Audio Sample ================== The main processes of TTS include: 1. Convert the original text into characters/phonemes, through ``text frontend`` module. 2. Convert characters/phonemes into acoustic features , such as linear spectrogram, mel spectrogram, LPC features, etc. through ``Acoustic models``. 3. Convert acoustic features into waveforms through ``Vocoders``. When training ``Tacotron2``、``TransformerTTS`` and ``WaveFlow``, we use English single speaker TTS dataset `LJSpeech `_ by default. However, when training ``SpeedySpeech``, ``FastSpeech2`` and ``ParallelWaveGAN``, we use Chinese single speaker dataset `CSMSC `_ by default. In the future, ``Parakeet`` will mainly use Chinese TTS datasets for default examples. Here, we will display three types of audio samples: 1. Analysis/synthesis (ground-truth spectrograms + Vocoder) 2. TTS (Acoustic model + Vocoder) 3. Chinese TTS with/without text frontend (mainly tone sandhi) Analysis/synthesis -------------------------- Audio samples generated from ground-truth spectrograms with a vocoder. .. raw:: html LJSpeech(English)

GT WaveFlow


CSMSC(Chinese)

GT (convert to 24k) ParallelWaveGAN
TTS ------------------- Audio samples generated by a TTS system. Text is first transformed into spectrogram by a text-to-spectrogram model, then the spectrogram is converted into raw audio by a vocoder. .. raw:: html
TransformerTTS + WaveFlow Tacotron2 + WaveFlow
SpeedySpeech + ParallelWaveGAN FastSpeech2 + ParallelWaveGAN
Chinese TTS with/without text frontend -------------------------------------- We provide a complete Chinese text frontend module in ``Parakeet``. ``Text Normalization`` and ``G2P`` are the most important modules in text frontend, We assume that the texts are normalized already, and mainly compare ``G2P`` module here. We use ``FastSpeech2`` + ``ParallelWaveGAN`` here. .. raw:: html
With Text Frontend Without Text Frontend