Audio Sample
==================
The main processes of TTS include:
1. Convert the original text into characters/phonemes, through ``text frontend`` module.
2. Convert characters/phonemes into acoustic features , such as linear spectrogram, mel spectrogram, LPC features, etc. through ``Acoustic models``.
3. Convert acoustic features into waveforms through ``Vocoders``.
When training ``Tacotron2``、``TransformerTTS`` and ``WaveFlow``, we use English single speaker TTS dataset `LJSpeech `_ by default. However, when training ``SpeedySpeech``, ``FastSpeech2`` and ``ParallelWaveGAN``, we use Chinese single speaker dataset `CSMSC `_ by default.
In the future, ``Parakeet`` will mainly use Chinese TTS datasets for default examples.
Here, we will display three types of audio samples:
1. Analysis/synthesis (ground-truth spectrograms + Vocoder)
2. TTS (Acoustic model + Vocoder)
3. Chinese TTS with/without text frontend (mainly tone sandhi)
Analysis/synthesis
--------------------------
Audio samples generated from ground-truth spectrograms with a vocoder.
.. raw:: html
LJSpeech(English)
GT
WaveFlow
CSMSC(Chinese)
GT (convert to 24k)
ParallelWaveGAN
TTS
-------------------
Audio samples generated by a TTS system. Text is first transformed into spectrogram by a text-to-spectrogram model, then the spectrogram is converted into raw audio by a vocoder.
.. raw:: html
TransformerTTS + WaveFlow
Tacotron2 + WaveFlow
SpeedySpeech + ParallelWaveGAN
FastSpeech2 + ParallelWaveGAN
Chinese TTS with/without text frontend
--------------------------------------
We provide a complete Chinese text frontend module in ``Parakeet``. ``Text Normalization`` and ``G2P`` are the most important modules in text frontend, We assume that the texts are normalized already, and mainly compare ``G2P`` module here.
We use ``FastSpeech2`` + ``ParallelWaveGAN`` here.
.. raw:: html