The examples in PaddleSpeech are mainly classified by datasets, the TTS datasets we mainly used are:
* CSMCS (Mandarin single speaker)
* AISHELL3 (Mandarin multiple speaker)
* LJSpeech (English single speaker)
* VCTK (English multiple speaker)
The models in PaddleSpeech TTS have the following mapping relationship:
* tts0 - Tactron2
* tts1 - TransformerTTS
* tts2 - SpeedySpeech
* tts3 - FastSpeech2
* voc0 - WaveFlow
* voc1 - Parallel WaveGAN
* voc2 - MelGAN
* voc3 - MultiBand MelGAN
* vc0 - Tactron2 Voice Clone with GE2E
## Quick Start
Let's take a FastSpeech2 + Parallel WaveGAN with CSMSC dataset for instance. (./examples/csmsc/)(https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/csmsc)
### Train Parallel WaveGAN with CSMSC
- Go to directory
```bash
cd examples/csmsc/voc1
```
- Source env
```bash
source path.sh
```
**Must do this before you start to do anything.**
Set `MAIN_ROOT` as project dir. Using `parallelwave_gan` model as `MODEL`.
- Main entrypoint
```bash
bash run.sh
```
This is just a demo, please make sure source data have been prepared well and every `step` works well before next `step`.
### Train FastSpeech2 with CSMSC
- Go to directory
```bash
cd examples/csmsc/tts3
```
- Source env
```bash
source path.sh
```
**Must do this before you start to do anything.**
Set `MAIN_ROOT` as project dir. Using `fastspeech2` model as `MODEL`.
- Main entrypoint
```bash
bash run.sh
```
This is just a demo, please make sure source data have been prepared well and every `step` works well before next `step`.
The steps in `run.sh` mainly include:
- source path.
- preprocess the dataset,
- train the model.
- synthesize waveform from metadata.jsonl.
- synthesize waveform from text file. (in acoustic models)
- inference using static model. (optional)
For more details , you can see `README.md` in examples.
## Pipeline of TTS
This section shows how to use pretrained models provided by TTS and make inference with them.
Pretrained models in TTS are provided in a archive. Extract it to get a folder like this:
**Acoustic Models:**
```text
checkpoint_name
├── default.yaml
├── snapshot_iter_*.pdz
├── speech_stats.npy
├── phone_id_map.txt
├── spk_id_map.txt (optimal)
└── tone_id_map.txt (optimal)
```
**Vocoders:**
```text
checkpoint_name
├── default.yaml
├── snapshot_iter_*.pdz
└── stats.npy
```
-`default.yaml` stores the config used to train the model.
-`snapshot_iter_*.pdz` is the chechpoint file, where `*` is the steps it has been trained.
-`*_stats.npy` is the stats file of feature if it has been normalized before training.
-`phone_id_map.txt` is the map of phonemes to phoneme_ids.
-`tone_id_map.txt` is the map of tones to tones_ids, when you split tones and phones before training acoustic models. (for example in our csmsc/speedyspeech example)
-`spk_id_map.txt` is the map of spkeaker to spk_ids in multi-spk acoustic models. (for example in our aishell3/fastspeech2 example)
The example code below shows how to use the models for prediction.
### Acoustic Models (text to spectrogram)
The code below show how to use a `FastSpeech2` model. After loading the pretrained model, use it and normalizer object to construct a prediction object,then use `fastspeech2_inferencet(phone_ids)` to generate spectrograms, which can be further used to synthesize raw audio with a vocoder.
# The output of Chinese text frontend is segmented
for part_phone_ids in phone_ids:
with paddle.no_grad():
temp_mel = fastspeech2_inference(part_phone_ids)
if flags == 0:
mel = temp_mel
flags = 1
else:
mel = paddle.concat([mel, temp_mel])
```
### Vocoder (spectrogram to wave)
The code below show how to use a ` Parallel WaveGAN` model. Like the example above, after loading the pretrained model, use it and normalizer object to construct a prediction object,then use `pwg_inference(mel)` to generate raw audio (in wav format).