PaddleSpeech/docs/source/tts/quick_start.md

# Quick Start of Text-to-Speech
The examples in PaddleSpeech are mainly classified by datasets, the TTS datasets we mainly used are:
* CSMCS (Mandarin single speaker)
* AISHELL3 (Mandarin multiple speaker)
* LJSpeech (English single speaker)
* VCTK (English multiple speaker)

The models in PaddleSpeech TTS have the following mapping relationship:
* tts0 - Tactron2
* tts1 - TransformerTTS
* tts2 - SpeedySpeech
* tts3 - FastSpeech2
* voc0 - WaveFlow
* voc1 - Parallel WaveGAN
* voc2 - MelGAN
* voc3 - MultiBand MelGAN
* vc0 - Tactron2 Voice Clone with GE2E
* vc1 - FastSpeech2 Voice Clone with GE2E

## Quick Start

Let's take a FastSpeech2 + Parallel WaveGAN with CSMSC dataset for instance. (./examples/csmsc/)(https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc)

### Train Parallel WaveGAN with CSMSC
- Go to directory
    ```bash
    cd examples/csmsc/voc1
    ```
- Source env
    ```bash
    source path.sh
    ```
    **Must do this before you start to do anything.**
    Set `MAIN_ROOT` as project dir. Using `parallelwave_gan` model as `MODEL`.

- Main entrypoint
    ```bash
    bash run.sh
    ```
    This is just a demo, please make sure source data have been prepared well and every `step` works well before next `step`.
### Train FastSpeech2 with CSMSC
- Go to directory
    ```bash
    cd examples/csmsc/tts3
    ```
- Source env
    ```bash
    source path.sh
    ```
    **Must do this before you start to do anything.**
    Set `MAIN_ROOT` as project dir. Using `fastspeech2` model as `MODEL`.
- Main entrypoint
    ```bash
    bash run.sh
    ```
    This is just a demo, please make sure source data have been prepared well and every `step` works well before next `step`.

The steps in `run.sh` mainly include:
- source path.
- preprocess the dataset,
- train the model.
- synthesize waveform from metadata.jsonl.
- synthesize waveform from text file. (in acoustic models)
- inference using static model. (optional)

For more details , you can see `README.md` in examples.

## Pipeline of TTS
This section shows how to use pretrained models provided by TTS and make inference with them.

Pretrained models in TTS are provided in a archive. Extract it to get a folder like this:
**Acoustic Models:**
```text
checkpoint_name
├── default.yaml
├── snapshot_iter_*.pdz
├── speech_stats.npy
├── phone_id_map.txt
├── spk_id_map.txt (optimal)
└── tone_id_map.txt (optimal)
```
**Vocoders:**
```text
checkpoint_name
├── default.yaml  
├── snapshot_iter_*.pdz
└── stats.npy  
```
- `default.yaml` stores the config used to train the model.
- `snapshot_iter_*.pdz` is the chechpoint file, where `*` is the steps it has been trained.
- `*_stats.npy` is the stats file of feature if  it has been normalized before training.
- `phone_id_map.txt` is the map of  phonemes to phoneme_ids.
- `tone_id_map.txt` is the map of  tones to tones_ids, when you split tones and phones before training acoustic models. (for example in our csmsc/speedyspeech example)
- `spk_id_map.txt` is the map of  spkeaker to spk_ids in multi-spk acoustic models. (for example in our aishell3/fastspeech2 example)

The example code below shows how to use the models for prediction.
### Acoustic Models (text to spectrogram)
The code below show how to use a `FastSpeech2` model.  After loading the pretrained model, use it and normalizer object to construct a prediction object，then use `fastspeech2_inferencet(phone_ids)` to generate spectrograms, which can be further used to synthesize raw audio with a vocoder.

```python
from pathlib import Path
import numpy as np
import paddle
import yaml
from yacs.config import CfgNode
from paddlespeech.t2s.models.fastspeech2 import FastSpeech2
from paddlespeech.t2s.models.fastspeech2 import FastSpeech2Inference
from paddlespeech.t2s.modules.normalizer import ZScore
# examples/fastspeech2/baker/frontend.py
from frontend import Frontend

# load the pretrained model
checkpoint_dir = Path("fastspeech2_nosil_baker_ckpt_0.4")
with open(checkpoint_dir / "phone_id_map.txt", "r") as f:
    phn_id = [line.strip().split() for line in f.readlines()]
vocab_size = len(phn_id)
with open(checkpoint_dir / "default.yaml") as f:
    fastspeech2_config = CfgNode(yaml.safe_load(f))
odim = fastspeech2_config.n_mels
model = FastSpeech2(
    idim=vocab_size, odim=odim, **fastspeech2_config["model"])
model.set_state_dict(
    paddle.load(args.fastspeech2_checkpoint)["main_params"])
model.eval()

# load stats file
stat = np.load(checkpoint_dir / "speech_stats.npy")
mu, std = stat
mu = paddle.to_tensor(mu)
std = paddle.to_tensor(std)
fastspeech2_normalizer = ZScore(mu, std)

# construct a prediction object
fastspeech2_inference = FastSpeech2Inference(fastspeech2_normalizer, model)

# load Chinese Frontend
frontend = Frontend(checkpoint_dir / "phone_id_map.txt")

# text to spectrogram
sentence = "你好吗？"
input_ids = frontend.get_input_ids(sentence, merge_sentences=True)
phone_ids = input_ids["phone_ids"]
flags = 0
# The output of Chinese text frontend is segmented
for part_phone_ids in phone_ids:
    with paddle.no_grad():
        temp_mel = fastspeech2_inference(part_phone_ids)
        if flags == 0:
            mel = temp_mel
            flags = 1
        else:
            mel = paddle.concat([mel, temp_mel])
```

### Vocoder (spectrogram to wave)
The code below show how to use a  ` Parallel WaveGAN` model. Like the example above, after loading the pretrained model, use it and normalizer object to construct a prediction object，then use `pwg_inference(mel)` to generate  raw audio (in wav format).

```python
from pathlib import Path
import numpy as np
import paddle
import soundfile as sf
import yaml
from yacs.config import CfgNode
from paddlespeech.t2s.models.parallel_wavegan import PWGGenerator
from paddlespeech.t2s.models.parallel_wavegan import PWGInference
from paddlespeech.t2s.modules.normalizer import ZScore

# load the pretrained model
checkpoint_dir = Path("parallel_wavegan_baker_ckpt_0.4")
with open(checkpoint_dir / "pwg_default.yaml") as f:
    pwg_config = CfgNode(yaml.safe_load(f))
vocoder = PWGGenerator(**pwg_config["generator_params"])
vocoder.set_state_dict(paddle.load(args.pwg_params))
vocoder.remove_weight_norm()
vocoder.eval()

# load stats file
stat = np.load(checkpoint_dir / "pwg_stats.npy")
mu, std = stat
mu = paddle.to_tensor(mu)
std = paddle.to_tensor(std)
pwg_normalizer = ZScore(mu, std)

# construct a prediction object
pwg_inference = PWGInference(pwg_normalizer, vocoder)

# spectrogram to wave
wav = pwg_inference(mel)
sf.write(
        audio_path,
        wav.numpy(),
        samplerate=fastspeech2_config.fs)
```