You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
PaddleSpeech/examples/canton/tts3/README.md

5.9 KiB

FastSpeech2 with Cantonese language

Dataset

Download and Extract

If you don't have the Cantonese datasets mentioned above, please download and unzip Guangzhou_Cantonese_Scripted_Speech_Corpus_Daily_Use_Sentence and Guangzhou_Cantonese_Scripted_Speech_Corpus_in_Vehicle under ~/datasets/.

To obtain better performance, please combine these two datasets together as follows:

mkdir -p ~/datasets/canton_all/WAV
cp -r ~/datasets/Guangzhou_Cantonese_Scripted_Speech_Corpus_Daily_Use_Sentence/WAV/* ~/datasets/canton_all/WAV
cp -r ~/datasets/Guangzhou_Cantonese_Scripted_Speech_Corpus_in_Vehicle/WAV/* ~/datasets/canton_all/WAV

After that, it should be look like:

~/datasets/canton_all
│   └── WAV
│       └──G0001
│       └──G0002
│       ...
│       └──G0071
│       └──G0072

Get MFA Result and Extract

We use MFA1.x to get durations for canton_fastspeech2. You can train your MFA model reference to canton_mfa example (use MFA1.x now) of our repo. We here provide the MFA results of these two datasets. canton_alignment.zip

Get Started

Assume the path to the Cantonese MFA result of the two datsets mentioned above is ./canton_alignment. Run the command below to

  1. source path.
  2. preprocess the dataset.
  3. train the model.
  4. synthesize wavs.
    • synthesize waveform from metadata.jsonl.
    • synthesize waveform from text file.
./run.sh

You can choose a range of stages you want to run, or set stage equal to stop-stage to use only one stage, for example, running the following command will only preprocess the dataset.

./run.sh --stage 0 --stop-stage 0

Data Preprocessing

./local/preprocess.sh ${conf_path}

When it is done. A dump folder is created in the current directory. The structure of the dump folder is listed below.

dump
├── dev
│   ├── norm
│   └── raw
├── phone_id_map.txt
├── speaker_id_map.txt
├── test
│   ├── norm
│   └── raw
└── train
    ├── energy_stats.npy
    ├── norm
    ├── pitch_stats.npy
    ├── raw
    └── speech_stats.npy

The dataset is split into 3 parts, namely train, dev, and test, each of which contains a norm and raw subfolder. The raw folder contains speech、pitch and energy features of each utterance, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in dump/train/*_stats.npy.

Also, there is a metadata.jsonl in each subfolder. It is a table-like file that contains phones, text_lengths, speech_lengths, durations, the path of speech features, the path of pitch features, a path of energy features, speaker, and id of each utterance.

Training details can refer to the script of examples/aishell3/tts3.

Pretrained Model

Pretrained FastSpeech2 model with no silence in the edge of audios:

The static model can be downloaded here:

The ONNX model can be downloaded here:

FastSpeech2 checkpoint contains files listed below.

fastspeech2_canton_ckpt_1.4.0
├── default.yaml            # default config used to train fastspeech2
├── energy_stats.npy        # statistics used to normalize energy when training fastspeech2
├── phone_id_map.txt        # phone vocabulary file when training fastspeech2
├── pitch_stats.npy         # statistics used to normalize pitch when training fastspeech2
├── snapshot_iter_140000.pdz # model parameters and optimizer states
├── speaker_id_map.txt      # speaker id map file when training a multi-speaker fastspeech2
└── speech_stats.npy        # statistics used to normalize spectrogram when training fastspeech2

We use parallel wavegan as the neural vocoder. Download the pretrained parallel wavegan model from pwg_aishell3_ckpt_0.5.zip and unzip it.

unzip pwg_aishell3_ckpt_0.5.zip

You can use the following scripts to synthesize for ${BIN_DIR}/../../assets/sentences_canton.txt using pretrained fastspeech2 and parallel wavegan models.

source path.sh

FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize_e2e.py \
  --am=fastspeech2_aishell3 \
  --am_config=fastspeech2_canton_ckpt_1.4.0/default.yaml \
  --am_ckpt=fastspeech2_canton_ckpt_1.4.0/snapshot_iter_140000.pdz \
  --am_stat=fastspeech2_canton_ckpt_1.4.0/speech_stats.npy \
  --voc=pwgan_aishell3 \
  --voc_config=pwg_aishell3_ckpt_0.5/default.yaml \
  --voc_ckpt=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \
  --voc_stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \
  --lang=canton \
  --text=${BIN_DIR}/../../assets/sentences_canton.txt \
  --output_dir=exp/default/test_e2e \
  --phones_dict=fastspeech2_canton_ckpt_1.4.0/phone_id_map.txt \
  --speaker_dict=fastspeech2_canton_ckpt_1.4.0/speaker_id_map.txt \
  --spk_id=10 \
  --inference_dir=exp/default/inference