History

TianYuan e6cbcca3e2 fix ERNIE-SAT README, test=doc (#2392 )		2 years ago
..
conf	add ernie sat synthesize_e2e, test=tts (#2287 )	2 years ago
local	[TTS] fix some bugs of ERNIE-SAT (#2378 )	2 years ago
README.md	fix ERNIE-SAT README, test=doc (#2392 )	2 years ago
path.sh	add ernie sat model file and config	2 years ago
run.sh	fix gpus of ernie_sat, test=tts (#2355 )	2 years ago

README.md

ERNIE-SAT with VCTK dataset

ERNIE-SAT speech-text joint pretraining framework, which achieves SOTA results in cross-lingual multi-speaker speech synthesis and cross-lingual speech editing tasks, It can be applied to a series of scenarios such as Speech Editing, personalized Speech Synthesis, and Voice Cloning.

Model Framework

In ERNIE-SAT, we propose two innovations:

In the pretraining process, the phonemes corresponding to Chinese and English are used as input to achieve cross-language and personalized soft phoneme mapping
The joint mask learning of speech and text is used to realize the alignment of speech and text

Dataset

Download and Extract the dataset

Download VCTK-0.92 from it's Official Website and extract it to ~/datasets. Then the dataset is in the directory ~/datasets/VCTK-Corpus-0.92.

Get MFA Result and Extract

We use MFA to get durations for fastspeech2. You can download from here vctk_alignment.tar.gz, or train your MFA model reference to mfa example of our repo. ps: we remove three speakers in VCTK-0.92 (see reorganize_vctk.py):

p315, because of no text for it.
p280 and p362, because no *_mic2.flac (which is better than *_mic1.flac) for them.

Get Started

Assume the path to the dataset is ~/datasets/VCTK-Corpus-0.92. Assume the path to the MFA result of VCTK is ./vctk_alignment. Run the command below to

source path.
preprocess the dataset.
train the model.
synthesize wavs.
- synthesize waveform from metadata.jsonl.
- synthesize waveform from text file.

./run.sh

You can choose a range of stages you want to run, or set stage equal to stop-stage to use only one stage, for example, running the following command will only preprocess the dataset.

./run.sh --stage 0 --stop-stage 0

Data Preprocessing

./local/preprocess.sh ${conf_path}

When it is done. A dump folder is created in the current directory. The structure of the dump folder is listed below.

dump
├── dev
│   ├── norm
│   └── raw
├── phone_id_map.txt
├── speaker_id_map.txt
├── test
│   ├── norm
│   └── raw
└── train
    ├── norm
    ├── raw
    └── speech_stats.npy

The dataset is split into 3 parts, namely train, dev, and test, each of which contains a norm and raw subfolder. The raw folder contains speech features of each utterance, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in dump/train/*_stats.npy.

Also, there is a metadata.jsonl in each subfolder. It is a table-like file that contains phones, text_lengths, speech_lengths, durations, the path of speech features, speaker, and id of each utterance.

Model Training

CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path}

./local/train.sh calls ${BIN_DIR}/train.py.

Synthesizing

We use HiFiGAN as the neural vocoder.

Download pretrained HiFiGAN model from hifigan_vctk_ckpt_0.2.0.zip and unzip it.

unzip hifigan_vctk_ckpt_0.2.0.zip

HiFiGAN checkpoint contains files listed below.

hifigan_vctk_ckpt_0.2.0
├── default.yaml                    # default config used to train HiFiGAN
├── feats_stats.npy                 # statistics used to normalize spectrogram when training HiFiGAN
└── snapshot_iter_2500000.pdz       # generator parameters of HiFiGAN

./local/synthesize.sh calls ${BIN_DIR}/../synthesize.py, which can synthesize waveform from metadata.jsonl.

CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}

Speech Synthesis and Speech Editing

Prepare

prepare aligner

mkdir -p tools/aligner
cd tools
# download MFA
wget https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner/releases/download/v1.0.1/montreal-forced-aligner_linux.tar.gz
# extract MFA
tar xvf montreal-forced-aligner_linux.tar.gz
# fix .so of MFA
cd montreal-forced-aligner/lib
ln -snf libpython3.6m.so.1.0 libpython3.6m.so
cd -
# download align models and dicts
cd aligner
wget https://paddlespeech.bj.bcebos.com/MFA/ernie_sat/aishell3_model.zip
wget https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/simple.lexicon
wget https://paddlespeech.bj.bcebos.com/MFA/ernie_sat/vctk_model.zip
wget https://paddlespeech.bj.bcebos.com/MFA/LJSpeech-1.1/cmudict-0.7b
cd ../../

prepare pretrained FastSpeech2 models

ERNIE-SAT use FastSpeech2 as phoneme duration predictor:

mkdir download
cd download
wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_conformer_baker_ckpt_0.5.zip
wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_ljspeech_ckpt_0.5.zip
unzip fastspeech2_conformer_baker_ckpt_0.5.zip
unzip fastspeech2_nosil_ljspeech_ckpt_0.5.zip
cd ../

prepare source data

mkdir source
cd source
wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/ernie_sat/source/SSB03540307.wav
wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/ernie_sat/source/SSB03540428.wav
wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/ernie_sat/source/LJ050-0278.wav
wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/ernie_sat/source/p243_313.wav
wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/ernie_sat/source/p299_096.wav
wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/ernie_sat/source/this_was_not_the_show_for_me.wav
wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/ernie_sat/source/README.md
cd ../

You can check the text of downloaded wavs in source/README.md.

Speech Synthesis and Speech Editing

./run.sh --stage 3 --stop-stage 3 --gpus 0

stage 3 of run.sh calls local/synthesize_e2e.sh, stage 0 of it is Speech Synthesis and stage 1 of it is Speech Editing.

You can modify --wav_path、--old_str and --new_str yourself, --old_str should be the text corresponding to the audio of --wav_path, --new_str should be designed according to --task_name, both --source_lang and --target_lang should be en for model trained with VCTK dataset.

Pretrained Model

Pretrained ErnieSAT model:

erniesat_vctk_ckpt_1.2.0.zip

Model	Step	eval/mlm_loss	eval/loss
default	8(gpu) x 199500	57.622215	57.622215