You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
151 lines
7.0 KiB
151 lines
7.0 KiB
# ERNIE-SAT with VCTK dataset
|
|
ERNIE-SAT speech-text joint pretraining framework, which achieves SOTA results in cross-lingual multi-speaker speech synthesis and cross-lingual speech editing tasks, It can be applied to a series of scenarios such as Speech Editing, personalized Speech Synthesis, and Voice Cloning.
|
|
|
|
## Model Framework
|
|
In ERNIE-SAT, we propose two innovations:
|
|
- In the pretraining process, the phonemes corresponding to Chinese and English are used as input to achieve cross-language and personalized soft phoneme mapping
|
|
- The joint mask learning of speech and text is used to realize the alignment of speech and text
|
|
|
|
<p align="center">
|
|
<img src="https://user-images.githubusercontent.com/24568452/186110814-1b9c6618-a0ab-4c0c-bb3d-3d860b0e8cc2.png" />
|
|
</p>
|
|
|
|
## Dataset
|
|
### Download and Extract
|
|
Download AISHELL-3 from it's [Official Website](http://www.aishelltech.com/aishell_3) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/data_aishell3`.
|
|
|
|
### Get MFA Result and Extract
|
|
We use [MFA2.x](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for aishell3_fastspeech2.
|
|
You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo.
|
|
|
|
## Get Started
|
|
Assume the path to the dataset is `~/datasets/data_aishell3`.
|
|
Assume the path to the MFA result of AISHELL-3 is `./aishell3_alignment_tone`.
|
|
Run the command below to
|
|
1. **source path**.
|
|
2. preprocess the dataset.
|
|
3. train the model.
|
|
4. synthesize wavs.
|
|
- synthesize waveform from `metadata.jsonl`.
|
|
- synthesize waveform from text file.
|
|
|
|
```bash
|
|
./run.sh
|
|
```
|
|
You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
|
|
```bash
|
|
./run.sh --stage 0 --stop-stage 0
|
|
```
|
|
### Data Preprocessing
|
|
```bash
|
|
./local/preprocess.sh ${conf_path}
|
|
```
|
|
When it is done. A `dump` folder is created in the current directory. The structure of the dump folder is listed below.
|
|
|
|
```text
|
|
dump
|
|
├── dev
|
|
│ ├── norm
|
|
│ └── raw
|
|
├── phone_id_map.txt
|
|
├── speaker_id_map.txt
|
|
├── test
|
|
│ ├── norm
|
|
│ └── raw
|
|
└── train
|
|
├── norm
|
|
├── raw
|
|
└── speech_stats.npy
|
|
```
|
|
The dataset is split into 3 parts, namely `train`, `dev`, and` test`, each of which contains a `norm` and `raw` subfolder. The raw folder contains speech features of each utterance, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/*_stats.npy`.
|
|
|
|
Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains phones, text_lengths, speech_lengths, durations, the path of speech features, speaker, and id of each utterance.
|
|
|
|
### Model Training
|
|
```bash
|
|
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path}
|
|
```
|
|
`./local/train.sh` calls `${BIN_DIR}/train.py`.
|
|
|
|
### Synthesizing
|
|
We use [HiFiGAN](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/voc5) as the neural vocoder.
|
|
|
|
Download pretrained HiFiGAN model from [hifigan_aishell3_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_ckpt_0.2.0.zip) and unzip it.
|
|
```bash
|
|
unzip hifigan_aishell3_ckpt_0.2.0.zip
|
|
```
|
|
HiFiGAN checkpoint contains files listed below.
|
|
```text
|
|
hifigan_aishell3_ckpt_0.2.0
|
|
├── default.yaml # default config used to train HiFiGAN
|
|
├── feats_stats.npy # statistics used to normalize spectrogram when training HiFiGAN
|
|
└── snapshot_iter_2500000.pdz # generator parameters of HiFiGAN
|
|
```
|
|
`./local/synthesize.sh` calls `${BIN_DIR}/../synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
|
|
```bash
|
|
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
|
|
```
|
|
## Speech Synthesis and Speech Editing
|
|
### Prepare
|
|
**prepare aligner**
|
|
```bash
|
|
mkdir -p tools/aligner
|
|
cd tools
|
|
# download MFA
|
|
wget https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner/releases/download/v1.0.1/montreal-forced-aligner_linux.tar.gz
|
|
# extract MFA
|
|
tar xvf montreal-forced-aligner_linux.tar.gz
|
|
# fix .so of MFA
|
|
cd montreal-forced-aligner/lib
|
|
ln -snf libpython3.6m.so.1.0 libpython3.6m.so
|
|
cd -
|
|
# download align models and dicts
|
|
cd aligner
|
|
wget https://paddlespeech.bj.bcebos.com/MFA/ernie_sat/aishell3_model.zip
|
|
wget https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/simple.lexicon
|
|
wget https://paddlespeech.bj.bcebos.com/MFA/ernie_sat/vctk_model.zip
|
|
wget https://paddlespeech.bj.bcebos.com/MFA/LJSpeech-1.1/cmudict-0.7b
|
|
cd ../../
|
|
```
|
|
**prepare pretrained FastSpeech2 models**
|
|
|
|
ERNIE-SAT use FastSpeech2 as phoneme duration predictor:
|
|
```bash
|
|
mkdir download
|
|
cd download
|
|
wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_conformer_baker_ckpt_0.5.zip
|
|
wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_ljspeech_ckpt_0.5.zip
|
|
unzip fastspeech2_conformer_baker_ckpt_0.5.zip
|
|
unzip fastspeech2_nosil_ljspeech_ckpt_0.5.zip
|
|
cd ../
|
|
```
|
|
**prepare source data**
|
|
```bash
|
|
mkdir source
|
|
cd source
|
|
wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/ernie_sat/source/SSB03540307.wav
|
|
wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/ernie_sat/source/SSB03540428.wav
|
|
wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/ernie_sat/source/LJ050-0278.wav
|
|
wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/ernie_sat/source/p243_313.wav
|
|
wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/ernie_sat/source/p299_096.wav
|
|
wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/ernie_sat/source/this_was_not_the_show_for_me.wav
|
|
wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/ernie_sat/source/README.md
|
|
cd ../
|
|
```
|
|
|
|
You can check the text of downloaded wavs in `source/README.md`.
|
|
### Speech Synthesis and Speech Editing
|
|
```bash
|
|
./run.sh --stage 3 --stop-stage 3 --gpus 0
|
|
```
|
|
`stage 3` of `run.sh` calls `local/synthesize_e2e.sh`, `stage 0` of it is **Speech Synthesis** and `stage 1` of it is **Speech Editing**.
|
|
|
|
You can modify `--wav_path`、`--old_str` and `--new_str` yourself, `--old_str` should be the text corresponding to the audio of `--wav_path`, `--new_str` should be designed according to `--task_name`, both `--source_lang` and `--target_lang` should be `zh` for model trained with AISHELL3 dataset.
|
|
## Pretrained Model
|
|
Pretrained ErnieSAT model:
|
|
- [erniesat_aishell3_ckpt_1.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/ernie_sat/erniesat_aishell3_ckpt_1.2.0.zip)
|
|
|
|
Model | Step | eval/mlm_loss | eval/loss
|
|
:-------------:| :------------:| :-----: | :-----:
|
|
default| 8(gpu) x 289500|51.723782|51.723782
|