Merge branch 'develop' into audio

3 years ago · 592aa28551
parent 41f08ef9fc e6cbcca3e2
commit 592aa28551
8 changed files with 555 additions and 70 deletions
--- a/examples/aishell3/ernie_sat/README.md
+++ b/examples/aishell3/ernie_sat/README.md
@ -1 +1,150 @@
-# ERNIE SAT with AISHELL3 dataset
+# ERNIE-SAT with VCTK dataset
 ERNIE-SAT speech-text joint pretraining framework, which achieves SOTA results in cross-lingual multi-speaker speech synthesis and cross-lingual speech editing tasks, It can be applied to a series of scenarios such as Speech Editing, personalized Speech Synthesis, and Voice Cloning.
 ## Model Framework
 In ERNIE-SAT, we propose two innovations:
 - In the pretraining process, the phonemes corresponding to Chinese and English are used as input to achieve cross-language and personalized soft phoneme mapping
 - The joint mask learning of speech and text is used to realize the alignment of speech and text
 <p align="center">
    <img src="https://user-images.githubusercontent.com/24568452/186110814-1b9c6618-a0ab-4c0c-bb3d-3d860b0e8cc2.png" />
 </p>
 ## Dataset
 ### Download and Extract
 Download AISHELL-3 from it's [Official Website](http://www.aishelltech.com/aishell_3) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/data_aishell3`.
 ### Get MFA Result and Extract
 We use [MFA2.x](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for aishell3_fastspeech2.
 You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo.
 ## Get Started
 Assume the path to the dataset is `~/datasets/data_aishell3`.
 Assume the path to the MFA result of AISHELL-3 is `./aishell3_alignment_tone`.
 Run the command below to
 1. **source path**.
 2. preprocess the dataset.
 3. train the model.
 4. synthesize wavs.
    - synthesize waveform from `metadata.jsonl`.
    - synthesize waveform from text file.
 ```bash
 ./run.sh
 ```
 You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
 ```bash
 ./run.sh --stage 0 --stop-stage 0
 ```
 ### Data Preprocessing
 ```bash
 ./local/preprocess.sh ${conf_path}
 ```
 When it is done. A `dump` folder is created in the current directory. The structure of the dump folder is listed below.
 ```text
 dump
 ├── dev
 │   ├── norm
 │   └── raw
 ├── phone_id_map.txt
 ├── speaker_id_map.txt
 ├── test
 │   ├── norm
 │   └── raw
 └── train
    ├── norm
    ├── raw
    └── speech_stats.npy
 ```
 The dataset is split into 3 parts, namely `train`, `dev`, and` test`, each of which contains a `norm` and `raw` subfolder. The raw folder contains speech features of each utterance, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/*_stats.npy`.
 Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains phones, text_lengths, speech_lengths, durations, the path of speech features, speaker, and id of each utterance.
 ### Model Training
 ```bash
 CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path}
 ```
 `./local/train.sh` calls `${BIN_DIR}/train.py`.
 ### Synthesizing
 We use [HiFiGAN](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/voc5) as the neural vocoder.
 Download pretrained HiFiGAN model from [hifigan_aishell3_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_ckpt_0.2.0.zip) and unzip it.
 ```bash
 unzip hifigan_aishell3_ckpt_0.2.0.zip
 ```
 HiFiGAN checkpoint contains files listed below.
 ```text
 hifigan_aishell3_ckpt_0.2.0
 ├── default.yaml                    # default config used to train HiFiGAN
 ├── feats_stats.npy                 # statistics used to normalize spectrogram when training HiFiGAN
 └── snapshot_iter_2500000.pdz       # generator parameters of HiFiGAN
 ```
 `./local/synthesize.sh` calls `${BIN_DIR}/../synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
 ```bash
 CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
 ```
 ##  Speech Synthesis and Speech Editing
 ### Prepare
 **prepare aligner**
 ```bash
 mkdir -p tools/aligner
 cd tools
 # download MFA
 wget https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner/releases/download/v1.0.1/montreal-forced-aligner_linux.tar.gz
 # extract MFA
 tar xvf montreal-forced-aligner_linux.tar.gz
 # fix .so of MFA
 cd montreal-forced-aligner/lib
 ln -snf libpython3.6m.so.1.0 libpython3.6m.so
 cd -
 # download align models and dicts
 cd aligner
 wget https://paddlespeech.bj.bcebos.com/MFA/ernie_sat/aishell3_model.zip
 wget https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/simple.lexicon
 wget https://paddlespeech.bj.bcebos.com/MFA/ernie_sat/vctk_model.zip
 wget https://paddlespeech.bj.bcebos.com/MFA/LJSpeech-1.1/cmudict-0.7b
 cd ../../
 ```
 **prepare pretrained FastSpeech2 models**
 ERNIE-SAT use FastSpeech2 as phoneme duration predictor:
 ```bash
 mkdir download
 cd download
 wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_conformer_baker_ckpt_0.5.zip
 wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_ljspeech_ckpt_0.5.zip
 unzip fastspeech2_conformer_baker_ckpt_0.5.zip
 unzip fastspeech2_nosil_ljspeech_ckpt_0.5.zip
 cd ../
 ```
 **prepare source data**
 ```bash
 mkdir source
 cd source
 wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/ernie_sat/source/SSB03540307.wav
 wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/ernie_sat/source/SSB03540428.wav
 wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/ernie_sat/source/LJ050-0278.wav
 wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/ernie_sat/source/p243_313.wav
 wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/ernie_sat/source/p299_096.wav
 wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/ernie_sat/source/this_was_not_the_show_for_me.wav
 wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/ernie_sat/source/README.md
 cd ../
 ```
 You can check the text of downloaded wavs in `source/README.md`.
 ### Speech Synthesis and Speech Editing
 ```bash
 ./run.sh --stage 3 --stop-stage 3 --gpus 0
 ```
 `stage 3` of `run.sh` calls `local/synthesize_e2e.sh`, `stage 0` of it is **Speech Synthesis** and  `stage 1` of it is **Speech Editing**.
 You can modify `--wav_path`、`--old_str` and `--new_str` yourself, `--old_str`  should be the text corresponding to the audio of  `--wav_path`, `--new_str` should be designed according to `--task_name`, both `--source_lang` and `--target_lang` should be `zh` for model trained with AISHELL3 dataset.
 ## Pretrained Model
 Pretrained ErnieSAT model:
 - [erniesat_aishell3_ckpt_1.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/ernie_sat/erniesat_aishell3_ckpt_1.2.0.zip)
 Model | Step | eval/mlm_loss | eval/loss
 :-------------:| :------------:| :-----: | :-----:
 default| 8(gpu) x 289500|51.723782|51.723782
--- a/examples/aishell3_vctk/ernie_sat/README.md
+++ b/examples/aishell3_vctk/ernie_sat/README.md
@ -1 +1,162 @@
-# ERNIE SAT with AISHELL3 and VCTK dataset
+# ERNIE-SAT with VCTK dataset
 ERNIE-SAT speech-text joint pretraining framework, which achieves SOTA results in cross-lingual multi-speaker speech synthesis and cross-lingual speech editing tasks, It can be applied to a series of scenarios such as Speech Editing, personalized Speech Synthesis, and Voice Cloning.
 ## Model Framework
 In ERNIE-SAT, we propose two innovations:
 - In the pretraining process, the phonemes corresponding to Chinese and English are used as input to achieve cross-language and personalized soft phoneme mapping
 - The joint mask learning of speech and text is used to realize the alignment of speech and text
 <p align="center">
    <img src="https://user-images.githubusercontent.com/24568452/186110814-1b9c6618-a0ab-4c0c-bb3d-3d860b0e8cc2.png" />
 </p>
 ## Dataset
 ### Download and Extract
 Download all datasets and extract it to `~/datasets`:
 - The aishell3 dataset is in the directory `~/datasets/data_aishell3`
 - The vctk dataset is in the directory `~/datasets/VCTK-Corpus-0.92`
 ### Get MFA Result and Extract
 We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for the fastspeech2 training.
 You can download from here:
 - [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz) 
 - [vctk_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/VCTK-Corpus-0.92/vctk_alignment.tar.gz)
 Or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo.
 ## Get Started
 Assume the paths to the datasets are:
 - `~/datasets/data_aishell3` 
 - `~/datasets/VCTK-Corpus-0.92`
 Assume the path to the MFA results of the datasets are:
 - `./aishell3_alignment_tone`
 - `./vctk_alignment`
 Run the command below to
 1. **source path**.
 2. preprocess the dataset.
 3. train the model.
 4. synthesize wavs.
    - synthesize waveform from `metadata.jsonl`.
    - synthesize waveform from text file.
 ```bash
 ./run.sh
 ```
 You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
 ```bash
 ./run.sh --stage 0 --stop-stage 0
 ```
 ### Data Preprocessing
 ```bash
 ./local/preprocess.sh ${conf_path}
 ```
 When it is done. A `dump` folder is created in the current directory. The structure of the dump folder is listed below.
 ```text
 dump
 ├── dev
 │   ├── norm
 │   └── raw
 ├── phone_id_map.txt
 ├── speaker_id_map.txt
 ├── test
 │   ├── norm
 │   └── raw
 └── train
    ├── norm
    ├── raw
    └── speech_stats.npy
 ```
 The dataset is split into 3 parts, namely `train`, `dev`, and` test`, each of which contains a `norm` and `raw` subfolder. The raw folder contains speech features of each utterance, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/*_stats.npy`.
 Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains phones, text_lengths, speech_lengths, durations, the path of speech features, speaker, and id of each utterance.
 ### Model Training
 ```bash
 CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path}
 ```
 `./local/train.sh` calls `${BIN_DIR}/train.py`.
 ### Synthesizing
 We use [HiFiGAN](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/voc5) as the neural vocoder.
 Download pretrained HiFiGAN model from [hifigan_aishell3_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_ckpt_0.2.0.zip) and unzip it.
 ```bash
 unzip hifigan_aishell3_ckpt_0.2.0.zip
 ```
 HiFiGAN checkpoint contains files listed below.
 ```text
 hifigan_aishell3_ckpt_0.2.0
 ├── default.yaml                    # default config used to train HiFiGAN
 ├── feats_stats.npy                 # statistics used to normalize spectrogram when training HiFiGAN
 └── snapshot_iter_2500000.pdz       # generator parameters of HiFiGAN
 ```
 `./local/synthesize.sh` calls `${BIN_DIR}/../synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
 ```bash
 CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
 ```
 ##  Speech Synthesis and Speech Editing
 ### Prepare
 **prepare aligner**
 ```bash
 mkdir -p tools/aligner
 cd tools
 # download MFA
 wget https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner/releases/download/v1.0.1/montreal-forced-aligner_linux.tar.gz
 # extract MFA
 tar xvf montreal-forced-aligner_linux.tar.gz
 # fix .so of MFA
 cd montreal-forced-aligner/lib
 ln -snf libpython3.6m.so.1.0 libpython3.6m.so
 cd -
 # download align models and dicts
 cd aligner
 wget https://paddlespeech.bj.bcebos.com/MFA/ernie_sat/aishell3_model.zip
 wget https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/simple.lexicon
 wget https://paddlespeech.bj.bcebos.com/MFA/ernie_sat/vctk_model.zip
 wget https://paddlespeech.bj.bcebos.com/MFA/LJSpeech-1.1/cmudict-0.7b
 cd ../../
 ```
 **prepare pretrained FastSpeech2 models**
 ERNIE-SAT use FastSpeech2 as phoneme duration predictor:
 ```bash
 mkdir download
 cd download
 wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_conformer_baker_ckpt_0.5.zip
 wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_ljspeech_ckpt_0.5.zip
 unzip fastspeech2_conformer_baker_ckpt_0.5.zip
 unzip fastspeech2_nosil_ljspeech_ckpt_0.5.zip
 cd ../
 ```
 **prepare source data**
 ```bash
 mkdir source
 cd source
 wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/ernie_sat/source/SSB03540307.wav
 wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/ernie_sat/source/SSB03540428.wav
 wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/ernie_sat/source/LJ050-0278.wav
 wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/ernie_sat/source/p243_313.wav
 wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/ernie_sat/source/p299_096.wav
 wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/ernie_sat/source/this_was_not_the_show_for_me.wav
 wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/ernie_sat/source/README.md
 cd ../
 ```
 You can check the text of downloaded wavs in `source/README.md`.
 ### Cross Language Voice Cloning
 ```bash
 ./run.sh --stage 3 --stop-stage 3 --gpus 0
 ```
 `stage 3` of `run.sh` calls `local/synthesize_e2e.sh`.
 You can modify  `--wav_path`、`--old_str` and `--new_str` yourself, `--old_str` should be the text corresponding to the audio of  `--wav_path`, `--new_str` should be designed according to `--task_name`, `--source_lang` and `--target_lang` should be different in this example.
 ## Pretrained Model
 Pretrained ErnieSAT model:
 - [erniesat_aishell3_vctk_ckpt_1.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/ernie_sat/erniesat_aishell3_vctk_ckpt_1.2.0.zip)
 Model | Step | eval/text_mlm_loss | eval/mlm_loss | eval/loss
 :-------------:| :------------:| :-----: | :-----:| :-----:
 default| 8(gpu) x 489000|0.000001|52.477642 |52.477642
--- a/examples/vctk/ernie_sat/README.md
+++ b/examples/vctk/ernie_sat/README.md
@ -1 +1,151 @@
-# ERNIE SAT with VCTK dataset
+# ERNIE-SAT with VCTK dataset
 ERNIE-SAT speech-text joint pretraining framework, which achieves SOTA results in cross-lingual multi-speaker speech synthesis and cross-lingual speech editing tasks, It can be applied to a series of scenarios such as Speech Editing, personalized Speech Synthesis, and Voice Cloning.
 ## Model Framework
 In ERNIE-SAT, we propose two innovations:
 - In the pretraining process, the phonemes corresponding to Chinese and English are used as input to achieve cross-language and personalized soft phoneme mapping
 - The joint mask learning of speech and text is used to realize the alignment of speech and text
 <p align="center">
    <img src="https://user-images.githubusercontent.com/24568452/186110814-1b9c6618-a0ab-4c0c-bb3d-3d860b0e8cc2.png" />
 </p>
 ## Dataset
 ### Download and Extract the dataset
 Download VCTK-0.92 from it's [Official Website](https://datashare.ed.ac.uk/handle/10283/3443) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/VCTK-Corpus-0.92`.
 ### Get MFA Result and Extract
 We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for fastspeech2.
 You can download from here [vctk_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/VCTK-Corpus-0.92/vctk_alignment.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
 ps: we remove three speakers in VCTK-0.92 (see [reorganize_vctk.py](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/examples/other/mfa/local/reorganize_vctk.py)):
 1. `p315`, because of no text for it.
 2. `p280` and `p362`, because no *_mic2.flac (which is better than *_mic1.flac) for  them.
 ## Get Started
 Assume the path to the dataset is `~/datasets/VCTK-Corpus-0.92`.
 Assume the path to the MFA result of VCTK is `./vctk_alignment`.
 Run the command below to
 1. **source path**.
 2. preprocess the dataset.
 3. train the model.
 4. synthesize wavs.
    - synthesize waveform from `metadata.jsonl`.
    - synthesize waveform from text file.
 ```bash
 ./run.sh
 ```
 You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
 ```bash
 ./run.sh --stage 0 --stop-stage 0
 ```
 ### Data Preprocessing
 ```bash
 ./local/preprocess.sh ${conf_path}
 ```
 When it is done. A `dump` folder is created in the current directory. The structure of the dump folder is listed below.
 ```text
 dump
 ├── dev
 │   ├── norm
 │   └── raw
 ├── phone_id_map.txt
 ├── speaker_id_map.txt
 ├── test
 │   ├── norm
 │   └── raw
 └── train
    ├── norm
    ├── raw
    └── speech_stats.npy
 ```
 The dataset is split into 3 parts, namely `train`, `dev`, and` test`, each of which contains a `norm` and `raw` subfolder. The raw folder contains speech features of each utterance, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/*_stats.npy`.
 Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains phones, text_lengths, speech_lengths, durations, the path of speech features, speaker, and id of each utterance.
 ### Model Training
 ```bash
 CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path}
 ```
 `./local/train.sh` calls `${BIN_DIR}/train.py`.
 ### Synthesizing
 We use [HiFiGAN](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/vctk/voc5) as the neural vocoder.
 Download pretrained HiFiGAN model from [hifigan_vctk_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_vctk_ckpt_0.2.0.zip) and unzip it.
 ```bash
 unzip hifigan_vctk_ckpt_0.2.0.zip
 ```
 HiFiGAN checkpoint contains files listed below.
 ```text
 hifigan_vctk_ckpt_0.2.0
 ├── default.yaml                    # default config used to train HiFiGAN
 ├── feats_stats.npy                 # statistics used to normalize spectrogram when training HiFiGAN
 └── snapshot_iter_2500000.pdz       # generator parameters of HiFiGAN
 ```
 `./local/synthesize.sh` calls `${BIN_DIR}/../synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
 ```bash
 CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
 ```
 ##  Speech Synthesis and Speech Editing
 ### Prepare
 **prepare aligner**
 ```bash
 mkdir -p tools/aligner
 cd tools
 # download MFA
 wget https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner/releases/download/v1.0.1/montreal-forced-aligner_linux.tar.gz
 # extract MFA
 tar xvf montreal-forced-aligner_linux.tar.gz
 # fix .so of MFA
 cd montreal-forced-aligner/lib
 ln -snf libpython3.6m.so.1.0 libpython3.6m.so
 cd -
 # download align models and dicts
 cd aligner
 wget https://paddlespeech.bj.bcebos.com/MFA/ernie_sat/aishell3_model.zip
 wget https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/simple.lexicon
 wget https://paddlespeech.bj.bcebos.com/MFA/ernie_sat/vctk_model.zip
 wget https://paddlespeech.bj.bcebos.com/MFA/LJSpeech-1.1/cmudict-0.7b
 cd ../../
 ```
 **prepare pretrained FastSpeech2 models**
 ERNIE-SAT use FastSpeech2 as phoneme duration predictor:
 ```bash
 mkdir download
 cd download
 wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_conformer_baker_ckpt_0.5.zip
 wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_ljspeech_ckpt_0.5.zip
 unzip fastspeech2_conformer_baker_ckpt_0.5.zip
 unzip fastspeech2_nosil_ljspeech_ckpt_0.5.zip
 cd ../
 ```
 **prepare source data**
 ```bash
 mkdir source
 cd source
 wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/ernie_sat/source/SSB03540307.wav
 wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/ernie_sat/source/SSB03540428.wav
 wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/ernie_sat/source/LJ050-0278.wav
 wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/ernie_sat/source/p243_313.wav
 wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/ernie_sat/source/p299_096.wav
 wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/ernie_sat/source/this_was_not_the_show_for_me.wav
 wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/ernie_sat/source/README.md
 cd ../
 ```
 You can check the text of downloaded wavs in `source/README.md`.
 ### Speech Synthesis and Speech Editing
 ```bash
 ./run.sh --stage 3 --stop-stage 3 --gpus 0
 ```
 `stage 3` of `run.sh` calls `local/synthesize_e2e.sh`, `stage 0` of it is **Speech Synthesis** and  `stage 1` of it is **Speech Editing**.
 You can modify `--wav_path`、`--old_str` and `--new_str` yourself, `--old_str` should be the text corresponding to the audio of  `--wav_path`, `--new_str` should be designed according to `--task_name`, both `--source_lang` and `--target_lang` should be `en` for model trained with VCTK dataset.
 ## Pretrained Model
 Pretrained ErnieSAT model:
 - [erniesat_vctk_ckpt_1.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/ernie_sat/erniesat_vctk_ckpt_1.2.0.zip)
 Model | Step | eval/mlm_loss | eval/loss
 :-------------:| :------------:| :-----: | :-----:
 default| 8(gpu) x 199500|57.622215|57.622215
--- a/examples/wenetspeech/asr1/RESULTS.md
+++ b/examples/wenetspeech/asr1/RESULTS.md
@ -34,3 +34,22 @@ Pretrain model from http://mobvoi-speech-public.ufile.ucloud.cn/public/wenet/wen
 | conformer | 32.52 M | conf/conformer.yaml | spec_aug  | aishell1 | ctc_greedy_search | - | 0.052534 |  
 | conformer | 32.52 M | conf/conformer.yaml | spec_aug  | aishell1 | ctc_prefix_beam_search | - | 0.052915 |  
 | conformer | 32.52 M | conf/conformer.yaml | spec_aug  | aishell1 | attention_rescoring | - | 0.047904 |  
 ## Conformer Steaming Pretrained Model
 Pretrain model from https://paddlespeech.bj.bcebos.com/s2t/wenetspeech/asr1/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar.gz
 | Model | Params | Config | Augmentation| Test set | Decode method | Chunk Size | CER |  
 | --- | --- | --- | --- | --- | --- | --- | --- |
 | conformer | 32.52 M | conf/chunk_conformer.yaml | spec_aug  | aishell1 | attention | 16 | 0.056273 |  
 | conformer | 32.52 M | conf/chunk_conformer.yaml | spec_aug  | aishell1 | ctc_greedy_search | 16 | 0.078918 |  
 | conformer | 32.52 M | conf/chunk_conformer.yaml | spec_aug  | aishell1 | ctc_prefix_beam_search | 16 | 0.079080 |  
 | conformer | 32.52 M | conf/chunk_conformer.yaml | spec_aug  | aishell1 | attention_rescoring | 16 | 0.054401 |
 | Model | Params | Config | Augmentation| Test set | Decode method | Chunk Size | CER |  
 | --- | --- | --- | --- | --- | --- | --- | --- |
 | conformer | 32.52 M | conf/chunk_conformer.yaml | spec_aug  | aishell1 | attention | -1 | 0.050767 |  
 | conformer | 32.52 M | conf/chunk_conformer.yaml | spec_aug  | aishell1 | ctc_greedy_search | -1 | 0.061884 |  
 | conformer | 32.52 M | conf/chunk_conformer.yaml | spec_aug  | aishell1 | ctc_prefix_beam_search | -1 | 0.062056 |  
 | conformer | 32.52 M | conf/chunk_conformer.yaml | spec_aug  | aishell1 | attention_rescoring | -1 |  0.052110 |
--- a/paddlespeech/t2s/frontend/g2pw/init.py
+++ b/paddlespeech/t2s/frontend/g2pw/init.py
@ -1,2 +1 @@
-from paddlespeech.t2s.frontend.g2pw.onnx_api import G2PWOnnxConverter
+from .onnx_api import G2PWOnnxConverter
--- a/paddlespeech/t2s/frontend/g2pw/dataset.py
+++ b/paddlespeech/t2s/frontend/g2pw/dataset.py
@ -15,6 +15,10 @@
 Credits
    This code is modified from https://github.com/GitYCC/g2pW
 """
 from typing import Dict
 from typing import List
 from typing import Tuple
 import numpy as np
 from paddlespeech.t2s.frontend.g2pw.utils import tokenize_and_map
@ -23,22 +27,17 @@ ANCHOR_CHAR = '▁'
 def prepare_onnx_input(tokenizer,
-                       labels,
+                       labels: List[str],
-                       char2phonemes,
+                       char2phonemes: Dict[str, List[int]],
-                       chars,
+                       chars: List[str],
-                       texts,
+                       texts: List[str],
-                       query_ids,
+                       query_ids: List[int],
-                       phonemes=None,
+                       use_mask: bool=False,
-                       pos_tags=None,
+                       window_size: int=None,
-                       use_mask=False,
+                       max_len: int=512) -> Dict[str, np.array]:
                       use_char_phoneme=False,
                       use_pos=False,
                       window_size=None,
                       max_len=512):
    if window_size is not None:
-        truncated_texts, truncated_query_ids = _truncate_texts(window_size,
+        truncated_texts, truncated_query_ids = _truncate_texts(
-                                                               texts, query_ids)
+            window_size=window_size, texts=texts, query_ids=query_ids)
    input_ids = []
    token_type_ids = []
    attention_masks = []
@ -51,13 +50,19 @@ def prepare_onnx_input(tokenizer,
        query_id = (truncated_query_ids if window_size else query_ids)[idx]
        try:
-            tokens, text2token, token2text = tokenize_and_map(tokenizer, text)
+            tokens, text2token, token2text = tokenize_and_map(
                tokenizer=tokenizer, text=text)
        except Exception:
            print(f'warning: text "{text}" is invalid')
            return {}
        text, query_id, tokens, text2token, token2text = _truncate(
-            max_len, text, query_id, tokens, text2token, token2text)
+            max_len=max_len,
            text=text,
            query_id=query_id,
            tokens=tokens,
            text2token=text2token,
            token2text=token2text)
        processed_tokens = ['[CLS]'] + tokens + ['[SEP]']
@ -91,7 +96,8 @@ def prepare_onnx_input(tokenizer,
    return outputs
-def _truncate_texts(window_size, texts, query_ids):
+def _truncate_texts(window_size: int, texts: List[str],
                    query_ids: List[int]) -> Tuple[List[str], List[int]]:
    truncated_texts = []
    truncated_query_ids = []
    for text, query_id in zip(texts, query_ids):
@ -105,7 +111,12 @@ def _truncate_texts(window_size, texts, query_ids):
    return truncated_texts, truncated_query_ids
-def _truncate(max_len, text, query_id, tokens, text2token, token2text):
+def _truncate(max_len: int,
              text: str,
              query_id: int,
              tokens: List[str],
              text2token: List[int],
              token2text: List[Tuple[int]]):
    truncate_len = max_len - 2
    if len(tokens) <= truncate_len:
        return (text, query_id, tokens, text2token, token2text)
@ -132,18 +143,8 @@ def _truncate(max_len, text, query_id, tokens, text2token, token2text):
    ], [(s - start, e - start) for s, e in token2text[token_start:token_end]])
-def prepare_data(sent_path, lb_path=None):
+def get_phoneme_labels(polyphonic_chars: List[List[str]]
-    raw_texts = open(sent_path).read().rstrip().split('\n')
+                       ) -> Tuple[List[str], Dict[str, List[int]]]:
    query_ids = [raw.index(ANCHOR_CHAR) for raw in raw_texts]
    texts = [raw.replace(ANCHOR_CHAR, '') for raw in raw_texts]
    if lb_path is None:
        return texts, query_ids
    else:
        phonemes = open(lb_path).read().rstrip().split('\n')
        return texts, query_ids, phonemes
 def get_phoneme_labels(polyphonic_chars):
    labels = sorted(list(set([phoneme for char, phoneme in polyphonic_chars])))
    char2phonemes = {}
    for char, phoneme in polyphonic_chars:
@ -153,7 +154,8 @@ def get_phoneme_labels(polyphonic_chars):
    return labels, char2phonemes
-def get_char_phoneme_labels(polyphonic_chars):
+def get_char_phoneme_labels(polyphonic_chars: List[List[str]]
                            ) -> Tuple[List[str], Dict[str, List[int]]]:
    labels = sorted(
        list(set([f'{char} {phoneme}' for char, phoneme in polyphonic_chars])))
    char2phonemes = {}
--- a/paddlespeech/t2s/frontend/g2pw/onnx_api.py
+++ b/paddlespeech/t2s/frontend/g2pw/onnx_api.py
@ -17,6 +17,10 @@ Credits
 """
 import json
 import os
 from typing import Any
 from typing import Dict
 from typing import List
 from typing import Tuple
 import numpy as np
 import onnxruntime
@ -34,7 +38,8 @@ from paddlespeech.t2s.frontend.g2pw.utils import load_config
 from paddlespeech.utils.env import MODEL_HOME
-def predict(session, onnx_input, labels):
+def predict(session, onnx_input: Dict[str, Any],
            labels: List[str]) -> Tuple[List[str], List[float]]:
    all_preds = []
    all_confidences = []
    probs = session.run([], {
@ -58,13 +63,12 @@ def predict(session, onnx_input, labels):
 class G2PWOnnxConverter:
    def __init__(self,
-                 model_dir=MODEL_HOME,
+                 model_dir: os.PathLike=MODEL_HOME,
-                 style='bopomofo',
+                 style: str='bopomofo',
-                 model_source=None,
+                 model_source: str=None,
-                 enable_non_tradional_chinese=False):
+                 enable_non_tradional_chinese: bool=False):
        if not os.path.exists(os.path.join(model_dir, 'G2PWModel/g2pW.onnx')):
        uncompress_path = download_and_decompress(
-                g2pw_onnx_models['G2PWModel']['1.0'], model_dir)
+            g2pw_onnx_models['G2PWModel'][model_version], model_dir)
        sess_options = onnxruntime.SessionOptions()
        sess_options.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL
@ -74,7 +78,8 @@ class G2PWOnnxConverter:
            os.path.join(model_dir, 'G2PWModel/g2pW.onnx'),
            sess_options=sess_options)
        self.config = load_config(
-            os.path.join(model_dir, 'G2PWModel/config.py'), use_default=True)
+            config_path=os.path.join(uncompress_path, 'config.py'),
            use_default=True)
        self.model_source = model_source if model_source else self.config.model_source
        self.enable_opencc = enable_non_tradional_chinese
@ -96,9 +101,9 @@ class G2PWOnnxConverter:
            .strip().split('\n')
        ]
        self.labels, self.char2phonemes = get_char_phoneme_labels(
-            self.polyphonic_chars
+            polyphonic_chars=self.polyphonic_chars
        ) if self.config.use_char_phoneme else get_phoneme_labels(
-            self.polyphonic_chars)
+            polyphonic_chars=self.polyphonic_chars)
        self.chars = sorted(list(self.char2phonemes.keys()))
        self.pos_tags = [
@ -125,7 +130,7 @@ class G2PWOnnxConverter:
        if self.enable_opencc:
            self.cc = OpenCC('s2tw')
-    def _convert_bopomofo_to_pinyin(self, bopomofo):
+    def _convert_bopomofo_to_pinyin(self, bopomofo: str) -> str:
        tone = bopomofo[-1]
        assert tone in '12345'
        component = self.bopomofo_convert_dict.get(bopomofo[:-1])
@ -135,7 +140,7 @@ class G2PWOnnxConverter:
            print(f'Warning: "{bopomofo}" cannot convert to pinyin')
            return None
-    def __call__(self, sentences):
+    def __call__(self, sentences: List[str]) -> List[List[str]]:
        if isinstance(sentences, str):
            sentences = [sentences]
@ -148,23 +153,25 @@ class G2PWOnnxConverter:
            sentences = translated_sentences
        texts, query_ids, sent_ids, partial_results = self._prepare_data(
-            sentences)
+            sentences=sentences)
        if len(texts) == 0:
            # sentences no polyphonic words
            return partial_results
        onnx_input = prepare_onnx_input(
-            self.tokenizer,
+            tokenizer=self.tokenizer,
-            self.labels,
+            labels=self.labels,
-            self.char2phonemes,
+            char2phonemes=self.char2phonemes,
-            self.chars,
+            chars=self.chars,
-            texts,
+            texts=texts,
-            query_ids,
+            query_ids=query_ids,
            use_mask=self.config.use_mask,
            use_char_phoneme=self.config.use_char_phoneme,
            window_size=None)
-        preds, confidences = predict(self.session_g2pW, onnx_input, self.labels)
+        preds, confidences = predict(
            session=self.session_g2pW,
            onnx_input=onnx_input,
            labels=self.labels)
        if self.config.use_char_phoneme:
            preds = [pred.split(' ')[1] for pred in preds]
@ -174,12 +181,9 @@ class G2PWOnnxConverter:
        return results
-    def _prepare_data(self, sentences):
+    def _prepare_data(
-        polyphonic_chars = set(self.chars)
+            self, sentences: List[str]
-        monophonic_chars_dict = {
+    ) -> Tuple[List[str], List[int], List[int], List[List[str]]]:
            char: phoneme
            for char, phoneme in self.monophonic_chars
        }
        texts, query_ids, sent_ids, partial_results = [], [], [], []
        for sent_id, sent in enumerate(sentences):
            pypinyin_result = pinyin(sent, style=Style.TONE3)
--- a/paddlespeech/t2s/frontend/g2pw/utils.py
+++ b/paddlespeech/t2s/frontend/g2pw/utils.py
@ -15,10 +15,11 @@
 Credits
    This code is modified from https://github.com/GitYCC/g2pW
 """
 import os
 import re
-def wordize_and_map(text):
+def wordize_and_map(text: str):
    words = []
    index_map_from_text_to_word = []
    index_map_from_word_to_text = []
@ -54,8 +55,8 @@ def wordize_and_map(text):
    return words, index_map_from_text_to_word, index_map_from_word_to_text
-def tokenize_and_map(tokenizer, text):
+def tokenize_and_map(tokenizer, text: str):
-    words, text2word, word2text = wordize_and_map(text)
+    words, text2word, word2text = wordize_and_map(text=text)
    tokens = []
    index_map_from_token_to_text = []
@ -82,7 +83,7 @@ def tokenize_and_map(tokenizer, text):
    return tokens, index_map_from_text_to_token, index_map_from_token_to_text
-def _load_config(config_path):
+def _load_config(config_path: os.PathLike):
    import importlib.util
    spec = importlib.util.spec_from_file_location('__init__', config_path)
    config = importlib.util.module_from_spec(spec)
@ -130,7 +131,7 @@ default_config_dict = {
 }
-def load_config(config_path, use_default=False):
+def load_config(config_path: os.PathLike, use_default: bool=False):
    config = _load_config(config_path)
    if use_default:
        for attr, val in default_config_dict.items():
`@ -1,2 +1 @@`
	`from paddlespeech.t2s.frontend.g2pw.onnx_api import G2PWOnnxConverter`	`from .onnx_api import G2PWOnnxConverter`