[TTS]Add hifigan (#1097)

* add hifigan * add hifigan * integrate synthesize synthesize_e2e, inference for tts, test=tts * add some python files, test=tts * update readme, test=doc_fix
3 years ago · 19ef7210a0
parent 675cff258b
commit 19ef7210a0
95 changed files with 4835 additions and 2708 deletions
--- a/examples/aishell3/tts3/README.md
+++ b/examples/aishell3/tts3/README.md
@ -1,7 +1,7 @@
 # FastSpeech2 with AISHELL-3
 This example contains code used to train a [Fastspeech2](https://arxiv.org/abs/2006.04558) model with [AISHELL-3](http://www.aishelltech.com/aishell_3).
-AISHELL-3 is a large-scale and high-fidelity multi-speaker Mandarin speech corpus which could be used to train multi-speaker Text-to-Speech (TTS) systems.
+AISHELL-3 is a large-scale and high-fidelity multi-speaker Mandarin speech corpus that could be used to train multi-speaker Text-to-Speech (TTS) systems.
 We use AISHELL-3 to train a multi-speaker fastspeech2 model here.
 ## Dataset
@ -17,7 +17,7 @@ tar zxvf data_aishell3.tgz -C data_aishell3
 ```
 ### Get MFA Result and Extract
 We use [MFA2.x](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for aishell3_fastspeech2.
-You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo.
+You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo.
 ## Get Started
 Assume the path to the dataset is `~/datasets/data_aishell3`.
@ -32,7 +32,7 @@ Run the command below to
 ```bash
 ./run.sh
 ```
-You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset.
+You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
 ```bash
 ./run.sh --stage 0 --stop-stage 0
 ```
@ -59,9 +59,9 @@ dump
    ├── raw
    └── speech_stats.npy
 ```
-The dataset is split into 3 parts, namely `train`, `dev` and` test`, each of which contains a `norm` and `raw` sub folder. The raw folder contains speech、pitch and energy features of each utterances, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/*_stats.npy`.
+The dataset is split into 3 parts, namely `train`, `dev`, and` test`, each of which contains a `norm` and `raw` subfolder. The raw folder contains speech、pitch and energy features of each utterance, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/*_stats.npy`.
-Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains phones, text_lengths, speech_lengths, durations, path of speech features, path of pitch features, path of energy features, speaker and id of each utterance.
+Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains phones, text_lengths, speech_lengths, durations, the path of speech features, the path of pitch features, a path of energy features, speaker, and id of each utterance.
 ### Model Training
 `./local/train.sh` calls `${BIN_DIR}/train.py`.
@ -95,14 +95,14 @@ optional arguments:
 ```
 1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
 2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
-3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are save in `checkpoints/` inside this directory.
+3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory.
 4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
 5. `--phones-dict` is the path of the phone vocabulary file.
-6. `--speaker-dict`is the path of the  speaker id map file when training a multi-speaker FastSpeech2.
+6. `--speaker-dict` is the path of the speaker id map file when training a multi-speaker FastSpeech2.
 ### Synthesizing
 We use [parallel wavegan](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/voc1) as the neural vocoder.
-Download pretrained parallel wavegan model from [pwg_aishell3_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_aishell3_ckpt_0.5.zip) and unzip it.
+Download the pretrained parallel wavegan model from [pwg_aishell3_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_aishell3_ckpt_0.5.zip) and unzip it.
 ```bash
 unzip pwg_aishell3_ckpt_0.5.zip
 ```
@ -113,98 +113,115 @@ pwg_aishell3_ckpt_0.5
 ├── feats_stats.npy                # statistics used to normalize spectrogram when training parallel wavegan
 └── snapshot_iter_1000000.pdz      # generator parameters of parallel wavegan
 ```
-`./local/synthesize.sh` calls `${BIN_DIR}/synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
+`./local/synthesize.sh` calls `${BIN_DIR}/../synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
 ```bash
 CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
 ```
 ```text
-usage: synthesize.py [-h] [--fastspeech2-config FASTSPEECH2_CONFIG]
+usage: synthesize.py [-h]
-                     [--fastspeech2-checkpoint FASTSPEECH2_CHECKPOINT]
+                     [--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}]
-                     [--fastspeech2-stat FASTSPEECH2_STAT]
+                     [--am_config AM_CONFIG] [--am_ckpt AM_CKPT]
-                     [--pwg-config PWG_CONFIG]
+                     [--am_stat AM_STAT] [--phones_dict PHONES_DICT]
-                     [--pwg-checkpoint PWG_CHECKPOINT] [--pwg-stat PWG_STAT]
+                     [--tones_dict TONES_DICT] [--speaker_dict SPEAKER_DICT]
-                     [--phones-dict PHONES_DICT] [--speaker-dict SPEAKER_DICT]
+                     [--voice-cloning VOICE_CLONING]
-                     [--test-metadata TEST_METADATA] [--output-dir OUTPUT_DIR]
+                     [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}]
-                     [--ngpu NGPU] [--verbose VERBOSE]
+                     [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT]
                     [--voc_stat VOC_STAT] [--ngpu NGPU]
                     [--test_metadata TEST_METADATA] [--output_dir OUTPUT_DIR]
-Synthesize with fastspeech2 & parallel wavegan.
+Synthesize with acoustic model & vocoder
 optional arguments:
  -h, --help            show this help message and exit
-  --fastspeech2-config FASTSPEECH2_CONFIG
+  --am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}
-                        fastspeech2 config file.
+                        Choose acoustic model type of tts task.
-  --fastspeech2-checkpoint FASTSPEECH2_CHECKPOINT
+  --am_config AM_CONFIG
-                        fastspeech2 checkpoint to load.
+                        Config of acoustic model. Use deault config when it is
-  --fastspeech2-stat FASTSPEECH2_STAT
+                        None.
-                        mean and standard deviation used to normalize
+  --am_ckpt AM_CKPT     Checkpoint file of acoustic model.
-                        spectrogram when training fastspeech2.
+  --am_stat AM_STAT     mean and standard deviation used to normalize
-  --pwg-config PWG_CONFIG
+                        spectrogram when training acoustic model.
-                        parallel wavegan config file.
+  --phones_dict PHONES_DICT
  --pwg-checkpoint PWG_CHECKPOINT
                        parallel wavegan generator parameters to load.
  --pwg-stat PWG_STAT   mean and standard deviation used to normalize
                        spectrogram when training parallel wavegan.
  --phones-dict PHONES_DICT
                        phone vocabulary file.
-  --speaker-dict SPEAKER_DICT
+  --tones_dict TONES_DICT
-                        speaker id map file for multiple speaker model.
+                        tone vocabulary file.
-  --test-metadata TEST_METADATA
+  --speaker_dict SPEAKER_DICT
                        speaker id map file.
  --voice-cloning VOICE_CLONING
                        whether training voice cloning model.
  --voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}
                        Choose vocoder type of tts task.
  --voc_config VOC_CONFIG
                        Config of voc. Use deault config when it is None.
  --voc_ckpt VOC_CKPT   Checkpoint file of voc.
  --voc_stat VOC_STAT   mean and standard deviation used to normalize
                        spectrogram when training voc.
  --ngpu NGPU           if ngpu == 0, use cpu.
  --test_metadata TEST_METADATA
                        test metadata.
-  --output-dir OUTPUT_DIR
+  --output_dir OUTPUT_DIR
                        output dir.
  --ngpu NGPU           if ngpu == 0, use cpu.
  --verbose VERBOSE     verbose
 ```
-`./local/synthesize_e2e.sh` calls `${BIN_DIR}/multi_spk_synthesize_e2e.py`, which can synthesize waveform from text file.
+`./local/synthesize_e2e.sh` calls `${BIN_DIR}/../synthesize_e2e.py`, which can synthesize waveform from text file.
 ```bash
 CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name}
 ```
 ```text
-usage: multi_spk_synthesize_e2e.py [-h]
+usage: synthesize_e2e.py [-h]
-                                   [--fastspeech2-config FASTSPEECH2_CONFIG]
+                         [--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}]
-                                   [--fastspeech2-checkpoint FASTSPEECH2_CHECKPOINT]
+                         [--am_config AM_CONFIG] [--am_ckpt AM_CKPT]
-                                   [--fastspeech2-stat FASTSPEECH2_STAT]
+                         [--am_stat AM_STAT] [--phones_dict PHONES_DICT]
-                                   [--pwg-config PWG_CONFIG]
+                         [--tones_dict TONES_DICT]
-                                   [--pwg-checkpoint PWG_CHECKPOINT]
+                         [--speaker_dict SPEAKER_DICT] [--spk_id SPK_ID]
-                                   [--pwg-stat PWG_STAT]
+                         [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}]
-                                   [--phones-dict PHONES_DICT]
+                         [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT]
-                                   [--speaker-dict SPEAKER_DICT] [--text TEXT]
+                         [--voc_stat VOC_STAT] [--lang LANG]
-                                   [--output-dir OUTPUT_DIR] [--ngpu NGPU]
+                         [--inference_dir INFERENCE_DIR] [--ngpu NGPU]
-                                   [--verbose VERBOSE]
+                         [--text TEXT] [--output_dir OUTPUT_DIR]
-
+
-Synthesize with fastspeech2 & parallel wavegan.
+Synthesize with acoustic model & vocoder
 optional arguments:
  -h, --help            show this help message and exit
-  --fastspeech2-config FASTSPEECH2_CONFIG
+  --am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}
-                        fastspeech2 config file.
+                        Choose acoustic model type of tts task.
-  --fastspeech2-checkpoint FASTSPEECH2_CHECKPOINT
+  --am_config AM_CONFIG
-                        fastspeech2 checkpoint to load.
+                        Config of acoustic model. Use deault config when it is
-  --fastspeech2-stat FASTSPEECH2_STAT
+                        None.
-                        mean and standard deviation used to normalize
+  --am_ckpt AM_CKPT     Checkpoint file of acoustic model.
-                        spectrogram when training fastspeech2.
+  --am_stat AM_STAT     mean and standard deviation used to normalize
-  --pwg-config PWG_CONFIG
+                        spectrogram when training acoustic model.
-                        parallel wavegan config file.
+  --phones_dict PHONES_DICT
  --pwg-checkpoint PWG_CHECKPOINT
                        parallel wavegan generator parameters to load.
  --pwg-stat PWG_STAT   mean and standard deviation used to normalize
                        spectrogram when training parallel wavegan.
  --phones-dict PHONES_DICT
                        phone vocabulary file.
-  --speaker-dict SPEAKER_DICT
+  --tones_dict TONES_DICT
                        tone vocabulary file.
  --speaker_dict SPEAKER_DICT
                        speaker id map file.
  --spk_id SPK_ID       spk id for multi speaker acoustic model
  --voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}
                        Choose vocoder type of tts task.
  --voc_config VOC_CONFIG
                        Config of voc. Use deault config when it is None.
  --voc_ckpt VOC_CKPT   Checkpoint file of voc.
  --voc_stat VOC_STAT   mean and standard deviation used to normalize
                        spectrogram when training voc.
  --lang LANG           Choose model language. zh or en
  --inference_dir INFERENCE_DIR
                        dir to save inference models
  --ngpu NGPU           if ngpu == 0, use cpu.
  --text TEXT           text to synthesize, a 'utt_id sentence' pair per line.
-  --output-dir OUTPUT_DIR
+  --output_dir OUTPUT_DIR
                        output dir.
  --ngpu NGPU           if ngpu == 0, use cpu.
  --verbose VERBOSE     verbose.
 ```
-1. `--fastspeech2-config`, `--fastspeech2-checkpoint`, `--fastspeech2-stat`, `--phones-dict` and `--speaker-dict` are arguments for fastspeech2, which correspond to the 5 files in the fastspeech2 pretrained model.
+1. `--am` is acoustic model type with the format {model_name}_{dataset}
-2. `--pwg-config`, `--pwg-checkpoint`, `--pwg-stat` are arguments for parallel wavegan, which correspond to the 3 files in the parallel wavegan pretrained model.
+2. `--am_config`, `--am_checkpoint`, `--am_stat`, `--phones_dict` `--speaker_dict` are arguments for acoustic model, which correspond to the 5 files in the fastspeech2 pretrained model.
-3. `--test-metadata` should be the metadata file in the normalized subfolder of `test`  in the `dump` folder.
+3. `--voc` is vocoder type with the format {model_name}_{dataset}
-4. `--text` is the text file, which contains sentences to synthesize.
+4. `--voc_config`, `--voc_checkpoint`, `--voc_stat` are arguments for vocoder, which correspond to the 3 files in the parallel wavegan pretrained model.
-5. `--output-dir` is the directory to save synthesized audio files.
+5. `--lang` is the model language, which can be `zh` or `en`.
-6. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
+6. `--test_metadata` should be the metadata file in the normalized subfolder of `test`  in the `dump` folder.
 7. `--text` is the text file, which contains sentences to synthesize.
 8. `--output_dir` is the directory to save synthesized audio files.
 9. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
 ## Pretrained Model
 Pretrained FastSpeech2 model with no silence in the edge of audios. [fastspeech2_nosil_aishell3_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_aishell3_ckpt_0.4.zip)
@ -225,16 +242,20 @@ source path.sh
 FLAGS_allocator_strategy=naive_best_fit \
 FLAGS_fraction_of_gpu_memory_to_use=0.01 \
-python3 ${BIN_DIR}/multi_spk_synthesize_e2e.py \
+python3 ${BIN_DIR}/../synthesize_e2e.py \
-  --fastspeech2-config=fastspeech2_nosil_aishell3_ckpt_0.4/default.yaml \
+  --am=fastspeech2_aishell3 \
-  --fastspeech2-checkpoint=fastspeech2_nosil_aishell3_ckpt_0.4/snapshot_iter_96400.pdz \
+  --am_config=fastspeech2_nosil_aishell3_ckpt_0.4/default.yaml \
-  --fastspeech2-stat=fastspeech2_nosil_aishell3_ckpt_0.4/speech_stats.npy \
+  --am_ckpt=fastspeech2_nosil_aishell3_ckpt_0.4/snapshot_iter_96400.pdz \
-  --pwg-config=pwg_aishell3_ckpt_0.5/default.yaml \
+  --am_stat=fastspeech2_nosil_aishell3_ckpt_0.4/speech_stats.npy \
-  --pwg-checkpoint=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \
+  --voc=pwgan_aishell3 \
-  --pwg-stat=pwg_aishell3_ckpt_0.5/feats_stats.npy  \
+  --voc_config=pwg_aishell3_ckpt_0.5/default.yaml \
  --voc_ckpt=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \
  --voc_stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \
  --lang=zh \
  --text=${BIN_DIR}/../sentences.txt \
-  --output-dir=exp/default/test_e2e \
+  --output_dir=exp/default/test_e2e \
-  --phones-dict=fastspeech2_nosil_aishell3_ckpt_0.4/phone_id_map.txt \
+  --phones_dict=fastspeech2_nosil_aishell3_ckpt_0.4/phone_id_map.txt \
-  --speaker-dict=fastspeech2_nosil_aishell3_ckpt_0.4/speaker_id_map.txt
+  --speaker_dict=fastspeech2_nosil_aishell3_ckpt_0.4/speaker_id_map.txt \
  --spk_id=0
 ```
--- a/examples/aishell3/tts3/conf/default.yaml
+++ b/examples/aishell3/tts3/conf/default.yaml
@ -3,9 +3,9 @@
 ###########################################################
 fs: 24000          # sr
-n_fft: 2048        # FFT size.
+n_fft: 2048        # FFT size (samples).
-n_shift: 300       # Hop size.
+n_shift: 300       # Hop size (samples). 12.5ms
-win_length: 1200   # Window length.
+win_length: 1200   # Window length (samples). 50ms
                   # If set to null, it will be the same as fft_size.
 window: "hann"     # Window function.
--- a/examples/aishell3/tts3/local/synthesize.sh
+++ b/examples/aishell3/tts3/local/synthesize.sh
@ -6,14 +6,16 @@ ckpt_name=$3
 FLAGS_allocator_strategy=naive_best_fit \
 FLAGS_fraction_of_gpu_memory_to_use=0.01 \
-python3 ${BIN_DIR}/synthesize.py \
+python3 ${BIN_DIR}/../synthesize.py \
-  --fastspeech2-config=${config_path} \
+    --am=fastspeech2_aishell3 \
-  --fastspeech2-checkpoint=${train_output_path}/checkpoints/${ckpt_name} \
+    --am_config=${config_path} \
-  --fastspeech2-stat=dump/train/speech_stats.npy \
+    --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
-  --pwg-config=pwg_aishell3_ckpt_0.5/default.yaml \
+    --am_stat=dump/train/speech_stats.npy \
-  --pwg-checkpoint=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \
+    --voc=pwgan_aishell3 \
-  --pwg-stat=pwg_aishell3_ckpt_0.5/feats_stats.npy  \
+    --voc_config=pwg_aishell3_ckpt_0.5/default.yaml \
-  --test-metadata=dump/test/norm/metadata.jsonl \
+    --voc_ckpt=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \
-  --output-dir=${train_output_path}/test \
+    --voc_stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \
-  --phones-dict=dump/phone_id_map.txt \
+    --test_metadata=dump/test/norm/metadata.jsonl \
-  --speaker-dict=dump/speaker_id_map.txt
+    --output_dir=${train_output_path}/test \
    --phones_dict=dump/phone_id_map.txt \
    --speaker_dict=dump/speaker_id_map.txt
--- a/examples/aishell3/tts3/local/synthesize_e2e.sh
+++ b/examples/aishell3/tts3/local/synthesize_e2e.sh
@ -6,14 +6,18 @@ ckpt_name=$3
 FLAGS_allocator_strategy=naive_best_fit \
 FLAGS_fraction_of_gpu_memory_to_use=0.01 \
-python3 ${BIN_DIR}/multi_spk_synthesize_e2e.py \
+python3 ${BIN_DIR}/../synthesize_e2e.py \
-  --fastspeech2-config=${config_path} \
+    --am=fastspeech2_aishell3 \
-  --fastspeech2-checkpoint=${train_output_path}/checkpoints/${ckpt_name} \
+    --am_config=${config_path} \
-  --fastspeech2-stat=dump/train/speech_stats.npy \
+    --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
-  --pwg-config=pwg_aishell3_ckpt_0.5/default.yaml \
+    --am_stat=dump/train/speech_stats.npy \
-  --pwg-checkpoint=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \
+    --voc=pwgan_aishell3 \
-  --pwg-stat=pwg_aishell3_ckpt_0.5/feats_stats.npy  \
+    --voc_config=pwg_aishell3_ckpt_0.5/default.yaml \
-  --text=${BIN_DIR}/../sentences.txt \
+    --voc_ckpt=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \
-  --output-dir=${train_output_path}/test_e2e \
+    --voc_stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \
-  --phones-dict=dump/phone_id_map.txt \
+    --lang=zh \
-  --speaker-dict=dump/speaker_id_map.txt
+    --text=${BIN_DIR}/../sentences.txt \
    --output_dir=${train_output_path}/test_e2e \
    --phones_dict=dump/phone_id_map.txt \
    --speaker_dict=dump/speaker_id_map.txt \
    --spk_id=0
--- a/examples/aishell3/vc0/README.md
+++ b/examples/aishell3/vc0/README.md
@ -1,6 +1,6 @@
 # Tacotron2 + AISHELL-3 Voice Cloning
-This example contains code used to train a [Tacotron2 ](https://arxiv.org/abs/1712.05884) model with [AISHELL-3](http://www.aishelltech.com/aishell_3). The trained model can be used in Voice Cloning Task, We refer to the model structure of  [Transfer Learning from Speaker Veriﬁcation to Multispeaker Text-To-Speech Synthesis](https://arxiv.org/pdf/1806.04558.pdf) . The general steps are as follows:
+This example contains code used to train a [Tacotron2 ](https://arxiv.org/abs/1712.05884) model with [AISHELL-3](http://www.aishelltech.com/aishell_3). The trained model can be used in Voice Cloning Task, We refer to the model structure of  [Transfer Learning from Speaker Veriﬁcation to Multispeaker Text-To-Speech Synthesis](https://arxiv.org/pdf/1806.04558.pdf). The general steps are as follows:
-1. Speaker Encoder: We  use a Speaker Verification to train a speaker encoder. Datasets used in this task are different from those used in Tacotron2, because the  transcriptions are not needed, we use more datasets, refer to  [ge2e](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/ge2e).
+1. Speaker Encoder: We use Speaker Verification to train a speaker encoder. Datasets used in this task are different from those used in Tacotron2 because the transcriptions are not needed, we use more datasets, refer to  [ge2e](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/ge2e).
 2. Synthesizer: We use the trained speaker encoder to generate speaker embedding for each sentence in AISHELL-3. This embedding is an extra input of  Tacotron2 which will be concated with encoder outputs.
 3. Vocoder: We use WaveFlow as the neural Vocoder, refer to [waveflow](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/voc0).
@ -39,13 +39,13 @@ fi
 The computing time of utterance embedding can be x hours.
 #### Process Wav
-There are silence in the edge of AISHELL-3's wavs, and the audio amplitude is very small, so, we need to remove the silence and normalize the audio. You can the silence remove method based on   volume or energy, but the effect is not very good, We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get  the alignment of text and  speech, then utilize the alignment results to remove the silence.
+There is silence in the edge of AISHELL-3's wavs, and the audio amplitude is very small, so, we need to remove the silence and normalize the audio. You can the silence remove method based on volume or energy, but the effect is not very good, We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get the alignment of text and speech, then utilize the alignment results to remove the silence.
-We use Montreal Force Aligner 1.0. The label in  aishell3 include pinyin，so the lexicon we provided to MFA is pinyin rather than Chinese characters. And the prosody marks(`$`  and `%`) need to be removed. You shoud preprocess the dataset into the format  which MFA needs, the texts have the same name with wavs and have the suffix `.lab`.
+We use Montreal Force Aligner 1.0. The label in  aishell3 includes pinyin，so the lexicon we provided to MFA is pinyin rather than Chinese characters. And the prosody marks(`$`  and `%`) need to be removed. You should preprocess the dataset into the format which MFA needs, the texts have the same name with wavs and have the suffix `.lab`.
 We use [lexicon.txt](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/paddlespeech/t2s/exps/voice_cloning/tacotron2_ge2e/lexicon.txt) as the lexicon.
-You can download the alignment results from here [alignment_aishell3.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/alignment_aishell3.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo.
+You can download the alignment results from here [alignment_aishell3.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/alignment_aishell3.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo.
 ```bash
@ -83,9 +83,9 @@ fi
 CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${preprocess_path} ${train_output_path}
 ```
-Our model remve  stop token prediction in Tacotron2, because of the problem of extremely unbalanced proportion of positive and negative samples of stop token prediction, and it's very sensitive to the clip of audio silence. We use the last symbol from the highest point of attention to the encoder side as the termination condition.
+Our model removes stop token prediction in Tacotron2, because of the problem of the extremely unbalanced proportion of positive and negative samples of stop token prediction, and it's very sensitive to the clip of audio silence. We use the last symbol from the highest point of attention to the encoder side as the termination condition.
-In addition, in order to accelerate the convergence of the model, we add `guided attention loss` to induce the alignment between encoder and decoder to show diagonal lines faster.
+In addition, to accelerate the convergence of the model, we add `guided attention loss` to induce the alignment between encoder and decoder to show diagonal lines faster.
 ### Voice Cloning
 ```bash
 CUDA_VISIBLE_DEVICES=${gpus} ./local/voice_cloning.sh ${ge2e_params_path} ${tacotron2_params_path} ${waveflow_params_path} ${vc_input} ${vc_output}
--- a/examples/aishell3/vc0/local/voice_cloning.sh
+++ b/examples/aishell3/vc0/local/voice_cloning.sh
@ -7,8 +7,8 @@ vc_input=$4
 vc_output=$5
 python3 ${BIN_DIR}/voice_cloning.py \
-        --ge2e_params_path=${ge2e_params_path} \
+    --ge2e_params_path=${ge2e_params_path} \
-        --tacotron2_params_path=${tacotron2_params_path} \
+    --tacotron2_params_path=${tacotron2_params_path} \
-        --waveflow_params_path=${waveflow_params_path} \
+    --waveflow_params_path=${waveflow_params_path} \
-        --input-dir=${vc_input} \
+    --input-dir=${vc_input} \
-        --output-dir=${vc_output}
+    --output-dir=${vc_output}
--- a/examples/aishell3/vc1/README.md
+++ b/examples/aishell3/vc1/README.md
@ -1,7 +1,7 @@
 # FastSpeech2 + AISHELL-3 Voice Cloning
-This example contains code used to train a [FastSpeech2](https://arxiv.org/abs/2006.04558) model with [AISHELL-3](http://www.aishelltech.com/aishell_3). The trained model can be used in Voice Cloning Task, We refer to the model structure of  [Transfer Learning from Speaker Veriﬁcation to Multispeaker Text-To-Speech Synthesis](https://arxiv.org/pdf/1806.04558.pdf) . The general steps are as follows:
+This example contains code used to train a [FastSpeech2](https://arxiv.org/abs/2006.04558) model with [AISHELL-3](http://www.aishelltech.com/aishell_3). The trained model can be used in Voice Cloning Task, We refer to the model structure of  [Transfer Learning from Speaker Veriﬁcation to Multispeaker Text-To-Speech Synthesis](https://arxiv.org/pdf/1806.04558.pdf). The general steps are as follows:
-1. Speaker Encoder: We  use a Speaker Verification to train a speaker encoder. Datasets used in this task are different from those used in `FastSpeech2`, because the  transcriptions are not needed, we use more datasets, refer to  [ge2e](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/ge2e).
+1. Speaker Encoder: We use Speaker Verification to train a speaker encoder. Datasets used in this task are different from those used in `FastSpeech2` because the transcriptions are not needed, we use more datasets, refer to  [ge2e](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/ge2e).
 2. Synthesizer: We use the trained speaker encoder to generate speaker embedding for each sentence in AISHELL-3. This embedding is an extra input of  `FastSpeech2` which will be concated with encoder outputs.
 3. Vocoder: We use [Parallel Wave GAN](http://arxiv.org/abs/1910.11480) as the neural Vocoder, refer to [voc1](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/voc1).
@ -18,7 +18,7 @@ tar zxvf data_aishell3.tgz -C data_aishell3
 ```
 ### Get MFA Result and Extract
 We use [MFA2.x](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for aishell3_fastspeech2.
-You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo.
+You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo.
 ## Pretrained GE2E Model
 We use pretrained GE2E model to generate speaker embedding for each sentence.
@ -39,7 +39,7 @@ Run the command below to
 ```bash
 ./run.sh
 ```
-You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset.
+You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
 ```bash
 ./run.sh --stage 0 --stop-stage 0
 ```
@ -72,20 +72,20 @@ dump
 ```
 The `embed` contains the generated speaker embedding for each sentence in AISHELL-3, which has the same file structure with wav files and the format is  `.npy`.
-The computing time of  utterance embedding can be x hours.
+The computing time of utterance embedding can be x hours.
-The dataset is split into 3 parts, namely `train`, `dev` and` test`, each of which contains a `norm` and `raw` sub folder. The raw folder contains speech、pitch and energy features of each utterances, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/*_stats.npy`.
+The dataset is split into 3 parts, namely `train`, `dev`, and` test`, each of which contains a `norm` and `raw` subfolder. The raw folder contains speech、pitch and energy features of each utterance, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/*_stats.npy`.
-Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains phones, text_lengths, speech_lengths, durations, path of speech features, path of pitch features, path of energy features, speaker and id of each utterance.
+Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains phones, text_lengths, speech_lengths, durations, the path of speech features, the path of pitch features, the path of energy features, speaker, and id of each utterance.
-The preprocessing step is very similar to that one of [tts3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/tts3), but  there is one more `ge2e/inference` step here.
+The preprocessing step is very similar to that one of [tts3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/tts3), but there is one more `ge2e/inference` step here.
 ### Model Training
 `./local/train.sh` calls `${BIN_DIR}/train.py`.
 ```bash
 CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path}
 ```
-The training step is very similar to that one of [tts3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/tts3), but  we should set `--voice-cloning=True` when calling `${BIN_DIR}/train.py`.
+The training step is very similar to that one of [tts3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/tts3), but we should set `--voice-cloning=True` when calling `${BIN_DIR}/train.py`.
 ### Synthesizing
 We use [parallel wavegan](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/voc1) as the neural vocoder.
@ -100,11 +100,11 @@ pwg_aishell3_ckpt_0.5
 ├── feats_stats.npy                # statistics used to normalize spectrogram when training parallel wavegan
 └── snapshot_iter_1000000.pdz      # generator parameters of parallel wavegan
 ```
-`./local/synthesize.sh` calls `${BIN_DIR}/synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
+`./local/synthesize.sh` calls `${BIN_DIR}/../synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
 ```bash
 CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
 ```
-The synthesizing step is very similar to that one of [tts3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/tts3), but  we should set `--voice-cloning=True` when calling `${BIN_DIR}/synthesize.py`.
+The synthesizing step is very similar to that one of [tts3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/tts3), but we should set `--voice-cloning=True` when calling `${BIN_DIR}/../synthesize.py`.
 ### Voice Cloning
 Assume there are some  reference audios in `./ref_audio`
--- a/examples/aishell3/vc1/conf/default.yaml
+++ b/examples/aishell3/vc1/conf/default.yaml
@ -3,9 +3,9 @@
 ###########################################################
 fs: 24000          # sr
-n_fft: 2048        # FFT size.
+n_fft: 2048        # FFT size (samples).
-n_shift: 300       # Hop size.
+n_shift: 300       # Hop size (samples). 12.5ms
-win_length: 1200   # Window length.
+win_length: 1200   # Window length (samples). 50ms
                   # If set to null, it will be the same as fft_size.
 window: "hann"     # Window function.
--- a/examples/aishell3/vc1/local/synthesize.sh
+++ b/examples/aishell3/vc1/local/synthesize.sh
@ -6,14 +6,17 @@ ckpt_name=$3
 FLAGS_allocator_strategy=naive_best_fit \
 FLAGS_fraction_of_gpu_memory_to_use=0.01 \
-python3 ${BIN_DIR}/synthesize.py \
+python3 ${BIN_DIR}/../synthesize.py \
-  --fastspeech2-config=${config_path} \
+    --am=fastspeech2_aishell3 \
-  --fastspeech2-checkpoint=${train_output_path}/checkpoints/${ckpt_name} \
+    --am_config=${config_path} \
-  --fastspeech2-stat=dump/train/speech_stats.npy \
+    --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
-  --pwg-config=pwg_aishell3_ckpt_0.5/default.yaml \
+    --am_stat=dump/train/speech_stats.npy \
-  --pwg-checkpoint=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \
+    --voc=pwgan_aishell3 \
-  --pwg-stat=pwg_aishell3_ckpt_0.5/feats_stats.npy  \
+    --voc_config=pwg_aishell3_ckpt_0.5/default.yaml \
-  --test-metadata=dump/test/norm/metadata.jsonl \
+    --voc_ckpt=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \
-  --output-dir=${train_output_path}/test \
+    --voc_stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \
-  --phones-dict=dump/phone_id_map.txt \
+    --test_metadata=dump/test/norm/metadata.jsonl \
-  --voice-cloning=True
+    --output_dir=${train_output_path}/test \
    --phones_dict=dump/phone_id_map.txt \
    --speaker_dict=dump/speaker_id_map.txt \
    --voice-cloning=True
--- a/examples/aishell3/vc1/local/voice_cloning.sh
+++ b/examples/aishell3/vc1/local/voice_cloning.sh
@ -9,14 +9,14 @@ ref_audio_dir=$5
 FLAGS_allocator_strategy=naive_best_fit \
 FLAGS_fraction_of_gpu_memory_to_use=0.01 \
 python3 ${BIN_DIR}/voice_cloning.py \
-  --fastspeech2-config=${config_path} \
+    --fastspeech2-config=${config_path} \
-  --fastspeech2-checkpoint=${train_output_path}/checkpoints/${ckpt_name} \
+    --fastspeech2-checkpoint=${train_output_path}/checkpoints/${ckpt_name} \
-  --fastspeech2-stat=dump/train/speech_stats.npy \
+    --fastspeech2-stat=dump/train/speech_stats.npy \
-  --pwg-config=pwg_aishell3_ckpt_0.5/default.yaml \
+    --pwg-config=pwg_aishell3_ckpt_0.5/default.yaml \
-  --pwg-checkpoint=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \
+    --pwg-checkpoint=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \
-  --pwg-stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \
+    --pwg-stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \
-  --ge2e_params_path=${ge2e_params_path} \
+    --ge2e_params_path=${ge2e_params_path} \
-  --text="凯莫瑞安联合体的经济崩溃迫在眉睫。" \
+    --text="凯莫瑞安联合体的经济崩溃迫在眉睫。" \
-  --input-dir=${ref_audio_dir} \
+    --input-dir=${ref_audio_dir} \
-  --output-dir=${train_output_path}/vc_syn \
+    --output-dir=${train_output_path}/vc_syn \
-  --phones-dict=dump/phone_id_map.txt
+    --phones-dict=dump/phone_id_map.txt
--- a/examples/aishell3/voc1/README.md
+++ b/examples/aishell3/voc1/README.md
@ -1,7 +1,7 @@
 # Parallel WaveGAN with AISHELL-3
 This example contains code used to train a [parallel wavegan](http://arxiv.org/abs/1910.11480) model with [AISHELL-3](http://www.aishelltech.com/aishell_3).
-AISHELL-3 is a large-scale and high-fidelity multi-speaker Mandarin speech corpus which could be used to train multi-speaker Text-to-Speech (TTS) systems.
+AISHELL-3 is a large-scale and high-fidelity multi-speaker Mandarin speech corpus that could be used to train multi-speaker Text-to-Speech (TTS) systems.
 ## Dataset
 ### Download and Extract
 Download AISHELL-3.
@ -15,7 +15,7 @@ tar zxvf data_aishell3.tgz -C data_aishell3
 ```
 ### Get MFA Result and Extract
 We use [MFA2.x](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for aishell3_fastspeech2.
-You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo.
+You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo.
 ## Get Started
 Assume the path to the dataset is `~/datasets/data_aishell3`.
@ -53,9 +53,9 @@ dump
    └── feats_stats.npy
 ```
-The dataset is split into 3 parts, namely `train`, `dev` and `test`, each of which contains a `norm` and `raw` subfolder. The `raw` folder contains log magnitude of mel spectrogram of each utterances, while the norm folder contains normalized spectrogram. The statistics used to normalize the spectrogram is computed from the training set, which is located in `dump/train/feats_stats.npy`.
+The dataset is split into 3 parts, namely `train`, `dev`, and `test`, each of which contains a `norm` and `raw` subfolder. The `raw` folder contains the log magnitude of the mel spectrogram of each utterance, while the norm folder contains the normalized spectrogram. The statistics used to normalize the spectrogram are computed from the training set, which is located in `dump/train/feats_stats.npy`.
-Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains id and paths to spectrogam of each utterance.
+Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains id and paths to the spectrogram of each utterance.
 ### Model Training
 ```bash
@ -101,7 +101,7 @@ benchmark:
 1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
 2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
-3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are save in `checkpoints/` inside this directory.
+3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory.
 4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
 ### Synthesizing
@ -110,15 +110,19 @@ benchmark:
 CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
 ```
 ```text
-usage: synthesize.py [-h] [--config CONFIG] [--checkpoint CHECKPOINT]
+usage: synthesize.py [-h] [--generator-type GENERATOR_TYPE] [--config CONFIG]
-                     [--test-metadata TEST_METADATA] [--output-dir OUTPUT_DIR]
+                     [--checkpoint CHECKPOINT] [--test-metadata TEST_METADATA]
-                     [--ngpu NGPU] [--verbose VERBOSE]
+                     [--output-dir OUTPUT_DIR] [--ngpu NGPU]
                     [--verbose VERBOSE]
-Synthesize with parallel wavegan.
+Synthesize with GANVocoder.
 optional arguments:
  -h, --help            show this help message and exit
-  --config CONFIG       parallel wavegan config file.
+  --generator-type GENERATOR_TYPE
                        type of GANVocoder, should in {pwgan, mb_melgan,
                        style_melgan, } now
  --config CONFIG       GANVocoder config file.
  --checkpoint CHECKPOINT
                        snapshot to load.
  --test-metadata TEST_METADATA
--- a/examples/aishell3/voc1/conf/default.yaml
+++ b/examples/aishell3/voc1/conf/default.yaml
@ -7,9 +7,9 @@
 #                FEATURE EXTRACTION SETTING               #
 ###########################################################
 fs: 24000                # Sampling rate.
-n_fft: 2048              # FFT size. (in samples)
+n_fft: 2048              # FFT size (samples). 
-n_shift: 300             # Hop size. (in samples)
+n_shift: 300             # Hop size (samples). 12.5ms
-win_length: 1200         # Window length. (in samples)
+win_length: 1200         # Window length (samples). 50ms
                         # If set to null, it will be the same as fft_size.
 window: "hann"           # Window function.
 n_mels: 80               # Number of mel basis.
@ -49,9 +49,9 @@ discriminator_params:
    bias: true            # Whether to use bias parameter in conv.
    use_weight_norm: true # Whether to use weight norm.
                          # If set to true, it will be applied to all of the conv layers.
-    nonlinear_activation: "LeakyReLU" # Nonlinear function after each conv.
+    nonlinear_activation: "leakyrelu" # Nonlinear function after each conv.
    nonlinear_activation_params:      # Nonlinear function parameters
-        negative_slope: 0.2           # Alpha in LeakyReLU.
+        negative_slope: 0.2           # Alpha in leakyrelu.
 ###########################################################
 #                   STFT LOSS SETTING                     #
--- a/examples/aishell3/voc1/local/synthesize.sh
+++ b/examples/aishell3/voc1/local/synthesize.sh
@ -7,8 +7,8 @@ ckpt_name=$3
 FLAGS_allocator_strategy=naive_best_fit \
 FLAGS_fraction_of_gpu_memory_to_use=0.01 \
 python3 ${BIN_DIR}/../synthesize.py \
-  --config=${config_path} \
+    --config=${config_path} \
-  --checkpoint=${train_output_path}/checkpoints/${ckpt_name} \
+    --checkpoint=${train_output_path}/checkpoints/${ckpt_name} \
-  --test-metadata=dump/test/norm/metadata.jsonl \
+    --test-metadata=dump/test/norm/metadata.jsonl \
-  --output-dir=${train_output_path}/test \
+    --output-dir=${train_output_path}/test \
-  --generator-type=pwgan
+    --generator-type=pwgan
--- a/examples/csmsc/tts2/README.md
+++ b/examples/csmsc/tts2/README.md
@ -7,7 +7,7 @@ Download CSMSC from it's [Official Website](https://test.data-baker.com/data/ind
 ### Get MFA Result and Extract
 We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for SPEEDYSPEECH.
-You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to  [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
+You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your MFA model reference to  [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
 ## Get Started
 Assume the path to the dataset is `~/datasets/BZNSYP`.
@ -18,8 +18,8 @@ Run the command below to
 3. train the model.
 4. synthesize wavs.
    - synthesize waveform from `metadata.jsonl`.
-    - synthesize waveform from text file.
+    - synthesize waveform from a text file.
-5. inference using static model.
+5. inference using the static model.
 ```bash
 ./run.sh
 ```
@ -47,9 +47,9 @@ dump
    └── feats_stats.npy
 ```
-The dataset is split into 3 parts, namely `train`, `dev` and `test`, each of which contains a `norm` and `raw` sub folder. The raw folder contains log magnitude of mel spectrogram of each utterances, while the norm folder contains normalized spectrogram. The statistics used to normalize the spectrogram is computed from the training set, which is located in `dump/train/feats_stats.npy`.
+The dataset is split into 3 parts, namely `train`, `dev`, and `test`, each of which contains a `norm` and `raw` subfolder. The raw folder contains the log magnitude of the mel spectrogram of each utterance, while the norm folder contains the normalized spectrogram. The statistics used to normalize the spectrogram are computed from the training set, which is located in `dump/train/feats_stats.npy`.
-Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains phones, tones, durations, path of spectrogram, and id of each utterance.
+Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains phones, tones, durations, the path of the spectrogram, and the id of each utterance.
 ### Model Training
 `./local/train.sh` calls `${BIN_DIR}/train.py`.
@ -64,7 +64,7 @@ usage: train.py [-h] [--config CONFIG] [--train-metadata TRAIN_METADATA]
                [--use-relative-path USE_RELATIVE_PATH]
                [--phones-dict PHONES_DICT] [--tones-dict TONES_DICT]
-Train a Speedyspeech model with sigle speaker dataset.
+Train a Speedyspeech model with a single speaker dataset.
 optional arguments:
  -h, --help            show this help message and exit
@ -87,7 +87,7 @@ optional arguments:
 1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
 2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
-3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are save in `checkpoints/` inside this directory.
+3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory.
 4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
 5. `--phones-dict` is the path of the phone vocabulary file.
 6. `--tones-dict` is the path of the tone vocabulary file.
@ -105,107 +105,118 @@ pwg_baker_ckpt_0.4
 ├── pwg_snapshot_iter_400000.pdz   # model parameters of parallel wavegan
 └── pwg_stats.npy                  # statistics used to normalize spectrogram when training parallel wavegan
 ```
-`./local/synthesize.sh` calls `${BIN_DIR}/synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
+`./local/synthesize.sh` calls `${BIN_DIR}/../synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
 ```bash
 CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
 ```
-```text
+``text
-usage: synthesize.py [-h] [--speedyspeech-config SPEEDYSPEECH_CONFIG]
+usage: synthesize.py [-h]
-                     [--speedyspeech-checkpoint SPEEDYSPEECH_CHECKPOINT]
+                     [--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}]
-                     [--speedyspeech-stat SPEEDYSPEECH_STAT]
+                     [--am_config AM_CONFIG] [--am_ckpt AM_CKPT]
-                     [--pwg-config PWG_CONFIG]
+                     [--am_stat AM_STAT] [--phones_dict PHONES_DICT]
-                     [--pwg-checkpoint PWG_CHECKPOINT] [--pwg-stat PWG_STAT]
+                     [--tones_dict TONES_DICT] [--speaker_dict SPEAKER_DICT]
-                     [--phones-dict PHONES_DICT] [--tones-dict TONES_DICT]
+                     [--voice-cloning VOICE_CLONING]
-                     [--test-metadata TEST_METADATA] [--output-dir OUTPUT_DIR]
+                     [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}]
-                     [--inference-dir INFERENCE_DIR] [--ngpu NGPU]
+                     [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT]
-                     [--verbose VERBOSE]
+                     [--voc_stat VOC_STAT] [--ngpu NGPU]
-
+                     [--test_metadata TEST_METADATA] [--output_dir OUTPUT_DIR]
-Synthesize with speedyspeech & parallel wavegan.
+
 Synthesize with acoustic model & vocoder
 optional arguments:
  -h, --help            show this help message and exit
-  --speedyspeech-config SPEEDYSPEECH_CONFIG
+  --am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}
-                        config file for speedyspeech.
+                        Choose acoustic model type of tts task.
-  --speedyspeech-checkpoint SPEEDYSPEECH_CHECKPOINT
+  --am_config AM_CONFIG
-                        speedyspeech checkpoint to load.
+                        Config of acoustic model. Use deault config when it is
-  --speedyspeech-stat SPEEDYSPEECH_STAT
+                        None.
-                        mean and standard deviation used to normalize
+  --am_ckpt AM_CKPT     Checkpoint file of acoustic model.
-                        spectrogram when training speedyspeech.
+  --am_stat AM_STAT     mean and standard deviation used to normalize
-  --pwg-config PWG_CONFIG
+                        spectrogram when training acoustic model.
-                        config file for parallelwavegan.
+  --phones_dict PHONES_DICT
  --pwg-checkpoint PWG_CHECKPOINT
                        parallel wavegan generator parameters to load.
  --pwg-stat PWG_STAT   mean and standard deviation used to normalize
                        spectrogram when training speedyspeech.
  --phones-dict PHONES_DICT
                        phone vocabulary file.
-  --tones-dict TONES_DICT
+  --tones_dict TONES_DICT
                        tone vocabulary file.
-  --test-metadata TEST_METADATA
+  --speaker_dict SPEAKER_DICT
-                        test metadata
+                        speaker id map file.
-  --output-dir OUTPUT_DIR
+  --voice-cloning VOICE_CLONING
-                        output dir
+                        whether training voice cloning model.
-  --inference-dir INFERENCE_DIR
+  --voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}
-                        dir to save inference models
+                        Choose vocoder type of tts task.
  --voc_config VOC_CONFIG
                        Config of voc. Use deault config when it is None.
  --voc_ckpt VOC_CKPT   Checkpoint file of voc.
  --voc_stat VOC_STAT   mean and standard deviation used to normalize
                        spectrogram when training voc.
  --ngpu NGPU           if ngpu == 0, use cpu.
-  --verbose VERBOSE     verbose
+  --test_metadata TEST_METADATA
                        test metadata.
  --output_dir OUTPUT_DIR
                        output dir.
 ```
-`./local/synthesize_e2e.sh` calls `${BIN_DIR}/synthesize_e2e.py`, which can synthesize waveform from text file.
+`./local/synthesize_e2e.sh` calls `${BIN_DIR}/../synthesize_e2e.py`, which can synthesize waveform from text file.
 ```bash
 CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name}
 ```
 ```text
-usage: synthesize_e2e.py [-h] [--speedyspeech-config SPEEDYSPEECH_CONFIG]
+usage: synthesize_e2e.py [-h]
-                         [--speedyspeech-checkpoint SPEEDYSPEECH_CHECKPOINT]
+                         [--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}]
-                         [--speedyspeech-stat SPEEDYSPEECH_STAT]
+                         [--am_config AM_CONFIG] [--am_ckpt AM_CKPT]
-                         [--pwg-config PWG_CONFIG]
+                         [--am_stat AM_STAT] [--phones_dict PHONES_DICT]
-                         [--pwg-checkpoint PWG_CHECKPOINT]
+                         [--tones_dict TONES_DICT]
-                         [--pwg-stat PWG_STAT] [--text TEXT]
+                         [--speaker_dict SPEAKER_DICT] [--spk_id SPK_ID]
-                         [--phones-dict PHONES_DICT] [--tones-dict TONES_DICT]
+                         [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}]
-                         [--output-dir OUTPUT_DIR]
+                         [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT]
-                         [--inference-dir INFERENCE_DIR] [--verbose VERBOSE]
+                         [--voc_stat VOC_STAT] [--lang LANG]
-                         [--ngpu NGPU]
+                         [--inference_dir INFERENCE_DIR] [--ngpu NGPU]
-
+                         [--text TEXT] [--output_dir OUTPUT_DIR]
-Synthesize with speedyspeech & parallel wavegan.
+
 Synthesize with acoustic model & vocoder
 optional arguments:
  -h, --help            show this help message and exit
-  --speedyspeech-config SPEEDYSPEECH_CONFIG
+  --am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}
-                        config file for speedyspeech.
+                        Choose acoustic model type of tts task.
-  --speedyspeech-checkpoint SPEEDYSPEECH_CHECKPOINT
+  --am_config AM_CONFIG
-                        speedyspeech checkpoint to load.
+                        Config of acoustic model. Use deault config when it is
-  --speedyspeech-stat SPEEDYSPEECH_STAT
+                        None.
-                        mean and standard deviation used to normalize
+  --am_ckpt AM_CKPT     Checkpoint file of acoustic model.
-                        spectrogram when training speedyspeech.
+  --am_stat AM_STAT     mean and standard deviation used to normalize
-  --pwg-config PWG_CONFIG
+                        spectrogram when training acoustic model.
-                        config file for parallelwavegan.
+  --phones_dict PHONES_DICT
  --pwg-checkpoint PWG_CHECKPOINT
                        parallel wavegan checkpoint to load.
  --pwg-stat PWG_STAT   mean and standard deviation used to normalize
                        spectrogram when training speedyspeech.
  --text TEXT           text to synthesize, a 'utt_id sentence' pair per line
  --phones-dict PHONES_DICT
                        phone vocabulary file.
-  --tones-dict TONES_DICT
+  --tones_dict TONES_DICT
                        tone vocabulary file.
-  --output-dir OUTPUT_DIR
+  --speaker_dict SPEAKER_DICT
-                        output dir
+                        speaker id map file.
-  --inference-dir INFERENCE_DIR
+  --spk_id SPK_ID       spk id for multi speaker acoustic model
  --voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}
                        Choose vocoder type of tts task.
  --voc_config VOC_CONFIG
                        Config of voc. Use deault config when it is None.
  --voc_ckpt VOC_CKPT   Checkpoint file of voc.
  --voc_stat VOC_STAT   mean and standard deviation used to normalize
                        spectrogram when training voc.
  --lang LANG           Choose model language. zh or en
  --inference_dir INFERENCE_DIR
                        dir to save inference models
  --verbose VERBOSE     verbose
  --ngpu NGPU           if ngpu == 0, use cpu.
  --text TEXT           text to synthesize, a 'utt_id sentence' pair per line.
  --output_dir OUTPUT_DIR
                        output dir.
 ```
-1. `--speedyspeech-config`, `--speedyspeech-checkpoint`, `--speedyspeech-stat` are arguments for speedyspeech, which correspond to the 3 files in the speedyspeech pretrained model.
+1. `--am` is acoustic model type with the format {model_name}_{dataset}
-2. `--pwg-config`, `--pwg-checkpoint`, `--pwg-stat` are arguments for parallel wavegan, which correspond to the 3 files in the parallel wavegan pretrained model.
+2. `--am_config`, `--am_checkpoint`, `--am_stat`, `--phones_dict` and `--tones_dict` are arguments for acoustic model, which correspond to the 5 files in the speedyspeech pretrained model.
-3. `--text` is the text file, which contains sentences to synthesize.
+3. `--voc` is vocoder type with the format {model_name}_{dataset}
-4. `--output-dir` is the directory to save synthesized audio files.
+4. `--voc_config`, `--voc_checkpoint`, `--voc_stat` are arguments for vocoder, which correspond to the 3 files in the parallel wavegan pretrained model.
-5. `--inference-dir` is the directory to save exported model, which can be used with paddle infernece.
+5. `--lang` is the model language, which can be `zh` or `en`.
-6. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
+6. `--test_metadata` should be the metadata file in the normalized subfolder of `test`  in the `dump` folder.
-7. `--phones-dict` is the path of the phone vocabulary file.
+7. `--text` is the text file, which contains sentences to synthesize.
-8. `--tones-dict` is the path of the tone vocabulary file.
+8. `--output_dir` is the directory to save synthesized audio files.
 9. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
 ### Inferencing
-After Synthesize, we will get static models of speedyspeech and pwgan in `${train_output_path}/inference`.
+After synthesizing, we will get static models of speedyspeech and pwgan in `${train_output_path}/inference`.
 `./local/inference.sh` calls `${BIN_DIR}/inference.py`, which provides a paddle static model inference example for speedyspeech + pwgan synthesize.
 ```bash
 CUDA_VISIBLE_DEVICES=${gpus} ./local/inference.sh ${train_output_path}
@ -214,7 +225,7 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/inference.sh ${train_output_path}
 ## Pretrained Model
 Pretrained SpeedySpeech model with no silence in the edge of audios[speedyspeech_nosil_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_ckpt_0.5.zip).
-Static model can be downloaded here [speedyspeech_nosil_baker_static_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_static_0.5.zip).
+The static model can be downloaded here [speedyspeech_nosil_baker_static_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_static_0.5.zip).
 Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/ssim_loss
 :-------------:| :------------:| :-----: | :-----: | :--------:|:--------:
@ -235,16 +246,19 @@ source path.sh
 FLAGS_allocator_strategy=naive_best_fit \
 FLAGS_fraction_of_gpu_memory_to_use=0.01 \
-python3 ${BIN_DIR}/synthesize_e2e.py \
+python3 ${BIN_DIR}/../synthesize_e2e.py \
-  --speedyspeech-config=speedyspeech_nosil_baker_ckpt_0.5/default.yaml \
+  --am=speedyspeech_csmsc \
-  --speedyspeech-checkpoint=speedyspeech_nosil_baker_ckpt_0.5/snapshot_iter_11400.pdz \
+  --am_config=speedyspeech_nosil_baker_ckpt_0.5/default.yaml \
-  --speedyspeech-stat=speedyspeech_nosil_baker_ckpt_0.5/feats_stats.npy \
+  --am_ckpt=speedyspeech_nosil_baker_ckpt_0.5/snapshot_iter_11400.pdz \
-  --pwg-config=pwg_baker_ckpt_0.4/pwg_default.yaml \
+  --am_stat=speedyspeech_nosil_baker_ckpt_0.5/feats_stats.npy \
-  --pwg-checkpoint=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \
+  --voc=pwgan_csmsc \
-  --pwg-stat=pwg_baker_ckpt_0.4/pwg_stats.npy \
+  --voc_config=pwg_baker_ckpt_0.4/pwg_default.yaml \
  --voc_ckpt=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \
  --voc_stat=pwg_baker_ckpt_0.4/pwg_stats.npy \
  --lang=zh \
  --text=${BIN_DIR}/../sentences.txt \
-  --output-dir=exp/default/test_e2e \
+  --output_dir=exp/default/test_e2e \
-  --inference-dir=exp/default/inference \
+  --inference_dir=exp/default/inference \
-  --phones-dict=speedyspeech_nosil_baker_ckpt_0.5/phone_id_map.txt \
+  --phones_dict=speedyspeech_nosil_baker_ckpt_0.5/phone_id_map.txt \
-  --tones-dict=speedyspeech_nosil_baker_ckpt_0.5/tone_id_map.txt
+  --tones_dict=speedyspeech_nosil_baker_ckpt_0.5/tone_id_map.txt
 ```
--- a/examples/csmsc/tts2/conf/default.yaml
+++ b/examples/csmsc/tts2/conf/default.yaml
@ -2,9 +2,9 @@
 #                FEATURE EXTRACTION SETTING               #
 ###########################################################
 fs: 24000           # Sampling rate.
-n_fft: 2048         # FFT size.
+n_fft: 2048         # FFT size (samples).
-n_shift: 300        # Hop size.
+n_shift: 300        # Hop size (samples). 12.5ms
-win_length: 1200    # Window length.
+win_length: 1200    # Window length (samples). 50ms
                    # If set to null, it will be the same as fft_size.
 window: "hann"      # Window function.
 n_mels: 80          # Number of mel basis.
--- a/examples/csmsc/tts2/local/inference.sh
+++ b/examples/csmsc/tts2/local/inference.sh
@ -2,9 +2,55 @@
 train_output_path=$1
-python3 ${BIN_DIR}/inference.py \
+stage=0
-  --inference-dir=${train_output_path}/inference \
+stop_stage=0
-  --text=${BIN_DIR}/../sentences.txt \
+
-  --output-dir=${train_output_path}/pd_infer_out \
+# pwgan
-  --phones-dict=dump/phone_id_map.txt \
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
-  --tones-dict=dump/tone_id_map.txt
+    python3 ${BIN_DIR}/../inference.py \
        --inference_dir=${train_output_path}/inference \
        --am=speedyspeech_csmsc \
        --voc=pwgan_csmsc \
        --text=${BIN_DIR}/../sentences.txt \
        --output_dir=${train_output_path}/pd_infer_out \
        --phones_dict=dump/phone_id_map.txt \
        --tones_dict=dump/tone_id_map.txt
 fi
 # for more GAN Vocoders
 # multi band melgan
 if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
    python3 ${BIN_DIR}/../inference.py \
        --inference_dir=${train_output_path}/inference \
        --am=speedyspeech_csmsc \
        --voc=mb_melgan_csmsc \
        --text=${BIN_DIR}/../sentences.txt \
        --output_dir=${train_output_path}/pd_infer_out \
        --phones_dict=dump/phone_id_map.txt \
        --tones_dict=dump/tone_id_map.txt
 fi
 # style melgan
 # style melgan's Dygraph to Static Graph is not ready now
 if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
    python3 ${BIN_DIR}/../inference.py \
        --inference_dir=${train_output_path}/inference \
        --am=speedyspeech_csmsc \
        --voc=style_melgan_csmsc \
        --text=${BIN_DIR}/../sentences.txt \
        --output_dir=${train_output_path}/pd_infer_out \
        --phones_dict=dump/phone_id_map.txt \
        --tones_dict=dump/tone_id_map.txt
 fi
 # hifigan
 if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
    python3 ${BIN_DIR}/../inference.py \
        --inference_dir=${train_output_path}/inference \
        --am=speedyspeech_csmsc \
        --voc=hifigan_csmsc \
        --text=${BIN_DIR}/../sentences.txt \
        --output_dir=${train_output_path}/pd_infer_out \
        --phones_dict=dump/phone_id_map.txt \
        --tones_dict=dump/tone_id_map.txt
 fi
--- a/examples/csmsc/tts2/local/synthesize.sh
+++ b/examples/csmsc/tts2/local/synthesize.sh
@ -5,15 +5,16 @@ ckpt_name=$3
 FLAGS_allocator_strategy=naive_best_fit \
 FLAGS_fraction_of_gpu_memory_to_use=0.01 \
-python3 ${BIN_DIR}/synthesize.py \
+python3 ${BIN_DIR}/../synthesize.py \
-  --speedyspeech-config=${config_path} \
+    --am=speedyspeech_csmsc \
-  --speedyspeech-checkpoint=${train_output_path}/checkpoints/${ckpt_name} \
+    --am_config=${config_path} \
-  --speedyspeech-stat=dump/train/feats_stats.npy \
+    --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
-  --pwg-config=pwg_baker_ckpt_0.4/pwg_default.yaml \
+    --am_stat=dump/train/feats_stats.npy \
-  --pwg-checkpoint=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \
+    --voc=pwgan_csmsc \
-  --pwg-stat=pwg_baker_ckpt_0.4/pwg_stats.npy \
+    --voc_config=pwg_baker_ckpt_0.4/pwg_default.yaml \
-  --test-metadata=dump/test/norm/metadata.jsonl \
+    --voc_ckpt=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \
-  --output-dir=${train_output_path}/test \
+    --voc_stat=pwg_baker_ckpt_0.4/pwg_stats.npy \
-  --inference-dir=${train_output_path}/inference \
+    --test_metadata=dump/test/norm/metadata.jsonl \
-  --phones-dict=dump/phone_id_map.txt \
+    --output_dir=${train_output_path}/test \
-  --tones-dict=dump/tone_id_map.txt
+    --phones_dict=dump/phone_id_map.txt \
    --tones_dict=dump/tone_id_map.txt
--- a/examples/csmsc/tts2/local/synthesize_e2e.sh
+++ b/examples/csmsc/tts2/local/synthesize_e2e.sh
@ -4,17 +4,91 @@ config_path=$1
 train_output_path=$2
 ckpt_name=$3
-FLAGS_allocator_strategy=naive_best_fit \
+stage=0
-FLAGS_fraction_of_gpu_memory_to_use=0.01 \
+stop_stage=0
-python3 ${BIN_DIR}/synthesize_e2e.py \
+
-  --speedyspeech-config=${config_path} \
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
-  --speedyspeech-checkpoint=${train_output_path}/checkpoints/${ckpt_name} \
+    FLAGS_allocator_strategy=naive_best_fit \
-  --speedyspeech-stat=dump/train/feats_stats.npy \
+    FLAGS_fraction_of_gpu_memory_to_use=0.01 \
-  --pwg-config=pwg_baker_ckpt_0.4/pwg_default.yaml \
+    python3 ${BIN_DIR}/../synthesize_e2e.py \
-  --pwg-checkpoint=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \
+        --am=speedyspeech_csmsc \
-  --pwg-stat=pwg_baker_ckpt_0.4/pwg_stats.npy \
+        --am_config=${config_path} \
-  --text=${BIN_DIR}/../sentences.txt \
+        --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
-  --output-dir=${train_output_path}/test_e2e \
+        --am_stat=dump/train/feats_stats.npy \
-  --inference-dir=${train_output_path}/inference \
+        --voc=pwgan_csmsc \
-  --phones-dict=dump/phone_id_map.txt \
+        --voc_config=pwg_baker_ckpt_0.4/pwg_default.yaml \
-  --tones-dict=dump/tone_id_map.txt
+        --voc_ckpt=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \
        --voc_stat=pwg_baker_ckpt_0.4/pwg_stats.npy \
        --lang=zh \
        --text=${BIN_DIR}/../sentences.txt \
        --output_dir=${train_output_path}/test_e2e \
        --inference_dir=${train_output_path}/inference \
        --phones_dict=dump/phone_id_map.txt \
        --tones_dict=dump/tone_id_map.txt
 fi
 # for more GAN Vocoders
 # multi band melgan
 if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
    FLAGS_allocator_strategy=naive_best_fit \
    FLAGS_fraction_of_gpu_memory_to_use=0.01 \
    python3 ${BIN_DIR}/../synthesize_e2e.py \
        --am=speedyspeech_csmsc \
        --am_config=${config_path} \
        --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
        --am_stat=dump/train/feats_stats.npy \
        --voc=mb_melgan_csmsc \
        --voc_config=mb_melgan_baker_finetune_ckpt_0.5/finetune.yaml \
        --voc_ckpt=mb_melgan_baker_finetune_ckpt_0.5/snapshot_iter_2000000.pdz\
        --voc_stat=mb_melgan_baker_finetune_ckpt_0.5/feats_stats.npy \
        --lang=zh \
        --text=${BIN_DIR}/../sentences.txt \
        --output_dir=${train_output_path}/test_e2e \
        --inference_dir=${train_output_path}/inference \
        --phones_dict=dump/phone_id_map.txt \
        --tones_dict=dump/tone_id_map.txt
 fi
 # the pretrained models haven't release now
 # style melgan
 # style melgan's Dygraph to Static Graph is not ready now
 if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
    FLAGS_allocator_strategy=naive_best_fit \
    FLAGS_fraction_of_gpu_memory_to_use=0.01 \
    python3 ${BIN_DIR}/../synthesize_e2e.py \
        --am=speedyspeech_csmsc \
        --am_config=${config_path} \
        --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
        --am_stat=dump/train/feats_stats.npy \
        --voc=style_melgan_csmsc \
        --voc_config=pwg_baker_ckpt_0.4/pwg_default.yaml \
        --voc_ckpt=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \
        --voc_stat=pwg_baker_ckpt_0.4/pwg_stats.npy \
        --lang=zh \
        --text=${BIN_DIR}/../sentences.txt \
        --output_dir=${train_output_path}/test_e2e \
        --phones_dict=dump/phone_id_map.txt \
        --tones_dict=dump/tone_id_map.txt
        # --inference_dir=${train_output_path}/inference
 fi
 # hifigan
 if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
    FLAGS_allocator_strategy=naive_best_fit \
    FLAGS_fraction_of_gpu_memory_to_use=0.01 \
    python3 ${BIN_DIR}/../synthesize_e2e.py \
        --am=speedyspeech_csmsc \
        --am_config=${config_path} \
        --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
        --am_stat=dump/train/feats_stats.npy \
        --voc=hifigan_csmsc \
        --voc_config=pwg_baker_ckpt_0.4/pwg_default.yaml \
        --voc_ckpt=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \
        --voc_stat=pwg_baker_ckpt_0.4/pwg_stats.npy \
        --lang=zh \
        --text=${BIN_DIR}/../sentences.txt \
        --output_dir=${train_output_path}/test_e2e \
        --inference_dir=${train_output_path}/inference \
        --phones_dict=dump/phone_id_map.txt \
        --tones_dict=dump/tone_id_map.txt
 fi
--- a/examples/csmsc/tts3/README.md
+++ b/examples/csmsc/tts3/README.md
@ -7,7 +7,7 @@ Download CSMSC from it's [Official Website](https://test.data-baker.com/data/ind
 ### Get MFA Result and Extract
 We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for fastspeech2.
-You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to  [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
+You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your MFA model reference to  [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
 ## Get Started
 Assume the path to the dataset is `~/datasets/BZNSYP`.
@ -18,12 +18,12 @@ Run the command below to
 3. train the model.
 4. synthesize wavs.
    - synthesize waveform from `metadata.jsonl`.
-    - synthesize waveform from text file.
+    - synthesize waveform from a text file.
-5. inference using static model.
+5. inference using the static model.
 ```bash
 ./run.sh
 ```
-You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset.
+You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
 ```bash
 ./run.sh --stage 0 --stop-stage 0
 ```
@ -50,9 +50,9 @@ dump
    ├── raw
    └── speech_stats.npy
 ```
-The dataset is split into 3 parts, namely `train`, `dev` and` test`, each of which contains a `norm` and `raw` sub folder. The raw folder contains speech、pitch and energy features of each utterances, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/*_stats.npy`.
+The dataset is split into 3 parts, namely `train`, `dev`, and` test`, each of which contains a `norm` and `raw` subfolder. The raw folder contains speech、pitch and energy features of each utterance, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/*_stats.npy`.
-Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains phones, text_lengths, speech_lengths, durations, path of speech features, path of pitch features, path of energy features, speaker and id of each utterance.
+Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains phones, text_lengths, speech_lengths, durations, the path of speech features, the path of pitch features, the path of energy features, speaker, and the id of each utterance.
 ### Model Training
 ```bash
@ -86,7 +86,7 @@ optional arguments:
 ```
 1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
 2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
-3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are save in `checkpoints/` inside this directory.
+3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory.
 4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
 5. `--phones-dict` is the path of the phone vocabulary file.
@ -103,100 +103,118 @@ pwg_baker_ckpt_0.4
 ├── pwg_snapshot_iter_400000.pdz   # model parameters of parallel wavegan
 └── pwg_stats.npy                  # statistics used to normalize spectrogram when training parallel wavegan
 ```
-`./local/synthesize.sh` calls `${BIN_DIR}/synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
+`./local/synthesize.sh` calls `${BIN_DIR}/../synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
 ```bash
 CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
 ```
 ```text
-usage: synthesize.py [-h] [--fastspeech2-config FASTSPEECH2_CONFIG]
+usage: synthesize.py [-h]
-                     [--fastspeech2-checkpoint FASTSPEECH2_CHECKPOINT]
+                     [--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}]
-                     [--fastspeech2-stat FASTSPEECH2_STAT]
+                     [--am_config AM_CONFIG] [--am_ckpt AM_CKPT]
-                     [--pwg-config PWG_CONFIG]
+                     [--am_stat AM_STAT] [--phones_dict PHONES_DICT]
-                     [--pwg-checkpoint PWG_CHECKPOINT] [--pwg-stat PWG_STAT]
+                     [--tones_dict TONES_DICT] [--speaker_dict SPEAKER_DICT]
-                     [--phones-dict PHONES_DICT] [--speaker-dict SPEAKER_DICT]
+                     [--voice-cloning VOICE_CLONING]
-                     [--test-metadata TEST_METADATA] [--output-dir OUTPUT_DIR]
+                     [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}]
-                     [--ngpu NGPU] [--verbose VERBOSE]
+                     [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT]
                     [--voc_stat VOC_STAT] [--ngpu NGPU]
                     [--test_metadata TEST_METADATA] [--output_dir OUTPUT_DIR]
-Synthesize with fastspeech2 & parallel wavegan.
+Synthesize with acoustic model & vocoder
 optional arguments:
  -h, --help            show this help message and exit
-  --fastspeech2-config FASTSPEECH2_CONFIG
+  --am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}
-                        fastspeech2 config file.
+                        Choose acoustic model type of tts task.
-  --fastspeech2-checkpoint FASTSPEECH2_CHECKPOINT
+  --am_config AM_CONFIG
-                        fastspeech2 checkpoint to load.
+                        Config of acoustic model. Use deault config when it is
-  --fastspeech2-stat FASTSPEECH2_STAT
+                        None.
-                        mean and standard deviation used to normalize
+  --am_ckpt AM_CKPT     Checkpoint file of acoustic model.
-                        spectrogram when training fastspeech2.
+  --am_stat AM_STAT     mean and standard deviation used to normalize
-  --pwg-config PWG_CONFIG
+                        spectrogram when training acoustic model.
-                        parallel wavegan config file.
+  --phones_dict PHONES_DICT
  --pwg-checkpoint PWG_CHECKPOINT
                        parallel wavegan generator parameters to load.
  --pwg-stat PWG_STAT   mean and standard deviation used to normalize
                        spectrogram when training parallel wavegan.
  --phones-dict PHONES_DICT
                        phone vocabulary file.
-  --speaker-dict SPEAKER_DICT
+  --tones_dict TONES_DICT
-                        speaker id map file for multiple speaker model.
+                        tone vocabulary file.
-  --test-metadata TEST_METADATA
+  --speaker_dict SPEAKER_DICT
                        speaker id map file.
  --voice-cloning VOICE_CLONING
                        whether training voice cloning model.
  --voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}
                        Choose vocoder type of tts task.
  --voc_config VOC_CONFIG
                        Config of voc. Use deault config when it is None.
  --voc_ckpt VOC_CKPT   Checkpoint file of voc.
  --voc_stat VOC_STAT   mean and standard deviation used to normalize
                        spectrogram when training voc.
  --ngpu NGPU           if ngpu == 0, use cpu.
  --test_metadata TEST_METADATA
                        test metadata.
-  --output-dir OUTPUT_DIR
+  --output_dir OUTPUT_DIR
                        output dir.
  --ngpu NGPU           if ngpu == 0, use cpu.
  --verbose VERBOSE     verbose.
 ```
-`./local/synthesize_e2e.sh` calls `${BIN_DIR}/synthesize_e2e.py`, which can synthesize waveform from text file.
+`./local/synthesize_e2e.sh` calls `${BIN_DIR}/../synthesize_e2e.py`, which can synthesize waveform from text file.
 ```bash
 CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name}
 ```
 ```text
-usage: synthesize_e2e.py [-h] [--fastspeech2-config FASTSPEECH2_CONFIG]
+usage: synthesize_e2e.py [-h]
-                         [--fastspeech2-checkpoint FASTSPEECH2_CHECKPOINT]
+                         [--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}]
-                         [--fastspeech2-stat FASTSPEECH2_STAT]
+                         [--am_config AM_CONFIG] [--am_ckpt AM_CKPT]
-                         [--pwg-config PWG_CONFIG]
+                         [--am_stat AM_STAT] [--phones_dict PHONES_DICT]
-                         [--pwg-checkpoint PWG_CHECKPOINT]
+                         [--tones_dict TONES_DICT]
-                         [--pwg-stat PWG_STAT] [--phones-dict PHONES_DICT]
+                         [--speaker_dict SPEAKER_DICT] [--spk_id SPK_ID]
-                         [--text TEXT] [--output-dir OUTPUT_DIR]
+                         [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}]
-                         [--inference-dir INFERENCE_DIR] [--ngpu NGPU]
+                         [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT]
-                         [--verbose VERBOSE]
+                         [--voc_stat VOC_STAT] [--lang LANG]
-
+                         [--inference_dir INFERENCE_DIR] [--ngpu NGPU]
-Synthesize with fastspeech2 & parallel wavegan.
+                         [--text TEXT] [--output_dir OUTPUT_DIR]
 Synthesize with acoustic model & vocoder
 optional arguments:
  -h, --help            show this help message and exit
-  --fastspeech2-config FASTSPEECH2_CONFIG
+  --am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}
-                        fastspeech2 config file.
+                        Choose acoustic model type of tts task.
-  --fastspeech2-checkpoint FASTSPEECH2_CHECKPOINT
+  --am_config AM_CONFIG
-                        fastspeech2 checkpoint to load.
+                        Config of acoustic model. Use deault config when it is
-  --fastspeech2-stat FASTSPEECH2_STAT
+                        None.
-                        mean and standard deviation used to normalize
+  --am_ckpt AM_CKPT     Checkpoint file of acoustic model.
-                        spectrogram when training fastspeech2.
+  --am_stat AM_STAT     mean and standard deviation used to normalize
-  --pwg-config PWG_CONFIG
+                        spectrogram when training acoustic model.
-                        parallel wavegan config file.
+  --phones_dict PHONES_DICT
  --pwg-checkpoint PWG_CHECKPOINT
                        parallel wavegan generator parameters to load.
  --pwg-stat PWG_STAT   mean and standard deviation used to normalize
                        spectrogram when training parallel wavegan.
  --phones-dict PHONES_DICT
                        phone vocabulary file.
-  --text TEXT           text to synthesize, a 'utt_id sentence' pair per line.
+  --tones_dict TONES_DICT
-  --output-dir OUTPUT_DIR
+                        tone vocabulary file.
-                        output dir.
+  --speaker_dict SPEAKER_DICT
-  --inference-dir INFERENCE_DIR
+                        speaker id map file.
  --spk_id SPK_ID       spk id for multi speaker acoustic model
  --voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}
                        Choose vocoder type of tts task.
  --voc_config VOC_CONFIG
                        Config of voc. Use deault config when it is None.
  --voc_ckpt VOC_CKPT   Checkpoint file of voc.
  --voc_stat VOC_STAT   mean and standard deviation used to normalize
                        spectrogram when training voc.
  --lang LANG           Choose model language. zh or en
  --inference_dir INFERENCE_DIR
                        dir to save inference models
  --ngpu NGPU           if ngpu == 0, use cpu.
-  --verbose VERBOSE     verbose.
+  --text TEXT           text to synthesize, a 'utt_id sentence' pair per line.
  --output_dir OUTPUT_DIR
                        output dir.
 ```
-
+1. `--am` is acoustic model type with the format {model_name}_{dataset}
-1. `--fastspeech2-config`, `--fastspeech2-checkpoint`, `--fastspeech2-stat` and `--phones-dict` are arguments for fastspeech2, which correspond to the 4 files in the fastspeech2 pretrained model.
+2. `--am_config`, `--am_checkpoint`, `--am_stat` and `--phones_dict` are arguments for acoustic model, which correspond to the 4 files in the fastspeech2 pretrained model.
-2. `--pwg-config`, `--pwg-checkpoint`, `--pwg-stat` are arguments for parallel wavegan, which correspond to the 3 files in the parallel wavegan pretrained model.
+3. `--voc` is vocoder type with the format {model_name}_{dataset}
-3. `--test-metadata` should be the metadata file in the normalized subfolder of `test`  in the `dump` folder.
+4. `--voc_config`, `--voc_checkpoint`, `--voc_stat` are arguments for vocoder, which correspond to the 3 files in the parallel wavegan pretrained model.
-4. `--text` is the text file, which contains sentences to synthesize.
+5. `--lang` is the model language, which can be `zh` or `en`.
-5. `--output-dir` is the directory to save synthesized audio files.
+6. `--test_metadata` should be the metadata file in the normalized subfolder of `test`  in the `dump` folder.
-6. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
+7. `--text` is the text file, which contains sentences to synthesize.
 8. `--output_dir` is the directory to save synthesized audio files.
 9. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
 ### Inferencing
-After Synthesize, we will get static models of fastspeech2 and pwgan in `${train_output_path}/inference`.
+After synthesizing, we will get static models of fastspeech2 and pwgan in `${train_output_path}/inference`.
 `./local/inference.sh` calls `${BIN_DIR}/inference.py`, which provides a paddle static model inference example for fastspeech2 + pwgan synthesize.
 ```bash
 CUDA_VISIBLE_DEVICES=${gpus} ./local/inference.sh ${train_output_path}
@ -207,7 +225,7 @@ Pretrained FastSpeech2 model with no silence in the edge of audios:
 - [fastspeech2_nosil_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_ckpt_0.4.zip)
 - [fastspeech2_conformer_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_conformer_baker_ckpt_0.5.zip)
-Static model can be downloaded here [fastspeech2_nosil_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_static_0.4.zip).
+The static model can be downloaded here [fastspeech2_nosil_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_static_0.4.zip).
 Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/pitch_loss| eval/energy_loss 
 :-------------:| :------------:| :-----: | :-----: | :--------: |:--------:|:---------:
@ -228,15 +246,18 @@ source path.sh
 FLAGS_allocator_strategy=naive_best_fit \
 FLAGS_fraction_of_gpu_memory_to_use=0.01 \
-python3 ${BIN_DIR}/synthesize_e2e.py \
+python3 ${BIN_DIR}/../synthesize_e2e.py \
-  --fastspeech2-config=fastspeech2_nosil_baker_ckpt_0.4/default.yaml \
+  --am=fastspeech2_csmsc \
-  --fastspeech2-checkpoint=fastspeech2_nosil_baker_ckpt_0.4/snapshot_iter_76000.pdz \
+  --am_config=fastspeech2_nosil_baker_ckpt_0.4/default.yaml \
-  --fastspeech2-stat=fastspeech2_nosil_baker_ckpt_0.4/speech_stats.npy \
+  --am_ckpt=fastspeech2_nosil_baker_ckpt_0.4/snapshot_iter_76000.pdz \
-  --pwg-config=pwg_baker_ckpt_0.4/pwg_default.yaml \
+  --am_stat=fastspeech2_nosil_baker_ckpt_0.4/speech_stats.npy  \
-  --pwg-checkpoint=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \
+  --voc=pwgan_csmsc \
-  --pwg-stat=pwg_baker_ckpt_0.4/pwg_stats.npy \
+  --voc_config=pwg_baker_ckpt_0.4/pwg_default.yaml \
  --voc_ckpt=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \
  --voc_stat=pwg_baker_ckpt_0.4/pwg_stats.npy \
  --lang=zh \
  --text=${BIN_DIR}/../sentences.txt \
-  --output-dir=exp/default/test_e2e \
+  --output_dir=exp/default/test_e2e \
-  --inference-dir=exp/default/inference \
+  --inference_dir=exp/default/inference \
-  --phones-dict=fastspeech2_nosil_baker_ckpt_0.4/phone_id_map.txt
+  --phones_dict=dump/phone_id_map.txt
 ```
--- a/examples/csmsc/tts3/conf/conformer.yaml
+++ b/examples/csmsc/tts3/conf/conformer.yaml
@ -3,9 +3,9 @@
 ###########################################################
 fs: 24000          # sr
-n_fft: 2048        # FFT size.
+n_fft: 2048        # FFT size (samples).
-n_shift: 300       # Hop size.
+n_shift: 300       # Hop size (samples). 12.5ms
-win_length: 1200   # Window length.
+win_length: 1200   # Window length (samples). 50ms
                   # If set to null, it will be the same as fft_size.
 window: "hann"     # Window function.
--- a/examples/csmsc/tts3/conf/default.yaml
+++ b/examples/csmsc/tts3/conf/default.yaml
@ -3,9 +3,9 @@
 ###########################################################
 fs: 24000          # sr
-n_fft: 2048        # FFT size.
+n_fft: 2048        # FFT size (samples).
-n_shift: 300       # Hop size.
+n_shift: 300       # Hop size (samples). 12.5ms
-win_length: 1200   # Window length.
+win_length: 1200   # Window length (samples). 50ms
                   # If set to null, it will be the same as fft_size.
 window: "hann"     # Window function.
--- a/examples/csmsc/tts3/local/inference.sh
+++ b/examples/csmsc/tts3/local/inference.sh
@ -2,8 +2,50 @@
 train_output_path=$1
-python3 ${BIN_DIR}/inference.py \
+stage=0
-  --inference-dir=${train_output_path}/inference \
+stop_stage=0
-  --text=${BIN_DIR}/../sentences.txt \
+
-  --output-dir=${train_output_path}/pd_infer_out \
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
-  --phones-dict=dump/phone_id_map.txt
+    python3 ${BIN_DIR}/../inference.py \
        --inference_dir=${train_output_path}/inference \
        --am=fastspeech2_csmsc \
        --voc=pwgan_csmsc \
        --text=${BIN_DIR}/../sentences.txt \
        --output_dir=${train_output_path}/pd_infer_out \
        --phones_dict=dump/phone_id_map.txt
 fi
 # for more GAN Vocoders
 # multi band melgan
 if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
    python3 ${BIN_DIR}/../inference.py \
        --inference_dir=${train_output_path}/inference \
        --am=fastspeech2_csmsc \
        --voc=mb_melgan_csmsc \
        --text=${BIN_DIR}/../sentences.txt \
        --output_dir=${train_output_path}/pd_infer_out \
        --phones_dict=dump/phone_id_map.txt
 fi
 # style melgan
 # style melgan's Dygraph to Static Graph is not ready now
 if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
    python3 ${BIN_DIR}/../inference.py \
        --inference_dir=${train_output_path}/inference \
        --am=fastspeech2_csmsc \
        --voc=style_melgan_csmsc \
        --text=${BIN_DIR}/../sentences.txt \
        --output_dir=${train_output_path}/pd_infer_out \
        --phones_dict=dump/phone_id_map.txt
 fi
 # hifigan
 if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
    python3 ${BIN_DIR}/../inference.py \
        --inference_dir=${train_output_path}/inference \
        --am=fastspeech2_csmsc \
        --voc=hifigan_csmsc \
        --text=${BIN_DIR}/../sentences.txt \
        --output_dir=${train_output_path}/pd_infer_out \
        --phones_dict=dump/phone_id_map.txt
 fi
--- a/examples/csmsc/tts3/local/synthesize.sh
+++ b/examples/csmsc/tts3/local/synthesize.sh
@ -6,13 +6,15 @@ ckpt_name=$3
 FLAGS_allocator_strategy=naive_best_fit \
 FLAGS_fraction_of_gpu_memory_to_use=0.01 \
-python3 ${BIN_DIR}/synthesize.py \
+python3 ${BIN_DIR}/../synthesize.py \
-  --fastspeech2-config=${config_path} \
+    --am=fastspeech2_csmsc \
-  --fastspeech2-checkpoint=${train_output_path}/checkpoints/${ckpt_name} \
+    --am_config=${config_path} \
-  --fastspeech2-stat=dump/train/speech_stats.npy \
+    --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
-  --pwg-config=pwg_baker_ckpt_0.4/pwg_default.yaml \
+    --am_stat=dump/train/speech_stats.npy \
-  --pwg-checkpoint=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \
+    --voc=pwgan_csmsc \
-  --pwg-stat=pwg_baker_ckpt_0.4/pwg_stats.npy \
+    --voc_config=pwg_baker_ckpt_0.4/pwg_default.yaml \
-  --test-metadata=dump/test/norm/metadata.jsonl \
+    --voc_ckpt=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \
-  --output-dir=${train_output_path}/test \
+    --voc_stat=pwg_baker_ckpt_0.4/pwg_stats.npy \
-  --phones-dict=dump/phone_id_map.txt
+    --test_metadata=dump/test/norm/metadata.jsonl \
    --output_dir=${train_output_path}/test \
    --phones_dict=dump/phone_id_map.txt
--- a/examples/csmsc/tts3/local/synthesize_e2e.sh
+++ b/examples/csmsc/tts3/local/synthesize_e2e.sh
@ -4,16 +4,88 @@ config_path=$1
 train_output_path=$2
 ckpt_name=$3
-FLAGS_allocator_strategy=naive_best_fit \
+stage=0
-FLAGS_fraction_of_gpu_memory_to_use=0.01 \
+stop_stage=0
-python3 ${BIN_DIR}/synthesize_e2e.py \
+
-  --fastspeech2-config=${config_path} \
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
-  --fastspeech2-checkpoint=${train_output_path}/checkpoints/${ckpt_name} \
+    FLAGS_allocator_strategy=naive_best_fit \
-  --fastspeech2-stat=dump/train/speech_stats.npy \
+    FLAGS_fraction_of_gpu_memory_to_use=0.01 \
-  --pwg-config=pwg_baker_ckpt_0.4/pwg_default.yaml \
+    python3 ${BIN_DIR}/../synthesize_e2e.py \
-  --pwg-checkpoint=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \
+        --am=fastspeech2_csmsc \
-  --pwg-stat=pwg_baker_ckpt_0.4/pwg_stats.npy \
+        --am_config=${config_path} \
-  --text=${BIN_DIR}/../sentences.txt \
+        --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
-  --output-dir=${train_output_path}/test_e2e \
+        --am_stat=dump/train/speech_stats.npy \
-  --inference-dir=${train_output_path}/inference \
+        --voc=pwgan_csmsc \
-  --phones-dict=dump/phone_id_map.txt
+        --voc_config=pwg_baker_ckpt_0.4/pwg_default.yaml \
        --voc_ckpt=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \
        --voc_stat=pwg_baker_ckpt_0.4/pwg_stats.npy \
        --lang=zh \
        --text=${BIN_DIR}/../sentences.txt \
        --output_dir=${train_output_path}/test_e2e \
        --inference_dir=${train_output_path}/inference \
        --phones_dict=dump/phone_id_map.txt
 fi
 # for more GAN Vocoders
 # multi band melgan
 if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
    FLAGS_allocator_strategy=naive_best_fit \
    FLAGS_fraction_of_gpu_memory_to_use=0.01 \
    python3 ${BIN_DIR}/../synthesize_e2e.py \
        --am=fastspeech2_csmsc \
        --am_config=${config_path} \
        --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
        --am_stat=dump/train/speech_stats.npy \
        --voc=mb_melgan_csmsc \
        --voc_config=mb_melgan_baker_finetune_ckpt_0.5/finetune.yaml \
        --voc_ckpt=mb_melgan_baker_finetune_ckpt_0.5/snapshot_iter_2000000.pdz\
        --voc_stat=mb_melgan_baker_finetune_ckpt_0.5/feats_stats.npy \
        --lang=zh \
        --text=${BIN_DIR}/../sentences.txt \
        --output_dir=${train_output_path}/test_e2e \
        --inference_dir=${train_output_path}/inference \
        --phones_dict=dump/phone_id_map.txt
 fi
 # the pretrained models haven't release now
 # style melgan
 # style melgan's Dygraph to Static Graph is not ready now
 if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
    FLAGS_allocator_strategy=naive_best_fit \
    FLAGS_fraction_of_gpu_memory_to_use=0.01 \
    python3 ${BIN_DIR}/../synthesize_e2e.py \
        --am=fastspeech2_csmsc \
        --am_config=${config_path} \
        --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
        --am_stat=dump/train/speech_stats.npy \
        --voc=style_melgan_csmsc \
        --voc_config=style_melgan_test/default.yaml \
        --voc_ckpt=style_melgan_test/snapshot_iter_935000.pdz \
        --voc_stat=style_melgan_test/feats_stats.npy \
        --lang=zh \
        --text=${BIN_DIR}/../sentences.txt \
        --output_dir=${train_output_path}/test_e2e \
        --phones_dict=dump/phone_id_map.txt
        # --inference_dir=${train_output_path}/inference
 fi
 # hifigan
 if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
    echo "in hifigan syn_e2e"
    FLAGS_allocator_strategy=naive_best_fit \
    FLAGS_fraction_of_gpu_memory_to_use=0.01 \
    python3 ${BIN_DIR}/../synthesize_e2e.py \
        --am=fastspeech2_csmsc \
        --am_config=${config_path} \
        --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
        --am_stat=dump/train/speech_stats.npy \
        --voc=hifigan_csmsc \
        --voc_config=hifigan_test/default.yaml \
        --voc_ckpt=hifigan_test/snapshot_iter_1600000.pdz \
        --voc_stat=hifigan_test/feats_stats.npy \
        --lang=zh \
        --text=${BIN_DIR}/../sentences.txt \
        --output_dir=${train_output_path}/test_e2e \
        --inference_dir=${train_output_path}/inference \
        --phones_dict=dump/phone_id_map.txt
 fi
--- a/examples/csmsc/voc1/README.md
+++ b/examples/csmsc/voc1/README.md
@ -2,11 +2,11 @@
 This example contains code used to train a [parallel wavegan](http://arxiv.org/abs/1910.11480) model with [Chinese Standard Mandarin Speech Copus](https://www.data-baker.com/open_source.html).
 ## Dataset
 ### Download and Extract
-Download CSMSC from the [official website](https://www.data-baker.com/data/index/source) and extract it to `~/datasets`. Then the dataset is in directory `~/datasets/BZNSYP`.
+Download CSMSC from the [official website](https://www.data-baker.com/data/index/source) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/BZNSYP`.
 ### Get MFA Result and Extract
-We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to  cut silence in the edge of audio.
+We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence at the edge of audio.
-You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to  [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
+You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your MFA model reference to  [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
 ## Get Started
 Assume the path to the dataset is `~/datasets/BZNSYP`.
@ -20,7 +20,7 @@ Run the command below to
 ```bash
 ./run.sh
 ```
-You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset.
+You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
 ```bash
 ./run.sh --stage 0 --stop-stage 0
 ```
@ -43,9 +43,9 @@ dump
    ├── raw
    └── feats_stats.npy
 ```
-The dataset is split into 3 parts, namely `train`, `dev` and `test`, each of which contains a `norm` and `raw` subfolder. The `raw` folder contains log magnitude of mel spectrogram of each utterances, while the norm folder contains normalized spectrogram. The statistics used to normalize the spectrogram is computed from the training set, which is located in `dump/train/feats_stats.npy`.
+The dataset is split into 3 parts, namely `train`, `dev`, and `test`, each of which contains a `norm` and `raw` subfolder. The `raw` folder contains the log magnitude of the mel spectrogram of each utterance, while the norm folder contains the normalized spectrogram. The statistics used to normalize the spectrogram are computed from the training set, which is located in `dump/train/feats_stats.npy`.
-Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains id and paths to spectrogam of each utterance.
+Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains id and paths to the spectrogram of each utterance.
 ### Model Training
 ```bash
@ -91,7 +91,7 @@ benchmark:
 1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
 2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
-3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are save in `checkpoints/` inside this directory.
+3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory.
 4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
 ### Synthesizing
@ -100,15 +100,19 @@ benchmark:
 CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
 ```
 ```text
-usage: synthesize.py [-h] [--config CONFIG] [--checkpoint CHECKPOINT]
+usage: synthesize.py [-h] [--generator-type GENERATOR_TYPE] [--config CONFIG]
-                     [--test-metadata TEST_METADATA] [--output-dir OUTPUT_DIR]
+                     [--checkpoint CHECKPOINT] [--test-metadata TEST_METADATA]
-                     [--ngpu NGPU] [--verbose VERBOSE]
+                     [--output-dir OUTPUT_DIR] [--ngpu NGPU]
                     [--verbose VERBOSE]
-Synthesize with parallel wavegan.
+Synthesize with GANVocoder.
 optional arguments:
  -h, --help            show this help message and exit
-  --config CONFIG       parallel wavegan config file.
+  --generator-type GENERATOR_TYPE
                        type of GANVocoder, should in {pwgan, mb_melgan,
                        style_melgan, } now
  --config CONFIG       GANVocoder config file.
  --checkpoint CHECKPOINT
                        snapshot to load.
  --test-metadata TEST_METADATA
@ -126,9 +130,9 @@ optional arguments:
 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
 ## Pretrained Models
-Pretrained model can be downloaded here [pwg_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_ckpt_0.4.zip).
+The pretrained model can be downloaded here [pwg_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_ckpt_0.4.zip).
-Static model can be downloaded here [pwg_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_static_0.4.zip).
+The static model can be downloaded here [pwg_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_static_0.4.zip).
 Model | Step | eval/generator_loss | eval/log_stft_magnitude_loss:| eval/spectral_convergence_loss 
 :-------------:| :------------:| :-----: | :-----: | :--------:
--- a/examples/csmsc/voc1/conf/default.yaml
+++ b/examples/csmsc/voc1/conf/default.yaml
@ -7,9 +7,9 @@
 #                FEATURE EXTRACTION SETTING               #
 ###########################################################
 fs: 24000                # Sampling rate.
-n_fft: 2048              # FFT size. (in samples)
+n_fft: 2048              # FFT size (samples).
-n_shift: 300             # Hop size. (in samples)
+n_shift: 300             # Hop size (samples). 12.5ms
-win_length: 1200         # Window length. (in samples)
+win_length: 1200         # Window length (samples). 50ms
                         # If set to null, it will be the same as fft_size.
 window: "hann"           # Window function.
 n_mels: 80               # Number of mel basis.
@ -56,9 +56,9 @@ discriminator_params:
    bias: true            # Whether to use bias parameter in conv.
    use_weight_norm: true # Whether to use weight norm.
                          # If set to true, it will be applied to all of the conv layers.
-    nonlinear_activation: "LeakyReLU" # Nonlinear function after each conv.
+    nonlinear_activation: "leakyrelu" # Nonlinear function after each conv.
    nonlinear_activation_params:      # Nonlinear function parameters
-        negative_slope: 0.2           # Alpha in LeakyReLU.
+        negative_slope: 0.2           # Alpha in leakyrelu.
 ###########################################################
 #                   STFT LOSS SETTING                     #
--- a/examples/csmsc/voc1/local/synthesize.sh
+++ b/examples/csmsc/voc1/local/synthesize.sh
@ -7,8 +7,8 @@ ckpt_name=$3
 FLAGS_allocator_strategy=naive_best_fit \
 FLAGS_fraction_of_gpu_memory_to_use=0.01 \
 python3 ${BIN_DIR}/../synthesize.py \
-  --config=${config_path} \
+    --config=${config_path} \
-  --checkpoint=${train_output_path}/checkpoints/${ckpt_name} \
+    --checkpoint=${train_output_path}/checkpoints/${ckpt_name} \
-  --test-metadata=dump/test/norm/metadata.jsonl \
+    --test-metadata=dump/test/norm/metadata.jsonl \
-  --output-dir=${train_output_path}/test \
+    --output-dir=${train_output_path}/test \
-  --generator-type=pwgan
+    --generator-type=pwgan
--- a/examples/csmsc/voc3/README.md
+++ b/examples/csmsc/voc3/README.md
@ -2,11 +2,11 @@
 This example contains code used to train a [Multi Band MelGAN](https://arxiv.org/abs/2005.05106) model with [Chinese Standard Mandarin Speech Copus](https://www.data-baker.com/open_source.html).
 ## Dataset
 ### Download and Extract
-Download CSMSC from the [official website](https://www.data-baker.com/data/index/source) and extract it to `~/datasets`. Then the dataset is in directory `~/datasets/BZNSYP`.
+Download CSMSC from the [official website](https://www.data-baker.com/data/index/source) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/BZNSYP`.
 ### Get MFA Result and Extract
-We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to  cut silence in the edge of audio.
+We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut the silence in the edge of audio.
-You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/mfa) of our repo.
+You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/mfa) of our repo.
 ## Get Started
 Assume the path to the dataset is `~/datasets/BZNSYP`.
@ -20,7 +20,7 @@ Run the command below to
 ```bash
 ./run.sh
 ```
-You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset.
+You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
 ```bash
 ./run.sh --stage 0 --stop-stage 0
 ```
@ -43,9 +43,9 @@ dump
    ├── raw
    └── feats_stats.npy
 ```
-The dataset is split into 3 parts, namely `train`, `dev` and `test`, each of which contains a `norm` and `raw` subfolder. The `raw` folder contains log magnitude of mel spectrogram of each utterances, while the norm folder contains normalized spectrogram. The statistics used to normalize the spectrogram is computed from the training set, which is located in `dump/train/feats_stats.npy`.
+The dataset is split into 3 parts, namely `train`, `dev`, and `test`, each of which contains a `norm` and `raw` subfolder. The `raw` folder contains the log magnitude of the mel spectrogram of each utterance, while the norm folder contains the normalized spectrogram. The statistics used to normalize the spectrogram are computed from the training set, which is located in `dump/train/feats_stats.npy`.
-Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains id and paths to spectrogam of each utterance.
+Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains id and paths to the spectrogram of each utterance.
 ### Model Training
 ```bash
@ -76,7 +76,7 @@ optional arguments:
 1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
 2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
-3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are save in `checkpoints/` inside this directory.
+3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory.
 4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
 ### Synthesizing
@ -85,15 +85,19 @@ optional arguments:
 CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
 ```
 ```text
-usage: synthesize.py [-h] [--config CONFIG] [--checkpoint CHECKPOINT]
+usage: synthesize.py [-h] [--generator-type GENERATOR_TYPE] [--config CONFIG]
-                     [--test-metadata TEST_METADATA] [--output-dir OUTPUT_DIR]
+                     [--checkpoint CHECKPOINT] [--test-metadata TEST_METADATA]
-                     [--ngpu NGPU] [--verbose VERBOSE]
+                     [--output-dir OUTPUT_DIR] [--ngpu NGPU]
                     [--verbose VERBOSE]
-Synthesize with multi band melgan.
+Synthesize with GANVocoder.
 optional arguments:
  -h, --help            show this help message and exit
-  --config CONFIG       multi band melgan config file.
+  --generator-type GENERATOR_TYPE
                        type of GANVocoder, should in {pwgan, mb_melgan,
                        style_melgan, } now
  --config CONFIG       GANVocoder config file.
  --checkpoint CHECKPOINT
                        snapshot to load.
  --test-metadata TEST_METADATA
@ -111,22 +115,22 @@ optional arguments:
 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
 ## Fine-tuning
-Since there are no `noise` in the input of Multi Band MelGAN, the  audio quality is not so good (see [espnet issue](https://github.com/espnet/espnet/issues/3536#issuecomment-916035415)), we refer to the method proposed in [HiFiGAN](https://arxiv.org/abs/2010.05646),  finetune Multi Band MelGAN with the predicted mel-spectrogram from `FastSpeech2`.
+Since there is no `noise` in the input of Multi-Band MelGAN, the audio quality is not so good (see [espnet issue](https://github.com/espnet/espnet/issues/3536#issuecomment-916035415)), we refer to the method proposed in [HiFiGAN](https://arxiv.org/abs/2010.05646),  finetune Multi-Band MelGAN with the predicted mel-spectrogram from `FastSpeech2`.
 The length of mel-spectrograms should align with the length of wavs, so we should generate mels using ground truth alignment.
-But since we are fine-tuning, we should use the statistics computed during training step.
+But since we are fine-tuning, we should use the statistics computed during the training step.
-You should  first download pretrained `FastSpeech2` model from [fastspeech2_nosil_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_ckpt_0.4.zip) and `unzip` it.
+You should first download pretrained `FastSpeech2` model from [fastspeech2_nosil_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_ckpt_0.4.zip) and `unzip` it.
-Assume the path to the dump-dir of  training  step is `dump`.
+Assume the path to the dump-dir of training step is `dump`.
-Assume the path to the duration result of CSMSC is `durations.txt` (generated during training step's preprocessing).
+Assume the path to the duration result of CSMSC is `durations.txt` (generated during the training step's preprocessing).
 Assume the path to the pretrained `FastSpeech2` model is `fastspeech2_nosil_baker_ckpt_0.4`.
 \
 The `finetune.sh` can
 1. **source path**.
 2. generate ground truth alignment mels.
-3. link `*_wave.npy` from `dump` to `dump_finetune` (because we only use new mels, the wavs are the ones used during train step) .
+3. link `*_wave.npy` from `dump` to `dump_finetune` (because we only use new mels, the wavs are the ones used during the training step).
 4. copy features' stats from `dump` to `dump_finetune`.
 5. normalize the ground truth alignment mels.
 6. finetune the model.
@ -137,9 +141,9 @@ exp/finetune/checkpoints
 ├── records.jsonl
 └── snapshot_iter_1000000.pdz
 ```
-The content of `records.jsonl` should be as follows (change `"path"` to your own ckpt path):
+The content of `records.jsonl` should be as follows (change `"path"` to your ckpt path):
 ```
-{"time": "2021-11-21 15:11:20.337311", "path": "~/PaddleSpeech/examples/csmsc/voc3/exp/finetune/checkpoints/snapshot_iter_1000000.pdz", "iteration": 1000000}↩
+{"time": "2021-11-21 15:11:20.337311", "path": "~/PaddleSpeech/examples/csmsc/voc3/exp/finetune/checkpoints/snapshot_iter_1000000.pdz", "iteration": 1000000}
 ```
 Run the command below 
 ```bash
@ -151,11 +155,11 @@ TODO:
 The hyperparameter of `finetune.yaml` is not good enough, a smaller `learning_rate` should be used (more `milestones` should be set).
 ## Pretrained Models
-Pretrained model can be downloaded here [mb_melgan_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_baker_ckpt_0.5.zip).
+The pretrained model can be downloaded here [mb_melgan_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_baker_ckpt_0.5.zip).
-Finetuned model can ben downloaded here [mb_melgan_baker_finetune_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_baker_finetune_ckpt_0.5.zip).
+The finetuned model can be downloaded here [mb_melgan_baker_finetune_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_baker_finetune_ckpt_0.5.zip).
-Static model can be downloaded here [mb_melgan_baker_static_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_baker_static_0.5.zip)
+The static model can be downloaded here [mb_melgan_baker_static_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_baker_static_0.5.zip)
 Model | Step | eval/generator_loss | eval/log_stft_magnitude_loss|eval/spectral_convergence_loss |eval/sub_log_stft_magnitude_loss|eval/sub_spectral_convergence_loss
 :-------------:| :------------:| :-----: | :-----: | :--------:| :--------:| :--------:
--- a/examples/csmsc/voc3/conf/default.yaml
+++ b/examples/csmsc/voc3/conf/default.yaml
@ -12,9 +12,9 @@
 #                FEATURE EXTRACTION SETTING               #
 ###########################################################
 fs: 24000                # Sampling rate.
-n_fft: 2048              # FFT size. (in samples)
+n_fft: 2048              # FFT size (samples).
-n_shift: 300             # Hop size. (in samples)
+n_shift: 300             # Hop size (samples). 12.5ms
-win_length: 1200         # Window length. (in samples)
+win_length: 1200         # Window length (samples). 50ms
                         # If set to null, it will be the same as fft_size.
 window: "hann"           # Window function.
 n_mels: 80               # Number of mel basis.
@ -54,7 +54,7 @@ discriminator_params:
    channels: 16                      # Number of channels of the initial conv layer.
    max_downsample_channels: 512      # Maximum number of channels of downsampling layers.
    downsample_scales: [4, 4, 4]      # List of downsampling scales.
-    nonlinear_activation: "LeakyReLU" # Nonlinear activation function.
+    nonlinear_activation: "leakyrelu" # Nonlinear activation function.
    nonlinear_activation_params:      # Parameters of nonlinear activation function.
        negative_slope: 0.2
    use_weight_norm: True             # Whether to use weight norm.
--- a/examples/csmsc/voc3/conf/finetune.yaml
+++ b/examples/csmsc/voc3/conf/finetune.yaml
@ -12,9 +12,9 @@
 #                FEATURE EXTRACTION SETTING               #
 ###########################################################
 fs: 24000                # Sampling rate.
-n_fft: 2048              # FFT size. (in samples)
+n_fft: 2048              # FFT size (samples).
-n_shift: 300             # Hop size. (in samples)
+n_shift: 300             # Hop size (samples). 12.5ms
-win_length: 1200         # Window length. (in samples)
+win_length: 1200         # Window length (samples). 50ms
                         # If set to null, it will be the same as fft_size.
 window: "hann"           # Window function.
 n_mels: 80               # Number of mel basis.
@ -54,7 +54,7 @@ discriminator_params:
    channels: 16                      # Number of channels of the initial conv layer.
    max_downsample_channels: 512      # Maximum number of channels of downsampling layers.
    downsample_scales: [4, 4, 4]      # List of downsampling scales.
-    nonlinear_activation: "LeakyReLU" # Nonlinear activation function.
+    nonlinear_activation: "leakyrelu" # Nonlinear activation function.
    nonlinear_activation_params:      # Parameters of nonlinear activation function.
        negative_slope: 0.2
    use_weight_norm: True             # Whether to use weight norm.
--- a/examples/csmsc/voc3/local/synthesize.sh
+++ b/examples/csmsc/voc3/local/synthesize.sh
@ -7,8 +7,8 @@ ckpt_name=$3
 FLAGS_allocator_strategy=naive_best_fit \
 FLAGS_fraction_of_gpu_memory_to_use=0.01 \
 python3 ${BIN_DIR}/../synthesize.py \
-  --config=${config_path} \
+    --config=${config_path} \
-  --checkpoint=${train_output_path}/checkpoints/${ckpt_name} \
+    --checkpoint=${train_output_path}/checkpoints/${ckpt_name} \
-  --test-metadata=dump/test/norm/metadata.jsonl \
+    --test-metadata=dump/test/norm/metadata.jsonl \
-  --output-dir=${train_output_path}/test \
+    --output-dir=${train_output_path}/test \
-  --generator-type=mb_melgan
+    --generator-type=mb_melgan
--- a/examples/csmsc/voc4/README.md
+++ b/examples/csmsc/voc4/README.md
@ -2,11 +2,11 @@
 This example contains code used to train a [Style MelGAN](https://arxiv.org/abs/2011.01557) model with [Chinese Standard Mandarin Speech Copus](https://www.data-baker.com/open_source.html).
 ## Dataset
 ### Download and Extract
-Download CSMSC from the [official website](https://www.data-baker.com/data/index/source) and extract it to `~/datasets`. Then the dataset is in directory `~/datasets/BZNSYP`.
+Download CSMSC from the [official website](https://www.data-baker.com/data/index/source) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/BZNSYP`.
 ### Get MFA Result and Extract
-We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to  cut silence in the edge of audio.
+We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut the silence in the edge of audio.
-You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/mfa) of our repo.
+You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/mfa) of our repo.
 ## Get Started
 Assume the path to the dataset is `~/datasets/BZNSYP`.
@ -20,7 +20,7 @@ Run the command below to
 ```bash
 ./run.sh
 ```
-You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset.
+You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
 ```bash
 ./run.sh --stage 0 --stop-stage 0
 ```
@ -43,9 +43,9 @@ dump
    ├── raw
    └── feats_stats.npy
 ```
-The dataset is split into 3 parts, namely `train`, `dev` and `test`, each of which contains a `norm` and `raw` subfolder. The `raw` folder contains log magnitude of mel spectrogram of each utterances, while the norm folder contains normalized spectrogram. The statistics used to normalize the spectrogram is computed from the training set, which is located in `dump/train/feats_stats.npy`.
+The dataset is split into 3 parts, namely `train`, `dev`, and `test`, each of which contains a `norm` and `raw` subfolder. The `raw` folder contains the log magnitude of the mel spectrogram of each utterance, while the norm folder contains the normalized spectrogram. The statistics used to normalize the spectrogram are computed from the training set, which is located in `dump/train/feats_stats.npy`.
-Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains id and paths to spectrogam of each utterance.
+Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains id and paths to the spectrogram of each utterance.
 ### Model Training
 ```bash
@ -76,7 +76,7 @@ optional arguments:
 1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
 2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
-3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are save in `checkpoints/` inside this directory.
+3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory.
 4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
 ### Synthesizing
@ -85,15 +85,19 @@ optional arguments:
 CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
 ```
 ```text
-usage: synthesize.py [-h] [--config CONFIG] [--checkpoint CHECKPOINT]
+usage: synthesize.py [-h] [--generator-type GENERATOR_TYPE] [--config CONFIG]
-                     [--test-metadata TEST_METADATA] [--output-dir OUTPUT_DIR]
+                     [--checkpoint CHECKPOINT] [--test-metadata TEST_METADATA]
-                     [--ngpu NGPU] [--verbose VERBOSE]
+                     [--output-dir OUTPUT_DIR] [--ngpu NGPU]
                     [--verbose VERBOSE]
-Synthesize with multi band melgan.
+Synthesize with GANVocoder.
 optional arguments:
  -h, --help            show this help message and exit
-  --config CONFIG       multi band melgan config file.
+  --generator-type GENERATOR_TYPE
                        type of GANVocoder, should in {pwgan, mb_melgan,
                        style_melgan, } now
  --config CONFIG       GANVocoder config file.
  --checkpoint CHECKPOINT
                        snapshot to load.
  --test-metadata TEST_METADATA
@ -104,7 +108,7 @@ optional arguments:
  --verbose VERBOSE     verbose.
 ```
-1. `--config` multi band melgan config file. You should use the same config with which the model is trained.
+1. `--config` style melgan config file. You should use the same config with which the model is trained.
 2. `--checkpoint` is the checkpoint to load. Pick one of the checkpoints from `checkpoints` inside the training output directory.
 3. `--test-metadata` is the metadata of the test dataset. Use the `metadata.jsonl` in the `dev/norm` subfolder from the processed directory.
 4. `--output-dir` is the directory to save the synthesized audio files.
--- a/examples/csmsc/voc4/conf/default.yaml
+++ b/examples/csmsc/voc4/conf/default.yaml
@ -9,9 +9,9 @@
 #                FEATURE EXTRACTION SETTING               #
 ###########################################################
 fs: 24000                # Sampling rate.
-n_fft: 2048              # FFT size. (in samples)
+n_fft: 2048              # FFT size (samples).
-n_shift: 300             # Hop size. (in samples)
+n_shift: 300             # Hop size (samples). 12.5ms
-win_length: 1200         # Window length. (in samples)
+win_length: 1200         # Window length (samples). 50ms
                         # If set to null, it will be the same as fft_size.
 window: "hann"           # Window function.
 n_mels: 80               # Number of mel basis.
--- a/examples/csmsc/voc4/local/synthesize.sh
+++ b/examples/csmsc/voc4/local/synthesize.sh
@ -7,8 +7,8 @@ ckpt_name=$3
 FLAGS_allocator_strategy=naive_best_fit \
 FLAGS_fraction_of_gpu_memory_to_use=0.01 \
 python3 ${BIN_DIR}/../synthesize.py \
-  --config=${config_path} \
+    --config=${config_path} \
-  --checkpoint=${train_output_path}/checkpoints/${ckpt_name} \
+    --checkpoint=${train_output_path}/checkpoints/${ckpt_name} \
-  --test-metadata=dump/test/norm/metadata.jsonl \
+    --test-metadata=dump/test/norm/metadata.jsonl \
-  --output-dir=${train_output_path}/test \
+    --output-dir=${train_output_path}/test \
-  --generator-type=style_melgan
+    --generator-type=style_melgan
--- a/examples/csmsc/voc5/README.md
+++ b/examples/csmsc/voc5/README.md
@ -0,0 +1,117 @@
 # HiFiGAN with CSMSC
 This example contains code used to train a [HiFiGAN](https://arxiv.org/abs/2010.05646) model with [Chinese Standard Mandarin Speech Copus](https://www.data-baker.com/open_source.html).
 ## Dataset
 ### Download and Extract
 Download CSMSC from the [official website](https://www.data-baker.com/data/index/source) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/BZNSYP`.
 ### Get MFA Result and Extract
 We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence at the edge of audio.
 You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/mfa) of our repo.
 ## Get Started
 Assume the path to the dataset is `~/datasets/BZNSYP`.
 Assume the path to the MFA result of CSMSC is `./baker_alignment_tone`.
 Run the command below to
 1. **source path**.
 2. preprocess the dataset.
 3. train the model.
 4. synthesize wavs.
    - synthesize waveform from `metadata.jsonl`.
 ```bash
 ./run.sh
 ```
 You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
 ```bash
 ./run.sh --stage 0 --stop-stage 0
 ```
 ### Data Preprocessing
 ```bash
 ./local/preprocess.sh ${conf_path}
 ```
 When it is done. A `dump` folder is created in the current directory. The structure of the dump folder is listed below.
 ```text
 dump
 ├── dev
 │   ├── norm
 │   └── raw
 ├── test
 │   ├── norm
 │   └── raw
 └── train
    ├── norm
    ├── raw
    └── feats_stats.npy
 ```
 The dataset is split into 3 parts, namely `train`, `dev`, and `test`, each of which contains a `norm` and `raw` subfolder. The `raw` folder contains the log magnitude of the mel spectrogram of each utterance, while the norm folder contains the normalized spectrogram. The statistics used to normalize the spectrogram are computed from the training set, which is located in `dump/train/feats_stats.npy`.
 Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains id and paths to the spectrogram of each utterance.
 ### Model Training
 ```bash
 CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path}
 ```
 `./local/train.sh` calls `${BIN_DIR}/train.py`.
 Here's the complete help message.
 ```text
 usage: train.py [-h] [--config CONFIG] [--train-metadata TRAIN_METADATA]
                [--dev-metadata DEV_METADATA] [--output-dir OUTPUT_DIR]
                [--ngpu NGPU] [--verbose VERBOSE]
 Train a HiFiGAN model.
 optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       config file to overwrite default config.
  --train-metadata TRAIN_METADATA
                        training data.
  --dev-metadata DEV_METADATA
                        dev data.
  --output-dir OUTPUT_DIR
                        output dir.
  --ngpu NGPU           if ngpu == 0, use cpu.
  --verbose VERBOSE     verbose.
 ```
 1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
 2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
 3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory.
 4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
 ### Synthesizing
 `./local/synthesize.sh` calls `${BIN_DIR}/../synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
 ```bash
 CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
 ```
 ```text
 usage: synthesize.py [-h] [--generator-type GENERATOR_TYPE] [--config CONFIG]
                     [--checkpoint CHECKPOINT] [--test-metadata TEST_METADATA]
                     [--output-dir OUTPUT_DIR] [--ngpu NGPU]
                     [--verbose VERBOSE]
 Synthesize with GANVocoder.
 optional arguments:
  -h, --help            show this help message and exit
  --generator-type GENERATOR_TYPE
                        type of GANVocoder, should in {pwgan, mb_melgan,
                        style_melgan, } now
  --config CONFIG       GANVocoder config file.
  --checkpoint CHECKPOINT
                        snapshot to load.
  --test-metadata TEST_METADATA
                        dev data.
  --output-dir OUTPUT_DIR
                        output dir.
  --ngpu NGPU           if ngpu == 0, use cpu.
  --verbose VERBOSE     verbose.
 ```
 1. `--config` config file. You should use the same config with which the model is trained.
 2. `--checkpoint` is the checkpoint to load. Pick one of the checkpoints from `checkpoints` inside the training output directory.
 3. `--test-metadata` is the metadata of the test dataset. Use the `metadata.jsonl` in the `dev/norm` subfolder from the processed directory.
 4. `--output-dir` is the directory to save the synthesized audio files.
 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
 ## Fine-tuning
--- a/examples/csmsc/voc5/conf/default.yaml
+++ b/examples/csmsc/voc5/conf/default.yaml
@ -0,0 +1,167 @@
 # This is the configuration file for CSMSC dataset.
 # This configuration is based on HiFiGAN V1, which is an official configuration. 
 # But I found that the optimizer setting does not work well with my implementation.
 # So I changed optimizer settings as follows:
 # - AdamW -> Adam
 # - betas: [0.8, 0.99] -> betas: [0.5, 0.9]
 # - Scheduler: ExponentialLR -> MultiStepLR
 # To match the shift size difference, the upsample scales is also modified from the original 256 shift setting.
 ###########################################################
 #                FEATURE EXTRACTION SETTING               #
 ###########################################################
 fs: 24000                # Sampling rate.
 n_fft: 2048              # FFT size (samples).
 n_shift: 300             # Hop size (samples). 12.5ms
 win_length: 1200         # Window length (samples). 50ms
                         # If set to null, it will be the same as fft_size.
 window: "hann"           # Window function.
 n_mels: 80               # Number of mel basis.
 fmin: 80                 # Minimum freq in mel basis calculation. (Hz)
 fmax: 7600               # Maximum frequency in mel basis calculation. (Hz)
 ###########################################################
 #         GENERATOR NETWORK ARCHITECTURE SETTING          #
 ###########################################################
 generator_params:
    in_channels: 80                       # Number of input channels.
    out_channels: 1                       # Number of output channels.
    channels: 512                         # Number of initial channels.
    kernel_size: 7                        # Kernel size of initial and final conv layers.
    upsample_scales: [5, 5, 4, 3]         # Upsampling scales.
    upsample_kernel_sizes: [10, 10, 8, 6] # Kernel size for upsampling layers.
    resblock_kernel_sizes: [3, 7, 11]     # Kernel size for residual blocks.
    resblock_dilations:                   # Dilations for residual blocks.
        - [1, 3, 5]
        - [1, 3, 5]
        - [1, 3, 5]
    use_additional_convs: true            # Whether to use additional conv layer in residual blocks.
    bias: true                            # Whether to use bias parameter in conv.
    nonlinear_activation: "leakyrelu"     # Nonlinear activation type.
    nonlinear_activation_params:          # Nonlinear activation paramters.
        negative_slope: 0.1
    use_weight_norm: true                 # Whether to apply weight normalization.
 ###########################################################
 #       DISCRIMINATOR NETWORK ARCHITECTURE SETTING        #
 ###########################################################
 discriminator_params:
    scales: 3                              # Number of multi-scale discriminator.
    scale_downsample_pooling: "AvgPool1D"  # Pooling operation for scale discriminator.
    scale_downsample_pooling_params:
        kernel_size: 4                     # Pooling kernel size.
        stride: 2                          # Pooling stride.
        padding: 2                         # Padding size.
    scale_discriminator_params:
        in_channels: 1                     # Number of input channels.
        out_channels: 1                    # Number of output channels.
        kernel_sizes: [15, 41, 5, 3]       # List of kernel sizes.
        channels: 128                      # Initial number of channels.
        max_downsample_channels: 1024      # Maximum number of channels in downsampling conv layers.
        max_groups: 16                     # Maximum number of groups in downsampling conv layers.
        bias: true
        downsample_scales: [4, 4, 4, 4, 1] # Downsampling scales.
        nonlinear_activation: "leakyrelu"  # Nonlinear activation.
        nonlinear_activation_params:
            negative_slope: 0.1
    follow_official_norm: true             # Whether to follow the official norm setting.
    periods: [2, 3, 5, 7, 11]              # List of period for multi-period discriminator.
    period_discriminator_params:
        in_channels: 1                     # Number of input channels.
        out_channels: 1                    # Number of output channels.
        kernel_sizes: [5, 3]               # List of kernel sizes.
        channels: 32                       # Initial number of channels.
        downsample_scales: [3, 3, 3, 3, 1] # Downsampling scales.
        max_downsample_channels: 1024      # Maximum number of channels in downsampling conv layers.
        bias: true                         # Whether to use bias parameter in conv layer."
        nonlinear_activation: "leakyrelu"  # Nonlinear activation.
        nonlinear_activation_params:       # Nonlinear activation paramters.
            negative_slope: 0.1
        use_weight_norm: true              # Whether to apply weight normalization.
        use_spectral_norm: false           # Whether to apply spectral normalization.
 ###########################################################
 #                   STFT LOSS SETTING                     #
 ###########################################################
 use_stft_loss: false                 # Whether to use multi-resolution STFT loss.
 use_mel_loss: true                   # Whether to use Mel-spectrogram loss.
 mel_loss_params:
    fs: 24000
    fft_size: 2048
    hop_size: 300
    win_length: 1200
    window: "hann"
    num_mels: 80
    fmin: 0
    fmax: 12000
    log_base: null
 generator_adv_loss_params:
    average_by_discriminators: false # Whether to average loss by #discriminators.
 discriminator_adv_loss_params:
    average_by_discriminators: false # Whether to average loss by #discriminators.
 use_feat_match_loss: true
 feat_match_loss_params:
    average_by_discriminators: false # Whether to average loss by #discriminators.
    average_by_layers: false         # Whether to average loss by #layers in each discriminator.
    include_final_outputs: false     # Whether to include final outputs in feat match loss calculation.
 ###########################################################
 #               ADVERSARIAL LOSS SETTING                  #
 ###########################################################
 lambda_aux: 45.0       # Loss balancing coefficient for STFT loss.
 lambda_adv: 1.0        # Loss balancing coefficient for adversarial loss.
 lambda_feat_match: 2.0 # Loss balancing coefficient for feat match loss..
 ###########################################################
 #                  DATA LOADER SETTING                    #
 ###########################################################
 batch_size: 16              # Batch size.
 batch_max_steps: 8400       # Length of each audio in batch. Make sure dividable by hop_size.
 num_workers: 2              # Number of workers in Pytorch DataLoader.
 ###########################################################
 #             OPTIMIZER & SCHEDULER SETTING               #
 ###########################################################
 generator_optimizer_params:
    beta1: 0.5
    beta2: 0.9
    weight_decay: 0.0                   # Generator's weight decay coefficient.
 generator_scheduler_params:
    learning_rate: 2.0e-4               # Generator's learning rate.
    gamma: 0.5                          # Generator's scheduler gamma.
    milestones:                         # At each milestone, lr will be multiplied by gamma.
        - 200000
        - 400000
        - 600000
        - 800000
 generator_grad_norm: -1                 # Generator's gradient norm.
 discriminator_optimizer_params:
    beta1: 0.5
    beta2: 0.9
    weight_decay: 0.0                   # Discriminator's weight decay coefficient.
 discriminator_scheduler_params:
    learning_rate: 2.0e-4               # Discriminator's learning rate.
    gamma: 0.5                          # Discriminator's scheduler gamma.
    milestones:                         # At each milestone, lr will be multiplied by gamma.
        - 200000
        - 400000
        - 600000
        - 800000    
 discriminator_grad_norm: -1             # Discriminator's gradient norm.            
 ###########################################################
 #                    INTERVAL SETTING                     #
 ###########################################################
 generator_train_start_steps: 1     # Number of steps to start to train discriminator.
 discriminator_train_start_steps: 0 # Number of steps to start to train discriminator.
 train_max_steps: 2500000           # Number of training steps.
 save_interval_steps: 5000         # Interval steps to save checkpoint.
 eval_interval_steps: 1000          # Interval steps to evaluate the network.
 ###########################################################
 #                     OTHER SETTING                       #
 ###########################################################
 num_snapshots: 10                 # max number of snapshots to keep while training
 seed: 42                          # random seed for paddle, random, and np.random
--- a/examples/csmsc/voc5/conf/finetune.yaml
+++ b/examples/csmsc/voc5/conf/finetune.yaml
@ -0,0 +1,168 @@
 # This is the configuration file for CSMSC dataset.
 # This configuration is based on HiFiGAN V1, which is an official configuration. 
 # But I found that the optimizer setting does not work well with my implementation.
 # So I changed optimizer settings as follows:
 # - AdamW -> Adam
 # - betas: [0.8, 0.99] -> betas: [0.5, 0.9]
 # - Scheduler: ExponentialLR -> MultiStepLR
 # To match the shift size difference, the upsample scales is also modified from the original 256 shift setting.
 ###########################################################
 #                FEATURE EXTRACTION SETTING               #
 ###########################################################
 fs: 24000                # Sampling rate.
 n_fft: 2048              # FFT size (samples).
 n_shift: 300             # Hop size (samples). 12.5ms
 win_length: 1200         # Window length (samples). 50ms
                         # If set to null, it will be the same as fft_size.
 window: "hann"           # Window function.
 n_mels: 80               # Number of mel basis.
 fmin: 80                 # Minimum freq in mel basis calculation. (Hz)
 fmax: 7600               # Maximum frequency in mel basis calculation. (Hz)
 ###########################################################
 #         GENERATOR NETWORK ARCHITECTURE SETTING          #
 ###########################################################
 generator_params:
    in_channels: 80                       # Number of input channels.
    out_channels: 1                       # Number of output channels.
    channels: 512                         # Number of initial channels.
    kernel_size: 7                        # Kernel size of initial and final conv layers.
    upsample_scales: [5, 5, 4, 3]         # Upsampling scales.
    upsample_kernel_sizes: [10, 10, 8, 6] # Kernel size for upsampling layers.
    resblock_kernel_sizes: [3, 7, 11]     # Kernel size for residual blocks.
    resblock_dilations:                   # Dilations for residual blocks.
        - [1, 3, 5]
        - [1, 3, 5]
        - [1, 3, 5]
    use_additional_convs: true            # Whether to use additional conv layer in residual blocks.
    bias: true                            # Whether to use bias parameter in conv.
    nonlinear_activation: "leakyrelu"     # Nonlinear activation type.
    nonlinear_activation_params:          # Nonlinear activation paramters.
        negative_slope: 0.1
    use_weight_norm: true                 # Whether to apply weight normalization.
 ###########################################################
 #       DISCRIMINATOR NETWORK ARCHITECTURE SETTING        #
 ###########################################################
 discriminator_params:
    scales: 3                              # Number of multi-scale discriminator.
    scale_downsample_pooling: "AvgPool1D"  # Pooling operation for scale discriminator.
    scale_downsample_pooling_params:
        kernel_size: 4                     # Pooling kernel size.
        stride: 2                          # Pooling stride.
        padding: 2                         # Padding size.
    scale_discriminator_params:
        in_channels: 1                     # Number of input channels.
        out_channels: 1                    # Number of output channels.
        kernel_sizes: [15, 41, 5, 3]       # List of kernel sizes.
        channels: 128                      # Initial number of channels.
        max_downsample_channels: 1024      # Maximum number of channels in downsampling conv layers.
        max_groups: 16                     # Maximum number of groups in downsampling conv layers.
        bias: true
        downsample_scales: [4, 4, 4, 4, 1] # Downsampling scales.
        nonlinear_activation: "leakyrelu"  # Nonlinear activation.
        nonlinear_activation_params:
            negative_slope: 0.1
    follow_official_norm: true             # Whether to follow the official norm setting.
    periods: [2, 3, 5, 7, 11]              # List of period for multi-period discriminator.
    period_discriminator_params:
        in_channels: 1                     # Number of input channels.
        out_channels: 1                    # Number of output channels.
        kernel_sizes: [5, 3]               # List of kernel sizes.
        channels: 32                       # Initial number of channels.
        downsample_scales: [3, 3, 3, 3, 1] # Downsampling scales.
        max_downsample_channels: 1024      # Maximum number of channels in downsampling conv layers.
        bias: true                         # Whether to use bias parameter in conv layer."
        nonlinear_activation: "leakyrelu"  # Nonlinear activation.
        nonlinear_activation_params:       # Nonlinear activation paramters.
            negative_slope: 0.1
        use_weight_norm: true              # Whether to apply weight normalization.
        use_spectral_norm: false           # Whether to apply spectral normalization.
 ###########################################################
 #                   STFT LOSS SETTING                     #
 ###########################################################
 use_stft_loss: false                 # Whether to use multi-resolution STFT loss.
 use_mel_loss: true                   # Whether to use Mel-spectrogram loss.
 mel_loss_params:
    fs: 24000
    fft_size: 2048
    hop_size: 300
    win_length: 1200
    window: "hann"
    num_mels: 80
    fmin: 0
    fmax: 12000
    log_base: null
 generator_adv_loss_params:
    average_by_discriminators: false # Whether to average loss by #discriminators.
 discriminator_adv_loss_params:
    average_by_discriminators: false # Whether to average loss by #discriminators.
 use_feat_match_loss: true
 feat_match_loss_params:
    average_by_discriminators: false # Whether to average loss by #discriminators.
    average_by_layers: false         # Whether to average loss by #layers in each discriminator.
    include_final_outputs: false     # Whether to include final outputs in feat match loss calculation.
 ###########################################################
 #               ADVERSARIAL LOSS SETTING                  #
 ###########################################################
 lambda_aux: 45.0       # Loss balancing coefficient for STFT loss.
 lambda_adv: 1.0        # Loss balancing coefficient for adversarial loss.
 lambda_feat_match: 2.0 # Loss balancing coefficient for feat match loss..
 ###########################################################
 #                  DATA LOADER SETTING                    #
 ###########################################################
 batch_size: 16              # Batch size.
 batch_max_steps: 8400       # Length of each audio in batch. Make sure dividable by hop_size.
 num_workers: 2              # Number of workers in Pytorch DataLoader.
 ###########################################################
 #             OPTIMIZER & SCHEDULER SETTING               #
 ###########################################################
 generator_optimizer_params:
    beta1: 0.5
    beta2: 0.9
    weight_decay: 0.0                   # Generator's weight decay coefficient.
 generator_scheduler_params:
    learning_rate: 2.0e-4               # Generator's learning rate.
    gamma: 0.5                          # Generator's scheduler gamma.
    milestones:                         # At each milestone, lr will be multiplied by gamma.
        - 200000
        - 400000
        - 600000
        - 800000
 generator_grad_norm: -1                 # Generator's gradient norm.
 discriminator_optimizer_params:
    beta1: 0.5
    beta2: 0.9
    weight_decay: 0.0                   # Discriminator's weight decay coefficient.
 discriminator_scheduler_params:
    learning_rate: 2.0e-4               # Discriminator's learning rate.
    gamma: 0.5                          # Discriminator's scheduler gamma.
    milestones:                         # At each milestone, lr will be multiplied by gamma.
        - 200000
        - 400000
        - 600000
        - 800000    
 discriminator_grad_norm: -1             # Discriminator's gradient norm.            
 ###########################################################
 #                    INTERVAL SETTING                     #
 ###########################################################
 generator_train_start_steps: 1     # Number of steps to start to train discriminator.
 discriminator_train_start_steps: 0 # Number of steps to start to train discriminator.
 train_max_steps: 2500000           # Number of training steps.
 save_interval_steps: 10000         # Interval steps to save checkpoint.
 eval_interval_steps: 1000          # Interval steps to evaluate the network.
 log_interval_steps: 100            # Interval steps to record the training log.
 ###########################################################
 #                     OTHER SETTING                       #
 ###########################################################
 num_snapshots: 10                 # max number of snapshots to keep while training
 seed: 42                          # random seed for paddle, random, and np.random
--- a/examples/csmsc/voc5/finetune.sh
+++ b/examples/csmsc/voc5/finetune.sh
@ -0,0 +1,62 @@
 #!/bin/bash
 source path.sh
 gpus=0
 stage=0
 stop_stage=100
 source ${MAIN_ROOT}/utils/parse_options.sh || exit 1
 if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
    python3 ${MAIN_ROOT}/paddlespeech/t2s/exps/fastspeech2/gen_gta_mel.py \
        --fastspeech2-config=fastspeech2_nosil_baker_ckpt_0.4/default.yaml \
        --fastspeech2-checkpoint=fastspeech2_nosil_baker_ckpt_0.4/snapshot_iter_76000.pdz \
        --fastspeech2-stat=fastspeech2_nosil_baker_ckpt_0.4/speech_stats.npy \
        --dur-file=durations.txt \
        --output-dir=dump_finetune \
        --phones-dict=fastspeech2_nosil_baker_ckpt_0.4/phone_id_map.txt
 fi
 if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
    python3 local/link_wav.py \
        --old-dump-dir=dump \
        --dump-dir=dump_finetune
 fi
 if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
    # get features' stats(mean and std)
    echo "Get features' stats ..."
    cp dump/train/feats_stats.npy dump_finetune/train/
 fi
 if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
    # normalize, dev and test should use train's stats
    echo "Normalize ..."
    python3 ${BIN_DIR}/../normalize.py \
        --metadata=dump_finetune/train/raw/metadata.jsonl \
        --dumpdir=dump_finetune/train/norm \
        --stats=dump_finetune/train/feats_stats.npy
    python3 ${BIN_DIR}/../normalize.py \
        --metadata=dump_finetune/dev/raw/metadata.jsonl \
        --dumpdir=dump_finetune/dev/norm \
        --stats=dump_finetune/train/feats_stats.npy
    python3 ${BIN_DIR}/../normalize.py \
        --metadata=dump_finetune/test/raw/metadata.jsonl \
        --dumpdir=dump_finetune/test/norm \
        --stats=dump_finetune/train/feats_stats.npy
 fi
 if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
    CUDA_VISIBLE_DEVICES=${gpus} \
    FLAGS_cudnn_exhaustive_search=true \
    FLAGS_conv_workspace_size_limit=4000 \
    python ${BIN_DIR}/train.py \
        --train-metadata=dump_finetune/train/norm/metadata.jsonl \
        --dev-metadata=dump_finetune/dev/norm/metadata.jsonl \
        --config=conf/finetune.yaml \
        --output-dir=exp/finetune \
        --ngpu=1
 fi 
--- a/examples/csmsc/voc5/local/link_wav.py
+++ b/examples/csmsc/voc5/local/link_wav.py
@ -0,0 +1,85 @@
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import argparse
 import os
 from operator import itemgetter
 from pathlib import Path
 import jsonlines
 import numpy as np
 def main():
    # parse config and args
    parser = argparse.ArgumentParser(
        description="Preprocess audio and then extract features .")
    parser.add_argument(
        "--old-dump-dir",
        default=None,
        type=str,
        help="directory to dump feature files.")
    parser.add_argument(
        "--dump-dir",
        type=str,
        required=True,
        help="directory to finetune dump feature files.")
    args = parser.parse_args()
    old_dump_dir = Path(args.old_dump_dir).expanduser()
    old_dump_dir = old_dump_dir.resolve()
    dump_dir = Path(args.dump_dir).expanduser()
    # use absolute path
    dump_dir = dump_dir.resolve()
    dump_dir.mkdir(parents=True, exist_ok=True)
    assert old_dump_dir.is_dir()
    assert dump_dir.is_dir()
    for sub in ["train", "dev", "test"]:
        # 把 old_dump_dir 里面的 *-wave.npy 软连接到 dump_dir 的对应位置
        output_dir = dump_dir / sub
        output_dir.mkdir(parents=True, exist_ok=True)
        results = []
        for name in os.listdir(output_dir / "raw"):
            # 003918_feats.npy
            utt_id = name.split("_")[0]
            mel_path = output_dir / ("raw/" + name)
            gen_mel = np.load(mel_path)
            wave_name = utt_id + "_wave.npy"
            wav = np.load(old_dump_dir / sub / ("raw/" + wave_name))
            os.symlink(old_dump_dir / sub / ("raw/" + wave_name),
                       output_dir / ("raw/" + wave_name))
            num_sample = wav.shape[0]
            num_frames = gen_mel.shape[0]
            wav_path = output_dir / ("raw/" + wave_name)
            record = {
                "utt_id": utt_id,
                "num_samples": num_sample,
                "num_frames": num_frames,
                "feats": str(mel_path),
                "wave": str(wav_path),
            }
            results.append(record)
        results.sort(key=itemgetter("utt_id"))
        with jsonlines.open(output_dir / "raw/metadata.jsonl", 'w') as writer:
            for item in results:
                writer.write(item)
 if __name__ == "__main__":
    main()
--- a/examples/csmsc/voc5/local/preprocess.sh
+++ b/examples/csmsc/voc5/local/preprocess.sh
@ -0,0 +1,55 @@
 #!/bin/bash
 stage=0
 stop_stage=100
 config_path=$1
 if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
    # get durations from MFA's result
    echo "Generate durations.txt from MFA results ..."
    python3 ${MAIN_ROOT}/utils/gen_duration_from_textgrid.py \
        --inputdir=./baker_alignment_tone \
        --output=durations.txt \
        --config=${config_path}
 fi
 if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
    # extract features
    echo "Extract features ..."
    python3 ${BIN_DIR}/../preprocess.py \
        --rootdir=~/datasets/BZNSYP/ \
        --dataset=baker \
        --dumpdir=dump \
        --dur-file=durations.txt \
        --config=${config_path} \
        --cut-sil=True \
        --num-cpu=20
 fi
 if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
    # get features' stats(mean and std)
    echo "Get features' stats ..."
    python3 ${MAIN_ROOT}/utils/compute_statistics.py \
        --metadata=dump/train/raw/metadata.jsonl \
        --field-name="feats"
 fi
 if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
    # normalize, dev and test should use train's stats
    echo "Normalize ..."
    python3 ${BIN_DIR}/../normalize.py \
        --metadata=dump/train/raw/metadata.jsonl \
        --dumpdir=dump/train/norm \
        --stats=dump/train/feats_stats.npy
    python3 ${BIN_DIR}/../normalize.py \
        --metadata=dump/dev/raw/metadata.jsonl \
        --dumpdir=dump/dev/norm \
        --stats=dump/train/feats_stats.npy
    python3 ${BIN_DIR}/../normalize.py \
        --metadata=dump/test/raw/metadata.jsonl \
        --dumpdir=dump/test/norm \
        --stats=dump/train/feats_stats.npy
 fi
--- a/examples/csmsc/voc5/local/synthesize.sh
+++ b/examples/csmsc/voc5/local/synthesize.sh
@ -0,0 +1,14 @@
 #!/bin/bash
 config_path=$1
 train_output_path=$2
 ckpt_name=$3
 FLAGS_allocator_strategy=naive_best_fit \
 FLAGS_fraction_of_gpu_memory_to_use=0.01 \
 python3 ${BIN_DIR}/../synthesize.py \
    --config=${config_path} \
    --checkpoint=${train_output_path}/checkpoints/${ckpt_name} \
    --test-metadata=dump/test/norm/metadata.jsonl \
    --output-dir=${train_output_path}/test \
    --generator-type=hifigan
--- a/examples/csmsc/voc5/local/train.sh
+++ b/examples/csmsc/voc5/local/train.sh
@ -0,0 +1,13 @@
 #!/bin/bash
 config_path=$1
 train_output_path=$2
 FLAGS_cudnn_exhaustive_search=true \
 FLAGS_conv_workspace_size_limit=4000 \
 python ${BIN_DIR}/train.py \
    --train-metadata=dump/train/norm/metadata.jsonl \
    --dev-metadata=dump/dev/norm/metadata.jsonl \
    --config=${config_path} \
    --output-dir=${train_output_path} \
    --ngpu=1
--- a/examples/csmsc/voc5/path.sh
+++ b/examples/csmsc/voc5/path.sh
@ -0,0 +1,13 @@
 #!/bin/bash
 export MAIN_ROOT=`realpath ${PWD}/../../../`
 export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH}
 export LC_ALL=C
 export PYTHONDONTWRITEBYTECODE=1
 # Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
 export PYTHONIOENCODING=UTF-8
 export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH}
 MODEL=hifigan
 export BIN_DIR=${MAIN_ROOT}/paddlespeech/t2s/exps/gan_vocoder/${MODEL}
--- a/examples/csmsc/voc5/run.sh
+++ b/examples/csmsc/voc5/run.sh
@ -0,0 +1,32 @@
 #!/bin/bash
 set -e
 source path.sh
 gpus=0,1
 stage=0
 stop_stage=100
 conf_path=conf/default.yaml
 train_output_path=exp/default
 ckpt_name=snapshot_iter_50000.pdz
 # with the following command, you can choose the stage range you want to run
 # such as `./run.sh --stage 0 --stop-stage 0`
 # this can not be mixed use with `$1`, `$2` ...
 source ${MAIN_ROOT}/utils/parse_options.sh || exit 1
 if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
    # prepare data
    ./local/preprocess.sh ${conf_path} || exit -1
 fi
 if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
    # train model, all `ckpt` under `train_output_path/checkpoints/` dir
    CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} || exit -1
 fi
 if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
    # synthesize
    CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
 fi
--- a/examples/librispeech/asr1/RESULTS.md
+++ b/examples/librispeech/asr1/RESULTS.md
@ -30,4 +30,4 @@ train: Epoch 120, 4 V100-32G, 27 Day, best avg: 10
 | transformer | 32.52 M | conf/transformer.yaml | spec_aug  | test-clean | attention | 6.382194232940674 | 0.049661 |  
 | transformer | 32.52 M | conf/transformer.yaml | spec_aug  | test-clean | ctc_greedy_search | 6.382194232940674 | 0.049566 |  
 | transformer | 32.52 M | conf/transformer.yaml | spec_aug  | test-clean | ctc_prefix_beam_search | 6.382194232940674 | 0.049585 |  
-| transformer | 32.52 M | conf/transformer.yaml | spec_aug  | test-clean | attention_rescoring | 6.382194232940674 | 0.038135 |  
+| transformer | 32.52 M | conf/transformer.yaml | spec_aug  | test-clean | attention_rescoring | 6.382194232940674 | 0.038135 |
--- a/examples/ljspeech/tts0/README.md
+++ b/examples/ljspeech/tts0/README.md
@ -1,5 +1,5 @@
 # Tacotron2 with LJSpeech
-PaddlePaddle dynamic graph implementation of Tacotron2, a neural network architecture for speech synthesis directly from text. The implementation is based on [Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions](https://arxiv.org/abs/1712.05884).
+PaddlePaddle dynamic graph implementation of Tacotron2, a neural network architecture for speech synthesis directly from the text. The implementation is based on [Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions](https://arxiv.org/abs/1712.05884).
 ## Dataset
 We experiment with the LJSpeech dataset. Download and unzip [LJSpeech](https://keithito.com/LJ-Speech-Dataset/).
@ -18,7 +18,7 @@ Run the command below to
 ```bash
 ./run.sh
 ```
-You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset.
+You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
 ```bash
 ./run.sh --stage 0 --stop-stage 0
 ```
@ -40,7 +40,7 @@ optional arguments:
  -h, --help            show this help message and exit
  --config FILE         path of the config file to overwrite to default config
                        with.
-  --data DATA_DIR       path to the datatset.
+  --data DATA_DIR       path to the dataset.
  --output OUTPUT_DIR   path to save checkpoint and logs.
  --checkpoint_path CHECKPOINT_PATH
                        path of the checkpoint to load
@ -50,9 +50,9 @@ optional arguments:
 ```
 If you want to train on CPU, just set `--ngpu=0`.
-If you want to train on multiple GPUs, just set `--ngpu` as num of GPU.
+If you want to train on multiple GPUs, just set `--ngpu` as the num of GPU.
 By default, training will be resumed from the latest checkpoint in `--output`, if you want to start a new training, please use a new `${OUTPUTPATH}` with no checkpoint.
-And if you want to resume from an other existing model, you should set `checkpoint_path` to be the checkpoint path you want to load.
+And if you want to resume from another existing model, you should set `checkpoint_path` to be the checkpoint path you want to load.
 **Note: The checkpoint path cannot contain the file extension.**
 ### Synthesizing
@ -79,11 +79,11 @@ optional arguments:
                        config, passing in KEY VALUE pairs
  -v, --verbose         print msg
 ```
-**Ps.** You can  use [waveflow](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/voc0) as the neural vocoder to synthesize mels to wavs. (Please  refer to `synthesize.sh` in our  LJSpeech waveflow example)
+**Ps.** You can use [waveflow](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/voc0) as the neural vocoder to synthesize mels to wavs. (Please  refer to `synthesize.sh` in our  LJSpeech waveflow example)
 ## Pretrained Models
-Pretrained Models can be downloaded from links below. We provide 2 models with different configurations.
+Pretrained Models can be downloaded from the links below. We provide 2 models with different configurations.
-1. This model use a binary classifier to predict the stop token. [tacotron2_ljspeech_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_ljspeech_ckpt_0.3.zip)
+1. This model uses a binary classifier to predict the stop token. [tacotron2_ljspeech_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_ljspeech_ckpt_0.3.zip)
-2. This model does not have a stop token predictor. It uses the attention peak position to decided whether all the contents have been uttered. Also guided attention loss is used to speed up training. This model is trained with `configs/alternative.yaml`.[tacotron2_ljspeech_ckpt_0.3_alternative.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_ljspeech_ckpt_0.3_alternative.zip)
+2. This model does not have a stop token predictor. It uses the attention peak position to decide whether all the contents have been uttered. Also, guided attention loss is used to speed up training. This model is trained with `configs/alternative.yaml`.[tacotron2_ljspeech_ckpt_0.3_alternative.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_ljspeech_ckpt_0.3_alternative.zip)
--- a/examples/ljspeech/tts1/README.md
+++ b/examples/ljspeech/tts1/README.md
@ -18,7 +18,7 @@ Run the command below to
 ```bash
 ./run.sh
 ```
-You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset.
+You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
 ```bash
 ./run.sh --stage 0 --stop-stage 0
 ```
@ -42,9 +42,9 @@ dump
    ├── raw
    └── speech_stats.npy
 ```
-The dataset is split into 3 parts, namely `train`, `dev` and` test`, each of which contains a `norm` and `raw` sub folder. The raw folder contains speech feature of each utterances, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/speech_stats.npy`.
+The dataset is split into 3 parts, namely `train`, `dev`, and` test`, each of which contains a `norm` and `raw` subfolder. The raw folder contains the speech feature of each utterance, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/speech_stats.npy`.
-Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains phones, text_lengths, speech_lengths, path of speech features, speaker and id of each utterance.
+Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains phones, text_lengths, speech_lengths, the path of speech features, speaker, and id of each utterance.
 ### Model Training
 `./local/train.sh` calls `${BIN_DIR}/train.py`.
@ -75,7 +75,7 @@ optional arguments:
 ```
 1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
 2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
-3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are save in `checkpoints/` inside this directory.
+3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory.
 4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
 5. `--phones-dict` is the path of the phone vocabulary file.
@ -85,7 +85,7 @@ Download Pretrained WaveFlow Model with residual channel equals 128 from [wavefl
 ```bash
 unzip waveflow_ljspeech_ckpt_0.3.zip
 ```
-WaveFlow  checkpoint contains files listed below.
+WaveFlow checkpoint contains files listed below.
 ```text
 waveflow_ljspeech_ckpt_0.3
 ├── config.yaml           # default config used to train waveflow
--- a/examples/ljspeech/tts1/conf/default.yaml
+++ b/examples/ljspeech/tts1/conf/default.yaml
@ -1,8 +1,8 @@
 fs : 22050              # Hz, sample rate
-n_fft : 1024            # fft frame size
+n_fft : 1024            # FFT size (samples). 
-win_length : 1024       # window size
+win_length : 1024       # Window length (samples). 46.4ms
-n_shift : 256           # hop size between ajacent frame
+n_shift : 256           # Hop size (samples). 11.6ms
 fmin : 0                # Hz, min frequency when converting to mel
 fmax : 8000             # Hz, max frequency when converting to mel
 n_mels : 80             # mel bands
--- a/examples/ljspeech/tts1/local/synthesize.sh
+++ b/examples/ljspeech/tts1/local/synthesize.sh
@ -7,11 +7,11 @@ ckpt_name=$3
 FLAGS_allocator_strategy=naive_best_fit \
 FLAGS_fraction_of_gpu_memory_to_use=0.01 \
 python3 ${BIN_DIR}/synthesize.py \
-  --transformer-tts-config=${config_path} \
+    --transformer-tts-config=${config_path} \
-  --transformer-tts-checkpoint=${train_output_path}/checkpoints/${ckpt_name} \
+    --transformer-tts-checkpoint=${train_output_path}/checkpoints/${ckpt_name} \
-  --transformer-tts-stat=dump/train/speech_stats.npy \
+    --transformer-tts-stat=dump/train/speech_stats.npy \
-  --waveflow-config=waveflow_ljspeech_ckpt_0.3/config.yaml \
+    --waveflow-config=waveflow_ljspeech_ckpt_0.3/config.yaml \
-  --waveflow-checkpoint=waveflow_ljspeech_ckpt_0.3/step-2000000.pdparams \
+    --waveflow-checkpoint=waveflow_ljspeech_ckpt_0.3/step-2000000.pdparams \
-  --test-metadata=dump/test/norm/metadata.jsonl \
+    --test-metadata=dump/test/norm/metadata.jsonl \
-  --output-dir=${train_output_path}/test \
+    --output-dir=${train_output_path}/test \
-  --phones-dict=dump/phone_id_map.txt
+    --phones-dict=dump/phone_id_map.txt
--- a/examples/ljspeech/tts1/local/synthesize_e2e.sh
+++ b/examples/ljspeech/tts1/local/synthesize_e2e.sh
@ -7,11 +7,11 @@ ckpt_name=$3
 FLAGS_allocator_strategy=naive_best_fit \
 FLAGS_fraction_of_gpu_memory_to_use=0.01 \
 python3 ${BIN_DIR}/synthesize_e2e.py \
-  --transformer-tts-config=${config_path} \
+    --transformer-tts-config=${config_path} \
-  --transformer-tts-checkpoint=${train_output_path}/checkpoints/${ckpt_name} \
+    --transformer-tts-checkpoint=${train_output_path}/checkpoints/${ckpt_name} \
-  --transformer-tts-stat=dump/train/speech_stats.npy \
+    --transformer-tts-stat=dump/train/speech_stats.npy \
-  --waveflow-config=waveflow_ljspeech_ckpt_0.3/config.yaml \
+    --waveflow-config=waveflow_ljspeech_ckpt_0.3/config.yaml \
-  --waveflow-checkpoint=waveflow_ljspeech_ckpt_0.3/step-2000000.pdparams \
+    --waveflow-checkpoint=waveflow_ljspeech_ckpt_0.3/step-2000000.pdparams \
-  --text=${BIN_DIR}/../sentences_en.txt \
+    --text=${BIN_DIR}/../sentences_en.txt \
-  --output-dir=${train_output_path}/test_e2e \
+    --output-dir=${train_output_path}/test_e2e \
-  --phones-dict=dump/phone_id_map.txt
+    --phones-dict=dump/phone_id_map.txt
--- a/examples/ljspeech/tts3/README.md
+++ b/examples/ljspeech/tts3/README.md
@ -7,7 +7,7 @@ Download LJSpeech-1.1 from the [official website](https://keithito.com/LJ-Speech
 ### Get MFA Result and Extract
 We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for fastspeech2.
-You can download from here [ljspeech_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/LJSpeech-1.1/ljspeech_alignment.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
+You can download from here [ljspeech_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/LJSpeech-1.1/ljspeech_alignment.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
 ## Get Started
 Assume the path to the dataset is `~/datasets/LJSpeech-1.1`.
@ -22,7 +22,7 @@ Run the command below to
 ```bash
 ./run.sh
 ```
-You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset.
+You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
 ```bash
 ./run.sh --stage 0 --stop-stage 0
 ```
@ -49,9 +49,9 @@ dump
    ├── raw
    └── speech_stats.npy
 ```
-The dataset is split into 3 parts, namely `train`, `dev` and` test`, each of which contains a `norm` and `raw` sub folder. The raw folder contains speech、pitch and energy features of each utterances, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/*_stats.npy`.
+The dataset is split into 3 parts, namely `train`, `dev`, and` test`, each of which contains a `norm` and `raw` subfolder. The raw folder contains speech、pitch and energy features of each utterance, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/*_stats.npy`.
-Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains phones, text_lengths, speech_lengths, durations, path of speech features, path of pitch features, path of energy features, speaker and id of each utterance.
+Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains phones, text_lengths, speech_lengths, durations, the path of speech features, the path of pitch features, the path of energy features, speaker, and id of each utterance.
 ### Model Training
 `./local/train.sh` calls `${BIN_DIR}/train.py`.
@ -85,7 +85,7 @@ optional arguments:
 ```
 1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
 2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
-3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are save in `checkpoints/` inside this directory.
+3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory.
 4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
 5. `--phones-dict` is the path of the phone vocabulary file.
@ -102,97 +102,115 @@ pwg_ljspeech_ckpt_0.5
 ├── pwg_snapshot_iter_400000.pdz  # generator parameters of parallel wavegan
 └── pwg_stats.npy                 # statistics used to normalize spectrogram when training parallel wavegan
 ```
-`./local/synthesize.sh` calls `${BIN_DIR}/synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
+`./local/synthesize.sh` calls `${BIN_DIR}/../synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
 ```bash
 CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
 ```
-```text
+``text
-usage: synthesize.py [-h] [--fastspeech2-config FASTSPEECH2_CONFIG]
+usage: synthesize.py [-h]
-                     [--fastspeech2-checkpoint FASTSPEECH2_CHECKPOINT]
+                     [--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}]
-                     [--fastspeech2-stat FASTSPEECH2_STAT]
+                     [--am_config AM_CONFIG] [--am_ckpt AM_CKPT]
-                     [--pwg-config PWG_CONFIG]
+                     [--am_stat AM_STAT] [--phones_dict PHONES_DICT]
-                     [--pwg-checkpoint PWG_CHECKPOINT] [--pwg-stat PWG_STAT]
+                     [--tones_dict TONES_DICT] [--speaker_dict SPEAKER_DICT]
-                     [--phones-dict PHONES_DICT] [--speaker-dict SPEAKER_DICT]
+                     [--voice-cloning VOICE_CLONING]
-                     [--test-metadata TEST_METADATA] [--output-dir OUTPUT_DIR]
+                     [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}]
-                     [--ngpu NGPU] [--verbose VERBOSE]
+                     [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT]
-
+                     [--voc_stat VOC_STAT] [--ngpu NGPU]
-Synthesize with fastspeech2 & parallel wavegan.
+                     [--test_metadata TEST_METADATA] [--output_dir OUTPUT_DIR]
 Synthesize with acoustic model & vocoder
 optional arguments:
  -h, --help            show this help message and exit
-  --fastspeech2-config FASTSPEECH2_CONFIG
+  --am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}
-                        fastspeech2 config file.
+                        Choose acoustic model type of tts task.
-  --fastspeech2-checkpoint FASTSPEECH2_CHECKPOINT
+  --am_config AM_CONFIG
-                        fastspeech2 checkpoint to load.
+                        Config of acoustic model. Use deault config when it is
-  --fastspeech2-stat FASTSPEECH2_STAT
+                        None.
-                        mean and standard deviation used to normalize
+  --am_ckpt AM_CKPT     Checkpoint file of acoustic model.
-                        spectrogram when training fastspeech2.
+  --am_stat AM_STAT     mean and standard deviation used to normalize
-  --pwg-config PWG_CONFIG
+                        spectrogram when training acoustic model.
-                        parallel wavegan config file.
+  --phones_dict PHONES_DICT
  --pwg-checkpoint PWG_CHECKPOINT
                        parallel wavegan generator parameters to load.
  --pwg-stat PWG_STAT   mean and standard deviation used to normalize
                        spectrogram when training parallel wavegan.
  --phones-dict PHONES_DICT
                        phone vocabulary file.
-  --speaker-dict SPEAKER_DICT
+  --tones_dict TONES_DICT
-                        speaker id map file for multiple speaker model.
+                        tone vocabulary file.
-  --test-metadata TEST_METADATA
+  --speaker_dict SPEAKER_DICT
                        speaker id map file.
  --voice-cloning VOICE_CLONING
                        whether training voice cloning model.
  --voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}
                        Choose vocoder type of tts task.
  --voc_config VOC_CONFIG
                        Config of voc. Use deault config when it is None.
  --voc_ckpt VOC_CKPT   Checkpoint file of voc.
  --voc_stat VOC_STAT   mean and standard deviation used to normalize
                        spectrogram when training voc.
  --ngpu NGPU           if ngpu == 0, use cpu.
  --test_metadata TEST_METADATA
                        test metadata.
-  --output-dir OUTPUT_DIR
+  --output_dir OUTPUT_DIR
                        output dir.
  --ngpu NGPU           if ngpu == 0, use cpu.
  --verbose VERBOSE     verbose.
 ```
-`./local/synthesize_e2e.sh` calls `${BIN_DIR}/synthesize_e2e_en.py`, which can synthesize waveform from text file.
+`./local/synthesize_e2e.sh` calls `${BIN_DIR}/../synthesize_e2e.py`, which can synthesize waveform from text file.
 ```bash
 CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name}
 ```
 ```text
-usage: synthesize_e2e.py [-h] [--fastspeech2-config FASTSPEECH2_CONFIG]
+usage: synthesize_e2e.py [-h]
-                         [--fastspeech2-checkpoint FASTSPEECH2_CHECKPOINT]
+                         [--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}]
-                         [--fastspeech2-stat FASTSPEECH2_STAT]
+                         [--am_config AM_CONFIG] [--am_ckpt AM_CKPT]
-                         [--pwg-config PWG_CONFIG]
+                         [--am_stat AM_STAT] [--phones_dict PHONES_DICT]
-                         [--pwg-checkpoint PWG_CHECKPOINT]
+                         [--tones_dict TONES_DICT]
-                         [--pwg-stat PWG_STAT] [--phones-dict PHONES_DICT]
+                         [--speaker_dict SPEAKER_DICT] [--spk_id SPK_ID]
-                         [--text TEXT] [--output-dir OUTPUT_DIR]
+                         [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}]
-                         [--inference-dir INFERENCE_DIR] [--ngpu NGPU]
+                         [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT]
-                         [--verbose VERBOSE]
+                         [--voc_stat VOC_STAT] [--lang LANG]
-
+                         [--inference_dir INFERENCE_DIR] [--ngpu NGPU]
-Synthesize with fastspeech2 & parallel wavegan.
+                         [--text TEXT] [--output_dir OUTPUT_DIR]
 Synthesize with acoustic model & vocoder
 optional arguments:
  -h, --help            show this help message and exit
-  --fastspeech2-config FASTSPEECH2_CONFIG
+  --am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}
-                        fastspeech2 config file.
+                        Choose acoustic model type of tts task.
-  --fastspeech2-checkpoint FASTSPEECH2_CHECKPOINT
+  --am_config AM_CONFIG
-                        fastspeech2 checkpoint to load.
+                        Config of acoustic model. Use deault config when it is
-  --fastspeech2-stat FASTSPEECH2_STAT
+                        None.
-                        mean and standard deviation used to normalize
+  --am_ckpt AM_CKPT     Checkpoint file of acoustic model.
-                        spectrogram when training fastspeech2.
+  --am_stat AM_STAT     mean and standard deviation used to normalize
-  --pwg-config PWG_CONFIG
+                        spectrogram when training acoustic model.
-                        parallel wavegan config file.
+  --phones_dict PHONES_DICT
  --pwg-checkpoint PWG_CHECKPOINT
                        parallel wavegan generator parameters to load.
  --pwg-stat PWG_STAT   mean and standard deviation used to normalize
                        spectrogram when training parallel wavegan.
  --phones-dict PHONES_DICT
                        phone vocabulary file.
-  --text TEXT           text to synthesize, a 'utt_id sentence' pair per line.
+  --tones_dict TONES_DICT
-  --output-dir OUTPUT_DIR
+                        tone vocabulary file.
-                        output dir.
+  --speaker_dict SPEAKER_DICT
-  --inference-dir INFERENCE_DIR
+                        speaker id map file.
  --spk_id SPK_ID       spk id for multi speaker acoustic model
  --voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}
                        Choose vocoder type of tts task.
  --voc_config VOC_CONFIG
                        Config of voc. Use deault config when it is None.
  --voc_ckpt VOC_CKPT   Checkpoint file of voc.
  --voc_stat VOC_STAT   mean and standard deviation used to normalize
                        spectrogram when training voc.
  --lang LANG           Choose model language. zh or en
  --inference_dir INFERENCE_DIR
                        dir to save inference models
  --ngpu NGPU           if ngpu == 0, use cpu.
-  --verbose VERBOSE     verbose.
+  --text TEXT           text to synthesize, a 'utt_id sentence' pair per line.
  --output_dir OUTPUT_DIR
                        output dir.
 ```
-
+1. `--am` is acoustic model type with the format {model_name}_{dataset}
-1. `--fastspeech2-config`, `--fastspeech2-checkpoint`, `--fastspeech2-stat` and `--phones-dict` are arguments for fastspeech2, which correspond to the 4 files in the fastspeech2 pretrained model.
+2. `--am_config`, `--am_checkpoint`, `--am_stat` and `--phones_dict` are arguments for acoustic model, which correspond to the 4 files in the fastspeech2 pretrained model.
-2. `--pwg-config`, `--pwg-checkpoint`, `--pwg-stat` are arguments for parallel wavegan, which correspond to the 3 files in the parallel wavegan pretrained model.
+3. `--voc` is vocoder type with the format {model_name}_{dataset}
-3. `--test-metadata` should be the metadata file in the normalized subfolder of `test`  in the `dump` folder.
+4. `--voc_config`, `--voc_checkpoint`, `--voc_stat` are arguments for vocoder, which correspond to the 3 files in the parallel wavegan pretrained model.
-4. `--text` is the text file, which contains sentences to synthesize.
+5. `--lang` is the model language, which can be `zh` or `en`.
-5. `--output-dir` is the directory to save synthesized audio files.
+6. `--test_metadata` should be the metadata file in the normalized subfolder of `test`  in the `dump` folder.
-6. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
+7. `--text` is the text file, which contains sentences to synthesize.
 8. `--output_dir` is the directory to save synthesized audio files.
 9. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
 ## Pretrained Model
 Pretrained FastSpeech2 model with no silence in the edge of audios. [fastspeech2_nosil_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_ljspeech_ckpt_0.5.zip)
@ -216,14 +234,18 @@ source path.sh
 FLAGS_allocator_strategy=naive_best_fit \
 FLAGS_fraction_of_gpu_memory_to_use=0.01 \
-python3 ${BIN_DIR}/synthesize_e2e_en.py \
+python3 ${BIN_DIR}/../synthesize_e2e.py \
-  --fastspeech2-config=fastspeech2_nosil_ljspeech_ckpt_0.5/default.yaml \
+  --am=fastspeech2_ljspeech \
-  --fastspeech2-checkpoint=fastspeech2_nosil_ljspeech_ckpt_0.5/snapshot_iter_100000.pdz \
+  --am_config=fastspeech2_nosil_ljspeech_ckpt_0.5/default.yaml \
-  --fastspeech2-stat=fastspeech2_nosil_ljspeech_ckpt_0.5/speech_stats.npy \
+  --am_ckpt=fastspeech2_nosil_ljspeech_ckpt_0.5/snapshot_iter_100000.pdz \
-  --pwg-config=pwg_ljspeech_ckpt_0.5/pwg_default.yaml \
+  --am_stat=fastspeech2_nosil_ljspeech_ckpt_0.5/speech_stats.npy \
-  --pwg-checkpoint=pwg_ljspeech_ckpt_0.5/pwg_snapshot_iter_400000.pdz \
+  --voc=pwgan_ljspeech\
-  --pwg-stat=pwg_ljspeech_ckpt_0.5/pwg_stats.npy \
+  --voc_config=pwg_ljspeech_ckpt_0.5/pwg_default.yaml \
  --voc_ckpt=pwg_ljspeech_ckpt_0.5/pwg_snapshot_iter_400000.pdz  \
  --voc_stat=pwg_ljspeech_ckpt_0.5/pwg_stats.npy \
  --lang=en \
  --text=${BIN_DIR}/../sentences_en.txt \
-  --output-dir=exp/default/test_e2e \
+  --output_dir=exp/default/test_e2e \
-  --phones-dict=fastspeech2_nosil_ljspeech_ckpt_0.5/phone_id_map.txt
+  --inference_dir=exp/default/inference \
  --phones_dict=fastspeech2_nosil_ljspeech_ckpt_0.5/phone_id_map.txt
 ```
--- a/examples/ljspeech/tts3/conf/default.yaml
+++ b/examples/ljspeech/tts3/conf/default.yaml
@ -3,9 +3,9 @@
 ###########################################################
 fs: 22050          # sr
-n_fft: 1024        # FFT size.
+n_fft: 1024        # FFT size (samples). 
-n_shift: 256       # Hop size.
+n_shift: 256       # Hop size (samples). 11.6ms
-win_length: null   # Window length.
+win_length: null   # Window length (samples).
                   # If set to null, it will be the same as fft_size.
 window: "hann"     # Window function.
--- a/examples/ljspeech/tts3/local/synthesize.sh
+++ b/examples/ljspeech/tts3/local/synthesize.sh
@ -6,13 +6,15 @@ ckpt_name=$3
 FLAGS_allocator_strategy=naive_best_fit \
 FLAGS_fraction_of_gpu_memory_to_use=0.01 \
-python3 ${BIN_DIR}/synthesize.py \
+python3 ${BIN_DIR}/../synthesize.py \
-  --fastspeech2-config=${config_path} \
+    --am=fastspeech2_ljspeech \
-  --fastspeech2-checkpoint=${train_output_path}/checkpoints/${ckpt_name} \
+    --am_config=${config_path} \
-  --fastspeech2-stat=dump/train/speech_stats.npy \
+    --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
-  --pwg-config=pwg_ljspeech_ckpt_0.5/pwg_default.yaml \
+    --am_stat=dump/train/speech_stats.npy \
-  --pwg-checkpoint=pwg_ljspeech_ckpt_0.5/pwg_snapshot_iter_400000.pdz \
+    --voc=pwgan_ljspeech \
-  --pwg-stat=pwg_ljspeech_ckpt_0.5/pwg_stats.npy \
+    --voc_config=pwg_ljspeech_ckpt_0.5/pwg_default.yaml \
-  --test-metadata=dump/test/norm/metadata.jsonl \
+    --voc_ckpt=pwg_ljspeech_ckpt_0.5/pwg_snapshot_iter_400000.pdz  \
-  --output-dir=${train_output_path}/test \
+    --voc_stat=pwg_ljspeech_ckpt_0.5/pwg_stats.npy \
-  --phones-dict=dump/phone_id_map.txt
+    --test_metadata=dump/test/norm/metadata.jsonl \
    --output_dir=${train_output_path}/test \
    --phones_dict=dump/phone_id_map.txt
--- a/examples/ljspeech/tts3/local/synthesize_e2e.sh
+++ b/examples/ljspeech/tts3/local/synthesize_e2e.sh
@ -6,13 +6,17 @@ ckpt_name=$3
 FLAGS_allocator_strategy=naive_best_fit \
 FLAGS_fraction_of_gpu_memory_to_use=0.01 \
-python3 ${BIN_DIR}/synthesize_e2e_en.py \
+python3 ${BIN_DIR}/../synthesize_e2e.py \
-  --fastspeech2-config=${config_path} \
+    --am=fastspeech2_ljspeech \
-  --fastspeech2-checkpoint=${train_output_path}/checkpoints/${ckpt_name} \
+    --am_config=${config_path} \
-  --fastspeech2-stat=dump/train/speech_stats.npy \
+    --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
-  --pwg-config=pwg_ljspeech_ckpt_0.5/pwg_default.yaml \
+    --am_stat=dump/train/speech_stats.npy \
-  --pwg-checkpoint=pwg_ljspeech_ckpt_0.5/pwg_snapshot_iter_400000.pdz \
+    --voc=pwgan_ljspeech \
-  --pwg-stat=pwg_ljspeech_ckpt_0.5/pwg_stats.npy \
+    --voc_config=pwg_ljspeech_ckpt_0.5/pwg_default.yaml \
-  --text=${BIN_DIR}/../sentences_en.txt \
+    --voc_ckpt=pwg_ljspeech_ckpt_0.5/pwg_snapshot_iter_400000.pdz  \
-  --output-dir=${train_output_path}/test_e2e \
+    --voc_stat=pwg_ljspeech_ckpt_0.5/pwg_stats.npy \
-  --phones-dict=dump/phone_id_map.txt
+    --lang=en \
    --text=${BIN_DIR}/../sentences_en.txt \
    --output_dir=${train_output_path}/test_e2e \
    --inference_dir=${train_output_path}/inference \
    --phones_dict=dump/phone_id_map.txt
--- a/examples/ljspeech/voc0/README.md
+++ b/examples/ljspeech/voc0/README.md
@ -17,7 +17,7 @@ Run the command below to
 ```bash
 ./run.sh
 ```
-You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset.
+You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
 ```bash
 ./run.sh --stage 0 --stop-stage 0
 ```
@ -45,7 +45,7 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${input_mel_path} ${train_out
 Synthesize waveform.
 1. We assume the `--input` is a directory containing several mel spectrograms(log magnitude) in `.npy` format.
-2. The output would be saved in `--output` directory, containing several `.wav` files, each with the same name as the mel spectrogram does.
+2. The output would be saved in the `--output` directory, containing several `.wav` files, each with the same name as the mel spectrogram does.
 3. `--checkpoint_path` should be the path of the parameter file (`.pdparams`) to load. Note that the extention name `.pdparmas` is not included here.
 6. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
--- a/examples/ljspeech/voc1/README.md
+++ b/examples/ljspeech/voc1/README.md
@ -4,8 +4,8 @@ This example contains code used to train a [parallel wavegan](http://arxiv.org/a
 ### Download and Extract
 Download LJSpeech-1.1 from the [official website](https://keithito.com/LJ-Speech-Dataset/).
 ### Get MFA Result and Extract
-We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence in the edge of audio.
+We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut the silence in the edge of audio.
-You can download from here [ljspeech_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/LJSpeech-1.1/ljspeech_alignment.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
+You can download from here [ljspeech_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/LJSpeech-1.1/ljspeech_alignment.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
 ## Get Started
 Assume the path to the dataset is `~/datasets/LJSpeech-1.1`.
@ -19,7 +19,7 @@ Run the command below to
 ```bash
 ./run.sh
 ```
-You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset.
+You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
 ```bash
 ./run.sh --stage 0 --stop-stage 0
 ```
@ -43,9 +43,9 @@ dump
    └── feats_stats.npy
 ```
-The dataset is split into 3 parts, namely `train`, `dev` and `test`, each of which contains a `norm` and `raw` subfolder. The `raw` folder contains log magnitude of mel spectrogram of each utterances, while the norm folder contains normalized spectrogram. The statistics used to normalize the spectrogram is computed from the training set, which is located in `dump/train/feats_stats.npy`.
+The dataset is split into 3 parts, namely `train`, `dev`, and `test`, each of which contains a `norm` and `raw` subfolder. The `raw` folder contains the log magnitude of the mel spectrogram of each utterance, while the norm folder contains the normalized spectrogram. The statistics used to normalize the spectrogram are computed from the training set, which is located in `dump/train/feats_stats.npy`.
-Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains id and paths to spectrogam of each utterance.
+Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains id and paths to the spectrogram of each utterance.
 ### Model Training
 `./local/train.sh` calls `${BIN_DIR}/train.py`.
@ -91,7 +91,7 @@ benchmark:
 1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
 2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
-3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are save in `checkpoints/` inside this directory.
+3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory.
 4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
 ### Synthesizing
@ -100,15 +100,19 @@ benchmark:
 CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
 ```
 ```text
-usage: synthesize.py [-h] [--config CONFIG] [--checkpoint CHECKPOINT]
+usage: synthesize.py [-h] [--generator-type GENERATOR_TYPE] [--config CONFIG]
-                     [--test-metadata TEST_METADATA] [--output-dir OUTPUT_DIR]
+                     [--checkpoint CHECKPOINT] [--test-metadata TEST_METADATA]
-                     [--ngpu NGPU] [--verbose VERBOSE]
+                     [--output-dir OUTPUT_DIR] [--ngpu NGPU]
                     [--verbose VERBOSE]
-Synthesize with parallel wavegan.
+Synthesize with GANVocoder.
 optional arguments:
  -h, --help            show this help message and exit
-  --config CONFIG       parallel wavegan config file.
+  --generator-type GENERATOR_TYPE
                        type of GANVocoder, should in {pwgan, mb_melgan,
                        style_melgan, } now
  --config CONFIG       GANVocoder config file.
  --checkpoint CHECKPOINT
                        snapshot to load.
  --test-metadata TEST_METADATA
--- a/examples/ljspeech/voc1/conf/default.yaml
+++ b/examples/ljspeech/voc1/conf/default.yaml
@ -7,9 +7,9 @@
 #                FEATURE EXTRACTION SETTING               #
 ###########################################################
 fs: 22050                # Sampling rate.
-n_fft: 1024              # FFT size. (in samples)
+n_fft: 1024              # FFT size (samples).
-n_shift: 256             # Hop size. (in samples)
+n_shift: 256             # Hop size (samples). 11.6ms
-win_length: null         # Window length. (in samples)
+win_length: null         # Window length (samples).
                         # If set to null, it will be the same as fft_size.
 window: "hann"           # Window function.
 n_mels: 80               # Number of mel basis.
@ -49,9 +49,9 @@ discriminator_params:
    bias: true            # Whether to use bias parameter in conv.
    use_weight_norm: true # Whether to use weight norm.
                          # If set to true, it will be applied to all of the conv layers.
-    nonlinear_activation: "LeakyReLU" # Nonlinear function after each conv.
+    nonlinear_activation: "leakyrelu" # Nonlinear function after each conv.
    nonlinear_activation_params:      # Nonlinear function parameters
-        negative_slope: 0.2           # Alpha in LeakyReLU.
+        negative_slope: 0.2           # Alpha in leakyrelu.
 ###########################################################
 #                   STFT LOSS SETTING                     #
--- a/examples/ljspeech/voc1/local/synthesize.sh
+++ b/examples/ljspeech/voc1/local/synthesize.sh
@ -7,8 +7,8 @@ ckpt_name=$3
 FLAGS_allocator_strategy=naive_best_fit \
 FLAGS_fraction_of_gpu_memory_to_use=0.01 \
 python3 ${BIN_DIR}/../synthesize.py \
-  --config=${config_path} \
+    --config=${config_path} \
-  --checkpoint=${train_output_path}/checkpoints/${ckpt_name} \
+    --checkpoint=${train_output_path}/checkpoints/${ckpt_name} \
-  --test-metadata=dump/test/norm/metadata.jsonl \
+    --test-metadata=dump/test/norm/metadata.jsonl \
-  --output-dir=${train_output_path}/test \
+    --output-dir=${train_output_path}/test \
-  --generator-type=pwgan
+    --generator-type=pwgan
--- a/examples/vctk/tts3/README.md
+++ b/examples/vctk/tts3/README.md
@ -2,14 +2,14 @@
 This example contains code used to train a [Fastspeech2](https://arxiv.org/abs/2006.04558) model with [VCTK](https://datashare.ed.ac.uk/handle/10283/3443).
 ## Dataset
-### Download and Extract the datasaet
+### Download and Extract the dataset
 Download VCTK-0.92 from the [official website](https://datashare.ed.ac.uk/handle/10283/3443).
 ### Get MFA Result and Extract
 We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for fastspeech2.
-You can download from here [vctk_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/VCTK-Corpus-0.92/vctk_alignment.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
+You can download from here [vctk_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/VCTK-Corpus-0.92/vctk_alignment.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
 ps: we remove three speakers in VCTK-0.92 (see [reorganize_vctk.py](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/examples/other/mfa/local/reorganize_vctk.py)):
-1. `p315`, because no txt for it.
+1. `p315`, because of no text for it.
 2. `p280` and `p362`, because no *_mic2.flac (which is better than *_mic1.flac) for  them.
 ## Get Started
@ -25,7 +25,7 @@ Run the command below to
 ```bash
 ./run.sh
 ```
-You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset.
+You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
 ```bash
 ./run.sh --stage 0 --stop-stage 0
 ```
@ -52,9 +52,9 @@ dump
    ├── raw
    └── speech_stats.npy
 ```
-The dataset is split into 3 parts, namely `train`, `dev` and` test`, each of which contains a `norm` and `raw` sub folder. The raw folder contains speech、pitch and energy features of each utterances, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/*_stats.npy`.
+The dataset is split into 3 parts, namely `train`, `dev`, and` test`, each of which contains a `norm` and `raw` subfolder. The raw folder contains speech、pitch and energy features of each utterance, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/*_stats.npy`.
-Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains phones, text_lengths, speech_lengths, durations, path of speech features, path of pitch features, path of energy features, speaker and id of each utterance.
+Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains phones, text_lengths, speech_lengths, durations, the path of speech features, the path of pitch features, the path of energy features, speaker, and id of each utterance.
 ### Model Training
 ```bash
@ -88,7 +88,7 @@ optional arguments:
 ```
 1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
 2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
-3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are save in `checkpoints/` inside this directory.
+3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory.
 4. `--phones-dict` is the path of the phone vocabulary file.
 ### Synthesizing
@ -105,99 +105,115 @@ pwg_vctk_ckpt_0.5
 ├── pwg_snapshot_iter_1000000.pdz  # generator parameters of parallel wavegan
 └── pwg_stats.npy                  # statistics used to normalize spectrogram when training parallel wavegan
 ```
-`./local/synthesize.sh` calls `${BIN_DIR}/synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
+`./local/synthesize.sh` calls `${BIN_DIR}/../synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
 ```bash
 CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
 ```
 ```text
-usage: synthesize.py [-h] [--fastspeech2-config FASTSPEECH2_CONFIG]
+usage: synthesize.py [-h]
-                     [--fastspeech2-checkpoint FASTSPEECH2_CHECKPOINT]
+                     [--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}]
-                     [--fastspeech2-stat FASTSPEECH2_STAT]
+                     [--am_config AM_CONFIG] [--am_ckpt AM_CKPT]
-                     [--pwg-config PWG_CONFIG]
+                     [--am_stat AM_STAT] [--phones_dict PHONES_DICT]
-                     [--pwg-checkpoint PWG_CHECKPOINT] [--pwg-stat PWG_STAT]
+                     [--tones_dict TONES_DICT] [--speaker_dict SPEAKER_DICT]
-                     [--phones-dict PHONES_DICT] [--speaker-dict SPEAKER_DICT]
+                     [--voice-cloning VOICE_CLONING]
-                     [--test-metadata TEST_METADATA] [--output-dir OUTPUT_DIR]
+                     [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}]
-                     [--ngpu NGPU] [--verbose VERBOSE]
+                     [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT]
-
+                     [--voc_stat VOC_STAT] [--ngpu NGPU]
-Synthesize with fastspeech2 & parallel wavegan.
+                     [--test_metadata TEST_METADATA] [--output_dir OUTPUT_DIR]
 Synthesize with acoustic model & vocoder
 optional arguments:
  -h, --help            show this help message and exit
-  --fastspeech2-config FASTSPEECH2_CONFIG
+  --am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}
-                        fastspeech2 config file.
+                        Choose acoustic model type of tts task.
-  --fastspeech2-checkpoint FASTSPEECH2_CHECKPOINT
+  --am_config AM_CONFIG
-                        fastspeech2 checkpoint to load.
+                        Config of acoustic model. Use deault config when it is
-  --fastspeech2-stat FASTSPEECH2_STAT
+                        None.
-                        mean and standard deviation used to normalize
+  --am_ckpt AM_CKPT     Checkpoint file of acoustic model.
-                        spectrogram when training fastspeech2.
+  --am_stat AM_STAT     mean and standard deviation used to normalize
-  --pwg-config PWG_CONFIG
+                        spectrogram when training acoustic model.
-                        parallel wavegan config file.
+  --phones_dict PHONES_DICT
  --pwg-checkpoint PWG_CHECKPOINT
                        parallel wavegan generator parameters to load.
  --pwg-stat PWG_STAT   mean and standard deviation used to normalize
                        spectrogram when training parallel wavegan.
  --phones-dict PHONES_DICT
                        phone vocabulary file.
-  --speaker-dict SPEAKER_DICT
+  --tones_dict TONES_DICT
-                        speaker id map file for multiple speaker model.
+                        tone vocabulary file.
-  --test-metadata TEST_METADATA
+  --speaker_dict SPEAKER_DICT
                        speaker id map file.
  --voice-cloning VOICE_CLONING
                        whether training voice cloning model.
  --voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}
                        Choose vocoder type of tts task.
  --voc_config VOC_CONFIG
                        Config of voc. Use deault config when it is None.
  --voc_ckpt VOC_CKPT   Checkpoint file of voc.
  --voc_stat VOC_STAT   mean and standard deviation used to normalize
                        spectrogram when training voc.
  --ngpu NGPU           if ngpu == 0, use cpu.
  --test_metadata TEST_METADATA
                        test metadata.
-  --output-dir OUTPUT_DIR
+  --output_dir OUTPUT_DIR
                        output dir.
  --ngpu NGPU           if ngpu == 0, use cpu.
  --verbose VERBOSE     verbose.
 ```
-`./local/synthesize_e2e.sh` calls `${BIN_DIR}/multi_spk_synthesize_e2e_en.py`, which can synthesize waveform from text file.
+`./local/synthesize_e2e.sh` calls `${BIN_DIR}/../synthesize_e2e.py`, which can synthesize waveform from text file.
 ```bash
 CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name}
 ```
 ```text
-usage: multi_spk_synthesize_e2e_en.py [-h]
+usage: synthesize_e2e.py [-h]
-                                      [--fastspeech2-config FASTSPEECH2_CONFIG]
+                         [--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}]
-                                      [--fastspeech2-checkpoint FASTSPEECH2_CHECKPOINT]
+                         [--am_config AM_CONFIG] [--am_ckpt AM_CKPT]
-                                      [--fastspeech2-stat FASTSPEECH2_STAT]
+                         [--am_stat AM_STAT] [--phones_dict PHONES_DICT]
-                                      [--pwg-config PWG_CONFIG]
+                         [--tones_dict TONES_DICT]
-                                      [--pwg-checkpoint PWG_CHECKPOINT]
+                         [--speaker_dict SPEAKER_DICT] [--spk_id SPK_ID]
-                                      [--pwg-stat PWG_STAT]
+                         [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}]
-                                      [--phones-dict PHONES_DICT]
+                         [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT]
-                                      [--speaker-dict SPEAKER_DICT]
+                         [--voc_stat VOC_STAT] [--lang LANG]
-                                      [--text TEXT] [--output-dir OUTPUT_DIR]
+                         [--inference_dir INFERENCE_DIR] [--ngpu NGPU]
-                                      [--ngpu NGPU] [--verbose VERBOSE]
+                         [--text TEXT] [--output_dir OUTPUT_DIR]
-
+
-Synthesize with fastspeech2 & parallel wavegan.
+Synthesize with acoustic model & vocoder
 optional arguments:
  -h, --help            show this help message and exit
-  --fastspeech2-config FASTSPEECH2_CONFIG
+  --am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk}
-                        fastspeech2 config file.
+                        Choose acoustic model type of tts task.
-  --fastspeech2-checkpoint FASTSPEECH2_CHECKPOINT
+  --am_config AM_CONFIG
-                        fastspeech2 checkpoint to load.
+                        Config of acoustic model. Use deault config when it is
-  --fastspeech2-stat FASTSPEECH2_STAT
+                        None.
-                        mean and standard deviation used to normalize
+  --am_ckpt AM_CKPT     Checkpoint file of acoustic model.
-                        spectrogram when training fastspeech2.
+  --am_stat AM_STAT     mean and standard deviation used to normalize
-  --pwg-config PWG_CONFIG
+                        spectrogram when training acoustic model.
-                        parallel wavegan config file.
+  --phones_dict PHONES_DICT
  --pwg-checkpoint PWG_CHECKPOINT
                        parallel wavegan generator parameters to load.
  --pwg-stat PWG_STAT   mean and standard deviation used to normalize
                        spectrogram when training parallel wavegan.
  --phones-dict PHONES_DICT
                        phone vocabulary file.
-  --speaker-dict SPEAKER_DICT
+  --tones_dict TONES_DICT
                        tone vocabulary file.
  --speaker_dict SPEAKER_DICT
                        speaker id map file.
  --spk_id SPK_ID       spk id for multi speaker acoustic model
  --voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}
                        Choose vocoder type of tts task.
  --voc_config VOC_CONFIG
                        Config of voc. Use deault config when it is None.
  --voc_ckpt VOC_CKPT   Checkpoint file of voc.
  --voc_stat VOC_STAT   mean and standard deviation used to normalize
                        spectrogram when training voc.
  --lang LANG           Choose model language. zh or en
  --inference_dir INFERENCE_DIR
                        dir to save inference models
  --ngpu NGPU           if ngpu == 0, use cpu.
  --text TEXT           text to synthesize, a 'utt_id sentence' pair per line.
-  --output-dir OUTPUT_DIR
+  --output_dir OUTPUT_DIR
                        output dir.
  --ngpu NGPU           if ngpu == 0, use cpu.
  --verbose VERBOSE     verbose.
 ```
-
+1. `--am` is acoustic model type with the format {model_name}_{dataset}
-1. `--fastspeech2-config`, `--fastspeech2-checkpoint`, `--fastspeech2-stat` and `--phones-dict` are arguments for fastspeech2, which correspond to the 4 files in the fastspeech2 pretrained model.
+2. `--am_config`, `--am_checkpoint`, `--am_stat`, `--phones_dict` `--speaker_dict` are arguments for acoustic model, which correspond to the 5 files in the fastspeech2 pretrained model.
-2. `--pwg-config`, `--pwg-checkpoint`, `--pwg-stat` are arguments for parallel wavegan, which correspond to the 3 files in the parallel wavegan pretrained model.
+3. `--voc` is vocoder type with the format {model_name}_{dataset}
-3. `--test-metadata` should be the metadata file in the normalized subfolder of `test`  in the `dump` folder.
+4. `--voc_config`, `--voc_checkpoint`, `--voc_stat` are arguments for vocoder, which correspond to the 3 files in the parallel wavegan pretrained model.
-4. `--text` is the text file, which contains sentences to synthesize.
+5. `--lang` is the model language, which can be `zh` or `en`.
-5. `--output-dir` is the directory to save synthesized audio files.
+6. `--test_metadata` should be the metadata file in the normalized subfolder of `test`  in the `dump` folder.
-6. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
+7. `--text` is the text file, which contains sentences to synthesize.
 8. `--output_dir` is the directory to save synthesized audio files.
 9. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
 ## Pretrained Model
 Pretrained FastSpeech2 model with no silence in the edge of audios. [fastspeech2_nosil_vctk_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_vctk_ckpt_0.5.zip)
@ -217,15 +233,19 @@ source path.sh
 FLAGS_allocator_strategy=naive_best_fit \
 FLAGS_fraction_of_gpu_memory_to_use=0.01 \
-python3 ${BIN_DIR}/multi_spk_synthesize_e2e_en.py \
+python3 ${BIN_DIR}/../synthesize_e2e.py \
-  --fastspeech2-config=fastspeech2_nosil_vctk_ckpt_0.5/default.yaml \
+  --am=fastspeech2_vctk \
-  --fastspeech2-checkpoint=fastspeech2_nosil_vctk_ckpt_0.5/snapshot_iter_66200.pdz \
+  --am_config=fastspeech2_nosil_vctk_ckpt_0.5/default.yaml \
-  --fastspeech2-stat=fastspeech2_nosil_vctk_ckpt_0.5/speech_stats.npy \
+  --am_ckpt=fastspeech2_nosil_vctk_ckpt_0.5/snapshot_iter_66200.pdz \
-  --pwg-config=pwg_vctk_ckpt_0.5/pwg_default.yaml \
+  --am_stat=fastspeech2_nosil_vctk_ckpt_0.5/speech_stats.npy \
-  --pwg-checkpoint=pwg_vctk_ckpt_0.5/pwg_snapshot_iter_1000000.pdz \
+  --voc=pwgan_vctk \
-  --pwg-stat=pwg_vctk_ckpt_0.5/pwg_stats.npy \
+  --voc_config=pwg_vctk_ckpt_0.5/pwg_default.yaml  \
  --voc_ckpt=pwg_vctk_ckpt_0.5/pwg_snapshot_iter_1000000.pdz \
  --voc_stat=pwg_vctk_ckpt_0.5/pwg_stats.npy \
  --lang=en \
  --text=${BIN_DIR}/../sentences_en.txt \
-  --output-dir=exp/default/test_e2e \
+  --output_dir=exp/default/test_e2e \
-  --phones-dict=fastspeech2_nosil_vctk_ckpt_0.5/phone_id_map.txt \
+  --phones_dict=dump/phone_id_map.txt \
-  --speaker-dict=fastspeech2_nosil_vctk_ckpt_0.5/speaker_id_map.txt
+  --speaker_dict=dump/speaker_id_map.txt \
  --spk_id=0
 ```
--- a/examples/vctk/tts3/conf/default.yaml
+++ b/examples/vctk/tts3/conf/default.yaml
@ -3,9 +3,9 @@
 ###########################################################
 fs: 24000          # sr
-n_fft: 2048        # FFT size.
+n_fft: 2048        # FFT size (samples).
-n_shift: 300       # Hop size.
+n_shift: 300       # Hop size (samples). 12.5ms
-win_length: 1200   # Window length.
+win_length: 1200   # Window length.(in samples) 50ms
                   # If set to null, it will be the same as fft_size.
 window: "hann"     # Window function.
--- a/examples/vctk/tts3/local/synthesize.sh
+++ b/examples/vctk/tts3/local/synthesize.sh
@ -6,14 +6,16 @@ ckpt_name=$3
 FLAGS_allocator_strategy=naive_best_fit \
 FLAGS_fraction_of_gpu_memory_to_use=0.01 \
-python3 ${BIN_DIR}/synthesize.py \
+python3 ${BIN_DIR}/../synthesize.py \
-  --fastspeech2-config=${config_path} \
+    --am=fastspeech2_vctk \
-  --fastspeech2-checkpoint=${train_output_path}/checkpoints/${ckpt_name} \
+    --am_config=${config_path} \
-  --fastspeech2-stat=dump/train/speech_stats.npy \
+    --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
-  --pwg-config=pwg_vctk_ckpt_0.5/pwg_default.yaml \
+    --am_stat=dump/train/speech_stats.npy \
-  --pwg-checkpoint=pwg_vctk_ckpt_0.5/pwg_snapshot_iter_1000000.pdz \
+    --voc=pwgan_vctk \
-  --pwg-stat=pwg_vctk_ckpt_0.5/pwg_stats.npy \
+    --voc_config=pwg_vctk_ckpt_0.5/pwg_default.yaml \
-  --test-metadata=dump/test/norm/metadata.jsonl \
+    --voc_ckpt=pwg_vctk_ckpt_0.5/pwg_snapshot_iter_1000000.pdz \
-  --output-dir=${train_output_path}/test \
+    --voc_stat=pwg_vctk_ckpt_0.5/pwg_stats.npy \
-  --phones-dict=dump/phone_id_map.txt \
+    --test_metadata=dump/test/norm/metadata.jsonl \
-  --speaker-dict=dump/speaker_id_map.txt
+    --output_dir=${train_output_path}/test \
    --phones_dict=dump/phone_id_map.txt \
    --speaker_dict=dump/speaker_id_map.txt
--- a/examples/vctk/tts3/local/synthesize_e2e.sh
+++ b/examples/vctk/tts3/local/synthesize_e2e.sh
@ -6,14 +6,18 @@ ckpt_name=$3
 FLAGS_allocator_strategy=naive_best_fit \
 FLAGS_fraction_of_gpu_memory_to_use=0.01 \
-python3 ${BIN_DIR}/multi_spk_synthesize_e2e_en.py \
+python3 ${BIN_DIR}/../synthesize_e2e.py \
-  --fastspeech2-config=${config_path} \
+    --am=fastspeech2_vctk \
-  --fastspeech2-checkpoint=${train_output_path}/checkpoints/${ckpt_name} \
+    --am_config=${config_path} \
-  --fastspeech2-stat=dump/train/speech_stats.npy \
+    --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
-  --pwg-config=pwg_vctk_ckpt_0.5/pwg_default.yaml \
+    --am_stat=dump/train/speech_stats.npy \
-  --pwg-checkpoint=pwg_vctk_ckpt_0.5/pwg_snapshot_iter_1000000.pdz \
+    --voc=pwgan_vctk \
-  --pwg-stat=pwg_vctk_ckpt_0.5/pwg_stats.npy \
+    --voc_config=pwg_vctk_ckpt_0.5/pwg_default.yaml  \
-  --text=${BIN_DIR}/../sentences_en.txt \
+    --voc_ckpt=pwg_vctk_ckpt_0.5/pwg_snapshot_iter_1000000.pdz \
-  --output-dir=${train_output_path}/test_e2e \
+    --voc_stat=pwg_vctk_ckpt_0.5/pwg_stats.npy \
-  --phones-dict=dump/phone_id_map.txt \
+    --lang=en \
-  --speaker-dict=dump/speaker_id_map.txt
+    --text=${BIN_DIR}/../sentences_en.txt \
    --output_dir=${train_output_path}/test_e2e \
    --phones_dict=dump/phone_id_map.txt \
    --speaker_dict=dump/speaker_id_map.txt \
    --spk_id=0
--- a/examples/vctk/voc1/README.md
+++ b/examples/vctk/voc1/README.md
@ -6,10 +6,10 @@ This example contains code used to train a [parallel wavegan](http://arxiv.org/a
 Download VCTK-0.92  from the [official website](https://datashare.ed.ac.uk/handle/10283/3443) and extract it to `~/datasets`. Then the dataset is in directory `~/datasets/VCTK-Corpus-0.92`.
 ### Get MFA Result and Extract
-We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to  cut silence in the edge of audio.
+We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut the silence in the edge of audio.
-You can download from here [vctk_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/VCTK-Corpus-0.92/vctk_alignment.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
+You can download from here [vctk_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/VCTK-Corpus-0.92/vctk_alignment.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
 ps: we remove three speakers in VCTK-0.92 (see [reorganize_vctk.py](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/examples/other/mfa/local/reorganize_vctk.py)):
-1. `p315`, because no txt for it.
+1. `p315`, because of no text for it.
 2. `p280` and `p362`, because no *_mic2.flac (which is better than *_mic1.flac) for  them.
 ## Get Started
@ -24,7 +24,7 @@ Run the command below to
 ```bash
 ./run.sh
 ```
-You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset.
+You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
 ```bash
 ./run.sh --stage 0 --stop-stage 0
 ```
@ -48,9 +48,9 @@ dump
    └── feats_stats.npy
 ```
-The dataset is split into 3 parts, namely `train`, `dev` and `test`, each of which contains a `norm` and `raw` subfolder. The `raw` folder contains log magnitude of mel spectrogram of each utterances, while the norm folder contains normalized spectrogram. The statistics used to normalize the spectrogram is computed from the training set, which is located in `dump/train/feats_stats.npy`.
+The dataset is split into 3 parts, namely `train`, `dev`, and `test`, each of which contains a `norm` and `raw` subfolder. The `raw` folder contains the log magnitude of the mel spectrogram of each utterance, while the norm folder contains the normalized spectrogram. The statistics used to normalize the spectrogram are computed from the training set, which is located in `dump/train/feats_stats.npy`.
-Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains id and paths to spectrogam of each utterance.
+Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains id and paths to the spectrogram of each utterance.
 ### Model Training
 ```bash
@ -96,7 +96,7 @@ benchmark:
 1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
 2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
-3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are save in `checkpoints/` inside this directory.
+3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory.
 4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
 ### Synthesizing
@ -105,15 +105,19 @@ benchmark:
 CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
 ```
 ```text
-usage: synthesize.py [-h] [--config CONFIG] [--checkpoint CHECKPOINT]
+usage: synthesize.py [-h] [--generator-type GENERATOR_TYPE] [--config CONFIG]
-                     [--test-metadata TEST_METADATA] [--output-dir OUTPUT_DIR]
+                     [--checkpoint CHECKPOINT] [--test-metadata TEST_METADATA]
-                     [--ngpu NGPU] [--verbose VERBOSE]
+                     [--output-dir OUTPUT_DIR] [--ngpu NGPU]
                     [--verbose VERBOSE]
-Synthesize with parallel wavegan.
+Synthesize with GANVocoder.
 optional arguments:
  -h, --help            show this help message and exit
-  --config CONFIG       parallel wavegan config file.
+  --generator-type GENERATOR_TYPE
                        type of GANVocoder, should in {pwgan, mb_melgan,
                        style_melgan, } now
  --config CONFIG       GANVocoder config file.
  --checkpoint CHECKPOINT
                        snapshot to load.
  --test-metadata TEST_METADATA
--- a/examples/vctk/voc1/conf/default.yaml
+++ b/examples/vctk/voc1/conf/default.yaml
@ -7,9 +7,9 @@
 #                FEATURE EXTRACTION SETTING               #
 ###########################################################
 fs: 24000                # Sampling rate.
-n_fft: 2048              # FFT size. (in samples)
+n_fft: 2048              # FFT size (samples).
-n_shift: 300             # Hop size. (in samples)
+n_shift: 300             # Hop size (samples). 12.5ms
-win_length: 1200         # Window length. (in samples)
+win_length: 1200         # Window length (samples). 50ms
                         # If set to null, it will be the same as fft_size.
 window: "hann"           # Window function.
 n_mels: 80               # Number of mel basis.
@ -49,9 +49,9 @@ discriminator_params:
    bias: true            # Whether to use bias parameter in conv.
    use_weight_norm: true # Whether to use weight norm.
                          # If set to true, it will be applied to all of the conv layers.
-    nonlinear_activation: "LeakyReLU" # Nonlinear function after each conv.
+    nonlinear_activation: "leakyrelu" # Nonlinear function after each conv.
    nonlinear_activation_params:      # Nonlinear function parameters
-        negative_slope: 0.2           # Alpha in LeakyReLU.
+        negative_slope: 0.2           # Alpha in leakyrelu.
 ###########################################################
 #                   STFT LOSS SETTING                     #
--- a/examples/vctk/voc1/local/synthesize.sh
+++ b/examples/vctk/voc1/local/synthesize.sh
@ -7,8 +7,8 @@ ckpt_name=$3
 FLAGS_allocator_strategy=naive_best_fit \
 FLAGS_fraction_of_gpu_memory_to_use=0.01 \
 python3 ${BIN_DIR}/../synthesize.py \
-  --config=${config_path} \
+    --config=${config_path} \
-  --checkpoint=${train_output_path}/checkpoints/${ckpt_name} \
+    --checkpoint=${train_output_path}/checkpoints/${ckpt_name} \
-  --test-metadata=dump/test/norm/metadata.jsonl \
+    --test-metadata=dump/test/norm/metadata.jsonl \
-  --output-dir=${train_output_path}/test \
+    --output-dir=${train_output_path}/test \
-  --generator-type=pwgan
+    --generator-type=pwgan
--- a/paddlespeech/t2s/exps/fastspeech2/inference.py
+++ b/paddlespeech/t2s/exps/fastspeech2/inference.py
@ -1,135 +0,0 @@
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import argparse
 import os
 from pathlib import Path
 import soundfile as sf
 from paddle import inference
 from paddlespeech.t2s.frontend.zh_frontend import Frontend
 def main():
    parser = argparse.ArgumentParser(
        description="Paddle Infernce with speedyspeech & parallel wavegan.")
    parser.add_argument(
        "--inference-dir", type=str, help="dir to save inference models")
    parser.add_argument(
        "--text",
        type=str,
        help="text to synthesize, a 'utt_id sentence' pair per line")
    parser.add_argument("--output-dir", type=str, help="output dir")
    parser.add_argument(
        "--enable-auto-log", action="store_true", help="use auto log")
    parser.add_argument(
        "--phones-dict",
        type=str,
        default="phones.txt",
        help="phone vocabulary file.")
    args, _ = parser.parse_known_args()
    frontend = Frontend(phone_vocab_path=args.phones_dict)
    print("frontend done!")
    fastspeech2_config = inference.Config(
        str(Path(args.inference_dir) / "fastspeech2.pdmodel"),
        str(Path(args.inference_dir) / "fastspeech2.pdiparams"))
    fastspeech2_config.enable_use_gpu(50, 0)
    # This line must be commented, if not, it will OOM
    # fastspeech2_config.enable_memory_optim()
    fastspeech2_predictor = inference.create_predictor(fastspeech2_config)
    pwg_config = inference.Config(
        str(Path(args.inference_dir) / "pwg.pdmodel"),
        str(Path(args.inference_dir) / "pwg.pdiparams"))
    pwg_config.enable_use_gpu(100, 0)
    pwg_config.enable_memory_optim()
    pwg_predictor = inference.create_predictor(pwg_config)
    if args.enable_auto_log:
        import auto_log
        os.makedirs("output", exist_ok=True)
        pid = os.getpid()
        logger = auto_log.AutoLogger(
            model_name="fastspeech2",
            model_precision='float32',
            batch_size=1,
            data_shape="dynamic",
            save_path="./output/auto_log.log",
            inference_config=fastspeech2_config,
            pids=pid,
            process_name=None,
            gpu_ids=0,
            time_keys=['preprocess_time', 'inference_time', 'postprocess_time'],
            warmup=0)
    output_dir = Path(args.output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)
    sentences = []
    with open(args.text, 'rt') as f:
        for line in f:
            items = line.strip().split()
            utt_id = items[0]
            sentence = "".join(items[1:])
            sentences.append((utt_id, sentence))
    for utt_id, sentence in sentences:
        if args.enable_auto_log:
            logger.times.start()
        input_ids = frontend.get_input_ids(sentence, merge_sentences=True)
        phone_ids = input_ids["phone_ids"]
        phones = phone_ids[0].numpy()
        if args.enable_auto_log:
            logger.times.stamp()
        input_names = fastspeech2_predictor.get_input_names()
        phones_handle = fastspeech2_predictor.get_input_handle(input_names[0])
        phones_handle.reshape(phones.shape)
        phones_handle.copy_from_cpu(phones)
        fastspeech2_predictor.run()
        output_names = fastspeech2_predictor.get_output_names()
        output_handle = fastspeech2_predictor.get_output_handle(output_names[0])
        output_data = output_handle.copy_to_cpu()
        input_names = pwg_predictor.get_input_names()
        mel_handle = pwg_predictor.get_input_handle(input_names[0])
        mel_handle.reshape(output_data.shape)
        mel_handle.copy_from_cpu(output_data)
        pwg_predictor.run()
        output_names = pwg_predictor.get_output_names()
        output_handle = pwg_predictor.get_output_handle(output_names[0])
        wav = output_data = output_handle.copy_to_cpu()
        if args.enable_auto_log:
            logger.times.stamp()
        sf.write(output_dir / (utt_id + ".wav"), wav, samplerate=24000)
        if args.enable_auto_log:
            logger.times.end(stamp=True)
        print(f"{utt_id} done!")
    if args.enable_auto_log:
        logger.report()
 if __name__ == "__main__":
    main()
--- a/paddlespeech/t2s/exps/fastspeech2/multi_spk_synthesize_e2e.py
+++ b/paddlespeech/t2s/exps/fastspeech2/multi_spk_synthesize_e2e.py
@ -1,178 +0,0 @@
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import argparse
 import logging
 from pathlib import Path
 import numpy as np
 import paddle
 import soundfile as sf
 import yaml
 from yacs.config import CfgNode
 from paddlespeech.t2s.frontend.zh_frontend import Frontend
 from paddlespeech.t2s.models.fastspeech2 import FastSpeech2
 from paddlespeech.t2s.models.fastspeech2 import FastSpeech2Inference
 from paddlespeech.t2s.models.parallel_wavegan import PWGGenerator
 from paddlespeech.t2s.models.parallel_wavegan import PWGInference
 from paddlespeech.t2s.modules.normalizer import ZScore
 def evaluate(args, fastspeech2_config, pwg_config):
    # dataloader has been too verbose
    logging.getLogger("DataLoader").disabled = True
    # construct dataset for evaluation
    sentences = []
    with open(args.text, 'rt') as f:
        for line in f:
            items = line.strip().split()
            utt_id = items[0]
            sentence = "".join(items[1:])
            sentences.append((utt_id, sentence))
    with open(args.phones_dict, "r") as f:
        phn_id = [line.strip().split() for line in f.readlines()]
    vocab_size = len(phn_id)
    print("vocab_size:", vocab_size)
    with open(args.speaker_dict, 'rt') as f:
        spk_id = [line.strip().split() for line in f.readlines()]
    spk_num = len(spk_id)
    print("spk_num:", spk_num)
    odim = fastspeech2_config.n_mels
    model = FastSpeech2(
        idim=vocab_size,
        odim=odim,
        spk_num=spk_num,
        **fastspeech2_config["model"])
    model.set_state_dict(
        paddle.load(args.fastspeech2_checkpoint)["main_params"])
    model.eval()
    vocoder = PWGGenerator(**pwg_config["generator_params"])
    vocoder.set_state_dict(paddle.load(args.pwg_checkpoint)["generator_params"])
    vocoder.remove_weight_norm()
    vocoder.eval()
    print("model done!")
    frontend = Frontend(phone_vocab_path=args.phones_dict)
    print("frontend done!")
    stat = np.load(args.fastspeech2_stat)
    mu, std = stat
    mu = paddle.to_tensor(mu)
    std = paddle.to_tensor(std)
    fastspeech2_normalizer = ZScore(mu, std)
    stat = np.load(args.pwg_stat)
    mu, std = stat
    mu = paddle.to_tensor(mu)
    std = paddle.to_tensor(std)
    pwg_normalizer = ZScore(mu, std)
    fastspeech2_inference = FastSpeech2Inference(fastspeech2_normalizer, model)
    pwg_inference = PWGInference(pwg_normalizer, vocoder)
    output_dir = Path(args.output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)
    # only test the number 0 speaker
    spk_ids = list(range(20))
    for spk_id in spk_ids:
        for utt_id, sentence in sentences[:2]:
            input_ids = frontend.get_input_ids(sentence, merge_sentences=True)
            phone_ids = input_ids["phone_ids"]
            flags = 0
            for part_phone_ids in phone_ids:
                with paddle.no_grad():
                    mel = fastspeech2_inference(
                        part_phone_ids, spk_id=paddle.to_tensor(spk_id))
                    temp_wav = pwg_inference(mel)
                if flags == 0:
                    wav = temp_wav
                    flags = 1
                else:
                    wav = paddle.concat([wav, temp_wav])
            sf.write(
                str(output_dir / (str(spk_id) + "_" + utt_id + ".wav")),
                wav.numpy(),
                samplerate=fastspeech2_config.fs)
            print(f"{spk_id}_{utt_id} done!")
 def main():
    # parse args and config and redirect to train_sp
    parser = argparse.ArgumentParser(
        description="Synthesize with fastspeech2 & parallel wavegan.")
    parser.add_argument(
        "--fastspeech2-config", type=str, help="fastspeech2 config file.")
    parser.add_argument(
        "--fastspeech2-checkpoint",
        type=str,
        help="fastspeech2 checkpoint to load.")
    parser.add_argument(
        "--fastspeech2-stat",
        type=str,
        help="mean and standard deviation used to normalize spectrogram when training fastspeech2."
    )
    parser.add_argument(
        "--pwg-config", type=str, help="parallel wavegan config file.")
    parser.add_argument(
        "--pwg-checkpoint",
        type=str,
        help="parallel wavegan generator parameters to load.")
    parser.add_argument(
        "--pwg-stat",
        type=str,
        help="mean and standard deviation used to normalize spectrogram when training parallel wavegan."
    )
    parser.add_argument(
        "--phones-dict", type=str, default=None, help="phone vocabulary file.")
    parser.add_argument(
        "--speaker-dict", type=str, default=None, help="speaker id map file.")
    parser.add_argument(
        "--text",
        type=str,
        help="text to synthesize, a 'utt_id sentence' pair per line.")
    parser.add_argument("--output-dir", type=str, help="output dir.")
    parser.add_argument(
        "--ngpu", type=int, default=1, help="if ngpu == 0, use cpu.")
    parser.add_argument("--verbose", type=int, default=1, help="verbose.")
    args = parser.parse_args()
    if args.ngpu == 0:
        paddle.set_device("cpu")
    elif args.ngpu > 0:
        paddle.set_device("gpu")
    else:
        print("ngpu should >= 0 !")
    with open(args.fastspeech2_config) as f:
        fastspeech2_config = CfgNode(yaml.safe_load(f))
    with open(args.pwg_config) as f:
        pwg_config = CfgNode(yaml.safe_load(f))
    print("========Args========")
    print(yaml.safe_dump(vars(args)))
    print("========Config========")
    print(fastspeech2_config)
    print(pwg_config)
    evaluate(args, fastspeech2_config, pwg_config)
 if __name__ == "__main__":
    main()
--- a/paddlespeech/t2s/exps/fastspeech2/multi_spk_synthesize_e2e_en.py
+++ b/paddlespeech/t2s/exps/fastspeech2/multi_spk_synthesize_e2e_en.py
@ -1,175 +0,0 @@
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import argparse
 import logging
 from pathlib import Path
 import numpy as np
 import paddle
 import soundfile as sf
 import yaml
 from yacs.config import CfgNode
 from paddlespeech.t2s.frontend import English
 from paddlespeech.t2s.models.fastspeech2 import FastSpeech2
 from paddlespeech.t2s.models.fastspeech2 import FastSpeech2Inference
 from paddlespeech.t2s.models.parallel_wavegan import PWGGenerator
 from paddlespeech.t2s.models.parallel_wavegan import PWGInference
 from paddlespeech.t2s.modules.normalizer import ZScore
 def evaluate(args, fastspeech2_config, pwg_config):
    # dataloader has been too verbose
    logging.getLogger("DataLoader").disabled = True
    # construct dataset for evaluation
    sentences = []
    with open(args.text, 'rt') as f:
        for line in f:
            line_list = line.strip().split()
            utt_id = line_list[0]
            sentence = " ".join(line_list[1:])
            sentences.append((utt_id, sentence))
    with open(args.phones_dict, "r") as f:
        phn_id = [line.strip().split() for line in f.readlines()]
    vocab_size = len(phn_id)
    phone_id_map = {}
    for phn, id in phn_id:
        phone_id_map[phn] = int(id)
    print("vocab_size:", vocab_size)
    with open(args.speaker_dict, 'rt') as f:
        spk_id = [line.strip().split() for line in f.readlines()]
    spk_num = len(spk_id)
    print("spk_num:", spk_num)
    odim = fastspeech2_config.n_mels
    model = FastSpeech2(
        idim=vocab_size,
        odim=odim,
        spk_num=spk_num,
        **fastspeech2_config["model"])
    model.set_state_dict(
        paddle.load(args.fastspeech2_checkpoint)["main_params"])
    model.eval()
    vocoder = PWGGenerator(**pwg_config["generator_params"])
    vocoder.set_state_dict(paddle.load(args.pwg_checkpoint)["generator_params"])
    vocoder.remove_weight_norm()
    vocoder.eval()
    print("model done!")
    frontend = English(phone_vocab_path=args.phones_dict)
    print("frontend done!")
    stat = np.load(args.fastspeech2_stat)
    mu, std = stat
    mu = paddle.to_tensor(mu)
    std = paddle.to_tensor(std)
    fastspeech2_normalizer = ZScore(mu, std)
    stat = np.load(args.pwg_stat)
    mu, std = stat
    mu = paddle.to_tensor(mu)
    std = paddle.to_tensor(std)
    pwg_normalizer = ZScore(mu, std)
    fastspeech2_inference = FastSpeech2Inference(fastspeech2_normalizer, model)
    pwg_inference = PWGInference(pwg_normalizer, vocoder)
    output_dir = Path(args.output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)
    # only test the number 0 speaker
    spk_id = 0
    for utt_id, sentence in sentences:
        input_ids = frontend.get_input_ids(sentence)
        phone_ids = input_ids["phone_ids"]
        with paddle.no_grad():
            mel = fastspeech2_inference(
                phone_ids, spk_id=paddle.to_tensor(spk_id))
            wav = pwg_inference(mel)
        sf.write(
            str(output_dir / (str(spk_id) + "_" + utt_id + ".wav")),
            wav.numpy(),
            samplerate=fastspeech2_config.fs)
        print(f"{spk_id}_{utt_id} done!")
 def main():
    # parse args and config and redirect to train_sp
    parser = argparse.ArgumentParser(
        description="Synthesize with fastspeech2 & parallel wavegan.")
    parser.add_argument(
        "--fastspeech2-config", type=str, help="fastspeech2 config file.")
    parser.add_argument(
        "--fastspeech2-checkpoint",
        type=str,
        help="fastspeech2 checkpoint to load.")
    parser.add_argument(
        "--fastspeech2-stat",
        type=str,
        help="mean and standard deviation used to normalize spectrogram when training fastspeech2."
    )
    parser.add_argument(
        "--pwg-config", type=str, help="parallel wavegan config file.")
    parser.add_argument(
        "--pwg-checkpoint",
        type=str,
        help="parallel wavegan generator parameters to load.")
    parser.add_argument(
        "--pwg-stat",
        type=str,
        help="mean and standard deviation used to normalize spectrogram when training parallel wavegan."
    )
    parser.add_argument(
        "--phones-dict", type=str, default=None, help="phone vocabulary file.")
    parser.add_argument(
        "--speaker-dict", type=str, default=None, help="speaker id map file.")
    parser.add_argument(
        "--text",
        type=str,
        help="text to synthesize, a 'utt_id sentence' pair per line.")
    parser.add_argument("--output-dir", type=str, help="output dir.")
    parser.add_argument(
        "--ngpu", type=int, default=1, help="if ngpu == 0, use cpu.")
    parser.add_argument("--verbose", type=int, default=1, help="verbose.")
    args = parser.parse_args()
    if args.ngpu == 0:
        paddle.set_device("cpu")
    elif args.ngpu > 0:
        paddle.set_device("gpu")
    else:
        print("ngpu should >= 0 !")
    with open(args.fastspeech2_config) as f:
        fastspeech2_config = CfgNode(yaml.safe_load(f))
    with open(args.pwg_config) as f:
        pwg_config = CfgNode(yaml.safe_load(f))
    print("========Args========")
    print(yaml.safe_dump(vars(args)))
    print("========Config========")
    print(fastspeech2_config)
    print(pwg_config)
    evaluate(args, fastspeech2_config, pwg_config)
 if __name__ == "__main__":
    main()
--- a/paddlespeech/t2s/exps/fastspeech2/synthesize.py
+++ b/paddlespeech/t2s/exps/fastspeech2/synthesize.py
@ -1,189 +0,0 @@
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import argparse
 import logging
 from pathlib import Path
 import jsonlines
 import numpy as np
 import paddle
 import soundfile as sf
 import yaml
 from yacs.config import CfgNode
 from paddlespeech.t2s.datasets.data_table import DataTable
 from paddlespeech.t2s.models.fastspeech2 import FastSpeech2
 from paddlespeech.t2s.models.fastspeech2 import FastSpeech2Inference
 from paddlespeech.t2s.models.parallel_wavegan import PWGGenerator
 from paddlespeech.t2s.models.parallel_wavegan import PWGInference
 from paddlespeech.t2s.modules.normalizer import ZScore
 def evaluate(args, fastspeech2_config, pwg_config):
    # dataloader has been too verbose
    logging.getLogger("DataLoader").disabled = True
    # construct dataset for evaluation
    with jsonlines.open(args.test_metadata, 'r') as reader:
        test_metadata = list(reader)
    fields = ["utt_id", "text"]
    spk_num = None
    if args.speaker_dict is not None:
        print("multiple speaker fastspeech2!")
        with open(args.speaker_dict, 'rt') as f:
            spk_id = [line.strip().split() for line in f.readlines()]
        spk_num = len(spk_id)
        fields += ["spk_id"]
    elif args.voice_cloning:
        print("voice cloning!")
        fields += ["spk_emb"]
    else:
        print("single speaker fastspeech2!")
    print("spk_num:", spk_num)
    test_dataset = DataTable(data=test_metadata, fields=fields)
    odim = fastspeech2_config.n_mels
    with open(args.phones_dict, "r") as f:
        phn_id = [line.strip().split() for line in f.readlines()]
    vocab_size = len(phn_id)
    print("vocab_size:", vocab_size)
    model = FastSpeech2(
        idim=vocab_size,
        odim=odim,
        spk_num=spk_num,
        **fastspeech2_config["model"])
    model.set_state_dict(
        paddle.load(args.fastspeech2_checkpoint)["main_params"])
    model.eval()
    vocoder = PWGGenerator(**pwg_config["generator_params"])
    vocoder.set_state_dict(paddle.load(args.pwg_checkpoint)["generator_params"])
    vocoder.remove_weight_norm()
    vocoder.eval()
    print("model done!")
    stat = np.load(args.fastspeech2_stat)
    mu, std = stat
    mu = paddle.to_tensor(mu)
    std = paddle.to_tensor(std)
    fastspeech2_normalizer = ZScore(mu, std)
    stat = np.load(args.pwg_stat)
    mu, std = stat
    mu = paddle.to_tensor(mu)
    std = paddle.to_tensor(std)
    pwg_normalizer = ZScore(mu, std)
    fastspeech2_inference = FastSpeech2Inference(fastspeech2_normalizer, model)
    pwg_inference = PWGInference(pwg_normalizer, vocoder)
    output_dir = Path(args.output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)
    for datum in test_dataset:
        utt_id = datum["utt_id"]
        text = paddle.to_tensor(datum["text"])
        spk_emb = None
        spk_id = None
        if args.voice_cloning and "spk_emb" in datum:
            spk_emb = paddle.to_tensor(np.load(datum["spk_emb"]))
        elif "spk_id" in datum:
            spk_id = paddle.to_tensor(datum["spk_id"])
        with paddle.no_grad():
            wav = pwg_inference(
                fastspeech2_inference(text, spk_id=spk_id, spk_emb=spk_emb))
        sf.write(
            str(output_dir / (utt_id + ".wav")),
            wav.numpy(),
            samplerate=fastspeech2_config.fs)
        print(f"{utt_id} done!")
 def main():
    # parse args and config and redirect to train_sp
    parser = argparse.ArgumentParser(
        description="Synthesize with fastspeech2 & parallel wavegan.")
    parser.add_argument(
        "--fastspeech2-config", type=str, help="fastspeech2 config file.")
    parser.add_argument(
        "--fastspeech2-checkpoint",
        type=str,
        help="fastspeech2 checkpoint to load.")
    parser.add_argument(
        "--fastspeech2-stat",
        type=str,
        help="mean and standard deviation used to normalize spectrogram when training fastspeech2."
    )
    parser.add_argument(
        "--pwg-config", type=str, help="parallel wavegan config file.")
    parser.add_argument(
        "--pwg-checkpoint",
        type=str,
        help="parallel wavegan generator parameters to load.")
    parser.add_argument(
        "--pwg-stat",
        type=str,
        help="mean and standard deviation used to normalize spectrogram when training parallel wavegan."
    )
    parser.add_argument(
        "--phones-dict", type=str, default=None, help="phone vocabulary file.")
    parser.add_argument(
        "--speaker-dict",
        type=str,
        default=None,
        help="speaker id map file for multiple speaker model.")
    def str2bool(str):
        return True if str.lower() == 'true' else False
    parser.add_argument(
        "--voice-cloning",
        type=str2bool,
        default=False,
        help="whether training voice cloning model.")
    parser.add_argument("--test-metadata", type=str, help="test metadata.")
    parser.add_argument("--output-dir", type=str, help="output dir.")
    parser.add_argument(
        "--ngpu", type=int, default=1, help="if ngpu == 0, use cpu.")
    parser.add_argument("--verbose", type=int, default=1, help="verbose.")
    args = parser.parse_args()
    if args.ngpu == 0:
        paddle.set_device("cpu")
    elif args.ngpu > 0:
        paddle.set_device("gpu")
    else:
        print("ngpu should >= 0 !")
    with open(args.fastspeech2_config) as f:
        fastspeech2_config = CfgNode(yaml.safe_load(f))
    with open(args.pwg_config) as f:
        pwg_config = CfgNode(yaml.safe_load(f))
    print("========Args========")
    print(yaml.safe_dump(vars(args)))
    print("========Config========")
    print(fastspeech2_config)
    print(pwg_config)
    evaluate(args, fastspeech2_config, pwg_config)
 if __name__ == "__main__":
    main()
--- a/paddlespeech/t2s/exps/fastspeech2/synthesize_e2e.py
+++ b/paddlespeech/t2s/exps/fastspeech2/synthesize_e2e.py
@ -1,187 +0,0 @@
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import argparse
 import logging
 import os
 from pathlib import Path
 import numpy as np
 import paddle
 import soundfile as sf
 import yaml
 from paddle import jit
 from paddle.static import InputSpec
 from yacs.config import CfgNode
 from paddlespeech.t2s.frontend.zh_frontend import Frontend
 from paddlespeech.t2s.models.fastspeech2 import FastSpeech2
 from paddlespeech.t2s.models.fastspeech2 import FastSpeech2Inference
 from paddlespeech.t2s.models.parallel_wavegan import PWGGenerator
 from paddlespeech.t2s.models.parallel_wavegan import PWGInference
 from paddlespeech.t2s.modules.normalizer import ZScore
 def evaluate(args, fastspeech2_config, pwg_config):
    # dataloader has been too verbose
    logging.getLogger("DataLoader").disabled = True
    # construct dataset for evaluation
    sentences = []
    with open(args.text, 'rt') as f:
        for line in f:
            items = line.strip().split()
            utt_id = items[0]
            sentence = "".join(items[1:])
            sentences.append((utt_id, sentence))
    with open(args.phones_dict, "r") as f:
        phn_id = [line.strip().split() for line in f.readlines()]
    vocab_size = len(phn_id)
    print("vocab_size:", vocab_size)
    odim = fastspeech2_config.n_mels
    model = FastSpeech2(
        idim=vocab_size, odim=odim, **fastspeech2_config["model"])
    model.set_state_dict(
        paddle.load(args.fastspeech2_checkpoint)["main_params"])
    model.eval()
    vocoder = PWGGenerator(**pwg_config["generator_params"])
    vocoder.set_state_dict(paddle.load(args.pwg_checkpoint)["generator_params"])
    vocoder.remove_weight_norm()
    vocoder.eval()
    print("model done!")
    frontend = Frontend(phone_vocab_path=args.phones_dict)
    print("frontend done!")
    stat = np.load(args.fastspeech2_stat)
    mu, std = stat
    mu = paddle.to_tensor(mu)
    std = paddle.to_tensor(std)
    fastspeech2_normalizer = ZScore(mu, std)
    stat = np.load(args.pwg_stat)
    mu, std = stat
    mu = paddle.to_tensor(mu)
    std = paddle.to_tensor(std)
    pwg_normalizer = ZScore(mu, std)
    fastspeech2_inference = FastSpeech2Inference(fastspeech2_normalizer, model)
    fastspeech2_inference.eval()
    fastspeech2_inference = jit.to_static(
        fastspeech2_inference, input_spec=[InputSpec([-1], dtype=paddle.int64)])
    paddle.jit.save(fastspeech2_inference,
                    os.path.join(args.inference_dir, "fastspeech2"))
    fastspeech2_inference = paddle.jit.load(
        os.path.join(args.inference_dir, "fastspeech2"))
    pwg_inference = PWGInference(pwg_normalizer, vocoder)
    pwg_inference.eval()
    pwg_inference = jit.to_static(
        pwg_inference, input_spec=[
            InputSpec([-1, 80], dtype=paddle.float32),
        ])
    paddle.jit.save(pwg_inference, os.path.join(args.inference_dir, "pwg"))
    pwg_inference = paddle.jit.load(os.path.join(args.inference_dir, "pwg"))
    output_dir = Path(args.output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)
    for utt_id, sentence in sentences:
        input_ids = frontend.get_input_ids(sentence, merge_sentences=True)
        phone_ids = input_ids["phone_ids"]
        flags = 0
        for part_phone_ids in phone_ids:
            with paddle.no_grad():
                mel = fastspeech2_inference(part_phone_ids)
                temp_wav = pwg_inference(mel)
            if flags == 0:
                wav = temp_wav
                flags = 1
            else:
                wav = paddle.concat([wav, temp_wav])
        sf.write(
            str(output_dir / (utt_id + ".wav")),
            wav.numpy(),
            samplerate=fastspeech2_config.fs)
        print(f"{utt_id} done!")
 def main():
    # parse args and config and redirect to train_sp
    parser = argparse.ArgumentParser(
        description="Synthesize with fastspeech2 & parallel wavegan.")
    parser.add_argument(
        "--fastspeech2-config", type=str, help="fastspeech2 config file.")
    parser.add_argument(
        "--fastspeech2-checkpoint",
        type=str,
        help="fastspeech2 checkpoint to load.")
    parser.add_argument(
        "--fastspeech2-stat",
        type=str,
        help="mean and standard deviation used to normalize spectrogram when training fastspeech2."
    )
    parser.add_argument(
        "--pwg-config", type=str, help="parallel wavegan config file.")
    parser.add_argument(
        "--pwg-checkpoint",
        type=str,
        help="parallel wavegan generator parameters to load.")
    parser.add_argument(
        "--pwg-stat",
        type=str,
        help="mean and standard deviation used to normalize spectrogram when training parallel wavegan."
    )
    parser.add_argument(
        "--phones-dict",
        type=str,
        default="phone_id_map.txt",
        help="phone vocabulary file.")
    parser.add_argument(
        "--text",
        type=str,
        help="text to synthesize, a 'utt_id sentence' pair per line.")
    parser.add_argument("--output-dir", type=str, help="output dir.")
    parser.add_argument(
        "--inference-dir", type=str, help="dir to save inference models")
    parser.add_argument(
        "--ngpu", type=int, default=1, help="if ngpu == 0, use cpu.")
    parser.add_argument("--verbose", type=int, default=1, help="verbose.")
    args = parser.parse_args()
    if args.ngpu == 0:
        paddle.set_device("cpu")
    elif args.ngpu > 0:
        paddle.set_device("gpu")
    else:
        print("ngpu should >= 0 !")
    with open(args.fastspeech2_config) as f:
        fastspeech2_config = CfgNode(yaml.safe_load(f))
    with open(args.pwg_config) as f:
        pwg_config = CfgNode(yaml.safe_load(f))
    print("========Args========")
    print(yaml.safe_dump(vars(args)))
    print("========Config========")
    print(fastspeech2_config)
    print(pwg_config)
    evaluate(args, fastspeech2_config, pwg_config)
 if __name__ == "__main__":
    main()
--- a/paddlespeech/t2s/exps/fastspeech2/synthesize_e2e_en.py
+++ b/paddlespeech/t2s/exps/fastspeech2/synthesize_e2e_en.py
@ -1,166 +0,0 @@
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import argparse
 import logging
 from pathlib import Path
 import numpy as np
 import paddle
 import soundfile as sf
 import yaml
 from yacs.config import CfgNode
 from paddlespeech.t2s.frontend import English
 from paddlespeech.t2s.models.fastspeech2 import FastSpeech2
 from paddlespeech.t2s.models.fastspeech2 import FastSpeech2Inference
 from paddlespeech.t2s.models.parallel_wavegan import PWGGenerator
 from paddlespeech.t2s.models.parallel_wavegan import PWGInference
 from paddlespeech.t2s.modules.normalizer import ZScore
 def evaluate(args, fastspeech2_config, pwg_config):
    # dataloader has been too verbose
    logging.getLogger("DataLoader").disabled = True
    # construct dataset for evaluation
    sentences = []
    with open(args.text, 'rt') as f:
        for line in f:
            line_list = line.strip().split()
            utt_id = line_list[0]
            sentence = " ".join(line_list[1:])
            sentences.append((utt_id, sentence))
    with open(args.phones_dict, "r") as f:
        phn_id = [line.strip().split() for line in f.readlines()]
    vocab_size = len(phn_id)
    phone_id_map = {}
    for phn, id in phn_id:
        phone_id_map[phn] = int(id)
    print("vocab_size:", vocab_size)
    odim = fastspeech2_config.n_mels
    model = FastSpeech2(
        idim=vocab_size, odim=odim, **fastspeech2_config["model"])
    model.set_state_dict(
        paddle.load(args.fastspeech2_checkpoint)["main_params"])
    model.eval()
    vocoder = PWGGenerator(**pwg_config["generator_params"])
    vocoder.set_state_dict(paddle.load(args.pwg_checkpoint)["generator_params"])
    vocoder.remove_weight_norm()
    vocoder.eval()
    print("model done!")
    frontend = English(phone_vocab_path=args.phones_dict)
    print("frontend done!")
    stat = np.load(args.fastspeech2_stat)
    mu, std = stat
    mu = paddle.to_tensor(mu)
    std = paddle.to_tensor(std)
    fastspeech2_normalizer = ZScore(mu, std)
    stat = np.load(args.pwg_stat)
    mu, std = stat
    mu = paddle.to_tensor(mu)
    std = paddle.to_tensor(std)
    pwg_normalizer = ZScore(mu, std)
    fastspeech2_inference = FastSpeech2Inference(fastspeech2_normalizer, model)
    pwg_inference = PWGInference(pwg_normalizer, vocoder)
    output_dir = Path(args.output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)
    for utt_id, sentence in sentences:
        input_ids = frontend.get_input_ids(sentence)
        phone_ids = input_ids["phone_ids"]
        with paddle.no_grad():
            mel = fastspeech2_inference(phone_ids)
            wav = pwg_inference(mel)
        sf.write(
            str(output_dir / (utt_id + ".wav")),
            wav.numpy(),
            samplerate=fastspeech2_config.fs)
        print(f"{utt_id} done!")
 def main():
    # parse args and config and redirect to train_sp
    parser = argparse.ArgumentParser(
        description="Synthesize with fastspeech2 & parallel wavegan.")
    parser.add_argument(
        "--fastspeech2-config", type=str, help="fastspeech2 config file.")
    parser.add_argument(
        "--fastspeech2-checkpoint",
        type=str,
        help="fastspeech2 checkpoint to load.")
    parser.add_argument(
        "--fastspeech2-stat",
        type=str,
        help="mean and standard deviation used to normalize spectrogram when training fastspeech2."
    )
    parser.add_argument(
        "--pwg-config", type=str, help="parallel wavegan config file.")
    parser.add_argument(
        "--pwg-checkpoint",
        type=str,
        help="parallel wavegan generator parameters to load.")
    parser.add_argument(
        "--pwg-stat",
        type=str,
        help="mean and standard deviation used to normalize spectrogram when training parallel wavegan."
    )
    parser.add_argument(
        "--phones-dict",
        type=str,
        default="phone_id_map.txt",
        help="phone vocabulary file.")
    parser.add_argument(
        "--text",
        type=str,
        help="text to synthesize, a 'utt_id sentence' pair per line.")
    parser.add_argument("--output-dir", type=str, help="output dir.")
    parser.add_argument(
        "--ngpu", type=int, default=1, help="if ngpu == 0, use cpu.")
    parser.add_argument("--verbose", type=int, default=1, help="verbose.")
    args = parser.parse_args()
    if args.ngpu == 0:
        paddle.set_device("cpu")
    elif args.ngpu > 0:
        paddle.set_device("gpu")
    else:
        print("ngpu should >= 0 !")
    with open(args.fastspeech2_config) as f:
        fastspeech2_config = CfgNode(yaml.safe_load(f))
    with open(args.pwg_config) as f:
        pwg_config = CfgNode(yaml.safe_load(f))
    print("========Args========")
    print(yaml.safe_dump(vars(args)))
    print("========Config========")
    print(fastspeech2_config)
    print(pwg_config)
    evaluate(args, fastspeech2_config, pwg_config)
 if __name__ == "__main__":
    main()
--- a/paddlespeech/t2s/exps/fastspeech2/synthesize_e2e_melgan.py
+++ b/paddlespeech/t2s/exps/fastspeech2/synthesize_e2e_melgan.py
@ -1,192 +0,0 @@
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import argparse
 import logging
 import os
 from pathlib import Path
 import numpy as np
 import paddle
 import soundfile as sf
 import yaml
 from paddle import jit
 from paddle.static import InputSpec
 from yacs.config import CfgNode
 from paddlespeech.t2s.frontend.zh_frontend import Frontend
 from paddlespeech.t2s.models.fastspeech2 import FastSpeech2
 from paddlespeech.t2s.models.fastspeech2 import FastSpeech2Inference
 from paddlespeech.t2s.models.melgan import MelGANGenerator
 from paddlespeech.t2s.models.melgan import MelGANInference
 from paddlespeech.t2s.modules.normalizer import ZScore
 def evaluate(args, fastspeech2_config, melgan_config):
    # dataloader has been too verbose
    logging.getLogger("DataLoader").disabled = True
    # construct dataset for evaluation
    sentences = []
    with open(args.text, 'rt') as f:
        for line in f:
            items = line.strip().split()
            utt_id = items[0]
            sentence = "".join(items[1:])
            sentences.append((utt_id, sentence))
    with open(args.phones_dict, "r") as f:
        phn_id = [line.strip().split() for line in f.readlines()]
    vocab_size = len(phn_id)
    print("vocab_size:", vocab_size)
    odim = fastspeech2_config.n_mels
    model = FastSpeech2(
        idim=vocab_size, odim=odim, **fastspeech2_config["model"])
    model.set_state_dict(
        paddle.load(args.fastspeech2_checkpoint)["main_params"])
    model.eval()
    vocoder = MelGANGenerator(**melgan_config["generator_params"])
    vocoder.set_state_dict(
        paddle.load(args.melgan_checkpoint)["generator_params"])
    vocoder.remove_weight_norm()
    vocoder.eval()
    print("model done!")
    frontend = Frontend(phone_vocab_path=args.phones_dict)
    print("frontend done!")
    stat = np.load(args.fastspeech2_stat)
    mu, std = stat
    mu = paddle.to_tensor(mu)
    std = paddle.to_tensor(std)
    fastspeech2_normalizer = ZScore(mu, std)
    stat = np.load(args.melgan_stat)
    mu, std = stat
    mu = paddle.to_tensor(mu)
    std = paddle.to_tensor(std)
    pwg_normalizer = ZScore(mu, std)
    fastspeech2_inference = FastSpeech2Inference(fastspeech2_normalizer, model)
    fastspeech2_inference.eval()
    fastspeech2_inference = jit.to_static(
        fastspeech2_inference, input_spec=[InputSpec([-1], dtype=paddle.int64)])
    paddle.jit.save(fastspeech2_inference,
                    os.path.join(args.inference_dir, "fastspeech2"))
    fastspeech2_inference = paddle.jit.load(
        os.path.join(args.inference_dir, "fastspeech2"))
    mb_melgan_inference = MelGANInference(pwg_normalizer, vocoder)
    mb_melgan_inference.eval()
    mb_melgan_inference = jit.to_static(
        mb_melgan_inference,
        input_spec=[
            InputSpec([-1, 80], dtype=paddle.float32),
        ])
    paddle.jit.save(mb_melgan_inference,
                    os.path.join(args.inference_dir, "mb_melgan"))
    mb_melgan_inference = paddle.jit.load(
        os.path.join(args.inference_dir, "mb_melgan"))
    output_dir = Path(args.output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)
    for utt_id, sentence in sentences:
        input_ids = frontend.get_input_ids(sentence, merge_sentences=True)
        phone_ids = input_ids["phone_ids"]
        flags = 0
        for part_phone_ids in phone_ids:
            with paddle.no_grad():
                mel = fastspeech2_inference(part_phone_ids)
                temp_wav = mb_melgan_inference(mel)
            if flags == 0:
                wav = temp_wav
                flags = 1
            else:
                wav = paddle.concat([wav, temp_wav])
        sf.write(
            str(output_dir / (utt_id + ".wav")),
            wav.numpy(),
            samplerate=fastspeech2_config.fs)
        print(f"{utt_id} done!")
 def main():
    # parse args and config and redirect to train_sp
    parser = argparse.ArgumentParser(
        description="Synthesize with fastspeech2 & parallel wavegan.")
    parser.add_argument(
        "--fastspeech2-config", type=str, help="fastspeech2 config file.")
    parser.add_argument(
        "--fastspeech2-checkpoint",
        type=str,
        help="fastspeech2 checkpoint to load.")
    parser.add_argument(
        "--fastspeech2-stat",
        type=str,
        help="mean and standard deviation used to normalize spectrogram when training fastspeech2."
    )
    parser.add_argument(
        "--melgan-config", type=str, help="parallel wavegan config file.")
    parser.add_argument(
        "--melgan-checkpoint",
        type=str,
        help="parallel wavegan generator parameters to load.")
    parser.add_argument(
        "--melgan-stat",
        type=str,
        help="mean and standard deviation used to normalize spectrogram when training parallel wavegan."
    )
    parser.add_argument(
        "--phones-dict",
        type=str,
        default="phone_id_map.txt",
        help="phone vocabulary file.")
    parser.add_argument(
        "--text",
        type=str,
        help="text to synthesize, a 'utt_id sentence' pair per line.")
    parser.add_argument("--output-dir", type=str, help="output dir.")
    parser.add_argument(
        "--inference-dir", type=str, help="dir to save inference models")
    parser.add_argument(
        "--ngpu", type=int, default=1, help="if ngpu == 0, use cpu.")
    parser.add_argument("--verbose", type=int, default=1, help="verbose.")
    args = parser.parse_args()
    if args.ngpu == 0:
        paddle.set_device("cpu")
    elif args.ngpu > 0:
        paddle.set_device("gpu")
    else:
        print("ngpu should >= 0 !")
    with open(args.fastspeech2_config) as f:
        fastspeech2_config = CfgNode(yaml.safe_load(f))
    with open(args.melgan_config) as f:
        melgan_config = CfgNode(yaml.safe_load(f))
    print("========Args========")
    print(yaml.safe_dump(vars(args)))
    print("========Config========")
    print(fastspeech2_config)
    print(melgan_config)
    evaluate(args, fastspeech2_config, melgan_config)
 if __name__ == "__main__":
    main()
--- a/paddlespeech/t2s/exps/gan_vocoder/hifigan/init.py
+++ b/paddlespeech/t2s/exps/gan_vocoder/hifigan/init.py
@ -0,0 +1,13 @@
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
--- a/paddlespeech/t2s/exps/gan_vocoder/hifigan/train.py
+++ b/paddlespeech/t2s/exps/gan_vocoder/hifigan/train.py
@ -0,0 +1,277 @@
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import argparse
 import logging
 import os
 import shutil
 from pathlib import Path
 import jsonlines
 import numpy as np
 import paddle
 import yaml
 from paddle import DataParallel
 from paddle import distributed as dist
 from paddle import nn
 from paddle.io import DataLoader
 from paddle.io import DistributedBatchSampler
 from paddle.optimizer import Adam
 from paddle.optimizer.lr import MultiStepDecay
 from yacs.config import CfgNode
 from paddlespeech.t2s.datasets.data_table import DataTable
 from paddlespeech.t2s.datasets.vocoder_batch_fn import Clip
 from paddlespeech.t2s.models.hifigan import HiFiGANEvaluator
 from paddlespeech.t2s.models.hifigan import HiFiGANGenerator
 from paddlespeech.t2s.models.hifigan import HiFiGANMultiScaleMultiPeriodDiscriminator
 from paddlespeech.t2s.models.hifigan import HiFiGANUpdater
 from paddlespeech.t2s.modules.losses import DiscriminatorAdversarialLoss
 from paddlespeech.t2s.modules.losses import FeatureMatchLoss
 from paddlespeech.t2s.modules.losses import GeneratorAdversarialLoss
 from paddlespeech.t2s.modules.losses import MelSpectrogramLoss
 from paddlespeech.t2s.training.extensions.snapshot import Snapshot
 from paddlespeech.t2s.training.extensions.visualizer import VisualDL
 from paddlespeech.t2s.training.seeding import seed_everything
 from paddlespeech.t2s.training.trainer import Trainer
 def train_sp(args, config):
    # decides device type and whether to run in parallel
    # setup running environment correctly
    world_size = paddle.distributed.get_world_size()
    if (not paddle.is_compiled_with_cuda()) or args.ngpu == 0:
        paddle.set_device("cpu")
    else:
        paddle.set_device("gpu")
        if world_size > 1:
            paddle.distributed.init_parallel_env()
    # set the random seed, it is a must for multiprocess training
    seed_everything(config.seed)
    print(
        f"rank: {dist.get_rank()}, pid: {os.getpid()}, parent_pid: {os.getppid()}",
    )
    # dataloader has been too verbose
    logging.getLogger("DataLoader").disabled = True
    # construct dataset for training and validation
    with jsonlines.open(args.train_metadata, 'r') as reader:
        train_metadata = list(reader)
    train_dataset = DataTable(
        data=train_metadata,
        fields=["wave", "feats"],
        converters={
            "wave": np.load,
            "feats": np.load,
        }, )
    with jsonlines.open(args.dev_metadata, 'r') as reader:
        dev_metadata = list(reader)
    dev_dataset = DataTable(
        data=dev_metadata,
        fields=["wave", "feats"],
        converters={
            "wave": np.load,
            "feats": np.load,
        }, )
    # collate function and dataloader
    train_sampler = DistributedBatchSampler(
        train_dataset,
        batch_size=config.batch_size,
        shuffle=True,
        drop_last=True)
    dev_sampler = DistributedBatchSampler(
        dev_dataset,
        batch_size=config.batch_size,
        shuffle=False,
        drop_last=False)
    print("samplers done!")
    if "aux_context_window" in config.generator_params:
        aux_context_window = config.generator_params.aux_context_window
    else:
        aux_context_window = 0
    train_batch_fn = Clip(
        batch_max_steps=config.batch_max_steps,
        hop_size=config.n_shift,
        aux_context_window=aux_context_window)
    train_dataloader = DataLoader(
        train_dataset,
        batch_sampler=train_sampler,
        collate_fn=train_batch_fn,
        num_workers=config.num_workers)
    dev_dataloader = DataLoader(
        dev_dataset,
        batch_sampler=dev_sampler,
        collate_fn=train_batch_fn,
        num_workers=config.num_workers)
    print("dataloaders done!")
    generator = HiFiGANGenerator(**config["generator_params"])
    discriminator = HiFiGANMultiScaleMultiPeriodDiscriminator(
        **config["discriminator_params"])
    if world_size > 1:
        generator = DataParallel(generator)
        discriminator = DataParallel(discriminator)
    print("models done!")
    criterion_feat_match = FeatureMatchLoss(**config["feat_match_loss_params"])
    criterion_mel = MelSpectrogramLoss(
        fs=config.fs,
        fft_size=config.n_fft,
        hop_size=config.n_shift,
        win_length=config.win_length,
        window=config.window,
        num_mels=config.n_mels,
        fmin=config.fmin,
        fmax=config.fmax, )
    criterion_gen_adv = GeneratorAdversarialLoss(
        **config["generator_adv_loss_params"])
    criterion_dis_adv = DiscriminatorAdversarialLoss(
        **config["discriminator_adv_loss_params"])
    print("criterions done!")
    lr_schedule_g = MultiStepDecay(**config["generator_scheduler_params"])
    # Compared to multi_band_melgan.v1 config, Adam optimizer without gradient norm is used
    generator_grad_norm = config["generator_grad_norm"]
    gradient_clip_g = nn.ClipGradByGlobalNorm(
        generator_grad_norm) if generator_grad_norm > 0 else None
    print("gradient_clip_g:", gradient_clip_g)
    optimizer_g = Adam(
        learning_rate=lr_schedule_g,
        grad_clip=gradient_clip_g,
        parameters=generator.parameters(),
        **config["generator_optimizer_params"])
    lr_schedule_d = MultiStepDecay(**config["discriminator_scheduler_params"])
    discriminator_grad_norm = config["discriminator_grad_norm"]
    gradient_clip_d = nn.ClipGradByGlobalNorm(
        discriminator_grad_norm) if discriminator_grad_norm > 0 else None
    print("gradient_clip_d:", gradient_clip_d)
    optimizer_d = Adam(
        learning_rate=lr_schedule_d,
        grad_clip=gradient_clip_d,
        parameters=discriminator.parameters(),
        **config["discriminator_optimizer_params"])
    print("optimizers done!")
    output_dir = Path(args.output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)
    if dist.get_rank() == 0:
        config_name = args.config.split("/")[-1]
        # copy conf to output_dir
        shutil.copyfile(args.config, output_dir / config_name)
    updater = HiFiGANUpdater(
        models={
            "generator": generator,
            "discriminator": discriminator,
        },
        optimizers={
            "generator": optimizer_g,
            "discriminator": optimizer_d,
        },
        criterions={
            "mel": criterion_mel,
            "feat_match": criterion_feat_match,
            "gen_adv": criterion_gen_adv,
            "dis_adv": criterion_dis_adv,
        },
        schedulers={
            "generator": lr_schedule_g,
            "discriminator": lr_schedule_d,
        },
        dataloader=train_dataloader,
        discriminator_train_start_steps=config.discriminator_train_start_steps,
        # only hifigan have generator_train_start_steps
        generator_train_start_steps=config.generator_train_start_steps,
        lambda_adv=config.lambda_adv,
        lambda_aux=config.lambda_aux,
        lambda_feat_match=config.lambda_feat_match,
        output_dir=output_dir)
    evaluator = HiFiGANEvaluator(
        models={
            "generator": generator,
            "discriminator": discriminator,
        },
        criterions={
            "mel": criterion_mel,
            "feat_match": criterion_feat_match,
            "gen_adv": criterion_gen_adv,
            "dis_adv": criterion_dis_adv,
        },
        dataloader=dev_dataloader,
        lambda_adv=config.lambda_adv,
        lambda_aux=config.lambda_aux,
        lambda_feat_match=config.lambda_feat_match,
        output_dir=output_dir)
    trainer = Trainer(
        updater,
        stop_trigger=(config.train_max_steps, "iteration"),
        out=output_dir)
    if dist.get_rank() == 0:
        trainer.extend(
            evaluator, trigger=(config.eval_interval_steps, 'iteration'))
        trainer.extend(VisualDL(output_dir), trigger=(1, 'iteration'))
        trainer.extend(
            Snapshot(max_size=config.num_snapshots),
            trigger=(config.save_interval_steps, 'iteration'))
    print("Trainer Done!")
    trainer.run()
 def main():
    # parse args and config and redirect to train_sp
    parser = argparse.ArgumentParser(
        description="Train a HiFiGAN model.")
    parser.add_argument(
        "--config", type=str, help="config file to overwrite default config.")
    parser.add_argument("--train-metadata", type=str, help="training data.")
    parser.add_argument("--dev-metadata", type=str, help="dev data.")
    parser.add_argument("--output-dir", type=str, help="output dir.")
    parser.add_argument(
        "--ngpu", type=int, default=1, help="if ngpu == 0, use cpu.")
    parser.add_argument("--verbose", type=int, default=1, help="verbose.")
    args = parser.parse_args()
    with open(args.config, 'rt') as f:
        config = CfgNode(yaml.safe_load(f))
    print("========Args========")
    print(yaml.safe_dump(vars(args)))
    print("========Config========")
    print(config)
    print(
        f"master see the word size: {dist.get_world_size()}, from pid: {os.getpid()}"
    )
    # dispatch
    if args.ngpu > 1:
        dist.spawn(train_sp, (args, config), nprocs=args.ngpu)
    else:
        train_sp(args, config)
 if __name__ == "__main__":
    main()
--- a/paddlespeech/t2s/exps/gan_vocoder/synthesize.py
+++ b/paddlespeech/t2s/exps/gan_vocoder/synthesize.py
@ -84,6 +84,7 @@ def main():
    generator.set_state_dict(state_dict["generator_params"])
    generator.remove_weight_norm()
    generator.eval()
    with jsonlines.open(args.test_metadata, 'r') as reader:
        metadata = list(reader)
    test_dataset = DataTable(
--- a/paddlespeech/t2s/exps/inference.py
+++ b/paddlespeech/t2s/exps/inference.py
@ -0,0 +1,136 @@
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import argparse
 from pathlib import Path
 import soundfile as sf
 from paddle import inference
 from paddlespeech.t2s.frontend.zh_frontend import Frontend
 # only inference for models trained with csmsc now
 def main():
    parser = argparse.ArgumentParser(
        description="Paddle Infernce with speedyspeech & parallel wavegan.")
    # acoustic model
    parser.add_argument(
        '--am',
        type=str,
        default='fastspeech2_csmsc',
        choices=['speedyspeech_csmsc', 'fastspeech2_csmsc'],
        help='Choose acoustic model type of tts task.')
    parser.add_argument(
        "--phones_dict", type=str, default=None, help="phone vocabulary file.")
    parser.add_argument(
        "--tones_dict", type=str, default=None, help="tone vocabulary file.")
    # voc
    parser.add_argument(
        '--voc',
        type=str,
        default='pwgan_csmsc',
        choices=['pwgan_csmsc', 'mb_melgan_csmsc', 'hifigan_csmsc'],
        help='Choose vocoder type of tts task.')
    # other
    parser.add_argument(
        "--text",
        type=str,
        help="text to synthesize, a 'utt_id sentence' pair per line")
    parser.add_argument(
        "--inference_dir", type=str, help="dir to save inference models")
    parser.add_argument("--output_dir", type=str, help="output dir")
    args, _ = parser.parse_known_args()
    frontend = Frontend(
        phone_vocab_path=args.phones_dict, tone_vocab_path=args.tones_dict)
    print("frontend done!")
    # model: {model_name}_{dataset}
    am_name = args.am[:args.am.rindex('_')]
    am_dataset = args.am[args.am.rindex('_') + 1:]
    am_config = inference.Config(
        str(Path(args.inference_dir) / (args.am + ".pdmodel")),
        str(Path(args.inference_dir) / (args.am + ".pdiparams")))
    am_config.enable_use_gpu(100, 0)
    # This line must be commented for fastspeech2, if not, it will OOM
    if am_name != 'fastspeech2':
        am_config.enable_memory_optim()
    am_predictor = inference.create_predictor(am_config)
    voc_config = inference.Config(
        str(Path(args.inference_dir) / (args.voc + ".pdmodel")),
        str(Path(args.inference_dir) / (args.voc + ".pdiparams")))
    voc_config.enable_use_gpu(100, 0)
    voc_config.enable_memory_optim()
    voc_predictor = inference.create_predictor(voc_config)
    output_dir = Path(args.output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)
    sentences = []
    print("in new inference")
    with open(args.text, 'rt') as f:
        for line in f:
            items = line.strip().split()
            utt_id = items[0]
            sentence = "".join(items[1:])
            sentences.append((utt_id, sentence))
    get_tone_ids = False
    if am_name == 'speedyspeech':
        get_tone_ids = True
    am_input_names = am_predictor.get_input_names()
    for utt_id, sentence in sentences:
        input_ids = frontend.get_input_ids(
            sentence, merge_sentences=True, get_tone_ids=get_tone_ids)
        phone_ids = input_ids["phone_ids"]
        if get_tone_ids:
            tone_ids = input_ids["tone_ids"]
            tones = tone_ids[0].numpy()
            tones_handle = am_predictor.get_input_handle(am_input_names[1])
            tones_handle.reshape(tones.shape)
            tones_handle.copy_from_cpu(tones)
        phones = phone_ids[0].numpy()
        phones_handle = am_predictor.get_input_handle(am_input_names[0])
        phones_handle.reshape(phones.shape)
        phones_handle.copy_from_cpu(phones)
        am_predictor.run()
        am_output_names = am_predictor.get_output_names()
        am_output_handle = am_predictor.get_output_handle(am_output_names[0])
        am_output_data = am_output_handle.copy_to_cpu()
        voc_input_names = voc_predictor.get_input_names()
        mel_handle = voc_predictor.get_input_handle(voc_input_names[0])
        mel_handle.reshape(am_output_data.shape)
        mel_handle.copy_from_cpu(am_output_data)
        voc_predictor.run()
        voc_output_names = voc_predictor.get_output_names()
        voc_output_handle = voc_predictor.get_output_handle(voc_output_names[0])
        wav = voc_output_handle.copy_to_cpu()
        sf.write(output_dir / (utt_id + ".wav"), wav, samplerate=24000)
        print(f"{utt_id} done!")
 if __name__ == "__main__":
    main()
--- a/paddlespeech/t2s/exps/speedyspeech/inference.py
+++ b/paddlespeech/t2s/exps/speedyspeech/inference.py
@ -11,8 +11,8 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # remain for chains
 import argparse
 import os
 from pathlib import Path
 import soundfile as sf
@ -31,8 +31,6 @@ def main():
        type=str,
        help="text to synthesize, a 'utt_id sentence' pair per line")
    parser.add_argument("--output-dir", type=str, help="output dir")
    parser.add_argument(
        "--enable-auto-log", action="store_true", help="use auto log")
    parser.add_argument(
        "--phones-dict",
        type=str,
@ -64,23 +62,6 @@ def main():
    pwg_config.enable_memory_optim()
    pwg_predictor = inference.create_predictor(pwg_config)
    if args.enable_auto_log:
        import auto_log
        os.makedirs("output", exist_ok=True)
        pid = os.getpid()
        logger = auto_log.AutoLogger(
            model_name="speedyspeech",
            model_precision='float32',
            batch_size=1,
            data_shape="dynamic",
            save_path="./output/auto_log.log",
            inference_config=speedyspeech_config,
            pids=pid,
            process_name=None,
            gpu_ids=0,
            time_keys=['preprocess_time', 'inference_time', 'postprocess_time'],
            warmup=0)
    output_dir = Path(args.output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)
    sentences = []
@ -93,9 +74,6 @@ def main():
            sentences.append((utt_id, sentence))
    for utt_id, sentence in sentences:
        if args.enable_auto_log:
            logger.times.start()
        input_ids = frontend.get_input_ids(
            sentence, merge_sentences=True, get_tone_ids=True)
        phone_ids = input_ids["phone_ids"]
@ -103,9 +81,6 @@ def main():
        phones = phone_ids[0].numpy()
        tones = tone_ids[0].numpy()
        if args.enable_auto_log:
            logger.times.stamp()
        input_names = speedyspeech_predictor.get_input_names()
        phones_handle = speedyspeech_predictor.get_input_handle(input_names[0])
        tones_handle = speedyspeech_predictor.get_input_handle(input_names[1])
@ -131,18 +106,10 @@ def main():
        output_handle = pwg_predictor.get_output_handle(output_names[0])
        wav = output_data = output_handle.copy_to_cpu()
        if args.enable_auto_log:
            logger.times.stamp()
        sf.write(output_dir / (utt_id + ".wav"), wav, samplerate=24000)
        if args.enable_auto_log:
            logger.times.end(stamp=True)
        print(f"{utt_id} done!")
    if args.enable_auto_log:
        logger.report()
 if __name__ == "__main__":
    main()
--- a/paddlespeech/t2s/exps/speedyspeech/synthesize.py
+++ b/paddlespeech/t2s/exps/speedyspeech/synthesize.py
@ -1,185 +0,0 @@
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import argparse
 import logging
 import os
 from pathlib import Path
 import jsonlines
 import numpy as np
 import paddle
 import soundfile as sf
 import yaml
 from paddle import jit
 from paddle.static import InputSpec
 from yacs.config import CfgNode
 from paddlespeech.t2s.datasets.data_table import DataTable
 from paddlespeech.t2s.models.parallel_wavegan import PWGGenerator
 from paddlespeech.t2s.models.parallel_wavegan import PWGInference
 from paddlespeech.t2s.models.speedyspeech import SpeedySpeech
 from paddlespeech.t2s.models.speedyspeech import SpeedySpeechInference
 from paddlespeech.t2s.modules.normalizer import ZScore
 def evaluate(args, speedyspeech_config, pwg_config):
    # dataloader has been too verbose
    logging.getLogger("DataLoader").disabled = True
    # construct dataset for evaluation
    with jsonlines.open(args.test_metadata, 'r') as reader:
        test_metadata = list(reader)
    test_dataset = DataTable(
        data=test_metadata, fields=["utt_id", "phones", "tones"])
    with open(args.phones_dict, "r") as f:
        phn_id = [line.strip().split() for line in f.readlines()]
    vocab_size = len(phn_id)
    print("vocab_size:", vocab_size)
    with open(args.tones_dict, "r") as f:
        tone_id = [line.strip().split() for line in f.readlines()]
    tone_size = len(tone_id)
    print("tone_size:", tone_size)
    model = SpeedySpeech(
        vocab_size=vocab_size,
        tone_size=tone_size,
        **speedyspeech_config["model"])
    model.set_state_dict(
        paddle.load(args.speedyspeech_checkpoint)["main_params"])
    model.eval()
    vocoder = PWGGenerator(**pwg_config["generator_params"])
    vocoder.set_state_dict(paddle.load(args.pwg_checkpoint)["generator_params"])
    vocoder.remove_weight_norm()
    vocoder.eval()
    print("model done!")
    stat = np.load(args.speedyspeech_stat)
    mu, std = stat
    mu = paddle.to_tensor(mu)
    std = paddle.to_tensor(std)
    speedyspeech_normalizer = ZScore(mu, std)
    speedyspeech_normalizer.eval()
    stat = np.load(args.pwg_stat)
    mu, std = stat
    mu = paddle.to_tensor(mu)
    std = paddle.to_tensor(std)
    pwg_normalizer = ZScore(mu, std)
    pwg_normalizer.eval()
    speedyspeech_inference = SpeedySpeechInference(speedyspeech_normalizer,
                                                   model)
    speedyspeech_inference.eval()
    speedyspeech_inference = jit.to_static(
        speedyspeech_inference,
        input_spec=[
            InputSpec([-1], dtype=paddle.int64), InputSpec(
                [-1], dtype=paddle.int64)
        ])
    paddle.jit.save(speedyspeech_inference,
                    os.path.join(args.inference_dir, "speedyspeech"))
    speedyspeech_inference = paddle.jit.load(
        os.path.join(args.inference_dir, "speedyspeech"))
    pwg_inference = PWGInference(pwg_normalizer, vocoder)
    pwg_inference.eval()
    pwg_inference = jit.to_static(
        pwg_inference, input_spec=[
            InputSpec([-1, 80], dtype=paddle.float32),
        ])
    paddle.jit.save(pwg_inference, os.path.join(args.inference_dir, "pwg"))
    pwg_inference = paddle.jit.load(os.path.join(args.inference_dir, "pwg"))
    output_dir = Path(args.output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)
    for datum in test_dataset:
        utt_id = datum["utt_id"]
        phones = paddle.to_tensor(datum["phones"])
        tones = paddle.to_tensor(datum["tones"])
        with paddle.no_grad():
            wav = pwg_inference(speedyspeech_inference(phones, tones))
        sf.write(
            output_dir / (utt_id + ".wav"),
            wav.numpy(),
            samplerate=speedyspeech_config.fs)
        print(f"{utt_id} done!")
 def main():
    # parse args and config and redirect to train_sp
    parser = argparse.ArgumentParser(
        description="Synthesize with speedyspeech & parallel wavegan.")
    parser.add_argument(
        "--speedyspeech-config", type=str, help="config file for speedyspeech.")
    parser.add_argument(
        "--speedyspeech-checkpoint",
        type=str,
        help="speedyspeech checkpoint to load.")
    parser.add_argument(
        "--speedyspeech-stat",
        type=str,
        help="mean and standard deviation used to normalize spectrogram when training speedyspeech."
    )
    parser.add_argument(
        "--pwg-config", type=str, help="config file for parallelwavegan.")
    parser.add_argument(
        "--pwg-checkpoint",
        type=str,
        help="parallel wavegan generator parameters to load.")
    parser.add_argument(
        "--pwg-stat",
        type=str,
        help="mean and standard deviation used to normalize spectrogram when training speedyspeech."
    )
    parser.add_argument(
        "--phones-dict", type=str, default=None, help="phone vocabulary file.")
    parser.add_argument(
        "--tones-dict", type=str, default=None, help="tone vocabulary file.")
    parser.add_argument("--test-metadata", type=str, help="test metadata")
    parser.add_argument("--output-dir", type=str, help="output dir")
    parser.add_argument(
        "--inference-dir", type=str, help="dir to save inference models")
    parser.add_argument(
        "--ngpu", type=int, default=1, help="if ngpu == 0, use cpu.")
    parser.add_argument("--verbose", type=int, default=1, help="verbose")
    args, _ = parser.parse_known_args()
    if args.ngpu == 0:
        paddle.set_device("cpu")
    elif args.ngpu > 0:
        paddle.set_device("gpu")
    else:
        print("ngpu should >= 0 !")
    with open(args.speedyspeech_config) as f:
        speedyspeech_config = CfgNode(yaml.safe_load(f))
    with open(args.pwg_config) as f:
        pwg_config = CfgNode(yaml.safe_load(f))
    print("========Args========")
    print(yaml.safe_dump(vars(args)))
    print("========Config========")
    print(speedyspeech_config)
    print(pwg_config)
    evaluate(args, speedyspeech_config, pwg_config)
 if __name__ == "__main__":
    main()
--- a/paddlespeech/t2s/exps/speedyspeech/synthesize_e2e.py
+++ b/paddlespeech/t2s/exps/speedyspeech/synthesize_e2e.py
@ -11,6 +11,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # remain for chains
 import argparse
 import logging
 import os
--- a/paddlespeech/t2s/exps/speedyspeech/train.py
+++ b/paddlespeech/t2s/exps/speedyspeech/train.py
@ -161,7 +161,7 @@ def train_sp(args, config):
 def main():
    # parse args and config and redirect to train_sp
    parser = argparse.ArgumentParser(
-        description="Train a Speedyspeech model with sigle speaker dataset.")
+        description="Train a Speedyspeech model with a single speaker dataset.")
    parser.add_argument("--config", type=str, help="config file.")
    parser.add_argument("--train-metadata", type=str, help="training data.")
    parser.add_argument("--dev-metadata", type=str, help="dev data.")
--- a/paddlespeech/t2s/exps/synthesize.py
+++ b/paddlespeech/t2s/exps/synthesize.py
@ -0,0 +1,268 @@
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import argparse
 import logging
 from pathlib import Path
 import jsonlines
 import numpy as np
 import paddle
 import soundfile as sf
 import yaml
 from yacs.config import CfgNode
 from paddlespeech.s2t.utils.dynamic_import import dynamic_import
 from paddlespeech.t2s.datasets.data_table import DataTable
 from paddlespeech.t2s.modules.normalizer import ZScore
 model_alias = {
    # acoustic model
    "speedyspeech":
    "paddlespeech.t2s.models.speedyspeech:SpeedySpeech",
    "speedyspeech_inference":
    "paddlespeech.t2s.models.speedyspeech:SpeedySpeechInference",
    "fastspeech2":
    "paddlespeech.t2s.models.fastspeech2:FastSpeech2",
    "fastspeech2_inference":
    "paddlespeech.t2s.models.fastspeech2:FastSpeech2Inference",
    # voc
    "pwgan":
    "paddlespeech.t2s.models.parallel_wavegan:PWGGenerator",
    "pwgan_inference":
    "paddlespeech.t2s.models.parallel_wavegan:PWGInference",
    "mb_melgan":
    "paddlespeech.t2s.models.melgan:MelGANGenerator",
    "mb_melgan_inference":
    "paddlespeech.t2s.models.melgan:MelGANInference",
 }
 def evaluate(args):
    # dataloader has been too verbose
    logging.getLogger("DataLoader").disabled = True
    # construct dataset for evaluation
    with jsonlines.open(args.test_metadata, 'r') as reader:
        test_metadata = list(reader)
    # Init body.
    with open(args.am_config) as f:
        am_config = CfgNode(yaml.safe_load(f))
    with open(args.voc_config) as f:
        voc_config = CfgNode(yaml.safe_load(f))
    print("========Args========")
    print(yaml.safe_dump(vars(args)))
    print("========Config========")
    print(am_config)
    print(voc_config)
    # construct dataset for evaluation
    # model: {model_name}_{dataset}
    am_name = args.am[:args.am.rindex('_')]
    am_dataset = args.am[args.am.rindex('_') + 1:]
    if am_name == 'fastspeech2':
        fields = ["utt_id", "text"]
        spk_num = None
        if am_dataset in {"aishell3", "vctk"} and args.speaker_dict:
            print("multiple speaker fastspeech2!")
            with open(args.speaker_dict, 'rt') as f:
                spk_id = [line.strip().split() for line in f.readlines()]
            spk_num = len(spk_id)
            fields += ["spk_id"]
        elif args.voice_cloning:
            print("voice cloning!")
            fields += ["spk_emb"]
        else:
            print("single speaker fastspeech2!")
        print("spk_num:", spk_num)
    elif am_name == 'speedyspeech':
        fields = ["utt_id", "phones", "tones"]
    test_dataset = DataTable(data=test_metadata, fields=fields)
    with open(args.phones_dict, "r") as f:
        phn_id = [line.strip().split() for line in f.readlines()]
    vocab_size = len(phn_id)
    print("vocab_size:", vocab_size)
    tone_size = None
    if args.tones_dict:
        with open(args.tones_dict, "r") as f:
            tone_id = [line.strip().split() for line in f.readlines()]
        tone_size = len(tone_id)
        print("tone_size:", tone_size)
    # acoustic model
    odim = am_config.n_mels
    am_class = dynamic_import(am_name, model_alias)
    am_inference_class = dynamic_import(am_name + '_inference', model_alias)
    if am_name == 'fastspeech2':
        am = am_class(
            idim=vocab_size, odim=odim, spk_num=spk_num, **am_config["model"])
    elif am_name == 'speedyspeech':
        am = am_class(
            vocab_size=vocab_size, tone_size=tone_size, **am_config["model"])
    am.set_state_dict(paddle.load(args.am_ckpt)["main_params"])
    am.eval()
    am_mu, am_std = np.load(args.am_stat)
    am_mu = paddle.to_tensor(am_mu)
    am_std = paddle.to_tensor(am_std)
    am_normalizer = ZScore(am_mu, am_std)
    am_inference = am_inference_class(am_normalizer, am)
    print("am_inference.training0:", am_inference.training)
    am_inference.eval()
    print("acoustic model done!")
    # vocoder
    # model: {model_name}_{dataset}
    voc_name = args.voc[:args.voc.rindex('_')]
    voc_class = dynamic_import(voc_name, model_alias)
    voc_inference_class = dynamic_import(voc_name + '_inference', model_alias)
    voc = voc_class(**voc_config["generator_params"])
    voc.set_state_dict(paddle.load(args.voc_ckpt)["generator_params"])
    voc.remove_weight_norm()
    voc.eval()
    voc_mu, voc_std = np.load(args.voc_stat)
    voc_mu = paddle.to_tensor(voc_mu)
    voc_std = paddle.to_tensor(voc_std)
    voc_normalizer = ZScore(voc_mu, voc_std)
    voc_inference = voc_inference_class(voc_normalizer, voc)
    print("voc_inference.training0:", voc_inference.training)
    voc_inference.eval()
    print("voc done!")
    output_dir = Path(args.output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)
    for datum in test_dataset:
        utt_id = datum["utt_id"]
        with paddle.no_grad():
            # acoustic model
            if am_name == 'fastspeech2':
                phone_ids = paddle.to_tensor(datum["text"])
                spk_emb = None
                spk_id = None
                # multi speaker
                if args.voice_cloning and "spk_emb" in datum:
                    spk_emb = paddle.to_tensor(np.load(datum["spk_emb"]))
                elif "spk_id" in datum:
                    spk_id = paddle.to_tensor(datum["spk_id"])
                mel = am_inference(phone_ids, spk_id=spk_id, spk_emb=spk_emb)
            elif am_name == 'speedyspeech':
                phone_ids = paddle.to_tensor(datum["phones"])
                tone_ids = paddle.to_tensor(datum["tones"])
                mel = am_inference(phone_ids, tone_ids)
            # vocoder
            wav = voc_inference(mel)
        sf.write(
            str(output_dir / (utt_id + ".wav")),
            wav.numpy(),
            samplerate=am_config.fs)
        print(f"{utt_id} done!")
 def main():
    # parse args and config and redirect to train_sp
    parser = argparse.ArgumentParser(
        description="Synthesize with acoustic model & vocoder")
    # acoustic model
    parser.add_argument(
        '--am',
        type=str,
        default='fastspeech2_csmsc',
        choices=[
            'speedyspeech_csmsc', 'fastspeech2_csmsc', 'fastspeech2_ljspeech',
            'fastspeech2_aishell3', 'fastspeech2_vctk'
        ],
        help='Choose acoustic model type of tts task.')
    parser.add_argument(
        '--am_config',
        type=str,
        default=None,
        help='Config of acoustic model. Use deault config when it is None.')
    parser.add_argument(
        '--am_ckpt',
        type=str,
        default=None,
        help='Checkpoint file of acoustic model.')
    parser.add_argument(
        "--am_stat",
        type=str,
        default=None,
        help="mean and standard deviation used to normalize spectrogram when training acoustic model."
    )
    parser.add_argument(
        "--phones_dict", type=str, default=None, help="phone vocabulary file.")
    parser.add_argument(
        "--tones_dict", type=str, default=None, help="tone vocabulary file.")
    parser.add_argument(
        "--speaker_dict", type=str, default=None, help="speaker id map file.")
    def str2bool(str):
        return True if str.lower() == 'true' else False
    parser.add_argument(
        "--voice-cloning",
        type=str2bool,
        default=False,
        help="whether training voice cloning model.")
    # vocoder
    parser.add_argument(
        '--voc',
        type=str,
        default='pwgan_csmsc',
        choices=[
            'pwgan_csmsc', 'pwgan_ljspeech', 'pwgan_aishell3', 'pwgan_vctk',
            'mb_melgan_csmsc'
        ],
        help='Choose vocoder type of tts task.')
    parser.add_argument(
        '--voc_config',
        type=str,
        default=None,
        help='Config of voc. Use deault config when it is None.')
    parser.add_argument(
        '--voc_ckpt', type=str, default=None, help='Checkpoint file of voc.')
    parser.add_argument(
        "--voc_stat",
        type=str,
        default=None,
        help="mean and standard deviation used to normalize spectrogram when training voc."
    )
    # other
    parser.add_argument(
        "--ngpu", type=int, default=1, help="if ngpu == 0, use cpu.")
    parser.add_argument("--test_metadata", type=str, help="test metadata.")
    parser.add_argument("--output_dir", type=str, help="output dir.")
    args = parser.parse_args()
    if args.ngpu == 0:
        paddle.set_device("cpu")
    elif args.ngpu > 0:
        paddle.set_device("gpu")
    else:
        print("ngpu should >= 0 !")
    evaluate(args)
 if __name__ == "__main__":
    main()
--- a/paddlespeech/t2s/exps/synthesize_e2e.py
+++ b/paddlespeech/t2s/exps/synthesize_e2e.py
@ -0,0 +1,336 @@
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import argparse
 import os
 from pathlib import Path
 import numpy as np
 import paddle
 import soundfile as sf
 import yaml
 from paddle import jit
 from paddle.static import InputSpec
 from yacs.config import CfgNode
 from paddlespeech.s2t.utils.dynamic_import import dynamic_import
 from paddlespeech.t2s.frontend import English
 from paddlespeech.t2s.frontend.zh_frontend import Frontend
 from paddlespeech.t2s.modules.normalizer import ZScore
 model_alias = {
    # acoustic model
    "speedyspeech":
    "paddlespeech.t2s.models.speedyspeech:SpeedySpeech",
    "speedyspeech_inference":
    "paddlespeech.t2s.models.speedyspeech:SpeedySpeechInference",
    "fastspeech2":
    "paddlespeech.t2s.models.fastspeech2:FastSpeech2",
    "fastspeech2_inference":
    "paddlespeech.t2s.models.fastspeech2:FastSpeech2Inference",
    # voc
    "pwgan":
    "paddlespeech.t2s.models.parallel_wavegan:PWGGenerator",
    "pwgan_inference":
    "paddlespeech.t2s.models.parallel_wavegan:PWGInference",
    "mb_melgan":
    "paddlespeech.t2s.models.melgan:MelGANGenerator",
    "mb_melgan_inference":
    "paddlespeech.t2s.models.melgan:MelGANInference",
    "style_melgan":
    "paddlespeech.t2s.models.melgan:StyleMelGANGenerator",
    "style_melgan_inference":
    "paddlespeech.t2s.models.melgan:StyleMelGANInference",
    "hifigan":
    "paddlespeech.t2s.models.hifigan:HiFiGANGenerator",
    "hifigan_inference":
    "paddlespeech.t2s.models.hifigan:HiFiGANInference",
 }
 def evaluate(args):
    # Init body.
    with open(args.am_config) as f:
        am_config = CfgNode(yaml.safe_load(f))
    with open(args.voc_config) as f:
        voc_config = CfgNode(yaml.safe_load(f))
    print("========Args========")
    print(yaml.safe_dump(vars(args)))
    print("========Config========")
    print(am_config)
    print(voc_config)
    # construct dataset for evaluation
    sentences = []
    with open(args.text, 'rt') as f:
        for line in f:
            items = line.strip().split()
            utt_id = items[0]
            if args.lang == 'zh':
                sentence = "".join(items[1:])
            elif args.lang == 'en':
                sentence = " ".join(items[1:])
            sentences.append((utt_id, sentence))
    with open(args.phones_dict, "r") as f:
        phn_id = [line.strip().split() for line in f.readlines()]
    vocab_size = len(phn_id)
    print("vocab_size:", vocab_size)
    tone_size = None
    if args.tones_dict:
        with open(args.tones_dict, "r") as f:
            tone_id = [line.strip().split() for line in f.readlines()]
        tone_size = len(tone_id)
        print("tone_size:", tone_size)
    spk_num = None
    if args.speaker_dict:
        with open(args.speaker_dict, 'rt') as f:
            spk_id = [line.strip().split() for line in f.readlines()]
        spk_num = len(spk_id)
        print("spk_num:", spk_num)
    # frontend
    if args.lang == 'zh':
        frontend = Frontend(
            phone_vocab_path=args.phones_dict, tone_vocab_path=args.tones_dict)
    elif args.lang == 'en':
        frontend = English(phone_vocab_path=args.phones_dict)
    print("frontend done!")
    # acoustic model
    odim = am_config.n_mels
    # model: {model_name}_{dataset}
    am_name = args.am[:args.am.rindex('_')]
    am_dataset = args.am[args.am.rindex('_') + 1:]
    am_class = dynamic_import(am_name, model_alias)
    am_inference_class = dynamic_import(am_name + '_inference', model_alias)
    if am_name == 'fastspeech2':
        am = am_class(
            idim=vocab_size, odim=odim, spk_num=spk_num, **am_config["model"])
    elif am_name == 'speedyspeech':
        am = am_class(
            vocab_size=vocab_size, tone_size=tone_size, **am_config["model"])
    am.set_state_dict(paddle.load(args.am_ckpt)["main_params"])
    am.eval()
    am_mu, am_std = np.load(args.am_stat)
    am_mu = paddle.to_tensor(am_mu)
    am_std = paddle.to_tensor(am_std)
    am_normalizer = ZScore(am_mu, am_std)
    am_inference = am_inference_class(am_normalizer, am)
    am_inference.eval()
    print("acoustic model done!")
    # vocoder
    # model: {model_name}_{dataset}
    voc_name = args.voc[:args.voc.rindex('_')]
    voc_class = dynamic_import(voc_name, model_alias)
    voc_inference_class = dynamic_import(voc_name + '_inference', model_alias)
    voc = voc_class(**voc_config["generator_params"])
    voc.set_state_dict(paddle.load(args.voc_ckpt)["generator_params"])
    voc.remove_weight_norm()
    voc.eval()
    voc_mu, voc_std = np.load(args.voc_stat)
    voc_mu = paddle.to_tensor(voc_mu)
    voc_std = paddle.to_tensor(voc_std)
    voc_normalizer = ZScore(voc_mu, voc_std)
    voc_inference = voc_inference_class(voc_normalizer, voc)
    voc_inference.eval()
    print("voc done!")
    # whether dygraph to static
    if args.inference_dir:
        # acoustic model
        if am_name == 'fastspeech2':
            if am_dataset in {"aishell3", "vctk"} and args.speaker_dict:
                print(
                    "Haven't test dygraph to static for multi speaker fastspeech2 now!"
                )
            else:
                am_inference = jit.to_static(
                    am_inference,
                    input_spec=[InputSpec([-1], dtype=paddle.int64)])
                paddle.jit.save(am_inference,
                                os.path.join(args.inference_dir, args.am))
                am_inference = paddle.jit.load(
                    os.path.join(args.inference_dir, args.am))
        elif am_name == 'speedyspeech':
            am_inference = jit.to_static(
                am_inference,
                input_spec=[
                    InputSpec([-1], dtype=paddle.int64),
                    InputSpec([-1], dtype=paddle.int64)
                ])
            paddle.jit.save(am_inference,
                            os.path.join(args.inference_dir, args.am))
            am_inference = paddle.jit.load(
                os.path.join(args.inference_dir, args.am))
        # vocoder
        voc_inference = jit.to_static(
            voc_inference,
            input_spec=[
                InputSpec([-1, 80], dtype=paddle.float32),
            ])
        paddle.jit.save(voc_inference,
                        os.path.join(args.inference_dir, args.voc))
        voc_inference = paddle.jit.load(
            os.path.join(args.inference_dir, args.voc))
    output_dir = Path(args.output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)
    for utt_id, sentence in sentences:
        get_tone_ids = False
        if am_name == 'speedyspeech':
            get_tone_ids = True
        if args.lang == 'zh':
            input_ids = frontend.get_input_ids(
                sentence, merge_sentences=True, get_tone_ids=get_tone_ids)
            phone_ids = input_ids["phone_ids"]
            phone_ids = phone_ids[0]
            if get_tone_ids:
                tone_ids = input_ids["tone_ids"]
                tone_ids = tone_ids[0]
        elif args.lang == 'en':
            input_ids = frontend.get_input_ids(sentence)
            phone_ids = input_ids["phone_ids"]
        else:
            print("lang should in {'zh', 'en'}!")
        with paddle.no_grad():
            # acoustic model
            if am_name == 'fastspeech2':
                # multi speaker
                if am_dataset in {"aishell3", "vctk"}:
                    spk_id = paddle.to_tensor(args.spk_id)
                    mel = am_inference(phone_ids, spk_id)
                else:
                    mel = am_inference(phone_ids)
            elif am_name == 'speedyspeech':
                mel = am_inference(phone_ids, tone_ids)
            # vocoder
            wav = voc_inference(mel)
        sf.write(
            str(output_dir / (utt_id + ".wav")),
            wav.numpy(),
            samplerate=am_config.fs)
        print(f"{utt_id} done!")
 def main():
    # parse args and config and redirect to train_sp
    parser = argparse.ArgumentParser(
        description="Synthesize with acoustic model & vocoder")
    # acoustic model
    parser.add_argument(
        '--am',
        type=str,
        default='fastspeech2_csmsc',
        choices=[
            'speedyspeech_csmsc', 'fastspeech2_csmsc', 'fastspeech2_ljspeech',
            'fastspeech2_aishell3', 'fastspeech2_vctk'
        ],
        help='Choose acoustic model type of tts task.')
    parser.add_argument(
        '--am_config',
        type=str,
        default=None,
        help='Config of acoustic model. Use deault config when it is None.')
    parser.add_argument(
        '--am_ckpt',
        type=str,
        default=None,
        help='Checkpoint file of acoustic model.')
    parser.add_argument(
        "--am_stat",
        type=str,
        default=None,
        help="mean and standard deviation used to normalize spectrogram when training acoustic model."
    )
    parser.add_argument(
        "--phones_dict", type=str, default=None, help="phone vocabulary file.")
    parser.add_argument(
        "--tones_dict", type=str, default=None, help="tone vocabulary file.")
    parser.add_argument(
        "--speaker_dict", type=str, default=None, help="speaker id map file.")
    parser.add_argument(
        '--spk_id',
        type=int,
        default=0,
        help='spk id for multi speaker acoustic model')
    # vocoder
    parser.add_argument(
        '--voc',
        type=str,
        default='pwgan_csmsc',
        choices=[
            'pwgan_csmsc', 'pwgan_ljspeech', 'pwgan_aishell3', 'pwgan_vctk',
            'mb_melgan_csmsc', 'style_melgan_csmsc', 'hifigan_csmsc'
        ],
        help='Choose vocoder type of tts task.')
    parser.add_argument(
        '--voc_config',
        type=str,
        default=None,
        help='Config of voc. Use deault config when it is None.')
    parser.add_argument(
        '--voc_ckpt', type=str, default=None, help='Checkpoint file of voc.')
    parser.add_argument(
        "--voc_stat",
        type=str,
        default=None,
        help="mean and standard deviation used to normalize spectrogram when training voc."
    )
    # other
    parser.add_argument(
        '--lang',
        type=str,
        default='zh',
        help='Choose model language. zh or en')
    parser.add_argument(
        "--inference_dir",
        type=str,
        default=None,
        help="dir to save inference models")
    parser.add_argument(
        "--ngpu", type=int, default=1, help="if ngpu == 0, use cpu.")
    parser.add_argument(
        "--text",
        type=str,
        help="text to synthesize, a 'utt_id sentence' pair per line.")
    parser.add_argument("--output_dir", type=str, help="output dir.")
    args = parser.parse_args()
    if args.ngpu == 0:
        paddle.set_device("cpu")
    elif args.ngpu > 0:
        paddle.set_device("gpu")
    else:
        print("ngpu should >= 0 !")
    evaluate(args)
 if __name__ == "__main__":
    main()
--- a/paddlespeech/t2s/models/init.py
+++ b/paddlespeech/t2s/models/init.py
@ -12,6 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 from .fastspeech2 import *
 from .hifigan import *
 from .melgan import *
 from .parallel_wavegan import *
 from .speedyspeech import *
--- a/paddlespeech/t2s/models/hifigan/init.py
+++ b/paddlespeech/t2s/models/hifigan/init.py
@ -0,0 +1,15 @@
 # Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 from .hifigan import *
 from .hifigan_updater import *
--- a/paddlespeech/t2s/models/hifigan/hifigan.py
+++ b/paddlespeech/t2s/models/hifigan/hifigan.py
@ -0,0 +1,779 @@
 # -*- coding: utf-8 -*-
 """HiFi-GAN Modules.
 This code is based on https://github.com/jik876/hifi-gan.
 """
 import copy
 from typing import Any
 from typing import Dict
 from typing import List
 import paddle
 import paddle.nn.functional as F
 from paddle import nn
 from paddlespeech.t2s.modules.activation import get_activation
 from paddlespeech.t2s.modules.nets_utils import initialize
 from paddlespeech.t2s.modules.residual_block import HiFiGANResidualBlock as ResidualBlock
 class HiFiGANGenerator(nn.Layer):
    """HiFiGAN generator module."""
    def __init__(
            self,
            in_channels: int=80,
            out_channels: int=1,
            channels: int=512,
            kernel_size: int=7,
            upsample_scales: List[int]=(8, 8, 2, 2),
            upsample_kernel_sizes: List[int]=(16, 16, 4, 4),
            resblock_kernel_sizes: List[int]=(3, 7, 11),
            resblock_dilations: List[List[int]]=[(1, 3, 5), (1, 3, 5),
                                                 (1, 3, 5)],
            use_additional_convs: bool=True,
            bias: bool=True,
            nonlinear_activation: str="leakyrelu",
            nonlinear_activation_params: Dict[str, Any]={"negative_slope": 0.1},
            use_weight_norm: bool=True,
            init_type: str="xavier_uniform", ):
        """Initialize HiFiGANGenerator module.
        Parameters
        ----------
        in_channels : int
            Number of input channels.
        out_channels : int
            Number of output channels.
        channels : int
            Number of hidden representation channels.
        kernel_size : int
            Kernel size of initial and final conv layer.
        upsample_scales : list
            List of upsampling scales.
        upsample_kernel_sizes : list
            List of kernel sizes for upsampling layers.
        resblock_kernel_sizes : list
            List of kernel sizes for residual blocks.
        resblock_dilations : list
            List of dilation list for residual blocks.
        use_additional_convs : bool
            Whether to use additional conv layers in residual blocks.
        bias : bool
            Whether to add bias parameter in convolution layers.
        nonlinear_activation : str
            Activation function module name.
        nonlinear_activation_params : dict
            Hyperparameters for activation function.
        use_weight_norm : bool
            Whether to use weight norm.
            If set to true, it will be applied to all of the conv layers.
        """
        super().__init__()
        # initialize parameters
        initialize(self, init_type)
        # check hyperparameters are valid
        assert kernel_size % 2 == 1, "Kernel size must be odd number."
        assert len(upsample_scales) == len(upsample_kernel_sizes)
        assert len(resblock_dilations) == len(resblock_kernel_sizes)
        # define modules
        self.num_upsamples = len(upsample_kernel_sizes)
        self.num_blocks = len(resblock_kernel_sizes)
        self.input_conv = nn.Conv1D(
            in_channels,
            channels,
            kernel_size,
            1,
            padding=(kernel_size - 1) // 2, )
        self.upsamples = nn.LayerList()
        self.blocks = nn.LayerList()
        for i in range(len(upsample_kernel_sizes)):
            assert upsample_kernel_sizes[i] == 2 * upsample_scales[i]
            self.upsamples.append(
                nn.Sequential(
                    get_activation(nonlinear_activation, **
                                   nonlinear_activation_params),
                    nn.Conv1DTranspose(
                        channels // (2**i),
                        channels // (2**(i + 1)),
                        upsample_kernel_sizes[i],
                        upsample_scales[i],
                        padding=upsample_scales[i] // 2 + upsample_scales[i] %
                        2,
                        output_padding=upsample_scales[i] % 2, ), ))
            for j in range(len(resblock_kernel_sizes)):
                self.blocks.append(
                    ResidualBlock(
                        kernel_size=resblock_kernel_sizes[j],
                        channels=channels // (2**(i + 1)),
                        dilations=resblock_dilations[j],
                        bias=bias,
                        use_additional_convs=use_additional_convs,
                        nonlinear_activation=nonlinear_activation,
                        nonlinear_activation_params=nonlinear_activation_params,
                    ))
        self.output_conv = nn.Sequential(
            nn.LeakyReLU(),
            nn.Conv1D(
                channels // (2**(i + 1)),
                out_channels,
                kernel_size,
                1,
                padding=(kernel_size - 1) // 2, ),
            nn.Tanh(), )
        nn.initializer.set_global_initializer(None)
        # apply weight norm
        if use_weight_norm:
            self.apply_weight_norm()
        # reset parameters
        self.reset_parameters()
    def forward(self, c):
        """Calculate forward propagation.
        Parameters
        ----------
        c : Tensor
            Input tensor (B, in_channels, T).
        Returns
        ----------
        Tensor
            Output tensor (B, out_channels, T).
        """
        c = self.input_conv(c)
        for i in range(self.num_upsamples):
            c = self.upsamples[i](c)
            # initialize
            cs = 0.0
            for j in range(self.num_blocks):
                cs += self.blocks[i * self.num_blocks + j](c)
            c = cs / self.num_blocks
        c = self.output_conv(c)
        return c
    def reset_parameters(self):
        """Reset parameters.
        This initialization follows official implementation manner.
        https://github.com/jik876/hifi-gan/blob/master/models.py
        """
        # 定义参数为float的正态分布。
        dist = paddle.distribution.Normal(loc=0.0, scale=0.01)
        def _reset_parameters(m):
            if isinstance(m, nn.Conv1D) or isinstance(m, nn.Conv1DTranspose):
                w = dist.sample(m.weight.shape)
                m.weight.set_value(w)
        self.apply(_reset_parameters)
    def apply_weight_norm(self):
        """Recursively apply weight normalization to all the Convolution layers
        in the sublayers.
        """
        def _apply_weight_norm(layer):
            if isinstance(layer, (nn.Conv1D, nn.Conv2D, nn.Conv1DTranspose)):
                nn.utils.weight_norm(layer)
        self.apply(_apply_weight_norm)
    def remove_weight_norm(self):
        """Recursively remove weight normalization from all the Convolution 
        layers in the sublayers.
        """
        def _remove_weight_norm(layer):
            try:
                nn.utils.remove_weight_norm(layer)
            except ValueError:
                pass
        self.apply(_remove_weight_norm)
    def inference(self, c):
        """Perform inference.
        Parameters
        ----------
        c : Tensor 
            Input tensor (T, in_channels).
            normalize_before (bool): Whether to perform normalization.
        Returns
        ----------
        Tensor
            Output tensor (T ** prod(upsample_scales), out_channels).
        """
        c = self.forward(c.transpose([1, 0]).unsqueeze(0))
        return c.squeeze(0).transpose([1, 0])
 class HiFiGANPeriodDiscriminator(nn.Layer):
    """HiFiGAN period discriminator module."""
    def __init__(
            self,
            in_channels: int=1,
            out_channels: int=1,
            period: int=3,
            kernel_sizes: List[int]=[5, 3],
            channels: int=32,
            downsample_scales: List[int]=[3, 3, 3, 3, 1],
            max_downsample_channels: int=1024,
            bias: bool=True,
            nonlinear_activation: str="leakyrelu",
            nonlinear_activation_params: Dict[str, Any]={"negative_slope": 0.1},
            use_weight_norm: bool=True,
            use_spectral_norm: bool=False,
            init_type: str="xavier_uniform", ):
        """Initialize HiFiGANPeriodDiscriminator module.
        Parameters
        ----------
        in_channels : int
            Number of input channels.
        out_channels : int
            Number of output channels.
        period : int
            Period.
        kernel_sizes : list
            Kernel sizes of initial conv layers and the final conv layer.
        channels : int
            Number of initial channels.
        downsample_scales : list
            List of downsampling scales.
        max_downsample_channels : int
            Number of maximum downsampling channels.
        use_additional_convs : bool
            Whether to use additional conv layers in residual blocks.
        bias : bool
            Whether to add bias parameter in convolution layers.
        nonlinear_activation : str
            Activation function module name.
        nonlinear_activation_params : dict
            Hyperparameters for activation function.
        use_weight_norm : bool
            Whether to use weight norm.
            If set to true, it will be applied to all of the conv layers.
        use_spectral_norm : bool
            Whether to use spectral norm.
            If set to true, it will be applied to all of the conv layers.
        """
        super().__init__()
        # initialize parameters
        initialize(self, init_type)
        assert len(kernel_sizes) == 2
        assert kernel_sizes[0] % 2 == 1, "Kernel size must be odd number."
        assert kernel_sizes[1] % 2 == 1, "Kernel size must be odd number."
        self.period = period
        self.convs = nn.LayerList()
        in_chs = in_channels
        out_chs = channels
        for downsample_scale in downsample_scales:
            self.convs.append(
                nn.Sequential(
                    nn.Conv2D(
                        in_chs,
                        out_chs,
                        (kernel_sizes[0], 1),
                        (downsample_scale, 1),
                        padding=((kernel_sizes[0] - 1) // 2, 0), ),
                    get_activation(nonlinear_activation, **
                                   nonlinear_activation_params), ))
            in_chs = out_chs
            # NOTE: Use downsample_scale + 1?
            out_chs = min(out_chs * 4, max_downsample_channels)
        self.output_conv = nn.Conv2D(
            out_chs,
            out_channels,
            (kernel_sizes[1] - 1, 1),
            1,
            padding=((kernel_sizes[1] - 1) // 2, 0), )
        if use_weight_norm and use_spectral_norm:
            raise ValueError("Either use use_weight_norm or use_spectral_norm.")
        # apply weight norm
        if use_weight_norm:
            self.apply_weight_norm()
        # apply spectral norm
        if use_spectral_norm:
            self.apply_spectral_norm()
    def forward(self, x):
        """Calculate forward propagation.
        Parameters
        ----------
        c : Tensor
            Input tensor (B, in_channels, T).
        Returns
        ----------
        list
            List of each layer's tensors.
        """
        # transform 1d to 2d -> (B, C, T/P, P)
        b, c, t = paddle.shape(x)
        if t % self.period != 0:
            n_pad = self.period - (t % self.period)
            x = F.pad(x, (0, n_pad), "reflect", data_format="NCL")
            t += n_pad
        x = x.reshape([b, c, t // self.period, self.period])
        # forward conv
        outs = []
        for layer in self.convs:
            x = layer(x)
            outs += [x]
        x = self.output_conv(x)
        x = paddle.flatten(x, 1, -1)
        outs += [x]
        return outs
    def apply_weight_norm(self):
        """Recursively apply weight normalization to all the Convolution layers
        in the sublayers.
        """
        def _apply_weight_norm(layer):
            if isinstance(layer, (nn.Conv1D, nn.Conv2D, nn.Conv1DTranspose)):
                nn.utils.weight_norm(layer)
        self.apply(_apply_weight_norm)
    def apply_spectral_norm(self):
        """Apply spectral normalization module from all of the layers."""
        def _apply_spectral_norm(m):
            if isinstance(m, nn.Conv2D):
                nn.utils.spectral_norm(m)
        self.apply(_apply_spectral_norm)
 class HiFiGANMultiPeriodDiscriminator(nn.Layer):
    """HiFiGAN multi-period discriminator module."""
    def __init__(
            self,
            periods: List[int]=[2, 3, 5, 7, 11],
            discriminator_params: Dict[str, Any]={
                "in_channels": 1,
                "out_channels": 1,
                "kernel_sizes": [5, 3],
                "channels": 32,
                "downsample_scales": [3, 3, 3, 3, 1],
                "max_downsample_channels": 1024,
                "bias": True,
                "nonlinear_activation": "leakyrelu",
                "nonlinear_activation_params": {
                    "negative_slope": 0.1
                },
                "use_weight_norm": True,
                "use_spectral_norm": False,
            },
            init_type: str="xavier_uniform", ):
        """Initialize HiFiGANMultiPeriodDiscriminator module.
        Parameters
        ----------
        periods : list
            List of periods.
        discriminator_params : dict
            Parameters for hifi-gan period discriminator module.
            The period parameter will be overwritten.
        """
        super().__init__()
        # initialize parameters
        initialize(self, init_type)
        self.discriminators = nn.LayerList()
        for period in periods:
            params = copy.deepcopy(discriminator_params)
            params["period"] = period
            self.discriminators.append(HiFiGANPeriodDiscriminator(**params))
    def forward(self, x):
        """Calculate forward propagation.
        Parameters
        ----------
        x : Tensor
            Input noise signal (B, 1, T).
        Returns
        ----------
        List
            List of list of each discriminator outputs, which consists of each layer output tensors.
        """
        outs = []
        for f in self.discriminators:
            outs += [f(x)]
        return outs
 class HiFiGANScaleDiscriminator(nn.Layer):
    """HiFi-GAN scale discriminator module."""
    def __init__(
            self,
            in_channels: int=1,
            out_channels: int=1,
            kernel_sizes: List[int]=[15, 41, 5, 3],
            channels: int=128,
            max_downsample_channels: int=1024,
            max_groups: int=16,
            bias: bool=True,
            downsample_scales: List[int]=[2, 2, 4, 4, 1],
            nonlinear_activation: str="leakyrelu",
            nonlinear_activation_params: Dict[str, Any]={"negative_slope": 0.1},
            use_weight_norm: bool=True,
            use_spectral_norm: bool=False,
            init_type: str="xavier_uniform", ):
        """Initilize HiFiGAN scale discriminator module.
        Parameters
        ----------
        in_channels : int
            Number of input channels.
        out_channels : int
            Number of output channels.
        kernel_sizes : list
            List of four kernel sizes. The first will be used for the first conv layer,
            and the second is for downsampling part, and the remaining two are for output layers.
        channels : int
            Initial number of channels for conv layer.
        max_downsample_channels : int
            Maximum number of channels for downsampling layers.
        bias : bool
            Whether to add bias parameter in convolution layers.
        downsample_scales : list
            List of downsampling scales.
        nonlinear_activation : str
            Activation function module name.
        nonlinear_activation_params : dict
            Hyperparameters for activation function.
        use_weight_norm : bool
            Whether to use weight norm.
            If set to true, it will be applied to all of the conv layers.
        use_spectral_norm : bool
            Whether to use spectral norm.
            If set to true, it will be applied to all of the conv layers.
        """
        super().__init__()
        # initialize parameters
        initialize(self, init_type)
        self.layers = nn.LayerList()
        # check kernel size is valid
        assert len(kernel_sizes) == 4
        for ks in kernel_sizes:
            assert ks % 2 == 1
        # add first layer
        self.layers.append(
            nn.Sequential(
                nn.Conv1D(
                    in_channels,
                    channels,
                    # NOTE: Use always the same kernel size
                    kernel_sizes[0],
                    bias_attr=bias,
                    padding=(kernel_sizes[0] - 1) // 2, ),
                get_activation(nonlinear_activation, **
                               nonlinear_activation_params), ))
        # add downsample layers
        in_chs = channels
        out_chs = channels
        # NOTE(kan-bayashi): Remove hard coding?
        groups = 4
        for downsample_scale in downsample_scales:
            self.layers.append(
                nn.Sequential(
                    nn.Conv1D(
                        in_chs,
                        out_chs,
                        kernel_size=kernel_sizes[1],
                        stride=downsample_scale,
                        padding=(kernel_sizes[1] - 1) // 2,
                        groups=groups,
                        bias_attr=bias, ),
                    get_activation(nonlinear_activation, **
                                   nonlinear_activation_params), ))
            in_chs = out_chs
            # NOTE: Remove hard coding?
            out_chs = min(in_chs * 2, max_downsample_channels)
            # NOTE: Remove hard coding?
            groups = min(groups * 4, max_groups)
        # add final layers
        out_chs = min(in_chs * 2, max_downsample_channels)
        self.layers.append(
            nn.Sequential(
                nn.Conv1D(
                    in_chs,
                    out_chs,
                    kernel_size=kernel_sizes[2],
                    stride=1,
                    padding=(kernel_sizes[2] - 1) // 2,
                    bias_attr=bias, ),
                get_activation(nonlinear_activation, **
                               nonlinear_activation_params), ))
        self.layers.append(
            nn.Conv1D(
                out_chs,
                out_channels,
                kernel_size=kernel_sizes[3],
                stride=1,
                padding=(kernel_sizes[3] - 1) // 2,
                bias_attr=bias, ), )
        if use_weight_norm and use_spectral_norm:
            raise ValueError("Either use use_weight_norm or use_spectral_norm.")
        # apply weight norm
        if use_weight_norm:
            self.apply_weight_norm()
        # apply spectral norm
        if use_spectral_norm:
            self.apply_spectral_norm()
    def forward(self, x):
        """Calculate forward propagation.
        Parameters
        ----------
        x : Tensor
            Input noise signal (B, 1, T).
        Returns
        ----------
        List
            List of output tensors of each layer.
        """
        outs = []
        for f in self.layers:
            x = f(x)
            outs += [x]
        return outs
    def apply_weight_norm(self):
        """Recursively apply weight normalization to all the Convolution layers
        in the sublayers.
        """
        def _apply_weight_norm(layer):
            if isinstance(layer, (nn.Conv1D, nn.Conv2D, nn.Conv1DTranspose)):
                nn.utils.weight_norm(layer)
        self.apply(_apply_weight_norm)
    def apply_spectral_norm(self):
        """Apply spectral normalization module from all of the layers."""
        def _apply_spectral_norm(m):
            if isinstance(m, nn.Conv2D):
                nn.utils.spectral_norm(m)
        self.apply(_apply_spectral_norm)
 class HiFiGANMultiScaleDiscriminator(nn.Layer):
    """HiFi-GAN multi-scale discriminator module."""
    def __init__(
            self,
            scales: int=3,
            downsample_pooling: str="AvgPool1D",
            # follow the official implementation setting
            downsample_pooling_params: Dict[str, Any]={
                "kernel_size": 4,
                "stride": 2,
                "padding": 2,
            },
            discriminator_params: Dict[str, Any]={
                "in_channels": 1,
                "out_channels": 1,
                "kernel_sizes": [15, 41, 5, 3],
                "channels": 128,
                "max_downsample_channels": 1024,
                "max_groups": 16,
                "bias": True,
                "downsample_scales": [2, 2, 4, 4, 1],
                "nonlinear_activation": "leakyrelu",
                "nonlinear_activation_params": {
                    "negative_slope": 0.1
                },
            },
            follow_official_norm: bool=False,
            init_type: str="xavier_uniform", ):
        """Initilize HiFiGAN multi-scale discriminator module.
        Parameters
        ----------
        scales : int
            Number of multi-scales.
        downsample_pooling : str
            Pooling module name for downsampling of the inputs.
        downsample_pooling_params : dict
            Parameters for the above pooling module.
        discriminator_params : dict
            Parameters for hifi-gan scale discriminator module.
        follow_official_norm : bool
            Whether to follow the norm setting of the official
            implementaion. The first discriminator uses spectral norm and the other
            discriminators use weight norm.
        """
        super().__init__()
        # initialize parameters
        initialize(self, init_type)
        self.discriminators = nn.LayerList()
        # add discriminators
        for i in range(scales):
            params = copy.deepcopy(discriminator_params)
            if follow_official_norm:
                if i == 0:
                    params["use_weight_norm"] = False
                    params["use_spectral_norm"] = True
                else:
                    params["use_weight_norm"] = True
                    params["use_spectral_norm"] = False
            self.discriminators.append(HiFiGANScaleDiscriminator(**params))
        self.pooling = getattr(nn, downsample_pooling)(
            **downsample_pooling_params)
    def forward(self, x):
        """Calculate forward propagation.
        Parameters
        ----------
        x : Tensor
            Input noise signal (B, 1, T).
        Returns
        ----------
        List
            List of list of each discriminator outputs, which consists of each layer output tensors.
        """
        outs = []
        for f in self.discriminators:
            outs += [f(x)]
            x = self.pooling(x)
        return outs
 class HiFiGANMultiScaleMultiPeriodDiscriminator(nn.Layer):
    """HiFi-GAN multi-scale + multi-period discriminator module."""
    def __init__(
            self,
            # Multi-scale discriminator related
            scales: int=3,
            scale_downsample_pooling: str="AvgPool1D",
            scale_downsample_pooling_params: Dict[str, Any]={
                "kernel_size": 4,
                "stride": 2,
                "padding": 2,
            },
            scale_discriminator_params: Dict[str, Any]={
                "in_channels": 1,
                "out_channels": 1,
                "kernel_sizes": [15, 41, 5, 3],
                "channels": 128,
                "max_downsample_channels": 1024,
                "max_groups": 16,
                "bias": True,
                "downsample_scales": [2, 2, 4, 4, 1],
                "nonlinear_activation": "leakyrelu",
                "nonlinear_activation_params": {
                    "negative_slope": 0.1
                },
            },
            follow_official_norm: bool=True,
            # Multi-period discriminator related
            periods: List[int]=[2, 3, 5, 7, 11],
            period_discriminator_params: Dict[str, Any]={
                "in_channels": 1,
                "out_channels": 1,
                "kernel_sizes": [5, 3],
                "channels": 32,
                "downsample_scales": [3, 3, 3, 3, 1],
                "max_downsample_channels": 1024,
                "bias": True,
                "nonlinear_activation": "leakyrelu",
                "nonlinear_activation_params": {
                    "negative_slope": 0.1
                },
                "use_weight_norm": True,
                "use_spectral_norm": False,
            },
            init_type: str="xavier_uniform", ):
        """Initilize HiFiGAN multi-scale + multi-period discriminator module.
        Parameters
        ----------
        scales : int
            Number of multi-scales.
        scale_downsample_pooling : str
            Pooling module name for downsampling of the inputs.
        scale_downsample_pooling_params : dict
            Parameters for the above pooling module.
        scale_discriminator_params : dict
            Parameters for hifi-gan scale discriminator module.
        follow_official_norm : bool): Whether to follow the norm setting of the official
            implementaion. The first discriminator uses spectral norm and the other
            discriminators use weight norm.
        periods : list
            List of periods.
        period_discriminator_params : dict
            Parameters for hifi-gan period discriminator module.
            The period parameter will be overwritten.
        """
        super().__init__()
        # initialize parameters
        initialize(self, init_type)
        self.msd = HiFiGANMultiScaleDiscriminator(
            scales=scales,
            downsample_pooling=scale_downsample_pooling,
            downsample_pooling_params=scale_downsample_pooling_params,
            discriminator_params=scale_discriminator_params,
            follow_official_norm=follow_official_norm, )
        self.mpd = HiFiGANMultiPeriodDiscriminator(
            periods=periods,
            discriminator_params=period_discriminator_params, )
    def forward(self, x):
        """Calculate forward propagation.
        Parameters
        ----------
        x : Tensor
            Input noise signal (B, 1, T).
        Returns
        ----------
        List:
            List of list of each discriminator outputs,
            which consists of each layer output tensors.
            Multi scale and multi period ones are concatenated.
        """
        msd_outs = self.msd(x)
        mpd_outs = self.mpd(x)
        return msd_outs + mpd_outs
 class HiFiGANInference(nn.Layer):
    def __init__(self, normalizer, hifigan_generator):
        super().__init__()
        self.normalizer = normalizer
        self.hifigan_generator = hifigan_generator
    def forward(self, logmel):
        normalized_mel = self.normalizer(logmel)
        wav = self.hifigan_generator.inference(normalized_mel)
        return wav
--- a/paddlespeech/t2s/models/hifigan/hifigan_updater.py
+++ b/paddlespeech/t2s/models/hifigan/hifigan_updater.py
@ -0,0 +1,247 @@
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import logging
 from typing import Dict
 import paddle
 from paddle import distributed as dist
 from paddle.io import DataLoader
 from paddle.nn import Layer
 from paddle.optimizer import Optimizer
 from paddle.optimizer.lr import LRScheduler
 from paddlespeech.t2s.training.extensions.evaluator import StandardEvaluator
 from paddlespeech.t2s.training.reporter import report
 from paddlespeech.t2s.training.updaters.standard_updater import StandardUpdater
 from paddlespeech.t2s.training.updaters.standard_updater import UpdaterState
 logging.basicConfig(
    format='%(asctime)s [%(levelname)s] [%(filename)s:%(lineno)d] %(message)s',
    datefmt='[%Y-%m-%d %H:%M:%S]')
 logger = logging.getLogger(__name__)
 logger.setLevel(logging.INFO)
 class HiFiGANUpdater(StandardUpdater):
    def __init__(self,
                 models: Dict[str, Layer],
                 optimizers: Dict[str, Optimizer],
                 criterions: Dict[str, Layer],
                 schedulers: Dict[str, LRScheduler],
                 dataloader: DataLoader,
                 generator_train_start_steps: int=0,
                 discriminator_train_start_steps: int=100000,
                 lambda_adv: float=1.0,
                 lambda_aux: float=1.0,
                 lambda_feat_match: float=1.0,
                 output_dir=None):
        self.models = models
        self.generator: Layer = models['generator']
        self.discriminator: Layer = models['discriminator']
        self.optimizers = optimizers
        self.optimizer_g: Optimizer = optimizers['generator']
        self.optimizer_d: Optimizer = optimizers['discriminator']
        self.criterions = criterions
        self.criterion_feat_match = criterions['feat_match']
        self.criterion_mel = criterions['mel']
        self.criterion_gen_adv = criterions["gen_adv"]
        self.criterion_dis_adv = criterions["dis_adv"]
        self.schedulers = schedulers
        self.scheduler_g = schedulers['generator']
        self.scheduler_d = schedulers['discriminator']
        self.dataloader = dataloader
        self.generator_train_start_steps = generator_train_start_steps
        self.discriminator_train_start_steps = discriminator_train_start_steps
        self.lambda_adv = lambda_adv
        self.lambda_aux = lambda_aux
        self.lambda_feat_match = lambda_feat_match
        self.state = UpdaterState(iteration=0, epoch=0)
        self.train_iterator = iter(self.dataloader)
        log_file = output_dir / 'worker_{}.log'.format(dist.get_rank())
        self.filehandler = logging.FileHandler(str(log_file))
        logger.addHandler(self.filehandler)
        self.logger = logger
        self.msg = ""
    def update_core(self, batch):
        self.msg = "Rank: {}, ".format(dist.get_rank())
        losses_dict = {}
        # parse batch
        wav, mel = batch
        # Generator
        if self.state.iteration > self.generator_train_start_steps:
            # (B, out_channels, T ** prod(upsample_scales)
            wav_ = self.generator(mel)
            # initialize
            gen_loss = 0.0
            aux_loss = 0.0
            # mel spectrogram loss
            mel_loss = self.criterion_mel(wav_, wav)
            aux_loss += mel_loss
            report("train/mel_loss", float(mel_loss))
            losses_dict["mel_loss"] = float(mel_loss)
            gen_loss += aux_loss * self.lambda_aux
            # adversarial loss
            if self.state.iteration > self.discriminator_train_start_steps:
                p_ = self.discriminator(wav_)
                adv_loss = self.criterion_gen_adv(p_)
                report("train/adversarial_loss", float(adv_loss))
                losses_dict["adversarial_loss"] = float(adv_loss)
                # feature matching loss
                # no need to track gradients
                with paddle.no_grad():
                    p = self.discriminator(wav)
                fm_loss = self.criterion_feat_match(p_, p)
                report("train/feature_matching_loss", float(fm_loss))
                losses_dict["feature_matching_loss"] = float(fm_loss)
                adv_loss += self.lambda_feat_match * fm_loss
                gen_loss += self.lambda_adv * adv_loss
            report("train/generator_loss", float(gen_loss))
            losses_dict["generator_loss"] = float(gen_loss)
            self.optimizer_g.clear_grad()
            gen_loss.backward()
            self.optimizer_g.step()
            self.scheduler_g.step()
        # Disctiminator
        if self.state.iteration > self.discriminator_train_start_steps:
            # re-compute wav_ which leads better quality
            with paddle.no_grad():
                wav_ = self.generator(mel)
            p = self.discriminator(wav)
            p_ = self.discriminator(wav_.detach())
            real_loss, fake_loss = self.criterion_dis_adv(p_, p)
            dis_loss = real_loss + fake_loss
            report("train/real_loss", float(real_loss))
            report("train/fake_loss", float(fake_loss))
            report("train/discriminator_loss", float(dis_loss))
            losses_dict["real_loss"] = float(real_loss)
            losses_dict["fake_loss"] = float(fake_loss)
            losses_dict["discriminator_loss"] = float(dis_loss)
            self.optimizer_d.clear_grad()
            dis_loss.backward()
            self.optimizer_d.step()
            self.scheduler_d.step()
        self.msg += ', '.join('{}: {:>.6f}'.format(k, v)
                              for k, v in losses_dict.items())
 class HiFiGANEvaluator(StandardEvaluator):
    def __init__(self,
                 models: Dict[str, Layer],
                 criterions: Dict[str, Layer],
                 dataloader: DataLoader,
                 lambda_adv: float=1.0,
                 lambda_aux: float=1.0,
                 lambda_feat_match: float=1.0,
                 output_dir=None):
        self.models = models
        self.generator = models['generator']
        self.discriminator = models['discriminator']
        self.criterions = criterions
        self.criterion_feat_match = criterions['feat_match']
        self.criterion_mel = criterions['mel']
        self.criterion_gen_adv = criterions["gen_adv"]
        self.criterion_dis_adv = criterions["dis_adv"]
        self.dataloader = dataloader
        self.lambda_adv = lambda_adv
        self.lambda_aux = lambda_aux
        self.lambda_feat_match = lambda_feat_match
        log_file = output_dir / 'worker_{}.log'.format(dist.get_rank())
        self.filehandler = logging.FileHandler(str(log_file))
        logger.addHandler(self.filehandler)
        self.logger = logger
        self.msg = ""
    def evaluate_core(self, batch):
        # logging.debug("Evaluate: ")
        self.msg = "Evaluate: "
        losses_dict = {}
        wav, mel = batch
        # Generator
        # (B, out_channels, T ** prod(upsample_scales)
        wav_ = self.generator(mel)
        # initialize
        gen_loss = 0.0
        aux_loss = 0.0
        ## Adversarial loss
        p_ = self.discriminator(wav_)
        adv_loss = self.criterion_gen_adv(p_)
        report("eval/adversarial_loss", float(adv_loss))
        losses_dict["adversarial_loss"] = float(adv_loss)
        # feature matching loss
        p = self.discriminator(wav)
        fm_loss = self.criterion_feat_match(p_, p)
        report("eval/feature_matching_loss", float(fm_loss))
        losses_dict["feature_matching_loss"] = float(fm_loss)
        adv_loss += self.lambda_feat_match * fm_loss
        gen_loss += self.lambda_adv * adv_loss
        # mel spectrogram loss
        mel_loss = self.criterion_mel(wav_, wav)
        aux_loss += mel_loss
        report("eval/mel_loss", float(mel_loss))
        losses_dict["mel_loss"] = float(mel_loss)
        gen_loss += aux_loss * self.lambda_aux
        report("eval/generator_loss", float(gen_loss))
        losses_dict["generator_loss"] = float(gen_loss)
        # Disctiminator
        p = self.discriminator(wav)
        real_loss, fake_loss = self.criterion_dis_adv(p_, p)
        dis_loss = real_loss + fake_loss
        report("eval/real_loss", float(real_loss))
        report("eval/fake_loss", float(fake_loss))
        report("eval/discriminator_loss", float(dis_loss))
        losses_dict["real_loss"] = float(real_loss)
        losses_dict["fake_loss"] = float(fake_loss)
        losses_dict["discriminator_loss"] = float(dis_loss)
        self.msg += ', '.join('{}: {:>.6f}'.format(k, v)
                              for k, v in losses_dict.items())
        self.logger.info(self.msg)
--- a/paddlespeech/t2s/models/melgan/melgan.py
+++ b/paddlespeech/t2s/models/melgan/melgan.py
@ -93,7 +93,8 @@ class MelGANGenerator(nn.Layer):
        initialize(self, init_type)
        # for compatibility
-        nonlinear_activation = nonlinear_activation.lower()
+        if nonlinear_activation:
            nonlinear_activation = nonlinear_activation.lower()
        # check hyper parameters is valid
        assert channels >= np.prod(upsample_scales)
@ -101,6 +102,7 @@ class MelGANGenerator(nn.Layer):
        if not use_causal_conv:
            assert (kernel_size - 1
                    ) % 2 == 0, "Not support even number kernel size."
        layers = []
        if not use_causal_conv:
            layers += [
@ -327,7 +329,8 @@ class MelGANDiscriminator(nn.Layer):
        super().__init__()
        # for compatibility
-        nonlinear_activation = nonlinear_activation.lower()
+        if nonlinear_activation:
            nonlinear_activation = nonlinear_activation.lower()
        # initialize parameters
        initialize(self, init_type)
@ -476,8 +479,9 @@ class MelGANMultiScaleDiscriminator(nn.Layer):
        # initialize parameters
        initialize(self, init_type)
-        # for compatibility
+        # for 
-        nonlinear_activation = nonlinear_activation.lower()
+        if nonlinear_activation:
            nonlinear_activation = nonlinear_activation.lower()
        self.discriminators = nn.LayerList()
--- a/paddlespeech/t2s/models/melgan/multi_band_melgan_updater.py
+++ b/paddlespeech/t2s/models/melgan/multi_band_melgan_updater.py
@ -40,8 +40,10 @@ class MBMelGANUpdater(StandardUpdater):
                 criterions: Dict[str, Layer],
                 schedulers: Dict[str, LRScheduler],
                 dataloader: DataLoader,
-                 discriminator_train_start_steps: int,
+                 generator_train_start_steps: int=0,
-                 lambda_adv: float,
+                 discriminator_train_start_steps: int=100000,
                 lambda_aux: float=1.0,
                 lambda_adv: float=1.0,
                 output_dir: Path=None):
        self.models = models
        self.generator: Layer = models['generator']
@ -64,10 +66,12 @@ class MBMelGANUpdater(StandardUpdater):
        self.dataloader = dataloader
        self.generator_train_start_steps = generator_train_start_steps
        self.discriminator_train_start_steps = discriminator_train_start_steps
        self.lambda_adv = lambda_adv
-        self.state = UpdaterState(iteration=0, epoch=0)
+        self.lambda_aux = lambda_aux
        self.state = UpdaterState(iteration=0, epoch=0)
        self.train_iterator = iter(self.dataloader)
        log_file = output_dir / 'worker_{}.log'.format(dist.get_rank())
@ -79,57 +83,61 @@ class MBMelGANUpdater(StandardUpdater):
    def update_core(self, batch):
        self.msg = "Rank: {}, ".format(dist.get_rank())
        losses_dict = {}
        # parse batch
        wav, mel = batch
        # Generator
        # (B, out_channels, T ** prod(upsample_scales)
        wav_ = self.generator(mel)
        wav_mb_ = wav_
        # (B, 1, out_channels*T ** prod(upsample_scales)
        wav_ = self.criterion_pqmf.synthesis(wav_mb_)
-        # initialize
+        # Generator
-        gen_loss = 0.0
+        if self.state.iteration > self.generator_train_start_steps:
-
+            # (B, out_channels, T ** prod(upsample_scales)
-        # full band Multi-resolution stft loss
+            wav_ = self.generator(mel)
-        sc_loss, mag_loss = self.criterion_stft(wav_, wav)
+            wav_mb_ = wav_
-        # for balancing with subband stft loss
+            # (B, 1, out_channels*T ** prod(upsample_scales)
-        # Eq.(9) in paper
+            wav_ = self.criterion_pqmf.synthesis(wav_mb_)
-        gen_loss += 0.5 * (sc_loss + mag_loss)
+
-        report("train/spectral_convergence_loss", float(sc_loss))
+            # initialize
-        report("train/log_stft_magnitude_loss", float(mag_loss))
+            gen_loss = 0.0
-        losses_dict["spectral_convergence_loss"] = float(sc_loss)
+            aux_loss = 0.0
-        losses_dict["log_stft_magnitude_loss"] = float(mag_loss)
+
-
+            # full band Multi-resolution stft loss
-        # sub band Multi-resolution stft loss
+            sc_loss, mag_loss = self.criterion_stft(wav_, wav)
-        # (B, subbands, T // subbands)
+            # for balancing with subband stft loss
-        wav_mb = self.criterion_pqmf.analysis(wav)
+            # Eq.(9) in paper
-        sub_sc_loss, sub_mag_loss = self.criterion_sub_stft(wav_mb_, wav_mb)
+            aux_loss += 0.5 * (sc_loss + mag_loss)
-        # Eq.(9) in paper
+            report("train/spectral_convergence_loss", float(sc_loss))
-        gen_loss += 0.5 * (sub_sc_loss + sub_mag_loss)
+            report("train/log_stft_magnitude_loss", float(mag_loss))
-        report("train/sub_spectral_convergence_loss", float(sub_sc_loss))
+            losses_dict["spectral_convergence_loss"] = float(sc_loss)
-        report("train/sub_log_stft_magnitude_loss", float(sub_mag_loss))
+            losses_dict["log_stft_magnitude_loss"] = float(mag_loss)
-        losses_dict["sub_spectral_convergence_loss"] = float(sub_sc_loss)
+
-        losses_dict["sub_log_stft_magnitude_loss"] = float(sub_mag_loss)
+            # sub band Multi-resolution stft loss
-
+            # (B, subbands, T // subbands)
-        ## Adversarial loss
+            wav_mb = self.criterion_pqmf.analysis(wav)
-        if self.state.iteration > self.discriminator_train_start_steps:
+            sub_sc_loss, sub_mag_loss = self.criterion_sub_stft(wav_mb_, wav_mb)
-            p_ = self.discriminator(wav_)
+            # Eq.(9) in paper
-            adv_loss = self.criterion_gen_adv(p_)
+            aux_loss += 0.5 * (sub_sc_loss + sub_mag_loss)
-
+            report("train/sub_spectral_convergence_loss", float(sub_sc_loss))
-            report("train/adversarial_loss", float(adv_loss))
+            report("train/sub_log_stft_magnitude_loss", float(sub_mag_loss))
-            losses_dict["adversarial_loss"] = float(adv_loss)
+            losses_dict["sub_spectral_convergence_loss"] = float(sub_sc_loss)
-            gen_loss += self.lambda_adv * adv_loss
+            losses_dict["sub_log_stft_magnitude_loss"] = float(sub_mag_loss)
-
+
-        report("train/generator_loss", float(gen_loss))
+            gen_loss += aux_loss * self.lambda_aux
-        losses_dict["generator_loss"] = float(gen_loss)
+
-
+            # adversarial loss
-        self.optimizer_g.clear_grad()
+            if self.state.iteration > self.discriminator_train_start_steps:
-        gen_loss.backward()
+                p_ = self.discriminator(wav_)
-
+                adv_loss = self.criterion_gen_adv(p_)
-        self.optimizer_g.step()
+                report("train/adversarial_loss", float(adv_loss))
-        self.scheduler_g.step()
+                losses_dict["adversarial_loss"] = float(adv_loss)
                gen_loss += self.lambda_adv * adv_loss
            report("train/generator_loss", float(gen_loss))
            losses_dict["generator_loss"] = float(gen_loss)
            self.optimizer_g.clear_grad()
            gen_loss.backward()
            self.optimizer_g.step()
            self.scheduler_g.step()
        # Disctiminator
        if self.state.iteration > self.discriminator_train_start_steps:
@ -163,7 +171,8 @@ class MBMelGANEvaluator(StandardEvaluator):
                 models: Dict[str, Layer],
                 criterions: Dict[str, Layer],
                 dataloader: DataLoader,
-                 lambda_adv: float,
+                 lambda_aux: float=1.0,
                 lambda_adv: float=1.0,
                 output_dir: Path=None):
        self.models = models
        self.generator = models['generator']
@ -177,7 +186,9 @@ class MBMelGANEvaluator(StandardEvaluator):
        self.criterion_dis_adv = criterions["dis_adv"]
        self.dataloader = dataloader
        self.lambda_adv = lambda_adv
        self.lambda_aux = lambda_aux
        log_file = output_dir / 'worker_{}.log'.format(dist.get_rank())
        self.filehandler = logging.FileHandler(str(log_file))
@ -189,8 +200,8 @@ class MBMelGANEvaluator(StandardEvaluator):
        # logging.debug("Evaluate: ")
        self.msg = "Evaluate: "
        losses_dict = {}
        wav, mel = batch
        # Generator
        # (B, out_channels, T ** prod(upsample_scales)
        wav_ = self.generator(mel)
@ -198,18 +209,22 @@ class MBMelGANEvaluator(StandardEvaluator):
        # (B, 1, out_channels*T ** prod(upsample_scales)
        wav_ = self.criterion_pqmf.synthesis(wav_mb_)
-        ## Adversarial loss
+        # initialize
        gen_loss = 0.0
        aux_loss = 0.0
        # adversarial loss
        p_ = self.discriminator(wav_)
        adv_loss = self.criterion_gen_adv(p_)
        report("eval/adversarial_loss", float(adv_loss))
        losses_dict["adversarial_loss"] = float(adv_loss)
-        gen_loss = self.lambda_adv * adv_loss
+
        gen_loss += self.lambda_adv * adv_loss
        # Multi-resolution stft loss
        sc_loss, mag_loss = self.criterion_stft(wav_, wav)
        # Eq.(9) in paper
-        gen_loss += 0.5 * (sc_loss + mag_loss)
+        aux_loss += 0.5 * (sc_loss + mag_loss)
        report("eval/spectral_convergence_loss", float(sc_loss))
        report("eval/log_stft_magnitude_loss", float(mag_loss))
        losses_dict["spectral_convergence_loss"] = float(sc_loss)
@ -220,12 +235,14 @@ class MBMelGANEvaluator(StandardEvaluator):
        wav_mb = self.criterion_pqmf.analysis(wav)
        sub_sc_loss, sub_mag_loss = self.criterion_sub_stft(wav_mb_, wav_mb)
        # Eq.(9) in paper
-        gen_loss += 0.5 * (sub_sc_loss + sub_mag_loss)
+        aux_loss += 0.5 * (sub_sc_loss + sub_mag_loss)
        report("eval/sub_spectral_convergence_loss", float(sub_sc_loss))
        report("eval/sub_log_stft_magnitude_loss", float(sub_mag_loss))
        losses_dict["sub_spectral_convergence_loss"] = float(sub_sc_loss)
        losses_dict["sub_log_stft_magnitude_loss"] = float(sub_mag_loss)
        gen_loss += aux_loss * self.lambda_aux
        report("eval/generator_loss", float(gen_loss))
        losses_dict["generator_loss"] = float(gen_loss)
--- a/paddlespeech/t2s/models/melgan/style_melgan.py
+++ b/paddlespeech/t2s/models/melgan/style_melgan.py
@ -14,7 +14,6 @@
 # Modified from espnet(https://github.com/espnet/espnet)
 """StyleMelGAN Modules."""
 import copy
 import math
 from typing import Any
 from typing import Dict
 from typing import List
@ -225,14 +224,17 @@ class StyleMelGANGenerator(nn.Layer):
        c_shape = paddle.shape(c)
        # prepare noise input
        # there is a bug in Paddle int division, we must convert a int tensor to int here
-        noise_size = (1, self.in_channels,
+        noise_T = paddle.cast(
-                      math.ceil(int(c_shape[2]) / self.noise_upsample_factor))
+            paddle.ceil(c_shape[2] / int(self.noise_upsample_factor)),
            dtype='int64')
        noise_size = (1, self.in_channels, noise_T)
        # (1, in_channels, T/noise_upsample_factor)
        noise = paddle.randn(noise_size)
        # (1, in_channels, T)
        x = self.noise_upsample(noise)
        x_shape = paddle.shape(x)
        total_length = c_shape[2] * self.upsample_factor
        # Dygraph to Static Graph bug here, 2021.12.15
        c = F.pad(
            c, (0, x_shape[2] - c_shape[2]), "replicate", data_format="NCL")
        # c.shape[2] == x.shape[2] here
@ -243,7 +245,6 @@ class StyleMelGANGenerator(nn.Layer):
        return x.squeeze(0).transpose([1, 0])
 # StyleMelGANDiscriminator 不需要 remove weight norm 嘛？
 class StyleMelGANDiscriminator(nn.Layer):
    """Style MelGAN disciminator module."""
--- a/paddlespeech/t2s/models/parallel_wavegan/parallel_wavegan.py
+++ b/paddlespeech/t2s/models/parallel_wavegan/parallel_wavegan.py
@ -21,299 +21,11 @@ from typing import Optional
 import numpy as np
 import paddle
 from paddle import nn
 from paddle.nn import functional as F
-
+from paddlespeech.t2s.modules.activation import get_activation
-class Stretch2D(nn.Layer):
+from paddlespeech.t2s.modules.nets_utils import initialize
-    def __init__(self, w_scale: int, h_scale: int, mode: str="nearest"):
+from paddlespeech.t2s.modules.residual_block import WaveNetResidualBlock as ResidualBlock
-        """Strech an image (or image-like object) with some interpolation.
+from paddlespeech.t2s.modules.upsample import ConvInUpsampleNet
        Parameters
        ----------
        w_scale : int
            Scalar of width.
        h_scale : int
            Scalar of the height.
        mode : str, optional
            Interpolation mode, modes suppored are "nearest", "bilinear", 
            "trilinear", "bicubic", "linear" and "area",by default "nearest"
            For more details about interpolation, see 
            `paddle.nn.functional.interpolate <https://www.paddlepaddle.org.cn/documentation/docs/en/api/paddle/nn/functional/interpolate_en.html>`_.
        """
        super().__init__()
        self.w_scale = w_scale
        self.h_scale = h_scale
        self.mode = mode
    def forward(self, x):
        """
        Parameters
        ----------
        x : Tensor
            Shape (N, C, H, W)
        Returns
        -------
        Tensor
            Shape (N, C, H', W'), where ``H'=h_scale * H``, ``W'=w_scale * W``.
            The stretched image.
        """
        out = F.interpolate(
            x, scale_factor=(self.h_scale, self.w_scale), mode=self.mode)
        return out
 class UpsampleNet(nn.Layer):
    """A Layer to upsample spectrogram by applying consecutive stretch and
    convolutions.
    Parameters
    ----------
    upsample_scales : List[int]
        Upsampling factors for each strech.
    nonlinear_activation : Optional[str], optional
        Activation after each convolution, by default None
    nonlinear_activation_params : Dict[str, Any], optional
        Parameters passed to construct the activation, by default {}
    interpolate_mode : str, optional
        Interpolation mode of the strech, by default "nearest"
    freq_axis_kernel_size : int, optional
        Convolution kernel size along the frequency axis, by default 1
    use_causal_conv : bool, optional
        Whether to use causal padding before convolution, by default False
        If True, Causal padding is used along the time axis, i.e. padding
        amount is ``receptive field - 1`` and 0 for before and after,
        respectively.
        If False, "same" padding is used along the time axis.
    """
    def __init__(self,
                 upsample_scales: List[int],
                 nonlinear_activation: Optional[str]=None,
                 nonlinear_activation_params: Dict[str, Any]={},
                 interpolate_mode: str="nearest",
                 freq_axis_kernel_size: int=1,
                 use_causal_conv: bool=False):
        super().__init__()
        self.use_causal_conv = use_causal_conv
        self.up_layers = nn.LayerList()
        for scale in upsample_scales:
            stretch = Stretch2D(scale, 1, interpolate_mode)
            assert freq_axis_kernel_size % 2 == 1
            freq_axis_padding = (freq_axis_kernel_size - 1) // 2
            kernel_size = (freq_axis_kernel_size, scale * 2 + 1)
            if use_causal_conv:
                padding = (freq_axis_padding, scale * 2)
            else:
                padding = (freq_axis_padding, scale)
            conv = nn.Conv2D(
                1, 1, kernel_size, padding=padding, bias_attr=False)
            self.up_layers.extend([stretch, conv])
            if nonlinear_activation is not None:
                nonlinear = getattr(
                    nn, nonlinear_activation)(**nonlinear_activation_params)
                self.up_layers.append(nonlinear)
    def forward(self, c):
        """
        Parameters
        ----------
        c : Tensor
            Shape (N, F, T), spectrogram
        Returns
        -------
        Tensor
            Shape (N, F, T'), where ``T' = upsample_factor * T``, upsampled 
            spectrogram
        """
        c = c.unsqueeze(1)
        for f in self.up_layers:
            if self.use_causal_conv and isinstance(f, nn.Conv2D):
                c = f(c)[:, :, :, c.shape[-1]]
            else:
                c = f(c)
        return c.squeeze(1)
 class ConvInUpsampleNet(nn.Layer):
    """A Layer to upsample spectrogram composed of a convolution and an 
    UpsampleNet.
    Parameters
    ----------
    upsample_scales : List[int]
        Upsampling factors for each strech.
    nonlinear_activation : Optional[str], optional
        Activation after each convolution, by default None
    nonlinear_activation_params : Dict[str, Any], optional
        Parameters passed to construct the activation, by default {}
    interpolate_mode : str, optional
        Interpolation mode of the strech, by default "nearest"
    freq_axis_kernel_size : int, optional
        Convolution kernel size along the frequency axis, by default 1
    aux_channels : int, optional
        Feature size of the input, by default 80
    aux_context_window : int, optional
        Context window of the first 1D convolution applied to the input. It 
        related to the kernel size of the convolution, by default 0
        If use causal convolution, the kernel size is ``window + 1``, else
        the kernel size is ``2 * window + 1``.
    use_causal_conv : bool, optional
        Whether to use causal padding before convolution, by default False
        If True, Causal padding is used along the time axis, i.e. padding 
        amount is ``receptive field - 1`` and 0 for before and after, 
        respectively.
        If False, "same" padding is used along the time axis.
    """
    def __init__(self,
                 upsample_scales: List[int],
                 nonlinear_activation: Optional[str]=None,
                 nonlinear_activation_params: Dict[str, Any]={},
                 interpolate_mode: str="nearest",
                 freq_axis_kernel_size: int=1,
                 aux_channels: int=80,
                 aux_context_window: int=0,
                 use_causal_conv: bool=False):
        super().__init__()
        self.aux_context_window = aux_context_window
        self.use_causal_conv = use_causal_conv and aux_context_window > 0
        kernel_size = aux_context_window + 1 if use_causal_conv else 2 * aux_context_window + 1
        self.conv_in = nn.Conv1D(
            aux_channels,
            aux_channels,
            kernel_size=kernel_size,
            bias_attr=False)
        self.upsample = UpsampleNet(
            upsample_scales=upsample_scales,
            nonlinear_activation=nonlinear_activation,
            nonlinear_activation_params=nonlinear_activation_params,
            interpolate_mode=interpolate_mode,
            freq_axis_kernel_size=freq_axis_kernel_size,
            use_causal_conv=use_causal_conv)
    def forward(self, c):
        """
        Parameters
        ----------
        c : Tensor
            Shape (N, F, T), spectrogram
        Returns
        -------
        Tensors
            Shape (N, F, T'), where ``T' = upsample_factor * T``, upsampled 
            spectrogram
        """
        c_ = self.conv_in(c)
        c = c_[:, :, :-self.aux_context_window] if self.use_causal_conv else c_
        return self.upsample(c)
 class ResidualBlock(nn.Layer):
    """A gated activation unit composed of an 1D convolution, a gated tanh
    unit and parametric redidual and skip connections. For more details, 
    refer to `WaveNet: A Generative Model for Raw Audio <https://arxiv.org/abs/1609.03499>`_.
    Parameters
    ----------
    kernel_size : int, optional
        Kernel size of the 1D convolution, by default 3
    residual_channels : int, optional
        Feature size of the resiaudl output(and also the input), by default 64
    gate_channels : int, optional
        Output feature size of the 1D convolution, by default 128
    skip_channels : int, optional
        Feature size of the skip output, by default 64
    aux_channels : int, optional
        Feature size of the auxiliary input (e.g. spectrogram), by default 80
    dropout : float, optional
        Probability of the dropout before the 1D convolution, by default 0.
    dilation : int, optional
        Dilation of the 1D convolution, by default 1
    bias : bool, optional
        Whether to use bias in the 1D convolution, by default True
    use_causal_conv : bool, optional
        Whether to use causal padding for the 1D convolution, by default False
    """
    def __init__(self,
                 kernel_size: int=3,
                 residual_channels: int=64,
                 gate_channels: int=128,
                 skip_channels: int=64,
                 aux_channels: int=80,
                 dropout: float=0.,
                 dilation: int=1,
                 bias: bool=True,
                 use_causal_conv: bool=False):
        super().__init__()
        self.dropout = dropout
        if use_causal_conv:
            padding = (kernel_size - 1) * dilation
        else:
            assert kernel_size % 2 == 1
            padding = (kernel_size - 1) // 2 * dilation
        self.use_causal_conv = use_causal_conv
        self.conv = nn.Conv1D(
            residual_channels,
            gate_channels,
            kernel_size,
            padding=padding,
            dilation=dilation,
            bias_attr=bias)
        if aux_channels is not None:
            self.conv1x1_aux = nn.Conv1D(
                aux_channels, gate_channels, kernel_size=1, bias_attr=False)
        else:
            self.conv1x1_aux = None
        gate_out_channels = gate_channels // 2
        self.conv1x1_out = nn.Conv1D(
            gate_out_channels, residual_channels, kernel_size=1, bias_attr=bias)
        self.conv1x1_skip = nn.Conv1D(
            gate_out_channels, skip_channels, kernel_size=1, bias_attr=bias)
    def forward(self, x, c):
        """
        Parameters
        ----------
        x : Tensor
            Shape (N, C_res, T), the input features.
        c : Tensor
            Shape (N, C_aux, T), the auxiliary input.
        Returns
        -------
        res : Tensor
            Shape (N, C_res, T), the residual output, which is used as the 
            input of the next ResidualBlock in a stack of ResidualBlocks.
        skip : Tensor
            Shape (N, C_skip, T), the skip output, which is collected among
            each layer in a stack of ResidualBlocks.
        """
        x_input = x
        x = F.dropout(x, self.dropout, training=self.training)
        x = self.conv(x)
        x = x[:, :, x_input.shape[-1]] if self.use_causal_conv else x
        if c is not None:
            c = self.conv1x1_aux(c)
            x += c
        a, b = paddle.chunk(x, 2, axis=1)
        x = paddle.tanh(a) * F.sigmoid(b)
        skip = self.conv1x1_skip(x)
        res = (self.conv1x1_out(x) + x_input) * math.sqrt(0.5)
        return res, skip
 class PWGGenerator(nn.Layer):
@ -331,7 +43,6 @@ class PWGGenerator(nn.Layer):
        Number of residual blocks inside, by default 30
    stacks : int, optional
        The number of groups to split the residual blocks into, by default 3
        Within each group, the dilation of the residual block grows 
        exponentially.
    residual_channels : int, optional
@ -367,27 +78,37 @@ class PWGGenerator(nn.Layer):
        Kernel size along the frequency axis of the upsample network, by default 1
    """
-    def __init__(self,
+    def __init__(
-                 in_channels: int=1,
+            self,
-                 out_channels: int=1,
+            in_channels: int=1,
-                 kernel_size: int=3,
+            out_channels: int=1,
-                 layers: int=30,
+            kernel_size: int=3,
-                 stacks: int=3,
+            layers: int=30,
-                 residual_channels: int=64,
+            stacks: int=3,
-                 gate_channels: int=128,
+            residual_channels: int=64,
-                 skip_channels: int=64,
+            gate_channels: int=128,
-                 aux_channels: int=80,
+            skip_channels: int=64,
-                 aux_context_window: int=2,
+            aux_channels: int=80,
-                 dropout: float=0.,
+            aux_context_window: int=2,
-                 bias: bool=True,
+            dropout: float=0.,
-                 use_weight_norm: bool=True,
+            bias: bool=True,
-                 use_causal_conv: bool=False,
+            use_weight_norm: bool=True,
-                 upsample_scales: List[int]=[4, 4, 4, 4],
+            use_causal_conv: bool=False,
-                 nonlinear_activation: Optional[str]=None,
+            upsample_scales: List[int]=[4, 4, 4, 4],
-                 nonlinear_activation_params: Dict[str, Any]={},
+            nonlinear_activation: Optional[str]=None,
-                 interpolate_mode: str="nearest",
+            nonlinear_activation_params: Dict[str, Any]={},
-                 freq_axis_kernel_size: int=1):
+            interpolate_mode: str="nearest",
            freq_axis_kernel_size: int=1,
            init_type: str="xavier_uniform", ):
        super().__init__()
        # initialize parameters
        initialize(self, init_type)
        # for compatibility
        if nonlinear_activation:
            nonlinear_activation = nonlinear_activation.lower()
        self.in_channels = in_channels
        self.out_channels = out_channels
        self.aux_channels = aux_channels
@ -540,7 +261,7 @@ class PWGDiscriminator(nn.Layer):
        exponentially if it is greater than 1, else the dilation of each 
        convolutional sublayers grows linearly, by default 1
    nonlinear_activation : str, optional
-        The activation after each convolutional sublayer, by default "LeakyReLU"
+        The activation after each convolutional sublayer, by default "leakyrelu"
    nonlinear_activation_params : Dict[str, Any], optional
        The parameters passed to the activation's initializer, by default 
        {"negative_slope": 0.2}
@ -559,11 +280,19 @@ class PWGDiscriminator(nn.Layer):
            layers: int=10,
            conv_channels: int=64,
            dilation_factor: int=1,
-            nonlinear_activation: str="LeakyReLU",
+            nonlinear_activation: str="leakyrelu",
            nonlinear_activation_params: Dict[str, Any]={"negative_slope": 0.2},
            bias: bool=True,
-            use_weight_norm: bool=True):
+            use_weight_norm: bool=True,
            init_type: str="xavier_uniform", ):
        super().__init__()
        # initialize parameters
        initialize(self, init_type)
        # for compatibility
        if nonlinear_activation:
            nonlinear_activation = nonlinear_activation.lower()
        assert kernel_size % 2 == 1
        assert dilation_factor > 0
        conv_layers = []
@ -582,8 +311,8 @@ class PWGDiscriminator(nn.Layer):
                padding=padding,
                dilation=dilation,
                bias_attr=bias)
-            nonlinear = getattr(
+            nonlinear = get_activation(nonlinear_activation,
-                nn, nonlinear_activation)(**nonlinear_activation_params)
+                                       **nonlinear_activation_params)
            conv_layers.append(conv_layer)
            conv_layers.append(nonlinear)
        padding = (kernel_size - 1) // 2
@ -663,28 +392,37 @@ class ResidualPWGDiscriminator(nn.Layer):
        Whether to use causal convolution in residual blocks, by default False
    nonlinear_activation : str, optional
        Activation after convolutions other than those in residual blocks, 
-        by default "LeakyReLU"
+        by default "leakyrelu"
    nonlinear_activation_params : Dict[str, Any], optional
        Parameters to pass to the activation, by default {"negative_slope": 0.2}
    """
-    def __init__(self,
+    def __init__(
-                 in_channels: int=1,
+            self,
-                 out_channels: int=1,
+            in_channels: int=1,
-                 kernel_size: int=3,
+            out_channels: int=1,
-                 layers: int=30,
+            kernel_size: int=3,
-                 stacks: int=3,
+            layers: int=30,
-                 residual_channels: int=64,
+            stacks: int=3,
-                 gate_channels: int=128,
+            residual_channels: int=64,
-                 skip_channels: int=64,
+            gate_channels: int=128,
-                 dropout: float=0.,
+            skip_channels: int=64,
-                 bias: bool=True,
+            dropout: float=0.,
-                 use_weight_norm: bool=True,
+            bias: bool=True,
-                 use_causal_conv: bool=False,
+            use_weight_norm: bool=True,
-                 nonlinear_activation: str="LeakyReLU",
+            use_causal_conv: bool=False,
-                 nonlinear_activation_params: Dict[
+            nonlinear_activation: str="leakyrelu",
-                     str, Any]={"negative_slope": 0.2}):
+            nonlinear_activation_params: Dict[str, Any]={"negative_slope": 0.2},
            init_type: str="xavier_uniform", ):
        super().__init__()
        # initialize parameters
        initialize(self, init_type)
        # for compatibility
        if nonlinear_activation:
            nonlinear_activation = nonlinear_activation.lower()
        assert kernel_size % 2 == 1
        self.in_channels = in_channels
        self.out_channels = out_channels
@ -697,7 +435,7 @@ class ResidualPWGDiscriminator(nn.Layer):
        self.first_conv = nn.Sequential(
            nn.Conv1D(in_channels, residual_channels, 1, bias_attr=True),
-            getattr(nn, nonlinear_activation)(**nonlinear_activation_params))
+            get_activation(nonlinear_activation, **nonlinear_activation_params))
        self.conv_layers = nn.LayerList()
        for layer in range(layers):
@ -715,9 +453,9 @@ class ResidualPWGDiscriminator(nn.Layer):
            self.conv_layers.append(conv)
        self.last_conv_layers = nn.Sequential(
-            getattr(nn, nonlinear_activation)(**nonlinear_activation_params),
+            get_activation(nonlinear_activation, **nonlinear_activation_params),
            nn.Conv1D(skip_channels, skip_channels, 1, bias_attr=True),
-            getattr(nn, nonlinear_activation)(**nonlinear_activation_params),
+            get_activation(nonlinear_activation, **nonlinear_activation_params),
            nn.Conv1D(skip_channels, out_channels, 1, bias_attr=True))
        if use_weight_norm:
--- a/paddlespeech/t2s/models/parallel_wavegan/parallel_wavegan_updater.py
+++ b/paddlespeech/t2s/models/parallel_wavegan/parallel_wavegan_updater.py
@ -21,7 +21,6 @@ from paddle.io import DataLoader
 from paddle.nn import Layer
 from paddle.optimizer import Optimizer
 from paddle.optimizer.lr import LRScheduler
 from timer import timer
 from paddlespeech.t2s.training.extensions.evaluator import StandardEvaluator
 from paddlespeech.t2s.training.reporter import report
@ -42,8 +41,10 @@ class PWGUpdater(StandardUpdater):
                 criterions: Dict[str, Layer],
                 schedulers: Dict[str, LRScheduler],
                 dataloader: DataLoader,
-                 discriminator_train_start_steps: int,
+                 generator_train_start_steps: int=0,
-                 lambda_adv: float,
+                 discriminator_train_start_steps: int=100000,
                 lambda_adv: float=1.0,
                 lambda_aux: float=1.0,
                 output_dir: Path=None):
        self.models = models
        self.generator: Layer = models['generator']
@ -63,8 +64,10 @@ class PWGUpdater(StandardUpdater):
        self.dataloader = dataloader
        self.generator_train_start_steps = generator_train_start_steps
        self.discriminator_train_start_steps = discriminator_train_start_steps
        self.lambda_adv = lambda_adv
        self.lambda_aux = lambda_aux
        self.state = UpdaterState(iteration=0, epoch=0)
        self.train_iterator = iter(self.dataloader)
@ -78,56 +81,46 @@ class PWGUpdater(StandardUpdater):
    def update_core(self, batch):
        self.msg = "Rank: {}, ".format(dist.get_rank())
        losses_dict = {}
        # parse batch
        wav, mel = batch
        # Generator
-        noise = paddle.randn(wav.shape)
+        if self.state.iteration > self.generator_train_start_steps:
-
+            noise = paddle.randn(wav.shape)
        with timer() as t:
            wav_ = self.generator(noise, mel)
            # logging.debug(f"Generator takes {t.elapse}s.")
-        # initialize
+            # initialize
-        gen_loss = 0.0
+            gen_loss = 0.0
            aux_loss = 0.0
-        ## Multi-resolution stft loss
+            # multi-resolution stft loss
        with timer() as t:
            sc_loss, mag_loss = self.criterion_stft(wav_, wav)
-            # logging.debug(f"Multi-resolution STFT loss takes {t.elapse}s.")
+            aux_loss += sc_loss + mag_loss
            report("train/spectral_convergence_loss", float(sc_loss))
            report("train/log_stft_magnitude_loss", float(mag_loss))
-        report("train/spectral_convergence_loss", float(sc_loss))
+            gen_loss += aux_loss * self.lambda_aux
        report("train/log_stft_magnitude_loss", float(mag_loss))
-        losses_dict["spectral_convergence_loss"] = float(sc_loss)
+            losses_dict["spectral_convergence_loss"] = float(sc_loss)
-        losses_dict["log_stft_magnitude_loss"] = float(mag_loss)
+            losses_dict["log_stft_magnitude_loss"] = float(mag_loss)
        gen_loss += sc_loss + mag_loss
-        ## Adversarial loss
+            # adversarial loss
-        if self.state.iteration > self.discriminator_train_start_steps:
+            if self.state.iteration > self.discriminator_train_start_steps:
            with timer() as t:
                p_ = self.discriminator(wav_)
                adv_loss = self.criterion_mse(p_, paddle.ones_like(p_))
-                # logging.debug(
+                report("train/adversarial_loss", float(adv_loss))
-                #     f"Discriminator and adversarial loss takes {t.elapse}s")
+                losses_dict["adversarial_loss"] = float(adv_loss)
            report("train/adversarial_loss", float(adv_loss))
            losses_dict["adversarial_loss"] = float(adv_loss)
            gen_loss += self.lambda_adv * adv_loss
-        report("train/generator_loss", float(gen_loss))
+                gen_loss += self.lambda_adv * adv_loss
-        losses_dict["generator_loss"] = float(gen_loss)
+
            report("train/generator_loss", float(gen_loss))
            losses_dict["generator_loss"] = float(gen_loss)
        with timer() as t:
            self.optimizer_g.clear_grad()
            gen_loss.backward()
            # logging.debug(f"Backward takes {t.elapse}s.")
        with timer() as t:
            self.optimizer_g.step()
            self.scheduler_g.step()
            # logging.debug(f"Update takes {t.elapse}s.")
        # Disctiminator
        if self.state.iteration > self.discriminator_train_start_steps:
@ -160,7 +153,8 @@ class PWGEvaluator(StandardEvaluator):
                 models: Dict[str, Layer],
                 criterions: Dict[str, Layer],
                 dataloader: DataLoader,
-                 lambda_adv: float,
+                 lambda_adv: float=1.0,
                 lambda_aux: float=1.0,
                 output_dir: Path=None):
        self.models = models
        self.generator = models['generator']
@ -171,7 +165,9 @@ class PWGEvaluator(StandardEvaluator):
        self.criterion_mse = criterions['mse']
        self.dataloader = dataloader
        self.lambda_adv = lambda_adv
        self.lambda_aux = lambda_aux
        log_file = output_dir / 'worker_{}.log'.format(dist.get_rank())
        self.filehandler = logging.FileHandler(str(log_file))
@ -183,34 +179,33 @@ class PWGEvaluator(StandardEvaluator):
        # logging.debug("Evaluate: ")
        self.msg = "Evaluate: "
        losses_dict = {}
        wav, mel = batch
        noise = paddle.randn(wav.shape)
-        with timer() as t:
+        # Generator
-            wav_ = self.generator(noise, mel)
+        wav_ = self.generator(noise, mel)
-            # logging.debug(f"Generator takes {t.elapse}s")
+
-
+        # initialize
-        ## Adversarial loss
+        gen_loss = 0.0
-        with timer() as t:
+        aux_loss = 0.0
-            p_ = self.discriminator(wav_)
+
-            adv_loss = self.criterion_mse(p_, paddle.ones_like(p_))
+        # adversarial loss
-            # logging.debug(
+        p_ = self.discriminator(wav_)
-            #     f"Discriminator and adversarial loss takes {t.elapse}s")
+        adv_loss = self.criterion_mse(p_, paddle.ones_like(p_))
        report("eval/adversarial_loss", float(adv_loss))
        losses_dict["adversarial_loss"] = float(adv_loss)
        gen_loss = self.lambda_adv * adv_loss
-        # stft loss
+        gen_loss += self.lambda_adv * adv_loss
        with timer() as t:
            sc_loss, mag_loss = self.criterion_stft(wav_, wav)
            # logging.debug(f"Multi-resolution STFT loss takes {t.elapse}s")
        # multi-resolution stft loss
        sc_loss, mag_loss = self.criterion_stft(wav_, wav)
        report("eval/spectral_convergence_loss", float(sc_loss))
        report("eval/log_stft_magnitude_loss", float(mag_loss))
        losses_dict["spectral_convergence_loss"] = float(sc_loss)
        losses_dict["log_stft_magnitude_loss"] = float(mag_loss)
-        gen_loss += sc_loss + mag_loss
+        aux_loss += sc_loss + mag_loss
        gen_loss += aux_loss * self.lambda_aux
        report("eval/generator_loss", float(gen_loss))
        losses_dict["generator_loss"] = float(gen_loss)
--- a/paddlespeech/t2s/modules/losses.py
+++ b/paddlespeech/t2s/modules/losses.py
@ -13,6 +13,7 @@
 # limitations under the License.
 import math
 import librosa
 import paddle
 from paddle import nn
 from paddle.fluid.layers import sequence_mask
@ -457,3 +458,206 @@ def masked_l1_loss(prediction, target, mask):
    abs_error = F.l1_loss(prediction, target, reduction='none')
    loss = weighted_mean(abs_error, mask)
    return loss
 class MelSpectrogram(nn.Layer):
    """Calculate Mel-spectrogram."""
    def __init__(
            self,
            fs=22050,
            fft_size=1024,
            hop_size=256,
            win_length=None,
            window="hann",
            num_mels=80,
            fmin=80,
            fmax=7600,
            center=True,
            normalized=False,
            onesided=True,
            eps=1e-10,
            log_base=10.0, ):
        """Initialize MelSpectrogram module."""
        super().__init__()
        self.fft_size = fft_size
        if win_length is None:
            self.win_length = fft_size
        else:
            self.win_length = win_length
        self.hop_size = hop_size
        self.center = center
        self.normalized = normalized
        self.onesided = onesided
        if window is not None and not hasattr(signal.windows, f"{window}"):
            raise ValueError(f"{window} window is not implemented")
        self.window = window
        self.eps = eps
        fmin = 0 if fmin is None else fmin
        fmax = fs / 2 if fmax is None else fmax
        melmat = librosa.filters.mel(
            sr=fs,
            n_fft=fft_size,
            n_mels=num_mels,
            fmin=fmin,
            fmax=fmax, )
        self.melmat = paddle.to_tensor(melmat.T)
        self.stft_params = {
            "n_fft": self.fft_size,
            "win_length": self.win_length,
            "hop_length": self.hop_size,
            "center": self.center,
            "normalized": self.normalized,
            "onesided": self.onesided,
        }
        self.log_base = log_base
        if self.log_base is None:
            self.log = paddle.log
        elif self.log_base == 2.0:
            self.log = paddle.log2
        elif self.log_base == 10.0:
            self.log = paddle.log10
        else:
            raise ValueError(f"log_base: {log_base} is not supported.")
    def forward(self, x):
        """Calculate Mel-spectrogram.
        Parameters
        ----------
        x : Tensor
            Input waveform tensor (B, T) or (B, 1, T).
        Returns
        ----------
        Tensor
            Mel-spectrogram (B, #mels, #frames).
        """
        if len(x.shape) == 3:
            # (B, C, T) -> (B*C, T)
            x = x.reshape([-1, paddle.shape(x)[2]])
        if self.window is not None:
            # calculate window
            window = signal.get_window(
                self.window, self.win_length, fftbins=True)
            window = paddle.to_tensor(window)
        else:
            window = None
        x_stft = paddle.signal.stft(x, window=window, **self.stft_params)
        real = x_stft.real()
        imag = x_stft.imag()
        # (B, #freqs, #frames) -> (B, $frames, #freqs)
        real = real.transpose([0, 2, 1])
        imag = imag.transpose([0, 2, 1])
        x_power = real**2 + imag**2
        x_amp = paddle.sqrt(paddle.clip(x_power, min=self.eps))
        x_mel = paddle.matmul(x_amp, self.melmat)
        x_mel = paddle.clip(x_mel, min=self.eps)
        return self.log(x_mel).transpose([0, 2, 1])
 class MelSpectrogramLoss(nn.Layer):
    """Mel-spectrogram loss."""
    def __init__(
            self,
            fs=22050,
            fft_size=1024,
            hop_size=256,
            win_length=None,
            window="hann",
            num_mels=80,
            fmin=80,
            fmax=7600,
            center=True,
            normalized=False,
            onesided=True,
            eps=1e-10,
            log_base=10.0, ):
        """Initialize Mel-spectrogram loss."""
        super().__init__()
        self.mel_spectrogram = MelSpectrogram(
            fs=fs,
            fft_size=fft_size,
            hop_size=hop_size,
            win_length=win_length,
            window=window,
            num_mels=num_mels,
            fmin=fmin,
            fmax=fmax,
            center=center,
            normalized=normalized,
            onesided=onesided,
            eps=eps,
            log_base=log_base, )
    def forward(self, y_hat, y):
        """Calculate Mel-spectrogram loss.
        Parameters
        ----------
        y_hat : Tensor
            Generated single tensor (B, 1, T).
        y : Tensor
            Groundtruth single tensor (B, 1, T).
        Returns
        ----------
        Tensor
            Mel-spectrogram loss value.
        """
        mel_hat = self.mel_spectrogram(y_hat)
        mel = self.mel_spectrogram(y)
        mel_loss = F.l1_loss(mel_hat, mel)
        return mel_loss
 class FeatureMatchLoss(nn.Layer):
    """Feature matching loss module."""
    def __init__(
            self,
            average_by_layers=True,
            average_by_discriminators=True,
            include_final_outputs=False, ):
        """Initialize FeatureMatchLoss module."""
        super().__init__()
        self.average_by_layers = average_by_layers
        self.average_by_discriminators = average_by_discriminators
        self.include_final_outputs = include_final_outputs
    def forward(self, feats_hat, feats):
        """Calcualate feature matching loss.
        Parameters
        ----------
        feats_hat : list
            List of list of discriminator outputs
            calcuated from generater outputs.
        feats : list
            List of list of discriminator outputs
            calcuated from groundtruth.
        Returns
        ----------
        Tensor
            Feature matching loss value.
        """
        feat_match_loss = 0.0
        for i, (feats_hat_, feats_) in enumerate(zip(feats_hat, feats)):
            feat_match_loss_ = 0.0
            if not self.include_final_outputs:
                feats_hat_ = feats_hat_[:-1]
                feats_ = feats_[:-1]
            for j, (feat_hat_, feat_) in enumerate(zip(feats_hat_, feats_)):
                feat_match_loss_ += F.l1_loss(feat_hat_, feat_.detach())
            if self.average_by_layers:
                feat_match_loss_ /= j + 1
            feat_match_loss += feat_match_loss_
        if self.average_by_discriminators:
            feat_match_loss /= i + 1
        return feat_match_loss
--- a/paddlespeech/t2s/modules/residual_block.py
+++ b/paddlespeech/t2s/modules/residual_block.py
@ -0,0 +1,207 @@
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import math
 from typing import Any
 from typing import Dict
 from typing import List
 import paddle
 from paddle import nn
 from paddle.nn import functional as F
 from paddlespeech.t2s.modules.activation import get_activation
 class WaveNetResidualBlock(nn.Layer):
    """A gated activation unit composed of an 1D convolution, a gated tanh
    unit and parametric redidual and skip connections. For more details, 
    refer to `WaveNet: A Generative Model for Raw Audio <https://arxiv.org/abs/1609.03499>`_.
    Parameters
    ----------
    kernel_size : int, optional
        Kernel size of the 1D convolution, by default 3
    residual_channels : int, optional
        Feature size of the resiaudl output(and also the input), by default 64
    gate_channels : int, optional
        Output feature size of the 1D convolution, by default 128
    skip_channels : int, optional
        Feature size of the skip output, by default 64
    aux_channels : int, optional
        Feature size of the auxiliary input (e.g. spectrogram), by default 80
    dropout : float, optional
        Probability of the dropout before the 1D convolution, by default 0.
    dilation : int, optional
        Dilation of the 1D convolution, by default 1
    bias : bool, optional
        Whether to use bias in the 1D convolution, by default True
    use_causal_conv : bool, optional
        Whether to use causal padding for the 1D convolution, by default False
    """
    def __init__(self,
                 kernel_size: int=3,
                 residual_channels: int=64,
                 gate_channels: int=128,
                 skip_channels: int=64,
                 aux_channels: int=80,
                 dropout: float=0.,
                 dilation: int=1,
                 bias: bool=True,
                 use_causal_conv: bool=False):
        super().__init__()
        self.dropout = dropout
        if use_causal_conv:
            padding = (kernel_size - 1) * dilation
        else:
            assert kernel_size % 2 == 1
            padding = (kernel_size - 1) // 2 * dilation
        self.use_causal_conv = use_causal_conv
        self.conv = nn.Conv1D(
            residual_channels,
            gate_channels,
            kernel_size,
            padding=padding,
            dilation=dilation,
            bias_attr=bias)
        if aux_channels is not None:
            self.conv1x1_aux = nn.Conv1D(
                aux_channels, gate_channels, kernel_size=1, bias_attr=False)
        else:
            self.conv1x1_aux = None
        gate_out_channels = gate_channels // 2
        self.conv1x1_out = nn.Conv1D(
            gate_out_channels, residual_channels, kernel_size=1, bias_attr=bias)
        self.conv1x1_skip = nn.Conv1D(
            gate_out_channels, skip_channels, kernel_size=1, bias_attr=bias)
    def forward(self, x, c):
        """
        Parameters
        ----------
        x : Tensor
            Shape (N, C_res, T), the input features.
        c : Tensor
            Shape (N, C_aux, T), the auxiliary input.
        Returns
        -------
        res : Tensor
            Shape (N, C_res, T), the residual output, which is used as the 
            input of the next ResidualBlock in a stack of ResidualBlocks.
        skip : Tensor
            Shape (N, C_skip, T), the skip output, which is collected among
            each layer in a stack of ResidualBlocks.
        """
        x_input = x
        x = F.dropout(x, self.dropout, training=self.training)
        x = self.conv(x)
        x = x[:, :, x_input.shape[-1]] if self.use_causal_conv else x
        if c is not None:
            c = self.conv1x1_aux(c)
            x += c
        a, b = paddle.chunk(x, 2, axis=1)
        x = paddle.tanh(a) * F.sigmoid(b)
        skip = self.conv1x1_skip(x)
        res = (self.conv1x1_out(x) + x_input) * math.sqrt(0.5)
        return res, skip
 class HiFiGANResidualBlock(nn.Layer):
    """Residual block module in HiFiGAN."""
    def __init__(
            self,
            kernel_size: int=3,
            channels: int=512,
            dilations: List[int]=(1, 3, 5),
            bias: bool=True,
            use_additional_convs: bool=True,
            nonlinear_activation: str="leakyrelu",
            nonlinear_activation_params: Dict[str, Any]={"negative_slope": 0.1},
    ):
        """Initialize HiFiGANResidualBlock module.
        Parameters
        ----------
        kernel_size : int
            Kernel size of dilation convolution layer.
        channels : int
            Number of channels for convolution layer.
        dilations : List[int]
            List of dilation factors.
        use_additional_convs : bool
            Whether to use additional convolution layers.
        bias : bool
            Whether to add bias parameter in convolution layers.
        nonlinear_activation : str
            Activation function module name.
        nonlinear_activation_params : dict
            Hyperparameters for activation function.
        """
        super().__init__()
        self.use_additional_convs = use_additional_convs
        self.convs1 = nn.LayerList()
        if use_additional_convs:
            self.convs2 = nn.LayerList()
        assert kernel_size % 2 == 1, "Kernel size must be odd number."
        for dilation in dilations:
            self.convs1.append(
                nn.Sequential(
                    get_activation(nonlinear_activation, **
                                   nonlinear_activation_params),
                    nn.Conv1D(
                        channels,
                        channels,
                        kernel_size,
                        1,
                        dilation=dilation,
                        bias_attr=bias,
                        padding=(kernel_size - 1) // 2 * dilation, ), ))
            if use_additional_convs:
                self.convs2.append(
                    nn.Sequential(
                        get_activation(nonlinear_activation, **
                                       nonlinear_activation_params),
                        nn.Conv1D(
                            channels,
                            channels,
                            kernel_size,
                            1,
                            dilation=1,
                            bias_attr=bias,
                            padding=(kernel_size - 1) // 2, ), ))
    def forward(self, x):
        """Calculate forward propagation.
        Parameters
        ----------
        x : Tensor
            Input tensor (B, channels, T).
        Returns
        ----------
        Tensor
            Output tensor (B, channels, T).
        """
        for idx in range(len(self.convs1)):
            xt = self.convs1[idx](x)
            if self.use_additional_convs:
                xt = self.convs2[idx](xt)
            x = xt + x
        return x
--- a/paddlespeech/t2s/modules/residual_stack.py
+++ b/paddlespeech/t2s/modules/residual_stack.py
@ -60,7 +60,8 @@ class ResidualStack(nn.Layer):
        """
        super().__init__()
        # for compatibility
-        nonlinear_activation = nonlinear_activation.lower()
+        if nonlinear_activation:
            nonlinear_activation = nonlinear_activation.lower()
        # defile residual stack part
        if not use_causal_conv:
--- a/paddlespeech/t2s/modules/upsample.py
+++ b/paddlespeech/t2s/modules/upsample.py
@ -0,0 +1,220 @@
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # Modified from espnet(https://github.com/espnet/espnet)
 from typing import Any
 from typing import Dict
 from typing import List
 from typing import Optional
 from paddle import nn
 from paddle.nn import functional as F
 from paddlespeech.t2s.modules.activation import get_activation
 class Stretch2D(nn.Layer):
    def __init__(self, w_scale: int, h_scale: int, mode: str="nearest"):
        """Strech an image (or image-like object) with some interpolation.
        Parameters
        ----------
        w_scale : int
            Scalar of width.
        h_scale : int
            Scalar of the height.
        mode : str, optional
            Interpolation mode, modes suppored are "nearest", "bilinear", 
            "trilinear", "bicubic", "linear" and "area",by default "nearest"
            For more details about interpolation, see 
            `paddle.nn.functional.interpolate <https://www.paddlepaddle.org.cn/documentation/docs/en/api/paddle/nn/functional/interpolate_en.html>`_.
        """
        super().__init__()
        self.w_scale = w_scale
        self.h_scale = h_scale
        self.mode = mode
    def forward(self, x):
        """
        Parameters
        ----------
        x : Tensor
            Shape (N, C, H, W)
        Returns
        -------
        Tensor
            Shape (N, C, H', W'), where ``H'=h_scale * H``, ``W'=w_scale * W``.
            The stretched image.
        """
        out = F.interpolate(
            x, scale_factor=(self.h_scale, self.w_scale), mode=self.mode)
        return out
 class UpsampleNet(nn.Layer):
    """A Layer to upsample spectrogram by applying consecutive stretch and
    convolutions.
    Parameters
    ----------
    upsample_scales : List[int]
        Upsampling factors for each strech.
    nonlinear_activation : Optional[str], optional
        Activation after each convolution, by default None
    nonlinear_activation_params : Dict[str, Any], optional
        Parameters passed to construct the activation, by default {}
    interpolate_mode : str, optional
        Interpolation mode of the strech, by default "nearest"
    freq_axis_kernel_size : int, optional
        Convolution kernel size along the frequency axis, by default 1
    use_causal_conv : bool, optional
        Whether to use causal padding before convolution, by default False
        If True, Causal padding is used along the time axis, i.e. padding
        amount is ``receptive field - 1`` and 0 for before and after,
        respectively.
        If False, "same" padding is used along the time axis.
    """
    def __init__(self,
                 upsample_scales: List[int],
                 nonlinear_activation: Optional[str]=None,
                 nonlinear_activation_params: Dict[str, Any]={},
                 interpolate_mode: str="nearest",
                 freq_axis_kernel_size: int=1,
                 use_causal_conv: bool=False):
        super().__init__()
        self.use_causal_conv = use_causal_conv
        self.up_layers = nn.LayerList()
        for scale in upsample_scales:
            stretch = Stretch2D(scale, 1, interpolate_mode)
            assert freq_axis_kernel_size % 2 == 1
            freq_axis_padding = (freq_axis_kernel_size - 1) // 2
            kernel_size = (freq_axis_kernel_size, scale * 2 + 1)
            if use_causal_conv:
                padding = (freq_axis_padding, scale * 2)
            else:
                padding = (freq_axis_padding, scale)
            conv = nn.Conv2D(
                1, 1, kernel_size, padding=padding, bias_attr=False)
            self.up_layers.extend([stretch, conv])
            if nonlinear_activation is not None:
                # for compatibility
                nonlinear_activation = nonlinear_activation.lower()
                nonlinear = get_activation(nonlinear_activation,
                                           **nonlinear_activation_params)
                self.up_layers.append(nonlinear)
    def forward(self, c):
        """
        Parameters
        ----------
        c : Tensor
            Shape (N, F, T), spectrogram
        Returns
        -------
        Tensor
            Shape (N, F, T'), where ``T' = upsample_factor * T``, upsampled 
            spectrogram
        """
        c = c.unsqueeze(1)
        for f in self.up_layers:
            if self.use_causal_conv and isinstance(f, nn.Conv2D):
                c = f(c)[:, :, :, c.shape[-1]]
            else:
                c = f(c)
        return c.squeeze(1)
 class ConvInUpsampleNet(nn.Layer):
    """A Layer to upsample spectrogram composed of a convolution and an 
    UpsampleNet.
    Parameters
    ----------
    upsample_scales : List[int]
        Upsampling factors for each strech.
    nonlinear_activation : Optional[str], optional
        Activation after each convolution, by default None
    nonlinear_activation_params : Dict[str, Any], optional
        Parameters passed to construct the activation, by default {}
    interpolate_mode : str, optional
        Interpolation mode of the strech, by default "nearest"
    freq_axis_kernel_size : int, optional
        Convolution kernel size along the frequency axis, by default 1
    aux_channels : int, optional
        Feature size of the input, by default 80
    aux_context_window : int, optional
        Context window of the first 1D convolution applied to the input. It 
        related to the kernel size of the convolution, by default 0
        If use causal convolution, the kernel size is ``window + 1``, else
        the kernel size is ``2 * window + 1``.
    use_causal_conv : bool, optional
        Whether to use causal padding before convolution, by default False
        If True, Causal padding is used along the time axis, i.e. padding 
        amount is ``receptive field - 1`` and 0 for before and after, 
        respectively.
        If False, "same" padding is used along the time axis.
    """
    def __init__(self,
                 upsample_scales: List[int],
                 nonlinear_activation: Optional[str]=None,
                 nonlinear_activation_params: Dict[str, Any]={},
                 interpolate_mode: str="nearest",
                 freq_axis_kernel_size: int=1,
                 aux_channels: int=80,
                 aux_context_window: int=0,
                 use_causal_conv: bool=False):
        super().__init__()
        self.aux_context_window = aux_context_window
        self.use_causal_conv = use_causal_conv and aux_context_window > 0
        kernel_size = aux_context_window + 1 if use_causal_conv else 2 * aux_context_window + 1
        self.conv_in = nn.Conv1D(
            aux_channels,
            aux_channels,
            kernel_size=kernel_size,
            bias_attr=False)
        self.upsample = UpsampleNet(
            upsample_scales=upsample_scales,
            nonlinear_activation=nonlinear_activation,
            nonlinear_activation_params=nonlinear_activation_params,
            interpolate_mode=interpolate_mode,
            freq_axis_kernel_size=freq_axis_kernel_size,
            use_causal_conv=use_causal_conv)
    def forward(self, c):
        """
        Parameters
        ----------
        c : Tensor
            Shape (N, F, T), spectrogram
        Returns
        -------
        Tensors
            Shape (N, F, T'), where ``T' = upsample_factor * T``, upsampled 
            spectrogram
        """
        c_ = self.conv_in(c)
        c = c_[:, :, :-self.aux_context_window] if self.use_causal_conv else c_
        return self.upsample(c)