refactor encoder, rm old code

4 years ago · 469329221b
parent bc0dd51149
commit 469329221b
63 changed files with 1017 additions and 1532 deletions
--- a/docs/source/tts/README.md
+++ b/docs/source/tts/README.md
@ -5,20 +5,6 @@ Parakeet aims to provide a flexible, efficient and state-of-the-art text-to-spee
  <img src="../../images/logo.png" width=300 /> <br>
 </div>
 ## News  <img src="../../images/news_icon.png" width="40"/>
 - Oct-12-2021, Refector examples code.
 - Oct-12-2021, Parallel WaveGAN with LJSpeech. Check [examples/GANVocoder/parallelwave_gan/ljspeech](./examples/GANVocoder/parallelwave_gan/ljspeech).
 - Oct-12-2021, FastSpeech2/FastPitch with LJSpeech. Check [examples/fastspeech2/ljspeech](./examples/fastspeech2/ljspeech).
 - Sep-14-2021, Reconstruction of TransformerTTS. Check [examples/transformer_tts/ljspeech](./examples/transformer_tts/ljspeech).
 - Aug-31-2021, Chinese Text Frontend. Check [examples/text_frontend](./examples/text_frontend).
 - Aug-23-2021, FastSpeech2/FastPitch with AISHELL-3. Check [examples/fastspeech2/aishell3](./examples/fastspeech2/aishell3).
 - Aug-03-2021, FastSpeech2/FastPitch with CSMSC. Check [examples/fastspeech2/baker](./examples/fastspeech2/baker).
 - Jul-19-2021, SpeedySpeech with CSMSC. Check [examples/speedyspeech/baker](./examples/speedyspeech/baker).
 - Jul-01-2021, Parallel WaveGAN with CSMSC. Check [examples/GANVocoder/parallelwave_gan/baker](./examples/GANVocoder/parallelwave_gan/baker).
 - Jul-01-2021, Montreal-Forced-Aligner. Check  [examples/use_mfa](./examples/use_mfa).
 - May-07-2021, Voice Cloning in Chinese. Check [examples/tacotron2_aishell3](./examples/tacotron2_aishell3).
 ## Overview
 In order to facilitate exploiting the existing TTS models directly and developing the new ones, Parakeet selects typical models and provides their reference implementations in PaddlePaddle. Further more, Parakeet abstracts the TTS pipeline and standardizes the procedure of data preprocessing, common modules sharing, model configuration, and the process of training and synthesis. The models supported here include Text FrontEnd, end-to-end Acoustic models and Vocoders:
@ -38,50 +24,11 @@ In order to facilitate exploiting the existing TTS models directly and developin
  - [Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis](https://arxiv.org/pdf/1806.04558v4.pdf)
  - [【GE2E】Generalized End-to-End Loss for Speaker Verification](https://arxiv.org/abs/1710.10467)
 ## Setup
 It's difficult to install some dependent libraries for this repo in Windows system, we recommend that you **DO NOT** use Windows system, please use `Linux`.
 Make sure the library `libsndfile1` is installed, e.g., on Ubuntu.
 ```bash
 sudo apt-get install libsndfile1
 ```
 ### Install PaddlePaddle
 See [install](https://www.paddlepaddle.org.cn/install/quick) for more details. This repo requires PaddlePaddle **2.1.2** or above.
 ### Install Parakeet
 ```bash
 git clone https://github.com/PaddlePaddle/Parakeet
 cd Parakeet
 pip install -e .
 ```
 If some python dependent packages cannot be installed successfully, you can run the following script first.
 (replace `python3.6` with your own python version)
 ```bash
 sudo apt install -y python3.6-dev
 ```
 See [install](https://paddle-parakeet.readthedocs.io/en/latest/install.html) for more details.
 ## Examples
 Entries to the introduction, and the launch of training and synthsis for different example models:
 - [>>> Chinese Text Frontend](./examples/text_frontend)
 - [>>> FastSpeech2/FastPitch](./examples/fastspeech2)
 - [>>> Montreal-Forced-Aligner](./examples/use_mfa)
 - [>>> Parallel WaveGAN](./examples/GANVocoder/parallelwave_gan)
 - [>>> SpeedySpeech](./examples/speedyspeech)
 - [>>> Tacotron2_AISHELL3](./examples/tacotron2_aishell3)
 - [>>> GE2E](./examples/ge2e)
 - [>>> WaveFlow](./examples/waveflow)
 - [>>> TransformerTTS](./examples/transformer_tts)
 - [>>> Tacotron2](./examples/tacotron2)
 ## Audio samples
-### TTS models (Acoustic Model + Neural Vocoder)
+
-Check our [website](https://paddleparakeet.readthedocs.io/en/latest/demo.html) for audio sampels.
+Check our [website](https://paddlespeech.readthedocs.io/en/latest/tts/demo.html) for audio sampels.
 ## Released Model
--- a/examples/aishell3/tts3/README.md
+++ b/examples/aishell3/tts3/README.md
@ -17,7 +17,7 @@ tar zxvf data_aishell3.tgz -C data_aishell3
 ```
 ### Get MFA result of AISHELL-3 and Extract it
 We use [MFA2.x](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for aishell3_fastspeech2.
-You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/use_mfa) (use MFA1.x now) of our repo.
+You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo.
 ## Get Started
 Assume the path to the dataset is `~/datasets/data_aishell3`.
--- a/examples/aishell3/vc0/README.md
+++ b/examples/aishell3/vc0/README.md
@ -41,7 +41,8 @@ We use Montreal Force Aligner 1.0. The label in  aishell3 include pinyin，so th
 We use [lexicon.txt](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/paddlespeech/t2s/exps/voice_cloning/tacotron2_ge2e/lexicon.txt) as the lexicon.
-You can download the alignment results from here [alignment_aishell3.tar.gz](https://paddlespeech.bj.bcebos.com/Parakeet/alignment_aishell3.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/use_mfa) (use MFA1.x now) of our repo.
+You can download the alignment results from here [alignment_aishell3.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/alignment_aishell3.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo.
 ```bash
 if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
--- a/examples/aishell3/vc1/README.md
+++ b/examples/aishell3/vc1/README.md
@ -1,89 +1,138 @@
 # FastSpeech2 + AISHELL-3 Voice Cloning
-This example contains code used to train a [Tacotron2 ](https://arxiv.org/abs/1712.05884) model with [AISHELL-3](http://www.aishelltech.com/aishell_3). The trained model can be used in Voice Cloning Task, We refer to the model structure of  [Transfer Learning from Speaker Veriﬁcation to Multispeaker Text-To-Speech Synthesis](https://arxiv.org/pdf/1806.04558.pdf) . The general steps are as follows:
+This example contains code used to train a [FastSpeech2](https://arxiv.org/abs/2006.04558) model with [AISHELL-3](http://www.aishelltech.com/aishell_3). The trained model can be used in Voice Cloning Task, We refer to the model structure of  [Transfer Learning from Speaker Veriﬁcation to Multispeaker Text-To-Speech Synthesis](https://arxiv.org/pdf/1806.04558.pdf) . The general steps are as follows:
-1. Speaker Encoder: We  use a Speaker Verification to train a speaker encoder. Datasets used in this task are different from those used in Tacotron2, because the  transcriptions are not needed, we use more datasets, refer to  [ge2e](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/ge2e).
+1. Speaker Encoder: We  use a Speaker Verification to train a speaker encoder. Datasets used in this task are different from those used in `FastSpeech2`, because the  transcriptions are not needed, we use more datasets, refer to  [ge2e](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/ge2e).
-2. Synthesizer: Then, we use the trained speaker encoder to generate utterance embedding for each  sentence in AISHELL-3. This embedding is a extra input of  Tacotron2 which will be concated with encoder outputs.
+2. Synthesizer: We use the trained speaker encoder to generate speaker embedding for each sentence in AISHELL-3. This embedding is an extra input of  `FastSpeech2` which will be concated with encoder outputs.
-3. Vocoder: We use WaveFlow as the neural Vocoder, refer to [waveflow](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/voc0).
+3. Vocoder: We use [Parallel Wave GAN](http://arxiv.org/abs/1910.11480) as the neural Vocoder, refer to [voc1](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/voc1).
 ## Dataset
 ### Download and Extract
 Download AISHELL-3.
 ```bash
 wget https://www.openslr.org/resources/93/data_aishell3.tgz
 ```
 Extract AISHELL-3.
 ```bash
 mkdir data_aishell3
 tar zxvf data_aishell3.tgz -C data_aishell3
 ```
 ### Get MFA Result and Extract
 We use [MFA2.x](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for aishell3_fastspeech2.
 You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo.
 ## Pretrained GE2E Model
 We use pretrained GE2E model to generate spwaker embedding for each sentence.
 Download pretrained GE2E model from here [ge2e_ckpt_0.3.zip](https://bj.bcebos.com/paddlespeech/Parakeet/released_models/ge2e/ge2e_ckpt_0.3.zip), and `unzip` it.
 ## Get Started
 Assume the path to the dataset is `~/datasets/data_aishell3`.
-Assume the path to the MFA result of AISHELL-3 is `./alignment`.
+Assume the path to the MFA result of AISHELL-3 is `./aishell3_alignment_tone`.
-Assume the path to the pretrained ge2e model is `ge2e_ckpt_path=./ge2e_ckpt_0.3/step-3000000`
+Assume the path to the pretrained ge2e model is `./ge2e_ckpt_0.3`.
 Run the command below to
 1. **source path**.
-2. preprocess the dataset,
+2. preprocess the dataset.
 3. train the model.
-4. start a voice cloning inference.
+4. synthesize waveform from `metadata.jsonl`.
 5. start a voice cloning inference.
 ```bash
 ./run.sh
 ```
-### Preprocess the dataset
+You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset.
 ```bash
-CUDA_VISIBLE_DEVICES=${gpus} ./local/preprocess.sh ${input} ${preprocess_path} ${alignment} ${ge2e_ckpt_path}
+./run.sh --stage 0 --stop-stage 0
 ```
-#### generate utterance embedding
+### Data Preprocessing
 Use pretrained GE2E (speaker encoder) to generate utterance embedding for each sentence in AISHELL-3, which has the same file structure with wav files and the format is  `.npy`.
 ```bash
-if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+CUDA_VISIBLE_DEVICES=${gpus} ./local/preprocess.sh ${conf_path} ${ge2e_ckpt_path}
-    python3 ${BIN_DIR}/../ge2e/inference.py \
+```
-        --input=${input} \
+When it is done. A `dump` folder is created in the current directory. The structure of the dump folder is listed below.
-        --output=${preprocess_path}/embed \
+```text
-        --ngpu=1 \
+dump
-        --checkpoint_path=${ge2e_ckpt_path}
+├── dev
-fi
+│   ├── norm
 │   └── raw
 ├── embed
 │   ├── SSB0005
 │   ├── SSB0009
 │   ├── ...
 │   └── ...
 ├── phone_id_map.txt
 ├── speaker_id_map.txt
 ├── test
 │   ├── norm
 │   └──  raw
 └── train
    ├── energy_stats.npy
    ├── norm
    ├── pitch_stats.npy
    ├── raw
    └── speech_stats.npy
 ```
 The `embed` contains the generated speaker embedding for each sentence in AISHELL-3, which has the same file structure with wav files and the format is  `.npy`.
 The computing time of  utterance embedding can be x hours.
 ####  process wav
 There are silence in the edge of AISHELL-3's wavs, and the audio amplitude is very small, so, we need to remove the silence and normalize the audio. You can the silence remove method based on   volume or energy, but the effect is not very good, We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get  the alignment of text and  speech, then utilize the alignment results to remove the silence.
-We use Montreal Force Aligner 1.0. The label in  aishell3 include pinyin，so the lexicon we provided to MFA is pinyin rather than Chinese characters. And the prosody marks(`$`  and `%`) need to be removed. You shoud preprocess the dataset into the format  which MFA needs, the texts have the same name with wavs and have the suffix `.lab`.
+The dataset is split into 3 parts, namely `train`, `dev` and` test`, each of which contains a `norm` and `raw` sub folder. The raw folder contains speech、pitch and energy features of each utterances, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/*_stats.npy`.
-We use [lexicon.txt](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/paddlespeech/t2s/exps/voice_cloning/tacotron2_ge2e/lexicon.txt) as the lexicon.
+Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains phones, text_lengths, speech_lengths, durations, path of speech features, path of pitch features, path of energy features, speaker and id of each utterance.
-You can download the alignment results from here [alignment_aishell3.tar.gz](https://paddlespeech.bj.bcebos.com/Parakeet/alignment_aishell3.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/use_mfa) (use MFA1.x now) of our repo.
+The preprocessing step is very similar to that one of [tts3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/tts3), but  there is one more `ge2e/inference` step here.
 ### Model Training
 `./local/train.sh` calls `${BIN_DIR}/train.py`.
 ```bash
-if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path}
    echo "Process wav ..."
    python3 ${BIN_DIR}/process_wav.py \
        --input=${input}/wav \
        --output=${preprocess_path}/normalized_wav \
        --alignment=${alignment}
 fi
 ```
 The training step is very similar to that one of [tts3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/tts3), but  we should set `--voice-cloning=True` when calling `${BIN_DIR}/train.py`.
-#### preprocess transcription
+### Synthesizing
-We revert the transcription into `phones` and  `tones`. It is worth noting that our processing here is different from that used for MFA, we separated the tones. This is a processing method, of course, you can only segment initials and vowels.
+We use [parallel wavegan](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/voc1) as the neural vocoder.
-
+Download pretrained parallel wavegan model from [pwg_aishell3_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_aishell3_ckpt_0.5.zip) and unzip it.
 ```bash
-if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
+unzip pwg_aishell3_ckpt_0.5.zip
    python3 ${BIN_DIR}/preprocess_transcription.py \
        --input=${input} \
        --output=${preprocess_path}
 fi
 ```
-The default input is  `~/datasets/data_aishell3/train`，which contains `label_train-set.txt`, the processed results are `metadata.yaml` and  `metadata.pickle`. the former is a text format for easy viewing, and the latter is a binary format for direct reading.
+Parallel WaveGAN checkpoint contains files listed below.
-#### extract mel
+```text
-```python
+pwg_aishell3_ckpt_0.5
-if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
+├── default.yaml                   # default config used to train parallel wavegan
-    python3 ${BIN_DIR}/extract_mel.py \
+├── feats_stats.npy                # statistics used to normalize spectrogram when training parallel wavegan
-        --input=${preprocess_path}/normalized_wav \
+└── snapshot_iter_1000000.pdz      # generator parameters of parallel wavegan
        --output=${preprocess_path}/mel
 fi
 ```
-
+`./local/synthesize.sh` calls `${BIN_DIR}/synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
 ###  Train the model
 ```bash
-CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${preprocess_path} ${train_output_path}
+CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
 ```
 The synthesizing step is very similar to that one of [tts3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/tts3), but  we should set `--voice-cloning=True` when calling `${BIN_DIR}/synthesize.py`.
-Our model remve  stop token prediction in Tacotron2, because of the problem of extremely unbalanced proportion of positive and negative samples of stop token prediction, and it's very sensitive to the clip of audio silence. We use the last symbol from the highest point of attention to the encoder side as the termination condition.
+### Voice Cloning
 Assume there are some  reference audios in `./ref_audio`
 ```text
 ref_audio
 ├── 001238.wav
 ├── LJ015-0254.wav
 └── audio_self_test.mp3
 ```
 `./local/voice_cloning.sh` calls `${BIN_DIR}/voice_cloning.py`
 In addition, in order to accelerate the convergence of the model, we add `guided attention loss` to induce the alignment between encoder and decoder to show diagonal lines faster.
 ###  Infernece
 ```bash
-CUDA_VISIBLE_DEVICES=${gpus} ./local/voice_cloning.sh ${ge2e_params_path} ${tacotron2_params_path} ${waveflow_params_path} ${vc_input} ${vc_output}
+CUDA_VISIBLE_DEVICES=${gpus} ./local/voice_cloning.sh ${conf_path} ${train_output_path} ${ckpt_name} ${ge2e_params_path} ${ref_audio_dir}
 ```
 ## Pretrained Model
-[tacotron2_aishell3_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/tacotron2_aishell3_ckpt_0.3.zip).
+[fastspeech2_nosil_aishell3_vc1_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_aishell3_vc1_ckpt_0.5.zip)
 Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/pitch_loss| eval/energy_loss 
 :-------------:| :------------:| :-----: | :-----: | :--------: |:--------:|:---------:
 default|2(gpu) x 96400|0.99699|0.62013|0.53057|0.11954| 0.20426|
 FastSpeech2 checkpoint contains files listed below.
 (There is no need for `speaker_id_map.txt` here )
 ```text
 fastspeech2_nosil_aishell3_ckpt_vc1_0.5
 ├── default.yaml            # default config used to train fastspeech2
 ├── phone_id_map.txt        # phone vocabulary file when training fastspeech2
 ├── snapshot_iter_96400.pdz # model parameters and optimizer states
 └── speech_stats.npy        # statistics used to normalize spectrogram when training fastspeech2
 ```
--- a/examples/aishell3/voc1/README.md
+++ b/examples/aishell3/voc1/README.md
@ -15,7 +15,7 @@ tar zxvf data_aishell3.tgz -C data_aishell3
 ```
 ### Get MFA result of AISHELL-3 and Extract it
 We use [MFA2.x](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for aishell3_fastspeech2.
-You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/use_mfa) (use MFA1.x now) of our repo.
+You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo.
 ## Get Started
 Assume the path to the dataset is `~/datasets/data_aishell3`.
--- a/examples/csmsc/tts2/README.md
+++ b/examples/csmsc/tts2/README.md
@ -7,7 +7,7 @@ Download CSMSC from it's [Official Website](https://test.data-baker.com/data/ind
 ### Get MFA result of CSMSC and Extract it
 We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for SPEEDYSPEECH.
-You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to  [use_mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/use_mfa) of our repo.
+You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to  [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
 ## Get Started
 Assume the path to the dataset is `~/datasets/BZNSYP`.
--- a/examples/csmsc/tts3/README.md
+++ b/examples/csmsc/tts3/README.md
@ -7,7 +7,7 @@ Download CSMSC from it's [Official Website](https://test.data-baker.com/data/ind
 ### Get MFA result of CSMSC and Extract it
 We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for fastspeech2.
-You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to  [use_mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/use_mfa) of our repo.
+You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to  [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
 ## Get Started
 Assume the path to the dataset is `~/datasets/BZNSYP`.
--- a/examples/csmsc/voc1/README.md
+++ b/examples/csmsc/voc1/README.md
@ -6,7 +6,7 @@ Download CSMSC from the [official website](https://www.data-baker.com/data/index
 ### Get MFA results for silence trim
 We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to  cut silence in the edge of audio.
-You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to  [use_mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/use_mfa) of our repo.
+You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to  [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
 ## Get Started
 Assume the path to the dataset is `~/datasets/BZNSYP`.
--- a/examples/csmsc/voc3/README.md
+++ b/examples/csmsc/voc3/README.md
@ -6,7 +6,7 @@ Download CSMSC from the [official website](https://www.data-baker.com/data/index
 ### Get MFA results for silence trim
 We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to  cut silence in the edge of audio.
-You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/use_mfa) of our repo.
+You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/mfa) of our repo.
 ## Get Started
 Assume the path to the dataset is `~/datasets/BZNSYP`.
--- a/examples/ljspeech/tts3/README.md
+++ b/examples/ljspeech/tts3/README.md
@ -7,7 +7,7 @@ Download LJSpeech-1.1 from the [official website](https://keithito.com/LJ-Speech
 ### Get MFA result of LJSpeech-1.1 and Extract it
 We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for fastspeech2.
-You can download from here [ljspeech_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/LJSpeech-1.1/ljspeech_alignment.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/use_mfa) of our repo.
+You can download from here [ljspeech_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/LJSpeech-1.1/ljspeech_alignment.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
 ## Get Started
 Assume the path to the dataset is `~/datasets/LJSpeech-1.1`.
--- a/examples/ljspeech/voc1/README.md
+++ b/examples/ljspeech/voc1/README.md
@ -1,26 +1,29 @@
 # Parallel WaveGAN with the LJSpeech-1.1
 This example contains code used to train a [parallel wavegan](http://arxiv.org/abs/1910.11480) model with [LJSpeech-1.1](https://keithito.com/LJ-Speech-Dataset/).
 ## Dataset
-### Download and Extract the datasaet
+### Download and Extract
 Download LJSpeech-1.1 from the [official website](https://keithito.com/LJ-Speech-Dataset/).
-### Get MFA results for silence trim
+### Get MFA Result and Extract
-We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to  cut silence in the edge of audio.
+We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence in the edge of audio.
-You can download from here [ljspeech_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/LJSpeech-1.1/ljspeech_alignment.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/use_mfa) of our repo.
+You can download from here [ljspeech_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/LJSpeech-1.1/ljspeech_alignment.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
 ## Get Started
 Assume the path to the dataset is `~/datasets/LJSpeech-1.1`.
 Assume the path to the MFA result of LJSpeech-1.1 is `./ljspeech_alignment`.
 Run the command below to
 1. **source path**.
-2. preprocess the dataset,
+2. preprocess the dataset.
 3. train the model.
 4. synthesize wavs.
    - synthesize waveform from `metadata.jsonl`.
 ```bash
 ./run.sh
 ```
-
+You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset.
-### Preprocess the dataset
+```bash
 ./run.sh --stage 0 --stop-stage 0
 ```
 ### Data Preprocessing
 ```bash
 ./local/preprocess.sh ${conf_path}
 ```
@ -44,7 +47,7 @@ The dataset is split into 3 parts, namely `train`, `dev` and `test`, each of whi
 Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains id and paths to spectrogam of each utterance.
-### Train the model
+### Model Training
 `./local/train.sh` calls `${BIN_DIR}/train.py`.
 ```bash
 CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path}
@ -91,7 +94,7 @@ benchmark:
 3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are save in `checkpoints/` inside this directory.
 4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
-### Synthesize
+### Synthesizing
 `./local/synthesize.sh` calls `${BIN_DIR}/synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
 ```bash
 CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
@ -122,8 +125,8 @@ optional arguments:
 4. `--output-dir` is the directory to save the synthesized audio files.
 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
-## Pretrained Models
+## Pretrained Model
-Pretrained models can be downloaded here. [pwg_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/pwg_ljspeech_ckpt_0.5.zip)
+Pretrained models can be downloaded here. [pwg_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_ljspeech_ckpt_0.5.zip)
 Parallel WaveGAN checkpoint contains files listed below.
@ -134,4 +137,4 @@ pwg_ljspeech_ckpt_0.5
 └── pwg_stats.npy                 # statistics used to normalize spectrogram when training parallel wavegan
 ```
 ## Acknowledgement
-We adapted some code from https://github.com/kan-bayashi/ParallelWaveGAN.
+We adapted some code from https://github.com/kan-bayashi/ParallelWaveGAN.
--- a/examples/other/use_mfa/README.md
+++ b/examples/other/use_mfa/README.md
--- a/examples/other/use_mfa/local/cmudict-0.7b
+++ b/examples/other/use_mfa/local/cmudict-0.7b
--- a/examples/other/use_mfa/local/detect_oov.py
+++ b/examples/other/use_mfa/local/detect_oov.py
--- a/examples/other/use_mfa/local/generate_lexicon.py
+++ b/examples/other/use_mfa/local/generate_lexicon.py
--- a/examples/other/use_mfa/local/reorganize_aishell3.py
+++ b/examples/other/use_mfa/local/reorganize_aishell3.py
--- a/examples/other/use_mfa/local/reorganize_baker.py
+++ b/examples/other/use_mfa/local/reorganize_baker.py
--- a/examples/other/use_mfa/local/reorganize_ljspeech.py
+++ b/examples/other/use_mfa/local/reorganize_ljspeech.py
--- a/examples/other/use_mfa/local/reorganize_vctk.py
+++ b/examples/other/use_mfa/local/reorganize_vctk.py
--- a/examples/other/use_mfa/run.sh
+++ b/examples/other/use_mfa/run.sh
--- a/examples/vctk/tts3/README.md
+++ b/examples/vctk/tts3/README.md
@ -7,8 +7,8 @@ Download VCTK-0.92 from the [official website](https://datashare.ed.ac.uk/handle
 ### Get MFA result of VCTK and Extract it
 We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for fastspeech2.
-You can download from here [vctk_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/VCTK-Corpus-0.92/vctk_alignment.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/use_mfa) of our repo.
+You can download from here [vctk_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/VCTK-Corpus-0.92/vctk_alignment.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
-ps: we remove three speakers in VCTK-0.92 (see [reorganize_vctk.py](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/examples/other/use_mfa/local/reorganize_vctk.py)):
+ps: we remove three speakers in VCTK-0.92 (see [reorganize_vctk.py](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/examples/other/mfa/local/reorganize_vctk.py)):
 1. `p315`, because no txt for it.
 2. `p280` and `p362`, because no *_mic2.flac (which is better than *_mic1.flac) for  them.
--- a/examples/vctk/voc1/README.md
+++ b/examples/vctk/voc1/README.md
@ -5,10 +5,10 @@ This example contains code used to train a [parallel wavegan](http://arxiv.org/a
 ### Download and Extract the datasaet
 Download VCTK-0.92  from the [official website](https://datashare.ed.ac.uk/handle/10283/3443) and extract it to `~/datasets`. Then the dataset is in directory `~/datasets/VCTK-Corpus-0.92`.
-### Get MFA results for silence trim
+### Get MFA Result and Extract
 We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to  cut silence in the edge of audio.
-You can download from here [vctk_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/VCTK-Corpus-0.92/vctk_alignment.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/use_mfa) of our repo.
+You can download from here [vctk_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/VCTK-Corpus-0.92/vctk_alignment.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
-ps: we remove three speakers in VCTK-0.92 (see [reorganize_vctk.py](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/examples/other/use_mfa/local/reorganize_vctk.py)):
+ps: we remove three speakers in VCTK-0.92 (see [reorganize_vctk.py](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/examples/other/mfa/local/reorganize_vctk.py)):
 1. `p315`, because no txt for it.
 2. `p280` and `p362`, because no *_mic2.flac (which is better than *_mic1.flac) for  them.
--- a/paddlespeech/t2s/exps/gan_vocoder/multi_band_melgan/train.py
+++ b/paddlespeech/t2s/exps/gan_vocoder/multi_band_melgan/train.py
@ -36,10 +36,10 @@ from paddlespeech.t2s.models.melgan import MBMelGANEvaluator
 from paddlespeech.t2s.models.melgan import MBMelGANUpdater
 from paddlespeech.t2s.models.melgan import MelGANGenerator
 from paddlespeech.t2s.models.melgan import MelGANMultiScaleDiscriminator
-from paddlespeech.t2s.modules.adversarial_loss import DiscriminatorAdversarialLoss
+from paddlespeech.t2s.modules.losses import DiscriminatorAdversarialLoss
-from paddlespeech.t2s.modules.adversarial_loss import GeneratorAdversarialLoss
+from paddlespeech.t2s.modules.losses import GeneratorAdversarialLoss
 from paddlespeech.t2s.modules.losses import MultiResolutionSTFTLoss
 from paddlespeech.t2s.modules.pqmf import PQMF
 from paddlespeech.t2s.modules.stft_loss import MultiResolutionSTFTLoss
 from paddlespeech.t2s.training.extensions.snapshot import Snapshot
 from paddlespeech.t2s.training.extensions.visualizer import VisualDL
 from paddlespeech.t2s.training.seeding import seed_everything
--- a/paddlespeech/t2s/exps/gan_vocoder/parallelwave_gan/train.py
+++ b/paddlespeech/t2s/exps/gan_vocoder/parallelwave_gan/train.py
@ -36,7 +36,7 @@ from paddlespeech.t2s.models.parallel_wavegan import PWGDiscriminator
 from paddlespeech.t2s.models.parallel_wavegan import PWGEvaluator
 from paddlespeech.t2s.models.parallel_wavegan import PWGGenerator
 from paddlespeech.t2s.models.parallel_wavegan import PWGUpdater
-from paddlespeech.t2s.modules.stft_loss import MultiResolutionSTFTLoss
+from paddlespeech.t2s.modules.losses import MultiResolutionSTFTLoss
 from paddlespeech.t2s.training.extensions.snapshot import Snapshot
 from paddlespeech.t2s.training.extensions.visualizer import VisualDL
 from paddlespeech.t2s.training.seeding import seed_everything
--- a/paddlespeech/t2s/models/init.py
+++ b/paddlespeech/t2s/models/init.py
@ -12,6 +12,8 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 from .fastspeech2 import *
 from .melgan import *
 from .parallel_wavegan import *
 from .tacotron2 import *
 from .transformer_tts import *
 from .waveflow import *
--- a/paddlespeech/t2s/models/fastspeech2/fastspeech2.py
+++ b/paddlespeech/t2s/models/fastspeech2/fastspeech2.py
@ -32,7 +32,8 @@ from paddlespeech.t2s.modules.predictor.duration_predictor import DurationPredic
 from paddlespeech.t2s.modules.predictor.length_regulator import LengthRegulator
 from paddlespeech.t2s.modules.predictor.variance_predictor import VariancePredictor
 from paddlespeech.t2s.modules.tacotron2.decoder import Postnet
-from paddlespeech.t2s.modules.transformer.encoder import Encoder
+from paddlespeech.t2s.modules.transformer.encoder import ConformerEncoder
 from paddlespeech.t2s.modules.transformer.encoder import TransformerEncoder
 class FastSpeech2(nn.Layer):
@ -306,12 +307,10 @@ class FastSpeech2(nn.Layer):
            num_embeddings=idim,
            embedding_dim=adim,
            padding_idx=self.padding_idx)
-        # add encoder type here
+            
        # 测试模型还能跑通不
        # 记得改 transformer tts
        if encoder_type == "transformer":
            print("encoder_type is transformer")
-            self.encoder = Encoder(
+            self.encoder = TransformerEncoder(
                idim=idim,
                attention_dim=adim,
                attention_heads=aheads,
@ -325,11 +324,10 @@ class FastSpeech2(nn.Layer):
                normalize_before=encoder_normalize_before,
                concat_after=encoder_concat_after,
                positionwise_layer_type=positionwise_layer_type,
-                positionwise_conv_kernel_size=positionwise_conv_kernel_size,
+                positionwise_conv_kernel_size=positionwise_conv_kernel_size, )
                encoder_type=encoder_type)
        elif encoder_type == "conformer":
            print("encoder_type is conformer")
-            self.encoder = Encoder(
+            self.encoder = ConformerEncoder(
                idim=idim,
                attention_dim=adim,
                attention_heads=aheads,
@ -349,8 +347,7 @@ class FastSpeech2(nn.Layer):
                activation_type=conformer_activation_type,
                use_cnn_module=use_cnn_in_conformer,
                cnn_module_kernel=conformer_enc_kernel_size,
-                zero_triu=zero_triu,
+                zero_triu=zero_triu, )
                encoder_type=encoder_type)
        else:
            raise ValueError(f"{encoder_type} is not supported.")
@ -417,7 +414,7 @@ class FastSpeech2(nn.Layer):
        # because fastspeech's decoder is the same as encoder
        if decoder_type == "transformer":
            print("decoder_type is transformer")
-            self.decoder = Encoder(
+            self.decoder = TransformerEncoder(
                idim=0,
                attention_dim=adim,
                attention_heads=aheads,
@ -432,11 +429,10 @@ class FastSpeech2(nn.Layer):
                normalize_before=decoder_normalize_before,
                concat_after=decoder_concat_after,
                positionwise_layer_type=positionwise_layer_type,
-                positionwise_conv_kernel_size=positionwise_conv_kernel_size,
+                positionwise_conv_kernel_size=positionwise_conv_kernel_size, )
                encoder_type=decoder_type)
        elif decoder_type == "conformer":
            print("decoder_type is conformer")
-            self.decoder = Encoder(
+            self.decoder = ConformerEncoder(
                idim=0,
                attention_dim=adim,
                attention_heads=aheads,
@ -455,8 +451,7 @@ class FastSpeech2(nn.Layer):
                selfattention_layer_type=conformer_self_attn_layer_type,
                activation_type=conformer_activation_type,
                use_cnn_module=use_cnn_in_conformer,
-                cnn_module_kernel=conformer_dec_kernel_size,
+                cnn_module_kernel=conformer_dec_kernel_size, )
                encoder_type=decoder_type)
        else:
            raise ValueError(f"{decoder_type} is not supported.")
--- a/paddlespeech/t2s/models/melgan/melgan.py
+++ b/paddlespeech/t2s/models/melgan/melgan.py
@ -78,7 +78,7 @@ class MelGANGenerator(nn.Layer):
            Padding function module name before dilated convolution layer.
        pad_params : dict
            Hyperparameters for padding function.
-        use_final_nonlinear_activation : paddle.nn.Layer
+        use_final_nonlinear_activation : nn.Layer
            Activation function for the final layer.
        use_weight_norm : bool
            Whether to use weight norm.
--- a/paddlespeech/t2s/models/speedyspeech/speedyspeech.py
+++ b/paddlespeech/t2s/models/speedyspeech/speedyspeech.py
@ -11,13 +11,34 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import numpy as np
 import paddle
 from paddle import nn
 from paddlespeech.t2s.modules.expansion import expand
 from paddlespeech.t2s.modules.positional_encoding import sinusoid_position_encoding
 def expand(encodings: paddle.Tensor, durations: paddle.Tensor) -> paddle.Tensor:
    """
    encodings: (B, T, C)
    durations: (B, T)
    """
    batch_size, t_enc = durations.shape
    durations = durations.numpy()
    slens = np.sum(durations, -1)
    t_dec = np.max(slens)
    M = np.zeros([batch_size, t_dec, t_enc])
    for i in range(batch_size):
        k = 0
        for j in range(t_enc):
            d = durations[i, j]
            M[i, k:k + d, j] = 1
            k += d
    M = paddle.to_tensor(M, dtype=encodings.dtype)
    encodings = paddle.matmul(M, encodings)
    return encodings
 class ResidualBlock(nn.Layer):
    def __init__(self, channels, kernel_size, dilation, n=2):
        super().__init__()
--- a/paddlespeech/t2s/models/speedyspeech/speedyspeech_updater.py
+++ b/paddlespeech/t2s/models/speedyspeech/speedyspeech_updater.py
@ -19,8 +19,8 @@ from paddle.fluid.layers import huber_loss
 from paddle.nn import functional as F
 from paddlespeech.t2s.modules.losses import masked_l1_loss
 from paddlespeech.t2s.modules.losses import ssim
 from paddlespeech.t2s.modules.losses import weighted_mean
 from paddlespeech.t2s.modules.ssim import ssim
 from paddlespeech.t2s.training.extensions.evaluator import StandardEvaluator
 from paddlespeech.t2s.training.reporter import report
 from paddlespeech.t2s.training.updaters.standard_updater import StandardUpdater
--- a/paddlespeech/t2s/models/tacotron2.py
+++ b/paddlespeech/t2s/models/tacotron2.py
@ -20,7 +20,6 @@ from paddle.nn import functional as F
 from paddle.nn import initializer as I
 from tqdm import trange
 from paddlespeech.t2s.modules.attention import LocationSensitiveAttention
 from paddlespeech.t2s.modules.conv import Conv1dBatchNorm
 from paddlespeech.t2s.modules.losses import guided_attention_loss
 from paddlespeech.t2s.utils import checkpoint
@ -28,6 +27,99 @@ from paddlespeech.t2s.utils import checkpoint
 __all__ = ["Tacotron2", "Tacotron2Loss"]
 class LocationSensitiveAttention(nn.Layer):
    """Location Sensitive Attention module.
    Reference: `Attention-Based Models for Speech Recognition <https://arxiv.org/pdf/1506.07503.pdf>`_
    Parameters
    -----------
    d_query: int
        The feature size of query.
    d_key : int
        The feature size of key.
    d_attention : int
        The feature size of dimension.
    location_filters : int
        Filter size of attention convolution.
    location_kernel_size : int
        Kernel size of attention convolution.
    """
    def __init__(self,
                 d_query: int,
                 d_key: int,
                 d_attention: int,
                 location_filters: int,
                 location_kernel_size: int):
        super().__init__()
        self.query_layer = nn.Linear(d_query, d_attention, bias_attr=False)
        self.key_layer = nn.Linear(d_key, d_attention, bias_attr=False)
        self.value = nn.Linear(d_attention, 1, bias_attr=False)
        # Location Layer
        self.location_conv = nn.Conv1D(
            2,
            location_filters,
            kernel_size=location_kernel_size,
            padding=int((location_kernel_size - 1) / 2),
            bias_attr=False,
            data_format='NLC')
        self.location_layer = nn.Linear(
            location_filters, d_attention, bias_attr=False)
    def forward(self,
                query,
                processed_key,
                value,
                attention_weights_cat,
                mask=None):
        """Compute context vector and attention weights.
        Parameters
        -----------
        query : Tensor [shape=(batch_size, d_query)]
            The queries.
        processed_key : Tensor [shape=(batch_size, time_steps_k, d_attention)]
            The keys after linear layer.
        value : Tensor [shape=(batch_size, time_steps_k, d_key)]
            The values.
        attention_weights_cat : Tensor [shape=(batch_size, time_step_k, 2)]
            Attention weights concat.
        mask : Tensor, optional
            The mask. Shape should be (batch_size, times_steps_k, 1).
            Defaults to None.
        Returns
        ----------
        attention_context : Tensor [shape=(batch_size, d_attention)]
            The context vector.
        attention_weights : Tensor [shape=(batch_size, time_steps_k)]
            The attention weights.
        """
        processed_query = self.query_layer(paddle.unsqueeze(query, axis=[1]))
        processed_attention_weights = self.location_layer(
            self.location_conv(attention_weights_cat))
        # (B, T_enc, 1)
        alignment = self.value(
            paddle.tanh(processed_attention_weights + processed_key +
                        processed_query))
        if mask is not None:
            alignment = alignment + (1.0 - mask) * -1e9
        attention_weights = F.softmax(alignment, axis=1)
        attention_context = paddle.matmul(
            attention_weights, value, transpose_x=True)
        attention_weights = paddle.squeeze(attention_weights, axis=-1)
        attention_context = paddle.squeeze(attention_context, axis=1)
        return attention_context, attention_weights
 class DecoderPreNet(nn.Layer):
    """Decoder prenet module for Tacotron2.
@ -197,7 +289,7 @@ class Tacotron2Encoder(nn.Layer):
        super().__init__()
        k = math.sqrt(1.0 / (d_hidden * kernel_size))
-        self.conv_batchnorms = paddle.nn.LayerList([
+        self.conv_batchnorms = nn.LayerList([
            Conv1dBatchNorm(
                d_hidden,
                d_hidden,
@ -903,7 +995,7 @@ class Tacotron2Loss(nn.Layer):
        self.use_stop_token_loss = use_stop_token_loss
        self.use_guided_attention_loss = use_guided_attention_loss
        self.attn_criterion = guided_attention_loss
-        self.stop_criterion = paddle.nn.BCEWithLogitsLoss()
+        self.stop_criterion = nn.BCEWithLogitsLoss()
        self.sigma = sigma
    def forward(self,
--- a/paddlespeech/t2s/models/transformer_tts/transformer_tts.py
+++ b/paddlespeech/t2s/models/transformer_tts/transformer_tts.py
@ -34,7 +34,7 @@ from paddlespeech.t2s.modules.transformer.attention import MultiHeadedAttention
 from paddlespeech.t2s.modules.transformer.decoder import Decoder
 from paddlespeech.t2s.modules.transformer.embedding import PositionalEncoding
 from paddlespeech.t2s.modules.transformer.embedding import ScaledPositionalEncoding
-from paddlespeech.t2s.modules.transformer.encoder import Encoder
+from paddlespeech.t2s.modules.transformer.encoder import TransformerEncoder
 from paddlespeech.t2s.modules.transformer.mask import subsequent_mask
@ -281,7 +281,7 @@ class TransformerTTS(nn.Layer):
                num_embeddings=idim,
                embedding_dim=adim,
                padding_idx=self.padding_idx)
-        self.encoder = Encoder(
+        self.encoder = TransformerEncoder(
            idim=idim,
            attention_dim=adim,
            attention_heads=aheads,
--- a/paddlespeech/t2s/models/waveflow.py
+++ b/paddlespeech/t2s/models/waveflow.py
@ -329,7 +329,7 @@ class ResidualNet(nn.LayerList):
        if len(dilations_h) != n_layer:
            raise ValueError(
                "number of dilations_h should equals num of layers")
-        super(ResidualNet, self).__init__()
+        super().__init__()
        for i in range(n_layer):
            dilation = (dilations_h[i], 2**i)
            layer = ResidualBlock(residual_channels, condition_channels,
--- a/paddlespeech/t2s/modules/init.py
+++ b/paddlespeech/t2s/modules/init.py
@ -14,5 +14,4 @@
 from .conv import *
 from .geometry import *
 from .losses import *
 from .masking import *
 from .positional_encoding import *
--- a/paddlespeech/t2s/modules/activation.py
+++ b/paddlespeech/t2s/modules/activation.py
@ -11,8 +11,9 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import paddle
 import paddle.nn.functional as F
 from paddle import nn
 from paddle.nn import functional as F
 class GLU(nn.Layer):
@ -24,3 +25,18 @@ class GLU(nn.Layer):
    def forward(self, xs):
        return F.glu(xs, axis=self.dim)
 def get_activation(act):
    """Return activation function."""
    activation_funcs = {
        "hardtanh": paddle.nn.Hardtanh,
        "tanh": paddle.nn.Tanh,
        "relu": paddle.nn.ReLU,
        "selu": paddle.nn.SELU,
        "swish": paddle.nn.Swish,
        "glu": GLU
    }
    return activation_funcs[act]()
--- a/paddlespeech/t2s/modules/adversarial_loss.py
+++ b/paddlespeech/t2s/modules/adversarial_loss.py
@ -1,125 +0,0 @@
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # Modified from espnet(https://github.com/espnet/espnet)
 """Adversarial loss modules."""
 import paddle
 import paddle.nn.functional as F
 from paddle import nn
 class GeneratorAdversarialLoss(nn.Layer):
    """Generator adversarial loss module."""
    def __init__(
            self,
            average_by_discriminators=True,
            loss_type="mse", ):
        """Initialize GeneratorAversarialLoss module."""
        super().__init__()
        self.average_by_discriminators = average_by_discriminators
        assert loss_type in ["mse", "hinge"], f"{loss_type} is not supported."
        if loss_type == "mse":
            self.criterion = self._mse_loss
        else:
            self.criterion = self._hinge_loss
    def forward(self, outputs):
        """Calcualate generator adversarial loss.
        Parameters
        ----------
        outputs: Tensor or List
        Discriminator outputs or list of discriminator outputs.
        Returns
        ----------
        Tensor
            Generator adversarial loss value.
        """
        if isinstance(outputs, (tuple, list)):
            adv_loss = 0.0
            for i, outputs_ in enumerate(outputs):
                if isinstance(outputs_, (tuple, list)):
                    # case including feature maps
                    outputs_ = outputs_[-1]
                adv_loss += self.criterion(outputs_)
            if self.average_by_discriminators:
                adv_loss /= i + 1
        else:
            adv_loss = self.criterion(outputs)
        return adv_loss
    def _mse_loss(self, x):
        return F.mse_loss(x, paddle.ones_like(x))
    def _hinge_loss(self, x):
        return -x.mean()
 class DiscriminatorAdversarialLoss(nn.Layer):
    """Discriminator adversarial loss module."""
    def __init__(
            self,
            average_by_discriminators=True,
            loss_type="mse", ):
        """Initialize DiscriminatorAversarialLoss module."""
        super().__init__()
        self.average_by_discriminators = average_by_discriminators
        assert loss_type in ["mse"], f"{loss_type} is not supported."
        if loss_type == "mse":
            self.fake_criterion = self._mse_fake_loss
            self.real_criterion = self._mse_real_loss
    def forward(self, outputs_hat, outputs):
        """Calcualate discriminator adversarial loss.
        Parameters
        ----------
        outputs_hat : Tensor or list
            Discriminator outputs or list of
            discriminator outputs calculated from generator outputs.
        outputs : Tensor or list
            Discriminator outputs or list of
            discriminator outputs calculated from groundtruth.
        Returns
        ----------
        Tensor
            Discriminator real loss value.
        Tensor
            Discriminator fake loss value.
        """
        if isinstance(outputs, (tuple, list)):
            real_loss = 0.0
            fake_loss = 0.0
            for i, (outputs_hat_,
                    outputs_) in enumerate(zip(outputs_hat, outputs)):
                if isinstance(outputs_hat_, (tuple, list)):
                    # case including feature maps
                    outputs_hat_ = outputs_hat_[-1]
                    outputs_ = outputs_[-1]
                real_loss += self.real_criterion(outputs_)
                fake_loss += self.fake_criterion(outputs_hat_)
            if self.average_by_discriminators:
                fake_loss /= i + 1
                real_loss /= i + 1
        else:
            real_loss = self.real_criterion(outputs)
            fake_loss = self.fake_criterion(outputs_hat)
        return real_loss, fake_loss
    def _mse_real_loss(self, x):
        return F.mse_loss(x, paddle.ones_like(x))
    def _mse_fake_loss(self, x):
        return F.mse_loss(x, paddle.zeros_like(x))
--- a/paddlespeech/t2s/modules/attention.py
+++ b/paddlespeech/t2s/modules/attention.py
@ -1,348 +0,0 @@
 # Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import math
 import numpy as np
 import paddle
 from paddle import nn
 from paddle.nn import functional as F
 def scaled_dot_product_attention(q, k, v, mask=None, dropout=0.0,
                                 training=True):
    r"""Scaled dot product attention with masking. 
    Assume that q, k, v all have the same leading dimensions (denoted as * in 
    descriptions below). Dropout is applied to attention weights before 
    weighted sum of values.
    Parameters
    -----------
    q : Tensor [shape=(\*, T_q, d)]
        the query tensor.
    k : Tensor [shape=(\*, T_k, d)]
        the key tensor.
    v : Tensor [shape=(\*, T_k, d_v)]
        the value tensor.
    mask : Tensor, [shape=(\*, T_q, T_k) or broadcastable shape], optional
        the mask tensor, zeros correspond to paddings. Defaults to None.
    Returns
    ----------
    out : Tensor [shape=(\*, T_q, d_v)]
        the context vector.
    attn_weights : Tensor [shape=(\*, T_q, T_k)]
        the attention weights.
    """
    d = q.shape[-1]  # we only support imperative execution
    qk = paddle.matmul(q, k, transpose_y=True)
    scaled_logit = paddle.scale(qk, 1.0 / math.sqrt(d))
    if mask is not None:
        scaled_logit += paddle.scale((1.0 - mask), -1e9)  # hard coded here
    attn_weights = F.softmax(scaled_logit, axis=-1)
    attn_weights = F.dropout(attn_weights, dropout, training=training)
    out = paddle.matmul(attn_weights, v)
    return out, attn_weights
 def drop_head(x, drop_n_heads, training=True):
    """Drop n context vectors from multiple ones.
    Parameters
    ----------
    x : Tensor [shape=(batch_size, num_heads, time_steps, channels)]
        The input, multiple context vectors.
    drop_n_heads : int [0<= drop_n_heads <= num_heads]
        Number of vectors to drop.
    training : bool
        A flag indicating whether it is in training. If `False`, no dropout is 
        applied.
    Returns
    -------
    Tensor
        The output.
    """
    if not training or (drop_n_heads == 0):
        return x
    batch_size, num_heads, _, _ = x.shape
    # drop all heads
    if num_heads == drop_n_heads:
        return paddle.zeros_like(x)
    mask = np.ones([batch_size, num_heads])
    mask[:, :drop_n_heads] = 0
    for subarray in mask:
        np.random.shuffle(subarray)
    scale = float(num_heads) / (num_heads - drop_n_heads)
    mask = scale * np.reshape(mask, [batch_size, num_heads, 1, 1])
    out = x * paddle.to_tensor(mask)
    return out
 def _split_heads(x, num_heads):
    batch_size, time_steps, _ = x.shape
    x = paddle.reshape(x, [batch_size, time_steps, num_heads, -1])
    x = paddle.transpose(x, [0, 2, 1, 3])
    return x
 def _concat_heads(x):
    batch_size, _, time_steps, _ = x.shape
    x = paddle.transpose(x, [0, 2, 1, 3])
    x = paddle.reshape(x, [batch_size, time_steps, -1])
    return x
 # Standard implementations of Monohead Attention & Multihead Attention
 class MonoheadAttention(nn.Layer):
    """Monohead Attention module.
    Parameters
    ----------
    model_dim : int
        Feature size of the query.
    dropout : float, optional
        Dropout probability of scaled dot product attention and final context
        vector. Defaults to 0.0.
    k_dim : int, optional
        Feature size of the key of each scaled dot product attention. If not
        provided, it is set to `model_dim / num_heads`. Defaults to None.
    v_dim : int, optional
        Feature size of the key of each scaled dot product attention. If not
        provided, it is set to `model_dim / num_heads`. Defaults to None.
    """
    def __init__(self,
                 model_dim: int,
                 dropout: float=0.0,
                 k_dim: int=None,
                 v_dim: int=None):
        super(MonoheadAttention, self).__init__()
        k_dim = k_dim or model_dim
        v_dim = v_dim or model_dim
        self.affine_q = nn.Linear(model_dim, k_dim)
        self.affine_k = nn.Linear(model_dim, k_dim)
        self.affine_v = nn.Linear(model_dim, v_dim)
        self.affine_o = nn.Linear(v_dim, model_dim)
        self.model_dim = model_dim
        self.dropout = dropout
    def forward(self, q, k, v, mask):
        """Compute context vector and attention weights.
        Parameters
        -----------
        q : Tensor [shape=(batch_size, time_steps_q, model_dim)]
            The queries.
        k : Tensor [shape=(batch_size, time_steps_k, model_dim)]
            The keys.
        v : Tensor [shape=(batch_size, time_steps_k, model_dim)]
            The values.
        mask : Tensor [shape=(batch_size, times_steps_q, time_steps_k] or broadcastable shape
            The mask.
        Returns
        ----------
        out : Tensor [shape=(batch_size, time_steps_q, model_dim)]
            The context vector.
        attention_weights : Tensor [shape=(batch_size, times_steps_q, time_steps_k)]
            The attention weights.
        """
        q = self.affine_q(q)  # (B, T, C)
        k = self.affine_k(k)
        v = self.affine_v(v)
        context_vectors, attention_weights = scaled_dot_product_attention(
            q, k, v, mask, self.dropout, self.training)
        out = self.affine_o(context_vectors)
        return out, attention_weights
 class MultiheadAttention(nn.Layer):
    """Multihead Attention module.
    Parameters
    -----------
    model_dim: int
        The feature size of query.
    num_heads : int
        The number of attention heads.
    dropout : float, optional
        Dropout probability of scaled dot product attention and final context
        vector. Defaults to 0.0.
    k_dim : int, optional
        Feature size of the key of each scaled dot product attention. If not
        provided, it is set to ``model_dim / num_heads``. Defaults to None.
    v_dim : int, optional
        Feature size of the key of each scaled dot product attention. If not
        provided, it is set to ``model_dim / num_heads``. Defaults to None.
    Raises
    ---------
    ValueError
        If ``model_dim`` is not divisible by ``num_heads``.
    """
    def __init__(self,
                 model_dim: int,
                 num_heads: int,
                 dropout: float=0.0,
                 k_dim: int=None,
                 v_dim: int=None):
        super(MultiheadAttention, self).__init__()
        if model_dim % num_heads != 0:
            raise ValueError("model_dim must be divisible by num_heads")
        depth = model_dim // num_heads
        k_dim = k_dim or depth
        v_dim = v_dim or depth
        self.affine_q = nn.Linear(model_dim, num_heads * k_dim)
        self.affine_k = nn.Linear(model_dim, num_heads * k_dim)
        self.affine_v = nn.Linear(model_dim, num_heads * v_dim)
        self.affine_o = nn.Linear(num_heads * v_dim, model_dim)
        self.num_heads = num_heads
        self.model_dim = model_dim
        self.dropout = dropout
    def forward(self, q, k, v, mask):
        """Compute context vector and attention weights.
        Parameters
        -----------
        q : Tensor [shape=(batch_size, time_steps_q, model_dim)]
            The queries.
        k : Tensor [shape=(batch_size, time_steps_k, model_dim)]
            The keys.
        v : Tensor [shape=(batch_size, time_steps_k, model_dim)]
            The values.
        mask : Tensor [shape=(batch_size, times_steps_q, time_steps_k] or broadcastable shape
            The mask.
        Returns
        ----------
        out : Tensor [shape=(batch_size, time_steps_q, model_dim)]
            The context vector.
        attention_weights : Tensor [shape=(batch_size, times_steps_q, time_steps_k)]
            The attention weights.
        """
        q = _split_heads(self.affine_q(q), self.num_heads)  # (B, h, T, C)
        k = _split_heads(self.affine_k(k), self.num_heads)
        v = _split_heads(self.affine_v(v), self.num_heads)
        mask = paddle.unsqueeze(mask, 1)  # unsqueeze for the h dim
        context_vectors, attention_weights = scaled_dot_product_attention(
            q, k, v, mask, self.dropout, self.training)
        # NOTE: there is more sophisticated implementation: Scheduled DropHead
        context_vectors = _concat_heads(context_vectors)  # (B, T, h*C)
        out = self.affine_o(context_vectors)
        return out, attention_weights
 class LocationSensitiveAttention(nn.Layer):
    """Location Sensitive Attention module.
    Reference: `Attention-Based Models for Speech Recognition <https://arxiv.org/pdf/1506.07503.pdf>`_
    Parameters
    -----------
    d_query: int
        The feature size of query.
    d_key : int
        The feature size of key.
    d_attention : int
        The feature size of dimension.
    location_filters : int
        Filter size of attention convolution.
    location_kernel_size : int
        Kernel size of attention convolution.
    """
    def __init__(self,
                 d_query: int,
                 d_key: int,
                 d_attention: int,
                 location_filters: int,
                 location_kernel_size: int):
        super().__init__()
        self.query_layer = nn.Linear(d_query, d_attention, bias_attr=False)
        self.key_layer = nn.Linear(d_key, d_attention, bias_attr=False)
        self.value = nn.Linear(d_attention, 1, bias_attr=False)
        # Location Layer
        self.location_conv = nn.Conv1D(
            2,
            location_filters,
            kernel_size=location_kernel_size,
            padding=int((location_kernel_size - 1) / 2),
            bias_attr=False,
            data_format='NLC')
        self.location_layer = nn.Linear(
            location_filters, d_attention, bias_attr=False)
    def forward(self,
                query,
                processed_key,
                value,
                attention_weights_cat,
                mask=None):
        """Compute context vector and attention weights.
        Parameters
        -----------
        query : Tensor [shape=(batch_size, d_query)]
            The queries.
        processed_key : Tensor [shape=(batch_size, time_steps_k, d_attention)]
            The keys after linear layer.
        value : Tensor [shape=(batch_size, time_steps_k, d_key)]
            The values.
        attention_weights_cat : Tensor [shape=(batch_size, time_step_k, 2)]
            Attention weights concat.
        mask : Tensor, optional
            The mask. Shape should be (batch_size, times_steps_k, 1).
            Defaults to None.
        Returns
        ----------
        attention_context : Tensor [shape=(batch_size, d_attention)]
            The context vector.
        attention_weights : Tensor [shape=(batch_size, time_steps_k)]
            The attention weights.
        """
        processed_query = self.query_layer(paddle.unsqueeze(query, axis=[1]))
        processed_attention_weights = self.location_layer(
            self.location_conv(attention_weights_cat))
        # (B, T_enc, 1)
        alignment = self.value(
            paddle.tanh(processed_attention_weights + processed_key +
                        processed_query))
        if mask is not None:
            alignment = alignment + (1.0 - mask) * -1e9
        attention_weights = F.softmax(alignment, axis=1)
        attention_context = paddle.matmul(
            attention_weights, value, transpose_x=True)
        attention_weights = paddle.squeeze(attention_weights, axis=-1)
        attention_context = paddle.squeeze(attention_context, axis=1)
        return attention_context, attention_weights
--- a/paddlespeech/t2s/modules/audio.py
+++ b/paddlespeech/t2s/modules/audio.py
@ -1,229 +0,0 @@
 # Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import librosa
 import numpy as np
 import paddle
 from librosa.util import pad_center
 from paddle import nn
 from paddle.nn import functional as F
 from scipy import signal
 __all__ = ["quantize", "dequantize", "STFT", "MelScale"]
 def quantize(values, n_bands):
    """Linearlly quantize a float Tensor in [-1, 1) to an interger Tensor in
    [0, n_bands).
    Parameters
    -----------
    values : Tensor [dtype: flaot32 or float64]
        The floating point value.
    n_bands : int
        The number of bands. The output integer Tensor's value is in the range
        [0, n_bans).
    Returns
    ----------
    Tensor [dtype: int 64]
        The quantized tensor.
    """
    quantized = paddle.cast((values + 1.0) / 2.0 * n_bands, "int64")
    return quantized
 def dequantize(quantized, n_bands, dtype=None):
    """Linearlly dequantize an integer Tensor into a float Tensor in the range
    [-1, 1).
    Parameters
    -----------
    quantized : Tensor [dtype: int]
        The quantized value in the range [0, n_bands).
    n_bands : int
        Number of bands. The input integer Tensor's value is in the range
        [0, n_bans).
    dtype : str, optional
        Data type of the output.
    Returns
    -----------
    Tensor
        The dequantized tensor, dtype is specified by `dtype`. If `dtype` is 
        not specified, the default float data type is used.
    """
    dtype = dtype or paddle.get_default_dtype()
    value = (paddle.cast(quantized, dtype) + 0.5) * (2.0 / n_bands) - 1.0
    return value
 class STFT(nn.Layer):
    """A module for computing stft transformation in a differentiable way.
    Parameters
    ------------
    n_fft : int
        Number of samples in a frame.
    hop_length : int
        Number of samples shifted between adjacent frames.
    win_length : int
        Length of the window.
    window : str, optional
        Name of window function, see `scipy.signal.get_window` for more
        details. Defaults to "hanning".
    center : bool
        If True, the signal y is padded so that frame D[:, t] is centered 
        at y[t * hop_length]. If False, then D[:, t] begins at y[t * hop_length].
        Defaults to True.
    pad_mode : string or function
        If center=True, this argument is passed to np.pad for padding the edges
        of the signal y. By default (pad_mode="reflect"), y is padded on both
        sides with its own reflection, mirrored around its first and last
        sample respectively. If center=False, this argument is ignored.
    Notes
    -----------
    It behaves like ``librosa.core.stft``. See ``librosa.core.stft`` for more
    details.
    Given a audio which ``T`` samples, it the STFT transformation outputs a
    spectrum with (C, frames) and complex dtype, where ``C = 1 + n_fft / 2``
    and ``frames = 1 + T // hop_lenghth``.
    Ony ``center`` and ``reflect`` padding is supported now.
    """
    def __init__(self,
                 n_fft,
                 hop_length=None,
                 win_length=None,
                 window="hanning",
                 center=True,
                 pad_mode="reflect"):
        super().__init__()
        # By default, use the entire frame
        if win_length is None:
            win_length = n_fft
        # Set the default hop, if it's not already specified
        if hop_length is None:
            hop_length = int(win_length // 4)
        self.hop_length = hop_length
        self.n_bin = 1 + n_fft // 2
        self.n_fft = n_fft
        self.center = center
        self.pad_mode = pad_mode
        # calculate window
        window = signal.get_window(window, win_length, fftbins=True)
        # pad window to n_fft size
        if n_fft != win_length:
            window = pad_center(window, n_fft, mode="constant")
            # lpad = (n_fft - win_length) // 2
            # rpad = n_fft - win_length - lpad
            # window = np.pad(window, ((lpad, pad), ), 'constant')
        # calculate weights
        # r = np.arange(0, n_fft)
        # M = np.expand_dims(r, -1) * np.expand_dims(r, 0)
        # w_real = np.reshape(window *
        # np.cos(2 * np.pi * M / n_fft)[:self.n_bin],
        # (self.n_bin, 1, self.n_fft))
        # w_imag = np.reshape(window *
        # np.sin(-2 * np.pi * M / n_fft)[:self.n_bin],
        # (self.n_bin, 1, self.n_fft))
        weight = np.fft.fft(np.eye(n_fft))[:self.n_bin]
        w_real = weight.real
        w_imag = weight.imag
        w = np.concatenate([w_real, w_imag], axis=0)
        w = w * window
        w = np.expand_dims(w, 1)
        weight = paddle.cast(paddle.to_tensor(w), paddle.get_default_dtype())
        self.register_buffer("weight", weight)
    def forward(self, x):
        """Compute the stft transform.
        Parameters
        ------------
        x : Tensor [shape=(B, T)]
            The input waveform.
        Returns
        ------------
        real : Tensor [shape=(B, C, frames)]
            The real part of the spectrogram.
        imag : Tensor [shape=(B, C, frames)]
            The image part of the spectrogram.
        """
        x = paddle.unsqueeze(x, axis=1)
        if self.center:
            x = F.pad(
                x, [self.n_fft // 2, self.n_fft // 2],
                data_format='NCL',
                mode=self.pad_mode)
        # to BCT, C=1
        out = F.conv1d(x, self.weight, stride=self.hop_length)
        real, imag = paddle.chunk(out, 2, axis=1)  # BCT
        return real, imag
    def power(self, x):
        """Compute the power spectrum.
        Parameters
        ------------
        x : Tensor [shape=(B, T)]
            The input waveform.
        Returns
        ------------
        Tensor [shape=(B, C, T)]
            The power spectrum.
        """
        real, imag = self.forward(x)
        power = real**2 + imag**2
        return power
    def magnitude(self, x):
        """Compute the magnitude of the spectrum.
        Parameters
        ------------
        x : Tensor [shape=(B, T)]
            The input waveform.
        Returns
        ------------
        Tensor [shape=(B, C, T)]
            The magnitude of the spectrum.
        """
        power = self.power(x)
        magnitude = paddle.sqrt(power)  # TODO(chenfeiyu): maybe clipping
        return magnitude
 class MelScale(nn.Layer):
    def __init__(self, sr, n_fft, n_mels, fmin, fmax):
        super().__init__()
        mel_basis = librosa.filters.mel(sr, n_fft, n_mels, fmin, fmax)
        # self.weight = paddle.to_tensor(mel_basis)
        weight = paddle.to_tensor(mel_basis, dtype=paddle.get_default_dtype())
        self.register_buffer("weight", weight)
    def forward(self, spec):
        # (n_mels, n_freq) * (batch_size, n_freq, n_frames)
        mel = paddle.matmul(self.weight, spec)
        return mel
--- a/paddlespeech/t2s/modules/causal_conv.py
+++ b/paddlespeech/t2s/modules/causal_conv.py
@ -13,9 +13,10 @@
 # limitations under the License.
 """Causal convolusion layer modules."""
 import paddle
 from paddle import nn
-class CausalConv1D(paddle.nn.Layer):
+class CausalConv1D(nn.Layer):
    """CausalConv1D module with customized initialization."""
    def __init__(
@ -31,7 +32,7 @@ class CausalConv1D(paddle.nn.Layer):
        super().__init__()
        self.pad = getattr(paddle.nn, pad)((kernel_size - 1) * dilation,
                                           **pad_params)
-        self.conv = paddle.nn.Conv1D(
+        self.conv = nn.Conv1D(
            in_channels,
            out_channels,
            kernel_size,
@ -52,7 +53,7 @@ class CausalConv1D(paddle.nn.Layer):
        return self.conv(self.pad(x))[:, :, :x.shape[2]]
-class CausalConv1DTranspose(paddle.nn.Layer):
+class CausalConv1DTranspose(nn.Layer):
    """CausalConv1DTranspose module with customized initialization."""
    def __init__(self,
@ -63,7 +64,7 @@ class CausalConv1DTranspose(paddle.nn.Layer):
                 bias=True):
        """Initialize CausalConvTranspose1d module."""
        super().__init__()
-        self.deconv = paddle.nn.Conv1DTranspose(
+        self.deconv = nn.Conv1DTranspose(
            in_channels, out_channels, kernel_size, stride, bias_attr=bias)
        self.stride = stride
--- a/paddlespeech/t2s/modules/conformer/convolution.py
+++ b/paddlespeech/t2s/modules/conformer/convolution.py
@ -72,8 +72,10 @@ class ConvolutionModule(nn.Layer):
        x = x.transpose([0, 2, 1])
        # GLU mechanism
-        x = self.pointwise_conv1(x)  # (batch, 2*channel, dim)
+        # (batch, 2*channel, time)
-        x = nn.functional.glu(x, axis=1)  # (batch, channel, dim)
+        x = self.pointwise_conv1(x)
        # (batch, channel, time)
        x = nn.functional.glu(x, axis=1)
        # 1D Depthwise Conv
        x = self.depthwise_conv(x)
--- a/paddlespeech/t2s/modules/conformer/encoder_layer.py
+++ b/paddlespeech/t2s/modules/conformer/encoder_layer.py
@ -25,19 +25,19 @@ class EncoderLayer(nn.Layer):
    ----------
    size : int
        Input dimension.
-    self_attn : paddle.nn.Layer
+    self_attn : nn.Layer
        Self-attention module instance.
        `MultiHeadedAttention` or `RelPositionMultiHeadedAttention` instance
        can be used as the argument.
-    feed_forward : paddle.nn.Layer
+    feed_forward : nn.Layer
        Feed-forward module instance.
        `PositionwiseFeedForward`, `MultiLayeredConv1d`, or `Conv1dLinear` instance
        can be used as the argument.
-    feed_forward_macaron : paddle.nn.Layer
+    feed_forward_macaron : nn.Layer
        Additional feed-forward module instance.
        `PositionwiseFeedForward`, `MultiLayeredConv1d`, or `Conv1dLinear` instance
        can be used as the argument.
-    conv_module : paddle.nn.Layer
+    conv_module : nn.Layer
        Convolution module instance.
        `ConvlutionModule` instance can be used as the argument.
    dropout_rate : float
@ -67,7 +67,7 @@ class EncoderLayer(nn.Layer):
            concat_after=False,
            stochastic_depth_rate=0.0, ):
        """Construct an EncoderLayer object."""
-        super(EncoderLayer, self).__init__()
+        super().__init__()
        self.self_attn = self_attn
        self.feed_forward = feed_forward
        self.feed_forward_macaron = feed_forward_macaron
--- a/paddlespeech/t2s/modules/conv.py
+++ b/paddlespeech/t2s/modules/conv.py
@ -84,7 +84,7 @@ class Conv1dCell(nn.Conv1D):
        _kernel_size = kernel_size[0] if isinstance(kernel_size, (
            tuple, list)) else kernel_size
        self._r = 1 + (_kernel_size - 1) * _dilation
-        super(Conv1dCell, self).__init__(
+        super().__init__(
            in_channels,
            out_channels,
            kernel_size,
@ -226,7 +226,7 @@ class Conv1dBatchNorm(nn.Layer):
                 data_format="NCL",
                 momentum=0.9,
                 epsilon=1e-05):
-        super(Conv1dBatchNorm, self).__init__()
+        super().__init__()
        self.conv = nn.Conv1D(
            in_channels,
            out_channels,
--- a/paddlespeech/t2s/modules/expansion.py
+++ b/paddlespeech/t2s/modules/expansion.py
@ -1,37 +0,0 @@
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import numpy as np
 import paddle
 from paddle import Tensor
 def expand(encodings: Tensor, durations: Tensor) -> Tensor:
    """
    encodings: (B, T, C)
    durations: (B, T)
    """
    batch_size, t_enc = durations.shape
    durations = durations.numpy()
    slens = np.sum(durations, -1)
    t_dec = np.max(slens)
    M = np.zeros([batch_size, t_dec, t_enc])
    for i in range(batch_size):
        k = 0
        for j in range(t_enc):
            d = durations[i, j]
            M[i, k:k + d, j] = 1
            k += d
    M = paddle.to_tensor(M, dtype=encodings.dtype)
    encodings = paddle.matmul(M, encodings)
    return encodings
--- a/paddlespeech/t2s/modules/layer_norm.py
+++ b/paddlespeech/t2s/modules/layer_norm.py
@ -13,9 +13,10 @@
 # limitations under the License.
 """Layer normalization module."""
 import paddle
 from paddle import nn
-class LayerNorm(paddle.nn.LayerNorm):
+class LayerNorm(nn.LayerNorm):
    """Layer normalization module.
    Parameters
@ -28,7 +29,7 @@ class LayerNorm(paddle.nn.LayerNorm):
    def __init__(self, nout, dim=-1):
        """Construct an LayerNorm object."""
-        super(LayerNorm, self).__init__(nout)
+        super().__init__(nout)
        self.dim = dim
    def forward(self, x):
--- a/paddlespeech/t2s/modules/losses.py
+++ b/paddlespeech/t2s/modules/losses.py
@ -11,18 +11,16 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import math
 import paddle
 from paddle import nn
 from paddle.fluid.layers import sequence_mask
 from paddle.nn import functional as F
-
+from scipy import signal
 __all__ = [
    "guided_attention_loss",
    "weighted_mean",
    "masked_l1_loss",
    "masked_softmax_with_cross_entropy",
 ]
 # Loss for Tacotron2
 def attention_guide(dec_lens, enc_lens, N, T, g, dtype=None):
    """Build that W matrix. shape(B, T_dec, T_enc)
    W[i, n, t] = 1 - exp(-(n/dec_lens[i] - t/enc_lens[i])**2 / (2g**2)) 
@ -57,6 +55,367 @@ def guided_attention_loss(attention_weight, dec_lens, enc_lens, g):
    return loss
 # Losses for GAN Vocoder
 def stft(x,
         fft_size,
         hop_length=None,
         win_length=None,
         window='hann',
         center=True,
         pad_mode='reflect'):
    """Perform STFT and convert to magnitude spectrogram.
    Parameters
    ----------
    x : Tensor
        Input signal tensor (B, T).
    fft_size : int
        FFT size.
    hop_size : int
        Hop size.
    win_length : int
        window : str, optional
    window : str
        Name of window function, see `scipy.signal.get_window` for more
        details. Defaults to "hann".
    center : bool, optional
        center (bool, optional): Whether to pad `x` to make that the
        :math:`t \times hop\_length` at the center of :math:`t`-th frame. Default: `True`.
    pad_mode : str, optional
        Choose padding pattern when `center` is `True`.
    Returns
    ----------
    Tensor:
        Magnitude spectrogram (B, #frames, fft_size // 2 + 1).
    """
    # calculate window
    window = signal.get_window(window, win_length, fftbins=True)
    window = paddle.to_tensor(window)
    x_stft = paddle.signal.stft(
        x,
        fft_size,
        hop_length,
        win_length,
        window=window,
        center=center,
        pad_mode=pad_mode)
    real = x_stft.real()
    imag = x_stft.imag()
    return paddle.sqrt(paddle.clip(real**2 + imag**2, min=1e-7)).transpose(
        [0, 2, 1])
 class SpectralConvergenceLoss(nn.Layer):
    """Spectral convergence loss module."""
    def __init__(self):
        """Initilize spectral convergence loss module."""
        super().__init__()
    def forward(self, x_mag, y_mag):
        """Calculate forward propagation.
        Parameters
        ----------
        x_mag : Tensor
            Magnitude spectrogram of predicted signal (B, #frames, #freq_bins).
        y_mag : Tensor)
            Magnitude spectrogram of groundtruth signal (B, #frames, #freq_bins).
        Returns
        ----------
        Tensor
            Spectral convergence loss value.
        """
        return paddle.norm(
            y_mag - x_mag, p="fro") / paddle.clip(
                paddle.norm(y_mag, p="fro"), min=1e-10)
 class LogSTFTMagnitudeLoss(nn.Layer):
    """Log STFT magnitude loss module."""
    def __init__(self, epsilon=1e-7):
        """Initilize los STFT magnitude loss module."""
        super().__init__()
        self.epsilon = epsilon
    def forward(self, x_mag, y_mag):
        """Calculate forward propagation.
        Parameters
        ----------
        x_mag : Tensor
            Magnitude spectrogram of predicted signal (B, #frames, #freq_bins).
        y_mag : Tensor
            Magnitude spectrogram of groundtruth signal (B, #frames, #freq_bins).
        Returns
        ----------
        Tensor
            Log STFT magnitude loss value.
        """
        return F.l1_loss(
            paddle.log(paddle.clip(y_mag, min=self.epsilon)),
            paddle.log(paddle.clip(x_mag, min=self.epsilon)))
 class STFTLoss(nn.Layer):
    """STFT loss module."""
    def __init__(self,
                 fft_size=1024,
                 shift_size=120,
                 win_length=600,
                 window="hann"):
        """Initialize STFT loss module."""
        super().__init__()
        self.fft_size = fft_size
        self.shift_size = shift_size
        self.win_length = win_length
        self.window = window
        self.spectral_convergence_loss = SpectralConvergenceLoss()
        self.log_stft_magnitude_loss = LogSTFTMagnitudeLoss()
    def forward(self, x, y):
        """Calculate forward propagation.
        Parameters
        ----------
        x : Tensor
            Predicted signal (B, T).
        y : Tensor
            Groundtruth signal (B, T).
        Returns
        ----------
        Tensor
            Spectral convergence loss value.
        Tensor
            Log STFT magnitude loss value.
        """
        x_mag = stft(x, self.fft_size, self.shift_size, self.win_length,
                     self.window)
        y_mag = stft(y, self.fft_size, self.shift_size, self.win_length,
                     self.window)
        sc_loss = self.spectral_convergence_loss(x_mag, y_mag)
        mag_loss = self.log_stft_magnitude_loss(x_mag, y_mag)
        return sc_loss, mag_loss
 class MultiResolutionSTFTLoss(nn.Layer):
    """Multi resolution STFT loss module."""
    def __init__(
            self,
            fft_sizes=[1024, 2048, 512],
            hop_sizes=[120, 240, 50],
            win_lengths=[600, 1200, 240],
            window="hann", ):
        """Initialize Multi resolution STFT loss module.
        Parameters
        ----------
        fft_sizes : list
            List of FFT sizes.
        hop_sizes : list
            List of hop sizes.
        win_lengths : list
            List of window lengths.
        window : str
            Window function type.
        """
        super().__init__()
        assert len(fft_sizes) == len(hop_sizes) == len(win_lengths)
        self.stft_losses = nn.LayerList()
        for fs, ss, wl in zip(fft_sizes, hop_sizes, win_lengths):
            self.stft_losses.append(STFTLoss(fs, ss, wl, window))
    def forward(self, x, y):
        """Calculate forward propagation.
        Parameters
        ----------
        x : Tensor
            Predicted signal (B, T) or (B, #subband, T).
        y : Tensor
            Groundtruth signal (B, T) or (B, #subband, T).
        Returns
        ----------
        Tensor
            Multi resolution spectral convergence loss value.
        Tensor
            Multi resolution log STFT magnitude loss value.
        """
        if len(x.shape) == 3:
            # (B, C, T) -> (B x C, T)
            x = x.reshape([-1, x.shape[2]])
            # (B, C, T) -> (B x C, T)
            y = y.reshape([-1, y.shape[2]])
        sc_loss = 0.0
        mag_loss = 0.0
        for f in self.stft_losses:
            sc_l, mag_l = f(x, y)
            sc_loss += sc_l
            mag_loss += mag_l
        sc_loss /= len(self.stft_losses)
        mag_loss /= len(self.stft_losses)
        return sc_loss, mag_loss
 class GeneratorAdversarialLoss(nn.Layer):
    """Generator adversarial loss module."""
    def __init__(
            self,
            average_by_discriminators=True,
            loss_type="mse", ):
        """Initialize GeneratorAversarialLoss module."""
        super().__init__()
        self.average_by_discriminators = average_by_discriminators
        assert loss_type in ["mse", "hinge"], f"{loss_type} is not supported."
        if loss_type == "mse":
            self.criterion = self._mse_loss
        else:
            self.criterion = self._hinge_loss
    def forward(self, outputs):
        """Calcualate generator adversarial loss.
        Parameters
        ----------
        outputs: Tensor or List
        Discriminator outputs or list of discriminator outputs.
        Returns
        ----------
        Tensor
            Generator adversarial loss value.
        """
        if isinstance(outputs, (tuple, list)):
            adv_loss = 0.0
            for i, outputs_ in enumerate(outputs):
                if isinstance(outputs_, (tuple, list)):
                    # case including feature maps
                    outputs_ = outputs_[-1]
                adv_loss += self.criterion(outputs_)
            if self.average_by_discriminators:
                adv_loss /= i + 1
        else:
            adv_loss = self.criterion(outputs)
        return adv_loss
    def _mse_loss(self, x):
        return F.mse_loss(x, paddle.ones_like(x))
    def _hinge_loss(self, x):
        return -x.mean()
 class DiscriminatorAdversarialLoss(nn.Layer):
    """Discriminator adversarial loss module."""
    def __init__(
            self,
            average_by_discriminators=True,
            loss_type="mse", ):
        """Initialize DiscriminatorAversarialLoss module."""
        super().__init__()
        self.average_by_discriminators = average_by_discriminators
        assert loss_type in ["mse"], f"{loss_type} is not supported."
        if loss_type == "mse":
            self.fake_criterion = self._mse_fake_loss
            self.real_criterion = self._mse_real_loss
    def forward(self, outputs_hat, outputs):
        """Calcualate discriminator adversarial loss.
        Parameters
        ----------
        outputs_hat : Tensor or list
            Discriminator outputs or list of
            discriminator outputs calculated from generator outputs.
        outputs : Tensor or list
            Discriminator outputs or list of
            discriminator outputs calculated from groundtruth.
        Returns
        ----------
        Tensor
            Discriminator real loss value.
        Tensor
            Discriminator fake loss value.
        """
        if isinstance(outputs, (tuple, list)):
            real_loss = 0.0
            fake_loss = 0.0
            for i, (outputs_hat_,
                    outputs_) in enumerate(zip(outputs_hat, outputs)):
                if isinstance(outputs_hat_, (tuple, list)):
                    # case including feature maps
                    outputs_hat_ = outputs_hat_[-1]
                    outputs_ = outputs_[-1]
                real_loss += self.real_criterion(outputs_)
                fake_loss += self.fake_criterion(outputs_hat_)
            if self.average_by_discriminators:
                fake_loss /= i + 1
                real_loss /= i + 1
        else:
            real_loss = self.real_criterion(outputs)
            fake_loss = self.fake_criterion(outputs_hat)
        return real_loss, fake_loss
    def _mse_real_loss(self, x):
        return F.mse_loss(x, paddle.ones_like(x))
    def _mse_fake_loss(self, x):
        return F.mse_loss(x, paddle.zeros_like(x))
 # Losses for SpeedySpeech
 # Structural Similarity Index Measure (SSIM)
 def gaussian(window_size, sigma):
    gauss = paddle.to_tensor([
        math.exp(-(x - window_size // 2)**2 / float(2 * sigma**2))
        for x in range(window_size)
    ])
    return gauss / gauss.sum()
 def create_window(window_size, channel):
    _1D_window = gaussian(window_size, 1.5).unsqueeze(1)
    _2D_window = paddle.matmul(_1D_window, paddle.transpose(
        _1D_window, [1, 0])).unsqueeze([0, 1])
    window = paddle.expand(_2D_window, [channel, 1, window_size, window_size])
    return window
 def _ssim(img1, img2, window, window_size, channel, size_average=True):
    mu1 = F.conv2d(img1, window, padding=window_size // 2, groups=channel)
    mu2 = F.conv2d(img2, window, padding=window_size // 2, groups=channel)
    mu1_sq = mu1.pow(2)
    mu2_sq = mu2.pow(2)
    mu1_mu2 = mu1 * mu2
    sigma1_sq = F.conv2d(
        img1 * img1, window, padding=window_size // 2, groups=channel) - mu1_sq
    sigma2_sq = F.conv2d(
        img2 * img2, window, padding=window_size // 2, groups=channel) - mu2_sq
    sigma12 = F.conv2d(
        img1 * img2, window, padding=window_size // 2, groups=channel) - mu1_mu2
    C1 = 0.01**2
    C2 = 0.03**2
    ssim_map = ((2 * mu1_mu2 + C1) * (2 * sigma12 + C2)) \
             / ((mu1_sq + mu2_sq + C1) * (sigma1_sq + sigma2_sq + C2))
    if size_average:
        return ssim_map.mean()
    else:
        return ssim_map.mean(1).mean(1).mean(1)
 def ssim(img1, img2, window_size=11, size_average=True):
    (_, channel, _, _) = img1.shape
    window = create_window(window_size, channel)
    return _ssim(img1, img2, window, window_size, channel, size_average)
 def weighted_mean(input, weight):
    """Weighted mean. It can also be used as masked mean.
@ -98,28 +457,3 @@ def masked_l1_loss(prediction, target, mask):
    abs_error = F.l1_loss(prediction, target, reduction='none')
    loss = weighted_mean(abs_error, mask)
    return loss
 def masked_softmax_with_cross_entropy(logits, label, mask, axis=-1):
    """Compute masked softmax with cross entropy loss.
    Parameters
    ----------
    logits : Tensor
        The logits. The ``axis``-th axis is the class dimension.
    label : Tensor [dtype: int]
        The label. The size of the ``axis``-th axis should be 1.
    mask : Tensor 
        The mask. The shape should be broadcastable to ``label``.
    axis : int, optional
        The index of the class dimension in the shape of ``logits``, by default
        -1.
    Returns
    -------
    Tensor [shape=(1,)]
        The masked softmax with cross entropy loss.
    """
    ce = F.softmax_with_cross_entropy(logits, label, axis=axis)
    loss = weighted_mean(ce, mask)
    return loss
--- a/paddlespeech/t2s/modules/masking.py
+++ b/paddlespeech/t2s/modules/masking.py
@ -1,120 +0,0 @@
 # Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import paddle
 __all__ = [
    "id_mask",
    "feature_mask",
    "combine_mask",
    "future_mask",
 ]
 def id_mask(input, padding_index=0, dtype="bool"):
    """Generate mask with input ids. 
    Those positions where the value equals ``padding_index`` correspond to 0 or
    ``False``, otherwise, 1 or ``True``.
    Parameters
    ----------
    input : Tensor [dtype: int]
        The input tensor. It represents the ids.
    padding_index : int, optional
        The id which represents padding, by default 0.
    dtype : str, optional
        Data type of the returned mask, by default "bool".
    Returns
    -------
    Tensor
        The generate mask. It has the same shape as ``input`` does.
    """
    return paddle.cast(input != padding_index, dtype)
 def feature_mask(input, axis, dtype="bool"):
    """Compute mask from input features.
    For a input features, represented as batched feature vectors, those vectors
    which all zeros are considerd padding vectors.
    Parameters
    ----------
    input : Tensor [dtype: float]
        The input tensor which represents featues.
    axis : int
        The index of the feature dimension in ``input``. Other dimensions are
        considered ``spatial`` dimensions.
    dtype : str, optional
        Data type of the generated mask, by default "bool"
    Returns
    -------
    Tensor
        The geenrated mask with ``spatial`` shape as mentioned above.
        It has one less dimension than ``input`` does.
    """
    feature_sum = paddle.sum(paddle.abs(input), axis)
    return paddle.cast(feature_sum != 0, dtype)
 def combine_mask(mask1, mask2):
    """Combine two mask with multiplication or logical and.
    Parameters
    -----------
    mask1 : Tensor
        The first mask.
    mask2 : Tensor
        The second mask with broadcastable shape with ``mask1``.
    Returns
    --------
    Tensor
        Combined mask.
    Notes
    ------
    It is mainly used to combine the padding mask and no future mask for
    transformer decoder. 
    Padding mask is used to mask padding positions of the decoder inputs and
    no future mask is used to prevent the decoder to see future information.
    """
    if mask1.dtype == paddle.fluid.core.VarDesc.VarType.BOOL:
        return paddle.logical_and(mask1, mask2)
    else:
        return mask1 * mask2
 def future_mask(time_steps, dtype="bool"):
    """Generate lower triangular mask.
    It is used at transformer decoder to prevent the decoder to see future
    information.
    Parameters
    ----------
    time_steps : int
        Decoder time steps.
    dtype : str, optional
        The data type of the generate mask, by default "bool".
    Returns
    -------
    Tensor
        The generated mask.
    """
    mask = paddle.tril(paddle.ones([time_steps, time_steps]))
    return paddle.cast(mask, dtype)
--- a/paddlespeech/t2s/modules/nets_utils.py
+++ b/paddlespeech/t2s/modules/nets_utils.py
@ -129,7 +129,7 @@ def initialize(model: nn.Layer, init: str):
    Parameters
    ----------
-    model : paddle.nn.Layer
+    model : nn.Layer
        Target.
    init : str
        Method of initialization.
@ -150,17 +150,3 @@ def initialize(model: nn.Layer, init: str):
                                              nn.initializer.Constant())
    else:
        raise ValueError("Unknown initialization: " + init)
 def get_activation(act):
    """Return activation function."""
    activation_funcs = {
        "hardtanh": paddle.nn.Hardtanh,
        "tanh": paddle.nn.Tanh,
        "relu": paddle.nn.ReLU,
        "selu": paddle.nn.SELU,
        "swish": paddle.nn.Swish,
    }
    return activation_funcs[act]()
--- a/paddlespeech/t2s/modules/pqmf.py
+++ b/paddlespeech/t2s/modules/pqmf.py
@ -16,6 +16,7 @@
 import numpy as np
 import paddle
 import paddle.nn.functional as F
 from paddle import nn
 from scipy.signal import kaiser
@ -56,7 +57,7 @@ def design_prototype_filter(taps=62, cutoff_ratio=0.142, beta=9.0):
    return h
-class PQMF(paddle.nn.Layer):
+class PQMF(nn.Layer):
    """PQMF module.
    This module is based on `Near-perfect-reconstruction pseudo-QMF banks`_.
    .. _`Near-perfect-reconstruction pseudo-QMF banks`:
@ -105,7 +106,7 @@ class PQMF(paddle.nn.Layer):
        self.updown_filter = updown_filter
        self.subbands = subbands
        # keep padding info
-        self.pad_fn = paddle.nn.Pad1D(taps // 2, mode='constant', value=0.0)
+        self.pad_fn = nn.Pad1D(taps // 2, mode='constant', value=0.0)
    def analysis(self, x):
        """Analysis with PQMF.
--- a/paddlespeech/t2s/modules/predictor/duration_predictor.py
+++ b/paddlespeech/t2s/modules/predictor/duration_predictor.py
@ -65,7 +65,7 @@ class DurationPredictor(nn.Layer):
            Offset value to avoid nan in log domain.
        """
-        super(DurationPredictor, self).__init__()
+        super().__init__()
        self.offset = offset
        self.conv = nn.LayerList()
        for idx in range(n_layers):
@ -155,7 +155,7 @@ class DurationPredictorLoss(nn.Layer):
        reduction : str
            Reduction type in loss calculation.
        """
-        super(DurationPredictorLoss, self).__init__()
+        super().__init__()
        self.criterion = nn.MSELoss(reduction=reduction)
        self.offset = offset
--- a/paddlespeech/t2s/modules/ssim.py
+++ b/paddlespeech/t2s/modules/ssim.py
@ -1,80 +0,0 @@
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 from math import exp
 import paddle
 import paddle.nn.functional as F
 from paddle import nn
 def gaussian(window_size, sigma):
    gauss = paddle.to_tensor([
        exp(-(x - window_size // 2)**2 / float(2 * sigma**2))
        for x in range(window_size)
    ])
    return gauss / gauss.sum()
 def create_window(window_size, channel):
    _1D_window = gaussian(window_size, 1.5).unsqueeze(1)
    _2D_window = paddle.matmul(_1D_window, paddle.transpose(
        _1D_window, [1, 0])).unsqueeze([0, 1])
    window = paddle.expand(_2D_window, [channel, 1, window_size, window_size])
    return window
 def _ssim(img1, img2, window, window_size, channel, size_average=True):
    mu1 = F.conv2d(img1, window, padding=window_size // 2, groups=channel)
    mu2 = F.conv2d(img2, window, padding=window_size // 2, groups=channel)
    mu1_sq = mu1.pow(2)
    mu2_sq = mu2.pow(2)
    mu1_mu2 = mu1 * mu2
    sigma1_sq = F.conv2d(
        img1 * img1, window, padding=window_size // 2, groups=channel) - mu1_sq
    sigma2_sq = F.conv2d(
        img2 * img2, window, padding=window_size // 2, groups=channel) - mu2_sq
    sigma12 = F.conv2d(
        img1 * img2, window, padding=window_size // 2, groups=channel) - mu1_mu2
    C1 = 0.01**2
    C2 = 0.03**2
    ssim_map = ((2 * mu1_mu2 + C1) * (2 * sigma12 + C2)) \
             / ((mu1_sq + mu2_sq + C1) * (sigma1_sq + sigma2_sq + C2))
    if size_average:
        return ssim_map.mean()
    else:
        return ssim_map.mean(1).mean(1).mean(1)
 class SSIM(nn.Layer):
    def __init__(self, window_size=11, size_average=True):
        super().__init__()
        self.window_size = window_size
        self.size_average = size_average
        self.channel = 1
        self.window = create_window(window_size, self.channel)
    def forward(self, img1, img2):
        return _ssim(img1, img2, self.window, self.window_size, self.channel,
                     self.size_average)
 def ssim(img1, img2, window_size=11, size_average=True):
    (_, channel, _, _) = img1.shape
    window = create_window(window_size, channel)
    return _ssim(img1, img2, window, window_size, channel, size_average)
--- a/paddlespeech/t2s/modules/stft_loss.py
+++ b/paddlespeech/t2s/modules/stft_loss.py
@ -1,220 +0,0 @@
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # Modified from espnet(https://github.com/espnet/espnet)
 import paddle
 from paddle import nn
 from paddle.nn import functional as F
 from scipy import signal
 def stft(x,
         fft_size,
         hop_length=None,
         win_length=None,
         window='hann',
         center=True,
         pad_mode='reflect'):
    """Perform STFT and convert to magnitude spectrogram.
    Parameters
    ----------
    x : Tensor
        Input signal tensor (B, T).
    fft_size : int
        FFT size.
    hop_size : int
        Hop size.
    win_length : int
        window : str, optional
    window : str
        Name of window function, see `scipy.signal.get_window` for more
        details. Defaults to "hann".
    center : bool, optional
        center (bool, optional): Whether to pad `x` to make that the
        :math:`t \times hop\_length` at the center of :math:`t`-th frame. Default: `True`.
    pad_mode : str, optional
        Choose padding pattern when `center` is `True`.
    Returns
    ----------
    Tensor:
        Magnitude spectrogram (B, #frames, fft_size // 2 + 1).
    """
    # calculate window
    window = signal.get_window(window, win_length, fftbins=True)
    window = paddle.to_tensor(window)
    x_stft = paddle.signal.stft(
        x,
        fft_size,
        hop_length,
        win_length,
        window=window,
        center=center,
        pad_mode=pad_mode)
    real = x_stft.real()
    imag = x_stft.imag()
    return paddle.sqrt(paddle.clip(real**2 + imag**2, min=1e-7)).transpose(
        [0, 2, 1])
 class SpectralConvergenceLoss(nn.Layer):
    """Spectral convergence loss module."""
    def __init__(self):
        """Initilize spectral convergence loss module."""
        super().__init__()
    def forward(self, x_mag, y_mag):
        """Calculate forward propagation.
        Parameters
        ----------
        x_mag : Tensor
            Magnitude spectrogram of predicted signal (B, #frames, #freq_bins).
        y_mag : Tensor)
            Magnitude spectrogram of groundtruth signal (B, #frames, #freq_bins).
        Returns
        ----------
        Tensor
            Spectral convergence loss value.
        """
        return paddle.norm(
            y_mag - x_mag, p="fro") / paddle.clip(
                paddle.norm(y_mag, p="fro"), min=1e-10)
 class LogSTFTMagnitudeLoss(nn.Layer):
    """Log STFT magnitude loss module."""
    def __init__(self, epsilon=1e-7):
        """Initilize los STFT magnitude loss module."""
        super().__init__()
        self.epsilon = epsilon
    def forward(self, x_mag, y_mag):
        """Calculate forward propagation.
        Parameters
        ----------
        x_mag : Tensor
            Magnitude spectrogram of predicted signal (B, #frames, #freq_bins).
        y_mag : Tensor
            Magnitude spectrogram of groundtruth signal (B, #frames, #freq_bins).
        Returns
        ----------
        Tensor
            Log STFT magnitude loss value.
        """
        return F.l1_loss(
            paddle.log(paddle.clip(y_mag, min=self.epsilon)),
            paddle.log(paddle.clip(x_mag, min=self.epsilon)))
 class STFTLoss(nn.Layer):
    """STFT loss module."""
    def __init__(self,
                 fft_size=1024,
                 shift_size=120,
                 win_length=600,
                 window="hann"):
        """Initialize STFT loss module."""
        super().__init__()
        self.fft_size = fft_size
        self.shift_size = shift_size
        self.win_length = win_length
        self.window = window
        self.spectral_convergence_loss = SpectralConvergenceLoss()
        self.log_stft_magnitude_loss = LogSTFTMagnitudeLoss()
    def forward(self, x, y):
        """Calculate forward propagation.
        Parameters
        ----------
        x : Tensor
            Predicted signal (B, T).
        y : Tensor
            Groundtruth signal (B, T).
        Returns
        ----------
        Tensor
            Spectral convergence loss value.
        Tensor
            Log STFT magnitude loss value.
        """
        x_mag = stft(x, self.fft_size, self.shift_size, self.win_length,
                     self.window)
        y_mag = stft(y, self.fft_size, self.shift_size, self.win_length,
                     self.window)
        sc_loss = self.spectral_convergence_loss(x_mag, y_mag)
        mag_loss = self.log_stft_magnitude_loss(x_mag, y_mag)
        return sc_loss, mag_loss
 class MultiResolutionSTFTLoss(nn.Layer):
    """Multi resolution STFT loss module."""
    def __init__(
            self,
            fft_sizes=[1024, 2048, 512],
            hop_sizes=[120, 240, 50],
            win_lengths=[600, 1200, 240],
            window="hann", ):
        """Initialize Multi resolution STFT loss module.
        Parameters
        ----------
        fft_sizes : list
            List of FFT sizes.
        hop_sizes : list
            List of hop sizes.
        win_lengths : list
            List of window lengths.
        window : str
            Window function type.
        """
        super().__init__()
        assert len(fft_sizes) == len(hop_sizes) == len(win_lengths)
        self.stft_losses = nn.LayerList()
        for fs, ss, wl in zip(fft_sizes, hop_sizes, win_lengths):
            self.stft_losses.append(STFTLoss(fs, ss, wl, window))
    def forward(self, x, y):
        """Calculate forward propagation.
        Parameters
        ----------
        x : Tensor
            Predicted signal (B, T) or (B, #subband, T).
        y : Tensor
            Groundtruth signal (B, T) or (B, #subband, T).
        Returns
        ----------
        Tensor
            Multi resolution spectral convergence loss value.
        Tensor
            Multi resolution log STFT magnitude loss value.
        """
        if len(x.shape) == 3:
            # (B, C, T) -> (B x C, T)
            x = x.reshape([-1, x.shape[2]])
            # (B, C, T) -> (B x C, T)
            y = y.reshape([-1, y.shape[2]])
        sc_loss = 0.0
        mag_loss = 0.0
        for f in self.stft_losses:
            sc_l, mag_l = f(x, y)
            sc_loss += sc_l
            mag_loss += mag_l
        sc_loss /= len(self.stft_losses)
        mag_loss /= len(self.stft_losses)
        return sc_loss, mag_loss
--- a/paddlespeech/t2s/modules/style_encoder.py
+++ b/paddlespeech/t2s/modules/style_encoder.py
@ -74,7 +74,7 @@ class StyleEncoder(nn.Layer):
            gru_units: int=128, ):
        """Initilize global style encoder module."""
        assert check_argument_types()
-        super(StyleEncoder, self).__init__()
+        super().__init__()
        self.ref_enc = ReferenceEncoder(
            idim=idim,
@ -93,11 +93,15 @@ class StyleEncoder(nn.Layer):
    def forward(self, speech: paddle.Tensor) -> paddle.Tensor:
        """Calculate forward propagation.
-        Args:
+        Parameters
-            speech (Tensor): Batch of padded target features (B, Lmax, odim).
+        ----------
        speech : Tensor
            Batch of padded target features (B, Lmax, odim).
-        Returns:
+        Returns
-            Tensor: Style token embeddings (B, token_dim).
+        ----------
        Tensor:
            Style token embeddings (B, token_dim).
        """
        ref_embs = self.ref_enc(speech)
@ -145,7 +149,7 @@ class ReferenceEncoder(nn.Layer):
            gru_units: int=128, ):
        """Initilize reference encoder module."""
        assert check_argument_types()
-        super(ReferenceEncoder, self).__init__()
+        super().__init__()
        # check hyperparameters are valid
        assert conv_kernel_size % 2 == 1, "kernel size must be odd."
@ -249,7 +253,7 @@ class StyleTokenLayer(nn.Layer):
            dropout_rate: float=0.0, ):
        """Initilize style token layer module."""
        assert check_argument_types()
-        super(StyleTokenLayer, self).__init__()
+        super().__init__()
        gst_embs = paddle.randn(shape=[gst_tokens, gst_token_dim // gst_heads])
        self.gst_embs = paddle.create_parameter(
--- a/paddlespeech/t2s/modules/tacotron2/encoder.py
+++ b/paddlespeech/t2s/modules/tacotron2/encoder.py
@ -73,7 +73,7 @@ class Encoder(nn.Layer):
            Dropout rate.
        """
-        super(Encoder, self).__init__()
+        super().__init__()
        # store the hyperparameters
        self.idim = idim
        self.use_residual = use_residual
--- a/paddlespeech/t2s/modules/transformer/decoder.py
+++ b/paddlespeech/t2s/modules/transformer/decoder.py
@ -67,11 +67,11 @@ class Decoder(nn.Layer):
        Dropout rate in self-attention.
    src_attention_dropout_rate : float
        Dropout rate in source-attention.
-    input_layer : (Union[str, paddle.nn.Layer])
+    input_layer : (Union[str, nn.Layer])
        Input layer type.
    use_output_layer : bool
        Whether to use output layer.
-    pos_enc_class : paddle.nn.Layer
+    pos_enc_class : nn.Layer
        Positional encoding module class.
        `PositionalEncoding `or `ScaledPositionalEncoding`
    normalize_before : bool
@ -122,8 +122,7 @@ class Decoder(nn.Layer):
                input_layer,
                pos_enc_class(attention_dim, positional_dropout_rate))
        else:
-            raise NotImplementedError(
+            raise NotImplementedError("only `embed` or nn.Layer is supported.")
                "only `embed` or paddle.nn.Layer is supported.")
        self.normalize_before = normalize_before
        # self-attention module definition
--- a/paddlespeech/t2s/modules/transformer/decoder_layer.py
+++ b/paddlespeech/t2s/modules/transformer/decoder_layer.py
@ -26,13 +26,13 @@ class DecoderLayer(nn.Layer):
    ----------
    size : int
        Input dimension.
-    self_attn : paddle.nn.Layer
+    self_attn : nn.Layer
        Self-attention module instance.
        `MultiHeadedAttention` instance can be used as the argument.
-    src_attn : paddle.nn.Layer
+    src_attn : nn.Layer
        Self-attention module instance.
        `MultiHeadedAttention` instance can be used as the argument.
-    feed_forward : paddle.nn.Layer
+    feed_forward : nn.Layer
        Feed-forward module instance.
        `PositionwiseFeedForward`, `MultiLayeredConv1d`, or `Conv1dLinear` instance can be used as the argument.
    dropout_rate : float
--- a/paddlespeech/t2s/modules/transformer/embedding.py
+++ b/paddlespeech/t2s/modules/transformer/embedding.py
@ -43,7 +43,7 @@ class PositionalEncoding(nn.Layer):
                 dtype="float32",
                 reverse=False):
        """Construct an PositionalEncoding object."""
-        super(PositionalEncoding, self).__init__()
+        super().__init__()
        self.d_model = d_model
        self.reverse = reverse
        self.xscale = math.sqrt(self.d_model)
@ -117,7 +117,7 @@ class ScaledPositionalEncoding(PositionalEncoding):
        self.alpha = paddle.create_parameter(
            shape=x.shape,
            dtype=self.dtype,
-            default_initializer=paddle.nn.initializer.Assign(x))
+            default_initializer=nn.initializer.Assign(x))
    def reset_parameters(self):
        """Reset parameters."""
@ -141,7 +141,7 @@ class ScaledPositionalEncoding(PositionalEncoding):
        return self.dropout(x)
-class RelPositionalEncoding(paddle.nn.Layer):
+class RelPositionalEncoding(nn.Layer):
    """Relative positional encoding module (new implementation).
    Details can be found in https://github.com/espnet/espnet/pull/2816.
    See : Appendix B in https://arxiv.org/abs/1901.02860
@ -157,10 +157,10 @@ class RelPositionalEncoding(paddle.nn.Layer):
    def __init__(self, d_model, dropout_rate, max_len=5000, dtype="float32"):
        """Construct an PositionalEncoding object."""
-        super(RelPositionalEncoding, self).__init__()
+        super().__init__()
        self.d_model = d_model
        self.xscale = math.sqrt(self.d_model)
-        self.dropout = paddle.nn.Dropout(p=dropout_rate)
+        self.dropout = nn.Dropout(p=dropout_rate)
        self.pe = None
        self.dtype = dtype
        self.extend_pe(paddle.expand(paddle.zeros([1]), (1, max_len)))
--- a/paddlespeech/t2s/modules/transformer/encoder.py
+++ b/paddlespeech/t2s/modules/transformer/encoder.py
@ -17,10 +17,10 @@ from typing import Union
 from paddle import nn
 from paddlespeech.t2s.modules.activation import get_activation
 from paddlespeech.t2s.modules.conformer.convolution import ConvolutionModule
 from paddlespeech.t2s.modules.conformer.encoder_layer import EncoderLayer as ConformerEncoderLayer
 from paddlespeech.t2s.modules.layer_norm import LayerNorm
 from paddlespeech.t2s.modules.nets_utils import get_activation
 from paddlespeech.t2s.modules.transformer.attention import MultiHeadedAttention
 from paddlespeech.t2s.modules.transformer.attention import RelPositionMultiHeadedAttention
 from paddlespeech.t2s.modules.transformer.embedding import PositionalEncoding
@ -34,8 +34,8 @@ from paddlespeech.t2s.modules.transformer.repeat import repeat
 from paddlespeech.t2s.modules.transformer.subsampling import Conv2dSubsampling
-class Encoder(nn.Layer):
+class BaseEncoder(nn.Layer):
-    """Transformer encoder module.
+    """Base Encoder module.
    Parameters
    ----------
@ -55,7 +55,7 @@ class Encoder(nn.Layer):
        Dropout rate after adding positional encoding.
    attention_dropout_rate : float
        Dropout rate in attention.
-    input_layer : Union[str, paddle.nn.Layer]
+    input_layer : Union[str, nn.Layer]
        Input layer type.
    normalize_before : bool
        Whether to use layer_norm before the first block.
@ -120,7 +120,7 @@ class Encoder(nn.Layer):
                 stochastic_depth_rate: float=0.0,
                 intermediate_layers: Union[List[int], None]=None,
                 encoder_type: str="transformer"):
-        """Construct an Encoder object."""
+        """Construct an Bae Encoder object."""
        super().__init__()
        activation = get_activation(activation_type)
        pos_enc_class = self.get_pos_enc_class(pos_enc_layer_type,
@ -264,7 +264,6 @@ class Encoder(nn.Layer):
                nn.Dropout(dropout_rate),
                nn.ReLU(),
                pos_enc_class(attention_dim, positional_dropout_rate), )
        elif input_layer == "conv2d":
            embed = Conv2dSubsampling(
                idim,
@ -305,46 +304,118 @@ class Encoder(nn.Layer):
        paddle.Tensor
            Mask tensor (#batch, 1, time).
        """
-        if self.encoder_type == "transformer":
+        xs = self.embed(xs)
-            xs = self.embed(xs)
+        xs, masks = self.encoders(xs, masks)
-            xs, masks = self.encoders(xs, masks)
+        if self.normalize_before:
-            if self.normalize_before:
+            xs = self.after_norm(xs)
-                xs = self.after_norm(xs)
+        return xs, masks
-            return xs, masks
+
-        elif self.encoder_type == "conformer":
+
-            if isinstance(self.embed, (Conv2dSubsampling)):
+class TransformerEncoder(BaseEncoder):
-                xs, masks = self.embed(xs, masks)
+    """Transformer encoder module.
-            else:
+    Parameters
-                xs = self.embed(xs)
+    ----------
-
+    idim : int
-            if self.intermediate_layers is None:
+        Input dimension.
-                xs, masks = self.encoders(xs, masks)
+    attention_dim : int
-            else:
+        Dimention of attention.
-                intermediate_outputs = []
+    attention_heads : int
-                for layer_idx, encoder_layer in enumerate(self.encoders):
+        The number of heads of multi head attention.
-                    xs, masks = encoder_layer(xs, masks)
+    linear_units : int
-
+        The number of units of position-wise feed forward.
-                    if (self.intermediate_layers is not None and
+    num_blocks : int
-                            layer_idx + 1 in self.intermediate_layers):
+        The number of decoder blocks.
-                        # intermediate branches also require normalization.
+    dropout_rate : float
-                        encoder_output = xs
+        Dropout rate.
-                        if isinstance(encoder_output, tuple):
+    positional_dropout_rate : float
-                            encoder_output = encoder_output[0]
+        Dropout rate after adding positional encoding.
-                            if self.normalize_before:
+    attention_dropout_rate : float
-                                encoder_output = self.after_norm(encoder_output)
+        Dropout rate in attention.
-                        intermediate_outputs.append(encoder_output)
+    input_layer : Union[str, paddle.nn.Layer]
-
+        Input layer type.
-            if isinstance(xs, tuple):
+    pos_enc_layer_type : str
-                xs = xs[0]
+        Encoder positional encoding layer type.
-
+    normalize_before : bool
-            if self.normalize_before:
+        Whether to use layer_norm before the first block.
-                xs = self.after_norm(xs)
+    concat_after : bool
-
+        Whether to concat attention layer's input and output.
-            if self.intermediate_layers is not None:
+        if True, additional linear will be applied.
-                return xs, masks, intermediate_outputs
+        i.e. x -> x + linear(concat(x, att(x)))
-            return xs, masks
+        if False, no additional linear will be applied. i.e. x -> x + att(x)
-        else:
+    positionwise_layer_type : str
-            raise ValueError(f"{self.encoder_type} is not supported.")
+        "linear", "conv1d", or "conv1d-linear".
    positionwise_conv_kernel_size : int
        Kernel size of positionwise conv1d layer.
    selfattention_layer_type : str
        Encoder attention layer type.
    activation_type : str
        Encoder activation function type.
    padding_idx : int
        Padding idx for input_layer=embed.
    """
    def __init__(
            self,
            idim,
            attention_dim: int=256,
            attention_heads: int=4,
            linear_units: int=2048,
            num_blocks: int=6,
            dropout_rate: float=0.1,
            positional_dropout_rate: float=0.1,
            attention_dropout_rate: float=0.0,
            input_layer: str="conv2d",
            pos_enc_layer_type: str="abs_pos",
            normalize_before: bool=True,
            concat_after: bool=False,
            positionwise_layer_type: str="linear",
            positionwise_conv_kernel_size: int=1,
            selfattention_layer_type: str="selfattn",
            activation_type: str="relu",
            padding_idx: int=-1, ):
        """Construct an Transformer Encoder object."""
        super().__init__(
            idim,
            attention_dim=attention_dim,
            attention_heads=attention_heads,
            linear_units=linear_units,
            num_blocks=num_blocks,
            dropout_rate=dropout_rate,
            positional_dropout_rate=positional_dropout_rate,
            attention_dropout_rate=attention_dropout_rate,
            input_layer=input_layer,
            pos_enc_layer_type=pos_enc_layer_type,
            normalize_before=normalize_before,
            concat_after=concat_after,
            positionwise_layer_type=positionwise_layer_type,
            positionwise_conv_kernel_size=positionwise_conv_kernel_size,
            selfattention_layer_type=selfattention_layer_type,
            activation_type=activation_type,
            padding_idx=padding_idx,
            encoder_type="transformer")
    def forward(self, xs, masks):
        """Encode input sequence.
        Parameters
        ----------
        xs : paddle.Tensor
            Input tensor (#batch, time, idim).
        masks : paddle.Tensor
            Mask tensor (#batch, 1, time).
        Returns
        ----------
        paddle.Tensor
            Output tensor (#batch, time, attention_dim).
        paddle.Tensor
            Mask tensor (#batch, 1, time).
        """
        xs = self.embed(xs)
        xs, masks = self.encoders(xs, masks)
        if self.normalize_before:
            xs = self.after_norm(xs)
        return xs, masks
    def forward_one_step(self, xs, masks, cache=None):
        """Encode input frame.
@ -378,3 +449,161 @@ class Encoder(nn.Layer):
        if self.normalize_before:
            xs = self.after_norm(xs)
        return xs, masks, new_cache
 class ConformerEncoder(BaseEncoder):
    """Conformer encoder module.
    Parameters
    ----------
    idim : int
        Input dimension.
    attention_dim : int
        Dimention of attention.
    attention_heads : int
        The number of heads of multi head attention.
    linear_units : int
        The number of units of position-wise feed forward.
    num_blocks : int
        The number of decoder blocks.
    dropout_rate : float
        Dropout rate.
    positional_dropout_rate : float
        Dropout rate after adding positional encoding.
    attention_dropout_rate : float
        Dropout rate in attention.
    input_layer : Union[str, nn.Layer]
        Input layer type.
    normalize_before : bool
        Whether to use layer_norm before the first block.
    concat_after : bool
        Whether to concat attention layer's input and output.
        if True, additional linear will be applied.
        i.e. x -> x + linear(concat(x, att(x)))
        if False, no additional linear will be applied. i.e. x -> x + att(x)
    positionwise_layer_type : str
        "linear", "conv1d", or "conv1d-linear".
    positionwise_conv_kernel_size : int
        Kernel size of positionwise conv1d layer.
    macaron_style : bool
        Whether to use macaron style for positionwise layer.
    pos_enc_layer_type : str
        Encoder positional encoding layer type.
    selfattention_layer_type : str
        Encoder attention layer type.
    activation_type : str
        Encoder activation function type.
    use_cnn_module : bool
        Whether to use convolution module.
    zero_triu : bool
        Whether to zero the upper triangular part of attention matrix.
    cnn_module_kernel : int
        Kernerl size of convolution module.
    padding_idx : int
        Padding idx for input_layer=embed.
    stochastic_depth_rate : float
        Maximum probability to skip the encoder layer.
    intermediate_layers : Union[List[int], None]
        indices of intermediate CTC layer.
        indices start from 1.
        if not None, intermediate outputs are returned (which changes return type
        signature.)
    """
    def __init__(
            self,
            idim: int,
            attention_dim: int=256,
            attention_heads: int=4,
            linear_units: int=2048,
            num_blocks: int=6,
            dropout_rate: float=0.1,
            positional_dropout_rate: float=0.1,
            attention_dropout_rate: float=0.0,
            input_layer: str="conv2d",
            normalize_before: bool=True,
            concat_after: bool=False,
            positionwise_layer_type: str="linear",
            positionwise_conv_kernel_size: int=1,
            macaron_style: bool=False,
            pos_enc_layer_type: str="rel_pos",
            selfattention_layer_type: str="rel_selfattn",
            activation_type: str="swish",
            use_cnn_module: bool=False,
            zero_triu: bool=False,
            cnn_module_kernel: int=31,
            padding_idx: int=-1,
            stochastic_depth_rate: float=0.0,
            intermediate_layers: Union[List[int], None]=None, ):
        """Construct an Conformer Encoder object."""
        super().__init__(
            idim=idim,
            attention_dim=attention_dim,
            attention_heads=attention_heads,
            linear_units=linear_units,
            num_blocks=num_blocks,
            dropout_rate=dropout_rate,
            positional_dropout_rate=positional_dropout_rate,
            attention_dropout_rate=attention_dropout_rate,
            input_layer=input_layer,
            normalize_before=normalize_before,
            concat_after=concat_after,
            positionwise_layer_type=positionwise_layer_type,
            positionwise_conv_kernel_size=positionwise_conv_kernel_size,
            macaron_style=macaron_style,
            pos_enc_layer_type=pos_enc_layer_type,
            selfattention_layer_type=selfattention_layer_type,
            activation_type=activation_type,
            use_cnn_module=use_cnn_module,
            zero_triu=zero_triu,
            cnn_module_kernel=cnn_module_kernel,
            padding_idx=padding_idx,
            stochastic_depth_rate=stochastic_depth_rate,
            intermediate_layers=intermediate_layers,
            encoder_type="conformer")
    def forward(self, xs, masks):
        """Encode input sequence.
        Parameters
        ----------
        xs : paddle.Tensor
            Input tensor (#batch, time, idim).
        masks : paddle.Tensor
            Mask tensor (#batch, 1, time).
        Returns
        ----------
        paddle.Tensor
            Output tensor (#batch, time, attention_dim).
        paddle.Tensor
            Mask tensor (#batch, 1, time).
        """
        if isinstance(self.embed, (Conv2dSubsampling)):
            xs, masks = self.embed(xs, masks)
        else:
            xs = self.embed(xs)
        if self.intermediate_layers is None:
            xs, masks = self.encoders(xs, masks)
        else:
            intermediate_outputs = []
            for layer_idx, encoder_layer in enumerate(self.encoders):
                xs, masks = encoder_layer(xs, masks)
                if (self.intermediate_layers is not None and
                        layer_idx + 1 in self.intermediate_layers):
                    # intermediate branches also require normalization.
                    encoder_output = xs
                    if isinstance(encoder_output, tuple):
                        encoder_output = encoder_output[0]
                        if self.normalize_before:
                            encoder_output = self.after_norm(encoder_output)
                    intermediate_outputs.append(encoder_output)
        if isinstance(xs, tuple):
            xs = xs[0]
        if self.normalize_before:
            xs = self.after_norm(xs)
        if self.intermediate_layers is not None:
            return xs, masks, intermediate_outputs
        return xs, masks
--- a/paddlespeech/t2s/modules/transformer/encoder_layer.py
+++ b/paddlespeech/t2s/modules/transformer/encoder_layer.py
@ -24,10 +24,10 @@ class EncoderLayer(nn.Layer):
    ----------
    size : int
        Input dimension.
-    self_attn : paddle.nn.Layer
+    self_attn : nn.Layer
        Self-attention module instance.
        `MultiHeadedAttention`  instance can be used as the argument.
-    feed_forward : paddle.nn.Layer
+    feed_forward : nn.Layer
        Feed-forward module instance.
        `PositionwiseFeedForward`, `MultiLayeredConv1d`, or `Conv1dLinear` instance can be used as the argument.
    dropout_rate : float
@ -50,7 +50,7 @@ class EncoderLayer(nn.Layer):
            normalize_before=True,
            concat_after=False, ):
        """Construct an EncoderLayer object."""
-        super(EncoderLayer, self).__init__()
+        super().__init__()
        self.self_attn = self_attn
        self.feed_forward = feed_forward
        self.norm1 = nn.LayerNorm(size)
--- a/paddlespeech/t2s/modules/transformer/lightconv.py
+++ b/paddlespeech/t2s/modules/transformer/lightconv.py
@ -18,7 +18,7 @@ import paddle
 import paddle.nn.functional as F
 from paddle import nn
-from paddlespeech.t2s.modules.glu import GLU
+from paddlespeech.t2s.modules.activation import get_activation
 from paddlespeech.t2s.modules.masked_fill import masked_fill
 MIN_VALUE = float(numpy.finfo(numpy.float32).min)
@ -56,7 +56,7 @@ class LightweightConvolution(nn.Layer):
            use_kernel_mask=False,
            use_bias=False, ):
        """Construct Lightweight Convolution layer."""
-        super(LightweightConvolution, self).__init__()
+        super().__init__()
        assert n_feat % wshare == 0
        self.wshare = wshare
@ -68,7 +68,7 @@ class LightweightConvolution(nn.Layer):
        # linear -> GLU -> lightconv -> linear
        self.linear1 = nn.Linear(n_feat, n_feat * 2)
        self.linear2 = nn.Linear(n_feat, n_feat)
-        self.act = GLU()
+        self.act = get_activation("glu")
        # lightconv related
        self.uniform_ = nn.initializer.Uniform()
--- a/paddlespeech/t2s/modules/transformer/multi_layer_conv.py
+++ b/paddlespeech/t2s/modules/transformer/multi_layer_conv.py
@ -12,10 +12,10 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Layer modules for FFT block in FastSpeech (Feed-forward Transformer)."""
-import paddle
+from paddle import nn
-class MultiLayeredConv1d(paddle.nn.Layer):
+class MultiLayeredConv1d(nn.Layer):
    """Multi-layered conv1d for Transformer block.
    This is a module of multi-leyered conv1d designed
@ -43,21 +43,21 @@ class MultiLayeredConv1d(paddle.nn.Layer):
            Dropout rate.
        """
-        super(MultiLayeredConv1d, self).__init__()
+        super().__init__()
-        self.w_1 = paddle.nn.Conv1D(
+        self.w_1 = nn.Conv1D(
            in_chans,
            hidden_chans,
            kernel_size,
            stride=1,
            padding=(kernel_size - 1) // 2, )
-        self.w_2 = paddle.nn.Conv1D(
+        self.w_2 = nn.Conv1D(
            hidden_chans,
            in_chans,
            kernel_size,
            stride=1,
            padding=(kernel_size - 1) // 2, )
-        self.dropout = paddle.nn.Dropout(dropout_rate)
+        self.dropout = nn.Dropout(dropout_rate)
-        self.relu = paddle.nn.ReLU()
+        self.relu = nn.ReLU()
    def forward(self, x):
        """Calculate forward propagation.
@ -77,7 +77,7 @@ class MultiLayeredConv1d(paddle.nn.Layer):
            [0, 2, 1])
-class Conv1dLinear(paddle.nn.Layer):
+class Conv1dLinear(nn.Layer):
    """Conv1D + Linear for Transformer block.
    A variant of MultiLayeredConv1d, which replaces second conv-layer to linear.
@ -98,16 +98,16 @@ class Conv1dLinear(paddle.nn.Layer):
        dropout_rate : float
            Dropout rate.
        """
-        super(Conv1dLinear, self).__init__()
+        super().__init__()
-        self.w_1 = paddle.nn.Conv1D(
+        self.w_1 = nn.Conv1D(
            in_chans,
            hidden_chans,
            kernel_size,
            stride=1,
            padding=(kernel_size - 1) // 2, )
-        self.w_2 = paddle.nn.Linear(hidden_chans, in_chans, bias_attr=True)
+        self.w_2 = nn.Linear(hidden_chans, in_chans, bias_attr=True)
-        self.dropout = paddle.nn.Dropout(dropout_rate)
+        self.dropout = nn.Dropout(dropout_rate)
-        self.relu = paddle.nn.ReLU()
+        self.relu = nn.ReLU()
    def forward(self, x):
        """Calculate forward propagation.
--- a/paddlespeech/t2s/modules/transformer/positionwise_feed_forward.py
+++ b/paddlespeech/t2s/modules/transformer/positionwise_feed_forward.py
@ -14,9 +14,10 @@
 # Modified from espnet(https://github.com/espnet/espnet)
 """Positionwise feed forward layer definition."""
 import paddle
 from paddle import nn
-class PositionwiseFeedForward(paddle.nn.Layer):
+class PositionwiseFeedForward(nn.Layer):
    """Positionwise feed forward layer.
    Parameters
@ -35,7 +36,7 @@ class PositionwiseFeedForward(paddle.nn.Layer):
                 dropout_rate,
                 activation=paddle.nn.ReLU()):
        """Construct an PositionwiseFeedForward object."""
-        super(PositionwiseFeedForward, self).__init__()
+        super().__init__()
        self.w_1 = paddle.nn.Linear(idim, hidden_units, bias_attr=True)
        self.w_2 = paddle.nn.Linear(hidden_units, idim, bias_attr=True)
        self.dropout = paddle.nn.Dropout(dropout_rate)
--- a/paddlespeech/t2s/modules/transformer/subsampling.py
+++ b/paddlespeech/t2s/modules/transformer/subsampling.py
@ -14,11 +14,12 @@
 # Modified from espnet(https://github.com/espnet/espnet)
 """Subsampling layer definition."""
 import paddle
 from paddle import nn
 from paddlespeech.t2s.modules.transformer.embedding import PositionalEncoding
-class Conv2dSubsampling(paddle.nn.Layer):
+class Conv2dSubsampling(nn.Layer):
    """Convolutional 2D subsampling (to 1/4 length).
    Parameters
    ----------
@ -28,20 +29,20 @@ class Conv2dSubsampling(paddle.nn.Layer):
        Output dimension.
    dropout_rate : float
        Dropout rate.
-    pos_enc : paddle.nn.Layer
+    pos_enc : nn.Layer
        Custom position encoding layer.
    """
    def __init__(self, idim, odim, dropout_rate, pos_enc=None):
        """Construct an Conv2dSubsampling object."""
-        super(Conv2dSubsampling, self).__init__()
+        super().__init__()
-        self.conv = paddle.nn.Sequential(
+        self.conv = nn.Sequential(
-            paddle.nn.Conv2D(1, odim, 3, 2),
+            nn.Conv2D(1, odim, 3, 2),
-            paddle.nn.ReLU(),
+            nn.ReLU(),
-            paddle.nn.Conv2D(odim, odim, 3, 2),
+            nn.Conv2D(odim, odim, 3, 2),
-            paddle.nn.ReLU(), )
+            nn.ReLU(), )
-        self.out = paddle.nn.Sequential(
+        self.out = nn.Sequential(
-            paddle.nn.Linear(odim * (((idim - 1) // 2 - 1) // 2), odim),
+            nn.Linear(odim * (((idim - 1) // 2 - 1) // 2), odim),
            pos_enc if pos_enc is not None else
            PositionalEncoding(odim, dropout_rate), )
--- a/paddlespeech/t2s/training/optimizer.py
+++ b/paddlespeech/t2s/training/optimizer.py
@ -12,6 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import paddle
 from paddle import nn
 optim_classes = dict(
    adadelta=paddle.optimizer.Adadelta,
@ -25,7 +26,7 @@ optim_classes = dict(
    sgd=paddle.optimizer.SGD, )
-def build_optimizers(model: paddle.nn.Layer,
+def build_optimizers(model: nn.Layer,
                     optim='adadelta',
                     max_grad_norm=None,
                     learning_rate=0.01) -> paddle.optimizer:
--- a/tests/unit/tts/test_stft.py
+++ b/tests/unit/tts/test_stft.py
@ -11,52 +11,11 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import librosa
 import numpy as np
 import paddle
 import torch
 from parallel_wavegan.losses import stft_loss as sl
 from scipy import signal
-from paddlespeech.t2s.modules.stft_loss import MultiResolutionSTFTLoss
+from paddlespeech.t2s.modules.losses import MultiResolutionSTFTLoss
 from paddlespeech.t2s.modules.stft_loss import STFT
 def test_stft():
    stft = STFT(n_fft=1024, hop_length=256, win_length=1024)
    x = paddle.uniform([4, 46080])
    S = stft.magnitude(x)
    window = signal.get_window('hann', 1024, fftbins=True)
    D2 = torch.stft(
        torch.as_tensor(x.numpy()),
        n_fft=1024,
        hop_length=256,
        win_length=1024,
        window=torch.as_tensor(window))
    S2 = (D2**2).sum(-1).sqrt()
    S3 = np.abs(
        librosa.stft(x.numpy()[0], n_fft=1024, hop_length=256, win_length=1024))
    print(S2.shape)
    print(S.numpy()[0])
    print(S2.data.cpu().numpy()[0])
    print(S3)
 def test_torch_stft():
    # NOTE: torch.stft use no window by default
    x = np.random.uniform(-1.0, 1.0, size=(46080, ))
    window = signal.get_window('hann', 1024, fftbins=True)
    D2 = torch.stft(
        torch.as_tensor(x),
        n_fft=1024,
        hop_length=256,
        win_length=1024,
        window=torch.as_tensor(window))
    D3 = librosa.stft(
        x, n_fft=1024, hop_length=256, win_length=1024, window='hann')
    print(D2[:, :, 0].data.cpu().numpy()[:, 30:60])
    print(D3.real[:, 30:60])
    # print(D3.imag[:, 30:60])
 def test_multi_resolution_stft_loss():