Merge branch 'PaddlePaddle:develop' into develop

2 years ago · 5c4df656cd
parent a2e7ccac4b ae521d3700
commit 5c4df656cd
143 changed files with 4363 additions and 4413 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -1,14 +1,29 @@
 # Changelog
-Date: 2022-1-19, Author: yt605155624.  
+Date: 2022-1-29, Author: yt605155624.
-Add features to: T2S:  
+Add features to: T2S:
-  - Add csmsc Tacotron2.  
+  - Update aishell3 vc0 with new Tacotron2.
  - PRLink: https://github.com/PaddlePaddle/PaddleSpeech/pull/1419
 Date: 2022-1-29, Author: yt605155624.
 Add features to: T2S:
  - Add ljspeech Tacotron2.
  - PRLink: https://github.com/PaddlePaddle/PaddleSpeech/pull/1416
 Date: 2022-1-24, Author: yt605155624.
 Add features to: T2S:
  - Add csmsc WaveRNN.
  - PRLink: https://github.com/PaddlePaddle/PaddleSpeech/pull/1379
 Date: 2022-1-19, Author: yt605155624.
 Add features to: T2S:
  - Add csmsc Tacotron2.
  - PRLink: https://github.com/PaddlePaddle/PaddleSpeech/pull/1314
 Date: 2022-1-10, Author: Jackwaterveg.  
-Add features to: CLI:  
+Add features to: CLI:
-  - Support English (librispeech/asr1/transformer).  
+  - Support English (librispeech/asr1/transformer).
  - Support choosing `decode_method` for conformer and transformer models.  
  - Refactor the config, using the unified config.  
  - PRLink: https://github.com/PaddlePaddle/PaddleSpeech/pull/1297
@ -16,8 +31,8 @@ Add features to: CLI:
 ***
 Date: 2022-1-17, Author: Jackwaterveg.  
-Add features to: CLI:  
+Add features to: CLI:
-  - Support deepspeech2 online/offline model(aishell). 
+  - Support deepspeech2 online/offline model(aishell).
  - PRLink: https://github.com/PaddlePaddle/PaddleSpeech/pull/1356
 ***
--- a/README.md
+++ b/README.md
@ -16,12 +16,15 @@
 <p align="center">
    <a href="./LICENSE"><img src="https://img.shields.io/badge/license-Apache%202-red.svg"></a>
-    <a href="support os"><img src="https://img.shields.io/badge/os-linux-yellow.svg"></a>
+    <a href="https://github.com/PaddlePaddle/PaddleSpeech/releases"><img src="https://img.shields.io/github/v/release/PaddlePaddle/PaddleSpeech?color=ffa"></a>
    <a href="support os"><img src="https://img.shields.io/badge/os-linux%2C%20win%2C%20mac-pink.svg"></a>
    <a href=""><img src="https://img.shields.io/badge/python-3.7+-aff.svg"></a>
    <a href="https://github.com/PaddlePaddle/PaddleSpeech/graphs/contributors"><img src="https://img.shields.io/github/contributors/PaddlePaddle/PaddleSpeech?color=9ea"></a>
    <a href="https://github.com/PaddlePaddle/PaddleSpeech/commits"><img src="https://img.shields.io/github/commit-activity/m/PaddlePaddle/PaddleSpeech?color=3af"></a>
    <a href="https://github.com/PaddlePaddle/PaddleSpeech/issues"><img src="https://img.shields.io/github/issues/PaddlePaddle/PaddleSpeech?color=9cc"></a>
    <a href="https://github.com/PaddlePaddle/PaddleSpeech/stargazers"><img src="https://img.shields.io/github/stars/PaddlePaddle/PaddleSpeech?color=ccf"></a>
    <a href="=https://pypi.org/project/paddlespeech/"><img src="https://img.shields.io/pypi/dm/PaddleSpeech"></a>
    <a href="=https://pypi.org/project/paddlespeech/"><img src="https://static.pepy.tech/badge/paddlespeech"></a>
    <a href="https://huggingface.co/spaces"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue"></a>
 </p>
@ -143,6 +146,8 @@ For more synthesized audios, please refer to [PaddleSpeech Text-to-Speech sample
 <div align="center"><a href="https://www.bilibili.com/video/BV1cL411V71o?share_source=copy_web"><img src="https://ai-studio-static-online.cdn.bcebos.com/06fd746ab32042f398fb6f33f873e6869e846fe63c214596ae37860fe8103720" / width="500px"></a></div>
 - [PaddleSpeech Demo Video](https://paddlespeech.readthedocs.io/en/latest/demo_video.html)
 ### 🔥 Hot Activities
 - 2021.12.21~12.24
@ -317,14 +322,15 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
    </tr>
    <tr>
      <td rowspan="4">Acoustic Model</td>
-      <td >Tacotron2</td>
+      <td>Tacotron2</td>
-      <td rowspan="2" >LJSpeech</td>
+      <td>LJSpeech / CSMSC</td>
      <td>
-      <a href = "./examples/ljspeech/tts0">tacotron2-ljspeech</a>
+      <a href = "./examples/ljspeech/tts0">tacotron2-ljspeech</a> / <a href = "./examples/csmsc/tts0">tacotron2-csmsc</a>
      </td>
    </tr>
    <tr>
      <td>Transformer TTS</td>
      <td>LJSpeech</td>
      <td>
      <a href = "./examples/ljspeech/tts1">transformer-ljspeech</a>
      </td>
@ -344,7 +350,7 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
      </td>
    </tr>
   <tr>
-      <td rowspan="5">Vocoder</td>
+      <td rowspan="6">Vocoder</td>
      <td >WaveFlow</td>
      <td >LJSpeech</td>
      <td>
@ -378,7 +384,14 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
      <td>
      <a href = "./examples/csmsc/voc5">HiFiGAN-csmsc</a> 
      </td>
-    <tr>                                                                                                                                       
+    </tr>
    <tr>
      <td >WaveRNN</td>
      <td >CSMSC</td>
      <td>
      <a href = "./examples/csmsc/voc6">WaveRNN-csmsc</a>
      </td>
    </tr>
    <tr>
      <td rowspan="3">Voice Cloning</td>
      <td>GE2E</td>
@ -416,7 +429,6 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
    </tr>
  </thead>
  <tbody>
  <tr>
      <td>Audio Classification</td>
      <td>ESC-50</td>
@ -440,7 +452,6 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
    </tr>
  </thead>
  <tbody>
  <tr>
      <td>Punctuation Restoration</td>
      <td>IWLST2012_zh</td>
@ -488,7 +499,17 @@ author={PaddlePaddle Authors},
 howpublished = {\url{https://github.com/PaddlePaddle/PaddleSpeech}},
 year={2021}
 }
@inproceedings{zheng2021fused,
  title={Fused acoustic and text encoding for multimodal bilingual pretraining and speech translation},
  author={Zheng, Renjie and Chen, Junkun and Ma, Mingbo and Huang, Liang},
  booktitle={International Conference on Machine Learning},
  pages={12736--12746},
  year={2021},
  organization={PMLR}
 }
 ```
 <a name="contribution"></a>
 ## Contribute to PaddleSpeech
--- a/README_cn.md
+++ b/README_cn.md
@ -147,6 +147,8 @@ from https://github.com/18F/open-source-guide/blob/18f-pages/pages/making-readme
 <div align="center"><a href="https://www.bilibili.com/video/BV1cL411V71o?share_source=copy_web"><img src="https://ai-studio-static-online.cdn.bcebos.com/06fd746ab32042f398fb6f33f873e6869e846fe63c214596ae37860fe8103720" / width="500px"></a></div>
 - [PaddleSpeech 示例视频](https://paddlespeech.readthedocs.io/en/latest/demo_video.html)
 ### 🔥 热门活动
@ -315,14 +317,15 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块：文本前端、声
    </tr>
    <tr>
      <td rowspan="4">声学模型</td>
-      <td >Tacotron2</td>
+      <td>Tacotron2</td>
-      <td rowspan="2" >LJSpeech</td>
+      <td>LJSpeech / CSMSC</td>
      <td>
-      <a href = "./examples/ljspeech/tts0">tacotron2-ljspeech</a>
+      <a href = "./examples/ljspeech/tts0">tacotron2-ljspeech</a> / <a href = "./examples/csmsc/tts0">tacotron2-csmsc</a>
      </td>
    </tr>
    <tr>
      <td>Transformer TTS</td>
      <td>LJSpeech</td>
      <td>
      <a href = "./examples/ljspeech/tts1">transformer-ljspeech</a>
      </td>
@ -342,7 +345,7 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块：文本前端、声
      </td>
    </tr>
   <tr>
-      <td rowspan="5">声码器</td>
+      <td rowspan="6">声码器</td>
      <td >WaveFlow</td>
      <td >LJSpeech</td>
      <td>
@ -376,7 +379,14 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块：文本前端、声
      <td>
      <a href = "./examples/csmsc/voc5">HiFiGAN-csmsc</a> 
      </td>
-    <tr>                                                                                                                                       
+    </tr>
    <tr>
      <td >WaveRNN</td>
      <td >CSMSC</td>
      <td>
      <a href = "./examples/csmsc/voc6">WaveRNN-csmsc</a>
      </td>
    </tr>
    <tr>
      <td rowspan="3">声音克隆</td>
      <td>GE2E</td>
@ -415,8 +425,6 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块：文本前端、声
    </tr>
  </thead>
  <tbody>
  <tr>
      <td>声音分类</td>
      <td>ESC-50</td>
@ -440,7 +448,6 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块：文本前端、声
    </tr>
  </thead>
  <tbody>
  <tr>
      <td>标点恢复</td>
      <td>IWLST2012_zh</td>
--- a/docs/source/demo_video.rst
+++ b/docs/source/demo_video.rst
@ -0,0 +1,13 @@
 Demo Video 
 ==================
 .. raw:: html
    <video controls width="1024">
    <source src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/PaddleSpeech_Demo.mp4"
            type="video/mp4">
    Sorry, your browser doesn't support embedded videos.
    </video>
--- a/docs/source/index.rst
+++ b/docs/source/index.rst
@ -41,6 +41,7 @@ Contents
   tts/gan_vocoder
   tts/demo
   tts/demo_2
 .. toctree::
   :maxdepth: 1
@ -50,12 +51,14 @@ Contents
 .. toctree::
   :maxdepth: 1
-   :caption: Acknowledgement
+   :caption: Demos
   asr/reference
   demo_video
   tts_demo_video
 .. toctree::
   :maxdepth: 1
   :caption: Acknowledgement
   asr/reference
--- a/docs/source/released_model.md
+++ b/docs/source/released_model.md
@ -1,3 +1,4 @@
 # Released Models
 ## Speech-to-Text Models
@ -32,14 +33,15 @@ Language Model | Training Data | Token-based | Size | Descriptions
 ### Acoustic Models
 Model Type | Dataset| Example Link | Pretrained Models|Static Models|Size (static)
 :-------------:| :------------:| :-----: | :-----:| :-----:| :-----:
-Tacotron2|LJSpeech|[tacotron2-vctk](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/tts0)|[tacotron2_ljspeech_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_ljspeech_ckpt_0.3.zip)|||
+Tacotron2|LJSpeech|[tacotron2-ljspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/tts0)|[tacotron2_ljspeech_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_ljspeech_ckpt_0.2.0.zip)|||
 Tacotron2|CSMSC|[tacotron2-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts0)|[tacotron2_csmsc_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_csmsc_ckpt_0.2.0.zip)|[tacotron2_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_csmsc_static_0.2.0.zip)|103MB|
 TransformerTTS| LJSpeech| [transformer-ljspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/tts1)|[transformer_tts_ljspeech_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/transformer_tts/transformer_tts_ljspeech_ckpt_0.4.zip)|||
 SpeedySpeech| CSMSC | [speedyspeech-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts2) |[speedyspeech_nosil_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_ckpt_0.5.zip)|[speedyspeech_nosil_baker_static_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_static_0.5.zip)|12MB|
 FastSpeech2| CSMSC |[fastspeech2-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts3)|[fastspeech2_nosil_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_ckpt_0.4.zip)|[fastspeech2_nosil_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_static_0.4.zip)|157MB|
 FastSpeech2-Conformer| CSMSC |[fastspeech2-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts3)|[fastspeech2_conformer_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_conformer_baker_ckpt_0.5.zip)|||
 FastSpeech2| AISHELL-3 |[fastspeech2-aishell3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/tts3)|[fastspeech2_nosil_aishell3_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_aishell3_ckpt_0.4.zip)|||
 FastSpeech2| LJSpeech |[fastspeech2-ljspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/tts3)|[fastspeech2_nosil_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_ljspeech_ckpt_0.5.zip)|||
-FastSpeech2| VCTK |[fastspeech2-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/vctk/tts3)|[fastspeech2_nosil_vctk_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_vctk_ckpt_0.5.zip)|||
+FastSpeech2| VCTK |[fastspeech2-vctk](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/vctk/tts3)|[fastspeech2_nosil_vctk_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_vctk_ckpt_0.5.zip)|||
 ### Vocoders
 Model Type | Dataset| Example Link | Pretrained Models| Static Models|Size (static)
@ -52,12 +54,14 @@ Parallel WaveGAN| VCTK |[PWGAN-vctk](https://github.com/PaddlePaddle/PaddleSpeec
 |Multi Band MelGAN | CSMSC |[MB MelGAN-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/voc3) | [mb_melgan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_ckpt_0.1.1.zip) <br>[mb_melgan_baker_finetune_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_baker_finetune_ckpt_0.5.zip)|[mb_melgan_csmsc_static_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_static_0.1.1.zip) |8.2MB|
 Style MelGAN | CSMSC |[Style MelGAN-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/voc4)|[style_melgan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/style_melgan/style_melgan_csmsc_ckpt_0.1.1.zip)| | |
 HiFiGAN | CSMSC |[HiFiGAN-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/voc5)|[hifigan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_ckpt_0.1.1.zip)|[hifigan_csmsc_static_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_static_0.1.1.zip)|50MB|
 WaveRNN | CSMSC |[WaveRNN-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/voc6)|[wavernn_csmsc_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/wavernn/wavernn_csmsc_ckpt_0.2.0.zip)|[wavernn_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/wavernn/wavernn_csmsc_static_0.2.0.zip)|18MB|
 ### Voice Cloning
 Model Type | Dataset| Example Link | Pretrained Models
 :-------------:| :------------:| :-----: | :-----:
 GE2E| AISHELL-3, etc. |[ge2e](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/ge2e)|[ge2e_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/ge2e/ge2e_ckpt_0.3.zip)
-GE2E + Tactron2| AISHELL-3 |[ge2e-tactron2-aishell3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/vc0)|[tacotron2_aishell3_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_aishell3_ckpt_0.3.zip)
+GE2E + Tactron2| AISHELL-3 |[ge2e-tactron2-aishell3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/vc0)|[tacotron2_aishell3_ckpt_vc0_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_aishell3_ckpt_vc0_0.2.0.zip)
 GE2E + FastSpeech2 | AISHELL-3  |[ge2e-fastspeech2-aishell3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/vc1)|[fastspeech2_nosil_aishell3_vc1_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_aishell3_vc1_ckpt_0.5.zip)
--- a/docs/source/tts/quick_start_cn.md
+++ b/docs/source/tts/quick_start_cn.md
@ -202,4 +202,4 @@ sf.write(
        audio_path,
        wav.numpy(),
        samplerate=fastspeech2_config.fs)
-```
+```
--- a/docs/source/tts_demo_video.rst
+++ b/docs/source/tts_demo_video.rst
@ -0,0 +1,12 @@
 TTS Demo Video
 ==================
 .. raw:: html
    <video controls width="1024">
    <source src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/paddle2021_with_me.mp4"
            type="video/mp4">
    Sorry, your browser doesn't support embedded videos.
    </video>
--- a/examples/aishell3/vc0/README.md
+++ b/examples/aishell3/vc0/README.md
@ -1,4 +1,3 @@
 # Tacotron2 + AISHELL-3 Voice Cloning
 This example contains code used to train a [Tacotron2](https://arxiv.org/abs/1712.05884) model with [AISHELL-3](http://www.aishelltech.com/aishell_3). The trained model can be used in Voice Cloning Task, We refer to the model structure of  [Transfer Learning from Speaker Veriﬁcation to Multispeaker Text-To-Speech Synthesis](https://arxiv.org/pdf/1806.04558.pdf). The general steps are as follows:
 1. Speaker Encoder: We use Speaker Verification to train a speaker encoder. Datasets used in this task are different from those used in `Tacotron2` because the transcriptions are not needed, we use more datasets, refer to  [ge2e](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/ge2e).
@ -17,7 +16,7 @@ mkdir data_aishell3
 tar zxvf data_aishell3.tgz -C data_aishell3
 ```
 ### Get MFA Result and Extract
-We use [MFA2.x](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for aishell3_fastspeech2.
+We use [MFA2.x](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get phonemes for Tacotron2, the durations of MFA are not needed here.
 You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo.
 ## Pretrained GE2E Model
@ -117,3 +116,25 @@ ref_audio
 ```bash
 CUDA_VISIBLE_DEVICES=${gpus} ./local/voice_cloning.sh ${conf_path} ${train_output_path} ${ckpt_name} ${ge2e_params_path} ${ref_audio_dir}
 ```
 ## Pretrained Model
 [tacotron2_aishell3_ckpt_vc0_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_aishell3_ckpt_vc0_0.2.0.zip)
 Model | Step | eval/loss | eval/l1_loss | eval/mse_loss | eval/bce_loss| eval/attn_loss
 :-------------:| :------------:| :-----: | :-----: | :--------: |:--------:|:---------:
 default| 2(gpu) x 37596|0.58704|0.39623|0.15073|0.039|1.9981e-04|
 Tacotron2 checkpoint contains files listed below.
 (There is no need for `speaker_id_map.txt` here )
 ```text
 tacotron2_aishell3_ckpt_vc0_0.2.0
 ├── default.yaml            # default config used to train tacotron2
 ├── phone_id_map.txt        # phone vocabulary file when training tacotron2
 ├── snapshot_iter_37596.pdz # model parameters and optimizer states
 └── speech_stats.npy        # statistics used to normalize spectrogram when training tacotron2
 ```
 ## More
 We strongly recommend that you use [FastSpeech2 + AISHELL-3 Voice Cloning](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/vc1) which works better.
--- a/examples/aishell3/vc0/conf/default.yaml
+++ b/examples/aishell3/vc0/conf/default.yaml
@ -77,7 +77,7 @@ optimizer:
 ###########################################################
 #                     TRAINING SETTING                    #
 ###########################################################
-max_epoch: 200
+max_epoch: 100
 num_snapshots: 5
 ###########################################################
--- a/examples/aishell3/vc0/path.sh
+++ b/examples/aishell3/vc0/path.sh
@ -9,5 +9,5 @@ export PYTHONDONTWRITEBYTECODE=1
 export PYTHONIOENCODING=UTF-8
 export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH}
-MODEL=new_tacotron2
+MODEL=tacotron2
 export BIN_DIR=${MAIN_ROOT}/paddlespeech/t2s/exps/${MODEL}
--- a/examples/aishell3/vc1/README.md
+++ b/examples/aishell3/vc1/README.md
@ -1,4 +1,3 @@
 # FastSpeech2 + AISHELL-3 Voice Cloning
 This example contains code used to train a [FastSpeech2](https://arxiv.org/abs/2006.04558) model with [AISHELL-3](http://www.aishelltech.com/aishell_3). The trained model can be used in Voice Cloning Task, We refer to the model structure of  [Transfer Learning from Speaker Veriﬁcation to Multispeaker Text-To-Speech Synthesis](https://arxiv.org/pdf/1806.04558.pdf). The general steps are as follows:
 1. Speaker Encoder: We use Speaker Verification to train a speaker encoder. Datasets used in this task are different from those used in `FastSpeech2` because the transcriptions are not needed, we use more datasets, refer to  [ge2e](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/ge2e).
--- a/examples/callcenter/README.md
+++ b/examples/callcenter/README.md
@ -0,0 +1,20 @@
 # Callcenter 8k sample rate
 Data distribution:
 ```
 676048 utts
 491.4004722221223 h
 4357792.0 text
 2.4633630739178654 text/sec
 2.6167397877068495 sec/utt
 ```
 train/dev/test partition:
 ```
    33802 manifest.dev
    67606 manifest.test
   574640 manifest.train
   676048 total
 ```
--- a/examples/csmsc/README.md
+++ b/examples/csmsc/README.md
@ -10,3 +10,5 @@
 * voc2 - MelGAN
 * voc3 - MultiBand MelGAN
 * voc4 - Style MelGAN
 * voc5 - HiFiGAN
 * voc6 - WaveRNN
--- a/examples/csmsc/tts0/README.md
+++ b/examples/csmsc/tts0/README.md
@ -212,6 +212,8 @@ optional arguments:
 Pretrained Tacotron2 model with no silence in the edge of audios:
 - [tacotron2_csmsc_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_csmsc_ckpt_0.2.0.zip)
 The static model can be downloaded here [tacotron2_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_csmsc_static_0.2.0.zip).
 Model | Step | eval/loss | eval/l1_loss | eval/mse_loss | eval/bce_loss| eval/attn_loss 
 :-------------:| :------------:| :-----: | :-----: | :--------: |:--------:|:---------:
--- a/examples/csmsc/tts0/local/synthesize_e2e.sh
+++ b/examples/csmsc/tts0/local/synthesize_e2e.sh
@ -7,6 +7,7 @@ ckpt_name=$3
 stage=0
 stop_stage=0
 # TODO: tacotron2 动转静的结果没有静态图的响亮, 可能还是 decode 的时候某个函数动静不对齐
 if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
    FLAGS_allocator_strategy=naive_best_fit \
    FLAGS_fraction_of_gpu_memory_to_use=0.01 \
@ -33,7 +34,7 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
    FLAGS_allocator_strategy=naive_best_fit \
    FLAGS_fraction_of_gpu_memory_to_use=0.01 \
    python3 ${BIN_DIR}/../synthesize_e2e.py \
-        --am=fastspeech2_csmsc \
+        --am=tacotron2_csmsc \
        --am_config=${config_path} \
        --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
        --am_stat=dump/train/speech_stats.npy \
@ -55,7 +56,7 @@ if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
    FLAGS_allocator_strategy=naive_best_fit \
    FLAGS_fraction_of_gpu_memory_to_use=0.01 \
    python3 ${BIN_DIR}/../synthesize_e2e.py \
-        --am=fastspeech2_csmsc \
+        --am=tacotron2_csmsc \
        --am_config=${config_path} \
        --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
        --am_stat=dump/train/speech_stats.npy \
@ -76,7 +77,7 @@ if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
    FLAGS_allocator_strategy=naive_best_fit \
    FLAGS_fraction_of_gpu_memory_to_use=0.01 \
    python3 ${BIN_DIR}/../synthesize_e2e.py \
-        --am=fastspeech2_csmsc \
+        --am=tacotron2_csmsc \
        --am_config=${config_path} \
        --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
        --am_stat=dump/train/speech_stats.npy \
@ -90,3 +91,24 @@ if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
        --inference_dir=${train_output_path}/inference \
        --phones_dict=dump/phone_id_map.txt
 fi
 # wavernn
 if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
    echo "in wavernn syn_e2e"
    FLAGS_allocator_strategy=naive_best_fit \
    FLAGS_fraction_of_gpu_memory_to_use=0.01 \
    python3 ${BIN_DIR}/../synthesize_e2e.py \
        --am=tacotron2_csmsc \
        --am_config=${config_path} \
        --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
        --am_stat=dump/train/speech_stats.npy \
        --voc=wavernn_csmsc \
        --voc_config=wavernn_csmsc_ckpt_0.2.0/default.yaml \
        --voc_ckpt=wavernn_csmsc_ckpt_0.2.0/snapshot_iter_400000.pdz \
        --voc_stat=wavernn_csmsc_ckpt_0.2.0/feats_stats.npy \
        --lang=zh \
        --text=${BIN_DIR}/../sentences.txt \
        --output_dir=${train_output_path}/test_e2e \
        --phones_dict=dump/phone_id_map.txt \
        --inference_dir=${train_output_path}/inference
 fi
--- a/examples/csmsc/tts0/path.sh
+++ b/examples/csmsc/tts0/path.sh
@ -9,5 +9,5 @@ export PYTHONDONTWRITEBYTECODE=1
 export PYTHONIOENCODING=UTF-8
 export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH}
-MODEL=new_tacotron2
+MODEL=tacotron2
 export BIN_DIR=${MAIN_ROOT}/paddlespeech/t2s/exps/${MODEL}
--- a/examples/csmsc/tts0/run.sh
+++ b/examples/csmsc/tts0/run.sh
@ -35,3 +35,8 @@ if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
    # synthesize_e2e, vocoder is pwgan
    CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
 fi
 if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
    # inference with static model
    CUDA_VISIBLE_DEVICES=${gpus} ./local/inference.sh ${train_output_path} || exit -1
 fi
--- a/examples/csmsc/tts2/local/synthesize_e2e.sh
+++ b/examples/csmsc/tts2/local/synthesize_e2e.sh
@ -92,3 +92,26 @@ if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
        --phones_dict=dump/phone_id_map.txt \
        --tones_dict=dump/tone_id_map.txt
 fi
 # wavernn
 if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
    echo "in wavernn syn_e2e"
    FLAGS_allocator_strategy=naive_best_fit \
    FLAGS_fraction_of_gpu_memory_to_use=0.01 \
    python3 ${BIN_DIR}/../synthesize_e2e.py \
        --am=speedyspeech_csmsc \
        --am_config=${config_path} \
        --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
        --am_stat=dump/train/feats_stats.npy \
        --voc=wavernn_csmsc \
        --voc_config=wavernn_csmsc_ckpt_0.2.0/default.yaml \
        --voc_ckpt=wavernn_csmsc_ckpt_0.2.0/snapshot_iter_400000.pdz \
        --voc_stat=wavernn_csmsc_ckpt_0.2.0/feats_stats.npy \
        --lang=zh \
        --text=${BIN_DIR}/../sentences.txt \
        --output_dir=${train_output_path}/test_e2e \
        --phones_dict=dump/phone_id_map.txt \
        --tones_dict=dump/tone_id_map.txt \
        --inference_dir=${train_output_path}/inference
 fi
--- a/examples/csmsc/tts3/README.md
+++ b/examples/csmsc/tts3/README.md
@ -243,6 +243,8 @@ fastspeech2_nosil_baker_ckpt_0.4
 └── speech_stats.npy        # statistics used to normalize spectrogram when training fastspeech2
 ```
 You can use the following scripts to synthesize for `${BIN_DIR}/../sentences.txt` using pretrained fastspeech2 and parallel wavegan models.
 If you want to use fastspeech2_conformer, you must delete this line `--inference_dir=exp/default/inference \` to skip the step of dygraph to static graph, cause we haven't tested dygraph to static graph for fastspeech2_conformer till now.
 ```bash
 source path.sh
--- a/examples/csmsc/tts3/local/synthesize_e2e.sh
+++ b/examples/csmsc/tts3/local/synthesize_e2e.sh
@ -102,9 +102,9 @@ if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
        --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
        --am_stat=dump/train/speech_stats.npy \
        --voc=wavernn_csmsc \
-        --voc_config=wavernn_test/default.yaml \
+        --voc_config=wavernn_csmsc_ckpt_0.2.0/default.yaml \
-        --voc_ckpt=wavernn_test/snapshot_iter_5000.pdz \
+        --voc_ckpt=wavernn_csmsc_ckpt_0.2.0/snapshot_iter_400000.pdz \
-        --voc_stat=wavernn_test/feats_stats.npy \
+        --voc_stat=wavernn_csmsc_ckpt_0.2.0/feats_stats.npy \
        --lang=zh \
        --text=${BIN_DIR}/../sentences.txt \
        --output_dir=${train_output_path}/test_e2e \
--- a/examples/csmsc/tts3/run.sh
+++ b/examples/csmsc/tts3/run.sh
@ -36,3 +36,8 @@ if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
    CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
 fi
 if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
    # inference with static model
    CUDA_VISIBLE_DEVICES=${gpus} ./local/inference.sh ${train_output_path} || exit -1
 fi
--- a/examples/csmsc/voc6/README.md
+++ b/examples/csmsc/voc6/README.md
@ -0,0 +1,127 @@
 # WaveRNN with CSMSC
 This example contains code used to train a [WaveRNN](https://arxiv.org/abs/1802.08435) model with [Chinese Standard Mandarin Speech Copus](https://www.data-baker.com/open_source.html).
 ## Dataset
 ### Download and Extract
 Download CSMSC from the [official website](https://www.data-baker.com/data/index/source) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/BZNSYP`.
 ### Get MFA Result and Extract
 We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence at the edge of audio.
 You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your MFA model reference to  [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
 ## Get Started
 Assume the path to the dataset is `~/datasets/BZNSYP`.
 Assume the path to the MFA result of CSMSC is `./baker_alignment_tone`.
 Run the command below to
 1. **source path**.
 2. preprocess the dataset.
 3. train the model.
 4. synthesize wavs.
    - synthesize waveform from `metadata.jsonl`.
 ```bash
 ./run.sh
 ```
 You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
 ```bash
 ./run.sh --stage 0 --stop-stage 0
 ```
 ### Data Preprocessing
 ```bash
 ./local/preprocess.sh ${conf_path}
 ```
 When it is done. A `dump` folder is created in the current directory. The structure of the dump folder is listed below.
 ```text
 dump
 ├── dev
 │   ├── norm
 │   └── raw
 ├── test
 │   ├── norm
 │   └── raw
 └── train
    ├── norm
    ├── raw
    └── feats_stats.npy
 ```
 The dataset is split into 3 parts, namely `train`, `dev`, and `test`, each of which contains a `norm` and `raw` subfolder. The `raw` folder contains the log magnitude of the mel spectrogram of each utterance, while the norm folder contains the normalized spectrogram. The statistics used to normalize the spectrogram are computed from the training set, which is located in `dump/train/feats_stats.npy`.
 Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains id and paths to the spectrogram of each utterance.
 ### Model Training
 ```bash
 CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path}
 ```
 `./local/train.sh` calls `${BIN_DIR}/train.py`.
 Here's the complete help message.
 ```text
 usage: train.py [-h] [--config CONFIG] [--train-metadata TRAIN_METADATA]
                [--dev-metadata DEV_METADATA] [--output-dir OUTPUT_DIR]
                [--ngpu NGPU]
 Train a WaveRNN model.
 optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       config file to overwrite default config.
  --train-metadata TRAIN_METADATA
                        training data.
  --dev-metadata DEV_METADATA
                        dev data.
  --output-dir OUTPUT_DIR
                        output dir.
  --ngpu NGPU           if ngpu == 0, use cpu.
 ```
 1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
 2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
 3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory.
 4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
 ### Synthesizing
 `./local/synthesize.sh` calls `${BIN_DIR}/synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
 ```bash
 CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
 ```
 ```text
 usage: synthesize.py [-h] [--config CONFIG] [--checkpoint CHECKPOINT]
                     [--test-metadata TEST_METADATA] [--output-dir OUTPUT_DIR]
                     [--ngpu NGPU]
 Synthesize with WaveRNN.
 optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       Vocoder config file.
  --checkpoint CHECKPOINT
                        snapshot to load.
  --test-metadata TEST_METADATA
                        dev data.
  --output-dir OUTPUT_DIR
                        output dir.
  --ngpu NGPU           if ngpu == 0, use cpu.
 ```
 1. `--config` wavernn config file. You should use the same config with which the model is trained.
 2. `--checkpoint` is the checkpoint to load. Pick one of the checkpoints from `checkpoints` inside the training output directory.
 3. `--test-metadata` is the metadata of the test dataset. Use the `metadata.jsonl` in the `dev/norm` subfolder from the processed directory.
 4. `--output-dir` is the directory to save the synthesized audio files.
 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
 ## Pretrained Models
 The pretrained model can be downloaded here [wavernn_csmsc_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/wavernn/wavernn_csmsc_ckpt_0.2.0.zip).
 The static model can be downloaded here [wavernn_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/wavernn/wavernn_csmsc_static_0.2.0.zip).
 Model | Step | eval/loss
 :-------------:|:------------:| :------------:
 default| 1(gpu) x 400000|2.602768
 WaveRNN checkpoint contains files listed below.
 ```text
 wavernn_csmsc_ckpt_0.2.0
 ├── default.yaml                   # default config used to train wavernn
 ├── feats_stats.npy                # statistics used to normalize spectrogram when training wavernn
 └── snapshot_iter_400000.pdz       # parameters of wavernn
 ```
--- a/examples/ljspeech/tts0/README.md
+++ b/examples/ljspeech/tts0/README.md
@ -0,0 +1,247 @@
 # Tacotron2 with LJSpeech-1.1
 This example contains code used to train a [Tacotron2](https://arxiv.org/abs/1712.05884) model with [LJSpeech-1.1](https://keithito.com/LJ-Speech-Dataset/)
 ## Dataset
 ### Download and Extract
 Download LJSpeech-1.1 from the [official website](https://keithito.com/LJ-Speech-Dataset/).
 ### Get MFA Result and Extract
 We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get phonemes for Tacotron2, the durations of MFA are not needed here.
 You can download from here [ljspeech_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/LJSpeech-1.1/ljspeech_alignment.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
 ## Get Started
 Assume the path to the dataset is `~/datasets/LJSpeech-1.1`.
 Assume the path to the MFA result of LJSpeech-1.1 is `./ljspeech_alignment`.
 Run the command below to
 1. **source path**.
 2. preprocess the dataset.
 3. train the model.
 4. synthesize wavs.
    - synthesize waveform from `metadata.jsonl`.
    - synthesize waveform from a text file.
 ```bash
 ./run.sh
 ```
 You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
 ```bash
 ./run.sh --stage 0 --stop-stage 0
 ```
 ### Data Preprocessing
 ```bash
 ./local/preprocess.sh ${conf_path}
 ```
 When it is done. A `dump` folder is created in the current directory. The structure of the dump folder is listed below.
 ```text
 dump
 ├── dev
 │   ├── norm
 │   └── raw
 ├── phone_id_map.txt
 ├── speaker_id_map.txt
 ├── test
 │   ├── norm
 │   └── raw
 └── train
    ├── norm
    ├── raw
    └── speech_stats.npy
 ```
 The dataset is split into 3 parts, namely `train`, `dev`, and` test`, each of which contains a `norm` and `raw` subfolder. The raw folder contains speech features of each utterance, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/*_stats.npy`.
 Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains phones, text_lengths, speech_lengths, durations, the path of speech features, speaker, and the id of each utterance.
 ### Model Training
 ```bash
 CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path}
 ```
 `./local/train.sh` calls `${BIN_DIR}/train.py`.
 Here's the complete help message.
 ```text
 usage: train.py [-h] [--config CONFIG] [--train-metadata TRAIN_METADATA]
                [--dev-metadata DEV_METADATA] [--output-dir OUTPUT_DIR]
                [--ngpu NGPU] [--phones-dict PHONES_DICT]
 Train a Tacotron2 model.
 optional arguments:
  -h, --help            show this help message and exit
  --config CONFIG       tacotron2 config file.
  --train-metadata TRAIN_METADATA
                        training data.
  --dev-metadata DEV_METADATA
                        dev data.
  --output-dir OUTPUT_DIR
                        output dir.
  --ngpu NGPU           if ngpu == 0, use cpu.
  --phones-dict PHONES_DICT
                        phone vocabulary file.
 ```
 1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
 2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
 3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory.
 4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
 5. `--phones-dict` is the path of the phone vocabulary file.
 ### Synthesizing
 We use [parallel wavegan](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/voc1) as the neural vocoder.
 Download pretrained parallel wavegan model from [pwg_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_ljspeech_ckpt_0.5.zip) and unzip it.
 ```bash
 unzip pwg_ljspeech_ckpt_0.5.zip
 ```
 Parallel WaveGAN checkpoint contains files listed below.
 ```text
 pwg_ljspeech_ckpt_0.5
 ├── pwg_default.yaml              # default config used to train parallel wavegan
 ├── pwg_snapshot_iter_400000.pdz  # generator parameters of parallel wavegan
 └── pwg_stats.npy                 # statistics used to normalize spectrogram when training parallel wavegan
 ```
 `./local/synthesize.sh` calls `${BIN_DIR}/../synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
 ```bash
 CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
 ```
 ```text
 usage: synthesize.py [-h]
                     [--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc}]
                     [--am_config AM_CONFIG] [--am_ckpt AM_CKPT]
                     [--am_stat AM_STAT] [--phones_dict PHONES_DICT]
                     [--tones_dict TONES_DICT] [--speaker_dict SPEAKER_DICT]
                     [--voice-cloning VOICE_CLONING]
                     [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}]
                     [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT]
                     [--voc_stat VOC_STAT] [--ngpu NGPU]
                     [--test_metadata TEST_METADATA] [--output_dir OUTPUT_DIR]
 Synthesize with acoustic model & vocoder
 optional arguments:
  -h, --help            show this help message and exit
  --am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc}
                        Choose acoustic model type of tts task.
  --am_config AM_CONFIG
                        Config of acoustic model. Use deault config when it is
                        None.
  --am_ckpt AM_CKPT     Checkpoint file of acoustic model.
  --am_stat AM_STAT     mean and standard deviation used to normalize
                        spectrogram when training acoustic model.
  --phones_dict PHONES_DICT
                        phone vocabulary file.
  --tones_dict TONES_DICT
                        tone vocabulary file.
  --speaker_dict SPEAKER_DICT
                        speaker id map file.
  --voice-cloning VOICE_CLONING
                        whether training voice cloning model.
  --voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc}
                        Choose vocoder type of tts task.
  --voc_config VOC_CONFIG
                        Config of voc. Use deault config when it is None.
  --voc_ckpt VOC_CKPT   Checkpoint file of voc.
  --voc_stat VOC_STAT   mean and standard deviation used to normalize
                        spectrogram when training voc.
  --ngpu NGPU           if ngpu == 0, use cpu.
  --test_metadata TEST_METADATA
                        test metadata.
  --output_dir OUTPUT_DIR
                        output dir.
 ```
 `./local/synthesize_e2e.sh` calls `${BIN_DIR}/../synthesize_e2e.py`, which can synthesize waveform from text file.
 ```bash
 CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name}
 ```
 ```text
 usage: synthesize_e2e.py [-h]
                         [--am {speedyspeech_csmsc,speedyspeech_aishell3,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc}]
                         [--am_config AM_CONFIG] [--am_ckpt AM_CKPT]
                         [--am_stat AM_STAT] [--phones_dict PHONES_DICT]
                         [--tones_dict TONES_DICT]
                         [--speaker_dict SPEAKER_DICT] [--spk_id SPK_ID]
                         [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc,style_melgan_csmsc,hifigan_csmsc}]
                         [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT]
                         [--voc_stat VOC_STAT] [--lang LANG]
                         [--inference_dir INFERENCE_DIR] [--ngpu NGPU]
                         [--text TEXT] [--output_dir OUTPUT_DIR]
 Synthesize with acoustic model & vocoder
 optional arguments:
  -h, --help            show this help message and exit
  --am {speedyspeech_csmsc,speedyspeech_aishell3,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc}
                        Choose acoustic model type of tts task.
  --am_config AM_CONFIG
                        Config of acoustic model. Use deault config when it is
                        None.
  --am_ckpt AM_CKPT     Checkpoint file of acoustic model.
  --am_stat AM_STAT     mean and standard deviation used to normalize
                        spectrogram when training acoustic model.
  --phones_dict PHONES_DICT
                        phone vocabulary file.
  --tones_dict TONES_DICT
                        tone vocabulary file.
  --speaker_dict SPEAKER_DICT
                        speaker id map file.
  --spk_id SPK_ID       spk id for multi speaker acoustic model
  --voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc,style_melgan_csmsc,hifigan_csmsc}
                        Choose vocoder type of tts task.
  --voc_config VOC_CONFIG
                        Config of voc. Use deault config when it is None.
  --voc_ckpt VOC_CKPT   Checkpoint file of voc.
  --voc_stat VOC_STAT   mean and standard deviation used to normalize
                        spectrogram when training voc.
  --lang LANG           Choose model language. zh or en
  --inference_dir INFERENCE_DIR
                        dir to save inference models
  --ngpu NGPU           if ngpu == 0, use cpu.
  --text TEXT           text to synthesize, a 'utt_id sentence' pair per line.
  --output_dir OUTPUT_DIR
                        output dir.
 ```
 1. `--am` is acoustic model type with the format {model_name}_{dataset}
 2. `--am_config`, `--am_checkpoint`, `--am_stat` and `--phones_dict` are arguments for acoustic model, which correspond to the 4 files in the Tacotron2 pretrained model.
 3. `--voc` is vocoder type with the format {model_name}_{dataset}
 4. `--voc_config`, `--voc_checkpoint`, `--voc_stat` are arguments for vocoder, which correspond to the 3 files in the parallel wavegan pretrained model.
 5. `--lang` is the model language, which can be `zh` or `en`.
 6. `--test_metadata` should be the metadata file in the normalized subfolder of `test`  in the `dump` folder.
 7. `--text` is the text file, which contains sentences to synthesize.
 8. `--output_dir` is the directory to save synthesized audio files.
 9. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
 ## Pretrained Model
 Pretrained Tacotron2 model with no silence in the edge of audios:
 - [tacotron2_ljspeech_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_ljspeech_ckpt_0.2.0.zip)
 Model | Step | eval/loss | eval/l1_loss | eval/mse_loss | eval/bce_loss| eval/attn_loss 
 :-------------:| :------------:| :-----: | :-----: | :--------: |:--------:|:---------:
 default| 1(gpu) x 60300|0.554092|0.394260|0.141046|0.018747|3.8e-05|
 Tacotron2 checkpoint contains files listed below.
 ```text
 tacotron2_ljspeech_ckpt_0.2.0
 ├── default.yaml            # default config used to train Tacotron2
 ├── phone_id_map.txt        # phone vocabulary file when training Tacotron2
 ├── snapshot_iter_60300.pdz # model parameters and optimizer states
 └── speech_stats.npy        # statistics used to normalize spectrogram when training Tacotron2
 ```
 You can use the following scripts to synthesize for `${BIN_DIR}/../sentences_en.txt` using pretrained Tacotron2 and parallel wavegan models.
 ```bash
 source path.sh
 FLAGS_allocator_strategy=naive_best_fit \
 FLAGS_fraction_of_gpu_memory_to_use=0.01 \
 python3 ${BIN_DIR}/../synthesize_e2e.py \
  --am=tacotron2_ljspeech \
  --am_config=tacotron2_ljspeech_ckpt_0.2.0/default.yaml \
  --am_ckpt=tacotron2_ljspeech_ckpt_0.2.0/snapshot_iter_60300.pdz \
  --am_stat=tacotron2_ljspeech_ckpt_0.2.0/speech_stats.npy  \
  --voc=pwgan_ljspeech\
  --voc_config=pwg_ljspeech_ckpt_0.5/pwg_default.yaml \
  --voc_ckpt=pwg_ljspeech_ckpt_0.5/pwg_snapshot_iter_400000.pdz  \
  --voc_stat=pwg_ljspeech_ckpt_0.5/pwg_stats.npy \
  --lang=en \
  --text=${BIN_DIR}/../sentences_en.txt \
  --output_dir=exp/default/test_e2e \
  --phones_dict=tacotron2_ljspeech_ckpt_0.2.0/phone_id_map.txt
 ```
--- a/examples/ljspeech/tts0/local/synthesize_e2e.sh
+++ b/examples/ljspeech/tts0/local/synthesize_e2e.sh
@ -0,0 +1,22 @@
 #!/bin/bash
 config_path=$1
 train_output_path=$2
 ckpt_name=$3
 # TODO: dygraph to static graph is not good for tacotron2_ljspeech now
 FLAGS_allocator_strategy=naive_best_fit \
 FLAGS_fraction_of_gpu_memory_to_use=0.01 \
 python3 ${BIN_DIR}/../synthesize_e2e.py \
    --am=tacotron2_ljspeech \
    --am_config=${config_path} \
    --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
    --am_stat=dump/train/speech_stats.npy \
    --voc=pwgan_ljspeech \
    --voc_config=pwg_ljspeech_ckpt_0.5/pwg_default.yaml \
    --voc_ckpt=pwg_ljspeech_ckpt_0.5/pwg_snapshot_iter_400000.pdz  \
    --voc_stat=pwg_ljspeech_ckpt_0.5/pwg_stats.npy \
    --lang=en \
    --text=${BIN_DIR}/../sentences_en.txt \
    --output_dir=${train_output_path}/test_e2e \
    --phones_dict=dump/phone_id_map.txt \
    # --inference_dir=${train_output_path}/inference
--- a/examples/ljspeech/tts0/path.sh
+++ b/examples/ljspeech/tts0/path.sh
@ -9,5 +9,5 @@ export PYTHONDONTWRITEBYTECODE=1
 export PYTHONIOENCODING=UTF-8
 export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH}
-MODEL=new_tacotron2
+MODEL=tacotron2
 export BIN_DIR=${MAIN_ROOT}/paddlespeech/t2s/exps/${MODEL}
--- a/examples/ljspeech/tts3/README.md
+++ b/examples/ljspeech/tts3/README.md
@ -1,4 +1,4 @@
-# FastSpeech2 with the LJSpeech-1.1
+# FastSpeech2 with LJSpeech-1.1
 This example contains code used to train a [Fastspeech2](https://arxiv.org/abs/2006.04558) model with [LJSpeech-1.1](https://keithito.com/LJ-Speech-Dataset/).
 ## Dataset
--- a/examples/voxceleb/README.md
+++ b/examples/voxceleb/README.md
@ -0,0 +1,8 @@
 dataset info refer to [VoxCeleb](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/index.html#about)
 sv0 - speaker verfication with softmax backend etc, all python code
      more info refer to the sv0/readme.txt
 sv1 - dependence on kaldi, speaker verfication with plda/sc backend, 
      more info refer to the sv1/readme.txt
--- a/examples/voxceleb/sv0/local/make_voxceleb_kaldi_trial.py
+++ b/examples/voxceleb/sv0/local/make_voxceleb_kaldi_trial.py
@ -0,0 +1,81 @@
 #!/usr/bin/python3
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """
 Make VoxCeleb1 trial of kaldi format
 this script creat the test trial from kaldi trial voxceleb1_test_v2.txt or official trial veri_test2.txt 
 to kaldi trial format
 """
 import argparse
 import codecs
 import os
 parser = argparse.ArgumentParser(description=__doc__)
 parser.add_argument("--voxceleb_trial",
                    default="voxceleb1_test_v2",
                    type=str,
                    help="VoxCeleb trial file. Default we use the kaldi trial voxceleb1_test_v2.txt")
 parser.add_argument("--trial",
                    default="data/test/trial",
                    type=str,
                    help="Kaldi format trial file")
 args = parser.parse_args()
 def main(voxceleb_trial, trial):
    """
        VoxCeleb provide several trial file, which format is different with kaldi format.
        VoxCeleb format's meaning is as following:
        --------------------------------
        target_or_nontarget path1 path2
        --------------------------------
        target_or_nontarget is an integer: 1 target                 path1 is equal to path2
                                           0 nontarget              path1 is unequal to path2    
        path1: spkr_id/rec_id/name
        path2: spkr_id/rec_id/name
        Kaldi format's meaning is as following:
        ---------------------------------------
        utt_id1 utt_id2 target_or_nontarget
        ---------------------------------------
        utt_id1: utterance identification or speaker identification
        utt_id2: utterance identification or speaker identification
        target_or_nontarget is an string: 'target' utt_id1 is equal to  utt_id2
                                        'nontarget' utt_id2 is unequal to utt_id2
    """
    print("Start convert the voxceleb trial to kaldi format")
    if not os.path.exists(voxceleb_trial):
        raise RuntimeError("{} does not exist. Pleas input the correct file path".format(voxceleb_trial))
    trial_dirname = os.path.dirname(trial)
    if not os.path.exists(trial_dirname):
        os.mkdir(trial_dirname)
    with codecs.open(voxceleb_trial, 'r', encoding='utf-8') as f, \
         codecs.open(trial, 'w', encoding='utf-8') as w:
         for line in f:
            target_or_nontarget, path1, path2 = line.strip().split()
            utt_id1 = "-".join(path1.split("/"))
            utt_id2 = "-".join(path2.split("/"))
            target = "nontarget"
            if int(target_or_nontarget):
                target = "target"
            w.write("{} {} {}\n".format(utt_id1, utt_id2, target))
    print("Convert the voxceleb trial to kaldi format successfully")
 if __name__ == "__main__":
    main(args.voxceleb_trial, args.trial)
--- a/paddleaudio/features/core.py
+++ b/paddleaudio/features/core.py
@ -415,11 +415,11 @@ def mfcc(x,
        **kwargs)
    # librosa mfcc:
-    spect = librosa.feature.melspectrogram(x,sr=16000,n_fft=512,
+    spect = librosa.feature.melspectrogram(y=x,sr=16000,n_fft=512,
                                              win_length=512,
                                              hop_length=320,
                                              n_mels=64, fmin=50)
-    b = librosa.feature.mfcc(x,
+    b = librosa.feature.mfcc(y=x,
        sr=16000,
        S=spect,
        n_mfcc=20,
--- a/paddlespeech/cli/asr/infer.py
+++ b/paddlespeech/cli/asr/infer.py
@ -311,8 +311,10 @@ class ASRExecutor(BaseExecutor):
                    audio = audio[:, 0]
                # pcm16 -> pcm 32
                audio = self._pcm16to32(audio)
-                audio = librosa.resample(audio, audio_sample_rate,
+                audio = librosa.resample(
-                                         self.sample_rate)
+                    audio,
                    orig_sr=audio_sample_rate,
                    target_sr=self.sample_rate)
                audio_sample_rate = self.sample_rate
                # pcm32 -> pcm 16
                audio = self._pcm32to16(audio)
--- a/paddlespeech/cli/cls/infer.py
+++ b/paddlespeech/cli/cls/infer.py
@ -114,8 +114,9 @@ class CLSExecutor(BaseExecutor):
        """
            Download and returns pretrained resources path of current task.
        """
-        assert tag in pretrained_models, 'Can not find pretrained resources of {}.'.format(
+        support_models = list(pretrained_models.keys())
-            tag)
+        assert tag in pretrained_models, 'The model "{}" you want to use has not been supported, please choose other models.\nThe support models includes:\n\t\t{}\n'.format(
            tag, '\n\t\t'.join(support_models))
        res_path = os.path.join(MODEL_HOME, tag)
        decompressed_path = download_and_decompress(pretrained_models[tag],
--- a/paddlespeech/cli/st/infer.py
+++ b/paddlespeech/cli/st/infer.py
@ -112,8 +112,9 @@ class STExecutor(BaseExecutor):
        """
            Download and returns pretrained resources path of current task.
        """
-        assert tag in pretrained_models, "Can not find pretrained resources of {}.".format(
+        support_models = list(pretrained_models.keys())
-            tag)
+        assert tag in pretrained_models, 'The model "{}" you want to use has not been supported, please choose other models.\nThe support models includes:\n\t\t{}\n'.format(
            tag, '\n\t\t'.join(support_models))
        res_path = os.path.join(MODEL_HOME, tag)
        decompressed_path = download_and_decompress(pretrained_models[tag],
--- a/paddlespeech/cli/text/infer.py
+++ b/paddlespeech/cli/text/infer.py
@ -124,8 +124,9 @@ class TextExecutor(BaseExecutor):
        """
            Download and returns pretrained resources path of current task.
        """
-        assert tag in pretrained_models, 'Can not find pretrained resources of {}.'.format(
+        support_models = list(pretrained_models.keys())
-            tag)
+        assert tag in pretrained_models, 'The model "{}" you want to use has not been supported, please choose other models.\nThe support models includes:\n\t\t{}\n'.format(
            tag, '\n\t\t'.join(support_models))
        res_path = os.path.join(MODEL_HOME, tag)
        decompressed_path = download_and_decompress(pretrained_models[tag],
--- a/paddlespeech/cli/tts/infer.py
+++ b/paddlespeech/cli/tts/infer.py
@ -117,6 +117,36 @@ pretrained_models = {
        'speaker_dict':
        'speaker_id_map.txt',
    },
    # tacotron2
    "tacotron2_csmsc-zh": {
        'url':
        'https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_csmsc_ckpt_0.2.0.zip',
        'md5':
        '0df4b6f0bcbe0d73c5ed6df8867ab91a',
        'config':
        'default.yaml',
        'ckpt':
        'snapshot_iter_30600.pdz',
        'speech_stats':
        'speech_stats.npy',
        'phones_dict':
        'phone_id_map.txt',
    },
    "tacotron2_ljspeech-en": {
        'url':
        'https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_ljspeech_ckpt_0.2.0.zip',
        'md5':
        '6a5eddd81ae0e81d16959b97481135f3',
        'config':
        'default.yaml',
        'ckpt':
        'snapshot_iter_60300.pdz',
        'speech_stats':
        'speech_stats.npy',
        'phones_dict':
        'phone_id_map.txt',
    },
    # pwgan
    "pwgan_csmsc-zh": {
        'url':
@ -205,6 +235,20 @@ pretrained_models = {
        'speech_stats':
        'feats_stats.npy',
    },
    # wavernn
    "wavernn_csmsc-zh": {
        'url':
        'https://paddlespeech.bj.bcebos.com/Parakeet/released_models/wavernn/wavernn_csmsc_ckpt_0.2.0.zip',
        'md5':
        'ee37b752f09bcba8f2af3b777ca38e13',
        'config':
        'default.yaml',
        'ckpt':
        'snapshot_iter_400000.pdz',
        'speech_stats':
        'feats_stats.npy',
    }
 }
 model_alias = {
@ -217,6 +261,10 @@ model_alias = {
    "paddlespeech.t2s.models.fastspeech2:FastSpeech2",
    "fastspeech2_inference":
    "paddlespeech.t2s.models.fastspeech2:FastSpeech2Inference",
    "tacotron2":
    "paddlespeech.t2s.models.tacotron2:Tacotron2",
    "tacotron2_inference":
    "paddlespeech.t2s.models.tacotron2:Tacotron2Inference",
    # voc
    "pwgan":
    "paddlespeech.t2s.models.parallel_wavegan:PWGGenerator",
@ -234,6 +282,10 @@ model_alias = {
    "paddlespeech.t2s.models.hifigan:HiFiGANGenerator",
    "hifigan_inference":
    "paddlespeech.t2s.models.hifigan:HiFiGANInference",
    "wavernn":
    "paddlespeech.t2s.models.wavernn:WaveRNN",
    "wavernn_inference":
    "paddlespeech.t2s.models.wavernn:WaveRNNInference",
 }
@ -253,9 +305,13 @@ class TTSExecutor(BaseExecutor):
            type=str,
            default='fastspeech2_csmsc',
            choices=[
-                'speedyspeech_csmsc', 'fastspeech2_csmsc',
+                'speedyspeech_csmsc',
-                'fastspeech2_ljspeech', 'fastspeech2_aishell3',
+                'fastspeech2_csmsc',
-                'fastspeech2_vctk'
+                'fastspeech2_ljspeech',
                'fastspeech2_aishell3',
                'fastspeech2_vctk',
                'tacotron2_csmsc',
                'tacotron2_ljspeech',
            ],
            help='Choose acoustic model type of tts task.')
        self.parser.add_argument(
@ -300,8 +356,14 @@ class TTSExecutor(BaseExecutor):
            type=str,
            default='pwgan_csmsc',
            choices=[
-                'pwgan_csmsc', 'pwgan_ljspeech', 'pwgan_aishell3', 'pwgan_vctk',
+                'pwgan_csmsc',
-                'mb_melgan_csmsc', 'style_melgan_csmsc', 'hifigan_csmsc'
+                'pwgan_ljspeech',
                'pwgan_aishell3',
                'pwgan_vctk',
                'mb_melgan_csmsc',
                'style_melgan_csmsc',
                'hifigan_csmsc',
                'wavernn_csmsc',
            ],
            help='Choose vocoder type of tts task.')
@ -340,8 +402,9 @@ class TTSExecutor(BaseExecutor):
        """
        Download and returns pretrained resources path of current task.
        """
-        assert tag in pretrained_models, 'Can not find pretrained resources of {}.'.format(
+        support_models = list(pretrained_models.keys())
-            tag)
+        assert tag in pretrained_models, 'The model "{}" you want to use has not been supported, please choose other models.\nThe support models includes:\n\t\t{}\n'.format(
            tag, '\n\t\t'.join(support_models))
        res_path = os.path.join(MODEL_HOME, tag)
        decompressed_path = download_and_decompress(pretrained_models[tag],
@ -368,7 +431,7 @@ class TTSExecutor(BaseExecutor):
        """
        Init model and other resources from a specific path.
        """
-        if hasattr(self, 'am') and hasattr(self, 'voc'):
+        if hasattr(self, 'am_inference') and hasattr(self, 'voc_inference'):
            logger.info('Models had been initialized.')
            return
        # am
@ -488,6 +551,8 @@ class TTSExecutor(BaseExecutor):
                vocab_size=vocab_size,
                tone_size=tone_size,
                **self.am_config["model"])
        elif am_name == 'tacotron2':
            am = am_class(idim=vocab_size, odim=odim, **self.am_config["model"])
        am.set_state_dict(paddle.load(self.am_ckpt)["main_params"])
        am.eval()
@ -505,10 +570,15 @@ class TTSExecutor(BaseExecutor):
        voc_class = dynamic_import(voc_name, model_alias)
        voc_inference_class = dynamic_import(voc_name + '_inference',
                                             model_alias)
-        voc = voc_class(**self.voc_config["generator_params"])
+        if voc_name != 'wavernn':
-        voc.set_state_dict(paddle.load(self.voc_ckpt)["generator_params"])
+            voc = voc_class(**self.voc_config["generator_params"])
-        voc.remove_weight_norm()
+            voc.set_state_dict(paddle.load(self.voc_ckpt)["generator_params"])
-        voc.eval()
+            voc.remove_weight_norm()
            voc.eval()
        else:
            voc = voc_class(**self.voc_config["model"])
            voc.set_state_dict(paddle.load(self.voc_ckpt)["main_params"])
            voc.eval()
        voc_mu, voc_std = np.load(self.voc_stat)
        voc_mu = paddle.to_tensor(voc_mu)
        voc_std = paddle.to_tensor(voc_std)
--- a/paddlespeech/s2t/exps/u2/model.py
+++ b/paddlespeech/s2t/exps/u2/model.py
@ -175,7 +175,7 @@ class U2Trainer(Trainer):
                        observation['batch_cost'] = observation[
                            'reader_cost'] + observation['step_cost']
                        observation['samples'] = observation['batch_size']
-                        observation['ips,sent./sec'] = observation[
+                        observation['ips,samples/s'] = observation[
                            'batch_size'] / observation['batch_cost']
                        for k, v in observation.items():
                            msg += f" {k.split(',')[0]}: "
--- a/paddlespeech/s2t/io/batchfy.py
+++ b/paddlespeech/s2t/io/batchfy.py
@ -419,7 +419,7 @@ def make_batchset(
        # sort it by input lengths (long to short)
        sorted_data = sorted(
            d.items(),
-            key=lambda data: int(data[1][batch_sort_key][batch_sort_axis]["shape"][0]),
+            key=lambda data: float(data[1][batch_sort_key][batch_sort_axis]["shape"][0]),
            reverse=not shortest_first, )
        logger.info("# utts: " + str(len(sorted_data)))
--- a/paddlespeech/s2t/io/dataloader.py
+++ b/paddlespeech/s2t/io/dataloader.py
@ -61,7 +61,7 @@ class BatchDataLoader():
    def __init__(self,
                 json_file: str,
                 train_mode: bool,
-                 sortagrad: bool=False,
+                 sortagrad: int=0,
                 batch_size: int=0,
                 maxlen_in: float=float('inf'),
                 maxlen_out: float=float('inf'),
--- a/paddlespeech/s2t/training/trainer.py
+++ b/paddlespeech/s2t/training/trainer.py
@ -252,8 +252,7 @@ class Trainer():
        if self.args.benchmark_max_step and self.iteration > self.args.benchmark_max_step:
            logger.info(
                f"Reach benchmark-max-step: {self.args.benchmark_max_step}")
-            sys.exit(
+            sys.exit(0)
                f"Reach benchmark-max-step: {self.args.benchmark_max_step}")
    def do_train(self):
        """The training process control by epoch."""
@ -282,7 +281,7 @@ class Trainer():
                        observation['batch_cost'] = observation[
                            'reader_cost'] + observation['step_cost']
                        observation['samples'] = observation['batch_size']
-                        observation['ips[sent./sec]'] = observation[
+                        observation['ips samples/s'] = observation[
                            'batch_size'] / observation['batch_cost']
                        for k, v in observation.items():
                            msg += f" {k}: "
--- a/paddlespeech/s2t/transform/perturb.py
+++ b/paddlespeech/s2t/transform/perturb.py
@ -90,7 +90,8 @@ class SpeedPerturbation():
        # Note1: resample requires the sampling-rate of input and output,
        #        but actually only the ratio is used.
-        y = librosa.resample(x, ratio, 1, res_type=self.res_type)
+        y = librosa.resample(
            x, orig_sr=ratio, target_sr=1, res_type=self.res_type)
        if self.keep_length:
            diff = abs(len(x) - len(y))
--- a/paddlespeech/s2t/transform/spectrogram.py
+++ b/paddlespeech/s2t/transform/spectrogram.py
@ -38,7 +38,7 @@ def stft(x,
    x = np.stack(
        [
            librosa.stft(
-                x[:, ch],
+                y=x[:, ch],
                n_fft=n_fft,
                hop_length=n_shift,
                win_length=win_length,
@ -67,7 +67,7 @@ def istft(x, n_shift, win_length=None, window="hann", center=True):
    x = np.stack(
        [
            librosa.istft(
-                x[:, ch].T,  # [Time, Freq] -> [Freq, Time]
+                stft_matrix=x[:, ch].T,  # [Time, Freq] -> [Freq, Time]
                hop_length=n_shift,
                win_length=win_length,
                window=window,
@ -95,7 +95,8 @@ def stft2logmelspectrogram(x_stft,
    # spc: (Time, Channel, Freq) or (Time, Freq)
    spc = np.abs(x_stft)
    # mel_basis: (Mel_freq, Freq)
-    mel_basis = librosa.filters.mel(fs, n_fft, n_mels, fmin, fmax)
+    mel_basis = librosa.filters.mel(
        sr=fs, n_fft=n_fft, n_mels=n_mels, fmin=fmin, fmax=fmax)
    # lmspc: (Time, Channel, Mel_freq) or (Time, Mel_freq)
    lmspc = np.log10(np.maximum(eps, np.dot(spc, mel_basis.T)))
--- a/paddlespeech/t2s/init.py
+++ b/paddlespeech/t2s/init.py
@ -13,7 +13,6 @@
 # limitations under the License.
 import logging
 from . import data
 from . import datasets
 from . import exps
 from . import frontend
--- a/paddlespeech/t2s/audio/audio.py
+++ b/paddlespeech/t2s/audio/audio.py
@ -53,8 +53,8 @@ class AudioProcessor(object):
    def _create_mel_filter(self):
        mel_filter = librosa.filters.mel(
-            self.sample_rate,
+            sr=self.sample_rate,
-            self.n_fft,
+            n_fft=self.n_fft,
            n_mels=self.n_mels,
            fmin=self.fmin,
            fmax=self.fmax)
--- a/paddlespeech/t2s/data/init.py
+++ b/paddlespeech/t2s/data/init.py
@ -1,17 +0,0 @@
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """t2s's infrastructure for data processing.
 """
 from .batch import *
 from .dataset import *
--- a/paddlespeech/t2s/datasets/init.py
+++ b/paddlespeech/t2s/datasets/init.py
@ -11,5 +11,4 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 from .common import *
 from .ljspeech import *
--- a/paddlespeech/t2s/datasets/am_batch_fn.py
+++ b/paddlespeech/t2s/datasets/am_batch_fn.py
@ -14,7 +14,7 @@
 import numpy as np
 import paddle
-from paddlespeech.t2s.data.batch import batch_sequences
+from paddlespeech.t2s.datasets.batch import batch_sequences
 def tacotron2_single_spk_batch_fn(examples):
--- a/paddlespeech/t2s/datasets/batch.py
+++ b/paddlespeech/t2s/datasets/batch.py
--- a/paddlespeech/t2s/datasets/common.py
+++ b/paddlespeech/t2s/datasets/common.py
@ -1,92 +0,0 @@
 # Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 from pathlib import Path
 from typing import List
 import librosa
 import numpy as np
 from paddle.io import Dataset
 __all__ = ["AudioSegmentDataset", "AudioDataset", "AudioFolderDataset"]
 class AudioSegmentDataset(Dataset):
    """A simple dataset adaptor for audio files to train vocoders.
    Read -> trim silence -> normalize -> extract a segment
    """
    def __init__(self,
                 file_paths: List[Path],
                 sample_rate: int,
                 length: int,
                 top_db: float):
        self.file_paths = file_paths
        self.sr = sample_rate
        self.top_db = top_db
        self.length = length  # samples in the clip
    def __getitem__(self, i):
        fpath = self.file_paths[i]
        y, sr = librosa.load(fpath, self.sr)
        y, _ = librosa.effects.trim(y, top_db=self.top_db)
        y = librosa.util.normalize(y)
        y = y.astype(np.float32)
        # pad or trim
        if y.size <= self.length:
            y = np.pad(y, [0, self.length - len(y)], mode='constant')
        else:
            start = np.random.randint(0, 1 + len(y) - self.length)
            y = y[start:start + self.length]
        return y
    def __len__(self):
        return len(self.file_paths)
 class AudioDataset(Dataset):
    """A simple dataset adaptor for the audio files.
    Read -> trim silence -> normalize
    """
    def __init__(self,
                 file_paths: List[Path],
                 sample_rate: int,
                 top_db: float=60):
        self.file_paths = file_paths
        self.sr = sample_rate
        self.top_db = top_db
    def __getitem__(self, i):
        fpath = self.file_paths[i]
        y, sr = librosa.load(fpath, self.sr)
        y, _ = librosa.effects.trim(y, top_db=self.top_db)
        y = librosa.util.normalize(y)
        y = y.astype(np.float32)
        return y
    def __len__(self):
        return len(self.file_paths)
 class AudioFolderDataset(AudioDataset):
    def __init__(
            self,
            root,
            sample_rate,
            top_db=60,
            extension=".wav", ):
        root = Path(root).expanduser()
        file_paths = sorted(list(root.rglob("*{}".format(extension))))
        super().__init__(file_paths, sample_rate, top_db)
--- a/paddlespeech/t2s/datasets/data_table.py
+++ b/paddlespeech/t2s/datasets/data_table.py
@ -22,26 +22,17 @@ from paddle.io import Dataset
 class DataTable(Dataset):
    """Dataset to load and convert data for general purpose.
-
+    Args:
-    Parameters
+        data (List[Dict[str, Any]]): Metadata, a list of meta datum, each of which is composed of  several fields
-    ----------
+        fields (List[str], optional): Fields to use, if not specified, all the fields in the data are used, by default None
-    data : List[Dict[str, Any]]
+        converters (Dict[str, Callable], optional): Converters used to process each field, by default None
-        Metadata, a list of meta datum, each of which is composed of 
+        use_cache (bool, optional): Whether to use cache, by default False
-        several fields
+
-    fields : List[str], optional
+    Raises:
-        Fields to use, if not specified, all the fields in the data are 
+        ValueError:
-        used, by default None
+            If there is some field that does not exist in data. 
-    converters : Dict[str, Callable], optional
+        ValueError:
-        Converters used to process each field, by default None
+            If there is some field in converters that does not exist in fields.
    use_cache : bool, optional
        Whether to use cache, by default False
    Raises
    ------
    ValueError
        If there is some field that does not exist in data. 
    ValueError
        If there is some field in converters that does not exist in fields.
    """
    def __init__(self,
@ -95,15 +86,11 @@ class DataTable(Dataset):
        """Convert a meta datum to an example by applying the corresponding 
        converters to each fields requested.
-        Parameters
+        Args:
-        ----------
+            meta_datum (Dict[str, Any]): Meta datum
        meta_datum : Dict[str, Any]
            Meta datum
-        Returns
+        Returns:
-        -------
+            Dict[str, Any]: Converted example
        Dict[str, Any]
            Converted example
        """
        example = {}
        for field in self.fields:
@ -118,16 +105,11 @@ class DataTable(Dataset):
    def __getitem__(self, idx: int) -> Dict[str, Any]:
        """Get an example given an index.
        Args:
            idx (int): Index of the example to get
-        Parameters
+        Returns:
-        ----------
+            Dict[str, Any]: A converted example
        idx : int
            Index of the example to get
        Returns
        -------
        Dict[str, Any]
            A converted example
        """
        if self.use_cache and self.caches[idx] is not None:
            return self.caches[idx]
--- a/paddlespeech/t2s/datasets/dataset.py
+++ b/paddlespeech/t2s/datasets/dataset.py
@ -258,4 +258,4 @@ class ChainDataset(Dataset):
                return dataset[i]
            i -= len(dataset)
-        raise IndexError("dataset index out of range")
+        raise IndexError("dataset index out of range")
--- a/paddlespeech/t2s/datasets/get_feats.py
+++ b/paddlespeech/t2s/datasets/get_feats.py
--- a/paddlespeech/t2s/datasets/preprocess_utils.py
+++ b/paddlespeech/t2s/datasets/preprocess_utils.py
@ -18,14 +18,10 @@ import re
 def get_phn_dur(file_name):
    '''
    read MFA duration.txt
-    Parameters
+    Args:
-    ----------
+        file_name (str or Path): path of gen_duration_from_textgrid.py's result
-    file_name : str or Path
+    Returns: 
-        path of gen_duration_from_textgrid.py's result
+        Dict: sentence: {'utt': ([char], [int])}
    Returns
    ----------
    Dict
        sentence: {'utt': ([char], [int])}
    '''
    f = open(file_name, 'r')
    sentence = {}
@ -48,10 +44,8 @@ def get_phn_dur(file_name):
 def merge_silence(sentence):
    '''
    merge silences
-    Parameters
+    Args:
-    ----------
+        sentence (Dict): sentence: {'utt': (([char], [int]), str)}
    sentence : Dict
        sentence: {'utt': (([char], [int]), str)}
    '''
    for utt in sentence:
        cur_phn, cur_dur, speaker = sentence[utt]
@ -81,12 +75,9 @@ def merge_silence(sentence):
 def get_input_token(sentence, output_path, dataset="baker"):
    '''
    get phone set from training data and save it
-    Parameters
+    Args:
-    ----------
+        sentence (Dict): sentence: {'utt': ([char], [int])}
-    sentence : Dict
+        output_path (str or path):path to save phone_id_map
        sentence: {'utt': ([char], [int])}
    output_path : str or path
        path to save phone_id_map
    '''
    phn_token = set()
    for utt in sentence:
@ -112,14 +103,10 @@ def get_phones_tones(sentence,
                     dataset="baker"):
    '''
    get phone set and tone set from training data and save it
-    Parameters
+    Args:
-    ----------
+        sentence (Dict): sentence: {'utt': ([char], [int])}
-    sentence : Dict
+        phones_output_path (str or path): path to save phone_id_map
-        sentence: {'utt': ([char], [int])}
+        tones_output_path (str or path): path to save tone_id_map
    phones_output_path : str or path
        path to save phone_id_map
    tones_output_path : str or path
        path to save tone_id_map
    '''
    phn_token = set()
    tone_token = set()
@ -162,14 +149,10 @@ def get_spk_id_map(speaker_set, output_path):
 def compare_duration_and_mel_length(sentences, utt, mel):
    '''
    check duration error, correct sentences[utt] if possible, else pop sentences[utt]
-    Parameters
+    Args:
-    ----------
+        sentences (Dict): sentences[utt] = [phones_list ,durations_list]
-    sentences : Dict
+        utt (str): utt_id
-        sentences[utt] = [phones_list ,durations_list]
+        mel (np.ndarry): features (num_frames, n_mels)
    utt : str
        utt_id
    mel : np.ndarry
        features (num_frames, n_mels)
    '''
    if utt in sentences:
--- a/paddlespeech/t2s/datasets/vocoder_batch_fn.py
+++ b/paddlespeech/t2s/datasets/vocoder_batch_fn.py
@ -29,15 +29,11 @@ class Clip(object):
            hop_size=256,
            aux_context_window=0, ):
        """Initialize customized collater for DataLoader.
        Args:
-        Parameters
+            batch_max_steps (int): The maximum length of input signal in batch.
-        ----------
+            hop_size (int): Hop size of auxiliary features.
-        batch_max_steps : int
+            aux_context_window (int): Context window size for auxiliary feature conv.
            The maximum length of input signal in batch.
        hop_size : int
            Hop size of auxiliary features.
        aux_context_window : int
            Context window size for auxiliary feature conv.
        """
        if batch_max_steps % hop_size != 0:
@ -56,18 +52,15 @@ class Clip(object):
    def __call__(self, batch):
        """Convert into batch tensors.
-        Parameters
+        Args:
-        ----------
+            batch (list): list of tuple of the pair of audio and features. Audio shape (T, ), features shape(T', C).
        batch : list
            list of tuple of the pair of audio and features. Audio shape (T, ), features shape(T', C).
-        Returns
+        Returns: 
-        ----------
+            Tensor:
-        Tensor
+                Auxiliary feature batch (B, C, T'), where
-            Auxiliary feature batch (B, C, T'), where
+                T = (T' - 2 * aux_context_window) * hop_size.
-            T = (T' - 2 * aux_context_window) * hop_size.
+            Tensor:
-        Tensor
+                Target signal batch (B, 1, T).
            Target signal batch (B, 1, T).
        """
        # check length
@ -104,11 +97,10 @@ class Clip(object):
    def _adjust_length(self, x, c):
        """Adjust the audio and feature lengths.
-        Note
+        Note:
-        -------
+            Basically we assume that the length of x and c are adjusted
-        Basically we assume that the length of x and c are adjusted
+            through preprocessing stage, but if we use other library processed
-        through preprocessing stage, but if we use other library processed
+            features, this process will be needed.
        features, this process will be needed.
        """
        if len(x) < c.shape[0] * self.hop_size:
@ -162,22 +154,14 @@ class WaveRNNClip(Clip):
        # voc_pad = 2  this will pad the input so that the resnet can 'see' wider than input length
        # max_offsets = n_frames - 2 - (mel_win + 2 * hp.voc_pad) = n_frames - 15
        """Convert into batch tensors.
-
+        Args:
-        Parameters
+            batch (list): list of tuple of the pair of audio and features. Audio shape (T, ), features shape(T', C).
-        ----------
+
-        batch : list
+        Returns:
-            list of tuple of the pair of audio and features. 
+            Tensor: Input signal batch (B, 1, T).
-            Audio shape (T, ), features shape(T', C).
+            Tensor: Target signal batch (B, 1, T).
-
+            Tensor: Auxiliary feature batch (B, C, T'), 
-        Returns
+                where T = (T' - 2 * aux_context_window) * hop_size.
        ----------
        Tensor
            Input signal batch (B, 1, T).
        Tensor
            Target signal batch (B, 1, T).
        Tensor
            Auxiliary feature batch (B, C, T'), where
            T = (T' - 2 * aux_context_window) * hop_size.
        """
        # check length
--- a/paddlespeech/t2s/exps/fastspeech2/preprocess.py
+++ b/paddlespeech/t2s/exps/fastspeech2/preprocess.py
@ -27,9 +27,9 @@ import tqdm
 import yaml
 from yacs.config import CfgNode
-from paddlespeech.t2s.data.get_feats import Energy
+from paddlespeech.t2s.datasets.get_feats import Energy
-from paddlespeech.t2s.data.get_feats import LogMelFBank
+from paddlespeech.t2s.datasets.get_feats import LogMelFBank
-from paddlespeech.t2s.data.get_feats import Pitch
+from paddlespeech.t2s.datasets.get_feats import Pitch
 from paddlespeech.t2s.datasets.preprocess_utils import compare_duration_and_mel_length
 from paddlespeech.t2s.datasets.preprocess_utils import get_input_token
 from paddlespeech.t2s.datasets.preprocess_utils import get_phn_dur
--- a/paddlespeech/t2s/exps/fastspeech2/train.py
+++ b/paddlespeech/t2s/exps/fastspeech2/train.py
@ -160,9 +160,8 @@ def train_sp(args, config):
    if dist.get_rank() == 0:
        trainer.extend(evaluator, trigger=(1, "epoch"))
        trainer.extend(VisualDL(output_dir), trigger=(1, "iteration"))
-        trainer.extend(
+    trainer.extend(
-            Snapshot(max_size=config.num_snapshots), trigger=(1, 'epoch'))
+        Snapshot(max_size=config.num_snapshots), trigger=(1, 'epoch'))
    # print(trainer.extensions)
    trainer.run()
--- a/paddlespeech/t2s/exps/gan_vocoder/hifigan/train.py
+++ b/paddlespeech/t2s/exps/gan_vocoder/hifigan/train.py
@ -231,9 +231,9 @@ def train_sp(args, config):
        trainer.extend(
            evaluator, trigger=(config.eval_interval_steps, 'iteration'))
        trainer.extend(VisualDL(output_dir), trigger=(1, 'iteration'))
-        trainer.extend(
+    trainer.extend(
-            Snapshot(max_size=config.num_snapshots),
+        Snapshot(max_size=config.num_snapshots),
-            trigger=(config.save_interval_steps, 'iteration'))
+        trigger=(config.save_interval_steps, 'iteration'))
    print("Trainer Done!")
    trainer.run()
--- a/paddlespeech/t2s/exps/gan_vocoder/multi_band_melgan/train.py
+++ b/paddlespeech/t2s/exps/gan_vocoder/multi_band_melgan/train.py
@ -219,9 +219,9 @@ def train_sp(args, config):
        trainer.extend(
            evaluator, trigger=(config.eval_interval_steps, 'iteration'))
        trainer.extend(VisualDL(output_dir), trigger=(1, 'iteration'))
-        trainer.extend(
+    trainer.extend(
-            Snapshot(max_size=config.num_snapshots),
+        Snapshot(max_size=config.num_snapshots),
-            trigger=(config.save_interval_steps, 'iteration'))
+        trigger=(config.save_interval_steps, 'iteration'))
    print("Trainer Done!")
    trainer.run()
--- a/paddlespeech/t2s/exps/gan_vocoder/parallelwave_gan/synthesize_from_wav.py
+++ b/paddlespeech/t2s/exps/gan_vocoder/parallelwave_gan/synthesize_from_wav.py
@ -23,7 +23,7 @@ import soundfile as sf
 import yaml
 from yacs.config import CfgNode
-from paddlespeech.t2s.data.get_feats import LogMelFBank
+from paddlespeech.t2s.datasets.get_feats import LogMelFBank
 from paddlespeech.t2s.models.parallel_wavegan import PWGGenerator
 from paddlespeech.t2s.models.parallel_wavegan import PWGInference
 from paddlespeech.t2s.modules.normalizer import ZScore
--- a/paddlespeech/t2s/exps/gan_vocoder/parallelwave_gan/train.py
+++ b/paddlespeech/t2s/exps/gan_vocoder/parallelwave_gan/train.py
@ -194,11 +194,10 @@ def train_sp(args, config):
        trainer.extend(
            evaluator, trigger=(config.eval_interval_steps, 'iteration'))
        trainer.extend(VisualDL(output_dir), trigger=(1, 'iteration'))
-        trainer.extend(
+    trainer.extend(
-            Snapshot(max_size=config.num_snapshots),
+        Snapshot(max_size=config.num_snapshots),
-            trigger=(config.save_interval_steps, 'iteration'))
+        trigger=(config.save_interval_steps, 'iteration'))
    # print(trainer.extensions.keys())
    print("Trainer Done!")
    trainer.run()
--- a/paddlespeech/t2s/exps/gan_vocoder/preprocess.py
+++ b/paddlespeech/t2s/exps/gan_vocoder/preprocess.py
@ -27,7 +27,7 @@ import tqdm
 import yaml
 from yacs.config import CfgNode
-from paddlespeech.t2s.data.get_feats import LogMelFBank
+from paddlespeech.t2s.datasets.get_feats import LogMelFBank
 from paddlespeech.t2s.datasets.preprocess_utils import get_phn_dur
 from paddlespeech.t2s.datasets.preprocess_utils import merge_silence
 from paddlespeech.t2s.utils import str2bool
--- a/paddlespeech/t2s/exps/gan_vocoder/style_melgan/train.py
+++ b/paddlespeech/t2s/exps/gan_vocoder/style_melgan/train.py
@ -212,9 +212,9 @@ def train_sp(args, config):
        trainer.extend(
            evaluator, trigger=(config.eval_interval_steps, 'iteration'))
        trainer.extend(VisualDL(output_dir), trigger=(1, 'iteration'))
-        trainer.extend(
+    trainer.extend(
-            Snapshot(max_size=config.num_snapshots),
+        Snapshot(max_size=config.num_snapshots),
-            trigger=(config.save_interval_steps, 'iteration'))
+        trigger=(config.save_interval_steps, 'iteration'))
    print("Trainer Done!")
    trainer.run()
--- a/paddlespeech/t2s/exps/speedyspeech/preprocess.py
+++ b/paddlespeech/t2s/exps/speedyspeech/preprocess.py
@ -27,7 +27,7 @@ import tqdm
 import yaml
 from yacs.config import CfgNode
-from paddlespeech.t2s.data.get_feats import LogMelFBank
+from paddlespeech.t2s.datasets.get_feats import LogMelFBank
 from paddlespeech.t2s.datasets.preprocess_utils import compare_duration_and_mel_length
 from paddlespeech.t2s.datasets.preprocess_utils import get_phn_dur
 from paddlespeech.t2s.datasets.preprocess_utils import get_phones_tones
--- a/paddlespeech/t2s/exps/speedyspeech/train.py
+++ b/paddlespeech/t2s/exps/speedyspeech/train.py
@ -171,8 +171,8 @@ def train_sp(args, config):
    if dist.get_rank() == 0:
        trainer.extend(evaluator, trigger=(1, "epoch"))
        trainer.extend(VisualDL(output_dir), trigger=(1, "iteration"))
-        trainer.extend(
+    trainer.extend(
-            Snapshot(max_size=config.num_snapshots), trigger=(1, 'epoch'))
+        Snapshot(max_size=config.num_snapshots), trigger=(1, 'epoch'))
    trainer.run()
--- a/paddlespeech/t2s/exps/synthesize.py
+++ b/paddlespeech/t2s/exps/synthesize.py
@ -38,9 +38,9 @@ model_alias = {
    "fastspeech2_inference":
    "paddlespeech.t2s.models.fastspeech2:FastSpeech2Inference",
    "tacotron2":
-    "paddlespeech.t2s.models.new_tacotron2:Tacotron2",
+    "paddlespeech.t2s.models.tacotron2:Tacotron2",
    "tacotron2_inference":
-    "paddlespeech.t2s.models.new_tacotron2:Tacotron2Inference",
+    "paddlespeech.t2s.models.tacotron2:Tacotron2Inference",
    # voc
    "pwgan":
    "paddlespeech.t2s.models.parallel_wavegan:PWGGenerator",
--- a/paddlespeech/t2s/exps/synthesize_e2e.py
+++ b/paddlespeech/t2s/exps/synthesize_e2e.py
@ -39,9 +39,9 @@ model_alias = {
    "fastspeech2_inference":
    "paddlespeech.t2s.models.fastspeech2:FastSpeech2Inference",
    "tacotron2":
-    "paddlespeech.t2s.models.new_tacotron2:Tacotron2",
+    "paddlespeech.t2s.models.tacotron2:Tacotron2",
    "tacotron2_inference":
-    "paddlespeech.t2s.models.new_tacotron2:Tacotron2Inference",
+    "paddlespeech.t2s.models.tacotron2:Tacotron2Inference",
    # voc
    "pwgan":
    "paddlespeech.t2s.models.parallel_wavegan:PWGGenerator",
@ -229,6 +229,11 @@ def evaluate(args):
    output_dir = Path(args.output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)
    merge_sentences = False
    # Avoid not stopping at the end of a sub sentence when tacotron2_ljspeech dygraph to static graph
    # but still not stopping in the end (NOTE by yuantian01 Feb 9 2022)
    if am_name == 'tacotron2':
        merge_sentences = True
    for utt_id, sentence in sentences:
        get_tone_ids = False
        if am_name == 'speedyspeech':
--- a/paddlespeech/t2s/exps/new_tacotron2/init.py
+++ b/paddlespeech/t2s/exps/new_tacotron2/init.py
--- a/paddlespeech/t2s/exps/new_tacotron2/normalize.py
+++ b/paddlespeech/t2s/exps/new_tacotron2/normalize.py
--- a/paddlespeech/t2s/exps/new_tacotron2/preprocess.py
+++ b/paddlespeech/t2s/exps/new_tacotron2/preprocess.py
@ -27,7 +27,7 @@ import tqdm
 import yaml
 from yacs.config import CfgNode
-from paddlespeech.t2s.data.get_feats import LogMelFBank
+from paddlespeech.t2s.datasets.get_feats import LogMelFBank
 from paddlespeech.t2s.datasets.preprocess_utils import compare_duration_and_mel_length
 from paddlespeech.t2s.datasets.preprocess_utils import get_input_token
 from paddlespeech.t2s.datasets.preprocess_utils import get_phn_dur
--- a/paddlespeech/t2s/exps/new_tacotron2/train.py
+++ b/paddlespeech/t2s/exps/new_tacotron2/train.py
@ -30,9 +30,9 @@ from yacs.config import CfgNode
 from paddlespeech.t2s.datasets.am_batch_fn import tacotron2_multi_spk_batch_fn
 from paddlespeech.t2s.datasets.am_batch_fn import tacotron2_single_spk_batch_fn
 from paddlespeech.t2s.datasets.data_table import DataTable
-from paddlespeech.t2s.models.new_tacotron2 import Tacotron2
+from paddlespeech.t2s.models.tacotron2 import Tacotron2
-from paddlespeech.t2s.models.new_tacotron2 import Tacotron2Evaluator
+from paddlespeech.t2s.models.tacotron2 import Tacotron2Evaluator
-from paddlespeech.t2s.models.new_tacotron2 import Tacotron2Updater
+from paddlespeech.t2s.models.tacotron2 import Tacotron2Updater
 from paddlespeech.t2s.training.extensions.snapshot import Snapshot
 from paddlespeech.t2s.training.extensions.visualizer import VisualDL
 from paddlespeech.t2s.training.optimizer import build_optimizers
@ -155,9 +155,8 @@ def train_sp(args, config):
    if dist.get_rank() == 0:
        trainer.extend(evaluator, trigger=(1, "epoch"))
        trainer.extend(VisualDL(output_dir), trigger=(1, "iteration"))
-        trainer.extend(
+    trainer.extend(
-            Snapshot(max_size=config.num_snapshots), trigger=(1, 'epoch'))
+        Snapshot(max_size=config.num_snapshots), trigger=(1, 'epoch'))
    # print(trainer.extensions)
    trainer.run()
--- a/paddlespeech/t2s/exps/transformer_tts/preprocess.py
+++ b/paddlespeech/t2s/exps/transformer_tts/preprocess.py
@ -26,20 +26,17 @@ import tqdm
 import yaml
 from yacs.config import CfgNode as Configuration
-from paddlespeech.t2s.data.get_feats import LogMelFBank
+from paddlespeech.t2s.datasets.get_feats import LogMelFBank
 from paddlespeech.t2s.frontend import English
 def get_lj_sentences(file_name, frontend):
-    '''
+    '''read MFA duration.txt
-    read MFA duration.txt
+
-    Parameters
+    Args:
-    ----------
+        file_name (str or Path)
-    file_name : str or Path
+    Returns:
-    Returns
+        Dict: sentence: {'utt': ([char], [int])}
    ----------
    Dict
        sentence: {'utt': ([char], [int])}
    '''
    f = open(file_name, 'r')
    sentence = {}
@ -59,14 +56,11 @@ def get_lj_sentences(file_name, frontend):
 def get_input_token(sentence, output_path):
-    '''
+    '''get phone set from training data and save it
-    get phone set from training data and save it
+    
-    Parameters
+    Args:
-    ----------
+        sentence (Dict): sentence: {'utt': ([char], str)}
-    sentence : Dict
+        output_path (str or path): path to save phone_id_map
        sentence: {'utt': ([char], str)}
    output_path : str or path
        path to save phone_id_map
    '''
    phn_token = set()
    for utt in sentence:
--- a/paddlespeech/t2s/exps/transformer_tts/train.py
+++ b/paddlespeech/t2s/exps/transformer_tts/train.py
@ -148,9 +148,8 @@ def train_sp(args, config):
    if dist.get_rank() == 0:
        trainer.extend(evaluator, trigger=(1, "epoch"))
        trainer.extend(VisualDL(output_dir), trigger=(1, "iteration"))
-        trainer.extend(
+    trainer.extend(
-            Snapshot(max_size=config.num_snapshots), trigger=(1, 'epoch'))
+        Snapshot(max_size=config.num_snapshots), trigger=(1, 'epoch'))
    # print(trainer.extensions)
    trainer.run()
--- a/paddlespeech/t2s/exps/voice_cloning.py
+++ b/paddlespeech/t2s/exps/voice_cloning.py
@ -34,9 +34,9 @@ model_alias = {
    "fastspeech2_inference":
    "paddlespeech.t2s.models.fastspeech2:FastSpeech2Inference",
    "tacotron2":
-    "paddlespeech.t2s.models.new_tacotron2:Tacotron2",
+    "paddlespeech.t2s.models.tacotron2:Tacotron2",
    "tacotron2_inference":
-    "paddlespeech.t2s.models.new_tacotron2:Tacotron2Inference",
+    "paddlespeech.t2s.models.tacotron2:Tacotron2Inference",
    # voc
    "pwgan":
    "paddlespeech.t2s.models.parallel_wavegan:PWGGenerator",
--- a/paddlespeech/t2s/exps/waveflow/ljspeech.py
+++ b/paddlespeech/t2s/exps/waveflow/ljspeech.py
@ -17,8 +17,8 @@ import numpy as np
 import pandas
 from paddle.io import Dataset
-from paddlespeech.t2s.data.batch import batch_spec
+from paddlespeech.t2s.datasets.batch import batch_spec
-from paddlespeech.t2s.data.batch import batch_wav
+from paddlespeech.t2s.datasets.batch import batch_wav
 class LJSpeech(Dataset):
--- a/paddlespeech/t2s/exps/waveflow/train.py
+++ b/paddlespeech/t2s/exps/waveflow/train.py
@ -19,7 +19,7 @@ from paddle import distributed as dist
 from paddle.io import DataLoader
 from paddle.io import DistributedBatchSampler
-from paddlespeech.t2s.data import dataset
+from paddlespeech.t2s.datasets import dataset
 from paddlespeech.t2s.exps.waveflow.config import get_cfg_defaults
 from paddlespeech.t2s.exps.waveflow.ljspeech import LJSpeech
 from paddlespeech.t2s.exps.waveflow.ljspeech import LJSpeechClipCollector
--- a/paddlespeech/t2s/exps/wavernn/synthesize.py
+++ b/paddlespeech/t2s/exps/wavernn/synthesize.py
@ -31,7 +31,7 @@ from paddlespeech.t2s.models.wavernn import WaveRNN
 def main():
    parser = argparse.ArgumentParser(description="Synthesize with WaveRNN.")
-    parser.add_argument("--config", type=str, help="GANVocoder config file.")
+    parser.add_argument("--config", type=str, help="Vocoder config file.")
    parser.add_argument("--checkpoint", type=str, help="snapshot to load.")
    parser.add_argument("--test-metadata", type=str, help="dev data.")
    parser.add_argument("--output-dir", type=str, help="output dir.")
--- a/paddlespeech/t2s/exps/wavernn/train.py
+++ b/paddlespeech/t2s/exps/wavernn/train.py
@ -168,9 +168,9 @@ def train_sp(args, config):
        trainer.extend(
            evaluator, trigger=(config.eval_interval_steps, 'iteration'))
        trainer.extend(VisualDL(output_dir), trigger=(1, 'iteration'))
-        trainer.extend(
+    trainer.extend(
-            Snapshot(max_size=config.num_snapshots),
+        Snapshot(max_size=config.num_snapshots),
-            trigger=(config.save_interval_steps, 'iteration'))
+        trigger=(config.save_interval_steps, 'iteration'))
    print("Trainer Done!")
    trainer.run()
@ -179,7 +179,7 @@ def train_sp(args, config):
 def main():
    # parse args and config and redirect to train_sp
-    parser = argparse.ArgumentParser(description="Train a HiFiGAN model.")
+    parser = argparse.ArgumentParser(description="Train a WaveRNN model.")
    parser.add_argument(
        "--config", type=str, help="config file to overwrite default config.")
    parser.add_argument("--train-metadata", type=str, help="training data.")
--- a/paddlespeech/t2s/frontend/arpabet.py
+++ b/paddlespeech/t2s/frontend/arpabet.py
@ -133,16 +133,11 @@ class ARPABET(Phonetics):
    def phoneticize(self, sentence, add_start_end=False):
        """ Normalize the input text sequence and convert it into pronunciation sequence.
        Args:
            sentence (str): The input text sequence.
-        Parameters
+        Returns:
-        -----------
+            List[str]: The list of pronunciation sequence.
        sentence: str
            The input text sequence.
        Returns
        ----------
        List[str]
            The list of pronunciation sequence.
        """
        phonemes = [
            self._remove_vowels(item) for item in self.backend(sentence)
@ -156,16 +151,12 @@ class ARPABET(Phonetics):
    def numericalize(self, phonemes):
        """ Convert pronunciation sequence into pronunciation id sequence.
-        
+
-        Parameters
+        Args:
-        -----------
+            phonemes (List[str]): The list of pronunciation sequence.
        phonemes: List[str]
            The list of pronunciation sequence.
-        Returns
+        Returns:
-        ----------
+            List[int]: The list of pronunciation id sequence.
        List[int]
            The list of pronunciation id sequence.
        """
        ids = [self.vocab.lookup(item) for item in phonemes]
        return ids
@ -173,30 +164,23 @@ class ARPABET(Phonetics):
    def reverse(self, ids):
        """ Reverse the list of pronunciation id sequence to a list of pronunciation sequence.
-        Parameters
+        Args:
-        -----------
+            ids( List[int]): The list of pronunciation id sequence.
        ids: List[int]
            The list of pronunciation id sequence.
-        Returns
+        Returns: 
-        ----------
+            List[str]: 
-        List[str]
+                The list of pronunciation sequence.
            The list of pronunciation sequence.
        """
        return [self.vocab.reverse(i) for i in ids]
    def __call__(self, sentence, add_start_end=False):
        """ Convert the input text sequence into pronunciation id sequence.
-        Parameters
+        Args:
-        -----------
+            sentence (str): The input text sequence.
        sentence: str
            The input text sequence.
-        Returns
+        Returns:
-        ----------
+            List[str]: The list of pronunciation id sequence.
        List[str]
            The list of pronunciation id sequence.
        """
        return self.numericalize(
            self.phoneticize(sentence, add_start_end=add_start_end))
@ -229,15 +213,11 @@ class ARPABETWithStress(Phonetics):
    def phoneticize(self, sentence, add_start_end=False):
        """ Normalize the input text sequence and convert it into pronunciation sequence.
-        Parameters
+        Args: 
-        -----------
+            sentence (str): The input text sequence.
        sentence: str
            The input text sequence.
-        Returns
+        Returns: 
-        ----------
+            List[str]: The list of pronunciation sequence.
        List[str]
            The list of pronunciation sequence.
        """
        phonemes = self.backend(sentence)
        if add_start_end:
@ -249,47 +229,33 @@ class ARPABETWithStress(Phonetics):
    def numericalize(self, phonemes):
        """ Convert pronunciation sequence into pronunciation id sequence.
-        
+
-        Parameters
+        Args:
-        -----------
+            phonemes (List[str]): The list of pronunciation sequence.
        phonemes: List[str]
            The list of pronunciation sequence.
-        Returns
+        Returns:
-        ----------
+            List[int]: The list of pronunciation id sequence.
        List[int]
            The list of pronunciation id sequence.
        """
        ids = [self.vocab.lookup(item) for item in phonemes]
        return ids
    def reverse(self, ids):
        """ Reverse the list of pronunciation id sequence to a list of pronunciation sequence.
-        
+        Args:
-        Parameters
+            ids (List[int]): The list of pronunciation id sequence.
        -----------
        ids: List[int]
            The list of pronunciation id sequence.
-        Returns
+        Returns: 
-        ----------
+            List[str]: The list of pronunciation sequence.
        List[str]
            The list of pronunciation sequence.
        """
        return [self.vocab.reverse(i) for i in ids]
    def __call__(self, sentence, add_start_end=False):
        """ Convert the input text sequence into pronunciation id sequence.
        Args:
            sentence (str): The input text sequence.
-        Parameters
+        Returns: 
-        -----------
+            List[str]: The list of pronunciation id sequence.
        sentence: str
            The input text sequence.
        Returns
        ----------
        List[str]
            The list of pronunciation id sequence.
        """
        return self.numericalize(
            self.phoneticize(sentence, add_start_end=add_start_end))
--- a/paddlespeech/t2s/frontend/phonectic.py
+++ b/paddlespeech/t2s/frontend/phonectic.py
@ -65,14 +65,10 @@ class English(Phonetics):
    def phoneticize(self, sentence):
        """ Normalize the input text sequence and convert it into pronunciation sequence.
-        Parameters
+        Args:
-        -----------
+            sentence (str): The input text sequence.
-        sentence: str
+        Returns: 
-            The input text sequence.
+            List[str]: The list of pronunciation sequence.
        Returns
        ----------
        List[str]
            The list of pronunciation sequence.
        """
        start = self.vocab.start_symbol
        end = self.vocab.end_symbol
@ -83,11 +79,6 @@ class English(Phonetics):
        return phonemes
    def _p2id(self, phonemes: List[str]) -> np.array:
        # replace unk phone with sp
        phonemes = [
            phn if (phn in self.vocab_phones and phn not in self.punc) else "sp"
            for phn in phonemes
        ]
        phone_ids = [self.vocab_phones[item] for item in phonemes]
        return np.array(phone_ids, np.int64)
@ -102,6 +93,12 @@ class English(Phonetics):
            # remove start_symbol and end_symbol
            phones = phones[1:-1]
            phones = [phn for phn in phones if not phn.isspace()]
            # replace unk phone with sp
            phones = [
                phn
                if (phn in self.vocab_phones and phn not in self.punc) else "sp"
                for phn in phones
            ]
            phones_list.append(phones)
        if merge_sentences:
@ -122,14 +119,10 @@ class English(Phonetics):
    def numericalize(self, phonemes):
        """ Convert pronunciation sequence into pronunciation id sequence.
-        Parameters
+        Args:
-        -----------
+            phonemes (List[str]): The list of pronunciation sequence.
-        phonemes: List[str]
+        Returns: 
-            The list of pronunciation sequence.
+            List[int]: The list of pronunciation id sequence.
        Returns
        ----------
        List[int]
            The list of pronunciation id sequence.
        """
        ids = [
            self.vocab.lookup(item) for item in phonemes
@ -139,27 +132,19 @@ class English(Phonetics):
    def reverse(self, ids):
        """ Reverse the list of pronunciation id sequence to a list of pronunciation sequence.
-        Parameters
+        Args:
-        -----------
+            ids (List[int]): The list of pronunciation id sequence.
-        ids: List[int]
+        Returns: 
-            The list of pronunciation id sequence.
+            List[str]: The list of pronunciation sequence.
        Returns
        ----------
        List[str]
            The list of pronunciation sequence.
        """
        return [self.vocab.reverse(i) for i in ids]
    def __call__(self, sentence):
        """ Convert the input text sequence into pronunciation id sequence.
-        Parameters
+        Args:
-        -----------
+            sentence(str): The input text sequence.
-        sentence: str
+        Returns: 
-            The input text sequence.
+            List[str]: The list of pronunciation id sequence.
        Returns
        ----------
        List[str]
            The list of pronunciation id sequence.
        """
        return self.numericalize(self.phoneticize(sentence))
@ -182,28 +167,21 @@ class EnglishCharacter(Phonetics):
    def phoneticize(self, sentence):
        """ Normalize the input text sequence.
-        Parameters
+        Args:
-        -----------
+            sentence(str): The input text sequence.
-        sentence: str
+        Returns:
-            The input text sequence.
+            str: A text sequence after normalize.
        Returns
        ----------
        str
            A text sequence after normalize.
        """
        words = normalize(sentence)
        return words
    def numericalize(self, sentence):
        """ Convert a text sequence into ids.
-        Parameters
+        Args:
-        -----------
+            sentence (str): The input text sequence.
-        sentence: str
+        Returns:
-            The input text sequence.
+            List[int]:
-        Returns
+                List of a character id sequence.
        ----------
        List[int]
            List of a character id sequence.
        """
        ids = [
            self.vocab.lookup(item) for item in sentence
@ -213,27 +191,19 @@ class EnglishCharacter(Phonetics):
    def reverse(self, ids):
        """ Convert a character id sequence into text.
-        Parameters
+        Args:
-        -----------
+            ids (List[int]): List of a character id sequence.
-        ids: List[int]
+        Returns:
-            List of a character id sequence.
+            str: The input text sequence.
        Returns
        ----------
        str
            The input text sequence.
        """
        return [self.vocab.reverse(i) for i in ids]
    def __call__(self, sentence):
        """ Normalize the input text sequence and convert it into character id sequence.
-        Parameters
+        Args:
-        -----------
+            sentence (str): The input text sequence.
-        sentence: str
+        Returns: 
-            The input text sequence.
+            List[int]: List of a character id sequence.
        Returns
        ----------
        List[int]
            List of a character id sequence.
        """
        return self.numericalize(self.phoneticize(sentence))
@ -263,14 +233,10 @@ class Chinese(Phonetics):
    def phoneticize(self, sentence):
        """ Normalize the input text sequence and convert it into pronunciation sequence.
-        Parameters
+        Args:
-        -----------
+            sentence(str): The input text sequence.
-        sentence: str
+        Returns: 
-            The input text sequence.
+            List[str]: The list of pronunciation sequence.
        Returns
        ----------
        List[str]
            The list of pronunciation sequence.
        """
        # simplified = self.opencc_backend.convert(sentence)
        simplified = sentence
@ -295,28 +261,20 @@ class Chinese(Phonetics):
    def numericalize(self, phonemes):
        """ Convert pronunciation sequence into pronunciation id sequence.
-        Parameters
+        Args:
-        -----------
+            phonemes(List[str]): The list of pronunciation sequence.
-        phonemes: List[str]
+        Returns:
-            The list of pronunciation sequence.
+                List[int]: The list of pronunciation id sequence.
        Returns
        ----------
        List[int]
            The list of pronunciation id sequence.
        """
        ids = [self.vocab.lookup(item) for item in phonemes]
        return ids
    def __call__(self, sentence):
        """ Convert the input text sequence into pronunciation id sequence.
-        Parameters
+        Args:
-        -----------
+            sentence (str): The input text sequence.
-        sentence: str
+        Returns:
-            The input text sequence.
+            List[str]: The list of pronunciation id sequence.
        Returns
        ----------
        List[str]
            The list of pronunciation id sequence.
        """
        return self.numericalize(self.phoneticize(sentence))
@ -328,13 +286,9 @@ class Chinese(Phonetics):
    def reverse(self, ids):
        """ Reverse the list of pronunciation id sequence to a list of pronunciation sequence.
-        Parameters
+        Args:
-        -----------
+        ids (List[int]): The list of pronunciation id sequence.
-        ids: List[int]
+        Returns: 
-            The list of pronunciation id sequence.
+            List[str]: The list of pronunciation sequence.
        Returns
        ----------
        List[str]
            The list of pronunciation sequence.
        """
        return [self.vocab.reverse(i) for i in ids]
--- a/paddlespeech/t2s/frontend/vocab.py
+++ b/paddlespeech/t2s/frontend/vocab.py
@ -20,22 +20,12 @@ __all__ = ["Vocab"]
 class Vocab(object):
    """  Vocabulary.
-    Parameters
+    Args:
-    -----------
+        symbols (Iterable[str]): Common symbols.
-    symbols: Iterable[str]
+        padding_symbol (str, optional): Symbol for pad. Defaults to "<pad>".
-        Common symbols.
+        unk_symbol (str, optional): Symbol for unknow. Defaults to "<unk>"
-
+        start_symbol (str, optional): Symbol for start. Defaults to "<s>"
-    padding_symbol: str, optional
+        end_symbol (str, optional): Symbol for end. Defaults to "</s>"
        Symbol for pad. Defaults to "<pad>".
    unk_symbol: str, optional
        Symbol for unknow. Defaults to "<unk>"
    start_symbol: str, optional
        Symbol for start. Defaults to "<s>"
    end_symbol: str, optional
        Symbol for end. Defaults to "</s>"
    """
    def __init__(self,
--- a/paddlespeech/t2s/frontend/zh_normalization/chronology.py
+++ b/paddlespeech/t2s/frontend/zh_normalization/chronology.py
@ -44,12 +44,10 @@ RE_TIME_RANGE = re.compile(r'([0-1]?[0-9]|2[0-3])'
 def replace_time(match) -> str:
    """
-    Parameters
+    Args:
-    ----------
+        match (re.Match)
-    match : re.Match
+    Returns:
-    Returns
+        str
    ----------
    str
    """
    is_range = len(match.groups()) > 5
@ -87,12 +85,10 @@ RE_DATE = re.compile(r'(\d{4}|\d{2})年'
 def replace_date(match) -> str:
    """
-    Parameters
+    Args:
-    ----------
+        match (re.Match)
-    match : re.Match
+    Returns:
-    Returns
+        str
    ----------
    str
    """
    year = match.group(1)
    month = match.group(3)
@ -114,12 +110,10 @@ RE_DATE2 = re.compile(
 def replace_date2(match) -> str:
    """
-    Parameters
+    Args:
-    ----------
+        match (re.Match)
-    match : re.Match
+    Returns:
-    Returns
+        str
    ----------
    str
    """
    year = match.group(1)
    month = match.group(3)
--- a/paddlespeech/t2s/frontend/zh_normalization/num.py
+++ b/paddlespeech/t2s/frontend/zh_normalization/num.py
@ -36,12 +36,10 @@ RE_FRAC = re.compile(r'(-?)(\d+)/(\d+)')
 def replace_frac(match) -> str:
    """
-    Parameters
+    Args:
-    ----------
+        match (re.Match)
-    match : re.Match
+    Returns:
-    Returns
+        str
    ----------
    str
    """
    sign = match.group(1)
    nominator = match.group(2)
@ -59,12 +57,10 @@ RE_PERCENTAGE = re.compile(r'(-?)(\d+(\.\d+)?)%')
 def replace_percentage(match) -> str:
    """
-    Parameters
+    Args:
-    ----------
+        match (re.Match)
-    match : re.Match
+    Returns:
-    Returns
+        str
    ----------
    str
    """
    sign = match.group(1)
    percent = match.group(2)
@ -81,12 +77,10 @@ RE_INTEGER = re.compile(r'(-)' r'(\d+)')
 def replace_negative_num(match) -> str:
    """
-    Parameters
+    Args:
-    ----------
+        match (re.Match)
-    match : re.Match
+    Returns:
-    Returns
+        str
    ----------
    str
    """
    sign = match.group(1)
    number = match.group(2)
@ -103,12 +97,10 @@ RE_DEFAULT_NUM = re.compile(r'\d{3}\d*')
 def replace_default_num(match):
    """
-    Parameters
+    Args:
-    ----------
+        match (re.Match)
-    match : re.Match
+    Returns:
-    Returns
+        str
    ----------
    str
    """
    number = match.group(0)
    return verbalize_digit(number)
@ -124,12 +116,10 @@ RE_NUMBER = re.compile(r'(-?)((\d+)(\.\d+)?)' r'|(\.(\d+))')
 def replace_positive_quantifier(match) -> str:
    """
-    Parameters
+    Args:
-    ----------
+        match (re.Match)
-    match : re.Match
+    Returns:
-    Returns
+        str
    ----------
    str
    """
    number = match.group(1)
    match_2 = match.group(2)
@ -142,12 +132,10 @@ def replace_positive_quantifier(match) -> str:
 def replace_number(match) -> str:
    """
-    Parameters
+    Args:
-    ----------
+        match (re.Match)
-    match : re.Match
+    Returns:
-    Returns
+        str
    ----------
    str
    """
    sign = match.group(1)
    number = match.group(2)
@ -169,12 +157,10 @@ RE_RANGE = re.compile(
 def replace_range(match) -> str:
    """
-    Parameters
+    Args:
-    ----------
+        match (re.Match)
-    match : re.Match
+    Returns:
-    Returns
+        str
    ----------
    str
    """
    first, second = match.group(1), match.group(8)
    first = RE_NUMBER.sub(replace_number, first)
@ -222,7 +208,7 @@ def verbalize_digit(value_string: str, alt_one=False) -> str:
    result_symbols = [DIGITS[digit] for digit in value_string]
    result = ''.join(result_symbols)
    if alt_one:
-        result.replace("一", "幺")
+        result = result.replace("一", "幺")
    return result
--- a/paddlespeech/t2s/frontend/zh_normalization/phonecode.py
+++ b/paddlespeech/t2s/frontend/zh_normalization/phonecode.py
@ -45,23 +45,19 @@ def phone2str(phone_string: str, mobile=True) -> str:
 def replace_phone(match) -> str:
    """
-    Parameters
+    Args:
-    ----------
+        match (re.Match)
-    match : re.Match
+    Returns:
-    Returns
+        str
    ----------
    str
    """
    return phone2str(match.group(0), mobile=False)
 def replace_mobile(match) -> str:
    """
-    Parameters
+    Args:
-    ----------
+        match (re.Match)
-    match : re.Match
+    Returns:
-    Returns
+        str
    ----------
    str
    """
    return phone2str(match.group(0))
--- a/paddlespeech/t2s/frontend/zh_normalization/quantifier.py
+++ b/paddlespeech/t2s/frontend/zh_normalization/quantifier.py
@ -22,12 +22,10 @@ RE_TEMPERATURE = re.compile(r'(-?)(\d+(\.\d+)?)(°C|℃|度|摄氏度)')
 def replace_temperature(match) -> str:
    """
-    Parameters
+    Args:
-    ----------
+        match (re.Match)
-    match : re.Match
+    Returns:
-    Returns
+        str
    ----------
    str
    """
    sign = match.group(1)
    temperature = match.group(2)
--- a/paddlespeech/t2s/frontend/zh_normalization/text_normlization.py
+++ b/paddlespeech/t2s/frontend/zh_normalization/text_normlization.py
@ -55,14 +55,10 @@ class TextNormalizer():
    def _split(self, text: str, lang="zh") -> List[str]:
        """Split long text into sentences with sentence-splitting punctuations.
-        Parameters
+        Args:
-        ----------
+            text (str): The input text.
-        text : str
+        Returns:
-            The input text.
+            List[str]: Sentences.
        Returns
        -------
        List[str]
            Sentences.
        """
        # Only for pure Chinese here
        if lang == "zh":
--- a/paddlespeech/t2s/models/init.py
+++ b/paddlespeech/t2s/models/init.py
@ -14,9 +14,9 @@
 from .fastspeech2 import *
 from .hifigan import *
 from .melgan import *
 from .new_tacotron2 import *
 from .parallel_wavegan import *
 from .speedyspeech import *
 from .tacotron2 import *
 from .transformer_tts import *
 from .waveflow import *
 from .wavernn import *
--- a/paddlespeech/t2s/models/fastspeech2/fastspeech2.py
+++ b/paddlespeech/t2s/models/fastspeech2/fastspeech2.py
@ -38,17 +38,21 @@ from paddlespeech.t2s.modules.transformer.encoder import TransformerEncoder
 class FastSpeech2(nn.Layer):
    """FastSpeech2 module.
-
+    
    This is a module of FastSpeech2 described in `FastSpeech 2: Fast and
    High-Quality End-to-End Text to Speech`_. Instead of quantized pitch and
    energy, we use token-averaged value introduced in `FastPitch: Parallel
    Text-to-speech with Pitch Prediction`_.
-
+    
    .. _`FastSpeech 2: Fast and High-Quality End-to-End Text to Speech`:
        https://arxiv.org/abs/2006.04558
    .. _`FastPitch: Parallel Text-to-speech with Pitch Prediction`:
        https://arxiv.org/abs/2006.06873
    Args:
    Returns:
    """
    def __init__(
@ -127,136 +131,72 @@ class FastSpeech2(nn.Layer):
            init_enc_alpha: float=1.0,
            init_dec_alpha: float=1.0, ):
        """Initialize FastSpeech2 module.
-        Parameters
+        Args:
-        ----------
+            idim (int): Dimension of the inputs.
-        idim : int
+            odim (int): Dimension of the outputs.
-            Dimension of the inputs.
+            adim (int): Attention dimension.
-        odim : int
+            aheads (int): Number of attention heads.
-            Dimension of the outputs.
+            elayers (int): Number of encoder layers.
-        adim : int
+            eunits (int): Number of encoder hidden units.
-            Attention dimension.
+            dlayers (int): Number of decoder layers.
-        aheads : int
+            dunits (int): Number of decoder hidden units.
-            Number of attention heads.
+            postnet_layers (int): Number of postnet layers.
-        elayers : int
+            postnet_chans (int): Number of postnet channels.
-            Number of encoder layers.
+            postnet_filts (int): Kernel size of postnet.
-        eunits : int
+            postnet_dropout_rate (float): Dropout rate in postnet.
-            Number of encoder hidden units.
+            use_scaled_pos_enc (bool): Whether to use trainable scaled pos encoding.
-        dlayers : int
+            use_batch_norm (bool): Whether to use batch normalization in encoder prenet.
-            Number of decoder layers.
+            encoder_normalize_before (bool): Whether to apply layernorm layer before encoder block.
-        dunits : int
+            decoder_normalize_before (bool): Whether to apply layernorm layer before decoder block.
-            Number of decoder hidden units.
+            encoder_concat_after (bool): Whether to concatenate attention layer's input and output in encoder.
-        postnet_layers : int
+            decoder_concat_after (bool): Whether to concatenate attention layer's input  and output in decoder.
-            Number of postnet layers.
+            reduction_factor (int): Reduction factor.
-        postnet_chans : int
+            encoder_type (str): Encoder type ("transformer" or "conformer").
-            Number of postnet channels.
+            decoder_type (str): Decoder type ("transformer" or "conformer").
-        postnet_filts : int
+            transformer_enc_dropout_rate (float): Dropout rate in encoder except attention and positional encoding.
-            Kernel size of postnet.
+            transformer_enc_positional_dropout_rate (float): Dropout rate after encoder positional encoding.
-        postnet_dropout_rate : float
+            transformer_enc_attn_dropout_rate (float): Dropout rate in encoder self-attention module.
-            Dropout rate in postnet.
+            transformer_dec_dropout_rate (float): Dropout rate in decoder except attention & positional encoding.
-        use_scaled_pos_enc : bool
+            transformer_dec_positional_dropout_rate (float): Dropout rate after decoder positional encoding.
-            Whether to use trainable scaled pos encoding.
+            transformer_dec_attn_dropout_rate (float): Dropout rate in decoder self-attention module.
-        use_batch_norm : bool
+            conformer_pos_enc_layer_type (str): Pos encoding layer type in conformer.
-            Whether to use batch normalization in encoder prenet.
+            conformer_self_attn_layer_type (str): Self-attention layer type in conformer
-        encoder_normalize_before : bool
+            conformer_activation_type (str): Activation function type in conformer.
-            Whether to apply layernorm layer before encoder block.
+            use_macaron_style_in_conformer (bool): Whether to use macaron style FFN.
-        decoder_normalize_before : bool
+            use_cnn_in_conformer (bool): Whether to use CNN in conformer.
-            Whether to apply layernorm layer before
+            zero_triu (bool): Whether to use zero triu in relative self-attention module.
-            decoder block.
+            conformer_enc_kernel_size (int): Kernel size of encoder conformer.
-        encoder_concat_after : bool
+            conformer_dec_kernel_size (int): Kernel size of decoder conformer.
-            Whether to concatenate attention layer's input and output in encoder.
+            duration_predictor_layers (int): Number of duration predictor layers.
-        decoder_concat_after : bool
+            duration_predictor_chans (int): Number of duration predictor channels.
-            Whether to concatenate attention layer's input  and output in decoder.
+            duration_predictor_kernel_size (int): Kernel size of duration predictor.
-        reduction_factor : int
+            duration_predictor_dropout_rate (float): Dropout rate in duration predictor.
-            Reduction factor.
+            pitch_predictor_layers (int): Number of pitch predictor layers.
-        encoder_type : str
+            pitch_predictor_chans (int): Number of pitch predictor channels.
-            Encoder type ("transformer" or "conformer").
+            pitch_predictor_kernel_size (int): Kernel size of pitch predictor.
-        decoder_type : str
+            pitch_predictor_dropout_rate (float): Dropout rate in pitch predictor.
-            Decoder type ("transformer" or "conformer").
+            pitch_embed_kernel_size (float): Kernel size of pitch embedding.
-        transformer_enc_dropout_rate : float
+            pitch_embed_dropout_rate (float): Dropout rate for pitch embedding.
-            Dropout rate in encoder except attention and positional encoding.
+            stop_gradient_from_pitch_predictor (bool): Whether to stop gradient from pitch predictor to encoder.
-        transformer_enc_positional_dropout_rate (float): Dropout rate after encoder
+            energy_predictor_layers (int): Number of energy predictor layers.
-            positional encoding.
+            energy_predictor_chans (int): Number of energy predictor channels.
-        transformer_enc_attn_dropout_rate (float): Dropout rate in encoder
+            energy_predictor_kernel_size (int): Kernel size of energy predictor.
-            self-attention module.
+            energy_predictor_dropout_rate (float): Dropout rate in energy predictor.
-        transformer_dec_dropout_rate (float): Dropout rate in decoder except
+            energy_embed_kernel_size (float): Kernel size of energy embedding.
-            attention & positional encoding.
+            energy_embed_dropout_rate (float): Dropout rate for energy embedding.
-        transformer_dec_positional_dropout_rate (float): Dropout rate after decoder
+            stop_gradient_from_energy_predictor（bool): Whether to stop gradient from energy predictor to encoder.
-            positional encoding.
+            spk_num (Optional[int]): Number of speakers. If not None, assume that the spk_embed_dim is not None,
-        transformer_dec_attn_dropout_rate (float): Dropout rate in decoder
+                spk_ids will be provided as the input and use spk_embedding_table.
-            self-attention module.
+            spk_embed_dim (Optional[int]): Speaker embedding dimension. If not None, 
-        conformer_pos_enc_layer_type : str
+                assume that spk_emb will be provided as the input or spk_num is not None.
-            Pos encoding layer type in conformer.
+            spk_embed_integration_type (str): How to integrate speaker embedding.
-        conformer_self_attn_layer_type : str
+            tone_num (Optional[int]): Number of tones. If not None, assume that the
-            Self-attention layer type in conformer
+                tone_ids will be provided as the input and use tone_embedding_table.
-        conformer_activation_type : str
+            tone_embed_dim (Optional[int]): Tone embedding dimension. If not None, assume that tone_num is not None.
-            Activation function type in conformer.
+            tone_embed_integration_type (str): How to integrate tone embedding.
-        use_macaron_style_in_conformer : bool
+            init_type (str): How to initialize transformer parameters.
-            Whether to use macaron style FFN.
+            init_enc_alpha （float): Initial value of alpha in scaled pos encoding of the encoder.
-        use_cnn_in_conformer : bool
+            init_dec_alpha (float): Initial value of alpha in scaled pos encoding of the decoder.
            Whether to use CNN in conformer.
        zero_triu : bool
            Whether to use zero triu in relative self-attention module.
        conformer_enc_kernel_size : int
            Kernel size of encoder conformer.
        conformer_dec_kernel_size : int
            Kernel size of decoder conformer.
        duration_predictor_layers : int
            Number of duration predictor layers.
        duration_predictor_chans : int
            Number of duration predictor channels.
        duration_predictor_kernel_size : int
            Kernel size of duration predictor.
        duration_predictor_dropout_rate : float
            Dropout rate in duration predictor.
        pitch_predictor_layers : int
            Number of pitch predictor layers.
        pitch_predictor_chans : int
            Number of pitch predictor channels.
        pitch_predictor_kernel_size : int
            Kernel size of pitch predictor.
        pitch_predictor_dropout_rate : float
            Dropout rate in pitch predictor.
        pitch_embed_kernel_size : float
            Kernel size of pitch embedding.
        pitch_embed_dropout_rate : float
            Dropout rate for pitch embedding.
        stop_gradient_from_pitch_predictor : bool
            Whether to stop gradient from pitch predictor to encoder.
        energy_predictor_layers : int
            Number of energy predictor layers.
        energy_predictor_chans : int
            Number of energy predictor channels.
        energy_predictor_kernel_size : int
            Kernel size of energy predictor.
        energy_predictor_dropout_rate : float
            Dropout rate in energy predictor.
        energy_embed_kernel_size : float
            Kernel size of energy embedding.
        energy_embed_dropout_rate : float
            Dropout rate for energy embedding.
        stop_gradient_from_energy_predictor : bool 
            Whether to stop gradient from energy predictor to encoder.
        spk_num : Optional[int]
            Number of speakers. If not None, assume that the spk_embed_dim is not None,
            spk_ids will be provided as the input and use spk_embedding_table.
        spk_embed_dim : Optional[int]
            Speaker embedding dimension. If not None, 
            assume that spk_emb will be provided as the input or spk_num is not None.
        spk_embed_integration_type : str
            How to integrate speaker embedding.
        tone_num : Optional[int]
            Number of tones. If not None, assume that the
            tone_ids will be provided as the input and use tone_embedding_table.
        tone_embed_dim : Optional[int]
            Tone embedding dimension. If not None, assume that tone_num is not None.
        tone_embed_integration_type : str
            How to integrate tone embedding.
        init_type : str
            How to initialize transformer parameters.
        init_enc_alpha : float
            Initial value of alpha in scaled pos encoding of the encoder.
        init_dec_alpha : float
            Initial value of alpha in scaled pos encoding of the decoder.
        """
        assert check_argument_types()
@ -489,45 +429,21 @@ class FastSpeech2(nn.Layer):
    ) -> Tuple[paddle.Tensor, Dict[str, paddle.Tensor], paddle.Tensor]:
        """Calculate forward propagation.
-        Parameters
+        Args:
-        ----------
+            text(Tensor(int64)): Batch of padded token ids (B, Tmax).
-        text : Tensor(int64)
+            text_lengths(Tensor(int64)): Batch of lengths of each input (B,).
-            Batch of padded token ids (B, Tmax).
+            speech(Tensor): Batch of padded target features (B, Lmax, odim).
-        text_lengths : Tensor(int64)
+            speech_lengths(Tensor(int64)): Batch of the lengths of each target (B,).
-            Batch of lengths of each input (B,).
+            durations(Tensor(int64)): Batch of padded durations (B, Tmax).
-        speech : Tensor
+            pitch(Tensor): Batch of padded token-averaged pitch (B, Tmax, 1).
-            Batch of padded target features (B, Lmax, odim).
+            energy(Tensor): Batch of padded token-averaged energy (B, Tmax, 1).
-        speech_lengths : Tensor(int64)
+            tone_id(Tensor, optional(int64)): Batch of padded tone ids  (B, Tmax).
-            Batch of the lengths of each target (B,).
+            spk_emb(Tensor, optional): Batch of speaker embeddings (B, spk_embed_dim).
-        durations : Tensor(int64)
+            spk_id(Tnesor, optional(int64)): Batch of speaker ids (B,)
-            Batch of padded durations (B, Tmax).
+
-        pitch : Tensor
+        Returns:
-            Batch of padded token-averaged pitch (B, Tmax, 1).
+
-        energy : Tensor
+        
            Batch of padded token-averaged energy (B, Tmax, 1).
        tone_id : Tensor, optional(int64)
                Batch of padded tone ids  (B, Tmax).
        spk_emb : Tensor, optional
            Batch of speaker embeddings (B, spk_embed_dim).
        spk_id : Tnesor, optional(int64)
            Batch of speaker ids (B,)
        Returns
        ----------
        Tensor
            mel outs before postnet
        Tensor
            mel outs after postnet
        Tensor
            duration predictor's output
        Tensor
            pitch predictor's output
        Tensor
            energy predictor's output
        Tensor
            speech
        Tensor
            speech_lengths, modified if reduction_factor > 1
        """
        # input of embedding must be int64
@ -680,34 +596,22 @@ class FastSpeech2(nn.Layer):
    ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor]:
        """Generate the sequence of features given the sequences of characters.
-        Parameters
+        Args:
-        ----------
+            text(Tensor(int64)): Input sequence of characters (T,).
-        text : Tensor(int64)
+            speech(Tensor, optional): Feature sequence to extract style (N, idim).
-            Input sequence of characters (T,).
+            durations(Tensor, optional (int64)): Groundtruth of duration (T,).
-        speech : Tensor, optional
+            pitch(Tensor, optional): Groundtruth of token-averaged pitch (T, 1).
-            Feature sequence to extract style (N, idim).
+            energy(Tensor, optional): Groundtruth of token-averaged energy (T, 1).
-        durations : Tensor, optional (int64)
+            alpha(float, optional): Alpha to control the speed.
-            Groundtruth of duration (T,).
+            use_teacher_forcing(bool, optional): Whether to use teacher forcing.
-        pitch : Tensor, optional
+                If true, groundtruth of duration, pitch and energy will be used.
-            Groundtruth of token-averaged pitch (T, 1).
+            spk_emb(Tensor, optional, optional): peaker embedding vector (spk_embed_dim,). (Default value = None)
-        energy : Tensor, optional
+            spk_id(Tensor, optional(int64), optional): Batch of padded spk ids  (1,). (Default value = None)
-            Groundtruth of token-averaged energy (T, 1).
+            tone_id(Tensor, optional(int64), optional): Batch of padded tone ids  (T,). (Default value = None)
-        alpha : float, optional
+
-            Alpha to control the speed.
+        Returns:
-        use_teacher_forcing : bool, optional
+
-            Whether to use teacher forcing.
+        
            If true, groundtruth of duration, pitch and energy will be used.
        spk_emb : Tensor, optional
            peaker embedding vector (spk_embed_dim,).
        spk_id : Tensor, optional(int64)
            Batch of padded spk ids  (1,).
        tone_id : Tensor, optional(int64)
            Batch of padded tone ids  (T,).
        Returns
        ----------
        Tensor
            Output sequence of features (L, odim).
        """
        # input of embedding must be int64
        x = paddle.cast(text, 'int64')
@ -761,17 +665,13 @@ class FastSpeech2(nn.Layer):
    def _integrate_with_spk_embed(self, hs, spk_emb):
        """Integrate speaker embedding with hidden states.
-        Parameters
+        Args:
-        ----------
+            hs(Tensor): Batch of hidden state sequences (B, Tmax, adim).
-        hs : Tensor
+            spk_emb(Tensor): Batch of speaker embeddings (B, spk_embed_dim).
-            Batch of hidden state sequences (B, Tmax, adim).
+
-        spk_emb : Tensor
+        Returns:
-            Batch of speaker embeddings (B, spk_embed_dim).
+
-
+        
        Returns
        ----------
        Tensor
            Batch of integrated hidden state sequences (B, Tmax, adim)
        """
        if self.spk_embed_integration_type == "add":
            # apply projection and then add to hidden states
@ -790,17 +690,13 @@ class FastSpeech2(nn.Layer):
    def _integrate_with_tone_embed(self, hs, tone_embs):
        """Integrate speaker embedding with hidden states.
-        Parameters
+        Args:
-        ----------
+            hs(Tensor): Batch of hidden state sequences (B, Tmax, adim).
-        hs : Tensor
+            tone_embs(Tensor): Batch of speaker embeddings (B, Tmax, tone_embed_dim).
-            Batch of hidden state sequences (B, Tmax, adim).
+
-        tone_embs : Tensor
+        Returns:
-            Batch of speaker embeddings (B, Tmax, tone_embed_dim).
+
-
+        
        Returns
        ----------
        Tensor
            Batch of integrated hidden state sequences (B, Tmax, adim)
        """
        if self.tone_embed_integration_type == "add":
            # apply projection and then add to hidden states
@ -819,24 +715,17 @@ class FastSpeech2(nn.Layer):
    def _source_mask(self, ilens: paddle.Tensor) -> paddle.Tensor:
        """Make masks for self-attention.
-        Parameters
+        Args:
-        ----------
+            ilens(Tensor): Batch of lengths (B,).
        ilens : Tensor
            Batch of lengths (B,).
-        Returns
+        Returns:
-        -------
+            Tensor: Mask tensor for self-attention. dtype=paddle.bool
        Tensor
            Mask tensor for self-attention.
            dtype=paddle.bool
        Examples
        -------
        >>> ilens = [5, 3]
        >>> self._source_mask(ilens)
        tensor([[[1, 1, 1, 1, 1],
                    [1, 1, 1, 0, 0]]]) bool
        Examples:
            >>> ilens = [5, 3]
            >>> self._source_mask(ilens)
            tensor([[[1, 1, 1, 1, 1],
                        [1, 1, 1, 0, 0]]]) bool
        """
        x_masks = make_non_pad_mask(ilens)
        return x_masks.unsqueeze(-2)
@ -910,34 +799,26 @@ class StyleFastSpeech2Inference(FastSpeech2Inference):
                spk_emb=None,
                spk_id=None):
        """
-        Parameters
+
-        ----------
+        Args:
-        text : Tensor(int64)
+            text(Tensor(int64)): Input sequence of characters (T,).
-            Input sequence of characters (T,).
+            speech(Tensor, optional): Feature sequence to extract style (N, idim).
-        speech : Tensor, optional
+            durations(paddle.Tensor/np.ndarray, optional (int64)): Groundtruth of duration (T,), this will overwrite the set of durations_scale and durations_bias
-            Feature sequence to extract style (N, idim).
+            durations_scale(int/float, optional): 
-        durations : paddle.Tensor/np.ndarray, optional (int64)
+            durations_bias(int/float, optional): 
-            Groundtruth of duration (T,), this will overwrite the set of durations_scale and durations_bias
+            pitch(paddle.Tensor/np.ndarray, optional): Groundtruth of token-averaged pitch (T, 1), this will overwrite the set of pitch_scale and pitch_bias
-        durations_scale: int/float, optional
+            pitch_scale(int/float, optional): In denormed HZ domain.
-        durations_bias: int/float, optional
+            pitch_bias(int/float, optional): In denormed HZ domain.
-        pitch : paddle.Tensor/np.ndarray, optional
+            energy(paddle.Tensor/np.ndarray, optional): Groundtruth of token-averaged energy (T, 1), this will overwrite the set of energy_scale and energy_bias
-            Groundtruth of token-averaged pitch (T, 1), this will overwrite the set of pitch_scale and pitch_bias
+            energy_scale(int/float, optional): In denormed domain.
-        pitch_scale: int/float, optional
+            energy_bias(int/float, optional): In denormed domain.
-            In denormed HZ domain.
+            robot: bool:  (Default value = False)
-        pitch_bias: int/float, optional
+            spk_emb: (Default value = None)
-            In denormed HZ domain.
+            spk_id: (Default value = None)
-        energy : paddle.Tensor/np.ndarray, optional
+
-            Groundtruth of token-averaged energy (T, 1), this will overwrite the set of energy_scale and energy_bias
+        Returns:
-        energy_scale: int/float, optional
+            Tensor: logmel
-            In denormed domain.
+
        energy_bias: int/float, optional
            In denormed domain.
        robot : bool, optional
            Weather output robot style
        Returns
        ----------
        Tensor
            Output sequence of features (L, odim).
        """
        normalized_mel, d_outs, p_outs, e_outs = self.acoustic_model.inference(
            text,
@ -1011,13 +892,9 @@ class FastSpeech2Loss(nn.Layer):
    def __init__(self, use_masking: bool=True,
                 use_weighted_masking: bool=False):
        """Initialize feed-forward Transformer loss module.
-
+        Args:
-        Parameters
+            use_masking (bool): Whether to apply masking for padded part in loss calculation.
-        ----------
+            use_weighted_masking (bool): Whether to weighted masking in loss calculation.
        use_masking : bool
            Whether to apply masking for padded part in loss calculation.
        use_weighted_masking : bool
            Whether to weighted masking in loss calculation.
        """
        assert check_argument_types()
        super().__init__()
@ -1048,42 +925,22 @@ class FastSpeech2Loss(nn.Layer):
    ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor, paddle.Tensor]:
        """Calculate forward propagation.
-        Parameters
+        Args:
-        ----------
+            after_outs(Tensor): Batch of outputs after postnets (B, Lmax, odim).
-        after_outs : Tensor
+            before_outs(Tensor): Batch of outputs before postnets (B, Lmax, odim).
-            Batch of outputs after postnets (B, Lmax, odim).
+            d_outs(Tensor): Batch of outputs of duration predictor (B, Tmax).
-        before_outs : Tensor
+            p_outs(Tensor): Batch of outputs of pitch predictor (B, Tmax, 1).
-            Batch of outputs before postnets (B, Lmax, odim).
+            e_outs(Tensor): Batch of outputs of energy predictor (B, Tmax, 1).
-        d_outs : Tensor
+            ys(Tensor): Batch of target features (B, Lmax, odim).
-                Batch of outputs of duration predictor (B, Tmax).
+            ds(Tensor): Batch of durations (B, Tmax).
-        p_outs : Tensor
+            ps(Tensor): Batch of target token-averaged pitch (B, Tmax, 1).
-            Batch of outputs of pitch predictor (B, Tmax, 1).
+            es(Tensor): Batch of target token-averaged energy (B, Tmax, 1).
-        e_outs : Tensor
+            ilens(Tensor): Batch of the lengths of each input (B,).
-            Batch of outputs of energy predictor (B, Tmax, 1).
+            olens(Tensor): Batch of the lengths of each target (B,).
-        ys : Tensor
+
-            Batch of target features (B, Lmax, odim).
+        Returns:
-        ds : Tensor
+
-            Batch of durations (B, Tmax).
+        
        ps : Tensor
            Batch of target token-averaged pitch (B, Tmax, 1).
        es : Tensor
            Batch of target token-averaged energy (B, Tmax, 1).
        ilens : Tensor
            Batch of the lengths of each input (B,).
        olens : Tensor
            Batch of the lengths of each target (B,).
        Returns
        ----------
        Tensor
            L1 loss value.
        Tensor
            Duration predictor loss value.
        Tensor
            Pitch predictor loss value.
        Tensor
            Energy predictor loss value.
        """
        # apply mask to remove padded part
        if self.use_masking:
--- a/paddlespeech/t2s/models/hifigan/hifigan.py
+++ b/paddlespeech/t2s/models/hifigan/hifigan.py
@ -37,35 +37,21 @@ class HiFiGANGenerator(nn.Layer):
            use_weight_norm: bool=True,
            init_type: str="xavier_uniform", ):
        """Initialize HiFiGANGenerator module.
-        Parameters
+        Args:
-        ----------
+            in_channels (int): Number of input channels.
-        in_channels : int
+            out_channels (int): Number of output channels.
-            Number of input channels.
+            channels (int): Number of hidden representation channels.
-        out_channels : int
+            kernel_size (int): Kernel size of initial and final conv layer.
-            Number of output channels.
+            upsample_scales (list): List of upsampling scales.
-        channels : int
+            upsample_kernel_sizes (list): List of kernel sizes for upsampling layers.
-            Number of hidden representation channels.
+            resblock_kernel_sizes (list): List of kernel sizes for residual blocks.
-        kernel_size : int
+            resblock_dilations (list): List of dilation list for residual blocks.
-            Kernel size of initial and final conv layer.
+            use_additional_convs (bool): Whether to use additional conv layers in residual blocks.
-        upsample_scales : list
+            bias (bool): Whether to add bias parameter in convolution layers.
-            List of upsampling scales.
+            nonlinear_activation (str): Activation function module name.
-        upsample_kernel_sizes : list
+            nonlinear_activation_params (dict): Hyperparameters for activation function.
-            List of kernel sizes for upsampling layers.
+            use_weight_norm (bool): Whether to use weight norm.
-        resblock_kernel_sizes : list
+                If set to true, it will be applied to all of the conv layers.
            List of kernel sizes for residual blocks.
        resblock_dilations : list
            List of dilation list for residual blocks.
        use_additional_convs : bool
            Whether to use additional conv layers in residual blocks.
        bias : bool
            Whether to add bias parameter in convolution layers.
        nonlinear_activation : str
            Activation function module name.
        nonlinear_activation_params : dict
            Hyperparameters for activation function.
        use_weight_norm : bool
            Whether to use weight norm.
            If set to true, it will be applied to all of the conv layers.
        """
        super().__init__()
@ -134,14 +120,11 @@ class HiFiGANGenerator(nn.Layer):
    def forward(self, c):
        """Calculate forward propagation.
-        Parameters
+        
-        ----------
+        Args:
-        c : Tensor
+            c (Tensor): Input tensor (B, in_channels, T).
-            Input tensor (B, in_channels, T).
+        Returns:
-        Returns
+            Tensor: Output tensor (B, out_channels, T).
        ----------
        Tensor
            Output tensor (B, out_channels, T).
        """
        c = self.input_conv(c)
        for i in range(self.num_upsamples):
@ -196,15 +179,12 @@ class HiFiGANGenerator(nn.Layer):
    def inference(self, c):
        """Perform inference.
-        Parameters
+        Args:
-        ----------
+            c (Tensor): Input tensor (T, in_channels).
-        c : Tensor 
+                normalize_before (bool): Whether to perform normalization.
-            Input tensor (T, in_channels).
+        Returns:
-            normalize_before (bool): Whether to perform normalization.
+            Tensor:
-        Returns
+                Output tensor (T ** prod(upsample_scales), out_channels).
        ----------
        Tensor
            Output tensor (T ** prod(upsample_scales), out_channels).
        """
        c = self.forward(c.transpose([1, 0]).unsqueeze(0))
        return c.squeeze(0).transpose([1, 0])
@ -229,36 +209,23 @@ class HiFiGANPeriodDiscriminator(nn.Layer):
            use_spectral_norm: bool=False,
            init_type: str="xavier_uniform", ):
        """Initialize HiFiGANPeriodDiscriminator module.
-        Parameters
+
-        ----------
+        Args:
-        in_channels : int
+            in_channels (int): Number of input channels.
-            Number of input channels.
+            out_channels (int): Number of output channels.
-        out_channels : int
+            period (int): Period.
-            Number of output channels.
+            kernel_sizes (list): Kernel sizes of initial conv layers and the final conv layer.
-        period : int
+            channels (int): Number of initial channels.
-            Period.
+            downsample_scales (list): List of downsampling scales.
-        kernel_sizes : list
+            max_downsample_channels (int): Number of maximum downsampling channels.
-            Kernel sizes of initial conv layers and the final conv layer.
+            use_additional_convs (bool): Whether to use additional conv layers in residual blocks.
-        channels : int
+            bias (bool): Whether to add bias parameter in convolution layers.
-            Number of initial channels.
+            nonlinear_activation (str): Activation function module name.
-        downsample_scales : list
+            nonlinear_activation_params (dict): Hyperparameters for activation function.
-            List of downsampling scales.
+            use_weight_norm (bool): Whether to use weight norm.
-        max_downsample_channels : int
+                If set to true, it will be applied to all of the conv layers.
-            Number of maximum downsampling channels.
+            use_spectral_norm (bool): Whether to use spectral norm.
-        use_additional_convs : bool
+                If set to true, it will be applied to all of the conv layers.
            Whether to use additional conv layers in residual blocks.
        bias : bool
            Whether to add bias parameter in convolution layers.
        nonlinear_activation : str
            Activation function module name.
        nonlinear_activation_params : dict
            Hyperparameters for activation function.
        use_weight_norm : bool
            Whether to use weight norm.
            If set to true, it will be applied to all of the conv layers.
        use_spectral_norm : bool
            Whether to use spectral norm.
            If set to true, it will be applied to all of the conv layers.
        """
        super().__init__()
@ -307,14 +274,11 @@ class HiFiGANPeriodDiscriminator(nn.Layer):
    def forward(self, x):
        """Calculate forward propagation.
-        Parameters
+
-        ----------
+        Args:
-        c : Tensor
+            c (Tensor): Input tensor (B, in_channels, T).
-            Input tensor (B, in_channels, T).
+        Returns:
-        Returns
+            list: List of each layer's tensors.
        ----------
        list
            List of each layer's tensors.
        """
        # transform 1d to 2d -> (B, C, T/P, P)
        b, c, t = paddle.shape(x)
@ -379,13 +343,11 @@ class HiFiGANMultiPeriodDiscriminator(nn.Layer):
            },
            init_type: str="xavier_uniform", ):
        """Initialize HiFiGANMultiPeriodDiscriminator module.
-        Parameters
+
-        ----------
+        Args:
-        periods : list
+            periods (list): List of periods.
-            List of periods.
+            discriminator_params (dict): Parameters for hifi-gan period discriminator module.
-        discriminator_params : dict
+                The period parameter will be overwritten.
            Parameters for hifi-gan period discriminator module.
            The period parameter will be overwritten.
        """
        super().__init__()
        # initialize parameters
@ -399,14 +361,11 @@ class HiFiGANMultiPeriodDiscriminator(nn.Layer):
    def forward(self, x):
        """Calculate forward propagation.
-        Parameters
+
-        ----------
+        Args:
-        x : Tensor
+            x (Tensor): Input noise signal (B, 1, T).
-            Input noise signal (B, 1, T).
+        Returns:
-        Returns
+            List: List of list of each discriminator outputs, which consists of each layer output tensors.
        ----------
        List
            List of list of each discriminator outputs, which consists of each layer output tensors.
        """
        outs = []
        for f in self.discriminators:
@ -434,33 +393,22 @@ class HiFiGANScaleDiscriminator(nn.Layer):
            use_spectral_norm: bool=False,
            init_type: str="xavier_uniform", ):
        """Initilize HiFiGAN scale discriminator module.
-        Parameters
+
-        ----------
+        Args:
-        in_channels : int
+            in_channels (int): Number of input channels.
-            Number of input channels.
+            out_channels (int): Number of output channels.
-        out_channels : int
+            kernel_sizes (list): List of four kernel sizes. The first will be used for the first conv layer,
-            Number of output channels.
+                and the second is for downsampling part, and the remaining two are for output layers.
-        kernel_sizes : list
+            channels (int): Initial number of channels for conv layer.
-            List of four kernel sizes. The first will be used for the first conv layer,
+            max_downsample_channels (int): Maximum number of channels for downsampling layers.
-            and the second is for downsampling part, and the remaining two are for output layers.
+            bias (bool): Whether to add bias parameter in convolution layers.
-        channels : int
+            downsample_scales (list): List of downsampling scales.
-            Initial number of channels for conv layer.
+            nonlinear_activation (str): Activation function module name.
-        max_downsample_channels : int
+            nonlinear_activation_params (dict): Hyperparameters for activation function.
-            Maximum number of channels for downsampling layers.
+            use_weight_norm (bool): Whether to use weight norm.
-        bias : bool
+                If set to true, it will be applied to all of the conv layers.
-            Whether to add bias parameter in convolution layers.
+            use_spectral_norm (bool): Whether to use spectral norm.
-        downsample_scales : list
+                If set to true, it will be applied to all of the conv layers.
            List of downsampling scales.
        nonlinear_activation : str
            Activation function module name.
        nonlinear_activation_params : dict
            Hyperparameters for activation function.
        use_weight_norm : bool
            Whether to use weight norm.
            If set to true, it will be applied to all of the conv layers.
        use_spectral_norm : bool
            Whether to use spectral norm.
            If set to true, it will be applied to all of the conv layers.
        """
        super().__init__()
@ -546,14 +494,11 @@ class HiFiGANScaleDiscriminator(nn.Layer):
    def forward(self, x):
        """Calculate forward propagation.
-        Parameters
+
-        ----------
+        Args:
-        x : Tensor
+            x (Tensor): Input noise signal (B, 1, T).
-            Input noise signal (B, 1, T).
+        Returns:
-        Returns
+            List: List of output tensors of each layer.
        ----------
        List
            List of output tensors of each layer.
        """
        outs = []
        for f in self.layers:
@ -613,20 +558,14 @@ class HiFiGANMultiScaleDiscriminator(nn.Layer):
            follow_official_norm: bool=False,
            init_type: str="xavier_uniform", ):
        """Initilize HiFiGAN multi-scale discriminator module.
-        Parameters
+   
-        ----------
+        Args:
-        scales : int
+            scales (int): Number of multi-scales.
-            Number of multi-scales.
+            downsample_pooling (str): Pooling module name for downsampling of the inputs.
-        downsample_pooling : str
+            downsample_pooling_params (dict): Parameters for the above pooling module.
-            Pooling module name for downsampling of the inputs.
+            discriminator_params (dict): Parameters for hifi-gan scale discriminator module.
-        downsample_pooling_params : dict
+            follow_official_norm (bool): Whether to follow the norm setting of the official
-            Parameters for the above pooling module.
+                implementaion. The first discriminator uses spectral norm and the other discriminators use weight norm.
        discriminator_params : dict
            Parameters for hifi-gan scale discriminator module.
        follow_official_norm : bool
            Whether to follow the norm setting of the official
            implementaion. The first discriminator uses spectral norm and the other
            discriminators use weight norm.
        """
        super().__init__()
@ -651,14 +590,11 @@ class HiFiGANMultiScaleDiscriminator(nn.Layer):
    def forward(self, x):
        """Calculate forward propagation.
-        Parameters
+
-        ----------
+        Args:
-        x : Tensor
+            x (Tensor): Input noise signal (B, 1, T).
-            Input noise signal (B, 1, T).
+        Returns:
-        Returns
+            List: List of list of each discriminator outputs, which consists of each layer output tensors.
        ----------
        List
            List of list of each discriminator outputs, which consists of each layer output tensors.
        """
        outs = []
        for f in self.discriminators:
@ -715,24 +651,17 @@ class HiFiGANMultiScaleMultiPeriodDiscriminator(nn.Layer):
            },
            init_type: str="xavier_uniform", ):
        """Initilize HiFiGAN multi-scale + multi-period discriminator module.
-        Parameters
+
-        ----------
+        Args:
-        scales : int
+            scales (int): Number of multi-scales.
-            Number of multi-scales.
+            scale_downsample_pooling (str): Pooling module name for downsampling of the inputs.
-        scale_downsample_pooling : str
+            scale_downsample_pooling_params (dict): Parameters for the above pooling module.
-            Pooling module name for downsampling of the inputs.
+            scale_discriminator_params (dict): Parameters for hifi-gan scale discriminator module.
-        scale_downsample_pooling_params : dict
+            follow_official_norm （bool): Whether to follow the norm setting of the official implementaion. 
-            Parameters for the above pooling module.
+                The first discriminator uses spectral norm and the other discriminators use weight norm.
-        scale_discriminator_params : dict
+            periods (list): List of periods.
-            Parameters for hifi-gan scale discriminator module.
+            period_discriminator_params (dict): Parameters for hifi-gan period discriminator module.
-        follow_official_norm : bool): Whether to follow the norm setting of the official
+                The period parameter will be overwritten.
            implementaion. The first discriminator uses spectral norm and the other
            discriminators use weight norm.
        periods : list
            List of periods.
        period_discriminator_params : dict
            Parameters for hifi-gan period discriminator module.
            The period parameter will be overwritten.
        """
        super().__init__()
@ -751,16 +680,14 @@ class HiFiGANMultiScaleMultiPeriodDiscriminator(nn.Layer):
    def forward(self, x):
        """Calculate forward propagation.
-        Parameters
+
-        ----------
+        Args:
-        x : Tensor
+            x (Tensor): Input noise signal (B, 1, T).
-            Input noise signal (B, 1, T).
+        Returns:
-        Returns
+            List:
-        ----------
+                List of list of each discriminator outputs,
-        List:
+                which consists of each layer output tensors.
-            List of list of each discriminator outputs,
+                Multi scale and multi period ones are concatenated.
            which consists of each layer output tensors.
            Multi scale and multi period ones are concatenated.
        """
        msd_outs = self.msd(x)
        mpd_outs = self.mpd(x)
--- a/paddlespeech/t2s/models/melgan/melgan.py
+++ b/paddlespeech/t2s/models/melgan/melgan.py
@ -51,41 +51,26 @@ class MelGANGenerator(nn.Layer):
            use_causal_conv: bool=False,
            init_type: str="xavier_uniform", ):
        """Initialize MelGANGenerator module.
-        Parameters
+
-        ----------
+        Args:
-        in_channels : int
+            in_channels (int): Number of input channels.
-            Number of input channels.
+            out_channels (int): Number of output channels,
-        out_channels : int
+                the number of sub-band is out_channels in multi-band melgan.
-            Number of output channels,
+            kernel_size (int): Kernel size of initial and final conv layer.
-            the number of sub-band is out_channels in multi-band melgan.
+            channels (int): Initial number of channels for conv layer.
-        kernel_size : int
+            bias (bool): Whether to add bias parameter in convolution layers.
-            Kernel size of initial and final conv layer.
+            upsample_scales (List[int]): List of upsampling scales.
-        channels : int
+            stack_kernel_size (int): Kernel size of dilated conv layers in residual stack.
-            Initial number of channels for conv layer.
+            stacks (int): Number of stacks in a single residual stack.
-        bias : bool
+            nonlinear_activation (Optional[str], optional): Non linear activation in upsample network, by default None
-            Whether to add bias parameter in convolution layers.
+            nonlinear_activation_params (Dict[str, Any], optional): Parameters passed to the linear activation in the upsample network, 
-        upsample_scales : List[int]
+                by default {}
-            List of upsampling scales.
+            pad (str): Padding function module name before dilated convolution layer.
-        stack_kernel_size : int
+            pad_params （dict): Hyperparameters for padding function.
-            Kernel size of dilated conv layers in residual stack.
+            use_final_nonlinear_activation (nn.Layer): Activation function for the final layer.
-        stacks : int
+            use_weight_norm (bool): Whether to use weight norm.
-            Number of stacks in a single residual stack.
+                If set to true, it will be applied to all of the conv layers.
-        nonlinear_activation : Optional[str], optional
+            use_causal_conv (bool): Whether to use causal convolution.
            Non linear activation in upsample network, by default None
        nonlinear_activation_params : Dict[str, Any], optional
            Parameters passed to the linear activation in the upsample network, 
            by default {}
        pad : str
            Padding function module name before dilated convolution layer.
        pad_params : dict
            Hyperparameters for padding function.
        use_final_nonlinear_activation : nn.Layer
            Activation function for the final layer.
        use_weight_norm : bool
            Whether to use weight norm.
            If set to true, it will be applied to all of the conv layers.
        use_causal_conv : bool
            Whether to use causal convolution.
        """
        super().__init__()
@ -207,14 +192,11 @@ class MelGANGenerator(nn.Layer):
    def forward(self, c):
        """Calculate forward propagation.
-        Parameters
+
-        ----------
+        Args:
-        c : Tensor
+            c (Tensor): Input tensor (B, in_channels, T).
-            Input tensor (B, in_channels, T).
+        Returns:
-        Returns
+            Tensor: Output tensor (B, out_channels, T ** prod(upsample_scales)).
        ----------
        Tensor
            Output tensor (B, out_channels, T ** prod(upsample_scales)).
        """
        out = self.melgan(c)
        return out
@ -260,14 +242,11 @@ class MelGANGenerator(nn.Layer):
    def inference(self, c):
        """Perform inference.
-        Parameters
+
-        ----------
+        Args:
-        c : Union[Tensor, ndarray]
+            c (Union[Tensor, ndarray]): Input tensor (T, in_channels).
-            Input tensor (T, in_channels).
+        Returns:
-        Returns
+            Tensor: Output tensor (out_channels*T ** prod(upsample_scales), 1).
        ----------
        Tensor
            Output tensor (out_channels*T ** prod(upsample_scales), 1).
        """
        # pseudo batch
        c = c.transpose([1, 0]).unsqueeze(0)
@ -298,33 +277,22 @@ class MelGANDiscriminator(nn.Layer):
            pad_params: Dict[str, Any]={"mode": "reflect"},
            init_type: str="xavier_uniform", ):
        """Initilize MelGAN discriminator module.
-        Parameters
+
-        ----------
+        Args:
-        in_channels : int
+            in_channels (int): Number of input channels.
-            Number of input channels.
+            out_channels (int): Number of output channels.
-        out_channels : int
+            kernel_sizes (List[int]): List of two kernel sizes. The prod will be used for the first conv layer,
-            Number of output channels.
+                and the first and the second kernel sizes will be used for the last two layers.
-        kernel_sizes : List[int]
+                For example if kernel_sizes = [5, 3], the first layer kernel size will be 5 * 3 = 15,
-            List of two kernel sizes. The prod will be used for the first conv layer,
+                the last two layers' kernel size will be 5 and 3, respectively.
-            and the first and the second kernel sizes will be used for the last two layers.
+            channels (int): Initial number of channels for conv layer.
-            For example if kernel_sizes = [5, 3], the first layer kernel size will be 5 * 3 = 15,
+            max_downsample_channels (int): Maximum number of channels for downsampling layers.
-            the last two layers' kernel size will be 5 and 3, respectively.
+            bias (bool): Whether to add bias parameter in convolution layers.
-        channels : int
+            downsample_scales (List[int]): List of downsampling scales.
-            Initial number of channels for conv layer.
+            nonlinear_activation (str): Activation function module name.
-        max_downsample_channels : int
+            nonlinear_activation_params (dict): Hyperparameters for activation function.
-            Maximum number of channels for downsampling layers.
+            pad (str): Padding function module name before dilated convolution layer.
-        bias : bool
+            pad_params (dict): Hyperparameters for padding function.
            Whether to add bias parameter in convolution layers.
        downsample_scales : List[int]
            List of downsampling scales.
        nonlinear_activation : str
            Activation function module name.
        nonlinear_activation_params : dict
            Hyperparameters for activation function.
        pad : str
            Padding function module name before dilated convolution layer.
        pad_params : dict
            Hyperparameters for padding function.
        """
        super().__init__()
@ -395,14 +363,10 @@ class MelGANDiscriminator(nn.Layer):
    def forward(self, x):
        """Calculate forward propagation.
-        Parameters
+        Args:
-        ----------
+            x (Tensor): Input noise signal (B, 1, T).
-        x : Tensor
+        Returns:
-            Input noise signal (B, 1, T).
+            List: List of output tensors of each layer (for feat_match_loss).
        Returns
        ----------
        List
            List of output tensors of each layer (for feat_match_loss).
        """
        outs = []
        for f in self.layers:
@ -440,39 +404,24 @@ class MelGANMultiScaleDiscriminator(nn.Layer):
            use_weight_norm: bool=True,
            init_type: str="xavier_uniform", ):
        """Initilize MelGAN multi-scale discriminator module.
-        Parameters
+
-        ----------
+        Args:
-        in_channels : int
+            in_channels (int): Number of input channels.
-            Number of input channels.
+            out_channels (int): Number of output channels.
-        out_channels : int
+            scales (int): Number of multi-scales.
-            Number of output channels.
+            downsample_pooling (str): Pooling module name for downsampling of the inputs.
-        scales : int
+            downsample_pooling_params (dict): Parameters for the above pooling module.
-            Number of multi-scales.
+            kernel_sizes (List[int]): List of two kernel sizes. The sum will be used for the first conv layer,
-        downsample_pooling : str
+                and the first and the second kernel sizes will be used for the last two layers.
-            Pooling module name for downsampling of the inputs.
+            channels (int): Initial number of channels for conv layer.
-        downsample_pooling_params : dict
+            max_downsample_channels (int): Maximum number of channels for downsampling layers.
-            Parameters for the above pooling module.
+            bias (bool): Whether to add bias parameter in convolution layers.
-        kernel_sizes : List[int]
+            downsample_scales (List[int]): List of downsampling scales.
-            List of two kernel sizes. The sum will be used for the first conv layer,
+            nonlinear_activation (str): Activation function module name.
-            and the first and the second kernel sizes will be used for the last two layers.
+            nonlinear_activation_params (dict): Hyperparameters for activation function.
-        channels : int
+            pad (str): Padding function module name before dilated convolution layer.
-            Initial number of channels for conv layer.
+            pad_params (dict): Hyperparameters for padding function.
-        max_downsample_channels : int
+            use_causal_conv (bool): Whether to use causal convolution.
            Maximum number of channels for downsampling layers.
        bias : bool
            Whether to add bias parameter in convolution layers.
        downsample_scales : List[int]
            List of downsampling scales.
        nonlinear_activation : str
            Activation function module name.
        nonlinear_activation_params : dict
            Hyperparameters for activation function.
        pad : str
            Padding function module name before dilated convolution layer.
        pad_params : dict
            Hyperparameters for padding function.
        use_causal_conv : bool
            Whether to use causal convolution.
        """
        super().__init__()
@ -514,14 +463,10 @@ class MelGANMultiScaleDiscriminator(nn.Layer):
    def forward(self, x):
        """Calculate forward propagation.
-        Parameters
+        Args:
-        ----------
+            x (Tensor): Input noise signal (B, 1, T).
-        x : Tensor
+        Returns:
-            Input noise signal (B, 1, T).
+            List: List of list of each discriminator outputs, which consists of each layer output tensors.
        Returns
        ----------
        List
            List of list of each discriminator outputs, which consists of each layer output tensors.
        """
        outs = []
        for f in self.discriminators:
--- a/paddlespeech/t2s/models/melgan/style_melgan.py
+++ b/paddlespeech/t2s/models/melgan/style_melgan.py
@ -52,37 +52,23 @@ class StyleMelGANGenerator(nn.Layer):
            use_weight_norm: bool=True,
            init_type: str="xavier_uniform", ):
        """Initilize Style MelGAN generator.
-        Parameters
+
-        ----------
+        Args:
-        in_channels : int
+            in_channels (int): Number of input noise channels.
-            Number of input noise channels.
+            aux_channels (int): Number of auxiliary input channels.
-        aux_channels : int
+            channels (int): Number of channels for conv layer.
-            Number of auxiliary input channels.
+            out_channels (int): Number of output channels.
-        channels : int
+            kernel_size (int): Kernel size of conv layers.
-            Number of channels for conv layer.
+            dilation (int): Dilation factor for conv layers.
-        out_channels : int
+            bias (bool): Whether to add bias parameter in convolution layers.
-            Number of output channels.
+            noise_upsample_scales (list): List of noise upsampling scales.
-        kernel_size : int
+            noise_upsample_activation (str): Activation function module name for noise upsampling.
-            Kernel size of conv layers.
+            noise_upsample_activation_params (dict): Hyperparameters for the above activation function.
-        dilation : int
+            upsample_scales (list): List of upsampling scales.
-            Dilation factor for conv layers.
+            upsample_mode (str): Upsampling mode in TADE layer.
-        bias : bool
+            gated_function (str): Gated function in TADEResBlock ("softmax" or "sigmoid").
-            Whether to add bias parameter in convolution layers.
+            use_weight_norm (bool): Whether to use weight norm.
-        noise_upsample_scales : list
+                If set to true, it will be applied to all of the conv layers.
            List of noise upsampling scales.
        noise_upsample_activation : str
            Activation function module name for noise upsampling.
        noise_upsample_activation_params : dict
            Hyperparameters for the above activation function.
        upsample_scales : list
            List of upsampling scales.
        upsample_mode : str
            Upsampling mode in TADE layer.
        gated_function : str
            Gated function in TADEResBlock ("softmax" or "sigmoid").
        use_weight_norm : bool
            Whether to use weight norm.
            If set to true, it will be applied to all of the conv layers.
        """
        super().__init__()
@ -147,16 +133,12 @@ class StyleMelGANGenerator(nn.Layer):
    def forward(self, c, z=None):
        """Calculate forward propagation.
-        Parameters
+
-        ----------
+        Args:
-        c : Tensor
+            c (Tensor): Auxiliary input tensor (B, channels, T).
-            Auxiliary input tensor (B, channels, T).
+            z (Tensor): Input noise tensor (B, in_channels, 1).
-        z : Tensor
+        Returns:
-            Input noise tensor (B, in_channels, 1).
+            Tensor: Output tensor (B, out_channels, T ** prod(upsample_scales)).
        Returns
        ----------
        Tensor
            Output tensor (B, out_channels, T ** prod(upsample_scales)).
        """
        # batch_max_steps(24000) == noise_upsample_factor(80) * upsample_factor(300)
        if z is None:
@ -211,14 +193,10 @@ class StyleMelGANGenerator(nn.Layer):
    def inference(self, c):
        """Perform inference.
-        Parameters
+        Args:
-        ----------
+            c (Tensor): Input tensor (T, in_channels).
-        c : Tensor
+        Returns:
-            Input tensor (T, in_channels).
+            Tensor: Output tensor (T ** prod(upsample_scales), out_channels).
        Returns
        ----------
        Tensor
            Output tensor (T ** prod(upsample_scales), out_channels).
        """
        # (1, in_channels, T)
        c = c.transpose([1, 0]).unsqueeze(0)
@ -278,18 +256,13 @@ class StyleMelGANDiscriminator(nn.Layer):
            use_weight_norm: bool=True,
            init_type: str="xavier_uniform", ):
        """Initilize Style MelGAN discriminator.
-        Parameters
+
-        ----------
+        Args:
-        repeats : int
+            repeats (int): Number of repititons to apply RWD.
-            Number of repititons to apply RWD.
+            window_sizes (list): List of random window sizes.
-        window_sizes : list
+            pqmf_params (list): List of list of Parameters for PQMF modules
-            List of random window sizes.
+            discriminator_params (dict): Parameters for base discriminator module.
-        pqmf_params : list
+            use_weight_nom (bool): Whether to apply weight normalization.
            List of list of Parameters for PQMF modules
        discriminator_params : dict
            Parameters for base discriminator module.
        use_weight_nom : bool
            Whether to apply weight normalization.
        """
        super().__init__()
@ -325,15 +298,11 @@ class StyleMelGANDiscriminator(nn.Layer):
    def forward(self, x):
        """Calculate forward propagation.
-        Parameters
+        Args:
-        ----------
+            x (Tensor): Input tensor (B, 1, T).
-        x : Tensor
+        Returns:
-            Input tensor (B, 1, T).
+            List: List of discriminator outputs, #items in the list will be
-        Returns
+                equal to repeats * #discriminators.
        ----------
        List
            List of discriminator outputs, #items in the list will be
            equal to repeats * #discriminators.
        """
        outs = []
        for _ in range(self.repeats):
--- a/paddlespeech/t2s/models/parallel_wavegan/parallel_wavegan.py
+++ b/paddlespeech/t2s/models/parallel_wavegan/parallel_wavegan.py
@ -31,51 +31,30 @@ from paddlespeech.t2s.modules.upsample import ConvInUpsampleNet
 class PWGGenerator(nn.Layer):
    """Wave Generator for Parallel WaveGAN
-    Parameters
+    Args:
-    ----------
+        in_channels (int, optional): Number of channels of the input waveform, by default 1
-    in_channels : int, optional
+        out_channels (int, optional): Number of channels of the output waveform, by default 1
-        Number of channels of the input waveform, by default 1
+        kernel_size (int, optional): Kernel size of the residual blocks inside, by default 3
-    out_channels : int, optional
+        layers (int, optional): Number of residual blocks inside, by default 30
-        Number of channels of the output waveform, by default 1
+        stacks (int, optional): The number of groups to split the residual blocks into, by default 3
-    kernel_size : int, optional
+            Within each group, the dilation of the residual block grows exponentially.
-        Kernel size of the residual blocks inside, by default 3
+        residual_channels (int, optional): Residual channel of the residual blocks, by default 64
-    layers : int, optional
+        gate_channels (int, optional): Gate channel of the residual blocks, by default 128
-        Number of residual blocks inside, by default 30
+        skip_channels (int, optional): Skip channel of the residual blocks, by default 64
-    stacks : int, optional
+        aux_channels (int, optional): Auxiliary channel of the residual blocks, by default 80
-        The number of groups to split the residual blocks into, by default 3
+        aux_context_window (int, optional): The context window size of the first convolution applied to the 
-        Within each group, the dilation of the residual block grows 
+            auxiliary input, by default 2
-        exponentially.
+        dropout (float, optional): Dropout of the residual blocks, by default 0.
-    residual_channels : int, optional
+        bias (bool, optional): Whether to use bias in residual blocks, by default True
-        Residual channel of the residual blocks, by default 64
+        use_weight_norm (bool, optional): Whether to use weight norm in all convolutions, by default True
-    gate_channels : int, optional
+        use_causal_conv (bool, optional): Whether to use causal padding in the upsample network and residual 
-        Gate channel of the residual blocks, by default 128
+            blocks, by default False
-    skip_channels : int, optional
+        upsample_scales (List[int], optional): Upsample scales of the upsample network, by default [4, 4, 4, 4]
-        Skip channel of the residual blocks, by default 64
+        nonlinear_activation (Optional[str], optional): Non linear activation in upsample network, by default None
-    aux_channels : int, optional
+        nonlinear_activation_params (Dict[str, Any], optional): Parameters passed to the linear activation in the upsample network, 
-        Auxiliary channel of the residual blocks, by default 80
+            by default {}
-    aux_context_window : int, optional
+        interpolate_mode (str, optional): Interpolation mode of the upsample network, by default "nearest"
-        The context window size of the first convolution applied to the 
+        freq_axis_kernel_size (int, optional): Kernel size along the frequency axis of the upsample network, by default 1
        auxiliary input, by default 2
    dropout : float, optional
        Dropout of the residual blocks, by default 0.
    bias : bool, optional
        Whether to use bias in residual blocks, by default True
    use_weight_norm : bool, optional
        Whether to use weight norm in all convolutions, by default True
    use_causal_conv : bool, optional
        Whether to use causal padding in the upsample network and residual 
        blocks, by default False
    upsample_scales : List[int], optional
        Upsample scales of the upsample network, by default [4, 4, 4, 4]
    nonlinear_activation : Optional[str], optional
        Non linear activation in upsample network, by default None
    nonlinear_activation_params : Dict[str, Any], optional
        Parameters passed to the linear activation in the upsample network, 
        by default {}
    interpolate_mode : str, optional
        Interpolation mode of the upsample network, by default "nearest"
    freq_axis_kernel_size : int, optional
        Kernel size along the frequency axis of the upsample network, by default 1
    """
    def __init__(
@ -167,18 +146,13 @@ class PWGGenerator(nn.Layer):
    def forward(self, x, c):
        """Generate waveform.
-        Parameters
+        Args:
-        ----------
+            x(Tensor): Shape (N, C_in, T), The input waveform.
-        x : Tensor
+            c(Tensor): Shape (N, C_aux, T'). The auxiliary input (e.g. spectrogram). It
            Shape (N, C_in, T), The input waveform.
        c : Tensor
            Shape (N, C_aux, T'). The auxiliary input (e.g. spectrogram). It 
            is upsampled to match the time resolution of the input.
-        Returns
+        Returns:
-        -------
+            Tensor: Shape (N, C_out, T), the generated waveform.
        Tensor
            Shape (N, C_out, T), the generated waveform.
        """
        c = self.upsample_net(c)
        assert c.shape[-1] == x.shape[-1]
@ -218,19 +192,14 @@ class PWGGenerator(nn.Layer):
        self.apply(_remove_weight_norm)
    def inference(self, c=None):
-        """Waveform generation. This function is used for single instance 
+        """Waveform generation. This function is used for single instance inference.
-        inference.
+
-        Parameters
+        Args:
-        ----------
+            c(Tensor, optional, optional): Shape (T', C_aux), the auxiliary input, by default None
-        c : Tensor, optional
+            x(Tensor, optional): Shape (T, C_in), the noise waveform, by default None
-            Shape (T', C_aux), the auxiliary input, by default None
+
-        x : Tensor, optional
+        Returns:
-            Shape (T, C_in), the noise waveform, by default None
+            Tensor: Shape (T, C_out), the generated waveform
            If not provided, a sample is drawn from a gaussian distribution.
        Returns
        -------
        Tensor
            Shape (T, C_out), the generated waveform
        """
        # when to static, can not input x, see https://github.com/PaddlePaddle/Parakeet/pull/132/files
        x = paddle.randn(
@ -244,32 +213,21 @@ class PWGGenerator(nn.Layer):
 class PWGDiscriminator(nn.Layer):
    """A convolutional discriminator for audio.
-    Parameters
+    Args:
-    ----------
+        in_channels (int, optional): Number of channels of the input audio, by default 1
-    in_channels : int, optional
+        out_channels (int, optional): Output feature size, by default 1
-        Number of channels of the input audio, by default 1
+        kernel_size (int, optional): Kernel size of convolutional sublayers, by default 3
-    out_channels : int, optional
+        layers (int, optional): Number of layers, by default 10
-        Output feature size, by default 1
+        conv_channels (int, optional): Feature size of the convolutional sublayers, by default 64
-    kernel_size : int, optional
+        dilation_factor (int, optional): The factor with which dilation of each convolutional sublayers grows 
-        Kernel size of convolutional sublayers, by default 3
+            exponentially if it is greater than 1, else the dilation of each convolutional sublayers grows linearly, 
-    layers : int, optional
+            by default 1
-        Number of layers, by default 10
+        nonlinear_activation (str, optional): The activation after each convolutional sublayer, by default "leakyrelu"
-    conv_channels : int, optional
+        nonlinear_activation_params (Dict[str, Any], optional): The parameters passed to the activation's initializer, by default 
-        Feature size of the convolutional sublayers, by default 64
+            {"negative_slope": 0.2}
-    dilation_factor : int, optional
+        bias (bool, optional): Whether to use bias in convolutional sublayers, by default True
-        The factor with which dilation of each convolutional sublayers grows 
+        use_weight_norm (bool, optional): Whether to use weight normalization at all convolutional sublayers, 
-        exponentially if it is greater than 1, else the dilation of each 
+            by default True
        convolutional sublayers grows linearly, by default 1
    nonlinear_activation : str, optional
        The activation after each convolutional sublayer, by default "leakyrelu"
    nonlinear_activation_params : Dict[str, Any], optional
        The parameters passed to the activation's initializer, by default 
        {"negative_slope": 0.2}
    bias : bool, optional
        Whether to use bias in convolutional sublayers, by default True
    use_weight_norm : bool, optional
        Whether to use weight normalization at all convolutional sublayers, 
        by default True
    """
    def __init__(
@ -330,15 +288,12 @@ class PWGDiscriminator(nn.Layer):
    def forward(self, x):
        """
-        Parameters
+
-        ----------
+        Args:
-        x : Tensor
+            x (Tensor): Shape (N, in_channels, num_samples), the input audio.
-            Shape (N, in_channels, num_samples), the input audio.
+
-
+        Returns:
-        Returns
+            Tensor: Shape (N, out_channels, num_samples), the predicted logits.
        -------
        Tensor
            Shape (N, out_channels, num_samples), the predicted logits.
        """
        return self.conv_layers(x)
@ -362,39 +317,25 @@ class PWGDiscriminator(nn.Layer):
 class ResidualPWGDiscriminator(nn.Layer):
    """A wavenet-style discriminator for audio.
-    Parameters
+    Args:
-    ----------
+        in_channels (int, optional): Number of channels of the input audio, by default 1
-    in_channels : int, optional
+        out_channels (int, optional): Output feature size, by default 1
-        Number of channels of the input audio, by default 1
+        kernel_size (int, optional): Kernel size of residual blocks, by default 3
-    out_channels : int, optional
+        layers (int, optional): Number of residual blocks, by default 30
-        Output feature size, by default 1
+        stacks (int, optional): Number of groups of residual blocks, within which the dilation 
-    kernel_size : int, optional
+            of each residual blocks grows exponentially, by default 3
-        Kernel size of residual blocks, by default 3
+        residual_channels (int, optional): Residual channels of residual blocks, by default 64
-    layers : int, optional
+        gate_channels (int, optional): Gate channels of residual blocks, by default 128
-        Number of residual blocks, by default 30
+        skip_channels (int, optional): Skip channels of residual blocks, by default 64
-    stacks : int, optional
+        dropout (float, optional): Dropout probability of residual blocks, by default 0.
-        Number of groups of residual blocks, within which the dilation 
+        bias (bool, optional): Whether to use bias in residual blocks, by default True
-        of each residual blocks grows exponentially, by default 3
+        use_weight_norm (bool, optional): Whether to use weight normalization in all convolutional layers, 
-    residual_channels : int, optional
+            by default True
-        Residual channels of residual blocks, by default 64
+        use_causal_conv (bool, optional): Whether to use causal convolution in residual blocks, by default False
-    gate_channels : int, optional
+        nonlinear_activation (str, optional): Activation after convolutions other than those in residual blocks, 
-        Gate channels of residual blocks, by default 128
+            by default "leakyrelu"
-    skip_channels : int, optional
+        nonlinear_activation_params (Dict[str, Any], optional): Parameters to pass to the activation, 
-        Skip channels of residual blocks, by default 64
+            by default {"negative_slope": 0.2}
    dropout : float, optional
        Dropout probability of residual blocks, by default 0.
    bias : bool, optional
        Whether to use bias in residual blocks, by default True
    use_weight_norm : bool, optional
        Whether to use weight normalization in all convolutional layers, 
        by default True
    use_causal_conv : bool, optional
        Whether to use causal convolution in residual blocks, by default False
    nonlinear_activation : str, optional
        Activation after convolutions other than those in residual blocks, 
        by default "leakyrelu"
    nonlinear_activation_params : Dict[str, Any], optional
        Parameters to pass to the activation, by default {"negative_slope": 0.2}
    """
    def __init__(
@ -463,15 +404,11 @@ class ResidualPWGDiscriminator(nn.Layer):
    def forward(self, x):
        """
-        Parameters
+        Args:
-        ----------
+            x(Tensor): Shape (N, in_channels, num_samples), the input audio.↩
-        x : Tensor
+
-            Shape (N, in_channels, num_samples), the input audio.
+        Returns:
-
+            Tensor: Shape (N, out_channels, num_samples), the predicted logits.
        Returns
        -------
        Tensor
            Shape (N, out_channels, num_samples), the predicted logits.
        """
        x = self.first_conv(x)
        skip = 0
--- a/paddlespeech/t2s/models/new_tacotron2/init.py
+++ b/paddlespeech/t2s/models/new_tacotron2/init.py
--- a/paddlespeech/t2s/models/new_tacotron2/tacotron2.py
+++ b/paddlespeech/t2s/models/new_tacotron2/tacotron2.py
@ -81,69 +81,39 @@ class Tacotron2(nn.Layer):
            # training related
            init_type: str="xavier_uniform", ):
        """Initialize Tacotron2 module.
-        Parameters
+        Args:
-        ----------
+            idim (int): Dimension of the inputs.
-        idim : int
+            odim (int): Dimension of the outputs.
-            Dimension of the inputs.
+            embed_dim (int): Dimension of the token embedding.
-        odim : int
+            elayers (int): Number of encoder blstm layers.
-            Dimension of the outputs.
+            eunits (int): Number of encoder blstm units.
-        embed_dim : int
+            econv_layers (int): Number of encoder conv layers.
-            Dimension of the token embedding.
+            econv_filts (int): Number of encoder conv filter size.
-        elayers : int
+            econv_chans (int): Number of encoder conv filter channels.
-            Number of encoder blstm layers.
+            dlayers (int): Number of decoder lstm layers.
-        eunits : int
+            dunits (int): Number of decoder lstm units.
-            Number of encoder blstm units.
+            prenet_layers (int): Number of prenet layers.
-        econv_layers : int
+            prenet_units (int): Number of prenet units.
-            Number of encoder conv layers.
+            postnet_layers (int): Number of postnet layers.
-        econv_filts : int
+            postnet_filts (int): Number of postnet filter size.
-            Number of encoder conv filter size.
+            postnet_chans (int): Number of postnet filter channels.
-        econv_chans : int
+            output_activation (str): Name of activation function for outputs.
-            Number of encoder conv filter channels.
+            adim (int): Number of dimension of mlp in attention.
-        dlayers : int
+            aconv_chans (int): Number of attention conv filter channels.
-            Number of decoder lstm layers.
+            aconv_filts (int): Number of attention conv filter size.
-        dunits : int
+            cumulate_att_w (bool): Whether to cumulate previous attention weight.
-            Number of decoder lstm units.
+            use_batch_norm (bool): Whether to use batch normalization.
-        prenet_layers : int
+            use_concate (bool): Whether to concat enc outputs w/ dec lstm outputs.
-            Number of prenet layers.
+            reduction_factor (int): Reduction factor.
-        prenet_units : int
+            spk_num (Optional[int]): Number of speakers. If set to > 1, assume that the
-            Number of prenet units.
+                sids will be provided as the input and use sid embedding layer.
-        postnet_layers : int
+            lang_num (Optional[int]): Number of languages. If set to > 1, assume that the
-            Number of postnet layers.
+                lids will be provided as the input and use sid embedding layer.
-        postnet_filts : int
+            spk_embed_dim (Optional[int]): Speaker embedding dimension. If set to > 0,
-            Number of postnet filter size.
+                assume that spk_emb will be provided as the input.
-        postnet_chans : int
+            spk_embed_integration_type (str): How to integrate speaker embedding.
-            Number of postnet filter channels.
+            dropout_rate (float): Dropout rate.
-        output_activation : str
+            zoneout_rate (float): Zoneout rate.
            Name of activation function for outputs.
        adim : int
            Number of dimension of mlp in attention.
        aconv_chans : int
            Number of attention conv filter channels.
        aconv_filts : int
            Number of attention conv filter size.
        cumulate_att_w : bool
            Whether to cumulate previous attention weight.
        use_batch_norm : bool
            Whether to use batch normalization.
        use_concate : bool
            Whether to concat enc outputs w/ dec lstm outputs.
        reduction_factor : int
            Reduction factor.
        spk_num : Optional[int]
            Number of speakers. If set to > 1, assume that the
            sids will be provided as the input and use sid embedding layer.
        lang_num : Optional[int]
            Number of languages. If set to > 1, assume that the
            lids will be provided as the input and use sid embedding layer.
        spk_embed_dim : Optional[int]
            Speaker embedding dimension. If set to > 0,
            assume that spk_emb will be provided as the input.
        spk_embed_integration_type : str
            How to integrate speaker embedding.
        dropout_rate : float
            Dropout rate.
        zoneout_rate : float
            Zoneout rate.
        """
        assert check_argument_types()
        super().__init__()
@ -258,31 +228,19 @@ class Tacotron2(nn.Layer):
    ) -> Tuple[paddle.Tensor, Dict[str, paddle.Tensor], paddle.Tensor]:
        """Calculate forward propagation.
-        Parameters
+        Args:
-        ----------
+            text (Tensor(int64)): Batch of padded character ids (B, T_text).
-        text : Tensor(int64)
+            text_lengths (Tensor(int64)): Batch of lengths of each input batch (B,).
-            Batch of padded character ids (B, T_text).
+            speech (Tensor): Batch of padded target features (B, T_feats, odim).
-        text_lengths : Tensor(int64)
+            speech_lengths (Tensor(int64)): Batch of the lengths of each target (B,).
-            Batch of lengths of each input batch (B,).
+            spk_emb (Optional[Tensor]): Batch of speaker embeddings (B, spk_embed_dim).
-        speech : Tensor
+            spk_id (Optional[Tensor]): Batch of speaker IDs (B, 1).
-            Batch of padded target features (B, T_feats, odim).
+            lang_id (Optional[Tensor]): Batch of language IDs (B, 1).
-        speech_lengths : Tensor(int64)
+
-            Batch of the lengths of each target (B,).
+        Returns:
-        spk_emb : Optional[Tensor]
+            Tensor: Loss scalar value.
-            Batch of speaker embeddings (B, spk_embed_dim).
+            Dict: Statistics to be monitored.
-        spk_id : Optional[Tensor]
+            Tensor: Weight value if not joint training else model outputs.
            Batch of speaker IDs (B, 1).
        lang_id : Optional[Tensor]
            Batch of language IDs (B, 1).
        Returns
        ----------
        Tensor
            Loss scalar value.
        Dict
            Statistics to be monitored.
        Tensor
            Weight value if not joint training else model outputs.
        """
        text = text[:, :text_lengths.max()]
@ -369,40 +327,26 @@ class Tacotron2(nn.Layer):
            use_teacher_forcing: bool=False, ) -> Dict[str, paddle.Tensor]:
        """Generate the sequence of features given the sequences of characters.
-        Parameters
+        Args:
-        ----------
+            text (Tensor(int64)): Input sequence of characters (T_text,).
-        text Tensor(int64)
+            speech (Optional[Tensor]): Feature sequence to extract style (N, idim).
-            Input sequence of characters (T_text,).
+            spk_emb (ptional[Tensor]): Speaker embedding (spk_embed_dim,).
-        speech : Optional[Tensor]
+            spk_id (Optional[Tensor]): Speaker ID (1,).
-            Feature sequence to extract style (N, idim).
+            lang_id (Optional[Tensor]): Language ID (1,).
-        spk_emb : ptional[Tensor]
+            threshold (float): Threshold in inference.
-            Speaker embedding (spk_embed_dim,).
+            minlenratio (float): Minimum length ratio in inference.
-        spk_id : Optional[Tensor]
+            maxlenratio (float): Maximum length ratio in inference.
-            Speaker ID (1,).
+            use_att_constraint (bool): Whether to apply attention constraint.
-        lang_id : Optional[Tensor]
+            backward_window (int): Backward window in attention constraint.
-            Language ID (1,).
+            forward_window (int): Forward window in attention constraint.
-        threshold : float
+            use_teacher_forcing (bool): Whether to use teacher forcing.
-            Threshold in inference.
+
-        minlenratio : float
+        Returns:
-            Minimum length ratio in inference.
+            Dict[str, Tensor]
-        maxlenratio : float
+            Output dict including the following items:
-            Maximum length ratio in inference.
+                * feat_gen (Tensor): Output sequence of features (T_feats, odim).
-        use_att_constraint : bool
+                * prob (Tensor): Output sequence of stop probabilities (T_feats,).
-            Whether to apply attention constraint.
+                * att_w (Tensor): Attention weights (T_feats, T).
        backward_window : int
            Backward window in attention constraint.
        forward_window : int
            Forward window in attention constraint.
        use_teacher_forcing : bool
            Whether to use teacher forcing.
        Return
        ----------
        Dict[str, Tensor]
        Output dict including the following items:
            * feat_gen (Tensor): Output sequence of features (T_feats, odim).
            * prob (Tensor): Output sequence of stop probabilities (T_feats,).
            * att_w (Tensor): Attention weights (T_feats, T).
        """
        x = text
@ -458,18 +402,13 @@ class Tacotron2(nn.Layer):
                                  spk_emb: paddle.Tensor) -> paddle.Tensor:
        """Integrate speaker embedding with hidden states.
-        Parameters
+        Args:
-        ----------
+            hs (Tensor): Batch of hidden state sequences (B, Tmax, eunits).
-         hs : Tensor
+            spk_emb (Tensor): Batch of speaker embeddings (B, spk_embed_dim).
-            Batch of hidden state sequences (B, Tmax, eunits).
+
-         spk_emb : Tensor
+        Returns:
-            Batch of speaker embeddings (B, spk_embed_dim).
+            Tensor: Batch of integrated hidden state sequences (B, Tmax, eunits) if
-
+                integration_type is "add" else (B, Tmax, eunits + spk_embed_dim).
        Returns
        ----------
         Tensor
            Batch of integrated hidden state sequences (B, Tmax, eunits) if
            integration_type is "add" else (B, Tmax, eunits + spk_embed_dim).
        """
        if self.spk_embed_integration_type == "add":
--- a/paddlespeech/t2s/models/new_tacotron2/tacotron2_updater.py
+++ b/paddlespeech/t2s/models/new_tacotron2/tacotron2_updater.py
--- a/paddlespeech/t2s/models/transformer_tts/transformer_tts.py
+++ b/paddlespeech/t2s/models/transformer_tts/transformer_tts.py
@ -48,127 +48,67 @@ class TransformerTTS(nn.Layer):
    .. _`Neural Speech Synthesis with Transformer Network`:
        https://arxiv.org/pdf/1809.08895.pdf
-    Parameters
+    Args:
-    ----------
+        idim (int): Dimension of the inputs.
-    idim : int
+        odim (int): Dimension of the outputs.
-        Dimension of the inputs.
+        embed_dim (int, optional): Dimension of character embedding.
-    odim : int
+        eprenet_conv_layers (int, optional): Number of encoder prenet convolution layers.
-        Dimension of the outputs.
+        eprenet_conv_chans (int, optional): Number of encoder prenet convolution channels.
-    embed_dim : int, optional
+        eprenet_conv_filts (int, optional): Filter size of encoder prenet convolution.
-        Dimension of character embedding.
+        dprenet_layers (int, optional): Number of decoder prenet layers.
-    eprenet_conv_layers : int, optional
+        dprenet_units (int, optional): Number of decoder prenet hidden units.
-        Number of encoder prenet convolution layers.
+        elayers (int, optional): Number of encoder layers.
-    eprenet_conv_chans : int, optional
+        eunits (int, optional): Number of encoder hidden units.
-        Number of encoder prenet convolution channels.
+        adim (int, optional): Number of attention transformation dimensions.
-    eprenet_conv_filts : int, optional
+        aheads (int, optional): Number of heads for multi head attention.
-        Filter size of encoder prenet convolution.
+        dlayers (int, optional): Number of decoder layers.
-    dprenet_layers : int, optional
+        dunits (int, optional): Number of decoder hidden units.
-        Number of decoder prenet layers.
+        postnet_layers (int, optional): Number of postnet layers.
-    dprenet_units : int, optional
+        postnet_chans (int, optional): Number of postnet channels.
-        Number of decoder prenet hidden units.
+        postnet_filts (int, optional): Filter size of postnet.
-    elayers : int, optional
+        use_scaled_pos_enc (pool, optional): Whether to use trainable scaled positional encoding.
-        Number of encoder layers.
+        use_batch_norm (bool, optional): Whether to use batch normalization in encoder prenet.
-    eunits : int, optional
+        encoder_normalize_before (bool, optional): Whether to perform layer normalization before encoder block.
-        Number of encoder hidden units.
+        decoder_normalize_before (bool, optional): Whether to perform layer normalization before decoder block.
-    adim : int, optional
+        encoder_concat_after (bool, optional): Whether to concatenate attention layer's input and output in encoder.
-        Number of attention transformation dimensions.
+        decoder_concat_after (bool, optional): Whether to concatenate attention layer's input and output in decoder.
-    aheads : int, optional
+        positionwise_layer_type (str, optional): Position-wise operation type.
-        Number of heads for multi head attention.
+        positionwise_conv_kernel_size (int, optional): Kernel size in position wise conv 1d.
-    dlayers : int, optional
+        reduction_factor (int, optional): Reduction factor.
-        Number of decoder layers.
+        spk_embed_dim (int, optional): Number of speaker embedding dimenstions.
-    dunits : int, optional
+        spk_embed_integration_type (str, optional): How to integrate speaker embedding.
-        Number of decoder hidden units.
+        use_gst (str, optional): Whether to use global style token.
-    postnet_layers : int, optional
+        gst_tokens (int, optional): The number of GST embeddings.
-        Number of postnet layers.
+        gst_heads (int, optional): The number of heads in GST multihead attention.
-    postnet_chans : int, optional
+        gst_conv_layers (int, optional): The number of conv layers in GST.
-        Number of postnet channels.
+        gst_conv_chans_list (Sequence[int], optional): List of the number of channels of conv layers in GST.
-    postnet_filts : int, optional
+        gst_conv_kernel_size (int, optional): Kernal size of conv layers in GST.
-        Filter size of postnet.
+        gst_conv_stride (int, optional): Stride size of conv layers in GST.
-    use_scaled_pos_enc : pool, optional
+        gst_gru_layers (int, optional): The number of GRU layers in GST.
-        Whether to use trainable scaled positional encoding.
+        gst_gru_units (int, optional): The number of GRU units in GST.
-    use_batch_norm : bool, optional
+        transformer_lr (float, optional): Initial value of learning rate.
-        Whether to use batch normalization in encoder prenet.
+        transformer_warmup_steps (int, optional): Optimizer warmup steps.
-    encoder_normalize_before : bool, optional
+        transformer_enc_dropout_rate (float, optional): Dropout rate in encoder except attention and positional encoding.
-        Whether to perform layer normalization before encoder block.
+        transformer_enc_positional_dropout_rate (float, optional): Dropout rate after encoder positional encoding.
-    decoder_normalize_before : bool, optional
+        transformer_enc_attn_dropout_rate （float, optional): Dropout rate in encoder self-attention module.
-        Whether to perform layer normalization before decoder block.
+        transformer_dec_dropout_rate (float, optional): Dropout rate in decoder except attention & positional encoding.
-    encoder_concat_after : bool, optional
+        transformer_dec_positional_dropout_rate (float, optional): Dropout rate after decoder positional encoding.
-        Whether to concatenate attention layer's input and output in encoder.
+        transformer_dec_attn_dropout_rate （float, optional): Dropout rate in deocoder self-attention module.
-    decoder_concat_after : bool, optional
+        transformer_enc_dec_attn_dropout_rate (float, optional): Dropout rate in encoder-deocoder attention module.
-        Whether to concatenate attention layer's input and output in decoder.
+        init_type (str, optional): How to initialize transformer parameters.
-    positionwise_layer_type : str, optional
+        init_enc_alpha （float, optional）: Initial value of alpha in scaled pos encoding of the encoder.
-        Position-wise operation type.
+        init_dec_alpha (float, optional): Initial value of alpha in scaled pos encoding of the decoder.
-    positionwise_conv_kernel_size : int, optional
+        eprenet_dropout_rate (float, optional): Dropout rate in encoder prenet.
-        Kernel size in position wise conv 1d.
+        dprenet_dropout_rate (float, optional): Dropout rate in decoder prenet.
-    reduction_factor : int, optional
+        postnet_dropout_rate (float, optional): Dropout rate in postnet.
-        Reduction factor.
+        use_masking (bool, optional): Whether to apply masking for padded part in loss calculation.
-    spk_embed_dim : int, optional
+        use_weighted_masking (bool, optional): Whether to apply weighted masking in loss calculation.
-        Number of speaker embedding dimenstions.
+        bce_pos_weight (float, optional): Positive sample weight in bce calculation (only for use_masking=true).
-    spk_embed_integration_type : str, optional
+        loss_type (str, optional): How to calculate loss.
-        How to integrate speaker embedding.
+        use_guided_attn_loss (bool, optional): Whether to use guided attention loss.
-    use_gst : str, optional
+        num_heads_applied_guided_attn (int, optional): Number of heads in each layer to apply guided attention loss.
-        Whether to use global style token.
+        num_layers_applied_guided_attn (int, optional): Number of layers to apply guided attention loss.
-    gst_tokens : int, optional
+            List of module names to apply guided attention loss.
        The number of GST embeddings.
    gst_heads : int, optional
        The number of heads in GST multihead attention.
    gst_conv_layers : int, optional
        The number of conv layers in GST.
    gst_conv_chans_list : Sequence[int], optional
            List of the number of channels of conv layers in GST.
    gst_conv_kernel_size : int, optional
        Kernal size of conv layers in GST.
    gst_conv_stride : int, optional
        Stride size of conv layers in GST.
    gst_gru_layers : int, optional
        The number of GRU layers in GST.
    gst_gru_units : int, optional
        The number of GRU units in GST.
    transformer_lr : float, optional
        Initial value of learning rate.
    transformer_warmup_steps : int, optional
        Optimizer warmup steps.
    transformer_enc_dropout_rate : float, optional
        Dropout rate in encoder except attention and positional encoding.
    transformer_enc_positional_dropout_rate : float, optional
        Dropout rate after encoder positional encoding.
    transformer_enc_attn_dropout_rate : float, optional
        Dropout rate in encoder self-attention module.
    transformer_dec_dropout_rate : float, optional
        Dropout rate in decoder except attention & positional encoding.
    transformer_dec_positional_dropout_rate : float, optional
        Dropout rate after decoder positional encoding.
    transformer_dec_attn_dropout_rate : float, optional
        Dropout rate in deocoder self-attention module.
    transformer_enc_dec_attn_dropout_rate : float, optional
        Dropout rate in encoder-deocoder attention module.
    init_type : str, optional
        How to initialize transformer parameters.
    init_enc_alpha : float, optional
        Initial value of alpha in scaled pos encoding of the encoder.
    init_dec_alpha : float, optional
        Initial value of alpha in scaled pos encoding of the decoder.
    eprenet_dropout_rate : float, optional
        Dropout rate in encoder prenet.
    dprenet_dropout_rate : float, optional
        Dropout rate in decoder prenet.
    postnet_dropout_rate : float, optional
        Dropout rate in postnet.
    use_masking : bool, optional
        Whether to apply masking for padded part in loss calculation.
    use_weighted_masking : bool, optional
        Whether to apply weighted masking in loss calculation.
    bce_pos_weight : float, optional
        Positive sample weight in bce calculation (only for use_masking=true).
    loss_type : str, optional
        How to calculate loss.
    use_guided_attn_loss : bool, optional
        Whether to use guided attention loss.
    num_heads_applied_guided_attn : int, optional
        Number of heads in each layer to apply guided attention loss.
    num_layers_applied_guided_attn : int, optional
        Number of layers to apply guided attention loss.
        List of module names to apply guided attention loss.
    """
    def __init__(
@ -398,25 +338,16 @@ class TransformerTTS(nn.Layer):
    ) -> Tuple[paddle.Tensor, Dict[str, paddle.Tensor], paddle.Tensor]:
        """Calculate forward propagation.
-        Parameters
+        Args:
-        ----------
+            text(Tensor(int64)): Batch of padded character ids (B, Tmax).
-        text : Tensor(int64)
+            text_lengths(Tensor(int64)): Batch of lengths of each input batch (B,).
-            Batch of padded character ids (B, Tmax).
+            speech(Tensor): Batch of padded target features (B, Lmax, odim).
-        text_lengths : Tensor(int64)
+            speech_lengths(Tensor(int64)): Batch of the lengths of each target (B,).
-            Batch of lengths of each input batch (B,).
+            spk_emb(Tensor, optional): Batch of speaker embeddings (B, spk_embed_dim).
-        speech : Tensor
+
-            Batch of padded target features (B, Lmax, odim).
+        Returns:
-        speech_lengths : Tensor(int64)
+            Tensor: Loss scalar value.
-            Batch of the lengths of each target (B,).
+            Dict: Statistics to be monitored.
        spk_emb : Tensor, optional
            Batch of speaker embeddings (B, spk_embed_dim).
        Returns
        ----------
        Tensor
            Loss scalar value.
        Dict
            Statistics to be monitored.
        """
        # input of embedding must be int64
@ -525,31 +456,19 @@ class TransformerTTS(nn.Layer):
    ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor]:
        """Generate the sequence of features given the sequences of characters.
-        Parameters
+        Args:
-        ----------
+            text(Tensor(int64)): Input sequence of characters (T,).
-        text : Tensor(int64)
+            speech(Tensor, optional): Feature sequence to extract style (N, idim).
-            Input sequence of characters (T,).
+            spk_emb(Tensor, optional): Speaker embedding vector (spk_embed_dim,).
-        speech : Tensor, optional
+            threshold(float, optional): Threshold in inference.
-            Feature sequence to extract style (N, idim).
+            minlenratio(float, optional): Minimum length ratio in inference.
-        spk_emb : Tensor, optional
+            maxlenratio(float, optional): Maximum length ratio in inference.
-            Speaker embedding vector (spk_embed_dim,).
+            use_teacher_forcing(bool, optional): Whether to use teacher forcing.
-        threshold : float, optional
+
-            Threshold in inference.
+        Returns:
-        minlenratio : float, optional
+            Tensor: Output sequence of features (L, odim).
-            Minimum length ratio in inference.
+            Tensor: Output sequence of stop probabilities (L,).
-        maxlenratio : float, optional
+            Tensor: Encoder-decoder (source) attention weights (#layers, #heads, L, T).
            Maximum length ratio in inference.
        use_teacher_forcing : bool, optional
            Whether to use teacher forcing.
        Returns
        ----------
        Tensor
            Output sequence of features (L, odim).
        Tensor
            Output sequence of stop probabilities (L,).
        Tensor
            Encoder-decoder (source) attention weights (#layers, #heads, L, T).
        """
        # input of embedding must be int64
@ -671,23 +590,17 @@ class TransformerTTS(nn.Layer):
    def _source_mask(self, ilens: paddle.Tensor) -> paddle.Tensor:
        """Make masks for self-attention.
-        Parameters
+        Args:
-        ----------
+            ilens(Tensor): Batch of lengths (B,).
        ilens : Tensor
            Batch of lengths (B,).
-        Returns
+        Returns:
-        -------
+            Tensor: Mask tensor for self-attention. dtype=paddle.bool
        Tensor
            Mask tensor for self-attention.
            dtype=paddle.bool
-        Examples
+        Examples:
-        -------
+            >>> ilens = [5, 3]
-        >>> ilens = [5, 3]
+            >>> self._source_mask(ilens)
-        >>> self._source_mask(ilens)
+            tensor([[[1, 1, 1, 1, 1],
-        tensor([[[1, 1, 1, 1, 1],
+                        [1, 1, 1, 0, 0]]]) bool
                    [1, 1, 1, 0, 0]]]) bool
        """
        x_masks = make_non_pad_mask(ilens)
@ -696,30 +609,25 @@ class TransformerTTS(nn.Layer):
    def _target_mask(self, olens: paddle.Tensor) -> paddle.Tensor:
        """Make masks for masked self-attention.
-        Parameters
+        Args:
-        ----------
+            olens (Tensor(int64)): Batch of lengths (B,).
-            olens : LongTensor
+
-                Batch of lengths (B,).
+        Returns:
-
+            Tensor: Mask tensor for masked self-attention.
-        Returns
+
-        ----------
+        Examples:
-        Tensor
+            >>> olens = [5, 3]
-            Mask tensor for masked self-attention.
+            >>> self._target_mask(olens)
-
+            tensor([[[1, 0, 0, 0, 0],
-        Examples
+                        [1, 1, 0, 0, 0],
-        ----------
+                        [1, 1, 1, 0, 0],
-        >>> olens = [5, 3]
+                        [1, 1, 1, 1, 0],
-        >>> self._target_mask(olens)
+                        [1, 1, 1, 1, 1]],
-        tensor([[[1, 0, 0, 0, 0],
+                    [[1, 0, 0, 0, 0],
-                    [1, 1, 0, 0, 0],
+                        [1, 1, 0, 0, 0],
-                    [1, 1, 1, 0, 0],
+                        [1, 1, 1, 0, 0],
-                    [1, 1, 1, 1, 0],
+                        [1, 1, 1, 0, 0],
-                    [1, 1, 1, 1, 1]],
+                        [1, 1, 1, 0, 0]]], dtype=paddle.uint8)
                [[1, 0, 0, 0, 0],
                    [1, 1, 0, 0, 0],
                    [1, 1, 1, 0, 0],
                    [1, 1, 1, 0, 0],
                    [1, 1, 1, 0, 0]]], dtype=paddle.uint8)
        """
        y_masks = make_non_pad_mask(olens)
@ -731,17 +639,12 @@ class TransformerTTS(nn.Layer):
                                  spk_emb: paddle.Tensor) -> paddle.Tensor:
        """Integrate speaker embedding with hidden states.
-        Parameters
+        Args:
-        ----------
+            hs(Tensor): Batch of hidden state sequences (B, Tmax, adim).
-        hs : Tensor
+            spk_emb(Tensor): Batch of speaker embeddings (B, spk_embed_dim).
-            Batch of hidden state sequences (B, Tmax, adim).
+
-        spk_emb : Tensor
+        Returns:
-            Batch of speaker embeddings (B, spk_embed_dim).
+            Tensor: Batch of integrated hidden state sequences (B, Tmax, adim).
        Returns
        ----------
        Tensor
            Batch of integrated hidden state sequences (B, Tmax, adim).
        """
        if self.spk_embed_integration_type == "add":
--- a/paddlespeech/t2s/models/waveflow.py
+++ b/paddlespeech/t2s/models/waveflow.py
@ -30,20 +30,14 @@ __all__ = ["WaveFlow", "ConditionalWaveFlow", "WaveFlowLoss"]
 def fold(x, n_group):
-    r"""Fold audio or spectrogram's temporal dimension in to groups.
+    """Fold audio or spectrogram's temporal dimension in to groups.
-    Parameters
+    Args:
-    ----------
+        x(Tensor): The input tensor. shape=(*, time_steps)
-    x : Tensor [shape=(\*, time_steps)
+        n_group(int): The size of a group.
        The input tensor.
-    n_group : int
+    Returns:
-        The size of a group.
+        Tensor: Folded tensor. shape=(*, time_steps // n_group, group)
    Returns
    ---------
    Tensor : [shape=(\*, time_steps // n_group, group)]
        Folded tensor.
    """
    spatial_shape = list(x.shape[:-1])
    time_steps = paddle.shape(x)[-1]
@ -58,27 +52,23 @@ class UpsampleNet(nn.LayerList):
    It consists of several conv2dtranspose layers which perform deconvolution
    on mel and time dimension.
-    Parameters
+    Args:
-    ----------
+        upscale_factors(List[int], optional): Time upsampling factors for each Conv2DTranspose Layer.
-    upscale_factors : List[int], optional
+            The ``UpsampleNet`` contains ``len(upscale_factor)`` Conv2DTranspose
-        Time upsampling factors for each Conv2DTranspose Layer.
+            Layers. Each upscale_factor is used as the ``stride`` for the
-
+            corresponding Conv2DTranspose. Defaults to [16, 16], this the default
-        The ``UpsampleNet`` contains ``len(upscale_factor)`` Conv2DTranspose
+            upsampling factor is 256.
        Layers. Each upscale_factor is used as the ``stride`` for the
        corresponding Conv2DTranspose. Defaults to [16, 16], this the default
        upsampling factor is 256.
-    Notes
+    Notes:
-    ------
+        ``np.prod(upscale_factors)`` should equals the ``hop_length`` of the stft
-    ``np.prod(upscale_factors)`` should equals the ``hop_length`` of the stft
+        transformation used to extract spectrogram features from audio.
    transformation used to extract spectrogram features from audio.
-    For example, ``16 * 16 = 256``, then the spectrogram extracted with a stft
+        For example, ``16 * 16 = 256``, then the spectrogram extracted with a stft
-    transformation whose ``hop_length`` equals 256 is suitable.
+        transformation whose ``hop_length`` equals 256 is suitable.
-    See Also
+        See Also
-    ---------
+    
-    ``librosa.core.stft``
+        ``librosa.core.stft``
    """
    def __init__(self, upsample_factors):
@ -101,25 +91,18 @@ class UpsampleNet(nn.LayerList):
        self.upsample_factors = upsample_factors
    def forward(self, x, trim_conv_artifact=False):
-        r"""Forward pass of the ``UpsampleNet``.
+        """Forward pass of the ``UpsampleNet``
-        Parameters
+        Args:
-        -----------
+            x(Tensor): The input spectrogram. shape=(batch_size, input_channels, time_steps)
-        x : Tensor [shape=(batch_size, input_channels, time_steps)]
+            trim_conv_artifact(bool, optional, optional): Trim deconvolution artifact at each layer. Defaults to False.
            The input spectrogram.
-        trim_conv_artifact : bool, optional
+        Returns:
-            Trim deconvolution artifact at each layer. Defaults to False.
+           Tensor: The upsampled spectrogram. shape=(batch_size, input_channels, time_steps * upsample_factor)
-        Returns
+        Notes:
-        --------
+            If trim_conv_artifact is ``True``, the output time steps is less
-        Tensor: [shape=(batch_size, input_channels, time_steps \* upsample_factor)]
+            than ``time_steps * upsample_factors``.
            The upsampled spectrogram.
        Notes
        --------
        If trim_conv_artifact is ``True``, the output time steps is less
        than ``time_steps \* upsample_factors``.
        """
        x = paddle.unsqueeze(x, 1)  # (B, C, T) -> (B, 1, C, T)
        for layer in self:
@ -139,19 +122,11 @@ class ResidualBlock(nn.Layer):
    same paddign in width dimension. It also has projection for the condition
    and output.
-    Parameters
+    Args:
-    ----------
+        channels (int): Feature size of the input.
-    channels : int
+        cond_channels (int): Featuer size of the condition.
-        Feature size of the input.
+        kernel_size (Tuple[int]): Kernel size of the Convolution2d applied to the input.
-
+        dilations (int): Dilations of the Convolution2d applied to the input.
    cond_channels : int
        Featuer size of the condition.
    kernel_size : Tuple[int]
        Kernel size of the Convolution2d applied to the input.
    dilations : int
        Dilations of the Convolution2d applied to the input.
    """
    def __init__(self, channels, cond_channels, kernel_size, dilations):
@ -197,21 +172,13 @@ class ResidualBlock(nn.Layer):
    def forward(self, x, condition):
        """Compute output for a whole folded sequence.
-        Parameters
+        Args:
-        ----------
+            x (Tensor): The input. [shape=(batch_size, channel, height, width)]
-        x : Tensor [shape=(batch_size, channel, height, width)]
+            condition (Tensor [shape=(batch_size, condition_channel, height, width)]): The local condition.
            The input.
        condition : Tensor [shape=(batch_size, condition_channel, height, width)]
            The local condition.
-        Returns
+        Returns: 
-        -------
+            res (Tensor): The residual output. [shape=(batch_size, channel, height, width)]
-        res : Tensor [shape=(batch_size, channel, height, width)]
+            skip (Tensor): The skip output. [shape=(batch_size, channel, height, width)]
            The residual output.
        skip : Tensor [shape=(batch_size, channel, height, width)]
            The skip output.
        """
        x_in = x
        x = self.conv(x)
@ -248,21 +215,14 @@ class ResidualBlock(nn.Layer):
    def add_input(self, x_row, condition_row):
        """Compute the output for a row and update the buffer.
-        Parameters
+        Args:
-        ----------
+            x_row (Tensor): A row of the input. shape=(batch_size, channel, 1, width)
-        x_row : Tensor [shape=(batch_size, channel, 1, width)]
+            condition_row (Tensor): A row of the condition. shape=(batch_size, condition_channel, 1, width)
            A row of the input.
        condition_row : Tensor [shape=(batch_size, condition_channel, 1, width)]
            A row of the condition.
-        Returns
+        Returns:
-        -------
+            res (Tensor): A row of the the residual output. shape=(batch_size, channel, 1, width)
-        res : Tensor [shape=(batch_size, channel, 1, width)]
+            skip (Tensor): A row of the skip output. shape=(batch_size, channel, 1, width)
            A row of the the residual output.
        skip : Tensor [shape=(batch_size, channel, 1, width)]
            A row of the skip output.
        """
        x_row_in = x_row
        if len(paddle.shape(self._conv_buffer)) == 1:
@ -297,27 +257,15 @@ class ResidualBlock(nn.Layer):
 class ResidualNet(nn.LayerList):
    """A stack of several ResidualBlocks. It merges condition at each layer.
-    Parameters
+    Args:
-    ----------
+        n_layer (int): Number of ResidualBlocks in the ResidualNet.
-    n_layer : int
+        residual_channels (int): Feature size of each ResidualBlocks.
-        Number of ResidualBlocks in the ResidualNet.
+        condition_channels (int): Feature size of the condition.
-
+        kernel_size (Tuple[int]): Kernel size of each ResidualBlock.
-    residual_channels : int
+        dilations_h (List[int]): Dilation in height dimension of every ResidualBlock.
        Feature size of each ResidualBlocks.
    condition_channels : int
        Feature size of the condition.
-    kernel_size : Tuple[int]
+    Raises:
-        Kernel size of each ResidualBlock.
+        ValueError: If the length of dilations_h does not equals n_layers.
    dilations_h : List[int]
        Dilation in height dimension of every ResidualBlock.
    Raises
    ------
    ValueError
        If the length of dilations_h does not equals n_layers.
    """
    def __init__(self,
@ -339,18 +287,13 @@ class ResidualNet(nn.LayerList):
    def forward(self, x, condition):
        """Comput the output of given the input and the condition.
-        Parameters
+        Args:
-        -----------
+            x (Tensor): The input. shape=(batch_size, channel, height, width)
-        x : Tensor [shape=(batch_size, channel, height, width)]
+            condition (Tensor): The local condition. shape=(batch_size, condition_channel, height, width)
-            The input.
+            
-
+        Returns: 
-        condition : Tensor [shape=(batch_size, condition_channel, height, width)]
+            Tensor : The output, which is an aggregation of all the skip outputs. shape=(batch_size, channel, height, width)
-            The local condition.
+            
        Returns
        --------
        Tensor : [shape=(batch_size, channel, height, width)]
            The output, which is an aggregation of all the skip outputs.
        """
        skip_connections = []
        for layer in self:
@ -368,21 +311,14 @@ class ResidualNet(nn.LayerList):
    def add_input(self, x_row, condition_row):
        """Compute the output for a row and update the buffers.
-        Parameters
+        Args:
-        ----------
+            x_row (Tensor): A row of the input. shape=(batch_size, channel, 1, width)
-        x_row : Tensor [shape=(batch_size, channel, 1, width)]
+            condition_row (Tensor):  A row of the condition. shape=(batch_size, condition_channel, 1, width)
-            A row of the input.
+            
-
+        Returns:
-        condition_row : Tensor [shape=(batch_size, condition_channel, 1, width)]
+            res (Tensor): A row of the the residual output. shape=(batch_size, channel, 1, width) 
-            A row of the condition.
+            skip (Tensor): A row of the skip output. shape=(batch_size, channel, 1, width)
-
+                
        Returns
        -------
        res : Tensor [shape=(batch_size, channel, 1, width)]
            A row of the the residual output.
        skip : Tensor [shape=(batch_size, channel, 1, width)]
            A row of the skip output.
        """
        skip_connections = []
        for layer in self:
@ -400,22 +336,12 @@ class Flow(nn.Layer):
    probability density estimation. The ``inverse`` method implements the
    sampling.
-    Parameters
+    Args:
-    ----------
+        n_layers (int): Number of ResidualBlocks in the Flow.
-    n_layers : int
+        channels (int): Feature size of the ResidualBlocks.
-        Number of ResidualBlocks in the Flow.
+        mel_bands (int): Feature size of the mel spectrogram (mel bands).
-
+        kernel_size (Tuple[int]): Kernel size of each ResisualBlocks in the Flow.
-    channels : int
+        n_group (int): Number of timesteps to the folded into a group.
        Feature size of the ResidualBlocks.
    mel_bands : int
        Feature size of the mel spectrogram (mel bands).
    kernel_size : Tuple[int]
        Kernel size of each ResisualBlocks in the Flow.
    n_group : int
        Number of timesteps to the folded into a group.
    """
    dilations_dict = {
        8: [1, 1, 1, 1, 1, 1, 1, 1],
@ -466,26 +392,16 @@ class Flow(nn.Layer):
        """Probability density estimation. It is done by inversely transform
        a sample from p(X) into a sample from p(Z).
-        Parameters
+        Args:
-        -----------
+            x (Tensor): A input sample of the distribution p(X). shape=(batch, 1, height, width)
-        x : Tensor [shape=(batch, 1, height, width)]
+            condition (Tensor): The local condition. shape=(batch, condition_channel, height, width)
-            A input sample of the distribution p(X).
+            
-
+        Returns:
-        condition : Tensor [shape=(batch, condition_channel, height, width)]
+            z (Tensor): shape(batch, 1, height, width), the transformed sample.
-            The local condition.
+            Tuple[Tensor, Tensor]:
-
+                The parameter of the transformation.
-        Returns
+                logs (Tensor): shape(batch, 1, height - 1, width), the log scale of the transformation from x to z.
-        --------
+                b (Tensor): shape(batch, 1, height - 1, width), the shift of the transformation from x to z.
        z (Tensor): shape(batch, 1, height, width), the transformed sample.
        Tuple[Tensor, Tensor]
            The parameter of the transformation.
            logs (Tensor): shape(batch, 1, height - 1, width), the log scale
            of the transformation from x to z.
            b (Tensor): shape(batch, 1, height - 1, width), the shift of the
            transformation from x to z.
        """
        # (B, C, H-1, W)
        logs, b = self._predict_parameters(x[:, :, :-1, :],
@ -516,27 +432,12 @@ class Flow(nn.Layer):
        """Sampling from the the distrition p(X). It is done by sample form
        p(Z) and transform the sample. It is a auto regressive transformation.
-        Parameters
+        Args:
-        -----------
+            z(Tensor): A sample of the distribution p(Z). shape=(batch, 1, time_steps
-        z : Tensor [shape=(batch, 1, height, width)]
+            condition(Tensor): The local condition. shape=(batch, condition_channel, time_steps)
-            A sample of the distribution p(Z).
+        Returns:
-
+            Tensor:
-        condition : Tensor [shape=(batch, condition_channel, height, width)]
+                The transformed sample. shape=(batch, 1, height, width)
            The local condition.
        Returns
        ---------
        x : Tensor [shape=(batch, 1, height, width)]
            The transformed sample.
        Tuple[Tensor, Tensor]
            The parameter of the transformation.
            logs (Tensor): shape(batch, 1, height - 1, width), the log scale
            of the transformation from x to z.
            b (Tensor): shape(batch, 1, height - 1, width), the shift of the
            transformation from x to z.
        """
        z_0 = z[:, :, :1, :]
        x = paddle.zeros_like(z)
@ -560,25 +461,13 @@ class WaveFlow(nn.LayerList):
    """An Deep Reversible layer that is composed of severel auto regressive
    flows.
-    Parameters
+    Args:
-    -----------
+        n_flows (int): Number of flows in the WaveFlow model.
-    n_flows : int
+        n_layers (int): Number of ResidualBlocks in each Flow.
-        Number of flows in the WaveFlow model.
+        n_group (int): Number of timesteps to fold as a group.
-
+        channels (int): Feature size of each ResidualBlock.
-    n_layers : int
+        mel_bands (int): Feature size of mel spectrogram (mel bands).
-        Number of ResidualBlocks in each Flow.
+        kernel_size (Union[int, List[int]]): Kernel size of the convolution layer in each ResidualBlock.
    n_group : int
        Number of timesteps to fold as a group.
    channels : int
        Feature size of each ResidualBlock.
    mel_bands : int
        Feature size of mel spectrogram (mel bands).
    kernel_size : Union[int, List[int]]
        Kernel size of the convolution layer in each ResidualBlock.
    """
    def __init__(self, n_flows, n_layers, n_group, channels, mel_bands,
@ -628,22 +517,13 @@ class WaveFlow(nn.LayerList):
        """Probability density estimation of random variable x given the
        condition.
-        Parameters
+        Args:
-        -----------
+            x (Tensor): The audio. shape=(batch_size, time_steps)
-        x : Tensor [shape=(batch_size, time_steps)]
+            condition (Tensor): The local condition (mel spectrogram here). shape=(batch_size, condition channel, time_steps)
-            The audio.
+                
-
+        Returns:
-        condition : Tensor [shape=(batch_size, condition channel, time_steps)]
+            Tensor: The transformed random variable. shape=(batch_size, time_steps)
-            The local condition (mel spectrogram here).
+            Tensor: The log determinant of the jacobian of the transformation from x to z. shape=(1,)
        Returns
        --------
        z : Tensor [shape=(batch_size, time_steps)]
            The transformed random variable.
        log_det_jacobian: Tensor [shape=(1,)]
            The log determinant of the jacobian of the transformation from x
            to z.
        """
        # x: (B, T)
        # condition: (B, C, T) upsampled condition
@ -678,18 +558,13 @@ class WaveFlow(nn.LayerList):
        Each Flow transform .. math:: `z_{i-1}` to .. math:: `z_{i}` in an
        autoregressive manner.
-        Parameters
+        Args:
-        ----------
+            z (Tensor): A sample of the distribution p(Z). shape=(batch, 1, time_steps
-        z : Tensor [shape=(batch, 1, time_steps]
+            condition (Tensor): The local condition. shape=(batch, condition_channel, time_steps)    
            A sample of the distribution p(Z).
        condition : Tensor [shape=(batch, condition_channel, time_steps)]
            The local condition.
-        Returns
+        Returns: 
-        --------
+            Tensor: The transformed sample (audio here). shape=(batch_size, time_steps)
-        x : Tensor [shape=(batch_size, time_steps)]
+            
            The transformed sample (audio here).
        """
        z, condition = self._trim(z, condition)
@ -714,29 +589,15 @@ class WaveFlow(nn.LayerList):
 class ConditionalWaveFlow(nn.LayerList):
    """ConditionalWaveFlow, a UpsampleNet with a WaveFlow model.
-    Parameters
+    Args:
-    ----------
+        upsample_factors (List[int]): Upsample factors for the upsample net.
-    upsample_factors : List[int]
+        n_flows (int): Number of flows in the WaveFlow model.
-        Upsample factors for the upsample net.
+        n_layers (int): Number of ResidualBlocks in each Flow.
-
+        n_group (int): Number of timesteps to fold as a group.
-    n_flows : int
+        channels (int): Feature size of each ResidualBlock.
-        Number of flows in the WaveFlow model.
+        n_mels (int): Feature size of mel spectrogram (mel bands).
-
+        kernel_size (Union[int, List[int]]): Kernel size of the convolution layer in each ResidualBlock.
-    n_layers : int
+        """
        Number of ResidualBlocks in each Flow.
    n_group : int
        Number of timesteps to fold as a group.
    channels : int
        Feature size of each ResidualBlock.
    n_mels : int
        Feature size of mel spectrogram (mel bands).
    kernel_size : Union[int, List[int]]
        Kernel size of the convolution layer in each ResidualBlock.
    """
    def __init__(self,
                 upsample_factors: List[int],
@ -760,22 +621,13 @@ class ConditionalWaveFlow(nn.LayerList):
        """Compute the transformed random variable z (x to z) and the log of
        the determinant of the jacobian of the transformation from x to z.
-        Parameters
+        Args:
-        ----------
+            audio(Tensor): The audio. shape=(B, T)
-        audio : Tensor [shape=(B, T)]
+            mel(Tensor): The mel spectrogram. shape=(B, C_mel, T_mel)
            The audio.
-        mel : Tensor [shape=(B, C_mel, T_mel)]
+        Returns:
-            The mel spectrogram.
+            Tensor: The inversely transformed random variable z (x to z). shape=(B, T)
-
+            Tensor: the log of the determinant of the jacobian of the transformation from x to z. shape=(1,)
        Returns
        -------
        z : Tensor [shape=(B, T)]
            The inversely transformed random variable z (x to z)
        log_det_jacobian: Tensor [shape=(1,)]
            the log of the determinant of the jacobian of the transformation
            from x to z.
        """
        condition = self.encoder(mel)
        z, log_det_jacobian = self.decoder(audio, condition)
@ -783,17 +635,13 @@ class ConditionalWaveFlow(nn.LayerList):
    @paddle.no_grad()
    def infer(self, mel):
-        r"""Generate raw audio given mel spectrogram.
+        """Generate raw audio given mel spectrogram.
-        Parameters
+        Args:
-        ----------
+            mel(np.ndarray): Mel spectrogram of an utterance(in log-magnitude). shape=(C_mel, T_mel)
        mel : Tensor [shape=(B, C_mel, T_mel)]
            Mel spectrogram (in log-magnitude).
-        Returns
+        Returns:
-        -------
+            Tensor: The synthesized audio, where``T <= T_mel * upsample_factors``. shape=(B, T)
        Tensor : [shape=(B, T)]
            The synthesized audio, where``T <= T_mel \* upsample_factors``.
        """
        start = time.time()
        condition = self.encoder(mel, trim_conv_artifact=True)  # (B, C, T)
@ -808,15 +656,11 @@ class ConditionalWaveFlow(nn.LayerList):
    def predict(self, mel):
        """Generate raw audio given mel spectrogram.
-        Parameters
+        Args:
-        ----------
+            mel(np.ndarray): Mel spectrogram of an utterance(in log-magnitude). shape=(C_mel, T_mel)
        mel : np.ndarray [shape=(C_mel, T_mel)]
            Mel spectrogram of an utterance(in log-magnitude).
-        Returns
+        Returns:
-        -------
+            np.ndarray: The synthesized audio. shape=(T,)
        np.ndarray [shape=(T,)]
            The synthesized audio.
        """
        mel = paddle.to_tensor(mel)
        mel = paddle.unsqueeze(mel, 0)
@ -828,18 +672,12 @@ class ConditionalWaveFlow(nn.LayerList):
    def from_pretrained(cls, config, checkpoint_path):
        """Build a ConditionalWaveFlow model from a pretrained model.
-        Parameters
+        Args:
-        ----------
+            config(yacs.config.CfgNode): model configs
-        config: yacs.config.CfgNode
+            checkpoint_path(Path or str): the path of pretrained model checkpoint, without extension name
            model configs
-        checkpoint_path: Path or str
+        Returns:
-            the path of pretrained model checkpoint, without extension name
+            ConditionalWaveFlow The model built from pretrained result.
        Returns
        -------
        ConditionalWaveFlow
            The model built from pretrained result.
        """
        model = cls(upsample_factors=config.model.upsample_factors,
                    n_flows=config.model.n_flows,
@ -855,11 +693,9 @@ class ConditionalWaveFlow(nn.LayerList):
 class WaveFlowLoss(nn.Layer):
    """Criterion of a WaveFlow model.
-    Parameters
+    Args:
-    ----------
+        sigma (float): The standard deviation of the gaussian noise used in WaveFlow, 
-    sigma : float
+            by default 1.0.
        The standard deviation of the gaussian noise used in WaveFlow, by
        default 1.0.
    """
    def __init__(self, sigma=1.0):
@ -871,19 +707,13 @@ class WaveFlowLoss(nn.Layer):
        """Compute the loss given the transformed random variable z and the
        log_det_jacobian of transformation from x to z.
-        Parameters
+        Args:
-        ----------
+            z(Tensor): The transformed random variable (x to z). shape=(B, T)
-        z : Tensor [shape=(B, T)]
+            log_det_jacobian(Tensor): The log of the determinant of the jacobian matrix of the
-            The transformed random variable (x to z).
+                transformation from x to z.  shape=(1,)
        log_det_jacobian : Tensor [shape=(1,)]
            The log of the determinant of the jacobian matrix of the
            transformation from x to z.
-        Returns
+        Returns:
-        -------
+            Tensor: The loss. shape=(1,)
        Tensor [shape=(1,)]
            The loss.
        """
        loss = paddle.sum(z * z) / (2 * self.sigma * self.sigma
                                    ) - log_det_jacobian
@ -895,15 +725,12 @@ class ConditionalWaveFlow2Infer(ConditionalWaveFlow):
    def forward(self, mel):
        """Generate raw audio given mel spectrogram.
-        Parameters
+        Args:
-        ----------
+            mel (np.ndarray): Mel spectrogram of an utterance(in log-magnitude). shape=(C_mel, T_mel)
-        mel : np.ndarray [shape=(C_mel, T_mel)]
+            
-            Mel spectrogram of an utterance(in log-magnitude).
+        Returns:
-
+            np.ndarray: The synthesized audio. shape=(T,)
-        Returns
+            
        -------
        np.ndarray [shape=(T,)]
            The synthesized audio.
        """
        audio = self.predict(mel)
        return audio
--- a/paddlespeech/t2s/models/wavernn/wavernn.py
+++ b/paddlespeech/t2s/models/wavernn/wavernn.py
@ -67,14 +67,10 @@ class MelResNet(nn.Layer):
    def forward(self, x):
        '''
-        Parameters
+        Args:
-        ----------
+            x (Tensor): Input tensor (B, in_dims, T).
-        x : Tensor
+        Returns:
-            Input tensor (B, in_dims, T).
+            Tensor: Output tensor (B, res_out_dims, T).
        Returns
        ----------
        Tensor
            Output tensor (B, res_out_dims, T).
        '''
        x = self.conv_in(x)
@ -121,16 +117,11 @@ class UpsampleNetwork(nn.Layer):
    def forward(self, m):
        '''
-        Parameters
+        Args:
-        ----------
+            c (Tensor): Input tensor (B, C_aux, T).
-        c : Tensor
+        Returns:
-            Input tensor (B, C_aux, T).
+            Tensor: Output tensor (B, (T - 2 * pad) *  prob(upsample_scales), C_aux).
-        Returns
+            Tensor: Output tensor (B, (T - 2 * pad) *  prob(upsample_scales), res_out_dims).
        ----------
        Tensor
            Output tensor (B, (T - 2 * pad) *  prob(upsample_scales), C_aux).
        Tensor
            Output tensor (B, (T - 2 * pad) *  prob(upsample_scales), res_out_dims).
        '''
        # aux: [B, C_aux, T] 
        # -> [B, res_out_dims, T - 2 * aux_context_window]
@ -172,32 +163,20 @@ class WaveRNN(nn.Layer):
            mode='RAW',
            init_type: str="xavier_uniform", ):
        '''
-        Parameters
+        Args:
-        ----------
+            rnn_dims (int, optional): Hidden dims of RNN Layers.
-        rnn_dims : int, optional
+            fc_dims (int, optional): Dims of FC Layers.
-            Hidden dims of RNN Layers.
+            bits (int, optional): bit depth of signal.
-        fc_dims : int, optional
+            aux_context_window (int, optional): The context window size of the first convolution applied to the 
-             Dims of FC Layers.
+                auxiliary input, by default 2
-        bits : int, optional
+            upsample_scales (List[int], optional): Upsample scales of the upsample network.
-            bit depth of signal.
+            aux_channels (int, optional): Auxiliary channel of the residual blocks.
-        aux_context_window : int, optional
+            compute_dims (int, optional): Dims of Conv1D in MelResNet.
-            The context window size of the first convolution applied to the 
+            res_out_dims (int, optional): Dims of output in MelResNet.
-            auxiliary input, by default 2
+            res_blocks (int, optional): Number of residual blocks.
-        upsample_scales : List[int], optional
+            mode (str, optional): Output mode of the WaveRNN vocoder. 
-            Upsample scales of the upsample network.
+                `MOL` for Mixture of Logistic Distribution, and `RAW` for quantized bits as the model's output.
-        aux_channels : int, optional
+            init_type (str): How to initialize parameters.
            Auxiliary channel of the residual blocks.
        compute_dims : int, optional
            Dims of Conv1D in MelResNet.
        res_out_dims : int, optional
            Dims of output in MelResNet.
        res_blocks : int, optional
            Number of residual blocks.
        mode : str, optional
            Output mode of the WaveRNN vocoder. `MOL` for Mixture of Logistic Distribution,
            and `RAW` for quantized bits as the model's output.
        init_type : str
            How to initialize parameters.
        '''
        super().__init__()
        self.mode = mode
@ -245,18 +224,13 @@ class WaveRNN(nn.Layer):
    def forward(self, x, c):
        '''
-        Parameters
+        Args:
-        ----------
+            x (Tensor): wav sequence, [B, T]
-        x : Tensor
+            c (Tensor): mel spectrogram [B, C_aux, T']
-            wav sequence, [B, T]
+
-        c : Tensor
+            T = (T' - 2 * aux_context_window ) * hop_length
-            mel spectrogram [B, C_aux, T']
+        Returns:
-        
+            Tensor: [B, T, n_classes]
        T = (T' - 2 * aux_context_window ) * hop_length
        Returns
        ----------
        Tensor
            [B, T, n_classes]
        '''
        # Although we `_flatten_parameters()` on init, when using DataParallel
        # the model gets replicated, making it no longer guaranteed that the
@ -304,22 +278,14 @@ class WaveRNN(nn.Layer):
                 mu_law: bool=True,
                 gen_display: bool=False):
        """
-        Parameters
+        Args:
-        ----------
+            c(Tensor): input mels, (T', C_aux)
-        c : Tensor
+            batched(bool): generate in batch or not
-            input mels, (T', C_aux)
+            target(int): target number of samples to be generated in each batch entry
-        batched : bool
+            overlap(int): number of samples for crossfading between batches
-            generate in batch or not
+            mu_law(bool)
-        target : int
+        Returns: 
-            target number of samples to be generated in each batch entry
+            wav sequence: Output (T' * prod(upsample_scales), out_channels, C_out).
        overlap : int
            number of samples for crossfading between batches
        mu_law : bool
            use mu law or not
        Returns
        ----------
        wav sequence
            Output (T' * prod(upsample_scales), out_channels, C_out).
        """
        self.eval()
@ -434,16 +400,13 @@ class WaveRNN(nn.Layer):
    def pad_tensor(self, x, pad, side='both'):
        '''
-        Parameters
+        Args:
-        ----------
+            x(Tensor): mel, [1, n_frames, 80]
-        x : Tensor
+            pad(int): 
-            mel, [1, n_frames, 80]
+            side(str, optional):  (Default value = 'both')
-        pad : int
+
-        side : str 
+        Returns:
-            'both', 'before' or 'after'
+            Tensor
        Returns
        ----------
        Tensor
        '''
        b, t, _ = paddle.shape(x)
        # for dygraph to static graph
@ -461,38 +424,29 @@ class WaveRNN(nn.Layer):
        Fold the tensor with overlap for quick batched inference.
        Overlap will be used for crossfading in xfade_and_unfold()
-        Parameters
+        Args:
-        ----------
+            x(Tensor): Upsampled conditioning features. mels or aux
-        x : Tensor
+                shape=(1, T, features)
-            Upsampled conditioning features. mels or aux
+                mels: [1, T, 80]
-            shape=(1, T, features)
+                aux: [1, T, 128]
-            mels: [1, T, 80]
+            target(int): Target timesteps for each index of batch
-            aux: [1, T, 128]
+            overlap(int): Timesteps for both xfade and rnn warmup
-        target : int
+
-            Target timesteps for each index of batch
+        Returns:
-        overlap : int
+            Tensor: 
-            Timesteps for both xfade and rnn warmup
+                shape=(num_folds, target + 2 * overlap, features)
-            overlap = hop_length * 2
+                num_flods = (time_seq - overlap) // (target + overlap)
-
+                mel: [num_folds, target + 2 * overlap, 80]
-        Returns
+                aux: [num_folds, target + 2 * overlap, 128]
-        ----------
+
-        Tensor 
+        Details:
-            shape=(num_folds, target + 2 * overlap, features)
+            x = [[h1, h2, ... hn]]
-            num_flods = (time_seq - overlap) // (target + overlap)
+            Where each h is a vector of conditioning features
-            mel: [num_folds, target + 2 * overlap, 80]
+            Eg: target=2, overlap=1 with x.size(1)=10
-            aux: [num_folds, target + 2 * overlap, 128]
+
-
+            folded = [[h1, h2, h3, h4],
-        Details
+                    [h4, h5, h6, h7],
-        ----------
+                    [h7, h8, h9, h10]]
        x = [[h1, h2, ... hn]]
        Where each h is a vector of conditioning features
        Eg: target=2, overlap=1 with x.size(1)=10
        folded = [[h1, h2, h3, h4],
                  [h4, h5, h6, h7],
                  [h7, h8, h9, h10]]
        '''
        _, total_len, features = paddle.shape(x)
@ -520,37 +474,33 @@ class WaveRNN(nn.Layer):
    def xfade_and_unfold(self, y, target: int=12000, overlap: int=600):
        ''' Applies a crossfade and unfolds into a 1d array.
-        Parameters
+        Args:
-        ----------
+            y (Tensor): 
-        y : Tensor
+                Batched sequences of audio samples
-            Batched sequences of audio samples
+                shape=(num_folds, target + 2 * overlap)
-            shape=(num_folds, target + 2 * overlap)
+                dtype=paddle.float32
-            dtype=paddle.float32
+            overlap (int): Timesteps for both xfade and rnn warmup
-        overlap : int
+
-            Timesteps for both xfade and rnn warmup
+        Returns:
-
+            Tensor
-        Returns
+                audio samples in a 1d array
-        ----------
+                shape=(total_len)
-        Tensor
+                dtype=paddle.float32
-            audio samples in a 1d array
+
-            shape=(total_len)
+        Details:
-            dtype=paddle.float32
+            y = [[seq1],
-
+                [seq2],
-        Details
+                [seq3]]
-        ----------
+
-        y = [[seq1],
+            Apply a gain envelope at both ends of the sequences
-            [seq2],
+
-            [seq3]]
+            y = [[seq1_in, seq1_target, seq1_out],
-
+                [seq2_in, seq2_target, seq2_out],
-        Apply a gain envelope at both ends of the sequences
+                [seq3_in, seq3_target, seq3_out]]
-
+
-        y = [[seq1_in, seq1_target, seq1_out],
+            Stagger and add up the groups of samples:
-            [seq2_in, seq2_target, seq2_out],
+
-            [seq3_in, seq3_target, seq3_out]]
+            [seq1_in, seq1_target, (seq1_out + seq2_in), seq2_target, ...]
        Stagger and add up the groups of samples:
        [seq1_in, seq1_target, (seq1_out + seq2_in), seq2_target, ...]
        '''
        # num_folds = (total_len - overlap) // (target + overlap)
--- a/paddlespeech/t2s/modules/causal_conv.py
+++ b/paddlespeech/t2s/modules/causal_conv.py
@ -41,14 +41,10 @@ class CausalConv1D(nn.Layer):
    def forward(self, x):
        """Calculate forward propagation.
-        Parameters
+        Args:
-        ----------
+            x (Tensor): Input tensor (B, in_channels, T).
-        x : Tensor
+        Returns: 
-            Input tensor (B, in_channels, T).
+            Tensor: Output tensor (B, out_channels, T).
        Returns
        ----------
        Tensor
            Output tensor (B, out_channels, T).
        """
        return self.conv(self.pad(x))[:, :, :x.shape[2]]
@ -70,13 +66,9 @@ class CausalConv1DTranspose(nn.Layer):
    def forward(self, x):
        """Calculate forward propagation.
-        Parameters
+        Args:
-        ----------
+            x (Tensor): Input tensor (B, in_channels, T_in).
-        x : Tensor
+        Returns:
-            Input tensor (B, in_channels, T_in).
+            Tensor: Output tensor (B, out_channels, T_out).
        Returns
        ----------
        Tensor
            Output tensor (B, out_channels, T_out).
        """
        return self.deconv(x)[:, :, :-self.stride]
--- a/paddlespeech/t2s/modules/conformer/convolution.py
+++ b/paddlespeech/t2s/modules/conformer/convolution.py
@ -18,12 +18,10 @@ from paddle import nn
 class ConvolutionModule(nn.Layer):
    """ConvolutionModule in Conformer model.
-    Parameters
+
-    ----------
+    Args:
-    channels : int
+        channels (int): The number of channels of conv layers.
-        The number of channels of conv layers.
+        kernel_size (int): Kernerl size of conv layers.
    kernel_size : int
        Kernerl size of conv layers.
    """
    def __init__(self, channels, kernel_size, activation=nn.ReLU(), bias=True):
@ -59,14 +57,11 @@ class ConvolutionModule(nn.Layer):
    def forward(self, x):
        """Compute convolution module.
-        Parameters
+
-        ----------
+        Args:
-        x : paddle.Tensor
+            x (Tensor): Input tensor (#batch, time, channels).
-            Input tensor (#batch, time, channels).
+        Returns:
-        Returns
+            Tensor: Output tensor (#batch, time, channels).
        ----------
        paddle.Tensor
            Output tensor (#batch, time, channels).
        """
        # exchange the temporal dimension and the feature dimension
        x = x.transpose([0, 2, 1])
--- a/paddlespeech/t2s/modules/conformer/encoder_layer.py
+++ b/paddlespeech/t2s/modules/conformer/encoder_layer.py
@ -21,38 +21,29 @@ from paddlespeech.t2s.modules.layer_norm import LayerNorm
 class EncoderLayer(nn.Layer):
    """Encoder layer module.
-    Parameters
+    
-    ----------
+    Args:
-    size : int
+        size (int): Input dimension.
-        Input dimension.
+        self_attn (nn.Layer): Self-attention module instance.
-    self_attn : nn.Layer
+            `MultiHeadedAttention` or `RelPositionMultiHeadedAttention` instance
-        Self-attention module instance.
+            can be used as the argument.
-        `MultiHeadedAttention` or `RelPositionMultiHeadedAttention` instance
+        feed_forward (nn.Layer): Feed-forward module instance.
-        can be used as the argument.
+            `PositionwiseFeedForward`, `MultiLayeredConv1d`, or `Conv1dLinear` instance
-    feed_forward : nn.Layer
+            can be used as the argument.
-        Feed-forward module instance.
+        feed_forward_macaron (nn.Layer): Additional feed-forward module instance.
-        `PositionwiseFeedForward`, `MultiLayeredConv1d`, or `Conv1dLinear` instance
+            `PositionwiseFeedForward`, `MultiLayeredConv1d`, or `Conv1dLinear` instance
-        can be used as the argument.
+            can be used as the argument.
-    feed_forward_macaron : nn.Layer
+        conv_module (nn.Layer): Convolution module instance.
-        Additional feed-forward module instance.
+            `ConvlutionModule` instance can be used as the argument.
-        `PositionwiseFeedForward`, `MultiLayeredConv1d`, or `Conv1dLinear` instance
+        dropout_rate (float): Dropout rate.
-        can be used as the argument.
+        normalize_before (bool): Whether to use layer_norm before the first block.
-    conv_module : nn.Layer
+        concat_after (bool): Whether to concat attention layer's input and output.
-        Convolution module instance.
+            if True, additional linear will be applied.
-        `ConvlutionModule` instance can be used as the argument.
+            i.e. x -> x + linear(concat(x, att(x)))
-    dropout_rate : float
+            if False, no additional linear will be applied. i.e. x -> x + att(x)
-        Dropout rate.
+        stochastic_depth_rate (float): Proability to skip this layer.
-    normalize_before : bool
+            During training, the layer may skip residual computation and return input
-        Whether to use layer_norm before the first block.
+            as-is with given probability.
    concat_after : bool
        Whether to concat attention layer's input and output.
        if True, additional linear will be applied.
        i.e. x -> x + linear(concat(x, att(x)))
        if False, no additional linear will be applied. i.e. x -> x + att(x)
    stochastic_depth_rate : float
        Proability to skip this layer.
        During training, the layer may skip residual computation and return input
        as-is with given probability.
    """
    def __init__(
@ -93,22 +84,17 @@ class EncoderLayer(nn.Layer):
    def forward(self, x_input, mask, cache=None):
        """Compute encoded features.
-        Parameters
+
-        ----------
+        Args:
-        x_input : Union[Tuple, paddle.Tensor]
+            x_input(Union[Tuple, Tensor]): Input tensor w/ or w/o pos emb.
-            Input tensor w/ or w/o pos emb.
+                - w/ pos emb: Tuple of tensors [(#batch, time, size), (1, time, size)].
-            - w/ pos emb: Tuple of tensors [(#batch, time, size), (1, time, size)].
+                - w/o pos emb: Tensor (#batch, time, size).
-            - w/o pos emb: Tensor (#batch, time, size).
+            mask(Tensor): Mask tensor for the input (#batch, time).
-        mask : paddle.Tensor
+            cache (Tensor): 
-            Mask tensor for the input (#batch, time).
+
-        cache paddle.Tensor
+        Returns:
-            Cache tensor of the input (#batch, time - 1, size).
+            Tensor: Output tensor (#batch, time, size).
-        Returns
+            Tensor: Mask tensor (#batch, time).
        ----------
        paddle.Tensor
            Output tensor (#batch, time, size).
        paddle.Tensor
            Mask tensor (#batch, time).
        """
        if isinstance(x_input, tuple):
            x, pos_emb = x_input[0], x_input[1]
--- a/paddlespeech/t2s/modules/conv.py
+++ b/paddlespeech/t2s/modules/conv.py
@ -40,36 +40,29 @@ class Conv1dCell(nn.Conv1D):
    2. padding must be a causal padding (recpetive_field - 1, 0).
    Thus, these arguments are removed from the ``__init__`` method of this
    class.
-    
+
-    Parameters
+    Args:
-    ----------
+        in_channels (int): The feature size of the input.
-    in_channels: int
+        out_channels (int): The feature size of the output.
-        The feature size of the input.
+        kernel_size (int or Tuple[int]): The size of the kernel.
-    out_channels: int
+        dilation (int or Tuple[int]): The dilation of the convolution, by default 1
-        The feature size of the output.
+        weight_attr (ParamAttr, Initializer, str or bool, optional) : The parameter attribute of the convolution kernel, 
-    kernel_size: int or Tuple[int]
+            by default None.
-        The size of the kernel.
+        bias_attr (ParamAttr, Initializer, str or bool, optional):The parameter attribute of the bias. 
-    dilation: int or Tuple[int]
+            If ``False``, this layer does not have a bias, by default None.
-        The dilation of the convolution, by default 1
+            
-    weight_attr: ParamAttr, Initializer, str or bool, optional
+    Examples: 
-        The parameter attribute of the convolution kernel, by default None.
+        >>> cell = Conv1dCell(3, 4, kernel_size=5)
-    bias_attr: ParamAttr, Initializer, str or bool, optional
+        >>> inputs = [paddle.randn([4, 3]) for _ in range(16)]
-        The parameter attribute of the bias. If ``False``, this layer does not
+        >>> outputs = []
-        have a bias, by default None.
+        >>> cell.eval()
-        
+        >>> cell.start_sequence()
-    Examples
+        >>> for xt in inputs:
-    --------
+        >>>     outputs.append(cell.add_input(xt))
-    >>> cell = Conv1dCell(3, 4, kernel_size=5)
+        >>> len(outputs))
-    >>> inputs = [paddle.randn([4, 3]) for _ in range(16)]
+        16
-    >>> outputs = []
+        >>> outputs[0].shape
-    >>> cell.eval()
+        [4, 4]
    >>> cell.start_sequence()
    >>> for xt in inputs:
    >>>     outputs.append(cell.add_input(xt))
    >>> len(outputs))
    16
    >>> outputs[0].shape
    [4, 4]
    """
    def __init__(self,
@ -103,15 +96,13 @@ class Conv1dCell(nn.Conv1D):
    def start_sequence(self):
        """Prepare the layer for a series of incremental forward.
-        Warnings
+        Warnings:
-        ---------
+            This method should be called before a sequence of calls to
-        This method should be called before a sequence of calls to
+            ``add_input``.
        ``add_input``.
-        Raises
+        Raises:
-        ------
+            Exception
-        Exception
+                If this method is called when the layer is in training mode.
            If this method is called when the layer is in training mode.
        """
        if self.training:
            raise Exception("only use start_sequence in evaluation")
@ -130,10 +121,9 @@ class Conv1dCell(nn.Conv1D):
    def initialize_buffer(self, x_t):
        """Initialize the buffer for the step input.
-        Parameters
+        Args:
-        ----------
+            x_t (Tensor): The step input. shape=(batch_size, in_channels)
-        x_t : Tensor [shape=(batch_size, in_channels)]
+            
            The step input.
        """
        batch_size, _ = x_t.shape
        self._buffer = paddle.zeros(
@ -143,26 +133,22 @@ class Conv1dCell(nn.Conv1D):
    def update_buffer(self, x_t):
        """Shift the buffer by one step.
-        Parameters
+        Args:
-        ----------
+            x_t (Tensor): The step input. shape=(batch_size, in_channels)
-        x_t : Tensor [shape=(batch_size, in_channels)]
+            
            The step input.
        """
        self._buffer = paddle.concat(
            [self._buffer[:, :, 1:], paddle.unsqueeze(x_t, -1)], -1)
    def add_input(self, x_t):
        """Add step input and compute step output.
-        
+
-        Parameters
+        Args:
-        -----------
+            x_t (Tensor): The step input. shape=(batch_size, in_channels)
-        x_t : Tensor [shape=(batch_size, in_channels)]
+          
-            The step input.
+        Returns: 
-            
+            y_t (Tensor): The step output. shape=(batch_size, out_channels)
-        Returns
+
        -------
        y_t :Tensor [shape=(batch_size, out_channels)]
            The step output.
        """
        batch_size = x_t.shape[0]
        if self.receptive_field > 1:
@ -186,33 +172,26 @@ class Conv1dCell(nn.Conv1D):
 class Conv1dBatchNorm(nn.Layer):
    """A Conv1D Layer followed by a BatchNorm1D.
-    Parameters
+    Args:
-    ----------
+        in_channels (int): The feature size of the input.
-    in_channels : int
+        out_channels (int): The feature size of the output.
-        The feature size of the input.
+        kernel_size (int): The size of the convolution kernel.
-    out_channels : int
+        stride (int, optional): The stride of the convolution, by default 1.
-        The feature size of the output.
+        padding (int, str or Tuple[int], optional):
-    kernel_size : int
+            The padding of the convolution.
-        The size of the convolution kernel.
+            If int, a symmetrical padding is applied before convolution;
-    stride : int, optional
+            If str, it should be "same" or "valid";
-        The stride of the convolution, by default 1.
+            If Tuple[int], its length should be 2, meaning
-    padding : int, str or Tuple[int], optional
+            ``(pad_before, pad_after)``, by default 0.
-        The padding of the convolution.
+        weight_attr (ParamAttr, Initializer, str or bool, optional):
-        If int, a symmetrical padding is applied before convolution;
+            The parameter attribute of the convolution kernel,
-        If str, it should be "same" or "valid";
+            by default None.
-        If Tuple[int], its length should be 2, meaning
+        bias_attr (ParamAttr, Initializer, str or bool, optional):
-        ``(pad_before, pad_after)``, by default 0.
+            The parameter attribute of the bias of the convolution,
-    weight_attr : ParamAttr, Initializer, str or bool, optional
+            by defaultNone.
-        The parameter attribute of the convolution kernel, by default None.
+        data_format (str ["NCL" or "NLC"], optional): The data layout of the input, by default "NCL"
-    bias_attr : ParamAttr, Initializer, str or bool, optional
+        momentum (float, optional): The momentum of the BatchNorm1D layer, by default 0.9
-        The parameter attribute of the bias of the convolution, by default
+        epsilon (float, optional): The epsilon of the BatchNorm1D layer, by default 1e-05
        None.
    data_format : str ["NCL" or "NLC"], optional
        The data layout of the input, by default "NCL"
    momentum : float, optional
        The momentum of the BatchNorm1D layer, by default 0.9
    epsilon : [type], optional
        The epsilon of the BatchNorm1D layer, by default 1e-05
    """
    def __init__(self,
@ -244,16 +223,15 @@ class Conv1dBatchNorm(nn.Layer):
    def forward(self, x):
        """Forward pass of the Conv1dBatchNorm layer.
-
+        
-        Parameters
+        Args:
-        ----------
+            x (Tensor): The input tensor. Its data layout depends on ``data_format``. 
-        x : Tensor [shape=(B, C_in, T_in) or (B, T_in, C_in)]
+            shape=(B, C_in, T_in) or (B, T_in, C_in)
-            The input tensor. Its data layout depends on ``data_format``.
+    
-
+        Returns:
-        Returns
+            Tensor: The output tensor. 
-        -------
+                shape=(B, C_out, T_out) or (B, T_out, C_out)
-        Tensor [shape=(B, C_out, T_out) or (B, T_out, C_out)]
+                
            The output tensor. 
        """
        x = self.conv(x)
        x = self.bn(x)
--- a/Show More
+++ b/Show More