Merge branch 'develop' into webdataset

3 years ago · 05d41523ad
parent 92d1d08b9a 8817bf8636
commit 05d41523ad
139 changed files with 2992 additions and 4492 deletions
--- a/README.md
+++ b/README.md
@ -1,3 +1,4 @@
 ([简体中文](./README_cn.md)|English)
 <p align="center">
  <img src="./docs/images/PaddleSpeech_logo.png" />
@ -494,6 +495,14 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
      <a href = "./examples/aishell3/vc1">ge2e-fastspeech2-aishell3</a>
      </td>
    </tr>
     <tr>
      <td rowspan="3">End-to-End</td>
      <td>VITS</td>
      <td >CSMSC</td>
      <td>
      <a href = "./examples/csmsc/vits">VITS-csmsc</a>
      </td>
    </tr>
  </tbody>
 </table>
--- a/README_cn.md
+++ b/README_cn.md
@ -1,3 +1,4 @@
 (简体中文|[English](./README.md))
 <p align="center">
  <img src="./docs/images/PaddleSpeech_logo.png" />
@ -481,6 +482,15 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块：文本前端、声
      <a href = "./examples/aishell3/vc1">ge2e-fastspeech2-aishell3</a>
      </td>
    </tr>
    </tr>
     <tr>
      <td rowspan="3">端到端</td>
      <td>VITS</td>
      <td >CSMSC</td>
      <td>
      <a href = "./examples/csmsc/vits">VITS-csmsc</a>
      </td>
    </tr>
  </tbody>
 </table>
--- a/dataset/aidatatang_200zh/README.md
+++ b/dataset/aidatatang_200zh/README.md
@ -1,4 +1,4 @@
-# [Aidatatang_200zh](http://www.openslr.org/62/)
+# [Aidatatang_200zh](http://openslr.elda.org/62/)
 Aidatatang_200zh is a free Chinese Mandarin speech corpus provided by Beijing DataTang Technology Co., Ltd under Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International Public License.
 The contents and the corresponding descriptions of the corpus include:
--- a/dataset/aishell/README.md
+++ b/dataset/aishell/README.md
@ -1,3 +1,3 @@
-# [Aishell1](http://www.openslr.org/33/)
+# [Aishell1](http://openslr.elda.org/33/)
 This Open Source Mandarin Speech Corpus, AISHELL-ASR0009-OS1, is 178 hours long. It is a part of AISHELL-ASR0009, of which utterance contains 11 domains, including smart home, autonomous driving, and industrial production. The whole recording was put in quiet indoor environment, using 3 different devices at the same time: high fidelity microphone (44.1kHz, 16-bit,); Android-system mobile phone (16kHz, 16-bit), iOS-system mobile phone (16kHz, 16-bit). Audios in high fidelity were re-sampled to 16kHz to build AISHELL- ASR0009-OS1. 400 speakers from different accent areas in China were invited to participate in the recording. The manual transcription accuracy rate is above 95%, through professional speech annotation and strict quality inspection. The corpus is divided into training, development and testing sets. ( This database is free for academic research, not in the commerce, if without permission. )
--- a/dataset/aishell/aishell.py
+++ b/dataset/aishell/aishell.py
@ -31,7 +31,7 @@ from utils.utility import unpack
 DATA_HOME = os.path.expanduser('~/.cache/paddle/dataset/speech')
-URL_ROOT = 'http://www.openslr.org/resources/33'
+URL_ROOT = 'http://openslr.elda.org/resources/33'
 # URL_ROOT = 'https://openslr.magicdatatech.com/resources/33'
 DATA_URL = URL_ROOT + '/data_aishell.tgz'
 MD5_DATA = '2f494334227864a8a8fec932999db9d8'
--- a/dataset/librispeech/librispeech.py
+++ b/dataset/librispeech/librispeech.py
@ -31,7 +31,7 @@ import soundfile
 from utils.utility import download
 from utils.utility import unpack
-URL_ROOT = "http://www.openslr.org/resources/12"
+URL_ROOT = "http://openslr.elda.org/resources/12"
 #URL_ROOT = "https://openslr.magicdatatech.com/resources/12"
 URL_TEST_CLEAN = URL_ROOT + "/test-clean.tar.gz"
 URL_TEST_OTHER = URL_ROOT + "/test-other.tar.gz"
--- a/dataset/magicdata/README.md
+++ b/dataset/magicdata/README.md
@ -1,4 +1,4 @@
-# [MagicData](http://www.openslr.org/68/)
+# [MagicData](http://openslr.elda.org/68/)
 MAGICDATA Mandarin Chinese Read Speech Corpus was developed by MAGIC DATA Technology Co., Ltd. and freely published for non-commercial use.
 The contents and the corresponding descriptions of the corpus include:
--- a/dataset/mini_librispeech/mini_librispeech.py
+++ b/dataset/mini_librispeech/mini_librispeech.py
@ -30,7 +30,7 @@ import soundfile
 from utils.utility import download
 from utils.utility import unpack
-URL_ROOT = "http://www.openslr.org/resources/31"
+URL_ROOT = "http://openslr.elda.org/resources/31"
 URL_TRAIN_CLEAN = URL_ROOT + "/train-clean-5.tar.gz"
 URL_DEV_CLEAN = URL_ROOT + "/dev-clean-2.tar.gz"
--- a/dataset/musan/musan.py
+++ b/dataset/musan/musan.py
@ -34,7 +34,7 @@ from utils.utility import unpack
 DATA_HOME = os.path.expanduser('~/.cache/paddle/dataset/speech')
-URL_ROOT = 'https://www.openslr.org/resources/17'
+URL_ROOT = 'https://openslr.elda.org/resources/17'
 DATA_URL = URL_ROOT + '/musan.tar.gz'
 MD5_DATA = '0c472d4fc0c5141eca47ad1ffeb2a7df'
--- a/dataset/primewords/README.md
+++ b/dataset/primewords/README.md
@ -1,4 +1,4 @@
-# [Primewords](http://www.openslr.org/47/)
+# [Primewords](http://openslr.elda.org/47/)
 This free Chinese Mandarin speech corpus set is released by Shanghai Primewords Information Technology Co., Ltd.
 The corpus is recorded by smart mobile phones from 296 native Chinese speakers. The transcription accuracy is larger than 98%, at the confidence level of 95%. It is free for academic use.
--- a/dataset/rir_noise/rir_noise.py
+++ b/dataset/rir_noise/rir_noise.py
@ -34,7 +34,7 @@ from utils.utility import unzip
 DATA_HOME = os.path.expanduser('~/.cache/paddle/dataset/speech')
-URL_ROOT = '--no-check-certificate http://www.openslr.org/resources/28'
+URL_ROOT = '--no-check-certificate https://us.openslr.org/resources/28/rirs_noises.zip'
 DATA_URL = URL_ROOT + '/rirs_noises.zip'
 MD5_DATA = 'e6f48e257286e05de56413b4779d8ffb'
--- a/dataset/st-cmds/README.md
+++ b/dataset/st-cmds/README.md
@ -1 +1 @@
-# [FreeST](http://www.openslr.org/38/)
+# [FreeST](http://openslr.elda.org/38/)
--- a/dataset/thchs30/README.md
+++ b/dataset/thchs30/README.md
@ -1,4 +1,4 @@
-# [THCHS30](http://www.openslr.org/18/)
+# [THCHS30](http://openslr.elda.org/18/)
 This is the *data part* of the `THCHS30 2015` acoustic data
 & scripts dataset.
--- a/dataset/thchs30/thchs30.py
+++ b/dataset/thchs30/thchs30.py
@ -32,7 +32,7 @@ from utils.utility import unpack
 DATA_HOME = os.path.expanduser('~/.cache/paddle/dataset/speech')
-URL_ROOT = 'http://www.openslr.org/resources/18'
+URL_ROOT = 'http://openslr.elda.org/resources/18'
 # URL_ROOT = 'https://openslr.magicdatatech.com/resources/18'
 DATA_URL = URL_ROOT + '/data_thchs30.tgz'
 TEST_NOISE_URL = URL_ROOT + '/test-noise.tgz'
--- a/demos/streaming_asr_server/web/app.py
+++ b/demos/streaming_asr_server/web/app.py
@ -1,23 +0,0 @@
 #!/usr/bin/env python3
 # -*- coding: utf-8 -*-
 # Copyright 2021 Mobvoi Inc. All Rights Reserved.
 # Author: zhendong.peng@mobvoi.com (Zhendong Peng)
 import argparse
 from flask import Flask
 from flask import render_template
 parser = argparse.ArgumentParser(description='training your network')
 parser.add_argument('--port', default=19999, type=int, help='port id')
 args = parser.parse_args()
 app = Flask(__name__)
@app.route('/')
 def index():
    return render_template('index.html')
 if __name__ == '__main__':
    app.run(host='0.0.0.0', port=args.port, debug=True)
--- a/demos/streaming_asr_server/web/favicon.ico
+++ b/demos/streaming_asr_server/web/favicon.ico
--- a/demos/streaming_asr_server/web/index.html
+++ b/demos/streaming_asr_server/web/index.html
--- a/demos/streaming_asr_server/web/paddle_web_demo.png
+++ b/demos/streaming_asr_server/web/paddle_web_demo.png
--- a/demos/streaming_asr_server/web/readme.md
+++ b/demos/streaming_asr_server/web/readme.md
@ -1,18 +1,20 @@
 # paddlespeech serving 网页Demo
- 感谢[wenet](https://github.com/wenet-e2e/wenet)团队的前端demo代码.
+![图片](./paddle_web_demo.png)
 step1: 开启流式语音识别服务器端
-## 使用方法
+```
-### 1. 在本地电脑启动网页服务
+# 开启流式语音识别服务
-   ```
+cd PaddleSpeech/demos/streaming_asr_server
-   python app.py
+paddlespeech_server start --config_file conf/ws_conformer_wenetspeech_application_faster.yaml
 ```
-   ```
+step2: 谷歌游览器打开 `web`目录下`index.html`
-### 2. 本地电脑浏览器
+step3: 点击`连接`，验证WebSocket是否成功连接
 step4：点击开始录音(弹窗询问，允许录音)
 在浏览器中输入127.0.0.1:19999 即可看到相关网页Demo。
 ![图片](./paddle_web_demo.png)
--- a/demos/streaming_asr_server/web/static/css/font-awesome.min.css
+++ b/demos/streaming_asr_server/web/static/css/font-awesome.min.css
--- a/demos/streaming_asr_server/web/static/css/style.css
+++ b/demos/streaming_asr_server/web/static/css/style.css
@ -1,453 +0,0 @@
 /*
 * @Author: baipengxia
 * @Date:   2021-03-12 11:44:28
 * @Last Modified by:   baipengxia
 * @Last Modified time: 2021-03-12 15:14:24
 */
 /** COMMON RESET **/
 * {
  -webkit-tap-highlight-color: rgba(0, 0, 0, 0);
 }
 body,
 h1,
 h2,
 h3,
 h4,
 h5,
 h6,
 hr,
 p,
 dl,
 dt,
 dd,
 ul,
 ol,
 li,
 fieldset,
 lengend,
 button,
 input,
 textarea,
 th,
 td {
  margin: 0;
  padding: 0;
  color: #000;
 }
 body {
  font-size: 14px;
 }
 html, body {
  min-width: 1200px;
 }
 button,
 input,
 select,
 textarea {
  font-size: 14px;
 }
 h1 {
  font-size: 18px;
 }
 h2 {
  font-size: 14px;
 }
 h3 {
  font-size: 14px;
 }
 ul,
 ol,
 li {
  list-style: none;
 }
 a {
  text-decoration: none;
 }
 a:hover {
  text-decoration: none;
 }
 fieldset,
 img {
  border: none;
 }
 table {
  border-collapse: collapse;
  border-spacing: 0;
 }
 i {
  font-style: normal;
 }
 label {
  position: inherit;
 }
 .clearfix:after {
  content: ".";
  display: block;
  height: 0;
  clear: both;
  visibility: hidden;
 }
 .clearfix {
  zoom: 1;
  display: block;
 }
 html,
 body {
  font-family: Tahoma, Arial, 'microsoft yahei', 'Roboto', 'Droid Sans', 'Helvetica Neue', 'Droid Sans Fallback', 'Heiti SC', 'Hiragino Sans GB', 'Simsun', 'sans-self';
 }
 .audio-banner {
  width: 100%;
  overflow: auto;
  padding: 0;
  background: url('../image/voice-dictation.svg');
  background-size: cover;
 }
 .weaper {
  width: 1200px;
  height: 155px;
  margin: 72px auto;
 }
 .text-content {
  width: 670px;
  height: 100%;
  float: left;
 }
 .text-content .title {
  font-size: 34px;
  font-family: 'PingFangSC-Medium';
  font-weight: 500;
  color: rgba(255, 255, 255, 1);
  line-height: 48px;
 }
 .text-content .con {
  font-size: 16px;
  font-family: PingFangSC-Light;
  font-weight: 300;
  color: rgba(255, 255, 255, 1);
  line-height: 30px;
 }
 .img-con {
  width: 416px;
  height: 100%;
  float: right;
 }
 .img-con img {
  width: 100%;
  height: 100%;
 }
 .con-container {
  margin-top: 34px;
 }
 .audio-advantage {
  background: #f8f9fa;
 }
 .asr-advantage {
  width: 1200px;
  margin: 0 auto;
 }
 .asr-advantage h2 {
  text-align: center;
  font-size: 22px;
  padding: 30px 0 0 0;
 }
 .asr-advantage > ul > li {
  box-sizing: border-box;
  padding: 0 16px;
  width: 33%;
  text-align: center;
  margin-bottom: 35px;
 }
 .asr-advantage > ul > li .icons{
  margin-top: 10px;
  margin-bottom: 20px;
  width: 42px;
  height: 42px;
 }
 .service-item-content {
  margin-top: 35px;
  display: flex;
  justify-content: center;
  flex-wrap: wrap;
 }
 .service-item-content img {
  width: 160px;
  vertical-align: bottom;
 }
 .service-item-content > li {
    box-sizing: border-box;
    padding: 0 16px;
    width: 33%;
    text-align: center;
    margin-bottom: 35px;
 }
 .service-item-content > li .service-item-content-title {
  line-height: 1.5;
  font-weight: 700;
  margin-top: 10px;
 }
 .service-item-content > li .service-item-content-desc {
  margin-top: 5px;
  line-height: 1.8;
  color: #657384;
 }
 .audio-scene-con {
  width: 100%;
  padding-bottom: 84px;
  background: #fff;
 }
 .audio-scene {
  overflow: auto;
  width: 1200px;
  background: #fff;
  text-align: center;
  padding: 0;
  margin: 0 auto;
 }
 .audio-scene h2 {
  padding: 30px 0 0 0;
  font-size: 22px;
  text-align: center;
 }
 .audio-experience {
  width: 100%;
  height: 538px;
  background: #fff;
  padding: 0;
  margin: 0;
  overflow: auto;
 }
 .asr-box {
  width: 1200px;
  height: 394px;
  margin: 64px auto;
 }
 .asr-box h2 {
  font-size: 22px;
  text-align: center;
  margin-bottom: 64px;
 }
 .voice-container {
  position: relative;
  width: 1200px;
  height: 308px;
  background: rgba(255, 255, 255, 1);
  border-radius: 8px;
  border: 1px solid rgba(225, 225, 225, 1);
 }
 .voice-container .voice {
  height: 236px;
  width: 100%;
  border-radius: 8px;
 }
 .voice-container .voice textarea {
  height: 100%;
  width: 100%;
  border: none;
  outline: none;
  border-radius: 8px;
  padding: 25px;
  font-size: 14px;
  box-sizing: border-box;
  resize: none;
 }
 .voice-input {
  width: 100%;
  height: 72px;
  box-sizing: border-box;
  padding-left: 35px;
  background: rgba(242, 244, 245, 1);
  border-radius: 8px;
  line-height: 72px;
 }
 .voice-input .el-select {
  width: 492px;
 }
 .start-voice {
  display: inline-block;
  margin-left: 10px;
 }
 .start-voice .time {
  margin-right: 25px;
 }
 .asr-advantage > ul > li {
  margin-bottom: 77px;
 }
 #msg {
  width: 100%;
  line-height: 40px;
  font-size: 14px;
  margin-left: 330px;
 }
 #captcha {
  margin-left: 350px !important;
  display: inline-block;
  position: relative;
 }
 .black {
  position: fixed;
  width: 100%;
  height: 100%;
  z-index: 5;
  background: rgba(0, 0, 0, 0.5);
  top: 0;
  left: 0;
 }
 .container {
  position: fixed;
  z-index: 6;
  top: 25%;
  left: 10%;
 }
 .audio-scene-con {
  width: 100%;
  padding-bottom: 84px;
  background: #fff;
 }
 #sound {
  color: #fff;
  cursor: pointer;
  background: #147ede;
  padding: 10px;
  margin-top: 30px;
  margin-left: 135px;
  width: 176px;
  height: 30px !important;
  text-align: center;
  line-height: 30px !important;
  border-radius: 10px;
 }
 .con-ten {
  position: absolute;
  width: 100%;
  height: 100%;
  z-index: 5;
  background: #fff;
  opacity: 0.5;
  top: 0;
  left: 0;
 }
 .websocket-url {
  width: 320px;
  height: 20px;
  border: 1px solid #dcdfe6;
  line-height: 20px;
  padding: 10px;
  border-radius: 4px;
 }
 .voice-btn {
  color: #fff;
  background-color: #409eff;
  font-weight: 500;
  padding: 12px 20px;
  font-size: 14px;
  border-radius: 4px;
  border: 0;
  cursor: pointer;
 }
 .voice-btn.end {
  display: none;
 }
 .result-text {
  background: #fff;
  padding: 20px;
 }
 .voice-footer {
  border-top: 1px solid #dddede;
  background: #f7f9fa;
  text-align: center;
  margin-bottom: 8px;
  color: #333;
  font-size: 12px;
  padding: 20px 0;
 }
 /** line animate **/
 .time-box {
  display: none;
  margin-left: 10px;
  width: 300px;
 }
 .total-time {
  font-size: 14px;
  color: #545454;
 }
 .voice-btn.end.show,
 .time-box.show {
  display: inline;
 }
 .start-taste-line {
  margin-right: 20px;
  display: inline-block;
 }
 .start-taste-line hr {
  background-color: #187cff;
  width: 3px;
  height: 8px;
  margin: 0 3px;
  display: inline-block;
  border: none;
 }
 .hr {
  animation: note 0.2s ease-in-out;
  animation-iteration-count: infinite;
  animation-direction: alternate;
 }
 .hr-one {
  animation-delay: -0.9s;
 }
 .hr-two {
  animation-delay: -0.8s;
 }
 .hr-three {
  animation-delay: -0.7s;
 }
 .hr-four {
  animation-delay: -0.6s;
 }
 .hr-five {
  animation-delay: -0.5s;
 }
 .hr-six {
  animation-delay: -0.4s;
 }
 .hr-seven {
  animation-delay: -0.3s;
 }
 .hr-eight {
  animation-delay: -0.2s;
 }
 .hr-nine {
  animation-delay: -0.1s;
 }
@keyframes note {
  from {
    transform: scaleY(1);
  }
  to {
    transform: scaleY(4);
  }
 }
--- a/demos/streaming_asr_server/web/static/fonts/FontAwesome.otf
+++ b/demos/streaming_asr_server/web/static/fonts/FontAwesome.otf
--- a/demos/streaming_asr_server/web/static/fonts/fontawesome-webfont.eot
+++ b/demos/streaming_asr_server/web/static/fonts/fontawesome-webfont.eot
--- a/demos/streaming_asr_server/web/static/fonts/fontawesome-webfont.svg
+++ b/demos/streaming_asr_server/web/static/fonts/fontawesome-webfont.svg
--- a/demos/streaming_asr_server/web/static/fonts/fontawesome-webfont.ttf
+++ b/demos/streaming_asr_server/web/static/fonts/fontawesome-webfont.ttf
--- a/demos/streaming_asr_server/web/static/fonts/fontawesome-webfont.woff
+++ b/demos/streaming_asr_server/web/static/fonts/fontawesome-webfont.woff
--- a/demos/streaming_asr_server/web/static/fonts/fontawesome-webfont.woff2
+++ b/demos/streaming_asr_server/web/static/fonts/fontawesome-webfont.woff2
--- a/demos/streaming_asr_server/web/static/image/PaddleSpeech_logo.png
+++ b/demos/streaming_asr_server/web/static/image/PaddleSpeech_logo.png
--- a/demos/streaming_asr_server/web/static/image/voice-dictation.svg
+++ b/demos/streaming_asr_server/web/static/image/voice-dictation.svg
--- a/demos/streaming_asr_server/web/static/js/SoundRecognizer.js
+++ b/demos/streaming_asr_server/web/static/js/SoundRecognizer.js
@ -1,133 +0,0 @@
 SoundRecognizer = {
    rec: null,
    wave: null,
    SampleRate: 16000,
    testBitRate: 16,
    isCloseRecorder: false,
    SendInterval: 300,
    realTimeSendTryType: 'pcm',
    realTimeSendTryEncBusy: 0,
    realTimeSendTryTime: 0,
    realTimeSendTryNumber: 0,
    transferUploadNumberMax: 0,
    realTimeSendTryChunk: null,
    soundType: "pcm",
    init: function (config) {
        this.soundType = config.soundType || 'pcm';
        this.SampleRate = config.sampleRate || 16000;
        this.recwaveElm = config.recwaveElm || '';
        this.TransferUpload = config.translerCallBack || this.TransferProcess;
        this.initRecorder();
    },
    RealTimeSendTryReset: function (type) {
        this.realTimeSendTryType = type;
        this.realTimeSendTryTime = 0;
    },
    RealTimeSendTry: function (rec, isClose) {
        var that = this;
        var t1 = Date.now(), endT = 0, recImpl = Recorder.prototype;
        if (this.realTimeSendTryTime == 0) {
            this.realTimeSendTryTime = t1;
            this.realTimeSendTryEncBusy = 0;
            this.realTimeSendTryNumber = 0;
            this.transferUploadNumberMax = 0;
            this.realTimeSendTryChunk = null;
        }
        if (!isClose && t1 - this.realTimeSendTryTime < this.SendInterval) {
            return;//控制缓冲达到指定间隔才进行传输
        }
        this.realTimeSendTryTime = t1;
        var number = ++this.realTimeSendTryNumber;
        //借用SampleData函数进行数据的连续处理，采样率转换是顺带的
        var chunk = Recorder.SampleData(rec.buffers, rec.srcSampleRate, this.SampleRate, this.realTimeSendTryChunk, { frameType: isClose ? "" : this.realTimeSendTryType });
        //清理已处理完的缓冲数据，释放内存以支持长时间录音，最后完成录音时不能调用stop，因为数据已经被清掉了
        for (var i = this.realTimeSendTryChunk ? this.realTimeSendTryChunk.index : 0; i < chunk.index; i++) {
            rec.buffers[i] = null;
        }
        this.realTimeSendTryChunk = chunk;
        //没有新数据，或结束时的数据量太小，不能进行mock转码
        if (chunk.data.length == 0 || isClose && chunk.data.length < 2000) {
            this.TransferUpload(number, null, 0, null, isClose);
            return;
        }
        //实时编码队列阻塞处理
        if (!isClose) {
            if (this.realTimeSendTryEncBusy >= 2) {
                console.log("编码队列阻塞，已丢弃一帧", 1);
                return;
            }
        }
        this.realTimeSendTryEncBusy++;
        //通过mock方法实时转码成mp3、wav
        var encStartTime = Date.now();
        var recMock = Recorder({
            type: this.realTimeSendTryType
            , sampleRate: this.SampleRate //采样率
            , bitRate: this.testBitRate //比特率
        });
        recMock.mock(chunk.data, chunk.sampleRate);
        recMock.stop(function (blob, duration) {
            that.realTimeSendTryEncBusy && (that.realTimeSendTryEncBusy--);
            blob.encTime = Date.now() - encStartTime;
            //转码好就推入传输
            that.TransferUpload(number, blob, duration, recMock, isClose);
        }, function (msg) {
            that.realTimeSendTryEncBusy && (that.realTimeSendTryEncBusy--);
            //转码错误？没想到什么时候会产生错误！
            console.log("不应该出现的错误:" + msg, 1);
        });
    },
    recordClose: function () {
        try {
            this.rec.close(function () {
                this.isCloseRecorder = true;
            });
            this.RealTimeSendTry(this.rec, true);//最后一次发送
        } catch (ex) {
            // recordClose();
        }
    },
    recordEnd: function () {
        try {
            this.rec.stop(function (blob, time) {
                this.recordClose();
            }, function (s) {
                this.recordClose();
            });
        } catch (ex) {
        }
    },
    initRecorder: function () {
        var that = this;
        var rec = Recorder({
            type: that.soundType
            , bitRate: that.testBitRate
            , sampleRate: that.SampleRate
            , onProcess: function (buffers, level, time, sampleRate) {
                that.wave.input(buffers[buffers.length - 1], level, sampleRate);
                that.RealTimeSendTry(rec, false);//推入实时处理，因为是unknown格式，这里简化函数调用，没有用到buffers和bufferSampleRate，因为这些数据和rec.buffers是完全相同的。
            }
        });
        rec.open(function () {
            that.wave = Recorder.FrequencyHistogramView({
                elem: that.recwaveElm, lineCount: 90
                , position: 0
                , minHeight: 1
                , stripeEnable: false
            });
            rec.start();
            that.isCloseRecorder = false;
            that.RealTimeSendTryReset(that.soundType);//重置
        });
        this.rec = rec;
    },
    TransferProcess: function (number, blobOrNull, duration, blobRec, isClose) {
    }
 }
--- a/demos/streaming_asr_server/web/static/js/jquery-3.2.1.min.js
+++ b/demos/streaming_asr_server/web/static/js/jquery-3.2.1.min.js
--- a/demos/streaming_asr_server/web/static/js/recorder/engine/mp3.js
+++ b/demos/streaming_asr_server/web/static/js/recorder/engine/mp3.js
--- a/demos/streaming_asr_server/web/static/js/recorder/engine/pcm.js
+++ b/demos/streaming_asr_server/web/static/js/recorder/engine/pcm.js
@ -1,6 +0,0 @@
 /*
 录音
 https://github.com/xiangyuecn/Recorder
 src: engine/pcm.js
 */
 !function(){"use strict";Recorder.prototype.enc_pcm={stable:!0,testmsg:"pcm为未封装的原始音频数据，pcm数据文件无法直接播放；支持位数8位、16位（填在比特率里面），采样率取值无限制"},Recorder.prototype.pcm=function(e,t,r){var a=this.set,n=e.length,o=8==a.bitRate?8:16,c=new ArrayBuffer(n*(o/8)),s=new DataView(c),l=0;if(8==o)for(var p=0;p<n;p++,l++){var i=128+(e[p]>>8);s.setInt8(l,i,!0)}else for(p=0;p<n;p++,l+=2)s.setInt16(l,e[p],!0);t(new Blob([s.buffer],{type:"audio/pcm"}))},Recorder.pcm2wav=function(e,a,n){e.slice&&null!=e.type&&(e={blob:e});var o=e.sampleRate||16e3,c=e.bitRate||16;if(e.sampleRate&&e.bitRate||console.warn("pcm2wav必须提供sampleRate和bitRate"),Recorder.prototype.wav){var s=new FileReader;s.onloadend=function(){var e;if(8==c){var t=new Uint8Array(s.result);e=new Int16Array(t.length);for(var r=0;r<t.length;r++)e[r]=t[r]-128<<8}else e=new Int16Array(s.result);Recorder({type:"wav",sampleRate:o,bitRate:c}).mock(e,o).stop(function(e,t){a(e,t)},n)},s.readAsArrayBuffer(e.blob)}else n("pcm2wav必须先加载wav编码器wav.js")}}();
--- a/demos/streaming_asr_server/web/static/js/recorder/engine/wav.js
+++ b/demos/streaming_asr_server/web/static/js/recorder/engine/wav.js
@ -1,6 +0,0 @@
 /*
 录音
 https://github.com/xiangyuecn/Recorder
 src: engine/wav.js
 */
 !function(){"use strict";Recorder.prototype.enc_wav={stable:!0,testmsg:"支持位数8位、16位（填在比特率里面），采样率取值无限制"},Recorder.prototype.wav=function(t,e,n){var r=this.set,a=t.length,o=r.sampleRate,f=8==r.bitRate?8:16,i=a*(f/8),s=new ArrayBuffer(44+i),c=new DataView(s),u=0,v=function(t){for(var e=0;e<t.length;e++,u++)c.setUint8(u,t.charCodeAt(e))},w=function(t){c.setUint16(u,t,!0),u+=2},l=function(t){c.setUint32(u,t,!0),u+=4};if(v("RIFF"),l(36+i),v("WAVE"),v("fmt "),l(16),w(1),w(1),l(o),l(o*(f/8)),w(f/8),w(f),v("data"),l(i),8==f)for(var p=0;p<a;p++,u++){var d=128+(t[p]>>8);c.setInt8(u,d,!0)}else for(p=0;p<a;p++,u+=2)c.setInt16(u,t[p],!0);e(new Blob([c.buffer],{type:"audio/wav"}))}}();
--- a/demos/streaming_asr_server/web/static/js/recorder/extensions/frequency.histogram.view.js
+++ b/demos/streaming_asr_server/web/static/js/recorder/extensions/frequency.histogram.view.js
@ -1,6 +0,0 @@
 /*
 录音
 https://github.com/xiangyuecn/Recorder
 src: extensions/frequency.histogram.view.js
 */
 !function(){"use strict";var t=function(t){return new e(t)},e=function(t){var e=this,r={scale:2,fps:20,lineCount:30,widthRatio:.6,spaceWidth:0,minHeight:0,position:-1,mirrorEnable:!1,stripeEnable:!0,stripeHeight:3,stripeMargin:6,fallDuration:1e3,stripeFallDuration:3500,linear:[0,"rgba(0,187,17,1)",.5,"rgba(255,215,0,1)",1,"rgba(255,102,0,1)"],stripeLinear:null,shadowBlur:0,shadowColor:"#bbb",stripeShadowBlur:-1,stripeShadowColor:"",onDraw:function(t,e){}};for(var a in t)r[a]=t[a];e.set=t=r;var i=t.elem;i&&("string"==typeof i?i=document.querySelector(i):i.length&&(i=i[0])),i&&(t.width=i.offsetWidth,t.height=i.offsetHeight);var o=t.scale,l=t.width*o,n=t.height*o,h=e.elem=document.createElement("div"),s=["","transform-origin:0 0;","transform:scale("+1/o+");"];h.innerHTML='<div style="width:'+t.width+"px;height:"+t.height+'px;overflow:hidden"><div style="width:'+l+"px;height:"+n+"px;"+s.join("-webkit-")+s.join("-ms-")+s.join("-moz-")+s.join("")+'"><canvas/></div></div>';var f=e.canvas=h.querySelector("canvas");e.ctx=f.getContext("2d");if(f.width=l,f.height=n,i&&(i.innerHTML="",i.appendChild(h)),!Recorder.LibFFT)throw new Error("需要lib.fft.js支持");e.fft=Recorder.LibFFT(1024),e.lastH=[],e.stripesH=[]};e.prototype=t.prototype={genLinear:function(t,e,r,a){for(var i=t.createLinearGradient(0,r,0,a),o=0;o<e.length;)i.addColorStop(e[o++],e[o++]);return i},input:function(t,e,r){var a=this;a.sampleRate=r,a.pcmData=t,a.pcmPos=0,a.inputTime=Date.now(),a.schedule()},schedule:function(){var t=this,e=t.set,r=Math.floor(1e3/e.fps);t.timer||(t.timer=setInterval(function(){t.schedule()},r));var a=Date.now(),i=t.drawTime||0;if(a-t.inputTime>1.3*e.stripeFallDuration)return clearInterval(t.timer),void(t.timer=0);if(!(a-i<r)){t.drawTime=a;for(var o=t.fft.bufferSize,l=t.pcmData,n=t.pcmPos,h=new Int16Array(o),s=0;s<o&&n<l.length;s++,n++)h[s]=l[n];t.pcmPos=n;var f=t.fft.transform(h);t.draw(f,t.sampleRate)}},draw:function(t,e){var r=this,a=r.set,i=r.ctx,o=a.scale,l=a.width*o,n=a.height*o,h=a.lineCount,s=r.fft.bufferSize,f=a.position,d=Math.abs(a.position),c=1==f?0:n,p=n;d<1&&(c=p/=2,p=Math.floor(p*(1+d)),c=Math.floor(0<f?c*(1-d):c*(1+d)));for(var u=r.lastH,v=r.stripesH,w=Math.ceil(p/(a.fallDuration/(1e3/a.fps))),g=Math.ceil(p/(a.stripeFallDuration/(1e3/a.fps))),m=a.stripeMargin*o,M=1<<(Math.round(Math.log(s)/Math.log(2)+3)<<1),b=Math.log(M)/Math.log(10),L=20*Math.log(32767)/Math.log(10),y=s/2,S=Math.min(y,Math.floor(5e3*y/(e/2))),C=S==y,H=C?h:Math.round(.8*h),R=S/H,D=C?0:(y-S)/(h-H),x=0,F=0;F<h;F++){var T=Math.ceil(x);x+=F<H?R:D;for(var B=Math.min(Math.ceil(x),y),E=0,j=T;j<B;j++)E=Math.max(E,Math.abs(t[j]));var I=M<E?Math.floor(17*(Math.log(E)/Math.log(10)-b)):0,q=p*Math.min(I/L,1);u[F]=(u[F]||0)-w,q<u[F]&&(q=u[F]),q<0&&(q=0),u[F]=q;var z=v[F]||0;if(q&&z<q+m)v[F]=q+m;else{var P=z-g;P<0&&(P=0),v[F]=P}}i.clearRect(0,0,l,n);var W=r.genLinear(i,a.linear,c,c-p),k=a.stripeLinear&&r.genLinear(i,a.stripeLinear,c,c-p)||W,A=r.genLinear(i,a.linear,c,c+p),G=a.stripeLinear&&r.genLinear(i,a.stripeLinear,c,c+p)||A;i.shadowBlur=a.shadowBlur*o,i.shadowColor=a.shadowColor;var V=a.mirrorEnable,J=V?2*h-1:h,K=a.widthRatio,N=a.spaceWidth*o;0!=N&&(K=(l-N*(J+1))/l);for(var O=Math.max(1*o,Math.floor(l*K/J)),Q=(l-J*O)/(J+1),U=a.minHeight*o,X=V?l/2-(Q+O/2):0,Y=(F=0,X);F<h;F++)Y+=Q,$=Math.floor(Y),q=Math.max(u[F],U),0!=c&&(_=c-q,i.fillStyle=W,i.fillRect($,_,O,q)),c!=n&&(i.fillStyle=A,i.fillRect($,c,O,q)),Y+=O;if(a.stripeEnable){var Z=a.stripeShadowBlur;i.shadowBlur=(-1==Z?a.shadowBlur:Z)*o,i.shadowColor=a.stripeShadowColor||a.shadowColor;var $,_,tt=a.stripeHeight*o;for(F=0,Y=X;F<h;F++)Y+=Q,$=Math.floor(Y),q=v[F],0!=c&&((_=c-q-tt)<0&&(_=0),i.fillStyle=k,i.fillRect($,_,O,tt)),c!=n&&(n<(_=c+q)+tt&&(_=n-tt),i.fillStyle=G,i.fillRect($,_,O,tt)),Y+=O}if(V){var et=Math.floor(l/2);i.save(),i.scale(-1,1),i.drawImage(r.canvas,Math.ceil(l/2),0,et,n,-et,0,et,n),i.restore()}a.onDraw(t,e)}},Recorder.FrequencyHistogramView=t}();
--- a/demos/streaming_asr_server/web/static/js/recorder/extensions/lib.fft.js
+++ b/demos/streaming_asr_server/web/static/js/recorder/extensions/lib.fft.js
@ -1,6 +0,0 @@
 /*
 录音
 https://github.com/xiangyuecn/Recorder
 src: extensions/lib.fft.js
 */
 Recorder.LibFFT=function(r){"use strict";var s,v,d,l,F,b,g,m;return function(r){var o,t,a,f;for(s=Math.round(Math.log(r)/Math.log(2)),d=((v=1<<s)<<2)*Math.sqrt(2),l=[],F=[],b=[0],g=[0],m=[],o=0;o<v;o++){for(a=o,f=t=0;t!=s;t++)f<<=1,f|=1&a,a>>>=1;m[o]=f}var n,u=2*Math.PI/v;for(o=(v>>1)-1;0<o;o--)n=o*u,g[o]=Math.cos(n),b[o]=Math.sin(n)}(r),{transform:function(r){var o,t,a,f,n,u,e,h,M=1,i=s-1;for(o=0;o!=v;o++)l[o]=r[m[o]],F[o]=0;for(o=s;0!=o;o--){for(t=0;t!=M;t++)for(n=g[t<<i],u=b[t<<i],a=t;a<v;a+=M<<1)e=n*l[f=a+M]-u*F[f],h=n*F[f]+u*l[f],l[f]=l[a]-e,F[f]=F[a]-h,l[a]+=e,F[a]+=h;M<<=1,i--}t=v>>1;var c=new Float64Array(t);for(n=-(u=d),o=t;0!=o;o--)e=l[o],h=F[o],c[o-1]=n<e&&e<u&&n<h&&h<u?0:Math.round(e*e+h*h);return c},bufferSize:v}};
--- a/demos/streaming_asr_server/web/static/js/recorder/recorder-core.js
+++ b/demos/streaming_asr_server/web/static/js/recorder/recorder-core.js
--- a/demos/streaming_asr_server/web/static/paddle.ico
+++ b/demos/streaming_asr_server/web/static/paddle.ico
--- a/demos/streaming_asr_server/web/templates/index.html
+++ b/demos/streaming_asr_server/web/templates/index.html
@ -1,156 +0,0 @@
 <!DOCTYPE html>
 <html>
 <head>
  <meta charset="UTF-8">
  <title>PaddleSpeech Serving-语音实时转写</title>
  <link rel="shortcut icon" href="./static/paddle.ico">
  <script src="../static/js/jquery-3.2.1.min.js"></script>
  <script src="../static/js/recorder/recorder-core.js"></script>
  <script src="../static/js/recorder/extensions/lib.fft.js"></script>
  <script src="../static/js/recorder/extensions/frequency.histogram.view.js"></script>
  <script src="../static/js/recorder/engine/pcm.js"></script>
  <script src="../static/js/SoundRecognizer.js"></script>
  <link rel="stylesheet" href="../static/css/style.css">
  <link rel="stylesheet" href="../static/css/font-awesome.min.css">
 </head>
 <body>
  <div class="asr-content">
    <div class="audio-banner">
      <div class="weaper">
        <div class="text-content">
          <p><span class="title">PaddleSpeech Serving简介</span></p>
          <p class="con-container">
            <span class="con">PaddleSpeech 是基于飞桨 PaddlePaddle 的语音方向的开源模型库，用于语音和音频中的各种关键任务的开发。PaddleSpeech Serving是基于python + fastapi 的语音算法模型的C/S类型后端服务，旨在统一paddle speech下的各语音算子来对外提供后端服务。</span>
          </p>
        </div>
        <div class="img-con">
          <img src="../static/image/PaddleSpeech_logo.png" alt="" />
        </div>
      </div>
    </div>
    <div class="audio-experience">
      <div class="asr-box">
        <h2>产品体验</h2>
        <div id="client-word-recorder" style="position: relative;">
          <div class="pd">
            <div style="text-align:center;height:20px;width:100%;
                        border:0px solid #bcbcbc;color:#000;box-sizing: border-box;display:inline-block"
              class="recwave">
            </div>
          </div>
        </div>
        <div class="voice-container">
          <div class="voice-input">
            <span>WebSocket URL：</span>
            <input type="text" id="socketUrl" class="websocket-url" value="ws://127.0.0.1:8091/ws/asr"
              placeholder="请输入服务器地址，如：ws://127.0.0.1:8091/ws/asr">
            <div class="start-voice">
              <button type="primary" id="beginBtn" class="voice-btn">
                <span class="fa fa-microphone"> 开始识别</span>
              </button>
              <button type="primary" id="endBtn" class="voice-btn end">
                <span class="fa fa-microphone-slash"> 结束识别</span>
              </button>
              <div id="timeBox" class="time-box flex-display-1">
                <span class="total-time">识别中，<i id="timeCount"></i> 秒后自动停止识别</span>
              </div>
            </div>
          </div>
          <div class="voice">
            <div class="result-text" id="resultPanel">此处显示识别结果</div>
          </div>
        </div>
      </div>
    </div>
  </div>
  <script>
    var wenetWs = null
    var timeLoop = null
    var result = ""
    $(document).ready(function () {
      $('#beginBtn').on('click', startRecording)
      $('#endBtn').on('click', stopRecording)
    })
    function openWebSocket(url) {
      if ("WebSocket" in window) {
        wenetWs = new WebSocket(url)
        wenetWs.onopen = function () {
          console.log("Websocket 连接成功，开始识别")
          wenetWs.send(JSON.stringify({
            "signal": "start"
          }))
        }
        wenetWs.onmessage = function (_msg) { parseResult(_msg.data) }
        wenetWs.onclose = function () {
          console.log("WebSocket 连接断开")
        }
        wenetWs.onerror = function () { console.log("WebSocket 连接失败") }
      }
    }
    function parseResult(data) {
      var data = JSON.parse(data)
      console.log('result json:', data)
      var result = data.result
      console.log(result)
      $("#resultPanel").html(result)
    }
    function TransferUpload(number, blobOrNull, duration, blobRec, isClose) {
      if (blobOrNull) {
        var blob = blobOrNull
        var encTime = blob.encTime
        var reader = new FileReader()
        reader.onloadend = function () { wenetWs.send(reader.result) }
        reader.readAsArrayBuffer(blob)
      }
    }
    function startRecording() {
      // Check socket url
      var socketUrl = $('#socketUrl').val()
      if (!socketUrl.trim()) {
        alert('请输入 WebSocket 服务器地址，如：ws://127.0.0.1:8091/ws/asr')
        $('#socketUrl').focus()
        return
      }
      // init recorder
      SoundRecognizer.init({
        soundType: 'pcm',
        sampleRate: 16000,
        recwaveElm: '.recwave',
        translerCallBack: TransferUpload
      })
      openWebSocket(socketUrl)
      // Change button state
      $('#beginBtn').hide()
      $('#endBtn, #timeBox').addClass('show')
      // Start countdown
      var seconds = 180
      $('#timeCount').text(seconds)
      timeLoop = setInterval(function () {
        seconds--
        $('#timeCount').text(seconds)
        if (seconds === 0) {
          stopRecording()
        }
      }, 1000)
    }
    function stopRecording() {
      wenetWs.send(JSON.stringify({ "signal": "end" }))
      SoundRecognizer.recordClose()
      $('#endBtn').add($('#timeBox')).removeClass('show')
      $('#beginBtn').show()
      $('#timeCount').text('')
      clearInterval(timeLoop)
    }
  </script>
 </body>
 </html>
--- a/docs/requirements.txt
+++ b/docs/requirements.txt
@ -22,6 +22,7 @@ onnxruntime
 pandas
 paddlenlp
 paddlespeech_feat
 Pillow>=9.0.0
 praatio==5.0.0
 pypinyin
 pypinyin-dict
--- a/docs/source/released_model.md
+++ b/docs/source/released_model.md
@ -10,7 +10,7 @@ Acoustic Model | Training Data | Token-based | Size | Descriptions | CER | WER |
 [Ds2 Offline Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_offline_aishell_ckpt_1.0.1.model.tar.gz)| Aishell Dataset | Char-based | 1.4 GB | 2 Conv + 5 bidirectional LSTM layers| 0.0554 |-| 151 h | [Ds2 Offline Aishell ASR0](../../examples/aishell/asr0) | inference/python |
 [Conformer Online Wenetspeech ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/wenetspeech/asr1/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar.gz) | WenetSpeech Dataset | Char-based | 457 MB  | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring| 0.11 (test\_net) 0.1879 (test\_meeting) |-| 10000 h |- | python |
 [Conformer Online Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_chunk_conformer_aishell_ckpt_0.2.0.model.tar.gz) | Aishell Dataset | Char-based | 189 MB  | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring| 0.0544 |-| 151 h | [Conformer Online Aishell ASR1](../../examples/aishell/asr1) | python |
-[Conformer Offline Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_conformer_aishell_ckpt_0.1.2.model.tar.gz) | Aishell Dataset | Char-based | 189 MB  | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0464 |-| 151 h | [Conformer Offline Aishell ASR1](../../examples/aishell/asr1) | python |
+[Conformer Offline Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_conformer_aishell_ckpt_1.0.1.model.tar.gz) | Aishell Dataset | Char-based | 189 MB  | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0460 |-| 151 h | [Conformer Offline Aishell ASR1](../../examples/aishell/asr1) | python |
 [Transformer Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_transformer_aishell_ckpt_0.1.1.model.tar.gz) | Aishell Dataset | Char-based | 128 MB | Encoder:Transformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0523 || 151 h | [Transformer  Aishell ASR1](../../examples/aishell/asr1) | python |
 [Ds2 Offline Librispeech ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr0/asr0_deepspeech2_offline_librispeech_ckpt_1.0.1.model.tar.gz)| Librispeech Dataset | Char-based | 1.3 GB | 2 Conv + 5 bidirectional LSTM layers| - |0.0467| 960 h | [Ds2 Offline Librispeech ASR0](../../examples/librispeech/asr0) | inference/python |
 [Conformer Librispeech ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr1/asr1_conformer_librispeech_ckpt_0.1.1.model.tar.gz) | Librispeech Dataset | subword-based | 191 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring |-| 0.0338 | 960 h | [Conformer Librispeech ASR1](../../examples/librispeech/asr1) | python |
--- a/examples/aishell/asr1/RESULTS.md
+++ b/examples/aishell/asr1/RESULTS.md
@ -2,13 +2,13 @@
 ## Conformer
 paddle version: 2.2.2  
-paddlespeech version: 0.2.0
+paddlespeech version: 1.0.1
 | Model | Params | Config | Augmentation| Test set | Decode method | Loss | CER |
 | --- | --- | --- | --- | --- | --- | --- | --- | 
-| conformer | 47.07M  | conf/conformer.yaml | spec_aug | test | attention | - | 0.0530 |
+| conformer | 47.07M  | conf/conformer.yaml | spec_aug | test | attention | - | 0.0522 |
-| conformer | 47.07M  | conf/conformer.yaml | spec_aug | test | ctc_greedy_search | - | 0.0495 |
+| conformer | 47.07M  | conf/conformer.yaml | spec_aug | test | ctc_greedy_search | - | 0.0481 |
-| conformer | 47.07M  | conf/conformer.yaml | spec_aug| test | ctc_prefix_beam_search | - | 0.0494 | 
+| conformer | 47.07M  | conf/conformer.yaml | spec_aug| test | ctc_prefix_beam_search | - | 0.0480 | 
-| conformer | 47.07M  | conf/conformer.yaml | spec_aug | test | attention_rescoring | - | 0.0464 | 
+| conformer | 47.07M  | conf/conformer.yaml | spec_aug | test | attention_rescoring | - | 0.0460 | 
 ## Conformer Streaming
--- a/examples/aishell/asr1/conf/conformer.yaml
+++ b/examples/aishell/asr1/conf/conformer.yaml
@ -57,7 +57,7 @@ feat_dim: 80
 stride_ms: 10.0
 window_ms: 25.0
 sortagrad: 0 # Feed samples from shortest to longest ; -1: enabled for all epochs, 0: disabled, other: enabled for 'other' epochs 
-batch_size: 64
+batch_size: 32
 maxlen_in: 512  # if input length  > maxlen-in, batchsize is automatically reduced
 maxlen_out: 150  # if output length > maxlen-out, batchsize is automatically reduced
 minibatches: 0 # for debug
@ -73,10 +73,10 @@ num_encs: 1
 ###########################################
 #                Training                 #
 ###########################################
-n_epoch: 240 
+n_epoch: 150 
-accum_grad: 2
+accum_grad: 8
 global_grad_clip: 5.0
-dist_sampler: True
+dist_sampler: False
 optim: adam
 optim_conf:
  lr: 0.002
--- a/examples/csmsc/vits/README.md
+++ b/examples/csmsc/vits/README.md
@ -144,3 +144,34 @@ optional arguments:
 6. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
 ## Pretrained Model
 The pretrained model can be downloaded here:
 - [vits_csmsc_ckpt_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/vits/vits_csmsc_ckpt_1.1.0.zip) (add_blank=true)
 VITS checkpoint contains files listed below.
 ```text
 vits_csmsc_ckpt_1.1.0
 ├── default.yaml              # default config used to train vitx
 ├── phone_id_map.txt          # phone vocabulary file when training vits
 └── snapshot_iter_350000.pdz  # model parameters and optimizer states
 ```
 ps: This ckpt is not good enough, a better result is training
 You can use the following scripts to synthesize for `${BIN_DIR}/../sentences.txt` using pretrained VITS.
 ```bash
 source path.sh
 add_blank=true
 FLAGS_allocator_strategy=naive_best_fit \
 FLAGS_fraction_of_gpu_memory_to_use=0.01 \
 python3 ${BIN_DIR}/synthesize_e2e.py \
    --config=vits_csmsc_ckpt_1.1.0/default.yaml \
    --ckpt=vits_csmsc_ckpt_1.1.0/snapshot_iter_350000.pdz \
    --phones_dict=vits_csmsc_ckpt_1.1.0/phone_id_map.txt \
    --output_dir=exp/default/test_e2e \
    --text=${BIN_DIR}/../sentences.txt \
    --add-blank=${add_blank} 
 ```
--- a/examples/csmsc/vits/local/train.sh
+++ b/examples/csmsc/vits/local/train.sh
@ -3,6 +3,11 @@
 config_path=$1
 train_output_path=$2
 # install monotonic_align
 cd ${MAIN_ROOT}/paddlespeech/t2s/models/vits/monotonic_align
 python3 setup.py build_ext --inplace
 cd -
 python3 ${BIN_DIR}/train.py \
    --train-metadata=dump/train/norm/metadata.jsonl \
    --dev-metadata=dump/dev/norm/metadata.jsonl \
--- a/examples/voxceleb/sv0/local/data.sh
+++ b/examples/voxceleb/sv0/local/data.sh
@ -74,7 +74,7 @@ if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
   # convert the m4a to wav
   # and we will not delete the original m4a file
   echo "start to convert the m4a to wav"
-   bash local/convert.sh ${TARGET_DIR}/voxceleb/vox2/test/ || exit 1;
+   bash local/convert.sh ${TARGET_DIR}/voxceleb/vox2/ || exit 1;
   if [ $? -ne 0 ]; then
      echo "Convert voxceleb2 dataset from m4a to wav failed. Terminated."
--- a/paddlespeech/audio/transform/spec_augment.py
+++ b/paddlespeech/audio/transform/spec_augment.py
@ -14,10 +14,8 @@
 # Modified from espnet(https://github.com/espnet/espnet)
 """Spec Augment module for preprocessing i.e., data augmentation"""
 import random
 import numpy
 from PIL import Image
 from PIL.Image import BICUBIC
 from .functional import FuncTrans
@ -46,9 +44,10 @@ def time_warp(x, max_time_warp=80, inplace=False, mode="PIL"):
        warped = random.randrange(center - window, center +
                                  window) + 1  # 1 ... t - 1
-        left = Image.fromarray(x[:center]).resize((x.shape[1], warped), BICUBIC)
+        left = Image.fromarray(x[:center]).resize((x.shape[1], warped),
                                                  Image.BICUBIC)
        right = Image.fromarray(x[center:]).resize((x.shape[1], t - warped),
-                                                   BICUBIC)
+                                                   Image.BICUBIC)
        if inplace:
            x[:warped] = left
            x[warped:] = right
--- a/paddlespeech/cli/asr/infer.py
+++ b/paddlespeech/cli/asr/infer.py
@ -133,11 +133,11 @@ class ASRExecutor(BaseExecutor):
        """
        Init model and other resources from a specific path.
        """
-        logger.info("start to init the model")
+        logger.debug("start to init the model")
        # default max_len: unit:second
        self.max_len = 50
        if hasattr(self, 'model'):
-            logger.info('Model had been initialized.')
+            logger.debug('Model had been initialized.')
            return
        if cfg_path is None or ckpt_path is None:
@ -151,15 +151,15 @@ class ASRExecutor(BaseExecutor):
            self.ckpt_path = os.path.join(
                self.res_path,
                self.task_resource.res_dict['ckpt_path'] + ".pdparams")
-            logger.info(self.res_path)
+            logger.debug(self.res_path)
        else:
            self.cfg_path = os.path.abspath(cfg_path)
            self.ckpt_path = os.path.abspath(ckpt_path + ".pdparams")
            self.res_path = os.path.dirname(
                os.path.dirname(os.path.abspath(self.cfg_path)))
-        logger.info(self.cfg_path)
+        logger.debug(self.cfg_path)
-        logger.info(self.ckpt_path)
+        logger.debug(self.ckpt_path)
        #Init body.
        self.config = CfgNode(new_allowed=True)
@ -216,7 +216,7 @@ class ASRExecutor(BaseExecutor):
                max_len = self.config.encoder_conf.max_len
            self.max_len = frame_shift_ms * max_len * subsample_rate
-            logger.info(
+            logger.debug(
                f"The asr server limit max duration len: {self.max_len}")
    def preprocess(self, model_type: str, input: Union[str, os.PathLike]):
@ -227,15 +227,15 @@ class ASRExecutor(BaseExecutor):
        audio_file = input
        if isinstance(audio_file, (str, os.PathLike)):
-            logger.info("Preprocess audio_file:" + audio_file)
+            logger.debug("Preprocess audio_file:" + audio_file)
        # Get the object for feature extraction
        if "deepspeech2" in model_type or "conformer" in model_type or "transformer" in model_type:
-            logger.info("get the preprocess conf")
+            logger.debug("get the preprocess conf")
            preprocess_conf = self.config.preprocess_config
            preprocess_args = {"train": False}
            preprocessing = Transformation(preprocess_conf)
-            logger.info("read the audio file")
+            logger.debug("read the audio file")
            audio, audio_sample_rate = soundfile.read(
                audio_file, dtype="int16", always_2d=True)
            if self.change_format:
@ -255,7 +255,7 @@ class ASRExecutor(BaseExecutor):
            else:
                audio = audio[:, 0]
-            logger.info(f"audio shape: {audio.shape}")
+            logger.debug(f"audio shape: {audio.shape}")
            # fbank
            audio = preprocessing(audio, **preprocess_args)
@ -264,19 +264,19 @@ class ASRExecutor(BaseExecutor):
            self._inputs["audio"] = audio
            self._inputs["audio_len"] = audio_len
-            logger.info(f"audio feat shape: {audio.shape}")
+            logger.debug(f"audio feat shape: {audio.shape}")
        else:
            raise Exception("wrong type")
-        logger.info("audio feat process success")
+        logger.debug("audio feat process success")
    @paddle.no_grad()
    def infer(self, model_type: str):
        """
        Model inference and result stored in self.output.
        """
-        logger.info("start to infer the model to get the output")
+        logger.debug("start to infer the model to get the output")
        cfg = self.config.decode
        audio = self._inputs["audio"]
        audio_len = self._inputs["audio_len"]
@ -293,7 +293,7 @@ class ASRExecutor(BaseExecutor):
            self._outputs["result"] = result_transcripts[0]
        elif "conformer" in model_type or "transformer" in model_type:
-            logger.info(
+            logger.debug(
                f"we will use the transformer like model : {model_type}")
            try:
                result_transcripts = self.model.decode(
@ -352,7 +352,7 @@ class ASRExecutor(BaseExecutor):
                logger.error("Please input the right audio file path")
                return False
-        logger.info("checking the audio file format......")
+        logger.debug("checking the audio file format......")
        try:
            audio, audio_sample_rate = soundfile.read(
                audio_file, dtype="int16", always_2d=True)
@ -374,7 +374,7 @@ class ASRExecutor(BaseExecutor):
                 sox input_audio.xx --rate 8k --bits 16 --channels 1 output_audio.wav \n \
                 ")
            return False
-        logger.info("The sample rate is %d" % audio_sample_rate)
+        logger.debug("The sample rate is %d" % audio_sample_rate)
        if audio_sample_rate != self.sample_rate:
            logger.warning("The sample rate of the input file is not {}.\n \
                            The program will resample the wav file to {}.\n \
@ -383,28 +383,28 @@ class ASRExecutor(BaseExecutor):
                        ".format(self.sample_rate, self.sample_rate))
            if force_yes is False:
                while (True):
-                    logger.info(
+                    logger.debug(
                        "Whether to change the sample rate and the channel. Y: change the sample. N: exit the prgream."
                    )
                    content = input("Input(Y/N):")
                    if content.strip() == "Y" or content.strip(
                    ) == "y" or content.strip() == "yes" or content.strip(
                    ) == "Yes":
-                        logger.info(
+                        logger.debug(
                            "change the sampele rate, channel to 16k and 1 channel"
                        )
                        break
                    elif content.strip() == "N" or content.strip(
                    ) == "n" or content.strip() == "no" or content.strip(
                    ) == "No":
-                        logger.info("Exit the program")
+                        logger.debug("Exit the program")
                        return False
                    else:
                        logger.warning("Not regular input, please input again")
            self.change_format = True
        else:
-            logger.info("The audio file format is right")
+            logger.debug("The audio file format is right")
            self.change_format = False
        return True
--- a/paddlespeech/cli/cls/infer.py
+++ b/paddlespeech/cli/cls/infer.py
@ -92,7 +92,7 @@ class CLSExecutor(BaseExecutor):
            Init model and other resources from a specific path.
        """
        if hasattr(self, 'model'):
-            logger.info('Model had been initialized.')
+            logger.debug('Model had been initialized.')
            return
        if label_file is None or ckpt_path is None:
@ -135,14 +135,14 @@ class CLSExecutor(BaseExecutor):
            Input content can be a text(tts), a file(asr, cls) or a streaming(not supported yet).
        """
        feat_conf = self._conf['feature']
-        logger.info(feat_conf)
+        logger.debug(feat_conf)
        waveform, _ = load(
            file=audio_file,
            sr=feat_conf['sample_rate'],
            mono=True,
            dtype='float32')
        if isinstance(audio_file, (str, os.PathLike)):
-            logger.info("Preprocessing audio_file:" + audio_file)
+            logger.debug("Preprocessing audio_file:" + audio_file)
        # Feature extraction
        feature_extractor = LogMelSpectrogram(
--- a/paddlespeech/cli/download.py
+++ b/paddlespeech/cli/download.py
@ -61,7 +61,7 @@ def _get_unique_endpoints(trainer_endpoints):
            continue
        ips.add(ip)
        unique_endpoints.add(endpoint)
-    logger.info("unique_endpoints {}".format(unique_endpoints))
+    logger.debug("unique_endpoints {}".format(unique_endpoints))
    return unique_endpoints
@ -96,7 +96,7 @@ def get_path_from_url(url,
    # data, and the same ip will only download data once.
    unique_endpoints = _get_unique_endpoints(ParallelEnv().trainer_endpoints[:])
    if osp.exists(fullpath) and check_exist and _md5check(fullpath, md5sum):
-        logger.info("Found {}".format(fullpath))
+        logger.debug("Found {}".format(fullpath))
    else:
        if ParallelEnv().current_endpoint in unique_endpoints:
            fullpath = _download(url, root_dir, md5sum, method=method)
@ -118,7 +118,7 @@ def _get_download(url, fullname):
    try:
        req = requests.get(url, stream=True)
    except Exception as e:  # requests.exceptions.ConnectionError
-        logger.info("Downloading {} from {} failed with exception {}".format(
+        logger.debug("Downloading {} from {} failed with exception {}".format(
            fname, url, str(e)))
        return False
@ -190,7 +190,7 @@ def _download(url, path, md5sum=None, method='get'):
    fullname = osp.join(path, fname)
    retry_cnt = 0
-    logger.info("Downloading {} from {}".format(fname, url))
+    logger.debug("Downloading {} from {}".format(fname, url))
    while not (osp.exists(fullname) and _md5check(fullname, md5sum)):
        if retry_cnt < DOWNLOAD_RETRY_LIMIT:
            retry_cnt += 1
@ -209,7 +209,7 @@ def _md5check(fullname, md5sum=None):
    if md5sum is None:
        return True
-    logger.info("File {} md5 checking...".format(fullname))
+    logger.debug("File {} md5 checking...".format(fullname))
    md5 = hashlib.md5()
    with open(fullname, 'rb') as f:
        for chunk in iter(lambda: f.read(4096), b""):
@ -217,8 +217,8 @@ def _md5check(fullname, md5sum=None):
    calc_md5sum = md5.hexdigest()
    if calc_md5sum != md5sum:
-        logger.info("File {} md5 check failed, {}(calc) != "
+        logger.debug("File {} md5 check failed, {}(calc) != "
-                    "{}(base)".format(fullname, calc_md5sum, md5sum))
+                     "{}(base)".format(fullname, calc_md5sum, md5sum))
        return False
    return True
@ -227,7 +227,7 @@ def _decompress(fname):
    """
    Decompress for zip and tar file
    """
-    logger.info("Decompressing {}...".format(fname))
+    logger.debug("Decompressing {}...".format(fname))
    # For protecting decompressing interupted,
    # decompress to fpath_tmp directory firstly, if decompress
--- a/paddlespeech/cli/executor.py
+++ b/paddlespeech/cli/executor.py
@ -217,7 +217,7 @@ class BaseExecutor(ABC):
            logging.getLogger(name) for name in logging.root.manager.loggerDict
        ]
        for l in loggers:
-            l.disabled = True
+            l.setLevel(logging.ERROR)
    def show_rtf(self, info: Dict[str, List[float]]):
        """
--- a/paddlespeech/cli/kws/infer.py
+++ b/paddlespeech/cli/kws/infer.py
@ -88,7 +88,7 @@ class KWSExecutor(BaseExecutor):
            Init model and other resources from a specific path.
        """
        if hasattr(self, 'model'):
-            logger.info('Model had been initialized.')
+            logger.debug('Model had been initialized.')
            return
        if ckpt_path is None:
@ -141,7 +141,7 @@ class KWSExecutor(BaseExecutor):
        assert os.path.isfile(audio_file)
        waveform, _ = load(audio_file)
        if isinstance(audio_file, (str, os.PathLike)):
-            logger.info("Preprocessing audio_file:" + audio_file)
+            logger.debug("Preprocessing audio_file:" + audio_file)
        # Feature extraction
        waveform = paddle.to_tensor(waveform).unsqueeze(0)
--- a/paddlespeech/cli/log.py
+++ b/paddlespeech/cli/log.py
@ -49,7 +49,7 @@ class Logger(object):
        self.handler.setFormatter(self.format)
        self.logger.addHandler(self.handler)
-        self.logger.setLevel(logging.DEBUG)
+        self.logger.setLevel(logging.INFO)
        self.logger.propagate = False
    def __call__(self, log_level: str, msg: str):
--- a/paddlespeech/cli/st/infer.py
+++ b/paddlespeech/cli/st/infer.py
@ -110,7 +110,7 @@ class STExecutor(BaseExecutor):
        """
        decompressed_path = download_and_decompress(self.kaldi_bins, MODEL_HOME)
        decompressed_path = os.path.abspath(decompressed_path)
-        logger.info("Kaldi_bins stored in: {}".format(decompressed_path))
+        logger.debug("Kaldi_bins stored in: {}".format(decompressed_path))
        if "LD_LIBRARY_PATH" in os.environ:
            os.environ["LD_LIBRARY_PATH"] += f":{decompressed_path}"
        else:
@ -128,7 +128,7 @@ class STExecutor(BaseExecutor):
            Init model and other resources from a specific path.
        """
        if hasattr(self, 'model'):
-            logger.info('Model had been initialized.')
+            logger.debug('Model had been initialized.')
            return
        if cfg_path is None or ckpt_path is None:
@ -140,8 +140,8 @@ class STExecutor(BaseExecutor):
            self.ckpt_path = os.path.join(
                self.task_resource.res_dir,
                self.task_resource.res_dict['ckpt_path'])
-            logger.info(self.cfg_path)
+            logger.debug(self.cfg_path)
-            logger.info(self.ckpt_path)
+            logger.debug(self.ckpt_path)
            res_path = self.task_resource.res_dir
        else:
            self.cfg_path = os.path.abspath(cfg_path)
@ -192,7 +192,7 @@ class STExecutor(BaseExecutor):
            Input content can be a file(wav).
        """
        audio_file = os.path.abspath(wav_file)
-        logger.info("Preprocess audio_file:" + audio_file)
+        logger.debug("Preprocess audio_file:" + audio_file)
        if "fat_st" in model_type:
            cmvn = self.config.cmvn_path
--- a/paddlespeech/cli/text/infer.py
+++ b/paddlespeech/cli/text/infer.py
@ -98,7 +98,7 @@ class TextExecutor(BaseExecutor):
            Init model and other resources from a specific path.
        """
        if hasattr(self, 'model'):
-            logger.info('Model had been initialized.')
+            logger.debug('Model had been initialized.')
            return
        self.task = task
--- a/paddlespeech/cli/tts/infer.py
+++ b/paddlespeech/cli/tts/infer.py
@ -173,16 +173,23 @@ class TTSExecutor(BaseExecutor):
        Init model and other resources from a specific path.
        """
        if hasattr(self, 'am_inference') and hasattr(self, 'voc_inference'):
-            logger.info('Models had been initialized.')
+            logger.debug('Models had been initialized.')
            return
        # am
        if am_ckpt is None or am_config is None or am_stat is None or phones_dict is None:
            use_pretrained_am = True
        else:
            use_pretrained_am = False
        am_tag = am + '-' + lang
        self.task_resource.set_task_model(
            model_tag=am_tag,
            model_type=0,  # am
            skip_download=not use_pretrained_am,
            version=None,  # default version
        )
-        if am_ckpt is None or am_config is None or am_stat is None or phones_dict is None:
+        if use_pretrained_am:
            self.am_res_path = self.task_resource.res_dir
            self.am_config = os.path.join(self.am_res_path,
                                          self.task_resource.res_dict['config'])
@ -193,9 +200,9 @@ class TTSExecutor(BaseExecutor):
            # must have phones_dict in acoustic
            self.phones_dict = os.path.join(
                self.am_res_path, self.task_resource.res_dict['phones_dict'])
-            logger.info(self.am_res_path)
+            logger.debug(self.am_res_path)
-            logger.info(self.am_config)
+            logger.debug(self.am_config)
-            logger.info(self.am_ckpt)
+            logger.debug(self.am_ckpt)
        else:
            self.am_config = os.path.abspath(am_config)
            self.am_ckpt = os.path.abspath(am_ckpt)
@ -220,13 +227,19 @@ class TTSExecutor(BaseExecutor):
                self.speaker_dict = speaker_dict
        # voc
        if voc_ckpt is None or voc_config is None or voc_stat is None:
            use_pretrained_voc = True
        else:
            use_pretrained_voc = False
        voc_tag = voc + '-' + lang
        self.task_resource.set_task_model(
            model_tag=voc_tag,
            model_type=1,  # vocoder
            skip_download=not use_pretrained_voc,
            version=None,  # default version
        )
-        if voc_ckpt is None or voc_config is None or voc_stat is None:
+        if use_pretrained_voc:
            self.voc_res_path = self.task_resource.voc_res_dir
            self.voc_config = os.path.join(
                self.voc_res_path, self.task_resource.voc_res_dict['config'])
@ -235,9 +248,9 @@ class TTSExecutor(BaseExecutor):
            self.voc_stat = os.path.join(
                self.voc_res_path,
                self.task_resource.voc_res_dict['speech_stats'])
-            logger.info(self.voc_res_path)
+            logger.debug(self.voc_res_path)
-            logger.info(self.voc_config)
+            logger.debug(self.voc_config)
-            logger.info(self.voc_ckpt)
+            logger.debug(self.voc_ckpt)
        else:
            self.voc_config = os.path.abspath(voc_config)
            self.voc_ckpt = os.path.abspath(voc_ckpt)
@ -254,21 +267,18 @@ class TTSExecutor(BaseExecutor):
        with open(self.phones_dict, "r") as f:
            phn_id = [line.strip().split() for line in f.readlines()]
        vocab_size = len(phn_id)
        print("vocab_size:", vocab_size)
        tone_size = None
        if self.tones_dict:
            with open(self.tones_dict, "r") as f:
                tone_id = [line.strip().split() for line in f.readlines()]
            tone_size = len(tone_id)
            print("tone_size:", tone_size)
        spk_num = None
        if self.speaker_dict:
            with open(self.speaker_dict, 'rt') as f:
                spk_id = [line.strip().split() for line in f.readlines()]
            spk_num = len(spk_id)
            print("spk_num:", spk_num)
        # frontend
        if lang == 'zh':
@ -278,7 +288,6 @@ class TTSExecutor(BaseExecutor):
        elif lang == 'en':
            self.frontend = English(phone_vocab_path=self.phones_dict)
        print("frontend done!")
        # acoustic model
        odim = self.am_config.n_mels
@ -311,7 +320,6 @@ class TTSExecutor(BaseExecutor):
        am_normalizer = ZScore(am_mu, am_std)
        self.am_inference = am_inference_class(am_normalizer, am)
        self.am_inference.eval()
        print("acoustic model done!")
        # vocoder
        # model: {model_name}_{dataset}
@ -334,7 +342,6 @@ class TTSExecutor(BaseExecutor):
        voc_normalizer = ZScore(voc_mu, voc_std)
        self.voc_inference = voc_inference_class(voc_normalizer, voc)
        self.voc_inference.eval()
        print("voc done!")
    def preprocess(self, input: Any, *args, **kwargs):
        """
@ -375,7 +382,7 @@ class TTSExecutor(BaseExecutor):
                text, merge_sentences=merge_sentences)
            phone_ids = input_ids["phone_ids"]
        else:
-            print("lang should in {'zh', 'en'}!")
+            logger.error("lang should in {'zh', 'en'}!")
        self.frontend_time = time.time() - frontend_st
        self.am_time = 0
--- a/paddlespeech/cli/vector/infer.py
+++ b/paddlespeech/cli/vector/infer.py
@ -117,7 +117,7 @@ class VectorExecutor(BaseExecutor):
        # stage 2: read the input data and store them as a list
        task_source = self.get_input_source(parser_args.input)
-        logger.info(f"task source: {task_source}")
+        logger.debug(f"task source: {task_source}")
        # stage 3: process the audio one by one
        # we do action according the task type
@ -127,13 +127,13 @@ class VectorExecutor(BaseExecutor):
            try:
                # extract the speaker audio embedding
                if parser_args.task == "spk":
-                    logger.info("do vector spk task")
+                    logger.debug("do vector spk task")
                    res = self(input_, model, sample_rate, config, ckpt_path,
                               device)
                    task_result[id_] = res
                elif parser_args.task == "score":
-                    logger.info("do vector score task")
+                    logger.debug("do vector score task")
-                    logger.info(f"input content {input_}")
+                    logger.debug(f"input content {input_}")
                    if len(input_.split()) != 2:
                        logger.error(
                            f"vector score task input {input_} wav num is not two,"
@ -142,7 +142,7 @@ class VectorExecutor(BaseExecutor):
                    # get the enroll and test embedding
                    enroll_audio, test_audio = input_.split()
-                    logger.info(
+                    logger.debug(
                        f"score task, enroll audio: {enroll_audio}, test audio: {test_audio}"
                    )
                    enroll_embedding = self(enroll_audio, model, sample_rate,
@ -158,8 +158,8 @@ class VectorExecutor(BaseExecutor):
                has_exceptions = True
                task_result[id_] = f'{e.__class__.__name__}: {e}'
-        logger.info("task result as follows: ")
+        logger.debug("task result as follows: ")
-        logger.info(f"{task_result}")
+        logger.debug(f"{task_result}")
        # stage 4: process the all the task results
        self.process_task_results(parser_args.input, task_result,
@ -207,7 +207,7 @@ class VectorExecutor(BaseExecutor):
        """
        if not hasattr(self, "score_func"):
            self.score_func = paddle.nn.CosineSimilarity(axis=0)
-            logger.info("create the cosine score function ")
+            logger.debug("create the cosine score function ")
        score = self.score_func(
            paddle.to_tensor(enroll_embedding),
@ -244,7 +244,7 @@ class VectorExecutor(BaseExecutor):
            sys.exit(-1)
        # stage 1: set the paddle runtime host device
-        logger.info(f"device type: {device}")
+        logger.debug(f"device type: {device}")
        paddle.device.set_device(device)
        # stage 2: read the specific pretrained model
@ -283,7 +283,7 @@ class VectorExecutor(BaseExecutor):
        # stage 0: avoid to init the mode again
        self.task = task
        if hasattr(self, "model"):
-            logger.info("Model has been initialized")
+            logger.debug("Model has been initialized")
            return
        # stage 1: get the model and config path
@ -294,7 +294,7 @@ class VectorExecutor(BaseExecutor):
            sample_rate_str = "16k" if sample_rate == 16000 else "8k"
            tag = model_type + "-" + sample_rate_str
            self.task_resource.set_task_model(tag, version=None)
-            logger.info(f"load the pretrained model: {tag}")
+            logger.debug(f"load the pretrained model: {tag}")
            # get the model from the pretrained list
            # we download the pretrained model and store it in the res_path
            self.res_path = self.task_resource.res_dir
@ -312,19 +312,19 @@ class VectorExecutor(BaseExecutor):
            self.res_path = os.path.dirname(
                os.path.dirname(os.path.abspath(self.cfg_path)))
-        logger.info(f"start to read the ckpt from {self.ckpt_path}")
+        logger.debug(f"start to read the ckpt from {self.ckpt_path}")
-        logger.info(f"read the config from {self.cfg_path}")
+        logger.debug(f"read the config from {self.cfg_path}")
-        logger.info(f"get the res path {self.res_path}")
+        logger.debug(f"get the res path {self.res_path}")
        # stage 2: read and config and init the model body
        self.config = CfgNode(new_allowed=True)
        self.config.merge_from_file(self.cfg_path)
        # stage 3: get the model name to instance the model network with dynamic_import
-        logger.info("start to dynamic import the model class")
+        logger.debug("start to dynamic import the model class")
        model_name = model_type[:model_type.rindex('_')]
        model_class = self.task_resource.get_model_class(model_name)
-        logger.info(f"model name {model_name}")
+        logger.debug(f"model name {model_name}")
        model_conf = self.config.model
        backbone = model_class(**model_conf)
        model = SpeakerIdetification(
@ -333,11 +333,11 @@ class VectorExecutor(BaseExecutor):
        self.model.eval()
        # stage 4: load the model parameters
-        logger.info("start to set the model parameters to model")
+        logger.debug("start to set the model parameters to model")
        model_dict = paddle.load(self.ckpt_path)
        self.model.set_state_dict(model_dict)
-        logger.info("create the model instance success")
+        logger.debug("create the model instance success")
    @paddle.no_grad()
    def infer(self, model_type: str):
@ -349,14 +349,14 @@ class VectorExecutor(BaseExecutor):
        # stage 0: get the feat and length from _inputs
        feats = self._inputs["feats"]
        lengths = self._inputs["lengths"]
-        logger.info("start to do backbone network model forward")
+        logger.debug("start to do backbone network model forward")
-        logger.info(
+        logger.debug(
            f"feats shape:{feats.shape}, lengths shape: {lengths.shape}")
        # stage 1: get the audio embedding
        # embedding from (1, emb_size, 1) -> (emb_size)
        embedding = self.model.backbone(feats, lengths).squeeze().numpy()
-        logger.info(f"embedding size: {embedding.shape}")
+        logger.debug(f"embedding size: {embedding.shape}")
        # stage 2: put the embedding and dim info to _outputs property
        #          the embedding type is numpy.array
@ -380,12 +380,13 @@ class VectorExecutor(BaseExecutor):
        """
        audio_file = input_file
        if isinstance(audio_file, (str, os.PathLike)):
-            logger.info(f"Preprocess audio file: {audio_file}")
+            logger.debug(f"Preprocess audio file: {audio_file}")
        # stage 1: load the audio sample points
        #    Note: this process must match the training process
        waveform, sr = load_audio(audio_file)
-        logger.info(f"load the audio sample points, shape is: {waveform.shape}")
+        logger.debug(
            f"load the audio sample points, shape is: {waveform.shape}")
        # stage 2: get the audio feat
        # Note: Now we only support fbank feature
@ -396,9 +397,9 @@ class VectorExecutor(BaseExecutor):
                n_mels=self.config.n_mels,
                window_size=self.config.window_size,
                hop_length=self.config.hop_size)
-            logger.info(f"extract the audio feat, shape is: {feat.shape}")
+            logger.debug(f"extract the audio feat, shape is: {feat.shape}")
        except Exception as e:
-            logger.info(f"feat occurs exception {e}")
+            logger.debug(f"feat occurs exception {e}")
            sys.exit(-1)
        feat = paddle.to_tensor(feat).unsqueeze(0)
@ -411,11 +412,11 @@ class VectorExecutor(BaseExecutor):
        # stage 4: store the feat and length in the _inputs,
        #          which will be used in other function
-        logger.info(f"feats shape: {feat.shape}")
+        logger.debug(f"feats shape: {feat.shape}")
        self._inputs["feats"] = feat
        self._inputs["lengths"] = lengths
-        logger.info("audio extract the feat success")
+        logger.debug("audio extract the feat success")
    def _check(self, audio_file: str, sample_rate: int):
        """Check if the model sample match the audio sample rate 
@ -441,7 +442,7 @@ class VectorExecutor(BaseExecutor):
                logger.error("Please input the right audio file path")
                return False
-        logger.info("checking the aduio file format......")
+        logger.debug("checking the aduio file format......")
        try:
            audio, audio_sample_rate = soundfile.read(
                audio_file, dtype="float32", always_2d=True)
@ -458,7 +459,7 @@ class VectorExecutor(BaseExecutor):
                 ")
            return False
-        logger.info(f"The sample rate is {audio_sample_rate}")
+        logger.debug(f"The sample rate is {audio_sample_rate}")
        if audio_sample_rate != self.sample_rate:
            logger.error("The sample rate of the input file is not {}.\n \
@ -468,6 +469,6 @@ class VectorExecutor(BaseExecutor):
                        ".format(self.sample_rate, self.sample_rate))
            sys.exit(-1)
        else:
-            logger.info("The audio file format is right")
+            logger.debug("The audio file format is right")
        return True
--- a/paddlespeech/resource/resource.py
+++ b/paddlespeech/resource/resource.py
@ -60,6 +60,7 @@ class CommonTaskResource:
    def set_task_model(self,
                       model_tag: str,
                       model_type: int=0,
                       skip_download: bool=False,
                       version: Optional[str]=None):
        """Set model tag and version of current task.
@ -83,16 +84,18 @@ class CommonTaskResource:
            self.version = version
            self.res_dict = self.pretrained_models[model_tag][version]
            self._format_path(self.res_dict)
-            self.res_dir = self._fetch(self.res_dict,
+            if not skip_download:
-                                       self._get_model_dir(model_type))
+                self.res_dir = self._fetch(self.res_dict,
                                           self._get_model_dir(model_type))
        else:
            assert self.task == 'tts', 'Vocoder will only be used in tts task.'
            self.voc_model_tag = model_tag
            self.voc_version = version
            self.voc_res_dict = self.pretrained_models[model_tag][version]
            self._format_path(self.voc_res_dict)
-            self.voc_res_dir = self._fetch(self.voc_res_dict,
+            if not skip_download:
-                                           self._get_model_dir(model_type))
+                self.voc_res_dir = self._fetch(self.voc_res_dict,
                                               self._get_model_dir(model_type))
    @staticmethod
    def get_model_class(model_name) -> List[object]:
--- a/paddlespeech/s2t/exps/deepspeech2/bin/export.py
+++ b/paddlespeech/s2t/exps/deepspeech2/bin/export.py
@ -35,12 +35,6 @@ if __name__ == "__main__":
    # save jit model to
    parser.add_argument(
        "--export_path", type=str, help="path of the jit model to save")
    parser.add_argument(
        '--nxpu',
        type=int,
        default=0,
        choices=[0, 1],
        help="if nxpu == 0 and ngpu == 0, use cpu.")
    args = parser.parse_args()
    print_arguments(args)
--- a/paddlespeech/s2t/exps/deepspeech2/bin/test.py
+++ b/paddlespeech/s2t/exps/deepspeech2/bin/test.py
@ -35,12 +35,6 @@ if __name__ == "__main__":
    # save asr result to
    parser.add_argument(
        "--result_file", type=str, help="path of save the asr result")
    parser.add_argument(
        '--nxpu',
        type=int,
        default=0,
        choices=[0, 1],
        help="if nxpu == 0 and ngpu == 0, use cpu.")
    args = parser.parse_args()
    print_arguments(args, globals())
--- a/paddlespeech/s2t/exps/deepspeech2/bin/test_export.py
+++ b/paddlespeech/s2t/exps/deepspeech2/bin/test_export.py
@ -38,12 +38,6 @@ if __name__ == "__main__":
    #load jit model from
    parser.add_argument(
        "--export_path", type=str, help="path of the jit model to save")
    parser.add_argument(
        '--nxpu',
        type=int,
        default=0,
        choices=[0, 1],
        help="if nxpu == 0 and ngpu == 0, use cpu.")
    parser.add_argument(
        "--enable-auto-log", action="store_true", help="use auto log")
    args = parser.parse_args()
--- a/paddlespeech/s2t/exps/deepspeech2/bin/train.py
+++ b/paddlespeech/s2t/exps/deepspeech2/bin/train.py
@ -31,12 +31,6 @@ def main(config, args):
 if __name__ == "__main__":
    parser = default_argument_parser()
    parser.add_argument(
        '--nxpu',
        type=int,
        default=0,
        choices=[0, 1],
        help="if nxpu == 0 and ngpu == 0, use cpu.")
    args = parser.parse_args()
    print_arguments(args, globals())
--- a/paddlespeech/s2t/frontend/augmentor/spec_augment.py
+++ b/paddlespeech/s2t/frontend/augmentor/spec_augment.py
@ -16,7 +16,6 @@ import random
 import numpy as np
 from PIL import Image
 from PIL.Image import BICUBIC
 from paddlespeech.s2t.frontend.augmentor.base import AugmentorBase
 from paddlespeech.s2t.utils.log import Log
@ -164,9 +163,9 @@ class SpecAugmentor(AugmentorBase):
                                      window) + 1  # 1 ... t - 1
            left = Image.fromarray(x[:center]).resize((x.shape[1], warped),
-                                                      BICUBIC)
+                                                      Image.BICUBIC)
            right = Image.fromarray(x[center:]).resize((x.shape[1], t - warped),
-                                                       BICUBIC)
+                                                       Image.BICUBIC)
            if self.inplace:
                x[:warped] = left
                x[warped:] = right
--- a/paddlespeech/s2t/frontend/featurizer/text_featurizer.py
+++ b/paddlespeech/s2t/frontend/featurizer/text_featurizer.py
@ -226,10 +226,10 @@ class TextFeaturizer():
        sos_id = vocab_list.index(SOS) if SOS in vocab_list else -1
        space_id = vocab_list.index(SPACE) if SPACE in vocab_list else -1
-        logger.info(f"BLANK id: {blank_id}")
+        logger.debug(f"BLANK id: {blank_id}")
-        logger.info(f"UNK id: {unk_id}")
+        logger.debug(f"UNK id: {unk_id}")
-        logger.info(f"EOS id: {eos_id}")
+        logger.debug(f"EOS id: {eos_id}")
-        logger.info(f"SOS id: {sos_id}")
+        logger.debug(f"SOS id: {sos_id}")
-        logger.info(f"SPACE id: {space_id}")
+        logger.debug(f"SPACE id: {space_id}")
-        logger.info(f"MASKCTC id: {maskctc_id}")
+        logger.debug(f"MASKCTC id: {maskctc_id}")
        return token2id, id2token, vocab_list, unk_id, eos_id, blank_id
--- a/paddlespeech/s2t/models/u2/u2.py
+++ b/paddlespeech/s2t/models/u2/u2.py
@ -827,7 +827,7 @@ class U2Model(U2DecodeModel):
        # encoder
        encoder_type = configs.get('encoder', 'transformer')
-        logger.info(f"U2 Encoder type: {encoder_type}")
+        logger.debug(f"U2 Encoder type: {encoder_type}")
        if encoder_type == 'transformer':
            encoder = TransformerEncoder(
                input_dim, global_cmvn=global_cmvn, **configs['encoder_conf'])
@ -894,7 +894,7 @@ class U2Model(U2DecodeModel):
        if checkpoint_path:
            infos = checkpoint.Checkpoint().load_parameters(
                model, checkpoint_path=checkpoint_path)
-            logger.info(f"checkpoint info: {infos}")
+            logger.debug(f"checkpoint info: {infos}")
        layer_tools.summary(model)
        return model
--- a/paddlespeech/s2t/modules/loss.py
+++ b/paddlespeech/s2t/modules/loss.py
@ -37,9 +37,9 @@ class CTCLoss(nn.Layer):
        self.loss = nn.CTCLoss(blank=blank, reduction=reduction)
        self.batch_average = batch_average
-        logger.info(
+        logger.debug(
            f"CTCLoss Loss reduction: {reduction}, div-bs: {batch_average}")
-        logger.info(f"CTCLoss Grad Norm Type: {grad_norm_type}")
+        logger.debug(f"CTCLoss Grad Norm Type: {grad_norm_type}")
        assert grad_norm_type in ('instance', 'batch', 'frame', None)
        self.norm_by_times = False
@ -70,7 +70,8 @@ class CTCLoss(nn.Layer):
            param = {}
        self._kwargs = {k: v for k, v in kwargs.items() if k in param}
        _notin = {k: v for k, v in kwargs.items() if k not in param}
-        logger.info(f"{self.loss} kwargs:{self._kwargs}, not support: {_notin}")
+        logger.debug(
            f"{self.loss} kwargs:{self._kwargs}, not support: {_notin}")
    def forward(self, logits, ys_pad, hlens, ys_lens):
        """Compute CTC loss.
--- a/paddlespeech/s2t/training/cli.py
+++ b/paddlespeech/s2t/training/cli.py
@ -82,6 +82,12 @@ def default_argument_parser(parser=None):
        type=int,
        default=1,
        help="number of parallel processes. 0 for cpu.")
    train_group.add_argument(
        '--nxpu',
        type=int,
        default=0,
        choices=[0, 1],
        help="if nxpu == 0 and ngpu == 0, use cpu.")
    train_group.add_argument(
        "--config", metavar="CONFIG_FILE", help="config file.")
    train_group.add_argument(
--- a/paddlespeech/s2t/utils/tensor_utils.py
+++ b/paddlespeech/s2t/utils/tensor_utils.py
@ -94,7 +94,7 @@ def pad_sequence(sequences: List[paddle.Tensor],
    for i, tensor in enumerate(sequences):
        length = tensor.shape[0]
        # use index notation to prevent duplicate references to the tensor
-        logger.info(
+        logger.debug(
            f"length {length}, out_tensor {out_tensor.shape}, tensor {tensor.shape}"
        )
        if batch_first:
--- a/paddlespeech/server/bin/paddlespeech_client.py
+++ b/paddlespeech/server/bin/paddlespeech_client.py
@ -123,7 +123,6 @@ class TTSClientExecutor(BaseExecutor):
            time_end = time.time()
            time_consume = time_end - time_start
            response_dict = res.json()
            logger.info(response_dict["message"])
            logger.info("Save synthesized audio successfully on %s." % (output))
            logger.info("Audio duration: %f s." %
                        (response_dict['result']['duration']))
@ -702,7 +701,6 @@ class VectorClientExecutor(BaseExecutor):
                test_audio=args.test,
                task=task)
            time_end = time.time()
            logger.info(f"The vector: {res}")
            logger.info("Response time %f s." % (time_end - time_start))
            return True
        except Exception as e:
--- a/paddlespeech/server/engine/acs/python/acs_engine.py
+++ b/paddlespeech/server/engine/acs/python/acs_engine.py
@ -30,7 +30,7 @@ class ACSEngine(BaseEngine):
        """The ACSEngine Engine
        """
        super(ACSEngine, self).__init__()
-        logger.info("Create the ACSEngine Instance")
+        logger.debug("Create the ACSEngine Instance")
        self.word_list = []
    def init(self, config: dict):
@ -42,7 +42,7 @@ class ACSEngine(BaseEngine):
        Returns:
            bool: The engine instance flag
        """
-        logger.info("Init the acs engine")
+        logger.debug("Init the acs engine")
        try:
            self.config = config
            self.device = self.config.get("device", paddle.get_device())
@ -50,7 +50,7 @@ class ACSEngine(BaseEngine):
            # websocket default ping timeout is 20 seconds
            self.ping_timeout = self.config.get("ping_timeout", 20)
            paddle.set_device(self.device)
-            logger.info(f"ACS Engine set the device: {self.device}")
+            logger.debug(f"ACS Engine set the device: {self.device}")
        except BaseException as e:
            logger.error(
@ -66,7 +66,9 @@ class ACSEngine(BaseEngine):
        self.url = "ws://" + self.config.asr_server_ip + ":" + str(
            self.config.asr_server_port) + "/paddlespeech/asr/streaming"
-        logger.info("Init the acs engine successfully")
+        logger.info("Initialize acs server engine successfully on device: %s." %
                    (self.device))
        return True
    def read_search_words(self):
@ -95,12 +97,12 @@ class ACSEngine(BaseEngine):
        Returns:
            _type_: _description_
        """
-        logger.info("send a message to the server")
+        logger.debug("send a message to the server")
        if self.url is None:
            logger.error("No asr server, please input valid ip and port")
            return ""
        ws = websocket.WebSocket()
-        logger.info(f"set the ping timeout: {self.ping_timeout} seconds")
+        logger.debug(f"set the ping timeout: {self.ping_timeout} seconds")
        ws.connect(self.url, ping_timeout=self.ping_timeout)
        audio_info = json.dumps(
            {
@ -123,7 +125,7 @@ class ACSEngine(BaseEngine):
            logger.info(f"audio result: {msg}")
        # 3. send chunk audio data to engine
-        logger.info("send the end signal")
+        logger.debug("send the end signal")
        audio_info = json.dumps(
            {
                "name": "test.wav",
@ -197,7 +199,7 @@ class ACSEngine(BaseEngine):
                start = max(time_stamp[m.start(0)]['bg'] - offset, 0)
                end = min(time_stamp[m.end(0) - 1]['ed'] + offset, max_ed)
-                logger.info(f'start: {start}, end: {end}')
+                logger.debug(f'start: {start}, end: {end}')
                acs_result.append({'w': w, 'bg': start, 'ed': end})
        return acs_result, asr_result
@ -212,7 +214,7 @@ class ACSEngine(BaseEngine):
        Returns:
            acs_result, asr_result: the acs result and the asr result
        """
-        logger.info("start to process the audio content search")
+        logger.debug("start to process the audio content search")
        msg = self.get_asr_content(io.BytesIO(audio_data))
        acs_result, asr_result = self.get_macthed_word(msg)
--- a/paddlespeech/server/engine/asr/online/onnx/asr_engine.py
+++ b/paddlespeech/server/engine/asr/online/onnx/asr_engine.py
@ -44,7 +44,7 @@ class PaddleASRConnectionHanddler:
            asr_engine (ASREngine): the global asr engine
        """
        super().__init__()
-        logger.info(
+        logger.debug(
            "create an paddle asr connection handler to process the websocket connection"
        )
        self.config = asr_engine.config  # server config
@ -152,12 +152,12 @@ class PaddleASRConnectionHanddler:
        self.output_reset()
    def extract_feat(self, samples: ByteString):
-        logger.info("Online ASR extract the feat")
+        logger.debug("Online ASR extract the feat")
        samples = np.frombuffer(samples, dtype=np.int16)
        assert samples.ndim == 1
        self.num_samples += samples.shape[0]
-        logger.info(
+        logger.debug(
            f"This package receive {samples.shape[0]} pcm data. Global samples:{self.num_samples}"
        )
@ -168,7 +168,7 @@ class PaddleASRConnectionHanddler:
        else:
            assert self.remained_wav.ndim == 1  # (T,)
            self.remained_wav = np.concatenate([self.remained_wav, samples])
-        logger.info(
+        logger.debug(
            f"The concatenation of remain and now audio samples length is: {self.remained_wav.shape}"
        )
@ -202,14 +202,14 @@ class PaddleASRConnectionHanddler:
        # update remained wav
        self.remained_wav = self.remained_wav[self.n_shift * num_frames:]
-        logger.info(
+        logger.debug(
            f"process the audio feature success, the cached feat shape: {self.cached_feat.shape}"
        )
-        logger.info(
+        logger.debug(
            f"After extract feat, the cached remain the audio samples: {self.remained_wav.shape}"
        )
-        logger.info(f"global samples: {self.num_samples}")
+        logger.debug(f"global samples: {self.num_samples}")
-        logger.info(f"global frames: {self.num_frames}")
+        logger.debug(f"global frames: {self.num_frames}")
    def decode(self, is_finished=False):
        """advance decoding
@ -237,7 +237,7 @@ class PaddleASRConnectionHanddler:
                return
            num_frames = self.cached_feat.shape[1]
-            logger.info(
+            logger.debug(
                f"Required decoding window {decoding_window} frames, and the connection has {num_frames} frames"
            )
@ -355,7 +355,7 @@ class ASRServerExecutor(ASRExecutor):
            lm_url = self.task_resource.res_dict['lm_url']
            lm_md5 = self.task_resource.res_dict['lm_md5']
-            logger.info(f"Start to load language model {lm_url}")
+            logger.debug(f"Start to load language model {lm_url}")
            self.download_lm(
                lm_url,
                os.path.dirname(self.config.decode.lang_model_path), lm_md5)
@ -367,7 +367,7 @@ class ASRServerExecutor(ASRExecutor):
        if "deepspeech2" in self.model_type:
            # AM predictor
-            logger.info("ASR engine start to init the am predictor")
+            logger.debug("ASR engine start to init the am predictor")
            self.am_predictor = onnx_infer.get_sess(
                model_path=self.am_model, sess_conf=self.am_predictor_conf)
        else:
@ -400,7 +400,7 @@ class ASRServerExecutor(ASRExecutor):
        self.num_decoding_left_chunks = num_decoding_left_chunks
        # conf for paddleinference predictor or onnx
        self.am_predictor_conf = am_predictor_conf
-        logger.info(f"model_type: {self.model_type}")
+        logger.debug(f"model_type: {self.model_type}")
        sample_rate_str = '16k' if sample_rate == 16000 else '8k'
        tag = model_type + '-' + lang + '-' + sample_rate_str
@ -422,12 +422,11 @@ class ASRServerExecutor(ASRExecutor):
        #     self.res_path, self.task_resource.res_dict[
        #         'params']) if am_params is None else os.path.abspath(am_params)
-        logger.info("Load the pretrained model:")
+        logger.debug("Load the pretrained model:")
-        logger.info(f"  tag = {tag}")
+        logger.debug(f"  tag = {tag}")
-        logger.info(f"  res_path: {self.res_path}")
+        logger.debug(f"  res_path: {self.res_path}")
-        logger.info(f"  cfg path: {self.cfg_path}")
+        logger.debug(f"  cfg path: {self.cfg_path}")
-        logger.info(f"  am_model path: {self.am_model}")
+        logger.debug(f"  am_model path: {self.am_model}")
        # logger.info(f"  am_params path: {self.am_params}")
        #Init body.
        self.config = CfgNode(new_allowed=True)
@ -436,7 +435,7 @@ class ASRServerExecutor(ASRExecutor):
        if self.config.spm_model_prefix:
            self.config.spm_model_prefix = os.path.join(
                self.res_path, self.config.spm_model_prefix)
-            logger.info(f"spm model path: {self.config.spm_model_prefix}")
+            logger.debug(f"spm model path: {self.config.spm_model_prefix}")
        self.vocab = self.config.vocab_filepath
@ -450,7 +449,7 @@ class ASRServerExecutor(ASRExecutor):
        # AM predictor
        self.init_model()
-        logger.info(f"create the {model_type} model success")
+        logger.debug(f"create the {model_type} model success")
        return True
@ -501,7 +500,7 @@ class ASREngine(BaseEngine):
                "If all GPU or XPU is used, you can set the server to 'cpu'")
            sys.exit(-1)
-        logger.info(f"paddlespeech_server set the device: {self.device}")
+        logger.debug(f"paddlespeech_server set the device: {self.device}")
        if not self.init_model():
            logger.error(
@ -509,7 +508,8 @@ class ASREngine(BaseEngine):
            )
            return False
-        logger.info("Initialize ASR server engine successfully.")
+        logger.info("Initialize ASR server engine successfully on device: %s." %
                    (self.device))
        return True
    def new_handler(self):
--- a/paddlespeech/server/engine/asr/online/paddleinference/asr_engine.py
+++ b/paddlespeech/server/engine/asr/online/paddleinference/asr_engine.py
@ -44,7 +44,7 @@ class PaddleASRConnectionHanddler:
            asr_engine (ASREngine): the global asr engine
        """
        super().__init__()
-        logger.info(
+        logger.debug(
            "create an paddle asr connection handler to process the websocket connection"
        )
        self.config = asr_engine.config  # server config
@ -157,7 +157,7 @@ class PaddleASRConnectionHanddler:
        assert samples.ndim == 1
        self.num_samples += samples.shape[0]
-        logger.info(
+        logger.debug(
            f"This package receive {samples.shape[0]} pcm data. Global samples:{self.num_samples}"
        )
@ -168,7 +168,7 @@ class PaddleASRConnectionHanddler:
        else:
            assert self.remained_wav.ndim == 1  # (T,)
            self.remained_wav = np.concatenate([self.remained_wav, samples])
-        logger.info(
+        logger.debug(
            f"The concatenation of remain and now audio samples length is: {self.remained_wav.shape}"
        )
@ -202,14 +202,14 @@ class PaddleASRConnectionHanddler:
        # update remained wav
        self.remained_wav = self.remained_wav[self.n_shift * num_frames:]
-        logger.info(
+        logger.debug(
            f"process the audio feature success, the cached feat shape: {self.cached_feat.shape}"
        )
-        logger.info(
+        logger.debug(
            f"After extract feat, the cached remain the audio samples: {self.remained_wav.shape}"
        )
-        logger.info(f"global samples: {self.num_samples}")
+        logger.debug(f"global samples: {self.num_samples}")
-        logger.info(f"global frames: {self.num_frames}")
+        logger.debug(f"global frames: {self.num_frames}")
    def decode(self, is_finished=False):
        """advance decoding
@ -237,13 +237,13 @@ class PaddleASRConnectionHanddler:
                return
            num_frames = self.cached_feat.shape[1]
-            logger.info(
+            logger.debug(
                f"Required decoding window {decoding_window} frames, and the connection has {num_frames} frames"
            )
            # the cached feat must be larger decoding_window
            if num_frames < decoding_window and not is_finished:
-                logger.info(
+                logger.debug(
                    f"frame feat num is less than {decoding_window}, please input more pcm data"
                )
                return None, None
@ -294,7 +294,7 @@ class PaddleASRConnectionHanddler:
        Returns:
            logprob: poster probability.
        """
-        logger.info("start to decoce one chunk for deepspeech2")
+        logger.debug("start to decoce one chunk for deepspeech2")
        input_names = self.am_predictor.get_input_names()
        audio_handle = self.am_predictor.get_input_handle(input_names[0])
        audio_len_handle = self.am_predictor.get_input_handle(input_names[1])
@ -369,7 +369,7 @@ class ASRServerExecutor(ASRExecutor):
            lm_url = self.task_resource.res_dict['lm_url']
            lm_md5 = self.task_resource.res_dict['lm_md5']
-            logger.info(f"Start to load language model {lm_url}")
+            logger.debug(f"Start to load language model {lm_url}")
            self.download_lm(
                lm_url,
                os.path.dirname(self.config.decode.lang_model_path), lm_md5)
@ -381,7 +381,7 @@ class ASRServerExecutor(ASRExecutor):
        if "deepspeech2" in self.model_type:
            # AM predictor
-            logger.info("ASR engine start to init the am predictor")
+            logger.debug("ASR engine start to init the am predictor")
            self.am_predictor = init_predictor(
                model_file=self.am_model,
                params_file=self.am_params,
@ -415,7 +415,7 @@ class ASRServerExecutor(ASRExecutor):
        self.num_decoding_left_chunks = num_decoding_left_chunks
        # conf for paddleinference predictor or onnx
        self.am_predictor_conf = am_predictor_conf
-        logger.info(f"model_type: {self.model_type}")
+        logger.debug(f"model_type: {self.model_type}")
        sample_rate_str = '16k' if sample_rate == 16000 else '8k'
        tag = model_type + '-' + lang + '-' + sample_rate_str
@ -437,12 +437,12 @@ class ASRServerExecutor(ASRExecutor):
            self.res_path = os.path.dirname(
                os.path.dirname(os.path.abspath(self.cfg_path)))
-        logger.info("Load the pretrained model:")
+        logger.debug("Load the pretrained model:")
-        logger.info(f"  tag = {tag}")
+        logger.debug(f"  tag = {tag}")
-        logger.info(f"  res_path: {self.res_path}")
+        logger.debug(f"  res_path: {self.res_path}")
-        logger.info(f"  cfg path: {self.cfg_path}")
+        logger.debug(f"  cfg path: {self.cfg_path}")
-        logger.info(f"  am_model path: {self.am_model}")
+        logger.debug(f"  am_model path: {self.am_model}")
-        logger.info(f"  am_params path: {self.am_params}")
+        logger.debug(f"  am_params path: {self.am_params}")
        #Init body.
        self.config = CfgNode(new_allowed=True)
@ -451,7 +451,7 @@ class ASRServerExecutor(ASRExecutor):
        if self.config.spm_model_prefix:
            self.config.spm_model_prefix = os.path.join(
                self.res_path, self.config.spm_model_prefix)
-            logger.info(f"spm model path: {self.config.spm_model_prefix}")
+            logger.debug(f"spm model path: {self.config.spm_model_prefix}")
        self.vocab = self.config.vocab_filepath
@ -465,7 +465,7 @@ class ASRServerExecutor(ASRExecutor):
        # AM predictor
        self.init_model()
-        logger.info(f"create the {model_type} model success")
+        logger.debug(f"create the {model_type} model success")
        return True
@ -516,7 +516,7 @@ class ASREngine(BaseEngine):
                "If all GPU or XPU is used, you can set the server to 'cpu'")
            sys.exit(-1)
-        logger.info(f"paddlespeech_server set the device: {self.device}")
+        logger.debug(f"paddlespeech_server set the device: {self.device}")
        if not self.init_model():
            logger.error(
@ -524,7 +524,9 @@ class ASREngine(BaseEngine):
            )
            return False
-        logger.info("Initialize ASR server engine successfully.")
+        logger.info("Initialize ASR server engine successfully on device: %s." %
                    (self.device))
        return True
    def new_handler(self):
--- a/paddlespeech/server/engine/asr/online/python/asr_engine.py
+++ b/paddlespeech/server/engine/asr/online/python/asr_engine.py
@ -49,7 +49,7 @@ class PaddleASRConnectionHanddler:
            asr_engine (ASREngine): the global asr engine
        """
        super().__init__()
-        logger.info(
+        logger.debug(
            "create an paddle asr connection handler to process the websocket connection"
        )
        self.config = asr_engine.config  # server config
@ -107,7 +107,7 @@ class PaddleASRConnectionHanddler:
            # acoustic model
            self.model = self.asr_engine.executor.model
            self.continuous_decoding = self.config.continuous_decoding
-            logger.info(f"continue decoding: {self.continuous_decoding}")
+            logger.debug(f"continue decoding: {self.continuous_decoding}")
            # ctc decoding config
            self.ctc_decode_config = self.asr_engine.executor.config.decode
@ -207,7 +207,7 @@ class PaddleASRConnectionHanddler:
        assert samples.ndim == 1
        self.num_samples += samples.shape[0]
-        logger.info(
+        logger.debug(
            f"This package receive {samples.shape[0]} pcm data. Global samples:{self.num_samples}"
        )
@ -218,7 +218,7 @@ class PaddleASRConnectionHanddler:
        else:
            assert self.remained_wav.ndim == 1  # (T,)
            self.remained_wav = np.concatenate([self.remained_wav, samples])
-        logger.info(
+        logger.debug(
            f"The concatenation of remain and now audio samples length is: {self.remained_wav.shape}"
        )
@ -252,14 +252,14 @@ class PaddleASRConnectionHanddler:
        # update remained wav
        self.remained_wav = self.remained_wav[self.n_shift * num_frames:]
-        logger.info(
+        logger.debug(
            f"process the audio feature success, the cached feat shape: {self.cached_feat.shape}"
        )
-        logger.info(
+        logger.debug(
            f"After extract feat, the cached remain the audio samples: {self.remained_wav.shape}"
        )
-        logger.info(f"global samples: {self.num_samples}")
+        logger.debug(f"global samples: {self.num_samples}")
-        logger.info(f"global frames: {self.num_frames}")
+        logger.debug(f"global frames: {self.num_frames}")
    def decode(self, is_finished=False):
        """advance decoding
@ -283,24 +283,24 @@ class PaddleASRConnectionHanddler:
            stride = subsampling * decoding_chunk_size
            if self.cached_feat is None:
-                logger.info("no audio feat, please input more pcm data")
+                logger.debug("no audio feat, please input more pcm data")
                return
            num_frames = self.cached_feat.shape[1]
-            logger.info(
+            logger.debug(
                f"Required decoding window {decoding_window} frames, and the connection has {num_frames} frames"
            )
            # the cached feat must be larger decoding_window
            if num_frames < decoding_window and not is_finished:
-                logger.info(
+                logger.debug(
                    f"frame feat num is less than {decoding_window}, please input more pcm data"
                )
                return None, None
            # if is_finished=True, we need at least context frames
            if num_frames < context:
-                logger.info(
+                logger.debug(
                    "flast {num_frames} is less than context {context} frames, and we cannot do model forward"
                )
                return None, None
@ -354,7 +354,7 @@ class PaddleASRConnectionHanddler:
        Returns:
            logprob: poster probability.
        """
-        logger.info("start to decoce one chunk for deepspeech2")
+        logger.debug("start to decoce one chunk for deepspeech2")
        input_names = self.am_predictor.get_input_names()
        audio_handle = self.am_predictor.get_input_handle(input_names[0])
        audio_len_handle = self.am_predictor.get_input_handle(input_names[1])
@ -391,7 +391,7 @@ class PaddleASRConnectionHanddler:
        self.decoder.next(output_chunk_probs, output_chunk_lens)
        trans_best, trans_beam = self.decoder.decode()
-        logger.info(f"decode one best result for deepspeech2: {trans_best[0]}")
+        logger.debug(f"decode one best result for deepspeech2: {trans_best[0]}")
        return trans_best[0]
    @paddle.no_grad()
@ -402,7 +402,7 @@ class PaddleASRConnectionHanddler:
        # reset endpiont state
        self.endpoint_state = False
-        logger.info(
+        logger.debug(
            "Conformer/Transformer: start to decode with advanced_decoding method"
        )
        cfg = self.ctc_decode_config
@ -427,25 +427,25 @@ class PaddleASRConnectionHanddler:
        stride = subsampling * decoding_chunk_size
        if self.cached_feat is None:
-            logger.info("no audio feat, please input more pcm data")
+            logger.debug("no audio feat, please input more pcm data")
            return
        # (B=1,T,D)
        num_frames = self.cached_feat.shape[1]
-        logger.info(
+        logger.debug(
            f"Required decoding window {decoding_window} frames, and the connection has {num_frames} frames"
        )
        # the cached feat must be larger decoding_window
        if num_frames < decoding_window and not is_finished:
-            logger.info(
+            logger.debug(
                f"frame feat num is less than {decoding_window}, please input more pcm data"
            )
            return None, None
        # if is_finished=True, we need at least context frames
        if num_frames < context:
-            logger.info(
+            logger.debug(
                "flast {num_frames} is less than context {context} frames, and we cannot do model forward"
            )
            return None, None
@ -489,7 +489,7 @@ class PaddleASRConnectionHanddler:
            self.encoder_out = ys
        else:
            self.encoder_out = paddle.concat([self.encoder_out, ys], axis=1)
-        logger.info(
+        logger.debug(
            f"This connection handler encoder out shape: {self.encoder_out.shape}"
        )
@ -513,7 +513,8 @@ class PaddleASRConnectionHanddler:
            if self.endpointer.endpoint_detected(ctc_probs.numpy(),
                                                 decoding_something):
                self.endpoint_state = True
-                logger.info(f"Endpoint is detected at {self.num_frames} frame.")
+                logger.debug(
                    f"Endpoint is detected at {self.num_frames} frame.")
        # advance cache of feat
        assert self.cached_feat.shape[0] == 1  #(B=1,T,D)
@ -526,7 +527,7 @@ class PaddleASRConnectionHanddler:
    def update_result(self):
        """Conformer/Transformer hyps to result.
        """
-        logger.info("update the final result")
+        logger.debug("update the final result")
        hyps = self.hyps
        # output results and tokenids
@ -560,16 +561,16 @@ class PaddleASRConnectionHanddler:
        only for conformer and transformer model.
        """
        if "deepspeech2" in self.model_type:
-            logger.info("deepspeech2 not support rescoring decoding.")
+            logger.debug("deepspeech2 not support rescoring decoding.")
            return
        if "attention_rescoring" != self.ctc_decode_config.decoding_method:
-            logger.info(
+            logger.debug(
                f"decoding method not match: {self.ctc_decode_config.decoding_method}, need attention_rescoring"
            )
            return
-        logger.info("rescoring the final result")
+        logger.debug("rescoring the final result")
        # last decoding for last audio
        self.searcher.finalize_search()
@ -685,7 +686,6 @@ class PaddleASRConnectionHanddler:
                "bg": global_offset_in_sec + start,
                "ed": global_offset_in_sec + end
            })
            # logger.info(f"{word_time_stamp[-1]}")
        self.word_time_stamp = word_time_stamp
        logger.info(f"word time stamp: {self.word_time_stamp}")
@ -707,13 +707,13 @@ class ASRServerExecutor(ASRExecutor):
            lm_url = self.task_resource.res_dict['lm_url']
            lm_md5 = self.task_resource.res_dict['lm_md5']
-            logger.info(f"Start to load language model {lm_url}")
+            logger.debug(f"Start to load language model {lm_url}")
            self.download_lm(
                lm_url,
                os.path.dirname(self.config.decode.lang_model_path), lm_md5)
        elif "conformer" in self.model_type or "transformer" in self.model_type:
            with UpdateConfig(self.config):
-                logger.info("start to create the stream conformer asr engine")
+                logger.debug("start to create the stream conformer asr engine")
                # update the decoding method
                if self.decode_method:
                    self.config.decode.decoding_method = self.decode_method
@ -726,7 +726,7 @@ class ASRServerExecutor(ASRExecutor):
                if self.config.decode.decoding_method not in [
                        "ctc_prefix_beam_search", "attention_rescoring"
                ]:
-                    logger.info(
+                    logger.debug(
                        "we set the decoding_method to attention_rescoring")
                    self.config.decode.decoding_method = "attention_rescoring"
@ -739,7 +739,7 @@ class ASRServerExecutor(ASRExecutor):
    def init_model(self) -> None:
        if "deepspeech2" in self.model_type:
            # AM predictor
-            logger.info("ASR engine start to init the am predictor")
+            logger.debug("ASR engine start to init the am predictor")
            self.am_predictor = init_predictor(
                model_file=self.am_model,
                params_file=self.am_params,
@ -748,7 +748,7 @@ class ASRServerExecutor(ASRExecutor):
            # load model
            # model_type: {model_name}_{dataset}
            model_name = self.model_type[:self.model_type.rindex('_')]
-            logger.info(f"model name: {model_name}")
+            logger.debug(f"model name: {model_name}")
            model_class = self.task_resource.get_model_class(model_name)
            model = model_class.from_config(self.config)
            self.model = model
@ -782,7 +782,7 @@ class ASRServerExecutor(ASRExecutor):
        self.num_decoding_left_chunks = num_decoding_left_chunks
        # conf for paddleinference predictor or onnx
        self.am_predictor_conf = am_predictor_conf
-        logger.info(f"model_type: {self.model_type}")
+        logger.debug(f"model_type: {self.model_type}")
        sample_rate_str = '16k' if sample_rate == 16000 else '8k'
        tag = model_type + '-' + lang + '-' + sample_rate_str
@ -804,12 +804,12 @@ class ASRServerExecutor(ASRExecutor):
            self.res_path = os.path.dirname(
                os.path.dirname(os.path.abspath(self.cfg_path)))
-        logger.info("Load the pretrained model:")
+        logger.debug("Load the pretrained model:")
-        logger.info(f"  tag = {tag}")
+        logger.debug(f"  tag = {tag}")
-        logger.info(f"  res_path: {self.res_path}")
+        logger.debug(f"  res_path: {self.res_path}")
-        logger.info(f"  cfg path: {self.cfg_path}")
+        logger.debug(f"  cfg path: {self.cfg_path}")
-        logger.info(f"  am_model path: {self.am_model}")
+        logger.debug(f"  am_model path: {self.am_model}")
-        logger.info(f"  am_params path: {self.am_params}")
+        logger.debug(f"  am_params path: {self.am_params}")
        #Init body.
        self.config = CfgNode(new_allowed=True)
@ -818,7 +818,7 @@ class ASRServerExecutor(ASRExecutor):
        if self.config.spm_model_prefix:
            self.config.spm_model_prefix = os.path.join(
                self.res_path, self.config.spm_model_prefix)
-            logger.info(f"spm model path: {self.config.spm_model_prefix}")
+            logger.debug(f"spm model path: {self.config.spm_model_prefix}")
        self.vocab = self.config.vocab_filepath
@ -832,7 +832,7 @@ class ASRServerExecutor(ASRExecutor):
        # AM predictor
        self.init_model()
-        logger.info(f"create the {model_type} model success")
+        logger.debug(f"create the {model_type} model success")
        return True
@ -883,7 +883,7 @@ class ASREngine(BaseEngine):
                "If all GPU or XPU is used, you can set the server to 'cpu'")
            sys.exit(-1)
-        logger.info(f"paddlespeech_server set the device: {self.device}")
+        logger.debug(f"paddlespeech_server set the device: {self.device}")
        if not self.init_model():
            logger.error(
@ -891,7 +891,9 @@ class ASREngine(BaseEngine):
            )
            return False
-        logger.info("Initialize ASR server engine successfully.")
+        logger.info("Initialize ASR server engine successfully on device: %s." %
                    (self.device))
        return True
    def new_handler(self):
--- a/paddlespeech/server/engine/asr/paddleinference/asr_engine.py
+++ b/paddlespeech/server/engine/asr/paddleinference/asr_engine.py
@ -65,10 +65,10 @@ class ASRServerExecutor(ASRExecutor):
                                         self.task_resource.res_dict['model'])
            self.am_params = os.path.join(self.res_path,
                                          self.task_resource.res_dict['params'])
-            logger.info(self.res_path)
+            logger.debug(self.res_path)
-            logger.info(self.cfg_path)
+            logger.debug(self.cfg_path)
-            logger.info(self.am_model)
+            logger.debug(self.am_model)
-            logger.info(self.am_params)
+            logger.debug(self.am_params)
        else:
            self.cfg_path = os.path.abspath(cfg_path)
            self.am_model = os.path.abspath(am_model)
@ -236,16 +236,16 @@ class PaddleASRConnectionHandler(ASRServerExecutor):
        if self._check(
                io.BytesIO(audio_data), self.asr_engine.config.sample_rate,
                self.asr_engine.config.force_yes):
-            logger.info("start running asr engine")
+            logger.debug("start running asr engine")
            self.preprocess(self.asr_engine.config.model_type,
                            io.BytesIO(audio_data))
            st = time.time()
            self.infer(self.asr_engine.config.model_type)
            infer_time = time.time() - st
            self.output = self.postprocess()  # Retrieve result of asr.
-            logger.info("end inferring asr engine")
+            logger.debug("end inferring asr engine")
        else:
-            logger.info("file check failed!")
+            logger.error("file check failed!")
            self.output = None
        logger.info("inference time: {}".format(infer_time))
--- a/paddlespeech/server/engine/asr/python/asr_engine.py
+++ b/paddlespeech/server/engine/asr/python/asr_engine.py
@ -104,7 +104,7 @@ class PaddleASRConnectionHandler(ASRServerExecutor):
            if self._check(
                    io.BytesIO(audio_data), self.asr_engine.config.sample_rate,
                    self.asr_engine.config.force_yes):
-                logger.info("start run asr engine")
+                logger.debug("start run asr engine")
                self.preprocess(self.asr_engine.config.model,
                                io.BytesIO(audio_data))
                st = time.time()
@ -112,7 +112,7 @@ class PaddleASRConnectionHandler(ASRServerExecutor):
                infer_time = time.time() - st
                self.output = self.postprocess()  # Retrieve result of asr.
            else:
-                logger.info("file check failed!")
+                logger.error("file check failed!")
                self.output = None
            logger.info("inference time: {}".format(infer_time))
--- a/paddlespeech/server/engine/cls/paddleinference/cls_engine.py
+++ b/paddlespeech/server/engine/cls/paddleinference/cls_engine.py
@ -67,22 +67,22 @@ class CLSServerExecutor(CLSExecutor):
            self.params_path = os.path.abspath(params_path)
            self.label_file = os.path.abspath(label_file)
-        logger.info(self.cfg_path)
+        logger.debug(self.cfg_path)
-        logger.info(self.model_path)
+        logger.debug(self.model_path)
-        logger.info(self.params_path)
+        logger.debug(self.params_path)
-        logger.info(self.label_file)
+        logger.debug(self.label_file)
        # config
        with open(self.cfg_path, 'r') as f:
            self._conf = yaml.safe_load(f)
-        logger.info("Read cfg file successfully.")
+        logger.debug("Read cfg file successfully.")
        # labels
        self._label_list = []
        with open(self.label_file, 'r') as f:
            for line in f:
                self._label_list.append(line.strip())
-        logger.info("Read label file successfully.")
+        logger.debug("Read label file successfully.")
        # Create predictor
        self.predictor_conf = predictor_conf
@ -90,7 +90,7 @@ class CLSServerExecutor(CLSExecutor):
            model_file=self.model_path,
            params_file=self.params_path,
            predictor_conf=self.predictor_conf)
-        logger.info("Create predictor successfully.")
+        logger.debug("Create predictor successfully.")
    @paddle.no_grad()
    def infer(self):
@ -148,7 +148,8 @@ class CLSEngine(BaseEngine):
            logger.error(e)
            return False
-        logger.info("Initialize CLS server engine successfully.")
+        logger.info("Initialize CLS server engine successfully on device: %s." %
                    (self.device))
        return True
@ -160,7 +161,7 @@ class PaddleCLSConnectionHandler(CLSServerExecutor):
            cls_engine (CLSEngine): The CLS engine
        """
        super().__init__()
-        logger.info(
+        logger.debug(
            "Create PaddleCLSConnectionHandler to process the cls request")
        self._inputs = OrderedDict()
@ -183,7 +184,7 @@ class PaddleCLSConnectionHandler(CLSServerExecutor):
        self.infer()
        infer_time = time.time() - st
-        logger.info("inference time: {}".format(infer_time))
+        logger.debug("inference time: {}".format(infer_time))
        logger.info("cls engine type: inference")
    def postprocess(self, topk: int):
--- a/paddlespeech/server/engine/cls/python/cls_engine.py
+++ b/paddlespeech/server/engine/cls/python/cls_engine.py
@ -88,7 +88,7 @@ class PaddleCLSConnectionHandler(CLSServerExecutor):
            cls_engine (CLSEngine): The CLS engine
        """
        super().__init__()
-        logger.info(
+        logger.debug(
            "Create PaddleCLSConnectionHandler to process the cls request")
        self._inputs = OrderedDict()
@ -110,7 +110,7 @@ class PaddleCLSConnectionHandler(CLSServerExecutor):
        self.infer()
        infer_time = time.time() - st
-        logger.info("inference time: {}".format(infer_time))
+        logger.debug("inference time: {}".format(infer_time))
        logger.info("cls engine type: python")
    def postprocess(self, topk: int):
--- a/paddlespeech/server/engine/engine_factory.py
+++ b/paddlespeech/server/engine/engine_factory.py
@ -13,7 +13,7 @@
 # limitations under the License.
 from typing import Text
-from ..utils.log import logger
+from paddlespeech.cli.log import logger
 __all__ = ['EngineFactory']
--- a/paddlespeech/server/engine/engine_warmup.py
+++ b/paddlespeech/server/engine/engine_warmup.py
@ -45,7 +45,7 @@ def warm_up(engine_and_type: str, warm_up_time: int=3) -> bool:
            logger.error("Please check tte engine type.")
        try:
-            logger.info("Start to warm up tts engine.")
+            logger.debug("Start to warm up tts engine.")
            for i in range(warm_up_time):
                connection_handler = PaddleTTSConnectionHandler(tts_engine)
                if flag_online:
@ -53,7 +53,7 @@ def warm_up(engine_and_type: str, warm_up_time: int=3) -> bool:
                            text=sentence,
                            lang=tts_engine.lang,
                            am=tts_engine.config.am):
-                        logger.info(
+                        logger.debug(
                            f"The first response time of the {i} warm up: {connection_handler.first_response_time} s"
                        )
                        break
@ -62,7 +62,7 @@ def warm_up(engine_and_type: str, warm_up_time: int=3) -> bool:
                    st = time.time()
                    connection_handler.infer(text=sentence)
                    et = time.time()
-                    logger.info(
+                    logger.debug(
                        f"The response time of the {i} warm up: {et - st} s")
        except Exception as e:
            logger.error("Failed to warm up on tts engine.")
--- a/paddlespeech/server/engine/text/python/text_engine.py
+++ b/paddlespeech/server/engine/text/python/text_engine.py
@ -28,7 +28,7 @@ class PaddleTextConnectionHandler:
            text_engine (TextEngine): The Text engine
        """
        super().__init__()
-        logger.info(
+        logger.debug(
            "Create PaddleTextConnectionHandler to process the text request")
        self.text_engine = text_engine
        self.task = self.text_engine.executor.task
@ -130,7 +130,7 @@ class TextEngine(BaseEngine):
        """The Text Engine
        """
        super(TextEngine, self).__init__()
-        logger.info("Create the TextEngine Instance")
+        logger.debug("Create the TextEngine Instance")
    def init(self, config: dict):
        """Init the Text Engine
@ -141,7 +141,7 @@ class TextEngine(BaseEngine):
        Returns:
            bool: The engine instance flag
        """
-        logger.info("Init the text engine")
+        logger.debug("Init the text engine")
        try:
            self.config = config
            if self.config.device:
@ -150,7 +150,7 @@ class TextEngine(BaseEngine):
                self.device = paddle.get_device()
            paddle.set_device(self.device)
-            logger.info(f"Text Engine set the device: {self.device}")
+            logger.debug(f"Text Engine set the device: {self.device}")
        except BaseException as e:
            logger.error(
                "Set device failed, please check if device is already used and the parameter 'device' in the yaml file"
@ -168,5 +168,6 @@ class TextEngine(BaseEngine):
            ckpt_path=config.ckpt_path,
            vocab_file=config.vocab_file)
-        logger.info("Init the text engine successfully")
+        logger.info("Initialize Text server engine successfully on device: %s."
                    % (self.device))
        return True
--- a/paddlespeech/server/engine/tts/online/onnx/tts_engine.py
+++ b/paddlespeech/server/engine/tts/online/onnx/tts_engine.py
@ -62,7 +62,7 @@ class TTSServerExecutor(TTSExecutor):
            (hasattr(self, 'am_encoder_infer_sess') and
             hasattr(self, 'am_decoder_sess') and hasattr(
                 self, 'am_postnet_sess'))) and hasattr(self, 'voc_inference'):
-            logger.info('Models had been initialized.')
+            logger.debug('Models had been initialized.')
            return
        # am
        am_tag = am + '-' + lang
@ -85,8 +85,7 @@ class TTSServerExecutor(TTSExecutor):
            else:
                self.am_ckpt = os.path.abspath(am_ckpt[0])
                self.phones_dict = os.path.abspath(phones_dict)
-                self.am_res_path = os.path.dirname(
+                self.am_res_path = os.path.dirname(os.path.abspath(am_ckpt))
                    os.path.abspath(am_ckpt))
            # create am sess
            self.am_sess = get_sess(self.am_ckpt, am_sess_conf)
@ -119,8 +118,7 @@ class TTSServerExecutor(TTSExecutor):
                self.am_postnet = os.path.abspath(am_ckpt[2])
                self.phones_dict = os.path.abspath(phones_dict)
                self.am_stat = os.path.abspath(am_stat)
-                self.am_res_path = os.path.dirname(
+                self.am_res_path = os.path.dirname(os.path.abspath(am_ckpt[0]))
                    os.path.abspath(am_ckpt[0]))
            # create am sess
            self.am_encoder_infer_sess = get_sess(self.am_encoder_infer,
@ -130,9 +128,9 @@ class TTSServerExecutor(TTSExecutor):
            self.am_mu, self.am_std = np.load(self.am_stat)
-        logger.info(f"self.phones_dict: {self.phones_dict}")
+        logger.debug(f"self.phones_dict: {self.phones_dict}")
-        logger.info(f"am model dir: {self.am_res_path}")
+        logger.debug(f"am model dir: {self.am_res_path}")
-        logger.info("Create am sess successfully.")
+        logger.debug("Create am sess successfully.")
        # voc model info
        voc_tag = voc + '-' + lang
@ -149,16 +147,16 @@ class TTSServerExecutor(TTSExecutor):
        else:
            self.voc_ckpt = os.path.abspath(voc_ckpt)
            self.voc_res_path = os.path.dirname(os.path.abspath(self.voc_ckpt))
-        logger.info(self.voc_res_path)
+        logger.debug(self.voc_res_path)
        # create voc sess
        self.voc_sess = get_sess(self.voc_ckpt, voc_sess_conf)
-        logger.info("Create voc sess successfully.")
+        logger.debug("Create voc sess successfully.")
        with open(self.phones_dict, "r") as f:
            phn_id = [line.strip().split() for line in f.readlines()]
        self.vocab_size = len(phn_id)
-        logger.info(f"vocab_size: {self.vocab_size}")
+        logger.debug(f"vocab_size: {self.vocab_size}")
        # frontend
        self.tones_dict = None
@ -169,7 +167,7 @@ class TTSServerExecutor(TTSExecutor):
        elif lang == 'en':
            self.frontend = English(phone_vocab_path=self.phones_dict)
-        logger.info("frontend done!")
+        logger.debug("frontend done!")
 class TTSEngine(BaseEngine):
@ -267,7 +265,7 @@ class PaddleTTSConnectionHandler:
            tts_engine (TTSEngine): The TTS engine
        """
        super().__init__()
-        logger.info(
+        logger.debug(
            "Create PaddleTTSConnectionHandler to process the tts request")
        self.tts_engine = tts_engine
--- a/paddlespeech/server/engine/tts/online/python/tts_engine.py
+++ b/paddlespeech/server/engine/tts/online/python/tts_engine.py
@ -102,16 +102,22 @@ class TTSServerExecutor(TTSExecutor):
        Init model and other resources from a specific path.
        """
        if hasattr(self, 'am_inference') and hasattr(self, 'voc_inference'):
-            logger.info('Models had been initialized.')
+            logger.debug('Models had been initialized.')
            return
        # am model info
        if am_ckpt is None or am_config is None or am_stat is None or phones_dict is None:
            use_pretrained_am = True
        else:
            use_pretrained_am = False
        am_tag = am + '-' + lang
        self.task_resource.set_task_model(
            model_tag=am_tag,
            model_type=0,  # am
            skip_download=not use_pretrained_am,
            version=None,  # default version
        )
-        if am_ckpt is None or am_config is None or am_stat is None or phones_dict is None:
+        if use_pretrained_am:
            self.am_res_path = self.task_resource.res_dir
            self.am_config = os.path.join(self.am_res_path,
                                          self.task_resource.res_dict['config'])
@ -122,29 +128,33 @@ class TTSServerExecutor(TTSExecutor):
            # must have phones_dict in acoustic
            self.phones_dict = os.path.join(
                self.am_res_path, self.task_resource.res_dict['phones_dict'])
-            print("self.phones_dict:", self.phones_dict)
+            logger.debug(self.am_res_path)
-            logger.info(self.am_res_path)
+            logger.debug(self.am_config)
-            logger.info(self.am_config)
+            logger.debug(self.am_ckpt)
            logger.info(self.am_ckpt)
        else:
            self.am_config = os.path.abspath(am_config)
            self.am_ckpt = os.path.abspath(am_ckpt)
            self.am_stat = os.path.abspath(am_stat)
            self.phones_dict = os.path.abspath(phones_dict)
            self.am_res_path = os.path.dirname(os.path.abspath(self.am_config))
        print("self.phones_dict:", self.phones_dict)
        self.tones_dict = None
        self.speaker_dict = None
        # voc model info
        if voc_ckpt is None or voc_config is None or voc_stat is None:
            use_pretrained_voc = True
        else:
            use_pretrained_voc = False
        voc_tag = voc + '-' + lang
        self.task_resource.set_task_model(
            model_tag=voc_tag,
            model_type=1,  # vocoder
            skip_download=not use_pretrained_voc,
            version=None,  # default version
        )
-        if voc_ckpt is None or voc_config is None or voc_stat is None:
+        if use_pretrained_voc:
            self.voc_res_path = self.task_resource.voc_res_dir
            self.voc_config = os.path.join(
                self.voc_res_path, self.task_resource.voc_res_dict['config'])
@ -153,9 +163,9 @@ class TTSServerExecutor(TTSExecutor):
            self.voc_stat = os.path.join(
                self.voc_res_path,
                self.task_resource.voc_res_dict['speech_stats'])
-            logger.info(self.voc_res_path)
+            logger.debug(self.voc_res_path)
-            logger.info(self.voc_config)
+            logger.debug(self.voc_config)
-            logger.info(self.voc_ckpt)
+            logger.debug(self.voc_ckpt)
        else:
            self.voc_config = os.path.abspath(voc_config)
            self.voc_ckpt = os.path.abspath(voc_ckpt)
@ -172,7 +182,6 @@ class TTSServerExecutor(TTSExecutor):
        with open(self.phones_dict, "r") as f:
            phn_id = [line.strip().split() for line in f.readlines()]
        self.vocab_size = len(phn_id)
        print("vocab_size:", self.vocab_size)
        # frontend
        if lang == 'zh':
@ -182,7 +191,6 @@ class TTSServerExecutor(TTSExecutor):
        elif lang == 'en':
            self.frontend = English(phone_vocab_path=self.phones_dict)
        print("frontend done!")
        # am infer info
        self.am_name = am[:am.rindex('_')]
@ -197,7 +205,6 @@ class TTSServerExecutor(TTSExecutor):
                self.am_name + '_inference')
            self.am_inference = am_inference_class(am_normalizer, am)
            self.am_inference.eval()
        print("acoustic model done!")
        # voc infer info
        self.voc_name = voc[:voc.rindex('_')]
@ -208,7 +215,6 @@ class TTSServerExecutor(TTSExecutor):
                                                                 '_inference')
        self.voc_inference = voc_inference_class(voc_normalizer, voc)
        self.voc_inference.eval()
        print("voc done!")
 class TTSEngine(BaseEngine):
@ -297,7 +303,7 @@ class PaddleTTSConnectionHandler:
            tts_engine (TTSEngine): The TTS engine
        """
        super().__init__()
-        logger.info(
+        logger.debug(
            "Create PaddleTTSConnectionHandler to process the tts request")
        self.tts_engine = tts_engine
@ -357,7 +363,7 @@ class PaddleTTSConnectionHandler:
                text, merge_sentences=merge_sentences)
            phone_ids = input_ids["phone_ids"]
        else:
-            print("lang should in {'zh', 'en'}!")
+            logger.error("lang should in {'zh', 'en'}!")
        frontend_et = time.time()
        self.frontend_time = frontend_et - frontend_st
--- a/paddlespeech/server/engine/tts/paddleinference/tts_engine.py
+++ b/paddlespeech/server/engine/tts/paddleinference/tts_engine.py
@ -65,16 +65,22 @@ class TTSServerExecutor(TTSExecutor):
        Init model and other resources from a specific path.
        """
        if hasattr(self, 'am_predictor') and hasattr(self, 'voc_predictor'):
-            logger.info('Models had been initialized.')
+            logger.debug('Models had been initialized.')
            return
        # am
        if am_model is None or am_params is None or phones_dict is None:
            use_pretrained_am = True
        else:
            use_pretrained_am = False
        am_tag = am + '-' + lang
        self.task_resource.set_task_model(
            model_tag=am_tag,
            model_type=0,  # am
            skip_download=not use_pretrained_am,
            version=None,  # default version
        )
-        if am_model is None or am_params is None or phones_dict is None:
+        if use_pretrained_am:
            self.am_res_path = self.task_resource.res_dir
            self.am_model = os.path.join(self.am_res_path,
                                         self.task_resource.res_dict['model'])
@ -85,16 +91,16 @@ class TTSServerExecutor(TTSExecutor):
                self.am_res_path, self.task_resource.res_dict['phones_dict'])
            self.am_sample_rate = self.task_resource.res_dict['sample_rate']
-            logger.info(self.am_res_path)
+            logger.debug(self.am_res_path)
-            logger.info(self.am_model)
+            logger.debug(self.am_model)
-            logger.info(self.am_params)
+            logger.debug(self.am_params)
        else:
            self.am_model = os.path.abspath(am_model)
            self.am_params = os.path.abspath(am_params)
            self.phones_dict = os.path.abspath(phones_dict)
            self.am_sample_rate = am_sample_rate
            self.am_res_path = os.path.dirname(os.path.abspath(self.am_model))
-        logger.info("self.phones_dict: {}".format(self.phones_dict))
+        logger.debug("self.phones_dict: {}".format(self.phones_dict))
        # for speedyspeech
        self.tones_dict = None
@ -113,13 +119,19 @@ class TTSServerExecutor(TTSExecutor):
                self.speaker_dict = speaker_dict
        # voc
        if voc_model is None or voc_params is None:
            use_pretrained_voc = True
        else:
            use_pretrained_voc = False
        voc_tag = voc + '-' + lang
        self.task_resource.set_task_model(
            model_tag=voc_tag,
            model_type=1,  # vocoder
            skip_download=not use_pretrained_voc,
            version=None,  # default version
        )
-        if voc_model is None or voc_params is None:
+        if use_pretrained_voc:
            self.voc_res_path = self.task_resource.voc_res_dir
            self.voc_model = os.path.join(
                self.voc_res_path, self.task_resource.voc_res_dict['model'])
@ -127,9 +139,9 @@ class TTSServerExecutor(TTSExecutor):
                self.voc_res_path, self.task_resource.voc_res_dict['params'])
            self.voc_sample_rate = self.task_resource.voc_res_dict[
                'sample_rate']
-            logger.info(self.voc_res_path)
+            logger.debug(self.voc_res_path)
-            logger.info(self.voc_model)
+            logger.debug(self.voc_model)
-            logger.info(self.voc_params)
+            logger.debug(self.voc_params)
        else:
            self.voc_model = os.path.abspath(voc_model)
            self.voc_params = os.path.abspath(voc_params)
@ -144,21 +156,21 @@ class TTSServerExecutor(TTSExecutor):
        with open(self.phones_dict, "r") as f:
            phn_id = [line.strip().split() for line in f.readlines()]
        vocab_size = len(phn_id)
-        logger.info("vocab_size: {}".format(vocab_size))
+        logger.debug("vocab_size: {}".format(vocab_size))
        tone_size = None
        if self.tones_dict:
            with open(self.tones_dict, "r") as f:
                tone_id = [line.strip().split() for line in f.readlines()]
            tone_size = len(tone_id)
-            logger.info("tone_size: {}".format(tone_size))
+            logger.debug("tone_size: {}".format(tone_size))
        spk_num = None
        if self.speaker_dict:
            with open(self.speaker_dict, 'rt') as f:
                spk_id = [line.strip().split() for line in f.readlines()]
            spk_num = len(spk_id)
-            logger.info("spk_num: {}".format(spk_num))
+            logger.debug("spk_num: {}".format(spk_num))
        # frontend
        if lang == 'zh':
@ -168,7 +180,7 @@ class TTSServerExecutor(TTSExecutor):
        elif lang == 'en':
            self.frontend = English(phone_vocab_path=self.phones_dict)
-        logger.info("frontend done!")
+        logger.debug("frontend done!")
        # Create am predictor
        self.am_predictor_conf = am_predictor_conf
@ -176,7 +188,7 @@ class TTSServerExecutor(TTSExecutor):
            model_file=self.am_model,
            params_file=self.am_params,
            predictor_conf=self.am_predictor_conf)
-        logger.info("Create AM predictor successfully.")
+        logger.debug("Create AM predictor successfully.")
        # Create voc predictor
        self.voc_predictor_conf = voc_predictor_conf
@ -184,7 +196,7 @@ class TTSServerExecutor(TTSExecutor):
            model_file=self.voc_model,
            params_file=self.voc_params,
            predictor_conf=self.voc_predictor_conf)
-        logger.info("Create Vocoder predictor successfully.")
+        logger.debug("Create Vocoder predictor successfully.")
    @paddle.no_grad()
    def infer(self,
@ -316,7 +328,8 @@ class TTSEngine(BaseEngine):
            logger.error(e)
            return False
-        logger.info("Initialize TTS server engine successfully.")
+        logger.info("Initialize TTS server engine successfully on device: %s." %
                    (self.device))
        return True
@ -328,7 +341,7 @@ class PaddleTTSConnectionHandler(TTSServerExecutor):
            tts_engine (TTSEngine): The TTS engine
        """
        super().__init__()
-        logger.info(
+        logger.debug(
            "Create PaddleTTSConnectionHandler to process the tts request")
        self.tts_engine = tts_engine
@ -366,23 +379,23 @@ class PaddleTTSConnectionHandler(TTSServerExecutor):
        if target_fs == 0 or target_fs > original_fs:
            target_fs = original_fs
            wav_tar_fs = wav
-            logger.info(
+            logger.debug(
                "The sample rate of synthesized audio is the same as model, which is {}Hz".
                format(original_fs))
        else:
            wav_tar_fs = librosa.resample(
                np.squeeze(wav), original_fs, target_fs)
-            logger.info(
+            logger.debug(
                "The sample rate of model is {}Hz and the target sample rate is {}Hz. Converting the sample rate of the synthesized audio successfully.".
                format(original_fs, target_fs))
        # transform volume
        wav_vol = wav_tar_fs * volume
-        logger.info("Transform the volume of the audio successfully.")
+        logger.debug("Transform the volume of the audio successfully.")
        # transform speed
        try:  # windows not support soxbindings
            wav_speed = change_speed(wav_vol, speed, target_fs)
-            logger.info("Transform the speed of the audio successfully.")
+            logger.debug("Transform the speed of the audio successfully.")
        except ServerBaseException:
            raise ServerBaseException(
                ErrorCode.SERVER_INTERNAL_ERR,
@ -399,7 +412,7 @@ class PaddleTTSConnectionHandler(TTSServerExecutor):
        wavfile.write(buf, target_fs, wav_speed)
        base64_bytes = base64.b64encode(buf.read())
        wav_base64 = base64_bytes.decode('utf-8')
-        logger.info("Audio to string successfully.")
+        logger.debug("Audio to string successfully.")
        # save audio
        if audio_path is not None:
@ -487,15 +500,15 @@ class PaddleTTSConnectionHandler(TTSServerExecutor):
            logger.error(e)
            sys.exit(-1)
-        logger.info("AM model: {}".format(self.config.am))
+        logger.debug("AM model: {}".format(self.config.am))
-        logger.info("Vocoder model: {}".format(self.config.voc))
+        logger.debug("Vocoder model: {}".format(self.config.voc))
-        logger.info("Language: {}".format(lang))
+        logger.debug("Language: {}".format(lang))
        logger.info("tts engine type: python")
        logger.info("audio duration: {}".format(duration))
-        logger.info("frontend inference time: {}".format(self.frontend_time))
+        logger.debug("frontend inference time: {}".format(self.frontend_time))
-        logger.info("AM inference time: {}".format(self.am_time))
+        logger.debug("AM inference time: {}".format(self.am_time))
-        logger.info("Vocoder inference time: {}".format(self.voc_time))
+        logger.debug("Vocoder inference time: {}".format(self.voc_time))
        logger.info("total inference time: {}".format(infer_time))
        logger.info(
            "postprocess (change speed, volume, target sample rate) time: {}".
@ -503,6 +516,6 @@ class PaddleTTSConnectionHandler(TTSServerExecutor):
        logger.info("total generate audio time: {}".format(infer_time +
                                                           postprocess_time))
        logger.info("RTF: {}".format(rtf))
-        logger.info("device: {}".format(self.tts_engine.device))
+        logger.debug("device: {}".format(self.tts_engine.device))
        return lang, target_sample_rate, duration, wav_base64
--- a/paddlespeech/server/engine/tts/python/tts_engine.py
+++ b/paddlespeech/server/engine/tts/python/tts_engine.py
@ -105,7 +105,7 @@ class PaddleTTSConnectionHandler(TTSServerExecutor):
            tts_engine (TTSEngine): The TTS engine
        """
        super().__init__()
-        logger.info(
+        logger.debug(
            "Create PaddleTTSConnectionHandler to process the tts request")
        self.tts_engine = tts_engine
@ -143,23 +143,23 @@ class PaddleTTSConnectionHandler(TTSServerExecutor):
        if target_fs == 0 or target_fs > original_fs:
            target_fs = original_fs
            wav_tar_fs = wav
-            logger.info(
+            logger.debug(
                "The sample rate of synthesized audio is the same as model, which is {}Hz".
                format(original_fs))
        else:
            wav_tar_fs = librosa.resample(
                np.squeeze(wav), original_fs, target_fs)
-            logger.info(
+            logger.debug(
                "The sample rate of model is {}Hz and the target sample rate is {}Hz. Converting the sample rate of the synthesized audio successfully.".
                format(original_fs, target_fs))
        # transform volume
        wav_vol = wav_tar_fs * volume
-        logger.info("Transform the volume of the audio successfully.")
+        logger.debug("Transform the volume of the audio successfully.")
        # transform speed
        try:  # windows not support soxbindings
            wav_speed = change_speed(wav_vol, speed, target_fs)
-            logger.info("Transform the speed of the audio successfully.")
+            logger.debug("Transform the speed of the audio successfully.")
        except ServerBaseException:
            raise ServerBaseException(
                ErrorCode.SERVER_INTERNAL_ERR,
@ -176,7 +176,7 @@ class PaddleTTSConnectionHandler(TTSServerExecutor):
        wavfile.write(buf, target_fs, wav_speed)
        base64_bytes = base64.b64encode(buf.read())
        wav_base64 = base64_bytes.decode('utf-8')
-        logger.info("Audio to string successfully.")
+        logger.debug("Audio to string successfully.")
        # save audio
        if audio_path is not None:
@ -264,15 +264,15 @@ class PaddleTTSConnectionHandler(TTSServerExecutor):
            logger.error(e)
            sys.exit(-1)
-        logger.info("AM model: {}".format(self.config.am))
+        logger.debug("AM model: {}".format(self.config.am))
-        logger.info("Vocoder model: {}".format(self.config.voc))
+        logger.debug("Vocoder model: {}".format(self.config.voc))
-        logger.info("Language: {}".format(lang))
+        logger.debug("Language: {}".format(lang))
        logger.info("tts engine type: python")
        logger.info("audio duration: {}".format(duration))
-        logger.info("frontend inference time: {}".format(self.frontend_time))
+        logger.debug("frontend inference time: {}".format(self.frontend_time))
-        logger.info("AM inference time: {}".format(self.am_time))
+        logger.debug("AM inference time: {}".format(self.am_time))
-        logger.info("Vocoder inference time: {}".format(self.voc_time))
+        logger.debug("Vocoder inference time: {}".format(self.voc_time))
        logger.info("total inference time: {}".format(infer_time))
        logger.info(
            "postprocess (change speed, volume, target sample rate) time: {}".
@ -280,6 +280,6 @@ class PaddleTTSConnectionHandler(TTSServerExecutor):
        logger.info("total generate audio time: {}".format(infer_time +
                                                           postprocess_time))
        logger.info("RTF: {}".format(rtf))
-        logger.info("device: {}".format(self.tts_engine.device))
+        logger.debug("device: {}".format(self.tts_engine.device))
        return lang, target_sample_rate, duration, wav_base64
--- a/paddlespeech/server/engine/vector/python/vector_engine.py
+++ b/paddlespeech/server/engine/vector/python/vector_engine.py
@ -33,7 +33,7 @@ class PaddleVectorConnectionHandler:
            vector_engine (VectorEngine): The Vector engine
        """
        super().__init__()
-        logger.info(
+        logger.debug(
            "Create PaddleVectorConnectionHandler to process the vector request")
        self.vector_engine = vector_engine
        self.executor = self.vector_engine.executor
@ -54,7 +54,7 @@ class PaddleVectorConnectionHandler:
        Returns:
            str: the punctuation text
        """
-        logger.info(
+        logger.debug(
            f"start to extract the do vector {self.task} from the http request")
        if self.task == "spk" and task == "spk":
            embedding = self.extract_audio_embedding(audio_data)
@ -81,17 +81,17 @@ class PaddleVectorConnectionHandler:
        Returns:
            float: the score between enroll and test audio
        """
-        logger.info("start to extract the enroll audio embedding")
+        logger.debug("start to extract the enroll audio embedding")
        enroll_emb = self.extract_audio_embedding(enroll_audio)
-        logger.info("start to extract the test audio embedding")
+        logger.debug("start to extract the test audio embedding")
        test_emb = self.extract_audio_embedding(test_audio)
-        logger.info(
+        logger.debug(
            "start to get the score between the enroll and test embedding")
        score = self.executor.get_embeddings_score(enroll_emb, test_emb)
-        logger.info(f"get the enroll vs test score: {score}")
+        logger.debug(f"get the enroll vs test score: {score}")
        return score
    @paddle.no_grad()
@ -106,11 +106,12 @@ class PaddleVectorConnectionHandler:
        # because the soundfile will change the io.BytesIO(audio) to the end
        # thus we should convert the base64 string to io.BytesIO when we need the audio data
        if not self.executor._check(io.BytesIO(audio), sample_rate):
-            logger.info("check the audio sample rate occurs error")
+            logger.debug("check the audio sample rate occurs error")
            return np.array([0.0])
        waveform, sr = load_audio(io.BytesIO(audio))
-        logger.info(f"load the audio sample points, shape is: {waveform.shape}")
+        logger.debug(
            f"load the audio sample points, shape is: {waveform.shape}")
        # stage 2: get the audio feat
        # Note: Now we only support fbank feature
@ -121,9 +122,9 @@ class PaddleVectorConnectionHandler:
                n_mels=self.config.n_mels,
                window_size=self.config.window_size,
                hop_length=self.config.hop_size)
-            logger.info(f"extract the audio feats, shape is: {feats.shape}")
+            logger.debug(f"extract the audio feats, shape is: {feats.shape}")
        except Exception as e:
-            logger.info(f"feats occurs exception {e}")
+            logger.error(f"feats occurs exception {e}")
            sys.exit(-1)
        feats = paddle.to_tensor(feats).unsqueeze(0)
@ -159,7 +160,7 @@ class VectorEngine(BaseEngine):
        """The Vector Engine
        """
        super(VectorEngine, self).__init__()
-        logger.info("Create the VectorEngine Instance")
+        logger.debug("Create the VectorEngine Instance")
    def init(self, config: dict):
        """Init the Vector Engine
@ -170,7 +171,7 @@ class VectorEngine(BaseEngine):
        Returns:
            bool: The engine instance flag
        """
-        logger.info("Init the vector engine")
+        logger.debug("Init the vector engine")
        try:
            self.config = config
            if self.config.device:
@ -179,7 +180,7 @@ class VectorEngine(BaseEngine):
                self.device = paddle.get_device()
            paddle.set_device(self.device)
-            logger.info(f"Vector Engine set the device: {self.device}")
+            logger.debug(f"Vector Engine set the device: {self.device}")
        except BaseException as e:
            logger.error(
                "Set device failed, please check if device is already used and the parameter 'device' in the yaml file"
@ -196,5 +197,7 @@ class VectorEngine(BaseEngine):
            ckpt_path=config.ckpt_path,
            task=config.task)
-        logger.info("Init the Vector engine successfully")
+        logger.info(
            "Initialize Vector server engine successfully on device: %s." %
            (self.device))
        return True
--- a/paddlespeech/server/utils/audio_handler.py
+++ b/paddlespeech/server/utils/audio_handler.py
@ -138,7 +138,7 @@ class ASRWsAudioHandler:
        Returns:
            str: the final asr result
        """
-        logging.info("send a message to the server")
+        logging.debug("send a message to the server")
        if self.url is None:
            logger.error("No asr server, please input valid ip and port")
@ -160,7 +160,7 @@ class ASRWsAudioHandler:
                separators=(',', ': '))
            await ws.send(audio_info)
            msg = await ws.recv()
-            logger.info("client receive msg={}".format(msg))
+            logger.debug("client receive msg={}".format(msg))
            # 3. send chunk audio data to engine
            for chunk_data in self.read_wave(wavfile_path):
@ -170,7 +170,7 @@ class ASRWsAudioHandler:
                if self.punc_server and len(msg["result"]) > 0:
                    msg["result"] = self.punc_server.run(msg["result"])
-                logger.info("client receive msg={}".format(msg))
+                logger.debug("client receive msg={}".format(msg))
            # 4. we must send finished signal to the server
            audio_info = json.dumps(
@ -310,7 +310,7 @@ class TTSWsHandler:
            start_request = json.dumps({"task": "tts", "signal": "start"})
            await ws.send(start_request)
            msg = await ws.recv()
-            logger.info(f"client receive msg={msg}")
+            logger.debug(f"client receive msg={msg}")
            msg = json.loads(msg)
            session = msg["session"]
@ -319,7 +319,7 @@ class TTSWsHandler:
            request = json.dumps({"text": text_base64})
            st = time.time()
            await ws.send(request)
-            logging.info("send a message to the server")
+            logging.debug("send a message to the server")
            # 4. Process the received response
            message = await ws.recv()
@ -543,7 +543,6 @@ class VectorHttpHandler:
            "sample_rate": sample_rate,
        }
        logger.info(self.url)
        res = requests.post(url=self.url, data=json.dumps(data))
        return res.json()
--- a/paddlespeech/server/utils/audio_process.py
+++ b/paddlespeech/server/utils/audio_process.py
@ -169,7 +169,7 @@ def save_audio(bytes_data, audio_path, sample_rate: int=24000) -> bool:
            sample_rate=sample_rate)
        os.remove("./tmp.pcm")
    else:
-        print("Only supports saved audio format is pcm or wav")
+        logger.error("Only supports saved audio format is pcm or wav")
        return False
    return True
--- a/paddlespeech/server/utils/log.py
+++ b/paddlespeech/server/utils/log.py
@ -1,59 +0,0 @@
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import functools
 import logging
 __all__ = [
    'logger',
 ]
 class Logger(object):
    def __init__(self, name: str=None):
        name = 'PaddleSpeech' if not name else name
        self.logger = logging.getLogger(name)
        log_config = {
            'DEBUG': 10,
            'INFO': 20,
            'TRAIN': 21,
            'EVAL': 22,
            'WARNING': 30,
            'ERROR': 40,
            'CRITICAL': 50,
            'EXCEPTION': 100,
        }
        for key, level in log_config.items():
            logging.addLevelName(level, key)
            if key == 'EXCEPTION':
                self.__dict__[key.lower()] = self.logger.exception
            else:
                self.__dict__[key.lower()] = functools.partial(self.__call__,
                                                               level)
        self.format = logging.Formatter(
            fmt='[%(asctime)-15s] [%(levelname)8s] - %(message)s')
        self.handler = logging.StreamHandler()
        self.handler.setFormatter(self.format)
        self.logger.addHandler(self.handler)
        self.logger.setLevel(logging.DEBUG)
        self.logger.propagate = False
    def __call__(self, log_level: str, msg: str):
        self.logger.log(log_level, msg)
 logger = Logger()
--- a/paddlespeech/server/utils/onnx_infer.py
+++ b/paddlespeech/server/utils/onnx_infer.py
@ -16,11 +16,11 @@ from typing import Optional
 import onnxruntime as ort
-from .log import logger
+from paddlespeech.cli.log import logger
 def get_sess(model_path: Optional[os.PathLike]=None, sess_conf: dict=None):
-    logger.info(f"ort sessconf: {sess_conf}")
+    logger.debug(f"ort sessconf: {sess_conf}")
    sess_options = ort.SessionOptions()
    sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
    if sess_conf.get('graph_optimization_level', 99) == 0:
@ -34,7 +34,7 @@ def get_sess(model_path: Optional[os.PathLike]=None, sess_conf: dict=None):
        # fastspeech2/mb_melgan can't use trt now!
        if sess_conf.get("use_trt", 0):
            providers = ['TensorrtExecutionProvider']
-    logger.info(f"ort providers: {providers}")
+    logger.debug(f"ort providers: {providers}")
    if 'cpu_threads' in sess_conf:
        sess_options.intra_op_num_threads = sess_conf.get("cpu_threads", 0)
--- a/paddlespeech/server/utils/util.py
+++ b/paddlespeech/server/utils/util.py
@ -13,6 +13,8 @@
 import base64
 import math
 from paddlespeech.cli.log import logger
 def wav2base64(wav_file: str):
    """
@ -61,7 +63,7 @@ def get_chunks(data, block_size, pad_size, step):
    elif step == "voc":
        data_len = data.shape[0]
    else:
-        print("Please set correct type to get chunks, am or voc")
+        logger.error("Please set correct type to get chunks, am or voc")
    chunks = []
    n = math.ceil(data_len / block_size)
@ -73,7 +75,7 @@ def get_chunks(data, block_size, pad_size, step):
        elif step == "voc":
            chunks.append(data[start:end, :])
        else:
-            print("Please set correct type to get chunks, am or voc")
+            logger.error("Please set correct type to get chunks, am or voc")
    return chunks
--- a/paddlespeech/t2s/models/fastspeech2/fastspeech2.py
+++ b/paddlespeech/t2s/models/fastspeech2/fastspeech2.py
@ -141,71 +141,133 @@ class FastSpeech2(nn.Layer):
            init_dec_alpha: float=1.0, ):
        """Initialize FastSpeech2 module.
        Args:
-            idim (int): Dimension of the inputs.
+            idim (int): 
-            odim (int): Dimension of the outputs.
+                Dimension of the inputs.
-            adim (int): Attention dimension.
+            odim (int): 
-            aheads (int): Number of attention heads.
+                Dimension of the outputs.
-            elayers (int): Number of encoder layers.
+            adim (int): 
-            eunits (int): Number of encoder hidden units.
+                Attention dimension.
-            dlayers (int): Number of decoder layers.
+            aheads (int): 
-            dunits (int): Number of decoder hidden units.
+                Number of attention heads.
-            postnet_layers (int): Number of postnet layers.
+            elayers (int): 
-            postnet_chans (int): Number of postnet channels.
+                Number of encoder layers.
-            postnet_filts (int): Kernel size of postnet.
+            eunits (int): 
-            postnet_dropout_rate (float): Dropout rate in postnet.
+                Number of encoder hidden units.
-            use_scaled_pos_enc (bool): Whether to use trainable scaled pos encoding.
+            dlayers (int): 
-            use_batch_norm (bool): Whether to use batch normalization in encoder prenet.
+                Number of decoder layers.
-            encoder_normalize_before (bool): Whether to apply layernorm layer before encoder block.
+            dunits (int): 
-            decoder_normalize_before (bool): Whether to apply layernorm layer before decoder block.
+                Number of decoder hidden units.
-            encoder_concat_after (bool): Whether to concatenate attention layer's input and output in encoder.
+            postnet_layers (int): 
-            decoder_concat_after (bool): Whether to concatenate attention layer's input  and output in decoder.
+                Number of postnet layers.
-            reduction_factor (int): Reduction factor.
+            postnet_chans (int): 
-            encoder_type (str): Encoder type ("transformer" or "conformer").
+                Number of postnet channels.
-            decoder_type (str): Decoder type ("transformer" or "conformer").
+            postnet_filts (int): 
-            transformer_enc_dropout_rate (float): Dropout rate in encoder except attention and positional encoding.
+                Kernel size of postnet.
-            transformer_enc_positional_dropout_rate (float): Dropout rate after encoder positional encoding.
+            postnet_dropout_rate (float): 
-            transformer_enc_attn_dropout_rate (float): Dropout rate in encoder self-attention module.
+                Dropout rate in postnet.
-            transformer_dec_dropout_rate (float): Dropout rate in decoder except attention & positional encoding.
+            use_scaled_pos_enc (bool): 
-            transformer_dec_positional_dropout_rate (float): Dropout rate after decoder positional encoding.
+                Whether to use trainable scaled pos encoding.
-            transformer_dec_attn_dropout_rate (float): Dropout rate in decoder self-attention module.
+            use_batch_norm (bool): 
-            conformer_pos_enc_layer_type (str): Pos encoding layer type in conformer.
+                Whether to use batch normalization in encoder prenet.
-            conformer_self_attn_layer_type (str): Self-attention layer type in conformer
+            encoder_normalize_before (bool): 
-            conformer_activation_type (str): Activation function type in conformer.
+                Whether to apply layernorm layer before encoder block.
-            use_macaron_style_in_conformer (bool): Whether to use macaron style FFN.
+            decoder_normalize_before (bool): 
-            use_cnn_in_conformer (bool): Whether to use CNN in conformer.
+                Whether to apply layernorm layer before decoder block.
-            zero_triu (bool): Whether to use zero triu in relative self-attention module.
+            encoder_concat_after (bool): 
-            conformer_enc_kernel_size (int): Kernel size of encoder conformer.
+                Whether to concatenate attention layer's input and output in encoder.
-            conformer_dec_kernel_size (int): Kernel size of decoder conformer.
+            decoder_concat_after (bool): 
-            duration_predictor_layers (int): Number of duration predictor layers.
+                Whether to concatenate attention layer's input  and output in decoder.
-            duration_predictor_chans (int): Number of duration predictor channels.
+            reduction_factor (int): 
-            duration_predictor_kernel_size (int): Kernel size of duration predictor.
+                Reduction factor.
-            duration_predictor_dropout_rate (float): Dropout rate in duration predictor.
+            encoder_type (str): 
-            pitch_predictor_layers (int): Number of pitch predictor layers.
+                Encoder type ("transformer" or "conformer").
-            pitch_predictor_chans (int): Number of pitch predictor channels.
+            decoder_type (str): 
-            pitch_predictor_kernel_size (int): Kernel size of pitch predictor.
+                Decoder type ("transformer" or "conformer").
-            pitch_predictor_dropout_rate (float): Dropout rate in pitch predictor.
+            transformer_enc_dropout_rate (float): 
-            pitch_embed_kernel_size (float): Kernel size of pitch embedding.
+                Dropout rate in encoder except attention and positional encoding.
-            pitch_embed_dropout_rate (float): Dropout rate for pitch embedding.
+            transformer_enc_positional_dropout_rate (float): 
-            stop_gradient_from_pitch_predictor (bool): Whether to stop gradient from pitch predictor to encoder.
+                Dropout rate after encoder positional encoding.
-            energy_predictor_layers (int): Number of energy predictor layers.
+            transformer_enc_attn_dropout_rate (float): 
-            energy_predictor_chans (int): Number of energy predictor channels.
+                Dropout rate in encoder self-attention module.
-            energy_predictor_kernel_size (int): Kernel size of energy predictor.
+            transformer_dec_dropout_rate (float): 
-            energy_predictor_dropout_rate (float): Dropout rate in energy predictor.
+                Dropout rate in decoder except attention & positional encoding.
-            energy_embed_kernel_size (float): Kernel size of energy embedding.
+            transformer_dec_positional_dropout_rate (float):
-            energy_embed_dropout_rate (float): Dropout rate for energy embedding.
+                Dropout rate after decoder positional encoding.
-            stop_gradient_from_energy_predictor（bool): Whether to stop gradient from energy predictor to encoder.
+            transformer_dec_attn_dropout_rate (float): 
-            spk_num (Optional[int]): Number of speakers. If not None, assume that the spk_embed_dim is not None,
+                Dropout rate in decoder self-attention module.
            conformer_pos_enc_layer_type (str): 
                Pos encoding layer type in conformer.
            conformer_self_attn_layer_type (str): 
                Self-attention layer type in conformer
            conformer_activation_type (str): 
                Activation function type in conformer.
            use_macaron_style_in_conformer (bool): 
                Whether to use macaron style FFN.
            use_cnn_in_conformer (bool): 
                Whether to use CNN in conformer.
            zero_triu (bool): 
                Whether to use zero triu in relative self-attention module.
            conformer_enc_kernel_size (int): 
                Kernel size of encoder conformer.
            conformer_dec_kernel_size (int): 
                Kernel size of decoder conformer.
            duration_predictor_layers (int): 
                Number of duration predictor layers.
            duration_predictor_chans (int): 
                Number of duration predictor channels.
            duration_predictor_kernel_size (int): 
                Kernel size of duration predictor.
            duration_predictor_dropout_rate (float): 
                Dropout rate in duration predictor.
            pitch_predictor_layers (int): 
                Number of pitch predictor layers.
            pitch_predictor_chans (int):
                Number of pitch predictor channels.
            pitch_predictor_kernel_size (int): 
                Kernel size of pitch predictor.
            pitch_predictor_dropout_rate (float): 
                Dropout rate in pitch predictor.
            pitch_embed_kernel_size (float): 
                Kernel size of pitch embedding.
            pitch_embed_dropout_rate (float): 
                Dropout rate for pitch embedding.
            stop_gradient_from_pitch_predictor (bool): 
                Whether to stop gradient from pitch predictor to encoder.
            energy_predictor_layers (int): 
                Number of energy predictor layers.
            energy_predictor_chans (int): 
                Number of energy predictor channels.
            energy_predictor_kernel_size (int): 
                Kernel size of energy predictor.
            energy_predictor_dropout_rate (float): 
                Dropout rate in energy predictor.
            energy_embed_kernel_size (float): 
                Kernel size of energy embedding.
            energy_embed_dropout_rate (float): 
                Dropout rate for energy embedding.
            stop_gradient_from_energy_predictor（bool): 
                Whether to stop gradient from energy predictor to encoder.
            spk_num (Optional[int]): 
                Number of speakers. If not None, assume that the spk_embed_dim is not None,
                spk_ids will be provided as the input and use spk_embedding_table.
-            spk_embed_dim (Optional[int]): Speaker embedding dimension. If not None, 
+            spk_embed_dim (Optional[int]): 
                Speaker embedding dimension. If not None, 
                assume that spk_emb will be provided as the input or spk_num is not None.
-            spk_embed_integration_type (str): How to integrate speaker embedding.
+            spk_embed_integration_type (str): 
-            tone_num (Optional[int]): Number of tones. If not None, assume that the
+                How to integrate speaker embedding.
            tone_num (Optional[int]): 
                Number of tones. If not None, assume that the
                tone_ids will be provided as the input and use tone_embedding_table.
-            tone_embed_dim (Optional[int]): Tone embedding dimension. If not None, assume that tone_num is not None.
+            tone_embed_dim (Optional[int]):
-            tone_embed_integration_type (str): How to integrate tone embedding.
+                Tone embedding dimension. If not None, assume that tone_num is not None.
-            init_type (str): How to initialize transformer parameters.
+            tone_embed_integration_type (str): 
-            init_enc_alpha （float): Initial value of alpha in scaled pos encoding of the encoder.
+                How to integrate tone embedding.
-            init_dec_alpha (float): Initial value of alpha in scaled pos encoding of the decoder.
+            init_type (str): 
                How to initialize transformer parameters.
            init_enc_alpha （float): 
                Initial value of alpha in scaled pos encoding of the encoder.
            init_dec_alpha (float): 
                Initial value of alpha in scaled pos encoding of the decoder.
        """
        assert check_argument_types()
@ -258,7 +320,6 @@ class FastSpeech2(nn.Layer):
            padding_idx=self.padding_idx)
        if encoder_type == "transformer":
            print("encoder_type is transformer")
            self.encoder = TransformerEncoder(
                idim=idim,
                attention_dim=adim,
@ -275,7 +336,6 @@ class FastSpeech2(nn.Layer):
                positionwise_layer_type=positionwise_layer_type,
                positionwise_conv_kernel_size=positionwise_conv_kernel_size, )
        elif encoder_type == "conformer":
            print("encoder_type is conformer")
            self.encoder = ConformerEncoder(
                idim=idim,
                attention_dim=adim,
@ -362,7 +422,6 @@ class FastSpeech2(nn.Layer):
        # NOTE: we use encoder as decoder
        # because fastspeech's decoder is the same as encoder
        if decoder_type == "transformer":
            print("decoder_type is transformer")
            self.decoder = TransformerEncoder(
                idim=0,
                attention_dim=adim,
@ -380,7 +439,6 @@ class FastSpeech2(nn.Layer):
                positionwise_layer_type=positionwise_layer_type,
                positionwise_conv_kernel_size=positionwise_conv_kernel_size, )
        elif decoder_type == "conformer":
            print("decoder_type is conformer")
            self.decoder = ConformerEncoder(
                idim=0,
                attention_dim=adim,
@ -453,20 +511,29 @@ class FastSpeech2(nn.Layer):
        """Calculate forward propagation.
        Args:
-            text(Tensor(int64)): Batch of padded token ids (B, Tmax).
+            text(Tensor(int64)): 
-            text_lengths(Tensor(int64)): Batch of lengths of each input (B,).
+                Batch of padded token ids (B, Tmax).
-            speech(Tensor): Batch of padded target features (B, Lmax, odim).
+            text_lengths(Tensor(int64)): 
-            speech_lengths(Tensor(int64)): Batch of the lengths of each target (B,).
+                Batch of lengths of each input (B,).
-            durations(Tensor(int64)): Batch of padded durations (B, Tmax).
+            speech(Tensor): 
-            pitch(Tensor): Batch of padded token-averaged pitch (B, Tmax, 1).
+                Batch of padded target features (B, Lmax, odim).
-            energy(Tensor): Batch of padded token-averaged energy (B, Tmax, 1).
+            speech_lengths(Tensor(int64)): 
-            tone_id(Tensor, optional(int64)): Batch of padded tone ids  (B, Tmax).
+                Batch of the lengths of each target (B,).
-            spk_emb(Tensor, optional): Batch of speaker embeddings (B, spk_embed_dim).
+            durations(Tensor(int64)): 
-            spk_id(Tnesor, optional(int64)): Batch of speaker ids (B,)
+                Batch of padded durations (B, Tmax).
            pitch(Tensor): 
                Batch of padded token-averaged pitch (B, Tmax, 1).
            energy(Tensor): 
                Batch of padded token-averaged energy (B, Tmax, 1).
            tone_id(Tensor, optional(int64)): 
                Batch of padded tone ids  (B, Tmax).
            spk_emb(Tensor, optional): 
                Batch of speaker embeddings (B, spk_embed_dim).
            spk_id(Tnesor, optional(int64)): 
                Batch of speaker ids (B,)
        Returns:
        """
        # input of embedding must be int64
@ -662,20 +729,28 @@ class FastSpeech2(nn.Layer):
        """Generate the sequence of features given the sequences of characters.
        Args:
-            text(Tensor(int64)): Input sequence of characters (T,).
+            text(Tensor(int64)): 
-            durations(Tensor, optional (int64)): Groundtruth of duration (T,).
+                Input sequence of characters (T,).
-            pitch(Tensor, optional): Groundtruth of token-averaged pitch (T, 1).
+            durations(Tensor, optional (int64)): 
-            energy(Tensor, optional): Groundtruth of token-averaged energy (T, 1).
+                Groundtruth of duration (T,).
-            alpha(float, optional): Alpha to control the speed.
+            pitch(Tensor, optional): 
-            use_teacher_forcing(bool, optional): Whether to use teacher forcing.
+                Groundtruth of token-averaged pitch (T, 1).
            energy(Tensor, optional): 
                Groundtruth of token-averaged energy (T, 1).
            alpha(float, optional): 
                Alpha to control the speed.
            use_teacher_forcing(bool, optional): 
                Whether to use teacher forcing.
                If true, groundtruth of duration, pitch and energy will be used.
-            spk_emb(Tensor, optional, optional): peaker embedding vector (spk_embed_dim,). (Default value = None)
+            spk_emb(Tensor, optional, optional): 
-            spk_id(Tensor, optional(int64), optional): spk ids (1,). (Default value = None)
+                peaker embedding vector (spk_embed_dim,). (Default value = None)
-            tone_id(Tensor, optional(int64), optional): tone ids (T,). (Default value = None)
+            spk_id(Tensor, optional(int64), optional): 
                spk ids (1,). (Default value = None)
            tone_id(Tensor, optional(int64), optional): 
                tone ids (T,). (Default value = None)
        Returns:
        """
        # input of embedding must be int64
        x = paddle.cast(text, 'int64')
@ -724,8 +799,10 @@ class FastSpeech2(nn.Layer):
        """Integrate speaker embedding with hidden states.
        Args:
-            hs(Tensor): Batch of hidden state sequences (B, Tmax, adim).
+            hs(Tensor): 
-            spk_emb(Tensor): Batch of speaker embeddings (B, spk_embed_dim).
+                Batch of hidden state sequences (B, Tmax, adim).
            spk_emb(Tensor): 
                Batch of speaker embeddings (B, spk_embed_dim).
        Returns:
@ -749,8 +826,10 @@ class FastSpeech2(nn.Layer):
        """Integrate speaker embedding with hidden states.
        Args:
-            hs(Tensor): Batch of hidden state sequences (B, Tmax, adim).
+            hs(Tensor): 
-            tone_embs(Tensor): Batch of speaker embeddings (B, Tmax, tone_embed_dim).
+                Batch of hidden state sequences (B, Tmax, adim).
            tone_embs(Tensor): 
                Batch of speaker embeddings (B, Tmax, tone_embed_dim).
        Returns:
@ -773,10 +852,12 @@ class FastSpeech2(nn.Layer):
        """Make masks for self-attention.
        Args:
-            ilens(Tensor): Batch of lengths (B,).
+            ilens(Tensor): 
                Batch of lengths (B,).
        Returns:
-            Tensor: Mask tensor for self-attention. dtype=paddle.bool
+            Tensor: 
                Mask tensor for self-attention. dtype=paddle.bool
        Examples:
            >>> ilens = [5, 3]
@ -858,19 +939,32 @@ class StyleFastSpeech2Inference(FastSpeech2Inference):
        """
        Args:
-            text(Tensor(int64)): Input sequence of characters (T,).
+            text(Tensor(int64)): 
-            durations(paddle.Tensor/np.ndarray, optional (int64)): Groundtruth of duration (T,), this will overwrite the set of durations_scale and durations_bias
+                Input sequence of characters (T,).
            durations(paddle.Tensor/np.ndarray, optional (int64)): 
                Groundtruth of duration (T,), this will overwrite the set of durations_scale and durations_bias
            durations_scale(int/float, optional): 
            durations_bias(int/float, optional): 
-            pitch(paddle.Tensor/np.ndarray, optional): Groundtruth of token-averaged pitch (T, 1), this will overwrite the set of pitch_scale and pitch_bias
+
-            pitch_scale(int/float, optional): In denormed HZ domain.
+            pitch(paddle.Tensor/np.ndarray, optional): 
-            pitch_bias(int/float, optional): In denormed HZ domain.
+                Groundtruth of token-averaged pitch (T, 1), this will overwrite the set of pitch_scale and pitch_bias
-            energy(paddle.Tensor/np.ndarray, optional): Groundtruth of token-averaged energy (T, 1), this will overwrite the set of energy_scale and energy_bias
+            pitch_scale(int/float, optional): 
-            energy_scale(int/float, optional): In denormed domain.
+                In denormed HZ domain.
-            energy_bias(int/float, optional): In denormed domain.
+            pitch_bias(int/float, optional): 
-            robot: bool:  (Default value = False)
+                In denormed HZ domain.
-            spk_emb: (Default value = None)
+            energy(paddle.Tensor/np.ndarray, optional): 
-            spk_id: (Default value = None)
+                Groundtruth of token-averaged energy (T, 1), this will overwrite the set of energy_scale and energy_bias
            energy_scale(int/float, optional): 
                In denormed domain.
            energy_bias(int/float, optional): 
                In denormed domain.
            robot(bool) (Default value = False):
            spk_emb(Default value = None):
            spk_id(Default value = None):
        Returns:
            Tensor: logmel
@ -949,8 +1043,10 @@ class FastSpeech2Loss(nn.Layer):
                 use_weighted_masking: bool=False):
        """Initialize feed-forward Transformer loss module.
        Args:
-            use_masking (bool): Whether to apply masking for padded part in loss calculation.
+            use_masking (bool): 
-            use_weighted_masking (bool): Whether to weighted masking in loss calculation.
+                Whether to apply masking for padded part in loss calculation.
            use_weighted_masking (bool): 
                Whether to weighted masking in loss calculation.
        """
        assert check_argument_types()
        super().__init__()
@ -982,17 +1078,28 @@ class FastSpeech2Loss(nn.Layer):
        """Calculate forward propagation.
        Args:
-            after_outs(Tensor): Batch of outputs after postnets (B, Lmax, odim).
+            after_outs(Tensor):  
-            before_outs(Tensor): Batch of outputs before postnets (B, Lmax, odim).
+                Batch of outputs after postnets (B, Lmax, odim).
-            d_outs(Tensor): Batch of outputs of duration predictor (B, Tmax).
+            before_outs(Tensor): 
-            p_outs(Tensor): Batch of outputs of pitch predictor (B, Tmax, 1).
+                Batch of outputs before postnets (B, Lmax, odim).
-            e_outs(Tensor): Batch of outputs of energy predictor (B, Tmax, 1).
+            d_outs(Tensor): 
-            ys(Tensor): Batch of target features (B, Lmax, odim).
+                Batch of outputs of duration predictor (B, Tmax).
-            ds(Tensor): Batch of durations (B, Tmax).
+            p_outs(Tensor): 
-            ps(Tensor): Batch of target token-averaged pitch (B, Tmax, 1).
+                Batch of outputs of pitch predictor (B, Tmax, 1).
-            es(Tensor): Batch of target token-averaged energy (B, Tmax, 1).
+            e_outs(Tensor): 
-            ilens(Tensor): Batch of the lengths of each input (B,).
+                Batch of outputs of energy predictor (B, Tmax, 1).
-            olens(Tensor): Batch of the lengths of each target (B,).
+            ys(Tensor): 
                Batch of target features (B, Lmax, odim).
            ds(Tensor): 
                Batch of durations (B, Tmax).
            ps(Tensor): 
                Batch of target token-averaged pitch (B, Tmax, 1).
            es(Tensor): 
                Batch of target token-averaged energy (B, Tmax, 1).
            ilens(Tensor): 
                Batch of the lengths of each input (B,).
            olens(Tensor): 
                Batch of the lengths of each target (B,).
        Returns:
--- a/paddlespeech/t2s/models/hifigan/hifigan.py
+++ b/paddlespeech/t2s/models/hifigan/hifigan.py
@ -50,20 +50,34 @@ class HiFiGANGenerator(nn.Layer):
            init_type: str="xavier_uniform", ):
        """Initialize HiFiGANGenerator module.
        Args:
-            in_channels (int): Number of input channels.
+            in_channels (int): 
-            out_channels (int): Number of output channels.
+                Number of input channels.
-            channels (int): Number of hidden representation channels.
+            out_channels (int): 
-            global_channels (int): Number of global conditioning channels.
+                Number of output channels.
-            kernel_size (int): Kernel size of initial and final conv layer.
+            channels (int): 
-            upsample_scales (list): List of upsampling scales.
+                Number of hidden representation channels.
-            upsample_kernel_sizes (list): List of kernel sizes for upsampling layers.
+            global_channels (int): 
-            resblock_kernel_sizes (list): List of kernel sizes for residual blocks.
+                Number of global conditioning channels.
-            resblock_dilations (list): List of dilation list for residual blocks.
+            kernel_size (int): 
-            use_additional_convs (bool): Whether to use additional conv layers in residual blocks.
+                Kernel size of initial and final conv layer.
-            bias (bool): Whether to add bias parameter in convolution layers.
+            upsample_scales (list): 
-            nonlinear_activation (str): Activation function module name.
+                List of upsampling scales.
-            nonlinear_activation_params (dict): Hyperparameters for activation function.
+            upsample_kernel_sizes (list): 
-            use_weight_norm (bool): Whether to use weight norm.
+                List of kernel sizes for upsampling layers.
            resblock_kernel_sizes (list): 
                List of kernel sizes for residual blocks.
            resblock_dilations (list): 
                List of dilation list for residual blocks.
            use_additional_convs (bool): 
                Whether to use additional conv layers in residual blocks.
            bias (bool): 
                Whether to add bias parameter in convolution layers.
            nonlinear_activation (str): 
                Activation function module name.
            nonlinear_activation_params (dict): 
                Hyperparameters for activation function.
            use_weight_norm (bool): 
                Whether to use weight norm.
                If set to true, it will be applied to all of the conv layers.
        """
        super().__init__()
@ -199,9 +213,10 @@ class HiFiGANGenerator(nn.Layer):
    def inference(self, c, g: Optional[paddle.Tensor]=None):
        """Perform inference.
        Args:
-            c (Tensor): Input tensor (T, in_channels).
+            c (Tensor): 
-                normalize_before (bool): Whether to perform normalization.
+                Input tensor (T, in_channels).
-            g (Optional[Tensor]): Global conditioning tensor (global_channels, 1).
+            g (Optional[Tensor]): 
                Global conditioning tensor (global_channels, 1).
        Returns:
            Tensor:
                Output tensor (T ** prod(upsample_scales), out_channels).
@ -233,20 +248,33 @@ class HiFiGANPeriodDiscriminator(nn.Layer):
        """Initialize HiFiGANPeriodDiscriminator module.
        Args:
-            in_channels (int): Number of input channels.
+            in_channels (int): 
-            out_channels (int): Number of output channels.
+                Number of input channels.
-            period (int): Period.
+            out_channels (int): 
-            kernel_sizes (list): Kernel sizes of initial conv layers and the final conv layer.
+                Number of output channels.
-            channels (int): Number of initial channels.
+            period (int): 
-            downsample_scales (list): List of downsampling scales.
+                Period.
-            max_downsample_channels (int): Number of maximum downsampling channels.
+            kernel_sizes (list): 
-            use_additional_convs (bool): Whether to use additional conv layers in residual blocks.
+                Kernel sizes of initial conv layers and the final conv layer.
-            bias (bool): Whether to add bias parameter in convolution layers.
+            channels (int): 
-            nonlinear_activation (str): Activation function module name.
+                Number of initial channels.
-            nonlinear_activation_params (dict): Hyperparameters for activation function.
+            downsample_scales (list): 
-            use_weight_norm (bool): Whether to use weight norm.
+                List of downsampling scales.
            max_downsample_channels (int): 
                Number of maximum downsampling channels.
            use_additional_convs (bool): 
                Whether to use additional conv layers in residual blocks.
            bias (bool): 
                Whether to add bias parameter in convolution layers.
            nonlinear_activation (str): 
                Activation function module name.
            nonlinear_activation_params (dict): 
                Hyperparameters for activation function.
            use_weight_norm (bool): 
                Whether to use weight norm.
                If set to true, it will be applied to all of the conv layers.
-            use_spectral_norm (bool): Whether to use spectral norm.
+            use_spectral_norm (bool): 
                Whether to use spectral norm.
                If set to true, it will be applied to all of the conv layers.
        """
        super().__init__()
@ -298,7 +326,8 @@ class HiFiGANPeriodDiscriminator(nn.Layer):
        """Calculate forward propagation.
        Args:
-            c (Tensor): Input tensor (B, in_channels, T).
+            c (Tensor): 
                Input tensor (B, in_channels, T).
        Returns:
            list: List of each layer's tensors.
        """
@ -367,8 +396,10 @@ class HiFiGANMultiPeriodDiscriminator(nn.Layer):
        """Initialize HiFiGANMultiPeriodDiscriminator module.
        Args:
-            periods (list): List of periods.
+            periods (list): 
-            discriminator_params (dict): Parameters for hifi-gan period discriminator module.
+                List of periods.
            discriminator_params (dict): 
                Parameters for hifi-gan period discriminator module.
                The period parameter will be overwritten.
        """
        super().__init__()
@ -385,7 +416,8 @@ class HiFiGANMultiPeriodDiscriminator(nn.Layer):
        """Calculate forward propagation.
        Args:
-            x (Tensor): Input noise signal (B, 1, T).
+            x (Tensor): 
                Input noise signal (B, 1, T).
        Returns:
            List: List of list of each discriminator outputs, which consists of each layer output tensors.
        """
@ -417,16 +449,25 @@ class HiFiGANScaleDiscriminator(nn.Layer):
        """Initilize HiFiGAN scale discriminator module.
        Args:
-            in_channels (int): Number of input channels.
+            in_channels (int): 
-            out_channels (int): Number of output channels.
+                Number of input channels.
-            kernel_sizes (list): List of four kernel sizes. The first will be used for the first conv layer,
+            out_channels (int): 
                Number of output channels.
            kernel_sizes (list): 
                List of four kernel sizes. The first will be used for the first conv layer,
                and the second is for downsampling part, and the remaining two are for output layers.
-            channels (int): Initial number of channels for conv layer.
+            channels (int): 
-            max_downsample_channels (int): Maximum number of channels for downsampling layers.
+                Initial number of channels for conv layer.
-            bias (bool): Whether to add bias parameter in convolution layers.
+            max_downsample_channels (int): 
-            downsample_scales (list): List of downsampling scales.
+                Maximum number of channels for downsampling layers.
-            nonlinear_activation (str): Activation function module name.
+            bias (bool): 
-            nonlinear_activation_params (dict): Hyperparameters for activation function.
+                Whether to add bias parameter in convolution layers.
            downsample_scales (list): 
                List of downsampling scales.
            nonlinear_activation (str): 
                Activation function module name.
            nonlinear_activation_params (dict): 
                Hyperparameters for activation function.
            use_weight_norm (bool): Whether to use weight norm.
                If set to true, it will be applied to all of the conv layers.
            use_spectral_norm (bool): Whether to use spectral norm.
@ -614,7 +655,8 @@ class HiFiGANMultiScaleDiscriminator(nn.Layer):
        """Calculate forward propagation.
        Args:
-            x (Tensor): Input noise signal (B, 1, T).
+            x (Tensor): 
                Input noise signal (B, 1, T).
        Returns:
            List: List of list of each discriminator outputs, which consists of each layer output tensors.
        """
@ -675,14 +717,21 @@ class HiFiGANMultiScaleMultiPeriodDiscriminator(nn.Layer):
        """Initilize HiFiGAN multi-scale + multi-period discriminator module.
        Args:
-            scales (int): Number of multi-scales.
+            scales (int): 
-            scale_downsample_pooling (str): Pooling module name for downsampling of the inputs.
+                Number of multi-scales.
-            scale_downsample_pooling_params (dict): Parameters for the above pooling module.
+            scale_downsample_pooling (str): 
-            scale_discriminator_params (dict): Parameters for hifi-gan scale discriminator module.
+                Pooling module name for downsampling of the inputs.
-            follow_official_norm （bool): Whether to follow the norm setting of the official implementaion. 
+            scale_downsample_pooling_params (dict): 
                Parameters for the above pooling module.
            scale_discriminator_params (dict): 
                Parameters for hifi-gan scale discriminator module.
            follow_official_norm （bool): 
                Whether to follow the norm setting of the official implementaion. 
                The first discriminator uses spectral norm and the other discriminators use weight norm.
-            periods (list): List of periods.
+            periods (list): 
-            period_discriminator_params (dict): Parameters for hifi-gan period discriminator module.
+                List of periods.
            period_discriminator_params (dict): 
                Parameters for hifi-gan period discriminator module.
                The period parameter will be overwritten.
        """
        super().__init__()
@ -704,7 +753,8 @@ class HiFiGANMultiScaleMultiPeriodDiscriminator(nn.Layer):
        """Calculate forward propagation.
        Args:
-            x (Tensor): Input noise signal (B, 1, T).
+            x (Tensor): 
                Input noise signal (B, 1, T).
        Returns:
            List:
                List of list of each discriminator outputs,
--- a/paddlespeech/t2s/models/melgan/melgan.py
+++ b/paddlespeech/t2s/models/melgan/melgan.py
@ -53,24 +53,38 @@ class MelGANGenerator(nn.Layer):
        """Initialize MelGANGenerator module.
        Args:
-            in_channels (int): Number of input channels.
+            in_channels (int): 
-            out_channels (int): Number of output channels,
+                Number of input channels.
            out_channels (int): 
                Number of output channels,
                the number of sub-band is out_channels in multi-band melgan.
-            kernel_size (int): Kernel size of initial and final conv layer.
+            kernel_size (int): 
-            channels (int): Initial number of channels for conv layer.
+                Kernel size of initial and final conv layer.
-            bias (bool): Whether to add bias parameter in convolution layers.
+            channels (int): 
-            upsample_scales (List[int]): List of upsampling scales.
+                Initial number of channels for conv layer.
-            stack_kernel_size (int): Kernel size of dilated conv layers in residual stack.
+            bias (bool): 
-            stacks (int): Number of stacks in a single residual stack.
+                Whether to add bias parameter in convolution layers.
-            nonlinear_activation (Optional[str], optional): Non linear activation in upsample network, by default None
+            upsample_scales (List[int]): 
-            nonlinear_activation_params (Dict[str, Any], optional): Parameters passed to the linear activation in the upsample network, 
+                List of upsampling scales.
-                by default {}
+            stack_kernel_size (int): 
-            pad (str): Padding function module name before dilated convolution layer.
+                Kernel size of dilated conv layers in residual stack.
-            pad_params (dict): Hyperparameters for padding function.
+            stacks (int): 
-            use_final_nonlinear_activation (nn.Layer): Activation function for the final layer.
+                Number of stacks in a single residual stack.
-            use_weight_norm (bool): Whether to use weight norm.
+            nonlinear_activation (Optional[str], optional): 
                Non linear activation in upsample network, by default None
            nonlinear_activation_params (Dict[str, Any], optional): 
                Parameters passed to the linear activation in the upsample network, by default {}
            pad (str): 
                Padding function module name before dilated convolution layer.
            pad_params (dict): 
                Hyperparameters for padding function.
            use_final_nonlinear_activation (nn.Layer): 
                Activation function for the final layer.
            use_weight_norm (bool): 
                Whether to use weight norm.
                If set to true, it will be applied to all of the conv layers.
-            use_causal_conv (bool): Whether to use causal convolution.
+            use_causal_conv (bool):
                Whether to use causal convolution.
        """
        super().__init__()
@ -194,7 +208,8 @@ class MelGANGenerator(nn.Layer):
        """Calculate forward propagation.
        Args:
-            c (Tensor): Input tensor (B, in_channels, T).
+            c (Tensor): 
                Input tensor (B, in_channels, T).
        Returns:
            Tensor: Output tensor (B, out_channels, T ** prod(upsample_scales)).
        """
@ -244,7 +259,8 @@ class MelGANGenerator(nn.Layer):
        """Perform inference.
        Args:
-            c (Union[Tensor, ndarray]): Input tensor (T, in_channels).
+            c (Union[Tensor, ndarray]): 
                Input tensor (T, in_channels).
        Returns:
            Tensor: Output tensor (out_channels*T ** prod(upsample_scales), 1).
        """
@ -279,20 +295,30 @@ class MelGANDiscriminator(nn.Layer):
        """Initilize MelGAN discriminator module.
        Args:
-            in_channels (int): Number of input channels.
+            in_channels (int): 
-            out_channels (int): Number of output channels.
+                Number of input channels.
            out_channels (int): 
                Number of output channels.
            kernel_sizes (List[int]): List of two kernel sizes. The prod will be used for the first conv layer,
                and the first and the second kernel sizes will be used for the last two layers.
                For example if kernel_sizes = [5, 3], the first layer kernel size will be 5 * 3 = 15,
                the last two layers' kernel size will be 5 and 3, respectively.
-            channels (int): Initial number of channels for conv layer.
+            channels (int): 
-            max_downsample_channels (int): Maximum number of channels for downsampling layers.
+                Initial number of channels for conv layer.
-            bias (bool): Whether to add bias parameter in convolution layers.
+            max_downsample_channels (int): 
-            downsample_scales (List[int]): List of downsampling scales.
+                Maximum number of channels for downsampling layers.
-            nonlinear_activation (str): Activation function module name.
+            bias (bool): 
-            nonlinear_activation_params (dict): Hyperparameters for activation function.
+                Whether to add bias parameter in convolution layers.
-            pad (str): Padding function module name before dilated convolution layer.
+            downsample_scales (List[int]): 
-            pad_params (dict): Hyperparameters for padding function.
+                List of downsampling scales.
            nonlinear_activation (str): 
                Activation function module name.
            nonlinear_activation_params (dict): 
                Hyperparameters for activation function.
            pad (str): 
                Padding function module name before dilated convolution layer.
            pad_params (dict): 
                Hyperparameters for padding function.
        """
        super().__init__()
@ -364,7 +390,8 @@ class MelGANDiscriminator(nn.Layer):
    def forward(self, x):
        """Calculate forward propagation.
        Args:
-            x (Tensor): Input noise signal (B, 1, T).
+            x (Tensor): 
                Input noise signal (B, 1, T).
        Returns:
            List: List of output tensors of each layer (for feat_match_loss).
        """
@ -406,22 +433,37 @@ class MelGANMultiScaleDiscriminator(nn.Layer):
        """Initilize MelGAN multi-scale discriminator module.
        Args:
-            in_channels (int): Number of input channels.
+            in_channels (int): 
-            out_channels (int): Number of output channels.
+                Number of input channels.
-            scales (int): Number of multi-scales.
+            out_channels (int): 
-            downsample_pooling (str): Pooling module name for downsampling of the inputs.
+                Number of output channels.
-            downsample_pooling_params (dict): Parameters for the above pooling module.
+            scales (int): 
-            kernel_sizes (List[int]): List of two kernel sizes. The sum will be used for the first conv layer,
+                Number of multi-scales.
            downsample_pooling (str): 
                Pooling module name for downsampling of the inputs.
            downsample_pooling_params (dict): 
                Parameters for the above pooling module.
            kernel_sizes (List[int]): 
                List of two kernel sizes. The sum will be used for the first conv layer,
                and the first and the second kernel sizes will be used for the last two layers.
-            channels (int): Initial number of channels for conv layer.
+            channels (int): 
-            max_downsample_channels (int): Maximum number of channels for downsampling layers.
+                Initial number of channels for conv layer.
-            bias (bool): Whether to add bias parameter in convolution layers.
+            max_downsample_channels (int): 
-            downsample_scales (List[int]): List of downsampling scales.
+                Maximum number of channels for downsampling layers.
-            nonlinear_activation (str): Activation function module name.
+            bias (bool): 
-            nonlinear_activation_params (dict): Hyperparameters for activation function.
+                Whether to add bias parameter in convolution layers.
-            pad (str): Padding function module name before dilated convolution layer.
+            downsample_scales (List[int]): 
-            pad_params (dict): Hyperparameters for padding function.
+                List of downsampling scales.
-            use_causal_conv (bool): Whether to use causal convolution.
+            nonlinear_activation (str): 
                Activation function module name.
            nonlinear_activation_params (dict): 
                Hyperparameters for activation function.
            pad (str): 
                Padding function module name before dilated convolution layer.
            pad_params (dict): 
                Hyperparameters for padding function.
            use_causal_conv (bool): 
                Whether to use causal convolution.
        """
        super().__init__()
@ -464,7 +506,8 @@ class MelGANMultiScaleDiscriminator(nn.Layer):
    def forward(self, x):
        """Calculate forward propagation.
        Args:
-            x (Tensor): Input noise signal (B, 1, T).
+            x (Tensor):
                Input noise signal (B, 1, T).
        Returns:
            List: List of list of each discriminator outputs, which consists of each layer output tensors.
        """
--- a/paddlespeech/t2s/models/melgan/style_melgan.py
+++ b/paddlespeech/t2s/models/melgan/style_melgan.py
@ -54,20 +54,34 @@ class StyleMelGANGenerator(nn.Layer):
        """Initilize Style MelGAN generator.
        Args:
-            in_channels (int): Number of input noise channels.
+            in_channels (int): 
-            aux_channels (int): Number of auxiliary input channels.
+                Number of input noise channels.
-            channels (int): Number of channels for conv layer.
+            aux_channels (int): 
-            out_channels (int): Number of output channels.
+                Number of auxiliary input channels.
-            kernel_size (int): Kernel size of conv layers.
+            channels (int): 
-            dilation (int): Dilation factor for conv layers.
+                Number of channels for conv layer.
-            bias (bool): Whether to add bias parameter in convolution layers.
+            out_channels (int): 
-            noise_upsample_scales (list): List of noise upsampling scales.
+                Number of output channels.
-            noise_upsample_activation (str): Activation function module name for noise upsampling.
+            kernel_size (int): 
-            noise_upsample_activation_params (dict): Hyperparameters for the above activation function.
+                Kernel size of conv layers.
-            upsample_scales (list): List of upsampling scales.
+            dilation (int): 
-            upsample_mode (str): Upsampling mode in TADE layer.
+                Dilation factor for conv layers.
-            gated_function (str): Gated function in TADEResBlock ("softmax" or "sigmoid").
+            bias (bool): 
-            use_weight_norm (bool): Whether to use weight norm.
+                Whether to add bias parameter in convolution layers.
            noise_upsample_scales (list): 
                List of noise upsampling scales.
            noise_upsample_activation (str): 
                Activation function module name for noise upsampling.
            noise_upsample_activation_params (dict): 
                Hyperparameters for the above activation function.
            upsample_scales (list): 
                List of upsampling scales.
            upsample_mode (str): 
                Upsampling mode in TADE layer.
            gated_function (str): 
                Gated function in TADEResBlock ("softmax" or "sigmoid").
            use_weight_norm (bool): 
                Whether to use weight norm.
                If set to true, it will be applied to all of the conv layers.
        """
        super().__init__()
@ -194,7 +208,8 @@ class StyleMelGANGenerator(nn.Layer):
    def inference(self, c):
        """Perform inference.
        Args:
-            c (Tensor): Input tensor (T, in_channels).
+            c (Tensor): 
                Input tensor (T, in_channels).
        Returns:
            Tensor: Output tensor (T ** prod(upsample_scales), out_channels).
        """
@ -258,11 +273,16 @@ class StyleMelGANDiscriminator(nn.Layer):
        """Initilize Style MelGAN discriminator.
        Args:
-            repeats (int): Number of repititons to apply RWD.
+            repeats (int): 
-            window_sizes (list): List of random window sizes.
+                Number of repititons to apply RWD.
-            pqmf_params (list): List of list of Parameters for PQMF modules
+            window_sizes (list): 
-            discriminator_params (dict): Parameters for base discriminator module.
+                List of random window sizes.
-            use_weight_nom (bool): Whether to apply weight normalization.
+            pqmf_params (list): 
                List of list of Parameters for PQMF modules
            discriminator_params (dict): 
                Parameters for base discriminator module.
            use_weight_nom (bool): 
                Whether to apply weight normalization.
        """
        super().__init__()
@ -299,7 +319,8 @@ class StyleMelGANDiscriminator(nn.Layer):
    def forward(self, x):
        """Calculate forward propagation.
        Args:
-            x (Tensor): Input tensor (B, 1, T).
+            x (Tensor): 
                Input tensor (B, 1, T).
        Returns:
            List: List of discriminator outputs, #items in the list will be
                equal to repeats * #discriminators.
--- a/paddlespeech/t2s/models/parallel_wavegan/parallel_wavegan.py
+++ b/paddlespeech/t2s/models/parallel_wavegan/parallel_wavegan.py
@ -32,29 +32,45 @@ class PWGGenerator(nn.Layer):
    """Wave Generator for Parallel WaveGAN
    Args:
-        in_channels (int, optional): Number of channels of the input waveform, by default 1
+        in_channels (int, optional): 
-        out_channels (int, optional): Number of channels of the output waveform, by default 1
+            Number of channels of the input waveform, by default 1
-        kernel_size (int, optional): Kernel size of the residual blocks inside, by default 3
+        out_channels (int, optional): 
-        layers (int, optional): Number of residual blocks inside, by default 30
+            Number of channels of the output waveform, by default 1
-        stacks (int, optional): The number of groups to split the residual blocks into, by default 3
+        kernel_size (int, optional): 
            Kernel size of the residual blocks inside, by default 3
        layers (int, optional): 
            Number of residual blocks inside, by default 30
        stacks (int, optional):
            The number of groups to split the residual blocks into, by default 3
            Within each group, the dilation of the residual block grows exponentially.
-        residual_channels (int, optional): Residual channel of the residual blocks, by default 64
+        residual_channels (int, optional): 
-        gate_channels (int, optional): Gate channel of the residual blocks, by default 128
+            Residual channel of the residual blocks, by default 64
-        skip_channels (int, optional): Skip channel of the residual blocks, by default 64
+        gate_channels (int, optional): 
-        aux_channels (int, optional): Auxiliary channel of the residual blocks, by default 80
+            Gate channel of the residual blocks, by default 128
-        aux_context_window (int, optional): The context window size of the first convolution applied to the 
+        skip_channels (int, optional): 
-            auxiliary input, by default 2
+            Skip channel of the residual blocks, by default 64
-        dropout (float, optional): Dropout of the residual blocks, by default 0.
+        aux_channels (int, optional): 
-        bias (bool, optional): Whether to use bias in residual blocks, by default True
+            Auxiliary channel of the residual blocks, by default 80
-        use_weight_norm (bool, optional): Whether to use weight norm in all convolutions, by default True
+        aux_context_window (int, optional): 
-        use_causal_conv (bool, optional): Whether to use causal padding in the upsample network and residual 
+            The context window size of the first convolution applied to the auxiliary input, by default 2
-            blocks, by default False
+        dropout (float, optional): 
-        upsample_scales (List[int], optional): Upsample scales of the upsample network, by default [4, 4, 4, 4]
+            Dropout of the residual blocks, by default 0.
-        nonlinear_activation (Optional[str], optional): Non linear activation in upsample network, by default None
+        bias (bool, optional): 
-        nonlinear_activation_params (Dict[str, Any], optional): Parameters passed to the linear activation in the upsample network, 
+            Whether to use bias in residual blocks, by default True
-            by default {}
+        use_weight_norm (bool, optional): 
-        interpolate_mode (str, optional): Interpolation mode of the upsample network, by default "nearest"
+            Whether to use weight norm in all convolutions, by default True
-        freq_axis_kernel_size (int, optional): Kernel size along the frequency axis of the upsample network, by default 1
+        use_causal_conv (bool, optional): 
            Whether to use causal padding in the upsample network and residual blocks, by default False
        upsample_scales (List[int], optional): 
            Upsample scales of the upsample network, by default [4, 4, 4, 4]
        nonlinear_activation (Optional[str], optional): 
            Non linear activation in upsample network, by default None
        nonlinear_activation_params (Dict[str, Any], optional): 
            Parameters passed to the linear activation in the upsample network, by default {}
        interpolate_mode (str, optional): 
            Interpolation mode of the upsample network, by default "nearest"
        freq_axis_kernel_size (int, optional): 
            Kernel size along the frequency axis of the upsample network, by default 1
    """
    def __init__(
@ -147,9 +163,11 @@ class PWGGenerator(nn.Layer):
        """Generate waveform.
        Args:
-            x(Tensor): Shape (N, C_in, T), The input waveform.
+            x(Tensor): 
-            c(Tensor): Shape (N, C_aux, T'). The auxiliary input (e.g. spectrogram). It
+                Shape (N, C_in, T), The input waveform.
-            is upsampled to match the time resolution of the input.
+            c(Tensor): 
                Shape (N, C_aux, T'). The auxiliary input (e.g. spectrogram). 
                It is upsampled to match the time resolution of the input.
        Returns:
            Tensor: Shape (N, C_out, T), the generated waveform.
@ -195,8 +213,10 @@ class PWGGenerator(nn.Layer):
        """Waveform generation. This function is used for single instance inference.
        Args:
-            c(Tensor, optional, optional): Shape (T', C_aux), the auxiliary input, by default None
+            c(Tensor, optional, optional): 
-            x(Tensor, optional): Shape (T, C_in), the noise waveform, by default None
+                Shape (T', C_aux), the auxiliary input, by default None
            x(Tensor, optional): 
                Shape (T, C_in), the noise waveform, by default None
        Returns:
            Tensor: Shape (T, C_out), the generated waveform
@ -214,20 +234,28 @@ class PWGDiscriminator(nn.Layer):
    """A convolutional discriminator for audio.
    Args:
-        in_channels (int, optional): Number of channels of the input audio, by default 1
+        in_channels (int, optional): 
-        out_channels (int, optional): Output feature size, by default 1
+            Number of channels of the input audio, by default 1
-        kernel_size (int, optional): Kernel size of convolutional sublayers, by default 3
+        out_channels (int, optional): 
-        layers (int, optional): Number of layers, by default 10
+            Output feature size, by default 1
-        conv_channels (int, optional): Feature size of the convolutional sublayers, by default 64
+        kernel_size (int, optional): 
-        dilation_factor (int, optional): The factor with which dilation of each convolutional sublayers grows 
+            Kernel size of convolutional sublayers, by default 3
        layers (int, optional): 
            Number of layers, by default 10
        conv_channels (int, optional): 
            Feature size of the convolutional sublayers, by default 64
        dilation_factor (int, optional): 
            The factor with which dilation of each convolutional sublayers grows 
            exponentially if it is greater than 1, else the dilation of each convolutional sublayers grows linearly, 
            by default 1
-        nonlinear_activation (str, optional): The activation after each convolutional sublayer, by default "leakyrelu"
+        nonlinear_activation (str, optional): 
-        nonlinear_activation_params (Dict[str, Any], optional): The parameters passed to the activation's initializer, by default 
+            The activation after each convolutional sublayer, by default "leakyrelu"
-            {"negative_slope": 0.2}
+        nonlinear_activation_params (Dict[str, Any], optional): 
-        bias (bool, optional): Whether to use bias in convolutional sublayers, by default True
+            The parameters passed to the activation's initializer, by default {"negative_slope": 0.2}
-        use_weight_norm (bool, optional): Whether to use weight normalization at all convolutional sublayers, 
+        bias (bool, optional): 
-            by default True
+            Whether to use bias in convolutional sublayers, by default True
        use_weight_norm (bool, optional): 
            Whether to use weight normalization at all convolutional sublayers, by default True
    """
    def __init__(
@ -290,7 +318,8 @@ class PWGDiscriminator(nn.Layer):
        """
        Args:
-            x (Tensor): Shape (N, in_channels, num_samples), the input audio.
+            x (Tensor): 
                Shape (N, in_channels, num_samples), the input audio.
        Returns:
            Tensor: Shape (N, out_channels, num_samples), the predicted logits.
@ -318,24 +347,35 @@ class ResidualPWGDiscriminator(nn.Layer):
    """A wavenet-style discriminator for audio.
    Args:
-        in_channels (int, optional): Number of channels of the input audio, by default 1
+        in_channels (int, optional): 
-        out_channels (int, optional): Output feature size, by default 1
+            Number of channels of the input audio, by default 1
-        kernel_size (int, optional): Kernel size of residual blocks, by default 3
+        out_channels (int, optional): 
-        layers (int, optional): Number of residual blocks, by default 30
+            Output feature size, by default 1
-        stacks (int, optional): Number of groups of residual blocks, within which the dilation 
+        kernel_size (int, optional): 
            Kernel size of residual blocks, by default 3
        layers (int, optional): 
            Number of residual blocks, by default 30
        stacks (int, optional): 
            Number of groups of residual blocks, within which the dilation 
            of each residual blocks grows exponentially, by default 3
-        residual_channels (int, optional): Residual channels of residual blocks, by default 64
+        residual_channels (int, optional): 
-        gate_channels (int, optional): Gate channels of residual blocks, by default 128
+            Residual channels of residual blocks, by default 64
-        skip_channels (int, optional): Skip channels of residual blocks, by default 64
+        gate_channels (int, optional): 
-        dropout (float, optional): Dropout probability of residual blocks, by default 0.
+            Gate channels of residual blocks, by default 128
-        bias (bool, optional): Whether to use bias in residual blocks, by default True
+        skip_channels (int, optional): 
-        use_weight_norm (bool, optional): Whether to use weight normalization in all convolutional layers, 
+            Skip channels of residual blocks, by default 64
-            by default True
+        dropout (float, optional): 
-        use_causal_conv (bool, optional): Whether to use causal convolution in residual blocks, by default False
+            Dropout probability of residual blocks, by default 0.
-        nonlinear_activation (str, optional): Activation after convolutions other than those in residual blocks, 
+        bias (bool, optional): 
-            by default "leakyrelu"
+            Whether to use bias in residual blocks, by default True
-        nonlinear_activation_params (Dict[str, Any], optional): Parameters to pass to the activation, 
+        use_weight_norm (bool, optional): 
-            by default {"negative_slope": 0.2}
+            Whether to use weight normalization in all convolutional layers, by default True
        use_causal_conv (bool, optional): 
            Whether to use causal convolution in residual blocks, by default False
        nonlinear_activation (str, optional): 
            Activation after convolutions other than those in residual blocks, by default "leakyrelu"
        nonlinear_activation_params (Dict[str, Any], optional): 
            Parameters to pass to the activation, by default {"negative_slope": 0.2}
    """
    def __init__(
@ -405,7 +445,8 @@ class ResidualPWGDiscriminator(nn.Layer):
    def forward(self, x):
        """
        Args:
-            x(Tensor): Shape (N, in_channels, num_samples), the input audio.↩
+            x(Tensor): 
                Shape (N, in_channels, num_samples), the input audio.↩
        Returns:
            Tensor: Shape (N, out_channels, num_samples), the predicted logits.
--- a/paddlespeech/t2s/models/speedyspeech/speedyspeech.py
+++ b/paddlespeech/t2s/models/speedyspeech/speedyspeech.py
@ -29,10 +29,14 @@ class ResidualBlock(nn.Layer):
                 n: int=2):
        """SpeedySpeech encoder module.
        Args:
-            channels (int, optional): Feature size of the residual output(and also the input).
+            channels (int, optional): 
-            kernel_size (int, optional): Kernel size of the 1D convolution.
+                Feature size of the residual output(and also the input).
-            dilation (int, optional): Dilation of the 1D convolution.
+            kernel_size (int, optional): 
-            n (int): Number of blocks.
+                Kernel size of the 1D convolution.
            dilation (int, optional): 
                Dilation of the 1D convolution.
            n (int): 
                Number of blocks.
        """
        super().__init__()
@ -57,7 +61,8 @@ class ResidualBlock(nn.Layer):
    def forward(self, x: paddle.Tensor):
        """Calculate forward propagation.
        Args:
-            x(Tensor): Batch of input sequences (B, hidden_size, Tmax).
+            x(Tensor): 
                Batch of input sequences (B, hidden_size, Tmax).
        Returns:
            Tensor: The residual output (B, hidden_size, Tmax).
        """
@ -89,8 +94,10 @@ class TextEmbedding(nn.Layer):
    def forward(self, text: paddle.Tensor, tone: paddle.Tensor=None):
        """Calculate forward propagation.
        Args:
-            text(Tensor(int64)): Batch of padded token ids (B, Tmax).
+            text(Tensor(int64)): 
-            tones(Tensor, optional(int64)): Batch of padded tone ids (B, Tmax).
+                Batch of padded token ids (B, Tmax).
            tones(Tensor, optional(int64)): 
                Batch of padded tone ids (B, Tmax).
        Returns:
            Tensor: The residual output (B, Tmax, embedding_size).
        """
@ -109,12 +116,18 @@ class TextEmbedding(nn.Layer):
 class SpeedySpeechEncoder(nn.Layer):
    """SpeedySpeech encoder module.
    Args:
-        vocab_size (int): Dimension of the inputs.
+        vocab_size (int): 
-        tone_size (Optional[int]): Number of tones.
+            Dimension of the inputs.
-        hidden_size (int): Number of encoder hidden units.
+        tone_size (Optional[int]): 
-        kernel_size (int): Kernel size of encoder.
+            Number of tones.
-        dilations (List[int]): Dilations of encoder.
+        hidden_size (int): 
-        spk_num (Optional[int]): Number of speakers. 
+            Number of encoder hidden units.
        kernel_size (int): 
            Kernel size of encoder.
        dilations (List[int]): 
            Dilations of encoder.
        spk_num (Optional[int]): 
            Number of speakers. 
    """
    def __init__(self,
@ -161,9 +174,12 @@ class SpeedySpeechEncoder(nn.Layer):
                spk_id: paddle.Tensor=None):
        """Encoder input sequence.
        Args:
-            text(Tensor(int64)): Batch of padded token ids (B, Tmax).
+            text(Tensor(int64)): 
-            tones(Tensor, optional(int64)): Batch of padded tone ids (B, Tmax).
+                Batch of padded token ids (B, Tmax).
-            spk_id(Tnesor, optional(int64)): Batch of speaker ids (B,)
+            tones(Tensor, optional(int64)): 
                Batch of padded tone ids (B, Tmax).
            spk_id(Tnesor, optional(int64)): 
                Batch of speaker ids (B,)
        Returns:
            Tensor: Output tensor (B, Tmax, hidden_size).
@ -192,7 +208,8 @@ class DurationPredictor(nn.Layer):
    def forward(self, x: paddle.Tensor):
        """Calculate forward propagation.
        Args:
-            x(Tensor): Batch of input sequences (B, Tmax, hidden_size).
+            x(Tensor): 
                Batch of input sequences (B, Tmax, hidden_size).
        Returns:
            Tensor: Batch of predicted durations in log domain (B, Tmax).
@ -212,10 +229,14 @@ class SpeedySpeechDecoder(nn.Layer):
                 ]):
        """SpeedySpeech decoder module.
        Args:
-            hidden_size (int): Number of decoder hidden units.
+            hidden_size (int): 
-            kernel_size (int): Kernel size of decoder.
+                Number of decoder hidden units.
-            output_size (int): Dimension of the outputs.
+            kernel_size (int): 
-            dilations (List[int]): Dilations of decoder.
+                Kernel size of decoder.
            output_size (int): 
                Dimension of the outputs.
            dilations (List[int]): 
                Dilations of decoder.
        """
        super().__init__()
        res_blocks = [
@ -230,7 +251,8 @@ class SpeedySpeechDecoder(nn.Layer):
    def forward(self, x):
        """Decoder input sequence.
        Args:
-            x(Tensor): Input tensor (B, time, hidden_size).
+            x(Tensor): 
                Input tensor (B, time, hidden_size).
        Returns:
            Tensor: Output tensor (B, time, output_size).
@ -261,18 +283,30 @@ class SpeedySpeech(nn.Layer):
            positional_dropout_rate: int=0.1):
        """Initialize SpeedySpeech module.
        Args:
-            vocab_size (int): Dimension of the inputs.
+            vocab_size (int): 
-            encoder_hidden_size (int): Number of encoder hidden units.
+                Dimension of the inputs.
-            encoder_kernel_size (int): Kernel size of encoder.
+            encoder_hidden_size (int): 
-            encoder_dilations (List[int]): Dilations of encoder.
+                Number of encoder hidden units.
-            duration_predictor_hidden_size (int): Number of duration predictor hidden units.
+            encoder_kernel_size (int): 
-            decoder_hidden_size (int): Number of decoder hidden units.
+                Kernel size of encoder.
-            decoder_kernel_size (int): Kernel size of decoder.
+            encoder_dilations (List[int]): 
-            decoder_dilations (List[int]): Dilations of decoder.
+                Dilations of encoder.
-            decoder_output_size (int): Dimension of the outputs.
+            duration_predictor_hidden_size (int):
-            tone_size (Optional[int]): Number of tones.
+                Number of duration predictor hidden units.
-            spk_num (Optional[int]): Number of speakers. 
+            decoder_hidden_size (int): 
-            init_type (str): How to initialize transformer parameters.
+                Number of decoder hidden units.
            decoder_kernel_size (int): 
                Kernel size of decoder.
            decoder_dilations (List[int]): 
                Dilations of decoder.
            decoder_output_size (int): 
                Dimension of the outputs.
            tone_size (Optional[int]): 
                Number of tones.
            spk_num (Optional[int]): 
                Number of speakers. 
            init_type (str): 
                How to initialize transformer parameters.
        """
        super().__init__()
@ -304,14 +338,20 @@ class SpeedySpeech(nn.Layer):
                spk_id: paddle.Tensor=None):
        """Calculate forward propagation.
        Args:
-            text(Tensor(int64)): Batch of padded token ids (B, Tmax).
+            text(Tensor(int64)): 
-            durations(Tensor(int64)): Batch of padded durations (B, Tmax).
+                Batch of padded token ids (B, Tmax).
-            tones(Tensor, optional(int64)): Batch of padded tone ids  (B, Tmax).
+            durations(Tensor(int64)): 
-            spk_id(Tnesor, optional(int64)): Batch of speaker ids (B,)
+                Batch of padded durations (B, Tmax).
            tones(Tensor, optional(int64)): 
                Batch of padded tone ids  (B, Tmax).
            spk_id(Tnesor, optional(int64)): 
                Batch of speaker ids (B,)
        Returns:
-            Tensor: Output tensor (B, T_frames, decoder_output_size).
+            Tensor: 
-            Tensor: Predicted durations (B, Tmax).
+                Output tensor (B, T_frames, decoder_output_size).
            Tensor: 
                Predicted durations (B, Tmax).
        """
        # input of embedding must be int64
        text = paddle.cast(text, 'int64')
@ -336,10 +376,14 @@ class SpeedySpeech(nn.Layer):
                  spk_id: paddle.Tensor=None):
        """Generate the sequence of features given the sequences of characters.
        Args:
-            text(Tensor(int64)): Input sequence of characters (T,).
+            text(Tensor(int64)): 
-            tones(Tensor, optional(int64)): Batch of padded tone ids (T, ).
+                Input sequence of characters (T,).
-            durations(Tensor, optional (int64)): Groundtruth of duration (T,).
+            tones(Tensor, optional(int64)): 
-            spk_id(Tensor, optional(int64), optional): spk ids (1,). (Default value = None)
+                Batch of padded tone ids (T, ).
            durations(Tensor, optional (int64)): 
                Groundtruth of duration (T,).
            spk_id(Tensor, optional(int64), optional): 
                spk ids (1,). (Default value = None)
        Returns:
            Tensor: logmel (T, decoder_output_size).
--- a/paddlespeech/t2s/models/tacotron2/tacotron2.py
+++ b/paddlespeech/t2s/models/tacotron2/tacotron2.py
@ -83,38 +83,67 @@ class Tacotron2(nn.Layer):
            init_type: str="xavier_uniform", ):
        """Initialize Tacotron2 module.
        Args:
-            idim (int): Dimension of the inputs.
+            idim (int): 
-            odim (int): Dimension of the outputs.
+                Dimension of the inputs.
-            embed_dim (int): Dimension of the token embedding.
+            odim (int): 
-            elayers (int): Number of encoder blstm layers.
+                Dimension of the outputs.
-            eunits (int): Number of encoder blstm units.
+            embed_dim (int): 
-            econv_layers (int): Number of encoder conv layers.
+                Dimension of the token embedding.
-            econv_filts (int): Number of encoder conv filter size.
+            elayers (int): 
-            econv_chans (int): Number of encoder conv filter channels.
+                Number of encoder blstm layers.
-            dlayers (int): Number of decoder lstm layers.
+            eunits (int): 
-            dunits (int): Number of decoder lstm units.
+                Number of encoder blstm units.
-            prenet_layers (int): Number of prenet layers.
+            econv_layers (int): 
-            prenet_units (int): Number of prenet units.
+                Number of encoder conv layers.
-            postnet_layers (int): Number of postnet layers.
+            econv_filts (int): 
-            postnet_filts (int): Number of postnet filter size.
+                Number of encoder conv filter size.
-            postnet_chans (int): Number of postnet filter channels.
+            econv_chans (int): 
-            output_activation (str): Name of activation function for outputs.
+                Number of encoder conv filter channels.
-            adim (int): Number of dimension of mlp in attention.
+            dlayers (int): 
-            aconv_chans (int): Number of attention conv filter channels.
+                Number of decoder lstm layers.
-            aconv_filts (int): Number of attention conv filter size.
+            dunits (int): 
-            cumulate_att_w (bool): Whether to cumulate previous attention weight.
+                Number of decoder lstm units.
-            use_batch_norm (bool): Whether to use batch normalization.
+            prenet_layers (int): 
-            use_concate (bool): Whether to concat enc outputs w/ dec lstm outputs.
+                Number of prenet layers.
-            reduction_factor (int): Reduction factor.
+            prenet_units (int): 
-            spk_num (Optional[int]): Number of speakers. If set to > 1, assume that the
+                Number of prenet units.
            postnet_layers (int): 
                Number of postnet layers.
            postnet_filts (int): 
                Number of postnet filter size.
            postnet_chans (int): 
                Number of postnet filter channels.
            output_activation (str): 
                Name of activation function for outputs.
            adim (int): 
                Number of dimension of mlp in attention.
            aconv_chans (int): 
                Number of attention conv filter channels.
            aconv_filts (int): 
                Number of attention conv filter size.
            cumulate_att_w (bool): 
                Whether to cumulate previous attention weight.
            use_batch_norm (bool): 
                Whether to use batch normalization.
            use_concate (bool): 
                Whether to concat enc outputs w/ dec lstm outputs.
            reduction_factor (int): 
                Reduction factor.
            spk_num (Optional[int]): 
                Number of speakers. If set to > 1, assume that the
                sids will be provided as the input and use sid embedding layer.
-            lang_num (Optional[int]): Number of languages. If set to > 1, assume that the
+            lang_num (Optional[int]): 
                Number of languages. If set to > 1, assume that the
                lids will be provided as the input and use sid embedding layer.
-            spk_embed_dim (Optional[int]): Speaker embedding dimension. If set to > 0,
+            spk_embed_dim (Optional[int]): 
                Speaker embedding dimension. If set to > 0,
                assume that spk_emb will be provided as the input.
-            spk_embed_integration_type (str): How to integrate speaker embedding.
+            spk_embed_integration_type (str): 
-            dropout_rate (float): Dropout rate.
+                How to integrate speaker embedding.
-            zoneout_rate (float): Zoneout rate.
+            dropout_rate (float): 
                Dropout rate.
            zoneout_rate (float): 
                Zoneout rate.
        """
        assert check_argument_types()
        super().__init__()
@ -230,18 +259,28 @@ class Tacotron2(nn.Layer):
        """Calculate forward propagation.
        Args:
-            text (Tensor(int64)): Batch of padded character ids (B, T_text).
+            text (Tensor(int64)):   
-            text_lengths (Tensor(int64)): Batch of lengths of each input batch (B,).
+                Batch of padded character ids (B, T_text).
-            speech (Tensor): Batch of padded target features (B, T_feats, odim).
+            text_lengths (Tensor(int64)): 
-            speech_lengths (Tensor(int64)): Batch of the lengths of each target (B,).
+                Batch of lengths of each input batch (B,).
-            spk_emb (Optional[Tensor]): Batch of speaker embeddings (B, spk_embed_dim).
+            speech (Tensor):
-            spk_id (Optional[Tensor]): Batch of speaker IDs (B, 1).
+                 Batch of padded target features (B, T_feats, odim).
-            lang_id (Optional[Tensor]): Batch of language IDs (B, 1).
+            speech_lengths (Tensor(int64)): 
                Batch of the lengths of each target (B,).
            spk_emb (Optional[Tensor]): 
                Batch of speaker embeddings (B, spk_embed_dim).
            spk_id (Optional[Tensor]): 
                Batch of speaker IDs (B, 1).
            lang_id (Optional[Tensor]): 
                Batch of language IDs (B, 1).
        Returns:
-            Tensor: Loss scalar value.
+            Tensor: 
-            Dict: Statistics to be monitored.
+                Loss scalar value.
-            Tensor: Weight value if not joint training else model outputs.
+            Dict: 
                Statistics to be monitored.
            Tensor: 
                Weight value if not joint training else model outputs.
        """
        text = text[:, :text_lengths.max()]
@ -329,18 +368,30 @@ class Tacotron2(nn.Layer):
        """Generate the sequence of features given the sequences of characters.
        Args:
-            text (Tensor(int64)): Input sequence of characters (T_text,).
+            text (Tensor(int64)): 
-            speech (Optional[Tensor]): Feature sequence to extract style (N, idim).
+                Input sequence of characters (T_text,).
-            spk_emb (ptional[Tensor]): Speaker embedding (spk_embed_dim,).
+            speech (Optional[Tensor]): 
-            spk_id (Optional[Tensor]): Speaker ID (1,).
+                Feature sequence to extract style (N, idim).
-            lang_id (Optional[Tensor]): Language ID (1,).
+            spk_emb (ptional[Tensor]): 
-            threshold (float): Threshold in inference.
+                Speaker embedding (spk_embed_dim,).
-            minlenratio (float): Minimum length ratio in inference.
+            spk_id (Optional[Tensor]): 
-            maxlenratio (float): Maximum length ratio in inference.
+                Speaker ID (1,).
-            use_att_constraint (bool): Whether to apply attention constraint.
+            lang_id (Optional[Tensor]): 
-            backward_window (int): Backward window in attention constraint.
+                Language ID (1,).
-            forward_window (int): Forward window in attention constraint.
+            threshold (float): 
-            use_teacher_forcing (bool): Whether to use teacher forcing.
+                Threshold in inference.
            minlenratio (float): 
                Minimum length ratio in inference.
            maxlenratio (float): 
                Maximum length ratio in inference.
            use_att_constraint (bool): 
                Whether to apply attention constraint.
            backward_window (int): 
                Backward window in attention constraint.
            forward_window (int): 
                    Forward window in attention constraint.
            use_teacher_forcing (bool): 
                Whether to use teacher forcing.
        Returns:
            Dict[str, Tensor]
--- a/paddlespeech/t2s/models/transformer_tts/transformer_tts.py
+++ b/paddlespeech/t2s/models/transformer_tts/transformer_tts.py
@ -49,66 +49,124 @@ class TransformerTTS(nn.Layer):
        https://arxiv.org/pdf/1809.08895.pdf
    Args:
-        idim (int): Dimension of the inputs.
+        idim (int): 
-        odim (int): Dimension of the outputs.
+            Dimension of the inputs.
-        embed_dim (int, optional): Dimension of character embedding.
+        odim (int): 
-        eprenet_conv_layers (int, optional): Number of encoder prenet convolution layers.
+            Dimension of the outputs.
-        eprenet_conv_chans (int, optional): Number of encoder prenet convolution channels.
+        embed_dim (int, optional): 
-        eprenet_conv_filts (int, optional): Filter size of encoder prenet convolution.
+            Dimension of character embedding.
-        dprenet_layers (int, optional): Number of decoder prenet layers.
+        eprenet_conv_layers (int, optional): 
-        dprenet_units (int, optional): Number of decoder prenet hidden units.
+            Number of encoder prenet convolution layers.
-        elayers (int, optional): Number of encoder layers.
+        eprenet_conv_chans (int, optional): 
-        eunits (int, optional): Number of encoder hidden units.
+            Number of encoder prenet convolution channels.
-        adim (int, optional): Number of attention transformation dimensions.
+        eprenet_conv_filts (int, optional): 
-        aheads (int, optional): Number of heads for multi head attention.
+            Filter size of encoder prenet convolution.
-        dlayers (int, optional): Number of decoder layers.
+        dprenet_layers (int, optional): 
-        dunits (int, optional): Number of decoder hidden units.
+            Number of decoder prenet layers.
-        postnet_layers (int, optional): Number of postnet layers.
+        dprenet_units (int, optional): 
-        postnet_chans (int, optional): Number of postnet channels.
+            Number of decoder prenet hidden units.
-        postnet_filts (int, optional): Filter size of postnet.
+        elayers (int, optional): 
-        use_scaled_pos_enc (pool, optional): Whether to use trainable scaled positional encoding.
+            Number of encoder layers.
-        use_batch_norm (bool, optional): Whether to use batch normalization in encoder prenet.
+        eunits (int, optional): 
-        encoder_normalize_before (bool, optional): Whether to perform layer normalization before encoder block.
+            Number of encoder hidden units.
-        decoder_normalize_before (bool, optional): Whether to perform layer normalization before decoder block.
+        adim (int, optional): 
-        encoder_concat_after (bool, optional): Whether to concatenate attention layer's input and output in encoder.
+            Number of attention transformation dimensions.
-        decoder_concat_after (bool, optional): Whether to concatenate attention layer's input and output in decoder.
+        aheads (int, optional): 
-        positionwise_layer_type (str, optional): Position-wise operation type.
+            Number of heads for multi head attention.
-        positionwise_conv_kernel_size (int, optional): Kernel size in position wise conv 1d.
+        dlayers (int, optional): 
-        reduction_factor (int, optional): Reduction factor.
+            Number of decoder layers.
-        spk_embed_dim (int, optional): Number of speaker embedding dimenstions.
+        dunits (int, optional): 
-        spk_embed_integration_type (str, optional): How to integrate speaker embedding.
+            Number of decoder hidden units.
-        use_gst (str, optional): Whether to use global style token.
+        postnet_layers (int, optional): 
-        gst_tokens (int, optional): The number of GST embeddings.
+            Number of postnet layers.
-        gst_heads (int, optional): The number of heads in GST multihead attention.
+        postnet_chans (int, optional): 
-        gst_conv_layers (int, optional): The number of conv layers in GST.
+            Number of postnet channels.
-        gst_conv_chans_list (Sequence[int], optional): List of the number of channels of conv layers in GST.
+        postnet_filts (int, optional): 
-        gst_conv_kernel_size (int, optional): Kernal size of conv layers in GST.
+            Filter size of postnet.
-        gst_conv_stride (int, optional): Stride size of conv layers in GST.
+        use_scaled_pos_enc (pool, optional): 
-        gst_gru_layers (int, optional): The number of GRU layers in GST.
+            Whether to use trainable scaled positional encoding.
-        gst_gru_units (int, optional): The number of GRU units in GST.
+        use_batch_norm (bool, optional): 
-        transformer_lr (float, optional): Initial value of learning rate.
+            Whether to use batch normalization in encoder prenet.
-        transformer_warmup_steps (int, optional): Optimizer warmup steps.
+        encoder_normalize_before (bool, optional): 
-        transformer_enc_dropout_rate (float, optional): Dropout rate in encoder except attention and positional encoding.
+            Whether to perform layer normalization before encoder block.
-        transformer_enc_positional_dropout_rate (float, optional): Dropout rate after encoder positional encoding.
+        decoder_normalize_before (bool, optional): 
-        transformer_enc_attn_dropout_rate （float, optional): Dropout rate in encoder self-attention module.
+            Whether to perform layer normalization before decoder block.
-        transformer_dec_dropout_rate (float, optional): Dropout rate in decoder except attention & positional encoding.
+        encoder_concat_after (bool, optional): 
-        transformer_dec_positional_dropout_rate (float, optional): Dropout rate after decoder positional encoding.
+            Whether to concatenate attention layer's input and output in encoder.
-        transformer_dec_attn_dropout_rate （float, optional): Dropout rate in deocoder self-attention module.
+        decoder_concat_after (bool, optional): 
-        transformer_enc_dec_attn_dropout_rate (float, optional): Dropout rate in encoder-deocoder attention module.
+            Whether to concatenate attention layer's input and output in decoder.
-        init_type (str, optional): How to initialize transformer parameters.
+        positionwise_layer_type (str, optional): 
-        init_enc_alpha （float, optional）: Initial value of alpha in scaled pos encoding of the encoder.
+            Position-wise operation type.
-        init_dec_alpha (float, optional): Initial value of alpha in scaled pos encoding of the decoder.
+        positionwise_conv_kernel_size (int, optional): 
-        eprenet_dropout_rate (float, optional): Dropout rate in encoder prenet.
+            Kernel size in position wise conv 1d.
-        dprenet_dropout_rate (float, optional): Dropout rate in decoder prenet.
+        reduction_factor (int, optional): 
-        postnet_dropout_rate (float, optional): Dropout rate in postnet.
+            Reduction factor.
-        use_masking (bool, optional): Whether to apply masking for padded part in loss calculation.
+        spk_embed_dim (int, optional): 
-        use_weighted_masking (bool, optional): Whether to apply weighted masking in loss calculation.
+            Number of speaker embedding dimenstions.
-        bce_pos_weight (float, optional): Positive sample weight in bce calculation (only for use_masking=true).
+        spk_embed_integration_type (str, optional): 
-        loss_type (str, optional): How to calculate loss.
+            How to integrate speaker embedding.
-        use_guided_attn_loss (bool, optional): Whether to use guided attention loss.
+        use_gst (str, optional): 
-        num_heads_applied_guided_attn (int, optional): Number of heads in each layer to apply guided attention loss.
+            Whether to use global style token.
-        num_layers_applied_guided_attn (int, optional): Number of layers to apply guided attention loss.
+        gst_tokens (int, optional): 
-            List of module names to apply guided attention loss.
+            The number of GST embeddings.
        gst_heads (int, optional): 
            The number of heads in GST multihead attention.
        gst_conv_layers (int, optional): 
            The number of conv layers in GST.
        gst_conv_chans_list (Sequence[int], optional): 
            List of the number of channels of conv layers in GST.
        gst_conv_kernel_size (int, optional): 
            Kernal size of conv layers in GST.
        gst_conv_stride (int, optional): 
            Stride size of conv layers in GST.
        gst_gru_layers (int, optional): 
            The number of GRU layers in GST.
        gst_gru_units (int, optional): 
            The number of GRU units in GST.
        transformer_lr (float, optional): 
            Initial value of learning rate.
        transformer_warmup_steps (int, optional): 
            Optimizer warmup steps.
        transformer_enc_dropout_rate (float, optional): 
            Dropout rate in encoder except attention and positional encoding.
        transformer_enc_positional_dropout_rate (float, optional): 
            Dropout rate after encoder positional encoding.
        transformer_enc_attn_dropout_rate （float, optional): 
            Dropout rate in encoder self-attention module.
        transformer_dec_dropout_rate (float, optional): 
            Dropout rate in decoder except attention & positional encoding.
        transformer_dec_positional_dropout_rate (float, optional): 
            Dropout rate after decoder positional encoding.
        transformer_dec_attn_dropout_rate （float, optional): 
            Dropout rate in deocoder self-attention module.
        transformer_enc_dec_attn_dropout_rate (float, optional): 
            Dropout rate in encoder-deocoder attention module.
        init_type (str, optional): 
            How to initialize transformer parameters.
        init_enc_alpha （float, optional）: 
            Initial value of alpha in scaled pos encoding of the encoder.
        init_dec_alpha (float, optional): 
            Initial value of alpha in scaled pos encoding of the decoder.
        eprenet_dropout_rate (float, optional): 
            Dropout rate in encoder prenet.
        dprenet_dropout_rate (float, optional): 
            Dropout rate in decoder prenet.
        postnet_dropout_rate (float, optional): 
            Dropout rate in postnet.
        use_masking (bool, optional): 
            Whether to apply masking for padded part in loss calculation.
        use_weighted_masking (bool, optional): 
            Whether to apply weighted masking in loss calculation.
        bce_pos_weight (float, optional): 
            Positive sample weight in bce calculation (only for use_masking=true).
        loss_type (str, optional): 
            How to calculate loss.
        use_guided_attn_loss (bool, optional): 
            Whether to use guided attention loss.
        num_heads_applied_guided_attn (int, optional):
            Number of heads in each layer to apply guided attention loss.
        num_layers_applied_guided_attn (int, optional): 
            Number of layers to apply guided attention loss.
    """
    def __init__(
--- a/paddlespeech/t2s/models/waveflow.py
+++ b/paddlespeech/t2s/models/waveflow.py
@ -33,8 +33,10 @@ def fold(x, n_group):
    """Fold audio or spectrogram's temporal dimension in to groups.
    Args:
-        x(Tensor): The input tensor. shape=(*, time_steps)
+        x(Tensor): 
-        n_group(int): The size of a group.
+            The input tensor. shape=(*, time_steps)
        n_group(int): 
            The size of a group.
    Returns:
        Tensor: Folded tensor. shape=(*, time_steps // n_group, group)
@ -53,7 +55,8 @@ class UpsampleNet(nn.LayerList):
    on mel and time dimension.
    Args:
-        upscale_factors(List[int], optional): Time upsampling factors for each Conv2DTranspose Layer.
+        upscale_factors(List[int], optional): 
            Time upsampling factors for each Conv2DTranspose Layer.
            The ``UpsampleNet`` contains ``len(upscale_factor)`` Conv2DTranspose
            Layers. Each upscale_factor is used as the ``stride`` for the
            corresponding Conv2DTranspose. Defaults to [16, 16], this the default
@ -94,8 +97,10 @@ class UpsampleNet(nn.LayerList):
        """Forward pass of the ``UpsampleNet``
        Args:
-            x(Tensor): The input spectrogram. shape=(batch_size, input_channels, time_steps)
+            x(Tensor): 
-            trim_conv_artifact(bool, optional, optional): Trim deconvolution artifact at each layer. Defaults to False.
+                The input spectrogram. shape=(batch_size, input_channels, time_steps)
            trim_conv_artifact(bool, optional, optional): 
                Trim deconvolution artifact at each layer. Defaults to False.
        Returns:
           Tensor: The upsampled spectrogram. shape=(batch_size, input_channels, time_steps * upsample_factor)
@ -123,10 +128,14 @@ class ResidualBlock(nn.Layer):
    and output.
    Args:
-        channels (int): Feature size of the input.
+        channels (int): 
-        cond_channels (int): Featuer size of the condition.
+            Feature size of the input.
-        kernel_size (Tuple[int]): Kernel size of the Convolution2d applied to the input.
+        cond_channels (int): 
-        dilations (int): Dilations of the Convolution2d applied to the input.
+            Featuer size of the condition.
        kernel_size (Tuple[int]): 
            Kernel size of the Convolution2d applied to the input.
        dilations (int): 
            Dilations of the Convolution2d applied to the input.
    """
    def __init__(self, channels, cond_channels, kernel_size, dilations):
@ -173,12 +182,16 @@ class ResidualBlock(nn.Layer):
        """Compute output for a whole folded sequence.
        Args:
-            x (Tensor): The input. [shape=(batch_size, channel, height, width)]
+            x (Tensor): 
-            condition (Tensor [shape=(batch_size, condition_channel, height, width)]): The local condition.
+                The input. [shape=(batch_size, channel, height, width)]
            condition (Tensor [shape=(batch_size, condition_channel, height, width)]): 
                The local condition.
        Returns: 
-            res (Tensor): The residual output. [shape=(batch_size, channel, height, width)]
+            res (Tensor): 
-            skip (Tensor): The skip output. [shape=(batch_size, channel, height, width)]
+                The residual output. [shape=(batch_size, channel, height, width)]
            skip (Tensor): 
                The skip output. [shape=(batch_size, channel, height, width)]
        """
        x_in = x
        x = self.conv(x)
@ -216,12 +229,16 @@ class ResidualBlock(nn.Layer):
        """Compute the output for a row and update the buffer.
        Args:
-            x_row (Tensor): A row of the input. shape=(batch_size, channel, 1, width)
+            x_row (Tensor): 
-            condition_row (Tensor): A row of the condition. shape=(batch_size, condition_channel, 1, width)
+                A row of the input. shape=(batch_size, channel, 1, width)
            condition_row (Tensor): 
                A row of the condition. shape=(batch_size, condition_channel, 1, width)
        Returns:
-            res (Tensor): A row of the the residual output. shape=(batch_size, channel, 1, width)
+            res (Tensor): 
-            skip (Tensor): A row of the skip output. shape=(batch_size, channel, 1, width)
+                A row of the the residual output. shape=(batch_size, channel, 1, width)
            skip (Tensor): 
                A row of the skip output. shape=(batch_size, channel, 1, width)
        """
        x_row_in = x_row
@ -258,11 +275,16 @@ class ResidualNet(nn.LayerList):
    """A stack of several ResidualBlocks. It merges condition at each layer.
    Args:
-        n_layer (int): Number of ResidualBlocks in the ResidualNet.
+        n_layer (int): 
-        residual_channels (int): Feature size of each ResidualBlocks.
+            Number of ResidualBlocks in the ResidualNet.
-        condition_channels (int): Feature size of the condition.
+        residual_channels (int): 
-        kernel_size (Tuple[int]): Kernel size of each ResidualBlock.
+            Feature size of each ResidualBlocks.
-        dilations_h (List[int]): Dilation in height dimension of every ResidualBlock.
+        condition_channels (int): 
            Feature size of the condition.
        kernel_size (Tuple[int]): 
            Kernel size of each ResidualBlock.
        dilations_h (List[int]): 
            Dilation in height dimension of every ResidualBlock.
    Raises:
        ValueError: If the length of dilations_h does not equals n_layers.
@ -288,11 +310,13 @@ class ResidualNet(nn.LayerList):
        """Comput the output of given the input and the condition.
        Args:
-            x (Tensor): The input. shape=(batch_size, channel, height, width)
+            x (Tensor): 
-            condition (Tensor): The local condition. shape=(batch_size, condition_channel, height, width)
+                The input. shape=(batch_size, channel, height, width)
            condition (Tensor): 
                The local condition. shape=(batch_size, condition_channel, height, width)
        Returns: 
-            Tensor : The output, which is an aggregation of all the skip outputs. shape=(batch_size, channel, height, width)
+            Tensor: The output, which is an aggregation of all the skip outputs. shape=(batch_size, channel, height, width)
        """
        skip_connections = []
@ -312,12 +336,16 @@ class ResidualNet(nn.LayerList):
        """Compute the output for a row and update the buffers.
        Args:
-            x_row (Tensor): A row of the input. shape=(batch_size, channel, 1, width)
+            x_row (Tensor): 
-            condition_row (Tensor):  A row of the condition. shape=(batch_size, condition_channel, 1, width)
+                A row of the input. shape=(batch_size, channel, 1, width)
            condition_row (Tensor):  
                A row of the condition. shape=(batch_size, condition_channel, 1, width)
        Returns:
-            res (Tensor): A row of the the residual output. shape=(batch_size, channel, 1, width) 
+            res (Tensor): 
-            skip (Tensor): A row of the skip output. shape=(batch_size, channel, 1, width)
+                A row of the the residual output. shape=(batch_size, channel, 1, width) 
            skip (Tensor): 
                A row of the skip output. shape=(batch_size, channel, 1, width)
        """
        skip_connections = []
@ -337,11 +365,16 @@ class Flow(nn.Layer):
    sampling.
    Args:
-        n_layers (int): Number of ResidualBlocks in the Flow.
+        n_layers (int): 
-        channels (int): Feature size of the ResidualBlocks.
+            Number of ResidualBlocks in the Flow.
-        mel_bands (int): Feature size of the mel spectrogram (mel bands).
+        channels (int): 
-        kernel_size (Tuple[int]): Kernel size of each ResisualBlocks in the Flow.
+            Feature size of the ResidualBlocks.
-        n_group (int): Number of timesteps to the folded into a group.
+        mel_bands (int): 
            Feature size of the mel spectrogram (mel bands).
        kernel_size (Tuple[int]): 
            Kernel size of each ResisualBlocks in the Flow.
        n_group (int): 
            Number of timesteps to the folded into a group.
    """
    dilations_dict = {
        8: [1, 1, 1, 1, 1, 1, 1, 1],
@ -393,11 +426,14 @@ class Flow(nn.Layer):
        a sample from p(X) into a sample from p(Z).
        Args:
-            x (Tensor): A input sample of the distribution p(X). shape=(batch, 1, height, width)
+            x (Tensor): 
-            condition (Tensor): The local condition. shape=(batch, condition_channel, height, width)
+                A input sample of the distribution p(X). shape=(batch, 1, height, width)
            condition (Tensor): 
                The local condition. shape=(batch, condition_channel, height, width)
        Returns:
-            z (Tensor): shape(batch, 1, height, width), the transformed sample.
+            z (Tensor): 
                shape(batch, 1, height, width), the transformed sample.
            Tuple[Tensor, Tensor]:
                The parameter of the transformation.
                logs (Tensor): shape(batch, 1, height - 1, width), the log scale of the transformation from x to z.
@ -433,8 +469,10 @@ class Flow(nn.Layer):
        p(Z) and transform the sample. It is a auto regressive transformation.
        Args:
-            z(Tensor): A sample of the distribution p(Z). shape=(batch, 1, time_steps
+            z(Tensor): 
-            condition(Tensor): The local condition. shape=(batch, condition_channel, time_steps)
+                A sample of the distribution p(Z). shape=(batch, 1, time_steps
            condition(Tensor): 
                The local condition. shape=(batch, condition_channel, time_steps)
        Returns:
            Tensor:
                The transformed sample. shape=(batch, 1, height, width)
@ -462,12 +500,18 @@ class WaveFlow(nn.LayerList):
    flows.
    Args:
-        n_flows (int): Number of flows in the WaveFlow model.
+        n_flows (int): 
-        n_layers (int): Number of ResidualBlocks in each Flow.
+            Number of flows in the WaveFlow model.
-        n_group (int): Number of timesteps to fold as a group.
+        n_layers (int): 
-        channels (int): Feature size of each ResidualBlock.
+            Number of ResidualBlocks in each Flow.
-        mel_bands (int): Feature size of mel spectrogram (mel bands).
+        n_group (int): 
-        kernel_size (Union[int, List[int]]): Kernel size of the convolution layer in each ResidualBlock.
+            Number of timesteps to fold as a group.
        channels (int): 
            Feature size of each ResidualBlock.
        mel_bands (int): 
            Feature size of mel spectrogram (mel bands).
        kernel_size (Union[int, List[int]]): 
            Kernel size of the convolution layer in each ResidualBlock.
    """
    def __init__(self, n_flows, n_layers, n_group, channels, mel_bands,
@ -518,12 +562,16 @@ class WaveFlow(nn.LayerList):
        condition.
        Args:
-            x (Tensor): The audio. shape=(batch_size, time_steps)
+            x (Tensor): 
-            condition (Tensor): The local condition (mel spectrogram here). shape=(batch_size, condition channel, time_steps)
+                The audio. shape=(batch_size, time_steps)
            condition (Tensor): 
                The local condition (mel spectrogram here). shape=(batch_size, condition channel, time_steps)
        Returns:
-            Tensor: The transformed random variable. shape=(batch_size, time_steps)
+            Tensor: 
-            Tensor: The log determinant of the jacobian of the transformation from x to z. shape=(1,)
+                The transformed random variable. shape=(batch_size, time_steps)
            Tensor: 
                The log determinant of the jacobian of the transformation from x to z. shape=(1,)
        """
        # x: (B, T)
        # condition: (B, C, T) upsampled condition
@ -559,12 +607,13 @@ class WaveFlow(nn.LayerList):
        autoregressive manner.
        Args:
-            z (Tensor): A sample of the distribution p(Z). shape=(batch, 1, time_steps
+            z (Tensor): 
-            condition (Tensor): The local condition. shape=(batch, condition_channel, time_steps)    
+                A sample of the distribution p(Z). shape=(batch, 1, time_steps
            condition (Tensor): 
                The local condition. shape=(batch, condition_channel, time_steps)    
        Returns: 
            Tensor: The transformed sample (audio here). shape=(batch_size, time_steps)
        """
        z, condition = self._trim(z, condition)
@ -590,13 +639,20 @@ class ConditionalWaveFlow(nn.LayerList):
    """ConditionalWaveFlow, a UpsampleNet with a WaveFlow model.
    Args:
-        upsample_factors (List[int]): Upsample factors for the upsample net.
+        upsample_factors (List[int]): 
-        n_flows (int): Number of flows in the WaveFlow model.
+            Upsample factors for the upsample net.
-        n_layers (int): Number of ResidualBlocks in each Flow.
+        n_flows (int): 
-        n_group (int): Number of timesteps to fold as a group.
+            Number of flows in the WaveFlow model.
-        channels (int): Feature size of each ResidualBlock.
+        n_layers (int): 
-        n_mels (int): Feature size of mel spectrogram (mel bands).
+            Number of ResidualBlocks in each Flow.
-        kernel_size (Union[int, List[int]]): Kernel size of the convolution layer in each ResidualBlock.
+        n_group (int): 
            Number of timesteps to fold as a group.
        channels (int): 
            Feature size of each ResidualBlock.
        n_mels (int): 
            Feature size of mel spectrogram (mel bands).
        kernel_size (Union[int, List[int]]): 
            Kernel size of the convolution layer in each ResidualBlock.
        """
    def __init__(self,
@ -622,12 +678,16 @@ class ConditionalWaveFlow(nn.LayerList):
        the determinant of the jacobian of the transformation from x to z.
        Args:
-            audio(Tensor): The audio. shape=(B, T)
+            audio(Tensor): 
-            mel(Tensor): The mel spectrogram. shape=(B, C_mel, T_mel)
+                The audio. shape=(B, T)
            mel(Tensor): 
                The mel spectrogram. shape=(B, C_mel, T_mel)
        Returns:
-            Tensor: The inversely transformed random variable z (x to z). shape=(B, T)
+            Tensor: 
-            Tensor: the log of the determinant of the jacobian of the transformation from x to z. shape=(1,)
+                The inversely transformed random variable z (x to z). shape=(B, T)
            Tensor: 
                the log of the determinant of the jacobian of the transformation from x to z. shape=(1,)
        """
        condition = self.encoder(mel)
        z, log_det_jacobian = self.decoder(audio, condition)
@ -638,10 +698,12 @@ class ConditionalWaveFlow(nn.LayerList):
        """Generate raw audio given mel spectrogram.
        Args:
-            mel(np.ndarray): Mel spectrogram of an utterance(in log-magnitude). shape=(C_mel, T_mel)
+            mel(np.ndarray): 
                Mel spectrogram of an utterance(in log-magnitude). shape=(C_mel, T_mel)
        Returns:
-            Tensor: The synthesized audio, where``T <= T_mel * upsample_factors``. shape=(B, T)
+            Tensor: 
                The synthesized audio, where``T <= T_mel * upsample_factors``. shape=(B, T)
        """
        start = time.time()
        condition = self.encoder(mel, trim_conv_artifact=True)  # (B, C, T)
@ -657,7 +719,8 @@ class ConditionalWaveFlow(nn.LayerList):
        """Generate raw audio given mel spectrogram.
        Args:
-            mel(np.ndarray): Mel spectrogram of an utterance(in log-magnitude). shape=(C_mel, T_mel)
+            mel(np.ndarray): 
                Mel spectrogram of an utterance(in log-magnitude). shape=(C_mel, T_mel)
        Returns:
            np.ndarray: The synthesized audio. shape=(T,)
@ -673,8 +736,10 @@ class ConditionalWaveFlow(nn.LayerList):
        """Build a ConditionalWaveFlow model from a pretrained model.
        Args:
-            config(yacs.config.CfgNode): model configs
+            config(yacs.config.CfgNode): 
-            checkpoint_path(Path or str): the path of pretrained model checkpoint, without extension name
+                model configs
            checkpoint_path(Path or str): 
                the path of pretrained model checkpoint, without extension name
        Returns:
            ConditionalWaveFlow The model built from pretrained result.
@ -694,8 +759,8 @@ class WaveFlowLoss(nn.Layer):
    """Criterion of a WaveFlow model.
    Args:
-        sigma (float): The standard deviation of the gaussian noise used in WaveFlow, 
+        sigma (float): 
-            by default 1.0.
+            The standard deviation of the gaussian noise used in WaveFlow, by default 1.0.
    """
    def __init__(self, sigma=1.0):
@ -708,8 +773,10 @@ class WaveFlowLoss(nn.Layer):
        log_det_jacobian of transformation from x to z.
        Args:
-            z(Tensor): The transformed random variable (x to z). shape=(B, T)
+            z(Tensor): 
-            log_det_jacobian(Tensor): The log of the determinant of the jacobian matrix of the
+                The transformed random variable (x to z). shape=(B, T)
            log_det_jacobian(Tensor): 
                The log of the determinant of the jacobian matrix of the
                transformation from x to z.  shape=(1,)
        Returns:
@ -726,7 +793,8 @@ class ConditionalWaveFlow2Infer(ConditionalWaveFlow):
        """Generate raw audio given mel spectrogram.
        Args:
-            mel (np.ndarray): Mel spectrogram of an utterance(in log-magnitude). shape=(C_mel, T_mel)
+            mel (np.ndarray): 
                Mel spectrogram of an utterance(in log-magnitude). shape=(C_mel, T_mel)
        Returns:
            np.ndarray: The synthesized audio. shape=(T,)
--- a/paddlespeech/t2s/models/wavernn/wavernn.py
+++ b/paddlespeech/t2s/models/wavernn/wavernn.py
@ -165,19 +165,29 @@ class WaveRNN(nn.Layer):
            init_type: str="xavier_uniform", ):
        '''
        Args:
-            rnn_dims (int, optional): Hidden dims of RNN Layers.
+            rnn_dims (int, optional): 
-            fc_dims (int, optional): Dims of FC Layers.
+                Hidden dims of RNN Layers.
-            bits (int, optional): bit depth of signal.
+            fc_dims (int, optional): 
-            aux_context_window (int, optional): The context window size of the first convolution applied to the 
+                Dims of FC Layers.
-                auxiliary input, by default 2
+            bits (int, optional): 
-            upsample_scales (List[int], optional): Upsample scales of the upsample network.
+                bit depth of signal.
-            aux_channels (int, optional): Auxiliary channel of the residual blocks.
+            aux_context_window (int, optional): 
-            compute_dims (int, optional): Dims of Conv1D in MelResNet.
+                The context window size of the first convolution applied to the auxiliary input, by default 2
-            res_out_dims (int, optional): Dims of output in MelResNet.
+            upsample_scales (List[int], optional): 
-            res_blocks (int, optional): Number of residual blocks.
+                Upsample scales of the upsample network.
-            mode (str, optional): Output mode of the WaveRNN vocoder. 
+            aux_channels (int, optional): 
                Auxiliary channel of the residual blocks.
            compute_dims (int, optional): 
                Dims of Conv1D in MelResNet.
            res_out_dims (int, optional): 
                Dims of output in MelResNet.
            res_blocks (int, optional): 
                Number of residual blocks.
            mode (str, optional): 
                Output mode of the WaveRNN vocoder. 
                `MOL` for Mixture of Logistic Distribution, and `RAW` for quantized bits as the model's output.
-            init_type (str): How to initialize parameters.
+            init_type (str): 
                How to initialize parameters.
        '''
        super().__init__()
        self.mode = mode
@ -226,8 +236,10 @@ class WaveRNN(nn.Layer):
    def forward(self, x, c):
        '''
        Args:
-            x (Tensor): wav sequence, [B, T]
+            x (Tensor): 
-            c (Tensor): mel spectrogram [B, C_aux, T']
+                wav sequence, [B, T]
            c (Tensor): 
                mel spectrogram [B, C_aux, T']
            T = (T' - 2 * aux_context_window ) * hop_length
        Returns:
@ -280,10 +292,14 @@ class WaveRNN(nn.Layer):
                 gen_display: bool=False):
        """
        Args:
-            c(Tensor): input mels, (T', C_aux)
+            c(Tensor): 
-            batched(bool): generate in batch or not
+                input mels, (T', C_aux)
-            target(int): target number of samples to be generated in each batch entry
+            batched(bool): 
-            overlap(int): number of samples for crossfading between batches
+                generate in batch or not
            target(int): 
                target number of samples to be generated in each batch entry
            overlap(int): 
                number of samples for crossfading between batches
            mu_law(bool)
        Returns: 
            wav sequence: Output (T' * prod(upsample_scales), out_channels, C_out).
@ -404,7 +420,8 @@ class WaveRNN(nn.Layer):
    def pad_tensor(self, x, pad, side='both'):
        '''
        Args:
-            x(Tensor): mel, [1, n_frames, 80]
+            x(Tensor): 
                mel, [1, n_frames, 80]
            pad(int): 
            side(str, optional):  (Default value = 'both')
@ -428,12 +445,15 @@ class WaveRNN(nn.Layer):
        Overlap will be used for crossfading in xfade_and_unfold()
        Args:
-            x(Tensor): Upsampled conditioning features. mels or aux
+            x(Tensor): 
                Upsampled conditioning features. mels or aux
                shape=(1, T, features)
                mels: [1, T, 80]
                aux: [1, T, 128]
-            target(int): Target timesteps for each index of batch
+            target(int): 
-            overlap(int): Timesteps for both xfade and rnn warmup
+                Target timesteps for each index of batch
            overlap(int): 
                Timesteps for both xfade and rnn warmup
        Returns:
            Tensor: 
--- a/Show More
+++ b/Show More
`@ -1,3 +1,3 @@`
	`# [Aishell1](http://www.openslr.org/33/)`	`# [Aishell1](http://openslr.elda.org/33/)`

	This Open Source Mandarin Speech Corpus, AISHELL-ASR0009-OS1, is 178 hours long. It is a part of AISHELL-ASR0009, of which utterance contains 11 domains, including smart home, autonomous driving, and industrial production. The whole recording was put in quiet indoor environment, using 3 different devices at the same time: high fidelity microphone (44.1kHz, 16-bit,); Android-system mobile phone (16kHz, 16-bit), iOS-system mobile phone (16kHz, 16-bit). Audios in high fidelity were re-sampled to 16kHz to build AISHELL- ASR0009-OS1. 400 speakers from different accent areas in China were invited to participate in the recording. The manual transcription accuracy rate is above 95%, through professional speech annotation and strict quality inspection. The corpus is divided into training, development and testing sets. ( This database is free for academic research, not in the commerce, if without permission. )	This Open Source Mandarin Speech Corpus, AISHELL-ASR0009-OS1, is 178 hours long. It is a part of AISHELL-ASR0009, of which utterance contains 11 domains, including smart home, autonomous driving, and industrial production. The whole recording was put in quiet indoor environment, using 3 different devices at the same time: high fidelity microphone (44.1kHz, 16-bit,); Android-system mobile phone (16kHz, 16-bit), iOS-system mobile phone (16kHz, 16-bit). Audios in high fidelity were re-sampled to 16kHz to build AISHELL- ASR0009-OS1. 400 speakers from different accent areas in China were invited to participate in the recording. The manual transcription accuracy rate is above 95%, through professional speech annotation and strict quality inspection. The corpus is divided into training, development and testing sets. ( This database is free for academic research, not in the commerce, if without permission. )
`@ -1 +1 @@`
	`# [FreeST](http://www.openslr.org/38/)`	`# [FreeST](http://openslr.elda.org/38/)`