Merge branch 'develop' into audio

3 years ago · d70bcb8f7a
parent 8f90a55f4a 2886ab9373
commit d70bcb8f7a
83 changed files with 2418 additions and 495 deletions
--- a/.gitignore
+++ b/.gitignore
@ -2,6 +2,7 @@
 *.pyc
 .vscode
 *log
 *.wav
 *.pdmodel
 *.pdiparams*
 *.zip
@ -13,6 +14,7 @@
 *.whl
 *.egg-info
 build
 *output/
 docs/build/
 docs/topic/ctc/warp-ctc/
@ -32,4 +34,4 @@ tools/activate_python.sh
 tools/miniconda.sh
 tools/CRF++-0.58/
-*output/
+speechx/fc_patch/
--- a/README.md
+++ b/README.md
@ -148,6 +148,12 @@ For more synthesized audios, please refer to [PaddleSpeech Text-to-Speech sample
 - [PaddleSpeech Demo Video](https://paddlespeech.readthedocs.io/en/latest/demo_video.html)
 - **[VTuberTalk](https://github.com/jerryuhoo/VTuberTalk): Use PaddleSpeech TTS and ASR to clone voice from videos.**
 <div align="center">
 <img src="https://raw.githubusercontent.com/jerryuhoo/VTuberTalk/main/gui/gui.png"  width = "500px"  />
 </div>
 ### 🔥 Hot Activities
 - 2021.12.21~12.24
@ -196,16 +202,18 @@ Developers can have a try of our models with [PaddleSpeech Command Line](./paddl
 ```shell
 paddlespeech cls --input input.wav
 ```
 **Automatic Speech Recognition**
 ```shell
 paddlespeech asr --lang zh --input input_16k.wav
 ```
 **Speech Translation** (English to Chinese)
 **Speech Translation** (English to Chinese)
 (not support for Mac and Windows now)
 ```shell
 paddlespeech st --input input_16k.wav
 ```
 **Text-to-Speech** 
 ```shell
 paddlespeech tts --input "你好，欢迎使用飞桨深度学习框架！" --output output.wav
@ -218,7 +226,16 @@ paddlespeech tts --input "你好，欢迎使用飞桨深度学习框架！" --ou
  paddlespeech text --task punc --input 今天的天气真不错啊你下午有空吗我想约你一起去吃饭
  ```
 **Batch Process**
 ```
 echo -e "1 欢迎光临。\n2 谢谢惠顾。" | paddlespeech tts
 ```  
 **Shell Pipeline**   
 - ASR + Punctuation Restoration
 ```
 paddlespeech asr --input ./zh.wav | paddlespeech text --task punc
 ```
 For more command lines, please see: [demos](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/demos)
@ -561,6 +578,9 @@ You are warmly welcome to submit questions in [discussions](https://github.com/P
 - Many thanks to [JiehangXie](https://github.com/JiehangXie)/[PaddleBoBo](https://github.com/JiehangXie/PaddleBoBo) for developing Virtual Uploader(VUP)/Virtual YouTuber(VTuber) with PaddleSpeech TTS function.
 - Many thanks to [745165806](https://github.com/745165806)/[PaddleSpeechTask](https://github.com/745165806/PaddleSpeechTask) for contributing Punctuation Restoration model.
 - Many thanks to [kslz](https://github.com/745165806) for supplementary Chinese documents.
 - Many thanks to [awmmmm](https://github.com/awmmmm) for contributing fastspeech2 aishell3 conformer pretrained model.
 - Many thanks to [phecda-xu](https://github.com/phecda-xu)/[PaddleDubbing](https://github.com/phecda-xu/PaddleDubbing) for developing a dubbing tool with GUI based on PaddleSpeech TTS model.
 - Many thanks to [jerryuhoo](https://github.com/jerryuhoo)/[VTuberTalk](https://github.com/jerryuhoo/VTuberTalk) for developing a GUI tool based on PaddleSpeech TTS and code for making datasets from videos based on PaddleSpeech ASR.
 Besides, PaddleSpeech depends on a lot of open source repositories. See [references](./docs/source/reference.md) for more information.
--- a/README_cn.md
+++ b/README_cn.md
@ -150,6 +150,12 @@ from https://github.com/18F/open-source-guide/blob/18f-pages/pages/making-readme
 - [PaddleSpeech 示例视频](https://paddlespeech.readthedocs.io/en/latest/demo_video.html)
 - **[VTuberTalk](https://github.com/jerryuhoo/VTuberTalk): 使用 PaddleSpeech 的语音合成和语音识别从视频中克隆人声。**
 <div align="center">
 <img src="https://raw.githubusercontent.com/jerryuhoo/VTuberTalk/main/gui/gui.png"  width = "500px"  />
 </div>
 ### 🔥 热门活动
 - 2021.12.21~12.24
@ -216,6 +222,17 @@ paddlespeech tts --input "你好，欢迎使用百度飞桨深度学习框架！
   paddlespeech text --task punc --input 今天的天气真不错啊你下午有空吗我想约你一起去吃饭
   ```
 **批处理**
 ```
 echo -e "1 欢迎光临。\n2 谢谢惠顾。" | paddlespeech tts
 ```  
 **Shell管道**
 ASR + Punc:
 ```
 paddlespeech asr --input ./zh.wav | paddlespeech text --task punc
 ```
 更多命令行命令请参考 [demos](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/demos)
 > Note: 如果需要训练或者微调，请查看[语音识别](./docs/source/asr/quick_start.md)， [语音合成](./docs/source/tts/quick_start.md)。
@ -556,6 +573,10 @@ year={2021}
 - 非常感谢 [JiehangXie](https://github.com/JiehangXie)/[PaddleBoBo](https://github.com/JiehangXie/PaddleBoBo) 采用 PaddleSpeech 语音合成功能实现 Virtual Uploader(VUP)/Virtual YouTuber(VTuber) 虚拟主播。
 - 非常感谢 [745165806](https://github.com/745165806)/[PaddleSpeechTask](https://github.com/745165806/PaddleSpeechTask) 贡献标点重建相关模型。
 - 非常感谢 [kslz](https://github.com/kslz) 补充中文文档。
 - 非常感谢 [awmmmm](https://github.com/awmmmm) 提供 fastspeech2 aishell3 conformer 预训练模型。
 - 非常感谢 [phecda-xu](https://github.com/phecda-xu)/[PaddleDubbing](https://github.com/phecda-xu/PaddleDubbing) 基于 PaddleSpeech 的 TTS 模型搭建带 GUI 操作界面的配音工具。
 - 非常感谢 [jerryuhoo](https://github.com/jerryuhoo)/[VTuberTalk](https://github.com/jerryuhoo/VTuberTalk) 基于 PaddleSpeech 的 TTS GUI 界面和基于 ASR 制作数据集的相关代码。
 此外，PaddleSpeech 依赖于许多开源存储库。有关更多信息，请参阅 [references](./docs/source/reference.md)。
--- a/demos/speech_recognition/.gitignore
+++ b/demos/speech_recognition/.gitignore
@ -0,0 +1 @@
 *.wav
--- a/demos/speech_recognition/README.md
+++ b/demos/speech_recognition/README.md
@ -27,6 +27,8 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
  paddlespeech asr --input ./zh.wav
  # English
  paddlespeech asr --model transformer_librispeech --lang en --input ./en.wav
  # Chinese ASR + Punctuation Restoration
  paddlespeech asr --input ./zh.wav | paddlespeech text --task punc
  ```
  (It doesn't matter if package `paddlespeech-ctcdecoders` is not found, this package is optional.)
--- a/demos/speech_recognition/README_cn.md
+++ b/demos/speech_recognition/README_cn.md
@ -25,6 +25,8 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
  paddlespeech asr --input ./zh.wav
  # 英文
  paddlespeech asr --model transformer_librispeech --lang en --input ./en.wav
  # 中文 + 标点恢复
  paddlespeech asr --input ./zh.wav | paddlespeech text --task punc
  ```
  (如果显示 `paddlespeech-ctcdecoders` 这个 python 包没有找到的 Error，没有关系，这个包是非必须的。)
--- a/demos/speech_recognition/run.sh
+++ b/demos/speech_recognition/run.sh
@ -1,4 +1,10 @@
 #!/bin/bash
 wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespeech.bj.bcebos.com/PaddleAudio/en.wav
 # asr
 paddlespeech asr --input ./zh.wav
 # asr + punc
 paddlespeech asr --input ./zh.wav | paddlespeech text --task punc
--- a/demos/speech_server/README.md
+++ b/demos/speech_server/README.md
@ -10,11 +10,23 @@ This demo is an implementation of starting the voice service and accessing the s
 ### 1. Installation
 see [installation](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md).
 It is recommended to use **paddlepaddle 2.2.1** or above.
 You can choose one way from easy, meduim and hard to install paddlespeech.
 ### 2. Prepare config File
 The configuration file contains the service-related configuration files and the model configuration related to the voice tasks contained in the service. They are all under the `conf` folder. 
 **Note: The configuration of `engine_backend` in `application.yaml` represents all speech tasks included in the started service.**
 If the service you want to start contains only a certain speech task, then you need to comment out the speech tasks that do not need to be included. For example, if you only want to use the speech recognition (ASR) service, then you can comment out the speech synthesis (TTS) service, as in the following example:
 ```bash
 engine_backend:
    asr: 'conf/asr/asr.yaml'
    #tts: 'conf/tts/tts.yaml'
 ```
 **Note: The configuration file of `engine_backend` in `application.yaml` needs to match the configuration type of `engine_type`.**
 When the configuration file of `engine_backend` is `XXX.yaml`, the configuration type of `engine_type` needs to be set to `python`; when the configuration file of `engine_backend` is `XXX_pd.yaml`, the configuration of `engine_type` needs to be set type is `inference`;
 The input of  ASR client demo should be a WAV file(`.wav`), and the sample rate must be the same as the model.
 Here are sample files for thisASR client demo that can be downloaded:
@ -76,6 +88,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
 ### 4. ASR Client Usage
 **Note:** The response time will be slightly longer when using the client for the first time
 - Command Line (Recommended)
   ```
   paddlespeech_client asr --server_ip 127.0.0.1 --port 8090 --input ./zh.wav
@ -122,6 +135,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
  ```
 ### 5. TTS Client Usage
 **Note:** The response time will be slightly longer when using the client for the first time
 - Command Line (Recommended)
   ```bash
   paddlespeech_client tts --server_ip 127.0.0.1 --port 8090 --input "您好，欢迎使用百度飞桨语音合成服务。" --output output.wav
@ -147,8 +161,6 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
    [2022-02-23 15:20:37,875] [    INFO] - Save synthesized audio successfully on output.wav.
    [2022-02-23 15:20:37,875] [    INFO] - Audio duration: 3.612500 s.
    [2022-02-23 15:20:37,875] [    INFO] - Response time: 0.348050 s.
    [2022-02-23 15:20:37,875] [    INFO] - RTF: 0.096346
    ```
@ -174,51 +186,13 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
  Save synthesized audio successfully on ./output.wav.
  Audio duration: 3.612500 s.
  Response time: 0.388317 s.
  RTF: 0.107493
  ```
-## Pretrained Models
+## Models supported by the service
 ### ASR model
-Here is a list of [ASR pretrained models](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/demos/speech_recognition/README.md#4pretrained-models) released by PaddleSpeech, both command line and python interfaces are available:
+Get all models supported by the ASR service via `paddlespeech_server stats --task asr`, where static models can be used for paddle inference inference.
 | Model | Language | Sample Rate
 | :--- | :---: | :---: |
 | conformer_wenetspeech| zh| 16000
 | transformer_librispeech| en| 16000
 ### TTS model
-Here is a list of [TTS pretrained models](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/demos/text_to_speech/README.md#4-pretrained-models) released by PaddleSpeech, both command line and python interfaces are available:
+Get all models supported by the TTS service via `paddlespeech_server stats --task tts`, where static models can be used for paddle inference inference.
 - Acoustic model
  | Model | Language
  | :--- | :---: |
  | speedyspeech_csmsc| zh
  | fastspeech2_csmsc| zh
  | fastspeech2_aishell3| zh
  | fastspeech2_ljspeech| en
  | fastspeech2_vctk| en
 - Vocoder
  | Model | Language
  | :--- | :---: |
  | pwgan_csmsc| zh
  | pwgan_aishell3| zh
  | pwgan_ljspeech| en
  | pwgan_vctk| en
  | mb_melgan_csmsc| zh
 Here is a list of **TTS pretrained static models** released by PaddleSpeech, both command line and python interfaces are available:
 - Acoustic model
  | Model | Language
  | :--- | :---: |
  | speedyspeech_csmsc| zh
  | fastspeech2_csmsc| zh
 - Vocoder
  | Model | Language
  | :--- | :---: |
  | pwgan_csmsc| zh
  | mb_melgan_csmsc| zh
  | hifigan_csmsc| zh
--- a/demos/speech_server/README_cn.md
+++ b/demos/speech_server/README_cn.md
@ -10,10 +10,21 @@
 ### 1. 安装
 请看 [安装文档](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md).
 推荐使用 **paddlepaddle 2.2.1** 或以上版本。
 你可以从 easy，medium，hard 三中方式中选择一种方式安装 PaddleSpeech。
 ### 2. 准备配置文件
 配置文件包含服务相关的配置文件和服务中包含的语音任务相关的模型配置。 它们都在 `conf` 文件夹下。
 **注意：`application.yaml` 中 `engine_backend` 的配置表示启动的服务中包含的所有语音任务。**
 如果你想启动的服务中只包含某项语音任务，那么你需要注释掉不需要包含的语音任务。例如你只想使用语音识别（ASR）服务，那么你可以将语音合成（TTS）服务注释掉，如下示例：
 ```bash
 engine_backend:
    asr: 'conf/asr/asr.yaml'
    #tts: 'conf/tts/tts.yaml'
 ```
 **注意：`application.yaml` 中 `engine_backend` 的配置文件需要和 `engine_type` 的配置类型匹配。**
 当`engine_backend` 的配置文件为`XXX.yaml`时，需要设置`engine_type`的配置类型为`python`;当`engine_backend` 的配置文件为`XXX_pd.yaml`时，需要设置`engine_type`的配置类型为`inference`;
 这个 ASR client 的输入应该是一个 WAV 文件（`.wav`），并且采样率必须与模型的采样率相同。
@ -75,6 +86,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
  ```
 ### 4. ASR客户端使用方法
 **注意：**初次使用客户端时响应时间会略长
 - 命令行 (推荐使用)
   ```
   paddlespeech_client asr --server_ip 127.0.0.1 --port 8090 --input ./zh.wav
@ -123,6 +135,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
  ```
 ### 5. TTS客户端使用方法
 **注意：**初次使用客户端时响应时间会略长
   ```bash
   paddlespeech_client tts --server_ip 127.0.0.1 --port 8090 --input "您好，欢迎使用百度飞桨语音合成服务。" --output output.wav
   ```
@ -148,7 +161,6 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
    [2022-02-23 15:20:37,875] [    INFO] - Save synthesized audio successfully on output.wav.
    [2022-02-23 15:20:37,875] [    INFO] - Audio duration: 3.612500 s.
    [2022-02-23 15:20:37,875] [    INFO] - Response time: 0.348050 s.
    [2022-02-23 15:20:37,875] [    INFO] - RTF: 0.096346
    ```
 - Python API
@ -173,50 +185,12 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
  Save synthesized audio successfully on ./output.wav.
  Audio duration: 3.612500 s.
  Response time: 0.388317 s.
-  RTF: 0.107493
+
-
+  ```
-  ```
+
-
+## 服务支持的模型
-## Pretrained Models
+### ASR支持的模型
-### ASR model
+通过 `paddlespeech_server stats --task asr` 获取ASR服务支持的所有模型，其中静态模型可用于 paddle inference 推理。 
-下面是PaddleSpeech发布的[ASR预训练模型](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/demos/speech_recognition/README.md#4pretrained-models)列表，命令行和python接口均可用：
+
-
+### TTS支持的模型
-| Model | Language | Sample Rate
+通过 `paddlespeech_server stats --task tts` 获取TTS服务支持的所有模型，其中静态模型可用于 paddle inference 推理。
 | :--- | :---: | :---: |
 | conformer_wenetspeech| zh| 16000
 | transformer_librispeech| en| 16000
 ### TTS model
 下面是PaddleSpeech发布的 [TTS预训练模型](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/demos/text_to_speech/README.md#4-pretrained-models) 列表，命令行和python接口均可用：
 - Acoustic model
  | Model | Language
  | :--- | :---: |
  | speedyspeech_csmsc| zh
  | fastspeech2_csmsc| zh
  | fastspeech2_aishell3| zh
  | fastspeech2_ljspeech| en
  | fastspeech2_vctk| en
 - Vocoder
  | Model | Language
  | :--- | :---: |
  | pwgan_csmsc| zh
  | pwgan_aishell3| zh
  | pwgan_ljspeech| en
  | pwgan_vctk| en
  | mb_melgan_csmsc| zh
 下面是PaddleSpeech发布的 **TTS预训练静态模型** 列表，命令行和python接口均可用：
 - Acoustic model
  | Model | Language
  | :--- | :---: |
  | speedyspeech_csmsc| zh
  | fastspeech2_csmsc| zh
 - Vocoder
  | Model | Language
  | :--- | :---: |
  | pwgan_csmsc| zh
  | mb_melgan_csmsc| zh
  | hifigan_csmsc| zh
--- a/demos/speech_server/conf/application.yaml
+++ b/demos/speech_server/conf/application.yaml
@ -3,15 +3,25 @@
 ##################################################################
 #                     SERVER SETTING                             #
 ##################################################################
-host: '0.0.0.0'
+host: 127.0.0.1
 port: 8090
 ##################################################################
 #                     CONFIG FILE                                #
 ##################################################################
-# add engine type (Options: asr, tts) and config file here.
+# add engine backend type (Options: asr, tts) and config file here.
-
+# Adding a speech task to engine_backend means starting the service.
 engine_backend:
    asr: 'conf/asr/asr.yaml'
    tts: 'conf/tts/tts.yaml'
 # The engine_type of speech task needs to keep the same type as the config file of speech task.
 # E.g: The engine_type of asr is 'python', the engine_backend of asr is 'XX/asr.yaml'
 # E.g: The engine_type of asr is 'inference', the engine_backend of asr is 'XX/asr_pd.yaml'
 #
 # add engine type (Options: python, inference) 
 engine_type:
    asr: 'python'
    tts: 'python'
--- a/demos/speech_server/conf/asr/asr.yaml
+++ b/demos/speech_server/conf/asr/asr.yaml
@ -1,7 +1,8 @@
 model: 'conformer_wenetspeech'
 lang: 'zh'
 sample_rate: 16000
-cfg_path: 
+cfg_path: # [optional]
-ckpt_path: 
+ckpt_path: # [optional]
 decode_method: 'attention_rescoring'
-force_yes: False
+force_yes: True
 device:  # set 'gpu:id' or 'cpu'
--- a/demos/speech_server/conf/asr/asr_pd.yaml
+++ b/demos/speech_server/conf/asr/asr_pd.yaml
@ -0,0 +1,26 @@
 # This is the parameter configuration file for ASR server.
 # These are the static models that support paddle inference.
 ##################################################################
 #                  ACOUSTIC MODEL SETTING                        #
 # am choices=['deepspeech2offline_aishell'] TODO
 ##################################################################
 model_type: 'deepspeech2offline_aishell'
 am_model: # the pdmodel file of am static model [optional]
 am_params:  # the pdiparams file of am static model [optional]
 lang: 'zh'
 sample_rate: 16000
 cfg_path: 
 decode_method: 
 force_yes: True
 am_predictor_conf:
  device:  # set 'gpu:id' or 'cpu'
  switch_ir_optim: True
  glog_info: False  # True -> print glog
  summary: True  # False -> do not show predictor config
 ##################################################################
 #                            OTHERS                              #
 ##################################################################
--- a/demos/speech_server/conf/tts/tts.yaml
+++ b/demos/speech_server/conf/tts/tts.yaml
@ -29,4 +29,4 @@ voc_stat:
 #                            OTHERS                              #
 ##################################################################
 lang: 'zh'
-device: 'gpu:2'
+device:  # set 'gpu:id' or 'cpu'
--- a/demos/speech_server/conf/tts/tts_pd.yaml
+++ b/demos/speech_server/conf/tts/tts_pd.yaml
@ -6,8 +6,8 @@
 # am choices=['speedyspeech_csmsc', 'fastspeech2_csmsc']
 ##################################################################
 am: 'fastspeech2_csmsc'   
-am_model: # the pdmodel file of am static model
+am_model: # the pdmodel file of your am static model (XX.pdmodel)
-am_params: # the pdiparams file of am static model
+am_params: # the pdiparams file of your am static model (XX.pdipparams)
 am_sample_rate: 24000
 phones_dict: 
 tones_dict: 
@ -15,9 +15,10 @@ speaker_dict:
 spk_id: 0
 am_predictor_conf:
-  use_gpu: True
+  device:  # set 'gpu:id' or 'cpu'
  enable_mkldnn: True
  switch_ir_optim: True
  glog_info: False # True -> print glog
  summary: True  # False -> do not show predictor config
 ##################################################################
@ -25,17 +26,17 @@ am_predictor_conf:
 # voc choices=['pwgan_csmsc', 'mb_melgan_csmsc','hifigan_csmsc']
 ##################################################################
 voc: 'pwgan_csmsc'
-voc_model: # the pdmodel file of vocoder static model
+voc_model: # the pdmodel file of your vocoder static model (XX.pdmodel)
-voc_params: # the pdiparams file of vocoder static model 
+voc_params: # the pdiparams file of your vocoder static model (XX.pdipparams)
 voc_sample_rate: 24000
 voc_predictor_conf:
-  use_gpu: True
+  device:  # set 'gpu:id' or 'cpu'  
  enable_mkldnn: True  
  switch_ir_optim: True  
  glog_info: False # True -> print glog
  summary: True  # False -> do not show predictor config
 ##################################################################
 #                            OTHERS                              #
 ##################################################################
 lang: 'zh'
 device: paddle.get_device()
--- a/demos/text_to_speech/README.md
+++ b/demos/text_to_speech/README.md
@ -17,11 +17,14 @@ The input of this demo should be a text of the specific language that can be pas
 ### 3. Usage
 - Command Line (Recommended)
    - Chinese
        The default acoustic model is `Fastspeech2`, and the default vocoder is `Parallel WaveGAN`.
        ```bash
        paddlespeech tts --input "你好，欢迎使用百度飞桨深度学习框架！"
        ```
    - Batch Process
        ```bash
        echo -e "1 欢迎光临。\n2 谢谢惠顾。" | paddlespeech tts
        ```
    - Chinese, use `SpeedySpeech` as the acoustic model
        ```bash
        paddlespeech tts --am speedyspeech_csmsc --input "你好，欢迎使用百度飞桨深度学习框架！"
--- a/demos/text_to_speech/README_cn.md
+++ b/demos/text_to_speech/README_cn.md
@ -24,6 +24,10 @@
        ```bash
        paddlespeech tts --input "你好，欢迎使用百度飞桨深度学习框架！"
        ```
    - 批处理
        ```bash
        echo -e "1 欢迎光临。\n2 谢谢惠顾。" | paddlespeech tts
        ```
    - 中文，使用 `SpeedySpeech` 作为声学模型
        ```bash
        paddlespeech tts --am speedyspeech_csmsc --input "你好，欢迎使用百度飞桨深度学习框架！"
--- a/demos/text_to_speech/run.sh
+++ b/demos/text_to_speech/run.sh
@ -1,3 +1,7 @@
 #!/bin/bash
 # single process
 paddlespeech tts --input 今天的天气不错啊
 # Batch process
 echo -e "1 欢迎光临。\n2 谢谢惠顾。" | paddlespeech tts
--- a/docs/topic/ctc/ctc_loss_speed_compare.ipynb
+++ b/docs/topic/ctc/ctc_loss_speed_compare.ipynb
@ -0,0 +1,369 @@
 {
 "cells": [
  {
   "cell_type": "markdown",
   "id": "a1e738e0",
   "metadata": {},
   "source": [
    "## 获取测试的 logit 数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "29d3368b",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "hlens.npy\n",
      "logits.npy\n",
      "ys_lens.npy\n",
      "ys_pad.npy\n"
     ]
    }
   ],
   "source": [
    "!mkdir -p ./test_data\n",
    "!test -f ./test_data/ctc_loss_compare_data.tgz || wget -P ./test_data https://paddlespeech.bj.bcebos.com/datasets/unit_test/asr/ctc_loss_compare_data.tgz\n",
    "!tar xzvf test_data/ctc_loss_compare_data.tgz -C ./test_data\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "240caf1d",
   "metadata": {},
   "outputs": [],
   "source": [
    "import os\n",
    "import numpy as np\n",
    "import time\n",
    "\n",
    "data_dir=\"./test_data\"\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "91bad949",
   "metadata": {},
   "outputs": [],
   "source": [
    "logits_np = np.load(os.path.join(data_dir, \"logits.npy\"))\n",
    "ys_pad_np = np.load(os.path.join(data_dir, \"ys_pad.npy\"))\n",
    "hlens_np = np.load(os.path.join(data_dir, \"hlens.npy\"))\n",
    "ys_lens_np = np.load(os.path.join(data_dir, \"ys_lens.npy\"))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4cef2f15",
   "metadata": {},
   "source": [
    "## 使用 torch 的 ctc loss"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "90612004",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'1.10.1+cu102'"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import torch\n",
    "torch.__version__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "00799f97",
   "metadata": {},
   "outputs": [],
   "source": [
    "def torch_ctc_loss(use_cpu):\n",
    "    if use_cpu:\n",
    "        device = torch.device(\"cpu\")\n",
    "    else:\n",
    "        device = torch.device(\"cuda\")\n",
    "\n",
    "    reduction_type = \"sum\" \n",
    "\n",
    "    ctc_loss = torch.nn.CTCLoss(reduction=reduction_type)\n",
    "\n",
    "    ys_hat = torch.tensor(logits_np, device = device)\n",
    "    ys_pad = torch.tensor(ys_pad_np, device = device)\n",
    "    hlens = torch.tensor(hlens_np, device = device)\n",
    "    ys_lens = torch.tensor(ys_lens_np, device = device)\n",
    "\n",
    "    ys_hat = ys_hat.transpose(0, 1)\n",
    "    \n",
    "    # 开始计算时间\n",
    "    start_time = time.time()\n",
    "    ys_hat = ys_hat.log_softmax(2)\n",
    "    loss = ctc_loss(ys_hat, ys_pad, hlens, ys_lens)\n",
    "    end_time = time.time()\n",
    "    \n",
    "    loss = loss / ys_hat.size(1)\n",
    "    return end_time - start_time, loss.item()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ba47b5a4",
   "metadata": {},
   "source": [
    "## 使用 paddle 的 ctc loss"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "6882a06e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "'2.2.2'"
      ]
     },
     "execution_count": 6,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import paddle\n",
    "paddle.__version__"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "3cfa3b7c",
   "metadata": {},
   "outputs": [],
   "source": [
    "def paddle_ctc_loss(use_cpu):    \n",
    "    import paddle.nn as pn\n",
    "    if use_cpu:\n",
    "        device = \"cpu\"\n",
    "    else:\n",
    "        device = \"gpu\"\n",
    "\n",
    "    paddle.set_device(device)\n",
    "\n",
    "    logits = paddle.to_tensor(logits_np)\n",
    "    ys_pad = paddle.to_tensor(ys_pad_np,dtype='int32')\n",
    "    hlens = paddle.to_tensor(hlens_np, dtype='int64')\n",
    "    ys_lens = paddle.to_tensor(ys_lens_np, dtype='int64')\n",
    "\n",
    "    logits = logits.transpose([1,0,2])\n",
    "\n",
    "    ctc_loss = pn.CTCLoss(reduction='sum')\n",
    "    # 开始计算时间\n",
    "    start_time = time.time()\n",
    "    pn_loss = ctc_loss(logits, ys_pad, hlens, ys_lens)\n",
    "    end_time = time.time()\n",
    "    \n",
    "    pn_loss = pn_loss / logits.shape[1]\n",
    "    return end_time - start_time, pn_loss.item()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "40413ef9",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "CPU, iteration 10\n",
      "torch_ctc_loss 159.17137145996094\n",
      "paddle_ctc_loss 159.16574096679688\n",
      "paddle average time 1.718252992630005\n",
      "torch average time 0.17536230087280275\n",
      "paddle time / torch time (cpu) 9.798303193320452\n",
      "\n",
      "GPU, iteration 10\n",
      "torch_ctc_loss 159.172119140625\n",
      "paddle_ctc_loss 159.17205810546875\n",
      "paddle average time 0.018606925010681154\n",
      "torch average time 0.0026710033416748047\n",
      "paddle time / torch time (gpu) 6.966267963938231\n"
     ]
    }
   ],
   "source": [
    "# 使用 CPU\n",
    "\n",
    "iteration = 10\n",
    "use_cpu = True\n",
    "torch_total_time = 0\n",
    "paddle_total_time = 0\n",
    "for _ in range(iteration):\n",
    "    cost_time, torch_loss = torch_ctc_loss(use_cpu)\n",
    "    torch_total_time += cost_time\n",
    "for _ in range(iteration):\n",
    "    cost_time, paddle_loss = paddle_ctc_loss(use_cpu)\n",
    "    paddle_total_time += cost_time\n",
    "print (\"CPU, iteration\", iteration)\n",
    "print (\"torch_ctc_loss\", torch_loss)\n",
    "print (\"paddle_ctc_loss\", paddle_loss)\n",
    "print (\"paddle average time\", paddle_total_time / iteration)\n",
    "print (\"torch average time\", torch_total_time / iteration)\n",
    "print (\"paddle time / torch time (cpu)\" , paddle_total_time/ torch_total_time)\n",
    "\n",
    "print (\"\")\n",
    "\n",
    "# 使用 GPU\n",
    "\n",
    "use_cpu = False\n",
    "torch_total_time = 0\n",
    "paddle_total_time = 0\n",
    "for _ in range(iteration):\n",
    "    cost_time, torch_loss  = torch_ctc_loss(use_cpu)\n",
    "    torch_total_time += cost_time\n",
    "for _ in range(iteration):\n",
    "    cost_time, paddle_loss = paddle_ctc_loss(use_cpu)\n",
    "    paddle_total_time += cost_time\n",
    "print (\"GPU, iteration\", iteration)\n",
    "print (\"torch_ctc_loss\", torch_loss)\n",
    "print (\"paddle_ctc_loss\", paddle_loss)\n",
    "print (\"paddle average time\", paddle_total_time / iteration)\n",
    "print (\"torch average time\", torch_total_time / iteration)\n",
    "print (\"paddle time / torch time (gpu)\" , paddle_total_time/ torch_total_time)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7cdf8697",
   "metadata": {},
   "source": [
    "## 其他: 使用 PaddleSpeech 中的 ctcloss 查一下loss值"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "73fad81d",
   "metadata": {},
   "outputs": [],
   "source": [
    "logits_np = np.load(os.path.join(data_dir, \"logits.npy\"))\n",
    "ys_pad_np = np.load(os.path.join(data_dir, \"ys_pad.npy\"))\n",
    "hlens_np = np.load(os.path.join(data_dir, \"hlens.npy\"))\n",
    "ys_lens_np = np.load(os.path.join(data_dir, \"ys_lens.npy\"))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "2b41e45d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "2022-02-25 11:34:34.143 | INFO     | paddlespeech.s2t.modules.loss:__init__:41 - CTCLoss Loss reduction: sum, div-bs: True\n",
      "2022-02-25 11:34:34.143 | INFO     | paddlespeech.s2t.modules.loss:__init__:42 - CTCLoss Grad Norm Type: instance\n",
      "2022-02-25 11:34:34.144 | INFO     | paddlespeech.s2t.modules.loss:__init__:73 - CTCLoss() kwargs:{'norm_by_times': True}, not support: {'norm_by_batchsize': False, 'norm_by_total_logits_len': False}\n",
      "loss 159.17205810546875\n"
     ]
    },
    {
     "name": "stderr",
     "output_type": "stream",
     "text": [
      "/root/miniconda3/lib/python3.7/site-packages/paddle/fluid/dygraph/math_op_patch.py:253: UserWarning: The dtype of left and right variables are not the same, left dtype is paddle.float32, but right dtype is paddle.int32, the right dtype will convert to paddle.float32\n",
      "  format(lhs_dtype, rhs_dtype, lhs_dtype))\n"
     ]
    }
   ],
   "source": [
    "use_cpu = False\n",
    "\n",
    "from paddlespeech.s2t.modules.loss import CTCLoss\n",
    "\n",
    "if use_cpu:\n",
    "    device = \"cpu\"\n",
    "else:\n",
    "    device = \"gpu\"\n",
    "\n",
    "paddle.set_device(device)\n",
    "\n",
    "blank_id=0\n",
    "reduction_type='sum'\n",
    "batch_average= True\n",
    "grad_norm_type='instance'\n",
    "\n",
    "criterion = CTCLoss(\n",
    "        blank=blank_id,\n",
    "        reduction=reduction_type,\n",
    "        batch_average=batch_average,\n",
    "        grad_norm_type=grad_norm_type)\n",
    "\n",
    "logits = paddle.to_tensor(logits_np)\n",
    "ys_pad = paddle.to_tensor(ys_pad_np,dtype='int32')\n",
    "hlens = paddle.to_tensor(hlens_np, dtype='int64')\n",
    "ys_lens = paddle.to_tensor(ys_lens_np, dtype='int64')\n",
    "\n",
    "pn_ctc_loss = criterion(logits, ys_pad, hlens, ys_lens)\n",
    "print(\"loss\", pn_ctc_loss.item())\n",
    "    "
   ]
  },
  {
   "cell_type": "markdown",
   "id": "de525d38",
   "metadata": {},
   "source": [
    "## 结论\n",
    "在 CPU 环境下： torch 的 CTC loss 的计算速度是 paddle 的 9.8 倍  \n",
    "在 GPU 环境下： torch 的 CTC loss 的计算速度是 paddle 的 6.87 倍\n",
    "\n",
    "## 其他结论\n",
    "torch 的 ctc loss 在 CPU 和 GPU 下 都没有完全对齐。其中CPU的前向对齐精度大约为 1e-2。 GPU 的前向对齐精度大约为 1e-4 。"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.7.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
 }
--- a/examples/aishell3/tts3/README.md
+++ b/examples/aishell3/tts3/README.md
@ -225,7 +225,9 @@ optional arguments:
 9. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
 ## Pretrained Model
-Pretrained FastSpeech2 model with no silence in the edge of audios. [fastspeech2_nosil_aishell3_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_aishell3_ckpt_0.4.zip)
+Pretrained FastSpeech2 model with no silence in the edge of audios:
 - [fastspeech2_nosil_aishell3_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_aishell3_ckpt_0.4.zip)
 - [fastspeech2_conformer_aishell3_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_conformer_aishell3_ckpt_0.2.0.zip) (Thanks for [@awmmmm](https://github.com/awmmmm)'s contribution)
 FastSpeech2 checkpoint contains files listed below.
--- a/examples/aishell3/tts3/conf/conformer.yaml
+++ b/examples/aishell3/tts3/conf/conformer.yaml
@ -0,0 +1,110 @@
 ###########################################################
 #                FEATURE EXTRACTION SETTING               #
 ###########################################################
 fs: 24000          # sr
 n_fft: 2048        # FFT size (samples).
 n_shift: 300       # Hop size (samples). 12.5ms
 win_length: 1200   # Window length (samples). 50ms
                   # If set to null, it will be the same as fft_size.
 window: "hann"     # Window function.
 # Only used for feats_type != raw
 fmin: 80           # Minimum frequency of Mel basis.
 fmax: 7600         # Maximum frequency of Mel basis.
 n_mels: 80         # The number of mel basis.
 # Only used for the model using pitch features (e.g. FastSpeech2)
 f0min: 80          # Maximum f0 for pitch extraction.
 f0max: 400         # Minimum f0 for pitch extraction.
 ###########################################################
 #                       DATA SETTING                      #
 ###########################################################
 batch_size: 32
 num_workers: 4
 ###########################################################
 #                       MODEL SETTING                     #
 ###########################################################
 model:
    adim: 384         # attention dimension
    aheads: 2         # number of attention heads
    elayers: 4        # number of encoder layers
    eunits: 1536      # number of encoder ff units
    dlayers: 4        # number of decoder layers
    dunits: 1536      # number of decoder ff units
    positionwise_layer_type: conv1d   # type of position-wise layer
    positionwise_conv_kernel_size: 3  # kernel size of position wise conv layer
    duration_predictor_layers: 2      # number of layers of duration predictor
    duration_predictor_chans: 256     # number of channels of duration predictor
    duration_predictor_kernel_size: 3 # filter size of duration predictor
    postnet_layers: 5                 # number of layers of postnset
    postnet_filts: 5                  # filter size of conv layers in postnet
    postnet_chans: 256                # number of channels of conv layers in postnet
    encoder_normalize_before: True    # whether to perform layer normalization before the input
    decoder_normalize_before: True    # whether to perform layer normalization before the input
    reduction_factor: 1               # reduction factor
    encoder_type: conformer           # encoder type
    decoder_type: conformer           # decoder type
    conformer_pos_enc_layer_type: rel_pos        # conformer positional encoding type
    conformer_self_attn_layer_type: rel_selfattn # conformer self-attention type
    conformer_activation_type: swish             # conformer activation type
    use_macaron_style_in_conformer: true         # whether to use macaron style in conformer
    use_cnn_in_conformer: true                   # whether to use CNN in conformer
    conformer_enc_kernel_size: 7                 # kernel size in CNN module of conformer-based encoder
    conformer_dec_kernel_size: 31                # kernel size in CNN module of conformer-based decoder
    init_type: xavier_uniform         # initialization type
    transformer_enc_dropout_rate: 0.2            # dropout rate for transformer encoder layer
    transformer_enc_positional_dropout_rate: 0.2 # dropout rate for transformer encoder positional encoding
    transformer_enc_attn_dropout_rate: 0.2       # dropout rate for transformer encoder attention layer
    transformer_dec_dropout_rate: 0.2            # dropout rate for transformer decoder layer
    transformer_dec_positional_dropout_rate: 0.2 # dropout rate for transformer decoder positional encoding
    transformer_dec_attn_dropout_rate: 0.2       # dropout rate for transformer decoder attention layer
    pitch_predictor_layers: 5                  # number of conv layers in pitch predictor
    pitch_predictor_chans: 256                 # number of channels of conv layers in pitch predictor
    pitch_predictor_kernel_size: 5             # kernel size of conv leyers in pitch predictor
    pitch_predictor_dropout: 0.5               # dropout rate in pitch predictor
    pitch_embed_kernel_size: 1                 # kernel size of conv embedding layer for pitch
    pitch_embed_dropout: 0.0                   # dropout rate after conv embedding layer for pitch
    stop_gradient_from_pitch_predictor: true   # whether to stop the gradient from pitch predictor to encoder
    energy_predictor_layers: 2                 # number of conv layers in energy predictor
    energy_predictor_chans: 256                # number of channels of conv layers in energy predictor
    energy_predictor_kernel_size: 3            # kernel size of conv leyers in energy predictor
    energy_predictor_dropout: 0.5              # dropout rate in energy predictor
    energy_embed_kernel_size: 1                # kernel size of conv embedding layer for energy
    energy_embed_dropout: 0.0                  # dropout rate after conv embedding layer for energy
    stop_gradient_from_energy_predictor: false # whether to stop the gradient from energy predictor to encoder
    spk_embed_dim: 256                         # speaker embedding dimension
    spk_embed_integration_type: concat         # speaker embedding integration type
 ###########################################################
 #                       UPDATER SETTING                   #
 ###########################################################
 updater:
    use_masking: True                 # whether to apply masking for padded part in loss calculation
 ###########################################################
 #                     OPTIMIZER SETTING                   #
 ###########################################################
 optimizer:
  optim: adam              # optimizer type
  learning_rate: 0.001     # learning rate
 ###########################################################
 #                     TRAINING SETTING                    #
 ###########################################################
 max_epoch: 1000
 num_snapshots: 5
 ###########################################################
 #                       OTHER SETTING                     #
 ###########################################################
 seed: 10086
--- a/examples/csmsc/tts0/local/synthesize.sh
+++ b/examples/csmsc/tts0/local/synthesize.sh
@ -3,10 +3,14 @@
 config_path=$1
 train_output_path=$2
 ckpt_name=$3
 stage=0
 stop_stage=0
-FLAGS_allocator_strategy=naive_best_fit \
+# pwgan
-FLAGS_fraction_of_gpu_memory_to_use=0.01 \
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
-python3 ${BIN_DIR}/../synthesize.py \
+    FLAGS_allocator_strategy=naive_best_fit \
    FLAGS_fraction_of_gpu_memory_to_use=0.01 \
    python3 ${BIN_DIR}/../synthesize.py \
        --am=tacotron2_csmsc \
        --am_config=${config_path} \
        --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
@ -18,3 +22,79 @@ python3 ${BIN_DIR}/../synthesize.py \
        --test_metadata=dump/test/norm/metadata.jsonl \
        --output_dir=${train_output_path}/test \
        --phones_dict=dump/phone_id_map.txt
 fi
 # for more GAN Vocoders
 # multi band melgan
 if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
    FLAGS_allocator_strategy=naive_best_fit \
    FLAGS_fraction_of_gpu_memory_to_use=0.01 \
    python3 ${BIN_DIR}/../synthesize.py \
        --am=tacotron2_csmsc \
        --am_config=${config_path} \
        --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
        --am_stat=dump/train/speech_stats.npy \
        --voc=mb_melgan_csmsc \
        --voc_config=mb_melgan_csmsc_ckpt_0.1.1/default.yaml \
        --voc_ckpt=mb_melgan_csmsc_ckpt_0.1.1/snapshot_iter_1000000.pdz\
        --voc_stat=mb_melgan_csmsc_ckpt_0.1.1/feats_stats.npy \
        --test_metadata=dump/test/norm/metadata.jsonl \
        --output_dir=${train_output_path}/test \
        --phones_dict=dump/phone_id_map.txt
 fi
 # style melgan
 if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
    FLAGS_allocator_strategy=naive_best_fit \
    FLAGS_fraction_of_gpu_memory_to_use=0.01 \
    python3 ${BIN_DIR}/../synthesize.py \
        --am=tacotron2_csmsc \
        --am_config=${config_path} \
        --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
        --am_stat=dump/train/speech_stats.npy \
        --voc=style_melgan_csmsc \
        --voc_config=style_melgan_csmsc_ckpt_0.1.1/default.yaml \
        --voc_ckpt=style_melgan_csmsc_ckpt_0.1.1/snapshot_iter_1500000.pdz \
        --voc_stat=style_melgan_csmsc_ckpt_0.1.1/feats_stats.npy \
        --test_metadata=dump/test/norm/metadata.jsonl \
        --output_dir=${train_output_path}/test \
        --phones_dict=dump/phone_id_map.txt
 fi
 # hifigan
 if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
    echo "in hifigan syn"
    FLAGS_allocator_strategy=naive_best_fit \
    FLAGS_fraction_of_gpu_memory_to_use=0.01 \
    python3 ${BIN_DIR}/../synthesize.py \
        --am=tacotron2_csmsc \
        --am_config=${config_path} \
        --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
        --am_stat=dump/train/speech_stats.npy \
        --voc=hifigan_csmsc \
        --voc_config=hifigan_csmsc_ckpt_0.1.1/default.yaml \
        --voc_ckpt=hifigan_csmsc_ckpt_0.1.1/snapshot_iter_2500000.pdz \
        --voc_stat=hifigan_csmsc_ckpt_0.1.1/feats_stats.npy \
        --test_metadata=dump/test/norm/metadata.jsonl \
        --output_dir=${train_output_path}/test \
        --phones_dict=dump/phone_id_map.txt
 fi
 # wavernn
 if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
    echo "in wavernn syn"
    FLAGS_allocator_strategy=naive_best_fit \
    FLAGS_fraction_of_gpu_memory_to_use=0.01 \
    python3 ${BIN_DIR}/../synthesize.py \
        --am=tacotron2_csmsc \
        --am_config=${config_path} \
        --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
        --am_stat=dump/train/speech_stats.npy \
        --voc=wavernn_csmsc \
        --voc_config=wavernn_csmsc_ckpt_0.2.0/default.yaml \
        --voc_ckpt=wavernn_csmsc_ckpt_0.2.0/snapshot_iter_400000.pdz \
        --voc_stat=wavernn_csmsc_ckpt_0.2.0/feats_stats.npy \
        --test_metadata=dump/test/norm/metadata.jsonl \
        --output_dir=${train_output_path}/test \
        --phones_dict=dump/phone_id_map.txt
 fi
--- a/examples/csmsc/tts0/local/synthesize_e2e.sh
+++ b/examples/csmsc/tts0/local/synthesize_e2e.sh
@ -8,6 +8,7 @@ stage=0
 stop_stage=0
 # TODO: tacotron2 动转静的结果没有静态图的响亮, 可能还是 decode 的时候某个函数动静不对齐
 # pwgan
 if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
    FLAGS_allocator_strategy=naive_best_fit \
    FLAGS_fraction_of_gpu_memory_to_use=0.01 \
@ -39,14 +40,14 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
        --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
        --am_stat=dump/train/speech_stats.npy \
        --voc=mb_melgan_csmsc \
-        --voc_config=mb_melgan_baker_finetune_ckpt_0.5/finetune.yaml \
+        --voc_config=mb_melgan_csmsc_ckpt_0.1.1/default.yaml \
-        --voc_ckpt=mb_melgan_baker_finetune_ckpt_0.5/snapshot_iter_2000000.pdz\
+        --voc_ckpt=mb_melgan_csmsc_ckpt_0.1.1/snapshot_iter_1000000.pdz\
-        --voc_stat=mb_melgan_baker_finetune_ckpt_0.5/feats_stats.npy \
+        --voc_stat=mb_melgan_csmsc_ckpt_0.1.1/feats_stats.npy \
        --lang=zh \
        --text=${BIN_DIR}/../sentences.txt \
        --output_dir=${train_output_path}/test_e2e \
-        --inference_dir=${train_output_path}/inference \
+        --phones_dict=dump/phone_id_map.txt \
-        --phones_dict=dump/phone_id_map.txt
+        --inference_dir=${train_output_path}/inference
 fi
 # the pretrained models haven't release now
@ -88,8 +89,8 @@ if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
        --lang=zh \
        --text=${BIN_DIR}/../sentences.txt \
        --output_dir=${train_output_path}/test_e2e \
-        --inference_dir=${train_output_path}/inference \
+        --phones_dict=dump/phone_id_map.txt \
-        --phones_dict=dump/phone_id_map.txt
+        --inference_dir=${train_output_path}/inference
 fi
 # wavernn
--- a/examples/csmsc/tts2/local/synthesize.sh
+++ b/examples/csmsc/tts2/local/synthesize.sh
@ -1,15 +1,20 @@
 #!/bin/bash
 config_path=$1
 train_output_path=$2
 ckpt_name=$3
 stage=0
 stop_stage=0
-FLAGS_allocator_strategy=naive_best_fit \
+# pwgan
-FLAGS_fraction_of_gpu_memory_to_use=0.01 \
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
-python3 ${BIN_DIR}/../synthesize.py \
+    FLAGS_allocator_strategy=naive_best_fit \
    FLAGS_fraction_of_gpu_memory_to_use=0.01 \
    python3 ${BIN_DIR}/../synthesize.py \
        --am=speedyspeech_csmsc \
        --am_config=${config_path} \
        --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
-    --am_stat=dump/train/feats_stats.npy \
+        --am_stat=dump/train/speech_stats.npy \
        --voc=pwgan_csmsc \
        --voc_config=pwg_baker_ckpt_0.4/pwg_default.yaml \
        --voc_ckpt=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \
@ -18,3 +23,83 @@ python3 ${BIN_DIR}/../synthesize.py \
        --output_dir=${train_output_path}/test \
        --phones_dict=dump/phone_id_map.txt \
        --tones_dict=dump/tone_id_map.txt
 fi
 # for more GAN Vocoders
 # multi band melgan
 if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
    FLAGS_allocator_strategy=naive_best_fit \
    FLAGS_fraction_of_gpu_memory_to_use=0.01 \
    python3 ${BIN_DIR}/../synthesize.py \
        --am=speedyspeech_csmsc \
        --am_config=${config_path} \
        --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
        --am_stat=dump/train/speech_stats.npy \
        --voc=mb_melgan_csmsc \
        --voc_config=mb_melgan_csmsc_ckpt_0.1.1/default.yaml \
        --voc_ckpt=mb_melgan_csmsc_ckpt_0.1.1/snapshot_iter_1000000.pdz\
        --voc_stat=mb_melgan_csmsc_ckpt_0.1.1/feats_stats.npy \
        --test_metadata=dump/test/norm/metadata.jsonl \
        --output_dir=${train_output_path}/test \
        --phones_dict=dump/phone_id_map.txt \
        --tones_dict=dump/tone_id_map.txt
 fi
 # style melgan
 if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
    FLAGS_allocator_strategy=naive_best_fit \
    FLAGS_fraction_of_gpu_memory_to_use=0.01 \
    python3 ${BIN_DIR}/../synthesize.py \
        --am=speedyspeech_csmsc \
        --am_config=${config_path} \
        --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
        --am_stat=dump/train/speech_stats.npy \
        --voc=style_melgan_csmsc \
        --voc_config=style_melgan_csmsc_ckpt_0.1.1/default.yaml \
        --voc_ckpt=style_melgan_csmsc_ckpt_0.1.1/snapshot_iter_1500000.pdz \
        --voc_stat=style_melgan_csmsc_ckpt_0.1.1/feats_stats.npy \
        --test_metadata=dump/test/norm/metadata.jsonl \
        --output_dir=${train_output_path}/test \
        --phones_dict=dump/phone_id_map.txt \
        --tones_dict=dump/tone_id_map.txt
 fi
 # hifigan
 if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
    echo "in hifigan syn"
    FLAGS_allocator_strategy=naive_best_fit \
    FLAGS_fraction_of_gpu_memory_to_use=0.01 \
    python3 ${BIN_DIR}/../synthesize.py \
        --am=speedyspeech_csmsc \
        --am_config=${config_path} \
        --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
        --am_stat=dump/train/speech_stats.npy \
        --voc=hifigan_csmsc \
        --voc_config=hifigan_csmsc_ckpt_0.1.1/default.yaml \
        --voc_ckpt=hifigan_csmsc_ckpt_0.1.1/snapshot_iter_2500000.pdz \
        --voc_stat=hifigan_csmsc_ckpt_0.1.1/feats_stats.npy \
        --test_metadata=dump/test/norm/metadata.jsonl \
        --output_dir=${train_output_path}/test \
        --phones_dict=dump/phone_id_map.txt \
        --tones_dict=dump/tone_id_map.txt
 fi
 # wavernn
 if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
    echo "in wavernn syn"
    FLAGS_allocator_strategy=naive_best_fit \
    FLAGS_fraction_of_gpu_memory_to_use=0.01 \
    python3 ${BIN_DIR}/../synthesize.py \
        --am=speedyspeech_csmsc \
        --am_config=${config_path} \
        --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
        --am_stat=dump/train/speech_stats.npy \
        --voc=wavernn_csmsc \
        --voc_config=wavernn_csmsc_ckpt_0.2.0/default.yaml \
        --voc_ckpt=wavernn_csmsc_ckpt_0.2.0/snapshot_iter_400000.pdz \
        --voc_stat=wavernn_csmsc_ckpt_0.2.0/feats_stats.npy \
        --test_metadata=dump/test/norm/metadata.jsonl \
        --output_dir=${train_output_path}/test \
        --tones_dict=dump/tone_id_map.txt \
        --phones_dict=dump/phone_id_map.txt
 fi
--- a/examples/csmsc/tts2/local/synthesize_e2e.sh
+++ b/examples/csmsc/tts2/local/synthesize_e2e.sh
@ -7,6 +7,7 @@ ckpt_name=$3
 stage=0
 stop_stage=0
 # pwgan
 if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
    FLAGS_allocator_strategy=naive_best_fit \
    FLAGS_fraction_of_gpu_memory_to_use=0.01 \
@ -22,9 +23,9 @@ if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
        --lang=zh \
        --text=${BIN_DIR}/../sentences.txt \
        --output_dir=${train_output_path}/test_e2e \
        --inference_dir=${train_output_path}/inference \
        --phones_dict=dump/phone_id_map.txt \
-        --tones_dict=dump/tone_id_map.txt
+        --tones_dict=dump/tone_id_map.txt \
        --inference_dir=${train_output_path}/inference
 fi
 # for more GAN Vocoders
@ -44,9 +45,9 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
        --lang=zh \
        --text=${BIN_DIR}/../sentences.txt \
        --output_dir=${train_output_path}/test_e2e \
        --inference_dir=${train_output_path}/inference \
        --phones_dict=dump/phone_id_map.txt \
-        --tones_dict=dump/tone_id_map.txt
+        --tones_dict=dump/tone_id_map.txt \
        --inference_dir=${train_output_path}/inference
 fi
 # the pretrained models haven't release now
@ -88,12 +89,11 @@ if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
        --lang=zh \
        --text=${BIN_DIR}/../sentences.txt \
        --output_dir=${train_output_path}/test_e2e \
        --inference_dir=${train_output_path}/inference \
        --phones_dict=dump/phone_id_map.txt \
-        --tones_dict=dump/tone_id_map.txt
+        --tones_dict=dump/tone_id_map.txt \
        --inference_dir=${train_output_path}/inference
 fi
 # wavernn
 if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
    echo "in wavernn syn_e2e"
--- a/examples/csmsc/tts3/local/synthesize.sh
+++ b/examples/csmsc/tts3/local/synthesize.sh
@ -3,10 +3,14 @@
 config_path=$1
 train_output_path=$2
 ckpt_name=$3
 stage=0
 stop_stage=0
-FLAGS_allocator_strategy=naive_best_fit \
+# pwgan
-FLAGS_fraction_of_gpu_memory_to_use=0.01 \
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
-python3 ${BIN_DIR}/../synthesize.py \
+    FLAGS_allocator_strategy=naive_best_fit \
    FLAGS_fraction_of_gpu_memory_to_use=0.01 \
    python3 ${BIN_DIR}/../synthesize.py \
        --am=fastspeech2_csmsc \
        --am_config=${config_path} \
        --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
@ -18,3 +22,79 @@ python3 ${BIN_DIR}/../synthesize.py \
        --test_metadata=dump/test/norm/metadata.jsonl \
        --output_dir=${train_output_path}/test \
        --phones_dict=dump/phone_id_map.txt
 fi
 # for more GAN Vocoders
 # multi band melgan
 if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
    FLAGS_allocator_strategy=naive_best_fit \
    FLAGS_fraction_of_gpu_memory_to_use=0.01 \
    python3 ${BIN_DIR}/../synthesize.py \
        --am=fastspeech2_csmsc \
        --am_config=${config_path} \
        --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
        --am_stat=dump/train/speech_stats.npy \
        --voc=mb_melgan_csmsc \
        --voc_config=mb_melgan_csmsc_ckpt_0.1.1/default.yaml \
        --voc_ckpt=mb_melgan_csmsc_ckpt_0.1.1/snapshot_iter_1000000.pdz\
        --voc_stat=mb_melgan_csmsc_ckpt_0.1.1/feats_stats.npy \
        --test_metadata=dump/test/norm/metadata.jsonl \
        --output_dir=${train_output_path}/test \
        --phones_dict=dump/phone_id_map.txt
 fi
 # style melgan
 if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
    FLAGS_allocator_strategy=naive_best_fit \
    FLAGS_fraction_of_gpu_memory_to_use=0.01 \
    python3 ${BIN_DIR}/../synthesize.py \
        --am=fastspeech2_csmsc \
        --am_config=${config_path} \
        --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
        --am_stat=dump/train/speech_stats.npy \
        --voc=style_melgan_csmsc \
        --voc_config=style_melgan_csmsc_ckpt_0.1.1/default.yaml \
        --voc_ckpt=style_melgan_csmsc_ckpt_0.1.1/snapshot_iter_1500000.pdz \
        --voc_stat=style_melgan_csmsc_ckpt_0.1.1/feats_stats.npy \
        --test_metadata=dump/test/norm/metadata.jsonl \
        --output_dir=${train_output_path}/test \
        --phones_dict=dump/phone_id_map.txt
 fi
 # hifigan
 if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
    echo "in hifigan syn"
    FLAGS_allocator_strategy=naive_best_fit \
    FLAGS_fraction_of_gpu_memory_to_use=0.01 \
    python3 ${BIN_DIR}/../synthesize.py \
        --am=fastspeech2_csmsc \
        --am_config=${config_path} \
        --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
        --am_stat=dump/train/speech_stats.npy \
        --voc=hifigan_csmsc \
        --voc_config=hifigan_csmsc_ckpt_0.1.1/default.yaml \
        --voc_ckpt=hifigan_csmsc_ckpt_0.1.1/snapshot_iter_2500000.pdz \
        --voc_stat=hifigan_csmsc_ckpt_0.1.1/feats_stats.npy \
        --test_metadata=dump/test/norm/metadata.jsonl \
        --output_dir=${train_output_path}/test \
        --phones_dict=dump/phone_id_map.txt
 fi
 # wavernn
 if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
    echo "in wavernn syn"
    FLAGS_allocator_strategy=naive_best_fit \
    FLAGS_fraction_of_gpu_memory_to_use=0.01 \
    python3 ${BIN_DIR}/../synthesize.py \
        --am=fastspeech2_csmsc \
        --am_config=${config_path} \
        --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
        --am_stat=dump/train/speech_stats.npy \
        --voc=wavernn_csmsc \
        --voc_config=wavernn_csmsc_ckpt_0.2.0/default.yaml \
        --voc_ckpt=wavernn_csmsc_ckpt_0.2.0/snapshot_iter_400000.pdz \
        --voc_stat=wavernn_csmsc_ckpt_0.2.0/feats_stats.npy \
        --test_metadata=dump/test/norm/metadata.jsonl \
        --output_dir=${train_output_path}/test \
        --phones_dict=dump/phone_id_map.txt
 fi
--- a/examples/csmsc/tts3/local/synthesize_e2e.sh
+++ b/examples/csmsc/tts3/local/synthesize_e2e.sh
@ -7,6 +7,7 @@ ckpt_name=$3
 stage=0
 stop_stage=0
 # pwgan
 if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
    FLAGS_allocator_strategy=naive_best_fit \
    FLAGS_fraction_of_gpu_memory_to_use=0.01 \
@ -22,8 +23,8 @@ if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
        --lang=zh \
        --text=${BIN_DIR}/../sentences.txt \
        --output_dir=${train_output_path}/test_e2e \
-        --inference_dir=${train_output_path}/inference \
+        --phones_dict=dump/phone_id_map.txt \
-        --phones_dict=dump/phone_id_map.txt
+        --inference_dir=${train_output_path}/inference
 fi
 # for more GAN Vocoders
@ -43,8 +44,8 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
        --lang=zh \
        --text=${BIN_DIR}/../sentences.txt \
        --output_dir=${train_output_path}/test_e2e \
-        --inference_dir=${train_output_path}/inference \
+        --phones_dict=dump/phone_id_map.txt \
-        --phones_dict=dump/phone_id_map.txt
+        --inference_dir=${train_output_path}/inference
 fi
 # the pretrained models haven't release now
@ -86,8 +87,8 @@ if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
        --lang=zh \
        --text=${BIN_DIR}/../sentences.txt \
        --output_dir=${train_output_path}/test_e2e \
-        --inference_dir=${train_output_path}/inference \
+        --phones_dict=dump/phone_id_map.txt \
-        --phones_dict=dump/phone_id_map.txt
+        --inference_dir=${train_output_path}/inference
 fi
--- a/examples/other/g2p/README.md
+++ b/examples/other/g2p/README.md
@ -10,7 +10,7 @@ Run the command below to get the results of the test.
 ```bash
 ./run.sh
 ```
-The `avg WER` of g2p is: 0.027124048652822204
+The `avg WER` of g2p is: 0.026014352515701198
 ```text
     ,--------------------------------------------------------------------.
     |        | # Snt    # Wrd  | Corr    Sub    Del    Ins    Err  S.Err |
--- a/paddlespeech/init.py
+++ b/paddlespeech/init.py
@ -11,3 +11,6 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import _locale
 _locale._getdefaultlocale = (lambda *args: ['en_US', 'utf8'])
--- a/paddlespeech/cli/init.py
+++ b/paddlespeech/cli/init.py
@ -18,6 +18,7 @@ from .base_commands import BaseCommand
 from .base_commands import HelpCommand
 from .cls import CLSExecutor
 from .st import STExecutor
 from .stats import StatsExecutor
 from .text import TextExecutor
 from .tts import TTSExecutor
--- a/paddlespeech/cli/stats/init.py
+++ b/paddlespeech/cli/stats/init.py
@ -0,0 +1,14 @@
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 from .infer import StatsExecutor
--- a/paddlespeech/cli/stats/infer.py
+++ b/paddlespeech/cli/stats/infer.py
@ -0,0 +1,193 @@
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import argparse
 from typing import List
 from prettytable import PrettyTable
 from ..log import logger
 from ..utils import cli_register
 from ..utils import stats_wrapper
 __all__ = ['StatsExecutor']
 model_name_format = {
    'asr': 'Model-Language-Sample Rate',
    'cls': 'Model-Sample Rate',
    'st': 'Model-Source language-Target language',
    'text': 'Model-Task-Language',
    'tts': 'Model-Language'
 }
@cli_register(
    name='paddlespeech.stats',
    description='Get speech tasks support models list.')
 class StatsExecutor():
    def __init__(self):
        super(StatsExecutor, self).__init__()
        self.parser = argparse.ArgumentParser(
            prog='paddlespeech.stats', add_help=True)
        self.parser.add_argument(
            '--task',
            type=str,
            default='asr',
            choices=['asr', 'cls', 'st', 'text', 'tts'],
            help='Choose speech task.',
            required=True)
        self.task_choices = ['asr', 'cls', 'st', 'text', 'tts']
    def show_support_models(self, pretrained_models: dict):
        fields = model_name_format[self.task].split("-")
        table = PrettyTable(fields)
        for key in pretrained_models:
            table.add_row(key.split("-"))
        print(table)
    def execute(self, argv: List[str]) -> bool:
        """
            Command line entry.
        """
        parser_args = self.parser.parse_args(argv)
        self.task = parser_args.task
        if self.task not in self.task_choices:
            logger.error(
                "Please input correct speech task, choices = ['asr', 'cls', 'st', 'text', 'tts']"
            )
            return False
        elif self.task == 'asr':
            try:
                from ..asr.infer import pretrained_models
                logger.info(
                    "Here is the list of ASR pretrained models released by PaddleSpeech that can be used by command line and python API"
                )
                self.show_support_models(pretrained_models)
                return True
            except BaseException:
                logger.error("Failed to get the list of ASR pretrained models.")
                return False
        elif self.task == 'cls':
            try:
                from ..cls.infer import pretrained_models
                logger.info(
                    "Here is the list of CLS pretrained models released by PaddleSpeech that can be used by command line and python API"
                )
                self.show_support_models(pretrained_models)
                return True
            except BaseException:
                logger.error("Failed to get the list of CLS pretrained models.")
                return False
        elif self.task == 'st':
            try:
                from ..st.infer import pretrained_models
                logger.info(
                    "Here is the list of ST pretrained models released by PaddleSpeech that can be used by command line and python API"
                )
                self.show_support_models(pretrained_models)
                return True
            except BaseException:
                logger.error("Failed to get the list of ST pretrained models.")
                return False
        elif self.task == 'text':
            try:
                from ..text.infer import pretrained_models
                logger.info(
                    "Here is the list of TEXT pretrained models released by PaddleSpeech that can be used by command line and python API"
                )
                self.show_support_models(pretrained_models)
                return True
            except BaseException:
                logger.error(
                    "Failed to get the list of TEXT pretrained models.")
                return False
        elif self.task == 'tts':
            try:
                from ..tts.infer import pretrained_models
                logger.info(
                    "Here is the list of TTS pretrained models released by PaddleSpeech that can be used by command line and python API"
                )
                self.show_support_models(pretrained_models)
                return True
            except BaseException:
                logger.error("Failed to get the list of TTS pretrained models.")
                return False
    @stats_wrapper
    def __call__(
            self,
            task: str=None, ):
        """
            Python API to call an executor.
        """
        self.task = task
        if self.task not in self.task_choices:
            print(
                "Please input correct speech task, choices = ['asr', 'cls', 'st', 'text', 'tts']"
            )
        elif self.task == 'asr':
            try:
                from ..asr.infer import pretrained_models
                print(
                    "Here is the list of ASR pretrained models released by PaddleSpeech that can be used by command line and python API"
                )
                self.show_support_models(pretrained_models)
            except BaseException:
                print("Failed to get the list of ASR pretrained models.")
        elif self.task == 'cls':
            try:
                from ..cls.infer import pretrained_models
                print(
                    "Here is the list of CLS pretrained models released by PaddleSpeech that can be used by command line and python API"
                )
                self.show_support_models(pretrained_models)
            except BaseException:
                print("Failed to get the list of CLS pretrained models.")
        elif self.task == 'st':
            try:
                from ..st.infer import pretrained_models
                print(
                    "Here is the list of ST pretrained models released by PaddleSpeech that can be used by command line and python API"
                )
                self.show_support_models(pretrained_models)
            except BaseException:
                print("Failed to get the list of ST pretrained models.")
        elif self.task == 'text':
            try:
                from ..text.infer import pretrained_models
                print(
                    "Here is the list of TEXT pretrained models released by PaddleSpeech that can be used by command line and python API"
                )
                self.show_support_models(pretrained_models)
            except BaseException:
                print("Failed to get the list of TEXT pretrained models.")
        elif self.task == 'tts':
            try:
                from ..tts.infer import pretrained_models
                print(
                    "Here is the list of TTS pretrained models released by PaddleSpeech that can be used by command line and python API"
                )
                self.show_support_models(pretrained_models)
            except BaseException:
                print("Failed to get the list of TTS pretrained models.")
--- a/paddlespeech/cli/tts/infer.py
+++ b/paddlespeech/cli/tts/infer.py
@ -13,6 +13,7 @@
 # limitations under the License.
 import argparse
 import os
 import time
 from collections import OrderedDict
 from typing import Any
 from typing import List
@ -621,6 +622,7 @@ class TTSExecutor(BaseExecutor):
        am_dataset = am[am.rindex('_') + 1:]
        get_tone_ids = False
        merge_sentences = False
        frontend_st = time.time()
        if am_name == 'speedyspeech':
            get_tone_ids = True
        if lang == 'zh':
@ -637,9 +639,13 @@ class TTSExecutor(BaseExecutor):
            phone_ids = input_ids["phone_ids"]
        else:
            print("lang should in {'zh', 'en'}!")
        self.frontend_time = time.time() - frontend_st
        self.am_time = 0
        self.voc_time = 0
        flags = 0
        for i in range(len(phone_ids)):
            am_st = time.time()
            part_phone_ids = phone_ids[i]
            # am
            if am_name == 'speedyspeech':
@ -653,13 +659,16 @@ class TTSExecutor(BaseExecutor):
                        part_phone_ids, spk_id=paddle.to_tensor(spk_id))
                else:
                    mel = self.am_inference(part_phone_ids)
            self.am_time += (time.time() - am_st)
            # voc
            voc_st = time.time()
            wav = self.voc_inference(mel)
            if flags == 0:
                wav_all = wav
                flags = 1
            else:
                wav_all = paddle.concat([wav_all, wav])
            self.voc_time += (time.time() - voc_st)
        self._outputs['wav'] = wav_all
    def postprocess(self, output: str='output.wav') -> Union[str, os.PathLike]:
--- a/paddlespeech/s2t/io/sampler.py
+++ b/paddlespeech/s2t/io/sampler.py
@ -51,7 +51,7 @@ def _batch_shuffle(indices, batch_size, epoch, clipped=False):
    """
    rng = np.random.RandomState(epoch)
    shift_len = rng.randint(0, batch_size - 1)
-    batch_indices = list(zip(*[iter(indices[shift_len:])] * batch_size))
+    batch_indices = list(zip(* [iter(indices[shift_len:])] * batch_size))
    rng.shuffle(batch_indices)
    batch_indices = [item for batch in batch_indices for item in batch]
    assert clipped is False
--- a/paddlespeech/s2t/models/u2_st/u2_st.py
+++ b/paddlespeech/s2t/models/u2_st/u2_st.py
@ -33,8 +33,6 @@ from paddlespeech.s2t.modules.decoder import TransformerDecoder
 from paddlespeech.s2t.modules.encoder import ConformerEncoder
 from paddlespeech.s2t.modules.encoder import TransformerEncoder
 from paddlespeech.s2t.modules.loss import LabelSmoothingLoss
 from paddlespeech.s2t.modules.mask import mask_finished_preds
 from paddlespeech.s2t.modules.mask import mask_finished_scores
 from paddlespeech.s2t.modules.mask import subsequent_mask
 from paddlespeech.s2t.utils import checkpoint
 from paddlespeech.s2t.utils import layer_tools
--- a/paddlespeech/server/bin/init.py
+++ b/paddlespeech/server/bin/init.py
@ -14,3 +14,4 @@
 from .paddlespeech_client import ASRClientExecutor
 from .paddlespeech_client import TTSClientExecutor
 from .paddlespeech_server import ServerExecutor
 from .paddlespeech_server import ServerStatsExecutor
--- a/paddlespeech/server/bin/main.py
+++ b/paddlespeech/server/bin/main.py
@ -12,8 +12,8 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import argparse
 import uvicorn
 import yaml
 from fastapi import FastAPI
 from paddlespeech.server.engine.engine_pool import init_engine_pool
--- a/paddlespeech/server/bin/paddlespeech_client.py
+++ b/paddlespeech/server/bin/paddlespeech_client.py
@ -48,8 +48,9 @@ class TTSClientExecutor(BaseExecutor):
        self.parser.add_argument(
            '--input',
            type=str,
-            default="你好，欢迎使用语音合成服务",
+            default=None,
-            help='A sentence to be synthesized.')
+            help='Text to be synthesized.',
            required=True)
        self.parser.add_argument(
            '--spk_id', type=int, default=0, help='Speaker id')
        self.parser.add_argument(
@ -120,10 +121,9 @@ class TTSClientExecutor(BaseExecutor):
                        (args.output))
            logger.info("Audio duration: %f s." % (duration))
            logger.info("Response time: %f s." % (time_consume))
            logger.info("RTF: %f " % (time_consume / duration))
            return True
-        except:
+        except BaseException:
            logger.error("Failed to synthesized audio.")
            return False
@ -163,7 +163,7 @@ class TTSClientExecutor(BaseExecutor):
            print("Audio duration: %f s." % (duration))
            print("Response time: %f s." % (time_consume))
            print("RTF: %f " % (time_consume / duration))
-        except:
+        except BaseException:
            print("Failed to synthesized audio.")
@ -181,8 +181,9 @@ class ASRClientExecutor(BaseExecutor):
        self.parser.add_argument(
            '--input',
            type=str,
-            default="./paddlespeech/server/tests/16_audio.wav",
+            default=None,
-            help='Audio file to be recognized')
+            help='Audio file to be recognized',
            required=True)
        self.parser.add_argument(
            '--sample_rate', type=int, default=16000, help='audio sample rate')
        self.parser.add_argument(
@ -209,7 +210,7 @@ class ASRClientExecutor(BaseExecutor):
            logger.info(r.json())
            logger.info("time cost %f s." % (time_end - time_start))
            return True
-        except:
+        except BaseException:
            logger.error("Failed to speech recognition.")
            return False
@ -240,5 +241,5 @@ class ASRClientExecutor(BaseExecutor):
            time_end = time.time()
            print(r.json())
            print("time cost %f s." % (time_end - time_start))
-        except:
+        except BaseException:
            print("Failed to speech recognition.")
--- a/paddlespeech/server/bin/paddlespeech_server.py
+++ b/paddlespeech/server/bin/paddlespeech_server.py
@ -16,15 +16,17 @@ from typing import List
 import uvicorn
 from fastapi import FastAPI
 from prettytable import PrettyTable
 from ..executor import BaseExecutor
 from ..util import cli_server_register
 from ..util import stats_wrapper
-from paddlespeech.server.engine.engine_factory import EngineFactory
+from paddlespeech.cli.log import logger
 from paddlespeech.server.engine.engine_pool import init_engine_pool
 from paddlespeech.server.restful.api import setup_router
 from paddlespeech.server.utils.config import get_config
-__all__ = ['ServerExecutor']
+__all__ = ['ServerExecutor', 'ServerStatsExecutor']
 app = FastAPI(
    title="PaddleSpeech Serving API", description="Api", version="0.0.1")
@ -41,7 +43,8 @@ class ServerExecutor(BaseExecutor):
            "--config_file",
            action="store",
            help="yaml file of the app",
-            default="./conf/application.yaml")
+            default=None,
            required=True)
        self.parser.add_argument(
            "--log_file",
@ -51,8 +54,10 @@ class ServerExecutor(BaseExecutor):
    def init(self, config) -> bool:
        """system initialization
        Args:
            config (CfgNode): config object
        Returns:
            bool: 
        """
@ -61,12 +66,7 @@ class ServerExecutor(BaseExecutor):
        api_router = setup_router(api_list)
        app.include_router(api_router)
-        # init engine
+        if not init_engine_pool(config):
        engine_pool = []
        for engine in config.engine_backend:
            engine_pool.append(EngineFactory.get_engine(engine_name=engine))
            if not engine_pool[-1].init(
                    config_file=config.engine_backend[engine]):
            return False
        return True
@ -88,3 +88,139 @@ class ServerExecutor(BaseExecutor):
        config = get_config(config_file)
        if self.init(config):
            uvicorn.run(app, host=config.host, port=config.port, debug=True)
@cli_server_register(
    name='paddlespeech_server.stats',
    description='Get the models supported by each speech task in the service.')
 class ServerStatsExecutor():
    def __init__(self):
        super(ServerStatsExecutor, self).__init__()
        self.parser = argparse.ArgumentParser(
            prog='paddlespeech_server.stats', add_help=True)
        self.parser.add_argument(
            '--task',
            type=str,
            default=None,
            choices=['asr', 'tts'],
            help='Choose speech task.',
            required=True)
        self.task_choices = ['asr', 'tts']
        self.model_name_format = {
            'asr': 'Model-Language-Sample Rate',
            'tts': 'Model-Language'
        }
    def show_support_models(self, pretrained_models: dict):
        fields = self.model_name_format[self.task].split("-")
        table = PrettyTable(fields)
        for key in pretrained_models:
            table.add_row(key.split("-"))
        print(table)
    def execute(self, argv: List[str]) -> bool:
        """
            Command line entry.
        """
        parser_args = self.parser.parse_args(argv)
        self.task = parser_args.task
        if self.task not in self.task_choices:
            logger.error(
                "Please input correct speech task, choices = ['asr', 'tts']")
            return False
        elif self.task == 'asr':
            try:
                from paddlespeech.cli.asr.infer import pretrained_models
                logger.info(
                    "Here is the table of ASR pretrained models supported in the service."
                )
                self.show_support_models(pretrained_models)
                # show ASR static pretrained model
                from paddlespeech.server.engine.asr.paddleinference.asr_engine import pretrained_models
                logger.info(
                    "Here is the table of ASR static pretrained models supported in the service."
                )
                self.show_support_models(pretrained_models)
                return True
            except BaseException:
                logger.error(
                    "Failed to get the table of ASR pretrained models supported in the service."
                )
                return False
        elif self.task == 'tts':
            try:
                from paddlespeech.cli.tts.infer import pretrained_models
                logger.info(
                    "Here is the table of TTS pretrained models supported in the service."
                )
                self.show_support_models(pretrained_models)
                # show TTS static pretrained model
                from paddlespeech.server.engine.tts.paddleinference.tts_engine import pretrained_models
                logger.info(
                    "Here is the table of TTS static pretrained models supported in the service."
                )
                self.show_support_models(pretrained_models)
                return True
            except BaseException:
                logger.error(
                    "Failed to get the table of TTS pretrained models supported in the service."
                )
                return False
    @stats_wrapper
    def __call__(
            self,
            task: str=None, ):
        """
            Python API to call an executor.
        """
        self.task = task
        if self.task not in self.task_choices:
            print("Please input correct speech task, choices = ['asr', 'tts']")
        elif self.task == 'asr':
            try:
                from paddlespeech.cli.asr.infer import pretrained_models
                print(
                    "Here is the table of ASR pretrained models supported in the service."
                )
                self.show_support_models(pretrained_models)
                # show ASR static pretrained model
                from paddlespeech.server.engine.asr.paddleinference.asr_engine import pretrained_models
                print(
                    "Here is the table of ASR static pretrained models supported in the service."
                )
                self.show_support_models(pretrained_models)
            except BaseException:
                print(
                    "Failed to get the table of ASR pretrained models supported in the service."
                )
        elif self.task == 'tts':
            try:
                from paddlespeech.cli.tts.infer import pretrained_models
                print(
                    "Here is the table of TTS pretrained models supported in the service."
                )
                self.show_support_models(pretrained_models)
                # show TTS static pretrained model
                from paddlespeech.server.engine.tts.paddleinference.tts_engine import pretrained_models
                print(
                    "Here is the table of TTS static pretrained models supported in the service."
                )
                self.show_support_models(pretrained_models)
            except BaseException:
                print(
                    "Failed to get the table of TTS pretrained models supported in the service."
                )
--- a/paddlespeech/server/conf/application.yaml
+++ b/paddlespeech/server/conf/application.yaml
@ -3,18 +3,25 @@
 ##################################################################
 #                     SERVER SETTING                             #
 ##################################################################
-host: '0.0.0.0'
+host: 127.0.0.1
 port: 8090
 ##################################################################
 #                     CONFIG FILE                                #
 ##################################################################
 # add engine backend type (Options: asr, tts) and config file here.
 # Adding a speech task to engine_backend means starting the service.
 engine_backend:
    asr: 'conf/asr/asr.yaml'
    tts: 'conf/tts/tts.yaml'
 # The engine_type of speech task needs to keep the same type as the config file of speech task.
 # E.g: The engine_type of asr is 'python', the engine_backend of asr is 'XX/asr.yaml'
 # E.g: The engine_type of asr is 'inference', the engine_backend of asr is 'XX/asr_pd.yaml'
 #
 # add engine type (Options: python, inference) 
 engine_type:
-    asr: 'inference'
+    asr: 'python'
-    # tts: 'inference'
+    tts: 'python'
 # add engine backend type (Options: asr, tts) and config file here.
 engine_backend:
    asr: 'conf/asr/asr_pd.yaml'
    #tts: 'conf/tts/tts_pd.yaml'
--- a/paddlespeech/server/conf/asr/asr.yaml
+++ b/paddlespeech/server/conf/asr/asr.yaml
@ -5,3 +5,4 @@ cfg_path: # [optional]
 ckpt_path: # [optional]
 decode_method: 'attention_rescoring'
 force_yes: True
 device:  # set 'gpu:id' or 'cpu'
--- a/paddlespeech/server/conf/asr/asr_pd.yaml
+++ b/paddlespeech/server/conf/asr/asr_pd.yaml
@ -15,9 +15,10 @@ decode_method:
 force_yes: True
 am_predictor_conf:
-  use_gpu: True
+  device:  # set 'gpu:id' or 'cpu'
  enable_mkldnn: True
  switch_ir_optim: True
  glog_info: False  # True -> print glog
  summary: True  # False -> do not show predictor config
 ##################################################################
--- a/paddlespeech/server/conf/tts/tts.yaml
+++ b/paddlespeech/server/conf/tts/tts.yaml
@ -29,4 +29,4 @@ voc_stat:
 #                            OTHERS                              #
 ##################################################################
 lang: 'zh'
-device: paddle.get_device()
+device:  # set 'gpu:id' or 'cpu'
--- a/paddlespeech/server/conf/tts/tts_pd.yaml
+++ b/paddlespeech/server/conf/tts/tts_pd.yaml
@ -6,8 +6,8 @@
 # am choices=['speedyspeech_csmsc', 'fastspeech2_csmsc']
 ##################################################################
 am: 'fastspeech2_csmsc'   
-am_model: # the pdmodel file of am static model
+am_model: # the pdmodel file of your am static model (XX.pdmodel)
-am_params: # the pdiparams file of am static model
+am_params: # the pdiparams file of your am static model (XX.pdipparams)
 am_sample_rate: 24000
 phones_dict: 
 tones_dict: 
@ -15,9 +15,10 @@ speaker_dict:
 spk_id: 0
 am_predictor_conf:
-  use_gpu: True
+  device:  # set 'gpu:id' or 'cpu'
  enable_mkldnn: True
  switch_ir_optim: True
  glog_info: False # True -> print glog
  summary: True  # False -> do not show predictor config
 ##################################################################
@ -25,17 +26,17 @@ am_predictor_conf:
 # voc choices=['pwgan_csmsc', 'mb_melgan_csmsc','hifigan_csmsc']
 ##################################################################
 voc: 'pwgan_csmsc'
-voc_model: # the pdmodel file of vocoder static model
+voc_model: # the pdmodel file of your vocoder static model (XX.pdmodel)
-voc_params: # the pdiparams file of vocoder static model 
+voc_params: # the pdiparams file of your vocoder static model (XX.pdipparams)
 voc_sample_rate: 24000
 voc_predictor_conf:
-  use_gpu: True
+  device:  # set 'gpu:id' or 'cpu'  
  enable_mkldnn: True  
  switch_ir_optim: True  
  glog_info: False # True -> print glog
  summary: True  # False -> do not show predictor config
 ##################################################################
 #                            OTHERS                              #
 ##################################################################
 lang: 'zh'
 device: paddle.get_device()
--- a/paddlespeech/server/engine/asr/paddleinference/asr_engine.py
+++ b/paddlespeech/server/engine/asr/paddleinference/asr_engine.py
@ -13,31 +13,25 @@
 # limitations under the License.
 import io
 import os
-from typing import List
+import time
 from typing import Optional
 from typing import Union
 import librosa
 import paddle
 import soundfile
 from yacs.config import CfgNode
 from paddlespeech.cli.utils import MODEL_HOME
 from paddlespeech.s2t.modules.ctc import CTCDecoder
 from paddlespeech.cli.asr.infer import ASRExecutor
 from paddlespeech.cli.log import logger
 from paddlespeech.cli.utils import MODEL_HOME
 from paddlespeech.s2t.frontend.featurizer.text_featurizer import TextFeaturizer
-from paddlespeech.s2t.transform.transformation import Transformation
+from paddlespeech.s2t.modules.ctc import CTCDecoder
 from paddlespeech.s2t.utils.dynamic_import import dynamic_import
 from paddlespeech.s2t.utils.utility import UpdateConfig
 from paddlespeech.server.engine.base_engine import BaseEngine
 from paddlespeech.server.utils.config import get_config
 from paddlespeech.server.utils.paddle_predictor import init_predictor
 from paddlespeech.server.utils.paddle_predictor import run_model
 from paddlespeech.server.engine.base_engine import BaseEngine
 __all__ = ['ASREngine']
 pretrained_models = {
    "deepspeech2offline_aishell-zh-16k": {
        'url':
@ -143,7 +137,6 @@ class ASRServerExecutor(ASRExecutor):
            batch_average=True,  # sum / batch_size
            grad_norm_type=self.config.get('ctc_grad_norm_type', None))
    @paddle.no_grad()
    def infer(self, model_type: str):
        """
@ -161,8 +154,7 @@ class ASRServerExecutor(ASRExecutor):
                cfg.beam_size, cfg.cutoff_prob, cfg.cutoff_top_n,
                cfg.num_proc_bsearch)
-            output_data = run_model(
+            output_data = run_model(self.am_predictor,
                                self.am_predictor,
                                    [audio.numpy(), audio_len.numpy()])
            probs = output_data[0]
@ -206,7 +198,6 @@ class ASREngine(BaseEngine):
        self.executor = ASRServerExecutor()
        self.config = get_config(config_file)
        paddle.set_device(paddle.get_device())
        self.executor._init_from_path(
            model_type=self.config.model_type,
            am_model=self.config.am_model,
@ -230,14 +221,20 @@ class ASREngine(BaseEngine):
                io.BytesIO(audio_data), self.config.sample_rate,
                self.config.force_yes):
            logger.info("start running asr engine")
-            self.executor.preprocess(self.config.model_type, io.BytesIO(audio_data))
+            self.executor.preprocess(self.config.model_type,
                                     io.BytesIO(audio_data))
            st = time.time()
            self.executor.infer(self.config.model_type)
            infer_time = time.time() - st
            self.output = self.executor.postprocess()  # Retrieve result of asr.
            logger.info("end inferring asr engine")
        else:
            logger.info("file check failed!")
            self.output = None
        logger.info("inference time: {}".format(infer_time))
        logger.info("asr engine type: paddle inference")
    def postprocess(self):
        """postprocess
        """
--- a/paddlespeech/server/engine/asr/python/asr_engine.py
+++ b/paddlespeech/server/engine/asr/python/asr_engine.py
@ -12,21 +12,12 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import io
-import os
+import time
 from typing import List
 from typing import Optional
 from typing import Union
 import librosa
 import paddle
 import soundfile
 from paddlespeech.cli.asr.infer import ASRExecutor
 from paddlespeech.cli.log import logger
 from paddlespeech.s2t.frontend.featurizer.text_featurizer import TextFeaturizer
 from paddlespeech.s2t.transform.transformation import Transformation
 from paddlespeech.s2t.utils.dynamic_import import dynamic_import
 from paddlespeech.s2t.utils.utility import UpdateConfig
 from paddlespeech.server.engine.base_engine import BaseEngine
 from paddlespeech.server.utils.config import get_config
@ -63,13 +54,24 @@ class ASREngine(BaseEngine):
        self.executor = ASRServerExecutor()
        self.config = get_config(config_file)
-        paddle.set_device(paddle.get_device())
+        try:
            if self.config.device:
                self.device = self.config.device
            else:
                self.device = paddle.get_device()
            paddle.set_device(self.device)
        except BaseException:
            logger.error(
                "Set device failed, please check if device is already used and the parameter 'device' in the yaml file"
            )
        self.executor._init_from_path(
            self.config.model, self.config.lang, self.config.sample_rate,
            self.config.cfg_path, self.config.decode_method,
            self.config.ckpt_path)
-        logger.info("Initialize ASR server engine successfully.")
+        logger.info("Initialize ASR server engine successfully on device: %s." %
                    (self.device))
        return True
    def run(self, audio_data):
@ -83,12 +85,17 @@ class ASREngine(BaseEngine):
                self.config.force_yes):
            logger.info("start run asr engine")
            self.executor.preprocess(self.config.model, io.BytesIO(audio_data))
            st = time.time()
            self.executor.infer(self.config.model)
            infer_time = time.time() - st
            self.output = self.executor.postprocess()  # Retrieve result of asr.
        else:
            logger.info("file check failed!")
            self.output = None
        logger.info("inference time: {}".format(infer_time))
        logger.info("asr engine type: python")
    def postprocess(self):
        """postprocess
        """
--- a/paddlespeech/server/engine/base_engine.py
+++ b/paddlespeech/server/engine/base_engine.py
@ -12,8 +12,6 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import os
 from typing import Any
 from typing import List
 from typing import Union
 from pattern_singleton import Singleton
--- a/paddlespeech/server/engine/engine_factory.py
+++ b/paddlespeech/server/engine/engine_factory.py
@ -13,7 +13,6 @@
 # limitations under the License.
 from typing import Text
 __all__ = ['EngineFactory']
--- a/paddlespeech/server/engine/engine_pool.py
+++ b/paddlespeech/server/engine/engine_pool.py
@ -29,8 +29,10 @@ def init_engine_pool(config) -> bool:
    """
    global ENGINE_POOL
    for engine in config.engine_backend:
-        ENGINE_POOL[engine] = EngineFactory.get_engine(engine_name=engine, engine_type=config.engine_type[engine])
+        ENGINE_POOL[engine] = EngineFactory.get_engine(
-        if not ENGINE_POOL[engine].init(config_file=config.engine_backend[engine]):
+            engine_name=engine, engine_type=config.engine_type[engine])
        if not ENGINE_POOL[engine].init(
                config_file=config.engine_backend[engine]):
            return False
    return True
--- a/paddlespeech/server/engine/tts/paddleinference/tts_engine.py
+++ b/paddlespeech/server/engine/tts/paddleinference/tts_engine.py
@ -14,6 +14,7 @@
 import base64
 import io
 import os
 import time
 from typing import Optional
 import librosa
@ -179,7 +180,7 @@ class TTSServerExecutor(TTSExecutor):
            self.phones_dict = os.path.abspath(phones_dict)
            self.am_sample_rate = am_sample_rate
            self.am_res_path = os.path.dirname(os.path.abspath(self.am_model))
-        print("self.phones_dict:", self.phones_dict)
+        logger.info("self.phones_dict: {}".format(self.phones_dict))
        # for speedyspeech
        self.tones_dict = None
@ -224,21 +225,21 @@ class TTSServerExecutor(TTSExecutor):
        with open(self.phones_dict, "r") as f:
            phn_id = [line.strip().split() for line in f.readlines()]
        vocab_size = len(phn_id)
-        print("vocab_size:", vocab_size)
+        logger.info("vocab_size: {}".format(vocab_size))
        tone_size = None
        if self.tones_dict:
            with open(self.tones_dict, "r") as f:
                tone_id = [line.strip().split() for line in f.readlines()]
            tone_size = len(tone_id)
-            print("tone_size:", tone_size)
+            logger.info("tone_size: {}".format(tone_size))
        spk_num = None
        if self.speaker_dict:
            with open(self.speaker_dict, 'rt') as f:
                spk_id = [line.strip().split() for line in f.readlines()]
            spk_num = len(spk_id)
-            print("spk_num:", spk_num)
+            logger.info("spk_num: {}".format(spk_num))
        # frontend
        if lang == 'zh':
@ -248,21 +249,29 @@ class TTSServerExecutor(TTSExecutor):
        elif lang == 'en':
            self.frontend = English(phone_vocab_path=self.phones_dict)
-        print("frontend done!")
+        logger.info("frontend done!")
        try:
            # am predictor
            self.am_predictor_conf = am_predictor_conf
            self.am_predictor = init_predictor(
                model_file=self.am_model,
                params_file=self.am_params,
                predictor_conf=self.am_predictor_conf)
            logger.info("Create AM predictor successfully.")
        except BaseException:
            logger.error("Failed to create AM predictor.")
        try:
            # voc predictor
            self.voc_predictor_conf = voc_predictor_conf
            self.voc_predictor = init_predictor(
                model_file=self.voc_model,
                params_file=self.voc_params,
                predictor_conf=self.voc_predictor_conf)
            logger.info("Create Vocoder predictor successfully.")
        except BaseException:
            logger.error("Failed to create Vocoder predictor.")
    @paddle.no_grad()
    def infer(self,
@ -277,6 +286,7 @@ class TTSServerExecutor(TTSExecutor):
        am_dataset = am[am.rindex('_') + 1:]
        get_tone_ids = False
        merge_sentences = False
        frontend_st = time.time()
        if am_name == 'speedyspeech':
            get_tone_ids = True
        if lang == 'zh':
@ -292,10 +302,14 @@ class TTSServerExecutor(TTSExecutor):
                text, merge_sentences=merge_sentences)
            phone_ids = input_ids["phone_ids"]
        else:
-            print("lang should in {'zh', 'en'}!")
+            logger.error("lang should in {'zh', 'en'}!")
        self.frontend_time = time.time() - frontend_st
        self.am_time = 0
        self.voc_time = 0
        flags = 0
        for i in range(len(phone_ids)):
            am_st = time.time()
            part_phone_ids = phone_ids[i]
            # am
            if am_name == 'speedyspeech':
@ -314,7 +328,10 @@ class TTSServerExecutor(TTSExecutor):
                    am_result = run_model(self.am_predictor,
                                          [part_phone_ids.numpy()])
                    mel = am_result[0]
            self.am_time += (time.time() - am_st)
            # voc
            voc_st = time.time()
            voc_result = run_model(self.voc_predictor, [mel])
            wav = voc_result[0]
            wav = paddle.to_tensor(wav)
@ -324,6 +341,7 @@ class TTSServerExecutor(TTSExecutor):
                flags = 1
            else:
                wav_all = paddle.concat([wav_all, wav])
            self.voc_time += (time.time() - voc_st)
        self._outputs['wav'] = wav_all
@ -344,7 +362,6 @@ class TTSEngine(BaseEngine):
        try:
            self.config = get_config(config_file)
            self.executor._init_from_path(
                am=self.config.am,
                am_model=self.config.am_model,
@ -361,8 +378,8 @@ class TTSEngine(BaseEngine):
                am_predictor_conf=self.config.am_predictor_conf,
                voc_predictor_conf=self.config.voc_predictor_conf, )
-        except:
+        except BaseException:
-            logger.info("Initialize TTS server engine Failed.")
+            logger.error("Initialize TTS server engine Failed.")
            return False
        logger.info("Initialize TTS server engine successfully.")
@ -371,7 +388,7 @@ class TTSEngine(BaseEngine):
    def postprocess(self,
                    wav,
                    original_fs: int,
-                    target_fs: int=16000,
+                    target_fs: int=0,
                    volume: float=1.0,
                    speed: float=1.0,
                    audio_path: str=None):
@ -396,36 +413,50 @@ class TTSEngine(BaseEngine):
        if target_fs == 0 or target_fs > original_fs:
            target_fs = original_fs
            wav_tar_fs = wav
            logger.info(
                "The sample rate of synthesized audio is the same as model, which is {}Hz".
                format(original_fs))
        else:
            wav_tar_fs = librosa.resample(
                np.squeeze(wav), original_fs, target_fs)
-
+            logger.info(
                "The sample rate of model is {}Hz and the target sample rate is {}Hz. Converting the sample rate of the synthesized audio successfully.".
                format(original_fs, target_fs))
        # transform volume
        wav_vol = wav_tar_fs * volume
        logger.info("Transform the volume of the audio successfully.")
        # transform speed
        try:  # windows not support soxbindings
            wav_speed = change_speed(wav_vol, speed, target_fs)
-        except:
+            logger.info("Transform the speed of the audio successfully.")
        except ServerBaseException:
            raise ServerBaseException(
                ErrorCode.SERVER_INTERNAL_ERR,
-                "Transform speed failed. Can not install soxbindings on your system. \
+                "Failed to transform speed. Can not install soxbindings on your system. \
                 You need to set speed value 1.0.")
        except BaseException:
            logger.error("Failed to transform speed.")
        # wav to base64
        buf = io.BytesIO()
        wavfile.write(buf, target_fs, wav_speed)
        base64_bytes = base64.b64encode(buf.read())
        wav_base64 = base64_bytes.decode('utf-8')
        logger.info("Audio to string successfully.")
        # save audio
-        if audio_path is not None and audio_path.endswith(".wav"):
+        if audio_path is not None:
            if audio_path.endswith(".wav"):
                sf.write(audio_path, wav_speed, target_fs)
-        elif audio_path is not None and audio_path.endswith(".pcm"):
+            elif audio_path.endswith(".pcm"):
                wav_norm = wav_speed * (32767 / max(0.001,
                                                    np.max(np.abs(wav_speed))))
                with open(audio_path, "wb") as f:
                    f.write(wav_norm.astype(np.int16))
            logger.info("Save audio to {} successfully.".format(audio_path))
        else:
            logger.info("There is no need to save audio.")
        return target_fs, wav_base64
@ -461,13 +492,20 @@ class TTSEngine(BaseEngine):
        lang = self.config.lang
        try:
            infer_st = time.time()
            self.executor.infer(
                text=sentence, lang=lang, am=self.config.am, spk_id=spk_id)
-        except:
+            infer_et = time.time()
            infer_time = infer_et - infer_st
        except ServerBaseException:
            raise ServerBaseException(ErrorCode.SERVER_INTERNAL_ERR,
                                      "tts infer failed.")
        except BaseException:
            logger.error("tts infer failed.")
        try:
            postprocess_st = time.time()
            target_sample_rate, wav_base64 = self.postprocess(
                wav=self.executor._outputs['wav'].numpy(),
                original_fs=self.executor.am_sample_rate,
@ -475,8 +513,34 @@ class TTSEngine(BaseEngine):
                volume=volume,
                speed=speed,
                audio_path=save_path)
-        except:
+            postprocess_et = time.time()
            postprocess_time = postprocess_et - postprocess_st
            duration = len(self.executor._outputs['wav']
                           .numpy()) / self.executor.am_sample_rate
            rtf = infer_time / duration
        except ServerBaseException:
            raise ServerBaseException(ErrorCode.SERVER_INTERNAL_ERR,
                                      "tts postprocess failed.")
        except BaseException:
            logger.error("tts postprocess failed.")
        logger.info("AM model: {}".format(self.config.am))
        logger.info("Vocoder model: {}".format(self.config.voc))
        logger.info("Language: {}".format(lang))
        logger.info("tts engine type: paddle inference")
        logger.info("audio duration: {}".format(duration))
        logger.info(
            "frontend inference time: {}".format(self.executor.frontend_time))
        logger.info("AM inference time: {}".format(self.executor.am_time))
        logger.info("Vocoder inference time: {}".format(self.executor.voc_time))
        logger.info("total inference time: {}".format(infer_time))
        logger.info(
            "postprocess (change speed, volume, target sample rate) time: {}".
            format(postprocess_time))
        logger.info("total generate audio time: {}".format(infer_time +
                                                           postprocess_time))
        logger.info("RTF: {}".format(rtf))
        return lang, target_sample_rate, wav_base64
--- a/paddlespeech/server/engine/tts/python/tts_engine.py
+++ b/paddlespeech/server/engine/tts/python/tts_engine.py
@ -13,6 +13,7 @@
 # limitations under the License.
 import base64
 import io
 import time
 import librosa
 import numpy as np
@ -54,8 +55,20 @@ class TTSEngine(BaseEngine):
        try:
            self.config = get_config(config_file)
-            paddle.set_device(self.config.device)
+            if self.config.device:
                self.device = self.config.device
            else:
                self.device = paddle.get_device()
            paddle.set_device(self.device)
        except BaseException:
            logger.error(
                "Set device failed, please check if device is already used and the parameter 'device' in the yaml file"
            )
            logger.error("Initialize TTS server engine Failed on device: %s." %
                         (self.device))
            return False
        try:
            self.executor._init_from_path(
                am=self.config.am,
                am_config=self.config.am_config,
@ -69,17 +82,20 @@ class TTSEngine(BaseEngine):
                voc_ckpt=self.config.voc_ckpt,
                voc_stat=self.config.voc_stat,
                lang=self.config.lang)
-        except:
+        except BaseException:
-            logger.info("Initialize TTS server engine Failed.")
+            logger.error("Failed to get model related files.")
            logger.error("Initialize TTS server engine Failed on device: %s." %
                         (self.device))
            return False
-        logger.info("Initialize TTS server engine successfully.")
+        logger.info("Initialize TTS server engine successfully on device: %s." %
                    (self.device))
        return True
    def postprocess(self,
                    wav,
                    original_fs: int,
-                    target_fs: int=16000,
+                    target_fs: int=0,
                    volume: float=1.0,
                    speed: float=1.0,
                    audio_path: str=None):
@ -104,35 +120,50 @@ class TTSEngine(BaseEngine):
        if target_fs == 0 or target_fs > original_fs:
            target_fs = original_fs
            wav_tar_fs = wav
            logger.info(
                "The sample rate of synthesized audio is the same as model, which is {}Hz".
                format(original_fs))
        else:
            wav_tar_fs = librosa.resample(
                np.squeeze(wav), original_fs, target_fs)
-
+            logger.info(
                "The sample rate of model is {}Hz and the target sample rate is {}Hz. Converting the sample rate of the synthesized audio successfully.".
                format(original_fs, target_fs))
        # transform volume
        wav_vol = wav_tar_fs * volume
        logger.info("Transform the volume of the audio successfully.")
        # transform speed
        try:  # windows not support soxbindings
            wav_speed = change_speed(wav_vol, speed, target_fs)
-        except:
+            logger.info("Transform the speed of the audio successfully.")
        except ServerBaseException:
            raise ServerBaseException(
                ErrorCode.SERVER_INTERNAL_ERR,
-                "Can not install soxbindings on your system.")
+                "Failed to transform speed. Can not install soxbindings on your system. \
                 You need to set speed value 1.0.")
        except BaseException:
            logger.error("Failed to transform speed.")
        # wav to base64
        buf = io.BytesIO()
        wavfile.write(buf, target_fs, wav_speed)
        base64_bytes = base64.b64encode(buf.read())
        wav_base64 = base64_bytes.decode('utf-8')
        logger.info("Audio to string successfully.")
        # save audio
-        if audio_path is not None and audio_path.endswith(".wav"):
+        if audio_path is not None:
            if audio_path.endswith(".wav"):
                sf.write(audio_path, wav_speed, target_fs)
-        elif audio_path is not None and audio_path.endswith(".pcm"):
+            elif audio_path.endswith(".pcm"):
                wav_norm = wav_speed * (32767 / max(0.001,
                                                    np.max(np.abs(wav_speed))))
                with open(audio_path, "wb") as f:
                    f.write(wav_norm.astype(np.int16))
            logger.info("Save audio to {} successfully.".format(audio_path))
        else:
            logger.info("There is no need to save audio.")
        return target_fs, wav_base64
@ -168,13 +199,23 @@ class TTSEngine(BaseEngine):
        lang = self.config.lang
        try:
            infer_st = time.time()
            self.executor.infer(
                text=sentence, lang=lang, am=self.config.am, spk_id=spk_id)
-        except:
+            infer_et = time.time()
            infer_time = infer_et - infer_st
            duration = len(self.executor._outputs['wav']
                           .numpy()) / self.executor.am_config.fs
            rtf = infer_time / duration
        except ServerBaseException:
            raise ServerBaseException(ErrorCode.SERVER_INTERNAL_ERR,
                                      "tts infer failed.")
        except BaseException:
            logger.error("tts infer failed.")
        try:
            postprocess_st = time.time()
            target_sample_rate, wav_base64 = self.postprocess(
                wav=self.executor._outputs['wav'].numpy(),
                original_fs=self.executor.am_config.fs,
@ -182,8 +223,32 @@ class TTSEngine(BaseEngine):
                volume=volume,
                speed=speed,
                audio_path=save_path)
-        except:
+            postprocess_et = time.time()
            postprocess_time = postprocess_et - postprocess_st
        except ServerBaseException:
            raise ServerBaseException(ErrorCode.SERVER_INTERNAL_ERR,
                                      "tts postprocess failed.")
        except BaseException:
            logger.error("tts postprocess failed.")
        logger.info("AM model: {}".format(self.config.am))
        logger.info("Vocoder model: {}".format(self.config.voc))
        logger.info("Language: {}".format(lang))
        logger.info("tts engine type: python")
        logger.info("audio duration: {}".format(duration))
        logger.info(
            "frontend inference time: {}".format(self.executor.frontend_time))
        logger.info("AM inference time: {}".format(self.executor.am_time))
        logger.info("Vocoder inference time: {}".format(self.executor.voc_time))
        logger.info("total inference time: {}".format(infer_time))
        logger.info(
            "postprocess (change speed, volume, target sample rate) time: {}".
            format(postprocess_time))
        logger.info("total generate audio time: {}".format(infer_time +
                                                           postprocess_time))
        logger.info("RTF: {}".format(rtf))
        logger.info("device: {}".format(self.device))
        return lang, target_sample_rate, wav_base64
--- a/paddlespeech/server/restful/asr_api.py
+++ b/paddlespeech/server/restful/asr_api.py
@ -14,6 +14,7 @@
 import base64
 import traceback
 from typing import Union
 from fastapi import APIRouter
 from paddlespeech.server.engine.engine_pool import get_engine_pool
@ -83,7 +84,7 @@ def asr(request_body: ASRRequest):
    except ServerBaseException as e:
        response = failed_response(e.error_code, e.msg)
-    except:
+    except BaseException:
        response = failed_response(ErrorCode.SERVER_UNKOWN_ERR)
        traceback.print_exc()
--- a/paddlespeech/server/restful/request.py
+++ b/paddlespeech/server/restful/request.py
@ -11,7 +11,6 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 from typing import List
 from typing import Optional
 from pydantic import BaseModel
--- a/paddlespeech/server/restful/response.py
+++ b/paddlespeech/server/restful/response.py
@ -11,9 +11,6 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 from typing import List
 from typing import Optional
 from pydantic import BaseModel
 __all__ = ['ASRResponse', 'TTSResponse']
--- a/paddlespeech/server/restful/tts_api.py
+++ b/paddlespeech/server/restful/tts_api.py
@ -16,7 +16,8 @@ from typing import Union
 from fastapi import APIRouter
-from paddlespeech.server.engine.tts.paddleinference.tts_engine import TTSEngine
+from paddlespeech.cli.log import logger
 from paddlespeech.server.engine.engine_pool import get_engine_pool
 from paddlespeech.server.restful.request import TTSRequest
 from paddlespeech.server.restful.response import ErrorResponse
 from paddlespeech.server.restful.response import TTSResponse
@ -60,28 +61,45 @@ def tts(request_body: TTSRequest):
    Returns:
        json: [description]
    """
    # json to dict 
    item_dict = request_body.dict()
    sentence = item_dict['text']
    spk_id = item_dict['spk_id']
    speed = item_dict['speed']
    volume = item_dict['volume']
    sample_rate = item_dict['sample_rate']
    save_path = item_dict['save_path']
-    # Check parameters
+    logger.info("request: {}".format(request_body))
-    if speed <=0 or speed > 3 or volume <=0 or volume > 3 or \
+
-        sample_rate not in [0, 16000, 8000] or \
+    # get params
-        (save_path is not None and not save_path.endswith("pcm") and not save_path.endswith("wav")):
+    text = request_body.text
-        return failed_response(ErrorCode.SERVER_PARAM_ERR)
+    spk_id = request_body.spk_id
    speed = request_body.speed
    volume = request_body.volume
    sample_rate = request_body.sample_rate
    save_path = request_body.save_path
-    # single
+    # Check parameters
-    tts_engine = TTSEngine()
+    if speed <= 0 or speed > 3:
        return failed_response(
            ErrorCode.SERVER_PARAM_ERR,
            "invalid speed value, the value should be between 0 and 3.")
    if volume <= 0 or volume > 3:
        return failed_response(
            ErrorCode.SERVER_PARAM_ERR,
            "invalid volume value, the value should be between 0 and 3.")
    if sample_rate not in [0, 16000, 8000]:
        return failed_response(
            ErrorCode.SERVER_PARAM_ERR,
            "invalid sample_rate value, the choice of value is 0, 8000, 16000.")
    if save_path is not None and not save_path.endswith(
            "pcm") and not save_path.endswith("wav"):
        return failed_response(
            ErrorCode.SERVER_PARAM_ERR,
            "invalid save_path, saved audio formats support pcm and wav")
    # run
    try:
        # get single engine from engine pool
        engine_pool = get_engine_pool()
        tts_engine = engine_pool['tts']
        logger.info("Get tts engine successfully.")
        lang, target_sample_rate, wav_base64 = tts_engine.run(
-            sentence, spk_id, speed, volume, sample_rate, save_path)
+            text, spk_id, speed, volume, sample_rate, save_path)
        response = {
            "success": True,
@ -101,7 +119,7 @@ def tts(request_body: TTSRequest):
        }
    except ServerBaseException as e:
        response = failed_response(e.error_code, e.msg)
-    except:
+    except BaseException:
        response = failed_response(ErrorCode.SERVER_UNKOWN_ERR)
        traceback.print_exc()
--- a/paddlespeech/server/tests/asr/http_client.py
+++ b/paddlespeech/server/tests/asr/http_client.py
@ -10,11 +10,11 @@
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the 
-import requests
+import base64
 import json
 import time
-import base64
+
-import io
+import requests
 def readwav2base64(wav_file):
@ -34,7 +34,7 @@ def main():
    url = "http://127.0.0.1:8090/paddlespeech/asr"
    # start Timestamp
-    time_start=time.time()
+    time_start = time.time()
    test_audio_dir = "./16_audio.wav"
    audio = readwav2base64(test_audio_dir)
@ -49,8 +49,8 @@ def main():
    r = requests.post(url=url, data=json.dumps(data))
    # ending Timestamp
-    time_end=time.time()
+    time_end = time.time()
-    print('time cost',time_end - time_start, 's')
+    print('time cost', time_end - time_start, 's')
    print(r.json())
--- a/paddlespeech/server/tests/tts/test_client.py
+++ b/paddlespeech/server/tests/tts/test_client.py
@ -25,6 +25,7 @@ import soundfile
 from paddlespeech.server.utils.audio_process import wav2pcm
 # Request and response
 def tts_client(args):
    """ Request and response
@ -99,5 +100,5 @@ if __name__ == "__main__":
        print("Inference time: %f" % (time_consume))
        print("The duration of synthesized audio: %f" % (duration))
        print("The RTF is: %f" % (rtf))
-    except:
+    except BaseException:
        print("Failed to synthesized audio.")
--- a/paddlespeech/server/util.py
+++ b/paddlespeech/server/util.py
@ -219,7 +219,7 @@ class ConfigCache:
            try:
                cfg = yaml.load(file, Loader=yaml.FullLoader)
                self._data.update(cfg)
-            except:
+            except BaseException:
                self.flush()
    @property
--- a/paddlespeech/server/utils/paddle_predictor.py
+++ b/paddlespeech/server/utils/paddle_predictor.py
@ -15,6 +15,7 @@ import os
 from typing import List
 from typing import Optional
 import paddle
 from paddle.inference import Config
 from paddle.inference import create_predictor
@ -40,14 +41,30 @@ def init_predictor(model_dir: Optional[os.PathLike]=None,
    else:
        config = Config(model_file, params_file)
-    config.enable_memory_optim()
+    # set device
-    if predictor_conf["use_gpu"]:
+    if predictor_conf["device"]:
-        config.enable_use_gpu(1000, 0)
+        device = predictor_conf["device"]
-    if predictor_conf["enable_mkldnn"]:
+    else:
-        config.enable_mkldnn()
+        device = paddle.get_device()
    if "gpu" in device:
        gpu_id = device.split(":")[-1]
        config.enable_use_gpu(1000, int(gpu_id))
    # IR optim
    if predictor_conf["switch_ir_optim"]:
        config.switch_ir_optim()
    # glog
    if not predictor_conf["glog_info"]:
        config.disable_glog_info()
    # config summary
    if predictor_conf["summary"]:
        print(config.summary())
    # memory optim
    config.enable_memory_optim()
    predictor = create_predictor(config)
    return predictor
--- a/paddlespeech/t2s/exps/synthesize.py
+++ b/paddlespeech/t2s/exps/synthesize.py
@ -20,6 +20,7 @@ import numpy as np
 import paddle
 import soundfile as sf
 import yaml
 from timer import timer
 from yacs.config import CfgNode
 from paddlespeech.s2t.utils.dynamic_import import dynamic_import
@ -50,6 +51,18 @@ model_alias = {
    "paddlespeech.t2s.models.melgan:MelGANGenerator",
    "mb_melgan_inference":
    "paddlespeech.t2s.models.melgan:MelGANInference",
    "style_melgan":
    "paddlespeech.t2s.models.melgan:StyleMelGANGenerator",
    "style_melgan_inference":
    "paddlespeech.t2s.models.melgan:StyleMelGANInference",
    "hifigan":
    "paddlespeech.t2s.models.hifigan:HiFiGANGenerator",
    "hifigan_inference":
    "paddlespeech.t2s.models.hifigan:HiFiGANInference",
    "wavernn":
    "paddlespeech.t2s.models.wavernn:WaveRNN",
    "wavernn_inference":
    "paddlespeech.t2s.models.wavernn:WaveRNNInference",
 }
@ -146,10 +159,15 @@ def evaluate(args):
    voc_name = args.voc[:args.voc.rindex('_')]
    voc_class = dynamic_import(voc_name, model_alias)
    voc_inference_class = dynamic_import(voc_name + '_inference', model_alias)
    if voc_name != 'wavernn':
        voc = voc_class(**voc_config["generator_params"])
        voc.set_state_dict(paddle.load(args.voc_ckpt)["generator_params"])
        voc.remove_weight_norm()
        voc.eval()
    else:
        voc = voc_class(**voc_config["model"])
        voc.set_state_dict(paddle.load(args.voc_ckpt)["main_params"])
        voc.eval()
    voc_mu, voc_std = np.load(args.voc_stat)
    voc_mu = paddle.to_tensor(voc_mu)
    voc_std = paddle.to_tensor(voc_std)
@ -162,8 +180,12 @@ def evaluate(args):
    output_dir = Path(args.output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)
    N = 0
    T = 0
    for datum in test_dataset:
        utt_id = datum["utt_id"]
        with timer() as t:
            with paddle.no_grad():
                # acoustic model
                if am_name == 'fastspeech2':
@ -175,7 +197,8 @@ def evaluate(args):
                        spk_emb = paddle.to_tensor(np.load(datum["spk_emb"]))
                    elif "spk_id" in datum:
                        spk_id = paddle.to_tensor(datum["spk_id"])
-                mel = am_inference(phone_ids, spk_id=spk_id, spk_emb=spk_emb)
+                    mel = am_inference(
                        phone_ids, spk_id=spk_id, spk_emb=spk_emb)
                elif am_name == 'speedyspeech':
                    phone_ids = paddle.to_tensor(datum["phones"])
                    tone_ids = paddle.to_tensor(datum["tones"])
@ -189,11 +212,19 @@ def evaluate(args):
                    mel = am_inference(phone_ids, spk_emb=spk_emb)
            # vocoder
            wav = voc_inference(mel)
            wav = wav.numpy()
            N += wav.size
            T += t.elapse
            speed = wav.size / t.elapse
            rtf = am_config.fs / speed
        print(
            f"{utt_id}, mel: {mel.shape}, wave: {wav.size}, time: {t.elapse}s, Hz: {speed}, RTF: {rtf}."
        )
        sf.write(
-            str(output_dir / (utt_id + ".wav")),
+            str(output_dir / (utt_id + ".wav")), wav, samplerate=am_config.fs)
            wav.numpy(),
            samplerate=am_config.fs)
        print(f"{utt_id} done!")
    print(f"generation speed: {N / T}Hz, RTF: {am_config.fs / (N / T) }")
 def main():
@ -246,7 +277,8 @@ def main():
        default='pwgan_csmsc',
        choices=[
            'pwgan_csmsc', 'pwgan_ljspeech', 'pwgan_aishell3', 'pwgan_vctk',
-            'mb_melgan_csmsc'
+            'mb_melgan_csmsc', 'wavernn_csmsc', 'hifigan_csmsc',
            'style_melgan_csmsc'
        ],
        help='Choose vocoder type of tts task.')
--- a/paddlespeech/t2s/exps/synthesize_e2e.py
+++ b/paddlespeech/t2s/exps/synthesize_e2e.py
@ -21,6 +21,7 @@ import soundfile as sf
 import yaml
 from paddle import jit
 from paddle.static import InputSpec
 from timer import timer
 from yacs.config import CfgNode
 from paddlespeech.s2t.utils.dynamic_import import dynamic_import
@ -196,8 +197,8 @@ def evaluate(args):
                    input_spec=[
                        InputSpec([-1], dtype=paddle.int64), # text
                        InputSpec([-1], dtype=paddle.int64), # tone
-                        None,  # duration
+                        InputSpec([1], dtype=paddle.int64),  # spk_id
-                        InputSpec([-1], dtype=paddle.int64)  # spk_id
+                        None                                 # duration
                    ])
            else:
                am_inference = jit.to_static(
@ -233,8 +234,10 @@ def evaluate(args):
    # but still not stopping in the end (NOTE by yuantian01 Feb 9 2022)
    if am_name == 'tacotron2':
        merge_sentences = True
-
+    N = 0
    T = 0
    for utt_id, sentence in sentences:
        with timer() as t:
            get_tone_ids = False
            if am_name == 'speedyspeech':
                get_tone_ids = True
@ -281,11 +284,18 @@ def evaluate(args):
                        flags = 1
                    else:
                        wav_all = paddle.concat([wav_all, wav])
        wav = wav_all.numpy()
        N += wav.size
        T += t.elapse
        speed = wav.size / t.elapse
        rtf = am_config.fs / speed
        print(
            f"{utt_id}, mel: {mel.shape}, wave: {wav.shape}, time: {t.elapse}s, Hz: {speed}, RTF: {rtf}."
        )
        sf.write(
-            str(output_dir / (utt_id + ".wav")),
+            str(output_dir / (utt_id + ".wav")), wav, samplerate=am_config.fs)
            wav_all.numpy(),
            samplerate=am_config.fs)
        print(f"{utt_id} done!")
    print(f"generation speed: {N / T}Hz, RTF: {am_config.fs / (N / T) }")
 def main():
--- a/paddlespeech/t2s/exps/wavernn/synthesize.py
+++ b/paddlespeech/t2s/exps/wavernn/synthesize.py
@ -91,7 +91,7 @@ def main():
                    target=config.inference.target,
                    overlap=config.inference.overlap,
                    mu_law=config.mu_law,
-                    gen_display=True)
+                    gen_display=False)
            wav = wav.numpy()
            N += wav.size
            T += t.elapse
--- a/paddlespeech/t2s/frontend/tone_sandhi.py
+++ b/paddlespeech/t2s/frontend/tone_sandhi.py
@ -63,7 +63,7 @@ class ToneSandhi():
            '扫把', '惦记'
        }
        self.must_not_neural_tone_words = {
-            "男子", "女子", "分子", "原子", "量子", "莲子", "石子", "瓜子", "电子"
+            "男子", "女子", "分子", "原子", "量子", "莲子", "石子", "瓜子", "电子", "人人", "虎虎"
        }
        self.punc = "：，；。？！“”‘’':,;.?!"
@ -77,7 +77,9 @@ class ToneSandhi():
        # reduplication words for n. and v. e.g. 奶奶, 试试, 旺旺
        for j, item in enumerate(word):
-            if j - 1 >= 0 and item == word[j - 1] and pos[0] in {"n", "v", "a"}:
+            if j - 1 >= 0 and item == word[j - 1] and pos[0] in {
                    "n", "v", "a"
            } and word not in self.must_not_neural_tone_words:
                finals[j] = finals[j][:-1] + "5"
        ge_idx = word.find("个")
        if len(word) >= 1 and word[-1] in "吧呢哈啊呐噻嘛吖嗨呐哦哒额滴哩哟喽啰耶喔诶":
--- a/paddlespeech/t2s/frontend/zh_frontend.py
+++ b/paddlespeech/t2s/frontend/zh_frontend.py
@ -20,7 +20,10 @@ import numpy as np
 import paddle
 from g2pM import G2pM
 from pypinyin import lazy_pinyin
 from pypinyin import load_phrases_dict
 from pypinyin import load_single_dict
 from pypinyin import Style
 from pypinyin_dict.phrase_pinyin_data import large_pinyin
 from paddlespeech.t2s.frontend.generate_lexicon import generate_lexicon
 from paddlespeech.t2s.frontend.tone_sandhi import ToneSandhi
@ -41,6 +44,8 @@ class Frontend():
            self.g2pM_model = G2pM()
            self.pinyin2phone = generate_lexicon(
                with_tone=True, with_erhua=False)
        else:
            self.__init__pypinyin()
        self.must_erhua = {"小院儿", "胡同儿", "范儿", "老汉儿", "撒欢儿", "寻老礼儿", "妥妥儿"}
        self.not_erhua = {
            "虐儿", "为儿", "护儿", "瞒儿", "救儿", "替儿", "有儿", "一儿", "我儿", "俺儿", "妻儿",
@ -62,6 +67,23 @@ class Frontend():
            for tone, id in tone_id:
                self.vocab_tones[tone] = int(id)
    def __init__pypinyin(self):
        large_pinyin.load()
        load_phrases_dict({u'开户行': [[u'ka1i'], [u'hu4'], [u'hang2']]})
        load_phrases_dict({u'发卡行': [[u'fa4'], [u'ka3'], [u'hang2']]})
        load_phrases_dict({u'放款行': [[u'fa4ng'], [u'kua3n'], [u'hang2']]})
        load_phrases_dict({u'茧行': [[u'jia3n'], [u'hang2']]})
        load_phrases_dict({u'行号': [[u'hang2'], [u'ha4o']]})
        load_phrases_dict({u'各地': [[u'ge4'], [u'di4']]})
        load_phrases_dict({u'借还款': [[u'jie4'], [u'hua2n'], [u'kua3n']]})
        load_phrases_dict({u'时间为': [[u'shi2'], [u'jia1n'], [u'we2i']]})
        load_phrases_dict({u'为准': [[u'we2i'], [u'zhu3n']]})
        load_phrases_dict({u'色差': [[u'se4'], [u'cha1']]})
        # 调整字的拼音顺序
        load_single_dict({ord(u'地'): u'de,di4'})
    def _get_initials_finals(self, word: str) -> List[List[str]]:
        initials = []
        finals = []
--- a/paddlespeech/t2s/frontend/zh_normalization/chronology.py
+++ b/paddlespeech/t2s/frontend/zh_normalization/chronology.py
@ -63,6 +63,9 @@ def replace_time(match) -> str:
    result = f"{num2str(hour)}点"
    if minute.lstrip('0'):
        if int(minute) == 30:
            result += f"半"
        else:
            result += f"{_time_num2str(minute)}分"
    if second and second.lstrip('0'):
        result += f"{_time_num2str(second)}秒"
@ -71,6 +74,9 @@ def replace_time(match) -> str:
        result += "至"
        result += f"{num2str(hour_2)}点"
        if minute_2.lstrip('0'):
            if int(minute) == 30:
                result += f"半"
            else:
                result += f"{_time_num2str(minute_2)}分"
        if second_2 and second_2.lstrip('0'):
            result += f"{_time_num2str(second_2)}秒"
--- a/paddlespeech/t2s/frontend/zh_normalization/num.py
+++ b/paddlespeech/t2s/frontend/zh_normalization/num.py
@ -28,7 +28,7 @@ UNITS = OrderedDict({
    8: '亿',
 })
-COM_QUANTIFIERS = '(朵|匹|张|座|回|场|尾|条|个|首|阙|阵|网|炮|顶|丘|棵|只|支|袭|辆|挑|担|颗|壳|窠|曲|墙|群|腔|砣|座|客|贯|扎|捆|刀|令|打|手|罗|坡|山|岭|江|溪|钟|队|单|双|对|出|口|头|脚|板|跳|枝|件|贴|针|线|管|名|位|身|堂|课|本|页|家|户|层|丝|毫|厘|分|钱|两|斤|担|铢|石|钧|锱|忽|(千|毫|微)克|毫|厘|(公)分|分|寸|尺|丈|里|寻|常|铺|程|(千|分|厘|毫|微)米|米|撮|勺|合|升|斗|石|盘|碗|碟|叠|桶|笼|盆|盒|杯|钟|斛|锅|簋|篮|盘|桶|罐|瓶|壶|卮|盏|箩|箱|煲|啖|袋|钵|年|月|日|季|刻|时|周|天|秒|分|旬|纪|岁|世|更|夜|春|夏|秋|冬|代|伏|辈|丸|泡|粒|颗|幢|堆|条|根|支|道|面|片|张|颗|块|元|(亿|千万|百万|万|千|百)|(亿|千万|百万|万|千|百|美|)元|(亿|千万|百万|万|千|百|)块|角|毛|分)'
+COM_QUANTIFIERS = '(所|朵|匹|张|座|回|场|尾|条|个|首|阙|阵|网|炮|顶|丘|棵|只|支|袭|辆|挑|担|颗|壳|窠|曲|墙|群|腔|砣|座|客|贯|扎|捆|刀|令|打|手|罗|坡|山|岭|江|溪|钟|队|单|双|对|出|口|头|脚|板|跳|枝|件|贴|针|线|管|名|位|身|堂|课|本|页|家|户|层|丝|毫|厘|分|钱|两|斤|担|铢|石|钧|锱|忽|(千|毫|微)克|毫|厘|(公)分|分|寸|尺|丈|里|寻|常|铺|程|(千|分|厘|毫|微)米|米|撮|勺|合|升|斗|石|盘|碗|碟|叠|桶|笼|盆|盒|杯|钟|斛|锅|簋|篮|盘|桶|罐|瓶|壶|卮|盏|箩|箱|煲|啖|袋|钵|年|月|日|季|刻|时|周|天|秒|分|小时|旬|纪|岁|世|更|夜|春|夏|秋|冬|代|伏|辈|丸|泡|粒|颗|幢|堆|条|根|支|道|面|片|张|颗|块|元|(亿|千万|百万|万|千|百)|(亿|千万|百万|万|千|百|美|)元|(亿|千万|百万|万|千|百|)块|角|毛|分)'
 # 分数表达式
 RE_FRAC = re.compile(r'(-?)(\d+)/(\d+)')
@ -110,7 +110,7 @@ def replace_default_num(match):
 # 纯小数
 RE_DECIMAL_NUM = re.compile(r'(-?)((\d+)(\.\d+))' r'|(\.(\d+))')
 # 正整数 + 量词
-RE_POSITIVE_QUANTIFIERS = re.compile(r"(\d+)([多余几])?" + COM_QUANTIFIERS)
+RE_POSITIVE_QUANTIFIERS = re.compile(r"(\d+)([多余几\+])?" + COM_QUANTIFIERS)
 RE_NUMBER = re.compile(r'(-?)((\d+)(\.\d+)?)' r'|(\.(\d+))')
@ -123,6 +123,8 @@ def replace_positive_quantifier(match) -> str:
    """
    number = match.group(1)
    match_2 = match.group(2)
    if match_2 == "+":
        match_2 = "多"
    match_2: str = match_2 if match_2 else ""
    quantifiers: str = match.group(3)
    number: str = num2str(number)
@ -151,6 +153,7 @@ def replace_number(match) -> str:
 # 范围表达式
 # match.group(1) and match.group(8) are copy from RE_NUMBER
 RE_RANGE = re.compile(
    r'((-?)((\d+)(\.\d+)?)|(\.(\d+)))[-~]((-?)((\d+)(\.\d+)?)|(\.(\d+)))')
--- a/paddlespeech/t2s/frontend/zh_normalization/text_normlization.py
+++ b/paddlespeech/t2s/frontend/zh_normalization/text_normlization.py
@ -63,11 +63,19 @@ class TextNormalizer():
        # Only for pure Chinese here
        if lang == "zh":
            text = text.replace(" ", "")
            # 过滤掉特殊字符
            text = re.sub(r'[《》【】<=>{}()（）#&@“”^_|…\\]', '', text)
        text = self.SENTENCE_SPLITOR.sub(r'\1\n', text)
        text = text.strip()
        sentences = [sentence.strip() for sentence in re.split(r'\n+', text)]
        return sentences
    def _post_replace(self, sentence: str) -> str:
        sentence = sentence.replace('/', '每')
        sentence = sentence.replace('~', '至')
        return sentence
    def normalize_sentence(self, sentence: str) -> str:
        # basic character conversions
        sentence = tranditional_to_simplified(sentence)
@ -97,6 +105,7 @@ class TextNormalizer():
                                               sentence)
        sentence = RE_DEFAULT_NUM.sub(replace_default_num, sentence)
        sentence = RE_NUMBER.sub(replace_number, sentence)
        sentence = self._post_replace(sentence)
        return sentence
--- a/paddlespeech/t2s/models/melgan/melgan.py
+++ b/paddlespeech/t2s/models/melgan/melgan.py
@ -66,7 +66,7 @@ class MelGANGenerator(nn.Layer):
            nonlinear_activation_params (Dict[str, Any], optional): Parameters passed to the linear activation in the upsample network, 
                by default {}
            pad (str): Padding function module name before dilated convolution layer.
-            pad_params （dict): Hyperparameters for padding function.
+            pad_params (dict): Hyperparameters for padding function.
            use_final_nonlinear_activation (nn.Layer): Activation function for the final layer.
            use_weight_norm (bool): Whether to use weight norm.
                If set to true, it will be applied to all of the conv layers.
--- a/paddlespeech/t2s/models/speedyspeech/speedyspeech.py
+++ b/paddlespeech/t2s/models/speedyspeech/speedyspeech.py
@ -247,7 +247,7 @@ class SpeedySpeechInference(nn.Layer):
        self.normalizer = normalizer
        self.acoustic_model = speedyspeech_model
-    def forward(self, phones, tones, durations=None, spk_id=None):
+    def forward(self, phones, tones, spk_id=None, durations=None):
        normalized_mel = self.acoustic_model.inference(
            phones, tones, durations=durations, spk_id=spk_id)
        logmel = self.normalizer.inverse(normalized_mel)
--- a/paddlespeech/t2s/models/wavernn/wavernn.py
+++ b/paddlespeech/t2s/models/wavernn/wavernn.py
@ -509,16 +509,20 @@ class WaveRNN(nn.Layer):
        total_len = num_folds * (target + overlap) + overlap
        # Need some silence for the run warmup
-        slience_len = overlap // 2
+        slience_len = 0
        linear_len = slience_len
        fade_len = overlap - slience_len
        slience = paddle.zeros([slience_len], dtype=paddle.float32)
-        linear = paddle.ones([fade_len], dtype=paddle.float32)
+        linear = paddle.ones([linear_len], dtype=paddle.float32)
        # Equal power crossfade
        # fade_in increase from 0 to 1, fade_out reduces from 1 to 0
-        t = paddle.linspace(-1, 1, fade_len, dtype=paddle.float32)
+        sigmoid_scale = 2.3
-        fade_in = paddle.sqrt(0.5 * (1 + t))
+        t = paddle.linspace(
-        fade_out = paddle.sqrt(0.5 * (1 - t))
+            -sigmoid_scale, sigmoid_scale, fade_len, dtype=paddle.float32)
        # sigmoid 曲线应该更好
        fade_in = paddle.nn.functional.sigmoid(t)
        fade_out = 1 - paddle.nn.functional.sigmoid(t)
        # Concat the silence to the fades
        fade_out = paddle.concat([linear, fade_out])
        fade_in = paddle.concat([slience, fade_in])
--- a/paddlespeech/t2s/modules/transformer/repeat.py
+++ b/paddlespeech/t2s/modules/transformer/repeat.py
@ -36,4 +36,4 @@ def repeat(N, fn):
    Returns:
        MultiSequential: Repeated model instance.
    """
-    return MultiSequential(*[fn(n) for n in range(N)])
+    return MultiSequential(* [fn(n) for n in range(N)])
--- a/setup.py
+++ b/setup.py
@ -27,10 +27,9 @@ from setuptools.command.install import install
 HERE = Path(os.path.abspath(os.path.dirname(__file__)))
-VERSION = '0.1.1'
+VERSION = '0.1.2'
-requirements = {
+base = [
    "install": [
    "editdistance",
    "g2p_en",
    "g2pM",
@ -39,7 +38,7 @@ requirements = {
    "jieba",
    "jsonlines",
    "kaldiio",
-        "librosa",
+    "librosa==0.8.1",
    "loguru",
    "matplotlib",
    "nara_wpe",
@ -49,6 +48,7 @@ requirements = {
    "paddlespeech_feat",
    "praatio==5.0.0",
    "pypinyin",
    "pypinyin-dict",
    "python-dateutil",
    "pyworld",
    "resampy==0.2.2",
@ -63,10 +63,18 @@ requirements = {
    "visualdl",
    "webrtcvad",
    "yacs~=0.1.8",
-        # fastapi server
+    "prettytable",
 ]
 server = [
    "fastapi",
    "uvicorn",
-    ],
+    "pattern_singleton",
 ]
 requirements = {
    "install":
    base + server,
    "develop": [
        "ConfigArgParse",
        "coverage",
--- a/tests/test_tipc/configs/conformer/train_infer_python.txt
+++ b/tests/test_tipc/configs/conformer/train_infer_python.txt
@ -54,4 +54,4 @@ batch_size:16|30
 fp_items:fp32
 iteration:50
 --profiler-options:"batch_range=[10,35];state=GPU;tracer_option=Default;profile_path=model.profile"
-flags:FLAGS_eager_delete_tensor_gb=0.0;FLAGS_fraction_of_gpu_memory_to_use=0.98;FLAGS_conv_workspace_size_limit=4096
+flags:null
--- a/tests/test_tipc/configs/pwgan/train_infer_python.txt
+++ b/tests/test_tipc/configs/pwgan/train_infer_python.txt
@ -54,4 +54,4 @@ batch_size:6|16
 fp_items:fp32
 iteration:50
 --profiler_options:"batch_range=[10,35];state=GPU;tracer_option=Default;profile_path=model.profile"
-flags:FLAGS_eager_delete_tensor_gb=0.0;FLAGS_fraction_of_gpu_memory_to_use=0.98;FLAGS_conv_workspace_size_limit=4096
+flags:null
--- a/tests/test_tipc/prepare.sh
+++ b/tests/test_tipc/prepare.sh
@ -26,15 +26,19 @@ if [ ${MODE} = "benchmark_train" ];then
    curPath=$(readlink -f "$(dirname "$0")")
        echo "curPath:"${curPath}
    cd ${curPath}/../..
-    pip install .
+    apt-get install libsndfile1
    pip install pytest-runner kaldiio setuptools_scm -i https://pypi.tuna.tsinghua.edu.cn/simple 
    pip install . -i https://pypi.tuna.tsinghua.edu.cn/simple 
    cd -
    if [ ${model_name} == "conformer" ]; then
        # set the URL for aishell_tiny dataset
-        URL='None'
+        URL=${conformer_data_URL:-"None"}
        echo "URL:"${URL}
        if [ ${URL} == 'None' ];then
            echo "please contact author to get the URL.\n"
            exit
 	else
 	    wget -P ${curPath}/../../dataset/aishell/ ${URL} 
        fi
        sed -i "s#^URL_ROOT_TAG#URL_ROOT = '${URL}'#g" ${curPath}/conformer/scripts/aishell_tiny.py
        cp ${curPath}/conformer/scripts/aishell_tiny.py ${curPath}/../../dataset/aishell/
@ -42,6 +46,7 @@ if [ ${MODE} = "benchmark_train" ];then
        source path.sh
        # download audio data
        sed -i "s#aishell.py#aishell_tiny.py#g" ./local/data.sh
 	sed -i "s#python3#python#g" ./local/data.sh
        bash ./local/data.sh || exit -1
        if [ $? -ne 0 ]; then
        exit 1
@ -56,7 +61,6 @@ if [ ${MODE} = "benchmark_train" ];then
        sed -i "s#conf/#test_tipc/conformer/benchmark_train/conf/#g" ${curPath}/conformer/benchmark_train/conf/conformer.yaml
        sed -i "s#data/#test_tipc/conformer/benchmark_train/data/#g" ${curPath}/conformer/benchmark_train/conf/tuning/decode.yaml
        sed -i "s#data/#test_tipc/conformer/benchmark_train/data/#g" ${curPath}/conformer/benchmark_train/conf/preprocess.yaml
    fi
    if [ ${model_name} == "pwgan" ]; then
--- a/tests/unit/asr/deepspeech2_online_model_test.py
+++ b/tests/unit/asr/deepspeech2_online_model_test.py
@ -11,11 +11,15 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import os
 import pickle
 import unittest
 import numpy as np
 import paddle
 from paddle import inference
 from paddlespeech.s2t.models.ds2_online import DeepSpeech2InferModelOnline
 from paddlespeech.s2t.models.ds2_online import DeepSpeech2ModelOnline
@ -182,5 +186,77 @@ class TestDeepSpeech2ModelOnline(unittest.TestCase):
                paddle.allclose(final_state_c_box, final_state_c_box_chk), True)
 class TestDeepSpeech2StaticModelOnline(unittest.TestCase):
    def setUp(self):
        export_prefix = "exp/deepspeech2_online/checkpoints/test_export"
        if not os.path.exists(os.path.dirname(export_prefix)):
            os.makedirs(os.path.dirname(export_prefix), mode=0o755)
        infer_model = DeepSpeech2InferModelOnline(
            feat_size=161,
            dict_size=4233,
            num_conv_layers=2,
            num_rnn_layers=5,
            rnn_size=1024,
            num_fc_layers=0,
            fc_layers_size_list=[-1],
            use_gru=False)
        static_model = infer_model.export()
        paddle.jit.save(static_model, export_prefix)
        with open("test_data/static_ds2online_inputs.pickle", "rb") as f:
            self.data_dict = pickle.load(f)
        self.setup_model(export_prefix)
    def setup_model(self, export_prefix):
        deepspeech_config = inference.Config(export_prefix + ".pdmodel",
                                             export_prefix + ".pdiparams")
        if ('CUDA_VISIBLE_DEVICES' in os.environ.keys() and
                os.environ['CUDA_VISIBLE_DEVICES'].strip() != ''):
            deepspeech_config.enable_use_gpu(100, 0)
            deepspeech_config.enable_memory_optim()
        deepspeech_predictor = inference.create_predictor(deepspeech_config)
        self.predictor = deepspeech_predictor
    def test_unit(self):
        input_names = self.predictor.get_input_names()
        audio_handle = self.predictor.get_input_handle(input_names[0])
        audio_len_handle = self.predictor.get_input_handle(input_names[1])
        h_box_handle = self.predictor.get_input_handle(input_names[2])
        c_box_handle = self.predictor.get_input_handle(input_names[3])
        x_chunk = self.data_dict["audio_chunk"]
        x_chunk_lens = self.data_dict["audio_chunk_lens"]
        chunk_state_h_box = self.data_dict["chunk_state_h_box"]
        chunk_state_c_box = self.data_dict["chunk_state_c_bos"]
        audio_handle.reshape(x_chunk.shape)
        audio_handle.copy_from_cpu(x_chunk)
        audio_len_handle.reshape(x_chunk_lens.shape)
        audio_len_handle.copy_from_cpu(x_chunk_lens)
        h_box_handle.reshape(chunk_state_h_box.shape)
        h_box_handle.copy_from_cpu(chunk_state_h_box)
        c_box_handle.reshape(chunk_state_c_box.shape)
        c_box_handle.copy_from_cpu(chunk_state_c_box)
        output_names = self.predictor.get_output_names()
        output_handle = self.predictor.get_output_handle(output_names[0])
        output_lens_handle = self.predictor.get_output_handle(output_names[1])
        output_state_h_handle = self.predictor.get_output_handle(
            output_names[2])
        output_state_c_handle = self.predictor.get_output_handle(
            output_names[3])
        self.predictor.run()
        output_chunk_probs = output_handle.copy_to_cpu()
        output_chunk_lens = output_lens_handle.copy_to_cpu()
        chunk_state_h_box = output_state_h_handle.copy_to_cpu()
        chunk_state_c_box = output_state_c_handle.copy_to_cpu()
        return True
 if __name__ == '__main__':
    unittest.main()
--- a/tests/unit/asr/deepspeech2_online_model_test.sh
+++ b/tests/unit/asr/deepspeech2_online_model_test.sh
@ -0,0 +1,3 @@
 mkdir -p ./test_data
 wget -P ./test_data https://paddlespeech.bj.bcebos.com/datasets/unit_test/asr/static_ds2online_inputs.pickle
 python deepspeech2_online_model_test.py
--- a/tests/unit/server/change_yaml.py
+++ b/tests/unit/server/change_yaml.py
@ -0,0 +1,114 @@
 #!/usr/bin/python
 import argparse
 import os
 import yaml
 def change_speech_yaml(yaml_name: str, device: str):
    """Change the settings of the device under the voice task configuration file
    Args:
        yaml_name (str): asr or asr_pd or tts or tts_pd
        cpu (bool): True means set device to "cpu"
        model_type (dict): change model type
    """
    if "asr" in yaml_name:
        dirpath = "./conf/asr/"
    elif 'tts' in yaml_name:
        dirpath = "./conf/tts/"
    yamlfile = dirpath + yaml_name + ".yaml"
    tmp_yamlfile = dirpath + yaml_name + "_tmp.yaml"
    os.system("cp %s %s" % (yamlfile, tmp_yamlfile))
    with open(tmp_yamlfile) as f, open(yamlfile, "w+", encoding="utf-8") as fw:
        y = yaml.safe_load(f)
        if device == 'cpu':
            print("Set device: cpu")
            if yaml_name == 'asr':
                y['device'] = 'cpu'
            elif yaml_name == 'asr_pd':
                y['am_predictor_conf']['device'] = 'cpu'
            elif yaml_name == 'tts':
                y['device'] = 'cpu'
            elif yaml_name == 'tts_pd':
                y['am_predictor_conf']['device'] = 'cpu'
                y['voc_predictor_conf']['device'] = 'cpu'
        elif device == 'gpu':
            print("Set device: gpu")
            if yaml_name == 'asr':
                y['device'] = 'gpu:0'
            elif yaml_name == 'asr_pd':
                y['am_predictor_conf']['device'] = 'gpu:0'
            elif yaml_name == 'tts':
                y['device'] = 'gpu:0'
            elif yaml_name == 'tts_pd':
                y['am_predictor_conf']['device'] = 'gpu:0'
                y['voc_predictor_conf']['device'] = 'gpu:0'
        else:
            print("Please set correct device: cpu or gpu.")
        print("The content of '%s': " % (yamlfile))
        print(yaml.dump(y, default_flow_style=False, sort_keys=False))
        yaml.dump(y, fw, allow_unicode=True)
    os.system("rm %s" % (tmp_yamlfile))
    print("Change %s successfully." % (yamlfile))
 def change_app_yaml(task: str, engine_type: str):
    """Change the engine type and corresponding configuration file of the speech task in application.yaml
    Args:
        task (str):  asr or tts
    """
    yamlfile = "./conf/application.yaml"
    tmp_yamlfile = "./conf/application_tmp.yaml"
    os.system("cp %s %s" % (yamlfile, tmp_yamlfile))
    with open(tmp_yamlfile) as f, open(yamlfile, "w+", encoding="utf-8") as fw:
        y = yaml.safe_load(f)
        y['engine_type'][task] = engine_type
        path_list = ["./conf/", task, "/", task]
        if engine_type == 'python':
            path_list.append(".yaml")
        elif engine_type == 'inference':
            path_list.append("_pd.yaml")
        y['engine_backend'][task] = ''.join(path_list)
        print("The content of './conf/application.yaml': ")
        print(yaml.dump(y, default_flow_style=False, sort_keys=False))
        yaml.dump(y, fw, allow_unicode=True)
    os.system("rm %s" % (tmp_yamlfile))
    print("Change %s successfully." % (yamlfile))
 if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        '--change_task',
        type=str,
        default=None,
        help='Change task',
        choices=[
            'app-asr-python',
            'app-asr-inference',
            'app-tts-python',
            'app-tts-inference',
            'speech-asr-cpu',
            'speech-asr-gpu',
            'speech-asr_pd-cpu',
            'speech-asr_pd-gpu',
            'speech-tts-cpu',
            'speech-tts-gpu',
            'speech-tts_pd-cpu',
            'speech-tts_pd-gpu',
        ],
        required=True)
    args = parser.parse_args()
    types = args.change_task.split("-")
    if types[0] == "app":
        change_app_yaml(types[1], types[2])
    elif types[0] == "speech":
        change_speech_yaml(types[1], types[2])
    else:
        print("Error change task, please check change_task.")
--- a/tests/unit/server/conf/application.yaml
+++ b/tests/unit/server/conf/application.yaml
@ -0,0 +1,27 @@
 # This is the parameter configuration file for PaddleSpeech Serving.
 ##################################################################
 #                     SERVER SETTING                             #
 ##################################################################
 host: 127.0.0.1
 port: 8090
 ##################################################################
 #                     CONFIG FILE                                #
 ##################################################################
 # add engine backend type (Options: asr, tts) and config file here.
 # Adding a speech task to engine_backend means starting the service.
 engine_backend:
    asr: 'conf/asr/asr.yaml'
    tts: 'conf/tts/tts.yaml'
 # The engine_type of speech task needs to keep the same type as the config file of speech task.
 # E.g: The engine_type of asr is 'python', the engine_backend of asr is 'XX/asr.yaml'
 # E.g: The engine_type of asr is 'inference', the engine_backend of asr is 'XX/asr_pd.yaml'
 #
 # add engine type (Options: python, inference) 
 engine_type:
    asr: 'python'
    tts: 'python'
--- a/tests/unit/server/conf/asr/asr.yaml
+++ b/tests/unit/server/conf/asr/asr.yaml
@ -0,0 +1,8 @@
 model: 'conformer_wenetspeech'
 lang: 'zh'
 sample_rate: 16000
 cfg_path: # [optional]
 ckpt_path: # [optional]
 decode_method: 'attention_rescoring'
 force_yes: True
 device:  # set 'gpu:id' or 'cpu'
--- a/tests/unit/server/conf/asr/asr_pd.yaml
+++ b/tests/unit/server/conf/asr/asr_pd.yaml
@ -0,0 +1,26 @@
 # This is the parameter configuration file for ASR server.
 # These are the static models that support paddle inference.
 ##################################################################
 #                  ACOUSTIC MODEL SETTING                        #
 # am choices=['deepspeech2offline_aishell'] TODO
 ##################################################################
 model_type: 'deepspeech2offline_aishell'
 am_model: # the pdmodel file of am static model [optional]
 am_params:  # the pdiparams file of am static model [optional]
 lang: 'zh'
 sample_rate: 16000
 cfg_path: 
 decode_method: 
 force_yes: True
 am_predictor_conf:
  device:  # set 'gpu:id' or 'cpu'
  switch_ir_optim: True
  glog_info: False  # True -> print glog
  summary: True  # False -> do not show predictor config
 ##################################################################
 #                            OTHERS                              #
 ##################################################################
--- a/tests/unit/server/conf/tts/tts.yaml
+++ b/tests/unit/server/conf/tts/tts.yaml
@ -0,0 +1,32 @@
 # This is the parameter configuration file for TTS server.
 ##################################################################
 #                  ACOUSTIC MODEL SETTING                        #
 # am choices=['speedyspeech_csmsc', 'fastspeech2_csmsc',
 #             'fastspeech2_ljspeech', 'fastspeech2_aishell3',
 #             'fastspeech2_vctk']
 ##################################################################
 am: 'fastspeech2_csmsc'   
 am_config: 
 am_ckpt: 
 am_stat: 
 phones_dict: 
 tones_dict: 
 speaker_dict: 
 spk_id: 0
 ##################################################################
 #                     VOCODER SETTING                            #
 # voc choices=['pwgan_csmsc', 'pwgan_ljspeech', 'pwgan_aishell3',
 #              'pwgan_vctk', 'mb_melgan_csmsc']
 ##################################################################
 voc: 'pwgan_csmsc'
 voc_config: 
 voc_ckpt: 
 voc_stat: 
 ##################################################################
 #                            OTHERS                              #
 ##################################################################
 lang: 'zh'
 device:  # set 'gpu:id' or 'cpu'
--- a/tests/unit/server/conf/tts/tts_pd.yaml
+++ b/tests/unit/server/conf/tts/tts_pd.yaml
@ -0,0 +1,42 @@
 # This is the parameter configuration file for TTS server.
 # These are the static models that support paddle inference.
 ##################################################################
 #                  ACOUSTIC MODEL SETTING                        #
 # am choices=['speedyspeech_csmsc', 'fastspeech2_csmsc']
 ##################################################################
 am: 'fastspeech2_csmsc'   
 am_model: # the pdmodel file of your am static model (XX.pdmodel)
 am_params: # the pdiparams file of your am static model (XX.pdipparams)
 am_sample_rate: 24000
 phones_dict: 
 tones_dict: 
 speaker_dict: 
 spk_id: 0
 am_predictor_conf:
  device:  # set 'gpu:id' or 'cpu'
  switch_ir_optim: True
  glog_info: False # True -> print glog
  summary: True  # False -> do not show predictor config
 ##################################################################
 #                     VOCODER SETTING                            #
 # voc choices=['pwgan_csmsc', 'mb_melgan_csmsc','hifigan_csmsc']
 ##################################################################
 voc: 'pwgan_csmsc'
 voc_model: # the pdmodel file of your vocoder static model (XX.pdmodel)
 voc_params: # the pdiparams file of your vocoder static model (XX.pdipparams)
 voc_sample_rate: 24000
 voc_predictor_conf:
  device:  # set 'gpu:id' or 'cpu'  
  switch_ir_optim: True  
  glog_info: False # True -> print glog
  summary: True  # False -> do not show predictor config
 ##################################################################
 #                            OTHERS                              #
 ##################################################################
 lang: 'zh'
--- a/tests/unit/server/test_server_client.sh
+++ b/tests/unit/server/test_server_client.sh
@ -0,0 +1,185 @@
 #!/bin/bash
 # bash test_server_client.sh
 StartService(){
    # Start service 
    paddlespeech_server start --config_file $config_file 1>>log/server.log 2>>log/server.log.wf &
    echo $! > pid
    start_num=$(cat log/server.log.wf | grep "INFO:     Uvicorn running on http://" -c)
    flag="normal"
    while [[ $start_num -lt $target_start_num && $flag == "normal" ]]
    do
        start_num=$(cat log/server.log.wf | grep "INFO:     Uvicorn running on http://" -c)
        # start service failed
        if [ $(cat log/server.log.wf | grep -i "error" -c) -gt $error_time ];then
            echo "Service started failed."  | tee -a ./log/test_result.log
            error_time=$(cat log/server.log.wf | grep -i "error" -c)
            flag="unnormal"
        fi
    done
 }
 ClientTest(){
    # Client test
    # test asr client
    paddlespeech_client asr --server_ip $server_ip --port $port --input ./zh.wav 
    ((test_times+=1))
    paddlespeech_client asr --server_ip $server_ip --port $port --input ./zh.wav 
    ((test_times+=1))
    # test tts client
    paddlespeech_client tts --server_ip $server_ip --port $port --input "您好，欢迎使用百度飞桨语音合成服务。" --output output.wav 
    ((test_times+=1))
    paddlespeech_client tts --server_ip $server_ip --port $port --input "您好，欢迎使用百度飞桨语音合成服务。" --output output.wav 
    ((test_times+=1))  
 }
 GetTestResult() {
    # Determine if the test was successful
    response_success_time=$(cat log/server.log | grep "200 OK" -c)
    if (( $response_success_time == $test_times )) ; then
        echo "Testing successfully. The service configuration is: asr engine type: $1; tts engine type: $1; device: $2."  | tee -a ./log/test_result.log
    else
        echo "Testing failed. The service configuration is: asr engine type: $1; tts engine type: $1; device: $2." | tee -a ./log/test_result.log
    fi
    test_times=$response_success_time
 }
 mkdir -p log
 rm -rf log/server.log.wf 
 rm -rf log/server.log
 rm -rf log/test_result.log
 config_file=./conf/application.yaml
 server_ip=$(cat $config_file | grep "host" | awk -F " " '{print $2}')
 port=$(cat $config_file | grep "port" | awk '/port:/ {print $2}')
 echo "Sevice ip: $server_ip" | tee ./log/test_result.log
 echo "Sevice port: $port" | tee -a ./log/test_result.log
 # whether a process is listening on $port
 pid=`lsof -i :"$port"|grep -v "PID" | awk '{print $2}'`
 if [ "$pid" != "" ]; then
    echo "The port: $port is occupied, please change another port"
    exit
 fi
 # download test audios for ASR client
 wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespeech.bj.bcebos.com/PaddleAudio/en.wav
 target_start_num=0  # the number of start service
 test_times=0  # The number of client test
 error_time=0  # The number of error occurrences in the startup failure server.log.wf file
 # start server: asr engine type: python; tts engine type: python; device: gpu
 echo "Start the service: asr engine type: python; tts engine type: python; device: gpu"  | tee -a ./log/test_result.log
 ((target_start_num+=1))
 StartService
 if [[ $start_num -eq $target_start_num && $flag == "normal" ]]; then
    echo "Service started successfully."  | tee -a ./log/test_result.log
    ClientTest
    echo "This round of testing is over."  | tee -a ./log/test_result.log
    GetTestResult python gpu
 else
    echo "Service failed to start, no client test."
    target_start_num=$start_num  
 fi
 kill -9 `cat pid`
 rm -rf pid
 sleep 2s
 echo "**************************************************************************************" | tee -a ./log/test_result.log
 # start server: asr engine type: python; tts engine type: python; device: cpu
 python change_yaml.py --change_task speech-asr-cpu    # change asr.yaml device: cpu
 python change_yaml.py --change_task speech-tts-cpu    # change tts.yaml device: cpu
 echo "Start the service: asr engine type: python; tts engine type: python; device: cpu"  | tee -a ./log/test_result.log
 ((target_start_num+=1))
 StartService
 if [[ $start_num -eq $target_start_num && $flag == "normal" ]]; then
    echo "Service started successfully."  | tee -a ./log/test_result.log
    ClientTest
    echo "This round of testing is over."  | tee -a ./log/test_result.log
    GetTestResult python cpu
 else
    echo "Service failed to start, no client test."
    target_start_num=$start_num  
 fi
 kill -9 `cat pid`
 rm -rf pid
 sleep 2s
 echo "**************************************************************************************" | tee -a ./log/test_result.log
 # start server: asr engine type: inference; tts engine type: inference; device: gpu
 python change_yaml.py --change_task app-asr-inference    # change application.yaml, asr engine_type: inference; asr engine_backend: asr_pd.yaml
 python change_yaml.py --change_task app-tts-inference    # change application.yaml, tts engine_type: inference; tts engine_backend: tts_pd.yaml
 echo "Start the service: asr engine type: inference; tts engine type: inference; device: gpu"  | tee -a ./log/test_result.log
 ((target_start_num+=1))
 StartService
 if [[ $start_num -eq $target_start_num && $flag == "normal" ]]; then
    echo "Service started successfully."  | tee -a ./log/test_result.log
    ClientTest
    echo "This round of testing is over."  | tee -a ./log/test_result.log
    GetTestResult inference gpu
 else
    echo "Service failed to start, no client test."
    target_start_num=$start_num  
 fi
 kill -9 `cat pid`
 rm -rf pid
 sleep 2s
 echo "**************************************************************************************" | tee -a ./log/test_result.log
 # start server: asr engine type: inference; tts engine type: inference; device: cpu
 python change_yaml.py --change_task speech-asr_pd-cpu    # change asr_pd.yaml device: cpu
 python change_yaml.py --change_task speech-tts_pd-cpu    # change tts_pd.yaml device: cpu
 echo "start the service: asr engine type: inference; tts engine type: inference; device: cpu"  | tee -a ./log/test_result.log
 ((target_start_num+=1))
 StartService
 if [[ $start_num -eq $target_start_num && $flag == "normal" ]]; then
    echo "Service started successfully."  | tee -a ./log/test_result.log
    ClientTest
    echo "This round of testing is over."  | tee -a ./log/test_result.log
    GetTestResult inference cpu
 else
    echo "Service failed to start, no client test."
    target_start_num=$start_num  
 fi
 kill -9 `cat pid`
 rm -rf pid
 sleep 2s
 echo "**************************************************************************************" | tee -a ./log/test_result.log
 echo "All tests completed."  | tee -a ./log/test_result.log
 # sohw all the test results
 echo "***************** Here are all the test results ********************"
 cat ./log/test_result.log
 # Restoring conf is the same as demos/speech_server
 cp ../../../demos/speech_server/conf/ ./ -rf