Merge branch 'PaddlePaddle:develop' into develop

3 years ago · 05447ea93b
parent 7939884c3f 58309aa9d7
commit 05447ea93b
105 changed files with 5771 additions and 220 deletions
--- a/README.md
+++ b/README.md
@ -157,14 +157,18 @@ Via the easy-to-use, efficient, flexible and scalable implementation, our vision
  - 🧩  *Cascaded models application*: as an extension of the typical traditional audio tasks, we combine the workflows of the aforementioned tasks with other fields like Natural language processing (NLP) and Computer Vision (CV).

 ### Recent Update
+- 👑 2022.11.18: Add [Whisper CLI and Demos](https://github.com/PaddlePaddle/PaddleSpeech/pull/2640), support multi language recognition and translation.
+- 🔥 2022.11.18: Add [Wav2vec2 CLI and Demos](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/demos/speech_ssl), Support ASR and Feature Extraction.
+- 🎉 2022.11.17: Add [male voice for TTS](https://github.com/PaddlePaddle/PaddleSpeech/pull/2660).
+- 🔥 2022.11.07: Add [U2/U2++ C++ High Performance Streaming ASR Deployment](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/speechx/examples/u2pp_ol/wenetspeech).
 - 👑 2022.11.01: Add [Adversarial Loss](https://arxiv.org/pdf/1907.04448.pdf) for [Chinese English mixed TTS](./examples/zh_en_tts/tts3).
 - 🔥 2022.10.26: Add [Prosody Prediction](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/rhy) for TTS.
 - 🎉 2022.10.21: Add [SSML](https://github.com/PaddlePaddle/PaddleSpeech/discussions/2538) for TTS Chinese Text Frontend.
- 👑 2022.10.11: Add [Wav2vec2ASR](./examples/librispeech/asr3), wav2vec2.0 fine-tuning for ASR on LibriSpeech.
- 🔥 2022.09.26: Add Voice Cloning, TTS finetune, and ERNIE-SAT in [PaddleSpeech Web Demo](./demos/speech_web).
+- 👑 2022.10.11: Add [Wav2vec2ASR-en](./examples/librispeech/asr3), wav2vec2.0 fine-tuning for ASR on LibriSpeech.
+- 🔥 2022.09.26: Add Voice Cloning, TTS finetune, and [ERNIE-SAT](https://arxiv.org/abs/2211.03545) in [PaddleSpeech Web Demo](./demos/speech_web).
 - ⚡ 2022.09.09: Add AISHELL-3 Voice Cloning [example](./examples/aishell3/vc2) with ECAPA-TDNN speaker encoder.
 - ⚡ 2022.08.25: Release TTS [finetune](./examples/other/tts_finetune/tts3) example.
- 🔥 2022.08.22: Add ERNIE-SAT models: [ERNIE-SAT-vctk](./examples/vctk/ernie_sat)、[ERNIE-SAT-aishell3](./examples/aishell3/ernie_sat)、[ERNIE-SAT-zh_en](./examples/aishell3_vctk/ernie_sat).
+- 🔥 2022.08.22: Add [ERNIE-SAT](https://arxiv.org/abs/2211.03545) models: [ERNIE-SAT-vctk](./examples/vctk/ernie_sat)、[ERNIE-SAT-aishell3](./examples/aishell3/ernie_sat)、[ERNIE-SAT-zh_en](./examples/aishell3_vctk/ernie_sat).
 - 🔥 2022.08.15: Add [g2pW](https://github.com/GitYCC/g2pW) into TTS Chinese Text Frontend.
 - 🔥 2022.08.09: Release [Chinese English mixed TTS](./examples/zh_en_tts/tts3).
 - ⚡ 2022.08.03: Add ONNXRuntime infer for  TTS CLI.
@ -579,7 +583,7 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
      </td>
    </tr>
    <tr>
-      <td>ERNIE-SAT</td>
+      <td><a href = "https://arxiv.org/abs/2211.03545">ERNIE-SAT</a></td>
      <td>VCTK / AISHELL-3 / ZH_EN</td>
      <td>
      <a href = "./examples/vctk/ernie_sat">ERNIE-SAT-vctk</a> / <a href = "./examples/aishell3/ernie_sat">ERNIE-SAT-aishell3</a> / <a href = "./examples/aishell3_vctk/ernie_sat">ERNIE-SAT-zh_en</a>
@ -977,6 +981,7 @@ You are warmly welcome to submit questions in [discussions](https://github.com/P
 - Many thanks to [jerryuhoo](https://github.com/jerryuhoo)/[VTuberTalk](https://github.com/jerryuhoo/VTuberTalk) for developing a GUI tool based on PaddleSpeech TTS and code for making datasets from videos based on PaddleSpeech ASR.
 - Many thanks to [vpegasus](https://github.com/vpegasus)/[xuesebot](https://github.com/vpegasus/xuesebot) for developing a rasa chatbot,which is able to speak and listen thanks to PaddleSpeech.
 - Many thanks to [chenkui164](https://github.com/chenkui164)/[FastASR](https://github.com/chenkui164/FastASR) for the C++ inference implementation of PaddleSpeech ASR.
+- Many thanks to [heyudage](https://github.com/heyudage)/[VoiceTyping](https://github.com/heyudage/VoiceTyping) for the real-time voice typing tool implementation of PaddleSpeech ASR streaming services.

 Besides, PaddleSpeech depends on a lot of open source repositories. See [references](./docs/source/reference.md) for more information.

--- a/README_cn.md
+++ b/README_cn.md
@ -164,14 +164,18 @@

  
 ### 近期更新
+- 👑 2022.11.18: 新增 [Whisper CLI 和 Demos](https://github.com/PaddlePaddle/PaddleSpeech/pull/2640)，支持多种语言的识别与翻译。
+- 🔥 2022.11.18: 新增 [Wav2vec2 CLI 和 Demos](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/demos/speech_ssl), 支持 ASR 和 特征提取.
+- 🎉 2022.11.17: TTS 新增[高质量男性音色](https://github.com/PaddlePaddle/PaddleSpeech/pull/2660)。
+- 🔥 2022.11.07: 新增 [U2/U2++ 高性能流式 ASR C++ 部署](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/speechx/examples/u2pp_ol/wenetspeech)。
 - 👑 2022.11.01: [中英文混合 TTS](./examples/zh_en_tts/tts3) 新增 [Adversarial Loss](https://arxiv.org/pdf/1907.04448.pdf) 模块。
 - 🔥 2022.10.26: TTS 新增[韵律预测](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/rhy)功能。
 - 🎉 2022.10.21: TTS 中文文本前端新增 [SSML](https://github.com/PaddlePaddle/PaddleSpeech/discussions/2538) 功能。
- 👑 2022.10.11: 新增 [Wav2vec2ASR](./examples/librispeech/asr3), 在 LibriSpeech 上针对 ASR 任务对 wav2vec2.0 的 finetuning。
- 🔥 2022.09.26: 新增 Voice Cloning, TTS finetune 和 ERNIE-SAT 到 [PaddleSpeech 网页应用](./demos/speech_web)。
+- 👑 2022.10.11: 新增 [Wav2vec2ASR-en](./examples/librispeech/asr3), 在 LibriSpeech 上针对 ASR 任务对 wav2vec2.0 的 finetuning。
+- 🔥 2022.09.26: 新增 Voice Cloning, TTS finetune 和 [ERNIE-SAT](https://arxiv.org/abs/2211.03545) 到 [PaddleSpeech 网页应用](./demos/speech_web)。
 - ⚡ 2022.09.09: 新增基于 ECAPA-TDNN 声纹模型的 AISHELL-3 Voice Cloning [示例](./examples/aishell3/vc2)。
 - ⚡ 2022.08.25: 发布 TTS [finetune](./examples/other/tts_finetune/tts3) 示例。
- 🔥 2022.08.22: 新增 ERNIE-SAT 模型: [ERNIE-SAT-vctk](./examples/vctk/ernie_sat)、[ERNIE-SAT-aishell3](./examples/aishell3/ernie_sat)、[ERNIE-SAT-zh_en](./examples/aishell3_vctk/ernie_sat)。
+- 🔥 2022.08.22: 新增 [ERNIE-SAT](https://arxiv.org/abs/2211.03545) 模型: [ERNIE-SAT-vctk](./examples/vctk/ernie_sat)、[ERNIE-SAT-aishell3](./examples/aishell3/ernie_sat)、[ERNIE-SAT-zh_en](./examples/aishell3_vctk/ernie_sat)。
 - 🔥 2022.08.15: 将 [g2pW](https://github.com/GitYCC/g2pW) 引入 TTS 中文文本前端。
 - 🔥 2022.08.09: 发布[中英文混合 TTS](./examples/zh_en_tts/tts3)。
 - ⚡ 2022.08.03: TTS CLI 新增 ONNXRuntime 推理方式。
@ -576,7 +580,7 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块：文本前端、声
      </td>
    </tr>
    <tr>
-      <td>ERNIE-SAT</td>
+      <td><a href = "https://arxiv.org/abs/2211.03545">ERNIE-SAT</a></td>
      <td>VCTK / AISHELL-3 / ZH_EN</td>
      <td>
      <a href = "./examples/vctk/ernie_sat">ERNIE-SAT-vctk</a> / <a href = "./examples/aishell3/ernie_sat">ERNIE-SAT-aishell3</a> / <a href = "./examples/aishell3_vctk/ernie_sat">ERNIE-SAT-zh_en</a>
@ -983,6 +987,7 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块：文本前端、声

 - 非常感谢 [vpegasus](https://github.com/vpegasus)/[xuesebot](https://github.com/vpegasus/xuesebot) 基于 PaddleSpeech 的 ASR 与 TTS 设计的可听、说对话机器人。
 - 非常感谢 [chenkui164](https://github.com/chenkui164)/[FastASR](https://github.com/chenkui164/FastASR) 对 PaddleSpeech 的 ASR 进行 C++ 推理实现。
+- 非常感谢 [heyudage](https://github.com/heyudage)/[VoiceTyping](https://github.com/heyudage/VoiceTyping) 基于 PaddleSpeech 的 ASR 流式服务实现的实时语音输入法工具。

 此外，PaddleSpeech 依赖于许多开源存储库。有关更多信息，请参阅 [references](./docs/source/reference.md)。

--- a/demos/README.md
+++ b/demos/README.md
@ -17,3 +17,5 @@ This directory contains many speech applications in multiple scenarios.
 * story talker - book reader based on OCR and TTS  
 * style_fs2 - multi style control for FastSpeech2 model  
 * text_to_speech - convert text into speech 
+* self supervised pretraining - speech feature extraction and speech recognition based on wav2vec2
+* Wishper - speech recognize and translate based on Whisper model
--- a/demos/README_cn.md
+++ b/demos/README_cn.md
@ -17,3 +17,5 @@
 * 会说话的故事书 - 基于 OCR 和语音合成的会说话的故事书。
 * 个性化语音合成 - 基于 FastSpeech2 模型的个性化语音合成。 
 * 语音合成 - 基于给定的文本生成语音音频。
+* 自监督预训练模型 - 基于wav2vec2的语音特征提取和语音识别。
+* Whisper - 基于Whisper模型的语音识别与翻译。
--- a/demos/asr_deployment/README.md
+++ b/demos/asr_deployment/README.md
@ -0,0 +1,100 @@
+([简体中文](./README_cn.md)|English)
+# ASR Deployment by SpeechX
+
+## Introduction
+
+ASR deployment support U2/U2++/Deepspeech2 asr model using c++, which is good practice in industry deployment.
+
+More info about SpeechX, please see [here](../../speechx/README.md).
+
+## Usage
+### 1. Environment
+
+* python - 3.7
+* docker - `registry.baidubce.com/paddlepaddle/paddle:2.2.2-gpu-cuda10.2-cudnn7`
+* os - Ubuntu 16.04.7 LTS
+* gcc/g++/gfortran - 8.2.0
+* cmake - 3.16.0
+
+More info please see [here](../../speechx/README.md).
+
+### 2. Compile SpeechX
+
+Please see [here](../../speechx/README.md).
+
+### 3. Usage
+
+For u2++ asr deployment example, please to see [here](../../speechx/examples/u2pp_ol/wenetspeech/).
+
+First go to `speechx/speechx/examples/u2pp_ol/wenetspeech` dir.
+
+- Source path.sh
+  ```bash
+  source path.sh
+  ```
+
+- Download Model, Prepare test data and cmvn
+  ```bash
+  run.sh --stage 0 --stop_stage 1
+  ```
+
+- Decode with WAV
+  
+  ```bash
+  # FP32
+  ./local/recognizer.sh
+
+  # INT8
+  ./local/recognizer_quant.sh
+  ```
+
+  Output:
+  ```bash
+  I1026 16:13:24.683531 48038 u2_recognizer_main.cc:55] utt: BAC009S0916W0495
+  I1026 16:13:24.683578 48038 u2_recognizer_main.cc:56] wav dur: 4.17119 sec.
+  I1026 16:13:24.683595 48038 u2_recognizer_main.cc:64] wav len (sample): 66739
+  I1026 16:13:25.037652 48038 u2_recognizer_main.cc:87] Pratial result: 3 这令
+  I1026 16:13:25.043697 48038 u2_recognizer_main.cc:87] Pratial result: 4 这令
+  I1026 16:13:25.222124 48038 u2_recognizer_main.cc:87] Pratial result: 5 这令被贷款
+  I1026 16:13:25.228385 48038 u2_recognizer_main.cc:87] Pratial result: 6 这令被贷款
+  I1026 16:13:25.414669 48038 u2_recognizer_main.cc:87] Pratial result: 7 这令被贷款的员工
+  I1026 16:13:25.420714 48038 u2_recognizer_main.cc:87] Pratial result: 8 这令被贷款的员工
+  I1026 16:13:25.608129 48038 u2_recognizer_main.cc:87] Pratial result: 9 这令被贷款的员工们请
+  I1026 16:13:25.801620 48038 u2_recognizer_main.cc:87] Pratial result: 10 这令被贷款的员工们请食难安
+  I1026 16:13:25.804101 48038 feature_cache.h:44] set finished
+  I1026 16:13:25.804128 48038 feature_cache.h:51] compute last feats done.
+  I1026 16:13:25.948771 48038 u2_recognizer_main.cc:87] Pratial result: 11 这令被贷款的员工们请食难安
+  I1026 16:13:26.246963 48038 u2_recognizer_main.cc:113] BAC009S0916W0495 这令被贷款的员工们请食难安
+  ```
+
+## Result
+
+> CER compute under aishell-test.
+> RTF compute with feature and decoder, which is more end to end.
+> Machine Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz avx512_vnni
+
+### FP32
+
+```
+Overall -> 5.75 % N=104765 C=99035 S=5587 D=143 I=294
+Mandarin -> 5.75 % N=104762 C=99035 S=5584 D=143 I=294
+English -> 0.00 % N=0 C=0 S=0 D=0 I=0
+Other -> 100.00 % N=3 C=0 S=3 D=0 I=0
+```
+
+```
+RTF is: 0.315337
+```
+
+### INT8
+
+```
+Overall -> 5.83 % N=104765 C=98943 S=5675 D=147 I=286
+Mandarin -> 5.83 % N=104762 C=98943 S=5672 D=147 I=286
+English -> 0.00 % N=0 C=0 S=0 D=0 I=0
+Other -> 100.00 % N=3 C=0 S=3 D=0 I=0
+```
+
+```
+RTF is: 0.269674
+```
--- a/demos/asr_deployment/README_cn.md
+++ b/demos/asr_deployment/README_cn.md
@ -0,0 +1,96 @@
+([简体中文](./README_cn.md)|English)
+# 基于SpeechX 的 ASR 部署 
+
+## 简介
+
+支持 U2/U2++/Deepspeech2 模型的 C++ 部署，其在工业实践中经常被用到。
+
+更多 Speechx 信息可以参看[文档](../../speechx/README.md)。
+
+## 使用
+### 1. 环境
+
+* python - 3.7
+* docker - `registry.baidubce.com/paddlepaddle/paddle:2.2.2-gpu-cuda10.2-cudnn7`
+* os - Ubuntu 16.04.7 LTS
+* gcc/g++/gfortran - 8.2.0
+* cmake - 3.16.0
+
+更多信息可以参看[文档](../../speechx/README.md)。
+
+### 2. 编译 SpeechX
+
+更多信息可以参看[文档](../../speechx/README.md)。
+
+### 3. 例子
+
+u2++ 识别部署参看[这里](../../speechx/examples/u2pp_ol/wenetspeech/)。
+
+以下是在 `speechx/speechx/examples/u2pp_ol/wenetspeech`.
+
+- Source path.sh
+  ```bash
+  source path.sh
+  ```
+
+- 下载模型，准备测试数据和cmvn文件
+  ```bash
+  run.sh --stage 0 --stop_stage 1
+  ```
+
+- 解码
+  
+  ```bash
+  # FP32
+  ./local/recognizer.sh
+
+  # INT8
+  ./local/recognizer_quant.sh
+  ```
+
+  输出:
+  ```bash
+  I1026 16:13:24.683531 48038 u2_recognizer_main.cc:55] utt: BAC009S0916W0495
+  I1026 16:13:24.683578 48038 u2_recognizer_main.cc:56] wav dur: 4.17119 sec.
+  I1026 16:13:24.683595 48038 u2_recognizer_main.cc:64] wav len (sample): 66739
+  I1026 16:13:25.037652 48038 u2_recognizer_main.cc:87] Pratial result: 3 这令
+  I1026 16:13:25.043697 48038 u2_recognizer_main.cc:87] Pratial result: 4 这令
+  I1026 16:13:25.222124 48038 u2_recognizer_main.cc:87] Pratial result: 5 这令被贷款
+  I1026 16:13:25.228385 48038 u2_recognizer_main.cc:87] Pratial result: 6 这令被贷款
+  I1026 16:13:25.414669 48038 u2_recognizer_main.cc:87] Pratial result: 7 这令被贷款的员工
+  I1026 16:13:25.420714 48038 u2_recognizer_main.cc:87] Pratial result: 8 这令被贷款的员工
+  I1026 16:13:25.608129 48038 u2_recognizer_main.cc:87] Pratial result: 9 这令被贷款的员工们请
+  I1026 16:13:25.801620 48038 u2_recognizer_main.cc:87] Pratial result: 10 这令被贷款的员工们请食难安
+  I1026 16:13:25.804101 48038 feature_cache.h:44] set finished
+  I1026 16:13:25.804128 48038 feature_cache.h:51] compute last feats done.
+  I1026 16:13:25.948771 48038 u2_recognizer_main.cc:87] Pratial result: 11 这令被贷款的员工们请食难安
+  I1026 16:13:26.246963 48038 u2_recognizer_main.cc:113] BAC009S0916W0495 这令被贷款的员工们请食难安
+  ```
+
+## 结果
+
+> CER 测试集为 aishell-test
+> RTF 计算包含提特征和解码
+> 测试机器： Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz avx512_vnni
+
+### FP32
+
+```
+Overall -> 5.75 % N=104765 C=99035 S=5587 D=143 I=294
+Mandarin -> 5.75 % N=104762 C=99035 S=5584 D=143 I=294
+English -> 0.00 % N=0 C=0 S=0 D=0 I=0
+Other -> 100.00 % N=3 C=0 S=3 D=0 I=0
+```
+
+```
+RTF is: 0.315337
+```
+
+### INT8
+
+```
+Overall -> 5.87 % N=104765 C=98909 S=5711 D=145 I=289
+Mandarin -> 5.86 % N=104762 C=98909 S=5708 D=145 I=289
+English -> 0.00 % N=0 C=0 S=0 D=0 I=0
+Other -> 100.00 % N=3 C=0 S=3 D=0 I=0
+```
--- a/demos/speech_ssl/README.md
+++ b/demos/speech_ssl/README.md
@ -0,0 +1,102 @@
+([简体中文](./README_cn.md)|English)
+# Speech SSL (Self-Supervised Learning)
+
+## Introduction
+Speech SSL, or Self-Supervised Learning, refers to a training method on the large-scale unlabeled speech dataset. The model trained in this way can produce a good acoustic representation, and can be applied to other downstream speech tasks by fine-tuning on labeled datasets.
+
+This demo is an implementation to recognize text or produce the acoustic representation from a specific audio file by speech ssl models. It can be done by a single command or a few lines in python using `PaddleSpeech`. 
+
+## Usage
+### 1. Installation
+see [installation](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md).
+
+You can choose one way from easy, meduim and hard to install paddlespeech.
+
+### 2. Prepare Input File
+The input of this demo should be a WAV file(`.wav`), and the sample rate must be the same as the model.
+
+Here are sample files for this demo that can be downloaded:
+```bash
+wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/en.wav
+```
+
+### 3. Usage
+- Command Line(Recommended)
+  ```bash
+  # to recognize text 
+  paddlespeech ssl --task asr --lang en --input ./en.wav
+
+  # to get acoustic representation
+  paddlespeech ssl --task vector --lang en --input ./en.wav
+  ```
+
+  Usage:
+  ```bash
+  paddlespeech ssl --help
+  ```
+  Arguments:
+  - `input`(required): Audio file to recognize.
+  - `model`: Model type of asr task. Default: `wav2vec2ASR_librispeech`.
+  - `task`: Output type. Default: `asr`.
+  - `lang`: Model language. Default: `en`.
+  - `sample_rate`: Sample rate of the model. Default: `16000`.
+  - `config`: Config of asr task. Use pretrained model when it is None. Default: `None`.
+  - `ckpt_path`: Model checkpoint. Use pretrained model when it is None. Default: `None`.
+  - `yes`: No additional parameters required. Once set this parameter, it means accepting the request of the program by default, which includes transforming the audio sample rate. Default: `False`.
+  - `device`: Choose device to execute model inference. Default: default device of paddlepaddle in current environment.
+  - `verbose`: Show the log information.
+
+
+- Python API
+  ```python
+  import paddle
+  from paddlespeech.cli.ssl import SSLExecutor
+
+  ssl_executor = SSLExecutor()
+
+  # to recognize text 
+  text = ssl_executor(
+      model='wav2vec2ASR_librispeech',
+      task='asr',
+      lang='en',
+      sample_rate=16000,
+      config=None,  # Set `config` and `ckpt_path` to None to use pretrained model.
+      ckpt_path=None,
+      audio_file='./en.wav',
+      device=paddle.get_device())
+  print('ASR Result: \n{}'.format(text))
+
+  # to get acoustic representation
+  feature = ssl_executor(
+      model='wav2vec2',
+      task='vector',
+      lang='en',
+      sample_rate=16000,
+      config=None,  # Set `config` and `ckpt_path` to None to use pretrained model.
+      ckpt_path=None,
+      audio_file='./en.wav',
+      device=paddle.get_device())
+  print('Representation: \n{}'.format(feature))
+  ```
+
+  Output:
+  ```bash
+  ASR Result:
+  i knocked at the door on the ancient side of the building
+
+  Representation:
+  Tensor(shape=[1, 164, 1024], dtype=float32, place=Place(gpu:0), stop_gradient=True,
+       [[[ 0.02351918, -0.12980647,  0.17868176, ...,  0.10118122,
+          -0.04614586,  0.17853957],
+         [ 0.02361383, -0.12978461,  0.17870593, ...,  0.10103855,
+          -0.04638699,  0.17855372],
+         [ 0.02345137, -0.12982975,  0.17883906, ...,  0.10104341,
+          -0.04643029,  0.17856732],
+         ...,
+         [ 0.02313030, -0.12918393,  0.17845058, ...,  0.10073373,
+          -0.04701405,  0.17862988],
+         [ 0.02176583, -0.12929161,  0.17797582, ...,  0.10097728,
+          -0.04687393,  0.17864393],
+         [ 0.05269200,  0.01297141, -0.23336855, ..., -0.11257174,
+          -0.17227529,  0.20338398]]])
+  ```
--- a/demos/speech_ssl/README_cn.md
+++ b/demos/speech_ssl/README_cn.md
@ -0,0 +1,103 @@
+(简体中文|[English](./README.md))
+
+# 语音自监督学习
+## 介绍
+语音自监督学习，指的是在大规模无标记的语音数据集上的训练方法。用这种方法训练出来的模型可以产生很好的声学表征。并且可以通过在有标签的数据集上进行微调，应用于其他下游的语音任务。
+
+这个 demo 是通过语音自监督模型将一个特定的音频文件识别成文本或产生声学表征，它可以通过使用 `PaddleSpeech` 的单个命令或 python 中的几行代码来实现。
+
+## 使用方法
+### 1. 安装
+请看[安装文档](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install_cn.md)。
+
+你可以从 easy，medium，hard 三中方式中选择一种方式安装。
+
+### 2. 准备输入
+这个 demo 的输入应该是一个 WAV 文件（`.wav`），并且采样率必须与模型的采样率相同。
+
+可以下载此 demo 的示例音频：
+```bash
+wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/en.wav
+```
+### 3. 使用方法
+- 命令行 (推荐使用)
+  ```bash
+
+  # 识别文本
+  paddlespeech ssl --task asr --lang en --input ./en.wav
+
+  # 产生声学表征
+  paddlespeech ssl --task vector --lang en --input ./en.wav
+  ```
+  
+  使用方法：
+  ```bash
+  paddlespeech asr --help
+  ```
+  参数：
+  - `input`(必须输入)：用于识别的音频文件。
+  - `model`：ASR 任务的模型，默认值：`wav2vec2ASR_librispeech`。
+  - `task`：输出类别，默认值：`asr`。
+  - `lang`：模型语言，默认值：`en`。
+  - `sample_rate`：音频采样率，默认值：`16000`。
+  - `config`：ASR 任务的参数文件，若不设置则使用预训练模型中的默认配置，默认值：`None`。
+  - `ckpt_path`：模型参数文件，若不设置则下载预训练模型使用，默认值：`None`。
+  - `yes`；不需要设置额外的参数，一旦设置了该参数，说明你默认同意程序的所有请求，其中包括自动转换输入音频的采样率。默认值：`False`。
+  - `device`：执行预测的设备，默认值：当前系统下 paddlepaddle 的默认 device。
+  - `verbose`: 如果使用，显示 logger 信息。
+
+
+- Python API
+  ```python
+  import paddle
+  from paddlespeech.cli.ssl import SSLExecutor
+
+  ssl_executor = SSLExecutor()
+
+  # 识别文本
+  text = ssl_executor(
+      model='wav2vec2ASR_librispeech',
+      task='asr',
+      lang='en',
+      sample_rate=16000,
+      config=None,  # Set `config` and `ckpt_path` to None to use pretrained model.
+      ckpt_path=None,
+      audio_file='./en.wav',
+      device=paddle.get_device())
+  print('ASR Result: \n{}'.format(text))
+
+  # 得到声学表征
+  feature = ssl_executor(
+      model='wav2vec2',
+      task='vector',
+      lang='en',
+      sample_rate=16000,
+      config=None,  # Set `config` and `ckpt_path` to None to use pretrained model.
+      ckpt_path=None,
+      audio_file='./en.wav',
+      device=paddle.get_device())
+  print('Representation: \n{}'.format(feature))
+  ```
+
+
+  输出：
+  ```bash
+  ASR Result:
+  i knocked at the door on the ancient side of the building
+  
+  Representation:
+  Tensor(shape=[1, 164, 1024], dtype=float32, place=Place(gpu:0), stop_gradient=True,
+       [[[ 0.02351918, -0.12980647,  0.17868176, ...,  0.10118122,
+          -0.04614586,  0.17853957],
+         [ 0.02361383, -0.12978461,  0.17870593, ...,  0.10103855,
+          -0.04638699,  0.17855372],
+         [ 0.02345137, -0.12982975,  0.17883906, ...,  0.10104341,
+          -0.04643029,  0.17856732],
+         ...,
+         [ 0.02313030, -0.12918393,  0.17845058, ...,  0.10073373,
+          -0.04701405,  0.17862988],
+         [ 0.02176583, -0.12929161,  0.17797582, ...,  0.10097728,
+          -0.04687393,  0.17864393],
+         [ 0.05269200,  0.01297141, -0.23336855, ..., -0.11257174,
+          -0.17227529,  0.20338398]]])
+  ```
--- a/demos/speech_ssl/run.sh
+++ b/demos/speech_ssl/run.sh
@ -0,0 +1,10 @@
+#!/bin/bash
+
+# audio download
+wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/en.wav
+
+# to recognize text 
+paddlespeech ssl --task asr --lang en --input ./en.wav
+
+# to get acoustic representation
+paddlespeech ssl --task vector --lang en --input ./en.wav
--- a/demos/whisper/README.md
+++ b/demos/whisper/README.md
@ -0,0 +1,95 @@
+([简体中文](./README_cn.md)|English)
+
+## Introduction
+Whisper is a general-purpose speech recognition model. It is trained on a large dataset of diverse audio and is also a multi-task model that can perform multilingual speech recognition as well as speech translation and language identification.
+
+Whisper model trained by OpenAI whisper https://github.com/openai/whisper
+
+## Usage
+ ### 1. Installation
+ see [installation](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md).
+
+ You can choose one way from easy, meduim and hard to install paddlespeech.
+
+ ### 2. Prepare Input File
+ The input of this demo should be a WAV file(`.wav`), and the sample rate must be the same as the model.
+
+ Here are sample files for this demo that can be downloaded:
+ ```bash
+ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
+ ```
+
+ ### 3. Usage
+ - Command Line(Recommended)
+   ```bash
+   # to recognize text 
+   paddlespeech whisper --task transcribe --input ./zh.wav
+
+   # to change model English-Only base size model
+   paddlespeech whisper --lang en --size base --task transcribe  --input ./en.wav
+
+   # to recognize text and translate to English
+   paddlespeech whisper --task translate --input ./zh.wav
+   
+   ```
+
+   Usage:
+   ```bash
+   paddlespeech whisper --help
+   ```
+   Arguments:
+   - `input`(required): Audio file to recognize.
+   - `model`: Model type of asr task. Default: `whisper-large`.
+   - `task`: Output type. Default: `transcribe`.
+   - `lang`: Model language. Default: ``. Use `en` to choice English-only model. Now [medium,base,small,tiny] size can support English-only.
+   - `size`: Model size for decode. Defalut: `large`. Now can support [large,medium,base,small,tiny].
+   - `language`: Set decode language. Default: `None`. Forcibly set the recognized language, which is determined by the model itself by default. 
+   - `sample_rate`: Sample rate of the model. Default: `16000`. Other sampling rates are not supported now.
+   - `config`: Config of asr task. Use pretrained model when it is None. Default: `None`.
+   - `ckpt_path`: Model checkpoint. Use pretrained model when it is None. Default: `None`.
+   - `yes`: No additional parameters required. Once set this parameter, it means accepting the request of the program by default, which includes transforming the audio sample rate. Default: `False`.
+   - `device`: Choose device to execute model inference. Default: default device of paddlepaddle in current environment.
+   - `verbose`: Show the log information.
+
+
+ - Python API
+   ```python
+   import paddle
+   from paddlespeech.cli.whisper import WhisperExecutor
+
+   whisper_executor = WhisperExecutor()
+
+   # to recognize text 
+   text = whisper_executor(
+       model='whisper',
+       task='transcribe',
+       sample_rate=16000,
+       config=None,  # Set `config` and `ckpt_path` to None to use pretrained model.
+       ckpt_path=None,
+       audio_file='./zh.wav',
+       device=paddle.get_device())
+   print('ASR Result: \n{}'.format(text))
+
+   # to recognize text and translate to English
+   feature = whisper_executor(
+       model='whisper',
+       task='translate',
+       sample_rate=16000,
+       config=None,  # Set `config` and `ckpt_path` to None to use pretrained model.
+       ckpt_path=None,
+       audio_file='./zh.wav',
+       device=paddle.get_device())
+   print('Representation: \n{}'.format(feature))
+   ```
+
+   Output:
+   ```bash
+   Transcribe Result:
+   Detected language: Chinese
+   [00:00.000 --> 00:05.000] 我认为跑步最重要的就是给我带来了身体健康
+   {'text': '我认为跑步最重要的就是给我带来了身体健康', 'segments': [{'id': 0, 'seek': 0, 'start': 0.0, 'end': 5.0, 'text': '我认为跑步最重要的就是给我带来了身体健康', 'tokens': [50364, 1654, 7422, 97, 13992, 32585, 31429, 8661, 24928, 1546, 5620, 49076, 4845, 99, 34912, 19847, 29485, 44201, 6346, 115, 50614], 'temperature': 0.0, 'avg_logprob': -0.23577967557040128, 'compression_ratio': 0.28169014084507044, 'no_speech_prob': 0.028302080929279327}], 'language': 'zh'}
+
+   Translate Result:
+   Detected language: Chinese
+   [00:00.000 --> 00:05.000]  I think the most important thing about running is that it brings me good health.
+   {'text': ' I think the most important thing about running is that it brings me good health.', 'segments': [{'id': 0, 'seek': 0, 'start': 0.0, 'end': 5.0, 'text': ' I think the most important thing about running is that it brings me good health.', 'tokens': [50364, 286, 519, 264, 881, 1021, 551, 466, 2614, 307, 300, 309, 5607, 385, 665, 1585, 13, 50614], 'temperature': 0.0, 'avg_logprob': -0.47945233395225123, 'compression_ratio': 1.095890410958904, 'no_speech_prob': 0.028302080929279327}], 'language': 'zh'}
--- a/demos/whisper/README_cn.md
+++ b/demos/whisper/README_cn.md
@ -0,0 +1,96 @@
+(简体中文|[English](./README.md))
+
+# Whisper模型
+## 介绍
+Whisper是一种通用的语音识别模型。它是在多种音频的大数据集上训练的，也是一个多任务模型，可以执行多语言语音识别以及语音翻译和语言识别。
+
+Whisper模型由OpenAI Whisper训练 https://github.com/openai/whisper
+
+## 使用方法
+### 1. 安装
+ 请看[安装文档](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install_cn.md)。
+
+ 你可以从 easy，medium，hard 三中方式中选择一种方式安装。
+
+### 2. 准备输入
+ 这个 demo 的输入应该是一个 WAV 文件（`.wav`），并且采样率必须与模型的采样率相同。
+
+ 可以下载此 demo 的示例音频：
+ ```bash
+ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
+ ```
+
+### 3. 使用方法
+ - 命令行 (推荐使用)
+   ```bash
+
+   # 识别文本
+   paddlespeech whisper --task transcribe --input ./zh.wav
+
+   #选择只支持英文的模型，并且更换不同大小的模型
+   paddlespeech whisper --lang en --size base --task transcribe  --input ./en.wav
+
+   # 将语音翻译成英语
+   paddlespeech whisper --task translate --input ./zh.wav
+   ```
+  使用方法：
+   ```bash
+   paddlespeech whisper --help
+   ```
+   参数：
+   - `input`(必须输入)：用于识别的音频文件。
+   - `model`：ASR 任务的模型，默认值：`whisper-large`。
+   - `task`：输出类别，默认值：`transcribe`。
+   - `lang`: 模型语言，默认值：``，使用`en`选择只支持英文的模型，目前可选择`en`的模型有[medium,base,small,tiny]。
+   - `size`: 模型大小，默认值：`large`，目前支持[large,medium,base,small,tiny]。
+   - `language`：设定解码语言，默认值：`None`，强制设定识别出的语言，默认为模型自行判定。
+   - `sample_rate`：音频采样率，默认值：`16000`，目前Whisper暂不支持其他采样率。
+   - `config`：ASR 任务的参数文件，若不设置则使用预训练模型中的默认配置，默认值：`None`。
+   - `ckpt_path`：模型参数文件，若不设置则下载解码模型使用，默认值：`None`。
+   - `yes`；不需要设置额外的参数，一旦设置了该参数，说明你默认同意程序的所有请求，其中包括自动转换输入音频的采样率。默认值：`False`。
+   - `device`：执行预测的设备，默认值：当前系统下 paddlepaddle 的默认 device。
+   - `verbose`: 如果使用，显示 logger 信息。
+
+
+- Python API
+   ```python
+   import paddle
+   from paddlespeech.cli.whisper import WhisperExecutor
+
+   whisper_executor = WhisperExecutor()
+
+   # 识别文本
+   text = whisper_executor(
+       model='whisper',
+       task='transcribe',
+       sample_rate=16000,
+       config=None,  # Set `config` and `ckpt_path` to None to use pretrained model.
+       ckpt_path=None,
+       audio_file='./zh.wav',
+       device=paddle.get_device())
+   print('ASR Result: \n{}'.format(text))
+
+    # 将语音翻译成英语
+   feature = whisper_executor(
+       model='whisper',
+       task='translate',
+       sample_rate=16000,
+       config=None,  # Set `config` and `ckpt_path` to None to use pretrained model.
+       ckpt_path=None,
+       audio_file='./zh.wav',
+       device=paddle.get_device())
+   print('Representation: \n{}'.format(feature))
+   ```
+
+
+   输出：
+   ```bash
+   Transcribe Result:
+   Detected language: Chinese
+   [00:00.000 --> 00:05.000] 我认为跑步最重要的就是给我带来了身体健康
+   {'text': '我认为跑步最重要的就是给我带来了身体健康', 'segments': [{'id': 0, 'seek': 0, 'start': 0.0, 'end': 5.0, 'text': '我认为跑步最重要的就是给我带来了身体健康', 'tokens': [50364, 1654, 7422, 97, 13992, 32585, 31429, 8661, 24928, 1546, 5620, 49076, 4845, 99, 34912, 19847, 29485, 44201, 6346, 115, 50614], 'temperature': 0.0, 'avg_logprob': -0.23577967557040128, 'compression_ratio': 0.28169014084507044, 'no_speech_prob': 0.028302080929279327}], 'language': 'zh'}
+
+   Translate Result:
+   Detected language: Chinese
+   [00:00.000 --> 00:05.000]  I think the most important thing about running is that it brings me good health.
+   {'text': ' I think the most important thing about running is that it brings me good health.', 'segments': [{'id': 0, 'seek': 0, 'start': 0.0, 'end': 5.0, 'text': ' I think the most important thing about running is that it brings me good health.', 'tokens': [50364, 286, 519, 264, 881, 1021, 551, 466, 2614, 307, 300, 309, 5607, 385, 665, 1585, 13, 50614], 'temperature': 0.0, 'avg_logprob': -0.47945233395225123, 'compression_ratio': 1.095890410958904, 'no_speech_prob': 0.028302080929279327}], 'language': 'zh'}
--- a/demos/whisper/run.sh
+++ b/demos/whisper/run.sh
@ -0,0 +1,13 @@
+#!/bin/bash
+
+# audio download
+wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespeech.bj.bcebos.com/PaddleAudio/en.wav
+
+# to recognize text 
+paddlespeech whisper --task transcribe --input ./zh.wav
+
+# to recognize text and translate to English
+paddlespeech whisper --task translate --input ./zh.wav
+
+# to change model English-Only model
+paddlespeech whisper --lang en --size base --task transcribe  --input ./en.wav
--- a/docs/source/install.md
+++ b/docs/source/install.md
@ -12,8 +12,8 @@ There are 3 ways to use `PaddleSpeech`. According to the degree of difficulty, t
 - Python >= 3.7
 - PaddlePaddle latest version (please refer to the [Installation Guide](https://www.paddlepaddle.org.cn/documentation/docs/en/beginners_guide/index_en.html))
 - C++ compilation environment
- Hip: For Linux and Mac, do not use command `sh` instead of command `bash` in installation document.
- Hip: We recommand you to install `paddlepaddle` from https://mirror.baidu.com/pypi/simple and install `paddlespeech` from https://pypi.tuna.tsinghua.edu.cn/simple. 
+- Tip: For Linux and Mac, do not use command `sh` instead of command `bash` in installation document.
+- Tip: We recommand you to install `paddlepaddle` from https://mirror.baidu.com/pypi/simple and install `paddlespeech` from https://pypi.tuna.tsinghua.edu.cn/simple. 

 ## Easy: Get the Basic Function (Support Linux, Mac, and Windows)
 - If you are newer to `PaddleSpeech` and want to experience it easily without your machine. We recommend you to use [AI Studio](https://aistudio.baidu.com/aistudio/index) to experience it. There is a step-by-step [tutorial](https://aistudio.baidu.com/aistudio/education/group/info/25130) for `PaddleSpeech`, and you can use the basic function of `PaddleSpeech` with a free machine.
--- a/docs/source/released_model.md
+++ b/docs/source/released_model.md
@ -22,7 +22,7 @@ Acoustic Model | Training Data | Token-based | Size | Descriptions | CER | WER |
 Model | Pre-Train Method | Pre-Train Data | Finetune Data | Size | Descriptions | CER | WER |  Example Link |
 :-------------:| :------------:| :-----: | -----: | :-----: |:-----:| :-----:  | :-----:  | :-----: | 
 [Wav2vec2-large-960h-lv60-self Model](https://paddlespeech.bj.bcebos.com/wav2vec/wav2vec2-large-960h-lv60-self.pdparams) | wav2vec2 | Librispeech and LV-60k Dataset (5.3w h) | - | 1.18 GB |Pre-trained Wav2vec2.0 Model | - | - | - | 
-[Wav2vec2ASR-large-960h-librispeech Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr3/wav2vec2ASR-large-960h-librispeech_ckpt_1.3.0.model.tar.gz) | wav2vec2 | Librispeech and LV-60k Dataset (5.3w h) | Librispeech (960 h) | 1.18 GB |Encoder: Wav2vec2.0, Decoder: CTC, Decoding method: Greedy search | - | 0.0189 | [Wav2vecASR Librispeech ASR3](../../examples/librispeech/asr3) |
+[Wav2vec2ASR-large-960h-librispeech Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr3/wav2vec2ASR-large-960h-librispeech_ckpt_1.3.1.model.tar.gz) | wav2vec2 | Librispeech and LV-60k Dataset (5.3w h) | Librispeech (960 h) | 718 MB |Encoder: Wav2vec2.0, Decoder: CTC, Decoding method: Greedy search | - | 0.0189 | [Wav2vecASR Librispeech ASR3](../../examples/librispeech/asr3) |

 ### Language Model based on NGram
 Language Model | Training Data | Token-based | Size | Descriptions
@ -40,36 +40,37 @@ Language Model | Training Data | Token-based | Size | Descriptions
 ## Text-to-Speech Models

 ### Acoustic Models
-Model Type | Dataset| Example Link | Pretrained Models|Static/ONNX Models|Size (static)
+Model Type | Dataset| Example Link | Pretrained Models|Static / ONNX / Paddle-Lite Models|Size (static)
 :-------------:| :------------:| :-----: | :-----:| :-----:| :-----:
 Tacotron2|LJSpeech|[tacotron2-ljspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/tts0)|[tacotron2_ljspeech_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_ljspeech_ckpt_0.2.0.zip)|||
 Tacotron2|CSMSC|[tacotron2-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts0)|[tacotron2_csmsc_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_csmsc_ckpt_0.2.0.zip)|[tacotron2_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_csmsc_static_0.2.0.zip)|103MB|
 TransformerTTS| LJSpeech| [transformer-ljspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/tts1)|[transformer_tts_ljspeech_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/transformer_tts/transformer_tts_ljspeech_ckpt_0.4.zip)|||
-SpeedySpeech| CSMSC | [speedyspeech-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts2)|[speedyspeech_csmsc_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_csmsc_ckpt_0.2.0.zip)|[speedyspeech_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_csmsc_static_0.2.0.zip) </br> [speedyspeech_csmsc_onnx_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_csmsc_onnx_0.2.0.zip)|13MB|
-FastSpeech2| CSMSC |[fastspeech2-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts3)|[fastspeech2_nosil_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_ckpt_0.4.zip)|[fastspeech2_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_csmsc_static_0.2.0.zip) </br> [fastspeech2_csmsc_onnx_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_csmsc_onnx_0.2.0.zip)|157MB|
+SpeedySpeech| CSMSC | [speedyspeech-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts2)|[speedyspeech_csmsc_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_csmsc_ckpt_0.2.0.zip)|[speedyspeech_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_csmsc_static_0.2.0.zip) </br> [speedyspeech_csmsc_onnx_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_csmsc_onnx_0.2.0.zip) </br> [speedyspeech_csmsc_pdlite_1.3.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_csmsc_pdlite_1.3.0.zip)|13MB|
+FastSpeech2| CSMSC |[fastspeech2-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts3)|[fastspeech2_nosil_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_ckpt_0.4.zip)|[fastspeech2_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_csmsc_static_0.2.0.zip) </br> [fastspeech2_csmsc_onnx_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_csmsc_onnx_0.2.0.zip) </br> [fastspeech2_csmsc_pdlite_1.3.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_csmsc_pdlite_1.3.0.zip)|157MB|
 FastSpeech2-Conformer| CSMSC |[fastspeech2-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts3)|[fastspeech2_conformer_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_conformer_baker_ckpt_0.5.zip)|||
-FastSpeech2-CNNDecoder| CSMSC| [fastspeech2-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts3)| [fastspeech2_cnndecoder_csmsc_ckpt_1.0.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_cnndecoder_csmsc_ckpt_1.0.0.zip) |  [fastspeech2_cnndecoder_csmsc_static_1.0.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_cnndecoder_csmsc_static_1.0.0.zip) </br>[fastspeech2_cnndecoder_csmsc_streaming_static_1.0.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_cnndecoder_csmsc_streaming_static_1.0.0.zip)  </br>[fastspeech2_cnndecoder_csmsc_onnx_1.0.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_cnndecoder_csmsc_onnx_1.0.0.zip)  </br>[fastspeech2_cnndecoder_csmsc_streaming_onnx_1.0.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_cnndecoder_csmsc_streaming_onnx_1.0.0.zip) | 84MB|
-FastSpeech2| AISHELL-3 |[fastspeech2-aishell3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/tts3)|[fastspeech2_aishell3_ckpt_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_aishell3_ckpt_1.1.0.zip)|[fastspeech2_aishell3_static_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_aishell3_static_1.1.0.zip) </br> [fastspeech2_aishell3_onnx_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_aishell3_onnx_1.1.0.zip)|147MB|
-FastSpeech2| LJSpeech |[fastspeech2-ljspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/tts3)|[fastspeech2_nosil_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_ljspeech_ckpt_0.5.zip)|[fastspeech2_ljspeech_static_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_ljspeech_static_1.1.0.zip) </br> [fastspeech2_ljspeech_onnx_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_ljspeech_onnx_1.1.0.zip)|145MB|
-FastSpeech2| VCTK |[fastspeech2-vctk](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/vctk/tts3)|[fastspeech2_vctk_ckpt_1.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_vctk_ckpt_1.2.0.zip)|[fastspeech2_vctk_static_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_vctk_static_1.1.0.zip) </br> [fastspeech2_vctk_onnx_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_vctk_onnx_1.1.0.zip) | 145MB|
+FastSpeech2-CNNDecoder| CSMSC| [fastspeech2-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts3)| [fastspeech2_cnndecoder_csmsc_ckpt_1.0.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_cnndecoder_csmsc_ckpt_1.0.0.zip) |  [fastspeech2_cnndecoder_csmsc_static_1.0.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_cnndecoder_csmsc_static_1.0.0.zip) </br>[fastspeech2_cnndecoder_csmsc_streaming_static_1.0.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_cnndecoder_csmsc_streaming_static_1.0.0.zip)  </br>[fastspeech2_cnndecoder_csmsc_onnx_1.0.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_cnndecoder_csmsc_onnx_1.0.0.zip)  </br>[fastspeech2_cnndecoder_csmsc_streaming_onnx_1.0.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_cnndecoder_csmsc_streaming_onnx_1.0.0.zip)  </br> [fastspeech2_cnndecoder_csmsc_pdlite_1.3.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_cnndecoder_csmsc_pdlite_1.3.0.zip) </br> [fastspeech2_cnndecoder_csmsc_streaming_pdlite_1.3.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_cnndecoder_csmsc_streaming_pdlite_1.3.0.zip)| 84MB|
+FastSpeech2| AISHELL-3 |[fastspeech2-aishell3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/tts3)|[fastspeech2_aishell3_ckpt_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_aishell3_ckpt_1.1.0.zip)|[fastspeech2_aishell3_static_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_aishell3_static_1.1.0.zip) </br> [fastspeech2_aishell3_onnx_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_aishell3_onnx_1.1.0.zip) </br> [fastspeech2_aishell3_pdlite_1.3.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_aishell3_pdlite_1.3.0.zip) |147MB|
+FastSpeech2| LJSpeech |[fastspeech2-ljspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/tts3)|[fastspeech2_nosil_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_ljspeech_ckpt_0.5.zip)|[fastspeech2_ljspeech_static_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_ljspeech_static_1.1.0.zip) </br> [fastspeech2_ljspeech_onnx_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_ljspeech_onnx_1.1.0.zip) </br> [fastspeech2_ljspeech_pdlite_1.3.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_ljspeech_pdlite_1.3.0.zip)|145MB|
+FastSpeech2| VCTK |[fastspeech2-vctk](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/vctk/tts3)|[fastspeech2_vctk_ckpt_1.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_vctk_ckpt_1.2.0.zip)|[fastspeech2_vctk_static_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_vctk_static_1.1.0.zip) </br> [fastspeech2_vctk_onnx_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_vctk_onnx_1.1.0.zip) </br> [fastspeech2_vctk_pdlite_1.3.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_vctk_pdlite_1.3.0.zip)| 145MB|
 FastSpeech2| ZH_EN |[fastspeech2-zh_en](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/zh_en_tts/tts3)|[fastspeech2_mix_ckpt_1.2.0.zip](https://paddlespeech.bj.bcebos.com/t2s/chinse_english_mixed/models/fastspeech2_mix_ckpt_1.2.0.zip)|[fastspeech2_mix_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/t2s/chinse_english_mixed/models/fastspeech2_mix_static_0.2.0.zip) </br> [fastspeech2_mix_onnx_0.2.0.zip](https://paddlespeech.bj.bcebos.com/t2s/chinse_english_mixed/models/fastspeech2_mix_onnx_0.2.0.zip) | 145MB|
-
+FastSpeech2| Male ||[fastspeech2_male_ckpt_1.3.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_male_ckpt_1.3.0.zip)| | |

 ### Vocoders
-Model Type | Dataset| Example Link | Pretrained Models| Static/ONNX Models|Size (static)
+Model Type | Dataset| Example Link | Pretrained Models| Static / ONNX / Paddle-Lite Models|Size (static)
 :-----:| :-----:| :-----: | :-----:| :-----:| :-----:
 WaveFlow| LJSpeech |[waveflow-ljspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/voc0)|[waveflow_ljspeech_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/waveflow/waveflow_ljspeech_ckpt_0.3.zip)|||
-Parallel WaveGAN| CSMSC |[PWGAN-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/voc1)|[pwg_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_ckpt_0.4.zip)|[pwg_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_static_0.4.zip) </br> [pwgan_csmsc_onnx_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwgan_csmsc_onnx_0.2.0.zip)|4.8MB|
-Parallel WaveGAN| LJSpeech |[PWGAN-ljspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/voc1)|[pwg_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_ljspeech_ckpt_0.5.zip)|[pwgan_ljspeech_static_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwgan_ljspeech_static_1.1.0.zip) </br> [pwgan_ljspeech_onnx_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwgan_ljspeech_onnx_1.1.0.zip)|4.8MB|
-Parallel WaveGAN| AISHELL-3 |[PWGAN-aishell3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/voc1)|[pwg_aishell3_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_aishell3_ckpt_0.5.zip)| [pwgan_aishell3_static_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwgan_aishell3_static_1.1.0.zip) </br> [pwgan_aishell3_onnx_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwgan_aishell3_onnx_1.1.0.zip)|4.8MB|
-Parallel WaveGAN| VCTK |[PWGAN-vctk](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/vctk/voc1)|[pwg_vctk_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_vctk_ckpt_0.5.zip)|[pwgan_vctk_static_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwgan_vctk_static_1.1.0.zip) </br> [pwgan_vctk_onnx_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwgan_vctk_onnx_1.1.0.zip)|4.8MB|
-|Multi Band MelGAN | CSMSC |[MB MelGAN-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/voc3) | [mb_melgan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_ckpt_0.1.1.zip) <br>[mb_melgan_baker_finetune_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_baker_finetune_ckpt_0.5.zip)|[mb_melgan_csmsc_static_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_static_0.1.1.zip) </br> [mb_melgan_csmsc_onnx_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_onnx_0.2.0.zip)|7.6MB|
+Parallel WaveGAN| CSMSC |[PWGAN-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/voc1)|[pwg_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_ckpt_0.4.zip)|[pwg_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_static_0.4.zip) </br> [pwgan_csmsc_onnx_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwgan_csmsc_onnx_0.2.0.zip) </br> [pwgan_csmsc_pdlite_1.3.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwgan_csmsc_pdlite_1.3.0.zip)|4.8MB|
+Parallel WaveGAN| LJSpeech |[PWGAN-ljspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/voc1)|[pwg_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_ljspeech_ckpt_0.5.zip)|[pwgan_ljspeech_static_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwgan_ljspeech_static_1.1.0.zip) </br> [pwgan_ljspeech_onnx_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwgan_ljspeech_onnx_1.1.0.zip) </br> [pwgan_ljspeech_pdlite_1.3.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwgan_ljspeech_pdlite_1.3.0.zip)|4.8MB|
+Parallel WaveGAN| AISHELL-3 |[PWGAN-aishell3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/voc1)|[pwg_aishell3_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_aishell3_ckpt_0.5.zip)| [pwgan_aishell3_static_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwgan_aishell3_static_1.1.0.zip) </br> [pwgan_aishell3_onnx_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwgan_aishell3_onnx_1.1.0.zip) </br> [pwgan_aishell3_pdlite_1.3.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwgan_aishell3_pdlite_1.3.0.zip)|4.8MB|
+Parallel WaveGAN| VCTK |[PWGAN-vctk](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/vctk/voc1)|[pwg_vctk_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_vctk_ckpt_0.5.zip)|[pwgan_vctk_static_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwgan_vctk_static_1.1.0.zip) </br> [pwgan_vctk_onnx_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwgan_vctk_onnx_1.1.0.zip) </br> [pwgan_vctk_pdlite_1.3.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwgan_vctk_pdlite_1.3.0.zip)|4.8MB|
+|Multi Band MelGAN | CSMSC |[MB MelGAN-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/voc3) | [mb_melgan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_ckpt_0.1.1.zip) <br>[mb_melgan_baker_finetune_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_baker_finetune_ckpt_0.5.zip)|[mb_melgan_csmsc_static_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_static_0.1.1.zip) </br> [mb_melgan_csmsc_onnx_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_onnx_0.2.0.zip) </br> [mb_melgan_csmsc_pdlite_1.3.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_pdlite_1.3.0.zip)|7.6MB|
 Style MelGAN | CSMSC |[Style MelGAN-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/voc4)|[style_melgan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/style_melgan/style_melgan_csmsc_ckpt_0.1.1.zip)| | |
-HiFiGAN | CSMSC |[HiFiGAN-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/voc5)|[hifigan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_ckpt_0.1.1.zip)|[hifigan_csmsc_static_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_static_0.1.1.zip) </br> [hifigan_csmsc_onnx_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_onnx_0.2.0.zip)|46MB|
-HiFiGAN | LJSpeech |[HiFiGAN-ljspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/voc5)|[hifigan_ljspeech_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_ljspeech_ckpt_0.2.0.zip)|[hifigan_ljspeech_static_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_ljspeech_static_1.1.0.zip) </br> [hifigan_ljspeech_onnx_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_ljspeech_onnx_1.1.0.zip) |49MB|
-HiFiGAN | AISHELL-3 |[HiFiGAN-aishell3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/voc5)|[hifigan_aishell3_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_ckpt_0.2.0.zip)|[hifigan_aishell3_static_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_static_1.1.0.zip) </br> [hifigan_aishell3_onnx_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_onnx_1.1.0.zip)|46MB|
-HiFiGAN | VCTK |[HiFiGAN-vctk](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/vctk/voc5)|[hifigan_vctk_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_vctk_ckpt_0.2.0.zip)|[hifigan_vctk_static_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_vctk_static_1.1.0.zip) </br> [hifigan_vctk_onnx_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_vctk_onnx_1.1.0.zip)|46MB|
+HiFiGAN | CSMSC |[HiFiGAN-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/voc5)|[hifigan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_ckpt_0.1.1.zip)|[hifigan_csmsc_static_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_static_0.1.1.zip) </br> [hifigan_csmsc_onnx_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_onnx_0.2.0.zip) </br> [hifigan_csmsc_pdlite_1.3.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_pdlite_1.3.0.zip)|46MB|
+HiFiGAN | LJSpeech |[HiFiGAN-ljspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/voc5)|[hifigan_ljspeech_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_ljspeech_ckpt_0.2.0.zip)|[hifigan_ljspeech_static_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_ljspeech_static_1.1.0.zip) </br> [hifigan_ljspeech_onnx_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_ljspeech_onnx_1.1.0.zip) </br> [hifigan_ljspeech_pdlite_1.3.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_ljspeech_pdlite_1.3.0.zip) |49MB|
+HiFiGAN | AISHELL-3 |[HiFiGAN-aishell3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/voc5)|[hifigan_aishell3_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_ckpt_0.2.0.zip)|[hifigan_aishell3_static_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_static_1.1.0.zip) </br> [hifigan_aishell3_onnx_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_onnx_1.1.0.zip) </br> [hifigan_aishell3_pdlite_1.3.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_pdlite_1.3.0.zip)|46MB|
+HiFiGAN | VCTK |[HiFiGAN-vctk](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/vctk/voc5)|[hifigan_vctk_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_vctk_ckpt_0.2.0.zip)|[hifigan_vctk_static_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_vctk_static_1.1.0.zip) </br> [hifigan_vctk_onnx_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_vctk_onnx_1.1.0.zip) </br> [hifigan_vctk_pdlite_1.3.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_vctk_pdlite_1.3.0.zip)|46MB|
 WaveRNN | CSMSC |[WaveRNN-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/voc6)|[wavernn_csmsc_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/wavernn/wavernn_csmsc_ckpt_0.2.0.zip)|[wavernn_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/wavernn/wavernn_csmsc_static_0.2.0.zip)|18MB|
+Parallel WaveGAN| Male ||[pwg_male_ckpt_1.3.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_male_ckpt_1.3.0.zip)|||


 ### Voice Cloning
--- a/examples/aishell3/ernie_sat/README.md
+++ b/examples/aishell3/ernie_sat/README.md
@ -1,5 +1,5 @@
 # ERNIE-SAT with AISHELL-3 dataset
-ERNIE-SAT speech-text joint pretraining framework, which achieves SOTA results in cross-lingual multi-speaker speech synthesis and cross-lingual speech editing tasks, It can be applied to a series of scenarios such as Speech Editing, personalized Speech Synthesis, and Voice Cloning.
+[ERNIE-SAT](https://arxiv.org/abs/2211.03545) speech-text joint pretraining framework, which achieves SOTA results in cross-lingual multi-speaker speech synthesis and cross-lingual speech editing tasks, It can be applied to a series of scenarios such as Speech Editing, personalized Speech Synthesis, and Voice Cloning.

 ## Model Framework
 In ERNIE-SAT, we propose two innovations:
--- a/examples/aishell3/tts3/README.md
+++ b/examples/aishell3/tts3/README.md
@ -226,6 +226,9 @@ The static model can be downloaded here:
 The ONNX model can be downloaded here:
 - [fastspeech2_aishell3_onnx_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_aishell3_onnx_1.1.0.zip)

+The Paddle-Lite model can be downloaded here:
+- [fastspeech2_aishell3_pdlite_1.3.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_aishell3_pdlite_1.3.0.zip)
+
 FastSpeech2 checkpoint contains files listed below.

 ```text
--- a/examples/aishell3/tts3/local/export2lite.sh
+++ b/examples/aishell3/tts3/local/export2lite.sh
@ -0,0 +1 @@
+../../../csmsc/tts3/local/export2lite.sh
--- a/examples/aishell3/tts3/local/lite_predict.sh
+++ b/examples/aishell3/tts3/local/lite_predict.sh
@ -0,0 +1,32 @@
+#!/bin/bash
+
+train_output_path=$1
+
+stage=0
+stop_stage=0
+
+# pwgan
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+    python3 ${BIN_DIR}/../lite_predict.py \
+        --inference_dir=${train_output_path}/pdlite \
+        --am=fastspeech2_aishell3 \
+        --voc=pwgan_aishell3 \
+        --text=${BIN_DIR}/../sentences.txt \
+        --output_dir=${train_output_path}/lite_infer_out \
+        --phones_dict=dump/phone_id_map.txt \
+        --speaker_dict=dump/speaker_id_map.txt \
+        --spk_id=0
+fi
+
+# hifigan
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+    python3 ${BIN_DIR}/../lite_predict.py \
+        --inference_dir=${train_output_path}/pdlite \
+        --am=fastspeech2_aishell3 \
+        --voc=hifigan_aishell3 \
+        --text=${BIN_DIR}/../sentences.txt \
+        --output_dir=${train_output_path}/lite_infer_out \
+        --phones_dict=dump/phone_id_map.txt \
+        --speaker_dict=dump/speaker_id_map.txt \
+        --spk_id=0
+fi
--- a/examples/aishell3/tts3/run.sh
+++ b/examples/aishell3/tts3/run.sh
@ -58,3 +58,13 @@ fi
 if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
    ./local/ort_predict.sh ${train_output_path}
 fi
+
+if [ ${stage} -le 7 ] && [ ${stop_stage} -ge 7 ]; then
+    ./local/export2lite.sh ${train_output_path} inference pdlite fastspeech2_aishell3 x86
+    ./local/export2lite.sh ${train_output_path} inference pdlite pwgan_aishell3 x86
+    # ./local/export2lite.sh ${train_output_path} inference pdlite hifigan_aishell3 x86
+fi
+
+if [ ${stage} -le 8 ] && [ ${stop_stage} -ge 8 ]; then
+    CUDA_VISIBLE_DEVICES=${gpus} ./local/lite_predict.sh ${train_output_path} || exit -1
+fi
--- a/examples/aishell3/voc1/README.md
+++ b/examples/aishell3/voc1/README.md
@ -139,6 +139,9 @@ The static model can be downloaded here:
 The ONNX model can be downloaded here:
 - [pwgan_aishell3_onnx_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwgan_aishell3_onnx_1.1.0.zip)

+The Paddle-Lite model can be downloaded here:
+- [pwgan_aishell3_pdlite_1.3.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwgan_aishell3_pdlite_1.3.0.zip)
+
 Model | Step | eval/generator_loss | eval/log_stft_magnitude_loss:| eval/spectral_convergence_loss 
 :-------------:| :------------:| :-----: | :-----: | :--------:
 default| 1(gpu) x 400000|1.968762|0.759008|0.218524
--- a/examples/aishell3/voc5/README.md
+++ b/examples/aishell3/voc5/README.md
@ -122,6 +122,9 @@ The static model can be downloaded here:
 The ONNX model can be downloaded here:
 - [hifigan_aishell3_onnx_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_onnx_1.1.0.zip)

+The Paddle-Lite model can be downloaded here:
+- [hifigan_aishell3_pdlite_1.3.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_pdlite_1.3.0.zip)
+
 Model | Step | eval/generator_loss | eval/mel_loss| eval/feature_matching_loss
 :-------------:| :------------:| :-----: | :-----: | :--------:
 default| 1(gpu) x 2500000|24.060|0.1068|7.499
--- a/examples/aishell3_vctk/ernie_sat/README.md
+++ b/examples/aishell3_vctk/ernie_sat/README.md
@ -1,5 +1,5 @@
 # ERNIE-SAT with AISHELL-3 and VCTK dataset
-ERNIE-SAT speech-text joint pretraining framework, which achieves SOTA results in cross-lingual multi-speaker speech synthesis and cross-lingual speech editing tasks, It can be applied to a series of scenarios such as Speech Editing, personalized Speech Synthesis, and Voice Cloning.
+[ERNIE-SAT](https://arxiv.org/abs/2211.03545) speech-text joint pretraining framework, which achieves SOTA results in cross-lingual multi-speaker speech synthesis and cross-lingual speech editing tasks, It can be applied to a series of scenarios such as Speech Editing, personalized Speech Synthesis, and Voice Cloning.

 ## Model Framework
 In ERNIE-SAT, we propose two innovations:
--- a/examples/csmsc/tts2/README.md
+++ b/examples/csmsc/tts2/README.md
@ -230,6 +230,9 @@ The static model can be downloaded here:
 The ONNX model can be downloaded here:
 - [speedyspeech_csmsc_onnx_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_csmsc_onnx_0.2.0.zip)

+The Paddle-Lite model can be downloaded here:
+- [speedyspeech_csmsc_pdlite_1.3.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_csmsc_pdlite_1.3.0.zip)
+

 Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/ssim_loss
 :-------------:| :------------:| :-----: | :-----: | :--------:|:--------:
--- a/examples/csmsc/tts2/local/export2lite.sh
+++ b/examples/csmsc/tts2/local/export2lite.sh
@ -0,0 +1 @@
+../../tts3/local/export2lite.sh
--- a/examples/csmsc/tts2/local/lite_predict.sh
+++ b/examples/csmsc/tts2/local/lite_predict.sh
@ -0,0 +1,43 @@
+#!/bin/bash
+
+train_output_path=$1
+
+stage=0
+stop_stage=0
+
+# pwgan
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+    python3 ${BIN_DIR}/../lite_predict.py \
+        --inference_dir=${train_output_path}/pdlite \
+        --am=speedyspeech_csmsc \
+        --voc=pwgan_csmsc \
+        --text=${BIN_DIR}/../sentences.txt \
+        --output_dir=${train_output_path}/lite_infer_out \
+        --phones_dict=dump/phone_id_map.txt \
+        --tones_dict=dump/tone_id_map.txt
+fi
+
+# for more GAN Vocoders
+# multi band melgan
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+    python3 ${BIN_DIR}/../lite_predict.py \
+        --inference_dir=${train_output_path}/pdlite \
+        --am=speedyspeech_csmsc \
+        --voc=mb_melgan_csmsc \
+        --text=${BIN_DIR}/../sentences.txt \
+        --output_dir=${train_output_path}/lite_infer_out \
+        --phones_dict=dump/phone_id_map.txt \
+        --tones_dict=dump/tone_id_map.txt
+fi
+
+# hifigan
+if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
+    python3 ${BIN_DIR}/../lite_predict.py \
+        --inference_dir=${train_output_path}/pdlite \
+        --am=speedyspeech_csmsc \
+        --voc=hifigan_csmsc \
+        --text=${BIN_DIR}/../sentences.txt \
+        --output_dir=${train_output_path}/lite_infer_out \
+        --phones_dict=dump/phone_id_map.txt \
+        --tones_dict=dump/tone_id_map.txt
+fi
--- a/examples/csmsc/tts2/run.sh
+++ b/examples/csmsc/tts2/run.sh
@ -60,3 +60,15 @@ fi
 if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
    ./local/ort_predict.sh ${train_output_path}
 fi
+
+# must run after stage 3 (which stage generated static models)
+if [ ${stage} -le 7 ] && [ ${stop_stage} -ge 7 ]; then
+    ./local/export2lite.sh ${train_output_path} inference pdlite speedyspeech_csmsc x86
+    ./local/export2lite.sh ${train_output_path} inference pdlite pwgan_csmsc x86
+    # ./local/export2lite.sh ${train_output_path} inference pdlite mb_melgan_csmsc x86
+    # ./local/export2lite.sh ${train_output_path} inference pdlite hifigan_csmsc x86
+fi
+
+if [ ${stage} -le 8 ] && [ ${stop_stage} -ge 8 ]; then
+    CUDA_VISIBLE_DEVICES=${gpus} ./local/lite_predict.sh ${train_output_path} || exit -1
+fi
--- a/examples/csmsc/tts3/README.md
+++ b/examples/csmsc/tts3/README.md
@ -238,6 +238,12 @@ The ONNX model can be downloaded here:
 - [fastspeech2_cnndecoder_csmsc_onnx_1.0.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_cnndecoder_csmsc_onnx_1.0.0.zip)
 - [fastspeech2_cnndecoder_csmsc_streaming_onnx_1.0.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_cnndecoder_csmsc_streaming_onnx_1.0.0.zip)

+The Paddle-Lite model can be downloaded here:
+> please compile develop version of Paddle-Lite to export and run TTS models, cause TTS models are supported by https://github.com/PaddlePaddle/Paddle-Lite/pull/9587 and https://github.com/PaddlePaddle/Paddle-Lite/pull/9706
+- [fastspeech2_csmsc_pdlite_1.3.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_csmsc_pdlite_1.3.0.zip)
+- [fastspeech2_cnndecoder_csmsc_pdlite_1.3.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_cnndecoder_csmsc_pdlite_1.3.0.zip)
+- [fastspeech2_cnndecoder_csmsc_streaming_pdlite_1.3.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_cnndecoder_csmsc_streaming_pdlite_1.3.0.zip)
+
 Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/pitch_loss| eval/energy_loss 
 :-------------:| :------------:| :-----: | :-----: | :--------: |:--------:|:---------:
 default| 2(gpu) x 76000|1.0991|0.59132|0.035815|0.31915|0.15287|
--- a/examples/csmsc/tts3/local/export2lite.sh
+++ b/examples/csmsc/tts3/local/export2lite.sh
@ -0,0 +1,18 @@
+train_output_path=$1
+model_dir=$2
+output_dir=$3
+model=$4
+valid_targets=$5
+
+model_name=${model%_*}
+echo model_name: ${model_name}
+
+suffix=${valid_targets%,*}
+
+mkdir -p ${train_output_path}/${output_dir}
+
+paddle_lite_opt \
+    --model_file ${train_output_path}/${model_dir}/${model}.pdmodel \
+    --param_file  ${train_output_path}/${model_dir}/${model}.pdiparams \
+    --optimize_out ${train_output_path}/${output_dir}/${model}_${suffix} \
+    --valid_targets ${valid_targets}
--- a/examples/csmsc/tts3/local/lite_predict.sh
+++ b/examples/csmsc/tts3/local/lite_predict.sh
@ -0,0 +1,40 @@
+#!/bin/bash
+
+train_output_path=$1
+
+stage=0
+stop_stage=0
+
+# pwgan
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+    python3 ${BIN_DIR}/../lite_predict.py \
+        --inference_dir=${train_output_path}/pdlite \
+        --am=fastspeech2_csmsc \
+        --voc=pwgan_csmsc \
+        --text=${BIN_DIR}/../sentences.txt \
+        --output_dir=${train_output_path}/lite_infer_out \
+        --phones_dict=dump/phone_id_map.txt
+fi
+
+# for more GAN Vocoders
+# multi band melgan
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+    python3 ${BIN_DIR}/../lite_predict.py \
+        --inference_dir=${train_output_path}/pdlite \
+        --am=fastspeech2_csmsc \
+        --voc=mb_melgan_csmsc \
+        --text=${BIN_DIR}/../sentences.txt \
+        --output_dir=${train_output_path}/lite_infer_out \
+        --phones_dict=dump/phone_id_map.txt
+fi
+
+# hifigan
+if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
+    python3 ${BIN_DIR}/../lite_predict.py \
+        --inference_dir=${train_output_path}/pdlite \
+        --am=fastspeech2_csmsc \
+        --voc=hifigan_csmsc \
+        --text=${BIN_DIR}/../sentences.txt \
+        --output_dir=${train_output_path}/lite_infer_out \
+        --phones_dict=dump/phone_id_map.txt
+fi
--- a/examples/csmsc/tts3/local/lite_predict_streaming.sh
+++ b/examples/csmsc/tts3/local/lite_predict_streaming.sh
@ -0,0 +1,47 @@
+#!/bin/bash
+
+train_output_path=$1
+
+stage=0
+stop_stage=0
+
+# pwgan
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+    python3 ${BIN_DIR}/../lite_predict_streaming.py \
+        --inference_dir=${train_output_path}/pdlite_streaming \
+        --am=fastspeech2_csmsc \
+        --am_stat=dump/train/speech_stats.npy \
+        --voc=pwgan_csmsc \
+        --text=${BIN_DIR}/../sentences.txt \
+        --output_dir=${train_output_path}/lite_infer_out_streaming \
+        --phones_dict=dump/phone_id_map.txt \
+        --am_streaming=True
+fi
+
+# for more GAN Vocoders
+# multi band melgan
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+    python3 ${BIN_DIR}/../lite_predict_streaming.py \
+        --inference_dir=${train_output_path}/pdlite_streaming \
+        --am=fastspeech2_csmsc \
+        --am_stat=dump/train/speech_stats.npy \
+        --voc=mb_melgan_csmsc \
+        --text=${BIN_DIR}/../sentences.txt \
+        --output_dir=${train_output_path}/lite_infer_out_streaming \
+        --phones_dict=dump/phone_id_map.txt \
+        --am_streaming=True
+fi
+
+# hifigan
+if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
+    python3 ${BIN_DIR}/../lite_predict_streaming.py \
+        --inference_dir=${train_output_path}/pdlite_streaming \
+        --am=fastspeech2_csmsc \
+        --am_stat=dump/train/speech_stats.npy \
+        --voc=hifigan_csmsc \
+        --text=${BIN_DIR}/../sentences.txt \
+        --output_dir=${train_output_path}/lite_infer_out_streaming \
+        --phones_dict=dump/phone_id_map.txt \
+        --am_streaming=True
+fi
+
--- a/examples/csmsc/tts3/run.sh
+++ b/examples/csmsc/tts3/run.sh
@ -61,3 +61,18 @@ fi
 if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
    ./local/ort_predict.sh ${train_output_path}
 fi
+
+# must run after stage 3 (which stage generated static models)
+if [ ${stage} -le 7 ] && [ ${stop_stage} -ge 7 ]; then
+    # NOTE by yuantian 2022.11.21: please compile develop version of Paddle-Lite to export and run TTS models,
+    #                   cause TTS models are supported by https://github.com/PaddlePaddle/Paddle-Lite/pull/9587 
+    #                   and https://github.com/PaddlePaddle/Paddle-Lite/pull/9706
+    ./local/export2lite.sh ${train_output_path} inference pdlite fastspeech2_csmsc x86
+    ./local/export2lite.sh ${train_output_path} inference pdlite pwgan_csmsc x86
+    # ./local/export2lite.sh ${train_output_path} inference pdlite mb_melgan_csmsc x86
+    # ./local/export2lite.sh ${train_output_path} inference pdlite hifigan_csmsc x86
+fi
+
+if [ ${stage} -le 8 ] && [ ${stop_stage} -ge 8 ]; then
+    CUDA_VISIBLE_DEVICES=${gpus} ./local/lite_predict.sh ${train_output_path} || exit -1
+fi
--- a/examples/csmsc/tts3/run_cnndecoder.sh
+++ b/examples/csmsc/tts3/run_cnndecoder.sh
@ -75,7 +75,6 @@ if [ ${stage} -le 8 ] && [ ${stop_stage} -ge 8 ]; then
 fi

 # paddle2onnx streaming
-
 if [ ${stage} -le 9 ] && [ ${stop_stage} -ge 9 ]; then
    # install paddle2onnx
    version=$(echo `pip list |grep "paddle2onnx"` |awk -F" " '{print $2}')
@ -97,3 +96,29 @@ if [ ${stage} -le 10 ] && [ ${stop_stage} -ge 10 ]; then
    ./local/ort_predict_streaming.sh ${train_output_path}
 fi

+# must run after stage 3 (which stage generated static models)
+if [ ${stage} -le 11 ] && [ ${stop_stage} -ge 11 ]; then
+    ./local/export2lite.sh ${train_output_path} inference pdlite fastspeech2_csmsc x86
+    ./local/export2lite.sh ${train_output_path} inference pdlite pwgan_csmsc x86
+    # ./local/export2lite.sh ${train_output_path} inference pdlite mb_melgan_csmsc x86
+    # ./local/export2lite.sh ${train_output_path} inference pdlite hifigan_csmsc x86
+fi
+
+if [ ${stage} -le 12 ] && [ ${stop_stage} -ge 12 ]; then
+    CUDA_VISIBLE_DEVICES=${gpus} ./local/lite_predict.sh ${train_output_path} || exit -1
+fi
+
+# must run after stage 5 (which stage generated static models)
+if [ ${stage} -le 13 ] && [ ${stop_stage} -ge 13 ]; then
+    # streaming acoustic model
+    ./local/export2lite.sh ${train_output_path} inference_streaming pdlite_streaming fastspeech2_csmsc_am_encoder_infer x86
+    ./local/export2lite.sh ${train_output_path} inference_streaming pdlite_streaming fastspeech2_csmsc_am_decoder x86
+    ./local/export2lite.sh ${train_output_path} inference_streaming pdlite_streaming fastspeech2_csmsc_am_postnet x86
+    ./local/export2lite.sh ${train_output_path} inference_streaming pdlite_streaming pwgan_csmsc x86
+    # ./local/export2lite.sh ${train_output_path} inference_streaming pdlite_streaming mb_melgan_csmsc x86
+    # ./local/export2lite.sh ${train_output_path} inference_streaming pdlite_streaming hifigan_csmsc x86
+fi
+
+if [ ${stage} -le 14 ] && [ ${stop_stage} -ge 14 ]; then
+    CUDA_VISIBLE_DEVICES=${gpus} ./local/lite_predict_streaming.sh ${train_output_path} || exit -1
+fi
--- a/examples/csmsc/voc1/README.md
+++ b/examples/csmsc/voc1/README.md
@ -136,6 +136,9 @@ The static model can be downloaded here:
 The ONNX model can be downloaded here:
 - [pwgan_csmsc_onnx_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwgan_csmsc_onnx_0.2.0.zip)

+The Paddle-Lite model can be downloaded here:
+- [pwgan_csmsc_pdlite_1.3.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwgan_csmsc_pdlite_1.3.0.zip)
+
 Model | Step | eval/generator_loss | eval/log_stft_magnitude_loss| eval/spectral_convergence_loss
 :-------------:| :------------:| :-----: | :-----: | :--------:
 default| 1(gpu) x 400000|1.948763|0.670098|0.248882
--- a/examples/csmsc/voc3/README.md
+++ b/examples/csmsc/voc3/README.md
@ -164,6 +164,9 @@ The static model can be downloaded here:
 The ONNX model can be downloaded here:
 - [mb_melgan_csmsc_onnx_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_onnx_0.2.0.zip)

+The Paddle-Lite model can be downloaded here:
+- [mb_melgan_csmsc_pdlite_1.3.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_pdlite_1.3.0.zip)
+
 Model | Step | eval/generator_loss | eval/log_stft_magnitude_loss|eval/spectral_convergence_loss |eval/sub_log_stft_magnitude_loss|eval/sub_spectral_convergence_loss
 :-------------:| :------------:| :-----: | :-----: | :--------:| :--------:| :--------:
 default| 1(gpu) x 1000000| 2.4851|0.71778 |0.2761 |0.66334 |0.2777|
--- a/examples/csmsc/voc5/README.md
+++ b/examples/csmsc/voc5/README.md
@ -121,6 +121,9 @@ The static model can be downloaded here:
 The ONNX model can be downloaded here:
 - [hifigan_csmsc_onnx_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_onnx_0.2.0.zip)

+The Paddle-Lite model can be downloaded here:
+- [hifigan_csmsc_pdlite_1.3.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_pdlite_1.3.0.zip)
+
 Model | Step | eval/generator_loss | eval/mel_loss| eval/feature_matching_loss
 :-------------:| :------------:| :-----: | :-----: | :--------:
 default| 1(gpu) x 2500000|24.927|0.1262|7.554
--- a/examples/librispeech/asr3/RESULTS.md
+++ b/examples/librispeech/asr3/RESULTS.md
@ -1,8 +1,8 @@
 # LibriSpeech

 ## Wav2VecASR
-train: Epoch 1, 1*V100-32G, batchsize:10
+train: Epoch 1, 1*V100-32G, batchsize: 6

 | Model | Params | Config | Augmentation| Test set | Decode method | WER |  
 | --- | --- | --- | --- | --- | --- | --- |
-| wav2vec2ASR | 302.86 M | conf/wav2vec2ASR.yaml | spec_aug | test-clean | greedy search | 0.018887 |  
+| wav2vec2ASR | 302.86 M | conf/wav2vec2ASR.yaml | spec_aug | test-clean | greedy search | 0.018906 |  
--- a/examples/librispeech/asr3/conf/preprocess.yaml
+++ b/examples/librispeech/asr3/conf/preprocess.yaml
@ -1,4 +1,3 @@
 process:
    # use raw audio
  - type: wav_process
-    dither: 0.0
--- a/examples/librispeech/asr3/conf/wav2vec2ASR.yaml
+++ b/examples/librispeech/asr3/conf/wav2vec2ASR.yaml
@ -4,16 +4,21 @@
 freeze_wav2vec2: True
 normalize_wav: True
 output_norm: True
-dnn_blocks: 2
-dnn_neurons: 1024
-blank_id: 0
-ctc_dropout_rate: 0.0
+init_type: 'kaiming_uniform' # !Warning: need to convergence
+enc:
+  input_shape: 1024
+  dnn_blocks: 2
+  dnn_neurons: 1024
+  activation: True
+ctc:
+  enc_n_units: 1024
+  blank_id: 0
+  dropout_rate: 0.0
 wav2vec2_params_path: "exp/wav2vec2/wav2vec2-large-960h-lv60-self.pdparams"

 ############################################
 #               Wav2Vec2.0                 #
 ############################################
-vocab_size: 32
 hidden_size: 1024
 num_hidden_layers: 24
 num_attention_heads: 16
@ -54,9 +59,6 @@ diversity_loss_weight: 0.1
 ctc_loss_reduction: "sum"
 ctc_zero_infinity: False
 use_weighted_layer_sum: False
-pad_token_id: 0
-bos_token_id: 1
-eos_token_id: 2
 add_adapter: False
 adapter_kernel_size: 3
 adapter_stride: 2
@ -70,7 +72,6 @@ train_manifest: data/manifest.train
 dev_manifest: data/manifest.dev
 test_manifest: data/manifest.test-clean

-
 ###########################################
 #              Dataloader                 #
 ###########################################
@ -79,7 +80,7 @@ unit_type: 'char'
 mean_std_filepath: ""
 preprocess_config: conf/preprocess.yaml
 sortagrad: -1 # Feed samples from shortest to longest ; -1: enabled for all epochs 0: disabled other: enabled for 'other' epochs 
-batch_size: 10  # Different batch_size may cause large differences in results
+batch_size: 6  # Different batch_size may cause large differences in results
 maxlen_in: 51200000000  # if input length  > maxlen-in batchsize is automatically reduced
 maxlen_out: 1500000  # if output length > maxlen-out batchsize is automatically reduced
 minibatches: 0 # for debug
@ -95,26 +96,38 @@ dist_sampler: True
 shortest_first: True
 return_lens_rate: True
  
+############################################
+#             Data Augmentation            #
+############################################
+audio_augment:  # for raw audio 
+  sample_rate: 16000
+  speeds: [95, 100, 105]

 ###########################################
 #                 Training                #
 ###########################################
 n_epoch: 1
 accum_grad: 1
-global_grad_clip: 3.0
+global_grad_clip: 5.0
 model_optim: adadelta
 model_optim_conf:
  lr: 0.9
  epsilon: 1.0e-6
  rho: 0.95
-scheduler: constantlr    
-scheduler_conf:
+model_scheduler: constantlr    
+model_scheduler_conf:
+  warmup_steps: 25000
+  lr_decay: 1.0
+wav2vec2_optim: adadelta
+wav2vec2_optim_conf:
+  lr: 0.9
+  epsilon: 1.0e-6
+  rho: 0.95
+wav2vec2_scheduler: constantlr    
+wav2vec2_scheduler_conf:
  warmup_steps: 25000
  lr_decay: 1.0
 log_interval: 1
 checkpoint:
  kbest_n: 50
  latest_n: 5
-augment: True
-
-
--- a/examples/librispeech/asr3/local/train.sh
+++ b/examples/librispeech/asr3/local/train.sh
@ -10,7 +10,8 @@ echo "using $ngpu gpus..."

 config_path=$1
 ckpt_name=$2
-ips=$3
+resume=$3
+ips=$4

 if [ ! $ips ];then
  ips_config=
@ -21,7 +22,7 @@ fi
 mkdir -p exp

 # seed may break model convergence
-seed=1998
+seed=1988
 if [ ${seed} != 0 ]; then
    export FLAGS_cudnn_deterministic=True
 fi
@ -34,13 +35,15 @@ python3 -u ${BIN_DIR}/train.py \
 --ngpu ${ngpu} \
 --config ${config_path} \
 --output exp/${ckpt_name} \
--seed ${seed} 
+--seed ${seed} \
+--resume ${resume}
 else
 python3 -m paddle.distributed.launch --gpus=${CUDA_VISIBLE_DEVICES} ${ips_config} ${BIN_DIR}/train.py \
 --ngpu ${ngpu} \
 --config ${config_path} \
 --output exp/${ckpt_name} \
--seed ${seed}
+--seed ${seed} \
+--resume ${resume}
 fi

 if [ ${seed} != 0 ]; then
--- a/examples/librispeech/asr3/run.sh
+++ b/examples/librispeech/asr3/run.sh
@ -11,7 +11,7 @@ conf_path=conf/wav2vec2ASR.yaml
 ips=            #xx.xx.xx.xx,xx.xx.xx.xx
 decode_conf_path=conf/tuning/decode.yaml
 avg_num=1
-dict_path=data/lang_char/vocab.txt
+resume=         # xx e.g. 30

 . ${MAIN_ROOT}/utils/parse_options.sh || exit 1;

@ -28,7 +28,7 @@ fi

 if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
    # train model, all `ckpt` under `exp` dir
-    CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${ckpt} ${ips} 
+    CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${ckpt} ${resume} ${ips} 
 fi

 if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
@ -38,10 +38,10 @@ fi

 if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
    # greedy search decoder
-    CUDA_VISIBLE_DEVICES=${gpus} ./local/test.sh ${conf_path} ${decode_conf_path} exp/${ckpt}/checkpoints/${avg_ckpt} || exit -1
+    CUDA_VISIBLE_DEVICES=0 ./local/test.sh ${conf_path} ${decode_conf_path} exp/${ckpt}/checkpoints/${avg_ckpt} || exit -1
 fi

 if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
    # test a single .wav file
-    CUDA_VISIBLE_DEVICES=${gpus} ./local/test_wav.sh ${conf_path} ${decode_conf_path} exp/${ckpt}/checkpoints/${avg_ckpt} ${audio_file} || exit -1
+    CUDA_VISIBLE_DEVICES=0 ./local/test_wav.sh ${conf_path} ${decode_conf_path} exp/${ckpt}/checkpoints/${avg_ckpt} ${audio_file} || exit -1
 fi
--- a/examples/ljspeech/tts3/README.md
+++ b/examples/ljspeech/tts3/README.md
@ -221,6 +221,9 @@ The static model can be downloaded here:
 The ONNX model can be downloaded here:
 - [fastspeech2_ljspeech_onnx_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_ljspeech_onnx_1.1.0.zip)

+The Paddle-Lite model can be downloaded here:
+- [fastspeech2_ljspeech_pdlite_1.3.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_ljspeech_pdlite_1.3.0.zip)
+

 Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/pitch_loss| eval/energy_loss 
 :-------------:| :------------:| :-----: | :-----: | :--------: |:--------:|:---------:
--- a/examples/ljspeech/tts3/local/export2lite.sh
+++ b/examples/ljspeech/tts3/local/export2lite.sh
@ -0,0 +1 @@
+../../../csmsc/tts3/local/export2lite.sh
--- a/examples/ljspeech/tts3/local/lite_predict.sh
+++ b/examples/ljspeech/tts3/local/lite_predict.sh
@ -0,0 +1,30 @@
+#!/bin/bash
+
+train_output_path=$1
+
+stage=0
+stop_stage=0
+
+# pwgan
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+    python3 ${BIN_DIR}/../lite_predict.py \
+        --inference_dir=${train_output_path}/pdlite \
+        --am=fastspeech2_ljspeech \
+        --voc=pwgan_ljspeech \
+        --text=${BIN_DIR}/../sentences_en.txt \
+        --output_dir=${train_output_path}/lite_infer_out \
+        --phones_dict=dump/phone_id_map.txt \
+        --lang=en
+fi
+
+# hifigan
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+    python3 ${BIN_DIR}/../lite_predict.py \
+        --inference_dir=${train_output_path}/pdlite \
+        --am=fastspeech2_ljspeech \
+        --voc=hifigan_ljspeech \
+        --text=${BIN_DIR}/../sentences_en.txt \
+        --output_dir=${train_output_path}/lite_infer_out \
+        --phones_dict=dump/phone_id_map.txt \
+        --lang=en
+fi
--- a/examples/ljspeech/tts3/run.sh
+++ b/examples/ljspeech/tts3/run.sh
@ -59,3 +59,14 @@ fi
 if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
    ./local/ort_predict.sh ${train_output_path}
 fi
+
+# must run after stage 3 (which stage generated static models)
+if [ ${stage} -le 7 ] && [ ${stop_stage} -ge 7 ]; then
+    ./local/export2lite.sh ${train_output_path} inference pdlite fastspeech2_ljspeech x86
+    ./local/export2lite.sh ${train_output_path} inference pdlite pwgan_ljspeech x86
+    # ./local/export2lite.sh ${train_output_path} inference pdlite hifigan_ljspeech x86
+fi
+
+if [ ${stage} -le 8 ] && [ ${stop_stage} -ge 8 ]; then
+    CUDA_VISIBLE_DEVICES=${gpus} ./local/lite_predict.sh ${train_output_path} || exit -1
+fi
--- a/examples/ljspeech/voc1/README.md
+++ b/examples/ljspeech/voc1/README.md
@ -136,6 +136,9 @@ The static model can be downloaded here:
 The ONNX model can be downloaded here:
 - [pwgan_ljspeech_onnx_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwgan_ljspeech_onnx_1.1.0.zip)

+The Paddle-Lite model can be downloaded here:
+- [pwgan_ljspeech_pdlite_1.3.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwgan_ljspeech_pdlite_1.3.0.zip)
+

 Parallel WaveGAN checkpoint contains files listed below.

--- a/examples/ljspeech/voc5/README.md
+++ b/examples/ljspeech/voc5/README.md
@ -121,6 +121,8 @@ The static model can be downloaded here:
 The ONNX model can be downloaded here:
 - [hifigan_ljspeech_onnx_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_ljspeech_onnx_1.1.0.zip)

+The Paddle-Lite model can be downloaded here:
+- [hifigan_ljspeech_pdlite_1.3.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_ljspeech_pdlite_1.3.0.zip)

 Model | Step | eval/generator_loss | eval/mel_loss| eval/feature_matching_loss
 :-------------:| :------------:| :-----: | :-----: | :--------:
--- a/examples/other/tn/data/textnorm_test_cases.txt
+++ b/examples/other/tn/data/textnorm_test_cases.txt
@ -123,3 +123,5 @@ iPad Pro的秒控键盘这次也推出白色版本。|iPad Pro的秒控键盘这
 985|九八五
 12~23|十二到二十三
 12-23|十二到二十三
+25cm²|二十五平方厘米
+25m|米
--- a/examples/other/tts_finetune/tts3/run_mix.sh
+++ b/examples/other/tts_finetune/tts3/run_mix.sh
@ -108,3 +108,4 @@ if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
        --spk_id=$replace_spkid
 fi

+
--- a/examples/vctk/ernie_sat/README.md
+++ b/examples/vctk/ernie_sat/README.md
@ -1,5 +1,5 @@
 # ERNIE-SAT with VCTK dataset
-ERNIE-SAT speech-text joint pretraining framework, which achieves SOTA results in cross-lingual multi-speaker speech synthesis and cross-lingual speech editing tasks, It can be applied to a series of scenarios such as Speech Editing, personalized Speech Synthesis, and Voice Cloning.
+[ERNIE-SAT](https://arxiv.org/abs/2211.03545) speech-text joint pretraining framework, which achieves SOTA results in cross-lingual multi-speaker speech synthesis and cross-lingual speech editing tasks, It can be applied to a series of scenarios such as Speech Editing, personalized Speech Synthesis, and Voice Cloning.

 ## Model Framework
 In ERNIE-SAT, we propose two innovations:
--- a/examples/vctk/tts3/README.md
+++ b/examples/vctk/tts3/README.md
@ -224,6 +224,9 @@ The static model can be downloaded here:
 The ONNX model can be downloaded here:
 - [fastspeech2_vctk_onnx_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_vctk_onnx_1.1.0.zip)

+The Paddle-Lite model can be downloaded here:
+- [fastspeech2_vctk_pdlite_1.3.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_vctk_pdlite_1.3.0.zip)
+
 FastSpeech2 checkpoint contains files listed below.
 ```text
 fastspeech2_vctk_ckpt_1.2.0
--- a/examples/vctk/tts3/local/export2lite.sh
+++ b/examples/vctk/tts3/local/export2lite.sh
@ -0,0 +1 @@
+../../../csmsc/tts3/local/export2lite.sh
--- a/examples/vctk/tts3/local/lite_predict.sh
+++ b/examples/vctk/tts3/local/lite_predict.sh
@ -0,0 +1,34 @@
+#!/bin/bash
+
+train_output_path=$1
+
+stage=0
+stop_stage=0
+
+# pwgan
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+    python3 ${BIN_DIR}/../lite_predict.py \
+        --inference_dir=${train_output_path}/pdlite \
+        --am=fastspeech2_vctk \
+        --voc=pwgan_vctk \
+        --text=${BIN_DIR}/../sentences_en.txt \
+        --output_dir=${train_output_path}/lite_infer_out \
+        --phones_dict=dump/phone_id_map.txt \
+        --speaker_dict=dump/speaker_id_map.txt \
+        --spk_id=0 \
+        --lang=en
+fi
+
+# hifigan
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+    python3 ${BIN_DIR}/../lite_predict.py \
+        --inference_dir=${train_output_path}/pdlite \
+        --am=fastspeech2_vctk \
+        --voc=hifigan_vctk \
+        --text=${BIN_DIR}/../sentences_en.txt \
+        --output_dir=${train_output_path}/lite_infer_out \
+        --phones_dict=dump/phone_id_map.txt \
+        --speaker_dict=dump/speaker_id_map.txt \
+        --spk_id=0 \
+        --lang=en
+fi
--- a/examples/vctk/tts3/run.sh
+++ b/examples/vctk/tts3/run.sh
@ -58,3 +58,14 @@ fi
 if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
    ./local/ort_predict.sh ${train_output_path}
 fi
+
+# must run after stage 3 (which stage generated static models)
+if [ ${stage} -le 7 ] && [ ${stop_stage} -ge 7 ]; then
+    ./local/export2lite.sh ${train_output_path} inference pdlite fastspeech2_vctk x86
+    ./local/export2lite.sh ${train_output_path} inference pdlite pwgan_vctk x86
+    # ./local/export2lite.sh ${train_output_path} inference pdlite hifigan_vctk x86
+fi
+
+if [ ${stage} -le 8 ] && [ ${stop_stage} -ge 8 ]; then
+    CUDA_VISIBLE_DEVICES=${gpus} ./local/lite_predict.sh ${train_output_path} || exit -1
+fi
--- a/examples/vctk/voc1/README.md
+++ b/examples/vctk/voc1/README.md
@ -141,6 +141,9 @@ The static model can be downloaded here:
 The ONNX model can be downloaded here:
 - [pwgan_vctk_onnx_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwgan_vctk_onnx_1.1.0.zip)

+The Paddle-Lite model can be downloaded here:
+- [pwgan_vctk_pdlite_1.3.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwgan_vctk_pdlite_1.3.0.zip)
+

 Parallel WaveGAN checkpoint contains files listed below.

--- a/examples/vctk/voc5/README.md
+++ b/examples/vctk/voc5/README.md
@ -127,6 +127,9 @@ The static model can be downloaded here:
 The ONNX model can be downloaded here:
 - [hifigan_vctk_onnx_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_vctk_onnx_1.1.0.zip)

+The Paddle-Lite model can be downloaded here:
+- [hifigan_vctk_pdlite_1.3.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_vctk_pdlite_1.3.0.zip)
+

 Model | Step | eval/generator_loss | eval/mel_loss| eval/feature_matching_loss
 :-------------:| :------------:| :-----: | :-----: | :--------:
--- a/paddlespeech/audio/transform/spectrogram.py
+++ b/paddlespeech/audio/transform/spectrogram.py
@ -383,7 +383,7 @@ class LogMelSpectrogramKaldi():


 class WavProcess():
-    def __init__(self, dither=0.0):
+    def __init__(self):
        """
        Args:
            dither (float): Dithering constant
@ -391,9 +391,7 @@ class WavProcess():
        Returns:
        """

-        self.dither = dither
-
-    def __call__(self, x, train):
+    def __call__(self, x):
        """
        Args:
            x (np.ndarray): shape (Ti,)
@ -405,10 +403,10 @@ class WavProcess():
        Returns:
            np.ndarray: (T, D)
        """
-        dither = self.dither if train else 0.0
        if x.ndim != 1:
            raise ValueError("Not support x: [Time, Channel]")
-        waveform = np.expand_dims(x, -1)
+        waveform = x.astype("float32") / 32768.0
+        waveform = np.expand_dims(waveform, -1)
        return waveform


--- a/paddlespeech/cli/base_commands.py
+++ b/paddlespeech/cli/base_commands.py
@ -83,7 +83,9 @@ model_name_format = {
    'st': 'Model-Source language-Target language',
    'text': 'Model-Task-Language',
    'tts': 'Model-Language',
-    'vector': 'Model-Sample Rate'
+    'vector': 'Model-Sample Rate',
+    'ssl': 'Model-Language-Sample Rate',
+    'whisper': 'Model-Language-Sample Rate'
 }


@ -94,7 +96,9 @@ class StatsCommand:
    def __init__(self):
        self.parser = argparse.ArgumentParser(
            prog='paddlespeech.stats', add_help=True)
-        self.task_choices = ['asr', 'cls', 'st', 'text', 'tts', 'vector', 'kws']
+        self.task_choices = [
+            'asr', 'cls', 'st', 'text', 'tts', 'vector', 'kws', 'ssl', 'whisper'
+        ]
        self.parser.add_argument(
            '--task',
            type=str,
@ -141,6 +145,12 @@ _commands = {
    'tts': ['Text to Speech infer command.', 'TTSExecutor'],
    'vector': ['Speech to vector embedding infer command.', 'VectorExecutor'],
    'kws': ['Keyword Spotting infer command.', 'KWSExecutor'],
+    'ssl':
+    ['Self-Supervised Learning Pretrained model infer command.', 'SSLExecutor'],
+    'whisper': [
+        'Whisper model for speech to text or translate speech to English.',
+        'WhisperExecutor'
+    ]
 }

 for com, info in _commands.items():
--- a/paddlespeech/cli/ssl/init.py
+++ b/paddlespeech/cli/ssl/init.py
@ -0,0 +1,14 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from .infer import SSLExecutor
--- a/paddlespeech/cli/ssl/infer.py
+++ b/paddlespeech/cli/ssl/infer.py
@ -0,0 +1,449 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import io
+import os
+import sys
+import time
+from collections import OrderedDict
+from typing import List
+from typing import Optional
+from typing import Union
+
+import librosa
+import numpy as np
+import paddle
+import soundfile
+from yacs.config import CfgNode
+
+from ..executor import BaseExecutor
+from ..log import logger
+from ..utils import CLI_TIMER
+from ..utils import stats_wrapper
+from ..utils import timer_register
+from paddlespeech.audio.transform.transformation import Transformation
+from paddlespeech.s2t.frontend.featurizer.text_featurizer import TextFeaturizer
+from paddlespeech.s2t.utils.utility import UpdateConfig
+
+__all__ = ['SSLExecutor']
+
+
+@timer_register
+class SSLExecutor(BaseExecutor):
+    def __init__(self):
+        super().__init__('ssl')
+        self.parser = argparse.ArgumentParser(
+            prog='paddlespeech.ssl', add_help=True)
+        self.parser.add_argument(
+            '--input', type=str, default=None, help='Audio file to recognize.')
+        self.parser.add_argument(
+            '--model',
+            type=str,
+            default='wav2vec2ASR_librispeech',
+            choices=[
+                tag[:tag.index('-')]
+                for tag in self.task_resource.pretrained_models.keys()
+            ],
+            help='Choose model type of asr task.')
+        self.parser.add_argument(
+            '--task',
+            type=str,
+            default='asr',
+            choices=['asr', 'vector'],
+            help='Choose output type for ssl task')
+        self.parser.add_argument(
+            '--lang',
+            type=str,
+            default='en',
+            help='Choose model language. zh or en, zh:[wav2vec2ASR_aishell1-zh-16k], en:[wav2vec2ASR_librispeech-en-16k]'
+        )
+        self.parser.add_argument(
+            "--sample_rate",
+            type=int,
+            default=16000,
+            choices=[8000, 16000],
+            help='Choose the audio sample rate of the model. 8000 or 16000')
+        self.parser.add_argument(
+            '--config',
+            type=str,
+            default=None,
+            help='Config of asr task. Use deault config when it is None.')
+        self.parser.add_argument(
+            '--decode_method',
+            type=str,
+            default='ctc_greedy_search',
+            choices=[
+                'ctc_greedy_search',
+                'ctc_prefix_beam_search',
+            ],
+            help='only support asr task')
+        self.parser.add_argument(
+            '--ckpt_path',
+            type=str,
+            default=None,
+            help='Checkpoint file of model.')
+        self.parser.add_argument(
+            '--yes',
+            '-y',
+            action="store_true",
+            default=False,
+            help='No additional parameters required. \
+            Once set this parameter, it means accepting the request of the program by default, \
+            which includes transforming the audio sample rate')
+        self.parser.add_argument(
+            '--rtf',
+            action="store_true",
+            default=False,
+            help='Show Real-time Factor(RTF).')
+        self.parser.add_argument(
+            '--device',
+            type=str,
+            default=paddle.get_device(),
+            help='Choose device to execute model inference.')
+        self.parser.add_argument(
+            '-d',
+            '--job_dump_result',
+            action='store_true',
+            help='Save job result into file.')
+        self.parser.add_argument(
+            '-v',
+            '--verbose',
+            action='store_true',
+            help='Increase logger verbosity of current task.')
+
+    def _init_from_path(self,
+                        model_type: str='wav2vec2ASR_librispeech',
+                        task: str='asr',
+                        lang: str='en',
+                        sample_rate: int=16000,
+                        cfg_path: Optional[os.PathLike]=None,
+                        decode_method: str='ctc_greedy_search',
+                        ckpt_path: Optional[os.PathLike]=None):
+        """
+        Init model and other resources from a specific path.
+        """
+        logger.debug("start to init the model")
+        # default max_len: unit:second
+        self.max_len = 50
+        if hasattr(self, 'model'):
+            logger.debug('Model had been initialized.')
+            return
+        if cfg_path is None or ckpt_path is None:
+            sample_rate_str = '16k' if sample_rate == 16000 else '8k'
+            if task == 'asr':
+                tag = model_type + '-' + lang + '-' + sample_rate_str
+            else:
+                tag = 'wav2vec2' + '-' + lang + '-' + sample_rate_str
+            self.task_resource.set_task_model(tag, version=None)
+            self.res_path = self.task_resource.res_dir
+
+            self.cfg_path = os.path.join(
+                self.res_path, self.task_resource.res_dict['cfg_path'])
+            self.ckpt_path = os.path.join(
+                self.res_path,
+                self.task_resource.res_dict['ckpt_path'] + ".pdparams")
+            logger.debug(self.res_path)
+        else:
+            self.cfg_path = os.path.abspath(cfg_path)
+            self.ckpt_path = os.path.abspath(ckpt_path + ".pdparams")
+            self.res_path = os.path.dirname(
+                os.path.dirname(os.path.abspath(self.cfg_path)))
+        logger.debug(self.cfg_path)
+        logger.debug(self.ckpt_path)
+
+        #Init body.
+        self.config = CfgNode(new_allowed=True)
+        self.config.merge_from_file(self.cfg_path)
+        if task == 'asr':
+            with UpdateConfig(self.config):
+                self.text_feature = TextFeaturizer(
+                    unit_type=self.config.unit_type,
+                    vocab=self.config.vocab_filepath)
+                self.config.decode.decoding_method = decode_method
+            model_name = model_type[:model_type.rindex(
+                '_')]  # model_type: {model_name}_{dataset}
+        else:
+            model_name = 'wav2vec2'
+        model_class = self.task_resource.get_model_class(model_name)
+
+        model_conf = self.config
+        model = model_class.from_config(model_conf)
+        self.model = model
+        self.model.eval()
+
+        # load model
+        model_dict = paddle.load(self.ckpt_path)
+        if task == 'asr':
+            self.model.set_state_dict(model_dict)
+        else:
+            self.model.wav2vec2.set_state_dict(model_dict)
+
+    def preprocess(self, model_type: str, input: Union[str, os.PathLike]):
+        """
+        Input preprocess and return paddle.Tensor stored in self.input.
+        Input content can be a text(tts), a file(asr, cls) or a streaming(not supported yet).
+        """
+
+        audio_file = input
+        if isinstance(audio_file, (str, os.PathLike)):
+            logger.debug("Preprocess audio_file:" + audio_file)
+        elif isinstance(audio_file, io.BytesIO):
+            audio_file.seek(0)
+
+        # Get the object for feature extraction
+        logger.debug("get the preprocess conf")
+        preprocess_conf = self.config.preprocess_config
+        preprocess_args = {"train": False}
+        preprocessing = Transformation(preprocess_conf)
+        logger.debug("read the audio file")
+        audio, audio_sample_rate = soundfile.read(
+            audio_file, dtype="int16", always_2d=True)
+        if self.change_format:
+            if audio.shape[1] >= 2:
+                audio = audio.mean(axis=1, dtype=np.int16)
+            else:
+                audio = audio[:, 0]
+            # pcm16 -> pcm 32
+            audio = self._pcm16to32(audio)
+            audio = librosa.resample(
+                audio, orig_sr=audio_sample_rate, target_sr=self.sample_rate)
+            audio_sample_rate = self.sample_rate
+            # pcm32 -> pcm 16
+            audio = self._pcm32to16(audio)
+        else:
+            audio = audio[:, 0]
+
+        logger.debug(f"audio shape: {audio.shape}")
+        # fbank
+        audio = preprocessing(audio, **preprocess_args)
+
+        audio_len = paddle.to_tensor(audio.shape[0])
+        audio = paddle.to_tensor(audio, dtype='float32').unsqueeze(axis=0)
+
+        self._inputs["audio"] = audio
+        self._inputs["audio_len"] = audio_len
+        logger.debug(f"audio feat shape: {audio.shape}")
+
+        logger.debug("audio feat process success")
+
+    @paddle.no_grad()
+    def infer(self, model_type: str, task: str):
+        """
+        Model inference and result stored in self.output.
+        """
+        logger.debug("start to infer the model to get the output")
+        audio = self._inputs["audio"]
+        if task == 'asr':
+            cfg = self.config.decode
+            logger.debug(
+                f"we will use the wav2vec2ASR like model : {model_type}")
+            try:
+                result_transcripts = self.model.decode(
+                    audio,
+                    text_feature=self.text_feature,
+                    decoding_method=cfg.decoding_method,
+                    beam_size=cfg.beam_size)
+                self._outputs["result"] = result_transcripts[0][0]
+            except Exception as e:
+                logger.exception(e)
+        else:
+            logger.debug(
+                "we will use the wav2vec2 like model to extract audio feature")
+            try:
+                out_feature = self.model(audio[:, :, 0])
+                self._outputs["result"] = out_feature[0]
+            except Exception as e:
+                logger.exception(e)
+
+    def postprocess(self) -> Union[str, os.PathLike]:
+        """
+            Output postprocess and return human-readable results such as texts and audio files.
+        """
+        return self._outputs["result"]
+
+    def _pcm16to32(self, audio):
+        assert (audio.dtype == np.int16)
+        audio = audio.astype("float32")
+        bits = np.iinfo(np.int16).bits
+        audio = audio / (2**(bits - 1))
+        return audio
+
+    def _pcm32to16(self, audio):
+        assert (audio.dtype == np.float32)
+        bits = np.iinfo(np.int16).bits
+        audio = audio * (2**(bits - 1))
+        audio = np.round(audio).astype("int16")
+        return audio
+
+    def _check(self, audio_file: str, sample_rate: int, force_yes: bool=False):
+        self.sample_rate = sample_rate
+        if self.sample_rate != 16000 and self.sample_rate != 8000:
+            logger.error(
+                "invalid sample rate, please input --sr 8000 or --sr 16000")
+            return False
+
+        if isinstance(audio_file, (str, os.PathLike)):
+            if not os.path.isfile(audio_file):
+                logger.error("Please input the right audio file path")
+                return False
+        elif isinstance(audio_file, io.BytesIO):
+            audio_file.seek(0)
+
+        logger.debug("checking the audio file format......")
+        try:
+            audio, audio_sample_rate = soundfile.read(
+                audio_file, dtype="int16", always_2d=True)
+            audio_duration = audio.shape[0] / audio_sample_rate
+            if audio_duration > self.max_len:
+                logger.error(
+                    f"Please input audio file less then {self.max_len} seconds.\n"
+                )
+                return False
+        except Exception as e:
+            logger.exception(e)
+            logger.error(
+                f"can not open the audio file, please check the audio file({audio_file}) format is 'wav'. \n \
+                 you can try to use sox to change the file format.\n \
+                 For example: \n \
+                 sample rate: 16k \n \
+                 sox input_audio.xx --rate 16k --bits 16 --channels 1 output_audio.wav \n \
+                 sample rate: 8k \n \
+                 sox input_audio.xx --rate 8k --bits 16 --channels 1 output_audio.wav \n \
+                 ")
+            return False
+        logger.debug("The sample rate is %d" % audio_sample_rate)
+        if audio_sample_rate != self.sample_rate:
+            logger.warning("The sample rate of the input file is not {}.\n \
+                            The program will resample the wav file to {}.\n \
+                            If the result does not meet your expectations，\n \
+                            Please input the 16k 16 bit 1 channel wav file. \
+                        ".format(self.sample_rate, self.sample_rate))
+            if force_yes is False:
+                while (True):
+                    logger.debug(
+                        "Whether to change the sample rate and the channel. Y: change the sample. N: exit the prgream."
+                    )
+                    content = input("Input(Y/N):")
+                    if content.strip() == "Y" or content.strip(
+                    ) == "y" or content.strip() == "yes" or content.strip(
+                    ) == "Yes":
+                        logger.debug(
+                            "change the sampele rate, channel to 16k and 1 channel"
+                        )
+                        break
+                    elif content.strip() == "N" or content.strip(
+                    ) == "n" or content.strip() == "no" or content.strip(
+                    ) == "No":
+                        logger.debug("Exit the program")
+                        return False
+                    else:
+                        logger.warning("Not regular input, please input again")
+
+            self.change_format = True
+        else:
+            logger.debug("The audio file format is right")
+            self.change_format = False
+
+        return True
+
+    def execute(self, argv: List[str]) -> bool:
+        """
+            Command line entry.
+        """
+        parser_args = self.parser.parse_args(argv)
+
+        model = parser_args.model
+        task = parser_args.task
+        lang = parser_args.lang
+        sample_rate = parser_args.sample_rate
+        config = parser_args.config
+        ckpt_path = parser_args.ckpt_path
+        decode_method = parser_args.decode_method
+        force_yes = parser_args.yes
+        rtf = parser_args.rtf
+        device = parser_args.device
+
+        if not parser_args.verbose:
+            self.disable_task_loggers()
+
+        task_source = self.get_input_source(parser_args.input)
+        task_results = OrderedDict()
+        has_exceptions = False
+
+        for id_, input_ in task_source.items():
+            try:
+                res = self(
+                    audio_file=input_,
+                    model=model,
+                    task=task,
+                    lang=lang,
+                    sample_rate=sample_rate,
+                    config=config,
+                    ckpt_path=ckpt_path,
+                    decode_method=decode_method,
+                    force_yes=force_yes,
+                    rtf=rtf,
+                    device=device)
+                task_results[id_] = res
+
+            except Exception as e:
+                has_exceptions = True
+                task_results[id_] = f'{e.__class__.__name__}: {e}'
+
+        if rtf:
+            self.show_rtf(CLI_TIMER[self.__class__.__name__])
+        self.process_task_results(parser_args.input, task_results,
+                                  parser_args.job_dump_result)
+        if has_exceptions:
+            return False
+        else:
+            return True
+
+    @stats_wrapper
+    def __call__(self,
+                 audio_file: os.PathLike,
+                 model: str='wav2vec2ASR_librispeech',
+                 task: str='asr',
+                 lang: str='en',
+                 sample_rate: int=16000,
+                 config: os.PathLike=None,
+                 ckpt_path: os.PathLike=None,
+                 decode_method: str='ctc_greedy_search',
+                 force_yes: bool=False,
+                 rtf: bool=False,
+                 device=paddle.get_device()):
+        """
+        Python API to call an executor.
+        """
+
+        audio_file = os.path.abspath(audio_file)
+        paddle.set_device(device)
+        self._init_from_path(model, task, lang, sample_rate, config,
+                             decode_method, ckpt_path)
+        if not self._check(audio_file, sample_rate, force_yes):
+            sys.exit(-1)
+        if rtf:
+            k = self.__class__.__name__
+            CLI_TIMER[k]['start'].append(time.time())
+        self.preprocess(model, audio_file)
+        self.infer(model, task)
+        res = self.postprocess()  # Retrieve result of asr.
+
+        if rtf:
+            CLI_TIMER[k]['end'].append(time.time())
+            audio, audio_sample_rate = soundfile.read(
+                audio_file, dtype="int16", always_2d=True)
+            CLI_TIMER[k]['extra'].append(audio.shape[0] / audio_sample_rate)
+
+        return res
--- a/paddlespeech/cli/tts/infer.py
+++ b/paddlespeech/cli/tts/infer.py
@ -67,6 +67,7 @@ class TTSExecutor(BaseExecutor):
                'fastspeech2_mix',
                'tacotron2_csmsc',
                'tacotron2_ljspeech',
+                'fastspeech2_male',
            ],
            help='Choose acoustic model type of tts task.')
        self.parser.add_argument(
@ -122,6 +123,7 @@ class TTSExecutor(BaseExecutor):
                'hifigan_aishell3',
                'hifigan_vctk',
                'wavernn_csmsc',
+                'pwgan_male',
            ],
            help='Choose vocoder type of tts task.')

--- a/paddlespeech/cli/whisper/init.py
+++ b/paddlespeech/cli/whisper/init.py
@ -0,0 +1,14 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from .infer import WhisperExecutor
--- a/paddlespeech/cli/whisper/infer.py
+++ b/paddlespeech/cli/whisper/infer.py
@ -0,0 +1,493 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import io
+import os
+import sys
+import time
+from collections import OrderedDict
+from typing import List
+from typing import Optional
+from typing import Union
+
+import librosa
+import numpy as np
+import paddle
+import soundfile
+from yacs.config import CfgNode
+
+from ...utils.env import DATA_HOME
+from ..download import get_path_from_url
+from ..executor import BaseExecutor
+from ..log import logger
+from ..utils import CLI_TIMER
+from ..utils import stats_wrapper
+from ..utils import timer_register
+from paddlespeech.s2t.models.whisper import log_mel_spectrogram
+from paddlespeech.s2t.models.whisper import ModelDimensions
+from paddlespeech.s2t.models.whisper import Whisper
+from paddlespeech.s2t.models.whisper.tokenizer import LANGUAGES
+from paddlespeech.s2t.models.whisper.tokenizer import TO_LANGUAGE_CODE
+from paddlespeech.s2t.utils.utility import UpdateConfig
+
+__all__ = ['WhisperExecutor']
+
+
+@timer_register
+class WhisperExecutor(BaseExecutor):
+    def __init__(self):
+        super().__init__('whisper')
+        self.parser = argparse.ArgumentParser(
+            prog='paddlespeech.whisper', add_help=True)
+        self.parser.add_argument(
+            '--input', type=str, default=None, help='Audio file to recognize.')
+        self.parser.add_argument(
+            '--model',
+            type=str,
+            default='whisper',
+            choices=['whisper'],
+            help='Choose model type of asr task.')
+        self.parser.add_argument(
+            '--lang',
+            type=str,
+            default='',
+            choices=['', 'en'],
+            help='Choose model language. Default is "", English-only model set [en].'
+        )
+        self.parser.add_argument(
+            '--task',
+            type=str,
+            default='transcribe',
+            choices=["transcribe", "translate"],
+            help='Choose task tpye for transcribe or translate.')
+        self.parser.add_argument(
+            '--size',
+            type=str,
+            default='large',
+            choices=['large', 'medium', 'base', 'small', 'tiny'],
+            help='Choose model size. now only support large, large:[whisper-large-16k]'
+        )
+        self.parser.add_argument(
+            '--language',
+            type=str,
+            default='None',
+            choices=sorted(LANGUAGES.keys()) + sorted(
+                [k.title() for k in TO_LANGUAGE_CODE.keys()]),
+            help='Choose model decode language. Default is None, recognized by model.'
+        )
+        self.parser.add_argument(
+            "--sample_rate",
+            type=int,
+            default=16000,
+            choices=[16000],
+            help='Choose the audio sample rate of the model. only support 16000')
+        self.parser.add_argument(
+            '--config',
+            type=str,
+            default=None,
+            help='Config of asr task. Use deault config when it is None.')
+        self.parser.add_argument(
+            '--decode_method',
+            type=str,
+            default='ctc_prefix_beam_search',
+            choices=['ctc_greedy_search', 'ctc_prefix_beam_search'],
+            help='only support transformer and conformer model')
+        self.parser.add_argument(
+            '--ckpt_path',
+            type=str,
+            default=None,
+            help='Checkpoint file of model.')
+        self.parser.add_argument(
+            '--yes',
+            '-y',
+            action="store_true",
+            default=False,
+            help='No additional parameters required. \
+            Once set this parameter, it means accepting the request of the program by default, \
+            which includes transforming the audio sample rate')
+        self.parser.add_argument(
+            '--rtf',
+            action="store_true",
+            default=False,
+            help='Show Real-time Factor(RTF).')
+        self.parser.add_argument(
+            '--device',
+            type=str,
+            default=paddle.get_device(),
+            help='Choose device to execute model inference.')
+        self.parser.add_argument(
+            '-d',
+            '--job_dump_result',
+            action='store_true',
+            help='Save job result into file.')
+        self.parser.add_argument(
+            '-v',
+            '--verbose',
+            action='store_true',
+            help='Increase logger verbosity of current task.')
+
+    def _init_from_path(self,
+                        model_type: str='whisper',
+                        lang: str='',
+                        task: str='transcribe',
+                        size: str='large',
+                        language: str='None',
+                        sample_rate: int=16000,
+                        cfg_path: Optional[os.PathLike]=None,
+                        decode_method: str='ctc_prefix_beam_search',
+                        num_decoding_left_chunks: int=-1,
+                        ckpt_path: Optional[os.PathLike]=None):
+        """
+        Init model and other resources from a specific path.
+        """
+        logger.debug("start to init the model")
+        # default max_len: unit:second
+        self.max_len = 50
+        if hasattr(self, 'model'):
+            logger.debug('Model had been initialized.')
+            return
+
+        if cfg_path is None or ckpt_path is None:
+            sample_rate_str = '16k' if sample_rate == 16000 else '8k'
+            if lang == "":
+                tag = model_type + '-' + size + '-' + sample_rate_str
+            else:
+                tag = model_type + '-' + size + '-' + lang + '-' + sample_rate_str
+            self.task_resource.set_task_model(tag, version=None)
+            self.res_path = self.task_resource.res_dir
+
+            self.cfg_path = os.path.join(
+                self.res_path, self.task_resource.res_dict['cfg_path'])
+            self.ckpt_path = os.path.join(
+                self.res_path,
+                self.task_resource.res_dict['ckpt_path'] + ".pdparams")
+            logger.debug(self.res_path)
+
+        else:
+            self.cfg_path = os.path.abspath(cfg_path)
+            self.ckpt_path = os.path.abspath(ckpt_path + ".pdparams")
+            self.res_path = os.path.dirname(
+                os.path.dirname(os.path.abspath(self.cfg_path)))
+        logger.debug(self.cfg_path)
+        logger.debug(self.ckpt_path)
+
+        #Init body.
+        self.config = CfgNode(new_allowed=True)
+        self.config.merge_from_file(self.cfg_path)
+
+        with UpdateConfig(self.config):
+            if "whisper" in model_type:
+                resource_url = self.task_resource.res_dict['resource_data']
+                resource_md5 = self.task_resource.res_dict['resource_data_md5']
+
+                self.resource_path = os.path.join(
+                    DATA_HOME, self.task_resource.version, 'whisper')
+                self.download_resource(resource_url, self.resource_path,
+                                       resource_md5)
+            else:
+                raise Exception("wrong type")
+
+        # load model
+        model_dict = paddle.load(self.ckpt_path)
+        dims = ModelDimensions(**model_dict["dims"])
+        self.model = Whisper(dims)
+        self.model.load_dict(model_dict)
+        self.model.eval()
+
+        #set task
+        if task is not None:
+            self.task = task
+
+        #set language
+        if language is not None:
+            if lang == 'en' and language != 'en':
+                logger.info(
+                    "{tag} is an English-only model, set language=English .")
+                self.language = 'en'
+            else:
+                self.language = language
+
+    def preprocess(self, model_type: str, input: Union[str, os.PathLike]):
+        """
+        Input preprocess and return paddle.Tensor stored in self.input.
+        Input content can be a text(tts), a file(asr, cls) or a streaming(not supported yet).
+        """
+
+        audio_file = input
+        if isinstance(audio_file, (str, os.PathLike)):
+            logger.debug("Preprocess audio_file:" + audio_file)
+        elif isinstance(audio_file, io.BytesIO):
+            audio_file.seek(0)
+
+        # Get the object for feature extraction
+        # whisper hard-coded audio hyperparameters, params in paddlespeech/s2t/models/whisper/whisper.py
+        logger.debug("read the audio file")
+        audio, audio_sample_rate = soundfile.read(
+            audio_file, dtype="float32", always_2d=True)
+        if self.change_format:
+            if audio.shape[1] >= 2:
+                audio = audio.mean(axis=1, dtype=np.int16)
+            else:
+                audio = audio[:, 0]
+            # pcm16 -> pcm 32
+            audio = self._pcm16to32(audio)
+            audio = librosa.resample(
+                audio, orig_sr=audio_sample_rate, target_sr=self.sample_rate)
+            audio_sample_rate = self.sample_rate
+            # pcm32 -> pcm 16
+            audio = self._pcm32to16(audio)
+        else:
+            audio = audio[:, 0]
+
+        logger.debug(f"audio shape: {audio.shape}")
+        # fbank
+        audio = log_mel_spectrogram(audio, resource_path=self.resource_path)
+
+        audio_len = paddle.to_tensor(audio.shape[0])
+
+        self._inputs["audio"] = audio
+        self._inputs["audio_len"] = audio_len
+        logger.debug(f"audio feat shape: {audio.shape}")
+
+        logger.debug("audio feat process success")
+
+    @paddle.no_grad()
+    def infer(self, model_type: str):
+        """
+        Model inference and result stored in self.output.
+        """
+        logger.debug("start to infer the model to get the output")
+        cfg = self.config
+        audio = self._inputs["audio"]
+        if cfg.temperature_increment_on_fallback is not None:
+            temperature = tuple(
+                np.arange(cfg.temperature, 1.0 + 1e-6,
+                          cfg.temperature_increment_on_fallback))
+        else:
+            temperature = [cfg.temperature]
+
+        self._outputs["result"] = self.model.transcribe(
+            audio,
+            verbose=cfg.verbose,
+            task=self.task,
+            language=self.language,
+            resource_path=self.resource_path,
+            temperature=temperature,
+            compression_ratio_threshold=cfg.compression_ratio_threshold,
+            logprob_threshold=cfg.logprob_threshold,
+            best_of=cfg.best_of,
+            beam_size=cfg.beam_size,
+            patience=cfg.patience,
+            length_penalty=cfg.length_penalty,
+            initial_prompt=cfg.initial_prompt,
+            condition_on_previous_text=cfg.condition_on_previous_text,
+            no_speech_threshold=cfg.no_speech_threshold)
+
+    def postprocess(self) -> Union[str, os.PathLike]:
+        """
+            Output postprocess and return human-readable results such as texts and audio files.
+        """
+        return self._outputs["result"]
+
+    def download_resource(self, url, lm_dir, md5sum):
+        download_path = get_path_from_url(
+            url=url,
+            root_dir=lm_dir,
+            md5sum=md5sum,
+            decompress=True, )
+
+    def _pcm16to32(self, audio):
+        assert (audio.dtype == np.int16)
+        audio = audio.astype("float32")
+        bits = np.iinfo(np.int16).bits
+        audio = audio / (2**(bits - 1))
+        return audio
+
+    def _pcm32to16(self, audio):
+        assert (audio.dtype == np.float32)
+        bits = np.iinfo(np.int16).bits
+        audio = audio * (2**(bits - 1))
+        audio = np.round(audio).astype("int16")
+        return audio
+
+    def _check(self, audio_file: str, sample_rate: int, force_yes: bool=False):
+        self.sample_rate = sample_rate
+        if self.sample_rate != 16000 and self.sample_rate != 8000:
+            logger.error(
+                "invalid sample rate, please input --sr 8000 or --sr 16000")
+            return False
+
+        if isinstance(audio_file, (str, os.PathLike)):
+            if not os.path.isfile(audio_file):
+                logger.error("Please input the right audio file path")
+                return False
+        elif isinstance(audio_file, io.BytesIO):
+            audio_file.seek(0)
+
+        logger.debug("checking the audio file format......")
+        try:
+            audio, audio_sample_rate = soundfile.read(
+                audio_file, dtype="int16", always_2d=True)
+            audio_duration = audio.shape[0] / audio_sample_rate
+            if audio_duration > self.max_len:
+                logger.error(
+                    f"Please input audio file less then {self.max_len} seconds.\n"
+                )
+                return False
+        except Exception as e:
+            logger.exception(e)
+            logger.error(
+                f"can not open the audio file, please check the audio file({audio_file}) format is 'wav'. \n \
+                 you can try to use sox to change the file format.\n \
+                 For example: \n \
+                 sample rate: 16k \n \
+                 sox input_audio.xx --rate 16k --bits 16 --channels 1 output_audio.wav \n \
+                 sample rate: 8k \n \
+                 sox input_audio.xx --rate 8k --bits 16 --channels 1 output_audio.wav \n \
+                 ")
+            return False
+        logger.debug("The sample rate is %d" % audio_sample_rate)
+        if audio_sample_rate != self.sample_rate:
+            logger.warning("The sample rate of the input file is not {}.\n \
+                            The program will resample the wav file to {}.\n \
+                            If the result does not meet your expectations，\n \
+                            Please input the 16k 16 bit 1 channel wav file. \
+                        ".format(self.sample_rate, self.sample_rate))
+            if force_yes is False:
+                while (True):
+                    logger.debug(
+                        "Whether to change the sample rate and the channel. Y: change the sample. N: exit the prgream."
+                    )
+                    content = input("Input(Y/N):")
+                    if content.strip() == "Y" or content.strip(
+                    ) == "y" or content.strip() == "yes" or content.strip(
+                    ) == "Yes":
+                        logger.debug(
+                            "change the sampele rate, channel to 16k and 1 channel"
+                        )
+                        break
+                    elif content.strip() == "N" or content.strip(
+                    ) == "n" or content.strip() == "no" or content.strip(
+                    ) == "No":
+                        logger.debug("Exit the program")
+                        return False
+                    else:
+                        logger.warning("Not regular input, please input again")
+
+            self.change_format = True
+        else:
+            logger.debug("The audio file format is right")
+            self.change_format = False
+
+        return True
+
+    def execute(self, argv: List[str]) -> bool:
+        """
+            Command line entry.
+        """
+        parser_args = self.parser.parse_args(argv)
+
+        model = parser_args.model
+        lang = parser_args.lang
+        task = parser_args.task
+        size = parser_args.size
+        language = parser_args.language
+        sample_rate = parser_args.sample_rate
+        config = parser_args.config
+        ckpt_path = parser_args.ckpt_path
+        decode_method = parser_args.decode_method
+        force_yes = parser_args.yes
+        rtf = parser_args.rtf
+        device = parser_args.device
+
+        if not parser_args.verbose:
+            self.disable_task_loggers()
+
+        task_source = self.get_input_source(parser_args.input)
+        task_results = OrderedDict()
+        has_exceptions = False
+
+        for id_, input_ in task_source.items():
+            try:
+                res = self(
+                    audio_file=input_,
+                    model=model,
+                    lang=lang,
+                    task=task,
+                    size=size,
+                    language=language,
+                    sample_rate=sample_rate,
+                    config=config,
+                    ckpt_path=ckpt_path,
+                    decode_method=decode_method,
+                    force_yes=force_yes,
+                    rtf=rtf,
+                    device=device)
+                task_results[id_] = res
+            except Exception as e:
+                has_exceptions = True
+                task_results[id_] = f'{e.__class__.__name__}: {e}'
+
+        if rtf:
+            self.show_rtf(CLI_TIMER[self.__class__.__name__])
+
+        self.process_task_results(parser_args.input, task_results,
+                                  parser_args.job_dump_result)
+
+        if has_exceptions:
+            return False
+        else:
+            return True
+
+    @stats_wrapper
+    def __call__(self,
+                 audio_file: os.PathLike,
+                 model: str='whisper',
+                 lang: str='',
+                 task: str='transcribe',
+                 size: str='large',
+                 language: str='None',
+                 sample_rate: int=16000,
+                 config: os.PathLike=None,
+                 ckpt_path: os.PathLike=None,
+                 decode_method: str='attention_rescoring',
+                 num_decoding_left_chunks: int=-1,
+                 force_yes: bool=False,
+                 rtf: bool=False,
+                 device=paddle.get_device()):
+        """
+        Python API to call an executor.
+        """
+        audio_file = os.path.abspath(audio_file)
+        paddle.set_device(device)
+        self._init_from_path(model, lang, task, size, language, sample_rate,
+                             config, decode_method, num_decoding_left_chunks,
+                             ckpt_path)
+        if not self._check(audio_file, sample_rate, force_yes):
+            sys.exit(-1)
+        if rtf:
+            k = self.__class__.__name__
+            CLI_TIMER[k]['start'].append(time.time())
+
+        self.preprocess(model, audio_file)
+        self.infer(model)
+        res = self.postprocess()  # Retrieve result of asr.
+
+        if rtf:
+            CLI_TIMER[k]['end'].append(time.time())
+            audio, audio_sample_rate = soundfile.read(
+                audio_file, dtype="int16", always_2d=True)
+            CLI_TIMER[k]['extra'].append(audio.shape[0] / audio_sample_rate)
+
+        return res
--- a/paddlespeech/resource/model_alias.py
+++ b/paddlespeech/resource/model_alias.py
@ -18,6 +18,12 @@ __all__ = [

 # Records of model name to import class
 model_alias = {
+    # ---------------------------------
+    # -------------- SSL --------------
+    # ---------------------------------
+    "wav2vec2ASR": ["paddlespeech.s2t.models.wav2vec2:Wav2vec2ASR"],
+    "wav2vec2": ["paddlespeech.s2t.models.wav2vec2:Wav2vec2Base"],
+
    # ---------------------------------
    # -------------- ASR --------------
    # ---------------------------------
@ -29,6 +35,11 @@ model_alias = {
    "transformer": ["paddlespeech.s2t.models.u2:U2Model"],
    "wenetspeech": ["paddlespeech.s2t.models.u2:U2Model"],

+    # ---------------------------------
+    # ------------ Whisper ------------
+    # ---------------------------------
+    "whisper": ["paddlespeech.s2t.models.whisper:Whisper"],
+
    # ---------------------------------
    # -------------- CLS --------------
    # ---------------------------------
--- a/paddlespeech/resource/pretrained_models.py
+++ b/paddlespeech/resource/pretrained_models.py
@ -25,6 +25,8 @@ __all__ = [
    'tts_static_pretrained_models',
    'tts_onnx_pretrained_models',
    'vector_dynamic_pretrained_models',
+    'ssl_dynamic_pretrained_models',
+    'whisper_dynamic_pretrained_models',
 ]

 # The tags for pretrained_models should be "{model_name}[_{dataset}][-{lang}][-...]".
@ -32,6 +34,44 @@ __all__ = [
 # Command line and python api use "{model_name}[_{dataset}]" as --model, usage:
 # "paddlespeech asr --model conformer_wenetspeech --lang zh --sr 16000 --input ./input.wav"

+# ---------------------------------
+# -------------- SSL --------------
+# ---------------------------------
+ssl_dynamic_pretrained_models = {
+    "wav2vec2-en-16k": {
+        '1.3': {
+            'url':
+            'https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr3/wav2vec2-large-960h-lv60-self_ckpt_1.3.0.model.tar.gz',
+            'md5':
+            'acc46900680e341e500437aa59193518',
+            'cfg_path':
+            'model.yaml',
+            'ckpt_path':
+            'wav2vec2-large-960h-lv60-self',
+            'model':
+            'wav2vec2-large-960h-lv60-self.pdparams',
+            'params':
+            'wav2vec2-large-960h-lv60-self.pdparams',
+        },
+    },
+    "wav2vec2ASR_librispeech-en-16k": {
+        '1.3': {
+            'url':
+            'https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr3/wav2vec2ASR-large-960h-librispeech_ckpt_1.3.1.model.tar.gz',
+            'md5':
+            'cbe28d6c78f3dd2e189968402381f454',
+            'cfg_path':
+            'model.yaml',
+            'ckpt_path':
+            'exp/wav2vec2ASR/checkpoints/avg_1',
+            'model':
+            'exp/wav2vec2ASR/checkpoints/avg_1.pdparams',
+            'params':
+            'exp/wav2vec2ASR/checkpoints/avg_1.pdparams',
+        },
+    },
+}
+
 # ---------------------------------
 # -------------- ASR --------------
 # ---------------------------------
@ -424,6 +464,189 @@ asr_onnx_pretrained_models = {
    },
 }

+whisper_dynamic_pretrained_models = {
+    "whisper-large-16k": {
+        '1.3': {
+            'url':
+            'https://paddlespeech.bj.bcebos.com/whisper/whisper_model_20221122/whisper-large-model.tar.gz',
+            'md5':
+            'cf1557af9d8ffa493fefad9cb08ae189',
+            'cfg_path':
+            'whisper.yaml',
+            'ckpt_path':
+            'whisper-large-model',
+            'model':
+            'whisper-large-model.pdparams',
+            'params':
+            'whisper-large-model.pdparams',
+            'resource_data':
+            'https://paddlespeech.bj.bcebos.com/whisper/whisper_model_20221108/assets.tar',
+            'resource_data_md5':
+            '37a0a8abdb3641a51194f79567a93b61',
+        },
+    },
+    "whisper-base-en-16k": {
+        '1.3': {
+            'url':
+            'https://paddlespeech.bj.bcebos.com/whisper/whisper_model_20221122/whisper-base-en-model.tar.gz',
+            'md5':
+            'b156529aefde6beb7726d2ea98fd067a',
+            'cfg_path':
+            'whisper.yaml',
+            'ckpt_path':
+            'whisper-base-en-model',
+            'model':
+            'whisper-base-en-model.pdparams',
+            'params':
+            'whisper-base-en-model.pdparams',
+            'resource_data':
+            'https://paddlespeech.bj.bcebos.com/whisper/whisper_model_20221108/assets.tar',
+            'resource_data_md5':
+            '37a0a8abdb3641a51194f79567a93b61',
+        },
+    },
+    "whisper-base-16k": {
+        '1.3': {
+            'url':
+            'https://paddlespeech.bj.bcebos.com/whisper/whisper_model_20221122/whisper-base-model.tar.gz',
+            'md5':
+            '6b012a5abd583db14398c3492e47120b',
+            'cfg_path':
+            'whisper.yaml',
+            'ckpt_path':
+            'whisper-base-model',
+            'model':
+            'whisper-base-model.pdparams',
+            'params':
+            'whisper-base-model.pdparams',
+            'resource_data':
+            'https://paddlespeech.bj.bcebos.com/whisper/whisper_model_20221108/assets.tar',
+            'resource_data_md5':
+            '37a0a8abdb3641a51194f79567a93b61',
+        },
+    },
+    "whisper-medium-en-16k": {
+        '1.3': {
+            'url':
+            'https://paddlespeech.bj.bcebos.com/whisper/whisper_model_20221122/whisper-medium-en-model.tar.gz',
+            'md5':
+            'c7f57d270bd20c7b170ba9dcf6c16f74',
+            'cfg_path':
+            'whisper.yaml',
+            'ckpt_path':
+            'whisper-medium-en-model',
+            'model':
+            'whisper-medium-en-model.pdparams',
+            'params':
+            'whisper-medium-en-model.pdparams',
+            'resource_data':
+            'https://paddlespeech.bj.bcebos.com/whisper/whisper_model_20221108/assets.tar',
+            'resource_data_md5':
+            '37a0a8abdb3641a51194f79567a93b61',
+        },
+    },
+    "whisper-medium-16k": {
+        '1.3': {
+            'url':
+            'https://paddlespeech.bj.bcebos.com/whisper/whisper_model_20221122/whisper-medium-model.tar.gz',
+            'md5':
+            '4c7dcd0df25f408199db4a4548336786',
+            'cfg_path':
+            'whisper.yaml',
+            'ckpt_path':
+            'whisper-medium-model',
+            'model':
+            'whisper-medium-model.pdparams',
+            'params':
+            'whisper-medium-model.pdparams',
+            'resource_data':
+            'https://paddlespeech.bj.bcebos.com/whisper/whisper_model_20221108/assets.tar',
+            'resource_data_md5':
+            '37a0a8abdb3641a51194f79567a93b61',
+        },
+    },
+    "whisper-small-en-16k": {
+        '1.3': {
+            'url':
+            'https://paddlespeech.bj.bcebos.com/whisper/whisper_model_20221122/whisper-small-en-model.tar.gz',
+            'md5':
+            '2b24efcb2e93f3275af7c0c7f598ff1c',
+            'cfg_path':
+            'whisper.yaml',
+            'ckpt_path':
+            'whisper-small-en-model',
+            'model':
+            'whisper-small-en-model.pdparams',
+            'params':
+            'whisper-small-en-model.pdparams',
+            'resource_data':
+            'https://paddlespeech.bj.bcebos.com/whisper/whisper_model_20221108/assets.tar',
+            'resource_data_md5':
+            '37a0a8abdb3641a51194f79567a93b61',
+        },
+    },
+    "whisper-small-16k": {
+        '1.3': {
+            'url':
+            'https://paddlespeech.bj.bcebos.com/whisper/whisper_model_20221122/whisper-small-model.tar.gz',
+            'md5':
+            '5a57911dd41651dd6ed78c5763912825',
+            'cfg_path':
+            'whisper.yaml',
+            'ckpt_path':
+            'whisper-small-model',
+            'model':
+            'whisper-small-model.pdparams',
+            'params':
+            'whisper-small-model.pdparams',
+            'resource_data':
+            'https://paddlespeech.bj.bcebos.com/whisper/whisper_model_20221108/assets.tar',
+            'resource_data_md5':
+            '37a0a8abdb3641a51194f79567a93b61',
+        },
+    },
+    "whisper-tiny-en-16k": {
+        '1.3': {
+            'url':
+            'https://paddlespeech.bj.bcebos.com/whisper/whisper_model_20221122/whisper-tiny-en-model.tar.gz',
+            'md5':
+            '14969164a3f713fd58e56978c34188f6',
+            'cfg_path':
+            'whisper.yaml',
+            'ckpt_path':
+            'whisper-tiny-en-model',
+            'model':
+            'whisper-tiny-en-model.pdparams',
+            'params':
+            'whisper-tiny-en-model.pdparams',
+            'resource_data':
+            'https://paddlespeech.bj.bcebos.com/whisper/whisper_model_20221108/assets.tar',
+            'resource_data_md5':
+            '37a0a8abdb3641a51194f79567a93b61',
+        },
+    },
+    "whisper-tiny-16k": {
+        '1.3': {
+            'url':
+            'https://paddlespeech.bj.bcebos.com/whisper/whisper_model_20221122/whisper-tiny-model.tar.gz',
+            'md5':
+            'a5b82a1f2067a2ca400f17fabd62b81b',
+            'cfg_path':
+            'whisper.yaml',
+            'ckpt_path':
+            'whisper-tiny-model',
+            'model':
+            'whisper-tiny-model.pdparams',
+            'params':
+            'whisper-tiny-model.pdparams',
+            'resource_data':
+            'https://paddlespeech.bj.bcebos.com/whisper/whisper_model_20221108/assets.tar',
+            'resource_data_md5':
+            '37a0a8abdb3641a51194f79567a93b61',
+        },
+    },
+}
+
 # ---------------------------------
 # -------------- CLS --------------
 # ---------------------------------
@ -723,6 +946,22 @@ tts_dynamic_pretrained_models = {
            'speaker_id_map.txt',
        },
    },
+    "fastspeech2_male-zh": {
+        '1.0': {
+            'url':
+            'https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_male_ckpt_1.3.0.zip',
+            'md5':
+            'a4b1a2f667b878ec8f67375357b04282',
+            'config':
+            'default.yaml',
+            'ckpt':
+            'snapshot_iter_76000.pdz',
+            'speech_stats':
+            'speech_stats.npy',
+            'phones_dict':
+            'phone_id_map.txt',
+        },
+    },
    # tacotron2
    "tacotron2_csmsc-zh": {
        '1.0': {
@ -813,6 +1052,20 @@ tts_dynamic_pretrained_models = {
            'feats_stats.npy',
        },
    },
+    "pwgan_male-zh": {
+        '1.0': {
+            'url':
+            'https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_male_ckpt_1.3.0.zip',
+            'md5':
+            'c98cdb889c809973f8cc764437311132',
+            'config':
+            'default.yaml',
+            'ckpt':
+            'snapshot_iter_200000.pdz',
+            'speech_stats':
+            'feats_stats.npy',
+        },
+    },
    # mb_melgan
    "mb_melgan_csmsc-zh": {
        '1.0': {
--- a/paddlespeech/resource/resource.py
+++ b/paddlespeech/resource/resource.py
@ -22,7 +22,9 @@ from ..utils.dynamic_import import dynamic_import
 from ..utils.env import MODEL_HOME
 from .model_alias import model_alias

-task_supported = ['asr', 'cls', 'st', 'text', 'tts', 'vector', 'kws']
+task_supported = [
+    'asr', 'cls', 'st', 'text', 'tts', 'vector', 'kws', 'ssl', 'whisper'
+]
 model_format_supported = ['dynamic', 'static', 'onnx']
 inference_mode_supported = ['online', 'offline']

@ -108,7 +110,6 @@ class CommonTaskResource:
        """
        assert model_name in model_alias, 'No model classes found for "{}"'.format(
            model_name)
-
        ret = []
        for import_path in model_alias[model_name]:
            ret.append(dynamic_import(import_path))
--- a/paddlespeech/s2t/exps/wav2vec2/init.py
+++ b/paddlespeech/s2t/exps/wav2vec2/init.py
@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
--- a/paddlespeech/s2t/exps/wav2vec2/bin/init.py
+++ b/paddlespeech/s2t/exps/wav2vec2/bin/init.py
@ -1,4 +1,4 @@
-# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
--- a/paddlespeech/s2t/exps/wav2vec2/bin/train.py
+++ b/paddlespeech/s2t/exps/wav2vec2/bin/train.py
@ -34,9 +34,10 @@ def main(config, args):

 if __name__ == "__main__":
    parser = default_argument_parser()
+    parser.add_argument(
+        '--resume', type=str, default="", nargs="?", help='resume ckpt path.')
    args = parser.parse_args()
    print_arguments(args, globals())
-
    # https://yaml.org/type/float.html
    config = CfgNode(new_allowed=True)
    if args.config:
--- a/paddlespeech/s2t/exps/wav2vec2/model.py
+++ b/paddlespeech/s2t/exps/wav2vec2/model.py
@ -15,6 +15,7 @@
 import json
 import math
 import os
+import re
 import time
 from collections import defaultdict
 from collections import OrderedDict
@ -62,6 +63,19 @@ class Wav2Vec2ASRTrainer(Trainer):
            self.avg_train_loss -= self.avg_train_loss / (batch_index + 1)
            self.avg_train_loss += loss / (batch_index + 1)

+    def before_train(self):
+        from_scratch = self.resume_or_scratch()
+        if from_scratch:
+            # scratch: save init model, i.e. 0 epoch
+            self.save(tag='init', infos=None)
+        else:
+            # resume: train next_epoch and next_iteration
+            self.epoch += 1
+            logger.info(
+                f"Resume train: epoch {self.epoch }, step {self.iteration}!")
+
+        self.maybe_batch_sampler_step()
+
    def train_batch(self, batch_index, batch, msg):
        train_conf = self.config
        start = time.time()
@ -69,13 +83,14 @@ class Wav2Vec2ASRTrainer(Trainer):
        # forward
        utt, wav, wavs_lens, target, target_lens = batch
        wavs_lens_rate = wavs_lens / wav.shape[1]
-        target_lens_rate = target_lens / target.shape[1]
+
        wav = wav[:, :, 0]
-        wav = self.speech_augmentation(wav, wavs_lens_rate)
-        loss = self.model(wav, wavs_lens_rate, target, target_lens_rate)
+        if hasattr(train_conf, 'audio_augment'):
+            wav = self.speech_augmentation(wav, wavs_lens_rate)
+
+        loss = self.model(wav, wavs_lens_rate, target, target_lens)
        # loss div by `batch_size * accum_grad`
        loss /= train_conf.accum_grad
-
        # update self.avg_train_loss
        self.update_average(batch_index, float(loss))

@ -97,11 +112,17 @@ class Wav2Vec2ASRTrainer(Trainer):

        # optimizer step old
        if (batch_index + 1) % train_conf.accum_grad == 0:
-            self.optimizer.step()
-            self.optimizer.clear_grad()
-            self.lr_scheduler.step()
+            self.model_optimizer.step()
+            self.model_optimizer.clear_grad()
+            if not train_conf.freeze_wav2vec2:
+                self.wav2vec2_optimizer.step()
+                self.wav2vec2_optimizer.clear_grad()
+            if self.config.model_scheduler != 'newbobscheduler':
+                self.model_lr_scheduler.step()
+            if self.config.wav2vec2_scheduler != 'newbobscheduler':
+                if not train_conf.freeze_wav2vec2:
+                    self.wav2vec2_lr_scheduler.step()
            self.iteration += 1
-
        losses_np = {'loss': self.avg_train_loss * train_conf.accum_grad}
        iteration_time = time.time() - start
        for k, v in losses_np.items():
@ -113,7 +134,10 @@ class Wav2Vec2ASRTrainer(Trainer):
        if (batch_index + 1) % train_conf.accum_grad == 0:
            if dist.get_rank() == 0 and self.visualizer:
                losses_np_v = losses_np.copy()
-                losses_np_v.update({"lr": self.lr_scheduler()})
+                losses_np_v.update({
+                    "model_lr": self.model_lr_scheduler(),
+                    "wav2vec2_lr": self.wav2vec2_lr_scheduler()
+                })
                for key, val in losses_np_v.items():
                    self.visualizer.add_scalar(
                        tag='train/' + key, value=val, step=self.iteration - 1)
@ -130,11 +154,10 @@ class Wav2Vec2ASRTrainer(Trainer):
        for i, batch in enumerate(self.valid_loader):
            utt, wav, wavs_lens, target, target_lens = batch
            wavs_lens_rate = wavs_lens / wav.shape[1]
-            target_lens_rate = target_lens / target.shape[1]
            wav = wav[:, :, 0]
-            loss = self.model(wav, wavs_lens_rate, target, target_lens_rate)
+            loss = self.model(wav, wavs_lens_rate, target, target_lens)

-            if paddle.isfinite(loss):
+            if math.isfinite(float(loss)):
                num_utts = batch[1].shape[0]
                num_seen_utts += num_utts
                total_loss += float(loss) * num_utts
@ -159,6 +182,106 @@ class Wav2Vec2ASRTrainer(Trainer):
            dist.get_rank(), total_loss / num_seen_utts))
        return total_loss, num_seen_utts

+    @mp_tools.rank_zero_only
+    def save(self, tag=None, infos: dict=None):
+        """Save checkpoint (model parameters and optimizer states).
+
+        Args:
+            tag (int or str, optional): None for step, else using tag, e.g epoch. Defaults to None.
+            infos (dict, optional): meta data to save. Defaults to None.
+        """
+
+        infos = infos if infos else dict()
+        infos.update({
+            "epoch": self.epoch,
+            "model_lr": self.model_optimizer.get_lr(),
+            "wav2vec2_lr": self.wav2vec2_optimizer.get_lr()
+        })
+
+        checkpoint_path = os.path.join(
+            self.checkpoint_dir,
+            "{}".format(self.iteration if tag is None else tag))
+
+        model_dict = self.model.state_dict()
+        params_path = checkpoint_path + ".pdparams"
+        paddle.save(model_dict, params_path)
+        logger.info("Saved model to {}".format(params_path))
+
+        model_opt_dict = self.model_optimizer.state_dict()
+        wav2vec2_opt_dict = self.wav2vec2_optimizer.state_dict()
+
+        opt_dict = {'model': model_opt_dict, 'wav2vec2': wav2vec2_opt_dict}
+
+        optimizer_path = checkpoint_path + ".pdopt"
+        paddle.save(opt_dict, optimizer_path)
+        logger.info("Saved optimzier state to {}".format(optimizer_path))
+
+        scheduler_dict = {}
+
+        if self.config.model_scheduler == 'newbobscheduler':
+            scheduler_dict['model'] = self.model_lr_scheduler.save()
+        if self.config.wav2vec2_scheduler == 'newbobscheduler':
+            scheduler_dict['wav2vec2'] = self.wav2vec2_lr_scheduler.save()
+        if scheduler_dict:
+            scheduler_path = checkpoint_path + ".pdlrs"
+            paddle.save(scheduler_dict, scheduler_path)
+            logger.info("Saved scheduler state to {}".format(scheduler_path))
+        info_path = re.sub('.pdparams$', '.json', params_path)
+        infos = {} if infos is None else infos
+        with open(info_path, 'w') as fout:
+            data = json.dumps(infos)
+            fout.write(data)
+
+    def resume_or_scratch(self):
+        """Resume from latest checkpoint at checkpoints in the output
+        directory or load a specified checkpoint.
+
+        If ``args.checkpoint_path`` is not None, load the checkpoint, else
+        resume training.
+        """
+        scratch = None
+        if self.args.resume:
+            # just restore ckpt
+            # lr will resotre from optimizer ckpt
+            resume_json_path = os.path.join(self.checkpoint_dir,
+                                            self.args.resume + '.json')
+            with open(resume_json_path, 'r') as f:
+                resume_json = json.load(f)
+            self.iteration = 0
+            self.epoch = resume_json["epoch"]
+
+            # resotre model from *.pdparams
+            params_path = os.path.join(self.checkpoint_dir,
+                                       "{}".format(self.epoch)) + '.pdparams'
+            model_dict = paddle.load(params_path)
+            self.model.set_state_dict(model_dict)
+
+            # resotre optimizer from *.pdopt
+            optimizer_path = os.path.join(self.checkpoint_dir,
+                                          "{}".format(self.epoch)) + '.pdopt'
+            optimizer_dict = paddle.load(optimizer_path)
+            self.model_optimizer.set_state_dict(optimizer_dict['model'])
+            self.wav2vec2_optimizer.set_state_dict(optimizer_dict['wav2vec2'])
+
+            # resotre lr_scheduler from *.pdlrs
+            scheduler_path = os.path.join(self.checkpoint_dir,
+                                          "{}".format(self.epoch)) + '.pdlrs'
+            if os.path.isfile(os.path.join(scheduler_path)):
+                scheduler_dict = paddle.load(scheduler_path)
+                if self.config.model_scheduler == 'newbobscheduler':
+                    self.model_lr_scheduler.load(scheduler_dict['model'])
+                if self.config.wav2vec2_scheduler == 'newbobscheduler':
+                    self.wav2vec2_lr_scheduler.load(scheduler_dict['wav2vec2'])
+            logger.info(
+                f"Restore ckpt: epoch {self.epoch }, step {self.iteration}!")
+            scratch = False
+        else:
+            self.iteration = 0
+            self.epoch = 0
+            scratch = True
+            logger.info("Init from scratch!")
+        return scratch
+
    def do_train(self):
        """The training process control by step."""
        # !!!IMPORTANT!!!
@ -169,7 +292,6 @@ class Wav2Vec2ASRTrainer(Trainer):
        # paddle.jit.save(script_model, script_model_path)

        self.before_train()
-
        if not self.use_streamdata:
            logger.info(
                f"Train Total Examples: {len(self.train_loader.dataset)}")
@ -186,7 +308,9 @@ class Wav2Vec2ASRTrainer(Trainer):
                            report("Rank", dist.get_rank())
                            report("epoch", self.epoch)
                            report('step', self.iteration)
-                            report("lr", self.lr_scheduler())
+                            report("model_lr", self.model_optimizer.get_lr())
+                            report("wav2vec2_lr",
+                                   self.wav2vec2_optimizer.get_lr())
                            self.train_batch(batch_index, batch, msg)
                            self.after_train_batch()
                            report('iter', batch_index + 1)
@ -224,15 +348,25 @@ class Wav2Vec2ASRTrainer(Trainer):
                    cv_loss = float(cv_loss)
                else:
                    cv_loss = total_loss / num_seen_utts
-
            logger.info(
                'Epoch {} Val info val_loss {}'.format(self.epoch, cv_loss))
            if self.visualizer:
                self.visualizer.add_scalar(
                    tag='eval/cv_loss', value=cv_loss, step=self.epoch)
                self.visualizer.add_scalar(
-                    tag='eval/lr', value=self.lr_scheduler(), step=self.epoch)
-
+                    tag='eval/model_lr',
+                    value=self.model_lr_scheduler(),
+                    step=self.epoch)
+                self.visualizer.add_scalar(
+                    tag='eval/wav2vec2_lr',
+                    value=self.wav2vec2_lr_scheduler(),
+                    step=self.epoch)
+
+            if self.config.model_scheduler == 'newbobscheduler':
+                self.model_lr_scheduler.step(cv_loss)
+            if self.config.wav2vec2_scheduler == 'newbobscheduler':
+                if not self.config.freeze_wav2vec2:
+                    self.wav2vec2_lr_scheduler.step(cv_loss)
            self.save(tag=self.epoch, infos={'val_loss': cv_loss})
            self.new_epoch()

@ -267,62 +401,93 @@ class Wav2Vec2ASRTrainer(Trainer):
                model_conf.output_dim = self.test_loader.vocab_size

        model = Wav2vec2ASR.from_config(model_conf)
+        model_dict = paddle.load(config.wav2vec2_params_path)
+        model.wav2vec2.set_state_dict(model_dict)

        if self.parallel:
            model = paddle.DataParallel(model, find_unused_parameters=True)
-
        logger.info(f"{model}")
        layer_tools.print_params(model, logger.info)
        self.model = model
        logger.info("Setup model!")

        # setup speech augmentation for wav2vec2
-        self.speech_augmentation = TimeDomainSpecAugment()
+        if hasattr(config, 'audio_augment') and self.train:
+            self.speech_augmentation = TimeDomainSpecAugment(
+                **config.audio_augment)

        if not self.train:
            return

        train_config = config
-        optim_type = train_config.model_optim
-        optim_conf = train_config.model_optim_conf
-        scheduler_type = train_config.scheduler
-        scheduler_conf = train_config.scheduler_conf
-
-        scheduler_args = {
-            "learning_rate": optim_conf.lr,
-            "verbose": False,
-            "warmup_steps": scheduler_conf.warmup_steps,
-            "gamma": scheduler_conf.lr_decay,
-            "d_model": model_conf.dnn_neurons,
-        }
-        lr_scheduler = LRSchedulerFactory.from_args(scheduler_type,
-                                                    scheduler_args)
+        model_optim_type = train_config.model_optim
+        model_optim_conf = train_config.model_optim_conf
+        wav2vec2_optim_type = train_config.model_optim
+        wav2vec2_optim_conf = train_config.wav2vec2_optim_conf
+
+        model_scheduler_type = train_config.model_scheduler
+        model_scheduler_conf = train_config.model_scheduler_conf
+        wav2vec2_scheduler_type = train_config.wav2vec2_scheduler
+        wav2vec2_scheduler_conf = train_config.wav2vec2_scheduler_conf
+
+        model_scheduler_args = dict(
+            **{"learning_rate": model_optim_conf.lr,
+               "verbose": False}, **(dict(model_scheduler_conf)))
+
+        wav2vec2_scheduler_args = dict(
+            **{"learning_rate": wav2vec2_optim_conf.lr,
+               "verbose": False}, **(dict(wav2vec2_scheduler_conf)))
+
+        model_lr_scheduler = LRSchedulerFactory.from_args(model_scheduler_type,
+                                                          model_scheduler_args)
+        wav2vec2_lr_scheduler = LRSchedulerFactory.from_args(
+            wav2vec2_scheduler_type, wav2vec2_scheduler_args)

        def optimizer_args(
                config,
+                optim_type,
+                optim_conf,
                parameters,
                lr_scheduler=None, ):
            train_config = config
-            optim_type = train_config.model_optim
-            optim_conf = train_config.model_optim_conf
-            scheduler_type = train_config.scheduler
-            scheduler_conf = train_config.scheduler_conf
-            return {
-                "grad_clip": train_config.global_grad_clip,
-                "learning_rate": lr_scheduler
-                if lr_scheduler else optim_conf.lr,
-                "epsilon": optim_conf.epsilon,
-                "rho": optim_conf.rho,
-                "parameters": parameters,
-                "beta1": 0.9 if optim_type == 'noam' else None,
-                "beat2": 0.98 if optim_type == 'noam' else None,
-            }
-
-        optimzer_args = optimizer_args(config, model.parameters(), lr_scheduler)
-        optimizer = OptimizerFactory.from_args(optim_type, optimzer_args)
-
-        self.optimizer = optimizer
-        self.lr_scheduler = lr_scheduler
+            optim_arg = dict(optim_conf)
+            optim_arg.update({
+                "grad_clip":
+                train_config.global_grad_clip,
+                "learning_rate":
+                lr_scheduler if lr_scheduler else optim_conf.lr,
+                "parameters":
+                parameters
+            })
+            return optim_arg
+
+        model_optimizer_args = optimizer_args(config, model_optim_type,
+                                              model_optim_conf, [{
+                                                  'params':
+                                                  model._layers.enc.parameters()
+                                              }, {
+                                                  'params':
+                                                  model._layers.ctc.parameters()
+                                              }] if self.parallel else [{
+                                                  'params':
+                                                  model.enc.parameters()
+                                              }, {
+                                                  'params':
+                                                  model.ctc.parameters()
+                                              }], model_lr_scheduler)
+        wav2vec2_optimizer_args = optimizer_args(
+            config, wav2vec2_optim_type, wav2vec2_optim_conf,
+            model._layers.wav2vec2.parameters() if self.parallel else
+            model.wav2vec2.parameters(), wav2vec2_lr_scheduler)
+        model_optimizer = OptimizerFactory.from_args(model_optim_type,
+                                                     model_optimizer_args)
+        wav2vec2_optimizer = OptimizerFactory.from_args(wav2vec2_optim_type,
+                                                        wav2vec2_optimizer_args)
+
+        self.model_optimizer = model_optimizer
+        self.wav2vec2_optimizer = wav2vec2_optimizer
+        self.model_lr_scheduler = model_lr_scheduler
+        self.wav2vec2_lr_scheduler = wav2vec2_lr_scheduler
        logger.info("Setup optimizer/lr_scheduler!")


--- a/paddlespeech/s2t/exps/whisper/test_wav.py
+++ b/paddlespeech/s2t/exps/whisper/test_wav.py
@ -0,0 +1,123 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.∏
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# Modified from Whisper (https://github.com/openai/whisper/whisper/)
+import os.path
+import sys
+
+import distutils
+import numpy as np
+import paddle
+import soundfile
+from yacs.config import CfgNode
+
+from paddlespeech.s2t.models.whisper import log_mel_spectrogram
+from paddlespeech.s2t.models.whisper import ModelDimensions
+from paddlespeech.s2t.models.whisper import transcribe
+from paddlespeech.s2t.models.whisper import Whisper
+from paddlespeech.s2t.training.cli import default_argument_parser
+from paddlespeech.s2t.utils.log import Log
+
+logger = Log(__name__).getlog()
+
+
+class WhisperInfer():
+    def __init__(self, config, args):
+        self.args = args
+        self.config = config
+        self.audio_file = args.audio_file
+
+        paddle.set_device('gpu' if self.args.ngpu > 0 else 'cpu')
+        config.pop("ngpu")
+
+        #load_model
+        model_dict = paddle.load(self.config.model_file)
+        config.pop("model_file")
+        dims = ModelDimensions(**model_dict["dims"])
+        self.model = Whisper(dims)
+        self.model.load_dict(model_dict)
+
+    def run(self):
+        check(args.audio_file)
+
+        with paddle.no_grad():
+            temperature = config.pop("temperature")
+            temperature_increment_on_fallback = config.pop(
+                "temperature_increment_on_fallback")
+            if temperature_increment_on_fallback is not None:
+                temperature = tuple(
+                    np.arange(temperature, 1.0 + 1e-6,
+                              temperature_increment_on_fallback))
+            else:
+                temperature = [temperature]
+
+            #load audio
+            mel = log_mel_spectrogram(
+                args.audio_file, resource_path=config.resource_path)
+
+            result = transcribe(
+                self.model, mel, temperature=temperature, **config)
+            if args.result_file is not None:
+                with open(args.result_file, 'w') as f:
+                    f.write(str(result))
+            return result
+
+
+def check(audio_file: str):
+    if not os.path.isfile(audio_file):
+        print("Please input the right audio file path")
+        sys.exit(-1)
+
+    logger.info("checking the audio file format......")
+    try:
+        _, sample_rate = soundfile.read(audio_file)
+    except Exception as e:
+        logger.error(str(e))
+        logger.error(
+            "can not open the wav file, please check the audio file format")
+        sys.exit(-1)
+    logger.info("The sample rate is %d" % sample_rate)
+    assert (sample_rate == 16000)
+    logger.info("The audio file format is right")
+
+
+def main(config, args):
+    WhisperInfer(config, args).run()
+
+
+if __name__ == "__main__":
+    parser = default_argument_parser()
+    # save asr result to
+    parser.add_argument(
+        "--result_file", type=str, help="path of save the asr result")
+    parser.add_argument(
+        "--audio_file", type=str, help="path of the input audio file")
+    parser.add_argument(
+        "--debug",
+        type=distutils.util.strtobool,
+        default=False,
+        help="for debug.")
+    args = parser.parse_args()
+
+    config = CfgNode(new_allowed=True)
+
+    if args.config:
+        config.merge_from_file(args.config)
+    if args.decode_cfg:
+        decode_confs = CfgNode(new_allowed=True)
+        decode_confs.merge_from_file(args.decode_cfg)
+        config.decode = decode_confs
+    if args.opts:
+        config.merge_from_list(args.opts)
+    config.freeze()
+    main(config, args)
--- a/paddlespeech/s2t/models/wav2vec2/init.py
+++ b/paddlespeech/s2t/models/wav2vec2/init.py
@ -0,0 +1,17 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from .wav2vec2_ASR import Wav2vec2ASR
+from .wav2vec2_ASR import Wav2vec2Base
+
+__all__ = ["Wav2vec2ASR", "Wav2vec2Base"]
--- a/paddlespeech/s2t/models/wav2vec2/modules/VanillaNN.py
+++ b/paddlespeech/s2t/models/wav2vec2/modules/VanillaNN.py
@ -18,6 +18,7 @@ import paddle

 from paddlespeech.s2t.models.wav2vec2.modules import containers
 from paddlespeech.s2t.models.wav2vec2.modules import linear
+from paddlespeech.s2t.models.wav2vec2.modules.normalization import BatchNorm1d


 class VanillaNN(containers.Sequential):
@ -39,18 +40,33 @@ class VanillaNN(containers.Sequential):
    paddle.shape([10, 120, 512])
    """

-    def __init__(
-            self,
-            input_shape,
-            activation=paddle.nn.LeakyReLU,
-            dnn_blocks=2,
-            dnn_neurons=512, ):
-        super().__init__(input_shape=input_shape)
+    def __init__(self,
+                 input_shape,
+                 dnn_blocks=2,
+                 dnn_neurons=512,
+                 activation=True,
+                 normalization=False,
+                 dropout_rate=0.5):
+        super().__init__(input_shape=[None, None, input_shape])
+
+        if not isinstance(dropout_rate, list):
+            dropout_rate = [dropout_rate] * dnn_blocks
+        else:
+            assert len(
+                dropout_rate
+            ) == dnn_blocks, "len(dropout_rate) must equal to dnn_blocks"

        for block_index in range(dnn_blocks):
            self.append(
                linear.Linear,
                n_neurons=dnn_neurons,
-                bias=True,
+                bias_attr=None,
                layer_name="linear", )
-            self.append(activation(), layer_name="act")
+            if normalization:
+                self.append(
+                    BatchNorm1d, input_size=dnn_neurons, layer_name='bn')
+            if activation:
+                self.append(paddle.nn.LeakyReLU(), layer_name="act")
+            self.append(
+                paddle.nn.Dropout(p=dropout_rate[block_index]),
+                layer_name='dropout')
--- a/paddlespeech/s2t/models/wav2vec2/modules/init.py
+++ b/paddlespeech/s2t/models/wav2vec2/modules/init.py
@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
--- a/paddlespeech/s2t/models/wav2vec2/modules/containers.py
+++ b/paddlespeech/s2t/models/wav2vec2/modules/containers.py
@ -141,5 +141,4 @@ class Sequential(paddle.nn.LayerDict):
            x = layer(x)
            if isinstance(x, tuple):
                x = x[0]
-
        return x
--- a/paddlespeech/s2t/models/wav2vec2/modules/linear.py
+++ b/paddlespeech/s2t/models/wav2vec2/modules/linear.py
@ -53,7 +53,7 @@ class Linear(paddle.nn.Layer):
            n_neurons,
            input_shape=None,
            input_size=None,
-            bias=True,
+            bias_attr=None,
            combine_dims=False, ):
        super().__init__()
        self.combine_dims = combine_dims
@ -67,7 +67,7 @@ class Linear(paddle.nn.Layer):
                input_size = input_shape[2] * input_shape[3]

        # Weights are initialized following paddle approach
-        self.w = align.Linear(input_size, n_neurons, bias_attr=bias)
+        self.w = align.Linear(input_size, n_neurons, bias_attr=bias_attr)

    def forward(self, x):
        """Returns the linear transformation of input tensor.
--- a/paddlespeech/s2t/models/wav2vec2/modules/modeling_wav2vec2.py
+++ b/paddlespeech/s2t/models/wav2vec2/modules/modeling_wav2vec2.py
@ -1120,9 +1120,6 @@ class Wav2Vec2ConfigPure():
        self.output_hidden_states = False
        self.use_return_dict = True

-        self.pad_token_id = config.pad_token_id
-        self.bos_token_id = config.bos_token_id
-        self.eos_token_id = config.eos_token_id
        self.hidden_size = config.hidden_size
        self.feat_extract_norm = config.feat_extract_norm
        self.feat_extract_activation = config.feat_extract_activation
@ -1145,7 +1142,6 @@ class Wav2Vec2ConfigPure():
        self.layerdrop = config.layerdrop
        self.layer_norm_eps = config.layer_norm_eps
        self.initializer_range = config.initializer_range
-        self.vocab_size = config.vocab_size
        self.do_stable_layer_norm = config.do_stable_layer_norm
        self.use_weighted_layer_sum = config.use_weighted_layer_sum

--- a/paddlespeech/s2t/models/wav2vec2/modules/normalization.py
+++ b/paddlespeech/s2t/models/wav2vec2/modules/normalization.py
@ -0,0 +1,97 @@
+# Authors
+#  * Mirco Ravanelli 2020
+#  * Guillermo Cámbara 2021
+#  * Sarthak Yadav 2022
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# Modified from speechbrain(https://github.com/speechbrain/speechbrain/blob/develop/speechbrain/nnet/normalization.py)
+import paddle.nn as nn
+
+from paddlespeech.s2t.modules.align import BatchNorm1D
+
+
+class BatchNorm1d(nn.Layer):
+    """Applies 1d batch normalization to the input tensor.
+    Arguments
+    ---------
+    input_shape : tuple
+        The expected shape of the input. Alternatively, use ``input_size``.
+    input_size : int
+        The expected size of the input. Alternatively, use ``input_shape``.
+    eps : float
+        This value is added to std deviation estimation to improve the numerical
+        stability.
+    momentum : float
+        It is a value used for the running_mean and running_var computation.
+    affine : bool
+        When set to True, the affine parameters are learned.
+    track_running_stats : bool
+        When set to True, this module tracks the running mean and variance,
+        and when set to False, this module does not track such statistics.
+    combine_batch_time : bool
+        When true, it combines batch an time axis.
+    Example
+    -------
+    >>> input = paddle.randn([100, 10])
+    >>> norm = BatchNorm1d(input_shape=input.shape)
+    >>> output = norm(input)
+    >>> output.shape
+    Paddle.Shape([100, 10])
+    """
+
+    def __init__(
+            self,
+            input_shape=None,
+            input_size=None,
+            eps=1e-05,
+            momentum=0.9,
+            combine_batch_time=False,
+            skip_transpose=False, ):
+        super().__init__()
+        self.combine_batch_time = combine_batch_time
+        self.skip_transpose = skip_transpose
+
+        if input_size is None and skip_transpose:
+            input_size = input_shape[1]
+        elif input_size is None:
+            input_size = input_shape[-1]
+
+        self.norm = BatchNorm1D(input_size, momentum=momentum, epsilon=eps)
+
+    def forward(self, x):
+        """Returns the normalized input tensor.
+        Arguments
+        ---------
+        x : paddle.Tensor (batch, time, [channels])
+            input to normalize. 2d or 3d tensors are expected in input
+            4d tensors can be used when combine_dims=True.
+        """
+        shape_or = x.shape
+        if self.combine_batch_time:
+            if x.ndim == 3:
+                x = x.reshape(shape_or[0] * shape_or[1], shape_or[2])
+            else:
+                x = x.reshape(shape_or[0] * shape_or[1], shape_or[3],
+                              shape_or[2])
+
+        elif not self.skip_transpose:
+            x = x.transpose([0, 2, 1])
+
+        x_n = self.norm(x)
+        if self.combine_batch_time:
+            x_n = x_n.reshape(shape_or)
+        elif not self.skip_transpose:
+            x_n = x_n.transpose([0, 2, 1])
+
+        return x_n
--- a/paddlespeech/s2t/models/wav2vec2/processing/init.py
+++ b/paddlespeech/s2t/models/wav2vec2/processing/init.py
@ -0,0 +1,13 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
--- a/paddlespeech/s2t/models/wav2vec2/processing/speech_augmentation.py
+++ b/paddlespeech/s2t/models/wav2vec2/processing/speech_augmentation.py
@ -639,16 +639,177 @@ class DropChunk(nn.Layer):
        return dropped_waveform


+class SpecAugment(paddle.nn.Layer):
+    """An implementation of the SpecAugment algorithm.
+    Reference:
+        https://arxiv.org/abs/1904.08779
+    Arguments
+    ---------
+    time_warp : bool
+        Whether applying time warping.
+    time_warp_window : int
+        Time warp window.
+    time_warp_mode : str
+        Interpolation mode for time warping (default "bicubic").
+    freq_mask : bool
+        Whether applying freq mask.
+    freq_mask_width : int or tuple
+        Freq mask width range.
+    n_freq_mask : int
+        Number of freq mask.
+    time_mask : bool
+        Whether applying time mask.
+    time_mask_width : int or tuple
+        Time mask width range.
+    n_time_mask : int
+        Number of time mask.
+    replace_with_zero : bool
+        If True, replace masked value with 0, else replace masked value with mean of the input tensor.
+    Example
+    -------
+    >>> aug = SpecAugment()
+    >>> a = paddle.rand([8, 120, 80])
+    >>> a = aug(a)
+    >>> print(a.shape)
+    paddle.Size([8, 120, 80])
+    """
+
+    def __init__(
+            self,
+            time_warp=True,
+            time_warp_window=5,
+            time_warp_mode="bicubic",
+            freq_mask=True,
+            freq_mask_width=(0, 20),
+            n_freq_mask=2,
+            time_mask=True,
+            time_mask_width=(0, 100),
+            n_time_mask=2,
+            replace_with_zero=True, ):
+        super().__init__()
+        assert (
+            time_warp or freq_mask or time_mask
+        ), "at least one of time_warp, time_mask, or freq_mask should be applied"
+
+        self.apply_time_warp = time_warp
+        self.time_warp_window = time_warp_window
+        self.time_warp_mode = time_warp_mode
+
+        self.freq_mask = freq_mask
+        if isinstance(freq_mask_width, int):
+            freq_mask_width = (0, freq_mask_width)
+        self.freq_mask_width = freq_mask_width
+        self.n_freq_mask = n_freq_mask
+
+        self.time_mask = time_mask
+        if isinstance(time_mask_width, int):
+            time_mask_width = (0, time_mask_width)
+        self.time_mask_width = time_mask_width
+        self.n_time_mask = n_time_mask
+
+        self.replace_with_zero = replace_with_zero
+
+    def forward(self, x):
+        """Takes in input a tensors and returns an augmented one."""
+        if self.apply_time_warp:
+            x = self.time_warp(x)
+        if self.freq_mask:
+            x = self.mask_along_axis(x, dim=2)
+        if self.time_mask:
+            x = self.mask_along_axis(x, dim=1)
+        return x
+
+    def time_warp(self, x):
+        """Time warping with paddle.nn.functional.interpolate"""
+        original_size = x.shape
+        window = self.time_warp_window
+
+        # 2d interpolation requires 4D or higher dimension tensors
+        # x: (Batch, Time, Freq) -> (Batch, 1, Time, Freq)
+        if x.dim() == 3:
+            x = x.unsqueeze(1)
+
+        time = x.shape[2]
+        if time - window <= window:
+            return x.view(*original_size)
+
+        # compute center and corresponding window
+        c = paddle.randint(window, time - window, (1, ))[0]
+        w = paddle.randint(c - window, c + window, (1, ))[0] + 1
+        # c = 5
+        # w = 10
+        left = paddle.nn.functional.interpolate(
+            x[:, :, :c],
+            (w, x.shape[3]),
+            mode=self.time_warp_mode,
+            align_corners=True, )
+        right = paddle.nn.functional.interpolate(
+            x[:, :, c:],
+            (time - w, x.shape[3]),
+            mode=self.time_warp_mode,
+            align_corners=True, )
+
+        x[:, :, :w] = left
+        x[:, :, w:] = right
+        return x.view(*original_size)
+
+    def mask_along_axis(self, x, dim):
+        """Mask along time or frequency axis.
+        Arguments
+        ---------
+        x : tensor
+            Input tensor.
+        dim : int
+            Corresponding dimension to mask.
+        """
+        original_size = x.shape
+        if x.dim() == 4:
+            x = x.view(-1, x.shape[2], x.shape[3])
+
+        batch, time, fea = x.shape
+
+        if dim == 1:
+            D = time
+            n_mask = self.n_time_mask
+            width_range = self.time_mask_width
+        else:
+            D = fea
+            n_mask = self.n_freq_mask
+            width_range = self.freq_mask_width
+
+        mask_len = paddle.randint(width_range[0], width_range[1],
+                                  (batch, n_mask)).unsqueeze(2)
+
+        mask_pos = paddle.randint(0, max(1, D - mask_len.max()),
+                                  (batch, n_mask)).unsqueeze(2)
+
+        # compute masks
+        arange = paddle.arange(end=D).view(1, 1, -1)
+        mask = (mask_pos <= arange) * (arange < (mask_pos + mask_len))
+        mask = mask.any(axis=1)
+
+        if dim == 1:
+            mask = mask.unsqueeze(2)
+        else:
+            mask = mask.unsqueeze(1)
+
+        if self.replace_with_zero:
+            val = 0.0
+        else:
+            val = x.mean()
+        # same to x.masked_fill_(mask, val)
+        y = paddle.full(x.shape, val, x.dtype)
+        x = paddle.where(mask, y, x)
+        return x.view(*original_size)
+
+
 class TimeDomainSpecAugment(nn.Layer):
    """A time-domain approximation of the SpecAugment algorithm.
-
    This augmentation module implements three augmentations in
    the time-domain.
-
     1. Drop chunks of the audio (zero amplitude or white noise)
     2. Drop frequency bands (with band-drop filters)
     3. Speed peturbation (via resampling to slightly different rate)
-
    Arguments
    ---------
    perturb_prob : float from 0 to 1
@ -677,7 +838,6 @@ class TimeDomainSpecAugment(nn.Layer):
    drop_chunk_noise_factor : float
        The noise factor used to scale the white noise inserted, relative to
        the average amplitude of the utterance. Default 0 (no noise inserted).
-
    Example
    -------
    >>> inputs = paddle.randn([10, 16000])
@ -718,7 +878,6 @@ class TimeDomainSpecAugment(nn.Layer):

    def forward(self, waveforms, lengths):
        """Returns the distorted waveforms.
-
        Arguments
        ---------
        waveforms : tensor
--- a/paddlespeech/s2t/models/wav2vec2/wav2vec2_ASR.py
+++ b/paddlespeech/s2t/models/wav2vec2/wav2vec2_ASR.py
@ -23,7 +23,9 @@ import paddle.nn.functional as F
 from paddlespeech.s2t.models.wav2vec2.modules.modeling_wav2vec2 import Wav2Vec2ConfigPure
 from paddlespeech.s2t.models.wav2vec2.modules.modeling_wav2vec2 import Wav2Vec2Model
 from paddlespeech.s2t.models.wav2vec2.modules.VanillaNN import VanillaNN
+from paddlespeech.s2t.models.wav2vec2.processing.speech_augmentation import SpecAugment
 from paddlespeech.s2t.modules.ctc import CTCDecoderBase as CTC
+from paddlespeech.s2t.modules.initializer import DefaultInitializerContext
 from paddlespeech.s2t.utils.ctc_utils import remove_duplicates_and_blank
 from paddlespeech.s2t.utils.utility import log_add

@ -31,44 +33,41 @@ from paddlespeech.s2t.utils.utility import log_add
 class Wav2vec2ASR(nn.Layer):
    def __init__(self, config: dict):
        super().__init__()
+        init_type = config.get("init_type", None)
+        with DefaultInitializerContext(init_type):
+            self.config = config
+            wav2vec2_config = Wav2Vec2ConfigPure(config)
+            wav2vec2 = Wav2Vec2Model(wav2vec2_config)
+            self.normalize_wav = config.normalize_wav
+            self.output_norm = config.output_norm
+            if hasattr(config, 'spec_augment'):
+                self.spec_augment = SpecAugment(**config.spec_augment)

-        wav2vec2_config = Wav2Vec2ConfigPure(config)
-        wav2vec2 = Wav2Vec2Model(wav2vec2_config)
-        model_dict = paddle.load(config.wav2vec2_params_path)
-        wav2vec2.set_state_dict(model_dict)
-        self.normalize_wav = config.normalize_wav
-        self.output_norm = config.output_norm
-        if config.freeze_wav2vec2:
-            wav2vec2.eval()
-            for parm in wav2vec2.parameters():
-                parm.trainable = False
-        self.wav2vec2 = wav2vec2
-        self.enc = VanillaNN(
-            input_shape=[None, None, wav2vec2_config.hidden_size],
-            activation=nn.LeakyReLU,
-            dnn_blocks=config.dnn_blocks,
-            dnn_neurons=config.dnn_neurons)
-        self.ctc = CTC(odim=config.output_dim,
-                       enc_n_units=config.dnn_neurons,
-                       blank_id=config.blank_id,
-                       dropout_rate=config.ctc_dropout_rate,
-                       reduction='mean')
-
-    def forward(self, wav, wavs_lens_rate, target, target_lens_rate):
+            if config.freeze_wav2vec2:
+                wav2vec2.eval()
+                for parm in wav2vec2.parameters():
+                    parm.trainable = False
+            self.wav2vec2 = wav2vec2
+            self.enc = VanillaNN(**config.enc)
+            self.ctc = CTC(**config.ctc,
+                           odim=config.output_dim,
+                           batch_average=False,
+                           reduction='mean')
+
+    def forward(self, wav, wavs_lens_rate, target, target_lens):
        if self.normalize_wav:
-            wav = F.layer_norm(wav, wav.shape[1:])
+            wav = F.layer_norm(wav, wav.shape)
        # Extract wav2vec output
        out = self.wav2vec2(wav)[0]
        # We normalize the output if required
        if self.output_norm:
-            out = F.layer_norm(out, out.shape[1:])
-        feats = out
-
+            out = F.layer_norm(out, out.shape)
+        if self.train and hasattr(self.config, 'spec_augment'):
+            feats = self.spec_augment(out)
+        else:
+            feats = out
        x = self.enc(feats)
        x_lens = (wavs_lens_rate * x.shape[1]).round().astype(paddle.int64)
-        target_lens = (target_lens_rate *
-                       target.shape[1]).round().astype(paddle.int64)
-
        ctc_loss = self.ctc(x, x_lens, target, target_lens)
        return ctc_loss

@ -239,3 +238,33 @@ class Wav2vec2ASR(nn.Layer):
        """
        hyps = self._ctc_prefix_beam_search(wav, beam_size)
        return hyps[0][0]
+
+
+class Wav2vec2Base(nn.Layer):
+    """Wav2vec2 model"""
+
+    def __init__(self, config: dict):
+        super().__init__()
+        wav2vec2_config = Wav2Vec2ConfigPure(config)
+        wav2vec2 = Wav2Vec2Model(wav2vec2_config)
+        self.wav2vec2 = wav2vec2
+
+    @classmethod
+    def from_config(cls, configs: dict):
+        """init model.
+
+        Args:
+            configs (dict): config dict.
+
+        Raises:
+            ValueError: raise when using not support encoder type.
+
+        Returns:
+            nn.Layer: Wav2Vec2Base
+        """
+        model = cls(configs)
+        return model
+
+    def forward(self, wav):
+        out = self.wav2vec2(wav)
+        return out
--- a/paddlespeech/s2t/models/whisper/init.py
+++ b/paddlespeech/s2t/models/whisper/init.py
@ -0,0 +1,12 @@
+# MIT License, Copyright (c) 2022 OpenAI.
+# Copyright (c) 2022 PaddlePaddle Authors and . All Rights Reserved.
+# 
+# Modified from OpenAI Whisper 2022 (https://github.com/openai/whisper/whisper/__init__.py)
+from paddlespeech.s2t.models.whisper.whipser import decode
+from paddlespeech.s2t.models.whisper.whipser import DecodingOptions
+from paddlespeech.s2t.models.whisper.whipser import DecodingResult
+from paddlespeech.s2t.models.whisper.whipser import detect_language
+from paddlespeech.s2t.models.whisper.whipser import log_mel_spectrogram
+from paddlespeech.s2t.models.whisper.whipser import ModelDimensions
+from paddlespeech.s2t.models.whisper.whipser import transcribe
+from paddlespeech.s2t.models.whisper.whipser import Whisper
--- a/paddlespeech/s2t/models/whisper/tokenizer.py
+++ b/paddlespeech/s2t/models/whisper/tokenizer.py
@ -0,0 +1,362 @@
+# MIT License, Copyright (c) 2022 OpenAI.
+# Copyright (c) 2022 PaddlePaddle Authors and . All Rights Reserved.
+# 
+# Modified from OpenAI Whisper 2022 (https://github.com/openai/whisper/whisper/tokenizer.py)
+import os
+from dataclasses import dataclass
+from functools import lru_cache
+from typing import List
+from typing import Optional
+from typing import Tuple
+from typing import Union
+
+import numpy as np
+import paddle
+from paddlenlp.transformers import GPTTokenizer
+
+LANGUAGES = {
+    "en": "english",
+    "zh": "chinese",
+    "de": "german",
+    "es": "spanish",
+    "ru": "russian",
+    "ko": "korean",
+    "fr": "french",
+    "ja": "japanese",
+    "pt": "portuguese",
+    "tr": "turkish",
+    "pl": "polish",
+    "ca": "catalan",
+    "nl": "dutch",
+    "ar": "arabic",
+    "sv": "swedish",
+    "it": "italian",
+    "id": "indonesian",
+    "hi": "hindi",
+    "fi": "finnish",
+    "vi": "vietnamese",
+    "iw": "hebrew",
+    "uk": "ukrainian",
+    "el": "greek",
+    "ms": "malay",
+    "cs": "czech",
+    "ro": "romanian",
+    "da": "danish",
+    "hu": "hungarian",
+    "ta": "tamil",
+    "no": "norwegian",
+    "th": "thai",
+    "ur": "urdu",
+    "hr": "croatian",
+    "bg": "bulgarian",
+    "lt": "lithuanian",
+    "la": "latin",
+    "mi": "maori",
+    "ml": "malayalam",
+    "cy": "welsh",
+    "sk": "slovak",
+    "te": "telugu",
+    "fa": "persian",
+    "lv": "latvian",
+    "bn": "bengali",
+    "sr": "serbian",
+    "az": "azerbaijani",
+    "sl": "slovenian",
+    "kn": "kannada",
+    "et": "estonian",
+    "mk": "macedonian",
+    "br": "breton",
+    "eu": "basque",
+    "is": "icelandic",
+    "hy": "armenian",
+    "ne": "nepali",
+    "mn": "mongolian",
+    "bs": "bosnian",
+    "kk": "kazakh",
+    "sq": "albanian",
+    "sw": "swahili",
+    "gl": "galician",
+    "mr": "marathi",
+    "pa": "punjabi",
+    "si": "sinhala",
+    "km": "khmer",
+    "sn": "shona",
+    "yo": "yoruba",
+    "so": "somali",
+    "af": "afrikaans",
+    "oc": "occitan",
+    "ka": "georgian",
+    "be": "belarusian",
+    "tg": "tajik",
+    "sd": "sindhi",
+    "gu": "gujarati",
+    "am": "amharic",
+    "yi": "yiddish",
+    "lo": "lao",
+    "uz": "uzbek",
+    "fo": "faroese",
+    "ht": "haitian creole",
+    "ps": "pashto",
+    "tk": "turkmen",
+    "nn": "nynorsk",
+    "mt": "maltese",
+    "sa": "sanskrit",
+    "lb": "luxembourgish",
+    "my": "myanmar",
+    "bo": "tibetan",
+    "tl": "tagalog",
+    "mg": "malagasy",
+    "as": "assamese",
+    "tt": "tatar",
+    "haw": "hawaiian",
+    "ln": "lingala",
+    "ha": "hausa",
+    "ba": "bashkir",
+    "jw": "javanese",
+    "su": "sundanese",
+}
+
+# language code lookup by name, with a few language aliases
+TO_LANGUAGE_CODE = {
+    **{language: code for code, language in LANGUAGES.items()},
+    "burmese": "my",
+    "valencian": "ca",
+    "flemish": "nl",
+    "haitian": "ht",
+    "letzeburgesch": "lb",
+    "pushto": "ps",
+    "panjabi": "pa",
+    "moldavian": "ro",
+    "moldovan": "ro",
+    "sinhalese": "si",
+    "castilian": "es",
+}
+
+
+@dataclass(frozen=True)
+class Tokenizer:
+    """A thin wrapper around `GPTTokenizer` providing quick access to special tokens"""
+
+    tokenizer: "GPTTokenizer"
+    language: Optional[str]
+    sot_sequence: Tuple[int]
+
+    def encode(self, text, **kwargs):
+        return self.tokenizer.encode(text, **kwargs)
+
+    def decode(self,
+               token_ids: Union[int, List[int], np.ndarray, paddle.Tensor],
+               **kwargs):
+        if len(token_ids) > 1:
+            ids_list = []
+            for ids in token_ids:
+                if paddle.is_tensor(ids):
+                    ids = ids.item()
+                if ids < len(self.tokenizer):
+                    ids_list.append(ids)
+            token_ids = ids_list
+
+        return self.tokenizer.decode(token_ids, **kwargs)
+
+    def decode_with_timestamps(self, tokens) -> str:
+        """
+        Timestamp tokens are above the special tokens' id range and are ignored by `decode()`.
+        This method decodes given tokens with timestamps tokens annotated, e.g. "<|1.08|>".
+        """
+        outputs = [[]]
+        for token in tokens:
+            if token >= self.timestamp_begin:
+                timestamp = f"<|{(token - self.timestamp_begin) * 0.02:.2f}|>"
+                outputs.append(timestamp)
+                outputs.append([])
+            else:
+                outputs[-1].append(token)
+        outputs = [
+            s if isinstance(s, str) else self.tokenizer.decode(s)
+            for s in outputs
+        ]
+        return "".join(outputs)
+
+    @property
+    @lru_cache()
+    def eot(self) -> int:
+        return self.tokenizer.eos_token_id
+
+    @property
+    @lru_cache()
+    def sot(self) -> int:
+        return self._get_single_token_id("<|startoftranscript|>")
+
+    @property
+    @lru_cache()
+    def sot_lm(self) -> int:
+        return self._get_single_token_id("<|startoflm|>")
+
+    @property
+    @lru_cache()
+    def sot_prev(self) -> int:
+        return self._get_single_token_id("<|startofprev|>")
+
+    @property
+    @lru_cache()
+    def no_speech(self) -> int:
+        return self._get_single_token_id("<|nospeech|>")
+
+    @property
+    @lru_cache()
+    def no_timestamps(self) -> int:
+        return self._get_single_token_id("<|notimestamps|>")
+
+    @property
+    @lru_cache()
+    def timestamp_begin(self) -> int:
+        return self.tokenizer.all_special_ids[-1] + 1
+
+    @property
+    @lru_cache()
+    def language_token(self) -> int:
+        """Returns the token id corresponding to the value of the `language` field"""
+        if self.language is None:
+            raise ValueError(
+                "This tokenizer does not have language token configured")
+
+        additional_tokens = dict(
+            zip(
+                self.tokenizer.additional_special_tokens,
+                self.tokenizer.additional_special_tokens_ids, ))
+        candidate = f"<|{self.language}|>"
+        if candidate in additional_tokens:
+            return additional_tokens[candidate]
+
+        raise KeyError(f"Language {self.language} not found in tokenizer.")
+
+    @property
+    @lru_cache()
+    def all_language_tokens(self) -> Tuple[int]:
+        result = []
+        for token, token_id in zip(
+                self.tokenizer.additional_special_tokens,
+                self.tokenizer.additional_special_tokens_ids, ):
+            if token.strip("<|>") in LANGUAGES:
+                result.append(token_id)
+        return tuple(result)
+
+    @property
+    @lru_cache()
+    def all_language_codes(self) -> Tuple[str]:
+        return tuple(
+            self.decode([l]).strip("<|>") for l in self.all_language_tokens)
+
+    @property
+    @lru_cache()
+    def sot_sequence_including_notimestamps(self) -> Tuple[int]:
+        return tuple(list(self.sot_sequence) + [self.no_timestamps])
+
+    @property
+    @lru_cache()
+    def non_speech_tokens(self) -> Tuple[int]:
+        """
+        Returns the list of tokens to suppress in order to avoid any speaker tags or non-speech
+        annotations, to prevent sampling texts that are not actually spoken in the audio, e.g.
+
+        - ♪♪♪
+        - ( SPEAKING FOREIGN LANGUAGE )
+        - [DAVID] Hey there,
+
+        keeping basic punctuations like commas, periods, question marks, exclamation points, etc.
+        """
+        symbols = list("\"#()*+/:;<=>@[\\]^_`{|}~「」『』")
+        symbols += "<< >> <<< >>> -- --- -( -[ (' (\" (( )) ((( ))) [[ ]] {{ }} ♪♪ ♪♪♪".split(
+        )
+
+        # symbols that may be a single token or multiple tokens depending on the tokenizer.
+        # In case they're multiple tokens, suppress the first token, which is safe because:
+        # These are between U+2640 and U+267F miscellaneous symbols that are okay to suppress
+        # in generations, and in the 3-byte UTF-8 representation they share the first two bytes.
+        miscellaneous = set("♩♪♫♬♭♮♯")
+        assert all(0x2640 <= ord(c) <= 0x267F for c in miscellaneous)
+
+        # allow hyphens "-" and single quotes "'" between words, but not at the beginning of a word
+        result = {
+            self.tokenizer.encode(" -").input_ids[0],
+            self.tokenizer.encode(" '").input_ids[0]
+        }
+        for symbol in symbols + list(miscellaneous):
+            for tokens in [
+                    self.tokenizer.encode(symbol).input_ids,
+                    self.tokenizer.encode(" " + symbol).input_ids
+            ]:
+                if len(tokens) == 1 or symbol in miscellaneous:
+                    result.add(tokens[0])
+
+        return tuple(sorted(result))
+
+    def _get_single_token_id(self, text) -> int:
+        tokens = self.tokenizer.encode(text).input_ids
+        assert len(tokens) == 1, f"{text} is not encoded as a single token"
+        return tokens[0]
+
+
+@lru_cache(maxsize=None)
+def build_tokenizer(resource_path: str, name: str="gpt2"):
+    os.environ["TOKENIZERS_PARALLELISM"] = "false"
+    path = os.path.join(resource_path, "assets", name)
+    tokenizer = GPTTokenizer.from_pretrained(path)
+
+    specials = [
+        "<|startoftranscript|>",
+        * [f"<|{lang}|>" for lang in LANGUAGES.keys()],
+        "<|translate|>",
+        "<|transcribe|>",
+        "<|startoflm|>",
+        "<|startofprev|>",
+        "<|nospeech|>",
+        "<|notimestamps|>",
+    ]
+
+    tokenizer.add_special_tokens(dict(additional_special_tokens=specials))
+    return tokenizer
+
+
+@lru_cache(maxsize=None)
+def get_tokenizer(
+        multilingual: bool,
+        resource_path: str,
+        *,
+        task: Optional[str]=None,  # Literal["transcribe", "translate", None]
+        language: Optional[str]=None, ) -> Tokenizer:
+    if language is not None:
+        language = language.lower()
+        if language not in LANGUAGES:
+            if language in TO_LANGUAGE_CODE:
+                language = TO_LANGUAGE_CODE[language]
+            else:
+                raise ValueError(f"Unsupported language: {language}")
+
+    if multilingual:
+        tokenizer_name = "multilingual"
+        task = task or "transcribe"
+        language = language or "en"
+    else:
+        tokenizer_name = "gpt2"
+        task = None
+        language = None
+
+    tokenizer = build_tokenizer(
+        resource_path=resource_path, name=tokenizer_name)
+    all_special_ids: List[int] = tokenizer.all_special_ids
+    sot: int = all_special_ids[1]
+    translate: int = all_special_ids[-6]
+    transcribe: int = all_special_ids[-5]
+
+    langs = tuple(LANGUAGES.keys())
+    sot_sequence = [sot]
+    if language is not None:
+        sot_sequence.append(sot + 1 + langs.index(language))
+    if task is not None:
+        sot_sequence.append(transcribe if task == "transcribe" else translate)
+
+    return Tokenizer(
+        tokenizer=tokenizer,
+        language=language,
+        sot_sequence=tuple(sot_sequence))
--- a/paddlespeech/s2t/models/whisper/utils.py
+++ b/paddlespeech/s2t/models/whisper/utils.py
@ -0,0 +1,92 @@
+# MIT License, Copyright (c) 2022 OpenAI.
+# Copyright (c) 2022 PaddlePaddle Authors and . All Rights Reserved.
+# 
+# Modified from OpenAI Whisper 2022 (https://github.com/openai/whisper/whisper/utils.py)
+import zlib
+from typing import Iterator
+from typing import TextIO
+
+
+def exact_div(x, y):
+    assert x % y == 0
+    return x // y
+
+
+def str2bool(string):
+    str2val = {"True": True, "False": False}
+    if string in str2val:
+        return str2val[string]
+    else:
+        raise ValueError(f"Expected one of {set(str2val.keys())}, got {string}")
+
+
+def optional_int(string):
+    return None if string == "None" else int(string)
+
+
+def optional_float(string):
+    return None if string == "None" else float(string)
+
+
+def compression_ratio(text) -> float:
+    return len(text) / len(zlib.compress(text.encode("utf-8")))
+
+
+def format_timestamp(seconds: float,
+                     always_include_hours: bool=False,
+                     decimal_marker: str='.'):
+    assert seconds >= 0, "non-negative timestamp expected"
+    milliseconds = round(seconds * 1000.0)
+
+    hours = milliseconds // 3_600_000
+    milliseconds -= hours * 3_600_000
+
+    minutes = milliseconds // 60_000
+    milliseconds -= minutes * 60_000
+
+    seconds = milliseconds // 1_000
+    milliseconds -= seconds * 1_000
+
+    hours_marker = f"{hours:02d}:" if always_include_hours or hours > 0 else ""
+    return f"{hours_marker}{minutes:02d}:{seconds:02d}{decimal_marker}{milliseconds:03d}"
+
+
+def write_txt(transcript: Iterator[dict], file: TextIO):
+    for segment in transcript:
+        print(segment['text'].strip(), file=file, flush=True)
+
+
+def write_vtt(transcript: Iterator[dict], file: TextIO):
+    print("WEBVTT\n", file=file)
+    for segment in transcript:
+        print(
+            f"{format_timestamp(segment['start'])} --> {format_timestamp(segment['end'])}\n"
+            f"{segment['text'].strip().replace('-->', '->')}\n",
+            file=file,
+            flush=True, )
+
+
+def write_srt(transcript: Iterator[dict], file: TextIO):
+    """
+    Write a transcript to a file in SRT format.
+
+    Example usage:
+        from pathlib import Path
+        from whisper.utils import write_srt
+
+        result = transcribe(model, audio_path, temperature=temperature, **args)
+
+        # save SRT
+        audio_basename = Path(audio_path).stem
+        with open(Path(output_dir) / (audio_basename + ".srt"), "w", encoding="utf-8") as srt:
+            write_srt(result["segments"], file=srt)
+    """
+    for i, segment in enumerate(transcript, start=1):
+        # write srt lines
+        print(
+            f"{i}\n"
+            f"{format_timestamp(segment['start'], always_include_hours=True, decimal_marker=',')} --> "
+            f"{format_timestamp(segment['end'], always_include_hours=True, decimal_marker=',')}\n"
+            f"{segment['text'].strip().replace('-->', '->')}\n",
+            file=file,
+            flush=True, )
--- a/paddlespeech/s2t/models/whisper/whipser.py
+++ b/paddlespeech/s2t/models/whisper/whipser.py
--- a/paddlespeech/s2t/models/whisper/whisper_LICENSE
+++ b/paddlespeech/s2t/models/whisper/whisper_LICENSE
@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2022 OpenAI
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
--- a/paddlespeech/s2t/training/scheduler.py
+++ b/paddlespeech/s2t/training/scheduler.py
@ -17,6 +17,7 @@ from typing import Dict
 from typing import Text
 from typing import Union

+import paddle
 from paddle.optimizer.lr import LRScheduler
 from typeguard import check_argument_types

@ -107,6 +108,125 @@ class ConstantLR(LRScheduler):
        return self.base_lr


+@register_scheduler
+class NewBobScheduler(LRScheduler):
+    """Scheduler with new-bob technique, used for LR annealing.
+
+    The learning rate is annealed based on the validation performance.
+    In particular: if (past_loss-current_loss)/past_loss< impr_threshold:
+    lr=lr * annealing_factor.
+
+    Arguments
+    ---------
+    initial_value : float
+        The initial hyperparameter value.
+    annealing_factor : float
+        It is annealing factor used in new_bob strategy.
+    improvement_threshold : float
+        It is the improvement rate between losses used to perform learning
+        annealing in new_bob strategy.
+    patient : int
+        When the annealing condition is violated patient times,
+        the learning rate is finally reduced.
+
+    Example
+    -------
+    >>> scheduler = NewBobScheduler(initial_value=1.0)
+    >>> scheduler(metric_value=10.0)
+    (1.0, 1.0)
+    >>> scheduler(metric_value=2.0)
+    (1.0, 1.0)
+    >>> scheduler(metric_value=2.5)
+    (1.0, 0.5)
+    """
+
+    def __init__(
+            self,
+            learning_rate,
+            last_epoch=-1,
+            verbose=False,
+            annealing_factor=0.5,
+            improvement_threshold=0.0025,
+            patient=0, ):
+        self.hyperparam_value = learning_rate
+        self.annealing_factor = annealing_factor
+        self.improvement_threshold = improvement_threshold
+        self.patient = patient
+        self.metric_values = []
+        self.current_patient = self.patient
+        super().__init__(learning_rate, last_epoch, verbose)
+
+    def step(self, metric_value=None):
+        """
+
+        ``step`` should be called after ``optimizer.step`` . It will update the learning rate in optimizer according to current ``epoch`` .
+        The new learning rate will take effect on next ``optimizer.step`` .
+
+        Args:
+            epoch (int, None): specify current epoch. Default: None. Auto-increment from last_epoch=-1.
+
+        Returns:
+            None
+        """
+        if metric_value is None:
+            self.last_epoch += 1
+            self.last_lr = self.hyperparam_value
+        else:
+            self.last_epoch += 1
+            self.last_lr = self.get_lr(metric_value)
+
+        if self.verbose:
+            print('Epoch {}: {} set learning rate to {}.'.format(
+                self.last_epoch, self.__class__.__name__, self.last_lr))
+
+    def get_lr(self, metric_value):
+        """Returns the current and new value for the hyperparameter.
+
+        Arguments
+        ---------
+        metric_value : int
+            A number for determining whether to change the hyperparameter value.
+        """
+        new_value = self.hyperparam_value
+        if len(self.metric_values) > 0:
+            prev_metric = self.metric_values[-1]
+            # Update value if improvement too small and patience is 0
+            if prev_metric == 0:  # Prevent division by zero
+                improvement = 0
+            else:
+                improvement = (prev_metric - metric_value) / prev_metric
+            if improvement < self.improvement_threshold:
+                if self.current_patient == 0:
+                    new_value *= self.annealing_factor
+                    self.current_patient = self.patient
+                else:
+                    self.current_patient -= 1
+
+        # Store relevant info
+        self.metric_values.append(metric_value)
+        self.hyperparam_value = new_value
+
+        return new_value
+
+    def save(self):
+        """Saves the current metrics on the specified path."""
+        data = {
+            "current_epoch_index": self.last_epoch,
+            "hyperparam_value": self.hyperparam_value,
+            "metric_values": self.metric_values,
+            "current_patient": self.current_patient
+        }
+        return data
+
+    def load(self, data):
+        """Loads the needed information."""
+        data = paddle.load(data)
+        self.last_epoch = data["current_epoch_index"]
+        self.hyperparam_value = data["hyperparam_value"]
+        self.metric_values = data["metric_values"]
+        self.current_patient = data["current_patient"]
+
+
 def dynamic_import_scheduler(module):
    """Import Scheduler class dynamically.

--- a/paddlespeech/t2s/exps/inference.py
+++ b/paddlespeech/t2s/exps/inference.py
@ -145,7 +145,7 @@ def main():
    # warmup
    for utt_id, sentence in sentences[:3]:
        with timer() as t:
-            am_output_data = get_am_output(
+            mel = get_am_output(
                input=sentence,
                am_predictor=am_predictor,
                am=args.am,
@ -154,12 +154,11 @@ def main():
                merge_sentences=merge_sentences,
                speaker_dict=args.speaker_dict,
                spk_id=args.spk_id, )
-            wav = get_voc_output(
-                voc_predictor=voc_predictor, input=am_output_data)
+            wav = get_voc_output(voc_predictor=voc_predictor, input=mel)
        speed = wav.size / t.elapse
        rtf = fs / speed
        print(
-            f"{utt_id}, mel: {am_output_data.shape}, wave: {wav.shape}, time: {t.elapse}s, Hz: {speed}, RTF: {rtf}."
+            f"{utt_id}, mel: {mel.shape}, wave: {wav.shape}, time: {t.elapse}s, Hz: {speed}, RTF: {rtf}."
        )

    print("warm up done!")
@ -168,7 +167,7 @@ def main():
    T = 0
    for utt_id, sentence in sentences:
        with timer() as t:
-            am_output_data = get_am_output(
+            mel = get_am_output(
                input=sentence,
                am_predictor=am_predictor,
                am=args.am,
@ -177,8 +176,7 @@ def main():
                merge_sentences=merge_sentences,
                speaker_dict=args.speaker_dict,
                spk_id=args.spk_id, )
-            wav = get_voc_output(
-                voc_predictor=voc_predictor, input=am_output_data)
+            wav = get_voc_output(voc_predictor=voc_predictor, input=mel)

        N += wav.size
        T += t.elapse
@ -187,7 +185,7 @@ def main():

        sf.write(output_dir / (utt_id + ".wav"), wav, samplerate=fs)
        print(
-            f"{utt_id}, mel: {am_output_data.shape}, wave: {wav.shape}, time: {t.elapse}s, Hz: {speed}, RTF: {rtf}."
+            f"{utt_id}, mel: {mel.shape}, wave: {wav.shape}, time: {t.elapse}s, Hz: {speed}, RTF: {rtf}."
        )

        print(f"{utt_id} done!")
--- a/paddlespeech/t2s/exps/lite_predict.py
+++ b/paddlespeech/t2s/exps/lite_predict.py
@ -0,0 +1,168 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+from pathlib import Path
+
+import soundfile as sf
+from timer import timer
+
+from paddlespeech.t2s.exps.syn_utils import get_frontend
+from paddlespeech.t2s.exps.syn_utils import get_lite_am_output
+from paddlespeech.t2s.exps.syn_utils import get_lite_predictor
+from paddlespeech.t2s.exps.syn_utils import get_lite_voc_output
+from paddlespeech.t2s.exps.syn_utils import get_sentences
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(
+        description="Paddle Infernce with acoustic model & vocoder.")
+    # acoustic model
+    parser.add_argument(
+        '--am',
+        type=str,
+        default='fastspeech2_csmsc',
+        choices=[
+            'speedyspeech_csmsc',
+            'fastspeech2_csmsc',
+            'fastspeech2_aishell3',
+            'fastspeech2_ljspeech',
+            'fastspeech2_vctk',
+            'fastspeech2_mix',
+        ],
+        help='Choose acoustic model type of tts task.')
+    parser.add_argument(
+        "--phones_dict", type=str, default=None, help="phone vocabulary file.")
+    parser.add_argument(
+        "--tones_dict", type=str, default=None, help="tone vocabulary file.")
+    parser.add_argument(
+        "--speaker_dict", type=str, default=None, help="speaker id map file.")
+    parser.add_argument(
+        '--spk_id',
+        type=int,
+        default=0,
+        help='spk id for multi speaker acoustic model')
+    # voc
+    parser.add_argument(
+        '--voc',
+        type=str,
+        default='pwgan_csmsc',
+        choices=[
+            'pwgan_csmsc',
+            'pwgan_aishell3',
+            'pwgan_ljspeech',
+            'pwgan_vctk',
+            'mb_melgan_csmsc',
+            'hifigan_csmsc',
+            'hifigan_aishell3',
+            'hifigan_ljspeech',
+            'hifigan_vctk',
+        ],
+        help='Choose vocoder type of tts task.')
+    # other
+    parser.add_argument(
+        '--lang',
+        type=str,
+        default='zh',
+        help='Choose model language. zh or en or mix')
+    parser.add_argument(
+        "--text",
+        type=str,
+        help="text to synthesize, a 'utt_id sentence' pair per line")
+    parser.add_argument(
+        "--inference_dir", type=str, help="dir to save inference models")
+    parser.add_argument("--output_dir", type=str, help="output dir")
+
+    args, _ = parser.parse_known_args()
+    return args
+
+
+# only inference for models trained with csmsc now
+def main():
+    args = parse_args()
+
+    # frontend
+    frontend = get_frontend(
+        lang=args.lang,
+        phones_dict=args.phones_dict,
+        tones_dict=args.tones_dict)
+
+    # am_predictor
+    am_predictor = get_lite_predictor(
+        model_dir=args.inference_dir, model_file=args.am + "_x86.nb")
+    # model: {model_name}_{dataset}
+    am_dataset = args.am[args.am.rindex('_') + 1:]
+
+    # voc_predictor
+    voc_predictor = get_lite_predictor(
+        model_dir=args.inference_dir, model_file=args.voc + "_x86.nb")
+
+    output_dir = Path(args.output_dir)
+    output_dir.mkdir(parents=True, exist_ok=True)
+
+    sentences = get_sentences(text_file=args.text, lang=args.lang)
+
+    merge_sentences = True
+    fs = 24000 if am_dataset != 'ljspeech' else 22050
+    # warmup
+    for utt_id, sentence in sentences[:3]:
+        with timer() as t:
+            mel = get_lite_am_output(
+                input=sentence,
+                am_predictor=am_predictor,
+                am=args.am,
+                frontend=frontend,
+                lang=args.lang,
+                merge_sentences=merge_sentences,
+                speaker_dict=args.speaker_dict,
+                spk_id=args.spk_id, )
+            wav = get_lite_voc_output(voc_predictor=voc_predictor, input=mel)
+        speed = wav.size / t.elapse
+        rtf = fs / speed
+        print(
+            f"{utt_id}, mel: {mel.shape}, wave: {wav.shape}, time: {t.elapse}s, Hz: {speed}, RTF: {rtf}."
+        )
+
+    print("warm up done!")
+
+    N = 0
+    T = 0
+    for utt_id, sentence in sentences:
+        with timer() as t:
+            mel = get_lite_am_output(
+                input=sentence,
+                am_predictor=am_predictor,
+                am=args.am,
+                frontend=frontend,
+                lang=args.lang,
+                merge_sentences=merge_sentences,
+                speaker_dict=args.speaker_dict,
+                spk_id=args.spk_id, )
+            wav = get_lite_voc_output(voc_predictor=voc_predictor, input=mel)
+
+        N += wav.size
+        T += t.elapse
+        speed = wav.size / t.elapse
+        rtf = fs / speed
+
+        sf.write(output_dir / (utt_id + ".wav"), wav, samplerate=fs)
+        print(
+            f"{utt_id}, mel: {mel.shape}, wave: {wav.shape}, time: {t.elapse}s, Hz: {speed}, RTF: {rtf}."
+        )
+
+        print(f"{utt_id} done!")
+    print(f"generation speed: {N / T}Hz, RTF: {fs / (N / T) }")
+
+
+if __name__ == "__main__":
+    main()
--- a/paddlespeech/t2s/exps/lite_predict_streaming.py
+++ b/paddlespeech/t2s/exps/lite_predict_streaming.py
@ -0,0 +1,230 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+from pathlib import Path
+
+import numpy as np
+import soundfile as sf
+from timer import timer
+
+from paddlespeech.t2s.exps.syn_utils import denorm
+from paddlespeech.t2s.exps.syn_utils import get_chunks
+from paddlespeech.t2s.exps.syn_utils import get_frontend
+from paddlespeech.t2s.exps.syn_utils import get_lite_am_sublayer_output
+from paddlespeech.t2s.exps.syn_utils import get_lite_predictor
+from paddlespeech.t2s.exps.syn_utils import get_lite_streaming_am_output
+from paddlespeech.t2s.exps.syn_utils import get_lite_voc_output
+from paddlespeech.t2s.exps.syn_utils import get_sentences
+from paddlespeech.t2s.exps.syn_utils import run_frontend
+from paddlespeech.t2s.utils import str2bool
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(
+        description="Paddle Infernce with acoustic model & vocoder.")
+    # acoustic model
+    parser.add_argument(
+        '--am',
+        type=str,
+        default='fastspeech2_csmsc',
+        choices=['fastspeech2_csmsc'],
+        help='Choose acoustic model type of tts task.')
+    parser.add_argument(
+        "--am_stat",
+        type=str,
+        default=None,
+        help="mean and standard deviation used to normalize spectrogram when training acoustic model."
+    )
+    parser.add_argument(
+        "--phones_dict", type=str, default=None, help="phone vocabulary file.")
+    parser.add_argument(
+        "--tones_dict", type=str, default=None, help="tone vocabulary file.")
+    parser.add_argument(
+        "--speaker_dict", type=str, default=None, help="speaker id map file.")
+    parser.add_argument(
+        '--spk_id',
+        type=int,
+        default=0,
+        help='spk id for multi speaker acoustic model')
+    # voc
+    parser.add_argument(
+        '--voc',
+        type=str,
+        default='pwgan_csmsc',
+        choices=['pwgan_csmsc', 'mb_melgan_csmsc', 'hifigan_csmsc'],
+        help='Choose vocoder type of tts task.')
+    # other
+    parser.add_argument(
+        '--lang',
+        type=str,
+        default='zh',
+        help='Choose model language. zh or en')
+    parser.add_argument(
+        "--text",
+        type=str,
+        help="text to synthesize, a 'utt_id sentence' pair per line")
+    parser.add_argument(
+        "--inference_dir", type=str, help="dir to save inference models")
+    parser.add_argument("--output_dir", type=str, help="output dir")
+    # inference
+
+    # streaming related
+    parser.add_argument(
+        "--am_streaming",
+        type=str2bool,
+        default=False,
+        help="whether use streaming acoustic model")
+    parser.add_argument(
+        "--block_size", type=int, default=42, help="block size of am streaming")
+    parser.add_argument(
+        "--pad_size", type=int, default=12, help="pad size of am streaming")
+
+    args, _ = parser.parse_known_args()
+    return args
+
+
+# only inference for models trained with csmsc now
+def main():
+    args = parse_args()
+
+    # frontend
+    frontend = get_frontend(
+        lang=args.lang,
+        phones_dict=args.phones_dict,
+        tones_dict=args.tones_dict)
+
+    # am_predictor
+    am_encoder_infer_predictor = get_lite_predictor(
+        model_dir=args.inference_dir,
+        model_file=args.am + "_am_encoder_infer" + "_x86.nb")
+    am_decoder_predictor = get_lite_predictor(
+        model_dir=args.inference_dir,
+        model_file=args.am + "_am_decoder" + "_x86.nb")
+    am_postnet_predictor = get_lite_predictor(
+        model_dir=args.inference_dir,
+        model_file=args.am + "_am_postnet" + "_x86.nb")
+    am_mu, am_std = np.load(args.am_stat)
+    # model: {model_name}_{dataset}
+    am_dataset = args.am[args.am.rindex('_') + 1:]
+
+    # voc_predictor
+    voc_predictor = get_lite_predictor(
+        model_dir=args.inference_dir, model_file=args.voc + "_x86.nb")
+
+    output_dir = Path(args.output_dir)
+    output_dir.mkdir(parents=True, exist_ok=True)
+
+    sentences = get_sentences(text_file=args.text, lang=args.lang)
+
+    merge_sentences = True
+
+    fs = 24000 if am_dataset != 'ljspeech' else 22050
+    # warmup
+    for utt_id, sentence in sentences[:3]:
+        with timer() as t:
+            normalized_mel = get_lite_streaming_am_output(
+                input=sentence,
+                am_encoder_infer_predictor=am_encoder_infer_predictor,
+                am_decoder_predictor=am_decoder_predictor,
+                am_postnet_predictor=am_postnet_predictor,
+                frontend=frontend,
+                lang=args.lang,
+                merge_sentences=merge_sentences, )
+            mel = denorm(normalized_mel, am_mu, am_std)
+            wav = get_lite_voc_output(voc_predictor=voc_predictor, input=mel)
+        speed = wav.size / t.elapse
+        rtf = fs / speed
+        print(
+            f"{utt_id}, mel: {mel.shape}, wave: {wav.shape}, time: {t.elapse}s, Hz: {speed}, RTF: {rtf}."
+        )
+
+    print("warm up done!")
+
+    N = 0
+    T = 0
+    block_size = args.block_size
+    pad_size = args.pad_size
+    get_tone_ids = False
+    for utt_id, sentence in sentences:
+        with timer() as t:
+            # frontend
+            frontend_dict = run_frontend(
+                frontend=frontend,
+                text=sentence,
+                merge_sentences=merge_sentences,
+                get_tone_ids=get_tone_ids,
+                lang=args.lang)
+            phone_ids = frontend_dict['phone_ids']
+            phones = phone_ids[0].numpy()
+            # acoustic model
+            orig_hs = get_lite_am_sublayer_output(
+                am_encoder_infer_predictor, input=phones)
+
+            if args.am_streaming:
+                hss = get_chunks(orig_hs, block_size, pad_size)
+                chunk_num = len(hss)
+                mel_list = []
+                for i, hs in enumerate(hss):
+                    am_decoder_output = get_lite_am_sublayer_output(
+                        am_decoder_predictor, input=hs)
+                    am_postnet_output = get_lite_am_sublayer_output(
+                        am_postnet_predictor,
+                        input=np.transpose(am_decoder_output, (0, 2, 1)))
+                    am_output_data = am_decoder_output + np.transpose(
+                        am_postnet_output, (0, 2, 1))
+                    normalized_mel = am_output_data[0]
+
+                    sub_mel = denorm(normalized_mel, am_mu, am_std)
+                    # clip output part of pad
+                    if i == 0:
+                        sub_mel = sub_mel[:-pad_size]
+                    elif i == chunk_num - 1:
+                        # 最后一块的右侧一定没有 pad 够
+                        sub_mel = sub_mel[pad_size:]
+                    else:
+                        # 倒数几块的右侧也可能没有 pad 够
+                        sub_mel = sub_mel[pad_size:(block_size + pad_size) -
+                                          sub_mel.shape[0]]
+                    mel_list.append(sub_mel)
+                mel = np.concatenate(mel_list, axis=0)
+
+            else:
+                am_decoder_output = get_lite_am_sublayer_output(
+                    am_decoder_predictor, input=orig_hs)
+                am_postnet_output = get_lite_am_sublayer_output(
+                    am_postnet_predictor,
+                    input=np.transpose(am_decoder_output, (0, 2, 1)))
+                am_output_data = am_decoder_output + np.transpose(
+                    am_postnet_output, (0, 2, 1))
+                normalized_mel = am_output_data[0]
+                mel = denorm(normalized_mel, am_mu, am_std)
+            # vocoder
+            wav = get_lite_voc_output(voc_predictor=voc_predictor, input=mel)
+
+        N += wav.size
+        T += t.elapse
+        speed = wav.size / t.elapse
+        rtf = fs / speed
+
+        sf.write(output_dir / (utt_id + ".wav"), wav, samplerate=24000)
+        print(
+            f"{utt_id}, mel: {mel.shape}, wave: {wav.shape}, time: {t.elapse}s, Hz: {speed}, RTF: {rtf}."
+        )
+
+        print(f"{utt_id} done!")
+    print(f"generation speed: {N / T}Hz, RTF: {fs / (N / T) }")
+
+
+if __name__ == "__main__":
+    main()
--- a/paddlespeech/t2s/exps/syn_utils.py
+++ b/paddlespeech/t2s/exps/syn_utils.py
@ -26,6 +26,8 @@ import paddle
 from paddle import inference
 from paddle import jit
 from paddle.static import InputSpec
+from paddlelite.lite import create_paddle_predictor
+from paddlelite.lite import MobileConfig
 from yacs.config import CfgNode

 from paddlespeech.t2s.datasets.data_table import DataTable
@ -510,3 +512,105 @@ def get_sess(model_path: Optional[os.PathLike],
    sess = ort.InferenceSession(
        model_path, providers=providers, sess_options=sess_options)
    return sess
+
+
+# Paddle-Lite
+def get_lite_predictor(model_dir: Optional[os.PathLike]=None,
+                       model_file: Optional[os.PathLike]=None,
+                       cpu_threads: int=1):
+    config = MobileConfig()
+    config.set_model_from_file(str(Path(model_dir) / model_file))
+    predictor = create_paddle_predictor(config)
+    return predictor
+
+
+def get_lite_am_output(
+        input: str,
+        am_predictor,
+        am: str,
+        frontend: object,
+        lang: str='zh',
+        merge_sentences: bool=True,
+        speaker_dict: Optional[os.PathLike]=None,
+        spk_id: int=0, ):
+    am_name = am[:am.rindex('_')]
+    am_dataset = am[am.rindex('_') + 1:]
+    get_spk_id = False
+    get_tone_ids = False
+    if am_name == 'speedyspeech':
+        get_tone_ids = True
+    if am_dataset in {"aishell3", "vctk", "mix"} and speaker_dict:
+        get_spk_id = True
+        spk_id = np.array([spk_id])
+
+    frontend_dict = run_frontend(
+        frontend=frontend,
+        text=input,
+        merge_sentences=merge_sentences,
+        get_tone_ids=get_tone_ids,
+        lang=lang)
+
+    if get_tone_ids:
+        tone_ids = frontend_dict['tone_ids']
+        tones = tone_ids[0].numpy()
+        tones_handle = am_predictor.get_input(1)
+        tones_handle.from_numpy(tones)
+
+    if get_spk_id:
+        spk_id_handle = am_predictor.get_input(1)
+        spk_id_handle.from_numpy(spk_id)
+    phone_ids = frontend_dict['phone_ids']
+    phones = phone_ids[0].numpy()
+    phones_handle = am_predictor.get_input(0)
+    phones_handle.from_numpy(phones)
+    am_predictor.run()
+    am_output_handle = am_predictor.get_output(0)
+    am_output_data = am_output_handle.numpy()
+    return am_output_data
+
+
+def get_lite_voc_output(voc_predictor, input):
+    mel_handle = voc_predictor.get_input(0)
+    mel_handle.from_numpy(input)
+    voc_predictor.run()
+    voc_output_handle = voc_predictor.get_output(0)
+    wav = voc_output_handle.numpy()
+    return wav
+
+
+def get_lite_am_sublayer_output(am_sublayer_predictor, input):
+    input_handle = am_sublayer_predictor.get_input(0)
+    input_handle.from_numpy(input)
+
+    am_sublayer_predictor.run()
+    am_sublayer_handle = am_sublayer_predictor.get_output(0)
+    am_sublayer_output = am_sublayer_handle.numpy()
+    return am_sublayer_output
+
+
+def get_lite_streaming_am_output(input: str,
+                                 am_encoder_infer_predictor,
+                                 am_decoder_predictor,
+                                 am_postnet_predictor,
+                                 frontend,
+                                 lang: str='zh',
+                                 merge_sentences: bool=True):
+    get_tone_ids = False
+    frontend_dict = run_frontend(
+        frontend=frontend,
+        text=input,
+        merge_sentences=merge_sentences,
+        get_tone_ids=get_tone_ids,
+        lang=lang)
+    phone_ids = frontend_dict['phone_ids']
+    phones = phone_ids[0].numpy()
+    am_encoder_infer_output = get_lite_am_sublayer_output(
+        am_encoder_infer_predictor, input=phones)
+    am_decoder_output = get_lite_am_sublayer_output(
+        am_decoder_predictor, input=am_encoder_infer_output)
+    am_postnet_output = get_lite_am_sublayer_output(
+        am_postnet_predictor, input=np.transpose(am_decoder_output, (0, 2, 1)))
+    am_output_data = am_decoder_output + np.transpose(am_postnet_output,
+                                                      (0, 2, 1))
+    normalized_mel = am_output_data[0]
+    return normalized_mel
--- a/paddlespeech/t2s/frontend/g2pw/onnx_api.py
+++ b/paddlespeech/t2s/frontend/g2pw/onnx_api.py
@ -100,7 +100,7 @@ class G2PWOnnxConverter:
        ]
        self.non_polyphonic = {
            '一', '不', '和', '咋', '嗲', '剖', '差', '攢', '倒', '難', '奔', '勁', '拗',
-            '肖', '瘙', '誒', '泊'
+            '肖', '瘙', '誒', '泊', '听'
        }
        self.non_monophonic = {'似', '攢'}
        self.monophonic_chars = [
--- a/paddlespeech/t2s/frontend/zh_normalization/constants.py
+++ b/paddlespeech/t2s/frontend/zh_normalization/constants.py
@ -19,7 +19,7 @@ from pypinyin.constants import SUPPORT_UCS4
 # 全角半角转换
 # 英文字符全角 -> 半角映射表 (num: 52)
 F2H_ASCII_LETTERS = {
-    chr(ord(char) + 65248): char
+    ord(char) + 65248: ord(char)
    for char in string.ascii_letters
 }

@ -27,12 +27,12 @@ F2H_ASCII_LETTERS = {
 H2F_ASCII_LETTERS = {value: key for key, value in F2H_ASCII_LETTERS.items()}

 # 数字字符全角 -> 半角映射表 (num: 10)
-F2H_DIGITS = {chr(ord(char) + 65248): char for char in string.digits}
+F2H_DIGITS = {ord(char) + 65248: ord(char) for char in string.digits}
 # 数字字符半角 -> 全角映射表
 H2F_DIGITS = {value: key for key, value in F2H_DIGITS.items()}

 # 标点符号全角 -> 半角映射表 (num: 32)
-F2H_PUNCTUATIONS = {chr(ord(char) + 65248): char for char in string.punctuation}
+F2H_PUNCTUATIONS = {ord(char) + 65248: ord(char) for char in string.punctuation}
 # 标点符号半角 -> 全角映射表
 H2F_PUNCTUATIONS = {value: key for key, value in F2H_PUNCTUATIONS.items()}

--- a/paddlespeech/t2s/frontend/zh_normalization/quantifier.py
+++ b/paddlespeech/t2s/frontend/zh_normalization/quantifier.py
@ -18,6 +18,25 @@ from .num import num2str
 # 温度表达式，温度会影响负号的读法
 # -3°C 零下三度
 RE_TEMPERATURE = re.compile(r'(-?)(\d+(\.\d+)?)(°C|℃|度|摄氏度)')
+measure_dict = {
+    "cm2": "平方厘米",
+    "cm²": "平方厘米",
+    "cm3": "立方厘米",
+    "cm³": "立方厘米",
+    "cm": "厘米",
+    "db": "分贝",
+    "ds": "毫秒",
+    "kg": "千克",
+    "km": "千米",
+    "m2": "平方米",
+    "m²": "平方米",
+    "m³": "立方米",
+    "m3": "立方米",
+    "ml": "毫升",
+    "m": "米",
+    "mm": "毫米",
+    "s": "秒"
+}


 def replace_temperature(match) -> str:
@ -35,3 +54,10 @@ def replace_temperature(match) -> str:
    unit: str = "摄氏度" if unit == "摄氏度" else "度"
    result = f"{sign}{temperature}{unit}"
    return result
+
+
+def replace_measure(sentence) -> str:
+    for q_notation in measure_dict:
+        if q_notation in sentence:
+            sentence = sentence.replace(q_notation, measure_dict[q_notation])
+    return sentence
--- a/paddlespeech/t2s/frontend/zh_normalization/text_normlization.py
+++ b/paddlespeech/t2s/frontend/zh_normalization/text_normlization.py
@ -46,6 +46,7 @@ from .phonecode import RE_TELEPHONE
 from .phonecode import replace_mobile
 from .phonecode import replace_phone
 from .quantifier import RE_TEMPERATURE
+from .quantifier import replace_measure
 from .quantifier import replace_temperature


@ -73,6 +74,17 @@ class TextNormalizer():
    def _post_replace(self, sentence: str) -> str:
        sentence = sentence.replace('/', '每')
        sentence = sentence.replace('~', '至')
+        sentence = sentence.replace('～', '至')
+        sentence = sentence.replace('①', '一')
+        sentence = sentence.replace('②', '二')
+        sentence = sentence.replace('③', '三')
+        sentence = sentence.replace('④', '四')
+        sentence = sentence.replace('⑤', '五')
+        sentence = sentence.replace('⑥', '六')
+        sentence = sentence.replace('⑦', '七')
+        sentence = sentence.replace('⑧', '八')
+        sentence = sentence.replace('⑨', '九')
+        sentence = sentence.replace('⑩', '十')

        return sentence

@ -91,6 +103,7 @@ class TextNormalizer():
        sentence = RE_TIME.sub(replace_time, sentence)

        sentence = RE_TEMPERATURE.sub(replace_temperature, sentence)
+        sentence = replace_measure(sentence)
        sentence = RE_FRAC.sub(replace_frac, sentence)
        sentence = RE_PERCENTAGE.sub(replace_percentage, sentence)
        sentence = RE_MOBILE_PHONE.sub(replace_mobile, sentence)
--- a/setup.py
+++ b/setup.py
@ -75,6 +75,7 @@ base = [
    "braceexpand",
    "pyyaml",
    "pybind11",
+    "paddlelite",
    "paddleslim==2.3.4",
 ]

--- a/speechx/examples/codelab/u2/utils
+++ b/speechx/examples/codelab/u2/utils
@ -0,0 +1 @@
+../../../../utils
--- a/speechx/examples/u2pp_ol/wenetspeech/README.md
+++ b/speechx/examples/u2pp_ol/wenetspeech/README.md
@ -2,10 +2,10 @@

 ## Testing with Aishell Test Data

-## Download wav and model
+### Download wav and model

 ```
-run.sh --stop_stage 0
+./run.sh --stop_stage 0
 ```

 ### compute feature
@ -22,7 +22,6 @@ run.sh --stop_stage 0

 ### decoding using wav

-
 ```
 ./run.sh --stage 3 --stop_stage 3
 ```
--- a/speechx/examples/u2pp_ol/wenetspeech/RESULTS.md
+++ b/speechx/examples/u2pp_ol/wenetspeech/RESULTS.md
@ -2,9 +2,11 @@

 7176 utts, duration 36108.9 sec.

-## Attention Rescore
+## U2++ Attention Rescore

-### u2++ FP32
+> Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz, support `avx512_vnni`
+> RTF with feature and decoder which is more end to end.
+### FP32

 #### CER

@ -17,20 +19,29 @@ Other -> 100.00 % N=3 C=0 S=3 D=0 I=0

 #### RTF 

-> RTF with feature and decoder which is more end to end.
-
-* Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz, support `avx512_vnni`
-
 ```
 I1027 10:52:38.662868 51665 u2_recognizer_main.cc:122] total wav duration is: 36108.9 sec
 I1027 10:52:38.662858 51665 u2_recognizer_main.cc:121] total cost:11169.1 sec
 I1027 10:52:38.662876 51665 u2_recognizer_main.cc:123] RTF is: 0.309318
 ```

-* Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz, not support `avx512_vnni`
+### INT8
+
+> RTF relative improve 12.8%, which count feature and decoder time.
+
+#### CER
+
+```
+Overall -> 5.83 % N=104765 C=98943 S=5675 D=147 I=286
+Mandarin -> 5.83 % N=104762 C=98943 S=5672 D=147 I=286
+English -> 0.00 % N=0 C=0 S=0 D=0 I=0
+Other -> 100.00 % N=3 C=0 S=3 D=0 I=0
+```
+
+#### RTF 

 ```
-I1026 16:13:26.247121 48038 u2_recognizer_main.cc:123] total wav duration is: 36108.9 sec
-I1026 16:13:26.247130 48038 u2_recognizer_main.cc:124] total decode cost:13656.7 sec
-I1026 16:13:26.247138 48038 u2_recognizer_main.cc:125] RTF is: 0.378208
+I1110 09:59:52.551712 37249 u2_recognizer_main.cc:122] total wav duration is: 36108.9 sec
+I1110 09:59:52.551717 37249 u2_recognizer_main.cc:123] total decode cost:9737.63 sec
+I1110 09:59:52.551723 37249 u2_recognizer_main.cc:124] RTF is: 0.269674
 ```
--- a/speechx/examples/u2pp_ol/wenetspeech/local/decode.sh
+++ b/speechx/examples/u2pp_ol/wenetspeech/local/decode.sh
@ -9,8 +9,9 @@ nj=20
 mkdir -p $exp
 ckpt_dir=./data/model
 model_dir=$ckpt_dir/asr1_chunk_conformer_u2pp_wenetspeech_static_1.3.0.model/
+text=$data/test/text

-utils/run.pl JOB=1:$nj $data/split${nj}/JOB/decoder.fbank.wolm.log \
+utils/run.pl JOB=1:$nj $data/split${nj}/JOB/decoder.log \
 ctc_prefix_beam_search_decoder_main \
    --model_path=$model_dir/export.jit \
    --vocab_path=$model_dir/unit.txt \
@ -20,6 +21,6 @@ ctc_prefix_beam_search_decoder_main \
    --feature_rspecifier=scp:$data/split${nj}/JOB/fbank.scp \
    --result_wspecifier=ark,t:$data/split${nj}/JOB/result_decode.ark

-cat $data/split${nj}/*/result_decode.ark > $exp/${label_file}
-utils/compute-wer.py --char=1 --v=1 $text $exp/${label_file} > $exp/${wer}
-tail -n 7 $exp/${wer}
+cat $data/split${nj}/*/result_decode.ark > $exp/aishell.decode.rsl
+utils/compute-wer.py --char=1 --v=1 $text $exp/aishell.decode.rsl > $exp/aishell.decode.err
+tail -n 7 $exp/aishell.decode.err
--- a/speechx/examples/u2pp_ol/wenetspeech/local/nnet.sh
+++ b/speechx/examples/u2pp_ol/wenetspeech/local/nnet.sh
@ -1,18 +1,21 @@
 #!/bin/bash
-set -x
 set -e

 . path.sh

+nj=20
 data=data
 exp=exp
+
 mkdir -p $exp
 ckpt_dir=./data/model
 model_dir=$ckpt_dir/asr1_chunk_conformer_u2pp_wenetspeech_static_1.3.0.model/

+utils/run.pl JOB=1:$nj $data/split${nj}/JOB/nnet.log \
 u2_nnet_main \
    --model_path=$model_dir/export.jit \
-    --feature_rspecifier=ark,t:$exp/fbank.ark \
+    --vocab_path=$model_dir/unit.txt \
+    --feature_rspecifier=ark,t:${data}/split${nj}/JOB/fbank.ark \
    --nnet_decoder_chunk=16 \
    --receptive_field_length=7 \
    --subsampling_rate=4 \
@ -20,4 +23,3 @@ u2_nnet_main \
    --nnet_encoder_outs_wspecifier=ark,t:$exp/encoder_outs.ark \
    --nnet_prob_wspecifier=ark,t:$exp/logprobs.ark
 echo "u2 nnet decode."
-
--- a/Show More
+++ b/Show More