Merge branch 'develop' of github.com:PaddlePaddle/PaddleSpeech into fix_name_bug

3 years ago · 7bbd9097a1
parent bc6d10ba4d bc2613b772
commit 7bbd9097a1
78 changed files with 1110 additions and 562 deletions
--- a/README.md
+++ b/README.md
@ -25,7 +25,7 @@
  | <a href="#documents"> Documents </a>
  | <a href="#model-list"> Models List </a>
  | <a href="https://aistudio.baidu.com/aistudio/education/group/info/25130"> AIStudio Courses </a>
-  | <a href="https://arxiv.org/abs/2205.12007"> NAACL2022 Paper </a>
+  | <a href="https://arxiv.org/abs/2205.12007"> NAACL2022 Best Demo Award Paper </a>
  | <a href="https://gitee.com/paddlepaddle/PaddleSpeech"> Gitee </a>
 </h4>
 </div>
@ -34,7 +34,7 @@

 **PaddleSpeech** is an open-source toolkit on [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) platform for a variety of critical tasks in speech and audio, with the state-of-art and influential models. 

-**PaddleSpeech** won the [NAACL2022 Best Demo Award](https://2022.naacl.org/blog/best-demo-award/).
+**PaddleSpeech** won the [NAACL2022 Best Demo Award](https://2022.naacl.org/blog/best-demo-award/), please check out our paper on [Arxiv](https://arxiv.org/abs/2205.12007).

 ##### Speech Recognition

@ -179,7 +179,7 @@ Via the easy-to-use, efficient, flexible and scalable implementation, our vision

 ## Installation

-We strongly recommend our users to install PaddleSpeech in **Linux** with *python>=3.7*.
+We strongly recommend our users to install PaddleSpeech in **Linux** with *python>=3.7* and *paddlepaddle>=2.3.1*.
 Up to now, **Linux** supports CLI for the all our tasks, **Mac OSX** and **Windows** only supports PaddleSpeech CLI for Audio Classification, Speech-to-Text and Text-to-Speech. To install `PaddleSpeech`, please see [installation](./docs/source/install.md).


--- a/README_cn.md
+++ b/README_cn.md
@ -20,7 +20,8 @@
 </p>
 <div align="center">  
 <h4>
-    <a href="#快速开始"> 快速开始 </a>
+  <a href="#安装"> 安装 </a>
+  | <a href="#快速开始"> 快速开始 </a>
  | <a href="#快速使用服务"> 快速使用服务 </a>
  | <a href="#快速使用流式服务"> 快速使用流式服务 </a>
  | <a href="#教程文档"> 教程文档 </a>
@ -36,8 +37,10 @@

 **PaddleSpeech** 是基于飞桨 [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) 的语音方向的开源模型库，用于语音和音频中的各种关键任务的开发，包含大量基于深度学习前沿和有影响力的模型，一些典型的应用示例如下：

-**PaddleSpeech** 荣获 [NAACL2022 Best Demo Award](https://2022.naacl.org/blog/best-demo-award/).
+**PaddleSpeech** 荣获 [NAACL2022 Best Demo Award](https://2022.naacl.org/blog/best-demo-award/), 请访问 [Arxiv](https://arxiv.org/abs/2205.12007) 论文。
  
+### 效果展示
+
 ##### 语音识别

 <div align = "center">
@ -154,7 +157,7 @@
 本项目采用了易用、高效、灵活以及可扩展的实现，旨在为工业应用、学术研究提供更好的支持，实现的功能包含训练、推断以及测试模块，以及部署过程，主要包括
 - 📦 **易用性**: 安装门槛低，可使用 [CLI](#quick-start) 快速开始。
 - 🏆 **对标 SoTA**: 提供了高速、轻量级模型，且借鉴了最前沿的技术。
- 🏆 **流式ASR和TTS系统**：工业级的端到端流式识别、流式合成系统。
+- 🏆 **流式 ASR 和 TTS 系统**：工业级的端到端流式识别、流式合成系统。
 - 💯 **基于规则的中文前端**: 我们的前端包含文本正则化和字音转换（G2P）。此外，我们使用自定义语言规则来适应中文语境。
 - **多种工业界以及学术界主流功能支持**:
  - 🛎️ 典型音频任务: 本工具包提供了音频任务如音频分类、语音翻译、自动语音识别、文本转语音、语音合成、声纹识别、KWS等任务的实现。
@ -182,61 +185,195 @@
 <img src="https://user-images.githubusercontent.com/23690325/169763015-cbd8e28d-602c-4723-810d-dbc6da49441e.jpg"  width = "200"  />
 </div>

+<a name="安装"></a>
 ## 安装

 我们强烈建议用户在 **Linux** 环境下，*3.7* 以上版本的 *python* 上安装 PaddleSpeech。
-目前为止，**Linux** 支持声音分类、语音识别、语音合成和语音翻译四种功能，**Mac OSX、 Windows** 下暂不支持语音翻译功能。 想了解具体安装细节，可以参考[安装文档](./docs/source/install_cn.md)。
+
+### 相关依赖
+ gcc >= 4.8.5
+ paddlepaddle >= 2.3.1
+ python >= 3.7
+ linux(推荐), mac, windows
+
+PaddleSpeech依赖于paddlepaddle，安装可以参考[paddlepaddle官网](https://www.paddlepaddle.org.cn/)，根据自己机器的情况进行选择。这里给出cpu版本示例，其它版本大家可以根据自己机器的情况进行安装。
+
+```shell
+pip install paddlepaddle -i https://mirror.baidu.com/pypi/simple
+```
+
+PaddleSpeech快速安装方式有两种，一种是pip安装，一种是源码编译（推荐）。
+
+### pip 安装
+```shell
+pip install pytest-runner
+pip install paddlespeech
+```
+
+### 源码编译
+```shell
+git clone https://github.com/PaddlePaddle/PaddleSpeech.git
+cd PaddleSpeech
+pip install pytest-runner
+pip install .
+```
+
+更多关于安装问题，如 conda 环境，librosa 依赖的系统库，gcc 环境问题，kaldi 安装等，可以参考这篇[安装文档](docs/source/install_cn.md)，如安装上遇到问题可以在 [#2150](https://github.com/PaddlePaddle/PaddleSpeech/issues/2150) 上留言以及查找相关问题

 <a name="快速开始"></a>
 ## 快速开始

-安装完成后，开发者可以通过命令行快速开始，改变 `--input` 可以尝试用自己的音频或文本测试。
+安装完成后，开发者可以通过命令行或者Python快速开始，命令行模式下改变 `--input` 可以尝试用自己的音频或文本测试，支持16k wav格式音频。
+
+你也可以在`aistudio`中快速体验 👉🏻[PaddleSpeech API Demo ](https://aistudio.baidu.com/aistudio/projectdetail/4281335?shared=1)。

-**声音分类**     
+测试音频示例下载
 ```shell
-paddlespeech cls --input input.wav
+wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
+wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/en.wav
 ```
-**声纹识别**
+
+### 语音识别
+<details><summary>&emsp;（点击可展开）开源中文语音识别</summary>
+
+命令行一键体验
+
 ```shell
-paddlespeech vector --task spk --input input_16k.wav
+paddlespeech asr --lang zh --input zh.wav
+```
+
+Python API 一键预测
+
+```python
+>>> from paddlespeech.cli.asr.infer import ASRExecutor
+>>> asr = ASRExecutor()
+>>> result = asr(audio_file="zh.wav")
+>>> print(result)
+我认为跑步最重要的就是给我带来了身体健康
 ```
-**语音识别**
+</details>
+
+### 语音合成
+
+<details><summary>&emsp;开源中文语音合成</summary>
+
+输出 24k 采样率wav格式音频
+
+
+命令行一键体验
+
 ```shell
-paddlespeech asr --lang zh --input input_16k.wav
+paddlespeech tts --input "你好，欢迎使用百度飞桨深度学习框架！" --output output.wav
+```
+
+Python API 一键预测
+
+```python
+>>> from paddlespeech.cli.tts.infer import TTSExecutor
+>>> tts = TTSExecutor()
+>>> tts(text="今天天气十分不错。", output="output.wav")
 ```
-**语音翻译** (English to Chinese)
+- 语音合成的 web demo 已经集成进了 [Huggingface Spaces](https://huggingface.co/spaces). 请参考: [TTS Demo](https://huggingface.co/spaces/KPatrick/PaddleSpeechTTS)
+
+</details>
+
+### 声音分类   
+
+<details><summary>&emsp;适配多场景的开放领域声音分类工具</summary>
+
+基于AudioSet数据集527个类别的声音分类模型
+
+命令行一键体验
+
 ```shell
-paddlespeech st --input input_16k.wav
+paddlespeech cls --input zh.wav
 ```
-**语音合成** 
+
+python API 一键预测
+
+```python
+>>> from paddlespeech.cli.cls.infer import CLSExecutor
+>>> cls = CLSExecutor()
+>>> result = cls(audio_file="zh.wav")
+>>> print(result)
+Speech 0.9027186632156372
+```
+
+</details>
+
+### 声纹提取
+
+<details><summary>&emsp;工业级声纹提取工具</summary>
+
+命令行一键体验
+
 ```shell
-paddlespeech tts --input "你好，欢迎使用百度飞桨深度学习框架！" --output output.wav
+paddlespeech vector --task spk --input zh.wav
 ```
- 语音合成的 web demo 已经集成进了 [Huggingface Spaces](https://huggingface.co/spaces). 请参考: [TTS Demo](https://huggingface.co/spaces/akhaliq/paddlespeech)

-**文本后处理** 
- - 标点恢复
-   ```bash
-   paddlespeech text --task punc --input 今天的天气真不错啊你下午有空吗我想约你一起去吃饭
-   ```
+Python API 一键预测

-**批处理**
+```python
+>>> from paddlespeech.cli.vector import VectorExecutor
+>>> vec = VectorExecutor()
+>>> result = vec(audio_file="zh.wav")
+>>> print(result) # 187维向量
+[ -0.19083306   9.474295   -14.122263    -2.0916545    0.04848729
+   4.9295826    1.4780062    0.3733844   10.695862     3.2697146
+  -4.48199     -0.6617882   -9.170393   -11.1568775   -1.2358263 ...]
 ```
-echo -e "1 欢迎光临。\n2 谢谢惠顾。" | paddlespeech tts
+
+</details>
+
+### 标点恢复 
+
+<details><summary>&emsp;一键恢复文本标点，可与ASR模型配合使用</summary>
+
+命令行一键体验
+
+```shell
+paddlespeech text --task punc --input 今天的天气真不错啊你下午有空吗我想约你一起去吃饭
+```
+
+Python API 一键预测
+
+```python
+>>> from paddlespeech.cli.text.infer import TextExecutor
+>>> text_punc = TextExecutor()
+>>> result = text_punc(text="今天的天气真不错啊你下午有空吗我想约你一起去吃饭")
+今天的天气真不错啊！你下午有空吗？我想约你一起去吃饭。
 ```

-**Shell管道**
-ASR + Punc:
+</details>
+
+### 语音翻译
+
+<details><summary>&emsp;端到端英译中语音翻译工具</summary>
+
+使用预编译的kaldi相关工具，只支持在Ubuntu系统中体验
+
+命令行一键体验
+
+```shell
+paddlespeech st --input en.wav
 ```
-paddlespeech asr --input ./zh.wav | paddlespeech text --task punc
+
+python API 一键预测
+
+```python
+>>> from paddlespeech.cli.st.infer import STExecutor
+>>> st = STExecutor()
+>>> result = st(audio_file="en.wav")
+['我 在 这栋 建筑 的 古老 门上 敲门 。']
 ```

-更多命令行命令请参考 [demos](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/demos)
-> Note: 如果需要训练或者微调，请查看[语音识别](./docs/source/asr/quick_start.md)， [语音合成](./docs/source/tts/quick_start.md)。
+</details>
+
+

 <a name="快速使用服务"></a>
 ## 快速使用服务
-安装完成后，开发者可以通过命令行快速使用服务。
+安装完成后，开发者可以通过命令行一键启动语音识别，语音合成，音频分类三种服务。

 **启动服务**     
 ```shell
@ -614,6 +751,7 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块：文本前端、声

 语音合成模块最初被称为 [Parakeet](https://github.com/PaddlePaddle/Parakeet)，现在与此仓库合并。如果您对该任务的学术研究感兴趣，请参阅 [TTS 研究概述](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/docs/source/tts#overview)。此外，[模型介绍](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/tts/models_introduction.md) 是了解语音合成流程的一个很好的指南。

+
 ## ⭐ 应用案例
 - **[PaddleBoBo](https://github.com/JiehangXie/PaddleBoBo): 使用 PaddleSpeech 的语音合成模块生成虚拟人的声音。**
  
--- a/demos/README.md
+++ b/demos/README.md
@ -12,6 +12,7 @@ This directory contains many speech applications in multiple scenarios.
 * speech recognition - recognize text of an audio file 
 * speech server - Server for Speech Task, e.g. ASR,TTS,CLS
 * streaming asr server - receive audio stream from websocket, and recognize to transcript.
+* streaming tts server - receive text from http or websocket, and streaming audio data stream.
 * speech translation - end to end speech translation  
 * story talker - book reader based on OCR and TTS  
 * style_fs2 - multi style control for FastSpeech2 model  
--- a/demos/README_cn.md
+++ b/demos/README_cn.md
@ -10,8 +10,9 @@
 * 元宇宙 - 基于语音合成的 2D 增强现实。
 * 标点恢复 - 通常作为语音识别的文本后处理任务，为一段无标点的纯文本添加相应的标点符号。
 * 语音识别 - 识别一段音频中包含的语音文字。
-* 语音服务 - 离线语音服务，包括ASR、TTS、CLS等
-* 流式语音识别服务 - 流式输入语音数据流识别音频中的文字
+* 语音服务 - 离线语音服务，包括ASR、TTS、CLS等。
+* 流式语音识别服务 - 流式输入语音数据流识别音频中的文字。
+* 流式语音合成服务 - 根据待合成文本流式生成合成音频数据流。
 * 语音翻译 - 实时识别音频中的语言，并同时翻译成目标语言。
 * 会说话的故事书 - 基于 OCR 和语音合成的会说话的故事书。
 * 个性化语音合成 - 基于 FastSpeech2 模型的个性化语音合成。 
--- a/demos/custom_streaming_asr/setup_docker.sh
+++ b/demos/custom_streaming_asr/setup_docker.sh
--- a/demos/keyword_spotting/run.sh
+++ b/demos/keyword_spotting/run.sh
--- a/demos/speaker_verification/run.sh
+++ b/demos/speaker_verification/run.sh
--- a/demos/speech_recognition/run.sh
+++ b/demos/speech_recognition/run.sh
@ -1,6 +1,7 @@
 #!/bin/bash

-wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespeech.bj.bcebos.com/PaddleAudio/en.wav
+wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
+wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/en.wav

 # asr
 paddlespeech asr --input ./zh.wav
@ -8,3 +9,18 @@ paddlespeech asr --input ./zh.wav

 # asr + punc
 paddlespeech asr --input ./zh.wav | paddlespeech text --task punc
+
+
+# asr help
+paddlespeech asr --help
+
+
+# english asr
+paddlespeech asr --lang en --model transformer_librispeech --input ./en.wav
+
+# model stats
+paddlespeech stats --task asr
+
+
+# paddlespeech help
+paddlespeech --help
--- a/demos/speech_server/README.md
+++ b/demos/speech_server/README.md
@ -14,7 +14,10 @@ For service interface definition, please check:
 see [installation](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md).

 It is recommended to use **paddlepaddle 2.3.1** or above.
-You can choose one way from meduim and hard to install paddlespeech.
+
+You can choose one way from easy, meduim and hard to install paddlespeech.
+
+**If you install in easy mode, you need to prepare the yaml file by yourself, you can refer to the yaml file in the conf directory.**

 ### 2. Prepare config File
 The configuration file can be found in `conf/application.yaml` .
--- a/demos/speech_server/README_cn.md
+++ b/demos/speech_server/README_cn.md
@ -3,8 +3,10 @@
 # 语音服务

 ## 介绍
+
 这个 demo 是一个启动离线语音服务和访问服务的实现。它可以通过使用 `paddlespeech_server` 和 `paddlespeech_client` 的单个命令或 python 的几行代码来实现。

+
 服务接口定义请参考:
 - [PaddleSpeech Server RESTful API](https://github.com/PaddlePaddle/PaddleSpeech/wiki/PaddleSpeech-Server-RESTful-API)

@ -13,12 +15,17 @@
 请看 [安装文档](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md).

 推荐使用 **paddlepaddle 2.3.1** 或以上版本。
-你可以从 medium，hard 两种方式中选择一种方式安装 PaddleSpeech。
+
+你可以从简单，中等，困难 几种方式中选择一种方式安装 PaddleSpeech。
+
+**如果使用简单模式安装，需要自行准备 yaml 文件，可参考 conf 目录下的 yaml 文件。**

 ### 2. 准备配置文件
 配置文件可参见 `conf/application.yaml` 。
-其中，`engine_list`表示即将启动的服务将会包含的语音引擎，格式为 <语音任务>_<引擎类型>。
+其中，`engine_list` 表示即将启动的服务将会包含的语音引擎，格式为 <语音任务>_<引擎类型>。
+
 目前服务集成的语音任务有： asr (语音识别)、tts (语音合成)、cls (音频分类)、vector (声纹识别)以及 text (文本处理)。
+
 目前引擎类型支持两种形式：python 及 inference (Paddle Inference)
 **注意：** 如果在容器里可正常启动服务，但客户端访问 ip 不可达，可尝试将配置文件中 `host` 地址换成本地 ip 地址。

--- a/demos/speech_server/asr_client.sh
+++ b/demos/speech_server/asr_client.sh
--- a/demos/speech_server/cls_client.sh
+++ b/demos/speech_server/cls_client.sh
--- a/demos/speech_server/server.sh
+++ b/demos/speech_server/server.sh
@ -1,3 +1,3 @@
 #!/bin/bash

-paddlespeech_server start --config_file ./conf/application.yaml
+paddlespeech_server start --config_file ./conf/application.yaml &> server.log &
--- a/demos/speech_server/sid_client.sh
+++ b/demos/speech_server/sid_client.sh
@ -0,0 +1,10 @@
+#!/bin/bash
+
+wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
+wget -c https://paddlespeech.bj.bcebos.com/vector/audio/123456789.wav
+
+# sid extract
+paddlespeech_client vector --server_ip 127.0.0.1 --port 8090 --task spk --input ./85236145389.wav
+
+# sid score
+paddlespeech_client vector --server_ip 127.0.0.1 --port 8090 --task score --enroll ./85236145389.wav --test ./123456789.wav
--- a/demos/speech_server/text_client.sh
+++ b/demos/speech_server/text_client.sh
@ -0,0 +1,4 @@
+#!/bin/bash
+
+
+paddlespeech_client text --server_ip 127.0.0.1 --port 8090 --input 今天的天气真好啊你下午有空吗我想约你一起去吃饭
--- a/demos/speech_server/tts_client.sh
+++ b/demos/speech_server/tts_client.sh
--- a/demos/speech_web/README.md
+++ b/demos/speech_web/README.md
@ -1,6 +1,6 @@
 # Paddle Speech Demo

-PaddleSpeechDemo 是一个以 PaddleSpeech 的语音交互功能为主体开发的Demo展示项目，用于帮助大家更好的上手PaddleSpeech 以及使用 PaddleSpeech 构建自己的应用。
+PaddleSpeechDemo 是一个以 PaddleSpeech 的语音交互功能为主体开发的 Demo 展示项目，用于帮助大家更好的上手 PaddleSpeech 以及使用 PaddleSpeech 构建自己的应用。

 智能语音交互部分使用 PaddleSpeech，对话以及信息抽取部分使用 PaddleNLP，网页前端展示部分基于 Vue3 进行开发

--- a/demos/speech_web/web_client/package-lock.json
+++ b/demos/speech_web/web_client/package-lock.json
@ -747,9 +747,9 @@
      }
    },
    "node_modules/moment": {
-      "version": "2.29.3",
-      "resolved": "https://registry.npmmirror.com/moment/-/moment-2.29.3.tgz",
-      "integrity": "sha512-c6YRvhEo//6T2Jz/vVtYzqBzwvPT95JBQ+smCytzf7c50oMZRsR/a4w88aD34I+/QVSfnoAnSBFPJHItlOMJVw==",
+      "version": "2.29.4",
+      "resolved": "https://registry.npmjs.org/moment/-/moment-2.29.4.tgz",
+      "integrity": "sha512-5LC9SOxjSc2HF6vO2CyuTDNivEdoz2IvyJJGj6X8DJ0eFyfszE0QiEd+iXmBvUP3WHxSjFH/vIsA0EN00cgr8w==",
      "engines": {
        "node": "*"
      }
@ -1636,9 +1636,9 @@
      "optional": true
    },
    "moment": {
-      "version": "2.29.3",
-      "resolved": "https://registry.npmmirror.com/moment/-/moment-2.29.3.tgz",
-      "integrity": "sha512-c6YRvhEo//6T2Jz/vVtYzqBzwvPT95JBQ+smCytzf7c50oMZRsR/a4w88aD34I+/QVSfnoAnSBFPJHItlOMJVw=="
+      "version": "2.29.4",
+      "resolved": "https://registry.npmjs.org/moment/-/moment-2.29.4.tgz",
+      "integrity": "sha512-5LC9SOxjSc2HF6vO2CyuTDNivEdoz2IvyJJGj6X8DJ0eFyfszE0QiEd+iXmBvUP3WHxSjFH/vIsA0EN00cgr8w=="
    },
    "nanoid": {
      "version": "3.3.2",
--- a/demos/speech_web/web_client/yarn.lock
+++ b/demos/speech_web/web_client/yarn.lock
@ -587,9 +587,9 @@ mime@^1.4.1:
  integrity sha512-x0Vn8spI+wuJ1O6S7gnbaQg8Pxh4NNHb7KSINmEWKiPE4RKOplvijn+NkmYmmRgP68mc70j2EbeTFRsrswaQeg==

 moment@^2.27.0:
-  version "2.29.3"
-  resolved "https://registry.npmmirror.com/moment/-/moment-2.29.3.tgz"
-  integrity sha512-c6YRvhEo//6T2Jz/vVtYzqBzwvPT95JBQ+smCytzf7c50oMZRsR/a4w88aD34I+/QVSfnoAnSBFPJHItlOMJVw==
+  version "2.29.4"
+  resolved "https://registry.yarnpkg.com/moment/-/moment-2.29.4.tgz#3dbe052889fe7c1b2ed966fcb3a77328964ef108"
+  integrity sha512-5LC9SOxjSc2HF6vO2CyuTDNivEdoz2IvyJJGj6X8DJ0eFyfszE0QiEd+iXmBvUP3WHxSjFH/vIsA0EN00cgr8w==

 ms@^2.1.1:
  version "2.1.3"
--- a/demos/streaming_asr_server/README.md
+++ b/demos/streaming_asr_server/README.md
@ -15,7 +15,10 @@ Streaming ASR server only support `websocket` protocol, and doesn't support `htt
 see [installation](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md).

 It is recommended to use **paddlepaddle 2.3.1** or above.
-You can choose one way from meduim and hard to install paddlespeech.
+
+You can choose one way from easy, meduim and hard to install paddlespeech.
+
+**If you install in easy mode, you need to prepare the yaml file by yourself, you can refer to 

 ### 2. Prepare config File
 The configuration file can be found in `conf/ws_application.yaml` 和 `conf/ws_conformer_wenetspeech_application.yaml`.
--- a/demos/streaming_asr_server/README_cn.md
+++ b/demos/streaming_asr_server/README_cn.md
@ -3,12 +3,11 @@
 # 流式语音识别服务

 ## 介绍
-这个demo是一个启动流式语音服务和访问服务的实现。 它可以通过使用 `paddlespeech_server` 和 `paddlespeech_client`的单个命令或 python 的几行代码来实现。
+这个 demo 是一个启动流式语音服务和访问服务的实现。 它可以通过使用 `paddlespeech_server` 和 `paddlespeech_client` 的单个命令或 python 的几行代码来实现。

 **流式语音识别服务只支持 `weboscket` 协议，不支持 `http` 协议。**

-
-For service interface definition, please check:
+服务接口定义请参考:
 - [PaddleSpeech Streaming Server WebSocket API](https://github.com/PaddlePaddle/PaddleSpeech/wiki/PaddleSpeech-Server-WebSocket-API)

 ## 使用方法
@ -16,7 +15,10 @@ For service interface definition, please check:
 安装 PaddleSpeech 的详细过程请看 [安装文档](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md)。

 推荐使用 **paddlepaddle 2.3.1** 或以上版本。
-你可以从medium，hard 两种方式中选择一种方式安装 PaddleSpeech。
+
+你可以从简单，中等，困难 几种方式中选择一种方式安装 PaddleSpeech。
+
+**如果使用简单模式安装，需要自行准备 yaml 文件，可参考 conf 目录下的 yaml 文件。**

 ### 2. 准备配置文件

--- a/demos/streaming_asr_server/local/punc_server.py
+++ b/demos/streaming_asr_server/local/punc_server.py
--- a/demos/streaming_asr_server/local/rtf_from_log.py
+++ b/demos/streaming_asr_server/local/rtf_from_log.py
@ -38,4 +38,4 @@ if __name__ == '__main__':
        T += m['T']
        P += m['P']

-    print(f"RTF: {P/T}")
+    print(f"RTF: {P/T}, utts: {n}")
--- a/demos/streaming_asr_server/local/streaming_asr_server.py
+++ b/demos/streaming_asr_server/local/streaming_asr_server.py
--- a/demos/streaming_asr_server/run.sh
+++ b/demos/streaming_asr_server/run.sh
--- a/demos/streaming_asr_server/server.sh
+++ b/demos/streaming_asr_server/server.sh
@ -1,9 +1,8 @@
-export CUDA_VISIBLE_DEVICE=0,1,2,3
- export CUDA_VISIBLE_DEVICE=0,1,2,3
+#export CUDA_VISIBLE_DEVICE=0,1,2,3

-# nohup python3 punc_server.py --config_file conf/punc_application.yaml > punc.log 2>&1 &
+# nohup python3 local/punc_server.py --config_file conf/punc_application.yaml > punc.log 2>&1 &
 paddlespeech_server start --config_file conf/punc_application.yaml &> punc.log &

-# nohup python3 streaming_asr_server.py --config_file conf/ws_conformer_wenetspeech_application.yaml > streaming_asr.log 2>&1 &
+# nohup python3 local/streaming_asr_server.py --config_file conf/ws_conformer_wenetspeech_application.yaml > streaming_asr.log 2>&1 &
 paddlespeech_server start --config_file conf/ws_conformer_wenetspeech_application.yaml &> streaming_asr.log  &

--- a/demos/streaming_asr_server/test.sh
+++ b/demos/streaming_asr_server/test.sh
@ -7,5 +7,5 @@ paddlespeech_client asr_online --server_ip 127.0.0.1 --port 8090 --input ./zh.wa

 # read the wav and call streaming and punc service
 # If `127.0.0.1` is not accessible, you need to use the actual service IP address.
-paddlespeech_client asr_online --server_ip 127.0.0.1 --port 8290 --punc.server_ip 127.0.0.1 --punc.port 8190 --input ./zh.wav
+paddlespeech_client asr_online --server_ip 127.0.0.1 --port 8090 --punc.server_ip 127.0.0.1 --punc.port 8190 --input ./zh.wav

--- a/demos/streaming_tts_server/README.md
+++ b/demos/streaming_tts_server/README.md
@ -15,7 +15,10 @@ For service interface definition, please check:
 see [installation](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md).

 It is recommended to use **paddlepaddle 2.3.1** or above.
-You can choose one way from meduim and hard to install paddlespeech.
+
+You can choose one way from easy, meduim and hard to install paddlespeech.
+
+**If you install in easy mode, you need to prepare the yaml file by yourself, you can refer to the yaml file in the conf directory.**

 ### 2. Prepare config File
 The configuration file can be found in `conf/tts_online_application.yaml`.
--- a/demos/streaming_tts_server/README_cn.md
+++ b/demos/streaming_tts_server/README_cn.md
@ -13,7 +13,11 @@
 请看 [安装文档](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md).

 推荐使用 **paddlepaddle 2.3.1** 或以上版本。
-你可以从 medium，hard 两种方式中选择一种方式安装 PaddleSpeech。
+
+你可以从简单，中等，困难 几种方式中选择一种方式安装 PaddleSpeech。
+
+**如果使用简单模式安装，需要自行准备 yaml 文件，可参考 conf 目录下的 yaml 文件。**
+

 ### 2. 准备配置文件
 配置文件可参见 `conf/tts_online_application.yaml` 。
--- a/demos/streaming_tts_server/test_client.sh
+++ b/demos/streaming_tts_server/test_client.sh
@ -2,8 +2,8 @@

 # http client test
 # If `127.0.0.1` is not accessible, you need to use the actual service IP address.
-paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --protocol http --input "您好，欢迎使用百度飞桨语音合成服务。" --output output.wav
+paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --protocol http --input "您好，欢迎使用百度飞桨语音合成服务。" --output output.http.wav

 # websocket client test
 # If `127.0.0.1` is not accessible, you need to use the actual service IP address.
-# paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --protocol websocket --input "您好，欢迎使用百度飞桨语音合成服务。" --output output.wav
+paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8192 --protocol websocket --input "您好，欢迎使用百度飞桨语音合成服务。" --output output.ws.wav
--- a/demos/streaming_tts_server/conf/tts_online_ws_application.yaml
+++ b/demos/streaming_tts_server/conf/tts_online_ws_application.yaml
@ -0,0 +1,103 @@
+# This is the parameter configuration file for streaming tts server.
+
+#################################################################################
+#                             SERVER SETTING                                    #
+#################################################################################
+host: 0.0.0.0
+port: 8192
+
+# The task format in the engin_list is: <speech task>_<engine type>
+# engine_list choices = ['tts_online', 'tts_online-onnx'], the inference speed of tts_online-onnx is faster than tts_online.
+# protocol choices = ['websocket', 'http'] 
+protocol: 'websocket'
+engine_list: ['tts_online-onnx']
+
+
+#################################################################################
+#                                ENGINE CONFIG                                  #
+#################################################################################
+
+################################### TTS #########################################
+################### speech task: tts; engine_type: online #######################
+tts_online: 
+    # am (acoustic model) choices=['fastspeech2_csmsc', 'fastspeech2_cnndecoder_csmsc']   
+    # fastspeech2_cnndecoder_csmsc support streaming am infer.     
+    am: 'fastspeech2_csmsc'   
+    am_config: 
+    am_ckpt: 
+    am_stat: 
+    phones_dict: 
+    tones_dict: 
+    speaker_dict: 
+    spk_id: 0
+
+    # voc (vocoder) choices=['mb_melgan_csmsc, hifigan_csmsc']
+    # Both mb_melgan_csmsc and hifigan_csmsc support streaming voc inference
+    voc: 'mb_melgan_csmsc'
+    voc_config: 
+    voc_ckpt: 
+    voc_stat: 
+
+    # others
+    lang: 'zh'
+    device: 'cpu' # set 'gpu:id' or 'cpu'
+    # am_block and am_pad only for fastspeech2_cnndecoder_onnx model to streaming am infer,
+    # when am_pad set 12, streaming synthetic audio is the same as non-streaming synthetic audio
+    am_block: 72
+    am_pad: 12
+    # voc_pad and voc_block voc model to streaming voc infer,
+    # when voc model is mb_melgan_csmsc, voc_pad set 14, streaming synthetic audio is the same as non-streaming synthetic audio; The minimum value of pad can be set to 7, streaming synthetic audio sounds normal
+    # when voc model is hifigan_csmsc, voc_pad set 19, streaming synthetic audio is the same as non-streaming synthetic audio; voc_pad set 14, streaming synthetic audio sounds normal
+    voc_block: 36
+    voc_pad: 14
+    
+
+
+#################################################################################
+#                                ENGINE CONFIG                                  #
+#################################################################################
+
+################################### TTS #########################################
+################### speech task: tts; engine_type: online-onnx #######################
+tts_online-onnx: 
+    # am (acoustic model) choices=['fastspeech2_csmsc_onnx', 'fastspeech2_cnndecoder_csmsc_onnx']
+    # fastspeech2_cnndecoder_csmsc_onnx support streaming am infer.        
+    am: 'fastspeech2_cnndecoder_csmsc_onnx' 
+    # am_ckpt is a list, if am is fastspeech2_cnndecoder_csmsc_onnx, am_ckpt = [encoder model, decoder model, postnet model];
+    # if am is fastspeech2_csmsc_onnx, am_ckpt = [ckpt model];
+    am_ckpt:   # list
+    am_stat: 
+    phones_dict: 
+    tones_dict: 
+    speaker_dict: 
+    spk_id: 0
+    am_sample_rate: 24000
+    am_sess_conf:
+        device: "cpu" # set 'gpu:id' or 'cpu'
+        use_trt: False
+        cpu_threads: 4
+
+    # voc (vocoder) choices=['mb_melgan_csmsc_onnx, hifigan_csmsc_onnx']
+    # Both mb_melgan_csmsc_onnx and hifigan_csmsc_onnx support streaming voc inference
+    voc: 'hifigan_csmsc_onnx'
+    voc_ckpt: 
+    voc_sample_rate: 24000
+    voc_sess_conf:
+        device: "cpu" # set 'gpu:id' or 'cpu'
+        use_trt: False
+        cpu_threads: 4
+
+    # others
+    lang: 'zh'
+    # am_block and am_pad only for fastspeech2_cnndecoder_onnx model to streaming am infer,
+    # when am_pad set 12, streaming synthetic audio is the same as non-streaming synthetic audio
+    am_block: 72
+    am_pad: 12
+    # voc_pad and voc_block voc model to streaming voc infer,
+    # when voc model is mb_melgan_csmsc_onnx, voc_pad set 14, streaming synthetic audio is the same as non-streaming synthetic audio; The minimum value of pad can be set to 7, streaming synthetic audio sounds normal
+    # when voc model is hifigan_csmsc_onnx, voc_pad set 19, streaming synthetic audio is the same as non-streaming synthetic audio; voc_pad set 14, streaming synthetic audio sounds normal
+    voc_block: 36
+    voc_pad: 14
+    # voc_upsample should be same as n_shift on voc config.
+    voc_upsample: 300
+    
--- a/demos/streaming_tts_server/server.sh
+++ b/demos/streaming_tts_server/server.sh
@ -0,0 +1,10 @@
+#!/bin/bash
+
+# http server
+paddlespeech_server start --config_file ./conf/tts_online_application.yaml &> tts.http.log &
+
+
+# websocket server
+paddlespeech_server start --config_file ./conf/tts_online_ws_application.yaml &> tts.ws.log &
+
+
--- a/demos/streaming_tts_server/start_server.sh
+++ b/demos/streaming_tts_server/start_server.sh
@ -1,3 +0,0 @@
-#!/bin/bash
-# start server
-paddlespeech_server start --config_file ./conf/tts_online_application.yaml
--- a/demos/text_to_speech/run.sh
+++ b/demos/text_to_speech/run.sh
@ -4,4 +4,10 @@
 paddlespeech tts --input 今天的天气不错啊

 # Batch process
-echo -e "1 欢迎光临。\n2 谢谢惠顾。" | paddlespeech tts
+echo -e "1 欢迎光临。\n2 谢谢惠顾。" | paddlespeech tts
+
+# Text Frontend
+paddlespeech tts --input 今天是2022/10/29,最低温度是-3℃.
+
+
+
--- a/docker/ubuntu18-cpu/Dockerfile
+++ b/docker/ubuntu18-cpu/Dockerfile
@ -1,15 +1,17 @@
 FROM registry.baidubce.com/paddlepaddle/paddle:2.2.2
 LABEL maintainer="paddlesl@baidu.com"

+RUN apt-get update \
+  && apt-get install libsndfile-dev \
+  && apt-get clean \
+  && rm -rf /var/lib/apt/lists/*
+
 RUN git clone --depth 1 https://github.com/PaddlePaddle/PaddleSpeech.git /home/PaddleSpeech  
 RUN pip3 uninstall mccabe -y ; exit 0;
 RUN pip3 install multiprocess==0.70.12 importlib-metadata==4.2.0 dill==0.3.4

-RUN cd /home/PaddleSpeech/audio
-RUN python setup.py bdist_wheel
-
-RUN cd /home/PaddleSpeech
+WORKDIR /home/PaddleSpeech/
 RUN python setup.py bdist_wheel
-RUN pip install audio/dist/*.whl dist/*.whl
+RUN pip install dist/*.whl -i https://pypi.tuna.tsinghua.edu.cn/simple

-WORKDIR /home/PaddleSpeech/
+CMD ['bash']
--- a/docs/requirements.txt
+++ b/docs/requirements.txt
@ -48,4 +48,5 @@ fastapi
 websockets
 keyboard
 uvicorn
-pattern_singleton
+pattern_singleton
+braceexpand
--- a/docs/source/install.md
+++ b/docs/source/install.md
@ -117,9 +117,9 @@ conda install -y -c gcc_linux-64=8.4.0 gxx_linux-64=8.4.0
 ```
 (Hip: Do not use the last script if you want to install by **Hard** way):
 ### Install PaddlePaddle
-You can choose the `PaddlePaddle` version based on your system. For example, for CUDA 10.2, CuDNN7.5 install paddlepaddle-gpu 2.2.0:
+You can choose the `PaddlePaddle` version based on your system. For example, for CUDA 10.2, CuDNN7.5 install paddlepaddle-gpu 2.3.1:
 ```bash
-python3 -m pip install paddlepaddle-gpu==2.2.0 -i https://mirror.baidu.com/pypi/simple
+python3 -m pip install paddlepaddle-gpu==2.3.1 -i https://mirror.baidu.com/pypi/simple
 ```
 ### Install PaddleSpeech 
 You can install  `paddlespeech`  by the following command，then you can use the `ready-made` examples in `paddlespeech` :
@ -180,9 +180,9 @@ Some users may fail to install `kaldiio` due to the default download source, you
 ```bash
 pip install pytest-runner -i https://pypi.tuna.tsinghua.edu.cn/simple
 ```
-Make sure you have GPU and the paddlepaddle version is right. For example, for CUDA 10.2, CuDNN7.5 install paddle 2.2.0:
+Make sure you have GPU and the paddlepaddle version is right. For example, for CUDA 10.2, CuDNN7.5 install paddle 2.3.1:
 ```bash
-python3 -m pip install paddlepaddle-gpu==2.2.0 -i https://mirror.baidu.com/pypi/simple
+python3 -m pip install paddlepaddle-gpu==2.3.1 -i https://mirror.baidu.com/pypi/simple
 ```
 ### Install PaddleSpeech in Developing Mode
 ```bash
--- a/docs/source/install_cn.md
+++ b/docs/source/install_cn.md
@ -111,9 +111,9 @@ conda install -y -c gcc_linux-64=8.4.0 gxx_linux-64=8.4.0
 ```
 （提示： 如果你想使用**困难**方式完成安装，请不要使用最后一条命令）
 ### 安装 PaddlePaddle
-你可以根据系统配置选择 PaddlePaddle 版本，例如系统使用 CUDA 10.2， CuDNN7.5 ，你可以安装 paddlepaddle-gpu 2.2.0：
+你可以根据系统配置选择 PaddlePaddle 版本，例如系统使用 CUDA 10.2， CuDNN7.5 ，你可以安装 paddlepaddle-gpu 2.3.1：
 ```bash
-python3 -m pip install paddlepaddle-gpu==2.2.0 -i https://mirror.baidu.com/pypi/simple
+python3 -m pip install paddlepaddle-gpu==2.3.1 -i https://mirror.baidu.com/pypi/simple
 ```
 ### 安装 PaddleSpeech
 最后安装 `paddlespeech`，这样你就可以使用 `paddlespeech` 中已有的 examples：
@ -168,9 +168,9 @@ conda activate tools/venv
 conda install -y -c conda-forge sox libsndfile swig bzip2 libflac bc
 ```
 ### 安装 PaddlePaddle
-请确认你系统是否有 GPU，并且使用了正确版本的 paddlepaddle。例如系统使用 CUDA 10.2, CuDNN7.5 ，你可以安装 paddlepaddle-gpu 2.2.0：
+请确认你系统是否有 GPU，并且使用了正确版本的 paddlepaddle。例如系统使用 CUDA 10.2, CuDNN7.5 ，你可以安装 paddlepaddle-gpu 2.3.1：
 ```bash
-python3 -m pip install paddlepaddle-gpu==2.2.0 -i https://mirror.baidu.com/pypi/simple
+python3 -m pip install paddlepaddle-gpu==2.3.1 -i https://mirror.baidu.com/pypi/simple
 ```
 ### 用开发者模式安装 PaddleSpeech
 部分用户系统由于默认源的问题，安装中会出现 kaldiio 安转出错的问题，建议首先安装 pytest-runner:
--- a/examples/aishell/asr1/README.md
+++ b/examples/aishell/asr1/README.md
@ -1,5 +1,5 @@
 # Transformer/Conformer ASR with Aishell
-This example contains code used to train a Transformer or [Conformer](http://arxiv.org/abs/2008.03802) model with [Aishell dataset](http://www.openslr.org/resources/33)
+This example contains code used to train a [u2](https://arxiv.org/pdf/2012.05481.pdf) model (Transformer or [Conformer](https://arxiv.org/pdf/2005.08100.pdf) model) with [Aishell dataset](http://www.openslr.org/resources/33)
 ## Overview
 All the scripts you need are in `run.sh`. There are several stages in `run.sh`, and each stage has its function.
 | Stage | Function                                                     |
--- a/examples/callcenter/README.md
+++ b/examples/callcenter/README.md
@ -1,20 +1,3 @@
 # Callcenter 8k sample rate

-Data distribution:
-
-```
-676048 utts
-491.4004722221223 h
-4357792.0 text
-2.4633630739178654 text/sec
-2.6167397877068495 sec/utt
-```
-
-train/dev/test partition:
-
-```
-    33802 manifest.dev
-    67606 manifest.test
-   574640 manifest.train
-   676048 total
-```
+This recipe only has model/data config for 8k ASR, user need to prepare data and generate manifest metafile. You can see Aishell or Libripseech.
--- a/examples/csmsc/vits/README.md
+++ b/examples/csmsc/vits/README.md
@ -154,7 +154,7 @@ VITS checkpoint contains files listed below.
 vits_csmsc_ckpt_1.1.0
 ├── default.yaml              # default config used to train vitx
 ├── phone_id_map.txt          # phone vocabulary file when training vits
-└── snapshot_iter_350000.pdz  # model parameters and optimizer states
+└── snapshot_iter_333000.pdz  # model parameters and optimizer states
 ```

 ps: This ckpt is not good enough, a better result is training
@ -169,7 +169,7 @@ FLAGS_allocator_strategy=naive_best_fit \
 FLAGS_fraction_of_gpu_memory_to_use=0.01 \
 python3 ${BIN_DIR}/synthesize_e2e.py \
    --config=vits_csmsc_ckpt_1.1.0/default.yaml \
-    --ckpt=vits_csmsc_ckpt_1.1.0/snapshot_iter_350000.pdz \
+    --ckpt=vits_csmsc_ckpt_1.1.0/snapshot_iter_333000.pdz \
    --phones_dict=vits_csmsc_ckpt_1.1.0/phone_id_map.txt \
    --output_dir=exp/default/test_e2e \
    --text=${BIN_DIR}/../sentences.txt \
--- a/examples/csmsc/vits/conf/default.yaml
+++ b/examples/csmsc/vits/conf/default.yaml
@ -179,7 +179,7 @@ generator_first: False # whether to start updating generator first
 #                OTHER TRAINING SETTING                  #
 ##########################################################
 num_snapshots: 10            # max number of snapshots to keep while training
-train_max_steps: 250000      # Number of training steps. == total_iters / ngpus, total_iters = 1000000
+train_max_steps: 350000      # Number of training steps. == total_iters / ngpus, total_iters = 1000000
 save_interval_steps: 1000    # Interval steps to save checkpoint.
 eval_interval_steps: 250     # Interval steps to evaluate the network.
 seed: 777                    # random seed number
--- a/examples/librispeech/asr1/README.md
+++ b/examples/librispeech/asr1/README.md
@ -1,5 +1,5 @@
 # Transformer/Conformer ASR with Librispeech
-This example contains code used to train a Transformer or [Conformer](http://arxiv.org/abs/2008.03802) model with [Librispeech dataset](http://www.openslr.org/resources/12)
+This example contains code used to train [u2](https://arxiv.org/pdf/2012.05481.pdf) model (Transformer or [Conformer](https://arxiv.org/pdf/2005.08100.pdf) model) with [Librispeech dataset](http://www.openslr.org/resources/12)
 ## Overview
 All the scripts you need are in `run.sh`. There are several stages in `run.sh`, and each stage has its function.
 | Stage | Function                                                     |
--- a/examples/librispeech/asr2/README.md
+++ b/examples/librispeech/asr2/README.md
@ -1,6 +1,6 @@
 # Transformer/Conformer ASR with Librispeech ASR2

-This example contains code used to train a Transformer or [Conformer](http://arxiv.org/abs/2008.03802) model with [Librispeech dataset](http://www.openslr.org/resources/12) and use some functions in kaldi.
+This example contains code used to train a [u2](https://arxiv.org/pdf/2012.05481.pdf) model (Transformer or [Conformer](https://arxiv.org/pdf/2005.08100.pdf) model) with [Librispeech dataset](http://www.openslr.org/resources/12) and use some functions in kaldi.

 To use this example, you need to install Kaldi first.

--- a/examples/tiny/asr1/README.md
+++ b/examples/tiny/asr1/README.md
@ -1,5 +1,5 @@
 # Transformer/Conformer ASR with Tiny
-This example contains code used to train a Transformer or [Conformer](http://arxiv.org/abs/2008.03802) model  Tiny dataset(a part of [[Librispeech dataset](http://www.openslr.org/resources/12)](http://www.openslr.org/resources/33))
+This example contains code used to train a [u2](https://arxiv.org/pdf/2012.05481.pdf) model (Transformer or [Conformer](https://arxiv.org/pdf/2005.08100.pdf) model) with Tiny dataset(a part of [[Librispeech dataset](http://www.openslr.org/resources/12)](http://www.openslr.org/resources/33))
 ## Overview
 All the scripts you need are in `run.sh`. There are several stages in `run.sh`, and each stage has its function.
 | Stage | Function                                                     |
--- a/examples/zh_en_tts/tts3/README.md
+++ b/examples/zh_en_tts/tts3/README.md
@ -0,0 +1,26 @@
+# Test
+We train a Chinese-English mixed fastspeech2 model. The training code is still being sorted out, let's show how to use it first.
+The sample rate of the synthesized audio is 22050 Hz. 
+
+## Download pretrained models
+Put pretrained models in a directory named `models`.
+
+- [fastspeech2_csmscljspeech_add-zhen.zip](https://paddlespeech.bj.bcebos.com/t2s/chinse_english_mixed/models/fastspeech2_csmscljspeech_add-zhen.zip)
+- [hifigan_ljspeech_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_ljspeech_ckpt_0.2.0.zip)
+
+```bash
+mkdir models
+cd models
+wget https://paddlespeech.bj.bcebos.com/t2s/chinse_english_mixed/models/fastspeech2_csmscljspeech_add-zhen.zip
+unzip fastspeech2_csmscljspeech_add-zhen.zip
+wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_ljspeech_ckpt_0.2.0.zip
+unzip hifigan_ljspeech_ckpt_0.2.0.zip
+cd ../
+```
+
+## test
+You can choose `--spk_id` {0, 1} in `local/synthesize_e2e.sh`.
+
+```bash
+bash test.sh
+```
--- a/examples/zh_en_tts/tts3/local/synthesize_e2e.sh
+++ b/examples/zh_en_tts/tts3/local/synthesize_e2e.sh
@ -0,0 +1,31 @@
+#!/bin/bash
+
+model_dir=$1
+output=$2
+am_name=fastspeech2_csmscljspeech_add-zhen
+am_model_dir=${model_dir}/${am_name}/
+
+stage=1
+stop_stage=1
+
+
+# hifigan
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+    FLAGS_allocator_strategy=naive_best_fit \
+    FLAGS_fraction_of_gpu_memory_to_use=0.01 \
+    python3 ${BIN_DIR}/../synthesize_e2e.py \
+        --am=fastspeech2_mix \
+        --am_config=${am_model_dir}/default.yaml \
+        --am_ckpt=${am_model_dir}/snapshot_iter_94000.pdz \
+        --am_stat=${am_model_dir}/speech_stats.npy \
+        --voc=hifigan_ljspeech \
+        --voc_config=${model_dir}/hifigan_ljspeech_ckpt_0.2.0/default.yaml \
+        --voc_ckpt=${model_dir}/hifigan_ljspeech_ckpt_0.2.0/snapshot_iter_2500000.pdz \
+        --voc_stat=${model_dir}/hifigan_ljspeech_ckpt_0.2.0/feats_stats.npy \
+        --lang=mix \
+        --text=${BIN_DIR}/../sentences_mix.txt \
+        --output_dir=${output}/test_e2e \
+        --phones_dict=${am_model_dir}/phone_id_map.txt \
+        --speaker_dict=${am_model_dir}/speaker_id_map.txt \
+        --spk_id 0 
+fi
--- a/examples/zh_en_tts/tts3/path.sh
+++ b/examples/zh_en_tts/tts3/path.sh
@ -0,0 +1,13 @@
+#!/bin/bash
+export MAIN_ROOT=`realpath ${PWD}/../../../`
+
+export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH}
+export LC_ALL=C
+
+export PYTHONDONTWRITEBYTECODE=1
+# Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
+export PYTHONIOENCODING=UTF-8
+export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH}
+
+MODEL=fastspeech2
+export BIN_DIR=${MAIN_ROOT}/paddlespeech/t2s/exps/${MODEL}
--- a/examples/zh_en_tts/tts3/test.sh
+++ b/examples/zh_en_tts/tts3/test.sh
@ -0,0 +1,23 @@
+#!/bin/bash
+
+set -e
+source path.sh
+
+gpus=0,1
+stage=3
+stop_stage=100
+
+model_dir=models
+output_dir=output
+
+# with the following command, you can choose the stage range you want to run
+# such as `./run.sh --stage 0 --stop-stage 0`
+# this can not be mixed use with `$1`, `$2` ...
+source ${MAIN_ROOT}/utils/parse_options.sh || exit 1
+
+
+if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
+    # synthesize_e2e, vocoder is hifigan by default
+    CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${model_dir} ${output_dir}  || exit -1
+fi
+
--- a/paddlespeech/init.py
+++ b/paddlespeech/init.py
@ -14,3 +14,5 @@
 import _locale

 _locale._getdefaultlocale = (lambda *args: ['en_US', 'utf8'])
+
+
--- a/paddlespeech/audio/init.py
+++ b/paddlespeech/audio/init.py
@ -14,6 +14,9 @@
 from . import compliance
 from . import datasets
 from . import features
+from . import text
+from . import transform
+from . import streamdata
 from . import functional
 from . import io
 from . import metric
--- a/paddlespeech/audio/text/init.py
+++ b/paddlespeech/audio/text/init.py
@ -0,0 +1,13 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
--- a/paddlespeech/cli/asr/infer.py
+++ b/paddlespeech/cli/asr/infer.py
@ -365,7 +365,7 @@ class ASRExecutor(BaseExecutor):
        except Exception as e:
            logger.exception(e)
            logger.error(
-                "can not open the audio file, please check the audio file format is 'wav'. \n \
+                f"can not open the audio file, please check the audio file({audio_file}) format is 'wav'. \n \
                 you can try to use sox to change the file format.\n \
                 For example: \n \
                 sample rate: 16k \n \
--- a/paddlespeech/cli/executor.py
+++ b/paddlespeech/cli/executor.py
@ -108,19 +108,20 @@ class BaseExecutor(ABC):
            Dict[str, Union[str, os.PathLike]]: A dict with ids and inputs.
        """
        if self._is_job_input(input_):
+            # .job/.scp/.txt file
            ret = self._get_job_contents(input_)
        else:
+            # job from stdin
            ret = OrderedDict()
-
            if input_ is None:  # Take input from stdin
                if not sys.stdin.isatty(
                ):  # Avoid getting stuck when stdin is empty.
                    for i, line in enumerate(sys.stdin):
                        line = line.strip()
-                        if len(line.split(' ')) == 1:
+                        if len(line.split()) == 1:
                            ret[str(i + 1)] = line
-                        elif len(line.split(' ')) == 2:
-                            id_, info = line.split(' ')
+                        elif len(line.split()) == 2:
+                            id_, info = line.split()
                            ret[id_] = info
                        else:  # No valid input info from one line.
                            continue
@ -170,7 +171,8 @@ class BaseExecutor(ABC):
            bool: return `True` for job input, `False` otherwise.
        """
        return input_ and os.path.isfile(input_) and (input_.endswith('.job') or
-                                                      input_.endswith('.txt'))
+                                                      input_.endswith('.txt') or
+                                                      input_.endswith('.scp'))

    def _get_job_contents(
            self, job_input: os.PathLike) -> Dict[str, Union[str, os.PathLike]]:
@ -189,7 +191,7 @@ class BaseExecutor(ABC):
                line = line.strip()
                if not line:
                    continue
-                k, v = line.split(' ')
+                k, v = line.split() # space or \t
                job_contents[k] = v
        return job_contents

--- a/paddlespeech/s2t/init.py
+++ b/paddlespeech/s2t/init.py
@ -18,7 +18,6 @@ from typing import Union

 import paddle
 from paddle import nn
-from paddle.fluid import core
 from paddle.nn import functional as F

 from paddlespeech.s2t.utils.log import Log
@ -39,46 +38,6 @@ paddle.long = 'int64'
 paddle.uint16 = 'uint16'
 paddle.cdouble = 'complex128'

-
-def convert_dtype_to_string(tensor_dtype):
-    """
-    Convert the data type in numpy to the data type in Paddle
-    Args:
-        tensor_dtype(core.VarDesc.VarType): the data type in numpy.
-    Returns:
-        core.VarDesc.VarType: the data type in Paddle.
-    """
-    dtype = tensor_dtype
-    if dtype == core.VarDesc.VarType.FP32:
-        return paddle.float32
-    elif dtype == core.VarDesc.VarType.FP64:
-        return paddle.float64
-    elif dtype == core.VarDesc.VarType.FP16:
-        return paddle.float16
-    elif dtype == core.VarDesc.VarType.INT32:
-        return paddle.int32
-    elif dtype == core.VarDesc.VarType.INT16:
-        return paddle.int16
-    elif dtype == core.VarDesc.VarType.INT64:
-        return paddle.int64
-    elif dtype == core.VarDesc.VarType.BOOL:
-        return paddle.bool
-    elif dtype == core.VarDesc.VarType.BF16:
-        # since there is still no support for bfloat16 in NumPy,
-        # uint16 is used for casting bfloat16
-        return paddle.uint16
-    elif dtype == core.VarDesc.VarType.UINT8:
-        return paddle.uint8
-    elif dtype == core.VarDesc.VarType.INT8:
-        return paddle.int8
-    elif dtype == core.VarDesc.VarType.COMPLEX64:
-        return paddle.complex64
-    elif dtype == core.VarDesc.VarType.COMPLEX128:
-        return paddle.complex128
-    else:
-        raise ValueError("Not supported tensor dtype %s" % dtype)
-
-
 if not hasattr(paddle, 'softmax'):
    logger.debug("register user softmax to paddle, remove this when fixed!")
    setattr(paddle, 'softmax', paddle.nn.functional.softmax)
@ -155,28 +114,6 @@ if not hasattr(paddle.Tensor, 'new_full'):
    paddle.Tensor.new_full = new_full
    paddle.static.Variable.new_full = new_full

-
-def eq(xs: paddle.Tensor, ys: Union[paddle.Tensor, float]) -> paddle.Tensor:
-    if convert_dtype_to_string(xs.dtype) == paddle.bool:
-        xs = xs.astype(paddle.int)
-    return xs.equal(
-        paddle.to_tensor(
-            ys, dtype=convert_dtype_to_string(xs.dtype), place=xs.place))
-
-
-if not hasattr(paddle.Tensor, 'eq'):
-    logger.debug(
-        "override eq of paddle.Tensor if exists or register, remove this when fixed!"
-    )
-    paddle.Tensor.eq = eq
-    paddle.static.Variable.eq = eq
-
-if not hasattr(paddle, 'eq'):
-    logger.debug(
-        "override eq of paddle if exists or register, remove this when fixed!")
-    paddle.eq = eq
-
-
 def contiguous(xs: paddle.Tensor) -> paddle.Tensor:
    return xs

@ -219,13 +156,22 @@ def is_broadcastable(shp1, shp2):
    return True


+def broadcast_shape(shp1, shp2):
+    result = []
+    for a, b in zip(shp1[::-1], shp2[::-1]):
+        result.append(max(a, b))
+    return result[::-1]
+
+
 def masked_fill(xs: paddle.Tensor,
                mask: paddle.Tensor,
                value: Union[float, int]):
-    assert is_broadcastable(xs.shape, mask.shape) is True, (xs.shape,
-                                                            mask.shape)
-    bshape = paddle.broadcast_shape(xs.shape, mask.shape)
-    mask = mask.broadcast_to(bshape)
+    bshape = broadcast_shape(xs.shape, mask.shape)
+    mask.stop_gradient = True
+    tmp = paddle.ones(shape=[len(bshape)], dtype='int32')
+    for index in range(len(bshape)):
+        tmp[index] = bshape[index]
+    mask = mask.broadcast_to(tmp)
    trues = paddle.ones_like(xs) * value
    xs = paddle.where(mask, trues, xs)
    return xs
--- a/paddlespeech/s2t/models/u2/u2.py
+++ b/paddlespeech/s2t/models/u2/u2.py
@ -29,6 +29,9 @@ import paddle
 from paddle import jit
 from paddle import nn

+from paddlespeech.audio.utils.tensor_utils import add_sos_eos
+from paddlespeech.audio.utils.tensor_utils import pad_sequence
+from paddlespeech.audio.utils.tensor_utils import th_accuracy
 from paddlespeech.s2t.decoders.scorers.ctc import CTCPrefixScorer
 from paddlespeech.s2t.frontend.utility import IGNORE_ID
 from paddlespeech.s2t.frontend.utility import load_cmvn
@ -48,9 +51,6 @@ from paddlespeech.s2t.utils import checkpoint
 from paddlespeech.s2t.utils import layer_tools
 from paddlespeech.s2t.utils.ctc_utils import remove_duplicates_and_blank
 from paddlespeech.s2t.utils.log import Log
-from paddlespeech.audio.utils.tensor_utils import add_sos_eos
-from paddlespeech.audio.utils.tensor_utils import pad_sequence
-from paddlespeech.audio.utils.tensor_utils import th_accuracy
 from paddlespeech.s2t.utils.utility import log_add
 from paddlespeech.s2t.utils.utility import UpdateConfig

@ -318,7 +318,7 @@ class U2BaseModel(ASRInterface, nn.Layer):
                dim=1)  # (B*N, i+1)

            # 2.6 Update end flag
-            end_flag = paddle.eq(hyps[:, -1], self.eos).view(-1, 1)
+            end_flag = paddle.equal(hyps[:, -1], self.eos).view(-1, 1)

        # 3. Select best of best
        scores = scores.view(batch_size, beam_size)
@ -605,29 +605,42 @@ class U2BaseModel(ASRInterface, nn.Layer):
            xs: paddle.Tensor,
            offset: int,
            required_cache_size: int,
-            subsampling_cache: Optional[paddle.Tensor]=None,
-            elayers_output_cache: Optional[List[paddle.Tensor]]=None,
-            conformer_cnn_cache: Optional[List[paddle.Tensor]]=None,
-    ) -> Tuple[paddle.Tensor, paddle.Tensor, List[paddle.Tensor], List[
-            paddle.Tensor]]:
+            att_cache: paddle.Tensor=paddle.zeros([0, 0, 0, 0]),
+            cnn_cache: paddle.Tensor=paddle.zeros([0, 0, 0, 0]),
+    ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor]:
        """ Export interface for c++ call, give input chunk xs, and return
            output from time 0 to current chunk.
+
        Args:
-            xs (paddle.Tensor): chunk input
-            subsampling_cache (Optional[paddle.Tensor]): subsampling cache
-            elayers_output_cache (Optional[List[paddle.Tensor]]):
-                transformer/conformer encoder layers output cache
-            conformer_cnn_cache (Optional[List[paddle.Tensor]]): conformer
-                cnn cache
+            xs (paddle.Tensor): chunk input, with shape (b=1, time, mel-dim),
+                where `time == (chunk_size - 1) * subsample_rate + \
+                        subsample.right_context + 1`
+            offset (int): current offset in encoder output time stamp
+            required_cache_size (int): cache size required for next chunk
+                compuation
+                >=0: actual cache size
+                <0: means all history cache is required
+            att_cache (paddle.Tensor): cache tensor for KEY & VALUE in
+                transformer/conformer attention, with shape
+                (elayers, head, cache_t1, d_k * 2), where
+                `head * d_k == hidden-dim` and
+                `cache_t1 == chunk_size * num_decoding_left_chunks`.
+                `d_k * 2` for att key & value. 
+            cnn_cache (paddle.Tensor): cache tensor for cnn_module in conformer,
+                (elayers, b=1, hidden-dim, cache_t2), where
+                `cache_t2 == cnn.lorder - 1`. 
+                
        Returns:
-            paddle.Tensor: output, it ranges from time 0 to current chunk.
-            paddle.Tensor: subsampling cache
-            List[paddle.Tensor]: attention cache
-            List[paddle.Tensor]: conformer cnn cache
+            paddle.Tensor: output of current input xs,
+                with shape (b=1, chunk_size, hidden-dim).
+            paddle.Tensor: new attention cache required for next chunk, with
+                dynamic shape (elayers, head, T(?), d_k * 2)
+                depending on required_cache_size.
+            paddle.Tensor: new conformer cnn cache required for next chunk, with
+                same shape as the original cnn_cache.
        """
-        return self.encoder.forward_chunk(
-            xs, offset, required_cache_size, subsampling_cache,
-            elayers_output_cache, conformer_cnn_cache)
+        return self.encoder.forward_chunk(xs, offset, required_cache_size,
+                                          att_cache, cnn_cache)

    # @jit.to_static
    def ctc_activation(self, xs: paddle.Tensor) -> paddle.Tensor:
--- a/paddlespeech/s2t/models/u2_st/u2_st.py
+++ b/paddlespeech/s2t/models/u2_st/u2_st.py
@ -401,29 +401,42 @@ class U2STBaseModel(nn.Layer):
            xs: paddle.Tensor,
            offset: int,
            required_cache_size: int,
-            subsampling_cache: Optional[paddle.Tensor]=None,
-            elayers_output_cache: Optional[List[paddle.Tensor]]=None,
-            conformer_cnn_cache: Optional[List[paddle.Tensor]]=None,
-    ) -> Tuple[paddle.Tensor, paddle.Tensor, List[paddle.Tensor], List[
-            paddle.Tensor]]:
+            att_cache: paddle.Tensor = paddle.zeros([0, 0, 0, 0]),
+            cnn_cache: paddle.Tensor = paddle.zeros([0, 0, 0, 0]),
+    ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor]:
        """ Export interface for c++ call, give input chunk xs, and return
            output from time 0 to current chunk.
+
        Args:
-            xs (paddle.Tensor): chunk input
-            subsampling_cache (Optional[paddle.Tensor]): subsampling cache
-            elayers_output_cache (Optional[List[paddle.Tensor]]):
-                transformer/conformer encoder layers output cache
-            conformer_cnn_cache (Optional[List[paddle.Tensor]]): conformer
-                cnn cache
+            xs (paddle.Tensor): chunk input, with shape (b=1, time, mel-dim),
+                where `time == (chunk_size - 1) * subsample_rate + \
+                        subsample.right_context + 1`
+            offset (int): current offset in encoder output time stamp
+            required_cache_size (int): cache size required for next chunk
+                compuation
+                >=0: actual cache size
+                <0: means all history cache is required
+            att_cache (paddle.Tensor): cache tensor for KEY & VALUE in
+                transformer/conformer attention, with shape
+                (elayers, head, cache_t1, d_k * 2), where
+                `head * d_k == hidden-dim` and
+                `cache_t1 == chunk_size * num_decoding_left_chunks`.
+                `d_k * 2` for att key & value.
+            cnn_cache (paddle.Tensor): cache tensor for cnn_module in conformer,
+                (elayers, b=1, hidden-dim, cache_t2), where
+                `cache_t2 == cnn.lorder - 1`
+
        Returns:
-            paddle.Tensor: output, it ranges from time 0 to current chunk.
-            paddle.Tensor: subsampling cache
-            List[paddle.Tensor]: attention cache
-            List[paddle.Tensor]: conformer cnn cache
+            paddle.Tensor: output of current input xs,
+                with shape (b=1, chunk_size, hidden-dim).
+            paddle.Tensor: new attention cache required for next chunk, with
+                dynamic shape (elayers, head, T(?), d_k * 2)
+                depending on required_cache_size.
+            paddle.Tensor: new conformer cnn cache required for next chunk, with
+                same shape as the original cnn_cache.
        """
        return self.encoder.forward_chunk(
-            xs, offset, required_cache_size, subsampling_cache,
-            elayers_output_cache, conformer_cnn_cache)
+            xs, offset, required_cache_size, att_cache, cnn_cache)

    # @jit.to_static
    def ctc_activation(self, xs: paddle.Tensor) -> paddle.Tensor:
--- a/paddlespeech/s2t/modules/align.py
+++ b/paddlespeech/s2t/modules/align.py
@ -13,8 +13,7 @@
 # limitations under the License.
 import paddle
 from paddle import nn
-
-from paddlespeech.s2t.modules.initializer import KaimingUniform
+import math
 """
    To align the initializer between paddle and torch, 
    the API below are set defalut initializer with priority higger than global initializer.
@ -82,10 +81,10 @@ class Linear(nn.Linear):
                 name=None):
        if weight_attr is None:
            if global_init_type == "kaiming_uniform":
-                weight_attr = paddle.ParamAttr(initializer=KaimingUniform())
+                weight_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform(fan_in=None, negative_slope=math.sqrt(5), nonlinearity='leaky_relu'))
        if bias_attr is None:
            if global_init_type == "kaiming_uniform":
-                bias_attr = paddle.ParamAttr(initializer=KaimingUniform())
+                bias_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform(fan_in=None, negative_slope=math.sqrt(5), nonlinearity='leaky_relu'))
        super(Linear, self).__init__(in_features, out_features, weight_attr,
                                     bias_attr, name)

@ -105,10 +104,10 @@ class Conv1D(nn.Conv1D):
                 data_format='NCL'):
        if weight_attr is None:
            if global_init_type == "kaiming_uniform":
-                weight_attr = paddle.ParamAttr(initializer=KaimingUniform())
+                weight_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform(fan_in=None, negative_slope=math.sqrt(5), nonlinearity='leaky_relu'))
        if bias_attr is None:
            if global_init_type == "kaiming_uniform":
-                bias_attr = paddle.ParamAttr(initializer=KaimingUniform())
+                bias_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform(fan_in=None, negative_slope=math.sqrt(5), nonlinearity='leaky_relu'))
        super(Conv1D, self).__init__(
            in_channels, out_channels, kernel_size, stride, padding, dilation,
            groups, padding_mode, weight_attr, bias_attr, data_format)
@ -129,10 +128,10 @@ class Conv2D(nn.Conv2D):
                 data_format='NCHW'):
        if weight_attr is None:
            if global_init_type == "kaiming_uniform":
-                weight_attr = paddle.ParamAttr(initializer=KaimingUniform())
+                weight_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform(fan_in=None, negative_slope=math.sqrt(5), nonlinearity='leaky_relu'))
        if bias_attr is None:
            if global_init_type == "kaiming_uniform":
-                bias_attr = paddle.ParamAttr(initializer=KaimingUniform())
+                bias_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform(fan_in=None, negative_slope=math.sqrt(5), nonlinearity='leaky_relu'))
        super(Conv2D, self).__init__(
            in_channels, out_channels, kernel_size, stride, padding, dilation,
            groups, padding_mode, weight_attr, bias_attr, data_format)
--- a/paddlespeech/s2t/modules/attention.py
+++ b/paddlespeech/s2t/modules/attention.py
@ -84,9 +84,10 @@ class MultiHeadedAttention(nn.Layer):
        return q, k, v

    def forward_attention(self,
-                          value: paddle.Tensor,
-                          scores: paddle.Tensor,
-                          mask: Optional[paddle.Tensor]) -> paddle.Tensor:
+            value: paddle.Tensor, 
+            scores: paddle.Tensor,
+            mask: paddle.Tensor = paddle.ones([0, 0, 0], dtype=paddle.bool),
+        ) -> paddle.Tensor:
        """Compute attention context vector.
        Args:
            value (paddle.Tensor): Transformed value, size
@ -94,14 +95,23 @@ class MultiHeadedAttention(nn.Layer):
            scores (paddle.Tensor): Attention score, size
                (#batch, n_head, time1, time2).
            mask (paddle.Tensor): Mask, size (#batch, 1, time2) or
-                (#batch, time1, time2).
+                (#batch, time1, time2), (0, 0, 0) means fake mask.
        Returns:
-            paddle.Tensor: Transformed value weighted
-                by the attention score, (#batch, time1, d_model).
+            paddle.Tensor: Transformed value (#batch, time1, d_model)
+                weighted by the attention score (#batch, time1, time2).
        """
        n_batch = value.shape[0]
-        if mask is not None:
-            mask = mask.unsqueeze(1).eq(0)  # (batch, 1, *, time2)
+
+        # When `if mask.size(2) > 0` be True:
+        # 1. training.
+        # 2. oonx(16/4, chunk_size/history_size), feed real cache and real mask for the 1st chunk.
+        # When will `if mask.size(2) > 0` be False?
+        # 1. onnx(16/-1, -1/-1, 16/0)
+        # 2. jit (16/-1, -1/-1, 16/0, 16/4)
+        if paddle.shape(mask)[2] > 0: # time2 > 0
+            mask = mask.unsqueeze(1).equal(0)  # (batch, 1, *, time2)
+            # for last chunk, time2 might be larger than scores.size(-1)
+            mask = mask[:, :, :, :paddle.shape(scores)[-1]]
            scores = scores.masked_fill(mask, -float('inf'))
            attn = paddle.softmax(
                scores, axis=-1).masked_fill(mask,
@ -121,21 +131,66 @@ class MultiHeadedAttention(nn.Layer):
                query: paddle.Tensor,
                key: paddle.Tensor,
                value: paddle.Tensor,
-                mask: Optional[paddle.Tensor]) -> paddle.Tensor:
+                mask: paddle.Tensor = paddle.ones([0,0,0], dtype=paddle.bool),
+                pos_emb: paddle.Tensor = paddle.empty([0]),
+                cache: paddle.Tensor = paddle.zeros([0,0,0,0])
+                ) -> Tuple[paddle.Tensor, paddle.Tensor]:
        """Compute scaled dot product attention.
-        Args:
-            query (torch.Tensor): Query tensor (#batch, time1, size).
-            key (torch.Tensor): Key tensor (#batch, time2, size).
-            value (torch.Tensor): Value tensor (#batch, time2, size).
-            mask (torch.Tensor): Mask tensor (#batch, 1, time2) or
+       Args:
+            query (paddle.Tensor): Query tensor (#batch, time1, size).
+            key (paddle.Tensor): Key tensor (#batch, time2, size).
+            value (paddle.Tensor): Value tensor (#batch, time2, size).
+            mask (paddle.Tensor): Mask tensor (#batch, 1, time2) or
                (#batch, time1, time2).
+                1.When applying cross attention between decoder and encoder,
+                the batch padding mask for input is in (#batch, 1, T) shape.
+                2.When applying self attention of encoder,
+                the mask is in (#batch, T, T)  shape.
+                3.When applying self attention of decoder,
+                the mask is in (#batch, L, L)  shape.
+                4.If the different position in decoder see different block
+                of the encoder, such as Mocha, the passed in mask could be
+                in (#batch, L, T) shape. But there is no such case in current
+                Wenet.
+            cache (paddle.Tensor): Cache tensor (1, head, cache_t, d_k * 2),
+                where `cache_t == chunk_size * num_decoding_left_chunks`
+                and `head * d_k == size`
        Returns:
-            torch.Tensor: Output tensor (#batch, time1, d_model).
+            paddle.Tensor: Output tensor (#batch, time1, d_model).
+            paddle.Tensor: Cache tensor (1, head, cache_t + time1, d_k * 2)
+                where `cache_t == chunk_size * num_decoding_left_chunks`
+                and `head * d_k == size`
+
        """
        q, k, v = self.forward_qkv(query, key, value)
+
+        #   when export onnx model, for 1st chunk, we feed
+        #       cache(1, head, 0, d_k * 2) (16/-1, -1/-1, 16/0 mode)
+        #       or cache(1, head, real_cache_t, d_k * 2) (16/4 mode).
+        #       In all modes, `if cache.size(0) > 0` will alwayse be `True`
+        #       and we will always do splitting and
+        #       concatnation(this will simplify onnx export). Note that
+        #       it's OK to concat & split zero-shaped tensors(see code below).
+        #   when export jit  model, for 1st chunk, we always feed
+        #       cache(0, 0, 0, 0) since jit supports dynamic if-branch.
+        # >>> a = torch.ones((1, 2, 0, 4))
+        # >>> b = torch.ones((1, 2, 3, 4))
+        # >>> c = torch.cat((a, b), dim=2)
+        # >>> torch.equal(b, c)        # True
+        # >>> d = torch.split(a, 2, dim=-1)
+        # >>> torch.equal(d[0], d[1])  # True
+        if paddle.shape(cache)[0] > 0:
+            # last dim `d_k * 2` for (key, val)
+            key_cache, value_cache = paddle.split(cache, 2, axis=-1)
+            k = paddle.concat([key_cache, k], axis=2)
+            v = paddle.concat([value_cache, v], axis=2)
+        # We do cache slicing in encoder.forward_chunk, since it's
+        #   non-trivial to calculate `next_cache_start` here.
+        new_cache = paddle.concat((k, v), axis=-1)
+
        scores = paddle.matmul(q,
                               k.transpose([0, 1, 3, 2])) / math.sqrt(self.d_k)
-        return self.forward_attention(v, scores, mask)
+        return self.forward_attention(v, scores, mask), new_cache


 class RelPositionMultiHeadedAttention(MultiHeadedAttention):
@ -192,23 +247,55 @@ class RelPositionMultiHeadedAttention(MultiHeadedAttention):
                query: paddle.Tensor,
                key: paddle.Tensor,
                value: paddle.Tensor,
-                pos_emb: paddle.Tensor,
-                mask: Optional[paddle.Tensor]):
+                mask: paddle.Tensor = paddle.ones([0,0,0], dtype=paddle.bool),
+                pos_emb: paddle.Tensor = paddle.empty([0]),
+                cache: paddle.Tensor = paddle.zeros([0,0,0,0])
+                ) -> Tuple[paddle.Tensor, paddle.Tensor]:
        """Compute 'Scaled Dot Product Attention' with rel. positional encoding.
        Args:
            query (paddle.Tensor): Query tensor (#batch, time1, size).
            key (paddle.Tensor): Key tensor (#batch, time2, size).
            value (paddle.Tensor): Value tensor (#batch, time2, size).
-            pos_emb (paddle.Tensor): Positional embedding tensor
-                (#batch, time1, size).
            mask (paddle.Tensor): Mask tensor (#batch, 1, time2) or
-                (#batch, time1, time2).
+                (#batch, time1, time2), (0, 0, 0) means fake mask.
+            pos_emb (paddle.Tensor): Positional embedding tensor
+                (#batch, time2, size).
+            cache (paddle.Tensor): Cache tensor (1, head, cache_t, d_k * 2),
+                where `cache_t == chunk_size * num_decoding_left_chunks`
+                and `head * d_k == size`
        Returns:
            paddle.Tensor: Output tensor (#batch, time1, d_model).
+            paddle.Tensor: Cache tensor (1, head, cache_t + time1, d_k * 2)
+                where `cache_t == chunk_size * num_decoding_left_chunks`
+                and `head * d_k == size`
        """
        q, k, v = self.forward_qkv(query, key, value)
        q = q.transpose([0, 2, 1, 3])  # (batch, time1, head, d_k)

+        #   when export onnx model, for 1st chunk, we feed
+        #       cache(1, head, 0, d_k * 2) (16/-1, -1/-1, 16/0 mode)
+        #       or cache(1, head, real_cache_t, d_k * 2) (16/4 mode).
+        #       In all modes, `if cache.size(0) > 0` will alwayse be `True`
+        #       and we will always do splitting and
+        #       concatnation(this will simplify onnx export). Note that
+        #       it's OK to concat & split zero-shaped tensors(see code below).
+        #   when export jit  model, for 1st chunk, we always feed
+        #       cache(0, 0, 0, 0) since jit supports dynamic if-branch.
+        # >>> a = torch.ones((1, 2, 0, 4))
+        # >>> b = torch.ones((1, 2, 3, 4))
+        # >>> c = torch.cat((a, b), dim=2)
+        # >>> torch.equal(b, c)        # True
+        # >>> d = torch.split(a, 2, dim=-1)
+        # >>> torch.equal(d[0], d[1])  # True
+        if paddle.shape(cache)[0] > 0:
+            # last dim `d_k * 2` for (key, val)
+            key_cache, value_cache = paddle.split(cache, 2, axis=-1)
+            k = paddle.concat([key_cache, k], axis=2)
+            v = paddle.concat([value_cache, v], axis=2)
+        # We do cache slicing in encoder.forward_chunk, since it's
+        #   non-trivial to calculate `next_cache_start` here.
+        new_cache = paddle.concat((k, v), axis=-1)
+
        n_batch_pos = pos_emb.shape[0]
        p = self.linear_pos(pos_emb).view(n_batch_pos, -1, self.h, self.d_k)
        p = p.transpose([0, 2, 1, 3])  # (batch, head, time1, d_k)
@ -234,4 +321,4 @@ class RelPositionMultiHeadedAttention(MultiHeadedAttention):
        scores = (matrix_ac + matrix_bd) / math.sqrt(
            self.d_k)  # (batch, head, time1, time2)

-        return self.forward_attention(v, scores, mask)
+        return self.forward_attention(v, scores, mask), new_cache
--- a/paddlespeech/s2t/modules/conformer_convolution.py
+++ b/paddlespeech/s2t/modules/conformer_convolution.py
@ -108,15 +108,17 @@ class ConvolutionModule(nn.Layer):

    def forward(self,
                x: paddle.Tensor,
-                mask_pad: Optional[paddle.Tensor]=None,
-                cache: Optional[paddle.Tensor]=None
+                mask_pad: paddle.Tensor= paddle.ones([0,0,0], dtype=paddle.bool),
+                cache: paddle.Tensor= paddle.zeros([0,0,0]),
                ) -> Tuple[paddle.Tensor, paddle.Tensor]:
        """Compute convolution module.
        Args:
            x (paddle.Tensor): Input tensor (#batch, time, channels).
-            mask_pad (paddle.Tensor): used for batch padding, (#batch, channels, time).
+            mask_pad (paddle.Tensor): used for batch padding (#batch, 1, time),
+                (0, 0, 0) means fake mask.
            cache (paddle.Tensor): left context cache, it is only
-                used in causal convolution. (#batch, channels, time')
+                used in causal convolution (#batch, channels, cache_t),
+                (0, 0, 0) meas fake cache.
        Returns:
            paddle.Tensor: Output tensor (#batch, time, channels).
            paddle.Tensor: Output cache tensor (#batch, channels, time')
@ -125,11 +127,11 @@ class ConvolutionModule(nn.Layer):
        x = x.transpose([0, 2, 1])  # [B, C, T]

        # mask batch padding
-        if mask_pad is not None:
+        if paddle.shape(mask_pad)[2] > 0: # time > 0
            x = x.masked_fill(mask_pad, 0.0)

        if self.lorder > 0:
-            if cache is None:
+            if paddle.shape(cache)[2] == 0: # cache_t == 0
                x = nn.functional.pad(
                    x, [self.lorder, 0], 'constant', 0.0, data_format='NCL')
            else:
@ -143,7 +145,7 @@ class ConvolutionModule(nn.Layer):
            # It's better we just return None if no cache is requried,
            # However, for JIT export, here we just fake one tensor instead of
            # None.
-            new_cache = paddle.zeros([1], dtype=x.dtype)
+            new_cache = paddle.zeros([0, 0, 0], dtype=x.dtype)

        # GLU mechanism
        x = self.pointwise_conv1(x)  # (batch, 2*channel, dim)
@ -159,7 +161,7 @@ class ConvolutionModule(nn.Layer):
        x = self.pointwise_conv2(x)

        # mask batch padding
-        if mask_pad is not None:
+        if paddle.shape(mask_pad)[2] > 0: # time > 0
            x = x.masked_fill(mask_pad, 0.0)

        x = x.transpose([0, 2, 1])  # [B, T, C]
--- a/paddlespeech/s2t/modules/decoder_layer.py
+++ b/paddlespeech/s2t/modules/decoder_layer.py
@ -121,11 +121,11 @@ class DecoderLayer(nn.Layer):

        if self.concat_after:
            tgt_concat = paddle.cat(
-                (tgt_q, self.self_attn(tgt_q, tgt, tgt, tgt_q_mask)), dim=-1)
+                (tgt_q, self.self_attn(tgt_q, tgt, tgt, tgt_q_mask)[0]), dim=-1)
            x = residual + self.concat_linear1(tgt_concat)
        else:
            x = residual + self.dropout(
-                self.self_attn(tgt_q, tgt, tgt, tgt_q_mask))
+                self.self_attn(tgt_q, tgt, tgt, tgt_q_mask)[0])
        if not self.normalize_before:
            x = self.norm1(x)

@ -134,11 +134,11 @@ class DecoderLayer(nn.Layer):
            x = self.norm2(x)
        if self.concat_after:
            x_concat = paddle.cat(
-                (x, self.src_attn(x, memory, memory, memory_mask)), dim=-1)
+                (x, self.src_attn(x, memory, memory, memory_mask)[0]), dim=-1)
            x = residual + self.concat_linear2(x_concat)
        else:
            x = residual + self.dropout(
-                self.src_attn(x, memory, memory, memory_mask))
+                self.src_attn(x, memory, memory, memory_mask)[0])
        if not self.normalize_before:
            x = self.norm2(x)

--- a/paddlespeech/s2t/modules/embedding.py
+++ b/paddlespeech/s2t/modules/embedding.py
@ -131,7 +131,7 @@ class PositionalEncoding(nn.Layer, PositionalEncodingInterface):
            offset (int): start offset
            size (int): requried size of position encoding
        Returns:
-            paddle.Tensor: Corresponding position encoding
+            paddle.Tensor: Corresponding position encoding, #[1, T, D].
        """
        assert offset + size < self.max_len
        return self.dropout(self.pe[:, offset:offset + size])
--- a/paddlespeech/s2t/modules/encoder.py
+++ b/paddlespeech/s2t/modules/encoder.py
@ -177,7 +177,7 @@ class BaseEncoder(nn.Layer):
            decoding_chunk_size, self.static_chunk_size,
            num_decoding_left_chunks)
        for layer in self.encoders:
-            xs, chunk_masks, _ = layer(xs, chunk_masks, pos_emb, mask_pad)
+            xs, chunk_masks, _, _ = layer(xs, chunk_masks, pos_emb, mask_pad)
        if self.normalize_before:
            xs = self.after_norm(xs)
        # Here we assume the mask is not changed in encoder layers, so just
@ -190,30 +190,31 @@ class BaseEncoder(nn.Layer):
            xs: paddle.Tensor,
            offset: int,
            required_cache_size: int,
-            subsampling_cache: Optional[paddle.Tensor]=None,
-            elayers_output_cache: Optional[List[paddle.Tensor]]=None,
-            conformer_cnn_cache: Optional[List[paddle.Tensor]]=None,
-    ) -> Tuple[paddle.Tensor, paddle.Tensor, List[paddle.Tensor], List[
-            paddle.Tensor]]:
+            att_cache: paddle.Tensor = paddle.zeros([0,0,0,0]),
+            cnn_cache: paddle.Tensor = paddle.zeros([0,0,0,0]),
+            att_mask: paddle.Tensor = paddle.ones([0,0,0], dtype=paddle.bool),
+    ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor]:
        """ Forward just one chunk
        Args:
-            xs (paddle.Tensor): chunk input, [B=1, T, D]
+            xs (paddle.Tensor): chunk audio feat input, [B=1, T, D], where 
+                `T==(chunk_size-1)*subsampling_rate + subsample.right_context + 1`
            offset (int): current offset in encoder output time stamp
            required_cache_size (int): cache size required for next chunk
                compuation
                >=0: actual cache size
                <0: means all history cache is required
-            subsampling_cache (Optional[paddle.Tensor]): subsampling cache
-            elayers_output_cache (Optional[List[paddle.Tensor]]):
-                transformer/conformer encoder layers output cache
-            conformer_cnn_cache (Optional[List[paddle.Tensor]]): conformer
-                cnn cache
+            att_cache(paddle.Tensor): cache tensor for key & val in 
+                transformer/conformer attention. Shape is 
+                (elayers, head, cache_t1, d_k * 2), where`head * d_k == hidden-dim` 
+                and `cache_t1 == chunk_size * num_decoding_left_chunks`.
+            cnn_cache (paddle.Tensor): cache tensor for cnn_module in conformer, 
+                (elayers, B=1, hidden-dim, cache_t2), where `cache_t2 == cnn.lorder - 1`
        Returns:
-            paddle.Tensor: output of current input xs
-            paddle.Tensor: subsampling cache required for next chunk computation
-            List[paddle.Tensor]: encoder layers output cache required for next
-                chunk computation
-            List[paddle.Tensor]: conformer cnn cache
+            paddle.Tensor: output of current input xs, (B=1, chunk_size, hidden-dim)
+            paddle.Tensor: new attention cache required for next chunk, dyanmic shape 
+                (elayers, head, T, d_k*2) depending on required_cache_size
+            paddle.Tensor: new conformer cnn cache required for next chunk, with
+                same shape as the original cnn_cache
        """
        assert xs.shape[0] == 1  # batch size must be one
        # tmp_masks is just for interface compatibility
@ -225,50 +226,50 @@ class BaseEncoder(nn.Layer):
        if self.global_cmvn is not None:
            xs = self.global_cmvn(xs)

-        xs, pos_emb, _ = self.embed(
-            xs, tmp_masks, offset=offset)  #xs=(B, T, D), pos_emb=(B=1, T, D)
+        # before embed, xs=(B, T, D1), pos_emb=(B=1, T, D)
+        xs, pos_emb, _ = self.embed(xs, tmp_masks, offset=offset) 
+        # after embed, xs=(B=1, chunk_size, hidden-dim)

-        if subsampling_cache is not None:
-            cache_size = subsampling_cache.shape[1]  #T
-            xs = paddle.cat((subsampling_cache, xs), dim=1)
-        else:
-            cache_size = 0
+        elayers = paddle.shape(att_cache)[0]
+        cache_t1 = paddle.shape(att_cache)[2]
+        chunk_size = paddle.shape(xs)[1]
+        attention_key_size = cache_t1 + chunk_size

        # only used when using `RelPositionMultiHeadedAttention`
        pos_emb = self.embed.position_encoding(
-            offset=offset - cache_size, size=xs.shape[1])
+            offset=offset - cache_t1, size=attention_key_size)

        if required_cache_size < 0:
            next_cache_start = 0
        elif required_cache_size == 0:
-            next_cache_start = xs.shape[1]
+            next_cache_start = attention_key_size
        else:
-            next_cache_start = xs.shape[1] - required_cache_size
-        r_subsampling_cache = xs[:, next_cache_start:, :]
-
-        # Real mask for transformer/conformer layers
-        masks = paddle.ones([1, xs.shape[1]], dtype=paddle.bool)
-        masks = masks.unsqueeze(1)  #[B=1, L'=1, T]
-        r_elayers_output_cache = []
-        r_conformer_cnn_cache = []
+            next_cache_start = max(attention_key_size - required_cache_size, 0)
+
+        r_att_cache = []
+        r_cnn_cache = []
        for i, layer in enumerate(self.encoders):
-            attn_cache = None if elayers_output_cache is None else elayers_output_cache[
-                i]
-            cnn_cache = None if conformer_cnn_cache is None else conformer_cnn_cache[
-                i]
-            xs, _, new_cnn_cache = layer(
-                xs,
-                masks,
-                pos_emb,
-                output_cache=attn_cache,
-                cnn_cache=cnn_cache)
-            r_elayers_output_cache.append(xs[:, next_cache_start:, :])
-            r_conformer_cnn_cache.append(new_cnn_cache)
+            # att_cache[i:i+1] = (1, head, cache_t1, d_k*2)
+            # cnn_cache[i:i+1] = (1, B=1, hidden-dim, cache_t2)
+            xs, _, new_att_cache, new_cnn_cache = layer(
+                xs, att_mask, pos_emb,
+                att_cache=att_cache[i:i+1] if elayers > 0 else att_cache,
+                cnn_cache=cnn_cache[i:i+1] if paddle.shape(cnn_cache)[0] > 0 else cnn_cache,
+            )
+            # new_att_cache = (1, head, attention_key_size, d_k*2)
+            # new_cnn_cache = (B=1, hidden-dim, cache_t2)
+            r_att_cache.append(new_att_cache[:,:, next_cache_start:, :])
+            r_cnn_cache.append(new_cnn_cache.unsqueeze(0)) # add elayer dim
+
        if self.normalize_before:
            xs = self.after_norm(xs)

-        return (xs[:, cache_size:, :], r_subsampling_cache,
-                r_elayers_output_cache, r_conformer_cnn_cache)
+        # r_att_cache (elayers, head, T, d_k*2)
+        # r_cnn_cache （elayers, B=1, hidden-dim, cache_t2)
+        r_att_cache = paddle.concat(r_att_cache, axis=0)
+        r_cnn_cache = paddle.concat(r_cnn_cache, axis=0)
+        return xs, r_att_cache, r_cnn_cache
+

    def forward_chunk_by_chunk(
            self,
@ -313,25 +314,24 @@ class BaseEncoder(nn.Layer):

        num_frames = xs.shape[1]
        required_cache_size = decoding_chunk_size * num_decoding_left_chunks
-        subsampling_cache: Optional[paddle.Tensor] = None
-        elayers_output_cache: Optional[List[paddle.Tensor]] = None
-        conformer_cnn_cache: Optional[List[paddle.Tensor]] = None
+
+        att_cache: paddle.Tensor = paddle.zeros([0,0,0,0])
+        cnn_cache: paddle.Tensor = paddle.zeros([0,0,0,0])
+
        outputs = []
        offset = 0
        # Feed forward overlap input step by step
        for cur in range(0, num_frames - context + 1, stride):
            end = min(cur + decoding_window, num_frames)
            chunk_xs = xs[:, cur:end, :]
-            (y, subsampling_cache, elayers_output_cache,
-             conformer_cnn_cache) = self.forward_chunk(
-                 chunk_xs, offset, required_cache_size, subsampling_cache,
-                 elayers_output_cache, conformer_cnn_cache)
+
+            (y, att_cache, cnn_cache) = self.forward_chunk(
+                 chunk_xs, offset, required_cache_size, att_cache, cnn_cache)
+
            outputs.append(y)
            offset += y.shape[1]
        ys = paddle.cat(outputs, 1)
-        # fake mask, just for jit script and compatibility with `forward` api
-        masks = paddle.ones([1, ys.shape[1]], dtype=paddle.bool)
-        masks = masks.unsqueeze(1)
+        masks = paddle.ones([1, 1, ys.shape[1]], dtype=paddle.bool)
        return ys, masks


--- a/paddlespeech/s2t/modules/encoder_layer.py
+++ b/paddlespeech/s2t/modules/encoder_layer.py
@ -75,49 +75,43 @@ class TransformerEncoderLayer(nn.Layer):
            self,
            x: paddle.Tensor,
            mask: paddle.Tensor,
-            pos_emb: Optional[paddle.Tensor]=None,
-            mask_pad: Optional[paddle.Tensor]=None,
-            output_cache: Optional[paddle.Tensor]=None,
-            cnn_cache: Optional[paddle.Tensor]=None,
-    ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor]:
+            pos_emb: paddle.Tensor,
+            mask_pad: paddle.Tensor=paddle.ones([0, 0, 0], dtype=paddle.bool),
+            att_cache: paddle.Tensor=paddle.zeros([0, 0, 0, 0]),
+            cnn_cache: paddle.Tensor=paddle.zeros([0, 0, 0, 0]),
+    ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor, paddle.Tensor]:
        """Compute encoded features.
        Args:
-            x (paddle.Tensor): Input tensor (#batch, time, size).
-            mask (paddle.Tensor): Mask tensor for the input (#batch, time).
+            x (paddle.Tensor): (#batch, time, size)
+            mask (paddle.Tensor): Mask tensor for the input (#batch, time，time),
+                (0, 0, 0) means fake mask.
            pos_emb (paddle.Tensor): just for interface compatibility
                to ConformerEncoderLayer
-            mask_pad (paddle.Tensor): not used here, it's for interface
-                compatibility to ConformerEncoderLayer
-            output_cache (paddle.Tensor): Cache tensor of the output
-                (#batch, time2, size), time2 < time in x.
-            cnn_cache (paddle.Tensor): not used here, it's for interface
-                compatibility to ConformerEncoderLayer
+            mask_pad (paddle.Tensor): does not used in transformer layer,
+                just for unified api with conformer.
+            att_cache (paddle.Tensor): Cache tensor of the KEY & VALUE
+                (#batch=1, head, cache_t1, d_k * 2), head * d_k == size.
+            cnn_cache (paddle.Tensor): Convolution cache in conformer layer
+                (#batch=1, size, cache_t2), not used here, it's for interface
+                compatibility to ConformerEncoderLayer.
        Returns:
            paddle.Tensor: Output tensor (#batch, time, size).
-            paddle.Tensor: Mask tensor (#batch, time).
-            paddle.Tensor: Fake cnn cache tensor for api compatibility with Conformer (#batch, channels, time').
+            paddle.Tensor: Mask tensor (#batch, time, time).
+            paddle.Tensor: att_cache tensor,
+                (#batch=1, head, cache_t1 + time, d_k * 2).
+            paddle.Tensor: cnn_cahce tensor (#batch=1, size, cache_t2).
        """
        residual = x
        if self.normalize_before:
            x = self.norm1(x)

-        if output_cache is None:
-            x_q = x
-        else:
-            assert output_cache.shape[0] == x.shape[0]
-            assert output_cache.shape[1] < x.shape[1]
-            assert output_cache.shape[2] == self.size
-            chunk = x.shape[1] - output_cache.shape[1]
-            x_q = x[:, -chunk:, :]
-            residual = residual[:, -chunk:, :]
-            mask = mask[:, -chunk:, :]
+        x_att, new_att_cache = self.self_attn(x, x, x, mask, cache=att_cache)

        if self.concat_after:
-            x_concat = paddle.concat(
-                (x, self.self_attn(x_q, x, x, mask)), axis=-1)
+            x_concat = paddle.concat((x, x_att), axis=-1)
            x = residual + self.concat_linear(x_concat)
        else:
-            x = residual + self.dropout(self.self_attn(x_q, x, x, mask))
+            x = residual + self.dropout(x_att)
        if not self.normalize_before:
            x = self.norm1(x)

@ -128,11 +122,8 @@ class TransformerEncoderLayer(nn.Layer):
        if not self.normalize_before:
            x = self.norm2(x)

-        if output_cache is not None:
-            x = paddle.concat([output_cache, x], axis=1)
-
-        fake_cnn_cache = paddle.zeros([1], dtype=x.dtype)
-        return x, mask, fake_cnn_cache
+        fake_cnn_cache = paddle.zeros([0, 0, 0], dtype=x.dtype)
+        return x, mask, new_att_cache, fake_cnn_cache


 class ConformerEncoderLayer(nn.Layer):
@ -192,32 +183,44 @@ class ConformerEncoderLayer(nn.Layer):
        self.size = size
        self.normalize_before = normalize_before
        self.concat_after = concat_after
-        self.concat_linear = Linear(size + size, size)
+        if self.concat_after:
+            self.concat_linear = Linear(size + size, size)
+        else:
+            self.concat_linear = nn.Identity()

    def forward(
            self,
            x: paddle.Tensor,
            mask: paddle.Tensor,
            pos_emb: paddle.Tensor,
-            mask_pad: Optional[paddle.Tensor]=None,
-            output_cache: Optional[paddle.Tensor]=None,
-            cnn_cache: Optional[paddle.Tensor]=None,
-    ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor]:
+            mask_pad: paddle.Tensor=paddle.ones([0, 0, 0], dtype=paddle.bool),
+            att_cache: paddle.Tensor=paddle.zeros([0, 0, 0, 0]),
+            cnn_cache: paddle.Tensor=paddle.zeros([0, 0, 0, 0]),
+    ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor, paddle.Tensor]:
        """Compute encoded features.
        Args:
-            x (paddle.Tensor): (#batch, time, size)
-            mask (paddle.Tensor): Mask tensor for the input (#batch, time，time).
-            pos_emb (paddle.Tensor): positional encoding, must not be None
-                for ConformerEncoderLayer.
-            mask_pad (paddle.Tensor): batch padding mask used for conv module, (B, 1, T).
-            output_cache (paddle.Tensor): Cache tensor of the encoder output
-                (#batch, time2, size), time2 < time in x.
+            x (paddle.Tensor): Input tensor (#batch, time, size).
+            mask (paddle.Tensor): Mask tensor for the input (#batch, time, time).
+                (0,0,0) means fake mask.
+            pos_emb (paddle.Tensor): postional encoding, must not be None 
+                for ConformerEncoderLayer
+            mask_pad (paddle.Tensor): batch padding mask used for conv module.
+               (#batch, 1，time), (0, 0, 0) means fake mask.
+            att_cache (paddle.Tensor): Cache tensor of the KEY & VALUE
+                (#batch=1, head, cache_t1, d_k * 2), head * d_k == size.
            cnn_cache (paddle.Tensor): Convolution cache in conformer layer
+                (1, #batch=1, size, cache_t2). First dim will not be used, just
+                for dy2st.
        Returns:
-            paddle.Tensor: Output tensor (#batch, time, size).
-            paddle.Tensor: Mask tensor (#batch, time).
-            paddle.Tensor: New cnn cache tensor (#batch, channels, time').
+           paddle.Tensor: Output tensor (#batch, time, size).
+           paddle.Tensor: Mask tensor (#batch, time, time).
+           paddle.Tensor: att_cache tensor,
+                (#batch=1, head, cache_t1 + time, d_k * 2).
+           paddle.Tensor: cnn_cahce tensor (#batch, size, cache_t2).
        """
+        # (1, #batch=1, size, cache_t2) -> (#batch=1, size, cache_t2)
+        cnn_cache = paddle.squeeze(cnn_cache, axis=0)
+
        # whether to use macaron style FFN
        if self.feed_forward_macaron is not None:
            residual = x
@ -233,18 +236,8 @@ class ConformerEncoderLayer(nn.Layer):
        if self.normalize_before:
            x = self.norm_mha(x)

-        if output_cache is None:
-            x_q = x
-        else:
-            assert output_cache.shape[0] == x.shape[0]
-            assert output_cache.shape[1] < x.shape[1]
-            assert output_cache.shape[2] == self.size
-            chunk = x.shape[1] - output_cache.shape[1]
-            x_q = x[:, -chunk:, :]
-            residual = residual[:, -chunk:, :]
-            mask = mask[:, -chunk:, :]
-
-        x_att = self.self_attn(x_q, x, x, pos_emb, mask)
+        x_att, new_att_cache = self.self_attn(
+            x, x, x, mask, pos_emb, cache=att_cache)

        if self.concat_after:
            x_concat = paddle.concat((x, x_att), axis=-1)
@ -257,7 +250,7 @@ class ConformerEncoderLayer(nn.Layer):

        # convolution module
        # Fake new cnn cache here, and then change it in conv_module
-        new_cnn_cache = paddle.zeros([1], dtype=x.dtype)
+        new_cnn_cache = paddle.zeros([0, 0, 0], dtype=x.dtype)
        if self.conv_module is not None:
            residual = x
            if self.normalize_before:
@ -282,7 +275,4 @@ class ConformerEncoderLayer(nn.Layer):
        if self.conv_module is not None:
            x = self.norm_final(x)

-        if output_cache is not None:
-            x = paddle.concat([output_cache, x], axis=1)
-
-        return x, mask, new_cnn_cache
+        return x, mask, new_att_cache, new_cnn_cache
--- a/paddlespeech/s2t/modules/initializer.py
+++ b/paddlespeech/s2t/modules/initializer.py
@ -12,142 +12,6 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import numpy as np
-from paddle.fluid import framework
-from paddle.fluid import unique_name
-from paddle.fluid.core import VarDesc
-from paddle.fluid.initializer import MSRAInitializer
-
-__all__ = ['KaimingUniform']
-
-
-class KaimingUniform(MSRAInitializer):
-    r"""Implements the Kaiming Uniform initializer
-
-    This class implements the weight initialization from the paper
-    `Delving Deep into Rectifiers: Surpassing Human-Level Performance on
-    ImageNet Classification <https://arxiv.org/abs/1502.01852>`_
-    by Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun. This is a
-    robust initialization method that particularly considers the rectifier
-    nonlinearities.
-
-    In case of Uniform distribution, the range is [-x, x], where
-
-    .. math::
-
-        x = \sqrt{\frac{1.0}{fan\_in}}
-
-    In case of Normal distribution, the mean is 0 and the standard deviation
-    is
-
-    .. math::
-
-        \sqrt{\\frac{2.0}{fan\_in}}
-
-    Args:
-        fan_in (float32|None): fan_in for Kaiming uniform Initializer. If None, it is\
-        inferred from the variable. default is None.
-
-    Note:
-        It is recommended to set fan_in to None for most cases.
-
-    Examples:
-        .. code-block:: python
-
-            import paddle
-            import paddle.nn as nn
-
-            linear = nn.Linear(2,
-                               4,
-                               weight_attr=nn.initializer.KaimingUniform())
-            data = paddle.rand([30, 10, 2], dtype='float32')
-            res = linear(data)
-
-    """
-
-    def __init__(self, fan_in=None):
-        super(KaimingUniform, self).__init__(
-            uniform=True, fan_in=fan_in, seed=0)
-
-    def __call__(self, var, block=None):
-        """Initialize the input tensor with MSRA initialization.
-
-        Args:
-            var(Tensor): Tensor that needs to be initialized.
-            block(Block, optional): The block in which initialization ops
-                   should be added. Used in static graph only, default None.
-
-        Returns:
-            The initialization op
-        """
-        block = self._check_block(block)
-
-        assert isinstance(var, framework.Variable)
-        assert isinstance(block, framework.Block)
-        f_in, f_out = self._compute_fans(var)
-
-        # If fan_in is passed, use it
-        fan_in = f_in if self._fan_in is None else self._fan_in
-
-        if self._seed == 0:
-            self._seed = block.program.random_seed
-
-        # to be compatible of fp16 initalizers
-        if var.dtype == VarDesc.VarType.FP16 or (
-                var.dtype == VarDesc.VarType.BF16 and not self._uniform):
-            out_dtype = VarDesc.VarType.FP32
-            out_var = block.create_var(
-                name=unique_name.generate(
-                    ".".join(['masra_init', var.name, 'tmp'])),
-                shape=var.shape,
-                dtype=out_dtype,
-                type=VarDesc.VarType.LOD_TENSOR,
-                persistable=False)
-        else:
-            out_dtype = var.dtype
-            out_var = var
-
-        if self._uniform:
-            limit = np.sqrt(1.0 / float(fan_in))
-            op = block.append_op(
-                type="uniform_random",
-                inputs={},
-                outputs={"Out": out_var},
-                attrs={
-                    "shape": out_var.shape,
-                    "dtype": int(out_dtype),
-                    "min": -limit,
-                    "max": limit,
-                    "seed": self._seed
-                },
-                stop_gradient=True)
-
-        else:
-            std = np.sqrt(2.0 / float(fan_in))
-            op = block.append_op(
-                type="gaussian_random",
-                outputs={"Out": out_var},
-                attrs={
-                    "shape": out_var.shape,
-                    "dtype": int(out_dtype),
-                    "mean": 0.0,
-                    "std": std,
-                    "seed": self._seed
-                },
-                stop_gradient=True)
-
-        if var.dtype == VarDesc.VarType.FP16 or (
-                var.dtype == VarDesc.VarType.BF16 and not self._uniform):
-            block.append_op(
-                type="cast",
-                inputs={"X": out_var},
-                outputs={"Out": var},
-                attrs={"in_dtype": out_var.dtype,
-                       "out_dtype": var.dtype})
-
-        if not framework.in_dygraph_mode():
-            var.op = op
-        return op
-

 class DefaultInitializerContext(object):
    """
--- a/paddlespeech/server/bin/paddlespeech_client.py
+++ b/paddlespeech/server/bin/paddlespeech_client.py
@ -718,6 +718,7 @@ class VectorClientExecutor(BaseExecutor):
            logger.info(f"the input audio: {input}")
            handler = VectorHttpHandler(server_ip=server_ip, port=port)
            res = handler.run(input, audio_format, sample_rate)
+            logger.info(f"The spk embedding is: {res}")
            return res
        elif task == "score":
            from paddlespeech.server.utils.audio_handler import VectorScoreHttpHandler
--- a/paddlespeech/server/engine/asr/online/ctc_endpoint.py
+++ b/paddlespeech/server/engine/asr/online/ctc_endpoint.py
@ -39,10 +39,10 @@ class OnlineCTCEndpoingOpt:

    # rule1 times out after 5 seconds of silence, even if we decoded nothing.
    rule1: OnlineCTCEndpointRule = OnlineCTCEndpointRule(False, 5000, 0)
-    # rule4 times out after 1.0 seconds of silence after decoding something,
+    # rule2 times out after 1.0 seconds of silence after decoding something,
    # even if we did not reach a final-state at all.
    rule2: OnlineCTCEndpointRule = OnlineCTCEndpointRule(True, 1000, 0)
-    # rule5 times out after the utterance is 20 seconds long, regardless of
+    # rule3 times out after the utterance is 20 seconds long, regardless of
    # anything else.
    rule3: OnlineCTCEndpointRule = OnlineCTCEndpointRule(False, 0, 20000)

@ -102,7 +102,8 @@ class OnlineCTCEndpoint:

        assert self.num_frames_decoded >= self.trailing_silence_frames
        assert self.frame_shift_in_ms > 0
-
+        
+        decoding_something = (self.num_frames_decoded > self.trailing_silence_frames) and decoding_something
        utterance_length = self.num_frames_decoded * self.frame_shift_in_ms
        trailing_silence = self.trailing_silence_frames * self.frame_shift_in_ms

--- a/paddlespeech/server/engine/asr/online/python/asr_engine.py
+++ b/paddlespeech/server/engine/asr/online/python/asr_engine.py
@ -130,9 +130,9 @@ class PaddleASRConnectionHanddler:

        ## conformer
        # cache for conformer online
-        self.subsampling_cache = None
-        self.elayers_output_cache = None
-        self.conformer_cnn_cache = None
+        self.att_cache = paddle.zeros([0,0,0,0])
+        self.cnn_cache = paddle.zeros([0,0,0,0])
+
        self.encoder_out = None
        # conformer decoding state
        self.offset = 0  # global offset in decoding frame unit
@ -474,11 +474,9 @@ class PaddleASRConnectionHanddler:
            # cur chunk
            chunk_xs = self.cached_feat[:, cur:end, :]
            # forward chunk
-            (y, self.subsampling_cache, self.elayers_output_cache,
-             self.conformer_cnn_cache) = self.model.encoder.forward_chunk(
+            (y, self.att_cache, self.cnn_cache) = self.model.encoder.forward_chunk(
                 chunk_xs, self.offset, required_cache_size,
-                 self.subsampling_cache, self.elayers_output_cache,
-                 self.conformer_cnn_cache)
+                 self.att_cache, self.cnn_cache)
            outputs.append(y)

            # update the global offset, in decoding frame unit
--- a/paddlespeech/server/engine/engine_warmup.py
+++ b/paddlespeech/server/engine/engine_warmup.py
@ -60,7 +60,10 @@ def warm_up(engine_and_type: str, warm_up_time: int=3) -> bool:

                else:
                    st = time.time()
-                    connection_handler.infer(text=sentence)
+                    connection_handler.infer(
+                        text=sentence,
+                        lang=tts_engine.lang,
+                        am=tts_engine.config.am)
                    et = time.time()
                    logger.debug(
                        f"The response time of the {i} warm up: {et - st} s")
--- a/paddlespeech/t2s/exps/sentences_mix.txt
+++ b/paddlespeech/t2s/exps/sentences_mix.txt
@ -0,0 +1,8 @@
+001 你好，欢迎使用 Paddle Speech 中英文混合 T T S 功能，开始你的合成之旅吧!
+002 我们的声学模型使用了 Fast Speech Two, 声码器使用了 Parallel Wave GAN and Hifi GAN.
+003 Paddle N L P 发布 ERNIE Tiny 全系列中文预训练小模型，快速提升预训练模型部署效率，通用信息抽取技术 U I E Tiny 系列模型全新升级，支持速度更快效果更好的 U I E 小模型。
+004 Paddle Speech 发布 P P A S R 流式语音识别系统、P P T T S 流式语音合成系统、P P V P R 全链路声纹识别系统。
+005 Paddle Bo Bo: 使用 Paddle Speech 的语音合成模块生成虚拟人的声音。
+006 热烈欢迎您在 Discussions 中提交问题，并在 Issues 中指出发现的 bug。此外，我们非常希望您参与到 Paddle Speech 的开发中！
+007 我喜欢 eat apple, 你喜欢 drink milk。
+008 我们要去云南 team building, 非常非常 happy.
--- a/paddlespeech/t2s/exps/syn_utils.py
+++ b/paddlespeech/t2s/exps/syn_utils.py
@ -29,6 +29,7 @@ from yacs.config import CfgNode

 from paddlespeech.t2s.datasets.data_table import DataTable
 from paddlespeech.t2s.frontend import English
+from paddlespeech.t2s.frontend.mix_frontend import MixFrontend
 from paddlespeech.t2s.frontend.zh_frontend import Frontend
 from paddlespeech.t2s.modules.normalizer import ZScore
 from paddlespeech.utils.dynamic_import import dynamic_import
@ -98,6 +99,8 @@ def get_sentences(text_file: Optional[os.PathLike], lang: str='zh'):
                sentence = "".join(items[1:])
            elif lang == 'en':
                sentence = " ".join(items[1:])
+            elif lang == 'mix':
+                sentence = " ".join(items[1:])
            sentences.append((utt_id, sentence))
    return sentences

@ -111,7 +114,8 @@ def get_test_dataset(test_metadata: List[Dict[str, Any]],
    am_dataset = am[am.rindex('_') + 1:]
    if am_name == 'fastspeech2':
        fields = ["utt_id", "text"]
-        if am_dataset in {"aishell3", "vctk"} and speaker_dict is not None:
+        if am_dataset in {"aishell3", "vctk",
+                          "mix"} and speaker_dict is not None:
            print("multiple speaker fastspeech2!")
            fields += ["spk_id"]
        elif voice_cloning:
@ -140,6 +144,10 @@ def get_frontend(lang: str='zh',
            phone_vocab_path=phones_dict, tone_vocab_path=tones_dict)
    elif lang == 'en':
        frontend = English(phone_vocab_path=phones_dict)
+    elif lang == 'mix':
+        frontend = MixFrontend(
+            phone_vocab_path=phones_dict, tone_vocab_path=tones_dict)
+
    else:
        print("wrong lang!")
    print("frontend done!")
@ -341,8 +349,12 @@ def get_am_output(
        input_ids = frontend.get_input_ids(
            input, merge_sentences=merge_sentences)
        phone_ids = input_ids["phone_ids"]
+    elif lang == 'mix':
+        input_ids = frontend.get_input_ids(
+            input, merge_sentences=merge_sentences)
+        phone_ids = input_ids["phone_ids"]
    else:
-        print("lang should in {'zh', 'en'}!")
+        print("lang should in {'zh', 'en', 'mix'}!")

    if get_tone_ids:
        tone_ids = input_ids["tone_ids"]
--- a/paddlespeech/t2s/exps/synthesize_e2e.py
+++ b/paddlespeech/t2s/exps/synthesize_e2e.py
@ -113,8 +113,12 @@ def evaluate(args):
                input_ids = frontend.get_input_ids(
                    sentence, merge_sentences=merge_sentences)
                phone_ids = input_ids["phone_ids"]
+            elif args.lang == 'mix':
+                input_ids = frontend.get_input_ids(
+                    sentence, merge_sentences=merge_sentences)
+                phone_ids = input_ids["phone_ids"]
            else:
-                print("lang should in {'zh', 'en'}!")
+                print("lang should in {'zh', 'en', 'mix'}!")
            with paddle.no_grad():
                flags = 0
                for i in range(len(phone_ids)):
@ -122,7 +126,7 @@ def evaluate(args):
                    # acoustic model
                    if am_name == 'fastspeech2':
                        # multi speaker
-                        if am_dataset in {"aishell3", "vctk"}:
+                        if am_dataset in {"aishell3", "vctk", "mix"}:
                            spk_id = paddle.to_tensor(args.spk_id)
                            mel = am_inference(part_phone_ids, spk_id)
                        else:
@ -170,7 +174,7 @@ def parse_args():
        choices=[
            'speedyspeech_csmsc', 'speedyspeech_aishell3', 'fastspeech2_csmsc',
            'fastspeech2_ljspeech', 'fastspeech2_aishell3', 'fastspeech2_vctk',
-            'tacotron2_csmsc', 'tacotron2_ljspeech'
+            'tacotron2_csmsc', 'tacotron2_ljspeech', 'fastspeech2_mix'
        ],
        help='Choose acoustic model type of tts task.')
    parser.add_argument(
@ -231,7 +235,7 @@ def parse_args():
        '--lang',
        type=str,
        default='zh',
-        help='Choose model language. zh or en')
+        help='Choose model language. zh or en or mix')

    parser.add_argument(
        "--inference_dir",
--- a/paddlespeech/t2s/frontend/mix_frontend.py
+++ b/paddlespeech/t2s/frontend/mix_frontend.py
@ -0,0 +1,179 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import re
+from typing import Dict
+from typing import List
+
+import paddle
+
+from paddlespeech.t2s.frontend import English
+from paddlespeech.t2s.frontend.zh_frontend import Frontend
+
+
+class MixFrontend():
+    def __init__(self,
+                 g2p_model="pypinyin",
+                 phone_vocab_path=None,
+                 tone_vocab_path=None):
+
+        self.zh_frontend = Frontend(
+            phone_vocab_path=phone_vocab_path, tone_vocab_path=tone_vocab_path)
+        self.en_frontend = English(phone_vocab_path=phone_vocab_path)
+        self.SENTENCE_SPLITOR = re.compile(r'([：、，；。？！,;?!][”’]?)')
+        self.sp_id = self.zh_frontend.vocab_phones["sp"]
+        self.sp_id_tensor = paddle.to_tensor([self.sp_id])
+
+    def is_chinese(self, char):
+        if char >= '\u4e00' and char <= '\u9fa5':
+            return True
+        else:
+            return False
+
+    def is_alphabet(self, char):
+        if (char >= '\u0041' and char <= '\u005a') or (char >= '\u0061' and
+                                                       char <= '\u007a'):
+            return True
+        else:
+            return False
+
+    def is_number(self, char):
+        if char >= '\u0030' and char <= '\u0039':
+            return True
+        else:
+            return False
+
+    def is_other(self, char):
+        if not (self.is_chinese(char) or self.is_number(char) or
+                self.is_alphabet(char)):
+            return True
+        else:
+            return False
+
+    def _split(self, text: str) -> List[str]:
+        text = re.sub(r'[《》【】<=>{}()（）#&@“”^_|…\\]', '', text)
+        text = self.SENTENCE_SPLITOR.sub(r'\1\n', text)
+        text = text.strip()
+        sentences = [sentence.strip() for sentence in re.split(r'\n+', text)]
+        return sentences
+
+    def _distinguish(self, text: str) -> List[str]:
+        # sentence --> [ch_part, en_part, ch_part, ...]
+
+        segments = []
+        types = []
+
+        flag = 0
+        temp_seg = ""
+        temp_lang = ""
+
+        # Determine the type of each character. type: blank, chinese, alphabet, number, unk.
+        for ch in text:
+            if self.is_chinese(ch):
+                types.append("zh")
+            elif self.is_alphabet(ch):
+                types.append("en")
+            elif ch == " ":
+                types.append("blank")
+            elif self.is_number(ch):
+                types.append("num")
+            else:
+                types.append("unk")
+
+        assert len(types) == len(text)
+
+        for i in range(len(types)):
+
+            # find the first char of the seg
+            if flag == 0:
+                if types[i] != "unk" and types[i] != "blank":
+                    temp_seg += text[i]
+                    temp_lang = types[i]
+                    flag = 1
+
+            else:
+                if types[i] == temp_lang or types[i] == "num":
+                    temp_seg += text[i]
+
+                elif temp_lang == "num" and types[i] != "unk":
+                    temp_seg += text[i]
+                    if types[i] == "zh" or types[i] == "en":
+                        temp_lang = types[i]
+
+                elif temp_lang == "en" and types[i] == "blank":
+                    temp_seg += text[i]
+
+                elif types[i] == "unk":
+                    pass
+
+                else:
+                    segments.append((temp_seg, temp_lang))
+
+                    if types[i] != "unk" and types[i] != "blank":
+                        temp_seg = text[i]
+                        temp_lang = types[i]
+                        flag = 1
+                    else:
+                        flag = 0
+                        temp_seg = ""
+                        temp_lang = ""
+
+        segments.append((temp_seg, temp_lang))
+
+        return segments
+
+    def get_input_ids(self,
+                      sentence: str,
+                      merge_sentences: bool=True,
+                      get_tone_ids: bool=False,
+                      add_sp: bool=True) -> Dict[str, List[paddle.Tensor]]:
+
+        sentences = self._split(sentence)
+        phones_list = []
+        result = {}
+
+        for text in sentences:
+            phones_seg = []
+            segments = self._distinguish(text)
+            for seg in segments:
+                content = seg[0]
+                lang = seg[1]
+                if lang == "zh":
+                    input_ids = self.zh_frontend.get_input_ids(
+                        content,
+                        merge_sentences=True,
+                        get_tone_ids=get_tone_ids)
+
+                elif lang == "en":
+                    input_ids = self.en_frontend.get_input_ids(
+                        content, merge_sentences=True)
+
+                phones_seg.append(input_ids["phone_ids"][0])
+                if add_sp:
+                    phones_seg.append(self.sp_id_tensor)
+
+            phones = paddle.concat(phones_seg)
+            phones_list.append(phones)
+
+        if merge_sentences:
+            merge_list = paddle.concat(phones_list)
+            # rm the last 'sp' to avoid the noise at the end
+            # cause in the training data, no 'sp' in the end
+            if merge_list[-1] == self.sp_id_tensor:
+                merge_list = merge_list[:-1]
+            phones_list = []
+            phones_list.append(merge_list)
+
+        result["phone_ids"] = phones_list
+
+        return result
--- a/setup.py
+++ b/setup.py
@ -72,7 +72,8 @@ base = [
    "colorlog",
    "pathos == 0.2.8",
    "braceexpand",
-    "pyyaml"
+    "pyyaml",
+    "pybind11",
 ]

 server = [
@ -91,7 +92,6 @@ requirements = {
        "gpustat",
        "paddlespeech_ctcdecoders",
        "phkit",
-        "pybind11",
        "pypi-kenlm",
        "snakeviz",
        "sox",
--- a/third_party/README.md
+++ b/third_party/README.md
@ -1,27 +1,26 @@
-* [python_kaldi_features](https://github.com/ZitengWang/python_kaldi_features)  
+# python_kaldi_features
+
+[python_kaldi_features](https://github.com/ZitengWang/python_kaldi_features)  
 commit: fc1bd6240c2008412ab64dc25045cd872f5e126c  
 ref: https://zhuanlan.zhihu.com/p/55371926  
 license: MIT

-* [python-pinyin](https://github.com/mozillazg/python-pinyin.git)
-commit: 55e524aa1b7b8eec3d15c5306043c6cdd5938b03
-license: MIT
+# Install ctc_decoder for Windows

-* [zhon](https://github.com/tsroten/zhon)
-commit: 09bf543696277f71de502506984661a60d24494c
-license: MIT
+`install_win_ctc.bat` is bat script to install paddlespeech_ctc_decoders for windows

-* [pymmseg-cpp](https://github.com/pluskid/pymmseg-cpp.git)
-commit: b76465045717fbb4f118c4fbdd24ce93bab10a6d
-license: MIT
+## Prepare your environment

-* [chinese_text_normalization](https://github.com/speechio/chinese_text_normalization.git)
-commit: 9e92c7bf2d6b5a7974305406d8e240045beac51c
-license: MIT
+insure your environment like this:

-* [phkit](https://github.com/KuangDD/phkit.git)
-commit: b2100293c1e36da531d7f30bd52c9b955a649522
-license: None
+* gcc: version >= 12.1.0
+* cmake: version >= 3.24.0
+* make: version >= 3.82.90
+* visual studio: version >= 2019

-* [nnAudio](https://github.com/KinWaiCheuk/nnAudio.git)
-license: MIT
+## Start your bat script
+
+```shell
+start install_win_ctc.bat
+
+```
--- a/third_party/ctc_decoders/scorer.cpp
+++ b/third_party/ctc_decoders/scorer.cpp
@ -13,7 +13,8 @@
 #include "decoder_utils.h"

 using namespace lm::ngram;
-
+// if your platform is windows ,you need add the define
+#define    F_OK    0
 Scorer::Scorer(double alpha,
               double beta,
               const std::string& lm_path,
--- a/third_party/ctc_decoders/setup.py
+++ b/third_party/ctc_decoders/setup.py
@ -89,10 +89,11 @@ FILES = [
                               or fn.endswith('unittest.cc'))
 ]
 # yapf: enable
-
 LIBS = ['stdc++']
 if platform.system() != 'Darwin':
    LIBS.append('rt')
+if platform.system() == 'Windows':
+    LIBS = ['-static-libstdc++']

 ARGS = ['-O3', '-DNDEBUG', '-DKENLM_MAX_ORDER=6', '-std=c++11']

--- a/third_party/install_win_ctc.bat
+++ b/third_party/install_win_ctc.bat
@ -0,0 +1,21 @@
+@echo off
+
+cd ctc_decoders
+if not exist kenlm (
+    git clone https://github.com/Doubledongli/kenlm.git
+    @echo.
+)
+
+if not exist openfst-1.6.3 (
+    echo "Download and extract openfst ..."
+    git clone https://gitee.com/koala999/openfst.git
+    ren openfst openfst-1.6.3
+    @echo.
+)
+
+if not exist ThreadPool (
+    git clone https://github.com/progschj/ThreadPool.git
+    @echo.
+)
+echo "Install decoders ..."
+python setup.py install --num_processes 4