Merge branch 'develop' of https://github.com/PaddlePaddle/PaddleSpeech into new_api

2 years ago · ac680aa783
parent 294b7b00bd d098e027ca
commit ac680aa783
61 changed files with 572 additions and 331 deletions
--- a/README.md
+++ b/README.md
@ -25,14 +25,16 @@
  | <a href="#documents"> Documents </a>
  | <a href="#model-list"> Models List </a>
  | <a href="https://aistudio.baidu.com/aistudio/education/group/info/25130"> AIStudio Courses </a>
-  | <a href="https://arxiv.org/abs/2205.12007"> Paper </a>
+  | <a href="https://arxiv.org/abs/2205.12007"> NAACL2022 Best Demo Award Paper </a>
  | <a href="https://gitee.com/paddlepaddle/PaddleSpeech"> Gitee </a>
 </h4>
 </div>

 ------------------------------------------------------------------------------------

-**PaddleSpeech** is an open-source toolkit on [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) platform for a variety of critical tasks in speech and audio, with the state-of-art and influential models.
+**PaddleSpeech** is an open-source toolkit on [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) platform for a variety of critical tasks in speech and audio, with the state-of-art and influential models. 
+
+**PaddleSpeech** won the [NAACL2022 Best Demo Award](https://2022.naacl.org/blog/best-demo-award/), please check out our paper on [Arxiv](https://arxiv.org/abs/2205.12007).

 ##### Speech Recognition

@ -177,7 +179,7 @@ Via the easy-to-use, efficient, flexible and scalable implementation, our vision

 ## Installation

-We strongly recommend our users to install PaddleSpeech in **Linux** with *python>=3.7*.
+We strongly recommend our users to install PaddleSpeech in **Linux** with *python>=3.7* and *paddlepaddle>=2.3.1*.
 Up to now, **Linux** supports CLI for the all our tasks, **Mac OSX** and **Windows** only supports PaddleSpeech CLI for Audio Classification, Speech-to-Text and Text-to-Speech. To install `PaddleSpeech`, please see [installation](./docs/source/install.md).


@ -706,6 +708,7 @@ You are warmly welcome to submit questions in [discussions](https://github.com/P
 - Many thanks to [phecda-xu](https://github.com/phecda-xu)/[PaddleDubbing](https://github.com/phecda-xu/PaddleDubbing) for developing a dubbing tool with GUI based on PaddleSpeech TTS model.
 - Many thanks to [jerryuhoo](https://github.com/jerryuhoo)/[VTuberTalk](https://github.com/jerryuhoo/VTuberTalk) for developing a GUI tool based on PaddleSpeech TTS and code for making datasets from videos based on PaddleSpeech ASR.
 - Many thanks to [vpegasus](https://github.com/vpegasus)/[xuesebot](https://github.com/vpegasus/xuesebot) for developing a rasa chatbot,which is able to speak and listen thanks to PaddleSpeech.
+- Many thanks to [chenkui164](https://github.com/chenkui164)/[FastASR](https://github.com/chenkui164/FastASR) for the C++ inference implementation of PaddleSpeech ASR.

 Besides, PaddleSpeech depends on a lot of open source repositories. See [references](./docs/source/reference.md) for more information.

--- a/README_cn.md
+++ b/README_cn.md
@ -26,7 +26,7 @@
  | <a href="#教程文档"> 教程文档 </a>
  | <a href="#模型列表"> 模型列表 </a>
  | <a href="https://aistudio.baidu.com/aistudio/education/group/info/25130"> AIStudio 课程 </a>
-  | <a href="https://arxiv.org/abs/2205.12007"> 论文 </a>
+  | <a href="https://arxiv.org/abs/2205.12007"> NAACL2022 论文 </a>
  | <a href="https://gitee.com/paddlepaddle/PaddleSpeech"> Gitee 
 </h4>
 </div>
@ -35,6 +35,9 @@
 ------------------------------------------------------------------------------------

 **PaddleSpeech** 是基于飞桨 [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) 的语音方向的开源模型库，用于语音和音频中的各种关键任务的开发，包含大量基于深度学习前沿和有影响力的模型，一些典型的应用示例如下：
+
+**PaddleSpeech** 荣获 [NAACL2022 Best Demo Award](https://2022.naacl.org/blog/best-demo-award/), 请访问 [Arxiv](https://arxiv.org/abs/2205.12007) 论文。
+  
 ##### 语音识别

 <div align = "center">
@ -702,7 +705,7 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块：文本前端、声
 - 非常感谢 [jerryuhoo](https://github.com/jerryuhoo)/[VTuberTalk](https://github.com/jerryuhoo/VTuberTalk) 基于 PaddleSpeech 的 TTS GUI 界面和基于 ASR 制作数据集的相关代码。

 - 非常感谢 [vpegasus](https://github.com/vpegasus)/[xuesebot](https://github.com/vpegasus/xuesebot) 基于 PaddleSpeech 的 ASR 与 TTS 设计的可听、说对话机器人。
-
+- 非常感谢 [chenkui164](https://github.com/chenkui164)/[FastASR](https://github.com/chenkui164/FastASR) 对 PaddleSpeech 的 ASR 进行 C++ 推理实现。

 此外，PaddleSpeech 依赖于许多开源存储库。有关更多信息，请参阅 [references](./docs/source/reference.md)。

--- a/demos/README.md
+++ b/demos/README.md
@ -12,6 +12,7 @@ This directory contains many speech applications in multiple scenarios.
 * speech recognition - recognize text of an audio file 
 * speech server - Server for Speech Task, e.g. ASR,TTS,CLS
 * streaming asr server - receive audio stream from websocket, and recognize to transcript.
+* streaming tts server - receive text from http or websocket, and streaming audio data stream.
 * speech translation - end to end speech translation  
 * story talker - book reader based on OCR and TTS  
 * style_fs2 - multi style control for FastSpeech2 model  
--- a/demos/README_cn.md
+++ b/demos/README_cn.md
@ -10,8 +10,9 @@
 * 元宇宙 - 基于语音合成的 2D 增强现实。
 * 标点恢复 - 通常作为语音识别的文本后处理任务，为一段无标点的纯文本添加相应的标点符号。
 * 语音识别 - 识别一段音频中包含的语音文字。
-* 语音服务 - 离线语音服务，包括ASR、TTS、CLS等
-* 流式语音识别服务 - 流式输入语音数据流识别音频中的文字
+* 语音服务 - 离线语音服务，包括ASR、TTS、CLS等。
+* 流式语音识别服务 - 流式输入语音数据流识别音频中的文字。
+* 流式语音合成服务 - 根据待合成文本流式生成合成音频数据流。
 * 语音翻译 - 实时识别音频中的语言，并同时翻译成目标语言。
 * 会说话的故事书 - 基于 OCR 和语音合成的会说话的故事书。
 * 个性化语音合成 - 基于 FastSpeech2 模型的个性化语音合成。 
--- a/demos/custom_streaming_asr/setup_docker.sh
+++ b/demos/custom_streaming_asr/setup_docker.sh
--- a/demos/keyword_spotting/run.sh
+++ b/demos/keyword_spotting/run.sh
--- a/demos/speaker_verification/run.sh
+++ b/demos/speaker_verification/run.sh
--- a/demos/speech_recognition/run.sh
+++ b/demos/speech_recognition/run.sh
@ -1,6 +1,7 @@
 #!/bin/bash

-wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespeech.bj.bcebos.com/PaddleAudio/en.wav
+wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
+wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/en.wav

 # asr
 paddlespeech asr --input ./zh.wav
@ -8,3 +9,18 @@ paddlespeech asr --input ./zh.wav

 # asr + punc
 paddlespeech asr --input ./zh.wav | paddlespeech text --task punc
+
+
+# asr help
+paddlespeech asr --help
+
+
+# english asr
+paddlespeech asr --lang en --model transformer_librispeech --input ./en.wav
+
+# model stats
+paddlespeech stats --task asr
+
+
+# paddlespeech help
+paddlespeech --help
--- a/demos/speech_server/README.md
+++ b/demos/speech_server/README.md
@ -11,7 +11,8 @@ This demo is an implementation of starting the voice service and accessing the s
 see [installation](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md).

 It is recommended to use **paddlepaddle 2.2.2** or above.
-You can choose one way from meduim and hard to install paddlespeech.
+You can choose one way from easy, meduim and hard to install paddlespeech.
+**If you install in simple mode, you need to prepare the yaml file by yourself, you can refer to the yaml file in the conf directory.**

 ### 2. Prepare config File
 The configuration file can be found in `conf/application.yaml` .
--- a/demos/speech_server/README_cn.md
+++ b/demos/speech_server/README_cn.md
@ -3,7 +3,7 @@
 # 语音服务

 ## 介绍
-这个 demo 是一个启动离线语音服务和访问服务的实现。它可以通过使用`paddlespeech_server` 和 `paddlespeech_client`的单个命令或 python 的几行代码来实现。
+这个 demo 是一个启动离线语音服务和访问服务的实现。它可以通过使用`paddlespeech_server` 和 `paddlespeech_client` 的单个命令或 python 的几行代码来实现。


 ## 使用方法
@ -11,13 +11,14 @@
 请看 [安装文档](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md).

 推荐使用 **paddlepaddle 2.2.2** 或以上版本。
-你可以从 medium，hard 两种方式中选择一种方式安装 PaddleSpeech。
+你可以从简单，中等，困难几种方式中选择一种方式安装 PaddleSpeech。
+**如果使用简单模式安装，需要自行准备 yaml 文件，可参考 conf 目录下的 yaml 文件。**


 ### 2. 准备配置文件
 配置文件可参见 `conf/application.yaml` 。
 其中，`engine_list`表示即将启动的服务将会包含的语音引擎，格式为 <语音任务>_<引擎类型>。
-目前服务集成的语音任务有： asr(语音识别)、tts(语音合成)、cls(音频分类)、vector(声纹识别)以及text(文本处理)。
+目前服务集成的语音任务有： asr(语音识别)、tts(语音合成)、cls(音频分类)、vector(声纹识别)以及 text(文本处理)。
 目前引擎类型支持两种形式：python 及 inference (Paddle Inference)
 **注意：** 如果在容器里可正常启动服务，但客户端访问 ip 不可达，可尝试将配置文件中 `host` 地址换成本地 ip 地址。

--- a/demos/speech_server/asr_client.sh
+++ b/demos/speech_server/asr_client.sh
--- a/demos/speech_server/cls_client.sh
+++ b/demos/speech_server/cls_client.sh
--- a/demos/speech_server/server.sh
+++ b/demos/speech_server/server.sh
@ -1,3 +1,3 @@
 #!/bin/bash

-paddlespeech_server start --config_file ./conf/application.yaml
+paddlespeech_server start --config_file ./conf/application.yaml &> server.log &
--- a/demos/speech_server/sid_client.sh
+++ b/demos/speech_server/sid_client.sh
@ -0,0 +1,10 @@
+#!/bin/bash
+
+wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
+wget -c https://paddlespeech.bj.bcebos.com/vector/audio/123456789.wav
+
+# sid extract
+paddlespeech_client vector --server_ip 127.0.0.1 --port 8090 --task spk --input ./85236145389.wav
+
+# sid score
+paddlespeech_client vector --server_ip 127.0.0.1 --port 8090 --task score --enroll ./85236145389.wav --test ./123456789.wav
--- a/demos/speech_server/text_client.sh
+++ b/demos/speech_server/text_client.sh
@ -0,0 +1,4 @@
+#!/bin/bash
+
+
+paddlespeech_client text --server_ip 127.0.0.1 --port 8090 --input 今天的天气真好啊你下午有空吗我想约你一起去吃饭
--- a/demos/speech_server/tts_client.sh
+++ b/demos/speech_server/tts_client.sh
--- a/demos/speech_web/web_client/package-lock.json
+++ b/demos/speech_web/web_client/package-lock.json
@ -747,9 +747,9 @@
      }
    },
    "node_modules/moment": {
-      "version": "2.29.3",
-      "resolved": "https://registry.npmmirror.com/moment/-/moment-2.29.3.tgz",
-      "integrity": "sha512-c6YRvhEo//6T2Jz/vVtYzqBzwvPT95JBQ+smCytzf7c50oMZRsR/a4w88aD34I+/QVSfnoAnSBFPJHItlOMJVw==",
+      "version": "2.29.4",
+      "resolved": "https://registry.npmjs.org/moment/-/moment-2.29.4.tgz",
+      "integrity": "sha512-5LC9SOxjSc2HF6vO2CyuTDNivEdoz2IvyJJGj6X8DJ0eFyfszE0QiEd+iXmBvUP3WHxSjFH/vIsA0EN00cgr8w==",
      "engines": {
        "node": "*"
      }
@ -1636,9 +1636,9 @@
      "optional": true
    },
    "moment": {
-      "version": "2.29.3",
-      "resolved": "https://registry.npmmirror.com/moment/-/moment-2.29.3.tgz",
-      "integrity": "sha512-c6YRvhEo//6T2Jz/vVtYzqBzwvPT95JBQ+smCytzf7c50oMZRsR/a4w88aD34I+/QVSfnoAnSBFPJHItlOMJVw=="
+      "version": "2.29.4",
+      "resolved": "https://registry.npmjs.org/moment/-/moment-2.29.4.tgz",
+      "integrity": "sha512-5LC9SOxjSc2HF6vO2CyuTDNivEdoz2IvyJJGj6X8DJ0eFyfszE0QiEd+iXmBvUP3WHxSjFH/vIsA0EN00cgr8w=="
    },
    "nanoid": {
      "version": "3.3.2",
--- a/demos/speech_web/web_client/yarn.lock
+++ b/demos/speech_web/web_client/yarn.lock
@ -587,9 +587,9 @@ mime@^1.4.1:
  integrity sha512-x0Vn8spI+wuJ1O6S7gnbaQg8Pxh4NNHb7KSINmEWKiPE4RKOplvijn+NkmYmmRgP68mc70j2EbeTFRsrswaQeg==

 moment@^2.27.0:
-  version "2.29.3"
-  resolved "https://registry.npmmirror.com/moment/-/moment-2.29.3.tgz"
-  integrity sha512-c6YRvhEo//6T2Jz/vVtYzqBzwvPT95JBQ+smCytzf7c50oMZRsR/a4w88aD34I+/QVSfnoAnSBFPJHItlOMJVw==
+  version "2.29.4"
+  resolved "https://registry.yarnpkg.com/moment/-/moment-2.29.4.tgz#3dbe052889fe7c1b2ed966fcb3a77328964ef108"
+  integrity sha512-5LC9SOxjSc2HF6vO2CyuTDNivEdoz2IvyJJGj6X8DJ0eFyfszE0QiEd+iXmBvUP3WHxSjFH/vIsA0EN00cgr8w==

 ms@^2.1.1:
  version "2.1.3"
--- a/demos/streaming_asr_server/README.md
+++ b/demos/streaming_asr_server/README.md
@ -12,7 +12,8 @@ Streaming ASR server only support `websocket` protocol, and doesn't support `htt
 see [installation](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md).

 It is recommended to use **paddlepaddle 2.2.1** or above.
-You can choose one way from meduim and hard to install paddlespeech.
+You can choose one way from easy, meduim and hard to install paddlespeech.
+**If you install in simple mode, you need to prepare the yaml file by yourself, you can refer to the yaml file in the conf directory.**

 ### 2. Prepare config File
 The configuration file can be found in `conf/ws_application.yaml` 和 `conf/ws_conformer_wenetspeech_application.yaml`.
--- a/demos/streaming_asr_server/README_cn.md
+++ b/demos/streaming_asr_server/README_cn.md
@ -12,8 +12,8 @@
 安装 PaddleSpeech 的详细过程请看 [安装文档](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md)。

 推荐使用 **paddlepaddle 2.2.1** 或以上版本。
-你可以从medium，hard 两种方式中选择一种方式安装 PaddleSpeech。
-
+你可以从简单， 中等，困难 几种种方式中选择一种方式安装 PaddleSpeech。
+**如果使用简单模式安装，需要自行准备 yaml 文件，可参考 conf 目录下的 yaml 文件。**

 ### 2. 准备配置文件

--- a/demos/streaming_asr_server/local/punc_server.py
+++ b/demos/streaming_asr_server/local/punc_server.py
--- a/demos/streaming_asr_server/local/streaming_asr_server.py
+++ b/demos/streaming_asr_server/local/streaming_asr_server.py
--- a/demos/streaming_asr_server/run.sh
+++ b/demos/streaming_asr_server/run.sh
--- a/demos/streaming_asr_server/server.sh
+++ b/demos/streaming_asr_server/server.sh
@ -1,9 +1,8 @@
-export CUDA_VISIBLE_DEVICE=0,1,2,3
- export CUDA_VISIBLE_DEVICE=0,1,2,3
+#export CUDA_VISIBLE_DEVICE=0,1,2,3

-# nohup python3 punc_server.py --config_file conf/punc_application.yaml > punc.log 2>&1 &
+# nohup python3 local/punc_server.py --config_file conf/punc_application.yaml > punc.log 2>&1 &
 paddlespeech_server start --config_file conf/punc_application.yaml &> punc.log &

-# nohup python3 streaming_asr_server.py --config_file conf/ws_conformer_wenetspeech_application.yaml > streaming_asr.log 2>&1 &
+# nohup python3 local/streaming_asr_server.py --config_file conf/ws_conformer_wenetspeech_application.yaml > streaming_asr.log 2>&1 &
 paddlespeech_server start --config_file conf/ws_conformer_wenetspeech_application.yaml &> streaming_asr.log  &

--- a/demos/streaming_asr_server/test.sh
+++ b/demos/streaming_asr_server/test.sh
@ -7,5 +7,5 @@ paddlespeech_client asr_online --server_ip 127.0.0.1 --port 8090 --input ./zh.wa

 # read the wav and call streaming and punc service
 # If `127.0.0.1` is not accessible, you need to use the actual service IP address.
-paddlespeech_client asr_online --server_ip 127.0.0.1 --port 8290 --punc.server_ip 127.0.0.1 --punc.port 8190 --input ./zh.wav
+paddlespeech_client asr_online --server_ip 127.0.0.1 --port 8090 --punc.server_ip 127.0.0.1 --punc.port 8190 --input ./zh.wav

--- a/demos/streaming_tts_server/README.md
+++ b/demos/streaming_tts_server/README.md
@ -11,7 +11,8 @@ This demo is an implementation of starting the streaming speech synthesis servic
 see [installation](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md).

 It is recommended to use **paddlepaddle 2.2.2** or above.
-You can choose one way from meduim and hard to install paddlespeech.
+You can choose one way from easy, meduim and hard to install paddlespeech.
+**If you install in simple mode, you need to prepare the yaml file by yourself, you can refer to the yaml file in the conf directory.**


 ### 2. Prepare config File
--- a/demos/streaming_tts_server/README_cn.md
+++ b/demos/streaming_tts_server/README_cn.md
@ -11,8 +11,8 @@
 请看 [安装文档](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md).

 推荐使用 **paddlepaddle 2.2.2** 或以上版本。
-你可以从 medium，hard 两种方式中选择一种方式安装 PaddleSpeech。
-
+你可以从简单，中等，困难几种方式中选择一种方式安装 PaddleSpeech。
+**如果使用简单模式安装，需要自行准备 yaml 文件，可参考 conf 目录下的 yaml 文件。**

 ### 2. 准备配置文件
 配置文件可参见 `conf/tts_online_application.yaml` 。
--- a/demos/streaming_tts_server/test_client.sh
+++ b/demos/streaming_tts_server/test_client.sh
@ -2,8 +2,8 @@

 # http client test
 # If `127.0.0.1` is not accessible, you need to use the actual service IP address.
-paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --protocol http --input "您好，欢迎使用百度飞桨语音合成服务。" --output output.wav
+paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --protocol http --input "您好，欢迎使用百度飞桨语音合成服务。" --output output.http.wav

 # websocket client test
 # If `127.0.0.1` is not accessible, you need to use the actual service IP address.
-# paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --protocol websocket --input "您好，欢迎使用百度飞桨语音合成服务。" --output output.wav
+paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8192 --protocol websocket --input "您好，欢迎使用百度飞桨语音合成服务。" --output output.ws.wav
--- a/demos/streaming_tts_server/conf/tts_online_ws_application.yaml
+++ b/demos/streaming_tts_server/conf/tts_online_ws_application.yaml
@ -0,0 +1,103 @@
+# This is the parameter configuration file for streaming tts server.
+
+#################################################################################
+#                             SERVER SETTING                                    #
+#################################################################################
+host: 0.0.0.0
+port: 8192
+
+# The task format in the engin_list is: <speech task>_<engine type>
+# engine_list choices = ['tts_online', 'tts_online-onnx'], the inference speed of tts_online-onnx is faster than tts_online.
+# protocol choices = ['websocket', 'http'] 
+protocol: 'websocket'
+engine_list: ['tts_online-onnx']
+
+
+#################################################################################
+#                                ENGINE CONFIG                                  #
+#################################################################################
+
+################################### TTS #########################################
+################### speech task: tts; engine_type: online #######################
+tts_online: 
+    # am (acoustic model) choices=['fastspeech2_csmsc', 'fastspeech2_cnndecoder_csmsc']   
+    # fastspeech2_cnndecoder_csmsc support streaming am infer.     
+    am: 'fastspeech2_csmsc'   
+    am_config: 
+    am_ckpt: 
+    am_stat: 
+    phones_dict: 
+    tones_dict: 
+    speaker_dict: 
+    spk_id: 0
+
+    # voc (vocoder) choices=['mb_melgan_csmsc, hifigan_csmsc']
+    # Both mb_melgan_csmsc and hifigan_csmsc support streaming voc inference
+    voc: 'mb_melgan_csmsc'
+    voc_config: 
+    voc_ckpt: 
+    voc_stat: 
+
+    # others
+    lang: 'zh'
+    device: 'cpu' # set 'gpu:id' or 'cpu'
+    # am_block and am_pad only for fastspeech2_cnndecoder_onnx model to streaming am infer,
+    # when am_pad set 12, streaming synthetic audio is the same as non-streaming synthetic audio
+    am_block: 72
+    am_pad: 12
+    # voc_pad and voc_block voc model to streaming voc infer,
+    # when voc model is mb_melgan_csmsc, voc_pad set 14, streaming synthetic audio is the same as non-streaming synthetic audio; The minimum value of pad can be set to 7, streaming synthetic audio sounds normal
+    # when voc model is hifigan_csmsc, voc_pad set 19, streaming synthetic audio is the same as non-streaming synthetic audio; voc_pad set 14, streaming synthetic audio sounds normal
+    voc_block: 36
+    voc_pad: 14
+    
+
+
+#################################################################################
+#                                ENGINE CONFIG                                  #
+#################################################################################
+
+################################### TTS #########################################
+################### speech task: tts; engine_type: online-onnx #######################
+tts_online-onnx: 
+    # am (acoustic model) choices=['fastspeech2_csmsc_onnx', 'fastspeech2_cnndecoder_csmsc_onnx']
+    # fastspeech2_cnndecoder_csmsc_onnx support streaming am infer.        
+    am: 'fastspeech2_cnndecoder_csmsc_onnx' 
+    # am_ckpt is a list, if am is fastspeech2_cnndecoder_csmsc_onnx, am_ckpt = [encoder model, decoder model, postnet model];
+    # if am is fastspeech2_csmsc_onnx, am_ckpt = [ckpt model];
+    am_ckpt:   # list
+    am_stat: 
+    phones_dict: 
+    tones_dict: 
+    speaker_dict: 
+    spk_id: 0
+    am_sample_rate: 24000
+    am_sess_conf:
+        device: "cpu" # set 'gpu:id' or 'cpu'
+        use_trt: False
+        cpu_threads: 4
+
+    # voc (vocoder) choices=['mb_melgan_csmsc_onnx, hifigan_csmsc_onnx']
+    # Both mb_melgan_csmsc_onnx and hifigan_csmsc_onnx support streaming voc inference
+    voc: 'hifigan_csmsc_onnx'
+    voc_ckpt: 
+    voc_sample_rate: 24000
+    voc_sess_conf:
+        device: "cpu" # set 'gpu:id' or 'cpu'
+        use_trt: False
+        cpu_threads: 4
+
+    # others
+    lang: 'zh'
+    # am_block and am_pad only for fastspeech2_cnndecoder_onnx model to streaming am infer,
+    # when am_pad set 12, streaming synthetic audio is the same as non-streaming synthetic audio
+    am_block: 72
+    am_pad: 12
+    # voc_pad and voc_block voc model to streaming voc infer,
+    # when voc model is mb_melgan_csmsc_onnx, voc_pad set 14, streaming synthetic audio is the same as non-streaming synthetic audio; The minimum value of pad can be set to 7, streaming synthetic audio sounds normal
+    # when voc model is hifigan_csmsc_onnx, voc_pad set 19, streaming synthetic audio is the same as non-streaming synthetic audio; voc_pad set 14, streaming synthetic audio sounds normal
+    voc_block: 36
+    voc_pad: 14
+    # voc_upsample should be same as n_shift on voc config.
+    voc_upsample: 300
+    
--- a/demos/streaming_tts_server/server.sh
+++ b/demos/streaming_tts_server/server.sh
@ -0,0 +1,10 @@
+#!/bin/bash
+
+# http server
+paddlespeech_server start --config_file ./conf/tts_online_application.yaml &> tts.http.log &
+
+
+# websocket server
+paddlespeech_server start --config_file ./conf/tts_online_ws_application.yaml &> tts.ws.log &
+
+
--- a/demos/streaming_tts_server/start_server.sh
+++ b/demos/streaming_tts_server/start_server.sh
@ -1,3 +0,0 @@
-#!/bin/bash
-# start server
-paddlespeech_server start --config_file ./conf/tts_online_application.yaml
--- a/demos/text_to_speech/run.sh
+++ b/demos/text_to_speech/run.sh
@ -4,4 +4,10 @@
 paddlespeech tts --input 今天的天气不错啊

 # Batch process
-echo -e "1 欢迎光临。\n2 谢谢惠顾。" | paddlespeech tts
+echo -e "1 欢迎光临。\n2 谢谢惠顾。" | paddlespeech tts
+
+# Text Frontend
+paddlespeech tts --input 今天是2022/10/29,最低温度是-3℃.
+
+
+
--- a/docker/ubuntu16-gpu/Dockerfile
+++ b/docker/ubuntu16-gpu/Dockerfile
@ -0,0 +1,77 @@
+FROM nvidia/cuda:11.2.2-cudnn8-runtime-ubuntu16.04
+
+RUN echo "deb [trusted=true] http://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial main restricted \n\
+deb [trusted=true] http://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial-updates main restricted \n\
+deb [trusted=true] http://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial universe \n\
+deb [trusted=true] http://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial-updates universe \n\
+deb [trusted=true] http://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial multiverse \n\
+deb [trusted=true] http://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial-updates multiverse \n\
+deb [trusted=true] http://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial-backports main restricted universe multiverse \n\
+deb [trusted=true] http://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial-security main restricted \n\
+deb [trusted=true] http://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial-security universe \n\
+deb [trusted=true] http://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial-security multiverse" > /etc/apt/sources.list
+
+RUN apt-get update && apt-get install -y inetutils-ping wget vim curl cmake git sox libsndfile1 libpng12-dev \
+    libpng-dev swig libzip-dev openssl bc libflac* libgdk-pixbuf2.0-dev libpango1.0-dev libcairo2-dev \
+    libgtk2.0-dev pkg-config zip unzip zlib1g-dev libreadline-dev libbz2-dev liblapack-dev libjpeg-turbo8-dev \
+    sudo lrzsz libsqlite3-dev libx11-dev libsm6 apt-utils libopencv-dev libavcodec-dev libavformat-dev \
+    libswscale-dev locales liblzma-dev python-lzma m4 libxext-dev strace libibverbs-dev libpcre3 libpcre3-dev \
+    build-essential libncurses5-dev libgdbm-dev libnss3-dev libssl-dev libreadline-dev libffi-dev xz-utils \
+    libfreetype6-dev libxslt1-dev libxml2-dev libgeos-3.5.0 libgeos-dev && apt-get install -y --allow-downgrades \
+    --allow-change-held-packages libnccl2 libnccl-dev && DEBIAN_FRONTEND=noninteractive apt-get install -y tzdata \
+    && /bin/cp /usr/share/zoneinfo/Asia/Shanghai /etc/localtime && dpkg-reconfigure -f noninteractive tzdata && \
+    cd /usr/lib/x86_64-linux-gnu && ln -s libcudnn.so.8 libcudnn.so && \
+    cd /usr/local/cuda-11.2/targets/x86_64-linux/lib  && ln -s libcublas.so.11.4.1.1043 libcublas.so && \
+    ln -s libcusolver.so.11.1.0.152 libcusolver.so && ln -s libcusparse.so.11 libcusparse.so && \
+    ln -s libcufft.so.10.4.1.152 libcufft.so
+
+RUN echo "set meta-flag on" >> /etc/inputrc && echo "set convert-meta off" >> /etc/inputrc && \
+    locale-gen en_US.UTF-8 && /sbin/ldconfig -v && groupadd -g 10001 paddle && \
+    useradd -m -s /bin/bash -N -u 10001 paddle -g paddle && chmod g+w /etc/passwd && \
+    echo "paddle ALL=(ALL) NOPASSWD: ALL" >> /etc/sudoers
+
+ENV LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 LANGUAGE=en_US.UTF-8 TZ=Asia/Shanghai
+
+# official download site: https://www.python.org/ftp/python/3.7.13/Python-3.7.13.tgz
+RUN wget https://cdn.npmmirror.com/binaries/python/3.7.13/Python-3.7.13.tgz && tar xvf Python-3.7.13.tgz && \
+    cd Python-3.7.13 && ./configure --prefix=/home/paddle/python3.7 && make -j8 && make install && \
+    rm -rf ../Python-3.7.13 ../Python-3.7.13.tgz && chown -R paddle:paddle /home/paddle/python3.7
+
+RUN cd /tmp && wget https://mirrors.sjtug.sjtu.edu.cn/gnu/gmp/gmp-6.1.0.tar.bz2 && tar xvf gmp-6.1.0.tar.bz2 && \
+    cd gmp-6.1.0 && ./configure --prefix=/usr/local && make -j8 && make install && \
+    rm -rf ../gmp-6.1.0.tar.bz2 ../gmp-6.1.0 && cd /tmp && \
+    wget https://www.mpfr.org/mpfr-3.1.4/mpfr-3.1.4.tar.bz2 && tar xvf mpfr-3.1.4.tar.bz2 && cd mpfr-3.1.4 && \
+    ./configure --prefix=/usr/local && make -j8 && make install && rm -rf ../mpfr-3.1.4.tar.bz2 ../mpfr-3.1.4 && \
+    cd /tmp && wget https://mirrors.sjtug.sjtu.edu.cn/gnu/mpc/mpc-1.0.3.tar.gz && tar xvf mpc-1.0.3.tar.gz && \
+    cd mpc-1.0.3 && ./configure --prefix=/usr/local && make -j8 && make install && \
+    rm -rf ../mpc-1.0.3.tar.gz ../mpc-1.0.3 && cd /tmp && \
+    wget http://www.mirrorservice.org/sites/sourceware.org/pub/gcc/infrastructure/isl-0.18.tar.bz2 && \
+    tar xvf isl-0.18.tar.bz2 && cd isl-0.18 && ./configure --prefix=/usr/local && make -j8 && make install \
+    && rm -rf ../isl-0.18.tar.bz2 ../isl-0.18 && cd /tmp && \
+    wget http://mirrors.ustc.edu.cn/gnu/gcc/gcc-8.2.0/gcc-8.2.0.tar.gz --no-check-certificate && \
+    tar xvf gcc-8.2.0.tar.gz && cd gcc-8.2.0 && unset LIBRARY_PATH && ./configure --prefix=/home/paddle/gcc82 \
+    --enable-threads=posix --disable-checking --disable-multilib --enable-languages=c,c++ --with-gmp=/usr/local \
+    --with-mpfr=/usr/local --with-mpc=/usr/local --with-isl=/usr/local && make -j8 && make install && \
+    rm -rf ../gcc-8.2.0.tar.gz ../gcc-8.2.0 && chown -R paddle:paddle /home/paddle/gcc82
+
+WORKDIR /home/paddle
+ENV PATH=/home/paddle/python3.7/bin:/home/paddle/gcc82/bin:${PATH} \
+    LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:/usr/local/cuda-11.2/targets/x86_64-linux/lib:${LD_LIBRARY_PATH}
+
+RUN mkdir -p ~/.pip && echo "[global]" > ~/.pip/pip.conf && \
+    echo "index-url=https://mirror.baidu.com/pypi/simple" >> ~/.pip/pip.conf && \
+    echo "trusted-host=mirror.baidu.com" >> ~/.pip/pip.conf && \
+    python3 -m pip install --upgrade pip && \
+    pip install paddlepaddle-gpu==2.3.1.post112 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html && \
+    rm -rf ~/.cache/pip
+
+RUN git clone https://github.com/PaddlePaddle/PaddleSpeech.git && cd PaddleSpeech && \
+    pip3 install pytest-runner paddleaudio -i https://pypi.tuna.tsinghua.edu.cn/simple && \
+    pip3 install -e .[develop] -i https://pypi.tuna.tsinghua.edu.cn/simple && \
+    pip3 install importlib-metadata==4.2.0 urllib3==1.25.10 -i https://pypi.tuna.tsinghua.edu.cn/simple && \
+    rm -rf ~/.cache/pip && \
+    sudo cp -f /home/paddle/gcc82/lib64/libstdc++.so.6.0.25 /usr/lib/x86_64-linux-gnu/libstdc++.so.6 && \
+    chown -R paddle:paddle /home/paddle/PaddleSpeech
+
+USER paddle
+CMD ['bash']
--- a/docs/source/install.md
+++ b/docs/source/install.md
@ -10,7 +10,7 @@ There are 3 ways to use `PaddleSpeech`. According to the degree of difficulty, t

 ## Prerequisites
 - Python >= 3.7
- PaddlePaddle latest version (please refer to the [Installation Guide] (https://www.paddlepaddle.org.cn/documentation/docs/en/beginners_guide/index_en.html))
+- PaddlePaddle latest version (please refer to the [Installation Guide](https://www.paddlepaddle.org.cn/documentation/docs/en/beginners_guide/index_en.html))
 - C++ compilation environment
 - Hip: For Linux and Mac, do not use command `sh` instead of command `bash` in installation document.
 - Hip: We recommand you to install `paddlepaddle` from https://mirror.baidu.com/pypi/simple and install `paddlespeech` from https://pypi.tuna.tsinghua.edu.cn/simple. 
@ -117,9 +117,9 @@ conda install -y -c gcc_linux-64=8.4.0 gxx_linux-64=8.4.0
 ```
 (Hip: Do not use the last script if you want to install by **Hard** way):
 ### Install PaddlePaddle
-You can choose the `PaddlePaddle` version based on your system. For example, for CUDA 10.2, CuDNN7.5 install paddlepaddle-gpu 2.2.0:
+You can choose the `PaddlePaddle` version based on your system. For example, for CUDA 10.2, CuDNN7.5 install paddlepaddle-gpu 2.3.1:
 ```bash
-python3 -m pip install paddlepaddle-gpu==2.2.0 -i https://mirror.baidu.com/pypi/simple
+python3 -m pip install paddlepaddle-gpu==2.3.1 -i https://mirror.baidu.com/pypi/simple
 ```
 ### Install PaddleSpeech 
 You can install  `paddlespeech`  by the following command，then you can use the `ready-made` examples in `paddlespeech` :
@ -180,9 +180,9 @@ Some users may fail to install `kaldiio` due to the default download source, you
 ```bash
 pip install pytest-runner -i https://pypi.tuna.tsinghua.edu.cn/simple
 ```
-Make sure you have GPU and the paddlepaddle version is right. For example, for CUDA 10.2, CuDNN7.5 install paddle 2.2.0:
+Make sure you have GPU and the paddlepaddle version is right. For example, for CUDA 10.2, CuDNN7.5 install paddle 2.3.1:
 ```bash
-python3 -m pip install paddlepaddle-gpu==2.2.0 -i https://mirror.baidu.com/pypi/simple
+python3 -m pip install paddlepaddle-gpu==2.3.1 -i https://mirror.baidu.com/pypi/simple
 ```
 ### Install PaddleSpeech in Developing Mode
 ```bash
--- a/docs/source/install_cn.md
+++ b/docs/source/install_cn.md
@ -111,9 +111,9 @@ conda install -y -c gcc_linux-64=8.4.0 gxx_linux-64=8.4.0
 ```
 （提示： 如果你想使用**困难**方式完成安装，请不要使用最后一条命令）
 ### 安装 PaddlePaddle
-你可以根据系统配置选择 PaddlePaddle 版本，例如系统使用 CUDA 10.2， CuDNN7.5 ，你可以安装 paddlepaddle-gpu 2.2.0：
+你可以根据系统配置选择 PaddlePaddle 版本，例如系统使用 CUDA 10.2， CuDNN7.5 ，你可以安装 paddlepaddle-gpu 2.3.1：
 ```bash
-python3 -m pip install paddlepaddle-gpu==2.2.0 -i https://mirror.baidu.com/pypi/simple
+python3 -m pip install paddlepaddle-gpu==2.3.1 -i https://mirror.baidu.com/pypi/simple
 ```
 ### 安装 PaddleSpeech
 最后安装 `paddlespeech`，这样你就可以使用 `paddlespeech` 中已有的 examples：
@ -168,9 +168,9 @@ conda activate tools/venv
 conda install -y -c conda-forge sox libsndfile swig bzip2 libflac bc
 ```
 ### 安装 PaddlePaddle
-请确认你系统是否有 GPU，并且使用了正确版本的 paddlepaddle。例如系统使用 CUDA 10.2, CuDNN7.5 ，你可以安装 paddlepaddle-gpu 2.2.0：
+请确认你系统是否有 GPU，并且使用了正确版本的 paddlepaddle。例如系统使用 CUDA 10.2, CuDNN7.5 ，你可以安装 paddlepaddle-gpu 2.3.1：
 ```bash
-python3 -m pip install paddlepaddle-gpu==2.2.0 -i https://mirror.baidu.com/pypi/simple
+python3 -m pip install paddlepaddle-gpu==2.3.1 -i https://mirror.baidu.com/pypi/simple
 ```
 ### 用开发者模式安装 PaddleSpeech
 部分用户系统由于默认源的问题，安装中会出现 kaldiio 安转出错的问题，建议首先安装 pytest-runner:
--- a/docs/topic/package_release/python_package_release.md
+++ b/docs/topic/package_release/python_package_release.md
@ -150,7 +150,7 @@ manylinux1 支持 Centos5以上， manylinux2010 支持 Centos 6 以上，manyli
 ### 拉取 manylinux2010

 ```bash
-docker pull quay.io/pypa/manylinux1_x86_64
+docker pull quay.io/pypa/manylinux2010_x86_64
 ```

 ### 使用 manylinux2010
--- a/examples/aishell/asr1/README.md
+++ b/examples/aishell/asr1/README.md
@ -1,5 +1,5 @@
 # Transformer/Conformer ASR with Aishell
-This example contains code used to train a Transformer or [Conformer](http://arxiv.org/abs/2008.03802) model with [Aishell dataset](http://www.openslr.org/resources/33)
+This example contains code used to train a [u2](https://arxiv.org/pdf/2012.05481.pdf) model (Transformer or [Conformer](https://arxiv.org/pdf/2005.08100.pdf) model) with [Aishell dataset](http://www.openslr.org/resources/33)
 ## Overview
 All the scripts you need are in `run.sh`. There are several stages in `run.sh`, and each stage has its function.
 | Stage | Function                                                     |
--- a/examples/callcenter/README.md
+++ b/examples/callcenter/README.md
@ -1,20 +1,3 @@
 # Callcenter 8k sample rate

-Data distribution:
-
-```
-676048 utts
-491.4004722221223 h
-4357792.0 text
-2.4633630739178654 text/sec
-2.6167397877068495 sec/utt
-```
-
-train/dev/test partition:
-
-```
-    33802 manifest.dev
-    67606 manifest.test
-   574640 manifest.train
-   676048 total
-```
+This recipe only has model/data config for 8k ASR, user need to prepare data and generate manifest metafile. You can see Aishell or Libripseech.
--- a/examples/csmsc/vits/README.md
+++ b/examples/csmsc/vits/README.md
@ -154,7 +154,7 @@ VITS checkpoint contains files listed below.
 vits_csmsc_ckpt_1.1.0
 ├── default.yaml              # default config used to train vitx
 ├── phone_id_map.txt          # phone vocabulary file when training vits
-└── snapshot_iter_350000.pdz  # model parameters and optimizer states
+└── snapshot_iter_333000.pdz  # model parameters and optimizer states
 ```

 ps: This ckpt is not good enough, a better result is training
@ -169,7 +169,7 @@ FLAGS_allocator_strategy=naive_best_fit \
 FLAGS_fraction_of_gpu_memory_to_use=0.01 \
 python3 ${BIN_DIR}/synthesize_e2e.py \
    --config=vits_csmsc_ckpt_1.1.0/default.yaml \
-    --ckpt=vits_csmsc_ckpt_1.1.0/snapshot_iter_350000.pdz \
+    --ckpt=vits_csmsc_ckpt_1.1.0/snapshot_iter_333000.pdz \
    --phones_dict=vits_csmsc_ckpt_1.1.0/phone_id_map.txt \
    --output_dir=exp/default/test_e2e \
    --text=${BIN_DIR}/../sentences.txt \
--- a/examples/csmsc/vits/conf/default.yaml
+++ b/examples/csmsc/vits/conf/default.yaml
@ -179,7 +179,7 @@ generator_first: False # whether to start updating generator first
 #                OTHER TRAINING SETTING                  #
 ##########################################################
 num_snapshots: 10            # max number of snapshots to keep while training
-train_max_steps: 250000      # Number of training steps. == total_iters / ngpus, total_iters = 1000000
+train_max_steps: 350000      # Number of training steps. == total_iters / ngpus, total_iters = 1000000
 save_interval_steps: 1000    # Interval steps to save checkpoint.
 eval_interval_steps: 250     # Interval steps to evaluate the network.
 seed: 777                    # random seed number
--- a/examples/librispeech/asr1/README.md
+++ b/examples/librispeech/asr1/README.md
@ -1,5 +1,5 @@
 # Transformer/Conformer ASR with Librispeech
-This example contains code used to train a Transformer or [Conformer](http://arxiv.org/abs/2008.03802) model with [Librispeech dataset](http://www.openslr.org/resources/12)
+This example contains code used to train [u2](https://arxiv.org/pdf/2012.05481.pdf) model (Transformer or [Conformer](https://arxiv.org/pdf/2005.08100.pdf) model) with [Librispeech dataset](http://www.openslr.org/resources/12)
 ## Overview
 All the scripts you need are in `run.sh`. There are several stages in `run.sh`, and each stage has its function.
 | Stage | Function                                                     |
--- a/examples/librispeech/asr2/README.md
+++ b/examples/librispeech/asr2/README.md
@ -1,6 +1,6 @@
 # Transformer/Conformer ASR with Librispeech ASR2

-This example contains code used to train a Transformer or [Conformer](http://arxiv.org/abs/2008.03802) model with [Librispeech dataset](http://www.openslr.org/resources/12) and use some functions in kaldi.
+This example contains code used to train a [u2](https://arxiv.org/pdf/2012.05481.pdf) model (Transformer or [Conformer](https://arxiv.org/pdf/2005.08100.pdf) model) with [Librispeech dataset](http://www.openslr.org/resources/12) and use some functions in kaldi.

 To use this example, you need to install Kaldi first.

--- a/examples/tiny/asr1/README.md
+++ b/examples/tiny/asr1/README.md
@ -1,5 +1,5 @@
 # Transformer/Conformer ASR with Tiny
-This example contains code used to train a Transformer or [Conformer](http://arxiv.org/abs/2008.03802) model  Tiny dataset(a part of [[Librispeech dataset](http://www.openslr.org/resources/12)](http://www.openslr.org/resources/33))
+This example contains code used to train a [u2](https://arxiv.org/pdf/2012.05481.pdf) model (Transformer or [Conformer](https://arxiv.org/pdf/2005.08100.pdf) model) with Tiny dataset(a part of [[Librispeech dataset](http://www.openslr.org/resources/12)](http://www.openslr.org/resources/33))
 ## Overview
 All the scripts you need are in `run.sh`. There are several stages in `run.sh`, and each stage has its function.
 | Stage | Function                                                     |
--- a/examples/vctk/tts3/local/synthesize.sh
+++ b/examples/vctk/tts3/local/synthesize.sh
@ -31,7 +31,7 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
    FLAGS_allocator_strategy=naive_best_fit \
    FLAGS_fraction_of_gpu_memory_to_use=0.01 \
    python3 ${BIN_DIR}/../synthesize.py \
-        --am=fastspeech2_aishell3 \
+        --am=fastspeech2_vctk \
        --am_config=${config_path} \
        --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
        --am_stat=dump/train/speech_stats.npy \
--- a/paddlespeech/init.py
+++ b/paddlespeech/init.py
@ -14,3 +14,5 @@
 import _locale

 _locale._getdefaultlocale = (lambda *args: ['en_US', 'utf8'])
+
+
--- a/paddlespeech/audio/init.py
+++ b/paddlespeech/audio/init.py
@ -14,6 +14,9 @@
 from . import compliance
 from . import datasets
 from . import features
+from . import text
+from . import transform
+from . import streamdata
 from . import functional
 from . import io
 from . import metric
--- a/paddlespeech/audio/text/init.py
+++ b/paddlespeech/audio/text/init.py
@ -0,0 +1,13 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
--- a/paddlespeech/s2t/init.py
+++ b/paddlespeech/s2t/init.py
@ -18,7 +18,6 @@ from typing import Union

 import paddle
 from paddle import nn
-from paddle.fluid import core
 from paddle.nn import functional as F

 from paddlespeech.s2t.utils.log import Log
@ -39,46 +38,6 @@ paddle.long = 'int64'
 paddle.uint16 = 'uint16'
 paddle.cdouble = 'complex128'

-
-def convert_dtype_to_string(tensor_dtype):
-    """
-    Convert the data type in numpy to the data type in Paddle
-    Args:
-        tensor_dtype(core.VarDesc.VarType): the data type in numpy.
-    Returns:
-        core.VarDesc.VarType: the data type in Paddle.
-    """
-    dtype = tensor_dtype
-    if dtype == core.VarDesc.VarType.FP32:
-        return paddle.float32
-    elif dtype == core.VarDesc.VarType.FP64:
-        return paddle.float64
-    elif dtype == core.VarDesc.VarType.FP16:
-        return paddle.float16
-    elif dtype == core.VarDesc.VarType.INT32:
-        return paddle.int32
-    elif dtype == core.VarDesc.VarType.INT16:
-        return paddle.int16
-    elif dtype == core.VarDesc.VarType.INT64:
-        return paddle.int64
-    elif dtype == core.VarDesc.VarType.BOOL:
-        return paddle.bool
-    elif dtype == core.VarDesc.VarType.BF16:
-        # since there is still no support for bfloat16 in NumPy,
-        # uint16 is used for casting bfloat16
-        return paddle.uint16
-    elif dtype == core.VarDesc.VarType.UINT8:
-        return paddle.uint8
-    elif dtype == core.VarDesc.VarType.INT8:
-        return paddle.int8
-    elif dtype == core.VarDesc.VarType.COMPLEX64:
-        return paddle.complex64
-    elif dtype == core.VarDesc.VarType.COMPLEX128:
-        return paddle.complex128
-    else:
-        raise ValueError("Not supported tensor dtype %s" % dtype)
-
-
 if not hasattr(paddle, 'softmax'):
    logger.debug("register user softmax to paddle, remove this when fixed!")
    setattr(paddle, 'softmax', paddle.nn.functional.softmax)
@ -156,25 +115,6 @@ if not hasattr(paddle.Tensor, 'new_full'):
    paddle.static.Variable.new_full = new_full


-def eq(xs: paddle.Tensor, ys: Union[paddle.Tensor, float]) -> paddle.Tensor:
-    if convert_dtype_to_string(xs.dtype) == paddle.bool:
-        xs = xs.astype(paddle.int)
-    return xs.equal(ys)
-
-
-if not hasattr(paddle.Tensor, 'eq'):
-    logger.debug(
-        "override eq of paddle.Tensor if exists or register, remove this when fixed!"
-    )
-    paddle.Tensor.eq = eq
-    paddle.static.Variable.eq = eq
-
-if not hasattr(paddle, 'eq'):
-    logger.debug(
-        "override eq of paddle if exists or register, remove this when fixed!")
-    paddle.eq = eq
-
-
 def contiguous(xs: paddle.Tensor) -> paddle.Tensor:
    return xs

--- a/paddlespeech/s2t/models/u2/u2.py
+++ b/paddlespeech/s2t/models/u2/u2.py
@ -318,7 +318,7 @@ class U2BaseModel(ASRInterface, nn.Layer):
                dim=1)  # (B*N, i+1)

            # 2.6 Update end flag
-            end_flag = paddle.eq(hyps[:, -1], self.eos).view(-1, 1)
+            end_flag = paddle.equal(hyps[:, -1], self.eos).view(-1, 1)

        # 3. Select best of best
        scores = scores.view(batch_size, beam_size)
--- a/paddlespeech/s2t/modules/align.py
+++ b/paddlespeech/s2t/modules/align.py
@ -13,8 +13,7 @@
 # limitations under the License.
 import paddle
 from paddle import nn
-
-from paddlespeech.s2t.modules.initializer import KaimingUniform
+import math
 """
    To align the initializer between paddle and torch, 
    the API below are set defalut initializer with priority higger than global initializer.
@ -82,10 +81,10 @@ class Linear(nn.Linear):
                 name=None):
        if weight_attr is None:
            if global_init_type == "kaiming_uniform":
-                weight_attr = paddle.ParamAttr(initializer=KaimingUniform())
+                weight_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform(fan_in=None, negative_slope=math.sqrt(5), nonlinearity='leaky_relu'))
        if bias_attr is None:
            if global_init_type == "kaiming_uniform":
-                bias_attr = paddle.ParamAttr(initializer=KaimingUniform())
+                bias_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform(fan_in=None, negative_slope=math.sqrt(5), nonlinearity='leaky_relu'))
        super(Linear, self).__init__(in_features, out_features, weight_attr,
                                     bias_attr, name)

@ -105,10 +104,10 @@ class Conv1D(nn.Conv1D):
                 data_format='NCL'):
        if weight_attr is None:
            if global_init_type == "kaiming_uniform":
-                weight_attr = paddle.ParamAttr(initializer=KaimingUniform())
+                weight_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform(fan_in=None, negative_slope=math.sqrt(5), nonlinearity='leaky_relu'))
        if bias_attr is None:
            if global_init_type == "kaiming_uniform":
-                bias_attr = paddle.ParamAttr(initializer=KaimingUniform())
+                bias_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform(fan_in=None, negative_slope=math.sqrt(5), nonlinearity='leaky_relu'))
        super(Conv1D, self).__init__(
            in_channels, out_channels, kernel_size, stride, padding, dilation,
            groups, padding_mode, weight_attr, bias_attr, data_format)
@ -129,10 +128,10 @@ class Conv2D(nn.Conv2D):
                 data_format='NCHW'):
        if weight_attr is None:
            if global_init_type == "kaiming_uniform":
-                weight_attr = paddle.ParamAttr(initializer=KaimingUniform())
+                weight_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform(fan_in=None, negative_slope=math.sqrt(5), nonlinearity='leaky_relu'))
        if bias_attr is None:
            if global_init_type == "kaiming_uniform":
-                bias_attr = paddle.ParamAttr(initializer=KaimingUniform())
+                bias_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform(fan_in=None, negative_slope=math.sqrt(5), nonlinearity='leaky_relu'))
        super(Conv2D, self).__init__(
            in_channels, out_channels, kernel_size, stride, padding, dilation,
            groups, padding_mode, weight_attr, bias_attr, data_format)
--- a/paddlespeech/s2t/modules/attention.py
+++ b/paddlespeech/s2t/modules/attention.py
@ -109,7 +109,7 @@ class MultiHeadedAttention(nn.Layer):
        # 1. onnx(16/-1, -1/-1, 16/0)
        # 2. jit (16/-1, -1/-1, 16/0, 16/4)
        if paddle.shape(mask)[2] > 0: # time2 > 0
-            mask = mask.unsqueeze(1).eq(0)  # (batch, 1, *, time2)
+            mask = mask.unsqueeze(1).equal(0)  # (batch, 1, *, time2)
            # for last chunk, time2 might be larger than scores.size(-1)
            mask = mask[:, :, :, :paddle.shape(scores)[-1]]
            scores = scores.masked_fill(mask, -float('inf'))
@ -321,4 +321,4 @@ class RelPositionMultiHeadedAttention(MultiHeadedAttention):
        scores = (matrix_ac + matrix_bd) / math.sqrt(
            self.d_k)  # (batch, head, time1, time2)

-        return self.forward_attention(v, scores, mask), new_cache
+        return self.forward_attention(v, scores, mask), new_cache
--- a/paddlespeech/s2t/modules/initializer.py
+++ b/paddlespeech/s2t/modules/initializer.py
@ -12,142 +12,6 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import numpy as np
-from paddle.fluid import framework
-from paddle.fluid import unique_name
-from paddle.fluid.core import VarDesc
-from paddle.fluid.initializer import MSRAInitializer
-
-__all__ = ['KaimingUniform']
-
-
-class KaimingUniform(MSRAInitializer):
-    r"""Implements the Kaiming Uniform initializer
-
-    This class implements the weight initialization from the paper
-    `Delving Deep into Rectifiers: Surpassing Human-Level Performance on
-    ImageNet Classification <https://arxiv.org/abs/1502.01852>`_
-    by Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun. This is a
-    robust initialization method that particularly considers the rectifier
-    nonlinearities.
-
-    In case of Uniform distribution, the range is [-x, x], where
-
-    .. math::
-
-        x = \sqrt{\frac{1.0}{fan\_in}}
-
-    In case of Normal distribution, the mean is 0 and the standard deviation
-    is
-
-    .. math::
-
-        \sqrt{\\frac{2.0}{fan\_in}}
-
-    Args:
-        fan_in (float32|None): fan_in for Kaiming uniform Initializer. If None, it is\
-        inferred from the variable. default is None.
-
-    Note:
-        It is recommended to set fan_in to None for most cases.
-
-    Examples:
-        .. code-block:: python
-
-            import paddle
-            import paddle.nn as nn
-
-            linear = nn.Linear(2,
-                               4,
-                               weight_attr=nn.initializer.KaimingUniform())
-            data = paddle.rand([30, 10, 2], dtype='float32')
-            res = linear(data)
-
-    """
-
-    def __init__(self, fan_in=None):
-        super(KaimingUniform, self).__init__(
-            uniform=True, fan_in=fan_in, seed=0)
-
-    def __call__(self, var, block=None):
-        """Initialize the input tensor with MSRA initialization.
-
-        Args:
-            var(Tensor): Tensor that needs to be initialized.
-            block(Block, optional): The block in which initialization ops
-                   should be added. Used in static graph only, default None.
-
-        Returns:
-            The initialization op
-        """
-        block = self._check_block(block)
-
-        assert isinstance(var, framework.Variable)
-        assert isinstance(block, framework.Block)
-        f_in, f_out = self._compute_fans(var)
-
-        # If fan_in is passed, use it
-        fan_in = f_in if self._fan_in is None else self._fan_in
-
-        if self._seed == 0:
-            self._seed = block.program.random_seed
-
-        # to be compatible of fp16 initalizers
-        if var.dtype == VarDesc.VarType.FP16 or (
-                var.dtype == VarDesc.VarType.BF16 and not self._uniform):
-            out_dtype = VarDesc.VarType.FP32
-            out_var = block.create_var(
-                name=unique_name.generate(
-                    ".".join(['masra_init', var.name, 'tmp'])),
-                shape=var.shape,
-                dtype=out_dtype,
-                type=VarDesc.VarType.LOD_TENSOR,
-                persistable=False)
-        else:
-            out_dtype = var.dtype
-            out_var = var
-
-        if self._uniform:
-            limit = np.sqrt(1.0 / float(fan_in))
-            op = block.append_op(
-                type="uniform_random",
-                inputs={},
-                outputs={"Out": out_var},
-                attrs={
-                    "shape": out_var.shape,
-                    "dtype": int(out_dtype),
-                    "min": -limit,
-                    "max": limit,
-                    "seed": self._seed
-                },
-                stop_gradient=True)
-
-        else:
-            std = np.sqrt(2.0 / float(fan_in))
-            op = block.append_op(
-                type="gaussian_random",
-                outputs={"Out": out_var},
-                attrs={
-                    "shape": out_var.shape,
-                    "dtype": int(out_dtype),
-                    "mean": 0.0,
-                    "std": std,
-                    "seed": self._seed
-                },
-                stop_gradient=True)
-
-        if var.dtype == VarDesc.VarType.FP16 or (
-                var.dtype == VarDesc.VarType.BF16 and not self._uniform):
-            block.append_op(
-                type="cast",
-                inputs={"X": out_var},
-                outputs={"Out": var},
-                attrs={"in_dtype": out_var.dtype,
-                       "out_dtype": var.dtype})
-
-        if not framework.in_dygraph_mode():
-            var.op = op
-        return op
-

 class DefaultInitializerContext(object):
    """
--- a/paddlespeech/server/bin/paddlespeech_client.py
+++ b/paddlespeech/server/bin/paddlespeech_client.py
@ -718,6 +718,7 @@ class VectorClientExecutor(BaseExecutor):
            logger.info(f"the input audio: {input}")
            handler = VectorHttpHandler(server_ip=server_ip, port=port)
            res = handler.run(input, audio_format, sample_rate)
+            logger.info(f"The spk embedding is: {res}")
            return res
        elif task == "score":
            from paddlespeech.server.utils.audio_handler import VectorScoreHttpHandler
--- a/paddlespeech/server/engine/acs/python/acs_engine.py
+++ b/paddlespeech/server/engine/acs/python/acs_engine.py
@ -192,12 +192,15 @@ class ACSEngine(BaseEngine):

        # search for each word in self.word_list
        offset = self.config.offset
+        # last time in time_stamp
        max_ed = time_stamp[-1]['ed']
        for w in self.word_list:
            # search the w in asr_result and the index in asr_result
+            # https://docs.python.org/3/library/re.html#re.finditer
            for m in re.finditer(w, asr_result):
+                # match start and end char index in timestamp
+                # https://docs.python.org/3/library/re.html#re.Match.start
                start = max(time_stamp[m.start(0)]['bg'] - offset, 0)
-
                end = min(time_stamp[m.end(0) - 1]['ed'] + offset, max_ed)
                logger.debug(f'start: {start}, end: {end}')
                acs_result.append({'w': w, 'bg': start, 'ed': end})
--- a/paddlespeech/server/engine/asr/online/ctc_endpoint.py
+++ b/paddlespeech/server/engine/asr/online/ctc_endpoint.py
@ -39,10 +39,10 @@ class OnlineCTCEndpoingOpt:

    # rule1 times out after 5 seconds of silence, even if we decoded nothing.
    rule1: OnlineCTCEndpointRule = OnlineCTCEndpointRule(False, 5000, 0)
-    # rule4 times out after 1.0 seconds of silence after decoding something,
+    # rule2 times out after 1.0 seconds of silence after decoding something,
    # even if we did not reach a final-state at all.
    rule2: OnlineCTCEndpointRule = OnlineCTCEndpointRule(True, 1000, 0)
-    # rule5 times out after the utterance is 20 seconds long, regardless of
+    # rule3 times out after the utterance is 20 seconds long, regardless of
    # anything else.
    rule3: OnlineCTCEndpointRule = OnlineCTCEndpointRule(False, 0, 20000)

@ -102,7 +102,8 @@ class OnlineCTCEndpoint:

        assert self.num_frames_decoded >= self.trailing_silence_frames
        assert self.frame_shift_in_ms > 0
-
+        
+        decoding_something = (self.num_frames_decoded > self.trailing_silence_frames) and decoding_something
        utterance_length = self.num_frames_decoded * self.frame_shift_in_ms
        trailing_silence = self.trailing_silence_frames * self.frame_shift_in_ms

--- a/paddlespeech/server/engine/asr/online/ctc_search.py
+++ b/paddlespeech/server/engine/asr/online/ctc_search.py
@ -83,11 +83,11 @@ class CTCPrefixBeamSearch:
        # cur_hyps: (prefix, (blank_ending_score, none_blank_ending_score))
        # 0. blank_ending_score,
        # 1. none_blank_ending_score, 
-        # 2. viterbi_blank ending, 
-        # 3. viterbi_non_blank, 
+        # 2. viterbi_blank ending score, 
+        # 3. viterbi_non_blank score, 
        # 4. current_token_prob, 
-        # 5. times_viterbi_blank, 
-        # 6. times_titerbi_non_blank
+        # 5. times_viterbi_blank, times_b
+        # 6. times_titerbi_non_blank, times_nb
        if self.cur_hyps is None:
            self.cur_hyps = [(tuple(), (0.0, -float('inf'), 0.0, 0.0,
                                        -float('inf'), [], []))]
@ -106,69 +106,69 @@ class CTCPrefixBeamSearch:
            for s in top_k_index:
                s = s.item()
                ps = logp[s].item()
-                for prefix, (pb, pnb, v_b_s, v_nb_s, cur_token_prob, times_s,
-                             times_ns) in self.cur_hyps:
+                for prefix, (pb, pnb, v_b_s, v_nb_s, cur_token_prob, times_b,
+                             times_nb) in self.cur_hyps:
                    last = prefix[-1] if len(prefix) > 0 else None
                    if s == blank_id:  # blank
-                        n_pb, n_pnb, n_v_s, n_v_ns, n_cur_token_prob, n_times_s, n_times_ns = next_hyps[
+                        n_pb, n_pnb, n_v_b, n_v_nb, n_cur_token_prob, n_times_b, n_times_nb = next_hyps[
                            prefix]
                        n_pb = log_add([n_pb, pb + ps, pnb + ps])

-                        pre_times = times_s if v_b_s > v_nb_s else times_ns
-                        n_times_s = copy.deepcopy(pre_times)
+                        pre_times = times_b if v_b_s > v_nb_s else times_nb
+                        n_times_b = copy.deepcopy(pre_times)
                        viterbi_score = v_b_s if v_b_s > v_nb_s else v_nb_s
-                        n_v_s = viterbi_score + ps
-                        next_hyps[prefix] = (n_pb, n_pnb, n_v_s, n_v_ns,
-                                             n_cur_token_prob, n_times_s,
-                                             n_times_ns)
+                        n_v_b = viterbi_score + ps
+                        next_hyps[prefix] = (n_pb, n_pnb, n_v_b, n_v_nb,
+                                             n_cur_token_prob, n_times_b,
+                                             n_times_nb)
                    elif s == last:
                        #  Update *ss -> *s;
                        # case1: *a + a => *a
-                        n_pb, n_pnb, n_v_s, n_v_ns, n_cur_token_prob, n_times_s, n_times_ns = next_hyps[
+                        n_pb, n_pnb, n_v_b, n_v_nb, n_cur_token_prob, n_times_b, n_times_nb = next_hyps[
                            prefix]
                        n_pnb = log_add([n_pnb, pnb + ps])
-                        if n_v_ns < v_nb_s + ps:
-                            n_v_ns = v_nb_s + ps
+                        if n_v_nb < v_nb_s + ps:
+                            n_v_nb = v_nb_s + ps
                            if n_cur_token_prob < ps:
                                n_cur_token_prob = ps
-                                n_times_ns = copy.deepcopy(times_ns)
-                                n_times_ns[
+                                n_times_nb = copy.deepcopy(times_nb)
+                                n_times_nb[
                                    -1] = self.abs_time_step  # 注意，这里要重新使用绝对时间
-                        next_hyps[prefix] = (n_pb, n_pnb, n_v_s, n_v_ns,
-                                             n_cur_token_prob, n_times_s,
-                                             n_times_ns)
+                        next_hyps[prefix] = (n_pb, n_pnb, n_v_b, n_v_nb,
+                                             n_cur_token_prob, n_times_b,
+                                             n_times_nb)

                        # Update *s-s -> *ss, - is for blank
                        # Case 2: *aε + a => *aa
                        n_prefix = prefix + (s, )
-                        n_pb, n_pnb, n_v_s, n_v_ns, n_cur_token_prob, n_times_s, n_times_ns = next_hyps[
+                        n_pb, n_pnb, n_v_b, n_v_nb, n_cur_token_prob, n_times_b, n_times_nb = next_hyps[
                            n_prefix]
-                        if n_v_ns < v_b_s + ps:
-                            n_v_ns = v_b_s + ps
+                        if n_v_nb < v_b_s + ps:
+                            n_v_nb = v_b_s + ps
                            n_cur_token_prob = ps
-                            n_times_ns = copy.deepcopy(times_s)
-                            n_times_ns.append(self.abs_time_step)
+                            n_times_nb = copy.deepcopy(times_b)
+                            n_times_nb.append(self.abs_time_step)
                        n_pnb = log_add([n_pnb, pb + ps])
-                        next_hyps[n_prefix] = (n_pb, n_pnb, n_v_s, n_v_ns,
-                                               n_cur_token_prob, n_times_s,
-                                               n_times_ns)
+                        next_hyps[n_prefix] = (n_pb, n_pnb, n_v_b, n_v_nb,
+                                               n_cur_token_prob, n_times_b,
+                                               n_times_nb)
                    else:
                        # Case 3: *a + b => *ab, *aε + b => *ab
                        n_prefix = prefix + (s, )
-                        n_pb, n_pnb, n_v_s, n_v_ns, n_cur_token_prob, n_times_s, n_times_ns = next_hyps[
+                        n_pb, n_pnb, n_v_b, n_v_nb, n_cur_token_prob, n_times_b, n_times_nb = next_hyps[
                            n_prefix]
                        viterbi_score = v_b_s if v_b_s > v_nb_s else v_nb_s
-                        pre_times = times_s if v_b_s > v_nb_s else times_ns
-                        if n_v_ns < viterbi_score + ps:
-                            n_v_ns = viterbi_score + ps
+                        pre_times = times_b if v_b_s > v_nb_s else times_nb
+                        if n_v_nb < viterbi_score + ps:
+                            n_v_nb = viterbi_score + ps
                            n_cur_token_prob = ps
-                            n_times_ns = copy.deepcopy(pre_times)
-                            n_times_ns.append(self.abs_time_step)
+                            n_times_nb = copy.deepcopy(pre_times)
+                            n_times_nb.append(self.abs_time_step)

                        n_pnb = log_add([n_pnb, pb + ps, pnb + ps])
-                        next_hyps[n_prefix] = (n_pb, n_pnb, n_v_s, n_v_ns,
-                                               n_cur_token_prob, n_times_s,
-                                               n_times_ns)
+                        next_hyps[n_prefix] = (n_pb, n_pnb, n_v_b, n_v_nb,
+                                               n_cur_token_prob, n_times_b,
+                                               n_times_nb)

            # 2.2 Second beam prune
            next_hyps = sorted(
--- a/paddlespeech/server/utils/onnx_infer.py
+++ b/paddlespeech/server/utils/onnx_infer.py
@ -30,7 +30,9 @@ def get_sess(model_path: Optional[os.PathLike]=None, sess_conf: dict=None):
    # "gpu:0"
    providers = ['CPUExecutionProvider']
    if "gpu" in sess_conf.get("device", ""):
-        providers = ['CUDAExecutionProvider']
+        device_id = int(sess_conf["device"].split(":")[1])
+        providers = [('CUDAExecutionProvider', {'device_id': device_id})]
+
        # fastspeech2/mb_melgan can't use trt now!
        if sess_conf.get("use_trt", 0):
            providers = ['TensorrtExecutionProvider']
--- a/paddlespeech/t2s/exps/syn_utils.py
+++ b/paddlespeech/t2s/exps/syn_utils.py
@ -29,6 +29,7 @@ from yacs.config import CfgNode

 from paddlespeech.t2s.datasets.data_table import DataTable
 from paddlespeech.t2s.frontend import English
+from paddlespeech.t2s.frontend.mix_frontend import MixFrontend
 from paddlespeech.t2s.frontend.zh_frontend import Frontend
 from paddlespeech.t2s.modules.normalizer import ZScore
 from paddlespeech.utils.dynamic_import import dynamic_import
@ -98,6 +99,8 @@ def get_sentences(text_file: Optional[os.PathLike], lang: str='zh'):
                sentence = "".join(items[1:])
            elif lang == 'en':
                sentence = " ".join(items[1:])
+            elif lang == 'mix':
+                sentence = " ".join(items[1:])
            sentences.append((utt_id, sentence))
    return sentences

@ -111,7 +114,8 @@ def get_test_dataset(test_metadata: List[Dict[str, Any]],
    am_dataset = am[am.rindex('_') + 1:]
    if am_name == 'fastspeech2':
        fields = ["utt_id", "text"]
-        if am_dataset in {"aishell3", "vctk"} and speaker_dict is not None:
+        if am_dataset in {"aishell3", "vctk",
+                          "mix"} and speaker_dict is not None:
            print("multiple speaker fastspeech2!")
            fields += ["spk_id"]
        elif voice_cloning:
@ -140,6 +144,10 @@ def get_frontend(lang: str='zh',
            phone_vocab_path=phones_dict, tone_vocab_path=tones_dict)
    elif lang == 'en':
        frontend = English(phone_vocab_path=phones_dict)
+    elif lang == 'mix':
+        frontend = MixFrontend(
+            phone_vocab_path=phones_dict, tone_vocab_path=tones_dict)
+
    else:
        print("wrong lang!")
    print("frontend done!")
@ -341,8 +349,12 @@ def get_am_output(
        input_ids = frontend.get_input_ids(
            input, merge_sentences=merge_sentences)
        phone_ids = input_ids["phone_ids"]
+    elif lang == 'mix':
+        input_ids = frontend.get_input_ids(
+            input, merge_sentences=merge_sentences)
+        phone_ids = input_ids["phone_ids"]
    else:
-        print("lang should in {'zh', 'en'}!")
+        print("lang should in {'zh', 'en', 'mix'}!")

    if get_tone_ids:
        tone_ids = input_ids["tone_ids"]
--- a/paddlespeech/t2s/exps/synthesize_e2e.py
+++ b/paddlespeech/t2s/exps/synthesize_e2e.py
@ -113,8 +113,12 @@ def evaluate(args):
                input_ids = frontend.get_input_ids(
                    sentence, merge_sentences=merge_sentences)
                phone_ids = input_ids["phone_ids"]
+            elif args.lang == 'mix':
+                input_ids = frontend.get_input_ids(
+                    sentence, merge_sentences=merge_sentences)
+                phone_ids = input_ids["phone_ids"]
            else:
-                print("lang should in {'zh', 'en'}!")
+                print("lang should in {'zh', 'en', 'mix'}!")
            with paddle.no_grad():
                flags = 0
                for i in range(len(phone_ids)):
@ -122,7 +126,7 @@ def evaluate(args):
                    # acoustic model
                    if am_name == 'fastspeech2':
                        # multi speaker
-                        if am_dataset in {"aishell3", "vctk"}:
+                        if am_dataset in {"aishell3", "vctk", "mix"}:
                            spk_id = paddle.to_tensor(args.spk_id)
                            mel = am_inference(part_phone_ids, spk_id)
                        else:
@ -170,7 +174,7 @@ def parse_args():
        choices=[
            'speedyspeech_csmsc', 'speedyspeech_aishell3', 'fastspeech2_csmsc',
            'fastspeech2_ljspeech', 'fastspeech2_aishell3', 'fastspeech2_vctk',
-            'tacotron2_csmsc', 'tacotron2_ljspeech'
+            'tacotron2_csmsc', 'tacotron2_ljspeech', 'fastspeech2_mix'
        ],
        help='Choose acoustic model type of tts task.')
    parser.add_argument(
@ -231,7 +235,7 @@ def parse_args():
        '--lang',
        type=str,
        default='zh',
-        help='Choose model language. zh or en')
+        help='Choose model language. zh or en or mix')

    parser.add_argument(
        "--inference_dir",
--- a/paddlespeech/t2s/frontend/mix_frontend.py
+++ b/paddlespeech/t2s/frontend/mix_frontend.py
@ -0,0 +1,179 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import re
+from typing import Dict
+from typing import List
+
+import paddle
+
+from paddlespeech.t2s.frontend import English
+from paddlespeech.t2s.frontend.zh_frontend import Frontend
+
+
+class MixFrontend():
+    def __init__(self,
+                 g2p_model="pypinyin",
+                 phone_vocab_path=None,
+                 tone_vocab_path=None):
+
+        self.zh_frontend = Frontend(
+            phone_vocab_path=phone_vocab_path, tone_vocab_path=tone_vocab_path)
+        self.en_frontend = English(phone_vocab_path=phone_vocab_path)
+        self.SENTENCE_SPLITOR = re.compile(r'([：、，；。？！,;?!][”’]?)')
+        self.sp_id = self.zh_frontend.vocab_phones["sp"]
+        self.sp_id_tensor = paddle.to_tensor([self.sp_id])
+
+    def is_chinese(self, char):
+        if char >= '\u4e00' and char <= '\u9fa5':
+            return True
+        else:
+            return False
+
+    def is_alphabet(self, char):
+        if (char >= '\u0041' and char <= '\u005a') or (char >= '\u0061' and
+                                                       char <= '\u007a'):
+            return True
+        else:
+            return False
+
+    def is_number(self, char):
+        if char >= '\u0030' and char <= '\u0039':
+            return True
+        else:
+            return False
+
+    def is_other(self, char):
+        if not (self.is_chinese(char) or self.is_number(char) or
+                self.is_alphabet(char)):
+            return True
+        else:
+            return False
+
+    def _split(self, text: str) -> List[str]:
+        text = re.sub(r'[《》【】<=>{}()（）#&@“”^_|…\\]', '', text)
+        text = self.SENTENCE_SPLITOR.sub(r'\1\n', text)
+        text = text.strip()
+        sentences = [sentence.strip() for sentence in re.split(r'\n+', text)]
+        return sentences
+
+    def _distinguish(self, text: str) -> List[str]:
+        # sentence --> [ch_part, en_part, ch_part, ...]
+
+        segments = []
+        types = []
+
+        flag = 0
+        temp_seg = ""
+        temp_lang = ""
+
+        # Determine the type of each character. type: blank, chinese, alphabet, number, unk.
+        for ch in text:
+            if self.is_chinese(ch):
+                types.append("zh")
+            elif self.is_alphabet(ch):
+                types.append("en")
+            elif ch == " ":
+                types.append("blank")
+            elif self.is_number(ch):
+                types.append("num")
+            else:
+                types.append("unk")
+
+        assert len(types) == len(text)
+
+        for i in range(len(types)):
+
+            # find the first char of the seg
+            if flag == 0:
+                if types[i] != "unk" and types[i] != "blank":
+                    temp_seg += text[i]
+                    temp_lang = types[i]
+                    flag = 1
+
+            else:
+                if types[i] == temp_lang or types[i] == "num":
+                    temp_seg += text[i]
+
+                elif temp_lang == "num" and types[i] != "unk":
+                    temp_seg += text[i]
+                    if types[i] == "zh" or types[i] == "en":
+                        temp_lang = types[i]
+
+                elif temp_lang == "en" and types[i] == "blank":
+                    temp_seg += text[i]
+
+                elif types[i] == "unk":
+                    pass
+
+                else:
+                    segments.append((temp_seg, temp_lang))
+
+                    if types[i] != "unk" and types[i] != "blank":
+                        temp_seg = text[i]
+                        temp_lang = types[i]
+                        flag = 1
+                    else:
+                        flag = 0
+                        temp_seg = ""
+                        temp_lang = ""
+
+        segments.append((temp_seg, temp_lang))
+
+        return segments
+
+    def get_input_ids(self,
+                      sentence: str,
+                      merge_sentences: bool=True,
+                      get_tone_ids: bool=False,
+                      add_sp: bool=True) -> Dict[str, List[paddle.Tensor]]:
+
+        sentences = self._split(sentence)
+        phones_list = []
+        result = {}
+
+        for text in sentences:
+            phones_seg = []
+            segments = self._distinguish(text)
+            for seg in segments:
+                content = seg[0]
+                lang = seg[1]
+                if lang == "zh":
+                    input_ids = self.zh_frontend.get_input_ids(
+                        content,
+                        merge_sentences=True,
+                        get_tone_ids=get_tone_ids)
+
+                elif lang == "en":
+                    input_ids = self.en_frontend.get_input_ids(
+                        content, merge_sentences=True)
+
+                phones_seg.append(input_ids["phone_ids"][0])
+                if add_sp:
+                    phones_seg.append(self.sp_id_tensor)
+
+            phones = paddle.concat(phones_seg)
+            phones_list.append(phones)
+
+        if merge_sentences:
+            merge_list = paddle.concat(phones_list)
+            # rm the last 'sp' to avoid the noise at the end
+            # cause in the training data, no 'sp' in the end
+            if merge_list[-1] == self.sp_id_tensor:
+                merge_list = merge_list[:-1]
+            phones_list = []
+            phones_list.append(merge_list)
+
+        result["phone_ids"] = phones_list
+
+        return result
--- a/setup.py
+++ b/setup.py
@ -50,6 +50,7 @@ base = [
    "paddlespeech_feat",
    "Pillow>=9.0.0"
    "praatio==5.0.0",
+    "protobuf>=3.1.0, <=3.20.0",
    "pypinyin",
    "pypinyin-dict",
    "python-dateutil",