diff --git a/README.md b/README.md
index 330da1a9..e93aa1d9 100644
--- a/README.md
+++ b/README.md
@@ -25,7 +25,7 @@
| Documents
| Models List
| AIStudio Courses
- | NAACL2022 Paper
+ | NAACL2022 Best Demo Award Paper
| Gitee
@@ -34,7 +34,7 @@
**PaddleSpeech** is an open-source toolkit on [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) platform for a variety of critical tasks in speech and audio, with the state-of-art and influential models.
-**PaddleSpeech** won the [NAACL2022 Best Demo Award](https://2022.naacl.org/blog/best-demo-award/).
+**PaddleSpeech** won the [NAACL2022 Best Demo Award](https://2022.naacl.org/blog/best-demo-award/), please check out our paper on [Arxiv](https://arxiv.org/abs/2205.12007).
##### Speech Recognition
@@ -179,7 +179,7 @@ Via the easy-to-use, efficient, flexible and scalable implementation, our vision
## Installation
-We strongly recommend our users to install PaddleSpeech in **Linux** with *python>=3.7*.
+We strongly recommend our users to install PaddleSpeech in **Linux** with *python>=3.7* and *paddlepaddle>=2.3.1*.
Up to now, **Linux** supports CLI for the all our tasks, **Mac OSX** and **Windows** only supports PaddleSpeech CLI for Audio Classification, Speech-to-Text and Text-to-Speech. To install `PaddleSpeech`, please see [installation](./docs/source/install.md).
diff --git a/README_cn.md b/README_cn.md
index 8df38602..896c575c 100644
--- a/README_cn.md
+++ b/README_cn.md
@@ -20,7 +20,8 @@
- 快速开始
+ 安装
+ | 快速开始
| 快速使用服务
| 快速使用流式服务
| 教程文档
@@ -36,8 +37,10 @@
**PaddleSpeech** 是基于飞桨 [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) 的语音方向的开源模型库,用于语音和音频中的各种关键任务的开发,包含大量基于深度学习前沿和有影响力的模型,一些典型的应用示例如下:
-**PaddleSpeech** 荣获 [NAACL2022 Best Demo Award](https://2022.naacl.org/blog/best-demo-award/).
+**PaddleSpeech** 荣获 [NAACL2022 Best Demo Award](https://2022.naacl.org/blog/best-demo-award/), 请访问 [Arxiv](https://arxiv.org/abs/2205.12007) 论文。
+### 效果展示
+
##### 语音识别
@@ -154,7 +157,7 @@
本项目采用了易用、高效、灵活以及可扩展的实现,旨在为工业应用、学术研究提供更好的支持,实现的功能包含训练、推断以及测试模块,以及部署过程,主要包括
- 📦 **易用性**: 安装门槛低,可使用 [CLI](#quick-start) 快速开始。
- 🏆 **对标 SoTA**: 提供了高速、轻量级模型,且借鉴了最前沿的技术。
-- 🏆 **流式ASR和TTS系统**:工业级的端到端流式识别、流式合成系统。
+- 🏆 **流式 ASR 和 TTS 系统**:工业级的端到端流式识别、流式合成系统。
- 💯 **基于规则的中文前端**: 我们的前端包含文本正则化和字音转换(G2P)。此外,我们使用自定义语言规则来适应中文语境。
- **多种工业界以及学术界主流功能支持**:
- 🛎️ 典型音频任务: 本工具包提供了音频任务如音频分类、语音翻译、自动语音识别、文本转语音、语音合成、声纹识别、KWS等任务的实现。
@@ -182,61 +185,195 @@
+
## 安装
我们强烈建议用户在 **Linux** 环境下,*3.7* 以上版本的 *python* 上安装 PaddleSpeech。
-目前为止,**Linux** 支持声音分类、语音识别、语音合成和语音翻译四种功能,**Mac OSX、 Windows** 下暂不支持语音翻译功能。 想了解具体安装细节,可以参考[安装文档](./docs/source/install_cn.md)。
+
+### 相关依赖
++ gcc >= 4.8.5
++ paddlepaddle >= 2.3.1
++ python >= 3.7
++ linux(推荐), mac, windows
+
+PaddleSpeech依赖于paddlepaddle,安装可以参考[paddlepaddle官网](https://www.paddlepaddle.org.cn/),根据自己机器的情况进行选择。这里给出cpu版本示例,其它版本大家可以根据自己机器的情况进行安装。
+
+```shell
+pip install paddlepaddle -i https://mirror.baidu.com/pypi/simple
+```
+
+PaddleSpeech快速安装方式有两种,一种是pip安装,一种是源码编译(推荐)。
+
+### pip 安装
+```shell
+pip install pytest-runner
+pip install paddlespeech
+```
+
+### 源码编译
+```shell
+git clone https://github.com/PaddlePaddle/PaddleSpeech.git
+cd PaddleSpeech
+pip install pytest-runner
+pip install .
+```
+
+更多关于安装问题,如 conda 环境,librosa 依赖的系统库,gcc 环境问题,kaldi 安装等,可以参考这篇[安装文档](docs/source/install_cn.md),如安装上遇到问题可以在 [#2150](https://github.com/PaddlePaddle/PaddleSpeech/issues/2150) 上留言以及查找相关问题
## 快速开始
-安装完成后,开发者可以通过命令行快速开始,改变 `--input` 可以尝试用自己的音频或文本测试。
+安装完成后,开发者可以通过命令行或者Python快速开始,命令行模式下改变 `--input` 可以尝试用自己的音频或文本测试,支持16k wav格式音频。
+
+你也可以在`aistudio`中快速体验 👉🏻[PaddleSpeech API Demo ](https://aistudio.baidu.com/aistudio/projectdetail/4281335?shared=1)。
-**声音分类**
+测试音频示例下载
```shell
-paddlespeech cls --input input.wav
+wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
+wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/en.wav
```
-**声纹识别**
+
+### 语音识别
+ (点击可展开)开源中文语音识别
+
+命令行一键体验
+
```shell
-paddlespeech vector --task spk --input input_16k.wav
+paddlespeech asr --lang zh --input zh.wav
+```
+
+Python API 一键预测
+
+```python
+>>> from paddlespeech.cli.asr.infer import ASRExecutor
+>>> asr = ASRExecutor()
+>>> result = asr(audio_file="zh.wav")
+>>> print(result)
+我认为跑步最重要的就是给我带来了身体健康
```
-**语音识别**
+
+
+### 语音合成
+
+ 开源中文语音合成
+
+输出 24k 采样率wav格式音频
+
+
+命令行一键体验
+
```shell
-paddlespeech asr --lang zh --input input_16k.wav
+paddlespeech tts --input "你好,欢迎使用百度飞桨深度学习框架!" --output output.wav
+```
+
+Python API 一键预测
+
+```python
+>>> from paddlespeech.cli.tts.infer import TTSExecutor
+>>> tts = TTSExecutor()
+>>> tts(text="今天天气十分不错。", output="output.wav")
```
-**语音翻译** (English to Chinese)
+- 语音合成的 web demo 已经集成进了 [Huggingface Spaces](https://huggingface.co/spaces). 请参考: [TTS Demo](https://huggingface.co/spaces/KPatrick/PaddleSpeechTTS)
+
+
+
+### 声音分类
+
+ 适配多场景的开放领域声音分类工具
+
+基于AudioSet数据集527个类别的声音分类模型
+
+命令行一键体验
+
```shell
-paddlespeech st --input input_16k.wav
+paddlespeech cls --input zh.wav
```
-**语音合成**
+
+python API 一键预测
+
+```python
+>>> from paddlespeech.cli.cls.infer import CLSExecutor
+>>> cls = CLSExecutor()
+>>> result = cls(audio_file="zh.wav")
+>>> print(result)
+Speech 0.9027186632156372
+```
+
+
+
+### 声纹提取
+
+ 工业级声纹提取工具
+
+命令行一键体验
+
```shell
-paddlespeech tts --input "你好,欢迎使用百度飞桨深度学习框架!" --output output.wav
+paddlespeech vector --task spk --input zh.wav
```
-- 语音合成的 web demo 已经集成进了 [Huggingface Spaces](https://huggingface.co/spaces). 请参考: [TTS Demo](https://huggingface.co/spaces/akhaliq/paddlespeech)
-**文本后处理**
- - 标点恢复
- ```bash
- paddlespeech text --task punc --input 今天的天气真不错啊你下午有空吗我想约你一起去吃饭
- ```
+Python API 一键预测
-**批处理**
+```python
+>>> from paddlespeech.cli.vector import VectorExecutor
+>>> vec = VectorExecutor()
+>>> result = vec(audio_file="zh.wav")
+>>> print(result) # 187维向量
+[ -0.19083306 9.474295 -14.122263 -2.0916545 0.04848729
+ 4.9295826 1.4780062 0.3733844 10.695862 3.2697146
+ -4.48199 -0.6617882 -9.170393 -11.1568775 -1.2358263 ...]
```
-echo -e "1 欢迎光临。\n2 谢谢惠顾。" | paddlespeech tts
+
+
+
+### 标点恢复
+
+ 一键恢复文本标点,可与ASR模型配合使用
+
+命令行一键体验
+
+```shell
+paddlespeech text --task punc --input 今天的天气真不错啊你下午有空吗我想约你一起去吃饭
+```
+
+Python API 一键预测
+
+```python
+>>> from paddlespeech.cli.text.infer import TextExecutor
+>>> text_punc = TextExecutor()
+>>> result = text_punc(text="今天的天气真不错啊你下午有空吗我想约你一起去吃饭")
+今天的天气真不错啊!你下午有空吗?我想约你一起去吃饭。
```
-**Shell管道**
-ASR + Punc:
+
+
+### 语音翻译
+
+ 端到端英译中语音翻译工具
+
+使用预编译的kaldi相关工具,只支持在Ubuntu系统中体验
+
+命令行一键体验
+
+```shell
+paddlespeech st --input en.wav
```
-paddlespeech asr --input ./zh.wav | paddlespeech text --task punc
+
+python API 一键预测
+
+```python
+>>> from paddlespeech.cli.st.infer import STExecutor
+>>> st = STExecutor()
+>>> result = st(audio_file="en.wav")
+['我 在 这栋 建筑 的 古老 门上 敲门 。']
```
-更多命令行命令请参考 [demos](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/demos)
-> Note: 如果需要训练或者微调,请查看[语音识别](./docs/source/asr/quick_start.md), [语音合成](./docs/source/tts/quick_start.md)。
+
+
+
## 快速使用服务
-安装完成后,开发者可以通过命令行快速使用服务。
+安装完成后,开发者可以通过命令行一键启动语音识别,语音合成,音频分类三种服务。
**启动服务**
```shell
@@ -614,6 +751,7 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声
语音合成模块最初被称为 [Parakeet](https://github.com/PaddlePaddle/Parakeet),现在与此仓库合并。如果您对该任务的学术研究感兴趣,请参阅 [TTS 研究概述](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/docs/source/tts#overview)。此外,[模型介绍](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/tts/models_introduction.md) 是了解语音合成流程的一个很好的指南。
+
## ⭐ 应用案例
- **[PaddleBoBo](https://github.com/JiehangXie/PaddleBoBo): 使用 PaddleSpeech 的语音合成模块生成虚拟人的声音。**
diff --git a/demos/README.md b/demos/README.md
index 2a306df6..72b70b23 100644
--- a/demos/README.md
+++ b/demos/README.md
@@ -12,6 +12,7 @@ This directory contains many speech applications in multiple scenarios.
* speech recognition - recognize text of an audio file
* speech server - Server for Speech Task, e.g. ASR,TTS,CLS
* streaming asr server - receive audio stream from websocket, and recognize to transcript.
+* streaming tts server - receive text from http or websocket, and streaming audio data stream.
* speech translation - end to end speech translation
* story talker - book reader based on OCR and TTS
* style_fs2 - multi style control for FastSpeech2 model
diff --git a/demos/README_cn.md b/demos/README_cn.md
index 47134212..04fc1fa7 100644
--- a/demos/README_cn.md
+++ b/demos/README_cn.md
@@ -10,8 +10,9 @@
* 元宇宙 - 基于语音合成的 2D 增强现实。
* 标点恢复 - 通常作为语音识别的文本后处理任务,为一段无标点的纯文本添加相应的标点符号。
* 语音识别 - 识别一段音频中包含的语音文字。
-* 语音服务 - 离线语音服务,包括ASR、TTS、CLS等
-* 流式语音识别服务 - 流式输入语音数据流识别音频中的文字
+* 语音服务 - 离线语音服务,包括ASR、TTS、CLS等。
+* 流式语音识别服务 - 流式输入语音数据流识别音频中的文字。
+* 流式语音合成服务 - 根据待合成文本流式生成合成音频数据流。
* 语音翻译 - 实时识别音频中的语言,并同时翻译成目标语言。
* 会说话的故事书 - 基于 OCR 和语音合成的会说话的故事书。
* 个性化语音合成 - 基于 FastSpeech2 模型的个性化语音合成。
diff --git a/demos/custom_streaming_asr/setup_docker.sh b/demos/custom_streaming_asr/setup_docker.sh
old mode 100644
new mode 100755
diff --git a/demos/keyword_spotting/run.sh b/demos/keyword_spotting/run.sh
old mode 100644
new mode 100755
diff --git a/demos/speaker_verification/run.sh b/demos/speaker_verification/run.sh
old mode 100644
new mode 100755
diff --git a/demos/speech_recognition/run.sh b/demos/speech_recognition/run.sh
old mode 100644
new mode 100755
index 19ce0ebb..e48ff3e9
--- a/demos/speech_recognition/run.sh
+++ b/demos/speech_recognition/run.sh
@@ -1,6 +1,7 @@
#!/bin/bash
-wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespeech.bj.bcebos.com/PaddleAudio/en.wav
+wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
+wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/en.wav
# asr
paddlespeech asr --input ./zh.wav
@@ -8,3 +9,18 @@ paddlespeech asr --input ./zh.wav
# asr + punc
paddlespeech asr --input ./zh.wav | paddlespeech text --task punc
+
+
+# asr help
+paddlespeech asr --help
+
+
+# english asr
+paddlespeech asr --lang en --model transformer_librispeech --input ./en.wav
+
+# model stats
+paddlespeech stats --task asr
+
+
+# paddlespeech help
+paddlespeech --help
diff --git a/demos/speech_server/README.md b/demos/speech_server/README.md
index dbbf9765..65a12940 100644
--- a/demos/speech_server/README.md
+++ b/demos/speech_server/README.md
@@ -14,7 +14,10 @@ For service interface definition, please check:
see [installation](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md).
It is recommended to use **paddlepaddle 2.3.1** or above.
-You can choose one way from meduim and hard to install paddlespeech.
+
+You can choose one way from easy, meduim and hard to install paddlespeech.
+
+**If you install in easy mode, you need to prepare the yaml file by yourself, you can refer to the yaml file in the conf directory.**
### 2. Prepare config File
The configuration file can be found in `conf/application.yaml` .
diff --git a/demos/speech_server/README_cn.md b/demos/speech_server/README_cn.md
index 9ed9175d..d21a53b0 100644
--- a/demos/speech_server/README_cn.md
+++ b/demos/speech_server/README_cn.md
@@ -3,8 +3,10 @@
# 语音服务
## 介绍
+
这个 demo 是一个启动离线语音服务和访问服务的实现。它可以通过使用 `paddlespeech_server` 和 `paddlespeech_client` 的单个命令或 python 的几行代码来实现。
+
服务接口定义请参考:
- [PaddleSpeech Server RESTful API](https://github.com/PaddlePaddle/PaddleSpeech/wiki/PaddleSpeech-Server-RESTful-API)
@@ -13,12 +15,17 @@
请看 [安装文档](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md).
推荐使用 **paddlepaddle 2.3.1** 或以上版本。
-你可以从 medium,hard 两种方式中选择一种方式安装 PaddleSpeech。
+
+你可以从简单,中等,困难 几种方式中选择一种方式安装 PaddleSpeech。
+
+**如果使用简单模式安装,需要自行准备 yaml 文件,可参考 conf 目录下的 yaml 文件。**
### 2. 准备配置文件
配置文件可参见 `conf/application.yaml` 。
-其中,`engine_list`表示即将启动的服务将会包含的语音引擎,格式为 <语音任务>_<引擎类型>。
+其中,`engine_list` 表示即将启动的服务将会包含的语音引擎,格式为 <语音任务>_<引擎类型>。
+
目前服务集成的语音任务有: asr (语音识别)、tts (语音合成)、cls (音频分类)、vector (声纹识别)以及 text (文本处理)。
+
目前引擎类型支持两种形式:python 及 inference (Paddle Inference)
**注意:** 如果在容器里可正常启动服务,但客户端访问 ip 不可达,可尝试将配置文件中 `host` 地址换成本地 ip 地址。
diff --git a/demos/speech_server/asr_client.sh b/demos/speech_server/asr_client.sh
old mode 100644
new mode 100755
diff --git a/demos/speech_server/cls_client.sh b/demos/speech_server/cls_client.sh
old mode 100644
new mode 100755
diff --git a/demos/speech_server/server.sh b/demos/speech_server/server.sh
old mode 100644
new mode 100755
index e5961286..fd719ffc
--- a/demos/speech_server/server.sh
+++ b/demos/speech_server/server.sh
@@ -1,3 +1,3 @@
#!/bin/bash
-paddlespeech_server start --config_file ./conf/application.yaml
+paddlespeech_server start --config_file ./conf/application.yaml &> server.log &
diff --git a/demos/speech_server/sid_client.sh b/demos/speech_server/sid_client.sh
new file mode 100755
index 00000000..99bab21a
--- /dev/null
+++ b/demos/speech_server/sid_client.sh
@@ -0,0 +1,10 @@
+#!/bin/bash
+
+wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
+wget -c https://paddlespeech.bj.bcebos.com/vector/audio/123456789.wav
+
+# sid extract
+paddlespeech_client vector --server_ip 127.0.0.1 --port 8090 --task spk --input ./85236145389.wav
+
+# sid score
+paddlespeech_client vector --server_ip 127.0.0.1 --port 8090 --task score --enroll ./85236145389.wav --test ./123456789.wav
diff --git a/demos/speech_server/text_client.sh b/demos/speech_server/text_client.sh
new file mode 100755
index 00000000..098f159f
--- /dev/null
+++ b/demos/speech_server/text_client.sh
@@ -0,0 +1,4 @@
+#!/bin/bash
+
+
+paddlespeech_client text --server_ip 127.0.0.1 --port 8090 --input 今天的天气真好啊你下午有空吗我想约你一起去吃饭
diff --git a/demos/speech_server/tts_client.sh b/demos/speech_server/tts_client.sh
old mode 100644
new mode 100755
diff --git a/demos/speech_web/README.md b/demos/speech_web/README.md
index ded78a6e..3b2da6e9 100644
--- a/demos/speech_web/README.md
+++ b/demos/speech_web/README.md
@@ -1,6 +1,6 @@
# Paddle Speech Demo
-PaddleSpeechDemo 是一个以 PaddleSpeech 的语音交互功能为主体开发的Demo展示项目,用于帮助大家更好的上手PaddleSpeech 以及使用 PaddleSpeech 构建自己的应用。
+PaddleSpeechDemo 是一个以 PaddleSpeech 的语音交互功能为主体开发的 Demo 展示项目,用于帮助大家更好的上手 PaddleSpeech 以及使用 PaddleSpeech 构建自己的应用。
智能语音交互部分使用 PaddleSpeech,对话以及信息抽取部分使用 PaddleNLP,网页前端展示部分基于 Vue3 进行开发
diff --git a/demos/speech_web/web_client/package-lock.json b/demos/speech_web/web_client/package-lock.json
index f1c77978..509be385 100644
--- a/demos/speech_web/web_client/package-lock.json
+++ b/demos/speech_web/web_client/package-lock.json
@@ -747,9 +747,9 @@
}
},
"node_modules/moment": {
- "version": "2.29.3",
- "resolved": "https://registry.npmmirror.com/moment/-/moment-2.29.3.tgz",
- "integrity": "sha512-c6YRvhEo//6T2Jz/vVtYzqBzwvPT95JBQ+smCytzf7c50oMZRsR/a4w88aD34I+/QVSfnoAnSBFPJHItlOMJVw==",
+ "version": "2.29.4",
+ "resolved": "https://registry.npmjs.org/moment/-/moment-2.29.4.tgz",
+ "integrity": "sha512-5LC9SOxjSc2HF6vO2CyuTDNivEdoz2IvyJJGj6X8DJ0eFyfszE0QiEd+iXmBvUP3WHxSjFH/vIsA0EN00cgr8w==",
"engines": {
"node": "*"
}
@@ -1636,9 +1636,9 @@
"optional": true
},
"moment": {
- "version": "2.29.3",
- "resolved": "https://registry.npmmirror.com/moment/-/moment-2.29.3.tgz",
- "integrity": "sha512-c6YRvhEo//6T2Jz/vVtYzqBzwvPT95JBQ+smCytzf7c50oMZRsR/a4w88aD34I+/QVSfnoAnSBFPJHItlOMJVw=="
+ "version": "2.29.4",
+ "resolved": "https://registry.npmjs.org/moment/-/moment-2.29.4.tgz",
+ "integrity": "sha512-5LC9SOxjSc2HF6vO2CyuTDNivEdoz2IvyJJGj6X8DJ0eFyfszE0QiEd+iXmBvUP3WHxSjFH/vIsA0EN00cgr8w=="
},
"nanoid": {
"version": "3.3.2",
diff --git a/demos/speech_web/web_client/yarn.lock b/demos/speech_web/web_client/yarn.lock
index 4504eab3..6777cf4c 100644
--- a/demos/speech_web/web_client/yarn.lock
+++ b/demos/speech_web/web_client/yarn.lock
@@ -587,9 +587,9 @@ mime@^1.4.1:
integrity sha512-x0Vn8spI+wuJ1O6S7gnbaQg8Pxh4NNHb7KSINmEWKiPE4RKOplvijn+NkmYmmRgP68mc70j2EbeTFRsrswaQeg==
moment@^2.27.0:
- version "2.29.3"
- resolved "https://registry.npmmirror.com/moment/-/moment-2.29.3.tgz"
- integrity sha512-c6YRvhEo//6T2Jz/vVtYzqBzwvPT95JBQ+smCytzf7c50oMZRsR/a4w88aD34I+/QVSfnoAnSBFPJHItlOMJVw==
+ version "2.29.4"
+ resolved "https://registry.yarnpkg.com/moment/-/moment-2.29.4.tgz#3dbe052889fe7c1b2ed966fcb3a77328964ef108"
+ integrity sha512-5LC9SOxjSc2HF6vO2CyuTDNivEdoz2IvyJJGj6X8DJ0eFyfszE0QiEd+iXmBvUP3WHxSjFH/vIsA0EN00cgr8w==
ms@^2.1.1:
version "2.1.3"
diff --git a/demos/streaming_asr_server/README.md b/demos/streaming_asr_server/README.md
index 3ada1b8d..ae66cae4 100644
--- a/demos/streaming_asr_server/README.md
+++ b/demos/streaming_asr_server/README.md
@@ -15,7 +15,10 @@ Streaming ASR server only support `websocket` protocol, and doesn't support `htt
see [installation](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md).
It is recommended to use **paddlepaddle 2.3.1** or above.
-You can choose one way from meduim and hard to install paddlespeech.
+
+You can choose one way from easy, meduim and hard to install paddlespeech.
+
+**If you install in easy mode, you need to prepare the yaml file by yourself, you can refer to
### 2. Prepare config File
The configuration file can be found in `conf/ws_application.yaml` 和 `conf/ws_conformer_wenetspeech_application.yaml`.
diff --git a/demos/streaming_asr_server/README_cn.md b/demos/streaming_asr_server/README_cn.md
index e4a7ef64..55acc07c 100644
--- a/demos/streaming_asr_server/README_cn.md
+++ b/demos/streaming_asr_server/README_cn.md
@@ -3,12 +3,11 @@
# 流式语音识别服务
## 介绍
-这个demo是一个启动流式语音服务和访问服务的实现。 它可以通过使用 `paddlespeech_server` 和 `paddlespeech_client`的单个命令或 python 的几行代码来实现。
+这个 demo 是一个启动流式语音服务和访问服务的实现。 它可以通过使用 `paddlespeech_server` 和 `paddlespeech_client` 的单个命令或 python 的几行代码来实现。
**流式语音识别服务只支持 `weboscket` 协议,不支持 `http` 协议。**
-
-For service interface definition, please check:
+服务接口定义请参考:
- [PaddleSpeech Streaming Server WebSocket API](https://github.com/PaddlePaddle/PaddleSpeech/wiki/PaddleSpeech-Server-WebSocket-API)
## 使用方法
@@ -16,7 +15,10 @@ For service interface definition, please check:
安装 PaddleSpeech 的详细过程请看 [安装文档](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md)。
推荐使用 **paddlepaddle 2.3.1** 或以上版本。
-你可以从medium,hard 两种方式中选择一种方式安装 PaddleSpeech。
+
+你可以从简单,中等,困难 几种方式中选择一种方式安装 PaddleSpeech。
+
+**如果使用简单模式安装,需要自行准备 yaml 文件,可参考 conf 目录下的 yaml 文件。**
### 2. 准备配置文件
diff --git a/demos/streaming_asr_server/punc_server.py b/demos/streaming_asr_server/local/punc_server.py
similarity index 100%
rename from demos/streaming_asr_server/punc_server.py
rename to demos/streaming_asr_server/local/punc_server.py
diff --git a/demos/streaming_asr_server/local/rtf_from_log.py b/demos/streaming_asr_server/local/rtf_from_log.py
index 4f30d640..4b89b48f 100755
--- a/demos/streaming_asr_server/local/rtf_from_log.py
+++ b/demos/streaming_asr_server/local/rtf_from_log.py
@@ -38,4 +38,4 @@ if __name__ == '__main__':
T += m['T']
P += m['P']
- print(f"RTF: {P/T}")
+ print(f"RTF: {P/T}, utts: {n}")
diff --git a/demos/streaming_asr_server/streaming_asr_server.py b/demos/streaming_asr_server/local/streaming_asr_server.py
similarity index 100%
rename from demos/streaming_asr_server/streaming_asr_server.py
rename to demos/streaming_asr_server/local/streaming_asr_server.py
diff --git a/demos/streaming_asr_server/run.sh b/demos/streaming_asr_server/run.sh
old mode 100644
new mode 100755
diff --git a/demos/streaming_asr_server/server.sh b/demos/streaming_asr_server/server.sh
index f532546e..961cb046 100755
--- a/demos/streaming_asr_server/server.sh
+++ b/demos/streaming_asr_server/server.sh
@@ -1,9 +1,8 @@
-export CUDA_VISIBLE_DEVICE=0,1,2,3
- export CUDA_VISIBLE_DEVICE=0,1,2,3
+#export CUDA_VISIBLE_DEVICE=0,1,2,3
-# nohup python3 punc_server.py --config_file conf/punc_application.yaml > punc.log 2>&1 &
+# nohup python3 local/punc_server.py --config_file conf/punc_application.yaml > punc.log 2>&1 &
paddlespeech_server start --config_file conf/punc_application.yaml &> punc.log &
-# nohup python3 streaming_asr_server.py --config_file conf/ws_conformer_wenetspeech_application.yaml > streaming_asr.log 2>&1 &
+# nohup python3 local/streaming_asr_server.py --config_file conf/ws_conformer_wenetspeech_application.yaml > streaming_asr.log 2>&1 &
paddlespeech_server start --config_file conf/ws_conformer_wenetspeech_application.yaml &> streaming_asr.log &
diff --git a/demos/streaming_asr_server/test.sh b/demos/streaming_asr_server/test.sh
index 67a5ec4c..386c7f89 100755
--- a/demos/streaming_asr_server/test.sh
+++ b/demos/streaming_asr_server/test.sh
@@ -7,5 +7,5 @@ paddlespeech_client asr_online --server_ip 127.0.0.1 --port 8090 --input ./zh.wa
# read the wav and call streaming and punc service
# If `127.0.0.1` is not accessible, you need to use the actual service IP address.
-paddlespeech_client asr_online --server_ip 127.0.0.1 --port 8290 --punc.server_ip 127.0.0.1 --punc.port 8190 --input ./zh.wav
+paddlespeech_client asr_online --server_ip 127.0.0.1 --port 8090 --punc.server_ip 127.0.0.1 --punc.port 8190 --input ./zh.wav
diff --git a/demos/streaming_tts_server/README.md b/demos/streaming_tts_server/README.md
index f708fd31..53a33f3c 100644
--- a/demos/streaming_tts_server/README.md
+++ b/demos/streaming_tts_server/README.md
@@ -15,7 +15,10 @@ For service interface definition, please check:
see [installation](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md).
It is recommended to use **paddlepaddle 2.3.1** or above.
-You can choose one way from meduim and hard to install paddlespeech.
+
+You can choose one way from easy, meduim and hard to install paddlespeech.
+
+**If you install in easy mode, you need to prepare the yaml file by yourself, you can refer to the yaml file in the conf directory.**
### 2. Prepare config File
The configuration file can be found in `conf/tts_online_application.yaml`.
diff --git a/demos/streaming_tts_server/README_cn.md b/demos/streaming_tts_server/README_cn.md
index fa041323..560791a9 100644
--- a/demos/streaming_tts_server/README_cn.md
+++ b/demos/streaming_tts_server/README_cn.md
@@ -13,7 +13,11 @@
请看 [安装文档](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md).
推荐使用 **paddlepaddle 2.3.1** 或以上版本。
-你可以从 medium,hard 两种方式中选择一种方式安装 PaddleSpeech。
+
+你可以从简单,中等,困难 几种方式中选择一种方式安装 PaddleSpeech。
+
+**如果使用简单模式安装,需要自行准备 yaml 文件,可参考 conf 目录下的 yaml 文件。**
+
### 2. 准备配置文件
配置文件可参见 `conf/tts_online_application.yaml` 。
diff --git a/demos/streaming_tts_server/test_client.sh b/demos/streaming_tts_server/client.sh
old mode 100644
new mode 100755
similarity index 61%
rename from demos/streaming_tts_server/test_client.sh
rename to demos/streaming_tts_server/client.sh
index bd88f20b..e93da58a
--- a/demos/streaming_tts_server/test_client.sh
+++ b/demos/streaming_tts_server/client.sh
@@ -2,8 +2,8 @@
# http client test
# If `127.0.0.1` is not accessible, you need to use the actual service IP address.
-paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --protocol http --input "您好,欢迎使用百度飞桨语音合成服务。" --output output.wav
+paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --protocol http --input "您好,欢迎使用百度飞桨语音合成服务。" --output output.http.wav
# websocket client test
# If `127.0.0.1` is not accessible, you need to use the actual service IP address.
-# paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --protocol websocket --input "您好,欢迎使用百度飞桨语音合成服务。" --output output.wav
+paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8192 --protocol websocket --input "您好,欢迎使用百度飞桨语音合成服务。" --output output.ws.wav
diff --git a/demos/streaming_tts_server/conf/tts_online_ws_application.yaml b/demos/streaming_tts_server/conf/tts_online_ws_application.yaml
new file mode 100644
index 00000000..146f06f1
--- /dev/null
+++ b/demos/streaming_tts_server/conf/tts_online_ws_application.yaml
@@ -0,0 +1,103 @@
+# This is the parameter configuration file for streaming tts server.
+
+#################################################################################
+# SERVER SETTING #
+#################################################################################
+host: 0.0.0.0
+port: 8192
+
+# The task format in the engin_list is: _
+# engine_list choices = ['tts_online', 'tts_online-onnx'], the inference speed of tts_online-onnx is faster than tts_online.
+# protocol choices = ['websocket', 'http']
+protocol: 'websocket'
+engine_list: ['tts_online-onnx']
+
+
+#################################################################################
+# ENGINE CONFIG #
+#################################################################################
+
+################################### TTS #########################################
+################### speech task: tts; engine_type: online #######################
+tts_online:
+ # am (acoustic model) choices=['fastspeech2_csmsc', 'fastspeech2_cnndecoder_csmsc']
+ # fastspeech2_cnndecoder_csmsc support streaming am infer.
+ am: 'fastspeech2_csmsc'
+ am_config:
+ am_ckpt:
+ am_stat:
+ phones_dict:
+ tones_dict:
+ speaker_dict:
+ spk_id: 0
+
+ # voc (vocoder) choices=['mb_melgan_csmsc, hifigan_csmsc']
+ # Both mb_melgan_csmsc and hifigan_csmsc support streaming voc inference
+ voc: 'mb_melgan_csmsc'
+ voc_config:
+ voc_ckpt:
+ voc_stat:
+
+ # others
+ lang: 'zh'
+ device: 'cpu' # set 'gpu:id' or 'cpu'
+ # am_block and am_pad only for fastspeech2_cnndecoder_onnx model to streaming am infer,
+ # when am_pad set 12, streaming synthetic audio is the same as non-streaming synthetic audio
+ am_block: 72
+ am_pad: 12
+ # voc_pad and voc_block voc model to streaming voc infer,
+ # when voc model is mb_melgan_csmsc, voc_pad set 14, streaming synthetic audio is the same as non-streaming synthetic audio; The minimum value of pad can be set to 7, streaming synthetic audio sounds normal
+ # when voc model is hifigan_csmsc, voc_pad set 19, streaming synthetic audio is the same as non-streaming synthetic audio; voc_pad set 14, streaming synthetic audio sounds normal
+ voc_block: 36
+ voc_pad: 14
+
+
+
+#################################################################################
+# ENGINE CONFIG #
+#################################################################################
+
+################################### TTS #########################################
+################### speech task: tts; engine_type: online-onnx #######################
+tts_online-onnx:
+ # am (acoustic model) choices=['fastspeech2_csmsc_onnx', 'fastspeech2_cnndecoder_csmsc_onnx']
+ # fastspeech2_cnndecoder_csmsc_onnx support streaming am infer.
+ am: 'fastspeech2_cnndecoder_csmsc_onnx'
+ # am_ckpt is a list, if am is fastspeech2_cnndecoder_csmsc_onnx, am_ckpt = [encoder model, decoder model, postnet model];
+ # if am is fastspeech2_csmsc_onnx, am_ckpt = [ckpt model];
+ am_ckpt: # list
+ am_stat:
+ phones_dict:
+ tones_dict:
+ speaker_dict:
+ spk_id: 0
+ am_sample_rate: 24000
+ am_sess_conf:
+ device: "cpu" # set 'gpu:id' or 'cpu'
+ use_trt: False
+ cpu_threads: 4
+
+ # voc (vocoder) choices=['mb_melgan_csmsc_onnx, hifigan_csmsc_onnx']
+ # Both mb_melgan_csmsc_onnx and hifigan_csmsc_onnx support streaming voc inference
+ voc: 'hifigan_csmsc_onnx'
+ voc_ckpt:
+ voc_sample_rate: 24000
+ voc_sess_conf:
+ device: "cpu" # set 'gpu:id' or 'cpu'
+ use_trt: False
+ cpu_threads: 4
+
+ # others
+ lang: 'zh'
+ # am_block and am_pad only for fastspeech2_cnndecoder_onnx model to streaming am infer,
+ # when am_pad set 12, streaming synthetic audio is the same as non-streaming synthetic audio
+ am_block: 72
+ am_pad: 12
+ # voc_pad and voc_block voc model to streaming voc infer,
+ # when voc model is mb_melgan_csmsc_onnx, voc_pad set 14, streaming synthetic audio is the same as non-streaming synthetic audio; The minimum value of pad can be set to 7, streaming synthetic audio sounds normal
+ # when voc model is hifigan_csmsc_onnx, voc_pad set 19, streaming synthetic audio is the same as non-streaming synthetic audio; voc_pad set 14, streaming synthetic audio sounds normal
+ voc_block: 36
+ voc_pad: 14
+ # voc_upsample should be same as n_shift on voc config.
+ voc_upsample: 300
+
diff --git a/demos/streaming_tts_server/server.sh b/demos/streaming_tts_server/server.sh
new file mode 100755
index 00000000..d34ddba0
--- /dev/null
+++ b/demos/streaming_tts_server/server.sh
@@ -0,0 +1,10 @@
+#!/bin/bash
+
+# http server
+paddlespeech_server start --config_file ./conf/tts_online_application.yaml &> tts.http.log &
+
+
+# websocket server
+paddlespeech_server start --config_file ./conf/tts_online_ws_application.yaml &> tts.ws.log &
+
+
diff --git a/demos/streaming_tts_server/start_server.sh b/demos/streaming_tts_server/start_server.sh
deleted file mode 100644
index 9c71f2fe..00000000
--- a/demos/streaming_tts_server/start_server.sh
+++ /dev/null
@@ -1,3 +0,0 @@
-#!/bin/bash
-# start server
-paddlespeech_server start --config_file ./conf/tts_online_application.yaml
\ No newline at end of file
diff --git a/demos/text_to_speech/run.sh b/demos/text_to_speech/run.sh
index b1340241..2b588be5 100755
--- a/demos/text_to_speech/run.sh
+++ b/demos/text_to_speech/run.sh
@@ -4,4 +4,10 @@
paddlespeech tts --input 今天的天气不错啊
# Batch process
-echo -e "1 欢迎光临。\n2 谢谢惠顾。" | paddlespeech tts
\ No newline at end of file
+echo -e "1 欢迎光临。\n2 谢谢惠顾。" | paddlespeech tts
+
+# Text Frontend
+paddlespeech tts --input 今天是2022/10/29,最低温度是-3℃.
+
+
+
diff --git a/docker/ubuntu18-cpu/Dockerfile b/docker/ubuntu18-cpu/Dockerfile
index d14c0185..35f45f2e 100644
--- a/docker/ubuntu18-cpu/Dockerfile
+++ b/docker/ubuntu18-cpu/Dockerfile
@@ -1,15 +1,17 @@
FROM registry.baidubce.com/paddlepaddle/paddle:2.2.2
LABEL maintainer="paddlesl@baidu.com"
+RUN apt-get update \
+ && apt-get install libsndfile-dev \
+ && apt-get clean \
+ && rm -rf /var/lib/apt/lists/*
+
RUN git clone --depth 1 https://github.com/PaddlePaddle/PaddleSpeech.git /home/PaddleSpeech
RUN pip3 uninstall mccabe -y ; exit 0;
RUN pip3 install multiprocess==0.70.12 importlib-metadata==4.2.0 dill==0.3.4
-RUN cd /home/PaddleSpeech/audio
-RUN python setup.py bdist_wheel
-
-RUN cd /home/PaddleSpeech
+WORKDIR /home/PaddleSpeech/
RUN python setup.py bdist_wheel
-RUN pip install audio/dist/*.whl dist/*.whl
+RUN pip install dist/*.whl -i https://pypi.tuna.tsinghua.edu.cn/simple
-WORKDIR /home/PaddleSpeech/
+CMD ['bash']
diff --git a/docs/requirements.txt b/docs/requirements.txt
index 08a049c1..bf1486c5 100644
--- a/docs/requirements.txt
+++ b/docs/requirements.txt
@@ -48,4 +48,5 @@ fastapi
websockets
keyboard
uvicorn
-pattern_singleton
\ No newline at end of file
+pattern_singleton
+braceexpand
\ No newline at end of file
diff --git a/docs/source/install.md b/docs/source/install.md
index 83b64619..6a9ff3bc 100644
--- a/docs/source/install.md
+++ b/docs/source/install.md
@@ -117,9 +117,9 @@ conda install -y -c gcc_linux-64=8.4.0 gxx_linux-64=8.4.0
```
(Hip: Do not use the last script if you want to install by **Hard** way):
### Install PaddlePaddle
-You can choose the `PaddlePaddle` version based on your system. For example, for CUDA 10.2, CuDNN7.5 install paddlepaddle-gpu 2.2.0:
+You can choose the `PaddlePaddle` version based on your system. For example, for CUDA 10.2, CuDNN7.5 install paddlepaddle-gpu 2.3.1:
```bash
-python3 -m pip install paddlepaddle-gpu==2.2.0 -i https://mirror.baidu.com/pypi/simple
+python3 -m pip install paddlepaddle-gpu==2.3.1 -i https://mirror.baidu.com/pypi/simple
```
### Install PaddleSpeech
You can install `paddlespeech` by the following command,then you can use the `ready-made` examples in `paddlespeech` :
@@ -180,9 +180,9 @@ Some users may fail to install `kaldiio` due to the default download source, you
```bash
pip install pytest-runner -i https://pypi.tuna.tsinghua.edu.cn/simple
```
-Make sure you have GPU and the paddlepaddle version is right. For example, for CUDA 10.2, CuDNN7.5 install paddle 2.2.0:
+Make sure you have GPU and the paddlepaddle version is right. For example, for CUDA 10.2, CuDNN7.5 install paddle 2.3.1:
```bash
-python3 -m pip install paddlepaddle-gpu==2.2.0 -i https://mirror.baidu.com/pypi/simple
+python3 -m pip install paddlepaddle-gpu==2.3.1 -i https://mirror.baidu.com/pypi/simple
```
### Install PaddleSpeech in Developing Mode
```bash
diff --git a/docs/source/install_cn.md b/docs/source/install_cn.md
index 75f4174e..9f49ebad 100644
--- a/docs/source/install_cn.md
+++ b/docs/source/install_cn.md
@@ -111,9 +111,9 @@ conda install -y -c gcc_linux-64=8.4.0 gxx_linux-64=8.4.0
```
(提示: 如果你想使用**困难**方式完成安装,请不要使用最后一条命令)
### 安装 PaddlePaddle
-你可以根据系统配置选择 PaddlePaddle 版本,例如系统使用 CUDA 10.2, CuDNN7.5 ,你可以安装 paddlepaddle-gpu 2.2.0:
+你可以根据系统配置选择 PaddlePaddle 版本,例如系统使用 CUDA 10.2, CuDNN7.5 ,你可以安装 paddlepaddle-gpu 2.3.1:
```bash
-python3 -m pip install paddlepaddle-gpu==2.2.0 -i https://mirror.baidu.com/pypi/simple
+python3 -m pip install paddlepaddle-gpu==2.3.1 -i https://mirror.baidu.com/pypi/simple
```
### 安装 PaddleSpeech
最后安装 `paddlespeech`,这样你就可以使用 `paddlespeech` 中已有的 examples:
@@ -168,9 +168,9 @@ conda activate tools/venv
conda install -y -c conda-forge sox libsndfile swig bzip2 libflac bc
```
### 安装 PaddlePaddle
-请确认你系统是否有 GPU,并且使用了正确版本的 paddlepaddle。例如系统使用 CUDA 10.2, CuDNN7.5 ,你可以安装 paddlepaddle-gpu 2.2.0:
+请确认你系统是否有 GPU,并且使用了正确版本的 paddlepaddle。例如系统使用 CUDA 10.2, CuDNN7.5 ,你可以安装 paddlepaddle-gpu 2.3.1:
```bash
-python3 -m pip install paddlepaddle-gpu==2.2.0 -i https://mirror.baidu.com/pypi/simple
+python3 -m pip install paddlepaddle-gpu==2.3.1 -i https://mirror.baidu.com/pypi/simple
```
### 用开发者模式安装 PaddleSpeech
部分用户系统由于默认源的问题,安装中会出现 kaldiio 安转出错的问题,建议首先安装 pytest-runner:
diff --git a/examples/aishell/asr1/README.md b/examples/aishell/asr1/README.md
index 25b28ede..a7390fd6 100644
--- a/examples/aishell/asr1/README.md
+++ b/examples/aishell/asr1/README.md
@@ -1,5 +1,5 @@
# Transformer/Conformer ASR with Aishell
-This example contains code used to train a Transformer or [Conformer](http://arxiv.org/abs/2008.03802) model with [Aishell dataset](http://www.openslr.org/resources/33)
+This example contains code used to train a [u2](https://arxiv.org/pdf/2012.05481.pdf) model (Transformer or [Conformer](https://arxiv.org/pdf/2005.08100.pdf) model) with [Aishell dataset](http://www.openslr.org/resources/33)
## Overview
All the scripts you need are in `run.sh`. There are several stages in `run.sh`, and each stage has its function.
| Stage | Function |
diff --git a/examples/callcenter/README.md b/examples/callcenter/README.md
index 1c715cb6..6d521146 100644
--- a/examples/callcenter/README.md
+++ b/examples/callcenter/README.md
@@ -1,20 +1,3 @@
# Callcenter 8k sample rate
-Data distribution:
-
-```
-676048 utts
-491.4004722221223 h
-4357792.0 text
-2.4633630739178654 text/sec
-2.6167397877068495 sec/utt
-```
-
-train/dev/test partition:
-
-```
- 33802 manifest.dev
- 67606 manifest.test
- 574640 manifest.train
- 676048 total
-```
+This recipe only has model/data config for 8k ASR, user need to prepare data and generate manifest metafile. You can see Aishell or Libripseech.
diff --git a/examples/csmsc/vits/README.md b/examples/csmsc/vits/README.md
index 5ca57e3a..8f223e07 100644
--- a/examples/csmsc/vits/README.md
+++ b/examples/csmsc/vits/README.md
@@ -154,7 +154,7 @@ VITS checkpoint contains files listed below.
vits_csmsc_ckpt_1.1.0
├── default.yaml # default config used to train vitx
├── phone_id_map.txt # phone vocabulary file when training vits
-└── snapshot_iter_350000.pdz # model parameters and optimizer states
+└── snapshot_iter_333000.pdz # model parameters and optimizer states
```
ps: This ckpt is not good enough, a better result is training
@@ -169,7 +169,7 @@ FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/synthesize_e2e.py \
--config=vits_csmsc_ckpt_1.1.0/default.yaml \
- --ckpt=vits_csmsc_ckpt_1.1.0/snapshot_iter_350000.pdz \
+ --ckpt=vits_csmsc_ckpt_1.1.0/snapshot_iter_333000.pdz \
--phones_dict=vits_csmsc_ckpt_1.1.0/phone_id_map.txt \
--output_dir=exp/default/test_e2e \
--text=${BIN_DIR}/../sentences.txt \
diff --git a/examples/csmsc/vits/conf/default.yaml b/examples/csmsc/vits/conf/default.yaml
index 32f995cc..a2aef998 100644
--- a/examples/csmsc/vits/conf/default.yaml
+++ b/examples/csmsc/vits/conf/default.yaml
@@ -179,7 +179,7 @@ generator_first: False # whether to start updating generator first
# OTHER TRAINING SETTING #
##########################################################
num_snapshots: 10 # max number of snapshots to keep while training
-train_max_steps: 250000 # Number of training steps. == total_iters / ngpus, total_iters = 1000000
+train_max_steps: 350000 # Number of training steps. == total_iters / ngpus, total_iters = 1000000
save_interval_steps: 1000 # Interval steps to save checkpoint.
eval_interval_steps: 250 # Interval steps to evaluate the network.
seed: 777 # random seed number
diff --git a/examples/librispeech/asr1/README.md b/examples/librispeech/asr1/README.md
index ae252a58..ca008144 100644
--- a/examples/librispeech/asr1/README.md
+++ b/examples/librispeech/asr1/README.md
@@ -1,5 +1,5 @@
# Transformer/Conformer ASR with Librispeech
-This example contains code used to train a Transformer or [Conformer](http://arxiv.org/abs/2008.03802) model with [Librispeech dataset](http://www.openslr.org/resources/12)
+This example contains code used to train [u2](https://arxiv.org/pdf/2012.05481.pdf) model (Transformer or [Conformer](https://arxiv.org/pdf/2005.08100.pdf) model) with [Librispeech dataset](http://www.openslr.org/resources/12)
## Overview
All the scripts you need are in `run.sh`. There are several stages in `run.sh`, and each stage has its function.
| Stage | Function |
diff --git a/examples/librispeech/asr2/README.md b/examples/librispeech/asr2/README.md
index 5bc7185a..26978520 100644
--- a/examples/librispeech/asr2/README.md
+++ b/examples/librispeech/asr2/README.md
@@ -1,6 +1,6 @@
# Transformer/Conformer ASR with Librispeech ASR2
-This example contains code used to train a Transformer or [Conformer](http://arxiv.org/abs/2008.03802) model with [Librispeech dataset](http://www.openslr.org/resources/12) and use some functions in kaldi.
+This example contains code used to train a [u2](https://arxiv.org/pdf/2012.05481.pdf) model (Transformer or [Conformer](https://arxiv.org/pdf/2005.08100.pdf) model) with [Librispeech dataset](http://www.openslr.org/resources/12) and use some functions in kaldi.
To use this example, you need to install Kaldi first.
diff --git a/examples/tiny/asr1/README.md b/examples/tiny/asr1/README.md
index 6a4999aa..cfa26670 100644
--- a/examples/tiny/asr1/README.md
+++ b/examples/tiny/asr1/README.md
@@ -1,5 +1,5 @@
# Transformer/Conformer ASR with Tiny
-This example contains code used to train a Transformer or [Conformer](http://arxiv.org/abs/2008.03802) model Tiny dataset(a part of [[Librispeech dataset](http://www.openslr.org/resources/12)](http://www.openslr.org/resources/33))
+This example contains code used to train a [u2](https://arxiv.org/pdf/2012.05481.pdf) model (Transformer or [Conformer](https://arxiv.org/pdf/2005.08100.pdf) model) with Tiny dataset(a part of [[Librispeech dataset](http://www.openslr.org/resources/12)](http://www.openslr.org/resources/33))
## Overview
All the scripts you need are in `run.sh`. There are several stages in `run.sh`, and each stage has its function.
| Stage | Function |
diff --git a/examples/zh_en_tts/tts3/README.md b/examples/zh_en_tts/tts3/README.md
new file mode 100644
index 00000000..6d38181c
--- /dev/null
+++ b/examples/zh_en_tts/tts3/README.md
@@ -0,0 +1,26 @@
+# Test
+We train a Chinese-English mixed fastspeech2 model. The training code is still being sorted out, let's show how to use it first.
+The sample rate of the synthesized audio is 22050 Hz.
+
+## Download pretrained models
+Put pretrained models in a directory named `models`.
+
+- [fastspeech2_csmscljspeech_add-zhen.zip](https://paddlespeech.bj.bcebos.com/t2s/chinse_english_mixed/models/fastspeech2_csmscljspeech_add-zhen.zip)
+- [hifigan_ljspeech_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_ljspeech_ckpt_0.2.0.zip)
+
+```bash
+mkdir models
+cd models
+wget https://paddlespeech.bj.bcebos.com/t2s/chinse_english_mixed/models/fastspeech2_csmscljspeech_add-zhen.zip
+unzip fastspeech2_csmscljspeech_add-zhen.zip
+wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_ljspeech_ckpt_0.2.0.zip
+unzip hifigan_ljspeech_ckpt_0.2.0.zip
+cd ../
+```
+
+## test
+You can choose `--spk_id` {0, 1} in `local/synthesize_e2e.sh`.
+
+```bash
+bash test.sh
+```
diff --git a/examples/zh_en_tts/tts3/local/synthesize_e2e.sh b/examples/zh_en_tts/tts3/local/synthesize_e2e.sh
new file mode 100755
index 00000000..a206c3a8
--- /dev/null
+++ b/examples/zh_en_tts/tts3/local/synthesize_e2e.sh
@@ -0,0 +1,31 @@
+#!/bin/bash
+
+model_dir=$1
+output=$2
+am_name=fastspeech2_csmscljspeech_add-zhen
+am_model_dir=${model_dir}/${am_name}/
+
+stage=1
+stop_stage=1
+
+
+# hifigan
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+ FLAGS_allocator_strategy=naive_best_fit \
+ FLAGS_fraction_of_gpu_memory_to_use=0.01 \
+ python3 ${BIN_DIR}/../synthesize_e2e.py \
+ --am=fastspeech2_mix \
+ --am_config=${am_model_dir}/default.yaml \
+ --am_ckpt=${am_model_dir}/snapshot_iter_94000.pdz \
+ --am_stat=${am_model_dir}/speech_stats.npy \
+ --voc=hifigan_ljspeech \
+ --voc_config=${model_dir}/hifigan_ljspeech_ckpt_0.2.0/default.yaml \
+ --voc_ckpt=${model_dir}/hifigan_ljspeech_ckpt_0.2.0/snapshot_iter_2500000.pdz \
+ --voc_stat=${model_dir}/hifigan_ljspeech_ckpt_0.2.0/feats_stats.npy \
+ --lang=mix \
+ --text=${BIN_DIR}/../sentences_mix.txt \
+ --output_dir=${output}/test_e2e \
+ --phones_dict=${am_model_dir}/phone_id_map.txt \
+ --speaker_dict=${am_model_dir}/speaker_id_map.txt \
+ --spk_id 0
+fi
diff --git a/examples/zh_en_tts/tts3/path.sh b/examples/zh_en_tts/tts3/path.sh
new file mode 100755
index 00000000..fb7e8411
--- /dev/null
+++ b/examples/zh_en_tts/tts3/path.sh
@@ -0,0 +1,13 @@
+#!/bin/bash
+export MAIN_ROOT=`realpath ${PWD}/../../../`
+
+export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH}
+export LC_ALL=C
+
+export PYTHONDONTWRITEBYTECODE=1
+# Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
+export PYTHONIOENCODING=UTF-8
+export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH}
+
+MODEL=fastspeech2
+export BIN_DIR=${MAIN_ROOT}/paddlespeech/t2s/exps/${MODEL}
diff --git a/examples/zh_en_tts/tts3/test.sh b/examples/zh_en_tts/tts3/test.sh
new file mode 100755
index 00000000..ff34da14
--- /dev/null
+++ b/examples/zh_en_tts/tts3/test.sh
@@ -0,0 +1,23 @@
+#!/bin/bash
+
+set -e
+source path.sh
+
+gpus=0,1
+stage=3
+stop_stage=100
+
+model_dir=models
+output_dir=output
+
+# with the following command, you can choose the stage range you want to run
+# such as `./run.sh --stage 0 --stop-stage 0`
+# this can not be mixed use with `$1`, `$2` ...
+source ${MAIN_ROOT}/utils/parse_options.sh || exit 1
+
+
+if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
+ # synthesize_e2e, vocoder is hifigan by default
+ CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${model_dir} ${output_dir} || exit -1
+fi
+
diff --git a/paddlespeech/__init__.py b/paddlespeech/__init__.py
index b781c4a8..4b1c0ef3 100644
--- a/paddlespeech/__init__.py
+++ b/paddlespeech/__init__.py
@@ -14,3 +14,5 @@
import _locale
_locale._getdefaultlocale = (lambda *args: ['en_US', 'utf8'])
+
+
diff --git a/paddlespeech/audio/__init__.py b/paddlespeech/audio/__init__.py
index 6184c1dd..83be8e32 100644
--- a/paddlespeech/audio/__init__.py
+++ b/paddlespeech/audio/__init__.py
@@ -14,6 +14,9 @@
from . import compliance
from . import datasets
from . import features
+from . import text
+from . import transform
+from . import streamdata
from . import functional
from . import io
from . import metric
diff --git a/paddlespeech/audio/text/__init__.py b/paddlespeech/audio/text/__init__.py
new file mode 100644
index 00000000..185a92b8
--- /dev/null
+++ b/paddlespeech/audio/text/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddlespeech/cli/asr/infer.py b/paddlespeech/cli/asr/infer.py
index 76dfafb9..f9b4439e 100644
--- a/paddlespeech/cli/asr/infer.py
+++ b/paddlespeech/cli/asr/infer.py
@@ -365,7 +365,7 @@ class ASRExecutor(BaseExecutor):
except Exception as e:
logger.exception(e)
logger.error(
- "can not open the audio file, please check the audio file format is 'wav'. \n \
+ f"can not open the audio file, please check the audio file({audio_file}) format is 'wav'. \n \
you can try to use sox to change the file format.\n \
For example: \n \
sample rate: 16k \n \
diff --git a/paddlespeech/cli/executor.py b/paddlespeech/cli/executor.py
index d4187a51..3800c36d 100644
--- a/paddlespeech/cli/executor.py
+++ b/paddlespeech/cli/executor.py
@@ -108,19 +108,20 @@ class BaseExecutor(ABC):
Dict[str, Union[str, os.PathLike]]: A dict with ids and inputs.
"""
if self._is_job_input(input_):
+ # .job/.scp/.txt file
ret = self._get_job_contents(input_)
else:
+ # job from stdin
ret = OrderedDict()
-
if input_ is None: # Take input from stdin
if not sys.stdin.isatty(
): # Avoid getting stuck when stdin is empty.
for i, line in enumerate(sys.stdin):
line = line.strip()
- if len(line.split(' ')) == 1:
+ if len(line.split()) == 1:
ret[str(i + 1)] = line
- elif len(line.split(' ')) == 2:
- id_, info = line.split(' ')
+ elif len(line.split()) == 2:
+ id_, info = line.split()
ret[id_] = info
else: # No valid input info from one line.
continue
@@ -170,7 +171,8 @@ class BaseExecutor(ABC):
bool: return `True` for job input, `False` otherwise.
"""
return input_ and os.path.isfile(input_) and (input_.endswith('.job') or
- input_.endswith('.txt'))
+ input_.endswith('.txt') or
+ input_.endswith('.scp'))
def _get_job_contents(
self, job_input: os.PathLike) -> Dict[str, Union[str, os.PathLike]]:
@@ -189,7 +191,7 @@ class BaseExecutor(ABC):
line = line.strip()
if not line:
continue
- k, v = line.split(' ')
+ k, v = line.split() # space or \t
job_contents[k] = v
return job_contents
diff --git a/paddlespeech/s2t/__init__.py b/paddlespeech/s2t/__init__.py
index 2da68435..f6476b9a 100644
--- a/paddlespeech/s2t/__init__.py
+++ b/paddlespeech/s2t/__init__.py
@@ -18,7 +18,6 @@ from typing import Union
import paddle
from paddle import nn
-from paddle.fluid import core
from paddle.nn import functional as F
from paddlespeech.s2t.utils.log import Log
@@ -39,46 +38,6 @@ paddle.long = 'int64'
paddle.uint16 = 'uint16'
paddle.cdouble = 'complex128'
-
-def convert_dtype_to_string(tensor_dtype):
- """
- Convert the data type in numpy to the data type in Paddle
- Args:
- tensor_dtype(core.VarDesc.VarType): the data type in numpy.
- Returns:
- core.VarDesc.VarType: the data type in Paddle.
- """
- dtype = tensor_dtype
- if dtype == core.VarDesc.VarType.FP32:
- return paddle.float32
- elif dtype == core.VarDesc.VarType.FP64:
- return paddle.float64
- elif dtype == core.VarDesc.VarType.FP16:
- return paddle.float16
- elif dtype == core.VarDesc.VarType.INT32:
- return paddle.int32
- elif dtype == core.VarDesc.VarType.INT16:
- return paddle.int16
- elif dtype == core.VarDesc.VarType.INT64:
- return paddle.int64
- elif dtype == core.VarDesc.VarType.BOOL:
- return paddle.bool
- elif dtype == core.VarDesc.VarType.BF16:
- # since there is still no support for bfloat16 in NumPy,
- # uint16 is used for casting bfloat16
- return paddle.uint16
- elif dtype == core.VarDesc.VarType.UINT8:
- return paddle.uint8
- elif dtype == core.VarDesc.VarType.INT8:
- return paddle.int8
- elif dtype == core.VarDesc.VarType.COMPLEX64:
- return paddle.complex64
- elif dtype == core.VarDesc.VarType.COMPLEX128:
- return paddle.complex128
- else:
- raise ValueError("Not supported tensor dtype %s" % dtype)
-
-
if not hasattr(paddle, 'softmax'):
logger.debug("register user softmax to paddle, remove this when fixed!")
setattr(paddle, 'softmax', paddle.nn.functional.softmax)
@@ -155,28 +114,6 @@ if not hasattr(paddle.Tensor, 'new_full'):
paddle.Tensor.new_full = new_full
paddle.static.Variable.new_full = new_full
-
-def eq(xs: paddle.Tensor, ys: Union[paddle.Tensor, float]) -> paddle.Tensor:
- if convert_dtype_to_string(xs.dtype) == paddle.bool:
- xs = xs.astype(paddle.int)
- return xs.equal(
- paddle.to_tensor(
- ys, dtype=convert_dtype_to_string(xs.dtype), place=xs.place))
-
-
-if not hasattr(paddle.Tensor, 'eq'):
- logger.debug(
- "override eq of paddle.Tensor if exists or register, remove this when fixed!"
- )
- paddle.Tensor.eq = eq
- paddle.static.Variable.eq = eq
-
-if not hasattr(paddle, 'eq'):
- logger.debug(
- "override eq of paddle if exists or register, remove this when fixed!")
- paddle.eq = eq
-
-
def contiguous(xs: paddle.Tensor) -> paddle.Tensor:
return xs
@@ -219,13 +156,22 @@ def is_broadcastable(shp1, shp2):
return True
+def broadcast_shape(shp1, shp2):
+ result = []
+ for a, b in zip(shp1[::-1], shp2[::-1]):
+ result.append(max(a, b))
+ return result[::-1]
+
+
def masked_fill(xs: paddle.Tensor,
mask: paddle.Tensor,
value: Union[float, int]):
- assert is_broadcastable(xs.shape, mask.shape) is True, (xs.shape,
- mask.shape)
- bshape = paddle.broadcast_shape(xs.shape, mask.shape)
- mask = mask.broadcast_to(bshape)
+ bshape = broadcast_shape(xs.shape, mask.shape)
+ mask.stop_gradient = True
+ tmp = paddle.ones(shape=[len(bshape)], dtype='int32')
+ for index in range(len(bshape)):
+ tmp[index] = bshape[index]
+ mask = mask.broadcast_to(tmp)
trues = paddle.ones_like(xs) * value
xs = paddle.where(mask, trues, xs)
return xs
diff --git a/paddlespeech/s2t/models/u2/u2.py b/paddlespeech/s2t/models/u2/u2.py
index 100aca18..e19f411c 100644
--- a/paddlespeech/s2t/models/u2/u2.py
+++ b/paddlespeech/s2t/models/u2/u2.py
@@ -29,6 +29,9 @@ import paddle
from paddle import jit
from paddle import nn
+from paddlespeech.audio.utils.tensor_utils import add_sos_eos
+from paddlespeech.audio.utils.tensor_utils import pad_sequence
+from paddlespeech.audio.utils.tensor_utils import th_accuracy
from paddlespeech.s2t.decoders.scorers.ctc import CTCPrefixScorer
from paddlespeech.s2t.frontend.utility import IGNORE_ID
from paddlespeech.s2t.frontend.utility import load_cmvn
@@ -48,9 +51,6 @@ from paddlespeech.s2t.utils import checkpoint
from paddlespeech.s2t.utils import layer_tools
from paddlespeech.s2t.utils.ctc_utils import remove_duplicates_and_blank
from paddlespeech.s2t.utils.log import Log
-from paddlespeech.audio.utils.tensor_utils import add_sos_eos
-from paddlespeech.audio.utils.tensor_utils import pad_sequence
-from paddlespeech.audio.utils.tensor_utils import th_accuracy
from paddlespeech.s2t.utils.utility import log_add
from paddlespeech.s2t.utils.utility import UpdateConfig
@@ -318,7 +318,7 @@ class U2BaseModel(ASRInterface, nn.Layer):
dim=1) # (B*N, i+1)
# 2.6 Update end flag
- end_flag = paddle.eq(hyps[:, -1], self.eos).view(-1, 1)
+ end_flag = paddle.equal(hyps[:, -1], self.eos).view(-1, 1)
# 3. Select best of best
scores = scores.view(batch_size, beam_size)
@@ -605,29 +605,42 @@ class U2BaseModel(ASRInterface, nn.Layer):
xs: paddle.Tensor,
offset: int,
required_cache_size: int,
- subsampling_cache: Optional[paddle.Tensor]=None,
- elayers_output_cache: Optional[List[paddle.Tensor]]=None,
- conformer_cnn_cache: Optional[List[paddle.Tensor]]=None,
- ) -> Tuple[paddle.Tensor, paddle.Tensor, List[paddle.Tensor], List[
- paddle.Tensor]]:
+ att_cache: paddle.Tensor=paddle.zeros([0, 0, 0, 0]),
+ cnn_cache: paddle.Tensor=paddle.zeros([0, 0, 0, 0]),
+ ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor]:
""" Export interface for c++ call, give input chunk xs, and return
output from time 0 to current chunk.
+
Args:
- xs (paddle.Tensor): chunk input
- subsampling_cache (Optional[paddle.Tensor]): subsampling cache
- elayers_output_cache (Optional[List[paddle.Tensor]]):
- transformer/conformer encoder layers output cache
- conformer_cnn_cache (Optional[List[paddle.Tensor]]): conformer
- cnn cache
+ xs (paddle.Tensor): chunk input, with shape (b=1, time, mel-dim),
+ where `time == (chunk_size - 1) * subsample_rate + \
+ subsample.right_context + 1`
+ offset (int): current offset in encoder output time stamp
+ required_cache_size (int): cache size required for next chunk
+ compuation
+ >=0: actual cache size
+ <0: means all history cache is required
+ att_cache (paddle.Tensor): cache tensor for KEY & VALUE in
+ transformer/conformer attention, with shape
+ (elayers, head, cache_t1, d_k * 2), where
+ `head * d_k == hidden-dim` and
+ `cache_t1 == chunk_size * num_decoding_left_chunks`.
+ `d_k * 2` for att key & value.
+ cnn_cache (paddle.Tensor): cache tensor for cnn_module in conformer,
+ (elayers, b=1, hidden-dim, cache_t2), where
+ `cache_t2 == cnn.lorder - 1`.
+
Returns:
- paddle.Tensor: output, it ranges from time 0 to current chunk.
- paddle.Tensor: subsampling cache
- List[paddle.Tensor]: attention cache
- List[paddle.Tensor]: conformer cnn cache
+ paddle.Tensor: output of current input xs,
+ with shape (b=1, chunk_size, hidden-dim).
+ paddle.Tensor: new attention cache required for next chunk, with
+ dynamic shape (elayers, head, T(?), d_k * 2)
+ depending on required_cache_size.
+ paddle.Tensor: new conformer cnn cache required for next chunk, with
+ same shape as the original cnn_cache.
"""
- return self.encoder.forward_chunk(
- xs, offset, required_cache_size, subsampling_cache,
- elayers_output_cache, conformer_cnn_cache)
+ return self.encoder.forward_chunk(xs, offset, required_cache_size,
+ att_cache, cnn_cache)
# @jit.to_static
def ctc_activation(self, xs: paddle.Tensor) -> paddle.Tensor:
diff --git a/paddlespeech/s2t/models/u2_st/u2_st.py b/paddlespeech/s2t/models/u2_st/u2_st.py
index 00ded912..e86bbedf 100644
--- a/paddlespeech/s2t/models/u2_st/u2_st.py
+++ b/paddlespeech/s2t/models/u2_st/u2_st.py
@@ -401,29 +401,42 @@ class U2STBaseModel(nn.Layer):
xs: paddle.Tensor,
offset: int,
required_cache_size: int,
- subsampling_cache: Optional[paddle.Tensor]=None,
- elayers_output_cache: Optional[List[paddle.Tensor]]=None,
- conformer_cnn_cache: Optional[List[paddle.Tensor]]=None,
- ) -> Tuple[paddle.Tensor, paddle.Tensor, List[paddle.Tensor], List[
- paddle.Tensor]]:
+ att_cache: paddle.Tensor = paddle.zeros([0, 0, 0, 0]),
+ cnn_cache: paddle.Tensor = paddle.zeros([0, 0, 0, 0]),
+ ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor]:
""" Export interface for c++ call, give input chunk xs, and return
output from time 0 to current chunk.
+
Args:
- xs (paddle.Tensor): chunk input
- subsampling_cache (Optional[paddle.Tensor]): subsampling cache
- elayers_output_cache (Optional[List[paddle.Tensor]]):
- transformer/conformer encoder layers output cache
- conformer_cnn_cache (Optional[List[paddle.Tensor]]): conformer
- cnn cache
+ xs (paddle.Tensor): chunk input, with shape (b=1, time, mel-dim),
+ where `time == (chunk_size - 1) * subsample_rate + \
+ subsample.right_context + 1`
+ offset (int): current offset in encoder output time stamp
+ required_cache_size (int): cache size required for next chunk
+ compuation
+ >=0: actual cache size
+ <0: means all history cache is required
+ att_cache (paddle.Tensor): cache tensor for KEY & VALUE in
+ transformer/conformer attention, with shape
+ (elayers, head, cache_t1, d_k * 2), where
+ `head * d_k == hidden-dim` and
+ `cache_t1 == chunk_size * num_decoding_left_chunks`.
+ `d_k * 2` for att key & value.
+ cnn_cache (paddle.Tensor): cache tensor for cnn_module in conformer,
+ (elayers, b=1, hidden-dim, cache_t2), where
+ `cache_t2 == cnn.lorder - 1`
+
Returns:
- paddle.Tensor: output, it ranges from time 0 to current chunk.
- paddle.Tensor: subsampling cache
- List[paddle.Tensor]: attention cache
- List[paddle.Tensor]: conformer cnn cache
+ paddle.Tensor: output of current input xs,
+ with shape (b=1, chunk_size, hidden-dim).
+ paddle.Tensor: new attention cache required for next chunk, with
+ dynamic shape (elayers, head, T(?), d_k * 2)
+ depending on required_cache_size.
+ paddle.Tensor: new conformer cnn cache required for next chunk, with
+ same shape as the original cnn_cache.
"""
return self.encoder.forward_chunk(
- xs, offset, required_cache_size, subsampling_cache,
- elayers_output_cache, conformer_cnn_cache)
+ xs, offset, required_cache_size, att_cache, cnn_cache)
# @jit.to_static
def ctc_activation(self, xs: paddle.Tensor) -> paddle.Tensor:
diff --git a/paddlespeech/s2t/modules/align.py b/paddlespeech/s2t/modules/align.py
index ad71ee02..cacda246 100644
--- a/paddlespeech/s2t/modules/align.py
+++ b/paddlespeech/s2t/modules/align.py
@@ -13,8 +13,7 @@
# limitations under the License.
import paddle
from paddle import nn
-
-from paddlespeech.s2t.modules.initializer import KaimingUniform
+import math
"""
To align the initializer between paddle and torch,
the API below are set defalut initializer with priority higger than global initializer.
@@ -82,10 +81,10 @@ class Linear(nn.Linear):
name=None):
if weight_attr is None:
if global_init_type == "kaiming_uniform":
- weight_attr = paddle.ParamAttr(initializer=KaimingUniform())
+ weight_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform(fan_in=None, negative_slope=math.sqrt(5), nonlinearity='leaky_relu'))
if bias_attr is None:
if global_init_type == "kaiming_uniform":
- bias_attr = paddle.ParamAttr(initializer=KaimingUniform())
+ bias_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform(fan_in=None, negative_slope=math.sqrt(5), nonlinearity='leaky_relu'))
super(Linear, self).__init__(in_features, out_features, weight_attr,
bias_attr, name)
@@ -105,10 +104,10 @@ class Conv1D(nn.Conv1D):
data_format='NCL'):
if weight_attr is None:
if global_init_type == "kaiming_uniform":
- weight_attr = paddle.ParamAttr(initializer=KaimingUniform())
+ weight_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform(fan_in=None, negative_slope=math.sqrt(5), nonlinearity='leaky_relu'))
if bias_attr is None:
if global_init_type == "kaiming_uniform":
- bias_attr = paddle.ParamAttr(initializer=KaimingUniform())
+ bias_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform(fan_in=None, negative_slope=math.sqrt(5), nonlinearity='leaky_relu'))
super(Conv1D, self).__init__(
in_channels, out_channels, kernel_size, stride, padding, dilation,
groups, padding_mode, weight_attr, bias_attr, data_format)
@@ -129,10 +128,10 @@ class Conv2D(nn.Conv2D):
data_format='NCHW'):
if weight_attr is None:
if global_init_type == "kaiming_uniform":
- weight_attr = paddle.ParamAttr(initializer=KaimingUniform())
+ weight_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform(fan_in=None, negative_slope=math.sqrt(5), nonlinearity='leaky_relu'))
if bias_attr is None:
if global_init_type == "kaiming_uniform":
- bias_attr = paddle.ParamAttr(initializer=KaimingUniform())
+ bias_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform(fan_in=None, negative_slope=math.sqrt(5), nonlinearity='leaky_relu'))
super(Conv2D, self).__init__(
in_channels, out_channels, kernel_size, stride, padding, dilation,
groups, padding_mode, weight_attr, bias_attr, data_format)
diff --git a/paddlespeech/s2t/modules/attention.py b/paddlespeech/s2t/modules/attention.py
index 438efd2a..b6d61586 100644
--- a/paddlespeech/s2t/modules/attention.py
+++ b/paddlespeech/s2t/modules/attention.py
@@ -84,9 +84,10 @@ class MultiHeadedAttention(nn.Layer):
return q, k, v
def forward_attention(self,
- value: paddle.Tensor,
- scores: paddle.Tensor,
- mask: Optional[paddle.Tensor]) -> paddle.Tensor:
+ value: paddle.Tensor,
+ scores: paddle.Tensor,
+ mask: paddle.Tensor = paddle.ones([0, 0, 0], dtype=paddle.bool),
+ ) -> paddle.Tensor:
"""Compute attention context vector.
Args:
value (paddle.Tensor): Transformed value, size
@@ -94,14 +95,23 @@ class MultiHeadedAttention(nn.Layer):
scores (paddle.Tensor): Attention score, size
(#batch, n_head, time1, time2).
mask (paddle.Tensor): Mask, size (#batch, 1, time2) or
- (#batch, time1, time2).
+ (#batch, time1, time2), (0, 0, 0) means fake mask.
Returns:
- paddle.Tensor: Transformed value weighted
- by the attention score, (#batch, time1, d_model).
+ paddle.Tensor: Transformed value (#batch, time1, d_model)
+ weighted by the attention score (#batch, time1, time2).
"""
n_batch = value.shape[0]
- if mask is not None:
- mask = mask.unsqueeze(1).eq(0) # (batch, 1, *, time2)
+
+ # When `if mask.size(2) > 0` be True:
+ # 1. training.
+ # 2. oonx(16/4, chunk_size/history_size), feed real cache and real mask for the 1st chunk.
+ # When will `if mask.size(2) > 0` be False?
+ # 1. onnx(16/-1, -1/-1, 16/0)
+ # 2. jit (16/-1, -1/-1, 16/0, 16/4)
+ if paddle.shape(mask)[2] > 0: # time2 > 0
+ mask = mask.unsqueeze(1).equal(0) # (batch, 1, *, time2)
+ # for last chunk, time2 might be larger than scores.size(-1)
+ mask = mask[:, :, :, :paddle.shape(scores)[-1]]
scores = scores.masked_fill(mask, -float('inf'))
attn = paddle.softmax(
scores, axis=-1).masked_fill(mask,
@@ -121,21 +131,66 @@ class MultiHeadedAttention(nn.Layer):
query: paddle.Tensor,
key: paddle.Tensor,
value: paddle.Tensor,
- mask: Optional[paddle.Tensor]) -> paddle.Tensor:
+ mask: paddle.Tensor = paddle.ones([0,0,0], dtype=paddle.bool),
+ pos_emb: paddle.Tensor = paddle.empty([0]),
+ cache: paddle.Tensor = paddle.zeros([0,0,0,0])
+ ) -> Tuple[paddle.Tensor, paddle.Tensor]:
"""Compute scaled dot product attention.
- Args:
- query (torch.Tensor): Query tensor (#batch, time1, size).
- key (torch.Tensor): Key tensor (#batch, time2, size).
- value (torch.Tensor): Value tensor (#batch, time2, size).
- mask (torch.Tensor): Mask tensor (#batch, 1, time2) or
+ Args:
+ query (paddle.Tensor): Query tensor (#batch, time1, size).
+ key (paddle.Tensor): Key tensor (#batch, time2, size).
+ value (paddle.Tensor): Value tensor (#batch, time2, size).
+ mask (paddle.Tensor): Mask tensor (#batch, 1, time2) or
(#batch, time1, time2).
+ 1.When applying cross attention between decoder and encoder,
+ the batch padding mask for input is in (#batch, 1, T) shape.
+ 2.When applying self attention of encoder,
+ the mask is in (#batch, T, T) shape.
+ 3.When applying self attention of decoder,
+ the mask is in (#batch, L, L) shape.
+ 4.If the different position in decoder see different block
+ of the encoder, such as Mocha, the passed in mask could be
+ in (#batch, L, T) shape. But there is no such case in current
+ Wenet.
+ cache (paddle.Tensor): Cache tensor (1, head, cache_t, d_k * 2),
+ where `cache_t == chunk_size * num_decoding_left_chunks`
+ and `head * d_k == size`
Returns:
- torch.Tensor: Output tensor (#batch, time1, d_model).
+ paddle.Tensor: Output tensor (#batch, time1, d_model).
+ paddle.Tensor: Cache tensor (1, head, cache_t + time1, d_k * 2)
+ where `cache_t == chunk_size * num_decoding_left_chunks`
+ and `head * d_k == size`
+
"""
q, k, v = self.forward_qkv(query, key, value)
+
+ # when export onnx model, for 1st chunk, we feed
+ # cache(1, head, 0, d_k * 2) (16/-1, -1/-1, 16/0 mode)
+ # or cache(1, head, real_cache_t, d_k * 2) (16/4 mode).
+ # In all modes, `if cache.size(0) > 0` will alwayse be `True`
+ # and we will always do splitting and
+ # concatnation(this will simplify onnx export). Note that
+ # it's OK to concat & split zero-shaped tensors(see code below).
+ # when export jit model, for 1st chunk, we always feed
+ # cache(0, 0, 0, 0) since jit supports dynamic if-branch.
+ # >>> a = torch.ones((1, 2, 0, 4))
+ # >>> b = torch.ones((1, 2, 3, 4))
+ # >>> c = torch.cat((a, b), dim=2)
+ # >>> torch.equal(b, c) # True
+ # >>> d = torch.split(a, 2, dim=-1)
+ # >>> torch.equal(d[0], d[1]) # True
+ if paddle.shape(cache)[0] > 0:
+ # last dim `d_k * 2` for (key, val)
+ key_cache, value_cache = paddle.split(cache, 2, axis=-1)
+ k = paddle.concat([key_cache, k], axis=2)
+ v = paddle.concat([value_cache, v], axis=2)
+ # We do cache slicing in encoder.forward_chunk, since it's
+ # non-trivial to calculate `next_cache_start` here.
+ new_cache = paddle.concat((k, v), axis=-1)
+
scores = paddle.matmul(q,
k.transpose([0, 1, 3, 2])) / math.sqrt(self.d_k)
- return self.forward_attention(v, scores, mask)
+ return self.forward_attention(v, scores, mask), new_cache
class RelPositionMultiHeadedAttention(MultiHeadedAttention):
@@ -192,23 +247,55 @@ class RelPositionMultiHeadedAttention(MultiHeadedAttention):
query: paddle.Tensor,
key: paddle.Tensor,
value: paddle.Tensor,
- pos_emb: paddle.Tensor,
- mask: Optional[paddle.Tensor]):
+ mask: paddle.Tensor = paddle.ones([0,0,0], dtype=paddle.bool),
+ pos_emb: paddle.Tensor = paddle.empty([0]),
+ cache: paddle.Tensor = paddle.zeros([0,0,0,0])
+ ) -> Tuple[paddle.Tensor, paddle.Tensor]:
"""Compute 'Scaled Dot Product Attention' with rel. positional encoding.
Args:
query (paddle.Tensor): Query tensor (#batch, time1, size).
key (paddle.Tensor): Key tensor (#batch, time2, size).
value (paddle.Tensor): Value tensor (#batch, time2, size).
- pos_emb (paddle.Tensor): Positional embedding tensor
- (#batch, time1, size).
mask (paddle.Tensor): Mask tensor (#batch, 1, time2) or
- (#batch, time1, time2).
+ (#batch, time1, time2), (0, 0, 0) means fake mask.
+ pos_emb (paddle.Tensor): Positional embedding tensor
+ (#batch, time2, size).
+ cache (paddle.Tensor): Cache tensor (1, head, cache_t, d_k * 2),
+ where `cache_t == chunk_size * num_decoding_left_chunks`
+ and `head * d_k == size`
Returns:
paddle.Tensor: Output tensor (#batch, time1, d_model).
+ paddle.Tensor: Cache tensor (1, head, cache_t + time1, d_k * 2)
+ where `cache_t == chunk_size * num_decoding_left_chunks`
+ and `head * d_k == size`
"""
q, k, v = self.forward_qkv(query, key, value)
q = q.transpose([0, 2, 1, 3]) # (batch, time1, head, d_k)
+ # when export onnx model, for 1st chunk, we feed
+ # cache(1, head, 0, d_k * 2) (16/-1, -1/-1, 16/0 mode)
+ # or cache(1, head, real_cache_t, d_k * 2) (16/4 mode).
+ # In all modes, `if cache.size(0) > 0` will alwayse be `True`
+ # and we will always do splitting and
+ # concatnation(this will simplify onnx export). Note that
+ # it's OK to concat & split zero-shaped tensors(see code below).
+ # when export jit model, for 1st chunk, we always feed
+ # cache(0, 0, 0, 0) since jit supports dynamic if-branch.
+ # >>> a = torch.ones((1, 2, 0, 4))
+ # >>> b = torch.ones((1, 2, 3, 4))
+ # >>> c = torch.cat((a, b), dim=2)
+ # >>> torch.equal(b, c) # True
+ # >>> d = torch.split(a, 2, dim=-1)
+ # >>> torch.equal(d[0], d[1]) # True
+ if paddle.shape(cache)[0] > 0:
+ # last dim `d_k * 2` for (key, val)
+ key_cache, value_cache = paddle.split(cache, 2, axis=-1)
+ k = paddle.concat([key_cache, k], axis=2)
+ v = paddle.concat([value_cache, v], axis=2)
+ # We do cache slicing in encoder.forward_chunk, since it's
+ # non-trivial to calculate `next_cache_start` here.
+ new_cache = paddle.concat((k, v), axis=-1)
+
n_batch_pos = pos_emb.shape[0]
p = self.linear_pos(pos_emb).view(n_batch_pos, -1, self.h, self.d_k)
p = p.transpose([0, 2, 1, 3]) # (batch, head, time1, d_k)
@@ -234,4 +321,4 @@ class RelPositionMultiHeadedAttention(MultiHeadedAttention):
scores = (matrix_ac + matrix_bd) / math.sqrt(
self.d_k) # (batch, head, time1, time2)
- return self.forward_attention(v, scores, mask)
+ return self.forward_attention(v, scores, mask), new_cache
diff --git a/paddlespeech/s2t/modules/conformer_convolution.py b/paddlespeech/s2t/modules/conformer_convolution.py
index 89e65268..c384b9c7 100644
--- a/paddlespeech/s2t/modules/conformer_convolution.py
+++ b/paddlespeech/s2t/modules/conformer_convolution.py
@@ -108,15 +108,17 @@ class ConvolutionModule(nn.Layer):
def forward(self,
x: paddle.Tensor,
- mask_pad: Optional[paddle.Tensor]=None,
- cache: Optional[paddle.Tensor]=None
+ mask_pad: paddle.Tensor= paddle.ones([0,0,0], dtype=paddle.bool),
+ cache: paddle.Tensor= paddle.zeros([0,0,0]),
) -> Tuple[paddle.Tensor, paddle.Tensor]:
"""Compute convolution module.
Args:
x (paddle.Tensor): Input tensor (#batch, time, channels).
- mask_pad (paddle.Tensor): used for batch padding, (#batch, channels, time).
+ mask_pad (paddle.Tensor): used for batch padding (#batch, 1, time),
+ (0, 0, 0) means fake mask.
cache (paddle.Tensor): left context cache, it is only
- used in causal convolution. (#batch, channels, time')
+ used in causal convolution (#batch, channels, cache_t),
+ (0, 0, 0) meas fake cache.
Returns:
paddle.Tensor: Output tensor (#batch, time, channels).
paddle.Tensor: Output cache tensor (#batch, channels, time')
@@ -125,11 +127,11 @@ class ConvolutionModule(nn.Layer):
x = x.transpose([0, 2, 1]) # [B, C, T]
# mask batch padding
- if mask_pad is not None:
+ if paddle.shape(mask_pad)[2] > 0: # time > 0
x = x.masked_fill(mask_pad, 0.0)
if self.lorder > 0:
- if cache is None:
+ if paddle.shape(cache)[2] == 0: # cache_t == 0
x = nn.functional.pad(
x, [self.lorder, 0], 'constant', 0.0, data_format='NCL')
else:
@@ -143,7 +145,7 @@ class ConvolutionModule(nn.Layer):
# It's better we just return None if no cache is requried,
# However, for JIT export, here we just fake one tensor instead of
# None.
- new_cache = paddle.zeros([1], dtype=x.dtype)
+ new_cache = paddle.zeros([0, 0, 0], dtype=x.dtype)
# GLU mechanism
x = self.pointwise_conv1(x) # (batch, 2*channel, dim)
@@ -159,7 +161,7 @@ class ConvolutionModule(nn.Layer):
x = self.pointwise_conv2(x)
# mask batch padding
- if mask_pad is not None:
+ if paddle.shape(mask_pad)[2] > 0: # time > 0
x = x.masked_fill(mask_pad, 0.0)
x = x.transpose([0, 2, 1]) # [B, T, C]
diff --git a/paddlespeech/s2t/modules/decoder_layer.py b/paddlespeech/s2t/modules/decoder_layer.py
index b7f8694c..37b124e8 100644
--- a/paddlespeech/s2t/modules/decoder_layer.py
+++ b/paddlespeech/s2t/modules/decoder_layer.py
@@ -121,11 +121,11 @@ class DecoderLayer(nn.Layer):
if self.concat_after:
tgt_concat = paddle.cat(
- (tgt_q, self.self_attn(tgt_q, tgt, tgt, tgt_q_mask)), dim=-1)
+ (tgt_q, self.self_attn(tgt_q, tgt, tgt, tgt_q_mask)[0]), dim=-1)
x = residual + self.concat_linear1(tgt_concat)
else:
x = residual + self.dropout(
- self.self_attn(tgt_q, tgt, tgt, tgt_q_mask))
+ self.self_attn(tgt_q, tgt, tgt, tgt_q_mask)[0])
if not self.normalize_before:
x = self.norm1(x)
@@ -134,11 +134,11 @@ class DecoderLayer(nn.Layer):
x = self.norm2(x)
if self.concat_after:
x_concat = paddle.cat(
- (x, self.src_attn(x, memory, memory, memory_mask)), dim=-1)
+ (x, self.src_attn(x, memory, memory, memory_mask)[0]), dim=-1)
x = residual + self.concat_linear2(x_concat)
else:
x = residual + self.dropout(
- self.src_attn(x, memory, memory, memory_mask))
+ self.src_attn(x, memory, memory, memory_mask)[0])
if not self.normalize_before:
x = self.norm2(x)
diff --git a/paddlespeech/s2t/modules/embedding.py b/paddlespeech/s2t/modules/embedding.py
index 51e558eb..3aeebd29 100644
--- a/paddlespeech/s2t/modules/embedding.py
+++ b/paddlespeech/s2t/modules/embedding.py
@@ -131,7 +131,7 @@ class PositionalEncoding(nn.Layer, PositionalEncodingInterface):
offset (int): start offset
size (int): requried size of position encoding
Returns:
- paddle.Tensor: Corresponding position encoding
+ paddle.Tensor: Corresponding position encoding, #[1, T, D].
"""
assert offset + size < self.max_len
return self.dropout(self.pe[:, offset:offset + size])
diff --git a/paddlespeech/s2t/modules/encoder.py b/paddlespeech/s2t/modules/encoder.py
index 4d31acf1..bff2d69b 100644
--- a/paddlespeech/s2t/modules/encoder.py
+++ b/paddlespeech/s2t/modules/encoder.py
@@ -177,7 +177,7 @@ class BaseEncoder(nn.Layer):
decoding_chunk_size, self.static_chunk_size,
num_decoding_left_chunks)
for layer in self.encoders:
- xs, chunk_masks, _ = layer(xs, chunk_masks, pos_emb, mask_pad)
+ xs, chunk_masks, _, _ = layer(xs, chunk_masks, pos_emb, mask_pad)
if self.normalize_before:
xs = self.after_norm(xs)
# Here we assume the mask is not changed in encoder layers, so just
@@ -190,30 +190,31 @@ class BaseEncoder(nn.Layer):
xs: paddle.Tensor,
offset: int,
required_cache_size: int,
- subsampling_cache: Optional[paddle.Tensor]=None,
- elayers_output_cache: Optional[List[paddle.Tensor]]=None,
- conformer_cnn_cache: Optional[List[paddle.Tensor]]=None,
- ) -> Tuple[paddle.Tensor, paddle.Tensor, List[paddle.Tensor], List[
- paddle.Tensor]]:
+ att_cache: paddle.Tensor = paddle.zeros([0,0,0,0]),
+ cnn_cache: paddle.Tensor = paddle.zeros([0,0,0,0]),
+ att_mask: paddle.Tensor = paddle.ones([0,0,0], dtype=paddle.bool),
+ ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor]:
""" Forward just one chunk
Args:
- xs (paddle.Tensor): chunk input, [B=1, T, D]
+ xs (paddle.Tensor): chunk audio feat input, [B=1, T, D], where
+ `T==(chunk_size-1)*subsampling_rate + subsample.right_context + 1`
offset (int): current offset in encoder output time stamp
required_cache_size (int): cache size required for next chunk
compuation
>=0: actual cache size
<0: means all history cache is required
- subsampling_cache (Optional[paddle.Tensor]): subsampling cache
- elayers_output_cache (Optional[List[paddle.Tensor]]):
- transformer/conformer encoder layers output cache
- conformer_cnn_cache (Optional[List[paddle.Tensor]]): conformer
- cnn cache
+ att_cache(paddle.Tensor): cache tensor for key & val in
+ transformer/conformer attention. Shape is
+ (elayers, head, cache_t1, d_k * 2), where`head * d_k == hidden-dim`
+ and `cache_t1 == chunk_size * num_decoding_left_chunks`.
+ cnn_cache (paddle.Tensor): cache tensor for cnn_module in conformer,
+ (elayers, B=1, hidden-dim, cache_t2), where `cache_t2 == cnn.lorder - 1`
Returns:
- paddle.Tensor: output of current input xs
- paddle.Tensor: subsampling cache required for next chunk computation
- List[paddle.Tensor]: encoder layers output cache required for next
- chunk computation
- List[paddle.Tensor]: conformer cnn cache
+ paddle.Tensor: output of current input xs, (B=1, chunk_size, hidden-dim)
+ paddle.Tensor: new attention cache required for next chunk, dyanmic shape
+ (elayers, head, T, d_k*2) depending on required_cache_size
+ paddle.Tensor: new conformer cnn cache required for next chunk, with
+ same shape as the original cnn_cache
"""
assert xs.shape[0] == 1 # batch size must be one
# tmp_masks is just for interface compatibility
@@ -225,50 +226,50 @@ class BaseEncoder(nn.Layer):
if self.global_cmvn is not None:
xs = self.global_cmvn(xs)
- xs, pos_emb, _ = self.embed(
- xs, tmp_masks, offset=offset) #xs=(B, T, D), pos_emb=(B=1, T, D)
+ # before embed, xs=(B, T, D1), pos_emb=(B=1, T, D)
+ xs, pos_emb, _ = self.embed(xs, tmp_masks, offset=offset)
+ # after embed, xs=(B=1, chunk_size, hidden-dim)
- if subsampling_cache is not None:
- cache_size = subsampling_cache.shape[1] #T
- xs = paddle.cat((subsampling_cache, xs), dim=1)
- else:
- cache_size = 0
+ elayers = paddle.shape(att_cache)[0]
+ cache_t1 = paddle.shape(att_cache)[2]
+ chunk_size = paddle.shape(xs)[1]
+ attention_key_size = cache_t1 + chunk_size
# only used when using `RelPositionMultiHeadedAttention`
pos_emb = self.embed.position_encoding(
- offset=offset - cache_size, size=xs.shape[1])
+ offset=offset - cache_t1, size=attention_key_size)
if required_cache_size < 0:
next_cache_start = 0
elif required_cache_size == 0:
- next_cache_start = xs.shape[1]
+ next_cache_start = attention_key_size
else:
- next_cache_start = xs.shape[1] - required_cache_size
- r_subsampling_cache = xs[:, next_cache_start:, :]
-
- # Real mask for transformer/conformer layers
- masks = paddle.ones([1, xs.shape[1]], dtype=paddle.bool)
- masks = masks.unsqueeze(1) #[B=1, L'=1, T]
- r_elayers_output_cache = []
- r_conformer_cnn_cache = []
+ next_cache_start = max(attention_key_size - required_cache_size, 0)
+
+ r_att_cache = []
+ r_cnn_cache = []
for i, layer in enumerate(self.encoders):
- attn_cache = None if elayers_output_cache is None else elayers_output_cache[
- i]
- cnn_cache = None if conformer_cnn_cache is None else conformer_cnn_cache[
- i]
- xs, _, new_cnn_cache = layer(
- xs,
- masks,
- pos_emb,
- output_cache=attn_cache,
- cnn_cache=cnn_cache)
- r_elayers_output_cache.append(xs[:, next_cache_start:, :])
- r_conformer_cnn_cache.append(new_cnn_cache)
+ # att_cache[i:i+1] = (1, head, cache_t1, d_k*2)
+ # cnn_cache[i:i+1] = (1, B=1, hidden-dim, cache_t2)
+ xs, _, new_att_cache, new_cnn_cache = layer(
+ xs, att_mask, pos_emb,
+ att_cache=att_cache[i:i+1] if elayers > 0 else att_cache,
+ cnn_cache=cnn_cache[i:i+1] if paddle.shape(cnn_cache)[0] > 0 else cnn_cache,
+ )
+ # new_att_cache = (1, head, attention_key_size, d_k*2)
+ # new_cnn_cache = (B=1, hidden-dim, cache_t2)
+ r_att_cache.append(new_att_cache[:,:, next_cache_start:, :])
+ r_cnn_cache.append(new_cnn_cache.unsqueeze(0)) # add elayer dim
+
if self.normalize_before:
xs = self.after_norm(xs)
- return (xs[:, cache_size:, :], r_subsampling_cache,
- r_elayers_output_cache, r_conformer_cnn_cache)
+ # r_att_cache (elayers, head, T, d_k*2)
+ # r_cnn_cache (elayers, B=1, hidden-dim, cache_t2)
+ r_att_cache = paddle.concat(r_att_cache, axis=0)
+ r_cnn_cache = paddle.concat(r_cnn_cache, axis=0)
+ return xs, r_att_cache, r_cnn_cache
+
def forward_chunk_by_chunk(
self,
@@ -313,25 +314,24 @@ class BaseEncoder(nn.Layer):
num_frames = xs.shape[1]
required_cache_size = decoding_chunk_size * num_decoding_left_chunks
- subsampling_cache: Optional[paddle.Tensor] = None
- elayers_output_cache: Optional[List[paddle.Tensor]] = None
- conformer_cnn_cache: Optional[List[paddle.Tensor]] = None
+
+ att_cache: paddle.Tensor = paddle.zeros([0,0,0,0])
+ cnn_cache: paddle.Tensor = paddle.zeros([0,0,0,0])
+
outputs = []
offset = 0
# Feed forward overlap input step by step
for cur in range(0, num_frames - context + 1, stride):
end = min(cur + decoding_window, num_frames)
chunk_xs = xs[:, cur:end, :]
- (y, subsampling_cache, elayers_output_cache,
- conformer_cnn_cache) = self.forward_chunk(
- chunk_xs, offset, required_cache_size, subsampling_cache,
- elayers_output_cache, conformer_cnn_cache)
+
+ (y, att_cache, cnn_cache) = self.forward_chunk(
+ chunk_xs, offset, required_cache_size, att_cache, cnn_cache)
+
outputs.append(y)
offset += y.shape[1]
ys = paddle.cat(outputs, 1)
- # fake mask, just for jit script and compatibility with `forward` api
- masks = paddle.ones([1, ys.shape[1]], dtype=paddle.bool)
- masks = masks.unsqueeze(1)
+ masks = paddle.ones([1, 1, ys.shape[1]], dtype=paddle.bool)
return ys, masks
diff --git a/paddlespeech/s2t/modules/encoder_layer.py b/paddlespeech/s2t/modules/encoder_layer.py
index e80a298d..5f810dfd 100644
--- a/paddlespeech/s2t/modules/encoder_layer.py
+++ b/paddlespeech/s2t/modules/encoder_layer.py
@@ -75,49 +75,43 @@ class TransformerEncoderLayer(nn.Layer):
self,
x: paddle.Tensor,
mask: paddle.Tensor,
- pos_emb: Optional[paddle.Tensor]=None,
- mask_pad: Optional[paddle.Tensor]=None,
- output_cache: Optional[paddle.Tensor]=None,
- cnn_cache: Optional[paddle.Tensor]=None,
- ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor]:
+ pos_emb: paddle.Tensor,
+ mask_pad: paddle.Tensor=paddle.ones([0, 0, 0], dtype=paddle.bool),
+ att_cache: paddle.Tensor=paddle.zeros([0, 0, 0, 0]),
+ cnn_cache: paddle.Tensor=paddle.zeros([0, 0, 0, 0]),
+ ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor, paddle.Tensor]:
"""Compute encoded features.
Args:
- x (paddle.Tensor): Input tensor (#batch, time, size).
- mask (paddle.Tensor): Mask tensor for the input (#batch, time).
+ x (paddle.Tensor): (#batch, time, size)
+ mask (paddle.Tensor): Mask tensor for the input (#batch, time,time),
+ (0, 0, 0) means fake mask.
pos_emb (paddle.Tensor): just for interface compatibility
to ConformerEncoderLayer
- mask_pad (paddle.Tensor): not used here, it's for interface
- compatibility to ConformerEncoderLayer
- output_cache (paddle.Tensor): Cache tensor of the output
- (#batch, time2, size), time2 < time in x.
- cnn_cache (paddle.Tensor): not used here, it's for interface
- compatibility to ConformerEncoderLayer
+ mask_pad (paddle.Tensor): does not used in transformer layer,
+ just for unified api with conformer.
+ att_cache (paddle.Tensor): Cache tensor of the KEY & VALUE
+ (#batch=1, head, cache_t1, d_k * 2), head * d_k == size.
+ cnn_cache (paddle.Tensor): Convolution cache in conformer layer
+ (#batch=1, size, cache_t2), not used here, it's for interface
+ compatibility to ConformerEncoderLayer.
Returns:
paddle.Tensor: Output tensor (#batch, time, size).
- paddle.Tensor: Mask tensor (#batch, time).
- paddle.Tensor: Fake cnn cache tensor for api compatibility with Conformer (#batch, channels, time').
+ paddle.Tensor: Mask tensor (#batch, time, time).
+ paddle.Tensor: att_cache tensor,
+ (#batch=1, head, cache_t1 + time, d_k * 2).
+ paddle.Tensor: cnn_cahce tensor (#batch=1, size, cache_t2).
"""
residual = x
if self.normalize_before:
x = self.norm1(x)
- if output_cache is None:
- x_q = x
- else:
- assert output_cache.shape[0] == x.shape[0]
- assert output_cache.shape[1] < x.shape[1]
- assert output_cache.shape[2] == self.size
- chunk = x.shape[1] - output_cache.shape[1]
- x_q = x[:, -chunk:, :]
- residual = residual[:, -chunk:, :]
- mask = mask[:, -chunk:, :]
+ x_att, new_att_cache = self.self_attn(x, x, x, mask, cache=att_cache)
if self.concat_after:
- x_concat = paddle.concat(
- (x, self.self_attn(x_q, x, x, mask)), axis=-1)
+ x_concat = paddle.concat((x, x_att), axis=-1)
x = residual + self.concat_linear(x_concat)
else:
- x = residual + self.dropout(self.self_attn(x_q, x, x, mask))
+ x = residual + self.dropout(x_att)
if not self.normalize_before:
x = self.norm1(x)
@@ -128,11 +122,8 @@ class TransformerEncoderLayer(nn.Layer):
if not self.normalize_before:
x = self.norm2(x)
- if output_cache is not None:
- x = paddle.concat([output_cache, x], axis=1)
-
- fake_cnn_cache = paddle.zeros([1], dtype=x.dtype)
- return x, mask, fake_cnn_cache
+ fake_cnn_cache = paddle.zeros([0, 0, 0], dtype=x.dtype)
+ return x, mask, new_att_cache, fake_cnn_cache
class ConformerEncoderLayer(nn.Layer):
@@ -192,32 +183,44 @@ class ConformerEncoderLayer(nn.Layer):
self.size = size
self.normalize_before = normalize_before
self.concat_after = concat_after
- self.concat_linear = Linear(size + size, size)
+ if self.concat_after:
+ self.concat_linear = Linear(size + size, size)
+ else:
+ self.concat_linear = nn.Identity()
def forward(
self,
x: paddle.Tensor,
mask: paddle.Tensor,
pos_emb: paddle.Tensor,
- mask_pad: Optional[paddle.Tensor]=None,
- output_cache: Optional[paddle.Tensor]=None,
- cnn_cache: Optional[paddle.Tensor]=None,
- ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor]:
+ mask_pad: paddle.Tensor=paddle.ones([0, 0, 0], dtype=paddle.bool),
+ att_cache: paddle.Tensor=paddle.zeros([0, 0, 0, 0]),
+ cnn_cache: paddle.Tensor=paddle.zeros([0, 0, 0, 0]),
+ ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor, paddle.Tensor]:
"""Compute encoded features.
Args:
- x (paddle.Tensor): (#batch, time, size)
- mask (paddle.Tensor): Mask tensor for the input (#batch, time,time).
- pos_emb (paddle.Tensor): positional encoding, must not be None
- for ConformerEncoderLayer.
- mask_pad (paddle.Tensor): batch padding mask used for conv module, (B, 1, T).
- output_cache (paddle.Tensor): Cache tensor of the encoder output
- (#batch, time2, size), time2 < time in x.
+ x (paddle.Tensor): Input tensor (#batch, time, size).
+ mask (paddle.Tensor): Mask tensor for the input (#batch, time, time).
+ (0,0,0) means fake mask.
+ pos_emb (paddle.Tensor): postional encoding, must not be None
+ for ConformerEncoderLayer
+ mask_pad (paddle.Tensor): batch padding mask used for conv module.
+ (#batch, 1,time), (0, 0, 0) means fake mask.
+ att_cache (paddle.Tensor): Cache tensor of the KEY & VALUE
+ (#batch=1, head, cache_t1, d_k * 2), head * d_k == size.
cnn_cache (paddle.Tensor): Convolution cache in conformer layer
+ (1, #batch=1, size, cache_t2). First dim will not be used, just
+ for dy2st.
Returns:
- paddle.Tensor: Output tensor (#batch, time, size).
- paddle.Tensor: Mask tensor (#batch, time).
- paddle.Tensor: New cnn cache tensor (#batch, channels, time').
+ paddle.Tensor: Output tensor (#batch, time, size).
+ paddle.Tensor: Mask tensor (#batch, time, time).
+ paddle.Tensor: att_cache tensor,
+ (#batch=1, head, cache_t1 + time, d_k * 2).
+ paddle.Tensor: cnn_cahce tensor (#batch, size, cache_t2).
"""
+ # (1, #batch=1, size, cache_t2) -> (#batch=1, size, cache_t2)
+ cnn_cache = paddle.squeeze(cnn_cache, axis=0)
+
# whether to use macaron style FFN
if self.feed_forward_macaron is not None:
residual = x
@@ -233,18 +236,8 @@ class ConformerEncoderLayer(nn.Layer):
if self.normalize_before:
x = self.norm_mha(x)
- if output_cache is None:
- x_q = x
- else:
- assert output_cache.shape[0] == x.shape[0]
- assert output_cache.shape[1] < x.shape[1]
- assert output_cache.shape[2] == self.size
- chunk = x.shape[1] - output_cache.shape[1]
- x_q = x[:, -chunk:, :]
- residual = residual[:, -chunk:, :]
- mask = mask[:, -chunk:, :]
-
- x_att = self.self_attn(x_q, x, x, pos_emb, mask)
+ x_att, new_att_cache = self.self_attn(
+ x, x, x, mask, pos_emb, cache=att_cache)
if self.concat_after:
x_concat = paddle.concat((x, x_att), axis=-1)
@@ -257,7 +250,7 @@ class ConformerEncoderLayer(nn.Layer):
# convolution module
# Fake new cnn cache here, and then change it in conv_module
- new_cnn_cache = paddle.zeros([1], dtype=x.dtype)
+ new_cnn_cache = paddle.zeros([0, 0, 0], dtype=x.dtype)
if self.conv_module is not None:
residual = x
if self.normalize_before:
@@ -282,7 +275,4 @@ class ConformerEncoderLayer(nn.Layer):
if self.conv_module is not None:
x = self.norm_final(x)
- if output_cache is not None:
- x = paddle.concat([output_cache, x], axis=1)
-
- return x, mask, new_cnn_cache
+ return x, mask, new_att_cache, new_cnn_cache
diff --git a/paddlespeech/s2t/modules/initializer.py b/paddlespeech/s2t/modules/initializer.py
index 30a04e44..cdcf2e05 100644
--- a/paddlespeech/s2t/modules/initializer.py
+++ b/paddlespeech/s2t/modules/initializer.py
@@ -12,142 +12,6 @@
# See the License for the specific language governing permissions and
# limitations under the License.
import numpy as np
-from paddle.fluid import framework
-from paddle.fluid import unique_name
-from paddle.fluid.core import VarDesc
-from paddle.fluid.initializer import MSRAInitializer
-
-__all__ = ['KaimingUniform']
-
-
-class KaimingUniform(MSRAInitializer):
- r"""Implements the Kaiming Uniform initializer
-
- This class implements the weight initialization from the paper
- `Delving Deep into Rectifiers: Surpassing Human-Level Performance on
- ImageNet Classification `_
- by Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun. This is a
- robust initialization method that particularly considers the rectifier
- nonlinearities.
-
- In case of Uniform distribution, the range is [-x, x], where
-
- .. math::
-
- x = \sqrt{\frac{1.0}{fan\_in}}
-
- In case of Normal distribution, the mean is 0 and the standard deviation
- is
-
- .. math::
-
- \sqrt{\\frac{2.0}{fan\_in}}
-
- Args:
- fan_in (float32|None): fan_in for Kaiming uniform Initializer. If None, it is\
- inferred from the variable. default is None.
-
- Note:
- It is recommended to set fan_in to None for most cases.
-
- Examples:
- .. code-block:: python
-
- import paddle
- import paddle.nn as nn
-
- linear = nn.Linear(2,
- 4,
- weight_attr=nn.initializer.KaimingUniform())
- data = paddle.rand([30, 10, 2], dtype='float32')
- res = linear(data)
-
- """
-
- def __init__(self, fan_in=None):
- super(KaimingUniform, self).__init__(
- uniform=True, fan_in=fan_in, seed=0)
-
- def __call__(self, var, block=None):
- """Initialize the input tensor with MSRA initialization.
-
- Args:
- var(Tensor): Tensor that needs to be initialized.
- block(Block, optional): The block in which initialization ops
- should be added. Used in static graph only, default None.
-
- Returns:
- The initialization op
- """
- block = self._check_block(block)
-
- assert isinstance(var, framework.Variable)
- assert isinstance(block, framework.Block)
- f_in, f_out = self._compute_fans(var)
-
- # If fan_in is passed, use it
- fan_in = f_in if self._fan_in is None else self._fan_in
-
- if self._seed == 0:
- self._seed = block.program.random_seed
-
- # to be compatible of fp16 initalizers
- if var.dtype == VarDesc.VarType.FP16 or (
- var.dtype == VarDesc.VarType.BF16 and not self._uniform):
- out_dtype = VarDesc.VarType.FP32
- out_var = block.create_var(
- name=unique_name.generate(
- ".".join(['masra_init', var.name, 'tmp'])),
- shape=var.shape,
- dtype=out_dtype,
- type=VarDesc.VarType.LOD_TENSOR,
- persistable=False)
- else:
- out_dtype = var.dtype
- out_var = var
-
- if self._uniform:
- limit = np.sqrt(1.0 / float(fan_in))
- op = block.append_op(
- type="uniform_random",
- inputs={},
- outputs={"Out": out_var},
- attrs={
- "shape": out_var.shape,
- "dtype": int(out_dtype),
- "min": -limit,
- "max": limit,
- "seed": self._seed
- },
- stop_gradient=True)
-
- else:
- std = np.sqrt(2.0 / float(fan_in))
- op = block.append_op(
- type="gaussian_random",
- outputs={"Out": out_var},
- attrs={
- "shape": out_var.shape,
- "dtype": int(out_dtype),
- "mean": 0.0,
- "std": std,
- "seed": self._seed
- },
- stop_gradient=True)
-
- if var.dtype == VarDesc.VarType.FP16 or (
- var.dtype == VarDesc.VarType.BF16 and not self._uniform):
- block.append_op(
- type="cast",
- inputs={"X": out_var},
- outputs={"Out": var},
- attrs={"in_dtype": out_var.dtype,
- "out_dtype": var.dtype})
-
- if not framework.in_dygraph_mode():
- var.op = op
- return op
-
class DefaultInitializerContext(object):
"""
diff --git a/paddlespeech/server/bin/paddlespeech_client.py b/paddlespeech/server/bin/paddlespeech_client.py
index e8e57fff..96368c0f 100644
--- a/paddlespeech/server/bin/paddlespeech_client.py
+++ b/paddlespeech/server/bin/paddlespeech_client.py
@@ -718,6 +718,7 @@ class VectorClientExecutor(BaseExecutor):
logger.info(f"the input audio: {input}")
handler = VectorHttpHandler(server_ip=server_ip, port=port)
res = handler.run(input, audio_format, sample_rate)
+ logger.info(f"The spk embedding is: {res}")
return res
elif task == "score":
from paddlespeech.server.utils.audio_handler import VectorScoreHttpHandler
diff --git a/paddlespeech/server/engine/asr/online/ctc_endpoint.py b/paddlespeech/server/engine/asr/online/ctc_endpoint.py
index 2dba3641..b87dbe80 100644
--- a/paddlespeech/server/engine/asr/online/ctc_endpoint.py
+++ b/paddlespeech/server/engine/asr/online/ctc_endpoint.py
@@ -39,10 +39,10 @@ class OnlineCTCEndpoingOpt:
# rule1 times out after 5 seconds of silence, even if we decoded nothing.
rule1: OnlineCTCEndpointRule = OnlineCTCEndpointRule(False, 5000, 0)
- # rule4 times out after 1.0 seconds of silence after decoding something,
+ # rule2 times out after 1.0 seconds of silence after decoding something,
# even if we did not reach a final-state at all.
rule2: OnlineCTCEndpointRule = OnlineCTCEndpointRule(True, 1000, 0)
- # rule5 times out after the utterance is 20 seconds long, regardless of
+ # rule3 times out after the utterance is 20 seconds long, regardless of
# anything else.
rule3: OnlineCTCEndpointRule = OnlineCTCEndpointRule(False, 0, 20000)
@@ -102,7 +102,8 @@ class OnlineCTCEndpoint:
assert self.num_frames_decoded >= self.trailing_silence_frames
assert self.frame_shift_in_ms > 0
-
+
+ decoding_something = (self.num_frames_decoded > self.trailing_silence_frames) and decoding_something
utterance_length = self.num_frames_decoded * self.frame_shift_in_ms
trailing_silence = self.trailing_silence_frames * self.frame_shift_in_ms
diff --git a/paddlespeech/server/engine/asr/online/python/asr_engine.py b/paddlespeech/server/engine/asr/online/python/asr_engine.py
index 2bacfecd..4df38f09 100644
--- a/paddlespeech/server/engine/asr/online/python/asr_engine.py
+++ b/paddlespeech/server/engine/asr/online/python/asr_engine.py
@@ -130,9 +130,9 @@ class PaddleASRConnectionHanddler:
## conformer
# cache for conformer online
- self.subsampling_cache = None
- self.elayers_output_cache = None
- self.conformer_cnn_cache = None
+ self.att_cache = paddle.zeros([0,0,0,0])
+ self.cnn_cache = paddle.zeros([0,0,0,0])
+
self.encoder_out = None
# conformer decoding state
self.offset = 0 # global offset in decoding frame unit
@@ -474,11 +474,9 @@ class PaddleASRConnectionHanddler:
# cur chunk
chunk_xs = self.cached_feat[:, cur:end, :]
# forward chunk
- (y, self.subsampling_cache, self.elayers_output_cache,
- self.conformer_cnn_cache) = self.model.encoder.forward_chunk(
+ (y, self.att_cache, self.cnn_cache) = self.model.encoder.forward_chunk(
chunk_xs, self.offset, required_cache_size,
- self.subsampling_cache, self.elayers_output_cache,
- self.conformer_cnn_cache)
+ self.att_cache, self.cnn_cache)
outputs.append(y)
# update the global offset, in decoding frame unit
diff --git a/paddlespeech/server/engine/engine_warmup.py b/paddlespeech/server/engine/engine_warmup.py
index 12c760c6..3751554c 100644
--- a/paddlespeech/server/engine/engine_warmup.py
+++ b/paddlespeech/server/engine/engine_warmup.py
@@ -60,7 +60,10 @@ def warm_up(engine_and_type: str, warm_up_time: int=3) -> bool:
else:
st = time.time()
- connection_handler.infer(text=sentence)
+ connection_handler.infer(
+ text=sentence,
+ lang=tts_engine.lang,
+ am=tts_engine.config.am)
et = time.time()
logger.debug(
f"The response time of the {i} warm up: {et - st} s")
diff --git a/paddlespeech/t2s/exps/sentences_mix.txt b/paddlespeech/t2s/exps/sentences_mix.txt
new file mode 100644
index 00000000..06e97d14
--- /dev/null
+++ b/paddlespeech/t2s/exps/sentences_mix.txt
@@ -0,0 +1,8 @@
+001 你好,欢迎使用 Paddle Speech 中英文混合 T T S 功能,开始你的合成之旅吧!
+002 我们的声学模型使用了 Fast Speech Two, 声码器使用了 Parallel Wave GAN and Hifi GAN.
+003 Paddle N L P 发布 ERNIE Tiny 全系列中文预训练小模型,快速提升预训练模型部署效率,通用信息抽取技术 U I E Tiny 系列模型全新升级,支持速度更快效果更好的 U I E 小模型。
+004 Paddle Speech 发布 P P A S R 流式语音识别系统、P P T T S 流式语音合成系统、P P V P R 全链路声纹识别系统。
+005 Paddle Bo Bo: 使用 Paddle Speech 的语音合成模块生成虚拟人的声音。
+006 热烈欢迎您在 Discussions 中提交问题,并在 Issues 中指出发现的 bug。此外,我们非常希望您参与到 Paddle Speech 的开发中!
+007 我喜欢 eat apple, 你喜欢 drink milk。
+008 我们要去云南 team building, 非常非常 happy.
\ No newline at end of file
diff --git a/paddlespeech/t2s/exps/syn_utils.py b/paddlespeech/t2s/exps/syn_utils.py
index cabea989..77abf97d 100644
--- a/paddlespeech/t2s/exps/syn_utils.py
+++ b/paddlespeech/t2s/exps/syn_utils.py
@@ -29,6 +29,7 @@ from yacs.config import CfgNode
from paddlespeech.t2s.datasets.data_table import DataTable
from paddlespeech.t2s.frontend import English
+from paddlespeech.t2s.frontend.mix_frontend import MixFrontend
from paddlespeech.t2s.frontend.zh_frontend import Frontend
from paddlespeech.t2s.modules.normalizer import ZScore
from paddlespeech.utils.dynamic_import import dynamic_import
@@ -98,6 +99,8 @@ def get_sentences(text_file: Optional[os.PathLike], lang: str='zh'):
sentence = "".join(items[1:])
elif lang == 'en':
sentence = " ".join(items[1:])
+ elif lang == 'mix':
+ sentence = " ".join(items[1:])
sentences.append((utt_id, sentence))
return sentences
@@ -111,7 +114,8 @@ def get_test_dataset(test_metadata: List[Dict[str, Any]],
am_dataset = am[am.rindex('_') + 1:]
if am_name == 'fastspeech2':
fields = ["utt_id", "text"]
- if am_dataset in {"aishell3", "vctk"} and speaker_dict is not None:
+ if am_dataset in {"aishell3", "vctk",
+ "mix"} and speaker_dict is not None:
print("multiple speaker fastspeech2!")
fields += ["spk_id"]
elif voice_cloning:
@@ -140,6 +144,10 @@ def get_frontend(lang: str='zh',
phone_vocab_path=phones_dict, tone_vocab_path=tones_dict)
elif lang == 'en':
frontend = English(phone_vocab_path=phones_dict)
+ elif lang == 'mix':
+ frontend = MixFrontend(
+ phone_vocab_path=phones_dict, tone_vocab_path=tones_dict)
+
else:
print("wrong lang!")
print("frontend done!")
@@ -341,8 +349,12 @@ def get_am_output(
input_ids = frontend.get_input_ids(
input, merge_sentences=merge_sentences)
phone_ids = input_ids["phone_ids"]
+ elif lang == 'mix':
+ input_ids = frontend.get_input_ids(
+ input, merge_sentences=merge_sentences)
+ phone_ids = input_ids["phone_ids"]
else:
- print("lang should in {'zh', 'en'}!")
+ print("lang should in {'zh', 'en', 'mix'}!")
if get_tone_ids:
tone_ids = input_ids["tone_ids"]
diff --git a/paddlespeech/t2s/exps/synthesize_e2e.py b/paddlespeech/t2s/exps/synthesize_e2e.py
index 28657eb2..ef954329 100644
--- a/paddlespeech/t2s/exps/synthesize_e2e.py
+++ b/paddlespeech/t2s/exps/synthesize_e2e.py
@@ -113,8 +113,12 @@ def evaluate(args):
input_ids = frontend.get_input_ids(
sentence, merge_sentences=merge_sentences)
phone_ids = input_ids["phone_ids"]
+ elif args.lang == 'mix':
+ input_ids = frontend.get_input_ids(
+ sentence, merge_sentences=merge_sentences)
+ phone_ids = input_ids["phone_ids"]
else:
- print("lang should in {'zh', 'en'}!")
+ print("lang should in {'zh', 'en', 'mix'}!")
with paddle.no_grad():
flags = 0
for i in range(len(phone_ids)):
@@ -122,7 +126,7 @@ def evaluate(args):
# acoustic model
if am_name == 'fastspeech2':
# multi speaker
- if am_dataset in {"aishell3", "vctk"}:
+ if am_dataset in {"aishell3", "vctk", "mix"}:
spk_id = paddle.to_tensor(args.spk_id)
mel = am_inference(part_phone_ids, spk_id)
else:
@@ -170,7 +174,7 @@ def parse_args():
choices=[
'speedyspeech_csmsc', 'speedyspeech_aishell3', 'fastspeech2_csmsc',
'fastspeech2_ljspeech', 'fastspeech2_aishell3', 'fastspeech2_vctk',
- 'tacotron2_csmsc', 'tacotron2_ljspeech'
+ 'tacotron2_csmsc', 'tacotron2_ljspeech', 'fastspeech2_mix'
],
help='Choose acoustic model type of tts task.')
parser.add_argument(
@@ -231,7 +235,7 @@ def parse_args():
'--lang',
type=str,
default='zh',
- help='Choose model language. zh or en')
+ help='Choose model language. zh or en or mix')
parser.add_argument(
"--inference_dir",
diff --git a/paddlespeech/t2s/frontend/mix_frontend.py b/paddlespeech/t2s/frontend/mix_frontend.py
new file mode 100644
index 00000000..6386c871
--- /dev/null
+++ b/paddlespeech/t2s/frontend/mix_frontend.py
@@ -0,0 +1,179 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import re
+from typing import Dict
+from typing import List
+
+import paddle
+
+from paddlespeech.t2s.frontend import English
+from paddlespeech.t2s.frontend.zh_frontend import Frontend
+
+
+class MixFrontend():
+ def __init__(self,
+ g2p_model="pypinyin",
+ phone_vocab_path=None,
+ tone_vocab_path=None):
+
+ self.zh_frontend = Frontend(
+ phone_vocab_path=phone_vocab_path, tone_vocab_path=tone_vocab_path)
+ self.en_frontend = English(phone_vocab_path=phone_vocab_path)
+ self.SENTENCE_SPLITOR = re.compile(r'([:、,;。?!,;?!][”’]?)')
+ self.sp_id = self.zh_frontend.vocab_phones["sp"]
+ self.sp_id_tensor = paddle.to_tensor([self.sp_id])
+
+ def is_chinese(self, char):
+ if char >= '\u4e00' and char <= '\u9fa5':
+ return True
+ else:
+ return False
+
+ def is_alphabet(self, char):
+ if (char >= '\u0041' and char <= '\u005a') or (char >= '\u0061' and
+ char <= '\u007a'):
+ return True
+ else:
+ return False
+
+ def is_number(self, char):
+ if char >= '\u0030' and char <= '\u0039':
+ return True
+ else:
+ return False
+
+ def is_other(self, char):
+ if not (self.is_chinese(char) or self.is_number(char) or
+ self.is_alphabet(char)):
+ return True
+ else:
+ return False
+
+ def _split(self, text: str) -> List[str]:
+ text = re.sub(r'[《》【】<=>{}()()#&@“”^_|…\\]', '', text)
+ text = self.SENTENCE_SPLITOR.sub(r'\1\n', text)
+ text = text.strip()
+ sentences = [sentence.strip() for sentence in re.split(r'\n+', text)]
+ return sentences
+
+ def _distinguish(self, text: str) -> List[str]:
+ # sentence --> [ch_part, en_part, ch_part, ...]
+
+ segments = []
+ types = []
+
+ flag = 0
+ temp_seg = ""
+ temp_lang = ""
+
+ # Determine the type of each character. type: blank, chinese, alphabet, number, unk.
+ for ch in text:
+ if self.is_chinese(ch):
+ types.append("zh")
+ elif self.is_alphabet(ch):
+ types.append("en")
+ elif ch == " ":
+ types.append("blank")
+ elif self.is_number(ch):
+ types.append("num")
+ else:
+ types.append("unk")
+
+ assert len(types) == len(text)
+
+ for i in range(len(types)):
+
+ # find the first char of the seg
+ if flag == 0:
+ if types[i] != "unk" and types[i] != "blank":
+ temp_seg += text[i]
+ temp_lang = types[i]
+ flag = 1
+
+ else:
+ if types[i] == temp_lang or types[i] == "num":
+ temp_seg += text[i]
+
+ elif temp_lang == "num" and types[i] != "unk":
+ temp_seg += text[i]
+ if types[i] == "zh" or types[i] == "en":
+ temp_lang = types[i]
+
+ elif temp_lang == "en" and types[i] == "blank":
+ temp_seg += text[i]
+
+ elif types[i] == "unk":
+ pass
+
+ else:
+ segments.append((temp_seg, temp_lang))
+
+ if types[i] != "unk" and types[i] != "blank":
+ temp_seg = text[i]
+ temp_lang = types[i]
+ flag = 1
+ else:
+ flag = 0
+ temp_seg = ""
+ temp_lang = ""
+
+ segments.append((temp_seg, temp_lang))
+
+ return segments
+
+ def get_input_ids(self,
+ sentence: str,
+ merge_sentences: bool=True,
+ get_tone_ids: bool=False,
+ add_sp: bool=True) -> Dict[str, List[paddle.Tensor]]:
+
+ sentences = self._split(sentence)
+ phones_list = []
+ result = {}
+
+ for text in sentences:
+ phones_seg = []
+ segments = self._distinguish(text)
+ for seg in segments:
+ content = seg[0]
+ lang = seg[1]
+ if lang == "zh":
+ input_ids = self.zh_frontend.get_input_ids(
+ content,
+ merge_sentences=True,
+ get_tone_ids=get_tone_ids)
+
+ elif lang == "en":
+ input_ids = self.en_frontend.get_input_ids(
+ content, merge_sentences=True)
+
+ phones_seg.append(input_ids["phone_ids"][0])
+ if add_sp:
+ phones_seg.append(self.sp_id_tensor)
+
+ phones = paddle.concat(phones_seg)
+ phones_list.append(phones)
+
+ if merge_sentences:
+ merge_list = paddle.concat(phones_list)
+ # rm the last 'sp' to avoid the noise at the end
+ # cause in the training data, no 'sp' in the end
+ if merge_list[-1] == self.sp_id_tensor:
+ merge_list = merge_list[:-1]
+ phones_list = []
+ phones_list.append(merge_list)
+
+ result["phone_ids"] = phones_list
+
+ return result
diff --git a/setup.py b/setup.py
index c90d037e..1cc82fa7 100644
--- a/setup.py
+++ b/setup.py
@@ -72,7 +72,8 @@ base = [
"colorlog",
"pathos == 0.2.8",
"braceexpand",
- "pyyaml"
+ "pyyaml",
+ "pybind11",
]
server = [
@@ -91,7 +92,6 @@ requirements = {
"gpustat",
"paddlespeech_ctcdecoders",
"phkit",
- "pybind11",
"pypi-kenlm",
"snakeviz",
"sox",
diff --git a/third_party/README.md b/third_party/README.md
index 843d0d3b..98e03b0a 100644
--- a/third_party/README.md
+++ b/third_party/README.md
@@ -1,27 +1,26 @@
-* [python_kaldi_features](https://github.com/ZitengWang/python_kaldi_features)
+# python_kaldi_features
+
+[python_kaldi_features](https://github.com/ZitengWang/python_kaldi_features)
commit: fc1bd6240c2008412ab64dc25045cd872f5e126c
ref: https://zhuanlan.zhihu.com/p/55371926
license: MIT
-* [python-pinyin](https://github.com/mozillazg/python-pinyin.git)
-commit: 55e524aa1b7b8eec3d15c5306043c6cdd5938b03
-license: MIT
+# Install ctc_decoder for Windows
-* [zhon](https://github.com/tsroten/zhon)
-commit: 09bf543696277f71de502506984661a60d24494c
-license: MIT
+`install_win_ctc.bat` is bat script to install paddlespeech_ctc_decoders for windows
-* [pymmseg-cpp](https://github.com/pluskid/pymmseg-cpp.git)
-commit: b76465045717fbb4f118c4fbdd24ce93bab10a6d
-license: MIT
+## Prepare your environment
-* [chinese_text_normalization](https://github.com/speechio/chinese_text_normalization.git)
-commit: 9e92c7bf2d6b5a7974305406d8e240045beac51c
-license: MIT
+insure your environment like this:
-* [phkit](https://github.com/KuangDD/phkit.git)
-commit: b2100293c1e36da531d7f30bd52c9b955a649522
-license: None
+* gcc: version >= 12.1.0
+* cmake: version >= 3.24.0
+* make: version >= 3.82.90
+* visual studio: version >= 2019
-* [nnAudio](https://github.com/KinWaiCheuk/nnAudio.git)
-license: MIT
+## Start your bat script
+
+```shell
+start install_win_ctc.bat
+
+```
diff --git a/third_party/ctc_decoders/scorer.cpp b/third_party/ctc_decoders/scorer.cpp
index 6c1d96be..6e7f68cf 100644
--- a/third_party/ctc_decoders/scorer.cpp
+++ b/third_party/ctc_decoders/scorer.cpp
@@ -13,7 +13,8 @@
#include "decoder_utils.h"
using namespace lm::ngram;
-
+// if your platform is windows ,you need add the define
+#define F_OK 0
Scorer::Scorer(double alpha,
double beta,
const std::string& lm_path,
diff --git a/third_party/ctc_decoders/setup.py b/third_party/ctc_decoders/setup.py
index ce2787e3..9a8b292a 100644
--- a/third_party/ctc_decoders/setup.py
+++ b/third_party/ctc_decoders/setup.py
@@ -89,10 +89,11 @@ FILES = [
or fn.endswith('unittest.cc'))
]
# yapf: enable
-
LIBS = ['stdc++']
if platform.system() != 'Darwin':
LIBS.append('rt')
+if platform.system() == 'Windows':
+ LIBS = ['-static-libstdc++']
ARGS = ['-O3', '-DNDEBUG', '-DKENLM_MAX_ORDER=6', '-std=c++11']
diff --git a/third_party/install_win_ctc.bat b/third_party/install_win_ctc.bat
new file mode 100644
index 00000000..0bf1e7bb
--- /dev/null
+++ b/third_party/install_win_ctc.bat
@@ -0,0 +1,21 @@
+@echo off
+
+cd ctc_decoders
+if not exist kenlm (
+ git clone https://github.com/Doubledongli/kenlm.git
+ @echo.
+)
+
+if not exist openfst-1.6.3 (
+ echo "Download and extract openfst ..."
+ git clone https://gitee.com/koala999/openfst.git
+ ren openfst openfst-1.6.3
+ @echo.
+)
+
+if not exist ThreadPool (
+ git clone https://github.com/progschj/ThreadPool.git
+ @echo.
+)
+echo "Install decoders ..."
+python setup.py install --num_processes 4
\ No newline at end of file