Merge branch 'develop' of github.com:PaddlePaddle/PaddleSpeech into fix_name_bug

pull/2189/head
TianYuan 3 years ago
commit 7bbd9097a1

@ -25,7 +25,7 @@
| <a href="#documents"> Documents </a> | <a href="#documents"> Documents </a>
| <a href="#model-list"> Models List </a> | <a href="#model-list"> Models List </a>
| <a href="https://aistudio.baidu.com/aistudio/education/group/info/25130"> AIStudio Courses </a> | <a href="https://aistudio.baidu.com/aistudio/education/group/info/25130"> AIStudio Courses </a>
| <a href="https://arxiv.org/abs/2205.12007"> NAACL2022 Paper </a> | <a href="https://arxiv.org/abs/2205.12007"> NAACL2022 Best Demo Award Paper </a>
| <a href="https://gitee.com/paddlepaddle/PaddleSpeech"> Gitee </a> | <a href="https://gitee.com/paddlepaddle/PaddleSpeech"> Gitee </a>
</h4> </h4>
</div> </div>
@ -34,7 +34,7 @@
**PaddleSpeech** is an open-source toolkit on [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) platform for a variety of critical tasks in speech and audio, with the state-of-art and influential models. **PaddleSpeech** is an open-source toolkit on [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) platform for a variety of critical tasks in speech and audio, with the state-of-art and influential models.
**PaddleSpeech** won the [NAACL2022 Best Demo Award](https://2022.naacl.org/blog/best-demo-award/). **PaddleSpeech** won the [NAACL2022 Best Demo Award](https://2022.naacl.org/blog/best-demo-award/), please check out our paper on [Arxiv](https://arxiv.org/abs/2205.12007).
##### Speech Recognition ##### Speech Recognition
@ -179,7 +179,7 @@ Via the easy-to-use, efficient, flexible and scalable implementation, our vision
## Installation ## Installation
We strongly recommend our users to install PaddleSpeech in **Linux** with *python>=3.7*. We strongly recommend our users to install PaddleSpeech in **Linux** with *python>=3.7* and *paddlepaddle>=2.3.1*.
Up to now, **Linux** supports CLI for the all our tasks, **Mac OSX** and **Windows** only supports PaddleSpeech CLI for Audio Classification, Speech-to-Text and Text-to-Speech. To install `PaddleSpeech`, please see [installation](./docs/source/install.md). Up to now, **Linux** supports CLI for the all our tasks, **Mac OSX** and **Windows** only supports PaddleSpeech CLI for Audio Classification, Speech-to-Text and Text-to-Speech. To install `PaddleSpeech`, please see [installation](./docs/source/install.md).

@ -20,7 +20,8 @@
</p> </p>
<div align="center"> <div align="center">
<h4> <h4>
<a href="#快速开始"> 快速开始 </a> <a href="#安装"> 安装 </a>
| <a href="#快速开始"> 快速开始 </a>
| <a href="#快速使用服务"> 快速使用服务 </a> | <a href="#快速使用服务"> 快速使用服务 </a>
| <a href="#快速使用流式服务"> 快速使用流式服务 </a> | <a href="#快速使用流式服务"> 快速使用流式服务 </a>
| <a href="#教程文档"> 教程文档 </a> | <a href="#教程文档"> 教程文档 </a>
@ -36,7 +37,9 @@
**PaddleSpeech** 是基于飞桨 [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) 的语音方向的开源模型库,用于语音和音频中的各种关键任务的开发,包含大量基于深度学习前沿和有影响力的模型,一些典型的应用示例如下: **PaddleSpeech** 是基于飞桨 [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) 的语音方向的开源模型库,用于语音和音频中的各种关键任务的开发,包含大量基于深度学习前沿和有影响力的模型,一些典型的应用示例如下:
**PaddleSpeech** 荣获 [NAACL2022 Best Demo Award](https://2022.naacl.org/blog/best-demo-award/). **PaddleSpeech** 荣获 [NAACL2022 Best Demo Award](https://2022.naacl.org/blog/best-demo-award/), 请访问 [Arxiv](https://arxiv.org/abs/2205.12007) 论文。
### 效果展示
##### 语音识别 ##### 语音识别
@ -154,7 +157,7 @@
本项目采用了易用、高效、灵活以及可扩展的实现,旨在为工业应用、学术研究提供更好的支持,实现的功能包含训练、推断以及测试模块,以及部署过程,主要包括 本项目采用了易用、高效、灵活以及可扩展的实现,旨在为工业应用、学术研究提供更好的支持,实现的功能包含训练、推断以及测试模块,以及部署过程,主要包括
- 📦 **易用性**: 安装门槛低,可使用 [CLI](#quick-start) 快速开始。 - 📦 **易用性**: 安装门槛低,可使用 [CLI](#quick-start) 快速开始。
- 🏆 **对标 SoTA**: 提供了高速、轻量级模型,且借鉴了最前沿的技术。 - 🏆 **对标 SoTA**: 提供了高速、轻量级模型,且借鉴了最前沿的技术。
- 🏆 **流式ASR和TTS系统**:工业级的端到端流式识别、流式合成系统。 - 🏆 **流式 ASR TTS 系统**:工业级的端到端流式识别、流式合成系统。
- 💯 **基于规则的中文前端**: 我们的前端包含文本正则化和字音转换G2P。此外我们使用自定义语言规则来适应中文语境。 - 💯 **基于规则的中文前端**: 我们的前端包含文本正则化和字音转换G2P。此外我们使用自定义语言规则来适应中文语境。
- **多种工业界以及学术界主流功能支持**: - **多种工业界以及学术界主流功能支持**:
- 🛎️ 典型音频任务: 本工具包提供了音频任务如音频分类、语音翻译、自动语音识别、文本转语音、语音合成、声纹识别、KWS等任务的实现。 - 🛎️ 典型音频任务: 本工具包提供了音频任务如音频分类、语音翻译、自动语音识别、文本转语音、语音合成、声纹识别、KWS等任务的实现。
@ -182,61 +185,195 @@
<img src="https://user-images.githubusercontent.com/23690325/169763015-cbd8e28d-602c-4723-810d-dbc6da49441e.jpg" width = "200" /> <img src="https://user-images.githubusercontent.com/23690325/169763015-cbd8e28d-602c-4723-810d-dbc6da49441e.jpg" width = "200" />
</div> </div>
<a name="安装"></a>
## 安装 ## 安装
我们强烈建议用户在 **Linux** 环境下,*3.7* 以上版本的 *python* 上安装 PaddleSpeech。 我们强烈建议用户在 **Linux** 环境下,*3.7* 以上版本的 *python* 上安装 PaddleSpeech。
目前为止,**Linux** 支持声音分类、语音识别、语音合成和语音翻译四种功能,**Mac OSX、 Windows** 下暂不支持语音翻译功能。 想了解具体安装细节,可以参考[安装文档](./docs/source/install_cn.md)。
### 相关依赖
+ gcc >= 4.8.5
+ paddlepaddle >= 2.3.1
+ python >= 3.7
+ linux(推荐), mac, windows
PaddleSpeech依赖于paddlepaddle安装可以参考[paddlepaddle官网](https://www.paddlepaddle.org.cn/)根据自己机器的情况进行选择。这里给出cpu版本示例其它版本大家可以根据自己机器的情况进行安装。
```shell
pip install paddlepaddle -i https://mirror.baidu.com/pypi/simple
```
PaddleSpeech快速安装方式有两种一种是pip安装一种是源码编译推荐
### pip 安装
```shell
pip install pytest-runner
pip install paddlespeech
```
### 源码编译
```shell
git clone https://github.com/PaddlePaddle/PaddleSpeech.git
cd PaddleSpeech
pip install pytest-runner
pip install .
```
更多关于安装问题,如 conda 环境librosa 依赖的系统库gcc 环境问题kaldi 安装等,可以参考这篇[安装文档](docs/source/install_cn.md),如安装上遇到问题可以在 [#2150](https://github.com/PaddlePaddle/PaddleSpeech/issues/2150) 上留言以及查找相关问题
<a name="快速开始"></a> <a name="快速开始"></a>
## 快速开始 ## 快速开始
安装完成后,开发者可以通过命令行快速开始,改变 `--input` 可以尝试用自己的音频或文本测试。 安装完成后,开发者可以通过命令行或者Python快速开始,命令行模式下改变 `--input` 可以尝试用自己的音频或文本测试支持16k wav格式音频
**声音分类** 你也可以在`aistudio`中快速体验 👉🏻[PaddleSpeech API Demo ](https://aistudio.baidu.com/aistudio/projectdetail/4281335?shared=1)。
测试音频示例下载
```shell ```shell
paddlespeech cls --input input.wav wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/en.wav
``` ```
**声纹识别**
### 语音识别
<details><summary>&emsp;(点击可展开)开源中文语音识别</summary>
命令行一键体验
```shell ```shell
paddlespeech vector --task spk --input input_16k.wav paddlespeech asr --lang zh --input zh.wav
``` ```
**语音识别**
Python API 一键预测
```python
>>> from paddlespeech.cli.asr.infer import ASRExecutor
>>> asr = ASRExecutor()
>>> result = asr(audio_file="zh.wav")
>>> print(result)
我认为跑步最重要的就是给我带来了身体健康
```
</details>
### 语音合成
<details><summary>&emsp;开源中文语音合成</summary>
输出 24k 采样率wav格式音频
命令行一键体验
```shell ```shell
paddlespeech asr --lang zh --input input_16k.wav paddlespeech tts --input "你好,欢迎使用百度飞桨深度学习框架!" --output output.wav
``` ```
**语音翻译** (English to Chinese)
Python API 一键预测
```python
>>> from paddlespeech.cli.tts.infer import TTSExecutor
>>> tts = TTSExecutor()
>>> tts(text="今天天气十分不错。", output="output.wav")
```
- 语音合成的 web demo 已经集成进了 [Huggingface Spaces](https://huggingface.co/spaces). 请参考: [TTS Demo](https://huggingface.co/spaces/KPatrick/PaddleSpeechTTS)
</details>
### 声音分类
<details><summary>&emsp;适配多场景的开放领域声音分类工具</summary>
基于AudioSet数据集527个类别的声音分类模型
命令行一键体验
```shell ```shell
paddlespeech st --input input_16k.wav paddlespeech cls --input zh.wav
``` ```
**语音合成**
python API 一键预测
```python
>>> from paddlespeech.cli.cls.infer import CLSExecutor
>>> cls = CLSExecutor()
>>> result = cls(audio_file="zh.wav")
>>> print(result)
Speech 0.9027186632156372
```
</details>
### 声纹提取
<details><summary>&emsp;工业级声纹提取工具</summary>
命令行一键体验
```shell ```shell
paddlespeech tts --input "你好,欢迎使用百度飞桨深度学习框架!" --output output.wav paddlespeech vector --task spk --input zh.wav
```
Python API 一键预测
```python
>>> from paddlespeech.cli.vector import VectorExecutor
>>> vec = VectorExecutor()
>>> result = vec(audio_file="zh.wav")
>>> print(result) # 187维向量
[ -0.19083306 9.474295 -14.122263 -2.0916545 0.04848729
4.9295826 1.4780062 0.3733844 10.695862 3.2697146
-4.48199 -0.6617882 -9.170393 -11.1568775 -1.2358263 ...]
``` ```
- 语音合成的 web demo 已经集成进了 [Huggingface Spaces](https://huggingface.co/spaces). 请参考: [TTS Demo](https://huggingface.co/spaces/akhaliq/paddlespeech)
**文本后处理** </details>
- 标点恢复
```bash
paddlespeech text --task punc --input 今天的天气真不错啊你下午有空吗我想约你一起去吃饭
```
**批处理** ### 标点恢复
<details><summary>&emsp;一键恢复文本标点可与ASR模型配合使用</summary>
命令行一键体验
```shell
paddlespeech text --task punc --input 今天的天气真不错啊你下午有空吗我想约你一起去吃饭
``` ```
echo -e "1 欢迎光临。\n2 谢谢惠顾。" | paddlespeech tts
Python API 一键预测
```python
>>> from paddlespeech.cli.text.infer import TextExecutor
>>> text_punc = TextExecutor()
>>> result = text_punc(text="今天的天气真不错啊你下午有空吗我想约你一起去吃饭")
今天的天气真不错啊!你下午有空吗?我想约你一起去吃饭。
``` ```
**Shell管道** </details>
ASR + Punc:
### 语音翻译
<details><summary>&emsp;端到端英译中语音翻译工具</summary>
使用预编译的kaldi相关工具只支持在Ubuntu系统中体验
命令行一键体验
```shell
paddlespeech st --input en.wav
``` ```
paddlespeech asr --input ./zh.wav | paddlespeech text --task punc
python API 一键预测
```python
>>> from paddlespeech.cli.st.infer import STExecutor
>>> st = STExecutor()
>>> result = st(audio_file="en.wav")
['我 在 这栋 建筑 的 古老 门上 敲门 。']
``` ```
更多命令行命令请参考 [demos](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/demos) </details>
> Note: 如果需要训练或者微调,请查看[语音识别](./docs/source/asr/quick_start.md) [语音合成](./docs/source/tts/quick_start.md)。
<a name="快速使用服务"></a> <a name="快速使用服务"></a>
## 快速使用服务 ## 快速使用服务
安装完成后,开发者可以通过命令行快速使用服务。 安装完成后,开发者可以通过命令行一键启动语音识别,语音合成,音频分类三种服务。
**启动服务** **启动服务**
```shell ```shell
@ -614,6 +751,7 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声
语音合成模块最初被称为 [Parakeet](https://github.com/PaddlePaddle/Parakeet),现在与此仓库合并。如果您对该任务的学术研究感兴趣,请参阅 [TTS 研究概述](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/docs/source/tts#overview)。此外,[模型介绍](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/tts/models_introduction.md) 是了解语音合成流程的一个很好的指南。 语音合成模块最初被称为 [Parakeet](https://github.com/PaddlePaddle/Parakeet),现在与此仓库合并。如果您对该任务的学术研究感兴趣,请参阅 [TTS 研究概述](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/docs/source/tts#overview)。此外,[模型介绍](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/tts/models_introduction.md) 是了解语音合成流程的一个很好的指南。
## ⭐ 应用案例 ## ⭐ 应用案例
- **[PaddleBoBo](https://github.com/JiehangXie/PaddleBoBo): 使用 PaddleSpeech 的语音合成模块生成虚拟人的声音。** - **[PaddleBoBo](https://github.com/JiehangXie/PaddleBoBo): 使用 PaddleSpeech 的语音合成模块生成虚拟人的声音。**

@ -12,6 +12,7 @@ This directory contains many speech applications in multiple scenarios.
* speech recognition - recognize text of an audio file * speech recognition - recognize text of an audio file
* speech server - Server for Speech Task, e.g. ASR,TTS,CLS * speech server - Server for Speech Task, e.g. ASR,TTS,CLS
* streaming asr server - receive audio stream from websocket, and recognize to transcript. * streaming asr server - receive audio stream from websocket, and recognize to transcript.
* streaming tts server - receive text from http or websocket, and streaming audio data stream.
* speech translation - end to end speech translation * speech translation - end to end speech translation
* story talker - book reader based on OCR and TTS * story talker - book reader based on OCR and TTS
* style_fs2 - multi style control for FastSpeech2 model * style_fs2 - multi style control for FastSpeech2 model

@ -10,8 +10,9 @@
* 元宇宙 - 基于语音合成的 2D 增强现实。 * 元宇宙 - 基于语音合成的 2D 增强现实。
* 标点恢复 - 通常作为语音识别的文本后处理任务,为一段无标点的纯文本添加相应的标点符号。 * 标点恢复 - 通常作为语音识别的文本后处理任务,为一段无标点的纯文本添加相应的标点符号。
* 语音识别 - 识别一段音频中包含的语音文字。 * 语音识别 - 识别一段音频中包含的语音文字。
* 语音服务 - 离线语音服务包括ASR、TTS、CLS等 * 语音服务 - 离线语音服务包括ASR、TTS、CLS等。
* 流式语音识别服务 - 流式输入语音数据流识别音频中的文字 * 流式语音识别服务 - 流式输入语音数据流识别音频中的文字。
* 流式语音合成服务 - 根据待合成文本流式生成合成音频数据流。
* 语音翻译 - 实时识别音频中的语言,并同时翻译成目标语言。 * 语音翻译 - 实时识别音频中的语言,并同时翻译成目标语言。
* 会说话的故事书 - 基于 OCR 和语音合成的会说话的故事书。 * 会说话的故事书 - 基于 OCR 和语音合成的会说话的故事书。
* 个性化语音合成 - 基于 FastSpeech2 模型的个性化语音合成。 * 个性化语音合成 - 基于 FastSpeech2 模型的个性化语音合成。

@ -1,6 +1,7 @@
#!/bin/bash #!/bin/bash
wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespeech.bj.bcebos.com/PaddleAudio/en.wav wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/en.wav
# asr # asr
paddlespeech asr --input ./zh.wav paddlespeech asr --input ./zh.wav
@ -8,3 +9,18 @@ paddlespeech asr --input ./zh.wav
# asr + punc # asr + punc
paddlespeech asr --input ./zh.wav | paddlespeech text --task punc paddlespeech asr --input ./zh.wav | paddlespeech text --task punc
# asr help
paddlespeech asr --help
# english asr
paddlespeech asr --lang en --model transformer_librispeech --input ./en.wav
# model stats
paddlespeech stats --task asr
# paddlespeech help
paddlespeech --help

@ -14,7 +14,10 @@ For service interface definition, please check:
see [installation](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md). see [installation](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md).
It is recommended to use **paddlepaddle 2.3.1** or above. It is recommended to use **paddlepaddle 2.3.1** or above.
You can choose one way from meduim and hard to install paddlespeech.
You can choose one way from easy, meduim and hard to install paddlespeech.
**If you install in easy mode, you need to prepare the yaml file by yourself, you can refer to the yaml file in the conf directory.**
### 2. Prepare config File ### 2. Prepare config File
The configuration file can be found in `conf/application.yaml` . The configuration file can be found in `conf/application.yaml` .

@ -3,8 +3,10 @@
# 语音服务 # 语音服务
## 介绍 ## 介绍
这个 demo 是一个启动离线语音服务和访问服务的实现。它可以通过使用 `paddlespeech_server``paddlespeech_client` 的单个命令或 python 的几行代码来实现。 这个 demo 是一个启动离线语音服务和访问服务的实现。它可以通过使用 `paddlespeech_server``paddlespeech_client` 的单个命令或 python 的几行代码来实现。
服务接口定义请参考: 服务接口定义请参考:
- [PaddleSpeech Server RESTful API](https://github.com/PaddlePaddle/PaddleSpeech/wiki/PaddleSpeech-Server-RESTful-API) - [PaddleSpeech Server RESTful API](https://github.com/PaddlePaddle/PaddleSpeech/wiki/PaddleSpeech-Server-RESTful-API)
@ -13,12 +15,17 @@
请看 [安装文档](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md). 请看 [安装文档](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md).
推荐使用 **paddlepaddle 2.3.1** 或以上版本。 推荐使用 **paddlepaddle 2.3.1** 或以上版本。
你可以从 mediumhard 两种方式中选择一种方式安装 PaddleSpeech。
你可以从简单,中等,困难 几种方式中选择一种方式安装 PaddleSpeech。
**如果使用简单模式安装,需要自行准备 yaml 文件,可参考 conf 目录下的 yaml 文件。**
### 2. 准备配置文件 ### 2. 准备配置文件
配置文件可参见 `conf/application.yaml` 配置文件可参见 `conf/application.yaml`
其中,`engine_list`表示即将启动的服务将会包含的语音引擎,格式为 <语音任务>_<引擎类型>。 其中,`engine_list` 表示即将启动的服务将会包含的语音引擎,格式为 <语音任务>_<引擎类型>。
目前服务集成的语音任务有: asr (语音识别)、tts (语音合成)、cls (音频分类)、vector (声纹识别)以及 text (文本处理)。 目前服务集成的语音任务有: asr (语音识别)、tts (语音合成)、cls (音频分类)、vector (声纹识别)以及 text (文本处理)。
目前引擎类型支持两种形式python 及 inference (Paddle Inference) 目前引擎类型支持两种形式python 及 inference (Paddle Inference)
**注意:** 如果在容器里可正常启动服务,但客户端访问 ip 不可达,可尝试将配置文件中 `host` 地址换成本地 ip 地址。 **注意:** 如果在容器里可正常启动服务,但客户端访问 ip 不可达,可尝试将配置文件中 `host` 地址换成本地 ip 地址。

@ -1,3 +1,3 @@
#!/bin/bash #!/bin/bash
paddlespeech_server start --config_file ./conf/application.yaml paddlespeech_server start --config_file ./conf/application.yaml &> server.log &

@ -0,0 +1,10 @@
#!/bin/bash
wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
wget -c https://paddlespeech.bj.bcebos.com/vector/audio/123456789.wav
# sid extract
paddlespeech_client vector --server_ip 127.0.0.1 --port 8090 --task spk --input ./85236145389.wav
# sid score
paddlespeech_client vector --server_ip 127.0.0.1 --port 8090 --task score --enroll ./85236145389.wav --test ./123456789.wav

@ -0,0 +1,4 @@
#!/bin/bash
paddlespeech_client text --server_ip 127.0.0.1 --port 8090 --input 今天的天气真好啊你下午有空吗我想约你一起去吃饭

@ -1,6 +1,6 @@
# Paddle Speech Demo # Paddle Speech Demo
PaddleSpeechDemo 是一个以 PaddleSpeech 的语音交互功能为主体开发的Demo展示项目用于帮助大家更好的上手PaddleSpeech 以及使用 PaddleSpeech 构建自己的应用。 PaddleSpeechDemo 是一个以 PaddleSpeech 的语音交互功能为主体开发的 Demo 展示项目,用于帮助大家更好的上手 PaddleSpeech 以及使用 PaddleSpeech 构建自己的应用。
智能语音交互部分使用 PaddleSpeech对话以及信息抽取部分使用 PaddleNLP网页前端展示部分基于 Vue3 进行开发 智能语音交互部分使用 PaddleSpeech对话以及信息抽取部分使用 PaddleNLP网页前端展示部分基于 Vue3 进行开发

@ -747,9 +747,9 @@
} }
}, },
"node_modules/moment": { "node_modules/moment": {
"version": "2.29.3", "version": "2.29.4",
"resolved": "https://registry.npmmirror.com/moment/-/moment-2.29.3.tgz", "resolved": "https://registry.npmjs.org/moment/-/moment-2.29.4.tgz",
"integrity": "sha512-c6YRvhEo//6T2Jz/vVtYzqBzwvPT95JBQ+smCytzf7c50oMZRsR/a4w88aD34I+/QVSfnoAnSBFPJHItlOMJVw==", "integrity": "sha512-5LC9SOxjSc2HF6vO2CyuTDNivEdoz2IvyJJGj6X8DJ0eFyfszE0QiEd+iXmBvUP3WHxSjFH/vIsA0EN00cgr8w==",
"engines": { "engines": {
"node": "*" "node": "*"
} }
@ -1636,9 +1636,9 @@
"optional": true "optional": true
}, },
"moment": { "moment": {
"version": "2.29.3", "version": "2.29.4",
"resolved": "https://registry.npmmirror.com/moment/-/moment-2.29.3.tgz", "resolved": "https://registry.npmjs.org/moment/-/moment-2.29.4.tgz",
"integrity": "sha512-c6YRvhEo//6T2Jz/vVtYzqBzwvPT95JBQ+smCytzf7c50oMZRsR/a4w88aD34I+/QVSfnoAnSBFPJHItlOMJVw==" "integrity": "sha512-5LC9SOxjSc2HF6vO2CyuTDNivEdoz2IvyJJGj6X8DJ0eFyfszE0QiEd+iXmBvUP3WHxSjFH/vIsA0EN00cgr8w=="
}, },
"nanoid": { "nanoid": {
"version": "3.3.2", "version": "3.3.2",

@ -587,9 +587,9 @@ mime@^1.4.1:
integrity sha512-x0Vn8spI+wuJ1O6S7gnbaQg8Pxh4NNHb7KSINmEWKiPE4RKOplvijn+NkmYmmRgP68mc70j2EbeTFRsrswaQeg== integrity sha512-x0Vn8spI+wuJ1O6S7gnbaQg8Pxh4NNHb7KSINmEWKiPE4RKOplvijn+NkmYmmRgP68mc70j2EbeTFRsrswaQeg==
moment@^2.27.0: moment@^2.27.0:
version "2.29.3" version "2.29.4"
resolved "https://registry.npmmirror.com/moment/-/moment-2.29.3.tgz" resolved "https://registry.yarnpkg.com/moment/-/moment-2.29.4.tgz#3dbe052889fe7c1b2ed966fcb3a77328964ef108"
integrity sha512-c6YRvhEo//6T2Jz/vVtYzqBzwvPT95JBQ+smCytzf7c50oMZRsR/a4w88aD34I+/QVSfnoAnSBFPJHItlOMJVw== integrity sha512-5LC9SOxjSc2HF6vO2CyuTDNivEdoz2IvyJJGj6X8DJ0eFyfszE0QiEd+iXmBvUP3WHxSjFH/vIsA0EN00cgr8w==
ms@^2.1.1: ms@^2.1.1:
version "2.1.3" version "2.1.3"

@ -15,7 +15,10 @@ Streaming ASR server only support `websocket` protocol, and doesn't support `htt
see [installation](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md). see [installation](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md).
It is recommended to use **paddlepaddle 2.3.1** or above. It is recommended to use **paddlepaddle 2.3.1** or above.
You can choose one way from meduim and hard to install paddlespeech.
You can choose one way from easy, meduim and hard to install paddlespeech.
**If you install in easy mode, you need to prepare the yaml file by yourself, you can refer to
### 2. Prepare config File ### 2. Prepare config File
The configuration file can be found in `conf/ws_application.yaml``conf/ws_conformer_wenetspeech_application.yaml`. The configuration file can be found in `conf/ws_application.yaml``conf/ws_conformer_wenetspeech_application.yaml`.

@ -3,12 +3,11 @@
# 流式语音识别服务 # 流式语音识别服务
## 介绍 ## 介绍
这个demo是一个启动流式语音服务和访问服务的实现。 它可以通过使用 `paddlespeech_server``paddlespeech_client`的单个命令或 python 的几行代码来实现。 这个 demo 是一个启动流式语音服务和访问服务的实现。 它可以通过使用 `paddlespeech_server``paddlespeech_client` 的单个命令或 python 的几行代码来实现。
**流式语音识别服务只支持 `weboscket` 协议,不支持 `http` 协议。** **流式语音识别服务只支持 `weboscket` 协议,不支持 `http` 协议。**
服务接口定义请参考:
For service interface definition, please check:
- [PaddleSpeech Streaming Server WebSocket API](https://github.com/PaddlePaddle/PaddleSpeech/wiki/PaddleSpeech-Server-WebSocket-API) - [PaddleSpeech Streaming Server WebSocket API](https://github.com/PaddlePaddle/PaddleSpeech/wiki/PaddleSpeech-Server-WebSocket-API)
## 使用方法 ## 使用方法
@ -16,7 +15,10 @@ For service interface definition, please check:
安装 PaddleSpeech 的详细过程请看 [安装文档](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md)。 安装 PaddleSpeech 的详细过程请看 [安装文档](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md)。
推荐使用 **paddlepaddle 2.3.1** 或以上版本。 推荐使用 **paddlepaddle 2.3.1** 或以上版本。
你可以从mediumhard 两种方式中选择一种方式安装 PaddleSpeech。
你可以从简单,中等,困难 几种方式中选择一种方式安装 PaddleSpeech。
**如果使用简单模式安装,需要自行准备 yaml 文件,可参考 conf 目录下的 yaml 文件。**
### 2. 准备配置文件 ### 2. 准备配置文件

@ -38,4 +38,4 @@ if __name__ == '__main__':
T += m['T'] T += m['T']
P += m['P'] P += m['P']
print(f"RTF: {P/T}") print(f"RTF: {P/T}, utts: {n}")

@ -1,9 +1,8 @@
export CUDA_VISIBLE_DEVICE=0,1,2,3 #export CUDA_VISIBLE_DEVICE=0,1,2,3
export CUDA_VISIBLE_DEVICE=0,1,2,3
# nohup python3 punc_server.py --config_file conf/punc_application.yaml > punc.log 2>&1 & # nohup python3 local/punc_server.py --config_file conf/punc_application.yaml > punc.log 2>&1 &
paddlespeech_server start --config_file conf/punc_application.yaml &> punc.log & paddlespeech_server start --config_file conf/punc_application.yaml &> punc.log &
# nohup python3 streaming_asr_server.py --config_file conf/ws_conformer_wenetspeech_application.yaml > streaming_asr.log 2>&1 & # nohup python3 local/streaming_asr_server.py --config_file conf/ws_conformer_wenetspeech_application.yaml > streaming_asr.log 2>&1 &
paddlespeech_server start --config_file conf/ws_conformer_wenetspeech_application.yaml &> streaming_asr.log & paddlespeech_server start --config_file conf/ws_conformer_wenetspeech_application.yaml &> streaming_asr.log &

@ -7,5 +7,5 @@ paddlespeech_client asr_online --server_ip 127.0.0.1 --port 8090 --input ./zh.wa
# read the wav and call streaming and punc service # read the wav and call streaming and punc service
# If `127.0.0.1` is not accessible, you need to use the actual service IP address. # If `127.0.0.1` is not accessible, you need to use the actual service IP address.
paddlespeech_client asr_online --server_ip 127.0.0.1 --port 8290 --punc.server_ip 127.0.0.1 --punc.port 8190 --input ./zh.wav paddlespeech_client asr_online --server_ip 127.0.0.1 --port 8090 --punc.server_ip 127.0.0.1 --punc.port 8190 --input ./zh.wav

@ -15,7 +15,10 @@ For service interface definition, please check:
see [installation](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md). see [installation](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md).
It is recommended to use **paddlepaddle 2.3.1** or above. It is recommended to use **paddlepaddle 2.3.1** or above.
You can choose one way from meduim and hard to install paddlespeech.
You can choose one way from easy, meduim and hard to install paddlespeech.
**If you install in easy mode, you need to prepare the yaml file by yourself, you can refer to the yaml file in the conf directory.**
### 2. Prepare config File ### 2. Prepare config File
The configuration file can be found in `conf/tts_online_application.yaml`. The configuration file can be found in `conf/tts_online_application.yaml`.

@ -13,7 +13,11 @@
请看 [安装文档](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md). 请看 [安装文档](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md).
推荐使用 **paddlepaddle 2.3.1** 或以上版本。 推荐使用 **paddlepaddle 2.3.1** 或以上版本。
你可以从 mediumhard 两种方式中选择一种方式安装 PaddleSpeech。
你可以从简单,中等,困难 几种方式中选择一种方式安装 PaddleSpeech。
**如果使用简单模式安装,需要自行准备 yaml 文件,可参考 conf 目录下的 yaml 文件。**
### 2. 准备配置文件 ### 2. 准备配置文件
配置文件可参见 `conf/tts_online_application.yaml` 配置文件可参见 `conf/tts_online_application.yaml`

@ -2,8 +2,8 @@
# http client test # http client test
# If `127.0.0.1` is not accessible, you need to use the actual service IP address. # If `127.0.0.1` is not accessible, you need to use the actual service IP address.
paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --protocol http --input "您好,欢迎使用百度飞桨语音合成服务。" --output output.wav paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --protocol http --input "您好,欢迎使用百度飞桨语音合成服务。" --output output.http.wav
# websocket client test # websocket client test
# If `127.0.0.1` is not accessible, you need to use the actual service IP address. # If `127.0.0.1` is not accessible, you need to use the actual service IP address.
# paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --protocol websocket --input "您好,欢迎使用百度飞桨语音合成服务。" --output output.wav paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8192 --protocol websocket --input "您好,欢迎使用百度飞桨语音合成服务。" --output output.ws.wav

@ -0,0 +1,103 @@
# This is the parameter configuration file for streaming tts server.
#################################################################################
# SERVER SETTING #
#################################################################################
host: 0.0.0.0
port: 8192
# The task format in the engin_list is: <speech task>_<engine type>
# engine_list choices = ['tts_online', 'tts_online-onnx'], the inference speed of tts_online-onnx is faster than tts_online.
# protocol choices = ['websocket', 'http']
protocol: 'websocket'
engine_list: ['tts_online-onnx']
#################################################################################
# ENGINE CONFIG #
#################################################################################
################################### TTS #########################################
################### speech task: tts; engine_type: online #######################
tts_online:
# am (acoustic model) choices=['fastspeech2_csmsc', 'fastspeech2_cnndecoder_csmsc']
# fastspeech2_cnndecoder_csmsc support streaming am infer.
am: 'fastspeech2_csmsc'
am_config:
am_ckpt:
am_stat:
phones_dict:
tones_dict:
speaker_dict:
spk_id: 0
# voc (vocoder) choices=['mb_melgan_csmsc, hifigan_csmsc']
# Both mb_melgan_csmsc and hifigan_csmsc support streaming voc inference
voc: 'mb_melgan_csmsc'
voc_config:
voc_ckpt:
voc_stat:
# others
lang: 'zh'
device: 'cpu' # set 'gpu:id' or 'cpu'
# am_block and am_pad only for fastspeech2_cnndecoder_onnx model to streaming am infer,
# when am_pad set 12, streaming synthetic audio is the same as non-streaming synthetic audio
am_block: 72
am_pad: 12
# voc_pad and voc_block voc model to streaming voc infer,
# when voc model is mb_melgan_csmsc, voc_pad set 14, streaming synthetic audio is the same as non-streaming synthetic audio; The minimum value of pad can be set to 7, streaming synthetic audio sounds normal
# when voc model is hifigan_csmsc, voc_pad set 19, streaming synthetic audio is the same as non-streaming synthetic audio; voc_pad set 14, streaming synthetic audio sounds normal
voc_block: 36
voc_pad: 14
#################################################################################
# ENGINE CONFIG #
#################################################################################
################################### TTS #########################################
################### speech task: tts; engine_type: online-onnx #######################
tts_online-onnx:
# am (acoustic model) choices=['fastspeech2_csmsc_onnx', 'fastspeech2_cnndecoder_csmsc_onnx']
# fastspeech2_cnndecoder_csmsc_onnx support streaming am infer.
am: 'fastspeech2_cnndecoder_csmsc_onnx'
# am_ckpt is a list, if am is fastspeech2_cnndecoder_csmsc_onnx, am_ckpt = [encoder model, decoder model, postnet model];
# if am is fastspeech2_csmsc_onnx, am_ckpt = [ckpt model];
am_ckpt: # list
am_stat:
phones_dict:
tones_dict:
speaker_dict:
spk_id: 0
am_sample_rate: 24000
am_sess_conf:
device: "cpu" # set 'gpu:id' or 'cpu'
use_trt: False
cpu_threads: 4
# voc (vocoder) choices=['mb_melgan_csmsc_onnx, hifigan_csmsc_onnx']
# Both mb_melgan_csmsc_onnx and hifigan_csmsc_onnx support streaming voc inference
voc: 'hifigan_csmsc_onnx'
voc_ckpt:
voc_sample_rate: 24000
voc_sess_conf:
device: "cpu" # set 'gpu:id' or 'cpu'
use_trt: False
cpu_threads: 4
# others
lang: 'zh'
# am_block and am_pad only for fastspeech2_cnndecoder_onnx model to streaming am infer,
# when am_pad set 12, streaming synthetic audio is the same as non-streaming synthetic audio
am_block: 72
am_pad: 12
# voc_pad and voc_block voc model to streaming voc infer,
# when voc model is mb_melgan_csmsc_onnx, voc_pad set 14, streaming synthetic audio is the same as non-streaming synthetic audio; The minimum value of pad can be set to 7, streaming synthetic audio sounds normal
# when voc model is hifigan_csmsc_onnx, voc_pad set 19, streaming synthetic audio is the same as non-streaming synthetic audio; voc_pad set 14, streaming synthetic audio sounds normal
voc_block: 36
voc_pad: 14
# voc_upsample should be same as n_shift on voc config.
voc_upsample: 300

@ -0,0 +1,10 @@
#!/bin/bash
# http server
paddlespeech_server start --config_file ./conf/tts_online_application.yaml &> tts.http.log &
# websocket server
paddlespeech_server start --config_file ./conf/tts_online_ws_application.yaml &> tts.ws.log &

@ -1,3 +0,0 @@
#!/bin/bash
# start server
paddlespeech_server start --config_file ./conf/tts_online_application.yaml

@ -5,3 +5,9 @@ paddlespeech tts --input 今天的天气不错啊
# Batch process # Batch process
echo -e "1 欢迎光临。\n2 谢谢惠顾。" | paddlespeech tts echo -e "1 欢迎光临。\n2 谢谢惠顾。" | paddlespeech tts
# Text Frontend
paddlespeech tts --input 今天是2022/10/29,最低温度是-3℃.

@ -1,15 +1,17 @@
FROM registry.baidubce.com/paddlepaddle/paddle:2.2.2 FROM registry.baidubce.com/paddlepaddle/paddle:2.2.2
LABEL maintainer="paddlesl@baidu.com" LABEL maintainer="paddlesl@baidu.com"
RUN apt-get update \
&& apt-get install libsndfile-dev \
&& apt-get clean \
&& rm -rf /var/lib/apt/lists/*
RUN git clone --depth 1 https://github.com/PaddlePaddle/PaddleSpeech.git /home/PaddleSpeech RUN git clone --depth 1 https://github.com/PaddlePaddle/PaddleSpeech.git /home/PaddleSpeech
RUN pip3 uninstall mccabe -y ; exit 0; RUN pip3 uninstall mccabe -y ; exit 0;
RUN pip3 install multiprocess==0.70.12 importlib-metadata==4.2.0 dill==0.3.4 RUN pip3 install multiprocess==0.70.12 importlib-metadata==4.2.0 dill==0.3.4
RUN cd /home/PaddleSpeech/audio WORKDIR /home/PaddleSpeech/
RUN python setup.py bdist_wheel
RUN cd /home/PaddleSpeech
RUN python setup.py bdist_wheel RUN python setup.py bdist_wheel
RUN pip install audio/dist/*.whl dist/*.whl RUN pip install dist/*.whl -i https://pypi.tuna.tsinghua.edu.cn/simple
WORKDIR /home/PaddleSpeech/ CMD ['bash']

@ -49,3 +49,4 @@ websockets
keyboard keyboard
uvicorn uvicorn
pattern_singleton pattern_singleton
braceexpand

@ -117,9 +117,9 @@ conda install -y -c gcc_linux-64=8.4.0 gxx_linux-64=8.4.0
``` ```
(Hip: Do not use the last script if you want to install by **Hard** way): (Hip: Do not use the last script if you want to install by **Hard** way):
### Install PaddlePaddle ### Install PaddlePaddle
You can choose the `PaddlePaddle` version based on your system. For example, for CUDA 10.2, CuDNN7.5 install paddlepaddle-gpu 2.2.0: You can choose the `PaddlePaddle` version based on your system. For example, for CUDA 10.2, CuDNN7.5 install paddlepaddle-gpu 2.3.1:
```bash ```bash
python3 -m pip install paddlepaddle-gpu==2.2.0 -i https://mirror.baidu.com/pypi/simple python3 -m pip install paddlepaddle-gpu==2.3.1 -i https://mirror.baidu.com/pypi/simple
``` ```
### Install PaddleSpeech ### Install PaddleSpeech
You can install `paddlespeech` by the following commandthen you can use the `ready-made` examples in `paddlespeech` : You can install `paddlespeech` by the following commandthen you can use the `ready-made` examples in `paddlespeech` :
@ -180,9 +180,9 @@ Some users may fail to install `kaldiio` due to the default download source, you
```bash ```bash
pip install pytest-runner -i https://pypi.tuna.tsinghua.edu.cn/simple pip install pytest-runner -i https://pypi.tuna.tsinghua.edu.cn/simple
``` ```
Make sure you have GPU and the paddlepaddle version is right. For example, for CUDA 10.2, CuDNN7.5 install paddle 2.2.0: Make sure you have GPU and the paddlepaddle version is right. For example, for CUDA 10.2, CuDNN7.5 install paddle 2.3.1:
```bash ```bash
python3 -m pip install paddlepaddle-gpu==2.2.0 -i https://mirror.baidu.com/pypi/simple python3 -m pip install paddlepaddle-gpu==2.3.1 -i https://mirror.baidu.com/pypi/simple
``` ```
### Install PaddleSpeech in Developing Mode ### Install PaddleSpeech in Developing Mode
```bash ```bash

@ -111,9 +111,9 @@ conda install -y -c gcc_linux-64=8.4.0 gxx_linux-64=8.4.0
``` ```
(提示: 如果你想使用**困难**方式完成安装,请不要使用最后一条命令) (提示: 如果你想使用**困难**方式完成安装,请不要使用最后一条命令)
### 安装 PaddlePaddle ### 安装 PaddlePaddle
你可以根据系统配置选择 PaddlePaddle 版本,例如系统使用 CUDA 10.2 CuDNN7.5 ,你可以安装 paddlepaddle-gpu 2.2.0 你可以根据系统配置选择 PaddlePaddle 版本,例如系统使用 CUDA 10.2 CuDNN7.5 ,你可以安装 paddlepaddle-gpu 2.3.1
```bash ```bash
python3 -m pip install paddlepaddle-gpu==2.2.0 -i https://mirror.baidu.com/pypi/simple python3 -m pip install paddlepaddle-gpu==2.3.1 -i https://mirror.baidu.com/pypi/simple
``` ```
### 安装 PaddleSpeech ### 安装 PaddleSpeech
最后安装 `paddlespeech`,这样你就可以使用 `paddlespeech` 中已有的 examples 最后安装 `paddlespeech`,这样你就可以使用 `paddlespeech` 中已有的 examples
@ -168,9 +168,9 @@ conda activate tools/venv
conda install -y -c conda-forge sox libsndfile swig bzip2 libflac bc conda install -y -c conda-forge sox libsndfile swig bzip2 libflac bc
``` ```
### 安装 PaddlePaddle ### 安装 PaddlePaddle
请确认你系统是否有 GPU并且使用了正确版本的 paddlepaddle。例如系统使用 CUDA 10.2, CuDNN7.5 ,你可以安装 paddlepaddle-gpu 2.2.0 请确认你系统是否有 GPU并且使用了正确版本的 paddlepaddle。例如系统使用 CUDA 10.2, CuDNN7.5 ,你可以安装 paddlepaddle-gpu 2.3.1
```bash ```bash
python3 -m pip install paddlepaddle-gpu==2.2.0 -i https://mirror.baidu.com/pypi/simple python3 -m pip install paddlepaddle-gpu==2.3.1 -i https://mirror.baidu.com/pypi/simple
``` ```
### 用开发者模式安装 PaddleSpeech ### 用开发者模式安装 PaddleSpeech
部分用户系统由于默认源的问题,安装中会出现 kaldiio 安转出错的问题,建议首先安装 pytest-runner: 部分用户系统由于默认源的问题,安装中会出现 kaldiio 安转出错的问题,建议首先安装 pytest-runner:

@ -1,5 +1,5 @@
# Transformer/Conformer ASR with Aishell # Transformer/Conformer ASR with Aishell
This example contains code used to train a Transformer or [Conformer](http://arxiv.org/abs/2008.03802) model with [Aishell dataset](http://www.openslr.org/resources/33) This example contains code used to train a [u2](https://arxiv.org/pdf/2012.05481.pdf) model (Transformer or [Conformer](https://arxiv.org/pdf/2005.08100.pdf) model) with [Aishell dataset](http://www.openslr.org/resources/33)
## Overview ## Overview
All the scripts you need are in `run.sh`. There are several stages in `run.sh`, and each stage has its function. All the scripts you need are in `run.sh`. There are several stages in `run.sh`, and each stage has its function.
| Stage | Function | | Stage | Function |

@ -1,20 +1,3 @@
# Callcenter 8k sample rate # Callcenter 8k sample rate
Data distribution: This recipe only has model/data config for 8k ASR, user need to prepare data and generate manifest metafile. You can see Aishell or Libripseech.
```
676048 utts
491.4004722221223 h
4357792.0 text
2.4633630739178654 text/sec
2.6167397877068495 sec/utt
```
train/dev/test partition:
```
33802 manifest.dev
67606 manifest.test
574640 manifest.train
676048 total
```

@ -154,7 +154,7 @@ VITS checkpoint contains files listed below.
vits_csmsc_ckpt_1.1.0 vits_csmsc_ckpt_1.1.0
├── default.yaml # default config used to train vitx ├── default.yaml # default config used to train vitx
├── phone_id_map.txt # phone vocabulary file when training vits ├── phone_id_map.txt # phone vocabulary file when training vits
└── snapshot_iter_350000.pdz # model parameters and optimizer states └── snapshot_iter_333000.pdz # model parameters and optimizer states
``` ```
ps: This ckpt is not good enough, a better result is training ps: This ckpt is not good enough, a better result is training
@ -169,7 +169,7 @@ FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/synthesize_e2e.py \ python3 ${BIN_DIR}/synthesize_e2e.py \
--config=vits_csmsc_ckpt_1.1.0/default.yaml \ --config=vits_csmsc_ckpt_1.1.0/default.yaml \
--ckpt=vits_csmsc_ckpt_1.1.0/snapshot_iter_350000.pdz \ --ckpt=vits_csmsc_ckpt_1.1.0/snapshot_iter_333000.pdz \
--phones_dict=vits_csmsc_ckpt_1.1.0/phone_id_map.txt \ --phones_dict=vits_csmsc_ckpt_1.1.0/phone_id_map.txt \
--output_dir=exp/default/test_e2e \ --output_dir=exp/default/test_e2e \
--text=${BIN_DIR}/../sentences.txt \ --text=${BIN_DIR}/../sentences.txt \

@ -179,7 +179,7 @@ generator_first: False # whether to start updating generator first
# OTHER TRAINING SETTING # # OTHER TRAINING SETTING #
########################################################## ##########################################################
num_snapshots: 10 # max number of snapshots to keep while training num_snapshots: 10 # max number of snapshots to keep while training
train_max_steps: 250000 # Number of training steps. == total_iters / ngpus, total_iters = 1000000 train_max_steps: 350000 # Number of training steps. == total_iters / ngpus, total_iters = 1000000
save_interval_steps: 1000 # Interval steps to save checkpoint. save_interval_steps: 1000 # Interval steps to save checkpoint.
eval_interval_steps: 250 # Interval steps to evaluate the network. eval_interval_steps: 250 # Interval steps to evaluate the network.
seed: 777 # random seed number seed: 777 # random seed number

@ -1,5 +1,5 @@
# Transformer/Conformer ASR with Librispeech # Transformer/Conformer ASR with Librispeech
This example contains code used to train a Transformer or [Conformer](http://arxiv.org/abs/2008.03802) model with [Librispeech dataset](http://www.openslr.org/resources/12) This example contains code used to train [u2](https://arxiv.org/pdf/2012.05481.pdf) model (Transformer or [Conformer](https://arxiv.org/pdf/2005.08100.pdf) model) with [Librispeech dataset](http://www.openslr.org/resources/12)
## Overview ## Overview
All the scripts you need are in `run.sh`. There are several stages in `run.sh`, and each stage has its function. All the scripts you need are in `run.sh`. There are several stages in `run.sh`, and each stage has its function.
| Stage | Function | | Stage | Function |

@ -1,6 +1,6 @@
# Transformer/Conformer ASR with Librispeech ASR2 # Transformer/Conformer ASR with Librispeech ASR2
This example contains code used to train a Transformer or [Conformer](http://arxiv.org/abs/2008.03802) model with [Librispeech dataset](http://www.openslr.org/resources/12) and use some functions in kaldi. This example contains code used to train a [u2](https://arxiv.org/pdf/2012.05481.pdf) model (Transformer or [Conformer](https://arxiv.org/pdf/2005.08100.pdf) model) with [Librispeech dataset](http://www.openslr.org/resources/12) and use some functions in kaldi.
To use this example, you need to install Kaldi first. To use this example, you need to install Kaldi first.

@ -1,5 +1,5 @@
# Transformer/Conformer ASR with Tiny # Transformer/Conformer ASR with Tiny
This example contains code used to train a Transformer or [Conformer](http://arxiv.org/abs/2008.03802) model Tiny dataset(a part of [[Librispeech dataset](http://www.openslr.org/resources/12)](http://www.openslr.org/resources/33)) This example contains code used to train a [u2](https://arxiv.org/pdf/2012.05481.pdf) model (Transformer or [Conformer](https://arxiv.org/pdf/2005.08100.pdf) model) with Tiny dataset(a part of [[Librispeech dataset](http://www.openslr.org/resources/12)](http://www.openslr.org/resources/33))
## Overview ## Overview
All the scripts you need are in `run.sh`. There are several stages in `run.sh`, and each stage has its function. All the scripts you need are in `run.sh`. There are several stages in `run.sh`, and each stage has its function.
| Stage | Function | | Stage | Function |

@ -0,0 +1,26 @@
# Test
We train a Chinese-English mixed fastspeech2 model. The training code is still being sorted out, let's show how to use it first.
The sample rate of the synthesized audio is 22050 Hz.
## Download pretrained models
Put pretrained models in a directory named `models`.
- [fastspeech2_csmscljspeech_add-zhen.zip](https://paddlespeech.bj.bcebos.com/t2s/chinse_english_mixed/models/fastspeech2_csmscljspeech_add-zhen.zip)
- [hifigan_ljspeech_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_ljspeech_ckpt_0.2.0.zip)
```bash
mkdir models
cd models
wget https://paddlespeech.bj.bcebos.com/t2s/chinse_english_mixed/models/fastspeech2_csmscljspeech_add-zhen.zip
unzip fastspeech2_csmscljspeech_add-zhen.zip
wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_ljspeech_ckpt_0.2.0.zip
unzip hifigan_ljspeech_ckpt_0.2.0.zip
cd ../
```
## test
You can choose `--spk_id` {0, 1} in `local/synthesize_e2e.sh`.
```bash
bash test.sh
```

@ -0,0 +1,31 @@
#!/bin/bash
model_dir=$1
output=$2
am_name=fastspeech2_csmscljspeech_add-zhen
am_model_dir=${model_dir}/${am_name}/
stage=1
stop_stage=1
# hifigan
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize_e2e.py \
--am=fastspeech2_mix \
--am_config=${am_model_dir}/default.yaml \
--am_ckpt=${am_model_dir}/snapshot_iter_94000.pdz \
--am_stat=${am_model_dir}/speech_stats.npy \
--voc=hifigan_ljspeech \
--voc_config=${model_dir}/hifigan_ljspeech_ckpt_0.2.0/default.yaml \
--voc_ckpt=${model_dir}/hifigan_ljspeech_ckpt_0.2.0/snapshot_iter_2500000.pdz \
--voc_stat=${model_dir}/hifigan_ljspeech_ckpt_0.2.0/feats_stats.npy \
--lang=mix \
--text=${BIN_DIR}/../sentences_mix.txt \
--output_dir=${output}/test_e2e \
--phones_dict=${am_model_dir}/phone_id_map.txt \
--speaker_dict=${am_model_dir}/speaker_id_map.txt \
--spk_id 0
fi

@ -0,0 +1,13 @@
#!/bin/bash
export MAIN_ROOT=`realpath ${PWD}/../../../`
export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH}
export LC_ALL=C
export PYTHONDONTWRITEBYTECODE=1
# Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
export PYTHONIOENCODING=UTF-8
export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH}
MODEL=fastspeech2
export BIN_DIR=${MAIN_ROOT}/paddlespeech/t2s/exps/${MODEL}

@ -0,0 +1,23 @@
#!/bin/bash
set -e
source path.sh
gpus=0,1
stage=3
stop_stage=100
model_dir=models
output_dir=output
# with the following command, you can choose the stage range you want to run
# such as `./run.sh --stage 0 --stop-stage 0`
# this can not be mixed use with `$1`, `$2` ...
source ${MAIN_ROOT}/utils/parse_options.sh || exit 1
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
# synthesize_e2e, vocoder is hifigan by default
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${model_dir} ${output_dir} || exit -1
fi

@ -14,3 +14,5 @@
import _locale import _locale
_locale._getdefaultlocale = (lambda *args: ['en_US', 'utf8']) _locale._getdefaultlocale = (lambda *args: ['en_US', 'utf8'])

@ -14,6 +14,9 @@
from . import compliance from . import compliance
from . import datasets from . import datasets
from . import features from . import features
from . import text
from . import transform
from . import streamdata
from . import functional from . import functional
from . import io from . import io
from . import metric from . import metric

@ -0,0 +1,13 @@
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

@ -365,7 +365,7 @@ class ASRExecutor(BaseExecutor):
except Exception as e: except Exception as e:
logger.exception(e) logger.exception(e)
logger.error( logger.error(
"can not open the audio file, please check the audio file format is 'wav'. \n \ f"can not open the audio file, please check the audio file({audio_file}) format is 'wav'. \n \
you can try to use sox to change the file format.\n \ you can try to use sox to change the file format.\n \
For example: \n \ For example: \n \
sample rate: 16k \n \ sample rate: 16k \n \

@ -108,19 +108,20 @@ class BaseExecutor(ABC):
Dict[str, Union[str, os.PathLike]]: A dict with ids and inputs. Dict[str, Union[str, os.PathLike]]: A dict with ids and inputs.
""" """
if self._is_job_input(input_): if self._is_job_input(input_):
# .job/.scp/.txt file
ret = self._get_job_contents(input_) ret = self._get_job_contents(input_)
else: else:
# job from stdin
ret = OrderedDict() ret = OrderedDict()
if input_ is None: # Take input from stdin if input_ is None: # Take input from stdin
if not sys.stdin.isatty( if not sys.stdin.isatty(
): # Avoid getting stuck when stdin is empty. ): # Avoid getting stuck when stdin is empty.
for i, line in enumerate(sys.stdin): for i, line in enumerate(sys.stdin):
line = line.strip() line = line.strip()
if len(line.split(' ')) == 1: if len(line.split()) == 1:
ret[str(i + 1)] = line ret[str(i + 1)] = line
elif len(line.split(' ')) == 2: elif len(line.split()) == 2:
id_, info = line.split(' ') id_, info = line.split()
ret[id_] = info ret[id_] = info
else: # No valid input info from one line. else: # No valid input info from one line.
continue continue
@ -170,7 +171,8 @@ class BaseExecutor(ABC):
bool: return `True` for job input, `False` otherwise. bool: return `True` for job input, `False` otherwise.
""" """
return input_ and os.path.isfile(input_) and (input_.endswith('.job') or return input_ and os.path.isfile(input_) and (input_.endswith('.job') or
input_.endswith('.txt')) input_.endswith('.txt') or
input_.endswith('.scp'))
def _get_job_contents( def _get_job_contents(
self, job_input: os.PathLike) -> Dict[str, Union[str, os.PathLike]]: self, job_input: os.PathLike) -> Dict[str, Union[str, os.PathLike]]:
@ -189,7 +191,7 @@ class BaseExecutor(ABC):
line = line.strip() line = line.strip()
if not line: if not line:
continue continue
k, v = line.split(' ') k, v = line.split() # space or \t
job_contents[k] = v job_contents[k] = v
return job_contents return job_contents

@ -18,7 +18,6 @@ from typing import Union
import paddle import paddle
from paddle import nn from paddle import nn
from paddle.fluid import core
from paddle.nn import functional as F from paddle.nn import functional as F
from paddlespeech.s2t.utils.log import Log from paddlespeech.s2t.utils.log import Log
@ -39,46 +38,6 @@ paddle.long = 'int64'
paddle.uint16 = 'uint16' paddle.uint16 = 'uint16'
paddle.cdouble = 'complex128' paddle.cdouble = 'complex128'
def convert_dtype_to_string(tensor_dtype):
"""
Convert the data type in numpy to the data type in Paddle
Args:
tensor_dtype(core.VarDesc.VarType): the data type in numpy.
Returns:
core.VarDesc.VarType: the data type in Paddle.
"""
dtype = tensor_dtype
if dtype == core.VarDesc.VarType.FP32:
return paddle.float32
elif dtype == core.VarDesc.VarType.FP64:
return paddle.float64
elif dtype == core.VarDesc.VarType.FP16:
return paddle.float16
elif dtype == core.VarDesc.VarType.INT32:
return paddle.int32
elif dtype == core.VarDesc.VarType.INT16:
return paddle.int16
elif dtype == core.VarDesc.VarType.INT64:
return paddle.int64
elif dtype == core.VarDesc.VarType.BOOL:
return paddle.bool
elif dtype == core.VarDesc.VarType.BF16:
# since there is still no support for bfloat16 in NumPy,
# uint16 is used for casting bfloat16
return paddle.uint16
elif dtype == core.VarDesc.VarType.UINT8:
return paddle.uint8
elif dtype == core.VarDesc.VarType.INT8:
return paddle.int8
elif dtype == core.VarDesc.VarType.COMPLEX64:
return paddle.complex64
elif dtype == core.VarDesc.VarType.COMPLEX128:
return paddle.complex128
else:
raise ValueError("Not supported tensor dtype %s" % dtype)
if not hasattr(paddle, 'softmax'): if not hasattr(paddle, 'softmax'):
logger.debug("register user softmax to paddle, remove this when fixed!") logger.debug("register user softmax to paddle, remove this when fixed!")
setattr(paddle, 'softmax', paddle.nn.functional.softmax) setattr(paddle, 'softmax', paddle.nn.functional.softmax)
@ -155,28 +114,6 @@ if not hasattr(paddle.Tensor, 'new_full'):
paddle.Tensor.new_full = new_full paddle.Tensor.new_full = new_full
paddle.static.Variable.new_full = new_full paddle.static.Variable.new_full = new_full
def eq(xs: paddle.Tensor, ys: Union[paddle.Tensor, float]) -> paddle.Tensor:
if convert_dtype_to_string(xs.dtype) == paddle.bool:
xs = xs.astype(paddle.int)
return xs.equal(
paddle.to_tensor(
ys, dtype=convert_dtype_to_string(xs.dtype), place=xs.place))
if not hasattr(paddle.Tensor, 'eq'):
logger.debug(
"override eq of paddle.Tensor if exists or register, remove this when fixed!"
)
paddle.Tensor.eq = eq
paddle.static.Variable.eq = eq
if not hasattr(paddle, 'eq'):
logger.debug(
"override eq of paddle if exists or register, remove this when fixed!")
paddle.eq = eq
def contiguous(xs: paddle.Tensor) -> paddle.Tensor: def contiguous(xs: paddle.Tensor) -> paddle.Tensor:
return xs return xs
@ -219,13 +156,22 @@ def is_broadcastable(shp1, shp2):
return True return True
def broadcast_shape(shp1, shp2):
result = []
for a, b in zip(shp1[::-1], shp2[::-1]):
result.append(max(a, b))
return result[::-1]
def masked_fill(xs: paddle.Tensor, def masked_fill(xs: paddle.Tensor,
mask: paddle.Tensor, mask: paddle.Tensor,
value: Union[float, int]): value: Union[float, int]):
assert is_broadcastable(xs.shape, mask.shape) is True, (xs.shape, bshape = broadcast_shape(xs.shape, mask.shape)
mask.shape) mask.stop_gradient = True
bshape = paddle.broadcast_shape(xs.shape, mask.shape) tmp = paddle.ones(shape=[len(bshape)], dtype='int32')
mask = mask.broadcast_to(bshape) for index in range(len(bshape)):
tmp[index] = bshape[index]
mask = mask.broadcast_to(tmp)
trues = paddle.ones_like(xs) * value trues = paddle.ones_like(xs) * value
xs = paddle.where(mask, trues, xs) xs = paddle.where(mask, trues, xs)
return xs return xs

@ -29,6 +29,9 @@ import paddle
from paddle import jit from paddle import jit
from paddle import nn from paddle import nn
from paddlespeech.audio.utils.tensor_utils import add_sos_eos
from paddlespeech.audio.utils.tensor_utils import pad_sequence
from paddlespeech.audio.utils.tensor_utils import th_accuracy
from paddlespeech.s2t.decoders.scorers.ctc import CTCPrefixScorer from paddlespeech.s2t.decoders.scorers.ctc import CTCPrefixScorer
from paddlespeech.s2t.frontend.utility import IGNORE_ID from paddlespeech.s2t.frontend.utility import IGNORE_ID
from paddlespeech.s2t.frontend.utility import load_cmvn from paddlespeech.s2t.frontend.utility import load_cmvn
@ -48,9 +51,6 @@ from paddlespeech.s2t.utils import checkpoint
from paddlespeech.s2t.utils import layer_tools from paddlespeech.s2t.utils import layer_tools
from paddlespeech.s2t.utils.ctc_utils import remove_duplicates_and_blank from paddlespeech.s2t.utils.ctc_utils import remove_duplicates_and_blank
from paddlespeech.s2t.utils.log import Log from paddlespeech.s2t.utils.log import Log
from paddlespeech.audio.utils.tensor_utils import add_sos_eos
from paddlespeech.audio.utils.tensor_utils import pad_sequence
from paddlespeech.audio.utils.tensor_utils import th_accuracy
from paddlespeech.s2t.utils.utility import log_add from paddlespeech.s2t.utils.utility import log_add
from paddlespeech.s2t.utils.utility import UpdateConfig from paddlespeech.s2t.utils.utility import UpdateConfig
@ -318,7 +318,7 @@ class U2BaseModel(ASRInterface, nn.Layer):
dim=1) # (B*N, i+1) dim=1) # (B*N, i+1)
# 2.6 Update end flag # 2.6 Update end flag
end_flag = paddle.eq(hyps[:, -1], self.eos).view(-1, 1) end_flag = paddle.equal(hyps[:, -1], self.eos).view(-1, 1)
# 3. Select best of best # 3. Select best of best
scores = scores.view(batch_size, beam_size) scores = scores.view(batch_size, beam_size)
@ -605,29 +605,42 @@ class U2BaseModel(ASRInterface, nn.Layer):
xs: paddle.Tensor, xs: paddle.Tensor,
offset: int, offset: int,
required_cache_size: int, required_cache_size: int,
subsampling_cache: Optional[paddle.Tensor]=None, att_cache: paddle.Tensor=paddle.zeros([0, 0, 0, 0]),
elayers_output_cache: Optional[List[paddle.Tensor]]=None, cnn_cache: paddle.Tensor=paddle.zeros([0, 0, 0, 0]),
conformer_cnn_cache: Optional[List[paddle.Tensor]]=None, ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor]:
) -> Tuple[paddle.Tensor, paddle.Tensor, List[paddle.Tensor], List[
paddle.Tensor]]:
""" Export interface for c++ call, give input chunk xs, and return """ Export interface for c++ call, give input chunk xs, and return
output from time 0 to current chunk. output from time 0 to current chunk.
Args: Args:
xs (paddle.Tensor): chunk input xs (paddle.Tensor): chunk input, with shape (b=1, time, mel-dim),
subsampling_cache (Optional[paddle.Tensor]): subsampling cache where `time == (chunk_size - 1) * subsample_rate + \
elayers_output_cache (Optional[List[paddle.Tensor]]): subsample.right_context + 1`
transformer/conformer encoder layers output cache offset (int): current offset in encoder output time stamp
conformer_cnn_cache (Optional[List[paddle.Tensor]]): conformer required_cache_size (int): cache size required for next chunk
cnn cache compuation
>=0: actual cache size
<0: means all history cache is required
att_cache (paddle.Tensor): cache tensor for KEY & VALUE in
transformer/conformer attention, with shape
(elayers, head, cache_t1, d_k * 2), where
`head * d_k == hidden-dim` and
`cache_t1 == chunk_size * num_decoding_left_chunks`.
`d_k * 2` for att key & value.
cnn_cache (paddle.Tensor): cache tensor for cnn_module in conformer,
(elayers, b=1, hidden-dim, cache_t2), where
`cache_t2 == cnn.lorder - 1`.
Returns: Returns:
paddle.Tensor: output, it ranges from time 0 to current chunk. paddle.Tensor: output of current input xs,
paddle.Tensor: subsampling cache with shape (b=1, chunk_size, hidden-dim).
List[paddle.Tensor]: attention cache paddle.Tensor: new attention cache required for next chunk, with
List[paddle.Tensor]: conformer cnn cache dynamic shape (elayers, head, T(?), d_k * 2)
depending on required_cache_size.
paddle.Tensor: new conformer cnn cache required for next chunk, with
same shape as the original cnn_cache.
""" """
return self.encoder.forward_chunk( return self.encoder.forward_chunk(xs, offset, required_cache_size,
xs, offset, required_cache_size, subsampling_cache, att_cache, cnn_cache)
elayers_output_cache, conformer_cnn_cache)
# @jit.to_static # @jit.to_static
def ctc_activation(self, xs: paddle.Tensor) -> paddle.Tensor: def ctc_activation(self, xs: paddle.Tensor) -> paddle.Tensor:

@ -401,29 +401,42 @@ class U2STBaseModel(nn.Layer):
xs: paddle.Tensor, xs: paddle.Tensor,
offset: int, offset: int,
required_cache_size: int, required_cache_size: int,
subsampling_cache: Optional[paddle.Tensor]=None, att_cache: paddle.Tensor = paddle.zeros([0, 0, 0, 0]),
elayers_output_cache: Optional[List[paddle.Tensor]]=None, cnn_cache: paddle.Tensor = paddle.zeros([0, 0, 0, 0]),
conformer_cnn_cache: Optional[List[paddle.Tensor]]=None, ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor]:
) -> Tuple[paddle.Tensor, paddle.Tensor, List[paddle.Tensor], List[
paddle.Tensor]]:
""" Export interface for c++ call, give input chunk xs, and return """ Export interface for c++ call, give input chunk xs, and return
output from time 0 to current chunk. output from time 0 to current chunk.
Args: Args:
xs (paddle.Tensor): chunk input xs (paddle.Tensor): chunk input, with shape (b=1, time, mel-dim),
subsampling_cache (Optional[paddle.Tensor]): subsampling cache where `time == (chunk_size - 1) * subsample_rate + \
elayers_output_cache (Optional[List[paddle.Tensor]]): subsample.right_context + 1`
transformer/conformer encoder layers output cache offset (int): current offset in encoder output time stamp
conformer_cnn_cache (Optional[List[paddle.Tensor]]): conformer required_cache_size (int): cache size required for next chunk
cnn cache compuation
>=0: actual cache size
<0: means all history cache is required
att_cache (paddle.Tensor): cache tensor for KEY & VALUE in
transformer/conformer attention, with shape
(elayers, head, cache_t1, d_k * 2), where
`head * d_k == hidden-dim` and
`cache_t1 == chunk_size * num_decoding_left_chunks`.
`d_k * 2` for att key & value.
cnn_cache (paddle.Tensor): cache tensor for cnn_module in conformer,
(elayers, b=1, hidden-dim, cache_t2), where
`cache_t2 == cnn.lorder - 1`
Returns: Returns:
paddle.Tensor: output, it ranges from time 0 to current chunk. paddle.Tensor: output of current input xs,
paddle.Tensor: subsampling cache with shape (b=1, chunk_size, hidden-dim).
List[paddle.Tensor]: attention cache paddle.Tensor: new attention cache required for next chunk, with
List[paddle.Tensor]: conformer cnn cache dynamic shape (elayers, head, T(?), d_k * 2)
depending on required_cache_size.
paddle.Tensor: new conformer cnn cache required for next chunk, with
same shape as the original cnn_cache.
""" """
return self.encoder.forward_chunk( return self.encoder.forward_chunk(
xs, offset, required_cache_size, subsampling_cache, xs, offset, required_cache_size, att_cache, cnn_cache)
elayers_output_cache, conformer_cnn_cache)
# @jit.to_static # @jit.to_static
def ctc_activation(self, xs: paddle.Tensor) -> paddle.Tensor: def ctc_activation(self, xs: paddle.Tensor) -> paddle.Tensor:

@ -13,8 +13,7 @@
# limitations under the License. # limitations under the License.
import paddle import paddle
from paddle import nn from paddle import nn
import math
from paddlespeech.s2t.modules.initializer import KaimingUniform
""" """
To align the initializer between paddle and torch, To align the initializer between paddle and torch,
the API below are set defalut initializer with priority higger than global initializer. the API below are set defalut initializer with priority higger than global initializer.
@ -82,10 +81,10 @@ class Linear(nn.Linear):
name=None): name=None):
if weight_attr is None: if weight_attr is None:
if global_init_type == "kaiming_uniform": if global_init_type == "kaiming_uniform":
weight_attr = paddle.ParamAttr(initializer=KaimingUniform()) weight_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform(fan_in=None, negative_slope=math.sqrt(5), nonlinearity='leaky_relu'))
if bias_attr is None: if bias_attr is None:
if global_init_type == "kaiming_uniform": if global_init_type == "kaiming_uniform":
bias_attr = paddle.ParamAttr(initializer=KaimingUniform()) bias_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform(fan_in=None, negative_slope=math.sqrt(5), nonlinearity='leaky_relu'))
super(Linear, self).__init__(in_features, out_features, weight_attr, super(Linear, self).__init__(in_features, out_features, weight_attr,
bias_attr, name) bias_attr, name)
@ -105,10 +104,10 @@ class Conv1D(nn.Conv1D):
data_format='NCL'): data_format='NCL'):
if weight_attr is None: if weight_attr is None:
if global_init_type == "kaiming_uniform": if global_init_type == "kaiming_uniform":
weight_attr = paddle.ParamAttr(initializer=KaimingUniform()) weight_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform(fan_in=None, negative_slope=math.sqrt(5), nonlinearity='leaky_relu'))
if bias_attr is None: if bias_attr is None:
if global_init_type == "kaiming_uniform": if global_init_type == "kaiming_uniform":
bias_attr = paddle.ParamAttr(initializer=KaimingUniform()) bias_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform(fan_in=None, negative_slope=math.sqrt(5), nonlinearity='leaky_relu'))
super(Conv1D, self).__init__( super(Conv1D, self).__init__(
in_channels, out_channels, kernel_size, stride, padding, dilation, in_channels, out_channels, kernel_size, stride, padding, dilation,
groups, padding_mode, weight_attr, bias_attr, data_format) groups, padding_mode, weight_attr, bias_attr, data_format)
@ -129,10 +128,10 @@ class Conv2D(nn.Conv2D):
data_format='NCHW'): data_format='NCHW'):
if weight_attr is None: if weight_attr is None:
if global_init_type == "kaiming_uniform": if global_init_type == "kaiming_uniform":
weight_attr = paddle.ParamAttr(initializer=KaimingUniform()) weight_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform(fan_in=None, negative_slope=math.sqrt(5), nonlinearity='leaky_relu'))
if bias_attr is None: if bias_attr is None:
if global_init_type == "kaiming_uniform": if global_init_type == "kaiming_uniform":
bias_attr = paddle.ParamAttr(initializer=KaimingUniform()) bias_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform(fan_in=None, negative_slope=math.sqrt(5), nonlinearity='leaky_relu'))
super(Conv2D, self).__init__( super(Conv2D, self).__init__(
in_channels, out_channels, kernel_size, stride, padding, dilation, in_channels, out_channels, kernel_size, stride, padding, dilation,
groups, padding_mode, weight_attr, bias_attr, data_format) groups, padding_mode, weight_attr, bias_attr, data_format)

@ -84,9 +84,10 @@ class MultiHeadedAttention(nn.Layer):
return q, k, v return q, k, v
def forward_attention(self, def forward_attention(self,
value: paddle.Tensor, value: paddle.Tensor,
scores: paddle.Tensor, scores: paddle.Tensor,
mask: Optional[paddle.Tensor]) -> paddle.Tensor: mask: paddle.Tensor = paddle.ones([0, 0, 0], dtype=paddle.bool),
) -> paddle.Tensor:
"""Compute attention context vector. """Compute attention context vector.
Args: Args:
value (paddle.Tensor): Transformed value, size value (paddle.Tensor): Transformed value, size
@ -94,14 +95,23 @@ class MultiHeadedAttention(nn.Layer):
scores (paddle.Tensor): Attention score, size scores (paddle.Tensor): Attention score, size
(#batch, n_head, time1, time2). (#batch, n_head, time1, time2).
mask (paddle.Tensor): Mask, size (#batch, 1, time2) or mask (paddle.Tensor): Mask, size (#batch, 1, time2) or
(#batch, time1, time2). (#batch, time1, time2), (0, 0, 0) means fake mask.
Returns: Returns:
paddle.Tensor: Transformed value weighted paddle.Tensor: Transformed value (#batch, time1, d_model)
by the attention score, (#batch, time1, d_model). weighted by the attention score (#batch, time1, time2).
""" """
n_batch = value.shape[0] n_batch = value.shape[0]
if mask is not None:
mask = mask.unsqueeze(1).eq(0) # (batch, 1, *, time2) # When `if mask.size(2) > 0` be True:
# 1. training.
# 2. oonx(16/4, chunk_size/history_size), feed real cache and real mask for the 1st chunk.
# When will `if mask.size(2) > 0` be False?
# 1. onnx(16/-1, -1/-1, 16/0)
# 2. jit (16/-1, -1/-1, 16/0, 16/4)
if paddle.shape(mask)[2] > 0: # time2 > 0
mask = mask.unsqueeze(1).equal(0) # (batch, 1, *, time2)
# for last chunk, time2 might be larger than scores.size(-1)
mask = mask[:, :, :, :paddle.shape(scores)[-1]]
scores = scores.masked_fill(mask, -float('inf')) scores = scores.masked_fill(mask, -float('inf'))
attn = paddle.softmax( attn = paddle.softmax(
scores, axis=-1).masked_fill(mask, scores, axis=-1).masked_fill(mask,
@ -121,21 +131,66 @@ class MultiHeadedAttention(nn.Layer):
query: paddle.Tensor, query: paddle.Tensor,
key: paddle.Tensor, key: paddle.Tensor,
value: paddle.Tensor, value: paddle.Tensor,
mask: Optional[paddle.Tensor]) -> paddle.Tensor: mask: paddle.Tensor = paddle.ones([0,0,0], dtype=paddle.bool),
pos_emb: paddle.Tensor = paddle.empty([0]),
cache: paddle.Tensor = paddle.zeros([0,0,0,0])
) -> Tuple[paddle.Tensor, paddle.Tensor]:
"""Compute scaled dot product attention. """Compute scaled dot product attention.
Args: Args:
query (torch.Tensor): Query tensor (#batch, time1, size). query (paddle.Tensor): Query tensor (#batch, time1, size).
key (torch.Tensor): Key tensor (#batch, time2, size). key (paddle.Tensor): Key tensor (#batch, time2, size).
value (torch.Tensor): Value tensor (#batch, time2, size). value (paddle.Tensor): Value tensor (#batch, time2, size).
mask (torch.Tensor): Mask tensor (#batch, 1, time2) or mask (paddle.Tensor): Mask tensor (#batch, 1, time2) or
(#batch, time1, time2). (#batch, time1, time2).
1.When applying cross attention between decoder and encoder,
the batch padding mask for input is in (#batch, 1, T) shape.
2.When applying self attention of encoder,
the mask is in (#batch, T, T) shape.
3.When applying self attention of decoder,
the mask is in (#batch, L, L) shape.
4.If the different position in decoder see different block
of the encoder, such as Mocha, the passed in mask could be
in (#batch, L, T) shape. But there is no such case in current
Wenet.
cache (paddle.Tensor): Cache tensor (1, head, cache_t, d_k * 2),
where `cache_t == chunk_size * num_decoding_left_chunks`
and `head * d_k == size`
Returns: Returns:
torch.Tensor: Output tensor (#batch, time1, d_model). paddle.Tensor: Output tensor (#batch, time1, d_model).
paddle.Tensor: Cache tensor (1, head, cache_t + time1, d_k * 2)
where `cache_t == chunk_size * num_decoding_left_chunks`
and `head * d_k == size`
""" """
q, k, v = self.forward_qkv(query, key, value) q, k, v = self.forward_qkv(query, key, value)
# when export onnx model, for 1st chunk, we feed
# cache(1, head, 0, d_k * 2) (16/-1, -1/-1, 16/0 mode)
# or cache(1, head, real_cache_t, d_k * 2) (16/4 mode).
# In all modes, `if cache.size(0) > 0` will alwayse be `True`
# and we will always do splitting and
# concatnation(this will simplify onnx export). Note that
# it's OK to concat & split zero-shaped tensors(see code below).
# when export jit model, for 1st chunk, we always feed
# cache(0, 0, 0, 0) since jit supports dynamic if-branch.
# >>> a = torch.ones((1, 2, 0, 4))
# >>> b = torch.ones((1, 2, 3, 4))
# >>> c = torch.cat((a, b), dim=2)
# >>> torch.equal(b, c) # True
# >>> d = torch.split(a, 2, dim=-1)
# >>> torch.equal(d[0], d[1]) # True
if paddle.shape(cache)[0] > 0:
# last dim `d_k * 2` for (key, val)
key_cache, value_cache = paddle.split(cache, 2, axis=-1)
k = paddle.concat([key_cache, k], axis=2)
v = paddle.concat([value_cache, v], axis=2)
# We do cache slicing in encoder.forward_chunk, since it's
# non-trivial to calculate `next_cache_start` here.
new_cache = paddle.concat((k, v), axis=-1)
scores = paddle.matmul(q, scores = paddle.matmul(q,
k.transpose([0, 1, 3, 2])) / math.sqrt(self.d_k) k.transpose([0, 1, 3, 2])) / math.sqrt(self.d_k)
return self.forward_attention(v, scores, mask) return self.forward_attention(v, scores, mask), new_cache
class RelPositionMultiHeadedAttention(MultiHeadedAttention): class RelPositionMultiHeadedAttention(MultiHeadedAttention):
@ -192,23 +247,55 @@ class RelPositionMultiHeadedAttention(MultiHeadedAttention):
query: paddle.Tensor, query: paddle.Tensor,
key: paddle.Tensor, key: paddle.Tensor,
value: paddle.Tensor, value: paddle.Tensor,
pos_emb: paddle.Tensor, mask: paddle.Tensor = paddle.ones([0,0,0], dtype=paddle.bool),
mask: Optional[paddle.Tensor]): pos_emb: paddle.Tensor = paddle.empty([0]),
cache: paddle.Tensor = paddle.zeros([0,0,0,0])
) -> Tuple[paddle.Tensor, paddle.Tensor]:
"""Compute 'Scaled Dot Product Attention' with rel. positional encoding. """Compute 'Scaled Dot Product Attention' with rel. positional encoding.
Args: Args:
query (paddle.Tensor): Query tensor (#batch, time1, size). query (paddle.Tensor): Query tensor (#batch, time1, size).
key (paddle.Tensor): Key tensor (#batch, time2, size). key (paddle.Tensor): Key tensor (#batch, time2, size).
value (paddle.Tensor): Value tensor (#batch, time2, size). value (paddle.Tensor): Value tensor (#batch, time2, size).
pos_emb (paddle.Tensor): Positional embedding tensor
(#batch, time1, size).
mask (paddle.Tensor): Mask tensor (#batch, 1, time2) or mask (paddle.Tensor): Mask tensor (#batch, 1, time2) or
(#batch, time1, time2). (#batch, time1, time2), (0, 0, 0) means fake mask.
pos_emb (paddle.Tensor): Positional embedding tensor
(#batch, time2, size).
cache (paddle.Tensor): Cache tensor (1, head, cache_t, d_k * 2),
where `cache_t == chunk_size * num_decoding_left_chunks`
and `head * d_k == size`
Returns: Returns:
paddle.Tensor: Output tensor (#batch, time1, d_model). paddle.Tensor: Output tensor (#batch, time1, d_model).
paddle.Tensor: Cache tensor (1, head, cache_t + time1, d_k * 2)
where `cache_t == chunk_size * num_decoding_left_chunks`
and `head * d_k == size`
""" """
q, k, v = self.forward_qkv(query, key, value) q, k, v = self.forward_qkv(query, key, value)
q = q.transpose([0, 2, 1, 3]) # (batch, time1, head, d_k) q = q.transpose([0, 2, 1, 3]) # (batch, time1, head, d_k)
# when export onnx model, for 1st chunk, we feed
# cache(1, head, 0, d_k * 2) (16/-1, -1/-1, 16/0 mode)
# or cache(1, head, real_cache_t, d_k * 2) (16/4 mode).
# In all modes, `if cache.size(0) > 0` will alwayse be `True`
# and we will always do splitting and
# concatnation(this will simplify onnx export). Note that
# it's OK to concat & split zero-shaped tensors(see code below).
# when export jit model, for 1st chunk, we always feed
# cache(0, 0, 0, 0) since jit supports dynamic if-branch.
# >>> a = torch.ones((1, 2, 0, 4))
# >>> b = torch.ones((1, 2, 3, 4))
# >>> c = torch.cat((a, b), dim=2)
# >>> torch.equal(b, c) # True
# >>> d = torch.split(a, 2, dim=-1)
# >>> torch.equal(d[0], d[1]) # True
if paddle.shape(cache)[0] > 0:
# last dim `d_k * 2` for (key, val)
key_cache, value_cache = paddle.split(cache, 2, axis=-1)
k = paddle.concat([key_cache, k], axis=2)
v = paddle.concat([value_cache, v], axis=2)
# We do cache slicing in encoder.forward_chunk, since it's
# non-trivial to calculate `next_cache_start` here.
new_cache = paddle.concat((k, v), axis=-1)
n_batch_pos = pos_emb.shape[0] n_batch_pos = pos_emb.shape[0]
p = self.linear_pos(pos_emb).view(n_batch_pos, -1, self.h, self.d_k) p = self.linear_pos(pos_emb).view(n_batch_pos, -1, self.h, self.d_k)
p = p.transpose([0, 2, 1, 3]) # (batch, head, time1, d_k) p = p.transpose([0, 2, 1, 3]) # (batch, head, time1, d_k)
@ -234,4 +321,4 @@ class RelPositionMultiHeadedAttention(MultiHeadedAttention):
scores = (matrix_ac + matrix_bd) / math.sqrt( scores = (matrix_ac + matrix_bd) / math.sqrt(
self.d_k) # (batch, head, time1, time2) self.d_k) # (batch, head, time1, time2)
return self.forward_attention(v, scores, mask) return self.forward_attention(v, scores, mask), new_cache

@ -108,15 +108,17 @@ class ConvolutionModule(nn.Layer):
def forward(self, def forward(self,
x: paddle.Tensor, x: paddle.Tensor,
mask_pad: Optional[paddle.Tensor]=None, mask_pad: paddle.Tensor= paddle.ones([0,0,0], dtype=paddle.bool),
cache: Optional[paddle.Tensor]=None cache: paddle.Tensor= paddle.zeros([0,0,0]),
) -> Tuple[paddle.Tensor, paddle.Tensor]: ) -> Tuple[paddle.Tensor, paddle.Tensor]:
"""Compute convolution module. """Compute convolution module.
Args: Args:
x (paddle.Tensor): Input tensor (#batch, time, channels). x (paddle.Tensor): Input tensor (#batch, time, channels).
mask_pad (paddle.Tensor): used for batch padding, (#batch, channels, time). mask_pad (paddle.Tensor): used for batch padding (#batch, 1, time),
(0, 0, 0) means fake mask.
cache (paddle.Tensor): left context cache, it is only cache (paddle.Tensor): left context cache, it is only
used in causal convolution. (#batch, channels, time') used in causal convolution (#batch, channels, cache_t),
(0, 0, 0) meas fake cache.
Returns: Returns:
paddle.Tensor: Output tensor (#batch, time, channels). paddle.Tensor: Output tensor (#batch, time, channels).
paddle.Tensor: Output cache tensor (#batch, channels, time') paddle.Tensor: Output cache tensor (#batch, channels, time')
@ -125,11 +127,11 @@ class ConvolutionModule(nn.Layer):
x = x.transpose([0, 2, 1]) # [B, C, T] x = x.transpose([0, 2, 1]) # [B, C, T]
# mask batch padding # mask batch padding
if mask_pad is not None: if paddle.shape(mask_pad)[2] > 0: # time > 0
x = x.masked_fill(mask_pad, 0.0) x = x.masked_fill(mask_pad, 0.0)
if self.lorder > 0: if self.lorder > 0:
if cache is None: if paddle.shape(cache)[2] == 0: # cache_t == 0
x = nn.functional.pad( x = nn.functional.pad(
x, [self.lorder, 0], 'constant', 0.0, data_format='NCL') x, [self.lorder, 0], 'constant', 0.0, data_format='NCL')
else: else:
@ -143,7 +145,7 @@ class ConvolutionModule(nn.Layer):
# It's better we just return None if no cache is requried, # It's better we just return None if no cache is requried,
# However, for JIT export, here we just fake one tensor instead of # However, for JIT export, here we just fake one tensor instead of
# None. # None.
new_cache = paddle.zeros([1], dtype=x.dtype) new_cache = paddle.zeros([0, 0, 0], dtype=x.dtype)
# GLU mechanism # GLU mechanism
x = self.pointwise_conv1(x) # (batch, 2*channel, dim) x = self.pointwise_conv1(x) # (batch, 2*channel, dim)
@ -159,7 +161,7 @@ class ConvolutionModule(nn.Layer):
x = self.pointwise_conv2(x) x = self.pointwise_conv2(x)
# mask batch padding # mask batch padding
if mask_pad is not None: if paddle.shape(mask_pad)[2] > 0: # time > 0
x = x.masked_fill(mask_pad, 0.0) x = x.masked_fill(mask_pad, 0.0)
x = x.transpose([0, 2, 1]) # [B, T, C] x = x.transpose([0, 2, 1]) # [B, T, C]

@ -121,11 +121,11 @@ class DecoderLayer(nn.Layer):
if self.concat_after: if self.concat_after:
tgt_concat = paddle.cat( tgt_concat = paddle.cat(
(tgt_q, self.self_attn(tgt_q, tgt, tgt, tgt_q_mask)), dim=-1) (tgt_q, self.self_attn(tgt_q, tgt, tgt, tgt_q_mask)[0]), dim=-1)
x = residual + self.concat_linear1(tgt_concat) x = residual + self.concat_linear1(tgt_concat)
else: else:
x = residual + self.dropout( x = residual + self.dropout(
self.self_attn(tgt_q, tgt, tgt, tgt_q_mask)) self.self_attn(tgt_q, tgt, tgt, tgt_q_mask)[0])
if not self.normalize_before: if not self.normalize_before:
x = self.norm1(x) x = self.norm1(x)
@ -134,11 +134,11 @@ class DecoderLayer(nn.Layer):
x = self.norm2(x) x = self.norm2(x)
if self.concat_after: if self.concat_after:
x_concat = paddle.cat( x_concat = paddle.cat(
(x, self.src_attn(x, memory, memory, memory_mask)), dim=-1) (x, self.src_attn(x, memory, memory, memory_mask)[0]), dim=-1)
x = residual + self.concat_linear2(x_concat) x = residual + self.concat_linear2(x_concat)
else: else:
x = residual + self.dropout( x = residual + self.dropout(
self.src_attn(x, memory, memory, memory_mask)) self.src_attn(x, memory, memory, memory_mask)[0])
if not self.normalize_before: if not self.normalize_before:
x = self.norm2(x) x = self.norm2(x)

@ -131,7 +131,7 @@ class PositionalEncoding(nn.Layer, PositionalEncodingInterface):
offset (int): start offset offset (int): start offset
size (int): requried size of position encoding size (int): requried size of position encoding
Returns: Returns:
paddle.Tensor: Corresponding position encoding paddle.Tensor: Corresponding position encoding, #[1, T, D].
""" """
assert offset + size < self.max_len assert offset + size < self.max_len
return self.dropout(self.pe[:, offset:offset + size]) return self.dropout(self.pe[:, offset:offset + size])

@ -177,7 +177,7 @@ class BaseEncoder(nn.Layer):
decoding_chunk_size, self.static_chunk_size, decoding_chunk_size, self.static_chunk_size,
num_decoding_left_chunks) num_decoding_left_chunks)
for layer in self.encoders: for layer in self.encoders:
xs, chunk_masks, _ = layer(xs, chunk_masks, pos_emb, mask_pad) xs, chunk_masks, _, _ = layer(xs, chunk_masks, pos_emb, mask_pad)
if self.normalize_before: if self.normalize_before:
xs = self.after_norm(xs) xs = self.after_norm(xs)
# Here we assume the mask is not changed in encoder layers, so just # Here we assume the mask is not changed in encoder layers, so just
@ -190,30 +190,31 @@ class BaseEncoder(nn.Layer):
xs: paddle.Tensor, xs: paddle.Tensor,
offset: int, offset: int,
required_cache_size: int, required_cache_size: int,
subsampling_cache: Optional[paddle.Tensor]=None, att_cache: paddle.Tensor = paddle.zeros([0,0,0,0]),
elayers_output_cache: Optional[List[paddle.Tensor]]=None, cnn_cache: paddle.Tensor = paddle.zeros([0,0,0,0]),
conformer_cnn_cache: Optional[List[paddle.Tensor]]=None, att_mask: paddle.Tensor = paddle.ones([0,0,0], dtype=paddle.bool),
) -> Tuple[paddle.Tensor, paddle.Tensor, List[paddle.Tensor], List[ ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor]:
paddle.Tensor]]:
""" Forward just one chunk """ Forward just one chunk
Args: Args:
xs (paddle.Tensor): chunk input, [B=1, T, D] xs (paddle.Tensor): chunk audio feat input, [B=1, T, D], where
`T==(chunk_size-1)*subsampling_rate + subsample.right_context + 1`
offset (int): current offset in encoder output time stamp offset (int): current offset in encoder output time stamp
required_cache_size (int): cache size required for next chunk required_cache_size (int): cache size required for next chunk
compuation compuation
>=0: actual cache size >=0: actual cache size
<0: means all history cache is required <0: means all history cache is required
subsampling_cache (Optional[paddle.Tensor]): subsampling cache att_cache(paddle.Tensor): cache tensor for key & val in
elayers_output_cache (Optional[List[paddle.Tensor]]): transformer/conformer attention. Shape is
transformer/conformer encoder layers output cache (elayers, head, cache_t1, d_k * 2), where`head * d_k == hidden-dim`
conformer_cnn_cache (Optional[List[paddle.Tensor]]): conformer and `cache_t1 == chunk_size * num_decoding_left_chunks`.
cnn cache cnn_cache (paddle.Tensor): cache tensor for cnn_module in conformer,
(elayers, B=1, hidden-dim, cache_t2), where `cache_t2 == cnn.lorder - 1`
Returns: Returns:
paddle.Tensor: output of current input xs paddle.Tensor: output of current input xs, (B=1, chunk_size, hidden-dim)
paddle.Tensor: subsampling cache required for next chunk computation paddle.Tensor: new attention cache required for next chunk, dyanmic shape
List[paddle.Tensor]: encoder layers output cache required for next (elayers, head, T, d_k*2) depending on required_cache_size
chunk computation paddle.Tensor: new conformer cnn cache required for next chunk, with
List[paddle.Tensor]: conformer cnn cache same shape as the original cnn_cache
""" """
assert xs.shape[0] == 1 # batch size must be one assert xs.shape[0] == 1 # batch size must be one
# tmp_masks is just for interface compatibility # tmp_masks is just for interface compatibility
@ -225,50 +226,50 @@ class BaseEncoder(nn.Layer):
if self.global_cmvn is not None: if self.global_cmvn is not None:
xs = self.global_cmvn(xs) xs = self.global_cmvn(xs)
xs, pos_emb, _ = self.embed( # before embed, xs=(B, T, D1), pos_emb=(B=1, T, D)
xs, tmp_masks, offset=offset) #xs=(B, T, D), pos_emb=(B=1, T, D) xs, pos_emb, _ = self.embed(xs, tmp_masks, offset=offset)
# after embed, xs=(B=1, chunk_size, hidden-dim)
if subsampling_cache is not None: elayers = paddle.shape(att_cache)[0]
cache_size = subsampling_cache.shape[1] #T cache_t1 = paddle.shape(att_cache)[2]
xs = paddle.cat((subsampling_cache, xs), dim=1) chunk_size = paddle.shape(xs)[1]
else: attention_key_size = cache_t1 + chunk_size
cache_size = 0
# only used when using `RelPositionMultiHeadedAttention` # only used when using `RelPositionMultiHeadedAttention`
pos_emb = self.embed.position_encoding( pos_emb = self.embed.position_encoding(
offset=offset - cache_size, size=xs.shape[1]) offset=offset - cache_t1, size=attention_key_size)
if required_cache_size < 0: if required_cache_size < 0:
next_cache_start = 0 next_cache_start = 0
elif required_cache_size == 0: elif required_cache_size == 0:
next_cache_start = xs.shape[1] next_cache_start = attention_key_size
else: else:
next_cache_start = xs.shape[1] - required_cache_size next_cache_start = max(attention_key_size - required_cache_size, 0)
r_subsampling_cache = xs[:, next_cache_start:, :]
r_att_cache = []
# Real mask for transformer/conformer layers r_cnn_cache = []
masks = paddle.ones([1, xs.shape[1]], dtype=paddle.bool)
masks = masks.unsqueeze(1) #[B=1, L'=1, T]
r_elayers_output_cache = []
r_conformer_cnn_cache = []
for i, layer in enumerate(self.encoders): for i, layer in enumerate(self.encoders):
attn_cache = None if elayers_output_cache is None else elayers_output_cache[ # att_cache[i:i+1] = (1, head, cache_t1, d_k*2)
i] # cnn_cache[i:i+1] = (1, B=1, hidden-dim, cache_t2)
cnn_cache = None if conformer_cnn_cache is None else conformer_cnn_cache[ xs, _, new_att_cache, new_cnn_cache = layer(
i] xs, att_mask, pos_emb,
xs, _, new_cnn_cache = layer( att_cache=att_cache[i:i+1] if elayers > 0 else att_cache,
xs, cnn_cache=cnn_cache[i:i+1] if paddle.shape(cnn_cache)[0] > 0 else cnn_cache,
masks, )
pos_emb, # new_att_cache = (1, head, attention_key_size, d_k*2)
output_cache=attn_cache, # new_cnn_cache = (B=1, hidden-dim, cache_t2)
cnn_cache=cnn_cache) r_att_cache.append(new_att_cache[:,:, next_cache_start:, :])
r_elayers_output_cache.append(xs[:, next_cache_start:, :]) r_cnn_cache.append(new_cnn_cache.unsqueeze(0)) # add elayer dim
r_conformer_cnn_cache.append(new_cnn_cache)
if self.normalize_before: if self.normalize_before:
xs = self.after_norm(xs) xs = self.after_norm(xs)
return (xs[:, cache_size:, :], r_subsampling_cache, # r_att_cache (elayers, head, T, d_k*2)
r_elayers_output_cache, r_conformer_cnn_cache) # r_cnn_cache elayers, B=1, hidden-dim, cache_t2)
r_att_cache = paddle.concat(r_att_cache, axis=0)
r_cnn_cache = paddle.concat(r_cnn_cache, axis=0)
return xs, r_att_cache, r_cnn_cache
def forward_chunk_by_chunk( def forward_chunk_by_chunk(
self, self,
@ -313,25 +314,24 @@ class BaseEncoder(nn.Layer):
num_frames = xs.shape[1] num_frames = xs.shape[1]
required_cache_size = decoding_chunk_size * num_decoding_left_chunks required_cache_size = decoding_chunk_size * num_decoding_left_chunks
subsampling_cache: Optional[paddle.Tensor] = None
elayers_output_cache: Optional[List[paddle.Tensor]] = None att_cache: paddle.Tensor = paddle.zeros([0,0,0,0])
conformer_cnn_cache: Optional[List[paddle.Tensor]] = None cnn_cache: paddle.Tensor = paddle.zeros([0,0,0,0])
outputs = [] outputs = []
offset = 0 offset = 0
# Feed forward overlap input step by step # Feed forward overlap input step by step
for cur in range(0, num_frames - context + 1, stride): for cur in range(0, num_frames - context + 1, stride):
end = min(cur + decoding_window, num_frames) end = min(cur + decoding_window, num_frames)
chunk_xs = xs[:, cur:end, :] chunk_xs = xs[:, cur:end, :]
(y, subsampling_cache, elayers_output_cache,
conformer_cnn_cache) = self.forward_chunk( (y, att_cache, cnn_cache) = self.forward_chunk(
chunk_xs, offset, required_cache_size, subsampling_cache, chunk_xs, offset, required_cache_size, att_cache, cnn_cache)
elayers_output_cache, conformer_cnn_cache)
outputs.append(y) outputs.append(y)
offset += y.shape[1] offset += y.shape[1]
ys = paddle.cat(outputs, 1) ys = paddle.cat(outputs, 1)
# fake mask, just for jit script and compatibility with `forward` api masks = paddle.ones([1, 1, ys.shape[1]], dtype=paddle.bool)
masks = paddle.ones([1, ys.shape[1]], dtype=paddle.bool)
masks = masks.unsqueeze(1)
return ys, masks return ys, masks

@ -75,49 +75,43 @@ class TransformerEncoderLayer(nn.Layer):
self, self,
x: paddle.Tensor, x: paddle.Tensor,
mask: paddle.Tensor, mask: paddle.Tensor,
pos_emb: Optional[paddle.Tensor]=None, pos_emb: paddle.Tensor,
mask_pad: Optional[paddle.Tensor]=None, mask_pad: paddle.Tensor=paddle.ones([0, 0, 0], dtype=paddle.bool),
output_cache: Optional[paddle.Tensor]=None, att_cache: paddle.Tensor=paddle.zeros([0, 0, 0, 0]),
cnn_cache: Optional[paddle.Tensor]=None, cnn_cache: paddle.Tensor=paddle.zeros([0, 0, 0, 0]),
) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor]: ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor, paddle.Tensor]:
"""Compute encoded features. """Compute encoded features.
Args: Args:
x (paddle.Tensor): Input tensor (#batch, time, size). x (paddle.Tensor): (#batch, time, size)
mask (paddle.Tensor): Mask tensor for the input (#batch, time). mask (paddle.Tensor): Mask tensor for the input (#batch, timetime),
(0, 0, 0) means fake mask.
pos_emb (paddle.Tensor): just for interface compatibility pos_emb (paddle.Tensor): just for interface compatibility
to ConformerEncoderLayer to ConformerEncoderLayer
mask_pad (paddle.Tensor): not used here, it's for interface mask_pad (paddle.Tensor): does not used in transformer layer,
compatibility to ConformerEncoderLayer just for unified api with conformer.
output_cache (paddle.Tensor): Cache tensor of the output att_cache (paddle.Tensor): Cache tensor of the KEY & VALUE
(#batch, time2, size), time2 < time in x. (#batch=1, head, cache_t1, d_k * 2), head * d_k == size.
cnn_cache (paddle.Tensor): not used here, it's for interface cnn_cache (paddle.Tensor): Convolution cache in conformer layer
compatibility to ConformerEncoderLayer (#batch=1, size, cache_t2), not used here, it's for interface
compatibility to ConformerEncoderLayer.
Returns: Returns:
paddle.Tensor: Output tensor (#batch, time, size). paddle.Tensor: Output tensor (#batch, time, size).
paddle.Tensor: Mask tensor (#batch, time). paddle.Tensor: Mask tensor (#batch, time, time).
paddle.Tensor: Fake cnn cache tensor for api compatibility with Conformer (#batch, channels, time'). paddle.Tensor: att_cache tensor,
(#batch=1, head, cache_t1 + time, d_k * 2).
paddle.Tensor: cnn_cahce tensor (#batch=1, size, cache_t2).
""" """
residual = x residual = x
if self.normalize_before: if self.normalize_before:
x = self.norm1(x) x = self.norm1(x)
if output_cache is None: x_att, new_att_cache = self.self_attn(x, x, x, mask, cache=att_cache)
x_q = x
else:
assert output_cache.shape[0] == x.shape[0]
assert output_cache.shape[1] < x.shape[1]
assert output_cache.shape[2] == self.size
chunk = x.shape[1] - output_cache.shape[1]
x_q = x[:, -chunk:, :]
residual = residual[:, -chunk:, :]
mask = mask[:, -chunk:, :]
if self.concat_after: if self.concat_after:
x_concat = paddle.concat( x_concat = paddle.concat((x, x_att), axis=-1)
(x, self.self_attn(x_q, x, x, mask)), axis=-1)
x = residual + self.concat_linear(x_concat) x = residual + self.concat_linear(x_concat)
else: else:
x = residual + self.dropout(self.self_attn(x_q, x, x, mask)) x = residual + self.dropout(x_att)
if not self.normalize_before: if not self.normalize_before:
x = self.norm1(x) x = self.norm1(x)
@ -128,11 +122,8 @@ class TransformerEncoderLayer(nn.Layer):
if not self.normalize_before: if not self.normalize_before:
x = self.norm2(x) x = self.norm2(x)
if output_cache is not None: fake_cnn_cache = paddle.zeros([0, 0, 0], dtype=x.dtype)
x = paddle.concat([output_cache, x], axis=1) return x, mask, new_att_cache, fake_cnn_cache
fake_cnn_cache = paddle.zeros([1], dtype=x.dtype)
return x, mask, fake_cnn_cache
class ConformerEncoderLayer(nn.Layer): class ConformerEncoderLayer(nn.Layer):
@ -192,32 +183,44 @@ class ConformerEncoderLayer(nn.Layer):
self.size = size self.size = size
self.normalize_before = normalize_before self.normalize_before = normalize_before
self.concat_after = concat_after self.concat_after = concat_after
self.concat_linear = Linear(size + size, size) if self.concat_after:
self.concat_linear = Linear(size + size, size)
else:
self.concat_linear = nn.Identity()
def forward( def forward(
self, self,
x: paddle.Tensor, x: paddle.Tensor,
mask: paddle.Tensor, mask: paddle.Tensor,
pos_emb: paddle.Tensor, pos_emb: paddle.Tensor,
mask_pad: Optional[paddle.Tensor]=None, mask_pad: paddle.Tensor=paddle.ones([0, 0, 0], dtype=paddle.bool),
output_cache: Optional[paddle.Tensor]=None, att_cache: paddle.Tensor=paddle.zeros([0, 0, 0, 0]),
cnn_cache: Optional[paddle.Tensor]=None, cnn_cache: paddle.Tensor=paddle.zeros([0, 0, 0, 0]),
) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor]: ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor, paddle.Tensor]:
"""Compute encoded features. """Compute encoded features.
Args: Args:
x (paddle.Tensor): (#batch, time, size) x (paddle.Tensor): Input tensor (#batch, time, size).
mask (paddle.Tensor): Mask tensor for the input (#batch, timetime). mask (paddle.Tensor): Mask tensor for the input (#batch, time, time).
pos_emb (paddle.Tensor): positional encoding, must not be None (0,0,0) means fake mask.
for ConformerEncoderLayer. pos_emb (paddle.Tensor): postional encoding, must not be None
mask_pad (paddle.Tensor): batch padding mask used for conv module, (B, 1, T). for ConformerEncoderLayer
output_cache (paddle.Tensor): Cache tensor of the encoder output mask_pad (paddle.Tensor): batch padding mask used for conv module.
(#batch, time2, size), time2 < time in x. (#batch, 1time), (0, 0, 0) means fake mask.
att_cache (paddle.Tensor): Cache tensor of the KEY & VALUE
(#batch=1, head, cache_t1, d_k * 2), head * d_k == size.
cnn_cache (paddle.Tensor): Convolution cache in conformer layer cnn_cache (paddle.Tensor): Convolution cache in conformer layer
(1, #batch=1, size, cache_t2). First dim will not be used, just
for dy2st.
Returns: Returns:
paddle.Tensor: Output tensor (#batch, time, size). paddle.Tensor: Output tensor (#batch, time, size).
paddle.Tensor: Mask tensor (#batch, time). paddle.Tensor: Mask tensor (#batch, time, time).
paddle.Tensor: New cnn cache tensor (#batch, channels, time'). paddle.Tensor: att_cache tensor,
(#batch=1, head, cache_t1 + time, d_k * 2).
paddle.Tensor: cnn_cahce tensor (#batch, size, cache_t2).
""" """
# (1, #batch=1, size, cache_t2) -> (#batch=1, size, cache_t2)
cnn_cache = paddle.squeeze(cnn_cache, axis=0)
# whether to use macaron style FFN # whether to use macaron style FFN
if self.feed_forward_macaron is not None: if self.feed_forward_macaron is not None:
residual = x residual = x
@ -233,18 +236,8 @@ class ConformerEncoderLayer(nn.Layer):
if self.normalize_before: if self.normalize_before:
x = self.norm_mha(x) x = self.norm_mha(x)
if output_cache is None: x_att, new_att_cache = self.self_attn(
x_q = x x, x, x, mask, pos_emb, cache=att_cache)
else:
assert output_cache.shape[0] == x.shape[0]
assert output_cache.shape[1] < x.shape[1]
assert output_cache.shape[2] == self.size
chunk = x.shape[1] - output_cache.shape[1]
x_q = x[:, -chunk:, :]
residual = residual[:, -chunk:, :]
mask = mask[:, -chunk:, :]
x_att = self.self_attn(x_q, x, x, pos_emb, mask)
if self.concat_after: if self.concat_after:
x_concat = paddle.concat((x, x_att), axis=-1) x_concat = paddle.concat((x, x_att), axis=-1)
@ -257,7 +250,7 @@ class ConformerEncoderLayer(nn.Layer):
# convolution module # convolution module
# Fake new cnn cache here, and then change it in conv_module # Fake new cnn cache here, and then change it in conv_module
new_cnn_cache = paddle.zeros([1], dtype=x.dtype) new_cnn_cache = paddle.zeros([0, 0, 0], dtype=x.dtype)
if self.conv_module is not None: if self.conv_module is not None:
residual = x residual = x
if self.normalize_before: if self.normalize_before:
@ -282,7 +275,4 @@ class ConformerEncoderLayer(nn.Layer):
if self.conv_module is not None: if self.conv_module is not None:
x = self.norm_final(x) x = self.norm_final(x)
if output_cache is not None: return x, mask, new_att_cache, new_cnn_cache
x = paddle.concat([output_cache, x], axis=1)
return x, mask, new_cnn_cache

@ -12,142 +12,6 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
import numpy as np import numpy as np
from paddle.fluid import framework
from paddle.fluid import unique_name
from paddle.fluid.core import VarDesc
from paddle.fluid.initializer import MSRAInitializer
__all__ = ['KaimingUniform']
class KaimingUniform(MSRAInitializer):
r"""Implements the Kaiming Uniform initializer
This class implements the weight initialization from the paper
`Delving Deep into Rectifiers: Surpassing Human-Level Performance on
ImageNet Classification <https://arxiv.org/abs/1502.01852>`_
by Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun. This is a
robust initialization method that particularly considers the rectifier
nonlinearities.
In case of Uniform distribution, the range is [-x, x], where
.. math::
x = \sqrt{\frac{1.0}{fan\_in}}
In case of Normal distribution, the mean is 0 and the standard deviation
is
.. math::
\sqrt{\\frac{2.0}{fan\_in}}
Args:
fan_in (float32|None): fan_in for Kaiming uniform Initializer. If None, it is\
inferred from the variable. default is None.
Note:
It is recommended to set fan_in to None for most cases.
Examples:
.. code-block:: python
import paddle
import paddle.nn as nn
linear = nn.Linear(2,
4,
weight_attr=nn.initializer.KaimingUniform())
data = paddle.rand([30, 10, 2], dtype='float32')
res = linear(data)
"""
def __init__(self, fan_in=None):
super(KaimingUniform, self).__init__(
uniform=True, fan_in=fan_in, seed=0)
def __call__(self, var, block=None):
"""Initialize the input tensor with MSRA initialization.
Args:
var(Tensor): Tensor that needs to be initialized.
block(Block, optional): The block in which initialization ops
should be added. Used in static graph only, default None.
Returns:
The initialization op
"""
block = self._check_block(block)
assert isinstance(var, framework.Variable)
assert isinstance(block, framework.Block)
f_in, f_out = self._compute_fans(var)
# If fan_in is passed, use it
fan_in = f_in if self._fan_in is None else self._fan_in
if self._seed == 0:
self._seed = block.program.random_seed
# to be compatible of fp16 initalizers
if var.dtype == VarDesc.VarType.FP16 or (
var.dtype == VarDesc.VarType.BF16 and not self._uniform):
out_dtype = VarDesc.VarType.FP32
out_var = block.create_var(
name=unique_name.generate(
".".join(['masra_init', var.name, 'tmp'])),
shape=var.shape,
dtype=out_dtype,
type=VarDesc.VarType.LOD_TENSOR,
persistable=False)
else:
out_dtype = var.dtype
out_var = var
if self._uniform:
limit = np.sqrt(1.0 / float(fan_in))
op = block.append_op(
type="uniform_random",
inputs={},
outputs={"Out": out_var},
attrs={
"shape": out_var.shape,
"dtype": int(out_dtype),
"min": -limit,
"max": limit,
"seed": self._seed
},
stop_gradient=True)
else:
std = np.sqrt(2.0 / float(fan_in))
op = block.append_op(
type="gaussian_random",
outputs={"Out": out_var},
attrs={
"shape": out_var.shape,
"dtype": int(out_dtype),
"mean": 0.0,
"std": std,
"seed": self._seed
},
stop_gradient=True)
if var.dtype == VarDesc.VarType.FP16 or (
var.dtype == VarDesc.VarType.BF16 and not self._uniform):
block.append_op(
type="cast",
inputs={"X": out_var},
outputs={"Out": var},
attrs={"in_dtype": out_var.dtype,
"out_dtype": var.dtype})
if not framework.in_dygraph_mode():
var.op = op
return op
class DefaultInitializerContext(object): class DefaultInitializerContext(object):
""" """

@ -718,6 +718,7 @@ class VectorClientExecutor(BaseExecutor):
logger.info(f"the input audio: {input}") logger.info(f"the input audio: {input}")
handler = VectorHttpHandler(server_ip=server_ip, port=port) handler = VectorHttpHandler(server_ip=server_ip, port=port)
res = handler.run(input, audio_format, sample_rate) res = handler.run(input, audio_format, sample_rate)
logger.info(f"The spk embedding is: {res}")
return res return res
elif task == "score": elif task == "score":
from paddlespeech.server.utils.audio_handler import VectorScoreHttpHandler from paddlespeech.server.utils.audio_handler import VectorScoreHttpHandler

@ -39,10 +39,10 @@ class OnlineCTCEndpoingOpt:
# rule1 times out after 5 seconds of silence, even if we decoded nothing. # rule1 times out after 5 seconds of silence, even if we decoded nothing.
rule1: OnlineCTCEndpointRule = OnlineCTCEndpointRule(False, 5000, 0) rule1: OnlineCTCEndpointRule = OnlineCTCEndpointRule(False, 5000, 0)
# rule4 times out after 1.0 seconds of silence after decoding something, # rule2 times out after 1.0 seconds of silence after decoding something,
# even if we did not reach a final-state at all. # even if we did not reach a final-state at all.
rule2: OnlineCTCEndpointRule = OnlineCTCEndpointRule(True, 1000, 0) rule2: OnlineCTCEndpointRule = OnlineCTCEndpointRule(True, 1000, 0)
# rule5 times out after the utterance is 20 seconds long, regardless of # rule3 times out after the utterance is 20 seconds long, regardless of
# anything else. # anything else.
rule3: OnlineCTCEndpointRule = OnlineCTCEndpointRule(False, 0, 20000) rule3: OnlineCTCEndpointRule = OnlineCTCEndpointRule(False, 0, 20000)
@ -103,6 +103,7 @@ class OnlineCTCEndpoint:
assert self.num_frames_decoded >= self.trailing_silence_frames assert self.num_frames_decoded >= self.trailing_silence_frames
assert self.frame_shift_in_ms > 0 assert self.frame_shift_in_ms > 0
decoding_something = (self.num_frames_decoded > self.trailing_silence_frames) and decoding_something
utterance_length = self.num_frames_decoded * self.frame_shift_in_ms utterance_length = self.num_frames_decoded * self.frame_shift_in_ms
trailing_silence = self.trailing_silence_frames * self.frame_shift_in_ms trailing_silence = self.trailing_silence_frames * self.frame_shift_in_ms

@ -130,9 +130,9 @@ class PaddleASRConnectionHanddler:
## conformer ## conformer
# cache for conformer online # cache for conformer online
self.subsampling_cache = None self.att_cache = paddle.zeros([0,0,0,0])
self.elayers_output_cache = None self.cnn_cache = paddle.zeros([0,0,0,0])
self.conformer_cnn_cache = None
self.encoder_out = None self.encoder_out = None
# conformer decoding state # conformer decoding state
self.offset = 0 # global offset in decoding frame unit self.offset = 0 # global offset in decoding frame unit
@ -474,11 +474,9 @@ class PaddleASRConnectionHanddler:
# cur chunk # cur chunk
chunk_xs = self.cached_feat[:, cur:end, :] chunk_xs = self.cached_feat[:, cur:end, :]
# forward chunk # forward chunk
(y, self.subsampling_cache, self.elayers_output_cache, (y, self.att_cache, self.cnn_cache) = self.model.encoder.forward_chunk(
self.conformer_cnn_cache) = self.model.encoder.forward_chunk(
chunk_xs, self.offset, required_cache_size, chunk_xs, self.offset, required_cache_size,
self.subsampling_cache, self.elayers_output_cache, self.att_cache, self.cnn_cache)
self.conformer_cnn_cache)
outputs.append(y) outputs.append(y)
# update the global offset, in decoding frame unit # update the global offset, in decoding frame unit

@ -60,7 +60,10 @@ def warm_up(engine_and_type: str, warm_up_time: int=3) -> bool:
else: else:
st = time.time() st = time.time()
connection_handler.infer(text=sentence) connection_handler.infer(
text=sentence,
lang=tts_engine.lang,
am=tts_engine.config.am)
et = time.time() et = time.time()
logger.debug( logger.debug(
f"The response time of the {i} warm up: {et - st} s") f"The response time of the {i} warm up: {et - st} s")

@ -0,0 +1,8 @@
001 你好,欢迎使用 Paddle Speech 中英文混合 T T S 功能,开始你的合成之旅吧!
002 我们的声学模型使用了 Fast Speech Two, 声码器使用了 Parallel Wave GAN and Hifi GAN.
003 Paddle N L P 发布 ERNIE Tiny 全系列中文预训练小模型,快速提升预训练模型部署效率,通用信息抽取技术 U I E Tiny 系列模型全新升级,支持速度更快效果更好的 U I E 小模型。
004 Paddle Speech 发布 P P A S R 流式语音识别系统、P P T T S 流式语音合成系统、P P V P R 全链路声纹识别系统。
005 Paddle Bo Bo: 使用 Paddle Speech 的语音合成模块生成虚拟人的声音。
006 热烈欢迎您在 Discussions 中提交问题,并在 Issues 中指出发现的 bug。此外我们非常希望您参与到 Paddle Speech 的开发中!
007 我喜欢 eat apple, 你喜欢 drink milk。
008 我们要去云南 team building, 非常非常 happy.

@ -29,6 +29,7 @@ from yacs.config import CfgNode
from paddlespeech.t2s.datasets.data_table import DataTable from paddlespeech.t2s.datasets.data_table import DataTable
from paddlespeech.t2s.frontend import English from paddlespeech.t2s.frontend import English
from paddlespeech.t2s.frontend.mix_frontend import MixFrontend
from paddlespeech.t2s.frontend.zh_frontend import Frontend from paddlespeech.t2s.frontend.zh_frontend import Frontend
from paddlespeech.t2s.modules.normalizer import ZScore from paddlespeech.t2s.modules.normalizer import ZScore
from paddlespeech.utils.dynamic_import import dynamic_import from paddlespeech.utils.dynamic_import import dynamic_import
@ -98,6 +99,8 @@ def get_sentences(text_file: Optional[os.PathLike], lang: str='zh'):
sentence = "".join(items[1:]) sentence = "".join(items[1:])
elif lang == 'en': elif lang == 'en':
sentence = " ".join(items[1:]) sentence = " ".join(items[1:])
elif lang == 'mix':
sentence = " ".join(items[1:])
sentences.append((utt_id, sentence)) sentences.append((utt_id, sentence))
return sentences return sentences
@ -111,7 +114,8 @@ def get_test_dataset(test_metadata: List[Dict[str, Any]],
am_dataset = am[am.rindex('_') + 1:] am_dataset = am[am.rindex('_') + 1:]
if am_name == 'fastspeech2': if am_name == 'fastspeech2':
fields = ["utt_id", "text"] fields = ["utt_id", "text"]
if am_dataset in {"aishell3", "vctk"} and speaker_dict is not None: if am_dataset in {"aishell3", "vctk",
"mix"} and speaker_dict is not None:
print("multiple speaker fastspeech2!") print("multiple speaker fastspeech2!")
fields += ["spk_id"] fields += ["spk_id"]
elif voice_cloning: elif voice_cloning:
@ -140,6 +144,10 @@ def get_frontend(lang: str='zh',
phone_vocab_path=phones_dict, tone_vocab_path=tones_dict) phone_vocab_path=phones_dict, tone_vocab_path=tones_dict)
elif lang == 'en': elif lang == 'en':
frontend = English(phone_vocab_path=phones_dict) frontend = English(phone_vocab_path=phones_dict)
elif lang == 'mix':
frontend = MixFrontend(
phone_vocab_path=phones_dict, tone_vocab_path=tones_dict)
else: else:
print("wrong lang!") print("wrong lang!")
print("frontend done!") print("frontend done!")
@ -341,8 +349,12 @@ def get_am_output(
input_ids = frontend.get_input_ids( input_ids = frontend.get_input_ids(
input, merge_sentences=merge_sentences) input, merge_sentences=merge_sentences)
phone_ids = input_ids["phone_ids"] phone_ids = input_ids["phone_ids"]
elif lang == 'mix':
input_ids = frontend.get_input_ids(
input, merge_sentences=merge_sentences)
phone_ids = input_ids["phone_ids"]
else: else:
print("lang should in {'zh', 'en'}!") print("lang should in {'zh', 'en', 'mix'}!")
if get_tone_ids: if get_tone_ids:
tone_ids = input_ids["tone_ids"] tone_ids = input_ids["tone_ids"]

@ -113,8 +113,12 @@ def evaluate(args):
input_ids = frontend.get_input_ids( input_ids = frontend.get_input_ids(
sentence, merge_sentences=merge_sentences) sentence, merge_sentences=merge_sentences)
phone_ids = input_ids["phone_ids"] phone_ids = input_ids["phone_ids"]
elif args.lang == 'mix':
input_ids = frontend.get_input_ids(
sentence, merge_sentences=merge_sentences)
phone_ids = input_ids["phone_ids"]
else: else:
print("lang should in {'zh', 'en'}!") print("lang should in {'zh', 'en', 'mix'}!")
with paddle.no_grad(): with paddle.no_grad():
flags = 0 flags = 0
for i in range(len(phone_ids)): for i in range(len(phone_ids)):
@ -122,7 +126,7 @@ def evaluate(args):
# acoustic model # acoustic model
if am_name == 'fastspeech2': if am_name == 'fastspeech2':
# multi speaker # multi speaker
if am_dataset in {"aishell3", "vctk"}: if am_dataset in {"aishell3", "vctk", "mix"}:
spk_id = paddle.to_tensor(args.spk_id) spk_id = paddle.to_tensor(args.spk_id)
mel = am_inference(part_phone_ids, spk_id) mel = am_inference(part_phone_ids, spk_id)
else: else:
@ -170,7 +174,7 @@ def parse_args():
choices=[ choices=[
'speedyspeech_csmsc', 'speedyspeech_aishell3', 'fastspeech2_csmsc', 'speedyspeech_csmsc', 'speedyspeech_aishell3', 'fastspeech2_csmsc',
'fastspeech2_ljspeech', 'fastspeech2_aishell3', 'fastspeech2_vctk', 'fastspeech2_ljspeech', 'fastspeech2_aishell3', 'fastspeech2_vctk',
'tacotron2_csmsc', 'tacotron2_ljspeech' 'tacotron2_csmsc', 'tacotron2_ljspeech', 'fastspeech2_mix'
], ],
help='Choose acoustic model type of tts task.') help='Choose acoustic model type of tts task.')
parser.add_argument( parser.add_argument(
@ -231,7 +235,7 @@ def parse_args():
'--lang', '--lang',
type=str, type=str,
default='zh', default='zh',
help='Choose model language. zh or en') help='Choose model language. zh or en or mix')
parser.add_argument( parser.add_argument(
"--inference_dir", "--inference_dir",

@ -0,0 +1,179 @@
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import re
from typing import Dict
from typing import List
import paddle
from paddlespeech.t2s.frontend import English
from paddlespeech.t2s.frontend.zh_frontend import Frontend
class MixFrontend():
def __init__(self,
g2p_model="pypinyin",
phone_vocab_path=None,
tone_vocab_path=None):
self.zh_frontend = Frontend(
phone_vocab_path=phone_vocab_path, tone_vocab_path=tone_vocab_path)
self.en_frontend = English(phone_vocab_path=phone_vocab_path)
self.SENTENCE_SPLITOR = re.compile(r'([:、,;。?!,;?!][”’]?)')
self.sp_id = self.zh_frontend.vocab_phones["sp"]
self.sp_id_tensor = paddle.to_tensor([self.sp_id])
def is_chinese(self, char):
if char >= '\u4e00' and char <= '\u9fa5':
return True
else:
return False
def is_alphabet(self, char):
if (char >= '\u0041' and char <= '\u005a') or (char >= '\u0061' and
char <= '\u007a'):
return True
else:
return False
def is_number(self, char):
if char >= '\u0030' and char <= '\u0039':
return True
else:
return False
def is_other(self, char):
if not (self.is_chinese(char) or self.is_number(char) or
self.is_alphabet(char)):
return True
else:
return False
def _split(self, text: str) -> List[str]:
text = re.sub(r'[《》【】<=>{}()#&@“”^_|…\\]', '', text)
text = self.SENTENCE_SPLITOR.sub(r'\1\n', text)
text = text.strip()
sentences = [sentence.strip() for sentence in re.split(r'\n+', text)]
return sentences
def _distinguish(self, text: str) -> List[str]:
# sentence --> [ch_part, en_part, ch_part, ...]
segments = []
types = []
flag = 0
temp_seg = ""
temp_lang = ""
# Determine the type of each character. type: blank, chinese, alphabet, number, unk.
for ch in text:
if self.is_chinese(ch):
types.append("zh")
elif self.is_alphabet(ch):
types.append("en")
elif ch == " ":
types.append("blank")
elif self.is_number(ch):
types.append("num")
else:
types.append("unk")
assert len(types) == len(text)
for i in range(len(types)):
# find the first char of the seg
if flag == 0:
if types[i] != "unk" and types[i] != "blank":
temp_seg += text[i]
temp_lang = types[i]
flag = 1
else:
if types[i] == temp_lang or types[i] == "num":
temp_seg += text[i]
elif temp_lang == "num" and types[i] != "unk":
temp_seg += text[i]
if types[i] == "zh" or types[i] == "en":
temp_lang = types[i]
elif temp_lang == "en" and types[i] == "blank":
temp_seg += text[i]
elif types[i] == "unk":
pass
else:
segments.append((temp_seg, temp_lang))
if types[i] != "unk" and types[i] != "blank":
temp_seg = text[i]
temp_lang = types[i]
flag = 1
else:
flag = 0
temp_seg = ""
temp_lang = ""
segments.append((temp_seg, temp_lang))
return segments
def get_input_ids(self,
sentence: str,
merge_sentences: bool=True,
get_tone_ids: bool=False,
add_sp: bool=True) -> Dict[str, List[paddle.Tensor]]:
sentences = self._split(sentence)
phones_list = []
result = {}
for text in sentences:
phones_seg = []
segments = self._distinguish(text)
for seg in segments:
content = seg[0]
lang = seg[1]
if lang == "zh":
input_ids = self.zh_frontend.get_input_ids(
content,
merge_sentences=True,
get_tone_ids=get_tone_ids)
elif lang == "en":
input_ids = self.en_frontend.get_input_ids(
content, merge_sentences=True)
phones_seg.append(input_ids["phone_ids"][0])
if add_sp:
phones_seg.append(self.sp_id_tensor)
phones = paddle.concat(phones_seg)
phones_list.append(phones)
if merge_sentences:
merge_list = paddle.concat(phones_list)
# rm the last 'sp' to avoid the noise at the end
# cause in the training data, no 'sp' in the end
if merge_list[-1] == self.sp_id_tensor:
merge_list = merge_list[:-1]
phones_list = []
phones_list.append(merge_list)
result["phone_ids"] = phones_list
return result

@ -72,7 +72,8 @@ base = [
"colorlog", "colorlog",
"pathos == 0.2.8", "pathos == 0.2.8",
"braceexpand", "braceexpand",
"pyyaml" "pyyaml",
"pybind11",
] ]
server = [ server = [
@ -91,7 +92,6 @@ requirements = {
"gpustat", "gpustat",
"paddlespeech_ctcdecoders", "paddlespeech_ctcdecoders",
"phkit", "phkit",
"pybind11",
"pypi-kenlm", "pypi-kenlm",
"snakeviz", "snakeviz",
"sox", "sox",

@ -1,27 +1,26 @@
* [python_kaldi_features](https://github.com/ZitengWang/python_kaldi_features) # python_kaldi_features
[python_kaldi_features](https://github.com/ZitengWang/python_kaldi_features)
commit: fc1bd6240c2008412ab64dc25045cd872f5e126c commit: fc1bd6240c2008412ab64dc25045cd872f5e126c
ref: https://zhuanlan.zhihu.com/p/55371926 ref: https://zhuanlan.zhihu.com/p/55371926
license: MIT license: MIT
* [python-pinyin](https://github.com/mozillazg/python-pinyin.git) # Install ctc_decoder for Windows
commit: 55e524aa1b7b8eec3d15c5306043c6cdd5938b03
license: MIT
* [zhon](https://github.com/tsroten/zhon) `install_win_ctc.bat` is bat script to install paddlespeech_ctc_decoders for windows
commit: 09bf543696277f71de502506984661a60d24494c
license: MIT
* [pymmseg-cpp](https://github.com/pluskid/pymmseg-cpp.git) ## Prepare your environment
commit: b76465045717fbb4f118c4fbdd24ce93bab10a6d
license: MIT
* [chinese_text_normalization](https://github.com/speechio/chinese_text_normalization.git) insure your environment like this:
commit: 9e92c7bf2d6b5a7974305406d8e240045beac51c
license: MIT
* [phkit](https://github.com/KuangDD/phkit.git) * gcc: version >= 12.1.0
commit: b2100293c1e36da531d7f30bd52c9b955a649522 * cmake: version >= 3.24.0
license: None * make: version >= 3.82.90
* visual studio: version >= 2019
* [nnAudio](https://github.com/KinWaiCheuk/nnAudio.git) ## Start your bat script
license: MIT
```shell
start install_win_ctc.bat
```

@ -13,7 +13,8 @@
#include "decoder_utils.h" #include "decoder_utils.h"
using namespace lm::ngram; using namespace lm::ngram;
// if your platform is windows ,you need add the define
#define F_OK 0
Scorer::Scorer(double alpha, Scorer::Scorer(double alpha,
double beta, double beta,
const std::string& lm_path, const std::string& lm_path,

@ -89,10 +89,11 @@ FILES = [
or fn.endswith('unittest.cc')) or fn.endswith('unittest.cc'))
] ]
# yapf: enable # yapf: enable
LIBS = ['stdc++'] LIBS = ['stdc++']
if platform.system() != 'Darwin': if platform.system() != 'Darwin':
LIBS.append('rt') LIBS.append('rt')
if platform.system() == 'Windows':
LIBS = ['-static-libstdc++']
ARGS = ['-O3', '-DNDEBUG', '-DKENLM_MAX_ORDER=6', '-std=c++11'] ARGS = ['-O3', '-DNDEBUG', '-DKENLM_MAX_ORDER=6', '-std=c++11']

@ -0,0 +1,21 @@
@echo off
cd ctc_decoders
if not exist kenlm (
git clone https://github.com/Doubledongli/kenlm.git
@echo.
)
if not exist openfst-1.6.3 (
echo "Download and extract openfst ..."
git clone https://gitee.com/koala999/openfst.git
ren openfst openfst-1.6.3
@echo.
)
if not exist ThreadPool (
git clone https://github.com/progschj/ThreadPool.git
@echo.
)
echo "Install decoders ..."
python setup.py install --num_processes 4
Loading…
Cancel
Save