Merge branch 'develop' of github.com:SmileGoat/PaddleSpeech into add_fbank

pull/1827/head
Yang Zhou 2 years ago
commit a36ecccfcb

@ -1,19 +1,10 @@
([简体中文](./README_cn.md)|English)
<p align="center">
<img src="./docs/images/PaddleSpeech_logo.png" />
</p>
<div align="center">
<h3>
<a href="#quick-start"> Quick Start </a>
| <a href="#quick-start-server"> Quick Start Server </a>
| <a href="#documents"> Documents </a>
| <a href="#model-list"> Models List </a>
</div>
------------------------------------------------------------------------------------
<p align="center">
<a href="./LICENSE"><img src="https://img.shields.io/badge/license-Apache%202-red.svg"></a>
@ -28,6 +19,20 @@
<a href="=https://pypi.org/project/paddlespeech/"><img src="https://static.pepy.tech/badge/paddlespeech"></a>
<a href="https://huggingface.co/spaces"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue"></a>
</p>
<div align="center">
<h3>
| <a href="#quick-start"> Quick Start </a>
| <a href="#quick-start-server"> Quick Start Server </a>
| <a href="#quick-start-streaming-server"> Quick Start Streaming Server</a>
|
</br>
| <a href="#documents"> Documents </a>
| <a href="#model-list"> Models List </a>
|
</h3>
</div>
**PaddleSpeech** is an open-source toolkit on [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) platform for a variety of critical tasks in speech and audio, with the state-of-art and influential models.
@ -142,26 +147,6 @@ For more synthesized audios, please refer to [PaddleSpeech Text-to-Speech sample
</div>
### ⭐ Examples
- **[PaddleBoBo](https://github.com/JiehangXie/PaddleBoBo): Use PaddleSpeech TTS to generate virtual human voice.**
<div align="center"><a href="https://www.bilibili.com/video/BV1cL411V71o?share_source=copy_web"><img src="https://ai-studio-static-online.cdn.bcebos.com/06fd746ab32042f398fb6f33f873e6869e846fe63c214596ae37860fe8103720" / width="500px"></a></div>
- [PaddleSpeech Demo Video](https://paddlespeech.readthedocs.io/en/latest/demo_video.html)
- **[VTuberTalk](https://github.com/jerryuhoo/VTuberTalk): Use PaddleSpeech TTS and ASR to clone voice from videos.**
<div align="center">
<img src="https://raw.githubusercontent.com/jerryuhoo/VTuberTalk/main/gui/gui.png" width = "500px" />
</div>
### 🔥 Hot Activities
- 2021.12.21~12.24
4 Days Live Courses: Depth interpretation of PaddleSpeech!
**Courses videos and related materials: https://aistudio.baidu.com/aistudio/education/group/info/25130**
### Features
@ -174,11 +159,22 @@ Via the easy-to-use, efficient, flexible and scalable implementation, our vision
- 🔬 *Integration of mainstream models and datasets*: the toolkit implements modules that participate in the whole pipeline of the speech tasks, and uses mainstream datasets like LibriSpeech, LJSpeech, AIShell, CSMSC, etc. See also [model list](#model-list) for more details.
- 🧩 *Cascaded models application*: as an extension of the typical traditional audio tasks, we combine the workflows of the aforementioned tasks with other fields like Natural language processing (NLP) and Computer Vision (CV).
### Recent Update
### 🔥 Hot Activities
<!---
2021.12.14: We would like to have an online courses to introduce basics and research of speech, as well as code practice with `paddlespeech`. Please pay attention to our [Calendar](https://www.paddlepaddle.org.cn/live).
--->
- 2021.12.21~12.24
4 Days Live Courses: Depth interpretation of PaddleSpeech!
**Courses videos and related materials: https://aistudio.baidu.com/aistudio/education/group/info/25130**
### Recent Update
- 👏🏻 2022.04.28: PaddleSpeech Streaming Server is available for Automatic Speech Recognition and Text-to-Speech.
- 👏🏻 2022.03.28: PaddleSpeech Server is available for Audio Classification, Automatic Speech Recognition and Text-to-Speech.
- 👏🏻 2022.03.28: PaddleSpeech CLI is available for Speaker Verification.
- 🤗 2021.12.14: Our PaddleSpeech [ASR](https://huggingface.co/spaces/KPatrick/PaddleSpeechASR) and [TTS](https://huggingface.co/spaces/KPatrick/PaddleSpeechTTS) Demos on Hugging Face Spaces are available!
@ -196,6 +192,7 @@ Via the easy-to-use, efficient, flexible and scalable implementation, our vision
We strongly recommend our users to install PaddleSpeech in **Linux** with *python>=3.7*.
Up to now, **Linux** supports CLI for the all our tasks, **Mac OSX** and **Windows** only supports PaddleSpeech CLI for Audio Classification, Speech-to-Text and Text-to-Speech. To install `PaddleSpeech`, please see [installation](./docs/source/install.md).
<a name="quickstart"></a>
## Quick Start
@ -238,7 +235,7 @@ paddlespeech tts --input "你好,欢迎使用飞桨深度学习框架!" --ou
**Batch Process**
```
echo -e "1 欢迎光临。\n2 谢谢惠顾。" | paddlespeech tts
```
```
**Shell Pipeline**
- ASR + Punctuation Restoration
@ -257,16 +254,19 @@ If you want to try more functions like training and tuning, please have a look a
Developers can have a try of our speech server with [PaddleSpeech Server Command Line](./paddlespeech/server/README.md).
**Start server**
```shell
paddlespeech_server start --config_file ./paddlespeech/server/conf/application.yaml
```
**Access Speech Recognition Services**
```shell
paddlespeech_client asr --server_ip 127.0.0.1 --port 8090 --input input_16k.wav
```
**Access Text to Speech Services**
```shell
paddlespeech_client tts --server_ip 127.0.0.1 --port 8090 --input "您好,欢迎使用百度飞桨语音合成服务。" --output output.wav
```
@ -280,6 +280,37 @@ paddlespeech_client cls --server_ip 127.0.0.1 --port 8090 --input input.wav
For more information about server command lines, please see: [speech server demos](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/demos/speech_server)
<a name="quickstartstreamingserver"></a>
## Quick Start Streaming Server
Developers can have a try of [streaming asr](./demos/streaming_asr_server/README.md) and [streaming tts](./demos/streaming_tts_server/README.md) server.
**Start Streaming Speech Recognition Server**
```
paddlespeech_server start --config_file ./demos/streaming_asr_server/conf/application.yaml
```
**Access Streaming Speech Recognition Services**
```
paddlespeech_client asr_online --server_ip 127.0.0.1 --port 8090 --input input_16k.wav
```
**Start Streaming Text to Speech Server**
```
paddlespeech_server start --config_file ./demos/streaming_tts_server/conf/tts_online_application.yaml
```
**Access Streaming Text to Speech Services**
```
paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --protocol http --input "您好,欢迎使用百度飞桨语音合成服务。" --output output.wav
```
For more information please see: [streaming asr](./demos/streaming_asr_server/README.md) and [streaming tts](./demos/streaming_tts_server/README.md)
<a name="ModelList"></a>
## Model List
@ -589,6 +620,21 @@ Normally, [Speech SoTA](https://paperswithcode.com/area/speech), [Audio SoTA](ht
The Text-to-Speech module is originally called [Parakeet](https://github.com/PaddlePaddle/Parakeet), and now merged with this repository. If you are interested in academic research about this task, please see [TTS research overview](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/docs/source/tts#overview). Also, [this document](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/tts/models_introduction.md) is a good guideline for the pipeline components.
## ⭐ Examples
- **[PaddleBoBo](https://github.com/JiehangXie/PaddleBoBo): Use PaddleSpeech TTS to generate virtual human voice.**
<div align="center"><a href="https://www.bilibili.com/video/BV1cL411V71o?share_source=copy_web"><img src="https://ai-studio-static-online.cdn.bcebos.com/06fd746ab32042f398fb6f33f873e6869e846fe63c214596ae37860fe8103720" / width="500px"></a></div>
- [PaddleSpeech Demo Video](https://paddlespeech.readthedocs.io/en/latest/demo_video.html)
- **[VTuberTalk](https://github.com/jerryuhoo/VTuberTalk): Use PaddleSpeech TTS and ASR to clone voice from videos.**
<div align="center">
<img src="https://raw.githubusercontent.com/jerryuhoo/VTuberTalk/main/gui/gui.png" width = "500px" />
</div>
## Citation
To cite PaddleSpeech for research, please use the following format.
@ -655,7 +701,6 @@ You are warmly welcome to submit questions in [discussions](https://github.com/P
## Acknowledgement
- Many thanks to [yeyupiaoling](https://github.com/yeyupiaoling)/[PPASR](https://github.com/yeyupiaoling/PPASR)/[PaddlePaddle-DeepSpeech](https://github.com/yeyupiaoling/PaddlePaddle-DeepSpeech)/[VoiceprintRecognition-PaddlePaddle](https://github.com/yeyupiaoling/VoiceprintRecognition-PaddlePaddle)/[AudioClassification-PaddlePaddle](https://github.com/yeyupiaoling/AudioClassification-PaddlePaddle) for years of attention, constructive advice and great help.
- Many thanks to [mymagicpower](https://github.com/mymagicpower) for the Java implementation of ASR upon [short](https://github.com/mymagicpower/AIAS/tree/main/3_audio_sdks/asr_sdk) and [long](https://github.com/mymagicpower/AIAS/tree/main/3_audio_sdks/asr_long_audio_sdk) audio files.
- Many thanks to [JiehangXie](https://github.com/JiehangXie)/[PaddleBoBo](https://github.com/JiehangXie/PaddleBoBo) for developing Virtual Uploader(VUP)/Virtual YouTuber(VTuber) with PaddleSpeech TTS function.

@ -2,26 +2,45 @@
<p align="center">
<img src="./docs/images/PaddleSpeech_logo.png" />
</p>
<div align="center">
<h3>
<a href="#quick-start"> 快速开始 </a>
| <a href="#quick-start-server"> 快速使用服务 </a>
| <a href="#documents"> 教程文档 </a>
| <a href="#model-list"> 模型列表 </a>
</div>
------------------------------------------------------------------------------------
<p align="center">
<a href="./LICENSE"><img src="https://img.shields.io/badge/license-Apache%202-red.svg"></a>
<a href="support os"><img src="https://img.shields.io/badge/os-linux-yellow.svg"></a>
<a href="https://github.com/PaddlePaddle/PaddleSpeech/releases"><img src="https://img.shields.io/github/v/release/PaddlePaddle/PaddleSpeech?color=ffa"></a>
<a href="support os"><img src="https://img.shields.io/badge/os-linux%2C%20win%2C%20mac-pink.svg"></a>
<a href=""><img src="https://img.shields.io/badge/python-3.7+-aff.svg"></a>
<a href="https://github.com/PaddlePaddle/PaddleSpeech/graphs/contributors"><img src="https://img.shields.io/github/contributors/PaddlePaddle/PaddleSpeech?color=9ea"></a>
<a href="https://github.com/PaddlePaddle/PaddleSpeech/commits"><img src="https://img.shields.io/github/commit-activity/m/PaddlePaddle/PaddleSpeech?color=3af"></a>
<a href="https://github.com/PaddlePaddle/PaddleSpeech/issues"><img src="https://img.shields.io/github/issues/PaddlePaddle/PaddleSpeech?color=9cc"></a>
<a href="https://github.com/PaddlePaddle/PaddleSpeech/stargazers"><img src="https://img.shields.io/github/stars/PaddlePaddle/PaddleSpeech?color=ccf"></a>
<a href="=https://pypi.org/project/paddlespeech/"><img src="https://img.shields.io/pypi/dm/PaddleSpeech"></a>
<a href="=https://pypi.org/project/paddlespeech/"><img src="https://static.pepy.tech/badge/paddlespeech"></a>
<a href="https://huggingface.co/spaces"><img src="https://img.shields.io/badge/%F0%9F%A4%97%20Hugging%20Face-Spaces-blue"></a>
</p>
<div align="center">
<h3>
<a href="#quick-start"> Quick Start </a>
| <a href="#quick-start-server"> Quick Start Server </a>
| <a href="#quick-start-streaming-server"> Quick Start Streaming Server</a>
</br>
<a href="#documents"> Documents </a>
| <a href="#model-list"> Models List </a>
</h3>
</div>
------------------------------------------------------------------------------------
<div align="center">
<h3>
<a href="#quick-start"> 快速开始 </a>
| <a href="#quick-start-server"> 快速使用服务 </a>
| <a href="#quick-start-streaming-server"> 快速使用流式服务 </a>
| <a href="#documents"> 教程文档 </a>
| <a href="#model-list"> 模型列表 </a>
</div>
<!---
from https://github.com/18F/open-source-guide/blob/18f-pages/pages/making-readmes-readable.md
@ -31,6 +50,8 @@ from https://github.com/18F/open-source-guide/blob/18f-pages/pages/making-readme
4.What is the goal of this project?
-->
**PaddleSpeech** 是基于飞桨 [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) 的语音方向的开源模型库,用于语音和音频中的各种关键任务的开发,包含大量基于深度学习前沿和有影响力的模型,一些典型的应用示例如下:
##### 语音识别
@ -57,7 +78,6 @@ from https://github.com/18F/open-source-guide/blob/18f-pages/pages/making-readme
</td>
<td>我认为跑步最重要的就是给我带来了身体健康。</td>
</tr>
</tbody>
</table>
@ -143,19 +163,6 @@ from https://github.com/18F/open-source-guide/blob/18f-pages/pages/making-readme
</div>
### ⭐ 应用案例
- **[PaddleBoBo](https://github.com/JiehangXie/PaddleBoBo): 使用 PaddleSpeech 的语音合成模块生成虚拟人的声音。**
<div align="center"><a href="https://www.bilibili.com/video/BV1cL411V71o?share_source=copy_web"><img src="https://ai-studio-static-online.cdn.bcebos.com/06fd746ab32042f398fb6f33f873e6869e846fe63c214596ae37860fe8103720" / width="500px"></a></div>
- [PaddleSpeech 示例视频](https://paddlespeech.readthedocs.io/en/latest/demo_video.html)
- **[VTuberTalk](https://github.com/jerryuhoo/VTuberTalk): 使用 PaddleSpeech 的语音合成和语音识别从视频中克隆人声。**
<div align="center">
<img src="https://raw.githubusercontent.com/jerryuhoo/VTuberTalk/main/gui/gui.png" width = "500px" />
</div>
### 🔥 热门活动
@ -164,27 +171,32 @@ from https://github.com/18F/open-source-guide/blob/18f-pages/pages/making-readme
4 日直播课: 深度解读 PaddleSpeech 语音技术!
**直播回放与课件资料: https://aistudio.baidu.com/aistudio/education/group/info/25130**
### 特性
本项目采用了易用、高效、灵活以及可扩展的实现,旨在为工业应用、学术研究提供更好的支持,实现的功能包含训练、推断以及测试模块,以及部署过程,主要包括
- 📦 **易用性**: 安装门槛低,可使用 [CLI](#quick-start) 快速开始。
- 🏆 **对标 SoTA**: 提供了高速、轻量级模型,且借鉴了最前沿的技术。
- 💯 **基于规则的中文前端**: 我们的前端包含文本正则化和字音转换G2P。此外我们使用自定义语言规则来适应中文语境。
- **多种工业界以及学术界主流功能支持**:
- 🛎️ 典型音频任务: 本工具包提供了音频任务如音频分类、语音翻译、自动语音识别、文本转语音、语音合成等任务的实现。
- 🔬 主流模型及数据集: 本工具包实现了参与整条语音任务流水线的各个模块,并且采用了主流数据集如 LibriSpeech、LJSpeech、AIShell、CSMSC详情请见 [模型列表](#model-list)。
- 🧩 级联模型应用: 作为传统语音任务的扩展,我们结合了自然语言处理、计算机视觉等任务,实现更接近实际需求的产业级应用。
### 近期更新
<!---
2021.12.14: We would like to have an online courses to introduce basics and research of speech, as well as code practice with `paddlespeech`. Please pay attention to our [Calendar](https://www.paddlepaddle.org.cn/live).
--->
- 👏🏻 2022.04.28: PaddleSpeech Streaming Server 上线! 覆盖了语音识别和语音合成。
- 👏🏻 2022.03.28: PaddleSpeech Server 上线! 覆盖了声音分类、语音识别、以及语音合成。
- 👏🏻 2022.03.28: PaddleSpeech CLI 上线声纹验证。
- 🤗 2021.12.14: Our PaddleSpeech [ASR](https://huggingface.co/spaces/KPatrick/PaddleSpeechASR) and [TTS](https://huggingface.co/spaces/KPatrick/PaddleSpeechTTS) Demos on Hugging Face Spaces are available!
- 👏🏻 2021.12.10: PaddleSpeech CLI 上线!覆盖了声音分类、语音识别、语音翻译(英译中)以及语音合成。
### 特性
本项目采用了易用、高效、灵活以及可扩展的实现,旨在为工业应用、学术研究提供更好的支持,实现的功能包含训练、推断以及测试模块,以及部署过程,主要包括
- 📦 **易用性**: 安装门槛低,可使用 [CLI](#quick-start) 快速开始。
- 🏆 **对标 SoTA**: 提供了高速、轻量级模型,且借鉴了最前沿的技术。
- 💯 **基于规则的中文前端**: 我们的前端包含文本正则化和字音转换G2P。此外我们使用自定义语言规则来适应中文语境。
- **多种工业界以及学术界主流功能支持**:
- 🛎️ 典型音频任务: 本工具包提供了音频任务如音频分类、语音翻译、自动语音识别、文本转语音、语音合成等任务的实现。
- 🔬 主流模型及数据集: 本工具包实现了参与整条语音任务流水线的各个模块,并且采用了主流数据集如 LibriSpeech、LJSpeech、AIShell、CSMSC详情请见 [模型列表](#model-list)。
- 🧩 级联模型应用: 作为传统语音任务的扩展,我们结合了自然语言处理、计算机视觉等任务,实现更接近实际需求的产业级应用。
### 技术交流群
微信扫描二维码(好友申请通过后回复【语音】)加入官方交流群,获得更高效的问题答疑,与各行各业开发者充分交流,期待您的加入。
@ -192,11 +204,13 @@ from https://github.com/18F/open-source-guide/blob/18f-pages/pages/making-readme
<img src="https://raw.githubusercontent.com/yt605155624/lanceTest/main/images/wechat_4.jpg" width = "300" />
</div>
## 安装
我们强烈建议用户在 **Linux** 环境下,*3.7* 以上版本的 *python* 上安装 PaddleSpeech。
目前为止,**Linux** 支持声音分类、语音识别、语音合成和语音翻译四种功能,**Mac OSX、 Windows** 下暂不支持语音翻译功能。 想了解具体安装细节,可以参考[安装文档](./docs/source/install_cn.md)。
## 快速开始
安装完成后,开发者可以通过命令行快速开始,改变 `--input` 可以尝试用自己的音频或文本测试。
@ -232,7 +246,7 @@ paddlespeech tts --input "你好,欢迎使用百度飞桨深度学习框架!
**批处理**
```
echo -e "1 欢迎光临。\n2 谢谢惠顾。" | paddlespeech tts
```
```
**Shell管道**
ASR + Punc:
@ -269,6 +283,38 @@ paddlespeech_client cls --server_ip 127.0.0.1 --port 8090 --input input.wav
更多服务相关的命令行使用信息,请参考 [demos](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/demos/speech_server)
<a name="quickstartstreamingserver"></a>
## 快速使用流式服务
开发者可以尝试[流式ASR](./demos/streaming_asr_server/README.md)和 [流式TTS](./demos/streaming_tts_server/README.md)服务.
**启动流式ASR服务**
```
paddlespeech_server start --config_file ./demos/streaming_asr_server/conf/application.yaml
```
**访问流式ASR服务**
```
paddlespeech_client asr_online --server_ip 127.0.0.1 --port 8090 --input input_16k.wav
```
**启动流式TTS服务**
```
paddlespeech_server start --config_file ./demos/streaming_tts_server/conf/tts_online_application.yaml
```
**访问流式TTS服务**
```
paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --protocol http --input "您好,欢迎使用百度飞桨语音合成服务。" --output output.wav
```
更多信息参看: [流式 ASR](./demos/streaming_asr_server/README.md) 和 [流式 TTS](./demos/streaming_tts_server/README.md)
<a name="modulelist"></a>
## 模型列表
PaddleSpeech 支持很多主流的模型,并提供了预训练模型,详情请见[模型列表](./docs/source/released_model.md)。
@ -582,6 +628,21 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声
语音合成模块最初被称为 [Parakeet](https://github.com/PaddlePaddle/Parakeet),现在与此仓库合并。如果您对该任务的学术研究感兴趣,请参阅 [TTS 研究概述](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/docs/source/tts#overview)。此外,[模型介绍](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/tts/models_introduction.md) 是了解语音合成流程的一个很好的指南。
## ⭐ 应用案例
- **[PaddleBoBo](https://github.com/JiehangXie/PaddleBoBo): 使用 PaddleSpeech 的语音合成模块生成虚拟人的声音。**
<div align="center"><a href="https://www.bilibili.com/video/BV1cL411V71o?share_source=copy_web"><img src="https://ai-studio-static-online.cdn.bcebos.com/06fd746ab32042f398fb6f33f873e6869e846fe63c214596ae37860fe8103720" / width="500px"></a></div>
- [PaddleSpeech 示例视频](https://paddlespeech.readthedocs.io/en/latest/demo_video.html)
- **[VTuberTalk](https://github.com/jerryuhoo/VTuberTalk): 使用 PaddleSpeech 的语音合成和语音识别从视频中克隆人声。**
<div align="center">
<img src="https://raw.githubusercontent.com/jerryuhoo/VTuberTalk/main/gui/gui.png" width = "500px" />
</div>
## 引用
要引用 PaddleSpeech 进行研究,请使用以下格式进行引用。
@ -658,6 +719,7 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声
- 非常感谢 [jerryuhoo](https://github.com/jerryuhoo)/[VTuberTalk](https://github.com/jerryuhoo/VTuberTalk) 基于 PaddleSpeech 的 TTS GUI 界面和基于 ASR 制作数据集的相关代码。
此外PaddleSpeech 依赖于许多开源存储库。有关更多信息,请参阅 [references](./docs/source/reference.md)。
## License

@ -24,13 +24,13 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
- Command Line(Recommended)
```bash
# Chinese
paddlespeech asr --input ./zh.wav
paddlespeech asr --input ./zh.wav -v
# English
paddlespeech asr --model transformer_librispeech --lang en --input ./en.wav
paddlespeech asr --model transformer_librispeech --lang en --input ./en.wav -v
# Chinese ASR + Punctuation Restoration
paddlespeech asr --input ./zh.wav | paddlespeech text --task punc
paddlespeech asr --input ./zh.wav -v | paddlespeech text --task punc -v
```
(It doesn't matter if package `paddlespeech-ctcdecoders` is not found, this package is optional.)
(If you don't want to see the log information, you can remove "-v". Besides, it doesn't matter if package `paddlespeech-ctcdecoders` is not found, this package is optional.)
Usage:
```bash
@ -45,6 +45,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
- `ckpt_path`: Model checkpoint. Use pretrained model when it is None. Default: `None`.
- `yes`: No additional parameters required. Once set this parameter, it means accepting the request of the program by default, which includes transforming the audio sample rate. Default: `False`.
- `device`: Choose device to execute model inference. Default: default device of paddlepaddle in current environment.
- `verbose`: Show the log information.
Output:
```bash
@ -84,8 +85,12 @@ Here is a list of pretrained models released by PaddleSpeech that can be used by
| Model | Language | Sample Rate
| :--- | :---: | :---: |
| conformer_wenetspeech| zh| 16k
| transformer_librispeech| en| 16k
| conformer_wenetspeech | zh | 16k
| conformer_online_multicn | zh | 16k
| conformer_aishell | zh | 16k
| conformer_online_aishell | zh | 16k
| transformer_librispeech | en | 16k
| deepspeech2online_wenetspeech | zh | 16k
| deepspeech2offline_aishell| zh| 16k
| deepspeech2online_aishell | zh | 16k
|deepspeech2offline_librispeech|en| 16k
| deepspeech2offline_librispeech | en | 16k

@ -22,13 +22,13 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
- 命令行 (推荐使用)
```bash
# 中文
paddlespeech asr --input ./zh.wav
paddlespeech asr --input ./zh.wav -v
# 英文
paddlespeech asr --model transformer_librispeech --lang en --input ./en.wav
paddlespeech asr --model transformer_librispeech --lang en --input ./en.wav -v
# 中文 + 标点恢复
paddlespeech asr --input ./zh.wav | paddlespeech text --task punc
paddlespeech asr --input ./zh.wav -v | paddlespeech text --task punc -v
```
(如果显示 `paddlespeech-ctcdecoders` 这个 python 包没有找到的 Error没有关系这个包是非必须的。)
(如果不想显示 log 信息,可以不使用"-v", 另外如果显示 `paddlespeech-ctcdecoders` 这个 python 包没有找到的 Error没有关系这个包是非必须的。)
使用方法:
```bash
@ -43,6 +43,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
- `ckpt_path`:模型参数文件,若不设置则下载预训练模型使用,默认值:`None`。
- `yes`;不需要设置额外的参数,一旦设置了该参数,说明你默认同意程序的所有请求,其中包括自动转换输入音频的采样率。默认值:`False`。
- `device`:执行预测的设备,默认值:当前系统下 paddlepaddle 的默认 device。
- `verbose`: 如果使用,显示 logger 信息。
输出:
```bash
@ -82,7 +83,11 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
| 模型 | 语言 | 采样率
| :--- | :---: | :---: |
| conformer_wenetspeech | zh | 16k
| conformer_online_multicn | zh | 16k
| conformer_aishell | zh | 16k
| conformer_online_aishell | zh | 16k
| transformer_librispeech | en | 16k
| deepspeech2online_wenetspeech | zh | 16k
| deepspeech2offline_aishell| zh| 16k
| deepspeech2online_aishell | zh | 16k
| deepspeech2offline_librispeech | en | 16k

@ -5,6 +5,7 @@
## Introduction
This demo is an implementation of starting the streaming speech service and accessing the service. It can be achieved with a single command using `paddlespeech_server` and `paddlespeech_client` or a few lines of code in python.
Streaming ASR server only support `websocket` protocol, and doesn't support `http` protocol.
## Usage
### 1. Installation
@ -30,7 +31,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
- Command Line (Recommended)
```bash
# start the service
# in PaddleSpeech/demos/streaming_asr_server start the service
paddlespeech_server start --config_file ./conf/ws_conformer_application.yaml
```
@ -110,11 +111,12 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
- Python API
```python
# in PaddleSpeech/demos/streaming_asr_server directory
from paddlespeech.server.bin.paddlespeech_server import ServerExecutor
server_executor = ServerExecutor()
server_executor(
config_file="./conf/ws_conformer_application.yaml",
config_file="./conf/ws_conformer_application.yaml",
log_file="./log/paddlespeech.log")
```
@ -185,6 +187,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
### 4. ASR Client Usage
**Note:** The response time will be slightly longer when using the client for the first time
- Command Line (Recommended)
```
@ -203,6 +206,8 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
- `sample_rate`: Audio ampling rate, default: 16000.
- `lang`: Language. Default: "zh_cn".
- `audio_format`: Audio format. Default: "wav".
- `punc.server_ip`: punctuation server ip. Default: None.
- `punc.server_port`: punctuation server port. Default: None.
Output:
```bash
@ -274,10 +279,9 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
- Python API
```python
from paddlespeech.server.bin.paddlespeech_client import ASRClientExecutor
import json
from paddlespeech.server.bin.paddlespeech_client import ASROnlineClientExecutor
asrclient_executor = ASRClientExecutor()
asrclient_executor = ASROnlineClientExecutor()
res = asrclient_executor(
input="./zh.wav",
server_ip="127.0.0.1",
@ -285,7 +289,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
sample_rate=16000,
lang="zh_cn",
audio_format="wav")
print(res.json())
print(res)
```
Output:
@ -351,5 +355,4 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
[2022-04-21 15:59:08,016] [ INFO] - receive msg={'asr_results': '我认为跑步最重要的就是给我带来了身体健康'}
[2022-04-21 15:59:08,024] [ INFO] - receive msg={'asr_results': '我认为跑步最重要的就是给我带来了身体健康'}
[2022-04-21 15:59:12,883] [ INFO] - final receive msg={'status': 'ok', 'signal': 'finished', 'asr_results': '我认为跑步最重要的就是给我带来了身体健康'}
[2022-04-21 15:59:12,884] [ INFO] - 我认为跑步最重要的就是给我带来了身体健康
```
```

@ -5,18 +5,26 @@
## 介绍
这个demo是一个启动流式语音服务和访问服务的实现。 它可以通过使用`paddlespeech_server` 和 `paddlespeech_client`的单个命令或 python 的几行代码来实现。
**流式语音识别服务只支持 `weboscket` 协议,不支持 `http` 协议。**
## 使用方法
### 1. 安装
请看 [安装文档](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md).
安装 PaddleSpeech 的详细过程请看 [安装文档](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md)
推荐使用 **paddlepaddle 2.2.1** 或以上版本。
你可以从 mediumhard 三中方式中选择一种方式安装 PaddleSpeech。
你可以从mediumhard 两种方式中选择一种方式安装 PaddleSpeech。
### 2. 准备配置文件
配置文件可参见 `conf/ws_application.yaml``conf/ws_conformer_application.yaml`
目前服务集成的模型有: DeepSpeech2和conformer模型。
流式ASR的服务启动脚本和服务测试脚本存放在 `PaddleSpeech/demos/streaming_asr_server` 目录。
下载好 `PaddleSpeech` 之后,进入到 `PaddleSpeech/demos/streaming_asr_server` 目录。
配置文件可参见该目录下 `conf/ws_application.yaml``conf/ws_conformer_application.yaml`
目前服务集成的模型有: DeepSpeech2和 conformer模型对应的配置文件如下
* DeepSpeech: `conf/ws_application.yaml`
* conformer: `conf/ws_conformer_application.yaml`
这个 ASR client 的输入应该是一个 WAV 文件(`.wav`),并且采样率必须与模型的采样率相同。
@ -30,7 +38,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
- 命令行 (推荐使用)
```bash
# 启动服务
# 在 PaddleSpeech/demos/streaming_asr_server 目录启动服务
paddlespeech_server start --config_file ./conf/ws_conformer_application.yaml
```
@ -110,6 +118,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
- Python API
```python
# 在 PaddleSpeech/demos/streaming_asr_server 目录
from paddlespeech.server.bin.paddlespeech_server import ServerExecutor
server_executor = ServerExecutor()
@ -184,11 +193,11 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
```
### 4. ASR 客户端使用方法
**注意:** 初次使用客户端时响应时间会略长
- 命令行 (推荐使用)
```
paddlespeech_client asr_online --server_ip 127.0.0.1 --port 8090 --input ./zh.wav
```
使用帮助:
@ -204,6 +213,8 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
- `sample_rate`: 音频采样率默认值16000。
- `lang`: 模型语言默认值zh_cn。
- `audio_format`: 音频格式默认值wav。
- `punc.server_ip` 标点预测服务的ip。默认是None。
- `punc.server_port` 标点预测服务的端口port。默认是None。
输出:
@ -276,7 +287,6 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
- Python API
```python
from paddlespeech.server.bin.paddlespeech_client import ASROnlineClientExecutor
import json
asrclient_executor = ASROnlineClientExecutor()
res = asrclient_executor(
@ -286,7 +296,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
sample_rate=16000,
lang="zh_cn",
audio_format="wav")
print(res.json())
print(res)
```
输出:
@ -352,5 +362,4 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
[2022-04-21 15:59:08,016] [ INFO] - receive msg={'asr_results': '我认为跑步最重要的就是给我带来了身体健康'}
[2022-04-21 15:59:08,024] [ INFO] - receive msg={'asr_results': '我认为跑步最重要的就是给我带来了身体健康'}
[2022-04-21 15:59:12,883] [ INFO] - final receive msg={'status': 'ok', 'signal': 'finished', 'asr_results': '我认为跑步最重要的就是给我带来了身体健康'}
[2022-04-21 15:59:12,884] [ INFO] - 我认为跑步最重要的就是给我带来了身体健康
```

@ -0,0 +1,45 @@
# This is the parameter configuration file for PaddleSpeech Serving.
#################################################################################
# SERVER SETTING #
#################################################################################
host: 0.0.0.0
port: 8090
# The task format in the engin_list is: <speech task>_<engine type>
# task choices = ['asr_online']
# protocol = ['websocket'] (only one can be selected).
# websocket only support online engine type.
protocol: 'websocket'
engine_list: ['asr_online']
#################################################################################
# ENGINE CONFIG #
#################################################################################
################################### ASR #########################################
################### speech task: asr; engine_type: online #######################
asr_online:
model_type: 'conformer_online_multicn'
am_model: # the pdmodel file of am static model [optional]
am_params: # the pdiparams file of am static model [optional]
lang: 'zh'
sample_rate: 16000
cfg_path:
decode_method:
force_yes: True
device: # cpu or gpu:id
am_predictor_conf:
device: # set 'gpu:id' or 'cpu'
switch_ir_optim: True
glog_info: False # True -> print glog
summary: True # False -> do not show predictor config
chunk_buffer_conf:
window_n: 7 # frame
shift_n: 4 # frame
window_ms: 25 # ms
shift_ms: 10 # ms
sample_rate: 16000
sample_width: 2

@ -7,8 +7,8 @@ host: 0.0.0.0
port: 8090
# The task format in the engin_list is: <speech task>_<engine type>
# task choices = ['asr_online', 'tts_online']
# protocol = ['websocket', 'http'] (only one can be selected).
# task choices = ['asr_online']
# protocol = ['websocket'] (only one can be selected).
# websocket only support online engine type.
protocol: 'websocket'
engine_list: ['asr_online']

@ -7,8 +7,8 @@ host: 0.0.0.0
port: 8090
# The task format in the engin_list is: <speech task>_<engine type>
# task choices = ['asr_online', 'tts_online']
# protocol = ['websocket', 'http'] (only one can be selected).
# task choices = ['asr_online']
# protocol = ['websocket'] (only one can be selected).
# websocket only support online engine type.
protocol: 'websocket'
engine_list: ['asr_online']

@ -93,7 +93,7 @@
function parseResult(data) {
var data = JSON.parse(data)
var result = data.asr_results
var result = data.result
console.log(result)
$("#resultPanel").html(result)
}
@ -152,4 +152,4 @@
</script>
</body>
</html>
</html>

@ -20,19 +20,23 @@ import logging
import os
from paddlespeech.cli.log import logger
from paddlespeech.server.utils.audio_handler import ASRAudioHandler
from paddlespeech.server.utils.audio_handler import ASRWsAudioHandler
def main(args):
logger.info("asr websocket client start")
handler = ASRAudioHandler("127.0.0.1", 8090)
handler = ASRWsAudioHandler(
args.server_ip,
args.port,
punc_server_ip=args.punc_server_ip,
punc_server_port=args.punc_server_port)
loop = asyncio.get_event_loop()
# support to process single audio file
if args.wavfile and os.path.exists(args.wavfile):
logger.info(f"start to process the wavscp: {args.wavfile}")
result = loop.run_until_complete(handler.run(args.wavfile))
result = result["asr_results"]
result = result["result"]
logger.info(f"asr websocket client finished : {result}")
# support to process batch audios from wav.scp
@ -43,13 +47,29 @@ def main(args):
for line in f:
utt_name, utt_path = line.strip().split()
result = loop.run_until_complete(handler.run(utt_path))
result = result["asr_results"]
result = result["result"]
w.write(f"{utt_name} {result}\n")
if __name__ == "__main__":
logger.info("Start to do streaming asr client")
parser = argparse.ArgumentParser()
parser.add_argument(
'--server_ip', type=str, default='127.0.0.1', help='server ip')
parser.add_argument('--port', type=int, default=8090, help='server port')
parser.add_argument(
'--punc.server_ip',
type=str,
default=None,
dest="punc_server_ip",
help='Punctuation server ip')
parser.add_argument(
'--punc.port',
type=int,
default=8091,
dest="punc_server_port",
help='Punctuation server port')
parser.add_argument(
"--wavfile",
action="store",

@ -15,19 +15,28 @@ You can choose one way from meduim and hard to install paddlespeech.
### 2. Prepare config File
The configuration file can be found in `conf/tts_online_application.yaml`
Among them, `protocol` indicates the network protocol used by the streaming TTS service. Currently, both http and websocket are supported.
`engine_list` indicates the speech engine that will be included in the service to be started, in the format of `<speech task>_<engine type>`.
This demo mainly introduces the streaming speech synthesis service, so the speech task should be set to `tts`.
Currently, the engine type supports two forms: **online** and **online-onnx**. `online` indicates an engine that uses python for dynamic graph inference; `online-onnx` indicates an engine that uses onnxruntime for inference. The inference speed of online-onnx is faster.
Streaming TTS AM model support: **fastspeech2 and fastspeech2_cnndecoder**; Voc model support: **hifigan and mb_melgan**
The configuration file can be found in `conf/tts_online_application.yaml`.
- `protocol` indicates the network protocol used by the streaming TTS service. Currently, both **http and websocket** are supported.
- `engine_list` indicates the speech engine that will be included in the service to be started, in the format of `<speech task>_<engine type>`.
- This demo mainly introduces the streaming speech synthesis service, so the speech task should be set to `tts`.
- the engine type supports two forms: **online** and **online-onnx**. `online` indicates an engine that uses python for dynamic graph inference; `online-onnx` indicates an engine that uses onnxruntime for inference. The inference speed of online-onnx is faster.
- Streaming TTS engine AM model support: **fastspeech2 and fastspeech2_cnndecoder**; Voc model support: **hifigan and mb_melgan**
- In streaming am inference, one chunk of data is inferred at a time to achieve a streaming effect. Among them, `am_block` indicates the number of valid frames in the chunk, and `am_pad` indicates the number of frames added before and after am_block in a chunk. The existence of am_pad is used to eliminate errors caused by streaming inference and avoid the influence of streaming inference on the quality of synthesized audio.
- fastspeech2 does not support streaming am inference, so am_pad and am_block have no effect on it.
- fastspeech2_cnndecoder supports streaming inference. When am_pad=12, streaming inference synthesized audio is consistent with non-streaming synthesized audio.
- In streaming voc inference, one chunk of data is inferred at a time to achieve a streaming effect. Where `voc_block` indicates the number of valid frames in the chunk, and `voc_pad` indicates the number of frames added before and after the voc_block in a chunk. The existence of voc_pad is used to eliminate errors caused by streaming inference and avoid the influence of streaming inference on the quality of synthesized audio.
- Both hifigan and mb_melgan support streaming voc inference.
- When the voc model is mb_melgan, when voc_pad=14, the synthetic audio for streaming inference is consistent with the non-streaming synthetic audio; the minimum voc_pad can be set to 7, and the synthetic audio has no abnormal hearing. If the voc_pad is less than 7, the synthetic audio sounds abnormal.
- When the voc model is hifigan, when voc_pad=20, the streaming inference synthetic audio is consistent with the non-streaming synthetic audio; when voc_pad=14, the synthetic audio has no abnormal hearing.
- Inference speed: mb_melgan > hifigan; Audio quality: mb_melgan < hifigan
### 3. Server Usage
### 3. Streaming speech synthesis server and client using http protocol
#### 3.1 Server Usage
- Command Line (Recommended)
Start the service (the configuration file uses http by default):
```bash
# start the service
paddlespeech_server start --config_file ./conf/tts_online_application.yaml
```
@ -67,7 +76,7 @@ Streaming TTS AM model support: **fastspeech2 and fastspeech2_cnndecoder**; Voc
log_file="./log/paddlespeech.log")
```
Output:
Output:
```bash
[2022-04-24 21:00:16,934] [ INFO] - The first response time of the 0 warm up: 1.268730878829956 s
[2022-04-24 21:00:17,046] [ INFO] - The first response time of the 1 warm up: 0.11168622970581055 s
@ -85,17 +94,15 @@ Streaming TTS AM model support: **fastspeech2 and fastspeech2_cnndecoder**; Voc
```
### 4. Streaming TTS client Usage
#### 3.2 Streaming TTS client Usage
- Command Line (Recommended)
```bash
# Access http streaming TTS service
paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --input "您好,欢迎使用百度飞桨语音合成服务。" --output output.wav
Access http streaming TTS service:
# Access websocket streaming TTS service
paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --protocol websocket --input "您好,欢迎使用百度飞桨语音合成服务。" --output output.wav
```bash
paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --protocol http --input "您好,欢迎使用百度飞桨语音合成服务。" --output output.wav
```
Usage:
```bash
@ -113,7 +120,6 @@ Streaming TTS AM model support: **fastspeech2 and fastspeech2_cnndecoder**; Voc
- `sample_rate`: Sampling rate, choices: [0, 8000, 16000], the default is the same as the model. Default: 0
- `output`: Output wave filepath. Default: None, which means not to save the audio to the local.
- `play`: Whether to play audio, play while synthesizing, default value: False, which means not playing. **Playing audio needs to rely on the pyaudio library**.
Output:
```bash
@ -156,8 +162,144 @@ Streaming TTS AM model support: **fastspeech2 and fastspeech2_cnndecoder**; Voc
[2022-04-24 21:11:16,802] [ INFO] - 音频时长3.825 s
[2022-04-24 21:11:16,802] [ INFO] - RTF: 0.7846773683635238
[2022-04-24 21:11:16,837] [ INFO] - 音频保存至:./output.wav
```
### 4. Streaming speech synthesis server and client using websocket protocol
#### 4.1 Server Usage
- Command Line (Recommended)
First modify the configuration file `conf/tts_online_application.yaml`, **set `protocol` to `websocket`**.
Start the service:
```bash
paddlespeech_server start --config_file ./conf/tts_online_application.yaml
```
Usage:
```bash
paddlespeech_server start --help
```
Arguments:
- `config_file`: yaml file of the app, defalut: ./conf/tts_online_application.yaml
- `log_file`: log file. Default: ./log/paddlespeech.log
Output:
```bash
[2022-04-27 10:18:09,107] [ INFO] - The first response time of the 0 warm up: 1.1551103591918945 s
[2022-04-27 10:18:09,219] [ INFO] - The first response time of the 1 warm up: 0.11204338073730469 s
[2022-04-27 10:18:09,324] [ INFO] - The first response time of the 2 warm up: 0.1051797866821289 s
[2022-04-27 10:18:09,325] [ INFO] - **********************************************************************
INFO: Started server process [17600]
[2022-04-27 10:18:09] [INFO] [server.py:75] Started server process [17600]
INFO: Waiting for application startup.
[2022-04-27 10:18:09] [INFO] [on.py:45] Waiting for application startup.
INFO: Application startup complete.
[2022-04-27 10:18:09] [INFO] [on.py:59] Application startup complete.
INFO: Uvicorn running on http://127.0.0.1:8092 (Press CTRL+C to quit)
[2022-04-27 10:18:09] [INFO] [server.py:211] Uvicorn running on http://127.0.0.1:8092 (Press CTRL+C to quit)
```
- Python API
```python
from paddlespeech.server.bin.paddlespeech_server import ServerExecutor
server_executor = ServerExecutor()
server_executor(
config_file="./conf/tts_online_application.yaml",
log_file="./log/paddlespeech.log")
```
Output:
```bash
[2022-04-27 10:20:16,660] [ INFO] - The first response time of the 0 warm up: 1.0945196151733398 s
[2022-04-27 10:20:16,773] [ INFO] - The first response time of the 1 warm up: 0.11222052574157715 s
[2022-04-27 10:20:16,878] [ INFO] - The first response time of the 2 warm up: 0.10494542121887207 s
[2022-04-27 10:20:16,878] [ INFO] - **********************************************************************
INFO: Started server process [23466]
[2022-04-27 10:20:16] [INFO] [server.py:75] Started server process [23466]
INFO: Waiting for application startup.
[2022-04-27 10:20:16] [INFO] [on.py:45] Waiting for application startup.
INFO: Application startup complete.
[2022-04-27 10:20:16] [INFO] [on.py:59] Application startup complete.
INFO: Uvicorn running on http://127.0.0.1:8092 (Press CTRL+C to quit)
[2022-04-27 10:20:16] [INFO] [server.py:211] Uvicorn running on http://127.0.0.1:8092 (Press CTRL+C to quit)
```
#### 4.2 Streaming TTS client Usage
- Command Line (Recommended)
Access websocket streaming TTS service:
```bash
paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --protocol websocket --input "您好,欢迎使用百度飞桨语音合成服务。" --output output.wav
```
Usage:
```bash
paddlespeech_client tts_online --help
```
Arguments:
- `server_ip`: erver ip. Default: 127.0.0.1
- `port`: server port. Default: 8092
- `protocol`: Service protocol, choices: [http, websocket], default: http.
- `input`: (required): Input text to generate.
- `spk_id`: Speaker id for multi-speaker text to speech. Default: 0
- `speed`: Audio speed, the value should be set between 0 and 3. Default: 1.0
- `volume`: Audio volume, the value should be set between 0 and 3. Default: 1.0
- `sample_rate`: Sampling rate, choices: [0, 8000, 16000], the default is the same as the model. Default: 0
- `output`: Output wave filepath. Default: None, which means not to save the audio to the local.
- `play`: Whether to play audio, play while synthesizing, default value: False, which means not playing. **Playing audio needs to rely on the pyaudio library**.
Output:
```bash
[2022-04-27 10:21:04,262] [ INFO] - tts websocket client start
[2022-04-27 10:21:04,496] [ INFO] - 句子:您好,欢迎使用百度飞桨语音合成服务。
[2022-04-27 10:21:04,496] [ INFO] - 首包响应0.2124948501586914 s
[2022-04-27 10:21:07,483] [ INFO] - 尾包响应3.199106454849243 s
[2022-04-27 10:21:07,484] [ INFO] - 音频时长3.825 s
[2022-04-27 10:21:07,484] [ INFO] - RTF: 0.8363677006141812
[2022-04-27 10:21:07,516] [ INFO] - 音频保存至output.wav
```
- Python API
```python
from paddlespeech.server.bin.paddlespeech_client import TTSOnlineClientExecutor
import json
executor = TTSOnlineClientExecutor()
executor(
input="您好,欢迎使用百度飞桨语音合成服务。",
server_ip="127.0.0.1",
port=8092,
protocol="websocket",
spk_id=0,
speed=1.0,
volume=1.0,
sample_rate=0,
output="./output.wav",
play=False)
```
Output:
```bash
[2022-04-27 10:22:48,852] [ INFO] - tts websocket client start
[2022-04-27 10:22:49,080] [ INFO] - 句子:您好,欢迎使用百度飞桨语音合成服务。
[2022-04-27 10:22:49,080] [ INFO] - 首包响应0.21017956733703613 s
[2022-04-27 10:22:52,100] [ INFO] - 尾包响应3.2304444313049316 s
[2022-04-27 10:22:52,101] [ INFO] - 音频时长3.825 s
[2022-04-27 10:22:52,101] [ INFO] - RTF: 0.8445606356352762
[2022-04-27 10:22:52,134] [ INFO] - 音频保存至:./output.wav
```

@ -1,4 +1,4 @@
([简体中文](./README_cn.md)|English)
(简体中文|[English](./README.md))
# 流式语音合成服务
@ -16,17 +16,26 @@
### 2. 准备配置文件
配置文件可参见 `conf/tts_online_application.yaml`
其中,`protocol`表示该流式TTS服务使用的网络协议目前支持 http 和 websocket 两种。
其中,`engine_list`表示即将启动的服务将会包含的语音引擎,格式为 <语音任务>_<引擎类型>。
该demo主要介绍流式语音合成服务因此语音任务应设置为tts。
目前引擎类型支持两种形式:**online** 表示使用python进行动态图推理的引擎**online-onnx** 表示使用onnxruntime进行推理的引擎。其中online-onnx的推理速度更快。
流式TTS的AM 模型支持fastspeech2 以及fastspeech2_cnndecoder; Voc 模型支持hifigan, mb_melgan
- `protocol`表示该流式TTS服务使用的网络协议目前支持 **http 和 websocket** 两种。
- `engine_list`表示即将启动的服务将会包含的语音引擎,格式为 <语音任务>_<引擎类型>。
- 该demo主要介绍流式语音合成服务因此语音任务应设置为tts。
- 目前引擎类型支持两种形式:**online** 表示使用python进行动态图推理的引擎**online-onnx** 表示使用onnxruntime进行推理的引擎。其中online-onnx的推理速度更快。
- 流式TTS引擎的AM模型支持**fastspeech2 以及fastspeech2_cnndecoder**; Voc 模型支持:**hifigan, mb_melgan**
- 流式am推理中每次会对一个chunk的数据进行推理以达到流式的效果。其中`am_block`表示chunk中的有效帧数`am_pad` 表示一个chunk中am_block前后各加的帧数。am_pad的存在用于消除流式推理产生的误差避免由流式推理对合成音频质量的影响。
- fastspeech2不支持流式am推理因此am_pad与am_block对它无效
- fastspeech2_cnndecoder 支持流式推理当am_pad=12时流式推理合成音频与非流式合成音频一致
- 流式voc推理中每次会对一个chunk的数据进行推理以达到流式的效果。其中`voc_block`表示chunk中的有效帧数`voc_pad` 表示一个chunk中voc_block前后各加的帧数。voc_pad的存在用于消除流式推理产生的误差避免由流式推理对合成音频质量的影响。
- hifigan, mb_melgan 均支持流式voc 推理
- 当voc模型为mb_melgan当voc_pad=14时流式推理合成音频与非流式合成音频一致voc_pad最小可以设置为7合成音频听感上没有异常若voc_pad小于7合成音频听感上存在异常。
- 当voc模型为hifigan当voc_pad=20时流式推理合成音频与非流式合成音频一致当voc_pad=14时合成音频听感上没有异常。
- 推理速度mb_melgan > hifigan; 音频质量mb_melgan < hifigan
### 3. 服务端使用方法
### 3. 使用http协议的流式语音合成服务端及客户端使用方法
#### 3.1 服务端使用方法
- 命令行 (推荐使用)
启动服务配置文件默认使用http
```bash
# 启动服务
paddlespeech_server start --config_file ./conf/tts_online_application.yaml
```
@ -36,7 +45,7 @@
paddlespeech_server start --help
```
参数:
- `config_file`: 服务的配置文件,默认: ./conf/application.yaml
- `config_file`: 服务的配置文件,默认: ./conf/tts_online_application.yaml
- `log_file`: log 文件. 默认:./log/paddlespeech.log
输出:
@ -84,17 +93,15 @@
```
### 4. 流式TTS 客户端使用方法
#### 3.2 客户端使用方法
- 命令行 (推荐使用)
```bash
# 访问 http 流式TTS服务
paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --input "您好,欢迎使用百度飞桨语音合成服务。" --output output.wav
访问 http 流式TTS服务
# 访问 websocket 流式TTS服务
paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --protocol websocket --input "您好,欢迎使用百度飞桨语音合成服务。" --output output.wav
```bash
paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --protocol http --input "您好,欢迎使用百度飞桨语音合成服务。" --output output.wav
```
使用帮助:
```bash
@ -155,8 +162,143 @@
[2022-04-24 21:11:16,802] [ INFO] - 音频时长3.825 s
[2022-04-24 21:11:16,802] [ INFO] - RTF: 0.7846773683635238
[2022-04-24 21:11:16,837] [ INFO] - 音频保存至:./output.wav
```
### 4. 使用websocket协议的流式语音合成服务端及客户端使用方法
#### 4.1 服务端使用方法
- 命令行 (推荐使用)
首先修改配置文件 `conf/tts_online_application.yaml` **`protocol` 设置为 `websocket`**。
启动服务:
```bash
paddlespeech_server start --config_file ./conf/tts_online_application.yaml
```
使用方法:
```bash
paddlespeech_server start --help
```
参数:
- `config_file`: 服务的配置文件,默认: ./conf/tts_online_application.yaml
- `log_file`: log 文件. 默认:./log/paddlespeech.log
输出:
```bash
[2022-04-27 10:18:09,107] [ INFO] - The first response time of the 0 warm up: 1.1551103591918945 s
[2022-04-27 10:18:09,219] [ INFO] - The first response time of the 1 warm up: 0.11204338073730469 s
[2022-04-27 10:18:09,324] [ INFO] - The first response time of the 2 warm up: 0.1051797866821289 s
[2022-04-27 10:18:09,325] [ INFO] - **********************************************************************
INFO: Started server process [17600]
[2022-04-27 10:18:09] [INFO] [server.py:75] Started server process [17600]
INFO: Waiting for application startup.
[2022-04-27 10:18:09] [INFO] [on.py:45] Waiting for application startup.
INFO: Application startup complete.
[2022-04-27 10:18:09] [INFO] [on.py:59] Application startup complete.
INFO: Uvicorn running on http://127.0.0.1:8092 (Press CTRL+C to quit)
[2022-04-27 10:18:09] [INFO] [server.py:211] Uvicorn running on http://127.0.0.1:8092 (Press CTRL+C to quit)
```
- Python API
```python
from paddlespeech.server.bin.paddlespeech_server import ServerExecutor
server_executor = ServerExecutor()
server_executor(
config_file="./conf/tts_online_application.yaml",
log_file="./log/paddlespeech.log")
```
输出:
```bash
[2022-04-27 10:20:16,660] [ INFO] - The first response time of the 0 warm up: 1.0945196151733398 s
[2022-04-27 10:20:16,773] [ INFO] - The first response time of the 1 warm up: 0.11222052574157715 s
[2022-04-27 10:20:16,878] [ INFO] - The first response time of the 2 warm up: 0.10494542121887207 s
[2022-04-27 10:20:16,878] [ INFO] - **********************************************************************
INFO: Started server process [23466]
[2022-04-27 10:20:16] [INFO] [server.py:75] Started server process [23466]
INFO: Waiting for application startup.
[2022-04-27 10:20:16] [INFO] [on.py:45] Waiting for application startup.
INFO: Application startup complete.
[2022-04-27 10:20:16] [INFO] [on.py:59] Application startup complete.
INFO: Uvicorn running on http://127.0.0.1:8092 (Press CTRL+C to quit)
[2022-04-27 10:20:16] [INFO] [server.py:211] Uvicorn running on http://127.0.0.1:8092 (Press CTRL+C to quit)
```
#### 4.2 客户端使用方法
- 命令行 (推荐使用)
访问 websocket 流式TTS服务
```bash
paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --protocol websocket --input "您好,欢迎使用百度飞桨语音合成服务。" --output output.wav
```
使用帮助:
```bash
paddlespeech_client tts_online --help
```
参数:
- `server_ip`: 服务端ip地址默认: 127.0.0.1。
- `port`: 服务端口,默认: 8092。
- `protocol`: 服务协议,可选 [http, websocket], 默认: http。
- `input`: (必须输入): 待合成的文本。
- `spk_id`: 说话人 id用于多说话人语音合成默认值 0。
- `speed`: 音频速度,该值应设置在 0 到 3 之间。 默认值1.0
- `volume`: 音频音量,该值应设置在 0 到 3 之间。 默认值: 1.0
- `sample_rate`: 采样率,可选 [0, 8000, 16000]默认值0表示与模型采样率相同
- `output`: 输出音频的路径, 默认值None表示不保存音频到本地。
- `play`: 是否播放音频,边合成边播放, 默认值False表示不播放。**播放音频需要依赖pyaudio库**。
输出:
```bash
[2022-04-27 10:21:04,262] [ INFO] - tts websocket client start
[2022-04-27 10:21:04,496] [ INFO] - 句子:您好,欢迎使用百度飞桨语音合成服务。
[2022-04-27 10:21:04,496] [ INFO] - 首包响应0.2124948501586914 s
[2022-04-27 10:21:07,483] [ INFO] - 尾包响应3.199106454849243 s
[2022-04-27 10:21:07,484] [ INFO] - 音频时长3.825 s
[2022-04-27 10:21:07,484] [ INFO] - RTF: 0.8363677006141812
[2022-04-27 10:21:07,516] [ INFO] - 音频保存至output.wav
```
- Python API
```python
from paddlespeech.server.bin.paddlespeech_client import TTSOnlineClientExecutor
import json
executor = TTSOnlineClientExecutor()
executor(
input="您好,欢迎使用百度飞桨语音合成服务。",
server_ip="127.0.0.1",
port=8092,
protocol="websocket",
spk_id=0,
speed=1.0,
volume=1.0,
sample_rate=0,
output="./output.wav",
play=False)
```
输出:
```bash
[2022-04-27 10:22:48,852] [ INFO] - tts websocket client start
[2022-04-27 10:22:49,080] [ INFO] - 句子:您好,欢迎使用百度飞桨语音合成服务。
[2022-04-27 10:22:49,080] [ INFO] - 首包响应0.21017956733703613 s
[2022-04-27 10:22:52,100] [ INFO] - 尾包响应3.2304444313049316 s
[2022-04-27 10:22:52,101] [ INFO] - 音频时长3.825 s
[2022-04-27 10:22:52,101] [ INFO] - RTF: 0.8445606356352762
[2022-04-27 10:22:52,134] [ INFO] - 音频保存至:./output.wav
```

@ -1,4 +1,4 @@
# This is the parameter configuration file for PaddleSpeech Serving.
# This is the parameter configuration file for streaming tts server.
#################################################################################
# SERVER SETTING #
@ -7,8 +7,8 @@ host: 127.0.0.1
port: 8092
# The task format in the engin_list is: <speech task>_<engine type>
# engine_list choices = ['tts_online', 'tts_online-onnx']
# protocol = ['websocket', 'http'] (only one can be selected).
# engine_list choices = ['tts_online', 'tts_online-onnx'], the inference speed of tts_online-onnx is faster than tts_online.
# protocol choices = ['websocket', 'http']
protocol: 'http'
engine_list: ['tts_online-onnx']
@ -20,7 +20,8 @@ engine_list: ['tts_online-onnx']
################################### TTS #########################################
################### speech task: tts; engine_type: online #######################
tts_online:
# am (acoustic model) choices=['fastspeech2_csmsc', 'fastspeech2_cnndecoder_csmsc']
# am (acoustic model) choices=['fastspeech2_csmsc', 'fastspeech2_cnndecoder_csmsc']
# fastspeech2_cnndecoder_csmsc support streaming am infer.
am: 'fastspeech2_csmsc'
am_config:
am_ckpt:
@ -31,6 +32,7 @@ tts_online:
spk_id: 0
# voc (vocoder) choices=['mb_melgan_csmsc, hifigan_csmsc']
# Both mb_melgan_csmsc and hifigan_csmsc support streaming voc inference
voc: 'mb_melgan_csmsc'
voc_config:
voc_ckpt:
@ -39,8 +41,13 @@ tts_online:
# others
lang: 'zh'
device: 'cpu' # set 'gpu:id' or 'cpu'
# am_block and am_pad only for fastspeech2_cnndecoder_onnx model to streaming am infer,
# when am_pad set 12, streaming synthetic audio is the same as non-streaming synthetic audio
am_block: 42
am_pad: 12
# voc_pad and voc_block voc model to streaming voc infer,
# when voc model is mb_melgan_csmsc, voc_pad set 14, streaming synthetic audio is the same as non-streaming synthetic audio; The minimum value of pad can be set to 7, streaming synthetic audio sounds normal
# when voc model is hifigan_csmsc, voc_pad set 20, streaming synthetic audio is the same as non-streaming synthetic audio; voc_pad set 14, streaming synthetic audio sounds normal
voc_block: 14
voc_pad: 14
@ -53,7 +60,8 @@ tts_online:
################################### TTS #########################################
################### speech task: tts; engine_type: online-onnx #######################
tts_online-onnx:
# am (acoustic model) choices=['fastspeech2_csmsc_onnx', 'fastspeech2_cnndecoder_csmsc_onnx']
# am (acoustic model) choices=['fastspeech2_csmsc_onnx', 'fastspeech2_cnndecoder_csmsc_onnx']
# fastspeech2_cnndecoder_csmsc_onnx support streaming am infer.
am: 'fastspeech2_cnndecoder_csmsc_onnx'
# am_ckpt is a list, if am is fastspeech2_cnndecoder_csmsc_onnx, am_ckpt = [encoder model, decoder model, postnet model];
# if am is fastspeech2_csmsc_onnx, am_ckpt = [ckpt model];
@ -70,6 +78,7 @@ tts_online-onnx:
cpu_threads: 4
# voc (vocoder) choices=['mb_melgan_csmsc_onnx, hifigan_csmsc_onnx']
# Both mb_melgan_csmsc_onnx and hifigan_csmsc_onnx support streaming voc inference
voc: 'hifigan_csmsc_onnx'
voc_ckpt:
voc_sample_rate: 24000
@ -80,9 +89,15 @@ tts_online-onnx:
# others
lang: 'zh'
# am_block and am_pad only for fastspeech2_cnndecoder_onnx model to streaming am infer,
# when am_pad set 12, streaming synthetic audio is the same as non-streaming synthetic audio
am_block: 42
am_pad: 12
# voc_pad and voc_block voc model to streaming voc infer,
# when voc model is mb_melgan_csmsc_onnx, voc_pad set 14, streaming synthetic audio is the same as non-streaming synthetic audio; The minimum value of pad can be set to 7, streaming synthetic audio sounds normal
# when voc model is hifigan_csmsc_onnx, voc_pad set 20, streaming synthetic audio is the same as non-streaming synthetic audio; voc_pad set 14, streaming synthetic audio sounds normal
voc_block: 14
voc_pad: 14
# voc_upsample should be same as n_shift on voc config.
voc_upsample: 300

@ -6,7 +6,7 @@
### Speech Recognition Model
Acoustic Model | Training Data | Token-based | Size | Descriptions | CER | WER | Hours of speech | Example Link
:-------------:| :------------:| :-----: | -----: | :-----: |:-----:| :-----: | :-----: | :-----:
[Ds2 Online Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.2.0.model.tar.gz) | Aishell Dataset | Char-based | 345 MB | 2 Conv + 5 LSTM layers with only forward direction | 0.078 |-| 151 h | [D2 Online Aishell ASR0](../../examples/aishell/asr0)
[Ds2 Online Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_fbank161_ckpt_0.2.1.model.tar.gz) | Aishell Dataset | Char-based | 491 MB | 2 Conv + 5 LSTM layers with only forward direction | 0.0666 |-| 151 h | [D2 Online Aishell ASR0](../../examples/aishell/asr0)
[Ds2 Offline Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_aishell_ckpt_0.1.1.model.tar.gz)| Aishell Dataset | Char-based | 306 MB | 2 Conv + 3 bidirectional GRU layers| 0.064 |-| 151 h | [Ds2 Offline Aishell ASR0](../../examples/aishell/asr0)
[Conformer Online Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_chunk_conformer_aishell_ckpt_0.2.0.model.tar.gz) | Aishell Dataset | Char-based | 189 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring| 0.0544 |-| 151 h | [Conformer Online Aishell ASR1](../../examples/aishell/asr1)
[Conformer Offline Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_conformer_aishell_ckpt_0.1.2.model.tar.gz) | Aishell Dataset | Char-based | 189 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0464 |-| 151 h | [Conformer Offline Aishell ASR1](../../examples/aishell/asr1)

@ -4,6 +4,8 @@
| Model | Number of Params | Release | Config | Test set | Valid Loss | CER |
| --- | --- | --- | --- | --- | --- | --- |
| DeepSpeech2 | 45.18M | r0.2.0 | conf/deepspeech2_online.yaml + U2 Data pipline and spec aug + fbank161 | test | 6.876979827880859 | 0.0666 |
| DeepSpeech2 | 45.18M | r0.2.0 | conf/deepspeech2_online.yaml + spec aug + fbank161 | test | 7.679287910461426 | 0.0718 |
| DeepSpeech2 | 45.18M | r0.2.0 | conf/deepspeech2_online.yaml + spec aug | test | 7.708217620849609| 0.078 |
| DeepSpeech2 | 45.18M | v2.2.0 | conf/deepspeech2_online.yaml + spec aug | test | 7.994938373565674 | 0.080 |

@ -1,3 +1,3 @@
# Speaker Diarization on AMI corpus
* sd0 - speaker diarization by AHC,SC base on x-vectors
* sd0 - speaker diarization by AHC,SC base on embeddings

@ -7,7 +7,23 @@
The script performs diarization using x-vectors(TDNN,ECAPA-TDNN) on the AMI mix-headset data. We demonstrate the use of different clustering methods: AHC, spectral.
## How to Run
### prepare annotations and audios
Download AMI corpus, You need around 10GB of free space to get whole data
The signals are too large to package in this way, so you need to use the chooser to indicate which ones you wish to download
```bash
## download annotations
wget http://groups.inf.ed.ac.uk/ami/AMICorpusAnnotations/ami_public_manual_1.6.2.zip && unzip ami_public_manual_1.6.2.zip
```
then please follow https://groups.inf.ed.ac.uk/ami/download/ to download the Signals:
1) Select one or more AMI meetings: the IDs please follow ./ami_split.py
2) Select media streams: Just select Headset mix
### start running
Use the following command to run diarization on AMI corpus.
`bash ./run.sh`
```bash
./run.sh --data_folder ./amicorpus --manual_annot_folder ./ami_public_manual_1.6.2
```
## Results (DER) coming soon! :)

@ -17,18 +17,6 @@ device=gpu
. ${MAIN_ROOT}/utils/parse_options.sh || exit 1;
if [ $stage -le 0 ]; then
# Prepare data
# Download AMI corpus, You need around 10GB of free space to get whole data
# The signals are too large to package in this way,
# so you need to use the chooser to indicate which ones you wish to download
echo "Please follow https://groups.inf.ed.ac.uk/ami/download/ to download the data."
echo "Annotations: AMI manual annotations v1.6.2 "
echo "Signals: "
echo "1) Select one or more AMI meetings: the IDs please follow ./ami_split.py"
echo "2) Select media streams: Just select Headset mix"
fi
if [ $stage -le 1 ]; then
# Download the pretrained model
wget https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_1_1.tar.gz

@ -1,39 +1,49 @@
data:
data_dir: '/PATH/TO/DATA/hey_snips_research_6k_en_train_eval_clean_ter'
dataset: 'paddleaudio.datasets:HeySnips'
# https://yaml.org/type/float.html
###########################################
# Data #
###########################################
dataset: 'paddleaudio.datasets:HeySnips'
data_dir: '/PATH/TO/DATA/hey_snips_research_6k_en_train_eval_clean_ter'
model:
num_keywords: 1
backbone: 'paddlespeech.kws.models:MDTC'
config:
stack_num: 3
stack_size: 4
in_channels: 80
res_channels: 32
kernel_size: 5
############################################
# Network Architecture #
############################################
backbone: 'paddlespeech.kws.models:MDTC'
num_keywords: 1
stack_num: 3
stack_size: 4
in_channels: 80
res_channels: 32
kernel_size: 5
feature:
feat_type: 'kaldi_fbank'
sample_rate: 16000
frame_shift: 10
frame_length: 25
n_mels: 80
###########################################
# Feature #
###########################################
feat_type: 'kaldi_fbank'
sample_rate: 16000
frame_shift: 10
frame_length: 25
n_mels: 80
training:
epochs: 100
num_workers: 16
batch_size: 100
checkpoint_dir: './checkpoint'
save_freq: 10
log_freq: 10
learning_rate: 0.001
weight_decay: 0.00005
grad_clip: 5.0
###########################################
# Training #
###########################################
epochs: 100
num_workers: 16
batch_size: 100
checkpoint_dir: './checkpoint'
save_freq: 10
log_freq: 10
learning_rate: 0.001
weight_decay: 0.00005
grad_clip: 5.0
scoring:
batch_size: 100
num_workers: 16
checkpoint: './checkpoint/epoch_100/model.pdparams'
score_file: './scores.txt'
stats_file: './stats.0.txt'
img_file: './det.png'
###########################################
# Scoring #
###########################################
batch_size: 100
num_workers: 16
checkpoint: './checkpoint/epoch_100/model.pdparams'
score_file: './scores.txt'
stats_file: './stats.0.txt'
img_file: './det.png'

@ -1,2 +1,25 @@
#!/bin/bash
python3 ${BIN_DIR}/plot_det_curve.py --cfg_path=$1 --keyword HeySnips
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
if [ $# != 3 ];then
echo "usage: ${0} config_path checkpoint output_file"
exit -1
fi
keyword=$1
stats_file=$2
img_file=$3
python3 ${BIN_DIR}/plot_det_curve.py --keyword_label ${keyword} --stats_file ${stats_file} --img_file ${img_file}

@ -1,5 +1,27 @@
#!/bin/bash
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
python3 ${BIN_DIR}/score.py --cfg_path=$1
if [ $# != 4 ];then
echo "usage: ${0} checkpoint score_file stats_file"
exit -1
fi
python3 ${BIN_DIR}/compute_det.py --cfg_path=$1
cfg_path=$1
ckpt=$2
score_file=$3
stats_file=$4
python3 ${BIN_DIR}/score.py --config ${cfg_path} --ckpt ${ckpt} --score_file ${score_file} || exit -1
python3 ${BIN_DIR}/compute_det.py --config ${cfg_path} --score_file ${score_file} --stats_file ${stats_file} || exit -1

@ -1,13 +1,31 @@
#!/bin/bash
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
if [ $# != 2 ];then
echo "usage: ${0} num_gpus config_path"
exit -1
fi
ngpu=$1
cfg_path=$2
if [ ${ngpu} -gt 0 ]; then
python3 -m paddle.distributed.launch --gpus $CUDA_VISIBLE_DEVICES ${BIN_DIR}/train.py \
--cfg_path ${cfg_path}
--config ${cfg_path}
else
echo "set CUDA_VISIBLE_DEVICES to enable multi-gpus trainning."
python3 ${BIN_DIR}/train.py \
--cfg_path ${cfg_path}
--config ${cfg_path}
fi

@ -32,10 +32,16 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
./local/train.sh ${ngpu} ${cfg_path} || exit -1
fi
ckpt=./checkpoint/epoch_100/model.pdparams
score_file=./scores.txt
stats_file=./stats.0.txt
img_file=./det.png
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
./local/score.sh ${cfg_path} || exit -1
./local/score.sh ${cfg_path} ${ckpt} ${score_file} ${stats_file} || exit -1
fi
keyword=HeySnips
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
./local/plot.sh ${cfg_path} || exit -1
./local/plot.sh ${keyword} ${stats_file} ${img_file} || exit -1
fi

@ -142,10 +142,10 @@ using the `tar` scripts to unpack the model and then you can use the script to t
For example:
```
wget https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_2_0.tar.gz
tar xzvf sv0_ecapa_tdnn_voxceleb12_ckpt_0_2_0.tar.gz
tar -xvf sv0_ecapa_tdnn_voxceleb12_ckpt_0_2_0.tar.gz
source path.sh
# If you have processed the data and get the manifest file you can skip the following 2 steps
CUDA_VISIBLE_DEVICES= ./local/test.sh ./data sv0_ecapa_tdnn_voxceleb12_ckpt_0_1_2 conf/ecapa_tdnn.yaml
CUDA_VISIBLE_DEVICES= bash ./local/test.sh ./data sv0_ecapa_tdnn_voxceleb12_ckpt_0_1_2/model/ conf/ecapa_tdnn.yaml
```
The performance of the released models are shown in [this](./RESULTS.md)

@ -33,10 +33,26 @@ dir=$1
exp_dir=$2
conf_path=$3
# get the gpu nums for training
ngpu=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}')
echo "using $ngpu gpus..."
# setting training device
device="cpu"
if ${use_gpu}; then
device="gpu"
fi
if [ $ngpu -le 0 ]; then
echo "no gpu, training in cpu mode"
device='cpu'
use_gpu=false
fi
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# test the model and compute the eer metrics
python3 ${BIN_DIR}/test.py \
--data-dir ${dir} \
--load-checkpoint ${exp_dir} \
--config ${conf_path}
--config ${conf_path} \
--device ${device}
fi

@ -42,15 +42,25 @@ device="cpu"
if ${use_gpu}; then
device="gpu"
fi
if [ $ngpu -le 0 ]; then
echo "no gpu, training in cpu mode"
device='cpu'
use_gpu=false
fi
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# train the speaker identification task with voxceleb data
# and we will create the trained model parameters in ${exp_dir}/model.pdparams as the soft link
# Note: we will store the log file in exp/log directory
python3 -m paddle.distributed.launch --gpus=$CUDA_VISIBLE_DEVICES \
${BIN_DIR}/train.py --device ${device} --checkpoint-dir ${exp_dir} \
--data-dir ${dir} --config ${conf_path}
if $use_gpu; then
python3 -m paddle.distributed.launch --gpus=$CUDA_VISIBLE_DEVICES \
${BIN_DIR}/train.py --device ${device} --checkpoint-dir ${exp_dir} \
--data-dir ${dir} --config ${conf_path}
else
python3 \
${BIN_DIR}/train.py --device ${device} --checkpoint-dir ${exp_dir} \
--data-dir ${dir} --config ${conf_path}
fi
fi
if [ $? -ne 0 ]; then

@ -27,6 +27,16 @@ pretrained_models = {
'ckpt_path':
'exp/conformer/checkpoints/wenetspeech',
},
"conformer_online_multicn-zh-16k": {
'url':
'https://paddlespeech.bj.bcebos.com/s2t/multi_cn/asr1/asr1_chunk_conformer_multi_cn_ckpt_0.2.0.model.tar.gz',
'md5':
'7989b3248c898070904cf042fd656003',
'cfg_path':
'model.yaml',
'ckpt_path':
'exp/chunk_conformer/checkpoints/multi_cn',
},
"conformer_aishell-zh-16k": {
'url':
'https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_conformer_aishell_ckpt_0.1.2.model.tar.gz',
@ -57,6 +67,20 @@ pretrained_models = {
'ckpt_path':
'exp/transformer/checkpoints/avg_10',
},
"deepspeech2online_wenetspeech-zh-16k": {
'url':
'https://paddlespeech.bj.bcebos.com/s2t/wenetspeech/asr0/WIP_asr0_deepspeech2_online_wenetspeech_ckpt_1.0.0a.model.tar.gz',
'md5':
'b3ef6fcae8c0058c3c53375341ccb209',
'cfg_path':
'model.yaml',
'ckpt_path':
'exp/deepspeech2_online/checkpoints/avg_3',
'lm_url':
'https://deepspeech.bj.bcebos.com/zh_lm/zh_giga.no_cna_cmn.prune01244.klm',
'lm_md5':
'29e02312deb2e59b3c8686c7966d4fe3'
},
"deepspeech2offline_aishell-zh-16k": {
'url':
'https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_aishell_ckpt_0.1.1.model.tar.gz',
@ -73,9 +97,9 @@ pretrained_models = {
},
"deepspeech2online_aishell-zh-16k": {
'url':
'https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.2.0.model.tar.gz',
'https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_fbank161_ckpt_0.2.1.model.tar.gz',
'md5':
'23e16c69730a1cb5d735c98c83c21e16',
'98b87b171b7240b7cae6e07d8d0bc9be',
'cfg_path':
'model.yaml',
'ckpt_path':

@ -22,6 +22,8 @@ from typing import Union
import paddle
import soundfile
from paddleaudio.backends import load as load_audio
from paddleaudio.compliance.librosa import melspectrogram
from yacs.config import CfgNode
from ..executor import BaseExecutor
@ -30,8 +32,6 @@ from ..utils import cli_register
from ..utils import stats_wrapper
from .pretrained_models import model_alias
from .pretrained_models import pretrained_models
from paddleaudio.backends import load as load_audio
from paddleaudio.compliance.librosa import melspectrogram
from paddlespeech.s2t.utils.dynamic_import import dynamic_import
from paddlespeech.vector.io.batch import feature_normalize
from paddlespeech.vector.modules.sid_model import SpeakerIdetification

@ -12,24 +12,15 @@
# See the License for the specific language governing permissions and
# limitations under the License.
# Modified from wekws(https://github.com/wenet-e2e/wekws)
import argparse
import os
import paddle
import yaml
from tqdm import tqdm
from yacs.config import CfgNode
from paddlespeech.s2t.training.cli import default_argument_parser
from paddlespeech.s2t.utils.dynamic_import import dynamic_import
# yapf: disable
parser = argparse.ArgumentParser(__doc__)
parser.add_argument("--cfg_path", type=str, required=True)
parser.add_argument('--keyword_index', type=int, default=0, help='keyword index')
parser.add_argument('--step', type=float, default=0.01, help='threshold step of trigger score')
parser.add_argument('--window_shift', type=int, default=50, help='window_shift is used to skip the frames after triggered')
args = parser.parse_args()
# yapf: enable
def load_label_and_score(keyword_index: int,
ds: paddle.io.Dataset,
@ -61,26 +52,52 @@ def load_label_and_score(keyword_index: int,
if __name__ == '__main__':
args.cfg_path = os.path.abspath(os.path.expanduser(args.cfg_path))
with open(args.cfg_path, 'r') as f:
config = yaml.safe_load(f)
parser = default_argument_parser()
parser.add_argument(
'--keyword_index', type=int, default=0, help='keyword index')
parser.add_argument(
'--step',
type=float,
default=0.01,
help='threshold step of trigger score')
parser.add_argument(
'--window_shift',
type=int,
default=50,
help='window_shift is used to skip the frames after triggered')
parser.add_argument(
"--score_file",
type=str,
required=True,
help='output file of trigger scores')
parser.add_argument(
'--stats_file',
type=str,
default='./stats.0.txt',
help='output file of detection error tradeoff')
args = parser.parse_args()
data_conf = config['data']
feat_conf = config['feature']
scoring_conf = config['scoring']
# https://yaml.org/type/float.html
config = CfgNode(new_allowed=True)
if args.config:
config.merge_from_file(args.config)
# Dataset
ds_class = dynamic_import(data_conf['dataset'])
test_ds = ds_class(data_dir=data_conf['data_dir'], mode='test', **feat_conf)
score_file = os.path.abspath(scoring_conf['score_file'])
stats_file = os.path.abspath(scoring_conf['stats_file'])
ds_class = dynamic_import(config['dataset'])
test_ds = ds_class(
data_dir=config['data_dir'],
mode='test',
feat_type=config['feat_type'],
sample_rate=config['sample_rate'],
frame_shift=config['frame_shift'],
frame_length=config['frame_length'],
n_mels=config['n_mels'], )
keyword_table, filler_table, filler_duration = load_label_and_score(
args.keyword, test_ds, score_file)
args.keyword_index, test_ds, args.score_file)
print('Filler total duration Hours: {}'.format(filler_duration / 3600.0))
pbar = tqdm(total=int(1.0 / args.step))
with open(stats_file, 'w', encoding='utf8') as fout:
with open(args.stats_file, 'w', encoding='utf8') as fout:
keyword_index = args.keyword_index
threshold = 0.0
while threshold <= 1.0:
@ -113,4 +130,4 @@ if __name__ == '__main__':
pbar.update(1)
pbar.close()
print('DET saved to: {}'.format(stats_file))
print('DET saved to: {}'.format(args.stats_file))

@ -17,12 +17,12 @@ import os
import matplotlib.pyplot as plt
import numpy as np
import yaml
# yapf: disable
parser = argparse.ArgumentParser(__doc__)
parser.add_argument("--cfg_path", type=str, required=True)
parser.add_argument("--keyword", type=str, required=True)
parser.add_argument('--keyword_label', type=str, required=True, help='keyword string shown on image')
parser.add_argument('--stats_file', type=str, required=True, help='output file of detection error tradeoff')
parser.add_argument('--img_file', type=str, default='./det.png', help='output det image')
args = parser.parse_args()
# yapf: enable
@ -61,14 +61,8 @@ def plot_det_curve(keywords, stats_file, figure_file, xlim, x_step, ylim,
if __name__ == '__main__':
args.cfg_path = os.path.abspath(os.path.expanduser(args.cfg_path))
with open(args.cfg_path, 'r') as f:
config = yaml.safe_load(f)
scoring_conf = config['scoring']
img_file = os.path.abspath(scoring_conf['img_file'])
stats_file = os.path.abspath(scoring_conf['stats_file'])
keywords = [args.keyword]
plot_det_curve(keywords, stats_file, img_file, 10, 2, 10, 2)
img_file = os.path.abspath(args.img_file)
stats_file = os.path.abspath(args.stats_file)
plot_det_curve([args.keyword_label], stats_file, img_file, 10, 2, 10, 2)
print('DET curve image saved to: {}'.format(img_file))

@ -12,55 +12,67 @@
# See the License for the specific language governing permissions and
# limitations under the License.
# Modified from wekws(https://github.com/wenet-e2e/wekws)
import argparse
import os
import paddle
import yaml
from tqdm import tqdm
from yacs.config import CfgNode
from paddlespeech.kws.exps.mdtc.collate import collate_features
from paddlespeech.kws.models.mdtc import KWSModel
from paddlespeech.s2t.training.cli import default_argument_parser
from paddlespeech.s2t.utils.dynamic_import import dynamic_import
# yapf: disable
parser = argparse.ArgumentParser(__doc__)
parser.add_argument("--cfg_path", type=str, required=True)
args = parser.parse_args()
# yapf: enable
if __name__ == '__main__':
args.cfg_path = os.path.abspath(os.path.expanduser(args.cfg_path))
with open(args.cfg_path, 'r') as f:
config = yaml.safe_load(f)
parser = default_argument_parser()
parser.add_argument(
"--ckpt",
type=str,
required=True,
help='model checkpoint for evaluation.')
parser.add_argument(
"--score_file",
type=str,
default='./scores.txt',
help='output file of trigger scores')
args = parser.parse_args()
model_conf = config['model']
data_conf = config['data']
feat_conf = config['feature']
scoring_conf = config['scoring']
# https://yaml.org/type/float.html
config = CfgNode(new_allowed=True)
if args.config:
config.merge_from_file(args.config)
# Dataset
ds_class = dynamic_import(data_conf['dataset'])
test_ds = ds_class(data_dir=data_conf['data_dir'], mode='test', **feat_conf)
ds_class = dynamic_import(config['dataset'])
test_ds = ds_class(
data_dir=config['data_dir'],
mode='test',
feat_type=config['feat_type'],
sample_rate=config['sample_rate'],
frame_shift=config['frame_shift'],
frame_length=config['frame_length'],
n_mels=config['n_mels'], )
test_sampler = paddle.io.BatchSampler(
test_ds, batch_size=scoring_conf['batch_size'], drop_last=False)
test_ds, batch_size=config['batch_size'], drop_last=False)
test_loader = paddle.io.DataLoader(
test_ds,
batch_sampler=test_sampler,
num_workers=scoring_conf['num_workers'],
num_workers=config['num_workers'],
return_list=True,
use_buffer_reader=True,
collate_fn=collate_features, )
# Model
backbone_class = dynamic_import(model_conf['backbone'])
backbone = backbone_class(**model_conf['config'])
model = KWSModel(backbone=backbone, num_keywords=model_conf['num_keywords'])
model.set_state_dict(paddle.load(scoring_conf['checkpoint']))
backbone_class = dynamic_import(config['backbone'])
backbone = backbone_class(
stack_num=config['stack_num'],
stack_size=config['stack_size'],
in_channels=config['in_channels'],
res_channels=config['res_channels'],
kernel_size=config['kernel_size'], )
model = KWSModel(backbone=backbone, num_keywords=config['num_keywords'])
model.set_state_dict(paddle.load(args.ckpt))
model.eval()
with paddle.no_grad(), open(
scoring_conf['score_file'], 'w', encoding='utf8') as fout:
with paddle.no_grad(), open(args.score_file, 'w', encoding='utf8') as f:
for batch_idx, batch in enumerate(
tqdm(test_loader, total=len(test_loader))):
keys, feats, labels, lengths = batch
@ -73,7 +85,6 @@ if __name__ == '__main__':
keyword_scores = score[:, keyword_i]
score_frames = ' '.join(
['{:.6f}'.format(x) for x in keyword_scores.tolist()])
fout.write(
'{} {} {}\n'.format(key, keyword_i, score_frames))
f.write('{} {} {}\n'.format(key, keyword_i, score_frames))
print('Result saved to: {}'.format(scoring_conf['score_file']))
print('Result saved to: {}'.format(args.score_file))

@ -11,77 +11,88 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import os
import paddle
import yaml
from paddleaudio.utils import logger
from paddleaudio.utils import Timer
from yacs.config import CfgNode
from paddlespeech.kws.exps.mdtc.collate import collate_features
from paddlespeech.kws.models.loss import max_pooling_loss
from paddlespeech.kws.models.mdtc import KWSModel
from paddlespeech.s2t.training.cli import default_argument_parser
from paddlespeech.s2t.utils.dynamic_import import dynamic_import
# yapf: disable
parser = argparse.ArgumentParser(__doc__)
parser.add_argument("--cfg_path", type=str, required=True)
args = parser.parse_args()
# yapf: enable
if __name__ == '__main__':
parser = default_argument_parser()
args = parser.parse_args()
# https://yaml.org/type/float.html
config = CfgNode(new_allowed=True)
if args.config:
config.merge_from_file(args.config)
nranks = paddle.distributed.get_world_size()
if paddle.distributed.get_world_size() > 1:
paddle.distributed.init_parallel_env()
local_rank = paddle.distributed.get_rank()
args.cfg_path = os.path.abspath(os.path.expanduser(args.cfg_path))
with open(args.cfg_path, 'r') as f:
config = yaml.safe_load(f)
model_conf = config['model']
data_conf = config['data']
feat_conf = config['feature']
training_conf = config['training']
# Dataset
ds_class = dynamic_import(data_conf['dataset'])
ds_class = dynamic_import(config['dataset'])
train_ds = ds_class(
data_dir=data_conf['data_dir'], mode='train', **feat_conf)
dev_ds = ds_class(data_dir=data_conf['data_dir'], mode='dev', **feat_conf)
data_dir=config['data_dir'],
mode='train',
feat_type=config['feat_type'],
sample_rate=config['sample_rate'],
frame_shift=config['frame_shift'],
frame_length=config['frame_length'],
n_mels=config['n_mels'], )
dev_ds = ds_class(
data_dir=config['data_dir'],
mode='dev',
feat_type=config['feat_type'],
sample_rate=config['sample_rate'],
frame_shift=config['frame_shift'],
frame_length=config['frame_length'],
n_mels=config['n_mels'], )
train_sampler = paddle.io.DistributedBatchSampler(
train_ds,
batch_size=training_conf['batch_size'],
batch_size=config['batch_size'],
shuffle=True,
drop_last=False)
train_loader = paddle.io.DataLoader(
train_ds,
batch_sampler=train_sampler,
num_workers=training_conf['num_workers'],
num_workers=config['num_workers'],
return_list=True,
use_buffer_reader=True,
collate_fn=collate_features, )
# Model
backbone_class = dynamic_import(model_conf['backbone'])
backbone = backbone_class(**model_conf['config'])
model = KWSModel(backbone=backbone, num_keywords=model_conf['num_keywords'])
backbone_class = dynamic_import(config['backbone'])
backbone = backbone_class(
stack_num=config['stack_num'],
stack_size=config['stack_size'],
in_channels=config['in_channels'],
res_channels=config['res_channels'],
kernel_size=config['kernel_size'], )
model = KWSModel(backbone=backbone, num_keywords=config['num_keywords'])
model = paddle.DataParallel(model)
clip = paddle.nn.ClipGradByGlobalNorm(training_conf['grad_clip'])
clip = paddle.nn.ClipGradByGlobalNorm(config['grad_clip'])
optimizer = paddle.optimizer.Adam(
learning_rate=training_conf['learning_rate'],
weight_decay=training_conf['weight_decay'],
learning_rate=config['learning_rate'],
weight_decay=config['weight_decay'],
parameters=model.parameters(),
grad_clip=clip)
criterion = max_pooling_loss
steps_per_epoch = len(train_sampler)
timer = Timer(steps_per_epoch * training_conf['epochs'])
timer = Timer(steps_per_epoch * config['epochs'])
timer.start()
for epoch in range(1, training_conf['epochs'] + 1):
for epoch in range(1, config['epochs'] + 1):
model.train()
avg_loss = 0
@ -107,15 +118,13 @@ if __name__ == '__main__':
timer.count()
if (batch_idx + 1
) % training_conf['log_freq'] == 0 and local_rank == 0:
if (batch_idx + 1) % config['log_freq'] == 0 and local_rank == 0:
lr = optimizer.get_lr()
avg_loss /= training_conf['log_freq']
avg_loss /= config['log_freq']
avg_acc = num_corrects / num_samples
print_msg = 'Epoch={}/{}, Step={}/{}'.format(
epoch, training_conf['epochs'], batch_idx + 1,
steps_per_epoch)
epoch, config['epochs'], batch_idx + 1, steps_per_epoch)
print_msg += ' loss={:.4f}'.format(avg_loss)
print_msg += ' acc={:.4f}'.format(avg_acc)
print_msg += ' lr={:.6f} step/sec={:.2f} | ETA {}'.format(
@ -126,17 +135,17 @@ if __name__ == '__main__':
num_corrects = 0
num_samples = 0
if epoch % training_conf[
if epoch % config[
'save_freq'] == 0 and batch_idx + 1 == steps_per_epoch and local_rank == 0:
dev_sampler = paddle.io.BatchSampler(
dev_ds,
batch_size=training_conf['batch_size'],
batch_size=config['batch_size'],
shuffle=False,
drop_last=False)
dev_loader = paddle.io.DataLoader(
dev_ds,
batch_sampler=dev_sampler,
num_workers=training_conf['num_workers'],
num_workers=config['num_workers'],
return_list=True,
use_buffer_reader=True,
collate_fn=collate_features, )
@ -159,7 +168,7 @@ if __name__ == '__main__':
logger.eval(print_msg)
# Save model
save_dir = os.path.join(training_conf['checkpoint_dir'],
save_dir = os.path.join(config['checkpoint_dir'],
'epoch_{}'.format(epoch))
logger.info('Saving model checkpoint to {}'.format(save_dir))
paddle.save(model.state_dict(),

@ -131,12 +131,14 @@ if not hasattr(paddle.Tensor, 'long'):
"override long of paddle.Tensor if exists or register, remove this when fixed!"
)
paddle.Tensor.long = func_long
paddle.static.Variable.long = func_long
if not hasattr(paddle.Tensor, 'numel'):
logger.debug(
"override numel of paddle.Tensor if exists or register, remove this when fixed!"
)
paddle.Tensor.numel = paddle.numel
paddle.static.Variable.numel = paddle.numel
def new_full(x: paddle.Tensor,
@ -151,6 +153,7 @@ if not hasattr(paddle.Tensor, 'new_full'):
"override new_full of paddle.Tensor if exists or register, remove this when fixed!"
)
paddle.Tensor.new_full = new_full
paddle.static.Variable.new_full = new_full
def eq(xs: paddle.Tensor, ys: Union[paddle.Tensor, float]) -> paddle.Tensor:
@ -166,6 +169,7 @@ if not hasattr(paddle.Tensor, 'eq'):
"override eq of paddle.Tensor if exists or register, remove this when fixed!"
)
paddle.Tensor.eq = eq
paddle.static.Variable.eq = eq
if not hasattr(paddle, 'eq'):
logger.debug(
@ -182,6 +186,7 @@ if not hasattr(paddle.Tensor, 'contiguous'):
"override contiguous of paddle.Tensor if exists or register, remove this when fixed!"
)
paddle.Tensor.contiguous = contiguous
paddle.static.Variable.contiguous = contiguous
def size(xs: paddle.Tensor, *args: int) -> paddle.Tensor:
@ -200,6 +205,7 @@ logger.debug(
"(`to_static` do not process `size` property, maybe some `paddle` api dependent on it), remove this when fixed!"
)
paddle.Tensor.size = size
paddle.static.Variable.size = size
def view(xs: paddle.Tensor, *args: int) -> paddle.Tensor:
@ -209,6 +215,7 @@ def view(xs: paddle.Tensor, *args: int) -> paddle.Tensor:
if not hasattr(paddle.Tensor, 'view'):
logger.debug("register user view to paddle.Tensor, remove this when fixed!")
paddle.Tensor.view = view
paddle.static.Variable.view = view
def view_as(xs: paddle.Tensor, ys: paddle.Tensor) -> paddle.Tensor:
@ -219,6 +226,7 @@ if not hasattr(paddle.Tensor, 'view_as'):
logger.debug(
"register user view_as to paddle.Tensor, remove this when fixed!")
paddle.Tensor.view_as = view_as
paddle.static.Variable.view_as = view_as
def is_broadcastable(shp1, shp2):
@ -246,6 +254,7 @@ if not hasattr(paddle.Tensor, 'masked_fill'):
logger.debug(
"register user masked_fill to paddle.Tensor, remove this when fixed!")
paddle.Tensor.masked_fill = masked_fill
paddle.static.Variable.masked_fill = masked_fill
def masked_fill_(xs: paddle.Tensor,
@ -264,6 +273,7 @@ if not hasattr(paddle.Tensor, 'masked_fill_'):
logger.debug(
"register user masked_fill_ to paddle.Tensor, remove this when fixed!")
paddle.Tensor.masked_fill_ = masked_fill_
paddle.static.Variable.maksed_fill_ = masked_fill_
def fill_(xs: paddle.Tensor, value: Union[float, int]) -> paddle.Tensor:
@ -276,6 +286,7 @@ if not hasattr(paddle.Tensor, 'fill_'):
logger.debug(
"register user fill_ to paddle.Tensor, remove this when fixed!")
paddle.Tensor.fill_ = fill_
paddle.static.Variable.fill_ = fill_
def repeat(xs: paddle.Tensor, *size: Any) -> paddle.Tensor:
@ -286,6 +297,7 @@ if not hasattr(paddle.Tensor, 'repeat'):
logger.debug(
"register user repeat to paddle.Tensor, remove this when fixed!")
paddle.Tensor.repeat = repeat
paddle.static.Variable.repeat = repeat
if not hasattr(paddle.Tensor, 'softmax'):
logger.debug(
@ -310,6 +322,7 @@ if not hasattr(paddle.Tensor, 'type_as'):
logger.debug(
"register user type_as to paddle.Tensor, remove this when fixed!")
setattr(paddle.Tensor, 'type_as', type_as)
setattr(paddle.static.Variable, 'type_as', type_as)
def to(x: paddle.Tensor, *args, **kwargs) -> paddle.Tensor:
@ -325,6 +338,7 @@ def to(x: paddle.Tensor, *args, **kwargs) -> paddle.Tensor:
if not hasattr(paddle.Tensor, 'to'):
logger.debug("register user to to paddle.Tensor, remove this when fixed!")
setattr(paddle.Tensor, 'to', to)
setattr(paddle.static.Variable, 'to', to)
def func_float(x: paddle.Tensor) -> paddle.Tensor:
@ -335,6 +349,7 @@ if not hasattr(paddle.Tensor, 'float'):
logger.debug(
"register user float to paddle.Tensor, remove this when fixed!")
setattr(paddle.Tensor, 'float', func_float)
setattr(paddle.static.Variable, 'float', func_float)
def func_int(x: paddle.Tensor) -> paddle.Tensor:
@ -344,6 +359,7 @@ def func_int(x: paddle.Tensor) -> paddle.Tensor:
if not hasattr(paddle.Tensor, 'int'):
logger.debug("register user int to paddle.Tensor, remove this when fixed!")
setattr(paddle.Tensor, 'int', func_int)
setattr(paddle.static.Variable, 'int', func_int)
def tolist(x: paddle.Tensor) -> List[Any]:
@ -354,6 +370,7 @@ if not hasattr(paddle.Tensor, 'tolist'):
logger.debug(
"register user tolist to paddle.Tensor, remove this when fixed!")
setattr(paddle.Tensor, 'tolist', tolist)
setattr(paddle.static.Variable, 'tolist', tolist)
########### hack paddle.nn #############
from paddle.nn import Layer

@ -16,7 +16,6 @@ import asyncio
import base64
import io
import json
import logging
import os
import random
import time
@ -30,7 +29,7 @@ from ..executor import BaseExecutor
from ..util import cli_client_register
from ..util import stats_wrapper
from paddlespeech.cli.log import logger
from paddlespeech.server.utils.audio_handler import ASRAudioHandler
from paddlespeech.server.utils.audio_handler import ASRWsAudioHandler
from paddlespeech.server.utils.audio_process import wav2pcm
from paddlespeech.server.utils.util import wav2base64
@ -288,6 +287,12 @@ class ASRClientExecutor(BaseExecutor):
default=None,
help='Audio file to be recognized',
required=True)
self.parser.add_argument(
'--protocol',
type=str,
default="http",
choices=["http", "websocket"],
help='server protocol')
self.parser.add_argument(
'--sample_rate', type=int, default=16000, help='audio sample rate')
self.parser.add_argument(
@ -295,6 +300,19 @@ class ASRClientExecutor(BaseExecutor):
self.parser.add_argument(
'--audio_format', type=str, default="wav", help='audio format')
self.parser.add_argument(
'--punc.server_ip',
type=str,
default=None,
dest="punc_server_ip",
help='Punctuation server ip')
self.parser.add_argument(
'--punc.port',
type=int,
default=8091,
dest="punc_server_port",
help='Punctuation server port')
def execute(self, argv: List[str]) -> bool:
args = self.parser.parse_args(argv)
input_ = args.input
@ -303,6 +321,7 @@ class ASRClientExecutor(BaseExecutor):
sample_rate = args.sample_rate
lang = args.lang
audio_format = args.audio_format
protocol = args.protocol
try:
time_start = time.time()
@ -312,13 +331,17 @@ class ASRClientExecutor(BaseExecutor):
port=port,
sample_rate=sample_rate,
lang=lang,
audio_format=audio_format)
audio_format=audio_format,
protocol=protocol,
punc_server_ip=args.punc_server_ip,
punc_server_port=args.punc_server_port)
time_end = time.time()
logger.info(res.json())
logger.info(f"ASR result: {res}")
logger.info("Response time %f s." % (time_end - time_start))
return True
except Exception as e:
logger.error("Failed to speech recognition.")
logger.error(e)
return False
@stats_wrapper
@ -328,21 +351,39 @@ class ASRClientExecutor(BaseExecutor):
port: int=8090,
sample_rate: int=16000,
lang: str="zh_cn",
audio_format: str="wav"):
"""
Python API to call an executor.
"""
audio_format: str="wav",
protocol: str="http",
punc_server_ip: str=None,
punc_server_port: int=None):
"""Python API to call an executor.
url = 'http://' + server_ip + ":" + str(port) + '/paddlespeech/asr'
audio = wav2base64(input)
data = {
"audio": audio,
"audio_format": audio_format,
"sample_rate": sample_rate,
"lang": lang,
}
Args:
input (str): The input audio file path
server_ip (str, optional): The ASR server ip. Defaults to "127.0.0.1".
port (int, optional): The ASR server port. Defaults to 8090.
sample_rate (int, optional): The audio sample rate. Defaults to 16000.
lang (str, optional): The audio language type. Defaults to "zh_cn".
audio_format (str, optional): The audio format information. Defaults to "wav".
protocol (str, optional): The ASR server. Defaults to "http".
Returns:
str: The ASR results
"""
# we use the asr server to recognize the audio text content
# and paddlespeech_client asr only support http protocol
protocol = "http"
if protocol.lower() == "http":
from paddlespeech.server.utils.audio_handler import ASRHttpHandler
logger.info("asr http client start")
handler = ASRHttpHandler(server_ip=server_ip, port=port)
res = handler.run(input, audio_format, sample_rate, lang)
res = res['result']['transcription']
logger.info("asr http client finished")
else:
logger.error(f"Sorry, we have not support protocol: {protocol},"
"please use http or websocket protocol")
sys.exit(-1)
res = requests.post(url=url, data=json.dumps(data))
return res
@ -379,7 +420,6 @@ class ASROnlineClientExecutor(BaseExecutor):
sample_rate = args.sample_rate
lang = args.lang
audio_format = args.audio_format
try:
time_start = time.time()
res = self(
@ -409,14 +449,13 @@ class ASROnlineClientExecutor(BaseExecutor):
"""
Python API to call an executor.
"""
logging.basicConfig(level=logging.INFO)
logging.info("asr websocket client start")
handler = ASRAudioHandler(server_ip, port)
logger.info("asr websocket client start")
handler = ASRWsAudioHandler(server_ip, port)
loop = asyncio.get_event_loop()
res = loop.run_until_complete(handler.run(input))
logging.info("asr websocket client finished")
logger.info("asr websocket client finished")
return res['asr_results']
return res['result']
@cli_client_register(
@ -509,7 +548,6 @@ class TextClientExecutor(BaseExecutor):
input_ = args.input
server_ip = args.server_ip
port = args.port
output = args.output
try:
time_start = time.time()

@ -1,4 +1,4 @@
# This is the parameter configuration file for PaddleSpeech Serving.
# This is the parameter configuration file for streaming tts server.
#################################################################################
# SERVER SETTING #
@ -7,8 +7,8 @@ host: 127.0.0.1
port: 8092
# The task format in the engin_list is: <speech task>_<engine type>
# task choices = ['tts_online', 'tts_online-onnx']
# protocol = ['websocket', 'http'] (only one can be selected).
# engine_list choices = ['tts_online', 'tts_online-onnx'], the inference speed of tts_online-onnx is faster than tts_online.
# protocol choices = ['websocket', 'http']
protocol: 'http'
engine_list: ['tts_online-onnx']
@ -20,8 +20,9 @@ engine_list: ['tts_online-onnx']
################################### TTS #########################################
################### speech task: tts; engine_type: online #######################
tts_online:
# am (acoustic model) choices=['fastspeech2_csmsc', 'fastspeech2_cnndecoder_csmsc']
am: 'fastspeech2_cnndecoder_csmsc'
# am (acoustic model) choices=['fastspeech2_csmsc', 'fastspeech2_cnndecoder_csmsc']
# fastspeech2_cnndecoder_csmsc support streaming am infer.
am: 'fastspeech2_csmsc'
am_config:
am_ckpt:
am_stat:
@ -31,6 +32,7 @@ tts_online:
spk_id: 0
# voc (vocoder) choices=['mb_melgan_csmsc, hifigan_csmsc']
# Both mb_melgan_csmsc and hifigan_csmsc support streaming voc inference
voc: 'mb_melgan_csmsc'
voc_config:
voc_ckpt:
@ -39,8 +41,13 @@ tts_online:
# others
lang: 'zh'
device: 'cpu' # set 'gpu:id' or 'cpu'
# am_block and am_pad only for fastspeech2_cnndecoder_onnx model to streaming am infer,
# when am_pad set 12, streaming synthetic audio is the same as non-streaming synthetic audio
am_block: 42
am_pad: 12
# voc_pad and voc_block voc model to streaming voc infer,
# when voc model is mb_melgan_csmsc, voc_pad set 14, streaming synthetic audio is the same as non-streaming synthetic audio; The minimum value of pad can be set to 7, streaming synthetic audio sounds normal
# when voc model is hifigan_csmsc, voc_pad set 20, streaming synthetic audio is the same as non-streaming synthetic audio; voc_pad set 14, streaming synthetic audio sounds normal
voc_block: 14
voc_pad: 14
@ -53,7 +60,8 @@ tts_online:
################################### TTS #########################################
################### speech task: tts; engine_type: online-onnx #######################
tts_online-onnx:
# am (acoustic model) choices=['fastspeech2_csmsc_onnx', 'fastspeech2_cnndecoder_csmsc_onnx']
# am (acoustic model) choices=['fastspeech2_csmsc_onnx', 'fastspeech2_cnndecoder_csmsc_onnx']
# fastspeech2_cnndecoder_csmsc_onnx support streaming am infer.
am: 'fastspeech2_cnndecoder_csmsc_onnx'
# am_ckpt is a list, if am is fastspeech2_cnndecoder_csmsc_onnx, am_ckpt = [encoder model, decoder model, postnet model];
# if am is fastspeech2_csmsc_onnx, am_ckpt = [ckpt model];
@ -70,6 +78,7 @@ tts_online-onnx:
cpu_threads: 4
# voc (vocoder) choices=['mb_melgan_csmsc_onnx, hifigan_csmsc_onnx']
# Both mb_melgan_csmsc_onnx and hifigan_csmsc_onnx support streaming voc inference
voc: 'hifigan_csmsc_onnx'
voc_ckpt:
voc_sample_rate: 24000
@ -80,9 +89,15 @@ tts_online-onnx:
# others
lang: 'zh'
# am_block and am_pad only for fastspeech2_cnndecoder_onnx model to streaming am infer,
# when am_pad set 12, streaming synthetic audio is the same as non-streaming synthetic audio
am_block: 42
am_pad: 12
# voc_pad and voc_block voc model to streaming voc infer,
# when voc model is mb_melgan_csmsc_onnx, voc_pad set 14, streaming synthetic audio is the same as non-streaming synthetic audio; The minimum value of pad can be set to 7, streaming synthetic audio sounds normal
# when voc model is hifigan_csmsc_onnx, voc_pad set 20, streaming synthetic audio is the same as non-streaming synthetic audio; voc_pad set 14, streaming synthetic audio sounds normal
voc_block: 14
voc_pad: 14
# voc_upsample should be same as n_shift on voc config.
voc_upsample: 300

@ -43,9 +43,9 @@ __all__ = ['ASREngine']
pretrained_models = {
"deepspeech2online_aishell-zh-16k": {
'url':
'https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.2.0.model.tar.gz',
'https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_fbank161_ckpt_0.2.1.model.tar.gz',
'md5':
'23e16c69730a1cb5d735c98c83c21e16',
'98b87b171b7240b7cae6e07d8d0bc9be',
'cfg_path':
'model.yaml',
'ckpt_path':

@ -26,7 +26,7 @@ import pyaudio
import websockets
class ASRAudioHandler(threading.Thread):
class ASRWsAudioHandler(threading.Thread):
def __init__(self, url="127.0.0.1", port=8091):
threading.Thread.__init__(self)
self.url = url
@ -148,7 +148,7 @@ if __name__ == "__main__":
logging.basicConfig(level=logging.INFO)
logging.info("asr websocket client start")
handler = ASRAudioHandler("127.0.0.1", 8091)
handler = ASRWsAudioHandler("127.0.0.1", 8091)
loop = asyncio.get_event_loop()
main_task = asyncio.ensure_future(handler.run())
for signal in [SIGINT, SIGTERM]:

@ -24,11 +24,11 @@ from typing import Any
from typing import Dict
import paddle
import paddleaudio
import requests
import yaml
from paddle.framework import load
import paddleaudio
from . import download
from .entry import client_commands
from .entry import server_commands

@ -24,19 +24,76 @@ import websockets
from paddlespeech.cli.log import logger
from paddlespeech.server.utils.audio_process import save_audio
from paddlespeech.server.utils.util import wav2base64
class ASRAudioHandler:
def __init__(self, url="127.0.0.1", port=8090):
class TextHttpHandler:
def __init__(self, server_ip="127.0.0.1", port=8090):
"""Text http client request
Args:
server_ip (str, optional): the text server ip. Defaults to "127.0.0.1".
port (int, optional): the text server port. Defaults to 8090.
"""
super().__init__()
self.server_ip = server_ip
self.port = port
if server_ip is None or port is None:
self.url = None
else:
self.url = 'http://' + self.server_ip + ":" + str(
self.port) + '/paddlespeech/text'
def run(self, text):
"""Call the text server to process the specific text
Args:
text (str): the text to be processed
Returns:
str: punctuation text
"""
if self.server_ip is None or self.port is None:
return text
request = {
"text": text,
}
try:
res = requests.post(url=self.url, data=json.dumps(request))
response_dict = res.json()
punc_text = response_dict["result"]["punc_text"]
except Exception as e:
logger.error(f"Call punctuation {self.url} occurs error")
logger.error(e)
punc_text = text
return punc_text
class ASRWsAudioHandler:
def __init__(self,
url=None,
port=None,
endpoint="/paddlespeech/asr/streaming",
punc_server_ip=None,
punc_server_port=None):
"""PaddleSpeech Online ASR Server Client audio handler
Online asr server use the websocket protocal
Args:
url (str, optional): the server ip. Defaults to "127.0.0.1".
port (int, optional): the server port. Defaults to 8090.
url (str, optional): the server ip. Defaults to None.
port (int, optional): the server port. Defaults to None.
endpoint(str, optional): to compatiable with python server and c++ server.
punc_server_ip(str, optional): the punctuation server ip. Defaults to None.
punc_server_port(int, optional): the punctuation port. Defaults to None
"""
self.url = url
self.port = port
self.url = "ws://" + self.url + ":" + str(self.port) + "/ws/asr"
if url is None or port is None or endpoint is None:
self.url = None
else:
self.url = "ws://" + self.url + ":" + str(self.port) + endpoint
self.punc_server = TextHttpHandler(punc_server_ip, punc_server_port)
logger.info(f"endpoint: {self.url}")
def read_wave(self, wavfile_path: str):
"""read the audio file from specific wavfile path
@ -80,6 +137,10 @@ class ASRAudioHandler:
"""
logging.info("send a message to the server")
if self.url is None:
logger.error("No asr server, please input valid ip and port")
return ""
# 1. send websocket handshake protocal
async with websockets.connect(self.url) as ws:
# 2. server has already received handshake protocal
@ -88,28 +149,31 @@ class ASRAudioHandler:
{
"name": "test.wav",
"signal": "start",
"nbest": 5
"nbest": 1
},
sort_keys=True,
indent=4,
separators=(',', ': '))
await ws.send(audio_info)
msg = await ws.recv()
logger.info("receive msg={}".format(msg))
logger.info("client receive msg={}".format(msg))
# 3. send chunk audio data to engine
for chunk_data in self.read_wave(wavfile_path):
await ws.send(chunk_data.tobytes())
msg = await ws.recv()
msg = json.loads(msg)
logger.info("receive msg={}".format(msg))
if self.punc_server and len(msg["result"]) > 0:
msg["result"] = self.punc_server.run(msg["result"])
logger.info("client receive msg={}".format(msg))
# 4. we must send finished signal to the server
audio_info = json.dumps(
{
"name": "test.wav",
"signal": "end",
"nbest": 5
"nbest": 1
},
sort_keys=True,
indent=4,
@ -119,11 +183,63 @@ class ASRAudioHandler:
# 5. decode the bytes to str
msg = json.loads(msg)
logger.info("final receive msg={}".format(msg))
if self.punc_server:
msg["result"] = self.punc_server.run(msg["result"])
logger.info("client final receive msg={}".format(msg))
result = msg
return result
class ASRHttpHandler:
def __init__(self, server_ip=None, port=None):
"""The ASR client http request
Args:
server_ip (str, optional): the http asr server ip. Defaults to "127.0.0.1".
port (int, optional): the http asr server port. Defaults to 8090.
"""
super().__init__()
self.server_ip = server_ip
self.port = port
if server_ip is None or port is None:
self.url = None
else:
self.url = 'http://' + self.server_ip + ":" + str(
self.port) + '/paddlespeech/asr'
def run(self, input, audio_format, sample_rate, lang):
"""Call the http asr to process the audio
Args:
input (str): the audio file path
audio_format (str): the audio format
sample_rate (str): the audio sample rate
lang (str): the audio language type
Returns:
str: the final asr result
"""
if self.url is None:
logger.error(
"No punctuation server, please input valid ip and port")
return ""
audio = wav2base64(input)
data = {
"audio": audio,
"audio_format": audio_format,
"sample_rate": sample_rate,
"lang": lang,
}
res = requests.post(url=self.url, data=json.dumps(data))
return res.json()
class TTSWsHandler:
def __init__(self, server="127.0.0.1", port=8092, play: bool=False):
"""PaddleSpeech Online TTS Server Client audio handler

@ -24,7 +24,7 @@ from paddlespeech.server.engine.engine_pool import get_engine_pool
router = APIRouter()
@router.websocket('/ws/asr')
@router.websocket('/paddlespeech/asr/streaming')
async def websocket_endpoint(websocket: WebSocket):
"""PaddleSpeech Online ASR Server api
@ -83,7 +83,7 @@ async def websocket_endpoint(websocket: WebSocket):
resp = {
"status": "ok",
"signal": "finished",
'asr_results': asr_results
'result': asr_results
}
await websocket.send_json(resp)
break
@ -102,7 +102,7 @@ async def websocket_endpoint(websocket: WebSocket):
# return the current period result
# if the engine create the vad instance, this connection will have many period results
resp = {'asr_results': asr_results}
resp = {'result': asr_results}
await websocket.send_json(resp)
except WebSocketDisconnect:
pass

@ -73,8 +73,6 @@ server = [
"uvicorn",
"pattern_singleton",
"websockets",
"websocket",
"websocket-client",
]
requirements = {

@ -24,8 +24,6 @@ docker run --privileged --net=host --ipc=host -it --rm -v $PWD:/workspace --nam
* More `Paddle` docker images you can see [here](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/docker/linux-docker.html).
* If you want only work under cpu, please download corresponded [image](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/docker/linux-docker.html), and using `docker` instead `nvidia-docker`.
2. Build `speechx` and `examples`.

@ -79,6 +79,7 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
--feature_wspecifier=ark,scp:$data/split${nj}/JOB/feat.ark,$data/split${nj}/JOB/feat.scp \
--cmvn_file=$cmvn \
--streaming_chunk=0.36
echo "feature make have finished!!!"
fi
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
@ -94,6 +95,8 @@ if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
cat $data/split${nj}/*/result > $exp/${label_file}
utils/compute-wer.py --char=1 --v=1 $text $exp/${label_file} > $exp/${wer}
echo "ctc-prefix-beam-search-decoder-ol without lm has finished!!!"
echo "please checkout in ${exp}/${wer}"
fi
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
@ -110,6 +113,8 @@ if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
cat $data/split${nj}/*/result_lm > $exp/${label_file}_lm
utils/compute-wer.py --char=1 --v=1 $text $exp/${label_file}_lm > $exp/${wer}.lm
echo "ctc-prefix-beam-search-decoder-ol with lm test has finished!!!"
echo "please checkout in ${exp}/${wer}.lm"
fi
wfst=$data/wfst/
@ -139,6 +144,8 @@ if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
cat $data/split${nj}/*/result_tlg > $exp/${label_file}_tlg
utils/compute-wer.py --char=1 --v=1 $text $exp/${label_file}_tlg > $exp/${wer}.tlg
echo "wfst-decoder-ol have finished!!!"
echo "please checkout in ${exp}/${wer}.tlg"
fi
if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
@ -159,4 +166,6 @@ if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
cat $data/split${nj}/*/result_recognizer > $exp/${label_file}_recognizer
utils/compute-wer.py --char=1 --v=1 $text $exp/${label_file}_recognizer > $exp/${wer}.recognizer
echo "recognizer test have finished!!!"
echo "please checkout in ${exp}/${wer}.recognizer"
fi

@ -48,6 +48,12 @@ void TLGDecoder::Reset() {
}
std::string TLGDecoder::GetFinalBestPath() {
if (frame_decoded_size_ == 0) {
// Assertion failed: (this->NumFramesDecoded() > 0 && "You cannot call
// BestPathEnd if no frames were decoded.")
return std::string("");
}
decoder_->FinalizeDecoding();
kaldi::Lattice lat;
kaldi::LatticeWeight weight;

@ -27,26 +27,27 @@ ConnectionHandler::ConnectionHandler(
: ws_(std::move(socket)), recognizer_resource_(recognizer_resource) {}
void ConnectionHandler::OnSpeechStart() {
LOG(INFO) << "Recieved speech start signal, start reading speech";
got_start_tag_ = true;
json::value rv = {{"status", "ok"}, {"type", "server_ready"}};
ws_.text(true);
ws_.write(asio::buffer(json::serialize(rv)));
recognizer_ = std::make_shared<Recognizer>(recognizer_resource_);
// Start decoder thread
decode_thread_ = std::make_shared<std::thread>(
&ConnectionHandler::DecodeThreadFunc, this);
got_start_tag_ = true;
LOG(INFO) << "Server: Recieved speech start signal, start reading speech";
json::value rv = {{"status", "ok"}, {"type", "server_ready"}};
ws_.text(true);
ws_.write(asio::buffer(json::serialize(rv)));
}
void ConnectionHandler::OnSpeechEnd() {
LOG(INFO) << "Recieved speech end signal";
CHECK(recognizer_ != nullptr);
recognizer_->SetFinished();
LOG(INFO) << "Server: Recieved speech end signal";
if (recognizer_ != nullptr) {
recognizer_->SetFinished();
}
got_end_tag_ = true;
}
void ConnectionHandler::OnFinalResult(const std::string& result) {
LOG(INFO) << "Final result: " << result;
LOG(INFO) << "Server: Final result: " << result;
json::value rv = {
{"status", "ok"}, {"type", "final_result"}, {"result", result}};
ws_.text(true);
@ -69,10 +70,16 @@ void ConnectionHandler::OnSpeechData(const beast::flat_buffer& buffer) {
pcm_data(i) = static_cast<float>(*pdata);
pdata++;
}
VLOG(2) << "Recieved " << num_samples << " samples";
LOG(INFO) << "Recieved " << num_samples << " samples";
VLOG(2) << "Server: Recieved " << num_samples << " samples";
LOG(INFO) << "Server: Recieved " << num_samples << " samples";
CHECK(recognizer_ != nullptr);
recognizer_->Accept(pcm_data);
// TODO: return lpartial result
json::value rv = {
{"status", "ok"}, {"type", "partial_result"}, {"result", "TODO"}};
ws_.text(true);
ws_.write(asio::buffer(json::serialize(rv)));
}
void ConnectionHandler::DecodeThreadFunc() {
@ -80,9 +87,9 @@ void ConnectionHandler::DecodeThreadFunc() {
while (true) {
recognizer_->Decode();
if (recognizer_->IsFinished()) {
LOG(INFO) << "enter finish";
LOG(INFO) << "Server: enter finish";
recognizer_->Decode();
LOG(INFO) << "finish";
LOG(INFO) << "Server: finish";
std::string result = recognizer_->GetFinalResult();
OnFinalResult(result);
OnFinish();
@ -135,7 +142,7 @@ void ConnectionHandler::operator()() {
ws_.read(buffer);
if (ws_.got_text()) {
std::string message = beast::buffers_to_string(buffer.data());
LOG(INFO) << message;
LOG(INFO) << "Server: Text: " << message;
OnText(message);
if (got_end_tag_) {
break;
@ -152,7 +159,7 @@ void ConnectionHandler::operator()() {
}
}
LOG(INFO) << "Read all pcm data, wait for decoding thread";
LOG(INFO) << "Server: finished to wait for decoding thread join.";
if (decode_thread_ != nullptr) {
decode_thread_->join();
}

@ -14,22 +14,24 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
paddlespeech asr --input ./zh.wav
paddlespeech asr --model conformer_aishell --input ./zh.wav
paddlespeech asr --model conformer_online_aishell --input ./zh.wav
paddlespeech asr --model conformer_online_multicn --input ./zh.wav
paddlespeech asr --model transformer_librispeech --lang en --input ./en.wav
paddlespeech asr --model deepspeech2offline_aishell --input ./zh.wav
paddlespeech asr --model deepspeech2online_wenetspeech --input ./zh.wav
paddlespeech asr --model deepspeech2online_aishell --input ./zh.wav
paddlespeech asr --model deepspeech2offline_librispeech --lang en --input ./en.wav
# long audio restriction
{
wget -c wget https://paddlespeech.bj.bcebos.com/datasets/single_wav/zh/test_long_audio_01.wav
wget -c https://paddlespeech.bj.bcebos.com/datasets/single_wav/zh/test_long_audio_01.wav
paddlespeech asr --input test_long_audio_01.wav
if [ $? -ne 255 ]; then
echo "Time restriction not passed"
echo -e "\e[1;31mTime restriction not passed\e[0m"
exit 1
fi
} &&
{
echo "Time restriction passed"
echo -e "\033[32mTime restriction passed\033[0m"
}
# Text To Speech
@ -77,4 +79,4 @@ paddlespeech stats --task vector
paddlespeech stats --task st
echo "Test success !!!"
echo -e "\033[32mTest success !!!\033[0m"

Loading…
Cancel
Save