diff --git a/README.md b/README.md index 379550ce..9791b895 100644 --- a/README.md +++ b/README.md @@ -1,19 +1,10 @@ ([简体中文](./README_cn.md)|English) + +

-
- -

- Quick Start - | Quick Start Server - | Documents - | Models List -

- ------------------------------------------------------------------------------------- -

@@ -28,6 +19,20 @@

+
+

+ | Quick Start + | Quick Start Server + | Quick Start Streaming Server + | +
+ | Documents + | Models List + | +

+
+ + **PaddleSpeech** is an open-source toolkit on [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) platform for a variety of critical tasks in speech and audio, with the state-of-art and influential models. @@ -142,26 +147,6 @@ For more synthesized audios, please refer to [PaddleSpeech Text-to-Speech sample -### ⭐ Examples -- **[PaddleBoBo](https://github.com/JiehangXie/PaddleBoBo): Use PaddleSpeech TTS to generate virtual human voice.** - -
- -- [PaddleSpeech Demo Video](https://paddlespeech.readthedocs.io/en/latest/demo_video.html) - -- **[VTuberTalk](https://github.com/jerryuhoo/VTuberTalk): Use PaddleSpeech TTS and ASR to clone voice from videos.** - -
- -
- -### 🔥 Hot Activities - -- 2021.12.21~12.24 - - 4 Days Live Courses: Depth interpretation of PaddleSpeech! - - **Courses videos and related materials: https://aistudio.baidu.com/aistudio/education/group/info/25130** ### Features @@ -174,11 +159,22 @@ Via the easy-to-use, efficient, flexible and scalable implementation, our vision - 🔬 *Integration of mainstream models and datasets*: the toolkit implements modules that participate in the whole pipeline of the speech tasks, and uses mainstream datasets like LibriSpeech, LJSpeech, AIShell, CSMSC, etc. See also [model list](#model-list) for more details. - 🧩 *Cascaded models application*: as an extension of the typical traditional audio tasks, we combine the workflows of the aforementioned tasks with other fields like Natural language processing (NLP) and Computer Vision (CV). -### Recent Update +### 🔥 Hot Activities + +- 2021.12.21~12.24 + + 4 Days Live Courses: Depth interpretation of PaddleSpeech! + + **Courses videos and related materials: https://aistudio.baidu.com/aistudio/education/group/info/25130** + + +### Recent Update + +- 👏🏻 2022.04.28: PaddleSpeech Streaming Server is available for Automatic Speech Recognition and Text-to-Speech. - 👏🏻 2022.03.28: PaddleSpeech Server is available for Audio Classification, Automatic Speech Recognition and Text-to-Speech. - 👏🏻 2022.03.28: PaddleSpeech CLI is available for Speaker Verification. - 🤗 2021.12.14: Our PaddleSpeech [ASR](https://huggingface.co/spaces/KPatrick/PaddleSpeechASR) and [TTS](https://huggingface.co/spaces/KPatrick/PaddleSpeechTTS) Demos on Hugging Face Spaces are available! @@ -196,6 +192,7 @@ Via the easy-to-use, efficient, flexible and scalable implementation, our vision We strongly recommend our users to install PaddleSpeech in **Linux** with *python>=3.7*. Up to now, **Linux** supports CLI for the all our tasks, **Mac OSX** and **Windows** only supports PaddleSpeech CLI for Audio Classification, Speech-to-Text and Text-to-Speech. To install `PaddleSpeech`, please see [installation](./docs/source/install.md). + ## Quick Start @@ -238,7 +235,7 @@ paddlespeech tts --input "你好,欢迎使用飞桨深度学习框架!" --ou **Batch Process** ``` echo -e "1 欢迎光临。\n2 谢谢惠顾。" | paddlespeech tts -``` +``` **Shell Pipeline** - ASR + Punctuation Restoration @@ -257,16 +254,19 @@ If you want to try more functions like training and tuning, please have a look a Developers can have a try of our speech server with [PaddleSpeech Server Command Line](./paddlespeech/server/README.md). **Start server** + ```shell paddlespeech_server start --config_file ./paddlespeech/server/conf/application.yaml ``` **Access Speech Recognition Services** + ```shell paddlespeech_client asr --server_ip 127.0.0.1 --port 8090 --input input_16k.wav ``` **Access Text to Speech Services** + ```shell paddlespeech_client tts --server_ip 127.0.0.1 --port 8090 --input "您好,欢迎使用百度飞桨语音合成服务。" --output output.wav ``` @@ -280,6 +280,37 @@ paddlespeech_client cls --server_ip 127.0.0.1 --port 8090 --input input.wav For more information about server command lines, please see: [speech server demos](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/demos/speech_server) + +## Quick Start Streaming Server + +Developers can have a try of [streaming asr](./demos/streaming_asr_server/README.md) and [streaming tts](./demos/streaming_tts_server/README.md) server. + +**Start Streaming Speech Recognition Server** + +``` +paddlespeech_server start --config_file ./demos/streaming_asr_server/conf/application.yaml +``` + +**Access Streaming Speech Recognition Services** + +``` +paddlespeech_client asr_online --server_ip 127.0.0.1 --port 8090 --input input_16k.wav +``` + +**Start Streaming Text to Speech Server** + +``` +paddlespeech_server start --config_file ./demos/streaming_tts_server/conf/tts_online_application.yaml +``` + +**Access Streaming Text to Speech Services** + +``` +paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --protocol http --input "您好,欢迎使用百度飞桨语音合成服务。" --output output.wav +``` + +For more information please see: [streaming asr](./demos/streaming_asr_server/README.md) and [streaming tts](./demos/streaming_tts_server/README.md) + ## Model List @@ -589,6 +620,21 @@ Normally, [Speech SoTA](https://paperswithcode.com/area/speech), [Audio SoTA](ht The Text-to-Speech module is originally called [Parakeet](https://github.com/PaddlePaddle/Parakeet), and now merged with this repository. If you are interested in academic research about this task, please see [TTS research overview](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/docs/source/tts#overview). Also, [this document](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/tts/models_introduction.md) is a good guideline for the pipeline components. + +## ⭐ Examples +- **[PaddleBoBo](https://github.com/JiehangXie/PaddleBoBo): Use PaddleSpeech TTS to generate virtual human voice.** + +
+ +- [PaddleSpeech Demo Video](https://paddlespeech.readthedocs.io/en/latest/demo_video.html) + +- **[VTuberTalk](https://github.com/jerryuhoo/VTuberTalk): Use PaddleSpeech TTS and ASR to clone voice from videos.** + +
+ +
+ + ## Citation To cite PaddleSpeech for research, please use the following format. @@ -655,7 +701,6 @@ You are warmly welcome to submit questions in [discussions](https://github.com/P ## Acknowledgement - - Many thanks to [yeyupiaoling](https://github.com/yeyupiaoling)/[PPASR](https://github.com/yeyupiaoling/PPASR)/[PaddlePaddle-DeepSpeech](https://github.com/yeyupiaoling/PaddlePaddle-DeepSpeech)/[VoiceprintRecognition-PaddlePaddle](https://github.com/yeyupiaoling/VoiceprintRecognition-PaddlePaddle)/[AudioClassification-PaddlePaddle](https://github.com/yeyupiaoling/AudioClassification-PaddlePaddle) for years of attention, constructive advice and great help. - Many thanks to [mymagicpower](https://github.com/mymagicpower) for the Java implementation of ASR upon [short](https://github.com/mymagicpower/AIAS/tree/main/3_audio_sdks/asr_sdk) and [long](https://github.com/mymagicpower/AIAS/tree/main/3_audio_sdks/asr_long_audio_sdk) audio files. - Many thanks to [JiehangXie](https://github.com/JiehangXie)/[PaddleBoBo](https://github.com/JiehangXie/PaddleBoBo) for developing Virtual Uploader(VUP)/Virtual YouTuber(VTuber) with PaddleSpeech TTS function. diff --git a/README_cn.md b/README_cn.md index 228d5d78..497863db 100644 --- a/README_cn.md +++ b/README_cn.md @@ -2,26 +2,45 @@

-
-

- 快速开始 - | 快速使用服务 - | 教程文档 - | 模型列表 -

-------------------------------------------------------------------------------------

- + + + +

+
+

+ Quick Start + | Quick Start Server + | Quick Start Streaming Server +
+ Documents + | Models List +

+
+ + +------------------------------------------------------------------------------------ + +
+

+ 快速开始 + | 快速使用服务 + | 快速使用流式服务 + | 教程文档 + | 模型列表 +

+ + + + **PaddleSpeech** 是基于飞桨 [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) 的语音方向的开源模型库,用于语音和音频中的各种关键任务的开发,包含大量基于深度学习前沿和有影响力的模型,一些典型的应用示例如下: ##### 语音识别 @@ -57,7 +78,6 @@ from https://github.com/18F/open-source-guide/blob/18f-pages/pages/making-readme 我认为跑步最重要的就是给我带来了身体健康。 - @@ -143,19 +163,6 @@ from https://github.com/18F/open-source-guide/blob/18f-pages/pages/making-readme -### ⭐ 应用案例 -- **[PaddleBoBo](https://github.com/JiehangXie/PaddleBoBo): 使用 PaddleSpeech 的语音合成模块生成虚拟人的声音。** - -
- -- [PaddleSpeech 示例视频](https://paddlespeech.readthedocs.io/en/latest/demo_video.html) - - -- **[VTuberTalk](https://github.com/jerryuhoo/VTuberTalk): 使用 PaddleSpeech 的语音合成和语音识别从视频中克隆人声。** - -
- -
### 🔥 热门活动 @@ -164,27 +171,32 @@ from https://github.com/18F/open-source-guide/blob/18f-pages/pages/making-readme 4 日直播课: 深度解读 PaddleSpeech 语音技术! **直播回放与课件资料: https://aistudio.baidu.com/aistudio/education/group/info/25130** -### 特性 -本项目采用了易用、高效、灵活以及可扩展的实现,旨在为工业应用、学术研究提供更好的支持,实现的功能包含训练、推断以及测试模块,以及部署过程,主要包括 -- 📦 **易用性**: 安装门槛低,可使用 [CLI](#quick-start) 快速开始。 -- 🏆 **对标 SoTA**: 提供了高速、轻量级模型,且借鉴了最前沿的技术。 -- 💯 **基于规则的中文前端**: 我们的前端包含文本正则化和字音转换(G2P)。此外,我们使用自定义语言规则来适应中文语境。 -- **多种工业界以及学术界主流功能支持**: - - 🛎️ 典型音频任务: 本工具包提供了音频任务如音频分类、语音翻译、自动语音识别、文本转语音、语音合成等任务的实现。 - - 🔬 主流模型及数据集: 本工具包实现了参与整条语音任务流水线的各个模块,并且采用了主流数据集如 LibriSpeech、LJSpeech、AIShell、CSMSC,详情请见 [模型列表](#model-list)。 - - 🧩 级联模型应用: 作为传统语音任务的扩展,我们结合了自然语言处理、计算机视觉等任务,实现更接近实际需求的产业级应用。 ### 近期更新 +- 👏🏻 2022.04.28: PaddleSpeech Streaming Server 上线! 覆盖了语音识别和语音合成。 - 👏🏻 2022.03.28: PaddleSpeech Server 上线! 覆盖了声音分类、语音识别、以及语音合成。 - 👏🏻 2022.03.28: PaddleSpeech CLI 上线声纹验证。 - 🤗 2021.12.14: Our PaddleSpeech [ASR](https://huggingface.co/spaces/KPatrick/PaddleSpeechASR) and [TTS](https://huggingface.co/spaces/KPatrick/PaddleSpeechTTS) Demos on Hugging Face Spaces are available! - 👏🏻 2021.12.10: PaddleSpeech CLI 上线!覆盖了声音分类、语音识别、语音翻译(英译中)以及语音合成。 + +### 特性 + +本项目采用了易用、高效、灵活以及可扩展的实现,旨在为工业应用、学术研究提供更好的支持,实现的功能包含训练、推断以及测试模块,以及部署过程,主要包括 +- 📦 **易用性**: 安装门槛低,可使用 [CLI](#quick-start) 快速开始。 +- 🏆 **对标 SoTA**: 提供了高速、轻量级模型,且借鉴了最前沿的技术。 +- 💯 **基于规则的中文前端**: 我们的前端包含文本正则化和字音转换(G2P)。此外,我们使用自定义语言规则来适应中文语境。 +- **多种工业界以及学术界主流功能支持**: + - 🛎️ 典型音频任务: 本工具包提供了音频任务如音频分类、语音翻译、自动语音识别、文本转语音、语音合成等任务的实现。 + - 🔬 主流模型及数据集: 本工具包实现了参与整条语音任务流水线的各个模块,并且采用了主流数据集如 LibriSpeech、LJSpeech、AIShell、CSMSC,详情请见 [模型列表](#model-list)。 + - 🧩 级联模型应用: 作为传统语音任务的扩展,我们结合了自然语言处理、计算机视觉等任务,实现更接近实际需求的产业级应用。 + + ### 技术交流群 微信扫描二维码(好友申请通过后回复【语音】)加入官方交流群,获得更高效的问题答疑,与各行各业开发者充分交流,期待您的加入。 @@ -192,11 +204,13 @@ from https://github.com/18F/open-source-guide/blob/18f-pages/pages/making-readme + ## 安装 我们强烈建议用户在 **Linux** 环境下,*3.7* 以上版本的 *python* 上安装 PaddleSpeech。 目前为止,**Linux** 支持声音分类、语音识别、语音合成和语音翻译四种功能,**Mac OSX、 Windows** 下暂不支持语音翻译功能。 想了解具体安装细节,可以参考[安装文档](./docs/source/install_cn.md)。 + ## 快速开始 安装完成后,开发者可以通过命令行快速开始,改变 `--input` 可以尝试用自己的音频或文本测试。 @@ -232,7 +246,7 @@ paddlespeech tts --input "你好,欢迎使用百度飞桨深度学习框架! **批处理** ``` echo -e "1 欢迎光临。\n2 谢谢惠顾。" | paddlespeech tts -``` +``` **Shell管道** ASR + Punc: @@ -269,6 +283,38 @@ paddlespeech_client cls --server_ip 127.0.0.1 --port 8090 --input input.wav 更多服务相关的命令行使用信息,请参考 [demos](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/demos/speech_server) + +## 快速使用流式服务 + +开发者可以尝试[流式ASR](./demos/streaming_asr_server/README.md)和 [流式TTS](./demos/streaming_tts_server/README.md)服务. + +**启动流式ASR服务** + +``` +paddlespeech_server start --config_file ./demos/streaming_asr_server/conf/application.yaml +``` + +**访问流式ASR服务** + +``` +paddlespeech_client asr_online --server_ip 127.0.0.1 --port 8090 --input input_16k.wav +``` + +**启动流式TTS服务** + +``` +paddlespeech_server start --config_file ./demos/streaming_tts_server/conf/tts_online_application.yaml +``` + +**访问流式TTS服务** + +``` +paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --protocol http --input "您好,欢迎使用百度飞桨语音合成服务。" --output output.wav +``` + +更多信息参看: [流式 ASR](./demos/streaming_asr_server/README.md) 和 [流式 TTS](./demos/streaming_tts_server/README.md) + + ## 模型列表 PaddleSpeech 支持很多主流的模型,并提供了预训练模型,详情请见[模型列表](./docs/source/released_model.md)。 @@ -582,6 +628,21 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声 语音合成模块最初被称为 [Parakeet](https://github.com/PaddlePaddle/Parakeet),现在与此仓库合并。如果您对该任务的学术研究感兴趣,请参阅 [TTS 研究概述](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/docs/source/tts#overview)。此外,[模型介绍](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/tts/models_introduction.md) 是了解语音合成流程的一个很好的指南。 +## ⭐ 应用案例 +- **[PaddleBoBo](https://github.com/JiehangXie/PaddleBoBo): 使用 PaddleSpeech 的语音合成模块生成虚拟人的声音。** + +
+ +- [PaddleSpeech 示例视频](https://paddlespeech.readthedocs.io/en/latest/demo_video.html) + + +- **[VTuberTalk](https://github.com/jerryuhoo/VTuberTalk): 使用 PaddleSpeech 的语音合成和语音识别从视频中克隆人声。** + +
+ +
+ + ## 引用 要引用 PaddleSpeech 进行研究,请使用以下格式进行引用。 @@ -658,6 +719,7 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声 - 非常感谢 [jerryuhoo](https://github.com/jerryuhoo)/[VTuberTalk](https://github.com/jerryuhoo/VTuberTalk) 基于 PaddleSpeech 的 TTS GUI 界面和基于 ASR 制作数据集的相关代码。 + 此外,PaddleSpeech 依赖于许多开源存储库。有关更多信息,请参阅 [references](./docs/source/reference.md)。 ## License diff --git a/demos/speech_recognition/README.md b/demos/speech_recognition/README.md index 63654880..6493e8e6 100644 --- a/demos/speech_recognition/README.md +++ b/demos/speech_recognition/README.md @@ -24,13 +24,13 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee - Command Line(Recommended) ```bash # Chinese - paddlespeech asr --input ./zh.wav + paddlespeech asr --input ./zh.wav -v # English - paddlespeech asr --model transformer_librispeech --lang en --input ./en.wav + paddlespeech asr --model transformer_librispeech --lang en --input ./en.wav -v # Chinese ASR + Punctuation Restoration - paddlespeech asr --input ./zh.wav | paddlespeech text --task punc + paddlespeech asr --input ./zh.wav -v | paddlespeech text --task punc -v ``` - (It doesn't matter if package `paddlespeech-ctcdecoders` is not found, this package is optional.) + (If you don't want to see the log information, you can remove "-v". Besides, it doesn't matter if package `paddlespeech-ctcdecoders` is not found, this package is optional.) Usage: ```bash @@ -45,6 +45,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee - `ckpt_path`: Model checkpoint. Use pretrained model when it is None. Default: `None`. - `yes`: No additional parameters required. Once set this parameter, it means accepting the request of the program by default, which includes transforming the audio sample rate. Default: `False`. - `device`: Choose device to execute model inference. Default: default device of paddlepaddle in current environment. + - `verbose`: Show the log information. Output: ```bash @@ -84,8 +85,12 @@ Here is a list of pretrained models released by PaddleSpeech that can be used by | Model | Language | Sample Rate | :--- | :---: | :---: | -| conformer_wenetspeech| zh| 16k -| transformer_librispeech| en| 16k +| conformer_wenetspeech | zh | 16k +| conformer_online_multicn | zh | 16k +| conformer_aishell | zh | 16k +| conformer_online_aishell | zh | 16k +| transformer_librispeech | en | 16k +| deepspeech2online_wenetspeech | zh | 16k | deepspeech2offline_aishell| zh| 16k | deepspeech2online_aishell | zh | 16k -|deepspeech2offline_librispeech|en| 16k +| deepspeech2offline_librispeech | en | 16k diff --git a/demos/speech_recognition/README_cn.md b/demos/speech_recognition/README_cn.md index 8033dbd8..8d631d89 100644 --- a/demos/speech_recognition/README_cn.md +++ b/demos/speech_recognition/README_cn.md @@ -22,13 +22,13 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee - 命令行 (推荐使用) ```bash # 中文 - paddlespeech asr --input ./zh.wav + paddlespeech asr --input ./zh.wav -v # 英文 - paddlespeech asr --model transformer_librispeech --lang en --input ./en.wav + paddlespeech asr --model transformer_librispeech --lang en --input ./en.wav -v # 中文 + 标点恢复 - paddlespeech asr --input ./zh.wav | paddlespeech text --task punc + paddlespeech asr --input ./zh.wav -v | paddlespeech text --task punc -v ``` - (如果显示 `paddlespeech-ctcdecoders` 这个 python 包没有找到的 Error,没有关系,这个包是非必须的。) + (如果不想显示 log 信息,可以不使用"-v", 另外如果显示 `paddlespeech-ctcdecoders` 这个 python 包没有找到的 Error,没有关系,这个包是非必须的。) 使用方法: ```bash @@ -43,6 +43,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee - `ckpt_path`:模型参数文件,若不设置则下载预训练模型使用,默认值:`None`。 - `yes`;不需要设置额外的参数,一旦设置了该参数,说明你默认同意程序的所有请求,其中包括自动转换输入音频的采样率。默认值:`False`。 - `device`:执行预测的设备,默认值:当前系统下 paddlepaddle 的默认 device。 + - `verbose`: 如果使用,显示 logger 信息。 输出: ```bash @@ -82,7 +83,11 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee | 模型 | 语言 | 采样率 | :--- | :---: | :---: | | conformer_wenetspeech | zh | 16k +| conformer_online_multicn | zh | 16k +| conformer_aishell | zh | 16k +| conformer_online_aishell | zh | 16k | transformer_librispeech | en | 16k +| deepspeech2online_wenetspeech | zh | 16k | deepspeech2offline_aishell| zh| 16k | deepspeech2online_aishell | zh | 16k | deepspeech2offline_librispeech | en | 16k diff --git a/demos/streaming_asr_server/README.md b/demos/streaming_asr_server/README.md index 6a2f21aa..3de2f386 100644 --- a/demos/streaming_asr_server/README.md +++ b/demos/streaming_asr_server/README.md @@ -5,6 +5,7 @@ ## Introduction This demo is an implementation of starting the streaming speech service and accessing the service. It can be achieved with a single command using `paddlespeech_server` and `paddlespeech_client` or a few lines of code in python. +Streaming ASR server only support `websocket` protocol, and doesn't support `http` protocol. ## Usage ### 1. Installation @@ -30,7 +31,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav - Command Line (Recommended) ```bash - # start the service + # in PaddleSpeech/demos/streaming_asr_server start the service paddlespeech_server start --config_file ./conf/ws_conformer_application.yaml ``` @@ -110,11 +111,12 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav - Python API ```python + # in PaddleSpeech/demos/streaming_asr_server directory from paddlespeech.server.bin.paddlespeech_server import ServerExecutor server_executor = ServerExecutor() server_executor( - config_file="./conf/ws_conformer_application.yaml", + config_file="./conf/ws_conformer_application.yaml", log_file="./log/paddlespeech.log") ``` @@ -185,6 +187,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav ### 4. ASR Client Usage + **Note:** The response time will be slightly longer when using the client for the first time - Command Line (Recommended) ``` @@ -203,6 +206,8 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav - `sample_rate`: Audio ampling rate, default: 16000. - `lang`: Language. Default: "zh_cn". - `audio_format`: Audio format. Default: "wav". + - `punc.server_ip`: punctuation server ip. Default: None. + - `punc.server_port`: punctuation server port. Default: None. Output: ```bash @@ -274,10 +279,9 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav - Python API ```python - from paddlespeech.server.bin.paddlespeech_client import ASRClientExecutor - import json + from paddlespeech.server.bin.paddlespeech_client import ASROnlineClientExecutor - asrclient_executor = ASRClientExecutor() + asrclient_executor = ASROnlineClientExecutor() res = asrclient_executor( input="./zh.wav", server_ip="127.0.0.1", @@ -285,7 +289,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav sample_rate=16000, lang="zh_cn", audio_format="wav") - print(res.json()) + print(res) ``` Output: @@ -351,5 +355,4 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav [2022-04-21 15:59:08,016] [ INFO] - receive msg={'asr_results': '我认为跑步最重要的就是给我带来了身体健康'} [2022-04-21 15:59:08,024] [ INFO] - receive msg={'asr_results': '我认为跑步最重要的就是给我带来了身体健康'} [2022-04-21 15:59:12,883] [ INFO] - final receive msg={'status': 'ok', 'signal': 'finished', 'asr_results': '我认为跑步最重要的就是给我带来了身体健康'} - [2022-04-21 15:59:12,884] [ INFO] - 我认为跑步最重要的就是给我带来了身体健康 - ``` + ``` \ No newline at end of file diff --git a/demos/streaming_asr_server/README_cn.md b/demos/streaming_asr_server/README_cn.md index 9224206b..bb1d3772 100644 --- a/demos/streaming_asr_server/README_cn.md +++ b/demos/streaming_asr_server/README_cn.md @@ -5,18 +5,26 @@ ## 介绍 这个demo是一个启动流式语音服务和访问服务的实现。 它可以通过使用`paddlespeech_server` 和 `paddlespeech_client`的单个命令或 python 的几行代码来实现。 +**流式语音识别服务只支持 `weboscket` 协议,不支持 `http` 协议。** ## 使用方法 ### 1. 安装 -请看 [安装文档](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md). +安装 PaddleSpeech 的详细过程请看 [安装文档](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md)。 推荐使用 **paddlepaddle 2.2.1** 或以上版本。 -你可以从 medium,hard 三中方式中选择一种方式安装 PaddleSpeech。 +你可以从medium,hard 两种方式中选择一种方式安装 PaddleSpeech。 ### 2. 准备配置文件 -配置文件可参见 `conf/ws_application.yaml` 和 `conf/ws_conformer_application.yaml` 。 -目前服务集成的模型有: DeepSpeech2和conformer模型。 + +流式ASR的服务启动脚本和服务测试脚本存放在 `PaddleSpeech/demos/streaming_asr_server` 目录。 +下载好 `PaddleSpeech` 之后,进入到 `PaddleSpeech/demos/streaming_asr_server` 目录。 +配置文件可参见该目录下 `conf/ws_application.yaml` 和 `conf/ws_conformer_application.yaml` 。 + +目前服务集成的模型有: DeepSpeech2和 conformer模型,对应的配置文件如下: +* DeepSpeech: `conf/ws_application.yaml` +* conformer: `conf/ws_conformer_application.yaml` + 这个 ASR client 的输入应该是一个 WAV 文件(`.wav`),并且采样率必须与模型的采样率相同。 @@ -30,7 +38,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav - 命令行 (推荐使用) ```bash - # 启动服务 + # 在 PaddleSpeech/demos/streaming_asr_server 目录启动服务 paddlespeech_server start --config_file ./conf/ws_conformer_application.yaml ``` @@ -110,6 +118,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav - Python API ```python + # 在 PaddleSpeech/demos/streaming_asr_server 目录 from paddlespeech.server.bin.paddlespeech_server import ServerExecutor server_executor = ServerExecutor() @@ -184,11 +193,11 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav ``` ### 4. ASR 客户端使用方法 + **注意:** 初次使用客户端时响应时间会略长 - 命令行 (推荐使用) ``` paddlespeech_client asr_online --server_ip 127.0.0.1 --port 8090 --input ./zh.wav - ``` 使用帮助: @@ -204,6 +213,8 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav - `sample_rate`: 音频采样率,默认值:16000。 - `lang`: 模型语言,默认值:zh_cn。 - `audio_format`: 音频格式,默认值:wav。 + - `punc.server_ip` 标点预测服务的ip。默认是None。 + - `punc.server_port` 标点预测服务的端口port。默认是None。 输出: @@ -276,7 +287,6 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav - Python API ```python from paddlespeech.server.bin.paddlespeech_client import ASROnlineClientExecutor - import json asrclient_executor = ASROnlineClientExecutor() res = asrclient_executor( @@ -286,7 +296,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav sample_rate=16000, lang="zh_cn", audio_format="wav") - print(res.json()) + print(res) ``` 输出: @@ -352,5 +362,4 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav [2022-04-21 15:59:08,016] [ INFO] - receive msg={'asr_results': '我认为跑步最重要的就是给我带来了身体健康'} [2022-04-21 15:59:08,024] [ INFO] - receive msg={'asr_results': '我认为跑步最重要的就是给我带来了身体健康'} [2022-04-21 15:59:12,883] [ INFO] - final receive msg={'status': 'ok', 'signal': 'finished', 'asr_results': '我认为跑步最重要的就是给我带来了身体健康'} - [2022-04-21 15:59:12,884] [ INFO] - 我认为跑步最重要的就是给我带来了身体健康 ``` diff --git a/demos/streaming_asr_server/conf/application.yaml b/demos/streaming_asr_server/conf/application.yaml new file mode 100644 index 00000000..50c7a727 --- /dev/null +++ b/demos/streaming_asr_server/conf/application.yaml @@ -0,0 +1,45 @@ +# This is the parameter configuration file for PaddleSpeech Serving. + +################################################################################# +# SERVER SETTING # +################################################################################# +host: 0.0.0.0 +port: 8090 + +# The task format in the engin_list is: _ +# task choices = ['asr_online'] +# protocol = ['websocket'] (only one can be selected). +# websocket only support online engine type. +protocol: 'websocket' +engine_list: ['asr_online'] + + +################################################################################# +# ENGINE CONFIG # +################################################################################# + +################################### ASR ######################################### +################### speech task: asr; engine_type: online ####################### +asr_online: + model_type: 'conformer_online_multicn' + am_model: # the pdmodel file of am static model [optional] + am_params: # the pdiparams file of am static model [optional] + lang: 'zh' + sample_rate: 16000 + cfg_path: + decode_method: + force_yes: True + device: # cpu or gpu:id + am_predictor_conf: + device: # set 'gpu:id' or 'cpu' + switch_ir_optim: True + glog_info: False # True -> print glog + summary: True # False -> do not show predictor config + + chunk_buffer_conf: + window_n: 7 # frame + shift_n: 4 # frame + window_ms: 25 # ms + shift_ms: 10 # ms + sample_rate: 16000 + sample_width: 2 \ No newline at end of file diff --git a/demos/streaming_asr_server/conf/ws_application.yaml b/demos/streaming_asr_server/conf/ws_application.yaml index dee8d78b..fc02f2ca 100644 --- a/demos/streaming_asr_server/conf/ws_application.yaml +++ b/demos/streaming_asr_server/conf/ws_application.yaml @@ -7,8 +7,8 @@ host: 0.0.0.0 port: 8090 # The task format in the engin_list is: _ -# task choices = ['asr_online', 'tts_online'] -# protocol = ['websocket', 'http'] (only one can be selected). +# task choices = ['asr_online'] +# protocol = ['websocket'] (only one can be selected). # websocket only support online engine type. protocol: 'websocket' engine_list: ['asr_online'] diff --git a/demos/streaming_asr_server/conf/ws_conformer_application.yaml b/demos/streaming_asr_server/conf/ws_conformer_application.yaml index 8f011485..50c7a727 100644 --- a/demos/streaming_asr_server/conf/ws_conformer_application.yaml +++ b/demos/streaming_asr_server/conf/ws_conformer_application.yaml @@ -7,8 +7,8 @@ host: 0.0.0.0 port: 8090 # The task format in the engin_list is: _ -# task choices = ['asr_online', 'tts_online'] -# protocol = ['websocket', 'http'] (only one can be selected). +# task choices = ['asr_online'] +# protocol = ['websocket'] (only one can be selected). # websocket only support online engine type. protocol: 'websocket' engine_list: ['asr_online'] diff --git a/demos/streaming_asr_server/web/templates/index.html b/demos/streaming_asr_server/web/templates/index.html index 7aa227fb..56c63080 100644 --- a/demos/streaming_asr_server/web/templates/index.html +++ b/demos/streaming_asr_server/web/templates/index.html @@ -93,7 +93,7 @@ function parseResult(data) { var data = JSON.parse(data) - var result = data.asr_results + var result = data.result console.log(result) $("#resultPanel").html(result) } @@ -152,4 +152,4 @@ - \ No newline at end of file + diff --git a/demos/streaming_asr_server/websocket_client.py b/demos/streaming_asr_server/websocket_client.py index 2a15096c..523ef482 100644 --- a/demos/streaming_asr_server/websocket_client.py +++ b/demos/streaming_asr_server/websocket_client.py @@ -20,19 +20,23 @@ import logging import os from paddlespeech.cli.log import logger -from paddlespeech.server.utils.audio_handler import ASRAudioHandler +from paddlespeech.server.utils.audio_handler import ASRWsAudioHandler def main(args): logger.info("asr websocket client start") - handler = ASRAudioHandler("127.0.0.1", 8090) + handler = ASRWsAudioHandler( + args.server_ip, + args.port, + punc_server_ip=args.punc_server_ip, + punc_server_port=args.punc_server_port) loop = asyncio.get_event_loop() # support to process single audio file if args.wavfile and os.path.exists(args.wavfile): logger.info(f"start to process the wavscp: {args.wavfile}") result = loop.run_until_complete(handler.run(args.wavfile)) - result = result["asr_results"] + result = result["result"] logger.info(f"asr websocket client finished : {result}") # support to process batch audios from wav.scp @@ -43,13 +47,29 @@ def main(args): for line in f: utt_name, utt_path = line.strip().split() result = loop.run_until_complete(handler.run(utt_path)) - result = result["asr_results"] + result = result["result"] w.write(f"{utt_name} {result}\n") if __name__ == "__main__": logger.info("Start to do streaming asr client") parser = argparse.ArgumentParser() + parser.add_argument( + '--server_ip', type=str, default='127.0.0.1', help='server ip') + parser.add_argument('--port', type=int, default=8090, help='server port') + parser.add_argument( + '--punc.server_ip', + type=str, + default=None, + dest="punc_server_ip", + help='Punctuation server ip') + parser.add_argument( + '--punc.port', + type=int, + default=8091, + dest="punc_server_port", + help='Punctuation server port') + parser.add_argument( "--wavfile", action="store", diff --git a/demos/streaming_tts_server/README.md b/demos/streaming_tts_server/README.md index 801c4f31..d03b9e28 100644 --- a/demos/streaming_tts_server/README.md +++ b/demos/streaming_tts_server/README.md @@ -15,19 +15,28 @@ You can choose one way from meduim and hard to install paddlespeech. ### 2. Prepare config File -The configuration file can be found in `conf/tts_online_application.yaml` 。 -Among them, `protocol` indicates the network protocol used by the streaming TTS service. Currently, both http and websocket are supported. -`engine_list` indicates the speech engine that will be included in the service to be started, in the format of `_`. -This demo mainly introduces the streaming speech synthesis service, so the speech task should be set to `tts`. -Currently, the engine type supports two forms: **online** and **online-onnx**. `online` indicates an engine that uses python for dynamic graph inference; `online-onnx` indicates an engine that uses onnxruntime for inference. The inference speed of online-onnx is faster. -Streaming TTS AM model support: **fastspeech2 and fastspeech2_cnndecoder**; Voc model support: **hifigan and mb_melgan** +The configuration file can be found in `conf/tts_online_application.yaml`. +- `protocol` indicates the network protocol used by the streaming TTS service. Currently, both **http and websocket** are supported. +- `engine_list` indicates the speech engine that will be included in the service to be started, in the format of `_`. + - This demo mainly introduces the streaming speech synthesis service, so the speech task should be set to `tts`. + - the engine type supports two forms: **online** and **online-onnx**. `online` indicates an engine that uses python for dynamic graph inference; `online-onnx` indicates an engine that uses onnxruntime for inference. The inference speed of online-onnx is faster. +- Streaming TTS engine AM model support: **fastspeech2 and fastspeech2_cnndecoder**; Voc model support: **hifigan and mb_melgan** +- In streaming am inference, one chunk of data is inferred at a time to achieve a streaming effect. Among them, `am_block` indicates the number of valid frames in the chunk, and `am_pad` indicates the number of frames added before and after am_block in a chunk. The existence of am_pad is used to eliminate errors caused by streaming inference and avoid the influence of streaming inference on the quality of synthesized audio. + - fastspeech2 does not support streaming am inference, so am_pad and am_block have no effect on it. + - fastspeech2_cnndecoder supports streaming inference. When am_pad=12, streaming inference synthesized audio is consistent with non-streaming synthesized audio. +- In streaming voc inference, one chunk of data is inferred at a time to achieve a streaming effect. Where `voc_block` indicates the number of valid frames in the chunk, and `voc_pad` indicates the number of frames added before and after the voc_block in a chunk. The existence of voc_pad is used to eliminate errors caused by streaming inference and avoid the influence of streaming inference on the quality of synthesized audio. + - Both hifigan and mb_melgan support streaming voc inference. + - When the voc model is mb_melgan, when voc_pad=14, the synthetic audio for streaming inference is consistent with the non-streaming synthetic audio; the minimum voc_pad can be set to 7, and the synthetic audio has no abnormal hearing. If the voc_pad is less than 7, the synthetic audio sounds abnormal. + - When the voc model is hifigan, when voc_pad=20, the streaming inference synthetic audio is consistent with the non-streaming synthetic audio; when voc_pad=14, the synthetic audio has no abnormal hearing. +- Inference speed: mb_melgan > hifigan; Audio quality: mb_melgan < hifigan -### 3. Server Usage +### 3. Streaming speech synthesis server and client using http protocol +#### 3.1 Server Usage - Command Line (Recommended) + Start the service (the configuration file uses http by default): ```bash - # start the service paddlespeech_server start --config_file ./conf/tts_online_application.yaml ``` @@ -67,7 +76,7 @@ Streaming TTS AM model support: **fastspeech2 and fastspeech2_cnndecoder**; Voc log_file="./log/paddlespeech.log") ``` - Output: + Output: ```bash [2022-04-24 21:00:16,934] [ INFO] - The first response time of the 0 warm up: 1.268730878829956 s [2022-04-24 21:00:17,046] [ INFO] - The first response time of the 1 warm up: 0.11168622970581055 s @@ -85,17 +94,15 @@ Streaming TTS AM model support: **fastspeech2 and fastspeech2_cnndecoder**; Voc ``` - -### 4. Streaming TTS client Usage +#### 3.2 Streaming TTS client Usage - Command Line (Recommended) - ```bash - # Access http streaming TTS service - paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --input "您好,欢迎使用百度飞桨语音合成服务。" --output output.wav + Access http streaming TTS service: - # Access websocket streaming TTS service - paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --protocol websocket --input "您好,欢迎使用百度飞桨语音合成服务。" --output output.wav + ```bash + paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --protocol http --input "您好,欢迎使用百度飞桨语音合成服务。" --output output.wav ``` + Usage: ```bash @@ -113,7 +120,6 @@ Streaming TTS AM model support: **fastspeech2 and fastspeech2_cnndecoder**; Voc - `sample_rate`: Sampling rate, choices: [0, 8000, 16000], the default is the same as the model. Default: 0 - `output`: Output wave filepath. Default: None, which means not to save the audio to the local. - `play`: Whether to play audio, play while synthesizing, default value: False, which means not playing. **Playing audio needs to rely on the pyaudio library**. - Output: ```bash @@ -156,8 +162,144 @@ Streaming TTS AM model support: **fastspeech2 and fastspeech2_cnndecoder**; Voc [2022-04-24 21:11:16,802] [ INFO] - 音频时长:3.825 s [2022-04-24 21:11:16,802] [ INFO] - RTF: 0.7846773683635238 [2022-04-24 21:11:16,837] [ INFO] - 音频保存至:./output.wav + ``` + + +### 4. Streaming speech synthesis server and client using websocket protocol +#### 4.1 Server Usage +- Command Line (Recommended) + First modify the configuration file `conf/tts_online_application.yaml`, **set `protocol` to `websocket`**. + Start the service: + ```bash + paddlespeech_server start --config_file ./conf/tts_online_application.yaml + ``` + + Usage: + + ```bash + paddlespeech_server start --help + ``` + Arguments: + - `config_file`: yaml file of the app, defalut: ./conf/tts_online_application.yaml + - `log_file`: log file. Default: ./log/paddlespeech.log + + Output: + ```bash + [2022-04-27 10:18:09,107] [ INFO] - The first response time of the 0 warm up: 1.1551103591918945 s + [2022-04-27 10:18:09,219] [ INFO] - The first response time of the 1 warm up: 0.11204338073730469 s + [2022-04-27 10:18:09,324] [ INFO] - The first response time of the 2 warm up: 0.1051797866821289 s + [2022-04-27 10:18:09,325] [ INFO] - ********************************************************************** + INFO: Started server process [17600] + [2022-04-27 10:18:09] [INFO] [server.py:75] Started server process [17600] + INFO: Waiting for application startup. + [2022-04-27 10:18:09] [INFO] [on.py:45] Waiting for application startup. + INFO: Application startup complete. + [2022-04-27 10:18:09] [INFO] [on.py:59] Application startup complete. + INFO: Uvicorn running on http://127.0.0.1:8092 (Press CTRL+C to quit) + [2022-04-27 10:18:09] [INFO] [server.py:211] Uvicorn running on http://127.0.0.1:8092 (Press CTRL+C to quit) + + + ``` + +- Python API + ```python + from paddlespeech.server.bin.paddlespeech_server import ServerExecutor + + server_executor = ServerExecutor() + server_executor( + config_file="./conf/tts_online_application.yaml", + log_file="./log/paddlespeech.log") + ``` + + Output: + ```bash + [2022-04-27 10:20:16,660] [ INFO] - The first response time of the 0 warm up: 1.0945196151733398 s + [2022-04-27 10:20:16,773] [ INFO] - The first response time of the 1 warm up: 0.11222052574157715 s + [2022-04-27 10:20:16,878] [ INFO] - The first response time of the 2 warm up: 0.10494542121887207 s + [2022-04-27 10:20:16,878] [ INFO] - ********************************************************************** + INFO: Started server process [23466] + [2022-04-27 10:20:16] [INFO] [server.py:75] Started server process [23466] + INFO: Waiting for application startup. + [2022-04-27 10:20:16] [INFO] [on.py:45] Waiting for application startup. + INFO: Application startup complete. + [2022-04-27 10:20:16] [INFO] [on.py:59] Application startup complete. + INFO: Uvicorn running on http://127.0.0.1:8092 (Press CTRL+C to quit) + [2022-04-27 10:20:16] [INFO] [server.py:211] Uvicorn running on http://127.0.0.1:8092 (Press CTRL+C to quit) + + ``` + +#### 4.2 Streaming TTS client Usage +- Command Line (Recommended) + + Access websocket streaming TTS service: + + ```bash + paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --protocol websocket --input "您好,欢迎使用百度飞桨语音合成服务。" --output output.wav + ``` + Usage: + + ```bash + paddlespeech_client tts_online --help + ``` + + Arguments: + - `server_ip`: erver ip. Default: 127.0.0.1 + - `port`: server port. Default: 8092 + - `protocol`: Service protocol, choices: [http, websocket], default: http. + - `input`: (required): Input text to generate. + - `spk_id`: Speaker id for multi-speaker text to speech. Default: 0 + - `speed`: Audio speed, the value should be set between 0 and 3. Default: 1.0 + - `volume`: Audio volume, the value should be set between 0 and 3. Default: 1.0 + - `sample_rate`: Sampling rate, choices: [0, 8000, 16000], the default is the same as the model. Default: 0 + - `output`: Output wave filepath. Default: None, which means not to save the audio to the local. + - `play`: Whether to play audio, play while synthesizing, default value: False, which means not playing. **Playing audio needs to rely on the pyaudio library**. + + + Output: + ```bash + [2022-04-27 10:21:04,262] [ INFO] - tts websocket client start + [2022-04-27 10:21:04,496] [ INFO] - 句子:您好,欢迎使用百度飞桨语音合成服务。 + [2022-04-27 10:21:04,496] [ INFO] - 首包响应:0.2124948501586914 s + [2022-04-27 10:21:07,483] [ INFO] - 尾包响应:3.199106454849243 s + [2022-04-27 10:21:07,484] [ INFO] - 音频时长:3.825 s + [2022-04-27 10:21:07,484] [ INFO] - RTF: 0.8363677006141812 + [2022-04-27 10:21:07,516] [ INFO] - 音频保存至:output.wav + + ``` + +- Python API + ```python + from paddlespeech.server.bin.paddlespeech_client import TTSOnlineClientExecutor + import json + + executor = TTSOnlineClientExecutor() + executor( + input="您好,欢迎使用百度飞桨语音合成服务。", + server_ip="127.0.0.1", + port=8092, + protocol="websocket", + spk_id=0, + speed=1.0, + volume=1.0, + sample_rate=0, + output="./output.wav", + play=False) + + ``` + + Output: + ```bash + [2022-04-27 10:22:48,852] [ INFO] - tts websocket client start + [2022-04-27 10:22:49,080] [ INFO] - 句子:您好,欢迎使用百度飞桨语音合成服务。 + [2022-04-27 10:22:49,080] [ INFO] - 首包响应:0.21017956733703613 s + [2022-04-27 10:22:52,100] [ INFO] - 尾包响应:3.2304444313049316 s + [2022-04-27 10:22:52,101] [ INFO] - 音频时长:3.825 s + [2022-04-27 10:22:52,101] [ INFO] - RTF: 0.8445606356352762 + [2022-04-27 10:22:52,134] [ INFO] - 音频保存至:./output.wav ``` + + diff --git a/demos/streaming_tts_server/README_cn.md b/demos/streaming_tts_server/README_cn.md index 211dc388..e40de11b 100644 --- a/demos/streaming_tts_server/README_cn.md +++ b/demos/streaming_tts_server/README_cn.md @@ -1,4 +1,4 @@ -([简体中文](./README_cn.md)|English) +(简体中文|[English](./README.md)) # 流式语音合成服务 @@ -16,17 +16,26 @@ ### 2. 准备配置文件 配置文件可参见 `conf/tts_online_application.yaml` 。 -其中,`protocol`表示该流式TTS服务使用的网络协议,目前支持 http 和 websocket 两种。 -其中,`engine_list`表示即将启动的服务将会包含的语音引擎,格式为 <语音任务>_<引擎类型>。 -该demo主要介绍流式语音合成服务,因此语音任务应设置为tts。 -目前引擎类型支持两种形式:**online** 表示使用python进行动态图推理的引擎;**online-onnx** 表示使用onnxruntime进行推理的引擎。其中,online-onnx的推理速度更快。 -流式TTS的AM 模型支持:fastspeech2 以及fastspeech2_cnndecoder; Voc 模型支持:hifigan, mb_melgan +- `protocol`表示该流式TTS服务使用的网络协议,目前支持 **http 和 websocket** 两种。 +- `engine_list`表示即将启动的服务将会包含的语音引擎,格式为 <语音任务>_<引擎类型>。 + - 该demo主要介绍流式语音合成服务,因此语音任务应设置为tts。 + - 目前引擎类型支持两种形式:**online** 表示使用python进行动态图推理的引擎;**online-onnx** 表示使用onnxruntime进行推理的引擎。其中,online-onnx的推理速度更快。 +- 流式TTS引擎的AM模型支持:**fastspeech2 以及fastspeech2_cnndecoder**; Voc 模型支持:**hifigan, mb_melgan** +- 流式am推理中,每次会对一个chunk的数据进行推理以达到流式的效果。其中`am_block`表示chunk中的有效帧数,`am_pad` 表示一个chunk中am_block前后各加的帧数。am_pad的存在用于消除流式推理产生的误差,避免由流式推理对合成音频质量的影响。 + - fastspeech2不支持流式am推理,因此am_pad与am_block对它无效 + - fastspeech2_cnndecoder 支持流式推理,当am_pad=12时,流式推理合成音频与非流式合成音频一致 +- 流式voc推理中,每次会对一个chunk的数据进行推理以达到流式的效果。其中`voc_block`表示chunk中的有效帧数,`voc_pad` 表示一个chunk中voc_block前后各加的帧数。voc_pad的存在用于消除流式推理产生的误差,避免由流式推理对合成音频质量的影响。 + - hifigan, mb_melgan 均支持流式voc 推理 + - 当voc模型为mb_melgan,当voc_pad=14时,流式推理合成音频与非流式合成音频一致;voc_pad最小可以设置为7,合成音频听感上没有异常,若voc_pad小于7,合成音频听感上存在异常。 + - 当voc模型为hifigan,当voc_pad=20时,流式推理合成音频与非流式合成音频一致;当voc_pad=14时,合成音频听感上没有异常。 +- 推理速度:mb_melgan > hifigan; 音频质量:mb_melgan < hifigan -### 3. 服务端使用方法 +### 3. 使用http协议的流式语音合成服务端及客户端使用方法 +#### 3.1 服务端使用方法 - 命令行 (推荐使用) + 启动服务(配置文件默认使用http): ```bash - # 启动服务 paddlespeech_server start --config_file ./conf/tts_online_application.yaml ``` @@ -36,7 +45,7 @@ paddlespeech_server start --help ``` 参数: - - `config_file`: 服务的配置文件,默认: ./conf/application.yaml + - `config_file`: 服务的配置文件,默认: ./conf/tts_online_application.yaml - `log_file`: log 文件. 默认:./log/paddlespeech.log 输出: @@ -84,17 +93,15 @@ ``` - -### 4. 流式TTS 客户端使用方法 +#### 3.2 客户端使用方法 - 命令行 (推荐使用) - ```bash - # 访问 http 流式TTS服务 - paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --input "您好,欢迎使用百度飞桨语音合成服务。" --output output.wav + 访问 http 流式TTS服务: - # 访问 websocket 流式TTS服务 - paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --protocol websocket --input "您好,欢迎使用百度飞桨语音合成服务。" --output output.wav + ```bash + paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --protocol http --input "您好,欢迎使用百度飞桨语音合成服务。" --output output.wav ``` + 使用帮助: ```bash @@ -155,8 +162,143 @@ [2022-04-24 21:11:16,802] [ INFO] - 音频时长:3.825 s [2022-04-24 21:11:16,802] [ INFO] - RTF: 0.7846773683635238 [2022-04-24 21:11:16,837] [ INFO] - 音频保存至:./output.wav + ``` + + +### 4. 使用websocket协议的流式语音合成服务端及客户端使用方法 +#### 4.1 服务端使用方法 +- 命令行 (推荐使用) + 首先修改配置文件 `conf/tts_online_application.yaml`, **将 `protocol` 设置为 `websocket`**。 + 启动服务: + ```bash + paddlespeech_server start --config_file ./conf/tts_online_application.yaml + ``` + + 使用方法: + + ```bash + paddlespeech_server start --help + ``` + 参数: + - `config_file`: 服务的配置文件,默认: ./conf/tts_online_application.yaml + - `log_file`: log 文件. 默认:./log/paddlespeech.log + + 输出: + ```bash + [2022-04-27 10:18:09,107] [ INFO] - The first response time of the 0 warm up: 1.1551103591918945 s + [2022-04-27 10:18:09,219] [ INFO] - The first response time of the 1 warm up: 0.11204338073730469 s + [2022-04-27 10:18:09,324] [ INFO] - The first response time of the 2 warm up: 0.1051797866821289 s + [2022-04-27 10:18:09,325] [ INFO] - ********************************************************************** + INFO: Started server process [17600] + [2022-04-27 10:18:09] [INFO] [server.py:75] Started server process [17600] + INFO: Waiting for application startup. + [2022-04-27 10:18:09] [INFO] [on.py:45] Waiting for application startup. + INFO: Application startup complete. + [2022-04-27 10:18:09] [INFO] [on.py:59] Application startup complete. + INFO: Uvicorn running on http://127.0.0.1:8092 (Press CTRL+C to quit) + [2022-04-27 10:18:09] [INFO] [server.py:211] Uvicorn running on http://127.0.0.1:8092 (Press CTRL+C to quit) + + + ``` + +- Python API + ```python + from paddlespeech.server.bin.paddlespeech_server import ServerExecutor + + server_executor = ServerExecutor() + server_executor( + config_file="./conf/tts_online_application.yaml", + log_file="./log/paddlespeech.log") + ``` + + 输出: + ```bash + [2022-04-27 10:20:16,660] [ INFO] - The first response time of the 0 warm up: 1.0945196151733398 s + [2022-04-27 10:20:16,773] [ INFO] - The first response time of the 1 warm up: 0.11222052574157715 s + [2022-04-27 10:20:16,878] [ INFO] - The first response time of the 2 warm up: 0.10494542121887207 s + [2022-04-27 10:20:16,878] [ INFO] - ********************************************************************** + INFO: Started server process [23466] + [2022-04-27 10:20:16] [INFO] [server.py:75] Started server process [23466] + INFO: Waiting for application startup. + [2022-04-27 10:20:16] [INFO] [on.py:45] Waiting for application startup. + INFO: Application startup complete. + [2022-04-27 10:20:16] [INFO] [on.py:59] Application startup complete. + INFO: Uvicorn running on http://127.0.0.1:8092 (Press CTRL+C to quit) + [2022-04-27 10:20:16] [INFO] [server.py:211] Uvicorn running on http://127.0.0.1:8092 (Press CTRL+C to quit) + + ``` + +#### 4.2 客户端使用方法 +- 命令行 (推荐使用) + + 访问 websocket 流式TTS服务: + + ```bash + paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --protocol websocket --input "您好,欢迎使用百度飞桨语音合成服务。" --output output.wav + ``` + + 使用帮助: + + ```bash + paddlespeech_client tts_online --help + ``` + + 参数: + - `server_ip`: 服务端ip地址,默认: 127.0.0.1。 + - `port`: 服务端口,默认: 8092。 + - `protocol`: 服务协议,可选 [http, websocket], 默认: http。 + - `input`: (必须输入): 待合成的文本。 + - `spk_id`: 说话人 id,用于多说话人语音合成,默认值: 0。 + - `speed`: 音频速度,该值应设置在 0 到 3 之间。 默认值:1.0 + - `volume`: 音频音量,该值应设置在 0 到 3 之间。 默认值: 1.0 + - `sample_rate`: 采样率,可选 [0, 8000, 16000],默认值:0,表示与模型采样率相同 + - `output`: 输出音频的路径, 默认值:None,表示不保存音频到本地。 + - `play`: 是否播放音频,边合成边播放, 默认值:False,表示不播放。**播放音频需要依赖pyaudio库**。 + + + 输出: + ```bash + [2022-04-27 10:21:04,262] [ INFO] - tts websocket client start + [2022-04-27 10:21:04,496] [ INFO] - 句子:您好,欢迎使用百度飞桨语音合成服务。 + [2022-04-27 10:21:04,496] [ INFO] - 首包响应:0.2124948501586914 s + [2022-04-27 10:21:07,483] [ INFO] - 尾包响应:3.199106454849243 s + [2022-04-27 10:21:07,484] [ INFO] - 音频时长:3.825 s + [2022-04-27 10:21:07,484] [ INFO] - RTF: 0.8363677006141812 + [2022-04-27 10:21:07,516] [ INFO] - 音频保存至:output.wav + ``` + +- Python API + ```python + from paddlespeech.server.bin.paddlespeech_client import TTSOnlineClientExecutor + import json + + executor = TTSOnlineClientExecutor() + executor( + input="您好,欢迎使用百度飞桨语音合成服务。", + server_ip="127.0.0.1", + port=8092, + protocol="websocket", + spk_id=0, + speed=1.0, + volume=1.0, + sample_rate=0, + output="./output.wav", + play=False) ``` + 输出: + ```bash + [2022-04-27 10:22:48,852] [ INFO] - tts websocket client start + [2022-04-27 10:22:49,080] [ INFO] - 句子:您好,欢迎使用百度飞桨语音合成服务。 + [2022-04-27 10:22:49,080] [ INFO] - 首包响应:0.21017956733703613 s + [2022-04-27 10:22:52,100] [ INFO] - 尾包响应:3.2304444313049316 s + [2022-04-27 10:22:52,101] [ INFO] - 音频时长:3.825 s + [2022-04-27 10:22:52,101] [ INFO] - RTF: 0.8445606356352762 + [2022-04-27 10:22:52,134] [ INFO] - 音频保存至:./output.wav + + ``` + + diff --git a/demos/streaming_tts_server/conf/tts_online_application.yaml b/demos/streaming_tts_server/conf/tts_online_application.yaml index 353c3e32..67d4641a 100644 --- a/demos/streaming_tts_server/conf/tts_online_application.yaml +++ b/demos/streaming_tts_server/conf/tts_online_application.yaml @@ -1,4 +1,4 @@ -# This is the parameter configuration file for PaddleSpeech Serving. +# This is the parameter configuration file for streaming tts server. ################################################################################# # SERVER SETTING # @@ -7,8 +7,8 @@ host: 127.0.0.1 port: 8092 # The task format in the engin_list is: _ -# engine_list choices = ['tts_online', 'tts_online-onnx'] -# protocol = ['websocket', 'http'] (only one can be selected). +# engine_list choices = ['tts_online', 'tts_online-onnx'], the inference speed of tts_online-onnx is faster than tts_online. +# protocol choices = ['websocket', 'http'] protocol: 'http' engine_list: ['tts_online-onnx'] @@ -20,7 +20,8 @@ engine_list: ['tts_online-onnx'] ################################### TTS ######################################### ################### speech task: tts; engine_type: online ####################### tts_online: - # am (acoustic model) choices=['fastspeech2_csmsc', 'fastspeech2_cnndecoder_csmsc'] + # am (acoustic model) choices=['fastspeech2_csmsc', 'fastspeech2_cnndecoder_csmsc'] + # fastspeech2_cnndecoder_csmsc support streaming am infer. am: 'fastspeech2_csmsc' am_config: am_ckpt: @@ -31,6 +32,7 @@ tts_online: spk_id: 0 # voc (vocoder) choices=['mb_melgan_csmsc, hifigan_csmsc'] + # Both mb_melgan_csmsc and hifigan_csmsc support streaming voc inference voc: 'mb_melgan_csmsc' voc_config: voc_ckpt: @@ -39,8 +41,13 @@ tts_online: # others lang: 'zh' device: 'cpu' # set 'gpu:id' or 'cpu' + # am_block and am_pad only for fastspeech2_cnndecoder_onnx model to streaming am infer, + # when am_pad set 12, streaming synthetic audio is the same as non-streaming synthetic audio am_block: 42 am_pad: 12 + # voc_pad and voc_block voc model to streaming voc infer, + # when voc model is mb_melgan_csmsc, voc_pad set 14, streaming synthetic audio is the same as non-streaming synthetic audio; The minimum value of pad can be set to 7, streaming synthetic audio sounds normal + # when voc model is hifigan_csmsc, voc_pad set 20, streaming synthetic audio is the same as non-streaming synthetic audio; voc_pad set 14, streaming synthetic audio sounds normal voc_block: 14 voc_pad: 14 @@ -53,7 +60,8 @@ tts_online: ################################### TTS ######################################### ################### speech task: tts; engine_type: online-onnx ####################### tts_online-onnx: - # am (acoustic model) choices=['fastspeech2_csmsc_onnx', 'fastspeech2_cnndecoder_csmsc_onnx'] + # am (acoustic model) choices=['fastspeech2_csmsc_onnx', 'fastspeech2_cnndecoder_csmsc_onnx'] + # fastspeech2_cnndecoder_csmsc_onnx support streaming am infer. am: 'fastspeech2_cnndecoder_csmsc_onnx' # am_ckpt is a list, if am is fastspeech2_cnndecoder_csmsc_onnx, am_ckpt = [encoder model, decoder model, postnet model]; # if am is fastspeech2_csmsc_onnx, am_ckpt = [ckpt model]; @@ -70,6 +78,7 @@ tts_online-onnx: cpu_threads: 4 # voc (vocoder) choices=['mb_melgan_csmsc_onnx, hifigan_csmsc_onnx'] + # Both mb_melgan_csmsc_onnx and hifigan_csmsc_onnx support streaming voc inference voc: 'hifigan_csmsc_onnx' voc_ckpt: voc_sample_rate: 24000 @@ -80,9 +89,15 @@ tts_online-onnx: # others lang: 'zh' + # am_block and am_pad only for fastspeech2_cnndecoder_onnx model to streaming am infer, + # when am_pad set 12, streaming synthetic audio is the same as non-streaming synthetic audio am_block: 42 am_pad: 12 + # voc_pad and voc_block voc model to streaming voc infer, + # when voc model is mb_melgan_csmsc_onnx, voc_pad set 14, streaming synthetic audio is the same as non-streaming synthetic audio; The minimum value of pad can be set to 7, streaming synthetic audio sounds normal + # when voc model is hifigan_csmsc_onnx, voc_pad set 20, streaming synthetic audio is the same as non-streaming synthetic audio; voc_pad set 14, streaming synthetic audio sounds normal voc_block: 14 voc_pad: 14 + # voc_upsample should be same as n_shift on voc config. voc_upsample: 300 diff --git a/docs/source/released_model.md b/docs/source/released_model.md index baa4ff45..aee44859 100644 --- a/docs/source/released_model.md +++ b/docs/source/released_model.md @@ -6,7 +6,7 @@ ### Speech Recognition Model Acoustic Model | Training Data | Token-based | Size | Descriptions | CER | WER | Hours of speech | Example Link :-------------:| :------------:| :-----: | -----: | :-----: |:-----:| :-----: | :-----: | :-----: -[Ds2 Online Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.2.0.model.tar.gz) | Aishell Dataset | Char-based | 345 MB | 2 Conv + 5 LSTM layers with only forward direction | 0.078 |-| 151 h | [D2 Online Aishell ASR0](../../examples/aishell/asr0) +[Ds2 Online Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_fbank161_ckpt_0.2.1.model.tar.gz) | Aishell Dataset | Char-based | 491 MB | 2 Conv + 5 LSTM layers with only forward direction | 0.0666 |-| 151 h | [D2 Online Aishell ASR0](../../examples/aishell/asr0) [Ds2 Offline Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_aishell_ckpt_0.1.1.model.tar.gz)| Aishell Dataset | Char-based | 306 MB | 2 Conv + 3 bidirectional GRU layers| 0.064 |-| 151 h | [Ds2 Offline Aishell ASR0](../../examples/aishell/asr0) [Conformer Online Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_chunk_conformer_aishell_ckpt_0.2.0.model.tar.gz) | Aishell Dataset | Char-based | 189 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring| 0.0544 |-| 151 h | [Conformer Online Aishell ASR1](../../examples/aishell/asr1) [Conformer Offline Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_conformer_aishell_ckpt_0.1.2.model.tar.gz) | Aishell Dataset | Char-based | 189 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0464 |-| 151 h | [Conformer Offline Aishell ASR1](../../examples/aishell/asr1) diff --git a/examples/aishell/asr0/RESULTS.md b/examples/aishell/asr0/RESULTS.md index 8af3d66d..131b6628 100644 --- a/examples/aishell/asr0/RESULTS.md +++ b/examples/aishell/asr0/RESULTS.md @@ -4,6 +4,8 @@ | Model | Number of Params | Release | Config | Test set | Valid Loss | CER | | --- | --- | --- | --- | --- | --- | --- | +| DeepSpeech2 | 45.18M | r0.2.0 | conf/deepspeech2_online.yaml + U2 Data pipline and spec aug + fbank161 | test | 6.876979827880859 | 0.0666 | +| DeepSpeech2 | 45.18M | r0.2.0 | conf/deepspeech2_online.yaml + spec aug + fbank161 | test | 7.679287910461426 | 0.0718 | | DeepSpeech2 | 45.18M | r0.2.0 | conf/deepspeech2_online.yaml + spec aug | test | 7.708217620849609| 0.078 | | DeepSpeech2 | 45.18M | v2.2.0 | conf/deepspeech2_online.yaml + spec aug | test | 7.994938373565674 | 0.080 | diff --git a/examples/ami/README.md b/examples/ami/README.md index a038eaeb..adc9dc4b 100644 --- a/examples/ami/README.md +++ b/examples/ami/README.md @@ -1,3 +1,3 @@ # Speaker Diarization on AMI corpus -* sd0 - speaker diarization by AHC,SC base on x-vectors +* sd0 - speaker diarization by AHC,SC base on embeddings diff --git a/examples/ami/sd0/README.md b/examples/ami/sd0/README.md index ffe95741..e9ecc285 100644 --- a/examples/ami/sd0/README.md +++ b/examples/ami/sd0/README.md @@ -7,7 +7,23 @@ The script performs diarization using x-vectors(TDNN,ECAPA-TDNN) on the AMI mix-headset data. We demonstrate the use of different clustering methods: AHC, spectral. ## How to Run +### prepare annotations and audios +Download AMI corpus, You need around 10GB of free space to get whole data +The signals are too large to package in this way, so you need to use the chooser to indicate which ones you wish to download + +```bash +## download annotations +wget http://groups.inf.ed.ac.uk/ami/AMICorpusAnnotations/ami_public_manual_1.6.2.zip && unzip ami_public_manual_1.6.2.zip +``` + +then please follow https://groups.inf.ed.ac.uk/ami/download/ to download the Signals: +1) Select one or more AMI meetings: the IDs please follow ./ami_split.py +2) Select media streams: Just select Headset mix + +### start running Use the following command to run diarization on AMI corpus. -`bash ./run.sh` +```bash +./run.sh --data_folder ./amicorpus --manual_annot_folder ./ami_public_manual_1.6.2 +``` ## Results (DER) coming soon! :) diff --git a/examples/ami/sd0/run.sh b/examples/ami/sd0/run.sh index 9035f595..1fcec269 100644 --- a/examples/ami/sd0/run.sh +++ b/examples/ami/sd0/run.sh @@ -17,18 +17,6 @@ device=gpu . ${MAIN_ROOT}/utils/parse_options.sh || exit 1; -if [ $stage -le 0 ]; then - # Prepare data - # Download AMI corpus, You need around 10GB of free space to get whole data - # The signals are too large to package in this way, - # so you need to use the chooser to indicate which ones you wish to download - echo "Please follow https://groups.inf.ed.ac.uk/ami/download/ to download the data." - echo "Annotations: AMI manual annotations v1.6.2 " - echo "Signals: " - echo "1) Select one or more AMI meetings: the IDs please follow ./ami_split.py" - echo "2) Select media streams: Just select Headset mix" -fi - if [ $stage -le 1 ]; then # Download the pretrained model wget https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_1_1.tar.gz diff --git a/examples/hey_snips/kws0/conf/mdtc.yaml b/examples/hey_snips/kws0/conf/mdtc.yaml index 3ce9f9d0..4bd0708c 100644 --- a/examples/hey_snips/kws0/conf/mdtc.yaml +++ b/examples/hey_snips/kws0/conf/mdtc.yaml @@ -1,39 +1,49 @@ -data: - data_dir: '/PATH/TO/DATA/hey_snips_research_6k_en_train_eval_clean_ter' - dataset: 'paddleaudio.datasets:HeySnips' +# https://yaml.org/type/float.html +########################################### +# Data # +########################################### +dataset: 'paddleaudio.datasets:HeySnips' +data_dir: '/PATH/TO/DATA/hey_snips_research_6k_en_train_eval_clean_ter' -model: - num_keywords: 1 - backbone: 'paddlespeech.kws.models:MDTC' - config: - stack_num: 3 - stack_size: 4 - in_channels: 80 - res_channels: 32 - kernel_size: 5 +############################################ +# Network Architecture # +############################################ +backbone: 'paddlespeech.kws.models:MDTC' +num_keywords: 1 +stack_num: 3 +stack_size: 4 +in_channels: 80 +res_channels: 32 +kernel_size: 5 -feature: - feat_type: 'kaldi_fbank' - sample_rate: 16000 - frame_shift: 10 - frame_length: 25 - n_mels: 80 +########################################### +# Feature # +########################################### +feat_type: 'kaldi_fbank' +sample_rate: 16000 +frame_shift: 10 +frame_length: 25 +n_mels: 80 -training: - epochs: 100 - num_workers: 16 - batch_size: 100 - checkpoint_dir: './checkpoint' - save_freq: 10 - log_freq: 10 - learning_rate: 0.001 - weight_decay: 0.00005 - grad_clip: 5.0 +########################################### +# Training # +########################################### +epochs: 100 +num_workers: 16 +batch_size: 100 +checkpoint_dir: './checkpoint' +save_freq: 10 +log_freq: 10 +learning_rate: 0.001 +weight_decay: 0.00005 +grad_clip: 5.0 -scoring: - batch_size: 100 - num_workers: 16 - checkpoint: './checkpoint/epoch_100/model.pdparams' - score_file: './scores.txt' - stats_file: './stats.0.txt' - img_file: './det.png' \ No newline at end of file +########################################### +# Scoring # +########################################### +batch_size: 100 +num_workers: 16 +checkpoint: './checkpoint/epoch_100/model.pdparams' +score_file: './scores.txt' +stats_file: './stats.0.txt' +img_file: './det.png' \ No newline at end of file diff --git a/examples/hey_snips/kws0/local/plot.sh b/examples/hey_snips/kws0/local/plot.sh index 5869e50b..783de98b 100755 --- a/examples/hey_snips/kws0/local/plot.sh +++ b/examples/hey_snips/kws0/local/plot.sh @@ -1,2 +1,25 @@ #!/bin/bash -python3 ${BIN_DIR}/plot_det_curve.py --cfg_path=$1 --keyword HeySnips +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +if [ $# != 3 ];then + echo "usage: ${0} config_path checkpoint output_file" + exit -1 +fi + +keyword=$1 +stats_file=$2 +img_file=$3 + +python3 ${BIN_DIR}/plot_det_curve.py --keyword_label ${keyword} --stats_file ${stats_file} --img_file ${img_file} diff --git a/examples/hey_snips/kws0/local/score.sh b/examples/hey_snips/kws0/local/score.sh index ed21d08c..916536af 100755 --- a/examples/hey_snips/kws0/local/score.sh +++ b/examples/hey_snips/kws0/local/score.sh @@ -1,5 +1,27 @@ #!/bin/bash +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. -python3 ${BIN_DIR}/score.py --cfg_path=$1 +if [ $# != 4 ];then + echo "usage: ${0} checkpoint score_file stats_file" + exit -1 +fi -python3 ${BIN_DIR}/compute_det.py --cfg_path=$1 +cfg_path=$1 +ckpt=$2 +score_file=$3 +stats_file=$4 + +python3 ${BIN_DIR}/score.py --config ${cfg_path} --ckpt ${ckpt} --score_file ${score_file} || exit -1 +python3 ${BIN_DIR}/compute_det.py --config ${cfg_path} --score_file ${score_file} --stats_file ${stats_file} || exit -1 diff --git a/examples/hey_snips/kws0/local/train.sh b/examples/hey_snips/kws0/local/train.sh index 8d0181b8..c403f22a 100755 --- a/examples/hey_snips/kws0/local/train.sh +++ b/examples/hey_snips/kws0/local/train.sh @@ -1,13 +1,31 @@ #!/bin/bash +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +if [ $# != 2 ];then + echo "usage: ${0} num_gpus config_path" + exit -1 +fi ngpu=$1 cfg_path=$2 if [ ${ngpu} -gt 0 ]; then python3 -m paddle.distributed.launch --gpus $CUDA_VISIBLE_DEVICES ${BIN_DIR}/train.py \ - --cfg_path ${cfg_path} + --config ${cfg_path} else echo "set CUDA_VISIBLE_DEVICES to enable multi-gpus trainning." python3 ${BIN_DIR}/train.py \ - --cfg_path ${cfg_path} + --config ${cfg_path} fi diff --git a/examples/hey_snips/kws0/run.sh b/examples/hey_snips/kws0/run.sh index 2cc09a4f..bc25a8e8 100755 --- a/examples/hey_snips/kws0/run.sh +++ b/examples/hey_snips/kws0/run.sh @@ -32,10 +32,16 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then ./local/train.sh ${ngpu} ${cfg_path} || exit -1 fi +ckpt=./checkpoint/epoch_100/model.pdparams +score_file=./scores.txt +stats_file=./stats.0.txt +img_file=./det.png + if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then - ./local/score.sh ${cfg_path} || exit -1 + ./local/score.sh ${cfg_path} ${ckpt} ${score_file} ${stats_file} || exit -1 fi +keyword=HeySnips if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then - ./local/plot.sh ${cfg_path} || exit -1 + ./local/plot.sh ${keyword} ${stats_file} ${img_file} || exit -1 fi \ No newline at end of file diff --git a/examples/voxceleb/sv0/README.md b/examples/voxceleb/sv0/README.md index 567963e5..418102b4 100644 --- a/examples/voxceleb/sv0/README.md +++ b/examples/voxceleb/sv0/README.md @@ -142,10 +142,10 @@ using the `tar` scripts to unpack the model and then you can use the script to t For example: ``` wget https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_2_0.tar.gz -tar xzvf sv0_ecapa_tdnn_voxceleb12_ckpt_0_2_0.tar.gz +tar -xvf sv0_ecapa_tdnn_voxceleb12_ckpt_0_2_0.tar.gz source path.sh # If you have processed the data and get the manifest file, you can skip the following 2 steps -CUDA_VISIBLE_DEVICES= ./local/test.sh ./data sv0_ecapa_tdnn_voxceleb12_ckpt_0_1_2 conf/ecapa_tdnn.yaml +CUDA_VISIBLE_DEVICES= bash ./local/test.sh ./data sv0_ecapa_tdnn_voxceleb12_ckpt_0_1_2/model/ conf/ecapa_tdnn.yaml ``` The performance of the released models are shown in [this](./RESULTS.md) diff --git a/examples/voxceleb/sv0/local/test.sh b/examples/voxceleb/sv0/local/test.sh index 4460a165..800fa67d 100644 --- a/examples/voxceleb/sv0/local/test.sh +++ b/examples/voxceleb/sv0/local/test.sh @@ -33,10 +33,26 @@ dir=$1 exp_dir=$2 conf_path=$3 +# get the gpu nums for training +ngpu=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}') +echo "using $ngpu gpus..." + +# setting training device +device="cpu" +if ${use_gpu}; then + device="gpu" +fi +if [ $ngpu -le 0 ]; then + echo "no gpu, training in cpu mode" + device='cpu' + use_gpu=false +fi + if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then # test the model and compute the eer metrics python3 ${BIN_DIR}/test.py \ --data-dir ${dir} \ --load-checkpoint ${exp_dir} \ - --config ${conf_path} + --config ${conf_path} \ + --device ${device} fi diff --git a/examples/voxceleb/sv0/local/train.sh b/examples/voxceleb/sv0/local/train.sh index 5477d0a3..674fedb3 100755 --- a/examples/voxceleb/sv0/local/train.sh +++ b/examples/voxceleb/sv0/local/train.sh @@ -42,15 +42,25 @@ device="cpu" if ${use_gpu}; then device="gpu" fi +if [ $ngpu -le 0 ]; then + echo "no gpu, training in cpu mode" + device='cpu' + use_gpu=false +fi if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then # train the speaker identification task with voxceleb data # and we will create the trained model parameters in ${exp_dir}/model.pdparams as the soft link # Note: we will store the log file in exp/log directory - python3 -m paddle.distributed.launch --gpus=$CUDA_VISIBLE_DEVICES \ - ${BIN_DIR}/train.py --device ${device} --checkpoint-dir ${exp_dir} \ - --data-dir ${dir} --config ${conf_path} - + if $use_gpu; then + python3 -m paddle.distributed.launch --gpus=$CUDA_VISIBLE_DEVICES \ + ${BIN_DIR}/train.py --device ${device} --checkpoint-dir ${exp_dir} \ + --data-dir ${dir} --config ${conf_path} + else + python3 \ + ${BIN_DIR}/train.py --device ${device} --checkpoint-dir ${exp_dir} \ + --data-dir ${dir} --config ${conf_path} + fi fi if [ $? -ne 0 ]; then diff --git a/paddlespeech/cli/asr/pretrained_models.py b/paddlespeech/cli/asr/pretrained_models.py index cb4c5e27..80b04aa4 100644 --- a/paddlespeech/cli/asr/pretrained_models.py +++ b/paddlespeech/cli/asr/pretrained_models.py @@ -27,6 +27,16 @@ pretrained_models = { 'ckpt_path': 'exp/conformer/checkpoints/wenetspeech', }, + "conformer_online_multicn-zh-16k": { + 'url': + 'https://paddlespeech.bj.bcebos.com/s2t/multi_cn/asr1/asr1_chunk_conformer_multi_cn_ckpt_0.2.0.model.tar.gz', + 'md5': + '7989b3248c898070904cf042fd656003', + 'cfg_path': + 'model.yaml', + 'ckpt_path': + 'exp/chunk_conformer/checkpoints/multi_cn', + }, "conformer_aishell-zh-16k": { 'url': 'https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_conformer_aishell_ckpt_0.1.2.model.tar.gz', @@ -57,6 +67,20 @@ pretrained_models = { 'ckpt_path': 'exp/transformer/checkpoints/avg_10', }, + "deepspeech2online_wenetspeech-zh-16k": { + 'url': + 'https://paddlespeech.bj.bcebos.com/s2t/wenetspeech/asr0/WIP_asr0_deepspeech2_online_wenetspeech_ckpt_1.0.0a.model.tar.gz', + 'md5': + 'b3ef6fcae8c0058c3c53375341ccb209', + 'cfg_path': + 'model.yaml', + 'ckpt_path': + 'exp/deepspeech2_online/checkpoints/avg_3', + 'lm_url': + 'https://deepspeech.bj.bcebos.com/zh_lm/zh_giga.no_cna_cmn.prune01244.klm', + 'lm_md5': + '29e02312deb2e59b3c8686c7966d4fe3' + }, "deepspeech2offline_aishell-zh-16k": { 'url': 'https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_aishell_ckpt_0.1.1.model.tar.gz', @@ -73,9 +97,9 @@ pretrained_models = { }, "deepspeech2online_aishell-zh-16k": { 'url': - 'https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.2.0.model.tar.gz', + 'https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_fbank161_ckpt_0.2.1.model.tar.gz', 'md5': - '23e16c69730a1cb5d735c98c83c21e16', + '98b87b171b7240b7cae6e07d8d0bc9be', 'cfg_path': 'model.yaml', 'ckpt_path': diff --git a/paddlespeech/cli/vector/infer.py b/paddlespeech/cli/vector/infer.py index 1dff6edb..37e19391 100644 --- a/paddlespeech/cli/vector/infer.py +++ b/paddlespeech/cli/vector/infer.py @@ -22,6 +22,8 @@ from typing import Union import paddle import soundfile +from paddleaudio.backends import load as load_audio +from paddleaudio.compliance.librosa import melspectrogram from yacs.config import CfgNode from ..executor import BaseExecutor @@ -30,8 +32,6 @@ from ..utils import cli_register from ..utils import stats_wrapper from .pretrained_models import model_alias from .pretrained_models import pretrained_models -from paddleaudio.backends import load as load_audio -from paddleaudio.compliance.librosa import melspectrogram from paddlespeech.s2t.utils.dynamic_import import dynamic_import from paddlespeech.vector.io.batch import feature_normalize from paddlespeech.vector.modules.sid_model import SpeakerIdetification diff --git a/paddlespeech/kws/exps/mdtc/compute_det.py b/paddlespeech/kws/exps/mdtc/compute_det.py index 817846b8..e43a953d 100644 --- a/paddlespeech/kws/exps/mdtc/compute_det.py +++ b/paddlespeech/kws/exps/mdtc/compute_det.py @@ -12,24 +12,15 @@ # See the License for the specific language governing permissions and # limitations under the License. # Modified from wekws(https://github.com/wenet-e2e/wekws) -import argparse import os import paddle -import yaml from tqdm import tqdm +from yacs.config import CfgNode +from paddlespeech.s2t.training.cli import default_argument_parser from paddlespeech.s2t.utils.dynamic_import import dynamic_import -# yapf: disable -parser = argparse.ArgumentParser(__doc__) -parser.add_argument("--cfg_path", type=str, required=True) -parser.add_argument('--keyword_index', type=int, default=0, help='keyword index') -parser.add_argument('--step', type=float, default=0.01, help='threshold step of trigger score') -parser.add_argument('--window_shift', type=int, default=50, help='window_shift is used to skip the frames after triggered') -args = parser.parse_args() -# yapf: enable - def load_label_and_score(keyword_index: int, ds: paddle.io.Dataset, @@ -61,26 +52,52 @@ def load_label_and_score(keyword_index: int, if __name__ == '__main__': - args.cfg_path = os.path.abspath(os.path.expanduser(args.cfg_path)) - with open(args.cfg_path, 'r') as f: - config = yaml.safe_load(f) + parser = default_argument_parser() + parser.add_argument( + '--keyword_index', type=int, default=0, help='keyword index') + parser.add_argument( + '--step', + type=float, + default=0.01, + help='threshold step of trigger score') + parser.add_argument( + '--window_shift', + type=int, + default=50, + help='window_shift is used to skip the frames after triggered') + parser.add_argument( + "--score_file", + type=str, + required=True, + help='output file of trigger scores') + parser.add_argument( + '--stats_file', + type=str, + default='./stats.0.txt', + help='output file of detection error tradeoff') + args = parser.parse_args() - data_conf = config['data'] - feat_conf = config['feature'] - scoring_conf = config['scoring'] + # https://yaml.org/type/float.html + config = CfgNode(new_allowed=True) + if args.config: + config.merge_from_file(args.config) # Dataset - ds_class = dynamic_import(data_conf['dataset']) - test_ds = ds_class(data_dir=data_conf['data_dir'], mode='test', **feat_conf) - - score_file = os.path.abspath(scoring_conf['score_file']) - stats_file = os.path.abspath(scoring_conf['stats_file']) + ds_class = dynamic_import(config['dataset']) + test_ds = ds_class( + data_dir=config['data_dir'], + mode='test', + feat_type=config['feat_type'], + sample_rate=config['sample_rate'], + frame_shift=config['frame_shift'], + frame_length=config['frame_length'], + n_mels=config['n_mels'], ) keyword_table, filler_table, filler_duration = load_label_and_score( - args.keyword, test_ds, score_file) + args.keyword_index, test_ds, args.score_file) print('Filler total duration Hours: {}'.format(filler_duration / 3600.0)) pbar = tqdm(total=int(1.0 / args.step)) - with open(stats_file, 'w', encoding='utf8') as fout: + with open(args.stats_file, 'w', encoding='utf8') as fout: keyword_index = args.keyword_index threshold = 0.0 while threshold <= 1.0: @@ -113,4 +130,4 @@ if __name__ == '__main__': pbar.update(1) pbar.close() - print('DET saved to: {}'.format(stats_file)) + print('DET saved to: {}'.format(args.stats_file)) diff --git a/paddlespeech/kws/exps/mdtc/plot_det_curve.py b/paddlespeech/kws/exps/mdtc/plot_det_curve.py index ac920358..a3ea21ef 100644 --- a/paddlespeech/kws/exps/mdtc/plot_det_curve.py +++ b/paddlespeech/kws/exps/mdtc/plot_det_curve.py @@ -17,12 +17,12 @@ import os import matplotlib.pyplot as plt import numpy as np -import yaml # yapf: disable parser = argparse.ArgumentParser(__doc__) -parser.add_argument("--cfg_path", type=str, required=True) -parser.add_argument("--keyword", type=str, required=True) +parser.add_argument('--keyword_label', type=str, required=True, help='keyword string shown on image') +parser.add_argument('--stats_file', type=str, required=True, help='output file of detection error tradeoff') +parser.add_argument('--img_file', type=str, default='./det.png', help='output det image') args = parser.parse_args() # yapf: enable @@ -61,14 +61,8 @@ def plot_det_curve(keywords, stats_file, figure_file, xlim, x_step, ylim, if __name__ == '__main__': - args.cfg_path = os.path.abspath(os.path.expanduser(args.cfg_path)) - with open(args.cfg_path, 'r') as f: - config = yaml.safe_load(f) - - scoring_conf = config['scoring'] - img_file = os.path.abspath(scoring_conf['img_file']) - stats_file = os.path.abspath(scoring_conf['stats_file']) - keywords = [args.keyword] - plot_det_curve(keywords, stats_file, img_file, 10, 2, 10, 2) + img_file = os.path.abspath(args.img_file) + stats_file = os.path.abspath(args.stats_file) + plot_det_curve([args.keyword_label], stats_file, img_file, 10, 2, 10, 2) print('DET curve image saved to: {}'.format(img_file)) diff --git a/paddlespeech/kws/exps/mdtc/score.py b/paddlespeech/kws/exps/mdtc/score.py index 7fe88ea3..1b5e1e29 100644 --- a/paddlespeech/kws/exps/mdtc/score.py +++ b/paddlespeech/kws/exps/mdtc/score.py @@ -12,55 +12,67 @@ # See the License for the specific language governing permissions and # limitations under the License. # Modified from wekws(https://github.com/wenet-e2e/wekws) -import argparse -import os - import paddle -import yaml from tqdm import tqdm +from yacs.config import CfgNode from paddlespeech.kws.exps.mdtc.collate import collate_features from paddlespeech.kws.models.mdtc import KWSModel +from paddlespeech.s2t.training.cli import default_argument_parser from paddlespeech.s2t.utils.dynamic_import import dynamic_import -# yapf: disable -parser = argparse.ArgumentParser(__doc__) -parser.add_argument("--cfg_path", type=str, required=True) -args = parser.parse_args() -# yapf: enable - if __name__ == '__main__': - args.cfg_path = os.path.abspath(os.path.expanduser(args.cfg_path)) - with open(args.cfg_path, 'r') as f: - config = yaml.safe_load(f) + parser = default_argument_parser() + parser.add_argument( + "--ckpt", + type=str, + required=True, + help='model checkpoint for evaluation.') + parser.add_argument( + "--score_file", + type=str, + default='./scores.txt', + help='output file of trigger scores') + args = parser.parse_args() - model_conf = config['model'] - data_conf = config['data'] - feat_conf = config['feature'] - scoring_conf = config['scoring'] + # https://yaml.org/type/float.html + config = CfgNode(new_allowed=True) + if args.config: + config.merge_from_file(args.config) # Dataset - ds_class = dynamic_import(data_conf['dataset']) - test_ds = ds_class(data_dir=data_conf['data_dir'], mode='test', **feat_conf) + ds_class = dynamic_import(config['dataset']) + test_ds = ds_class( + data_dir=config['data_dir'], + mode='test', + feat_type=config['feat_type'], + sample_rate=config['sample_rate'], + frame_shift=config['frame_shift'], + frame_length=config['frame_length'], + n_mels=config['n_mels'], ) test_sampler = paddle.io.BatchSampler( - test_ds, batch_size=scoring_conf['batch_size'], drop_last=False) + test_ds, batch_size=config['batch_size'], drop_last=False) test_loader = paddle.io.DataLoader( test_ds, batch_sampler=test_sampler, - num_workers=scoring_conf['num_workers'], + num_workers=config['num_workers'], return_list=True, use_buffer_reader=True, collate_fn=collate_features, ) # Model - backbone_class = dynamic_import(model_conf['backbone']) - backbone = backbone_class(**model_conf['config']) - model = KWSModel(backbone=backbone, num_keywords=model_conf['num_keywords']) - model.set_state_dict(paddle.load(scoring_conf['checkpoint'])) + backbone_class = dynamic_import(config['backbone']) + backbone = backbone_class( + stack_num=config['stack_num'], + stack_size=config['stack_size'], + in_channels=config['in_channels'], + res_channels=config['res_channels'], + kernel_size=config['kernel_size'], ) + model = KWSModel(backbone=backbone, num_keywords=config['num_keywords']) + model.set_state_dict(paddle.load(args.ckpt)) model.eval() - with paddle.no_grad(), open( - scoring_conf['score_file'], 'w', encoding='utf8') as fout: + with paddle.no_grad(), open(args.score_file, 'w', encoding='utf8') as f: for batch_idx, batch in enumerate( tqdm(test_loader, total=len(test_loader))): keys, feats, labels, lengths = batch @@ -73,7 +85,6 @@ if __name__ == '__main__': keyword_scores = score[:, keyword_i] score_frames = ' '.join( ['{:.6f}'.format(x) for x in keyword_scores.tolist()]) - fout.write( - '{} {} {}\n'.format(key, keyword_i, score_frames)) + f.write('{} {} {}\n'.format(key, keyword_i, score_frames)) - print('Result saved to: {}'.format(scoring_conf['score_file'])) + print('Result saved to: {}'.format(args.score_file)) diff --git a/paddlespeech/kws/exps/mdtc/train.py b/paddlespeech/kws/exps/mdtc/train.py index 99e72871..5a9ca92d 100644 --- a/paddlespeech/kws/exps/mdtc/train.py +++ b/paddlespeech/kws/exps/mdtc/train.py @@ -11,77 +11,88 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -import argparse import os import paddle -import yaml - from paddleaudio.utils import logger from paddleaudio.utils import Timer +from yacs.config import CfgNode + from paddlespeech.kws.exps.mdtc.collate import collate_features from paddlespeech.kws.models.loss import max_pooling_loss from paddlespeech.kws.models.mdtc import KWSModel +from paddlespeech.s2t.training.cli import default_argument_parser from paddlespeech.s2t.utils.dynamic_import import dynamic_import -# yapf: disable -parser = argparse.ArgumentParser(__doc__) -parser.add_argument("--cfg_path", type=str, required=True) -args = parser.parse_args() -# yapf: enable - if __name__ == '__main__': + parser = default_argument_parser() + args = parser.parse_args() + + # https://yaml.org/type/float.html + config = CfgNode(new_allowed=True) + if args.config: + config.merge_from_file(args.config) + nranks = paddle.distributed.get_world_size() if paddle.distributed.get_world_size() > 1: paddle.distributed.init_parallel_env() local_rank = paddle.distributed.get_rank() - args.cfg_path = os.path.abspath(os.path.expanduser(args.cfg_path)) - with open(args.cfg_path, 'r') as f: - config = yaml.safe_load(f) - - model_conf = config['model'] - data_conf = config['data'] - feat_conf = config['feature'] - training_conf = config['training'] - # Dataset - ds_class = dynamic_import(data_conf['dataset']) + ds_class = dynamic_import(config['dataset']) train_ds = ds_class( - data_dir=data_conf['data_dir'], mode='train', **feat_conf) - dev_ds = ds_class(data_dir=data_conf['data_dir'], mode='dev', **feat_conf) + data_dir=config['data_dir'], + mode='train', + feat_type=config['feat_type'], + sample_rate=config['sample_rate'], + frame_shift=config['frame_shift'], + frame_length=config['frame_length'], + n_mels=config['n_mels'], ) + dev_ds = ds_class( + data_dir=config['data_dir'], + mode='dev', + feat_type=config['feat_type'], + sample_rate=config['sample_rate'], + frame_shift=config['frame_shift'], + frame_length=config['frame_length'], + n_mels=config['n_mels'], ) train_sampler = paddle.io.DistributedBatchSampler( train_ds, - batch_size=training_conf['batch_size'], + batch_size=config['batch_size'], shuffle=True, drop_last=False) train_loader = paddle.io.DataLoader( train_ds, batch_sampler=train_sampler, - num_workers=training_conf['num_workers'], + num_workers=config['num_workers'], return_list=True, use_buffer_reader=True, collate_fn=collate_features, ) # Model - backbone_class = dynamic_import(model_conf['backbone']) - backbone = backbone_class(**model_conf['config']) - model = KWSModel(backbone=backbone, num_keywords=model_conf['num_keywords']) + backbone_class = dynamic_import(config['backbone']) + backbone = backbone_class( + stack_num=config['stack_num'], + stack_size=config['stack_size'], + in_channels=config['in_channels'], + res_channels=config['res_channels'], + kernel_size=config['kernel_size'], ) + model = KWSModel(backbone=backbone, num_keywords=config['num_keywords']) model = paddle.DataParallel(model) - clip = paddle.nn.ClipGradByGlobalNorm(training_conf['grad_clip']) + clip = paddle.nn.ClipGradByGlobalNorm(config['grad_clip']) optimizer = paddle.optimizer.Adam( - learning_rate=training_conf['learning_rate'], - weight_decay=training_conf['weight_decay'], + learning_rate=config['learning_rate'], + weight_decay=config['weight_decay'], parameters=model.parameters(), grad_clip=clip) criterion = max_pooling_loss steps_per_epoch = len(train_sampler) - timer = Timer(steps_per_epoch * training_conf['epochs']) + timer = Timer(steps_per_epoch * config['epochs']) timer.start() - for epoch in range(1, training_conf['epochs'] + 1): + for epoch in range(1, config['epochs'] + 1): model.train() avg_loss = 0 @@ -107,15 +118,13 @@ if __name__ == '__main__': timer.count() - if (batch_idx + 1 - ) % training_conf['log_freq'] == 0 and local_rank == 0: + if (batch_idx + 1) % config['log_freq'] == 0 and local_rank == 0: lr = optimizer.get_lr() - avg_loss /= training_conf['log_freq'] + avg_loss /= config['log_freq'] avg_acc = num_corrects / num_samples print_msg = 'Epoch={}/{}, Step={}/{}'.format( - epoch, training_conf['epochs'], batch_idx + 1, - steps_per_epoch) + epoch, config['epochs'], batch_idx + 1, steps_per_epoch) print_msg += ' loss={:.4f}'.format(avg_loss) print_msg += ' acc={:.4f}'.format(avg_acc) print_msg += ' lr={:.6f} step/sec={:.2f} | ETA {}'.format( @@ -126,17 +135,17 @@ if __name__ == '__main__': num_corrects = 0 num_samples = 0 - if epoch % training_conf[ + if epoch % config[ 'save_freq'] == 0 and batch_idx + 1 == steps_per_epoch and local_rank == 0: dev_sampler = paddle.io.BatchSampler( dev_ds, - batch_size=training_conf['batch_size'], + batch_size=config['batch_size'], shuffle=False, drop_last=False) dev_loader = paddle.io.DataLoader( dev_ds, batch_sampler=dev_sampler, - num_workers=training_conf['num_workers'], + num_workers=config['num_workers'], return_list=True, use_buffer_reader=True, collate_fn=collate_features, ) @@ -159,7 +168,7 @@ if __name__ == '__main__': logger.eval(print_msg) # Save model - save_dir = os.path.join(training_conf['checkpoint_dir'], + save_dir = os.path.join(config['checkpoint_dir'], 'epoch_{}'.format(epoch)) logger.info('Saving model checkpoint to {}'.format(save_dir)) paddle.save(model.state_dict(), diff --git a/paddlespeech/s2t/__init__.py b/paddlespeech/s2t/__init__.py index 7acc3716..2365071f 100644 --- a/paddlespeech/s2t/__init__.py +++ b/paddlespeech/s2t/__init__.py @@ -131,12 +131,14 @@ if not hasattr(paddle.Tensor, 'long'): "override long of paddle.Tensor if exists or register, remove this when fixed!" ) paddle.Tensor.long = func_long + paddle.static.Variable.long = func_long if not hasattr(paddle.Tensor, 'numel'): logger.debug( "override numel of paddle.Tensor if exists or register, remove this when fixed!" ) paddle.Tensor.numel = paddle.numel + paddle.static.Variable.numel = paddle.numel def new_full(x: paddle.Tensor, @@ -151,6 +153,7 @@ if not hasattr(paddle.Tensor, 'new_full'): "override new_full of paddle.Tensor if exists or register, remove this when fixed!" ) paddle.Tensor.new_full = new_full + paddle.static.Variable.new_full = new_full def eq(xs: paddle.Tensor, ys: Union[paddle.Tensor, float]) -> paddle.Tensor: @@ -166,6 +169,7 @@ if not hasattr(paddle.Tensor, 'eq'): "override eq of paddle.Tensor if exists or register, remove this when fixed!" ) paddle.Tensor.eq = eq + paddle.static.Variable.eq = eq if not hasattr(paddle, 'eq'): logger.debug( @@ -182,6 +186,7 @@ if not hasattr(paddle.Tensor, 'contiguous'): "override contiguous of paddle.Tensor if exists or register, remove this when fixed!" ) paddle.Tensor.contiguous = contiguous + paddle.static.Variable.contiguous = contiguous def size(xs: paddle.Tensor, *args: int) -> paddle.Tensor: @@ -200,6 +205,7 @@ logger.debug( "(`to_static` do not process `size` property, maybe some `paddle` api dependent on it), remove this when fixed!" ) paddle.Tensor.size = size +paddle.static.Variable.size = size def view(xs: paddle.Tensor, *args: int) -> paddle.Tensor: @@ -209,6 +215,7 @@ def view(xs: paddle.Tensor, *args: int) -> paddle.Tensor: if not hasattr(paddle.Tensor, 'view'): logger.debug("register user view to paddle.Tensor, remove this when fixed!") paddle.Tensor.view = view + paddle.static.Variable.view = view def view_as(xs: paddle.Tensor, ys: paddle.Tensor) -> paddle.Tensor: @@ -219,6 +226,7 @@ if not hasattr(paddle.Tensor, 'view_as'): logger.debug( "register user view_as to paddle.Tensor, remove this when fixed!") paddle.Tensor.view_as = view_as + paddle.static.Variable.view_as = view_as def is_broadcastable(shp1, shp2): @@ -246,6 +254,7 @@ if not hasattr(paddle.Tensor, 'masked_fill'): logger.debug( "register user masked_fill to paddle.Tensor, remove this when fixed!") paddle.Tensor.masked_fill = masked_fill + paddle.static.Variable.masked_fill = masked_fill def masked_fill_(xs: paddle.Tensor, @@ -264,6 +273,7 @@ if not hasattr(paddle.Tensor, 'masked_fill_'): logger.debug( "register user masked_fill_ to paddle.Tensor, remove this when fixed!") paddle.Tensor.masked_fill_ = masked_fill_ + paddle.static.Variable.maksed_fill_ = masked_fill_ def fill_(xs: paddle.Tensor, value: Union[float, int]) -> paddle.Tensor: @@ -276,6 +286,7 @@ if not hasattr(paddle.Tensor, 'fill_'): logger.debug( "register user fill_ to paddle.Tensor, remove this when fixed!") paddle.Tensor.fill_ = fill_ + paddle.static.Variable.fill_ = fill_ def repeat(xs: paddle.Tensor, *size: Any) -> paddle.Tensor: @@ -286,6 +297,7 @@ if not hasattr(paddle.Tensor, 'repeat'): logger.debug( "register user repeat to paddle.Tensor, remove this when fixed!") paddle.Tensor.repeat = repeat + paddle.static.Variable.repeat = repeat if not hasattr(paddle.Tensor, 'softmax'): logger.debug( @@ -310,6 +322,7 @@ if not hasattr(paddle.Tensor, 'type_as'): logger.debug( "register user type_as to paddle.Tensor, remove this when fixed!") setattr(paddle.Tensor, 'type_as', type_as) + setattr(paddle.static.Variable, 'type_as', type_as) def to(x: paddle.Tensor, *args, **kwargs) -> paddle.Tensor: @@ -325,6 +338,7 @@ def to(x: paddle.Tensor, *args, **kwargs) -> paddle.Tensor: if not hasattr(paddle.Tensor, 'to'): logger.debug("register user to to paddle.Tensor, remove this when fixed!") setattr(paddle.Tensor, 'to', to) + setattr(paddle.static.Variable, 'to', to) def func_float(x: paddle.Tensor) -> paddle.Tensor: @@ -335,6 +349,7 @@ if not hasattr(paddle.Tensor, 'float'): logger.debug( "register user float to paddle.Tensor, remove this when fixed!") setattr(paddle.Tensor, 'float', func_float) + setattr(paddle.static.Variable, 'float', func_float) def func_int(x: paddle.Tensor) -> paddle.Tensor: @@ -344,6 +359,7 @@ def func_int(x: paddle.Tensor) -> paddle.Tensor: if not hasattr(paddle.Tensor, 'int'): logger.debug("register user int to paddle.Tensor, remove this when fixed!") setattr(paddle.Tensor, 'int', func_int) + setattr(paddle.static.Variable, 'int', func_int) def tolist(x: paddle.Tensor) -> List[Any]: @@ -354,6 +370,7 @@ if not hasattr(paddle.Tensor, 'tolist'): logger.debug( "register user tolist to paddle.Tensor, remove this when fixed!") setattr(paddle.Tensor, 'tolist', tolist) + setattr(paddle.static.Variable, 'tolist', tolist) ########### hack paddle.nn ############# from paddle.nn import Layer diff --git a/paddlespeech/server/bin/paddlespeech_client.py b/paddlespeech/server/bin/paddlespeech_client.py index 1cc0a6ab..2f1ce385 100644 --- a/paddlespeech/server/bin/paddlespeech_client.py +++ b/paddlespeech/server/bin/paddlespeech_client.py @@ -16,7 +16,6 @@ import asyncio import base64 import io import json -import logging import os import random import time @@ -30,7 +29,7 @@ from ..executor import BaseExecutor from ..util import cli_client_register from ..util import stats_wrapper from paddlespeech.cli.log import logger -from paddlespeech.server.utils.audio_handler import ASRAudioHandler +from paddlespeech.server.utils.audio_handler import ASRWsAudioHandler from paddlespeech.server.utils.audio_process import wav2pcm from paddlespeech.server.utils.util import wav2base64 @@ -288,6 +287,12 @@ class ASRClientExecutor(BaseExecutor): default=None, help='Audio file to be recognized', required=True) + self.parser.add_argument( + '--protocol', + type=str, + default="http", + choices=["http", "websocket"], + help='server protocol') self.parser.add_argument( '--sample_rate', type=int, default=16000, help='audio sample rate') self.parser.add_argument( @@ -295,6 +300,19 @@ class ASRClientExecutor(BaseExecutor): self.parser.add_argument( '--audio_format', type=str, default="wav", help='audio format') + self.parser.add_argument( + '--punc.server_ip', + type=str, + default=None, + dest="punc_server_ip", + help='Punctuation server ip') + self.parser.add_argument( + '--punc.port', + type=int, + default=8091, + dest="punc_server_port", + help='Punctuation server port') + def execute(self, argv: List[str]) -> bool: args = self.parser.parse_args(argv) input_ = args.input @@ -303,6 +321,7 @@ class ASRClientExecutor(BaseExecutor): sample_rate = args.sample_rate lang = args.lang audio_format = args.audio_format + protocol = args.protocol try: time_start = time.time() @@ -312,13 +331,17 @@ class ASRClientExecutor(BaseExecutor): port=port, sample_rate=sample_rate, lang=lang, - audio_format=audio_format) + audio_format=audio_format, + protocol=protocol, + punc_server_ip=args.punc_server_ip, + punc_server_port=args.punc_server_port) time_end = time.time() - logger.info(res.json()) + logger.info(f"ASR result: {res}") logger.info("Response time %f s." % (time_end - time_start)) return True except Exception as e: logger.error("Failed to speech recognition.") + logger.error(e) return False @stats_wrapper @@ -328,21 +351,39 @@ class ASRClientExecutor(BaseExecutor): port: int=8090, sample_rate: int=16000, lang: str="zh_cn", - audio_format: str="wav"): - """ - Python API to call an executor. - """ + audio_format: str="wav", + protocol: str="http", + punc_server_ip: str=None, + punc_server_port: int=None): + """Python API to call an executor. - url = 'http://' + server_ip + ":" + str(port) + '/paddlespeech/asr' - audio = wav2base64(input) - data = { - "audio": audio, - "audio_format": audio_format, - "sample_rate": sample_rate, - "lang": lang, - } + Args: + input (str): The input audio file path + server_ip (str, optional): The ASR server ip. Defaults to "127.0.0.1". + port (int, optional): The ASR server port. Defaults to 8090. + sample_rate (int, optional): The audio sample rate. Defaults to 16000. + lang (str, optional): The audio language type. Defaults to "zh_cn". + audio_format (str, optional): The audio format information. Defaults to "wav". + protocol (str, optional): The ASR server. Defaults to "http". + + Returns: + str: The ASR results + """ + # we use the asr server to recognize the audio text content + # and paddlespeech_client asr only support http protocol + protocol = "http" + if protocol.lower() == "http": + from paddlespeech.server.utils.audio_handler import ASRHttpHandler + logger.info("asr http client start") + handler = ASRHttpHandler(server_ip=server_ip, port=port) + res = handler.run(input, audio_format, sample_rate, lang) + res = res['result']['transcription'] + logger.info("asr http client finished") + else: + logger.error(f"Sorry, we have not support protocol: {protocol}," + "please use http or websocket protocol") + sys.exit(-1) - res = requests.post(url=url, data=json.dumps(data)) return res @@ -379,7 +420,6 @@ class ASROnlineClientExecutor(BaseExecutor): sample_rate = args.sample_rate lang = args.lang audio_format = args.audio_format - try: time_start = time.time() res = self( @@ -409,14 +449,13 @@ class ASROnlineClientExecutor(BaseExecutor): """ Python API to call an executor. """ - logging.basicConfig(level=logging.INFO) - logging.info("asr websocket client start") - handler = ASRAudioHandler(server_ip, port) + logger.info("asr websocket client start") + handler = ASRWsAudioHandler(server_ip, port) loop = asyncio.get_event_loop() res = loop.run_until_complete(handler.run(input)) - logging.info("asr websocket client finished") + logger.info("asr websocket client finished") - return res['asr_results'] + return res['result'] @cli_client_register( @@ -509,7 +548,6 @@ class TextClientExecutor(BaseExecutor): input_ = args.input server_ip = args.server_ip port = args.port - output = args.output try: time_start = time.time() diff --git a/paddlespeech/server/conf/tts_online_application.yaml b/paddlespeech/server/conf/tts_online_application.yaml index 6214188d..67d4641a 100644 --- a/paddlespeech/server/conf/tts_online_application.yaml +++ b/paddlespeech/server/conf/tts_online_application.yaml @@ -1,4 +1,4 @@ -# This is the parameter configuration file for PaddleSpeech Serving. +# This is the parameter configuration file for streaming tts server. ################################################################################# # SERVER SETTING # @@ -7,8 +7,8 @@ host: 127.0.0.1 port: 8092 # The task format in the engin_list is: _ -# task choices = ['tts_online', 'tts_online-onnx'] -# protocol = ['websocket', 'http'] (only one can be selected). +# engine_list choices = ['tts_online', 'tts_online-onnx'], the inference speed of tts_online-onnx is faster than tts_online. +# protocol choices = ['websocket', 'http'] protocol: 'http' engine_list: ['tts_online-onnx'] @@ -20,8 +20,9 @@ engine_list: ['tts_online-onnx'] ################################### TTS ######################################### ################### speech task: tts; engine_type: online ####################### tts_online: - # am (acoustic model) choices=['fastspeech2_csmsc', 'fastspeech2_cnndecoder_csmsc'] - am: 'fastspeech2_cnndecoder_csmsc' + # am (acoustic model) choices=['fastspeech2_csmsc', 'fastspeech2_cnndecoder_csmsc'] + # fastspeech2_cnndecoder_csmsc support streaming am infer. + am: 'fastspeech2_csmsc' am_config: am_ckpt: am_stat: @@ -31,6 +32,7 @@ tts_online: spk_id: 0 # voc (vocoder) choices=['mb_melgan_csmsc, hifigan_csmsc'] + # Both mb_melgan_csmsc and hifigan_csmsc support streaming voc inference voc: 'mb_melgan_csmsc' voc_config: voc_ckpt: @@ -39,8 +41,13 @@ tts_online: # others lang: 'zh' device: 'cpu' # set 'gpu:id' or 'cpu' + # am_block and am_pad only for fastspeech2_cnndecoder_onnx model to streaming am infer, + # when am_pad set 12, streaming synthetic audio is the same as non-streaming synthetic audio am_block: 42 am_pad: 12 + # voc_pad and voc_block voc model to streaming voc infer, + # when voc model is mb_melgan_csmsc, voc_pad set 14, streaming synthetic audio is the same as non-streaming synthetic audio; The minimum value of pad can be set to 7, streaming synthetic audio sounds normal + # when voc model is hifigan_csmsc, voc_pad set 20, streaming synthetic audio is the same as non-streaming synthetic audio; voc_pad set 14, streaming synthetic audio sounds normal voc_block: 14 voc_pad: 14 @@ -53,7 +60,8 @@ tts_online: ################################### TTS ######################################### ################### speech task: tts; engine_type: online-onnx ####################### tts_online-onnx: - # am (acoustic model) choices=['fastspeech2_csmsc_onnx', 'fastspeech2_cnndecoder_csmsc_onnx'] + # am (acoustic model) choices=['fastspeech2_csmsc_onnx', 'fastspeech2_cnndecoder_csmsc_onnx'] + # fastspeech2_cnndecoder_csmsc_onnx support streaming am infer. am: 'fastspeech2_cnndecoder_csmsc_onnx' # am_ckpt is a list, if am is fastspeech2_cnndecoder_csmsc_onnx, am_ckpt = [encoder model, decoder model, postnet model]; # if am is fastspeech2_csmsc_onnx, am_ckpt = [ckpt model]; @@ -70,6 +78,7 @@ tts_online-onnx: cpu_threads: 4 # voc (vocoder) choices=['mb_melgan_csmsc_onnx, hifigan_csmsc_onnx'] + # Both mb_melgan_csmsc_onnx and hifigan_csmsc_onnx support streaming voc inference voc: 'hifigan_csmsc_onnx' voc_ckpt: voc_sample_rate: 24000 @@ -80,9 +89,15 @@ tts_online-onnx: # others lang: 'zh' + # am_block and am_pad only for fastspeech2_cnndecoder_onnx model to streaming am infer, + # when am_pad set 12, streaming synthetic audio is the same as non-streaming synthetic audio am_block: 42 am_pad: 12 + # voc_pad and voc_block voc model to streaming voc infer, + # when voc model is mb_melgan_csmsc_onnx, voc_pad set 14, streaming synthetic audio is the same as non-streaming synthetic audio; The minimum value of pad can be set to 7, streaming synthetic audio sounds normal + # when voc model is hifigan_csmsc_onnx, voc_pad set 20, streaming synthetic audio is the same as non-streaming synthetic audio; voc_pad set 14, streaming synthetic audio sounds normal voc_block: 14 voc_pad: 14 + # voc_upsample should be same as n_shift on voc config. voc_upsample: 300 diff --git a/paddlespeech/server/engine/asr/online/asr_engine.py b/paddlespeech/server/engine/asr/online/asr_engine.py index 10e72024..990590b4 100644 --- a/paddlespeech/server/engine/asr/online/asr_engine.py +++ b/paddlespeech/server/engine/asr/online/asr_engine.py @@ -43,9 +43,9 @@ __all__ = ['ASREngine'] pretrained_models = { "deepspeech2online_aishell-zh-16k": { 'url': - 'https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.2.0.model.tar.gz', + 'https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_fbank161_ckpt_0.2.1.model.tar.gz', 'md5': - '23e16c69730a1cb5d735c98c83c21e16', + '98b87b171b7240b7cae6e07d8d0bc9be', 'cfg_path': 'model.yaml', 'ckpt_path': diff --git a/paddlespeech/server/tests/asr/online/microphone_client.py b/paddlespeech/server/tests/asr/online/microphone_client.py index 2ceaf6d0..bb27e548 100644 --- a/paddlespeech/server/tests/asr/online/microphone_client.py +++ b/paddlespeech/server/tests/asr/online/microphone_client.py @@ -26,7 +26,7 @@ import pyaudio import websockets -class ASRAudioHandler(threading.Thread): +class ASRWsAudioHandler(threading.Thread): def __init__(self, url="127.0.0.1", port=8091): threading.Thread.__init__(self) self.url = url @@ -148,7 +148,7 @@ if __name__ == "__main__": logging.basicConfig(level=logging.INFO) logging.info("asr websocket client start") - handler = ASRAudioHandler("127.0.0.1", 8091) + handler = ASRWsAudioHandler("127.0.0.1", 8091) loop = asyncio.get_event_loop() main_task = asyncio.ensure_future(handler.run()) for signal in [SIGINT, SIGTERM]: diff --git a/paddlespeech/server/util.py b/paddlespeech/server/util.py index 1f1b0be1..ae3e9c6a 100644 --- a/paddlespeech/server/util.py +++ b/paddlespeech/server/util.py @@ -24,11 +24,11 @@ from typing import Any from typing import Dict import paddle +import paddleaudio import requests import yaml from paddle.framework import load -import paddleaudio from . import download from .entry import client_commands from .entry import server_commands diff --git a/paddlespeech/server/utils/audio_handler.py b/paddlespeech/server/utils/audio_handler.py index c2863115..f0ec0eaa 100644 --- a/paddlespeech/server/utils/audio_handler.py +++ b/paddlespeech/server/utils/audio_handler.py @@ -24,19 +24,76 @@ import websockets from paddlespeech.cli.log import logger from paddlespeech.server.utils.audio_process import save_audio +from paddlespeech.server.utils.util import wav2base64 -class ASRAudioHandler: - def __init__(self, url="127.0.0.1", port=8090): +class TextHttpHandler: + def __init__(self, server_ip="127.0.0.1", port=8090): + """Text http client request + + Args: + server_ip (str, optional): the text server ip. Defaults to "127.0.0.1". + port (int, optional): the text server port. Defaults to 8090. + """ + super().__init__() + self.server_ip = server_ip + self.port = port + if server_ip is None or port is None: + self.url = None + else: + self.url = 'http://' + self.server_ip + ":" + str( + self.port) + '/paddlespeech/text' + + def run(self, text): + """Call the text server to process the specific text + + Args: + text (str): the text to be processed + + Returns: + str: punctuation text + """ + if self.server_ip is None or self.port is None: + return text + request = { + "text": text, + } + try: + res = requests.post(url=self.url, data=json.dumps(request)) + response_dict = res.json() + punc_text = response_dict["result"]["punc_text"] + except Exception as e: + logger.error(f"Call punctuation {self.url} occurs error") + logger.error(e) + punc_text = text + + return punc_text + + +class ASRWsAudioHandler: + def __init__(self, + url=None, + port=None, + endpoint="/paddlespeech/asr/streaming", + punc_server_ip=None, + punc_server_port=None): """PaddleSpeech Online ASR Server Client audio handler Online asr server use the websocket protocal Args: - url (str, optional): the server ip. Defaults to "127.0.0.1". - port (int, optional): the server port. Defaults to 8090. + url (str, optional): the server ip. Defaults to None. + port (int, optional): the server port. Defaults to None. + endpoint(str, optional): to compatiable with python server and c++ server. + punc_server_ip(str, optional): the punctuation server ip. Defaults to None. + punc_server_port(int, optional): the punctuation port. Defaults to None """ self.url = url self.port = port - self.url = "ws://" + self.url + ":" + str(self.port) + "/ws/asr" + if url is None or port is None or endpoint is None: + self.url = None + else: + self.url = "ws://" + self.url + ":" + str(self.port) + endpoint + self.punc_server = TextHttpHandler(punc_server_ip, punc_server_port) + logger.info(f"endpoint: {self.url}") def read_wave(self, wavfile_path: str): """read the audio file from specific wavfile path @@ -80,6 +137,10 @@ class ASRAudioHandler: """ logging.info("send a message to the server") + if self.url is None: + logger.error("No asr server, please input valid ip and port") + return "" + # 1. send websocket handshake protocal async with websockets.connect(self.url) as ws: # 2. server has already received handshake protocal @@ -88,28 +149,31 @@ class ASRAudioHandler: { "name": "test.wav", "signal": "start", - "nbest": 5 + "nbest": 1 }, sort_keys=True, indent=4, separators=(',', ': ')) await ws.send(audio_info) msg = await ws.recv() - logger.info("receive msg={}".format(msg)) + logger.info("client receive msg={}".format(msg)) # 3. send chunk audio data to engine for chunk_data in self.read_wave(wavfile_path): await ws.send(chunk_data.tobytes()) msg = await ws.recv() msg = json.loads(msg) - logger.info("receive msg={}".format(msg)) + + if self.punc_server and len(msg["result"]) > 0: + msg["result"] = self.punc_server.run(msg["result"]) + logger.info("client receive msg={}".format(msg)) # 4. we must send finished signal to the server audio_info = json.dumps( { "name": "test.wav", "signal": "end", - "nbest": 5 + "nbest": 1 }, sort_keys=True, indent=4, @@ -119,11 +183,63 @@ class ASRAudioHandler: # 5. decode the bytes to str msg = json.loads(msg) - logger.info("final receive msg={}".format(msg)) + + if self.punc_server: + msg["result"] = self.punc_server.run(msg["result"]) + + logger.info("client final receive msg={}".format(msg)) result = msg + return result +class ASRHttpHandler: + def __init__(self, server_ip=None, port=None): + """The ASR client http request + + Args: + server_ip (str, optional): the http asr server ip. Defaults to "127.0.0.1". + port (int, optional): the http asr server port. Defaults to 8090. + """ + super().__init__() + self.server_ip = server_ip + self.port = port + if server_ip is None or port is None: + self.url = None + else: + self.url = 'http://' + self.server_ip + ":" + str( + self.port) + '/paddlespeech/asr' + + def run(self, input, audio_format, sample_rate, lang): + """Call the http asr to process the audio + + Args: + input (str): the audio file path + audio_format (str): the audio format + sample_rate (str): the audio sample rate + lang (str): the audio language type + + Returns: + str: the final asr result + """ + if self.url is None: + logger.error( + "No punctuation server, please input valid ip and port") + return "" + + audio = wav2base64(input) + data = { + "audio": audio, + "audio_format": audio_format, + "sample_rate": sample_rate, + "lang": lang, + } + + res = requests.post(url=self.url, data=json.dumps(data)) + + return res.json() + + class TTSWsHandler: def __init__(self, server="127.0.0.1", port=8092, play: bool=False): """PaddleSpeech Online TTS Server Client audio handler diff --git a/paddlespeech/server/ws/asr_socket.py b/paddlespeech/server/ws/asr_socket.py index 10967f28..68686d3d 100644 --- a/paddlespeech/server/ws/asr_socket.py +++ b/paddlespeech/server/ws/asr_socket.py @@ -24,7 +24,7 @@ from paddlespeech.server.engine.engine_pool import get_engine_pool router = APIRouter() -@router.websocket('/ws/asr') +@router.websocket('/paddlespeech/asr/streaming') async def websocket_endpoint(websocket: WebSocket): """PaddleSpeech Online ASR Server api @@ -83,7 +83,7 @@ async def websocket_endpoint(websocket: WebSocket): resp = { "status": "ok", "signal": "finished", - 'asr_results': asr_results + 'result': asr_results } await websocket.send_json(resp) break @@ -102,7 +102,7 @@ async def websocket_endpoint(websocket: WebSocket): # return the current period result # if the engine create the vad instance, this connection will have many period results - resp = {'asr_results': asr_results} + resp = {'result': asr_results} await websocket.send_json(resp) except WebSocketDisconnect: pass diff --git a/setup.py b/setup.py index 34c0baa3..912fdd6d 100644 --- a/setup.py +++ b/setup.py @@ -73,8 +73,6 @@ server = [ "uvicorn", "pattern_singleton", "websockets", - "websocket", - "websocket-client", ] requirements = { diff --git a/speechx/README.md b/speechx/README.md index 34a66278..f75d8ac4 100644 --- a/speechx/README.md +++ b/speechx/README.md @@ -24,8 +24,6 @@ docker run --privileged --net=host --ipc=host -it --rm -v $PWD:/workspace --nam * More `Paddle` docker images you can see [here](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/docker/linux-docker.html). -* If you want only work under cpu, please download corresponded [image](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/docker/linux-docker.html), and using `docker` instead `nvidia-docker`. - 2. Build `speechx` and `examples`. diff --git a/speechx/examples/ds2_ol/aishell/run.sh b/speechx/examples/ds2_ol/aishell/run.sh index 0d520278..b44200b0 100755 --- a/speechx/examples/ds2_ol/aishell/run.sh +++ b/speechx/examples/ds2_ol/aishell/run.sh @@ -79,6 +79,7 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then --feature_wspecifier=ark,scp:$data/split${nj}/JOB/feat.ark,$data/split${nj}/JOB/feat.scp \ --cmvn_file=$cmvn \ --streaming_chunk=0.36 + echo "feature make have finished!!!" fi if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then @@ -94,6 +95,8 @@ if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then cat $data/split${nj}/*/result > $exp/${label_file} utils/compute-wer.py --char=1 --v=1 $text $exp/${label_file} > $exp/${wer} + echo "ctc-prefix-beam-search-decoder-ol without lm has finished!!!" + echo "please checkout in ${exp}/${wer}" fi if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then @@ -110,6 +113,8 @@ if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then cat $data/split${nj}/*/result_lm > $exp/${label_file}_lm utils/compute-wer.py --char=1 --v=1 $text $exp/${label_file}_lm > $exp/${wer}.lm + echo "ctc-prefix-beam-search-decoder-ol with lm test has finished!!!" + echo "please checkout in ${exp}/${wer}.lm" fi wfst=$data/wfst/ @@ -139,6 +144,8 @@ if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then cat $data/split${nj}/*/result_tlg > $exp/${label_file}_tlg utils/compute-wer.py --char=1 --v=1 $text $exp/${label_file}_tlg > $exp/${wer}.tlg + echo "wfst-decoder-ol have finished!!!" + echo "please checkout in ${exp}/${wer}.tlg" fi if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then @@ -159,4 +166,6 @@ if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then cat $data/split${nj}/*/result_recognizer > $exp/${label_file}_recognizer utils/compute-wer.py --char=1 --v=1 $text $exp/${label_file}_recognizer > $exp/${wer}.recognizer + echo "recognizer test have finished!!!" + echo "please checkout in ${exp}/${wer}.recognizer" fi diff --git a/speechx/speechx/decoder/ctc_tlg_decoder.cc b/speechx/speechx/decoder/ctc_tlg_decoder.cc index 7b720e7b..02e64316 100644 --- a/speechx/speechx/decoder/ctc_tlg_decoder.cc +++ b/speechx/speechx/decoder/ctc_tlg_decoder.cc @@ -48,6 +48,12 @@ void TLGDecoder::Reset() { } std::string TLGDecoder::GetFinalBestPath() { + if (frame_decoded_size_ == 0) { + // Assertion failed: (this->NumFramesDecoded() > 0 && "You cannot call + // BestPathEnd if no frames were decoded.") + return std::string(""); + } + decoder_->FinalizeDecoding(); kaldi::Lattice lat; kaldi::LatticeWeight weight; diff --git a/speechx/speechx/websocket/websocket_server.cc b/speechx/speechx/websocket/websocket_server.cc index 3f6da894..71a9e127 100644 --- a/speechx/speechx/websocket/websocket_server.cc +++ b/speechx/speechx/websocket/websocket_server.cc @@ -27,26 +27,27 @@ ConnectionHandler::ConnectionHandler( : ws_(std::move(socket)), recognizer_resource_(recognizer_resource) {} void ConnectionHandler::OnSpeechStart() { - LOG(INFO) << "Recieved speech start signal, start reading speech"; - got_start_tag_ = true; - json::value rv = {{"status", "ok"}, {"type", "server_ready"}}; - ws_.text(true); - ws_.write(asio::buffer(json::serialize(rv))); recognizer_ = std::make_shared(recognizer_resource_); // Start decoder thread decode_thread_ = std::make_shared( &ConnectionHandler::DecodeThreadFunc, this); + got_start_tag_ = true; + LOG(INFO) << "Server: Recieved speech start signal, start reading speech"; + json::value rv = {{"status", "ok"}, {"type", "server_ready"}}; + ws_.text(true); + ws_.write(asio::buffer(json::serialize(rv))); } void ConnectionHandler::OnSpeechEnd() { - LOG(INFO) << "Recieved speech end signal"; - CHECK(recognizer_ != nullptr); - recognizer_->SetFinished(); + LOG(INFO) << "Server: Recieved speech end signal"; + if (recognizer_ != nullptr) { + recognizer_->SetFinished(); + } got_end_tag_ = true; } void ConnectionHandler::OnFinalResult(const std::string& result) { - LOG(INFO) << "Final result: " << result; + LOG(INFO) << "Server: Final result: " << result; json::value rv = { {"status", "ok"}, {"type", "final_result"}, {"result", result}}; ws_.text(true); @@ -69,10 +70,16 @@ void ConnectionHandler::OnSpeechData(const beast::flat_buffer& buffer) { pcm_data(i) = static_cast(*pdata); pdata++; } - VLOG(2) << "Recieved " << num_samples << " samples"; - LOG(INFO) << "Recieved " << num_samples << " samples"; + VLOG(2) << "Server: Recieved " << num_samples << " samples"; + LOG(INFO) << "Server: Recieved " << num_samples << " samples"; CHECK(recognizer_ != nullptr); recognizer_->Accept(pcm_data); + + // TODO: return lpartial result + json::value rv = { + {"status", "ok"}, {"type", "partial_result"}, {"result", "TODO"}}; + ws_.text(true); + ws_.write(asio::buffer(json::serialize(rv))); } void ConnectionHandler::DecodeThreadFunc() { @@ -80,9 +87,9 @@ void ConnectionHandler::DecodeThreadFunc() { while (true) { recognizer_->Decode(); if (recognizer_->IsFinished()) { - LOG(INFO) << "enter finish"; + LOG(INFO) << "Server: enter finish"; recognizer_->Decode(); - LOG(INFO) << "finish"; + LOG(INFO) << "Server: finish"; std::string result = recognizer_->GetFinalResult(); OnFinalResult(result); OnFinish(); @@ -135,7 +142,7 @@ void ConnectionHandler::operator()() { ws_.read(buffer); if (ws_.got_text()) { std::string message = beast::buffers_to_string(buffer.data()); - LOG(INFO) << message; + LOG(INFO) << "Server: Text: " << message; OnText(message); if (got_end_tag_) { break; @@ -152,7 +159,7 @@ void ConnectionHandler::operator()() { } } - LOG(INFO) << "Read all pcm data, wait for decoding thread"; + LOG(INFO) << "Server: finished to wait for decoding thread join."; if (decode_thread_ != nullptr) { decode_thread_->join(); } diff --git a/tests/unit/cli/test_cli.sh b/tests/unit/cli/test_cli.sh index 59f31516..bdf05524 100755 --- a/tests/unit/cli/test_cli.sh +++ b/tests/unit/cli/test_cli.sh @@ -14,22 +14,24 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee paddlespeech asr --input ./zh.wav paddlespeech asr --model conformer_aishell --input ./zh.wav paddlespeech asr --model conformer_online_aishell --input ./zh.wav +paddlespeech asr --model conformer_online_multicn --input ./zh.wav paddlespeech asr --model transformer_librispeech --lang en --input ./en.wav paddlespeech asr --model deepspeech2offline_aishell --input ./zh.wav +paddlespeech asr --model deepspeech2online_wenetspeech --input ./zh.wav paddlespeech asr --model deepspeech2online_aishell --input ./zh.wav paddlespeech asr --model deepspeech2offline_librispeech --lang en --input ./en.wav # long audio restriction { -wget -c wget https://paddlespeech.bj.bcebos.com/datasets/single_wav/zh/test_long_audio_01.wav +wget -c https://paddlespeech.bj.bcebos.com/datasets/single_wav/zh/test_long_audio_01.wav paddlespeech asr --input test_long_audio_01.wav if [ $? -ne 255 ]; then - echo "Time restriction not passed" + echo -e "\e[1;31mTime restriction not passed\e[0m" exit 1 fi } && { - echo "Time restriction passed" + echo -e "\033[32mTime restriction passed\033[0m" } # Text To Speech @@ -77,4 +79,4 @@ paddlespeech stats --task vector paddlespeech stats --task st -echo "Test success !!!" +echo -e "\033[32mTest success !!!\033[0m"