diff --git a/README.md b/README.md index 330da1a9..e93aa1d9 100644 --- a/README.md +++ b/README.md @@ -25,7 +25,7 @@ | Documents | Models List | AIStudio Courses - | NAACL2022 Paper + | NAACL2022 Best Demo Award Paper | Gitee @@ -34,7 +34,7 @@ **PaddleSpeech** is an open-source toolkit on [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) platform for a variety of critical tasks in speech and audio, with the state-of-art and influential models. -**PaddleSpeech** won the [NAACL2022 Best Demo Award](https://2022.naacl.org/blog/best-demo-award/). +**PaddleSpeech** won the [NAACL2022 Best Demo Award](https://2022.naacl.org/blog/best-demo-award/), please check out our paper on [Arxiv](https://arxiv.org/abs/2205.12007). ##### Speech Recognition @@ -179,7 +179,7 @@ Via the easy-to-use, efficient, flexible and scalable implementation, our vision ## Installation -We strongly recommend our users to install PaddleSpeech in **Linux** with *python>=3.7*. +We strongly recommend our users to install PaddleSpeech in **Linux** with *python>=3.7* and *paddlepaddle>=2.3.1*. Up to now, **Linux** supports CLI for the all our tasks, **Mac OSX** and **Windows** only supports PaddleSpeech CLI for Audio Classification, Speech-to-Text and Text-to-Speech. To install `PaddleSpeech`, please see [installation](./docs/source/install.md). diff --git a/README_cn.md b/README_cn.md index 8df38602..896c575c 100644 --- a/README_cn.md +++ b/README_cn.md @@ -20,7 +20,8 @@

- 快速开始 + 安装 + | 快速开始 | 快速使用服务 | 快速使用流式服务 | 教程文档 @@ -36,8 +37,10 @@ **PaddleSpeech** 是基于飞桨 [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) 的语音方向的开源模型库,用于语音和音频中的各种关键任务的开发,包含大量基于深度学习前沿和有影响力的模型,一些典型的应用示例如下: -**PaddleSpeech** 荣获 [NAACL2022 Best Demo Award](https://2022.naacl.org/blog/best-demo-award/). +**PaddleSpeech** 荣获 [NAACL2022 Best Demo Award](https://2022.naacl.org/blog/best-demo-award/), 请访问 [Arxiv](https://arxiv.org/abs/2205.12007) 论文。 +### 效果展示 + ##### 语音识别
@@ -154,7 +157,7 @@ 本项目采用了易用、高效、灵活以及可扩展的实现,旨在为工业应用、学术研究提供更好的支持,实现的功能包含训练、推断以及测试模块,以及部署过程,主要包括 - 📦 **易用性**: 安装门槛低,可使用 [CLI](#quick-start) 快速开始。 - 🏆 **对标 SoTA**: 提供了高速、轻量级模型,且借鉴了最前沿的技术。 -- 🏆 **流式ASR和TTS系统**:工业级的端到端流式识别、流式合成系统。 +- 🏆 **流式 ASR 和 TTS 系统**:工业级的端到端流式识别、流式合成系统。 - 💯 **基于规则的中文前端**: 我们的前端包含文本正则化和字音转换(G2P)。此外,我们使用自定义语言规则来适应中文语境。 - **多种工业界以及学术界主流功能支持**: - 🛎️ 典型音频任务: 本工具包提供了音频任务如音频分类、语音翻译、自动语音识别、文本转语音、语音合成、声纹识别、KWS等任务的实现。 @@ -182,61 +185,195 @@
+ ## 安装 我们强烈建议用户在 **Linux** 环境下,*3.7* 以上版本的 *python* 上安装 PaddleSpeech。 -目前为止,**Linux** 支持声音分类、语音识别、语音合成和语音翻译四种功能,**Mac OSX、 Windows** 下暂不支持语音翻译功能。 想了解具体安装细节,可以参考[安装文档](./docs/source/install_cn.md)。 + +### 相关依赖 ++ gcc >= 4.8.5 ++ paddlepaddle >= 2.3.1 ++ python >= 3.7 ++ linux(推荐), mac, windows + +PaddleSpeech依赖于paddlepaddle,安装可以参考[paddlepaddle官网](https://www.paddlepaddle.org.cn/),根据自己机器的情况进行选择。这里给出cpu版本示例,其它版本大家可以根据自己机器的情况进行安装。 + +```shell +pip install paddlepaddle -i https://mirror.baidu.com/pypi/simple +``` + +PaddleSpeech快速安装方式有两种,一种是pip安装,一种是源码编译(推荐)。 + +### pip 安装 +```shell +pip install pytest-runner +pip install paddlespeech +``` + +### 源码编译 +```shell +git clone https://github.com/PaddlePaddle/PaddleSpeech.git +cd PaddleSpeech +pip install pytest-runner +pip install . +``` + +更多关于安装问题,如 conda 环境,librosa 依赖的系统库,gcc 环境问题,kaldi 安装等,可以参考这篇[安装文档](docs/source/install_cn.md),如安装上遇到问题可以在 [#2150](https://github.com/PaddlePaddle/PaddleSpeech/issues/2150) 上留言以及查找相关问题 ## 快速开始 -安装完成后,开发者可以通过命令行快速开始,改变 `--input` 可以尝试用自己的音频或文本测试。 +安装完成后,开发者可以通过命令行或者Python快速开始,命令行模式下改变 `--input` 可以尝试用自己的音频或文本测试,支持16k wav格式音频。 + +你也可以在`aistudio`中快速体验 👉🏻[PaddleSpeech API Demo ](https://aistudio.baidu.com/aistudio/projectdetail/4281335?shared=1)。 -**声音分类** +测试音频示例下载 ```shell -paddlespeech cls --input input.wav +wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav +wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/en.wav ``` -**声纹识别** + +### 语音识别 +
 (点击可展开)开源中文语音识别 + +命令行一键体验 + ```shell -paddlespeech vector --task spk --input input_16k.wav +paddlespeech asr --lang zh --input zh.wav +``` + +Python API 一键预测 + +```python +>>> from paddlespeech.cli.asr.infer import ASRExecutor +>>> asr = ASRExecutor() +>>> result = asr(audio_file="zh.wav") +>>> print(result) +我认为跑步最重要的就是给我带来了身体健康 ``` -**语音识别** +
+ +### 语音合成 + +
 开源中文语音合成 + +输出 24k 采样率wav格式音频 + + +命令行一键体验 + ```shell -paddlespeech asr --lang zh --input input_16k.wav +paddlespeech tts --input "你好,欢迎使用百度飞桨深度学习框架!" --output output.wav +``` + +Python API 一键预测 + +```python +>>> from paddlespeech.cli.tts.infer import TTSExecutor +>>> tts = TTSExecutor() +>>> tts(text="今天天气十分不错。", output="output.wav") ``` -**语音翻译** (English to Chinese) +- 语音合成的 web demo 已经集成进了 [Huggingface Spaces](https://huggingface.co/spaces). 请参考: [TTS Demo](https://huggingface.co/spaces/KPatrick/PaddleSpeechTTS) + +
+ +### 声音分类 + +
 适配多场景的开放领域声音分类工具 + +基于AudioSet数据集527个类别的声音分类模型 + +命令行一键体验 + ```shell -paddlespeech st --input input_16k.wav +paddlespeech cls --input zh.wav ``` -**语音合成** + +python API 一键预测 + +```python +>>> from paddlespeech.cli.cls.infer import CLSExecutor +>>> cls = CLSExecutor() +>>> result = cls(audio_file="zh.wav") +>>> print(result) +Speech 0.9027186632156372 +``` + +
+ +### 声纹提取 + +
 工业级声纹提取工具 + +命令行一键体验 + ```shell -paddlespeech tts --input "你好,欢迎使用百度飞桨深度学习框架!" --output output.wav +paddlespeech vector --task spk --input zh.wav ``` -- 语音合成的 web demo 已经集成进了 [Huggingface Spaces](https://huggingface.co/spaces). 请参考: [TTS Demo](https://huggingface.co/spaces/akhaliq/paddlespeech) -**文本后处理** - - 标点恢复 - ```bash - paddlespeech text --task punc --input 今天的天气真不错啊你下午有空吗我想约你一起去吃饭 - ``` +Python API 一键预测 -**批处理** +```python +>>> from paddlespeech.cli.vector import VectorExecutor +>>> vec = VectorExecutor() +>>> result = vec(audio_file="zh.wav") +>>> print(result) # 187维向量 +[ -0.19083306 9.474295 -14.122263 -2.0916545 0.04848729 + 4.9295826 1.4780062 0.3733844 10.695862 3.2697146 + -4.48199 -0.6617882 -9.170393 -11.1568775 -1.2358263 ...] ``` -echo -e "1 欢迎光临。\n2 谢谢惠顾。" | paddlespeech tts + +
+ +### 标点恢复 + +
 一键恢复文本标点,可与ASR模型配合使用 + +命令行一键体验 + +```shell +paddlespeech text --task punc --input 今天的天气真不错啊你下午有空吗我想约你一起去吃饭 +``` + +Python API 一键预测 + +```python +>>> from paddlespeech.cli.text.infer import TextExecutor +>>> text_punc = TextExecutor() +>>> result = text_punc(text="今天的天气真不错啊你下午有空吗我想约你一起去吃饭") +今天的天气真不错啊!你下午有空吗?我想约你一起去吃饭。 ``` -**Shell管道** -ASR + Punc: +
+ +### 语音翻译 + +
 端到端英译中语音翻译工具 + +使用预编译的kaldi相关工具,只支持在Ubuntu系统中体验 + +命令行一键体验 + +```shell +paddlespeech st --input en.wav ``` -paddlespeech asr --input ./zh.wav | paddlespeech text --task punc + +python API 一键预测 + +```python +>>> from paddlespeech.cli.st.infer import STExecutor +>>> st = STExecutor() +>>> result = st(audio_file="en.wav") +['我 在 这栋 建筑 的 古老 门上 敲门 。'] ``` -更多命令行命令请参考 [demos](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/demos) -> Note: 如果需要训练或者微调,请查看[语音识别](./docs/source/asr/quick_start.md), [语音合成](./docs/source/tts/quick_start.md)。 +
+ + ## 快速使用服务 -安装完成后,开发者可以通过命令行快速使用服务。 +安装完成后,开发者可以通过命令行一键启动语音识别,语音合成,音频分类三种服务。 **启动服务** ```shell @@ -614,6 +751,7 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声 语音合成模块最初被称为 [Parakeet](https://github.com/PaddlePaddle/Parakeet),现在与此仓库合并。如果您对该任务的学术研究感兴趣,请参阅 [TTS 研究概述](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/docs/source/tts#overview)。此外,[模型介绍](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/tts/models_introduction.md) 是了解语音合成流程的一个很好的指南。 + ## ⭐ 应用案例 - **[PaddleBoBo](https://github.com/JiehangXie/PaddleBoBo): 使用 PaddleSpeech 的语音合成模块生成虚拟人的声音。** diff --git a/demos/README.md b/demos/README.md index 2a306df6..72b70b23 100644 --- a/demos/README.md +++ b/demos/README.md @@ -12,6 +12,7 @@ This directory contains many speech applications in multiple scenarios. * speech recognition - recognize text of an audio file * speech server - Server for Speech Task, e.g. ASR,TTS,CLS * streaming asr server - receive audio stream from websocket, and recognize to transcript. +* streaming tts server - receive text from http or websocket, and streaming audio data stream. * speech translation - end to end speech translation * story talker - book reader based on OCR and TTS * style_fs2 - multi style control for FastSpeech2 model diff --git a/demos/README_cn.md b/demos/README_cn.md index 47134212..04fc1fa7 100644 --- a/demos/README_cn.md +++ b/demos/README_cn.md @@ -10,8 +10,9 @@ * 元宇宙 - 基于语音合成的 2D 增强现实。 * 标点恢复 - 通常作为语音识别的文本后处理任务,为一段无标点的纯文本添加相应的标点符号。 * 语音识别 - 识别一段音频中包含的语音文字。 -* 语音服务 - 离线语音服务,包括ASR、TTS、CLS等 -* 流式语音识别服务 - 流式输入语音数据流识别音频中的文字 +* 语音服务 - 离线语音服务,包括ASR、TTS、CLS等。 +* 流式语音识别服务 - 流式输入语音数据流识别音频中的文字。 +* 流式语音合成服务 - 根据待合成文本流式生成合成音频数据流。 * 语音翻译 - 实时识别音频中的语言,并同时翻译成目标语言。 * 会说话的故事书 - 基于 OCR 和语音合成的会说话的故事书。 * 个性化语音合成 - 基于 FastSpeech2 模型的个性化语音合成。 diff --git a/demos/custom_streaming_asr/setup_docker.sh b/demos/custom_streaming_asr/setup_docker.sh old mode 100644 new mode 100755 diff --git a/demos/keyword_spotting/run.sh b/demos/keyword_spotting/run.sh old mode 100644 new mode 100755 diff --git a/demos/speaker_verification/run.sh b/demos/speaker_verification/run.sh old mode 100644 new mode 100755 diff --git a/demos/speech_recognition/run.sh b/demos/speech_recognition/run.sh old mode 100644 new mode 100755 index 19ce0ebb..e48ff3e9 --- a/demos/speech_recognition/run.sh +++ b/demos/speech_recognition/run.sh @@ -1,6 +1,7 @@ #!/bin/bash -wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespeech.bj.bcebos.com/PaddleAudio/en.wav +wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav +wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/en.wav # asr paddlespeech asr --input ./zh.wav @@ -8,3 +9,18 @@ paddlespeech asr --input ./zh.wav # asr + punc paddlespeech asr --input ./zh.wav | paddlespeech text --task punc + + +# asr help +paddlespeech asr --help + + +# english asr +paddlespeech asr --lang en --model transformer_librispeech --input ./en.wav + +# model stats +paddlespeech stats --task asr + + +# paddlespeech help +paddlespeech --help diff --git a/demos/speech_server/README.md b/demos/speech_server/README.md index dbbf9765..65a12940 100644 --- a/demos/speech_server/README.md +++ b/demos/speech_server/README.md @@ -14,7 +14,10 @@ For service interface definition, please check: see [installation](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md). It is recommended to use **paddlepaddle 2.3.1** or above. -You can choose one way from meduim and hard to install paddlespeech. + +You can choose one way from easy, meduim and hard to install paddlespeech. + +**If you install in easy mode, you need to prepare the yaml file by yourself, you can refer to the yaml file in the conf directory.** ### 2. Prepare config File The configuration file can be found in `conf/application.yaml` . diff --git a/demos/speech_server/README_cn.md b/demos/speech_server/README_cn.md index 9ed9175d..d21a53b0 100644 --- a/demos/speech_server/README_cn.md +++ b/demos/speech_server/README_cn.md @@ -3,8 +3,10 @@ # 语音服务 ## 介绍 + 这个 demo 是一个启动离线语音服务和访问服务的实现。它可以通过使用 `paddlespeech_server` 和 `paddlespeech_client` 的单个命令或 python 的几行代码来实现。 + 服务接口定义请参考: - [PaddleSpeech Server RESTful API](https://github.com/PaddlePaddle/PaddleSpeech/wiki/PaddleSpeech-Server-RESTful-API) @@ -13,12 +15,17 @@ 请看 [安装文档](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md). 推荐使用 **paddlepaddle 2.3.1** 或以上版本。 -你可以从 medium,hard 两种方式中选择一种方式安装 PaddleSpeech。 + +你可以从简单,中等,困难 几种方式中选择一种方式安装 PaddleSpeech。 + +**如果使用简单模式安装,需要自行准备 yaml 文件,可参考 conf 目录下的 yaml 文件。** ### 2. 准备配置文件 配置文件可参见 `conf/application.yaml` 。 -其中,`engine_list`表示即将启动的服务将会包含的语音引擎,格式为 <语音任务>_<引擎类型>。 +其中,`engine_list` 表示即将启动的服务将会包含的语音引擎,格式为 <语音任务>_<引擎类型>。 + 目前服务集成的语音任务有: asr (语音识别)、tts (语音合成)、cls (音频分类)、vector (声纹识别)以及 text (文本处理)。 + 目前引擎类型支持两种形式:python 及 inference (Paddle Inference) **注意:** 如果在容器里可正常启动服务,但客户端访问 ip 不可达,可尝试将配置文件中 `host` 地址换成本地 ip 地址。 diff --git a/demos/speech_server/asr_client.sh b/demos/speech_server/asr_client.sh old mode 100644 new mode 100755 diff --git a/demos/speech_server/cls_client.sh b/demos/speech_server/cls_client.sh old mode 100644 new mode 100755 diff --git a/demos/speech_server/server.sh b/demos/speech_server/server.sh old mode 100644 new mode 100755 index e5961286..fd719ffc --- a/demos/speech_server/server.sh +++ b/demos/speech_server/server.sh @@ -1,3 +1,3 @@ #!/bin/bash -paddlespeech_server start --config_file ./conf/application.yaml +paddlespeech_server start --config_file ./conf/application.yaml &> server.log & diff --git a/demos/speech_server/sid_client.sh b/demos/speech_server/sid_client.sh new file mode 100755 index 00000000..99bab21a --- /dev/null +++ b/demos/speech_server/sid_client.sh @@ -0,0 +1,10 @@ +#!/bin/bash + +wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav +wget -c https://paddlespeech.bj.bcebos.com/vector/audio/123456789.wav + +# sid extract +paddlespeech_client vector --server_ip 127.0.0.1 --port 8090 --task spk --input ./85236145389.wav + +# sid score +paddlespeech_client vector --server_ip 127.0.0.1 --port 8090 --task score --enroll ./85236145389.wav --test ./123456789.wav diff --git a/demos/speech_server/text_client.sh b/demos/speech_server/text_client.sh new file mode 100755 index 00000000..098f159f --- /dev/null +++ b/demos/speech_server/text_client.sh @@ -0,0 +1,4 @@ +#!/bin/bash + + +paddlespeech_client text --server_ip 127.0.0.1 --port 8090 --input 今天的天气真好啊你下午有空吗我想约你一起去吃饭 diff --git a/demos/speech_server/tts_client.sh b/demos/speech_server/tts_client.sh old mode 100644 new mode 100755 diff --git a/demos/speech_web/README.md b/demos/speech_web/README.md index ded78a6e..3b2da6e9 100644 --- a/demos/speech_web/README.md +++ b/demos/speech_web/README.md @@ -1,6 +1,6 @@ # Paddle Speech Demo -PaddleSpeechDemo 是一个以 PaddleSpeech 的语音交互功能为主体开发的Demo展示项目,用于帮助大家更好的上手PaddleSpeech 以及使用 PaddleSpeech 构建自己的应用。 +PaddleSpeechDemo 是一个以 PaddleSpeech 的语音交互功能为主体开发的 Demo 展示项目,用于帮助大家更好的上手 PaddleSpeech 以及使用 PaddleSpeech 构建自己的应用。 智能语音交互部分使用 PaddleSpeech,对话以及信息抽取部分使用 PaddleNLP,网页前端展示部分基于 Vue3 进行开发 diff --git a/demos/speech_web/web_client/package-lock.json b/demos/speech_web/web_client/package-lock.json index f1c77978..509be385 100644 --- a/demos/speech_web/web_client/package-lock.json +++ b/demos/speech_web/web_client/package-lock.json @@ -747,9 +747,9 @@ } }, "node_modules/moment": { - "version": "2.29.3", - "resolved": "https://registry.npmmirror.com/moment/-/moment-2.29.3.tgz", - "integrity": "sha512-c6YRvhEo//6T2Jz/vVtYzqBzwvPT95JBQ+smCytzf7c50oMZRsR/a4w88aD34I+/QVSfnoAnSBFPJHItlOMJVw==", + "version": "2.29.4", + "resolved": "https://registry.npmjs.org/moment/-/moment-2.29.4.tgz", + "integrity": "sha512-5LC9SOxjSc2HF6vO2CyuTDNivEdoz2IvyJJGj6X8DJ0eFyfszE0QiEd+iXmBvUP3WHxSjFH/vIsA0EN00cgr8w==", "engines": { "node": "*" } @@ -1636,9 +1636,9 @@ "optional": true }, "moment": { - "version": "2.29.3", - "resolved": "https://registry.npmmirror.com/moment/-/moment-2.29.3.tgz", - "integrity": "sha512-c6YRvhEo//6T2Jz/vVtYzqBzwvPT95JBQ+smCytzf7c50oMZRsR/a4w88aD34I+/QVSfnoAnSBFPJHItlOMJVw==" + "version": "2.29.4", + "resolved": "https://registry.npmjs.org/moment/-/moment-2.29.4.tgz", + "integrity": "sha512-5LC9SOxjSc2HF6vO2CyuTDNivEdoz2IvyJJGj6X8DJ0eFyfszE0QiEd+iXmBvUP3WHxSjFH/vIsA0EN00cgr8w==" }, "nanoid": { "version": "3.3.2", diff --git a/demos/speech_web/web_client/yarn.lock b/demos/speech_web/web_client/yarn.lock index 4504eab3..6777cf4c 100644 --- a/demos/speech_web/web_client/yarn.lock +++ b/demos/speech_web/web_client/yarn.lock @@ -587,9 +587,9 @@ mime@^1.4.1: integrity sha512-x0Vn8spI+wuJ1O6S7gnbaQg8Pxh4NNHb7KSINmEWKiPE4RKOplvijn+NkmYmmRgP68mc70j2EbeTFRsrswaQeg== moment@^2.27.0: - version "2.29.3" - resolved "https://registry.npmmirror.com/moment/-/moment-2.29.3.tgz" - integrity sha512-c6YRvhEo//6T2Jz/vVtYzqBzwvPT95JBQ+smCytzf7c50oMZRsR/a4w88aD34I+/QVSfnoAnSBFPJHItlOMJVw== + version "2.29.4" + resolved "https://registry.yarnpkg.com/moment/-/moment-2.29.4.tgz#3dbe052889fe7c1b2ed966fcb3a77328964ef108" + integrity sha512-5LC9SOxjSc2HF6vO2CyuTDNivEdoz2IvyJJGj6X8DJ0eFyfszE0QiEd+iXmBvUP3WHxSjFH/vIsA0EN00cgr8w== ms@^2.1.1: version "2.1.3" diff --git a/demos/streaming_asr_server/README.md b/demos/streaming_asr_server/README.md index 3ada1b8d..ae66cae4 100644 --- a/demos/streaming_asr_server/README.md +++ b/demos/streaming_asr_server/README.md @@ -15,7 +15,10 @@ Streaming ASR server only support `websocket` protocol, and doesn't support `htt see [installation](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md). It is recommended to use **paddlepaddle 2.3.1** or above. -You can choose one way from meduim and hard to install paddlespeech. + +You can choose one way from easy, meduim and hard to install paddlespeech. + +**If you install in easy mode, you need to prepare the yaml file by yourself, you can refer to ### 2. Prepare config File The configuration file can be found in `conf/ws_application.yaml` 和 `conf/ws_conformer_wenetspeech_application.yaml`. diff --git a/demos/streaming_asr_server/README_cn.md b/demos/streaming_asr_server/README_cn.md index e4a7ef64..55acc07c 100644 --- a/demos/streaming_asr_server/README_cn.md +++ b/demos/streaming_asr_server/README_cn.md @@ -3,12 +3,11 @@ # 流式语音识别服务 ## 介绍 -这个demo是一个启动流式语音服务和访问服务的实现。 它可以通过使用 `paddlespeech_server` 和 `paddlespeech_client`的单个命令或 python 的几行代码来实现。 +这个 demo 是一个启动流式语音服务和访问服务的实现。 它可以通过使用 `paddlespeech_server` 和 `paddlespeech_client` 的单个命令或 python 的几行代码来实现。 **流式语音识别服务只支持 `weboscket` 协议,不支持 `http` 协议。** - -For service interface definition, please check: +服务接口定义请参考: - [PaddleSpeech Streaming Server WebSocket API](https://github.com/PaddlePaddle/PaddleSpeech/wiki/PaddleSpeech-Server-WebSocket-API) ## 使用方法 @@ -16,7 +15,10 @@ For service interface definition, please check: 安装 PaddleSpeech 的详细过程请看 [安装文档](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md)。 推荐使用 **paddlepaddle 2.3.1** 或以上版本。 -你可以从medium,hard 两种方式中选择一种方式安装 PaddleSpeech。 + +你可以从简单,中等,困难 几种方式中选择一种方式安装 PaddleSpeech。 + +**如果使用简单模式安装,需要自行准备 yaml 文件,可参考 conf 目录下的 yaml 文件。** ### 2. 准备配置文件 diff --git a/demos/streaming_asr_server/punc_server.py b/demos/streaming_asr_server/local/punc_server.py similarity index 100% rename from demos/streaming_asr_server/punc_server.py rename to demos/streaming_asr_server/local/punc_server.py diff --git a/demos/streaming_asr_server/local/rtf_from_log.py b/demos/streaming_asr_server/local/rtf_from_log.py index 4f30d640..4b89b48f 100755 --- a/demos/streaming_asr_server/local/rtf_from_log.py +++ b/demos/streaming_asr_server/local/rtf_from_log.py @@ -38,4 +38,4 @@ if __name__ == '__main__': T += m['T'] P += m['P'] - print(f"RTF: {P/T}") + print(f"RTF: {P/T}, utts: {n}") diff --git a/demos/streaming_asr_server/streaming_asr_server.py b/demos/streaming_asr_server/local/streaming_asr_server.py similarity index 100% rename from demos/streaming_asr_server/streaming_asr_server.py rename to demos/streaming_asr_server/local/streaming_asr_server.py diff --git a/demos/streaming_asr_server/run.sh b/demos/streaming_asr_server/run.sh old mode 100644 new mode 100755 diff --git a/demos/streaming_asr_server/server.sh b/demos/streaming_asr_server/server.sh index f532546e..961cb046 100755 --- a/demos/streaming_asr_server/server.sh +++ b/demos/streaming_asr_server/server.sh @@ -1,9 +1,8 @@ -export CUDA_VISIBLE_DEVICE=0,1,2,3 - export CUDA_VISIBLE_DEVICE=0,1,2,3 +#export CUDA_VISIBLE_DEVICE=0,1,2,3 -# nohup python3 punc_server.py --config_file conf/punc_application.yaml > punc.log 2>&1 & +# nohup python3 local/punc_server.py --config_file conf/punc_application.yaml > punc.log 2>&1 & paddlespeech_server start --config_file conf/punc_application.yaml &> punc.log & -# nohup python3 streaming_asr_server.py --config_file conf/ws_conformer_wenetspeech_application.yaml > streaming_asr.log 2>&1 & +# nohup python3 local/streaming_asr_server.py --config_file conf/ws_conformer_wenetspeech_application.yaml > streaming_asr.log 2>&1 & paddlespeech_server start --config_file conf/ws_conformer_wenetspeech_application.yaml &> streaming_asr.log & diff --git a/demos/streaming_asr_server/test.sh b/demos/streaming_asr_server/test.sh index 67a5ec4c..386c7f89 100755 --- a/demos/streaming_asr_server/test.sh +++ b/demos/streaming_asr_server/test.sh @@ -7,5 +7,5 @@ paddlespeech_client asr_online --server_ip 127.0.0.1 --port 8090 --input ./zh.wa # read the wav and call streaming and punc service # If `127.0.0.1` is not accessible, you need to use the actual service IP address. -paddlespeech_client asr_online --server_ip 127.0.0.1 --port 8290 --punc.server_ip 127.0.0.1 --punc.port 8190 --input ./zh.wav +paddlespeech_client asr_online --server_ip 127.0.0.1 --port 8090 --punc.server_ip 127.0.0.1 --punc.port 8190 --input ./zh.wav diff --git a/demos/streaming_tts_server/README.md b/demos/streaming_tts_server/README.md index f708fd31..53a33f3c 100644 --- a/demos/streaming_tts_server/README.md +++ b/demos/streaming_tts_server/README.md @@ -15,7 +15,10 @@ For service interface definition, please check: see [installation](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md). It is recommended to use **paddlepaddle 2.3.1** or above. -You can choose one way from meduim and hard to install paddlespeech. + +You can choose one way from easy, meduim and hard to install paddlespeech. + +**If you install in easy mode, you need to prepare the yaml file by yourself, you can refer to the yaml file in the conf directory.** ### 2. Prepare config File The configuration file can be found in `conf/tts_online_application.yaml`. diff --git a/demos/streaming_tts_server/README_cn.md b/demos/streaming_tts_server/README_cn.md index fa041323..560791a9 100644 --- a/demos/streaming_tts_server/README_cn.md +++ b/demos/streaming_tts_server/README_cn.md @@ -13,7 +13,11 @@ 请看 [安装文档](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md). 推荐使用 **paddlepaddle 2.3.1** 或以上版本。 -你可以从 medium,hard 两种方式中选择一种方式安装 PaddleSpeech。 + +你可以从简单,中等,困难 几种方式中选择一种方式安装 PaddleSpeech。 + +**如果使用简单模式安装,需要自行准备 yaml 文件,可参考 conf 目录下的 yaml 文件。** + ### 2. 准备配置文件 配置文件可参见 `conf/tts_online_application.yaml` 。 diff --git a/demos/streaming_tts_server/test_client.sh b/demos/streaming_tts_server/client.sh old mode 100644 new mode 100755 similarity index 61% rename from demos/streaming_tts_server/test_client.sh rename to demos/streaming_tts_server/client.sh index bd88f20b..e93da58a --- a/demos/streaming_tts_server/test_client.sh +++ b/demos/streaming_tts_server/client.sh @@ -2,8 +2,8 @@ # http client test # If `127.0.0.1` is not accessible, you need to use the actual service IP address. -paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --protocol http --input "您好,欢迎使用百度飞桨语音合成服务。" --output output.wav +paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --protocol http --input "您好,欢迎使用百度飞桨语音合成服务。" --output output.http.wav # websocket client test # If `127.0.0.1` is not accessible, you need to use the actual service IP address. -# paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --protocol websocket --input "您好,欢迎使用百度飞桨语音合成服务。" --output output.wav +paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8192 --protocol websocket --input "您好,欢迎使用百度飞桨语音合成服务。" --output output.ws.wav diff --git a/demos/streaming_tts_server/conf/tts_online_ws_application.yaml b/demos/streaming_tts_server/conf/tts_online_ws_application.yaml new file mode 100644 index 00000000..146f06f1 --- /dev/null +++ b/demos/streaming_tts_server/conf/tts_online_ws_application.yaml @@ -0,0 +1,103 @@ +# This is the parameter configuration file for streaming tts server. + +################################################################################# +# SERVER SETTING # +################################################################################# +host: 0.0.0.0 +port: 8192 + +# The task format in the engin_list is: _ +# engine_list choices = ['tts_online', 'tts_online-onnx'], the inference speed of tts_online-onnx is faster than tts_online. +# protocol choices = ['websocket', 'http'] +protocol: 'websocket' +engine_list: ['tts_online-onnx'] + + +################################################################################# +# ENGINE CONFIG # +################################################################################# + +################################### TTS ######################################### +################### speech task: tts; engine_type: online ####################### +tts_online: + # am (acoustic model) choices=['fastspeech2_csmsc', 'fastspeech2_cnndecoder_csmsc'] + # fastspeech2_cnndecoder_csmsc support streaming am infer. + am: 'fastspeech2_csmsc' + am_config: + am_ckpt: + am_stat: + phones_dict: + tones_dict: + speaker_dict: + spk_id: 0 + + # voc (vocoder) choices=['mb_melgan_csmsc, hifigan_csmsc'] + # Both mb_melgan_csmsc and hifigan_csmsc support streaming voc inference + voc: 'mb_melgan_csmsc' + voc_config: + voc_ckpt: + voc_stat: + + # others + lang: 'zh' + device: 'cpu' # set 'gpu:id' or 'cpu' + # am_block and am_pad only for fastspeech2_cnndecoder_onnx model to streaming am infer, + # when am_pad set 12, streaming synthetic audio is the same as non-streaming synthetic audio + am_block: 72 + am_pad: 12 + # voc_pad and voc_block voc model to streaming voc infer, + # when voc model is mb_melgan_csmsc, voc_pad set 14, streaming synthetic audio is the same as non-streaming synthetic audio; The minimum value of pad can be set to 7, streaming synthetic audio sounds normal + # when voc model is hifigan_csmsc, voc_pad set 19, streaming synthetic audio is the same as non-streaming synthetic audio; voc_pad set 14, streaming synthetic audio sounds normal + voc_block: 36 + voc_pad: 14 + + + +################################################################################# +# ENGINE CONFIG # +################################################################################# + +################################### TTS ######################################### +################### speech task: tts; engine_type: online-onnx ####################### +tts_online-onnx: + # am (acoustic model) choices=['fastspeech2_csmsc_onnx', 'fastspeech2_cnndecoder_csmsc_onnx'] + # fastspeech2_cnndecoder_csmsc_onnx support streaming am infer. + am: 'fastspeech2_cnndecoder_csmsc_onnx' + # am_ckpt is a list, if am is fastspeech2_cnndecoder_csmsc_onnx, am_ckpt = [encoder model, decoder model, postnet model]; + # if am is fastspeech2_csmsc_onnx, am_ckpt = [ckpt model]; + am_ckpt: # list + am_stat: + phones_dict: + tones_dict: + speaker_dict: + spk_id: 0 + am_sample_rate: 24000 + am_sess_conf: + device: "cpu" # set 'gpu:id' or 'cpu' + use_trt: False + cpu_threads: 4 + + # voc (vocoder) choices=['mb_melgan_csmsc_onnx, hifigan_csmsc_onnx'] + # Both mb_melgan_csmsc_onnx and hifigan_csmsc_onnx support streaming voc inference + voc: 'hifigan_csmsc_onnx' + voc_ckpt: + voc_sample_rate: 24000 + voc_sess_conf: + device: "cpu" # set 'gpu:id' or 'cpu' + use_trt: False + cpu_threads: 4 + + # others + lang: 'zh' + # am_block and am_pad only for fastspeech2_cnndecoder_onnx model to streaming am infer, + # when am_pad set 12, streaming synthetic audio is the same as non-streaming synthetic audio + am_block: 72 + am_pad: 12 + # voc_pad and voc_block voc model to streaming voc infer, + # when voc model is mb_melgan_csmsc_onnx, voc_pad set 14, streaming synthetic audio is the same as non-streaming synthetic audio; The minimum value of pad can be set to 7, streaming synthetic audio sounds normal + # when voc model is hifigan_csmsc_onnx, voc_pad set 19, streaming synthetic audio is the same as non-streaming synthetic audio; voc_pad set 14, streaming synthetic audio sounds normal + voc_block: 36 + voc_pad: 14 + # voc_upsample should be same as n_shift on voc config. + voc_upsample: 300 + diff --git a/demos/streaming_tts_server/server.sh b/demos/streaming_tts_server/server.sh new file mode 100755 index 00000000..d34ddba0 --- /dev/null +++ b/demos/streaming_tts_server/server.sh @@ -0,0 +1,10 @@ +#!/bin/bash + +# http server +paddlespeech_server start --config_file ./conf/tts_online_application.yaml &> tts.http.log & + + +# websocket server +paddlespeech_server start --config_file ./conf/tts_online_ws_application.yaml &> tts.ws.log & + + diff --git a/demos/streaming_tts_server/start_server.sh b/demos/streaming_tts_server/start_server.sh deleted file mode 100644 index 9c71f2fe..00000000 --- a/demos/streaming_tts_server/start_server.sh +++ /dev/null @@ -1,3 +0,0 @@ -#!/bin/bash -# start server -paddlespeech_server start --config_file ./conf/tts_online_application.yaml \ No newline at end of file diff --git a/demos/text_to_speech/run.sh b/demos/text_to_speech/run.sh index b1340241..2b588be5 100755 --- a/demos/text_to_speech/run.sh +++ b/demos/text_to_speech/run.sh @@ -4,4 +4,10 @@ paddlespeech tts --input 今天的天气不错啊 # Batch process -echo -e "1 欢迎光临。\n2 谢谢惠顾。" | paddlespeech tts \ No newline at end of file +echo -e "1 欢迎光临。\n2 谢谢惠顾。" | paddlespeech tts + +# Text Frontend +paddlespeech tts --input 今天是2022/10/29,最低温度是-3℃. + + + diff --git a/docker/ubuntu18-cpu/Dockerfile b/docker/ubuntu18-cpu/Dockerfile index d14c0185..35f45f2e 100644 --- a/docker/ubuntu18-cpu/Dockerfile +++ b/docker/ubuntu18-cpu/Dockerfile @@ -1,15 +1,17 @@ FROM registry.baidubce.com/paddlepaddle/paddle:2.2.2 LABEL maintainer="paddlesl@baidu.com" +RUN apt-get update \ + && apt-get install libsndfile-dev \ + && apt-get clean \ + && rm -rf /var/lib/apt/lists/* + RUN git clone --depth 1 https://github.com/PaddlePaddle/PaddleSpeech.git /home/PaddleSpeech RUN pip3 uninstall mccabe -y ; exit 0; RUN pip3 install multiprocess==0.70.12 importlib-metadata==4.2.0 dill==0.3.4 -RUN cd /home/PaddleSpeech/audio -RUN python setup.py bdist_wheel - -RUN cd /home/PaddleSpeech +WORKDIR /home/PaddleSpeech/ RUN python setup.py bdist_wheel -RUN pip install audio/dist/*.whl dist/*.whl +RUN pip install dist/*.whl -i https://pypi.tuna.tsinghua.edu.cn/simple -WORKDIR /home/PaddleSpeech/ +CMD ['bash'] diff --git a/docs/requirements.txt b/docs/requirements.txt index 08a049c1..bf1486c5 100644 --- a/docs/requirements.txt +++ b/docs/requirements.txt @@ -48,4 +48,5 @@ fastapi websockets keyboard uvicorn -pattern_singleton \ No newline at end of file +pattern_singleton +braceexpand \ No newline at end of file diff --git a/docs/source/install.md b/docs/source/install.md index 83b64619..6a9ff3bc 100644 --- a/docs/source/install.md +++ b/docs/source/install.md @@ -117,9 +117,9 @@ conda install -y -c gcc_linux-64=8.4.0 gxx_linux-64=8.4.0 ``` (Hip: Do not use the last script if you want to install by **Hard** way): ### Install PaddlePaddle -You can choose the `PaddlePaddle` version based on your system. For example, for CUDA 10.2, CuDNN7.5 install paddlepaddle-gpu 2.2.0: +You can choose the `PaddlePaddle` version based on your system. For example, for CUDA 10.2, CuDNN7.5 install paddlepaddle-gpu 2.3.1: ```bash -python3 -m pip install paddlepaddle-gpu==2.2.0 -i https://mirror.baidu.com/pypi/simple +python3 -m pip install paddlepaddle-gpu==2.3.1 -i https://mirror.baidu.com/pypi/simple ``` ### Install PaddleSpeech You can install `paddlespeech` by the following command,then you can use the `ready-made` examples in `paddlespeech` : @@ -180,9 +180,9 @@ Some users may fail to install `kaldiio` due to the default download source, you ```bash pip install pytest-runner -i https://pypi.tuna.tsinghua.edu.cn/simple ``` -Make sure you have GPU and the paddlepaddle version is right. For example, for CUDA 10.2, CuDNN7.5 install paddle 2.2.0: +Make sure you have GPU and the paddlepaddle version is right. For example, for CUDA 10.2, CuDNN7.5 install paddle 2.3.1: ```bash -python3 -m pip install paddlepaddle-gpu==2.2.0 -i https://mirror.baidu.com/pypi/simple +python3 -m pip install paddlepaddle-gpu==2.3.1 -i https://mirror.baidu.com/pypi/simple ``` ### Install PaddleSpeech in Developing Mode ```bash diff --git a/docs/source/install_cn.md b/docs/source/install_cn.md index 75f4174e..9f49ebad 100644 --- a/docs/source/install_cn.md +++ b/docs/source/install_cn.md @@ -111,9 +111,9 @@ conda install -y -c gcc_linux-64=8.4.0 gxx_linux-64=8.4.0 ``` (提示: 如果你想使用**困难**方式完成安装,请不要使用最后一条命令) ### 安装 PaddlePaddle -你可以根据系统配置选择 PaddlePaddle 版本,例如系统使用 CUDA 10.2, CuDNN7.5 ,你可以安装 paddlepaddle-gpu 2.2.0: +你可以根据系统配置选择 PaddlePaddle 版本,例如系统使用 CUDA 10.2, CuDNN7.5 ,你可以安装 paddlepaddle-gpu 2.3.1: ```bash -python3 -m pip install paddlepaddle-gpu==2.2.0 -i https://mirror.baidu.com/pypi/simple +python3 -m pip install paddlepaddle-gpu==2.3.1 -i https://mirror.baidu.com/pypi/simple ``` ### 安装 PaddleSpeech 最后安装 `paddlespeech`,这样你就可以使用 `paddlespeech` 中已有的 examples: @@ -168,9 +168,9 @@ conda activate tools/venv conda install -y -c conda-forge sox libsndfile swig bzip2 libflac bc ``` ### 安装 PaddlePaddle -请确认你系统是否有 GPU,并且使用了正确版本的 paddlepaddle。例如系统使用 CUDA 10.2, CuDNN7.5 ,你可以安装 paddlepaddle-gpu 2.2.0: +请确认你系统是否有 GPU,并且使用了正确版本的 paddlepaddle。例如系统使用 CUDA 10.2, CuDNN7.5 ,你可以安装 paddlepaddle-gpu 2.3.1: ```bash -python3 -m pip install paddlepaddle-gpu==2.2.0 -i https://mirror.baidu.com/pypi/simple +python3 -m pip install paddlepaddle-gpu==2.3.1 -i https://mirror.baidu.com/pypi/simple ``` ### 用开发者模式安装 PaddleSpeech 部分用户系统由于默认源的问题,安装中会出现 kaldiio 安转出错的问题,建议首先安装 pytest-runner: diff --git a/examples/aishell/asr1/README.md b/examples/aishell/asr1/README.md index 25b28ede..a7390fd6 100644 --- a/examples/aishell/asr1/README.md +++ b/examples/aishell/asr1/README.md @@ -1,5 +1,5 @@ # Transformer/Conformer ASR with Aishell -This example contains code used to train a Transformer or [Conformer](http://arxiv.org/abs/2008.03802) model with [Aishell dataset](http://www.openslr.org/resources/33) +This example contains code used to train a [u2](https://arxiv.org/pdf/2012.05481.pdf) model (Transformer or [Conformer](https://arxiv.org/pdf/2005.08100.pdf) model) with [Aishell dataset](http://www.openslr.org/resources/33) ## Overview All the scripts you need are in `run.sh`. There are several stages in `run.sh`, and each stage has its function. | Stage | Function | diff --git a/examples/callcenter/README.md b/examples/callcenter/README.md index 1c715cb6..6d521146 100644 --- a/examples/callcenter/README.md +++ b/examples/callcenter/README.md @@ -1,20 +1,3 @@ # Callcenter 8k sample rate -Data distribution: - -``` -676048 utts -491.4004722221223 h -4357792.0 text -2.4633630739178654 text/sec -2.6167397877068495 sec/utt -``` - -train/dev/test partition: - -``` - 33802 manifest.dev - 67606 manifest.test - 574640 manifest.train - 676048 total -``` +This recipe only has model/data config for 8k ASR, user need to prepare data and generate manifest metafile. You can see Aishell or Libripseech. diff --git a/examples/csmsc/vits/README.md b/examples/csmsc/vits/README.md index 5ca57e3a..8f223e07 100644 --- a/examples/csmsc/vits/README.md +++ b/examples/csmsc/vits/README.md @@ -154,7 +154,7 @@ VITS checkpoint contains files listed below. vits_csmsc_ckpt_1.1.0 ├── default.yaml # default config used to train vitx ├── phone_id_map.txt # phone vocabulary file when training vits -└── snapshot_iter_350000.pdz # model parameters and optimizer states +└── snapshot_iter_333000.pdz # model parameters and optimizer states ``` ps: This ckpt is not good enough, a better result is training @@ -169,7 +169,7 @@ FLAGS_allocator_strategy=naive_best_fit \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \ python3 ${BIN_DIR}/synthesize_e2e.py \ --config=vits_csmsc_ckpt_1.1.0/default.yaml \ - --ckpt=vits_csmsc_ckpt_1.1.0/snapshot_iter_350000.pdz \ + --ckpt=vits_csmsc_ckpt_1.1.0/snapshot_iter_333000.pdz \ --phones_dict=vits_csmsc_ckpt_1.1.0/phone_id_map.txt \ --output_dir=exp/default/test_e2e \ --text=${BIN_DIR}/../sentences.txt \ diff --git a/examples/csmsc/vits/conf/default.yaml b/examples/csmsc/vits/conf/default.yaml index 32f995cc..a2aef998 100644 --- a/examples/csmsc/vits/conf/default.yaml +++ b/examples/csmsc/vits/conf/default.yaml @@ -179,7 +179,7 @@ generator_first: False # whether to start updating generator first # OTHER TRAINING SETTING # ########################################################## num_snapshots: 10 # max number of snapshots to keep while training -train_max_steps: 250000 # Number of training steps. == total_iters / ngpus, total_iters = 1000000 +train_max_steps: 350000 # Number of training steps. == total_iters / ngpus, total_iters = 1000000 save_interval_steps: 1000 # Interval steps to save checkpoint. eval_interval_steps: 250 # Interval steps to evaluate the network. seed: 777 # random seed number diff --git a/examples/librispeech/asr1/README.md b/examples/librispeech/asr1/README.md index ae252a58..ca008144 100644 --- a/examples/librispeech/asr1/README.md +++ b/examples/librispeech/asr1/README.md @@ -1,5 +1,5 @@ # Transformer/Conformer ASR with Librispeech -This example contains code used to train a Transformer or [Conformer](http://arxiv.org/abs/2008.03802) model with [Librispeech dataset](http://www.openslr.org/resources/12) +This example contains code used to train [u2](https://arxiv.org/pdf/2012.05481.pdf) model (Transformer or [Conformer](https://arxiv.org/pdf/2005.08100.pdf) model) with [Librispeech dataset](http://www.openslr.org/resources/12) ## Overview All the scripts you need are in `run.sh`. There are several stages in `run.sh`, and each stage has its function. | Stage | Function | diff --git a/examples/librispeech/asr2/README.md b/examples/librispeech/asr2/README.md index 5bc7185a..26978520 100644 --- a/examples/librispeech/asr2/README.md +++ b/examples/librispeech/asr2/README.md @@ -1,6 +1,6 @@ # Transformer/Conformer ASR with Librispeech ASR2 -This example contains code used to train a Transformer or [Conformer](http://arxiv.org/abs/2008.03802) model with [Librispeech dataset](http://www.openslr.org/resources/12) and use some functions in kaldi. +This example contains code used to train a [u2](https://arxiv.org/pdf/2012.05481.pdf) model (Transformer or [Conformer](https://arxiv.org/pdf/2005.08100.pdf) model) with [Librispeech dataset](http://www.openslr.org/resources/12) and use some functions in kaldi. To use this example, you need to install Kaldi first. diff --git a/examples/tiny/asr1/README.md b/examples/tiny/asr1/README.md index 6a4999aa..cfa26670 100644 --- a/examples/tiny/asr1/README.md +++ b/examples/tiny/asr1/README.md @@ -1,5 +1,5 @@ # Transformer/Conformer ASR with Tiny -This example contains code used to train a Transformer or [Conformer](http://arxiv.org/abs/2008.03802) model Tiny dataset(a part of [[Librispeech dataset](http://www.openslr.org/resources/12)](http://www.openslr.org/resources/33)) +This example contains code used to train a [u2](https://arxiv.org/pdf/2012.05481.pdf) model (Transformer or [Conformer](https://arxiv.org/pdf/2005.08100.pdf) model) with Tiny dataset(a part of [[Librispeech dataset](http://www.openslr.org/resources/12)](http://www.openslr.org/resources/33)) ## Overview All the scripts you need are in `run.sh`. There are several stages in `run.sh`, and each stage has its function. | Stage | Function | diff --git a/examples/zh_en_tts/tts3/README.md b/examples/zh_en_tts/tts3/README.md new file mode 100644 index 00000000..6d38181c --- /dev/null +++ b/examples/zh_en_tts/tts3/README.md @@ -0,0 +1,26 @@ +# Test +We train a Chinese-English mixed fastspeech2 model. The training code is still being sorted out, let's show how to use it first. +The sample rate of the synthesized audio is 22050 Hz. + +## Download pretrained models +Put pretrained models in a directory named `models`. + +- [fastspeech2_csmscljspeech_add-zhen.zip](https://paddlespeech.bj.bcebos.com/t2s/chinse_english_mixed/models/fastspeech2_csmscljspeech_add-zhen.zip) +- [hifigan_ljspeech_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_ljspeech_ckpt_0.2.0.zip) + +```bash +mkdir models +cd models +wget https://paddlespeech.bj.bcebos.com/t2s/chinse_english_mixed/models/fastspeech2_csmscljspeech_add-zhen.zip +unzip fastspeech2_csmscljspeech_add-zhen.zip +wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_ljspeech_ckpt_0.2.0.zip +unzip hifigan_ljspeech_ckpt_0.2.0.zip +cd ../ +``` + +## test +You can choose `--spk_id` {0, 1} in `local/synthesize_e2e.sh`. + +```bash +bash test.sh +``` diff --git a/examples/zh_en_tts/tts3/local/synthesize_e2e.sh b/examples/zh_en_tts/tts3/local/synthesize_e2e.sh new file mode 100755 index 00000000..a206c3a8 --- /dev/null +++ b/examples/zh_en_tts/tts3/local/synthesize_e2e.sh @@ -0,0 +1,31 @@ +#!/bin/bash + +model_dir=$1 +output=$2 +am_name=fastspeech2_csmscljspeech_add-zhen +am_model_dir=${model_dir}/${am_name}/ + +stage=1 +stop_stage=1 + + +# hifigan +if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then + FLAGS_allocator_strategy=naive_best_fit \ + FLAGS_fraction_of_gpu_memory_to_use=0.01 \ + python3 ${BIN_DIR}/../synthesize_e2e.py \ + --am=fastspeech2_mix \ + --am_config=${am_model_dir}/default.yaml \ + --am_ckpt=${am_model_dir}/snapshot_iter_94000.pdz \ + --am_stat=${am_model_dir}/speech_stats.npy \ + --voc=hifigan_ljspeech \ + --voc_config=${model_dir}/hifigan_ljspeech_ckpt_0.2.0/default.yaml \ + --voc_ckpt=${model_dir}/hifigan_ljspeech_ckpt_0.2.0/snapshot_iter_2500000.pdz \ + --voc_stat=${model_dir}/hifigan_ljspeech_ckpt_0.2.0/feats_stats.npy \ + --lang=mix \ + --text=${BIN_DIR}/../sentences_mix.txt \ + --output_dir=${output}/test_e2e \ + --phones_dict=${am_model_dir}/phone_id_map.txt \ + --speaker_dict=${am_model_dir}/speaker_id_map.txt \ + --spk_id 0 +fi diff --git a/examples/zh_en_tts/tts3/path.sh b/examples/zh_en_tts/tts3/path.sh new file mode 100755 index 00000000..fb7e8411 --- /dev/null +++ b/examples/zh_en_tts/tts3/path.sh @@ -0,0 +1,13 @@ +#!/bin/bash +export MAIN_ROOT=`realpath ${PWD}/../../../` + +export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH} +export LC_ALL=C + +export PYTHONDONTWRITEBYTECODE=1 +# Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C +export PYTHONIOENCODING=UTF-8 +export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH} + +MODEL=fastspeech2 +export BIN_DIR=${MAIN_ROOT}/paddlespeech/t2s/exps/${MODEL} diff --git a/examples/zh_en_tts/tts3/test.sh b/examples/zh_en_tts/tts3/test.sh new file mode 100755 index 00000000..ff34da14 --- /dev/null +++ b/examples/zh_en_tts/tts3/test.sh @@ -0,0 +1,23 @@ +#!/bin/bash + +set -e +source path.sh + +gpus=0,1 +stage=3 +stop_stage=100 + +model_dir=models +output_dir=output + +# with the following command, you can choose the stage range you want to run +# such as `./run.sh --stage 0 --stop-stage 0` +# this can not be mixed use with `$1`, `$2` ... +source ${MAIN_ROOT}/utils/parse_options.sh || exit 1 + + +if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then + # synthesize_e2e, vocoder is hifigan by default + CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${model_dir} ${output_dir} || exit -1 +fi + diff --git a/paddlespeech/__init__.py b/paddlespeech/__init__.py index b781c4a8..4b1c0ef3 100644 --- a/paddlespeech/__init__.py +++ b/paddlespeech/__init__.py @@ -14,3 +14,5 @@ import _locale _locale._getdefaultlocale = (lambda *args: ['en_US', 'utf8']) + + diff --git a/paddlespeech/audio/__init__.py b/paddlespeech/audio/__init__.py index 6184c1dd..83be8e32 100644 --- a/paddlespeech/audio/__init__.py +++ b/paddlespeech/audio/__init__.py @@ -14,6 +14,9 @@ from . import compliance from . import datasets from . import features +from . import text +from . import transform +from . import streamdata from . import functional from . import io from . import metric diff --git a/paddlespeech/audio/text/__init__.py b/paddlespeech/audio/text/__init__.py new file mode 100644 index 00000000..185a92b8 --- /dev/null +++ b/paddlespeech/audio/text/__init__.py @@ -0,0 +1,13 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. diff --git a/paddlespeech/cli/asr/infer.py b/paddlespeech/cli/asr/infer.py index 76dfafb9..f9b4439e 100644 --- a/paddlespeech/cli/asr/infer.py +++ b/paddlespeech/cli/asr/infer.py @@ -365,7 +365,7 @@ class ASRExecutor(BaseExecutor): except Exception as e: logger.exception(e) logger.error( - "can not open the audio file, please check the audio file format is 'wav'. \n \ + f"can not open the audio file, please check the audio file({audio_file}) format is 'wav'. \n \ you can try to use sox to change the file format.\n \ For example: \n \ sample rate: 16k \n \ diff --git a/paddlespeech/cli/executor.py b/paddlespeech/cli/executor.py index d4187a51..3800c36d 100644 --- a/paddlespeech/cli/executor.py +++ b/paddlespeech/cli/executor.py @@ -108,19 +108,20 @@ class BaseExecutor(ABC): Dict[str, Union[str, os.PathLike]]: A dict with ids and inputs. """ if self._is_job_input(input_): + # .job/.scp/.txt file ret = self._get_job_contents(input_) else: + # job from stdin ret = OrderedDict() - if input_ is None: # Take input from stdin if not sys.stdin.isatty( ): # Avoid getting stuck when stdin is empty. for i, line in enumerate(sys.stdin): line = line.strip() - if len(line.split(' ')) == 1: + if len(line.split()) == 1: ret[str(i + 1)] = line - elif len(line.split(' ')) == 2: - id_, info = line.split(' ') + elif len(line.split()) == 2: + id_, info = line.split() ret[id_] = info else: # No valid input info from one line. continue @@ -170,7 +171,8 @@ class BaseExecutor(ABC): bool: return `True` for job input, `False` otherwise. """ return input_ and os.path.isfile(input_) and (input_.endswith('.job') or - input_.endswith('.txt')) + input_.endswith('.txt') or + input_.endswith('.scp')) def _get_job_contents( self, job_input: os.PathLike) -> Dict[str, Union[str, os.PathLike]]: @@ -189,7 +191,7 @@ class BaseExecutor(ABC): line = line.strip() if not line: continue - k, v = line.split(' ') + k, v = line.split() # space or \t job_contents[k] = v return job_contents diff --git a/paddlespeech/s2t/__init__.py b/paddlespeech/s2t/__init__.py index 2da68435..f6476b9a 100644 --- a/paddlespeech/s2t/__init__.py +++ b/paddlespeech/s2t/__init__.py @@ -18,7 +18,6 @@ from typing import Union import paddle from paddle import nn -from paddle.fluid import core from paddle.nn import functional as F from paddlespeech.s2t.utils.log import Log @@ -39,46 +38,6 @@ paddle.long = 'int64' paddle.uint16 = 'uint16' paddle.cdouble = 'complex128' - -def convert_dtype_to_string(tensor_dtype): - """ - Convert the data type in numpy to the data type in Paddle - Args: - tensor_dtype(core.VarDesc.VarType): the data type in numpy. - Returns: - core.VarDesc.VarType: the data type in Paddle. - """ - dtype = tensor_dtype - if dtype == core.VarDesc.VarType.FP32: - return paddle.float32 - elif dtype == core.VarDesc.VarType.FP64: - return paddle.float64 - elif dtype == core.VarDesc.VarType.FP16: - return paddle.float16 - elif dtype == core.VarDesc.VarType.INT32: - return paddle.int32 - elif dtype == core.VarDesc.VarType.INT16: - return paddle.int16 - elif dtype == core.VarDesc.VarType.INT64: - return paddle.int64 - elif dtype == core.VarDesc.VarType.BOOL: - return paddle.bool - elif dtype == core.VarDesc.VarType.BF16: - # since there is still no support for bfloat16 in NumPy, - # uint16 is used for casting bfloat16 - return paddle.uint16 - elif dtype == core.VarDesc.VarType.UINT8: - return paddle.uint8 - elif dtype == core.VarDesc.VarType.INT8: - return paddle.int8 - elif dtype == core.VarDesc.VarType.COMPLEX64: - return paddle.complex64 - elif dtype == core.VarDesc.VarType.COMPLEX128: - return paddle.complex128 - else: - raise ValueError("Not supported tensor dtype %s" % dtype) - - if not hasattr(paddle, 'softmax'): logger.debug("register user softmax to paddle, remove this when fixed!") setattr(paddle, 'softmax', paddle.nn.functional.softmax) @@ -155,28 +114,6 @@ if not hasattr(paddle.Tensor, 'new_full'): paddle.Tensor.new_full = new_full paddle.static.Variable.new_full = new_full - -def eq(xs: paddle.Tensor, ys: Union[paddle.Tensor, float]) -> paddle.Tensor: - if convert_dtype_to_string(xs.dtype) == paddle.bool: - xs = xs.astype(paddle.int) - return xs.equal( - paddle.to_tensor( - ys, dtype=convert_dtype_to_string(xs.dtype), place=xs.place)) - - -if not hasattr(paddle.Tensor, 'eq'): - logger.debug( - "override eq of paddle.Tensor if exists or register, remove this when fixed!" - ) - paddle.Tensor.eq = eq - paddle.static.Variable.eq = eq - -if not hasattr(paddle, 'eq'): - logger.debug( - "override eq of paddle if exists or register, remove this when fixed!") - paddle.eq = eq - - def contiguous(xs: paddle.Tensor) -> paddle.Tensor: return xs @@ -219,13 +156,22 @@ def is_broadcastable(shp1, shp2): return True +def broadcast_shape(shp1, shp2): + result = [] + for a, b in zip(shp1[::-1], shp2[::-1]): + result.append(max(a, b)) + return result[::-1] + + def masked_fill(xs: paddle.Tensor, mask: paddle.Tensor, value: Union[float, int]): - assert is_broadcastable(xs.shape, mask.shape) is True, (xs.shape, - mask.shape) - bshape = paddle.broadcast_shape(xs.shape, mask.shape) - mask = mask.broadcast_to(bshape) + bshape = broadcast_shape(xs.shape, mask.shape) + mask.stop_gradient = True + tmp = paddle.ones(shape=[len(bshape)], dtype='int32') + for index in range(len(bshape)): + tmp[index] = bshape[index] + mask = mask.broadcast_to(tmp) trues = paddle.ones_like(xs) * value xs = paddle.where(mask, trues, xs) return xs diff --git a/paddlespeech/s2t/models/u2/u2.py b/paddlespeech/s2t/models/u2/u2.py index 100aca18..e19f411c 100644 --- a/paddlespeech/s2t/models/u2/u2.py +++ b/paddlespeech/s2t/models/u2/u2.py @@ -29,6 +29,9 @@ import paddle from paddle import jit from paddle import nn +from paddlespeech.audio.utils.tensor_utils import add_sos_eos +from paddlespeech.audio.utils.tensor_utils import pad_sequence +from paddlespeech.audio.utils.tensor_utils import th_accuracy from paddlespeech.s2t.decoders.scorers.ctc import CTCPrefixScorer from paddlespeech.s2t.frontend.utility import IGNORE_ID from paddlespeech.s2t.frontend.utility import load_cmvn @@ -48,9 +51,6 @@ from paddlespeech.s2t.utils import checkpoint from paddlespeech.s2t.utils import layer_tools from paddlespeech.s2t.utils.ctc_utils import remove_duplicates_and_blank from paddlespeech.s2t.utils.log import Log -from paddlespeech.audio.utils.tensor_utils import add_sos_eos -from paddlespeech.audio.utils.tensor_utils import pad_sequence -from paddlespeech.audio.utils.tensor_utils import th_accuracy from paddlespeech.s2t.utils.utility import log_add from paddlespeech.s2t.utils.utility import UpdateConfig @@ -318,7 +318,7 @@ class U2BaseModel(ASRInterface, nn.Layer): dim=1) # (B*N, i+1) # 2.6 Update end flag - end_flag = paddle.eq(hyps[:, -1], self.eos).view(-1, 1) + end_flag = paddle.equal(hyps[:, -1], self.eos).view(-1, 1) # 3. Select best of best scores = scores.view(batch_size, beam_size) @@ -605,29 +605,42 @@ class U2BaseModel(ASRInterface, nn.Layer): xs: paddle.Tensor, offset: int, required_cache_size: int, - subsampling_cache: Optional[paddle.Tensor]=None, - elayers_output_cache: Optional[List[paddle.Tensor]]=None, - conformer_cnn_cache: Optional[List[paddle.Tensor]]=None, - ) -> Tuple[paddle.Tensor, paddle.Tensor, List[paddle.Tensor], List[ - paddle.Tensor]]: + att_cache: paddle.Tensor=paddle.zeros([0, 0, 0, 0]), + cnn_cache: paddle.Tensor=paddle.zeros([0, 0, 0, 0]), + ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor]: """ Export interface for c++ call, give input chunk xs, and return output from time 0 to current chunk. + Args: - xs (paddle.Tensor): chunk input - subsampling_cache (Optional[paddle.Tensor]): subsampling cache - elayers_output_cache (Optional[List[paddle.Tensor]]): - transformer/conformer encoder layers output cache - conformer_cnn_cache (Optional[List[paddle.Tensor]]): conformer - cnn cache + xs (paddle.Tensor): chunk input, with shape (b=1, time, mel-dim), + where `time == (chunk_size - 1) * subsample_rate + \ + subsample.right_context + 1` + offset (int): current offset in encoder output time stamp + required_cache_size (int): cache size required for next chunk + compuation + >=0: actual cache size + <0: means all history cache is required + att_cache (paddle.Tensor): cache tensor for KEY & VALUE in + transformer/conformer attention, with shape + (elayers, head, cache_t1, d_k * 2), where + `head * d_k == hidden-dim` and + `cache_t1 == chunk_size * num_decoding_left_chunks`. + `d_k * 2` for att key & value. + cnn_cache (paddle.Tensor): cache tensor for cnn_module in conformer, + (elayers, b=1, hidden-dim, cache_t2), where + `cache_t2 == cnn.lorder - 1`. + Returns: - paddle.Tensor: output, it ranges from time 0 to current chunk. - paddle.Tensor: subsampling cache - List[paddle.Tensor]: attention cache - List[paddle.Tensor]: conformer cnn cache + paddle.Tensor: output of current input xs, + with shape (b=1, chunk_size, hidden-dim). + paddle.Tensor: new attention cache required for next chunk, with + dynamic shape (elayers, head, T(?), d_k * 2) + depending on required_cache_size. + paddle.Tensor: new conformer cnn cache required for next chunk, with + same shape as the original cnn_cache. """ - return self.encoder.forward_chunk( - xs, offset, required_cache_size, subsampling_cache, - elayers_output_cache, conformer_cnn_cache) + return self.encoder.forward_chunk(xs, offset, required_cache_size, + att_cache, cnn_cache) # @jit.to_static def ctc_activation(self, xs: paddle.Tensor) -> paddle.Tensor: diff --git a/paddlespeech/s2t/models/u2_st/u2_st.py b/paddlespeech/s2t/models/u2_st/u2_st.py index 00ded912..e86bbedf 100644 --- a/paddlespeech/s2t/models/u2_st/u2_st.py +++ b/paddlespeech/s2t/models/u2_st/u2_st.py @@ -401,29 +401,42 @@ class U2STBaseModel(nn.Layer): xs: paddle.Tensor, offset: int, required_cache_size: int, - subsampling_cache: Optional[paddle.Tensor]=None, - elayers_output_cache: Optional[List[paddle.Tensor]]=None, - conformer_cnn_cache: Optional[List[paddle.Tensor]]=None, - ) -> Tuple[paddle.Tensor, paddle.Tensor, List[paddle.Tensor], List[ - paddle.Tensor]]: + att_cache: paddle.Tensor = paddle.zeros([0, 0, 0, 0]), + cnn_cache: paddle.Tensor = paddle.zeros([0, 0, 0, 0]), + ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor]: """ Export interface for c++ call, give input chunk xs, and return output from time 0 to current chunk. + Args: - xs (paddle.Tensor): chunk input - subsampling_cache (Optional[paddle.Tensor]): subsampling cache - elayers_output_cache (Optional[List[paddle.Tensor]]): - transformer/conformer encoder layers output cache - conformer_cnn_cache (Optional[List[paddle.Tensor]]): conformer - cnn cache + xs (paddle.Tensor): chunk input, with shape (b=1, time, mel-dim), + where `time == (chunk_size - 1) * subsample_rate + \ + subsample.right_context + 1` + offset (int): current offset in encoder output time stamp + required_cache_size (int): cache size required for next chunk + compuation + >=0: actual cache size + <0: means all history cache is required + att_cache (paddle.Tensor): cache tensor for KEY & VALUE in + transformer/conformer attention, with shape + (elayers, head, cache_t1, d_k * 2), where + `head * d_k == hidden-dim` and + `cache_t1 == chunk_size * num_decoding_left_chunks`. + `d_k * 2` for att key & value. + cnn_cache (paddle.Tensor): cache tensor for cnn_module in conformer, + (elayers, b=1, hidden-dim, cache_t2), where + `cache_t2 == cnn.lorder - 1` + Returns: - paddle.Tensor: output, it ranges from time 0 to current chunk. - paddle.Tensor: subsampling cache - List[paddle.Tensor]: attention cache - List[paddle.Tensor]: conformer cnn cache + paddle.Tensor: output of current input xs, + with shape (b=1, chunk_size, hidden-dim). + paddle.Tensor: new attention cache required for next chunk, with + dynamic shape (elayers, head, T(?), d_k * 2) + depending on required_cache_size. + paddle.Tensor: new conformer cnn cache required for next chunk, with + same shape as the original cnn_cache. """ return self.encoder.forward_chunk( - xs, offset, required_cache_size, subsampling_cache, - elayers_output_cache, conformer_cnn_cache) + xs, offset, required_cache_size, att_cache, cnn_cache) # @jit.to_static def ctc_activation(self, xs: paddle.Tensor) -> paddle.Tensor: diff --git a/paddlespeech/s2t/modules/align.py b/paddlespeech/s2t/modules/align.py index ad71ee02..cacda246 100644 --- a/paddlespeech/s2t/modules/align.py +++ b/paddlespeech/s2t/modules/align.py @@ -13,8 +13,7 @@ # limitations under the License. import paddle from paddle import nn - -from paddlespeech.s2t.modules.initializer import KaimingUniform +import math """ To align the initializer between paddle and torch, the API below are set defalut initializer with priority higger than global initializer. @@ -82,10 +81,10 @@ class Linear(nn.Linear): name=None): if weight_attr is None: if global_init_type == "kaiming_uniform": - weight_attr = paddle.ParamAttr(initializer=KaimingUniform()) + weight_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform(fan_in=None, negative_slope=math.sqrt(5), nonlinearity='leaky_relu')) if bias_attr is None: if global_init_type == "kaiming_uniform": - bias_attr = paddle.ParamAttr(initializer=KaimingUniform()) + bias_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform(fan_in=None, negative_slope=math.sqrt(5), nonlinearity='leaky_relu')) super(Linear, self).__init__(in_features, out_features, weight_attr, bias_attr, name) @@ -105,10 +104,10 @@ class Conv1D(nn.Conv1D): data_format='NCL'): if weight_attr is None: if global_init_type == "kaiming_uniform": - weight_attr = paddle.ParamAttr(initializer=KaimingUniform()) + weight_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform(fan_in=None, negative_slope=math.sqrt(5), nonlinearity='leaky_relu')) if bias_attr is None: if global_init_type == "kaiming_uniform": - bias_attr = paddle.ParamAttr(initializer=KaimingUniform()) + bias_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform(fan_in=None, negative_slope=math.sqrt(5), nonlinearity='leaky_relu')) super(Conv1D, self).__init__( in_channels, out_channels, kernel_size, stride, padding, dilation, groups, padding_mode, weight_attr, bias_attr, data_format) @@ -129,10 +128,10 @@ class Conv2D(nn.Conv2D): data_format='NCHW'): if weight_attr is None: if global_init_type == "kaiming_uniform": - weight_attr = paddle.ParamAttr(initializer=KaimingUniform()) + weight_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform(fan_in=None, negative_slope=math.sqrt(5), nonlinearity='leaky_relu')) if bias_attr is None: if global_init_type == "kaiming_uniform": - bias_attr = paddle.ParamAttr(initializer=KaimingUniform()) + bias_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform(fan_in=None, negative_slope=math.sqrt(5), nonlinearity='leaky_relu')) super(Conv2D, self).__init__( in_channels, out_channels, kernel_size, stride, padding, dilation, groups, padding_mode, weight_attr, bias_attr, data_format) diff --git a/paddlespeech/s2t/modules/attention.py b/paddlespeech/s2t/modules/attention.py index 438efd2a..b6d61586 100644 --- a/paddlespeech/s2t/modules/attention.py +++ b/paddlespeech/s2t/modules/attention.py @@ -84,9 +84,10 @@ class MultiHeadedAttention(nn.Layer): return q, k, v def forward_attention(self, - value: paddle.Tensor, - scores: paddle.Tensor, - mask: Optional[paddle.Tensor]) -> paddle.Tensor: + value: paddle.Tensor, + scores: paddle.Tensor, + mask: paddle.Tensor = paddle.ones([0, 0, 0], dtype=paddle.bool), + ) -> paddle.Tensor: """Compute attention context vector. Args: value (paddle.Tensor): Transformed value, size @@ -94,14 +95,23 @@ class MultiHeadedAttention(nn.Layer): scores (paddle.Tensor): Attention score, size (#batch, n_head, time1, time2). mask (paddle.Tensor): Mask, size (#batch, 1, time2) or - (#batch, time1, time2). + (#batch, time1, time2), (0, 0, 0) means fake mask. Returns: - paddle.Tensor: Transformed value weighted - by the attention score, (#batch, time1, d_model). + paddle.Tensor: Transformed value (#batch, time1, d_model) + weighted by the attention score (#batch, time1, time2). """ n_batch = value.shape[0] - if mask is not None: - mask = mask.unsqueeze(1).eq(0) # (batch, 1, *, time2) + + # When `if mask.size(2) > 0` be True: + # 1. training. + # 2. oonx(16/4, chunk_size/history_size), feed real cache and real mask for the 1st chunk. + # When will `if mask.size(2) > 0` be False? + # 1. onnx(16/-1, -1/-1, 16/0) + # 2. jit (16/-1, -1/-1, 16/0, 16/4) + if paddle.shape(mask)[2] > 0: # time2 > 0 + mask = mask.unsqueeze(1).equal(0) # (batch, 1, *, time2) + # for last chunk, time2 might be larger than scores.size(-1) + mask = mask[:, :, :, :paddle.shape(scores)[-1]] scores = scores.masked_fill(mask, -float('inf')) attn = paddle.softmax( scores, axis=-1).masked_fill(mask, @@ -121,21 +131,66 @@ class MultiHeadedAttention(nn.Layer): query: paddle.Tensor, key: paddle.Tensor, value: paddle.Tensor, - mask: Optional[paddle.Tensor]) -> paddle.Tensor: + mask: paddle.Tensor = paddle.ones([0,0,0], dtype=paddle.bool), + pos_emb: paddle.Tensor = paddle.empty([0]), + cache: paddle.Tensor = paddle.zeros([0,0,0,0]) + ) -> Tuple[paddle.Tensor, paddle.Tensor]: """Compute scaled dot product attention. - Args: - query (torch.Tensor): Query tensor (#batch, time1, size). - key (torch.Tensor): Key tensor (#batch, time2, size). - value (torch.Tensor): Value tensor (#batch, time2, size). - mask (torch.Tensor): Mask tensor (#batch, 1, time2) or + Args: + query (paddle.Tensor): Query tensor (#batch, time1, size). + key (paddle.Tensor): Key tensor (#batch, time2, size). + value (paddle.Tensor): Value tensor (#batch, time2, size). + mask (paddle.Tensor): Mask tensor (#batch, 1, time2) or (#batch, time1, time2). + 1.When applying cross attention between decoder and encoder, + the batch padding mask for input is in (#batch, 1, T) shape. + 2.When applying self attention of encoder, + the mask is in (#batch, T, T) shape. + 3.When applying self attention of decoder, + the mask is in (#batch, L, L) shape. + 4.If the different position in decoder see different block + of the encoder, such as Mocha, the passed in mask could be + in (#batch, L, T) shape. But there is no such case in current + Wenet. + cache (paddle.Tensor): Cache tensor (1, head, cache_t, d_k * 2), + where `cache_t == chunk_size * num_decoding_left_chunks` + and `head * d_k == size` Returns: - torch.Tensor: Output tensor (#batch, time1, d_model). + paddle.Tensor: Output tensor (#batch, time1, d_model). + paddle.Tensor: Cache tensor (1, head, cache_t + time1, d_k * 2) + where `cache_t == chunk_size * num_decoding_left_chunks` + and `head * d_k == size` + """ q, k, v = self.forward_qkv(query, key, value) + + # when export onnx model, for 1st chunk, we feed + # cache(1, head, 0, d_k * 2) (16/-1, -1/-1, 16/0 mode) + # or cache(1, head, real_cache_t, d_k * 2) (16/4 mode). + # In all modes, `if cache.size(0) > 0` will alwayse be `True` + # and we will always do splitting and + # concatnation(this will simplify onnx export). Note that + # it's OK to concat & split zero-shaped tensors(see code below). + # when export jit model, for 1st chunk, we always feed + # cache(0, 0, 0, 0) since jit supports dynamic if-branch. + # >>> a = torch.ones((1, 2, 0, 4)) + # >>> b = torch.ones((1, 2, 3, 4)) + # >>> c = torch.cat((a, b), dim=2) + # >>> torch.equal(b, c) # True + # >>> d = torch.split(a, 2, dim=-1) + # >>> torch.equal(d[0], d[1]) # True + if paddle.shape(cache)[0] > 0: + # last dim `d_k * 2` for (key, val) + key_cache, value_cache = paddle.split(cache, 2, axis=-1) + k = paddle.concat([key_cache, k], axis=2) + v = paddle.concat([value_cache, v], axis=2) + # We do cache slicing in encoder.forward_chunk, since it's + # non-trivial to calculate `next_cache_start` here. + new_cache = paddle.concat((k, v), axis=-1) + scores = paddle.matmul(q, k.transpose([0, 1, 3, 2])) / math.sqrt(self.d_k) - return self.forward_attention(v, scores, mask) + return self.forward_attention(v, scores, mask), new_cache class RelPositionMultiHeadedAttention(MultiHeadedAttention): @@ -192,23 +247,55 @@ class RelPositionMultiHeadedAttention(MultiHeadedAttention): query: paddle.Tensor, key: paddle.Tensor, value: paddle.Tensor, - pos_emb: paddle.Tensor, - mask: Optional[paddle.Tensor]): + mask: paddle.Tensor = paddle.ones([0,0,0], dtype=paddle.bool), + pos_emb: paddle.Tensor = paddle.empty([0]), + cache: paddle.Tensor = paddle.zeros([0,0,0,0]) + ) -> Tuple[paddle.Tensor, paddle.Tensor]: """Compute 'Scaled Dot Product Attention' with rel. positional encoding. Args: query (paddle.Tensor): Query tensor (#batch, time1, size). key (paddle.Tensor): Key tensor (#batch, time2, size). value (paddle.Tensor): Value tensor (#batch, time2, size). - pos_emb (paddle.Tensor): Positional embedding tensor - (#batch, time1, size). mask (paddle.Tensor): Mask tensor (#batch, 1, time2) or - (#batch, time1, time2). + (#batch, time1, time2), (0, 0, 0) means fake mask. + pos_emb (paddle.Tensor): Positional embedding tensor + (#batch, time2, size). + cache (paddle.Tensor): Cache tensor (1, head, cache_t, d_k * 2), + where `cache_t == chunk_size * num_decoding_left_chunks` + and `head * d_k == size` Returns: paddle.Tensor: Output tensor (#batch, time1, d_model). + paddle.Tensor: Cache tensor (1, head, cache_t + time1, d_k * 2) + where `cache_t == chunk_size * num_decoding_left_chunks` + and `head * d_k == size` """ q, k, v = self.forward_qkv(query, key, value) q = q.transpose([0, 2, 1, 3]) # (batch, time1, head, d_k) + # when export onnx model, for 1st chunk, we feed + # cache(1, head, 0, d_k * 2) (16/-1, -1/-1, 16/0 mode) + # or cache(1, head, real_cache_t, d_k * 2) (16/4 mode). + # In all modes, `if cache.size(0) > 0` will alwayse be `True` + # and we will always do splitting and + # concatnation(this will simplify onnx export). Note that + # it's OK to concat & split zero-shaped tensors(see code below). + # when export jit model, for 1st chunk, we always feed + # cache(0, 0, 0, 0) since jit supports dynamic if-branch. + # >>> a = torch.ones((1, 2, 0, 4)) + # >>> b = torch.ones((1, 2, 3, 4)) + # >>> c = torch.cat((a, b), dim=2) + # >>> torch.equal(b, c) # True + # >>> d = torch.split(a, 2, dim=-1) + # >>> torch.equal(d[0], d[1]) # True + if paddle.shape(cache)[0] > 0: + # last dim `d_k * 2` for (key, val) + key_cache, value_cache = paddle.split(cache, 2, axis=-1) + k = paddle.concat([key_cache, k], axis=2) + v = paddle.concat([value_cache, v], axis=2) + # We do cache slicing in encoder.forward_chunk, since it's + # non-trivial to calculate `next_cache_start` here. + new_cache = paddle.concat((k, v), axis=-1) + n_batch_pos = pos_emb.shape[0] p = self.linear_pos(pos_emb).view(n_batch_pos, -1, self.h, self.d_k) p = p.transpose([0, 2, 1, 3]) # (batch, head, time1, d_k) @@ -234,4 +321,4 @@ class RelPositionMultiHeadedAttention(MultiHeadedAttention): scores = (matrix_ac + matrix_bd) / math.sqrt( self.d_k) # (batch, head, time1, time2) - return self.forward_attention(v, scores, mask) + return self.forward_attention(v, scores, mask), new_cache diff --git a/paddlespeech/s2t/modules/conformer_convolution.py b/paddlespeech/s2t/modules/conformer_convolution.py index 89e65268..c384b9c7 100644 --- a/paddlespeech/s2t/modules/conformer_convolution.py +++ b/paddlespeech/s2t/modules/conformer_convolution.py @@ -108,15 +108,17 @@ class ConvolutionModule(nn.Layer): def forward(self, x: paddle.Tensor, - mask_pad: Optional[paddle.Tensor]=None, - cache: Optional[paddle.Tensor]=None + mask_pad: paddle.Tensor= paddle.ones([0,0,0], dtype=paddle.bool), + cache: paddle.Tensor= paddle.zeros([0,0,0]), ) -> Tuple[paddle.Tensor, paddle.Tensor]: """Compute convolution module. Args: x (paddle.Tensor): Input tensor (#batch, time, channels). - mask_pad (paddle.Tensor): used for batch padding, (#batch, channels, time). + mask_pad (paddle.Tensor): used for batch padding (#batch, 1, time), + (0, 0, 0) means fake mask. cache (paddle.Tensor): left context cache, it is only - used in causal convolution. (#batch, channels, time') + used in causal convolution (#batch, channels, cache_t), + (0, 0, 0) meas fake cache. Returns: paddle.Tensor: Output tensor (#batch, time, channels). paddle.Tensor: Output cache tensor (#batch, channels, time') @@ -125,11 +127,11 @@ class ConvolutionModule(nn.Layer): x = x.transpose([0, 2, 1]) # [B, C, T] # mask batch padding - if mask_pad is not None: + if paddle.shape(mask_pad)[2] > 0: # time > 0 x = x.masked_fill(mask_pad, 0.0) if self.lorder > 0: - if cache is None: + if paddle.shape(cache)[2] == 0: # cache_t == 0 x = nn.functional.pad( x, [self.lorder, 0], 'constant', 0.0, data_format='NCL') else: @@ -143,7 +145,7 @@ class ConvolutionModule(nn.Layer): # It's better we just return None if no cache is requried, # However, for JIT export, here we just fake one tensor instead of # None. - new_cache = paddle.zeros([1], dtype=x.dtype) + new_cache = paddle.zeros([0, 0, 0], dtype=x.dtype) # GLU mechanism x = self.pointwise_conv1(x) # (batch, 2*channel, dim) @@ -159,7 +161,7 @@ class ConvolutionModule(nn.Layer): x = self.pointwise_conv2(x) # mask batch padding - if mask_pad is not None: + if paddle.shape(mask_pad)[2] > 0: # time > 0 x = x.masked_fill(mask_pad, 0.0) x = x.transpose([0, 2, 1]) # [B, T, C] diff --git a/paddlespeech/s2t/modules/decoder_layer.py b/paddlespeech/s2t/modules/decoder_layer.py index b7f8694c..37b124e8 100644 --- a/paddlespeech/s2t/modules/decoder_layer.py +++ b/paddlespeech/s2t/modules/decoder_layer.py @@ -121,11 +121,11 @@ class DecoderLayer(nn.Layer): if self.concat_after: tgt_concat = paddle.cat( - (tgt_q, self.self_attn(tgt_q, tgt, tgt, tgt_q_mask)), dim=-1) + (tgt_q, self.self_attn(tgt_q, tgt, tgt, tgt_q_mask)[0]), dim=-1) x = residual + self.concat_linear1(tgt_concat) else: x = residual + self.dropout( - self.self_attn(tgt_q, tgt, tgt, tgt_q_mask)) + self.self_attn(tgt_q, tgt, tgt, tgt_q_mask)[0]) if not self.normalize_before: x = self.norm1(x) @@ -134,11 +134,11 @@ class DecoderLayer(nn.Layer): x = self.norm2(x) if self.concat_after: x_concat = paddle.cat( - (x, self.src_attn(x, memory, memory, memory_mask)), dim=-1) + (x, self.src_attn(x, memory, memory, memory_mask)[0]), dim=-1) x = residual + self.concat_linear2(x_concat) else: x = residual + self.dropout( - self.src_attn(x, memory, memory, memory_mask)) + self.src_attn(x, memory, memory, memory_mask)[0]) if not self.normalize_before: x = self.norm2(x) diff --git a/paddlespeech/s2t/modules/embedding.py b/paddlespeech/s2t/modules/embedding.py index 51e558eb..3aeebd29 100644 --- a/paddlespeech/s2t/modules/embedding.py +++ b/paddlespeech/s2t/modules/embedding.py @@ -131,7 +131,7 @@ class PositionalEncoding(nn.Layer, PositionalEncodingInterface): offset (int): start offset size (int): requried size of position encoding Returns: - paddle.Tensor: Corresponding position encoding + paddle.Tensor: Corresponding position encoding, #[1, T, D]. """ assert offset + size < self.max_len return self.dropout(self.pe[:, offset:offset + size]) diff --git a/paddlespeech/s2t/modules/encoder.py b/paddlespeech/s2t/modules/encoder.py index 4d31acf1..bff2d69b 100644 --- a/paddlespeech/s2t/modules/encoder.py +++ b/paddlespeech/s2t/modules/encoder.py @@ -177,7 +177,7 @@ class BaseEncoder(nn.Layer): decoding_chunk_size, self.static_chunk_size, num_decoding_left_chunks) for layer in self.encoders: - xs, chunk_masks, _ = layer(xs, chunk_masks, pos_emb, mask_pad) + xs, chunk_masks, _, _ = layer(xs, chunk_masks, pos_emb, mask_pad) if self.normalize_before: xs = self.after_norm(xs) # Here we assume the mask is not changed in encoder layers, so just @@ -190,30 +190,31 @@ class BaseEncoder(nn.Layer): xs: paddle.Tensor, offset: int, required_cache_size: int, - subsampling_cache: Optional[paddle.Tensor]=None, - elayers_output_cache: Optional[List[paddle.Tensor]]=None, - conformer_cnn_cache: Optional[List[paddle.Tensor]]=None, - ) -> Tuple[paddle.Tensor, paddle.Tensor, List[paddle.Tensor], List[ - paddle.Tensor]]: + att_cache: paddle.Tensor = paddle.zeros([0,0,0,0]), + cnn_cache: paddle.Tensor = paddle.zeros([0,0,0,0]), + att_mask: paddle.Tensor = paddle.ones([0,0,0], dtype=paddle.bool), + ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor]: """ Forward just one chunk Args: - xs (paddle.Tensor): chunk input, [B=1, T, D] + xs (paddle.Tensor): chunk audio feat input, [B=1, T, D], where + `T==(chunk_size-1)*subsampling_rate + subsample.right_context + 1` offset (int): current offset in encoder output time stamp required_cache_size (int): cache size required for next chunk compuation >=0: actual cache size <0: means all history cache is required - subsampling_cache (Optional[paddle.Tensor]): subsampling cache - elayers_output_cache (Optional[List[paddle.Tensor]]): - transformer/conformer encoder layers output cache - conformer_cnn_cache (Optional[List[paddle.Tensor]]): conformer - cnn cache + att_cache(paddle.Tensor): cache tensor for key & val in + transformer/conformer attention. Shape is + (elayers, head, cache_t1, d_k * 2), where`head * d_k == hidden-dim` + and `cache_t1 == chunk_size * num_decoding_left_chunks`. + cnn_cache (paddle.Tensor): cache tensor for cnn_module in conformer, + (elayers, B=1, hidden-dim, cache_t2), where `cache_t2 == cnn.lorder - 1` Returns: - paddle.Tensor: output of current input xs - paddle.Tensor: subsampling cache required for next chunk computation - List[paddle.Tensor]: encoder layers output cache required for next - chunk computation - List[paddle.Tensor]: conformer cnn cache + paddle.Tensor: output of current input xs, (B=1, chunk_size, hidden-dim) + paddle.Tensor: new attention cache required for next chunk, dyanmic shape + (elayers, head, T, d_k*2) depending on required_cache_size + paddle.Tensor: new conformer cnn cache required for next chunk, with + same shape as the original cnn_cache """ assert xs.shape[0] == 1 # batch size must be one # tmp_masks is just for interface compatibility @@ -225,50 +226,50 @@ class BaseEncoder(nn.Layer): if self.global_cmvn is not None: xs = self.global_cmvn(xs) - xs, pos_emb, _ = self.embed( - xs, tmp_masks, offset=offset) #xs=(B, T, D), pos_emb=(B=1, T, D) + # before embed, xs=(B, T, D1), pos_emb=(B=1, T, D) + xs, pos_emb, _ = self.embed(xs, tmp_masks, offset=offset) + # after embed, xs=(B=1, chunk_size, hidden-dim) - if subsampling_cache is not None: - cache_size = subsampling_cache.shape[1] #T - xs = paddle.cat((subsampling_cache, xs), dim=1) - else: - cache_size = 0 + elayers = paddle.shape(att_cache)[0] + cache_t1 = paddle.shape(att_cache)[2] + chunk_size = paddle.shape(xs)[1] + attention_key_size = cache_t1 + chunk_size # only used when using `RelPositionMultiHeadedAttention` pos_emb = self.embed.position_encoding( - offset=offset - cache_size, size=xs.shape[1]) + offset=offset - cache_t1, size=attention_key_size) if required_cache_size < 0: next_cache_start = 0 elif required_cache_size == 0: - next_cache_start = xs.shape[1] + next_cache_start = attention_key_size else: - next_cache_start = xs.shape[1] - required_cache_size - r_subsampling_cache = xs[:, next_cache_start:, :] - - # Real mask for transformer/conformer layers - masks = paddle.ones([1, xs.shape[1]], dtype=paddle.bool) - masks = masks.unsqueeze(1) #[B=1, L'=1, T] - r_elayers_output_cache = [] - r_conformer_cnn_cache = [] + next_cache_start = max(attention_key_size - required_cache_size, 0) + + r_att_cache = [] + r_cnn_cache = [] for i, layer in enumerate(self.encoders): - attn_cache = None if elayers_output_cache is None else elayers_output_cache[ - i] - cnn_cache = None if conformer_cnn_cache is None else conformer_cnn_cache[ - i] - xs, _, new_cnn_cache = layer( - xs, - masks, - pos_emb, - output_cache=attn_cache, - cnn_cache=cnn_cache) - r_elayers_output_cache.append(xs[:, next_cache_start:, :]) - r_conformer_cnn_cache.append(new_cnn_cache) + # att_cache[i:i+1] = (1, head, cache_t1, d_k*2) + # cnn_cache[i:i+1] = (1, B=1, hidden-dim, cache_t2) + xs, _, new_att_cache, new_cnn_cache = layer( + xs, att_mask, pos_emb, + att_cache=att_cache[i:i+1] if elayers > 0 else att_cache, + cnn_cache=cnn_cache[i:i+1] if paddle.shape(cnn_cache)[0] > 0 else cnn_cache, + ) + # new_att_cache = (1, head, attention_key_size, d_k*2) + # new_cnn_cache = (B=1, hidden-dim, cache_t2) + r_att_cache.append(new_att_cache[:,:, next_cache_start:, :]) + r_cnn_cache.append(new_cnn_cache.unsqueeze(0)) # add elayer dim + if self.normalize_before: xs = self.after_norm(xs) - return (xs[:, cache_size:, :], r_subsampling_cache, - r_elayers_output_cache, r_conformer_cnn_cache) + # r_att_cache (elayers, head, T, d_k*2) + # r_cnn_cache (elayers, B=1, hidden-dim, cache_t2) + r_att_cache = paddle.concat(r_att_cache, axis=0) + r_cnn_cache = paddle.concat(r_cnn_cache, axis=0) + return xs, r_att_cache, r_cnn_cache + def forward_chunk_by_chunk( self, @@ -313,25 +314,24 @@ class BaseEncoder(nn.Layer): num_frames = xs.shape[1] required_cache_size = decoding_chunk_size * num_decoding_left_chunks - subsampling_cache: Optional[paddle.Tensor] = None - elayers_output_cache: Optional[List[paddle.Tensor]] = None - conformer_cnn_cache: Optional[List[paddle.Tensor]] = None + + att_cache: paddle.Tensor = paddle.zeros([0,0,0,0]) + cnn_cache: paddle.Tensor = paddle.zeros([0,0,0,0]) + outputs = [] offset = 0 # Feed forward overlap input step by step for cur in range(0, num_frames - context + 1, stride): end = min(cur + decoding_window, num_frames) chunk_xs = xs[:, cur:end, :] - (y, subsampling_cache, elayers_output_cache, - conformer_cnn_cache) = self.forward_chunk( - chunk_xs, offset, required_cache_size, subsampling_cache, - elayers_output_cache, conformer_cnn_cache) + + (y, att_cache, cnn_cache) = self.forward_chunk( + chunk_xs, offset, required_cache_size, att_cache, cnn_cache) + outputs.append(y) offset += y.shape[1] ys = paddle.cat(outputs, 1) - # fake mask, just for jit script and compatibility with `forward` api - masks = paddle.ones([1, ys.shape[1]], dtype=paddle.bool) - masks = masks.unsqueeze(1) + masks = paddle.ones([1, 1, ys.shape[1]], dtype=paddle.bool) return ys, masks diff --git a/paddlespeech/s2t/modules/encoder_layer.py b/paddlespeech/s2t/modules/encoder_layer.py index e80a298d..5f810dfd 100644 --- a/paddlespeech/s2t/modules/encoder_layer.py +++ b/paddlespeech/s2t/modules/encoder_layer.py @@ -75,49 +75,43 @@ class TransformerEncoderLayer(nn.Layer): self, x: paddle.Tensor, mask: paddle.Tensor, - pos_emb: Optional[paddle.Tensor]=None, - mask_pad: Optional[paddle.Tensor]=None, - output_cache: Optional[paddle.Tensor]=None, - cnn_cache: Optional[paddle.Tensor]=None, - ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor]: + pos_emb: paddle.Tensor, + mask_pad: paddle.Tensor=paddle.ones([0, 0, 0], dtype=paddle.bool), + att_cache: paddle.Tensor=paddle.zeros([0, 0, 0, 0]), + cnn_cache: paddle.Tensor=paddle.zeros([0, 0, 0, 0]), + ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor, paddle.Tensor]: """Compute encoded features. Args: - x (paddle.Tensor): Input tensor (#batch, time, size). - mask (paddle.Tensor): Mask tensor for the input (#batch, time). + x (paddle.Tensor): (#batch, time, size) + mask (paddle.Tensor): Mask tensor for the input (#batch, time,time), + (0, 0, 0) means fake mask. pos_emb (paddle.Tensor): just for interface compatibility to ConformerEncoderLayer - mask_pad (paddle.Tensor): not used here, it's for interface - compatibility to ConformerEncoderLayer - output_cache (paddle.Tensor): Cache tensor of the output - (#batch, time2, size), time2 < time in x. - cnn_cache (paddle.Tensor): not used here, it's for interface - compatibility to ConformerEncoderLayer + mask_pad (paddle.Tensor): does not used in transformer layer, + just for unified api with conformer. + att_cache (paddle.Tensor): Cache tensor of the KEY & VALUE + (#batch=1, head, cache_t1, d_k * 2), head * d_k == size. + cnn_cache (paddle.Tensor): Convolution cache in conformer layer + (#batch=1, size, cache_t2), not used here, it's for interface + compatibility to ConformerEncoderLayer. Returns: paddle.Tensor: Output tensor (#batch, time, size). - paddle.Tensor: Mask tensor (#batch, time). - paddle.Tensor: Fake cnn cache tensor for api compatibility with Conformer (#batch, channels, time'). + paddle.Tensor: Mask tensor (#batch, time, time). + paddle.Tensor: att_cache tensor, + (#batch=1, head, cache_t1 + time, d_k * 2). + paddle.Tensor: cnn_cahce tensor (#batch=1, size, cache_t2). """ residual = x if self.normalize_before: x = self.norm1(x) - if output_cache is None: - x_q = x - else: - assert output_cache.shape[0] == x.shape[0] - assert output_cache.shape[1] < x.shape[1] - assert output_cache.shape[2] == self.size - chunk = x.shape[1] - output_cache.shape[1] - x_q = x[:, -chunk:, :] - residual = residual[:, -chunk:, :] - mask = mask[:, -chunk:, :] + x_att, new_att_cache = self.self_attn(x, x, x, mask, cache=att_cache) if self.concat_after: - x_concat = paddle.concat( - (x, self.self_attn(x_q, x, x, mask)), axis=-1) + x_concat = paddle.concat((x, x_att), axis=-1) x = residual + self.concat_linear(x_concat) else: - x = residual + self.dropout(self.self_attn(x_q, x, x, mask)) + x = residual + self.dropout(x_att) if not self.normalize_before: x = self.norm1(x) @@ -128,11 +122,8 @@ class TransformerEncoderLayer(nn.Layer): if not self.normalize_before: x = self.norm2(x) - if output_cache is not None: - x = paddle.concat([output_cache, x], axis=1) - - fake_cnn_cache = paddle.zeros([1], dtype=x.dtype) - return x, mask, fake_cnn_cache + fake_cnn_cache = paddle.zeros([0, 0, 0], dtype=x.dtype) + return x, mask, new_att_cache, fake_cnn_cache class ConformerEncoderLayer(nn.Layer): @@ -192,32 +183,44 @@ class ConformerEncoderLayer(nn.Layer): self.size = size self.normalize_before = normalize_before self.concat_after = concat_after - self.concat_linear = Linear(size + size, size) + if self.concat_after: + self.concat_linear = Linear(size + size, size) + else: + self.concat_linear = nn.Identity() def forward( self, x: paddle.Tensor, mask: paddle.Tensor, pos_emb: paddle.Tensor, - mask_pad: Optional[paddle.Tensor]=None, - output_cache: Optional[paddle.Tensor]=None, - cnn_cache: Optional[paddle.Tensor]=None, - ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor]: + mask_pad: paddle.Tensor=paddle.ones([0, 0, 0], dtype=paddle.bool), + att_cache: paddle.Tensor=paddle.zeros([0, 0, 0, 0]), + cnn_cache: paddle.Tensor=paddle.zeros([0, 0, 0, 0]), + ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor, paddle.Tensor]: """Compute encoded features. Args: - x (paddle.Tensor): (#batch, time, size) - mask (paddle.Tensor): Mask tensor for the input (#batch, time,time). - pos_emb (paddle.Tensor): positional encoding, must not be None - for ConformerEncoderLayer. - mask_pad (paddle.Tensor): batch padding mask used for conv module, (B, 1, T). - output_cache (paddle.Tensor): Cache tensor of the encoder output - (#batch, time2, size), time2 < time in x. + x (paddle.Tensor): Input tensor (#batch, time, size). + mask (paddle.Tensor): Mask tensor for the input (#batch, time, time). + (0,0,0) means fake mask. + pos_emb (paddle.Tensor): postional encoding, must not be None + for ConformerEncoderLayer + mask_pad (paddle.Tensor): batch padding mask used for conv module. + (#batch, 1,time), (0, 0, 0) means fake mask. + att_cache (paddle.Tensor): Cache tensor of the KEY & VALUE + (#batch=1, head, cache_t1, d_k * 2), head * d_k == size. cnn_cache (paddle.Tensor): Convolution cache in conformer layer + (1, #batch=1, size, cache_t2). First dim will not be used, just + for dy2st. Returns: - paddle.Tensor: Output tensor (#batch, time, size). - paddle.Tensor: Mask tensor (#batch, time). - paddle.Tensor: New cnn cache tensor (#batch, channels, time'). + paddle.Tensor: Output tensor (#batch, time, size). + paddle.Tensor: Mask tensor (#batch, time, time). + paddle.Tensor: att_cache tensor, + (#batch=1, head, cache_t1 + time, d_k * 2). + paddle.Tensor: cnn_cahce tensor (#batch, size, cache_t2). """ + # (1, #batch=1, size, cache_t2) -> (#batch=1, size, cache_t2) + cnn_cache = paddle.squeeze(cnn_cache, axis=0) + # whether to use macaron style FFN if self.feed_forward_macaron is not None: residual = x @@ -233,18 +236,8 @@ class ConformerEncoderLayer(nn.Layer): if self.normalize_before: x = self.norm_mha(x) - if output_cache is None: - x_q = x - else: - assert output_cache.shape[0] == x.shape[0] - assert output_cache.shape[1] < x.shape[1] - assert output_cache.shape[2] == self.size - chunk = x.shape[1] - output_cache.shape[1] - x_q = x[:, -chunk:, :] - residual = residual[:, -chunk:, :] - mask = mask[:, -chunk:, :] - - x_att = self.self_attn(x_q, x, x, pos_emb, mask) + x_att, new_att_cache = self.self_attn( + x, x, x, mask, pos_emb, cache=att_cache) if self.concat_after: x_concat = paddle.concat((x, x_att), axis=-1) @@ -257,7 +250,7 @@ class ConformerEncoderLayer(nn.Layer): # convolution module # Fake new cnn cache here, and then change it in conv_module - new_cnn_cache = paddle.zeros([1], dtype=x.dtype) + new_cnn_cache = paddle.zeros([0, 0, 0], dtype=x.dtype) if self.conv_module is not None: residual = x if self.normalize_before: @@ -282,7 +275,4 @@ class ConformerEncoderLayer(nn.Layer): if self.conv_module is not None: x = self.norm_final(x) - if output_cache is not None: - x = paddle.concat([output_cache, x], axis=1) - - return x, mask, new_cnn_cache + return x, mask, new_att_cache, new_cnn_cache diff --git a/paddlespeech/s2t/modules/initializer.py b/paddlespeech/s2t/modules/initializer.py index 30a04e44..cdcf2e05 100644 --- a/paddlespeech/s2t/modules/initializer.py +++ b/paddlespeech/s2t/modules/initializer.py @@ -12,142 +12,6 @@ # See the License for the specific language governing permissions and # limitations under the License. import numpy as np -from paddle.fluid import framework -from paddle.fluid import unique_name -from paddle.fluid.core import VarDesc -from paddle.fluid.initializer import MSRAInitializer - -__all__ = ['KaimingUniform'] - - -class KaimingUniform(MSRAInitializer): - r"""Implements the Kaiming Uniform initializer - - This class implements the weight initialization from the paper - `Delving Deep into Rectifiers: Surpassing Human-Level Performance on - ImageNet Classification `_ - by Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun. This is a - robust initialization method that particularly considers the rectifier - nonlinearities. - - In case of Uniform distribution, the range is [-x, x], where - - .. math:: - - x = \sqrt{\frac{1.0}{fan\_in}} - - In case of Normal distribution, the mean is 0 and the standard deviation - is - - .. math:: - - \sqrt{\\frac{2.0}{fan\_in}} - - Args: - fan_in (float32|None): fan_in for Kaiming uniform Initializer. If None, it is\ - inferred from the variable. default is None. - - Note: - It is recommended to set fan_in to None for most cases. - - Examples: - .. code-block:: python - - import paddle - import paddle.nn as nn - - linear = nn.Linear(2, - 4, - weight_attr=nn.initializer.KaimingUniform()) - data = paddle.rand([30, 10, 2], dtype='float32') - res = linear(data) - - """ - - def __init__(self, fan_in=None): - super(KaimingUniform, self).__init__( - uniform=True, fan_in=fan_in, seed=0) - - def __call__(self, var, block=None): - """Initialize the input tensor with MSRA initialization. - - Args: - var(Tensor): Tensor that needs to be initialized. - block(Block, optional): The block in which initialization ops - should be added. Used in static graph only, default None. - - Returns: - The initialization op - """ - block = self._check_block(block) - - assert isinstance(var, framework.Variable) - assert isinstance(block, framework.Block) - f_in, f_out = self._compute_fans(var) - - # If fan_in is passed, use it - fan_in = f_in if self._fan_in is None else self._fan_in - - if self._seed == 0: - self._seed = block.program.random_seed - - # to be compatible of fp16 initalizers - if var.dtype == VarDesc.VarType.FP16 or ( - var.dtype == VarDesc.VarType.BF16 and not self._uniform): - out_dtype = VarDesc.VarType.FP32 - out_var = block.create_var( - name=unique_name.generate( - ".".join(['masra_init', var.name, 'tmp'])), - shape=var.shape, - dtype=out_dtype, - type=VarDesc.VarType.LOD_TENSOR, - persistable=False) - else: - out_dtype = var.dtype - out_var = var - - if self._uniform: - limit = np.sqrt(1.0 / float(fan_in)) - op = block.append_op( - type="uniform_random", - inputs={}, - outputs={"Out": out_var}, - attrs={ - "shape": out_var.shape, - "dtype": int(out_dtype), - "min": -limit, - "max": limit, - "seed": self._seed - }, - stop_gradient=True) - - else: - std = np.sqrt(2.0 / float(fan_in)) - op = block.append_op( - type="gaussian_random", - outputs={"Out": out_var}, - attrs={ - "shape": out_var.shape, - "dtype": int(out_dtype), - "mean": 0.0, - "std": std, - "seed": self._seed - }, - stop_gradient=True) - - if var.dtype == VarDesc.VarType.FP16 or ( - var.dtype == VarDesc.VarType.BF16 and not self._uniform): - block.append_op( - type="cast", - inputs={"X": out_var}, - outputs={"Out": var}, - attrs={"in_dtype": out_var.dtype, - "out_dtype": var.dtype}) - - if not framework.in_dygraph_mode(): - var.op = op - return op - class DefaultInitializerContext(object): """ diff --git a/paddlespeech/server/bin/paddlespeech_client.py b/paddlespeech/server/bin/paddlespeech_client.py index e8e57fff..96368c0f 100644 --- a/paddlespeech/server/bin/paddlespeech_client.py +++ b/paddlespeech/server/bin/paddlespeech_client.py @@ -718,6 +718,7 @@ class VectorClientExecutor(BaseExecutor): logger.info(f"the input audio: {input}") handler = VectorHttpHandler(server_ip=server_ip, port=port) res = handler.run(input, audio_format, sample_rate) + logger.info(f"The spk embedding is: {res}") return res elif task == "score": from paddlespeech.server.utils.audio_handler import VectorScoreHttpHandler diff --git a/paddlespeech/server/engine/asr/online/ctc_endpoint.py b/paddlespeech/server/engine/asr/online/ctc_endpoint.py index 2dba3641..b87dbe80 100644 --- a/paddlespeech/server/engine/asr/online/ctc_endpoint.py +++ b/paddlespeech/server/engine/asr/online/ctc_endpoint.py @@ -39,10 +39,10 @@ class OnlineCTCEndpoingOpt: # rule1 times out after 5 seconds of silence, even if we decoded nothing. rule1: OnlineCTCEndpointRule = OnlineCTCEndpointRule(False, 5000, 0) - # rule4 times out after 1.0 seconds of silence after decoding something, + # rule2 times out after 1.0 seconds of silence after decoding something, # even if we did not reach a final-state at all. rule2: OnlineCTCEndpointRule = OnlineCTCEndpointRule(True, 1000, 0) - # rule5 times out after the utterance is 20 seconds long, regardless of + # rule3 times out after the utterance is 20 seconds long, regardless of # anything else. rule3: OnlineCTCEndpointRule = OnlineCTCEndpointRule(False, 0, 20000) @@ -102,7 +102,8 @@ class OnlineCTCEndpoint: assert self.num_frames_decoded >= self.trailing_silence_frames assert self.frame_shift_in_ms > 0 - + + decoding_something = (self.num_frames_decoded > self.trailing_silence_frames) and decoding_something utterance_length = self.num_frames_decoded * self.frame_shift_in_ms trailing_silence = self.trailing_silence_frames * self.frame_shift_in_ms diff --git a/paddlespeech/server/engine/asr/online/python/asr_engine.py b/paddlespeech/server/engine/asr/online/python/asr_engine.py index 2bacfecd..4df38f09 100644 --- a/paddlespeech/server/engine/asr/online/python/asr_engine.py +++ b/paddlespeech/server/engine/asr/online/python/asr_engine.py @@ -130,9 +130,9 @@ class PaddleASRConnectionHanddler: ## conformer # cache for conformer online - self.subsampling_cache = None - self.elayers_output_cache = None - self.conformer_cnn_cache = None + self.att_cache = paddle.zeros([0,0,0,0]) + self.cnn_cache = paddle.zeros([0,0,0,0]) + self.encoder_out = None # conformer decoding state self.offset = 0 # global offset in decoding frame unit @@ -474,11 +474,9 @@ class PaddleASRConnectionHanddler: # cur chunk chunk_xs = self.cached_feat[:, cur:end, :] # forward chunk - (y, self.subsampling_cache, self.elayers_output_cache, - self.conformer_cnn_cache) = self.model.encoder.forward_chunk( + (y, self.att_cache, self.cnn_cache) = self.model.encoder.forward_chunk( chunk_xs, self.offset, required_cache_size, - self.subsampling_cache, self.elayers_output_cache, - self.conformer_cnn_cache) + self.att_cache, self.cnn_cache) outputs.append(y) # update the global offset, in decoding frame unit diff --git a/paddlespeech/server/engine/engine_warmup.py b/paddlespeech/server/engine/engine_warmup.py index 12c760c6..3751554c 100644 --- a/paddlespeech/server/engine/engine_warmup.py +++ b/paddlespeech/server/engine/engine_warmup.py @@ -60,7 +60,10 @@ def warm_up(engine_and_type: str, warm_up_time: int=3) -> bool: else: st = time.time() - connection_handler.infer(text=sentence) + connection_handler.infer( + text=sentence, + lang=tts_engine.lang, + am=tts_engine.config.am) et = time.time() logger.debug( f"The response time of the {i} warm up: {et - st} s") diff --git a/paddlespeech/t2s/exps/sentences_mix.txt b/paddlespeech/t2s/exps/sentences_mix.txt new file mode 100644 index 00000000..06e97d14 --- /dev/null +++ b/paddlespeech/t2s/exps/sentences_mix.txt @@ -0,0 +1,8 @@ +001 你好,欢迎使用 Paddle Speech 中英文混合 T T S 功能,开始你的合成之旅吧! +002 我们的声学模型使用了 Fast Speech Two, 声码器使用了 Parallel Wave GAN and Hifi GAN. +003 Paddle N L P 发布 ERNIE Tiny 全系列中文预训练小模型,快速提升预训练模型部署效率,通用信息抽取技术 U I E Tiny 系列模型全新升级,支持速度更快效果更好的 U I E 小模型。 +004 Paddle Speech 发布 P P A S R 流式语音识别系统、P P T T S 流式语音合成系统、P P V P R 全链路声纹识别系统。 +005 Paddle Bo Bo: 使用 Paddle Speech 的语音合成模块生成虚拟人的声音。 +006 热烈欢迎您在 Discussions 中提交问题,并在 Issues 中指出发现的 bug。此外,我们非常希望您参与到 Paddle Speech 的开发中! +007 我喜欢 eat apple, 你喜欢 drink milk。 +008 我们要去云南 team building, 非常非常 happy. \ No newline at end of file diff --git a/paddlespeech/t2s/exps/syn_utils.py b/paddlespeech/t2s/exps/syn_utils.py index cabea989..77abf97d 100644 --- a/paddlespeech/t2s/exps/syn_utils.py +++ b/paddlespeech/t2s/exps/syn_utils.py @@ -29,6 +29,7 @@ from yacs.config import CfgNode from paddlespeech.t2s.datasets.data_table import DataTable from paddlespeech.t2s.frontend import English +from paddlespeech.t2s.frontend.mix_frontend import MixFrontend from paddlespeech.t2s.frontend.zh_frontend import Frontend from paddlespeech.t2s.modules.normalizer import ZScore from paddlespeech.utils.dynamic_import import dynamic_import @@ -98,6 +99,8 @@ def get_sentences(text_file: Optional[os.PathLike], lang: str='zh'): sentence = "".join(items[1:]) elif lang == 'en': sentence = " ".join(items[1:]) + elif lang == 'mix': + sentence = " ".join(items[1:]) sentences.append((utt_id, sentence)) return sentences @@ -111,7 +114,8 @@ def get_test_dataset(test_metadata: List[Dict[str, Any]], am_dataset = am[am.rindex('_') + 1:] if am_name == 'fastspeech2': fields = ["utt_id", "text"] - if am_dataset in {"aishell3", "vctk"} and speaker_dict is not None: + if am_dataset in {"aishell3", "vctk", + "mix"} and speaker_dict is not None: print("multiple speaker fastspeech2!") fields += ["spk_id"] elif voice_cloning: @@ -140,6 +144,10 @@ def get_frontend(lang: str='zh', phone_vocab_path=phones_dict, tone_vocab_path=tones_dict) elif lang == 'en': frontend = English(phone_vocab_path=phones_dict) + elif lang == 'mix': + frontend = MixFrontend( + phone_vocab_path=phones_dict, tone_vocab_path=tones_dict) + else: print("wrong lang!") print("frontend done!") @@ -341,8 +349,12 @@ def get_am_output( input_ids = frontend.get_input_ids( input, merge_sentences=merge_sentences) phone_ids = input_ids["phone_ids"] + elif lang == 'mix': + input_ids = frontend.get_input_ids( + input, merge_sentences=merge_sentences) + phone_ids = input_ids["phone_ids"] else: - print("lang should in {'zh', 'en'}!") + print("lang should in {'zh', 'en', 'mix'}!") if get_tone_ids: tone_ids = input_ids["tone_ids"] diff --git a/paddlespeech/t2s/exps/synthesize_e2e.py b/paddlespeech/t2s/exps/synthesize_e2e.py index 28657eb2..ef954329 100644 --- a/paddlespeech/t2s/exps/synthesize_e2e.py +++ b/paddlespeech/t2s/exps/synthesize_e2e.py @@ -113,8 +113,12 @@ def evaluate(args): input_ids = frontend.get_input_ids( sentence, merge_sentences=merge_sentences) phone_ids = input_ids["phone_ids"] + elif args.lang == 'mix': + input_ids = frontend.get_input_ids( + sentence, merge_sentences=merge_sentences) + phone_ids = input_ids["phone_ids"] else: - print("lang should in {'zh', 'en'}!") + print("lang should in {'zh', 'en', 'mix'}!") with paddle.no_grad(): flags = 0 for i in range(len(phone_ids)): @@ -122,7 +126,7 @@ def evaluate(args): # acoustic model if am_name == 'fastspeech2': # multi speaker - if am_dataset in {"aishell3", "vctk"}: + if am_dataset in {"aishell3", "vctk", "mix"}: spk_id = paddle.to_tensor(args.spk_id) mel = am_inference(part_phone_ids, spk_id) else: @@ -170,7 +174,7 @@ def parse_args(): choices=[ 'speedyspeech_csmsc', 'speedyspeech_aishell3', 'fastspeech2_csmsc', 'fastspeech2_ljspeech', 'fastspeech2_aishell3', 'fastspeech2_vctk', - 'tacotron2_csmsc', 'tacotron2_ljspeech' + 'tacotron2_csmsc', 'tacotron2_ljspeech', 'fastspeech2_mix' ], help='Choose acoustic model type of tts task.') parser.add_argument( @@ -231,7 +235,7 @@ def parse_args(): '--lang', type=str, default='zh', - help='Choose model language. zh or en') + help='Choose model language. zh or en or mix') parser.add_argument( "--inference_dir", diff --git a/paddlespeech/t2s/frontend/mix_frontend.py b/paddlespeech/t2s/frontend/mix_frontend.py new file mode 100644 index 00000000..6386c871 --- /dev/null +++ b/paddlespeech/t2s/frontend/mix_frontend.py @@ -0,0 +1,179 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import re +from typing import Dict +from typing import List + +import paddle + +from paddlespeech.t2s.frontend import English +from paddlespeech.t2s.frontend.zh_frontend import Frontend + + +class MixFrontend(): + def __init__(self, + g2p_model="pypinyin", + phone_vocab_path=None, + tone_vocab_path=None): + + self.zh_frontend = Frontend( + phone_vocab_path=phone_vocab_path, tone_vocab_path=tone_vocab_path) + self.en_frontend = English(phone_vocab_path=phone_vocab_path) + self.SENTENCE_SPLITOR = re.compile(r'([:、,;。?!,;?!][”’]?)') + self.sp_id = self.zh_frontend.vocab_phones["sp"] + self.sp_id_tensor = paddle.to_tensor([self.sp_id]) + + def is_chinese(self, char): + if char >= '\u4e00' and char <= '\u9fa5': + return True + else: + return False + + def is_alphabet(self, char): + if (char >= '\u0041' and char <= '\u005a') or (char >= '\u0061' and + char <= '\u007a'): + return True + else: + return False + + def is_number(self, char): + if char >= '\u0030' and char <= '\u0039': + return True + else: + return False + + def is_other(self, char): + if not (self.is_chinese(char) or self.is_number(char) or + self.is_alphabet(char)): + return True + else: + return False + + def _split(self, text: str) -> List[str]: + text = re.sub(r'[《》【】<=>{}()()#&@“”^_|…\\]', '', text) + text = self.SENTENCE_SPLITOR.sub(r'\1\n', text) + text = text.strip() + sentences = [sentence.strip() for sentence in re.split(r'\n+', text)] + return sentences + + def _distinguish(self, text: str) -> List[str]: + # sentence --> [ch_part, en_part, ch_part, ...] + + segments = [] + types = [] + + flag = 0 + temp_seg = "" + temp_lang = "" + + # Determine the type of each character. type: blank, chinese, alphabet, number, unk. + for ch in text: + if self.is_chinese(ch): + types.append("zh") + elif self.is_alphabet(ch): + types.append("en") + elif ch == " ": + types.append("blank") + elif self.is_number(ch): + types.append("num") + else: + types.append("unk") + + assert len(types) == len(text) + + for i in range(len(types)): + + # find the first char of the seg + if flag == 0: + if types[i] != "unk" and types[i] != "blank": + temp_seg += text[i] + temp_lang = types[i] + flag = 1 + + else: + if types[i] == temp_lang or types[i] == "num": + temp_seg += text[i] + + elif temp_lang == "num" and types[i] != "unk": + temp_seg += text[i] + if types[i] == "zh" or types[i] == "en": + temp_lang = types[i] + + elif temp_lang == "en" and types[i] == "blank": + temp_seg += text[i] + + elif types[i] == "unk": + pass + + else: + segments.append((temp_seg, temp_lang)) + + if types[i] != "unk" and types[i] != "blank": + temp_seg = text[i] + temp_lang = types[i] + flag = 1 + else: + flag = 0 + temp_seg = "" + temp_lang = "" + + segments.append((temp_seg, temp_lang)) + + return segments + + def get_input_ids(self, + sentence: str, + merge_sentences: bool=True, + get_tone_ids: bool=False, + add_sp: bool=True) -> Dict[str, List[paddle.Tensor]]: + + sentences = self._split(sentence) + phones_list = [] + result = {} + + for text in sentences: + phones_seg = [] + segments = self._distinguish(text) + for seg in segments: + content = seg[0] + lang = seg[1] + if lang == "zh": + input_ids = self.zh_frontend.get_input_ids( + content, + merge_sentences=True, + get_tone_ids=get_tone_ids) + + elif lang == "en": + input_ids = self.en_frontend.get_input_ids( + content, merge_sentences=True) + + phones_seg.append(input_ids["phone_ids"][0]) + if add_sp: + phones_seg.append(self.sp_id_tensor) + + phones = paddle.concat(phones_seg) + phones_list.append(phones) + + if merge_sentences: + merge_list = paddle.concat(phones_list) + # rm the last 'sp' to avoid the noise at the end + # cause in the training data, no 'sp' in the end + if merge_list[-1] == self.sp_id_tensor: + merge_list = merge_list[:-1] + phones_list = [] + phones_list.append(merge_list) + + result["phone_ids"] = phones_list + + return result diff --git a/setup.py b/setup.py index c90d037e..1cc82fa7 100644 --- a/setup.py +++ b/setup.py @@ -72,7 +72,8 @@ base = [ "colorlog", "pathos == 0.2.8", "braceexpand", - "pyyaml" + "pyyaml", + "pybind11", ] server = [ @@ -91,7 +92,6 @@ requirements = { "gpustat", "paddlespeech_ctcdecoders", "phkit", - "pybind11", "pypi-kenlm", "snakeviz", "sox", diff --git a/third_party/README.md b/third_party/README.md index 843d0d3b..98e03b0a 100644 --- a/third_party/README.md +++ b/third_party/README.md @@ -1,27 +1,26 @@ -* [python_kaldi_features](https://github.com/ZitengWang/python_kaldi_features) +# python_kaldi_features + +[python_kaldi_features](https://github.com/ZitengWang/python_kaldi_features) commit: fc1bd6240c2008412ab64dc25045cd872f5e126c ref: https://zhuanlan.zhihu.com/p/55371926 license: MIT -* [python-pinyin](https://github.com/mozillazg/python-pinyin.git) -commit: 55e524aa1b7b8eec3d15c5306043c6cdd5938b03 -license: MIT +# Install ctc_decoder for Windows -* [zhon](https://github.com/tsroten/zhon) -commit: 09bf543696277f71de502506984661a60d24494c -license: MIT +`install_win_ctc.bat` is bat script to install paddlespeech_ctc_decoders for windows -* [pymmseg-cpp](https://github.com/pluskid/pymmseg-cpp.git) -commit: b76465045717fbb4f118c4fbdd24ce93bab10a6d -license: MIT +## Prepare your environment -* [chinese_text_normalization](https://github.com/speechio/chinese_text_normalization.git) -commit: 9e92c7bf2d6b5a7974305406d8e240045beac51c -license: MIT +insure your environment like this: -* [phkit](https://github.com/KuangDD/phkit.git) -commit: b2100293c1e36da531d7f30bd52c9b955a649522 -license: None +* gcc: version >= 12.1.0 +* cmake: version >= 3.24.0 +* make: version >= 3.82.90 +* visual studio: version >= 2019 -* [nnAudio](https://github.com/KinWaiCheuk/nnAudio.git) -license: MIT +## Start your bat script + +```shell +start install_win_ctc.bat + +``` diff --git a/third_party/ctc_decoders/scorer.cpp b/third_party/ctc_decoders/scorer.cpp index 6c1d96be..6e7f68cf 100644 --- a/third_party/ctc_decoders/scorer.cpp +++ b/third_party/ctc_decoders/scorer.cpp @@ -13,7 +13,8 @@ #include "decoder_utils.h" using namespace lm::ngram; - +// if your platform is windows ,you need add the define +#define F_OK 0 Scorer::Scorer(double alpha, double beta, const std::string& lm_path, diff --git a/third_party/ctc_decoders/setup.py b/third_party/ctc_decoders/setup.py index ce2787e3..9a8b292a 100644 --- a/third_party/ctc_decoders/setup.py +++ b/third_party/ctc_decoders/setup.py @@ -89,10 +89,11 @@ FILES = [ or fn.endswith('unittest.cc')) ] # yapf: enable - LIBS = ['stdc++'] if platform.system() != 'Darwin': LIBS.append('rt') +if platform.system() == 'Windows': + LIBS = ['-static-libstdc++'] ARGS = ['-O3', '-DNDEBUG', '-DKENLM_MAX_ORDER=6', '-std=c++11'] diff --git a/third_party/install_win_ctc.bat b/third_party/install_win_ctc.bat new file mode 100644 index 00000000..0bf1e7bb --- /dev/null +++ b/third_party/install_win_ctc.bat @@ -0,0 +1,21 @@ +@echo off + +cd ctc_decoders +if not exist kenlm ( + git clone https://github.com/Doubledongli/kenlm.git + @echo. +) + +if not exist openfst-1.6.3 ( + echo "Download and extract openfst ..." + git clone https://gitee.com/koala999/openfst.git + ren openfst openfst-1.6.3 + @echo. +) + +if not exist ThreadPool ( + git clone https://github.com/progschj/ThreadPool.git + @echo. +) +echo "Install decoders ..." +python setup.py install --num_processes 4 \ No newline at end of file