diff --git a/.readthedocs.yml b/.readthedocs.yml index dc38a20fc..e922891e1 100644 --- a/.readthedocs.yml +++ b/.readthedocs.yml @@ -21,5 +21,6 @@ python: version: 3.7 install: - requirements: docs/requirements.txt - - + - method: setuptools + path: . + system_packages: true \ No newline at end of file diff --git a/MANIFEST.in b/MANIFEST.in new file mode 100644 index 000000000..7c9f4165c --- /dev/null +++ b/MANIFEST.in @@ -0,0 +1,2 @@ +include paddlespeech/t2s/exps/*.txt +include paddlespeech/t2s/frontend/*.yaml \ No newline at end of file diff --git a/README.md b/README.md index c9d4796c8..e35289e2b 100644 --- a/README.md +++ b/README.md @@ -1,3 +1,4 @@ + ([简体中文](./README_cn.md)|English)

@@ -24,14 +25,16 @@ | Documents | Models List | AIStudio Courses - | Paper + | NAACL2022 Best Demo Award Paper | Gitee ------------------------------------------------------------------------------------ -**PaddleSpeech** is an open-source toolkit on [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) platform for a variety of critical tasks in speech and audio, with the state-of-art and influential models. +**PaddleSpeech** is an open-source toolkit on [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) platform for a variety of critical tasks in speech and audio, with the state-of-art and influential models. + +**PaddleSpeech** won the [NAACL2022 Best Demo Award](https://2022.naacl.org/blog/best-demo-award/), please check out our paper on [Arxiv](https://arxiv.org/abs/2205.12007). ##### Speech Recognition @@ -176,7 +179,7 @@ Via the easy-to-use, efficient, flexible and scalable implementation, our vision ## Installation -We strongly recommend our users to install PaddleSpeech in **Linux** with *python>=3.7*. +We strongly recommend our users to install PaddleSpeech in **Linux** with *python>=3.7* and *paddlepaddle>=2.3.1*. Up to now, **Linux** supports CLI for the all our tasks, **Mac OSX** and **Windows** only supports PaddleSpeech CLI for Audio Classification, Speech-to-Text and Text-to-Speech. To install `PaddleSpeech`, please see [installation](./docs/source/install.md). @@ -494,6 +497,14 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r ge2e-fastspeech2-aishell3 + + End-to-End + VITS + CSMSC + + VITS-csmsc + + @@ -688,6 +699,7 @@ You are warmly welcome to submit questions in [discussions](https://github.com/P ## Acknowledgement +- Many thanks to [BarryKCL](https://github.com/BarryKCL) improved TTS Chinses frontend based on [G2PW](https://github.com/GitYCC/g2pW) - Many thanks to [yeyupiaoling](https://github.com/yeyupiaoling)/[PPASR](https://github.com/yeyupiaoling/PPASR)/[PaddlePaddle-DeepSpeech](https://github.com/yeyupiaoling/PaddlePaddle-DeepSpeech)/[VoiceprintRecognition-PaddlePaddle](https://github.com/yeyupiaoling/VoiceprintRecognition-PaddlePaddle)/[AudioClassification-PaddlePaddle](https://github.com/yeyupiaoling/AudioClassification-PaddlePaddle) for years of attention, constructive advice and great help. - Many thanks to [mymagicpower](https://github.com/mymagicpower) for the Java implementation of ASR upon [short](https://github.com/mymagicpower/AIAS/tree/main/3_audio_sdks/asr_sdk) and [long](https://github.com/mymagicpower/AIAS/tree/main/3_audio_sdks/asr_long_audio_sdk) audio files. - Many thanks to [JiehangXie](https://github.com/JiehangXie)/[PaddleBoBo](https://github.com/JiehangXie/PaddleBoBo) for developing Virtual Uploader(VUP)/Virtual YouTuber(VTuber) with PaddleSpeech TTS function. @@ -696,6 +708,8 @@ You are warmly welcome to submit questions in [discussions](https://github.com/P - Many thanks to [awmmmm](https://github.com/awmmmm) for contributing fastspeech2 aishell3 conformer pretrained model. - Many thanks to [phecda-xu](https://github.com/phecda-xu)/[PaddleDubbing](https://github.com/phecda-xu/PaddleDubbing) for developing a dubbing tool with GUI based on PaddleSpeech TTS model. - Many thanks to [jerryuhoo](https://github.com/jerryuhoo)/[VTuberTalk](https://github.com/jerryuhoo/VTuberTalk) for developing a GUI tool based on PaddleSpeech TTS and code for making datasets from videos based on PaddleSpeech ASR. +- Many thanks to [vpegasus](https://github.com/vpegasus)/[xuesebot](https://github.com/vpegasus/xuesebot) for developing a rasa chatbot,which is able to speak and listen thanks to PaddleSpeech. +- Many thanks to [chenkui164](https://github.com/chenkui164)/[FastASR](https://github.com/chenkui164/FastASR) for the C++ inference implementation of PaddleSpeech ASR. Besides, PaddleSpeech depends on a lot of open source repositories. See [references](./docs/source/reference.md) for more information. diff --git a/README_cn.md b/README_cn.md index c751b061d..1c6a949fd 100644 --- a/README_cn.md +++ b/README_cn.md @@ -1,3 +1,4 @@ + (简体中文|[English](./README.md))

@@ -19,13 +20,14 @@

- 快速开始 + 安装 + | 快速开始 | 快速使用服务 | 快速使用流式服务 | 教程文档 | 模型列表 | AIStudio 课程 - | 论文 + | NAACL2022 论文 | Gitee

@@ -34,6 +36,11 @@ ------------------------------------------------------------------------------------ **PaddleSpeech** 是基于飞桨 [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) 的语音方向的开源模型库,用于语音和音频中的各种关键任务的开发,包含大量基于深度学习前沿和有影响力的模型,一些典型的应用示例如下: + +**PaddleSpeech** 荣获 [NAACL2022 Best Demo Award](https://2022.naacl.org/blog/best-demo-award/), 请访问 [Arxiv](https://arxiv.org/abs/2205.12007) 论文。 + +### 效果展示 + ##### 语音识别
@@ -150,7 +157,7 @@ 本项目采用了易用、高效、灵活以及可扩展的实现,旨在为工业应用、学术研究提供更好的支持,实现的功能包含训练、推断以及测试模块,以及部署过程,主要包括 - 📦 **易用性**: 安装门槛低,可使用 [CLI](#quick-start) 快速开始。 - 🏆 **对标 SoTA**: 提供了高速、轻量级模型,且借鉴了最前沿的技术。 -- 🏆 **流式ASR和TTS系统**:工业级的端到端流式识别、流式合成系统。 +- 🏆 **流式 ASR 和 TTS 系统**:工业级的端到端流式识别、流式合成系统。 - 💯 **基于规则的中文前端**: 我们的前端包含文本正则化和字音转换(G2P)。此外,我们使用自定义语言规则来适应中文语境。 - **多种工业界以及学术界主流功能支持**: - 🛎️ 典型音频任务: 本工具包提供了音频任务如音频分类、语音翻译、自动语音识别、文本转语音、语音合成、声纹识别、KWS等任务的实现。 @@ -159,6 +166,7 @@ ### 近期更新 + - 👑 2022.05.13: PaddleSpeech 发布 [PP-ASR](./docs/source/asr/PPASR_cn.md) 流式语音识别系统、[PP-TTS](./docs/source/tts/PPTTS_cn.md) 流式语音合成系统、[PP-VPR](docs/source/vpr/PPVPR_cn.md) 全链路声纹识别系统 - 👏🏻 2022.05.06: PaddleSpeech Streaming Server 上线! 覆盖了语音识别(标点恢复、时间戳),和语音合成。 - 👏🏻 2022.05.06: PaddleSpeech Server 上线! 覆盖了声音分类、语音识别、语音合成、声纹识别,标点恢复。 @@ -177,61 +185,195 @@
+
## 安装 我们强烈建议用户在 **Linux** 环境下,*3.7* 以上版本的 *python* 上安装 PaddleSpeech。 -目前为止,**Linux** 支持声音分类、语音识别、语音合成和语音翻译四种功能,**Mac OSX、 Windows** 下暂不支持语音翻译功能。 想了解具体安装细节,可以参考[安装文档](./docs/source/install_cn.md)。 + +### 相关依赖 ++ gcc >= 4.8.5 ++ paddlepaddle >= 2.3.1 ++ python >= 3.7 ++ linux(推荐), mac, windows + +PaddleSpeech依赖于paddlepaddle,安装可以参考[paddlepaddle官网](https://www.paddlepaddle.org.cn/),根据自己机器的情况进行选择。这里给出cpu版本示例,其它版本大家可以根据自己机器的情况进行安装。 + +```shell +pip install paddlepaddle -i https://mirror.baidu.com/pypi/simple +``` + +PaddleSpeech快速安装方式有两种,一种是pip安装,一种是源码编译(推荐)。 + +### pip 安装 +```shell +pip install pytest-runner +pip install paddlespeech +``` + +### 源码编译 +```shell +git clone https://github.com/PaddlePaddle/PaddleSpeech.git +cd PaddleSpeech +pip install pytest-runner +pip install . +``` + +更多关于安装问题,如 conda 环境,librosa 依赖的系统库,gcc 环境问题,kaldi 安装等,可以参考这篇[安装文档](docs/source/install_cn.md),如安装上遇到问题可以在 [#2150](https://github.com/PaddlePaddle/PaddleSpeech/issues/2150) 上留言以及查找相关问题 ## 快速开始 -安装完成后,开发者可以通过命令行快速开始,改变 `--input` 可以尝试用自己的音频或文本测试。 +安装完成后,开发者可以通过命令行或者Python快速开始,命令行模式下改变 `--input` 可以尝试用自己的音频或文本测试,支持16k wav格式音频。 -**声音分类** +你也可以在`aistudio`中快速体验 👉🏻[PaddleSpeech API Demo ](https://aistudio.baidu.com/aistudio/projectdetail/4281335?shared=1)。 + +测试音频示例下载 ```shell -paddlespeech cls --input input.wav +wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav +wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/en.wav ``` -**声纹识别** + +### 语音识别 +
 (点击可展开)开源中文语音识别 + +命令行一键体验 + ```shell -paddlespeech vector --task spk --input input_16k.wav +paddlespeech asr --lang zh --input zh.wav +``` + +Python API 一键预测 + +```python +>>> from paddlespeech.cli.asr.infer import ASRExecutor +>>> asr = ASRExecutor() +>>> result = asr(audio_file="zh.wav") +>>> print(result) +我认为跑步最重要的就是给我带来了身体健康 ``` -**语音识别** +
+ +### 语音合成 + +
 开源中文语音合成 + +输出 24k 采样率wav格式音频 + + +命令行一键体验 + ```shell -paddlespeech asr --lang zh --input input_16k.wav +paddlespeech tts --input "你好,欢迎使用百度飞桨深度学习框架!" --output output.wav ``` -**语音翻译** (English to Chinese) + +Python API 一键预测 + +```python +>>> from paddlespeech.cli.tts.infer import TTSExecutor +>>> tts = TTSExecutor() +>>> tts(text="今天天气十分不错。", output="output.wav") +``` +- 语音合成的 web demo 已经集成进了 [Huggingface Spaces](https://huggingface.co/spaces). 请参考: [TTS Demo](https://huggingface.co/spaces/KPatrick/PaddleSpeechTTS) + +
+ +### 声音分类 + +
 适配多场景的开放领域声音分类工具 + +基于AudioSet数据集527个类别的声音分类模型 + +命令行一键体验 + ```shell -paddlespeech st --input input_16k.wav +paddlespeech cls --input zh.wav +``` + +python API 一键预测 + +```python +>>> from paddlespeech.cli.cls.infer import CLSExecutor +>>> cls = CLSExecutor() +>>> result = cls(audio_file="zh.wav") +>>> print(result) +Speech 0.9027186632156372 ``` -**语音合成** + +
+ +### 声纹提取 + +
 工业级声纹提取工具 + +命令行一键体验 + ```shell -paddlespeech tts --input "你好,欢迎使用百度飞桨深度学习框架!" --output output.wav +paddlespeech vector --task spk --input zh.wav +``` + +Python API 一键预测 + +```python +>>> from paddlespeech.cli.vector import VectorExecutor +>>> vec = VectorExecutor() +>>> result = vec(audio_file="zh.wav") +>>> print(result) # 187维向量 +[ -0.19083306 9.474295 -14.122263 -2.0916545 0.04848729 + 4.9295826 1.4780062 0.3733844 10.695862 3.2697146 + -4.48199 -0.6617882 -9.170393 -11.1568775 -1.2358263 ...] ``` -- 语音合成的 web demo 已经集成进了 [Huggingface Spaces](https://huggingface.co/spaces). 请参考: [TTS Demo](https://huggingface.co/spaces/akhaliq/paddlespeech) -**文本后处理** - - 标点恢复 - ```bash - paddlespeech text --task punc --input 今天的天气真不错啊你下午有空吗我想约你一起去吃饭 - ``` +
-**批处理** +### 标点恢复 + +
 一键恢复文本标点,可与ASR模型配合使用 + +命令行一键体验 + +```shell +paddlespeech text --task punc --input 今天的天气真不错啊你下午有空吗我想约你一起去吃饭 ``` -echo -e "1 欢迎光临。\n2 谢谢惠顾。" | paddlespeech tts + +Python API 一键预测 + +```python +>>> from paddlespeech.cli.text.infer import TextExecutor +>>> text_punc = TextExecutor() +>>> result = text_punc(text="今天的天气真不错啊你下午有空吗我想约你一起去吃饭") +今天的天气真不错啊!你下午有空吗?我想约你一起去吃饭。 ``` -**Shell管道** -ASR + Punc: +
+ +### 语音翻译 + +
 端到端英译中语音翻译工具 + +使用预编译的kaldi相关工具,只支持在Ubuntu系统中体验 + +命令行一键体验 + +```shell +paddlespeech st --input en.wav ``` -paddlespeech asr --input ./zh.wav | paddlespeech text --task punc + +python API 一键预测 + +```python +>>> from paddlespeech.cli.st.infer import STExecutor +>>> st = STExecutor() +>>> result = st(audio_file="en.wav") +['我 在 这栋 建筑 的 古老 门上 敲门 。'] ``` -更多命令行命令请参考 [demos](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/demos) -> Note: 如果需要训练或者微调,请查看[语音识别](./docs/source/asr/quick_start.md), [语音合成](./docs/source/tts/quick_start.md)。 +
+ + ## 快速使用服务 -安装完成后,开发者可以通过命令行快速使用服务。 +安装完成后,开发者可以通过命令行一键启动语音识别,语音合成,音频分类三种服务。 **启动服务** ```shell @@ -480,6 +622,15 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声 ge2e-fastspeech2-aishell3 + + + 端到端 + VITS + CSMSC + + VITS-csmsc + + @@ -600,6 +751,7 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声 语音合成模块最初被称为 [Parakeet](https://github.com/PaddlePaddle/Parakeet),现在与此仓库合并。如果您对该任务的学术研究感兴趣,请参阅 [TTS 研究概述](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/docs/source/tts#overview)。此外,[模型介绍](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/tts/models_introduction.md) 是了解语音合成流程的一个很好的指南。 + ## ⭐ 应用案例 - **[PaddleBoBo](https://github.com/JiehangXie/PaddleBoBo): 使用 PaddleSpeech 的语音合成模块生成虚拟人的声音。** @@ -681,6 +833,7 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声 ## 致谢 +- 非常感谢 [BarryKCL](https://github.com/BarryKCL)基于[G2PW](https://github.com/GitYCC/g2pW)对TTS中文文本前端的优化。 - 非常感谢 [yeyupiaoling](https://github.com/yeyupiaoling)/[PPASR](https://github.com/yeyupiaoling/PPASR)/[PaddlePaddle-DeepSpeech](https://github.com/yeyupiaoling/PaddlePaddle-DeepSpeech)/[VoiceprintRecognition-PaddlePaddle](https://github.com/yeyupiaoling/VoiceprintRecognition-PaddlePaddle)/[AudioClassification-PaddlePaddle](https://github.com/yeyupiaoling/AudioClassification-PaddlePaddle) 多年来的关注和建议,以及在诸多问题上的帮助。 - 非常感谢 [mymagicpower](https://github.com/mymagicpower) 采用PaddleSpeech 对 ASR 的[短语音](https://github.com/mymagicpower/AIAS/tree/main/3_audio_sdks/asr_sdk)及[长语音](https://github.com/mymagicpower/AIAS/tree/main/3_audio_sdks/asr_long_audio_sdk)进行 Java 实现。 - 非常感谢 [JiehangXie](https://github.com/JiehangXie)/[PaddleBoBo](https://github.com/JiehangXie/PaddleBoBo) 采用 PaddleSpeech 语音合成功能实现 Virtual Uploader(VUP)/Virtual YouTuber(VTuber) 虚拟主播。 @@ -690,7 +843,8 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声 - 非常感谢 [phecda-xu](https://github.com/phecda-xu)/[PaddleDubbing](https://github.com/phecda-xu/PaddleDubbing) 基于 PaddleSpeech 的 TTS 模型搭建带 GUI 操作界面的配音工具。 - 非常感谢 [jerryuhoo](https://github.com/jerryuhoo)/[VTuberTalk](https://github.com/jerryuhoo/VTuberTalk) 基于 PaddleSpeech 的 TTS GUI 界面和基于 ASR 制作数据集的相关代码。 - +- 非常感谢 [vpegasus](https://github.com/vpegasus)/[xuesebot](https://github.com/vpegasus/xuesebot) 基于 PaddleSpeech 的 ASR 与 TTS 设计的可听、说对话机器人。 +- 非常感谢 [chenkui164](https://github.com/chenkui164)/[FastASR](https://github.com/chenkui164/FastASR) 对 PaddleSpeech 的 ASR 进行 C++ 推理实现。 此外,PaddleSpeech 依赖于许多开源存储库。有关更多信息,请参阅 [references](./docs/source/reference.md)。 diff --git a/dataset/aidatatang_200zh/README.md b/dataset/aidatatang_200zh/README.md index e6f1eefbd..addc323a6 100644 --- a/dataset/aidatatang_200zh/README.md +++ b/dataset/aidatatang_200zh/README.md @@ -1,4 +1,4 @@ -# [Aidatatang_200zh](http://www.openslr.org/62/) +# [Aidatatang_200zh](http://openslr.elda.org/62/) Aidatatang_200zh is a free Chinese Mandarin speech corpus provided by Beijing DataTang Technology Co., Ltd under Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International Public License. The contents and the corresponding descriptions of the corpus include: diff --git a/dataset/aishell/README.md b/dataset/aishell/README.md index 6770cd207..a7dd0cf32 100644 --- a/dataset/aishell/README.md +++ b/dataset/aishell/README.md @@ -1,3 +1,3 @@ -# [Aishell1](http://www.openslr.org/33/) +# [Aishell1](http://openslr.elda.org/33/) This Open Source Mandarin Speech Corpus, AISHELL-ASR0009-OS1, is 178 hours long. It is a part of AISHELL-ASR0009, of which utterance contains 11 domains, including smart home, autonomous driving, and industrial production. The whole recording was put in quiet indoor environment, using 3 different devices at the same time: high fidelity microphone (44.1kHz, 16-bit,); Android-system mobile phone (16kHz, 16-bit), iOS-system mobile phone (16kHz, 16-bit). Audios in high fidelity were re-sampled to 16kHz to build AISHELL- ASR0009-OS1. 400 speakers from different accent areas in China were invited to participate in the recording. The manual transcription accuracy rate is above 95%, through professional speech annotation and strict quality inspection. The corpus is divided into training, development and testing sets. ( This database is free for academic research, not in the commerce, if without permission. ) diff --git a/dataset/aishell/aishell.py b/dataset/aishell/aishell.py index 7431fc083..ec43104db 100644 --- a/dataset/aishell/aishell.py +++ b/dataset/aishell/aishell.py @@ -31,7 +31,7 @@ from utils.utility import unpack DATA_HOME = os.path.expanduser('~/.cache/paddle/dataset/speech') -URL_ROOT = 'http://www.openslr.org/resources/33' +URL_ROOT = 'http://openslr.elda.org/resources/33' # URL_ROOT = 'https://openslr.magicdatatech.com/resources/33' DATA_URL = URL_ROOT + '/data_aishell.tgz' MD5_DATA = '2f494334227864a8a8fec932999db9d8' diff --git a/dataset/librispeech/librispeech.py b/dataset/librispeech/librispeech.py index 65cab2490..2d6f1763d 100644 --- a/dataset/librispeech/librispeech.py +++ b/dataset/librispeech/librispeech.py @@ -31,7 +31,7 @@ import soundfile from utils.utility import download from utils.utility import unpack -URL_ROOT = "http://www.openslr.org/resources/12" +URL_ROOT = "http://openslr.elda.org/resources/12" #URL_ROOT = "https://openslr.magicdatatech.com/resources/12" URL_TEST_CLEAN = URL_ROOT + "/test-clean.tar.gz" URL_TEST_OTHER = URL_ROOT + "/test-other.tar.gz" diff --git a/dataset/magicdata/README.md b/dataset/magicdata/README.md index 083aee97b..4641a21d6 100644 --- a/dataset/magicdata/README.md +++ b/dataset/magicdata/README.md @@ -1,4 +1,4 @@ -# [MagicData](http://www.openslr.org/68/) +# [MagicData](http://openslr.elda.org/68/) MAGICDATA Mandarin Chinese Read Speech Corpus was developed by MAGIC DATA Technology Co., Ltd. and freely published for non-commercial use. The contents and the corresponding descriptions of the corpus include: diff --git a/dataset/mini_librispeech/mini_librispeech.py b/dataset/mini_librispeech/mini_librispeech.py index 730c73a8b..0eb80bf8f 100644 --- a/dataset/mini_librispeech/mini_librispeech.py +++ b/dataset/mini_librispeech/mini_librispeech.py @@ -30,7 +30,7 @@ import soundfile from utils.utility import download from utils.utility import unpack -URL_ROOT = "http://www.openslr.org/resources/31" +URL_ROOT = "http://openslr.elda.org/resources/31" URL_TRAIN_CLEAN = URL_ROOT + "/train-clean-5.tar.gz" URL_DEV_CLEAN = URL_ROOT + "/dev-clean-2.tar.gz" diff --git a/dataset/musan/musan.py b/dataset/musan/musan.py index 2ac701bed..ae3430b2a 100644 --- a/dataset/musan/musan.py +++ b/dataset/musan/musan.py @@ -34,7 +34,7 @@ from utils.utility import unpack DATA_HOME = os.path.expanduser('~/.cache/paddle/dataset/speech') -URL_ROOT = 'https://www.openslr.org/resources/17' +URL_ROOT = 'https://openslr.elda.org/resources/17' DATA_URL = URL_ROOT + '/musan.tar.gz' MD5_DATA = '0c472d4fc0c5141eca47ad1ffeb2a7df' diff --git a/dataset/primewords/README.md b/dataset/primewords/README.md index a4f1ed65d..dba51cec7 100644 --- a/dataset/primewords/README.md +++ b/dataset/primewords/README.md @@ -1,4 +1,4 @@ -# [Primewords](http://www.openslr.org/47/) +# [Primewords](http://openslr.elda.org/47/) This free Chinese Mandarin speech corpus set is released by Shanghai Primewords Information Technology Co., Ltd. The corpus is recorded by smart mobile phones from 296 native Chinese speakers. The transcription accuracy is larger than 98%, at the confidence level of 95%. It is free for academic use. diff --git a/dataset/rir_noise/rir_noise.py b/dataset/rir_noise/rir_noise.py index 009175e5b..b1d475584 100644 --- a/dataset/rir_noise/rir_noise.py +++ b/dataset/rir_noise/rir_noise.py @@ -34,7 +34,7 @@ from utils.utility import unzip DATA_HOME = os.path.expanduser('~/.cache/paddle/dataset/speech') -URL_ROOT = '--no-check-certificate http://www.openslr.org/resources/28' +URL_ROOT = '--no-check-certificate https://us.openslr.org/resources/28/rirs_noises.zip' DATA_URL = URL_ROOT + '/rirs_noises.zip' MD5_DATA = 'e6f48e257286e05de56413b4779d8ffb' diff --git a/dataset/st-cmds/README.md b/dataset/st-cmds/README.md index c7ae50e59..bbf85c3e7 100644 --- a/dataset/st-cmds/README.md +++ b/dataset/st-cmds/README.md @@ -1 +1 @@ -# [FreeST](http://www.openslr.org/38/) +# [FreeST](http://openslr.elda.org/38/) diff --git a/dataset/thchs30/README.md b/dataset/thchs30/README.md index 6b59d663a..b488a3551 100644 --- a/dataset/thchs30/README.md +++ b/dataset/thchs30/README.md @@ -1,4 +1,4 @@ -# [THCHS30](http://www.openslr.org/18/) +# [THCHS30](http://openslr.elda.org/18/) This is the *data part* of the `THCHS30 2015` acoustic data & scripts dataset. diff --git a/dataset/thchs30/thchs30.py b/dataset/thchs30/thchs30.py index cdfc0a75c..d41c0e175 100644 --- a/dataset/thchs30/thchs30.py +++ b/dataset/thchs30/thchs30.py @@ -32,7 +32,7 @@ from utils.utility import unpack DATA_HOME = os.path.expanduser('~/.cache/paddle/dataset/speech') -URL_ROOT = 'http://www.openslr.org/resources/18' +URL_ROOT = 'http://openslr.elda.org/resources/18' # URL_ROOT = 'https://openslr.magicdatatech.com/resources/18' DATA_URL = URL_ROOT + '/data_thchs30.tgz' TEST_NOISE_URL = URL_ROOT + '/test-noise.tgz' diff --git a/demos/README.md b/demos/README.md index 2a306df6b..72b70b237 100644 --- a/demos/README.md +++ b/demos/README.md @@ -12,6 +12,7 @@ This directory contains many speech applications in multiple scenarios. * speech recognition - recognize text of an audio file * speech server - Server for Speech Task, e.g. ASR,TTS,CLS * streaming asr server - receive audio stream from websocket, and recognize to transcript. +* streaming tts server - receive text from http or websocket, and streaming audio data stream. * speech translation - end to end speech translation * story talker - book reader based on OCR and TTS * style_fs2 - multi style control for FastSpeech2 model diff --git a/demos/README_cn.md b/demos/README_cn.md index 471342127..04fc1fa7d 100644 --- a/demos/README_cn.md +++ b/demos/README_cn.md @@ -10,8 +10,9 @@ * 元宇宙 - 基于语音合成的 2D 增强现实。 * 标点恢复 - 通常作为语音识别的文本后处理任务,为一段无标点的纯文本添加相应的标点符号。 * 语音识别 - 识别一段音频中包含的语音文字。 -* 语音服务 - 离线语音服务,包括ASR、TTS、CLS等 -* 流式语音识别服务 - 流式输入语音数据流识别音频中的文字 +* 语音服务 - 离线语音服务,包括ASR、TTS、CLS等。 +* 流式语音识别服务 - 流式输入语音数据流识别音频中的文字。 +* 流式语音合成服务 - 根据待合成文本流式生成合成音频数据流。 * 语音翻译 - 实时识别音频中的语言,并同时翻译成目标语言。 * 会说话的故事书 - 基于 OCR 和语音合成的会说话的故事书。 * 个性化语音合成 - 基于 FastSpeech2 模型的个性化语音合成。 diff --git a/demos/audio_searching/requirements.txt b/demos/audio_searching/requirements.txt index 057c6ab92..9d0f6419b 100644 --- a/demos/audio_searching/requirements.txt +++ b/demos/audio_searching/requirements.txt @@ -2,7 +2,7 @@ diskcache==5.2.1 dtaidistance==2.3.1 fastapi librosa==0.8.0 -numpy==1.21.0 +numpy==1.22.0 pydantic pymilvus==2.0.1 pymysql diff --git a/demos/custom_streaming_asr/setup_docker.sh b/demos/custom_streaming_asr/setup_docker.sh old mode 100644 new mode 100755 diff --git a/demos/keyword_spotting/README.md b/demos/keyword_spotting/README.md new file mode 100644 index 000000000..6544cf71e --- /dev/null +++ b/demos/keyword_spotting/README.md @@ -0,0 +1,79 @@ +([简体中文](./README_cn.md)|English) +# KWS (Keyword Spotting) + +## Introduction +KWS(Keyword Spotting) is a technique to recognize keyword from a giving speech audio. + +This demo is an implementation to recognize keyword from a specific audio file. It can be done by a single command or a few lines in python using `PaddleSpeech`. + +## Usage +### 1. Installation +see [installation](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md). + +You can choose one way from easy, meduim and hard to install paddlespeech. + +### 2. Prepare Input File +The input of this demo should be a WAV file(`.wav`), and the sample rate must be the same as the model. + +Here are sample files for this demo that can be downloaded: +```bash +wget -c https://paddlespeech.bj.bcebos.com/kws/hey_snips.wav https://paddlespeech.bj.bcebos.com/kws/non-keyword.wav +``` + +### 3. Usage +- Command Line(Recommended) + ```bash + paddlespeech kws --input ./hey_snips.wav + paddlespeech kws --input ./non-keyword.wav + ``` + + Usage: + ```bash + paddlespeech kws --help + ``` + Arguments: + - `input`(required): Audio file to recognize. + - `threshold`:Score threshold for kws. Default: `0.8`. + - `model`: Model type of kws task. Default: `mdtc_heysnips`. + - `config`: Config of kws task. Use pretrained model when it is None. Default: `None`. + - `ckpt_path`: Model checkpoint. Use pretrained model when it is None. Default: `None`. + - `device`: Choose device to execute model inference. Default: default device of paddlepaddle in current environment. + - `verbose`: Show the log information. + + Output: + ```bash + # Input file: ./hey_snips.wav + Score: 1.000, Threshold: 0.8, Is keyword: True + # Input file: ./non-keyword.wav + Score: 0.000, Threshold: 0.8, Is keyword: False + ``` + +- Python API + ```python + import paddle + from paddlespeech.cli.kws import KWSExecutor + + kws_executor = KWSExecutor() + result = kws_executor( + audio_file='./hey_snips.wav', + threshold=0.8, + model='mdtc_heysnips', + config=None, + ckpt_path=None, + device=paddle.get_device()) + print('KWS Result: \n{}'.format(result)) + ``` + + Output: + ```bash + KWS Result: + Score: 1.000, Threshold: 0.8, Is keyword: True + ``` + +### 4.Pretrained Models + +Here is a list of pretrained models released by PaddleSpeech that can be used by command and python API: + +| Model | Language | Sample Rate +| :--- | :---: | :---: | +| mdtc_heysnips | en | 16k diff --git a/demos/keyword_spotting/README_cn.md b/demos/keyword_spotting/README_cn.md new file mode 100644 index 000000000..0d8f44a53 --- /dev/null +++ b/demos/keyword_spotting/README_cn.md @@ -0,0 +1,76 @@ +(简体中文|[English](./README.md)) + +# 关键词识别 +## 介绍 +关键词识别是一项用于识别一段语音内是否包含特定的关键词。 + +这个 demo 是一个从给定音频文件识别特定关键词的实现,它可以通过使用 `PaddleSpeech` 的单个命令或 python 中的几行代码来实现。 +## 使用方法 +### 1. 安装 +请看[安装文档](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install_cn.md)。 + +你可以从 easy,medium,hard 三中方式中选择一种方式安装。 + +### 2. 准备输入 +这个 demo 的输入应该是一个 WAV 文件(`.wav`),并且采样率必须与模型的采样率相同。 + +可以下载此 demo 的示例音频: +```bash +wget -c https://paddlespeech.bj.bcebos.com/kws/hey_snips.wav https://paddlespeech.bj.bcebos.com/kws/non-keyword.wav +``` +### 3. 使用方法 +- 命令行 (推荐使用) + ```bash + paddlespeech kws --input ./hey_snips.wav + paddlespeech kws --input ./non-keyword.wav + ``` + + 使用方法: + ```bash + paddlespeech kws --help + ``` + 参数: + - `input`(必须输入):用于识别关键词的音频文件。 + - `threshold`:用于判别是包含关键词的得分阈值,默认值:`0.8`。 + - `model`:KWS 任务的模型,默认值:`mdtc_heysnips`。 + - `config`:KWS 任务的参数文件,若不设置则使用预训练模型中的默认配置,默认值:`None`。 + - `ckpt_path`:模型参数文件,若不设置则下载预训练模型使用,默认值:`None`。 + - `device`:执行预测的设备,默认值:当前系统下 paddlepaddle 的默认 device。 + - `verbose`: 如果使用,显示 logger 信息。 + + 输出: + ```bash + # 输入为 ./hey_snips.wav + Score: 1.000, Threshold: 0.8, Is keyword: True + # 输入为 ./non-keyword.wav + Score: 0.000, Threshold: 0.8, Is keyword: False + ``` + +- Python API + ```python + import paddle + from paddlespeech.cli.kws import KWSExecutor + + kws_executor = KWSExecutor() + result = kws_executor( + audio_file='./hey_snips.wav', + threshold=0.8, + model='mdtc_heysnips', + config=None, + ckpt_path=None, + device=paddle.get_device()) + print('KWS Result: \n{}'.format(result)) + ``` + + 输出: + ```bash + KWS Result: + Score: 1.000, Threshold: 0.8, Is keyword: True + ``` + +### 4.预训练模型 +以下是 PaddleSpeech 提供的可以被命令行和 python API 使用的预训练模型列表: + +| 模型 | 语言 | 采样率 +| :--- | :---: | :---: | +| mdtc_heysnips | en | 16k diff --git a/demos/keyword_spotting/run.sh b/demos/keyword_spotting/run.sh new file mode 100755 index 000000000..7f9e0ebba --- /dev/null +++ b/demos/keyword_spotting/run.sh @@ -0,0 +1,7 @@ +#!/bin/bash + +wget -c https://paddlespeech.bj.bcebos.com/kws/hey_snips.wav https://paddlespeech.bj.bcebos.com/kws/non-keyword.wav + +# kws +paddlespeech kws --input ./hey_snips.wav +paddlespeech kws --input non-keyword.wav diff --git a/demos/speaker_verification/run.sh b/demos/speaker_verification/run.sh old mode 100644 new mode 100755 diff --git a/demos/speech_recognition/run.sh b/demos/speech_recognition/run.sh old mode 100644 new mode 100755 index 19ce0ebb3..e48ff3e96 --- a/demos/speech_recognition/run.sh +++ b/demos/speech_recognition/run.sh @@ -1,6 +1,7 @@ #!/bin/bash -wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespeech.bj.bcebos.com/PaddleAudio/en.wav +wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav +wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/en.wav # asr paddlespeech asr --input ./zh.wav @@ -8,3 +9,18 @@ paddlespeech asr --input ./zh.wav # asr + punc paddlespeech asr --input ./zh.wav | paddlespeech text --task punc + + +# asr help +paddlespeech asr --help + + +# english asr +paddlespeech asr --lang en --model transformer_librispeech --input ./en.wav + +# model stats +paddlespeech stats --task asr + + +# paddlespeech help +paddlespeech --help diff --git a/demos/speech_server/README.md b/demos/speech_server/README.md index 14a88f078..e400f7e74 100644 --- a/demos/speech_server/README.md +++ b/demos/speech_server/README.md @@ -5,13 +5,19 @@ ## Introduction This demo is an implementation of starting the voice service and accessing the service. It can be achieved with a single command using `paddlespeech_server` and `paddlespeech_client` or a few lines of code in python. +For service interface definition, please check: +- [PaddleSpeech Server RESTful API](https://github.com/PaddlePaddle/PaddleSpeech/wiki/PaddleSpeech-Server-RESTful-API) + ## Usage ### 1. Installation see [installation](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md). -It is recommended to use **paddlepaddle 2.2.2** or above. -You can choose one way from meduim and hard to install paddlespeech. +It is recommended to use **paddlepaddle 2.3.1** or above. + +You can choose one way from easy, meduim and hard to install paddlespeech. + +**If you install in easy mode, you need to prepare the yaml file by yourself, you can refer to the yaml file in the conf directory.** ### 2. Prepare config File The configuration file can be found in `conf/application.yaml` . @@ -20,14 +26,6 @@ At present, the speech tasks integrated by the service include: asr (speech reco Currently the engine type supports two forms: python and inference (Paddle Inference) **Note:** If the service can be started normally in the container, but the client access IP is unreachable, you can try to replace the `host` address in the configuration file with the local IP address. - -The input of ASR client demo should be a WAV file(`.wav`), and the sample rate must be the same as the model. - -Here are sample files for thisASR client demo that can be downloaded: -```bash -wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespeech.bj.bcebos.com/PaddleAudio/en.wav -``` - ### 3. Server Usage - Command Line (Recommended) @@ -46,7 +44,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee - `log_file`: log file. Default: ./log/paddlespeech.log Output: - ```bash + ```text [2022-02-23 11:17:32] [INFO] [server.py:64] Started server process [6384] INFO: Waiting for application startup. [2022-02-23 11:17:32] [INFO] [on.py:26] Waiting for application startup. @@ -54,7 +52,6 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee [2022-02-23 11:17:32] [INFO] [on.py:38] Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit) [2022-02-23 11:17:32] [INFO] [server.py:204] Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit) - ``` - Python API @@ -68,7 +65,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ``` Output: - ```bash + ```text INFO: Started server process [529] [2022-02-23 14:57:56] [INFO] [server.py:64] Started server process [529] INFO: Waiting for application startup. @@ -77,11 +74,19 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee [2022-02-23 14:57:56] [INFO] [on.py:38] Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit) [2022-02-23 14:57:56] [INFO] [server.py:204] Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit) - ``` ### 4. ASR Client Usage + +The input of ASR client demo should be a WAV file(`.wav`), and the sample rate must be the same as the model. + +Here are sample files for this ASR client demo that can be downloaded: +```bash +wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav +wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/en.wav +``` + **Note:** The response time will be slightly longer when using the client for the first time - Command Line (Recommended) @@ -105,16 +110,14 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee - `audio_format`: Audio format. Default: "wav". Output: - ```bash - [2022-02-23 18:11:22,819] [ INFO] - {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'transcription': '我认为跑步最重要的就是给我带来了身体健康'}} - [2022-02-23 18:11:22,820] [ INFO] - time cost 0.689145 s. - + ```text + [2022-08-01 07:54:01,646] [ INFO] - ASR result: 我认为跑步最重要的就是给我带来了身体健康 + [2022-08-01 07:54:01,646] [ INFO] - Response time 4.898965 s. ``` - Python API ```python from paddlespeech.server.bin.paddlespeech_client import ASRClientExecutor - import json asrclient_executor = ASRClientExecutor() res = asrclient_executor( @@ -124,12 +127,11 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee sample_rate=16000, lang="zh_cn", audio_format="wav") - print(res.json()) + print(res) ``` - Output: - ```bash - {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'transcription': '我认为跑步最重要的就是给我带来了身体健康'}} + ```text + 我认为跑步最重要的就是给我带来了身体健康 ``` ### 5. TTS Client Usage @@ -157,12 +159,10 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee - `output`: Output wave filepath. Default: None, which means not to save the audio to the local. Output: - ```bash - [2022-02-23 15:20:37,875] [ INFO] - {'description': 'success.'} + ```text [2022-02-23 15:20:37,875] [ INFO] - Save synthesized audio successfully on output.wav. [2022-02-23 15:20:37,875] [ INFO] - Audio duration: 3.612500 s. [2022-02-23 15:20:37,875] [ INFO] - Response time: 0.348050 s. - ``` - Python API @@ -188,20 +188,25 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ``` Output: - ```bash + ```text {'description': 'success.'} Save synthesized audio successfully on ./output.wav. Audio duration: 3.612500 s. - ``` ### 6. CLS Client Usage + +Here are sample files for this CLS Client demo that can be downloaded: +```bash +wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav +``` + **Note:** The response time will be slightly longer when using the client for the first time - Command Line (Recommended) If `127.0.0.1` is not accessible, you need to use the actual service IP address. - ``` + ```bash paddlespeech_client cls --server_ip 127.0.0.1 --port 8090 --input ./zh.wav ``` @@ -217,11 +222,9 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee - `topk`: topk scores of classification result. Output: - ```bash + ```text [2022-03-09 20:44:39,974] [ INFO] - {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'topk': 1, 'results': [{'class_name': 'Speech', 'prob': 0.9027184844017029}]}} [2022-03-09 20:44:39,975] [ INFO] - Response time 0.104360 s. - - ``` - Python API @@ -239,14 +242,19 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ``` Output: - ```bash + ```text {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'topk': 1, 'results': [{'class_name': 'Speech', 'prob': 0.9027184844017029}]}} - ``` ### 7. Speaker Verification Client Usage +Here are sample files for this Speaker Verification Client demo that can be downloaded: +```bash +wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav +wget -c https://paddlespeech.bj.bcebos.com/vector/audio/123456789.wav +``` + #### 7.1 Extract speaker embedding **Note:** The response time will be slightly longer when using the client for the first time - Command Line (Recommended) @@ -273,19 +281,19 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee Output: - ```bash - [2022-05-25 12:25:36,165] [ INFO] - vector http client start - [2022-05-25 12:25:36,165] [ INFO] - the input audio: 85236145389.wav - [2022-05-25 12:25:36,165] [ INFO] - endpoint: http://127.0.0.1:8790/paddlespeech/vector - [2022-05-25 12:25:36,166] [ INFO] - http://127.0.0.1:8790/paddlespeech/vector - [2022-05-25 12:25:36,324] [ INFO] - The vector: {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'vec': [-1.3251205682754517, 7.860682487487793, -4.620625972747803, 0.3000721037387848, 2.2648534774780273, -1.1931440830230713, 3.064713716506958, 7.673594951629639, -6.004472732543945, -12.024259567260742, -1.9496068954467773, 3.126953601837158, 1.6188379526138306, -7.638310432434082, -1.2299772500991821, -12.33833122253418, 2.1373026371002197, -5.395712375640869, 9.717328071594238, 5.675230503082275, 3.7805123329162598, 3.0597171783447266, 3.429692029953003, 8.9760103225708, 13.174124717712402, -0.5313228368759155, 8.942471504211426, 4.465109825134277, -4.426247596740723, -9.726503372192383, 8.399328231811523, 7.223917484283447, -7.435853958129883, 2.9441683292388916, -4.343039512634277, -13.886964797973633, -1.6346734762191772, -10.902740478515625, -5.311244964599609, 3.800722122192383, 3.897603750228882, -2.123077392578125, -2.3521194458007812, 4.151031017303467, -7.404866695404053, 0.13911646604537964, 2.4626107215881348, 4.96645450592041, 0.9897574186325073, 5.483975410461426, -3.3574001789093018, 10.13400650024414, -0.6120170950889587, -10.403095245361328, 4.600754261016846, 16.009349822998047, -7.78369140625, -4.194530487060547, -6.93686056137085, 1.1789555549621582, 11.490800857543945, 4.23802375793457, 9.550930976867676, 8.375045776367188, 7.508914470672607, -0.6570729613304138, -0.3005157709121704, 2.8406054973602295, 3.0828027725219727, 0.7308170199394226, 6.1483540534973145, 0.1376611888408661, -13.424735069274902, -7.746140480041504, -2.322798252105713, -8.305252075195312, 2.98791241645813, -10.99522876739502, 0.15211068093776703, -2.3820347785949707, -1.7984174489974976, 8.49562931060791, -5.852236747741699, -3.755497932434082, 0.6989710927009583, -5.270299434661865, -2.6188621520996094, -1.8828465938568115, -4.6466498374938965, 14.078543663024902, -0.5495333075523376, 10.579157829284668, -3.216050148010254, 9.349003791809082, -4.381077766418457, -11.675816535949707, -2.863020658493042, 4.5721755027771, 2.246612071990967, -4.574341773986816, 1.8610187768936157, 2.3767874240875244, 5.625787734985352, -9.784077644348145, 0.6496725678443909, -1.457950472831726, 0.4263263940811157, -4.921126365661621, -2.4547839164733887, 3.4869801998138428, -0.4265422224998474, 8.341268539428711, 1.356552004814148, 7.096688270568848, -13.102828979492188, 8.01673412322998, -7.115934371948242, 1.8699780702590942, 0.20872099697589874, 14.699383735656738, -1.0252779722213745, -2.6107232570648193, -2.5082311630249023, 8.427192687988281, 6.913852691650391, -6.29124641418457, 0.6157366037368774, 2.489687919616699, -3.4668266773223877, 9.92176342010498, 11.200815200805664, -0.19664029777050018, 7.491600513458252, -0.6231271624565125, -0.2584814429283142, -9.947997093200684, -0.9611040949821472, 1.1649218797683716, -2.1907122135162354, -1.502848744392395, -0.5192610621452332, 15.165953636169434, 2.4649462699890137, -0.998044490814209, 7.44166374206543, -2.0768048763275146, 3.5896823406219482, -7.305543422698975, -7.562084674835205, 4.32333517074585, 0.08044180274009705, -6.564010143280029, -2.314805269241333, -1.7642345428466797, -2.470881700515747, -7.6756181716918945, -9.548877716064453, -1.017755389213562, 0.1698644608259201, 2.5877134799957275, -1.8752295970916748, -0.36614322662353516, -6.049378395080566, -2.3965611457824707, -5.945338726043701, 0.9424033164978027, -13.155974388122559, -7.45780086517334, 0.14658108353614807, -3.7427968978881836, 5.841492652893066, -1.2872905731201172, 5.569431304931641, 12.570590019226074, 1.0939218997955322, 2.2142086029052734, 1.9181575775146484, 6.991420745849609, -5.888138771057129, 3.1409823894500732, -2.0036280155181885, 2.4434285163879395, 9.973138809204102, 5.036680221557617, 2.005120277404785, 2.861560344696045, 5.860223770141602, 2.917618751525879, -1.63111412525177, 2.0292205810546875, -4.070415019989014, -6.831437110900879]}} - [2022-05-25 12:25:36,324] [ INFO] - Response time 0.159053 s. + ```text + [2022-08-01 09:01:22,151] [ INFO] - vector http client start + [2022-08-01 09:01:22,152] [ INFO] - the input audio: 85236145389.wav + [2022-08-01 09:01:22,152] [ INFO] - endpoint: http://127.0.0.1:8090/paddlespeech/vector + [2022-08-01 09:01:27,093] [ INFO] - {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'vec': [1.4217487573623657, 5.626248836517334, -5.342073440551758, 1.177390217781067, 3.308061122894287, 1.7565997838974, 5.1678876876831055, 10.806346893310547, -3.822679042816162, -5.614130973815918, 2.6238481998443604, -0.8072965741157532, 1.963512659072876, -7.312864780426025, 0.011034967377781868, -9.723127365112305, 0.661963164806366, -6.976816654205322, 10.213465690612793, 7.494767189025879, 2.9105641841888428, 3.894925117492676, 3.7999846935272217, 7.106173992156982, 16.905324935913086, -7.149376392364502, 8.733112335205078, 3.423002004623413, -4.831653118133545, -11.403371810913086, 11.232216835021973, 7.127464771270752, -4.282831192016602, 2.4523589611053467, -5.13075065612793, -18.17765998840332, -2.611666440963745, -11.00034236907959, -6.731431007385254, 1.6564655303955078, 0.7618184685707092, 1.1253058910369873, -2.0838277339935303, 4.725739002227783, -8.782590866088867, -3.5398736000061035, 3.8142387866973877, 5.142062664031982, 2.162053346633911, 4.09642219543457, -6.416221618652344, 12.747454643249512, 1.9429889917373657, -15.152948379516602, 6.417416572570801, 16.097013473510742, -9.716649055480957, -1.9920448064804077, -3.364956855773926, -1.8719490766525269, 11.567351341247559, 3.6978795528411865, 11.258269309997559, 7.442364692687988, 9.183405876159668, 4.528151512145996, -1.2417811155319214, 4.395910263061523, 6.672768592834473, 5.889888763427734, 7.627115249633789, -0.6692016124725342, -11.889703750610352, -9.208883285522461, -7.427401542663574, -3.777655601501465, 6.917237758636475, -9.848749160766602, -2.094479560852051, -5.1351189613342285, 0.49564215540885925, 9.317541122436523, -5.9141845703125, -1.809845209121704, -0.11738205701112747, -7.169270992279053, -1.0578246116638184, -5.721685886383057, -5.117387294769287, 16.137670516967773, -4.473618984222412, 7.66243314743042, -0.5538089871406555, 9.631582260131836, -6.470466613769531, -8.54850959777832, 4.371622085571289, -0.7970349192619324, 4.479003429412842, -2.9758646488189697, 3.2721707820892334, 2.8382749557495117, 5.1345953941345215, -9.19078254699707, -0.5657423138618469, -4.874573230743408, 2.316561460494995, -5.984307289123535, -2.1798791885375977, 0.35541653633117676, -0.3178458511829376, 9.493547439575195, 2.114448070526123, 4.358088493347168, -12.089820861816406, 8.451695442199707, -7.925461769104004, 4.624246120452881, 4.428938388824463, 18.691999435424805, -2.620460033416748, -5.149182319641113, -0.3582168221473694, 8.488557815551758, 4.98148250579834, -9.326834678649902, -2.2544236183166504, 6.64176607131958, 1.2119656801223755, 10.977132797241211, 16.55504035949707, 3.323848247528076, 9.55185317993164, -1.6677050590515137, -0.7953923940658569, -8.605660438537598, -0.4735637903213501, 2.6741855144500732, -5.359188079833984, -2.6673784255981445, 0.6660736799240112, 15.443212509155273, 4.740597724914551, -3.4725306034088135, 11.592561721801758, -2.05450701713562, 1.7361239194869995, -8.26533031463623, -9.304476737976074, 5.406835079193115, -1.5180232524871826, -7.746610641479492, -6.089605331420898, 0.07112561166286469, -0.34904858469963074, -8.649889945983887, -9.998958587646484, -2.5648481845855713, -0.5399898886680603, 2.6018145084381104, -0.31927648186683655, -1.8815231323242188, -2.0721378326416016, -3.4105639457702637, -8.299802780151367, 1.4836379289627075, -15.366002082824707, -8.288193702697754, 3.884773015975952, -3.4876506328582764, 7.362995624542236, 0.4657321572303772, 3.1326000690460205, 12.438883781433105, -1.8337029218673706, 4.532927513122559, 2.726433277130127, 10.145345687866211, -6.521956920623779, 2.8971481323242188, -3.3925881385803223, 5.079156398773193, 7.759725093841553, 4.677562236785889, 5.8457818031311035, 2.4023921489715576, 7.707108974456787, 3.9711389541625977, -6.390035152435303, 6.126871109008789, -3.776031017303467, -11.118141174316406]}} + [2022-08-01 09:01:27,094] [ INFO] - Response time 4.941739 s. ``` * Python API ``` python from paddlespeech.server.bin.paddlespeech_client import VectorClientExecutor + import json vectorclient_executor = VectorClientExecutor() res = vectorclient_executor( @@ -293,13 +301,13 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee server_ip="127.0.0.1", port=8090, task="spk") - print(res) + print(res.json()) ``` Output: - ``` bash - {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'vec': [-1.3251205682754517, 7.860682487487793, -4.620625972747803, 0.3000721037387848, 2.2648534774780273, -1.1931440830230713, 3.064713716506958, 7.673594951629639, -6.004472732543945, -12.024259567260742, -1.9496068954467773, 3.126953601837158, 1.6188379526138306, -7.638310432434082, -1.2299772500991821, -12.33833122253418, 2.1373026371002197, -5.395712375640869, 9.717328071594238, 5.675230503082275, 3.7805123329162598, 3.0597171783447266, 3.429692029953003, 8.9760103225708, 13.174124717712402, -0.5313228368759155, 8.942471504211426, 4.465109825134277, -4.426247596740723, -9.726503372192383, 8.399328231811523, 7.223917484283447, -7.435853958129883, 2.9441683292388916, -4.343039512634277, -13.886964797973633, -1.6346734762191772, -10.902740478515625, -5.311244964599609, 3.800722122192383, 3.897603750228882, -2.123077392578125, -2.3521194458007812, 4.151031017303467, -7.404866695404053, 0.13911646604537964, 2.4626107215881348, 4.96645450592041, 0.9897574186325073, 5.483975410461426, -3.3574001789093018, 10.13400650024414, -0.6120170950889587, -10.403095245361328, 4.600754261016846, 16.009349822998047, -7.78369140625, -4.194530487060547, -6.93686056137085, 1.1789555549621582, 11.490800857543945, 4.23802375793457, 9.550930976867676, 8.375045776367188, 7.508914470672607, -0.6570729613304138, -0.3005157709121704, 2.8406054973602295, 3.0828027725219727, 0.7308170199394226, 6.1483540534973145, 0.1376611888408661, -13.424735069274902, -7.746140480041504, -2.322798252105713, -8.305252075195312, 2.98791241645813, -10.99522876739502, 0.15211068093776703, -2.3820347785949707, -1.7984174489974976, 8.49562931060791, -5.852236747741699, -3.755497932434082, 0.6989710927009583, -5.270299434661865, -2.6188621520996094, -1.8828465938568115, -4.6466498374938965, 14.078543663024902, -0.5495333075523376, 10.579157829284668, -3.216050148010254, 9.349003791809082, -4.381077766418457, -11.675816535949707, -2.863020658493042, 4.5721755027771, 2.246612071990967, -4.574341773986816, 1.8610187768936157, 2.3767874240875244, 5.625787734985352, -9.784077644348145, 0.6496725678443909, -1.457950472831726, 0.4263263940811157, -4.921126365661621, -2.4547839164733887, 3.4869801998138428, -0.4265422224998474, 8.341268539428711, 1.356552004814148, 7.096688270568848, -13.102828979492188, 8.01673412322998, -7.115934371948242, 1.8699780702590942, 0.20872099697589874, 14.699383735656738, -1.0252779722213745, -2.6107232570648193, -2.5082311630249023, 8.427192687988281, 6.913852691650391, -6.29124641418457, 0.6157366037368774, 2.489687919616699, -3.4668266773223877, 9.92176342010498, 11.200815200805664, -0.19664029777050018, 7.491600513458252, -0.6231271624565125, -0.2584814429283142, -9.947997093200684, -0.9611040949821472, 1.1649218797683716, -2.1907122135162354, -1.502848744392395, -0.5192610621452332, 15.165953636169434, 2.4649462699890137, -0.998044490814209, 7.44166374206543, -2.0768048763275146, 3.5896823406219482, -7.305543422698975, -7.562084674835205, 4.32333517074585, 0.08044180274009705, -6.564010143280029, -2.314805269241333, -1.7642345428466797, -2.470881700515747, -7.6756181716918945, -9.548877716064453, -1.017755389213562, 0.1698644608259201, 2.5877134799957275, -1.8752295970916748, -0.36614322662353516, -6.049378395080566, -2.3965611457824707, -5.945338726043701, 0.9424033164978027, -13.155974388122559, -7.45780086517334, 0.14658108353614807, -3.7427968978881836, 5.841492652893066, -1.2872905731201172, 5.569431304931641, 12.570590019226074, 1.0939218997955322, 2.2142086029052734, 1.9181575775146484, 6.991420745849609, -5.888138771057129, 3.1409823894500732, -2.0036280155181885, 2.4434285163879395, 9.973138809204102, 5.036680221557617, 2.005120277404785, 2.861560344696045, 5.860223770141602, 2.917618751525879, -1.63111412525177, 2.0292205810546875, -4.070415019989014, -6.831437110900879]}} + ```text + {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'vec': [1.4217487573623657, 5.626248836517334, -5.342073440551758, 1.177390217781067, 3.308061122894287, 1.7565997838974, 5.1678876876831055, 10.806346893310547, -3.822679042816162, -5.614130973815918, 2.6238481998443604, -0.8072965741157532, 1.963512659072876, -7.312864780426025, 0.011034967377781868, -9.723127365112305, 0.661963164806366, -6.976816654205322, 10.213465690612793, 7.494767189025879, 2.9105641841888428, 3.894925117492676, 3.7999846935272217, 7.106173992156982, 16.905324935913086, -7.149376392364502, 8.733112335205078, 3.423002004623413, -4.831653118133545, -11.403371810913086, 11.232216835021973, 7.127464771270752, -4.282831192016602, 2.4523589611053467, -5.13075065612793, -18.17765998840332, -2.611666440963745, -11.00034236907959, -6.731431007385254, 1.6564655303955078, 0.7618184685707092, 1.1253058910369873, -2.0838277339935303, 4.725739002227783, -8.782590866088867, -3.5398736000061035, 3.8142387866973877, 5.142062664031982, 2.162053346633911, 4.09642219543457, -6.416221618652344, 12.747454643249512, 1.9429889917373657, -15.152948379516602, 6.417416572570801, 16.097013473510742, -9.716649055480957, -1.9920448064804077, -3.364956855773926, -1.8719490766525269, 11.567351341247559, 3.6978795528411865, 11.258269309997559, 7.442364692687988, 9.183405876159668, 4.528151512145996, -1.2417811155319214, 4.395910263061523, 6.672768592834473, 5.889888763427734, 7.627115249633789, -0.6692016124725342, -11.889703750610352, -9.208883285522461, -7.427401542663574, -3.777655601501465, 6.917237758636475, -9.848749160766602, -2.094479560852051, -5.1351189613342285, 0.49564215540885925, 9.317541122436523, -5.9141845703125, -1.809845209121704, -0.11738205701112747, -7.169270992279053, -1.0578246116638184, -5.721685886383057, -5.117387294769287, 16.137670516967773, -4.473618984222412, 7.66243314743042, -0.5538089871406555, 9.631582260131836, -6.470466613769531, -8.54850959777832, 4.371622085571289, -0.7970349192619324, 4.479003429412842, -2.9758646488189697, 3.2721707820892334, 2.8382749557495117, 5.1345953941345215, -9.19078254699707, -0.5657423138618469, -4.874573230743408, 2.316561460494995, -5.984307289123535, -2.1798791885375977, 0.35541653633117676, -0.3178458511829376, 9.493547439575195, 2.114448070526123, 4.358088493347168, -12.089820861816406, 8.451695442199707, -7.925461769104004, 4.624246120452881, 4.428938388824463, 18.691999435424805, -2.620460033416748, -5.149182319641113, -0.3582168221473694, 8.488557815551758, 4.98148250579834, -9.326834678649902, -2.2544236183166504, 6.64176607131958, 1.2119656801223755, 10.977132797241211, 16.55504035949707, 3.323848247528076, 9.55185317993164, -1.6677050590515137, -0.7953923940658569, -8.605660438537598, -0.4735637903213501, 2.6741855144500732, -5.359188079833984, -2.6673784255981445, 0.6660736799240112, 15.443212509155273, 4.740597724914551, -3.4725306034088135, 11.592561721801758, -2.05450701713562, 1.7361239194869995, -8.26533031463623, -9.304476737976074, 5.406835079193115, -1.5180232524871826, -7.746610641479492, -6.089605331420898, 0.07112561166286469, -0.34904858469963074, -8.649889945983887, -9.998958587646484, -2.5648481845855713, -0.5399898886680603, 2.6018145084381104, -0.31927648186683655, -1.8815231323242188, -2.0721378326416016, -3.4105639457702637, -8.299802780151367, 1.4836379289627075, -15.366002082824707, -8.288193702697754, 3.884773015975952, -3.4876506328582764, 7.362995624542236, 0.4657321572303772, 3.1326000690460205, 12.438883781433105, -1.8337029218673706, 4.532927513122559, 2.726433277130127, 10.145345687866211, -6.521956920623779, 2.8971481323242188, -3.3925881385803223, 5.079156398773193, 7.759725093841553, 4.677562236785889, 5.8457818031311035, 2.4023921489715576, 7.707108974456787, 3.9711389541625977, -6.390035152435303, 6.126871109008789, -3.776031017303467, -11.118141174316406]}} ``` #### 7.2 Get the score between speaker audio embedding @@ -330,19 +338,19 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee Output: - ``` bash - [2022-05-25 12:33:24,527] [ INFO] - vector score http client start - [2022-05-25 12:33:24,527] [ INFO] - enroll audio: 85236145389.wav, test audio: 123456789.wav - [2022-05-25 12:33:24,528] [ INFO] - endpoint: http://127.0.0.1:8790/paddlespeech/vector/score - [2022-05-25 12:33:24,695] [ INFO] - The vector score is: {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'score': 0.45332613587379456}} - [2022-05-25 12:33:24,696] [ INFO] - The vector: {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'score': 0.45332613587379456}} - [2022-05-25 12:33:24,696] [ INFO] - Response time 0.168271 s. + ```text + [2022-08-01 09:04:42,275] [ INFO] - vector score http client start + [2022-08-01 09:04:42,275] [ INFO] - enroll audio: 85236145389.wav, test audio: 123456789.wav + [2022-08-01 09:04:42,275] [ INFO] - endpoint: http://127.0.0.1:8090/paddlespeech/vector/score + [2022-08-01 09:04:44,611] [ INFO] - {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'score': 0.4292638897895813}} + [2022-08-01 09:04:44,611] [ INFO] - Response time 2.336258 s. ``` * Python API ``` python from paddlespeech.server.bin.paddlespeech_client import VectorClientExecutor + import json vectorclient_executor = VectorClientExecutor() res = vectorclient_executor( @@ -352,17 +360,13 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee server_ip="127.0.0.1", port=8090, task="score") - print(res) + print(res.json()) ``` Output: - ``` bash - [2022-05-25 12:30:14,143] [ INFO] - vector score http client start - [2022-05-25 12:30:14,143] [ INFO] - enroll audio: 85236145389.wav, test audio: 123456789.wav - [2022-05-25 12:30:14,143] [ INFO] - endpoint: http://127.0.0.1:8790/paddlespeech/vector/score - [2022-05-25 12:30:14,363] [ INFO] - The vector score is: {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'score': 0.45332613587379456}} - {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'score': 0.45332613587379456}} + ```text + {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'score': 0.4292638897895813}} ``` ### 8. Punctuation prediction @@ -388,9 +392,9 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee - `input`(required): Input text to get punctuation. Output: - ```bash - [2022-05-09 18:19:04,397] [ INFO] - The punc text: 我认为跑步最重要的就是给我带来了身体健康。 - [2022-05-09 18:19:04,397] [ INFO] - Response time 0.092407 s. + ```text + [2022-05-09 18:19:04,397] [ INFO] - The punc text: 我认为跑步最重要的就是给我带来了身体健康。 + [2022-05-09 18:19:04,397] [ INFO] - Response time 0.092407 s. ``` - Python API @@ -403,15 +407,13 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee server_ip="127.0.0.1", port=8090,) print(res) - ``` Output: - ```bash + ```text 我认为跑步最重要的就是给我带来了身体健康。 ``` - ## Models supported by the service ### ASR model Get all models supported by the ASR service via `paddlespeech_server stats --task asr`, where static models can be used for paddle inference inference. diff --git a/demos/speech_server/README_cn.md b/demos/speech_server/README_cn.md index 29629b7e8..628468c83 100644 --- a/demos/speech_server/README_cn.md +++ b/demos/speech_server/README_cn.md @@ -3,31 +3,31 @@ # 语音服务 ## 介绍 -这个 demo 是一个启动离线语音服务和访问服务的实现。它可以通过使用`paddlespeech_server` 和 `paddlespeech_client`的单个命令或 python 的几行代码来实现。 +这个 demo 是一个启动离线语音服务和访问服务的实现。它可以通过使用 `paddlespeech_server` 和 `paddlespeech_client` 的单个命令或 python 的几行代码来实现。 + + +服务接口定义请参考: +- [PaddleSpeech Server RESTful API](https://github.com/PaddlePaddle/PaddleSpeech/wiki/PaddleSpeech-Server-RESTful-API) ## 使用方法 ### 1. 安装 请看 [安装文档](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md). -推荐使用 **paddlepaddle 2.2.2** 或以上版本。 -你可以从 medium,hard 两种方式中选择一种方式安装 PaddleSpeech。 +推荐使用 **paddlepaddle 2.3.1** 或以上版本。 + +你可以从简单,中等,困难 几种方式中选择一种方式安装 PaddleSpeech。 +**如果使用简单模式安装,需要自行准备 yaml 文件,可参考 conf 目录下的 yaml 文件。** ### 2. 准备配置文件 配置文件可参见 `conf/application.yaml` 。 -其中,`engine_list`表示即将启动的服务将会包含的语音引擎,格式为 <语音任务>_<引擎类型>。 -目前服务集成的语音任务有: asr(语音识别)、tts(语音合成)、cls(音频分类)、vector(声纹识别)以及text(文本处理)。 -目前引擎类型支持两种形式:python 及 inference (Paddle Inference) -**注意:** 如果在容器里可正常启动服务,但客户端访问 ip 不可达,可尝试将配置文件中 `host` 地址换成本地 ip 地址。 +其中,`engine_list` 表示即将启动的服务将会包含的语音引擎,格式为 <语音任务>_<引擎类型>。 +目前服务集成的语音任务有: asr (语音识别)、tts (语音合成)、cls (音频分类)、vector (声纹识别)以及 text (文本处理)。 -ASR client 的输入是一个 WAV 文件(`.wav`),并且采样率必须与模型的采样率相同。 - -可以下载此 ASR client 的示例音频: -```bash -wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespeech.bj.bcebos.com/PaddleAudio/en.wav -``` +目前引擎类型支持两种形式:python 及 inference (Paddle Inference) +**注意:** 如果在容器里可正常启动服务,但客户端访问 ip 不可达,可尝试将配置文件中 `host` 地址换成本地 ip 地址。 ### 3. 服务端使用方法 - 命令行 (推荐使用) @@ -47,7 +47,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee - `log_file`: log 文件. 默认:./log/paddlespeech.log 输出: - ```bash + ```text [2022-02-23 11:17:32] [INFO] [server.py:64] Started server process [6384] INFO: Waiting for application startup. [2022-02-23 11:17:32] [INFO] [on.py:26] Waiting for application startup. @@ -55,7 +55,6 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee [2022-02-23 11:17:32] [INFO] [on.py:38] Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit) [2022-02-23 11:17:32] [INFO] [server.py:204] Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit) - ``` - Python API @@ -69,7 +68,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ``` 输出: - ```bash + ```text INFO: Started server process [529] [2022-02-23 14:57:56] [INFO] [server.py:64] Started server process [529] INFO: Waiting for application startup. @@ -78,10 +77,18 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee [2022-02-23 14:57:56] [INFO] [on.py:38] Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit) [2022-02-23 14:57:56] [INFO] [server.py:204] Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit) - ``` ### 4. ASR 客户端使用方法 + +ASR 客户端的输入是一个 WAV 文件(`.wav`),并且采样率必须与模型的采样率相同。 + +可以下载 ASR 客户端的示例音频: +```bash +wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav +wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/en.wav +``` + **注意:** 初次使用客户端时响应时间会略长 - 命令行 (推荐使用) @@ -89,7 +96,6 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ``` paddlespeech_client asr --server_ip 127.0.0.1 --port 8090 --input ./zh.wav - ``` 使用帮助: @@ -107,16 +113,14 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee - `audio_format`: 音频格式,默认值:wav。 输出: - - ```bash - [2022-02-23 18:11:22,819] [ INFO] - {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'transcription': '我认为跑步最重要的就是给我带来了身体健康'}} - [2022-02-23 18:11:22,820] [ INFO] - time cost 0.689145 s. + ```text + [2022-08-01 07:54:01,646] [ INFO] - ASR result: 我认为跑步最重要的就是给我带来了身体健康 + [2022-08-01 07:54:01,646] [ INFO] - Response time 4.898965 s. ``` - Python API ```python from paddlespeech.server.bin.paddlespeech_client import ASRClientExecutor - import json asrclient_executor = ASRClientExecutor() res = asrclient_executor( @@ -126,13 +130,12 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee sample_rate=16000, lang="zh_cn", audio_format="wav") - print(res.json()) + print(res) ``` 输出: - ```bash - {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'transcription': '我认为跑步最重要的就是给我带来了身体健康'}} - + ```text + 我认为跑步最重要的就是给我带来了身体健康 ``` ### 5. TTS 客户端使用方法 @@ -161,8 +164,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee - `output`: 输出音频的路径, 默认值:None,表示不保存音频到本地。 输出: - ```bash - [2022-02-23 15:20:37,875] [ INFO] - {'description': 'success.'} + ```text [2022-02-23 15:20:37,875] [ INFO] - Save synthesized audio successfully on output.wav. [2022-02-23 15:20:37,875] [ INFO] - Audio duration: 3.612500 s. [2022-02-23 15:20:37,875] [ INFO] - Response time: 0.348050 s. @@ -191,22 +193,26 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee ``` 输出: - ```bash + ```text {'description': 'success.'} Save synthesized audio successfully on ./output.wav. Audio duration: 3.612500 s. - ``` ### 6. CLS 客户端使用方法 +可以下载 CLS 客户端的示例音频: +```bash +wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav +``` + **注意:** 初次使用客户端时响应时间会略长 - 命令行 (推荐使用) 若 `127.0.0.1` 不能访问,则需要使用实际服务 IP 地址 - ``` + ```bash paddlespeech_client cls --server_ip 127.0.0.1 --port 8090 --input ./zh.wav ``` @@ -222,11 +228,9 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee - `topk`: 分类结果的topk。 输出: - ```bash + ```text [2022-03-09 20:44:39,974] [ INFO] - {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'topk': 1, 'results': [{'class_name': 'Speech', 'prob': 0.9027184844017029}]}} [2022-03-09 20:44:39,975] [ INFO] - Response time 0.104360 s. - - ``` - Python API @@ -241,24 +245,28 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee port=8090, topk=1) print(res.json()) - ``` 输出: - ```bash + ```text {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'topk': 1, 'results': [{'class_name': 'Speech', 'prob': 0.9027184844017029}]}} - ``` ### 7. 声纹客户端使用方法 +可以下载声纹客户端的示例音频: +```bash +wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav +wget -c https://paddlespeech.bj.bcebos.com/vector/audio/123456789.wav +``` + #### 7.1 提取声纹特征 -注意: 初次使用客户端时响应时间会略长 +**注意:** 初次使用客户端时响应时间会略长 * 命令行 (推荐使用) 若 `127.0.0.1` 不能访问,则需要使用实际服务 IP 地址 - ``` bash + ```bash paddlespeech_client vector --task spk --server_ip 127.0.0.1 --port 8090 --input 85236145389.wav ``` @@ -274,21 +282,21 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee * task: vector 的任务,可选spk或者score。默认是 spk。 * enroll: 注册音频;。 * test: 测试音频。 - 输出: - ``` bash - [2022-05-25 12:25:36,165] [ INFO] - vector http client start - [2022-05-25 12:25:36,165] [ INFO] - the input audio: 85236145389.wav - [2022-05-25 12:25:36,165] [ INFO] - endpoint: http://127.0.0.1:8790/paddlespeech/vector - [2022-05-25 12:25:36,166] [ INFO] - http://127.0.0.1:8790/paddlespeech/vector - [2022-05-25 12:25:36,324] [ INFO] - The vector: {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'vec': [-1.3251205682754517, 7.860682487487793, -4.620625972747803, 0.3000721037387848, 2.2648534774780273, -1.1931440830230713, 3.064713716506958, 7.673594951629639, -6.004472732543945, -12.024259567260742, -1.9496068954467773, 3.126953601837158, 1.6188379526138306, -7.638310432434082, -1.2299772500991821, -12.33833122253418, 2.1373026371002197, -5.395712375640869, 9.717328071594238, 5.675230503082275, 3.7805123329162598, 3.0597171783447266, 3.429692029953003, 8.9760103225708, 13.174124717712402, -0.5313228368759155, 8.942471504211426, 4.465109825134277, -4.426247596740723, -9.726503372192383, 8.399328231811523, 7.223917484283447, -7.435853958129883, 2.9441683292388916, -4.343039512634277, -13.886964797973633, -1.6346734762191772, -10.902740478515625, -5.311244964599609, 3.800722122192383, 3.897603750228882, -2.123077392578125, -2.3521194458007812, 4.151031017303467, -7.404866695404053, 0.13911646604537964, 2.4626107215881348, 4.96645450592041, 0.9897574186325073, 5.483975410461426, -3.3574001789093018, 10.13400650024414, -0.6120170950889587, -10.403095245361328, 4.600754261016846, 16.009349822998047, -7.78369140625, -4.194530487060547, -6.93686056137085, 1.1789555549621582, 11.490800857543945, 4.23802375793457, 9.550930976867676, 8.375045776367188, 7.508914470672607, -0.6570729613304138, -0.3005157709121704, 2.8406054973602295, 3.0828027725219727, 0.7308170199394226, 6.1483540534973145, 0.1376611888408661, -13.424735069274902, -7.746140480041504, -2.322798252105713, -8.305252075195312, 2.98791241645813, -10.99522876739502, 0.15211068093776703, -2.3820347785949707, -1.7984174489974976, 8.49562931060791, -5.852236747741699, -3.755497932434082, 0.6989710927009583, -5.270299434661865, -2.6188621520996094, -1.8828465938568115, -4.6466498374938965, 14.078543663024902, -0.5495333075523376, 10.579157829284668, -3.216050148010254, 9.349003791809082, -4.381077766418457, -11.675816535949707, -2.863020658493042, 4.5721755027771, 2.246612071990967, -4.574341773986816, 1.8610187768936157, 2.3767874240875244, 5.625787734985352, -9.784077644348145, 0.6496725678443909, -1.457950472831726, 0.4263263940811157, -4.921126365661621, -2.4547839164733887, 3.4869801998138428, -0.4265422224998474, 8.341268539428711, 1.356552004814148, 7.096688270568848, -13.102828979492188, 8.01673412322998, -7.115934371948242, 1.8699780702590942, 0.20872099697589874, 14.699383735656738, -1.0252779722213745, -2.6107232570648193, -2.5082311630249023, 8.427192687988281, 6.913852691650391, -6.29124641418457, 0.6157366037368774, 2.489687919616699, -3.4668266773223877, 9.92176342010498, 11.200815200805664, -0.19664029777050018, 7.491600513458252, -0.6231271624565125, -0.2584814429283142, -9.947997093200684, -0.9611040949821472, 1.1649218797683716, -2.1907122135162354, -1.502848744392395, -0.5192610621452332, 15.165953636169434, 2.4649462699890137, -0.998044490814209, 7.44166374206543, -2.0768048763275146, 3.5896823406219482, -7.305543422698975, -7.562084674835205, 4.32333517074585, 0.08044180274009705, -6.564010143280029, -2.314805269241333, -1.7642345428466797, -2.470881700515747, -7.6756181716918945, -9.548877716064453, -1.017755389213562, 0.1698644608259201, 2.5877134799957275, -1.8752295970916748, -0.36614322662353516, -6.049378395080566, -2.3965611457824707, -5.945338726043701, 0.9424033164978027, -13.155974388122559, -7.45780086517334, 0.14658108353614807, -3.7427968978881836, 5.841492652893066, -1.2872905731201172, 5.569431304931641, 12.570590019226074, 1.0939218997955322, 2.2142086029052734, 1.9181575775146484, 6.991420745849609, -5.888138771057129, 3.1409823894500732, -2.0036280155181885, 2.4434285163879395, 9.973138809204102, 5.036680221557617, 2.005120277404785, 2.861560344696045, 5.860223770141602, 2.917618751525879, -1.63111412525177, 2.0292205810546875, -4.070415019989014, -6.831437110900879]}} - [2022-05-25 12:25:36,324] [ INFO] - Response time 0.159053 s. + 输出: + ```text + [2022-08-01 09:01:22,151] [ INFO] - vector http client start + [2022-08-01 09:01:22,152] [ INFO] - the input audio: 85236145389.wav + [2022-08-01 09:01:22,152] [ INFO] - endpoint: http://127.0.0.1:8090/paddlespeech/vector + [2022-08-01 09:01:27,093] [ INFO] - {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'vec': [1.4217487573623657, 5.626248836517334, -5.342073440551758, 1.177390217781067, 3.308061122894287, 1.7565997838974, 5.1678876876831055, 10.806346893310547, -3.822679042816162, -5.614130973815918, 2.6238481998443604, -0.8072965741157532, 1.963512659072876, -7.312864780426025, 0.011034967377781868, -9.723127365112305, 0.661963164806366, -6.976816654205322, 10.213465690612793, 7.494767189025879, 2.9105641841888428, 3.894925117492676, 3.7999846935272217, 7.106173992156982, 16.905324935913086, -7.149376392364502, 8.733112335205078, 3.423002004623413, -4.831653118133545, -11.403371810913086, 11.232216835021973, 7.127464771270752, -4.282831192016602, 2.4523589611053467, -5.13075065612793, -18.17765998840332, -2.611666440963745, -11.00034236907959, -6.731431007385254, 1.6564655303955078, 0.7618184685707092, 1.1253058910369873, -2.0838277339935303, 4.725739002227783, -8.782590866088867, -3.5398736000061035, 3.8142387866973877, 5.142062664031982, 2.162053346633911, 4.09642219543457, -6.416221618652344, 12.747454643249512, 1.9429889917373657, -15.152948379516602, 6.417416572570801, 16.097013473510742, -9.716649055480957, -1.9920448064804077, -3.364956855773926, -1.8719490766525269, 11.567351341247559, 3.6978795528411865, 11.258269309997559, 7.442364692687988, 9.183405876159668, 4.528151512145996, -1.2417811155319214, 4.395910263061523, 6.672768592834473, 5.889888763427734, 7.627115249633789, -0.6692016124725342, -11.889703750610352, -9.208883285522461, -7.427401542663574, -3.777655601501465, 6.917237758636475, -9.848749160766602, -2.094479560852051, -5.1351189613342285, 0.49564215540885925, 9.317541122436523, -5.9141845703125, -1.809845209121704, -0.11738205701112747, -7.169270992279053, -1.0578246116638184, -5.721685886383057, -5.117387294769287, 16.137670516967773, -4.473618984222412, 7.66243314743042, -0.5538089871406555, 9.631582260131836, -6.470466613769531, -8.54850959777832, 4.371622085571289, -0.7970349192619324, 4.479003429412842, -2.9758646488189697, 3.2721707820892334, 2.8382749557495117, 5.1345953941345215, -9.19078254699707, -0.5657423138618469, -4.874573230743408, 2.316561460494995, -5.984307289123535, -2.1798791885375977, 0.35541653633117676, -0.3178458511829376, 9.493547439575195, 2.114448070526123, 4.358088493347168, -12.089820861816406, 8.451695442199707, -7.925461769104004, 4.624246120452881, 4.428938388824463, 18.691999435424805, -2.620460033416748, -5.149182319641113, -0.3582168221473694, 8.488557815551758, 4.98148250579834, -9.326834678649902, -2.2544236183166504, 6.64176607131958, 1.2119656801223755, 10.977132797241211, 16.55504035949707, 3.323848247528076, 9.55185317993164, -1.6677050590515137, -0.7953923940658569, -8.605660438537598, -0.4735637903213501, 2.6741855144500732, -5.359188079833984, -2.6673784255981445, 0.6660736799240112, 15.443212509155273, 4.740597724914551, -3.4725306034088135, 11.592561721801758, -2.05450701713562, 1.7361239194869995, -8.26533031463623, -9.304476737976074, 5.406835079193115, -1.5180232524871826, -7.746610641479492, -6.089605331420898, 0.07112561166286469, -0.34904858469963074, -8.649889945983887, -9.998958587646484, -2.5648481845855713, -0.5399898886680603, 2.6018145084381104, -0.31927648186683655, -1.8815231323242188, -2.0721378326416016, -3.4105639457702637, -8.299802780151367, 1.4836379289627075, -15.366002082824707, -8.288193702697754, 3.884773015975952, -3.4876506328582764, 7.362995624542236, 0.4657321572303772, 3.1326000690460205, 12.438883781433105, -1.8337029218673706, 4.532927513122559, 2.726433277130127, 10.145345687866211, -6.521956920623779, 2.8971481323242188, -3.3925881385803223, 5.079156398773193, 7.759725093841553, 4.677562236785889, 5.8457818031311035, 2.4023921489715576, 7.707108974456787, 3.9711389541625977, -6.390035152435303, 6.126871109008789, -3.776031017303467, -11.118141174316406]}} + [2022-08-01 09:01:27,094] [ INFO] - Response time 4.941739 s. ``` * Python API ``` python from paddlespeech.server.bin.paddlespeech_client import VectorClientExecutor + import json vectorclient_executor = VectorClientExecutor() res = vectorclient_executor( @@ -296,18 +304,17 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee server_ip="127.0.0.1", port=8090, task="spk") - print(res) + print(res.json()) ``` 输出: - - ``` bash - {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'vec': [-1.3251205682754517, 7.860682487487793, -4.620625972747803, 0.3000721037387848, 2.2648534774780273, -1.1931440830230713, 3.064713716506958, 7.673594951629639, -6.004472732543945, -12.024259567260742, -1.9496068954467773, 3.126953601837158, 1.6188379526138306, -7.638310432434082, -1.2299772500991821, -12.33833122253418, 2.1373026371002197, -5.395712375640869, 9.717328071594238, 5.675230503082275, 3.7805123329162598, 3.0597171783447266, 3.429692029953003, 8.9760103225708, 13.174124717712402, -0.5313228368759155, 8.942471504211426, 4.465109825134277, -4.426247596740723, -9.726503372192383, 8.399328231811523, 7.223917484283447, -7.435853958129883, 2.9441683292388916, -4.343039512634277, -13.886964797973633, -1.6346734762191772, -10.902740478515625, -5.311244964599609, 3.800722122192383, 3.897603750228882, -2.123077392578125, -2.3521194458007812, 4.151031017303467, -7.404866695404053, 0.13911646604537964, 2.4626107215881348, 4.96645450592041, 0.9897574186325073, 5.483975410461426, -3.3574001789093018, 10.13400650024414, -0.6120170950889587, -10.403095245361328, 4.600754261016846, 16.009349822998047, -7.78369140625, -4.194530487060547, -6.93686056137085, 1.1789555549621582, 11.490800857543945, 4.23802375793457, 9.550930976867676, 8.375045776367188, 7.508914470672607, -0.6570729613304138, -0.3005157709121704, 2.8406054973602295, 3.0828027725219727, 0.7308170199394226, 6.1483540534973145, 0.1376611888408661, -13.424735069274902, -7.746140480041504, -2.322798252105713, -8.305252075195312, 2.98791241645813, -10.99522876739502, 0.15211068093776703, -2.3820347785949707, -1.7984174489974976, 8.49562931060791, -5.852236747741699, -3.755497932434082, 0.6989710927009583, -5.270299434661865, -2.6188621520996094, -1.8828465938568115, -4.6466498374938965, 14.078543663024902, -0.5495333075523376, 10.579157829284668, -3.216050148010254, 9.349003791809082, -4.381077766418457, -11.675816535949707, -2.863020658493042, 4.5721755027771, 2.246612071990967, -4.574341773986816, 1.8610187768936157, 2.3767874240875244, 5.625787734985352, -9.784077644348145, 0.6496725678443909, -1.457950472831726, 0.4263263940811157, -4.921126365661621, -2.4547839164733887, 3.4869801998138428, -0.4265422224998474, 8.341268539428711, 1.356552004814148, 7.096688270568848, -13.102828979492188, 8.01673412322998, -7.115934371948242, 1.8699780702590942, 0.20872099697589874, 14.699383735656738, -1.0252779722213745, -2.6107232570648193, -2.5082311630249023, 8.427192687988281, 6.913852691650391, -6.29124641418457, 0.6157366037368774, 2.489687919616699, -3.4668266773223877, 9.92176342010498, 11.200815200805664, -0.19664029777050018, 7.491600513458252, -0.6231271624565125, -0.2584814429283142, -9.947997093200684, -0.9611040949821472, 1.1649218797683716, -2.1907122135162354, -1.502848744392395, -0.5192610621452332, 15.165953636169434, 2.4649462699890137, -0.998044490814209, 7.44166374206543, -2.0768048763275146, 3.5896823406219482, -7.305543422698975, -7.562084674835205, 4.32333517074585, 0.08044180274009705, -6.564010143280029, -2.314805269241333, -1.7642345428466797, -2.470881700515747, -7.6756181716918945, -9.548877716064453, -1.017755389213562, 0.1698644608259201, 2.5877134799957275, -1.8752295970916748, -0.36614322662353516, -6.049378395080566, -2.3965611457824707, -5.945338726043701, 0.9424033164978027, -13.155974388122559, -7.45780086517334, 0.14658108353614807, -3.7427968978881836, 5.841492652893066, -1.2872905731201172, 5.569431304931641, 12.570590019226074, 1.0939218997955322, 2.2142086029052734, 1.9181575775146484, 6.991420745849609, -5.888138771057129, 3.1409823894500732, -2.0036280155181885, 2.4434285163879395, 9.973138809204102, 5.036680221557617, 2.005120277404785, 2.861560344696045, 5.860223770141602, 2.917618751525879, -1.63111412525177, 2.0292205810546875, -4.070415019989014, -6.831437110900879]}} + ```text + {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'vec': [1.4217487573623657, 5.626248836517334, -5.342073440551758, 1.177390217781067, 3.308061122894287, 1.7565997838974, 5.1678876876831055, 10.806346893310547, -3.822679042816162, -5.614130973815918, 2.6238481998443604, -0.8072965741157532, 1.963512659072876, -7.312864780426025, 0.011034967377781868, -9.723127365112305, 0.661963164806366, -6.976816654205322, 10.213465690612793, 7.494767189025879, 2.9105641841888428, 3.894925117492676, 3.7999846935272217, 7.106173992156982, 16.905324935913086, -7.149376392364502, 8.733112335205078, 3.423002004623413, -4.831653118133545, -11.403371810913086, 11.232216835021973, 7.127464771270752, -4.282831192016602, 2.4523589611053467, -5.13075065612793, -18.17765998840332, -2.611666440963745, -11.00034236907959, -6.731431007385254, 1.6564655303955078, 0.7618184685707092, 1.1253058910369873, -2.0838277339935303, 4.725739002227783, -8.782590866088867, -3.5398736000061035, 3.8142387866973877, 5.142062664031982, 2.162053346633911, 4.09642219543457, -6.416221618652344, 12.747454643249512, 1.9429889917373657, -15.152948379516602, 6.417416572570801, 16.097013473510742, -9.716649055480957, -1.9920448064804077, -3.364956855773926, -1.8719490766525269, 11.567351341247559, 3.6978795528411865, 11.258269309997559, 7.442364692687988, 9.183405876159668, 4.528151512145996, -1.2417811155319214, 4.395910263061523, 6.672768592834473, 5.889888763427734, 7.627115249633789, -0.6692016124725342, -11.889703750610352, -9.208883285522461, -7.427401542663574, -3.777655601501465, 6.917237758636475, -9.848749160766602, -2.094479560852051, -5.1351189613342285, 0.49564215540885925, 9.317541122436523, -5.9141845703125, -1.809845209121704, -0.11738205701112747, -7.169270992279053, -1.0578246116638184, -5.721685886383057, -5.117387294769287, 16.137670516967773, -4.473618984222412, 7.66243314743042, -0.5538089871406555, 9.631582260131836, -6.470466613769531, -8.54850959777832, 4.371622085571289, -0.7970349192619324, 4.479003429412842, -2.9758646488189697, 3.2721707820892334, 2.8382749557495117, 5.1345953941345215, -9.19078254699707, -0.5657423138618469, -4.874573230743408, 2.316561460494995, -5.984307289123535, -2.1798791885375977, 0.35541653633117676, -0.3178458511829376, 9.493547439575195, 2.114448070526123, 4.358088493347168, -12.089820861816406, 8.451695442199707, -7.925461769104004, 4.624246120452881, 4.428938388824463, 18.691999435424805, -2.620460033416748, -5.149182319641113, -0.3582168221473694, 8.488557815551758, 4.98148250579834, -9.326834678649902, -2.2544236183166504, 6.64176607131958, 1.2119656801223755, 10.977132797241211, 16.55504035949707, 3.323848247528076, 9.55185317993164, -1.6677050590515137, -0.7953923940658569, -8.605660438537598, -0.4735637903213501, 2.6741855144500732, -5.359188079833984, -2.6673784255981445, 0.6660736799240112, 15.443212509155273, 4.740597724914551, -3.4725306034088135, 11.592561721801758, -2.05450701713562, 1.7361239194869995, -8.26533031463623, -9.304476737976074, 5.406835079193115, -1.5180232524871826, -7.746610641479492, -6.089605331420898, 0.07112561166286469, -0.34904858469963074, -8.649889945983887, -9.998958587646484, -2.5648481845855713, -0.5399898886680603, 2.6018145084381104, -0.31927648186683655, -1.8815231323242188, -2.0721378326416016, -3.4105639457702637, -8.299802780151367, 1.4836379289627075, -15.366002082824707, -8.288193702697754, 3.884773015975952, -3.4876506328582764, 7.362995624542236, 0.4657321572303772, 3.1326000690460205, 12.438883781433105, -1.8337029218673706, 4.532927513122559, 2.726433277130127, 10.145345687866211, -6.521956920623779, 2.8971481323242188, -3.3925881385803223, 5.079156398773193, 7.759725093841553, 4.677562236785889, 5.8457818031311035, 2.4023921489715576, 7.707108974456787, 3.9711389541625977, -6.390035152435303, 6.126871109008789, -3.776031017303467, -11.118141174316406]}} ``` #### 7.2 音频声纹打分 -注意: 初次使用客户端时响应时间会略长 +**注意:** 初次使用客户端时响应时间会略长 * 命令行 (推荐使用) 若 `127.0.0.1` 不能访问,则需要使用实际服务 IP 地址 @@ -331,20 +338,19 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee * test: 测试音频。 输出: - - ``` bash - [2022-05-25 12:33:24,527] [ INFO] - vector score http client start - [2022-05-25 12:33:24,527] [ INFO] - enroll audio: 85236145389.wav, test audio: 123456789.wav - [2022-05-25 12:33:24,528] [ INFO] - endpoint: http://127.0.0.1:8790/paddlespeech/vector/score - [2022-05-25 12:33:24,695] [ INFO] - The vector score is: {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'score': 0.45332613587379456}} - [2022-05-25 12:33:24,696] [ INFO] - The vector: {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'score': 0.45332613587379456}} - [2022-05-25 12:33:24,696] [ INFO] - Response time 0.168271 s. + ```text + [2022-08-01 09:04:42,275] [ INFO] - vector score http client start + [2022-08-01 09:04:42,275] [ INFO] - enroll audio: 85236145389.wav, test audio: 123456789.wav + [2022-08-01 09:04:42,275] [ INFO] - endpoint: http://127.0.0.1:8090/paddlespeech/vector/score + [2022-08-01 09:04:44,611] [ INFO] - {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'score': 0.4292638897895813}} + [2022-08-01 09:04:44,611] [ INFO] - Response time 2.336258 s. ``` * Python API - ``` python + ```python from paddlespeech.server.bin.paddlespeech_client import VectorClientExecutor + import json vectorclient_executor = VectorClientExecutor() res = vectorclient_executor( @@ -354,20 +360,14 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee server_ip="127.0.0.1", port=8090, task="score") - print(res) + print(res.json()) ``` 输出: - - ``` bash - [2022-05-25 12:30:14,143] [ INFO] - vector score http client start - [2022-05-25 12:30:14,143] [ INFO] - enroll audio: 85236145389.wav, test audio: 123456789.wav - [2022-05-25 12:30:14,143] [ INFO] - endpoint: http://127.0.0.1:8790/paddlespeech/vector/score - [2022-05-25 12:30:14,363] [ INFO] - The vector score is: {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'score': 0.45332613587379456}} - {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'score': 0.45332613587379456}} + ```text + {'success': True, 'code': 200, 'message': {'description': 'success'}, 'result': {'score': 0.4292638897895813}} ``` - ### 8. 标点预测 **注意:** 初次使用客户端时响应时间会略长 @@ -390,9 +390,9 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee - `input`(必须输入): 用于标点预测的文本内容。 输出: - ```bash - [2022-05-09 18:19:04,397] [ INFO] - The punc text: 我认为跑步最重要的就是给我带来了身体健康。 - [2022-05-09 18:19:04,397] [ INFO] - Response time 0.092407 s. + ```text + [2022-05-09 18:19:04,397] [ INFO] - The punc text: 我认为跑步最重要的就是给我带来了身体健康。 + [2022-05-09 18:19:04,397] [ INFO] - Response time 0.092407 s. ``` - Python API @@ -405,11 +405,10 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee server_ip="127.0.0.1", port=8090,) print(res) - ``` 输出: - ```bash + ```text 我认为跑步最重要的就是给我带来了身体健康。 ``` @@ -418,10 +417,10 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee 通过 `paddlespeech_server stats --task asr` 获取 ASR 服务支持的所有模型,其中静态模型可用于 paddle inference 推理。 ### TTS 支持的模型 -通过 `paddlespeech_server stats --task tts` 获取 TTS 服务支持的所有模型,其中静态模型可用于 paddle inference 推理。 +通过 `paddlespeech_server stats --task tts` 获取 TTS 服务支持的所有模型,其中静态模型可用于 paddle inference 推理。 ### CLS 支持的模型 -通过 `paddlespeech_server stats --task cls` 获取 CLS 服务支持的所有模型,其中静态模型可用于 paddle inference 推理。 +通过 `paddlespeech_server stats --task cls` 获取 CLS 服务支持的所有模型,其中静态模型可用于 paddle inference 推理。 ### Vector 支持的模型 通过 `paddlespeech_server stats --task vector` 获取 Vector 服务支持的所有模型。 diff --git a/demos/speech_server/asr_client.sh b/demos/speech_server/asr_client.sh old mode 100644 new mode 100755 diff --git a/demos/speech_server/cls_client.sh b/demos/speech_server/cls_client.sh old mode 100644 new mode 100755 diff --git a/demos/speech_server/conf/application.yaml b/demos/speech_server/conf/application.yaml index c6588ce80..9c171c470 100644 --- a/demos/speech_server/conf/application.yaml +++ b/demos/speech_server/conf/application.yaml @@ -7,7 +7,7 @@ host: 0.0.0.0 port: 8090 # The task format in the engin_list is: _ -# task choices = ['asr_python', 'asr_inference', 'tts_python', 'tts_inference', 'cls_python', 'cls_inference'] +# task choices = ['asr_python', 'asr_inference', 'tts_python', 'tts_inference', 'cls_python', 'cls_inference', 'text_python', 'vector_python'] protocol: 'http' engine_list: ['asr_python', 'tts_python', 'cls_python', 'text_python', 'vector_python'] @@ -28,7 +28,6 @@ asr_python: force_yes: True device: # set 'gpu:id' or 'cpu' - ################### speech task: asr; engine_type: inference ####################### asr_inference: # model_type choices=['deepspeech2offline_aishell'] @@ -50,10 +49,11 @@ asr_inference: ################################### TTS ######################################### ################### speech task: tts; engine_type: python ####################### -tts_python: - # am (acoustic model) choices=['speedyspeech_csmsc', 'fastspeech2_csmsc', - # 'fastspeech2_ljspeech', 'fastspeech2_aishell3', - # 'fastspeech2_vctk'] +tts_python: + # am (acoustic model) choices=['speedyspeech_csmsc', 'fastspeech2_csmsc', + # 'fastspeech2_ljspeech', 'fastspeech2_aishell3', + # 'fastspeech2_vctk', 'fastspeech2_mix', + # 'tacotron2_csmsc', 'tacotron2_ljspeech'] am: 'fastspeech2_csmsc' am_config: am_ckpt: @@ -64,8 +64,10 @@ tts_python: spk_id: 0 # voc (vocoder) choices=['pwgan_csmsc', 'pwgan_ljspeech', 'pwgan_aishell3', - # 'pwgan_vctk', 'mb_melgan_csmsc'] - voc: 'pwgan_csmsc' + # 'pwgan_vctk', 'mb_melgan_csmsc', 'style_melgan_csmsc', + # 'hifigan_csmsc', 'hifigan_ljspeech', 'hifigan_aishell3', + # 'hifigan_vctk', 'wavernn_csmsc'] + voc: 'mb_melgan_csmsc' voc_config: voc_ckpt: voc_stat: @@ -94,7 +96,7 @@ tts_inference: summary: True # False -> do not show predictor config # voc (vocoder) choices=['pwgan_csmsc', 'mb_melgan_csmsc','hifigan_csmsc'] - voc: 'pwgan_csmsc' + voc: 'mb_melgan_csmsc' voc_model: # the pdmodel file of your vocoder static model (XX.pdmodel) voc_params: # the pdiparams file of your vocoder static model (XX.pdipparams) voc_sample_rate: 24000 diff --git a/demos/speech_server/server.sh b/demos/speech_server/server.sh old mode 100644 new mode 100755 index e5961286b..fd719ffc1 --- a/demos/speech_server/server.sh +++ b/demos/speech_server/server.sh @@ -1,3 +1,3 @@ #!/bin/bash -paddlespeech_server start --config_file ./conf/application.yaml +paddlespeech_server start --config_file ./conf/application.yaml &> server.log & diff --git a/demos/speech_server/sid_client.sh b/demos/speech_server/sid_client.sh new file mode 100755 index 000000000..99bab21ae --- /dev/null +++ b/demos/speech_server/sid_client.sh @@ -0,0 +1,10 @@ +#!/bin/bash + +wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav +wget -c https://paddlespeech.bj.bcebos.com/vector/audio/123456789.wav + +# sid extract +paddlespeech_client vector --server_ip 127.0.0.1 --port 8090 --task spk --input ./85236145389.wav + +# sid score +paddlespeech_client vector --server_ip 127.0.0.1 --port 8090 --task score --enroll ./85236145389.wav --test ./123456789.wav diff --git a/demos/speech_server/text_client.sh b/demos/speech_server/text_client.sh new file mode 100755 index 000000000..098f159fb --- /dev/null +++ b/demos/speech_server/text_client.sh @@ -0,0 +1,4 @@ +#!/bin/bash + + +paddlespeech_client text --server_ip 127.0.0.1 --port 8090 --input 今天的天气真好啊你下午有空吗我想约你一起去吃饭 diff --git a/demos/speech_server/tts_client.sh b/demos/speech_server/tts_client.sh old mode 100644 new mode 100755 diff --git a/demos/speech_web/.gitignore b/demos/speech_web/.gitignore new file mode 100644 index 000000000..54418e605 --- /dev/null +++ b/demos/speech_web/.gitignore @@ -0,0 +1,16 @@ +*/.vscode/* +*.wav +*/resource/* +.Ds* +*.pyc +*.pcm +*.npy +*.diff +*.sqlite +*/static/* +*.pdparams +*.pdiparams* +*.pdmodel +*/source/* +*/PaddleSpeech/* + diff --git a/demos/speech_web/API.md b/demos/speech_web/API.md new file mode 100644 index 000000000..c51446749 --- /dev/null +++ b/demos/speech_web/API.md @@ -0,0 +1,404 @@ +# 接口文档 + +开启服务后可参照: + +http://0.0.0.0:8010/docs + +## ASR + +### 【POST】/asr/offline + +说明:上传 16k, 16bit wav 文件,返回 offline 语音识别模型识别结果 + +返回: JSON + +前端接口: ASR-端到端识别,音频文件识别;语音指令-录音上传 + +示例: + +```json +{ + "code": 0, + "result": "你也喜欢这个天气吗", + "message": "ok" +} +``` + +### 【POST】/asr/offlinefile + +说明:上传16k,16bit wav文件,返回 offline 语音识别模型识别结果 + wav 数据的 base64 + +返回: JSON + +前端接口: 音频文件识别(播放这段base64还原后记得添加 wav 头,采样率 16k, int16,添加后才能播放) + +示例: + +```json +{ + "code": 0, + "result": { + "asr_result": "今天天气真好", + "wav_base64": "///+//3//f/8/////v/////////////////+/wAA//8AAAEAAQACAAIAAQABAP" + }, + "message": "ok" +} +``` + + +### 【POST】/asr/collectEnv + +说明: 通过采集环境噪音,上传 16k, int16 wav 文件,来生成后台 VAD 的能量阈值, 返回阈值结果 + +前端接口:ASR-环境采样 + +返回: JSON + +```json +{ + "code": 0, + "result": 3624.93505859375, + "message": "采集环境噪音成功" +} +``` + +### 【GET】/asr/stopRecord + +说明:通过 GET 请求 /asr/stopRecord, 后台停止接收 offlineStream 中通过 WS 协议 上传的数据 + +前端接口:语音聊天-暂停录音(获取 NLP,播放 TTS 时暂停) + +返回: JSON + +```JSON +{ + "code": 0, + "result": null, + "message": "停止成功" +} +``` + +### 【GET】/asr/resumeRecord + +说明:通过 GET 请求 /asr/resumeRecord, 后台停止接收 offlineStream 中通过 WS 协议 上传的数据 + +前端接口:语音聊天-恢复录音( TTS 播放完毕时,告诉后台恢复录音) + +返回: JSON + +```JSON +{ + "code": 0, + "result": null, + "message": "Online录音恢复" +} +``` + +### 【Websocket】/ws/asr/offlineStream + +说明:通过 WS 协议,将前端音频持续上传到后台,前端采集 16k,Int16 类型的PCM片段,持续上传到后端 + +前端接口:语音聊天-开始录音,持续将麦克风语音传给后端,后端推送语音识别结果 + +返回:后端返回识别结果,offline 模型识别结果, 由WS推送 + + +### 【Websocket】/ws/asr/onlineStream + +说明:通过 WS 协议,将前端音频持续上传到后台,前端采集 16k,Int16 类型的 PCM 片段,持续上传到后端 + +前端接口:ASR-流式识别开始录音,持续将麦克风语音传给后端,后端推送语音识别结果 + +返回:后端返回识别结果,online 模型识别结果, 由 WS 推送 + +## NLP + +### 【POST】/nlp/chat + +说明:返回闲聊对话的结果 + +前端接口:语音聊天-获取到ASR识别结果后,向后端获取闲聊文本 + +上传示例: + +```json +{ + "chat": "天气非常棒" +} +``` + +返回示例: + +```json +{ + "code": 0, + "result": "是的,我也挺喜欢的", + "message": "ok" +} +``` + + +### 【POST】/nlp/ie + +说明:返回信息抽取结果 + +前端接口:语音指令-向后端获取信息抽取结果 + +上传示例: + +```json +{ + "chat": "今天我从马来西亚出发去香港花了五十万元" +} +``` + +返回示例: + +```json +{ + "code": 0, + "result": [ + { + "时间": [ + { + "text": "今天", + "start": 0, + "end": 2, + "probability": 0.9817976247505698 + } + ], + "出发地": [ + { + "text": "马来西亚", + "start": 4, + "end": 8, + "probability": 0.974892389414169 + } + ], + "目的地": [ + { + "text": "马来西亚", + "start": 4, + "end": 8, + "probability": 0.7347504438136951 + } + ], + "费用": [ + { + "text": "五十万元", + "start": 15, + "end": 19, + "probability": 0.9679076530644402 + } + ] + } + ], + "message": "ok" +} +``` + + +## TTS + +### 【POST】/tts/offline + +说明:获取 TTS 离线模型音频 + +前端接口:TTS-端到端合成 + +上传示例: + +```json +{ + "text": "天气非常棒" +} +``` + +返回示例:对应音频对应的 base64 编码 + +```json +{ + "code": 0, + "result": "UklGRrzQAABXQVZFZm10IBAAAAABAAEAwF0AAIC7AAACABAAZGF0YZjQAAADAP7/BAADAAAA...", + "message": "ok" +} +``` + +### 【POST】/tts/online + +说明:流式获取语音合成音频 + +前端接口:流式合成 + +上传示例: +```json +{ + "text": "天气非常棒" +} + +``` + +返回示例: + +二进制PCM片段,16k Int 16类型 + +## VPR + +### 【POST】/vpr/enroll + +说明:声纹注册,通过表单上传 spk_id(字符串,非空), 与 audio (文件) + +前端接口:声纹识别-声纹注册 + +上传示例: + +```text +curl -X 'POST' \ + 'http://0.0.0.0:8010/vpr/enroll' \ + -H 'accept: application/json' \ + -H 'Content-Type: multipart/form-data' \ + -F 'spk_id=啦啦啦啦' \ + -F 'audio=@demo_16k.wav;type=audio/wav' +``` + +返回示例: + +```json +{ + "status": true, + "msg": "Successfully enroll data!" +} +``` + +### 【POST】/vpr/recog + +说明:声纹识别,识别文件,提取文件的声纹信息做比对 音频 16k, int 16 wav 格式 + +前端接口:声纹识别-上传音频,返回声纹识别结果 + +上传示例: + +```shell +curl -X 'POST' \ + 'http://0.0.0.0:8010/vpr/recog' \ + -H 'accept: application/json' \ + -H 'Content-Type: multipart/form-data' \ + -F 'audio=@demo_16k.wav;type=audio/wav' +``` + +返回示例: + +```json +[ + [ + "啦啦啦啦", + [ + "", + 100 + ] + ], + [ + "test1", + [ + "", + 11.64 + ] + ], + [ + "test2", + [ + "", + 6.09 + ] + ] +] + +``` + + +### 【POST】/vpr/del + +说明: 根据 spk_id 删除用户数据 + +前端接口:声纹识别-删除用户数据 + +上传示例: +```json +{ + "spk_id":"啦啦啦啦" +} +``` + +返回示例 + +```json +{ + "status": true, + "msg": "Successfully delete data!" +} + +``` + + +### 【GET】/vpr/list + +说明:查询用户列表数据,无需参数,返回 spk_id 与 vpr_id + +前端接口:声纹识别-获取声纹数据列表 + +返回示例: + +```json +[ + [ + "test1", + "test2" + ], + [ + 9, + 10 + ] +] + +``` + + +### 【GET】/vpr/data + +说明: 根据 vpr_id 获取用户vpr时使用的音频 + +前端接口:声纹识别-获取vpr对应的音频 + +访问示例: + +```shell +curl -X 'GET' \ + 'http://0.0.0.0:8010/vpr/data?vprId=9' \ + -H 'accept: application/json' +``` + +返回示例: + +对应音频文件 + +### 【GET】/vpr/database64 + +说明: 根据 vpr_id 获取用户 vpr 时注册使用音频转换成 16k, int16 类型的数组,返回 base64 编码 + +前端接口:声纹识别-获取 vpr 对应的音频(注意:播放时需要添加 wav头,16k,int16, 可参考 tts 播放时添加 wav 的方式,注意更改采样率) + +访问示例: + +```shell +curl -X 'GET' \ + 'http://localhost:8010/vpr/database64?vprId=12' \ + -H 'accept: application/json' +``` + +返回示例: +```json +{ + "code": 0, + "result":"AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA", + "message": "ok" +``` \ No newline at end of file diff --git a/demos/speech_web/README.md b/demos/speech_web/README.md new file mode 100644 index 000000000..3b2da6e9a --- /dev/null +++ b/demos/speech_web/README.md @@ -0,0 +1,116 @@ +# Paddle Speech Demo + +PaddleSpeechDemo 是一个以 PaddleSpeech 的语音交互功能为主体开发的 Demo 展示项目,用于帮助大家更好的上手 PaddleSpeech 以及使用 PaddleSpeech 构建自己的应用。 + +智能语音交互部分使用 PaddleSpeech,对话以及信息抽取部分使用 PaddleNLP,网页前端展示部分基于 Vue3 进行开发 + +主要功能: + ++ 语音聊天:PaddleSpeech 的语音识别能力+语音合成能力,对话部分基于 PaddleNLP 的闲聊功能 ++ 声纹识别:PaddleSpeech 的声纹识别功能展示 ++ 语音识别:支持【实时语音识别】,【端到端识别】,【音频文件识别】三种模式 ++ 语音合成:支持【流式合成】与【端到端合成】两种方式 ++ 语音指令:基于 PaddleSpeech 的语音识别能力与 PaddleNLP 的信息抽取,实现交通费的智能报销 + +运行效果: + + ![效果](docs/效果展示.png) + +## 安装 + +### 后端环境安装 + +``` +# 安装环境 +cd speech_server +pip install -r requirements.txt + +# 下载 ie 模型,针对地点进行微调,效果更好,不下载的话会使用其它版本,效果没有这个好 +cd source +mkdir model +cd model +wget https://bj.bcebos.com/paddlenlp/applications/speech-cmd-analysis/finetune/model_state.pdparams +``` + +### 前端环境安装 + +前端依赖 `node.js` ,需要提前安装,确保 `npm` 可用,`npm` 测试版本 `8.3.1`,建议下载[官网](https://nodejs.org/en/)稳定版的 `node.js` + +``` +# 进入前端目录 +cd web_client + +# 安装 `yarn`,已经安装可跳过 +npm install -g yarn + +# 使用yarn安装前端依赖 +yarn install +``` + +## 启动服务 + +### 开启后端服务 + +``` +cd speech_server +# 默认8010端口 +python main.py --port 8010 +``` + +### 开启前端服务 + +``` +cd web_client +yarn dev --port 8011 +``` + +默认配置下,前端中配置的后台地址信息是 localhost,确保后端服务器和打开页面的游览器在同一台机器上,不在一台机器的配置方式见下方的 FAQ:【后端如果部署在其它机器或者别的端口如何修改】 +## FAQ + +#### Q: 如何安装node.js + +A: node.js的安装可以参考[【菜鸟教程】](https://www.runoob.com/nodejs/nodejs-install-setup.html), 确保 npm 可用 + +#### Q:后端如果部署在其它机器或者别的端口如何修改 + +A:后端的配置地址有分散在两个文件中 + +修改第一个文件 `PaddleSpeechWebClient/vite.config.js` + +``` +server: { + host: "0.0.0.0", + proxy: { + "/api": { + target: "http://localhost:8010", // 这里改成后端所在接口 + changeOrigin: true, + rewrite: (path) => path.replace(/^\/api/, ""), + }, + }, + } +``` + +修改第二个文件 `PaddleSpeechWebClient/src/api/API.js`( Websocket 代理配置失败,所以需要在这个文件中修改) + +``` +// websocket (这里改成后端所在的接口) +CHAT_SOCKET_RECORD: 'ws://localhost:8010/ws/asr/offlineStream', // ChatBot websocket 接口 +ASR_SOCKET_RECORD: 'ws://localhost:8010/ws/asr/onlineStream', // Stream ASR 接口 +TTS_SOCKET_RECORD: 'ws://localhost:8010/ws/tts/online', // Stream TTS 接口 +``` + +#### Q:后端以IP地址的形式,前端无法录音 + +A:这里主要是游览器安全策略的限制,需要配置游览器后重启。游览器修改配置可参考[使用js-audio-recorder报浏览器不支持getUserMedia](https://blog.csdn.net/YRY_LIKE_YOU/article/details/113745273) + +chrome设置地址: chrome://flags/#unsafely-treat-insecure-origin-as-secure + +## 参考资料 + +vue实现录音参考资料:https://blog.csdn.net/qq_41619796/article/details/107865602#t1 + +前端流式播放音频参考仓库: + +https://github.com/AnthumChris/fetch-stream-audio + +https://bm.enthuses.me/buffered.php?bref=6677 diff --git a/demos/speech_web/docs/效果展示.png b/demos/speech_web/docs/效果展示.png new file mode 100644 index 000000000..5f7997c17 Binary files /dev/null and b/demos/speech_web/docs/效果展示.png differ diff --git a/demos/speech_web/speech_server/conf/tts_online_application.yaml b/demos/speech_web/speech_server/conf/tts_online_application.yaml new file mode 100644 index 000000000..0460a5e16 --- /dev/null +++ b/demos/speech_web/speech_server/conf/tts_online_application.yaml @@ -0,0 +1,103 @@ +# This is the parameter configuration file for streaming tts server. + +################################################################################# +# SERVER SETTING # +################################################################################# +host: 0.0.0.0 +port: 8092 + +# The task format in the engin_list is: _ +# engine_list choices = ['tts_online', 'tts_online-onnx'], the inference speed of tts_online-onnx is faster than tts_online. +# protocol choices = ['websocket', 'http'] +protocol: 'http' +engine_list: ['tts_online-onnx'] + + +################################################################################# +# ENGINE CONFIG # +################################################################################# + +################################### TTS ######################################### +################### speech task: tts; engine_type: online ####################### +tts_online: + # am (acoustic model) choices=['fastspeech2_csmsc', 'fastspeech2_cnndecoder_csmsc'] + # fastspeech2_cnndecoder_csmsc support streaming am infer. + am: 'fastspeech2_csmsc' + am_config: + am_ckpt: + am_stat: + phones_dict: + tones_dict: + speaker_dict: + spk_id: 0 + + # voc (vocoder) choices=['mb_melgan_csmsc, hifigan_csmsc'] + # Both mb_melgan_csmsc and hifigan_csmsc support streaming voc inference + voc: 'mb_melgan_csmsc' + voc_config: + voc_ckpt: + voc_stat: + + # others + lang: 'zh' + device: 'cpu' # set 'gpu:id' or 'cpu' + # am_block and am_pad only for fastspeech2_cnndecoder_onnx model to streaming am infer, + # when am_pad set 12, streaming synthetic audio is the same as non-streaming synthetic audio + am_block: 72 + am_pad: 12 + # voc_pad and voc_block voc model to streaming voc infer, + # when voc model is mb_melgan_csmsc, voc_pad set 14, streaming synthetic audio is the same as non-streaming synthetic audio; The minimum value of pad can be set to 7, streaming synthetic audio sounds normal + # when voc model is hifigan_csmsc, voc_pad set 19, streaming synthetic audio is the same as non-streaming synthetic audio; voc_pad set 14, streaming synthetic audio sounds normal + voc_block: 36 + voc_pad: 14 + + + +################################################################################# +# ENGINE CONFIG # +################################################################################# + +################################### TTS ######################################### +################### speech task: tts; engine_type: online-onnx ####################### +tts_online-onnx: + # am (acoustic model) choices=['fastspeech2_csmsc_onnx', 'fastspeech2_cnndecoder_csmsc_onnx'] + # fastspeech2_cnndecoder_csmsc_onnx support streaming am infer. + am: 'fastspeech2_cnndecoder_csmsc_onnx' + # am_ckpt is a list, if am is fastspeech2_cnndecoder_csmsc_onnx, am_ckpt = [encoder model, decoder model, postnet model]; + # if am is fastspeech2_csmsc_onnx, am_ckpt = [ckpt model]; + am_ckpt: # list + am_stat: + phones_dict: + tones_dict: + speaker_dict: + spk_id: 0 + am_sample_rate: 24000 + am_sess_conf: + device: "cpu" # set 'gpu:id' or 'cpu' + use_trt: False + cpu_threads: 4 + + # voc (vocoder) choices=['mb_melgan_csmsc_onnx, hifigan_csmsc_onnx'] + # Both mb_melgan_csmsc_onnx and hifigan_csmsc_onnx support streaming voc inference + voc: 'hifigan_csmsc_onnx' + voc_ckpt: + voc_sample_rate: 24000 + voc_sess_conf: + device: "cpu" # set 'gpu:id' or 'cpu' + use_trt: False + cpu_threads: 4 + + # others + lang: 'zh' + # am_block and am_pad only for fastspeech2_cnndecoder_onnx model to streaming am infer, + # when am_pad set 12, streaming synthetic audio is the same as non-streaming synthetic audio + am_block: 72 + am_pad: 12 + # voc_pad and voc_block voc model to streaming voc infer, + # when voc model is mb_melgan_csmsc_onnx, voc_pad set 14, streaming synthetic audio is the same as non-streaming synthetic audio; The minimum value of pad can be set to 7, streaming synthetic audio sounds normal + # when voc model is hifigan_csmsc_onnx, voc_pad set 19, streaming synthetic audio is the same as non-streaming synthetic audio; voc_pad set 14, streaming synthetic audio sounds normal + voc_block: 36 + voc_pad: 14 + # voc_upsample should be same as n_shift on voc config. + voc_upsample: 300 + diff --git a/demos/speech_web/speech_server/conf/ws_conformer_wenetspeech_application_faster.yaml b/demos/speech_web/speech_server/conf/ws_conformer_wenetspeech_application_faster.yaml new file mode 100644 index 000000000..ba413c802 --- /dev/null +++ b/demos/speech_web/speech_server/conf/ws_conformer_wenetspeech_application_faster.yaml @@ -0,0 +1,48 @@ +# This is the parameter configuration file for PaddleSpeech Serving. + +################################################################################# +# SERVER SETTING # +################################################################################# +host: 0.0.0.0 +port: 8090 + +# The task format in the engin_list is: _ +# task choices = ['asr_online'] +# protocol = ['websocket'] (only one can be selected). +# websocket only support online engine type. +protocol: 'websocket' +engine_list: ['asr_online'] + + +################################################################################# +# ENGINE CONFIG # +################################################################################# + +################################### ASR ######################################### +################### speech task: asr; engine_type: online ####################### +asr_online: + model_type: 'conformer_online_wenetspeech' + am_model: # the pdmodel file of am static model [optional] + am_params: # the pdiparams file of am static model [optional] + lang: 'zh' + sample_rate: 16000 + cfg_path: + decode_method: + force_yes: True + device: 'cpu' # cpu or gpu:id + decode_method: "attention_rescoring" + continuous_decoding: True # enable continue decoding when endpoint detected + num_decoding_left_chunks: 16 + am_predictor_conf: + device: # set 'gpu:id' or 'cpu' + switch_ir_optim: True + glog_info: False # True -> print glog + summary: True # False -> do not show predictor config + + chunk_buffer_conf: + window_n: 7 # frame + shift_n: 4 # frame + window_ms: 25 # ms + shift_ms: 10 # ms + sample_rate: 16000 + sample_width: 2 diff --git a/demos/speech_web/speech_server/main.py b/demos/speech_web/speech_server/main.py new file mode 100644 index 000000000..b10176670 --- /dev/null +++ b/demos/speech_web/speech_server/main.py @@ -0,0 +1,492 @@ +# todo: +# 1. 开启服务 +# 2. 接收录音音频,返回识别结果 +# 3. 接收ASR识别结果,返回NLP对话结果 +# 4. 接收NLP对话结果,返回TTS音频 + +import base64 +import yaml +import os +import json +import datetime +import librosa +import soundfile as sf +import numpy as np +import argparse +import uvicorn +import aiofiles +from typing import Optional, List +from pydantic import BaseModel +from fastapi import FastAPI, Header, File, UploadFile, Form, Cookie, WebSocket, WebSocketDisconnect +from fastapi.responses import StreamingResponse +from starlette.responses import FileResponse +from starlette.middleware.cors import CORSMiddleware +from starlette.requests import Request +from starlette.websockets import WebSocketState as WebSocketState + +from src.AudioManeger import AudioMannger +from src.util import * +from src.robot import Robot +from src.WebsocketManeger import ConnectionManager +from src.SpeechBase.vpr import VPR + +from paddlespeech.server.engine.asr.online.python.asr_engine import PaddleASRConnectionHanddler +from paddlespeech.server.utils.audio_process import float2pcm + + +# 解析配置 +parser = argparse.ArgumentParser( + prog='PaddleSpeechDemo', add_help=True) + +parser.add_argument( + "--port", + action="store", + type=int, + help="port of the app", + default=8010, + required=False) + +args = parser.parse_args() +port = args.port + +# 配置文件 +tts_config = "conf/tts_online_application.yaml" +asr_config = "conf/ws_conformer_wenetspeech_application_faster.yaml" +asr_init_path = "source/demo/demo.wav" +db_path = "source/db/vpr.sqlite" +ie_model_path = "source/model" + +# 路径配置 +UPLOAD_PATH = "source/vpr" +WAV_PATH = "source/wav" + + +base_sources = [ + UPLOAD_PATH, WAV_PATH +] +for path in base_sources: + os.makedirs(path, exist_ok=True) + + +# 初始化 +app = FastAPI() +chatbot = Robot(asr_config, tts_config, asr_init_path, ie_model_path=ie_model_path) +manager = ConnectionManager() +aumanager = AudioMannger(chatbot) +aumanager.init() +vpr = VPR(db_path, dim = 192, top_k = 5) + +# 服务配置 +class NlpBase(BaseModel): + chat: str + +class TtsBase(BaseModel): + text: str + +class Audios: + def __init__(self) -> None: + self.audios = b"" + +audios = Audios() + +###################################################################### +########################### ASR 服务 ################################# +##################################################################### + +# 接收文件,返回ASR结果 +# 上传文件 +@app.post("/asr/offline") +async def speech2textOffline(files: List[UploadFile]): + # 只有第一个有效 + asr_res = "" + for file in files[:1]: + # 生成时间戳 + now_name = "asr_offline_" + datetime.datetime.strftime(datetime.datetime.now(), '%Y%m%d%H%M%S') + randName() + ".wav" + out_file_path = os.path.join(WAV_PATH, now_name) + async with aiofiles.open(out_file_path, 'wb') as out_file: + content = await file.read() # async read + await out_file.write(content) # async write + + # 返回ASR识别结果 + asr_res = chatbot.speech2text(out_file_path) + return SuccessRequest(result=asr_res) + # else: + # return ErrorRequest(message="文件不是.wav格式") + return ErrorRequest(message="上传文件为空") + +# 接收文件,同时将wav强制转成16k, int16类型 +@app.post("/asr/offlinefile") +async def speech2textOfflineFile(files: List[UploadFile]): + # 只有第一个有效 + asr_res = "" + for file in files[:1]: + # 生成时间戳 + now_name = "asr_offline_" + datetime.datetime.strftime(datetime.datetime.now(), '%Y%m%d%H%M%S') + randName() + ".wav" + out_file_path = os.path.join(WAV_PATH, now_name) + async with aiofiles.open(out_file_path, 'wb') as out_file: + content = await file.read() # async read + await out_file.write(content) # async write + + # 将文件转成16k, 16bit类型的wav文件 + wav, sr = librosa.load(out_file_path, sr=16000) + wav = float2pcm(wav) # float32 to int16 + wav_bytes = wav.tobytes() # to bytes + wav_base64 = base64.b64encode(wav_bytes).decode('utf8') + + # 将文件重新写入 + now_name = now_name[:-4] + "_16k" + ".wav" + out_file_path = os.path.join(WAV_PATH, now_name) + sf.write(out_file_path,wav,16000) + + # 返回ASR识别结果 + asr_res = chatbot.speech2text(out_file_path) + response_res = { + "asr_result": asr_res, + "wav_base64": wav_base64 + } + return SuccessRequest(result=response_res) + + return ErrorRequest(message="上传文件为空") + + + +# 流式接收测试 +@app.post("/asr/online1") +async def speech2textOnlineRecive(files: List[UploadFile]): + audio_bin = b'' + for file in files: + content = await file.read() + audio_bin += content + audios.audios += audio_bin + print(f"audios长度变化: {len(audios.audios)}") + return SuccessRequest(message="接收成功") + +# 采集环境噪音大小 +@app.post("/asr/collectEnv") +async def collectEnv(files: List[UploadFile]): + for file in files[:1]: + content = await file.read() # async read + # 初始化, wav 前44字节是头部信息 + aumanager.compute_env_volume(content[44:]) + vad_ = aumanager.vad_threshold + return SuccessRequest(result=vad_,message="采集环境噪音成功") + +# 停止录音 +@app.get("/asr/stopRecord") +async def stopRecord(): + audios.audios = b"" + aumanager.stop() + print("Online录音暂停") + return SuccessRequest(message="停止成功") + +# 恢复录音 +@app.get("/asr/resumeRecord") +async def resumeRecord(): + aumanager.resume() + print("Online录音恢复") + return SuccessRequest(message="Online录音恢复") + + +# 聊天用的ASR +@app.websocket("/ws/asr/offlineStream") +async def websocket_endpoint(websocket: WebSocket): + await manager.connect(websocket) + try: + while True: + asr_res = None + # websocket 不接收,只推送 + data = await websocket.receive_bytes() + if not aumanager.is_pause: + asr_res = aumanager.stream_asr(data) + else: + print("录音暂停") + if asr_res: + await manager.send_personal_message(asr_res, websocket) + aumanager.clear_asr() + + except WebSocketDisconnect: + manager.disconnect(websocket) + # await manager.broadcast(f"用户-{user}-离开") + # print(f"用户-{user}-离开") + + +# Online识别的ASR +@app.websocket('/ws/asr/onlineStream') +async def websocket_endpoint(websocket: WebSocket): + """PaddleSpeech Online ASR Server api + + Args: + websocket (WebSocket): the websocket instance + """ + + #1. the interface wait to accept the websocket protocal header + # and only we receive the header, it establish the connection with specific thread + await websocket.accept() + + #2. if we accept the websocket headers, we will get the online asr engine instance + engine = chatbot.asr.engine + + #3. each websocket connection, we will create an PaddleASRConnectionHanddler to process such audio + # and each connection has its own connection instance to process the request + # and only if client send the start signal, we create the PaddleASRConnectionHanddler instance + connection_handler = None + + try: + #4. we do a loop to process the audio package by package according the protocal + # and only if the client send finished signal, we will break the loop + while True: + # careful here, changed the source code from starlette.websockets + # 4.1 we wait for the client signal for the specific action + assert websocket.application_state == WebSocketState.CONNECTED + message = await websocket.receive() + websocket._raise_on_disconnect(message) + + #4.2 text for the action command and bytes for pcm data + if "text" in message: + # we first parse the specific command + message = json.loads(message["text"]) + if 'signal' not in message: + resp = {"status": "ok", "message": "no valid json data"} + await websocket.send_json(resp) + + # start command, we create the PaddleASRConnectionHanddler instance to process the audio data + # end command, we process the all the last audio pcm and return the final result + # and we break the loop + if message['signal'] == 'start': + resp = {"status": "ok", "signal": "server_ready"} + # do something at begining here + # create the instance to process the audio + # connection_handler = chatbot.asr.connection_handler + connection_handler = PaddleASRConnectionHanddler(engine) + await websocket.send_json(resp) + elif message['signal'] == 'end': + # reset single engine for an new connection + # and we will destroy the connection + connection_handler.decode(is_finished=True) + connection_handler.rescoring() + asr_results = connection_handler.get_result() + connection_handler.reset() + + resp = { + "status": "ok", + "signal": "finished", + 'result': asr_results + } + await websocket.send_json(resp) + break + else: + resp = {"status": "ok", "message": "no valid json data"} + await websocket.send_json(resp) + elif "bytes" in message: + # bytes for the pcm data + message = message["bytes"] + print("###############") + print("len message: ", len(message)) + print("###############") + + # we extract the remained audio pcm + # and decode for the result in this package data + connection_handler.extract_feat(message) + connection_handler.decode(is_finished=False) + asr_results = connection_handler.get_result() + + # return the current period result + # if the engine create the vad instance, this connection will have many period results + resp = {'result': asr_results} + print(resp) + await websocket.send_json(resp) + except WebSocketDisconnect: + pass + +###################################################################### +########################### NLP 服务 ################################# +##################################################################### + +@app.post("/nlp/chat") +async def chatOffline(nlp_base:NlpBase): + chat = nlp_base.chat + if not chat: + return ErrorRequest(message="传入文本为空") + else: + res = chatbot.chat(chat) + return SuccessRequest(result=res) + +@app.post("/nlp/ie") +async def ieOffline(nlp_base:NlpBase): + nlp_text = nlp_base.chat + if not nlp_text: + return ErrorRequest(message="传入文本为空") + else: + res = chatbot.ie(nlp_text) + return SuccessRequest(result=res) + +###################################################################### +########################### TTS 服务 ################################# +##################################################################### + +@app.post("/tts/offline") +async def text2speechOffline(tts_base:TtsBase): + text = tts_base.text + if not text: + return ErrorRequest(message="文本为空") + else: + now_name = "tts_"+ datetime.datetime.strftime(datetime.datetime.now(), '%Y%m%d%H%M%S') + randName() + ".wav" + out_file_path = os.path.join(WAV_PATH, now_name) + # 保存为文件,再转成base64传输 + chatbot.text2speech(text, outpath=out_file_path) + with open(out_file_path, "rb") as f: + data_bin = f.read() + base_str = base64.b64encode(data_bin) + return SuccessRequest(result=base_str) + +# http流式TTS +@app.post("/tts/online") +async def stream_tts(request_body: TtsBase): + text = request_body.text + return StreamingResponse(chatbot.text2speechStreamBytes(text=text)) + +# ws流式TTS +@app.websocket("/ws/tts/online") +async def stream_ttsWS(websocket: WebSocket): + await manager.connect(websocket) + try: + while True: + text = await websocket.receive_text() + # 用 websocket 流式接收音频数据 + if text: + for sub_wav in chatbot.text2speechStream(text=text): + # print("发送sub wav: ", len(sub_wav)) + res = { + "wav": sub_wav, + "done": False + } + await websocket.send_json(res) + + # 输送结束 + res = { + "wav": sub_wav, + "done": True + } + await websocket.send_json(res) + # manager.disconnect(websocket) + + except WebSocketDisconnect: + manager.disconnect(websocket) + + +###################################################################### +########################### VPR 服务 ################################# +##################################################################### + +app.add_middleware( + CORSMiddleware, + allow_origins=["*"], + allow_credentials=True, + allow_methods=["*"], + allow_headers=["*"]) + + +@app.post('/vpr/enroll') +async def vpr_enroll(table_name: str=None, + spk_id: str=Form(...), + audio: UploadFile=File(...)): + # Enroll the uploaded audio with spk-id into MySQL + try: + if not spk_id: + return {'status': False, 'msg': "spk_id can not be None"} + # Save the upload data to server. + content = await audio.read() + now_name = "vpr_enroll_" + datetime.datetime.strftime(datetime.datetime.now(), '%Y%m%d%H%M%S') + randName() + ".wav" + audio_path = os.path.join(UPLOAD_PATH, now_name) + + with open(audio_path, "wb+") as f: + f.write(content) + vpr.vpr_enroll(username=spk_id, wav_path=audio_path) + return {'status': True, 'msg': "Successfully enroll data!"} + except Exception as e: + return {'status': False, 'msg': e} + + +@app.post('/vpr/recog') +async def vpr_recog(request: Request, + table_name: str=None, + audio: UploadFile=File(...)): + # Voice print recognition online + # try: + # Save the upload data to server. + content = await audio.read() + now_name = "vpr_query_" + datetime.datetime.strftime(datetime.datetime.now(), '%Y%m%d%H%M%S') + randName() + ".wav" + query_audio_path = os.path.join(UPLOAD_PATH, now_name) + with open(query_audio_path, "wb+") as f: + f.write(content) + spk_ids, paths, scores = vpr.do_search_vpr(query_audio_path) + + res = dict(zip(spk_ids, zip(paths, scores))) + # Sort results by distance metric, closest distances first + res = sorted(res.items(), key=lambda item: item[1][1], reverse=True) + return res + # except Exception as e: + # return {'status': False, 'msg': e}, 400 + + +@app.post('/vpr/del') +async def vpr_del(spk_id: dict=None): + # Delete a record by spk_id in MySQL + try: + spk_id = spk_id['spk_id'] + if not spk_id: + return {'status': False, 'msg': "spk_id can not be None"} + vpr.vpr_del(username=spk_id) + return {'status': True, 'msg': "Successfully delete data!"} + except Exception as e: + return {'status': False, 'msg': e}, 400 + + +@app.get('/vpr/list') +async def vpr_list(): + # Get all records in MySQL + try: + spk_ids, vpr_ids = vpr.do_list() + return spk_ids, vpr_ids + except Exception as e: + return {'status': False, 'msg': e}, 400 + + +@app.get('/vpr/database64') +async def vpr_database64(vprId: int): + # Get the audio file from path by spk_id in MySQL + try: + if not vprId: + return {'status': False, 'msg': "vpr_id can not be None"} + audio_path = vpr.do_get_wav(vprId) + # 返回base64 + + # 将文件转成16k, 16bit类型的wav文件 + wav, sr = librosa.load(audio_path, sr=16000) + wav = float2pcm(wav) # float32 to int16 + wav_bytes = wav.tobytes() # to bytes + wav_base64 = base64.b64encode(wav_bytes).decode('utf8') + + return SuccessRequest(result=wav_base64) + except Exception as e: + return {'status': False, 'msg': e}, 400 + +@app.get('/vpr/data') +async def vpr_data(vprId: int): + # Get the audio file from path by spk_id in MySQL + try: + if not vprId: + return {'status': False, 'msg': "vpr_id can not be None"} + audio_path = vpr.do_get_wav(vprId) + return FileResponse(audio_path) + except Exception as e: + return {'status': False, 'msg': e}, 400 + +if __name__ == '__main__': + uvicorn.run(app=app, host='0.0.0.0', port=port) + + + + + + diff --git a/demos/speech_web/speech_server/requirements.txt b/demos/speech_web/speech_server/requirements.txt new file mode 100644 index 000000000..7e7bd1680 --- /dev/null +++ b/demos/speech_web/speech_server/requirements.txt @@ -0,0 +1,14 @@ +aiofiles +fastapi +librosa +numpy +pydantic +scikit_learn +SoundFile +starlette +uvicorn +paddlepaddle +paddlespeech +paddlenlp +faiss-cpu +python-multipart \ No newline at end of file diff --git a/demos/speech_web/speech_server/src/AudioManeger.py b/demos/speech_web/speech_server/src/AudioManeger.py new file mode 100644 index 000000000..0deb03699 --- /dev/null +++ b/demos/speech_web/speech_server/src/AudioManeger.py @@ -0,0 +1,150 @@ +import imp +from queue import Queue +import numpy as np +import os +import wave +import random +import datetime +from .util import randName + + +class AudioMannger: + def __init__(self, robot, frame_length=160, frame=10, data_width=2, vad_default = 300): + # 二进制 pcm 流 + self.audios = b'' + self.asr_result = "" + # Speech 核心主体 + self.robot = robot + + self.file_dir = "source" + os.makedirs(self.file_dir, exist_ok=True) + self.vad_deafult = vad_default + self.vad_threshold = vad_default + self.vad_threshold_path = os.path.join(self.file_dir, "vad_threshold.npy") + + # 10ms 一帧 + self.frame_length = frame_length + # 10帧,检测一次 vad + self.frame = frame + # int 16, 两个bytes + self.data_width = data_width + # window + self.window_length = frame_length * frame * data_width + + # 是否开始录音 + self.on_asr = False + self.silence_cnt = 0 + self.max_silence_cnt = 4 + self.is_pause = False # 录音暂停与恢复 + + + + def init(self): + if os.path.exists(self.vad_threshold_path): + # 平均响度文件存在 + self.vad_threshold = np.load(self.vad_threshold_path) + + + def clear_audio(self): + # 清空 pcm 累积片段与 asr 识别结果 + self.audios = b'' + + def clear_asr(self): + self.asr_result = "" + + + def compute_chunk_volume(self, start_index, pcm_bins): + # 根据帧长计算能量平均值 + pcm_bin = pcm_bins[start_index: start_index + self.window_length] + # 转成 numpy + pcm_np = np.frombuffer(pcm_bin, np.int16) + # 归一化 + 计算响度 + x = pcm_np.astype(np.float32) + x = np.abs(x) + return np.mean(x) + + + def is_speech(self, start_index, pcm_bins): + # 检查是否没 + if start_index > len(pcm_bins): + return False + # 检查从这个 start 开始是否为静音帧 + energy = self.compute_chunk_volume(start_index=start_index, pcm_bins=pcm_bins) + # print(energy) + if energy > self.vad_threshold: + return True + else: + return False + + def compute_env_volume(self, pcm_bins): + max_energy = 0 + start = 0 + while start < len(pcm_bins): + energy = self.compute_chunk_volume(start_index=start, pcm_bins=pcm_bins) + if energy > max_energy: + max_energy = energy + start += self.window_length + self.vad_threshold = max_energy + 100 if max_energy > self.vad_deafult else self.vad_deafult + + # 保存成文件 + np.save(self.vad_threshold_path, self.vad_threshold) + print(f"vad 阈值大小: {self.vad_threshold}") + print(f"环境采样保存: {os.path.realpath(self.vad_threshold_path)}") + + def stream_asr(self, pcm_bin): + # 先把 pcm_bin 送进去做端点检测 + start = 0 + while start < len(pcm_bin): + if self.is_speech(start_index=start, pcm_bins=pcm_bin): + self.on_asr = True + self.silence_cnt = 0 + print("录音中") + self.audios += pcm_bin[ start : start + self.window_length] + else: + if self.on_asr: + self.silence_cnt += 1 + if self.silence_cnt > self.max_silence_cnt: + self.on_asr = False + self.silence_cnt = 0 + # 录音停止 + print("录音停止") + # audios 保存为 wav, 送入 ASR + if len(self.audios) > 2 * 16000: + file_path = os.path.join(self.file_dir, "asr_" + datetime.datetime.strftime(datetime.datetime.now(), '%Y%m%d%H%M%S') + randName() + ".wav") + self.save_audio(file_path=file_path) + self.asr_result = self.robot.speech2text(file_path) + self.clear_audio() + return self.asr_result + else: + # 正常接收 + print("录音中 静音") + self.audios += pcm_bin[ start : start + self.window_length] + start += self.window_length + return "" + + def save_audio(self, file_path): + print("保存音频") + wf = wave.open(file_path, 'wb') # 创建一个音频文件,名字为“01.wav" + wf.setnchannels(1) # 设置声道数为2 + wf.setsampwidth(2) # 设置采样深度为 + wf.setframerate(16000) # 设置采样率为16000 + # 将数据写入创建的音频文件 + wf.writeframes(self.audios) + # 写完后将文件关闭 + wf.close() + + def end(self): + # audios 保存为 wav, 送入 ASR + file_path = os.path.join(self.file_dir, "asr.wav") + self.save_audio(file_path=file_path) + return self.robot.speech2text(file_path) + + def stop(self): + self.is_pause = True + self.audios = b'' + + def resume(self): + self.is_pause = False + + + \ No newline at end of file diff --git a/demos/speech_web/speech_server/src/SpeechBase/asr.py b/demos/speech_web/speech_server/src/SpeechBase/asr.py new file mode 100644 index 000000000..8d4c0cffc --- /dev/null +++ b/demos/speech_web/speech_server/src/SpeechBase/asr.py @@ -0,0 +1,62 @@ +from re import sub +import numpy as np +import paddle +import librosa +import soundfile + +from paddlespeech.server.engine.asr.online.python.asr_engine import ASREngine +from paddlespeech.server.engine.asr.online.python.asr_engine import PaddleASRConnectionHanddler +from paddlespeech.server.utils.config import get_config + +def readWave(samples): + x_len = len(samples) + + chunk_size = 85 * 16 #80ms, sample_rate = 16kHz + if x_len % chunk_size != 0: + padding_len_x = chunk_size - x_len % chunk_size + else: + padding_len_x = 0 + + padding = np.zeros((padding_len_x), dtype=samples.dtype) + padded_x = np.concatenate([samples, padding], axis=0) + + assert (x_len + padding_len_x) % chunk_size == 0 + num_chunk = (x_len + padding_len_x) / chunk_size + num_chunk = int(num_chunk) + for i in range(0, num_chunk): + start = i * chunk_size + end = start + chunk_size + x_chunk = padded_x[start:end] + yield x_chunk + + +class ASR: + def __init__(self, config_path, ) -> None: + self.config = get_config(config_path)['asr_online'] + self.engine = ASREngine() + self.engine.init(self.config) + self.connection_handler = PaddleASRConnectionHanddler(self.engine) + + def offlineASR(self, samples, sample_rate=16000): + x_chunk, x_chunk_lens = self.engine.preprocess(samples=samples, sample_rate=sample_rate) + self.engine.run(x_chunk, x_chunk_lens) + result = self.engine.postprocess() + self.engine.reset() + return result + + def onlineASR(self, samples:bytes=None, is_finished=False): + if not is_finished: + # 流式开始 + self.connection_handler.extract_feat(samples) + self.connection_handler.decode(is_finished) + asr_results = self.connection_handler.get_result() + return asr_results + else: + # 流式结束 + self.connection_handler.decode(is_finished=True) + self.connection_handler.rescoring() + asr_results = self.connection_handler.get_result() + self.connection_handler.reset() + return asr_results + + \ No newline at end of file diff --git a/demos/speech_web/speech_server/src/SpeechBase/nlp.py b/demos/speech_web/speech_server/src/SpeechBase/nlp.py new file mode 100644 index 000000000..4ece63256 --- /dev/null +++ b/demos/speech_web/speech_server/src/SpeechBase/nlp.py @@ -0,0 +1,23 @@ +from paddlenlp import Taskflow + +class NLP: + def __init__(self, ie_model_path=None): + schema = ["时间", "出发地", "目的地", "费用"] + if ie_model_path: + self.ie_model = Taskflow("information_extraction", + schema=schema, task_path=ie_model_path) + else: + self.ie_model = Taskflow("information_extraction", + schema=schema) + + self.dialogue_model = Taskflow("dialogue") + + def chat(self, text): + result = self.dialogue_model([text]) + return result[0] + + def ie(self, text): + result = self.ie_model(text) + return result + + \ No newline at end of file diff --git a/demos/speech_web/speech_server/src/SpeechBase/sql_helper.py b/demos/speech_web/speech_server/src/SpeechBase/sql_helper.py new file mode 100644 index 000000000..6937def58 --- /dev/null +++ b/demos/speech_web/speech_server/src/SpeechBase/sql_helper.py @@ -0,0 +1,116 @@ +import base64 +import sqlite3 +import os +import numpy as np +from pkg_resources import resource_stream + + +def dict_factory(cursor, row): + d = {} + for idx, col in enumerate(cursor.description): + d[col[0]] = row[idx] + return d + +class DataBase(object): + def __init__(self, db_path:str): + db_path = os.path.realpath(db_path) + + if os.path.exists(db_path): + self.db_path = db_path + else: + db_path_dir = os.path.dirname(db_path) + os.makedirs(db_path_dir, exist_ok=True) + self.db_path = db_path + + self.conn = sqlite3.connect(self.db_path) + self.conn.row_factory = dict_factory + self.cursor = self.conn.cursor() + self.init_database() + + def init_database(self): + """ + 初始化数据库, 若表不存在则创建 + """ + sql = """ + CREATE TABLE IF NOT EXISTS vprtable ( + `id` INTEGER PRIMARY KEY AUTOINCREMENT, + `username` TEXT NOT NULL, + `vector` TEXT NOT NULL, + `wavpath` TEXT NOT NULL + ); + """ + self.cursor.execute(sql) + self.conn.commit() + + def execute_base(self, sql, data_dict): + self.cursor.execute(sql, data_dict) + self.conn.commit() + + def insert_one(self, username, vector_base64:str, wav_path): + if not os.path.exists(wav_path): + return None, "wav not exists" + else: + sql = f""" + insert into + vprtable (username, vector, wavpath) + values (?, ?, ?) + """ + try: + self.cursor.execute(sql, (username, vector_base64, wav_path)) + self.conn.commit() + lastidx = self.cursor.lastrowid + return lastidx, "data insert success" + except Exception as e: + print(e) + return None, e + + def select_all(self): + sql = """ + SELECT * from vprtable + """ + result = self.cursor.execute(sql).fetchall() + return result + + def select_by_id(self, vpr_id): + sql = f""" + SELECT * from vprtable WHERE `id` = {vpr_id} + """ + result = self.cursor.execute(sql).fetchall() + return result + + def select_by_username(self, username): + sql = f""" + SELECT * from vprtable WHERE `username` = '{username}' + """ + result = self.cursor.execute(sql).fetchall() + return result + + def drop_by_username(self, username): + sql = f""" + DELETE from vprtable WHERE `username`='{username}' + """ + self.cursor.execute(sql) + self.conn.commit() + + def drop_all(self): + sql = f""" + DELETE from vprtable + """ + self.cursor.execute(sql) + self.conn.commit() + + def drop_table(self): + sql = f""" + DROP TABLE vprtable + """ + self.cursor.execute(sql) + self.conn.commit() + + def encode_vector(self, vector:np.ndarray): + return base64.b64encode(vector).decode('utf8') + + def decode_vector(self, vector_base64, dtype=np.float32): + b = base64.b64decode(vector_base64) + vc = np.frombuffer(b, dtype=dtype) + return vc + \ No newline at end of file diff --git a/demos/speech_web/speech_server/src/SpeechBase/tts.py b/demos/speech_web/speech_server/src/SpeechBase/tts.py new file mode 100644 index 000000000..d5ba0c802 --- /dev/null +++ b/demos/speech_web/speech_server/src/SpeechBase/tts.py @@ -0,0 +1,209 @@ +# tts 推理引擎,支持流式与非流式 +# 精简化使用 +# 用 onnxruntime 进行推理 +# 1. 下载对应的模型 +# 2. 加载模型 +# 3. 端到端推理 +# 4. 流式推理 + +import base64 +import math +import logging +import numpy as np +from paddlespeech.server.utils.onnx_infer import get_sess +from paddlespeech.t2s.frontend.zh_frontend import Frontend +from paddlespeech.server.utils.util import denorm, get_chunks +from paddlespeech.server.utils.audio_process import float2pcm +from paddlespeech.server.utils.config import get_config + +from paddlespeech.server.engine.tts.online.onnx.tts_engine import TTSEngine + +class TTS: + def __init__(self, config_path): + self.config = get_config(config_path)['tts_online-onnx'] + self.config['voc_block'] = 36 + self.engine = TTSEngine() + self.engine.init(self.config) + self.executor = self.engine.executor + #self.engine.warm_up() + + # 前端初始化 + self.frontend = Frontend( + phone_vocab_path=self.engine.executor.phones_dict, + tone_vocab_path=None) + + def depadding(self, data, chunk_num, chunk_id, block, pad, upsample): + """ + Streaming inference removes the result of pad inference + """ + front_pad = min(chunk_id * block, pad) + # first chunk + if chunk_id == 0: + data = data[:block * upsample] + # last chunk + elif chunk_id == chunk_num - 1: + data = data[front_pad * upsample:] + # middle chunk + else: + data = data[front_pad * upsample:(front_pad + block) * upsample] + + return data + + def offlineTTS(self, text): + get_tone_ids = False + merge_sentences = False + + input_ids = self.frontend.get_input_ids( + text, + merge_sentences=merge_sentences, + get_tone_ids=get_tone_ids) + phone_ids = input_ids["phone_ids"] + wav_list = [] + for i in range(len(phone_ids)): + orig_hs = self.engine.executor.am_encoder_infer_sess.run( + None, input_feed={'text': phone_ids[i].numpy()} + ) + hs = orig_hs[0] + am_decoder_output = self.engine.executor.am_decoder_sess.run( + None, input_feed={'xs': hs}) + am_postnet_output = self.engine.executor.am_postnet_sess.run( + None, + input_feed={ + 'xs': np.transpose(am_decoder_output[0], (0, 2, 1)) + }) + am_output_data = am_decoder_output + np.transpose( + am_postnet_output[0], (0, 2, 1)) + normalized_mel = am_output_data[0][0] + mel = denorm(normalized_mel, self.engine.executor.am_mu, self.engine.executor.am_std) + wav = self.engine.executor.voc_sess.run( + output_names=None, input_feed={'logmel': mel})[0] + wav_list.append(wav) + wavs = np.concatenate(wav_list) + return wavs + + def streamTTS(self, text): + + get_tone_ids = False + merge_sentences = False + + # front + input_ids = self.frontend.get_input_ids( + text, + merge_sentences=merge_sentences, + get_tone_ids=get_tone_ids) + phone_ids = input_ids["phone_ids"] + + for i in range(len(phone_ids)): + part_phone_ids = phone_ids[i].numpy() + voc_chunk_id = 0 + + # fastspeech2_csmsc + if self.config.am == "fastspeech2_csmsc_onnx": + # am + mel = self.executor.am_sess.run( + output_names=None, input_feed={'text': part_phone_ids}) + mel = mel[0] + + # voc streaming + mel_chunks = get_chunks(mel, self.config.voc_block, self.config.voc_pad, "voc") + voc_chunk_num = len(mel_chunks) + for i, mel_chunk in enumerate(mel_chunks): + sub_wav = self.executor.voc_sess.run( + output_names=None, input_feed={'logmel': mel_chunk}) + sub_wav = self.depadding(sub_wav[0], voc_chunk_num, i, + self.config.voc_block, self.config.voc_pad, + self.config.voc_upsample) + + yield self.after_process(sub_wav) + + # fastspeech2_cnndecoder_csmsc + elif self.config.am == "fastspeech2_cnndecoder_csmsc_onnx": + # am + orig_hs = self.executor.am_encoder_infer_sess.run( + None, input_feed={'text': part_phone_ids}) + orig_hs = orig_hs[0] + + # streaming voc chunk info + mel_len = orig_hs.shape[1] + voc_chunk_num = math.ceil(mel_len / self.config.voc_block) + start = 0 + end = min(self.config.voc_block + self.config.voc_pad, mel_len) + + # streaming am + hss = get_chunks(orig_hs, self.config.am_block, self.config.am_pad, "am") + am_chunk_num = len(hss) + for i, hs in enumerate(hss): + am_decoder_output = self.executor.am_decoder_sess.run( + None, input_feed={'xs': hs}) + am_postnet_output = self.executor.am_postnet_sess.run( + None, + input_feed={ + 'xs': np.transpose(am_decoder_output[0], (0, 2, 1)) + }) + am_output_data = am_decoder_output + np.transpose( + am_postnet_output[0], (0, 2, 1)) + normalized_mel = am_output_data[0][0] + + sub_mel = denorm(normalized_mel, self.executor.am_mu, + self.executor.am_std) + sub_mel = self.depadding(sub_mel, am_chunk_num, i, + self.config.am_block, self.config.am_pad, 1) + + if i == 0: + mel_streaming = sub_mel + else: + mel_streaming = np.concatenate( + (mel_streaming, sub_mel), axis=0) + + # streaming voc + # 当流式AM推理的mel帧数大于流式voc推理的chunk size,开始进行流式voc 推理 + while (mel_streaming.shape[0] >= end and + voc_chunk_id < voc_chunk_num): + voc_chunk = mel_streaming[start:end, :] + + sub_wav = self.executor.voc_sess.run( + output_names=None, input_feed={'logmel': voc_chunk}) + sub_wav = self.depadding( + sub_wav[0], voc_chunk_num, voc_chunk_id, + self.config.voc_block, self.config.voc_pad, self.config.voc_upsample) + + yield self.after_process(sub_wav) + + voc_chunk_id += 1 + start = max( + 0, voc_chunk_id * self.config.voc_block - self.config.voc_pad) + end = min( + (voc_chunk_id + 1) * self.config.voc_block + self.config.voc_pad, + mel_len) + + else: + logging.error( + "Only support fastspeech2_csmsc or fastspeech2_cnndecoder_csmsc on streaming tts." + ) + + + def streamTTSBytes(self, text): + for wav in self.engine.executor.infer( + text=text, + lang=self.engine.config.lang, + am=self.engine.config.am, + spk_id=0): + wav = float2pcm(wav) # float32 to int16 + wav_bytes = wav.tobytes() # to bytes + yield wav_bytes + + + def after_process(self, wav): + # for tvm + wav = float2pcm(wav) # float32 to int16 + wav_bytes = wav.tobytes() # to bytes + wav_base64 = base64.b64encode(wav_bytes).decode('utf8') # to base64 + return wav_base64 + + def streamTTS_TVM(self, text): + # 用 TVM 优化 + pass + + + + \ No newline at end of file diff --git a/demos/speech_web/speech_server/src/SpeechBase/vpr.py b/demos/speech_web/speech_server/src/SpeechBase/vpr.py new file mode 100644 index 000000000..29ee986e3 --- /dev/null +++ b/demos/speech_web/speech_server/src/SpeechBase/vpr.py @@ -0,0 +1,118 @@ +# vpr Demo 没有使用 mysql 与 muilvs, 仅用于docker演示 +import logging +import faiss +from matplotlib import use +import numpy as np +from .sql_helper import DataBase +from .vpr_encode import get_audio_embedding + +class VPR: + def __init__(self, db_path, dim, top_k) -> None: + # 初始化 + self.db_path = db_path + self.dim = dim + self.top_k = top_k + self.dtype = np.float32 + self.vpr_idx = 0 + + # db 初始化 + self.db = DataBase(db_path) + + # faiss 初始化 + index_ip = faiss.IndexFlatIP(dim) + self.index_ip = faiss.IndexIDMap(index_ip) + self.init() + + def init(self): + # demo 初始化,把 mysql中的向量注册到 faiss 中 + sql_dbs = self.db.select_all() + if sql_dbs: + for sql_db in sql_dbs: + idx = sql_db['id'] + vc_bs64 = sql_db['vector'] + vc = self.db.decode_vector(vc_bs64) + if len(vc.shape) == 1: + vc = np.expand_dims(vc, axis=0) + # 构建数据库 + self.index_ip.add_with_ids(vc, np.array((idx,)).astype('int64')) + logging.info("faiss 构建完毕") + + def faiss_enroll(self, idx, vc): + self.index_ip.add_with_ids(vc, np.array((idx,)).astype('int64')) + + def vpr_enroll(self, username, wav_path): + # 注册声纹 + emb = get_audio_embedding(wav_path) + emb = np.expand_dims(emb, axis=0) + if emb is not None: + emb_bs64 = self.db.encode_vector(emb) + last_idx, mess = self.db.insert_one(username, emb_bs64, wav_path) + if last_idx: + # faiss 注册 + self.faiss_enroll(last_idx, emb) + else: + last_idx, mess = None + return last_idx + + def vpr_recog(self, wav_path): + # 识别声纹 + emb_search = get_audio_embedding(wav_path) + + if emb_search is not None: + emb_search = np.expand_dims(emb_search, axis=0) + D, I = self.index_ip.search(emb_search, self.top_k) + D = D.tolist()[0] + I = I.tolist()[0] + return [(round(D[i] * 100, 2 ), I[i]) for i in range(len(D)) if I[i] != -1] + else: + logging.error("识别失败") + return None + + def do_search_vpr(self, wav_path): + spk_ids, paths, scores = [], [], [] + recog_result = self.vpr_recog(wav_path) + for score, idx in recog_result: + username = self.db.select_by_id(idx)[0]['username'] + if username not in spk_ids: + spk_ids.append(username) + scores.append(score) + paths.append("") + return spk_ids, paths, scores + + def vpr_del(self, username): + # 根据用户username, 删除声纹 + # 查用户ID,删除对应向量 + res = self.db.select_by_username(username) + for r in res: + idx = r['id'] + self.index_ip.remove_ids(np.array((idx,)).astype('int64')) + + self.db.drop_by_username(username) + + def vpr_list(self): + # 获取数据列表 + return self.db.select_all() + + def do_list(self): + spk_ids, vpr_ids = [], [] + for res in self.db.select_all(): + spk_ids.append(res['username']) + vpr_ids.append(res['id']) + return spk_ids, vpr_ids + + def do_get_wav(self, vpr_idx): + res = self.db.select_by_id(vpr_idx) + return res[0]['wavpath'] + + + def vpr_data(self, idx): + # 获取对应ID的数据 + res = self.db.select_by_id(idx) + return res + + def vpr_droptable(self): + # 删除表 + self.db.drop_table() + # 清空 faiss + self.index_ip.reset() + diff --git a/demos/speech_web/speech_server/src/SpeechBase/vpr_encode.py b/demos/speech_web/speech_server/src/SpeechBase/vpr_encode.py new file mode 100644 index 000000000..a6a00e4d0 --- /dev/null +++ b/demos/speech_web/speech_server/src/SpeechBase/vpr_encode.py @@ -0,0 +1,20 @@ +from paddlespeech.cli.vector import VectorExecutor +import numpy as np +import logging + +vector_executor = VectorExecutor() + +def get_audio_embedding(path): + """ + Use vpr_inference to generate embedding of audio + """ + try: + embedding = vector_executor( + audio_file=path, model='ecapatdnn_voxceleb12') + embedding = embedding / np.linalg.norm(embedding) + return embedding + except Exception as e: + logging.error(f"Error with embedding:{e}") + return None + + \ No newline at end of file diff --git a/demos/speech_web/speech_server/src/WebsocketManeger.py b/demos/speech_web/speech_server/src/WebsocketManeger.py new file mode 100644 index 000000000..5edde8430 --- /dev/null +++ b/demos/speech_web/speech_server/src/WebsocketManeger.py @@ -0,0 +1,31 @@ +from typing import List + +from fastapi import WebSocket + +class ConnectionManager: + def __init__(self): + # 存放激活的ws连接对象 + self.active_connections: List[WebSocket] = [] + + async def connect(self, ws: WebSocket): + # 等待连接 + await ws.accept() + # 存储ws连接对象 + self.active_connections.append(ws) + + def disconnect(self, ws: WebSocket): + # 关闭时 移除ws对象 + self.active_connections.remove(ws) + + @staticmethod + async def send_personal_message(message: str, ws: WebSocket): + # 发送个人消息 + await ws.send_text(message) + + async def broadcast(self, message: str): + # 广播消息 + for connection in self.active_connections: + await connection.send_text(message) + + +manager = ConnectionManager() \ No newline at end of file diff --git a/demos/speech_web/speech_server/src/robot.py b/demos/speech_web/speech_server/src/robot.py new file mode 100644 index 000000000..b971c57b5 --- /dev/null +++ b/demos/speech_web/speech_server/src/robot.py @@ -0,0 +1,70 @@ +from paddlespeech.cli.asr.infer import ASRExecutor +import soundfile as sf +import os +import librosa + +from src.SpeechBase.asr import ASR +from src.SpeechBase.tts import TTS +from src.SpeechBase.nlp import NLP + + +class Robot: + def __init__(self, asr_config, tts_config,asr_init_path, + ie_model_path=None) -> None: + self.nlp = NLP(ie_model_path=ie_model_path) + self.asr = ASR(config_path=asr_config) + self.tts = TTS(config_path=tts_config) + self.tts_sample_rate = 24000 + self.asr_sample_rate = 16000 + + # 流式识别效果不如端到端的模型,这里流式模型与端到端模型分开 + self.asr_model = ASRExecutor() + self.asr_name = "conformer_wenetspeech" + self.warm_up_asrmodel(asr_init_path) + + + def warm_up_asrmodel(self, asr_init_path): + if not os.path.exists(asr_init_path): + path_dir = os.path.dirname(asr_init_path) + if not os.path.exists(path_dir): + os.makedirs(path_dir, exist_ok=True) + + # TTS生成,采样率24000 + text = "生成初始音频" + self.text2speech(text, asr_init_path) + + # asr model初始化 + self.asr_model(asr_init_path, model=self.asr_name,lang='zh', + sample_rate=16000, force_yes=True) + + + def speech2text(self, audio_file): + self.asr_model.preprocess(self.asr_name, audio_file) + self.asr_model.infer(self.asr_name) + res = self.asr_model.postprocess() + return res + + def text2speech(self, text, outpath): + wav = self.tts.offlineTTS(text) + sf.write( + outpath, wav, samplerate=self.tts_sample_rate) + res = wav + return res + + def text2speechStream(self, text): + for sub_wav_base64 in self.tts.streamTTS(text=text): + yield sub_wav_base64 + + def text2speechStreamBytes(self, text): + for wav_bytes in self.tts.streamTTSBytes(text=text): + yield wav_bytes + + def chat(self, text): + result = self.nlp.chat(text) + return result + + def ie(self, text): + result = self.nlp.ie(text) + return result + + \ No newline at end of file diff --git a/demos/speech_web/speech_server/src/util.py b/demos/speech_web/speech_server/src/util.py new file mode 100644 index 000000000..34005d919 --- /dev/null +++ b/demos/speech_web/speech_server/src/util.py @@ -0,0 +1,18 @@ +import random + +def randName(n=5): + return "".join(random.sample('zyxwvutsrqponmlkjihgfedcba',n)) + +def SuccessRequest(result=None, message="ok"): + return { + "code": 0, + "result":result, + "message": message + } + +def ErrorRequest(result=None, message="error"): + return { + "code": -1, + "result":result, + "message": message + } \ No newline at end of file diff --git a/demos/speech_web/web_client/.gitignore b/demos/speech_web/web_client/.gitignore new file mode 100644 index 000000000..e33435dce --- /dev/null +++ b/demos/speech_web/web_client/.gitignore @@ -0,0 +1,25 @@ +# Logs +logs +*.log +npm-debug.log* +yarn-debug.log* +yarn-error.log* +pnpm-debug.log* +lerna-debug.log* + +node_modules +dist +dist-ssr +*.local + +# Editor directories and files +.vscode/* +!.vscode/extensions.json +.idea +.DS_Store +*.suo +*.ntvs* +*.njsproj +*.sln +*.sw? +.vscode/* diff --git a/demos/speech_web/web_client/index.html b/demos/speech_web/web_client/index.html new file mode 100644 index 000000000..6b20e7b7b --- /dev/null +++ b/demos/speech_web/web_client/index.html @@ -0,0 +1,13 @@ + + + + + + + 飞桨PaddleSpeech + + +
+ + + diff --git a/demos/speech_web/web_client/package-lock.json b/demos/speech_web/web_client/package-lock.json new file mode 100644 index 000000000..509be385c --- /dev/null +++ b/demos/speech_web/web_client/package-lock.json @@ -0,0 +1,1869 @@ +{ + "name": "paddlespeechwebclient", + "version": "0.0.0", + "lockfileVersion": 2, + "requires": true, + "packages": { + "": { + "name": "paddlespeechwebclient", + "version": "0.0.0", + "dependencies": { + "ant-design-vue": "^2.2.8", + "axios": "^0.26.1", + "element-plus": "^2.1.9", + "js-audio-recorder": "0.5.7", + "lamejs": "^1.2.1", + "less": "^4.1.2", + "vue": "^3.2.25" + }, + "devDependencies": { + "@vitejs/plugin-vue": "^2.3.0", + "vite": "^2.9.0" + } + }, + "node_modules/@ant-design/colors": { + "version": "6.0.0", + "resolved": "https://registry.npmmirror.com/@ant-design/colors/-/colors-6.0.0.tgz", + "integrity": "sha512-qAZRvPzfdWHtfameEGP2Qvuf838NhergR35o+EuVyB5XvSA98xod5r4utvi4TJ3ywmevm290g9nsCG5MryrdWQ==", + "dependencies": { + "@ctrl/tinycolor": "^3.4.0" + } + }, + "node_modules/@ant-design/icons-svg": { + "version": "4.2.1", + "resolved": "https://registry.npmmirror.com/@ant-design/icons-svg/-/icons-svg-4.2.1.tgz", + "integrity": "sha512-EB0iwlKDGpG93hW8f85CTJTs4SvMX7tt5ceupvhALp1IF44SeUFOMhKUOYqpsoYWQKAOuTRDMqn75rEaKDp0Xw==" + }, + "node_modules/@ant-design/icons-vue": { + "version": "6.1.0", + "resolved": "https://registry.npmmirror.com/@ant-design/icons-vue/-/icons-vue-6.1.0.tgz", + "integrity": "sha512-EX6bYm56V+ZrKN7+3MT/ubDkvJ5rK/O2t380WFRflDcVFgsvl3NLH7Wxeau6R8DbrO5jWR6DSTC3B6gYFp77AA==", + "dependencies": { + "@ant-design/colors": "^6.0.0", + "@ant-design/icons-svg": "^4.2.1" + }, + "peerDependencies": { + "vue": ">=3.0.3" + } + }, + "node_modules/@babel/parser": { + "version": "7.17.9", + "resolved": "https://registry.npmmirror.com/@babel/parser/-/parser-7.17.9.tgz", + "integrity": "sha512-vqUSBLP8dQHFPdPi9bc5GK9vRkYHJ49fsZdtoJ8EQ8ibpwk5rPKfvNIwChB0KVXcIjcepEBBd2VHC5r9Gy8ueg==", + "license": "MIT", + "bin": { + "parser": "bin/babel-parser.js" + }, + "engines": { + "node": ">=6.0.0" + } + }, + "node_modules/@babel/runtime": { + "version": "7.17.9", + "resolved": "https://registry.npmmirror.com/@babel/runtime/-/runtime-7.17.9.tgz", + "integrity": "sha512-lSiBBvodq29uShpWGNbgFdKYNiFDo5/HIYsaCEY9ff4sb10x9jizo2+pRrSyF4jKZCXqgzuqBOQKbUm90gQwJg==", + "dependencies": { + "regenerator-runtime": "^0.13.4" + }, + "engines": { + "node": ">=6.9.0" + } + }, + "node_modules/@ctrl/tinycolor": { + "version": "3.4.1", + "resolved": "https://registry.npmmirror.com/@ctrl/tinycolor/-/tinycolor-3.4.1.tgz", + "integrity": "sha512-ej5oVy6lykXsvieQtqZxCOaLT+xD4+QNarq78cIYISHmZXshCvROLudpQN3lfL8G0NL7plMSSK+zlyvCaIJ4Iw==", + "license": "MIT", + "engines": { + "node": ">=10" + } + }, + "node_modules/@element-plus/icons-vue": { + "version": "1.1.4", + "resolved": "https://registry.npmmirror.com/@element-plus/icons-vue/-/icons-vue-1.1.4.tgz", + "integrity": "sha512-Iz/nHqdp1sFPmdzRwHkEQQA3lKvoObk8azgABZ81QUOpW9s/lUyQVUSh0tNtEPZXQlKwlSh7SPgoVxzrE0uuVQ==", + "license": "MIT", + "peerDependencies": { + "vue": "^3.2.0" + } + }, + "node_modules/@floating-ui/core": { + "version": "0.6.1", + "resolved": "https://registry.npmmirror.com/@floating-ui/core/-/core-0.6.1.tgz", + "integrity": "sha512-Y30eVMcZva8o84c0HcXAtDO4BEzPJMvF6+B7x7urL2xbAqVsGJhojOyHLaoQHQYjb6OkqRq5kO+zeySycQwKqg==", + "license": "MIT" + }, + "node_modules/@floating-ui/dom": { + "version": "0.4.4", + "resolved": "https://registry.npmmirror.com/@floating-ui/dom/-/dom-0.4.4.tgz", + "integrity": "sha512-0Ulu3B/dqQplUUSqnTx0foSrlYuMN+GTtlJWvNJwt6Fr7/PqmlR/Y08o6/+bxDWr6p3roBJRaQ51MDZsNmEhhw==", + "license": "MIT", + "dependencies": { + "@floating-ui/core": "^0.6.1" + } + }, + "node_modules/@popperjs/core": { + "version": "2.11.5", + "resolved": "https://registry.npmmirror.com/@popperjs/core/-/core-2.11.5.tgz", + "integrity": "sha512-9X2obfABZuDVLCgPK9aX0a/x4jaOEweTTWE2+9sr0Qqqevj2Uv5XorvusThmc9XGYpS9yI+fhh8RTafBtGposw==", + "license": "MIT", + "funding": { + "type": "opencollective", + "url": "https://opencollective.com/popperjs" + } + }, + "node_modules/@simonwep/pickr": { + "version": "1.8.2", + "resolved": "https://registry.npmmirror.com/@simonwep/pickr/-/pickr-1.8.2.tgz", + "integrity": "sha512-/l5w8BIkrpP6n1xsetx9MWPWlU6OblN5YgZZphxan0Tq4BByTCETL6lyIeY8lagalS2Nbt4F2W034KHLIiunKA==", + "dependencies": { + "core-js": "^3.15.1", + "nanopop": "^2.1.0" + } + }, + "node_modules/@types/lodash": { + "version": "4.14.181", + "resolved": "https://registry.npmmirror.com/@types/lodash/-/lodash-4.14.181.tgz", + "integrity": "sha512-n3tyKthHJbkiWhDZs3DkhkCzt2MexYHXlX0td5iMplyfwketaOeKboEVBqzceH7juqvEg3q5oUoBFxSLu7zFag==", + "license": "MIT" + }, + "node_modules/@types/lodash-es": { + "version": "4.17.6", + "resolved": "https://registry.npmmirror.com/@types/lodash-es/-/lodash-es-4.17.6.tgz", + "integrity": "sha512-R+zTeVUKDdfoRxpAryaQNRKk3105Rrgx2CFRClIgRGaqDTdjsm8h6IYA8ir584W3ePzkZfst5xIgDwYrlh9HLg==", + "license": "MIT", + "dependencies": { + "@types/lodash": "*" + } + }, + "node_modules/@vitejs/plugin-vue": { + "version": "2.3.1", + "resolved": "https://registry.npmmirror.com/@vitejs/plugin-vue/-/plugin-vue-2.3.1.tgz", + "integrity": "sha512-YNzBt8+jt6bSwpt7LP890U1UcTOIZZxfpE5WOJ638PNxSEKOqAi0+FSKS0nVeukfdZ0Ai/H7AFd6k3hayfGZqQ==", + "dev": true, + "license": "MIT", + "engines": { + "node": ">=12.0.0" + }, + "peerDependencies": { + "vite": "^2.5.10", + "vue": "^3.2.25" + } + }, + "node_modules/@vue/compiler-core": { + "version": "3.2.32", + "resolved": "https://registry.npmmirror.com/@vue/compiler-core/-/compiler-core-3.2.32.tgz", + "integrity": "sha512-bRQ8Rkpm/aYFElDWtKkTPHeLnX5pEkNxhPUcqu5crEJIilZH0yeFu/qUAcV4VfSE2AudNPkQSOwMZofhnuutmA==", + "license": "MIT", + "dependencies": { + "@babel/parser": "^7.16.4", + "@vue/shared": "3.2.32", + "estree-walker": "^2.0.2", + "source-map": "^0.6.1" + } + }, + "node_modules/@vue/compiler-dom": { + "version": "3.2.32", + "resolved": "https://registry.npmmirror.com/@vue/compiler-dom/-/compiler-dom-3.2.32.tgz", + "integrity": "sha512-maa3PNB/NxR17h2hDQfcmS02o1f9r9QIpN1y6fe8tWPrS1E4+q8MqrvDDQNhYVPd84rc3ybtyumrgm9D5Rf/kg==", + "license": "MIT", + "dependencies": { + "@vue/compiler-core": "3.2.32", + "@vue/shared": "3.2.32" + } + }, + "node_modules/@vue/compiler-sfc": { + "version": "3.2.32", + "resolved": "https://registry.npmmirror.com/@vue/compiler-sfc/-/compiler-sfc-3.2.32.tgz", + "integrity": "sha512-uO6+Gh3AVdWm72lRRCjMr8nMOEqc6ezT9lWs5dPzh1E9TNaJkMYPaRtdY9flUv/fyVQotkfjY/ponjfR+trPSg==", + "license": "MIT", + "dependencies": { + "@babel/parser": "^7.16.4", + "@vue/compiler-core": "3.2.32", + "@vue/compiler-dom": "3.2.32", + "@vue/compiler-ssr": "3.2.32", + "@vue/reactivity-transform": "3.2.32", + "@vue/shared": "3.2.32", + "estree-walker": "^2.0.2", + "magic-string": "^0.25.7", + "postcss": "^8.1.10", + "source-map": "^0.6.1" + } + }, + "node_modules/@vue/compiler-ssr": { + "version": "3.2.32", + "resolved": "https://registry.npmmirror.com/@vue/compiler-ssr/-/compiler-ssr-3.2.32.tgz", + "integrity": "sha512-ZklVUF/SgTx6yrDUkaTaBL/JMVOtSocP+z5Xz/qIqqLdW/hWL90P+ob/jOQ0Xc/om57892Q7sRSrex0wujOL2Q==", + "license": "MIT", + "dependencies": { + "@vue/compiler-dom": "3.2.32", + "@vue/shared": "3.2.32" + } + }, + "node_modules/@vue/reactivity": { + "version": "3.2.32", + "resolved": "https://registry.npmmirror.com/@vue/reactivity/-/reactivity-3.2.32.tgz", + "integrity": "sha512-4zaDumuyDqkuhbb63hRd+YHFGopW7srFIWesLUQ2su/rJfWrSq3YUvoKAJE8Eu1EhZ2Q4c1NuwnEreKj1FkDxA==", + "license": "MIT", + "dependencies": { + "@vue/shared": "3.2.32" + } + }, + "node_modules/@vue/reactivity-transform": { + "version": "3.2.32", + "resolved": "https://registry.npmmirror.com/@vue/reactivity-transform/-/reactivity-transform-3.2.32.tgz", + "integrity": "sha512-CW1W9zaJtE275tZSWIfQKiPG0iHpdtSlmTqYBu7Y62qvtMgKG5yOxtvBs4RlrZHlaqFSE26avLAgQiTp4YHozw==", + "license": "MIT", + "dependencies": { + "@babel/parser": "^7.16.4", + "@vue/compiler-core": "3.2.32", + "@vue/shared": "3.2.32", + "estree-walker": "^2.0.2", + "magic-string": "^0.25.7" + } + }, + "node_modules/@vue/runtime-core": { + "version": "3.2.32", + "resolved": "https://registry.npmmirror.com/@vue/runtime-core/-/runtime-core-3.2.32.tgz", + "integrity": "sha512-uKKzK6LaCnbCJ7rcHvsK0azHLGpqs+Vi9B28CV1mfWVq1F3Bj8Okk3cX+5DtD06aUh4V2bYhS2UjjWiUUKUF0w==", + "license": "MIT", + "dependencies": { + "@vue/reactivity": "3.2.32", + "@vue/shared": "3.2.32" + } + }, + "node_modules/@vue/runtime-dom": { + "version": "3.2.32", + "resolved": "https://registry.npmmirror.com/@vue/runtime-dom/-/runtime-dom-3.2.32.tgz", + "integrity": "sha512-AmlIg+GPqjkNoADLjHojEX5RGcAg+TsgXOOcUrtDHwKvA8mO26EnLQLB8nylDjU6AMJh2CIYn8NEgyOV5ZIScQ==", + "license": "MIT", + "dependencies": { + "@vue/runtime-core": "3.2.32", + "@vue/shared": "3.2.32", + "csstype": "^2.6.8" + } + }, + "node_modules/@vue/server-renderer": { + "version": "3.2.32", + "resolved": "https://registry.npmmirror.com/@vue/server-renderer/-/server-renderer-3.2.32.tgz", + "integrity": "sha512-TYKpZZfRJpGTTiy/s6bVYwQJpAUx3G03z4G7/3O18M11oacrMTVHaHjiPuPqf3xQtY8R4LKmQ3EOT/DRCA/7Wg==", + "license": "MIT", + "dependencies": { + "@vue/compiler-ssr": "3.2.32", + "@vue/shared": "3.2.32" + }, + "peerDependencies": { + "vue": "3.2.32" + } + }, + "node_modules/@vue/shared": { + "version": "3.2.32", + "resolved": "https://registry.npmmirror.com/@vue/shared/-/shared-3.2.32.tgz", + "integrity": "sha512-bjcixPErUsAnTQRQX4Z5IQnICYjIfNCyCl8p29v1M6kfVzvwOICPw+dz48nNuWlTOOx2RHhzHdazJibE8GSnsw==", + "license": "MIT" + }, + "node_modules/@vueuse/core": { + "version": "8.2.5", + "resolved": "https://registry.npmmirror.com/@vueuse/core/-/core-8.2.5.tgz", + "integrity": "sha512-5prZAA1Ji2ltwNUnzreu6WIXYqHYP/9U2BiY5mD/650VYLpVcwVlYznJDFcLCmEWI3o3Vd34oS1FUf+6Mh68GQ==", + "license": "MIT", + "dependencies": { + "@vueuse/metadata": "8.2.5", + "@vueuse/shared": "8.2.5", + "vue-demi": "*" + }, + "funding": { + "url": "https://github.com/sponsors/antfu" + }, + "peerDependencies": { + "@vue/composition-api": "^1.1.0", + "vue": "^2.6.0 || ^3.2.0" + }, + "peerDependenciesMeta": { + "@vue/composition-api": { + "optional": true + }, + "vue": { + "optional": true + } + } + }, + "node_modules/@vueuse/metadata": { + "version": "8.2.5", + "resolved": "https://registry.npmmirror.com/@vueuse/metadata/-/metadata-8.2.5.tgz", + "integrity": "sha512-Lk9plJjh9cIdiRdcj16dau+2LANxIdFCiTgdfzwYXbflxq0QnMBeOD2qHgKDE7fuVrtPcVWj8VSuZEx1HRfNQA==", + "license": "MIT", + "funding": { + "url": "https://github.com/sponsors/antfu" + } + }, + "node_modules/@vueuse/shared": { + "version": "8.2.5", + "resolved": "https://registry.npmmirror.com/@vueuse/shared/-/shared-8.2.5.tgz", + "integrity": "sha512-lNWo+7sk6JCuOj4AiYM+6HZ6fq4xAuVq1sVckMQKgfCJZpZRe4i8es+ZULO5bYTKP+VrOCtqrLR2GzEfrbr3YQ==", + "license": "MIT", + "dependencies": { + "vue-demi": "*" + }, + "funding": { + "url": "https://github.com/sponsors/antfu" + }, + "peerDependencies": { + "@vue/composition-api": "^1.1.0", + "vue": "^2.6.0 || ^3.2.0" + }, + "peerDependenciesMeta": { + "@vue/composition-api": { + "optional": true + }, + "vue": { + "optional": true + } + } + }, + "node_modules/ant-design-vue": { + "version": "2.2.8", + "resolved": "https://registry.npmmirror.com/ant-design-vue/-/ant-design-vue-2.2.8.tgz", + "integrity": "sha512-3graq9/gCfJQs6hznrHV6sa9oDmk/D1H3Oo0vLdVpPS/I61fZPk8NEyNKCHpNA6fT2cx6xx9U3QS63uuyikg/Q==", + "dependencies": { + "@ant-design/icons-vue": "^6.0.0", + "@babel/runtime": "^7.10.5", + "@simonwep/pickr": "~1.8.0", + "array-tree-filter": "^2.1.0", + "async-validator": "^3.3.0", + "dom-align": "^1.12.1", + "dom-scroll-into-view": "^2.0.0", + "lodash": "^4.17.21", + "lodash-es": "^4.17.15", + "moment": "^2.27.0", + "omit.js": "^2.0.0", + "resize-observer-polyfill": "^1.5.1", + "scroll-into-view-if-needed": "^2.2.25", + "shallow-equal": "^1.0.0", + "vue-types": "^3.0.0", + "warning": "^4.0.0" + }, + "peerDependencies": { + "@vue/compiler-sfc": ">=3.1.0", + "vue": ">=3.1.0" + } + }, + "node_modules/ant-design-vue/node_modules/async-validator": { + "version": "3.5.2", + "resolved": "https://registry.npmmirror.com/async-validator/-/async-validator-3.5.2.tgz", + "integrity": "sha512-8eLCg00W9pIRZSB781UUX/H6Oskmm8xloZfr09lz5bikRpBVDlJ3hRVuxxP1SxcwsEYfJ4IU8Q19Y8/893r3rQ==" + }, + "node_modules/array-tree-filter": { + "version": "2.1.0", + "resolved": "https://registry.npmmirror.com/array-tree-filter/-/array-tree-filter-2.1.0.tgz", + "integrity": "sha512-4ROwICNlNw/Hqa9v+rk5h22KjmzB1JGTMVKP2AKJBOCgb0yL0ASf0+YvCcLNNwquOHNX48jkeZIJ3a+oOQqKcw==" + }, + "node_modules/async-validator": { + "version": "4.0.7", + "resolved": "https://registry.npmmirror.com/async-validator/-/async-validator-4.0.7.tgz", + "integrity": "sha512-Pj2IR7u8hmUEDOwB++su6baaRi+QvsgajuFB9j95foM1N2gy5HM4z60hfusIO0fBPG5uLAEl6yCJr1jNSVugEQ==", + "license": "MIT" + }, + "node_modules/axios": { + "version": "0.26.1", + "resolved": "https://registry.npmmirror.com/axios/-/axios-0.26.1.tgz", + "integrity": "sha512-fPwcX4EvnSHuInCMItEhAGnaSEXRBjtzh9fOtsE6E1G6p7vl7edEeZe11QHf18+6+9gR5PbKV/sGKNaD8YaMeA==", + "license": "MIT", + "dependencies": { + "follow-redirects": "^1.14.8" + } + }, + "node_modules/axios/node_modules/follow-redirects": { + "version": "1.14.9", + "resolved": "https://registry.npmmirror.com/follow-redirects/-/follow-redirects-1.14.9.tgz", + "integrity": "sha512-MQDfihBQYMcyy5dhRDJUHcw7lb2Pv/TuE6xP1vyraLukNDHKbDxDNaOE3NbCAdKQApno+GPRyo1YAp89yCjK4w==", + "funding": [ + { + "type": "individual", + "url": "https://github.com/sponsors/RubenVerborgh" + } + ], + "license": "MIT", + "engines": { + "node": ">=4.0" + }, + "peerDependenciesMeta": { + "debug": { + "optional": true + } + } + }, + "node_modules/compute-scroll-into-view": { + "version": "1.0.17", + "resolved": "https://registry.npmmirror.com/compute-scroll-into-view/-/compute-scroll-into-view-1.0.17.tgz", + "integrity": "sha512-j4dx+Fb0URmzbwwMUrhqWM2BEWHdFGx+qZ9qqASHRPqvTYdqvWnHg0H1hIbcyLnvgnoNAVMlwkepyqM3DaIFUg==" + }, + "node_modules/copy-anything": { + "version": "2.0.6", + "resolved": "https://registry.npmmirror.com/copy-anything/-/copy-anything-2.0.6.tgz", + "integrity": "sha512-1j20GZTsvKNkc4BY3NpMOM8tt///wY3FpIzozTOFO2ffuZcV61nojHXVKIy3WM+7ADCy5FVhdZYHYDdgTU0yJw==", + "dependencies": { + "is-what": "^3.14.1" + } + }, + "node_modules/core-js": { + "version": "3.22.5", + "resolved": "https://registry.npmmirror.com/core-js/-/core-js-3.22.5.tgz", + "integrity": "sha512-VP/xYuvJ0MJWRAobcmQ8F2H6Bsn+s7zqAAjFaHGBMc5AQm7zaelhD1LGduFn2EehEcQcU+br6t+fwbpQ5d1ZWA==", + "hasInstallScript": true + }, + "node_modules/csstype": { + "version": "2.6.20", + "resolved": "https://registry.npmmirror.com/csstype/-/csstype-2.6.20.tgz", + "integrity": "sha512-/WwNkdXfckNgw6S5R125rrW8ez139lBHWouiBvX8dfMFtcn6V81REDqnH7+CRpRipfYlyU1CmOnOxrmGcFOjeA==", + "license": "MIT" + }, + "node_modules/dayjs": { + "version": "1.11.0", + "resolved": "https://registry.npmmirror.com/dayjs/-/dayjs-1.11.0.tgz", + "integrity": "sha512-JLC809s6Y948/FuCZPm5IX8rRhQwOiyMb2TfVVQEixG7P8Lm/gt5S7yoQZmC8x1UehI9Pb7sksEt4xx14m+7Ug==", + "license": "MIT" + }, + "node_modules/dom-align": { + "version": "1.12.3", + "resolved": "https://registry.npmmirror.com/dom-align/-/dom-align-1.12.3.tgz", + "integrity": "sha512-Gj9hZN3a07cbR6zviMUBOMPdWxYhbMI+x+WS0NAIu2zFZmbK8ys9R79g+iG9qLnlCwpFoaB+fKy8Pdv470GsPA==" + }, + "node_modules/dom-scroll-into-view": { + "version": "2.0.1", + "resolved": "https://registry.npmmirror.com/dom-scroll-into-view/-/dom-scroll-into-view-2.0.1.tgz", + "integrity": "sha512-bvVTQe1lfaUr1oFzZX80ce9KLDlZ3iU+XGNE/bz9HnGdklTieqsbmsLHe+rT2XWqopvL0PckkYqN7ksmm5pe3w==" + }, + "node_modules/element-plus": { + "version": "2.1.9", + "resolved": "https://registry.npmmirror.com/element-plus/-/element-plus-2.1.9.tgz", + "integrity": "sha512-6mWqS3YrmJPnouWP4otzL8+MehfOnDFqDbcIdnmC07p+Z0JkWe/CVKc4Wky8AYC8nyDMUQyiZYvooCbqGuM7pg==", + "license": "MIT", + "dependencies": { + "@ctrl/tinycolor": "^3.4.0", + "@element-plus/icons-vue": "^1.1.4", + "@floating-ui/dom": "^0.4.2", + "@popperjs/core": "^2.11.4", + "@types/lodash": "^4.14.181", + "@types/lodash-es": "^4.17.6", + "@vueuse/core": "^8.2.4", + "async-validator": "^4.0.7", + "dayjs": "^1.11.0", + "escape-html": "^1.0.3", + "lodash": "^4.17.21", + "lodash-es": "^4.17.21", + "lodash-unified": "^1.0.2", + "memoize-one": "^6.0.0", + "normalize-wheel-es": "^1.1.2" + }, + "peerDependencies": { + "vue": "^3.2.0" + } + }, + "node_modules/errno": { + "version": "0.1.8", + "resolved": "https://registry.npmmirror.com/errno/-/errno-0.1.8.tgz", + "integrity": "sha512-dJ6oBr5SQ1VSd9qkk7ByRgb/1SH4JZjCHSW/mr63/QcXO9zLVxvJ6Oy13nio03rxpSnVDDjFor75SjVeZWPW/A==", + "optional": true, + "dependencies": { + "prr": "~1.0.1" + }, + "bin": { + "errno": "cli.js" + } + }, + "node_modules/esbuild": { + "version": "0.14.36", + "resolved": "https://registry.npmmirror.com/esbuild/-/esbuild-0.14.36.tgz", + "integrity": "sha512-HhFHPiRXGYOCRlrhpiVDYKcFJRdO0sBElZ668M4lh2ER0YgnkLxECuFe7uWCf23FrcLc59Pqr7dHkTqmRPDHmw==", + "dev": true, + "hasInstallScript": true, + "license": "MIT", + "bin": { + "esbuild": "bin/esbuild" + }, + "engines": { + "node": ">=12" + }, + "optionalDependencies": { + "esbuild-android-64": "0.14.36", + "esbuild-android-arm64": "0.14.36", + "esbuild-darwin-64": "0.14.36", + "esbuild-darwin-arm64": "0.14.36", + "esbuild-freebsd-64": "0.14.36", + "esbuild-freebsd-arm64": "0.14.36", + "esbuild-linux-32": "0.14.36", + "esbuild-linux-64": "0.14.36", + "esbuild-linux-arm": "0.14.36", + "esbuild-linux-arm64": "0.14.36", + "esbuild-linux-mips64le": "0.14.36", + "esbuild-linux-ppc64le": "0.14.36", + "esbuild-linux-riscv64": "0.14.36", + "esbuild-linux-s390x": "0.14.36", + "esbuild-netbsd-64": "0.14.36", + "esbuild-openbsd-64": "0.14.36", + "esbuild-sunos-64": "0.14.36", + "esbuild-windows-32": "0.14.36", + "esbuild-windows-64": "0.14.36", + "esbuild-windows-arm64": "0.14.36" + } + }, + "node_modules/esbuild-darwin-64": { + "version": "0.14.36", + "resolved": "https://registry.npmmirror.com/esbuild-darwin-64/-/esbuild-darwin-64-0.14.36.tgz", + "integrity": "sha512-kkl6qmV0dTpyIMKagluzYqlc1vO0ecgpviK/7jwPbRDEv5fejRTaBBEE2KxEQbTHcLhiiDbhG7d5UybZWo/1zQ==", + "cpu": [ + "x64" + ], + "dev": true, + "license": "MIT", + "optional": true, + "os": [ + "darwin" + ], + "engines": { + "node": ">=12" + } + }, + "node_modules/escape-html": { + "version": "1.0.3", + "resolved": "https://registry.npmmirror.com/escape-html/-/escape-html-1.0.3.tgz", + "integrity": "sha512-NiSupZ4OeuGwr68lGIeym/ksIZMJodUGOSCZ/FSnTxcrekbvqrgdUxlJOMpijaKZVjAJrWrGs/6Jy8OMuyj9ow==", + "license": "MIT" + }, + "node_modules/estree-walker": { + "version": "2.0.2", + "resolved": "https://registry.npmmirror.com/estree-walker/-/estree-walker-2.0.2.tgz", + "integrity": "sha512-Rfkk/Mp/DL7JVje3u18FxFujQlTNR2q6QfMSMB7AvCBx91NGj/ba3kCfza0f6dVDbw7YlRf/nDrn7pQrCCyQ/w==", + "license": "MIT" + }, + "node_modules/fsevents": { + "version": "2.3.2", + "resolved": "https://registry.npmmirror.com/fsevents/-/fsevents-2.3.2.tgz", + "integrity": "sha512-xiqMQR4xAeHTuB9uWm+fFRcIOgKBMiOBP+eXiyT7jsgVCq1bkVygt00oASowB7EdtpOHaaPgKt812P9ab+DDKA==", + "dev": true, + "license": "MIT", + "optional": true, + "os": [ + "darwin" + ], + "engines": { + "node": "^8.16.0 || ^10.6.0 || >=11.0.0" + } + }, + "node_modules/function-bind": { + "version": "1.1.1", + "resolved": "https://registry.npmmirror.com/function-bind/-/function-bind-1.1.1.tgz", + "integrity": "sha512-yIovAzMX49sF8Yl58fSCWJ5svSLuaibPxXQJFLmBObTuCr0Mf1KiPopGM9NiFjiYBCbfaa2Fh6breQ6ANVTI0A==", + "dev": true, + "license": "MIT" + }, + "node_modules/graceful-fs": { + "version": "4.2.10", + "resolved": "https://registry.npmmirror.com/graceful-fs/-/graceful-fs-4.2.10.tgz", + "integrity": "sha512-9ByhssR2fPVsNZj478qUUbKfmL0+t5BDVyjShtyZZLiK7ZDAArFFfopyOTj0M05wE2tJPisA4iTnnXl2YoPvOA==", + "optional": true + }, + "node_modules/has": { + "version": "1.0.3", + "resolved": "https://registry.npmmirror.com/has/-/has-1.0.3.tgz", + "integrity": "sha512-f2dvO0VU6Oej7RkWJGrehjbzMAjFp5/VKPp5tTpWIV4JHHZK1/BxbFRtf/siA2SWTe09caDmVtYYzWEIbBS4zw==", + "dev": true, + "license": "MIT", + "dependencies": { + "function-bind": "^1.1.1" + }, + "engines": { + "node": ">= 0.4.0" + } + }, + "node_modules/iconv-lite": { + "version": "0.4.24", + "resolved": "https://registry.npmmirror.com/iconv-lite/-/iconv-lite-0.4.24.tgz", + "integrity": "sha512-v3MXnZAcvnywkTUEZomIActle7RXXeedOR31wwl7VlyoXO4Qi9arvSenNQWne1TcRwhCL1HwLI21bEqdpj8/rA==", + "optional": true, + "dependencies": { + "safer-buffer": ">= 2.1.2 < 3" + }, + "engines": { + "node": ">=0.10.0" + } + }, + "node_modules/image-size": { + "version": "0.5.5", + "resolved": "https://registry.npmmirror.com/image-size/-/image-size-0.5.5.tgz", + "integrity": "sha512-6TDAlDPZxUFCv+fuOkIoXT/V/f3Qbq8e37p+YOiYrUv3v9cc3/6x78VdfPgFVaB9dZYeLUfKgHRebpkm/oP2VQ==", + "optional": true, + "bin": { + "image-size": "bin/image-size.js" + }, + "engines": { + "node": ">=0.10.0" + } + }, + "node_modules/is-core-module": { + "version": "2.8.1", + "resolved": "https://registry.npmmirror.com/is-core-module/-/is-core-module-2.8.1.tgz", + "integrity": "sha512-SdNCUs284hr40hFTFP6l0IfZ/RSrMXF3qgoRHd3/79unUTvrFO/JoXwkGm+5J/Oe3E/b5GsnG330uUNgRpu1PA==", + "dev": true, + "license": "MIT", + "dependencies": { + "has": "^1.0.3" + }, + "funding": { + "url": "https://github.com/sponsors/ljharb" + } + }, + "node_modules/is-plain-object": { + "version": "3.0.1", + "resolved": "https://registry.npmmirror.com/is-plain-object/-/is-plain-object-3.0.1.tgz", + "integrity": "sha512-Xnpx182SBMrr/aBik8y+GuR4U1L9FqMSojwDQwPMmxyC6bvEqly9UBCxhauBF5vNh2gwWJNX6oDV7O+OM4z34g==", + "engines": { + "node": ">=0.10.0" + } + }, + "node_modules/is-what": { + "version": "3.14.1", + "resolved": "https://registry.npmmirror.com/is-what/-/is-what-3.14.1.tgz", + "integrity": "sha512-sNxgpk9793nzSs7bA6JQJGeIuRBQhAaNGG77kzYQgMkrID+lS6SlK07K5LaptscDlSaIgH+GPFzf+d75FVxozA==" + }, + "node_modules/js-audio-recorder": { + "version": "0.5.7", + "resolved": "https://registry.npmmirror.com/js-audio-recorder/-/js-audio-recorder-0.5.7.tgz", + "integrity": "sha512-DIlv30N86AYHr7zGHN0O7V/3Rd8Q6SIJ/MBzVJaT9STWTdhF4E/8fxCX6ZMgRSv8xmx6fEqcFFNPoofmxJD4+A==", + "license": "MIT" + }, + "node_modules/js-tokens": { + "version": "4.0.0", + "resolved": "https://registry.npmmirror.com/js-tokens/-/js-tokens-4.0.0.tgz", + "integrity": "sha512-RdJUflcE3cUzKiMqQgsCu06FPu9UdIJO0beYbPhHN4k6apgJtifcoCtT9bcxOpYBtpD2kCM6Sbzg4CausW/PKQ==" + }, + "node_modules/lamejs": { + "version": "1.2.1", + "resolved": "https://registry.npmmirror.com/lamejs/-/lamejs-1.2.1.tgz", + "integrity": "sha512-s7bxvjvYthw6oPLCm5pFxvA84wUROODB8jEO2+CE1adhKgrIvVOlmMgY8zyugxGrvRaDHNJanOiS21/emty6dQ==", + "license": "LGPL-3.0", + "dependencies": { + "use-strict": "1.0.1" + } + }, + "node_modules/less": { + "version": "4.1.2", + "resolved": "https://registry.npmmirror.com/less/-/less-4.1.2.tgz", + "integrity": "sha512-EoQp/Et7OSOVu0aJknJOtlXZsnr8XE8KwuzTHOLeVSEx8pVWUICc8Q0VYRHgzyjX78nMEyC/oztWFbgyhtNfDA==", + "dependencies": { + "copy-anything": "^2.0.1", + "parse-node-version": "^1.0.1", + "tslib": "^2.3.0" + }, + "bin": { + "lessc": "bin/lessc" + }, + "engines": { + "node": ">=6" + }, + "optionalDependencies": { + "errno": "^0.1.1", + "graceful-fs": "^4.1.2", + "image-size": "~0.5.0", + "make-dir": "^2.1.0", + "mime": "^1.4.1", + "needle": "^2.5.2", + "source-map": "~0.6.0" + } + }, + "node_modules/lodash": { + "version": "4.17.21", + "resolved": "https://registry.npmmirror.com/lodash/-/lodash-4.17.21.tgz", + "integrity": "sha512-v2kDEe57lecTulaDIuNTPy3Ry4gLGJ6Z1O3vE1krgXZNrsQ+LFTGHVxVjcXPs17LhbZVGedAJv8XZ1tvj5FvSg==", + "license": "MIT" + }, + "node_modules/lodash-es": { + "version": "4.17.21", + "resolved": "https://registry.npmmirror.com/lodash-es/-/lodash-es-4.17.21.tgz", + "integrity": "sha512-mKnC+QJ9pWVzv+C4/U3rRsHapFfHvQFoFB92e52xeyGMcX6/OlIl78je1u8vePzYZSkkogMPJ2yjxxsb89cxyw==", + "license": "MIT" + }, + "node_modules/lodash-unified": { + "version": "1.0.2", + "resolved": "https://registry.npmmirror.com/lodash-unified/-/lodash-unified-1.0.2.tgz", + "integrity": "sha512-OGbEy+1P+UT26CYi4opY4gebD8cWRDxAT6MAObIVQMiqYdxZr1g3QHWCToVsm31x2NkLS4K3+MC2qInaRMa39g==", + "license": "MIT", + "peerDependencies": { + "@types/lodash-es": "*", + "lodash": "*", + "lodash-es": "*" + } + }, + "node_modules/loose-envify": { + "version": "1.4.0", + "resolved": "https://registry.npmmirror.com/loose-envify/-/loose-envify-1.4.0.tgz", + "integrity": "sha512-lyuxPGr/Wfhrlem2CL/UcnUc1zcqKAImBDzukY7Y5F/yQiNdko6+fRLevlw1HgMySw7f611UIY408EtxRSoK3Q==", + "dependencies": { + "js-tokens": "^3.0.0 || ^4.0.0" + }, + "bin": { + "loose-envify": "cli.js" + } + }, + "node_modules/magic-string": { + "version": "0.25.9", + "resolved": "https://registry.npmmirror.com/magic-string/-/magic-string-0.25.9.tgz", + "integrity": "sha512-RmF0AsMzgt25qzqqLc1+MbHmhdx0ojF2Fvs4XnOqz2ZOBXzzkEwc/dJQZCYHAn7v1jbVOjAZfK8msRn4BxO4VQ==", + "license": "MIT", + "dependencies": { + "sourcemap-codec": "^1.4.8" + } + }, + "node_modules/make-dir": { + "version": "2.1.0", + "resolved": "https://registry.npmmirror.com/make-dir/-/make-dir-2.1.0.tgz", + "integrity": "sha512-LS9X+dc8KLxXCb8dni79fLIIUA5VyZoyjSMCwTluaXA0o27cCK0bhXkpgw+sTXVpPy/lSO57ilRixqk0vDmtRA==", + "optional": true, + "dependencies": { + "pify": "^4.0.1", + "semver": "^5.6.0" + }, + "engines": { + "node": ">=6" + } + }, + "node_modules/memoize-one": { + "version": "6.0.0", + "resolved": "https://registry.npmmirror.com/memoize-one/-/memoize-one-6.0.0.tgz", + "integrity": "sha512-rkpe71W0N0c0Xz6QD0eJETuWAJGnJ9afsl1srmwPrI+yBCkge5EycXXbYRyvL29zZVUWQCY7InPRCv3GDXuZNw==", + "license": "MIT" + }, + "node_modules/mime": { + "version": "1.6.0", + "resolved": "https://registry.npmmirror.com/mime/-/mime-1.6.0.tgz", + "integrity": "sha512-x0Vn8spI+wuJ1O6S7gnbaQg8Pxh4NNHb7KSINmEWKiPE4RKOplvijn+NkmYmmRgP68mc70j2EbeTFRsrswaQeg==", + "optional": true, + "bin": { + "mime": "cli.js" + }, + "engines": { + "node": ">=4" + } + }, + "node_modules/moment": { + "version": "2.29.4", + "resolved": "https://registry.npmjs.org/moment/-/moment-2.29.4.tgz", + "integrity": "sha512-5LC9SOxjSc2HF6vO2CyuTDNivEdoz2IvyJJGj6X8DJ0eFyfszE0QiEd+iXmBvUP3WHxSjFH/vIsA0EN00cgr8w==", + "engines": { + "node": "*" + } + }, + "node_modules/nanoid": { + "version": "3.3.2", + "resolved": "https://registry.npmmirror.com/nanoid/-/nanoid-3.3.2.tgz", + "integrity": "sha512-CuHBogktKwpm5g2sRgv83jEy2ijFzBwMoYA60orPDR7ynsLijJDqgsi4RDGj3OJpy3Ieb+LYwiRmIOGyytgITA==", + "license": "MIT", + "bin": { + "nanoid": "bin/nanoid.cjs" + }, + "engines": { + "node": "^10 || ^12 || ^13.7 || ^14 || >=15.0.1" + } + }, + "node_modules/nanopop": { + "version": "2.1.0", + "resolved": "https://registry.npmmirror.com/nanopop/-/nanopop-2.1.0.tgz", + "integrity": "sha512-jGTwpFRexSH+fxappnGQtN9dspgE2ipa1aOjtR24igG0pv6JCxImIAmrLRHX+zUF5+1wtsFVbKyfP51kIGAVNw==" + }, + "node_modules/needle": { + "version": "2.9.1", + "resolved": "https://registry.npmmirror.com/needle/-/needle-2.9.1.tgz", + "integrity": "sha512-6R9fqJ5Zcmf+uYaFgdIHmLwNldn5HbK8L5ybn7Uz+ylX/rnOsSp1AHcvQSrCaFN+qNM1wpymHqD7mVasEOlHGQ==", + "optional": true, + "dependencies": { + "debug": "^3.2.6", + "iconv-lite": "^0.4.4", + "sax": "^1.2.4" + }, + "bin": { + "needle": "bin/needle" + }, + "engines": { + "node": ">= 4.4.x" + } + }, + "node_modules/needle/node_modules/debug": { + "version": "3.2.7", + "resolved": "https://registry.npmmirror.com/debug/-/debug-3.2.7.tgz", + "integrity": "sha512-CFjzYYAi4ThfiQvizrFQevTTXHtnCqWfe7x1AhgEscTz6ZbLbfoLRLPugTQyBth6f8ZERVUSyWHFD/7Wu4t1XQ==", + "optional": true, + "dependencies": { + "ms": "^2.1.1" + } + }, + "node_modules/needle/node_modules/ms": { + "version": "2.1.3", + "resolved": "https://registry.npmmirror.com/ms/-/ms-2.1.3.tgz", + "integrity": "sha512-6FlzubTLZG3J2a/NVCAleEhjzq5oxgHyaCU9yYXvcLsvoVaHJq/s5xXI6/XXP6tz7R9xAOtHnSO/tXtF3WRTlA==", + "optional": true + }, + "node_modules/normalize-wheel-es": { + "version": "1.1.2", + "resolved": "https://registry.npmmirror.com/normalize-wheel-es/-/normalize-wheel-es-1.1.2.tgz", + "integrity": "sha512-scX83plWJXYH1J4+BhAuIHadROzxX0UBF3+HuZNY2Ks8BciE7tSTQ+5JhTsvzjaO0/EJdm4JBGrfObKxFf3Png==", + "license": "BSD-3-Clause" + }, + "node_modules/omit.js": { + "version": "2.0.2", + "resolved": "https://registry.npmmirror.com/omit.js/-/omit.js-2.0.2.tgz", + "integrity": "sha512-hJmu9D+bNB40YpL9jYebQl4lsTW6yEHRTroJzNLqQJYHm7c+NQnJGfZmIWh8S3q3KoaxV1aLhV6B3+0N0/kyJg==" + }, + "node_modules/parse-node-version": { + "version": "1.0.1", + "resolved": "https://registry.npmmirror.com/parse-node-version/-/parse-node-version-1.0.1.tgz", + "integrity": "sha512-3YHlOa/JgH6Mnpr05jP9eDG254US9ek25LyIxZlDItp2iJtwyaXQb57lBYLdT3MowkUFYEV2XXNAYIPlESvJlA==", + "engines": { + "node": ">= 0.10" + } + }, + "node_modules/path-parse": { + "version": "1.0.7", + "resolved": "https://registry.npmmirror.com/path-parse/-/path-parse-1.0.7.tgz", + "integrity": "sha512-LDJzPVEEEPR+y48z93A0Ed0yXb8pAByGWo/k5YYdYgpY2/2EsOsksJrq7lOHxryrVOn1ejG6oAp8ahvOIQD8sw==", + "dev": true, + "license": "MIT" + }, + "node_modules/picocolors": { + "version": "1.0.0", + "resolved": "https://registry.npmmirror.com/picocolors/-/picocolors-1.0.0.tgz", + "integrity": "sha512-1fygroTLlHu66zi26VoTDv8yRgm0Fccecssto+MhsZ0D/DGW2sm8E8AjW7NU5VVTRt5GxbeZ5qBuJr+HyLYkjQ==", + "license": "ISC" + }, + "node_modules/pify": { + "version": "4.0.1", + "resolved": "https://registry.npmmirror.com/pify/-/pify-4.0.1.tgz", + "integrity": "sha512-uB80kBFb/tfd68bVleG9T5GGsGPjJrLAUpR5PZIrhBnIaRTQRjqdJSsIKkOP6OAIFbj7GOrcudc5pNjZ+geV2g==", + "optional": true, + "engines": { + "node": ">=6" + } + }, + "node_modules/postcss": { + "version": "8.4.12", + "resolved": "https://registry.npmmirror.com/postcss/-/postcss-8.4.12.tgz", + "integrity": "sha512-lg6eITwYe9v6Hr5CncVbK70SoioNQIq81nsaG86ev5hAidQvmOeETBqs7jm43K2F5/Ley3ytDtriImV6TpNiSg==", + "funding": [ + { + "type": "opencollective", + "url": "https://opencollective.com/postcss/" + }, + { + "type": "tidelift", + "url": "https://tidelift.com/funding/github/npm/postcss" + } + ], + "license": "MIT", + "dependencies": { + "nanoid": "^3.3.1", + "picocolors": "^1.0.0", + "source-map-js": "^1.0.2" + }, + "engines": { + "node": "^10 || ^12 || >=14" + } + }, + "node_modules/prr": { + "version": "1.0.1", + "resolved": "https://registry.npmmirror.com/prr/-/prr-1.0.1.tgz", + "integrity": "sha512-yPw4Sng1gWghHQWj0B3ZggWUm4qVbPwPFcRG8KyxiU7J2OHFSoEHKS+EZ3fv5l1t9CyCiop6l/ZYeWbrgoQejw==", + "optional": true + }, + "node_modules/regenerator-runtime": { + "version": "0.13.9", + "resolved": "https://registry.npmmirror.com/regenerator-runtime/-/regenerator-runtime-0.13.9.tgz", + "integrity": "sha512-p3VT+cOEgxFsRRA9X4lkI1E+k2/CtnKtU4gcxyaCUreilL/vqI6CdZ3wxVUx3UOUg+gnUOQQcRI7BmSI656MYA==" + }, + "node_modules/resize-observer-polyfill": { + "version": "1.5.1", + "resolved": "https://registry.npmmirror.com/resize-observer-polyfill/-/resize-observer-polyfill-1.5.1.tgz", + "integrity": "sha512-LwZrotdHOo12nQuZlHEmtuXdqGoOD0OhaxopaNFxWzInpEgaLWoVuAMbTzixuosCx2nEG58ngzW3vxdWoxIgdg==" + }, + "node_modules/resolve": { + "version": "1.22.0", + "resolved": "https://registry.npmmirror.com/resolve/-/resolve-1.22.0.tgz", + "integrity": "sha512-Hhtrw0nLeSrFQ7phPp4OOcVjLPIeMnRlr5mcnVuMe7M/7eBn98A3hmFRLoFo3DLZkivSYwhRUJTyPyWAk56WLw==", + "dev": true, + "license": "MIT", + "dependencies": { + "is-core-module": "^2.8.1", + "path-parse": "^1.0.7", + "supports-preserve-symlinks-flag": "^1.0.0" + }, + "bin": { + "resolve": "bin/resolve" + }, + "funding": { + "url": "https://github.com/sponsors/ljharb" + } + }, + "node_modules/rollup": { + "version": "2.70.1", + "resolved": "https://registry.npmmirror.com/rollup/-/rollup-2.70.1.tgz", + "integrity": "sha512-CRYsI5EuzLbXdxC6RnYhOuRdtz4bhejPMSWjsFLfVM/7w/85n2szZv6yExqUXsBdz5KT8eoubeyDUDjhLHEslA==", + "dev": true, + "license": "MIT", + "bin": { + "rollup": "dist/bin/rollup" + }, + "engines": { + "node": ">=10.0.0" + }, + "optionalDependencies": { + "fsevents": "~2.3.2" + } + }, + "node_modules/safer-buffer": { + "version": "2.1.2", + "resolved": "https://registry.npmmirror.com/safer-buffer/-/safer-buffer-2.1.2.tgz", + "integrity": "sha512-YZo3K82SD7Riyi0E1EQPojLz7kpepnSQI9IyPbHHg1XXXevb5dJI7tpyN2ADxGcQbHG7vcyRHk0cbwqcQriUtg==", + "optional": true + }, + "node_modules/sax": { + "version": "1.2.4", + "resolved": "https://registry.npmmirror.com/sax/-/sax-1.2.4.tgz", + "integrity": "sha512-NqVDv9TpANUjFm0N8uM5GxL36UgKi9/atZw+x7YFnQ8ckwFGKrl4xX4yWtrey3UJm5nP1kUbnYgLopqWNSRhWw==", + "optional": true + }, + "node_modules/scroll-into-view-if-needed": { + "version": "2.2.29", + "resolved": "https://registry.npmmirror.com/scroll-into-view-if-needed/-/scroll-into-view-if-needed-2.2.29.tgz", + "integrity": "sha512-hxpAR6AN+Gh53AdAimHM6C8oTN1ppwVZITihix+WqalywBeFcQ6LdQP5ABNl26nX8GTEL7VT+b8lKpdqq65wXg==", + "dependencies": { + "compute-scroll-into-view": "^1.0.17" + } + }, + "node_modules/semver": { + "version": "5.7.1", + "resolved": "https://registry.npmmirror.com/semver/-/semver-5.7.1.tgz", + "integrity": "sha512-sauaDf/PZdVgrLTNYHRtpXa1iRiKcaebiKQ1BJdpQlWH2lCvexQdX55snPFyK7QzpudqbCI0qXFfOasHdyNDGQ==", + "optional": true, + "bin": { + "semver": "bin/semver" + } + }, + "node_modules/shallow-equal": { + "version": "1.2.1", + "resolved": "https://registry.npmmirror.com/shallow-equal/-/shallow-equal-1.2.1.tgz", + "integrity": "sha512-S4vJDjHHMBaiZuT9NPb616CSmLf618jawtv3sufLl6ivK8WocjAo58cXwbRV1cgqxH0Qbv+iUt6m05eqEa2IRA==" + }, + "node_modules/source-map": { + "version": "0.6.1", + "resolved": "https://registry.npmmirror.com/source-map/-/source-map-0.6.1.tgz", + "integrity": "sha512-UjgapumWlbMhkBgzT7Ykc5YXUT46F0iKu8SGXq0bcwP5dz/h0Plj6enJqjz1Zbq2l5WaqYnrVbwWOWMyF3F47g==", + "license": "BSD-3-Clause", + "engines": { + "node": ">=0.10.0" + } + }, + "node_modules/source-map-js": { + "version": "1.0.2", + "resolved": "https://registry.npmmirror.com/source-map-js/-/source-map-js-1.0.2.tgz", + "integrity": "sha512-R0XvVJ9WusLiqTCEiGCmICCMplcCkIwwR11mOSD9CR5u+IXYdiseeEuXCVAjS54zqwkLcPNnmU4OeJ6tUrWhDw==", + "license": "BSD-3-Clause", + "engines": { + "node": ">=0.10.0" + } + }, + "node_modules/sourcemap-codec": { + "version": "1.4.8", + "resolved": "https://registry.npmmirror.com/sourcemap-codec/-/sourcemap-codec-1.4.8.tgz", + "integrity": "sha512-9NykojV5Uih4lgo5So5dtw+f0JgJX30KCNI8gwhz2J9A15wD0Ml6tjHKwf6fTSa6fAdVBdZeNOs9eJ71qCk8vA==", + "license": "MIT" + }, + "node_modules/supports-preserve-symlinks-flag": { + "version": "1.0.0", + "resolved": "https://registry.npmmirror.com/supports-preserve-symlinks-flag/-/supports-preserve-symlinks-flag-1.0.0.tgz", + "integrity": "sha512-ot0WnXS9fgdkgIcePe6RHNk1WA8+muPa6cSjeR3V8K27q9BB1rTE3R1p7Hv0z1ZyAc8s6Vvv8DIyWf681MAt0w==", + "dev": true, + "license": "MIT", + "engines": { + "node": ">= 0.4" + }, + "funding": { + "url": "https://github.com/sponsors/ljharb" + } + }, + "node_modules/tslib": { + "version": "2.4.0", + "resolved": "https://registry.npmmirror.com/tslib/-/tslib-2.4.0.tgz", + "integrity": "sha512-d6xOpEDfsi2CZVlPQzGeux8XMwLT9hssAsaPYExaQMuYskwb+x1x7J371tWlbBdWHroy99KnVB6qIkUbs5X3UQ==" + }, + "node_modules/use-strict": { + "version": "1.0.1", + "resolved": "https://registry.npmmirror.com/use-strict/-/use-strict-1.0.1.tgz", + "integrity": "sha512-IeiWvvEXfW5ltKVMkxq6FvNf2LojMKvB2OCeja6+ct24S1XOmQw2dGr2JyndwACWAGJva9B7yPHwAmeA9QCqAQ==", + "license": "ISC" + }, + "node_modules/vite": { + "version": "2.9.1", + "resolved": "https://registry.npmmirror.com/vite/-/vite-2.9.1.tgz", + "integrity": "sha512-vSlsSdOYGcYEJfkQ/NeLXgnRv5zZfpAsdztkIrs7AZHV8RCMZQkwjo4DS5BnrYTqoWqLoUe1Cah4aVO4oNNqCQ==", + "dev": true, + "license": "MIT", + "dependencies": { + "esbuild": "^0.14.27", + "postcss": "^8.4.12", + "resolve": "^1.22.0", + "rollup": "^2.59.0" + }, + "bin": { + "vite": "bin/vite.js" + }, + "engines": { + "node": ">=12.2.0" + }, + "optionalDependencies": { + "fsevents": "~2.3.2" + }, + "peerDependencies": { + "less": "*", + "sass": "*", + "stylus": "*" + }, + "peerDependenciesMeta": { + "less": { + "optional": true + }, + "sass": { + "optional": true + }, + "stylus": { + "optional": true + } + } + }, + "node_modules/vue": { + "version": "3.2.32", + "resolved": "https://registry.npmmirror.com/vue/-/vue-3.2.32.tgz", + "integrity": "sha512-6L3jKZApF042OgbCkh+HcFeAkiYi3Lovi8wNhWqIK98Pi5efAMLZzRHgi91v+60oIRxdJsGS9sTMsb+yDpY8Eg==", + "license": "MIT", + "dependencies": { + "@vue/compiler-dom": "3.2.32", + "@vue/compiler-sfc": "3.2.32", + "@vue/runtime-dom": "3.2.32", + "@vue/server-renderer": "3.2.32", + "@vue/shared": "3.2.32" + } + }, + "node_modules/vue-demi": { + "version": "0.12.5", + "resolved": "https://registry.npmmirror.com/vue-demi/-/vue-demi-0.12.5.tgz", + "integrity": "sha512-BREuTgTYlUr0zw0EZn3hnhC3I6gPWv+Kwh4MCih6QcAeaTlaIX0DwOVN0wHej7hSvDPecz4jygy/idsgKfW58Q==", + "hasInstallScript": true, + "license": "MIT", + "bin": { + "vue-demi-fix": "bin/vue-demi-fix.js", + "vue-demi-switch": "bin/vue-demi-switch.js" + }, + "engines": { + "node": ">=12" + }, + "funding": { + "url": "https://github.com/sponsors/antfu" + }, + "peerDependencies": { + "@vue/composition-api": "^1.0.0-rc.1", + "vue": "^3.0.0-0 || ^2.6.0" + }, + "peerDependenciesMeta": { + "@vue/composition-api": { + "optional": true + } + } + }, + "node_modules/vue-types": { + "version": "3.0.2", + "resolved": "https://registry.npmmirror.com/vue-types/-/vue-types-3.0.2.tgz", + "integrity": "sha512-IwUC0Aq2zwaXqy74h4WCvFCUtoV0iSWr0snWnE9TnU18S66GAQyqQbRf2qfJtUuiFsBf6qp0MEwdonlwznlcrw==", + "dependencies": { + "is-plain-object": "3.0.1" + }, + "engines": { + "node": ">=10.15.0" + }, + "peerDependencies": { + "vue": "^3.0.0" + } + }, + "node_modules/warning": { + "version": "4.0.3", + "resolved": "https://registry.npmmirror.com/warning/-/warning-4.0.3.tgz", + "integrity": "sha512-rpJyN222KWIvHJ/F53XSZv0Zl/accqHR8et1kpaMTD/fLCRxtV8iX8czMzY7sVZupTI3zcUTg8eycS2kNF9l6w==", + "dependencies": { + "loose-envify": "^1.0.0" + } + } + }, + "dependencies": { + "@ant-design/colors": { + "version": "6.0.0", + "resolved": "https://registry.npmmirror.com/@ant-design/colors/-/colors-6.0.0.tgz", + "integrity": "sha512-qAZRvPzfdWHtfameEGP2Qvuf838NhergR35o+EuVyB5XvSA98xod5r4utvi4TJ3ywmevm290g9nsCG5MryrdWQ==", + "requires": { + "@ctrl/tinycolor": "^3.4.0" + } + }, + "@ant-design/icons-svg": { + "version": "4.2.1", + "resolved": "https://registry.npmmirror.com/@ant-design/icons-svg/-/icons-svg-4.2.1.tgz", + "integrity": "sha512-EB0iwlKDGpG93hW8f85CTJTs4SvMX7tt5ceupvhALp1IF44SeUFOMhKUOYqpsoYWQKAOuTRDMqn75rEaKDp0Xw==" + }, + "@ant-design/icons-vue": { + "version": "6.1.0", + "resolved": "https://registry.npmmirror.com/@ant-design/icons-vue/-/icons-vue-6.1.0.tgz", + "integrity": "sha512-EX6bYm56V+ZrKN7+3MT/ubDkvJ5rK/O2t380WFRflDcVFgsvl3NLH7Wxeau6R8DbrO5jWR6DSTC3B6gYFp77AA==", + "requires": { + "@ant-design/colors": "^6.0.0", + "@ant-design/icons-svg": "^4.2.1" + } + }, + "@babel/parser": { + "version": "7.17.9", + "resolved": "https://registry.npmmirror.com/@babel/parser/-/parser-7.17.9.tgz", + "integrity": "sha512-vqUSBLP8dQHFPdPi9bc5GK9vRkYHJ49fsZdtoJ8EQ8ibpwk5rPKfvNIwChB0KVXcIjcepEBBd2VHC5r9Gy8ueg==" + }, + "@babel/runtime": { + "version": "7.17.9", + "resolved": "https://registry.npmmirror.com/@babel/runtime/-/runtime-7.17.9.tgz", + "integrity": "sha512-lSiBBvodq29uShpWGNbgFdKYNiFDo5/HIYsaCEY9ff4sb10x9jizo2+pRrSyF4jKZCXqgzuqBOQKbUm90gQwJg==", + "requires": { + "regenerator-runtime": "^0.13.4" + } + }, + "@ctrl/tinycolor": { + "version": "3.4.1", + "resolved": "https://registry.npmmirror.com/@ctrl/tinycolor/-/tinycolor-3.4.1.tgz", + "integrity": "sha512-ej5oVy6lykXsvieQtqZxCOaLT+xD4+QNarq78cIYISHmZXshCvROLudpQN3lfL8G0NL7plMSSK+zlyvCaIJ4Iw==" + }, + "@element-plus/icons-vue": { + "version": "1.1.4", + "resolved": "https://registry.npmmirror.com/@element-plus/icons-vue/-/icons-vue-1.1.4.tgz", + "integrity": "sha512-Iz/nHqdp1sFPmdzRwHkEQQA3lKvoObk8azgABZ81QUOpW9s/lUyQVUSh0tNtEPZXQlKwlSh7SPgoVxzrE0uuVQ==", + "requires": {} + }, + "@floating-ui/core": { + "version": "0.6.1", + "resolved": "https://registry.npmmirror.com/@floating-ui/core/-/core-0.6.1.tgz", + "integrity": "sha512-Y30eVMcZva8o84c0HcXAtDO4BEzPJMvF6+B7x7urL2xbAqVsGJhojOyHLaoQHQYjb6OkqRq5kO+zeySycQwKqg==" + }, + "@floating-ui/dom": { + "version": "0.4.4", + "resolved": "https://registry.npmmirror.com/@floating-ui/dom/-/dom-0.4.4.tgz", + "integrity": "sha512-0Ulu3B/dqQplUUSqnTx0foSrlYuMN+GTtlJWvNJwt6Fr7/PqmlR/Y08o6/+bxDWr6p3roBJRaQ51MDZsNmEhhw==", + "requires": { + "@floating-ui/core": "^0.6.1" + } + }, + "@popperjs/core": { + "version": "2.11.5", + "resolved": "https://registry.npmmirror.com/@popperjs/core/-/core-2.11.5.tgz", + "integrity": "sha512-9X2obfABZuDVLCgPK9aX0a/x4jaOEweTTWE2+9sr0Qqqevj2Uv5XorvusThmc9XGYpS9yI+fhh8RTafBtGposw==" + }, + "@simonwep/pickr": { + "version": "1.8.2", + "resolved": "https://registry.npmmirror.com/@simonwep/pickr/-/pickr-1.8.2.tgz", + "integrity": "sha512-/l5w8BIkrpP6n1xsetx9MWPWlU6OblN5YgZZphxan0Tq4BByTCETL6lyIeY8lagalS2Nbt4F2W034KHLIiunKA==", + "requires": { + "core-js": "^3.15.1", + "nanopop": "^2.1.0" + } + }, + "@types/lodash": { + "version": "4.14.181", + "resolved": "https://registry.npmmirror.com/@types/lodash/-/lodash-4.14.181.tgz", + "integrity": "sha512-n3tyKthHJbkiWhDZs3DkhkCzt2MexYHXlX0td5iMplyfwketaOeKboEVBqzceH7juqvEg3q5oUoBFxSLu7zFag==" + }, + "@types/lodash-es": { + "version": "4.17.6", + "resolved": "https://registry.npmmirror.com/@types/lodash-es/-/lodash-es-4.17.6.tgz", + "integrity": "sha512-R+zTeVUKDdfoRxpAryaQNRKk3105Rrgx2CFRClIgRGaqDTdjsm8h6IYA8ir584W3ePzkZfst5xIgDwYrlh9HLg==", + "requires": { + "@types/lodash": "*" + } + }, + "@vitejs/plugin-vue": { + "version": "2.3.1", + "resolved": "https://registry.npmmirror.com/@vitejs/plugin-vue/-/plugin-vue-2.3.1.tgz", + "integrity": "sha512-YNzBt8+jt6bSwpt7LP890U1UcTOIZZxfpE5WOJ638PNxSEKOqAi0+FSKS0nVeukfdZ0Ai/H7AFd6k3hayfGZqQ==", + "dev": true, + "requires": {} + }, + "@vue/compiler-core": { + "version": "3.2.32", + "resolved": "https://registry.npmmirror.com/@vue/compiler-core/-/compiler-core-3.2.32.tgz", + "integrity": "sha512-bRQ8Rkpm/aYFElDWtKkTPHeLnX5pEkNxhPUcqu5crEJIilZH0yeFu/qUAcV4VfSE2AudNPkQSOwMZofhnuutmA==", + "requires": { + "@babel/parser": "^7.16.4", + "@vue/shared": "3.2.32", + "estree-walker": "^2.0.2", + "source-map": "^0.6.1" + } + }, + "@vue/compiler-dom": { + "version": "3.2.32", + "resolved": "https://registry.npmmirror.com/@vue/compiler-dom/-/compiler-dom-3.2.32.tgz", + "integrity": "sha512-maa3PNB/NxR17h2hDQfcmS02o1f9r9QIpN1y6fe8tWPrS1E4+q8MqrvDDQNhYVPd84rc3ybtyumrgm9D5Rf/kg==", + "requires": { + "@vue/compiler-core": "3.2.32", + "@vue/shared": "3.2.32" + } + }, + "@vue/compiler-sfc": { + "version": "3.2.32", + "resolved": "https://registry.npmmirror.com/@vue/compiler-sfc/-/compiler-sfc-3.2.32.tgz", + "integrity": "sha512-uO6+Gh3AVdWm72lRRCjMr8nMOEqc6ezT9lWs5dPzh1E9TNaJkMYPaRtdY9flUv/fyVQotkfjY/ponjfR+trPSg==", + "requires": { + "@babel/parser": "^7.16.4", + "@vue/compiler-core": "3.2.32", + "@vue/compiler-dom": "3.2.32", + "@vue/compiler-ssr": "3.2.32", + "@vue/reactivity-transform": "3.2.32", + "@vue/shared": "3.2.32", + "estree-walker": "^2.0.2", + "magic-string": "^0.25.7", + "postcss": "^8.1.10", + "source-map": "^0.6.1" + } + }, + "@vue/compiler-ssr": { + "version": "3.2.32", + "resolved": "https://registry.npmmirror.com/@vue/compiler-ssr/-/compiler-ssr-3.2.32.tgz", + "integrity": "sha512-ZklVUF/SgTx6yrDUkaTaBL/JMVOtSocP+z5Xz/qIqqLdW/hWL90P+ob/jOQ0Xc/om57892Q7sRSrex0wujOL2Q==", + "requires": { + "@vue/compiler-dom": "3.2.32", + "@vue/shared": "3.2.32" + } + }, + "@vue/reactivity": { + "version": "3.2.32", + "resolved": "https://registry.npmmirror.com/@vue/reactivity/-/reactivity-3.2.32.tgz", + "integrity": "sha512-4zaDumuyDqkuhbb63hRd+YHFGopW7srFIWesLUQ2su/rJfWrSq3YUvoKAJE8Eu1EhZ2Q4c1NuwnEreKj1FkDxA==", + "requires": { + "@vue/shared": "3.2.32" + } + }, + "@vue/reactivity-transform": { + "version": "3.2.32", + "resolved": "https://registry.npmmirror.com/@vue/reactivity-transform/-/reactivity-transform-3.2.32.tgz", + "integrity": "sha512-CW1W9zaJtE275tZSWIfQKiPG0iHpdtSlmTqYBu7Y62qvtMgKG5yOxtvBs4RlrZHlaqFSE26avLAgQiTp4YHozw==", + "requires": { + "@babel/parser": "^7.16.4", + "@vue/compiler-core": "3.2.32", + "@vue/shared": "3.2.32", + "estree-walker": "^2.0.2", + "magic-string": "^0.25.7" + } + }, + "@vue/runtime-core": { + "version": "3.2.32", + "resolved": "https://registry.npmmirror.com/@vue/runtime-core/-/runtime-core-3.2.32.tgz", + "integrity": "sha512-uKKzK6LaCnbCJ7rcHvsK0azHLGpqs+Vi9B28CV1mfWVq1F3Bj8Okk3cX+5DtD06aUh4V2bYhS2UjjWiUUKUF0w==", + "requires": { + "@vue/reactivity": "3.2.32", + "@vue/shared": "3.2.32" + } + }, + "@vue/runtime-dom": { + "version": "3.2.32", + "resolved": "https://registry.npmmirror.com/@vue/runtime-dom/-/runtime-dom-3.2.32.tgz", + "integrity": "sha512-AmlIg+GPqjkNoADLjHojEX5RGcAg+TsgXOOcUrtDHwKvA8mO26EnLQLB8nylDjU6AMJh2CIYn8NEgyOV5ZIScQ==", + "requires": { + "@vue/runtime-core": "3.2.32", + "@vue/shared": "3.2.32", + "csstype": "^2.6.8" + } + }, + "@vue/server-renderer": { + "version": "3.2.32", + "resolved": "https://registry.npmmirror.com/@vue/server-renderer/-/server-renderer-3.2.32.tgz", + "integrity": "sha512-TYKpZZfRJpGTTiy/s6bVYwQJpAUx3G03z4G7/3O18M11oacrMTVHaHjiPuPqf3xQtY8R4LKmQ3EOT/DRCA/7Wg==", + "requires": { + "@vue/compiler-ssr": "3.2.32", + "@vue/shared": "3.2.32" + } + }, + "@vue/shared": { + "version": "3.2.32", + "resolved": "https://registry.npmmirror.com/@vue/shared/-/shared-3.2.32.tgz", + "integrity": "sha512-bjcixPErUsAnTQRQX4Z5IQnICYjIfNCyCl8p29v1M6kfVzvwOICPw+dz48nNuWlTOOx2RHhzHdazJibE8GSnsw==" + }, + "@vueuse/core": { + "version": "8.2.5", + "resolved": "https://registry.npmmirror.com/@vueuse/core/-/core-8.2.5.tgz", + "integrity": "sha512-5prZAA1Ji2ltwNUnzreu6WIXYqHYP/9U2BiY5mD/650VYLpVcwVlYznJDFcLCmEWI3o3Vd34oS1FUf+6Mh68GQ==", + "requires": { + "@vueuse/metadata": "8.2.5", + "@vueuse/shared": "8.2.5", + "vue-demi": "*" + } + }, + "@vueuse/metadata": { + "version": "8.2.5", + "resolved": "https://registry.npmmirror.com/@vueuse/metadata/-/metadata-8.2.5.tgz", + "integrity": "sha512-Lk9plJjh9cIdiRdcj16dau+2LANxIdFCiTgdfzwYXbflxq0QnMBeOD2qHgKDE7fuVrtPcVWj8VSuZEx1HRfNQA==" + }, + "@vueuse/shared": { + "version": "8.2.5", + "resolved": "https://registry.npmmirror.com/@vueuse/shared/-/shared-8.2.5.tgz", + "integrity": "sha512-lNWo+7sk6JCuOj4AiYM+6HZ6fq4xAuVq1sVckMQKgfCJZpZRe4i8es+ZULO5bYTKP+VrOCtqrLR2GzEfrbr3YQ==", + "requires": { + "vue-demi": "*" + } + }, + "ant-design-vue": { + "version": "2.2.8", + "resolved": "https://registry.npmmirror.com/ant-design-vue/-/ant-design-vue-2.2.8.tgz", + "integrity": "sha512-3graq9/gCfJQs6hznrHV6sa9oDmk/D1H3Oo0vLdVpPS/I61fZPk8NEyNKCHpNA6fT2cx6xx9U3QS63uuyikg/Q==", + "requires": { + "@ant-design/icons-vue": "^6.0.0", + "@babel/runtime": "^7.10.5", + "@simonwep/pickr": "~1.8.0", + "array-tree-filter": "^2.1.0", + "async-validator": "^3.3.0", + "dom-align": "^1.12.1", + "dom-scroll-into-view": "^2.0.0", + "lodash": "^4.17.21", + "lodash-es": "^4.17.15", + "moment": "^2.27.0", + "omit.js": "^2.0.0", + "resize-observer-polyfill": "^1.5.1", + "scroll-into-view-if-needed": "^2.2.25", + "shallow-equal": "^1.0.0", + "vue-types": "^3.0.0", + "warning": "^4.0.0" + }, + "dependencies": { + "async-validator": { + "version": "3.5.2", + "resolved": "https://registry.npmmirror.com/async-validator/-/async-validator-3.5.2.tgz", + "integrity": "sha512-8eLCg00W9pIRZSB781UUX/H6Oskmm8xloZfr09lz5bikRpBVDlJ3hRVuxxP1SxcwsEYfJ4IU8Q19Y8/893r3rQ==" + } + } + }, + "array-tree-filter": { + "version": "2.1.0", + "resolved": "https://registry.npmmirror.com/array-tree-filter/-/array-tree-filter-2.1.0.tgz", + "integrity": "sha512-4ROwICNlNw/Hqa9v+rk5h22KjmzB1JGTMVKP2AKJBOCgb0yL0ASf0+YvCcLNNwquOHNX48jkeZIJ3a+oOQqKcw==" + }, + "async-validator": { + "version": "4.0.7", + "resolved": "https://registry.npmmirror.com/async-validator/-/async-validator-4.0.7.tgz", + "integrity": "sha512-Pj2IR7u8hmUEDOwB++su6baaRi+QvsgajuFB9j95foM1N2gy5HM4z60hfusIO0fBPG5uLAEl6yCJr1jNSVugEQ==" + }, + "axios": { + "version": "0.26.1", + "resolved": "https://registry.npmmirror.com/axios/-/axios-0.26.1.tgz", + "integrity": "sha512-fPwcX4EvnSHuInCMItEhAGnaSEXRBjtzh9fOtsE6E1G6p7vl7edEeZe11QHf18+6+9gR5PbKV/sGKNaD8YaMeA==", + "requires": { + "follow-redirects": "^1.14.8" + }, + "dependencies": { + "follow-redirects": { + "version": "1.14.9", + "resolved": "https://registry.npmmirror.com/follow-redirects/-/follow-redirects-1.14.9.tgz", + "integrity": "sha512-MQDfihBQYMcyy5dhRDJUHcw7lb2Pv/TuE6xP1vyraLukNDHKbDxDNaOE3NbCAdKQApno+GPRyo1YAp89yCjK4w==" + } + } + }, + "compute-scroll-into-view": { + "version": "1.0.17", + "resolved": "https://registry.npmmirror.com/compute-scroll-into-view/-/compute-scroll-into-view-1.0.17.tgz", + "integrity": "sha512-j4dx+Fb0URmzbwwMUrhqWM2BEWHdFGx+qZ9qqASHRPqvTYdqvWnHg0H1hIbcyLnvgnoNAVMlwkepyqM3DaIFUg==" + }, + "copy-anything": { + "version": "2.0.6", + "resolved": "https://registry.npmmirror.com/copy-anything/-/copy-anything-2.0.6.tgz", + "integrity": "sha512-1j20GZTsvKNkc4BY3NpMOM8tt///wY3FpIzozTOFO2ffuZcV61nojHXVKIy3WM+7ADCy5FVhdZYHYDdgTU0yJw==", + "requires": { + "is-what": "^3.14.1" + } + }, + "core-js": { + "version": "3.22.5", + "resolved": "https://registry.npmmirror.com/core-js/-/core-js-3.22.5.tgz", + "integrity": "sha512-VP/xYuvJ0MJWRAobcmQ8F2H6Bsn+s7zqAAjFaHGBMc5AQm7zaelhD1LGduFn2EehEcQcU+br6t+fwbpQ5d1ZWA==" + }, + "csstype": { + "version": "2.6.20", + "resolved": "https://registry.npmmirror.com/csstype/-/csstype-2.6.20.tgz", + "integrity": "sha512-/WwNkdXfckNgw6S5R125rrW8ez139lBHWouiBvX8dfMFtcn6V81REDqnH7+CRpRipfYlyU1CmOnOxrmGcFOjeA==" + }, + "dayjs": { + "version": "1.11.0", + "resolved": "https://registry.npmmirror.com/dayjs/-/dayjs-1.11.0.tgz", + "integrity": "sha512-JLC809s6Y948/FuCZPm5IX8rRhQwOiyMb2TfVVQEixG7P8Lm/gt5S7yoQZmC8x1UehI9Pb7sksEt4xx14m+7Ug==" + }, + "dom-align": { + "version": "1.12.3", + "resolved": "https://registry.npmmirror.com/dom-align/-/dom-align-1.12.3.tgz", + "integrity": "sha512-Gj9hZN3a07cbR6zviMUBOMPdWxYhbMI+x+WS0NAIu2zFZmbK8ys9R79g+iG9qLnlCwpFoaB+fKy8Pdv470GsPA==" + }, + "dom-scroll-into-view": { + "version": "2.0.1", + "resolved": "https://registry.npmmirror.com/dom-scroll-into-view/-/dom-scroll-into-view-2.0.1.tgz", + "integrity": "sha512-bvVTQe1lfaUr1oFzZX80ce9KLDlZ3iU+XGNE/bz9HnGdklTieqsbmsLHe+rT2XWqopvL0PckkYqN7ksmm5pe3w==" + }, + "element-plus": { + "version": "2.1.9", + "resolved": "https://registry.npmmirror.com/element-plus/-/element-plus-2.1.9.tgz", + "integrity": "sha512-6mWqS3YrmJPnouWP4otzL8+MehfOnDFqDbcIdnmC07p+Z0JkWe/CVKc4Wky8AYC8nyDMUQyiZYvooCbqGuM7pg==", + "requires": { + "@ctrl/tinycolor": "^3.4.0", + "@element-plus/icons-vue": "^1.1.4", + "@floating-ui/dom": "^0.4.2", + "@popperjs/core": "^2.11.4", + "@types/lodash": "^4.14.181", + "@types/lodash-es": "^4.17.6", + "@vueuse/core": "^8.2.4", + "async-validator": "^4.0.7", + "dayjs": "^1.11.0", + "escape-html": "^1.0.3", + "lodash": "^4.17.21", + "lodash-es": "^4.17.21", + "lodash-unified": "^1.0.2", + "memoize-one": "^6.0.0", + "normalize-wheel-es": "^1.1.2" + } + }, + "errno": { + "version": "0.1.8", + "resolved": "https://registry.npmmirror.com/errno/-/errno-0.1.8.tgz", + "integrity": "sha512-dJ6oBr5SQ1VSd9qkk7ByRgb/1SH4JZjCHSW/mr63/QcXO9zLVxvJ6Oy13nio03rxpSnVDDjFor75SjVeZWPW/A==", + "optional": true, + "requires": { + "prr": "~1.0.1" + } + }, + "esbuild": { + "version": "0.14.36", + "resolved": "https://registry.npmmirror.com/esbuild/-/esbuild-0.14.36.tgz", + "integrity": "sha512-HhFHPiRXGYOCRlrhpiVDYKcFJRdO0sBElZ668M4lh2ER0YgnkLxECuFe7uWCf23FrcLc59Pqr7dHkTqmRPDHmw==", + "dev": true, + "requires": { + "esbuild-android-64": "0.14.36", + "esbuild-android-arm64": "0.14.36", + "esbuild-darwin-64": "0.14.36", + "esbuild-darwin-arm64": "0.14.36", + "esbuild-freebsd-64": "0.14.36", + "esbuild-freebsd-arm64": "0.14.36", + "esbuild-linux-32": "0.14.36", + "esbuild-linux-64": "0.14.36", + "esbuild-linux-arm": "0.14.36", + "esbuild-linux-arm64": "0.14.36", + "esbuild-linux-mips64le": "0.14.36", + "esbuild-linux-ppc64le": "0.14.36", + "esbuild-linux-riscv64": "0.14.36", + "esbuild-linux-s390x": "0.14.36", + "esbuild-netbsd-64": "0.14.36", + "esbuild-openbsd-64": "0.14.36", + "esbuild-sunos-64": "0.14.36", + "esbuild-windows-32": "0.14.36", + "esbuild-windows-64": "0.14.36", + "esbuild-windows-arm64": "0.14.36" + } + }, + "esbuild-darwin-64": { + "version": "0.14.36", + "resolved": "https://registry.npmmirror.com/esbuild-darwin-64/-/esbuild-darwin-64-0.14.36.tgz", + "integrity": "sha512-kkl6qmV0dTpyIMKagluzYqlc1vO0ecgpviK/7jwPbRDEv5fejRTaBBEE2KxEQbTHcLhiiDbhG7d5UybZWo/1zQ==", + "dev": true, + "optional": true + }, + "escape-html": { + "version": "1.0.3", + "resolved": "https://registry.npmmirror.com/escape-html/-/escape-html-1.0.3.tgz", + "integrity": "sha512-NiSupZ4OeuGwr68lGIeym/ksIZMJodUGOSCZ/FSnTxcrekbvqrgdUxlJOMpijaKZVjAJrWrGs/6Jy8OMuyj9ow==" + }, + "estree-walker": { + "version": "2.0.2", + "resolved": "https://registry.npmmirror.com/estree-walker/-/estree-walker-2.0.2.tgz", + "integrity": "sha512-Rfkk/Mp/DL7JVje3u18FxFujQlTNR2q6QfMSMB7AvCBx91NGj/ba3kCfza0f6dVDbw7YlRf/nDrn7pQrCCyQ/w==" + }, + "fsevents": { + "version": "2.3.2", + "resolved": "https://registry.npmmirror.com/fsevents/-/fsevents-2.3.2.tgz", + "integrity": "sha512-xiqMQR4xAeHTuB9uWm+fFRcIOgKBMiOBP+eXiyT7jsgVCq1bkVygt00oASowB7EdtpOHaaPgKt812P9ab+DDKA==", + "dev": true, + "optional": true + }, + "function-bind": { + "version": "1.1.1", + "resolved": "https://registry.npmmirror.com/function-bind/-/function-bind-1.1.1.tgz", + "integrity": "sha512-yIovAzMX49sF8Yl58fSCWJ5svSLuaibPxXQJFLmBObTuCr0Mf1KiPopGM9NiFjiYBCbfaa2Fh6breQ6ANVTI0A==", + "dev": true + }, + "graceful-fs": { + "version": "4.2.10", + "resolved": "https://registry.npmmirror.com/graceful-fs/-/graceful-fs-4.2.10.tgz", + "integrity": "sha512-9ByhssR2fPVsNZj478qUUbKfmL0+t5BDVyjShtyZZLiK7ZDAArFFfopyOTj0M05wE2tJPisA4iTnnXl2YoPvOA==", + "optional": true + }, + "has": { + "version": "1.0.3", + "resolved": "https://registry.npmmirror.com/has/-/has-1.0.3.tgz", + "integrity": "sha512-f2dvO0VU6Oej7RkWJGrehjbzMAjFp5/VKPp5tTpWIV4JHHZK1/BxbFRtf/siA2SWTe09caDmVtYYzWEIbBS4zw==", + "dev": true, + "requires": { + "function-bind": "^1.1.1" + } + }, + "iconv-lite": { + "version": "0.4.24", + "resolved": "https://registry.npmmirror.com/iconv-lite/-/iconv-lite-0.4.24.tgz", + "integrity": "sha512-v3MXnZAcvnywkTUEZomIActle7RXXeedOR31wwl7VlyoXO4Qi9arvSenNQWne1TcRwhCL1HwLI21bEqdpj8/rA==", + "optional": true, + "requires": { + "safer-buffer": ">= 2.1.2 < 3" + } + }, + "image-size": { + "version": "0.5.5", + "resolved": "https://registry.npmmirror.com/image-size/-/image-size-0.5.5.tgz", + "integrity": "sha512-6TDAlDPZxUFCv+fuOkIoXT/V/f3Qbq8e37p+YOiYrUv3v9cc3/6x78VdfPgFVaB9dZYeLUfKgHRebpkm/oP2VQ==", + "optional": true + }, + "is-core-module": { + "version": "2.8.1", + "resolved": "https://registry.npmmirror.com/is-core-module/-/is-core-module-2.8.1.tgz", + "integrity": "sha512-SdNCUs284hr40hFTFP6l0IfZ/RSrMXF3qgoRHd3/79unUTvrFO/JoXwkGm+5J/Oe3E/b5GsnG330uUNgRpu1PA==", + "dev": true, + "requires": { + "has": "^1.0.3" + } + }, + "is-plain-object": { + "version": "3.0.1", + "resolved": "https://registry.npmmirror.com/is-plain-object/-/is-plain-object-3.0.1.tgz", + "integrity": "sha512-Xnpx182SBMrr/aBik8y+GuR4U1L9FqMSojwDQwPMmxyC6bvEqly9UBCxhauBF5vNh2gwWJNX6oDV7O+OM4z34g==" + }, + "is-what": { + "version": "3.14.1", + "resolved": "https://registry.npmmirror.com/is-what/-/is-what-3.14.1.tgz", + "integrity": "sha512-sNxgpk9793nzSs7bA6JQJGeIuRBQhAaNGG77kzYQgMkrID+lS6SlK07K5LaptscDlSaIgH+GPFzf+d75FVxozA==" + }, + "js-audio-recorder": { + "version": "0.5.7", + "resolved": "https://registry.npmmirror.com/js-audio-recorder/-/js-audio-recorder-0.5.7.tgz", + "integrity": "sha512-DIlv30N86AYHr7zGHN0O7V/3Rd8Q6SIJ/MBzVJaT9STWTdhF4E/8fxCX6ZMgRSv8xmx6fEqcFFNPoofmxJD4+A==" + }, + "js-tokens": { + "version": "4.0.0", + "resolved": "https://registry.npmmirror.com/js-tokens/-/js-tokens-4.0.0.tgz", + "integrity": "sha512-RdJUflcE3cUzKiMqQgsCu06FPu9UdIJO0beYbPhHN4k6apgJtifcoCtT9bcxOpYBtpD2kCM6Sbzg4CausW/PKQ==" + }, + "lamejs": { + "version": "1.2.1", + "resolved": "https://registry.npmmirror.com/lamejs/-/lamejs-1.2.1.tgz", + "integrity": "sha512-s7bxvjvYthw6oPLCm5pFxvA84wUROODB8jEO2+CE1adhKgrIvVOlmMgY8zyugxGrvRaDHNJanOiS21/emty6dQ==", + "requires": { + "use-strict": "1.0.1" + } + }, + "less": { + "version": "4.1.2", + "resolved": "https://registry.npmmirror.com/less/-/less-4.1.2.tgz", + "integrity": "sha512-EoQp/Et7OSOVu0aJknJOtlXZsnr8XE8KwuzTHOLeVSEx8pVWUICc8Q0VYRHgzyjX78nMEyC/oztWFbgyhtNfDA==", + "requires": { + "copy-anything": "^2.0.1", + "errno": "^0.1.1", + "graceful-fs": "^4.1.2", + "image-size": "~0.5.0", + "make-dir": "^2.1.0", + "mime": "^1.4.1", + "needle": "^2.5.2", + "parse-node-version": "^1.0.1", + "source-map": "~0.6.0", + "tslib": "^2.3.0" + } + }, + "lodash": { + "version": "4.17.21", + "resolved": "https://registry.npmmirror.com/lodash/-/lodash-4.17.21.tgz", + "integrity": "sha512-v2kDEe57lecTulaDIuNTPy3Ry4gLGJ6Z1O3vE1krgXZNrsQ+LFTGHVxVjcXPs17LhbZVGedAJv8XZ1tvj5FvSg==" + }, + "lodash-es": { + "version": "4.17.21", + "resolved": "https://registry.npmmirror.com/lodash-es/-/lodash-es-4.17.21.tgz", + "integrity": "sha512-mKnC+QJ9pWVzv+C4/U3rRsHapFfHvQFoFB92e52xeyGMcX6/OlIl78je1u8vePzYZSkkogMPJ2yjxxsb89cxyw==" + }, + "lodash-unified": { + "version": "1.0.2", + "resolved": "https://registry.npmmirror.com/lodash-unified/-/lodash-unified-1.0.2.tgz", + "integrity": "sha512-OGbEy+1P+UT26CYi4opY4gebD8cWRDxAT6MAObIVQMiqYdxZr1g3QHWCToVsm31x2NkLS4K3+MC2qInaRMa39g==", + "requires": {} + }, + "loose-envify": { + "version": "1.4.0", + "resolved": "https://registry.npmmirror.com/loose-envify/-/loose-envify-1.4.0.tgz", + "integrity": "sha512-lyuxPGr/Wfhrlem2CL/UcnUc1zcqKAImBDzukY7Y5F/yQiNdko6+fRLevlw1HgMySw7f611UIY408EtxRSoK3Q==", + "requires": { + "js-tokens": "^3.0.0 || ^4.0.0" + } + }, + "magic-string": { + "version": "0.25.9", + "resolved": "https://registry.npmmirror.com/magic-string/-/magic-string-0.25.9.tgz", + "integrity": "sha512-RmF0AsMzgt25qzqqLc1+MbHmhdx0ojF2Fvs4XnOqz2ZOBXzzkEwc/dJQZCYHAn7v1jbVOjAZfK8msRn4BxO4VQ==", + "requires": { + "sourcemap-codec": "^1.4.8" + } + }, + "make-dir": { + "version": "2.1.0", + "resolved": "https://registry.npmmirror.com/make-dir/-/make-dir-2.1.0.tgz", + "integrity": "sha512-LS9X+dc8KLxXCb8dni79fLIIUA5VyZoyjSMCwTluaXA0o27cCK0bhXkpgw+sTXVpPy/lSO57ilRixqk0vDmtRA==", + "optional": true, + "requires": { + "pify": "^4.0.1", + "semver": "^5.6.0" + } + }, + "memoize-one": { + "version": "6.0.0", + "resolved": "https://registry.npmmirror.com/memoize-one/-/memoize-one-6.0.0.tgz", + "integrity": "sha512-rkpe71W0N0c0Xz6QD0eJETuWAJGnJ9afsl1srmwPrI+yBCkge5EycXXbYRyvL29zZVUWQCY7InPRCv3GDXuZNw==" + }, + "mime": { + "version": "1.6.0", + "resolved": "https://registry.npmmirror.com/mime/-/mime-1.6.0.tgz", + "integrity": "sha512-x0Vn8spI+wuJ1O6S7gnbaQg8Pxh4NNHb7KSINmEWKiPE4RKOplvijn+NkmYmmRgP68mc70j2EbeTFRsrswaQeg==", + "optional": true + }, + "moment": { + "version": "2.29.4", + "resolved": "https://registry.npmjs.org/moment/-/moment-2.29.4.tgz", + "integrity": "sha512-5LC9SOxjSc2HF6vO2CyuTDNivEdoz2IvyJJGj6X8DJ0eFyfszE0QiEd+iXmBvUP3WHxSjFH/vIsA0EN00cgr8w==" + }, + "nanoid": { + "version": "3.3.2", + "resolved": "https://registry.npmmirror.com/nanoid/-/nanoid-3.3.2.tgz", + "integrity": "sha512-CuHBogktKwpm5g2sRgv83jEy2ijFzBwMoYA60orPDR7ynsLijJDqgsi4RDGj3OJpy3Ieb+LYwiRmIOGyytgITA==" + }, + "nanopop": { + "version": "2.1.0", + "resolved": "https://registry.npmmirror.com/nanopop/-/nanopop-2.1.0.tgz", + "integrity": "sha512-jGTwpFRexSH+fxappnGQtN9dspgE2ipa1aOjtR24igG0pv6JCxImIAmrLRHX+zUF5+1wtsFVbKyfP51kIGAVNw==" + }, + "needle": { + "version": "2.9.1", + "resolved": "https://registry.npmmirror.com/needle/-/needle-2.9.1.tgz", + "integrity": "sha512-6R9fqJ5Zcmf+uYaFgdIHmLwNldn5HbK8L5ybn7Uz+ylX/rnOsSp1AHcvQSrCaFN+qNM1wpymHqD7mVasEOlHGQ==", + "optional": true, + "requires": { + "debug": "^3.2.6", + "iconv-lite": "^0.4.4", + "sax": "^1.2.4" + }, + "dependencies": { + "debug": { + "version": "3.2.7", + "resolved": "https://registry.npmmirror.com/debug/-/debug-3.2.7.tgz", + "integrity": "sha512-CFjzYYAi4ThfiQvizrFQevTTXHtnCqWfe7x1AhgEscTz6ZbLbfoLRLPugTQyBth6f8ZERVUSyWHFD/7Wu4t1XQ==", + "optional": true, + "requires": { + "ms": "^2.1.1" + } + }, + "ms": { + "version": "2.1.3", + "resolved": "https://registry.npmmirror.com/ms/-/ms-2.1.3.tgz", + "integrity": "sha512-6FlzubTLZG3J2a/NVCAleEhjzq5oxgHyaCU9yYXvcLsvoVaHJq/s5xXI6/XXP6tz7R9xAOtHnSO/tXtF3WRTlA==", + "optional": true + } + } + }, + "normalize-wheel-es": { + "version": "1.1.2", + "resolved": "https://registry.npmmirror.com/normalize-wheel-es/-/normalize-wheel-es-1.1.2.tgz", + "integrity": "sha512-scX83plWJXYH1J4+BhAuIHadROzxX0UBF3+HuZNY2Ks8BciE7tSTQ+5JhTsvzjaO0/EJdm4JBGrfObKxFf3Png==" + }, + "omit.js": { + "version": "2.0.2", + "resolved": "https://registry.npmmirror.com/omit.js/-/omit.js-2.0.2.tgz", + "integrity": "sha512-hJmu9D+bNB40YpL9jYebQl4lsTW6yEHRTroJzNLqQJYHm7c+NQnJGfZmIWh8S3q3KoaxV1aLhV6B3+0N0/kyJg==" + }, + "parse-node-version": { + "version": "1.0.1", + "resolved": "https://registry.npmmirror.com/parse-node-version/-/parse-node-version-1.0.1.tgz", + "integrity": "sha512-3YHlOa/JgH6Mnpr05jP9eDG254US9ek25LyIxZlDItp2iJtwyaXQb57lBYLdT3MowkUFYEV2XXNAYIPlESvJlA==" + }, + "path-parse": { + "version": "1.0.7", + "resolved": "https://registry.npmmirror.com/path-parse/-/path-parse-1.0.7.tgz", + "integrity": "sha512-LDJzPVEEEPR+y48z93A0Ed0yXb8pAByGWo/k5YYdYgpY2/2EsOsksJrq7lOHxryrVOn1ejG6oAp8ahvOIQD8sw==", + "dev": true + }, + "picocolors": { + "version": "1.0.0", + "resolved": "https://registry.npmmirror.com/picocolors/-/picocolors-1.0.0.tgz", + "integrity": "sha512-1fygroTLlHu66zi26VoTDv8yRgm0Fccecssto+MhsZ0D/DGW2sm8E8AjW7NU5VVTRt5GxbeZ5qBuJr+HyLYkjQ==" + }, + "pify": { + "version": "4.0.1", + "resolved": "https://registry.npmmirror.com/pify/-/pify-4.0.1.tgz", + "integrity": "sha512-uB80kBFb/tfd68bVleG9T5GGsGPjJrLAUpR5PZIrhBnIaRTQRjqdJSsIKkOP6OAIFbj7GOrcudc5pNjZ+geV2g==", + "optional": true + }, + "postcss": { + "version": "8.4.12", + "resolved": "https://registry.npmmirror.com/postcss/-/postcss-8.4.12.tgz", + "integrity": "sha512-lg6eITwYe9v6Hr5CncVbK70SoioNQIq81nsaG86ev5hAidQvmOeETBqs7jm43K2F5/Ley3ytDtriImV6TpNiSg==", + "requires": { + "nanoid": "^3.3.1", + "picocolors": "^1.0.0", + "source-map-js": "^1.0.2" + } + }, + "prr": { + "version": "1.0.1", + "resolved": "https://registry.npmmirror.com/prr/-/prr-1.0.1.tgz", + "integrity": "sha512-yPw4Sng1gWghHQWj0B3ZggWUm4qVbPwPFcRG8KyxiU7J2OHFSoEHKS+EZ3fv5l1t9CyCiop6l/ZYeWbrgoQejw==", + "optional": true + }, + "regenerator-runtime": { + "version": "0.13.9", + "resolved": "https://registry.npmmirror.com/regenerator-runtime/-/regenerator-runtime-0.13.9.tgz", + "integrity": "sha512-p3VT+cOEgxFsRRA9X4lkI1E+k2/CtnKtU4gcxyaCUreilL/vqI6CdZ3wxVUx3UOUg+gnUOQQcRI7BmSI656MYA==" + }, + "resize-observer-polyfill": { + "version": "1.5.1", + "resolved": "https://registry.npmmirror.com/resize-observer-polyfill/-/resize-observer-polyfill-1.5.1.tgz", + "integrity": "sha512-LwZrotdHOo12nQuZlHEmtuXdqGoOD0OhaxopaNFxWzInpEgaLWoVuAMbTzixuosCx2nEG58ngzW3vxdWoxIgdg==" + }, + "resolve": { + "version": "1.22.0", + "resolved": "https://registry.npmmirror.com/resolve/-/resolve-1.22.0.tgz", + "integrity": "sha512-Hhtrw0nLeSrFQ7phPp4OOcVjLPIeMnRlr5mcnVuMe7M/7eBn98A3hmFRLoFo3DLZkivSYwhRUJTyPyWAk56WLw==", + "dev": true, + "requires": { + "is-core-module": "^2.8.1", + "path-parse": "^1.0.7", + "supports-preserve-symlinks-flag": "^1.0.0" + } + }, + "rollup": { + "version": "2.70.1", + "resolved": "https://registry.npmmirror.com/rollup/-/rollup-2.70.1.tgz", + "integrity": "sha512-CRYsI5EuzLbXdxC6RnYhOuRdtz4bhejPMSWjsFLfVM/7w/85n2szZv6yExqUXsBdz5KT8eoubeyDUDjhLHEslA==", + "dev": true, + "requires": { + "fsevents": "~2.3.2" + } + }, + "safer-buffer": { + "version": "2.1.2", + "resolved": "https://registry.npmmirror.com/safer-buffer/-/safer-buffer-2.1.2.tgz", + "integrity": "sha512-YZo3K82SD7Riyi0E1EQPojLz7kpepnSQI9IyPbHHg1XXXevb5dJI7tpyN2ADxGcQbHG7vcyRHk0cbwqcQriUtg==", + "optional": true + }, + "sax": { + "version": "1.2.4", + "resolved": "https://registry.npmmirror.com/sax/-/sax-1.2.4.tgz", + "integrity": "sha512-NqVDv9TpANUjFm0N8uM5GxL36UgKi9/atZw+x7YFnQ8ckwFGKrl4xX4yWtrey3UJm5nP1kUbnYgLopqWNSRhWw==", + "optional": true + }, + "scroll-into-view-if-needed": { + "version": "2.2.29", + "resolved": "https://registry.npmmirror.com/scroll-into-view-if-needed/-/scroll-into-view-if-needed-2.2.29.tgz", + "integrity": "sha512-hxpAR6AN+Gh53AdAimHM6C8oTN1ppwVZITihix+WqalywBeFcQ6LdQP5ABNl26nX8GTEL7VT+b8lKpdqq65wXg==", + "requires": { + "compute-scroll-into-view": "^1.0.17" + } + }, + "semver": { + "version": "5.7.1", + "resolved": "https://registry.npmmirror.com/semver/-/semver-5.7.1.tgz", + "integrity": "sha512-sauaDf/PZdVgrLTNYHRtpXa1iRiKcaebiKQ1BJdpQlWH2lCvexQdX55snPFyK7QzpudqbCI0qXFfOasHdyNDGQ==", + "optional": true + }, + "shallow-equal": { + "version": "1.2.1", + "resolved": "https://registry.npmmirror.com/shallow-equal/-/shallow-equal-1.2.1.tgz", + "integrity": "sha512-S4vJDjHHMBaiZuT9NPb616CSmLf618jawtv3sufLl6ivK8WocjAo58cXwbRV1cgqxH0Qbv+iUt6m05eqEa2IRA==" + }, + "source-map": { + "version": "0.6.1", + "resolved": "https://registry.npmmirror.com/source-map/-/source-map-0.6.1.tgz", + "integrity": "sha512-UjgapumWlbMhkBgzT7Ykc5YXUT46F0iKu8SGXq0bcwP5dz/h0Plj6enJqjz1Zbq2l5WaqYnrVbwWOWMyF3F47g==" + }, + "source-map-js": { + "version": "1.0.2", + "resolved": "https://registry.npmmirror.com/source-map-js/-/source-map-js-1.0.2.tgz", + "integrity": "sha512-R0XvVJ9WusLiqTCEiGCmICCMplcCkIwwR11mOSD9CR5u+IXYdiseeEuXCVAjS54zqwkLcPNnmU4OeJ6tUrWhDw==" + }, + "sourcemap-codec": { + "version": "1.4.8", + "resolved": "https://registry.npmmirror.com/sourcemap-codec/-/sourcemap-codec-1.4.8.tgz", + "integrity": "sha512-9NykojV5Uih4lgo5So5dtw+f0JgJX30KCNI8gwhz2J9A15wD0Ml6tjHKwf6fTSa6fAdVBdZeNOs9eJ71qCk8vA==" + }, + "supports-preserve-symlinks-flag": { + "version": "1.0.0", + "resolved": "https://registry.npmmirror.com/supports-preserve-symlinks-flag/-/supports-preserve-symlinks-flag-1.0.0.tgz", + "integrity": "sha512-ot0WnXS9fgdkgIcePe6RHNk1WA8+muPa6cSjeR3V8K27q9BB1rTE3R1p7Hv0z1ZyAc8s6Vvv8DIyWf681MAt0w==", + "dev": true + }, + "tslib": { + "version": "2.4.0", + "resolved": "https://registry.npmmirror.com/tslib/-/tslib-2.4.0.tgz", + "integrity": "sha512-d6xOpEDfsi2CZVlPQzGeux8XMwLT9hssAsaPYExaQMuYskwb+x1x7J371tWlbBdWHroy99KnVB6qIkUbs5X3UQ==" + }, + "use-strict": { + "version": "1.0.1", + "resolved": "https://registry.npmmirror.com/use-strict/-/use-strict-1.0.1.tgz", + "integrity": "sha512-IeiWvvEXfW5ltKVMkxq6FvNf2LojMKvB2OCeja6+ct24S1XOmQw2dGr2JyndwACWAGJva9B7yPHwAmeA9QCqAQ==" + }, + "vite": { + "version": "2.9.1", + "resolved": "https://registry.npmmirror.com/vite/-/vite-2.9.1.tgz", + "integrity": "sha512-vSlsSdOYGcYEJfkQ/NeLXgnRv5zZfpAsdztkIrs7AZHV8RCMZQkwjo4DS5BnrYTqoWqLoUe1Cah4aVO4oNNqCQ==", + "dev": true, + "requires": { + "esbuild": "^0.14.27", + "fsevents": "~2.3.2", + "postcss": "^8.4.12", + "resolve": "^1.22.0", + "rollup": "^2.59.0" + } + }, + "vue": { + "version": "3.2.32", + "resolved": "https://registry.npmmirror.com/vue/-/vue-3.2.32.tgz", + "integrity": "sha512-6L3jKZApF042OgbCkh+HcFeAkiYi3Lovi8wNhWqIK98Pi5efAMLZzRHgi91v+60oIRxdJsGS9sTMsb+yDpY8Eg==", + "requires": { + "@vue/compiler-dom": "3.2.32", + "@vue/compiler-sfc": "3.2.32", + "@vue/runtime-dom": "3.2.32", + "@vue/server-renderer": "3.2.32", + "@vue/shared": "3.2.32" + } + }, + "vue-demi": { + "version": "0.12.5", + "resolved": "https://registry.npmmirror.com/vue-demi/-/vue-demi-0.12.5.tgz", + "integrity": "sha512-BREuTgTYlUr0zw0EZn3hnhC3I6gPWv+Kwh4MCih6QcAeaTlaIX0DwOVN0wHej7hSvDPecz4jygy/idsgKfW58Q==", + "requires": {} + }, + "vue-types": { + "version": "3.0.2", + "resolved": "https://registry.npmmirror.com/vue-types/-/vue-types-3.0.2.tgz", + "integrity": "sha512-IwUC0Aq2zwaXqy74h4WCvFCUtoV0iSWr0snWnE9TnU18S66GAQyqQbRf2qfJtUuiFsBf6qp0MEwdonlwznlcrw==", + "requires": { + "is-plain-object": "3.0.1" + } + }, + "warning": { + "version": "4.0.3", + "resolved": "https://registry.npmmirror.com/warning/-/warning-4.0.3.tgz", + "integrity": "sha512-rpJyN222KWIvHJ/F53XSZv0Zl/accqHR8et1kpaMTD/fLCRxtV8iX8czMzY7sVZupTI3zcUTg8eycS2kNF9l6w==", + "requires": { + "loose-envify": "^1.0.0" + } + } + } +} diff --git a/demos/speech_web/web_client/package.json b/demos/speech_web/web_client/package.json new file mode 100644 index 000000000..7f28d4c97 --- /dev/null +++ b/demos/speech_web/web_client/package.json @@ -0,0 +1,23 @@ +{ + "name": "paddlespeechwebclient", + "private": true, + "version": "0.0.0", + "scripts": { + "dev": "vite", + "build": "vite build", + "preview": "vite preview" + }, + "dependencies": { + "ant-design-vue": "^2.2.8", + "axios": "^0.26.1", + "element-plus": "^2.1.9", + "js-audio-recorder": "0.5.7", + "lamejs": "^1.2.1", + "less": "^4.1.2", + "vue": "^3.2.25" + }, + "devDependencies": { + "@vitejs/plugin-vue": "^2.3.0", + "vite": "^2.9.0" + } +} diff --git a/demos/speech_web/web_client/public/favicon.ico b/demos/speech_web/web_client/public/favicon.ico new file mode 100644 index 000000000..342038720 Binary files /dev/null and b/demos/speech_web/web_client/public/favicon.ico differ diff --git a/demos/speech_web/web_client/src/App.vue b/demos/speech_web/web_client/src/App.vue new file mode 100644 index 000000000..a70dbf9c4 --- /dev/null +++ b/demos/speech_web/web_client/src/App.vue @@ -0,0 +1,19 @@ + + + + + diff --git a/demos/speech_web/web_client/src/api/API.js b/demos/speech_web/web_client/src/api/API.js new file mode 100644 index 000000000..0feaa63f1 --- /dev/null +++ b/demos/speech_web/web_client/src/api/API.js @@ -0,0 +1,29 @@ +export const apiURL = { + ASR_OFFLINE : '/api/asr/offline', // 获取离线语音识别结果 + ASR_COLLECT_ENV : '/api/asr/collectEnv', // 采集环境噪音 + ASR_STOP_RECORD : '/api/asr/stopRecord', // 后端暂停录音 + ASR_RESUME_RECORD : '/api/asr/resumeRecord',// 后端恢复录音 + + NLP_CHAT : '/api/nlp/chat', // NLP闲聊接口 + NLP_IE : '/api/nlp/ie', // 信息抽取接口 + + TTS_OFFLINE : '/api/tts/offline', // 获取TTS音频 + + VPR_RECOG : '/api/vpr/recog', // 声纹识别接口,返回声纹对比相似度 + VPR_ENROLL : '/api/vpr/enroll', // 声纹识别注册接口 + VPR_LIST : '/api/vpr/list', // 获取声纹注册的数据列表 + VPR_DEL : '/api/vpr/del', // 删除用户声纹 + VPR_DATA : '/api/vpr/database64?vprId=', // 获取声纹注册数据 bs64格式 + + // websocket + CHAT_SOCKET_RECORD: 'ws://localhost:8010/ws/asr/offlineStream', // ChatBot websocket 接口 + ASR_SOCKET_RECORD: 'ws://localhost:8010/ws/asr/onlineStream', // Stream ASR 接口 + TTS_SOCKET_RECORD: 'ws://localhost:8010/ws/tts/online', // Stream TTS 接口 +} + + + + + + + diff --git a/demos/speech_web/web_client/src/api/ApiASR.js b/demos/speech_web/web_client/src/api/ApiASR.js new file mode 100644 index 000000000..342c56164 --- /dev/null +++ b/demos/speech_web/web_client/src/api/ApiASR.js @@ -0,0 +1,30 @@ +import axios from 'axios' +import {apiURL} from "./API.js" + +// 上传音频文件,获得识别结果 +export async function asrOffline(params){ + const result = await axios.post( + apiURL.ASR_OFFLINE, params + ) + return result +} + +// 上传环境采集文件 +export async function asrCollentEnv(params){ + const result = await axios.post( + apiURL.ASR_OFFLINE, params + ) + return result +} + +// 暂停录音 +export async function asrStopRecord(){ + const result = await axios.get(apiURL.ASR_STOP_RECORD); + return result +} + +// 恢复录音 +export async function asrResumeRecord(){ + const result = await axios.get(apiURL.ASR_RESUME_RECORD); + return result +} \ No newline at end of file diff --git a/demos/speech_web/web_client/src/api/ApiNLP.js b/demos/speech_web/web_client/src/api/ApiNLP.js new file mode 100644 index 000000000..92259054a --- /dev/null +++ b/demos/speech_web/web_client/src/api/ApiNLP.js @@ -0,0 +1,17 @@ +import axios from 'axios' +import {apiURL} from "./API.js" + +// 获取闲聊对话结果 +export async function nlpChat(text){ + const result = await axios.post(apiURL.NLP_CHAT, { chat : text}); + return result +} + +// 获取信息抽取结果 +export async function nlpIE(text){ + const result = await axios.post(apiURL.NLP_IE, { chat : text}); + return result +} + + + diff --git a/demos/speech_web/web_client/src/api/ApiTTS.js b/demos/speech_web/web_client/src/api/ApiTTS.js new file mode 100644 index 000000000..1d23a4bd1 --- /dev/null +++ b/demos/speech_web/web_client/src/api/ApiTTS.js @@ -0,0 +1,8 @@ +import axios from 'axios' +import {apiURL} from "./API.js" + +export async function ttsOffline(text){ + const result = await axios.post(apiURL.TTS_OFFLINE, { text : text}); + return result +} + diff --git a/demos/speech_web/web_client/src/api/ApiVPR.js b/demos/speech_web/web_client/src/api/ApiVPR.js new file mode 100644 index 000000000..e3ae2f5ec --- /dev/null +++ b/demos/speech_web/web_client/src/api/ApiVPR.js @@ -0,0 +1,32 @@ +import axios from 'axios' +import {apiURL} from "./API.js" + +// 注册声纹 +export async function vprEnroll(params){ + const result = await axios.post(apiURL.VPR_ENROLL, params); + return result +} + +// 声纹识别 +export async function vprRecog(params){ + const result = await axios.post(apiURL.VPR_RECOG, params); + return result +} + +// 删除声纹 +export async function vprDel(params){ + const result = await axios.post(apiURL.VPR_DEL, params); + return result +} + +// 获取声纹列表 +export async function vprList(){ + const result = await axios.get(apiURL.VPR_LIST); + return result +} + +// 获取声纹音频 +export async function vprData(params){ + const result = await axios.get(apiURL.VPR_DATA+params); + return result +} diff --git a/demos/speech_web/web_client/src/assets/image/ic_大-上传文件.svg b/demos/speech_web/web_client/src/assets/image/ic_大-上传文件.svg new file mode 100644 index 000000000..4c3c86403 --- /dev/null +++ b/demos/speech_web/web_client/src/assets/image/ic_大-上传文件.svg @@ -0,0 +1,6 @@ + + + + + + diff --git a/demos/speech_web/web_client/src/assets/image/ic_大-声音波浪.svg b/demos/speech_web/web_client/src/assets/image/ic_大-声音波浪.svg new file mode 100644 index 000000000..dfbdc0e85 --- /dev/null +++ b/demos/speech_web/web_client/src/assets/image/ic_大-声音波浪.svg @@ -0,0 +1,6 @@ + + + + + + diff --git a/demos/speech_web/web_client/src/assets/image/ic_大-语音.svg b/demos/speech_web/web_client/src/assets/image/ic_大-语音.svg new file mode 100644 index 000000000..54571a3e3 --- /dev/null +++ b/demos/speech_web/web_client/src/assets/image/ic_大-语音.svg @@ -0,0 +1,6 @@ + + + + + + diff --git a/demos/speech_web/web_client/src/assets/image/ic_小-录制语音.svg b/demos/speech_web/web_client/src/assets/image/ic_小-录制语音.svg new file mode 100644 index 000000000..b61f7ac03 --- /dev/null +++ b/demos/speech_web/web_client/src/assets/image/ic_小-录制语音.svg @@ -0,0 +1,6 @@ + + + + + + diff --git a/demos/speech_web/web_client/src/assets/image/ic_小-结束.svg b/demos/speech_web/web_client/src/assets/image/ic_小-结束.svg new file mode 100644 index 000000000..01a8dc65e --- /dev/null +++ b/demos/speech_web/web_client/src/assets/image/ic_小-结束.svg @@ -0,0 +1,3 @@ + + + diff --git a/demos/speech_web/web_client/src/assets/image/ic_开始聊天.svg b/demos/speech_web/web_client/src/assets/image/ic_开始聊天.svg new file mode 100644 index 000000000..073efd5e0 --- /dev/null +++ b/demos/speech_web/web_client/src/assets/image/ic_开始聊天.svg @@ -0,0 +1,6 @@ + + + + + + diff --git a/demos/speech_web/web_client/src/assets/image/ic_开始聊天_hover.svg b/demos/speech_web/web_client/src/assets/image/ic_开始聊天_hover.svg new file mode 100644 index 000000000..824f974ab --- /dev/null +++ b/demos/speech_web/web_client/src/assets/image/ic_开始聊天_hover.svg @@ -0,0 +1,6 @@ + + + + + + diff --git a/demos/speech_web/web_client/src/assets/image/ic_播放(按钮).svg b/demos/speech_web/web_client/src/assets/image/ic_播放(按钮).svg new file mode 100644 index 000000000..4dc1461fd --- /dev/null +++ b/demos/speech_web/web_client/src/assets/image/ic_播放(按钮).svg @@ -0,0 +1,3 @@ + + + diff --git a/demos/speech_web/web_client/src/assets/image/ic_暂停(按钮).svg b/demos/speech_web/web_client/src/assets/image/ic_暂停(按钮).svg new file mode 100644 index 000000000..6ede8ea62 --- /dev/null +++ b/demos/speech_web/web_client/src/assets/image/ic_暂停(按钮).svg @@ -0,0 +1,3 @@ + + + diff --git a/demos/speech_web/web_client/src/assets/image/ic_更换示例.svg b/demos/speech_web/web_client/src/assets/image/ic_更换示例.svg new file mode 100644 index 000000000..d126775d3 --- /dev/null +++ b/demos/speech_web/web_client/src/assets/image/ic_更换示例.svg @@ -0,0 +1,11 @@ + + + + + + + + + + + diff --git a/demos/speech_web/web_client/src/assets/image/icon_小-声音波浪.svg b/demos/speech_web/web_client/src/assets/image/icon_小-声音波浪.svg new file mode 100644 index 000000000..3dfed9be5 --- /dev/null +++ b/demos/speech_web/web_client/src/assets/image/icon_小-声音波浪.svg @@ -0,0 +1,6 @@ + + + + + + diff --git a/demos/speech_web/web_client/src/assets/image/icon_录制声音小语音1.svg b/demos/speech_web/web_client/src/assets/image/icon_录制声音小语音1.svg new file mode 100644 index 000000000..4fe4f0f7d --- /dev/null +++ b/demos/speech_web/web_client/src/assets/image/icon_录制声音小语音1.svg @@ -0,0 +1,14 @@ + + + icon_录制声音(小语音) + + + + + + + + + + + \ No newline at end of file diff --git a/demos/speech_web/web_client/src/assets/image/在线体验-背景@2x.png b/demos/speech_web/web_client/src/assets/image/在线体验-背景@2x.png new file mode 100644 index 000000000..66627e1e6 Binary files /dev/null and b/demos/speech_web/web_client/src/assets/image/在线体验-背景@2x.png differ diff --git a/demos/speech_web/web_client/src/assets/image/场景齐全@3x.png b/demos/speech_web/web_client/src/assets/image/场景齐全@3x.png new file mode 100644 index 000000000..b85427a1a Binary files /dev/null and b/demos/speech_web/web_client/src/assets/image/场景齐全@3x.png differ diff --git a/demos/speech_web/web_client/src/assets/image/教程丰富@3x.png b/demos/speech_web/web_client/src/assets/image/教程丰富@3x.png new file mode 100644 index 000000000..6edd64316 Binary files /dev/null and b/demos/speech_web/web_client/src/assets/image/教程丰富@3x.png differ diff --git a/demos/speech_web/web_client/src/assets/image/模型全面@3x.png b/demos/speech_web/web_client/src/assets/image/模型全面@3x.png new file mode 100644 index 000000000..4d54eac05 Binary files /dev/null and b/demos/speech_web/web_client/src/assets/image/模型全面@3x.png differ diff --git a/demos/speech_web/web_client/src/assets/image/步骤-箭头切图@2x.png b/demos/speech_web/web_client/src/assets/image/步骤-箭头切图@2x.png new file mode 100644 index 000000000..d0cedecce Binary files /dev/null and b/demos/speech_web/web_client/src/assets/image/步骤-箭头切图@2x.png differ diff --git a/demos/speech_web/web_client/src/assets/image/用户头像@2x.png b/demos/speech_web/web_client/src/assets/image/用户头像@2x.png new file mode 100644 index 000000000..2970d0070 Binary files /dev/null and b/demos/speech_web/web_client/src/assets/image/用户头像@2x.png differ diff --git a/demos/speech_web/web_client/src/assets/image/飞桨头像@2x.png b/demos/speech_web/web_client/src/assets/image/飞桨头像@2x.png new file mode 100644 index 000000000..1712170ed Binary files /dev/null and b/demos/speech_web/web_client/src/assets/image/飞桨头像@2x.png differ diff --git a/demos/speech_web/web_client/src/assets/logo.png b/demos/speech_web/web_client/src/assets/logo.png new file mode 100644 index 000000000..f3d2503fc Binary files /dev/null and b/demos/speech_web/web_client/src/assets/logo.png differ diff --git a/demos/speech_web/web_client/src/components/Content/Header/Header.vue b/demos/speech_web/web_client/src/components/Content/Header/Header.vue new file mode 100644 index 000000000..8135a2bff --- /dev/null +++ b/demos/speech_web/web_client/src/components/Content/Header/Header.vue @@ -0,0 +1,26 @@ + + + + + \ No newline at end of file diff --git a/demos/speech_web/web_client/src/components/Content/Header/style.less b/demos/speech_web/web_client/src/components/Content/Header/style.less new file mode 100644 index 000000000..9d0261378 --- /dev/null +++ b/demos/speech_web/web_client/src/components/Content/Header/style.less @@ -0,0 +1,148 @@ +.speech_header { + width: 1200px; + margin: 0 auto; + padding-top: 50px; + // background: url("../../../assets/image/在线体验-背景@2x.png") no-repeat; + box-sizing: border-box; + &::after { + content: ""; + display: block; + clear: both; + visibility: hidden; + } + + ; + + // background: pink; + .speech_header_title { + height: 57px; + font-family: PingFangSC-Medium; + font-size: 38px; + color: #000000; + letter-spacing: 0; + line-height: 57px; + font-weight: 500; + margin-bottom: 15px; + } + + ; + + .speech_header_describe { + height: 26px; + font-family: PingFangSC-Regular; + font-size: 16px; + color: #575757; + line-height: 26px; + font-weight: 400; + margin-bottom: 24px; + } + + ; + .speech_header_link_box { + height: 40px; + margin-bottom: 40px; + display: flex; + align-items: center; + }; + .speech_header_link { + display: block; + background: #2932E1; + width: 120px; + height: 40px; + line-height: 40px; + border-radius: 20px; + font-family: PingFangSC-Medium; + font-size: 14px; + color: #FFFFFF; + text-align: center; + font-weight: 500; + margin-right: 20px; + // margin-bottom: 40px; + + &:hover { + opacity: 0.9; + } + + ; + } + + ; + + .speech_header_divider { + width: 1200px; + height: 1px; + background: #D1D1D1; + margin-bottom: 40px; + } + + ; + + .speech_header_content_wrapper { + width: 1200px; + margin: 0 auto; + // background: pink; + margin-bottom: 20px; + display: flex; + justify-content: space-between; + flex-wrap: wrap; + + .speech_header_module { + width: 384px; + background: #FFFFFF; + border: 1px solid rgba(224, 224, 224, 1); + box-shadow: 4px 8px 12px 0px rgba(0, 0, 0, 0.05); + border-radius: 16px; + padding: 30px 34px 0px 34px; + box-sizing: border-box; + display: flex; + margin-bottom: 40px; + .speech_header_background_img { + width: 46px; + height: 46px; + background-size: 46px 46px; + background-repeat: no-repeat; + background-position: center; + margin-right: 20px; + } + + ; + + .speech_header_content { + padding-top: 4px; + margin-bottom: 32px; + + .speech_header_module_title { + height: 26px; + font-family: PingFangSC-Medium; + font-size: 20px; + color: #000000; + letter-spacing: 0; + line-height: 26px; + font-weight: 500; + margin-bottom: 10px; + } + + ; + + .speech_header_module_introduce { + font-family: PingFangSC-Regular; + font-size: 16px; + color: #666666; + letter-spacing: 0; + font-weight: 400; + } + + ; + } + + ; + } + + ; + } + + ; +} + +; + diff --git a/demos/speech_web/web_client/src/components/Content/Tail/Tail.vue b/demos/speech_web/web_client/src/components/Content/Tail/Tail.vue new file mode 100644 index 000000000..e69de29bb diff --git a/demos/speech_web/web_client/src/components/Content/Tail/style.less b/demos/speech_web/web_client/src/components/Content/Tail/style.less new file mode 100644 index 000000000..e69de29bb diff --git a/demos/speech_web/web_client/src/components/Experience.vue b/demos/speech_web/web_client/src/components/Experience.vue new file mode 100644 index 000000000..5620d6af9 --- /dev/null +++ b/demos/speech_web/web_client/src/components/Experience.vue @@ -0,0 +1,50 @@ + + + + + + \ No newline at end of file diff --git a/demos/speech_web/web_client/src/components/SubMenu/ASR/ASR.vue b/demos/speech_web/web_client/src/components/SubMenu/ASR/ASR.vue new file mode 100644 index 000000000..edef6a787 --- /dev/null +++ b/demos/speech_web/web_client/src/components/SubMenu/ASR/ASR.vue @@ -0,0 +1,154 @@ + + + + + \ No newline at end of file diff --git a/demos/speech_web/web_client/src/components/SubMenu/ASR/ASRT.vue b/demos/speech_web/web_client/src/components/SubMenu/ASR/ASRT.vue new file mode 100644 index 000000000..245fddb2c --- /dev/null +++ b/demos/speech_web/web_client/src/components/SubMenu/ASR/ASRT.vue @@ -0,0 +1,38 @@ + + + + + + + \ No newline at end of file diff --git a/demos/speech_web/web_client/src/components/SubMenu/ASR/AudioFile/AudioFileIdentification.vue b/demos/speech_web/web_client/src/components/SubMenu/ASR/AudioFile/AudioFileIdentification.vue new file mode 100644 index 000000000..4d3cf3c31 --- /dev/null +++ b/demos/speech_web/web_client/src/components/SubMenu/ASR/AudioFile/AudioFileIdentification.vue @@ -0,0 +1,241 @@ + + + + + \ No newline at end of file diff --git a/demos/speech_web/web_client/src/components/SubMenu/ASR/AudioFile/style.less b/demos/speech_web/web_client/src/components/SubMenu/ASR/AudioFile/style.less new file mode 100644 index 000000000..46b33272d --- /dev/null +++ b/demos/speech_web/web_client/src/components/SubMenu/ASR/AudioFile/style.less @@ -0,0 +1,293 @@ +.audioFileIdentification { + width: 1106px; + height: 270px; + // background-color: pink; + padding-top: 40px; + box-sizing: border-box; + display: flex; + // 开始上传 + .public_recognition_speech { + width: 295px; + height: 230px; + padding-top: 32px; + box-sizing: border-box; + // 开始上传 + .upload_img { + width: 116px; + height: 116px; + background: #2932E1; + border-radius: 50%; + margin-left: 98px; + cursor: pointer; + margin-bottom: 20px; + display: flex; + justify-content: center; + align-items: center; + .upload_img_back { + width: 34.38px; + height: 30.82px; + background: #2932E1; + background: url("../../../../assets/image/ic_大-上传文件.svg"); + background-repeat: no-repeat; + background-position: center; + background-size: 34.38px 30.82px; + cursor: pointer; + } + &:hover { + opacity: 0.9; + }; + + }; + + + .speech_text { + height: 22px; + font-family: PingFangSC-Medium; + font-size: 16px; + color: #000000; + font-weight: 500; + margin-left: 124px; + margin-bottom: 10px; + }; + .speech_text_prompt { + height: 20px; + font-family: PingFangSC-Regular; + font-size: 14px; + color: #999999; + font-weight: 400; + margin-left: 84px; + }; + }; + // 上传中 + .on_the_cross_speech { + width: 295px; + height: 230px; + padding-top: 32px; + box-sizing: border-box; + + .on_the_upload_img { + width: 116px; + height: 116px; + background: #7278F5; + border-radius: 50%; + margin-left: 98px; + cursor: pointer; + margin-bottom: 20px; + display: flex; + justify-content: center; + align-items: center; + + .on_the_upload_img_back { + width: 34.38px; + height: 30.82px; + background: #7278F5; + background: url("../../../../assets/image/ic_大-上传文件.svg"); + background-repeat: no-repeat; + background-position: center; + background-size: 34.38px 30.82px; + cursor: pointer; + + }; + }; + + + .on_the_speech_text { + height: 22px; + font-family: PingFangSC-Medium; + font-size: 16px; + color: #000000; + font-weight: 500; + margin-left: 124px; + margin-bottom: 10px; + display: flex; + // justify-content: center; + align-items: center; + .on_the_speech_loading { + display: inline-block; + width: 16px; + height: 16px; + background: #7278F5; + // background: url("../../../../assets/image/ic_开始聊天.svg"); + // background-repeat: no-repeat; + // background-position: center; + // background-size: 16px 16px; + margin-right: 8px; + }; + }; + }; + + //开始识别 + .public_recognition_speech_start { + width: 295px; + height: 230px; + padding-top: 32px; + box-sizing: border-box; + position: relative; + .public_recognition_speech_content { + width: 100%; + position: absolute; + top: 40px; + left: 50%; + transform: translateX(-50%); + display: flex; + justify-content: center; + align-items: center; + + .public_recognition_speech_title { + height: 22px; + font-family: PingFangSC-Regular; + font-size: 16px; + color: #000000; + font-weight: 400; + }; + .public_recognition_speech_again { + height: 22px; + font-family: PingFangSC-Regular; + font-size: 16px; + color: #2932E1; + font-weight: 400; + margin-left: 30px; + cursor: pointer; + }; + .public_recognition_speech_play { + height: 22px; + font-family: PingFangSC-Regular; + font-size: 16px; + color: #2932E1; + font-weight: 400; + margin-left: 20px; + cursor: pointer; + }; + }; + .speech_promp { + position: absolute; + top: 112px; + left: 50%; + transform: translateX(-50%); + width: 142px; + height: 44px; + background: #2932E1; + border-radius: 22px; + font-family: PingFangSC-Medium; + font-size: 14px; + color: #FFFFFF; + text-align: center; + line-height: 44px; + font-weight: 500; + cursor: pointer; + }; + + + }; + // 识别中 + .public_recognition_speech_identify { + width: 295px; + height: 230px; + padding-top: 32px; + box-sizing: border-box; + position: relative; + .public_recognition_speech_identify_box { + width: 143px; + height: 44px; + background: #7278F5; + border-radius: 22px; + position: absolute; + top: 50%; + left: 50%; + transform: translate(-50%,-50%); + display: flex; + justify-content: center; + align-items: center; + cursor: pointer; + .public_recognition_speech_identify_back_img { + width: 16px; + height: 16px; + // background: #7278F5; + // background: url("../../../../assets/image/ic_开始聊天.svg"); + // background-repeat: no-repeat; + // background-position: center; + // background-size: 16px 16px; + }; + .public_recognition__identify_the_promp { + height: 20px; + font-family: PingFangSC-Medium; + font-size: 14px; + color: #FFFFFF; + font-weight: 500; + margin-left: 12px; + }; + }; + + + + }; + // 重新识别 + .public_recognition_speech_identify_ahain { + width: 295px; + height: 230px; + padding-top: 32px; + box-sizing: border-box; + position: relative; + cursor: pointer; + .public_recognition_speech_identify_box_btn { + width: 143px; + height: 44px; + background: #2932E1; + border-radius: 22px; + position: absolute; + top: 50%; + left: 50%; + transform: translate(-50%,-50%); + display: flex; + justify-content: center; + align-items: center; + cursor: pointer; + .public_recognition__identify_the_btn { + height: 20px; + font-family: PingFangSC-Medium; + font-size: 14px; + color: #FFFFFF; + font-weight: 500; + }; + }; + + + + }; + // 指向 + .public_recognition_point_to { + width: 47px; + height: 67px; + background: url("../../../../assets/image/步骤-箭头切图@2x.png") no-repeat; + background-position: center; + background-size: 47px 67px; + margin-top: 91px; + margin-right: 67px; + }; + // 识别结果 + .public_recognition_result { + width: 680px; + height: 230px; + background: #FAFAFA; + padding: 40px 50px 0px 50px; + div { + &:nth-of-type(1) { + height: 26px; + font-family: PingFangSC-Medium; + font-size: 16px; + color: #666666; + line-height: 26px; + font-weight: 500; + margin-bottom: 20px; + }; + &:nth-of-type(2) { + height: 26px; + font-family: PingFangSC-Medium; + font-size: 16px; + color: #666666; + line-height: 26px; + font-weight: 500; + }; + }; + }; +}; \ No newline at end of file diff --git a/demos/speech_web/web_client/src/components/SubMenu/ASR/EndToEnd/EndToEndIdentification.vue b/demos/speech_web/web_client/src/components/SubMenu/ASR/EndToEnd/EndToEndIdentification.vue new file mode 100644 index 000000000..651e8c725 --- /dev/null +++ b/demos/speech_web/web_client/src/components/SubMenu/ASR/EndToEnd/EndToEndIdentification.vue @@ -0,0 +1,92 @@ + + + + + \ No newline at end of file diff --git a/demos/speech_web/web_client/src/components/SubMenu/ASR/EndToEnd/style.less b/demos/speech_web/web_client/src/components/SubMenu/ASR/EndToEnd/style.less new file mode 100644 index 000000000..1fc04b2c7 --- /dev/null +++ b/demos/speech_web/web_client/src/components/SubMenu/ASR/EndToEnd/style.less @@ -0,0 +1,114 @@ +.endToEndIdentification { + width: 1106px; + height: 270px; + // background-color: pink; + padding-top: 40px; + box-sizing: border-box; + display: flex; + // 开始识别 + .public_recognition_speech { + width: 295px; + height: 230px; + padding-top: 32px; + box-sizing: border-box; + + .endToEndIdentification_start_recorder_img { + width: 116px; + height: 116px; + background: #2932E1; + background: url("../../../../assets/image/ic_开始聊天.svg"); + background-repeat: no-repeat; + background-position: center; + background-size: 116px 116px; + margin-left: 98px; + cursor: pointer; + margin-bottom: 20px; + &:hover { + background: url("../../../../assets/image/ic_开始聊天_hover.svg"); + + }; + + }; + + .endToEndIdentification_end_recorder_img { + width: 116px; + height: 116px; + background: #2932E1; + border-radius: 50%; + display: flex; + justify-content: center; + align-items: center; + margin-left: 98px; + margin-bottom: 20px; + cursor: pointer; + .endToEndIdentification_end_recorder_img_back { + width: 50px; + height: 50px; + background: url("../../../../assets/image/ic_大-声音波浪.svg"); + background-repeat: no-repeat; + background-position: center; + background-size: 50px 50px; + + &:hover { + opacity: 0.9; + + }; + }; + + }; + .endToEndIdentification_prompt { + height: 22px; + font-family: PingFangSC-Medium; + font-size: 16px; + color: #000000; + font-weight: 500; + margin-left: 124px; + margin-bottom: 10px; + }; + .speech_text_prompt { + height: 20px; + font-family: PingFangSC-Regular; + font-size: 14px; + color: #999999; + font-weight: 400; + margin-left: 90px; + }; + }; + // 指向 + .public_recognition_point_to { + width: 47px; + height: 67px; + background: url("../../../../assets/image/步骤-箭头切图@2x.png") no-repeat; + background-position: center; + background-size: 47px 67px; + margin-top: 91px; + margin-right: 67px; + }; + // 识别结果 + .public_recognition_result { + width: 680px; + height: 230px; + background: #FAFAFA; + padding: 40px 50px 0px 50px; + div { + &:nth-of-type(1) { + height: 26px; + font-family: PingFangSC-Medium; + font-size: 16px; + color: #666666; + line-height: 26px; + font-weight: 500; + margin-bottom: 20px; + }; + &:nth-of-type(2) { + height: 26px; + font-family: PingFangSC-Medium; + font-size: 16px; + color: #666666; + line-height: 26px; + font-weight: 500; + }; + }; + }; + +}; \ No newline at end of file diff --git a/demos/speech_web/web_client/src/components/SubMenu/ASR/RealTime/RealTime.vue b/demos/speech_web/web_client/src/components/SubMenu/ASR/RealTime/RealTime.vue new file mode 100644 index 000000000..761a5c11f --- /dev/null +++ b/demos/speech_web/web_client/src/components/SubMenu/ASR/RealTime/RealTime.vue @@ -0,0 +1,128 @@ + + + + + \ No newline at end of file diff --git a/demos/speech_web/web_client/src/components/SubMenu/ASR/RealTime/style.less b/demos/speech_web/web_client/src/components/SubMenu/ASR/RealTime/style.less new file mode 100644 index 000000000..baa89c570 --- /dev/null +++ b/demos/speech_web/web_client/src/components/SubMenu/ASR/RealTime/style.less @@ -0,0 +1,112 @@ +.realTime{ + width: 1106px; + height: 270px; + // background-color: pink; + padding-top: 40px; + box-sizing: border-box; + display: flex; + // 开始识别 + .public_recognition_speech { + width: 295px; + height: 230px; + padding-top: 32px; + box-sizing: border-box; + .endToEndIdentification_start_recorder_img { + width: 116px; + height: 116px; + background: #2932E1; + background: url("../../../../assets/image/ic_开始聊天.svg"); + background-repeat: no-repeat; + background-position: center; + background-size: 116px 116px; + margin-left: 98px; + cursor: pointer; + margin-bottom: 20px; + &:hover { + background: url("../../../../assets/image/ic_开始聊天_hover.svg"); + + }; + + }; + + .endToEndIdentification_end_recorder_img { + width: 116px; + height: 116px; + background: #2932E1; + border-radius: 50%; + display: flex; + justify-content: center; + align-items: center; + margin-left: 98px; + margin-bottom: 20px; + cursor: pointer; + .endToEndIdentification_end_recorder_img_back { + width: 50px; + height: 50px; + background: url("../../../../assets/image/ic_大-声音波浪.svg"); + background-repeat: no-repeat; + background-position: center; + background-size: 50px 50px; + + &:hover { + opacity: 0.9; + + }; + }; + + }; + .endToEndIdentification_prompt { + height: 22px; + font-family: PingFangSC-Medium; + font-size: 16px; + color: #000000; + font-weight: 500; + margin-left: 124px; + margin-bottom: 10px; + }; + .speech_text_prompt { + height: 20px; + font-family: PingFangSC-Regular; + font-size: 14px; + color: #999999; + font-weight: 400; + margin-left: 105px; + }; + }; + // 指向 + .public_recognition_point_to { + width: 47px; + height: 67px; + background: url("../../../../assets/image/步骤-箭头切图@2x.png") no-repeat; + background-position: center; + background-size: 47px 67px; + margin-top: 91px; + margin-right: 67px; + }; + // 识别结果 + .public_recognition_result { + width: 680px; + height: 230px; + background: #FAFAFA; + padding: 40px 50px 0px 50px; + div { + &:nth-of-type(1) { + height: 26px; + font-family: PingFangSC-Medium; + font-size: 16px; + color: #666666; + line-height: 26px; + font-weight: 500; + margin-bottom: 20px; + }; + &:nth-of-type(2) { + height: 26px; + font-family: PingFangSC-Medium; + font-size: 16px; + color: #666666; + line-height: 26px; + font-weight: 500; + }; + }; + }; +}; \ No newline at end of file diff --git a/demos/speech_web/web_client/src/components/SubMenu/ASR/style.less b/demos/speech_web/web_client/src/components/SubMenu/ASR/style.less new file mode 100644 index 000000000..92ce9340b --- /dev/null +++ b/demos/speech_web/web_client/src/components/SubMenu/ASR/style.less @@ -0,0 +1,76 @@ +.speech_recognition { + width: 1200px; + height: 410px; + background: #FFFFFF; + padding: 40px 50px 50px 44px; + position: relative; + .frame { + width: 605px; + height: 50px; + border: 1px solid rgba(238,238,238,1); + border-radius: 25px; + position: absolute; + } + .speech_recognition_mytabs { + .ant-tabs-tab { + position: relative; + display: inline-flex; + align-items: center; + // padding: 12px 0; + font-size: 14px; + background: transparent; + border: 0; + outline: none; + cursor: pointer; + padding: 12px 26px; + box-sizing: border-box; + } + .ant-tabs-tab-active { + height: 50px; + background: #EEEFFD; + border-radius: 25px; + padding: 12px 26px; + box-sizing: border-box; + }; + .speech_recognition .speech_recognition_mytabs .ant-tabs-ink-bar { + position: absolute; + background: transparent !important; + pointer-events: none; + } + .ant-tabs-ink-bar { + position: absolute; + background: transparent !important; + pointer-events: none; + } + .experience .experience_wrapper .experience_content .experience_tabs .ant-tabs-nav::before { + position: absolute; + right: 0; + left: 0; + border-bottom: 1px solid transparent !important; + // border: none; + content: ''; + } + .ant-tabs-top > .ant-tabs-nav::before, .ant-tabs-bottom > .ant-tabs-nav::before, .ant-tabs-top > div > .ant-tabs-nav::before, .ant-tabs-bottom > div > .ant-tabs-nav::before { + position: absolute; + right: 0; + left: 0; + border-bottom: 1px solid transparent !important; + // border: none; + content: ''; + } + .ant-tabs-top > .ant-tabs-nav::before, .ant-tabs-bottom > .ant-tabs-nav::before, .ant-tabs-top > div > .ant-tabs-nav::before, .ant-tabs-bottom > div > .ant-tabs-nav::before { + position: absolute; + right: 0; + left: 0; + border-bottom: 1px solid transparent !important; + content: ''; + } + .ant-tabs-nav::before { + position: absolute; + right: 0; + left: 0; + border-bottom: 1px solid transparent !important; + content: ''; + }; + }; +}; \ No newline at end of file diff --git a/demos/speech_web/web_client/src/components/SubMenu/ChatBot/Chat.vue b/demos/speech_web/web_client/src/components/SubMenu/ChatBot/Chat.vue new file mode 100644 index 000000000..9d356fc80 --- /dev/null +++ b/demos/speech_web/web_client/src/components/SubMenu/ChatBot/Chat.vue @@ -0,0 +1,298 @@ + + + + + \ No newline at end of file diff --git a/demos/speech_web/web_client/src/components/SubMenu/ChatBot/ChatT.vue b/demos/speech_web/web_client/src/components/SubMenu/ChatBot/ChatT.vue new file mode 100644 index 000000000..c37c083ff --- /dev/null +++ b/demos/speech_web/web_client/src/components/SubMenu/ChatBot/ChatT.vue @@ -0,0 +1,255 @@ + + + + + \ No newline at end of file diff --git a/demos/speech_web/web_client/src/components/SubMenu/ChatBot/style.less b/demos/speech_web/web_client/src/components/SubMenu/ChatBot/style.less new file mode 100644 index 000000000..d868fd470 --- /dev/null +++ b/demos/speech_web/web_client/src/components/SubMenu/ChatBot/style.less @@ -0,0 +1,181 @@ +.voice_chat { + width: 1200px; + height: 410px; + background: #FFFFFF; + position: relative; + // 开始聊天 + .voice_chat_wrapper { + top: 50%; + left: 50%; + transform: translate(-50%,-50%); + position: absolute; + .voice_chat_btn { + width: 116px; + height: 116px; + margin-left: 54px; + // background: #2932E1; + border-radius: 50%; + cursor: pointer; + background: url("../../../assets/image/ic_开始聊天.svg"); + background-repeat: no-repeat; + background-position: center; + background-size: 116px 116px; + margin-bottom: 17px; + &:hover { + width: 116px; + height: 116px; + background: url("../../../assets/image/ic_开始聊天_hover.svg"); + background-repeat: no-repeat; + background-position: center; + background-size: 116px 116px; + }; + + }; + .voice_chat_btn_title { + height: 22px; + font-family: PingFangSC-Medium; + font-size: 16px; + color: #000000; + letter-spacing: 0; + text-align: center; + line-height: 22px; + font-weight: 500; + margin-bottom: 10px; + }; + .voice_chat_btn_prompt { + height: 24px; + font-family: PingFangSC-Regular; + font-size: 14px; + color: #999999; + letter-spacing: 0; + text-align: center; + line-height: 24px; + font-weight: 400; + }; + }; + .voice_chat_wrapper::after { + content: ""; + display: block; + clear: both; + visibility: hidden; + }; + // 结束聊天 + .voice_chat_dialog_wrapper { + width: 1200px; + height: 410px; + background: #FFFFFF; + position: relative; + .dialog_box { + width: 100%; + height: 410px; + padding: 50px 198px 82px 199px; + box-sizing: border-box; + + .dialog_content { + width: 100%; + height: 268px; + // background: rgb(113, 144, 145); + padding: 0px; + overflow: auto; + li { + list-style-type: none; + margin-bottom: 33px; + display: flex; + align-items: center; + &:last-of-type(1) { + margin-bottom: 0px; + }; + .dialog_content_img_pp { + width: 60px; + height: 60px; + // transform: scaleX(-1); + background: url("../../../assets/image/飞桨头像@2x.png"); + background-repeat: no-repeat; + background-position: center; + background-size: 60px 60px; + margin-right: 20px; + }; + .dialog_content_img_user { + width: 60px; + height: 60px; + transform: scaleX(-1); + background: url("../../../assets/image/用户头像@2x.png"); + background-repeat: no-repeat; + background-position: center; + background-size: 60px 60px; + margin-left: 20px; + }; + .dialog_content_dialogue_pp { + height: 50px; + background: #F5F5F5; + border-radius: 25px; + font-family: PingFangSC-Regular; + font-size: 14px; + color: #000000; + line-height: 50px; + font-weight: 400; + padding: 0px 16px; + box-sizing: border-box; + }; + .dialog_content_dialogue_user { + height: 50px; + background: rgba(41,50,225,0.90); + border-radius: 25px; + font-family: PingFangSC-Regular; + font-size: 14px; + color: #FFFFFF; + line-height: 50px; + font-weight: 400; + padding: 0px 16px; + box-sizing: border-box; + }; + }; + }; + .move_dialogue { + justify-content: flex-end; + }; + + }; + + .btn_end_dialog { + width: 124px; + height: 42px; + line-height: 42px; + background: #FFFFFF; + box-shadow: 0px 4px 16px 0px rgba(0,0,0,0.09); + border-radius: 21px; + padding: 0px 24px; + box-sizing: border-box; + position: absolute; + left: 50%; + bottom: 40px; + transform: translateX(-50%); + display: flex; + justify-content: space-between; + align-items: center; + cursor: pointer; + span { + display: inline-block; + &:nth-of-type(1) { + width: 16px; + height: 16px; + background: url("../../../assets/image/ic_小-结束.svg"); + background-repeat: no-repeat; + background-position: center; + background-size: 16px 16px; + + }; + &:nth-of-type(2) { + height: 20px; + font-family: PingFangSC-Regular; + font-size: 14px; + color: #F33E3E; + text-align: center; + font-weight: 400; + line-height: 20px; + margin-left: 4px; + }; + }; + }; + }; +}; \ No newline at end of file diff --git a/demos/speech_web/web_client/src/components/SubMenu/IE/IE.vue b/demos/speech_web/web_client/src/components/SubMenu/IE/IE.vue new file mode 100644 index 000000000..c7dd04e9d --- /dev/null +++ b/demos/speech_web/web_client/src/components/SubMenu/IE/IE.vue @@ -0,0 +1,125 @@ + + + + + \ No newline at end of file diff --git a/demos/speech_web/web_client/src/components/SubMenu/IE/IET.vue b/demos/speech_web/web_client/src/components/SubMenu/IE/IET.vue new file mode 100644 index 000000000..50eadec70 --- /dev/null +++ b/demos/speech_web/web_client/src/components/SubMenu/IE/IET.vue @@ -0,0 +1,166 @@ + + + + + \ No newline at end of file diff --git a/demos/speech_web/web_client/src/components/SubMenu/IE/style.less b/demos/speech_web/web_client/src/components/SubMenu/IE/style.less new file mode 100644 index 000000000..988666a26 --- /dev/null +++ b/demos/speech_web/web_client/src/components/SubMenu/IE/style.less @@ -0,0 +1,179 @@ +.voice_commands { + width: 1200px; + height: 410px; + background: #FFFFFF; + padding: 40px 50px 50px 50px; + box-sizing: border-box; + display: flex; + // 交通报销 + .voice_commands_traffic { + width: 468px; + height: 320px; + .voice_commands_traffic_title { + height: 26px; + font-family: PingFangSC-Medium; + font-size: 16px; + color: #000000; + letter-spacing: 0; + line-height: 26px; + font-weight: 500; + margin-bottom: 30px; + // background: pink; + }; + .voice_commands_traffic_wrapper { + width: 465px; + height: 264px; + // background: #FAFAFA; + position: relative; + .voice_commands_traffic_wrapper_move { + position: absolute; + top: 50%; + left: 50%; + transform: translate(-50%,-50%); + }; + .traffic_btn_img_btn { + width: 116px; + height: 116px; + background: #2932E1; + display: flex; + justify-content: center; + align-items: center; + border-radius: 50%; + cursor: pointer; + margin-bottom: 20px; + margin-left: 84px; + &:hover { + width: 116px; + height: 116px; + background: #7278F5; + + .start_recorder_img{ + width: 50px; + height: 50px; + background: url("../../../assets/image/ic_开始聊天_hover.svg") no-repeat; + background-position: center; + background-size: 50px 50px; + }; + + }; + + .start_recorder_img{ + width: 50px; + height: 50px; + background: url("../../../assets/image/ic_开始聊天.svg") no-repeat; + background-position: center; + background-size: 50px 50px; + }; + + }; + .traffic_btn_prompt { + height: 22px; + font-family: PingFangSC-Medium; + font-size: 16px; + color: #000000; + font-weight: 500; + margin-bottom: 16px; + margin-left: 110px; + }; + .traffic_btn_list { + height: 20px; + font-family: PingFangSC-Regular; + font-size: 12px; + color: #999999; + font-weight: 400; + width: 112%; + }; + }; + }; + //指向 + .voice_point_to { + width: 47px; + height: 63px; + background: url("../../../assets/image/步骤-箭头切图@2x.png") no-repeat; + background-position: center; + background-size: 47px 63px; + margin-top: 164px; + margin-right: 82px; + }; + //识别结果 + .voice_commands_IdentifyTheResults { + .voice_commands_IdentifyTheResults_title { + height: 26px; + font-family: PingFangSC-Medium; + font-size: 16px; + color: #000000; + line-height: 26px; + font-weight: 500; + margin-bottom: 30px; + }; + // 显示框 + .voice_commands_IdentifyTheResults_show { + width: 503px; + height: 264px; + background: #FAFAFA; + padding: 40px 0px 0px 50px; + box-sizing: border-box; + .voice_commands_IdentifyTheResults_show_title { + height: 22px; + font-family: PingFangSC-Medium; + font-size: 16px; + color: #000000; + // text-align: center; + font-weight: 500; + margin-bottom: 30px; + }; + .oice_commands_IdentifyTheResults_show_time { + height: 20px; + font-family: PingFangSC-Medium; + font-size: 14px; + color: #666666; + font-weight: 500; + margin-bottom: 12px; + }; + .oice_commands_IdentifyTheResults_show_money { + height: 20px; + font-family: PingFangSC-Medium; + font-size: 14px; + color: #666666; + font-weight: 500; + margin-bottom: 12px; + }; + .oice_commands_IdentifyTheResults_show_origin { + height: 20px; + font-family: PingFangSC-Medium; + font-size: 14px; + color: #666666; + font-weight: 500; + margin-bottom: 12px; + }; + .oice_commands_IdentifyTheResults_show_destination { + height: 20px; + font-family: PingFangSC-Medium; + font-size: 14px; + color: #666666; + font-weight: 500; + }; + }; + //加载状态 + .voice_commands_IdentifyTheResults_show_loading { + width: 503px; + height: 264px; + background: #FAFAFA; + padding: 40px 0px 0px 50px; + box-sizing: border-box; + display: flex; + justify-content: center; + align-items: center; + }; + }; + .end_recorder_img { + width: 50px; + height: 50px; + background: url("../../../assets/image/ic_大-声音波浪.svg") no-repeat; + background-position: center; + background-size: 50px 50px; + }; + .end_recorder_img:hover { + opacity: 0.9; + }; +}; \ No newline at end of file diff --git a/demos/speech_web/web_client/src/components/SubMenu/TTS/TTST.vue b/demos/speech_web/web_client/src/components/SubMenu/TTS/TTST.vue new file mode 100644 index 000000000..353221f7b --- /dev/null +++ b/demos/speech_web/web_client/src/components/SubMenu/TTS/TTST.vue @@ -0,0 +1,359 @@ + + + + + + + \ No newline at end of file diff --git a/demos/speech_web/web_client/src/components/SubMenu/TTS/style.less b/demos/speech_web/web_client/src/components/SubMenu/TTS/style.less new file mode 100644 index 000000000..b5d189650 --- /dev/null +++ b/demos/speech_web/web_client/src/components/SubMenu/TTS/style.less @@ -0,0 +1,369 @@ +.speech_recognition { + width: 1200px; + height: 410px; + background: #FFFFFF; + padding: 40px 0px 50px 50px; + box-sizing: border-box; + display: flex; + .recognition_text { + width: 589px; + height: 320px; + // background: pink; + .recognition_text_header { + margin-bottom: 30px; + display: flex; + justify-content: space-between; + align-items: center; + .recognition_text_title { + height: 26px; + font-family: PingFangSC-Medium; + font-size: 16px; + color: #000000; + letter-spacing: 0; + line-height: 26px; + font-weight: 500; + }; + .recognition_text_random { + display: flex; + align-items: center; + cursor: pointer; + span { + display: inline-block; + &:nth-of-type(1) { + width: 20px; + height: 20px; + background: url("../../../assets/image/ic_更换示例.svg") no-repeat; + background-position: center; + background-size: 20px 20px; + margin-right: 5px; + + }; + &:nth-of-type(2) { + height: 20px; + font-family: PingFangSC-Regular; + font-size: 14px; + color: #2932E1; + letter-spacing: 0; + font-weight: 400; + }; + }; + }; + }; + .recognition_text_field { + width: 589px; + height: 264px; + background: #FAFAFA; + .textToSpeech_content_show_text{ + width: 100%; + height: 264px; + padding: 0px 30px 30px 0px; + box-sizing: border-box; + .ant-input { + height: 208px; + resize: none; + // margin-bottom: 230px; + padding: 21px 20px; + }; + }; + }; + }; + // 指向 + .recognition_point_to { + width: 47px; + height: 63px; + background: url("../../../assets/image/步骤-箭头切图@2x.png") no-repeat; + background-position: center; + background-size: 47px 63px; + margin-top: 164px; + margin-right: 101px; + margin-left: 100px; + margin-top: 164px; + }; + // 语音合成 + .speech_recognition_new { + .speech_recognition_title { + height: 26px; + font-family: PingFangSC-Medium; + font-size: 16px; + color: #000000; + line-height: 26px; + font-weight: 500; + margin-left: 32px; + margin-bottom: 96px; + }; + // 流式合成 + .speech_recognition_streaming { + width: 136px; + height: 44px; + background: #2932E1; + border-radius: 22px; + font-family: PingFangSC-Medium; + font-size: 14px; + color: #FFFFFF; + font-weight: 500; + text-align: center; + line-height: 44px; + margin-bottom: 40px; + cursor: pointer; + &:hover { + opacity: .9; + }; + }; + // 合成中 + .streaming_ing_box { + display: flex; + align-items: center; + height: 44px; + margin-bottom: 40px; + .streaming_ing { + width: 136px; + height: 44px; + background: #7278F5; + border-radius: 22px; + display: flex; + justify-content: center; + align-items: center; + cursor: pointer; + + .streaming_ing_img { + width: 16px; + height: 16px; + // background: url("../../../assets/image/ic_小-录制语音.svg"); + // background-repeat: no-repeat; + // background-position: center; + // background-size: 16px 16px; + // margin-right: 12px; + }; + .streaming_ing_text { + height: 20px; + font-family: PingFangSC-Medium; + font-size: 14px; + color: #FFFFFF; + font-weight: 500; + margin-left: 12px; + }; + }; + // 合成时间文字 + .streaming_time { + height: 20px; + font-family: PingFangSC-Medium; + font-size: 14px; + color: #000000; + font-weight: 500; + margin-left: 12px; + }; + }; + + + // 暂停播放 + .streaming_suspended_box { + display: flex; + align-items: center; + height: 44px; + margin-bottom: 40px; + .streaming_suspended { + width: 136px; + height: 44px; + background: #2932E1; + border-radius: 22px; + display: flex; + justify-content: center; + align-items: center; + cursor: pointer; + + .streaming_suspended_img { + width: 16px; + height: 16px; + background: url("../../../assets/image/ic_暂停(按钮).svg"); + background-repeat: no-repeat; + background-position: center; + background-size: 16px 16px; + margin-right: 12px; + }; + .streaming_suspended_text { + height: 20px; + font-family: PingFangSC-Medium; + font-size: 14px; + color: #FFFFFF; + font-weight: 500; + margin-left: 12px; + }; + + }; + // 暂停获取时间 + .suspended_time { + height: 20px; + font-family: PingFangSC-Medium; + font-size: 14px; + color: #000000; + font-weight: 500; + margin-left: 12px; + } + }; + + // 继续播放 + .streaming_continue { + width: 136px; + height: 44px; + background: #2932E1; + border-radius: 22px; + display: flex; + justify-content: center; + align-items: center; + cursor: pointer; + margin-bottom: 40px; + .streaming_continue_img { + width: 16px; + height: 16px; + background: url("../../../assets/image/ic_播放(按钮).svg"); + background-repeat: no-repeat; + background-position: center; + background-size: 16px 16px; + margin-right: 12px; + }; + .streaming_continue_text { + height: 20px; + font-family: PingFangSC-Medium; + font-size: 14px; + color: #FFFFFF; + font-weight: 500; + }; + }; + + + + + + + // 端到端合成 + .speech_recognition_end_to_end { + width: 136px; + height: 44px; + background: #2932E1; + border-radius: 22px; + font-family: PingFangSC-Medium; + font-size: 14px; + color: #FFFFFF; + font-weight: 500; + text-align: center; + line-height: 44px; + cursor: pointer; + &:hover { + opacity: .9; + }; + }; + // 合成中 + .end_to_end_ing_box { + display: flex; + align-items: center; + height: 44px; + .end_to_end_ing { + width: 136px; + height: 44px; + background: #7278F5; + border-radius: 22px; + display: flex; + justify-content: center; + align-items: center; + cursor: pointer; + .end_to_end_ing_img { + width: 16px; + height: 16px; + // background: url("../../../assets/image/ic_小-录制语音.svg"); + // background-repeat: no-repeat; + // background-position: center; + // background-size: 16px 16px; + + }; + .end_to_end_ing_text { + height: 20px; + font-family: PingFangSC-Medium; + font-size: 14px; + color: #FFFFFF; + font-weight: 500; + margin-left: 12px; + }; + }; + // 合成时间文本 + .end_to_end_ing_time { + height: 20px; + font-family: PingFangSC-Medium; + font-size: 14px; + color: #000000; + font-weight: 500; + margin-left: 12px; + }; + }; + + + // 暂停播放 + .end_to_end_suspended_box { + display: flex; + align-items: center; + height: 44px; + .end_to_end_suspended { + width: 136px; + height: 44px; + background: #2932E1; + border-radius: 22px; + display: flex; + justify-content: center; + align-items: center; + cursor: pointer; + .end_to_end_suspended_img { + width: 16px; + height: 16px; + background: url("../../../assets/image/ic_暂停(按钮).svg"); + background-repeat: no-repeat; + background-position: center; + background-size: 16px 16px; + margin-right: 12px; + }; + .end_to_end_suspended_text { + height: 20px; + font-family: PingFangSC-Medium; + font-size: 14px; + color: #FFFFFF; + font-weight: 500; + }; + }; + // 暂停播放时间 + .end_to_end_ing_suspended_time { + height: 20px; + font-family: PingFangSC-Medium; + font-size: 14px; + color: #000000; + font-weight: 500; + margin-left: 12px; + }; + }; + + // 继续播放 + .end_to_end_continue { + width: 136px; + height: 44px; + background: #2932E1; + border-radius: 22px; + display: flex; + justify-content: center; + align-items: center; + cursor: pointer; + .end_to_end_continue_img { + width: 16px; + height: 16px; + background: url("../../../assets/image/ic_播放(按钮).svg"); + background-repeat: no-repeat; + background-position: center; + background-size: 16px 16px; + margin-right: 12px; + }; + .end_to_end_continue_text { + height: 20px; + font-family: PingFangSC-Medium; + font-size: 14px; + color: #FFFFFF; + font-weight: 500; + }; + }; + }; +}; \ No newline at end of file diff --git a/demos/speech_web/web_client/src/components/SubMenu/VPR/VPR.vue b/demos/speech_web/web_client/src/components/SubMenu/VPR/VPR.vue new file mode 100644 index 000000000..1fe71e4d8 --- /dev/null +++ b/demos/speech_web/web_client/src/components/SubMenu/VPR/VPR.vue @@ -0,0 +1,178 @@ + + + + + \ No newline at end of file diff --git a/demos/speech_web/web_client/src/components/SubMenu/VPR/VPRT.vue b/demos/speech_web/web_client/src/components/SubMenu/VPR/VPRT.vue new file mode 100644 index 000000000..e398da00c --- /dev/null +++ b/demos/speech_web/web_client/src/components/SubMenu/VPR/VPRT.vue @@ -0,0 +1,335 @@ + + + + + \ No newline at end of file diff --git a/demos/speech_web/web_client/src/components/SubMenu/VPR/style.less b/demos/speech_web/web_client/src/components/SubMenu/VPR/style.less new file mode 100644 index 000000000..cb3df49ef --- /dev/null +++ b/demos/speech_web/web_client/src/components/SubMenu/VPR/style.less @@ -0,0 +1,419 @@ +.voiceprint { + width: 1200px; + height: 410px; + background: #FFFFFF; + padding: 41px 80px 56px 80px; + box-sizing: border-box; + display: flex; + // 录制声纹 + .voiceprint_recording { + width: 423px; + height: 354px; + margin-right: 66px; + .recording_title { + display: flex; + align-items: center; + margin-bottom: 20px; + div { + &:nth-of-type(1) { + width: 24px; + height: 24px; + background: rgba(41,50,225,0.70); + font-family: PingFangSC-Regular; + font-size: 16px; + color: #FFFFFF; + letter-spacing: 0; + text-align: center; + line-height: 24px; + font-weight: 400; + margin-right: 16px; + border-radius: 50%; + }; + &:nth-of-type(2) { + height: 26px; + font-family: PingFangSC-Regular; + font-size: 16px; + color: #000000; + line-height: 26px; + font-weight: 400; + }; + }; + }; + // 开始录音 + .recording_btn { + width: 143px; + height: 44px; + cursor: pointer; + background: #2932E1; + padding: 0px 24px 0px 21px; + box-sizing: border-box; + border-radius: 22px; + display: flex; + align-items: center; + margin-bottom: 20px; + margin-top: 10px; + + &:hover { + background: #7278F5; + .recording_img { + width: 20px; + height: 20px; + background: url("../../../assets/image//icon_录制声音小语音1.svg"); + background-repeat: no-repeat; + background-position: center; + background-size: 20px 20px; + margin-right: 8.26px; + + }; + } + .recording_img { + width: 20px; + height: 20px; + background: url("../../../assets/image//icon_录制声音小语音1.svg"); + background-repeat: no-repeat; + background-position: center; + background-size: 20px 20px; + margin-right: 8.26px; + + }; + .recording_prompt { + height: 20px; + font-family: PingFangSC-Regular; + font-size: 12px; + color: #FFFFFF; + font-weight: 400; + }; + + }; + // 录音中 + .recording_btn_the_recording { + width: 143px; + height: 44px; + cursor: pointer; + background: #7278F5; + padding: 0px 24px 0px 21px; + box-sizing: border-box; + border-radius: 22px; + display: flex; + align-items: center; + justify-content: center; + margin-bottom: 40px; + .recording_img_the_recording { + width: 20px; + height: 20px; + background: url("../../../assets/image//icon_小-声音波浪.svg"); + background-repeat: no-repeat; + background-position: center; + background-size: 20px 20px; + margin-right: 8.26px; + }; + .recording_prompt { + height: 20px; + font-family: PingFangSC-Regular; + font-size: 12px; + color: #FFFFFF; + font-weight: 400; + }; + }; + // 完成录音 + .complete_the_recording_btn { + width: 143px; + height: 44px; + cursor: pointer; + background: #2932E1; + padding: 0px 24px 0px 21px; + box-sizing: border-box; + border-radius: 22px; + display: flex; + align-items: center; + margin-bottom: 40px; + &:hover { + background: #7278F5; + .complete_the_recording_img { + width: 20px; + height: 20px; + background: url("../../../assets/image//icon_小-声音波浪.svg"); + background-repeat: no-repeat; + background-position: center; + background-size: 20px 20px; + margin-right: 8.26px; + + }; + } + .complete_the_recording_img { + width: 20px; + height: 20px; + background: url("../../../assets/image//icon_小-声音波浪.svg"); + background-repeat: no-repeat; + background-position: center; + background-size: 20px 20px; + margin-right: 8.26px; + + }; + .complete_the_recording_prompt { + height: 20px; + font-family: PingFangSC-Regular; + font-size: 12px; + color: #FFFFFF; + font-weight: 400; + }; + + }; + // table + .recording_table { + width: 322px; + .recording_table_box { + .ant-table-thead > tr > th { + color: rgba(0, 0, 0, 0.85); + font-weight: 500; + text-align: left; + background: rgba(40,50,225,0.08); + border-bottom: none; + transition: background 0.3s ease; + height: 22px; + font-family: PingFangSC-Regular; + font-size: 16px; + color: #333333; + // text-align: center; + font-weight: 400; + &:nth-of-type(2) { + border-left: 2px solid white; + }; + }; + .ant-table-tbody > tr > td { + border-bottom: 1px solid #f0f0f0; + transition: background 0.3s; + height: 22px; + font-family: PingFangSC-Regular; + font-size: 16px; + color: #333333; + // text-align: center; + font-weight: 400; + }; + }; + }; + // input + .recording_input { + width: 322px; + margin-bottom: 20px; + }; + }; + // 指向 + .recording_point_to { + width: 63px; + height: 47px; + background: url("../../../assets/image//步骤-箭头切图@2x.png"); + background-repeat: no-repeat; + background-position: center; + background-size: 63px 47px; + margin-right: 66px; + margin-top: 198px; + }; + //识别声纹 + .voiceprint_identify { + width: 423px; + height: 354px; + .identify_title { + display: flex; + align-items: center; + margin-bottom: 20px; + div { + &:nth-of-type(1) { + width: 24px; + height: 24px; + background: rgba(41,50,225,0.70); + font-family: PingFangSC-Regular; + font-size: 16px; + color: #FFFFFF; + letter-spacing: 0; + text-align: center; + line-height: 24px; + font-weight: 400; + margin-right: 16px; + border-radius: 50%; + }; + &:nth-of-type(2) { + height: 26px; + font-family: PingFangSC-Regular; + font-size: 16px; + color: #000000; + line-height: 26px; + font-weight: 400; + }; + }; + }; + // 开始识别 + .identify_btn { + width: 143px; + height: 44px; + cursor: pointer; + background: #2932E1; + padding: 0px 24px 0px 21px; + box-sizing: border-box; + border-radius: 22px; + display: flex; + align-items: center; + margin-bottom: 40px; + margin-top: 10px; + &:hover { + background: #7278F5; + .identify_img { + width: 20px; + height: 20px; + background: url("../../../assets/image//icon_录制声音小语音1.svg"); + background-repeat: no-repeat; + background-position: center; + background-size: 20px 20px; + margin-right: 8.26px; + + }; + } + .identify_img { + width: 20px; + height: 20px; + background: url("../../../assets/image//icon_录制声音小语音1.svg"); + background-repeat: no-repeat; + background-position: center; + background-size: 20px 20px; + margin-right: 8.26px; + + }; + .identify_prompt { + height: 20px; + font-family: PingFangSC-Regular; + font-size: 12px; + color: #FFFFFF; + font-weight: 400; + }; + + }; + // 识别中 + .identify_btn_the_recording { + width: 143px; + height: 44px; + cursor: pointer; + background: #7278F5; + padding: 0px 24px 0px 21px; + box-sizing: border-box; + border-radius: 22px; + display: flex; + align-items: center; + justify-content: center; + margin-bottom: 40px; + .identify_img_the_recording { + width: 20px; + height: 20px; + background: url("../../../assets/image//icon_录制声音小语音1.svg"); + background-repeat: no-repeat; + background-position: center; + background-size: 20px 20px; + margin-right: 8.26px; + }; + .recording_prompt { + height: 20px; + font-family: PingFangSC-Regular; + font-size: 12px; + color: #FFFFFF; + font-weight: 400; + }; + }; + // 完成识别 + .identify_complete_the_recording_btn { + width: 143px; + height: 44px; + cursor: pointer; + background: #2932E1; + padding: 0px 24px 0px 21px; + box-sizing: border-box; + border-radius: 22px; + display: flex; + align-items: center; + margin-bottom: 40px; + &:hover { + background: #7278F5; + .identify_complete_the_recording_img { + width: 20px; + height: 20px; + background: url("../../../assets/image//icon_小-声音波浪.svg"); + background-repeat: no-repeat; + background-position: center; + background-size: 20px 20px; + margin-right: 8.26px; + + }; + } + .identify_complete_the_recording_img { + width: 20px; + height: 20px; + background: url("../../../assets/image//icon_小-声音波浪.svg"); + background-repeat: no-repeat; + background-position: center; + background-size: 20px 20px; + margin-right: 8.26px; + + }; + .identify_complete_the_recording_prompt { + height: 20px; + font-family: PingFangSC-Regular; + font-size: 12px; + color: #FFFFFF; + font-weight: 400; + }; + + }; + + + + + // 结果 + .identify_result { + width: 422px; + height: 184px; + text-align: center; + line-height: 184px; + background: #FAFAFA; + position: relative; + .identify_result_default { + + font-family: PingFangSC-Regular; + font-size: 16px; + color: #999999; + font-weight: 400; + }; + .identify_result_content { + // text-align: center; + // position: absolute; + // top: 50%; + // left: 50%; + // transform: translate(-50%,-50%); + div { + &:nth-of-type(1) { + height: 22px; + font-family: PingFangSC-Regular; + font-size: 16px; + color: #666666; + font-weight: 400; + margin-bottom: 10px; + }; + &:nth-of-type(2) { + height: 33px; + font-family: PingFangSC-Medium; + font-size: 24px; + color: #000000; + font-weight: 500; + }; + }; + }; + }; + }; + .action_btn { + display: inline-block; + height: 22px; + font-family: PingFangSC-Regular; + font-size: 16px; + color: #2932E1; + text-align: center; + font-weight: 400; + cursor: pointer; + }; +}; \ No newline at end of file diff --git a/demos/speech_web/web_client/src/components/style.less b/demos/speech_web/web_client/src/components/style.less new file mode 100644 index 000000000..98f414f1c --- /dev/null +++ b/demos/speech_web/web_client/src/components/style.less @@ -0,0 +1,83 @@ +.experience { + width: 100%; + height: 709px; + // background: url("../assets/image/在线体验-背景@2x.png") no-repeat; + background-size: 100% 709px; + background-position: initial; + // + .experience_wrapper { + width: 1200px; + height: 709px; + margin: 0 auto; + padding: 0px 0px 0px 0px; + box-sizing: border-box; + // background: red; + .experience_title { + height: 42px; + font-family: PingFangSC-Semibold; + font-size: 30px; + color: #000000; + font-weight: 600; + line-height: 42px; + text-align: center; + margin-bottom: 10px; + }; + .experience_describe { + height: 22px; + font-family: PingFangSC-Regular; + font-size: 14px; + color: #666666; + letter-spacing: 0; + text-align: center; + line-height: 22px; + font-weight: 400; + margin-bottom: 30px; + }; + .experience_content { + width: 1200px; + margin: 0 auto; + display: flex; + justify-content: center; + .experience_tabs { + + margin-top: 15px; + + & > .ant-tabs-nav { + margin-bottom: 20px; + + &::before { + content: none; + } + + .ant-tabs-nav-wrap { + justify-content: center; + } + + .ant-tabs-tab { + font-size: 20px; + } + + .ant-tabs-nav-list { + margin-right: -32px; + flex: none; + } + }; + + .ant-tabs-nav::before { + position: absolute; + right: 0; + left: 0; + border-bottom: 1px solid #f6f7fe; + content: ''; + }; + + }; + }; + }; +}; +.experience::after { + content: ""; + display: block; + clear: both; + visibility: hidden; +} \ No newline at end of file diff --git a/demos/speech_web/web_client/src/main.js b/demos/speech_web/web_client/src/main.js new file mode 100644 index 000000000..3fbf87c85 --- /dev/null +++ b/demos/speech_web/web_client/src/main.js @@ -0,0 +1,13 @@ +import { createApp } from 'vue' +import ElementPlus from 'element-plus' +import 'element-plus/dist/index.css' +import Antd from 'ant-design-vue'; +import 'ant-design-vue/dist/antd.css'; +import App from './App.vue' +import axios from 'axios' + +const app = createApp(App) +app.config.globalProperties.$http = axios + +app.use(ElementPlus).use(Antd) +app.mount('#app') diff --git a/demos/speech_web/web_client/vite.config.js b/demos/speech_web/web_client/vite.config.js new file mode 100644 index 000000000..dc7e6978c --- /dev/null +++ b/demos/speech_web/web_client/vite.config.js @@ -0,0 +1,28 @@ +import { defineConfig } from 'vite' +import vue from '@vitejs/plugin-vue' + +// https://vitejs.dev/config/ +export default defineConfig({ + plugins: [vue()], + css: + { preprocessorOptions: + { css: + { + charset: false + } + } + }, + build: { + assetsInlineLimit: '2048' // 2kb + }, + server: { + host: "0.0.0.0", + proxy: { + "/api": { + target: "http://localhost:8010", + changeOrigin: true, + rewrite: (path) => path.replace(/^\/api/, ""), + }, + }, + }, +}) diff --git a/demos/speech_web/web_client/yarn.lock b/demos/speech_web/web_client/yarn.lock new file mode 100644 index 000000000..6777cf4ce --- /dev/null +++ b/demos/speech_web/web_client/yarn.lock @@ -0,0 +1,785 @@ +# THIS IS AN AUTOGENERATED FILE. DO NOT EDIT THIS FILE DIRECTLY. +# yarn lockfile v1 + + +"@ant-design/colors@^6.0.0": + version "6.0.0" + resolved "https://registry.npmmirror.com/@ant-design/colors/-/colors-6.0.0.tgz" + integrity sha512-qAZRvPzfdWHtfameEGP2Qvuf838NhergR35o+EuVyB5XvSA98xod5r4utvi4TJ3ywmevm290g9nsCG5MryrdWQ== + dependencies: + "@ctrl/tinycolor" "^3.4.0" + +"@ant-design/icons-svg@^4.2.1": + version "4.2.1" + resolved "https://registry.npmmirror.com/@ant-design/icons-svg/-/icons-svg-4.2.1.tgz" + integrity sha512-EB0iwlKDGpG93hW8f85CTJTs4SvMX7tt5ceupvhALp1IF44SeUFOMhKUOYqpsoYWQKAOuTRDMqn75rEaKDp0Xw== + +"@ant-design/icons-vue@^6.0.0": + version "6.1.0" + resolved "https://registry.npmmirror.com/@ant-design/icons-vue/-/icons-vue-6.1.0.tgz" + integrity sha512-EX6bYm56V+ZrKN7+3MT/ubDkvJ5rK/O2t380WFRflDcVFgsvl3NLH7Wxeau6R8DbrO5jWR6DSTC3B6gYFp77AA== + dependencies: + "@ant-design/colors" "^6.0.0" + "@ant-design/icons-svg" "^4.2.1" + +"@babel/parser@^7.16.4": + version "7.17.9" + resolved "https://registry.npmmirror.com/@babel/parser/-/parser-7.17.9.tgz" + integrity sha512-vqUSBLP8dQHFPdPi9bc5GK9vRkYHJ49fsZdtoJ8EQ8ibpwk5rPKfvNIwChB0KVXcIjcepEBBd2VHC5r9Gy8ueg== + +"@babel/runtime@^7.10.5": + version "7.17.9" + resolved "https://registry.npmmirror.com/@babel/runtime/-/runtime-7.17.9.tgz" + integrity sha512-lSiBBvodq29uShpWGNbgFdKYNiFDo5/HIYsaCEY9ff4sb10x9jizo2+pRrSyF4jKZCXqgzuqBOQKbUm90gQwJg== + dependencies: + regenerator-runtime "^0.13.4" + +"@ctrl/tinycolor@^3.4.0": + version "3.4.1" + resolved "https://registry.npmmirror.com/@ctrl/tinycolor/-/tinycolor-3.4.1.tgz" + integrity sha512-ej5oVy6lykXsvieQtqZxCOaLT+xD4+QNarq78cIYISHmZXshCvROLudpQN3lfL8G0NL7plMSSK+zlyvCaIJ4Iw== + +"@element-plus/icons-vue@^1.1.4": + version "1.1.4" + resolved "https://registry.npmmirror.com/@element-plus/icons-vue/-/icons-vue-1.1.4.tgz" + integrity sha512-Iz/nHqdp1sFPmdzRwHkEQQA3lKvoObk8azgABZ81QUOpW9s/lUyQVUSh0tNtEPZXQlKwlSh7SPgoVxzrE0uuVQ== + +"@floating-ui/core@^0.6.1": + version "0.6.1" + resolved "https://registry.npmmirror.com/@floating-ui/core/-/core-0.6.1.tgz" + integrity sha512-Y30eVMcZva8o84c0HcXAtDO4BEzPJMvF6+B7x7urL2xbAqVsGJhojOyHLaoQHQYjb6OkqRq5kO+zeySycQwKqg== + +"@floating-ui/dom@^0.4.2": + version "0.4.4" + resolved "https://registry.npmmirror.com/@floating-ui/dom/-/dom-0.4.4.tgz" + integrity sha512-0Ulu3B/dqQplUUSqnTx0foSrlYuMN+GTtlJWvNJwt6Fr7/PqmlR/Y08o6/+bxDWr6p3roBJRaQ51MDZsNmEhhw== + dependencies: + "@floating-ui/core" "^0.6.1" + +"@popperjs/core@^2.11.4": + version "2.11.5" + resolved "https://registry.npmmirror.com/@popperjs/core/-/core-2.11.5.tgz" + integrity sha512-9X2obfABZuDVLCgPK9aX0a/x4jaOEweTTWE2+9sr0Qqqevj2Uv5XorvusThmc9XGYpS9yI+fhh8RTafBtGposw== + +"@simonwep/pickr@~1.8.0": + version "1.8.2" + resolved "https://registry.npmmirror.com/@simonwep/pickr/-/pickr-1.8.2.tgz" + integrity sha512-/l5w8BIkrpP6n1xsetx9MWPWlU6OblN5YgZZphxan0Tq4BByTCETL6lyIeY8lagalS2Nbt4F2W034KHLIiunKA== + dependencies: + core-js "^3.15.1" + nanopop "^2.1.0" + +"@types/lodash-es@^4.17.6": + version "4.17.6" + resolved "https://registry.npmmirror.com/@types/lodash-es/-/lodash-es-4.17.6.tgz" + integrity sha512-R+zTeVUKDdfoRxpAryaQNRKk3105Rrgx2CFRClIgRGaqDTdjsm8h6IYA8ir584W3ePzkZfst5xIgDwYrlh9HLg== + dependencies: + "@types/lodash" "*" + +"@types/lodash@*", "@types/lodash@^4.14.181": + version "4.14.181" + resolved "https://registry.npmmirror.com/@types/lodash/-/lodash-4.14.181.tgz" + integrity sha512-n3tyKthHJbkiWhDZs3DkhkCzt2MexYHXlX0td5iMplyfwketaOeKboEVBqzceH7juqvEg3q5oUoBFxSLu7zFag== + +"@vitejs/plugin-vue@^2.3.0": + version "2.3.1" + resolved "https://registry.npmmirror.com/@vitejs/plugin-vue/-/plugin-vue-2.3.1.tgz" + integrity sha512-YNzBt8+jt6bSwpt7LP890U1UcTOIZZxfpE5WOJ638PNxSEKOqAi0+FSKS0nVeukfdZ0Ai/H7AFd6k3hayfGZqQ== + +"@vue/compiler-core@3.2.32": + version "3.2.32" + resolved "https://registry.npmmirror.com/@vue/compiler-core/-/compiler-core-3.2.32.tgz" + integrity sha512-bRQ8Rkpm/aYFElDWtKkTPHeLnX5pEkNxhPUcqu5crEJIilZH0yeFu/qUAcV4VfSE2AudNPkQSOwMZofhnuutmA== + dependencies: + "@babel/parser" "^7.16.4" + "@vue/shared" "3.2.32" + estree-walker "^2.0.2" + source-map "^0.6.1" + +"@vue/compiler-dom@3.2.32": + version "3.2.32" + resolved "https://registry.npmmirror.com/@vue/compiler-dom/-/compiler-dom-3.2.32.tgz" + integrity sha512-maa3PNB/NxR17h2hDQfcmS02o1f9r9QIpN1y6fe8tWPrS1E4+q8MqrvDDQNhYVPd84rc3ybtyumrgm9D5Rf/kg== + dependencies: + "@vue/compiler-core" "3.2.32" + "@vue/shared" "3.2.32" + +"@vue/compiler-sfc@3.2.32": + version "3.2.32" + resolved "https://registry.npmmirror.com/@vue/compiler-sfc/-/compiler-sfc-3.2.32.tgz" + integrity sha512-uO6+Gh3AVdWm72lRRCjMr8nMOEqc6ezT9lWs5dPzh1E9TNaJkMYPaRtdY9flUv/fyVQotkfjY/ponjfR+trPSg== + dependencies: + "@babel/parser" "^7.16.4" + "@vue/compiler-core" "3.2.32" + "@vue/compiler-dom" "3.2.32" + "@vue/compiler-ssr" "3.2.32" + "@vue/reactivity-transform" "3.2.32" + "@vue/shared" "3.2.32" + estree-walker "^2.0.2" + magic-string "^0.25.7" + postcss "^8.1.10" + source-map "^0.6.1" + +"@vue/compiler-ssr@3.2.32": + version "3.2.32" + resolved "https://registry.npmmirror.com/@vue/compiler-ssr/-/compiler-ssr-3.2.32.tgz" + integrity sha512-ZklVUF/SgTx6yrDUkaTaBL/JMVOtSocP+z5Xz/qIqqLdW/hWL90P+ob/jOQ0Xc/om57892Q7sRSrex0wujOL2Q== + dependencies: + "@vue/compiler-dom" "3.2.32" + "@vue/shared" "3.2.32" + +"@vue/reactivity-transform@3.2.32": + version "3.2.32" + resolved "https://registry.npmmirror.com/@vue/reactivity-transform/-/reactivity-transform-3.2.32.tgz" + integrity sha512-CW1W9zaJtE275tZSWIfQKiPG0iHpdtSlmTqYBu7Y62qvtMgKG5yOxtvBs4RlrZHlaqFSE26avLAgQiTp4YHozw== + dependencies: + "@babel/parser" "^7.16.4" + "@vue/compiler-core" "3.2.32" + "@vue/shared" "3.2.32" + estree-walker "^2.0.2" + magic-string "^0.25.7" + +"@vue/reactivity@3.2.32": + version "3.2.32" + resolved "https://registry.npmmirror.com/@vue/reactivity/-/reactivity-3.2.32.tgz" + integrity sha512-4zaDumuyDqkuhbb63hRd+YHFGopW7srFIWesLUQ2su/rJfWrSq3YUvoKAJE8Eu1EhZ2Q4c1NuwnEreKj1FkDxA== + dependencies: + "@vue/shared" "3.2.32" + +"@vue/runtime-core@3.2.32": + version "3.2.32" + resolved "https://registry.npmmirror.com/@vue/runtime-core/-/runtime-core-3.2.32.tgz" + integrity sha512-uKKzK6LaCnbCJ7rcHvsK0azHLGpqs+Vi9B28CV1mfWVq1F3Bj8Okk3cX+5DtD06aUh4V2bYhS2UjjWiUUKUF0w== + dependencies: + "@vue/reactivity" "3.2.32" + "@vue/shared" "3.2.32" + +"@vue/runtime-dom@3.2.32": + version "3.2.32" + resolved "https://registry.npmmirror.com/@vue/runtime-dom/-/runtime-dom-3.2.32.tgz" + integrity sha512-AmlIg+GPqjkNoADLjHojEX5RGcAg+TsgXOOcUrtDHwKvA8mO26EnLQLB8nylDjU6AMJh2CIYn8NEgyOV5ZIScQ== + dependencies: + "@vue/runtime-core" "3.2.32" + "@vue/shared" "3.2.32" + csstype "^2.6.8" + +"@vue/server-renderer@3.2.32": + version "3.2.32" + resolved "https://registry.npmmirror.com/@vue/server-renderer/-/server-renderer-3.2.32.tgz" + integrity sha512-TYKpZZfRJpGTTiy/s6bVYwQJpAUx3G03z4G7/3O18M11oacrMTVHaHjiPuPqf3xQtY8R4LKmQ3EOT/DRCA/7Wg== + dependencies: + "@vue/compiler-ssr" "3.2.32" + "@vue/shared" "3.2.32" + +"@vue/shared@3.2.32": + version "3.2.32" + resolved "https://registry.npmmirror.com/@vue/shared/-/shared-3.2.32.tgz" + integrity sha512-bjcixPErUsAnTQRQX4Z5IQnICYjIfNCyCl8p29v1M6kfVzvwOICPw+dz48nNuWlTOOx2RHhzHdazJibE8GSnsw== + +"@vueuse/core@^8.2.4": + version "8.2.5" + resolved "https://registry.npmmirror.com/@vueuse/core/-/core-8.2.5.tgz" + integrity sha512-5prZAA1Ji2ltwNUnzreu6WIXYqHYP/9U2BiY5mD/650VYLpVcwVlYznJDFcLCmEWI3o3Vd34oS1FUf+6Mh68GQ== + dependencies: + "@vueuse/metadata" "8.2.5" + "@vueuse/shared" "8.2.5" + vue-demi "*" + +"@vueuse/metadata@8.2.5": + version "8.2.5" + resolved "https://registry.npmmirror.com/@vueuse/metadata/-/metadata-8.2.5.tgz" + integrity sha512-Lk9plJjh9cIdiRdcj16dau+2LANxIdFCiTgdfzwYXbflxq0QnMBeOD2qHgKDE7fuVrtPcVWj8VSuZEx1HRfNQA== + +"@vueuse/shared@8.2.5": + version "8.2.5" + resolved "https://registry.npmmirror.com/@vueuse/shared/-/shared-8.2.5.tgz" + integrity sha512-lNWo+7sk6JCuOj4AiYM+6HZ6fq4xAuVq1sVckMQKgfCJZpZRe4i8es+ZULO5bYTKP+VrOCtqrLR2GzEfrbr3YQ== + dependencies: + vue-demi "*" + +ant-design-vue@^2.2.8: + version "2.2.8" + resolved "https://registry.npmmirror.com/ant-design-vue/-/ant-design-vue-2.2.8.tgz" + integrity sha512-3graq9/gCfJQs6hznrHV6sa9oDmk/D1H3Oo0vLdVpPS/I61fZPk8NEyNKCHpNA6fT2cx6xx9U3QS63uuyikg/Q== + dependencies: + "@ant-design/icons-vue" "^6.0.0" + "@babel/runtime" "^7.10.5" + "@simonwep/pickr" "~1.8.0" + array-tree-filter "^2.1.0" + async-validator "^3.3.0" + dom-align "^1.12.1" + dom-scroll-into-view "^2.0.0" + lodash "^4.17.21" + lodash-es "^4.17.15" + moment "^2.27.0" + omit.js "^2.0.0" + resize-observer-polyfill "^1.5.1" + scroll-into-view-if-needed "^2.2.25" + shallow-equal "^1.0.0" + vue-types "^3.0.0" + warning "^4.0.0" + +array-tree-filter@^2.1.0: + version "2.1.0" + resolved "https://registry.npmmirror.com/array-tree-filter/-/array-tree-filter-2.1.0.tgz" + integrity sha512-4ROwICNlNw/Hqa9v+rk5h22KjmzB1JGTMVKP2AKJBOCgb0yL0ASf0+YvCcLNNwquOHNX48jkeZIJ3a+oOQqKcw== + +async-validator@^3.3.0: + version "3.5.2" + resolved "https://registry.npmmirror.com/async-validator/-/async-validator-3.5.2.tgz" + integrity sha512-8eLCg00W9pIRZSB781UUX/H6Oskmm8xloZfr09lz5bikRpBVDlJ3hRVuxxP1SxcwsEYfJ4IU8Q19Y8/893r3rQ== + +async-validator@^4.0.7: + version "4.0.7" + resolved "https://registry.npmmirror.com/async-validator/-/async-validator-4.0.7.tgz" + integrity sha512-Pj2IR7u8hmUEDOwB++su6baaRi+QvsgajuFB9j95foM1N2gy5HM4z60hfusIO0fBPG5uLAEl6yCJr1jNSVugEQ== + +axios@^0.26.1: + version "0.26.1" + resolved "https://registry.npmmirror.com/axios/-/axios-0.26.1.tgz" + integrity sha512-fPwcX4EvnSHuInCMItEhAGnaSEXRBjtzh9fOtsE6E1G6p7vl7edEeZe11QHf18+6+9gR5PbKV/sGKNaD8YaMeA== + dependencies: + follow-redirects "^1.14.8" + +compute-scroll-into-view@^1.0.17: + version "1.0.17" + resolved "https://registry.npmmirror.com/compute-scroll-into-view/-/compute-scroll-into-view-1.0.17.tgz" + integrity sha512-j4dx+Fb0URmzbwwMUrhqWM2BEWHdFGx+qZ9qqASHRPqvTYdqvWnHg0H1hIbcyLnvgnoNAVMlwkepyqM3DaIFUg== + +copy-anything@^2.0.1: + version "2.0.6" + resolved "https://registry.npmmirror.com/copy-anything/-/copy-anything-2.0.6.tgz" + integrity sha512-1j20GZTsvKNkc4BY3NpMOM8tt///wY3FpIzozTOFO2ffuZcV61nojHXVKIy3WM+7ADCy5FVhdZYHYDdgTU0yJw== + dependencies: + is-what "^3.14.1" + +core-js@^3.15.1: + version "3.22.5" + resolved "https://registry.npmmirror.com/core-js/-/core-js-3.22.5.tgz" + integrity sha512-VP/xYuvJ0MJWRAobcmQ8F2H6Bsn+s7zqAAjFaHGBMc5AQm7zaelhD1LGduFn2EehEcQcU+br6t+fwbpQ5d1ZWA== + +csstype@^2.6.8: + version "2.6.20" + resolved "https://registry.npmmirror.com/csstype/-/csstype-2.6.20.tgz" + integrity sha512-/WwNkdXfckNgw6S5R125rrW8ez139lBHWouiBvX8dfMFtcn6V81REDqnH7+CRpRipfYlyU1CmOnOxrmGcFOjeA== + +dayjs@^1.11.0: + version "1.11.0" + resolved "https://registry.npmmirror.com/dayjs/-/dayjs-1.11.0.tgz" + integrity sha512-JLC809s6Y948/FuCZPm5IX8rRhQwOiyMb2TfVVQEixG7P8Lm/gt5S7yoQZmC8x1UehI9Pb7sksEt4xx14m+7Ug== + +debug@^3.2.6: + version "3.2.7" + resolved "https://registry.npmmirror.com/debug/-/debug-3.2.7.tgz" + integrity sha512-CFjzYYAi4ThfiQvizrFQevTTXHtnCqWfe7x1AhgEscTz6ZbLbfoLRLPugTQyBth6f8ZERVUSyWHFD/7Wu4t1XQ== + dependencies: + ms "^2.1.1" + +dom-align@^1.12.1: + version "1.12.3" + resolved "https://registry.npmmirror.com/dom-align/-/dom-align-1.12.3.tgz" + integrity sha512-Gj9hZN3a07cbR6zviMUBOMPdWxYhbMI+x+WS0NAIu2zFZmbK8ys9R79g+iG9qLnlCwpFoaB+fKy8Pdv470GsPA== + +dom-scroll-into-view@^2.0.0: + version "2.0.1" + resolved "https://registry.npmmirror.com/dom-scroll-into-view/-/dom-scroll-into-view-2.0.1.tgz" + integrity sha512-bvVTQe1lfaUr1oFzZX80ce9KLDlZ3iU+XGNE/bz9HnGdklTieqsbmsLHe+rT2XWqopvL0PckkYqN7ksmm5pe3w== + +element-plus@^2.1.9: + version "2.1.9" + resolved "https://registry.npmmirror.com/element-plus/-/element-plus-2.1.9.tgz" + integrity sha512-6mWqS3YrmJPnouWP4otzL8+MehfOnDFqDbcIdnmC07p+Z0JkWe/CVKc4Wky8AYC8nyDMUQyiZYvooCbqGuM7pg== + dependencies: + "@ctrl/tinycolor" "^3.4.0" + "@element-plus/icons-vue" "^1.1.4" + "@floating-ui/dom" "^0.4.2" + "@popperjs/core" "^2.11.4" + "@types/lodash" "^4.14.181" + "@types/lodash-es" "^4.17.6" + "@vueuse/core" "^8.2.4" + async-validator "^4.0.7" + dayjs "^1.11.0" + escape-html "^1.0.3" + lodash "^4.17.21" + lodash-es "^4.17.21" + lodash-unified "^1.0.2" + memoize-one "^6.0.0" + normalize-wheel-es "^1.1.2" + +errno@^0.1.1: + version "0.1.8" + resolved "https://registry.npmmirror.com/errno/-/errno-0.1.8.tgz" + integrity sha512-dJ6oBr5SQ1VSd9qkk7ByRgb/1SH4JZjCHSW/mr63/QcXO9zLVxvJ6Oy13nio03rxpSnVDDjFor75SjVeZWPW/A== + dependencies: + prr "~1.0.1" + +esbuild-android-64@0.14.36: + version "0.14.36" + resolved "https://registry.yarnpkg.com/esbuild-android-64/-/esbuild-android-64-0.14.36.tgz#fc5f95ce78c8c3d790fa16bc71bd904f2bb42aa1" + integrity sha512-jwpBhF1jmo0tVCYC/ORzVN+hyVcNZUWuozGcLHfod0RJCedTDTvR4nwlTXdx1gtncDqjk33itjO+27OZHbiavw== + +esbuild-android-arm64@0.14.36: + version "0.14.36" + resolved "https://registry.yarnpkg.com/esbuild-android-arm64/-/esbuild-android-arm64-0.14.36.tgz#44356fbb9f8de82a5cdf11849e011dfb3ad0a8a8" + integrity sha512-/hYkyFe7x7Yapmfv4X/tBmyKnggUmdQmlvZ8ZlBnV4+PjisrEhAvC3yWpURuD9XoB8Wa1d5dGkTsF53pIvpjsg== + +esbuild-darwin-64@0.14.36: + version "0.14.36" + resolved "https://registry.npmmirror.com/esbuild-darwin-64/-/esbuild-darwin-64-0.14.36.tgz" + integrity sha512-kkl6qmV0dTpyIMKagluzYqlc1vO0ecgpviK/7jwPbRDEv5fejRTaBBEE2KxEQbTHcLhiiDbhG7d5UybZWo/1zQ== + +esbuild-darwin-arm64@0.14.36: + version "0.14.36" + resolved "https://registry.yarnpkg.com/esbuild-darwin-arm64/-/esbuild-darwin-arm64-0.14.36.tgz#2a8040c2e465131e5281034f3c72405e643cb7b2" + integrity sha512-q8fY4r2Sx6P0Pr3VUm//eFYKVk07C5MHcEinU1BjyFnuYz4IxR/03uBbDwluR6ILIHnZTE7AkTUWIdidRi1Jjw== + +esbuild-freebsd-64@0.14.36: + version "0.14.36" + resolved "https://registry.yarnpkg.com/esbuild-freebsd-64/-/esbuild-freebsd-64-0.14.36.tgz#d82c387b4d01fe9e8631f97d41eb54f2dbeb68a3" + integrity sha512-Hn8AYuxXXRptybPqoMkga4HRFE7/XmhtlQjXFHoAIhKUPPMeJH35GYEUWGbjteai9FLFvBAjEAlwEtSGxnqWww== + +esbuild-freebsd-arm64@0.14.36: + version "0.14.36" + resolved "https://registry.yarnpkg.com/esbuild-freebsd-arm64/-/esbuild-freebsd-arm64-0.14.36.tgz#e8ce2e6c697da6c7ecd0cc0ac821d47c5ab68529" + integrity sha512-S3C0attylLLRiCcHiJd036eDEMOY32+h8P+jJ3kTcfhJANNjP0TNBNL30TZmEdOSx/820HJFgRrqpNAvTbjnDA== + +esbuild-linux-32@0.14.36: + version "0.14.36" + resolved "https://registry.yarnpkg.com/esbuild-linux-32/-/esbuild-linux-32-0.14.36.tgz#a4a261e2af91986ea62451f2db712a556cb38a15" + integrity sha512-Eh9OkyTrEZn9WGO4xkI3OPPpUX7p/3QYvdG0lL4rfr73Ap2HAr6D9lP59VMF64Ex01LhHSXwIsFG/8AQjh6eNw== + +esbuild-linux-64@0.14.36: + version "0.14.36" + resolved "https://registry.yarnpkg.com/esbuild-linux-64/-/esbuild-linux-64-0.14.36.tgz#4a9500f9197e2c8fcb884a511d2c9d4c2debde72" + integrity sha512-vFVFS5ve7PuwlfgoWNyRccGDi2QTNkQo/2k5U5ttVD0jRFaMlc8UQee708fOZA6zTCDy5RWsT5MJw3sl2X6KDg== + +esbuild-linux-arm64@0.14.36: + version "0.14.36" + resolved "https://registry.yarnpkg.com/esbuild-linux-arm64/-/esbuild-linux-arm64-0.14.36.tgz#c91c21e25b315464bd7da867365dd1dae14ca176" + integrity sha512-24Vq1M7FdpSmaTYuu1w0Hdhiqkbto1I5Pjyi+4Cdw5fJKGlwQuw+hWynTcRI/cOZxBcBpP21gND7W27gHAiftw== + +esbuild-linux-arm@0.14.36: + version "0.14.36" + resolved "https://registry.yarnpkg.com/esbuild-linux-arm/-/esbuild-linux-arm-0.14.36.tgz#90e23bca2e6e549affbbe994f80ba3bb6c4d934a" + integrity sha512-NhgU4n+NCsYgt7Hy61PCquEz5aevI6VjQvxwBxtxrooXsxt5b2xtOUXYZe04JxqQo+XZk3d1gcr7pbV9MAQ/Lg== + +esbuild-linux-mips64le@0.14.36: + version "0.14.36" + resolved "https://registry.yarnpkg.com/esbuild-linux-mips64le/-/esbuild-linux-mips64le-0.14.36.tgz#40e11afb08353ff24709fc89e4db0f866bc131d2" + integrity sha512-hZUeTXvppJN+5rEz2EjsOFM9F1bZt7/d2FUM1lmQo//rXh1RTFYzhC0txn7WV0/jCC7SvrGRaRz0NMsRPf8SIA== + +esbuild-linux-ppc64le@0.14.36: + version "0.14.36" + resolved "https://registry.yarnpkg.com/esbuild-linux-ppc64le/-/esbuild-linux-ppc64le-0.14.36.tgz#9e8a588c513d06cc3859f9dcc52e5fdfce8a1a5e" + integrity sha512-1Bg3QgzZjO+QtPhP9VeIBhAduHEc2kzU43MzBnMwpLSZ890azr4/A9Dganun8nsqD/1TBcqhId0z4mFDO8FAvg== + +esbuild-linux-riscv64@0.14.36: + version "0.14.36" + resolved "https://registry.yarnpkg.com/esbuild-linux-riscv64/-/esbuild-linux-riscv64-0.14.36.tgz#e578c09b23b3b97652e60e3692bfda628b541f06" + integrity sha512-dOE5pt3cOdqEhaufDRzNCHf5BSwxgygVak9UR7PH7KPVHwSTDAZHDoEjblxLqjJYpc5XaU9+gKJ9F8mp9r5I4A== + +esbuild-linux-s390x@0.14.36: + version "0.14.36" + resolved "https://registry.yarnpkg.com/esbuild-linux-s390x/-/esbuild-linux-s390x-0.14.36.tgz#3c9dab40d0d69932ffded0fd7317bb403626c9bc" + integrity sha512-g4FMdh//BBGTfVHjF6MO7Cz8gqRoDPzXWxRvWkJoGroKA18G9m0wddvPbEqcQf5Tbt2vSc1CIgag7cXwTmoTXg== + +esbuild-netbsd-64@0.14.36: + version "0.14.36" + resolved "https://registry.yarnpkg.com/esbuild-netbsd-64/-/esbuild-netbsd-64-0.14.36.tgz#e27847f6d506218291619b8c1e121ecd97628494" + integrity sha512-UB2bVImxkWk4vjnP62ehFNZ73lQY1xcnL5ZNYF3x0AG+j8HgdkNF05v67YJdCIuUJpBuTyCK8LORCYo9onSW+A== + +esbuild-openbsd-64@0.14.36: + version "0.14.36" + resolved "https://registry.yarnpkg.com/esbuild-openbsd-64/-/esbuild-openbsd-64-0.14.36.tgz#c94c04c557fae516872a586eae67423da6d2fabb" + integrity sha512-NvGB2Chf8GxuleXRGk8e9zD3aSdRO5kLt9coTQbCg7WMGXeX471sBgh4kSg8pjx0yTXRt0MlrUDnjVYnetyivg== + +esbuild-sunos-64@0.14.36: + version "0.14.36" + resolved "https://registry.yarnpkg.com/esbuild-sunos-64/-/esbuild-sunos-64-0.14.36.tgz#9b79febc0df65a30f1c9bd63047d1675511bf99d" + integrity sha512-VkUZS5ftTSjhRjuRLp+v78auMO3PZBXu6xl4ajomGenEm2/rGuWlhFSjB7YbBNErOchj51Jb2OK8lKAo8qdmsQ== + +esbuild-windows-32@0.14.36: + version "0.14.36" + resolved "https://registry.yarnpkg.com/esbuild-windows-32/-/esbuild-windows-32-0.14.36.tgz#910d11936c8d2122ffdd3275e5b28d8a4e1240ec" + integrity sha512-bIar+A6hdytJjZrDxfMBUSEHHLfx3ynoEZXx/39nxy86pX/w249WZm8Bm0dtOAByAf4Z6qV0LsnTIJHiIqbw0w== + +esbuild-windows-64@0.14.36: + version "0.14.36" + resolved "https://registry.yarnpkg.com/esbuild-windows-64/-/esbuild-windows-64-0.14.36.tgz#21b4ce8b42a4efc63f4b58ec617f1302448aad26" + integrity sha512-+p4MuRZekVChAeueT1Y9LGkxrT5x7YYJxYE8ZOTcEfeUUN43vktSn6hUNsvxzzATrSgq5QqRdllkVBxWZg7KqQ== + +esbuild-windows-arm64@0.14.36: + version "0.14.36" + resolved "https://registry.yarnpkg.com/esbuild-windows-arm64/-/esbuild-windows-arm64-0.14.36.tgz#ba21546fecb7297667d0052d00150de22c044b24" + integrity sha512-fBB4WlDqV1m18EF/aheGYQkQZHfPHiHJSBYzXIo8yKehek+0BtBwo/4PNwKGJ5T0YK0oc8pBKjgwPbzSrPLb+Q== + +esbuild@^0.14.27: + version "0.14.36" + resolved "https://registry.npmmirror.com/esbuild/-/esbuild-0.14.36.tgz" + integrity sha512-HhFHPiRXGYOCRlrhpiVDYKcFJRdO0sBElZ668M4lh2ER0YgnkLxECuFe7uWCf23FrcLc59Pqr7dHkTqmRPDHmw== + optionalDependencies: + esbuild-android-64 "0.14.36" + esbuild-android-arm64 "0.14.36" + esbuild-darwin-64 "0.14.36" + esbuild-darwin-arm64 "0.14.36" + esbuild-freebsd-64 "0.14.36" + esbuild-freebsd-arm64 "0.14.36" + esbuild-linux-32 "0.14.36" + esbuild-linux-64 "0.14.36" + esbuild-linux-arm "0.14.36" + esbuild-linux-arm64 "0.14.36" + esbuild-linux-mips64le "0.14.36" + esbuild-linux-ppc64le "0.14.36" + esbuild-linux-riscv64 "0.14.36" + esbuild-linux-s390x "0.14.36" + esbuild-netbsd-64 "0.14.36" + esbuild-openbsd-64 "0.14.36" + esbuild-sunos-64 "0.14.36" + esbuild-windows-32 "0.14.36" + esbuild-windows-64 "0.14.36" + esbuild-windows-arm64 "0.14.36" + +escape-html@^1.0.3: + version "1.0.3" + resolved "https://registry.npmmirror.com/escape-html/-/escape-html-1.0.3.tgz" + integrity sha512-NiSupZ4OeuGwr68lGIeym/ksIZMJodUGOSCZ/FSnTxcrekbvqrgdUxlJOMpijaKZVjAJrWrGs/6Jy8OMuyj9ow== + +estree-walker@^2.0.2: + version "2.0.2" + resolved "https://registry.npmmirror.com/estree-walker/-/estree-walker-2.0.2.tgz" + integrity sha512-Rfkk/Mp/DL7JVje3u18FxFujQlTNR2q6QfMSMB7AvCBx91NGj/ba3kCfza0f6dVDbw7YlRf/nDrn7pQrCCyQ/w== + +follow-redirects@^1.14.8: + version "1.14.9" + resolved "https://registry.npmmirror.com/follow-redirects/-/follow-redirects-1.14.9.tgz" + integrity sha512-MQDfihBQYMcyy5dhRDJUHcw7lb2Pv/TuE6xP1vyraLukNDHKbDxDNaOE3NbCAdKQApno+GPRyo1YAp89yCjK4w== + +fsevents@~2.3.2: + version "2.3.2" + resolved "https://registry.npmmirror.com/fsevents/-/fsevents-2.3.2.tgz" + integrity sha512-xiqMQR4xAeHTuB9uWm+fFRcIOgKBMiOBP+eXiyT7jsgVCq1bkVygt00oASowB7EdtpOHaaPgKt812P9ab+DDKA== + +function-bind@^1.1.1: + version "1.1.1" + resolved "https://registry.npmmirror.com/function-bind/-/function-bind-1.1.1.tgz" + integrity sha512-yIovAzMX49sF8Yl58fSCWJ5svSLuaibPxXQJFLmBObTuCr0Mf1KiPopGM9NiFjiYBCbfaa2Fh6breQ6ANVTI0A== + +graceful-fs@^4.1.2: + version "4.2.10" + resolved "https://registry.npmmirror.com/graceful-fs/-/graceful-fs-4.2.10.tgz" + integrity sha512-9ByhssR2fPVsNZj478qUUbKfmL0+t5BDVyjShtyZZLiK7ZDAArFFfopyOTj0M05wE2tJPisA4iTnnXl2YoPvOA== + +has@^1.0.3: + version "1.0.3" + resolved "https://registry.npmmirror.com/has/-/has-1.0.3.tgz" + integrity sha512-f2dvO0VU6Oej7RkWJGrehjbzMAjFp5/VKPp5tTpWIV4JHHZK1/BxbFRtf/siA2SWTe09caDmVtYYzWEIbBS4zw== + dependencies: + function-bind "^1.1.1" + +iconv-lite@^0.4.4: + version "0.4.24" + resolved "https://registry.npmmirror.com/iconv-lite/-/iconv-lite-0.4.24.tgz" + integrity sha512-v3MXnZAcvnywkTUEZomIActle7RXXeedOR31wwl7VlyoXO4Qi9arvSenNQWne1TcRwhCL1HwLI21bEqdpj8/rA== + dependencies: + safer-buffer ">= 2.1.2 < 3" + +image-size@~0.5.0: + version "0.5.5" + resolved "https://registry.npmmirror.com/image-size/-/image-size-0.5.5.tgz" + integrity sha512-6TDAlDPZxUFCv+fuOkIoXT/V/f3Qbq8e37p+YOiYrUv3v9cc3/6x78VdfPgFVaB9dZYeLUfKgHRebpkm/oP2VQ== + +is-core-module@^2.8.1: + version "2.8.1" + resolved "https://registry.npmmirror.com/is-core-module/-/is-core-module-2.8.1.tgz" + integrity sha512-SdNCUs284hr40hFTFP6l0IfZ/RSrMXF3qgoRHd3/79unUTvrFO/JoXwkGm+5J/Oe3E/b5GsnG330uUNgRpu1PA== + dependencies: + has "^1.0.3" + +is-plain-object@3.0.1: + version "3.0.1" + resolved "https://registry.npmmirror.com/is-plain-object/-/is-plain-object-3.0.1.tgz" + integrity sha512-Xnpx182SBMrr/aBik8y+GuR4U1L9FqMSojwDQwPMmxyC6bvEqly9UBCxhauBF5vNh2gwWJNX6oDV7O+OM4z34g== + +is-what@^3.14.1: + version "3.14.1" + resolved "https://registry.npmmirror.com/is-what/-/is-what-3.14.1.tgz" + integrity sha512-sNxgpk9793nzSs7bA6JQJGeIuRBQhAaNGG77kzYQgMkrID+lS6SlK07K5LaptscDlSaIgH+GPFzf+d75FVxozA== + +js-audio-recorder@0.5.7: + version "0.5.7" + resolved "https://registry.npmmirror.com/js-audio-recorder/-/js-audio-recorder-0.5.7.tgz" + integrity sha512-DIlv30N86AYHr7zGHN0O7V/3Rd8Q6SIJ/MBzVJaT9STWTdhF4E/8fxCX6ZMgRSv8xmx6fEqcFFNPoofmxJD4+A== + +"js-tokens@^3.0.0 || ^4.0.0": + version "4.0.0" + resolved "https://registry.npmmirror.com/js-tokens/-/js-tokens-4.0.0.tgz" + integrity sha512-RdJUflcE3cUzKiMqQgsCu06FPu9UdIJO0beYbPhHN4k6apgJtifcoCtT9bcxOpYBtpD2kCM6Sbzg4CausW/PKQ== + +lamejs@^1.2.1: + version "1.2.1" + resolved "https://registry.npmmirror.com/lamejs/-/lamejs-1.2.1.tgz" + integrity sha512-s7bxvjvYthw6oPLCm5pFxvA84wUROODB8jEO2+CE1adhKgrIvVOlmMgY8zyugxGrvRaDHNJanOiS21/emty6dQ== + dependencies: + use-strict "1.0.1" + +less@^4.1.2: + version "4.1.2" + resolved "https://registry.npmmirror.com/less/-/less-4.1.2.tgz" + integrity sha512-EoQp/Et7OSOVu0aJknJOtlXZsnr8XE8KwuzTHOLeVSEx8pVWUICc8Q0VYRHgzyjX78nMEyC/oztWFbgyhtNfDA== + dependencies: + copy-anything "^2.0.1" + parse-node-version "^1.0.1" + tslib "^2.3.0" + optionalDependencies: + errno "^0.1.1" + graceful-fs "^4.1.2" + image-size "~0.5.0" + make-dir "^2.1.0" + mime "^1.4.1" + needle "^2.5.2" + source-map "~0.6.0" + +lodash-es@^4.17.15, lodash-es@^4.17.21: + version "4.17.21" + resolved "https://registry.npmmirror.com/lodash-es/-/lodash-es-4.17.21.tgz" + integrity sha512-mKnC+QJ9pWVzv+C4/U3rRsHapFfHvQFoFB92e52xeyGMcX6/OlIl78je1u8vePzYZSkkogMPJ2yjxxsb89cxyw== + +lodash-unified@^1.0.2: + version "1.0.2" + resolved "https://registry.npmmirror.com/lodash-unified/-/lodash-unified-1.0.2.tgz" + integrity sha512-OGbEy+1P+UT26CYi4opY4gebD8cWRDxAT6MAObIVQMiqYdxZr1g3QHWCToVsm31x2NkLS4K3+MC2qInaRMa39g== + +lodash@^4.17.21: + version "4.17.21" + resolved "https://registry.npmmirror.com/lodash/-/lodash-4.17.21.tgz" + integrity sha512-v2kDEe57lecTulaDIuNTPy3Ry4gLGJ6Z1O3vE1krgXZNrsQ+LFTGHVxVjcXPs17LhbZVGedAJv8XZ1tvj5FvSg== + +loose-envify@^1.0.0: + version "1.4.0" + resolved "https://registry.npmmirror.com/loose-envify/-/loose-envify-1.4.0.tgz" + integrity sha512-lyuxPGr/Wfhrlem2CL/UcnUc1zcqKAImBDzukY7Y5F/yQiNdko6+fRLevlw1HgMySw7f611UIY408EtxRSoK3Q== + dependencies: + js-tokens "^3.0.0 || ^4.0.0" + +magic-string@^0.25.7: + version "0.25.9" + resolved "https://registry.npmmirror.com/magic-string/-/magic-string-0.25.9.tgz" + integrity sha512-RmF0AsMzgt25qzqqLc1+MbHmhdx0ojF2Fvs4XnOqz2ZOBXzzkEwc/dJQZCYHAn7v1jbVOjAZfK8msRn4BxO4VQ== + dependencies: + sourcemap-codec "^1.4.8" + +make-dir@^2.1.0: + version "2.1.0" + resolved "https://registry.npmmirror.com/make-dir/-/make-dir-2.1.0.tgz" + integrity sha512-LS9X+dc8KLxXCb8dni79fLIIUA5VyZoyjSMCwTluaXA0o27cCK0bhXkpgw+sTXVpPy/lSO57ilRixqk0vDmtRA== + dependencies: + pify "^4.0.1" + semver "^5.6.0" + +memoize-one@^6.0.0: + version "6.0.0" + resolved "https://registry.npmmirror.com/memoize-one/-/memoize-one-6.0.0.tgz" + integrity sha512-rkpe71W0N0c0Xz6QD0eJETuWAJGnJ9afsl1srmwPrI+yBCkge5EycXXbYRyvL29zZVUWQCY7InPRCv3GDXuZNw== + +mime@^1.4.1: + version "1.6.0" + resolved "https://registry.npmmirror.com/mime/-/mime-1.6.0.tgz" + integrity sha512-x0Vn8spI+wuJ1O6S7gnbaQg8Pxh4NNHb7KSINmEWKiPE4RKOplvijn+NkmYmmRgP68mc70j2EbeTFRsrswaQeg== + +moment@^2.27.0: + version "2.29.4" + resolved "https://registry.yarnpkg.com/moment/-/moment-2.29.4.tgz#3dbe052889fe7c1b2ed966fcb3a77328964ef108" + integrity sha512-5LC9SOxjSc2HF6vO2CyuTDNivEdoz2IvyJJGj6X8DJ0eFyfszE0QiEd+iXmBvUP3WHxSjFH/vIsA0EN00cgr8w== + +ms@^2.1.1: + version "2.1.3" + resolved "https://registry.npmmirror.com/ms/-/ms-2.1.3.tgz" + integrity sha512-6FlzubTLZG3J2a/NVCAleEhjzq5oxgHyaCU9yYXvcLsvoVaHJq/s5xXI6/XXP6tz7R9xAOtHnSO/tXtF3WRTlA== + +nanoid@^3.3.1: + version "3.3.2" + resolved "https://registry.npmmirror.com/nanoid/-/nanoid-3.3.2.tgz" + integrity sha512-CuHBogktKwpm5g2sRgv83jEy2ijFzBwMoYA60orPDR7ynsLijJDqgsi4RDGj3OJpy3Ieb+LYwiRmIOGyytgITA== + +nanopop@^2.1.0: + version "2.1.0" + resolved "https://registry.npmmirror.com/nanopop/-/nanopop-2.1.0.tgz" + integrity sha512-jGTwpFRexSH+fxappnGQtN9dspgE2ipa1aOjtR24igG0pv6JCxImIAmrLRHX+zUF5+1wtsFVbKyfP51kIGAVNw== + +needle@^2.5.2: + version "2.9.1" + resolved "https://registry.npmmirror.com/needle/-/needle-2.9.1.tgz" + integrity sha512-6R9fqJ5Zcmf+uYaFgdIHmLwNldn5HbK8L5ybn7Uz+ylX/rnOsSp1AHcvQSrCaFN+qNM1wpymHqD7mVasEOlHGQ== + dependencies: + debug "^3.2.6" + iconv-lite "^0.4.4" + sax "^1.2.4" + +normalize-wheel-es@^1.1.2: + version "1.1.2" + resolved "https://registry.npmmirror.com/normalize-wheel-es/-/normalize-wheel-es-1.1.2.tgz" + integrity sha512-scX83plWJXYH1J4+BhAuIHadROzxX0UBF3+HuZNY2Ks8BciE7tSTQ+5JhTsvzjaO0/EJdm4JBGrfObKxFf3Png== + +omit.js@^2.0.0: + version "2.0.2" + resolved "https://registry.npmmirror.com/omit.js/-/omit.js-2.0.2.tgz" + integrity sha512-hJmu9D+bNB40YpL9jYebQl4lsTW6yEHRTroJzNLqQJYHm7c+NQnJGfZmIWh8S3q3KoaxV1aLhV6B3+0N0/kyJg== + +parse-node-version@^1.0.1: + version "1.0.1" + resolved "https://registry.npmmirror.com/parse-node-version/-/parse-node-version-1.0.1.tgz" + integrity sha512-3YHlOa/JgH6Mnpr05jP9eDG254US9ek25LyIxZlDItp2iJtwyaXQb57lBYLdT3MowkUFYEV2XXNAYIPlESvJlA== + +path-parse@^1.0.7: + version "1.0.7" + resolved "https://registry.npmmirror.com/path-parse/-/path-parse-1.0.7.tgz" + integrity sha512-LDJzPVEEEPR+y48z93A0Ed0yXb8pAByGWo/k5YYdYgpY2/2EsOsksJrq7lOHxryrVOn1ejG6oAp8ahvOIQD8sw== + +picocolors@^1.0.0: + version "1.0.0" + resolved "https://registry.npmmirror.com/picocolors/-/picocolors-1.0.0.tgz" + integrity sha512-1fygroTLlHu66zi26VoTDv8yRgm0Fccecssto+MhsZ0D/DGW2sm8E8AjW7NU5VVTRt5GxbeZ5qBuJr+HyLYkjQ== + +pify@^4.0.1: + version "4.0.1" + resolved "https://registry.npmmirror.com/pify/-/pify-4.0.1.tgz" + integrity sha512-uB80kBFb/tfd68bVleG9T5GGsGPjJrLAUpR5PZIrhBnIaRTQRjqdJSsIKkOP6OAIFbj7GOrcudc5pNjZ+geV2g== + +postcss@^8.1.10, postcss@^8.4.12: + version "8.4.12" + resolved "https://registry.npmmirror.com/postcss/-/postcss-8.4.12.tgz" + integrity sha512-lg6eITwYe9v6Hr5CncVbK70SoioNQIq81nsaG86ev5hAidQvmOeETBqs7jm43K2F5/Ley3ytDtriImV6TpNiSg== + dependencies: + nanoid "^3.3.1" + picocolors "^1.0.0" + source-map-js "^1.0.2" + +prr@~1.0.1: + version "1.0.1" + resolved "https://registry.npmmirror.com/prr/-/prr-1.0.1.tgz" + integrity sha512-yPw4Sng1gWghHQWj0B3ZggWUm4qVbPwPFcRG8KyxiU7J2OHFSoEHKS+EZ3fv5l1t9CyCiop6l/ZYeWbrgoQejw== + +regenerator-runtime@^0.13.4: + version "0.13.9" + resolved "https://registry.npmmirror.com/regenerator-runtime/-/regenerator-runtime-0.13.9.tgz" + integrity sha512-p3VT+cOEgxFsRRA9X4lkI1E+k2/CtnKtU4gcxyaCUreilL/vqI6CdZ3wxVUx3UOUg+gnUOQQcRI7BmSI656MYA== + +resize-observer-polyfill@^1.5.1: + version "1.5.1" + resolved "https://registry.npmmirror.com/resize-observer-polyfill/-/resize-observer-polyfill-1.5.1.tgz" + integrity sha512-LwZrotdHOo12nQuZlHEmtuXdqGoOD0OhaxopaNFxWzInpEgaLWoVuAMbTzixuosCx2nEG58ngzW3vxdWoxIgdg== + +resolve@^1.22.0: + version "1.22.0" + resolved "https://registry.npmmirror.com/resolve/-/resolve-1.22.0.tgz" + integrity sha512-Hhtrw0nLeSrFQ7phPp4OOcVjLPIeMnRlr5mcnVuMe7M/7eBn98A3hmFRLoFo3DLZkivSYwhRUJTyPyWAk56WLw== + dependencies: + is-core-module "^2.8.1" + path-parse "^1.0.7" + supports-preserve-symlinks-flag "^1.0.0" + +rollup@^2.59.0: + version "2.70.1" + resolved "https://registry.npmmirror.com/rollup/-/rollup-2.70.1.tgz" + integrity sha512-CRYsI5EuzLbXdxC6RnYhOuRdtz4bhejPMSWjsFLfVM/7w/85n2szZv6yExqUXsBdz5KT8eoubeyDUDjhLHEslA== + optionalDependencies: + fsevents "~2.3.2" + +"safer-buffer@>= 2.1.2 < 3": + version "2.1.2" + resolved "https://registry.npmmirror.com/safer-buffer/-/safer-buffer-2.1.2.tgz" + integrity sha512-YZo3K82SD7Riyi0E1EQPojLz7kpepnSQI9IyPbHHg1XXXevb5dJI7tpyN2ADxGcQbHG7vcyRHk0cbwqcQriUtg== + +sax@^1.2.4: + version "1.2.4" + resolved "https://registry.npmmirror.com/sax/-/sax-1.2.4.tgz" + integrity sha512-NqVDv9TpANUjFm0N8uM5GxL36UgKi9/atZw+x7YFnQ8ckwFGKrl4xX4yWtrey3UJm5nP1kUbnYgLopqWNSRhWw== + +scroll-into-view-if-needed@^2.2.25: + version "2.2.29" + resolved "https://registry.npmmirror.com/scroll-into-view-if-needed/-/scroll-into-view-if-needed-2.2.29.tgz" + integrity sha512-hxpAR6AN+Gh53AdAimHM6C8oTN1ppwVZITihix+WqalywBeFcQ6LdQP5ABNl26nX8GTEL7VT+b8lKpdqq65wXg== + dependencies: + compute-scroll-into-view "^1.0.17" + +semver@^5.6.0: + version "5.7.1" + resolved "https://registry.npmmirror.com/semver/-/semver-5.7.1.tgz" + integrity sha512-sauaDf/PZdVgrLTNYHRtpXa1iRiKcaebiKQ1BJdpQlWH2lCvexQdX55snPFyK7QzpudqbCI0qXFfOasHdyNDGQ== + +shallow-equal@^1.0.0: + version "1.2.1" + resolved "https://registry.npmmirror.com/shallow-equal/-/shallow-equal-1.2.1.tgz" + integrity sha512-S4vJDjHHMBaiZuT9NPb616CSmLf618jawtv3sufLl6ivK8WocjAo58cXwbRV1cgqxH0Qbv+iUt6m05eqEa2IRA== + +source-map-js@^1.0.2: + version "1.0.2" + resolved "https://registry.npmmirror.com/source-map-js/-/source-map-js-1.0.2.tgz" + integrity sha512-R0XvVJ9WusLiqTCEiGCmICCMplcCkIwwR11mOSD9CR5u+IXYdiseeEuXCVAjS54zqwkLcPNnmU4OeJ6tUrWhDw== + +source-map@^0.6.1, source-map@~0.6.0: + version "0.6.1" + resolved "https://registry.npmmirror.com/source-map/-/source-map-0.6.1.tgz" + integrity sha512-UjgapumWlbMhkBgzT7Ykc5YXUT46F0iKu8SGXq0bcwP5dz/h0Plj6enJqjz1Zbq2l5WaqYnrVbwWOWMyF3F47g== + +sourcemap-codec@^1.4.8: + version "1.4.8" + resolved "https://registry.npmmirror.com/sourcemap-codec/-/sourcemap-codec-1.4.8.tgz" + integrity sha512-9NykojV5Uih4lgo5So5dtw+f0JgJX30KCNI8gwhz2J9A15wD0Ml6tjHKwf6fTSa6fAdVBdZeNOs9eJ71qCk8vA== + +supports-preserve-symlinks-flag@^1.0.0: + version "1.0.0" + resolved "https://registry.npmmirror.com/supports-preserve-symlinks-flag/-/supports-preserve-symlinks-flag-1.0.0.tgz" + integrity sha512-ot0WnXS9fgdkgIcePe6RHNk1WA8+muPa6cSjeR3V8K27q9BB1rTE3R1p7Hv0z1ZyAc8s6Vvv8DIyWf681MAt0w== + +tslib@^2.3.0: + version "2.4.0" + resolved "https://registry.npmmirror.com/tslib/-/tslib-2.4.0.tgz" + integrity sha512-d6xOpEDfsi2CZVlPQzGeux8XMwLT9hssAsaPYExaQMuYskwb+x1x7J371tWlbBdWHroy99KnVB6qIkUbs5X3UQ== + +use-strict@1.0.1: + version "1.0.1" + resolved "https://registry.npmmirror.com/use-strict/-/use-strict-1.0.1.tgz" + integrity sha512-IeiWvvEXfW5ltKVMkxq6FvNf2LojMKvB2OCeja6+ct24S1XOmQw2dGr2JyndwACWAGJva9B7yPHwAmeA9QCqAQ== + +vite@^2.9.0: + version "2.9.1" + resolved "https://registry.npmmirror.com/vite/-/vite-2.9.1.tgz" + integrity sha512-vSlsSdOYGcYEJfkQ/NeLXgnRv5zZfpAsdztkIrs7AZHV8RCMZQkwjo4DS5BnrYTqoWqLoUe1Cah4aVO4oNNqCQ== + dependencies: + esbuild "^0.14.27" + postcss "^8.4.12" + resolve "^1.22.0" + rollup "^2.59.0" + optionalDependencies: + fsevents "~2.3.2" + +vue-demi@*: + version "0.12.5" + resolved "https://registry.npmmirror.com/vue-demi/-/vue-demi-0.12.5.tgz" + integrity sha512-BREuTgTYlUr0zw0EZn3hnhC3I6gPWv+Kwh4MCih6QcAeaTlaIX0DwOVN0wHej7hSvDPecz4jygy/idsgKfW58Q== + +vue-types@^3.0.0: + version "3.0.2" + resolved "https://registry.npmmirror.com/vue-types/-/vue-types-3.0.2.tgz" + integrity sha512-IwUC0Aq2zwaXqy74h4WCvFCUtoV0iSWr0snWnE9TnU18S66GAQyqQbRf2qfJtUuiFsBf6qp0MEwdonlwznlcrw== + dependencies: + is-plain-object "3.0.1" + +vue@^3.2.25: + version "3.2.32" + resolved "https://registry.npmmirror.com/vue/-/vue-3.2.32.tgz" + integrity sha512-6L3jKZApF042OgbCkh+HcFeAkiYi3Lovi8wNhWqIK98Pi5efAMLZzRHgi91v+60oIRxdJsGS9sTMsb+yDpY8Eg== + dependencies: + "@vue/compiler-dom" "3.2.32" + "@vue/compiler-sfc" "3.2.32" + "@vue/runtime-dom" "3.2.32" + "@vue/server-renderer" "3.2.32" + "@vue/shared" "3.2.32" + +warning@^4.0.0: + version "4.0.3" + resolved "https://registry.npmmirror.com/warning/-/warning-4.0.3.tgz" + integrity sha512-rpJyN222KWIvHJ/F53XSZv0Zl/accqHR8et1kpaMTD/fLCRxtV8iX8czMzY7sVZupTI3zcUTg8eycS2kNF9l6w== + dependencies: + loose-envify "^1.0.0" diff --git a/demos/streaming_asr_server/README.md b/demos/streaming_asr_server/README.md index a770f58c3..a97486757 100644 --- a/demos/streaming_asr_server/README.md +++ b/demos/streaming_asr_server/README.md @@ -7,12 +7,18 @@ This demo is an implementation of starting the streaming speech service and acce Streaming ASR server only support `websocket` protocol, and doesn't support `http` protocol. +服务接口定义请参考: +- [PaddleSpeech Streaming Server WebSocket API](https://github.com/PaddlePaddle/PaddleSpeech/wiki/PaddleSpeech-Server-WebSocket-API) + ## Usage ### 1. Installation see [installation](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md). -It is recommended to use **paddlepaddle 2.2.1** or above. -You can choose one way from meduim and hard to install paddlespeech. +It is recommended to use **paddlepaddle 2.3.1** or above. + +You can choose one way from easy, meduim and hard to install paddlespeech. + +**If you install in easy mode, you need to prepare the yaml file by yourself, you can refer to ### 2. Prepare config File The configuration file can be found in `conf/ws_application.yaml` 和 `conf/ws_conformer_wenetspeech_application.yaml`. @@ -47,28 +53,28 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav - `log_file`: log file. Default: `./log/paddlespeech.log` Output: - ```bash - [2022-05-14 04:56:13,086] [ INFO] - create the online asr engine instance - [2022-05-14 04:56:13,086] [ INFO] - paddlespeech_server set the device: cpu - [2022-05-14 04:56:13,087] [ INFO] - Load the pretrained model, tag = conformer_online_wenetspeech-zh-16k - [2022-05-14 04:56:13,087] [ INFO] - File /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar.gz md5 checking... - [2022-05-14 04:56:17,542] [ INFO] - Use pretrained model stored in: /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1. 0.0a.model.tar - [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar - [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar/model.yaml - [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar/exp/ chunk_conformer/checkpoints/avg_10.pdparams - [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar/exp/ chunk_conformer/checkpoints/avg_10.pdparams - [2022-05-14 04:56:17,852] [ INFO] - start to create the stream conformer asr engine - [2022-05-14 04:56:17,863] [ INFO] - model name: conformer_online - [2022-05-14 04:56:22,756] [ INFO] - create the transformer like model success - [2022-05-14 04:56:22,758] [ INFO] - Initialize ASR server engine successfully. - INFO: Started server process [4242] - [2022-05-14 04:56:22] [INFO] [server.py:75] Started server process [4242] - INFO: Waiting for application startup. - [2022-05-14 04:56:22] [INFO] [on.py:45] Waiting for application startup. - INFO: Application startup complete. - [2022-05-14 04:56:22] [INFO] [on.py:59] Application startup complete. - INFO: Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit) - [2022-05-14 04:56:22] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit) + ```text + [2022-05-14 04:56:13,086] [ INFO] - create the online asr engine instance + [2022-05-14 04:56:13,086] [ INFO] - paddlespeech_server set the device: cpu + [2022-05-14 04:56:13,087] [ INFO] - Load the pretrained model, tag = conformer_online_wenetspeech-zh-16k + [2022-05-14 04:56:13,087] [ INFO] - File /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar.gz md5 checking... + [2022-05-14 04:56:17,542] [ INFO] - Use pretrained model stored in: /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1. 0.0a.model.tar + [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar + [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar/model.yaml + [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar/exp/ chunk_conformer/checkpoints/avg_10.pdparams + [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar/exp/ chunk_conformer/checkpoints/avg_10.pdparams + [2022-05-14 04:56:17,852] [ INFO] - start to create the stream conformer asr engine + [2022-05-14 04:56:17,863] [ INFO] - model name: conformer_online + [2022-05-14 04:56:22,756] [ INFO] - create the transformer like model success + [2022-05-14 04:56:22,758] [ INFO] - Initialize ASR server engine successfully. + INFO: Started server process [4242] + [2022-05-14 04:56:22] [INFO] [server.py:75] Started server process [4242] + INFO: Waiting for application startup. + [2022-05-14 04:56:22] [INFO] [on.py:45] Waiting for application startup. + INFO: Application startup complete. + [2022-05-14 04:56:22] [INFO] [on.py:59] Application startup complete. + INFO: Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit) + [2022-05-14 04:56:22] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit) ``` - Python API @@ -84,28 +90,28 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav ``` Output: - ```bash - [2022-05-14 04:56:13,086] [ INFO] - create the online asr engine instance - [2022-05-14 04:56:13,086] [ INFO] - paddlespeech_server set the device: cpu - [2022-05-14 04:56:13,087] [ INFO] - Load the pretrained model, tag = conformer_online_wenetspeech-zh-16k - [2022-05-14 04:56:13,087] [ INFO] - File /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar.gz md5 checking... - [2022-05-14 04:56:17,542] [ INFO] - Use pretrained model stored in: /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1. 0.0a.model.tar - [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar - [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar/model.yaml - [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar/exp/ chunk_conformer/checkpoints/avg_10.pdparams - [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar/exp/ chunk_conformer/checkpoints/avg_10.pdparams - [2022-05-14 04:56:17,852] [ INFO] - start to create the stream conformer asr engine - [2022-05-14 04:56:17,863] [ INFO] - model name: conformer_online - [2022-05-14 04:56:22,756] [ INFO] - create the transformer like model success - [2022-05-14 04:56:22,758] [ INFO] - Initialize ASR server engine successfully. - INFO: Started server process [4242] - [2022-05-14 04:56:22] [INFO] [server.py:75] Started server process [4242] - INFO: Waiting for application startup. - [2022-05-14 04:56:22] [INFO] [on.py:45] Waiting for application startup. - INFO: Application startup complete. - [2022-05-14 04:56:22] [INFO] [on.py:59] Application startup complete. - INFO: Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit) - [2022-05-14 04:56:22] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit) + ```text + [2022-05-14 04:56:13,086] [ INFO] - create the online asr engine instance + [2022-05-14 04:56:13,086] [ INFO] - paddlespeech_server set the device: cpu + [2022-05-14 04:56:13,087] [ INFO] - Load the pretrained model, tag = conformer_online_wenetspeech-zh-16k + [2022-05-14 04:56:13,087] [ INFO] - File /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar.gz md5 checking... + [2022-05-14 04:56:17,542] [ INFO] - Use pretrained model stored in: /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1. 0.0a.model.tar + [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar + [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar/model.yaml + [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar/exp/ chunk_conformer/checkpoints/avg_10.pdparams + [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar/exp/ chunk_conformer/checkpoints/avg_10.pdparams + [2022-05-14 04:56:17,852] [ INFO] - start to create the stream conformer asr engine + [2022-05-14 04:56:17,863] [ INFO] - model name: conformer_online + [2022-05-14 04:56:22,756] [ INFO] - create the transformer like model success + [2022-05-14 04:56:22,758] [ INFO] - Initialize ASR server engine successfully. + INFO: Started server process [4242] + [2022-05-14 04:56:22] [INFO] [server.py:75] Started server process [4242] + INFO: Waiting for application startup. + [2022-05-14 04:56:22] [INFO] [on.py:45] Waiting for application startup. + INFO: Application startup complete. + [2022-05-14 04:56:22] [INFO] [on.py:59] Application startup complete. + INFO: Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit) + [2022-05-14 04:56:22] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit) ``` @@ -116,7 +122,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav If `127.0.0.1` is not accessible, you need to use the actual service IP address. - ``` + ```bash paddlespeech_client asr_online --server_ip 127.0.0.1 --port 8090 --input ./zh.wav ``` @@ -125,6 +131,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav ```bash paddlespeech_client asr_online --help ``` + Arguments: - `server_ip`: server ip. Default: 127.0.0.1 - `port`: server port. Default: 8090 @@ -136,75 +143,74 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav - `punc.server_port`: punctuation server port. Default: None. Output: - ```bash - [2022-05-06 21:10:35,598] [ INFO] - Start to do streaming asr client - [2022-05-06 21:10:35,600] [ INFO] - asr websocket client start - [2022-05-06 21:10:35,600] [ INFO] - endpoint: ws://127.0.0.1:8390/paddlespeech/asr/streaming - [2022-05-06 21:10:35,600] [ INFO] - start to process the wavscp: ./zh.wav - [2022-05-06 21:10:35,670] [ INFO] - client receive msg={"status": "ok", "signal": "server_ready"} - [2022-05-06 21:10:35,699] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:35,713] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:35,726] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:35,738] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:35,750] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:35,762] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:35,774] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:35,786] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:36,387] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:36,398] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:36,407] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:36,416] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:36,425] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:36,434] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:36,442] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:36,930] [ INFO] - client receive msg={'result': '我认为跑'} - [2022-05-06 21:10:36,938] [ INFO] - client receive msg={'result': '我认为跑'} - [2022-05-06 21:10:36,946] [ INFO] - client receive msg={'result': '我认为跑'} - [2022-05-06 21:10:36,954] [ INFO] - client receive msg={'result': '我认为跑'} - [2022-05-06 21:10:36,962] [ INFO] - client receive msg={'result': '我认为跑'} - [2022-05-06 21:10:36,970] [ INFO] - client receive msg={'result': '我认为跑'} - [2022-05-06 21:10:36,977] [ INFO] - client receive msg={'result': '我认为跑'} - [2022-05-06 21:10:36,985] [ INFO] - client receive msg={'result': '我认为跑'} - [2022-05-06 21:10:37,484] [ INFO] - client receive msg={'result': '我认为跑步最重要的'} - [2022-05-06 21:10:37,492] [ INFO] - client receive msg={'result': '我认为跑步最重要的'} - [2022-05-06 21:10:37,500] [ INFO] - client receive msg={'result': '我认为跑步最重要的'} - [2022-05-06 21:10:37,508] [ INFO] - client receive msg={'result': '我认为跑步最重要的'} - [2022-05-06 21:10:37,517] [ INFO] - client receive msg={'result': '我认为跑步最重要的'} - [2022-05-06 21:10:37,525] [ INFO] - client receive msg={'result': '我认为跑步最重要的'} - [2022-05-06 21:10:37,532] [ INFO] - client receive msg={'result': '我认为跑步最重要的'} - [2022-05-06 21:10:38,050] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} - [2022-05-06 21:10:38,058] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} - [2022-05-06 21:10:38,066] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} - [2022-05-06 21:10:38,073] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} - [2022-05-06 21:10:38,081] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} - [2022-05-06 21:10:38,089] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} - [2022-05-06 21:10:38,097] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} - [2022-05-06 21:10:38,105] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} - [2022-05-06 21:10:38,630] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给'} - [2022-05-06 21:10:38,639] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给'} - [2022-05-06 21:10:38,647] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给'} - [2022-05-06 21:10:38,655] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给'} - [2022-05-06 21:10:38,663] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给'} - [2022-05-06 21:10:38,671] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给'} - [2022-05-06 21:10:38,679] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给'} - [2022-05-06 21:10:39,216] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} - [2022-05-06 21:10:39,224] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} - [2022-05-06 21:10:39,232] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} - [2022-05-06 21:10:39,240] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} - [2022-05-06 21:10:39,248] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} - [2022-05-06 21:10:39,256] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} - [2022-05-06 21:10:39,264] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} - [2022-05-06 21:10:39,272] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} - [2022-05-06 21:10:39,885] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了身体健康'} - [2022-05-06 21:10:39,896] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了身体健康'} - [2022-05-06 21:10:39,905] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了身体健康'} - [2022-05-06 21:10:39,915] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了身体健康'} - [2022-05-06 21:10:39,924] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了身体健康'} - [2022-05-06 21:10:39,934] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了身体健康'} - [2022-05-06 21:10:44,827] [ INFO] - client final receive msg={'status': 'ok', 'signal': 'finished', 'result': '我认为跑步最重要的就是给我带来了身体健康', 'times': [{'w': '我', 'bg': 0.0, 'ed': 0.7000000000000001}, {'w': '认', 'bg': 0.7000000000000001, 'ed': 0.84}, {'w': '为', 'bg': 0.84, 'ed': 1.0}, {'w': '跑', 'bg': 1.0, 'ed': 1.18}, {'w': '步', 'bg': 1.18, 'ed': 1.36}, {'w': '最', 'bg': 1.36, 'ed': 1.5}, {'w': '重', 'bg': 1.5, 'ed': 1.6400000000000001}, {'w': '要', 'bg': 1.6400000000000001, 'ed': 1.78}, {'w': '的', 'bg': 1.78, 'ed': 1.9000000000000001}, {'w': '就', 'bg': 1.9000000000000001, 'ed': 2.06}, {'w': '是', 'bg': 2.06, 'ed': 2.62}, {'w': '给', 'bg': 2.62, 'ed': 3.16}, {'w': '我', 'bg': 3.16, 'ed': 3.3200000000000003}, {'w': '带', 'bg': 3.3200000000000003, 'ed': 3.48}, {'w': '来', 'bg': 3.48, 'ed': 3.62}, {'w': '了', 'bg': 3.62, 'ed': 3.7600000000000002}, {'w': '身', 'bg': 3.7600000000000002, 'ed': 3.9}, {'w': '体', 'bg': 3.9, 'ed': 4.0600000000000005}, {'w': '健', 'bg': 4.0600000000000005, 'ed': 4.26}, {'w': '康', 'bg': 4.26, 'ed': 4.96}]} - [2022-05-06 21:10:44,827] [ INFO] - audio duration: 4.9968125, elapsed time: 9.225094079971313, RTF=1.846195765794957 - [2022-05-06 21:10:44,828] [ INFO] - asr websocket client finished : 我认为跑步最重要的就是给我带来了身体健康 - + ```text + [2022-05-06 21:10:35,598] [ INFO] - Start to do streaming asr client + [2022-05-06 21:10:35,600] [ INFO] - asr websocket client start + [2022-05-06 21:10:35,600] [ INFO] - endpoint: ws://127.0.0.1:8390/paddlespeech/asr/streaming + [2022-05-06 21:10:35,600] [ INFO] - start to process the wavscp: ./zh.wav + [2022-05-06 21:10:35,670] [ INFO] - client receive msg={"status": "ok", "signal": "server_ready"} + [2022-05-06 21:10:35,699] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:35,713] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:35,726] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:35,738] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:35,750] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:35,762] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:35,774] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:35,786] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:36,387] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:36,398] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:36,407] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:36,416] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:36,425] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:36,434] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:36,442] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:36,930] [ INFO] - client receive msg={'result': '我认为跑'} + [2022-05-06 21:10:36,938] [ INFO] - client receive msg={'result': '我认为跑'} + [2022-05-06 21:10:36,946] [ INFO] - client receive msg={'result': '我认为跑'} + [2022-05-06 21:10:36,954] [ INFO] - client receive msg={'result': '我认为跑'} + [2022-05-06 21:10:36,962] [ INFO] - client receive msg={'result': '我认为跑'} + [2022-05-06 21:10:36,970] [ INFO] - client receive msg={'result': '我认为跑'} + [2022-05-06 21:10:36,977] [ INFO] - client receive msg={'result': '我认为跑'} + [2022-05-06 21:10:36,985] [ INFO] - client receive msg={'result': '我认为跑'} + [2022-05-06 21:10:37,484] [ INFO] - client receive msg={'result': '我认为跑步最重要的'} + [2022-05-06 21:10:37,492] [ INFO] - client receive msg={'result': '我认为跑步最重要的'} + [2022-05-06 21:10:37,500] [ INFO] - client receive msg={'result': '我认为跑步最重要的'} + [2022-05-06 21:10:37,508] [ INFO] - client receive msg={'result': '我认为跑步最重要的'} + [2022-05-06 21:10:37,517] [ INFO] - client receive msg={'result': '我认为跑步最重要的'} + [2022-05-06 21:10:37,525] [ INFO] - client receive msg={'result': '我认为跑步最重要的'} + [2022-05-06 21:10:37,532] [ INFO] - client receive msg={'result': '我认为跑步最重要的'} + [2022-05-06 21:10:38,050] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} + [2022-05-06 21:10:38,058] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} + [2022-05-06 21:10:38,066] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} + [2022-05-06 21:10:38,073] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} + [2022-05-06 21:10:38,081] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} + [2022-05-06 21:10:38,089] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} + [2022-05-06 21:10:38,097] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} + [2022-05-06 21:10:38,105] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} + [2022-05-06 21:10:38,630] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给'} + [2022-05-06 21:10:38,639] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给'} + [2022-05-06 21:10:38,647] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给'} + [2022-05-06 21:10:38,655] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给'} + [2022-05-06 21:10:38,663] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给'} + [2022-05-06 21:10:38,671] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给'} + [2022-05-06 21:10:38,679] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给'} + [2022-05-06 21:10:39,216] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} + [2022-05-06 21:10:39,224] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} + [2022-05-06 21:10:39,232] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} + [2022-05-06 21:10:39,240] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} + [2022-05-06 21:10:39,248] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} + [2022-05-06 21:10:39,256] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} + [2022-05-06 21:10:39,264] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} + [2022-05-06 21:10:39,272] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} + [2022-05-06 21:10:39,885] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了身体健康'} + [2022-05-06 21:10:39,896] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了身体健康'} + [2022-05-06 21:10:39,905] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了身体健康'} + [2022-05-06 21:10:39,915] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了身体健康'} + [2022-05-06 21:10:39,924] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了身体健康'} + [2022-05-06 21:10:39,934] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了身体健康'} + [2022-05-06 21:10:44,827] [ INFO] - client final receive msg={'status': 'ok', 'signal': 'finished', 'result': '我认为跑步最重要的就是给我带来了身体健康', 'times': [{'w': '我', 'bg': 0.0, 'ed': 0.7000000000000001}, {'w': '认', 'bg': 0.7000000000000001, 'ed': 0.84}, {'w': '为', 'bg': 0.84, 'ed': 1.0}, {'w': '跑', 'bg': 1.0, 'ed': 1.18}, {'w': '步', 'bg': 1.18, 'ed': 1.36}, {'w': '最', 'bg': 1.36, 'ed': 1.5}, {'w': '重', 'bg': 1.5, 'ed': 1.6400000000000001}, {'w': '要', 'bg': 1.6400000000000001, 'ed': 1.78}, {'w': '的', 'bg': 1.78, 'ed': 1.9000000000000001}, {'w': '就', 'bg': 1.9000000000000001, 'ed': 2.06}, {'w': '是', 'bg': 2.06, 'ed': 2.62}, {'w': '给', 'bg': 2.62, 'ed': 3.16}, {'w': '我', 'bg': 3.16, 'ed': 3.3200000000000003}, {'w': '带', 'bg': 3.3200000000000003, 'ed': 3.48}, {'w': '来', 'bg': 3.48, 'ed': 3.62}, {'w': '了', 'bg': 3.62, 'ed': 3.7600000000000002}, {'w': '身', 'bg': 3.7600000000000002, 'ed': 3.9}, {'w': '体', 'bg': 3.9, 'ed': 4.0600000000000005}, {'w': '健', 'bg': 4.0600000000000005, 'ed': 4.26}, {'w': '康', 'bg': 4.26, 'ed': 4.96}]} + [2022-05-06 21:10:44,827] [ INFO] - audio duration: 4.9968125, elapsed time: 9.225094079971313, RTF=1.846195765794957 + [2022-05-06 21:10:44,828] [ INFO] - asr websocket client finished : 我认为跑步最重要的就是给我带来了身体健康 ``` - Python API @@ -223,7 +229,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav ``` Output: - ```bash + ```text [2022-05-06 21:14:03,137] [ INFO] - asr websocket client start [2022-05-06 21:14:03,137] [ INFO] - endpoint: ws://127.0.0.1:8390/paddlespeech/asr/streaming [2022-05-06 21:14:03,149] [ INFO] - client receive msg={"status": "ok", "signal": "server_ready"} @@ -298,12 +304,11 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav - Command Line **Note:** The default deployment of the server is on the 'CPU' device, which can be deployed on the 'GPU' by modifying the 'device' parameter in the service configuration file. - ``` bash + ```bash In PaddleSpeech/demos/streaming_asr_server directory to lanuch punctuation service paddlespeech_server start --config_file conf/punc_application.yaml ``` - Usage: ```bash paddlespeech_server start --help @@ -315,7 +320,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav Output: - ``` bash + ```text [2022-05-02 17:59:26,285] [ INFO] - Create the TextEngine Instance [2022-05-02 17:59:26,285] [ INFO] - Init the text engine [2022-05-02 17:59:26,285] [ INFO] - Text Engine set the device: gpu:0 @@ -347,26 +352,26 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav log_file="./log/paddlespeech.log") ``` - Output: - ``` - [2022-05-02 18:09:02,542] [ INFO] - Create the TextEngine Instance - [2022-05-02 18:09:02,543] [ INFO] - Init the text engine - [2022-05-02 18:09:02,543] [ INFO] - Text Engine set the device: gpu:0 - [2022-05-02 18:09:02,545] [ INFO] - File /home/users/xiongxinlei/.paddlespeech/models/ernie_linear_p3_wudao-punc-zh/ernie_linear_p3_wudao-punc-zh.tar.gz md5 checking... - [2022-05-02 18:09:06,919] [ INFO] - Use pretrained model stored in: /home/users/xiongxinlei/.paddlespeech/models/ernie_linear_p3_wudao-punc-zh/ernie_linear_p3_wudao-punc-zh.tar - W0502 18:09:07.523002 22615 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 6.1, Driver API Version: 10.2, Runtime API Version: 10.2 - W0502 18:09:07.527882 22615 device_context.cc:465] device: 0, cuDNN Version: 7.6. - [2022-05-02 18:09:10,900] [ INFO] - Already cached /home/users/xiongxinlei/.paddlenlp/models/ernie-1.0/vocab.txt - [2022-05-02 18:09:10,913] [ INFO] - Init the text engine successfully - INFO: Started server process [22615] - [2022-05-02 18:09:10] [INFO] [server.py:75] Started server process [22615] - INFO: Waiting for application startup. - [2022-05-02 18:09:10] [INFO] [on.py:45] Waiting for application startup. - INFO: Application startup complete. - [2022-05-02 18:09:10] [INFO] [on.py:59] Application startup complete. - INFO: Uvicorn running on http://0.0.0.0:8190 (Press CTRL+C to quit) - [2022-05-02 18:09:10] [INFO] [server.py:206] Uvicorn running on http://0.0.0.0:8190 (Press CTRL+C to quit) - ``` + Output: + ```text + [2022-05-02 18:09:02,542] [ INFO] - Create the TextEngine Instance + [2022-05-02 18:09:02,543] [ INFO] - Init the text engine + [2022-05-02 18:09:02,543] [ INFO] - Text Engine set the device: gpu:0 + [2022-05-02 18:09:02,545] [ INFO] - File /home/users/xiongxinlei/.paddlespeech/models/ernie_linear_p3_wudao-punc-zh/ernie_linear_p3_wudao-punc-zh.tar.gz md5 checking... + [2022-05-02 18:09:06,919] [ INFO] - Use pretrained model stored in: /home/users/xiongxinlei/.paddlespeech/models/ernie_linear_p3_wudao-punc-zh/ernie_linear_p3_wudao-punc-zh.tar + W0502 18:09:07.523002 22615 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 6.1, Driver API Version: 10.2, Runtime API Version: 10.2 + W0502 18:09:07.527882 22615 device_context.cc:465] device: 0, cuDNN Version: 7.6. + [2022-05-02 18:09:10,900] [ INFO] - Already cached /home/users/xiongxinlei/.paddlenlp/models/ernie-1.0/vocab.txt + [2022-05-02 18:09:10,913] [ INFO] - Init the text engine successfully + INFO: Started server process [22615] + [2022-05-02 18:09:10] [INFO] [server.py:75] Started server process [22615] + INFO: Waiting for application startup. + [2022-05-02 18:09:10] [INFO] [on.py:45] Waiting for application startup. + INFO: Application startup complete. + [2022-05-02 18:09:10] [INFO] [on.py:59] Application startup complete. + INFO: Uvicorn running on http://0.0.0.0:8190 (Press CTRL+C to quit) + [2022-05-02 18:09:10] [INFO] [server.py:206] Uvicorn running on http://0.0.0.0:8190 (Press CTRL+C to quit) + ``` ### 2. Client usage **Note** The response time will be slightly longer when using the client for the first time @@ -375,17 +380,17 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav If `127.0.0.1` is not accessible, you need to use the actual service IP address. - ``` + ```bash paddlespeech_client text --server_ip 127.0.0.1 --port 8190 --input "我认为跑步最重要的就是给我带来了身体健康" ``` Output - ``` + ```text [2022-05-02 18:12:29,767] [ INFO] - The punc text: 我认为跑步最重要的就是给我带来了身体健康。 [2022-05-02 18:12:29,767] [ INFO] - Response time 0.096548 s. ``` -- Python3 API +- Python API ```python from paddlespeech.server.bin.paddlespeech_client import TextClientExecutor @@ -399,11 +404,10 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav ``` Output: - ``` bash + ```text 我认为跑步最重要的就是给我带来了身体健康。 ``` - ## Join streaming asr and punctuation server By default, each server is deployed on the 'CPU' device and speech recognition and punctuation prediction can be deployed on different 'GPU' by modifying the' device 'parameter in the service configuration file respectively. @@ -412,7 +416,7 @@ We use `streaming_ asr_server.py` and `punc_server.py` two services to lanuch st ### 1. Start two server -``` bash +```bash Note: streaming speech recognition and punctuation prediction are configured on different graphics cards through configuration files bash server.sh ``` @@ -422,11 +426,11 @@ bash server.sh If `127.0.0.1` is not accessible, you need to use the actual service IP address. - ``` + ```bash paddlespeech_client asr_online --server_ip 127.0.0.1 --port 8290 --punc.server_ip 127.0.0.1 --punc.port 8190 --input ./zh.wav ``` Output: - ``` + ```text [2022-05-07 11:21:47,060] [ INFO] - asr websocket client start [2022-05-07 11:21:47,060] [ INFO] - endpoint: ws://127.0.0.1:8490/paddlespeech/asr/streaming [2022-05-07 11:21:47,080] [ INFO] - client receive msg={"status": "ok", "signal": "server_ready"} @@ -500,11 +504,11 @@ bash server.sh If `127.0.0.1` is not accessible, you need to use the actual service IP address. - ``` + ```bash python3 websocket_client.py --server_ip 127.0.0.1 --port 8290 --punc.server_ip 127.0.0.1 --punc.port 8190 --wavfile ./zh.wav ``` Output: - ``` + ```text [2022-05-07 11:11:02,984] [ INFO] - Start to do streaming asr client [2022-05-07 11:11:02,985] [ INFO] - asr websocket client start [2022-05-07 11:11:02,985] [ INFO] - endpoint: ws://127.0.0.1:8490/paddlespeech/asr/streaming @@ -573,5 +577,3 @@ bash server.sh [2022-05-07 11:11:18,915] [ INFO] - audio duration: 4.9968125, elapsed time: 15.928460597991943, RTF=3.187724293835709 [2022-05-07 11:11:18,916] [ INFO] - asr websocket client finished : 我认为跑步最重要的就是给我带来了身体健康 ``` - - diff --git a/demos/streaming_asr_server/README_cn.md b/demos/streaming_asr_server/README_cn.md index c771869e9..267367729 100644 --- a/demos/streaming_asr_server/README_cn.md +++ b/demos/streaming_asr_server/README_cn.md @@ -3,17 +3,22 @@ # 流式语音识别服务 ## 介绍 -这个demo是一个启动流式语音服务和访问服务的实现。 它可以通过使用`paddlespeech_server` 和 `paddlespeech_client`的单个命令或 python 的几行代码来实现。 +这个 demo 是一个启动流式语音服务和访问服务的实现。 它可以通过使用 `paddlespeech_server` 和 `paddlespeech_client` 的单个命令或 python 的几行代码来实现。 **流式语音识别服务只支持 `weboscket` 协议,不支持 `http` 协议。** +服务接口定义请参考: +- [PaddleSpeech Streaming Server WebSocket API](https://github.com/PaddlePaddle/PaddleSpeech/wiki/PaddleSpeech-Server-WebSocket-API) + ## 使用方法 ### 1. 安装 安装 PaddleSpeech 的详细过程请看 [安装文档](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md)。 -推荐使用 **paddlepaddle 2.2.1** 或以上版本。 -你可以从medium,hard 两种方式中选择一种方式安装 PaddleSpeech。 +推荐使用 **paddlepaddle 2.3.1** 或以上版本。 + +你可以从简单,中等,困难 几种方式中选择一种方式安装 PaddleSpeech。 +**如果使用简单模式安装,需要自行准备 yaml 文件,可参考 conf 目录下的 yaml 文件。** ### 2. 准备配置文件 @@ -26,7 +31,6 @@ * conformer: `conf/ws_conformer_wenetspeech_application.yaml` - 这个 ASR client 的输入应该是一个 WAV 文件(`.wav`),并且采样率必须与模型的采样率相同。 可以下载此 ASR client的示例音频: @@ -54,28 +58,28 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav - `log_file`: log 文件. 默认:`./log/paddlespeech.log` 输出: - ```bash - [2022-05-14 04:56:13,086] [ INFO] - create the online asr engine instance - [2022-05-14 04:56:13,086] [ INFO] - paddlespeech_server set the device: cpu - [2022-05-14 04:56:13,087] [ INFO] - Load the pretrained model, tag = conformer_online_wenetspeech-zh-16k - [2022-05-14 04:56:13,087] [ INFO] - File /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar.gz md5 checking... - [2022-05-14 04:56:17,542] [ INFO] - Use pretrained model stored in: /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1. 0.0a.model.tar - [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar - [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar/model.yaml - [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar/exp/ chunk_conformer/checkpoints/avg_10.pdparams - [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar/exp/ chunk_conformer/checkpoints/avg_10.pdparams - [2022-05-14 04:56:17,852] [ INFO] - start to create the stream conformer asr engine - [2022-05-14 04:56:17,863] [ INFO] - model name: conformer_online - [2022-05-14 04:56:22,756] [ INFO] - create the transformer like model success - [2022-05-14 04:56:22,758] [ INFO] - Initialize ASR server engine successfully. - INFO: Started server process [4242] - [2022-05-14 04:56:22] [INFO] [server.py:75] Started server process [4242] - INFO: Waiting for application startup. - [2022-05-14 04:56:22] [INFO] [on.py:45] Waiting for application startup. - INFO: Application startup complete. - [2022-05-14 04:56:22] [INFO] [on.py:59] Application startup complete. - INFO: Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit) - [2022-05-14 04:56:22] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit) + ```text + [2022-05-14 04:56:13,086] [ INFO] - create the online asr engine instance + [2022-05-14 04:56:13,086] [ INFO] - paddlespeech_server set the device: cpu + [2022-05-14 04:56:13,087] [ INFO] - Load the pretrained model, tag = conformer_online_wenetspeech-zh-16k + [2022-05-14 04:56:13,087] [ INFO] - File /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar.gz md5 checking... + [2022-05-14 04:56:17,542] [ INFO] - Use pretrained model stored in: /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1. 0.0a.model.tar + [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar + [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar/model.yaml + [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar/exp/ chunk_conformer/checkpoints/avg_10.pdparams + [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar/exp/ chunk_conformer/checkpoints/avg_10.pdparams + [2022-05-14 04:56:17,852] [ INFO] - start to create the stream conformer asr engine + [2022-05-14 04:56:17,863] [ INFO] - model name: conformer_online + [2022-05-14 04:56:22,756] [ INFO] - create the transformer like model success + [2022-05-14 04:56:22,758] [ INFO] - Initialize ASR server engine successfully. + INFO: Started server process [4242] + [2022-05-14 04:56:22] [INFO] [server.py:75] Started server process [4242] + INFO: Waiting for application startup. + [2022-05-14 04:56:22] [INFO] [on.py:45] Waiting for application startup. + INFO: Application startup complete. + [2022-05-14 04:56:22] [INFO] [on.py:59] Application startup complete. + INFO: Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit) + [2022-05-14 04:56:22] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit) ``` - Python API @@ -90,29 +94,29 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav log_file="./log/paddlespeech.log") ``` - 输出: - ```bash - [2022-05-14 04:56:13,086] [ INFO] - create the online asr engine instance - [2022-05-14 04:56:13,086] [ INFO] - paddlespeech_server set the device: cpu - [2022-05-14 04:56:13,087] [ INFO] - Load the pretrained model, tag = conformer_online_wenetspeech-zh-16k - [2022-05-14 04:56:13,087] [ INFO] - File /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar.gz md5 checking... - [2022-05-14 04:56:17,542] [ INFO] - Use pretrained model stored in: /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1. 0.0a.model.tar - [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar - [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar/model.yaml - [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar/exp/ chunk_conformer/checkpoints/avg_10.pdparams - [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar/exp/ chunk_conformer/checkpoints/avg_10.pdparams - [2022-05-14 04:56:17,852] [ INFO] - start to create the stream conformer asr engine - [2022-05-14 04:56:17,863] [ INFO] - model name: conformer_online - [2022-05-14 04:56:22,756] [ INFO] - create the transformer like model success - [2022-05-14 04:56:22,758] [ INFO] - Initialize ASR server engine successfully. - INFO: Started server process [4242] - [2022-05-14 04:56:22] [INFO] [server.py:75] Started server process [4242] - INFO: Waiting for application startup. - [2022-05-14 04:56:22] [INFO] [on.py:45] Waiting for application startup. - INFO: Application startup complete. - [2022-05-14 04:56:22] [INFO] [on.py:59] Application startup complete. - INFO: Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit) - [2022-05-14 04:56:22] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit) + 输出: + ```text + [2022-05-14 04:56:13,086] [ INFO] - create the online asr engine instance + [2022-05-14 04:56:13,086] [ INFO] - paddlespeech_server set the device: cpu + [2022-05-14 04:56:13,087] [ INFO] - Load the pretrained model, tag = conformer_online_wenetspeech-zh-16k + [2022-05-14 04:56:13,087] [ INFO] - File /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar.gz md5 checking... + [2022-05-14 04:56:17,542] [ INFO] - Use pretrained model stored in: /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1. 0.0a.model.tar + [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar + [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar/model.yaml + [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar/exp/ chunk_conformer/checkpoints/avg_10.pdparams + [2022-05-14 04:56:17,543] [ INFO] - /root/.paddlespeech/models/conformer_online_wenetspeech-zh-16k/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar/exp/ chunk_conformer/checkpoints/avg_10.pdparams + [2022-05-14 04:56:17,852] [ INFO] - start to create the stream conformer asr engine + [2022-05-14 04:56:17,863] [ INFO] - model name: conformer_online + [2022-05-14 04:56:22,756] [ INFO] - create the transformer like model success + [2022-05-14 04:56:22,758] [ INFO] - Initialize ASR server engine successfully. + INFO: Started server process [4242] + [2022-05-14 04:56:22] [INFO] [server.py:75] Started server process [4242] + INFO: Waiting for application startup. + [2022-05-14 04:56:22] [INFO] [on.py:45] Waiting for application startup. + INFO: Application startup complete. + [2022-05-14 04:56:22] [INFO] [on.py:59] Application startup complete. + INFO: Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit) + [2022-05-14 04:56:22] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8090 (Press CTRL+C to quit) ``` ### 4. ASR 客户端使用方法 @@ -120,98 +124,97 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav **注意:** 初次使用客户端时响应时间会略长 - 命令行 (推荐使用) - 若 `127.0.0.1` 不能访问,则需要使用实际服务 IP 地址 + 若 `127.0.0.1` 不能访问,则需要使用实际服务 IP 地址 - ``` - paddlespeech_client asr_online --server_ip 127.0.0.1 --port 8090 --input ./zh.wav - ``` + ```bash + paddlespeech_client asr_online --server_ip 127.0.0.1 --port 8090 --input ./zh.wav + ``` - 使用帮助: - - ```bash - paddlespeech_client asr_online --help - ``` + 使用帮助: + + ```bash + paddlespeech_client asr_online --help + ``` - 参数: - - `server_ip`: 服务端ip地址,默认: 127.0.0.1。 - - `port`: 服务端口,默认: 8090。 - - `input`(必须输入): 用于识别的音频文件。 - - `sample_rate`: 音频采样率,默认值:16000。 - - `lang`: 模型语言,默认值:zh_cn。 - - `audio_format`: 音频格式,默认值:wav。 - - `punc.server_ip` 标点预测服务的ip。默认是None。 - - `punc.server_port` 标点预测服务的端口port。默认是None。 - - 输出: - - ```bash - [2022-05-06 21:10:35,598] [ INFO] - Start to do streaming asr client - [2022-05-06 21:10:35,600] [ INFO] - asr websocket client start - [2022-05-06 21:10:35,600] [ INFO] - endpoint: ws://127.0.0.1:8390/paddlespeech/asr/streaming - [2022-05-06 21:10:35,600] [ INFO] - start to process the wavscp: ./zh.wav - [2022-05-06 21:10:35,670] [ INFO] - client receive msg={"status": "ok", "signal": "server_ready"} - [2022-05-06 21:10:35,699] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:35,713] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:35,726] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:35,738] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:35,750] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:35,762] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:35,774] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:35,786] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:36,387] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:36,398] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:36,407] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:36,416] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:36,425] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:36,434] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:36,442] [ INFO] - client receive msg={'result': ''} - [2022-05-06 21:10:36,930] [ INFO] - client receive msg={'result': '我认为跑'} - [2022-05-06 21:10:36,938] [ INFO] - client receive msg={'result': '我认为跑'} - [2022-05-06 21:10:36,946] [ INFO] - client receive msg={'result': '我认为跑'} - [2022-05-06 21:10:36,954] [ INFO] - client receive msg={'result': '我认为跑'} - [2022-05-06 21:10:36,962] [ INFO] - client receive msg={'result': '我认为跑'} - [2022-05-06 21:10:36,970] [ INFO] - client receive msg={'result': '我认为跑'} - [2022-05-06 21:10:36,977] [ INFO] - client receive msg={'result': '我认为跑'} - [2022-05-06 21:10:36,985] [ INFO] - client receive msg={'result': '我认为跑'} - [2022-05-06 21:10:37,484] [ INFO] - client receive msg={'result': '我认为跑步最重要的'} - [2022-05-06 21:10:37,492] [ INFO] - client receive msg={'result': '我认为跑步最重要的'} - [2022-05-06 21:10:37,500] [ INFO] - client receive msg={'result': '我认为跑步最重要的'} - [2022-05-06 21:10:37,508] [ INFO] - client receive msg={'result': '我认为跑步最重要的'} - [2022-05-06 21:10:37,517] [ INFO] - client receive msg={'result': '我认为跑步最重要的'} - [2022-05-06 21:10:37,525] [ INFO] - client receive msg={'result': '我认为跑步最重要的'} - [2022-05-06 21:10:37,532] [ INFO] - client receive msg={'result': '我认为跑步最重要的'} - [2022-05-06 21:10:38,050] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} - [2022-05-06 21:10:38,058] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} - [2022-05-06 21:10:38,066] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} - [2022-05-06 21:10:38,073] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} - [2022-05-06 21:10:38,081] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} - [2022-05-06 21:10:38,089] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} - [2022-05-06 21:10:38,097] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} - [2022-05-06 21:10:38,105] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} - [2022-05-06 21:10:38,630] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给'} - [2022-05-06 21:10:38,639] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给'} - [2022-05-06 21:10:38,647] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给'} - [2022-05-06 21:10:38,655] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给'} - [2022-05-06 21:10:38,663] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给'} - [2022-05-06 21:10:38,671] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给'} - [2022-05-06 21:10:38,679] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给'} - [2022-05-06 21:10:39,216] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} - [2022-05-06 21:10:39,224] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} - [2022-05-06 21:10:39,232] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} - [2022-05-06 21:10:39,240] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} - [2022-05-06 21:10:39,248] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} - [2022-05-06 21:10:39,256] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} - [2022-05-06 21:10:39,264] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} - [2022-05-06 21:10:39,272] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} - [2022-05-06 21:10:39,885] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了身体健康'} - [2022-05-06 21:10:39,896] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了身体健康'} - [2022-05-06 21:10:39,905] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了身体健康'} - [2022-05-06 21:10:39,915] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了身体健康'} - [2022-05-06 21:10:39,924] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了身体健康'} - [2022-05-06 21:10:39,934] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了身体健康'} - [2022-05-06 21:10:44,827] [ INFO] - client final receive msg={'status': 'ok', 'signal': 'finished', 'result': '我认为跑步最重要的就是给我带来了身体健康', 'times': [{'w': '我', 'bg': 0.0, 'ed': 0.7000000000000001}, {'w': '认', 'bg': 0.7000000000000001, 'ed': 0.84}, {'w': '为', 'bg': 0.84, 'ed': 1.0}, {'w': '跑', 'bg': 1.0, 'ed': 1.18}, {'w': '步', 'bg': 1.18, 'ed': 1.36}, {'w': '最', 'bg': 1.36, 'ed': 1.5}, {'w': '重', 'bg': 1.5, 'ed': 1.6400000000000001}, {'w': '要', 'bg': 1.6400000000000001, 'ed': 1.78}, {'w': '的', 'bg': 1.78, 'ed': 1.9000000000000001}, {'w': '就', 'bg': 1.9000000000000001, 'ed': 2.06}, {'w': '是', 'bg': 2.06, 'ed': 2.62}, {'w': '给', 'bg': 2.62, 'ed': 3.16}, {'w': '我', 'bg': 3.16, 'ed': 3.3200000000000003}, {'w': '带', 'bg': 3.3200000000000003, 'ed': 3.48}, {'w': '来', 'bg': 3.48, 'ed': 3.62}, {'w': '了', 'bg': 3.62, 'ed': 3.7600000000000002}, {'w': '身', 'bg': 3.7600000000000002, 'ed': 3.9}, {'w': '体', 'bg': 3.9, 'ed': 4.0600000000000005}, {'w': '健', 'bg': 4.0600000000000005, 'ed': 4.26}, {'w': '康', 'bg': 4.26, 'ed': 4.96}]} - [2022-05-06 21:10:44,827] [ INFO] - audio duration: 4.9968125, elapsed time: 9.225094079971313, RTF=1.846195765794957 - [2022-05-06 21:10:44,828] [ INFO] - asr websocket client finished : 我认为跑步最重要的就是给我带来了身体健康 + 参数: + - `server_ip`: 服务端ip地址,默认: 127.0.0.1。 + - `port`: 服务端口,默认: 8090。 + - `input`(必须输入): 用于识别的音频文件。 + - `sample_rate`: 音频采样率,默认值:16000。 + - `lang`: 模型语言,默认值:zh_cn。 + - `audio_format`: 音频格式,默认值:wav。 + - `punc.server_ip` 标点预测服务的ip。默认是None。 + - `punc.server_port` 标点预测服务的端口port。默认是None。 + + 输出: + ```text + [2022-05-06 21:10:35,598] [ INFO] - Start to do streaming asr client + [2022-05-06 21:10:35,600] [ INFO] - asr websocket client start + [2022-05-06 21:10:35,600] [ INFO] - endpoint: ws://127.0.0.1:8390/paddlespeech/asr/streaming + [2022-05-06 21:10:35,600] [ INFO] - start to process the wavscp: ./zh.wav + [2022-05-06 21:10:35,670] [ INFO] - client receive msg={"status": "ok", "signal": "server_ready"} + [2022-05-06 21:10:35,699] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:35,713] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:35,726] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:35,738] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:35,750] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:35,762] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:35,774] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:35,786] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:36,387] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:36,398] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:36,407] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:36,416] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:36,425] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:36,434] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:36,442] [ INFO] - client receive msg={'result': ''} + [2022-05-06 21:10:36,930] [ INFO] - client receive msg={'result': '我认为跑'} + [2022-05-06 21:10:36,938] [ INFO] - client receive msg={'result': '我认为跑'} + [2022-05-06 21:10:36,946] [ INFO] - client receive msg={'result': '我认为跑'} + [2022-05-06 21:10:36,954] [ INFO] - client receive msg={'result': '我认为跑'} + [2022-05-06 21:10:36,962] [ INFO] - client receive msg={'result': '我认为跑'} + [2022-05-06 21:10:36,970] [ INFO] - client receive msg={'result': '我认为跑'} + [2022-05-06 21:10:36,977] [ INFO] - client receive msg={'result': '我认为跑'} + [2022-05-06 21:10:36,985] [ INFO] - client receive msg={'result': '我认为跑'} + [2022-05-06 21:10:37,484] [ INFO] - client receive msg={'result': '我认为跑步最重要的'} + [2022-05-06 21:10:37,492] [ INFO] - client receive msg={'result': '我认为跑步最重要的'} + [2022-05-06 21:10:37,500] [ INFO] - client receive msg={'result': '我认为跑步最重要的'} + [2022-05-06 21:10:37,508] [ INFO] - client receive msg={'result': '我认为跑步最重要的'} + [2022-05-06 21:10:37,517] [ INFO] - client receive msg={'result': '我认为跑步最重要的'} + [2022-05-06 21:10:37,525] [ INFO] - client receive msg={'result': '我认为跑步最重要的'} + [2022-05-06 21:10:37,532] [ INFO] - client receive msg={'result': '我认为跑步最重要的'} + [2022-05-06 21:10:38,050] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} + [2022-05-06 21:10:38,058] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} + [2022-05-06 21:10:38,066] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} + [2022-05-06 21:10:38,073] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} + [2022-05-06 21:10:38,081] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} + [2022-05-06 21:10:38,089] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} + [2022-05-06 21:10:38,097] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} + [2022-05-06 21:10:38,105] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是'} + [2022-05-06 21:10:38,630] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给'} + [2022-05-06 21:10:38,639] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给'} + [2022-05-06 21:10:38,647] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给'} + [2022-05-06 21:10:38,655] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给'} + [2022-05-06 21:10:38,663] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给'} + [2022-05-06 21:10:38,671] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给'} + [2022-05-06 21:10:38,679] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给'} + [2022-05-06 21:10:39,216] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} + [2022-05-06 21:10:39,224] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} + [2022-05-06 21:10:39,232] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} + [2022-05-06 21:10:39,240] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} + [2022-05-06 21:10:39,248] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} + [2022-05-06 21:10:39,256] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} + [2022-05-06 21:10:39,264] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} + [2022-05-06 21:10:39,272] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了'} + [2022-05-06 21:10:39,885] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了身体健康'} + [2022-05-06 21:10:39,896] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了身体健康'} + [2022-05-06 21:10:39,905] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了身体健康'} + [2022-05-06 21:10:39,915] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了身体健康'} + [2022-05-06 21:10:39,924] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了身体健康'} + [2022-05-06 21:10:39,934] [ INFO] - client receive msg={'result': '我认为跑步最重要的就是给我带来了身体健康'} + [2022-05-06 21:10:44,827] [ INFO] - client final receive msg={'status': 'ok', 'signal': 'finished', 'result': '我认为跑步最重要的就是给我带来了身体健康', 'times': [{'w': '我', 'bg': 0.0, 'ed': 0.7000000000000001}, {'w': '认', 'bg': 0.7000000000000001, 'ed': 0.84}, {'w': '为', 'bg': 0.84, 'ed': 1.0}, {'w': '跑', 'bg': 1.0, 'ed': 1.18}, {'w': '步', 'bg': 1.18, 'ed': 1.36}, {'w': '最', 'bg': 1.36, 'ed': 1.5}, {'w': '重', 'bg': 1.5, 'ed': 1.6400000000000001}, {'w': '要', 'bg': 1.6400000000000001, 'ed': 1.78}, {'w': '的', 'bg': 1.78, 'ed': 1.9000000000000001}, {'w': '就', 'bg': 1.9000000000000001, 'ed': 2.06}, {'w': '是', 'bg': 2.06, 'ed': 2.62}, {'w': '给', 'bg': 2.62, 'ed': 3.16}, {'w': '我', 'bg': 3.16, 'ed': 3.3200000000000003}, {'w': '带', 'bg': 3.3200000000000003, 'ed': 3.48}, {'w': '来', 'bg': 3.48, 'ed': 3.62}, {'w': '了', 'bg': 3.62, 'ed': 3.7600000000000002}, {'w': '身', 'bg': 3.7600000000000002, 'ed': 3.9}, {'w': '体', 'bg': 3.9, 'ed': 4.0600000000000005}, {'w': '健', 'bg': 4.0600000000000005, 'ed': 4.26}, {'w': '康', 'bg': 4.26, 'ed': 4.96}]} + [2022-05-06 21:10:44,827] [ INFO] - audio duration: 4.9968125, elapsed time: 9.225094079971313, RTF=1.846195765794957 + [2022-05-06 21:10:44,828] [ INFO] - asr websocket client finished : 我认为跑步最重要的就是给我带来了身体健康 ``` - Python API @@ -230,7 +233,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav ``` 输出: - ```bash + ```text [2022-05-06 21:14:03,137] [ INFO] - asr websocket client start [2022-05-06 21:14:03,137] [ INFO] - endpoint: ws://127.0.0.1:8390/paddlespeech/asr/streaming [2022-05-06 21:14:03,149] [ INFO] - client receive msg={"status": "ok", "signal": "server_ready"} @@ -297,34 +300,29 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav [2022-05-06 21:14:12,159] [ INFO] - audio duration: 4.9968125, elapsed time: 9.019973039627075, RTF=1.8051453881103354 [2022-05-06 21:14:12,160] [ INFO] - asr websocket client finished ``` - - - ## 标点预测 ### 1. 服务端使用方法 - 命令行 **注意:** 默认部署在 `cpu` 设备上,可以通过修改服务配置文件中 `device` 参数部署在 `gpu` 上。 - ``` bash - 在 PaddleSpeech/demos/streaming_asr_server 目录下启动标点预测服务 + ```bash + # 在 PaddleSpeech/demos/streaming_asr_server 目录下启动标点预测服务 paddlespeech_server start --config_file conf/punc_application.yaml ``` - - 使用方法: - + 使用方法: ```bash paddlespeech_server start --help ``` - 参数: + 参数: - `config_file`: 服务的配置文件。 - `log_file`: log 文件。 - 输出: - ``` bash + 输出: + ```text [2022-05-02 17:59:26,285] [ INFO] - Create the TextEngine Instance [2022-05-02 17:59:26,285] [ INFO] - Init the text engine [2022-05-02 17:59:26,285] [ INFO] - Text Engine set the device: gpu:0 @@ -356,26 +354,26 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav log_file="./log/paddlespeech.log") ``` - 输出 - ``` - [2022-05-02 18:09:02,542] [ INFO] - Create the TextEngine Instance - [2022-05-02 18:09:02,543] [ INFO] - Init the text engine - [2022-05-02 18:09:02,543] [ INFO] - Text Engine set the device: gpu:0 - [2022-05-02 18:09:02,545] [ INFO] - File /home/users/xiongxinlei/.paddlespeech/models/ernie_linear_p3_wudao-punc-zh/ernie_linear_p3_wudao-punc-zh.tar.gz md5 checking... - [2022-05-02 18:09:06,919] [ INFO] - Use pretrained model stored in: /home/users/xiongxinlei/.paddlespeech/models/ernie_linear_p3_wudao-punc-zh/ernie_linear_p3_wudao-punc-zh.tar - W0502 18:09:07.523002 22615 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 6.1, Driver API Version: 10.2, Runtime API Version: 10.2 - W0502 18:09:07.527882 22615 device_context.cc:465] device: 0, cuDNN Version: 7.6. - [2022-05-02 18:09:10,900] [ INFO] - Already cached /home/users/xiongxinlei/.paddlenlp/models/ernie-1.0/vocab.txt - [2022-05-02 18:09:10,913] [ INFO] - Init the text engine successfully - INFO: Started server process [22615] - [2022-05-02 18:09:10] [INFO] [server.py:75] Started server process [22615] - INFO: Waiting for application startup. - [2022-05-02 18:09:10] [INFO] [on.py:45] Waiting for application startup. - INFO: Application startup complete. - [2022-05-02 18:09:10] [INFO] [on.py:59] Application startup complete. - INFO: Uvicorn running on http://0.0.0.0:8190 (Press CTRL+C to quit) - [2022-05-02 18:09:10] [INFO] [server.py:206] Uvicorn running on http://0.0.0.0:8190 (Press CTRL+C to quit) - ``` + 输出: + ```text + [2022-05-02 18:09:02,542] [ INFO] - Create the TextEngine Instance + [2022-05-02 18:09:02,543] [ INFO] - Init the text engine + [2022-05-02 18:09:02,543] [ INFO] - Text Engine set the device: gpu:0 + [2022-05-02 18:09:02,545] [ INFO] - File /home/users/xiongxinlei/.paddlespeech/models/ernie_linear_p3_wudao-punc-zh/ernie_linear_p3_wudao-punc-zh.tar.gz md5 checking... + [2022-05-02 18:09:06,919] [ INFO] - Use pretrained model stored in: /home/users/xiongxinlei/.paddlespeech/models/ernie_linear_p3_wudao-punc-zh/ernie_linear_p3_wudao-punc-zh.tar + W0502 18:09:07.523002 22615 device_context.cc:447] Please NOTE: device: 0, GPU Compute Capability: 6.1, Driver API Version: 10.2, Runtime API Version: 10.2 + W0502 18:09:07.527882 22615 device_context.cc:465] device: 0, cuDNN Version: 7.6. + [2022-05-02 18:09:10,900] [ INFO] - Already cached /home/users/xiongxinlei/.paddlenlp/models/ernie-1.0/vocab.txt + [2022-05-02 18:09:10,913] [ INFO] - Init the text engine successfully + INFO: Started server process [22615] + [2022-05-02 18:09:10] [INFO] [server.py:75] Started server process [22615] + INFO: Waiting for application startup. + [2022-05-02 18:09:10] [INFO] [on.py:45] Waiting for application startup. + INFO: Application startup complete. + [2022-05-02 18:09:10] [INFO] [on.py:59] Application startup complete. + INFO: Uvicorn running on http://0.0.0.0:8190 (Press CTRL+C to quit) + [2022-05-02 18:09:10] [INFO] [server.py:206] Uvicorn running on http://0.0.0.0:8190 (Press CTRL+C to quit) + ``` ### 2. 标点预测客户端使用方法 **注意:** 初次使用客户端时响应时间会略长 @@ -384,17 +382,17 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav 若 `127.0.0.1` 不能访问,则需要使用实际服务 IP 地址 - ``` - paddlespeech_client text --server_ip 127.0.0.1 --port 8190 --input "我认为跑步最重要的就是给我带来了身体健康" - ``` - - 输出 + ```bash + paddlespeech_client text --server_ip 127.0.0.1 --port 8190 --input "我认为跑步最重要的就是给我带来了身体健康" ``` + + 输出: + ```text [2022-05-02 18:12:29,767] [ INFO] - The punc text: 我认为跑步最重要的就是给我带来了身体健康。 [2022-05-02 18:12:29,767] [ INFO] - Response time 0.096548 s. ``` -- Python3 API +- Python API ```python from paddlespeech.server.bin.paddlespeech_client import TextClientExecutor @@ -407,12 +405,11 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav print(res) ``` - 输出: - ``` bash + 输出: + ```text 我认为跑步最重要的就是给我带来了身体健康。 ``` - ## 联合流式语音识别和标点预测 **注意:** 默认部署在 `cpu` 设备上,可以通过修改服务配置文件中 `device` 参数将语音识别和标点预测部署在不同的 `gpu` 上。 @@ -420,7 +417,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav ### 1. 启动服务 -``` bash +```bash 注意:流式语音识别和标点预测通过配置文件配置到不同的显卡上 bash server.sh ``` @@ -430,11 +427,11 @@ bash server.sh 若 `127.0.0.1` 不能访问,则需要使用实际服务 IP 地址 - ``` + ```bash paddlespeech_client asr_online --server_ip 127.0.0.1 --port 8290 --punc.server_ip 127.0.0.1 --punc.port 8190 --input ./zh.wav ``` - 输出: - ``` + 输出: + ```text [2022-05-07 11:21:47,060] [ INFO] - asr websocket client start [2022-05-07 11:21:47,060] [ INFO] - endpoint: ws://127.0.0.1:8490/paddlespeech/asr/streaming [2022-05-07 11:21:47,080] [ INFO] - client receive msg={"status": "ok", "signal": "server_ready"} @@ -508,11 +505,11 @@ bash server.sh 若 `127.0.0.1` 不能访问,则需要使用实际服务 IP 地址 - ``` + ```bash python3 websocket_client.py --server_ip 127.0.0.1 --port 8290 --punc.server_ip 127.0.0.1 --punc.port 8190 --wavfile ./zh.wav ``` - 输出: - ``` + 输出: + ```text [2022-05-07 11:11:02,984] [ INFO] - Start to do streaming asr client [2022-05-07 11:11:02,985] [ INFO] - asr websocket client start [2022-05-07 11:11:02,985] [ INFO] - endpoint: ws://127.0.0.1:8490/paddlespeech/asr/streaming @@ -581,5 +578,3 @@ bash server.sh [2022-05-07 11:11:18,915] [ INFO] - audio duration: 4.9968125, elapsed time: 15.928460597991943, RTF=3.187724293835709 [2022-05-07 11:11:18,916] [ INFO] - asr websocket client finished : 我认为跑步最重要的就是给我带来了身体健康 ``` - - diff --git a/demos/streaming_asr_server/conf/ws_ds2_application.yaml b/demos/streaming_asr_server/conf/ws_ds2_application.yaml index e36a829cc..ac20b2a23 100644 --- a/demos/streaming_asr_server/conf/ws_ds2_application.yaml +++ b/demos/streaming_asr_server/conf/ws_ds2_application.yaml @@ -18,12 +18,13 @@ engine_list: ['asr_online-onnx'] # ENGINE CONFIG # ################################################################################# + ################################### ASR ######################################### -################### speech task: asr; engine_type: online-inference ####################### -asr_online-inference: +################### speech task: asr; engine_type: online-onnx ####################### +asr_online-onnx: model_type: 'deepspeech2online_wenetspeech' - am_model: # the pdmodel file of am static model [optional] - am_params: # the pdiparams file of am static model [optional] + am_model: # the pdmodel file of onnx am static model [optional] + am_params: # the pdiparams file of am static model [optional] lang: 'zh' sample_rate: 16000 cfg_path: @@ -32,11 +33,14 @@ asr_online-inference: force_yes: True device: 'cpu' # cpu or gpu:id + # https://onnxruntime.ai/docs/api/python/api_summary.html#inferencesession am_predictor_conf: - device: # set 'gpu:id' or 'cpu' - switch_ir_optim: True - glog_info: False # True -> print glog - summary: True # False -> do not show predictor config + device: 'cpu' # set 'gpu:id' or 'cpu' + graph_optimization_level: 0 + intra_op_num_threads: 0 # Sets the number of threads used to parallelize the execution within nodes. + inter_op_num_threads: 0 # Sets the number of threads used to parallelize the execution of the graph (across nodes). + log_severity_level: 2 # Log severity level. Applies to session load, initialization, etc. 0:Verbose, 1:Info, 2:Warning. 3:Error, 4:Fatal. Default is 2. + log_verbosity_level: 0 # VLOG level if DEBUG build and session_log_severity_level is 0. Applies to session load, initialization, etc. Default is 0. chunk_buffer_conf: frame_duration_ms: 85 @@ -49,13 +53,12 @@ asr_online-inference: shift_ms: 10 # ms - ################################### ASR ######################################### -################### speech task: asr; engine_type: online-onnx ####################### -asr_online-onnx: +################### speech task: asr; engine_type: online-inference ####################### +asr_online-inference: model_type: 'deepspeech2online_wenetspeech' - am_model: # the pdmodel file of onnx am static model [optional] - am_params: # the pdiparams file of am static model [optional] + am_model: # the pdmodel file of am static model [optional] + am_params: # the pdiparams file of am static model [optional] lang: 'zh' sample_rate: 16000 cfg_path: @@ -64,21 +67,18 @@ asr_online-onnx: force_yes: True device: 'cpu' # cpu or gpu:id - # https://onnxruntime.ai/docs/api/python/api_summary.html#inferencesession am_predictor_conf: - device: 'cpu' # set 'gpu:id' or 'cpu' - graph_optimization_level: 0 - intra_op_num_threads: 0 # Sets the number of threads used to parallelize the execution within nodes. - inter_op_num_threads: 0 # Sets the number of threads used to parallelize the execution of the graph (across nodes). - log_severity_level: 2 # Log severity level. Applies to session load, initialization, etc. 0:Verbose, 1:Info, 2:Warning. 3:Error, 4:Fatal. Default is 2. - log_verbosity_level: 0 # VLOG level if DEBUG build and session_log_severity_level is 0. Applies to session load, initialization, etc. Default is 0. + device: # set 'gpu:id' or 'cpu' + switch_ir_optim: True + glog_info: False # True -> print glog + summary: True # False -> do not show predictor config chunk_buffer_conf: - frame_duration_ms: 80 + frame_duration_ms: 85 shift_ms: 40 sample_rate: 16000 sample_width: 2 window_n: 7 # frame shift_n: 4 # frame window_ms: 25 # ms - shift_ms: 10 # ms + shift_ms: 10 # ms \ No newline at end of file diff --git a/demos/streaming_asr_server/punc_server.py b/demos/streaming_asr_server/local/punc_server.py similarity index 100% rename from demos/streaming_asr_server/punc_server.py rename to demos/streaming_asr_server/local/punc_server.py diff --git a/demos/streaming_asr_server/local/rtf_from_log.py b/demos/streaming_asr_server/local/rtf_from_log.py index a5634388b..4b89b48fd 100755 --- a/demos/streaming_asr_server/local/rtf_from_log.py +++ b/demos/streaming_asr_server/local/rtf_from_log.py @@ -33,7 +33,8 @@ if __name__ == '__main__': P = 0.0 n = 0 for m in rtfs: - n += 1 + # not accurate, may have duplicate log + n += 1 T += m['T'] P += m['P'] diff --git a/demos/streaming_asr_server/streaming_asr_server.py b/demos/streaming_asr_server/local/streaming_asr_server.py similarity index 100% rename from demos/streaming_asr_server/streaming_asr_server.py rename to demos/streaming_asr_server/local/streaming_asr_server.py diff --git a/demos/streaming_asr_server/local/websocket_client.py b/demos/streaming_asr_server/local/websocket_client.py index 51ae7a2f4..8b70eb2d6 100644 --- a/demos/streaming_asr_server/local/websocket_client.py +++ b/demos/streaming_asr_server/local/websocket_client.py @@ -18,7 +18,6 @@ import argparse import asyncio import codecs -import logging import os from paddlespeech.cli.log import logger @@ -44,7 +43,7 @@ def main(args): # support to process batch audios from wav.scp if args.wavscp and os.path.exists(args.wavscp): - logging.info(f"start to process the wavscp: {args.wavscp}") + logger.info(f"start to process the wavscp: {args.wavscp}") with codecs.open(args.wavscp, 'r', encoding='utf-8') as f,\ codecs.open("result.txt", 'w', encoding='utf-8') as w: for line in f: diff --git a/demos/streaming_asr_server/run.sh b/demos/streaming_asr_server/run.sh old mode 100644 new mode 100755 diff --git a/demos/streaming_asr_server/server.sh b/demos/streaming_asr_server/server.sh index f532546e7..961cb046a 100755 --- a/demos/streaming_asr_server/server.sh +++ b/demos/streaming_asr_server/server.sh @@ -1,9 +1,8 @@ -export CUDA_VISIBLE_DEVICE=0,1,2,3 - export CUDA_VISIBLE_DEVICE=0,1,2,3 +#export CUDA_VISIBLE_DEVICE=0,1,2,3 -# nohup python3 punc_server.py --config_file conf/punc_application.yaml > punc.log 2>&1 & +# nohup python3 local/punc_server.py --config_file conf/punc_application.yaml > punc.log 2>&1 & paddlespeech_server start --config_file conf/punc_application.yaml &> punc.log & -# nohup python3 streaming_asr_server.py --config_file conf/ws_conformer_wenetspeech_application.yaml > streaming_asr.log 2>&1 & +# nohup python3 local/streaming_asr_server.py --config_file conf/ws_conformer_wenetspeech_application.yaml > streaming_asr.log 2>&1 & paddlespeech_server start --config_file conf/ws_conformer_wenetspeech_application.yaml &> streaming_asr.log & diff --git a/demos/streaming_asr_server/test.sh b/demos/streaming_asr_server/test.sh index 67a5ec4c5..386c7f894 100755 --- a/demos/streaming_asr_server/test.sh +++ b/demos/streaming_asr_server/test.sh @@ -7,5 +7,5 @@ paddlespeech_client asr_online --server_ip 127.0.0.1 --port 8090 --input ./zh.wa # read the wav and call streaming and punc service # If `127.0.0.1` is not accessible, you need to use the actual service IP address. -paddlespeech_client asr_online --server_ip 127.0.0.1 --port 8290 --punc.server_ip 127.0.0.1 --punc.port 8190 --input ./zh.wav +paddlespeech_client asr_online --server_ip 127.0.0.1 --port 8090 --punc.server_ip 127.0.0.1 --punc.port 8190 --input ./zh.wav diff --git a/demos/streaming_asr_server/web/app.py b/demos/streaming_asr_server/web/app.py deleted file mode 100644 index 22993c08e..000000000 --- a/demos/streaming_asr_server/web/app.py +++ /dev/null @@ -1,23 +0,0 @@ -#!/usr/bin/env python3 -# -*- coding: utf-8 -*- -# Copyright 2021 Mobvoi Inc. All Rights Reserved. -# Author: zhendong.peng@mobvoi.com (Zhendong Peng) -import argparse - -from flask import Flask -from flask import render_template - -parser = argparse.ArgumentParser(description='training your network') -parser.add_argument('--port', default=19999, type=int, help='port id') -args = parser.parse_args() - -app = Flask(__name__) - - -@app.route('/') -def index(): - return render_template('index.html') - - -if __name__ == '__main__': - app.run(host='0.0.0.0', port=args.port, debug=True) diff --git a/demos/streaming_asr_server/web/favicon.ico b/demos/streaming_asr_server/web/favicon.ico new file mode 100644 index 000000000..342038720 Binary files /dev/null and b/demos/streaming_asr_server/web/favicon.ico differ diff --git a/demos/streaming_asr_server/web/index.html b/demos/streaming_asr_server/web/index.html new file mode 100644 index 000000000..33c676c55 --- /dev/null +++ b/demos/streaming_asr_server/web/index.html @@ -0,0 +1,218 @@ + + + + + + + 飞桨PaddleSpeech + + + + +
+ + + diff --git a/demos/streaming_asr_server/web/paddle_web_demo.png b/demos/streaming_asr_server/web/paddle_web_demo.png index 214edffd0..db4b63ab9 100644 Binary files a/demos/streaming_asr_server/web/paddle_web_demo.png and b/demos/streaming_asr_server/web/paddle_web_demo.png differ diff --git a/demos/streaming_asr_server/web/readme.md b/demos/streaming_asr_server/web/readme.md index 8310a2571..bef421711 100644 --- a/demos/streaming_asr_server/web/readme.md +++ b/demos/streaming_asr_server/web/readme.md @@ -1,18 +1,20 @@ # paddlespeech serving 网页Demo -- 感谢[wenet](https://github.com/wenet-e2e/wenet)团队的前端demo代码. +![图片](./paddle_web_demo.png) +step1: 开启流式语音识别服务器端 -## 使用方法 -### 1. 在本地电脑启动网页服务 - ``` - python app.py +``` +# 开启流式语音识别服务 +cd PaddleSpeech/demos/streaming_asr_server +paddlespeech_server start --config_file conf/ws_conformer_wenetspeech_application_faster.yaml +``` - ``` +step2: 谷歌游览器打开 `web`目录下`index.html` -### 2. 本地电脑浏览器 +step3: 点击`连接`,验证WebSocket是否成功连接 + +step4:点击开始录音(弹窗询问,允许录音) -在浏览器中输入127.0.0.1:19999 即可看到相关网页Demo。 -![图片](./paddle_web_demo.png) diff --git a/demos/streaming_asr_server/web/static/css/font-awesome.min.css b/demos/streaming_asr_server/web/static/css/font-awesome.min.css deleted file mode 100644 index 540440ce8..000000000 --- a/demos/streaming_asr_server/web/static/css/font-awesome.min.css +++ /dev/null @@ -1,4 +0,0 @@ -/*! - * Font Awesome 4.7.0 by @davegandy - http://fontawesome.io - @fontawesome - * License - http://fontawesome.io/license (Font: SIL OFL 1.1, CSS: MIT License) - */@font-face{font-family:'FontAwesome';src:url('../fonts/fontawesome-webfont.eot?v=4.7.0');src:url('../fonts/fontawesome-webfont.eot?#iefix&v=4.7.0') format('embedded-opentype'),url('../fonts/fontawesome-webfont.woff2?v=4.7.0') format('woff2'),url('../fonts/fontawesome-webfont.woff?v=4.7.0') format('woff'),url('../fonts/fontawesome-webfont.ttf?v=4.7.0') format('truetype'),url('../fonts/fontawesome-webfont.svg?v=4.7.0#fontawesomeregular') format('svg');font-weight:normal;font-style:normal}.fa{display:inline-block;font:normal normal normal 14px/1 FontAwesome;font-size:inherit;text-rendering:auto;-webkit-font-smoothing:antialiased;-moz-osx-font-smoothing:grayscale}.fa-lg{font-size:1.33333333em;line-height:.75em;vertical-align:-15%}.fa-2x{font-size:2em}.fa-3x{font-size:3em}.fa-4x{font-size:4em}.fa-5x{font-size:5em}.fa-fw{width:1.28571429em;text-align:center}.fa-ul{padding-left:0;margin-left:2.14285714em;list-style-type:none}.fa-ul>li{position:relative}.fa-li{position:absolute;left:-2.14285714em;width:2.14285714em;top:.14285714em;text-align:center}.fa-li.fa-lg{left:-1.85714286em}.fa-border{padding:.2em .25em .15em;border:solid .08em #eee;border-radius:.1em}.fa-pull-left{float:left}.fa-pull-right{float:right}.fa.fa-pull-left{margin-right:.3em}.fa.fa-pull-right{margin-left:.3em}.pull-right{float:right}.pull-left{float:left}.fa.pull-left{margin-right:.3em}.fa.pull-right{margin-left:.3em}.fa-spin{-webkit-animation:fa-spin 2s infinite linear;animation:fa-spin 2s infinite linear}.fa-pulse{-webkit-animation:fa-spin 1s infinite steps(8);animation:fa-spin 1s infinite steps(8)}@-webkit-keyframes fa-spin{0%{-webkit-transform:rotate(0deg);transform:rotate(0deg)}100%{-webkit-transform:rotate(359deg);transform:rotate(359deg)}}@keyframes fa-spin{0%{-webkit-transform:rotate(0deg);transform:rotate(0deg)}100%{-webkit-transform:rotate(359deg);transform:rotate(359deg)}}.fa-rotate-90{-ms-filter:"progid:DXImageTransform.Microsoft.BasicImage(rotation=1)";-webkit-transform:rotate(90deg);-ms-transform:rotate(90deg);transform:rotate(90deg)}.fa-rotate-180{-ms-filter:"progid:DXImageTransform.Microsoft.BasicImage(rotation=2)";-webkit-transform:rotate(180deg);-ms-transform:rotate(180deg);transform:rotate(180deg)}.fa-rotate-270{-ms-filter:"progid:DXImageTransform.Microsoft.BasicImage(rotation=3)";-webkit-transform:rotate(270deg);-ms-transform:rotate(270deg);transform:rotate(270deg)}.fa-flip-horizontal{-ms-filter:"progid:DXImageTransform.Microsoft.BasicImage(rotation=0, mirror=1)";-webkit-transform:scale(-1, 1);-ms-transform:scale(-1, 1);transform:scale(-1, 1)}.fa-flip-vertical{-ms-filter:"progid:DXImageTransform.Microsoft.BasicImage(rotation=2, mirror=1)";-webkit-transform:scale(1, -1);-ms-transform:scale(1, -1);transform:scale(1, -1)}:root .fa-rotate-90,:root .fa-rotate-180,:root .fa-rotate-270,:root .fa-flip-horizontal,:root .fa-flip-vertical{filter:none}.fa-stack{position:relative;display:inline-block;width:2em;height:2em;line-height:2em;vertical-align:middle}.fa-stack-1x,.fa-stack-2x{position:absolute;left:0;width:100%;text-align:center}.fa-stack-1x{line-height:inherit}.fa-stack-2x{font-size:2em}.fa-inverse{color:#fff}.fa-glass:before{content:"\f000"}.fa-music:before{content:"\f001"}.fa-search:before{content:"\f002"}.fa-envelope-o:before{content:"\f003"}.fa-heart:before{content:"\f004"}.fa-star:before{content:"\f005"}.fa-star-o:before{content:"\f006"}.fa-user:before{content:"\f007"}.fa-film:before{content:"\f008"}.fa-th-large:before{content:"\f009"}.fa-th:before{content:"\f00a"}.fa-th-list:before{content:"\f00b"}.fa-check:before{content:"\f00c"}.fa-remove:before,.fa-close:before,.fa-times:before{content:"\f00d"}.fa-search-plus:before{content:"\f00e"}.fa-search-minus:before{content:"\f010"}.fa-power-off:before{content:"\f011"}.fa-signal:before{content:"\f012"}.fa-gear:before,.fa-cog:before{content:"\f013"}.fa-trash-o:before{content:"\f014"}.fa-home:before{content:"\f015"}.fa-file-o:before{content:"\f016"}.fa-clock-o:before{content:"\f017"}.fa-road:before{content:"\f018"}.fa-download:before{content:"\f019"}.fa-arrow-circle-o-down:before{content:"\f01a"}.fa-arrow-circle-o-up:before{content:"\f01b"}.fa-inbox:before{content:"\f01c"}.fa-play-circle-o:before{content:"\f01d"}.fa-rotate-right:before,.fa-repeat:before{content:"\f01e"}.fa-refresh:before{content:"\f021"}.fa-list-alt:before{content:"\f022"}.fa-lock:before{content:"\f023"}.fa-flag:before{content:"\f024"}.fa-headphones:before{content:"\f025"}.fa-volume-off:before{content:"\f026"}.fa-volume-down:before{content:"\f027"}.fa-volume-up:before{content:"\f028"}.fa-qrcode:before{content:"\f029"}.fa-barcode:before{content:"\f02a"}.fa-tag:before{content:"\f02b"}.fa-tags:before{content:"\f02c"}.fa-book:before{content:"\f02d"}.fa-bookmark:before{content:"\f02e"}.fa-print:before{content:"\f02f"}.fa-camera:before{content:"\f030"}.fa-font:before{content:"\f031"}.fa-bold:before{content:"\f032"}.fa-italic:before{content:"\f033"}.fa-text-height:before{content:"\f034"}.fa-text-width:before{content:"\f035"}.fa-align-left:before{content:"\f036"}.fa-align-center:before{content:"\f037"}.fa-align-right:before{content:"\f038"}.fa-align-justify:before{content:"\f039"}.fa-list:before{content:"\f03a"}.fa-dedent:before,.fa-outdent:before{content:"\f03b"}.fa-indent:before{content:"\f03c"}.fa-video-camera:before{content:"\f03d"}.fa-photo:before,.fa-image:before,.fa-picture-o:before{content:"\f03e"}.fa-pencil:before{content:"\f040"}.fa-map-marker:before{content:"\f041"}.fa-adjust:before{content:"\f042"}.fa-tint:before{content:"\f043"}.fa-edit:before,.fa-pencil-square-o:before{content:"\f044"}.fa-share-square-o:before{content:"\f045"}.fa-check-square-o:before{content:"\f046"}.fa-arrows:before{content:"\f047"}.fa-step-backward:before{content:"\f048"}.fa-fast-backward:before{content:"\f049"}.fa-backward:before{content:"\f04a"}.fa-play:before{content:"\f04b"}.fa-pause:before{content:"\f04c"}.fa-stop:before{content:"\f04d"}.fa-forward:before{content:"\f04e"}.fa-fast-forward:before{content:"\f050"}.fa-step-forward:before{content:"\f051"}.fa-eject:before{content:"\f052"}.fa-chevron-left:before{content:"\f053"}.fa-chevron-right:before{content:"\f054"}.fa-plus-circle:before{content:"\f055"}.fa-minus-circle:before{content:"\f056"}.fa-times-circle:before{content:"\f057"}.fa-check-circle:before{content:"\f058"}.fa-question-circle:before{content:"\f059"}.fa-info-circle:before{content:"\f05a"}.fa-crosshairs:before{content:"\f05b"}.fa-times-circle-o:before{content:"\f05c"}.fa-check-circle-o:before{content:"\f05d"}.fa-ban:before{content:"\f05e"}.fa-arrow-left:before{content:"\f060"}.fa-arrow-right:before{content:"\f061"}.fa-arrow-up:before{content:"\f062"}.fa-arrow-down:before{content:"\f063"}.fa-mail-forward:before,.fa-share:before{content:"\f064"}.fa-expand:before{content:"\f065"}.fa-compress:before{content:"\f066"}.fa-plus:before{content:"\f067"}.fa-minus:before{content:"\f068"}.fa-asterisk:before{content:"\f069"}.fa-exclamation-circle:before{content:"\f06a"}.fa-gift:before{content:"\f06b"}.fa-leaf:before{content:"\f06c"}.fa-fire:before{content:"\f06d"}.fa-eye:before{content:"\f06e"}.fa-eye-slash:before{content:"\f070"}.fa-warning:before,.fa-exclamation-triangle:before{content:"\f071"}.fa-plane:before{content:"\f072"}.fa-calendar:before{content:"\f073"}.fa-random:before{content:"\f074"}.fa-comment:before{content:"\f075"}.fa-magnet:before{content:"\f076"}.fa-chevron-up:before{content:"\f077"}.fa-chevron-down:before{content:"\f078"}.fa-retweet:before{content:"\f079"}.fa-shopping-cart:before{content:"\f07a"}.fa-folder:before{content:"\f07b"}.fa-folder-open:before{content:"\f07c"}.fa-arrows-v:before{content:"\f07d"}.fa-arrows-h:before{content:"\f07e"}.fa-bar-chart-o:before,.fa-bar-chart:before{content:"\f080"}.fa-twitter-square:before{content:"\f081"}.fa-facebook-square:before{content:"\f082"}.fa-camera-retro:before{content:"\f083"}.fa-key:before{content:"\f084"}.fa-gears:before,.fa-cogs:before{content:"\f085"}.fa-comments:before{content:"\f086"}.fa-thumbs-o-up:before{content:"\f087"}.fa-thumbs-o-down:before{content:"\f088"}.fa-star-half:before{content:"\f089"}.fa-heart-o:before{content:"\f08a"}.fa-sign-out:before{content:"\f08b"}.fa-linkedin-square:before{content:"\f08c"}.fa-thumb-tack:before{content:"\f08d"}.fa-external-link:before{content:"\f08e"}.fa-sign-in:before{content:"\f090"}.fa-trophy:before{content:"\f091"}.fa-github-square:before{content:"\f092"}.fa-upload:before{content:"\f093"}.fa-lemon-o:before{content:"\f094"}.fa-phone:before{content:"\f095"}.fa-square-o:before{content:"\f096"}.fa-bookmark-o:before{content:"\f097"}.fa-phone-square:before{content:"\f098"}.fa-twitter:before{content:"\f099"}.fa-facebook-f:before,.fa-facebook:before{content:"\f09a"}.fa-github:before{content:"\f09b"}.fa-unlock:before{content:"\f09c"}.fa-credit-card:before{content:"\f09d"}.fa-feed:before,.fa-rss:before{content:"\f09e"}.fa-hdd-o:before{content:"\f0a0"}.fa-bullhorn:before{content:"\f0a1"}.fa-bell:before{content:"\f0f3"}.fa-certificate:before{content:"\f0a3"}.fa-hand-o-right:before{content:"\f0a4"}.fa-hand-o-left:before{content:"\f0a5"}.fa-hand-o-up:before{content:"\f0a6"}.fa-hand-o-down:before{content:"\f0a7"}.fa-arrow-circle-left:before{content:"\f0a8"}.fa-arrow-circle-right:before{content:"\f0a9"}.fa-arrow-circle-up:before{content:"\f0aa"}.fa-arrow-circle-down:before{content:"\f0ab"}.fa-globe:before{content:"\f0ac"}.fa-wrench:before{content:"\f0ad"}.fa-tasks:before{content:"\f0ae"}.fa-filter:before{content:"\f0b0"}.fa-briefcase:before{content:"\f0b1"}.fa-arrows-alt:before{content:"\f0b2"}.fa-group:before,.fa-users:before{content:"\f0c0"}.fa-chain:before,.fa-link:before{content:"\f0c1"}.fa-cloud:before{content:"\f0c2"}.fa-flask:before{content:"\f0c3"}.fa-cut:before,.fa-scissors:before{content:"\f0c4"}.fa-copy:before,.fa-files-o:before{content:"\f0c5"}.fa-paperclip:before{content:"\f0c6"}.fa-save:before,.fa-floppy-o:before{content:"\f0c7"}.fa-square:before{content:"\f0c8"}.fa-navicon:before,.fa-reorder:before,.fa-bars:before{content:"\f0c9"}.fa-list-ul:before{content:"\f0ca"}.fa-list-ol:before{content:"\f0cb"}.fa-strikethrough:before{content:"\f0cc"}.fa-underline:before{content:"\f0cd"}.fa-table:before{content:"\f0ce"}.fa-magic:before{content:"\f0d0"}.fa-truck:before{content:"\f0d1"}.fa-pinterest:before{content:"\f0d2"}.fa-pinterest-square:before{content:"\f0d3"}.fa-google-plus-square:before{content:"\f0d4"}.fa-google-plus:before{content:"\f0d5"}.fa-money:before{content:"\f0d6"}.fa-caret-down:before{content:"\f0d7"}.fa-caret-up:before{content:"\f0d8"}.fa-caret-left:before{content:"\f0d9"}.fa-caret-right:before{content:"\f0da"}.fa-columns:before{content:"\f0db"}.fa-unsorted:before,.fa-sort:before{content:"\f0dc"}.fa-sort-down:before,.fa-sort-desc:before{content:"\f0dd"}.fa-sort-up:before,.fa-sort-asc:before{content:"\f0de"}.fa-envelope:before{content:"\f0e0"}.fa-linkedin:before{content:"\f0e1"}.fa-rotate-left:before,.fa-undo:before{content:"\f0e2"}.fa-legal:before,.fa-gavel:before{content:"\f0e3"}.fa-dashboard:before,.fa-tachometer:before{content:"\f0e4"}.fa-comment-o:before{content:"\f0e5"}.fa-comments-o:before{content:"\f0e6"}.fa-flash:before,.fa-bolt:before{content:"\f0e7"}.fa-sitemap:before{content:"\f0e8"}.fa-umbrella:before{content:"\f0e9"}.fa-paste:before,.fa-clipboard:before{content:"\f0ea"}.fa-lightbulb-o:before{content:"\f0eb"}.fa-exchange:before{content:"\f0ec"}.fa-cloud-download:before{content:"\f0ed"}.fa-cloud-upload:before{content:"\f0ee"}.fa-user-md:before{content:"\f0f0"}.fa-stethoscope:before{content:"\f0f1"}.fa-suitcase:before{content:"\f0f2"}.fa-bell-o:before{content:"\f0a2"}.fa-coffee:before{content:"\f0f4"}.fa-cutlery:before{content:"\f0f5"}.fa-file-text-o:before{content:"\f0f6"}.fa-building-o:before{content:"\f0f7"}.fa-hospital-o:before{content:"\f0f8"}.fa-ambulance:before{content:"\f0f9"}.fa-medkit:before{content:"\f0fa"}.fa-fighter-jet:before{content:"\f0fb"}.fa-beer:before{content:"\f0fc"}.fa-h-square:before{content:"\f0fd"}.fa-plus-square:before{content:"\f0fe"}.fa-angle-double-left:before{content:"\f100"}.fa-angle-double-right:before{content:"\f101"}.fa-angle-double-up:before{content:"\f102"}.fa-angle-double-down:before{content:"\f103"}.fa-angle-left:before{content:"\f104"}.fa-angle-right:before{content:"\f105"}.fa-angle-up:before{content:"\f106"}.fa-angle-down:before{content:"\f107"}.fa-desktop:before{content:"\f108"}.fa-laptop:before{content:"\f109"}.fa-tablet:before{content:"\f10a"}.fa-mobile-phone:before,.fa-mobile:before{content:"\f10b"}.fa-circle-o:before{content:"\f10c"}.fa-quote-left:before{content:"\f10d"}.fa-quote-right:before{content:"\f10e"}.fa-spinner:before{content:"\f110"}.fa-circle:before{content:"\f111"}.fa-mail-reply:before,.fa-reply:before{content:"\f112"}.fa-github-alt:before{content:"\f113"}.fa-folder-o:before{content:"\f114"}.fa-folder-open-o:before{content:"\f115"}.fa-smile-o:before{content:"\f118"}.fa-frown-o:before{content:"\f119"}.fa-meh-o:before{content:"\f11a"}.fa-gamepad:before{content:"\f11b"}.fa-keyboard-o:before{content:"\f11c"}.fa-flag-o:before{content:"\f11d"}.fa-flag-checkered:before{content:"\f11e"}.fa-terminal:before{content:"\f120"}.fa-code:before{content:"\f121"}.fa-mail-reply-all:before,.fa-reply-all:before{content:"\f122"}.fa-star-half-empty:before,.fa-star-half-full:before,.fa-star-half-o:before{content:"\f123"}.fa-location-arrow:before{content:"\f124"}.fa-crop:before{content:"\f125"}.fa-code-fork:before{content:"\f126"}.fa-unlink:before,.fa-chain-broken:before{content:"\f127"}.fa-question:before{content:"\f128"}.fa-info:before{content:"\f129"}.fa-exclamation:before{content:"\f12a"}.fa-superscript:before{content:"\f12b"}.fa-subscript:before{content:"\f12c"}.fa-eraser:before{content:"\f12d"}.fa-puzzle-piece:before{content:"\f12e"}.fa-microphone:before{content:"\f130"}.fa-microphone-slash:before{content:"\f131"}.fa-shield:before{content:"\f132"}.fa-calendar-o:before{content:"\f133"}.fa-fire-extinguisher:before{content:"\f134"}.fa-rocket:before{content:"\f135"}.fa-maxcdn:before{content:"\f136"}.fa-chevron-circle-left:before{content:"\f137"}.fa-chevron-circle-right:before{content:"\f138"}.fa-chevron-circle-up:before{content:"\f139"}.fa-chevron-circle-down:before{content:"\f13a"}.fa-html5:before{content:"\f13b"}.fa-css3:before{content:"\f13c"}.fa-anchor:before{content:"\f13d"}.fa-unlock-alt:before{content:"\f13e"}.fa-bullseye:before{content:"\f140"}.fa-ellipsis-h:before{content:"\f141"}.fa-ellipsis-v:before{content:"\f142"}.fa-rss-square:before{content:"\f143"}.fa-play-circle:before{content:"\f144"}.fa-ticket:before{content:"\f145"}.fa-minus-square:before{content:"\f146"}.fa-minus-square-o:before{content:"\f147"}.fa-level-up:before{content:"\f148"}.fa-level-down:before{content:"\f149"}.fa-check-square:before{content:"\f14a"}.fa-pencil-square:before{content:"\f14b"}.fa-external-link-square:before{content:"\f14c"}.fa-share-square:before{content:"\f14d"}.fa-compass:before{content:"\f14e"}.fa-toggle-down:before,.fa-caret-square-o-down:before{content:"\f150"}.fa-toggle-up:before,.fa-caret-square-o-up:before{content:"\f151"}.fa-toggle-right:before,.fa-caret-square-o-right:before{content:"\f152"}.fa-euro:before,.fa-eur:before{content:"\f153"}.fa-gbp:before{content:"\f154"}.fa-dollar:before,.fa-usd:before{content:"\f155"}.fa-rupee:before,.fa-inr:before{content:"\f156"}.fa-cny:before,.fa-rmb:before,.fa-yen:before,.fa-jpy:before{content:"\f157"}.fa-ruble:before,.fa-rouble:before,.fa-rub:before{content:"\f158"}.fa-won:before,.fa-krw:before{content:"\f159"}.fa-bitcoin:before,.fa-btc:before{content:"\f15a"}.fa-file:before{content:"\f15b"}.fa-file-text:before{content:"\f15c"}.fa-sort-alpha-asc:before{content:"\f15d"}.fa-sort-alpha-desc:before{content:"\f15e"}.fa-sort-amount-asc:before{content:"\f160"}.fa-sort-amount-desc:before{content:"\f161"}.fa-sort-numeric-asc:before{content:"\f162"}.fa-sort-numeric-desc:before{content:"\f163"}.fa-thumbs-up:before{content:"\f164"}.fa-thumbs-down:before{content:"\f165"}.fa-youtube-square:before{content:"\f166"}.fa-youtube:before{content:"\f167"}.fa-xing:before{content:"\f168"}.fa-xing-square:before{content:"\f169"}.fa-youtube-play:before{content:"\f16a"}.fa-dropbox:before{content:"\f16b"}.fa-stack-overflow:before{content:"\f16c"}.fa-instagram:before{content:"\f16d"}.fa-flickr:before{content:"\f16e"}.fa-adn:before{content:"\f170"}.fa-bitbucket:before{content:"\f171"}.fa-bitbucket-square:before{content:"\f172"}.fa-tumblr:before{content:"\f173"}.fa-tumblr-square:before{content:"\f174"}.fa-long-arrow-down:before{content:"\f175"}.fa-long-arrow-up:before{content:"\f176"}.fa-long-arrow-left:before{content:"\f177"}.fa-long-arrow-right:before{content:"\f178"}.fa-apple:before{content:"\f179"}.fa-windows:before{content:"\f17a"}.fa-android:before{content:"\f17b"}.fa-linux:before{content:"\f17c"}.fa-dribbble:before{content:"\f17d"}.fa-skype:before{content:"\f17e"}.fa-foursquare:before{content:"\f180"}.fa-trello:before{content:"\f181"}.fa-female:before{content:"\f182"}.fa-male:before{content:"\f183"}.fa-gittip:before,.fa-gratipay:before{content:"\f184"}.fa-sun-o:before{content:"\f185"}.fa-moon-o:before{content:"\f186"}.fa-archive:before{content:"\f187"}.fa-bug:before{content:"\f188"}.fa-vk:before{content:"\f189"}.fa-weibo:before{content:"\f18a"}.fa-renren:before{content:"\f18b"}.fa-pagelines:before{content:"\f18c"}.fa-stack-exchange:before{content:"\f18d"}.fa-arrow-circle-o-right:before{content:"\f18e"}.fa-arrow-circle-o-left:before{content:"\f190"}.fa-toggle-left:before,.fa-caret-square-o-left:before{content:"\f191"}.fa-dot-circle-o:before{content:"\f192"}.fa-wheelchair:before{content:"\f193"}.fa-vimeo-square:before{content:"\f194"}.fa-turkish-lira:before,.fa-try:before{content:"\f195"}.fa-plus-square-o:before{content:"\f196"}.fa-space-shuttle:before{content:"\f197"}.fa-slack:before{content:"\f198"}.fa-envelope-square:before{content:"\f199"}.fa-wordpress:before{content:"\f19a"}.fa-openid:before{content:"\f19b"}.fa-institution:before,.fa-bank:before,.fa-university:before{content:"\f19c"}.fa-mortar-board:before,.fa-graduation-cap:before{content:"\f19d"}.fa-yahoo:before{content:"\f19e"}.fa-google:before{content:"\f1a0"}.fa-reddit:before{content:"\f1a1"}.fa-reddit-square:before{content:"\f1a2"}.fa-stumbleupon-circle:before{content:"\f1a3"}.fa-stumbleupon:before{content:"\f1a4"}.fa-delicious:before{content:"\f1a5"}.fa-digg:before{content:"\f1a6"}.fa-pied-piper-pp:before{content:"\f1a7"}.fa-pied-piper-alt:before{content:"\f1a8"}.fa-drupal:before{content:"\f1a9"}.fa-joomla:before{content:"\f1aa"}.fa-language:before{content:"\f1ab"}.fa-fax:before{content:"\f1ac"}.fa-building:before{content:"\f1ad"}.fa-child:before{content:"\f1ae"}.fa-paw:before{content:"\f1b0"}.fa-spoon:before{content:"\f1b1"}.fa-cube:before{content:"\f1b2"}.fa-cubes:before{content:"\f1b3"}.fa-behance:before{content:"\f1b4"}.fa-behance-square:before{content:"\f1b5"}.fa-steam:before{content:"\f1b6"}.fa-steam-square:before{content:"\f1b7"}.fa-recycle:before{content:"\f1b8"}.fa-automobile:before,.fa-car:before{content:"\f1b9"}.fa-cab:before,.fa-taxi:before{content:"\f1ba"}.fa-tree:before{content:"\f1bb"}.fa-spotify:before{content:"\f1bc"}.fa-deviantart:before{content:"\f1bd"}.fa-soundcloud:before{content:"\f1be"}.fa-database:before{content:"\f1c0"}.fa-file-pdf-o:before{content:"\f1c1"}.fa-file-word-o:before{content:"\f1c2"}.fa-file-excel-o:before{content:"\f1c3"}.fa-file-powerpoint-o:before{content:"\f1c4"}.fa-file-photo-o:before,.fa-file-picture-o:before,.fa-file-image-o:before{content:"\f1c5"}.fa-file-zip-o:before,.fa-file-archive-o:before{content:"\f1c6"}.fa-file-sound-o:before,.fa-file-audio-o:before{content:"\f1c7"}.fa-file-movie-o:before,.fa-file-video-o:before{content:"\f1c8"}.fa-file-code-o:before{content:"\f1c9"}.fa-vine:before{content:"\f1ca"}.fa-codepen:before{content:"\f1cb"}.fa-jsfiddle:before{content:"\f1cc"}.fa-life-bouy:before,.fa-life-buoy:before,.fa-life-saver:before,.fa-support:before,.fa-life-ring:before{content:"\f1cd"}.fa-circle-o-notch:before{content:"\f1ce"}.fa-ra:before,.fa-resistance:before,.fa-rebel:before{content:"\f1d0"}.fa-ge:before,.fa-empire:before{content:"\f1d1"}.fa-git-square:before{content:"\f1d2"}.fa-git:before{content:"\f1d3"}.fa-y-combinator-square:before,.fa-yc-square:before,.fa-hacker-news:before{content:"\f1d4"}.fa-tencent-weibo:before{content:"\f1d5"}.fa-qq:before{content:"\f1d6"}.fa-wechat:before,.fa-weixin:before{content:"\f1d7"}.fa-send:before,.fa-paper-plane:before{content:"\f1d8"}.fa-send-o:before,.fa-paper-plane-o:before{content:"\f1d9"}.fa-history:before{content:"\f1da"}.fa-circle-thin:before{content:"\f1db"}.fa-header:before{content:"\f1dc"}.fa-paragraph:before{content:"\f1dd"}.fa-sliders:before{content:"\f1de"}.fa-share-alt:before{content:"\f1e0"}.fa-share-alt-square:before{content:"\f1e1"}.fa-bomb:before{content:"\f1e2"}.fa-soccer-ball-o:before,.fa-futbol-o:before{content:"\f1e3"}.fa-tty:before{content:"\f1e4"}.fa-binoculars:before{content:"\f1e5"}.fa-plug:before{content:"\f1e6"}.fa-slideshare:before{content:"\f1e7"}.fa-twitch:before{content:"\f1e8"}.fa-yelp:before{content:"\f1e9"}.fa-newspaper-o:before{content:"\f1ea"}.fa-wifi:before{content:"\f1eb"}.fa-calculator:before{content:"\f1ec"}.fa-paypal:before{content:"\f1ed"}.fa-google-wallet:before{content:"\f1ee"}.fa-cc-visa:before{content:"\f1f0"}.fa-cc-mastercard:before{content:"\f1f1"}.fa-cc-discover:before{content:"\f1f2"}.fa-cc-amex:before{content:"\f1f3"}.fa-cc-paypal:before{content:"\f1f4"}.fa-cc-stripe:before{content:"\f1f5"}.fa-bell-slash:before{content:"\f1f6"}.fa-bell-slash-o:before{content:"\f1f7"}.fa-trash:before{content:"\f1f8"}.fa-copyright:before{content:"\f1f9"}.fa-at:before{content:"\f1fa"}.fa-eyedropper:before{content:"\f1fb"}.fa-paint-brush:before{content:"\f1fc"}.fa-birthday-cake:before{content:"\f1fd"}.fa-area-chart:before{content:"\f1fe"}.fa-pie-chart:before{content:"\f200"}.fa-line-chart:before{content:"\f201"}.fa-lastfm:before{content:"\f202"}.fa-lastfm-square:before{content:"\f203"}.fa-toggle-off:before{content:"\f204"}.fa-toggle-on:before{content:"\f205"}.fa-bicycle:before{content:"\f206"}.fa-bus:before{content:"\f207"}.fa-ioxhost:before{content:"\f208"}.fa-angellist:before{content:"\f209"}.fa-cc:before{content:"\f20a"}.fa-shekel:before,.fa-sheqel:before,.fa-ils:before{content:"\f20b"}.fa-meanpath:before{content:"\f20c"}.fa-buysellads:before{content:"\f20d"}.fa-connectdevelop:before{content:"\f20e"}.fa-dashcube:before{content:"\f210"}.fa-forumbee:before{content:"\f211"}.fa-leanpub:before{content:"\f212"}.fa-sellsy:before{content:"\f213"}.fa-shirtsinbulk:before{content:"\f214"}.fa-simplybuilt:before{content:"\f215"}.fa-skyatlas:before{content:"\f216"}.fa-cart-plus:before{content:"\f217"}.fa-cart-arrow-down:before{content:"\f218"}.fa-diamond:before{content:"\f219"}.fa-ship:before{content:"\f21a"}.fa-user-secret:before{content:"\f21b"}.fa-motorcycle:before{content:"\f21c"}.fa-street-view:before{content:"\f21d"}.fa-heartbeat:before{content:"\f21e"}.fa-venus:before{content:"\f221"}.fa-mars:before{content:"\f222"}.fa-mercury:before{content:"\f223"}.fa-intersex:before,.fa-transgender:before{content:"\f224"}.fa-transgender-alt:before{content:"\f225"}.fa-venus-double:before{content:"\f226"}.fa-mars-double:before{content:"\f227"}.fa-venus-mars:before{content:"\f228"}.fa-mars-stroke:before{content:"\f229"}.fa-mars-stroke-v:before{content:"\f22a"}.fa-mars-stroke-h:before{content:"\f22b"}.fa-neuter:before{content:"\f22c"}.fa-genderless:before{content:"\f22d"}.fa-facebook-official:before{content:"\f230"}.fa-pinterest-p:before{content:"\f231"}.fa-whatsapp:before{content:"\f232"}.fa-server:before{content:"\f233"}.fa-user-plus:before{content:"\f234"}.fa-user-times:before{content:"\f235"}.fa-hotel:before,.fa-bed:before{content:"\f236"}.fa-viacoin:before{content:"\f237"}.fa-train:before{content:"\f238"}.fa-subway:before{content:"\f239"}.fa-medium:before{content:"\f23a"}.fa-yc:before,.fa-y-combinator:before{content:"\f23b"}.fa-optin-monster:before{content:"\f23c"}.fa-opencart:before{content:"\f23d"}.fa-expeditedssl:before{content:"\f23e"}.fa-battery-4:before,.fa-battery:before,.fa-battery-full:before{content:"\f240"}.fa-battery-3:before,.fa-battery-three-quarters:before{content:"\f241"}.fa-battery-2:before,.fa-battery-half:before{content:"\f242"}.fa-battery-1:before,.fa-battery-quarter:before{content:"\f243"}.fa-battery-0:before,.fa-battery-empty:before{content:"\f244"}.fa-mouse-pointer:before{content:"\f245"}.fa-i-cursor:before{content:"\f246"}.fa-object-group:before{content:"\f247"}.fa-object-ungroup:before{content:"\f248"}.fa-sticky-note:before{content:"\f249"}.fa-sticky-note-o:before{content:"\f24a"}.fa-cc-jcb:before{content:"\f24b"}.fa-cc-diners-club:before{content:"\f24c"}.fa-clone:before{content:"\f24d"}.fa-balance-scale:before{content:"\f24e"}.fa-hourglass-o:before{content:"\f250"}.fa-hourglass-1:before,.fa-hourglass-start:before{content:"\f251"}.fa-hourglass-2:before,.fa-hourglass-half:before{content:"\f252"}.fa-hourglass-3:before,.fa-hourglass-end:before{content:"\f253"}.fa-hourglass:before{content:"\f254"}.fa-hand-grab-o:before,.fa-hand-rock-o:before{content:"\f255"}.fa-hand-stop-o:before,.fa-hand-paper-o:before{content:"\f256"}.fa-hand-scissors-o:before{content:"\f257"}.fa-hand-lizard-o:before{content:"\f258"}.fa-hand-spock-o:before{content:"\f259"}.fa-hand-pointer-o:before{content:"\f25a"}.fa-hand-peace-o:before{content:"\f25b"}.fa-trademark:before{content:"\f25c"}.fa-registered:before{content:"\f25d"}.fa-creative-commons:before{content:"\f25e"}.fa-gg:before{content:"\f260"}.fa-gg-circle:before{content:"\f261"}.fa-tripadvisor:before{content:"\f262"}.fa-odnoklassniki:before{content:"\f263"}.fa-odnoklassniki-square:before{content:"\f264"}.fa-get-pocket:before{content:"\f265"}.fa-wikipedia-w:before{content:"\f266"}.fa-safari:before{content:"\f267"}.fa-chrome:before{content:"\f268"}.fa-firefox:before{content:"\f269"}.fa-opera:before{content:"\f26a"}.fa-internet-explorer:before{content:"\f26b"}.fa-tv:before,.fa-television:before{content:"\f26c"}.fa-contao:before{content:"\f26d"}.fa-500px:before{content:"\f26e"}.fa-amazon:before{content:"\f270"}.fa-calendar-plus-o:before{content:"\f271"}.fa-calendar-minus-o:before{content:"\f272"}.fa-calendar-times-o:before{content:"\f273"}.fa-calendar-check-o:before{content:"\f274"}.fa-industry:before{content:"\f275"}.fa-map-pin:before{content:"\f276"}.fa-map-signs:before{content:"\f277"}.fa-map-o:before{content:"\f278"}.fa-map:before{content:"\f279"}.fa-commenting:before{content:"\f27a"}.fa-commenting-o:before{content:"\f27b"}.fa-houzz:before{content:"\f27c"}.fa-vimeo:before{content:"\f27d"}.fa-black-tie:before{content:"\f27e"}.fa-fonticons:before{content:"\f280"}.fa-reddit-alien:before{content:"\f281"}.fa-edge:before{content:"\f282"}.fa-credit-card-alt:before{content:"\f283"}.fa-codiepie:before{content:"\f284"}.fa-modx:before{content:"\f285"}.fa-fort-awesome:before{content:"\f286"}.fa-usb:before{content:"\f287"}.fa-product-hunt:before{content:"\f288"}.fa-mixcloud:before{content:"\f289"}.fa-scribd:before{content:"\f28a"}.fa-pause-circle:before{content:"\f28b"}.fa-pause-circle-o:before{content:"\f28c"}.fa-stop-circle:before{content:"\f28d"}.fa-stop-circle-o:before{content:"\f28e"}.fa-shopping-bag:before{content:"\f290"}.fa-shopping-basket:before{content:"\f291"}.fa-hashtag:before{content:"\f292"}.fa-bluetooth:before{content:"\f293"}.fa-bluetooth-b:before{content:"\f294"}.fa-percent:before{content:"\f295"}.fa-gitlab:before{content:"\f296"}.fa-wpbeginner:before{content:"\f297"}.fa-wpforms:before{content:"\f298"}.fa-envira:before{content:"\f299"}.fa-universal-access:before{content:"\f29a"}.fa-wheelchair-alt:before{content:"\f29b"}.fa-question-circle-o:before{content:"\f29c"}.fa-blind:before{content:"\f29d"}.fa-audio-description:before{content:"\f29e"}.fa-volume-control-phone:before{content:"\f2a0"}.fa-braille:before{content:"\f2a1"}.fa-assistive-listening-systems:before{content:"\f2a2"}.fa-asl-interpreting:before,.fa-american-sign-language-interpreting:before{content:"\f2a3"}.fa-deafness:before,.fa-hard-of-hearing:before,.fa-deaf:before{content:"\f2a4"}.fa-glide:before{content:"\f2a5"}.fa-glide-g:before{content:"\f2a6"}.fa-signing:before,.fa-sign-language:before{content:"\f2a7"}.fa-low-vision:before{content:"\f2a8"}.fa-viadeo:before{content:"\f2a9"}.fa-viadeo-square:before{content:"\f2aa"}.fa-snapchat:before{content:"\f2ab"}.fa-snapchat-ghost:before{content:"\f2ac"}.fa-snapchat-square:before{content:"\f2ad"}.fa-pied-piper:before{content:"\f2ae"}.fa-first-order:before{content:"\f2b0"}.fa-yoast:before{content:"\f2b1"}.fa-themeisle:before{content:"\f2b2"}.fa-google-plus-circle:before,.fa-google-plus-official:before{content:"\f2b3"}.fa-fa:before,.fa-font-awesome:before{content:"\f2b4"}.fa-handshake-o:before{content:"\f2b5"}.fa-envelope-open:before{content:"\f2b6"}.fa-envelope-open-o:before{content:"\f2b7"}.fa-linode:before{content:"\f2b8"}.fa-address-book:before{content:"\f2b9"}.fa-address-book-o:before{content:"\f2ba"}.fa-vcard:before,.fa-address-card:before{content:"\f2bb"}.fa-vcard-o:before,.fa-address-card-o:before{content:"\f2bc"}.fa-user-circle:before{content:"\f2bd"}.fa-user-circle-o:before{content:"\f2be"}.fa-user-o:before{content:"\f2c0"}.fa-id-badge:before{content:"\f2c1"}.fa-drivers-license:before,.fa-id-card:before{content:"\f2c2"}.fa-drivers-license-o:before,.fa-id-card-o:before{content:"\f2c3"}.fa-quora:before{content:"\f2c4"}.fa-free-code-camp:before{content:"\f2c5"}.fa-telegram:before{content:"\f2c6"}.fa-thermometer-4:before,.fa-thermometer:before,.fa-thermometer-full:before{content:"\f2c7"}.fa-thermometer-3:before,.fa-thermometer-three-quarters:before{content:"\f2c8"}.fa-thermometer-2:before,.fa-thermometer-half:before{content:"\f2c9"}.fa-thermometer-1:before,.fa-thermometer-quarter:before{content:"\f2ca"}.fa-thermometer-0:before,.fa-thermometer-empty:before{content:"\f2cb"}.fa-shower:before{content:"\f2cc"}.fa-bathtub:before,.fa-s15:before,.fa-bath:before{content:"\f2cd"}.fa-podcast:before{content:"\f2ce"}.fa-window-maximize:before{content:"\f2d0"}.fa-window-minimize:before{content:"\f2d1"}.fa-window-restore:before{content:"\f2d2"}.fa-times-rectangle:before,.fa-window-close:before{content:"\f2d3"}.fa-times-rectangle-o:before,.fa-window-close-o:before{content:"\f2d4"}.fa-bandcamp:before{content:"\f2d5"}.fa-grav:before{content:"\f2d6"}.fa-etsy:before{content:"\f2d7"}.fa-imdb:before{content:"\f2d8"}.fa-ravelry:before{content:"\f2d9"}.fa-eercast:before{content:"\f2da"}.fa-microchip:before{content:"\f2db"}.fa-snowflake-o:before{content:"\f2dc"}.fa-superpowers:before{content:"\f2dd"}.fa-wpexplorer:before{content:"\f2de"}.fa-meetup:before{content:"\f2e0"}.sr-only{position:absolute;width:1px;height:1px;padding:0;margin:-1px;overflow:hidden;clip:rect(0, 0, 0, 0);border:0}.sr-only-focusable:active,.sr-only-focusable:focus{position:static;width:auto;height:auto;margin:0;overflow:visible;clip:auto} diff --git a/demos/streaming_asr_server/web/static/css/style.css b/demos/streaming_asr_server/web/static/css/style.css deleted file mode 100644 index a3040718b..000000000 --- a/demos/streaming_asr_server/web/static/css/style.css +++ /dev/null @@ -1,453 +0,0 @@ -/* -* @Author: baipengxia -* @Date: 2021-03-12 11:44:28 -* @Last Modified by: baipengxia -* @Last Modified time: 2021-03-12 15:14:24 -*/ - -/** COMMON RESET **/ -* { - -webkit-tap-highlight-color: rgba(0, 0, 0, 0); -} - -body, -h1, -h2, -h3, -h4, -h5, -h6, -hr, -p, -dl, -dt, -dd, -ul, -ol, -li, -fieldset, -lengend, -button, -input, -textarea, -th, -td { - margin: 0; - padding: 0; - color: #000; -} - -body { - font-size: 14px; -} -html, body { - min-width: 1200px; -} - -button, -input, -select, -textarea { - font-size: 14px; -} - -h1 { - font-size: 18px; -} - -h2 { - font-size: 14px; -} - -h3 { - font-size: 14px; -} - -ul, -ol, -li { - list-style: none; -} - -a { - text-decoration: none; -} - -a:hover { - text-decoration: none; -} - -fieldset, -img { - border: none; -} - -table { - border-collapse: collapse; - border-spacing: 0; -} - -i { - font-style: normal; -} - -label { - position: inherit; -} - -.clearfix:after { - content: "."; - display: block; - height: 0; - clear: both; - visibility: hidden; -} - -.clearfix { - zoom: 1; - display: block; -} - -html, -body { - font-family: Tahoma, Arial, 'microsoft yahei', 'Roboto', 'Droid Sans', 'Helvetica Neue', 'Droid Sans Fallback', 'Heiti SC', 'Hiragino Sans GB', 'Simsun', 'sans-self'; -} - - - -.audio-banner { - width: 100%; - overflow: auto; - padding: 0; - background: url('../image/voice-dictation.svg'); - background-size: cover; -} -.weaper { - width: 1200px; - height: 155px; - margin: 72px auto; -} -.text-content { - width: 670px; - height: 100%; - float: left; -} -.text-content .title { - font-size: 34px; - font-family: 'PingFangSC-Medium'; - font-weight: 500; - color: rgba(255, 255, 255, 1); - line-height: 48px; -} -.text-content .con { - font-size: 16px; - font-family: PingFangSC-Light; - font-weight: 300; - color: rgba(255, 255, 255, 1); - line-height: 30px; -} -.img-con { - width: 416px; - height: 100%; - float: right; -} -.img-con img { - width: 100%; - height: 100%; -} -.con-container { - margin-top: 34px; -} - -.audio-advantage { - background: #f8f9fa; -} -.asr-advantage { - width: 1200px; - margin: 0 auto; -} -.asr-advantage h2 { - text-align: center; - font-size: 22px; - padding: 30px 0 0 0; -} -.asr-advantage > ul > li { - box-sizing: border-box; - padding: 0 16px; - width: 33%; - text-align: center; - margin-bottom: 35px; -} -.asr-advantage > ul > li .icons{ - margin-top: 10px; - margin-bottom: 20px; - width: 42px; - height: 42px; -} -.service-item-content { - margin-top: 35px; - display: flex; - justify-content: center; - flex-wrap: wrap; -} -.service-item-content img { - width: 160px; - vertical-align: bottom; -} -.service-item-content > li { - box-sizing: border-box; - padding: 0 16px; - width: 33%; - text-align: center; - margin-bottom: 35px; -} -.service-item-content > li .service-item-content-title { - line-height: 1.5; - font-weight: 700; - margin-top: 10px; -} -.service-item-content > li .service-item-content-desc { - margin-top: 5px; - line-height: 1.8; - color: #657384; -} - - -.audio-scene-con { - width: 100%; - padding-bottom: 84px; - background: #fff; -} -.audio-scene { - overflow: auto; - width: 1200px; - background: #fff; - text-align: center; - padding: 0; - margin: 0 auto; -} -.audio-scene h2 { - padding: 30px 0 0 0; - font-size: 22px; - text-align: center; -} - -.audio-experience { - width: 100%; - height: 538px; - background: #fff; - padding: 0; - margin: 0; - overflow: auto; -} -.asr-box { - width: 1200px; - height: 394px; - margin: 64px auto; -} -.asr-box h2 { - font-size: 22px; - text-align: center; - margin-bottom: 64px; -} -.voice-container { - position: relative; - width: 1200px; - height: 308px; - background: rgba(255, 255, 255, 1); - border-radius: 8px; - border: 1px solid rgba(225, 225, 225, 1); -} -.voice-container .voice { - height: 236px; - width: 100%; - border-radius: 8px; -} -.voice-container .voice textarea { - height: 100%; - width: 100%; - border: none; - outline: none; - border-radius: 8px; - padding: 25px; - font-size: 14px; - box-sizing: border-box; - resize: none; -} -.voice-input { - width: 100%; - height: 72px; - box-sizing: border-box; - padding-left: 35px; - background: rgba(242, 244, 245, 1); - border-radius: 8px; - line-height: 72px; -} -.voice-input .el-select { - width: 492px; -} -.start-voice { - display: inline-block; - margin-left: 10px; -} -.start-voice .time { - margin-right: 25px; -} -.asr-advantage > ul > li { - margin-bottom: 77px; -} -#msg { - width: 100%; - line-height: 40px; - font-size: 14px; - margin-left: 330px; -} -#captcha { - margin-left: 350px !important; - display: inline-block; - position: relative; -} -.black { - position: fixed; - width: 100%; - height: 100%; - z-index: 5; - background: rgba(0, 0, 0, 0.5); - top: 0; - left: 0; -} -.container { - position: fixed; - z-index: 6; - top: 25%; - left: 10%; -} -.audio-scene-con { - width: 100%; - padding-bottom: 84px; - background: #fff; -} -#sound { - color: #fff; - cursor: pointer; - background: #147ede; - padding: 10px; - margin-top: 30px; - margin-left: 135px; - width: 176px; - height: 30px !important; - text-align: center; - line-height: 30px !important; - border-radius: 10px; -} -.con-ten { - position: absolute; - width: 100%; - height: 100%; - z-index: 5; - background: #fff; - opacity: 0.5; - top: 0; - left: 0; -} -.websocket-url { - width: 320px; - height: 20px; - border: 1px solid #dcdfe6; - line-height: 20px; - padding: 10px; - border-radius: 4px; -} -.voice-btn { - color: #fff; - background-color: #409eff; - font-weight: 500; - padding: 12px 20px; - font-size: 14px; - border-radius: 4px; - border: 0; - cursor: pointer; -} -.voice-btn.end { - display: none; -} -.result-text { - background: #fff; - padding: 20px; -} -.voice-footer { - border-top: 1px solid #dddede; - background: #f7f9fa; - text-align: center; - margin-bottom: 8px; - color: #333; - font-size: 12px; - padding: 20px 0; -} - -/** line animate **/ -.time-box { - display: none; - margin-left: 10px; - width: 300px; -} -.total-time { - font-size: 14px; - color: #545454; -} -.voice-btn.end.show, -.time-box.show { - display: inline; -} -.start-taste-line { - margin-right: 20px; - display: inline-block; -} -.start-taste-line hr { - background-color: #187cff; - width: 3px; - height: 8px; - margin: 0 3px; - display: inline-block; - border: none; -} -.hr { - animation: note 0.2s ease-in-out; - animation-iteration-count: infinite; - animation-direction: alternate; -} -.hr-one { - animation-delay: -0.9s; -} -.hr-two { - animation-delay: -0.8s; -} -.hr-three { - animation-delay: -0.7s; -} -.hr-four { - animation-delay: -0.6s; -} -.hr-five { - animation-delay: -0.5s; -} -.hr-six { - animation-delay: -0.4s; -} -.hr-seven { - animation-delay: -0.3s; -} -.hr-eight { - animation-delay: -0.2s; -} -.hr-nine { - animation-delay: -0.1s; -} -@keyframes note { - from { - transform: scaleY(1); - } - to { - transform: scaleY(4); - } -} \ No newline at end of file diff --git a/demos/streaming_asr_server/web/static/fonts/FontAwesome.otf b/demos/streaming_asr_server/web/static/fonts/FontAwesome.otf deleted file mode 100644 index 401ec0f36..000000000 Binary files a/demos/streaming_asr_server/web/static/fonts/FontAwesome.otf and /dev/null differ diff --git a/demos/streaming_asr_server/web/static/fonts/fontawesome-webfont.eot b/demos/streaming_asr_server/web/static/fonts/fontawesome-webfont.eot deleted file mode 100644 index e9f60ca95..000000000 Binary files a/demos/streaming_asr_server/web/static/fonts/fontawesome-webfont.eot and /dev/null differ diff --git a/demos/streaming_asr_server/web/static/fonts/fontawesome-webfont.svg b/demos/streaming_asr_server/web/static/fonts/fontawesome-webfont.svg deleted file mode 100644 index 6cd0326be..000000000 --- a/demos/streaming_asr_server/web/static/fonts/fontawesome-webfont.svg +++ /dev/null @@ -1,1951 +0,0 @@ - - - - -Created by FontForge 20120731 at Mon Oct 24 17:37:40 2016 - By ,,, -Copyright Dave Gandy 2016. All rights reserved. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - diff --git a/demos/streaming_asr_server/web/static/fonts/fontawesome-webfont.ttf b/demos/streaming_asr_server/web/static/fonts/fontawesome-webfont.ttf deleted file mode 100644 index 35acda2fa..000000000 Binary files a/demos/streaming_asr_server/web/static/fonts/fontawesome-webfont.ttf and /dev/null differ diff --git a/demos/streaming_asr_server/web/static/fonts/fontawesome-webfont.woff b/demos/streaming_asr_server/web/static/fonts/fontawesome-webfont.woff deleted file mode 100644 index 400014a4b..000000000 Binary files a/demos/streaming_asr_server/web/static/fonts/fontawesome-webfont.woff and /dev/null differ diff --git a/demos/streaming_asr_server/web/static/fonts/fontawesome-webfont.woff2 b/demos/streaming_asr_server/web/static/fonts/fontawesome-webfont.woff2 deleted file mode 100644 index 4d13fc604..000000000 Binary files a/demos/streaming_asr_server/web/static/fonts/fontawesome-webfont.woff2 and /dev/null differ diff --git a/demos/streaming_asr_server/web/static/image/PaddleSpeech_logo.png b/demos/streaming_asr_server/web/static/image/PaddleSpeech_logo.png deleted file mode 100644 index fb2527754..000000000 Binary files a/demos/streaming_asr_server/web/static/image/PaddleSpeech_logo.png and /dev/null differ diff --git a/demos/streaming_asr_server/web/static/image/voice-dictation.svg b/demos/streaming_asr_server/web/static/image/voice-dictation.svg deleted file mode 100644 index d35971499..000000000 --- a/demos/streaming_asr_server/web/static/image/voice-dictation.svg +++ /dev/null @@ -1,94 +0,0 @@ - - - - 背景 - Created with Sketch. - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - \ No newline at end of file diff --git a/demos/streaming_asr_server/web/static/js/SoundRecognizer.js b/demos/streaming_asr_server/web/static/js/SoundRecognizer.js deleted file mode 100644 index 5ef3d2e89..000000000 --- a/demos/streaming_asr_server/web/static/js/SoundRecognizer.js +++ /dev/null @@ -1,133 +0,0 @@ -SoundRecognizer = { - rec: null, - wave: null, - SampleRate: 16000, - testBitRate: 16, - isCloseRecorder: false, - SendInterval: 300, - realTimeSendTryType: 'pcm', - realTimeSendTryEncBusy: 0, - realTimeSendTryTime: 0, - realTimeSendTryNumber: 0, - transferUploadNumberMax: 0, - realTimeSendTryChunk: null, - soundType: "pcm", - init: function (config) { - this.soundType = config.soundType || 'pcm'; - this.SampleRate = config.sampleRate || 16000; - this.recwaveElm = config.recwaveElm || ''; - this.TransferUpload = config.translerCallBack || this.TransferProcess; - this.initRecorder(); - }, - RealTimeSendTryReset: function (type) { - this.realTimeSendTryType = type; - this.realTimeSendTryTime = 0; - }, - RealTimeSendTry: function (rec, isClose) { - var that = this; - var t1 = Date.now(), endT = 0, recImpl = Recorder.prototype; - if (this.realTimeSendTryTime == 0) { - this.realTimeSendTryTime = t1; - this.realTimeSendTryEncBusy = 0; - this.realTimeSendTryNumber = 0; - this.transferUploadNumberMax = 0; - this.realTimeSendTryChunk = null; - } - if (!isClose && t1 - this.realTimeSendTryTime < this.SendInterval) { - return;//控制缓冲达到指定间隔才进行传输 - } - this.realTimeSendTryTime = t1; - var number = ++this.realTimeSendTryNumber; - - //借用SampleData函数进行数据的连续处理,采样率转换是顺带的 - var chunk = Recorder.SampleData(rec.buffers, rec.srcSampleRate, this.SampleRate, this.realTimeSendTryChunk, { frameType: isClose ? "" : this.realTimeSendTryType }); - - //清理已处理完的缓冲数据,释放内存以支持长时间录音,最后完成录音时不能调用stop,因为数据已经被清掉了 - for (var i = this.realTimeSendTryChunk ? this.realTimeSendTryChunk.index : 0; i < chunk.index; i++) { - rec.buffers[i] = null; - } - this.realTimeSendTryChunk = chunk; - - //没有新数据,或结束时的数据量太小,不能进行mock转码 - if (chunk.data.length == 0 || isClose && chunk.data.length < 2000) { - this.TransferUpload(number, null, 0, null, isClose); - return; - } - //实时编码队列阻塞处理 - if (!isClose) { - if (this.realTimeSendTryEncBusy >= 2) { - console.log("编码队列阻塞,已丢弃一帧", 1); - return; - } - } - this.realTimeSendTryEncBusy++; - - //通过mock方法实时转码成mp3、wav - var encStartTime = Date.now(); - var recMock = Recorder({ - type: this.realTimeSendTryType - , sampleRate: this.SampleRate //采样率 - , bitRate: this.testBitRate //比特率 - }); - recMock.mock(chunk.data, chunk.sampleRate); - recMock.stop(function (blob, duration) { - that.realTimeSendTryEncBusy && (that.realTimeSendTryEncBusy--); - blob.encTime = Date.now() - encStartTime; - - //转码好就推入传输 - that.TransferUpload(number, blob, duration, recMock, isClose); - }, function (msg) { - that.realTimeSendTryEncBusy && (that.realTimeSendTryEncBusy--); - //转码错误?没想到什么时候会产生错误! - console.log("不应该出现的错误:" + msg, 1); - }); - }, - recordClose: function () { - try { - this.rec.close(function () { - this.isCloseRecorder = true; - }); - this.RealTimeSendTry(this.rec, true);//最后一次发送 - } catch (ex) { - // recordClose(); - } - }, - recordEnd: function () { - try { - this.rec.stop(function (blob, time) { - this.recordClose(); - }, function (s) { - this.recordClose(); - }); - } catch (ex) { - } - }, - initRecorder: function () { - var that = this; - var rec = Recorder({ - type: that.soundType - , bitRate: that.testBitRate - , sampleRate: that.SampleRate - , onProcess: function (buffers, level, time, sampleRate) { - that.wave.input(buffers[buffers.length - 1], level, sampleRate); - that.RealTimeSendTry(rec, false);//推入实时处理,因为是unknown格式,这里简化函数调用,没有用到buffers和bufferSampleRate,因为这些数据和rec.buffers是完全相同的。 - } - }); - - rec.open(function () { - that.wave = Recorder.FrequencyHistogramView({ - elem: that.recwaveElm, lineCount: 90 - , position: 0 - , minHeight: 1 - , stripeEnable: false - }); - rec.start(); - that.isCloseRecorder = false; - that.RealTimeSendTryReset(that.soundType);//重置 - }); - this.rec = rec; - }, - TransferProcess: function (number, blobOrNull, duration, blobRec, isClose) { - - } -} \ No newline at end of file diff --git a/demos/streaming_asr_server/web/static/js/jquery-3.2.1.min.js b/demos/streaming_asr_server/web/static/js/jquery-3.2.1.min.js deleted file mode 100644 index 644d35e27..000000000 --- a/demos/streaming_asr_server/web/static/js/jquery-3.2.1.min.js +++ /dev/null @@ -1,4 +0,0 @@ -/*! jQuery v3.2.1 | (c) JS Foundation and other contributors | jquery.org/license */ -!function(a,b){"use strict";"object"==typeof module&&"object"==typeof module.exports?module.exports=a.document?b(a,!0):function(a){if(!a.document)throw new Error("jQuery requires a window with a document");return b(a)}:b(a)}("undefined"!=typeof window?window:this,function(a,b){"use strict";var c=[],d=a.document,e=Object.getPrototypeOf,f=c.slice,g=c.concat,h=c.push,i=c.indexOf,j={},k=j.toString,l=j.hasOwnProperty,m=l.toString,n=m.call(Object),o={};function p(a,b){b=b||d;var c=b.createElement("script");c.text=a,b.head.appendChild(c).parentNode.removeChild(c)}var q="3.2.1",r=function(a,b){return new r.fn.init(a,b)},s=/^[\s\uFEFF\xA0]+|[\s\uFEFF\xA0]+$/g,t=/^-ms-/,u=/-([a-z])/g,v=function(a,b){return b.toUpperCase()};r.fn=r.prototype={jquery:q,constructor:r,length:0,toArray:function(){return f.call(this)},get:function(a){return null==a?f.call(this):a<0?this[a+this.length]:this[a]},pushStack:function(a){var b=r.merge(this.constructor(),a);return b.prevObject=this,b},each:function(a){return r.each(this,a)},map:function(a){return this.pushStack(r.map(this,function(b,c){return a.call(b,c,b)}))},slice:function(){return this.pushStack(f.apply(this,arguments))},first:function(){return this.eq(0)},last:function(){return this.eq(-1)},eq:function(a){var b=this.length,c=+a+(a<0?b:0);return this.pushStack(c>=0&&c0&&b-1 in a)}var x=function(a){var b,c,d,e,f,g,h,i,j,k,l,m,n,o,p,q,r,s,t,u="sizzle"+1*new Date,v=a.document,w=0,x=0,y=ha(),z=ha(),A=ha(),B=function(a,b){return a===b&&(l=!0),0},C={}.hasOwnProperty,D=[],E=D.pop,F=D.push,G=D.push,H=D.slice,I=function(a,b){for(var c=0,d=a.length;c+~]|"+K+")"+K+"*"),S=new RegExp("="+K+"*([^\\]'\"]*?)"+K+"*\\]","g"),T=new RegExp(N),U=new RegExp("^"+L+"$"),V={ID:new RegExp("^#("+L+")"),CLASS:new RegExp("^\\.("+L+")"),TAG:new RegExp("^("+L+"|[*])"),ATTR:new RegExp("^"+M),PSEUDO:new RegExp("^"+N),CHILD:new RegExp("^:(only|first|last|nth|nth-last)-(child|of-type)(?:\\("+K+"*(even|odd|(([+-]|)(\\d*)n|)"+K+"*(?:([+-]|)"+K+"*(\\d+)|))"+K+"*\\)|)","i"),bool:new RegExp("^(?:"+J+")$","i"),needsContext:new RegExp("^"+K+"*[>+~]|:(even|odd|eq|gt|lt|nth|first|last)(?:\\("+K+"*((?:-\\d)?\\d*)"+K+"*\\)|)(?=[^-]|$)","i")},W=/^(?:input|select|textarea|button)$/i,X=/^h\d$/i,Y=/^[^{]+\{\s*\[native \w/,Z=/^(?:#([\w-]+)|(\w+)|\.([\w-]+))$/,$=/[+~]/,_=new RegExp("\\\\([\\da-f]{1,6}"+K+"?|("+K+")|.)","ig"),aa=function(a,b,c){var d="0x"+b-65536;return d!==d||c?b:d<0?String.fromCharCode(d+65536):String.fromCharCode(d>>10|55296,1023&d|56320)},ba=/([\0-\x1f\x7f]|^-?\d)|^-$|[^\0-\x1f\x7f-\uFFFF\w-]/g,ca=function(a,b){return b?"\0"===a?"\ufffd":a.slice(0,-1)+"\\"+a.charCodeAt(a.length-1).toString(16)+" ":"\\"+a},da=function(){m()},ea=ta(function(a){return a.disabled===!0&&("form"in a||"label"in a)},{dir:"parentNode",next:"legend"});try{G.apply(D=H.call(v.childNodes),v.childNodes),D[v.childNodes.length].nodeType}catch(fa){G={apply:D.length?function(a,b){F.apply(a,H.call(b))}:function(a,b){var c=a.length,d=0;while(a[c++]=b[d++]);a.length=c-1}}}function ga(a,b,d,e){var f,h,j,k,l,o,r,s=b&&b.ownerDocument,w=b?b.nodeType:9;if(d=d||[],"string"!=typeof a||!a||1!==w&&9!==w&&11!==w)return d;if(!e&&((b?b.ownerDocument||b:v)!==n&&m(b),b=b||n,p)){if(11!==w&&(l=Z.exec(a)))if(f=l[1]){if(9===w){if(!(j=b.getElementById(f)))return d;if(j.id===f)return d.push(j),d}else if(s&&(j=s.getElementById(f))&&t(b,j)&&j.id===f)return d.push(j),d}else{if(l[2])return G.apply(d,b.getElementsByTagName(a)),d;if((f=l[3])&&c.getElementsByClassName&&b.getElementsByClassName)return G.apply(d,b.getElementsByClassName(f)),d}if(c.qsa&&!A[a+" "]&&(!q||!q.test(a))){if(1!==w)s=b,r=a;else if("object"!==b.nodeName.toLowerCase()){(k=b.getAttribute("id"))?k=k.replace(ba,ca):b.setAttribute("id",k=u),o=g(a),h=o.length;while(h--)o[h]="#"+k+" "+sa(o[h]);r=o.join(","),s=$.test(a)&&qa(b.parentNode)||b}if(r)try{return G.apply(d,s.querySelectorAll(r)),d}catch(x){}finally{k===u&&b.removeAttribute("id")}}}return i(a.replace(P,"$1"),b,d,e)}function ha(){var a=[];function b(c,e){return a.push(c+" ")>d.cacheLength&&delete b[a.shift()],b[c+" "]=e}return b}function ia(a){return a[u]=!0,a}function ja(a){var b=n.createElement("fieldset");try{return!!a(b)}catch(c){return!1}finally{b.parentNode&&b.parentNode.removeChild(b),b=null}}function ka(a,b){var c=a.split("|"),e=c.length;while(e--)d.attrHandle[c[e]]=b}function la(a,b){var c=b&&a,d=c&&1===a.nodeType&&1===b.nodeType&&a.sourceIndex-b.sourceIndex;if(d)return d;if(c)while(c=c.nextSibling)if(c===b)return-1;return a?1:-1}function ma(a){return function(b){var c=b.nodeName.toLowerCase();return"input"===c&&b.type===a}}function na(a){return function(b){var c=b.nodeName.toLowerCase();return("input"===c||"button"===c)&&b.type===a}}function oa(a){return function(b){return"form"in b?b.parentNode&&b.disabled===!1?"label"in b?"label"in b.parentNode?b.parentNode.disabled===a:b.disabled===a:b.isDisabled===a||b.isDisabled!==!a&&ea(b)===a:b.disabled===a:"label"in b&&b.disabled===a}}function pa(a){return ia(function(b){return b=+b,ia(function(c,d){var e,f=a([],c.length,b),g=f.length;while(g--)c[e=f[g]]&&(c[e]=!(d[e]=c[e]))})})}function qa(a){return a&&"undefined"!=typeof a.getElementsByTagName&&a}c=ga.support={},f=ga.isXML=function(a){var b=a&&(a.ownerDocument||a).documentElement;return!!b&&"HTML"!==b.nodeName},m=ga.setDocument=function(a){var b,e,g=a?a.ownerDocument||a:v;return g!==n&&9===g.nodeType&&g.documentElement?(n=g,o=n.documentElement,p=!f(n),v!==n&&(e=n.defaultView)&&e.top!==e&&(e.addEventListener?e.addEventListener("unload",da,!1):e.attachEvent&&e.attachEvent("onunload",da)),c.attributes=ja(function(a){return a.className="i",!a.getAttribute("className")}),c.getElementsByTagName=ja(function(a){return a.appendChild(n.createComment("")),!a.getElementsByTagName("*").length}),c.getElementsByClassName=Y.test(n.getElementsByClassName),c.getById=ja(function(a){return o.appendChild(a).id=u,!n.getElementsByName||!n.getElementsByName(u).length}),c.getById?(d.filter.ID=function(a){var b=a.replace(_,aa);return function(a){return a.getAttribute("id")===b}},d.find.ID=function(a,b){if("undefined"!=typeof b.getElementById&&p){var c=b.getElementById(a);return c?[c]:[]}}):(d.filter.ID=function(a){var b=a.replace(_,aa);return function(a){var c="undefined"!=typeof a.getAttributeNode&&a.getAttributeNode("id");return c&&c.value===b}},d.find.ID=function(a,b){if("undefined"!=typeof b.getElementById&&p){var c,d,e,f=b.getElementById(a);if(f){if(c=f.getAttributeNode("id"),c&&c.value===a)return[f];e=b.getElementsByName(a),d=0;while(f=e[d++])if(c=f.getAttributeNode("id"),c&&c.value===a)return[f]}return[]}}),d.find.TAG=c.getElementsByTagName?function(a,b){return"undefined"!=typeof b.getElementsByTagName?b.getElementsByTagName(a):c.qsa?b.querySelectorAll(a):void 0}:function(a,b){var c,d=[],e=0,f=b.getElementsByTagName(a);if("*"===a){while(c=f[e++])1===c.nodeType&&d.push(c);return d}return f},d.find.CLASS=c.getElementsByClassName&&function(a,b){if("undefined"!=typeof b.getElementsByClassName&&p)return b.getElementsByClassName(a)},r=[],q=[],(c.qsa=Y.test(n.querySelectorAll))&&(ja(function(a){o.appendChild(a).innerHTML="",a.querySelectorAll("[msallowcapture^='']").length&&q.push("[*^$]="+K+"*(?:''|\"\")"),a.querySelectorAll("[selected]").length||q.push("\\["+K+"*(?:value|"+J+")"),a.querySelectorAll("[id~="+u+"-]").length||q.push("~="),a.querySelectorAll(":checked").length||q.push(":checked"),a.querySelectorAll("a#"+u+"+*").length||q.push(".#.+[+~]")}),ja(function(a){a.innerHTML="";var b=n.createElement("input");b.setAttribute("type","hidden"),a.appendChild(b).setAttribute("name","D"),a.querySelectorAll("[name=d]").length&&q.push("name"+K+"*[*^$|!~]?="),2!==a.querySelectorAll(":enabled").length&&q.push(":enabled",":disabled"),o.appendChild(a).disabled=!0,2!==a.querySelectorAll(":disabled").length&&q.push(":enabled",":disabled"),a.querySelectorAll("*,:x"),q.push(",.*:")})),(c.matchesSelector=Y.test(s=o.matches||o.webkitMatchesSelector||o.mozMatchesSelector||o.oMatchesSelector||o.msMatchesSelector))&&ja(function(a){c.disconnectedMatch=s.call(a,"*"),s.call(a,"[s!='']:x"),r.push("!=",N)}),q=q.length&&new RegExp(q.join("|")),r=r.length&&new RegExp(r.join("|")),b=Y.test(o.compareDocumentPosition),t=b||Y.test(o.contains)?function(a,b){var c=9===a.nodeType?a.documentElement:a,d=b&&b.parentNode;return a===d||!(!d||1!==d.nodeType||!(c.contains?c.contains(d):a.compareDocumentPosition&&16&a.compareDocumentPosition(d)))}:function(a,b){if(b)while(b=b.parentNode)if(b===a)return!0;return!1},B=b?function(a,b){if(a===b)return l=!0,0;var d=!a.compareDocumentPosition-!b.compareDocumentPosition;return d?d:(d=(a.ownerDocument||a)===(b.ownerDocument||b)?a.compareDocumentPosition(b):1,1&d||!c.sortDetached&&b.compareDocumentPosition(a)===d?a===n||a.ownerDocument===v&&t(v,a)?-1:b===n||b.ownerDocument===v&&t(v,b)?1:k?I(k,a)-I(k,b):0:4&d?-1:1)}:function(a,b){if(a===b)return l=!0,0;var c,d=0,e=a.parentNode,f=b.parentNode,g=[a],h=[b];if(!e||!f)return a===n?-1:b===n?1:e?-1:f?1:k?I(k,a)-I(k,b):0;if(e===f)return la(a,b);c=a;while(c=c.parentNode)g.unshift(c);c=b;while(c=c.parentNode)h.unshift(c);while(g[d]===h[d])d++;return d?la(g[d],h[d]):g[d]===v?-1:h[d]===v?1:0},n):n},ga.matches=function(a,b){return ga(a,null,null,b)},ga.matchesSelector=function(a,b){if((a.ownerDocument||a)!==n&&m(a),b=b.replace(S,"='$1']"),c.matchesSelector&&p&&!A[b+" "]&&(!r||!r.test(b))&&(!q||!q.test(b)))try{var d=s.call(a,b);if(d||c.disconnectedMatch||a.document&&11!==a.document.nodeType)return d}catch(e){}return ga(b,n,null,[a]).length>0},ga.contains=function(a,b){return(a.ownerDocument||a)!==n&&m(a),t(a,b)},ga.attr=function(a,b){(a.ownerDocument||a)!==n&&m(a);var e=d.attrHandle[b.toLowerCase()],f=e&&C.call(d.attrHandle,b.toLowerCase())?e(a,b,!p):void 0;return void 0!==f?f:c.attributes||!p?a.getAttribute(b):(f=a.getAttributeNode(b))&&f.specified?f.value:null},ga.escape=function(a){return(a+"").replace(ba,ca)},ga.error=function(a){throw new Error("Syntax error, unrecognized expression: "+a)},ga.uniqueSort=function(a){var b,d=[],e=0,f=0;if(l=!c.detectDuplicates,k=!c.sortStable&&a.slice(0),a.sort(B),l){while(b=a[f++])b===a[f]&&(e=d.push(f));while(e--)a.splice(d[e],1)}return k=null,a},e=ga.getText=function(a){var b,c="",d=0,f=a.nodeType;if(f){if(1===f||9===f||11===f){if("string"==typeof a.textContent)return a.textContent;for(a=a.firstChild;a;a=a.nextSibling)c+=e(a)}else if(3===f||4===f)return a.nodeValue}else while(b=a[d++])c+=e(b);return c},d=ga.selectors={cacheLength:50,createPseudo:ia,match:V,attrHandle:{},find:{},relative:{">":{dir:"parentNode",first:!0}," ":{dir:"parentNode"},"+":{dir:"previousSibling",first:!0},"~":{dir:"previousSibling"}},preFilter:{ATTR:function(a){return a[1]=a[1].replace(_,aa),a[3]=(a[3]||a[4]||a[5]||"").replace(_,aa),"~="===a[2]&&(a[3]=" "+a[3]+" "),a.slice(0,4)},CHILD:function(a){return a[1]=a[1].toLowerCase(),"nth"===a[1].slice(0,3)?(a[3]||ga.error(a[0]),a[4]=+(a[4]?a[5]+(a[6]||1):2*("even"===a[3]||"odd"===a[3])),a[5]=+(a[7]+a[8]||"odd"===a[3])):a[3]&&ga.error(a[0]),a},PSEUDO:function(a){var b,c=!a[6]&&a[2];return V.CHILD.test(a[0])?null:(a[3]?a[2]=a[4]||a[5]||"":c&&T.test(c)&&(b=g(c,!0))&&(b=c.indexOf(")",c.length-b)-c.length)&&(a[0]=a[0].slice(0,b),a[2]=c.slice(0,b)),a.slice(0,3))}},filter:{TAG:function(a){var b=a.replace(_,aa).toLowerCase();return"*"===a?function(){return!0}:function(a){return a.nodeName&&a.nodeName.toLowerCase()===b}},CLASS:function(a){var b=y[a+" "];return b||(b=new RegExp("(^|"+K+")"+a+"("+K+"|$)"))&&y(a,function(a){return b.test("string"==typeof a.className&&a.className||"undefined"!=typeof a.getAttribute&&a.getAttribute("class")||"")})},ATTR:function(a,b,c){return function(d){var e=ga.attr(d,a);return null==e?"!="===b:!b||(e+="","="===b?e===c:"!="===b?e!==c:"^="===b?c&&0===e.indexOf(c):"*="===b?c&&e.indexOf(c)>-1:"$="===b?c&&e.slice(-c.length)===c:"~="===b?(" "+e.replace(O," ")+" ").indexOf(c)>-1:"|="===b&&(e===c||e.slice(0,c.length+1)===c+"-"))}},CHILD:function(a,b,c,d,e){var f="nth"!==a.slice(0,3),g="last"!==a.slice(-4),h="of-type"===b;return 1===d&&0===e?function(a){return!!a.parentNode}:function(b,c,i){var j,k,l,m,n,o,p=f!==g?"nextSibling":"previousSibling",q=b.parentNode,r=h&&b.nodeName.toLowerCase(),s=!i&&!h,t=!1;if(q){if(f){while(p){m=b;while(m=m[p])if(h?m.nodeName.toLowerCase()===r:1===m.nodeType)return!1;o=p="only"===a&&!o&&"nextSibling"}return!0}if(o=[g?q.firstChild:q.lastChild],g&&s){m=q,l=m[u]||(m[u]={}),k=l[m.uniqueID]||(l[m.uniqueID]={}),j=k[a]||[],n=j[0]===w&&j[1],t=n&&j[2],m=n&&q.childNodes[n];while(m=++n&&m&&m[p]||(t=n=0)||o.pop())if(1===m.nodeType&&++t&&m===b){k[a]=[w,n,t];break}}else if(s&&(m=b,l=m[u]||(m[u]={}),k=l[m.uniqueID]||(l[m.uniqueID]={}),j=k[a]||[],n=j[0]===w&&j[1],t=n),t===!1)while(m=++n&&m&&m[p]||(t=n=0)||o.pop())if((h?m.nodeName.toLowerCase()===r:1===m.nodeType)&&++t&&(s&&(l=m[u]||(m[u]={}),k=l[m.uniqueID]||(l[m.uniqueID]={}),k[a]=[w,t]),m===b))break;return t-=e,t===d||t%d===0&&t/d>=0}}},PSEUDO:function(a,b){var c,e=d.pseudos[a]||d.setFilters[a.toLowerCase()]||ga.error("unsupported pseudo: "+a);return e[u]?e(b):e.length>1?(c=[a,a,"",b],d.setFilters.hasOwnProperty(a.toLowerCase())?ia(function(a,c){var d,f=e(a,b),g=f.length;while(g--)d=I(a,f[g]),a[d]=!(c[d]=f[g])}):function(a){return e(a,0,c)}):e}},pseudos:{not:ia(function(a){var b=[],c=[],d=h(a.replace(P,"$1"));return d[u]?ia(function(a,b,c,e){var f,g=d(a,null,e,[]),h=a.length;while(h--)(f=g[h])&&(a[h]=!(b[h]=f))}):function(a,e,f){return b[0]=a,d(b,null,f,c),b[0]=null,!c.pop()}}),has:ia(function(a){return function(b){return ga(a,b).length>0}}),contains:ia(function(a){return a=a.replace(_,aa),function(b){return(b.textContent||b.innerText||e(b)).indexOf(a)>-1}}),lang:ia(function(a){return U.test(a||"")||ga.error("unsupported lang: "+a),a=a.replace(_,aa).toLowerCase(),function(b){var c;do if(c=p?b.lang:b.getAttribute("xml:lang")||b.getAttribute("lang"))return c=c.toLowerCase(),c===a||0===c.indexOf(a+"-");while((b=b.parentNode)&&1===b.nodeType);return!1}}),target:function(b){var c=a.location&&a.location.hash;return c&&c.slice(1)===b.id},root:function(a){return a===o},focus:function(a){return a===n.activeElement&&(!n.hasFocus||n.hasFocus())&&!!(a.type||a.href||~a.tabIndex)},enabled:oa(!1),disabled:oa(!0),checked:function(a){var b=a.nodeName.toLowerCase();return"input"===b&&!!a.checked||"option"===b&&!!a.selected},selected:function(a){return a.parentNode&&a.parentNode.selectedIndex,a.selected===!0},empty:function(a){for(a=a.firstChild;a;a=a.nextSibling)if(a.nodeType<6)return!1;return!0},parent:function(a){return!d.pseudos.empty(a)},header:function(a){return X.test(a.nodeName)},input:function(a){return W.test(a.nodeName)},button:function(a){var b=a.nodeName.toLowerCase();return"input"===b&&"button"===a.type||"button"===b},text:function(a){var b;return"input"===a.nodeName.toLowerCase()&&"text"===a.type&&(null==(b=a.getAttribute("type"))||"text"===b.toLowerCase())},first:pa(function(){return[0]}),last:pa(function(a,b){return[b-1]}),eq:pa(function(a,b,c){return[c<0?c+b:c]}),even:pa(function(a,b){for(var c=0;c=0;)a.push(d);return a}),gt:pa(function(a,b,c){for(var d=c<0?c+b:c;++d1?function(b,c,d){var e=a.length;while(e--)if(!a[e](b,c,d))return!1;return!0}:a[0]}function va(a,b,c){for(var d=0,e=b.length;d-1&&(f[j]=!(g[j]=l))}}else r=wa(r===g?r.splice(o,r.length):r),e?e(null,g,r,i):G.apply(g,r)})}function ya(a){for(var b,c,e,f=a.length,g=d.relative[a[0].type],h=g||d.relative[" "],i=g?1:0,k=ta(function(a){return a===b},h,!0),l=ta(function(a){return I(b,a)>-1},h,!0),m=[function(a,c,d){var e=!g&&(d||c!==j)||((b=c).nodeType?k(a,c,d):l(a,c,d));return b=null,e}];i1&&ua(m),i>1&&sa(a.slice(0,i-1).concat({value:" "===a[i-2].type?"*":""})).replace(P,"$1"),c,i0,e=a.length>0,f=function(f,g,h,i,k){var l,o,q,r=0,s="0",t=f&&[],u=[],v=j,x=f||e&&d.find.TAG("*",k),y=w+=null==v?1:Math.random()||.1,z=x.length;for(k&&(j=g===n||g||k);s!==z&&null!=(l=x[s]);s++){if(e&&l){o=0,g||l.ownerDocument===n||(m(l),h=!p);while(q=a[o++])if(q(l,g||n,h)){i.push(l);break}k&&(w=y)}c&&((l=!q&&l)&&r--,f&&t.push(l))}if(r+=s,c&&s!==r){o=0;while(q=b[o++])q(t,u,g,h);if(f){if(r>0)while(s--)t[s]||u[s]||(u[s]=E.call(i));u=wa(u)}G.apply(i,u),k&&!f&&u.length>0&&r+b.length>1&&ga.uniqueSort(i)}return k&&(w=y,j=v),t};return c?ia(f):f}return h=ga.compile=function(a,b){var c,d=[],e=[],f=A[a+" "];if(!f){b||(b=g(a)),c=b.length;while(c--)f=ya(b[c]),f[u]?d.push(f):e.push(f);f=A(a,za(e,d)),f.selector=a}return f},i=ga.select=function(a,b,c,e){var f,i,j,k,l,m="function"==typeof a&&a,n=!e&&g(a=m.selector||a);if(c=c||[],1===n.length){if(i=n[0]=n[0].slice(0),i.length>2&&"ID"===(j=i[0]).type&&9===b.nodeType&&p&&d.relative[i[1].type]){if(b=(d.find.ID(j.matches[0].replace(_,aa),b)||[])[0],!b)return c;m&&(b=b.parentNode),a=a.slice(i.shift().value.length)}f=V.needsContext.test(a)?0:i.length;while(f--){if(j=i[f],d.relative[k=j.type])break;if((l=d.find[k])&&(e=l(j.matches[0].replace(_,aa),$.test(i[0].type)&&qa(b.parentNode)||b))){if(i.splice(f,1),a=e.length&&sa(i),!a)return G.apply(c,e),c;break}}}return(m||h(a,n))(e,b,!p,c,!b||$.test(a)&&qa(b.parentNode)||b),c},c.sortStable=u.split("").sort(B).join("")===u,c.detectDuplicates=!!l,m(),c.sortDetached=ja(function(a){return 1&a.compareDocumentPosition(n.createElement("fieldset"))}),ja(function(a){return a.innerHTML="","#"===a.firstChild.getAttribute("href")})||ka("type|href|height|width",function(a,b,c){if(!c)return a.getAttribute(b,"type"===b.toLowerCase()?1:2)}),c.attributes&&ja(function(a){return a.innerHTML="",a.firstChild.setAttribute("value",""),""===a.firstChild.getAttribute("value")})||ka("value",function(a,b,c){if(!c&&"input"===a.nodeName.toLowerCase())return a.defaultValue}),ja(function(a){return null==a.getAttribute("disabled")})||ka(J,function(a,b,c){var d;if(!c)return a[b]===!0?b.toLowerCase():(d=a.getAttributeNode(b))&&d.specified?d.value:null}),ga}(a);r.find=x,r.expr=x.selectors,r.expr[":"]=r.expr.pseudos,r.uniqueSort=r.unique=x.uniqueSort,r.text=x.getText,r.isXMLDoc=x.isXML,r.contains=x.contains,r.escapeSelector=x.escape;var y=function(a,b,c){var d=[],e=void 0!==c;while((a=a[b])&&9!==a.nodeType)if(1===a.nodeType){if(e&&r(a).is(c))break;d.push(a)}return d},z=function(a,b){for(var c=[];a;a=a.nextSibling)1===a.nodeType&&a!==b&&c.push(a);return c},A=r.expr.match.needsContext;function B(a,b){return a.nodeName&&a.nodeName.toLowerCase()===b.toLowerCase()}var C=/^<([a-z][^\/\0>:\x20\t\r\n\f]*)[\x20\t\r\n\f]*\/?>(?:<\/\1>|)$/i,D=/^.[^:#\[\.,]*$/;function E(a,b,c){return r.isFunction(b)?r.grep(a,function(a,d){return!!b.call(a,d,a)!==c}):b.nodeType?r.grep(a,function(a){return a===b!==c}):"string"!=typeof b?r.grep(a,function(a){return i.call(b,a)>-1!==c}):D.test(b)?r.filter(b,a,c):(b=r.filter(b,a),r.grep(a,function(a){return i.call(b,a)>-1!==c&&1===a.nodeType}))}r.filter=function(a,b,c){var d=b[0];return c&&(a=":not("+a+")"),1===b.length&&1===d.nodeType?r.find.matchesSelector(d,a)?[d]:[]:r.find.matches(a,r.grep(b,function(a){return 1===a.nodeType}))},r.fn.extend({find:function(a){var b,c,d=this.length,e=this;if("string"!=typeof a)return this.pushStack(r(a).filter(function(){for(b=0;b1?r.uniqueSort(c):c},filter:function(a){return this.pushStack(E(this,a||[],!1))},not:function(a){return this.pushStack(E(this,a||[],!0))},is:function(a){return!!E(this,"string"==typeof a&&A.test(a)?r(a):a||[],!1).length}});var F,G=/^(?:\s*(<[\w\W]+>)[^>]*|#([\w-]+))$/,H=r.fn.init=function(a,b,c){var e,f;if(!a)return this;if(c=c||F,"string"==typeof a){if(e="<"===a[0]&&">"===a[a.length-1]&&a.length>=3?[null,a,null]:G.exec(a),!e||!e[1]&&b)return!b||b.jquery?(b||c).find(a):this.constructor(b).find(a);if(e[1]){if(b=b instanceof r?b[0]:b,r.merge(this,r.parseHTML(e[1],b&&b.nodeType?b.ownerDocument||b:d,!0)),C.test(e[1])&&r.isPlainObject(b))for(e in b)r.isFunction(this[e])?this[e](b[e]):this.attr(e,b[e]);return this}return f=d.getElementById(e[2]),f&&(this[0]=f,this.length=1),this}return a.nodeType?(this[0]=a,this.length=1,this):r.isFunction(a)?void 0!==c.ready?c.ready(a):a(r):r.makeArray(a,this)};H.prototype=r.fn,F=r(d);var I=/^(?:parents|prev(?:Until|All))/,J={children:!0,contents:!0,next:!0,prev:!0};r.fn.extend({has:function(a){var b=r(a,this),c=b.length;return this.filter(function(){for(var a=0;a-1:1===c.nodeType&&r.find.matchesSelector(c,a))){f.push(c);break}return this.pushStack(f.length>1?r.uniqueSort(f):f)},index:function(a){return a?"string"==typeof a?i.call(r(a),this[0]):i.call(this,a.jquery?a[0]:a):this[0]&&this[0].parentNode?this.first().prevAll().length:-1},add:function(a,b){return this.pushStack(r.uniqueSort(r.merge(this.get(),r(a,b))))},addBack:function(a){return this.add(null==a?this.prevObject:this.prevObject.filter(a))}});function K(a,b){while((a=a[b])&&1!==a.nodeType);return a}r.each({parent:function(a){var b=a.parentNode;return b&&11!==b.nodeType?b:null},parents:function(a){return y(a,"parentNode")},parentsUntil:function(a,b,c){return y(a,"parentNode",c)},next:function(a){return K(a,"nextSibling")},prev:function(a){return K(a,"previousSibling")},nextAll:function(a){return y(a,"nextSibling")},prevAll:function(a){return y(a,"previousSibling")},nextUntil:function(a,b,c){return y(a,"nextSibling",c)},prevUntil:function(a,b,c){return y(a,"previousSibling",c)},siblings:function(a){return z((a.parentNode||{}).firstChild,a)},children:function(a){return z(a.firstChild)},contents:function(a){return B(a,"iframe")?a.contentDocument:(B(a,"template")&&(a=a.content||a),r.merge([],a.childNodes))}},function(a,b){r.fn[a]=function(c,d){var e=r.map(this,b,c);return"Until"!==a.slice(-5)&&(d=c),d&&"string"==typeof d&&(e=r.filter(d,e)),this.length>1&&(J[a]||r.uniqueSort(e),I.test(a)&&e.reverse()),this.pushStack(e)}});var L=/[^\x20\t\r\n\f]+/g;function M(a){var b={};return r.each(a.match(L)||[],function(a,c){b[c]=!0}),b}r.Callbacks=function(a){a="string"==typeof a?M(a):r.extend({},a);var b,c,d,e,f=[],g=[],h=-1,i=function(){for(e=e||a.once,d=b=!0;g.length;h=-1){c=g.shift();while(++h-1)f.splice(c,1),c<=h&&h--}),this},has:function(a){return a?r.inArray(a,f)>-1:f.length>0},empty:function(){return f&&(f=[]),this},disable:function(){return e=g=[],f=c="",this},disabled:function(){return!f},lock:function(){return e=g=[],c||b||(f=c=""),this},locked:function(){return!!e},fireWith:function(a,c){return e||(c=c||[],c=[a,c.slice?c.slice():c],g.push(c),b||i()),this},fire:function(){return j.fireWith(this,arguments),this},fired:function(){return!!d}};return j};function N(a){return a}function O(a){throw a}function P(a,b,c,d){var e;try{a&&r.isFunction(e=a.promise)?e.call(a).done(b).fail(c):a&&r.isFunction(e=a.then)?e.call(a,b,c):b.apply(void 0,[a].slice(d))}catch(a){c.apply(void 0,[a])}}r.extend({Deferred:function(b){var c=[["notify","progress",r.Callbacks("memory"),r.Callbacks("memory"),2],["resolve","done",r.Callbacks("once memory"),r.Callbacks("once memory"),0,"resolved"],["reject","fail",r.Callbacks("once memory"),r.Callbacks("once memory"),1,"rejected"]],d="pending",e={state:function(){return d},always:function(){return f.done(arguments).fail(arguments),this},"catch":function(a){return e.then(null,a)},pipe:function(){var a=arguments;return r.Deferred(function(b){r.each(c,function(c,d){var e=r.isFunction(a[d[4]])&&a[d[4]];f[d[1]](function(){var a=e&&e.apply(this,arguments);a&&r.isFunction(a.promise)?a.promise().progress(b.notify).done(b.resolve).fail(b.reject):b[d[0]+"With"](this,e?[a]:arguments)})}),a=null}).promise()},then:function(b,d,e){var f=0;function g(b,c,d,e){return function(){var h=this,i=arguments,j=function(){var a,j;if(!(b=f&&(d!==O&&(h=void 0,i=[a]),c.rejectWith(h,i))}};b?k():(r.Deferred.getStackHook&&(k.stackTrace=r.Deferred.getStackHook()),a.setTimeout(k))}}return r.Deferred(function(a){c[0][3].add(g(0,a,r.isFunction(e)?e:N,a.notifyWith)),c[1][3].add(g(0,a,r.isFunction(b)?b:N)),c[2][3].add(g(0,a,r.isFunction(d)?d:O))}).promise()},promise:function(a){return null!=a?r.extend(a,e):e}},f={};return r.each(c,function(a,b){var g=b[2],h=b[5];e[b[1]]=g.add,h&&g.add(function(){d=h},c[3-a][2].disable,c[0][2].lock),g.add(b[3].fire),f[b[0]]=function(){return f[b[0]+"With"](this===f?void 0:this,arguments),this},f[b[0]+"With"]=g.fireWith}),e.promise(f),b&&b.call(f,f),f},when:function(a){var b=arguments.length,c=b,d=Array(c),e=f.call(arguments),g=r.Deferred(),h=function(a){return function(c){d[a]=this,e[a]=arguments.length>1?f.call(arguments):c,--b||g.resolveWith(d,e)}};if(b<=1&&(P(a,g.done(h(c)).resolve,g.reject,!b),"pending"===g.state()||r.isFunction(e[c]&&e[c].then)))return g.then();while(c--)P(e[c],h(c),g.reject);return g.promise()}});var Q=/^(Eval|Internal|Range|Reference|Syntax|Type|URI)Error$/;r.Deferred.exceptionHook=function(b,c){a.console&&a.console.warn&&b&&Q.test(b.name)&&a.console.warn("jQuery.Deferred exception: "+b.message,b.stack,c)},r.readyException=function(b){a.setTimeout(function(){throw b})};var R=r.Deferred();r.fn.ready=function(a){return R.then(a)["catch"](function(a){r.readyException(a)}),this},r.extend({isReady:!1,readyWait:1,ready:function(a){(a===!0?--r.readyWait:r.isReady)||(r.isReady=!0,a!==!0&&--r.readyWait>0||R.resolveWith(d,[r]))}}),r.ready.then=R.then;function S(){d.removeEventListener("DOMContentLoaded",S), -a.removeEventListener("load",S),r.ready()}"complete"===d.readyState||"loading"!==d.readyState&&!d.documentElement.doScroll?a.setTimeout(r.ready):(d.addEventListener("DOMContentLoaded",S),a.addEventListener("load",S));var T=function(a,b,c,d,e,f,g){var h=0,i=a.length,j=null==c;if("object"===r.type(c)){e=!0;for(h in c)T(a,b,h,c[h],!0,f,g)}else if(void 0!==d&&(e=!0,r.isFunction(d)||(g=!0),j&&(g?(b.call(a,d),b=null):(j=b,b=function(a,b,c){return j.call(r(a),c)})),b))for(;h1,null,!0)},removeData:function(a){return this.each(function(){X.remove(this,a)})}}),r.extend({queue:function(a,b,c){var d;if(a)return b=(b||"fx")+"queue",d=W.get(a,b),c&&(!d||Array.isArray(c)?d=W.access(a,b,r.makeArray(c)):d.push(c)),d||[]},dequeue:function(a,b){b=b||"fx";var c=r.queue(a,b),d=c.length,e=c.shift(),f=r._queueHooks(a,b),g=function(){r.dequeue(a,b)};"inprogress"===e&&(e=c.shift(),d--),e&&("fx"===b&&c.unshift("inprogress"),delete f.stop,e.call(a,g,f)),!d&&f&&f.empty.fire()},_queueHooks:function(a,b){var c=b+"queueHooks";return W.get(a,c)||W.access(a,c,{empty:r.Callbacks("once memory").add(function(){W.remove(a,[b+"queue",c])})})}}),r.fn.extend({queue:function(a,b){var c=2;return"string"!=typeof a&&(b=a,a="fx",c--),arguments.length\x20\t\r\n\f]+)/i,la=/^$|\/(?:java|ecma)script/i,ma={option:[1,""],thead:[1,"","
"],col:[2,"","
"],tr:[2,"","
"],td:[3,"","
"],_default:[0,"",""]};ma.optgroup=ma.option,ma.tbody=ma.tfoot=ma.colgroup=ma.caption=ma.thead,ma.th=ma.td;function na(a,b){var c;return c="undefined"!=typeof a.getElementsByTagName?a.getElementsByTagName(b||"*"):"undefined"!=typeof a.querySelectorAll?a.querySelectorAll(b||"*"):[],void 0===b||b&&B(a,b)?r.merge([a],c):c}function oa(a,b){for(var c=0,d=a.length;c-1)e&&e.push(f);else if(j=r.contains(f.ownerDocument,f),g=na(l.appendChild(f),"script"),j&&oa(g),c){k=0;while(f=g[k++])la.test(f.type||"")&&c.push(f)}return l}!function(){var a=d.createDocumentFragment(),b=a.appendChild(d.createElement("div")),c=d.createElement("input");c.setAttribute("type","radio"),c.setAttribute("checked","checked"),c.setAttribute("name","t"),b.appendChild(c),o.checkClone=b.cloneNode(!0).cloneNode(!0).lastChild.checked,b.innerHTML="",o.noCloneChecked=!!b.cloneNode(!0).lastChild.defaultValue}();var ra=d.documentElement,sa=/^key/,ta=/^(?:mouse|pointer|contextmenu|drag|drop)|click/,ua=/^([^.]*)(?:\.(.+)|)/;function va(){return!0}function wa(){return!1}function xa(){try{return d.activeElement}catch(a){}}function ya(a,b,c,d,e,f){var g,h;if("object"==typeof b){"string"!=typeof c&&(d=d||c,c=void 0);for(h in b)ya(a,h,c,d,b[h],f);return a}if(null==d&&null==e?(e=c,d=c=void 0):null==e&&("string"==typeof c?(e=d,d=void 0):(e=d,d=c,c=void 0)),e===!1)e=wa;else if(!e)return a;return 1===f&&(g=e,e=function(a){return r().off(a),g.apply(this,arguments)},e.guid=g.guid||(g.guid=r.guid++)),a.each(function(){r.event.add(this,b,e,d,c)})}r.event={global:{},add:function(a,b,c,d,e){var f,g,h,i,j,k,l,m,n,o,p,q=W.get(a);if(q){c.handler&&(f=c,c=f.handler,e=f.selector),e&&r.find.matchesSelector(ra,e),c.guid||(c.guid=r.guid++),(i=q.events)||(i=q.events={}),(g=q.handle)||(g=q.handle=function(b){return"undefined"!=typeof r&&r.event.triggered!==b.type?r.event.dispatch.apply(a,arguments):void 0}),b=(b||"").match(L)||[""],j=b.length;while(j--)h=ua.exec(b[j])||[],n=p=h[1],o=(h[2]||"").split(".").sort(),n&&(l=r.event.special[n]||{},n=(e?l.delegateType:l.bindType)||n,l=r.event.special[n]||{},k=r.extend({type:n,origType:p,data:d,handler:c,guid:c.guid,selector:e,needsContext:e&&r.expr.match.needsContext.test(e),namespace:o.join(".")},f),(m=i[n])||(m=i[n]=[],m.delegateCount=0,l.setup&&l.setup.call(a,d,o,g)!==!1||a.addEventListener&&a.addEventListener(n,g)),l.add&&(l.add.call(a,k),k.handler.guid||(k.handler.guid=c.guid)),e?m.splice(m.delegateCount++,0,k):m.push(k),r.event.global[n]=!0)}},remove:function(a,b,c,d,e){var f,g,h,i,j,k,l,m,n,o,p,q=W.hasData(a)&&W.get(a);if(q&&(i=q.events)){b=(b||"").match(L)||[""],j=b.length;while(j--)if(h=ua.exec(b[j])||[],n=p=h[1],o=(h[2]||"").split(".").sort(),n){l=r.event.special[n]||{},n=(d?l.delegateType:l.bindType)||n,m=i[n]||[],h=h[2]&&new RegExp("(^|\\.)"+o.join("\\.(?:.*\\.|)")+"(\\.|$)"),g=f=m.length;while(f--)k=m[f],!e&&p!==k.origType||c&&c.guid!==k.guid||h&&!h.test(k.namespace)||d&&d!==k.selector&&("**"!==d||!k.selector)||(m.splice(f,1),k.selector&&m.delegateCount--,l.remove&&l.remove.call(a,k));g&&!m.length&&(l.teardown&&l.teardown.call(a,o,q.handle)!==!1||r.removeEvent(a,n,q.handle),delete i[n])}else for(n in i)r.event.remove(a,n+b[j],c,d,!0);r.isEmptyObject(i)&&W.remove(a,"handle events")}},dispatch:function(a){var b=r.event.fix(a),c,d,e,f,g,h,i=new Array(arguments.length),j=(W.get(this,"events")||{})[b.type]||[],k=r.event.special[b.type]||{};for(i[0]=b,c=1;c=1))for(;j!==this;j=j.parentNode||this)if(1===j.nodeType&&("click"!==a.type||j.disabled!==!0)){for(f=[],g={},c=0;c-1:r.find(e,this,null,[j]).length),g[e]&&f.push(d);f.length&&h.push({elem:j,handlers:f})}return j=this,i\x20\t\r\n\f]*)[^>]*)\/>/gi,Aa=/\s*$/g;function Ea(a,b){return B(a,"table")&&B(11!==b.nodeType?b:b.firstChild,"tr")?r(">tbody",a)[0]||a:a}function Fa(a){return a.type=(null!==a.getAttribute("type"))+"/"+a.type,a}function Ga(a){var b=Ca.exec(a.type);return b?a.type=b[1]:a.removeAttribute("type"),a}function Ha(a,b){var c,d,e,f,g,h,i,j;if(1===b.nodeType){if(W.hasData(a)&&(f=W.access(a),g=W.set(b,f),j=f.events)){delete g.handle,g.events={};for(e in j)for(c=0,d=j[e].length;c1&&"string"==typeof q&&!o.checkClone&&Ba.test(q))return a.each(function(e){var f=a.eq(e);s&&(b[0]=q.call(this,e,f.html())),Ja(f,b,c,d)});if(m&&(e=qa(b,a[0].ownerDocument,!1,a,d),f=e.firstChild,1===e.childNodes.length&&(e=f),f||d)){for(h=r.map(na(e,"script"),Fa),i=h.length;l")},clone:function(a,b,c){var d,e,f,g,h=a.cloneNode(!0),i=r.contains(a.ownerDocument,a);if(!(o.noCloneChecked||1!==a.nodeType&&11!==a.nodeType||r.isXMLDoc(a)))for(g=na(h),f=na(a),d=0,e=f.length;d0&&oa(g,!i&&na(a,"script")),h},cleanData:function(a){for(var b,c,d,e=r.event.special,f=0;void 0!==(c=a[f]);f++)if(U(c)){if(b=c[W.expando]){if(b.events)for(d in b.events)e[d]?r.event.remove(c,d):r.removeEvent(c,d,b.handle);c[W.expando]=void 0}c[X.expando]&&(c[X.expando]=void 0)}}}),r.fn.extend({detach:function(a){return Ka(this,a,!0)},remove:function(a){return Ka(this,a)},text:function(a){return T(this,function(a){return void 0===a?r.text(this):this.empty().each(function(){1!==this.nodeType&&11!==this.nodeType&&9!==this.nodeType||(this.textContent=a)})},null,a,arguments.length)},append:function(){return Ja(this,arguments,function(a){if(1===this.nodeType||11===this.nodeType||9===this.nodeType){var b=Ea(this,a);b.appendChild(a)}})},prepend:function(){return Ja(this,arguments,function(a){if(1===this.nodeType||11===this.nodeType||9===this.nodeType){var b=Ea(this,a);b.insertBefore(a,b.firstChild)}})},before:function(){return Ja(this,arguments,function(a){this.parentNode&&this.parentNode.insertBefore(a,this)})},after:function(){return Ja(this,arguments,function(a){this.parentNode&&this.parentNode.insertBefore(a,this.nextSibling)})},empty:function(){for(var a,b=0;null!=(a=this[b]);b++)1===a.nodeType&&(r.cleanData(na(a,!1)),a.textContent="");return this},clone:function(a,b){return a=null!=a&&a,b=null==b?a:b,this.map(function(){return r.clone(this,a,b)})},html:function(a){return T(this,function(a){var b=this[0]||{},c=0,d=this.length;if(void 0===a&&1===b.nodeType)return b.innerHTML;if("string"==typeof a&&!Aa.test(a)&&!ma[(ka.exec(a)||["",""])[1].toLowerCase()]){a=r.htmlPrefilter(a);try{for(;c1)}});function _a(a,b,c,d,e){return new _a.prototype.init(a,b,c,d,e)}r.Tween=_a,_a.prototype={constructor:_a,init:function(a,b,c,d,e,f){this.elem=a,this.prop=c,this.easing=e||r.easing._default,this.options=b,this.start=this.now=this.cur(),this.end=d,this.unit=f||(r.cssNumber[c]?"":"px")},cur:function(){var a=_a.propHooks[this.prop];return a&&a.get?a.get(this):_a.propHooks._default.get(this)},run:function(a){var b,c=_a.propHooks[this.prop];return this.options.duration?this.pos=b=r.easing[this.easing](a,this.options.duration*a,0,1,this.options.duration):this.pos=b=a,this.now=(this.end-this.start)*b+this.start,this.options.step&&this.options.step.call(this.elem,this.now,this),c&&c.set?c.set(this):_a.propHooks._default.set(this),this}},_a.prototype.init.prototype=_a.prototype,_a.propHooks={_default:{get:function(a){var b;return 1!==a.elem.nodeType||null!=a.elem[a.prop]&&null==a.elem.style[a.prop]?a.elem[a.prop]:(b=r.css(a.elem,a.prop,""),b&&"auto"!==b?b:0)},set:function(a){r.fx.step[a.prop]?r.fx.step[a.prop](a):1!==a.elem.nodeType||null==a.elem.style[r.cssProps[a.prop]]&&!r.cssHooks[a.prop]?a.elem[a.prop]=a.now:r.style(a.elem,a.prop,a.now+a.unit)}}},_a.propHooks.scrollTop=_a.propHooks.scrollLeft={set:function(a){a.elem.nodeType&&a.elem.parentNode&&(a.elem[a.prop]=a.now)}},r.easing={linear:function(a){return a},swing:function(a){return.5-Math.cos(a*Math.PI)/2},_default:"swing"},r.fx=_a.prototype.init,r.fx.step={};var ab,bb,cb=/^(?:toggle|show|hide)$/,db=/queueHooks$/;function eb(){bb&&(d.hidden===!1&&a.requestAnimationFrame?a.requestAnimationFrame(eb):a.setTimeout(eb,r.fx.interval),r.fx.tick())}function fb(){return a.setTimeout(function(){ab=void 0}),ab=r.now()}function gb(a,b){var c,d=0,e={height:a};for(b=b?1:0;d<4;d+=2-b)c=ca[d],e["margin"+c]=e["padding"+c]=a;return b&&(e.opacity=e.width=a),e}function hb(a,b,c){for(var d,e=(kb.tweeners[b]||[]).concat(kb.tweeners["*"]),f=0,g=e.length;f1)},removeAttr:function(a){return this.each(function(){r.removeAttr(this,a)})}}),r.extend({attr:function(a,b,c){var d,e,f=a.nodeType;if(3!==f&&8!==f&&2!==f)return"undefined"==typeof a.getAttribute?r.prop(a,b,c):(1===f&&r.isXMLDoc(a)||(e=r.attrHooks[b.toLowerCase()]||(r.expr.match.bool.test(b)?lb:void 0)),void 0!==c?null===c?void r.removeAttr(a,b):e&&"set"in e&&void 0!==(d=e.set(a,c,b))?d:(a.setAttribute(b,c+""),c):e&&"get"in e&&null!==(d=e.get(a,b))?d:(d=r.find.attr(a,b), -null==d?void 0:d))},attrHooks:{type:{set:function(a,b){if(!o.radioValue&&"radio"===b&&B(a,"input")){var c=a.value;return a.setAttribute("type",b),c&&(a.value=c),b}}}},removeAttr:function(a,b){var c,d=0,e=b&&b.match(L);if(e&&1===a.nodeType)while(c=e[d++])a.removeAttribute(c)}}),lb={set:function(a,b,c){return b===!1?r.removeAttr(a,c):a.setAttribute(c,c),c}},r.each(r.expr.match.bool.source.match(/\w+/g),function(a,b){var c=mb[b]||r.find.attr;mb[b]=function(a,b,d){var e,f,g=b.toLowerCase();return d||(f=mb[g],mb[g]=e,e=null!=c(a,b,d)?g:null,mb[g]=f),e}});var nb=/^(?:input|select|textarea|button)$/i,ob=/^(?:a|area)$/i;r.fn.extend({prop:function(a,b){return T(this,r.prop,a,b,arguments.length>1)},removeProp:function(a){return this.each(function(){delete this[r.propFix[a]||a]})}}),r.extend({prop:function(a,b,c){var d,e,f=a.nodeType;if(3!==f&&8!==f&&2!==f)return 1===f&&r.isXMLDoc(a)||(b=r.propFix[b]||b,e=r.propHooks[b]),void 0!==c?e&&"set"in e&&void 0!==(d=e.set(a,c,b))?d:a[b]=c:e&&"get"in e&&null!==(d=e.get(a,b))?d:a[b]},propHooks:{tabIndex:{get:function(a){var b=r.find.attr(a,"tabindex");return b?parseInt(b,10):nb.test(a.nodeName)||ob.test(a.nodeName)&&a.href?0:-1}}},propFix:{"for":"htmlFor","class":"className"}}),o.optSelected||(r.propHooks.selected={get:function(a){var b=a.parentNode;return b&&b.parentNode&&b.parentNode.selectedIndex,null},set:function(a){var b=a.parentNode;b&&(b.selectedIndex,b.parentNode&&b.parentNode.selectedIndex)}}),r.each(["tabIndex","readOnly","maxLength","cellSpacing","cellPadding","rowSpan","colSpan","useMap","frameBorder","contentEditable"],function(){r.propFix[this.toLowerCase()]=this});function pb(a){var b=a.match(L)||[];return b.join(" ")}function qb(a){return a.getAttribute&&a.getAttribute("class")||""}r.fn.extend({addClass:function(a){var b,c,d,e,f,g,h,i=0;if(r.isFunction(a))return this.each(function(b){r(this).addClass(a.call(this,b,qb(this)))});if("string"==typeof a&&a){b=a.match(L)||[];while(c=this[i++])if(e=qb(c),d=1===c.nodeType&&" "+pb(e)+" "){g=0;while(f=b[g++])d.indexOf(" "+f+" ")<0&&(d+=f+" ");h=pb(d),e!==h&&c.setAttribute("class",h)}}return this},removeClass:function(a){var b,c,d,e,f,g,h,i=0;if(r.isFunction(a))return this.each(function(b){r(this).removeClass(a.call(this,b,qb(this)))});if(!arguments.length)return this.attr("class","");if("string"==typeof a&&a){b=a.match(L)||[];while(c=this[i++])if(e=qb(c),d=1===c.nodeType&&" "+pb(e)+" "){g=0;while(f=b[g++])while(d.indexOf(" "+f+" ")>-1)d=d.replace(" "+f+" "," ");h=pb(d),e!==h&&c.setAttribute("class",h)}}return this},toggleClass:function(a,b){var c=typeof a;return"boolean"==typeof b&&"string"===c?b?this.addClass(a):this.removeClass(a):r.isFunction(a)?this.each(function(c){r(this).toggleClass(a.call(this,c,qb(this),b),b)}):this.each(function(){var b,d,e,f;if("string"===c){d=0,e=r(this),f=a.match(L)||[];while(b=f[d++])e.hasClass(b)?e.removeClass(b):e.addClass(b)}else void 0!==a&&"boolean"!==c||(b=qb(this),b&&W.set(this,"__className__",b),this.setAttribute&&this.setAttribute("class",b||a===!1?"":W.get(this,"__className__")||""))})},hasClass:function(a){var b,c,d=0;b=" "+a+" ";while(c=this[d++])if(1===c.nodeType&&(" "+pb(qb(c))+" ").indexOf(b)>-1)return!0;return!1}});var rb=/\r/g;r.fn.extend({val:function(a){var b,c,d,e=this[0];{if(arguments.length)return d=r.isFunction(a),this.each(function(c){var e;1===this.nodeType&&(e=d?a.call(this,c,r(this).val()):a,null==e?e="":"number"==typeof e?e+="":Array.isArray(e)&&(e=r.map(e,function(a){return null==a?"":a+""})),b=r.valHooks[this.type]||r.valHooks[this.nodeName.toLowerCase()],b&&"set"in b&&void 0!==b.set(this,e,"value")||(this.value=e))});if(e)return b=r.valHooks[e.type]||r.valHooks[e.nodeName.toLowerCase()],b&&"get"in b&&void 0!==(c=b.get(e,"value"))?c:(c=e.value,"string"==typeof c?c.replace(rb,""):null==c?"":c)}}}),r.extend({valHooks:{option:{get:function(a){var b=r.find.attr(a,"value");return null!=b?b:pb(r.text(a))}},select:{get:function(a){var b,c,d,e=a.options,f=a.selectedIndex,g="select-one"===a.type,h=g?null:[],i=g?f+1:e.length;for(d=f<0?i:g?f:0;d-1)&&(c=!0);return c||(a.selectedIndex=-1),f}}}}),r.each(["radio","checkbox"],function(){r.valHooks[this]={set:function(a,b){if(Array.isArray(b))return a.checked=r.inArray(r(a).val(),b)>-1}},o.checkOn||(r.valHooks[this].get=function(a){return null===a.getAttribute("value")?"on":a.value})});var sb=/^(?:focusinfocus|focusoutblur)$/;r.extend(r.event,{trigger:function(b,c,e,f){var g,h,i,j,k,m,n,o=[e||d],p=l.call(b,"type")?b.type:b,q=l.call(b,"namespace")?b.namespace.split("."):[];if(h=i=e=e||d,3!==e.nodeType&&8!==e.nodeType&&!sb.test(p+r.event.triggered)&&(p.indexOf(".")>-1&&(q=p.split("."),p=q.shift(),q.sort()),k=p.indexOf(":")<0&&"on"+p,b=b[r.expando]?b:new r.Event(p,"object"==typeof b&&b),b.isTrigger=f?2:3,b.namespace=q.join("."),b.rnamespace=b.namespace?new RegExp("(^|\\.)"+q.join("\\.(?:.*\\.|)")+"(\\.|$)"):null,b.result=void 0,b.target||(b.target=e),c=null==c?[b]:r.makeArray(c,[b]),n=r.event.special[p]||{},f||!n.trigger||n.trigger.apply(e,c)!==!1)){if(!f&&!n.noBubble&&!r.isWindow(e)){for(j=n.delegateType||p,sb.test(j+p)||(h=h.parentNode);h;h=h.parentNode)o.push(h),i=h;i===(e.ownerDocument||d)&&o.push(i.defaultView||i.parentWindow||a)}g=0;while((h=o[g++])&&!b.isPropagationStopped())b.type=g>1?j:n.bindType||p,m=(W.get(h,"events")||{})[b.type]&&W.get(h,"handle"),m&&m.apply(h,c),m=k&&h[k],m&&m.apply&&U(h)&&(b.result=m.apply(h,c),b.result===!1&&b.preventDefault());return b.type=p,f||b.isDefaultPrevented()||n._default&&n._default.apply(o.pop(),c)!==!1||!U(e)||k&&r.isFunction(e[p])&&!r.isWindow(e)&&(i=e[k],i&&(e[k]=null),r.event.triggered=p,e[p](),r.event.triggered=void 0,i&&(e[k]=i)),b.result}},simulate:function(a,b,c){var d=r.extend(new r.Event,c,{type:a,isSimulated:!0});r.event.trigger(d,null,b)}}),r.fn.extend({trigger:function(a,b){return this.each(function(){r.event.trigger(a,b,this)})},triggerHandler:function(a,b){var c=this[0];if(c)return r.event.trigger(a,b,c,!0)}}),r.each("blur focus focusin focusout resize scroll click dblclick mousedown mouseup mousemove mouseover mouseout mouseenter mouseleave change select submit keydown keypress keyup contextmenu".split(" "),function(a,b){r.fn[b]=function(a,c){return arguments.length>0?this.on(b,null,a,c):this.trigger(b)}}),r.fn.extend({hover:function(a,b){return this.mouseenter(a).mouseleave(b||a)}}),o.focusin="onfocusin"in a,o.focusin||r.each({focus:"focusin",blur:"focusout"},function(a,b){var c=function(a){r.event.simulate(b,a.target,r.event.fix(a))};r.event.special[b]={setup:function(){var d=this.ownerDocument||this,e=W.access(d,b);e||d.addEventListener(a,c,!0),W.access(d,b,(e||0)+1)},teardown:function(){var d=this.ownerDocument||this,e=W.access(d,b)-1;e?W.access(d,b,e):(d.removeEventListener(a,c,!0),W.remove(d,b))}}});var tb=a.location,ub=r.now(),vb=/\?/;r.parseXML=function(b){var c;if(!b||"string"!=typeof b)return null;try{c=(new a.DOMParser).parseFromString(b,"text/xml")}catch(d){c=void 0}return c&&!c.getElementsByTagName("parsererror").length||r.error("Invalid XML: "+b),c};var wb=/\[\]$/,xb=/\r?\n/g,yb=/^(?:submit|button|image|reset|file)$/i,zb=/^(?:input|select|textarea|keygen)/i;function Ab(a,b,c,d){var e;if(Array.isArray(b))r.each(b,function(b,e){c||wb.test(a)?d(a,e):Ab(a+"["+("object"==typeof e&&null!=e?b:"")+"]",e,c,d)});else if(c||"object"!==r.type(b))d(a,b);else for(e in b)Ab(a+"["+e+"]",b[e],c,d)}r.param=function(a,b){var c,d=[],e=function(a,b){var c=r.isFunction(b)?b():b;d[d.length]=encodeURIComponent(a)+"="+encodeURIComponent(null==c?"":c)};if(Array.isArray(a)||a.jquery&&!r.isPlainObject(a))r.each(a,function(){e(this.name,this.value)});else for(c in a)Ab(c,a[c],b,e);return d.join("&")},r.fn.extend({serialize:function(){return r.param(this.serializeArray())},serializeArray:function(){return this.map(function(){var a=r.prop(this,"elements");return a?r.makeArray(a):this}).filter(function(){var a=this.type;return this.name&&!r(this).is(":disabled")&&zb.test(this.nodeName)&&!yb.test(a)&&(this.checked||!ja.test(a))}).map(function(a,b){var c=r(this).val();return null==c?null:Array.isArray(c)?r.map(c,function(a){return{name:b.name,value:a.replace(xb,"\r\n")}}):{name:b.name,value:c.replace(xb,"\r\n")}}).get()}});var Bb=/%20/g,Cb=/#.*$/,Db=/([?&])_=[^&]*/,Eb=/^(.*?):[ \t]*([^\r\n]*)$/gm,Fb=/^(?:about|app|app-storage|.+-extension|file|res|widget):$/,Gb=/^(?:GET|HEAD)$/,Hb=/^\/\//,Ib={},Jb={},Kb="*/".concat("*"),Lb=d.createElement("a");Lb.href=tb.href;function Mb(a){return function(b,c){"string"!=typeof b&&(c=b,b="*");var d,e=0,f=b.toLowerCase().match(L)||[];if(r.isFunction(c))while(d=f[e++])"+"===d[0]?(d=d.slice(1)||"*",(a[d]=a[d]||[]).unshift(c)):(a[d]=a[d]||[]).push(c)}}function Nb(a,b,c,d){var e={},f=a===Jb;function g(h){var i;return e[h]=!0,r.each(a[h]||[],function(a,h){var j=h(b,c,d);return"string"!=typeof j||f||e[j]?f?!(i=j):void 0:(b.dataTypes.unshift(j),g(j),!1)}),i}return g(b.dataTypes[0])||!e["*"]&&g("*")}function Ob(a,b){var c,d,e=r.ajaxSettings.flatOptions||{};for(c in b)void 0!==b[c]&&((e[c]?a:d||(d={}))[c]=b[c]);return d&&r.extend(!0,a,d),a}function Pb(a,b,c){var d,e,f,g,h=a.contents,i=a.dataTypes;while("*"===i[0])i.shift(),void 0===d&&(d=a.mimeType||b.getResponseHeader("Content-Type"));if(d)for(e in h)if(h[e]&&h[e].test(d)){i.unshift(e);break}if(i[0]in c)f=i[0];else{for(e in c){if(!i[0]||a.converters[e+" "+i[0]]){f=e;break}g||(g=e)}f=f||g}if(f)return f!==i[0]&&i.unshift(f),c[f]}function Qb(a,b,c,d){var e,f,g,h,i,j={},k=a.dataTypes.slice();if(k[1])for(g in a.converters)j[g.toLowerCase()]=a.converters[g];f=k.shift();while(f)if(a.responseFields[f]&&(c[a.responseFields[f]]=b),!i&&d&&a.dataFilter&&(b=a.dataFilter(b,a.dataType)),i=f,f=k.shift())if("*"===f)f=i;else if("*"!==i&&i!==f){if(g=j[i+" "+f]||j["* "+f],!g)for(e in j)if(h=e.split(" "),h[1]===f&&(g=j[i+" "+h[0]]||j["* "+h[0]])){g===!0?g=j[e]:j[e]!==!0&&(f=h[0],k.unshift(h[1]));break}if(g!==!0)if(g&&a["throws"])b=g(b);else try{b=g(b)}catch(l){return{state:"parsererror",error:g?l:"No conversion from "+i+" to "+f}}}return{state:"success",data:b}}r.extend({active:0,lastModified:{},etag:{},ajaxSettings:{url:tb.href,type:"GET",isLocal:Fb.test(tb.protocol),global:!0,processData:!0,async:!0,contentType:"application/x-www-form-urlencoded; charset=UTF-8",accepts:{"*":Kb,text:"text/plain",html:"text/html",xml:"application/xml, text/xml",json:"application/json, text/javascript"},contents:{xml:/\bxml\b/,html:/\bhtml/,json:/\bjson\b/},responseFields:{xml:"responseXML",text:"responseText",json:"responseJSON"},converters:{"* text":String,"text html":!0,"text json":JSON.parse,"text xml":r.parseXML},flatOptions:{url:!0,context:!0}},ajaxSetup:function(a,b){return b?Ob(Ob(a,r.ajaxSettings),b):Ob(r.ajaxSettings,a)},ajaxPrefilter:Mb(Ib),ajaxTransport:Mb(Jb),ajax:function(b,c){"object"==typeof b&&(c=b,b=void 0),c=c||{};var e,f,g,h,i,j,k,l,m,n,o=r.ajaxSetup({},c),p=o.context||o,q=o.context&&(p.nodeType||p.jquery)?r(p):r.event,s=r.Deferred(),t=r.Callbacks("once memory"),u=o.statusCode||{},v={},w={},x="canceled",y={readyState:0,getResponseHeader:function(a){var b;if(k){if(!h){h={};while(b=Eb.exec(g))h[b[1].toLowerCase()]=b[2]}b=h[a.toLowerCase()]}return null==b?null:b},getAllResponseHeaders:function(){return k?g:null},setRequestHeader:function(a,b){return null==k&&(a=w[a.toLowerCase()]=w[a.toLowerCase()]||a,v[a]=b),this},overrideMimeType:function(a){return null==k&&(o.mimeType=a),this},statusCode:function(a){var b;if(a)if(k)y.always(a[y.status]);else for(b in a)u[b]=[u[b],a[b]];return this},abort:function(a){var b=a||x;return e&&e.abort(b),A(0,b),this}};if(s.promise(y),o.url=((b||o.url||tb.href)+"").replace(Hb,tb.protocol+"//"),o.type=c.method||c.type||o.method||o.type,o.dataTypes=(o.dataType||"*").toLowerCase().match(L)||[""],null==o.crossDomain){j=d.createElement("a");try{j.href=o.url,j.href=j.href,o.crossDomain=Lb.protocol+"//"+Lb.host!=j.protocol+"//"+j.host}catch(z){o.crossDomain=!0}}if(o.data&&o.processData&&"string"!=typeof o.data&&(o.data=r.param(o.data,o.traditional)),Nb(Ib,o,c,y),k)return y;l=r.event&&o.global,l&&0===r.active++&&r.event.trigger("ajaxStart"),o.type=o.type.toUpperCase(),o.hasContent=!Gb.test(o.type),f=o.url.replace(Cb,""),o.hasContent?o.data&&o.processData&&0===(o.contentType||"").indexOf("application/x-www-form-urlencoded")&&(o.data=o.data.replace(Bb,"+")):(n=o.url.slice(f.length),o.data&&(f+=(vb.test(f)?"&":"?")+o.data,delete o.data),o.cache===!1&&(f=f.replace(Db,"$1"),n=(vb.test(f)?"&":"?")+"_="+ub++ +n),o.url=f+n),o.ifModified&&(r.lastModified[f]&&y.setRequestHeader("If-Modified-Since",r.lastModified[f]),r.etag[f]&&y.setRequestHeader("If-None-Match",r.etag[f])),(o.data&&o.hasContent&&o.contentType!==!1||c.contentType)&&y.setRequestHeader("Content-Type",o.contentType),y.setRequestHeader("Accept",o.dataTypes[0]&&o.accepts[o.dataTypes[0]]?o.accepts[o.dataTypes[0]]+("*"!==o.dataTypes[0]?", "+Kb+"; q=0.01":""):o.accepts["*"]);for(m in o.headers)y.setRequestHeader(m,o.headers[m]);if(o.beforeSend&&(o.beforeSend.call(p,y,o)===!1||k))return y.abort();if(x="abort",t.add(o.complete),y.done(o.success),y.fail(o.error),e=Nb(Jb,o,c,y)){if(y.readyState=1,l&&q.trigger("ajaxSend",[y,o]),k)return y;o.async&&o.timeout>0&&(i=a.setTimeout(function(){y.abort("timeout")},o.timeout));try{k=!1,e.send(v,A)}catch(z){if(k)throw z;A(-1,z)}}else A(-1,"No Transport");function A(b,c,d,h){var j,m,n,v,w,x=c;k||(k=!0,i&&a.clearTimeout(i),e=void 0,g=h||"",y.readyState=b>0?4:0,j=b>=200&&b<300||304===b,d&&(v=Pb(o,y,d)),v=Qb(o,v,y,j),j?(o.ifModified&&(w=y.getResponseHeader("Last-Modified"),w&&(r.lastModified[f]=w),w=y.getResponseHeader("etag"),w&&(r.etag[f]=w)),204===b||"HEAD"===o.type?x="nocontent":304===b?x="notmodified":(x=v.state,m=v.data,n=v.error,j=!n)):(n=x,!b&&x||(x="error",b<0&&(b=0))),y.status=b,y.statusText=(c||x)+"",j?s.resolveWith(p,[m,x,y]):s.rejectWith(p,[y,x,n]),y.statusCode(u),u=void 0,l&&q.trigger(j?"ajaxSuccess":"ajaxError",[y,o,j?m:n]),t.fireWith(p,[y,x]),l&&(q.trigger("ajaxComplete",[y,o]),--r.active||r.event.trigger("ajaxStop")))}return y},getJSON:function(a,b,c){return r.get(a,b,c,"json")},getScript:function(a,b){return r.get(a,void 0,b,"script")}}),r.each(["get","post"],function(a,b){r[b]=function(a,c,d,e){return r.isFunction(c)&&(e=e||d,d=c,c=void 0),r.ajax(r.extend({url:a,type:b,dataType:e,data:c,success:d},r.isPlainObject(a)&&a))}}),r._evalUrl=function(a){return r.ajax({url:a,type:"GET",dataType:"script",cache:!0,async:!1,global:!1,"throws":!0})},r.fn.extend({wrapAll:function(a){var b;return this[0]&&(r.isFunction(a)&&(a=a.call(this[0])),b=r(a,this[0].ownerDocument).eq(0).clone(!0),this[0].parentNode&&b.insertBefore(this[0]),b.map(function(){var a=this;while(a.firstElementChild)a=a.firstElementChild;return a}).append(this)),this},wrapInner:function(a){return r.isFunction(a)?this.each(function(b){r(this).wrapInner(a.call(this,b))}):this.each(function(){var b=r(this),c=b.contents();c.length?c.wrapAll(a):b.append(a)})},wrap:function(a){var b=r.isFunction(a);return this.each(function(c){r(this).wrapAll(b?a.call(this,c):a)})},unwrap:function(a){return this.parent(a).not("body").each(function(){r(this).replaceWith(this.childNodes)}),this}}),r.expr.pseudos.hidden=function(a){return!r.expr.pseudos.visible(a)},r.expr.pseudos.visible=function(a){return!!(a.offsetWidth||a.offsetHeight||a.getClientRects().length)},r.ajaxSettings.xhr=function(){try{return new a.XMLHttpRequest}catch(b){}};var Rb={0:200,1223:204},Sb=r.ajaxSettings.xhr();o.cors=!!Sb&&"withCredentials"in Sb,o.ajax=Sb=!!Sb,r.ajaxTransport(function(b){var c,d;if(o.cors||Sb&&!b.crossDomain)return{send:function(e,f){var g,h=b.xhr();if(h.open(b.type,b.url,b.async,b.username,b.password),b.xhrFields)for(g in b.xhrFields)h[g]=b.xhrFields[g];b.mimeType&&h.overrideMimeType&&h.overrideMimeType(b.mimeType),b.crossDomain||e["X-Requested-With"]||(e["X-Requested-With"]="XMLHttpRequest");for(g in e)h.setRequestHeader(g,e[g]);c=function(a){return function(){c&&(c=d=h.onload=h.onerror=h.onabort=h.onreadystatechange=null,"abort"===a?h.abort():"error"===a?"number"!=typeof h.status?f(0,"error"):f(h.status,h.statusText):f(Rb[h.status]||h.status,h.statusText,"text"!==(h.responseType||"text")||"string"!=typeof h.responseText?{binary:h.response}:{text:h.responseText},h.getAllResponseHeaders()))}},h.onload=c(),d=h.onerror=c("error"),void 0!==h.onabort?h.onabort=d:h.onreadystatechange=function(){4===h.readyState&&a.setTimeout(function(){c&&d()})},c=c("abort");try{h.send(b.hasContent&&b.data||null)}catch(i){if(c)throw i}},abort:function(){c&&c()}}}),r.ajaxPrefilter(function(a){a.crossDomain&&(a.contents.script=!1)}),r.ajaxSetup({accepts:{script:"text/javascript, application/javascript, application/ecmascript, application/x-ecmascript"},contents:{script:/\b(?:java|ecma)script\b/},converters:{"text script":function(a){return r.globalEval(a),a}}}),r.ajaxPrefilter("script",function(a){void 0===a.cache&&(a.cache=!1),a.crossDomain&&(a.type="GET")}),r.ajaxTransport("script",function(a){if(a.crossDomain){var b,c;return{send:function(e,f){b=r(" - - - - - - - - - - -
-
-
-
-

PaddleSpeech Serving简介

-

- PaddleSpeech 是基于飞桨 PaddlePaddle 的语音方向的开源模型库,用于语音和音频中的各种关键任务的开发。PaddleSpeech Serving是基于python + fastapi 的语音算法模型的C/S类型后端服务,旨在统一paddle speech下的各语音算子来对外提供后端服务。 -

-
-
- -
-
-
-
-
-

产品体验

-
-
-
-
-
-
-
-
- WebSocket URL: - -
- - -
- 识别中, 秒后自动停止识别 -
-
-
-
-
此处显示识别结果
-
-
-
-
-
- - - - diff --git a/demos/streaming_tts_server/README.md b/demos/streaming_tts_server/README.md index 860d9a978..15448a46f 100644 --- a/demos/streaming_tts_server/README.md +++ b/demos/streaming_tts_server/README.md @@ -5,14 +5,19 @@ ## Introduction This demo is an implementation of starting the streaming speech synthesis service and accessing the service. It can be achieved with a single command using `paddlespeech_server` and `paddlespeech_client` or a few lines of code in python. +For service interface definition, please check: +- [PaddleSpeech Server RESTful API](https://github.com/PaddlePaddle/PaddleSpeech/wiki/PaddleSpeech-Server-RESTful-API) +- [PaddleSpeech Streaming Server WebSocket API](https://github.com/PaddlePaddle/PaddleSpeech/wiki/PaddleSpeech-Server-WebSocket-API) ## Usage ### 1. Installation see [installation](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md). -It is recommended to use **paddlepaddle 2.2.2** or above. -You can choose one way from meduim and hard to install paddlespeech. +It is recommended to use **paddlepaddle 2.3.1** or above. +You can choose one way from easy, meduim and hard to install paddlespeech. + +**If you install in easy mode, you need to prepare the yaml file by yourself, you can refer to the yaml file in the conf directory.** ### 2. Prepare config File The configuration file can be found in `conf/tts_online_application.yaml`. @@ -28,11 +33,10 @@ The configuration file can be found in `conf/tts_online_application.yaml`. - Both hifigan and mb_melgan support streaming voc inference. - When the voc model is mb_melgan, when voc_pad=14, the synthetic audio for streaming inference is consistent with the non-streaming synthetic audio; the minimum voc_pad can be set to 7, and the synthetic audio has no abnormal hearing. If the voc_pad is less than 7, the synthetic audio sounds abnormal. - When the voc model is hifigan, when voc_pad=19, the streaming inference synthetic audio is consistent with the non-streaming synthetic audio; when voc_pad=14, the synthetic audio has no abnormal hearing. + - Pad calculation method of streaming vocoder in PaddleSpeech: [AIStudio tutorial](https://aistudio.baidu.com/aistudio/projectdetail/4151335) - Inference speed: mb_melgan > hifigan; Audio quality: mb_melgan < hifigan - **Note:** If the service can be started normally in the container, but the client access IP is unreachable, you can try to replace the `host` address in the configuration file with the local IP address. - - ### 3. Streaming speech synthesis server and client using http protocol #### 3.1 Server Usage - Command Line (Recommended) @@ -52,7 +56,7 @@ The configuration file can be found in `conf/tts_online_application.yaml`. - `log_file`: log file. Default: ./log/paddlespeech.log Output: - ```bash + ```text [2022-04-24 20:05:27,887] [ INFO] - The first response time of the 0 warm up: 1.0123658180236816 s [2022-04-24 20:05:28,038] [ INFO] - The first response time of the 1 warm up: 0.15108466148376465 s [2022-04-24 20:05:28,191] [ INFO] - The first response time of the 2 warm up: 0.15317344665527344 s @@ -78,8 +82,8 @@ The configuration file can be found in `conf/tts_online_application.yaml`. log_file="./log/paddlespeech.log") ``` - Output: - ```bash + Output: + ```text [2022-04-24 21:00:16,934] [ INFO] - The first response time of the 0 warm up: 1.268730878829956 s [2022-04-24 21:00:17,046] [ INFO] - The first response time of the 1 warm up: 0.11168622970581055 s [2022-04-24 21:00:17,151] [ INFO] - The first response time of the 2 warm up: 0.10413002967834473 s @@ -92,8 +96,6 @@ The configuration file can be found in `conf/tts_online_application.yaml`. [2022-04-24 21:00:17] [INFO] [on.py:59] Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit) [2022-04-24 21:00:17] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit) - - ``` #### 3.2 Streaming TTS client Usage @@ -119,15 +121,12 @@ The configuration file can be found in `conf/tts_online_application.yaml`. - `protocol`: Service protocol, choices: [http, websocket], default: http. - `input`: (required): Input text to generate. - `spk_id`: Speaker id for multi-speaker text to speech. Default: 0 - - `speed`: Audio speed, the value should be set between 0 and 3. Default: 1.0 - - `volume`: Audio volume, the value should be set between 0 and 3. Default: 1.0 - - `sample_rate`: Sampling rate, choices: [0, 8000, 16000], the default is the same as the model. Default: 0 - - `output`: Output wave filepath. Default: None, which means not to save the audio to the local. + - `output`: Client output wave filepath. Default: None, which means not to save the audio to the local. - `play`: Whether to play audio, play while synthesizing, default value: False, which means not playing. **Playing audio needs to rely on the pyaudio library**. - - `spk_id, speed, volume, sample_rate` do not take effect in streaming speech synthesis service temporarily. + - Currently, only the single-speaker model is supported in the code, so `spk_id` does not take effect. Streaming TTS does not support changing sample rate, variable speed and volume. Output: - ```bash + ```text [2022-04-24 21:08:18,559] [ INFO] - tts http client start [2022-04-24 21:08:21,702] [ INFO] - 句子:您好,欢迎使用百度飞桨语音合成服务。 [2022-04-24 21:08:21,703] [ INFO] - 首包响应:0.18863153457641602 s @@ -150,16 +149,13 @@ The configuration file can be found in `conf/tts_online_application.yaml`. port=8092, protocol="http", spk_id=0, - speed=1.0, - volume=1.0, - sample_rate=0, output="./output.wav", play=False) ``` Output: - ```bash + ```text [2022-04-24 21:11:13,798] [ INFO] - tts http client start [2022-04-24 21:11:16,800] [ INFO] - 句子:您好,欢迎使用百度飞桨语音合成服务。 [2022-04-24 21:11:16,801] [ INFO] - 首包响应:0.18234872817993164 s @@ -169,7 +165,6 @@ The configuration file can be found in `conf/tts_online_application.yaml`. [2022-04-24 21:11:16,837] [ INFO] - 音频保存至:./output.wav ``` - ### 4. Streaming speech synthesis server and client using websocket protocol #### 4.1 Server Usage - Command Line (Recommended) @@ -189,21 +184,19 @@ The configuration file can be found in `conf/tts_online_application.yaml`. - `log_file`: log file. Default: ./log/paddlespeech.log Output: - ```bash - [2022-04-27 10:18:09,107] [ INFO] - The first response time of the 0 warm up: 1.1551103591918945 s - [2022-04-27 10:18:09,219] [ INFO] - The first response time of the 1 warm up: 0.11204338073730469 s - [2022-04-27 10:18:09,324] [ INFO] - The first response time of the 2 warm up: 0.1051797866821289 s - [2022-04-27 10:18:09,325] [ INFO] - ********************************************************************** - INFO: Started server process [17600] - [2022-04-27 10:18:09] [INFO] [server.py:75] Started server process [17600] - INFO: Waiting for application startup. - [2022-04-27 10:18:09] [INFO] [on.py:45] Waiting for application startup. - INFO: Application startup complete. - [2022-04-27 10:18:09] [INFO] [on.py:59] Application startup complete. - INFO: Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit) - [2022-04-27 10:18:09] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit) - - + ```text + [2022-04-27 10:18:09,107] [ INFO] - The first response time of the 0 warm up: 1.1551103591918945 s + [2022-04-27 10:18:09,219] [ INFO] - The first response time of the 1 warm up: 0.11204338073730469 s + [2022-04-27 10:18:09,324] [ INFO] - The first response time of the 2 warm up: 0.1051797866821289 s + [2022-04-27 10:18:09,325] [ INFO] - ********************************************************************** + INFO: Started server process [17600] + [2022-04-27 10:18:09] [INFO] [server.py:75] Started server process [17600] + INFO: Waiting for application startup. + [2022-04-27 10:18:09] [INFO] [on.py:45] Waiting for application startup. + INFO: Application startup complete. + [2022-04-27 10:18:09] [INFO] [on.py:59] Application startup complete. + INFO: Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit) + [2022-04-27 10:18:09] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit) ``` - Python API @@ -217,20 +210,19 @@ The configuration file can be found in `conf/tts_online_application.yaml`. ``` Output: - ```bash - [2022-04-27 10:20:16,660] [ INFO] - The first response time of the 0 warm up: 1.0945196151733398 s - [2022-04-27 10:20:16,773] [ INFO] - The first response time of the 1 warm up: 0.11222052574157715 s - [2022-04-27 10:20:16,878] [ INFO] - The first response time of the 2 warm up: 0.10494542121887207 s - [2022-04-27 10:20:16,878] [ INFO] - ********************************************************************** - INFO: Started server process [23466] - [2022-04-27 10:20:16] [INFO] [server.py:75] Started server process [23466] - INFO: Waiting for application startup. - [2022-04-27 10:20:16] [INFO] [on.py:45] Waiting for application startup. - INFO: Application startup complete. - [2022-04-27 10:20:16] [INFO] [on.py:59] Application startup complete. - INFO: Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit) - [2022-04-27 10:20:16] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit) - + ```text + [2022-04-27 10:20:16,660] [ INFO] - The first response time of the 0 warm up: 1.0945196151733398 s + [2022-04-27 10:20:16,773] [ INFO] - The first response time of the 1 warm up: 0.11222052574157715 s + [2022-04-27 10:20:16,878] [ INFO] - The first response time of the 2 warm up: 0.10494542121887207 s + [2022-04-27 10:20:16,878] [ INFO] - ********************************************************************** + INFO: Started server process [23466] + [2022-04-27 10:20:16] [INFO] [server.py:75] Started server process [23466] + INFO: Waiting for application startup. + [2022-04-27 10:20:16] [INFO] [on.py:45] Waiting for application startup. + INFO: Application startup complete. + [2022-04-27 10:20:16] [INFO] [on.py:59] Application startup complete. + INFO: Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit) + [2022-04-27 10:20:16] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit) ``` #### 4.2 Streaming TTS client Usage @@ -256,16 +248,14 @@ The configuration file can be found in `conf/tts_online_application.yaml`. - `protocol`: Service protocol, choices: [http, websocket], default: http. - `input`: (required): Input text to generate. - `spk_id`: Speaker id for multi-speaker text to speech. Default: 0 - - `speed`: Audio speed, the value should be set between 0 and 3. Default: 1.0 - - `volume`: Audio volume, the value should be set between 0 and 3. Default: 1.0 - - `sample_rate`: Sampling rate, choices: [0, 8000, 16000], the default is the same as the model. Default: 0 - - `output`: Output wave filepath. Default: None, which means not to save the audio to the local. + - `output`: Client output wave filepath. Default: None, which means not to save the audio to the local. - `play`: Whether to play audio, play while synthesizing, default value: False, which means not playing. **Playing audio needs to rely on the pyaudio library**. - - `spk_id, speed, volume, sample_rate` do not take effect in streaming speech synthesis service temporarily. + - Currently, only the single-speaker model is supported in the code, so `spk_id` does not take effect. Streaming TTS does not support changing sample rate, variable speed and volume. + Output: - ```bash + ```text [2022-04-27 10:21:04,262] [ INFO] - tts websocket client start [2022-04-27 10:21:04,496] [ INFO] - 句子:您好,欢迎使用百度飞桨语音合成服务。 [2022-04-27 10:21:04,496] [ INFO] - 首包响应:0.2124948501586914 s @@ -273,7 +263,6 @@ The configuration file can be found in `conf/tts_online_application.yaml`. [2022-04-27 10:21:07,484] [ INFO] - 音频时长:3.825 s [2022-04-27 10:21:07,484] [ INFO] - RTF: 0.8363677006141812 [2022-04-27 10:21:07,516] [ INFO] - 音频保存至:output.wav - ``` - Python API @@ -288,26 +277,17 @@ The configuration file can be found in `conf/tts_online_application.yaml`. port=8092, protocol="websocket", spk_id=0, - speed=1.0, - volume=1.0, - sample_rate=0, output="./output.wav", play=False) - ``` Output: - ```bash - [2022-04-27 10:22:48,852] [ INFO] - tts websocket client start - [2022-04-27 10:22:49,080] [ INFO] - 句子:您好,欢迎使用百度飞桨语音合成服务。 - [2022-04-27 10:22:49,080] [ INFO] - 首包响应:0.21017956733703613 s - [2022-04-27 10:22:52,100] [ INFO] - 尾包响应:3.2304444313049316 s - [2022-04-27 10:22:52,101] [ INFO] - 音频时长:3.825 s - [2022-04-27 10:22:52,101] [ INFO] - RTF: 0.8445606356352762 - [2022-04-27 10:22:52,134] [ INFO] - 音频保存至:./output.wav - + ```text + [2022-04-27 10:22:48,852] [ INFO] - tts websocket client start + [2022-04-27 10:22:49,080] [ INFO] - 句子:您好,欢迎使用百度飞桨语音合成服务。 + [2022-04-27 10:22:49,080] [ INFO] - 首包响应:0.21017956733703613 s + [2022-04-27 10:22:52,100] [ INFO] - 尾包响应:3.2304444313049316 s + [2022-04-27 10:22:52,101] [ INFO] - 音频时长:3.825 s + [2022-04-27 10:22:52,101] [ INFO] - RTF: 0.8445606356352762 + [2022-04-27 10:22:52,134] [ INFO] - 音频保存至:./output.wav ``` - - - - diff --git a/demos/streaming_tts_server/README_cn.md b/demos/streaming_tts_server/README_cn.md index 254ec26a2..b99155bca 100644 --- a/demos/streaming_tts_server/README_cn.md +++ b/demos/streaming_tts_server/README_cn.md @@ -3,16 +3,20 @@ # 流式语音合成服务 ## 介绍 -这个demo是一个启动流式语音合成服务和访问该服务的实现。 它可以通过使用`paddlespeech_server` 和 `paddlespeech_client`的单个命令或 python 的几行代码来实现。 - +这个 demo 是一个启动流式语音合成服务和访问该服务的实现。 它可以通过使用 `paddlespeech_server` 和 `paddlespeech_client` 的单个命令或 python 的几行代码来实现。 +服务接口定义请参考: +- [PaddleSpeech Server RESTful API](https://github.com/PaddlePaddle/PaddleSpeech/wiki/PaddleSpeech-Server-RESTful-API) +- [PaddleSpeech Streaming Server WebSocket API](https://github.com/PaddlePaddle/PaddleSpeech/wiki/PaddleSpeech-Server-WebSocket-API) ## 使用方法 ### 1. 安装 请看 [安装文档](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md). -推荐使用 **paddlepaddle 2.2.2** 或以上版本。 -你可以从 medium,hard 两种方式中选择一种方式安装 PaddleSpeech。 +推荐使用 **paddlepaddle 2.3.1** 或以上版本。 + +你可以从简单,中等,困难 几种方式中选择一种方式安装 PaddleSpeech。 +**如果使用简单模式安装,需要自行准备 yaml 文件,可参考 conf 目录下的 yaml 文件。** ### 2. 准备配置文件 配置文件可参见 `conf/tts_online_application.yaml` 。 @@ -20,19 +24,20 @@ - `engine_list` 表示即将启动的服务将会包含的语音引擎,格式为 <语音任务>_<引擎类型>。 - 该 demo 主要介绍流式语音合成服务,因此语音任务应设置为 tts。 - 目前引擎类型支持两种形式:**online** 表示使用python进行动态图推理的引擎;**online-onnx** 表示使用 onnxruntime 进行推理的引擎。其中,online-onnx 的推理速度更快。 -- 流式 TTS 引擎的 AM 模型支持:**fastspeech2 以及fastspeech2_cnndecoder**; Voc 模型支持:**hifigan, mb_melgan** +- 流式 TTS 引擎的 AM 模型支持:**fastspeech2 以及 fastspeech2_cnndecoder**; Voc 模型支持:**hifigan, mb_melgan** - 流式 am 推理中,每次会对一个 chunk 的数据进行推理以达到流式的效果。其中 `am_block` 表示 chunk 中的有效帧数,`am_pad` 表示一个 chunk 中 am_block 前后各加的帧数。am_pad 的存在用于消除流式推理产生的误差,避免由流式推理对合成音频质量的影响。 - - fastspeech2 不支持流式 am 推理,因此 am_pad 与 m_block 对它无效 + - fastspeech2 不支持流式 am 推理,因此 am_pad 与 am_block 对它无效 - fastspeech2_cnndecoder 支持流式推理,当 am_pad=12 时,流式推理合成音频与非流式合成音频一致 -- 流式 voc 推理中,每次会对一个 chunk 的数据进行推理以达到流式的效果。其中 `voc_block` 表示chunk中的有效帧数,`voc_pad` 表示一个 chunk 中 voc_block 前后各加的帧数。voc_pad 的存在用于消除流式推理产生的误差,避免由流式推理对合成音频质量的影响。 +- 流式 voc 推理中,每次会对一个 chunk 的数据进行推理以达到流式的效果。其中 `voc_block` 表示 chunk 中的有效帧数,`voc_pad` 表示一个 chunk 中 voc_block 前后各加的帧数。voc_pad 的存在用于消除流式推理产生的误差,避免由流式推理对合成音频质量的影响。 - hifigan, mb_melgan 均支持流式 voc 推理 - 当 voc 模型为 mb_melgan,当 voc_pad=14 时,流式推理合成音频与非流式合成音频一致;voc_pad 最小可以设置为7,合成音频听感上没有异常,若 voc_pad 小于7,合成音频听感上存在异常。 - 当 voc 模型为 hifigan,当 voc_pad=19 时,流式推理合成音频与非流式合成音频一致;当 voc_pad=14 时,合成音频听感上没有异常。 + - PaddleSpeech 中流式声码器 Pad 计算方法: [AIStudio 教程](https://aistudio.baidu.com/aistudio/projectdetail/4151335) - 推理速度:mb_melgan > hifigan; 音频质量:mb_melgan < hifigan - **注意:** 如果在容器里可正常启动服务,但客户端访问 ip 不可达,可尝试将配置文件中 `host` 地址换成本地 ip 地址。 -### 3. 使用http协议的流式语音合成服务端及客户端使用方法 +### 3. 使用 http 协议的流式语音合成服务端及客户端使用方法 #### 3.1 服务端使用方法 - 命令行 (推荐使用) @@ -51,7 +56,7 @@ - `log_file`: log 文件. 默认:./log/paddlespeech.log 输出: - ```bash + ```text [2022-04-24 20:05:27,887] [ INFO] - The first response time of the 0 warm up: 1.0123658180236816 s [2022-04-24 20:05:28,038] [ INFO] - The first response time of the 1 warm up: 0.15108466148376465 s [2022-04-24 20:05:28,191] [ INFO] - The first response time of the 2 warm up: 0.15317344665527344 s @@ -64,7 +69,6 @@ [2022-04-24 20:05:28] [INFO] [on.py:59] Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit) [2022-04-24 20:05:28] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit) - ``` - Python API @@ -77,8 +81,8 @@ log_file="./log/paddlespeech.log") ``` - 输出: - ```bash + 输出: + ```text [2022-04-24 21:00:16,934] [ INFO] - The first response time of the 0 warm up: 1.268730878829956 s [2022-04-24 21:00:17,046] [ INFO] - The first response time of the 1 warm up: 0.11168622970581055 s [2022-04-24 21:00:17,151] [ INFO] - The first response time of the 2 warm up: 0.10413002967834473 s @@ -91,8 +95,6 @@ [2022-04-24 21:00:17] [INFO] [on.py:59] Application startup complete. INFO: Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit) [2022-04-24 21:00:17] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit) - - ``` #### 3.2 客户端使用方法 @@ -118,16 +120,13 @@ - `protocol`: 服务协议,可选 [http, websocket], 默认: http。 - `input`: (必须输入): 待合成的文本。 - `spk_id`: 说话人 id,用于多说话人语音合成,默认值: 0。 - - `speed`: 音频速度,该值应设置在 0 到 3 之间。 默认值:1.0 - - `volume`: 音频音量,该值应设置在 0 到 3 之间。 默认值: 1.0 - - `sample_rate`: 采样率,可选 [0, 8000, 16000],默认值:0,表示与模型采样率相同 - - `output`: 输出音频的路径, 默认值:None,表示不保存音频到本地。 + - `output`: 客户端输出音频的路径, 默认值:None,表示不保存音频。 - `play`: 是否播放音频,边合成边播放, 默认值:False,表示不播放。**播放音频需要依赖pyaudio库**。 - - `spk_id, speed, volume, sample_rate` 在流式语音合成服务中暂时不生效。 + - 目前代码中只支持单说话人的模型,因此 spk_id 的选择并不生效。流式 TTS 不支持更换采样率,变速和变音量等功能。 输出: - ```bash + ```text [2022-04-24 21:08:18,559] [ INFO] - tts http client start [2022-04-24 21:08:21,702] [ INFO] - 句子:您好,欢迎使用百度飞桨语音合成服务。 [2022-04-24 21:08:21,703] [ INFO] - 首包响应:0.18863153457641602 s @@ -150,9 +149,6 @@ port=8092, protocol="http", spk_id=0, - speed=1.0, - volume=1.0, - sample_rate=0, output="./output.wav", play=False) @@ -168,9 +164,8 @@ [2022-04-24 21:11:16,802] [ INFO] - RTF: 0.7846773683635238 [2022-04-24 21:11:16,837] [ INFO] - 音频保存至:./output.wav ``` - -### 4. 使用websocket协议的流式语音合成服务端及客户端使用方法 +### 4. 使用 websocket 协议的流式语音合成服务端及客户端使用方法 #### 4.1 服务端使用方法 - 命令行 (推荐使用) 首先修改配置文件 `conf/tts_online_application.yaml`, **将 `protocol` 设置为 `websocket`**。 @@ -189,21 +184,19 @@ - `log_file`: log 文件. 默认:./log/paddlespeech.log 输出: - ```bash - [2022-04-27 10:18:09,107] [ INFO] - The first response time of the 0 warm up: 1.1551103591918945 s - [2022-04-27 10:18:09,219] [ INFO] - The first response time of the 1 warm up: 0.11204338073730469 s - [2022-04-27 10:18:09,324] [ INFO] - The first response time of the 2 warm up: 0.1051797866821289 s - [2022-04-27 10:18:09,325] [ INFO] - ********************************************************************** - INFO: Started server process [17600] - [2022-04-27 10:18:09] [INFO] [server.py:75] Started server process [17600] - INFO: Waiting for application startup. - [2022-04-27 10:18:09] [INFO] [on.py:45] Waiting for application startup. - INFO: Application startup complete. - [2022-04-27 10:18:09] [INFO] [on.py:59] Application startup complete. - INFO: Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit) - [2022-04-27 10:18:09] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit) - - + ```text + [2022-04-27 10:18:09,107] [ INFO] - The first response time of the 0 warm up: 1.1551103591918945 s + [2022-04-27 10:18:09,219] [ INFO] - The first response time of the 1 warm up: 0.11204338073730469 s + [2022-04-27 10:18:09,324] [ INFO] - The first response time of the 2 warm up: 0.1051797866821289 s + [2022-04-27 10:18:09,325] [ INFO] - ********************************************************************** + INFO: Started server process [17600] + [2022-04-27 10:18:09] [INFO] [server.py:75] Started server process [17600] + INFO: Waiting for application startup. + [2022-04-27 10:18:09] [INFO] [on.py:45] Waiting for application startup. + INFO: Application startup complete. + [2022-04-27 10:18:09] [INFO] [on.py:59] Application startup complete. + INFO: Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit) + [2022-04-27 10:18:09] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit) ``` - Python API @@ -216,27 +209,26 @@ log_file="./log/paddlespeech.log") ``` - 输出: - ```bash - [2022-04-27 10:20:16,660] [ INFO] - The first response time of the 0 warm up: 1.0945196151733398 s - [2022-04-27 10:20:16,773] [ INFO] - The first response time of the 1 warm up: 0.11222052574157715 s - [2022-04-27 10:20:16,878] [ INFO] - The first response time of the 2 warm up: 0.10494542121887207 s - [2022-04-27 10:20:16,878] [ INFO] - ********************************************************************** - INFO: Started server process [23466] - [2022-04-27 10:20:16] [INFO] [server.py:75] Started server process [23466] - INFO: Waiting for application startup. - [2022-04-27 10:20:16] [INFO] [on.py:45] Waiting for application startup. - INFO: Application startup complete. - [2022-04-27 10:20:16] [INFO] [on.py:59] Application startup complete. - INFO: Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit) - [2022-04-27 10:20:16] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit) - + 输出: + ```text + [2022-04-27 10:20:16,660] [ INFO] - The first response time of the 0 warm up: 1.0945196151733398 s + [2022-04-27 10:20:16,773] [ INFO] - The first response time of the 1 warm up: 0.11222052574157715 s + [2022-04-27 10:20:16,878] [ INFO] - The first response time of the 2 warm up: 0.10494542121887207 s + [2022-04-27 10:20:16,878] [ INFO] - ********************************************************************** + INFO: Started server process [23466] + [2022-04-27 10:20:16] [INFO] [server.py:75] Started server process [23466] + INFO: Waiting for application startup. + [2022-04-27 10:20:16] [INFO] [on.py:45] Waiting for application startup. + INFO: Application startup complete. + [2022-04-27 10:20:16] [INFO] [on.py:59] Application startup complete. + INFO: Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit) + [2022-04-27 10:20:16] [INFO] [server.py:211] Uvicorn running on http://0.0.0.0:8092 (Press CTRL+C to quit) ``` #### 4.2 客户端使用方法 - 命令行 (推荐使用) - 访问 websocket 流式TTS服务: + 访问 websocket 流式 TTS 服务: 若 `127.0.0.1` 不能访问,则需要使用实际服务 IP 地址 @@ -256,16 +248,13 @@ - `protocol`: 服务协议,可选 [http, websocket], 默认: http。 - `input`: (必须输入): 待合成的文本。 - `spk_id`: 说话人 id,用于多说话人语音合成,默认值: 0。 - - `speed`: 音频速度,该值应设置在 0 到 3 之间。 默认值:1.0 - - `volume`: 音频音量,该值应设置在 0 到 3 之间。 默认值: 1.0 - - `sample_rate`: 采样率,可选 [0, 8000, 16000],默认值:0,表示与模型采样率相同 - - `output`: 输出音频的路径, 默认值:None,表示不保存音频到本地。 + - `output`: 客户端输出音频的路径, 默认值:None,表示不保存音频。 - `play`: 是否播放音频,边合成边播放, 默认值:False,表示不播放。**播放音频需要依赖pyaudio库**。 - - `spk_id, speed, volume, sample_rate` 在流式语音合成服务中暂时不生效。 + - 目前代码中只支持单说话人的模型,因此 spk_id 的选择并不生效。流式 TTS 不支持更换采样率,变速和变音量等功能。 + - 输出: - ```bash + ```text [2022-04-27 10:21:04,262] [ INFO] - tts websocket client start [2022-04-27 10:21:04,496] [ INFO] - 句子:您好,欢迎使用百度飞桨语音合成服务。 [2022-04-27 10:21:04,496] [ INFO] - 首包响应:0.2124948501586914 s @@ -273,7 +262,6 @@ [2022-04-27 10:21:07,484] [ INFO] - 音频时长:3.825 s [2022-04-27 10:21:07,484] [ INFO] - RTF: 0.8363677006141812 [2022-04-27 10:21:07,516] [ INFO] - 音频保存至:output.wav - ``` - Python API @@ -288,16 +276,12 @@ port=8092, protocol="websocket", spk_id=0, - speed=1.0, - volume=1.0, - sample_rate=0, output="./output.wav", play=False) - ``` 输出: - ```bash + ```text [2022-04-27 10:22:48,852] [ INFO] - tts websocket client start [2022-04-27 10:22:49,080] [ INFO] - 句子:您好,欢迎使用百度飞桨语音合成服务。 [2022-04-27 10:22:49,080] [ INFO] - 首包响应:0.21017956733703613 s @@ -305,8 +289,4 @@ [2022-04-27 10:22:52,101] [ INFO] - 音频时长:3.825 s [2022-04-27 10:22:52,101] [ INFO] - RTF: 0.8445606356352762 [2022-04-27 10:22:52,134] [ INFO] - 音频保存至:./output.wav - ``` - - - diff --git a/demos/streaming_tts_server/test_client.sh b/demos/streaming_tts_server/client.sh old mode 100644 new mode 100755 similarity index 61% rename from demos/streaming_tts_server/test_client.sh rename to demos/streaming_tts_server/client.sh index bd88f20b1..e93da58a8 --- a/demos/streaming_tts_server/test_client.sh +++ b/demos/streaming_tts_server/client.sh @@ -2,8 +2,8 @@ # http client test # If `127.0.0.1` is not accessible, you need to use the actual service IP address. -paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --protocol http --input "您好,欢迎使用百度飞桨语音合成服务。" --output output.wav +paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --protocol http --input "您好,欢迎使用百度飞桨语音合成服务。" --output output.http.wav # websocket client test # If `127.0.0.1` is not accessible, you need to use the actual service IP address. -# paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --protocol websocket --input "您好,欢迎使用百度飞桨语音合成服务。" --output output.wav +paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8192 --protocol websocket --input "您好,欢迎使用百度飞桨语音合成服务。" --output output.ws.wav diff --git a/demos/streaming_tts_server/conf/tts_online_application.yaml b/demos/streaming_tts_server/conf/tts_online_application.yaml index 0460a5e16..e617912fe 100644 --- a/demos/streaming_tts_server/conf/tts_online_application.yaml +++ b/demos/streaming_tts_server/conf/tts_online_application.yaml @@ -79,7 +79,7 @@ tts_online-onnx: # voc (vocoder) choices=['mb_melgan_csmsc_onnx, hifigan_csmsc_onnx'] # Both mb_melgan_csmsc_onnx and hifigan_csmsc_onnx support streaming voc inference - voc: 'hifigan_csmsc_onnx' + voc: 'mb_melgan_csmsc_onnx' voc_ckpt: voc_sample_rate: 24000 voc_sess_conf: @@ -100,4 +100,4 @@ tts_online-onnx: voc_pad: 14 # voc_upsample should be same as n_shift on voc config. voc_upsample: 300 - + \ No newline at end of file diff --git a/demos/streaming_tts_server/conf/tts_online_ws_application.yaml b/demos/streaming_tts_server/conf/tts_online_ws_application.yaml new file mode 100644 index 000000000..329f882cc --- /dev/null +++ b/demos/streaming_tts_server/conf/tts_online_ws_application.yaml @@ -0,0 +1,103 @@ +# This is the parameter configuration file for streaming tts server. + +################################################################################# +# SERVER SETTING # +################################################################################# +host: 0.0.0.0 +port: 8192 + +# The task format in the engin_list is: _ +# engine_list choices = ['tts_online', 'tts_online-onnx'], the inference speed of tts_online-onnx is faster than tts_online. +# protocol choices = ['websocket', 'http'] +protocol: 'websocket' +engine_list: ['tts_online-onnx'] + + +################################################################################# +# ENGINE CONFIG # +################################################################################# + +################################### TTS ######################################### +################### speech task: tts; engine_type: online ####################### +tts_online: + # am (acoustic model) choices=['fastspeech2_csmsc', 'fastspeech2_cnndecoder_csmsc'] + # fastspeech2_cnndecoder_csmsc support streaming am infer. + am: 'fastspeech2_csmsc' + am_config: + am_ckpt: + am_stat: + phones_dict: + tones_dict: + speaker_dict: + spk_id: 0 + + # voc (vocoder) choices=['mb_melgan_csmsc, hifigan_csmsc'] + # Both mb_melgan_csmsc and hifigan_csmsc support streaming voc inference + voc: 'mb_melgan_csmsc' + voc_config: + voc_ckpt: + voc_stat: + + # others + lang: 'zh' + device: 'cpu' # set 'gpu:id' or 'cpu' + # am_block and am_pad only for fastspeech2_cnndecoder_onnx model to streaming am infer, + # when am_pad set 12, streaming synthetic audio is the same as non-streaming synthetic audio + am_block: 72 + am_pad: 12 + # voc_pad and voc_block voc model to streaming voc infer, + # when voc model is mb_melgan_csmsc, voc_pad set 14, streaming synthetic audio is the same as non-streaming synthetic audio; The minimum value of pad can be set to 7, streaming synthetic audio sounds normal + # when voc model is hifigan_csmsc, voc_pad set 19, streaming synthetic audio is the same as non-streaming synthetic audio; voc_pad set 14, streaming synthetic audio sounds normal + voc_block: 36 + voc_pad: 14 + + + +################################################################################# +# ENGINE CONFIG # +################################################################################# + +################################### TTS ######################################### +################### speech task: tts; engine_type: online-onnx ####################### +tts_online-onnx: + # am (acoustic model) choices=['fastspeech2_csmsc_onnx', 'fastspeech2_cnndecoder_csmsc_onnx'] + # fastspeech2_cnndecoder_csmsc_onnx support streaming am infer. + am: 'fastspeech2_cnndecoder_csmsc_onnx' + # am_ckpt is a list, if am is fastspeech2_cnndecoder_csmsc_onnx, am_ckpt = [encoder model, decoder model, postnet model]; + # if am is fastspeech2_csmsc_onnx, am_ckpt = [ckpt model]; + am_ckpt: # list + am_stat: + phones_dict: + tones_dict: + speaker_dict: + spk_id: 0 + am_sample_rate: 24000 + am_sess_conf: + device: "cpu" # set 'gpu:id' or 'cpu' + use_trt: False + cpu_threads: 4 + + # voc (vocoder) choices=['mb_melgan_csmsc_onnx, hifigan_csmsc_onnx'] + # Both mb_melgan_csmsc_onnx and hifigan_csmsc_onnx support streaming voc inference + voc: 'mb_melgan_csmsc_onnx' + voc_ckpt: + voc_sample_rate: 24000 + voc_sess_conf: + device: "cpu" # set 'gpu:id' or 'cpu' + use_trt: False + cpu_threads: 4 + + # others + lang: 'zh' + # am_block and am_pad only for fastspeech2_cnndecoder_onnx model to streaming am infer, + # when am_pad set 12, streaming synthetic audio is the same as non-streaming synthetic audio + am_block: 72 + am_pad: 12 + # voc_pad and voc_block voc model to streaming voc infer, + # when voc model is mb_melgan_csmsc_onnx, voc_pad set 14, streaming synthetic audio is the same as non-streaming synthetic audio; The minimum value of pad can be set to 7, streaming synthetic audio sounds normal + # when voc model is hifigan_csmsc_onnx, voc_pad set 19, streaming synthetic audio is the same as non-streaming synthetic audio; voc_pad set 14, streaming synthetic audio sounds normal + voc_block: 36 + voc_pad: 14 + # voc_upsample should be same as n_shift on voc config. + voc_upsample: 300 + \ No newline at end of file diff --git a/demos/streaming_tts_server/server.sh b/demos/streaming_tts_server/server.sh new file mode 100755 index 000000000..d34ddba02 --- /dev/null +++ b/demos/streaming_tts_server/server.sh @@ -0,0 +1,10 @@ +#!/bin/bash + +# http server +paddlespeech_server start --config_file ./conf/tts_online_application.yaml &> tts.http.log & + + +# websocket server +paddlespeech_server start --config_file ./conf/tts_online_ws_application.yaml &> tts.ws.log & + + diff --git a/demos/streaming_tts_server/start_server.sh b/demos/streaming_tts_server/start_server.sh deleted file mode 100644 index 9c71f2fe2..000000000 --- a/demos/streaming_tts_server/start_server.sh +++ /dev/null @@ -1,3 +0,0 @@ -#!/bin/bash -# start server -paddlespeech_server start --config_file ./conf/tts_online_application.yaml \ No newline at end of file diff --git a/demos/text_to_speech/run.sh b/demos/text_to_speech/run.sh index b1340241b..2b588be55 100755 --- a/demos/text_to_speech/run.sh +++ b/demos/text_to_speech/run.sh @@ -4,4 +4,10 @@ paddlespeech tts --input 今天的天气不错啊 # Batch process -echo -e "1 欢迎光临。\n2 谢谢惠顾。" | paddlespeech tts \ No newline at end of file +echo -e "1 欢迎光临。\n2 谢谢惠顾。" | paddlespeech tts + +# Text Frontend +paddlespeech tts --input 今天是2022/10/29,最低温度是-3℃. + + + diff --git a/docker/ubuntu16-gpu/Dockerfile b/docker/ubuntu16-gpu/Dockerfile new file mode 100644 index 000000000..f275471ee --- /dev/null +++ b/docker/ubuntu16-gpu/Dockerfile @@ -0,0 +1,77 @@ +FROM nvidia/cuda:11.2.2-cudnn8-runtime-ubuntu16.04 + +RUN echo "deb [trusted=true] http://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial main restricted \n\ +deb [trusted=true] http://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial-updates main restricted \n\ +deb [trusted=true] http://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial universe \n\ +deb [trusted=true] http://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial-updates universe \n\ +deb [trusted=true] http://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial multiverse \n\ +deb [trusted=true] http://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial-updates multiverse \n\ +deb [trusted=true] http://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial-backports main restricted universe multiverse \n\ +deb [trusted=true] http://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial-security main restricted \n\ +deb [trusted=true] http://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial-security universe \n\ +deb [trusted=true] http://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial-security multiverse" > /etc/apt/sources.list + +RUN apt-get update && apt-get install -y inetutils-ping wget vim curl cmake git sox libsndfile1 libpng12-dev \ + libpng-dev swig libzip-dev openssl bc libflac* libgdk-pixbuf2.0-dev libpango1.0-dev libcairo2-dev \ + libgtk2.0-dev pkg-config zip unzip zlib1g-dev libreadline-dev libbz2-dev liblapack-dev libjpeg-turbo8-dev \ + sudo lrzsz libsqlite3-dev libx11-dev libsm6 apt-utils libopencv-dev libavcodec-dev libavformat-dev \ + libswscale-dev locales liblzma-dev python-lzma m4 libxext-dev strace libibverbs-dev libpcre3 libpcre3-dev \ + build-essential libncurses5-dev libgdbm-dev libnss3-dev libssl-dev libreadline-dev libffi-dev xz-utils \ + libfreetype6-dev libxslt1-dev libxml2-dev libgeos-3.5.0 libgeos-dev && apt-get install -y --allow-downgrades \ + --allow-change-held-packages libnccl2 libnccl-dev && DEBIAN_FRONTEND=noninteractive apt-get install -y tzdata \ + && /bin/cp /usr/share/zoneinfo/Asia/Shanghai /etc/localtime && dpkg-reconfigure -f noninteractive tzdata && \ + cd /usr/lib/x86_64-linux-gnu && ln -s libcudnn.so.8 libcudnn.so && \ + cd /usr/local/cuda-11.2/targets/x86_64-linux/lib && ln -s libcublas.so.11.4.1.1043 libcublas.so && \ + ln -s libcusolver.so.11.1.0.152 libcusolver.so && ln -s libcusparse.so.11 libcusparse.so && \ + ln -s libcufft.so.10.4.1.152 libcufft.so + +RUN echo "set meta-flag on" >> /etc/inputrc && echo "set convert-meta off" >> /etc/inputrc && \ + locale-gen en_US.UTF-8 && /sbin/ldconfig -v && groupadd -g 10001 paddle && \ + useradd -m -s /bin/bash -N -u 10001 paddle -g paddle && chmod g+w /etc/passwd && \ + echo "paddle ALL=(ALL) NOPASSWD: ALL" >> /etc/sudoers + +ENV LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 LANGUAGE=en_US.UTF-8 TZ=Asia/Shanghai + +# official download site: https://www.python.org/ftp/python/3.7.13/Python-3.7.13.tgz +RUN wget https://cdn.npmmirror.com/binaries/python/3.7.13/Python-3.7.13.tgz && tar xvf Python-3.7.13.tgz && \ + cd Python-3.7.13 && ./configure --prefix=/home/paddle/python3.7 && make -j8 && make install && \ + rm -rf ../Python-3.7.13 ../Python-3.7.13.tgz && chown -R paddle:paddle /home/paddle/python3.7 + +RUN cd /tmp && wget https://mirrors.sjtug.sjtu.edu.cn/gnu/gmp/gmp-6.1.0.tar.bz2 && tar xvf gmp-6.1.0.tar.bz2 && \ + cd gmp-6.1.0 && ./configure --prefix=/usr/local && make -j8 && make install && \ + rm -rf ../gmp-6.1.0.tar.bz2 ../gmp-6.1.0 && cd /tmp && \ + wget https://www.mpfr.org/mpfr-3.1.4/mpfr-3.1.4.tar.bz2 && tar xvf mpfr-3.1.4.tar.bz2 && cd mpfr-3.1.4 && \ + ./configure --prefix=/usr/local && make -j8 && make install && rm -rf ../mpfr-3.1.4.tar.bz2 ../mpfr-3.1.4 && \ + cd /tmp && wget https://mirrors.sjtug.sjtu.edu.cn/gnu/mpc/mpc-1.0.3.tar.gz && tar xvf mpc-1.0.3.tar.gz && \ + cd mpc-1.0.3 && ./configure --prefix=/usr/local && make -j8 && make install && \ + rm -rf ../mpc-1.0.3.tar.gz ../mpc-1.0.3 && cd /tmp && \ + wget http://www.mirrorservice.org/sites/sourceware.org/pub/gcc/infrastructure/isl-0.18.tar.bz2 && \ + tar xvf isl-0.18.tar.bz2 && cd isl-0.18 && ./configure --prefix=/usr/local && make -j8 && make install \ + && rm -rf ../isl-0.18.tar.bz2 ../isl-0.18 && cd /tmp && \ + wget http://mirrors.ustc.edu.cn/gnu/gcc/gcc-8.2.0/gcc-8.2.0.tar.gz --no-check-certificate && \ + tar xvf gcc-8.2.0.tar.gz && cd gcc-8.2.0 && unset LIBRARY_PATH && ./configure --prefix=/home/paddle/gcc82 \ + --enable-threads=posix --disable-checking --disable-multilib --enable-languages=c,c++ --with-gmp=/usr/local \ + --with-mpfr=/usr/local --with-mpc=/usr/local --with-isl=/usr/local && make -j8 && make install && \ + rm -rf ../gcc-8.2.0.tar.gz ../gcc-8.2.0 && chown -R paddle:paddle /home/paddle/gcc82 + +WORKDIR /home/paddle +ENV PATH=/home/paddle/python3.7/bin:/home/paddle/gcc82/bin:${PATH} \ + LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:/usr/local/cuda-11.2/targets/x86_64-linux/lib:${LD_LIBRARY_PATH} + +RUN mkdir -p ~/.pip && echo "[global]" > ~/.pip/pip.conf && \ + echo "index-url=https://mirror.baidu.com/pypi/simple" >> ~/.pip/pip.conf && \ + echo "trusted-host=mirror.baidu.com" >> ~/.pip/pip.conf && \ + python3 -m pip install --upgrade pip && \ + pip install paddlepaddle-gpu==2.3.1.post112 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html && \ + rm -rf ~/.cache/pip + +RUN git clone https://github.com/PaddlePaddle/PaddleSpeech.git && cd PaddleSpeech && \ + pip3 install pytest-runner paddleaudio -i https://pypi.tuna.tsinghua.edu.cn/simple && \ + pip3 install -e .[develop] -i https://pypi.tuna.tsinghua.edu.cn/simple && \ + pip3 install importlib-metadata==4.2.0 urllib3==1.25.10 -i https://pypi.tuna.tsinghua.edu.cn/simple && \ + rm -rf ~/.cache/pip && \ + sudo cp -f /home/paddle/gcc82/lib64/libstdc++.so.6.0.25 /usr/lib/x86_64-linux-gnu/libstdc++.so.6 && \ + chown -R paddle:paddle /home/paddle/PaddleSpeech + +USER paddle +CMD ['bash'] diff --git a/docker/ubuntu18-cpu/Dockerfile b/docker/ubuntu18-cpu/Dockerfile index d14c01858..35f45f2e4 100644 --- a/docker/ubuntu18-cpu/Dockerfile +++ b/docker/ubuntu18-cpu/Dockerfile @@ -1,15 +1,17 @@ FROM registry.baidubce.com/paddlepaddle/paddle:2.2.2 LABEL maintainer="paddlesl@baidu.com" +RUN apt-get update \ + && apt-get install libsndfile-dev \ + && apt-get clean \ + && rm -rf /var/lib/apt/lists/* + RUN git clone --depth 1 https://github.com/PaddlePaddle/PaddleSpeech.git /home/PaddleSpeech RUN pip3 uninstall mccabe -y ; exit 0; RUN pip3 install multiprocess==0.70.12 importlib-metadata==4.2.0 dill==0.3.4 -RUN cd /home/PaddleSpeech/audio -RUN python setup.py bdist_wheel - -RUN cd /home/PaddleSpeech +WORKDIR /home/PaddleSpeech/ RUN python setup.py bdist_wheel -RUN pip install audio/dist/*.whl dist/*.whl +RUN pip install dist/*.whl -i https://pypi.tuna.tsinghua.edu.cn/simple -WORKDIR /home/PaddleSpeech/ +CMD ['bash'] diff --git a/docs/requirements.txt b/docs/requirements.txt index 11e0d4b46..ee116a9b6 100644 --- a/docs/requirements.txt +++ b/docs/requirements.txt @@ -5,3 +5,49 @@ sphinx sphinx-autobuild sphinx-markdown-tables sphinx_rtd_theme +paddlepaddle>=2.2.2 +editdistance +g2p_en +g2pM +h5py +inflect +jieba +jsonlines +kaldiio +librosa==0.8.1 +loguru +matplotlib +nara_wpe +onnxruntime==1.10.0 +opencc +pandas +paddlenlp +paddlespeech_feat +Pillow>=9.0.0 +praatio==5.0.0 +pypinyin +pypinyin-dict +python-dateutil +pyworld==0.2.12 +resampy==0.2.2 +sacrebleu +scipy +sentencepiece~=0.1.96 +soundfile~=0.10 +textgrid +timer +tqdm +typeguard +visualdl +webrtcvad +yacs~=0.1.8 +prettytable +zhon +colorlog +pathos == 0.2.8 +fastapi +websockets +keyboard +uvicorn +pattern_singleton +braceexpand \ No newline at end of file diff --git a/docs/source/api/modules.rst b/docs/source/api/modules.rst new file mode 100644 index 000000000..42d287862 --- /dev/null +++ b/docs/source/api/modules.rst @@ -0,0 +1,7 @@ +paddlespeech +============ + +.. toctree:: + :maxdepth: 4 + + paddlespeech diff --git a/docs/source/api/paddlespeech.audio.backends.rst b/docs/source/api/paddlespeech.audio.backends.rst new file mode 100644 index 000000000..e8917897e --- /dev/null +++ b/docs/source/api/paddlespeech.audio.backends.rst @@ -0,0 +1,16 @@ +paddlespeech.audio.backends package +=================================== + +.. automodule:: paddlespeech.audio.backends + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.audio.backends.soundfile_backend + paddlespeech.audio.backends.sox_backend diff --git a/docs/source/api/paddlespeech.audio.backends.soundfile_backend.rst b/docs/source/api/paddlespeech.audio.backends.soundfile_backend.rst new file mode 100644 index 000000000..5c4ef3881 --- /dev/null +++ b/docs/source/api/paddlespeech.audio.backends.soundfile_backend.rst @@ -0,0 +1,7 @@ +paddlespeech.audio.backends.soundfile\_backend module +===================================================== + +.. automodule:: paddlespeech.audio.backends.soundfile_backend + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.audio.backends.sox_backend.rst b/docs/source/api/paddlespeech.audio.backends.sox_backend.rst new file mode 100644 index 000000000..a99c49de8 --- /dev/null +++ b/docs/source/api/paddlespeech.audio.backends.sox_backend.rst @@ -0,0 +1,7 @@ +paddlespeech.audio.backends.sox\_backend module +=============================================== + +.. automodule:: paddlespeech.audio.backends.sox_backend + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.audio.compliance.kaldi.rst b/docs/source/api/paddlespeech.audio.compliance.kaldi.rst new file mode 100644 index 000000000..f1459cf1a --- /dev/null +++ b/docs/source/api/paddlespeech.audio.compliance.kaldi.rst @@ -0,0 +1,7 @@ +paddlespeech.audio.compliance.kaldi module +========================================== + +.. automodule:: paddlespeech.audio.compliance.kaldi + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.audio.compliance.librosa.rst b/docs/source/api/paddlespeech.audio.compliance.librosa.rst new file mode 100644 index 000000000..85271bee4 --- /dev/null +++ b/docs/source/api/paddlespeech.audio.compliance.librosa.rst @@ -0,0 +1,7 @@ +paddlespeech.audio.compliance.librosa module +============================================ + +.. automodule:: paddlespeech.audio.compliance.librosa + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.audio.compliance.rst b/docs/source/api/paddlespeech.audio.compliance.rst new file mode 100644 index 000000000..515d25e99 --- /dev/null +++ b/docs/source/api/paddlespeech.audio.compliance.rst @@ -0,0 +1,16 @@ +paddlespeech.audio.compliance package +===================================== + +.. automodule:: paddlespeech.audio.compliance + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.audio.compliance.kaldi + paddlespeech.audio.compliance.librosa diff --git a/docs/source/api/paddlespeech.audio.datasets.dataset.rst b/docs/source/api/paddlespeech.audio.datasets.dataset.rst new file mode 100644 index 000000000..41243fb73 --- /dev/null +++ b/docs/source/api/paddlespeech.audio.datasets.dataset.rst @@ -0,0 +1,7 @@ +paddlespeech.audio.datasets.dataset module +========================================== + +.. automodule:: paddlespeech.audio.datasets.dataset + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.audio.datasets.esc50.rst b/docs/source/api/paddlespeech.audio.datasets.esc50.rst new file mode 100644 index 000000000..80e4a4187 --- /dev/null +++ b/docs/source/api/paddlespeech.audio.datasets.esc50.rst @@ -0,0 +1,7 @@ +paddlespeech.audio.datasets.esc50 module +======================================== + +.. automodule:: paddlespeech.audio.datasets.esc50 + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.audio.datasets.gtzan.rst b/docs/source/api/paddlespeech.audio.datasets.gtzan.rst new file mode 100644 index 000000000..47252e8d7 --- /dev/null +++ b/docs/source/api/paddlespeech.audio.datasets.gtzan.rst @@ -0,0 +1,7 @@ +paddlespeech.audio.datasets.gtzan module +======================================== + +.. automodule:: paddlespeech.audio.datasets.gtzan + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.audio.datasets.hey_snips.rst b/docs/source/api/paddlespeech.audio.datasets.hey_snips.rst new file mode 100644 index 000000000..ce08b7003 --- /dev/null +++ b/docs/source/api/paddlespeech.audio.datasets.hey_snips.rst @@ -0,0 +1,7 @@ +paddlespeech.audio.datasets.hey\_snips module +============================================= + +.. automodule:: paddlespeech.audio.datasets.hey_snips + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.audio.datasets.rirs_noises.rst b/docs/source/api/paddlespeech.audio.datasets.rirs_noises.rst new file mode 100644 index 000000000..3015ba9e4 --- /dev/null +++ b/docs/source/api/paddlespeech.audio.datasets.rirs_noises.rst @@ -0,0 +1,7 @@ +paddlespeech.audio.datasets.rirs\_noises module +=============================================== + +.. automodule:: paddlespeech.audio.datasets.rirs_noises + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.audio.datasets.rst b/docs/source/api/paddlespeech.audio.datasets.rst new file mode 100644 index 000000000..bfc313a70 --- /dev/null +++ b/docs/source/api/paddlespeech.audio.datasets.rst @@ -0,0 +1,22 @@ +paddlespeech.audio.datasets package +=================================== + +.. automodule:: paddlespeech.audio.datasets + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.audio.datasets.dataset + paddlespeech.audio.datasets.esc50 + paddlespeech.audio.datasets.gtzan + paddlespeech.audio.datasets.hey_snips + paddlespeech.audio.datasets.rirs_noises + paddlespeech.audio.datasets.tess + paddlespeech.audio.datasets.urban_sound + paddlespeech.audio.datasets.voxceleb diff --git a/docs/source/api/paddlespeech.audio.datasets.tess.rst b/docs/source/api/paddlespeech.audio.datasets.tess.rst new file mode 100644 index 000000000..d845e6d6a --- /dev/null +++ b/docs/source/api/paddlespeech.audio.datasets.tess.rst @@ -0,0 +1,7 @@ +paddlespeech.audio.datasets.tess module +======================================= + +.. automodule:: paddlespeech.audio.datasets.tess + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.audio.datasets.urban_sound.rst b/docs/source/api/paddlespeech.audio.datasets.urban_sound.rst new file mode 100644 index 000000000..4efa060a8 --- /dev/null +++ b/docs/source/api/paddlespeech.audio.datasets.urban_sound.rst @@ -0,0 +1,7 @@ +paddlespeech.audio.datasets.urban\_sound module +=============================================== + +.. automodule:: paddlespeech.audio.datasets.urban_sound + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.audio.datasets.voxceleb.rst b/docs/source/api/paddlespeech.audio.datasets.voxceleb.rst new file mode 100644 index 000000000..179053dcd --- /dev/null +++ b/docs/source/api/paddlespeech.audio.datasets.voxceleb.rst @@ -0,0 +1,7 @@ +paddlespeech.audio.datasets.voxceleb module +=========================================== + +.. automodule:: paddlespeech.audio.datasets.voxceleb + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.audio.features.layers.rst b/docs/source/api/paddlespeech.audio.features.layers.rst new file mode 100644 index 000000000..978c018e0 --- /dev/null +++ b/docs/source/api/paddlespeech.audio.features.layers.rst @@ -0,0 +1,7 @@ +paddlespeech.audio.features.layers module +========================================= + +.. automodule:: paddlespeech.audio.features.layers + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.audio.features.rst b/docs/source/api/paddlespeech.audio.features.rst new file mode 100644 index 000000000..ab1e79b07 --- /dev/null +++ b/docs/source/api/paddlespeech.audio.features.rst @@ -0,0 +1,15 @@ +paddlespeech.audio.features package +=================================== + +.. automodule:: paddlespeech.audio.features + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.audio.features.layers diff --git a/docs/source/api/paddlespeech.audio.functional.functional.rst b/docs/source/api/paddlespeech.audio.functional.functional.rst new file mode 100644 index 000000000..80cc5a5a4 --- /dev/null +++ b/docs/source/api/paddlespeech.audio.functional.functional.rst @@ -0,0 +1,7 @@ +paddlespeech.audio.functional.functional module +=============================================== + +.. automodule:: paddlespeech.audio.functional.functional + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.audio.functional.rst b/docs/source/api/paddlespeech.audio.functional.rst new file mode 100644 index 000000000..4e979dd9a --- /dev/null +++ b/docs/source/api/paddlespeech.audio.functional.rst @@ -0,0 +1,16 @@ +paddlespeech.audio.functional package +===================================== + +.. automodule:: paddlespeech.audio.functional + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.audio.functional.functional + paddlespeech.audio.functional.window diff --git a/docs/source/api/paddlespeech.audio.functional.window.rst b/docs/source/api/paddlespeech.audio.functional.window.rst new file mode 100644 index 000000000..347762751 --- /dev/null +++ b/docs/source/api/paddlespeech.audio.functional.window.rst @@ -0,0 +1,7 @@ +paddlespeech.audio.functional.window module +=========================================== + +.. automodule:: paddlespeech.audio.functional.window + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.audio.io.rst b/docs/source/api/paddlespeech.audio.io.rst new file mode 100644 index 000000000..03f5b9fec --- /dev/null +++ b/docs/source/api/paddlespeech.audio.io.rst @@ -0,0 +1,7 @@ +paddlespeech.audio.io package +============================= + +.. automodule:: paddlespeech.audio.io + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.audio.metric.eer.rst b/docs/source/api/paddlespeech.audio.metric.eer.rst new file mode 100644 index 000000000..bbe881221 --- /dev/null +++ b/docs/source/api/paddlespeech.audio.metric.eer.rst @@ -0,0 +1,7 @@ +paddlespeech.audio.metric.eer module +==================================== + +.. automodule:: paddlespeech.audio.metric.eer + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.audio.metric.rst b/docs/source/api/paddlespeech.audio.metric.rst new file mode 100644 index 000000000..a6d411dd6 --- /dev/null +++ b/docs/source/api/paddlespeech.audio.metric.rst @@ -0,0 +1,15 @@ +paddlespeech.audio.metric package +================================= + +.. automodule:: paddlespeech.audio.metric + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.audio.metric.eer diff --git a/docs/source/api/paddlespeech.audio.rst b/docs/source/api/paddlespeech.audio.rst new file mode 100644 index 000000000..5a3867f96 --- /dev/null +++ b/docs/source/api/paddlespeech.audio.rst @@ -0,0 +1,23 @@ +paddlespeech.audio package +========================== + +.. automodule:: paddlespeech.audio + :members: + :undoc-members: + :show-inheritance: + +Subpackages +----------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.audio.backends + paddlespeech.audio.compliance + paddlespeech.audio.datasets + paddlespeech.audio.features + paddlespeech.audio.functional + paddlespeech.audio.io + paddlespeech.audio.metric + paddlespeech.audio.sox_effects + paddlespeech.audio.utils diff --git a/docs/source/api/paddlespeech.audio.sox_effects.rst b/docs/source/api/paddlespeech.audio.sox_effects.rst new file mode 100644 index 000000000..75f991a16 --- /dev/null +++ b/docs/source/api/paddlespeech.audio.sox_effects.rst @@ -0,0 +1,7 @@ +paddlespeech.audio.sox\_effects package +======================================= + +.. automodule:: paddlespeech.audio.sox_effects + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.audio.utils.download.rst b/docs/source/api/paddlespeech.audio.utils.download.rst new file mode 100644 index 000000000..fab97813d --- /dev/null +++ b/docs/source/api/paddlespeech.audio.utils.download.rst @@ -0,0 +1,7 @@ +paddlespeech.audio.utils.download module +======================================== + +.. automodule:: paddlespeech.audio.utils.download + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.audio.utils.error.rst b/docs/source/api/paddlespeech.audio.utils.error.rst new file mode 100644 index 000000000..4aee8b168 --- /dev/null +++ b/docs/source/api/paddlespeech.audio.utils.error.rst @@ -0,0 +1,7 @@ +paddlespeech.audio.utils.error module +===================================== + +.. automodule:: paddlespeech.audio.utils.error + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.audio.utils.log.rst b/docs/source/api/paddlespeech.audio.utils.log.rst new file mode 100644 index 000000000..9094b0993 --- /dev/null +++ b/docs/source/api/paddlespeech.audio.utils.log.rst @@ -0,0 +1,7 @@ +paddlespeech.audio.utils.log module +=================================== + +.. automodule:: paddlespeech.audio.utils.log + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.audio.utils.numeric.rst b/docs/source/api/paddlespeech.audio.utils.numeric.rst new file mode 100644 index 000000000..2ad0ff2cd --- /dev/null +++ b/docs/source/api/paddlespeech.audio.utils.numeric.rst @@ -0,0 +1,7 @@ +paddlespeech.audio.utils.numeric module +======================================= + +.. automodule:: paddlespeech.audio.utils.numeric + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.audio.utils.rst b/docs/source/api/paddlespeech.audio.utils.rst new file mode 100644 index 000000000..db15927da --- /dev/null +++ b/docs/source/api/paddlespeech.audio.utils.rst @@ -0,0 +1,19 @@ +paddlespeech.audio.utils package +================================ + +.. automodule:: paddlespeech.audio.utils + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.audio.utils.download + paddlespeech.audio.utils.error + paddlespeech.audio.utils.log + paddlespeech.audio.utils.numeric + paddlespeech.audio.utils.time diff --git a/docs/source/api/paddlespeech.audio.utils.time.rst b/docs/source/api/paddlespeech.audio.utils.time.rst new file mode 100644 index 000000000..ebfae6edd --- /dev/null +++ b/docs/source/api/paddlespeech.audio.utils.time.rst @@ -0,0 +1,7 @@ +paddlespeech.audio.utils.time module +==================================== + +.. automodule:: paddlespeech.audio.utils.time + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.cli.asr.infer.rst b/docs/source/api/paddlespeech.cli.asr.infer.rst new file mode 100644 index 000000000..bb201c30c --- /dev/null +++ b/docs/source/api/paddlespeech.cli.asr.infer.rst @@ -0,0 +1,7 @@ +paddlespeech.cli.asr.infer module +================================= + +.. automodule:: paddlespeech.cli.asr.infer + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.cli.asr.rst b/docs/source/api/paddlespeech.cli.asr.rst new file mode 100644 index 000000000..e7d831042 --- /dev/null +++ b/docs/source/api/paddlespeech.cli.asr.rst @@ -0,0 +1,15 @@ +paddlespeech.cli.asr package +============================ + +.. automodule:: paddlespeech.cli.asr + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.cli.asr.infer diff --git a/docs/source/api/paddlespeech.cli.base_commands.rst b/docs/source/api/paddlespeech.cli.base_commands.rst new file mode 100644 index 000000000..44bb55d58 --- /dev/null +++ b/docs/source/api/paddlespeech.cli.base_commands.rst @@ -0,0 +1,7 @@ +paddlespeech.cli.base\_commands module +====================================== + +.. automodule:: paddlespeech.cli.base_commands + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.cli.cls.infer.rst b/docs/source/api/paddlespeech.cli.cls.infer.rst new file mode 100644 index 000000000..c29a70496 --- /dev/null +++ b/docs/source/api/paddlespeech.cli.cls.infer.rst @@ -0,0 +1,7 @@ +paddlespeech.cli.cls.infer module +================================= + +.. automodule:: paddlespeech.cli.cls.infer + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.cli.cls.rst b/docs/source/api/paddlespeech.cli.cls.rst new file mode 100644 index 000000000..70a28fce8 --- /dev/null +++ b/docs/source/api/paddlespeech.cli.cls.rst @@ -0,0 +1,15 @@ +paddlespeech.cli.cls package +============================ + +.. automodule:: paddlespeech.cli.cls + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.cli.cls.infer diff --git a/docs/source/api/paddlespeech.cli.download.rst b/docs/source/api/paddlespeech.cli.download.rst new file mode 100644 index 000000000..55058bd7c --- /dev/null +++ b/docs/source/api/paddlespeech.cli.download.rst @@ -0,0 +1,7 @@ +paddlespeech.cli.download module +================================ + +.. automodule:: paddlespeech.cli.download + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.cli.entry.rst b/docs/source/api/paddlespeech.cli.entry.rst new file mode 100644 index 000000000..8adfa9289 --- /dev/null +++ b/docs/source/api/paddlespeech.cli.entry.rst @@ -0,0 +1,7 @@ +paddlespeech.cli.entry module +============================= + +.. automodule:: paddlespeech.cli.entry + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.cli.executor.rst b/docs/source/api/paddlespeech.cli.executor.rst new file mode 100644 index 000000000..0a398d3e8 --- /dev/null +++ b/docs/source/api/paddlespeech.cli.executor.rst @@ -0,0 +1,7 @@ +paddlespeech.cli.executor module +================================ + +.. automodule:: paddlespeech.cli.executor + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.cli.kws.infer.rst b/docs/source/api/paddlespeech.cli.kws.infer.rst new file mode 100644 index 000000000..8ca50379f --- /dev/null +++ b/docs/source/api/paddlespeech.cli.kws.infer.rst @@ -0,0 +1,7 @@ +paddlespeech.cli.kws.infer module +================================= + +.. automodule:: paddlespeech.cli.kws.infer + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.cli.kws.rst b/docs/source/api/paddlespeech.cli.kws.rst new file mode 100644 index 000000000..d47b56370 --- /dev/null +++ b/docs/source/api/paddlespeech.cli.kws.rst @@ -0,0 +1,15 @@ +paddlespeech.cli.kws package +============================ + +.. automodule:: paddlespeech.cli.kws + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.cli.kws.infer diff --git a/docs/source/api/paddlespeech.cli.log.rst b/docs/source/api/paddlespeech.cli.log.rst new file mode 100644 index 000000000..b1097c492 --- /dev/null +++ b/docs/source/api/paddlespeech.cli.log.rst @@ -0,0 +1,7 @@ +paddlespeech.cli.log module +=========================== + +.. automodule:: paddlespeech.cli.log + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.cli.rst b/docs/source/api/paddlespeech.cli.rst new file mode 100644 index 000000000..bd147b9d8 --- /dev/null +++ b/docs/source/api/paddlespeech.cli.rst @@ -0,0 +1,34 @@ +paddlespeech.cli package +======================== + +.. automodule:: paddlespeech.cli + :members: + :undoc-members: + :show-inheritance: + +Subpackages +----------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.cli.asr + paddlespeech.cli.cls + paddlespeech.cli.kws + paddlespeech.cli.st + paddlespeech.cli.text + paddlespeech.cli.tts + paddlespeech.cli.vector + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.cli.base_commands + paddlespeech.cli.download + paddlespeech.cli.entry + paddlespeech.cli.executor + paddlespeech.cli.log + paddlespeech.cli.utils diff --git a/docs/source/api/paddlespeech.cli.st.infer.rst b/docs/source/api/paddlespeech.cli.st.infer.rst new file mode 100644 index 000000000..66d25cf0e --- /dev/null +++ b/docs/source/api/paddlespeech.cli.st.infer.rst @@ -0,0 +1,7 @@ +paddlespeech.cli.st.infer module +================================ + +.. automodule:: paddlespeech.cli.st.infer + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.cli.st.rst b/docs/source/api/paddlespeech.cli.st.rst new file mode 100644 index 000000000..e4c6860d2 --- /dev/null +++ b/docs/source/api/paddlespeech.cli.st.rst @@ -0,0 +1,15 @@ +paddlespeech.cli.st package +=========================== + +.. automodule:: paddlespeech.cli.st + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.cli.st.infer diff --git a/docs/source/api/paddlespeech.cli.text.infer.rst b/docs/source/api/paddlespeech.cli.text.infer.rst new file mode 100644 index 000000000..5903abc82 --- /dev/null +++ b/docs/source/api/paddlespeech.cli.text.infer.rst @@ -0,0 +1,7 @@ +paddlespeech.cli.text.infer module +================================== + +.. automodule:: paddlespeech.cli.text.infer + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.cli.text.rst b/docs/source/api/paddlespeech.cli.text.rst new file mode 100644 index 000000000..b0e647083 --- /dev/null +++ b/docs/source/api/paddlespeech.cli.text.rst @@ -0,0 +1,15 @@ +paddlespeech.cli.text package +============================= + +.. automodule:: paddlespeech.cli.text + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.cli.text.infer diff --git a/docs/source/api/paddlespeech.cli.tts.infer.rst b/docs/source/api/paddlespeech.cli.tts.infer.rst new file mode 100644 index 000000000..91bcbea97 --- /dev/null +++ b/docs/source/api/paddlespeech.cli.tts.infer.rst @@ -0,0 +1,7 @@ +paddlespeech.cli.tts.infer module +================================= + +.. automodule:: paddlespeech.cli.tts.infer + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.cli.tts.rst b/docs/source/api/paddlespeech.cli.tts.rst new file mode 100644 index 000000000..9518e74b2 --- /dev/null +++ b/docs/source/api/paddlespeech.cli.tts.rst @@ -0,0 +1,15 @@ +paddlespeech.cli.tts package +============================ + +.. automodule:: paddlespeech.cli.tts + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.cli.tts.infer diff --git a/docs/source/api/paddlespeech.cli.utils.rst b/docs/source/api/paddlespeech.cli.utils.rst new file mode 100644 index 000000000..5d5b05890 --- /dev/null +++ b/docs/source/api/paddlespeech.cli.utils.rst @@ -0,0 +1,7 @@ +paddlespeech.cli.utils module +============================= + +.. automodule:: paddlespeech.cli.utils + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.cli.vector.infer.rst b/docs/source/api/paddlespeech.cli.vector.infer.rst new file mode 100644 index 000000000..5fcafb286 --- /dev/null +++ b/docs/source/api/paddlespeech.cli.vector.infer.rst @@ -0,0 +1,7 @@ +paddlespeech.cli.vector.infer module +==================================== + +.. automodule:: paddlespeech.cli.vector.infer + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.cli.vector.rst b/docs/source/api/paddlespeech.cli.vector.rst new file mode 100644 index 000000000..215b2602a --- /dev/null +++ b/docs/source/api/paddlespeech.cli.vector.rst @@ -0,0 +1,15 @@ +paddlespeech.cli.vector package +=============================== + +.. automodule:: paddlespeech.cli.vector + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.cli.vector.infer diff --git a/docs/source/api/paddlespeech.cls.exps.panns.deploy.predict.rst b/docs/source/api/paddlespeech.cls.exps.panns.deploy.predict.rst new file mode 100644 index 000000000..d4f92a2ea --- /dev/null +++ b/docs/source/api/paddlespeech.cls.exps.panns.deploy.predict.rst @@ -0,0 +1,7 @@ +paddlespeech.cls.exps.panns.deploy.predict module +================================================= + +.. automodule:: paddlespeech.cls.exps.panns.deploy.predict + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.cls.exps.panns.deploy.rst b/docs/source/api/paddlespeech.cls.exps.panns.deploy.rst new file mode 100644 index 000000000..4415c9330 --- /dev/null +++ b/docs/source/api/paddlespeech.cls.exps.panns.deploy.rst @@ -0,0 +1,15 @@ +paddlespeech.cls.exps.panns.deploy package +========================================== + +.. automodule:: paddlespeech.cls.exps.panns.deploy + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.cls.exps.panns.deploy.predict diff --git a/docs/source/api/paddlespeech.cls.exps.panns.export_model.rst b/docs/source/api/paddlespeech.cls.exps.panns.export_model.rst new file mode 100644 index 000000000..6c39c2bc8 --- /dev/null +++ b/docs/source/api/paddlespeech.cls.exps.panns.export_model.rst @@ -0,0 +1,7 @@ +paddlespeech.cls.exps.panns.export\_model module +================================================ + +.. automodule:: paddlespeech.cls.exps.panns.export_model + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.cls.exps.panns.predict.rst b/docs/source/api/paddlespeech.cls.exps.panns.predict.rst new file mode 100644 index 000000000..88cd40338 --- /dev/null +++ b/docs/source/api/paddlespeech.cls.exps.panns.predict.rst @@ -0,0 +1,7 @@ +paddlespeech.cls.exps.panns.predict module +========================================== + +.. automodule:: paddlespeech.cls.exps.panns.predict + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.cls.exps.panns.rst b/docs/source/api/paddlespeech.cls.exps.panns.rst new file mode 100644 index 000000000..6147b245e --- /dev/null +++ b/docs/source/api/paddlespeech.cls.exps.panns.rst @@ -0,0 +1,25 @@ +paddlespeech.cls.exps.panns package +=================================== + +.. automodule:: paddlespeech.cls.exps.panns + :members: + :undoc-members: + :show-inheritance: + +Subpackages +----------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.cls.exps.panns.deploy + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.cls.exps.panns.export_model + paddlespeech.cls.exps.panns.predict + paddlespeech.cls.exps.panns.train diff --git a/docs/source/api/paddlespeech.cls.exps.panns.train.rst b/docs/source/api/paddlespeech.cls.exps.panns.train.rst new file mode 100644 index 000000000..a89b7eecc --- /dev/null +++ b/docs/source/api/paddlespeech.cls.exps.panns.train.rst @@ -0,0 +1,7 @@ +paddlespeech.cls.exps.panns.train module +======================================== + +.. automodule:: paddlespeech.cls.exps.panns.train + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.cls.exps.rst b/docs/source/api/paddlespeech.cls.exps.rst new file mode 100644 index 000000000..39c79c11e --- /dev/null +++ b/docs/source/api/paddlespeech.cls.exps.rst @@ -0,0 +1,15 @@ +paddlespeech.cls.exps package +============================= + +.. automodule:: paddlespeech.cls.exps + :members: + :undoc-members: + :show-inheritance: + +Subpackages +----------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.cls.exps.panns diff --git a/docs/source/api/paddlespeech.cls.models.panns.classifier.rst b/docs/source/api/paddlespeech.cls.models.panns.classifier.rst new file mode 100644 index 000000000..724174e4f --- /dev/null +++ b/docs/source/api/paddlespeech.cls.models.panns.classifier.rst @@ -0,0 +1,7 @@ +paddlespeech.cls.models.panns.classifier module +=============================================== + +.. automodule:: paddlespeech.cls.models.panns.classifier + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.cls.models.panns.panns.rst b/docs/source/api/paddlespeech.cls.models.panns.panns.rst new file mode 100644 index 000000000..76691b119 --- /dev/null +++ b/docs/source/api/paddlespeech.cls.models.panns.panns.rst @@ -0,0 +1,7 @@ +paddlespeech.cls.models.panns.panns module +========================================== + +.. automodule:: paddlespeech.cls.models.panns.panns + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.cls.models.panns.rst b/docs/source/api/paddlespeech.cls.models.panns.rst new file mode 100644 index 000000000..c67bd361f --- /dev/null +++ b/docs/source/api/paddlespeech.cls.models.panns.rst @@ -0,0 +1,16 @@ +paddlespeech.cls.models.panns package +===================================== + +.. automodule:: paddlespeech.cls.models.panns + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.cls.models.panns.classifier + paddlespeech.cls.models.panns.panns diff --git a/docs/source/api/paddlespeech.cls.models.rst b/docs/source/api/paddlespeech.cls.models.rst new file mode 100644 index 000000000..c7eb960e6 --- /dev/null +++ b/docs/source/api/paddlespeech.cls.models.rst @@ -0,0 +1,15 @@ +paddlespeech.cls.models package +=============================== + +.. automodule:: paddlespeech.cls.models + :members: + :undoc-members: + :show-inheritance: + +Subpackages +----------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.cls.models.panns diff --git a/docs/source/api/paddlespeech.cls.rst b/docs/source/api/paddlespeech.cls.rst new file mode 100644 index 000000000..8302a0da7 --- /dev/null +++ b/docs/source/api/paddlespeech.cls.rst @@ -0,0 +1,16 @@ +paddlespeech.cls package +======================== + +.. automodule:: paddlespeech.cls + :members: + :undoc-members: + :show-inheritance: + +Subpackages +----------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.cls.exps + paddlespeech.cls.models diff --git a/docs/source/api/paddlespeech.kws.models.loss.rst b/docs/source/api/paddlespeech.kws.models.loss.rst new file mode 100644 index 000000000..8ebbc88f3 --- /dev/null +++ b/docs/source/api/paddlespeech.kws.models.loss.rst @@ -0,0 +1,7 @@ +paddlespeech.kws.models.loss module +=================================== + +.. automodule:: paddlespeech.kws.models.loss + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.kws.models.mdtc.rst b/docs/source/api/paddlespeech.kws.models.mdtc.rst new file mode 100644 index 000000000..c5bc1cb06 --- /dev/null +++ b/docs/source/api/paddlespeech.kws.models.mdtc.rst @@ -0,0 +1,7 @@ +paddlespeech.kws.models.mdtc module +=================================== + +.. automodule:: paddlespeech.kws.models.mdtc + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.kws.models.rst b/docs/source/api/paddlespeech.kws.models.rst new file mode 100644 index 000000000..62d350ac3 --- /dev/null +++ b/docs/source/api/paddlespeech.kws.models.rst @@ -0,0 +1,16 @@ +paddlespeech.kws.models package +=============================== + +.. automodule:: paddlespeech.kws.models + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.kws.models.loss + paddlespeech.kws.models.mdtc diff --git a/docs/source/api/paddlespeech.kws.rst b/docs/source/api/paddlespeech.kws.rst new file mode 100644 index 000000000..c2829a42e --- /dev/null +++ b/docs/source/api/paddlespeech.kws.rst @@ -0,0 +1,15 @@ +paddlespeech.kws package +======================== + +.. automodule:: paddlespeech.kws + :members: + :undoc-members: + :show-inheritance: + +Subpackages +----------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.kws.models diff --git a/docs/source/api/paddlespeech.rst b/docs/source/api/paddlespeech.rst new file mode 100644 index 000000000..e7a01bf76 --- /dev/null +++ b/docs/source/api/paddlespeech.rst @@ -0,0 +1,23 @@ +paddlespeech package +==================== + +.. automodule:: paddlespeech + :members: + :undoc-members: + :show-inheritance: + +Subpackages +----------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.audio + paddlespeech.cli + paddlespeech.cls + paddlespeech.kws + paddlespeech.s2t + paddlespeech.server + paddlespeech.t2s + paddlespeech.text + paddlespeech.vector diff --git a/docs/source/api/paddlespeech.s2t.decoders.beam_search.batch_beam_search.rst b/docs/source/api/paddlespeech.s2t.decoders.beam_search.batch_beam_search.rst new file mode 100644 index 000000000..98227557b --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.decoders.beam_search.batch_beam_search.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.decoders.beam\_search.batch\_beam\_search module +================================================================= + +.. automodule:: paddlespeech.s2t.decoders.beam_search.batch_beam_search + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.decoders.beam_search.beam_search.rst b/docs/source/api/paddlespeech.s2t.decoders.beam_search.beam_search.rst new file mode 100644 index 000000000..38f3cf2ad --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.decoders.beam_search.beam_search.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.decoders.beam\_search.beam\_search module +========================================================== + +.. automodule:: paddlespeech.s2t.decoders.beam_search.beam_search + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.decoders.beam_search.rst b/docs/source/api/paddlespeech.s2t.decoders.beam_search.rst new file mode 100644 index 000000000..c67b12a40 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.decoders.beam_search.rst @@ -0,0 +1,16 @@ +paddlespeech.s2t.decoders.beam\_search package +============================================== + +.. automodule:: paddlespeech.s2t.decoders.beam_search + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.s2t.decoders.beam_search.batch_beam_search + paddlespeech.s2t.decoders.beam_search.beam_search diff --git a/docs/source/api/paddlespeech.s2t.decoders.ctcdecoder.decoders_deprecated.rst b/docs/source/api/paddlespeech.s2t.decoders.ctcdecoder.decoders_deprecated.rst new file mode 100644 index 000000000..61eeba73d --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.decoders.ctcdecoder.decoders_deprecated.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.decoders.ctcdecoder.decoders\_deprecated module +================================================================ + +.. automodule:: paddlespeech.s2t.decoders.ctcdecoder.decoders_deprecated + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.decoders.ctcdecoder.rst b/docs/source/api/paddlespeech.s2t.decoders.ctcdecoder.rst new file mode 100644 index 000000000..8093619b1 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.decoders.ctcdecoder.rst @@ -0,0 +1,17 @@ +paddlespeech.s2t.decoders.ctcdecoder package +============================================ + +.. automodule:: paddlespeech.s2t.decoders.ctcdecoder + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.s2t.decoders.ctcdecoder.decoders_deprecated + paddlespeech.s2t.decoders.ctcdecoder.scorer_deprecated + paddlespeech.s2t.decoders.ctcdecoder.swig_wrapper diff --git a/docs/source/api/paddlespeech.s2t.decoders.ctcdecoder.scorer_deprecated.rst b/docs/source/api/paddlespeech.s2t.decoders.ctcdecoder.scorer_deprecated.rst new file mode 100644 index 000000000..1079d6721 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.decoders.ctcdecoder.scorer_deprecated.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.decoders.ctcdecoder.scorer\_deprecated module +============================================================== + +.. automodule:: paddlespeech.s2t.decoders.ctcdecoder.scorer_deprecated + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.decoders.ctcdecoder.swig_wrapper.rst b/docs/source/api/paddlespeech.s2t.decoders.ctcdecoder.swig_wrapper.rst new file mode 100644 index 000000000..ba6fef3ed --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.decoders.ctcdecoder.swig_wrapper.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.decoders.ctcdecoder.swig\_wrapper module +========================================================= + +.. automodule:: paddlespeech.s2t.decoders.ctcdecoder.swig_wrapper + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.decoders.recog.rst b/docs/source/api/paddlespeech.s2t.decoders.recog.rst new file mode 100644 index 000000000..a1d2736e0 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.decoders.recog.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.decoders.recog module +====================================== + +.. automodule:: paddlespeech.s2t.decoders.recog + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.decoders.recog_bin.rst b/docs/source/api/paddlespeech.s2t.decoders.recog_bin.rst new file mode 100644 index 000000000..4952e2e6a --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.decoders.recog_bin.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.decoders.recog\_bin module +=========================================== + +.. automodule:: paddlespeech.s2t.decoders.recog_bin + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.decoders.rst b/docs/source/api/paddlespeech.s2t.decoders.rst new file mode 100644 index 000000000..e4eabedfd --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.decoders.rst @@ -0,0 +1,27 @@ +paddlespeech.s2t.decoders package +================================= + +.. automodule:: paddlespeech.s2t.decoders + :members: + :undoc-members: + :show-inheritance: + +Subpackages +----------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.s2t.decoders.beam_search + paddlespeech.s2t.decoders.ctcdecoder + paddlespeech.s2t.decoders.scorers + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.s2t.decoders.recog + paddlespeech.s2t.decoders.recog_bin + paddlespeech.s2t.decoders.utils diff --git a/docs/source/api/paddlespeech.s2t.decoders.scorers.ctc.rst b/docs/source/api/paddlespeech.s2t.decoders.scorers.ctc.rst new file mode 100644 index 000000000..1de7174c2 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.decoders.scorers.ctc.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.decoders.scorers.ctc module +============================================ + +.. automodule:: paddlespeech.s2t.decoders.scorers.ctc + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.decoders.scorers.ctc_prefix_score.rst b/docs/source/api/paddlespeech.s2t.decoders.scorers.ctc_prefix_score.rst new file mode 100644 index 000000000..1a41d38dd --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.decoders.scorers.ctc_prefix_score.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.decoders.scorers.ctc\_prefix\_score module +=========================================================== + +.. automodule:: paddlespeech.s2t.decoders.scorers.ctc_prefix_score + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.decoders.scorers.length_bonus.rst b/docs/source/api/paddlespeech.s2t.decoders.scorers.length_bonus.rst new file mode 100644 index 000000000..0833c919e --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.decoders.scorers.length_bonus.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.decoders.scorers.length\_bonus module +====================================================== + +.. automodule:: paddlespeech.s2t.decoders.scorers.length_bonus + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.decoders.scorers.ngram.rst b/docs/source/api/paddlespeech.s2t.decoders.scorers.ngram.rst new file mode 100644 index 000000000..f38a61099 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.decoders.scorers.ngram.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.decoders.scorers.ngram module +============================================== + +.. automodule:: paddlespeech.s2t.decoders.scorers.ngram + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.decoders.scorers.rst b/docs/source/api/paddlespeech.s2t.decoders.scorers.rst new file mode 100644 index 000000000..83808c49b --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.decoders.scorers.rst @@ -0,0 +1,19 @@ +paddlespeech.s2t.decoders.scorers package +========================================= + +.. automodule:: paddlespeech.s2t.decoders.scorers + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.s2t.decoders.scorers.ctc + paddlespeech.s2t.decoders.scorers.ctc_prefix_score + paddlespeech.s2t.decoders.scorers.length_bonus + paddlespeech.s2t.decoders.scorers.ngram + paddlespeech.s2t.decoders.scorers.scorer_interface diff --git a/docs/source/api/paddlespeech.s2t.decoders.scorers.scorer_interface.rst b/docs/source/api/paddlespeech.s2t.decoders.scorers.scorer_interface.rst new file mode 100644 index 000000000..26a205168 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.decoders.scorers.scorer_interface.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.decoders.scorers.scorer\_interface module +========================================================== + +.. automodule:: paddlespeech.s2t.decoders.scorers.scorer_interface + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.decoders.utils.rst b/docs/source/api/paddlespeech.s2t.decoders.utils.rst new file mode 100644 index 000000000..19556f94b --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.decoders.utils.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.decoders.utils module +====================================== + +.. automodule:: paddlespeech.s2t.decoders.utils + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.exps.deepspeech2.bin.deploy.client.rst b/docs/source/api/paddlespeech.s2t.exps.deepspeech2.bin.deploy.client.rst new file mode 100644 index 000000000..a73a56853 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.exps.deepspeech2.bin.deploy.client.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.exps.deepspeech2.bin.deploy.client module +========================================================== + +.. automodule:: paddlespeech.s2t.exps.deepspeech2.bin.deploy.client + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.exps.deepspeech2.bin.deploy.record.rst b/docs/source/api/paddlespeech.s2t.exps.deepspeech2.bin.deploy.record.rst new file mode 100644 index 000000000..bc1078485 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.exps.deepspeech2.bin.deploy.record.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.exps.deepspeech2.bin.deploy.record module +========================================================== + +.. automodule:: paddlespeech.s2t.exps.deepspeech2.bin.deploy.record + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.exps.deepspeech2.bin.deploy.rst b/docs/source/api/paddlespeech.s2t.exps.deepspeech2.bin.deploy.rst new file mode 100644 index 000000000..d1f966fc1 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.exps.deepspeech2.bin.deploy.rst @@ -0,0 +1,19 @@ +paddlespeech.s2t.exps.deepspeech2.bin.deploy package +==================================================== + +.. automodule:: paddlespeech.s2t.exps.deepspeech2.bin.deploy + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.s2t.exps.deepspeech2.bin.deploy.client + paddlespeech.s2t.exps.deepspeech2.bin.deploy.record + paddlespeech.s2t.exps.deepspeech2.bin.deploy.runtime + paddlespeech.s2t.exps.deepspeech2.bin.deploy.send + paddlespeech.s2t.exps.deepspeech2.bin.deploy.server diff --git a/docs/source/api/paddlespeech.s2t.exps.deepspeech2.bin.deploy.runtime.rst b/docs/source/api/paddlespeech.s2t.exps.deepspeech2.bin.deploy.runtime.rst new file mode 100644 index 000000000..560f02ced --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.exps.deepspeech2.bin.deploy.runtime.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.exps.deepspeech2.bin.deploy.runtime module +=========================================================== + +.. automodule:: paddlespeech.s2t.exps.deepspeech2.bin.deploy.runtime + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.exps.deepspeech2.bin.deploy.send.rst b/docs/source/api/paddlespeech.s2t.exps.deepspeech2.bin.deploy.send.rst new file mode 100644 index 000000000..ba1ae0a62 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.exps.deepspeech2.bin.deploy.send.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.exps.deepspeech2.bin.deploy.send module +======================================================== + +.. automodule:: paddlespeech.s2t.exps.deepspeech2.bin.deploy.send + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.exps.deepspeech2.bin.deploy.server.rst b/docs/source/api/paddlespeech.s2t.exps.deepspeech2.bin.deploy.server.rst new file mode 100644 index 000000000..cb74f07bd --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.exps.deepspeech2.bin.deploy.server.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.exps.deepspeech2.bin.deploy.server module +========================================================== + +.. automodule:: paddlespeech.s2t.exps.deepspeech2.bin.deploy.server + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.exps.deepspeech2.bin.export.rst b/docs/source/api/paddlespeech.s2t.exps.deepspeech2.bin.export.rst new file mode 100644 index 000000000..adc282d18 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.exps.deepspeech2.bin.export.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.exps.deepspeech2.bin.export module +=================================================== + +.. automodule:: paddlespeech.s2t.exps.deepspeech2.bin.export + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.exps.deepspeech2.bin.rst b/docs/source/api/paddlespeech.s2t.exps.deepspeech2.bin.rst new file mode 100644 index 000000000..b7942220e --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.exps.deepspeech2.bin.rst @@ -0,0 +1,27 @@ +paddlespeech.s2t.exps.deepspeech2.bin package +============================================= + +.. automodule:: paddlespeech.s2t.exps.deepspeech2.bin + :members: + :undoc-members: + :show-inheritance: + +Subpackages +----------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.s2t.exps.deepspeech2.bin.deploy + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.s2t.exps.deepspeech2.bin.export + paddlespeech.s2t.exps.deepspeech2.bin.test + paddlespeech.s2t.exps.deepspeech2.bin.test_export + paddlespeech.s2t.exps.deepspeech2.bin.test_wav + paddlespeech.s2t.exps.deepspeech2.bin.train diff --git a/docs/source/api/paddlespeech.s2t.exps.deepspeech2.bin.test.rst b/docs/source/api/paddlespeech.s2t.exps.deepspeech2.bin.test.rst new file mode 100644 index 000000000..64ac44cde --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.exps.deepspeech2.bin.test.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.exps.deepspeech2.bin.test module +================================================= + +.. automodule:: paddlespeech.s2t.exps.deepspeech2.bin.test + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.exps.deepspeech2.bin.test_export.rst b/docs/source/api/paddlespeech.s2t.exps.deepspeech2.bin.test_export.rst new file mode 100644 index 000000000..89cdca3a6 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.exps.deepspeech2.bin.test_export.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.exps.deepspeech2.bin.test\_export module +========================================================= + +.. automodule:: paddlespeech.s2t.exps.deepspeech2.bin.test_export + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.exps.deepspeech2.bin.test_wav.rst b/docs/source/api/paddlespeech.s2t.exps.deepspeech2.bin.test_wav.rst new file mode 100644 index 000000000..d77083173 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.exps.deepspeech2.bin.test_wav.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.exps.deepspeech2.bin.test\_wav module +====================================================== + +.. automodule:: paddlespeech.s2t.exps.deepspeech2.bin.test_wav + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.exps.deepspeech2.bin.train.rst b/docs/source/api/paddlespeech.s2t.exps.deepspeech2.bin.train.rst new file mode 100644 index 000000000..b908ccbfd --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.exps.deepspeech2.bin.train.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.exps.deepspeech2.bin.train module +================================================== + +.. automodule:: paddlespeech.s2t.exps.deepspeech2.bin.train + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.exps.deepspeech2.model.rst b/docs/source/api/paddlespeech.s2t.exps.deepspeech2.model.rst new file mode 100644 index 000000000..16206c4be --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.exps.deepspeech2.model.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.exps.deepspeech2.model module +============================================== + +.. automodule:: paddlespeech.s2t.exps.deepspeech2.model + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.exps.deepspeech2.rst b/docs/source/api/paddlespeech.s2t.exps.deepspeech2.rst new file mode 100644 index 000000000..24e1b287e --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.exps.deepspeech2.rst @@ -0,0 +1,23 @@ +paddlespeech.s2t.exps.deepspeech2 package +========================================= + +.. automodule:: paddlespeech.s2t.exps.deepspeech2 + :members: + :undoc-members: + :show-inheritance: + +Subpackages +----------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.s2t.exps.deepspeech2.bin + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.s2t.exps.deepspeech2.model diff --git a/docs/source/api/paddlespeech.s2t.exps.rst b/docs/source/api/paddlespeech.s2t.exps.rst new file mode 100644 index 000000000..344f1e692 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.exps.rst @@ -0,0 +1,18 @@ +paddlespeech.s2t.exps package +============================= + +.. automodule:: paddlespeech.s2t.exps + :members: + :undoc-members: + :show-inheritance: + +Subpackages +----------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.s2t.exps.deepspeech2 + paddlespeech.s2t.exps.u2 + paddlespeech.s2t.exps.u2_kaldi + paddlespeech.s2t.exps.u2_st diff --git a/docs/source/api/paddlespeech.s2t.exps.u2.bin.alignment.rst b/docs/source/api/paddlespeech.s2t.exps.u2.bin.alignment.rst new file mode 100644 index 000000000..09b9a1d95 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.exps.u2.bin.alignment.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.exps.u2.bin.alignment module +============================================= + +.. automodule:: paddlespeech.s2t.exps.u2.bin.alignment + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.exps.u2.bin.export.rst b/docs/source/api/paddlespeech.s2t.exps.u2.bin.export.rst new file mode 100644 index 000000000..878e2dccc --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.exps.u2.bin.export.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.exps.u2.bin.export module +========================================== + +.. automodule:: paddlespeech.s2t.exps.u2.bin.export + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.exps.u2.bin.rst b/docs/source/api/paddlespeech.s2t.exps.u2.bin.rst new file mode 100644 index 000000000..3b3542e70 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.exps.u2.bin.rst @@ -0,0 +1,19 @@ +paddlespeech.s2t.exps.u2.bin package +==================================== + +.. automodule:: paddlespeech.s2t.exps.u2.bin + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.s2t.exps.u2.bin.alignment + paddlespeech.s2t.exps.u2.bin.export + paddlespeech.s2t.exps.u2.bin.test + paddlespeech.s2t.exps.u2.bin.test_wav + paddlespeech.s2t.exps.u2.bin.train diff --git a/docs/source/api/paddlespeech.s2t.exps.u2.bin.test.rst b/docs/source/api/paddlespeech.s2t.exps.u2.bin.test.rst new file mode 100644 index 000000000..470b4ab66 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.exps.u2.bin.test.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.exps.u2.bin.test module +======================================== + +.. automodule:: paddlespeech.s2t.exps.u2.bin.test + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.exps.u2.bin.test_wav.rst b/docs/source/api/paddlespeech.s2t.exps.u2.bin.test_wav.rst new file mode 100644 index 000000000..e25372162 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.exps.u2.bin.test_wav.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.exps.u2.bin.test\_wav module +============================================= + +.. automodule:: paddlespeech.s2t.exps.u2.bin.test_wav + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.exps.u2.bin.train.rst b/docs/source/api/paddlespeech.s2t.exps.u2.bin.train.rst new file mode 100644 index 000000000..d8b8ca473 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.exps.u2.bin.train.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.exps.u2.bin.train module +========================================= + +.. automodule:: paddlespeech.s2t.exps.u2.bin.train + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.exps.u2.model.rst b/docs/source/api/paddlespeech.s2t.exps.u2.model.rst new file mode 100644 index 000000000..37e2f91e0 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.exps.u2.model.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.exps.u2.model module +===================================== + +.. automodule:: paddlespeech.s2t.exps.u2.model + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.exps.u2.rst b/docs/source/api/paddlespeech.s2t.exps.u2.rst new file mode 100644 index 000000000..e0ebb7fc9 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.exps.u2.rst @@ -0,0 +1,24 @@ +paddlespeech.s2t.exps.u2 package +================================ + +.. automodule:: paddlespeech.s2t.exps.u2 + :members: + :undoc-members: + :show-inheritance: + +Subpackages +----------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.s2t.exps.u2.bin + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.s2t.exps.u2.model + paddlespeech.s2t.exps.u2.trainer diff --git a/docs/source/api/paddlespeech.s2t.exps.u2.trainer.rst b/docs/source/api/paddlespeech.s2t.exps.u2.trainer.rst new file mode 100644 index 000000000..0cd28945a --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.exps.u2.trainer.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.exps.u2.trainer module +======================================= + +.. automodule:: paddlespeech.s2t.exps.u2.trainer + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.exps.u2_kaldi.bin.recog.rst b/docs/source/api/paddlespeech.s2t.exps.u2_kaldi.bin.recog.rst new file mode 100644 index 000000000..bc749c8f8 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.exps.u2_kaldi.bin.recog.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.exps.u2\_kaldi.bin.recog module +================================================ + +.. automodule:: paddlespeech.s2t.exps.u2_kaldi.bin.recog + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.exps.u2_kaldi.bin.rst b/docs/source/api/paddlespeech.s2t.exps.u2_kaldi.bin.rst new file mode 100644 index 000000000..ff1a6efee --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.exps.u2_kaldi.bin.rst @@ -0,0 +1,17 @@ +paddlespeech.s2t.exps.u2\_kaldi.bin package +=========================================== + +.. automodule:: paddlespeech.s2t.exps.u2_kaldi.bin + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.s2t.exps.u2_kaldi.bin.recog + paddlespeech.s2t.exps.u2_kaldi.bin.test + paddlespeech.s2t.exps.u2_kaldi.bin.train diff --git a/docs/source/api/paddlespeech.s2t.exps.u2_kaldi.bin.test.rst b/docs/source/api/paddlespeech.s2t.exps.u2_kaldi.bin.test.rst new file mode 100644 index 000000000..b8bf93165 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.exps.u2_kaldi.bin.test.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.exps.u2\_kaldi.bin.test module +=============================================== + +.. automodule:: paddlespeech.s2t.exps.u2_kaldi.bin.test + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.exps.u2_kaldi.bin.train.rst b/docs/source/api/paddlespeech.s2t.exps.u2_kaldi.bin.train.rst new file mode 100644 index 000000000..bbf77bc90 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.exps.u2_kaldi.bin.train.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.exps.u2\_kaldi.bin.train module +================================================ + +.. automodule:: paddlespeech.s2t.exps.u2_kaldi.bin.train + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.exps.u2_kaldi.model.rst b/docs/source/api/paddlespeech.s2t.exps.u2_kaldi.model.rst new file mode 100644 index 000000000..60f12394f --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.exps.u2_kaldi.model.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.exps.u2\_kaldi.model module +============================================ + +.. automodule:: paddlespeech.s2t.exps.u2_kaldi.model + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.exps.u2_kaldi.rst b/docs/source/api/paddlespeech.s2t.exps.u2_kaldi.rst new file mode 100644 index 000000000..2b0539be1 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.exps.u2_kaldi.rst @@ -0,0 +1,23 @@ +paddlespeech.s2t.exps.u2\_kaldi package +======================================= + +.. automodule:: paddlespeech.s2t.exps.u2_kaldi + :members: + :undoc-members: + :show-inheritance: + +Subpackages +----------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.s2t.exps.u2_kaldi.bin + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.s2t.exps.u2_kaldi.model diff --git a/docs/source/api/paddlespeech.s2t.exps.u2_st.bin.export.rst b/docs/source/api/paddlespeech.s2t.exps.u2_st.bin.export.rst new file mode 100644 index 000000000..bdb8f0812 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.exps.u2_st.bin.export.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.exps.u2\_st.bin.export module +============================================== + +.. automodule:: paddlespeech.s2t.exps.u2_st.bin.export + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.exps.u2_st.bin.rst b/docs/source/api/paddlespeech.s2t.exps.u2_st.bin.rst new file mode 100644 index 000000000..1dc1402d8 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.exps.u2_st.bin.rst @@ -0,0 +1,17 @@ +paddlespeech.s2t.exps.u2\_st.bin package +======================================== + +.. automodule:: paddlespeech.s2t.exps.u2_st.bin + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.s2t.exps.u2_st.bin.export + paddlespeech.s2t.exps.u2_st.bin.test + paddlespeech.s2t.exps.u2_st.bin.train diff --git a/docs/source/api/paddlespeech.s2t.exps.u2_st.bin.test.rst b/docs/source/api/paddlespeech.s2t.exps.u2_st.bin.test.rst new file mode 100644 index 000000000..3240bf100 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.exps.u2_st.bin.test.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.exps.u2\_st.bin.test module +============================================ + +.. automodule:: paddlespeech.s2t.exps.u2_st.bin.test + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.exps.u2_st.bin.train.rst b/docs/source/api/paddlespeech.s2t.exps.u2_st.bin.train.rst new file mode 100644 index 000000000..d52899a99 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.exps.u2_st.bin.train.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.exps.u2\_st.bin.train module +============================================= + +.. automodule:: paddlespeech.s2t.exps.u2_st.bin.train + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.exps.u2_st.model.rst b/docs/source/api/paddlespeech.s2t.exps.u2_st.model.rst new file mode 100644 index 000000000..a6d7431cf --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.exps.u2_st.model.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.exps.u2\_st.model module +========================================= + +.. automodule:: paddlespeech.s2t.exps.u2_st.model + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.exps.u2_st.rst b/docs/source/api/paddlespeech.s2t.exps.u2_st.rst new file mode 100644 index 000000000..17968d90e --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.exps.u2_st.rst @@ -0,0 +1,23 @@ +paddlespeech.s2t.exps.u2\_st package +==================================== + +.. automodule:: paddlespeech.s2t.exps.u2_st + :members: + :undoc-members: + :show-inheritance: + +Subpackages +----------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.s2t.exps.u2_st.bin + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.s2t.exps.u2_st.model diff --git a/docs/source/api/paddlespeech.s2t.frontend.audio.rst b/docs/source/api/paddlespeech.s2t.frontend.audio.rst new file mode 100644 index 000000000..1b90627d4 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.frontend.audio.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.frontend.audio module +====================================== + +.. automodule:: paddlespeech.s2t.frontend.audio + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.frontend.augmentor.augmentation.rst b/docs/source/api/paddlespeech.s2t.frontend.augmentor.augmentation.rst new file mode 100644 index 000000000..1bb42af16 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.frontend.augmentor.augmentation.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.frontend.augmentor.augmentation module +======================================================= + +.. automodule:: paddlespeech.s2t.frontend.augmentor.augmentation + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.frontend.augmentor.base.rst b/docs/source/api/paddlespeech.s2t.frontend.augmentor.base.rst new file mode 100644 index 000000000..d1fbcc6f6 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.frontend.augmentor.base.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.frontend.augmentor.base module +=============================================== + +.. automodule:: paddlespeech.s2t.frontend.augmentor.base + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.frontend.augmentor.impulse_response.rst b/docs/source/api/paddlespeech.s2t.frontend.augmentor.impulse_response.rst new file mode 100644 index 000000000..ce0cd26fa --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.frontend.augmentor.impulse_response.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.frontend.augmentor.impulse\_response module +============================================================ + +.. automodule:: paddlespeech.s2t.frontend.augmentor.impulse_response + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.frontend.augmentor.noise_perturb.rst b/docs/source/api/paddlespeech.s2t.frontend.augmentor.noise_perturb.rst new file mode 100644 index 000000000..49d61e8bf --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.frontend.augmentor.noise_perturb.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.frontend.augmentor.noise\_perturb module +========================================================= + +.. automodule:: paddlespeech.s2t.frontend.augmentor.noise_perturb + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.frontend.augmentor.online_bayesian_normalization.rst b/docs/source/api/paddlespeech.s2t.frontend.augmentor.online_bayesian_normalization.rst new file mode 100644 index 000000000..a84ca25a9 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.frontend.augmentor.online_bayesian_normalization.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.frontend.augmentor.online\_bayesian\_normalization module +========================================================================== + +.. automodule:: paddlespeech.s2t.frontend.augmentor.online_bayesian_normalization + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.frontend.augmentor.resample.rst b/docs/source/api/paddlespeech.s2t.frontend.augmentor.resample.rst new file mode 100644 index 000000000..fb8b5fddb --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.frontend.augmentor.resample.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.frontend.augmentor.resample module +=================================================== + +.. automodule:: paddlespeech.s2t.frontend.augmentor.resample + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.frontend.augmentor.rst b/docs/source/api/paddlespeech.s2t.frontend.augmentor.rst new file mode 100644 index 000000000..2aa99309a --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.frontend.augmentor.rst @@ -0,0 +1,24 @@ +paddlespeech.s2t.frontend.augmentor package +=========================================== + +.. automodule:: paddlespeech.s2t.frontend.augmentor + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.s2t.frontend.augmentor.augmentation + paddlespeech.s2t.frontend.augmentor.base + paddlespeech.s2t.frontend.augmentor.impulse_response + paddlespeech.s2t.frontend.augmentor.noise_perturb + paddlespeech.s2t.frontend.augmentor.online_bayesian_normalization + paddlespeech.s2t.frontend.augmentor.resample + paddlespeech.s2t.frontend.augmentor.shift_perturb + paddlespeech.s2t.frontend.augmentor.spec_augment + paddlespeech.s2t.frontend.augmentor.speed_perturb + paddlespeech.s2t.frontend.augmentor.volume_perturb diff --git a/docs/source/api/paddlespeech.s2t.frontend.augmentor.shift_perturb.rst b/docs/source/api/paddlespeech.s2t.frontend.augmentor.shift_perturb.rst new file mode 100644 index 000000000..8b7997dda --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.frontend.augmentor.shift_perturb.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.frontend.augmentor.shift\_perturb module +========================================================= + +.. automodule:: paddlespeech.s2t.frontend.augmentor.shift_perturb + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.frontend.augmentor.spec_augment.rst b/docs/source/api/paddlespeech.s2t.frontend.augmentor.spec_augment.rst new file mode 100644 index 000000000..5fcce2c83 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.frontend.augmentor.spec_augment.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.frontend.augmentor.spec\_augment module +======================================================== + +.. automodule:: paddlespeech.s2t.frontend.augmentor.spec_augment + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.frontend.augmentor.speed_perturb.rst b/docs/source/api/paddlespeech.s2t.frontend.augmentor.speed_perturb.rst new file mode 100644 index 000000000..a0a1e590a --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.frontend.augmentor.speed_perturb.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.frontend.augmentor.speed\_perturb module +========================================================= + +.. automodule:: paddlespeech.s2t.frontend.augmentor.speed_perturb + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.frontend.augmentor.volume_perturb.rst b/docs/source/api/paddlespeech.s2t.frontend.augmentor.volume_perturb.rst new file mode 100644 index 000000000..9b4d3e0a8 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.frontend.augmentor.volume_perturb.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.frontend.augmentor.volume\_perturb module +========================================================== + +.. automodule:: paddlespeech.s2t.frontend.augmentor.volume_perturb + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.frontend.featurizer.audio_featurizer.rst b/docs/source/api/paddlespeech.s2t.frontend.featurizer.audio_featurizer.rst new file mode 100644 index 000000000..064b198e8 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.frontend.featurizer.audio_featurizer.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.frontend.featurizer.audio\_featurizer module +============================================================= + +.. automodule:: paddlespeech.s2t.frontend.featurizer.audio_featurizer + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.frontend.featurizer.rst b/docs/source/api/paddlespeech.s2t.frontend.featurizer.rst new file mode 100644 index 000000000..0a7aaa61c --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.frontend.featurizer.rst @@ -0,0 +1,17 @@ +paddlespeech.s2t.frontend.featurizer package +============================================ + +.. automodule:: paddlespeech.s2t.frontend.featurizer + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.s2t.frontend.featurizer.audio_featurizer + paddlespeech.s2t.frontend.featurizer.speech_featurizer + paddlespeech.s2t.frontend.featurizer.text_featurizer diff --git a/docs/source/api/paddlespeech.s2t.frontend.featurizer.speech_featurizer.rst b/docs/source/api/paddlespeech.s2t.frontend.featurizer.speech_featurizer.rst new file mode 100644 index 000000000..dd8ff4fbe --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.frontend.featurizer.speech_featurizer.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.frontend.featurizer.speech\_featurizer module +============================================================== + +.. automodule:: paddlespeech.s2t.frontend.featurizer.speech_featurizer + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.frontend.featurizer.text_featurizer.rst b/docs/source/api/paddlespeech.s2t.frontend.featurizer.text_featurizer.rst new file mode 100644 index 000000000..1ca3f5df4 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.frontend.featurizer.text_featurizer.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.frontend.featurizer.text\_featurizer module +============================================================ + +.. automodule:: paddlespeech.s2t.frontend.featurizer.text_featurizer + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.frontend.normalizer.rst b/docs/source/api/paddlespeech.s2t.frontend.normalizer.rst new file mode 100644 index 000000000..de9f5f59a --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.frontend.normalizer.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.frontend.normalizer module +=========================================== + +.. automodule:: paddlespeech.s2t.frontend.normalizer + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.frontend.rst b/docs/source/api/paddlespeech.s2t.frontend.rst new file mode 100644 index 000000000..1a45ecf0d --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.frontend.rst @@ -0,0 +1,27 @@ +paddlespeech.s2t.frontend package +================================= + +.. automodule:: paddlespeech.s2t.frontend + :members: + :undoc-members: + :show-inheritance: + +Subpackages +----------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.s2t.frontend.augmentor + paddlespeech.s2t.frontend.featurizer + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.s2t.frontend.audio + paddlespeech.s2t.frontend.normalizer + paddlespeech.s2t.frontend.speech + paddlespeech.s2t.frontend.utility diff --git a/docs/source/api/paddlespeech.s2t.frontend.speech.rst b/docs/source/api/paddlespeech.s2t.frontend.speech.rst new file mode 100644 index 000000000..d30c59e94 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.frontend.speech.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.frontend.speech module +======================================= + +.. automodule:: paddlespeech.s2t.frontend.speech + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.frontend.utility.rst b/docs/source/api/paddlespeech.s2t.frontend.utility.rst new file mode 100644 index 000000000..679c7d866 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.frontend.utility.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.frontend.utility module +======================================== + +.. automodule:: paddlespeech.s2t.frontend.utility + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.io.batchfy.rst b/docs/source/api/paddlespeech.s2t.io.batchfy.rst new file mode 100644 index 000000000..0f86487ef --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.io.batchfy.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.io.batchfy module +================================== + +.. automodule:: paddlespeech.s2t.io.batchfy + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.io.collator.rst b/docs/source/api/paddlespeech.s2t.io.collator.rst new file mode 100644 index 000000000..6d5733bb7 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.io.collator.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.io.collator module +=================================== + +.. automodule:: paddlespeech.s2t.io.collator + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.io.converter.rst b/docs/source/api/paddlespeech.s2t.io.converter.rst new file mode 100644 index 000000000..9f686296f --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.io.converter.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.io.converter module +==================================== + +.. automodule:: paddlespeech.s2t.io.converter + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.io.dataloader.rst b/docs/source/api/paddlespeech.s2t.io.dataloader.rst new file mode 100644 index 000000000..539681557 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.io.dataloader.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.io.dataloader module +===================================== + +.. automodule:: paddlespeech.s2t.io.dataloader + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.io.dataset.rst b/docs/source/api/paddlespeech.s2t.io.dataset.rst new file mode 100644 index 000000000..b84fab0dd --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.io.dataset.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.io.dataset module +================================== + +.. automodule:: paddlespeech.s2t.io.dataset + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.io.reader.rst b/docs/source/api/paddlespeech.s2t.io.reader.rst new file mode 100644 index 000000000..fbae08042 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.io.reader.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.io.reader module +================================= + +.. automodule:: paddlespeech.s2t.io.reader + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.io.rst b/docs/source/api/paddlespeech.s2t.io.rst new file mode 100644 index 000000000..cbdce31a6 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.io.rst @@ -0,0 +1,22 @@ +paddlespeech.s2t.io package +=========================== + +.. automodule:: paddlespeech.s2t.io + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.s2t.io.batchfy + paddlespeech.s2t.io.collator + paddlespeech.s2t.io.converter + paddlespeech.s2t.io.dataloader + paddlespeech.s2t.io.dataset + paddlespeech.s2t.io.reader + paddlespeech.s2t.io.sampler + paddlespeech.s2t.io.utility diff --git a/docs/source/api/paddlespeech.s2t.io.sampler.rst b/docs/source/api/paddlespeech.s2t.io.sampler.rst new file mode 100644 index 000000000..8ca79a243 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.io.sampler.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.io.sampler module +================================== + +.. automodule:: paddlespeech.s2t.io.sampler + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.io.utility.rst b/docs/source/api/paddlespeech.s2t.io.utility.rst new file mode 100644 index 000000000..d2a6183fe --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.io.utility.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.io.utility module +================================== + +.. automodule:: paddlespeech.s2t.io.utility + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.models.asr_interface.rst b/docs/source/api/paddlespeech.s2t.models.asr_interface.rst new file mode 100644 index 000000000..d679a4d59 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.models.asr_interface.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.models.asr\_interface module +============================================= + +.. automodule:: paddlespeech.s2t.models.asr_interface + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.models.ds2.conv.rst b/docs/source/api/paddlespeech.s2t.models.ds2.conv.rst new file mode 100644 index 000000000..f876d074d --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.models.ds2.conv.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.models.ds2.conv module +======================================= + +.. automodule:: paddlespeech.s2t.models.ds2.conv + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.models.ds2.deepspeech2.rst b/docs/source/api/paddlespeech.s2t.models.ds2.deepspeech2.rst new file mode 100644 index 000000000..495bcaad0 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.models.ds2.deepspeech2.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.models.ds2.deepspeech2 module +============================================== + +.. automodule:: paddlespeech.s2t.models.ds2.deepspeech2 + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.models.ds2.rst b/docs/source/api/paddlespeech.s2t.models.ds2.rst new file mode 100644 index 000000000..1b27bdcb6 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.models.ds2.rst @@ -0,0 +1,16 @@ +paddlespeech.s2t.models.ds2 package +=================================== + +.. automodule:: paddlespeech.s2t.models.ds2 + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.s2t.models.ds2.conv + paddlespeech.s2t.models.ds2.deepspeech2 diff --git a/docs/source/api/paddlespeech.s2t.models.lm.dataset.rst b/docs/source/api/paddlespeech.s2t.models.lm.dataset.rst new file mode 100644 index 000000000..16580d85f --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.models.lm.dataset.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.models.lm.dataset module +========================================= + +.. automodule:: paddlespeech.s2t.models.lm.dataset + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.models.lm.rst b/docs/source/api/paddlespeech.s2t.models.lm.rst new file mode 100644 index 000000000..131926520 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.models.lm.rst @@ -0,0 +1,16 @@ +paddlespeech.s2t.models.lm package +================================== + +.. automodule:: paddlespeech.s2t.models.lm + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.s2t.models.lm.dataset + paddlespeech.s2t.models.lm.transformer diff --git a/docs/source/api/paddlespeech.s2t.models.lm.transformer.rst b/docs/source/api/paddlespeech.s2t.models.lm.transformer.rst new file mode 100644 index 000000000..31dea9bd3 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.models.lm.transformer.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.models.lm.transformer module +============================================= + +.. automodule:: paddlespeech.s2t.models.lm.transformer + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.models.lm_interface.rst b/docs/source/api/paddlespeech.s2t.models.lm_interface.rst new file mode 100644 index 000000000..42010beae --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.models.lm_interface.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.models.lm\_interface module +============================================ + +.. automodule:: paddlespeech.s2t.models.lm_interface + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.models.rst b/docs/source/api/paddlespeech.s2t.models.rst new file mode 100644 index 000000000..f3d14b550 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.models.rst @@ -0,0 +1,28 @@ +paddlespeech.s2t.models package +=============================== + +.. automodule:: paddlespeech.s2t.models + :members: + :undoc-members: + :show-inheritance: + +Subpackages +----------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.s2t.models.ds2 + paddlespeech.s2t.models.lm + paddlespeech.s2t.models.u2 + paddlespeech.s2t.models.u2_st + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.s2t.models.asr_interface + paddlespeech.s2t.models.lm_interface + paddlespeech.s2t.models.st_interface diff --git a/docs/source/api/paddlespeech.s2t.models.st_interface.rst b/docs/source/api/paddlespeech.s2t.models.st_interface.rst new file mode 100644 index 000000000..7aaa6e0b0 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.models.st_interface.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.models.st\_interface module +============================================ + +.. automodule:: paddlespeech.s2t.models.st_interface + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.models.u2.rst b/docs/source/api/paddlespeech.s2t.models.u2.rst new file mode 100644 index 000000000..aa7a1a3f9 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.models.u2.rst @@ -0,0 +1,16 @@ +paddlespeech.s2t.models.u2 package +================================== + +.. automodule:: paddlespeech.s2t.models.u2 + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.s2t.models.u2.u2 + paddlespeech.s2t.models.u2.updater diff --git a/docs/source/api/paddlespeech.s2t.models.u2.u2.rst b/docs/source/api/paddlespeech.s2t.models.u2.u2.rst new file mode 100644 index 000000000..62a68d0d6 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.models.u2.u2.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.models.u2.u2 module +==================================== + +.. automodule:: paddlespeech.s2t.models.u2.u2 + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.models.u2.updater.rst b/docs/source/api/paddlespeech.s2t.models.u2.updater.rst new file mode 100644 index 000000000..73cd32ca4 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.models.u2.updater.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.models.u2.updater module +========================================= + +.. automodule:: paddlespeech.s2t.models.u2.updater + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.models.u2_st.rst b/docs/source/api/paddlespeech.s2t.models.u2_st.rst new file mode 100644 index 000000000..cbffddc0c --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.models.u2_st.rst @@ -0,0 +1,15 @@ +paddlespeech.s2t.models.u2\_st package +====================================== + +.. automodule:: paddlespeech.s2t.models.u2_st + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.s2t.models.u2_st.u2_st diff --git a/docs/source/api/paddlespeech.s2t.models.u2_st.u2_st.rst b/docs/source/api/paddlespeech.s2t.models.u2_st.u2_st.rst new file mode 100644 index 000000000..de4f4ca72 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.models.u2_st.u2_st.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.models.u2\_st.u2\_st module +============================================ + +.. automodule:: paddlespeech.s2t.models.u2_st.u2_st + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.modules.activation.rst b/docs/source/api/paddlespeech.s2t.modules.activation.rst new file mode 100644 index 000000000..781893520 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.modules.activation.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.modules.activation module +========================================== + +.. automodule:: paddlespeech.s2t.modules.activation + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.modules.align.rst b/docs/source/api/paddlespeech.s2t.modules.align.rst new file mode 100644 index 000000000..af3ddfdfe --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.modules.align.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.modules.align module +===================================== + +.. automodule:: paddlespeech.s2t.modules.align + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.modules.attention.rst b/docs/source/api/paddlespeech.s2t.modules.attention.rst new file mode 100644 index 000000000..4fe4c7f58 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.modules.attention.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.modules.attention module +========================================= + +.. automodule:: paddlespeech.s2t.modules.attention + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.modules.cmvn.rst b/docs/source/api/paddlespeech.s2t.modules.cmvn.rst new file mode 100644 index 000000000..30222c84d --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.modules.cmvn.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.modules.cmvn module +==================================== + +.. automodule:: paddlespeech.s2t.modules.cmvn + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.modules.conformer_convolution.rst b/docs/source/api/paddlespeech.s2t.modules.conformer_convolution.rst new file mode 100644 index 000000000..0f861f7dd --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.modules.conformer_convolution.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.modules.conformer\_convolution module +====================================================== + +.. automodule:: paddlespeech.s2t.modules.conformer_convolution + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.modules.crf.rst b/docs/source/api/paddlespeech.s2t.modules.crf.rst new file mode 100644 index 000000000..4b5868d2a --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.modules.crf.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.modules.crf module +=================================== + +.. automodule:: paddlespeech.s2t.modules.crf + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.modules.ctc.rst b/docs/source/api/paddlespeech.s2t.modules.ctc.rst new file mode 100644 index 000000000..1b5674caa --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.modules.ctc.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.modules.ctc module +=================================== + +.. automodule:: paddlespeech.s2t.modules.ctc + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.modules.decoder.rst b/docs/source/api/paddlespeech.s2t.modules.decoder.rst new file mode 100644 index 000000000..e0d68adad --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.modules.decoder.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.modules.decoder module +======================================= + +.. automodule:: paddlespeech.s2t.modules.decoder + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.modules.decoder_layer.rst b/docs/source/api/paddlespeech.s2t.modules.decoder_layer.rst new file mode 100644 index 000000000..8c30068cf --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.modules.decoder_layer.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.modules.decoder\_layer module +============================================== + +.. automodule:: paddlespeech.s2t.modules.decoder_layer + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.modules.embedding.rst b/docs/source/api/paddlespeech.s2t.modules.embedding.rst new file mode 100644 index 000000000..9a67105ff --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.modules.embedding.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.modules.embedding module +========================================= + +.. automodule:: paddlespeech.s2t.modules.embedding + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.modules.encoder.rst b/docs/source/api/paddlespeech.s2t.modules.encoder.rst new file mode 100644 index 000000000..b2d281981 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.modules.encoder.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.modules.encoder module +======================================= + +.. automodule:: paddlespeech.s2t.modules.encoder + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.modules.encoder_layer.rst b/docs/source/api/paddlespeech.s2t.modules.encoder_layer.rst new file mode 100644 index 000000000..15655a48b --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.modules.encoder_layer.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.modules.encoder\_layer module +============================================== + +.. automodule:: paddlespeech.s2t.modules.encoder_layer + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.modules.initializer.rst b/docs/source/api/paddlespeech.s2t.modules.initializer.rst new file mode 100644 index 000000000..11e874ac9 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.modules.initializer.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.modules.initializer module +=========================================== + +.. automodule:: paddlespeech.s2t.modules.initializer + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.modules.loss.rst b/docs/source/api/paddlespeech.s2t.modules.loss.rst new file mode 100644 index 000000000..ae61ed78f --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.modules.loss.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.modules.loss module +==================================== + +.. automodule:: paddlespeech.s2t.modules.loss + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.modules.mask.rst b/docs/source/api/paddlespeech.s2t.modules.mask.rst new file mode 100644 index 000000000..89785b57a --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.modules.mask.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.modules.mask module +==================================== + +.. automodule:: paddlespeech.s2t.modules.mask + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.modules.positionwise_feed_forward.rst b/docs/source/api/paddlespeech.s2t.modules.positionwise_feed_forward.rst new file mode 100644 index 000000000..7c085527b --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.modules.positionwise_feed_forward.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.modules.positionwise\_feed\_forward module +=========================================================== + +.. automodule:: paddlespeech.s2t.modules.positionwise_feed_forward + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.modules.rst b/docs/source/api/paddlespeech.s2t.modules.rst new file mode 100644 index 000000000..5bf9974f6 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.modules.rst @@ -0,0 +1,31 @@ +paddlespeech.s2t.modules package +================================ + +.. automodule:: paddlespeech.s2t.modules + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.s2t.modules.activation + paddlespeech.s2t.modules.align + paddlespeech.s2t.modules.attention + paddlespeech.s2t.modules.cmvn + paddlespeech.s2t.modules.conformer_convolution + paddlespeech.s2t.modules.crf + paddlespeech.s2t.modules.ctc + paddlespeech.s2t.modules.decoder + paddlespeech.s2t.modules.decoder_layer + paddlespeech.s2t.modules.embedding + paddlespeech.s2t.modules.encoder + paddlespeech.s2t.modules.encoder_layer + paddlespeech.s2t.modules.initializer + paddlespeech.s2t.modules.loss + paddlespeech.s2t.modules.mask + paddlespeech.s2t.modules.positionwise_feed_forward + paddlespeech.s2t.modules.subsampling diff --git a/docs/source/api/paddlespeech.s2t.modules.subsampling.rst b/docs/source/api/paddlespeech.s2t.modules.subsampling.rst new file mode 100644 index 000000000..76fcb2a28 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.modules.subsampling.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.modules.subsampling module +=========================================== + +.. automodule:: paddlespeech.s2t.modules.subsampling + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.rst b/docs/source/api/paddlespeech.s2t.rst new file mode 100644 index 000000000..4be22cb87 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.rst @@ -0,0 +1,23 @@ +paddlespeech.s2t package +======================== + +.. automodule:: paddlespeech.s2t + :members: + :undoc-members: + :show-inheritance: + +Subpackages +----------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.s2t.decoders + paddlespeech.s2t.exps + paddlespeech.s2t.frontend + paddlespeech.s2t.io + paddlespeech.s2t.models + paddlespeech.s2t.modules + paddlespeech.s2t.training + paddlespeech.s2t.transform + paddlespeech.s2t.utils diff --git a/docs/source/api/paddlespeech.s2t.training.cli.rst b/docs/source/api/paddlespeech.s2t.training.cli.rst new file mode 100644 index 000000000..91c33f5c6 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.training.cli.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.training.cli module +==================================== + +.. automodule:: paddlespeech.s2t.training.cli + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.training.extensions.evaluator.rst b/docs/source/api/paddlespeech.s2t.training.extensions.evaluator.rst new file mode 100644 index 000000000..ff736d88a --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.training.extensions.evaluator.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.training.extensions.evaluator module +===================================================== + +.. automodule:: paddlespeech.s2t.training.extensions.evaluator + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.training.extensions.extension.rst b/docs/source/api/paddlespeech.s2t.training.extensions.extension.rst new file mode 100644 index 000000000..36858339c --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.training.extensions.extension.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.training.extensions.extension module +===================================================== + +.. automodule:: paddlespeech.s2t.training.extensions.extension + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.training.extensions.plot.rst b/docs/source/api/paddlespeech.s2t.training.extensions.plot.rst new file mode 100644 index 000000000..366a7ca4b --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.training.extensions.plot.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.training.extensions.plot module +================================================ + +.. automodule:: paddlespeech.s2t.training.extensions.plot + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.training.extensions.rst b/docs/source/api/paddlespeech.s2t.training.extensions.rst new file mode 100644 index 000000000..f31b8427e --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.training.extensions.rst @@ -0,0 +1,19 @@ +paddlespeech.s2t.training.extensions package +============================================ + +.. automodule:: paddlespeech.s2t.training.extensions + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.s2t.training.extensions.evaluator + paddlespeech.s2t.training.extensions.extension + paddlespeech.s2t.training.extensions.plot + paddlespeech.s2t.training.extensions.snapshot + paddlespeech.s2t.training.extensions.visualizer diff --git a/docs/source/api/paddlespeech.s2t.training.extensions.snapshot.rst b/docs/source/api/paddlespeech.s2t.training.extensions.snapshot.rst new file mode 100644 index 000000000..e0ca21a73 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.training.extensions.snapshot.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.training.extensions.snapshot module +==================================================== + +.. automodule:: paddlespeech.s2t.training.extensions.snapshot + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.training.extensions.visualizer.rst b/docs/source/api/paddlespeech.s2t.training.extensions.visualizer.rst new file mode 100644 index 000000000..22ae11f11 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.training.extensions.visualizer.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.training.extensions.visualizer module +====================================================== + +.. automodule:: paddlespeech.s2t.training.extensions.visualizer + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.training.gradclip.rst b/docs/source/api/paddlespeech.s2t.training.gradclip.rst new file mode 100644 index 000000000..b9f675efa --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.training.gradclip.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.training.gradclip module +========================================= + +.. automodule:: paddlespeech.s2t.training.gradclip + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.training.optimizer.rst b/docs/source/api/paddlespeech.s2t.training.optimizer.rst new file mode 100644 index 000000000..401ec5153 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.training.optimizer.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.training.optimizer module +========================================== + +.. automodule:: paddlespeech.s2t.training.optimizer + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.training.reporter.rst b/docs/source/api/paddlespeech.s2t.training.reporter.rst new file mode 100644 index 000000000..d033dafdb --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.training.reporter.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.training.reporter module +========================================= + +.. automodule:: paddlespeech.s2t.training.reporter + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.training.rst b/docs/source/api/paddlespeech.s2t.training.rst new file mode 100644 index 000000000..4e8580500 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.training.rst @@ -0,0 +1,31 @@ +paddlespeech.s2t.training package +================================= + +.. automodule:: paddlespeech.s2t.training + :members: + :undoc-members: + :show-inheritance: + +Subpackages +----------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.s2t.training.extensions + paddlespeech.s2t.training.triggers + paddlespeech.s2t.training.updaters + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.s2t.training.cli + paddlespeech.s2t.training.gradclip + paddlespeech.s2t.training.optimizer + paddlespeech.s2t.training.reporter + paddlespeech.s2t.training.scheduler + paddlespeech.s2t.training.timer + paddlespeech.s2t.training.trainer diff --git a/docs/source/api/paddlespeech.s2t.training.scheduler.rst b/docs/source/api/paddlespeech.s2t.training.scheduler.rst new file mode 100644 index 000000000..f93b24f1c --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.training.scheduler.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.training.scheduler module +========================================== + +.. automodule:: paddlespeech.s2t.training.scheduler + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.training.timer.rst b/docs/source/api/paddlespeech.s2t.training.timer.rst new file mode 100644 index 000000000..68290e267 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.training.timer.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.training.timer module +====================================== + +.. automodule:: paddlespeech.s2t.training.timer + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.training.trainer.rst b/docs/source/api/paddlespeech.s2t.training.trainer.rst new file mode 100644 index 000000000..a09635660 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.training.trainer.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.training.trainer module +======================================== + +.. automodule:: paddlespeech.s2t.training.trainer + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.training.triggers.compare_value_trigger.rst b/docs/source/api/paddlespeech.s2t.training.triggers.compare_value_trigger.rst new file mode 100644 index 000000000..afd1f8491 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.training.triggers.compare_value_trigger.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.training.triggers.compare\_value\_trigger module +================================================================= + +.. automodule:: paddlespeech.s2t.training.triggers.compare_value_trigger + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.training.triggers.interval_trigger.rst b/docs/source/api/paddlespeech.s2t.training.triggers.interval_trigger.rst new file mode 100644 index 000000000..980c55c65 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.training.triggers.interval_trigger.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.training.triggers.interval\_trigger module +=========================================================== + +.. automodule:: paddlespeech.s2t.training.triggers.interval_trigger + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.training.triggers.limit_trigger.rst b/docs/source/api/paddlespeech.s2t.training.triggers.limit_trigger.rst new file mode 100644 index 000000000..5ead7c902 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.training.triggers.limit_trigger.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.training.triggers.limit\_trigger module +======================================================== + +.. automodule:: paddlespeech.s2t.training.triggers.limit_trigger + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.training.triggers.rst b/docs/source/api/paddlespeech.s2t.training.triggers.rst new file mode 100644 index 000000000..0316ce21c --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.training.triggers.rst @@ -0,0 +1,19 @@ +paddlespeech.s2t.training.triggers package +========================================== + +.. automodule:: paddlespeech.s2t.training.triggers + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.s2t.training.triggers.compare_value_trigger + paddlespeech.s2t.training.triggers.interval_trigger + paddlespeech.s2t.training.triggers.limit_trigger + paddlespeech.s2t.training.triggers.time_trigger + paddlespeech.s2t.training.triggers.utils diff --git a/docs/source/api/paddlespeech.s2t.training.triggers.time_trigger.rst b/docs/source/api/paddlespeech.s2t.training.triggers.time_trigger.rst new file mode 100644 index 000000000..213af4555 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.training.triggers.time_trigger.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.training.triggers.time\_trigger module +======================================================= + +.. automodule:: paddlespeech.s2t.training.triggers.time_trigger + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.training.triggers.utils.rst b/docs/source/api/paddlespeech.s2t.training.triggers.utils.rst new file mode 100644 index 000000000..c4ae53cde --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.training.triggers.utils.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.training.triggers.utils module +=============================================== + +.. automodule:: paddlespeech.s2t.training.triggers.utils + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.training.updaters.rst b/docs/source/api/paddlespeech.s2t.training.updaters.rst new file mode 100644 index 000000000..a06170168 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.training.updaters.rst @@ -0,0 +1,17 @@ +paddlespeech.s2t.training.updaters package +========================================== + +.. automodule:: paddlespeech.s2t.training.updaters + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.s2t.training.updaters.standard_updater + paddlespeech.s2t.training.updaters.trainer + paddlespeech.s2t.training.updaters.updater diff --git a/docs/source/api/paddlespeech.s2t.training.updaters.standard_updater.rst b/docs/source/api/paddlespeech.s2t.training.updaters.standard_updater.rst new file mode 100644 index 000000000..a6567bfc2 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.training.updaters.standard_updater.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.training.updaters.standard\_updater module +=========================================================== + +.. automodule:: paddlespeech.s2t.training.updaters.standard_updater + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.training.updaters.trainer.rst b/docs/source/api/paddlespeech.s2t.training.updaters.trainer.rst new file mode 100644 index 000000000..6981a8f05 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.training.updaters.trainer.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.training.updaters.trainer module +================================================= + +.. automodule:: paddlespeech.s2t.training.updaters.trainer + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.training.updaters.updater.rst b/docs/source/api/paddlespeech.s2t.training.updaters.updater.rst new file mode 100644 index 000000000..160fb4a14 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.training.updaters.updater.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.training.updaters.updater module +================================================= + +.. automodule:: paddlespeech.s2t.training.updaters.updater + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.transform.add_deltas.rst b/docs/source/api/paddlespeech.s2t.transform.add_deltas.rst new file mode 100644 index 000000000..5007fd9d8 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.transform.add_deltas.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.transform.add\_deltas module +============================================= + +.. automodule:: paddlespeech.s2t.transform.add_deltas + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.transform.channel_selector.rst b/docs/source/api/paddlespeech.s2t.transform.channel_selector.rst new file mode 100644 index 000000000..e08dd253e --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.transform.channel_selector.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.transform.channel\_selector module +=================================================== + +.. automodule:: paddlespeech.s2t.transform.channel_selector + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.transform.cmvn.rst b/docs/source/api/paddlespeech.s2t.transform.cmvn.rst new file mode 100644 index 000000000..8348e3d4b --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.transform.cmvn.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.transform.cmvn module +====================================== + +.. automodule:: paddlespeech.s2t.transform.cmvn + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.transform.functional.rst b/docs/source/api/paddlespeech.s2t.transform.functional.rst new file mode 100644 index 000000000..eb2b54a67 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.transform.functional.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.transform.functional module +============================================ + +.. automodule:: paddlespeech.s2t.transform.functional + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.transform.perturb.rst b/docs/source/api/paddlespeech.s2t.transform.perturb.rst new file mode 100644 index 000000000..0be28ab7e --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.transform.perturb.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.transform.perturb module +========================================= + +.. automodule:: paddlespeech.s2t.transform.perturb + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.transform.rst b/docs/source/api/paddlespeech.s2t.transform.rst new file mode 100644 index 000000000..5016ff4f1 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.transform.rst @@ -0,0 +1,24 @@ +paddlespeech.s2t.transform package +================================== + +.. automodule:: paddlespeech.s2t.transform + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.s2t.transform.add_deltas + paddlespeech.s2t.transform.channel_selector + paddlespeech.s2t.transform.cmvn + paddlespeech.s2t.transform.functional + paddlespeech.s2t.transform.perturb + paddlespeech.s2t.transform.spec_augment + paddlespeech.s2t.transform.spectrogram + paddlespeech.s2t.transform.transform_interface + paddlespeech.s2t.transform.transformation + paddlespeech.s2t.transform.wpe diff --git a/docs/source/api/paddlespeech.s2t.transform.spec_augment.rst b/docs/source/api/paddlespeech.s2t.transform.spec_augment.rst new file mode 100644 index 000000000..00fd3ea12 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.transform.spec_augment.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.transform.spec\_augment module +=============================================== + +.. automodule:: paddlespeech.s2t.transform.spec_augment + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.transform.spectrogram.rst b/docs/source/api/paddlespeech.s2t.transform.spectrogram.rst new file mode 100644 index 000000000..33c499a7a --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.transform.spectrogram.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.transform.spectrogram module +============================================= + +.. automodule:: paddlespeech.s2t.transform.spectrogram + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.transform.transform_interface.rst b/docs/source/api/paddlespeech.s2t.transform.transform_interface.rst new file mode 100644 index 000000000..009b06589 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.transform.transform_interface.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.transform.transform\_interface module +====================================================== + +.. automodule:: paddlespeech.s2t.transform.transform_interface + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.transform.transformation.rst b/docs/source/api/paddlespeech.s2t.transform.transformation.rst new file mode 100644 index 000000000..a03e731a5 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.transform.transformation.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.transform.transformation module +================================================ + +.. automodule:: paddlespeech.s2t.transform.transformation + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.transform.wpe.rst b/docs/source/api/paddlespeech.s2t.transform.wpe.rst new file mode 100644 index 000000000..c4831f7f9 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.transform.wpe.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.transform.wpe module +===================================== + +.. automodule:: paddlespeech.s2t.transform.wpe + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.utils.asr_utils.rst b/docs/source/api/paddlespeech.s2t.utils.asr_utils.rst new file mode 100644 index 000000000..4cd6a851d --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.utils.asr_utils.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.utils.asr\_utils module +======================================== + +.. automodule:: paddlespeech.s2t.utils.asr_utils + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.utils.bleu_score.rst b/docs/source/api/paddlespeech.s2t.utils.bleu_score.rst new file mode 100644 index 000000000..550c21ec7 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.utils.bleu_score.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.utils.bleu\_score module +========================================= + +.. automodule:: paddlespeech.s2t.utils.bleu_score + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.utils.check_kwargs.rst b/docs/source/api/paddlespeech.s2t.utils.check_kwargs.rst new file mode 100644 index 000000000..52efa37a0 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.utils.check_kwargs.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.utils.check\_kwargs module +=========================================== + +.. automodule:: paddlespeech.s2t.utils.check_kwargs + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.utils.checkpoint.rst b/docs/source/api/paddlespeech.s2t.utils.checkpoint.rst new file mode 100644 index 000000000..dbf812eb0 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.utils.checkpoint.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.utils.checkpoint module +======================================== + +.. automodule:: paddlespeech.s2t.utils.checkpoint + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.utils.cli_readers.rst b/docs/source/api/paddlespeech.s2t.utils.cli_readers.rst new file mode 100644 index 000000000..ec726bf05 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.utils.cli_readers.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.utils.cli\_readers module +========================================== + +.. automodule:: paddlespeech.s2t.utils.cli_readers + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.utils.cli_utils.rst b/docs/source/api/paddlespeech.s2t.utils.cli_utils.rst new file mode 100644 index 000000000..45104505c --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.utils.cli_utils.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.utils.cli\_utils module +======================================== + +.. automodule:: paddlespeech.s2t.utils.cli_utils + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.utils.cli_writers.rst b/docs/source/api/paddlespeech.s2t.utils.cli_writers.rst new file mode 100644 index 000000000..e2f834653 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.utils.cli_writers.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.utils.cli\_writers module +========================================== + +.. automodule:: paddlespeech.s2t.utils.cli_writers + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.utils.ctc_utils.rst b/docs/source/api/paddlespeech.s2t.utils.ctc_utils.rst new file mode 100644 index 000000000..0fd40b5e9 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.utils.ctc_utils.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.utils.ctc\_utils module +======================================== + +.. automodule:: paddlespeech.s2t.utils.ctc_utils + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.utils.dynamic_import.rst b/docs/source/api/paddlespeech.s2t.utils.dynamic_import.rst new file mode 100644 index 000000000..417af4c3b --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.utils.dynamic_import.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.utils.dynamic\_import module +============================================= + +.. automodule:: paddlespeech.s2t.utils.dynamic_import + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.utils.dynamic_pip_install.rst b/docs/source/api/paddlespeech.s2t.utils.dynamic_pip_install.rst new file mode 100644 index 000000000..e43a327b6 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.utils.dynamic_pip_install.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.utils.dynamic\_pip\_install module +=================================================== + +.. automodule:: paddlespeech.s2t.utils.dynamic_pip_install + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.utils.error_rate.rst b/docs/source/api/paddlespeech.s2t.utils.error_rate.rst new file mode 100644 index 000000000..e733a2ec1 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.utils.error_rate.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.utils.error\_rate module +========================================= + +.. automodule:: paddlespeech.s2t.utils.error_rate + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.utils.layer_tools.rst b/docs/source/api/paddlespeech.s2t.utils.layer_tools.rst new file mode 100644 index 000000000..abf39cc7c --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.utils.layer_tools.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.utils.layer\_tools module +========================================== + +.. automodule:: paddlespeech.s2t.utils.layer_tools + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.utils.log.rst b/docs/source/api/paddlespeech.s2t.utils.log.rst new file mode 100644 index 000000000..34d8599e0 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.utils.log.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.utils.log module +================================= + +.. automodule:: paddlespeech.s2t.utils.log + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.utils.mp_tools.rst b/docs/source/api/paddlespeech.s2t.utils.mp_tools.rst new file mode 100644 index 000000000..68c721f19 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.utils.mp_tools.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.utils.mp\_tools module +======================================= + +.. automodule:: paddlespeech.s2t.utils.mp_tools + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.utils.profiler.rst b/docs/source/api/paddlespeech.s2t.utils.profiler.rst new file mode 100644 index 000000000..56df28aa4 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.utils.profiler.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.utils.profiler module +====================================== + +.. automodule:: paddlespeech.s2t.utils.profiler + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.utils.rst b/docs/source/api/paddlespeech.s2t.utils.rst new file mode 100644 index 000000000..8ab353819 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.utils.rst @@ -0,0 +1,34 @@ +paddlespeech.s2t.utils package +============================== + +.. automodule:: paddlespeech.s2t.utils + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.s2t.utils.asr_utils + paddlespeech.s2t.utils.bleu_score + paddlespeech.s2t.utils.check_kwargs + paddlespeech.s2t.utils.checkpoint + paddlespeech.s2t.utils.cli_readers + paddlespeech.s2t.utils.cli_utils + paddlespeech.s2t.utils.cli_writers + paddlespeech.s2t.utils.ctc_utils + paddlespeech.s2t.utils.dynamic_import + paddlespeech.s2t.utils.dynamic_pip_install + paddlespeech.s2t.utils.error_rate + paddlespeech.s2t.utils.layer_tools + paddlespeech.s2t.utils.log + paddlespeech.s2t.utils.mp_tools + paddlespeech.s2t.utils.profiler + paddlespeech.s2t.utils.socket_server + paddlespeech.s2t.utils.spec_augment + paddlespeech.s2t.utils.tensor_utils + paddlespeech.s2t.utils.text_grid + paddlespeech.s2t.utils.utility diff --git a/docs/source/api/paddlespeech.s2t.utils.socket_server.rst b/docs/source/api/paddlespeech.s2t.utils.socket_server.rst new file mode 100644 index 000000000..90e10ae8b --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.utils.socket_server.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.utils.socket\_server module +============================================ + +.. automodule:: paddlespeech.s2t.utils.socket_server + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.utils.spec_augment.rst b/docs/source/api/paddlespeech.s2t.utils.spec_augment.rst new file mode 100644 index 000000000..21e613158 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.utils.spec_augment.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.utils.spec\_augment module +=========================================== + +.. automodule:: paddlespeech.s2t.utils.spec_augment + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.utils.tensor_utils.rst b/docs/source/api/paddlespeech.s2t.utils.tensor_utils.rst new file mode 100644 index 000000000..dcbfef762 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.utils.tensor_utils.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.utils.tensor\_utils module +=========================================== + +.. automodule:: paddlespeech.s2t.utils.tensor_utils + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.utils.text_grid.rst b/docs/source/api/paddlespeech.s2t.utils.text_grid.rst new file mode 100644 index 000000000..f745cb4b4 --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.utils.text_grid.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.utils.text\_grid module +======================================== + +.. automodule:: paddlespeech.s2t.utils.text_grid + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.s2t.utils.utility.rst b/docs/source/api/paddlespeech.s2t.utils.utility.rst new file mode 100644 index 000000000..5975ce6dc --- /dev/null +++ b/docs/source/api/paddlespeech.s2t.utils.utility.rst @@ -0,0 +1,7 @@ +paddlespeech.s2t.utils.utility module +===================================== + +.. automodule:: paddlespeech.s2t.utils.utility + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.server.base_commands.rst b/docs/source/api/paddlespeech.server.base_commands.rst new file mode 100644 index 000000000..0cbdbc8dd --- /dev/null +++ b/docs/source/api/paddlespeech.server.base_commands.rst @@ -0,0 +1,7 @@ +paddlespeech.server.base\_commands module +========================================= + +.. automodule:: paddlespeech.server.base_commands + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.server.bin.paddlespeech_client.rst b/docs/source/api/paddlespeech.server.bin.paddlespeech_client.rst new file mode 100644 index 000000000..604fd42b4 --- /dev/null +++ b/docs/source/api/paddlespeech.server.bin.paddlespeech_client.rst @@ -0,0 +1,7 @@ +paddlespeech.server.bin.paddlespeech\_client module +=================================================== + +.. automodule:: paddlespeech.server.bin.paddlespeech_client + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.server.bin.paddlespeech_server.rst b/docs/source/api/paddlespeech.server.bin.paddlespeech_server.rst new file mode 100644 index 000000000..23b9f8090 --- /dev/null +++ b/docs/source/api/paddlespeech.server.bin.paddlespeech_server.rst @@ -0,0 +1,7 @@ +paddlespeech.server.bin.paddlespeech\_server module +=================================================== + +.. automodule:: paddlespeech.server.bin.paddlespeech_server + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.server.bin.rst b/docs/source/api/paddlespeech.server.bin.rst new file mode 100644 index 000000000..20461e0d0 --- /dev/null +++ b/docs/source/api/paddlespeech.server.bin.rst @@ -0,0 +1,16 @@ +paddlespeech.server.bin package +=============================== + +.. automodule:: paddlespeech.server.bin + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.server.bin.paddlespeech_client + paddlespeech.server.bin.paddlespeech_server diff --git a/docs/source/api/paddlespeech.server.engine.acs.python.acs_engine.rst b/docs/source/api/paddlespeech.server.engine.acs.python.acs_engine.rst new file mode 100644 index 000000000..9b61633e0 --- /dev/null +++ b/docs/source/api/paddlespeech.server.engine.acs.python.acs_engine.rst @@ -0,0 +1,7 @@ +paddlespeech.server.engine.acs.python.acs\_engine module +======================================================== + +.. automodule:: paddlespeech.server.engine.acs.python.acs_engine + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.server.engine.acs.python.rst b/docs/source/api/paddlespeech.server.engine.acs.python.rst new file mode 100644 index 000000000..3c06ba080 --- /dev/null +++ b/docs/source/api/paddlespeech.server.engine.acs.python.rst @@ -0,0 +1,15 @@ +paddlespeech.server.engine.acs.python package +============================================= + +.. automodule:: paddlespeech.server.engine.acs.python + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.server.engine.acs.python.acs_engine diff --git a/docs/source/api/paddlespeech.server.engine.acs.rst b/docs/source/api/paddlespeech.server.engine.acs.rst new file mode 100644 index 000000000..e00b911f0 --- /dev/null +++ b/docs/source/api/paddlespeech.server.engine.acs.rst @@ -0,0 +1,15 @@ +paddlespeech.server.engine.acs package +====================================== + +.. automodule:: paddlespeech.server.engine.acs + :members: + :undoc-members: + :show-inheritance: + +Subpackages +----------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.server.engine.acs.python diff --git a/docs/source/api/paddlespeech.server.engine.asr.online.ctc_endpoint.rst b/docs/source/api/paddlespeech.server.engine.asr.online.ctc_endpoint.rst new file mode 100644 index 000000000..e6f950587 --- /dev/null +++ b/docs/source/api/paddlespeech.server.engine.asr.online.ctc_endpoint.rst @@ -0,0 +1,7 @@ +paddlespeech.server.engine.asr.online.ctc\_endpoint module +========================================================== + +.. automodule:: paddlespeech.server.engine.asr.online.ctc_endpoint + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.server.engine.asr.online.ctc_search.rst b/docs/source/api/paddlespeech.server.engine.asr.online.ctc_search.rst new file mode 100644 index 000000000..01563ee88 --- /dev/null +++ b/docs/source/api/paddlespeech.server.engine.asr.online.ctc_search.rst @@ -0,0 +1,7 @@ +paddlespeech.server.engine.asr.online.ctc\_search module +======================================================== + +.. automodule:: paddlespeech.server.engine.asr.online.ctc_search + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.server.engine.asr.online.onnx.asr_engine.rst b/docs/source/api/paddlespeech.server.engine.asr.online.onnx.asr_engine.rst new file mode 100644 index 000000000..baf826807 --- /dev/null +++ b/docs/source/api/paddlespeech.server.engine.asr.online.onnx.asr_engine.rst @@ -0,0 +1,7 @@ +paddlespeech.server.engine.asr.online.onnx.asr\_engine module +============================================================= + +.. automodule:: paddlespeech.server.engine.asr.online.onnx.asr_engine + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.server.engine.asr.online.onnx.rst b/docs/source/api/paddlespeech.server.engine.asr.online.onnx.rst new file mode 100644 index 000000000..d3b373a29 --- /dev/null +++ b/docs/source/api/paddlespeech.server.engine.asr.online.onnx.rst @@ -0,0 +1,15 @@ +paddlespeech.server.engine.asr.online.onnx package +================================================== + +.. automodule:: paddlespeech.server.engine.asr.online.onnx + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.server.engine.asr.online.onnx.asr_engine diff --git a/docs/source/api/paddlespeech.server.engine.asr.online.paddleinference.asr_engine.rst b/docs/source/api/paddlespeech.server.engine.asr.online.paddleinference.asr_engine.rst new file mode 100644 index 000000000..17e25d006 --- /dev/null +++ b/docs/source/api/paddlespeech.server.engine.asr.online.paddleinference.asr_engine.rst @@ -0,0 +1,7 @@ +paddlespeech.server.engine.asr.online.paddleinference.asr\_engine module +======================================================================== + +.. automodule:: paddlespeech.server.engine.asr.online.paddleinference.asr_engine + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.server.engine.asr.online.paddleinference.rst b/docs/source/api/paddlespeech.server.engine.asr.online.paddleinference.rst new file mode 100644 index 000000000..7d75948a9 --- /dev/null +++ b/docs/source/api/paddlespeech.server.engine.asr.online.paddleinference.rst @@ -0,0 +1,15 @@ +paddlespeech.server.engine.asr.online.paddleinference package +============================================================= + +.. automodule:: paddlespeech.server.engine.asr.online.paddleinference + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.server.engine.asr.online.paddleinference.asr_engine diff --git a/docs/source/api/paddlespeech.server.engine.asr.online.python.asr_engine.rst b/docs/source/api/paddlespeech.server.engine.asr.online.python.asr_engine.rst new file mode 100644 index 000000000..a5d09a5ee --- /dev/null +++ b/docs/source/api/paddlespeech.server.engine.asr.online.python.asr_engine.rst @@ -0,0 +1,7 @@ +paddlespeech.server.engine.asr.online.python.asr\_engine module +=============================================================== + +.. automodule:: paddlespeech.server.engine.asr.online.python.asr_engine + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.server.engine.asr.online.python.rst b/docs/source/api/paddlespeech.server.engine.asr.online.python.rst new file mode 100644 index 000000000..9ecd86479 --- /dev/null +++ b/docs/source/api/paddlespeech.server.engine.asr.online.python.rst @@ -0,0 +1,15 @@ +paddlespeech.server.engine.asr.online.python package +==================================================== + +.. automodule:: paddlespeech.server.engine.asr.online.python + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.server.engine.asr.online.python.asr_engine diff --git a/docs/source/api/paddlespeech.server.engine.asr.online.rst b/docs/source/api/paddlespeech.server.engine.asr.online.rst new file mode 100644 index 000000000..140fcb629 --- /dev/null +++ b/docs/source/api/paddlespeech.server.engine.asr.online.rst @@ -0,0 +1,26 @@ +paddlespeech.server.engine.asr.online package +============================================= + +.. automodule:: paddlespeech.server.engine.asr.online + :members: + :undoc-members: + :show-inheritance: + +Subpackages +----------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.server.engine.asr.online.onnx + paddlespeech.server.engine.asr.online.paddleinference + paddlespeech.server.engine.asr.online.python + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.server.engine.asr.online.ctc_endpoint + paddlespeech.server.engine.asr.online.ctc_search diff --git a/docs/source/api/paddlespeech.server.engine.asr.paddleinference.asr_engine.rst b/docs/source/api/paddlespeech.server.engine.asr.paddleinference.asr_engine.rst new file mode 100644 index 000000000..9509eeedd --- /dev/null +++ b/docs/source/api/paddlespeech.server.engine.asr.paddleinference.asr_engine.rst @@ -0,0 +1,7 @@ +paddlespeech.server.engine.asr.paddleinference.asr\_engine module +================================================================= + +.. automodule:: paddlespeech.server.engine.asr.paddleinference.asr_engine + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.server.engine.asr.paddleinference.rst b/docs/source/api/paddlespeech.server.engine.asr.paddleinference.rst new file mode 100644 index 000000000..052e730cd --- /dev/null +++ b/docs/source/api/paddlespeech.server.engine.asr.paddleinference.rst @@ -0,0 +1,15 @@ +paddlespeech.server.engine.asr.paddleinference package +====================================================== + +.. automodule:: paddlespeech.server.engine.asr.paddleinference + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.server.engine.asr.paddleinference.asr_engine diff --git a/docs/source/api/paddlespeech.server.engine.asr.python.asr_engine.rst b/docs/source/api/paddlespeech.server.engine.asr.python.asr_engine.rst new file mode 100644 index 000000000..f4d3dc86f --- /dev/null +++ b/docs/source/api/paddlespeech.server.engine.asr.python.asr_engine.rst @@ -0,0 +1,7 @@ +paddlespeech.server.engine.asr.python.asr\_engine module +======================================================== + +.. automodule:: paddlespeech.server.engine.asr.python.asr_engine + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.server.engine.asr.python.rst b/docs/source/api/paddlespeech.server.engine.asr.python.rst new file mode 100644 index 000000000..7abee63eb --- /dev/null +++ b/docs/source/api/paddlespeech.server.engine.asr.python.rst @@ -0,0 +1,15 @@ +paddlespeech.server.engine.asr.python package +============================================= + +.. automodule:: paddlespeech.server.engine.asr.python + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.server.engine.asr.python.asr_engine diff --git a/docs/source/api/paddlespeech.server.engine.asr.rst b/docs/source/api/paddlespeech.server.engine.asr.rst new file mode 100644 index 000000000..a021b9ffe --- /dev/null +++ b/docs/source/api/paddlespeech.server.engine.asr.rst @@ -0,0 +1,17 @@ +paddlespeech.server.engine.asr package +====================================== + +.. automodule:: paddlespeech.server.engine.asr + :members: + :undoc-members: + :show-inheritance: + +Subpackages +----------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.server.engine.asr.online + paddlespeech.server.engine.asr.paddleinference + paddlespeech.server.engine.asr.python diff --git a/docs/source/api/paddlespeech.server.engine.base_engine.rst b/docs/source/api/paddlespeech.server.engine.base_engine.rst new file mode 100644 index 000000000..c55150a03 --- /dev/null +++ b/docs/source/api/paddlespeech.server.engine.base_engine.rst @@ -0,0 +1,7 @@ +paddlespeech.server.engine.base\_engine module +============================================== + +.. automodule:: paddlespeech.server.engine.base_engine + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.server.engine.cls.paddleinference.cls_engine.rst b/docs/source/api/paddlespeech.server.engine.cls.paddleinference.cls_engine.rst new file mode 100644 index 000000000..fbf808694 --- /dev/null +++ b/docs/source/api/paddlespeech.server.engine.cls.paddleinference.cls_engine.rst @@ -0,0 +1,7 @@ +paddlespeech.server.engine.cls.paddleinference.cls\_engine module +================================================================= + +.. automodule:: paddlespeech.server.engine.cls.paddleinference.cls_engine + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.server.engine.cls.paddleinference.rst b/docs/source/api/paddlespeech.server.engine.cls.paddleinference.rst new file mode 100644 index 000000000..1ed752b45 --- /dev/null +++ b/docs/source/api/paddlespeech.server.engine.cls.paddleinference.rst @@ -0,0 +1,15 @@ +paddlespeech.server.engine.cls.paddleinference package +====================================================== + +.. automodule:: paddlespeech.server.engine.cls.paddleinference + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.server.engine.cls.paddleinference.cls_engine diff --git a/docs/source/api/paddlespeech.server.engine.cls.python.cls_engine.rst b/docs/source/api/paddlespeech.server.engine.cls.python.cls_engine.rst new file mode 100644 index 000000000..17673126b --- /dev/null +++ b/docs/source/api/paddlespeech.server.engine.cls.python.cls_engine.rst @@ -0,0 +1,7 @@ +paddlespeech.server.engine.cls.python.cls\_engine module +======================================================== + +.. automodule:: paddlespeech.server.engine.cls.python.cls_engine + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.server.engine.cls.python.rst b/docs/source/api/paddlespeech.server.engine.cls.python.rst new file mode 100644 index 000000000..cd0f9d2fc --- /dev/null +++ b/docs/source/api/paddlespeech.server.engine.cls.python.rst @@ -0,0 +1,15 @@ +paddlespeech.server.engine.cls.python package +============================================= + +.. automodule:: paddlespeech.server.engine.cls.python + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.server.engine.cls.python.cls_engine diff --git a/docs/source/api/paddlespeech.server.engine.cls.rst b/docs/source/api/paddlespeech.server.engine.cls.rst new file mode 100644 index 000000000..c1df3d401 --- /dev/null +++ b/docs/source/api/paddlespeech.server.engine.cls.rst @@ -0,0 +1,16 @@ +paddlespeech.server.engine.cls package +====================================== + +.. automodule:: paddlespeech.server.engine.cls + :members: + :undoc-members: + :show-inheritance: + +Subpackages +----------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.server.engine.cls.paddleinference + paddlespeech.server.engine.cls.python diff --git a/docs/source/api/paddlespeech.server.engine.engine_factory.rst b/docs/source/api/paddlespeech.server.engine.engine_factory.rst new file mode 100644 index 000000000..cdd48af79 --- /dev/null +++ b/docs/source/api/paddlespeech.server.engine.engine_factory.rst @@ -0,0 +1,7 @@ +paddlespeech.server.engine.engine\_factory module +================================================= + +.. automodule:: paddlespeech.server.engine.engine_factory + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.server.engine.engine_pool.rst b/docs/source/api/paddlespeech.server.engine.engine_pool.rst new file mode 100644 index 000000000..33455d39a --- /dev/null +++ b/docs/source/api/paddlespeech.server.engine.engine_pool.rst @@ -0,0 +1,7 @@ +paddlespeech.server.engine.engine\_pool module +============================================== + +.. automodule:: paddlespeech.server.engine.engine_pool + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.server.engine.engine_warmup.rst b/docs/source/api/paddlespeech.server.engine.engine_warmup.rst new file mode 100644 index 000000000..f134a87a6 --- /dev/null +++ b/docs/source/api/paddlespeech.server.engine.engine_warmup.rst @@ -0,0 +1,7 @@ +paddlespeech.server.engine.engine\_warmup module +================================================ + +.. automodule:: paddlespeech.server.engine.engine_warmup + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.server.engine.rst b/docs/source/api/paddlespeech.server.engine.rst new file mode 100644 index 000000000..44d5730db --- /dev/null +++ b/docs/source/api/paddlespeech.server.engine.rst @@ -0,0 +1,31 @@ +paddlespeech.server.engine package +================================== + +.. automodule:: paddlespeech.server.engine + :members: + :undoc-members: + :show-inheritance: + +Subpackages +----------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.server.engine.acs + paddlespeech.server.engine.asr + paddlespeech.server.engine.cls + paddlespeech.server.engine.text + paddlespeech.server.engine.tts + paddlespeech.server.engine.vector + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.server.engine.base_engine + paddlespeech.server.engine.engine_factory + paddlespeech.server.engine.engine_pool + paddlespeech.server.engine.engine_warmup diff --git a/docs/source/api/paddlespeech.server.engine.text.python.rst b/docs/source/api/paddlespeech.server.engine.text.python.rst new file mode 100644 index 000000000..4661b649b --- /dev/null +++ b/docs/source/api/paddlespeech.server.engine.text.python.rst @@ -0,0 +1,15 @@ +paddlespeech.server.engine.text.python package +============================================== + +.. automodule:: paddlespeech.server.engine.text.python + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.server.engine.text.python.text_engine diff --git a/docs/source/api/paddlespeech.server.engine.text.python.text_engine.rst b/docs/source/api/paddlespeech.server.engine.text.python.text_engine.rst new file mode 100644 index 000000000..97c870dd9 --- /dev/null +++ b/docs/source/api/paddlespeech.server.engine.text.python.text_engine.rst @@ -0,0 +1,7 @@ +paddlespeech.server.engine.text.python.text\_engine module +========================================================== + +.. automodule:: paddlespeech.server.engine.text.python.text_engine + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.server.engine.text.rst b/docs/source/api/paddlespeech.server.engine.text.rst new file mode 100644 index 000000000..4d207800e --- /dev/null +++ b/docs/source/api/paddlespeech.server.engine.text.rst @@ -0,0 +1,15 @@ +paddlespeech.server.engine.text package +======================================= + +.. automodule:: paddlespeech.server.engine.text + :members: + :undoc-members: + :show-inheritance: + +Subpackages +----------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.server.engine.text.python diff --git a/docs/source/api/paddlespeech.server.engine.tts.online.onnx.rst b/docs/source/api/paddlespeech.server.engine.tts.online.onnx.rst new file mode 100644 index 000000000..32ac10971 --- /dev/null +++ b/docs/source/api/paddlespeech.server.engine.tts.online.onnx.rst @@ -0,0 +1,15 @@ +paddlespeech.server.engine.tts.online.onnx package +================================================== + +.. automodule:: paddlespeech.server.engine.tts.online.onnx + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.server.engine.tts.online.onnx.tts_engine diff --git a/docs/source/api/paddlespeech.server.engine.tts.online.onnx.tts_engine.rst b/docs/source/api/paddlespeech.server.engine.tts.online.onnx.tts_engine.rst new file mode 100644 index 000000000..1d0a7c825 --- /dev/null +++ b/docs/source/api/paddlespeech.server.engine.tts.online.onnx.tts_engine.rst @@ -0,0 +1,7 @@ +paddlespeech.server.engine.tts.online.onnx.tts\_engine module +============================================================= + +.. automodule:: paddlespeech.server.engine.tts.online.onnx.tts_engine + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.server.engine.tts.online.python.rst b/docs/source/api/paddlespeech.server.engine.tts.online.python.rst new file mode 100644 index 000000000..c4ac5a6da --- /dev/null +++ b/docs/source/api/paddlespeech.server.engine.tts.online.python.rst @@ -0,0 +1,15 @@ +paddlespeech.server.engine.tts.online.python package +==================================================== + +.. automodule:: paddlespeech.server.engine.tts.online.python + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.server.engine.tts.online.python.tts_engine diff --git a/docs/source/api/paddlespeech.server.engine.tts.online.python.tts_engine.rst b/docs/source/api/paddlespeech.server.engine.tts.online.python.tts_engine.rst new file mode 100644 index 000000000..22c69e0fc --- /dev/null +++ b/docs/source/api/paddlespeech.server.engine.tts.online.python.tts_engine.rst @@ -0,0 +1,7 @@ +paddlespeech.server.engine.tts.online.python.tts\_engine module +=============================================================== + +.. automodule:: paddlespeech.server.engine.tts.online.python.tts_engine + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.server.engine.tts.online.rst b/docs/source/api/paddlespeech.server.engine.tts.online.rst new file mode 100644 index 000000000..f8b573e2f --- /dev/null +++ b/docs/source/api/paddlespeech.server.engine.tts.online.rst @@ -0,0 +1,16 @@ +paddlespeech.server.engine.tts.online package +============================================= + +.. automodule:: paddlespeech.server.engine.tts.online + :members: + :undoc-members: + :show-inheritance: + +Subpackages +----------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.server.engine.tts.online.onnx + paddlespeech.server.engine.tts.online.python diff --git a/docs/source/api/paddlespeech.server.engine.tts.paddleinference.rst b/docs/source/api/paddlespeech.server.engine.tts.paddleinference.rst new file mode 100644 index 000000000..42ab7912b --- /dev/null +++ b/docs/source/api/paddlespeech.server.engine.tts.paddleinference.rst @@ -0,0 +1,15 @@ +paddlespeech.server.engine.tts.paddleinference package +====================================================== + +.. automodule:: paddlespeech.server.engine.tts.paddleinference + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.server.engine.tts.paddleinference.tts_engine diff --git a/docs/source/api/paddlespeech.server.engine.tts.paddleinference.tts_engine.rst b/docs/source/api/paddlespeech.server.engine.tts.paddleinference.tts_engine.rst new file mode 100644 index 000000000..eb49ff34c --- /dev/null +++ b/docs/source/api/paddlespeech.server.engine.tts.paddleinference.tts_engine.rst @@ -0,0 +1,7 @@ +paddlespeech.server.engine.tts.paddleinference.tts\_engine module +================================================================= + +.. automodule:: paddlespeech.server.engine.tts.paddleinference.tts_engine + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.server.engine.tts.python.rst b/docs/source/api/paddlespeech.server.engine.tts.python.rst new file mode 100644 index 000000000..bd011c504 --- /dev/null +++ b/docs/source/api/paddlespeech.server.engine.tts.python.rst @@ -0,0 +1,15 @@ +paddlespeech.server.engine.tts.python package +============================================= + +.. automodule:: paddlespeech.server.engine.tts.python + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.server.engine.tts.python.tts_engine diff --git a/docs/source/api/paddlespeech.server.engine.tts.python.tts_engine.rst b/docs/source/api/paddlespeech.server.engine.tts.python.tts_engine.rst new file mode 100644 index 000000000..1f1b401c1 --- /dev/null +++ b/docs/source/api/paddlespeech.server.engine.tts.python.tts_engine.rst @@ -0,0 +1,7 @@ +paddlespeech.server.engine.tts.python.tts\_engine module +======================================================== + +.. automodule:: paddlespeech.server.engine.tts.python.tts_engine + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.server.engine.tts.rst b/docs/source/api/paddlespeech.server.engine.tts.rst new file mode 100644 index 000000000..71ce74ea6 --- /dev/null +++ b/docs/source/api/paddlespeech.server.engine.tts.rst @@ -0,0 +1,17 @@ +paddlespeech.server.engine.tts package +====================================== + +.. automodule:: paddlespeech.server.engine.tts + :members: + :undoc-members: + :show-inheritance: + +Subpackages +----------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.server.engine.tts.online + paddlespeech.server.engine.tts.paddleinference + paddlespeech.server.engine.tts.python diff --git a/docs/source/api/paddlespeech.server.engine.vector.python.rst b/docs/source/api/paddlespeech.server.engine.vector.python.rst new file mode 100644 index 000000000..8edd65b15 --- /dev/null +++ b/docs/source/api/paddlespeech.server.engine.vector.python.rst @@ -0,0 +1,15 @@ +paddlespeech.server.engine.vector.python package +================================================ + +.. automodule:: paddlespeech.server.engine.vector.python + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.server.engine.vector.python.vector_engine diff --git a/docs/source/api/paddlespeech.server.engine.vector.python.vector_engine.rst b/docs/source/api/paddlespeech.server.engine.vector.python.vector_engine.rst new file mode 100644 index 000000000..637ff3815 --- /dev/null +++ b/docs/source/api/paddlespeech.server.engine.vector.python.vector_engine.rst @@ -0,0 +1,7 @@ +paddlespeech.server.engine.vector.python.vector\_engine module +============================================================== + +.. automodule:: paddlespeech.server.engine.vector.python.vector_engine + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.server.engine.vector.rst b/docs/source/api/paddlespeech.server.engine.vector.rst new file mode 100644 index 000000000..a641dc619 --- /dev/null +++ b/docs/source/api/paddlespeech.server.engine.vector.rst @@ -0,0 +1,15 @@ +paddlespeech.server.engine.vector package +========================================= + +.. automodule:: paddlespeech.server.engine.vector + :members: + :undoc-members: + :show-inheritance: + +Subpackages +----------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.server.engine.vector.python diff --git a/docs/source/api/paddlespeech.server.entry.rst b/docs/source/api/paddlespeech.server.entry.rst new file mode 100644 index 000000000..e27b7a242 --- /dev/null +++ b/docs/source/api/paddlespeech.server.entry.rst @@ -0,0 +1,7 @@ +paddlespeech.server.entry module +================================ + +.. automodule:: paddlespeech.server.entry + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.server.executor.rst b/docs/source/api/paddlespeech.server.executor.rst new file mode 100644 index 000000000..7557fa9b9 --- /dev/null +++ b/docs/source/api/paddlespeech.server.executor.rst @@ -0,0 +1,7 @@ +paddlespeech.server.executor module +=================================== + +.. automodule:: paddlespeech.server.executor + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.server.restful.acs_api.rst b/docs/source/api/paddlespeech.server.restful.acs_api.rst new file mode 100644 index 000000000..7dd9c48ac --- /dev/null +++ b/docs/source/api/paddlespeech.server.restful.acs_api.rst @@ -0,0 +1,7 @@ +paddlespeech.server.restful.acs\_api module +=========================================== + +.. automodule:: paddlespeech.server.restful.acs_api + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.server.restful.api.rst b/docs/source/api/paddlespeech.server.restful.api.rst new file mode 100644 index 000000000..57347755f --- /dev/null +++ b/docs/source/api/paddlespeech.server.restful.api.rst @@ -0,0 +1,7 @@ +paddlespeech.server.restful.api module +====================================== + +.. automodule:: paddlespeech.server.restful.api + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.server.restful.asr_api.rst b/docs/source/api/paddlespeech.server.restful.asr_api.rst new file mode 100644 index 000000000..bc7291abf --- /dev/null +++ b/docs/source/api/paddlespeech.server.restful.asr_api.rst @@ -0,0 +1,7 @@ +paddlespeech.server.restful.asr\_api module +=========================================== + +.. automodule:: paddlespeech.server.restful.asr_api + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.server.restful.cls_api.rst b/docs/source/api/paddlespeech.server.restful.cls_api.rst new file mode 100644 index 000000000..40abe5ca3 --- /dev/null +++ b/docs/source/api/paddlespeech.server.restful.cls_api.rst @@ -0,0 +1,7 @@ +paddlespeech.server.restful.cls\_api module +=========================================== + +.. automodule:: paddlespeech.server.restful.cls_api + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.server.restful.request.rst b/docs/source/api/paddlespeech.server.restful.request.rst new file mode 100644 index 000000000..a39877e64 --- /dev/null +++ b/docs/source/api/paddlespeech.server.restful.request.rst @@ -0,0 +1,7 @@ +paddlespeech.server.restful.request module +========================================== + +.. automodule:: paddlespeech.server.restful.request + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.server.restful.response.rst b/docs/source/api/paddlespeech.server.restful.response.rst new file mode 100644 index 000000000..89c0cf05f --- /dev/null +++ b/docs/source/api/paddlespeech.server.restful.response.rst @@ -0,0 +1,7 @@ +paddlespeech.server.restful.response module +=========================================== + +.. automodule:: paddlespeech.server.restful.response + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.server.restful.rst b/docs/source/api/paddlespeech.server.restful.rst new file mode 100644 index 000000000..340e6090a --- /dev/null +++ b/docs/source/api/paddlespeech.server.restful.rst @@ -0,0 +1,23 @@ +paddlespeech.server.restful package +=================================== + +.. automodule:: paddlespeech.server.restful + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.server.restful.acs_api + paddlespeech.server.restful.api + paddlespeech.server.restful.asr_api + paddlespeech.server.restful.cls_api + paddlespeech.server.restful.request + paddlespeech.server.restful.response + paddlespeech.server.restful.text_api + paddlespeech.server.restful.tts_api + paddlespeech.server.restful.vector_api diff --git a/docs/source/api/paddlespeech.server.restful.text_api.rst b/docs/source/api/paddlespeech.server.restful.text_api.rst new file mode 100644 index 000000000..6894fdbb3 --- /dev/null +++ b/docs/source/api/paddlespeech.server.restful.text_api.rst @@ -0,0 +1,7 @@ +paddlespeech.server.restful.text\_api module +============================================ + +.. automodule:: paddlespeech.server.restful.text_api + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.server.restful.tts_api.rst b/docs/source/api/paddlespeech.server.restful.tts_api.rst new file mode 100644 index 000000000..53259b2c8 --- /dev/null +++ b/docs/source/api/paddlespeech.server.restful.tts_api.rst @@ -0,0 +1,7 @@ +paddlespeech.server.restful.tts\_api module +=========================================== + +.. automodule:: paddlespeech.server.restful.tts_api + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.server.restful.vector_api.rst b/docs/source/api/paddlespeech.server.restful.vector_api.rst new file mode 100644 index 000000000..0cf7a8ba4 --- /dev/null +++ b/docs/source/api/paddlespeech.server.restful.vector_api.rst @@ -0,0 +1,7 @@ +paddlespeech.server.restful.vector\_api module +============================================== + +.. automodule:: paddlespeech.server.restful.vector_api + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.server.rst b/docs/source/api/paddlespeech.server.rst new file mode 100644 index 000000000..169dd62e1 --- /dev/null +++ b/docs/source/api/paddlespeech.server.rst @@ -0,0 +1,31 @@ +paddlespeech.server package +=========================== + +.. automodule:: paddlespeech.server + :members: + :undoc-members: + :show-inheritance: + +Subpackages +----------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.server.bin + paddlespeech.server.engine + paddlespeech.server.restful + paddlespeech.server.tests + paddlespeech.server.utils + paddlespeech.server.ws + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.server.base_commands + paddlespeech.server.entry + paddlespeech.server.executor + paddlespeech.server.util diff --git a/docs/source/api/paddlespeech.server.tests.asr.offline.http_client.rst b/docs/source/api/paddlespeech.server.tests.asr.offline.http_client.rst new file mode 100644 index 000000000..1acc51538 --- /dev/null +++ b/docs/source/api/paddlespeech.server.tests.asr.offline.http_client.rst @@ -0,0 +1,7 @@ +paddlespeech.server.tests.asr.offline.http\_client module +========================================================= + +.. automodule:: paddlespeech.server.tests.asr.offline.http_client + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.server.tests.asr.offline.rst b/docs/source/api/paddlespeech.server.tests.asr.offline.rst new file mode 100644 index 000000000..f9178e0a6 --- /dev/null +++ b/docs/source/api/paddlespeech.server.tests.asr.offline.rst @@ -0,0 +1,15 @@ +paddlespeech.server.tests.asr.offline package +============================================= + +.. automodule:: paddlespeech.server.tests.asr.offline + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.server.tests.asr.offline.http_client diff --git a/docs/source/api/paddlespeech.server.tests.asr.rst b/docs/source/api/paddlespeech.server.tests.asr.rst new file mode 100644 index 000000000..a3ee49b77 --- /dev/null +++ b/docs/source/api/paddlespeech.server.tests.asr.rst @@ -0,0 +1,15 @@ +paddlespeech.server.tests.asr package +===================================== + +.. automodule:: paddlespeech.server.tests.asr + :members: + :undoc-members: + :show-inheritance: + +Subpackages +----------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.server.tests.asr.offline diff --git a/docs/source/api/paddlespeech.server.tests.rst b/docs/source/api/paddlespeech.server.tests.rst new file mode 100644 index 000000000..275f0f4ea --- /dev/null +++ b/docs/source/api/paddlespeech.server.tests.rst @@ -0,0 +1,15 @@ +paddlespeech.server.tests package +================================= + +.. automodule:: paddlespeech.server.tests + :members: + :undoc-members: + :show-inheritance: + +Subpackages +----------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.server.tests.asr diff --git a/docs/source/api/paddlespeech.server.util.rst b/docs/source/api/paddlespeech.server.util.rst new file mode 100644 index 000000000..6d38c1bc2 --- /dev/null +++ b/docs/source/api/paddlespeech.server.util.rst @@ -0,0 +1,7 @@ +paddlespeech.server.util module +=============================== + +.. automodule:: paddlespeech.server.util + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.server.utils.audio_handler.rst b/docs/source/api/paddlespeech.server.utils.audio_handler.rst new file mode 100644 index 000000000..334245d1c --- /dev/null +++ b/docs/source/api/paddlespeech.server.utils.audio_handler.rst @@ -0,0 +1,7 @@ +paddlespeech.server.utils.audio\_handler module +=============================================== + +.. automodule:: paddlespeech.server.utils.audio_handler + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.server.utils.audio_process.rst b/docs/source/api/paddlespeech.server.utils.audio_process.rst new file mode 100644 index 000000000..f96304ac0 --- /dev/null +++ b/docs/source/api/paddlespeech.server.utils.audio_process.rst @@ -0,0 +1,7 @@ +paddlespeech.server.utils.audio\_process module +=============================================== + +.. automodule:: paddlespeech.server.utils.audio_process + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.server.utils.buffer.rst b/docs/source/api/paddlespeech.server.utils.buffer.rst new file mode 100644 index 000000000..e45d62b80 --- /dev/null +++ b/docs/source/api/paddlespeech.server.utils.buffer.rst @@ -0,0 +1,7 @@ +paddlespeech.server.utils.buffer module +======================================= + +.. automodule:: paddlespeech.server.utils.buffer + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.server.utils.config.rst b/docs/source/api/paddlespeech.server.utils.config.rst new file mode 100644 index 000000000..db1a26930 --- /dev/null +++ b/docs/source/api/paddlespeech.server.utils.config.rst @@ -0,0 +1,7 @@ +paddlespeech.server.utils.config module +======================================= + +.. automodule:: paddlespeech.server.utils.config + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.server.utils.errors.rst b/docs/source/api/paddlespeech.server.utils.errors.rst new file mode 100644 index 000000000..38c2ce3eb --- /dev/null +++ b/docs/source/api/paddlespeech.server.utils.errors.rst @@ -0,0 +1,7 @@ +paddlespeech.server.utils.errors module +======================================= + +.. automodule:: paddlespeech.server.utils.errors + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.server.utils.exception.rst b/docs/source/api/paddlespeech.server.utils.exception.rst new file mode 100644 index 000000000..440665eaa --- /dev/null +++ b/docs/source/api/paddlespeech.server.utils.exception.rst @@ -0,0 +1,7 @@ +paddlespeech.server.utils.exception module +========================================== + +.. automodule:: paddlespeech.server.utils.exception + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.server.utils.log.rst b/docs/source/api/paddlespeech.server.utils.log.rst new file mode 100644 index 000000000..453b4a61f --- /dev/null +++ b/docs/source/api/paddlespeech.server.utils.log.rst @@ -0,0 +1,7 @@ +paddlespeech.server.utils.log module +==================================== + +.. automodule:: paddlespeech.server.utils.log + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.server.utils.onnx_infer.rst b/docs/source/api/paddlespeech.server.utils.onnx_infer.rst new file mode 100644 index 000000000..8bea8bbd3 --- /dev/null +++ b/docs/source/api/paddlespeech.server.utils.onnx_infer.rst @@ -0,0 +1,7 @@ +paddlespeech.server.utils.onnx\_infer module +============================================ + +.. automodule:: paddlespeech.server.utils.onnx_infer + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.server.utils.paddle_predictor.rst b/docs/source/api/paddlespeech.server.utils.paddle_predictor.rst new file mode 100644 index 000000000..c48147a4f --- /dev/null +++ b/docs/source/api/paddlespeech.server.utils.paddle_predictor.rst @@ -0,0 +1,7 @@ +paddlespeech.server.utils.paddle\_predictor module +================================================== + +.. automodule:: paddlespeech.server.utils.paddle_predictor + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.server.utils.rst b/docs/source/api/paddlespeech.server.utils.rst new file mode 100644 index 000000000..9d1166392 --- /dev/null +++ b/docs/source/api/paddlespeech.server.utils.rst @@ -0,0 +1,25 @@ +paddlespeech.server.utils package +================================= + +.. automodule:: paddlespeech.server.utils + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.server.utils.audio_handler + paddlespeech.server.utils.audio_process + paddlespeech.server.utils.buffer + paddlespeech.server.utils.config + paddlespeech.server.utils.errors + paddlespeech.server.utils.exception + paddlespeech.server.utils.log + paddlespeech.server.utils.onnx_infer + paddlespeech.server.utils.paddle_predictor + paddlespeech.server.utils.util + paddlespeech.server.utils.vad diff --git a/docs/source/api/paddlespeech.server.utils.util.rst b/docs/source/api/paddlespeech.server.utils.util.rst new file mode 100644 index 000000000..ca060ab6d --- /dev/null +++ b/docs/source/api/paddlespeech.server.utils.util.rst @@ -0,0 +1,7 @@ +paddlespeech.server.utils.util module +===================================== + +.. automodule:: paddlespeech.server.utils.util + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.server.utils.vad.rst b/docs/source/api/paddlespeech.server.utils.vad.rst new file mode 100644 index 000000000..8caa82b7a --- /dev/null +++ b/docs/source/api/paddlespeech.server.utils.vad.rst @@ -0,0 +1,7 @@ +paddlespeech.server.utils.vad module +==================================== + +.. automodule:: paddlespeech.server.utils.vad + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.server.ws.api.rst b/docs/source/api/paddlespeech.server.ws.api.rst new file mode 100644 index 000000000..0504cb7ab --- /dev/null +++ b/docs/source/api/paddlespeech.server.ws.api.rst @@ -0,0 +1,7 @@ +paddlespeech.server.ws.api module +================================= + +.. automodule:: paddlespeech.server.ws.api + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.server.ws.asr_api.rst b/docs/source/api/paddlespeech.server.ws.asr_api.rst new file mode 100644 index 000000000..16806adb2 --- /dev/null +++ b/docs/source/api/paddlespeech.server.ws.asr_api.rst @@ -0,0 +1,7 @@ +paddlespeech.server.ws.asr\_api module +====================================== + +.. automodule:: paddlespeech.server.ws.asr_api + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.server.ws.rst b/docs/source/api/paddlespeech.server.ws.rst new file mode 100644 index 000000000..b3914cc05 --- /dev/null +++ b/docs/source/api/paddlespeech.server.ws.rst @@ -0,0 +1,17 @@ +paddlespeech.server.ws package +============================== + +.. automodule:: paddlespeech.server.ws + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.server.ws.api + paddlespeech.server.ws.asr_api + paddlespeech.server.ws.tts_api diff --git a/docs/source/api/paddlespeech.server.ws.tts_api.rst b/docs/source/api/paddlespeech.server.ws.tts_api.rst new file mode 100644 index 000000000..788ce6aa3 --- /dev/null +++ b/docs/source/api/paddlespeech.server.ws.tts_api.rst @@ -0,0 +1,7 @@ +paddlespeech.server.ws.tts\_api module +====================================== + +.. automodule:: paddlespeech.server.ws.tts_api + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.audio.audio.rst b/docs/source/api/paddlespeech.t2s.audio.audio.rst new file mode 100644 index 000000000..b9b216493 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.audio.audio.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.audio.audio module +=================================== + +.. automodule:: paddlespeech.t2s.audio.audio + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.audio.codec.rst b/docs/source/api/paddlespeech.t2s.audio.codec.rst new file mode 100644 index 000000000..8d4652b76 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.audio.codec.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.audio.codec module +=================================== + +.. automodule:: paddlespeech.t2s.audio.codec + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.audio.rst b/docs/source/api/paddlespeech.t2s.audio.rst new file mode 100644 index 000000000..383b470c2 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.audio.rst @@ -0,0 +1,17 @@ +paddlespeech.t2s.audio package +============================== + +.. automodule:: paddlespeech.t2s.audio + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.t2s.audio.audio + paddlespeech.t2s.audio.codec + paddlespeech.t2s.audio.spec_normalizer diff --git a/docs/source/api/paddlespeech.t2s.audio.spec_normalizer.rst b/docs/source/api/paddlespeech.t2s.audio.spec_normalizer.rst new file mode 100644 index 000000000..eab1c81c2 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.audio.spec_normalizer.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.audio.spec\_normalizer module +============================================== + +.. automodule:: paddlespeech.t2s.audio.spec_normalizer + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.datasets.am_batch_fn.rst b/docs/source/api/paddlespeech.t2s.datasets.am_batch_fn.rst new file mode 100644 index 000000000..8484cf8fc --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.datasets.am_batch_fn.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.datasets.am\_batch\_fn module +============================================== + +.. automodule:: paddlespeech.t2s.datasets.am_batch_fn + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.datasets.batch.rst b/docs/source/api/paddlespeech.t2s.datasets.batch.rst new file mode 100644 index 000000000..e759ae118 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.datasets.batch.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.datasets.batch module +====================================== + +.. automodule:: paddlespeech.t2s.datasets.batch + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.datasets.data_table.rst b/docs/source/api/paddlespeech.t2s.datasets.data_table.rst new file mode 100644 index 000000000..8db602dc6 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.datasets.data_table.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.datasets.data\_table module +============================================ + +.. automodule:: paddlespeech.t2s.datasets.data_table + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.datasets.dataset.rst b/docs/source/api/paddlespeech.t2s.datasets.dataset.rst new file mode 100644 index 000000000..109d40a5d --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.datasets.dataset.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.datasets.dataset module +======================================== + +.. automodule:: paddlespeech.t2s.datasets.dataset + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.datasets.get_feats.rst b/docs/source/api/paddlespeech.t2s.datasets.get_feats.rst new file mode 100644 index 000000000..4a6676a03 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.datasets.get_feats.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.datasets.get\_feats module +=========================================== + +.. automodule:: paddlespeech.t2s.datasets.get_feats + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.datasets.ljspeech.rst b/docs/source/api/paddlespeech.t2s.datasets.ljspeech.rst new file mode 100644 index 000000000..61da37821 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.datasets.ljspeech.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.datasets.ljspeech module +========================================= + +.. automodule:: paddlespeech.t2s.datasets.ljspeech + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.datasets.preprocess_utils.rst b/docs/source/api/paddlespeech.t2s.datasets.preprocess_utils.rst new file mode 100644 index 000000000..9bed6cf2d --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.datasets.preprocess_utils.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.datasets.preprocess\_utils module +================================================== + +.. automodule:: paddlespeech.t2s.datasets.preprocess_utils + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.datasets.rst b/docs/source/api/paddlespeech.t2s.datasets.rst new file mode 100644 index 000000000..b40eb2bf1 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.datasets.rst @@ -0,0 +1,22 @@ +paddlespeech.t2s.datasets package +================================= + +.. automodule:: paddlespeech.t2s.datasets + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.t2s.datasets.am_batch_fn + paddlespeech.t2s.datasets.batch + paddlespeech.t2s.datasets.data_table + paddlespeech.t2s.datasets.dataset + paddlespeech.t2s.datasets.get_feats + paddlespeech.t2s.datasets.ljspeech + paddlespeech.t2s.datasets.preprocess_utils + paddlespeech.t2s.datasets.vocoder_batch_fn diff --git a/docs/source/api/paddlespeech.t2s.datasets.vocoder_batch_fn.rst b/docs/source/api/paddlespeech.t2s.datasets.vocoder_batch_fn.rst new file mode 100644 index 000000000..5f89725d7 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.datasets.vocoder_batch_fn.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.datasets.vocoder\_batch\_fn module +=================================================== + +.. automodule:: paddlespeech.t2s.datasets.vocoder_batch_fn + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.exps.fastspeech2.gen_gta_mel.rst b/docs/source/api/paddlespeech.t2s.exps.fastspeech2.gen_gta_mel.rst new file mode 100644 index 000000000..353320a90 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.exps.fastspeech2.gen_gta_mel.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.exps.fastspeech2.gen\_gta\_mel module +====================================================== + +.. automodule:: paddlespeech.t2s.exps.fastspeech2.gen_gta_mel + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.exps.fastspeech2.normalize.rst b/docs/source/api/paddlespeech.t2s.exps.fastspeech2.normalize.rst new file mode 100644 index 000000000..1bea8e521 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.exps.fastspeech2.normalize.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.exps.fastspeech2.normalize module +================================================== + +.. automodule:: paddlespeech.t2s.exps.fastspeech2.normalize + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.exps.fastspeech2.preprocess.rst b/docs/source/api/paddlespeech.t2s.exps.fastspeech2.preprocess.rst new file mode 100644 index 000000000..a25b16ca3 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.exps.fastspeech2.preprocess.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.exps.fastspeech2.preprocess module +=================================================== + +.. automodule:: paddlespeech.t2s.exps.fastspeech2.preprocess + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.exps.fastspeech2.rst b/docs/source/api/paddlespeech.t2s.exps.fastspeech2.rst new file mode 100644 index 000000000..3c98aa882 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.exps.fastspeech2.rst @@ -0,0 +1,18 @@ +paddlespeech.t2s.exps.fastspeech2 package +========================================= + +.. automodule:: paddlespeech.t2s.exps.fastspeech2 + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.t2s.exps.fastspeech2.gen_gta_mel + paddlespeech.t2s.exps.fastspeech2.normalize + paddlespeech.t2s.exps.fastspeech2.preprocess + paddlespeech.t2s.exps.fastspeech2.train diff --git a/docs/source/api/paddlespeech.t2s.exps.fastspeech2.train.rst b/docs/source/api/paddlespeech.t2s.exps.fastspeech2.train.rst new file mode 100644 index 000000000..66c055252 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.exps.fastspeech2.train.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.exps.fastspeech2.train module +============================================== + +.. automodule:: paddlespeech.t2s.exps.fastspeech2.train + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.exps.gan_vocoder.hifigan.rst b/docs/source/api/paddlespeech.t2s.exps.gan_vocoder.hifigan.rst new file mode 100644 index 000000000..bfac832ad --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.exps.gan_vocoder.hifigan.rst @@ -0,0 +1,15 @@ +paddlespeech.t2s.exps.gan\_vocoder.hifigan package +================================================== + +.. automodule:: paddlespeech.t2s.exps.gan_vocoder.hifigan + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.t2s.exps.gan_vocoder.hifigan.train diff --git a/docs/source/api/paddlespeech.t2s.exps.gan_vocoder.hifigan.train.rst b/docs/source/api/paddlespeech.t2s.exps.gan_vocoder.hifigan.train.rst new file mode 100644 index 000000000..237685a6b --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.exps.gan_vocoder.hifigan.train.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.exps.gan\_vocoder.hifigan.train module +======================================================= + +.. automodule:: paddlespeech.t2s.exps.gan_vocoder.hifigan.train + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.exps.gan_vocoder.multi_band_melgan.rst b/docs/source/api/paddlespeech.t2s.exps.gan_vocoder.multi_band_melgan.rst new file mode 100644 index 000000000..00a965cd0 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.exps.gan_vocoder.multi_band_melgan.rst @@ -0,0 +1,15 @@ +paddlespeech.t2s.exps.gan\_vocoder.multi\_band\_melgan package +============================================================== + +.. automodule:: paddlespeech.t2s.exps.gan_vocoder.multi_band_melgan + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.t2s.exps.gan_vocoder.multi_band_melgan.train diff --git a/docs/source/api/paddlespeech.t2s.exps.gan_vocoder.multi_band_melgan.train.rst b/docs/source/api/paddlespeech.t2s.exps.gan_vocoder.multi_band_melgan.train.rst new file mode 100644 index 000000000..c59490b00 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.exps.gan_vocoder.multi_band_melgan.train.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.exps.gan\_vocoder.multi\_band\_melgan.train module +=================================================================== + +.. automodule:: paddlespeech.t2s.exps.gan_vocoder.multi_band_melgan.train + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.exps.gan_vocoder.normalize.rst b/docs/source/api/paddlespeech.t2s.exps.gan_vocoder.normalize.rst new file mode 100644 index 000000000..64a50c2ee --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.exps.gan_vocoder.normalize.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.exps.gan\_vocoder.normalize module +=================================================== + +.. automodule:: paddlespeech.t2s.exps.gan_vocoder.normalize + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.exps.gan_vocoder.parallelwave_gan.rst b/docs/source/api/paddlespeech.t2s.exps.gan_vocoder.parallelwave_gan.rst new file mode 100644 index 000000000..21b6225dc --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.exps.gan_vocoder.parallelwave_gan.rst @@ -0,0 +1,16 @@ +paddlespeech.t2s.exps.gan\_vocoder.parallelwave\_gan package +============================================================ + +.. automodule:: paddlespeech.t2s.exps.gan_vocoder.parallelwave_gan + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.t2s.exps.gan_vocoder.parallelwave_gan.synthesize_from_wav + paddlespeech.t2s.exps.gan_vocoder.parallelwave_gan.train diff --git a/docs/source/api/paddlespeech.t2s.exps.gan_vocoder.parallelwave_gan.synthesize_from_wav.rst b/docs/source/api/paddlespeech.t2s.exps.gan_vocoder.parallelwave_gan.synthesize_from_wav.rst new file mode 100644 index 000000000..6057fb62f --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.exps.gan_vocoder.parallelwave_gan.synthesize_from_wav.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.exps.gan\_vocoder.parallelwave\_gan.synthesize\_from\_wav module +================================================================================= + +.. automodule:: paddlespeech.t2s.exps.gan_vocoder.parallelwave_gan.synthesize_from_wav + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.exps.gan_vocoder.parallelwave_gan.train.rst b/docs/source/api/paddlespeech.t2s.exps.gan_vocoder.parallelwave_gan.train.rst new file mode 100644 index 000000000..b1731d2ab --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.exps.gan_vocoder.parallelwave_gan.train.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.exps.gan\_vocoder.parallelwave\_gan.train module +================================================================= + +.. automodule:: paddlespeech.t2s.exps.gan_vocoder.parallelwave_gan.train + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.exps.gan_vocoder.preprocess.rst b/docs/source/api/paddlespeech.t2s.exps.gan_vocoder.preprocess.rst new file mode 100644 index 000000000..b6b8f1664 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.exps.gan_vocoder.preprocess.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.exps.gan\_vocoder.preprocess module +==================================================== + +.. automodule:: paddlespeech.t2s.exps.gan_vocoder.preprocess + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.exps.gan_vocoder.rst b/docs/source/api/paddlespeech.t2s.exps.gan_vocoder.rst new file mode 100644 index 000000000..1b48b4b50 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.exps.gan_vocoder.rst @@ -0,0 +1,28 @@ +paddlespeech.t2s.exps.gan\_vocoder package +========================================== + +.. automodule:: paddlespeech.t2s.exps.gan_vocoder + :members: + :undoc-members: + :show-inheritance: + +Subpackages +----------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.t2s.exps.gan_vocoder.hifigan + paddlespeech.t2s.exps.gan_vocoder.multi_band_melgan + paddlespeech.t2s.exps.gan_vocoder.parallelwave_gan + paddlespeech.t2s.exps.gan_vocoder.style_melgan + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.t2s.exps.gan_vocoder.normalize + paddlespeech.t2s.exps.gan_vocoder.preprocess + paddlespeech.t2s.exps.gan_vocoder.synthesize diff --git a/docs/source/api/paddlespeech.t2s.exps.gan_vocoder.style_melgan.rst b/docs/source/api/paddlespeech.t2s.exps.gan_vocoder.style_melgan.rst new file mode 100644 index 000000000..ef96d6ee9 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.exps.gan_vocoder.style_melgan.rst @@ -0,0 +1,15 @@ +paddlespeech.t2s.exps.gan\_vocoder.style\_melgan package +======================================================== + +.. automodule:: paddlespeech.t2s.exps.gan_vocoder.style_melgan + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.t2s.exps.gan_vocoder.style_melgan.train diff --git a/docs/source/api/paddlespeech.t2s.exps.gan_vocoder.style_melgan.train.rst b/docs/source/api/paddlespeech.t2s.exps.gan_vocoder.style_melgan.train.rst new file mode 100644 index 000000000..e1fdd14ba --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.exps.gan_vocoder.style_melgan.train.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.exps.gan\_vocoder.style\_melgan.train module +============================================================= + +.. automodule:: paddlespeech.t2s.exps.gan_vocoder.style_melgan.train + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.exps.gan_vocoder.synthesize.rst b/docs/source/api/paddlespeech.t2s.exps.gan_vocoder.synthesize.rst new file mode 100644 index 000000000..720e71f04 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.exps.gan_vocoder.synthesize.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.exps.gan\_vocoder.synthesize module +==================================================== + +.. automodule:: paddlespeech.t2s.exps.gan_vocoder.synthesize + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.exps.inference.rst b/docs/source/api/paddlespeech.t2s.exps.inference.rst new file mode 100644 index 000000000..7e404402c --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.exps.inference.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.exps.inference module +====================================== + +.. automodule:: paddlespeech.t2s.exps.inference + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.exps.inference_streaming.rst b/docs/source/api/paddlespeech.t2s.exps.inference_streaming.rst new file mode 100644 index 000000000..d531fa7c0 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.exps.inference_streaming.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.exps.inference\_streaming module +================================================= + +.. automodule:: paddlespeech.t2s.exps.inference_streaming + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.exps.ort_predict.rst b/docs/source/api/paddlespeech.t2s.exps.ort_predict.rst new file mode 100644 index 000000000..451aab1d4 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.exps.ort_predict.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.exps.ort\_predict module +========================================= + +.. automodule:: paddlespeech.t2s.exps.ort_predict + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.exps.ort_predict_e2e.rst b/docs/source/api/paddlespeech.t2s.exps.ort_predict_e2e.rst new file mode 100644 index 000000000..81a591929 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.exps.ort_predict_e2e.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.exps.ort\_predict\_e2e module +============================================== + +.. automodule:: paddlespeech.t2s.exps.ort_predict_e2e + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.exps.ort_predict_streaming.rst b/docs/source/api/paddlespeech.t2s.exps.ort_predict_streaming.rst new file mode 100644 index 000000000..b72cecfd8 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.exps.ort_predict_streaming.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.exps.ort\_predict\_streaming module +==================================================== + +.. automodule:: paddlespeech.t2s.exps.ort_predict_streaming + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.exps.rst b/docs/source/api/paddlespeech.t2s.exps.rst new file mode 100644 index 000000000..a688435eb --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.exps.rst @@ -0,0 +1,38 @@ +paddlespeech.t2s.exps package +============================= + +.. automodule:: paddlespeech.t2s.exps + :members: + :undoc-members: + :show-inheritance: + +Subpackages +----------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.t2s.exps.fastspeech2 + paddlespeech.t2s.exps.gan_vocoder + paddlespeech.t2s.exps.speedyspeech + paddlespeech.t2s.exps.tacotron2 + paddlespeech.t2s.exps.transformer_tts + paddlespeech.t2s.exps.waveflow + paddlespeech.t2s.exps.wavernn + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.t2s.exps.inference + paddlespeech.t2s.exps.inference_streaming + paddlespeech.t2s.exps.ort_predict + paddlespeech.t2s.exps.ort_predict_e2e + paddlespeech.t2s.exps.ort_predict_streaming + paddlespeech.t2s.exps.syn_utils + paddlespeech.t2s.exps.synthesize + paddlespeech.t2s.exps.synthesize_e2e + paddlespeech.t2s.exps.synthesize_streaming + paddlespeech.t2s.exps.voice_cloning diff --git a/docs/source/api/paddlespeech.t2s.exps.speedyspeech.gen_gta_mel.rst b/docs/source/api/paddlespeech.t2s.exps.speedyspeech.gen_gta_mel.rst new file mode 100644 index 000000000..83f92d23f --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.exps.speedyspeech.gen_gta_mel.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.exps.speedyspeech.gen\_gta\_mel module +======================================================= + +.. automodule:: paddlespeech.t2s.exps.speedyspeech.gen_gta_mel + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.exps.speedyspeech.inference.rst b/docs/source/api/paddlespeech.t2s.exps.speedyspeech.inference.rst new file mode 100644 index 000000000..92a1557e2 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.exps.speedyspeech.inference.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.exps.speedyspeech.inference module +=================================================== + +.. automodule:: paddlespeech.t2s.exps.speedyspeech.inference + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.exps.speedyspeech.normalize.rst b/docs/source/api/paddlespeech.t2s.exps.speedyspeech.normalize.rst new file mode 100644 index 000000000..e48d8a07b --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.exps.speedyspeech.normalize.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.exps.speedyspeech.normalize module +=================================================== + +.. automodule:: paddlespeech.t2s.exps.speedyspeech.normalize + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.exps.speedyspeech.preprocess.rst b/docs/source/api/paddlespeech.t2s.exps.speedyspeech.preprocess.rst new file mode 100644 index 000000000..edd0baca7 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.exps.speedyspeech.preprocess.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.exps.speedyspeech.preprocess module +==================================================== + +.. automodule:: paddlespeech.t2s.exps.speedyspeech.preprocess + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.exps.speedyspeech.rst b/docs/source/api/paddlespeech.t2s.exps.speedyspeech.rst new file mode 100644 index 000000000..d57300268 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.exps.speedyspeech.rst @@ -0,0 +1,20 @@ +paddlespeech.t2s.exps.speedyspeech package +========================================== + +.. automodule:: paddlespeech.t2s.exps.speedyspeech + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.t2s.exps.speedyspeech.gen_gta_mel + paddlespeech.t2s.exps.speedyspeech.inference + paddlespeech.t2s.exps.speedyspeech.normalize + paddlespeech.t2s.exps.speedyspeech.preprocess + paddlespeech.t2s.exps.speedyspeech.synthesize_e2e + paddlespeech.t2s.exps.speedyspeech.train diff --git a/docs/source/api/paddlespeech.t2s.exps.speedyspeech.synthesize_e2e.rst b/docs/source/api/paddlespeech.t2s.exps.speedyspeech.synthesize_e2e.rst new file mode 100644 index 000000000..c95f08587 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.exps.speedyspeech.synthesize_e2e.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.exps.speedyspeech.synthesize\_e2e module +========================================================= + +.. automodule:: paddlespeech.t2s.exps.speedyspeech.synthesize_e2e + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.exps.speedyspeech.train.rst b/docs/source/api/paddlespeech.t2s.exps.speedyspeech.train.rst new file mode 100644 index 000000000..7f393ed44 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.exps.speedyspeech.train.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.exps.speedyspeech.train module +=============================================== + +.. automodule:: paddlespeech.t2s.exps.speedyspeech.train + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.exps.syn_utils.rst b/docs/source/api/paddlespeech.t2s.exps.syn_utils.rst new file mode 100644 index 000000000..408ce3571 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.exps.syn_utils.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.exps.syn\_utils module +======================================= + +.. automodule:: paddlespeech.t2s.exps.syn_utils + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.exps.synthesize.rst b/docs/source/api/paddlespeech.t2s.exps.synthesize.rst new file mode 100644 index 000000000..6131301c1 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.exps.synthesize.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.exps.synthesize module +======================================= + +.. automodule:: paddlespeech.t2s.exps.synthesize + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.exps.synthesize_e2e.rst b/docs/source/api/paddlespeech.t2s.exps.synthesize_e2e.rst new file mode 100644 index 000000000..afecc1fb0 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.exps.synthesize_e2e.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.exps.synthesize\_e2e module +============================================ + +.. automodule:: paddlespeech.t2s.exps.synthesize_e2e + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.exps.synthesize_streaming.rst b/docs/source/api/paddlespeech.t2s.exps.synthesize_streaming.rst new file mode 100644 index 000000000..d2a65fc99 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.exps.synthesize_streaming.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.exps.synthesize\_streaming module +================================================== + +.. automodule:: paddlespeech.t2s.exps.synthesize_streaming + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.exps.tacotron2.normalize.rst b/docs/source/api/paddlespeech.t2s.exps.tacotron2.normalize.rst new file mode 100644 index 000000000..9d32fc724 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.exps.tacotron2.normalize.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.exps.tacotron2.normalize module +================================================ + +.. automodule:: paddlespeech.t2s.exps.tacotron2.normalize + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.exps.tacotron2.preprocess.rst b/docs/source/api/paddlespeech.t2s.exps.tacotron2.preprocess.rst new file mode 100644 index 000000000..1613b4179 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.exps.tacotron2.preprocess.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.exps.tacotron2.preprocess module +================================================= + +.. automodule:: paddlespeech.t2s.exps.tacotron2.preprocess + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.exps.tacotron2.rst b/docs/source/api/paddlespeech.t2s.exps.tacotron2.rst new file mode 100644 index 000000000..8e0007764 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.exps.tacotron2.rst @@ -0,0 +1,17 @@ +paddlespeech.t2s.exps.tacotron2 package +======================================= + +.. automodule:: paddlespeech.t2s.exps.tacotron2 + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.t2s.exps.tacotron2.normalize + paddlespeech.t2s.exps.tacotron2.preprocess + paddlespeech.t2s.exps.tacotron2.train diff --git a/docs/source/api/paddlespeech.t2s.exps.tacotron2.train.rst b/docs/source/api/paddlespeech.t2s.exps.tacotron2.train.rst new file mode 100644 index 000000000..fa637c583 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.exps.tacotron2.train.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.exps.tacotron2.train module +============================================ + +.. automodule:: paddlespeech.t2s.exps.tacotron2.train + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.exps.transformer_tts.normalize.rst b/docs/source/api/paddlespeech.t2s.exps.transformer_tts.normalize.rst new file mode 100644 index 000000000..077ee13f0 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.exps.transformer_tts.normalize.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.exps.transformer\_tts.normalize module +======================================================= + +.. automodule:: paddlespeech.t2s.exps.transformer_tts.normalize + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.exps.transformer_tts.preprocess.rst b/docs/source/api/paddlespeech.t2s.exps.transformer_tts.preprocess.rst new file mode 100644 index 000000000..73f53b763 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.exps.transformer_tts.preprocess.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.exps.transformer\_tts.preprocess module +======================================================== + +.. automodule:: paddlespeech.t2s.exps.transformer_tts.preprocess + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.exps.transformer_tts.rst b/docs/source/api/paddlespeech.t2s.exps.transformer_tts.rst new file mode 100644 index 000000000..3a951f3b4 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.exps.transformer_tts.rst @@ -0,0 +1,19 @@ +paddlespeech.t2s.exps.transformer\_tts package +============================================== + +.. automodule:: paddlespeech.t2s.exps.transformer_tts + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.t2s.exps.transformer_tts.normalize + paddlespeech.t2s.exps.transformer_tts.preprocess + paddlespeech.t2s.exps.transformer_tts.synthesize + paddlespeech.t2s.exps.transformer_tts.synthesize_e2e + paddlespeech.t2s.exps.transformer_tts.train diff --git a/docs/source/api/paddlespeech.t2s.exps.transformer_tts.synthesize.rst b/docs/source/api/paddlespeech.t2s.exps.transformer_tts.synthesize.rst new file mode 100644 index 000000000..20f4b99d0 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.exps.transformer_tts.synthesize.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.exps.transformer\_tts.synthesize module +======================================================== + +.. automodule:: paddlespeech.t2s.exps.transformer_tts.synthesize + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.exps.transformer_tts.synthesize_e2e.rst b/docs/source/api/paddlespeech.t2s.exps.transformer_tts.synthesize_e2e.rst new file mode 100644 index 000000000..1e554e22a --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.exps.transformer_tts.synthesize_e2e.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.exps.transformer\_tts.synthesize\_e2e module +============================================================= + +.. automodule:: paddlespeech.t2s.exps.transformer_tts.synthesize_e2e + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.exps.transformer_tts.train.rst b/docs/source/api/paddlespeech.t2s.exps.transformer_tts.train.rst new file mode 100644 index 000000000..d6d1d2d29 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.exps.transformer_tts.train.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.exps.transformer\_tts.train module +=================================================== + +.. automodule:: paddlespeech.t2s.exps.transformer_tts.train + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.exps.voice_cloning.rst b/docs/source/api/paddlespeech.t2s.exps.voice_cloning.rst new file mode 100644 index 000000000..b4a600aad --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.exps.voice_cloning.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.exps.voice\_cloning module +=========================================== + +.. automodule:: paddlespeech.t2s.exps.voice_cloning + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.exps.waveflow.config.rst b/docs/source/api/paddlespeech.t2s.exps.waveflow.config.rst new file mode 100644 index 000000000..f4de16fe0 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.exps.waveflow.config.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.exps.waveflow.config module +============================================ + +.. automodule:: paddlespeech.t2s.exps.waveflow.config + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.exps.waveflow.ljspeech.rst b/docs/source/api/paddlespeech.t2s.exps.waveflow.ljspeech.rst new file mode 100644 index 000000000..425adf13d --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.exps.waveflow.ljspeech.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.exps.waveflow.ljspeech module +============================================== + +.. automodule:: paddlespeech.t2s.exps.waveflow.ljspeech + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.exps.waveflow.preprocess.rst b/docs/source/api/paddlespeech.t2s.exps.waveflow.preprocess.rst new file mode 100644 index 000000000..387aade94 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.exps.waveflow.preprocess.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.exps.waveflow.preprocess module +================================================ + +.. automodule:: paddlespeech.t2s.exps.waveflow.preprocess + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.exps.waveflow.rst b/docs/source/api/paddlespeech.t2s.exps.waveflow.rst new file mode 100644 index 000000000..5e643fb7a --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.exps.waveflow.rst @@ -0,0 +1,19 @@ +paddlespeech.t2s.exps.waveflow package +====================================== + +.. automodule:: paddlespeech.t2s.exps.waveflow + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.t2s.exps.waveflow.config + paddlespeech.t2s.exps.waveflow.ljspeech + paddlespeech.t2s.exps.waveflow.preprocess + paddlespeech.t2s.exps.waveflow.synthesize + paddlespeech.t2s.exps.waveflow.train diff --git a/docs/source/api/paddlespeech.t2s.exps.waveflow.synthesize.rst b/docs/source/api/paddlespeech.t2s.exps.waveflow.synthesize.rst new file mode 100644 index 000000000..a7a98cf2e --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.exps.waveflow.synthesize.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.exps.waveflow.synthesize module +================================================ + +.. automodule:: paddlespeech.t2s.exps.waveflow.synthesize + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.exps.waveflow.train.rst b/docs/source/api/paddlespeech.t2s.exps.waveflow.train.rst new file mode 100644 index 000000000..c23376315 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.exps.waveflow.train.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.exps.waveflow.train module +=========================================== + +.. automodule:: paddlespeech.t2s.exps.waveflow.train + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.exps.wavernn.rst b/docs/source/api/paddlespeech.t2s.exps.wavernn.rst new file mode 100644 index 000000000..392e3906e --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.exps.wavernn.rst @@ -0,0 +1,16 @@ +paddlespeech.t2s.exps.wavernn package +===================================== + +.. automodule:: paddlespeech.t2s.exps.wavernn + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.t2s.exps.wavernn.synthesize + paddlespeech.t2s.exps.wavernn.train diff --git a/docs/source/api/paddlespeech.t2s.exps.wavernn.synthesize.rst b/docs/source/api/paddlespeech.t2s.exps.wavernn.synthesize.rst new file mode 100644 index 000000000..911402f0e --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.exps.wavernn.synthesize.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.exps.wavernn.synthesize module +=============================================== + +.. automodule:: paddlespeech.t2s.exps.wavernn.synthesize + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.exps.wavernn.train.rst b/docs/source/api/paddlespeech.t2s.exps.wavernn.train.rst new file mode 100644 index 000000000..373f8e48f --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.exps.wavernn.train.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.exps.wavernn.train module +========================================== + +.. automodule:: paddlespeech.t2s.exps.wavernn.train + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.frontend.arpabet.rst b/docs/source/api/paddlespeech.t2s.frontend.arpabet.rst new file mode 100644 index 000000000..199fb8094 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.frontend.arpabet.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.frontend.arpabet module +======================================== + +.. automodule:: paddlespeech.t2s.frontend.arpabet + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.frontend.generate_lexicon.rst b/docs/source/api/paddlespeech.t2s.frontend.generate_lexicon.rst new file mode 100644 index 000000000..355dfb539 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.frontend.generate_lexicon.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.frontend.generate\_lexicon module +================================================== + +.. automodule:: paddlespeech.t2s.frontend.generate_lexicon + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.frontend.normalizer.abbrrviation.rst b/docs/source/api/paddlespeech.t2s.frontend.normalizer.abbrrviation.rst new file mode 100644 index 000000000..22b79160f --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.frontend.normalizer.abbrrviation.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.frontend.normalizer.abbrrviation module +======================================================== + +.. automodule:: paddlespeech.t2s.frontend.normalizer.abbrrviation + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.frontend.normalizer.acronyms.rst b/docs/source/api/paddlespeech.t2s.frontend.normalizer.acronyms.rst new file mode 100644 index 000000000..7020e2070 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.frontend.normalizer.acronyms.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.frontend.normalizer.acronyms module +==================================================== + +.. automodule:: paddlespeech.t2s.frontend.normalizer.acronyms + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.frontend.normalizer.normalizer.rst b/docs/source/api/paddlespeech.t2s.frontend.normalizer.normalizer.rst new file mode 100644 index 000000000..1a649daf2 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.frontend.normalizer.normalizer.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.frontend.normalizer.normalizer module +====================================================== + +.. automodule:: paddlespeech.t2s.frontend.normalizer.normalizer + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.frontend.normalizer.numbers.rst b/docs/source/api/paddlespeech.t2s.frontend.normalizer.numbers.rst new file mode 100644 index 000000000..438b56845 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.frontend.normalizer.numbers.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.frontend.normalizer.numbers module +=================================================== + +.. automodule:: paddlespeech.t2s.frontend.normalizer.numbers + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.frontend.normalizer.rst b/docs/source/api/paddlespeech.t2s.frontend.normalizer.rst new file mode 100644 index 000000000..083e6badb --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.frontend.normalizer.rst @@ -0,0 +1,19 @@ +paddlespeech.t2s.frontend.normalizer package +============================================ + +.. automodule:: paddlespeech.t2s.frontend.normalizer + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.t2s.frontend.normalizer.abbrrviation + paddlespeech.t2s.frontend.normalizer.acronyms + paddlespeech.t2s.frontend.normalizer.normalizer + paddlespeech.t2s.frontend.normalizer.numbers + paddlespeech.t2s.frontend.normalizer.width diff --git a/docs/source/api/paddlespeech.t2s.frontend.normalizer.width.rst b/docs/source/api/paddlespeech.t2s.frontend.normalizer.width.rst new file mode 100644 index 000000000..188659fcd --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.frontend.normalizer.width.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.frontend.normalizer.width module +================================================= + +.. automodule:: paddlespeech.t2s.frontend.normalizer.width + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.frontend.phonectic.rst b/docs/source/api/paddlespeech.t2s.frontend.phonectic.rst new file mode 100644 index 000000000..8483b2d96 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.frontend.phonectic.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.frontend.phonectic module +========================================== + +.. automodule:: paddlespeech.t2s.frontend.phonectic + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.frontend.punctuation.rst b/docs/source/api/paddlespeech.t2s.frontend.punctuation.rst new file mode 100644 index 000000000..692c6f031 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.frontend.punctuation.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.frontend.punctuation module +============================================ + +.. automodule:: paddlespeech.t2s.frontend.punctuation + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.frontend.rst b/docs/source/api/paddlespeech.t2s.frontend.rst new file mode 100644 index 000000000..8fbf1e6eb --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.frontend.rst @@ -0,0 +1,30 @@ +paddlespeech.t2s.frontend package +================================= + +.. automodule:: paddlespeech.t2s.frontend + :members: + :undoc-members: + :show-inheritance: + +Subpackages +----------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.t2s.frontend.normalizer + paddlespeech.t2s.frontend.zh_normalization + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.t2s.frontend.arpabet + paddlespeech.t2s.frontend.generate_lexicon + paddlespeech.t2s.frontend.phonectic + paddlespeech.t2s.frontend.punctuation + paddlespeech.t2s.frontend.tone_sandhi + paddlespeech.t2s.frontend.vocab + paddlespeech.t2s.frontend.zh_frontend diff --git a/docs/source/api/paddlespeech.t2s.frontend.tone_sandhi.rst b/docs/source/api/paddlespeech.t2s.frontend.tone_sandhi.rst new file mode 100644 index 000000000..3ea4ba49c --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.frontend.tone_sandhi.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.frontend.tone\_sandhi module +============================================= + +.. automodule:: paddlespeech.t2s.frontend.tone_sandhi + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.frontend.vocab.rst b/docs/source/api/paddlespeech.t2s.frontend.vocab.rst new file mode 100644 index 000000000..34f597bfb --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.frontend.vocab.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.frontend.vocab module +====================================== + +.. automodule:: paddlespeech.t2s.frontend.vocab + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.frontend.zh_frontend.rst b/docs/source/api/paddlespeech.t2s.frontend.zh_frontend.rst new file mode 100644 index 000000000..6327839e5 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.frontend.zh_frontend.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.frontend.zh\_frontend module +============================================= + +.. automodule:: paddlespeech.t2s.frontend.zh_frontend + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.frontend.zh_normalization.char_convert.rst b/docs/source/api/paddlespeech.t2s.frontend.zh_normalization.char_convert.rst new file mode 100644 index 000000000..9cfc3406b --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.frontend.zh_normalization.char_convert.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.frontend.zh\_normalization.char\_convert module +================================================================ + +.. automodule:: paddlespeech.t2s.frontend.zh_normalization.char_convert + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.frontend.zh_normalization.chronology.rst b/docs/source/api/paddlespeech.t2s.frontend.zh_normalization.chronology.rst new file mode 100644 index 000000000..386df9885 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.frontend.zh_normalization.chronology.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.frontend.zh\_normalization.chronology module +============================================================= + +.. automodule:: paddlespeech.t2s.frontend.zh_normalization.chronology + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.frontend.zh_normalization.constants.rst b/docs/source/api/paddlespeech.t2s.frontend.zh_normalization.constants.rst new file mode 100644 index 000000000..53c181b59 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.frontend.zh_normalization.constants.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.frontend.zh\_normalization.constants module +============================================================ + +.. automodule:: paddlespeech.t2s.frontend.zh_normalization.constants + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.frontend.zh_normalization.num.rst b/docs/source/api/paddlespeech.t2s.frontend.zh_normalization.num.rst new file mode 100644 index 000000000..7eb624d18 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.frontend.zh_normalization.num.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.frontend.zh\_normalization.num module +====================================================== + +.. automodule:: paddlespeech.t2s.frontend.zh_normalization.num + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.frontend.zh_normalization.phonecode.rst b/docs/source/api/paddlespeech.t2s.frontend.zh_normalization.phonecode.rst new file mode 100644 index 000000000..67d47f4e2 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.frontend.zh_normalization.phonecode.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.frontend.zh\_normalization.phonecode module +============================================================ + +.. automodule:: paddlespeech.t2s.frontend.zh_normalization.phonecode + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.frontend.zh_normalization.quantifier.rst b/docs/source/api/paddlespeech.t2s.frontend.zh_normalization.quantifier.rst new file mode 100644 index 000000000..8c4d359c3 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.frontend.zh_normalization.quantifier.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.frontend.zh\_normalization.quantifier module +============================================================= + +.. automodule:: paddlespeech.t2s.frontend.zh_normalization.quantifier + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.frontend.zh_normalization.rst b/docs/source/api/paddlespeech.t2s.frontend.zh_normalization.rst new file mode 100644 index 000000000..ebe3f9d95 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.frontend.zh_normalization.rst @@ -0,0 +1,21 @@ +paddlespeech.t2s.frontend.zh\_normalization package +=================================================== + +.. automodule:: paddlespeech.t2s.frontend.zh_normalization + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.t2s.frontend.zh_normalization.char_convert + paddlespeech.t2s.frontend.zh_normalization.chronology + paddlespeech.t2s.frontend.zh_normalization.constants + paddlespeech.t2s.frontend.zh_normalization.num + paddlespeech.t2s.frontend.zh_normalization.phonecode + paddlespeech.t2s.frontend.zh_normalization.quantifier + paddlespeech.t2s.frontend.zh_normalization.text_normlization diff --git a/docs/source/api/paddlespeech.t2s.frontend.zh_normalization.text_normlization.rst b/docs/source/api/paddlespeech.t2s.frontend.zh_normalization.text_normlization.rst new file mode 100644 index 000000000..a875dc8bf --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.frontend.zh_normalization.text_normlization.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.frontend.zh\_normalization.text\_normlization module +===================================================================== + +.. automodule:: paddlespeech.t2s.frontend.zh_normalization.text_normlization + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.models.ernie_sat.mlm.rst b/docs/source/api/paddlespeech.t2s.models.ernie_sat.mlm.rst new file mode 100644 index 000000000..f0e8fd11a --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.models.ernie_sat.mlm.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.models.ernie\_sat.mlm module +============================================= + +.. automodule:: paddlespeech.t2s.models.ernie_sat.mlm + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.models.ernie_sat.rst b/docs/source/api/paddlespeech.t2s.models.ernie_sat.rst new file mode 100644 index 000000000..680a85dea --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.models.ernie_sat.rst @@ -0,0 +1,15 @@ +paddlespeech.t2s.models.ernie\_sat package +========================================== + +.. automodule:: paddlespeech.t2s.models.ernie_sat + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.t2s.models.ernie_sat.mlm diff --git a/docs/source/api/paddlespeech.t2s.models.fastspeech2.fastspeech2.rst b/docs/source/api/paddlespeech.t2s.models.fastspeech2.fastspeech2.rst new file mode 100644 index 000000000..0bcba5481 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.models.fastspeech2.fastspeech2.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.models.fastspeech2.fastspeech2 module +====================================================== + +.. automodule:: paddlespeech.t2s.models.fastspeech2.fastspeech2 + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.models.fastspeech2.fastspeech2_updater.rst b/docs/source/api/paddlespeech.t2s.models.fastspeech2.fastspeech2_updater.rst new file mode 100644 index 000000000..bd6f941d8 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.models.fastspeech2.fastspeech2_updater.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.models.fastspeech2.fastspeech2\_updater module +=============================================================== + +.. automodule:: paddlespeech.t2s.models.fastspeech2.fastspeech2_updater + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.models.fastspeech2.rst b/docs/source/api/paddlespeech.t2s.models.fastspeech2.rst new file mode 100644 index 000000000..89a540c3d --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.models.fastspeech2.rst @@ -0,0 +1,16 @@ +paddlespeech.t2s.models.fastspeech2 package +=========================================== + +.. automodule:: paddlespeech.t2s.models.fastspeech2 + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.t2s.models.fastspeech2.fastspeech2 + paddlespeech.t2s.models.fastspeech2.fastspeech2_updater diff --git a/docs/source/api/paddlespeech.t2s.models.hifigan.hifigan.rst b/docs/source/api/paddlespeech.t2s.models.hifigan.hifigan.rst new file mode 100644 index 000000000..4c043067b --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.models.hifigan.hifigan.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.models.hifigan.hifigan module +============================================== + +.. automodule:: paddlespeech.t2s.models.hifigan.hifigan + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.models.hifigan.hifigan_updater.rst b/docs/source/api/paddlespeech.t2s.models.hifigan.hifigan_updater.rst new file mode 100644 index 000000000..514993295 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.models.hifigan.hifigan_updater.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.models.hifigan.hifigan\_updater module +======================================================= + +.. automodule:: paddlespeech.t2s.models.hifigan.hifigan_updater + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.models.hifigan.rst b/docs/source/api/paddlespeech.t2s.models.hifigan.rst new file mode 100644 index 000000000..ac78e287b --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.models.hifigan.rst @@ -0,0 +1,16 @@ +paddlespeech.t2s.models.hifigan package +======================================= + +.. automodule:: paddlespeech.t2s.models.hifigan + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.t2s.models.hifigan.hifigan + paddlespeech.t2s.models.hifigan.hifigan_updater diff --git a/docs/source/api/paddlespeech.t2s.models.melgan.melgan.rst b/docs/source/api/paddlespeech.t2s.models.melgan.melgan.rst new file mode 100644 index 000000000..663782e84 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.models.melgan.melgan.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.models.melgan.melgan module +============================================ + +.. automodule:: paddlespeech.t2s.models.melgan.melgan + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.models.melgan.multi_band_melgan_updater.rst b/docs/source/api/paddlespeech.t2s.models.melgan.multi_band_melgan_updater.rst new file mode 100644 index 000000000..46b8c71f8 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.models.melgan.multi_band_melgan_updater.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.models.melgan.multi\_band\_melgan\_updater module +================================================================== + +.. automodule:: paddlespeech.t2s.models.melgan.multi_band_melgan_updater + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.models.melgan.rst b/docs/source/api/paddlespeech.t2s.models.melgan.rst new file mode 100644 index 000000000..ed4bd6167 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.models.melgan.rst @@ -0,0 +1,18 @@ +paddlespeech.t2s.models.melgan package +====================================== + +.. automodule:: paddlespeech.t2s.models.melgan + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.t2s.models.melgan.melgan + paddlespeech.t2s.models.melgan.multi_band_melgan_updater + paddlespeech.t2s.models.melgan.style_melgan + paddlespeech.t2s.models.melgan.style_melgan_updater diff --git a/docs/source/api/paddlespeech.t2s.models.melgan.style_melgan.rst b/docs/source/api/paddlespeech.t2s.models.melgan.style_melgan.rst new file mode 100644 index 000000000..449c5142f --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.models.melgan.style_melgan.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.models.melgan.style\_melgan module +=================================================== + +.. automodule:: paddlespeech.t2s.models.melgan.style_melgan + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.models.melgan.style_melgan_updater.rst b/docs/source/api/paddlespeech.t2s.models.melgan.style_melgan_updater.rst new file mode 100644 index 000000000..2a6cb9fb5 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.models.melgan.style_melgan_updater.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.models.melgan.style\_melgan\_updater module +============================================================ + +.. automodule:: paddlespeech.t2s.models.melgan.style_melgan_updater + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.models.parallel_wavegan.parallel_wavegan.rst b/docs/source/api/paddlespeech.t2s.models.parallel_wavegan.parallel_wavegan.rst new file mode 100644 index 000000000..06bc0dadd --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.models.parallel_wavegan.parallel_wavegan.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.models.parallel\_wavegan.parallel\_wavegan module +================================================================== + +.. automodule:: paddlespeech.t2s.models.parallel_wavegan.parallel_wavegan + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.models.parallel_wavegan.parallel_wavegan_updater.rst b/docs/source/api/paddlespeech.t2s.models.parallel_wavegan.parallel_wavegan_updater.rst new file mode 100644 index 000000000..4a652673d --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.models.parallel_wavegan.parallel_wavegan_updater.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.models.parallel\_wavegan.parallel\_wavegan\_updater module +=========================================================================== + +.. automodule:: paddlespeech.t2s.models.parallel_wavegan.parallel_wavegan_updater + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.models.parallel_wavegan.rst b/docs/source/api/paddlespeech.t2s.models.parallel_wavegan.rst new file mode 100644 index 000000000..ff25b6c24 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.models.parallel_wavegan.rst @@ -0,0 +1,16 @@ +paddlespeech.t2s.models.parallel\_wavegan package +================================================= + +.. automodule:: paddlespeech.t2s.models.parallel_wavegan + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.t2s.models.parallel_wavegan.parallel_wavegan + paddlespeech.t2s.models.parallel_wavegan.parallel_wavegan_updater diff --git a/docs/source/api/paddlespeech.t2s.models.rst b/docs/source/api/paddlespeech.t2s.models.rst new file mode 100644 index 000000000..5e681fa88 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.models.rst @@ -0,0 +1,32 @@ +paddlespeech.t2s.models package +=============================== + +.. automodule:: paddlespeech.t2s.models + :members: + :undoc-members: + :show-inheritance: + +Subpackages +----------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.t2s.models.ernie_sat + paddlespeech.t2s.models.fastspeech2 + paddlespeech.t2s.models.hifigan + paddlespeech.t2s.models.melgan + paddlespeech.t2s.models.parallel_wavegan + paddlespeech.t2s.models.speedyspeech + paddlespeech.t2s.models.tacotron2 + paddlespeech.t2s.models.transformer_tts + paddlespeech.t2s.models.vits + paddlespeech.t2s.models.wavernn + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.t2s.models.waveflow diff --git a/docs/source/api/paddlespeech.t2s.models.speedyspeech.rst b/docs/source/api/paddlespeech.t2s.models.speedyspeech.rst new file mode 100644 index 000000000..a9cded2f0 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.models.speedyspeech.rst @@ -0,0 +1,16 @@ +paddlespeech.t2s.models.speedyspeech package +============================================ + +.. automodule:: paddlespeech.t2s.models.speedyspeech + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.t2s.models.speedyspeech.speedyspeech + paddlespeech.t2s.models.speedyspeech.speedyspeech_updater diff --git a/docs/source/api/paddlespeech.t2s.models.speedyspeech.speedyspeech.rst b/docs/source/api/paddlespeech.t2s.models.speedyspeech.speedyspeech.rst new file mode 100644 index 000000000..0cfa30892 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.models.speedyspeech.speedyspeech.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.models.speedyspeech.speedyspeech module +======================================================== + +.. automodule:: paddlespeech.t2s.models.speedyspeech.speedyspeech + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.models.speedyspeech.speedyspeech_updater.rst b/docs/source/api/paddlespeech.t2s.models.speedyspeech.speedyspeech_updater.rst new file mode 100644 index 000000000..c0a33d4ce --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.models.speedyspeech.speedyspeech_updater.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.models.speedyspeech.speedyspeech\_updater module +================================================================= + +.. automodule:: paddlespeech.t2s.models.speedyspeech.speedyspeech_updater + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.models.tacotron2.rst b/docs/source/api/paddlespeech.t2s.models.tacotron2.rst new file mode 100644 index 000000000..69133b849 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.models.tacotron2.rst @@ -0,0 +1,16 @@ +paddlespeech.t2s.models.tacotron2 package +========================================= + +.. automodule:: paddlespeech.t2s.models.tacotron2 + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.t2s.models.tacotron2.tacotron2 + paddlespeech.t2s.models.tacotron2.tacotron2_updater diff --git a/docs/source/api/paddlespeech.t2s.models.tacotron2.tacotron2.rst b/docs/source/api/paddlespeech.t2s.models.tacotron2.tacotron2.rst new file mode 100644 index 000000000..44e2c68c5 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.models.tacotron2.tacotron2.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.models.tacotron2.tacotron2 module +================================================== + +.. automodule:: paddlespeech.t2s.models.tacotron2.tacotron2 + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.models.tacotron2.tacotron2_updater.rst b/docs/source/api/paddlespeech.t2s.models.tacotron2.tacotron2_updater.rst new file mode 100644 index 000000000..c1bf1d2d5 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.models.tacotron2.tacotron2_updater.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.models.tacotron2.tacotron2\_updater module +=========================================================== + +.. automodule:: paddlespeech.t2s.models.tacotron2.tacotron2_updater + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.models.transformer_tts.rst b/docs/source/api/paddlespeech.t2s.models.transformer_tts.rst new file mode 100644 index 000000000..89d6a4e3e --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.models.transformer_tts.rst @@ -0,0 +1,16 @@ +paddlespeech.t2s.models.transformer\_tts package +================================================ + +.. automodule:: paddlespeech.t2s.models.transformer_tts + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.t2s.models.transformer_tts.transformer_tts + paddlespeech.t2s.models.transformer_tts.transformer_tts_updater diff --git a/docs/source/api/paddlespeech.t2s.models.transformer_tts.transformer_tts.rst b/docs/source/api/paddlespeech.t2s.models.transformer_tts.transformer_tts.rst new file mode 100644 index 000000000..b061caa46 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.models.transformer_tts.transformer_tts.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.models.transformer\_tts.transformer\_tts module +================================================================ + +.. automodule:: paddlespeech.t2s.models.transformer_tts.transformer_tts + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.models.transformer_tts.transformer_tts_updater.rst b/docs/source/api/paddlespeech.t2s.models.transformer_tts.transformer_tts_updater.rst new file mode 100644 index 000000000..aacc8479d --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.models.transformer_tts.transformer_tts_updater.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.models.transformer\_tts.transformer\_tts\_updater module +========================================================================= + +.. automodule:: paddlespeech.t2s.models.transformer_tts.transformer_tts_updater + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.models.vits.duration_predictor.rst b/docs/source/api/paddlespeech.t2s.models.vits.duration_predictor.rst new file mode 100644 index 000000000..59239c7e7 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.models.vits.duration_predictor.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.models.vits.duration\_predictor module +======================================================= + +.. automodule:: paddlespeech.t2s.models.vits.duration_predictor + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.models.vits.flow.rst b/docs/source/api/paddlespeech.t2s.models.vits.flow.rst new file mode 100644 index 000000000..59be7f6a4 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.models.vits.flow.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.models.vits.flow module +======================================== + +.. automodule:: paddlespeech.t2s.models.vits.flow + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.models.vits.generator.rst b/docs/source/api/paddlespeech.t2s.models.vits.generator.rst new file mode 100644 index 000000000..7b91b2855 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.models.vits.generator.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.models.vits.generator module +============================================= + +.. automodule:: paddlespeech.t2s.models.vits.generator + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.models.vits.posterior_encoder.rst b/docs/source/api/paddlespeech.t2s.models.vits.posterior_encoder.rst new file mode 100644 index 000000000..4710681c4 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.models.vits.posterior_encoder.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.models.vits.posterior\_encoder module +====================================================== + +.. automodule:: paddlespeech.t2s.models.vits.posterior_encoder + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.models.vits.residual_coupling.rst b/docs/source/api/paddlespeech.t2s.models.vits.residual_coupling.rst new file mode 100644 index 000000000..31d8c51e6 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.models.vits.residual_coupling.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.models.vits.residual\_coupling module +====================================================== + +.. automodule:: paddlespeech.t2s.models.vits.residual_coupling + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.models.vits.rst b/docs/source/api/paddlespeech.t2s.models.vits.rst new file mode 100644 index 000000000..3146094b0 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.models.vits.rst @@ -0,0 +1,32 @@ +paddlespeech.t2s.models.vits package +==================================== + +.. automodule:: paddlespeech.t2s.models.vits + :members: + :undoc-members: + :show-inheritance: + +Subpackages +----------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.t2s.models.vits.monotonic_align + paddlespeech.t2s.models.vits.wavenet + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.t2s.models.vits.duration_predictor + paddlespeech.t2s.models.vits.flow + paddlespeech.t2s.models.vits.generator + paddlespeech.t2s.models.vits.posterior_encoder + paddlespeech.t2s.models.vits.residual_coupling + paddlespeech.t2s.models.vits.text_encoder + paddlespeech.t2s.models.vits.transform + paddlespeech.t2s.models.vits.vits + paddlespeech.t2s.models.vits.vits_updater diff --git a/docs/source/api/paddlespeech.t2s.models.vits.text_encoder.rst b/docs/source/api/paddlespeech.t2s.models.vits.text_encoder.rst new file mode 100644 index 000000000..50d2fda90 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.models.vits.text_encoder.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.models.vits.text\_encoder module +================================================= + +.. automodule:: paddlespeech.t2s.models.vits.text_encoder + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.models.vits.transform.rst b/docs/source/api/paddlespeech.t2s.models.vits.transform.rst new file mode 100644 index 000000000..a2ffde0d1 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.models.vits.transform.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.models.vits.transform module +============================================= + +.. automodule:: paddlespeech.t2s.models.vits.transform + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.models.vits.vits.rst b/docs/source/api/paddlespeech.t2s.models.vits.vits.rst new file mode 100644 index 000000000..39af2ba5a --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.models.vits.vits.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.models.vits.vits module +======================================== + +.. automodule:: paddlespeech.t2s.models.vits.vits + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.models.vits.vits_updater.rst b/docs/source/api/paddlespeech.t2s.models.vits.vits_updater.rst new file mode 100644 index 000000000..f84646647 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.models.vits.vits_updater.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.models.vits.vits\_updater module +================================================= + +.. automodule:: paddlespeech.t2s.models.vits.vits_updater + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.models.vits.wavenet.residual_block.rst b/docs/source/api/paddlespeech.t2s.models.vits.wavenet.residual_block.rst new file mode 100644 index 000000000..33b2006f6 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.models.vits.wavenet.residual_block.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.models.vits.wavenet.residual\_block module +=========================================================== + +.. automodule:: paddlespeech.t2s.models.vits.wavenet.residual_block + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.models.vits.wavenet.rst b/docs/source/api/paddlespeech.t2s.models.vits.wavenet.rst new file mode 100644 index 000000000..694f7be26 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.models.vits.wavenet.rst @@ -0,0 +1,16 @@ +paddlespeech.t2s.models.vits.wavenet package +============================================ + +.. automodule:: paddlespeech.t2s.models.vits.wavenet + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.t2s.models.vits.wavenet.residual_block + paddlespeech.t2s.models.vits.wavenet.wavenet diff --git a/docs/source/api/paddlespeech.t2s.models.vits.wavenet.wavenet.rst b/docs/source/api/paddlespeech.t2s.models.vits.wavenet.wavenet.rst new file mode 100644 index 000000000..a0de1c0ec --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.models.vits.wavenet.wavenet.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.models.vits.wavenet.wavenet module +=================================================== + +.. automodule:: paddlespeech.t2s.models.vits.wavenet.wavenet + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.models.waveflow.rst b/docs/source/api/paddlespeech.t2s.models.waveflow.rst new file mode 100644 index 000000000..f5d5b9687 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.models.waveflow.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.models.waveflow module +======================================= + +.. automodule:: paddlespeech.t2s.models.waveflow + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.models.wavernn.rst b/docs/source/api/paddlespeech.t2s.models.wavernn.rst new file mode 100644 index 000000000..d52d4b888 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.models.wavernn.rst @@ -0,0 +1,16 @@ +paddlespeech.t2s.models.wavernn package +======================================= + +.. automodule:: paddlespeech.t2s.models.wavernn + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.t2s.models.wavernn.wavernn + paddlespeech.t2s.models.wavernn.wavernn_updater diff --git a/docs/source/api/paddlespeech.t2s.models.wavernn.wavernn.rst b/docs/source/api/paddlespeech.t2s.models.wavernn.wavernn.rst new file mode 100644 index 000000000..679e67687 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.models.wavernn.wavernn.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.models.wavernn.wavernn module +============================================== + +.. automodule:: paddlespeech.t2s.models.wavernn.wavernn + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.models.wavernn.wavernn_updater.rst b/docs/source/api/paddlespeech.t2s.models.wavernn.wavernn_updater.rst new file mode 100644 index 000000000..ea4ca2cad --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.models.wavernn.wavernn_updater.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.models.wavernn.wavernn\_updater module +======================================================= + +.. automodule:: paddlespeech.t2s.models.wavernn.wavernn_updater + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.modules.activation.rst b/docs/source/api/paddlespeech.t2s.modules.activation.rst new file mode 100644 index 000000000..ab15fb9bd --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.modules.activation.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.modules.activation module +========================================== + +.. automodule:: paddlespeech.t2s.modules.activation + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.modules.causal_conv.rst b/docs/source/api/paddlespeech.t2s.modules.causal_conv.rst new file mode 100644 index 000000000..94ab95da6 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.modules.causal_conv.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.modules.causal\_conv module +============================================ + +.. automodule:: paddlespeech.t2s.modules.causal_conv + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.modules.conformer.convolution.rst b/docs/source/api/paddlespeech.t2s.modules.conformer.convolution.rst new file mode 100644 index 000000000..072af9ab8 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.modules.conformer.convolution.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.modules.conformer.convolution module +===================================================== + +.. automodule:: paddlespeech.t2s.modules.conformer.convolution + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.modules.conformer.encoder_layer.rst b/docs/source/api/paddlespeech.t2s.modules.conformer.encoder_layer.rst new file mode 100644 index 000000000..88f2f1a35 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.modules.conformer.encoder_layer.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.modules.conformer.encoder\_layer module +======================================================== + +.. automodule:: paddlespeech.t2s.modules.conformer.encoder_layer + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.modules.conformer.rst b/docs/source/api/paddlespeech.t2s.modules.conformer.rst new file mode 100644 index 000000000..796079df8 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.modules.conformer.rst @@ -0,0 +1,16 @@ +paddlespeech.t2s.modules.conformer package +========================================== + +.. automodule:: paddlespeech.t2s.modules.conformer + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.t2s.modules.conformer.convolution + paddlespeech.t2s.modules.conformer.encoder_layer diff --git a/docs/source/api/paddlespeech.t2s.modules.conv.rst b/docs/source/api/paddlespeech.t2s.modules.conv.rst new file mode 100644 index 000000000..532803b11 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.modules.conv.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.modules.conv module +==================================== + +.. automodule:: paddlespeech.t2s.modules.conv + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.modules.geometry.rst b/docs/source/api/paddlespeech.t2s.modules.geometry.rst new file mode 100644 index 000000000..8d1f39956 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.modules.geometry.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.modules.geometry module +======================================== + +.. automodule:: paddlespeech.t2s.modules.geometry + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.modules.layer_norm.rst b/docs/source/api/paddlespeech.t2s.modules.layer_norm.rst new file mode 100644 index 000000000..2dc1668c0 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.modules.layer_norm.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.modules.layer\_norm module +=========================================== + +.. automodule:: paddlespeech.t2s.modules.layer_norm + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.modules.losses.rst b/docs/source/api/paddlespeech.t2s.modules.losses.rst new file mode 100644 index 000000000..ad262e0a6 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.modules.losses.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.modules.losses module +====================================== + +.. automodule:: paddlespeech.t2s.modules.losses + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.modules.masked_fill.rst b/docs/source/api/paddlespeech.t2s.modules.masked_fill.rst new file mode 100644 index 000000000..afde1a7a4 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.modules.masked_fill.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.modules.masked\_fill module +============================================ + +.. automodule:: paddlespeech.t2s.modules.masked_fill + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.modules.nets_utils.rst b/docs/source/api/paddlespeech.t2s.modules.nets_utils.rst new file mode 100644 index 000000000..c9a680625 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.modules.nets_utils.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.modules.nets\_utils module +=========================================== + +.. automodule:: paddlespeech.t2s.modules.nets_utils + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.modules.normalizer.rst b/docs/source/api/paddlespeech.t2s.modules.normalizer.rst new file mode 100644 index 000000000..e0a166cac --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.modules.normalizer.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.modules.normalizer module +========================================== + +.. automodule:: paddlespeech.t2s.modules.normalizer + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.modules.positional_encoding.rst b/docs/source/api/paddlespeech.t2s.modules.positional_encoding.rst new file mode 100644 index 000000000..2ba65e6ed --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.modules.positional_encoding.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.modules.positional\_encoding module +==================================================== + +.. automodule:: paddlespeech.t2s.modules.positional_encoding + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.modules.pqmf.rst b/docs/source/api/paddlespeech.t2s.modules.pqmf.rst new file mode 100644 index 000000000..2b89a4506 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.modules.pqmf.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.modules.pqmf module +==================================== + +.. automodule:: paddlespeech.t2s.modules.pqmf + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.modules.predictor.duration_predictor.rst b/docs/source/api/paddlespeech.t2s.modules.predictor.duration_predictor.rst new file mode 100644 index 000000000..5b55e9129 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.modules.predictor.duration_predictor.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.modules.predictor.duration\_predictor module +============================================================= + +.. automodule:: paddlespeech.t2s.modules.predictor.duration_predictor + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.modules.predictor.length_regulator.rst b/docs/source/api/paddlespeech.t2s.modules.predictor.length_regulator.rst new file mode 100644 index 000000000..fbbec39f8 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.modules.predictor.length_regulator.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.modules.predictor.length\_regulator module +=========================================================== + +.. automodule:: paddlespeech.t2s.modules.predictor.length_regulator + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.modules.predictor.rst b/docs/source/api/paddlespeech.t2s.modules.predictor.rst new file mode 100644 index 000000000..2c6ab0884 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.modules.predictor.rst @@ -0,0 +1,17 @@ +paddlespeech.t2s.modules.predictor package +========================================== + +.. automodule:: paddlespeech.t2s.modules.predictor + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.t2s.modules.predictor.duration_predictor + paddlespeech.t2s.modules.predictor.length_regulator + paddlespeech.t2s.modules.predictor.variance_predictor diff --git a/docs/source/api/paddlespeech.t2s.modules.predictor.variance_predictor.rst b/docs/source/api/paddlespeech.t2s.modules.predictor.variance_predictor.rst new file mode 100644 index 000000000..a3c339cfa --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.modules.predictor.variance_predictor.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.modules.predictor.variance\_predictor module +============================================================= + +.. automodule:: paddlespeech.t2s.modules.predictor.variance_predictor + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.modules.residual_block.rst b/docs/source/api/paddlespeech.t2s.modules.residual_block.rst new file mode 100644 index 000000000..1286e7ec9 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.modules.residual_block.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.modules.residual\_block module +=============================================== + +.. automodule:: paddlespeech.t2s.modules.residual_block + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.modules.residual_stack.rst b/docs/source/api/paddlespeech.t2s.modules.residual_stack.rst new file mode 100644 index 000000000..6608e8ece --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.modules.residual_stack.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.modules.residual\_stack module +=============================================== + +.. automodule:: paddlespeech.t2s.modules.residual_stack + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.modules.rst b/docs/source/api/paddlespeech.t2s.modules.rst new file mode 100644 index 000000000..70e023f79 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.modules.rst @@ -0,0 +1,41 @@ +paddlespeech.t2s.modules package +================================ + +.. automodule:: paddlespeech.t2s.modules + :members: + :undoc-members: + :show-inheritance: + +Subpackages +----------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.t2s.modules.conformer + paddlespeech.t2s.modules.predictor + paddlespeech.t2s.modules.tacotron2 + paddlespeech.t2s.modules.transformer + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.t2s.modules.activation + paddlespeech.t2s.modules.causal_conv + paddlespeech.t2s.modules.conv + paddlespeech.t2s.modules.geometry + paddlespeech.t2s.modules.layer_norm + paddlespeech.t2s.modules.losses + paddlespeech.t2s.modules.masked_fill + paddlespeech.t2s.modules.nets_utils + paddlespeech.t2s.modules.normalizer + paddlespeech.t2s.modules.positional_encoding + paddlespeech.t2s.modules.pqmf + paddlespeech.t2s.modules.residual_block + paddlespeech.t2s.modules.residual_stack + paddlespeech.t2s.modules.style_encoder + paddlespeech.t2s.modules.tade_res_block + paddlespeech.t2s.modules.upsample diff --git a/docs/source/api/paddlespeech.t2s.modules.style_encoder.rst b/docs/source/api/paddlespeech.t2s.modules.style_encoder.rst new file mode 100644 index 000000000..b2d97fdbd --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.modules.style_encoder.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.modules.style\_encoder module +============================================== + +.. automodule:: paddlespeech.t2s.modules.style_encoder + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.modules.tacotron2.attentions.rst b/docs/source/api/paddlespeech.t2s.modules.tacotron2.attentions.rst new file mode 100644 index 000000000..f6c8f61ca --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.modules.tacotron2.attentions.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.modules.tacotron2.attentions module +==================================================== + +.. automodule:: paddlespeech.t2s.modules.tacotron2.attentions + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.modules.tacotron2.decoder.rst b/docs/source/api/paddlespeech.t2s.modules.tacotron2.decoder.rst new file mode 100644 index 000000000..20471a28c --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.modules.tacotron2.decoder.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.modules.tacotron2.decoder module +================================================= + +.. automodule:: paddlespeech.t2s.modules.tacotron2.decoder + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.modules.tacotron2.encoder.rst b/docs/source/api/paddlespeech.t2s.modules.tacotron2.encoder.rst new file mode 100644 index 000000000..eaf941c13 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.modules.tacotron2.encoder.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.modules.tacotron2.encoder module +================================================= + +.. automodule:: paddlespeech.t2s.modules.tacotron2.encoder + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.modules.tacotron2.rst b/docs/source/api/paddlespeech.t2s.modules.tacotron2.rst new file mode 100644 index 000000000..66fdd6c78 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.modules.tacotron2.rst @@ -0,0 +1,17 @@ +paddlespeech.t2s.modules.tacotron2 package +========================================== + +.. automodule:: paddlespeech.t2s.modules.tacotron2 + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.t2s.modules.tacotron2.attentions + paddlespeech.t2s.modules.tacotron2.decoder + paddlespeech.t2s.modules.tacotron2.encoder diff --git a/docs/source/api/paddlespeech.t2s.modules.tade_res_block.rst b/docs/source/api/paddlespeech.t2s.modules.tade_res_block.rst new file mode 100644 index 000000000..1711335dc --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.modules.tade_res_block.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.modules.tade\_res\_block module +================================================ + +.. automodule:: paddlespeech.t2s.modules.tade_res_block + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.modules.transformer.attention.rst b/docs/source/api/paddlespeech.t2s.modules.transformer.attention.rst new file mode 100644 index 000000000..ae2cdb542 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.modules.transformer.attention.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.modules.transformer.attention module +===================================================== + +.. automodule:: paddlespeech.t2s.modules.transformer.attention + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.modules.transformer.decoder.rst b/docs/source/api/paddlespeech.t2s.modules.transformer.decoder.rst new file mode 100644 index 000000000..4d6036c29 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.modules.transformer.decoder.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.modules.transformer.decoder module +=================================================== + +.. automodule:: paddlespeech.t2s.modules.transformer.decoder + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.modules.transformer.decoder_layer.rst b/docs/source/api/paddlespeech.t2s.modules.transformer.decoder_layer.rst new file mode 100644 index 000000000..01ef289df --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.modules.transformer.decoder_layer.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.modules.transformer.decoder\_layer module +========================================================== + +.. automodule:: paddlespeech.t2s.modules.transformer.decoder_layer + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.modules.transformer.embedding.rst b/docs/source/api/paddlespeech.t2s.modules.transformer.embedding.rst new file mode 100644 index 000000000..04f816e8b --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.modules.transformer.embedding.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.modules.transformer.embedding module +===================================================== + +.. automodule:: paddlespeech.t2s.modules.transformer.embedding + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.modules.transformer.encoder.rst b/docs/source/api/paddlespeech.t2s.modules.transformer.encoder.rst new file mode 100644 index 000000000..6b3cd0b4b --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.modules.transformer.encoder.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.modules.transformer.encoder module +=================================================== + +.. automodule:: paddlespeech.t2s.modules.transformer.encoder + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.modules.transformer.encoder_layer.rst b/docs/source/api/paddlespeech.t2s.modules.transformer.encoder_layer.rst new file mode 100644 index 000000000..ce1fb2a32 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.modules.transformer.encoder_layer.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.modules.transformer.encoder\_layer module +========================================================== + +.. automodule:: paddlespeech.t2s.modules.transformer.encoder_layer + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.modules.transformer.lightconv.rst b/docs/source/api/paddlespeech.t2s.modules.transformer.lightconv.rst new file mode 100644 index 000000000..75050ff4f --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.modules.transformer.lightconv.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.modules.transformer.lightconv module +===================================================== + +.. automodule:: paddlespeech.t2s.modules.transformer.lightconv + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.modules.transformer.mask.rst b/docs/source/api/paddlespeech.t2s.modules.transformer.mask.rst new file mode 100644 index 000000000..e381ca9d0 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.modules.transformer.mask.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.modules.transformer.mask module +================================================ + +.. automodule:: paddlespeech.t2s.modules.transformer.mask + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.modules.transformer.multi_layer_conv.rst b/docs/source/api/paddlespeech.t2s.modules.transformer.multi_layer_conv.rst new file mode 100644 index 000000000..5d2a3d165 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.modules.transformer.multi_layer_conv.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.modules.transformer.multi\_layer\_conv module +============================================================== + +.. automodule:: paddlespeech.t2s.modules.transformer.multi_layer_conv + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.modules.transformer.positionwise_feed_forward.rst b/docs/source/api/paddlespeech.t2s.modules.transformer.positionwise_feed_forward.rst new file mode 100644 index 000000000..57656a6fc --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.modules.transformer.positionwise_feed_forward.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.modules.transformer.positionwise\_feed\_forward module +======================================================================= + +.. automodule:: paddlespeech.t2s.modules.transformer.positionwise_feed_forward + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.modules.transformer.repeat.rst b/docs/source/api/paddlespeech.t2s.modules.transformer.repeat.rst new file mode 100644 index 000000000..795756a13 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.modules.transformer.repeat.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.modules.transformer.repeat module +================================================== + +.. automodule:: paddlespeech.t2s.modules.transformer.repeat + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.modules.transformer.rst b/docs/source/api/paddlespeech.t2s.modules.transformer.rst new file mode 100644 index 000000000..e0e0ccd22 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.modules.transformer.rst @@ -0,0 +1,26 @@ +paddlespeech.t2s.modules.transformer package +============================================ + +.. automodule:: paddlespeech.t2s.modules.transformer + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.t2s.modules.transformer.attention + paddlespeech.t2s.modules.transformer.decoder + paddlespeech.t2s.modules.transformer.decoder_layer + paddlespeech.t2s.modules.transformer.embedding + paddlespeech.t2s.modules.transformer.encoder + paddlespeech.t2s.modules.transformer.encoder_layer + paddlespeech.t2s.modules.transformer.lightconv + paddlespeech.t2s.modules.transformer.mask + paddlespeech.t2s.modules.transformer.multi_layer_conv + paddlespeech.t2s.modules.transformer.positionwise_feed_forward + paddlespeech.t2s.modules.transformer.repeat + paddlespeech.t2s.modules.transformer.subsampling diff --git a/docs/source/api/paddlespeech.t2s.modules.transformer.subsampling.rst b/docs/source/api/paddlespeech.t2s.modules.transformer.subsampling.rst new file mode 100644 index 000000000..852fedb23 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.modules.transformer.subsampling.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.modules.transformer.subsampling module +======================================================= + +.. automodule:: paddlespeech.t2s.modules.transformer.subsampling + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.modules.upsample.rst b/docs/source/api/paddlespeech.t2s.modules.upsample.rst new file mode 100644 index 000000000..74820386a --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.modules.upsample.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.modules.upsample module +======================================== + +.. automodule:: paddlespeech.t2s.modules.upsample + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.rst b/docs/source/api/paddlespeech.t2s.rst new file mode 100644 index 000000000..59dd619f1 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.rst @@ -0,0 +1,22 @@ +paddlespeech.t2s package +======================== + +.. automodule:: paddlespeech.t2s + :members: + :undoc-members: + :show-inheritance: + +Subpackages +----------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.t2s.audio + paddlespeech.t2s.datasets + paddlespeech.t2s.exps + paddlespeech.t2s.frontend + paddlespeech.t2s.models + paddlespeech.t2s.modules + paddlespeech.t2s.training + paddlespeech.t2s.utils diff --git a/docs/source/api/paddlespeech.t2s.training.cli.rst b/docs/source/api/paddlespeech.t2s.training.cli.rst new file mode 100644 index 000000000..4f9a8efdb --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.training.cli.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.training.cli module +==================================== + +.. automodule:: paddlespeech.t2s.training.cli + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.training.default_config.rst b/docs/source/api/paddlespeech.t2s.training.default_config.rst new file mode 100644 index 000000000..a79e633a5 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.training.default_config.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.training.default\_config module +================================================ + +.. automodule:: paddlespeech.t2s.training.default_config + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.training.experiment.rst b/docs/source/api/paddlespeech.t2s.training.experiment.rst new file mode 100644 index 000000000..9d5760318 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.training.experiment.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.training.experiment module +=========================================== + +.. automodule:: paddlespeech.t2s.training.experiment + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.training.extension.rst b/docs/source/api/paddlespeech.t2s.training.extension.rst new file mode 100644 index 000000000..e8c96db20 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.training.extension.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.training.extension module +========================================== + +.. automodule:: paddlespeech.t2s.training.extension + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.training.extensions.evaluator.rst b/docs/source/api/paddlespeech.t2s.training.extensions.evaluator.rst new file mode 100644 index 000000000..ce0241fab --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.training.extensions.evaluator.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.training.extensions.evaluator module +===================================================== + +.. automodule:: paddlespeech.t2s.training.extensions.evaluator + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.training.extensions.rst b/docs/source/api/paddlespeech.t2s.training.extensions.rst new file mode 100644 index 000000000..c145b58d9 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.training.extensions.rst @@ -0,0 +1,17 @@ +paddlespeech.t2s.training.extensions package +============================================ + +.. automodule:: paddlespeech.t2s.training.extensions + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.t2s.training.extensions.evaluator + paddlespeech.t2s.training.extensions.snapshot + paddlespeech.t2s.training.extensions.visualizer diff --git a/docs/source/api/paddlespeech.t2s.training.extensions.snapshot.rst b/docs/source/api/paddlespeech.t2s.training.extensions.snapshot.rst new file mode 100644 index 000000000..eb509ff26 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.training.extensions.snapshot.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.training.extensions.snapshot module +==================================================== + +.. automodule:: paddlespeech.t2s.training.extensions.snapshot + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.training.extensions.visualizer.rst b/docs/source/api/paddlespeech.t2s.training.extensions.visualizer.rst new file mode 100644 index 000000000..13cd16ae7 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.training.extensions.visualizer.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.training.extensions.visualizer module +====================================================== + +.. automodule:: paddlespeech.t2s.training.extensions.visualizer + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.training.optimizer.rst b/docs/source/api/paddlespeech.t2s.training.optimizer.rst new file mode 100644 index 000000000..1a5487feb --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.training.optimizer.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.training.optimizer module +========================================== + +.. automodule:: paddlespeech.t2s.training.optimizer + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.training.reporter.rst b/docs/source/api/paddlespeech.t2s.training.reporter.rst new file mode 100644 index 000000000..c6eca7cbc --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.training.reporter.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.training.reporter module +========================================= + +.. automodule:: paddlespeech.t2s.training.reporter + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.training.rst b/docs/source/api/paddlespeech.t2s.training.rst new file mode 100644 index 000000000..48d4679f3 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.training.rst @@ -0,0 +1,34 @@ +paddlespeech.t2s.training package +================================= + +.. automodule:: paddlespeech.t2s.training + :members: + :undoc-members: + :show-inheritance: + +Subpackages +----------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.t2s.training.extensions + paddlespeech.t2s.training.triggers + paddlespeech.t2s.training.updaters + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.t2s.training.cli + paddlespeech.t2s.training.default_config + paddlespeech.t2s.training.experiment + paddlespeech.t2s.training.extension + paddlespeech.t2s.training.optimizer + paddlespeech.t2s.training.reporter + paddlespeech.t2s.training.seeding + paddlespeech.t2s.training.trainer + paddlespeech.t2s.training.trigger + paddlespeech.t2s.training.updater diff --git a/docs/source/api/paddlespeech.t2s.training.seeding.rst b/docs/source/api/paddlespeech.t2s.training.seeding.rst new file mode 100644 index 000000000..d1889fbd7 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.training.seeding.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.training.seeding module +======================================== + +.. automodule:: paddlespeech.t2s.training.seeding + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.training.trainer.rst b/docs/source/api/paddlespeech.t2s.training.trainer.rst new file mode 100644 index 000000000..e3480910c --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.training.trainer.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.training.trainer module +======================================== + +.. automodule:: paddlespeech.t2s.training.trainer + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.training.trigger.rst b/docs/source/api/paddlespeech.t2s.training.trigger.rst new file mode 100644 index 000000000..9e3696039 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.training.trigger.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.training.trigger module +======================================== + +.. automodule:: paddlespeech.t2s.training.trigger + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.training.triggers.interval_trigger.rst b/docs/source/api/paddlespeech.t2s.training.triggers.interval_trigger.rst new file mode 100644 index 000000000..99a212501 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.training.triggers.interval_trigger.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.training.triggers.interval\_trigger module +=========================================================== + +.. automodule:: paddlespeech.t2s.training.triggers.interval_trigger + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.training.triggers.limit_trigger.rst b/docs/source/api/paddlespeech.t2s.training.triggers.limit_trigger.rst new file mode 100644 index 000000000..34144b17e --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.training.triggers.limit_trigger.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.training.triggers.limit\_trigger module +======================================================== + +.. automodule:: paddlespeech.t2s.training.triggers.limit_trigger + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.training.triggers.rst b/docs/source/api/paddlespeech.t2s.training.triggers.rst new file mode 100644 index 000000000..9c8536166 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.training.triggers.rst @@ -0,0 +1,17 @@ +paddlespeech.t2s.training.triggers package +========================================== + +.. automodule:: paddlespeech.t2s.training.triggers + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.t2s.training.triggers.interval_trigger + paddlespeech.t2s.training.triggers.limit_trigger + paddlespeech.t2s.training.triggers.time_trigger diff --git a/docs/source/api/paddlespeech.t2s.training.triggers.time_trigger.rst b/docs/source/api/paddlespeech.t2s.training.triggers.time_trigger.rst new file mode 100644 index 000000000..6220544b2 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.training.triggers.time_trigger.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.training.triggers.time\_trigger module +======================================================= + +.. automodule:: paddlespeech.t2s.training.triggers.time_trigger + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.training.updater.rst b/docs/source/api/paddlespeech.t2s.training.updater.rst new file mode 100644 index 000000000..79aa7feb4 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.training.updater.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.training.updater module +======================================== + +.. automodule:: paddlespeech.t2s.training.updater + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.training.updaters.rst b/docs/source/api/paddlespeech.t2s.training.updaters.rst new file mode 100644 index 000000000..d4062b000 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.training.updaters.rst @@ -0,0 +1,15 @@ +paddlespeech.t2s.training.updaters package +========================================== + +.. automodule:: paddlespeech.t2s.training.updaters + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.t2s.training.updaters.standard_updater diff --git a/docs/source/api/paddlespeech.t2s.training.updaters.standard_updater.rst b/docs/source/api/paddlespeech.t2s.training.updaters.standard_updater.rst new file mode 100644 index 000000000..6202ccaee --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.training.updaters.standard_updater.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.training.updaters.standard\_updater module +=========================================================== + +.. automodule:: paddlespeech.t2s.training.updaters.standard_updater + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.utils.checkpoint.rst b/docs/source/api/paddlespeech.t2s.utils.checkpoint.rst new file mode 100644 index 000000000..9f7758ea4 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.utils.checkpoint.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.utils.checkpoint module +======================================== + +.. automodule:: paddlespeech.t2s.utils.checkpoint + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.utils.display.rst b/docs/source/api/paddlespeech.t2s.utils.display.rst new file mode 100644 index 000000000..b1f9520f5 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.utils.display.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.utils.display module +===================================== + +.. automodule:: paddlespeech.t2s.utils.display + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.utils.error_rate.rst b/docs/source/api/paddlespeech.t2s.utils.error_rate.rst new file mode 100644 index 000000000..843171fec --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.utils.error_rate.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.utils.error\_rate module +========================================= + +.. automodule:: paddlespeech.t2s.utils.error_rate + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.utils.h5_utils.rst b/docs/source/api/paddlespeech.t2s.utils.h5_utils.rst new file mode 100644 index 000000000..0784ebf6d --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.utils.h5_utils.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.utils.h5\_utils module +======================================= + +.. automodule:: paddlespeech.t2s.utils.h5_utils + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.utils.internals.rst b/docs/source/api/paddlespeech.t2s.utils.internals.rst new file mode 100644 index 000000000..d6236151d --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.utils.internals.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.utils.internals module +======================================= + +.. automodule:: paddlespeech.t2s.utils.internals + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.utils.layer_tools.rst b/docs/source/api/paddlespeech.t2s.utils.layer_tools.rst new file mode 100644 index 000000000..d16e3843d --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.utils.layer_tools.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.utils.layer\_tools module +========================================== + +.. automodule:: paddlespeech.t2s.utils.layer_tools + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.utils.mp_tools.rst b/docs/source/api/paddlespeech.t2s.utils.mp_tools.rst new file mode 100644 index 000000000..cdb6ca913 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.utils.mp_tools.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.utils.mp\_tools module +======================================= + +.. automodule:: paddlespeech.t2s.utils.mp_tools + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.utils.profiler.rst b/docs/source/api/paddlespeech.t2s.utils.profiler.rst new file mode 100644 index 000000000..31a26934f --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.utils.profiler.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.utils.profiler module +====================================== + +.. automodule:: paddlespeech.t2s.utils.profiler + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.t2s.utils.rst b/docs/source/api/paddlespeech.t2s.utils.rst new file mode 100644 index 000000000..2fd9df2a6 --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.utils.rst @@ -0,0 +1,23 @@ +paddlespeech.t2s.utils package +============================== + +.. automodule:: paddlespeech.t2s.utils + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.t2s.utils.checkpoint + paddlespeech.t2s.utils.display + paddlespeech.t2s.utils.error_rate + paddlespeech.t2s.utils.h5_utils + paddlespeech.t2s.utils.internals + paddlespeech.t2s.utils.layer_tools + paddlespeech.t2s.utils.mp_tools + paddlespeech.t2s.utils.profiler + paddlespeech.t2s.utils.scheduler diff --git a/docs/source/api/paddlespeech.t2s.utils.scheduler.rst b/docs/source/api/paddlespeech.t2s.utils.scheduler.rst new file mode 100644 index 000000000..2066f31ac --- /dev/null +++ b/docs/source/api/paddlespeech.t2s.utils.scheduler.rst @@ -0,0 +1,7 @@ +paddlespeech.t2s.utils.scheduler module +======================================= + +.. automodule:: paddlespeech.t2s.utils.scheduler + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.text.exps.ernie_linear.avg_model.rst b/docs/source/api/paddlespeech.text.exps.ernie_linear.avg_model.rst new file mode 100644 index 000000000..25bffda84 --- /dev/null +++ b/docs/source/api/paddlespeech.text.exps.ernie_linear.avg_model.rst @@ -0,0 +1,7 @@ +paddlespeech.text.exps.ernie\_linear.avg\_model module +====================================================== + +.. automodule:: paddlespeech.text.exps.ernie_linear.avg_model + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.text.exps.ernie_linear.punc_restore.rst b/docs/source/api/paddlespeech.text.exps.ernie_linear.punc_restore.rst new file mode 100644 index 000000000..92f1926d1 --- /dev/null +++ b/docs/source/api/paddlespeech.text.exps.ernie_linear.punc_restore.rst @@ -0,0 +1,7 @@ +paddlespeech.text.exps.ernie\_linear.punc\_restore module +========================================================= + +.. automodule:: paddlespeech.text.exps.ernie_linear.punc_restore + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.text.exps.ernie_linear.rst b/docs/source/api/paddlespeech.text.exps.ernie_linear.rst new file mode 100644 index 000000000..7b38c8f10 --- /dev/null +++ b/docs/source/api/paddlespeech.text.exps.ernie_linear.rst @@ -0,0 +1,18 @@ +paddlespeech.text.exps.ernie\_linear package +============================================ + +.. automodule:: paddlespeech.text.exps.ernie_linear + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.text.exps.ernie_linear.avg_model + paddlespeech.text.exps.ernie_linear.punc_restore + paddlespeech.text.exps.ernie_linear.test + paddlespeech.text.exps.ernie_linear.train diff --git a/docs/source/api/paddlespeech.text.exps.ernie_linear.test.rst b/docs/source/api/paddlespeech.text.exps.ernie_linear.test.rst new file mode 100644 index 000000000..ac61440da --- /dev/null +++ b/docs/source/api/paddlespeech.text.exps.ernie_linear.test.rst @@ -0,0 +1,7 @@ +paddlespeech.text.exps.ernie\_linear.test module +================================================ + +.. automodule:: paddlespeech.text.exps.ernie_linear.test + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.text.exps.ernie_linear.train.rst b/docs/source/api/paddlespeech.text.exps.ernie_linear.train.rst new file mode 100644 index 000000000..c26425c3b --- /dev/null +++ b/docs/source/api/paddlespeech.text.exps.ernie_linear.train.rst @@ -0,0 +1,7 @@ +paddlespeech.text.exps.ernie\_linear.train module +================================================= + +.. automodule:: paddlespeech.text.exps.ernie_linear.train + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.text.exps.rst b/docs/source/api/paddlespeech.text.exps.rst new file mode 100644 index 000000000..76143476f --- /dev/null +++ b/docs/source/api/paddlespeech.text.exps.rst @@ -0,0 +1,15 @@ +paddlespeech.text.exps package +============================== + +.. automodule:: paddlespeech.text.exps + :members: + :undoc-members: + :show-inheritance: + +Subpackages +----------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.text.exps.ernie_linear diff --git a/docs/source/api/paddlespeech.text.models.ernie_crf.model.rst b/docs/source/api/paddlespeech.text.models.ernie_crf.model.rst new file mode 100644 index 000000000..fffdbe0fb --- /dev/null +++ b/docs/source/api/paddlespeech.text.models.ernie_crf.model.rst @@ -0,0 +1,7 @@ +paddlespeech.text.models.ernie\_crf.model module +================================================ + +.. automodule:: paddlespeech.text.models.ernie_crf.model + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.text.models.ernie_crf.rst b/docs/source/api/paddlespeech.text.models.ernie_crf.rst new file mode 100644 index 000000000..526b89c77 --- /dev/null +++ b/docs/source/api/paddlespeech.text.models.ernie_crf.rst @@ -0,0 +1,15 @@ +paddlespeech.text.models.ernie\_crf package +=========================================== + +.. automodule:: paddlespeech.text.models.ernie_crf + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.text.models.ernie_crf.model diff --git a/docs/source/api/paddlespeech.text.models.ernie_linear.dataset.rst b/docs/source/api/paddlespeech.text.models.ernie_linear.dataset.rst new file mode 100644 index 000000000..f291f4b54 --- /dev/null +++ b/docs/source/api/paddlespeech.text.models.ernie_linear.dataset.rst @@ -0,0 +1,7 @@ +paddlespeech.text.models.ernie\_linear.dataset module +===================================================== + +.. automodule:: paddlespeech.text.models.ernie_linear.dataset + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.text.models.ernie_linear.ernie_linear.rst b/docs/source/api/paddlespeech.text.models.ernie_linear.ernie_linear.rst new file mode 100644 index 000000000..b0a59c8ab --- /dev/null +++ b/docs/source/api/paddlespeech.text.models.ernie_linear.ernie_linear.rst @@ -0,0 +1,7 @@ +paddlespeech.text.models.ernie\_linear.ernie\_linear module +=========================================================== + +.. automodule:: paddlespeech.text.models.ernie_linear.ernie_linear + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.text.models.ernie_linear.ernie_linear_updater.rst b/docs/source/api/paddlespeech.text.models.ernie_linear.ernie_linear_updater.rst new file mode 100644 index 000000000..8d6df2601 --- /dev/null +++ b/docs/source/api/paddlespeech.text.models.ernie_linear.ernie_linear_updater.rst @@ -0,0 +1,7 @@ +paddlespeech.text.models.ernie\_linear.ernie\_linear\_updater module +==================================================================== + +.. automodule:: paddlespeech.text.models.ernie_linear.ernie_linear_updater + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.text.models.ernie_linear.rst b/docs/source/api/paddlespeech.text.models.ernie_linear.rst new file mode 100644 index 000000000..6a6f7faa4 --- /dev/null +++ b/docs/source/api/paddlespeech.text.models.ernie_linear.rst @@ -0,0 +1,17 @@ +paddlespeech.text.models.ernie\_linear package +============================================== + +.. automodule:: paddlespeech.text.models.ernie_linear + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.text.models.ernie_linear.dataset + paddlespeech.text.models.ernie_linear.ernie_linear + paddlespeech.text.models.ernie_linear.ernie_linear_updater diff --git a/docs/source/api/paddlespeech.text.models.rst b/docs/source/api/paddlespeech.text.models.rst new file mode 100644 index 000000000..cc4e5d617 --- /dev/null +++ b/docs/source/api/paddlespeech.text.models.rst @@ -0,0 +1,16 @@ +paddlespeech.text.models package +================================ + +.. automodule:: paddlespeech.text.models + :members: + :undoc-members: + :show-inheritance: + +Subpackages +----------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.text.models.ernie_crf + paddlespeech.text.models.ernie_linear diff --git a/docs/source/api/paddlespeech.text.rst b/docs/source/api/paddlespeech.text.rst new file mode 100644 index 000000000..7ffde89f5 --- /dev/null +++ b/docs/source/api/paddlespeech.text.rst @@ -0,0 +1,16 @@ +paddlespeech.text package +========================= + +.. automodule:: paddlespeech.text + :members: + :undoc-members: + :show-inheritance: + +Subpackages +----------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.text.exps + paddlespeech.text.models diff --git a/docs/source/api/paddlespeech.vector.cluster.diarization.rst b/docs/source/api/paddlespeech.vector.cluster.diarization.rst new file mode 100644 index 000000000..cd3a5a1fd --- /dev/null +++ b/docs/source/api/paddlespeech.vector.cluster.diarization.rst @@ -0,0 +1,7 @@ +paddlespeech.vector.cluster.diarization module +============================================== + +.. automodule:: paddlespeech.vector.cluster.diarization + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.vector.cluster.plda.rst b/docs/source/api/paddlespeech.vector.cluster.plda.rst new file mode 100644 index 000000000..e3e9e23d3 --- /dev/null +++ b/docs/source/api/paddlespeech.vector.cluster.plda.rst @@ -0,0 +1,7 @@ +paddlespeech.vector.cluster.plda module +======================================= + +.. automodule:: paddlespeech.vector.cluster.plda + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.vector.cluster.rst b/docs/source/api/paddlespeech.vector.cluster.rst new file mode 100644 index 000000000..288465c43 --- /dev/null +++ b/docs/source/api/paddlespeech.vector.cluster.rst @@ -0,0 +1,16 @@ +paddlespeech.vector.cluster package +=================================== + +.. automodule:: paddlespeech.vector.cluster + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.vector.cluster.diarization + paddlespeech.vector.cluster.plda diff --git a/docs/source/api/paddlespeech.vector.exps.ge2e.audio_processor.rst b/docs/source/api/paddlespeech.vector.exps.ge2e.audio_processor.rst new file mode 100644 index 000000000..654faf2ec --- /dev/null +++ b/docs/source/api/paddlespeech.vector.exps.ge2e.audio_processor.rst @@ -0,0 +1,7 @@ +paddlespeech.vector.exps.ge2e.audio\_processor module +===================================================== + +.. automodule:: paddlespeech.vector.exps.ge2e.audio_processor + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.vector.exps.ge2e.config.rst b/docs/source/api/paddlespeech.vector.exps.ge2e.config.rst new file mode 100644 index 000000000..d97e8f9f6 --- /dev/null +++ b/docs/source/api/paddlespeech.vector.exps.ge2e.config.rst @@ -0,0 +1,7 @@ +paddlespeech.vector.exps.ge2e.config module +=========================================== + +.. automodule:: paddlespeech.vector.exps.ge2e.config + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.vector.exps.ge2e.dataset_processors.rst b/docs/source/api/paddlespeech.vector.exps.ge2e.dataset_processors.rst new file mode 100644 index 000000000..9386aaf0c --- /dev/null +++ b/docs/source/api/paddlespeech.vector.exps.ge2e.dataset_processors.rst @@ -0,0 +1,7 @@ +paddlespeech.vector.exps.ge2e.dataset\_processors module +======================================================== + +.. automodule:: paddlespeech.vector.exps.ge2e.dataset_processors + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.vector.exps.ge2e.inference.rst b/docs/source/api/paddlespeech.vector.exps.ge2e.inference.rst new file mode 100644 index 000000000..4e63754fd --- /dev/null +++ b/docs/source/api/paddlespeech.vector.exps.ge2e.inference.rst @@ -0,0 +1,7 @@ +paddlespeech.vector.exps.ge2e.inference module +============================================== + +.. automodule:: paddlespeech.vector.exps.ge2e.inference + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.vector.exps.ge2e.preprocess.rst b/docs/source/api/paddlespeech.vector.exps.ge2e.preprocess.rst new file mode 100644 index 000000000..31fd8c8ba --- /dev/null +++ b/docs/source/api/paddlespeech.vector.exps.ge2e.preprocess.rst @@ -0,0 +1,7 @@ +paddlespeech.vector.exps.ge2e.preprocess module +=============================================== + +.. automodule:: paddlespeech.vector.exps.ge2e.preprocess + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.vector.exps.ge2e.random_cycle.rst b/docs/source/api/paddlespeech.vector.exps.ge2e.random_cycle.rst new file mode 100644 index 000000000..96c2c14ae --- /dev/null +++ b/docs/source/api/paddlespeech.vector.exps.ge2e.random_cycle.rst @@ -0,0 +1,7 @@ +paddlespeech.vector.exps.ge2e.random\_cycle module +================================================== + +.. automodule:: paddlespeech.vector.exps.ge2e.random_cycle + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.vector.exps.ge2e.rst b/docs/source/api/paddlespeech.vector.exps.ge2e.rst new file mode 100644 index 000000000..00cae3a71 --- /dev/null +++ b/docs/source/api/paddlespeech.vector.exps.ge2e.rst @@ -0,0 +1,22 @@ +paddlespeech.vector.exps.ge2e package +===================================== + +.. automodule:: paddlespeech.vector.exps.ge2e + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.vector.exps.ge2e.audio_processor + paddlespeech.vector.exps.ge2e.config + paddlespeech.vector.exps.ge2e.dataset_processors + paddlespeech.vector.exps.ge2e.inference + paddlespeech.vector.exps.ge2e.preprocess + paddlespeech.vector.exps.ge2e.random_cycle + paddlespeech.vector.exps.ge2e.speaker_verification_dataset + paddlespeech.vector.exps.ge2e.train diff --git a/docs/source/api/paddlespeech.vector.exps.ge2e.speaker_verification_dataset.rst b/docs/source/api/paddlespeech.vector.exps.ge2e.speaker_verification_dataset.rst new file mode 100644 index 000000000..be2d8334b --- /dev/null +++ b/docs/source/api/paddlespeech.vector.exps.ge2e.speaker_verification_dataset.rst @@ -0,0 +1,7 @@ +paddlespeech.vector.exps.ge2e.speaker\_verification\_dataset module +=================================================================== + +.. automodule:: paddlespeech.vector.exps.ge2e.speaker_verification_dataset + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.vector.exps.ge2e.train.rst b/docs/source/api/paddlespeech.vector.exps.ge2e.train.rst new file mode 100644 index 000000000..d3d4edde3 --- /dev/null +++ b/docs/source/api/paddlespeech.vector.exps.ge2e.train.rst @@ -0,0 +1,7 @@ +paddlespeech.vector.exps.ge2e.train module +========================================== + +.. automodule:: paddlespeech.vector.exps.ge2e.train + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.vector.exps.rst b/docs/source/api/paddlespeech.vector.exps.rst new file mode 100644 index 000000000..ea857a0a8 --- /dev/null +++ b/docs/source/api/paddlespeech.vector.exps.rst @@ -0,0 +1,15 @@ +paddlespeech.vector.exps package +================================ + +.. automodule:: paddlespeech.vector.exps + :members: + :undoc-members: + :show-inheritance: + +Subpackages +----------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.vector.exps.ge2e diff --git a/docs/source/api/paddlespeech.vector.io.augment.rst b/docs/source/api/paddlespeech.vector.io.augment.rst new file mode 100644 index 000000000..b76c15311 --- /dev/null +++ b/docs/source/api/paddlespeech.vector.io.augment.rst @@ -0,0 +1,7 @@ +paddlespeech.vector.io.augment module +===================================== + +.. automodule:: paddlespeech.vector.io.augment + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.vector.io.batch.rst b/docs/source/api/paddlespeech.vector.io.batch.rst new file mode 100644 index 000000000..e0bdb5f2b --- /dev/null +++ b/docs/source/api/paddlespeech.vector.io.batch.rst @@ -0,0 +1,7 @@ +paddlespeech.vector.io.batch module +=================================== + +.. automodule:: paddlespeech.vector.io.batch + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.vector.io.dataset.rst b/docs/source/api/paddlespeech.vector.io.dataset.rst new file mode 100644 index 000000000..a4618e19a --- /dev/null +++ b/docs/source/api/paddlespeech.vector.io.dataset.rst @@ -0,0 +1,7 @@ +paddlespeech.vector.io.dataset module +===================================== + +.. automodule:: paddlespeech.vector.io.dataset + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.vector.io.dataset_from_json.rst b/docs/source/api/paddlespeech.vector.io.dataset_from_json.rst new file mode 100644 index 000000000..8b0773a98 --- /dev/null +++ b/docs/source/api/paddlespeech.vector.io.dataset_from_json.rst @@ -0,0 +1,7 @@ +paddlespeech.vector.io.dataset\_from\_json module +================================================= + +.. automodule:: paddlespeech.vector.io.dataset_from_json + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.vector.io.embedding_norm.rst b/docs/source/api/paddlespeech.vector.io.embedding_norm.rst new file mode 100644 index 000000000..dc85976bd --- /dev/null +++ b/docs/source/api/paddlespeech.vector.io.embedding_norm.rst @@ -0,0 +1,7 @@ +paddlespeech.vector.io.embedding\_norm module +============================================= + +.. automodule:: paddlespeech.vector.io.embedding_norm + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.vector.io.rst b/docs/source/api/paddlespeech.vector.io.rst new file mode 100644 index 000000000..720628680 --- /dev/null +++ b/docs/source/api/paddlespeech.vector.io.rst @@ -0,0 +1,20 @@ +paddlespeech.vector.io package +============================== + +.. automodule:: paddlespeech.vector.io + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.vector.io.augment + paddlespeech.vector.io.batch + paddlespeech.vector.io.dataset + paddlespeech.vector.io.dataset_from_json + paddlespeech.vector.io.embedding_norm + paddlespeech.vector.io.signal_processing diff --git a/docs/source/api/paddlespeech.vector.io.signal_processing.rst b/docs/source/api/paddlespeech.vector.io.signal_processing.rst new file mode 100644 index 000000000..e03b8c2c8 --- /dev/null +++ b/docs/source/api/paddlespeech.vector.io.signal_processing.rst @@ -0,0 +1,7 @@ +paddlespeech.vector.io.signal\_processing module +================================================ + +.. automodule:: paddlespeech.vector.io.signal_processing + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.vector.models.ecapa_tdnn.rst b/docs/source/api/paddlespeech.vector.models.ecapa_tdnn.rst new file mode 100644 index 000000000..2a1b16878 --- /dev/null +++ b/docs/source/api/paddlespeech.vector.models.ecapa_tdnn.rst @@ -0,0 +1,7 @@ +paddlespeech.vector.models.ecapa\_tdnn module +============================================= + +.. automodule:: paddlespeech.vector.models.ecapa_tdnn + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.vector.models.lstm_speaker_encoder.rst b/docs/source/api/paddlespeech.vector.models.lstm_speaker_encoder.rst new file mode 100644 index 000000000..3a19e2b7d --- /dev/null +++ b/docs/source/api/paddlespeech.vector.models.lstm_speaker_encoder.rst @@ -0,0 +1,7 @@ +paddlespeech.vector.models.lstm\_speaker\_encoder module +======================================================== + +.. automodule:: paddlespeech.vector.models.lstm_speaker_encoder + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.vector.models.rst b/docs/source/api/paddlespeech.vector.models.rst new file mode 100644 index 000000000..aab7c0dc4 --- /dev/null +++ b/docs/source/api/paddlespeech.vector.models.rst @@ -0,0 +1,16 @@ +paddlespeech.vector.models package +================================== + +.. automodule:: paddlespeech.vector.models + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.vector.models.ecapa_tdnn + paddlespeech.vector.models.lstm_speaker_encoder diff --git a/docs/source/api/paddlespeech.vector.modules.layer.rst b/docs/source/api/paddlespeech.vector.modules.layer.rst new file mode 100644 index 000000000..bb7e8bb1f --- /dev/null +++ b/docs/source/api/paddlespeech.vector.modules.layer.rst @@ -0,0 +1,7 @@ +paddlespeech.vector.modules.layer module +======================================== + +.. automodule:: paddlespeech.vector.modules.layer + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.vector.modules.loss.rst b/docs/source/api/paddlespeech.vector.modules.loss.rst new file mode 100644 index 000000000..744ffd236 --- /dev/null +++ b/docs/source/api/paddlespeech.vector.modules.loss.rst @@ -0,0 +1,7 @@ +paddlespeech.vector.modules.loss module +======================================= + +.. automodule:: paddlespeech.vector.modules.loss + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.vector.modules.rst b/docs/source/api/paddlespeech.vector.modules.rst new file mode 100644 index 000000000..7895e9bbf --- /dev/null +++ b/docs/source/api/paddlespeech.vector.modules.rst @@ -0,0 +1,17 @@ +paddlespeech.vector.modules package +=================================== + +.. automodule:: paddlespeech.vector.modules + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.vector.modules.layer + paddlespeech.vector.modules.loss + paddlespeech.vector.modules.sid_model diff --git a/docs/source/api/paddlespeech.vector.modules.sid_model.rst b/docs/source/api/paddlespeech.vector.modules.sid_model.rst new file mode 100644 index 000000000..03b65247d --- /dev/null +++ b/docs/source/api/paddlespeech.vector.modules.sid_model.rst @@ -0,0 +1,7 @@ +paddlespeech.vector.modules.sid\_model module +============================================= + +.. automodule:: paddlespeech.vector.modules.sid_model + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.vector.rst b/docs/source/api/paddlespeech.vector.rst new file mode 100644 index 000000000..eeeea10a2 --- /dev/null +++ b/docs/source/api/paddlespeech.vector.rst @@ -0,0 +1,21 @@ +paddlespeech.vector package +=========================== + +.. automodule:: paddlespeech.vector + :members: + :undoc-members: + :show-inheritance: + +Subpackages +----------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.vector.cluster + paddlespeech.vector.exps + paddlespeech.vector.io + paddlespeech.vector.models + paddlespeech.vector.modules + paddlespeech.vector.training + paddlespeech.vector.utils diff --git a/docs/source/api/paddlespeech.vector.training.rst b/docs/source/api/paddlespeech.vector.training.rst new file mode 100644 index 000000000..eed51903f --- /dev/null +++ b/docs/source/api/paddlespeech.vector.training.rst @@ -0,0 +1,16 @@ +paddlespeech.vector.training package +==================================== + +.. automodule:: paddlespeech.vector.training + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.vector.training.scheduler + paddlespeech.vector.training.seeding diff --git a/docs/source/api/paddlespeech.vector.training.scheduler.rst b/docs/source/api/paddlespeech.vector.training.scheduler.rst new file mode 100644 index 000000000..1811a3489 --- /dev/null +++ b/docs/source/api/paddlespeech.vector.training.scheduler.rst @@ -0,0 +1,7 @@ +paddlespeech.vector.training.scheduler module +============================================= + +.. automodule:: paddlespeech.vector.training.scheduler + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.vector.training.seeding.rst b/docs/source/api/paddlespeech.vector.training.seeding.rst new file mode 100644 index 000000000..ea1324a68 --- /dev/null +++ b/docs/source/api/paddlespeech.vector.training.seeding.rst @@ -0,0 +1,7 @@ +paddlespeech.vector.training.seeding module +=========================================== + +.. automodule:: paddlespeech.vector.training.seeding + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.vector.utils.rst b/docs/source/api/paddlespeech.vector.utils.rst new file mode 100644 index 000000000..fcab72544 --- /dev/null +++ b/docs/source/api/paddlespeech.vector.utils.rst @@ -0,0 +1,16 @@ +paddlespeech.vector.utils package +================================= + +.. automodule:: paddlespeech.vector.utils + :members: + :undoc-members: + :show-inheritance: + +Submodules +---------- + +.. toctree:: + :maxdepth: 4 + + paddlespeech.vector.utils.time + paddlespeech.vector.utils.vector_utils diff --git a/docs/source/api/paddlespeech.vector.utils.time.rst b/docs/source/api/paddlespeech.vector.utils.time.rst new file mode 100644 index 000000000..19ce5fd31 --- /dev/null +++ b/docs/source/api/paddlespeech.vector.utils.time.rst @@ -0,0 +1,7 @@ +paddlespeech.vector.utils.time module +===================================== + +.. automodule:: paddlespeech.vector.utils.time + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/api/paddlespeech.vector.utils.vector_utils.rst b/docs/source/api/paddlespeech.vector.utils.vector_utils.rst new file mode 100644 index 000000000..c740700f2 --- /dev/null +++ b/docs/source/api/paddlespeech.vector.utils.vector_utils.rst @@ -0,0 +1,7 @@ +paddlespeech.vector.utils.vector\_utils module +============================================== + +.. automodule:: paddlespeech.vector.utils.vector_utils + :members: + :undoc-members: + :show-inheritance: diff --git a/docs/source/conf.py b/docs/source/conf.py index e6431c7c4..c94cf0b86 100644 --- a/docs/source/conf.py +++ b/docs/source/conf.py @@ -22,6 +22,9 @@ # documentation root, use os.path.abspath to make it absolute, like shown here. import recommonmark.parser import sphinx_rtd_theme +import sys +import os +sys.path.insert(0, os.path.abspath('../..')) autodoc_mock_imports = ["soundfile", "librosa"] diff --git a/docs/source/index.rst b/docs/source/index.rst index fc1649eb3..83474c528 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -64,3 +64,18 @@ Contents :caption: Acknowledgement asr/reference + + +.. toctree:: + :maxdepth: 2 + :caption: API Reference + + paddlespeech.audio + paddlespeech.cli + paddlespeech.cls + paddlespeech.kws + paddlespeech.s2t + paddlespeech.server + paddlespeech.t2s + paddlespeech.text + paddlespeech.vector diff --git a/docs/source/install.md b/docs/source/install.md index ac48d88ba..6a9ff3bc8 100644 --- a/docs/source/install.md +++ b/docs/source/install.md @@ -5,12 +5,12 @@ There are 3 ways to use `PaddleSpeech`. According to the degree of difficulty, t | Way | Function | Support| |:---- |:----------------------------------------------------------- |:----| | Easy | (1) Use command-line functions of PaddleSpeech.
(2) Experience PaddleSpeech on Ai Studio. | Linux, Mac(not support M1 chip),Windows ( For more information about installation, see [#1195](https://github.com/PaddlePaddle/PaddleSpeech/discussions/1195)) | -| Medium | Support major functions ,such as using the` ready-made `examples and using PaddleSpeech to train your model. | Linux | -| Hard | Support full function of Paddlespeech, including using join ctc decoder with kaldi, training n-gram language model, Montreal-Forced-Aligner, and so on. And you are more able to be a developer! | Ubuntu | +| Medium | Support major functions ,such as using the` ready-made `examples and using PaddleSpeech to train your model. | Linux, Mac(not support M1 chip, not support training models),Windows (not support training models) | +| Hard | Support full function of Paddlespeech, including using join ctc decoder with kaldi([asr2](../../examples/librispeech/asr2 )), training n-gram language model, Montreal-Forced-Aligner, and so on. And you are more able to be a developer! | Ubuntu | ## Prerequisites - Python >= 3.7 -- PaddlePaddle latest version (please refer to the [Installation Guide] (https://www.paddlepaddle.org.cn/documentation/docs/en/beginners_guide/index_en.html)) +- PaddlePaddle latest version (please refer to the [Installation Guide](https://www.paddlepaddle.org.cn/documentation/docs/en/beginners_guide/index_en.html)) - C++ compilation environment - Hip: For Linux and Mac, do not use command `sh` instead of command `bash` in installation document. - Hip: We recommand you to install `paddlepaddle` from https://mirror.baidu.com/pypi/simple and install `paddlespeech` from https://pypi.tuna.tsinghua.edu.cn/simple. @@ -63,9 +63,9 @@ pip install paddlespeech -i https://pypi.tuna.tsinghua.edu.cn/simple ``` > If you encounter problem with downloading **nltk_data** while using paddlespeech, it maybe due to your poor network, we suggest you download the [nltk_data](https://paddlespeech.bj.bcebos.com/Parakeet/tools/nltk_data.tar.gz) provided by us, and extract it to your `${HOME}`. -> If you fail to install paddlespeech-ctcdecoders, it doesn't matter. +> If you fail to install paddlespeech-ctcdecoders, you only can not use deepspeech2 model inference. For other models, it doesn't matter. -## Medium: Get the Major Functions (Support Linux) +## Medium: Get the Major Functions (Support Linux, mac and windows not support training) If you want to get the major function of `paddlespeech`, you need to do following steps: ### Git clone PaddleSpeech You need to `git clone` this repository at first. @@ -75,7 +75,7 @@ cd PaddleSpeech ``` ### Install Conda -Conda is a management system of the environment. You can go to [minicoda](https://docs.conda.io/en/latest/miniconda.html) to select a version (py>=3.7) and install it by yourself or you can use the following command: +Conda is a management system of the environment. You can go to [minicoda](https://docs.conda.io/en/latest/miniconda.html) to select a version (py>=3.7). For windows, you can follow the installing guide step by step and for linux and mac, you can use the following commands: ```bash # download the miniconda wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -P tools/ @@ -117,9 +117,9 @@ conda install -y -c gcc_linux-64=8.4.0 gxx_linux-64=8.4.0 ``` (Hip: Do not use the last script if you want to install by **Hard** way): ### Install PaddlePaddle -You can choose the `PaddlePaddle` version based on your system. For example, for CUDA 10.2, CuDNN7.5 install paddlepaddle-gpu 2.2.0: +You can choose the `PaddlePaddle` version based on your system. For example, for CUDA 10.2, CuDNN7.5 install paddlepaddle-gpu 2.3.1: ```bash -python3 -m pip install paddlepaddle-gpu==2.2.0 -i https://mirror.baidu.com/pypi/simple +python3 -m pip install paddlepaddle-gpu==2.3.1 -i https://mirror.baidu.com/pypi/simple ``` ### Install PaddleSpeech You can install `paddlespeech` by the following command,then you can use the `ready-made` examples in `paddlespeech` : @@ -180,9 +180,9 @@ Some users may fail to install `kaldiio` due to the default download source, you ```bash pip install pytest-runner -i https://pypi.tuna.tsinghua.edu.cn/simple ``` -Make sure you have GPU and the paddlepaddle version is right. For example, for CUDA 10.2, CuDNN7.5 install paddle 2.2.0: +Make sure you have GPU and the paddlepaddle version is right. For example, for CUDA 10.2, CuDNN7.5 install paddle 2.3.1: ```bash -python3 -m pip install paddlepaddle-gpu==2.2.0 -i https://mirror.baidu.com/pypi/simple +python3 -m pip install paddlepaddle-gpu==2.3.1 -i https://mirror.baidu.com/pypi/simple ``` ### Install PaddleSpeech in Developing Mode ```bash diff --git a/docs/source/install_cn.md b/docs/source/install_cn.md index 345e79bb5..9f49ebad6 100644 --- a/docs/source/install_cn.md +++ b/docs/source/install_cn.md @@ -4,8 +4,8 @@ | 方式 | 功能 | 支持系统 | | :--- | :----------------------------------------------------------- | :------------------ | | 简单 | (1) 使用 PaddleSpeech 的命令行功能.
(2) 在 Aistudio上体验 PaddleSpeech. | Linux, Mac(不支持M1芯片),Windows (安装详情查看[#1195](https://github.com/PaddlePaddle/PaddleSpeech/discussions/1195)) | -| 中等 | 支持 PaddleSpeech 主要功能,比如使用已有 examples 中的模型和使用 PaddleSpeech 来训练自己的模型. | Linux | -| 困难 | 支持 PaddleSpeech 的各项功能,包含结合kaldi使用 join ctc decoder 方式解码,训练语言模型,使用强制对齐等。并且你更能成为一名开发者! | Ubuntu | +| 中等 | 支持 PaddleSpeech 主要功能,比如使用已有 examples 中的模型和使用 PaddleSpeech 来训练自己的模型. | Linux, Mac(不支持M1芯片,不支持训练), Windows(不支持训练) | +| 困难 | 支持 PaddleSpeech 的各项功能,包含结合 kaldi 使用 join ctc decoder 方式解码 ([asr2](../../examples/librispeech/asr2 )),训练语言模型,使用强制对齐等。并且你更能成为一名开发者! | Ubuntu | ## 先决条件 - Python >= 3.7 - 最新版本的 PaddlePaddle (请看 [安装向导](https://www.paddlepaddle.org.cn/documentation/docs/en/beginners_guide/index_en.html)) @@ -60,9 +60,9 @@ pip install paddlespeech -i https://pypi.tuna.tsinghua.edu.cn/simple ``` > 如果您在使用 paddlespeech 的过程中遇到关于下载 **nltk_data** 的问题,可能是您的网络不佳,我们建议您下载我们提供的 [nltk_data](https://paddlespeech.bj.bcebos.com/Parakeet/tools/nltk_data.tar.gz) 并解压缩到您的 `${HOME}` 目录下。 -> 如果出现 paddlespeech-ctcdecoders 无法安装的问题,无须担心,这不影响使用。 +> 如果出现 paddlespeech-ctcdecoders 无法安装的问题,无须担心,这个只影响 deepspeech2 模型的推理,不影响其他模型的使用。 -## 中等: 获取主要功能(支持 Linux) +## 中等: 获取主要功能(支持 Linux, Mac 和 Windows 不支持训练) 如果你想要使用 `paddlespeech` 的主要功能。你需要完成以下几个步骤 ### Git clone PaddleSpeech 你需要先 git clone 本仓库 @@ -71,7 +71,7 @@ git clone https://github.com/PaddlePaddle/PaddleSpeech.git cd PaddleSpeech ``` ### 安装 Conda -Conda 是一个包管理的环境。你可以前往 [minicoda](https://docs.conda.io/en/latest/miniconda.html) 去下载并安装 conda(请下载 py>=3.7 的版本)。你可以尝试自己安装,或者使用以下的命令: +Conda 是一个包管理的环境。你可以前往 [minicoda](https://docs.conda.io/en/latest/miniconda.html) 去下载并安装 conda(请下载 py>=3.7 的版本)。windows 系统可以使用 conda 的向导安装,linux 和 mac 可以使用以下的命令: ```bash # 下载 miniconda wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -P tools/ @@ -111,14 +111,14 @@ conda install -y -c gcc_linux-64=8.4.0 gxx_linux-64=8.4.0 ``` (提示: 如果你想使用**困难**方式完成安装,请不要使用最后一条命令) ### 安装 PaddlePaddle -你可以根据系统配置选择 PaddlePaddle 版本,例如系统使用 CUDA 10.2, CuDNN7.5 ,你可以安装 paddlepaddle-gpu 2.2.0: +你可以根据系统配置选择 PaddlePaddle 版本,例如系统使用 CUDA 10.2, CuDNN7.5 ,你可以安装 paddlepaddle-gpu 2.3.1: ```bash -python3 -m pip install paddlepaddle-gpu==2.2.0 -i https://mirror.baidu.com/pypi/simple +python3 -m pip install paddlepaddle-gpu==2.3.1 -i https://mirror.baidu.com/pypi/simple ``` ### 安装 PaddleSpeech -最后安装 `paddlespeech`,这样你就可以使用 `paddlespeech`中已有的 examples: +最后安装 `paddlespeech`,这样你就可以使用 `paddlespeech` 中已有的 examples: ```bash -# 部分用户系统由于默认源的问题,安装中会出现kaldiio安转出错的问题,建议首先安装pytest-runner: +# 部分用户系统由于默认源的问题,安装中会出现 kaldiio 安转出错的问题,建议首先安装pytest-runner: pip install pytest-runner -i https://pypi.tuna.tsinghua.edu.cn/simple # 请确保目前处于PaddleSpeech项目的根目录 pip install . -i https://pypi.tuna.tsinghua.edu.cn/simple @@ -137,7 +137,7 @@ Docker 是一种开源工具,用于在和系统本身环境相隔离的环境 在 [Docker Hub](https://hub.docker.com/repository/docker/paddlecloud/paddlespeech) 中获取这些镜像及相应的使用指南,包括 CPU、GPU、ROCm 版本。 -如果您对自动化制作docker镜像感兴趣,或有自定义需求,请访问 [PaddlePaddle/PaddleCloud](https://github.com/PaddlePaddle/PaddleCloud/tree/main/tekton) 做进一步了解。 +如果您对自动化制作 docker 镜像感兴趣,或有自定义需求,请访问 [PaddlePaddle/PaddleCloud](https://github.com/PaddlePaddle/PaddleCloud/tree/main/tekton) 做进一步了解。 完成这些以后,你就可以在 docker 容器中执行训练、推理和超参 fine-tune。 ### 选择2: 使用有 root 权限的 Ubuntu - 使用apt安装 `build-essential` @@ -168,12 +168,12 @@ conda activate tools/venv conda install -y -c conda-forge sox libsndfile swig bzip2 libflac bc ``` ### 安装 PaddlePaddle -请确认你系统是否有 GPU,并且使用了正确版本的 paddlepaddle。例如系统使用 CUDA 10.2, CuDNN7.5 ,你可以安装 paddlepaddle-gpu 2.2.0: +请确认你系统是否有 GPU,并且使用了正确版本的 paddlepaddle。例如系统使用 CUDA 10.2, CuDNN7.5 ,你可以安装 paddlepaddle-gpu 2.3.1: ```bash -python3 -m pip install paddlepaddle-gpu==2.2.0 -i https://mirror.baidu.com/pypi/simple +python3 -m pip install paddlepaddle-gpu==2.3.1 -i https://mirror.baidu.com/pypi/simple ``` ### 用开发者模式安装 PaddleSpeech -部分用户系统由于默认源的问题,安装中会出现kaldiio安转出错的问题,建议首先安装pytest-runner: +部分用户系统由于默认源的问题,安装中会出现 kaldiio 安转出错的问题,建议首先安装 pytest-runner: ```bash pip install pytest-runner -i https://pypi.tuna.tsinghua.edu.cn/simple ``` diff --git a/docs/source/reference.md b/docs/source/reference.md index ed91c2066..0d36d96f7 100644 --- a/docs/source/reference.md +++ b/docs/source/reference.md @@ -40,3 +40,6 @@ We borrowed a lot of code from these repos to build `model` and `engine`, thanks * [ThreadPool](https://github.com/progschj/ThreadPool/blob/master/COPYING) - zlib License - ThreadPool + +* [g2pW](https://github.com/GitYCC/g2pW/blob/master/LICENCE) +- Apache-2.0 license diff --git a/docs/source/released_model.md b/docs/source/released_model.md index 5afd3c478..a1e3eb879 100644 --- a/docs/source/released_model.md +++ b/docs/source/released_model.md @@ -1,22 +1,21 @@ - # Released Models ## Speech-to-Text Models ### Speech Recognition Model -Acoustic Model | Training Data | Token-based | Size | Descriptions | CER | WER | Hours of speech | Example Link -:-------------:| :------------:| :-----: | -----: | :-----: |:-----:| :-----: | :-----: | :-----: -[Ds2 Online Wenetspeech ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/wenetspeech/asr0/asr0_deepspeech2_online_wenetspeech_ckpt_1.0.2.model.tar.gz) | Wenetspeech Dataset | Char-based | 1.2 GB | 2 Conv + 5 LSTM layers | 0.152 (test\_net, w/o LM)
0.2417 (test\_meeting, w/o LM)
0.053 (aishell, w/ LM) |-| 10000 h |- -[Ds2 Online Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_fbank161_ckpt_0.2.1.model.tar.gz) | Aishell Dataset | Char-based | 491 MB | 2 Conv + 5 LSTM layers | 0.0666 |-| 151 h | [D2 Online Aishell ASR0](../../examples/aishell/asr0) -[Ds2 Offline Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_offline_aishell_ckpt_1.0.1.model.tar.gz)| Aishell Dataset | Char-based | 1.4 GB | 2 Conv + 5 bidirectional LSTM layers| 0.0554 |-| 151 h | [Ds2 Offline Aishell ASR0](../../examples/aishell/asr0) -[Conformer Online Wenetspeech ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/wenetspeech/asr1/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar.gz) | WenetSpeech Dataset | Char-based | 457 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring| 0.11 (test\_net) 0.1879 (test\_meeting) |-| 10000 h |- -[Conformer Online Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_chunk_conformer_aishell_ckpt_0.2.0.model.tar.gz) | Aishell Dataset | Char-based | 189 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring| 0.0544 |-| 151 h | [Conformer Online Aishell ASR1](../../examples/aishell/asr1) -[Conformer Offline Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_conformer_aishell_ckpt_0.1.2.model.tar.gz) | Aishell Dataset | Char-based | 189 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0464 |-| 151 h | [Conformer Offline Aishell ASR1](../../examples/aishell/asr1) -[Transformer Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_transformer_aishell_ckpt_0.1.1.model.tar.gz) | Aishell Dataset | Char-based | 128 MB | Encoder:Transformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0523 || 151 h | [Transformer Aishell ASR1](../../examples/aishell/asr1) -[Ds2 Offline Librispeech ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr0/asr0_deepspeech2_offline_librispeech_ckpt_1.0.1.model.tar.gz)| Librispeech Dataset | Char-based | 1.3 GB | 2 Conv + 5 bidirectional LSTM layers| - |0.0467| 960 h | [Ds2 Offline Librispeech ASR0](../../examples/librispeech/asr0) -[Conformer Librispeech ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr1/asr1_conformer_librispeech_ckpt_0.1.1.model.tar.gz) | Librispeech Dataset | subword-based | 191 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring |-| 0.0338 | 960 h | [Conformer Librispeech ASR1](../../examples/librispeech/asr1) -[Transformer Librispeech ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr1/asr1_transformer_librispeech_ckpt_0.1.1.model.tar.gz) | Librispeech Dataset | subword-based | 131 MB | Encoder:Transformer, Decoder:Transformer, Decoding method: Attention rescoring |-| 0.0381 | 960 h | [Transformer Librispeech ASR1](../../examples/librispeech/asr1) -[Transformer Librispeech ASR2 Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr2/asr2_transformer_librispeech_ckpt_0.1.1.model.tar.gz) | Librispeech Dataset | subword-based | 131 MB | Encoder:Transformer, Decoder:Transformer, Decoding method: JoinCTC w/ LM |-| 0.0240 | 960 h | [Transformer Librispeech ASR2](../../examples/librispeech/asr2) +Acoustic Model | Training Data | Token-based | Size | Descriptions | CER | WER | Hours of speech | Example Link | Inference Type | +:-------------:| :------------:| :-----: | -----: | :-----: |:-----:| :-----: | :-----: | :-----: | :-----: | +[Ds2 Online Wenetspeech ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/wenetspeech/asr0/asr0_deepspeech2_online_wenetspeech_ckpt_1.0.4.model.tar.gz) | Wenetspeech Dataset | Char-based | 1.2 GB | 2 Conv + 5 LSTM layers | 0.152 (test\_net, w/o LM)
0.2417 (test\_meeting, w/o LM)
0.053 (aishell, w/ LM) |-| 10000 h | - | onnx/inference/python | +[Ds2 Online Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_fbank161_ckpt_0.2.1.model.tar.gz) | Aishell Dataset | Char-based | 491 MB | 2 Conv + 5 LSTM layers | 0.0666 |-| 151 h | [D2 Online Aishell ASR0](../../examples/aishell/asr0) | onnx/inference/python | +[Ds2 Offline Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_offline_aishell_ckpt_1.0.1.model.tar.gz)| Aishell Dataset | Char-based | 1.4 GB | 2 Conv + 5 bidirectional LSTM layers| 0.0554 |-| 151 h | [Ds2 Offline Aishell ASR0](../../examples/aishell/asr0) | inference/python | +[Conformer Online Wenetspeech ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/wenetspeech/asr1/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar.gz) | WenetSpeech Dataset | Char-based | 457 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring| 0.11 (test\_net) 0.1879 (test\_meeting) |-| 10000 h |- | python | +[Conformer Online Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_chunk_conformer_aishell_ckpt_0.2.0.model.tar.gz) | Aishell Dataset | Char-based | 189 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring| 0.0544 |-| 151 h | [Conformer Online Aishell ASR1](../../examples/aishell/asr1) | python | +[Conformer Offline Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_conformer_aishell_ckpt_1.0.1.model.tar.gz) | Aishell Dataset | Char-based | 189 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0460 |-| 151 h | [Conformer Offline Aishell ASR1](../../examples/aishell/asr1) | python | +[Transformer Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_transformer_aishell_ckpt_0.1.1.model.tar.gz) | Aishell Dataset | Char-based | 128 MB | Encoder:Transformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0523 || 151 h | [Transformer Aishell ASR1](../../examples/aishell/asr1) | python | +[Ds2 Offline Librispeech ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr0/asr0_deepspeech2_offline_librispeech_ckpt_1.0.1.model.tar.gz)| Librispeech Dataset | Char-based | 1.3 GB | 2 Conv + 5 bidirectional LSTM layers| - |0.0467| 960 h | [Ds2 Offline Librispeech ASR0](../../examples/librispeech/asr0) | inference/python | +[Conformer Librispeech ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr1/asr1_conformer_librispeech_ckpt_0.1.1.model.tar.gz) | Librispeech Dataset | subword-based | 191 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring |-| 0.0338 | 960 h | [Conformer Librispeech ASR1](../../examples/librispeech/asr1) | python | +[Transformer Librispeech ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr1/asr1_transformer_librispeech_ckpt_0.1.1.model.tar.gz) | Librispeech Dataset | subword-based | 131 MB | Encoder:Transformer, Decoder:Transformer, Decoding method: Attention rescoring |-| 0.0381 | 960 h | [Transformer Librispeech ASR1](../../examples/librispeech/asr1) | python | +[Transformer Librispeech ASR2 Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr2/asr2_transformer_librispeech_ckpt_0.1.1.model.tar.gz) | Librispeech Dataset | subword-based | 131 MB | Encoder:Transformer, Decoder:Transformer, Decoding method: JoinCTC w/ LM |-| 0.0240 | 960 h | [Transformer Librispeech ASR2](../../examples/librispeech/asr2) | python | ### Language Model based on NGram Language Model | Training Data | Token-based | Size | Descriptions @@ -34,32 +33,33 @@ Language Model | Training Data | Token-based | Size | Descriptions ## Text-to-Speech Models ### Acoustic Models -Model Type | Dataset| Example Link | Pretrained Models|Static Models|Size (static) +Model Type | Dataset| Example Link | Pretrained Models|Static/ONNX Models|Size (static) :-------------:| :------------:| :-----: | :-----:| :-----:| :-----: Tacotron2|LJSpeech|[tacotron2-ljspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/tts0)|[tacotron2_ljspeech_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_ljspeech_ckpt_0.2.0.zip)||| Tacotron2|CSMSC|[tacotron2-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts0)|[tacotron2_csmsc_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_csmsc_ckpt_0.2.0.zip)|[tacotron2_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_csmsc_static_0.2.0.zip)|103MB| TransformerTTS| LJSpeech| [transformer-ljspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/tts1)|[transformer_tts_ljspeech_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/transformer_tts/transformer_tts_ljspeech_ckpt_0.4.zip)||| -SpeedySpeech| CSMSC | [speedyspeech-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts2)|[speedyspeech_csmsc_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_csmsc_ckpt_0.2.0.zip)|[speedyspeech_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_csmsc_static_0.2.0.zip)|12MB| -FastSpeech2| CSMSC |[fastspeech2-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts3)|[fastspeech2_nosil_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_ckpt_0.4.zip)|[fastspeech2_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_csmsc_static_0.2.0.zip)|157MB| +SpeedySpeech| CSMSC | [speedyspeech-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts2)|[speedyspeech_csmsc_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_csmsc_ckpt_0.2.0.zip)|[speedyspeech_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_csmsc_static_0.2.0.zip)
[speedyspeech_csmsc_onnx_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_csmsc_onnx_0.2.0.zip)|13MB| +FastSpeech2| CSMSC |[fastspeech2-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts3)|[fastspeech2_nosil_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_ckpt_0.4.zip)|[fastspeech2_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_csmsc_static_0.2.0.zip)
[fastspeech2_csmsc_onnx_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_csmsc_onnx_0.2.0.zip)|157MB| FastSpeech2-Conformer| CSMSC |[fastspeech2-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts3)|[fastspeech2_conformer_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_conformer_baker_ckpt_0.5.zip)||| -FastSpeech2| AISHELL-3 |[fastspeech2-aishell3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/tts3)|[fastspeech2_nosil_aishell3_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_aishell3_ckpt_0.4.zip)||| -FastSpeech2| LJSpeech |[fastspeech2-ljspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/tts3)|[fastspeech2_nosil_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_ljspeech_ckpt_0.5.zip)||| -FastSpeech2| VCTK |[fastspeech2-vctk](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/vctk/tts3)|[fastspeech2_nosil_vctk_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_vctk_ckpt_0.5.zip)||| +FastSpeech2-CNNDecoder| CSMSC| [fastspeech2-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts3)| [fastspeech2_cnndecoder_csmsc_ckpt_1.0.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_cnndecoder_csmsc_ckpt_1.0.0.zip) | [fastspeech2_cnndecoder_csmsc_static_1.0.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_cnndecoder_csmsc_static_1.0.0.zip)
[fastspeech2_cnndecoder_csmsc_streaming_static_1.0.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_cnndecoder_csmsc_streaming_static_1.0.0.zip)
[fastspeech2_cnndecoder_csmsc_onnx_1.0.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_cnndecoder_csmsc_onnx_1.0.0.zip)
[fastspeech2_cnndecoder_csmsc_streaming_onnx_1.0.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_cnndecoder_csmsc_streaming_onnx_1.0.0.zip) | 84MB| +FastSpeech2| AISHELL-3 |[fastspeech2-aishell3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/tts3)|[fastspeech2_nosil_aishell3_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_aishell3_ckpt_0.4.zip)|[fastspeech2_aishell3_static_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_aishell3_static_1.1.0.zip)
[fastspeech2_aishell3_onnx_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_aishell3_onnx_1.1.0.zip)|147MB| +FastSpeech2| LJSpeech |[fastspeech2-ljspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/tts3)|[fastspeech2_nosil_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_ljspeech_ckpt_0.5.zip)|[fastspeech2_ljspeech_static_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_ljspeech_static_1.1.0.zip)
[fastspeech2_ljspeech_onnx_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_ljspeech_onnx_1.1.0.zip)|145MB| +FastSpeech2| VCTK |[fastspeech2-vctk](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/vctk/tts3)|[fastspeech2_nosil_vctk_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_vctk_ckpt_0.5.zip)|[fastspeech2_vctk_static_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_vctk_static_1.1.0.zip)
[fastspeech2_vctk_onnx_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_vctk_onnx_1.1.0.zip) | 145MB| ### Vocoders -Model Type | Dataset| Example Link | Pretrained Models| Static Models|Size (static) +Model Type | Dataset| Example Link | Pretrained Models| Static/ONNX Models|Size (static) :-----:| :-----:| :-----: | :-----:| :-----:| :-----: WaveFlow| LJSpeech |[waveflow-ljspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/voc0)|[waveflow_ljspeech_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/waveflow/waveflow_ljspeech_ckpt_0.3.zip)||| -Parallel WaveGAN| CSMSC |[PWGAN-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/voc1)|[pwg_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_ckpt_0.4.zip)|[pwg_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_static_0.4.zip)|5.1MB| -Parallel WaveGAN| LJSpeech |[PWGAN-ljspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/voc1)|[pwg_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_ljspeech_ckpt_0.5.zip)||| -Parallel WaveGAN| AISHELL-3 |[PWGAN-aishell3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/voc1)|[pwg_aishell3_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_aishell3_ckpt_0.5.zip)||| -Parallel WaveGAN| VCTK |[PWGAN-vctk](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/vctk/voc1)|[pwg_vctk_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_vctk_ckpt_0.5.zip)||| -|Multi Band MelGAN | CSMSC |[MB MelGAN-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/voc3) | [mb_melgan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_ckpt_0.1.1.zip)
[mb_melgan_baker_finetune_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_baker_finetune_ckpt_0.5.zip)|[mb_melgan_csmsc_static_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_static_0.1.1.zip) |8.2MB| +Parallel WaveGAN| CSMSC |[PWGAN-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/voc1)|[pwg_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_ckpt_0.4.zip)|[pwg_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_static_0.4.zip)
[pwgan_csmsc_onnx_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwgan_csmsc_onnx_0.2.0.zip)|4.8MB| +Parallel WaveGAN| LJSpeech |[PWGAN-ljspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/voc1)|[pwg_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_ljspeech_ckpt_0.5.zip)|[pwgan_ljspeech_static_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwgan_ljspeech_static_1.1.0.zip)
[pwgan_ljspeech_onnx_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwgan_ljspeech_onnx_1.1.0.zip)|4.8MB| +Parallel WaveGAN| AISHELL-3 |[PWGAN-aishell3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/voc1)|[pwg_aishell3_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_aishell3_ckpt_0.5.zip)| [pwgan_aishell3_static_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwgan_aishell3_static_1.1.0.zip)
[pwgan_aishell3_onnx_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwgan_aishell3_onnx_1.1.0.zip)|4.8MB| +Parallel WaveGAN| VCTK |[PWGAN-vctk](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/vctk/voc1)|[pwg_vctk_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_vctk_ckpt_0.5.zip)|[pwgan_vctk_static_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwgan_vctk_static_1.1.0.zip)
[pwgan_vctk_onnx_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwgan_vctk_onnx_1.1.0.zip)|4.8MB| +|Multi Band MelGAN | CSMSC |[MB MelGAN-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/voc3) | [mb_melgan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_ckpt_0.1.1.zip)
[mb_melgan_baker_finetune_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_baker_finetune_ckpt_0.5.zip)|[mb_melgan_csmsc_static_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_static_0.1.1.zip)
[mb_melgan_csmsc_onnx_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_onnx_0.2.0.zip)|7.6MB| Style MelGAN | CSMSC |[Style MelGAN-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/voc4)|[style_melgan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/style_melgan/style_melgan_csmsc_ckpt_0.1.1.zip)| | | -HiFiGAN | CSMSC |[HiFiGAN-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/voc5)|[hifigan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_ckpt_0.1.1.zip)|[hifigan_csmsc_static_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_static_0.1.1.zip)|50MB| -HiFiGAN | LJSpeech |[HiFiGAN-ljspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/voc5)|[hifigan_ljspeech_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_ljspeech_ckpt_0.2.0.zip)||| -HiFiGAN | AISHELL-3 |[HiFiGAN-aishell3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/voc5)|[hifigan_aishell3_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_ckpt_0.2.0.zip)||| -HiFiGAN | VCTK |[HiFiGAN-vctk](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/vctk/voc5)|[hifigan_vctk_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_vctk_ckpt_0.2.0.zip)||| +HiFiGAN | CSMSC |[HiFiGAN-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/voc5)|[hifigan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_ckpt_0.1.1.zip)|[hifigan_csmsc_static_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_static_0.1.1.zip)
[hifigan_csmsc_onnx_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_onnx_0.2.0.zip)|46MB| +HiFiGAN | LJSpeech |[HiFiGAN-ljspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/voc5)|[hifigan_ljspeech_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_ljspeech_ckpt_0.2.0.zip)|[hifigan_ljspeech_static_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_ljspeech_static_1.1.0.zip)
[hifigan_ljspeech_onnx_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_ljspeech_onnx_1.1.0.zip) |49MB| +HiFiGAN | AISHELL-3 |[HiFiGAN-aishell3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/voc5)|[hifigan_aishell3_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_ckpt_0.2.0.zip)|[hifigan_aishell3_static_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_static_1.1.0.zip)
[hifigan_aishell3_onnx_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_onnx_1.1.0.zip)|46MB| +HiFiGAN | VCTK |[HiFiGAN-vctk](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/vctk/voc5)|[hifigan_vctk_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_vctk_ckpt_0.2.0.zip)|[hifigan_vctk_static_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_vctk_static_1.1.0.zip)
[hifigan_vctk_onnx_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_vctk_onnx_1.1.0.zip)|46MB| WaveRNN | CSMSC |[WaveRNN-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/voc6)|[wavernn_csmsc_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/wavernn/wavernn_csmsc_ckpt_0.2.0.zip)|[wavernn_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/wavernn/wavernn_csmsc_static_0.2.0.zip)|18MB| diff --git a/docs/topic/package_release/python_package_release.md b/docs/topic/package_release/python_package_release.md index 3e3f9dbf6..cb1029e7b 100644 --- a/docs/topic/package_release/python_package_release.md +++ b/docs/topic/package_release/python_package_release.md @@ -150,7 +150,7 @@ manylinux1 支持 Centos5以上, manylinux2010 支持 Centos 6 以上,manyli ### 拉取 manylinux2010 ```bash -docker pull quay.io/pypa/manylinux1_x86_64 +docker pull quay.io/pypa/manylinux2010_x86_64 ``` ### 使用 manylinux2010 diff --git a/examples/aishell/asr1/README.md b/examples/aishell/asr1/README.md index 25b28ede8..a7390fd68 100644 --- a/examples/aishell/asr1/README.md +++ b/examples/aishell/asr1/README.md @@ -1,5 +1,5 @@ # Transformer/Conformer ASR with Aishell -This example contains code used to train a Transformer or [Conformer](http://arxiv.org/abs/2008.03802) model with [Aishell dataset](http://www.openslr.org/resources/33) +This example contains code used to train a [u2](https://arxiv.org/pdf/2012.05481.pdf) model (Transformer or [Conformer](https://arxiv.org/pdf/2005.08100.pdf) model) with [Aishell dataset](http://www.openslr.org/resources/33) ## Overview All the scripts you need are in `run.sh`. There are several stages in `run.sh`, and each stage has its function. | Stage | Function | diff --git a/examples/aishell/asr1/RESULTS.md b/examples/aishell/asr1/RESULTS.md index f16d423a2..79c695b1b 100644 --- a/examples/aishell/asr1/RESULTS.md +++ b/examples/aishell/asr1/RESULTS.md @@ -2,13 +2,13 @@ ## Conformer paddle version: 2.2.2 -paddlespeech version: 0.2.0 +paddlespeech version: 1.0.1 | Model | Params | Config | Augmentation| Test set | Decode method | Loss | CER | | --- | --- | --- | --- | --- | --- | --- | --- | -| conformer | 47.07M | conf/conformer.yaml | spec_aug | test | attention | - | 0.0530 | -| conformer | 47.07M | conf/conformer.yaml | spec_aug | test | ctc_greedy_search | - | 0.0495 | -| conformer | 47.07M | conf/conformer.yaml | spec_aug| test | ctc_prefix_beam_search | - | 0.0494 | -| conformer | 47.07M | conf/conformer.yaml | spec_aug | test | attention_rescoring | - | 0.0464 | +| conformer | 47.07M | conf/conformer.yaml | spec_aug | test | attention | - | 0.0522 | +| conformer | 47.07M | conf/conformer.yaml | spec_aug | test | ctc_greedy_search | - | 0.0481 | +| conformer | 47.07M | conf/conformer.yaml | spec_aug| test | ctc_prefix_beam_search | - | 0.0480 | +| conformer | 47.07M | conf/conformer.yaml | spec_aug | test | attention_rescoring | - | 0.0460 | ## Conformer Streaming diff --git a/examples/aishell/asr1/conf/conformer.yaml b/examples/aishell/asr1/conf/conformer.yaml index 2419d07a4..0d12a9ef8 100644 --- a/examples/aishell/asr1/conf/conformer.yaml +++ b/examples/aishell/asr1/conf/conformer.yaml @@ -57,7 +57,7 @@ feat_dim: 80 stride_ms: 10.0 window_ms: 25.0 sortagrad: 0 # Feed samples from shortest to longest ; -1: enabled for all epochs, 0: disabled, other: enabled for 'other' epochs -batch_size: 64 +batch_size: 32 maxlen_in: 512 # if input length > maxlen-in, batchsize is automatically reduced maxlen_out: 150 # if output length > maxlen-out, batchsize is automatically reduced minibatches: 0 # for debug @@ -73,10 +73,10 @@ num_encs: 1 ########################################### # Training # ########################################### -n_epoch: 240 -accum_grad: 2 +n_epoch: 150 +accum_grad: 8 global_grad_clip: 5.0 -dist_sampler: True +dist_sampler: False optim: adam optim_conf: lr: 0.002 diff --git a/examples/aishell3/ernie_sat/README.md b/examples/aishell3/ernie_sat/README.md new file mode 100644 index 000000000..8086d007c --- /dev/null +++ b/examples/aishell3/ernie_sat/README.md @@ -0,0 +1 @@ +# ERNIE SAT with AISHELL3 dataset diff --git a/examples/aishell3/ernie_sat/conf/default.yaml b/examples/aishell3/ernie_sat/conf/default.yaml new file mode 100644 index 000000000..fdc767fb0 --- /dev/null +++ b/examples/aishell3/ernie_sat/conf/default.yaml @@ -0,0 +1,283 @@ +########################################################### +# FEATURE EXTRACTION SETTING # +########################################################### + +fs: 24000 # sr +n_fft: 2048 # FFT size (samples). +n_shift: 300 # Hop size (samples). 12.5ms +win_length: 1200 # Window length (samples). 50ms + # If set to null, it will be the same as fft_size. +window: "hann" # Window function. + +# Only used for feats_type != raw + +fmin: 80 # Minimum frequency of Mel basis. +fmax: 7600 # Maximum frequency of Mel basis. +n_mels: 80 # The number of mel basis. + +mean_phn_span: 8 +mlm_prob: 0.8 + +########################################################### +# DATA SETTING # +########################################################### +batch_size: 20 +num_workers: 2 + +########################################################### +# MODEL SETTING # +########################################################### +model: + text_masking: false + postnet_layers: 5 + postnet_filts: 5 + postnet_chans: 256 + encoder_type: conformer + decoder_type: conformer + enc_input_layer: sega_mlm + enc_pre_speech_layer: 0 + enc_cnn_module_kernel: 7 + enc_attention_dim: 384 + enc_attention_heads: 2 + enc_linear_units: 1536 + enc_num_blocks: 4 + enc_dropout_rate: 0.2 + enc_positional_dropout_rate: 0.2 + enc_attention_dropout_rate: 0.2 + enc_normalize_before: true + enc_macaron_style: true + enc_use_cnn_module: true + enc_selfattention_layer_type: legacy_rel_selfattn + enc_activation_type: swish + enc_pos_enc_layer_type: legacy_rel_pos + enc_positionwise_layer_type: conv1d + enc_positionwise_conv_kernel_size: 3 + dec_cnn_module_kernel: 31 + dec_attention_dim: 384 + dec_attention_heads: 2 + dec_linear_units: 1536 + dec_num_blocks: 4 + dec_dropout_rate: 0.2 + dec_positional_dropout_rate: 0.2 + dec_attention_dropout_rate: 0.2 + dec_macaron_style: true + dec_use_cnn_module: true + dec_selfattention_layer_type: legacy_rel_selfattn + dec_activation_type: swish + dec_pos_enc_layer_type: legacy_rel_pos + dec_positionwise_layer_type: conv1d + dec_positionwise_conv_kernel_size: 3 + +########################################################### +# OPTIMIZER SETTING # +########################################################### +scheduler_params: + d_model: 384 + warmup_steps: 4000 +grad_clip: 1.0 + +########################################################### +# TRAINING SETTING # +########################################################### +max_epoch: 1500 +num_snapshots: 50 + +########################################################### +# OTHER SETTING # +########################################################### +seed: 0 + +token_list: +- +- +- d +- sp +- sh +- ii +- j +- zh +- l +- x +- b +- g +- uu +- e5 +- h +- q +- m +- i1 +- t +- z +- ch +- f +- s +- u4 +- ix4 +- i4 +- n +- i3 +- iu3 +- vv +- ian4 +- ix2 +- r +- e4 +- ai4 +- k +- ing2 +- a1 +- en2 +- ui4 +- ong1 +- uo3 +- u2 +- u3 +- ao4 +- ee +- p +- an1 +- eng2 +- i2 +- in1 +- c +- ai2 +- ian2 +- e2 +- an4 +- ing4 +- v4 +- ai3 +- a5 +- ian3 +- eng1 +- ong4 +- ang4 +- ian1 +- ing1 +- iy4 +- ao3 +- ang1 +- uo4 +- u1 +- iao4 +- iu4 +- a4 +- van2 +- ie4 +- ang2 +- ou4 +- iang4 +- ix1 +- er4 +- iy1 +- e1 +- en1 +- ui2 +- an3 +- ei4 +- ong2 +- uo1 +- ou3 +- uo2 +- iao1 +- ou1 +- an2 +- uan4 +- ia4 +- ia1 +- ang3 +- v3 +- iu2 +- iao3 +- in4 +- a3 +- ei3 +- iang3 +- v2 +- eng4 +- en3 +- aa +- uan1 +- v1 +- ao1 +- ve4 +- ie3 +- ai1 +- ing3 +- iang1 +- a2 +- ui1 +- en4 +- en5 +- in3 +- uan3 +- e3 +- ie1 +- ve2 +- ei2 +- in2 +- ix3 +- uan2 +- iang2 +- ie2 +- ua4 +- ou2 +- uai4 +- er2 +- eng3 +- uang3 +- un1 +- ong3 +- uang4 +- vn4 +- un2 +- iy3 +- iz4 +- ui3 +- iao2 +- iong4 +- un4 +- van4 +- ao2 +- uang1 +- iy5 +- o2 +- ei1 +- ua1 +- iu1 +- uang2 +- er5 +- o1 +- un3 +- vn1 +- vn2 +- o4 +- ve1 +- van3 +- ua2 +- er3 +- iong3 +- van1 +- ia2 +- iy2 +- ia3 +- iong1 +- uo5 +- oo +- ve3 +- ou5 +- uai3 +- ian5 +- iong2 +- uai2 +- uai1 +- ua3 +- vn3 +- ia5 +- ie5 +- ueng1 +- o5 +- o3 +- iang5 +- ei5 +- \ No newline at end of file diff --git a/examples/aishell3/ernie_sat/local/preprocess.sh b/examples/aishell3/ernie_sat/local/preprocess.sh new file mode 100755 index 000000000..d7a91d08f --- /dev/null +++ b/examples/aishell3/ernie_sat/local/preprocess.sh @@ -0,0 +1,61 @@ +#!/bin/bash + +stage=0 +stop_stage=100 + +config_path=$1 + +if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then + # get durations from MFA's result + echo "Generate durations.txt from MFA results ..." + python3 ${MAIN_ROOT}/utils/gen_duration_from_textgrid.py \ + --inputdir=./aishell3_alignment_tone \ + --output durations.txt \ + --config=${config_path} +fi + +if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then + # extract features + echo "Extract features ..." + python3 ${BIN_DIR}/preprocess.py \ + --dataset=aishell3 \ + --rootdir=~/datasets/data_aishell3/ \ + --dumpdir=dump \ + --dur-file=durations.txt \ + --config=${config_path} \ + --num-cpu=20 \ + --cut-sil=True +fi + +if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then + # get features' stats(mean and std) + echo "Get features' stats ..." + python3 ${MAIN_ROOT}/utils/compute_statistics.py \ + --metadata=dump/train/raw/metadata.jsonl \ + --field-name="speech" +fi + +if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then + # normalize and covert phone/speaker to id, dev and test should use train's stats + echo "Normalize ..." + python3 ${BIN_DIR}/normalize.py \ + --metadata=dump/train/raw/metadata.jsonl \ + --dumpdir=dump/train/norm \ + --speech-stats=dump/train/speech_stats.npy \ + --phones-dict=dump/phone_id_map.txt \ + --speaker-dict=dump/speaker_id_map.txt + + python3 ${BIN_DIR}/normalize.py \ + --metadata=dump/dev/raw/metadata.jsonl \ + --dumpdir=dump/dev/norm \ + --speech-stats=dump/train/speech_stats.npy \ + --phones-dict=dump/phone_id_map.txt \ + --speaker-dict=dump/speaker_id_map.txt + + python3 ${BIN_DIR}/normalize.py \ + --metadata=dump/test/raw/metadata.jsonl \ + --dumpdir=dump/test/norm \ + --speech-stats=dump/train/speech_stats.npy \ + --phones-dict=dump/phone_id_map.txt \ + --speaker-dict=dump/speaker_id_map.txt +fi diff --git a/examples/aishell3/ernie_sat/local/synthesize.sh b/examples/aishell3/ernie_sat/local/synthesize.sh new file mode 100755 index 000000000..3e907427c --- /dev/null +++ b/examples/aishell3/ernie_sat/local/synthesize.sh @@ -0,0 +1,42 @@ +#!/bin/bash + +config_path=$1 +train_output_path=$2 +ckpt_name=$3 + +stage=1 +stop_stage=1 + +# pwgan +if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then + FLAGS_allocator_strategy=naive_best_fit \ + FLAGS_fraction_of_gpu_memory_to_use=0.01 \ + python3 ${BIN_DIR}/synthesize.py \ + --erniesat_config=${config_path} \ + --erniesat_ckpt=${train_output_path}/checkpoints/${ckpt_name} \ + --erniesat_stat=dump/train/speech_stats.npy \ + --voc=pwgan_aishell3 \ + --voc_config=pwg_aishell3_ckpt_0.5/default.yaml \ + --voc_ckpt=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \ + --voc_stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \ + --test_metadata=dump/test/norm/metadata.jsonl \ + --output_dir=${train_output_path}/test \ + --phones_dict=dump/phone_id_map.txt +fi + +# hifigan +if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then + FLAGS_allocator_strategy=naive_best_fit \ + FLAGS_fraction_of_gpu_memory_to_use=0.01 \ + python3 ${BIN_DIR}/synthesize.py \ + --erniesat_config=${config_path} \ + --erniesat_ckpt=${train_output_path}/checkpoints/${ckpt_name} \ + --erniesat_stat=dump/train/speech_stats.npy \ + --voc=hifigan_aishell3 \ + --voc_config=hifigan_aishell3_ckpt_0.2.0/default.yaml \ + --voc_ckpt=hifigan_aishell3_ckpt_0.2.0/snapshot_iter_2500000.pdz \ + --voc_stat=hifigan_aishell3_ckpt_0.2.0/feats_stats.npy \ + --test_metadata=dump/test/norm/metadata.jsonl \ + --output_dir=${train_output_path}/test \ + --phones_dict=dump/phone_id_map.txt +fi diff --git a/examples/aishell3/ernie_sat/local/train.sh b/examples/aishell3/ernie_sat/local/train.sh new file mode 100755 index 000000000..30720e8f5 --- /dev/null +++ b/examples/aishell3/ernie_sat/local/train.sh @@ -0,0 +1,12 @@ +#!/bin/bash + +config_path=$1 +train_output_path=$2 + +python3 ${BIN_DIR}/train.py \ + --train-metadata=dump/train/norm/metadata.jsonl \ + --dev-metadata=dump/dev/norm/metadata.jsonl \ + --config=${config_path} \ + --output-dir=${train_output_path} \ + --ngpu=2 \ + --phones-dict=dump/phone_id_map.txt \ No newline at end of file diff --git a/examples/aishell3/ernie_sat/path.sh b/examples/aishell3/ernie_sat/path.sh new file mode 100755 index 000000000..4ecab0251 --- /dev/null +++ b/examples/aishell3/ernie_sat/path.sh @@ -0,0 +1,13 @@ +#!/bin/bash +export MAIN_ROOT=`realpath ${PWD}/../../../` + +export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH} +export LC_ALL=C + +export PYTHONDONTWRITEBYTECODE=1 +# Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C +export PYTHONIOENCODING=UTF-8 +export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH} + +MODEL=ernie_sat +export BIN_DIR=${MAIN_ROOT}/paddlespeech/t2s/exps/${MODEL} \ No newline at end of file diff --git a/examples/aishell3/ernie_sat/run.sh b/examples/aishell3/ernie_sat/run.sh new file mode 100755 index 000000000..d75a19f23 --- /dev/null +++ b/examples/aishell3/ernie_sat/run.sh @@ -0,0 +1,32 @@ +#!/bin/bash + +set -e +source path.sh + +gpus=0,1 +stage=0 +stop_stage=100 + +conf_path=conf/default.yaml +train_output_path=exp/default +ckpt_name=snapshot_iter_153.pdz + +# with the following command, you can choose the stage range you want to run +# such as `./run.sh --stage 0 --stop-stage 0` +# this can not be mixed use with `$1`, `$2` ... +source ${MAIN_ROOT}/utils/parse_options.sh || exit 1 + +if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then + # prepare data + ./local/preprocess.sh ${conf_path} || exit -1 +fi + +if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then + # train model, all `ckpt` under `train_output_path/checkpoints/` dir + CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} || exit -1 +fi + +if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then + # synthesize, vocoder is pwgan + CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1 +fi diff --git a/examples/aishell3/tts3/README.md b/examples/aishell3/tts3/README.md index 31c99898c..21bad51ec 100644 --- a/examples/aishell3/tts3/README.md +++ b/examples/aishell3/tts3/README.md @@ -220,6 +220,12 @@ Pretrained FastSpeech2 model with no silence in the edge of audios: - [fastspeech2_nosil_aishell3_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_aishell3_ckpt_0.4.zip) - [fastspeech2_conformer_aishell3_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_conformer_aishell3_ckpt_0.2.0.zip) (Thanks for [@awmmmm](https://github.com/awmmmm)'s contribution) +The static model can be downloaded here: +- [fastspeech2_aishell3_static_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_aishell3_static_1.1.0.zip) + +The ONNX model can be downloaded here: +- [fastspeech2_aishell3_onnx_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_aishell3_onnx_1.1.0.zip) + FastSpeech2 checkpoint contains files listed below. ```text diff --git a/examples/aishell3/tts3/conf/conformer.yaml b/examples/aishell3/tts3/conf/conformer.yaml index ea73593d7..0834bfe3f 100644 --- a/examples/aishell3/tts3/conf/conformer.yaml +++ b/examples/aishell3/tts3/conf/conformer.yaml @@ -94,8 +94,8 @@ updater: # OPTIMIZER SETTING # ########################################################### optimizer: - optim: adam # optimizer type - learning_rate: 0.001 # learning rate + optim: adam # optimizer type + learning_rate: 0.001 # learning rate ########################################################### # TRAINING SETTING # diff --git a/examples/aishell3/tts3/conf/default.yaml b/examples/aishell3/tts3/conf/default.yaml index ac4956742..e65b5d0ec 100644 --- a/examples/aishell3/tts3/conf/default.yaml +++ b/examples/aishell3/tts3/conf/default.yaml @@ -88,8 +88,8 @@ updater: # OPTIMIZER SETTING # ########################################################### optimizer: - optim: adam # optimizer type - learning_rate: 0.001 # learning rate + optim: adam # optimizer type + learning_rate: 0.001 # learning rate ########################################################### # TRAINING SETTING # diff --git a/examples/aishell3/tts3/local/inference.sh b/examples/aishell3/tts3/local/inference.sh index 3b03b53ce..dc05ec592 100755 --- a/examples/aishell3/tts3/local/inference.sh +++ b/examples/aishell3/tts3/local/inference.sh @@ -17,3 +17,14 @@ if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then --spk_id=0 fi +if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then + python3 ${BIN_DIR}/../inference.py \ + --inference_dir=${train_output_path}/inference \ + --am=fastspeech2_aishell3 \ + --voc=hifigan_aishell3 \ + --text=${BIN_DIR}/../sentences.txt \ + --output_dir=${train_output_path}/pd_infer_out \ + --phones_dict=dump/phone_id_map.txt \ + --speaker_dict=dump/speaker_id_map.txt \ + --spk_id=0 +fi diff --git a/examples/aishell3/tts3/local/ort_predict.sh b/examples/aishell3/tts3/local/ort_predict.sh new file mode 100755 index 000000000..24e66f689 --- /dev/null +++ b/examples/aishell3/tts3/local/ort_predict.sh @@ -0,0 +1,32 @@ +train_output_path=$1 + +stage=0 +stop_stage=0 + +# e2e, synthesize from text +if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then + python3 ${BIN_DIR}/../ort_predict_e2e.py \ + --inference_dir=${train_output_path}/inference_onnx \ + --am=fastspeech2_aishell3 \ + --voc=pwgan_aishell3 \ + --output_dir=${train_output_path}/onnx_infer_out_e2e \ + --text=${BIN_DIR}/../csmsc_test.txt \ + --phones_dict=dump/phone_id_map.txt \ + --device=cpu \ + --cpu_threads=2 \ + --spk_id=0 + +fi + +if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then + python3 ${BIN_DIR}/../ort_predict_e2e.py \ + --inference_dir=${train_output_path}/inference_onnx \ + --am=fastspeech2_aishell3 \ + --voc=hifigan_aishell3 \ + --output_dir=${train_output_path}/onnx_infer_out_e2e \ + --text=${BIN_DIR}/../csmsc_test.txt \ + --phones_dict=dump/phone_id_map.txt \ + --device=cpu \ + --cpu_threads=2 \ + --spk_id=0 +fi diff --git a/examples/aishell3/tts3/local/paddle2onnx.sh b/examples/aishell3/tts3/local/paddle2onnx.sh new file mode 120000 index 000000000..8d5dbef4c --- /dev/null +++ b/examples/aishell3/tts3/local/paddle2onnx.sh @@ -0,0 +1 @@ +../../../csmsc/tts3/local/paddle2onnx.sh \ No newline at end of file diff --git a/examples/aishell3/tts3/local/synthesize.sh b/examples/aishell3/tts3/local/synthesize.sh index d3978833f..9134e0426 100755 --- a/examples/aishell3/tts3/local/synthesize.sh +++ b/examples/aishell3/tts3/local/synthesize.sh @@ -37,7 +37,7 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then --am_stat=dump/train/speech_stats.npy \ --voc=hifigan_aishell3 \ --voc_config=hifigan_aishell3_ckpt_0.2.0/default.yaml \ - --voc_ckpt=hifigan_aishell3_ckpt_0.2.0/snapshot_iter_2500000.pd \ + --voc_ckpt=hifigan_aishell3_ckpt_0.2.0/snapshot_iter_2500000.pdz \ --voc_stat=hifigan_aishell3_ckpt_0.2.0/feats_stats.npy \ --test_metadata=dump/test/norm/metadata.jsonl \ --output_dir=${train_output_path}/test \ diff --git a/examples/aishell3/tts3/run.sh b/examples/aishell3/tts3/run.sh index b375f2159..24715fee1 100755 --- a/examples/aishell3/tts3/run.sh +++ b/examples/aishell3/tts3/run.sh @@ -27,11 +27,34 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then fi if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then - # synthesize, vocoder is pwgan + # synthesize, vocoder is pwgan by default CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1 fi if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then - # synthesize_e2e, vocoder is pwgan + # synthesize_e2e, vocoder is pwgan by default CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1 fi + +if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then + # inference with static model, vocoder is pwgan by default + CUDA_VISIBLE_DEVICES=${gpus} ./local/inference.sh ${train_output_path} || exit -1 +fi + +if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then + # install paddle2onnx + version=$(echo `pip list |grep "paddle2onnx"` |awk -F" " '{print $2}') + if [[ -z "$version" || ${version} != '0.9.8' ]]; then + pip install paddle2onnx==0.9.8 + fi + ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx fastspeech2_aishell3 + # considering the balance between speed and quality, we recommend that you use hifigan as vocoder + ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx pwgan_aishell3 + # ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx hifigan_aishell3 + +fi + +# inference with onnxruntime, use fastspeech2 + pwgan by default +if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then + ./local/ort_predict.sh ${train_output_path} +fi diff --git a/examples/aishell3/voc1/README.md b/examples/aishell3/voc1/README.md index a3daf3dfd..bc25f43cf 100644 --- a/examples/aishell3/voc1/README.md +++ b/examples/aishell3/voc1/README.md @@ -133,6 +133,12 @@ optional arguments: Pretrained models can be downloaded here: - [pwg_aishell3_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_aishell3_ckpt_0.5.zip) +The static model can be downloaded here: +- [pwgan_aishell3_static_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwgan_aishell3_static_1.1.0.zip) + +The ONNX model can be downloaded here: +- [pwgan_aishell3_onnx_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwgan_aishell3_onnx_1.1.0.zip) + Model | Step | eval/generator_loss | eval/log_stft_magnitude_loss:| eval/spectral_convergence_loss :-------------:| :------------:| :-----: | :-----: | :--------: default| 1(gpu) x 400000|1.968762|0.759008|0.218524 diff --git a/examples/aishell3/voc5/README.md b/examples/aishell3/voc5/README.md index c3e3197d6..7f99a52e3 100644 --- a/examples/aishell3/voc5/README.md +++ b/examples/aishell3/voc5/README.md @@ -116,6 +116,11 @@ optional arguments: The pretrained model can be downloaded here: - [hifigan_aishell3_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_ckpt_0.2.0.zip) +The static model can be downloaded here: +- [hifigan_aishell3_static_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_static_1.1.0.zip) + +The ONNX model can be downloaded here: +- [hifigan_aishell3_onnx_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_onnx_1.1.0.zip) Model | Step | eval/generator_loss | eval/mel_loss| eval/feature_matching_loss :-------------:| :------------:| :-----: | :-----: | :--------: diff --git a/examples/aishell3_vctk/README.md b/examples/aishell3_vctk/README.md new file mode 100644 index 000000000..330b25934 --- /dev/null +++ b/examples/aishell3_vctk/README.md @@ -0,0 +1 @@ +# Mixed Chinese and English TTS with AISHELL3 and VCTK datasets diff --git a/examples/aishell3_vctk/ernie_sat/README.md b/examples/aishell3_vctk/ernie_sat/README.md new file mode 100644 index 000000000..1c6bbe230 --- /dev/null +++ b/examples/aishell3_vctk/ernie_sat/README.md @@ -0,0 +1 @@ +# ERNIE SAT with AISHELL3 and VCTK dataset diff --git a/examples/aishell3_vctk/ernie_sat/conf/default.yaml b/examples/aishell3_vctk/ernie_sat/conf/default.yaml new file mode 100644 index 000000000..abb69fcc0 --- /dev/null +++ b/examples/aishell3_vctk/ernie_sat/conf/default.yaml @@ -0,0 +1,352 @@ +########################################################### +# FEATURE EXTRACTION SETTING # +########################################################### + +fs: 24000 # sr +n_fft: 2048 # FFT size (samples). +n_shift: 300 # Hop size (samples). 12.5ms +win_length: 1200 # Window length (samples). 50ms + # If set to null, it will be the same as fft_size. +window: "hann" # Window function. + +# Only used for feats_type != raw + +fmin: 80 # Minimum frequency of Mel basis. +fmax: 7600 # Maximum frequency of Mel basis. +n_mels: 80 # The number of mel basis. + +mean_phn_span: 8 +mlm_prob: 0.8 + +########################################################### +# DATA SETTING # +########################################################### +batch_size: 20 +num_workers: 2 + +########################################################### +# MODEL SETTING # +########################################################### +model: + text_masking: true + postnet_layers: 5 + postnet_filts: 5 + postnet_chans: 256 + encoder_type: conformer + decoder_type: conformer + enc_input_layer: sega_mlm + enc_pre_speech_layer: 0 + enc_cnn_module_kernel: 7 + enc_attention_dim: 384 + enc_attention_heads: 2 + enc_linear_units: 1536 + enc_num_blocks: 4 + enc_dropout_rate: 0.2 + enc_positional_dropout_rate: 0.2 + enc_attention_dropout_rate: 0.2 + enc_normalize_before: true + enc_macaron_style: true + enc_use_cnn_module: true + enc_selfattention_layer_type: legacy_rel_selfattn + enc_activation_type: swish + enc_pos_enc_layer_type: legacy_rel_pos + enc_positionwise_layer_type: conv1d + enc_positionwise_conv_kernel_size: 3 + dec_cnn_module_kernel: 31 + dec_attention_dim: 384 + dec_attention_heads: 2 + dec_linear_units: 1536 + dec_num_blocks: 4 + dec_dropout_rate: 0.2 + dec_positional_dropout_rate: 0.2 + dec_attention_dropout_rate: 0.2 + dec_macaron_style: true + dec_use_cnn_module: true + dec_selfattention_layer_type: legacy_rel_selfattn + dec_activation_type: swish + dec_pos_enc_layer_type: legacy_rel_pos + dec_positionwise_layer_type: conv1d + dec_positionwise_conv_kernel_size: 3 + +########################################################### +# OPTIMIZER SETTING # +########################################################### +scheduler_params: + d_model: 384 + warmup_steps: 4000 +grad_clip: 1.0 + +########################################################### +# TRAINING SETTING # +########################################################### +max_epoch: 700 +num_snapshots: 50 + +########################################################### +# OTHER SETTING # +########################################################### +seed: 0 + +token_list: +- +- +- AH0 +- T +- N +- sp +- S +- R +- D +- L +- Z +- DH +- IH1 +- K +- W +- M +- EH1 +- AE1 +- ER0 +- B +- IY1 +- P +- V +- IY0 +- F +- HH +- AA1 +- AY1 +- AH1 +- EY1 +- IH0 +- AO1 +- OW1 +- UW1 +- G +- NG +- SH +- Y +- TH +- ER1 +- JH +- UH1 +- AW1 +- CH +- IH2 +- OW0 +- OW2 +- EY2 +- EH2 +- UW0 +- OY1 +- ZH +- EH0 +- AY2 +- AW2 +- AA2 +- AE2 +- IY2 +- AH2 +- AE0 +- AO2 +- AY0 +- AO0 +- UW2 +- UH2 +- AA0 +- EY0 +- AW0 +- UH0 +- ER2 +- OY2 +- OY0 +- d +- sh +- ii +- j +- zh +- l +- x +- b +- g +- uu +- e5 +- h +- q +- m +- i1 +- t +- z +- ch +- f +- s +- u4 +- ix4 +- i4 +- n +- i3 +- iu3 +- vv +- ian4 +- ix2 +- r +- e4 +- ai4 +- k +- ing2 +- a1 +- en2 +- ui4 +- ong1 +- uo3 +- u2 +- u3 +- ao4 +- ee +- p +- an1 +- eng2 +- i2 +- in1 +- c +- ai2 +- ian2 +- e2 +- an4 +- ing4 +- v4 +- ai3 +- a5 +- ian3 +- eng1 +- ong4 +- ang4 +- ian1 +- ing1 +- iy4 +- ao3 +- ang1 +- uo4 +- u1 +- iao4 +- iu4 +- a4 +- van2 +- ie4 +- ang2 +- ou4 +- iang4 +- ix1 +- er4 +- iy1 +- e1 +- en1 +- ui2 +- an3 +- ei4 +- ong2 +- uo1 +- ou3 +- uo2 +- iao1 +- ou1 +- an2 +- uan4 +- ia4 +- ia1 +- ang3 +- v3 +- iu2 +- iao3 +- in4 +- a3 +- ei3 +- iang3 +- v2 +- eng4 +- en3 +- aa +- uan1 +- v1 +- ao1 +- ve4 +- ie3 +- ai1 +- ing3 +- iang1 +- a2 +- ui1 +- en4 +- en5 +- in3 +- uan3 +- e3 +- ie1 +- ve2 +- ei2 +- in2 +- ix3 +- uan2 +- iang2 +- ie2 +- ua4 +- ou2 +- uai4 +- er2 +- eng3 +- uang3 +- un1 +- ong3 +- uang4 +- vn4 +- un2 +- iy3 +- iz4 +- ui3 +- iao2 +- iong4 +- un4 +- van4 +- ao2 +- uang1 +- iy5 +- o2 +- ei1 +- ua1 +- iu1 +- uang2 +- er5 +- o1 +- un3 +- vn1 +- vn2 +- o4 +- ve1 +- van3 +- ua2 +- er3 +- iong3 +- van1 +- ia2 +- iy2 +- ia3 +- iong1 +- uo5 +- oo +- ve3 +- ou5 +- uai3 +- ian5 +- iong2 +- uai2 +- uai1 +- ua3 +- vn3 +- ia5 +- ie5 +- ueng1 +- o5 +- o3 +- iang5 +- ei5 +- diff --git a/examples/aishell3_vctk/ernie_sat/local/preprocess.sh b/examples/aishell3_vctk/ernie_sat/local/preprocess.sh new file mode 100755 index 000000000..783fd6333 --- /dev/null +++ b/examples/aishell3_vctk/ernie_sat/local/preprocess.sh @@ -0,0 +1,89 @@ +#!/bin/bash + +stage=0 +stop_stage=100 + +config_path=$1 + +if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then + # get durations from MFA's result + echo "Generate durations.txt from MFA results for aishell3 ..." + python3 ${MAIN_ROOT}/utils/gen_duration_from_textgrid.py \ + --inputdir=./aishell3_alignment_tone \ + --output durations_aishell3.txt \ + --config=${config_path} +fi + +if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then + # get durations from MFA's result + echo "Generate durations.txt from MFA results for vctk ..." + python3 ${MAIN_ROOT}/utils/gen_duration_from_textgrid.py \ + --inputdir=./vctk_alignment \ + --output durations_vctk.txt \ + --config=${config_path} +fi + +if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then + # get durations from MFA's result + echo "concat durations_aishell3.txt and durations_vctk.txt to durations.txt" + cat durations_aishell3.txt durations_vctk.txt > durations.txt +fi + +if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then + # extract features + echo "Extract features ..." + python3 ${BIN_DIR}/preprocess.py \ + --dataset=aishell3 \ + --rootdir=~/datasets/data_aishell3/ \ + --dumpdir=dump \ + --dur-file=durations.txt \ + --config=${config_path} \ + --num-cpu=20 \ + --cut-sil=True +fi + +if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then + # extract features + echo "Extract features ..." + python3 ${BIN_DIR}/preprocess.py \ + --dataset=vctk \ + --rootdir=~/datasets/VCTK-Corpus-0.92/ \ + --dumpdir=dump \ + --dur-file=durations.txt \ + --config=${config_path} \ + --num-cpu=20 \ + --cut-sil=True +fi + +if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then + # get features' stats(mean and std) + echo "Get features' stats ..." + python3 ${MAIN_ROOT}/utils/compute_statistics.py \ + --metadata=dump/train/raw/metadata.jsonl \ + --field-name="speech" +fi + +if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then + # normalize and covert phone/speaker to id, dev and test should use train's stats + echo "Normalize ..." + python3 ${BIN_DIR}/normalize.py \ + --metadata=dump/train/raw/metadata.jsonl \ + --dumpdir=dump/train/norm \ + --speech-stats=dump/train/speech_stats.npy \ + --phones-dict=dump/phone_id_map.txt \ + --speaker-dict=dump/speaker_id_map.txt + + python3 ${BIN_DIR}/normalize.py \ + --metadata=dump/dev/raw/metadata.jsonl \ + --dumpdir=dump/dev/norm \ + --speech-stats=dump/train/speech_stats.npy \ + --phones-dict=dump/phone_id_map.txt \ + --speaker-dict=dump/speaker_id_map.txt + + python3 ${BIN_DIR}/normalize.py \ + --metadata=dump/test/raw/metadata.jsonl \ + --dumpdir=dump/test/norm \ + --speech-stats=dump/train/speech_stats.npy \ + --phones-dict=dump/phone_id_map.txt \ + --speaker-dict=dump/speaker_id_map.txt +fi diff --git a/examples/aishell3_vctk/ernie_sat/local/synthesize.sh b/examples/aishell3_vctk/ernie_sat/local/synthesize.sh new file mode 100755 index 000000000..3e907427c --- /dev/null +++ b/examples/aishell3_vctk/ernie_sat/local/synthesize.sh @@ -0,0 +1,42 @@ +#!/bin/bash + +config_path=$1 +train_output_path=$2 +ckpt_name=$3 + +stage=1 +stop_stage=1 + +# pwgan +if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then + FLAGS_allocator_strategy=naive_best_fit \ + FLAGS_fraction_of_gpu_memory_to_use=0.01 \ + python3 ${BIN_DIR}/synthesize.py \ + --erniesat_config=${config_path} \ + --erniesat_ckpt=${train_output_path}/checkpoints/${ckpt_name} \ + --erniesat_stat=dump/train/speech_stats.npy \ + --voc=pwgan_aishell3 \ + --voc_config=pwg_aishell3_ckpt_0.5/default.yaml \ + --voc_ckpt=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \ + --voc_stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \ + --test_metadata=dump/test/norm/metadata.jsonl \ + --output_dir=${train_output_path}/test \ + --phones_dict=dump/phone_id_map.txt +fi + +# hifigan +if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then + FLAGS_allocator_strategy=naive_best_fit \ + FLAGS_fraction_of_gpu_memory_to_use=0.01 \ + python3 ${BIN_DIR}/synthesize.py \ + --erniesat_config=${config_path} \ + --erniesat_ckpt=${train_output_path}/checkpoints/${ckpt_name} \ + --erniesat_stat=dump/train/speech_stats.npy \ + --voc=hifigan_aishell3 \ + --voc_config=hifigan_aishell3_ckpt_0.2.0/default.yaml \ + --voc_ckpt=hifigan_aishell3_ckpt_0.2.0/snapshot_iter_2500000.pdz \ + --voc_stat=hifigan_aishell3_ckpt_0.2.0/feats_stats.npy \ + --test_metadata=dump/test/norm/metadata.jsonl \ + --output_dir=${train_output_path}/test \ + --phones_dict=dump/phone_id_map.txt +fi diff --git a/examples/aishell3_vctk/ernie_sat/local/train.sh b/examples/aishell3_vctk/ernie_sat/local/train.sh new file mode 100755 index 000000000..30720e8f5 --- /dev/null +++ b/examples/aishell3_vctk/ernie_sat/local/train.sh @@ -0,0 +1,12 @@ +#!/bin/bash + +config_path=$1 +train_output_path=$2 + +python3 ${BIN_DIR}/train.py \ + --train-metadata=dump/train/norm/metadata.jsonl \ + --dev-metadata=dump/dev/norm/metadata.jsonl \ + --config=${config_path} \ + --output-dir=${train_output_path} \ + --ngpu=2 \ + --phones-dict=dump/phone_id_map.txt \ No newline at end of file diff --git a/examples/aishell3_vctk/ernie_sat/path.sh b/examples/aishell3_vctk/ernie_sat/path.sh new file mode 100755 index 000000000..4ecab0251 --- /dev/null +++ b/examples/aishell3_vctk/ernie_sat/path.sh @@ -0,0 +1,13 @@ +#!/bin/bash +export MAIN_ROOT=`realpath ${PWD}/../../../` + +export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH} +export LC_ALL=C + +export PYTHONDONTWRITEBYTECODE=1 +# Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C +export PYTHONIOENCODING=UTF-8 +export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH} + +MODEL=ernie_sat +export BIN_DIR=${MAIN_ROOT}/paddlespeech/t2s/exps/${MODEL} \ No newline at end of file diff --git a/examples/aishell3_vctk/ernie_sat/run.sh b/examples/aishell3_vctk/ernie_sat/run.sh new file mode 100755 index 000000000..d75a19f23 --- /dev/null +++ b/examples/aishell3_vctk/ernie_sat/run.sh @@ -0,0 +1,32 @@ +#!/bin/bash + +set -e +source path.sh + +gpus=0,1 +stage=0 +stop_stage=100 + +conf_path=conf/default.yaml +train_output_path=exp/default +ckpt_name=snapshot_iter_153.pdz + +# with the following command, you can choose the stage range you want to run +# such as `./run.sh --stage 0 --stop-stage 0` +# this can not be mixed use with `$1`, `$2` ... +source ${MAIN_ROOT}/utils/parse_options.sh || exit 1 + +if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then + # prepare data + ./local/preprocess.sh ${conf_path} || exit -1 +fi + +if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then + # train model, all `ckpt` under `train_output_path/checkpoints/` dir + CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} || exit -1 +fi + +if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then + # synthesize, vocoder is pwgan + CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1 +fi diff --git a/examples/callcenter/README.md b/examples/callcenter/README.md index 1c715cb69..6d5211461 100644 --- a/examples/callcenter/README.md +++ b/examples/callcenter/README.md @@ -1,20 +1,3 @@ # Callcenter 8k sample rate -Data distribution: - -``` -676048 utts -491.4004722221223 h -4357792.0 text -2.4633630739178654 text/sec -2.6167397877068495 sec/utt -``` - -train/dev/test partition: - -``` - 33802 manifest.dev - 67606 manifest.test - 574640 manifest.train - 676048 total -``` +This recipe only has model/data config for 8k ASR, user need to prepare data and generate manifest metafile. You can see Aishell or Libripseech. diff --git a/examples/csmsc/tts2/conf/default.yaml b/examples/csmsc/tts2/conf/default.yaml index a3366b8f9..a5a258b7e 100644 --- a/examples/csmsc/tts2/conf/default.yaml +++ b/examples/csmsc/tts2/conf/default.yaml @@ -21,22 +21,22 @@ num_workers: 4 # MODEL SETTING # ########################################################### model: - encoder_hidden_size: 128 - encoder_kernel_size: 3 - encoder_dilations: [1, 3, 9, 27, 1, 3, 9, 27, 1, 1] - duration_predictor_hidden_size: 128 - decoder_hidden_size: 128 - decoder_output_size: 80 - decoder_kernel_size: 3 - decoder_dilations: [1, 3, 9, 27, 1, 3, 9, 27, 1, 3, 9, 27, 1, 3, 9, 27, 1, 1] + encoder_hidden_size: 128 + encoder_kernel_size: 3 + encoder_dilations: [1, 3, 9, 27, 1, 3, 9, 27, 1, 1] + duration_predictor_hidden_size: 128 + decoder_hidden_size: 128 + decoder_output_size: 80 + decoder_kernel_size: 3 + decoder_dilations: [1, 3, 9, 27, 1, 3, 9, 27, 1, 3, 9, 27, 1, 3, 9, 27, 1, 1] ########################################################### # OPTIMIZER SETTING # ########################################################### optimizer: - optim: adam # optimizer type - learning_rate: 0.002 # learning rate - max_grad_norm: 1 + optim: adam # optimizer type + learning_rate: 0.002 # learning rate + max_grad_norm: 1 ########################################################### # TRAINING SETTING # diff --git a/examples/csmsc/tts2/local/ort_predict.sh b/examples/csmsc/tts2/local/ort_predict.sh index 46b0409b8..8ca4c0e9b 100755 --- a/examples/csmsc/tts2/local/ort_predict.sh +++ b/examples/csmsc/tts2/local/ort_predict.sh @@ -3,22 +3,34 @@ train_output_path=$1 stage=0 stop_stage=0 -# only support default_fastspeech2/speedyspeech + hifigan/mb_melgan now! - -# synthesize from metadata +# e2e, synthesize from text if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then - python3 ${BIN_DIR}/../ort_predict.py \ + python3 ${BIN_DIR}/../ort_predict_e2e.py \ --inference_dir=${train_output_path}/inference_onnx \ --am=speedyspeech_csmsc \ - --voc=hifigan_csmsc \ - --test_metadata=dump/test/norm/metadata.jsonl \ - --output_dir=${train_output_path}/onnx_infer_out \ + --voc=pwgan_csmsc \ + --output_dir=${train_output_path}/onnx_infer_out_e2e \ + --text=${BIN_DIR}/../csmsc_test.txt \ + --phones_dict=dump/phone_id_map.txt \ + --tones_dict=dump/tone_id_map.txt \ --device=cpu \ --cpu_threads=2 fi -# e2e, synthesize from text if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then + python3 ${BIN_DIR}/../ort_predict_e2e.py \ + --inference_dir=${train_output_path}/inference_onnx \ + --am=speedyspeech_csmsc \ + --voc=mb_melgan_csmsc \ + --output_dir=${train_output_path}/onnx_infer_out_e2e \ + --text=${BIN_DIR}/../csmsc_test.txt \ + --phones_dict=dump/phone_id_map.txt \ + --tones_dict=dump/tone_id_map.txt \ + --device=cpu \ + --cpu_threads=2 +fi + +if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then python3 ${BIN_DIR}/../ort_predict_e2e.py \ --inference_dir=${train_output_path}/inference_onnx \ --am=speedyspeech_csmsc \ @@ -30,3 +42,15 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then --device=cpu \ --cpu_threads=2 fi + +# synthesize from metadata +if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then + python3 ${BIN_DIR}/../ort_predict.py \ + --inference_dir=${train_output_path}/inference_onnx \ + --am=speedyspeech_csmsc \ + --voc=hifigan_csmsc \ + --test_metadata=dump/test/norm/metadata.jsonl \ + --output_dir=${train_output_path}/onnx_infer_out \ + --device=cpu \ + --cpu_threads=2 +fi diff --git a/examples/csmsc/tts2/run.sh b/examples/csmsc/tts2/run.sh index 1d67a5c91..e51913496 100755 --- a/examples/csmsc/tts2/run.sh +++ b/examples/csmsc/tts2/run.sh @@ -27,12 +27,12 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then fi if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then - # synthesize, vocoder is pwgan + # synthesize, vocoder is pwgan by default CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1 fi if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then - # synthesize_e2e, vocoder is pwgan + # synthesize_e2e, vocoder is pwgan by default CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1 fi @@ -46,19 +46,17 @@ fi if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then # install paddle2onnx version=$(echo `pip list |grep "paddle2onnx"` |awk -F" " '{print $2}') - if [[ -z "$version" || ${version} != '0.9.5' ]]; then - pip install paddle2onnx==0.9.5 + if [[ -z "$version" || ${version} != '0.9.8' ]]; then + pip install paddle2onnx==0.9.8 fi ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx speedyspeech_csmsc - ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx hifigan_csmsc + # considering the balance between speed and quality, we recommend that you use hifigan as vocoder + ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx pwgan_csmsc + # ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx mb_melgan_csmsc + # ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx hifigan_csmsc fi -# inference with onnxruntime, use fastspeech2 + hifigan by default +# inference with onnxruntime if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then - # install onnxruntime - version=$(echo `pip list |grep "onnxruntime"` |awk -F" " '{print $2}') - if [[ -z "$version" || ${version} != '1.10.0' ]]; then - pip install onnxruntime==1.10.0 - fi ./local/ort_predict.sh ${train_output_path} fi diff --git a/examples/csmsc/tts3/local/ort_predict.sh b/examples/csmsc/tts3/local/ort_predict.sh index 96350c06c..e16c7bd05 100755 --- a/examples/csmsc/tts3/local/ort_predict.sh +++ b/examples/csmsc/tts3/local/ort_predict.sh @@ -3,22 +3,32 @@ train_output_path=$1 stage=0 stop_stage=0 -# only support default_fastspeech2/speedyspeech + hifigan/mb_melgan now! - -# synthesize from metadata +# e2e, synthesize from text if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then - python3 ${BIN_DIR}/../ort_predict.py \ + python3 ${BIN_DIR}/../ort_predict_e2e.py \ --inference_dir=${train_output_path}/inference_onnx \ --am=fastspeech2_csmsc \ - --voc=hifigan_csmsc \ - --test_metadata=dump/test/norm/metadata.jsonl \ - --output_dir=${train_output_path}/onnx_infer_out \ + --voc=pwgan_csmsc \ + --output_dir=${train_output_path}/onnx_infer_out_e2e \ + --text=${BIN_DIR}/../csmsc_test.txt \ + --phones_dict=dump/phone_id_map.txt \ --device=cpu \ --cpu_threads=2 fi -# e2e, synthesize from text if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then + python3 ${BIN_DIR}/../ort_predict_e2e.py \ + --inference_dir=${train_output_path}/inference_onnx \ + --am=fastspeech2_csmsc \ + --voc=mb_melgan_csmsc \ + --output_dir=${train_output_path}/onnx_infer_out_e2e \ + --text=${BIN_DIR}/../csmsc_test.txt \ + --phones_dict=dump/phone_id_map.txt \ + --device=cpu \ + --cpu_threads=2 +fi + +if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then python3 ${BIN_DIR}/../ort_predict_e2e.py \ --inference_dir=${train_output_path}/inference_onnx \ --am=fastspeech2_csmsc \ @@ -29,3 +39,15 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then --device=cpu \ --cpu_threads=2 fi + +# synthesize from metadata, take hifigan as an example +if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then + python3 ${BIN_DIR}/../ort_predict.py \ + --inference_dir=${train_output_path}/inference_onnx \ + --am=fastspeech2_csmsc \ + --voc=hifigan_csmsc \ + --test_metadata=dump/test/norm/metadata.jsonl \ + --output_dir=${train_output_path}/onnx_infer_out \ + --device=cpu \ + --cpu_threads=2 +fi \ No newline at end of file diff --git a/examples/csmsc/tts3/local/ort_predict_streaming.sh b/examples/csmsc/tts3/local/ort_predict_streaming.sh index 502ec912a..743935816 100755 --- a/examples/csmsc/tts3/local/ort_predict_streaming.sh +++ b/examples/csmsc/tts3/local/ort_predict_streaming.sh @@ -5,6 +5,34 @@ stop_stage=0 # e2e, synthesize from text if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then + python3 ${BIN_DIR}/../ort_predict_streaming.py \ + --inference_dir=${train_output_path}/inference_onnx_streaming \ + --am=fastspeech2_csmsc \ + --am_stat=dump/train/speech_stats.npy \ + --voc=pwgan_csmsc \ + --output_dir=${train_output_path}/onnx_infer_out_streaming \ + --text=${BIN_DIR}/../csmsc_test.txt \ + --phones_dict=dump/phone_id_map.txt \ + --device=cpu \ + --cpu_threads=2 \ + --am_streaming=True +fi + +if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then + python3 ${BIN_DIR}/../ort_predict_streaming.py \ + --inference_dir=${train_output_path}/inference_onnx_streaming \ + --am=fastspeech2_csmsc \ + --am_stat=dump/train/speech_stats.npy \ + --voc=mb_melgan_csmsc \ + --output_dir=${train_output_path}/onnx_infer_out_streaming \ + --text=${BIN_DIR}/../csmsc_test.txt \ + --phones_dict=dump/phone_id_map.txt \ + --device=cpu \ + --cpu_threads=2 \ + --am_streaming=True +fi + +if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then python3 ${BIN_DIR}/../ort_predict_streaming.py \ --inference_dir=${train_output_path}/inference_onnx_streaming \ --am=fastspeech2_csmsc \ diff --git a/examples/csmsc/tts3/local/synthesize_streaming.sh b/examples/csmsc/tts3/local/synthesize_streaming.sh index b135db76d..366a88db9 100755 --- a/examples/csmsc/tts3/local/synthesize_streaming.sh +++ b/examples/csmsc/tts3/local/synthesize_streaming.sh @@ -24,7 +24,8 @@ if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then --text=${BIN_DIR}/../sentences.txt \ --output_dir=${train_output_path}/test_e2e_streaming \ --phones_dict=dump/phone_id_map.txt \ - --am_streaming=True + --am_streaming=True \ + --inference_dir=${train_output_path}/inference_streaming fi # for more GAN Vocoders @@ -45,7 +46,8 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then --text=${BIN_DIR}/../sentences.txt \ --output_dir=${train_output_path}/test_e2e_streaming \ --phones_dict=dump/phone_id_map.txt \ - --am_streaming=True + --am_streaming=True \ + --inference_dir=${train_output_path}/inference_streaming fi # the pretrained models haven't release now diff --git a/examples/csmsc/tts3/run.sh b/examples/csmsc/tts3/run.sh index f0afcc895..2662b5811 100755 --- a/examples/csmsc/tts3/run.sh +++ b/examples/csmsc/tts3/run.sh @@ -27,17 +27,17 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then fi if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then - # synthesize, vocoder is pwgan + # synthesize, vocoder is pwgan by default CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1 fi if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then - # synthesize_e2e, vocoder is pwgan + # synthesize_e2e, vocoder is pwgan by default CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1 fi if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then - # inference with static model + # inference with static model, vocoder is pwgan by default CUDA_VISIBLE_DEVICES=${gpus} ./local/inference.sh ${train_output_path} || exit -1 fi @@ -46,15 +46,18 @@ fi if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then # install paddle2onnx version=$(echo `pip list |grep "paddle2onnx"` |awk -F" " '{print $2}') - if [[ -z "$version" || ${version} != '0.9.5' ]]; then - pip install paddle2onnx==0.9.5 + if [[ -z "$version" || ${version} != '0.9.8' ]]; then + pip install paddle2onnx==0.9.8 fi ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx fastspeech2_csmsc - ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx hifigan_csmsc - ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx mb_melgan_csmsc + # considering the balance between speed and quality, we recommend that you use hifigan as vocoder + ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx pwgan_csmsc + # ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx mb_melgan_csmsc + # ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx hifigan_csmsc + fi -# inference with onnxruntime, use fastspeech2 + hifigan by default +# inference with onnxruntime if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then ./local/ort_predict.sh ${train_output_path} fi diff --git a/examples/csmsc/tts3/run_cnndecoder.sh b/examples/csmsc/tts3/run_cnndecoder.sh index c8dd8545b..c5ce41a9c 100755 --- a/examples/csmsc/tts3/run_cnndecoder.sh +++ b/examples/csmsc/tts3/run_cnndecoder.sh @@ -33,25 +33,25 @@ fi # synthesize_e2e non-streaming if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then - # synthesize_e2e, vocoder is pwgan + # synthesize_e2e, vocoder is pwgan by default CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1 fi # inference non-streaming if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then - # inference with static model + # inference with static model, vocoder is pwgan by default CUDA_VISIBLE_DEVICES=${gpus} ./local/inference.sh ${train_output_path} || exit -1 fi # synthesize_e2e streaming if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then - # synthesize_e2e, vocoder is pwgan + # synthesize_e2e, vocoder is pwgan by default CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_streaming.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1 fi # inference streaming if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then - # inference with static model + # inference with static model, vocoder is pwgan by default CUDA_VISIBLE_DEVICES=${gpus} ./local/inference_streaming.sh ${train_output_path} || exit -1 fi @@ -59,32 +59,37 @@ fi if [ ${stage} -le 7 ] && [ ${stop_stage} -ge 7 ]; then # install paddle2onnx version=$(echo `pip list |grep "paddle2onnx"` |awk -F" " '{print $2}') - if [[ -z "$version" || ${version} != '0.9.5' ]]; then - pip install paddle2onnx==0.9.5 + if [[ -z "$version" || ${version} != '0.9.8' ]]; then + pip install paddle2onnx==0.9.8 fi ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx fastspeech2_csmsc - ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx hifigan_csmsc + # considering the balance between speed and quality, we recommend that you use hifigan as vocoder + ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx pwgan_csmsc + # ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx mb_melgan_csmsc + # ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx hifigan_csmsc fi # onnxruntime non streaming -# inference with onnxruntime, use fastspeech2 + hifigan by default if [ ${stage} -le 8 ] && [ ${stop_stage} -ge 8 ]; then ./local/ort_predict.sh ${train_output_path} fi # paddle2onnx streaming + if [ ${stage} -le 9 ] && [ ${stop_stage} -ge 9 ]; then # install paddle2onnx version=$(echo `pip list |grep "paddle2onnx"` |awk -F" " '{print $2}') - if [[ -z "$version" || ${version} != '0.9.5' ]]; then - pip install paddle2onnx==0.9.5 + if [[ -z "$version" || ${version} != '0.9.8' ]]; then + pip install paddle2onnx==0.9.8 fi # streaming acoustic model ./local/paddle2onnx.sh ${train_output_path} inference_streaming inference_onnx_streaming fastspeech2_csmsc_am_encoder_infer ./local/paddle2onnx.sh ${train_output_path} inference_streaming inference_onnx_streaming fastspeech2_csmsc_am_decoder ./local/paddle2onnx.sh ${train_output_path} inference_streaming inference_onnx_streaming fastspeech2_csmsc_am_postnet - # vocoder - ./local/paddle2onnx.sh ${train_output_path} inference_streaming inference_onnx_streaming hifigan_csmsc + # considering the balance between speed and quality, we recommend that you use hifigan as vocoder + ./local/paddle2onnx.sh ${train_output_path} inference_streaming inference_onnx_streaming pwgan_csmsc + # ./local/paddle2onnx.sh ${train_output_path} inference_streaming inference_onnx_streaming mb_melgan_csmsc + # ./local/paddle2onnx.sh ${train_output_path} inference_streaming inference_onnx_streaming hifigan_csmsc fi # onnxruntime streaming diff --git a/examples/csmsc/vits/README.md b/examples/csmsc/vits/README.md index 0c16840a0..8f223e07b 100644 --- a/examples/csmsc/vits/README.md +++ b/examples/csmsc/vits/README.md @@ -144,3 +144,34 @@ optional arguments: 6. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. ## Pretrained Model + +The pretrained model can be downloaded here: + +- [vits_csmsc_ckpt_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/vits/vits_csmsc_ckpt_1.1.0.zip) (add_blank=true) + +VITS checkpoint contains files listed below. +```text +vits_csmsc_ckpt_1.1.0 +├── default.yaml # default config used to train vitx +├── phone_id_map.txt # phone vocabulary file when training vits +└── snapshot_iter_333000.pdz # model parameters and optimizer states +``` + +ps: This ckpt is not good enough, a better result is training + +You can use the following scripts to synthesize for `${BIN_DIR}/../sentences.txt` using pretrained VITS. + +```bash +source path.sh +add_blank=true + +FLAGS_allocator_strategy=naive_best_fit \ +FLAGS_fraction_of_gpu_memory_to_use=0.01 \ +python3 ${BIN_DIR}/synthesize_e2e.py \ + --config=vits_csmsc_ckpt_1.1.0/default.yaml \ + --ckpt=vits_csmsc_ckpt_1.1.0/snapshot_iter_333000.pdz \ + --phones_dict=vits_csmsc_ckpt_1.1.0/phone_id_map.txt \ + --output_dir=exp/default/test_e2e \ + --text=${BIN_DIR}/../sentences.txt \ + --add-blank=${add_blank} +``` diff --git a/examples/csmsc/vits/conf/default.yaml b/examples/csmsc/vits/conf/default.yaml index 47af780dc..a2aef998d 100644 --- a/examples/csmsc/vits/conf/default.yaml +++ b/examples/csmsc/vits/conf/default.yaml @@ -178,6 +178,8 @@ generator_first: False # whether to start updating generator first ########################################################## # OTHER TRAINING SETTING # ########################################################## -max_epoch: 1000 # number of epochs -num_snapshots: 10 # max number of snapshots to keep while training -seed: 777 # random seed number +num_snapshots: 10 # max number of snapshots to keep while training +train_max_steps: 350000 # Number of training steps. == total_iters / ngpus, total_iters = 1000000 +save_interval_steps: 1000 # Interval steps to save checkpoint. +eval_interval_steps: 250 # Interval steps to evaluate the network. +seed: 777 # random seed number diff --git a/examples/csmsc/vits/local/preprocess.sh b/examples/csmsc/vits/local/preprocess.sh index 1d3ae5937..1cd6d1f9b 100755 --- a/examples/csmsc/vits/local/preprocess.sh +++ b/examples/csmsc/vits/local/preprocess.sh @@ -4,6 +4,7 @@ stage=0 stop_stage=100 config_path=$1 +add_blank=$2 if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then # get durations from MFA's result @@ -44,6 +45,7 @@ if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then --feats-stats=dump/train/feats_stats.npy \ --phones-dict=dump/phone_id_map.txt \ --speaker-dict=dump/speaker_id_map.txt \ + --add-blank=${add_blank} \ --skip-wav-copy python3 ${BIN_DIR}/normalize.py \ @@ -52,6 +54,7 @@ if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then --feats-stats=dump/train/feats_stats.npy \ --phones-dict=dump/phone_id_map.txt \ --speaker-dict=dump/speaker_id_map.txt \ + --add-blank=${add_blank} \ --skip-wav-copy python3 ${BIN_DIR}/normalize.py \ @@ -60,5 +63,6 @@ if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then --feats-stats=dump/train/feats_stats.npy \ --phones-dict=dump/phone_id_map.txt \ --speaker-dict=dump/speaker_id_map.txt \ + --add-blank=${add_blank} \ --skip-wav-copy fi diff --git a/examples/csmsc/vits/local/synthesize.sh b/examples/csmsc/vits/local/synthesize.sh index c15d5f99f..a4b35ec0a 100755 --- a/examples/csmsc/vits/local/synthesize.sh +++ b/examples/csmsc/vits/local/synthesize.sh @@ -15,4 +15,4 @@ if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then --phones_dict=dump/phone_id_map.txt \ --test_metadata=dump/test/norm/metadata.jsonl \ --output_dir=${train_output_path}/test -fi \ No newline at end of file +fi diff --git a/examples/csmsc/vits/local/synthesize_e2e.sh b/examples/csmsc/vits/local/synthesize_e2e.sh index edbb07bfc..3f3bf6517 100755 --- a/examples/csmsc/vits/local/synthesize_e2e.sh +++ b/examples/csmsc/vits/local/synthesize_e2e.sh @@ -3,9 +3,12 @@ config_path=$1 train_output_path=$2 ckpt_name=$3 +add_blank=$4 + stage=0 stop_stage=0 + if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then FLAGS_allocator_strategy=naive_best_fit \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \ @@ -14,5 +17,6 @@ if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then --ckpt=${train_output_path}/checkpoints/${ckpt_name} \ --phones_dict=dump/phone_id_map.txt \ --output_dir=${train_output_path}/test_e2e \ - --text=${BIN_DIR}/../sentences.txt + --text=${BIN_DIR}/../sentences.txt \ + --add-blank=${add_blank} fi diff --git a/examples/csmsc/vits/local/train.sh b/examples/csmsc/vits/local/train.sh index 42fff26ca..289837a5d 100755 --- a/examples/csmsc/vits/local/train.sh +++ b/examples/csmsc/vits/local/train.sh @@ -3,6 +3,11 @@ config_path=$1 train_output_path=$2 +# install monotonic_align +cd ${MAIN_ROOT}/paddlespeech/t2s/models/vits/monotonic_align +python3 setup.py build_ext --inplace +cd - + python3 ${BIN_DIR}/train.py \ --train-metadata=dump/train/norm/metadata.jsonl \ --dev-metadata=dump/dev/norm/metadata.jsonl \ diff --git a/examples/csmsc/vits/run.sh b/examples/csmsc/vits/run.sh index 80e56e7c1..c284b7b23 100755 --- a/examples/csmsc/vits/run.sh +++ b/examples/csmsc/vits/run.sh @@ -10,6 +10,7 @@ stop_stage=100 conf_path=conf/default.yaml train_output_path=exp/default ckpt_name=snapshot_iter_153.pdz +add_blank=true # with the following command, you can choose the stage range you want to run # such as `./run.sh --stage 0 --stop-stage 0` @@ -18,7 +19,7 @@ source ${MAIN_ROOT}/utils/parse_options.sh || exit 1 if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then # prepare data - ./local/preprocess.sh ${conf_path} || exit -1 + ./local/preprocess.sh ${conf_path} ${add_blank}|| exit -1 fi if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then @@ -32,5 +33,5 @@ fi if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then # synthesize_e2e, vocoder is pwgan - CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1 + CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name} ${add_blank}|| exit -1 fi diff --git a/examples/csmsc/voc3/conf/default.yaml b/examples/csmsc/voc3/conf/default.yaml index fbff54f19..a5ee17808 100644 --- a/examples/csmsc/voc3/conf/default.yaml +++ b/examples/csmsc/voc3/conf/default.yaml @@ -29,7 +29,7 @@ generator_params: out_channels: 4 # Number of output channels. kernel_size: 7 # Kernel size of initial and final conv layers. channels: 384 # Initial number of channels for conv layers. - upsample_scales: [5, 5, 3] # List of Upsampling scales. prod(upsample_scales) == n_shift + upsample_scales: [5, 5, 3] # List of Upsampling scales. prod(upsample_scales) x out_channels == n_shift stack_kernel_size: 3 # Kernel size of dilated conv layers in residual stack. stacks: 4 # Number of stacks in a single residual stack module. use_weight_norm: True # Whether to use weight normalization. diff --git a/examples/csmsc/voc3/conf/finetune.yaml b/examples/csmsc/voc3/conf/finetune.yaml index 0a38c2820..8c37ac302 100644 --- a/examples/csmsc/voc3/conf/finetune.yaml +++ b/examples/csmsc/voc3/conf/finetune.yaml @@ -29,7 +29,7 @@ generator_params: out_channels: 4 # Number of output channels. kernel_size: 7 # Kernel size of initial and final conv layers. channels: 384 # Initial number of channels for conv layers. - upsample_scales: [5, 5, 3] # List of Upsampling scales. prod(upsample_scales) == n_shift + upsample_scales: [5, 5, 3] # List of Upsampling scales. prod(upsample_scales) x out_channels == n_shift stack_kernel_size: 3 # Kernel size of dilated conv layers in residual stack. stacks: 4 # Number of stacks in a single residual stack module. use_weight_norm: True # Whether to use weight normalization. diff --git a/examples/ernie_sat/.meta/framework.png b/examples/ernie_sat/.meta/framework.png new file mode 100644 index 000000000..c68f62467 Binary files /dev/null and b/examples/ernie_sat/.meta/framework.png differ diff --git a/examples/ernie_sat/README.md b/examples/ernie_sat/README.md new file mode 100644 index 000000000..d3bd13372 --- /dev/null +++ b/examples/ernie_sat/README.md @@ -0,0 +1,137 @@ +ERNIE-SAT 是可以同时处理中英文的跨语言的语音-语言跨模态大模型,其在语音编辑、个性化语音合成以及跨语言的语音合成等多个任务取得了领先效果。可以应用于语音编辑、个性化合成、语音克隆、同传翻译等一系列场景,该项目供研究使用。 + +## 模型框架 +ERNIE-SAT 中我们提出了两项创新: +- 在预训练过程中将中英双语对应的音素作为输入,实现了跨语言、个性化的软音素映射 +- 采用语言和语音的联合掩码学习实现了语言和语音的对齐 + +[外链图片转存失败,源站可能有防盗链机制,建议将图片保存下来直接上传(img-3lOXKJXE-1655380879339)(.meta/framework.png)] + +## 使用说明 + +### 1.安装飞桨与环境依赖 + +- 本项目的代码基于 Paddle(version>=2.0) +- 本项目开放提供加载 torch 版本的 vocoder 的功能 + - torch version>=1.8 + +- 安装 htk: 在[官方地址](https://htk.eng.cam.ac.uk/)注册完成后,即可进行下载较新版本的 htk (例如 3.4.1)。同时提供[历史版本 htk 下载地址](https://htk.eng.cam.ac.uk/ftp/software/) + + - 1.注册账号,下载 htk + - 2.解压 htk 文件,**放入项目根目录的 tools 文件夹中, 以 htk 文件夹名称放入** + - 3.**注意**: 如果您下载的是 3.4.1 或者更高版本, 需要进入 HTKLib/HRec.c 文件中, **修改 1626 行和 1650 行**, 即把**以下两行的 dur<=0 都修改为 dur<0**,如下所示: + ```bash + 以htk3.4.1版本举例: + (1)第1626行: if (dur<=0 && labid != splabid) HError(8522,"LatFromPaths: Align have dur<=0"); + 修改为: if (dur<0 && labid != splabid) HError(8522,"LatFromPaths: Align have dur<0"); + + (2)1650行: if (dur<=0 && labid != splabid) HError(8522,"LatFromPaths: Align have dur<=0 "); + 修改为: if (dur<0 && labid != splabid) HError(8522,"LatFromPaths: Align have dur<0 "); + ``` + - 4.**编译**: 详情参见解压后的 htk 中的 README 文件(如果未编译, 则无法正常运行) + + + +- 安装 ParallelWaveGAN: 参见[官方地址](https://github.com/kan-bayashi/ParallelWaveGAN):按照该官方链接的安装流程,直接在**项目的根目录下** git clone ParallelWaveGAN 项目并且安装相关依赖即可。 + + +- 安装其他依赖: **sox, libsndfile**等 + +### 2.预训练模型 +预训练模型 ERNIE-SAT 的模型如下所示: +- [ERNIE-SAT_ZH](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/ernie_sat/old/model-ernie-sat-base-zh.tar.gz) +- [ERNIE-SAT_EN](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/ernie_sat/old/model-ernie-sat-base-en.tar.gz) +- [ERNIE-SAT_ZH_and_EN](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/ernie_sat/old/model-ernie-sat-base-en_zh.tar.gz) + + +创建 pretrained_model 文件夹,下载上述 ERNIE-SAT 预训练模型并将其解压: +```bash +mkdir pretrained_model +cd pretrained_model +tar -zxvf model-ernie-sat-base-en.tar.gz +tar -zxvf model-ernie-sat-base-zh.tar.gz +tar -zxvf model-ernie-sat-base-en_zh.tar.gz +``` + +### 3.下载 + +1. 本项目使用 parallel wavegan 作为声码器(vocoder): + - [pwg_aishell3_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_aishell3_ckpt_0.5.zip) + + 创建 download 文件夹,下载上述预训练的声码器(vocoder)模型并将其解压: + + ```bash + mkdir download + cd download + unzip pwg_aishell3_ckpt_0.5.zip + ``` + +2. 本项目使用 [FastSpeech2](https://arxiv.org/abs/2006.04558) 作为音素(phoneme)的持续时间预测器: + - [fastspeech2_conformer_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_conformer_baker_ckpt_0.5.zip) 中文场景下使用 + - [fastspeech2_nosil_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_ljspeech_ckpt_0.5.zip) 英文场景下使用 + + 下载上述预训练的 fastspeech2 模型并将其解压: + + ```bash + cd download + unzip fastspeech2_conformer_baker_ckpt_0.5.zip + unzip fastspeech2_nosil_ljspeech_ckpt_0.5.zip + ``` + +3. 本项目使用 HTK 获取输入音频和文本的对齐信息: + + - [aligner.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/ernie_sat/old/aligner.zip) + + 下载上述文件到 tools 文件夹并将其解压: + ```bash + cd tools + unzip aligner.zip + ``` + + +### 4.推理 + +本项目当前开源了语音编辑、个性化语音合成、跨语言语音合成的推理代码,后续会逐步开源。 +注:当前英文场下的合成语音采用的声码器默认为 vctk_parallel_wavegan.v1.long, 可在[该链接](https://github.com/kan-bayashi/ParallelWaveGAN)中找到; 若 use_pt_vocoder 参数设置为 False,则英文场景下使用 paddle 版本的声码器。 + +我们提供特定音频文件, 以及其对应的文本、音素相关文件: +- prompt_wav: 提供的音频文件 +- prompt/dev: 基于上述特定音频对应的文本、音素相关文件 + + +```text +prompt_wav +├── p299_096.wav # 样例语音文件1 +├── p243_313.wav # 样例语音文件2 +└── ... +``` + +```text +prompt/dev +├── text # 样例语音对应文本 +├── wav.scp # 样例语音路径 +├── mfa_text # 样例语音对应音素 +├── mfa_start # 样例语音中各个音素的开始时间 +└── mfa_end # 样例语音中各个音素的结束时间 +``` +1. `--am` 声学模型格式符合 {model_name}_{dataset} +2. `--am_config`, `--am_checkpoint`, `--am_stat` 和 `--phones_dict` 是声学模型的参数,对应于 fastspeech2 预训练模型中的 4 个文件。 +3. `--voc` 声码器(vocoder)格式是否符合 {model_name}_{dataset} +4. `--voc_config`, `--voc_checkpoint`, `--voc_stat` 是声码器的参数,对应于 parallel wavegan 预训练模型中的 3 个文件。 +5. `--lang` 对应模型的语言可以是 `zh` 或 `en` 。 +6. `--ngpu` 要使用的 GPU 数,如果 ngpu==0,则使用 cpu。 +7. `--model_name` 模型名称 +8. `--uid` 特定提示(prompt)语音的 id +9. `--new_str` 输入的文本(本次开源暂时先设置特定的文本) +10. `--prefix` 特定音频对应的文本、音素相关文件的地址 +11. `--source_lang` , 源语言 +12. `--target_lang` , 目标语言 +13. `--output_name` , 合成语音名称 +14. `--task_name` , 任务名称, 包括:语音编辑任务、个性化语音合成任务、跨语言语音合成任务 + +运行以下脚本即可进行实验 +```shell +./run_sedit_en.sh # 语音编辑任务(英文) +./run_gen_en.sh # 个性化语音合成任务(英文) +./run_clone_en_to_zh.sh # 跨语言语音合成任务(英文到中文的语音克隆) +``` diff --git a/examples/ernie_sat/local/align.py b/examples/ernie_sat/local/align.py new file mode 100755 index 000000000..ff47cac5b --- /dev/null +++ b/examples/ernie_sat/local/align.py @@ -0,0 +1,454 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" Usage: + align.py wavfile trsfile outwordfile outphonefile +""" +import os +import sys + +PHONEME = 'tools/aligner/english_envir/english2phoneme/phoneme' +MODEL_DIR_EN = 'tools/aligner/english' +MODEL_DIR_ZH = 'tools/aligner/mandarin' +HVITE = 'tools/htk/HTKTools/HVite' +HCOPY = 'tools/htk/HTKTools/HCopy' + + +def get_unk_phns(word_str: str): + tmpbase = '/tmp/tp.' + f = open(tmpbase + 'temp.words', 'w') + f.write(word_str) + f.close() + os.system(PHONEME + ' ' + tmpbase + 'temp.words' + ' ' + tmpbase + + 'temp.phons') + f = open(tmpbase + 'temp.phons', 'r') + lines2 = f.readline().strip().split() + f.close() + phns = [] + for phn in lines2: + phons = phn.replace('\n', '').replace(' ', '') + seq = [] + j = 0 + while (j < len(phons)): + if (phons[j] > 'Z'): + if (phons[j] == 'j'): + seq.append('JH') + elif (phons[j] == 'h'): + seq.append('HH') + else: + seq.append(phons[j].upper()) + j += 1 + else: + p = phons[j:j + 2] + if (p == 'WH'): + seq.append('W') + elif (p in ['TH', 'SH', 'HH', 'DH', 'CH', 'ZH', 'NG']): + seq.append(p) + elif (p == 'AX'): + seq.append('AH0') + else: + seq.append(p + '1') + j += 2 + phns.extend(seq) + return phns + + +def words2phns(line: str): + ''' + Args: + line (str): input text. + eg: for that reason cover is impossible to be given. + Returns: + List[str]: phones of input text. + eg: + ['F', 'AO1', 'R', 'DH', 'AE1', 'T', 'R', 'IY1', 'Z', 'AH0', 'N', 'K', 'AH1', 'V', 'ER0', + 'IH1', 'Z', 'IH2', 'M', 'P', 'AA1', 'S', 'AH0', 'B', 'AH0', 'L', 'T', 'UW1', 'B', 'IY1', + 'G', 'IH1', 'V', 'AH0', 'N'] + + Dict(str, str): key - idx_word + value - phones + eg: + {'0_FOR': ['F', 'AO1', 'R'], '1_THAT': ['DH', 'AE1', 'T'], '2_REASON': ['R', 'IY1', 'Z', 'AH0', 'N'], + '3_COVER': ['K', 'AH1', 'V', 'ER0'], '4_IS': ['IH1', 'Z'], '5_IMPOSSIBLE': ['IH2', 'M', 'P', 'AA1', 'S', 'AH0', 'B', 'AH0', 'L'], + '6_TO': ['T', 'UW1'], '7_BE': ['B', 'IY1'], '8_GIVEN': ['G', 'IH1', 'V', 'AH0', 'N']} + ''' + dictfile = MODEL_DIR_EN + '/dict' + line = line.strip() + words = [] + for pun in [',', '.', ':', ';', '!', '?', '"', '(', ')', '--', '---']: + line = line.replace(pun, ' ') + for wrd in line.split(): + if (wrd[-1] == '-'): + wrd = wrd[:-1] + if (wrd[0] == "'"): + wrd = wrd[1:] + if wrd: + words.append(wrd) + ds = set([]) + word2phns_dict = {} + with open(dictfile, 'r') as fid: + for line in fid: + word = line.split()[0] + ds.add(word) + if word not in word2phns_dict.keys(): + word2phns_dict[word] = " ".join(line.split()[1:]) + + phns = [] + wrd2phns = {} + for index, wrd in enumerate(words): + if wrd == '[MASK]': + wrd2phns[str(index) + "_" + wrd] = [wrd] + phns.append(wrd) + elif (wrd.upper() not in ds): + wrd2phns[str(index) + "_" + wrd.upper()] = get_unk_phns(wrd) + phns.extend(get_unk_phns(wrd)) + else: + wrd2phns[str(index) + + "_" + wrd.upper()] = word2phns_dict[wrd.upper()].split() + phns.extend(word2phns_dict[wrd.upper()].split()) + return phns, wrd2phns + + +def words2phns_zh(line: str): + dictfile = MODEL_DIR_ZH + '/dict' + line = line.strip() + words = [] + for pun in [ + ',', '.', ':', ';', '!', '?', '"', '(', ')', '--', '---', u',', + u'。', u':', u';', u'!', u'?', u'(', u')' + ]: + line = line.replace(pun, ' ') + for wrd in line.split(): + if (wrd[-1] == '-'): + wrd = wrd[:-1] + if (wrd[0] == "'"): + wrd = wrd[1:] + if wrd: + words.append(wrd) + + ds = set([]) + word2phns_dict = {} + with open(dictfile, 'r') as fid: + for line in fid: + word = line.split()[0] + ds.add(word) + if word not in word2phns_dict.keys(): + word2phns_dict[word] = " ".join(line.split()[1:]) + + phns = [] + wrd2phns = {} + for index, wrd in enumerate(words): + if wrd == '[MASK]': + wrd2phns[str(index) + "_" + wrd] = [wrd] + phns.append(wrd) + elif (wrd.upper() not in ds): + print("出现非法词错误,请输入正确的文本...") + else: + wrd2phns[str(index) + "_" + wrd] = word2phns_dict[wrd].split() + phns.extend(word2phns_dict[wrd].split()) + + return phns, wrd2phns + + +def prep_txt_zh(line: str, tmpbase: str, dictfile: str): + + words = [] + line = line.strip() + for pun in [ + ',', '.', ':', ';', '!', '?', '"', '(', ')', '--', '---', u',', + u'。', u':', u';', u'!', u'?', u'(', u')' + ]: + line = line.replace(pun, ' ') + for wrd in line.split(): + if (wrd[-1] == '-'): + wrd = wrd[:-1] + if (wrd[0] == "'"): + wrd = wrd[1:] + if wrd: + words.append(wrd) + + ds = set([]) + with open(dictfile, 'r') as fid: + for line in fid: + ds.add(line.split()[0]) + + unk_words = set([]) + with open(tmpbase + '.txt', 'w') as fwid: + for wrd in words: + if (wrd not in ds): + unk_words.add(wrd) + fwid.write(wrd + ' ') + fwid.write('\n') + return unk_words + + +def prep_txt_en(line: str, tmpbase, dictfile): + + words = [] + + line = line.strip() + for pun in [',', '.', ':', ';', '!', '?', '"', '(', ')', '--', '---']: + line = line.replace(pun, ' ') + for wrd in line.split(): + if (wrd[-1] == '-'): + wrd = wrd[:-1] + if (wrd[0] == "'"): + wrd = wrd[1:] + if wrd: + words.append(wrd) + + ds = set([]) + with open(dictfile, 'r') as fid: + for line in fid: + ds.add(line.split()[0]) + + unk_words = set([]) + with open(tmpbase + '.txt', 'w') as fwid: + for wrd in words: + if (wrd.upper() not in ds): + unk_words.add(wrd.upper()) + fwid.write(wrd + ' ') + fwid.write('\n') + + #generate pronounciations for unknows words using 'letter to sound' + with open(tmpbase + '_unk.words', 'w') as fwid: + for unk in unk_words: + fwid.write(unk + '\n') + try: + os.system(PHONEME + ' ' + tmpbase + '_unk.words' + ' ' + tmpbase + + '_unk.phons') + except Exception: + print('english2phoneme error!') + sys.exit(1) + + #add unknown words to the standard dictionary, generate a tmp dictionary for alignment + fw = open(tmpbase + '.dict', 'w') + with open(dictfile, 'r') as fid: + for line in fid: + fw.write(line) + f = open(tmpbase + '_unk.words', 'r') + lines1 = f.readlines() + f.close() + f = open(tmpbase + '_unk.phons', 'r') + lines2 = f.readlines() + f.close() + for i in range(len(lines1)): + wrd = lines1[i].replace('\n', '') + phons = lines2[i].replace('\n', '').replace(' ', '') + seq = [] + j = 0 + while (j < len(phons)): + if (phons[j] > 'Z'): + if (phons[j] == 'j'): + seq.append('JH') + elif (phons[j] == 'h'): + seq.append('HH') + else: + seq.append(phons[j].upper()) + j += 1 + else: + p = phons[j:j + 2] + if (p == 'WH'): + seq.append('W') + elif (p in ['TH', 'SH', 'HH', 'DH', 'CH', 'ZH', 'NG']): + seq.append(p) + elif (p == 'AX'): + seq.append('AH0') + else: + seq.append(p + '1') + j += 2 + + fw.write(wrd + ' ') + for s in seq: + fw.write(' ' + s) + fw.write('\n') + fw.close() + + +def prep_mlf(txt: str, tmpbase: str): + + with open(tmpbase + '.mlf', 'w') as fwid: + fwid.write('#!MLF!#\n') + fwid.write('"' + tmpbase + '.lab"\n') + fwid.write('sp\n') + wrds = txt.split() + for wrd in wrds: + fwid.write(wrd.upper() + '\n') + fwid.write('sp\n') + fwid.write('.\n') + + +def _get_user(): + return os.path.expanduser('~').split("/")[-1] + + +def alignment(wav_path: str, text: str): + ''' + intervals: List[phn, start, end] + ''' + tmpbase = '/tmp/' + _get_user() + '_' + str(os.getpid()) + + #prepare wav and trs files + try: + os.system('sox ' + wav_path + ' -r 16000 ' + tmpbase + '.wav remix -') + except Exception: + print('sox error!') + return None + + #prepare clean_transcript file + try: + prep_txt_en(line=text, tmpbase=tmpbase, dictfile=MODEL_DIR_EN + '/dict') + except Exception: + print('prep_txt error!') + return None + + #prepare mlf file + try: + with open(tmpbase + '.txt', 'r') as fid: + txt = fid.readline() + prep_mlf(txt, tmpbase) + except Exception: + print('prep_mlf error!') + return None + + #prepare scp + try: + os.system(HCOPY + ' -C ' + MODEL_DIR_EN + '/16000/config ' + tmpbase + + '.wav' + ' ' + tmpbase + '.plp') + except Exception: + print('HCopy error!') + return None + + #run alignment + try: + os.system(HVITE + ' -a -m -t 10000.0 10000.0 100000.0 -I ' + tmpbase + + '.mlf -H ' + MODEL_DIR_EN + '/16000/macros -H ' + MODEL_DIR_EN + + '/16000/hmmdefs -i ' + tmpbase + '.aligned ' + tmpbase + + '.dict ' + MODEL_DIR_EN + '/monophones ' + tmpbase + + '.plp 2>&1 > /dev/null') + except Exception: + print('HVite error!') + return None + + with open(tmpbase + '.txt', 'r') as fid: + words = fid.readline().strip().split() + words = txt.strip().split() + words.reverse() + + with open(tmpbase + '.aligned', 'r') as fid: + lines = fid.readlines() + i = 2 + intervals = [] + word2phns = {} + current_word = '' + index = 0 + while (i < len(lines)): + splited_line = lines[i].strip().split() + if (len(splited_line) >= 4) and (splited_line[0] != splited_line[1]): + phn = splited_line[2] + pst = (int(splited_line[0]) / 1000 + 125) / 10000 + pen = (int(splited_line[1]) / 1000 + 125) / 10000 + intervals.append([phn, pst, pen]) + # splited_line[-1]!='sp' + if len(splited_line) == 5: + current_word = str(index) + '_' + splited_line[-1] + word2phns[current_word] = phn + index += 1 + elif len(splited_line) == 4: + word2phns[current_word] += ' ' + phn + i += 1 + return intervals, word2phns + + +def alignment_zh(wav_path: str, text: str): + tmpbase = '/tmp/' + _get_user() + '_' + str(os.getpid()) + + #prepare wav and trs files + try: + os.system('sox ' + wav_path + ' -r 16000 -b 16 ' + tmpbase + + '.wav remix -') + + except Exception: + print('sox error!') + return None + + #prepare clean_transcript file + try: + unk_words = prep_txt_zh( + line=text, tmpbase=tmpbase, dictfile=MODEL_DIR_ZH + '/dict') + if unk_words: + print('Error! Please add the following words to dictionary:') + for unk in unk_words: + print("非法words: ", unk) + except Exception: + print('prep_txt error!') + return None + + #prepare mlf file + try: + with open(tmpbase + '.txt', 'r') as fid: + txt = fid.readline() + prep_mlf(txt, tmpbase) + except Exception: + print('prep_mlf error!') + return None + + #prepare scp + try: + os.system(HCOPY + ' -C ' + MODEL_DIR_ZH + '/16000/config ' + tmpbase + + '.wav' + ' ' + tmpbase + '.plp') + except Exception: + print('HCopy error!') + return None + + #run alignment + try: + os.system(HVITE + ' -a -m -t 10000.0 10000.0 100000.0 -I ' + tmpbase + + '.mlf -H ' + MODEL_DIR_ZH + '/16000/macros -H ' + MODEL_DIR_ZH + + '/16000/hmmdefs -i ' + tmpbase + '.aligned ' + MODEL_DIR_ZH + + '/dict ' + MODEL_DIR_ZH + '/monophones ' + tmpbase + + '.plp 2>&1 > /dev/null') + + except Exception: + print('HVite error!') + return None + + with open(tmpbase + '.txt', 'r') as fid: + words = fid.readline().strip().split() + words = txt.strip().split() + words.reverse() + + with open(tmpbase + '.aligned', 'r') as fid: + lines = fid.readlines() + + i = 2 + intervals = [] + word2phns = {} + current_word = '' + index = 0 + while (i < len(lines)): + splited_line = lines[i].strip().split() + if (len(splited_line) >= 4) and (splited_line[0] != splited_line[1]): + phn = splited_line[2] + pst = (int(splited_line[0]) / 1000 + 125) / 10000 + pen = (int(splited_line[1]) / 1000 + 125) / 10000 + intervals.append([phn, pst, pen]) + # splited_line[-1]!='sp' + if len(splited_line) == 5: + current_word = str(index) + '_' + splited_line[-1] + word2phns[current_word] = phn + index += 1 + elif len(splited_line) == 4: + word2phns[current_word] += ' ' + phn + i += 1 + return intervals, word2phns diff --git a/examples/ernie_sat/local/inference.py b/examples/ernie_sat/local/inference.py new file mode 100644 index 000000000..e6a0788fd --- /dev/null +++ b/examples/ernie_sat/local/inference.py @@ -0,0 +1,609 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import os +import random +from typing import Dict +from typing import List + +import librosa +import numpy as np +import paddle +import soundfile as sf +from align import alignment +from align import alignment_zh +from align import words2phns +from align import words2phns_zh +from paddle import nn +from sedit_arg_parser import parse_args +from utils import eval_durs +from utils import get_voc_out +from utils import is_chinese +from utils import load_num_sequence_text +from utils import read_2col_text + +from paddlespeech.t2s.datasets.am_batch_fn import build_mlm_collate_fn +from paddlespeech.t2s.models.ernie_sat.mlm import build_model_from_file + +random.seed(0) +np.random.seed(0) + + +def get_wav(wav_path: str, + source_lang: str='english', + target_lang: str='english', + model_name: str="paddle_checkpoint_en", + old_str: str="", + new_str: str="", + non_autoreg: bool=True): + wav_org, output_feat, old_span_bdy, new_span_bdy, fs, hop_length = get_mlm_output( + source_lang=source_lang, + target_lang=target_lang, + model_name=model_name, + wav_path=wav_path, + old_str=old_str, + new_str=new_str, + use_teacher_forcing=non_autoreg) + + masked_feat = output_feat[new_span_bdy[0]:new_span_bdy[1]] + + alt_wav = get_voc_out(masked_feat) + + old_time_bdy = [hop_length * x for x in old_span_bdy] + + wav_replaced = np.concatenate( + [wav_org[:old_time_bdy[0]], alt_wav, wav_org[old_time_bdy[1]:]]) + + data_dict = {"origin": wav_org, "output": wav_replaced} + + return data_dict + + +def load_model(model_name: str="paddle_checkpoint_en"): + config_path = './pretrained_model/{}/config.yaml'.format(model_name) + model_path = './pretrained_model/{}/model.pdparams'.format(model_name) + mlm_model, conf = build_model_from_file( + config_file=config_path, model_file=model_path) + return mlm_model, conf + + +def read_data(uid: str, prefix: os.PathLike): + # 获取 uid 对应的文本 + mfa_text = read_2col_text(prefix + '/text')[uid] + # 获取 uid 对应的音频路径 + mfa_wav_path = read_2col_text(prefix + '/wav.scp')[uid] + if not os.path.isabs(mfa_wav_path): + mfa_wav_path = prefix + mfa_wav_path + return mfa_text, mfa_wav_path + + +def get_align_data(uid: str, prefix: os.PathLike): + mfa_path = prefix + "mfa_" + mfa_text = read_2col_text(mfa_path + 'text')[uid] + mfa_start = load_num_sequence_text( + mfa_path + 'start', loader_type='text_float')[uid] + mfa_end = load_num_sequence_text( + mfa_path + 'end', loader_type='text_float')[uid] + mfa_wav_path = read_2col_text(mfa_path + 'wav.scp')[uid] + return mfa_text, mfa_start, mfa_end, mfa_wav_path + + +# 获取需要被 mask 的 mel 帧的范围 +def get_masked_mel_bdy(mfa_start: List[float], + mfa_end: List[float], + fs: int, + hop_length: int, + span_to_repl: List[List[int]]): + align_start = np.array(mfa_start) + align_end = np.array(mfa_end) + align_start = np.floor(fs * align_start / hop_length).astype('int') + align_end = np.floor(fs * align_end / hop_length).astype('int') + if span_to_repl[0] >= len(mfa_start): + span_bdy = [align_end[-1], align_end[-1]] + else: + span_bdy = [ + align_start[span_to_repl[0]], align_end[span_to_repl[1] - 1] + ] + return span_bdy, align_start, align_end + + +def recover_dict(word2phns: Dict[str, str], tp_word2phns: Dict[str, str]): + dic = {} + keys_to_del = [] + exist_idx = [] + sp_count = 0 + add_sp_count = 0 + for key in word2phns.keys(): + idx, wrd = key.split('_') + if wrd == 'sp': + sp_count += 1 + exist_idx.append(int(idx)) + else: + keys_to_del.append(key) + + for key in keys_to_del: + del word2phns[key] + + cur_id = 0 + for key in tp_word2phns.keys(): + if cur_id in exist_idx: + dic[str(cur_id) + "_sp"] = 'sp' + cur_id += 1 + add_sp_count += 1 + idx, wrd = key.split('_') + dic[str(cur_id) + "_" + wrd] = tp_word2phns[key] + cur_id += 1 + + if add_sp_count + 1 == sp_count: + dic[str(cur_id) + "_sp"] = 'sp' + add_sp_count += 1 + + assert add_sp_count == sp_count, "sp are not added in dic" + return dic + + +def get_max_idx(dic): + return sorted([int(key.split('_')[0]) for key in dic.keys()])[-1] + + +def get_phns_and_spans(wav_path: str, + old_str: str="", + new_str: str="", + source_lang: str="english", + target_lang: str="english"): + is_append = (old_str == new_str[:len(old_str)]) + old_phns, mfa_start, mfa_end = [], [], [] + # source + if source_lang == "english": + intervals, word2phns = alignment(wav_path, old_str) + elif source_lang == "chinese": + intervals, word2phns = alignment_zh(wav_path, old_str) + _, tp_word2phns = words2phns_zh(old_str) + + for key, value in tp_word2phns.items(): + idx, wrd = key.split('_') + cur_val = " ".join(value) + tp_word2phns[key] = cur_val + + word2phns = recover_dict(word2phns, tp_word2phns) + else: + assert source_lang == "chinese" or source_lang == "english", \ + "source_lang is wrong..." + + for item in intervals: + old_phns.append(item[0]) + mfa_start.append(float(item[1])) + mfa_end.append(float(item[2])) + # target + if is_append and (source_lang != target_lang): + cross_lingual_clone = True + else: + cross_lingual_clone = False + + if cross_lingual_clone: + str_origin = new_str[:len(old_str)] + str_append = new_str[len(old_str):] + + if target_lang == "chinese": + phns_origin, origin_word2phns = words2phns(str_origin) + phns_append, append_word2phns_tmp = words2phns_zh(str_append) + + elif target_lang == "english": + # 原始句子 + phns_origin, origin_word2phns = words2phns_zh(str_origin) + # clone 句子 + phns_append, append_word2phns_tmp = words2phns(str_append) + else: + assert target_lang == "chinese" or target_lang == "english", \ + "cloning is not support for this language, please check it." + + new_phns = phns_origin + phns_append + + append_word2phns = {} + length = len(origin_word2phns) + for key, value in append_word2phns_tmp.items(): + idx, wrd = key.split('_') + append_word2phns[str(int(idx) + length) + '_' + wrd] = value + new_word2phns = origin_word2phns.copy() + new_word2phns.update(append_word2phns) + + else: + if source_lang == target_lang and target_lang == "english": + new_phns, new_word2phns = words2phns(new_str) + elif source_lang == target_lang and target_lang == "chinese": + new_phns, new_word2phns = words2phns_zh(new_str) + else: + assert source_lang == target_lang, \ + "source language is not same with target language..." + + span_to_repl = [0, len(old_phns) - 1] + span_to_add = [0, len(new_phns) - 1] + left_idx = 0 + new_phns_left = [] + sp_count = 0 + # find the left different index + for key in word2phns.keys(): + idx, wrd = key.split('_') + if wrd == 'sp': + sp_count += 1 + new_phns_left.append('sp') + else: + idx = str(int(idx) - sp_count) + if idx + '_' + wrd in new_word2phns: + left_idx += len(new_word2phns[idx + '_' + wrd]) + new_phns_left.extend(word2phns[key].split()) + else: + span_to_repl[0] = len(new_phns_left) + span_to_add[0] = len(new_phns_left) + break + + # reverse word2phns and new_word2phns + right_idx = 0 + new_phns_right = [] + sp_count = 0 + word2phns_max_idx = get_max_idx(word2phns) + new_word2phns_max_idx = get_max_idx(new_word2phns) + new_phns_mid = [] + if is_append: + new_phns_right = [] + new_phns_mid = new_phns[left_idx:] + span_to_repl[0] = len(new_phns_left) + span_to_add[0] = len(new_phns_left) + span_to_add[1] = len(new_phns_left) + len(new_phns_mid) + span_to_repl[1] = len(old_phns) - len(new_phns_right) + # speech edit + else: + for key in list(word2phns.keys())[::-1]: + idx, wrd = key.split('_') + if wrd == 'sp': + sp_count += 1 + new_phns_right = ['sp'] + new_phns_right + else: + idx = str(new_word2phns_max_idx - (word2phns_max_idx - int(idx) + - sp_count)) + if idx + '_' + wrd in new_word2phns: + right_idx -= len(new_word2phns[idx + '_' + wrd]) + new_phns_right = word2phns[key].split() + new_phns_right + else: + span_to_repl[1] = len(old_phns) - len(new_phns_right) + new_phns_mid = new_phns[left_idx:right_idx] + span_to_add[1] = len(new_phns_left) + len(new_phns_mid) + if len(new_phns_mid) == 0: + span_to_add[1] = min(span_to_add[1] + 1, len(new_phns)) + span_to_add[0] = max(0, span_to_add[0] - 1) + span_to_repl[0] = max(0, span_to_repl[0] - 1) + span_to_repl[1] = min(span_to_repl[1] + 1, + len(old_phns)) + break + new_phns = new_phns_left + new_phns_mid + new_phns_right + ''' + For that reason cover should not be given. + For that reason cover is impossible to be given. + span_to_repl: [17, 23] "should not" + span_to_add: [17, 30] "is impossible to" + ''' + return mfa_start, mfa_end, old_phns, new_phns, span_to_repl, span_to_add + + +# mfa 获得的 duration 和 fs2 的 duration_predictor 获取的 duration 可能不同 +# 此处获得一个缩放比例, 用于预测值和真实值之间的缩放 +def get_dur_adj_factor(orig_dur: List[int], + pred_dur: List[int], + phns: List[str]): + length = 0 + factor_list = [] + for orig, pred, phn in zip(orig_dur, pred_dur, phns): + if pred == 0 or phn == 'sp': + continue + else: + factor_list.append(orig / pred) + factor_list = np.array(factor_list) + factor_list.sort() + if len(factor_list) < 5: + return 1 + length = 2 + avg = np.average(factor_list[length:-length]) + return avg + + +def prep_feats_with_dur(wav_path: str, + source_lang: str="English", + target_lang: str="English", + old_str: str="", + new_str: str="", + mask_reconstruct: bool=False, + duration_adjust: bool=True, + start_end_sp: bool=False, + fs: int=24000, + hop_length: int=300): + ''' + Returns: + np.ndarray: new wav, replace the part to be edited in original wav with 0 + List[str]: new phones + List[float]: mfa start of new wav + List[float]: mfa end of new wav + List[int]: masked mel boundary of original wav + List[int]: masked mel boundary of new wav + ''' + wav_org, _ = librosa.load(wav_path, sr=fs) + + mfa_start, mfa_end, old_phns, new_phns, span_to_repl, span_to_add = get_phns_and_spans( + wav_path=wav_path, + old_str=old_str, + new_str=new_str, + source_lang=source_lang, + target_lang=target_lang) + + if start_end_sp: + if new_phns[-1] != 'sp': + new_phns = new_phns + ['sp'] + # 中文的 phns 不一定都在 fastspeech2 的字典里, 用 sp 代替 + if target_lang == "english" or target_lang == "chinese": + old_durs = eval_durs(old_phns, target_lang=source_lang) + else: + assert target_lang == "chinese" or target_lang == "english", \ + "calculate duration_predict is not support for this language..." + + orig_old_durs = [e - s for e, s in zip(mfa_end, mfa_start)] + if '[MASK]' in new_str: + new_phns = old_phns + span_to_add = span_to_repl + d_factor_left = get_dur_adj_factor( + orig_dur=orig_old_durs[:span_to_repl[0]], + pred_dur=old_durs[:span_to_repl[0]], + phns=old_phns[:span_to_repl[0]]) + d_factor_right = get_dur_adj_factor( + orig_dur=orig_old_durs[span_to_repl[1]:], + pred_dur=old_durs[span_to_repl[1]:], + phns=old_phns[span_to_repl[1]:]) + d_factor = (d_factor_left + d_factor_right) / 2 + new_durs_adjusted = [d_factor * i for i in old_durs] + else: + if duration_adjust: + d_factor = get_dur_adj_factor( + orig_dur=orig_old_durs, pred_dur=old_durs, phns=old_phns) + d_factor = d_factor * 1.25 + else: + d_factor = 1 + + if target_lang == "english" or target_lang == "chinese": + new_durs = eval_durs(new_phns, target_lang=target_lang) + else: + assert target_lang == "chinese" or target_lang == "english", \ + "calculate duration_predict is not support for this language..." + + new_durs_adjusted = [d_factor * i for i in new_durs] + + new_span_dur_sum = sum(new_durs_adjusted[span_to_add[0]:span_to_add[1]]) + old_span_dur_sum = sum(orig_old_durs[span_to_repl[0]:span_to_repl[1]]) + dur_offset = new_span_dur_sum - old_span_dur_sum + new_mfa_start = mfa_start[:span_to_repl[0]] + new_mfa_end = mfa_end[:span_to_repl[0]] + for i in new_durs_adjusted[span_to_add[0]:span_to_add[1]]: + if len(new_mfa_end) == 0: + new_mfa_start.append(0) + new_mfa_end.append(i) + else: + new_mfa_start.append(new_mfa_end[-1]) + new_mfa_end.append(new_mfa_end[-1] + i) + new_mfa_start += [i + dur_offset for i in mfa_start[span_to_repl[1]:]] + new_mfa_end += [i + dur_offset for i in mfa_end[span_to_repl[1]:]] + + # 3. get new wav + # 在原始句子后拼接 + if span_to_repl[0] >= len(mfa_start): + left_idx = len(wav_org) + right_idx = left_idx + # 在原始句子中间替换 + else: + left_idx = int(np.floor(mfa_start[span_to_repl[0]] * fs)) + right_idx = int(np.ceil(mfa_end[span_to_repl[1] - 1] * fs)) + blank_wav = np.zeros( + (int(np.ceil(new_span_dur_sum * fs)), ), dtype=wav_org.dtype) + # 原始音频,需要编辑的部分替换成空音频,空音频的时间由 fs2 的 duration_predictor 决定 + new_wav = np.concatenate( + [wav_org[:left_idx], blank_wav, wav_org[right_idx:]]) + + # 4. get old and new mel span to be mask + # [92, 92] + + old_span_bdy, mfa_start, mfa_end = get_masked_mel_bdy( + mfa_start=mfa_start, + mfa_end=mfa_end, + fs=fs, + hop_length=hop_length, + span_to_repl=span_to_repl) + # [92, 174] + # new_mfa_start, new_mfa_end 时间级别的开始和结束时间 -> 帧级别 + new_span_bdy, new_mfa_start, new_mfa_end = get_masked_mel_bdy( + mfa_start=new_mfa_start, + mfa_end=new_mfa_end, + fs=fs, + hop_length=hop_length, + span_to_repl=span_to_add) + + # old_span_bdy, new_span_bdy 是帧级别的范围 + return new_wav, new_phns, new_mfa_start, new_mfa_end, old_span_bdy, new_span_bdy + + +def prep_feats(wav_path: str, + source_lang: str="english", + target_lang: str="english", + old_str: str="", + new_str: str="", + duration_adjust: bool=True, + start_end_sp: bool=False, + mask_reconstruct: bool=False, + fs: int=24000, + hop_length: int=300, + token_list: List[str]=[]): + wav, phns, mfa_start, mfa_end, old_span_bdy, new_span_bdy = prep_feats_with_dur( + source_lang=source_lang, + target_lang=target_lang, + old_str=old_str, + new_str=new_str, + wav_path=wav_path, + duration_adjust=duration_adjust, + start_end_sp=start_end_sp, + mask_reconstruct=mask_reconstruct, + fs=fs, + hop_length=hop_length) + + token_to_id = {item: i for i, item in enumerate(token_list)} + text = np.array( + list(map(lambda x: token_to_id.get(x, token_to_id['']), phns))) + span_bdy = np.array(new_span_bdy) + + batch = [('1', { + "speech": wav, + "align_start": mfa_start, + "align_end": mfa_end, + "text": text, + "span_bdy": span_bdy + })] + + return batch, old_span_bdy, new_span_bdy + + +def decode_with_model(mlm_model: nn.Layer, + collate_fn, + wav_path: str, + source_lang: str="english", + target_lang: str="english", + old_str: str="", + new_str: str="", + use_teacher_forcing: bool=False, + duration_adjust: bool=True, + start_end_sp: bool=False, + fs: int=24000, + hop_length: int=300, + token_list: List[str]=[]): + batch, old_span_bdy, new_span_bdy = prep_feats( + source_lang=source_lang, + target_lang=target_lang, + wav_path=wav_path, + old_str=old_str, + new_str=new_str, + duration_adjust=duration_adjust, + start_end_sp=start_end_sp, + fs=fs, + hop_length=hop_length, + token_list=token_list) + + feats = collate_fn(batch)[1] + + if 'text_masked_pos' in feats.keys(): + feats.pop('text_masked_pos') + + output = mlm_model.inference( + text=feats['text'], + speech=feats['speech'], + masked_pos=feats['masked_pos'], + speech_mask=feats['speech_mask'], + text_mask=feats['text_mask'], + speech_seg_pos=feats['speech_seg_pos'], + text_seg_pos=feats['text_seg_pos'], + span_bdy=new_span_bdy, + use_teacher_forcing=use_teacher_forcing) + + # 拼接音频 + output_feat = paddle.concat(x=output, axis=0) + wav_org, _ = librosa.load(wav_path, sr=fs) + return wav_org, output_feat, old_span_bdy, new_span_bdy, fs, hop_length + + +def get_mlm_output(wav_path: str, + model_name: str="paddle_checkpoint_en", + source_lang: str="english", + target_lang: str="english", + old_str: str="", + new_str: str="", + use_teacher_forcing: bool=False, + duration_adjust: bool=True, + start_end_sp: bool=False): + mlm_model, train_conf = load_model(model_name) + mlm_model.eval() + + collate_fn = build_mlm_collate_fn( + sr=train_conf.feats_extract_conf['fs'], + n_fft=train_conf.feats_extract_conf['n_fft'], + hop_length=train_conf.feats_extract_conf['hop_length'], + win_length=train_conf.feats_extract_conf['win_length'], + n_mels=train_conf.feats_extract_conf['n_mels'], + fmin=train_conf.feats_extract_conf['fmin'], + fmax=train_conf.feats_extract_conf['fmax'], + mlm_prob=train_conf['mlm_prob'], + mean_phn_span=train_conf['mean_phn_span'], + seg_emb=train_conf.encoder_conf['input_layer'] == 'sega_mlm') + + return decode_with_model( + source_lang=source_lang, + target_lang=target_lang, + mlm_model=mlm_model, + collate_fn=collate_fn, + wav_path=wav_path, + old_str=old_str, + new_str=new_str, + use_teacher_forcing=use_teacher_forcing, + duration_adjust=duration_adjust, + start_end_sp=start_end_sp, + fs=train_conf.feats_extract_conf['fs'], + hop_length=train_conf.feats_extract_conf['hop_length'], + token_list=train_conf.token_list) + + +def evaluate(uid: str, + source_lang: str="english", + target_lang: str="english", + prefix: os.PathLike="./prompt/dev/", + model_name: str="paddle_checkpoint_en", + new_str: str="", + prompt_decoding: bool=False, + task_name: str=None): + + # get origin text and path of origin wav + old_str, wav_path = read_data(uid=uid, prefix=prefix) + + if task_name == 'edit': + new_str = new_str + elif task_name == 'synthesize': + new_str = old_str + new_str + else: + new_str = old_str + ' '.join([ch for ch in new_str if is_chinese(ch)]) + + print('new_str is ', new_str) + + results_dict = get_wav( + source_lang=source_lang, + target_lang=target_lang, + model_name=model_name, + wav_path=wav_path, + old_str=old_str, + new_str=new_str) + return results_dict + + +if __name__ == "__main__": + # parse config and args + args = parse_args() + + data_dict = evaluate( + uid=args.uid, + source_lang=args.source_lang, + target_lang=args.target_lang, + prefix=args.prefix, + model_name=args.model_name, + new_str=args.new_str, + task_name=args.task_name) + sf.write(args.output_name, data_dict['output'], samplerate=24000) + print("finished...") diff --git a/examples/ernie_sat/local/inference_new.py b/examples/ernie_sat/local/inference_new.py new file mode 100644 index 000000000..525967eb1 --- /dev/null +++ b/examples/ernie_sat/local/inference_new.py @@ -0,0 +1,622 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import os +import random +from typing import Dict +from typing import List + +import librosa +import numpy as np +import paddle +import soundfile as sf +import yaml +from align import alignment +from align import alignment_zh +from align import words2phns +from align import words2phns_zh +from paddle import nn +from sedit_arg_parser import parse_args +from utils import eval_durs +from utils import get_voc_out +from utils import is_chinese +from utils import load_num_sequence_text +from utils import read_2col_text +from yacs.config import CfgNode + +from paddlespeech.t2s.datasets.am_batch_fn import build_mlm_collate_fn +from paddlespeech.t2s.models.ernie_sat.ernie_sat import ErnieSAT + +random.seed(0) +np.random.seed(0) + + +def get_wav(wav_path: str, + source_lang: str='english', + target_lang: str='english', + model_name: str="paddle_checkpoint_en", + old_str: str="", + new_str: str="", + non_autoreg: bool=True): + wav_org, output_feat, old_span_bdy, new_span_bdy, fs, hop_length = get_mlm_output( + source_lang=source_lang, + target_lang=target_lang, + model_name=model_name, + wav_path=wav_path, + old_str=old_str, + new_str=new_str, + use_teacher_forcing=non_autoreg) + + masked_feat = output_feat[new_span_bdy[0]:new_span_bdy[1]] + + alt_wav = get_voc_out(masked_feat) + + old_time_bdy = [hop_length * x for x in old_span_bdy] + + wav_replaced = np.concatenate( + [wav_org[:old_time_bdy[0]], alt_wav, wav_org[old_time_bdy[1]:]]) + + data_dict = {"origin": wav_org, "output": wav_replaced} + + return data_dict + + +def load_model(model_name: str="paddle_checkpoint_en"): + config_path = './pretrained_model/{}/default.yaml'.format(model_name) + model_path = './pretrained_model/{}/model.pdparams'.format(model_name) + with open(config_path) as f: + conf = CfgNode(yaml.safe_load(f)) + token_list = list(conf.token_list) + vocab_size = len(token_list) + odim = conf.n_mels + mlm_model = ErnieSAT(idim=vocab_size, odim=odim, **conf["model"]) + state_dict = paddle.load(model_path) + new_state_dict = {} + for key, value in state_dict.items(): + new_key = "model." + key + new_state_dict[new_key] = value + mlm_model.set_state_dict(new_state_dict) + mlm_model.eval() + + return mlm_model, conf + + +def read_data(uid: str, prefix: os.PathLike): + # 获取 uid 对应的文本 + mfa_text = read_2col_text(prefix + '/text')[uid] + # 获取 uid 对应的音频路径 + mfa_wav_path = read_2col_text(prefix + '/wav.scp')[uid] + if not os.path.isabs(mfa_wav_path): + mfa_wav_path = prefix + mfa_wav_path + return mfa_text, mfa_wav_path + + +def get_align_data(uid: str, prefix: os.PathLike): + mfa_path = prefix + "mfa_" + mfa_text = read_2col_text(mfa_path + 'text')[uid] + mfa_start = load_num_sequence_text( + mfa_path + 'start', loader_type='text_float')[uid] + mfa_end = load_num_sequence_text( + mfa_path + 'end', loader_type='text_float')[uid] + mfa_wav_path = read_2col_text(mfa_path + 'wav.scp')[uid] + return mfa_text, mfa_start, mfa_end, mfa_wav_path + + +# 获取需要被 mask 的 mel 帧的范围 +def get_masked_mel_bdy(mfa_start: List[float], + mfa_end: List[float], + fs: int, + hop_length: int, + span_to_repl: List[List[int]]): + align_start = np.array(mfa_start) + align_end = np.array(mfa_end) + align_start = np.floor(fs * align_start / hop_length).astype('int') + align_end = np.floor(fs * align_end / hop_length).astype('int') + if span_to_repl[0] >= len(mfa_start): + span_bdy = [align_end[-1], align_end[-1]] + else: + span_bdy = [ + align_start[span_to_repl[0]], align_end[span_to_repl[1] - 1] + ] + return span_bdy, align_start, align_end + + +def recover_dict(word2phns: Dict[str, str], tp_word2phns: Dict[str, str]): + dic = {} + keys_to_del = [] + exist_idx = [] + sp_count = 0 + add_sp_count = 0 + for key in word2phns.keys(): + idx, wrd = key.split('_') + if wrd == 'sp': + sp_count += 1 + exist_idx.append(int(idx)) + else: + keys_to_del.append(key) + + for key in keys_to_del: + del word2phns[key] + + cur_id = 0 + for key in tp_word2phns.keys(): + if cur_id in exist_idx: + dic[str(cur_id) + "_sp"] = 'sp' + cur_id += 1 + add_sp_count += 1 + idx, wrd = key.split('_') + dic[str(cur_id) + "_" + wrd] = tp_word2phns[key] + cur_id += 1 + + if add_sp_count + 1 == sp_count: + dic[str(cur_id) + "_sp"] = 'sp' + add_sp_count += 1 + + assert add_sp_count == sp_count, "sp are not added in dic" + return dic + + +def get_max_idx(dic): + return sorted([int(key.split('_')[0]) for key in dic.keys()])[-1] + + +def get_phns_and_spans(wav_path: str, + old_str: str="", + new_str: str="", + source_lang: str="english", + target_lang: str="english"): + is_append = (old_str == new_str[:len(old_str)]) + old_phns, mfa_start, mfa_end = [], [], [] + # source + if source_lang == "english": + intervals, word2phns = alignment(wav_path, old_str) + elif source_lang == "chinese": + intervals, word2phns = alignment_zh(wav_path, old_str) + _, tp_word2phns = words2phns_zh(old_str) + + for key, value in tp_word2phns.items(): + idx, wrd = key.split('_') + cur_val = " ".join(value) + tp_word2phns[key] = cur_val + + word2phns = recover_dict(word2phns, tp_word2phns) + else: + assert source_lang == "chinese" or source_lang == "english", \ + "source_lang is wrong..." + + for item in intervals: + old_phns.append(item[0]) + mfa_start.append(float(item[1])) + mfa_end.append(float(item[2])) + # target + if is_append and (source_lang != target_lang): + cross_lingual_clone = True + else: + cross_lingual_clone = False + + if cross_lingual_clone: + str_origin = new_str[:len(old_str)] + str_append = new_str[len(old_str):] + + if target_lang == "chinese": + phns_origin, origin_word2phns = words2phns(str_origin) + phns_append, append_word2phns_tmp = words2phns_zh(str_append) + + elif target_lang == "english": + # 原始句子 + phns_origin, origin_word2phns = words2phns_zh(str_origin) + # clone 句子 + phns_append, append_word2phns_tmp = words2phns(str_append) + else: + assert target_lang == "chinese" or target_lang == "english", \ + "cloning is not support for this language, please check it." + + new_phns = phns_origin + phns_append + + append_word2phns = {} + length = len(origin_word2phns) + for key, value in append_word2phns_tmp.items(): + idx, wrd = key.split('_') + append_word2phns[str(int(idx) + length) + '_' + wrd] = value + new_word2phns = origin_word2phns.copy() + new_word2phns.update(append_word2phns) + + else: + if source_lang == target_lang and target_lang == "english": + new_phns, new_word2phns = words2phns(new_str) + elif source_lang == target_lang and target_lang == "chinese": + new_phns, new_word2phns = words2phns_zh(new_str) + else: + assert source_lang == target_lang, \ + "source language is not same with target language..." + + span_to_repl = [0, len(old_phns) - 1] + span_to_add = [0, len(new_phns) - 1] + left_idx = 0 + new_phns_left = [] + sp_count = 0 + # find the left different index + for key in word2phns.keys(): + idx, wrd = key.split('_') + if wrd == 'sp': + sp_count += 1 + new_phns_left.append('sp') + else: + idx = str(int(idx) - sp_count) + if idx + '_' + wrd in new_word2phns: + left_idx += len(new_word2phns[idx + '_' + wrd]) + new_phns_left.extend(word2phns[key].split()) + else: + span_to_repl[0] = len(new_phns_left) + span_to_add[0] = len(new_phns_left) + break + + # reverse word2phns and new_word2phns + right_idx = 0 + new_phns_right = [] + sp_count = 0 + word2phns_max_idx = get_max_idx(word2phns) + new_word2phns_max_idx = get_max_idx(new_word2phns) + new_phns_mid = [] + if is_append: + new_phns_right = [] + new_phns_mid = new_phns[left_idx:] + span_to_repl[0] = len(new_phns_left) + span_to_add[0] = len(new_phns_left) + span_to_add[1] = len(new_phns_left) + len(new_phns_mid) + span_to_repl[1] = len(old_phns) - len(new_phns_right) + # speech edit + else: + for key in list(word2phns.keys())[::-1]: + idx, wrd = key.split('_') + if wrd == 'sp': + sp_count += 1 + new_phns_right = ['sp'] + new_phns_right + else: + idx = str(new_word2phns_max_idx - (word2phns_max_idx - int(idx) + - sp_count)) + if idx + '_' + wrd in new_word2phns: + right_idx -= len(new_word2phns[idx + '_' + wrd]) + new_phns_right = word2phns[key].split() + new_phns_right + else: + span_to_repl[1] = len(old_phns) - len(new_phns_right) + new_phns_mid = new_phns[left_idx:right_idx] + span_to_add[1] = len(new_phns_left) + len(new_phns_mid) + if len(new_phns_mid) == 0: + span_to_add[1] = min(span_to_add[1] + 1, len(new_phns)) + span_to_add[0] = max(0, span_to_add[0] - 1) + span_to_repl[0] = max(0, span_to_repl[0] - 1) + span_to_repl[1] = min(span_to_repl[1] + 1, + len(old_phns)) + break + new_phns = new_phns_left + new_phns_mid + new_phns_right + ''' + For that reason cover should not be given. + For that reason cover is impossible to be given. + span_to_repl: [17, 23] "should not" + span_to_add: [17, 30] "is impossible to" + ''' + return mfa_start, mfa_end, old_phns, new_phns, span_to_repl, span_to_add + + +# mfa 获得的 duration 和 fs2 的 duration_predictor 获取的 duration 可能不同 +# 此处获得一个缩放比例, 用于预测值和真实值之间的缩放 +def get_dur_adj_factor(orig_dur: List[int], + pred_dur: List[int], + phns: List[str]): + length = 0 + factor_list = [] + for orig, pred, phn in zip(orig_dur, pred_dur, phns): + if pred == 0 or phn == 'sp': + continue + else: + factor_list.append(orig / pred) + factor_list = np.array(factor_list) + factor_list.sort() + if len(factor_list) < 5: + return 1 + length = 2 + avg = np.average(factor_list[length:-length]) + return avg + + +def prep_feats_with_dur(wav_path: str, + source_lang: str="English", + target_lang: str="English", + old_str: str="", + new_str: str="", + mask_reconstruct: bool=False, + duration_adjust: bool=True, + start_end_sp: bool=False, + fs: int=24000, + hop_length: int=300): + ''' + Returns: + np.ndarray: new wav, replace the part to be edited in original wav with 0 + List[str]: new phones + List[float]: mfa start of new wav + List[float]: mfa end of new wav + List[int]: masked mel boundary of original wav + List[int]: masked mel boundary of new wav + ''' + wav_org, _ = librosa.load(wav_path, sr=fs) + + mfa_start, mfa_end, old_phns, new_phns, span_to_repl, span_to_add = get_phns_and_spans( + wav_path=wav_path, + old_str=old_str, + new_str=new_str, + source_lang=source_lang, + target_lang=target_lang) + + if start_end_sp: + if new_phns[-1] != 'sp': + new_phns = new_phns + ['sp'] + # 中文的 phns 不一定都在 fastspeech2 的字典里, 用 sp 代替 + if target_lang == "english" or target_lang == "chinese": + old_durs = eval_durs(old_phns, target_lang=source_lang) + else: + assert target_lang == "chinese" or target_lang == "english", \ + "calculate duration_predict is not support for this language..." + + orig_old_durs = [e - s for e, s in zip(mfa_end, mfa_start)] + if '[MASK]' in new_str: + new_phns = old_phns + span_to_add = span_to_repl + d_factor_left = get_dur_adj_factor( + orig_dur=orig_old_durs[:span_to_repl[0]], + pred_dur=old_durs[:span_to_repl[0]], + phns=old_phns[:span_to_repl[0]]) + d_factor_right = get_dur_adj_factor( + orig_dur=orig_old_durs[span_to_repl[1]:], + pred_dur=old_durs[span_to_repl[1]:], + phns=old_phns[span_to_repl[1]:]) + d_factor = (d_factor_left + d_factor_right) / 2 + new_durs_adjusted = [d_factor * i for i in old_durs] + else: + if duration_adjust: + d_factor = get_dur_adj_factor( + orig_dur=orig_old_durs, pred_dur=old_durs, phns=old_phns) + d_factor = d_factor * 1.25 + else: + d_factor = 1 + + if target_lang == "english" or target_lang == "chinese": + new_durs = eval_durs(new_phns, target_lang=target_lang) + else: + assert target_lang == "chinese" or target_lang == "english", \ + "calculate duration_predict is not support for this language..." + + new_durs_adjusted = [d_factor * i for i in new_durs] + + new_span_dur_sum = sum(new_durs_adjusted[span_to_add[0]:span_to_add[1]]) + old_span_dur_sum = sum(orig_old_durs[span_to_repl[0]:span_to_repl[1]]) + dur_offset = new_span_dur_sum - old_span_dur_sum + new_mfa_start = mfa_start[:span_to_repl[0]] + new_mfa_end = mfa_end[:span_to_repl[0]] + for i in new_durs_adjusted[span_to_add[0]:span_to_add[1]]: + if len(new_mfa_end) == 0: + new_mfa_start.append(0) + new_mfa_end.append(i) + else: + new_mfa_start.append(new_mfa_end[-1]) + new_mfa_end.append(new_mfa_end[-1] + i) + new_mfa_start += [i + dur_offset for i in mfa_start[span_to_repl[1]:]] + new_mfa_end += [i + dur_offset for i in mfa_end[span_to_repl[1]:]] + + # 3. get new wav + # 在原始句子后拼接 + if span_to_repl[0] >= len(mfa_start): + left_idx = len(wav_org) + right_idx = left_idx + # 在原始句子中间替换 + else: + left_idx = int(np.floor(mfa_start[span_to_repl[0]] * fs)) + right_idx = int(np.ceil(mfa_end[span_to_repl[1] - 1] * fs)) + blank_wav = np.zeros( + (int(np.ceil(new_span_dur_sum * fs)), ), dtype=wav_org.dtype) + # 原始音频,需要编辑的部分替换成空音频,空音频的时间由 fs2 的 duration_predictor 决定 + new_wav = np.concatenate( + [wav_org[:left_idx], blank_wav, wav_org[right_idx:]]) + + # 4. get old and new mel span to be mask + # [92, 92] + + old_span_bdy, mfa_start, mfa_end = get_masked_mel_bdy( + mfa_start=mfa_start, + mfa_end=mfa_end, + fs=fs, + hop_length=hop_length, + span_to_repl=span_to_repl) + # [92, 174] + # new_mfa_start, new_mfa_end 时间级别的开始和结束时间 -> 帧级别 + new_span_bdy, new_mfa_start, new_mfa_end = get_masked_mel_bdy( + mfa_start=new_mfa_start, + mfa_end=new_mfa_end, + fs=fs, + hop_length=hop_length, + span_to_repl=span_to_add) + + # old_span_bdy, new_span_bdy 是帧级别的范围 + return new_wav, new_phns, new_mfa_start, new_mfa_end, old_span_bdy, new_span_bdy + + +def prep_feats(wav_path: str, + source_lang: str="english", + target_lang: str="english", + old_str: str="", + new_str: str="", + duration_adjust: bool=True, + start_end_sp: bool=False, + mask_reconstruct: bool=False, + fs: int=24000, + hop_length: int=300, + token_list: List[str]=[]): + wav, phns, mfa_start, mfa_end, old_span_bdy, new_span_bdy = prep_feats_with_dur( + source_lang=source_lang, + target_lang=target_lang, + old_str=old_str, + new_str=new_str, + wav_path=wav_path, + duration_adjust=duration_adjust, + start_end_sp=start_end_sp, + mask_reconstruct=mask_reconstruct, + fs=fs, + hop_length=hop_length) + + token_to_id = {item: i for i, item in enumerate(token_list)} + text = np.array( + list(map(lambda x: token_to_id.get(x, token_to_id['']), phns))) + span_bdy = np.array(new_span_bdy) + + batch = [('1', { + "speech": wav, + "align_start": mfa_start, + "align_end": mfa_end, + "text": text, + "span_bdy": span_bdy + })] + + return batch, old_span_bdy, new_span_bdy + + +def decode_with_model(mlm_model: nn.Layer, + collate_fn, + wav_path: str, + source_lang: str="english", + target_lang: str="english", + old_str: str="", + new_str: str="", + use_teacher_forcing: bool=False, + duration_adjust: bool=True, + start_end_sp: bool=False, + fs: int=24000, + hop_length: int=300, + token_list: List[str]=[]): + batch, old_span_bdy, new_span_bdy = prep_feats( + source_lang=source_lang, + target_lang=target_lang, + wav_path=wav_path, + old_str=old_str, + new_str=new_str, + duration_adjust=duration_adjust, + start_end_sp=start_end_sp, + fs=fs, + hop_length=hop_length, + token_list=token_list) + + feats = collate_fn(batch)[1] + + if 'text_masked_pos' in feats.keys(): + feats.pop('text_masked_pos') + + output = mlm_model.inference( + text=feats['text'], + speech=feats['speech'], + masked_pos=feats['masked_pos'], + speech_mask=feats['speech_mask'], + text_mask=feats['text_mask'], + speech_seg_pos=feats['speech_seg_pos'], + text_seg_pos=feats['text_seg_pos'], + span_bdy=new_span_bdy, + use_teacher_forcing=use_teacher_forcing) + + # 拼接音频 + output_feat = paddle.concat(x=output, axis=0) + wav_org, _ = librosa.load(wav_path, sr=fs) + return wav_org, output_feat, old_span_bdy, new_span_bdy, fs, hop_length + + +def get_mlm_output(wav_path: str, + model_name: str="paddle_checkpoint_en", + source_lang: str="english", + target_lang: str="english", + old_str: str="", + new_str: str="", + use_teacher_forcing: bool=False, + duration_adjust: bool=True, + start_end_sp: bool=False): + mlm_model, train_conf = load_model(model_name) + + collate_fn = build_mlm_collate_fn( + sr=train_conf.fs, + n_fft=train_conf.n_fft, + hop_length=train_conf.n_shift, + win_length=train_conf.win_length, + n_mels=train_conf.n_mels, + fmin=train_conf.fmin, + fmax=train_conf.fmax, + mlm_prob=train_conf.mlm_prob, + mean_phn_span=train_conf.mean_phn_span, + seg_emb=train_conf.model['enc_input_layer'] == 'sega_mlm') + + return decode_with_model( + source_lang=source_lang, + target_lang=target_lang, + mlm_model=mlm_model, + collate_fn=collate_fn, + wav_path=wav_path, + old_str=old_str, + new_str=new_str, + use_teacher_forcing=use_teacher_forcing, + duration_adjust=duration_adjust, + start_end_sp=start_end_sp, + fs=train_conf.fs, + hop_length=train_conf.n_shift, + token_list=train_conf.token_list) + + +def evaluate(uid: str, + source_lang: str="english", + target_lang: str="english", + prefix: os.PathLike="./prompt/dev/", + model_name: str="paddle_checkpoint_en", + new_str: str="", + prompt_decoding: bool=False, + task_name: str=None): + + # get origin text and path of origin wav + old_str, wav_path = read_data(uid=uid, prefix=prefix) + + if task_name == 'edit': + new_str = new_str + elif task_name == 'synthesize': + new_str = old_str + new_str + else: + new_str = old_str + ' '.join([ch for ch in new_str if is_chinese(ch)]) + + print('new_str is ', new_str) + + results_dict = get_wav( + source_lang=source_lang, + target_lang=target_lang, + model_name=model_name, + wav_path=wav_path, + old_str=old_str, + new_str=new_str) + return results_dict + + +if __name__ == "__main__": + # parse config and args + args = parse_args() + + data_dict = evaluate( + uid=args.uid, + source_lang=args.source_lang, + target_lang=args.target_lang, + prefix=args.prefix, + model_name=args.model_name, + new_str=args.new_str, + task_name=args.task_name) + sf.write(args.output_name, data_dict['output'], samplerate=24000) + print("finished...") diff --git a/examples/ernie_sat/local/sedit_arg_parser.py b/examples/ernie_sat/local/sedit_arg_parser.py new file mode 100644 index 000000000..ad7e57191 --- /dev/null +++ b/examples/ernie_sat/local/sedit_arg_parser.py @@ -0,0 +1,97 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import argparse + + +def parse_args(): + # parse args and config and redirect to train_sp + parser = argparse.ArgumentParser( + description="Synthesize with acoustic model & vocoder") + # acoustic model + parser.add_argument( + '--am', + type=str, + default='fastspeech2_csmsc', + choices=[ + 'speedyspeech_csmsc', 'fastspeech2_csmsc', 'fastspeech2_ljspeech', + 'fastspeech2_aishell3', 'fastspeech2_vctk', 'tacotron2_csmsc', + 'tacotron2_ljspeech', 'tacotron2_aishell3' + ], + help='Choose acoustic model type of tts task.') + parser.add_argument( + '--am_config', + type=str, + default=None, + help='Config of acoustic model. Use deault config when it is None.') + parser.add_argument( + '--am_ckpt', + type=str, + default=None, + help='Checkpoint file of acoustic model.') + parser.add_argument( + "--am_stat", + type=str, + default=None, + help="mean and standard deviation used to normalize spectrogram when training acoustic model." + ) + parser.add_argument( + "--phones_dict", type=str, default=None, help="phone vocabulary file.") + parser.add_argument( + "--tones_dict", type=str, default=None, help="tone vocabulary file.") + parser.add_argument( + "--speaker_dict", type=str, default=None, help="speaker id map file.") + + # vocoder + parser.add_argument( + '--voc', + type=str, + default='pwgan_aishell3', + choices=[ + 'pwgan_csmsc', 'pwgan_ljspeech', 'pwgan_aishell3', 'pwgan_vctk', + 'mb_melgan_csmsc', 'wavernn_csmsc', 'hifigan_csmsc', + 'hifigan_ljspeech', 'hifigan_aishell3', 'hifigan_vctk', + 'style_melgan_csmsc' + ], + help='Choose vocoder type of tts task.') + parser.add_argument( + '--voc_config', + type=str, + default=None, + help='Config of voc. Use deault config when it is None.') + parser.add_argument( + '--voc_ckpt', type=str, default=None, help='Checkpoint file of voc.') + parser.add_argument( + "--voc_stat", + type=str, + default=None, + help="mean and standard deviation used to normalize spectrogram when training voc." + ) + # other + parser.add_argument( + "--ngpu", type=int, default=1, help="if ngpu == 0, use cpu.") + + parser.add_argument("--model_name", type=str, help="model name") + parser.add_argument("--uid", type=str, help="uid") + parser.add_argument("--new_str", type=str, help="new string") + parser.add_argument("--prefix", type=str, help="prefix") + parser.add_argument( + "--source_lang", type=str, default="english", help="source language") + parser.add_argument( + "--target_lang", type=str, default="english", help="target language") + parser.add_argument("--output_name", type=str, help="output name") + parser.add_argument("--task_name", type=str, help="task name") + + # pre + args = parser.parse_args() + return args diff --git a/examples/ernie_sat/local/utils.py b/examples/ernie_sat/local/utils.py new file mode 100644 index 000000000..f2dce504a --- /dev/null +++ b/examples/ernie_sat/local/utils.py @@ -0,0 +1,175 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from pathlib import Path +from typing import Dict +from typing import List +from typing import Union + +import numpy as np +import paddle +import yaml +from sedit_arg_parser import parse_args +from yacs.config import CfgNode + +from paddlespeech.t2s.exps.syn_utils import get_am_inference +from paddlespeech.t2s.exps.syn_utils import get_voc_inference + + +def read_2col_text(path: Union[Path, str]) -> Dict[str, str]: + """Read a text file having 2 column as dict object. + + Examples: + wav.scp: + key1 /some/path/a.wav + key2 /some/path/b.wav + + >>> read_2col_text('wav.scp') + {'key1': '/some/path/a.wav', 'key2': '/some/path/b.wav'} + + """ + + data = {} + with Path(path).open("r", encoding="utf-8") as f: + for linenum, line in enumerate(f, 1): + sps = line.rstrip().split(maxsplit=1) + if len(sps) == 1: + k, v = sps[0], "" + else: + k, v = sps + if k in data: + raise RuntimeError(f"{k} is duplicated ({path}:{linenum})") + data[k] = v + return data + + +def load_num_sequence_text(path: Union[Path, str], loader_type: str="csv_int" + ) -> Dict[str, List[Union[float, int]]]: + """Read a text file indicating sequences of number + + Examples: + key1 1 2 3 + key2 34 5 6 + + >>> d = load_num_sequence_text('text') + >>> np.testing.assert_array_equal(d["key1"], np.array([1, 2, 3])) + """ + if loader_type == "text_int": + delimiter = " " + dtype = int + elif loader_type == "text_float": + delimiter = " " + dtype = float + elif loader_type == "csv_int": + delimiter = "," + dtype = int + elif loader_type == "csv_float": + delimiter = "," + dtype = float + else: + raise ValueError(f"Not supported loader_type={loader_type}") + + # path looks like: + # utta 1,0 + # uttb 3,4,5 + # -> return {'utta': np.ndarray([1, 0]), + # 'uttb': np.ndarray([3, 4, 5])} + d = read_2column_text(path) + # Using for-loop instead of dict-comprehension for debuggability + retval = {} + for k, v in d.items(): + try: + retval[k] = [dtype(i) for i in v.split(delimiter)] + except TypeError: + print(f'Error happened with path="{path}", id="{k}", value="{v}"') + raise + return retval + + +def is_chinese(ch): + if u'\u4e00' <= ch <= u'\u9fff': + return True + else: + return False + + +def get_voc_out(mel): + # vocoder + args = parse_args() + with open(args.voc_config) as f: + voc_config = CfgNode(yaml.safe_load(f)) + voc_inference = get_voc_inference( + voc=args.voc, + voc_config=voc_config, + voc_ckpt=args.voc_ckpt, + voc_stat=args.voc_stat) + + with paddle.no_grad(): + wav = voc_inference(mel) + return np.squeeze(wav) + + +def eval_durs(phns, target_lang="chinese", fs=24000, hop_length=300): + args = parse_args() + + if target_lang == 'english': + args.am = "fastspeech2_ljspeech" + args.am_config = "download/fastspeech2_nosil_ljspeech_ckpt_0.5/default.yaml" + args.am_ckpt = "download/fastspeech2_nosil_ljspeech_ckpt_0.5/snapshot_iter_100000.pdz" + args.am_stat = "download/fastspeech2_nosil_ljspeech_ckpt_0.5/speech_stats.npy" + args.phones_dict = "download/fastspeech2_nosil_ljspeech_ckpt_0.5/phone_id_map.txt" + + elif target_lang == 'chinese': + args.am = "fastspeech2_csmsc" + args.am_config = "download/fastspeech2_conformer_baker_ckpt_0.5/conformer.yaml" + args.am_ckpt = "download/fastspeech2_conformer_baker_ckpt_0.5/snapshot_iter_76000.pdz" + args.am_stat = "download/fastspeech2_conformer_baker_ckpt_0.5/speech_stats.npy" + args.phones_dict = "download/fastspeech2_conformer_baker_ckpt_0.5/phone_id_map.txt" + + if args.ngpu == 0: + paddle.set_device("cpu") + elif args.ngpu > 0: + paddle.set_device("gpu") + else: + print("ngpu should >= 0 !") + + # Init body. + with open(args.am_config) as f: + am_config = CfgNode(yaml.safe_load(f)) + + am_inference, am = get_am_inference( + am=args.am, + am_config=am_config, + am_ckpt=args.am_ckpt, + am_stat=args.am_stat, + phones_dict=args.phones_dict, + tones_dict=args.tones_dict, + speaker_dict=args.speaker_dict, + return_am=True) + + vocab_phones = {} + with open(args.phones_dict, "r") as f: + phn_id = [line.strip().split() for line in f.readlines()] + for tone, id in phn_id: + vocab_phones[tone] = int(id) + vocab_size = len(vocab_phones) + phonemes = [phn if phn in vocab_phones else "sp" for phn in phns] + + phone_ids = [vocab_phones[item] for item in phonemes] + phone_ids.append(vocab_size - 1) + phone_ids = paddle.to_tensor(np.array(phone_ids, np.int64)) + _, d_outs, _, _ = am.inference(phone_ids, spk_id=None, spk_emb=None) + pre_d_outs = d_outs + phu_durs_new = pre_d_outs * hop_length / fs + phu_durs_new = phu_durs_new.tolist()[:-1] + return phu_durs_new diff --git a/examples/ernie_sat/path.sh b/examples/ernie_sat/path.sh new file mode 100755 index 000000000..d46d2f612 --- /dev/null +++ b/examples/ernie_sat/path.sh @@ -0,0 +1,13 @@ +#!/bin/bash +export MAIN_ROOT=`realpath ${PWD}/../../` + +export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH} +export LC_ALL=C + +export PYTHONDONTWRITEBYTECODE=1 +# Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C +export PYTHONIOENCODING=UTF-8 +export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH} + +MODEL=ernie_sat +export BIN_DIR=${MAIN_ROOT}/paddlespeech/t2s/exps/${MODEL} \ No newline at end of file diff --git a/examples/ernie_sat/prompt/dev/text b/examples/ernie_sat/prompt/dev/text new file mode 100644 index 000000000..f79cdcb42 --- /dev/null +++ b/examples/ernie_sat/prompt/dev/text @@ -0,0 +1,3 @@ +p243_new For that reason cover should not be given. +Prompt_003_new This was not the show for me. +p299_096 We are trying to establish a date. diff --git a/examples/ernie_sat/prompt/dev/wav.scp b/examples/ernie_sat/prompt/dev/wav.scp new file mode 100644 index 000000000..eb0e8e48d --- /dev/null +++ b/examples/ernie_sat/prompt/dev/wav.scp @@ -0,0 +1,3 @@ +p243_new ../../prompt_wav/p243_313.wav +Prompt_003_new ../../prompt_wav/this_was_not_the_show_for_me.wav +p299_096 ../../prompt_wav/p299_096.wav diff --git a/examples/ernie_sat/run_clone_en_to_zh.sh b/examples/ernie_sat/run_clone_en_to_zh.sh new file mode 100755 index 000000000..68b1c7544 --- /dev/null +++ b/examples/ernie_sat/run_clone_en_to_zh.sh @@ -0,0 +1,27 @@ +#!/bin/bash + +set -e +source path.sh + +# en --> zh 的 语音合成 +# 根据 Prompt_003_new 作为提示语音: This was not the show for me. 来合成: '今天天气很好' +# 注: 输入的 new_str 需为中文汉字, 否则会通过预处理只保留中文汉字, 即合成预处理后的中文语音。 + +python local/inference.py \ + --task_name=cross-lingual_clone \ + --model_name=paddle_checkpoint_dual_mask_enzh \ + --uid=Prompt_003_new \ + --new_str='今天天气很好.' \ + --prefix='./prompt/dev/' \ + --source_lang=english \ + --target_lang=chinese \ + --output_name=pred_clone.wav \ + --voc=pwgan_aishell3 \ + --voc_config=download/pwg_aishell3_ckpt_0.5/default.yaml \ + --voc_ckpt=download/pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \ + --voc_stat=download/pwg_aishell3_ckpt_0.5/feats_stats.npy \ + --am=fastspeech2_csmsc \ + --am_config=download/fastspeech2_conformer_baker_ckpt_0.5/conformer.yaml \ + --am_ckpt=download/fastspeech2_conformer_baker_ckpt_0.5/snapshot_iter_76000.pdz \ + --am_stat=download/fastspeech2_conformer_baker_ckpt_0.5/speech_stats.npy \ + --phones_dict=download/fastspeech2_conformer_baker_ckpt_0.5/phone_id_map.txt diff --git a/examples/ernie_sat/run_clone_en_to_zh_new.sh b/examples/ernie_sat/run_clone_en_to_zh_new.sh new file mode 100755 index 000000000..12fdf23f1 --- /dev/null +++ b/examples/ernie_sat/run_clone_en_to_zh_new.sh @@ -0,0 +1,27 @@ +#!/bin/bash + +set -e +source path.sh + +# en --> zh 的 语音合成 +# 根据 Prompt_003_new 作为提示语音: This was not the show for me. 来合成: '今天天气很好' +# 注: 输入的 new_str 需为中文汉字, 否则会通过预处理只保留中文汉字, 即合成预处理后的中文语音。 + +python local/inference_new.py \ + --task_name=cross-lingual_clone \ + --model_name=paddle_checkpoint_dual_mask_enzh \ + --uid=Prompt_003_new \ + --new_str='今天天气很好.' \ + --prefix='./prompt/dev/' \ + --source_lang=english \ + --target_lang=chinese \ + --output_name=pred_clone.wav \ + --voc=pwgan_aishell3 \ + --voc_config=download/pwg_aishell3_ckpt_0.5/default.yaml \ + --voc_ckpt=download/pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \ + --voc_stat=download/pwg_aishell3_ckpt_0.5/feats_stats.npy \ + --am=fastspeech2_csmsc \ + --am_config=download/fastspeech2_conformer_baker_ckpt_0.5/conformer.yaml \ + --am_ckpt=download/fastspeech2_conformer_baker_ckpt_0.5/snapshot_iter_76000.pdz \ + --am_stat=download/fastspeech2_conformer_baker_ckpt_0.5/speech_stats.npy \ + --phones_dict=download/fastspeech2_conformer_baker_ckpt_0.5/phone_id_map.txt diff --git a/examples/ernie_sat/run_gen_en.sh b/examples/ernie_sat/run_gen_en.sh new file mode 100755 index 000000000..a0641bc7f --- /dev/null +++ b/examples/ernie_sat/run_gen_en.sh @@ -0,0 +1,26 @@ +#!/bin/bash + +set -e +source path.sh + +# 纯英文的语音合成 +# 样例为根据 p299_096 对应的语音作为提示语音: This was not the show for me. 来合成: 'I enjoy my life.' + +python local/inference.py \ + --task_name=synthesize \ + --model_name=paddle_checkpoint_en \ + --uid=p299_096 \ + --new_str='I enjoy my life, do you?' \ + --prefix='./prompt/dev/' \ + --source_lang=english \ + --target_lang=english \ + --output_name=pred_gen.wav \ + --voc=pwgan_aishell3 \ + --voc_config=download/pwg_aishell3_ckpt_0.5/default.yaml \ + --voc_ckpt=download/pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \ + --voc_stat=download/pwg_aishell3_ckpt_0.5/feats_stats.npy \ + --am=fastspeech2_ljspeech \ + --am_config=download/fastspeech2_nosil_ljspeech_ckpt_0.5/default.yaml \ + --am_ckpt=download/fastspeech2_nosil_ljspeech_ckpt_0.5/snapshot_iter_100000.pdz \ + --am_stat=download/fastspeech2_nosil_ljspeech_ckpt_0.5/speech_stats.npy \ + --phones_dict=download/fastspeech2_nosil_ljspeech_ckpt_0.5/phone_id_map.txt diff --git a/examples/ernie_sat/run_gen_en_new.sh b/examples/ernie_sat/run_gen_en_new.sh new file mode 100755 index 000000000..d76b00430 --- /dev/null +++ b/examples/ernie_sat/run_gen_en_new.sh @@ -0,0 +1,26 @@ +#!/bin/bash + +set -e +source path.sh + +# 纯英文的语音合成 +# 样例为根据 p299_096 对应的语音作为提示语音: This was not the show for me. 来合成: 'I enjoy my life.' + +python local/inference_new.py \ + --task_name=synthesize \ + --model_name=paddle_checkpoint_en \ + --uid=p299_096 \ + --new_str='I enjoy my life, do you?' \ + --prefix='./prompt/dev/' \ + --source_lang=english \ + --target_lang=english \ + --output_name=pred_gen.wav \ + --voc=pwgan_aishell3 \ + --voc_config=download/pwg_aishell3_ckpt_0.5/default.yaml \ + --voc_ckpt=download/pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \ + --voc_stat=download/pwg_aishell3_ckpt_0.5/feats_stats.npy \ + --am=fastspeech2_ljspeech \ + --am_config=download/fastspeech2_nosil_ljspeech_ckpt_0.5/default.yaml \ + --am_ckpt=download/fastspeech2_nosil_ljspeech_ckpt_0.5/snapshot_iter_100000.pdz \ + --am_stat=download/fastspeech2_nosil_ljspeech_ckpt_0.5/speech_stats.npy \ + --phones_dict=download/fastspeech2_nosil_ljspeech_ckpt_0.5/phone_id_map.txt diff --git a/examples/ernie_sat/run_sedit_en.sh b/examples/ernie_sat/run_sedit_en.sh new file mode 100755 index 000000000..eec7d6402 --- /dev/null +++ b/examples/ernie_sat/run_sedit_en.sh @@ -0,0 +1,27 @@ +#!/bin/bash + +set -e +source path.sh + +# 纯英文的语音编辑 +# 样例为把 p243_new 对应的原始语音: For that reason cover should not be given.编辑成 'for that reason cover is impossible to be given.' 对应的语音 +# NOTE: 语音编辑任务暂支持句子中 1 个位置的替换或者插入文本操作 + +python local/inference.py \ + --task_name=edit \ + --model_name=paddle_checkpoint_en \ + --uid=p243_new \ + --new_str='for that reason cover is impossible to be given.' \ + --prefix='./prompt/dev/' \ + --source_lang=english \ + --target_lang=english \ + --output_name=pred_edit.wav \ + --voc=pwgan_aishell3 \ + --voc_config=download/pwg_aishell3_ckpt_0.5/default.yaml \ + --voc_ckpt=download/pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \ + --voc_stat=download/pwg_aishell3_ckpt_0.5/feats_stats.npy \ + --am=fastspeech2_ljspeech \ + --am_config=download/fastspeech2_nosil_ljspeech_ckpt_0.5/default.yaml \ + --am_ckpt=download/fastspeech2_nosil_ljspeech_ckpt_0.5/snapshot_iter_100000.pdz \ + --am_stat=download/fastspeech2_nosil_ljspeech_ckpt_0.5/speech_stats.npy \ + --phones_dict=download/fastspeech2_nosil_ljspeech_ckpt_0.5/phone_id_map.txt diff --git a/examples/ernie_sat/run_sedit_en_new.sh b/examples/ernie_sat/run_sedit_en_new.sh new file mode 100755 index 000000000..0952d280c --- /dev/null +++ b/examples/ernie_sat/run_sedit_en_new.sh @@ -0,0 +1,27 @@ +#!/bin/bash + +set -e +source path.sh + +# 纯英文的语音编辑 +# 样例为把 p243_new 对应的原始语音: For that reason cover should not be given.编辑成 'for that reason cover is impossible to be given.' 对应的语音 +# NOTE: 语音编辑任务暂支持句子中 1 个位置的替换或者插入文本操作 + +python local/inference_new.py \ + --task_name=edit \ + --model_name=paddle_checkpoint_en \ + --uid=p243_new \ + --new_str='for that reason cover is impossible to be given.' \ + --prefix='./prompt/dev/' \ + --source_lang=english \ + --target_lang=english \ + --output_name=pred_edit.wav \ + --voc=pwgan_aishell3 \ + --voc_config=download/pwg_aishell3_ckpt_0.5/default.yaml \ + --voc_ckpt=download/pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \ + --voc_stat=download/pwg_aishell3_ckpt_0.5/feats_stats.npy \ + --am=fastspeech2_ljspeech \ + --am_config=download/fastspeech2_nosil_ljspeech_ckpt_0.5/default.yaml \ + --am_ckpt=download/fastspeech2_nosil_ljspeech_ckpt_0.5/snapshot_iter_100000.pdz \ + --am_stat=download/fastspeech2_nosil_ljspeech_ckpt_0.5/speech_stats.npy \ + --phones_dict=download/fastspeech2_nosil_ljspeech_ckpt_0.5/phone_id_map.txt diff --git a/examples/ernie_sat/test_run.sh b/examples/ernie_sat/test_run.sh new file mode 100755 index 000000000..75b6a5691 --- /dev/null +++ b/examples/ernie_sat/test_run.sh @@ -0,0 +1,6 @@ +#!/bin/bash + +rm -rf *.wav +./run_sedit_en.sh # 语音编辑任务(英文) +./run_gen_en.sh # 个性化语音合成任务(英文) +./run_clone_en_to_zh.sh # 跨语言语音合成任务(英文到中文的语音克隆) \ No newline at end of file diff --git a/examples/ernie_sat/test_run_new.sh b/examples/ernie_sat/test_run_new.sh new file mode 100755 index 000000000..bf8a4e02d --- /dev/null +++ b/examples/ernie_sat/test_run_new.sh @@ -0,0 +1,6 @@ +#!/bin/bash + +rm -rf *.wav +./run_sedit_en_new.sh # 语音编辑任务(英文) +./run_gen_en_new.sh # 个性化语音合成任务(英文) +./run_clone_en_to_zh_new.sh # 跨语言语音合成任务(英文到中文的语音克隆) \ No newline at end of file diff --git a/examples/ernie_sat/tools/.gitkeep b/examples/ernie_sat/tools/.gitkeep new file mode 100644 index 000000000..e69de29bb diff --git a/examples/iwslt2012/punc0/conf/default.yaml b/examples/iwslt2012/punc0/conf/default.yaml index 74ced9932..e88ce2ff1 100644 --- a/examples/iwslt2012/punc0/conf/default.yaml +++ b/examples/iwslt2012/punc0/conf/default.yaml @@ -29,7 +29,7 @@ optimizer_params: scheduler_params: learning_rate: 1.0e-5 # learning rate. - gamma: 1.0 # scheduler gamma. + gamma: 0.9999 # scheduler gamma must between(0.0, 1.0) and closer to 1.0 is better. ########################################################### # TRAINING SETTING # diff --git a/examples/librispeech/asr1/README.md b/examples/librispeech/asr1/README.md index ae252a58b..ca0081444 100644 --- a/examples/librispeech/asr1/README.md +++ b/examples/librispeech/asr1/README.md @@ -1,5 +1,5 @@ # Transformer/Conformer ASR with Librispeech -This example contains code used to train a Transformer or [Conformer](http://arxiv.org/abs/2008.03802) model with [Librispeech dataset](http://www.openslr.org/resources/12) +This example contains code used to train [u2](https://arxiv.org/pdf/2012.05481.pdf) model (Transformer or [Conformer](https://arxiv.org/pdf/2005.08100.pdf) model) with [Librispeech dataset](http://www.openslr.org/resources/12) ## Overview All the scripts you need are in `run.sh`. There are several stages in `run.sh`, and each stage has its function. | Stage | Function | diff --git a/examples/librispeech/asr2/README.md b/examples/librispeech/asr2/README.md index 5bc7185a9..26978520d 100644 --- a/examples/librispeech/asr2/README.md +++ b/examples/librispeech/asr2/README.md @@ -1,6 +1,6 @@ # Transformer/Conformer ASR with Librispeech ASR2 -This example contains code used to train a Transformer or [Conformer](http://arxiv.org/abs/2008.03802) model with [Librispeech dataset](http://www.openslr.org/resources/12) and use some functions in kaldi. +This example contains code used to train a [u2](https://arxiv.org/pdf/2012.05481.pdf) model (Transformer or [Conformer](https://arxiv.org/pdf/2005.08100.pdf) model) with [Librispeech dataset](http://www.openslr.org/resources/12) and use some functions in kaldi. To use this example, you need to install Kaldi first. diff --git a/examples/ljspeech/tts3/README.md b/examples/ljspeech/tts3/README.md index 81a0580c0..d786c1571 100644 --- a/examples/ljspeech/tts3/README.md +++ b/examples/ljspeech/tts3/README.md @@ -215,6 +215,13 @@ optional arguments: Pretrained FastSpeech2 model with no silence in the edge of audios: - [fastspeech2_nosil_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_ljspeech_ckpt_0.5.zip) +The static model can be downloaded here: +- [fastspeech2_ljspeech_static_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_ljspeech_static_1.1.0.zip) + +The ONNX model can be downloaded here: +- [fastspeech2_ljspeech_onnx_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_ljspeech_onnx_1.1.0.zip) + + Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/pitch_loss| eval/energy_loss :-------------:| :------------:| :-----: | :-----: | :--------: |:--------:|:---------: default| 2(gpu) x 100000| 1.505682|0.612104| 0.045505| 0.62792| 0.220147 diff --git a/examples/ljspeech/tts3/local/inference.sh b/examples/ljspeech/tts3/local/inference.sh new file mode 100755 index 000000000..ff192f3e3 --- /dev/null +++ b/examples/ljspeech/tts3/local/inference.sh @@ -0,0 +1,30 @@ +#!/bin/bash + +train_output_path=$1 + +stage=0 +stop_stage=0 + +# pwgan +if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then + python3 ${BIN_DIR}/../inference.py \ + --inference_dir=${train_output_path}/inference \ + --am=fastspeech2_ljspeech \ + --voc=pwgan_ljspeech \ + --text=${BIN_DIR}/../sentences_en.txt \ + --output_dir=${train_output_path}/pd_infer_out \ + --phones_dict=dump/phone_id_map.txt \ + --lang=en +fi + +# hifigan +if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then + python3 ${BIN_DIR}/../inference.py \ + --inference_dir=${train_output_path}/inference \ + --am=fastspeech2_ljspeech \ + --voc=hifigan_ljspeech \ + --text=${BIN_DIR}/../sentences_en.txt \ + --output_dir=${train_output_path}/pd_infer_out \ + --phones_dict=dump/phone_id_map.txt \ + --lang=en +fi diff --git a/examples/ljspeech/tts3/local/ort_predict.sh b/examples/ljspeech/tts3/local/ort_predict.sh new file mode 100755 index 000000000..b4716f70e --- /dev/null +++ b/examples/ljspeech/tts3/local/ort_predict.sh @@ -0,0 +1,32 @@ +train_output_path=$1 + +stage=0 +stop_stage=0 + +# e2e, synthesize from text +if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then + python3 ${BIN_DIR}/../ort_predict_e2e.py \ + --inference_dir=${train_output_path}/inference_onnx \ + --am=fastspeech2_ljspeech \ + --voc=pwgan_ljspeech\ + --output_dir=${train_output_path}/onnx_infer_out_e2e \ + --text=${BIN_DIR}/../sentences_en.txt \ + --phones_dict=dump/phone_id_map.txt \ + --device=cpu \ + --cpu_threads=2 \ + --lang=en + +fi + +if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then + python3 ${BIN_DIR}/../ort_predict_e2e.py \ + --inference_dir=${train_output_path}/inference_onnx \ + --am=fastspeech2_ljspeech \ + --voc=hifigan_ljspeech \ + --output_dir=${train_output_path}/onnx_infer_out_e2e \ + --text=${BIN_DIR}/../sentences_en.txt \ + --phones_dict=dump/phone_id_map.txt \ + --device=cpu \ + --cpu_threads=2 \ + --lang=en +fi diff --git a/examples/ljspeech/tts3/local/paddle2onnx.sh b/examples/ljspeech/tts3/local/paddle2onnx.sh new file mode 120000 index 000000000..8d5dbef4c --- /dev/null +++ b/examples/ljspeech/tts3/local/paddle2onnx.sh @@ -0,0 +1 @@ +../../../csmsc/tts3/local/paddle2onnx.sh \ No newline at end of file diff --git a/examples/ljspeech/tts3/run.sh b/examples/ljspeech/tts3/run.sh index c64fa8883..260f06c8b 100755 --- a/examples/ljspeech/tts3/run.sh +++ b/examples/ljspeech/tts3/run.sh @@ -27,11 +27,35 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then fi if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then - # synthesize, vocoder is pwgan + # synthesize, vocoder is pwgan by default CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1 fi if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then - # synthesize_e2e, vocoder is pwgan + # synthesize_e2e, vocoder is pwgan by default CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1 fi + +if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then + # inference with static model, vocoder is pwgan by default + CUDA_VISIBLE_DEVICES=${gpus} ./local/inference.sh ${train_output_path} || exit -1 +fi + +# paddle2onnx, please make sure the static models are in ${train_output_path}/inference first +# we have only tested the following models so far +if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then + # install paddle2onnx + version=$(echo `pip list |grep "paddle2onnx"` |awk -F" " '{print $2}') + if [[ -z "$version" || ${version} != '0.9.8' ]]; then + pip install paddle2onnx==0.9.8 + fi + ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx fastspeech2_ljspeech + # considering the balance between speed and quality, we recommend that you use hifigan as vocoder + ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx pwgan_ljspeech + # ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx hifigan_ljspeech +fi + +# inference with onnxruntime, use fastspeech2 + pwgan by default +if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then + ./local/ort_predict.sh ${train_output_path} +fi diff --git a/examples/ljspeech/voc0/local/synthesize.sh b/examples/ljspeech/voc0/local/synthesize.sh index 1d5e11836..11874e499 100755 --- a/examples/ljspeech/voc0/local/synthesize.sh +++ b/examples/ljspeech/voc0/local/synthesize.sh @@ -8,5 +8,4 @@ python ${BIN_DIR}/synthesize.py \ --input=${input_mel_path} \ --output=${train_output_path}/wavs/ \ --checkpoint_path=${train_output_path}/checkpoints/${ckpt_name} \ - --ngpu=1 \ - --verbose \ No newline at end of file + --ngpu=1 \ No newline at end of file diff --git a/examples/ljspeech/voc1/README.md b/examples/ljspeech/voc1/README.md index d16c0e35f..ad6cd2982 100644 --- a/examples/ljspeech/voc1/README.md +++ b/examples/ljspeech/voc1/README.md @@ -130,6 +130,13 @@ optional arguments: Pretrained models can be downloaded here: - [pwg_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_ljspeech_ckpt_0.5.zip) +The static model can be downloaded here: +- [pwgan_ljspeech_static_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwgan_ljspeech_static_1.1.0.zip) + +The ONNX model can be downloaded here: +- [pwgan_ljspeech_onnx_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwgan_ljspeech_onnx_1.1.0.zip) + + Parallel WaveGAN checkpoint contains files listed below. ```text diff --git a/examples/ljspeech/voc5/README.md b/examples/ljspeech/voc5/README.md index d856cfecf..eaa51e507 100644 --- a/examples/ljspeech/voc5/README.md +++ b/examples/ljspeech/voc5/README.md @@ -115,6 +115,12 @@ optional arguments: The pretrained model can be downloaded here: - [hifigan_ljspeech_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_ljspeech_ckpt_0.2.0.zip) +The static model can be downloaded here: +- [hifigan_ljspeech_static_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_ljspeech_static_1.1.0.zip) + +The ONNX model can be downloaded here: +- [hifigan_ljspeech_onnx_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_ljspeech_onnx_1.1.0.zip) + Model | Step | eval/generator_loss | eval/mel_loss| eval/feature_matching_loss :-------------:| :------------:| :-----: | :-----: | :--------: diff --git a/examples/other/g2p/README.md b/examples/other/g2p/README.md index 141f7f741..84f5fe234 100644 --- a/examples/other/g2p/README.md +++ b/examples/other/g2p/README.md @@ -10,11 +10,15 @@ Run the command below to get the results of the test. ```bash ./run.sh ``` -The `avg WER` of g2p is: 0.026014352515701198 + +The `avg WER` of g2p is: 0.028952373312476395 + ```text ,--------------------------------------------------------------------. - | | # Snt # Wrd | Corr Sub Del Ins Err S.Err | + | ./exp/g2p/text.g2p | + |--------------------------------------------------------------------| + | SPKR | # Snt # Wrd | Corr Sub Del Ins Err S.Err | |--------+-----------------+-----------------------------------------| - | Sum/Avg| 9996 299181 | 97.3 2.7 0.0 0.0 2.7 52.2 | + | Sum/Avg| 9996 299181 | 97.2 2.8 0.0 0.1 2.9 53.3 | `--------------------------------------------------------------------' ``` diff --git a/examples/tiny/asr1/README.md b/examples/tiny/asr1/README.md index 6a4999aa6..cfa266704 100644 --- a/examples/tiny/asr1/README.md +++ b/examples/tiny/asr1/README.md @@ -1,5 +1,5 @@ # Transformer/Conformer ASR with Tiny -This example contains code used to train a Transformer or [Conformer](http://arxiv.org/abs/2008.03802) model Tiny dataset(a part of [[Librispeech dataset](http://www.openslr.org/resources/12)](http://www.openslr.org/resources/33)) +This example contains code used to train a [u2](https://arxiv.org/pdf/2012.05481.pdf) model (Transformer or [Conformer](https://arxiv.org/pdf/2005.08100.pdf) model) with Tiny dataset(a part of [[Librispeech dataset](http://www.openslr.org/resources/12)](http://www.openslr.org/resources/33)) ## Overview All the scripts you need are in `run.sh`. There are several stages in `run.sh`, and each stage has its function. | Stage | Function | diff --git a/examples/vctk/ernie_sat/README.md b/examples/vctk/ernie_sat/README.md new file mode 100644 index 000000000..055e7903d --- /dev/null +++ b/examples/vctk/ernie_sat/README.md @@ -0,0 +1 @@ +# ERNIE SAT with VCTK dataset diff --git a/examples/vctk/ernie_sat/conf/default.yaml b/examples/vctk/ernie_sat/conf/default.yaml new file mode 100644 index 000000000..672f937ef --- /dev/null +++ b/examples/vctk/ernie_sat/conf/default.yaml @@ -0,0 +1,163 @@ +########################################################### +# FEATURE EXTRACTION SETTING # +########################################################### + +fs: 24000 # sr +n_fft: 2048 # FFT size (samples). +n_shift: 300 # Hop size (samples). 12.5ms +win_length: 1200 # Window length (samples). 50ms + # If set to null, it will be the same as fft_size. +window: "hann" # Window function. + +# Only used for feats_type != raw + +fmin: 80 # Minimum frequency of Mel basis. +fmax: 7600 # Maximum frequency of Mel basis. +n_mels: 80 # The number of mel basis. + +mean_phn_span: 8 +mlm_prob: 0.8 + +########################################################### +# DATA SETTING # +########################################################### +batch_size: 20 +num_workers: 2 + +########################################################### +# MODEL SETTING # +########################################################### +model: + text_masking: false + postnet_layers: 5 + postnet_filts: 5 + postnet_chans: 256 + encoder_type: conformer + decoder_type: conformer + enc_input_layer: sega_mlm + enc_pre_speech_layer: 0 + enc_cnn_module_kernel: 7 + enc_attention_dim: 384 + enc_attention_heads: 2 + enc_linear_units: 1536 + enc_num_blocks: 4 + enc_dropout_rate: 0.2 + enc_positional_dropout_rate: 0.2 + enc_attention_dropout_rate: 0.2 + enc_normalize_before: true + enc_macaron_style: true + enc_use_cnn_module: true + enc_selfattention_layer_type: legacy_rel_selfattn + enc_activation_type: swish + enc_pos_enc_layer_type: legacy_rel_pos + enc_positionwise_layer_type: conv1d + enc_positionwise_conv_kernel_size: 3 + dec_cnn_module_kernel: 31 + dec_attention_dim: 384 + dec_attention_heads: 2 + dec_linear_units: 1536 + dec_num_blocks: 4 + dec_dropout_rate: 0.2 + dec_positional_dropout_rate: 0.2 + dec_attention_dropout_rate: 0.2 + dec_macaron_style: true + dec_use_cnn_module: true + dec_selfattention_layer_type: legacy_rel_selfattn + dec_activation_type: swish + dec_pos_enc_layer_type: legacy_rel_pos + dec_positionwise_layer_type: conv1d + dec_positionwise_conv_kernel_size: 3 + +########################################################### +# OPTIMIZER SETTING # +########################################################### +scheduler_params: + d_model: 384 + warmup_steps: 4000 +grad_clip: 1.0 + +########################################################### +# TRAINING SETTING # +########################################################### +max_epoch: 1500 +num_snapshots: 50 + +########################################################### +# OTHER SETTING # +########################################################### +seed: 0 + +token_list: +- +- +- AH0 +- T +- N +- sp +- D +- S +- R +- L +- IH1 +- DH +- AE1 +- M +- EH1 +- K +- Z +- W +- HH +- ER0 +- AH1 +- IY1 +- P +- V +- F +- B +- AY1 +- IY0 +- EY1 +- AA1 +- AO1 +- UW1 +- IH0 +- OW1 +- NG +- G +- SH +- ER1 +- Y +- TH +- AW1 +- CH +- UH1 +- IH2 +- JH +- OW0 +- EH2 +- OY1 +- AY2 +- EH0 +- EY2 +- UW0 +- AE2 +- AA2 +- OW2 +- AH2 +- ZH +- AO2 +- IY2 +- AE0 +- UW2 +- AY0 +- AA0 +- AO0 +- AW2 +- EY0 +- UH2 +- ER2 +- OY2 +- UH0 +- AW0 +- OY0 +- diff --git a/examples/vctk/ernie_sat/local/preprocess.sh b/examples/vctk/ernie_sat/local/preprocess.sh new file mode 100755 index 000000000..a0a3881f0 --- /dev/null +++ b/examples/vctk/ernie_sat/local/preprocess.sh @@ -0,0 +1,61 @@ +#!/bin/bash + +stage=0 +stop_stage=100 + +config_path=$1 + +if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then + # get durations from MFA's result + echo "Generate durations.txt from MFA results ..." + python3 ${MAIN_ROOT}/utils/gen_duration_from_textgrid.py \ + --inputdir=./vctk_alignment \ + --output durations.txt \ + --config=${config_path} +fi + +if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then + # extract features + echo "Extract features ..." + python3 ${BIN_DIR}/preprocess.py \ + --dataset=vctk \ + --rootdir=~/datasets/VCTK-Corpus-0.92/ \ + --dumpdir=dump \ + --dur-file=durations.txt \ + --config=${config_path} \ + --num-cpu=20 \ + --cut-sil=True +fi + +if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then + # get features' stats(mean and std) + echo "Get features' stats ..." + python3 ${MAIN_ROOT}/utils/compute_statistics.py \ + --metadata=dump/train/raw/metadata.jsonl \ + --field-name="speech" +fi + +if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then + # normalize and covert phone/speaker to id, dev and test should use train's stats + echo "Normalize ..." + python3 ${BIN_DIR}/normalize.py \ + --metadata=dump/train/raw/metadata.jsonl \ + --dumpdir=dump/train/norm \ + --speech-stats=dump/train/speech_stats.npy \ + --phones-dict=dump/phone_id_map.txt \ + --speaker-dict=dump/speaker_id_map.txt + + python3 ${BIN_DIR}/normalize.py \ + --metadata=dump/dev/raw/metadata.jsonl \ + --dumpdir=dump/dev/norm \ + --speech-stats=dump/train/speech_stats.npy \ + --phones-dict=dump/phone_id_map.txt \ + --speaker-dict=dump/speaker_id_map.txt + + python3 ${BIN_DIR}/normalize.py \ + --metadata=dump/test/raw/metadata.jsonl \ + --dumpdir=dump/test/norm \ + --speech-stats=dump/train/speech_stats.npy \ + --phones-dict=dump/phone_id_map.txt \ + --speaker-dict=dump/speaker_id_map.txt +fi diff --git a/examples/vctk/ernie_sat/local/synthesize.sh b/examples/vctk/ernie_sat/local/synthesize.sh new file mode 100755 index 000000000..b24db018a --- /dev/null +++ b/examples/vctk/ernie_sat/local/synthesize.sh @@ -0,0 +1,45 @@ +#!/bin/bash + +config_path=$1 +train_output_path=$2 +ckpt_name=$3 + +stage=1 +stop_stage=1 + +# use am to predict duration here +# 增加 am_phones_dict am_tones_dict 等,也可以用新的方式构造 am, 不需要这么多参数了就 + +# pwgan +if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then + FLAGS_allocator_strategy=naive_best_fit \ + FLAGS_fraction_of_gpu_memory_to_use=0.01 \ + python3 ${BIN_DIR}/synthesize.py \ + --erniesat_config=${config_path} \ + --erniesat_ckpt=${train_output_path}/checkpoints/${ckpt_name} \ + --erniesat_stat=dump/train/speech_stats.npy \ + --voc=pwgan_vctk \ + --voc_config=pwg_vctk_ckpt_0.1.1/default.yaml \ + --voc_ckpt=pwg_vctk_ckpt_0.1.1/snapshot_iter_1500000.pdz \ + --voc_stat=pwg_vctk_ckpt_0.1.1/feats_stats.npy \ + --test_metadata=dump/test/norm/metadata.jsonl \ + --output_dir=${train_output_path}/test \ + --phones_dict=dump/phone_id_map.txt +fi + +# hifigan +if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then + FLAGS_allocator_strategy=naive_best_fit \ + FLAGS_fraction_of_gpu_memory_to_use=0.01 \ + python3 ${BIN_DIR}/synthesize.py \ + --erniesat_config=${config_path} \ + --erniesat_ckpt=${train_output_path}/checkpoints/${ckpt_name} \ + --erniesat_stat=dump/train/speech_stats.npy \ + --voc=hifigan_vctk \ + --voc_config=hifigan_vctk_ckpt_0.2.0/default.yaml \ + --voc_ckpt=hifigan_vctk_ckpt_0.2.0/snapshot_iter_2500000.pdz \ + --voc_stat=hifigan_vctk_ckpt_0.2.0/feats_stats.npy \ + --test_metadata=dump/test/norm/metadata.jsonl \ + --output_dir=${train_output_path}/test \ + --phones_dict=dump/phone_id_map.txt +fi diff --git a/examples/vctk/ernie_sat/local/train.sh b/examples/vctk/ernie_sat/local/train.sh new file mode 100755 index 000000000..30720e8f5 --- /dev/null +++ b/examples/vctk/ernie_sat/local/train.sh @@ -0,0 +1,12 @@ +#!/bin/bash + +config_path=$1 +train_output_path=$2 + +python3 ${BIN_DIR}/train.py \ + --train-metadata=dump/train/norm/metadata.jsonl \ + --dev-metadata=dump/dev/norm/metadata.jsonl \ + --config=${config_path} \ + --output-dir=${train_output_path} \ + --ngpu=2 \ + --phones-dict=dump/phone_id_map.txt \ No newline at end of file diff --git a/examples/vctk/ernie_sat/path.sh b/examples/vctk/ernie_sat/path.sh new file mode 100755 index 000000000..4ecab0251 --- /dev/null +++ b/examples/vctk/ernie_sat/path.sh @@ -0,0 +1,13 @@ +#!/bin/bash +export MAIN_ROOT=`realpath ${PWD}/../../../` + +export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH} +export LC_ALL=C + +export PYTHONDONTWRITEBYTECODE=1 +# Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C +export PYTHONIOENCODING=UTF-8 +export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH} + +MODEL=ernie_sat +export BIN_DIR=${MAIN_ROOT}/paddlespeech/t2s/exps/${MODEL} \ No newline at end of file diff --git a/examples/vctk/ernie_sat/run.sh b/examples/vctk/ernie_sat/run.sh new file mode 100755 index 000000000..d75a19f23 --- /dev/null +++ b/examples/vctk/ernie_sat/run.sh @@ -0,0 +1,32 @@ +#!/bin/bash + +set -e +source path.sh + +gpus=0,1 +stage=0 +stop_stage=100 + +conf_path=conf/default.yaml +train_output_path=exp/default +ckpt_name=snapshot_iter_153.pdz + +# with the following command, you can choose the stage range you want to run +# such as `./run.sh --stage 0 --stop-stage 0` +# this can not be mixed use with `$1`, `$2` ... +source ${MAIN_ROOT}/utils/parse_options.sh || exit 1 + +if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then + # prepare data + ./local/preprocess.sh ${conf_path} || exit -1 +fi + +if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then + # train model, all `ckpt` under `train_output_path/checkpoints/` dir + CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} || exit -1 +fi + +if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then + # synthesize, vocoder is pwgan + CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1 +fi diff --git a/examples/vctk/tts3/README.md b/examples/vctk/tts3/README.md index 0b0ce0934..9c0d75616 100644 --- a/examples/vctk/tts3/README.md +++ b/examples/vctk/tts3/README.md @@ -218,6 +218,12 @@ optional arguments: Pretrained FastSpeech2 model with no silence in the edge of audios: - [fastspeech2_nosil_vctk_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_vctk_ckpt_0.5.zip) +The static model can be downloaded here: +- [fastspeech2_vctk_static_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_vctk_static_1.1.0.zip) + +The ONNX model can be downloaded here: +- [fastspeech2_vctk_onnx_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_vctk_onnx_1.1.0.zip) + FastSpeech2 checkpoint contains files listed below. ```text fastspeech2_nosil_vctk_ckpt_0.5 diff --git a/examples/vctk/tts3/conf/default.yaml b/examples/vctk/tts3/conf/default.yaml index 1bca9107b..a75658d3d 100644 --- a/examples/vctk/tts3/conf/default.yaml +++ b/examples/vctk/tts3/conf/default.yaml @@ -24,7 +24,7 @@ f0max: 400 # Maximum f0 for pitch extraction. # DATA SETTING # ########################################################### batch_size: 64 -num_workers: 4 +num_workers: 2 ########################################################### @@ -88,8 +88,8 @@ updater: # OPTIMIZER SETTING # ########################################################### optimizer: - optim: adam # optimizer type - learning_rate: 0.001 # learning rate + optim: adam # optimizer type + learning_rate: 0.001 # learning rate ########################################################### # TRAINING SETTING # diff --git a/examples/vctk/tts3/local/inference.sh b/examples/vctk/tts3/local/inference.sh index caef89d8b..9c4426146 100755 --- a/examples/vctk/tts3/local/inference.sh +++ b/examples/vctk/tts3/local/inference.sh @@ -18,3 +18,15 @@ if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then --lang=en fi +if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then + python3 ${BIN_DIR}/../inference.py \ + --inference_dir=${train_output_path}/inference \ + --am=fastspeech2_vctk \ + --voc=hifigan_vctk \ + --text=${BIN_DIR}/../sentences_en.txt \ + --output_dir=${train_output_path}/pd_infer_out \ + --phones_dict=dump/phone_id_map.txt \ + --speaker_dict=dump/speaker_id_map.txt \ + --spk_id=0 \ + --lang=en +fi diff --git a/examples/vctk/tts3/local/ort_predict.sh b/examples/vctk/tts3/local/ort_predict.sh new file mode 100755 index 000000000..4019e17fa --- /dev/null +++ b/examples/vctk/tts3/local/ort_predict.sh @@ -0,0 +1,34 @@ +train_output_path=$1 + +stage=0 +stop_stage=0 + +# e2e, synthesize from text +if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then + python3 ${BIN_DIR}/../ort_predict_e2e.py \ + --inference_dir=${train_output_path}/inference_onnx \ + --am=fastspeech2_vctk \ + --voc=pwgan_vctk \ + --output_dir=${train_output_path}/onnx_infer_out_e2e \ + --text=${BIN_DIR}/../sentences_en.txt \ + --phones_dict=dump/phone_id_map.txt \ + --device=cpu \ + --cpu_threads=2 \ + --spk_id=0 \ + --lang=en + +fi + +if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then + python3 ${BIN_DIR}/../ort_predict_e2e.py \ + --inference_dir=${train_output_path}/inference_onnx \ + --am=fastspeech2_vctk \ + --voc=hifigan_vctk \ + --output_dir=${train_output_path}/onnx_infer_out_e2e \ + --text=${BIN_DIR}/../sentences_en.txt \ + --phones_dict=dump/phone_id_map.txt \ + --device=cpu \ + --cpu_threads=2 \ + --spk_id=0 \ + --lang=en +fi diff --git a/examples/vctk/tts3/local/paddle2onnx.sh b/examples/vctk/tts3/local/paddle2onnx.sh new file mode 120000 index 000000000..8d5dbef4c --- /dev/null +++ b/examples/vctk/tts3/local/paddle2onnx.sh @@ -0,0 +1 @@ +../../../csmsc/tts3/local/paddle2onnx.sh \ No newline at end of file diff --git a/examples/vctk/tts3/local/synthesize.sh b/examples/vctk/tts3/local/synthesize.sh index 9e03f9b8a..87145959f 100755 --- a/examples/vctk/tts3/local/synthesize.sh +++ b/examples/vctk/tts3/local/synthesize.sh @@ -31,7 +31,7 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then FLAGS_allocator_strategy=naive_best_fit \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \ python3 ${BIN_DIR}/../synthesize.py \ - --am=fastspeech2_aishell3 \ + --am=fastspeech2_vctk \ --am_config=${config_path} \ --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \ --am_stat=dump/train/speech_stats.npy \ diff --git a/examples/vctk/tts3/run.sh b/examples/vctk/tts3/run.sh index a2b849bc8..b45afd7be 100755 --- a/examples/vctk/tts3/run.sh +++ b/examples/vctk/tts3/run.sh @@ -27,11 +27,34 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then fi if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then - # synthesize, vocoder is pwgan + # synthesize, vocoder is pwgan by default CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1 fi if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then - # synthesize_e2e, vocoder is pwgan + # synthesize_e2e, vocoder is pwgan by default CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1 fi + +if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then + # inference with static model, vocoder is pwgan by default + CUDA_VISIBLE_DEVICES=${gpus} ./local/inference.sh ${train_output_path} || exit -1 +fi + +if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then + # install paddle2onnx + version=$(echo `pip list |grep "paddle2onnx"` |awk -F" " '{print $2}') + if [[ -z "$version" || ${version} != '0.9.8' ]]; then + pip install paddle2onnx==0.9.8 + fi + ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx fastspeech2_vctk + # considering the balance between speed and quality, we recommend that you use hifigan as vocoder + ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx pwgan_vctk + # ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx hifigan_vctk + +fi + +# inference with onnxruntime, use fastspeech2 + pwgan by default +if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then + ./local/ort_predict.sh ${train_output_path} +fi diff --git a/examples/vctk/voc1/README.md b/examples/vctk/voc1/README.md index a0e06a420..2d80e7563 100644 --- a/examples/vctk/voc1/README.md +++ b/examples/vctk/voc1/README.md @@ -135,6 +135,13 @@ optional arguments: Pretrained models can be downloaded here: - [pwg_vctk_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_vctk_ckpt_0.1.1.zip) +The static model can be downloaded here: +- [pwgan_vctk_static_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwgan_vctk_static_1.1.0.zip) + +The ONNX model can be downloaded here: +- [pwgan_vctk_onnx_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwgan_vctk_onnx_1.1.0.zip) + + Parallel WaveGAN checkpoint contains files listed below. ```text diff --git a/examples/vctk/voc5/README.md b/examples/vctk/voc5/README.md index f2cbf27d2..e937679b5 100644 --- a/examples/vctk/voc5/README.md +++ b/examples/vctk/voc5/README.md @@ -121,6 +121,12 @@ optional arguments: The pretrained model can be downloaded here: - [hifigan_vctk_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_vctk_ckpt_0.2.0.zip) +The static model can be downloaded here: +- [hifigan_vctk_static_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_vctk_static_1.1.0.zip) + +The ONNX model can be downloaded here: +- [hifigan_vctk_onnx_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_vctk_onnx_1.1.0.zip) + Model | Step | eval/generator_loss | eval/mel_loss| eval/feature_matching_loss :-------------:| :------------:| :-----: | :-----: | :--------: diff --git a/examples/voxceleb/sv0/local/convert.sh b/examples/voxceleb/sv0/local/convert.sh new file mode 100755 index 000000000..f03ac3dd3 --- /dev/null +++ b/examples/voxceleb/sv0/local/convert.sh @@ -0,0 +1,27 @@ +# copy this to root directory of data and +# chmod a+x convert.sh +# ./convert.sh +# https://unix.stackexchange.com/questions/103920/parallelize-a-bash-for-loop +dir=$1 +open_sem(){ + mkfifo pipe-$$ + exec 3<>pipe-$$ + rm pipe-$$ + local i=$1 + for((;i>0;i--)); do + printf %s 000 >&3 + done +} +run_with_lock(){ + local x + read -u 3 -n 3 x && ((0==x)) || exit $x + ( + ( "$@"; ) + printf '%.3d' $? >&3 + )& +} +N=32 # number of vCPU +open_sem $N +for f in $(find ${dir} -name "*.m4a"); do + run_with_lock ffmpeg -loglevel panic -i "$f" -ar 16000 "${f%.*}.wav" +done diff --git a/examples/voxceleb/sv0/local/data.sh b/examples/voxceleb/sv0/local/data.sh index d6010ec66..366397484 100755 --- a/examples/voxceleb/sv0/local/data.sh +++ b/examples/voxceleb/sv0/local/data.sh @@ -74,7 +74,7 @@ if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then # convert the m4a to wav # and we will not delete the original m4a file echo "start to convert the m4a to wav" - bash local/convert.sh ${TARGET_DIR}/voxceleb/vox2/test/ || exit 1; + bash local/convert.sh ${TARGET_DIR}/voxceleb/vox2/ || exit 1; if [ $? -ne 0 ]; then echo "Convert voxceleb2 dataset from m4a to wav failed. Terminated." diff --git a/examples/wenetspeech/asr1/conf/conformer.yaml b/examples/wenetspeech/asr1/conf/conformer.yaml index 6c2bbca41..8a44db1e8 100644 --- a/examples/wenetspeech/asr1/conf/conformer.yaml +++ b/examples/wenetspeech/asr1/conf/conformer.yaml @@ -1,7 +1,6 @@ ############################################ # Network Architecture # ############################################ -cmvn_file: cmvn_file_type: "json" # encoder related encoder: conformer @@ -38,45 +37,48 @@ model_conf: ctc_weight: 0.3 lsm_weight: 0.1 # label smoothing option length_normalized_loss: false + init_type: 'kaiming_uniform' # !Warning: need to convergence # https://yaml.org/type/float.html ########################################### # Data # ########################################### -train_manifest: data/manifest.train -dev_manifest: data/manifest.dev -test_manifest: data/manifest.test +train_manifest: data/train_l/data.list +dev_manifest: data/dev/data.list +test_manifest: data/test_meeting/data.list ########################################### # Dataloader # ########################################### -vocab_filepath: data/lang_char/vocab.txt +use_stream_data: True unit_type: 'char' +vocab_filepath: data/lang_char/vocab.txt preprocess_config: conf/preprocess.yaml +cmvn_file: data/mean_std.json spm_model_prefix: '' feat_dim: 80 stride_ms: 10.0 window_ms: 25.0 +dither: 0.1 sortagrad: 0 # Feed samples from shortest to longest ; -1: enabled for all epochs, 0: disabled, other: enabled for 'other' epochs -batch_size: 64 -maxlen_in: 512 # if input length > maxlen-in, batchsize is automatically reduced -maxlen_out: 150 # if output length > maxlen-out, batchsize is automatically reduced -minibatches: 0 # for debug -batch_count: auto -batch_bins: 0 -batch_frames_in: 0 -batch_frames_out: 0 -batch_frames_inout: 0 -num_workers: 0 -subsampling_factor: 1 +batch_size: 32 +minlen_in: 10 +maxlen_in: 1200 # if input length(number of frames) > maxlen-in, data is automatically removed +minlen_out: 0 +maxlen_out: 150 # if output length(number of tokens) > maxlen-out, data is automatically removed +resample_rate: 16000 +shuffle_size: 1500 # read number of 'shuffle_size' data as a chunk, shuffle the data in the chunk +sort_size: 1000 # read number of 'sort_size' data as a chunk, sort the data in the chunk +num_workers: 8 +prefetch_factor: 10 +dist_sampler: True num_encs: 1 - ########################################### # Training # ########################################### -n_epoch: 240 -accum_grad: 16 +n_epoch: 32 +accum_grad: 32 global_grad_clip: 5.0 log_interval: 100 checkpoint: diff --git a/examples/wenetspeech/asr1/local/data.sh b/examples/wenetspeech/asr1/local/data.sh index d216dd84a..62579ba32 100755 --- a/examples/wenetspeech/asr1/local/data.sh +++ b/examples/wenetspeech/asr1/local/data.sh @@ -2,6 +2,8 @@ # Copyright 2021 Mobvoi Inc(Author: Di Wu, Binbin Zhang) # NPU, ASLP Group (Author: Qijie Shao) +# +# Modified from wenet(https://github.com/wenet-e2e/wenet) stage=-1 stop_stage=100 @@ -30,7 +32,7 @@ mkdir -p data TARGET_DIR=${MAIN_ROOT}/dataset mkdir -p ${TARGET_DIR} -if [ ${stage} -le -2 ] && [ ${stop_stage} -ge -2 ]; then +if [ ${stage} -le -1 ] && [ ${stop_stage} -ge -1 ]; then # download data echo "Please follow https://github.com/wenet-e2e/WenetSpeech to download the data." exit 0; @@ -44,86 +46,57 @@ if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then data || exit 1; fi -if [ ${stage} -le -1 ] && [ ${stop_stage} -ge -1 ]; then - # generate manifests - python3 ${TARGET_DIR}/aishell/aishell.py \ - --manifest_prefix="data/manifest" \ - --target_dir="${TARGET_DIR}/aishell" - - if [ $? -ne 0 ]; then - echo "Prepare Aishell failed. Terminated." - exit 1 - fi - - for dataset in train dev test; do - mv data/manifest.${dataset} data/manifest.${dataset}.raw - done -fi - -if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then - # compute mean and stddev for normalizer - if $cmvn; then - full_size=`cat data/${train_set}/wav.scp | wc -l` - sampling_size=$((full_size / cmvn_sampling_divisor)) - shuf -n $sampling_size data/$train_set/wav.scp \ - > data/$train_set/wav.scp.sampled - num_workers=$(nproc) - - python3 ${MAIN_ROOT}/utils/compute_mean_std.py \ - --manifest_path="data/manifest.train.raw" \ - --spectrum_type="fbank" \ - --feat_dim=80 \ - --delta_delta=false \ - --stride_ms=10 \ - --window_ms=25 \ - --sample_rate=16000 \ - --use_dB_normalization=False \ - --num_samples=-1 \ - --num_workers=${num_workers} \ - --output_path="data/mean_std.json" - - if [ $? -ne 0 ]; then - echo "Compute mean and stddev failed. Terminated." - exit 1 - fi - fi -fi - -dict=data/dict/lang_char.txt +dict=data/lang_char/vocab.txt if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then - # download data, generate manifests - # build vocabulary - python3 ${MAIN_ROOT}/utils/build_vocab.py \ - --unit_type="char" \ - --count_threshold=0 \ - --vocab_path="data/lang_char/vocab.txt" \ - --manifest_paths "data/manifest.train.raw" - - if [ $? -ne 0 ]; then - echo "Build vocabulary failed. Terminated." - exit 1 - fi + echo "Make a dictionary" + echo "dictionary: ${dict}" + mkdir -p $(dirname $dict) + echo "" > ${dict} # 0 will be used for "blank" in CTC + echo "" >> ${dict} # must be 1 + echo "▁" >> ${dict} # ▁ is for space + utils/text2token.py -s 1 -n 1 --space "▁" data/${train_set}/text \ + | cut -f 2- -d" " | tr " " "\n" \ + | sort | uniq | grep -a -v -e '^\s*$' \ + | grep -v "▁" \ + | awk '{print $0}' >> ${dict} \ + || exit 1; + echo "" >> $dict fi if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then - # format manifest with tokenids, vocab size - for dataset in train dev test; do - { - python3 ${MAIN_ROOT}/utils/format_data.py \ - --cmvn_path "data/mean_std.json" \ - --unit_type "char" \ - --vocab_path="data/vocab.txt" \ - --manifest_path="data/manifest.${dataset}.raw" \ - --output_path="data/manifest.${dataset}" + echo "Compute cmvn" + # Here we use all the training data, you can sample some some data to save time + # BUG!!! We should use the segmented data for CMVN + if $cmvn; then + full_size=`cat data/${train_set}/wav.scp | wc -l` + sampling_size=$((full_size / cmvn_sampling_divisor)) + shuf -n $sampling_size data/$train_set/wav.scp \ + > data/$train_set/wav.scp.sampled + python3 utils/compute_cmvn_stats.py \ + --num_workers 16 \ + --train_config $train_config \ + --in_scp data/$train_set/wav.scp.sampled \ + --out_cmvn data/$train_set/mean_std.json \ + || exit 1; + fi +fi - if [ $? -ne 0 ]; then - echo "Formt mnaifest failed. Terminated." - exit 1 - fi - } & - done - wait +if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then + echo "Making shards, please wait..." + RED='\033[0;31m' + NOCOLOR='\033[0m' + echo -e "It requires ${RED}1.2T ${NOCOLOR}space for $shards_dir, please make sure you have enough space" + echo -e "It takes about ${RED}12 ${NOCOLOR}hours with 32 threads" + for x in $dev_set $test_sets ${train_set}; do + dst=$shards_dir/$x + mkdir -p $dst + utils/make_filted_shard_list.py --num_node 1 --num_gpus_per_node 8 --num_utts_per_shard 1000 \ + --do_filter --resample 16000 \ + --num_threads 32 --segments data/$x/segments \ + data/$x/wav.scp data/$x/text \ + $(realpath $dst) data/$x/data.list + done fi -echo "Aishell data preparation done." +echo "Wenetspeech data preparation done." exit 0 diff --git a/examples/wenetspeech/asr1/local/export.sh b/examples/wenetspeech/asr1/local/export.sh new file mode 100755 index 000000000..6b646b469 --- /dev/null +++ b/examples/wenetspeech/asr1/local/export.sh @@ -0,0 +1,28 @@ +#!/bin/bash + +if [ $# != 3 ];then + echo "usage: $0 config_path ckpt_prefix jit_model_path" + exit -1 +fi + +ngpu=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}') +echo "using $ngpu gpus..." + +config_path=$1 +ckpt_path_prefix=$2 +jit_model_export_path=$3 + +python3 -u ${BIN_DIR}/export.py \ +--ngpu ${ngpu} \ +--config ${config_path} \ +--checkpoint_path ${ckpt_path_prefix} \ +--export_path ${jit_model_export_path} + + +if [ $? -ne 0 ]; then + echo "Failed in export!" + exit 1 +fi + + +exit 0 diff --git a/examples/wenetspeech/asr1/local/train.sh b/examples/wenetspeech/asr1/local/train.sh new file mode 100755 index 000000000..01af00b61 --- /dev/null +++ b/examples/wenetspeech/asr1/local/train.sh @@ -0,0 +1,68 @@ +#!/bin/bash + +profiler_options= +benchmark_batch_size=0 +benchmark_max_step=0 + +# seed may break model convergence +seed=0 + +source ${MAIN_ROOT}/utils/parse_options.sh || exit 1; + +ngpu=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}') +echo "using $ngpu gpus..." + +if [ ${seed} != 0 ]; then + export FLAGS_cudnn_deterministic=True + echo "using seed $seed & FLAGS_cudnn_deterministic=True ..." +fi + +if [ $# -lt 2 ] && [ $# -gt 3 ];then + echo "usage: CUDA_VISIBLE_DEVICES=0 ${0} config_path ckpt_name ips(optional)" + exit -1 +fi + +config_path=$1 +ckpt_name=$2 +ips=$3 + +if [ ! $ips ];then + ips_config= +else + ips_config="--ips="${ips} +fi +echo ${ips_config} + +mkdir -p exp + +if [ ${ngpu} == 0 ]; then +python3 -u ${BIN_DIR}/train.py \ +--ngpu ${ngpu} \ +--seed ${seed} \ +--config ${config_path} \ +--output exp/${ckpt_name} \ +--profiler-options "${profiler_options}" \ +--benchmark-batch-size ${benchmark_batch_size} \ +--benchmark-max-step ${benchmark_max_step} +else +NCCL_SOCKET_IFNAME=eth0 python3 -m paddle.distributed.launch --gpus=${CUDA_VISIBLE_DEVICES} ${ips_config} ${BIN_DIR}/train.py \ +--ngpu ${ngpu} \ +--seed ${seed} \ +--config ${config_path} \ +--output exp/${ckpt_name} \ +--profiler-options "${profiler_options}" \ +--benchmark-batch-size ${benchmark_batch_size} \ +--benchmark-max-step ${benchmark_max_step} +fi + + +if [ ${seed} != 0 ]; then + unset FLAGS_cudnn_deterministic +fi + +if [ $? -ne 0 ]; then + echo "Failed in training!" + exit 1 +fi + +exit 0 diff --git a/examples/wenetspeech/asr1/local/wenetspeech_data_prep.sh b/examples/wenetspeech/asr1/local/wenetspeech_data_prep.sh index 858530534..baa2b32df 100755 --- a/examples/wenetspeech/asr1/local/wenetspeech_data_prep.sh +++ b/examples/wenetspeech/asr1/local/wenetspeech_data_prep.sh @@ -24,7 +24,7 @@ stage=1 prefix= train_subset=L -. ./tools/parse_options.sh || exit 1; +. ./utils/parse_options.sh || exit 1; filter_by_id () { idlist=$1 @@ -132,4 +132,4 @@ if [ $stage -le 2 ]; then done fi -echo "$0: Done" \ No newline at end of file +echo "$0: Done" diff --git a/examples/wenetspeech/asr1/run.sh b/examples/wenetspeech/asr1/run.sh index 9995bc63e..ddce0a9c8 100644 --- a/examples/wenetspeech/asr1/run.sh +++ b/examples/wenetspeech/asr1/run.sh @@ -7,6 +7,7 @@ gpus=0,1,2,3,4,5,6,7 stage=0 stop_stage=100 conf_path=conf/conformer.yaml +ips= #xxx.xxx.xxx.xxx,xxx.xxx.xxx.xxx decode_conf_path=conf/tuning/decode.yaml average_checkpoint=true avg_num=10 @@ -26,7 +27,7 @@ fi if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then # train model, all `ckpt` under `exp` dir - CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${ckpt} + CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${ckpt} ${ips} fi if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then diff --git a/examples/zh_en_tts/tts3/README.md b/examples/zh_en_tts/tts3/README.md new file mode 100644 index 000000000..131d7f2c4 --- /dev/null +++ b/examples/zh_en_tts/tts3/README.md @@ -0,0 +1,298 @@ + +# Mixed Chinese and English TTS with CSMSC, LJSpeech-1.1, AISHELL-3 and VCTK datasets + +This example contains code used to train a [Fastspeech2](https://arxiv.org/abs/2006.04558) model with [CSMSC](https://www.data-baker.com/open_source.html), [LJSpeech-1.1](https://keithito.com/LJ-Speech-Dataset/), [AISHELL3](http://www.aishelltech.com/aishell_3) and [VCTK](https://datashare.ed.ac.uk/handle/10283/3443) datasets. + + +## Dataset +### Download and Extract +Download all datasets and extract it to `~/datasets`: +- The CSMSC dataset is in the directory `~/datasets/BZNSYP` +- The Ljspeech dataset is in the directory `~/datasets/LJSpeech-1.1` +- The aishell3 dataset is in the directory `~/datasets/data_aishell3` +- The vctk dataset is in the directory `~/datasets/VCTK-Corpus-0.92` + +### Get MFA Result and Extract +We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for the fastspeech2 training. +You can download from here: +- [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz) +- [ljspeech_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/LJSpeech-1.1/ljspeech_alignment.tar.gz) +- [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz) +- [vctk_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/VCTK-Corpus-0.92/vctk_alignment.tar.gz) + +Or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo. + +## Get Started +Assume the paths to the datasets are: +- `~/datasets/BZNSYP` +- `~/datasets/LJSpeech-1.1` +- `~/datasets/data_aishell3` +- `~/datasets/VCTK-Corpus-0.92` + +Assume the path to the MFA results of the datasets are: +- `./mfa_results/baker_alignment_tone` +- `./mfa_results/ljspeech_alignment` +- `./mfa_results/aishell3_alignment_tone` +- `./mfa_results/vctk_alignment` + +Run the command below to +1. **source path**. +2. preprocess the dataset. +3. train the model. +4. synthesize wavs. + - synthesize waveform from `metadata.jsonl`. + - synthesize waveform from text file. +```bash +./run.sh +``` + +You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset. +```bash +./run.sh --stage 0 --stop-stage 0 +``` + +### Data Preprocessing +```bash +./local/preprocess.sh ${conf_path} ${datasets_root_dir} ${mfa_root_dir} +``` +When it is done. A `dump` folder is created in the current directory. The structure of the dump folder is listed below. +```text +dump +├── dev +│ ├── norm +│ └── raw +├── phone_id_map.txt +├── speaker_id_map.txt +├── test +│ ├── norm +│ └── raw +└── train + ├── energy_stats.npy + ├── norm + ├── pitch_stats.npy + ├── raw + └── speech_stats.npy +``` +The dataset is split into 3 parts, namely `train`, `dev`, and` test`, each of which contains a `norm` and `raw` subfolder. The raw folder contains speech, pitch and energy features of each utterance, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/*_stats.npy`. + +Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains phones, text_lengths, speech_lengths, durations, the path of speech features, the path of pitch features, a path of energy features, speaker, and id of each utterance. + + +### Model Training +`./local/train.sh` calls `${BIN_DIR}/train.py`. +```bash +CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} +``` +Here's the complete help message. +```text +usage: train.py [-h] [--config CONFIG] [--train-metadata TRAIN_METADATA] + [--dev-metadata DEV_METADATA] [--output-dir OUTPUT_DIR] + [--ngpu NGPU] [--phones-dict PHONES_DICT] + [--speaker-dict SPEAKER_DICT] [--voice-cloning VOICE_CLONING] + +Train a FastSpeech2 model. + +optional arguments: + -h, --help show this help message and exit + --config CONFIG fastspeech2 config file. + --train-metadata TRAIN_METADATA + training data. + --dev-metadata DEV_METADATA + dev data. + --output-dir OUTPUT_DIR + output dir. + --ngpu NGPU if ngpu=0, use cpu. + --phones-dict PHONES_DICT + phone vocabulary file. + --speaker-dict SPEAKER_DICT + speaker id map file for multiple speaker model. + --voice-cloning VOICE_CLONING + whether training voice cloning model. +``` +1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`. +2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder. +3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory. +4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. +5. `--phones-dict` is the path of the phone vocabulary file. +6. `--speaker-dict` is the path of the speaker id map file when training a multi-speaker FastSpeech2. + + +### Synthesizing +We use [parallel wavegan](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/voc1) as the default neural vocoder. +Download the pretrained parallel wavegan model from [pwg_aishell3_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_aishell3_ckpt_0.5.zip) and unzip it. + +When speaker is `174` (csmsc), use csmsc's vocoder is better than aishell3's, we recommend that you use [hifigan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_ckpt_0.1.1.zip), please check `stage 2` of `synthesize_e2e.sh`. + +But if speaker is `175` (ljspeech), we **don't** recommend you to use ljspeech's vocoder, because ljspeech's vocoders are trained on sample rate 22.05kHz, but this acoustic model is trained on sample rate 24kHz, you can use csmsc's vocoder also, because ljspeech and csmsc are both female speakers. + +For speakers in aishell3 and vctk, we recommend you use aishell3 or vctk's vocoders, because ljspeech and csmsc are both female speakers, there vocoders may not perform well for male speakers in aishell3 and vctk, you can check speaker name and spk_id in `dump/speaker_id_map.txt` and check speakers' information ( Age / Gender / Accents / region, etc ) in [this issue](https://github.com/PaddlePaddle/PaddleSpeech/issues/1620) and choose the `spk_id` you want. + + +```bash +unzip pwg_aishell3_ckpt_0.5.zip +``` +Parallel WaveGAN checkpoint contains files listed below. +```text +pwg_aishell3_ckpt_0.5 +├── default.yaml # default config used to train parallel wavegan +├── feats_stats.npy # statistics used to normalize spectrogram when training parallel wavegan +└── snapshot_iter_1000000.pdz # generator parameters of parallel wavegan +``` +`./local/synthesize.sh` calls `${BIN_DIR}/../synthesize.py`, which can synthesize waveform from `metadata.jsonl`. +```bash +CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} +``` +```text +usage: synthesize.py [-h] + [--am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc,tacotron2_ljspeech,tacotron2_aishell3, fastspeech2_mix}] + [--am_config AM_CONFIG] [--am_ckpt AM_CKPT] + [--am_stat AM_STAT] [--phones_dict PHONES_DICT] + [--tones_dict TONES_DICT] [--speaker_dict SPEAKER_DICT] + [--voice-cloning VOICE_CLONING] + [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc,wavernn_csmsc,hifigan_csmsc,hifigan_ljspeech,hifigan_aishell3,hifigan_vctk,style_melgan_csmsc}] + [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT] + [--voc_stat VOC_STAT] [--ngpu NGPU] + [--test_metadata TEST_METADATA] [--output_dir OUTPUT_DIR] + +Synthesize with acoustic model & vocoder + +optional arguments: + -h, --help show this help message and exit + --am {speedyspeech_csmsc,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc,tacotron2_ljspeech,tacotron2_aishell3, fastspeech2_mix} + Choose acoustic model type of tts task. + --am_config AM_CONFIG + Config of acoustic model. + --am_ckpt AM_CKPT Checkpoint file of acoustic model. + --am_stat AM_STAT mean and standard deviation used to normalize + spectrogram when training acoustic model. + --phones_dict PHONES_DICT + phone vocabulary file. + --tones_dict TONES_DICT + tone vocabulary file. + --speaker_dict SPEAKER_DICT + speaker id map file. + --voice-cloning VOICE_CLONING + whether training voice cloning model. + --voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc,wavernn_csmsc,hifigan_csmsc,hifigan_ljspeech,hifigan_aishell3,hifigan_vctk,style_melgan_csmsc} + Choose vocoder type of tts task. + --voc_config VOC_CONFIG + Config of voc. + --voc_ckpt VOC_CKPT Checkpoint file of voc. + --voc_stat VOC_STAT mean and standard deviation used to normalize + spectrogram when training voc. + --ngpu NGPU if ngpu == 0, use cpu. + --test_metadata TEST_METADATA + test metadata. + --output_dir OUTPUT_DIR + output dir. + + +``` +`./local/synthesize_e2e.sh` calls `${BIN_DIR}/../synthesize_e2e.py`, which can synthesize waveform from text file. +```bash +CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name} +``` +```text +usage: synthesize_e2e.py [-h] + [--am {speedyspeech_csmsc,speedyspeech_aishell3,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc,tacotron2_ljspeech, fastspeech2_mix}] + [--am_config AM_CONFIG] [--am_ckpt AM_CKPT] + [--am_stat AM_STAT] [--phones_dict PHONES_DICT] + [--tones_dict TONES_DICT] + [--speaker_dict SPEAKER_DICT] [--spk_id SPK_ID] + [--voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc,style_melgan_csmsc,hifigan_csmsc,hifigan_ljspeech,hifigan_aishell3,hifigan_vctk,wavernn_csmsc}] + [--voc_config VOC_CONFIG] [--voc_ckpt VOC_CKPT] + [--voc_stat VOC_STAT] [--lang LANG] + [--inference_dir INFERENCE_DIR] [--ngpu NGPU] + [--text TEXT] [--output_dir OUTPUT_DIR] + +Synthesize with acoustic model & vocoder + +optional arguments: + -h, --help show this help message and exit + --am {speedyspeech_csmsc,speedyspeech_aishell3,fastspeech2_csmsc,fastspeech2_ljspeech,fastspeech2_aishell3,fastspeech2_vctk,tacotron2_csmsc,tacotron2_ljspeech, fastspeech2_mix} + Choose acoustic model type of tts task. + --am_config AM_CONFIG + Config of acoustic model. + --am_ckpt AM_CKPT Checkpoint file of acoustic model. + --am_stat AM_STAT mean and standard deviation used to normalize + spectrogram when training acoustic model. + --phones_dict PHONES_DICT + phone vocabulary file. + --tones_dict TONES_DICT + tone vocabulary file. + --speaker_dict SPEAKER_DICT + speaker id map file. + --spk_id SPK_ID spk id for multi speaker acoustic model + --voc {pwgan_csmsc,pwgan_ljspeech,pwgan_aishell3,pwgan_vctk,mb_melgan_csmsc,style_melgan_csmsc,hifigan_csmsc,hifigan_ljspeech,hifigan_aishell3,hifigan_vctk,wavernn_csmsc} + Choose vocoder type of tts task. + --voc_config VOC_CONFIG + Config of voc. + --voc_ckpt VOC_CKPT Checkpoint file of voc. + --voc_stat VOC_STAT mean and standard deviation used to normalize + spectrogram when training voc. + --lang LANG Choose model language. zh or en or mix + --inference_dir INFERENCE_DIR + dir to save inference models + --ngpu NGPU if ngpu == 0, use cpu. + --text TEXT text to synthesize, a 'utt_id sentence' pair per line. + --output_dir OUTPUT_DIR + output dir. +``` +1. `--am` is acoustic model type with the format {model_name}_{dataset} +2. `--am_config`, `--am_ckpt`, `--am_stat`, `--phones_dict` `--speaker_dict` are arguments for acoustic model, which correspond to the 5 files in the fastspeech2 pretrained model. +3. `--voc` is vocoder type with the format {model_name}_{dataset} +4. `--voc_config`, `--voc_ckpt`, `--voc_stat` are arguments for vocoder, which correspond to the 3 files in the parallel wavegan pretrained model. +5. `--lang` is the model language, which can be `zh` or `en` or `mix`. +6. `--test_metadata` should be the metadata file in the normalized subfolder of `test` in the `dump` folder. +7. `--text` is the text file, which contains sentences to synthesize. +8. `--output_dir` is the directory to save synthesized audio files. +9. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. + + +## Pretrained Model +Pretrained FastSpeech2 model with no silence in the edge of audios: +- [fastspeech2_mix_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/t2s/chinse_english_mixed/models/fastspeech2_mix_ckpt_0.2.0.zip) + +The static model can be downloaded here: +- [fastspeech2_mix_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/t2s/chinse_english_mixed/models/fastspeech2_mix_static_0.2.0.zip) + +The ONNX model can be downloaded here: +- [fastspeech2_mix_onnx_0.2.0.zip](https://paddlespeech.bj.bcebos.com/t2s/chinse_english_mixed/models/fastspeech2_mix_onnx_0.2.0.zip) + +FastSpeech2 checkpoint contains files listed below. + +```text +fastspeech2_mix_ckpt_0.2.0 +├── default.yaml # default config used to train fastspeech2 +├── phone_id_map.txt # phone vocabulary file when training fastspeech2 +├── snapshot_iter_99200.pdz # model parameters and optimizer states +├── speaker_id_map.txt # speaker id map file when training a multi-speaker fastspeech2 +└── speech_stats.npy # statistics used to normalize spectrogram when training fastspeech2 +``` + + +You can use the following scripts to synthesize for `${BIN_DIR}/../sentences_mix.txt` using pretrained fastspeech2 and parallel wavegan models. +`174` means baker speaker, `175` means ljspeech speaker. For other speaker information, please see `speaker_id_map.txt`. + +```bash +source path.sh + +FLAGS_allocator_strategy=naive_best_fit \ +FLAGS_fraction_of_gpu_memory_to_use=0.01 \ +python3 ${BIN_DIR}/../synthesize_e2e.py \ + --am=fastspeech2_mix \ + --am_config=fastspeech2_mix_ckpt_0.2.0/default.yaml \ + --am_ckpt=fastspeech2_mix_ckpt_0.2.0/snapshot_iter_99200.pdz \ + --am_stat=fastspeech2_mix_ckpt_0.2.0/speech_stats.npy \ + --voc=pwgan_aishell3 \ + --voc_config=pwg_aishell3_ckpt_0.5/default.yaml \ + --voc_ckpt=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \ + --voc_stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \ + --lang=mix \ + --text=${BIN_DIR}/../sentences_mix.txt \ + --output_dir=exp/default/test_e2e \ + --phones_dict=fastspeech2_mix_ckpt_0.2.0/phone_id_map.txt \ + --speaker_dict=fastspeech2_mix_ckpt_0.2.0/speaker_id_map.txt \ + --spk_id=174 \ + --inference_dir=exp/default/inference +``` diff --git a/examples/zh_en_tts/tts3/conf/default.yaml b/examples/zh_en_tts/tts3/conf/default.yaml new file mode 100644 index 000000000..e65b5d0ec --- /dev/null +++ b/examples/zh_en_tts/tts3/conf/default.yaml @@ -0,0 +1,104 @@ +########################################################### +# FEATURE EXTRACTION SETTING # +########################################################### + +fs: 24000 # sr +n_fft: 2048 # FFT size (samples). +n_shift: 300 # Hop size (samples). 12.5ms +win_length: 1200 # Window length (samples). 50ms + # If set to null, it will be the same as fft_size. +window: "hann" # Window function. + +# Only used for feats_type != raw + +fmin: 80 # Minimum frequency of Mel basis. +fmax: 7600 # Maximum frequency of Mel basis. +n_mels: 80 # The number of mel basis. + +# Only used for the model using pitch features (e.g. FastSpeech2) +f0min: 80 # Minimum f0 for pitch extraction. +f0max: 400 # Maximum f0 for pitch extraction. + + +########################################################### +# DATA SETTING # +########################################################### +batch_size: 64 +num_workers: 2 + + +########################################################### +# MODEL SETTING # +########################################################### +model: + adim: 384 # attention dimension + aheads: 2 # number of attention heads + elayers: 4 # number of encoder layers + eunits: 1536 # number of encoder ff units + dlayers: 4 # number of decoder layers + dunits: 1536 # number of decoder ff units + positionwise_layer_type: conv1d # type of position-wise layer + positionwise_conv_kernel_size: 3 # kernel size of position wise conv layer + duration_predictor_layers: 2 # number of layers of duration predictor + duration_predictor_chans: 256 # number of channels of duration predictor + duration_predictor_kernel_size: 3 # filter size of duration predictor + postnet_layers: 5 # number of layers of postnset + postnet_filts: 5 # filter size of conv layers in postnet + postnet_chans: 256 # number of channels of conv layers in postnet + use_scaled_pos_enc: True # whether to use scaled positional encoding + encoder_normalize_before: True # whether to perform layer normalization before the input + decoder_normalize_before: True # whether to perform layer normalization before the input + reduction_factor: 1 # reduction factor + init_type: xavier_uniform # initialization type + init_enc_alpha: 1.0 # initial value of alpha of encoder scaled position encoding + init_dec_alpha: 1.0 # initial value of alpha of decoder scaled position encoding + transformer_enc_dropout_rate: 0.2 # dropout rate for transformer encoder layer + transformer_enc_positional_dropout_rate: 0.2 # dropout rate for transformer encoder positional encoding + transformer_enc_attn_dropout_rate: 0.2 # dropout rate for transformer encoder attention layer + transformer_dec_dropout_rate: 0.2 # dropout rate for transformer decoder layer + transformer_dec_positional_dropout_rate: 0.2 # dropout rate for transformer decoder positional encoding + transformer_dec_attn_dropout_rate: 0.2 # dropout rate for transformer decoder attention layer + pitch_predictor_layers: 5 # number of conv layers in pitch predictor + pitch_predictor_chans: 256 # number of channels of conv layers in pitch predictor + pitch_predictor_kernel_size: 5 # kernel size of conv leyers in pitch predictor + pitch_predictor_dropout: 0.5 # dropout rate in pitch predictor + pitch_embed_kernel_size: 1 # kernel size of conv embedding layer for pitch + pitch_embed_dropout: 0.0 # dropout rate after conv embedding layer for pitch + stop_gradient_from_pitch_predictor: True # whether to stop the gradient from pitch predictor to encoder + energy_predictor_layers: 2 # number of conv layers in energy predictor + energy_predictor_chans: 256 # number of channels of conv layers in energy predictor + energy_predictor_kernel_size: 3 # kernel size of conv leyers in energy predictor + energy_predictor_dropout: 0.5 # dropout rate in energy predictor + energy_embed_kernel_size: 1 # kernel size of conv embedding layer for energy + energy_embed_dropout: 0.0 # dropout rate after conv embedding layer for energy + stop_gradient_from_energy_predictor: False # whether to stop the gradient from energy predictor to encoder + spk_embed_dim: 256 # speaker embedding dimension + spk_embed_integration_type: concat # speaker embedding integration type + + + +########################################################### +# UPDATER SETTING # +########################################################### +updater: + use_masking: True # whether to apply masking for padded part in loss calculation + + +########################################################### +# OPTIMIZER SETTING # +########################################################### +optimizer: + optim: adam # optimizer type + learning_rate: 0.001 # learning rate + +########################################################### +# TRAINING SETTING # +########################################################### +max_epoch: 200 +num_snapshots: 5 + + +########################################################### +# OTHER SETTING # +########################################################### +seed: 10086 diff --git a/examples/zh_en_tts/tts3/local/inference.sh b/examples/zh_en_tts/tts3/local/inference.sh new file mode 100755 index 000000000..16499ed01 --- /dev/null +++ b/examples/zh_en_tts/tts3/local/inference.sh @@ -0,0 +1,54 @@ +#!/bin/bash + +train_output_path=$1 + +stage=0 +stop_stage=0 + +# voc: pwgan_aishell3 +# the spk_id=174 means baker speaker, default +# the spk_id=175 means ljspeech speaker +if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then + python3 ${BIN_DIR}/../inference.py \ + --inference_dir=${train_output_path}/inference \ + --am=fastspeech2_mix \ + --voc=pwgan_aishell3 \ + --text=${BIN_DIR}/../sentences_mix.txt \ + --output_dir=${train_output_path}/pd_infer_out \ + --phones_dict=dump/phone_id_map.txt \ + --speaker_dict=dump/speaker_id_map.txt \ + --lang=mix \ + --spk_id=174 +fi + + +# voc: hifigan_aishell3 +# the spk_id=174 means baker speaker, default +# the spk_id=175 means ljspeech speaker +if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then + python3 ${BIN_DIR}/../inference.py \ + --inference_dir=${train_output_path}/inference \ + --am=fastspeech2_mix \ + --voc=hifigan_aishell3 \ + --text=${BIN_DIR}/../sentences_mix.txt \ + --output_dir=${train_output_path}/pd_infer_out \ + --phones_dict=dump/phone_id_map.txt \ + --speaker_dict=dump/speaker_id_map.txt \ + --lang=mix \ + --spk_id=174 +fi + +# voc: hifigan_csmsc +# when speaker is 174 (csmsc), use csmsc's vocoder is better than aishell3's +if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then + python3 ${BIN_DIR}/../inference.py \ + --inference_dir=${train_output_path}/inference \ + --am=fastspeech2_mix \ + --voc=hifigan_csmsc \ + --text=${BIN_DIR}/../sentences_mix.txt \ + --output_dir=${train_output_path}/pd_infer_out \ + --phones_dict=dump/phone_id_map.txt \ + --speaker_dict=dump/speaker_id_map.txt \ + --lang=mix \ + --spk_id=174 +fi diff --git a/examples/zh_en_tts/tts3/local/ort_predict.sh b/examples/zh_en_tts/tts3/local/ort_predict.sh new file mode 100755 index 000000000..d80da9c91 --- /dev/null +++ b/examples/zh_en_tts/tts3/local/ort_predict.sh @@ -0,0 +1,54 @@ +train_output_path=$1 + +stage=0 +stop_stage=0 + +# e2e, synthesize from text +# voc: pwgan_aishell3 +# the spk_id=174 means baker speaker, default +# the spk_id=175 means ljspeech speaker +if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then + python3 ${BIN_DIR}/../ort_predict_e2e.py \ + --inference_dir=${train_output_path}/inference_onnx \ + --am=fastspeech2_mix \ + --voc=pwgan_aishell3 \ + --output_dir=${train_output_path}/onnx_infer_out_e2e \ + --text=${BIN_DIR}/../sentences_mix.txt \ + --phones_dict=dump/phone_id_map.txt \ + --device=cpu \ + --cpu_threads=4 \ + --lang=mix \ + --spk_id=174 +fi + + +# voc: hifigan_aishell3 +# the spk_id=174 means baker speaker, default +# the spk_id=175 means ljspeech speaker +if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then + python3 ${BIN_DIR}/../ort_predict_e2e.py \ + --inference_dir=${train_output_path}/inference_onnx \ + --am=fastspeech2_mix \ + --voc=hifigan_aishell3 \ + --output_dir=${train_output_path}/onnx_infer_out_e2e \ + --text=${BIN_DIR}/../sentences_mix.txt \ + --phones_dict=dump/phone_id_map.txt \ + --device=cpu \ + --cpu_threads=4 \ + --lang=mix \ + --spk_id=174 +fi + +if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then + python3 ${BIN_DIR}/../ort_predict_e2e.py \ + --inference_dir=${train_output_path}/inference_onnx \ + --am=fastspeech2_mix \ + --voc=hifigan_csmsc \ + --output_dir=${train_output_path}/onnx_infer_out_e2e \ + --text=${BIN_DIR}/../sentences_mix.txt \ + --phones_dict=dump/phone_id_map.txt \ + --device=cpu \ + --cpu_threads=4 \ + --lang=mix \ + --spk_id=174 +fi diff --git a/examples/zh_en_tts/tts3/local/paddle2onnx.sh b/examples/zh_en_tts/tts3/local/paddle2onnx.sh new file mode 120000 index 000000000..8d5dbef4c --- /dev/null +++ b/examples/zh_en_tts/tts3/local/paddle2onnx.sh @@ -0,0 +1 @@ +../../../csmsc/tts3/local/paddle2onnx.sh \ No newline at end of file diff --git a/examples/zh_en_tts/tts3/local/preprocess.sh b/examples/zh_en_tts/tts3/local/preprocess.sh new file mode 100755 index 000000000..a938f5243 --- /dev/null +++ b/examples/zh_en_tts/tts3/local/preprocess.sh @@ -0,0 +1,149 @@ +#!/bin/bash + +stage=0 +stop_stage=100 + +config_path=$1 +datasets_root_dir=$2 +mfa_root_dir=$3 + +# 1. get durations from MFA's result +if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then + echo "Generate durations_baker.txt from MFA results ..." + python3 ${MAIN_ROOT}/utils/gen_duration_from_textgrid.py \ + --inputdir=${mfa_root_dir}/baker_alignment_tone \ + --output durations_baker.txt \ + --config=${config_path} +fi + +if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then + echo "Generate durations_ljspeech.txt from MFA results ..." + python3 ${MAIN_ROOT}/utils/gen_duration_from_textgrid.py \ + --inputdir=${mfa_root_dir}/ljspeech_alignment \ + --output durations_ljspeech.txt \ + --config=${config_path} +fi + +if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then + echo "Generate durations_aishell3.txt from MFA results ..." + python3 ${MAIN_ROOT}/utils/gen_duration_from_textgrid.py \ + --inputdir=${mfa_root_dir}/aishell3_alignment_tone \ + --output durations_aishell3.txt \ + --config=${config_path} +fi + +if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then + echo "Generate durations_vctk.txt from MFA results ..." + python3 ${MAIN_ROOT}/utils/gen_duration_from_textgrid.py \ + --inputdir=${mfa_root_dir}/vctk_alignment \ + --output durations_vctk.txt \ + --config=${config_path} +fi + +if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then + # concat duration file + echo "concat durations_baker.txt, durations_ljspeech.txt, durations_aishell3.txt and durations_vctk.txt to durations.txt" + cat durations_baker.txt durations_ljspeech.txt durations_aishell3.txt durations_vctk.txt > durations.txt +fi + +# 2. extract features +if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then + echo "Extract baker features ..." + python3 ${BIN_DIR}/preprocess.py \ + --dataset=baker \ + --rootdir=${datasets_root_dir}/BZNSYP/ \ + --dumpdir=dump \ + --dur-file=durations.txt \ + --config=${config_path} \ + --num-cpu=20 \ + --cut-sil=True \ + --write_metadata_method=a +fi + +if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then + echo "Extract ljspeech features ..." + python3 ${BIN_DIR}/preprocess.py \ + --dataset=ljspeech \ + --rootdir=${datasets_root_dir}/LJSpeech-1.1/ \ + --dumpdir=dump \ + --dur-file=durations.txt \ + --config=${config_path} \ + --num-cpu=20 \ + --cut-sil=True \ + --write_metadata_method=a +fi + +if [ ${stage} -le 7 ] && [ ${stop_stage} -ge 7 ]; then + echo "Extract aishell3 features ..." + python3 ${BIN_DIR}/preprocess.py \ + --dataset=aishell3 \ + --rootdir=${datasets_root_dir}/data_aishell3/ \ + --dumpdir=dump \ + --dur-file=durations.txt \ + --config=${config_path} \ + --num-cpu=20 \ + --cut-sil=True \ + --write_metadata_method=a +fi + +if [ ${stage} -le 8 ] && [ ${stop_stage} -ge 8 ]; then + echo "Extract vctk features ..." + python3 ${BIN_DIR}/preprocess.py \ + --dataset=vctk \ + --rootdir=${datasets_root_dir}/VCTK-Corpus-0.92/ \ + --dumpdir=dump \ + --dur-file=durations.txt \ + --config=${config_path} \ + --num-cpu=20 \ + --cut-sil=True \ + --write_metadata_method=a +fi + + +# 3. get features' stats(mean and std) +if [ ${stage} -le 9 ] && [ ${stop_stage} -ge 9 ]; then + echo "Get features' stats ..." + python3 ${MAIN_ROOT}/utils/compute_statistics.py \ + --metadata=dump/train/raw/metadata.jsonl \ + --field-name="speech" + + python3 ${MAIN_ROOT}/utils/compute_statistics.py \ + --metadata=dump/train/raw/metadata.jsonl \ + --field-name="pitch" + + python3 ${MAIN_ROOT}/utils/compute_statistics.py \ + --metadata=dump/train/raw/metadata.jsonl \ + --field-name="energy" +fi + + +# 4. normalize and covert phone/speaker to id, dev and test should use train's stats +if [ ${stage} -le 10 ] && [ ${stop_stage} -ge 10 ]; then + echo "Normalize ..." + python3 ${BIN_DIR}/normalize.py \ + --metadata=dump/train/raw/metadata.jsonl \ + --dumpdir=dump/train/norm \ + --speech-stats=dump/train/speech_stats.npy \ + --pitch-stats=dump/train/pitch_stats.npy \ + --energy-stats=dump/train/energy_stats.npy \ + --phones-dict=dump/phone_id_map.txt \ + --speaker-dict=dump/speaker_id_map.txt + + python3 ${BIN_DIR}/normalize.py \ + --metadata=dump/dev/raw/metadata.jsonl \ + --dumpdir=dump/dev/norm \ + --speech-stats=dump/train/speech_stats.npy \ + --pitch-stats=dump/train/pitch_stats.npy \ + --energy-stats=dump/train/energy_stats.npy \ + --phones-dict=dump/phone_id_map.txt \ + --speaker-dict=dump/speaker_id_map.txt + + python3 ${BIN_DIR}/normalize.py \ + --metadata=dump/test/raw/metadata.jsonl \ + --dumpdir=dump/test/norm \ + --speech-stats=dump/train/speech_stats.npy \ + --pitch-stats=dump/train/pitch_stats.npy \ + --energy-stats=dump/train/energy_stats.npy \ + --phones-dict=dump/phone_id_map.txt \ + --speaker-dict=dump/speaker_id_map.txt +fi diff --git a/examples/zh_en_tts/tts3/local/synthesize.sh b/examples/zh_en_tts/tts3/local/synthesize.sh new file mode 100755 index 000000000..5bb947466 --- /dev/null +++ b/examples/zh_en_tts/tts3/local/synthesize.sh @@ -0,0 +1,47 @@ +#!/bin/bash + +config_path=$1 +train_output_path=$2 +ckpt_name=$3 + +stage=0 +stop_stage=0 + +# voc: pwgan_aishell3 +if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then + FLAGS_allocator_strategy=naive_best_fit \ + FLAGS_fraction_of_gpu_memory_to_use=0.01 \ + python3 ${BIN_DIR}/../synthesize.py \ + --am=fastspeech2_mix \ + --am_config=${config_path} \ + --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \ + --am_stat=dump/train/speech_stats.npy \ + --voc=pwgan_aishell3 \ + --voc_config=pwg_aishell3_ckpt_0.5/default.yaml \ + --voc_ckpt=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \ + --voc_stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \ + --test_metadata=dump/test/norm/metadata.jsonl \ + --output_dir=${train_output_path}/test \ + --phones_dict=dump/phone_id_map.txt \ + --speaker_dict=dump/speaker_id_map.txt +fi + + +# voc: hifigan_aishell3 +if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then + FLAGS_allocator_strategy=naive_best_fit \ + FLAGS_fraction_of_gpu_memory_to_use=0.01 \ + python3 ${BIN_DIR}/../synthesize.py \ + --am=fastspeech2_mix \ + --am_config=${config_path} \ + --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \ + --am_stat=dump/train/speech_stats.npy \ + --voc=hifigan_aishell3 \ + --voc_config=hifigan_aishell3_ckpt_0.2.0/default.yaml \ + --voc_ckpt=hifigan_aishell3_ckpt_0.2.0/snapshot_iter_2500000.pdz \ + --voc_stat=hifigan_aishell3_ckpt_0.2.0/feats_stats.npy \ + --test_metadata=dump/test/norm/metadata.jsonl \ + --output_dir=${train_output_path}/test \ + --phones_dict=dump/phone_id_map.txt \ + --speaker_dict=dump/speaker_id_map.txt +fi diff --git a/examples/zh_en_tts/tts3/local/synthesize_e2e.sh b/examples/zh_en_tts/tts3/local/synthesize_e2e.sh new file mode 100755 index 000000000..f6ee04aef --- /dev/null +++ b/examples/zh_en_tts/tts3/local/synthesize_e2e.sh @@ -0,0 +1,82 @@ +#!/bin/bash + +config_path=$1 +train_output_path=$2 +ckpt_name=$3 + +stage=0 +stop_stage=0 + +# voc: pwgan_aishell3 +# the spk_id=174 means baker speaker, default. +# the spk_id=175 means ljspeech speaker +if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then + FLAGS_allocator_strategy=naive_best_fit \ + FLAGS_fraction_of_gpu_memory_to_use=0.01 \ + python3 ${BIN_DIR}/../synthesize_e2e.py \ + --am=fastspeech2_mix \ + --am_config=${config_path} \ + --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \ + --am_stat=dump/train/speech_stats.npy \ + --voc=pwgan_aishell3 \ + --voc_config=pwg_aishell3_ckpt_0.5/default.yaml \ + --voc_ckpt=pwg_aishell3_ckpt_0.5/snapshot_iter_1000000.pdz \ + --voc_stat=pwg_aishell3_ckpt_0.5/feats_stats.npy \ + --lang=mix \ + --text=${BIN_DIR}/../sentences_mix.txt \ + --output_dir=${train_output_path}/test_e2e \ + --phones_dict=dump/phone_id_map.txt \ + --speaker_dict=dump/speaker_id_map.txt \ + --spk_id=174 \ + --inference_dir=${train_output_path}/inference +fi + +# voc: hifigan_aishell3 +# the spk_id=174 means baker speaker, default +# the spk_id=175 means ljspeech speaker +if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then + echo "in hifigan syn_e2e" + FLAGS_allocator_strategy=naive_best_fit \ + FLAGS_fraction_of_gpu_memory_to_use=0.01 \ + python3 ${BIN_DIR}/../synthesize_e2e.py \ + --am=fastspeech2_mix \ + --am_config=${config_path} \ + --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \ + --am_stat=dump/train/speech_stats.npy \ + --voc=hifigan_aishell3 \ + --voc_config=hifigan_aishell3_ckpt_0.2.0/default.yaml \ + --voc_ckpt=hifigan_aishell3_ckpt_0.2.0/snapshot_iter_2500000.pdz \ + --voc_stat=hifigan_aishell3_ckpt_0.2.0/feats_stats.npy \ + --lang=mix \ + --text=${BIN_DIR}/../sentences_mix.txt \ + --output_dir=${train_output_path}/test_e2e \ + --phones_dict=dump/phone_id_map.txt \ + --speaker_dict=dump/speaker_id_map.txt \ + --spk_id=174 \ + --inference_dir=${train_output_path}/inference +fi + + +# voc: hifigan_csmsc +# when speaker is 174 (csmsc), use csmsc's vocoder is better than aishell3's +if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then + echo "in csmsc's hifigan syn_e2e" + FLAGS_allocator_strategy=naive_best_fit \ + FLAGS_fraction_of_gpu_memory_to_use=0.01 \ + python3 ${BIN_DIR}/../synthesize_e2e.py \ + --am=fastspeech2_mix \ + --am_config=${config_path} \ + --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \ + --am_stat=dump/train/speech_stats.npy \ + --voc=hifigan_csmsc \ + --voc_config=hifigan_csmsc_ckpt_0.1.1/default.yaml \ + --voc_ckpt=hifigan_csmsc_ckpt_0.1.1/snapshot_iter_2500000.pdz \ + --voc_stat=hifigan_csmsc_ckpt_0.1.1/feats_stats.npy \ + --lang=mix \ + --text=${BIN_DIR}/../sentences_mix.txt \ + --output_dir=${train_output_path}/test_e2e \ + --phones_dict=dump/phone_id_map.txt \ + --speaker_dict=dump/speaker_id_map.txt \ + --spk_id=174 \ + --inference_dir=${train_output_path}/inference +fi \ No newline at end of file diff --git a/examples/zh_en_tts/tts3/local/train.sh b/examples/zh_en_tts/tts3/local/train.sh new file mode 100755 index 000000000..1da72f117 --- /dev/null +++ b/examples/zh_en_tts/tts3/local/train.sh @@ -0,0 +1,13 @@ +#!/bin/bash + +config_path=$1 +train_output_path=$2 + +python3 ${BIN_DIR}/train.py \ + --train-metadata=dump/train/norm/metadata.jsonl \ + --dev-metadata=dump/dev/norm/metadata.jsonl \ + --config=${config_path} \ + --output-dir=${train_output_path} \ + --ngpu=2 \ + --phones-dict=dump/phone_id_map.txt \ + --speaker-dict=dump/speaker_id_map.txt diff --git a/examples/zh_en_tts/tts3/path.sh b/examples/zh_en_tts/tts3/path.sh new file mode 100755 index 000000000..fb7e8411c --- /dev/null +++ b/examples/zh_en_tts/tts3/path.sh @@ -0,0 +1,13 @@ +#!/bin/bash +export MAIN_ROOT=`realpath ${PWD}/../../../` + +export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH} +export LC_ALL=C + +export PYTHONDONTWRITEBYTECODE=1 +# Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C +export PYTHONIOENCODING=UTF-8 +export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH} + +MODEL=fastspeech2 +export BIN_DIR=${MAIN_ROOT}/paddlespeech/t2s/exps/${MODEL} diff --git a/examples/zh_en_tts/tts3/run.sh b/examples/zh_en_tts/tts3/run.sh new file mode 100755 index 000000000..204042b12 --- /dev/null +++ b/examples/zh_en_tts/tts3/run.sh @@ -0,0 +1,63 @@ +#!/bin/bash + +set -e +source path.sh + +gpus=0,1 +stage=0 +stop_stage=100 + +datasets_root_dir=~/datasets +mfa_root_dir=./mfa_results/ +conf_path=conf/default.yaml +train_output_path=exp/default +ckpt_name=snapshot_iter_99200.pdz + + +# with the following command, you can choose the stage range you want to run +# such as `./run.sh --stage 0 --stop-stage 0` +# this can not be mixed use with `$1`, `$2` ... +source ${MAIN_ROOT}/utils/parse_options.sh || exit 1 + +if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then + # prepare data + ./local/preprocess.sh ${conf_path} ${datasets_root_dir} ${mfa_root_dir} || exit -1 +fi + +if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then + # train model, all `ckpt` under `train_output_path/checkpoints/` dir + CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} || exit -1 +fi + +if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then + # synthesize, vocoder is pwgan by default + CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1 +fi + +if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then + # synthesize_e2e, vocoder is pwgan by default + CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1 +fi + +if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then + # inference with static model, vocoder is pwgan by default + CUDA_VISIBLE_DEVICES=${gpus} ./local/inference.sh ${train_output_path} || exit -1 +fi + +if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then + # install paddle2onnx + version=$(echo `pip list |grep "paddle2onnx"` |awk -F" " '{print $2}') + if [[ -z "$version" || ${version} != '0.9.8' ]]; then + pip install paddle2onnx==0.9.8 + fi + ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx fastspeech2_mix + # considering the balance between speed and quality, we recommend that you use hifigan as vocoder + ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx pwgan_aishell3 + # ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx hifigan_aishell3 + # ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx hifigan_csmsc +fi + +# inference with onnxruntime, use fastspeech2 + pwgan by default +if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then + ./local/ort_predict.sh ${train_output_path} +fi diff --git a/paddlespeech/audio/__init__.py b/paddlespeech/audio/__init__.py index 8a231ae5b..f79f3d773 100644 --- a/paddlespeech/audio/__init__.py +++ b/paddlespeech/audio/__init__.py @@ -16,6 +16,9 @@ from . import _extension from . import compliance from . import datasets from . import features +from . import text +from . import transform +from . import streamdata from . import functional from . import io from . import metric diff --git a/paddlespeech/audio/streamdata/__init__.py b/paddlespeech/audio/streamdata/__init__.py new file mode 100644 index 000000000..753fcc11b --- /dev/null +++ b/paddlespeech/audio/streamdata/__init__.py @@ -0,0 +1,70 @@ +# Copyright (c) 2017-2019 NVIDIA CORPORATION. All rights reserved. +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# See the LICENSE file for licensing terms (BSD-style). +# Modified from https://github.com/webdataset/webdataset +# +# flake8: noqa + +from .cache import ( + cached_tarfile_samples, + cached_tarfile_to_samples, + lru_cleanup, + pipe_cleaner, +) +from .compat import WebDataset, WebLoader, FluidWrapper +from .extradatasets import MockDataset, with_epoch, with_length +from .filters import ( + associate, + batched, + decode, + detshuffle, + extract_keys, + getfirst, + info, + map, + map_dict, + map_tuple, + pipelinefilter, + rename, + rename_keys, + audio_resample, + select, + shuffle, + slice, + to_tuple, + transform_with, + unbatched, + xdecode, + audio_data_filter, + audio_tokenize, + audio_resample, + audio_compute_fbank, + audio_spec_aug, + sort, + audio_padding, + audio_cmvn, + placeholder, +) +from .handlers import ( + ignore_and_continue, + ignore_and_stop, + reraise_exception, + warn_and_continue, + warn_and_stop, +) +from .pipeline import DataPipeline +from .shardlists import ( + MultiShardSample, + ResampledShards, + SimpleShardList, + non_empty, + resampled, + shardspec, + single_node_only, + split_by_node, + split_by_worker, +) +from .tariterators import tarfile_samples, tarfile_to_samples +from .utils import PipelineStage, repeatedly +from .writer import ShardWriter, TarWriter, numpy_dumps +from .mix import RandomMix, RoundRobin diff --git a/paddlespeech/audio/streamdata/autodecode.py b/paddlespeech/audio/streamdata/autodecode.py new file mode 100644 index 000000000..ca0e2ea2f --- /dev/null +++ b/paddlespeech/audio/streamdata/autodecode.py @@ -0,0 +1,445 @@ +# +# Copyright (c) 2017-2021 NVIDIA CORPORATION. All rights reserved. +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# This file is part of the WebDataset library. +# See the LICENSE file for licensing terms (BSD-style). +# Modified from https://github.com/webdataset/webdataset +# + +"""Automatically decode webdataset samples.""" + +import io, json, os, pickle, re, tempfile +from functools import partial + +import numpy as np + +"""Extensions passed on to the image decoder.""" +image_extensions = "jpg jpeg png ppm pgm pbm pnm".split() + + +################################################################ +# handle basic datatypes +################################################################ + + +def paddle_loads(data): + """Load data using paddle.loads, importing paddle only if needed. + + :param data: data to be decoded + """ + import io + + import paddle + + stream = io.BytesIO(data) + return paddle.load(stream) + + +def tenbin_loads(data): + from . import tenbin + + return tenbin.decode_buffer(data) + + +def msgpack_loads(data): + import msgpack + + return msgpack.unpackb(data) + + +def npy_loads(data): + import numpy.lib.format + + stream = io.BytesIO(data) + return numpy.lib.format.read_array(stream) + + +def cbor_loads(data): + import cbor + + return cbor.loads(data) + + +decoders = { + "txt": lambda data: data.decode("utf-8"), + "text": lambda data: data.decode("utf-8"), + "transcript": lambda data: data.decode("utf-8"), + "cls": lambda data: int(data), + "cls2": lambda data: int(data), + "index": lambda data: int(data), + "inx": lambda data: int(data), + "id": lambda data: int(data), + "json": lambda data: json.loads(data), + "jsn": lambda data: json.loads(data), + "pyd": lambda data: pickle.loads(data), + "pickle": lambda data: pickle.loads(data), + "pdparams": lambda data: paddle_loads(data), + "ten": tenbin_loads, + "tb": tenbin_loads, + "mp": msgpack_loads, + "msg": msgpack_loads, + "npy": npy_loads, + "npz": lambda data: np.load(io.BytesIO(data)), + "cbor": cbor_loads, +} + + +def basichandlers(key, data): + """Handle basic file decoding. + + This function is usually part of the post= decoders. + This handles the following forms of decoding: + + - txt -> unicode string + - cls cls2 class count index inx id -> int + - json jsn -> JSON decoding + - pyd pickle -> pickle decoding + - pdparams -> paddle.loads + - ten tenbin -> fast tensor loading + - mp messagepack msg -> messagepack decoding + - npy -> Python NPY decoding + + :param key: file name extension + :param data: binary data to be decoded + """ + extension = re.sub(r".*[.]", "", key) + + if extension in decoders: + return decoders[extension](data) + + return None + + +################################################################ +# Generic extension handler. +################################################################ + + +def call_extension_handler(key, data, f, extensions): + """Call the function f with the given data if the key matches the extensions. + + :param key: actual key found in the sample + :param data: binary data + :param f: decoder function + :param extensions: list of matching extensions + """ + extension = key.lower().split(".") + for target in extensions: + target = target.split(".") + if len(target) > len(extension): + continue + if extension[-len(target) :] == target: + return f(data) + return None + + +def handle_extension(extensions, f): + """Return a decoder function for the list of extensions. + + Extensions can be a space separated list of extensions. + Extensions can contain dots, in which case the corresponding number + of extension components must be present in the key given to f. + Comparisons are case insensitive. + + Examples: + handle_extension("jpg jpeg", my_decode_jpg) # invoked for any file.jpg + handle_extension("seg.jpg", special_case_jpg) # invoked only for file.seg.jpg + """ + extensions = extensions.lower().split() + return partial(call_extension_handler, f=f, extensions=extensions) + + +################################################################ +# handle images +################################################################ + +imagespecs = { + "l8": ("numpy", "uint8", "l"), + "rgb8": ("numpy", "uint8", "rgb"), + "rgba8": ("numpy", "uint8", "rgba"), + "l": ("numpy", "float", "l"), + "rgb": ("numpy", "float", "rgb"), + "rgba": ("numpy", "float", "rgba"), + "paddlel8": ("paddle", "uint8", "l"), + "paddlergb8": ("paddle", "uint8", "rgb"), + "paddlergba8": ("paddle", "uint8", "rgba"), + "paddlel": ("paddle", "float", "l"), + "paddlergb": ("paddle", "float", "rgb"), + "paddle": ("paddle", "float", "rgb"), + "paddlergba": ("paddle", "float", "rgba"), + "pill": ("pil", None, "l"), + "pil": ("pil", None, "rgb"), + "pilrgb": ("pil", None, "rgb"), + "pilrgba": ("pil", None, "rgba"), +} + + +class ImageHandler: + """Decode image data using the given `imagespec`. + + The `imagespec` specifies whether the image is decoded + to numpy/paddle/pi, decoded to uint8/float, and decoded + to l/rgb/rgba: + + - l8: numpy uint8 l + - rgb8: numpy uint8 rgb + - rgba8: numpy uint8 rgba + - l: numpy float l + - rgb: numpy float rgb + - rgba: numpy float rgba + - paddlel8: paddle uint8 l + - paddlergb8: paddle uint8 rgb + - paddlergba8: paddle uint8 rgba + - paddlel: paddle float l + - paddlergb: paddle float rgb + - paddle: paddle float rgb + - paddlergba: paddle float rgba + - pill: pil None l + - pil: pil None rgb + - pilrgb: pil None rgb + - pilrgba: pil None rgba + + """ + + def __init__(self, imagespec, extensions=image_extensions): + """Create an image handler. + + :param imagespec: short string indicating the type of decoding + :param extensions: list of extensions the image handler is invoked for + """ + if imagespec not in list(imagespecs.keys()): + raise ValueError("Unknown imagespec: %s" % imagespec) + self.imagespec = imagespec.lower() + self.extensions = extensions + + def __call__(self, key, data): + """Perform image decoding. + + :param key: file name extension + :param data: binary data + """ + import PIL.Image + + extension = re.sub(r".*[.]", "", key) + if extension.lower() not in self.extensions: + return None + imagespec = self.imagespec + atype, etype, mode = imagespecs[imagespec] + with io.BytesIO(data) as stream: + img = PIL.Image.open(stream) + img.load() + img = img.convert(mode.upper()) + if atype == "pil": + return img + elif atype == "numpy": + result = np.asarray(img) + if result.dtype != np.uint8: + raise ValueError("ImageHandler: numpy image must be uint8") + if etype == "uint8": + return result + else: + return result.astype("f") / 255.0 + elif atype == "paddle": + import paddle + + result = np.asarray(img) + if result.dtype != np.uint8: + raise ValueError("ImageHandler: paddle image must be uint8") + if etype == "uint8": + result = np.array(result.transpose(2, 0, 1)) + return paddle.tensor(result) + else: + result = np.array(result.transpose(2, 0, 1)) + return paddle.tensor(result) / 255.0 + return None + + +def imagehandler(imagespec, extensions=image_extensions): + """Create an image handler. + + This is just a lower case alias for ImageHander. + + :param imagespec: textual image spec + :param extensions: list of extensions the handler should be applied for + """ + return ImageHandler(imagespec, extensions) + + +################################################################ +# torch video +################################################################ + +''' +def torch_video(key, data): + """Decode video using the torchvideo library. + + :param key: file name extension + :param data: data to be decoded + """ + extension = re.sub(r".*[.]", "", key) + if extension not in "mp4 ogv mjpeg avi mov h264 mpg webm wmv".split(): + return None + + import torchvision.io + + with tempfile.TemporaryDirectory() as dirname: + fname = os.path.join(dirname, f"file.{extension}") + with open(fname, "wb") as stream: + stream.write(data) + return torchvision.io.read_video(fname, pts_unit="sec") +''' + + +################################################################ +# paddlespeech.audio +################################################################ + + +def paddle_audio(key, data): + """Decode audio using the paddlespeech.audio library. + + :param key: file name extension + :param data: data to be decoded + """ + extension = re.sub(r".*[.]", "", key) + if extension not in ["flac", "mp3", "sox", "wav", "m4a", "ogg", "wma"]: + return None + + import paddlespeech.audio + + with tempfile.TemporaryDirectory() as dirname: + fname = os.path.join(dirname, f"file.{extension}") + with open(fname, "wb") as stream: + stream.write(data) + return paddlespeech.audio.load(fname) + + +################################################################ +# special class for continuing decoding +################################################################ + + +class Continue: + """Special class for continuing decoding. + + This is mostly used for decompression, as in: + + def decompressor(key, data): + if key.endswith(".gz"): + return Continue(key[:-3], decompress(data)) + return None + """ + + def __init__(self, key, data): + """__init__. + + :param key: + :param data: + """ + self.key, self.data = key, data + + +def gzfilter(key, data): + """Decode .gz files. + + This decodes compressed files and the continues decoding. + + :param key: file name extension + :param data: binary data + """ + import gzip + + if not key.endswith(".gz"): + return None + decompressed = gzip.open(io.BytesIO(data)).read() + return Continue(key[:-3], decompressed) + + +################################################################ +# decode entire training amples +################################################################ + + +default_pre_handlers = [gzfilter] +default_post_handlers = [basichandlers] + + +class Decoder: + """Decode samples using a list of handlers. + + For each key/data item, this iterates through the list of + handlers until some handler returns something other than None. + """ + + def __init__(self, handlers, pre=None, post=None, only=None, partial=False): + """Create a Decoder. + + :param handlers: main list of handlers + :param pre: handlers called before the main list (.gz handler by default) + :param post: handlers called after the main list (default handlers by default) + :param only: a list of extensions; when give, only ignores files with those extensions + :param partial: allow partial decoding (i.e., don't decode fields that aren't of type bytes) + """ + if isinstance(only, str): + only = only.split() + self.only = only if only is None else set(only) + if pre is None: + pre = default_pre_handlers + if post is None: + post = default_post_handlers + assert all(callable(h) for h in handlers), f"one of {handlers} not callable" + assert all(callable(h) for h in pre), f"one of {pre} not callable" + assert all(callable(h) for h in post), f"one of {post} not callable" + self.handlers = pre + handlers + post + self.partial = partial + + def decode1(self, key, data): + """Decode a single field of a sample. + + :param key: file name extension + :param data: binary data + """ + key = "." + key + for f in self.handlers: + result = f(key, data) + if isinstance(result, Continue): + key, data = result.key, result.data + continue + if result is not None: + return result + return data + + def decode(self, sample): + """Decode an entire sample. + + :param sample: the sample, a dictionary of key value pairs + """ + result = {} + assert isinstance(sample, dict), sample + for k, v in list(sample.items()): + if k[0] == "_": + if isinstance(v, bytes): + v = v.decode("utf-8") + result[k] = v + continue + if self.only is not None and k not in self.only: + result[k] = v + continue + assert v is not None + if self.partial: + if isinstance(v, bytes): + result[k] = self.decode1(k, v) + else: + result[k] = v + else: + assert isinstance(v, bytes) + result[k] = self.decode1(k, v) + return result + + def __call__(self, sample): + """Decode an entire sample. + + :param sample: the sample + """ + assert isinstance(sample, dict), (len(sample), sample) + return self.decode(sample) diff --git a/paddlespeech/audio/streamdata/cache.py b/paddlespeech/audio/streamdata/cache.py new file mode 100644 index 000000000..e7bbffa1b --- /dev/null +++ b/paddlespeech/audio/streamdata/cache.py @@ -0,0 +1,190 @@ +# Copyright (c) 2017-2019 NVIDIA CORPORATION. All rights reserved. +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# See the LICENSE file for licensing terms (BSD-style). +# Modified from https://github.com/webdataset/webdataset +import itertools, os, random, re, sys +from urllib.parse import urlparse + +from . import filters +from . import gopen +from .handlers import reraise_exception +from .tariterators import tar_file_and_group_expander + +default_cache_dir = os.environ.get("WDS_CACHE", "./_cache") +default_cache_size = float(os.environ.get("WDS_CACHE_SIZE", "1e18")) + + +def lru_cleanup(cache_dir, cache_size, keyfn=os.path.getctime, verbose=False): + """Performs cleanup of the file cache in cache_dir using an LRU strategy, + keeping the total size of all remaining files below cache_size.""" + if not os.path.exists(cache_dir): + return + total_size = 0 + for dirpath, dirnames, filenames in os.walk(cache_dir): + for filename in filenames: + total_size += os.path.getsize(os.path.join(dirpath, filename)) + if total_size <= cache_size: + return + # sort files by last access time + files = [] + for dirpath, dirnames, filenames in os.walk(cache_dir): + for filename in filenames: + files.append(os.path.join(dirpath, filename)) + files.sort(key=keyfn, reverse=True) + # delete files until we're under the cache size + while len(files) > 0 and total_size > cache_size: + fname = files.pop() + total_size -= os.path.getsize(fname) + if verbose: + print("# deleting %s" % fname, file=sys.stderr) + os.remove(fname) + + +def download(url, dest, chunk_size=1024 ** 2, verbose=False): + """Download a file from `url` to `dest`.""" + temp = dest + f".temp{os.getpid()}" + with gopen.gopen(url) as stream: + with open(temp, "wb") as f: + while True: + data = stream.read(chunk_size) + if not data: + break + f.write(data) + os.rename(temp, dest) + + +def pipe_cleaner(spec): + """Guess the actual URL from a "pipe:" specification.""" + if spec.startswith("pipe:"): + spec = spec[5:] + words = spec.split(" ") + for word in words: + if re.match(r"^(https?|gs|ais|s3)", word): + return word + return spec + + +def get_file_cached( + spec, + cache_size=-1, + cache_dir=None, + url_to_name=pipe_cleaner, + verbose=False, +): + if cache_size == -1: + cache_size = default_cache_size + if cache_dir is None: + cache_dir = default_cache_dir + url = url_to_name(spec) + parsed = urlparse(url) + dirname, filename = os.path.split(parsed.path) + dirname = dirname.lstrip("/") + dirname = re.sub(r"[:/|;]", "_", dirname) + destdir = os.path.join(cache_dir, dirname) + os.makedirs(destdir, exist_ok=True) + dest = os.path.join(cache_dir, dirname, filename) + if not os.path.exists(dest): + if verbose: + print("# downloading %s to %s" % (url, dest), file=sys.stderr) + lru_cleanup(cache_dir, cache_size, verbose=verbose) + download(spec, dest, verbose=verbose) + return dest + + +def get_filetype(fname): + with os.popen("file '%s'" % fname) as f: + ftype = f.read() + return ftype + + +def check_tar_format(fname): + """Check whether a file is a tar archive.""" + ftype = get_filetype(fname) + return "tar archive" in ftype or "gzip compressed" in ftype + + +verbose_cache = int(os.environ.get("WDS_VERBOSE_CACHE", "0")) + + +def cached_url_opener( + data, + handler=reraise_exception, + cache_size=-1, + cache_dir=None, + url_to_name=pipe_cleaner, + validator=check_tar_format, + verbose=False, + always=False, +): + """Given a stream of url names (packaged in `dict(url=url)`), yield opened streams.""" + verbose = verbose or verbose_cache + for sample in data: + assert isinstance(sample, dict), sample + assert "url" in sample + url = sample["url"] + attempts = 5 + try: + if not always and os.path.exists(url): + dest = url + else: + dest = get_file_cached( + url, + cache_size=cache_size, + cache_dir=cache_dir, + url_to_name=url_to_name, + verbose=verbose, + ) + if verbose: + print("# opening %s" % dest, file=sys.stderr) + assert os.path.exists(dest) + if not validator(dest): + ftype = get_filetype(dest) + with open(dest, "rb") as f: + data = f.read(200) + os.remove(dest) + raise ValueError( + "%s (%s) is not a tar archive, but a %s, contains %s" + % (dest, url, ftype, repr(data)) + ) + try: + stream = open(dest, "rb") + sample.update(stream=stream) + yield sample + except FileNotFoundError as exn: + # dealing with race conditions in lru_cleanup + attempts -= 1 + if attempts > 0: + time.sleep(random.random() * 10) + continue + raise exn + except Exception as exn: + exn.args = exn.args + (url,) + if handler(exn): + continue + else: + break + + +def cached_tarfile_samples( + src, + handler=reraise_exception, + cache_size=-1, + cache_dir=None, + verbose=False, + url_to_name=pipe_cleaner, + always=False, +): + streams = cached_url_opener( + src, + handler=handler, + cache_size=cache_size, + cache_dir=cache_dir, + verbose=verbose, + url_to_name=url_to_name, + always=always, + ) + samples = tar_file_and_group_expander(streams, handler=handler) + return samples + + +cached_tarfile_to_samples = filters.pipelinefilter(cached_tarfile_samples) diff --git a/paddlespeech/audio/streamdata/compat.py b/paddlespeech/audio/streamdata/compat.py new file mode 100644 index 000000000..deda53384 --- /dev/null +++ b/paddlespeech/audio/streamdata/compat.py @@ -0,0 +1,170 @@ +# Copyright (c) 2017-2019 NVIDIA CORPORATION. All rights reserved. +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# See the LICENSE file for licensing terms (BSD-style). +# Modified from https://github.com/webdataset/webdataset +from dataclasses import dataclass +from itertools import islice +from typing import List + +import braceexpand, yaml + +from . import autodecode +from . import cache, filters, shardlists, tariterators +from .filters import reraise_exception +from .pipeline import DataPipeline +from .paddle_utils import DataLoader, IterableDataset + + +class FluidInterface: + def batched(self, batchsize): + return self.compose(filters.batched(batchsize)) + + def dynamic_batched(self, max_frames_in_batch): + return self.compose(filter.dynamic_batched(max_frames_in_batch)) + + def unbatched(self): + return self.compose(filters.unbatched()) + + def listed(self, batchsize, partial=True): + return self.compose(filters.batched(), batchsize=batchsize, collation_fn=None) + + def unlisted(self): + return self.compose(filters.unlisted()) + + def log_keys(self, logfile=None): + return self.compose(filters.log_keys(logfile)) + + def shuffle(self, size, **kw): + if size < 1: + return self + else: + return self.compose(filters.shuffle(size, **kw)) + + def map(self, f, handler=reraise_exception): + return self.compose(filters.map(f, handler=handler)) + + def decode(self, *args, pre=None, post=None, only=None, partial=False, handler=reraise_exception): + handlers = [autodecode.ImageHandler(x) if isinstance(x, str) else x for x in args] + decoder = autodecode.Decoder(handlers, pre=pre, post=post, only=only, partial=partial) + return self.map(decoder, handler=handler) + + def map_dict(self, handler=reraise_exception, **kw): + return self.compose(filters.map_dict(handler=handler, **kw)) + + def select(self, predicate, **kw): + return self.compose(filters.select(predicate, **kw)) + + def to_tuple(self, *args, handler=reraise_exception): + return self.compose(filters.to_tuple(*args, handler=handler)) + + def map_tuple(self, *args, handler=reraise_exception): + return self.compose(filters.map_tuple(*args, handler=handler)) + + def slice(self, *args): + return self.compose(filters.slice(*args)) + + def rename(self, **kw): + return self.compose(filters.rename(**kw)) + + def rsample(self, p=0.5): + return self.compose(filters.rsample(p)) + + def rename_keys(self, *args, **kw): + return self.compose(filters.rename_keys(*args, **kw)) + + def extract_keys(self, *args, **kw): + return self.compose(filters.extract_keys(*args, **kw)) + + def xdecode(self, *args, **kw): + return self.compose(filters.xdecode(*args, **kw)) + + def audio_data_filter(self, *args, **kw): + return self.compose(filters.audio_data_filter(*args, **kw)) + + def audio_tokenize(self, *args, **kw): + return self.compose(filters.audio_tokenize(*args, **kw)) + + def resample(self, *args, **kw): + return self.compose(filters.resample(*args, **kw)) + + def audio_compute_fbank(self, *args, **kw): + return self.compose(filters.audio_compute_fbank(*args, **kw)) + + def audio_spec_aug(self, *args, **kw): + return self.compose(filters.audio_spec_aug(*args, **kw)) + + def sort(self, size=500): + return self.compose(filters.sort(size)) + + def audio_padding(self): + return self.compose(filters.audio_padding()) + + def audio_cmvn(self, cmvn_file): + return self.compose(filters.audio_cmvn(cmvn_file)) + +class WebDataset(DataPipeline, FluidInterface): + """Small fluid-interface wrapper for DataPipeline.""" + + def __init__( + self, + urls, + handler=reraise_exception, + resampled=False, + repeat=False, + shardshuffle=None, + cache_size=0, + cache_dir=None, + detshuffle=False, + nodesplitter=shardlists.single_node_only, + verbose=False, + ): + super().__init__() + if isinstance(urls, IterableDataset): + assert not resampled + self.append(urls) + elif isinstance(urls, str) and (urls.endswith(".yaml") or urls.endswith(".yml")): + with (open(urls)) as stream: + spec = yaml.safe_load(stream) + assert "datasets" in spec + self.append(shardlists.MultiShardSample(spec)) + elif isinstance(urls, dict): + assert "datasets" in urls + self.append(shardlists.MultiShardSample(urls)) + elif resampled: + self.append(shardlists.ResampledShards(urls)) + else: + self.append(shardlists.SimpleShardList(urls)) + self.append(nodesplitter) + self.append(shardlists.split_by_worker) + if shardshuffle is True: + shardshuffle = 100 + if shardshuffle is not None: + if detshuffle: + self.append(filters.detshuffle(shardshuffle)) + else: + self.append(filters.shuffle(shardshuffle)) + if cache_size == 0: + self.append(tariterators.tarfile_to_samples(handler=handler)) + else: + assert cache_size == -1 or cache_size > 0 + self.append( + cache.cached_tarfile_to_samples( + handler=handler, + verbose=verbose, + cache_size=cache_size, + cache_dir=cache_dir, + ) + ) + + +class FluidWrapper(DataPipeline, FluidInterface): + """Small fluid-interface wrapper for DataPipeline.""" + + def __init__(self, initial): + super().__init__() + self.append(initial) + + +class WebLoader(DataPipeline, FluidInterface): + def __init__(self, *args, **kw): + super().__init__(DataLoader(*args, **kw)) diff --git a/paddlespeech/audio/streamdata/extradatasets.py b/paddlespeech/audio/streamdata/extradatasets.py new file mode 100644 index 000000000..e6d617724 --- /dev/null +++ b/paddlespeech/audio/streamdata/extradatasets.py @@ -0,0 +1,141 @@ +# +# Copyright (c) 2017-2021 NVIDIA CORPORATION. All rights reserved. +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# This file is part of the WebDataset library. +# See the LICENSE file for licensing terms (BSD-style). +# Modified from https://github.com/webdataset/webdataset +# + + +"""Train PyTorch models directly from POSIX tar archive. + +Code works locally or over HTTP connections. +""" + +import itertools as itt +import os +import random +import sys + +import braceexpand + +from . import utils +from .paddle_utils import IterableDataset +from .utils import PipelineStage + + +class MockDataset(IterableDataset): + """MockDataset. + + A mock dataset for performance testing and unit testing. + """ + + def __init__(self, sample, length): + """Create a mock dataset instance. + + :param sample: the sample to be returned repeatedly + :param length: the length of the mock dataset + """ + self.sample = sample + self.length = length + + def __iter__(self): + """Return an iterator over this mock dataset.""" + for i in range(self.length): + yield self.sample + + +class repeatedly(IterableDataset, PipelineStage): + """Repeatedly yield samples from a dataset.""" + + def __init__(self, source, nepochs=None, nbatches=None, length=None): + """Create an instance of Repeatedly. + + :param nepochs: repeat for a maximum of nepochs + :param nbatches: repeat for a maximum of nbatches + """ + self.source = source + self.length = length + self.nbatches = nbatches + + def invoke(self, source): + """Return an iterator that iterates repeatedly over a source.""" + return utils.repeatedly( + source, + nepochs=self.nepochs, + nbatches=self.nbatches, + ) + + +class with_epoch(IterableDataset): + """Change the actual and nominal length of an IterableDataset. + + This will continuously iterate through the original dataset, but + impose new epoch boundaries at the given length/nominal. + This exists mainly as a workaround for the odd logic in DataLoader. + It is also useful for choosing smaller nominal epoch sizes with + very large datasets. + + """ + + def __init__(self, dataset, length): + """Chop the dataset to the given length. + + :param dataset: IterableDataset + :param length: declared length of the dataset + :param nominal: nominal length of dataset (if different from declared) + """ + super().__init__() + self.length = length + self.source = None + + def __getstate__(self): + """Return the pickled state of the dataset. + + This resets the dataset iterator, since that can't be pickled. + """ + result = dict(self.__dict__) + result["source"] = None + return result + + def invoke(self, dataset): + """Return an iterator over the dataset. + + This iterator returns as many samples as given by the `length` + parameter. + """ + if self.source is None: + self.source = iter(dataset) + for i in range(self.length): + try: + sample = next(self.source) + except StopIteration: + self.source = iter(dataset) + try: + sample = next(self.source) + except StopIteration: + return + yield sample + self.source = None + + +class with_length(IterableDataset, PipelineStage): + """Repeatedly yield samples from a dataset.""" + + def __init__(self, dataset, length): + """Create an instance of Repeatedly. + + :param dataset: source dataset + :param length: stated length + """ + super().__init__() + self.dataset = dataset + self.length = length + + def invoke(self, dataset): + """Return an iterator that iterates repeatedly over a source.""" + return iter(dataset) + + def __len__(self): + """Return the user specified length.""" + return self.length diff --git a/paddlespeech/audio/streamdata/filters.py b/paddlespeech/audio/streamdata/filters.py new file mode 100644 index 000000000..82b9c6bab --- /dev/null +++ b/paddlespeech/audio/streamdata/filters.py @@ -0,0 +1,935 @@ +# Copyright (c) 2017-2021 NVIDIA CORPORATION. All rights reserved. +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# This file is part of the WebDataset library. +# See the LICENSE file for licensing terms (BSD-style). +# + +# Modified from https://github.com/webdataset/webdataset +# Modified from wenet(https://github.com/wenet-e2e/wenet) +"""A collection of iterators for data transformations. + +These functions are plain iterator functions. You can find curried versions +in webdataset.filters, and you can find IterableDataset wrappers in +webdataset.processing. +""" + +import io +from fnmatch import fnmatch +import re +import itertools, os, random, sys, time +from functools import reduce, wraps + +import numpy as np + +from . import autodecode +from . import utils +from .paddle_utils import PaddleTensor +from .utils import PipelineStage + +from .. import backends +from ..compliance import kaldi +import paddle +from ..transform.cmvn import GlobalCMVN +from ..utils.tensor_utils import pad_sequence +from ..transform.spec_augment import time_warp +from ..transform.spec_augment import time_mask +from ..transform.spec_augment import freq_mask + +class FilterFunction(object): + """Helper class for currying pipeline stages. + + We use this roundabout construct becauce it can be pickled. + """ + + def __init__(self, f, *args, **kw): + """Create a curried function.""" + self.f = f + self.args = args + self.kw = kw + + def __call__(self, data): + """Call the curried function with the given argument.""" + return self.f(data, *self.args, **self.kw) + + def __str__(self): + """Compute a string representation.""" + return f"<{self.f.__name__} {self.args} {self.kw}>" + + def __repr__(self): + """Compute a string representation.""" + return f"<{self.f.__name__} {self.args} {self.kw}>" + + +class RestCurried(object): + """Helper class for currying pipeline stages. + + We use this roundabout construct because it can be pickled. + """ + + def __init__(self, f): + """Store the function for future currying.""" + self.f = f + + def __call__(self, *args, **kw): + """Curry with the given arguments.""" + return FilterFunction(self.f, *args, **kw) + + +def pipelinefilter(f): + """Turn the decorated function into one that is partially applied for + all arguments other than the first.""" + result = RestCurried(f) + return result + + +def reraise_exception(exn): + """Reraises the given exception; used as a handler. + + :param exn: exception + """ + raise exn + + +def identity(x): + """Return the argument.""" + return x + + +def compose2(f, g): + """Compose two functions, g(f(x)).""" + return lambda x: g(f(x)) + + +def compose(*args): + """Compose a sequence of functions (left-to-right).""" + return reduce(compose2, args) + + +def pipeline(source, *args): + """Write an input pipeline; first argument is source, rest are filters.""" + if len(args) == 0: + return source + return compose(*args)(source) + + +def getfirst(a, keys, default=None, missing_is_error=True): + """Get the first matching key from a dictionary. + + Keys can be specified as a list, or as a string of keys separated by ';'. + """ + if isinstance(keys, str): + assert " " not in keys + keys = keys.split(";") + for k in keys: + if k in a: + return a[k] + if missing_is_error: + raise ValueError(f"didn't find {keys} in {list(a.keys())}") + return default + + +def parse_field_spec(fields): + """Parse a specification for a list of fields to be extracted. + + Keys are separated by spaces in the spec. Each key can itself + be composed of key alternatives separated by ';'. + """ + if isinstance(fields, str): + fields = fields.split() + return [field.split(";") for field in fields] + + +def transform_with(sample, transformers): + """Transform a list of values using a list of functions. + + sample: list of values + transformers: list of functions + + If there are fewer transformers than inputs, or if a transformer + function is None, then the identity function is used for the + corresponding sample fields. + """ + if transformers is None or len(transformers) == 0: + return sample + result = list(sample) + assert len(transformers) <= len(sample) + for i in range(len(transformers)): # skipcq: PYL-C0200 + f = transformers[i] + if f is not None: + result[i] = f(sample[i]) + return result + +### +# Iterators +### + +def _info(data, fmt=None, n=3, every=-1, width=50, stream=sys.stderr, name=""): + """Print information about the samples that are passing through. + + :param data: source iterator + :param fmt: format statement (using sample dict as keyword) + :param n: when to stop + :param every: how often to print + :param width: maximum width + :param stream: output stream + :param name: identifier printed before any output + """ + for i, sample in enumerate(data): + if i < n or (every > 0 and (i + 1) % every == 0): + if fmt is None: + print("---", name, file=stream) + for k, v in sample.items(): + print(k, repr(v)[:width], file=stream) + else: + print(fmt.format(**sample), file=stream) + yield sample + + +info = pipelinefilter(_info) + + +def pick(buf, rng): + k = rng.randint(0, len(buf) - 1) + sample = buf[k] + buf[k] = buf[-1] + buf.pop() + return sample + + +def _shuffle(data, bufsize=1000, initial=100, rng=None, handler=None): + """Shuffle the data in the stream. + + This uses a buffer of size `bufsize`. Shuffling at + startup is less random; this is traded off against + yielding samples quickly. + + data: iterator + bufsize: buffer size for shuffling + returns: iterator + rng: either random module or random.Random instance + + """ + if rng is None: + rng = random.Random(int((os.getpid() + time.time()) * 1e9)) + initial = min(initial, bufsize) + buf = [] + for sample in data: + buf.append(sample) + if len(buf) < bufsize: + try: + buf.append(next(data)) # skipcq: PYL-R1708 + except StopIteration: + pass + if len(buf) >= initial: + yield pick(buf, rng) + while len(buf) > 0: + yield pick(buf, rng) + + +shuffle = pipelinefilter(_shuffle) + + +class detshuffle(PipelineStage): + def __init__(self, bufsize=1000, initial=100, seed=0, epoch=-1): + self.bufsize = bufsize + self.initial = initial + self.seed = seed + self.epoch = epoch + + def run(self, src): + self.epoch += 1 + rng = random.Random() + rng.seed((self.seed, self.epoch)) + return _shuffle(src, self.bufsize, self.initial, rng) + + +def _select(data, predicate): + """Select samples based on a predicate. + + :param data: source iterator + :param predicate: predicate (function) + """ + for sample in data: + if predicate(sample): + yield sample + + +select = pipelinefilter(_select) + + +def _log_keys(data, logfile=None): + import fcntl + + if logfile is None or logfile == "": + for sample in data: + yield sample + else: + with open(logfile, "a") as stream: + for i, sample in enumerate(data): + buf = f"{i}\t{sample.get('__worker__')}\t{sample.get('__rank__')}\t{sample.get('__key__')}\n" + try: + fcntl.flock(stream.fileno(), fcntl.LOCK_EX) + stream.write(buf) + finally: + fcntl.flock(stream.fileno(), fcntl.LOCK_UN) + yield sample + + +log_keys = pipelinefilter(_log_keys) + + +def _decode(data, *args, handler=reraise_exception, **kw): + """Decode data based on the decoding functions given as arguments.""" + + decoder = lambda x: autodecode.imagehandler(x) if isinstance(x, str) else x + handlers = [decoder(x) for x in args] + f = autodecode.Decoder(handlers, **kw) + + for sample in data: + assert isinstance(sample, dict), sample + try: + decoded = f(sample) + except Exception as exn: # skipcq: PYL-W0703 + if handler(exn): + continue + else: + break + yield decoded + + +decode = pipelinefilter(_decode) + + +def _map(data, f, handler=reraise_exception): + """Map samples.""" + for sample in data: + try: + result = f(sample) + except Exception as exn: + if handler(exn): + continue + else: + break + if result is None: + continue + if isinstance(sample, dict) and isinstance(result, dict): + result["__key__"] = sample.get("__key__") + yield result + + +map = pipelinefilter(_map) + + +def _rename(data, handler=reraise_exception, keep=True, **kw): + """Rename samples based on keyword arguments.""" + for sample in data: + try: + if not keep: + yield {k: getfirst(sample, v, missing_is_error=True) for k, v in kw.items()} + else: + + def listify(v): + return v.split(";") if isinstance(v, str) else v + + to_be_replaced = {x for v in kw.values() for x in listify(v)} + result = {k: v for k, v in sample.items() if k not in to_be_replaced} + result.update({k: getfirst(sample, v, missing_is_error=True) for k, v in kw.items()}) + yield result + except Exception as exn: + if handler(exn): + continue + else: + break + + +rename = pipelinefilter(_rename) + + +def _associate(data, associator, **kw): + """Associate additional data with samples.""" + for sample in data: + if callable(associator): + extra = associator(sample["__key__"]) + else: + extra = associator.get(sample["__key__"], {}) + sample.update(extra) # destructive + yield sample + + +associate = pipelinefilter(_associate) + + +def _map_dict(data, handler=reraise_exception, **kw): + """Map the entries in a dict sample with individual functions.""" + assert len(list(kw.keys())) > 0 + for key, f in kw.items(): + assert callable(f), (key, f) + + for sample in data: + assert isinstance(sample, dict) + try: + for k, f in kw.items(): + sample[k] = f(sample[k]) + except Exception as exn: + if handler(exn): + continue + else: + break + yield sample + + +map_dict = pipelinefilter(_map_dict) + + +def _to_tuple(data, *args, handler=reraise_exception, missing_is_error=True, none_is_error=None): + """Convert dict samples to tuples.""" + if none_is_error is None: + none_is_error = missing_is_error + if len(args) == 1 and isinstance(args[0], str) and " " in args[0]: + args = args[0].split() + + for sample in data: + try: + result = tuple([getfirst(sample, f, missing_is_error=missing_is_error) for f in args]) + if none_is_error and any(x is None for x in result): + raise ValueError(f"to_tuple {args} got {sample.keys()}") + yield result + except Exception as exn: + if handler(exn): + continue + else: + break + + +to_tuple = pipelinefilter(_to_tuple) + + +def _map_tuple(data, *args, handler=reraise_exception): + """Map the entries of a tuple with individual functions.""" + args = [f if f is not None else utils.identity for f in args] + for f in args: + assert callable(f), f + for sample in data: + assert isinstance(sample, (list, tuple)) + sample = list(sample) + n = min(len(args), len(sample)) + try: + for i in range(n): + sample[i] = args[i](sample[i]) + except Exception as exn: + if handler(exn): + continue + else: + break + yield tuple(sample) + + +map_tuple = pipelinefilter(_map_tuple) + + +def _unlisted(data): + """Turn batched data back into unbatched data.""" + for batch in data: + assert isinstance(batch, list), sample + for sample in batch: + yield sample + + +unlisted = pipelinefilter(_unlisted) + + +def _unbatched(data): + """Turn batched data back into unbatched data.""" + for sample in data: + assert isinstance(sample, (tuple, list)), sample + assert len(sample) > 0 + for i in range(len(sample[0])): + yield tuple(x[i] for x in sample) + + +unbatched = pipelinefilter(_unbatched) + + +def _rsample(data, p=0.5): + """Randomly subsample a stream of data.""" + assert p >= 0.0 and p <= 1.0 + for sample in data: + if random.uniform(0.0, 1.0) < p: + yield sample + + +rsample = pipelinefilter(_rsample) + +slice = pipelinefilter(itertools.islice) + + +def _extract_keys(source, *patterns, duplicate_is_error=True, ignore_missing=False): + for sample in source: + result = [] + for pattern in patterns: + pattern = pattern.split(";") if isinstance(pattern, str) else pattern + matches = [x for x in sample.keys() if any(fnmatch("." + x, p) for p in pattern)] + if len(matches) == 0: + if ignore_missing: + continue + else: + raise ValueError(f"Cannot find {pattern} in sample keys {sample.keys()}.") + if len(matches) > 1 and duplicate_is_error: + raise ValueError(f"Multiple sample keys {sample.keys()} match {pattern}.") + value = sample[matches[0]] + result.append(value) + yield tuple(result) + + +extract_keys = pipelinefilter(_extract_keys) + + +def _rename_keys(source, *args, keep_unselected=False, must_match=True, duplicate_is_error=True, **kw): + renamings = [(pattern, output) for output, pattern in args] + renamings += [(pattern, output) for output, pattern in kw.items()] + for sample in source: + new_sample = {} + matched = {k: False for k, _ in renamings} + for path, value in sample.items(): + fname = re.sub(r".*/", "", path) + new_name = None + for pattern, name in renamings[::-1]: + if fnmatch(fname.lower(), pattern): + matched[pattern] = True + new_name = name + break + if new_name is None: + if keep_unselected: + new_sample[path] = value + continue + if new_name in new_sample: + if duplicate_is_error: + raise ValueError(f"Duplicate value in sample {sample.keys()} after rename.") + continue + new_sample[new_name] = value + if must_match and not all(matched.values()): + raise ValueError(f"Not all patterns ({matched}) matched sample keys ({sample.keys()}).") + + yield new_sample + + +rename_keys = pipelinefilter(_rename_keys) + + +def decode_bin(stream): + return stream.read() + + +def decode_text(stream): + binary = stream.read() + return binary.decode("utf-8") + + +def decode_pickle(stream): + return pickle.load(stream) + + +default_decoders = [ + ("*.bin", decode_bin), + ("*.txt", decode_text), + ("*.pyd", decode_pickle), +] + + +def find_decoder(decoders, path): + fname = re.sub(r".*/", "", path) + if fname.startswith("__"): + return lambda x: x + for pattern, fun in decoders[::-1]: + if fnmatch(fname.lower(), pattern) or fnmatch("." + fname.lower(), pattern): + return fun + return None + + +def _xdecode( + source, + *args, + must_decode=True, + defaults=default_decoders, + **kw, +): + decoders = list(defaults) + list(args) + decoders += [("*." + k, v) for k, v in kw.items()] + for sample in source: + new_sample = {} + for path, data in sample.items(): + if path.startswith("__"): + new_sample[path] = data + continue + decoder = find_decoder(decoders, path) + if decoder is False: + value = data + elif decoder is None: + if must_decode: + raise ValueError(f"No decoder found for {path}.") + value = data + else: + if isinstance(data, bytes): + data = io.BytesIO(data) + value = decoder(data) + new_sample[path] = value + yield new_sample + +xdecode = pipelinefilter(_xdecode) + + + +def _audio_data_filter(source, + frame_shift=10, + max_length=10240, + min_length=10, + token_max_length=200, + token_min_length=1, + min_output_input_ratio=0.0005, + max_output_input_ratio=1): + """ Filter sample according to feature and label length + Inplace operation. + + Args:: + source: Iterable[{fname, wav, label, sample_rate}] + frame_shift: length of frame shift (ms) + max_length: drop utterance which is greater than max_length(10ms) + min_length: drop utterance which is less than min_length(10ms) + token_max_length: drop utterance which is greater than + token_max_length, especially when use char unit for + english modeling + token_min_length: drop utterance which is + less than token_max_length + min_output_input_ratio: minimal ration of + token_length / feats_length(10ms) + max_output_input_ratio: maximum ration of + token_length / feats_length(10ms) + + Returns: + Iterable[{fname, wav, label, sample_rate}] + """ + for sample in source: + assert 'sample_rate' in sample + assert 'wav' in sample + assert 'label' in sample + # sample['wav'] is paddle.Tensor, we have 100 frames every second (default) + num_frames = sample['wav'].shape[1] / sample['sample_rate'] * (1000 / frame_shift) + if num_frames < min_length: + continue + if num_frames > max_length: + continue + if len(sample['label']) < token_min_length: + continue + if len(sample['label']) > token_max_length: + continue + if num_frames != 0: + if len(sample['label']) / num_frames < min_output_input_ratio: + continue + if len(sample['label']) / num_frames > max_output_input_ratio: + continue + yield sample + +audio_data_filter = pipelinefilter(_audio_data_filter) + +def _audio_tokenize(source, + symbol_table, + bpe_model=None, + non_lang_syms=None, + split_with_space=False): + """ Decode text to chars or BPE + Inplace operation + + Args: + source: Iterable[{fname, wav, txt, sample_rate}] + + Returns: + Iterable[{fname, wav, txt, tokens, label, sample_rate}] + """ + if non_lang_syms is not None: + non_lang_syms_pattern = re.compile(r"(\[[^\[\]]+\]|<[^<>]+>|{[^{}]+})") + else: + non_lang_syms = {} + non_lang_syms_pattern = None + + if bpe_model is not None: + import sentencepiece as spm + sp = spm.SentencePieceProcessor() + sp.load(bpe_model) + else: + sp = None + + for sample in source: + assert 'txt' in sample + txt = sample['txt'].strip() + if non_lang_syms_pattern is not None: + parts = non_lang_syms_pattern.split(txt.upper()) + parts = [w for w in parts if len(w.strip()) > 0] + else: + parts = [txt] + + label = [] + tokens = [] + for part in parts: + if part in non_lang_syms: + tokens.append(part) + else: + if bpe_model is not None: + tokens.extend(__tokenize_by_bpe_model(sp, part)) + else: + if split_with_space: + part = part.split(" ") + for ch in part: + if ch == ' ': + ch = "" + tokens.append(ch) + + for ch in tokens: + if ch in symbol_table: + label.append(symbol_table[ch]) + elif '' in symbol_table: + label.append(symbol_table['']) + + sample['tokens'] = tokens + sample['label'] = label + yield sample + +audio_tokenize = pipelinefilter(_audio_tokenize) + +def _audio_resample(source, resample_rate=16000): + """ Resample data. + Inplace operation. + + Args: + data: Iterable[{fname, wav, label, sample_rate}] + resample_rate: target resample rate + + Returns: + Iterable[{fname, wav, label, sample_rate}] + """ + for sample in source: + assert 'sample_rate' in sample + assert 'wav' in sample + sample_rate = sample['sample_rate'] + waveform = sample['wav'] + if sample_rate != resample_rate: + sample['sample_rate'] = resample_rate + sample['wav'] = paddle.to_tensor(backends.soundfile_backend.resample( + waveform.numpy(), src_sr = sample_rate, target_sr = resample_rate + )) + yield sample + +audio_resample = pipelinefilter(_audio_resample) + +def _audio_compute_fbank(source, + num_mel_bins=80, + frame_length=25, + frame_shift=10, + dither=0.0): + """ Extract fbank + + Args: + source: Iterable[{fname, wav, label, sample_rate}] + num_mel_bins: number of mel filter bank + frame_length: length of one frame (ms) + frame_shift: length of frame shift (ms) + dither: value of dither + + Returns: + Iterable[{fname, feat, label}] + """ + for sample in source: + assert 'sample_rate' in sample + assert 'wav' in sample + assert 'fname' in sample + assert 'label' in sample + sample_rate = sample['sample_rate'] + waveform = sample['wav'] + waveform = waveform * (1 << 15) + # Only keep fname, feat, label + mat = kaldi.fbank(waveform, + n_mels=num_mel_bins, + frame_length=frame_length, + frame_shift=frame_shift, + dither=dither, + energy_floor=0.0, + sr=sample_rate) + yield dict(fname=sample['fname'], label=sample['label'], feat=mat) + + +audio_compute_fbank = pipelinefilter(_audio_compute_fbank) + +def _audio_spec_aug(source, + max_w=5, + w_inplace=True, + w_mode="PIL", + max_f=30, + num_f_mask=2, + f_inplace=True, + f_replace_with_zero=False, + max_t=40, + num_t_mask=2, + t_inplace=True, + t_replace_with_zero=False,): + """ Do spec augmentation + Inplace operation + + Args: + source: Iterable[{fname, feat, label}] + max_w: max width of time warp + w_inplace: whether to inplace the original data while time warping + w_mode: time warp mode + max_f: max width of freq mask + num_f_mask: number of freq mask to apply + f_inplace: whether to inplace the original data while frequency masking + f_replace_with_zero: use zero to mask + max_t: max width of time mask + num_t_mask: number of time mask to apply + t_inplace: whether to inplace the original data while time masking + t_replace_with_zero: use zero to mask + + Returns + Iterable[{fname, feat, label}] + """ + for sample in source: + x = sample['feat'] + x = x.numpy() + x = time_warp(x, max_time_warp=max_w, inplace = w_inplace, mode= w_mode) + x = freq_mask(x, F = max_f, n_mask = num_f_mask, inplace = f_inplace, replace_with_zero = f_replace_with_zero) + x = time_mask(x, T = max_t, n_mask = num_t_mask, inplace = t_inplace, replace_with_zero = t_replace_with_zero) + sample['feat'] = paddle.to_tensor(x, dtype=paddle.float32) + yield sample + +audio_spec_aug = pipelinefilter(_audio_spec_aug) + + +def _sort(source, sort_size=500): + """ Sort the data by feature length. + Sort is used after shuffle and before batch, so we can group + utts with similar lengths into a batch, and `sort_size` should + be less than `shuffle_size` + + Args: + source: Iterable[{fname, feat, label}] + sort_size: buffer size for sort + + Returns: + Iterable[{fname, feat, label}] + """ + + buf = [] + for sample in source: + buf.append(sample) + if len(buf) >= sort_size: + buf.sort(key=lambda x: x['feat'].shape[0]) + for x in buf: + yield x + buf = [] + # The sample left over + buf.sort(key=lambda x: x['feat'].shape[0]) + for x in buf: + yield x + +sort = pipelinefilter(_sort) + +def _batched(source, batch_size=16): + """ Static batch the data by `batch_size` + + Args: + data: Iterable[{fname, feat, label}] + batch_size: batch size + + Returns: + Iterable[List[{fname, feat, label}]] + """ + buf = [] + for sample in source: + buf.append(sample) + if len(buf) >= batch_size: + yield buf + buf = [] + if len(buf) > 0: + yield buf + +batched = pipelinefilter(_batched) + +def dynamic_batched(source, max_frames_in_batch=12000): + """ Dynamic batch the data until the total frames in batch + reach `max_frames_in_batch` + + Args: + source: Iterable[{fname, feat, label}] + max_frames_in_batch: max_frames in one batch + + Returns: + Iterable[List[{fname, feat, label}]] + """ + buf = [] + longest_frames = 0 + for sample in source: + assert 'feat' in sample + assert isinstance(sample['feat'], paddle.Tensor) + new_sample_frames = sample['feat'].size(0) + longest_frames = max(longest_frames, new_sample_frames) + frames_after_padding = longest_frames * (len(buf) + 1) + if frames_after_padding > max_frames_in_batch: + yield buf + buf = [sample] + longest_frames = new_sample_frames + else: + buf.append(sample) + if len(buf) > 0: + yield buf + + +def _audio_padding(source): + """ Padding the data into training data + + Args: + source: Iterable[List[{fname, feat, label}]] + + Returns: + Iterable[Tuple(fname, feats, labels, feats lengths, label lengths)] + """ + for sample in source: + assert isinstance(sample, list) + feats_length = paddle.to_tensor([x['feat'].shape[0] for x in sample], + dtype="int64") + order = paddle.argsort(feats_length, descending=True) + feats_lengths = paddle.to_tensor( + [sample[i]['feat'].shape[0] for i in order], dtype="int64") + sorted_feats = [sample[i]['feat'] for i in order] + sorted_keys = [sample[i]['fname'] for i in order] + sorted_labels = [ + paddle.to_tensor(sample[i]['label'], dtype="int32") for i in order + ] + label_lengths = paddle.to_tensor([x.shape[0] for x in sorted_labels], + dtype="int64") + padded_feats = pad_sequence(sorted_feats, + batch_first=True, + padding_value=0) + padding_labels = pad_sequence(sorted_labels, + batch_first=True, + padding_value=-1) + + yield (sorted_keys, padded_feats, feats_lengths, padding_labels, + label_lengths) + +audio_padding = pipelinefilter(_audio_padding) + +def _audio_cmvn(source, cmvn_file): + global_cmvn = GlobalCMVN(cmvn_file) + for batch in source: + sorted_keys, padded_feats, feats_lengths, padding_labels, label_lengths = batch + padded_feats = padded_feats.numpy() + padded_feats = global_cmvn(padded_feats) + padded_feats = paddle.to_tensor(padded_feats, dtype=paddle.float32) + yield (sorted_keys, padded_feats, feats_lengths, padding_labels, + label_lengths) + +audio_cmvn = pipelinefilter(_audio_cmvn) + +def _placeholder(source): + for data in source: + yield data + +placeholder = pipelinefilter(_placeholder) diff --git a/paddlespeech/audio/streamdata/gopen.py b/paddlespeech/audio/streamdata/gopen.py new file mode 100644 index 000000000..457d048a6 --- /dev/null +++ b/paddlespeech/audio/streamdata/gopen.py @@ -0,0 +1,340 @@ +# +# Copyright (c) 2017-2021 NVIDIA CORPORATION. All rights reserved. +# This file is part of the WebDataset library. +# See the LICENSE file for licensing terms (BSD-style). +# + + +"""Open URLs by calling subcommands.""" + +import os, sys, re +from subprocess import PIPE, Popen +from urllib.parse import urlparse + +# global used for printing additional node information during verbose output +info = {} + + +class Pipe: + """Wrapper class for subprocess.Pipe. + + This class looks like a stream from the outside, but it checks + subprocess status and handles timeouts with exceptions. + This way, clients of the class do not need to know that they are + dealing with subprocesses. + + :param *args: passed to `subprocess.Pipe` + :param **kw: passed to `subprocess.Pipe` + :param timeout: timeout for closing/waiting + :param ignore_errors: don't raise exceptions on subprocess errors + :param ignore_status: list of status codes to ignore + """ + + def __init__( + self, + *args, + mode=None, + timeout=7200.0, + ignore_errors=False, + ignore_status=[], + **kw, + ): + """Create an IO Pipe.""" + self.ignore_errors = ignore_errors + self.ignore_status = [0] + ignore_status + self.timeout = timeout + self.args = (args, kw) + if mode[0] == "r": + self.proc = Popen(*args, stdout=PIPE, **kw) + self.stream = self.proc.stdout + if self.stream is None: + raise ValueError(f"{args}: couldn't open") + elif mode[0] == "w": + self.proc = Popen(*args, stdin=PIPE, **kw) + self.stream = self.proc.stdin + if self.stream is None: + raise ValueError(f"{args}: couldn't open") + self.status = None + + def __str__(self): + return f"" + + def check_status(self): + """Poll the process and handle any errors.""" + status = self.proc.poll() + if status is not None: + self.wait_for_child() + + def wait_for_child(self): + """Check the status variable and raise an exception if necessary.""" + verbose = int(os.environ.get("GOPEN_VERBOSE", 0)) + if self.status is not None and verbose: + # print(f"(waiting again [{self.status} {os.getpid()}:{self.proc.pid}])", file=sys.stderr) + return + self.status = self.proc.wait() + if verbose: + print( + f"pipe exit [{self.status} {os.getpid()}:{self.proc.pid}] {self.args} {info}", + file=sys.stderr, + ) + if self.status not in self.ignore_status and not self.ignore_errors: + raise Exception(f"{self.args}: exit {self.status} (read) {info}") + + def read(self, *args, **kw): + """Wrap stream.read and checks status.""" + result = self.stream.read(*args, **kw) + self.check_status() + return result + + def write(self, *args, **kw): + """Wrap stream.write and checks status.""" + result = self.stream.write(*args, **kw) + self.check_status() + return result + + def readLine(self, *args, **kw): + """Wrap stream.readLine and checks status.""" + result = self.stream.readLine(*args, **kw) + self.status = self.proc.poll() + self.check_status() + return result + + def close(self): + """Wrap stream.close, wait for the subprocess, and handle errors.""" + self.stream.close() + self.status = self.proc.wait(self.timeout) + self.wait_for_child() + + def __enter__(self): + """Context handler.""" + return self + + def __exit__(self, etype, value, traceback): + """Context handler.""" + self.close() + + +def set_options( + obj, timeout=None, ignore_errors=None, ignore_status=None, handler=None +): + """Set options for Pipes. + + This function can be called on any stream. It will set pipe options only + when its argument is a pipe. + + :param obj: any kind of stream + :param timeout: desired timeout + :param ignore_errors: desired ignore_errors setting + :param ignore_status: desired ignore_status setting + :param handler: desired error handler + """ + if not isinstance(obj, Pipe): + return False + if timeout is not None: + obj.timeout = timeout + if ignore_errors is not None: + obj.ignore_errors = ignore_errors + if ignore_status is not None: + obj.ignore_status = ignore_status + if handler is not None: + obj.handler = handler + return True + + +def gopen_file(url, mode="rb", bufsize=8192): + """Open a file. + + This works for local files, files over HTTP, and pipe: files. + + :param url: URL to be opened + :param mode: mode to open it with + :param bufsize: requested buffer size + """ + return open(url, mode) + + +def gopen_pipe(url, mode="rb", bufsize=8192): + """Use gopen to open a pipe. + + :param url: a pipe: URL + :param mode: desired mode + :param bufsize: desired buffer size + """ + assert url.startswith("pipe:") + cmd = url[5:] + if mode[0] == "r": + return Pipe( + cmd, + mode=mode, + shell=True, + bufsize=bufsize, + ignore_status=[141], + ) # skipcq: BAN-B604 + elif mode[0] == "w": + return Pipe( + cmd, + mode=mode, + shell=True, + bufsize=bufsize, + ignore_status=[141], + ) # skipcq: BAN-B604 + else: + raise ValueError(f"{mode}: unknown mode") + + +def gopen_curl(url, mode="rb", bufsize=8192): + """Open a URL with `curl`. + + :param url: url (usually, http:// etc.) + :param mode: file mode + :param bufsize: buffer size + """ + if mode[0] == "r": + cmd = f"curl -s -L '{url}'" + return Pipe( + cmd, + mode=mode, + shell=True, + bufsize=bufsize, + ignore_status=[141, 23], + ) # skipcq: BAN-B604 + elif mode[0] == "w": + cmd = f"curl -s -L -T - '{url}'" + return Pipe( + cmd, + mode=mode, + shell=True, + bufsize=bufsize, + ignore_status=[141, 26], + ) # skipcq: BAN-B604 + else: + raise ValueError(f"{mode}: unknown mode") + + +def gopen_htgs(url, mode="rb", bufsize=8192): + """Open a URL with `curl`. + + :param url: url (usually, http:// etc.) + :param mode: file mode + :param bufsize: buffer size + """ + if mode[0] == "r": + url = re.sub(r"(?i)^htgs://", "gs://", url) + cmd = f"curl -s -L '{url}'" + return Pipe( + cmd, + mode=mode, + shell=True, + bufsize=bufsize, + ignore_status=[141, 23], + ) # skipcq: BAN-B604 + elif mode[0] == "w": + raise ValueError(f"{mode}: cannot write") + else: + raise ValueError(f"{mode}: unknown mode") + + + +def gopen_gsutil(url, mode="rb", bufsize=8192): + """Open a URL with `curl`. + + :param url: url (usually, http:// etc.) + :param mode: file mode + :param bufsize: buffer size + """ + if mode[0] == "r": + cmd = f"gsutil cat '{url}'" + return Pipe( + cmd, + mode=mode, + shell=True, + bufsize=bufsize, + ignore_status=[141, 23], + ) # skipcq: BAN-B604 + elif mode[0] == "w": + cmd = f"gsutil cp - '{url}'" + return Pipe( + cmd, + mode=mode, + shell=True, + bufsize=bufsize, + ignore_status=[141, 26], + ) # skipcq: BAN-B604 + else: + raise ValueError(f"{mode}: unknown mode") + + + +def gopen_error(url, *args, **kw): + """Raise a value error. + + :param url: url + :param args: other arguments + :param kw: other keywords + """ + raise ValueError(f"{url}: no gopen handler defined") + + +"""A dispatch table mapping URL schemes to handlers.""" +gopen_schemes = dict( + __default__=gopen_error, + pipe=gopen_pipe, + http=gopen_curl, + https=gopen_curl, + sftp=gopen_curl, + ftps=gopen_curl, + scp=gopen_curl, + gs=gopen_gsutil, + htgs=gopen_htgs, +) + + +def gopen(url, mode="rb", bufsize=8192, **kw): + """Open the URL. + + This uses the `gopen_schemes` dispatch table to dispatch based + on scheme. + + Support for the following schemes is built-in: pipe, file, + http, https, sftp, ftps, scp. + + When no scheme is given the url is treated as a file. + + You can use the OPEN_VERBOSE argument to get info about + files being opened. + + :param url: the source URL + :param mode: the mode ("rb", "r") + :param bufsize: the buffer size + """ + global fallback_gopen + verbose = int(os.environ.get("GOPEN_VERBOSE", 0)) + if verbose: + print("GOPEN", url, info, file=sys.stderr) + assert mode in ["rb", "wb"], mode + if url == "-": + if mode == "rb": + return sys.stdin.buffer + elif mode == "wb": + return sys.stdout.buffer + else: + raise ValueError(f"unknown mode {mode}") + pr = urlparse(url) + if pr.scheme == "": + bufsize = int(os.environ.get("GOPEN_BUFFER", -1)) + return open(url, mode, buffering=bufsize) + if pr.scheme == "file": + bufsize = int(os.environ.get("GOPEN_BUFFER", -1)) + return open(pr.path, mode, buffering=bufsize) + handler = gopen_schemes["__default__"] + handler = gopen_schemes.get(pr.scheme, handler) + return handler(url, mode, bufsize, **kw) + + +def reader(url, **kw): + """Open url with gopen and mode "rb". + + :param url: source URL + :param kw: other keywords forwarded to gopen + """ + return gopen(url, "rb", **kw) diff --git a/paddlespeech/audio/streamdata/handlers.py b/paddlespeech/audio/streamdata/handlers.py new file mode 100644 index 000000000..7f3d28b62 --- /dev/null +++ b/paddlespeech/audio/streamdata/handlers.py @@ -0,0 +1,47 @@ +# +# Copyright (c) 2017-2021 NVIDIA CORPORATION. All rights reserved. +# This file is part of the WebDataset library. +# See the LICENSE file for licensing terms (BSD-style). +# + +"""Pluggable exception handlers. + +These are functions that take an exception as an argument and then return... + +- the exception (in order to re-raise it) +- True (in order to continue and ignore the exception) +- False (in order to ignore the exception and stop processing) + +They are used as handler= arguments in much of the library. +""" + +import time, warnings + + +def reraise_exception(exn): + """Call in an exception handler to re-raise the exception.""" + raise exn + + +def ignore_and_continue(exn): + """Call in an exception handler to ignore any exception and continue.""" + return True + + +def warn_and_continue(exn): + """Call in an exception handler to ignore any exception, isssue a warning, and continue.""" + warnings.warn(repr(exn)) + time.sleep(0.5) + return True + + +def ignore_and_stop(exn): + """Call in an exception handler to ignore any exception and stop further processing.""" + return False + + +def warn_and_stop(exn): + """Call in an exception handler to ignore any exception and stop further processing.""" + warnings.warn(repr(exn)) + time.sleep(0.5) + return False diff --git a/paddlespeech/audio/streamdata/mix.py b/paddlespeech/audio/streamdata/mix.py new file mode 100644 index 000000000..7d790f00f --- /dev/null +++ b/paddlespeech/audio/streamdata/mix.py @@ -0,0 +1,85 @@ +# +# Copyright (c) 2017-2021 NVIDIA CORPORATION. All rights reserved. +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# This file is part of the WebDataset library. +# See the LICENSE file for licensing terms (BSD-style). +# Modified from https://github.com/webdataset/webdataset +# + +"""Classes for mixing samples from multiple sources.""" + +import itertools, os, random, time, sys +from functools import reduce, wraps + +import numpy as np + +from . import autodecode, utils +from .paddle_utils import PaddleTensor, IterableDataset +from .utils import PipelineStage + + +def round_robin_shortest(*sources): + i = 0 + while True: + try: + sample = next(sources[i % len(sources)]) + yield sample + except StopIteration: + break + i += 1 + + +def round_robin_longest(*sources): + i = 0 + while len(sources) > 0: + try: + sample = next(sources[i]) + i += 1 + yield sample + except StopIteration: + del sources[i] + + +class RoundRobin(IterableDataset): + def __init__(self, datasets, longest=False): + self.datasets = datasets + self.longest = longest + + def __iter__(self): + """Return an iterator over the sources.""" + sources = [iter(d) for d in self.datasets] + if self.longest: + return round_robin_longest(*sources) + else: + return round_robin_shortest(*sources) + + +def random_samples(sources, probs=None, longest=False): + if probs is None: + probs = [1] * len(sources) + else: + probs = list(probs) + while len(sources) > 0: + cum = (np.array(probs) / np.sum(probs)).cumsum() + r = random.random() + i = np.searchsorted(cum, r) + try: + yield next(sources[i]) + except StopIteration: + if longest: + del sources[i] + del probs[i] + else: + break + + +class RandomMix(IterableDataset): + def __init__(self, datasets, probs=None, longest=False): + self.datasets = datasets + self.probs = probs + self.longest = longest + + def __iter__(self): + """Return an iterator over the sources.""" + sources = [iter(d) for d in self.datasets] + return random_samples(sources, self.probs, longest=self.longest) diff --git a/paddlespeech/audio/streamdata/paddle_utils.py b/paddlespeech/audio/streamdata/paddle_utils.py new file mode 100644 index 000000000..02bc4c841 --- /dev/null +++ b/paddlespeech/audio/streamdata/paddle_utils.py @@ -0,0 +1,33 @@ +# +# Copyright (c) 2017-2021 NVIDIA CORPORATION. All rights reserved. +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# This file is part of the WebDataset library. +# See the LICENSE file for licensing terms (BSD-style). +# Modified from https://github.com/webdataset/webdataset +# + +"""Mock implementations of paddle interfaces when paddle is not available.""" + + +try: + from paddle.io import DataLoader, IterableDataset +except ModuleNotFoundError: + + class IterableDataset: + """Empty implementation of IterableDataset when paddle is not available.""" + + pass + + class DataLoader: + """Empty implementation of DataLoader when paddle is not available.""" + + pass + +try: + from paddle import Tensor as PaddleTensor +except ModuleNotFoundError: + + class TorchTensor: + """Empty implementation of PaddleTensor when paddle is not available.""" + + pass diff --git a/paddlespeech/audio/streamdata/pipeline.py b/paddlespeech/audio/streamdata/pipeline.py new file mode 100644 index 000000000..7339a762a --- /dev/null +++ b/paddlespeech/audio/streamdata/pipeline.py @@ -0,0 +1,132 @@ +# Copyright (c) 2017-2019 NVIDIA CORPORATION. All rights reserved. +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# See the LICENSE file for licensing terms (BSD-style). +# Modified from https://github.com/webdataset/webdataset +#%% +import copy, os, random, sys, time +from dataclasses import dataclass +from itertools import islice +from typing import List + +import braceexpand, yaml + +from .handlers import reraise_exception +from .paddle_utils import DataLoader, IterableDataset +from .utils import PipelineStage + + +def add_length_method(obj): + def length(self): + return self.size + + Combined = type( + obj.__class__.__name__ + "_Length", + (obj.__class__, IterableDataset), + {"__len__": length}, + ) + obj.__class__ = Combined + return obj + + +class DataPipeline(IterableDataset, PipelineStage): + """A pipeline starting with an IterableDataset and a series of filters.""" + + def __init__(self, *args, **kwargs): + super().__init__() + self.pipeline = [] + self.length = -1 + self.repetitions = 1 + self.nsamples = -1 + for arg in args: + if arg is None: + continue + if isinstance(arg, list): + self.pipeline.extend(arg) + else: + self.pipeline.append(arg) + + def invoke(self, f, *args, **kwargs): + """Apply a pipeline stage, possibly to the output of a previous stage.""" + if isinstance(f, PipelineStage): + return f.run(*args, **kwargs) + if isinstance(f, (IterableDataset, DataLoader)) and len(args) == 0: + return iter(f) + if isinstance(f, list): + return iter(f) + if callable(f): + result = f(*args, **kwargs) + return result + raise ValueError(f"{f}: not a valid pipeline stage") + + def iterator1(self): + """Create an iterator through one epoch in the pipeline.""" + source = self.invoke(self.pipeline[0]) + for step in self.pipeline[1:]: + source = self.invoke(step, source) + return source + + def iterator(self): + """Create an iterator through the entire dataset, using the given number of repetitions.""" + for i in range(self.repetitions): + for sample in self.iterator1(): + yield sample + + def __iter__(self): + """Create an iterator through the pipeline, repeating and slicing as requested.""" + if self.repetitions != 1: + if self.nsamples > 0: + return islice(self.iterator(), self.nsamples) + else: + return self.iterator() + else: + return self.iterator() + + def stage(self, i): + """Return pipeline stage i.""" + return self.pipeline[i] + + def append(self, f): + """Append a pipeline stage (modifies the object).""" + self.pipeline.append(f) + return self + + def append_list(self, *args): + for arg in args: + self.pipeline.append(arg) + return self + + def compose(self, *args): + """Append a pipeline stage to a copy of the pipeline and returns the copy.""" + result = copy.copy(self) + for arg in args: + result.append(arg) + return result + + def with_length(self, n): + """Add a __len__ method returning the desired value. + + This does not change the actual number of samples in an epoch. + PyTorch IterableDataset should not have a __len__ method. + This is provided only as a workaround for some broken training environments + that require a __len__ method. + """ + self.size = n + return add_length_method(self) + + def with_epoch(self, nsamples=-1, nbatches=-1): + """Change the epoch to return the given number of samples/batches. + + The two arguments mean the same thing.""" + self.repetitions = sys.maxsize + self.nsamples = max(nsamples, nbatches) + return self + + def repeat(self, nepochs=-1, nbatches=-1): + """Repeat iterating through the dataset for the given #epochs up to the given #samples.""" + if nepochs > 0: + self.repetitions = nepochs + self.nsamples = nbatches + else: + self.repetitions = sys.maxsize + self.nsamples = nbatches + return self diff --git a/paddlespeech/audio/streamdata/shardlists.py b/paddlespeech/audio/streamdata/shardlists.py new file mode 100644 index 000000000..cfaf9a64b --- /dev/null +++ b/paddlespeech/audio/streamdata/shardlists.py @@ -0,0 +1,261 @@ +# +# Copyright (c) 2017-2021 NVIDIA CORPORATION. All rights reserved. +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# This file is part of the WebDataset library. +# See the LICENSE file for licensing terms (BSD-style). +# + +# Modified from https://github.com/webdataset/webdataset + +"""Train PyTorch models directly from POSIX tar archive. + +Code works locally or over HTTP connections. +""" + +import os, random, sys, time +from dataclasses import dataclass, field +from itertools import islice +from typing import List + +import braceexpand, yaml + +from . import utils +from .filters import pipelinefilter +from .paddle_utils import IterableDataset + + +from ..utils.log import Logger +logger = Logger(__name__) +def expand_urls(urls): + if isinstance(urls, str): + urllist = urls.split("::") + result = [] + for url in urllist: + result.extend(braceexpand.braceexpand(url)) + return result + else: + return list(urls) + + +class SimpleShardList(IterableDataset): + """An iterable dataset yielding a list of urls.""" + + def __init__(self, urls, seed=None): + """Iterate through the list of shards. + + :param urls: a list of URLs as a Python list or brace notation string + """ + super().__init__() + urls = expand_urls(urls) + self.urls = urls + assert isinstance(self.urls[0], str) + self.seed = seed + + def __len__(self): + return len(self.urls) + + def __iter__(self): + """Return an iterator over the shards.""" + urls = self.urls.copy() + if self.seed is not None: + random.Random(self.seed).shuffle(urls) + for url in urls: + yield dict(url=url) + + +def split_by_node(src, group=None): + rank, world_size, worker, num_workers = utils.paddle_worker_info(group=group) + logger.info(f"world_size:{world_size}, rank:{rank}") + if world_size > 1: + for s in islice(src, rank, None, world_size): + yield s + else: + for s in src: + yield s + + +def single_node_only(src, group=None): + rank, world_size, worker, num_workers = utils.paddle_worker_info(group=group) + if world_size > 1: + raise ValueError("input pipeline needs to be reconfigured for multinode training") + for s in src: + yield s + + +def split_by_worker(src): + rank, world_size, worker, num_workers = utils.paddle_worker_info() + logger.info(f"num_workers:{num_workers}, worker:{worker}") + if num_workers > 1: + for s in islice(src, worker, None, num_workers): + yield s + else: + for s in src: + yield s + + +def resampled_(src, n=sys.maxsize): + import random + + seed = time.time() + try: + seed = open("/dev/random", "rb").read(20) + except Exception as exn: + print(repr(exn)[:50], file=sys.stderr) + rng = random.Random(seed) + print("# resampled loading", file=sys.stderr) + items = list(src) + print(f"# resampled got {len(items)} samples, yielding {n}", file=sys.stderr) + for i in range(n): + yield rng.choice(items) + + +resampled = pipelinefilter(resampled_) + + +def non_empty(src): + count = 0 + for s in src: + yield s + count += 1 + if count == 0: + raise ValueError("pipeline stage received no data at all and this was declared as an error") + + +@dataclass +class MSSource: + """Class representing a data source.""" + + name: str = "" + perepoch: int = -1 + resample: bool = False + urls: List[str] = field(default_factory=list) + + +default_rng = random.Random() + + +def expand(s): + return os.path.expanduser(os.path.expandvars(s)) + + +class MultiShardSample(IterableDataset): + def __init__(self, fname): + """Construct a shardlist from multiple sources using a YAML spec.""" + self.epoch = -1 +class MultiShardSample(IterableDataset): + def __init__(self, fname): + """Construct a shardlist from multiple sources using a YAML spec.""" + self.epoch = -1 + self.parse_spec(fname) + + def parse_spec(self, fname): + self.rng = default_rng # capture default_rng if we fork + if isinstance(fname, dict): + spec = fname + fname = "{dict}" + else: + with open(fname) as stream: + spec = yaml.safe_load(stream) + assert set(spec.keys()).issubset(set("prefix datasets buckets".split())), list(spec.keys()) + prefix = expand(spec.get("prefix", "")) + self.sources = [] + for ds in spec["datasets"]: + assert set(ds.keys()).issubset(set("buckets name shards resample choose".split())), list( + ds.keys() + ) + buckets = ds.get("buckets", spec.get("buckets", [])) + if isinstance(buckets, str): + buckets = [buckets] + buckets = [expand(s) for s in buckets] + if buckets == []: + buckets = [""] + assert len(buckets) == 1, f"{buckets}: FIXME support for multiple buckets unimplemented" + bucket = buckets[0] + name = ds.get("name", "@" + bucket) + urls = ds["shards"] + if isinstance(urls, str): + urls = [urls] + # urls = [u for url in urls for u in braceexpand.braceexpand(url)] + urls = [ + prefix + os.path.join(bucket, u) for url in urls for u in braceexpand.braceexpand(expand(url)) + ] + resample = ds.get("resample", -1) + nsample = ds.get("choose", -1) + if nsample > len(urls): + raise ValueError(f"perepoch {nsample} must be no greater than the number of shards") + if (nsample > 0) and (resample > 0): + raise ValueError("specify only one of perepoch or choose") + entry = MSSource(name=name, urls=urls, perepoch=nsample, resample=resample) + self.sources.append(entry) + print(f"# {name} {len(urls)} {nsample}", file=sys.stderr) + + def set_epoch(self, seed): + """Set the current epoch (for consistent shard selection among nodes).""" + self.rng = random.Random(seed) + + def get_shards_for_epoch(self): + result = [] + for source in self.sources: + if source.resample > 0: + # sample with replacement + l = self.rng.choices(source.urls, k=source.resample) + elif source.perepoch > 0: + # sample without replacement + l = list(source.urls) + self.rng.shuffle(l) + l = l[: source.perepoch] + else: + l = list(source.urls) + result += l + self.rng.shuffle(result) + return result + + def __iter__(self): + shards = self.get_shards_for_epoch() + for shard in shards: + yield dict(url=shard) + + +def shardspec(spec): + if spec.endswith(".yaml"): + return MultiShardSample(spec) + else: + return SimpleShardList(spec) + + +class ResampledShards(IterableDataset): + """An iterable dataset yielding a list of urls.""" + + def __init__( + self, + urls, + nshards=sys.maxsize, + worker_seed=None, + deterministic=False, + ): + """Sample shards from the shard list with replacement. + + :param urls: a list of URLs as a Python list or brace notation string + """ + super().__init__() + urls = expand_urls(urls) + self.urls = urls + assert isinstance(self.urls[0], str) + self.nshards = nshards + self.worker_seed = utils.paddle_worker_seed if worker_seed is None else worker_seed + self.deterministic = deterministic + self.epoch = -1 + + def __iter__(self): + """Return an iterator over the shards.""" + self.epoch += 1 + if self.deterministic: + seed = utils.make_seed(self.worker_seed(), self.epoch) + else: + seed = utils.make_seed(self.worker_seed(), self.epoch, os.getpid(), time.time_ns(), os.urandom(4)) + if os.environ.get("WDS_SHOW_SEED", "0") == "1": + print(f"# ResampledShards seed {seed}") + self.rng = random.Random(seed) + for _ in range(self.nshards): + index = self.rng.randint(0, len(self.urls) - 1) + yield dict(url=self.urls[index]) diff --git a/paddlespeech/audio/streamdata/tariterators.py b/paddlespeech/audio/streamdata/tariterators.py new file mode 100644 index 000000000..b1616918c --- /dev/null +++ b/paddlespeech/audio/streamdata/tariterators.py @@ -0,0 +1,283 @@ +# +# Copyright (c) 2017-2021 NVIDIA CORPORATION. All rights reserved. +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# This file is part of the WebDataset library. +# See the LICENSE file for licensing terms (BSD-style). + +# Modified from https://github.com/webdataset/webdataset +# Modified from wenet(https://github.com/wenet-e2e/wenet) + +"""Low level iteration functions for tar archives.""" + +import random, re, tarfile + +import braceexpand + +from . import filters +from . import gopen +from .handlers import reraise_exception + +trace = False +meta_prefix = "__" +meta_suffix = "__" + +import paddlespeech +import paddle +import numpy as np + +AUDIO_FORMAT_SETS = set(['flac', 'mp3', 'm4a', 'ogg', 'opus', 'wav', 'wma']) + +def base_plus_ext(path): + """Split off all file extensions. + + Returns base, allext. + + :param path: path with extensions + :param returns: path with all extensions removed + + """ + match = re.match(r"^((?:.*/|)[^.]+)[.]([^/]*)$", path) + if not match: + return None, None + return match.group(1), match.group(2) + + +def valid_sample(sample): + """Check whether a sample is valid. + + :param sample: sample to be checked + """ + return ( + sample is not None + and isinstance(sample, dict) + and len(list(sample.keys())) > 0 + and not sample.get("__bad__", False) + ) + + +# FIXME: UNUSED +def shardlist(urls, *, shuffle=False): + """Given a list of URLs, yields that list, possibly shuffled.""" + if isinstance(urls, str): + urls = braceexpand.braceexpand(urls) + else: + urls = list(urls) + if shuffle: + random.shuffle(urls) + for url in urls: + yield dict(url=url) + + +def url_opener(data, handler=reraise_exception, **kw): + """Given a stream of url names (packaged in `dict(url=url)`), yield opened streams.""" + for sample in data: + assert isinstance(sample, dict), sample + assert "url" in sample + url = sample["url"] + try: + stream = gopen.gopen(url, **kw) + sample.update(stream=stream) + yield sample + except Exception as exn: + exn.args = exn.args + (url,) + if handler(exn): + continue + else: + break + + +def tar_file_iterator( + fileobj, skip_meta=r"__[^/]*__($|/)", handler=reraise_exception +): + """Iterate over tar file, yielding filename, content pairs for the given tar stream. + + :param fileobj: byte stream suitable for tarfile + :param skip_meta: regexp for keys that are skipped entirely (Default value = r"__[^/]*__($|/)") + + """ + stream = tarfile.open(fileobj=fileobj, mode="r:*") + for tarinfo in stream: + fname = tarinfo.name + try: + if not tarinfo.isreg(): + continue + if fname is None: + continue + if ( + "/" not in fname + and fname.startswith(meta_prefix) + and fname.endswith(meta_suffix) + ): + # skipping metadata for now + continue + if skip_meta is not None and re.match(skip_meta, fname): + continue + + name = tarinfo.name + pos = name.rfind('.') + assert pos > 0 + prefix, postfix = name[:pos], name[pos + 1:] + if postfix == 'wav': + waveform, sample_rate = paddlespeech.audio.load(stream.extractfile(tarinfo), normal=False) + result = dict(fname=prefix, wav=waveform, sample_rate = sample_rate) + else: + txt = stream.extractfile(tarinfo).read().decode('utf8').strip() + result = dict(fname=prefix, txt=txt) + #result = dict(fname=fname, data=data) + yield result + stream.members = [] + except Exception as exn: + if hasattr(exn, "args") and len(exn.args) > 0: + exn.args = (exn.args[0] + " @ " + str(fileobj),) + exn.args[1:] + if handler(exn): + continue + else: + break + del stream + +def tar_file_and_group_iterator( + fileobj, skip_meta=r"__[^/]*__($|/)", handler=reraise_exception +): + """ Expand a stream of open tar files into a stream of tar file contents. + And groups the file with same prefix + + Args: + data: Iterable[{src, stream}] + + Returns: + Iterable[{key, wav, txt, sample_rate}] + """ + stream = tarfile.open(fileobj=fileobj, mode="r:*") + prev_prefix = None + example = {} + valid = True + for tarinfo in stream: + name = tarinfo.name + pos = name.rfind('.') + assert pos > 0 + prefix, postfix = name[:pos], name[pos + 1:] + if prev_prefix is not None and prefix != prev_prefix: + example['fname'] = prev_prefix + if valid: + yield example + example = {} + valid = True + with stream.extractfile(tarinfo) as file_obj: + try: + if postfix == 'txt': + example['txt'] = file_obj.read().decode('utf8').strip() + elif postfix in AUDIO_FORMAT_SETS: + waveform, sample_rate = paddlespeech.audio.load(file_obj, normal=False) + waveform = paddle.to_tensor(np.expand_dims(np.array(waveform),0), dtype=paddle.float32) + + example['wav'] = waveform + example['sample_rate'] = sample_rate + else: + example[postfix] = file_obj.read() + except Exception as exn: + if hasattr(exn, "args") and len(exn.args) > 0: + exn.args = (exn.args[0] + " @ " + str(fileobj),) + exn.args[1:] + if handler(exn): + continue + else: + break + valid = False + # logging.warning('error to parse {}'.format(name)) + prev_prefix = prefix + if prev_prefix is not None: + example['fname'] = prev_prefix + yield example + stream.close() + +def tar_file_expander(data, handler=reraise_exception): + """Expand a stream of open tar files into a stream of tar file contents. + + This returns an iterator over (filename, file_contents). + """ + for source in data: + url = source["url"] + try: + assert isinstance(source, dict) + assert "stream" in source + for sample in tar_file_iterator(source["stream"]): + assert ( + isinstance(sample, dict) and "data" in sample and "fname" in sample + ) + sample["__url__"] = url + yield sample + except Exception as exn: + exn.args = exn.args + (source.get("stream"), source.get("url")) + if handler(exn): + continue + else: + break + + + + +def tar_file_and_group_expander(data, handler=reraise_exception): + """Expand a stream of open tar files into a stream of tar file contents. + + This returns an iterator over (filename, file_contents). + """ + for source in data: + url = source["url"] + try: + assert isinstance(source, dict) + assert "stream" in source + for sample in tar_file_and_group_iterator(source["stream"]): + assert ( + isinstance(sample, dict) and "wav" in sample and "txt" in sample and "fname" in sample + ) + sample["__url__"] = url + yield sample + except Exception as exn: + exn.args = exn.args + (source.get("stream"), source.get("url")) + if handler(exn): + continue + else: + break + + +def group_by_keys(data, keys=base_plus_ext, lcase=True, suffixes=None, handler=None): + """Return function over iterator that groups key, value pairs into samples. + + :param keys: function that splits the key into key and extension (base_plus_ext) + :param lcase: convert suffixes to lower case (Default value = True) + """ + current_sample = None + for filesample in data: + assert isinstance(filesample, dict) + fname, value = filesample["fname"], filesample["data"] + prefix, suffix = keys(fname) + if trace: + print( + prefix, + suffix, + current_sample.keys() if isinstance(current_sample, dict) else None, + ) + if prefix is None: + continue + if lcase: + suffix = suffix.lower() + if current_sample is None or prefix != current_sample["__key__"]: + if valid_sample(current_sample): + yield current_sample + current_sample = dict(__key__=prefix, __url__=filesample["__url__"]) + if suffix in current_sample: + raise ValueError( + f"{fname}: duplicate file name in tar file {suffix} {current_sample.keys()}" + ) + if suffixes is None or suffix in suffixes: + current_sample[suffix] = value + if valid_sample(current_sample): + yield current_sample + + +def tarfile_samples(src, handler=reraise_exception): + streams = url_opener(src, handler=handler) + samples = tar_file_and_group_expander(streams, handler=handler) + return samples + + +tarfile_to_samples = filters.pipelinefilter(tarfile_samples) diff --git a/paddlespeech/audio/streamdata/utils.py b/paddlespeech/audio/streamdata/utils.py new file mode 100644 index 000000000..c7294f2bf --- /dev/null +++ b/paddlespeech/audio/streamdata/utils.py @@ -0,0 +1,132 @@ +# +# Copyright (c) 2017-2021 NVIDIA CORPORATION. All rights reserved. +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# This file is part of the WebDataset library. +# See the LICENSE file for licensing terms (BSD-style). +# + +# Modified from https://github.com/webdataset/webdataset + +"""Miscellaneous utility functions.""" + +import importlib +import itertools as itt +import os +import re +import sys +from typing import Any, Callable, Iterator, Optional, Union + +from ..utils.log import Logger + +logger = Logger(__name__) + +def make_seed(*args): + seed = 0 + for arg in args: + seed = (seed * 31 + hash(arg)) & 0x7FFFFFFF + return seed + + +class PipelineStage: + def invoke(self, *args, **kw): + raise NotImplementedError + + +def identity(x: Any) -> Any: + """Return the argument as is.""" + return x + + +def safe_eval(s: str, expr: str = "{}"): + """Evaluate the given expression more safely.""" + if re.sub("[^A-Za-z0-9_]", "", s) != s: + raise ValueError(f"safe_eval: illegal characters in: '{s}'") + return eval(expr.format(s)) + + +def lookup_sym(sym: str, modules: list): + """Look up a symbol in a list of modules.""" + for mname in modules: + module = importlib.import_module(mname, package="webdataset") + result = getattr(module, sym, None) + if result is not None: + return result + return None + + +def repeatedly0( + loader: Iterator, nepochs: int = sys.maxsize, nbatches: int = sys.maxsize +): + """Repeatedly returns batches from a DataLoader.""" + for epoch in range(nepochs): + for sample in itt.islice(loader, nbatches): + yield sample + + +def guess_batchsize(batch: Union[tuple, list]): + """Guess the batch size by looking at the length of the first element in a tuple.""" + return len(batch[0]) + + +def repeatedly( + source: Iterator, + nepochs: int = None, + nbatches: int = None, + nsamples: int = None, + batchsize: Callable[..., int] = guess_batchsize, +): + """Repeatedly yield samples from an iterator.""" + epoch = 0 + batch = 0 + total = 0 + while True: + for sample in source: + yield sample + batch += 1 + if nbatches is not None and batch >= nbatches: + return + if nsamples is not None: + total += guess_batchsize(sample) + if total >= nsamples: + return + epoch += 1 + if nepochs is not None and epoch >= nepochs: + return + +def paddle_worker_info(group=None): + """Return node and worker info for PyTorch and some distributed environments.""" + rank = 0 + world_size = 1 + worker = 0 + num_workers = 1 + if "RANK" in os.environ and "WORLD_SIZE" in os.environ: + rank = int(os.environ["RANK"]) + world_size = int(os.environ["WORLD_SIZE"]) + else: + try: + import paddle.distributed + group = group or paddle.distributed.get_group() + rank = paddle.distributed.get_rank() + world_size = paddle.distributed.get_world_size() + except ModuleNotFoundError: + pass + if "WORKER" in os.environ and "NUM_WORKERS" in os.environ: + worker = int(os.environ["WORKER"]) + num_workers = int(os.environ["NUM_WORKERS"]) + else: + try: + from paddle.io import get_worker_info + worker_info = paddle.io.get_worker_info() + if worker_info is not None: + worker = worker_info.id + num_workers = worker_info.num_workers + except ModuleNotFoundError as E: + logger.info(f"not found {E}") + exit(-1) + + return rank, world_size, worker, num_workers + +def paddle_worker_seed(group=None): + """Compute a distinct, deterministic RNG seed for each worker and node.""" + rank, world_size, worker, num_workers = paddle_worker_info(group=group) + return rank * 1000 + worker diff --git a/paddlespeech/audio/streamdata/writer.py b/paddlespeech/audio/streamdata/writer.py new file mode 100644 index 000000000..7d4f7703b --- /dev/null +++ b/paddlespeech/audio/streamdata/writer.py @@ -0,0 +1,450 @@ +# +# Copyright (c) 2017-2021 NVIDIA CORPORATION. All rights reserved. +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# This file is part of the WebDataset library. +# See the LICENSE file for licensing terms (BSD-style). +# Modified from https://github.com/webdataset/webdataset +# + +"""Classes and functions for writing tar files and WebDataset files.""" + +import io, json, pickle, re, tarfile, time +from typing import Any, Callable, Optional, Union + +import numpy as np + +from . import gopen + + +def imageencoder(image: Any, format: str = "PNG"): # skipcq: PYL-W0622 + """Compress an image using PIL and return it as a string. + + Can handle float or uint8 images. + + :param image: ndarray representing an image + :param format: compression format (PNG, JPEG, PPM) + + """ + import PIL + + assert isinstance(image, (PIL.Image.Image, np.ndarray)), type(image) + + if isinstance(image, np.ndarray): + if image.dtype in [np.dtype("f"), np.dtype("d")]: + if not (np.amin(image) > -0.001 and np.amax(image) < 1.001): + raise ValueError( + f"image values out of range {np.amin(image)} {np.amax(image)}" + ) + image = np.clip(image, 0.0, 1.0) + image = np.array(image * 255.0, "uint8") + assert image.ndim in [2, 3] + if image.ndim == 3: + assert image.shape[2] in [1, 3] + image = PIL.Image.fromarray(image) + if format.upper() == "JPG": + format = "JPEG" + elif format.upper() in ["IMG", "IMAGE"]: + format = "PPM" + if format == "JPEG": + opts = dict(quality=100) + else: + opts = {} + with io.BytesIO() as result: + image.save(result, format=format, **opts) + return result.getvalue() + + +def bytestr(data: Any): + """Convert data into a bytestring. + + Uses str and ASCII encoding for data that isn't already in string format. + + :param data: data + """ + if isinstance(data, bytes): + return data + if isinstance(data, str): + return data.encode("ascii") + return str(data).encode("ascii") + +def paddle_dumps(data: Any): + """Dump data into a bytestring using paddle.dumps. + + This delays importing paddle until needed. + + :param data: data to be dumped + """ + import io + + import paddle + + stream = io.BytesIO() + paddle.save(data, stream) + return stream.getvalue() + +def numpy_dumps(data: np.ndarray): + """Dump data into a bytestring using numpy npy format. + + :param data: data to be dumped + """ + import io + + import numpy.lib.format + + stream = io.BytesIO() + numpy.lib.format.write_array(stream, data) + return stream.getvalue() + + +def numpy_npz_dumps(data: np.ndarray): + """Dump data into a bytestring using numpy npz format. + + :param data: data to be dumped + """ + import io + + stream = io.BytesIO() + np.savez_compressed(stream, **data) + return stream.getvalue() + + +def tenbin_dumps(x): + from . import tenbin + + if isinstance(x, list): + return memoryview(tenbin.encode_buffer(x)) + else: + return memoryview(tenbin.encode_buffer([x])) + + +def cbor_dumps(x): + import cbor + + return cbor.dumps(x) + + +def mp_dumps(x): + import msgpack + + return msgpack.packb(x) + + +def add_handlers(d, keys, value): + if isinstance(keys, str): + keys = keys.split() + for k in keys: + d[k] = value + + +def make_handlers(): + """Create a list of handlers for encoding data.""" + handlers = {} + add_handlers( + handlers, "cls cls2 class count index inx id", lambda x: str(x).encode("ascii") + ) + add_handlers(handlers, "txt text transcript", lambda x: x.encode("utf-8")) + add_handlers(handlers, "html htm", lambda x: x.encode("utf-8")) + add_handlers(handlers, "pyd pickle", pickle.dumps) + add_handlers(handlers, "pdparams", paddle_dumps) + add_handlers(handlers, "npy", numpy_dumps) + add_handlers(handlers, "npz", numpy_npz_dumps) + add_handlers(handlers, "ten tenbin tb", tenbin_dumps) + add_handlers(handlers, "json jsn", lambda x: json.dumps(x).encode("utf-8")) + add_handlers(handlers, "mp msgpack msg", mp_dumps) + add_handlers(handlers, "cbor", cbor_dumps) + add_handlers(handlers, "jpg jpeg img image", lambda data: imageencoder(data, "jpg")) + add_handlers(handlers, "png", lambda data: imageencoder(data, "png")) + add_handlers(handlers, "pbm", lambda data: imageencoder(data, "pbm")) + add_handlers(handlers, "pgm", lambda data: imageencoder(data, "pgm")) + add_handlers(handlers, "ppm", lambda data: imageencoder(data, "ppm")) + return handlers + + +default_handlers = make_handlers() + + +def encode_based_on_extension1(data: Any, tname: str, handlers: dict): + """Encode data based on its extension and a dict of handlers. + + :param data: data + :param tname: file extension + :param handlers: handlers + """ + if tname[0] == "_": + if not isinstance(data, str): + raise ValueError("the values of metadata must be of string type") + return data + extension = re.sub(r".*\.", "", tname).lower() + if isinstance(data, bytes): + return data + if isinstance(data, str): + return data.encode("utf-8") + handler = handlers.get(extension) + if handler is None: + raise ValueError(f"no handler found for {extension}") + return handler(data) + + +def encode_based_on_extension(sample: dict, handlers: dict): + """Encode an entire sample with a collection of handlers. + + :param sample: data sample (a dict) + :param handlers: handlers for encoding + """ + return { + k: encode_based_on_extension1(v, k, handlers) for k, v in list(sample.items()) + } + + +def make_encoder(spec: Union[bool, str, dict, Callable]): + """Make an encoder function from a specification. + + :param spec: specification + """ + if spec is False or spec is None: + + def encoder(x): + """Do not encode at all.""" + return x + + elif callable(spec): + encoder = spec + elif isinstance(spec, dict): + + def f(sample): + """Encode based on extension.""" + return encode_based_on_extension(sample, spec) + + encoder = f + + elif spec is True: + handlers = default_handlers + + def g(sample): + """Encode based on extension.""" + return encode_based_on_extension(sample, handlers) + + encoder = g + + else: + raise ValueError(f"{spec}: unknown decoder spec") + if not callable(encoder): + raise ValueError(f"{spec} did not yield a callable encoder") + return encoder + + +class TarWriter: + """A class for writing dictionaries to tar files. + + :param fileobj: fileobj: file name for tar file (.tgz/.tar) or open file descriptor + :param encoder: sample encoding (Default value = True) + :param compress: (Default value = None) + + `True` will use an encoder that behaves similar to the automatic + decoder for `Dataset`. `False` disables encoding and expects byte strings + (except for metadata, which must be strings). The `encoder` argument can + also be a `callable`, or a dictionary mapping extensions to encoders. + + The following code will add two file to the tar archive: `a/b.png` and + `a/b.output.png`. + + ```Python + tarwriter = TarWriter(stream) + image = imread("b.jpg") + image2 = imread("b.out.jpg") + sample = {"__key__": "a/b", "png": image, "output.png": image2} + tarwriter.write(sample) + ``` + """ + + def __init__( + self, + fileobj, + user: str = "bigdata", + group: str = "bigdata", + mode: int = 0o0444, + compress: Optional[bool] = None, + encoder: Union[None, bool, Callable] = True, + keep_meta: bool = False, + ): + """Create a tar writer. + + :param fileobj: stream to write data to + :param user: user for tar files + :param group: group for tar files + :param mode: mode for tar files + :param compress: desired compression + :param encoder: encoder function + :param keep_meta: keep metadata (entries starting with "_") + """ + if isinstance(fileobj, str): + if compress is False: + tarmode = "w|" + elif compress is True: + tarmode = "w|gz" + else: + tarmode = "w|gz" if fileobj.endswith("gz") else "w|" + fileobj = gopen.gopen(fileobj, "wb") + self.own_fileobj = fileobj + else: + tarmode = "w|gz" if compress is True else "w|" + self.own_fileobj = None + self.encoder = make_encoder(encoder) + self.keep_meta = keep_meta + self.stream = fileobj + self.tarstream = tarfile.open(fileobj=fileobj, mode=tarmode) + + self.user = user + self.group = group + self.mode = mode + self.compress = compress + + def __enter__(self): + """Enter context.""" + return self + + def __exit__(self, exc_type, exc_val, exc_tb): + """Exit context.""" + self.close() + + def close(self): + """Close the tar file.""" + self.tarstream.close() + if self.own_fileobj is not None: + self.own_fileobj.close() + self.own_fileobj = None + + def write(self, obj): + """Write a dictionary to the tar file. + + :param obj: dictionary of objects to be stored + :returns: size of the entry + + """ + total = 0 + obj = self.encoder(obj) + if "__key__" not in obj: + raise ValueError("object must contain a __key__") + for k, v in list(obj.items()): + if k[0] == "_": + continue + if not isinstance(v, (bytes, bytearray, memoryview)): + raise ValueError( + f"{k} doesn't map to a bytes after encoding ({type(v)})" + ) + key = obj["__key__"] + for k in sorted(obj.keys()): + if k == "__key__": + continue + if not self.keep_meta and k[0] == "_": + continue + v = obj[k] + if isinstance(v, str): + v = v.encode("utf-8") + now = time.time() + ti = tarfile.TarInfo(key + "." + k) + ti.size = len(v) + ti.mtime = now + ti.mode = self.mode + ti.uname = self.user + ti.gname = self.group + if not isinstance(v, (bytes, bytearray, memoryview)): + raise ValueError(f"converter didn't yield bytes: {k}, {type(v)}") + stream = io.BytesIO(v) + self.tarstream.addfile(ti, stream) + total += ti.size + return total + + +class ShardWriter: + """Like TarWriter but splits into multiple shards.""" + + def __init__( + self, + pattern: str, + maxcount: int = 100000, + maxsize: float = 3e9, + post: Optional[Callable] = None, + start_shard: int = 0, + **kw, + ): + """Create a ShardWriter. + + :param pattern: output file pattern + :param maxcount: maximum number of records per shard (Default value = 100000) + :param maxsize: maximum size of each shard (Default value = 3e9) + :param kw: other options passed to TarWriter + """ + self.verbose = 1 + self.kw = kw + self.maxcount = maxcount + self.maxsize = maxsize + self.post = post + + self.tarstream = None + self.shard = start_shard + self.pattern = pattern + self.total = 0 + self.count = 0 + self.size = 0 + self.fname = None + self.next_stream() + + def next_stream(self): + """Close the current stream and move to the next.""" + self.finish() + self.fname = self.pattern % self.shard + if self.verbose: + print( + "# writing", + self.fname, + self.count, + "%.1f GB" % (self.size / 1e9), + self.total, + ) + self.shard += 1 + stream = open(self.fname, "wb") + self.tarstream = TarWriter(stream, **self.kw) + self.count = 0 + self.size = 0 + + def write(self, obj): + """Write a sample. + + :param obj: sample to be written + """ + if ( + self.tarstream is None + or self.count >= self.maxcount + or self.size >= self.maxsize + ): + self.next_stream() + size = self.tarstream.write(obj) + self.count += 1 + self.total += 1 + self.size += size + + def finish(self): + """Finish all writing (use close instead).""" + if self.tarstream is not None: + self.tarstream.close() + assert self.fname is not None + if callable(self.post): + self.post(self.fname) + self.tarstream = None + + def close(self): + """Close the stream.""" + self.finish() + del self.tarstream + del self.shard + del self.count + del self.size + + def __enter__(self): + """Enter context.""" + return self + + def __exit__(self, *args, **kw): + """Exit context.""" + self.close() diff --git a/paddlespeech/s2t/transform/__init__.py b/paddlespeech/audio/text/__init__.py similarity index 100% rename from paddlespeech/s2t/transform/__init__.py rename to paddlespeech/audio/text/__init__.py diff --git a/paddlespeech/audio/text/text_featurizer.py b/paddlespeech/audio/text/text_featurizer.py new file mode 100644 index 000000000..91c4d75c3 --- /dev/null +++ b/paddlespeech/audio/text/text_featurizer.py @@ -0,0 +1,235 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Contains the text featurizer class.""" +from pprint import pformat +from typing import Union + +import sentencepiece as spm + +from .utility import BLANK +from .utility import EOS +from .utility import load_dict +from .utility import MASKCTC +from .utility import SOS +from .utility import SPACE +from .utility import UNK +from ..utils.log import Logger + +logger = Logger(__name__) + +__all__ = ["TextFeaturizer"] + + +class TextFeaturizer(): + def __init__(self, unit_type, vocab, spm_model_prefix=None, maskctc=False): + """Text featurizer, for processing or extracting features from text. + + Currently, it supports char/word/sentence-piece level tokenizing and conversion into + a list of token indices. Note that the token indexing order follows the + given vocabulary file. + + Args: + unit_type (str): unit type, e.g. char, word, spm + vocab Option[str, list]: Filepath to load vocabulary for token indices conversion, or vocab list. + spm_model_prefix (str, optional): spm model prefix. Defaults to None. + """ + assert unit_type in ('char', 'spm', 'word') + self.unit_type = unit_type + self.unk = UNK + self.maskctc = maskctc + + if vocab: + self.vocab_dict, self._id2token, self.vocab_list, self.unk_id, self.eos_id, self.blank_id = self._load_vocabulary_from_file( + vocab, maskctc) + self.vocab_size = len(self.vocab_list) + else: + logger.warning("TextFeaturizer: not have vocab file or vocab list.") + + if unit_type == 'spm': + spm_model = spm_model_prefix + '.model' + self.sp = spm.SentencePieceProcessor() + self.sp.Load(spm_model) + + def tokenize(self, text, replace_space=True): + if self.unit_type == 'char': + tokens = self.char_tokenize(text, replace_space) + elif self.unit_type == 'word': + tokens = self.word_tokenize(text) + else: # spm + tokens = self.spm_tokenize(text) + return tokens + + def detokenize(self, tokens): + if self.unit_type == 'char': + text = self.char_detokenize(tokens) + elif self.unit_type == 'word': + text = self.word_detokenize(tokens) + else: # spm + text = self.spm_detokenize(tokens) + return text + + def featurize(self, text): + """Convert text string to a list of token indices. + + Args: + text (str): Text to process. + + Returns: + List[int]: List of token indices. + """ + tokens = self.tokenize(text) + ids = [] + for token in tokens: + if token not in self.vocab_dict: + logger.debug(f"Text Token: {token} -> {self.unk}") + token = self.unk + ids.append(self.vocab_dict[token]) + return ids + + def defeaturize(self, idxs): + """Convert a list of token indices to text string, + ignore index after eos_id. + + Args: + idxs (List[int]): List of token indices. + + Returns: + str: Text. + """ + tokens = [] + for idx in idxs: + if idx == self.eos_id: + break + tokens.append(self._id2token[idx]) + text = self.detokenize(tokens) + return text + + def char_tokenize(self, text, replace_space=True): + """Character tokenizer. + + Args: + text (str): text string. + replace_space (bool): False only used by build_vocab.py. + + Returns: + List[str]: tokens. + """ + text = text.strip() + if replace_space: + text_list = [SPACE if item == " " else item for item in list(text)] + else: + text_list = list(text) + return text_list + + def char_detokenize(self, tokens): + """Character detokenizer. + + Args: + tokens (List[str]): tokens. + + Returns: + str: text string. + """ + tokens = [t.replace(SPACE, " ") for t in tokens] + return "".join(tokens) + + def word_tokenize(self, text): + """Word tokenizer, separate by .""" + return text.strip().split() + + def word_detokenize(self, tokens): + """Word detokenizer, separate by .""" + return " ".join(tokens) + + def spm_tokenize(self, text): + """spm tokenize. + + Args: + text (str): text string. + + Returns: + List[str]: sentence pieces str code + """ + stats = {"num_empty": 0, "num_filtered": 0} + + def valid(line): + return True + + def encode(l): + return self.sp.EncodeAsPieces(l) + + def encode_line(line): + line = line.strip() + if len(line) > 0: + line = encode(line) + if valid(line): + return line + else: + stats["num_filtered"] += 1 + else: + stats["num_empty"] += 1 + return None + + enc_line = encode_line(text) + return enc_line + + def spm_detokenize(self, tokens, input_format='piece'): + """spm detokenize. + + Args: + ids (List[str]): tokens. + + Returns: + str: text + """ + if input_format == "piece": + + def decode(l): + return "".join(self.sp.DecodePieces(l)) + elif input_format == "id": + + def decode(l): + return "".join(self.sp.DecodeIds(l)) + + return decode(tokens) + + def _load_vocabulary_from_file(self, vocab: Union[str, list], + maskctc: bool): + """Load vocabulary from file.""" + if isinstance(vocab, list): + vocab_list = vocab + else: + vocab_list = load_dict(vocab, maskctc) + assert vocab_list is not None + logger.debug(f"Vocab: {pformat(vocab_list)}") + + id2token = dict( + [(idx, token) for (idx, token) in enumerate(vocab_list)]) + token2id = dict( + [(token, idx) for (idx, token) in enumerate(vocab_list)]) + + blank_id = vocab_list.index(BLANK) if BLANK in vocab_list else -1 + maskctc_id = vocab_list.index(MASKCTC) if MASKCTC in vocab_list else -1 + unk_id = vocab_list.index(UNK) if UNK in vocab_list else -1 + eos_id = vocab_list.index(EOS) if EOS in vocab_list else -1 + sos_id = vocab_list.index(SOS) if SOS in vocab_list else -1 + space_id = vocab_list.index(SPACE) if SPACE in vocab_list else -1 + + logger.info(f"BLANK id: {blank_id}") + logger.info(f"UNK id: {unk_id}") + logger.info(f"EOS id: {eos_id}") + logger.info(f"SOS id: {sos_id}") + logger.info(f"SPACE id: {space_id}") + logger.info(f"MASKCTC id: {maskctc_id}") + return token2id, id2token, vocab_list, unk_id, eos_id, blank_id diff --git a/paddlespeech/audio/text/utility.py b/paddlespeech/audio/text/utility.py new file mode 100644 index 000000000..d35785db6 --- /dev/null +++ b/paddlespeech/audio/text/utility.py @@ -0,0 +1,393 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Contains data helper functions.""" +import json +import math +import tarfile +from collections import namedtuple +from typing import List +from typing import Optional +from typing import Text + +import jsonlines +import numpy as np + +from paddlespeech.s2t.utils.log import Log + +logger = Log(__name__).getlog() + +__all__ = [ + "load_dict", "load_cmvn", "read_manifest", "rms_to_db", "rms_to_dbfs", + "max_dbfs", "mean_dbfs", "gain_db_to_ratio", "normalize_audio", "SOS", + "EOS", "UNK", "BLANK", "MASKCTC", "SPACE", "convert_samples_to_float32", + "convert_samples_from_float32" +] + +IGNORE_ID = -1 +# `sos` and `eos` using same token +SOS = "" +EOS = SOS +UNK = "" +BLANK = "" +MASKCTC = "" +SPACE = "" + + +def load_dict(dict_path: Optional[Text], maskctc=False) -> Optional[List[Text]]: + if dict_path is None: + return None + + with open(dict_path, "r") as f: + dictionary = f.readlines() + # first token is `` + # multi line: ` 0\n` + # one line: `` + # space is relpace with + char_list = [entry[:-1].split(" ")[0] for entry in dictionary] + if BLANK not in char_list: + char_list.insert(0, BLANK) + if EOS not in char_list: + char_list.append(EOS) + # for non-autoregressive maskctc model + if maskctc and MASKCTC not in char_list: + char_list.append(MASKCTC) + return char_list + + +def read_manifest( + manifest_path, + max_input_len=float('inf'), + min_input_len=0.0, + max_output_len=float('inf'), + min_output_len=0.0, + max_output_input_ratio=float('inf'), + min_output_input_ratio=0.0, ): + """Load and parse manifest file. + + Args: + manifest_path ([type]): Manifest file to load and parse. + max_input_len ([type], optional): maximum output seq length, + in seconds for raw wav, in frame numbers for feature data. + Defaults to float('inf'). + min_input_len (float, optional): minimum input seq length, + in seconds for raw wav, in frame numbers for feature data. + Defaults to 0.0. + max_output_len (float, optional): maximum input seq length, + in modeling units. Defaults to 500.0. + min_output_len (float, optional): minimum input seq length, + in modeling units. Defaults to 0.0. + max_output_input_ratio (float, optional): + maximum output seq length/output seq length ratio. Defaults to 10.0. + min_output_input_ratio (float, optional): + minimum output seq length/output seq length ratio. Defaults to 0.05. + + Raises: + IOError: If failed to parse the manifest. + + Returns: + List[dict]: Manifest parsing results. + """ + manifest = [] + with jsonlines.open(manifest_path, 'r') as reader: + for json_data in reader: + feat_len = json_data["input"][0]["shape"][ + 0] if "input" in json_data and "shape" in json_data["input"][ + 0] else 1.0 + token_len = json_data["output"][0]["shape"][ + 0] if "output" in json_data and "shape" in json_data["output"][ + 0] else 1.0 + conditions = [ + feat_len >= min_input_len, + feat_len <= max_input_len, + token_len >= min_output_len, + token_len <= max_output_len, + token_len / feat_len >= min_output_input_ratio, + token_len / feat_len <= max_output_input_ratio, + ] + if all(conditions): + manifest.append(json_data) + return manifest + + +# Tar File read +TarLocalData = namedtuple('TarLocalData', ['tar2info', 'tar2object']) + + +def parse_tar(file): + """Parse a tar file to get a tarfile object + and a map containing tarinfoes + """ + result = {} + f = tarfile.open(file) + for tarinfo in f.getmembers(): + result[tarinfo.name] = tarinfo + return f, result + + +def subfile_from_tar(file, local_data=None): + """Get subfile object from tar. + + tar:tarpath#filename + + It will return a subfile object from tar file + and cached tar file info for next reading request. + """ + tarpath, filename = file.split(':', 1)[1].split('#', 1) + + if local_data is None: + local_data = TarLocalData(tar2info={}, tar2object={}) + + assert isinstance(local_data, TarLocalData) + + if 'tar2info' not in local_data.__dict__: + local_data.tar2info = {} + if 'tar2object' not in local_data.__dict__: + local_data.tar2object = {} + + if tarpath not in local_data.tar2info: + fobj, infos = parse_tar(tarpath) + local_data.tar2info[tarpath] = infos + local_data.tar2object[tarpath] = fobj + else: + fobj = local_data.tar2object[tarpath] + infos = local_data.tar2info[tarpath] + return fobj.extractfile(infos[filename]) + + +def rms_to_db(rms: float): + """Root Mean Square to dB. + + Args: + rms ([float]): root mean square + + Returns: + float: dB + """ + return 20.0 * math.log10(max(1e-16, rms)) + + +def rms_to_dbfs(rms: float): + """Root Mean Square to dBFS. + https://fireattack.wordpress.com/2017/02/06/replaygain-loudness-normalization-and-applications/ + Audio is mix of sine wave, so 1 amp sine wave's Full scale is 0.7071, equal to -3.0103dB. + + dB = dBFS + 3.0103 + dBFS = db - 3.0103 + e.g. 0 dB = -3.0103 dBFS + + Args: + rms ([float]): root mean square + + Returns: + float: dBFS + """ + return rms_to_db(rms) - 3.0103 + + +def max_dbfs(sample_data: np.ndarray): + """Peak dBFS based on the maximum energy sample. + + Args: + sample_data ([np.ndarray]): float array, [-1, 1]. + + Returns: + float: dBFS + """ + # Peak dBFS based on the maximum energy sample. Will prevent overdrive if used for normalization. + return rms_to_dbfs(max(abs(np.min(sample_data)), abs(np.max(sample_data)))) + + +def mean_dbfs(sample_data): + """Peak dBFS based on the RMS energy. + + Args: + sample_data ([np.ndarray]): float array, [-1, 1]. + + Returns: + float: dBFS + """ + return rms_to_dbfs( + math.sqrt(np.mean(np.square(sample_data, dtype=np.float64)))) + + +def gain_db_to_ratio(gain_db: float): + """dB to ratio + + Args: + gain_db (float): gain in dB + + Returns: + float: scale in amp + """ + return math.pow(10.0, gain_db / 20.0) + + +def normalize_audio(sample_data: np.ndarray, dbfs: float=-3.0103): + """Nomalize audio to dBFS. + + Args: + sample_data (np.ndarray): input wave samples, [-1, 1]. + dbfs (float, optional): target dBFS. Defaults to -3.0103. + + Returns: + np.ndarray: normalized wave + """ + return np.maximum( + np.minimum(sample_data * gain_db_to_ratio(dbfs - max_dbfs(sample_data)), + 1.0), -1.0) + + +def _load_json_cmvn(json_cmvn_file): + """ Load the json format cmvn stats file and calculate cmvn + + Args: + json_cmvn_file: cmvn stats file in json format + + Returns: + a numpy array of [means, vars] + """ + with open(json_cmvn_file) as f: + cmvn_stats = json.load(f) + + means = cmvn_stats['mean_stat'] + variance = cmvn_stats['var_stat'] + count = cmvn_stats['frame_num'] + for i in range(len(means)): + means[i] /= count + variance[i] = variance[i] / count - means[i] * means[i] + if variance[i] < 1.0e-20: + variance[i] = 1.0e-20 + variance[i] = 1.0 / math.sqrt(variance[i]) + cmvn = np.array([means, variance]) + return cmvn + + +def _load_kaldi_cmvn(kaldi_cmvn_file): + """ Load the kaldi format cmvn stats file and calculate cmvn + + Args: + kaldi_cmvn_file: kaldi text style global cmvn file, which + is generated by: + compute-cmvn-stats --binary=false scp:feats.scp global_cmvn + + Returns: + a numpy array of [means, vars] + """ + means = [] + variance = [] + with open(kaldi_cmvn_file, 'r') as fid: + # kaldi binary file start with '\0B' + if fid.read(2) == '\0B': + logger.error('kaldi cmvn binary file is not supported, please ' + 'recompute it by: compute-cmvn-stats --binary=false ' + ' scp:feats.scp global_cmvn') + sys.exit(1) + fid.seek(0) + arr = fid.read().split() + assert (arr[0] == '[') + assert (arr[-2] == '0') + assert (arr[-1] == ']') + feat_dim = int((len(arr) - 2 - 2) / 2) + for i in range(1, feat_dim + 1): + means.append(float(arr[i])) + count = float(arr[feat_dim + 1]) + for i in range(feat_dim + 2, 2 * feat_dim + 2): + variance.append(float(arr[i])) + + for i in range(len(means)): + means[i] /= count + variance[i] = variance[i] / count - means[i] * means[i] + if variance[i] < 1.0e-20: + variance[i] = 1.0e-20 + variance[i] = 1.0 / math.sqrt(variance[i]) + cmvn = np.array([means, variance]) + return cmvn + + +def load_cmvn(cmvn_file: str, filetype: str): + """load cmvn from file. + + Args: + cmvn_file (str): cmvn path. + filetype (str): file type, optional[npz, json, kaldi]. + + Raises: + ValueError: file type not support. + + Returns: + Tuple[np.ndarray, np.ndarray]: mean, istd + """ + assert filetype in ['npz', 'json', 'kaldi'], filetype + filetype = filetype.lower() + if filetype == "json": + cmvn = _load_json_cmvn(cmvn_file) + elif filetype == "kaldi": + cmvn = _load_kaldi_cmvn(cmvn_file) + elif filetype == "npz": + eps = 1e-14 + npzfile = np.load(cmvn_file) + mean = np.squeeze(npzfile["mean"]) + std = np.squeeze(npzfile["std"]) + istd = 1 / (std + eps) + cmvn = [mean, istd] + else: + raise ValueError(f"cmvn file type no support: {filetype}") + return cmvn[0], cmvn[1] + + +def convert_samples_to_float32(samples): + """Convert sample type to float32. + + Audio sample type is usually integer or float-point. + Integers will be scaled to [-1, 1] in float32. + + PCM16 -> PCM32 + """ + float32_samples = samples.astype('float32') + if samples.dtype in np.sctypes['int']: + bits = np.iinfo(samples.dtype).bits + float32_samples *= (1. / 2**(bits - 1)) + elif samples.dtype in np.sctypes['float']: + pass + else: + raise TypeError("Unsupported sample type: %s." % samples.dtype) + return float32_samples + + +def convert_samples_from_float32(samples, dtype): + """Convert sample type from float32 to dtype. + + Audio sample type is usually integer or float-point. For integer + type, float32 will be rescaled from [-1, 1] to the maximum range + supported by the integer type. + + PCM32 -> PCM16 + """ + dtype = np.dtype(dtype) + output_samples = samples.copy() + if dtype in np.sctypes['int']: + bits = np.iinfo(dtype).bits + output_samples *= (2**(bits - 1) / 1.) + min_val = np.iinfo(dtype).min + max_val = np.iinfo(dtype).max + output_samples[output_samples > max_val] = max_val + output_samples[output_samples < min_val] = min_val + elif samples.dtype in np.sctypes['float']: + min_val = np.finfo(dtype).min + max_val = np.finfo(dtype).max + output_samples[output_samples > max_val] = max_val + output_samples[output_samples < min_val] = min_val + else: + raise TypeError("Unsupported sample type: %s." % samples.dtype) + return output_samples.astype(dtype) diff --git a/paddlespeech/audio/transform/__init__.py b/paddlespeech/audio/transform/__init__.py new file mode 100644 index 000000000..185a92b8d --- /dev/null +++ b/paddlespeech/audio/transform/__init__.py @@ -0,0 +1,13 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. diff --git a/paddlespeech/s2t/transform/add_deltas.py b/paddlespeech/audio/transform/add_deltas.py similarity index 100% rename from paddlespeech/s2t/transform/add_deltas.py rename to paddlespeech/audio/transform/add_deltas.py diff --git a/paddlespeech/s2t/transform/channel_selector.py b/paddlespeech/audio/transform/channel_selector.py similarity index 100% rename from paddlespeech/s2t/transform/channel_selector.py rename to paddlespeech/audio/transform/channel_selector.py diff --git a/paddlespeech/s2t/transform/cmvn.py b/paddlespeech/audio/transform/cmvn.py similarity index 100% rename from paddlespeech/s2t/transform/cmvn.py rename to paddlespeech/audio/transform/cmvn.py diff --git a/paddlespeech/s2t/transform/functional.py b/paddlespeech/audio/transform/functional.py similarity index 94% rename from paddlespeech/s2t/transform/functional.py rename to paddlespeech/audio/transform/functional.py index ccb500819..271819adb 100644 --- a/paddlespeech/s2t/transform/functional.py +++ b/paddlespeech/audio/transform/functional.py @@ -14,8 +14,8 @@ # Modified from espnet(https://github.com/espnet/espnet) import inspect -from paddlespeech.s2t.transform.transform_interface import TransformInterface -from paddlespeech.s2t.utils.check_kwargs import check_kwargs +from paddlespeech.audio.transform.transform_interface import TransformInterface +from paddlespeech.audio.utils.check_kwargs import check_kwargs class FuncTrans(TransformInterface): diff --git a/paddlespeech/s2t/transform/perturb.py b/paddlespeech/audio/transform/perturb.py similarity index 86% rename from paddlespeech/s2t/transform/perturb.py rename to paddlespeech/audio/transform/perturb.py index b18caefb8..8044dc36f 100644 --- a/paddlespeech/s2t/transform/perturb.py +++ b/paddlespeech/audio/transform/perturb.py @@ -17,8 +17,97 @@ import numpy import scipy import soundfile -from paddlespeech.s2t.io.reader import SoundHDF5File +import io +import os +import h5py +import numpy as np +class SoundHDF5File(): + """Collecting sound files to a HDF5 file + + >>> f = SoundHDF5File('a.flac.h5', mode='a') + >>> array = np.random.randint(0, 100, 100, dtype=np.int16) + >>> f['id'] = (array, 16000) + >>> array, rate = f['id'] + + + :param: str filepath: + :param: str mode: + :param: str format: The type used when saving wav. flac, nist, htk, etc. + :param: str dtype: + + """ + + def __init__(self, + filepath, + mode="r+", + format=None, + dtype="int16", + **kwargs): + self.filepath = filepath + self.mode = mode + self.dtype = dtype + + self.file = h5py.File(filepath, mode, **kwargs) + if format is None: + # filepath = a.flac.h5 -> format = flac + second_ext = os.path.splitext(os.path.splitext(filepath)[0])[1] + format = second_ext[1:] + if format.upper() not in soundfile.available_formats(): + # If not found, flac is selected + format = "flac" + + # This format affects only saving + self.format = format + + def __repr__(self): + return ''.format( + self.filepath, self.mode, self.format, self.dtype) + + def create_dataset(self, name, shape=None, data=None, **kwds): + f = io.BytesIO() + array, rate = data + soundfile.write(f, array, rate, format=self.format) + self.file.create_dataset( + name, shape=shape, data=np.void(f.getvalue()), **kwds) + + def __setitem__(self, name, data): + self.create_dataset(name, data=data) + + def __getitem__(self, key): + data = self.file[key][()] + f = io.BytesIO(data.tobytes()) + array, rate = soundfile.read(f, dtype=self.dtype) + return array, rate + + def keys(self): + return self.file.keys() + + def values(self): + for k in self.file: + yield self[k] + + def items(self): + for k in self.file: + yield k, self[k] + + def __iter__(self): + return iter(self.file) + + def __contains__(self, item): + return item in self.file + + def __len__(self, item): + return len(self.file) + + def __enter__(self): + return self + + def __exit__(self, exc_type, exc_val, exc_tb): + self.file.close() + + def close(self): + self.file.close() class SpeedPerturbation(): """SpeedPerturbation @@ -469,3 +558,4 @@ class RIRConvolve(): [scipy.convolve(x, r, mode="same") for r in rir], axis=-1) else: return scipy.convolve(x, rir, mode="same") + diff --git a/paddlespeech/s2t/transform/spec_augment.py b/paddlespeech/audio/transform/spec_augment.py similarity index 97% rename from paddlespeech/s2t/transform/spec_augment.py rename to paddlespeech/audio/transform/spec_augment.py index 5ce950851..029e7b8f5 100644 --- a/paddlespeech/s2t/transform/spec_augment.py +++ b/paddlespeech/audio/transform/spec_augment.py @@ -14,12 +14,10 @@ # Modified from espnet(https://github.com/espnet/espnet) """Spec Augment module for preprocessing i.e., data augmentation""" import random - import numpy from PIL import Image -from PIL.Image import BICUBIC -from paddlespeech.s2t.transform.functional import FuncTrans +from .functional import FuncTrans def time_warp(x, max_time_warp=80, inplace=False, mode="PIL"): @@ -46,9 +44,10 @@ def time_warp(x, max_time_warp=80, inplace=False, mode="PIL"): warped = random.randrange(center - window, center + window) + 1 # 1 ... t - 1 - left = Image.fromarray(x[:center]).resize((x.shape[1], warped), BICUBIC) + left = Image.fromarray(x[:center]).resize((x.shape[1], warped), + Image.BICUBIC) right = Image.fromarray(x[center:]).resize((x.shape[1], t - warped), - BICUBIC) + Image.BICUBIC) if inplace: x[:warped] = left x[warped:] = right diff --git a/paddlespeech/s2t/transform/spectrogram.py b/paddlespeech/audio/transform/spectrogram.py similarity index 99% rename from paddlespeech/s2t/transform/spectrogram.py rename to paddlespeech/audio/transform/spectrogram.py index 19f0237bf..864f3f994 100644 --- a/paddlespeech/s2t/transform/spectrogram.py +++ b/paddlespeech/audio/transform/spectrogram.py @@ -17,7 +17,7 @@ import numpy as np import paddle from python_speech_features import logfbank -import paddlespeech.audio.compliance.kaldi as kaldi +from ..compliance import kaldi def stft(x, diff --git a/paddlespeech/s2t/transform/transform_interface.py b/paddlespeech/audio/transform/transform_interface.py similarity index 100% rename from paddlespeech/s2t/transform/transform_interface.py rename to paddlespeech/audio/transform/transform_interface.py diff --git a/paddlespeech/s2t/transform/transformation.py b/paddlespeech/audio/transform/transformation.py similarity index 75% rename from paddlespeech/s2t/transform/transformation.py rename to paddlespeech/audio/transform/transformation.py index 3b433cb0b..d24d6437c 100644 --- a/paddlespeech/s2t/transform/transformation.py +++ b/paddlespeech/audio/transform/transformation.py @@ -22,32 +22,32 @@ from inspect import signature import yaml -from paddlespeech.s2t.utils.dynamic_import import dynamic_import +from ..utils.dynamic_import import dynamic_import import_alias = dict( - identity="paddlespeech.s2t.transform.transform_interface:Identity", - time_warp="paddlespeech.s2t.transform.spec_augment:TimeWarp", - time_mask="paddlespeech.s2t.transform.spec_augment:TimeMask", - freq_mask="paddlespeech.s2t.transform.spec_augment:FreqMask", - spec_augment="paddlespeech.s2t.transform.spec_augment:SpecAugment", - speed_perturbation="paddlespeech.s2t.transform.perturb:SpeedPerturbation", - speed_perturbation_sox="paddlespeech.s2t.transform.perturb:SpeedPerturbationSox", - volume_perturbation="paddlespeech.s2t.transform.perturb:VolumePerturbation", - noise_injection="paddlespeech.s2t.transform.perturb:NoiseInjection", - bandpass_perturbation="paddlespeech.s2t.transform.perturb:BandpassPerturbation", - rir_convolve="paddlespeech.s2t.transform.perturb:RIRConvolve", - delta="paddlespeech.s2t.transform.add_deltas:AddDeltas", - cmvn="paddlespeech.s2t.transform.cmvn:CMVN", - utterance_cmvn="paddlespeech.s2t.transform.cmvn:UtteranceCMVN", - fbank="paddlespeech.s2t.transform.spectrogram:LogMelSpectrogram", - spectrogram="paddlespeech.s2t.transform.spectrogram:Spectrogram", - stft="paddlespeech.s2t.transform.spectrogram:Stft", - istft="paddlespeech.s2t.transform.spectrogram:IStft", - stft2fbank="paddlespeech.s2t.transform.spectrogram:Stft2LogMelSpectrogram", - wpe="paddlespeech.s2t.transform.wpe:WPE", - channel_selector="paddlespeech.s2t.transform.channel_selector:ChannelSelector", - fbank_kaldi="paddlespeech.s2t.transform.spectrogram:LogMelSpectrogramKaldi", - cmvn_json="paddlespeech.s2t.transform.cmvn:GlobalCMVN") + identity="paddlespeech.audio.transform.transform_interface:Identity", + time_warp="paddlespeech.audio.transform.spec_augment:TimeWarp", + time_mask="paddlespeech.audio.transform.spec_augment:TimeMask", + freq_mask="paddlespeech.audio.transform.spec_augment:FreqMask", + spec_augment="paddlespeech.audio.transform.spec_augment:SpecAugment", + speed_perturbation="paddlespeech.audio.transform.perturb:SpeedPerturbation", + speed_perturbation_sox="paddlespeech.audio.transform.perturb:SpeedPerturbationSox", + volume_perturbation="paddlespeech.audio.transform.perturb:VolumePerturbation", + noise_injection="paddlespeech.audio.transform.perturb:NoiseInjection", + bandpass_perturbation="paddlespeech.audio.transform.perturb:BandpassPerturbation", + rir_convolve="paddlespeech.audio.transform.perturb:RIRConvolve", + delta="paddlespeech.audio.transform.add_deltas:AddDeltas", + cmvn="paddlespeech.audio.transform.cmvn:CMVN", + utterance_cmvn="paddlespeech.audio.transform.cmvn:UtteranceCMVN", + fbank="paddlespeech.audio.transform.spectrogram:LogMelSpectrogram", + spectrogram="paddlespeech.audio.transform.spectrogram:Spectrogram", + stft="paddlespeech.audio.transform.spectrogram:Stft", + istft="paddlespeech.audio.transform.spectrogram:IStft", + stft2fbank="paddlespeech.audio.transform.spectrogram:Stft2LogMelSpectrogram", + wpe="paddlespeech.audio.transform.wpe:WPE", + channel_selector="paddlespeech.audio.transform.channel_selector:ChannelSelector", + fbank_kaldi="paddlespeech.audio.transform.spectrogram:LogMelSpectrogramKaldi", + cmvn_json="paddlespeech.audio.transform.cmvn:GlobalCMVN") class Transformation(): diff --git a/paddlespeech/s2t/transform/wpe.py b/paddlespeech/audio/transform/wpe.py similarity index 100% rename from paddlespeech/s2t/transform/wpe.py rename to paddlespeech/audio/transform/wpe.py diff --git a/paddlespeech/audio/utils/__init__.py b/paddlespeech/audio/utils/__init__.py index 2c9a17969..18c59ff11 100644 --- a/paddlespeech/audio/utils/__init__.py +++ b/paddlespeech/audio/utils/__init__.py @@ -11,8 +11,8 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -from ...cli.utils import DATA_HOME -from ...cli.utils import MODEL_HOME +from ...utils.env import DATA_HOME +from ...utils.env import MODEL_HOME from .download import decompress from .download import download_and_decompress from .download import load_state_dict_from_url diff --git a/paddlespeech/audio/utils/check_kwargs.py b/paddlespeech/audio/utils/check_kwargs.py new file mode 100644 index 000000000..0aa839aca --- /dev/null +++ b/paddlespeech/audio/utils/check_kwargs.py @@ -0,0 +1,35 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# Modified from espnet(https://github.com/espnet/espnet) +import inspect + + +def check_kwargs(func, kwargs, name=None): + """check kwargs are valid for func + + If kwargs are invalid, raise TypeError as same as python default + :param function func: function to be validated + :param dict kwargs: keyword arguments for func + :param str name: name used in TypeError (default is func name) + """ + try: + params = inspect.signature(func).parameters + except ValueError: + return + if name is None: + name = func.__name__ + for k in kwargs.keys(): + if k not in params: + raise TypeError( + f"{name}() got an unexpected keyword argument '{k}'") diff --git a/paddlespeech/audio/utils/dynamic_import.py b/paddlespeech/audio/utils/dynamic_import.py new file mode 100644 index 000000000..99f93356f --- /dev/null +++ b/paddlespeech/audio/utils/dynamic_import.py @@ -0,0 +1,38 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# Modified from espnet(https://github.com/espnet/espnet) +import importlib + +__all__ = ["dynamic_import"] + + +def dynamic_import(import_path, alias=dict()): + """dynamic import module and class + + :param str import_path: syntax 'module_name:class_name' + e.g., 'paddlespeech.s2t.models.u2:U2Model' + :param dict alias: shortcut for registered class + :return: imported class + """ + if import_path not in alias and ":" not in import_path: + raise ValueError( + "import_path should be one of {} or " + 'include ":", e.g. "paddlespeech.s2t.models.u2:U2Model" : ' + "{}".format(set(alias), import_path)) + if ":" not in import_path: + import_path = alias[import_path] + + module_name, objname = import_path.split(":") + m = importlib.import_module(module_name) + return getattr(m, objname) diff --git a/paddlespeech/audio/utils/log.py b/paddlespeech/audio/utils/log.py index 5656b286a..0a25bbd5f 100644 --- a/paddlespeech/audio/utils/log.py +++ b/paddlespeech/audio/utils/log.py @@ -65,6 +65,7 @@ class Logger(object): def __init__(self, name: str=None): name = 'PaddleAudio' if not name else name + self.name = name self.logger = logging.getLogger(name) for key, conf in log_config.items(): @@ -101,7 +102,7 @@ class Logger(object): if not self.is_enable: return - self.logger.log(log_level, msg) + self.logger.log(log_level, self.name + " | " + msg) @contextlib.contextmanager def use_terminator(self, terminator: str): diff --git a/paddlespeech/audio/utils/tensor_utils.py b/paddlespeech/audio/utils/tensor_utils.py new file mode 100644 index 000000000..16f60810e --- /dev/null +++ b/paddlespeech/audio/utils/tensor_utils.py @@ -0,0 +1,192 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Unility functions for Transformer.""" +from typing import List +from typing import Tuple + +import paddle + +from .log import Logger + +__all__ = ["pad_sequence", "add_sos_eos", "th_accuracy", "has_tensor"] + +logger = Logger(__name__) + + +def has_tensor(val): + if isinstance(val, (list, tuple)): + for item in val: + if has_tensor(item): + return True + elif isinstance(val, dict): + for k, v in val.items(): + print(k) + if has_tensor(v): + return True + else: + return paddle.is_tensor(val) + + +def pad_sequence(sequences: List[paddle.Tensor], + batch_first: bool=False, + padding_value: float=0.0) -> paddle.Tensor: + r"""Pad a list of variable length Tensors with ``padding_value`` + + ``pad_sequence`` stacks a list of Tensors along a new dimension, + and pads them to equal length. For example, if the input is list of + sequences with size ``L x *`` and if batch_first is False, and ``T x B x *`` + otherwise. + + `B` is batch size. It is equal to the number of elements in ``sequences``. + `T` is length of the longest sequence. + `L` is length of the sequence. + `*` is any number of trailing dimensions, including none. + + Example: + >>> from paddle.nn.utils.rnn import pad_sequence + >>> a = paddle.ones(25, 300) + >>> b = paddle.ones(22, 300) + >>> c = paddle.ones(15, 300) + >>> pad_sequence([a, b, c]).shape + paddle.Tensor([25, 3, 300]) + + Note: + This function returns a Tensor of size ``T x B x *`` or ``B x T x *`` + where `T` is the length of the longest sequence. This function assumes + trailing dimensions and type of all the Tensors in sequences are same. + + Args: + sequences (list[Tensor]): list of variable length sequences. + batch_first (bool, optional): output will be in ``B x T x *`` if True, or in + ``T x B x *`` otherwise + padding_value (float, optional): value for padded elements. Default: 0. + + Returns: + Tensor of size ``T x B x *`` if :attr:`batch_first` is ``False``. + Tensor of size ``B x T x *`` otherwise + """ + + # assuming trailing dimensions and type of all the Tensors + # in sequences are same and fetching those from sequences[0] + max_size = paddle.shape(sequences[0]) + # (TODO Hui Zhang): slice not supprot `end==start` + # trailing_dims = max_size[1:] + trailing_dims = tuple( + max_size[1:].numpy().tolist()) if sequences[0].ndim >= 2 else () + max_len = max([s.shape[0] for s in sequences]) + if batch_first: + out_dims = (len(sequences), max_len) + trailing_dims + else: + out_dims = (max_len, len(sequences)) + trailing_dims + out_tensor = paddle.full(out_dims, padding_value, sequences[0].dtype) + for i, tensor in enumerate(sequences): + length = tensor.shape[0] + # use index notation to prevent duplicate references to the tensor + if batch_first: + # TODO (Hui Zhang): set_value op not supprot `end==start` + # TODO (Hui Zhang): set_value op not support int16 + # TODO (Hui Zhang): set_varbase 2 rank not support [0,0,...] + # out_tensor[i, :length, ...] = tensor + if length != 0: + out_tensor[i, :length] = tensor + else: + out_tensor[i, length] = tensor + else: + # TODO (Hui Zhang): set_value op not supprot `end==start` + # out_tensor[:length, i, ...] = tensor + if length != 0: + out_tensor[:length, i] = tensor + else: + out_tensor[length, i] = tensor + + return out_tensor + + +def add_sos_eos(ys_pad: paddle.Tensor, sos: int, eos: int, + ignore_id: int) -> Tuple[paddle.Tensor, paddle.Tensor]: + """Add and labels. + Args: + ys_pad (paddle.Tensor): batch of padded target sequences (B, Lmax) + sos (int): index of + eos (int): index of + ignore_id (int): index of padding + Returns: + ys_in (paddle.Tensor) : (B, Lmax + 1) + ys_out (paddle.Tensor) : (B, Lmax + 1) + Examples: + >>> sos_id = 10 + >>> eos_id = 11 + >>> ignore_id = -1 + >>> ys_pad + tensor([[ 1, 2, 3, 4, 5], + [ 4, 5, 6, -1, -1], + [ 7, 8, 9, -1, -1]], dtype=paddle.int32) + >>> ys_in,ys_out=add_sos_eos(ys_pad, sos_id , eos_id, ignore_id) + >>> ys_in + tensor([[10, 1, 2, 3, 4, 5], + [10, 4, 5, 6, 11, 11], + [10, 7, 8, 9, 11, 11]]) + >>> ys_out + tensor([[ 1, 2, 3, 4, 5, 11], + [ 4, 5, 6, 11, -1, -1], + [ 7, 8, 9, 11, -1, -1]]) + """ + # TODO(Hui Zhang): using comment code, + #_sos = paddle.to_tensor( + # [sos], dtype=paddle.long, stop_gradient=True, place=ys_pad.place) + #_eos = paddle.to_tensor( + # [eos], dtype=paddle.long, stop_gradient=True, place=ys_pad.place) + #ys = [y[y != ignore_id] for y in ys_pad] # parse padded ys + #ys_in = [paddle.cat([_sos, y], dim=0) for y in ys] + #ys_out = [paddle.cat([y, _eos], dim=0) for y in ys] + #return pad_sequence(ys_in, padding_value=eos), pad_sequence(ys_out, padding_value=ignore_id) + B = ys_pad.shape[0] + _sos = paddle.ones([B, 1], dtype=ys_pad.dtype) * sos + _eos = paddle.ones([B, 1], dtype=ys_pad.dtype) * eos + ys_in = paddle.cat([_sos, ys_pad], dim=1) + mask_pad = (ys_in == ignore_id) + ys_in = ys_in.masked_fill(mask_pad, eos) + + ys_out = paddle.cat([ys_pad, _eos], dim=1) + ys_out = ys_out.masked_fill(mask_pad, eos) + mask_eos = (ys_out == ignore_id) + ys_out = ys_out.masked_fill(mask_eos, eos) + ys_out = ys_out.masked_fill(mask_pad, ignore_id) + return ys_in, ys_out + + +def th_accuracy(pad_outputs: paddle.Tensor, + pad_targets: paddle.Tensor, + ignore_label: int) -> float: + """Calculate accuracy. + Args: + pad_outputs (Tensor): Prediction tensors (B * Lmax, D). + pad_targets (LongTensor): Target label tensors (B, Lmax, D). + ignore_label (int): Ignore label id. + Returns: + float: Accuracy value (0.0 - 1.0). + """ + pad_pred = pad_outputs.view(pad_targets.shape[0], pad_targets.shape[1], + pad_outputs.shape[1]).argmax(2) + mask = pad_targets != ignore_label + #TODO(Hui Zhang): sum not support bool type + # numerator = paddle.sum( + # pad_pred.masked_select(mask) == pad_targets.masked_select(mask)) + numerator = ( + pad_pred.masked_select(mask) == pad_targets.masked_select(mask)) + numerator = paddle.sum(numerator.type_as(pad_targets)) + #TODO(Hui Zhang): sum not support bool type + # denominator = paddle.sum(mask) + denominator = paddle.sum(mask.type_as(pad_targets)) + return float(numerator) / float(denominator) diff --git a/paddlespeech/cli/asr/infer.py b/paddlespeech/cli/asr/infer.py index a943ccfa7..f9b4439ec 100644 --- a/paddlespeech/cli/asr/infer.py +++ b/paddlespeech/cli/asr/infer.py @@ -26,15 +26,15 @@ import paddle import soundfile from yacs.config import CfgNode +from ...utils.env import MODEL_HOME from ..download import get_path_from_url from ..executor import BaseExecutor from ..log import logger from ..utils import CLI_TIMER -from ..utils import MODEL_HOME from ..utils import stats_wrapper from ..utils import timer_register +from paddlespeech.audio.transform.transformation import Transformation from paddlespeech.s2t.frontend.featurizer.text_featurizer import TextFeaturizer -from paddlespeech.s2t.transform.transformation import Transformation from paddlespeech.s2t.utils.utility import UpdateConfig __all__ = ['ASRExecutor'] @@ -133,11 +133,11 @@ class ASRExecutor(BaseExecutor): """ Init model and other resources from a specific path. """ - logger.info("start to init the model") + logger.debug("start to init the model") # default max_len: unit:second self.max_len = 50 if hasattr(self, 'model'): - logger.info('Model had been initialized.') + logger.debug('Model had been initialized.') return if cfg_path is None or ckpt_path is None: @@ -151,15 +151,15 @@ class ASRExecutor(BaseExecutor): self.ckpt_path = os.path.join( self.res_path, self.task_resource.res_dict['ckpt_path'] + ".pdparams") - logger.info(self.res_path) + logger.debug(self.res_path) else: self.cfg_path = os.path.abspath(cfg_path) self.ckpt_path = os.path.abspath(ckpt_path + ".pdparams") self.res_path = os.path.dirname( os.path.dirname(os.path.abspath(self.cfg_path))) - logger.info(self.cfg_path) - logger.info(self.ckpt_path) + logger.debug(self.cfg_path) + logger.debug(self.ckpt_path) #Init body. self.config = CfgNode(new_allowed=True) @@ -216,7 +216,7 @@ class ASRExecutor(BaseExecutor): max_len = self.config.encoder_conf.max_len self.max_len = frame_shift_ms * max_len * subsample_rate - logger.info( + logger.debug( f"The asr server limit max duration len: {self.max_len}") def preprocess(self, model_type: str, input: Union[str, os.PathLike]): @@ -227,15 +227,15 @@ class ASRExecutor(BaseExecutor): audio_file = input if isinstance(audio_file, (str, os.PathLike)): - logger.info("Preprocess audio_file:" + audio_file) + logger.debug("Preprocess audio_file:" + audio_file) # Get the object for feature extraction if "deepspeech2" in model_type or "conformer" in model_type or "transformer" in model_type: - logger.info("get the preprocess conf") + logger.debug("get the preprocess conf") preprocess_conf = self.config.preprocess_config preprocess_args = {"train": False} preprocessing = Transformation(preprocess_conf) - logger.info("read the audio file") + logger.debug("read the audio file") audio, audio_sample_rate = soundfile.read( audio_file, dtype="int16", always_2d=True) if self.change_format: @@ -255,7 +255,7 @@ class ASRExecutor(BaseExecutor): else: audio = audio[:, 0] - logger.info(f"audio shape: {audio.shape}") + logger.debug(f"audio shape: {audio.shape}") # fbank audio = preprocessing(audio, **preprocess_args) @@ -264,19 +264,19 @@ class ASRExecutor(BaseExecutor): self._inputs["audio"] = audio self._inputs["audio_len"] = audio_len - logger.info(f"audio feat shape: {audio.shape}") + logger.debug(f"audio feat shape: {audio.shape}") else: raise Exception("wrong type") - logger.info("audio feat process success") + logger.debug("audio feat process success") @paddle.no_grad() def infer(self, model_type: str): """ Model inference and result stored in self.output. """ - logger.info("start to infer the model to get the output") + logger.debug("start to infer the model to get the output") cfg = self.config.decode audio = self._inputs["audio"] audio_len = self._inputs["audio_len"] @@ -293,7 +293,7 @@ class ASRExecutor(BaseExecutor): self._outputs["result"] = result_transcripts[0] elif "conformer" in model_type or "transformer" in model_type: - logger.info( + logger.debug( f"we will use the transformer like model : {model_type}") try: result_transcripts = self.model.decode( @@ -352,7 +352,7 @@ class ASRExecutor(BaseExecutor): logger.error("Please input the right audio file path") return False - logger.info("checking the audio file format......") + logger.debug("checking the audio file format......") try: audio, audio_sample_rate = soundfile.read( audio_file, dtype="int16", always_2d=True) @@ -365,7 +365,7 @@ class ASRExecutor(BaseExecutor): except Exception as e: logger.exception(e) logger.error( - "can not open the audio file, please check the audio file format is 'wav'. \n \ + f"can not open the audio file, please check the audio file({audio_file}) format is 'wav'. \n \ you can try to use sox to change the file format.\n \ For example: \n \ sample rate: 16k \n \ @@ -374,7 +374,7 @@ class ASRExecutor(BaseExecutor): sox input_audio.xx --rate 8k --bits 16 --channels 1 output_audio.wav \n \ ") return False - logger.info("The sample rate is %d" % audio_sample_rate) + logger.debug("The sample rate is %d" % audio_sample_rate) if audio_sample_rate != self.sample_rate: logger.warning("The sample rate of the input file is not {}.\n \ The program will resample the wav file to {}.\n \ @@ -383,28 +383,28 @@ class ASRExecutor(BaseExecutor): ".format(self.sample_rate, self.sample_rate)) if force_yes is False: while (True): - logger.info( + logger.debug( "Whether to change the sample rate and the channel. Y: change the sample. N: exit the prgream." ) content = input("Input(Y/N):") if content.strip() == "Y" or content.strip( ) == "y" or content.strip() == "yes" or content.strip( ) == "Yes": - logger.info( + logger.debug( "change the sampele rate, channel to 16k and 1 channel" ) break elif content.strip() == "N" or content.strip( ) == "n" or content.strip() == "no" or content.strip( ) == "No": - logger.info("Exit the program") + logger.debug("Exit the program") return False else: logger.warning("Not regular input, please input again") self.change_format = True else: - logger.info("The audio file format is right") + logger.debug("The audio file format is right") self.change_format = False return True diff --git a/paddlespeech/cli/base_commands.py b/paddlespeech/cli/base_commands.py index f5e2246d8..f9e2a55f8 100644 --- a/paddlespeech/cli/base_commands.py +++ b/paddlespeech/cli/base_commands.py @@ -94,7 +94,7 @@ class StatsCommand: def __init__(self): self.parser = argparse.ArgumentParser( prog='paddlespeech.stats', add_help=True) - self.task_choices = ['asr', 'cls', 'st', 'text', 'tts', 'vector'] + self.task_choices = ['asr', 'cls', 'st', 'text', 'tts', 'vector', 'kws'] self.parser.add_argument( '--task', type=str, @@ -138,6 +138,7 @@ _commands = { 'text': ['Text command.', 'TextExecutor'], 'tts': ['Text to Speech infer command.', 'TTSExecutor'], 'vector': ['Speech to vector embedding infer command.', 'VectorExecutor'], + 'kws': ['Keyword Spotting infer command.', 'KWSExecutor'], } for com, info in _commands.items(): diff --git a/paddlespeech/cli/cls/infer.py b/paddlespeech/cli/cls/infer.py index 942dc3b92..c869e28bf 100644 --- a/paddlespeech/cli/cls/infer.py +++ b/paddlespeech/cli/cls/infer.py @@ -92,7 +92,7 @@ class CLSExecutor(BaseExecutor): Init model and other resources from a specific path. """ if hasattr(self, 'model'): - logger.info('Model had been initialized.') + logger.debug('Model had been initialized.') return if label_file is None or ckpt_path is None: @@ -135,14 +135,14 @@ class CLSExecutor(BaseExecutor): Input content can be a text(tts), a file(asr, cls) or a streaming(not supported yet). """ feat_conf = self._conf['feature'] - logger.info(feat_conf) + logger.debug(feat_conf) waveform, _ = load( file=audio_file, sr=feat_conf['sample_rate'], mono=True, dtype='float32') if isinstance(audio_file, (str, os.PathLike)): - logger.info("Preprocessing audio_file:" + audio_file) + logger.debug("Preprocessing audio_file:" + audio_file) # Feature extraction feature_extractor = LogMelSpectrogram( diff --git a/paddlespeech/cli/download.py b/paddlespeech/cli/download.py index ec7258747..5661f18f9 100644 --- a/paddlespeech/cli/download.py +++ b/paddlespeech/cli/download.py @@ -61,7 +61,7 @@ def _get_unique_endpoints(trainer_endpoints): continue ips.add(ip) unique_endpoints.add(endpoint) - logger.info("unique_endpoints {}".format(unique_endpoints)) + logger.debug("unique_endpoints {}".format(unique_endpoints)) return unique_endpoints @@ -96,7 +96,7 @@ def get_path_from_url(url, # data, and the same ip will only download data once. unique_endpoints = _get_unique_endpoints(ParallelEnv().trainer_endpoints[:]) if osp.exists(fullpath) and check_exist and _md5check(fullpath, md5sum): - logger.info("Found {}".format(fullpath)) + logger.debug("Found {}".format(fullpath)) else: if ParallelEnv().current_endpoint in unique_endpoints: fullpath = _download(url, root_dir, md5sum, method=method) @@ -118,7 +118,7 @@ def _get_download(url, fullname): try: req = requests.get(url, stream=True) except Exception as e: # requests.exceptions.ConnectionError - logger.info("Downloading {} from {} failed with exception {}".format( + logger.debug("Downloading {} from {} failed with exception {}".format( fname, url, str(e))) return False @@ -190,7 +190,7 @@ def _download(url, path, md5sum=None, method='get'): fullname = osp.join(path, fname) retry_cnt = 0 - logger.info("Downloading {} from {}".format(fname, url)) + logger.debug("Downloading {} from {}".format(fname, url)) while not (osp.exists(fullname) and _md5check(fullname, md5sum)): if retry_cnt < DOWNLOAD_RETRY_LIMIT: retry_cnt += 1 @@ -209,7 +209,7 @@ def _md5check(fullname, md5sum=None): if md5sum is None: return True - logger.info("File {} md5 checking...".format(fullname)) + logger.debug("File {} md5 checking...".format(fullname)) md5 = hashlib.md5() with open(fullname, 'rb') as f: for chunk in iter(lambda: f.read(4096), b""): @@ -217,8 +217,8 @@ def _md5check(fullname, md5sum=None): calc_md5sum = md5.hexdigest() if calc_md5sum != md5sum: - logger.info("File {} md5 check failed, {}(calc) != " - "{}(base)".format(fullname, calc_md5sum, md5sum)) + logger.debug("File {} md5 check failed, {}(calc) != " + "{}(base)".format(fullname, calc_md5sum, md5sum)) return False return True @@ -227,7 +227,7 @@ def _decompress(fname): """ Decompress for zip and tar file """ - logger.info("Decompressing {}...".format(fname)) + logger.debug("Decompressing {}...".format(fname)) # For protecting decompressing interupted, # decompress to fpath_tmp directory firstly, if decompress diff --git a/paddlespeech/cli/executor.py b/paddlespeech/cli/executor.py index d390f947d..3800c36db 100644 --- a/paddlespeech/cli/executor.py +++ b/paddlespeech/cli/executor.py @@ -108,19 +108,20 @@ class BaseExecutor(ABC): Dict[str, Union[str, os.PathLike]]: A dict with ids and inputs. """ if self._is_job_input(input_): + # .job/.scp/.txt file ret = self._get_job_contents(input_) else: + # job from stdin ret = OrderedDict() - if input_ is None: # Take input from stdin if not sys.stdin.isatty( ): # Avoid getting stuck when stdin is empty. for i, line in enumerate(sys.stdin): line = line.strip() - if len(line.split(' ')) == 1: + if len(line.split()) == 1: ret[str(i + 1)] = line - elif len(line.split(' ')) == 2: - id_, info = line.split(' ') + elif len(line.split()) == 2: + id_, info = line.split() ret[id_] = info else: # No valid input info from one line. continue @@ -170,7 +171,8 @@ class BaseExecutor(ABC): bool: return `True` for job input, `False` otherwise. """ return input_ and os.path.isfile(input_) and (input_.endswith('.job') or - input_.endswith('.txt')) + input_.endswith('.txt') or + input_.endswith('.scp')) def _get_job_contents( self, job_input: os.PathLike) -> Dict[str, Union[str, os.PathLike]]: @@ -189,7 +191,7 @@ class BaseExecutor(ABC): line = line.strip() if not line: continue - k, v = line.split(' ') + k, v = line.split() # space or \t job_contents[k] = v return job_contents @@ -217,7 +219,7 @@ class BaseExecutor(ABC): logging.getLogger(name) for name in logging.root.manager.loggerDict ] for l in loggers: - l.disabled = True + l.setLevel(logging.ERROR) def show_rtf(self, info: Dict[str, List[float]]): """ diff --git a/paddlespeech/cli/kws/__init__.py b/paddlespeech/cli/kws/__init__.py new file mode 100644 index 000000000..db7bd50eb --- /dev/null +++ b/paddlespeech/cli/kws/__init__.py @@ -0,0 +1,14 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from .infer import KWSExecutor diff --git a/paddlespeech/cli/kws/infer.py b/paddlespeech/cli/kws/infer.py new file mode 100644 index 000000000..111cfd754 --- /dev/null +++ b/paddlespeech/cli/kws/infer.py @@ -0,0 +1,219 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import argparse +import os +from collections import OrderedDict +from typing import List +from typing import Optional +from typing import Union + +import paddle +import yaml + +from ..executor import BaseExecutor +from ..log import logger +from ..utils import stats_wrapper +from paddlespeech.audio import load +from paddlespeech.audio.compliance.kaldi import fbank as kaldi_fbank + +__all__ = ['KWSExecutor'] + + +class KWSExecutor(BaseExecutor): + def __init__(self): + super().__init__(task='kws') + self.parser = argparse.ArgumentParser( + prog='paddlespeech.kws', add_help=True) + self.parser.add_argument( + '--input', + type=str, + default=None, + help='Audio file to keyword spotting.') + self.parser.add_argument( + '--threshold', + type=float, + default=0.8, + help='Score threshold for keyword spotting.') + self.parser.add_argument( + '--model', + type=str, + default='mdtc_heysnips', + choices=[ + tag[:tag.index('-')] + for tag in self.task_resource.pretrained_models.keys() + ], + help='Choose model type of kws task.') + self.parser.add_argument( + '--config', + type=str, + default=None, + help='Config of kws task. Use deault config when it is None.') + self.parser.add_argument( + '--ckpt_path', + type=str, + default=None, + help='Checkpoint file of model.') + self.parser.add_argument( + '--device', + type=str, + default=paddle.get_device(), + help='Choose device to execute model inference.') + self.parser.add_argument( + '-d', + '--job_dump_result', + action='store_true', + help='Save job result into file.') + self.parser.add_argument( + '-v', + '--verbose', + action='store_true', + help='Increase logger verbosity of current task.') + + def _init_from_path(self, + model_type: str='mdtc_heysnips', + cfg_path: Optional[os.PathLike]=None, + ckpt_path: Optional[os.PathLike]=None): + """ + Init model and other resources from a specific path. + """ + if hasattr(self, 'model'): + logger.debug('Model had been initialized.') + return + + if ckpt_path is None: + tag = model_type + '-' + '16k' + self.task_resource.set_task_model(tag) + self.cfg_path = os.path.join( + self.task_resource.res_dir, + self.task_resource.res_dict['cfg_path']) + self.ckpt_path = os.path.join( + self.task_resource.res_dir, + self.task_resource.res_dict['ckpt_path'] + '.pdparams') + else: + self.cfg_path = os.path.abspath(cfg_path) + self.ckpt_path = os.path.abspath(ckpt_path) + + # config + with open(self.cfg_path, 'r') as f: + config = yaml.safe_load(f) + + # model + backbone_class = self.task_resource.get_model_class( + model_type.split('_')[0]) + model_class = self.task_resource.get_model_class( + model_type.split('_')[0] + '_for_kws') + backbone = backbone_class( + stack_num=config['stack_num'], + stack_size=config['stack_size'], + in_channels=config['in_channels'], + res_channels=config['res_channels'], + kernel_size=config['kernel_size'], + causal=True, ) + self.model = model_class( + backbone=backbone, num_keywords=config['num_keywords']) + model_dict = paddle.load(self.ckpt_path) + self.model.set_state_dict(model_dict) + self.model.eval() + + self.feature_extractor = lambda x: kaldi_fbank( + x, sr=config['sample_rate'], + frame_shift=config['frame_shift'], + frame_length=config['frame_length'], + n_mels=config['n_mels'] + ) + + def preprocess(self, audio_file: Union[str, os.PathLike]): + """ + Input preprocess and return paddle.Tensor stored in self.input. + Input content can be a text(tts), a file(asr, cls) or a streaming(not supported yet). + """ + assert os.path.isfile(audio_file) + waveform, _ = load(audio_file) + if isinstance(audio_file, (str, os.PathLike)): + logger.debug("Preprocessing audio_file:" + audio_file) + + # Feature extraction + waveform = paddle.to_tensor(waveform).unsqueeze(0) + self._inputs['feats'] = self.feature_extractor(waveform).unsqueeze(0) + + @paddle.no_grad() + def infer(self): + """ + Model inference and result stored in self.output. + """ + self._outputs['logits'] = self.model(self._inputs['feats']) + + def postprocess(self, threshold: float) -> Union[str, os.PathLike]: + """ + Output postprocess and return human-readable results such as texts and audio files. + """ + kws_score = max(self._outputs['logits'][0, :, 0]).item() + return 'Score: {:.3f}, Threshold: {}, Is keyword: {}'.format( + kws_score, threshold, kws_score > threshold) + + def execute(self, argv: List[str]) -> bool: + """ + Command line entry. + """ + parser_args = self.parser.parse_args(argv) + + model_type = parser_args.model + cfg_path = parser_args.config + ckpt_path = parser_args.ckpt_path + device = parser_args.device + threshold = parser_args.threshold + + if not parser_args.verbose: + self.disable_task_loggers() + + task_source = self.get_input_source(parser_args.input) + task_results = OrderedDict() + has_exceptions = False + + for id_, input_ in task_source.items(): + try: + res = self(input_, threshold, model_type, cfg_path, ckpt_path, + device) + task_results[id_] = res + except Exception as e: + has_exceptions = True + task_results[id_] = f'{e.__class__.__name__}: {e}' + + self.process_task_results(parser_args.input, task_results, + parser_args.job_dump_result) + + if has_exceptions: + return False + else: + return True + + @stats_wrapper + def __call__(self, + audio_file: os.PathLike, + threshold: float=0.8, + model: str='mdtc_heysnips', + config: Optional[os.PathLike]=None, + ckpt_path: Optional[os.PathLike]=None, + device: str=paddle.get_device()): + """ + Python API to call an executor. + """ + audio_file = os.path.abspath(os.path.expanduser(audio_file)) + paddle.set_device(device) + self._init_from_path(model, config, ckpt_path) + self.preprocess(audio_file) + self.infer() + res = self.postprocess(threshold) + + return res diff --git a/paddlespeech/cli/log.py b/paddlespeech/cli/log.py index 8644064c7..8b33e71e1 100644 --- a/paddlespeech/cli/log.py +++ b/paddlespeech/cli/log.py @@ -49,7 +49,7 @@ class Logger(object): self.handler.setFormatter(self.format) self.logger.addHandler(self.handler) - self.logger.setLevel(logging.DEBUG) + self.logger.setLevel(logging.INFO) self.logger.propagate = False def __call__(self, log_level: str, msg: str): diff --git a/paddlespeech/cli/st/infer.py b/paddlespeech/cli/st/infer.py index e1ce181af..bc2bdd1ac 100644 --- a/paddlespeech/cli/st/infer.py +++ b/paddlespeech/cli/st/infer.py @@ -26,10 +26,10 @@ import soundfile from kaldiio import WriteHelper from yacs.config import CfgNode +from ...utils.env import MODEL_HOME from ..executor import BaseExecutor from ..log import logger from ..utils import download_and_decompress -from ..utils import MODEL_HOME from ..utils import stats_wrapper from paddlespeech.s2t.frontend.featurizer.text_featurizer import TextFeaturizer from paddlespeech.s2t.utils.utility import UpdateConfig @@ -110,7 +110,7 @@ class STExecutor(BaseExecutor): """ decompressed_path = download_and_decompress(self.kaldi_bins, MODEL_HOME) decompressed_path = os.path.abspath(decompressed_path) - logger.info("Kaldi_bins stored in: {}".format(decompressed_path)) + logger.debug("Kaldi_bins stored in: {}".format(decompressed_path)) if "LD_LIBRARY_PATH" in os.environ: os.environ["LD_LIBRARY_PATH"] += f":{decompressed_path}" else: @@ -128,7 +128,7 @@ class STExecutor(BaseExecutor): Init model and other resources from a specific path. """ if hasattr(self, 'model'): - logger.info('Model had been initialized.') + logger.debug('Model had been initialized.') return if cfg_path is None or ckpt_path is None: @@ -140,8 +140,8 @@ class STExecutor(BaseExecutor): self.ckpt_path = os.path.join( self.task_resource.res_dir, self.task_resource.res_dict['ckpt_path']) - logger.info(self.cfg_path) - logger.info(self.ckpt_path) + logger.debug(self.cfg_path) + logger.debug(self.ckpt_path) res_path = self.task_resource.res_dir else: self.cfg_path = os.path.abspath(cfg_path) @@ -192,7 +192,7 @@ class STExecutor(BaseExecutor): Input content can be a file(wav). """ audio_file = os.path.abspath(wav_file) - logger.info("Preprocess audio_file:" + audio_file) + logger.debug("Preprocess audio_file:" + audio_file) if "fat_st" in model_type: cmvn = self.config.cmvn_path diff --git a/paddlespeech/cli/text/infer.py b/paddlespeech/cli/text/infer.py index 7b8faf99c..24b8c9c25 100644 --- a/paddlespeech/cli/text/infer.py +++ b/paddlespeech/cli/text/infer.py @@ -98,7 +98,7 @@ class TextExecutor(BaseExecutor): Init model and other resources from a specific path. """ if hasattr(self, 'model'): - logger.info('Model had been initialized.') + logger.debug('Model had been initialized.') return self.task = task diff --git a/paddlespeech/cli/tts/infer.py b/paddlespeech/cli/tts/infer.py index 4e0337bcc..3eb597156 100644 --- a/paddlespeech/cli/tts/infer.py +++ b/paddlespeech/cli/tts/infer.py @@ -29,11 +29,21 @@ from yacs.config import CfgNode from ..executor import BaseExecutor from ..log import logger from ..utils import stats_wrapper -from paddlespeech.t2s.frontend import English -from paddlespeech.t2s.frontend.zh_frontend import Frontend -from paddlespeech.t2s.modules.normalizer import ZScore +from paddlespeech.resource import CommonTaskResource +from paddlespeech.t2s.exps.syn_utils import get_am_inference +from paddlespeech.t2s.exps.syn_utils import get_frontend +from paddlespeech.t2s.exps.syn_utils import get_sess +from paddlespeech.t2s.exps.syn_utils import get_voc_inference +from paddlespeech.t2s.exps.syn_utils import run_frontend +from paddlespeech.t2s.utils import str2bool __all__ = ['TTSExecutor'] +ONNX_SUPPORT_SET = { + 'speedyspeech_csmsc', 'fastspeech2_csmsc', 'fastspeech2_ljspeech', + 'fastspeech2_aishell3', 'fastspeech2_vctk', 'pwgan_csmsc', 'pwgan_ljspeech', + 'pwgan_aishell3', 'pwgan_vctk', 'mb_melgan_csmsc', 'hifigan_csmsc', + 'hifigan_ljspeech', 'hifigan_aishell3', 'hifigan_vctk' +} class TTSExecutor(BaseExecutor): @@ -54,6 +64,7 @@ class TTSExecutor(BaseExecutor): 'fastspeech2_ljspeech', 'fastspeech2_aishell3', 'fastspeech2_vctk', + 'fastspeech2_mix', 'tacotron2_csmsc', 'tacotron2_ljspeech', ], @@ -98,7 +109,7 @@ class TTSExecutor(BaseExecutor): self.parser.add_argument( '--voc', type=str, - default='pwgan_csmsc', + default='hifigan_csmsc', choices=[ 'pwgan_csmsc', 'pwgan_ljspeech', @@ -135,13 +146,15 @@ class TTSExecutor(BaseExecutor): '--lang', type=str, default='zh', - help='Choose model language. zh or en') + help='Choose model language. zh or en or mix') self.parser.add_argument( '--device', type=str, default=paddle.get_device(), help='Choose device to execute model inference.') + self.parser.add_argument('--cpu_threads', type=int, default=2) + self.parser.add_argument( '--output', type=str, default='output.wav', help='output file name') self.parser.add_argument( @@ -154,6 +167,16 @@ class TTSExecutor(BaseExecutor): '--verbose', action='store_true', help='Increase logger verbosity of current task.') + self.parser.add_argument( + "--use_onnx", + type=str2bool, + default=False, + help="whether to usen onnxruntime inference.") + self.parser.add_argument( + '--fs', + type=int, + default=24000, + help='sample rate for onnx models when use specified model files.') def _init_from_path( self, @@ -164,7 +187,7 @@ class TTSExecutor(BaseExecutor): phones_dict: Optional[os.PathLike]=None, tones_dict: Optional[os.PathLike]=None, speaker_dict: Optional[os.PathLike]=None, - voc: str='pwgan_csmsc', + voc: str='hifigan_csmsc', voc_config: Optional[os.PathLike]=None, voc_ckpt: Optional[os.PathLike]=None, voc_stat: Optional[os.PathLike]=None, @@ -173,16 +196,23 @@ class TTSExecutor(BaseExecutor): Init model and other resources from a specific path. """ if hasattr(self, 'am_inference') and hasattr(self, 'voc_inference'): - logger.info('Models had been initialized.') + logger.debug('Models had been initialized.') return + # am + if am_ckpt is None or am_config is None or am_stat is None or phones_dict is None: + use_pretrained_am = True + else: + use_pretrained_am = False + am_tag = am + '-' + lang self.task_resource.set_task_model( model_tag=am_tag, model_type=0, # am + skip_download=not use_pretrained_am, version=None, # default version ) - if am_ckpt is None or am_config is None or am_stat is None or phones_dict is None: + if use_pretrained_am: self.am_res_path = self.task_resource.res_dir self.am_config = os.path.join(self.am_res_path, self.task_resource.res_dict['config']) @@ -193,15 +223,15 @@ class TTSExecutor(BaseExecutor): # must have phones_dict in acoustic self.phones_dict = os.path.join( self.am_res_path, self.task_resource.res_dict['phones_dict']) - logger.info(self.am_res_path) - logger.info(self.am_config) - logger.info(self.am_ckpt) + logger.debug(self.am_res_path) + logger.debug(self.am_config) + logger.debug(self.am_ckpt) else: self.am_config = os.path.abspath(am_config) self.am_ckpt = os.path.abspath(am_ckpt) self.am_stat = os.path.abspath(am_stat) self.phones_dict = os.path.abspath(phones_dict) - self.am_res_path = os.path.dirname(os.path.abspath(self.am_config)) + self.am_res_path = os.path.dirname(self.am_config) # for speedyspeech self.tones_dict = None @@ -220,13 +250,22 @@ class TTSExecutor(BaseExecutor): self.speaker_dict = speaker_dict # voc - voc_tag = voc + '-' + lang + if voc_ckpt is None or voc_config is None or voc_stat is None: + use_pretrained_voc = True + else: + use_pretrained_voc = False + voc_lang = lang + # When speaker is 174 (csmsc), use csmsc's vocoder is better than aishell3's + if lang == 'mix': + voc_lang = 'zh' + voc_tag = voc + '-' + voc_lang self.task_resource.set_task_model( model_tag=voc_tag, model_type=1, # vocoder + skip_download=not use_pretrained_voc, version=None, # default version ) - if voc_ckpt is None or voc_config is None or voc_stat is None: + if use_pretrained_voc: self.voc_res_path = self.task_resource.voc_res_dir self.voc_config = os.path.join( self.voc_res_path, self.task_resource.voc_res_dict['config']) @@ -235,9 +274,9 @@ class TTSExecutor(BaseExecutor): self.voc_stat = os.path.join( self.voc_res_path, self.task_resource.voc_res_dict['speech_stats']) - logger.info(self.voc_res_path) - logger.info(self.voc_config) - logger.info(self.voc_ckpt) + logger.debug(self.voc_res_path) + logger.debug(self.voc_config) + logger.debug(self.voc_ckpt) else: self.voc_config = os.path.abspath(voc_config) self.voc_ckpt = os.path.abspath(voc_ckpt) @@ -254,87 +293,128 @@ class TTSExecutor(BaseExecutor): with open(self.phones_dict, "r") as f: phn_id = [line.strip().split() for line in f.readlines()] vocab_size = len(phn_id) - print("vocab_size:", vocab_size) tone_size = None if self.tones_dict: with open(self.tones_dict, "r") as f: tone_id = [line.strip().split() for line in f.readlines()] tone_size = len(tone_id) - print("tone_size:", tone_size) spk_num = None if self.speaker_dict: with open(self.speaker_dict, 'rt') as f: spk_id = [line.strip().split() for line in f.readlines()] spk_num = len(spk_id) - print("spk_num:", spk_num) # frontend - if lang == 'zh': - self.frontend = Frontend( - phone_vocab_path=self.phones_dict, - tone_vocab_path=self.tones_dict) - - elif lang == 'en': - self.frontend = English(phone_vocab_path=self.phones_dict) - print("frontend done!") + self.frontend = get_frontend( + lang=lang, phones_dict=self.phones_dict, tones_dict=self.tones_dict) # acoustic model - odim = self.am_config.n_mels - # model: {model_name}_{dataset} - am_name = am[:am.rindex('_')] - - am_class = self.task_resource.get_model_class(am_name) - am_inference_class = self.task_resource.get_model_class(am_name + - '_inference') - - if am_name == 'fastspeech2': - am = am_class( - idim=vocab_size, - odim=odim, - spk_num=spk_num, - **self.am_config["model"]) - elif am_name == 'speedyspeech': - am = am_class( - vocab_size=vocab_size, - tone_size=tone_size, - **self.am_config["model"]) - elif am_name == 'tacotron2': - am = am_class(idim=vocab_size, odim=odim, **self.am_config["model"]) - - am.set_state_dict(paddle.load(self.am_ckpt)["main_params"]) - am.eval() - am_mu, am_std = np.load(self.am_stat) - am_mu = paddle.to_tensor(am_mu) - am_std = paddle.to_tensor(am_std) - am_normalizer = ZScore(am_mu, am_std) - self.am_inference = am_inference_class(am_normalizer, am) - self.am_inference.eval() - print("acoustic model done!") + self.am_inference = get_am_inference( + am=am, + am_config=self.am_config, + am_ckpt=self.am_ckpt, + am_stat=self.am_stat, + phones_dict=self.phones_dict, + tones_dict=self.tones_dict, + speaker_dict=self.speaker_dict) # vocoder - # model: {model_name}_{dataset} - voc_name = voc[:voc.rindex('_')] - voc_class = self.task_resource.get_model_class(voc_name) - voc_inference_class = self.task_resource.get_model_class(voc_name + - '_inference') - if voc_name != 'wavernn': - voc = voc_class(**self.voc_config["generator_params"]) - voc.set_state_dict(paddle.load(self.voc_ckpt)["generator_params"]) - voc.remove_weight_norm() - voc.eval() + self.voc_inference = get_voc_inference( + voc=voc, + voc_config=self.voc_config, + voc_ckpt=self.voc_ckpt, + voc_stat=self.voc_stat) + + def _init_from_path_onnx(self, + am: str='fastspeech2_csmsc', + am_ckpt: Optional[os.PathLike]=None, + phones_dict: Optional[os.PathLike]=None, + tones_dict: Optional[os.PathLike]=None, + speaker_dict: Optional[os.PathLike]=None, + voc: str='hifigan_csmsc', + voc_ckpt: Optional[os.PathLike]=None, + lang: str='zh', + device: str='cpu', + cpu_threads: int=2, + fs: int=24000): + if hasattr(self, 'am_sess') and hasattr(self, 'voc_sess'): + logger.debug('Models had been initialized.') + return + + # am + if am_ckpt is None or phones_dict is None: + use_pretrained_am = True + else: + use_pretrained_am = False + + am_tag = am + '_onnx' + '-' + lang + self.task_resource.set_task_model( + model_tag=am_tag, + model_type=0, # am + skip_download=not use_pretrained_am, + version=None, # default version + ) + if use_pretrained_am: + self.am_res_path = self.task_resource.res_dir + self.am_ckpt = os.path.join(self.am_res_path, + self.task_resource.res_dict['ckpt']) + # must have phones_dict in acoustic + self.phones_dict = os.path.join( + self.am_res_path, self.task_resource.res_dict['phones_dict']) + self.am_fs = self.task_resource.res_dict['sample_rate'] + logger.debug(self.am_res_path) + logger.debug(self.am_ckpt) + else: + self.am_ckpt = os.path.abspath(am_ckpt) + self.phones_dict = os.path.abspath(phones_dict) + self.am_res_path = os.path.dirname(self.am_ckpt) + self.am_fs = fs + + # for speedyspeech + self.tones_dict = None + if 'tones_dict' in self.task_resource.res_dict: + self.tones_dict = os.path.join( + self.am_res_path, self.task_resource.res_dict['tones_dict']) + if tones_dict: + self.tones_dict = tones_dict + + # voc + if voc_ckpt is None: + use_pretrained_voc = True + else: + use_pretrained_voc = False + voc_lang = lang + # we must use ljspeech's voc for mix am now! + if lang == 'mix': + voc_lang = 'en' + voc_tag = voc + '_onnx' + '-' + voc_lang + self.task_resource.set_task_model( + model_tag=voc_tag, + model_type=1, # vocoder + skip_download=not use_pretrained_voc, + version=None, # default version + ) + if use_pretrained_voc: + self.voc_res_path = self.task_resource.voc_res_dir + self.voc_ckpt = os.path.join( + self.voc_res_path, self.task_resource.voc_res_dict['ckpt']) + logger.debug(self.voc_res_path) + logger.debug(self.voc_ckpt) else: - voc = voc_class(**self.voc_config["model"]) - voc.set_state_dict(paddle.load(self.voc_ckpt)["main_params"]) - voc.eval() - voc_mu, voc_std = np.load(self.voc_stat) - voc_mu = paddle.to_tensor(voc_mu) - voc_std = paddle.to_tensor(voc_std) - voc_normalizer = ZScore(voc_mu, voc_std) - self.voc_inference = voc_inference_class(voc_normalizer, voc) - self.voc_inference.eval() - print("voc done!") + self.voc_ckpt = os.path.abspath(voc_ckpt) + self.voc_res_path = os.path.dirname(os.path.abspath(self.voc_ckpt)) + + # frontend + self.frontend = get_frontend( + lang=lang, phones_dict=self.phones_dict, tones_dict=self.tones_dict) + self.am_sess = get_sess( + model_path=self.am_ckpt, device=device, cpu_threads=cpu_threads) + + # vocoder + self.voc_sess = get_sess( + model_path=self.voc_ckpt, device=device, cpu_threads=cpu_threads) def preprocess(self, input: Any, *args, **kwargs): """ @@ -357,41 +437,33 @@ class TTSExecutor(BaseExecutor): """ am_name = am[:am.rindex('_')] am_dataset = am[am.rindex('_') + 1:] - get_tone_ids = False merge_sentences = False - frontend_st = time.time() + get_tone_ids = False if am_name == 'speedyspeech': get_tone_ids = True - if lang == 'zh': - input_ids = self.frontend.get_input_ids( - text, - merge_sentences=merge_sentences, - get_tone_ids=get_tone_ids) - phone_ids = input_ids["phone_ids"] - if get_tone_ids: - tone_ids = input_ids["tone_ids"] - elif lang == 'en': - input_ids = self.frontend.get_input_ids( - text, merge_sentences=merge_sentences) - phone_ids = input_ids["phone_ids"] - else: - print("lang should in {'zh', 'en'}!") + frontend_st = time.time() + frontend_dict = run_frontend( + frontend=self.frontend, + text=text, + merge_sentences=merge_sentences, + get_tone_ids=get_tone_ids, + lang=lang) self.frontend_time = time.time() - frontend_st - self.am_time = 0 self.voc_time = 0 flags = 0 + phone_ids = frontend_dict['phone_ids'] for i in range(len(phone_ids)): am_st = time.time() part_phone_ids = phone_ids[i] # am if am_name == 'speedyspeech': - part_tone_ids = tone_ids[i] + part_tone_ids = frontend_dict['tone_ids'][i] mel = self.am_inference(part_phone_ids, part_tone_ids) # fastspeech2 else: # multi speaker - if am_dataset in {"aishell3", "vctk"}: + if am_dataset in {'aishell3', 'vctk', 'mix'}: mel = self.am_inference( part_phone_ids, spk_id=paddle.to_tensor(spk_id)) else: @@ -408,6 +480,62 @@ class TTSExecutor(BaseExecutor): self.voc_time += (time.time() - voc_st) self._outputs['wav'] = wav_all + def infer_onnx(self, + text: str, + lang: str='zh', + am: str='fastspeech2_csmsc', + spk_id: int=0): + am_name = am[:am.rindex('_')] + am_dataset = am[am.rindex('_') + 1:] + merge_sentences = False + get_tone_ids = False + if am_name == 'speedyspeech': + get_tone_ids = True + am_input_feed = {} + frontend_st = time.time() + frontend_dict = run_frontend( + frontend=self.frontend, + text=text, + merge_sentences=merge_sentences, + get_tone_ids=get_tone_ids, + lang=lang, + to_tensor=False) + self.frontend_time = time.time() - frontend_st + phone_ids = frontend_dict['phone_ids'] + self.am_time = 0 + self.voc_time = 0 + flags = 0 + for i in range(len(phone_ids)): + am_st = time.time() + part_phone_ids = phone_ids[i] + if am_name == 'fastspeech2': + am_input_feed.update({'text': part_phone_ids}) + if am_dataset in {"aishell3", "vctk"}: + # NOTE: 'spk_id' should be List[int] rather than int here!! + am_input_feed.update({'spk_id': [spk_id]}) + elif am_name == 'speedyspeech': + part_tone_ids = frontend_dict['tone_ids'][i] + am_input_feed.update({ + 'phones': part_phone_ids, + 'tones': part_tone_ids + }) + mel = self.am_sess.run(output_names=None, input_feed=am_input_feed) + mel = mel[0] + self.am_time += (time.time() - am_st) + # voc + voc_st = time.time() + wav = self.voc_sess.run( + output_names=None, input_feed={'logmel': mel}) + wav = wav[0] + if flags == 0: + wav_all = wav + flags = 1 + else: + wav_all = np.concatenate([wav_all, wav]) + self.voc_time += (time.time() - voc_st) + + self._outputs['wav'] = wav_all + def postprocess(self, output: str='output.wav') -> Union[str, os.PathLike]: """ Output postprocess and return results. @@ -421,6 +549,20 @@ class TTSExecutor(BaseExecutor): output, self._outputs['wav'].numpy(), samplerate=self.am_config.fs) return output + def postprocess_onnx(self, + output: str='output.wav') -> Union[str, os.PathLike]: + """ + Output postprocess and return results. + This method get model output from self._outputs and convert it into human-readable results. + + Returns: + Union[str, os.PathLike]: Human-readable results such as texts and audio files. + """ + output = os.path.abspath(os.path.expanduser(output)) + sf.write(output, self._outputs['wav'], samplerate=self.am_fs) + return output + + # 命令行的入口是这里 def execute(self, argv: List[str]) -> bool: """ Command line entry. @@ -442,6 +584,9 @@ class TTSExecutor(BaseExecutor): lang = args.lang device = args.device spk_id = args.spk_id + use_onnx = args.use_onnx + cpu_threads = args.cpu_threads + fs = args.fs if not args.verbose: self.disable_task_loggers() @@ -478,7 +623,10 @@ class TTSExecutor(BaseExecutor): # other lang=lang, device=device, - output=output) + output=output, + use_onnx=use_onnx, + cpu_threads=cpu_threads, + fs=fs) task_results[id_] = res except Exception as e: has_exceptions = True @@ -492,6 +640,7 @@ class TTSExecutor(BaseExecutor): else: return True + # pyton api 的入口是这里 @stats_wrapper def __call__(self, text: str, @@ -503,33 +652,59 @@ class TTSExecutor(BaseExecutor): phones_dict: Optional[os.PathLike]=None, tones_dict: Optional[os.PathLike]=None, speaker_dict: Optional[os.PathLike]=None, - voc: str='pwgan_csmsc', + voc: str='hifigan_csmsc', voc_config: Optional[os.PathLike]=None, voc_ckpt: Optional[os.PathLike]=None, voc_stat: Optional[os.PathLike]=None, lang: str='zh', device: str=paddle.get_device(), - output: str='output.wav'): + output: str='output.wav', + use_onnx: bool=False, + cpu_threads: int=2, + fs: int=24000): """ Python API to call an executor. """ - paddle.set_device(device) - self._init_from_path( - am=am, - am_config=am_config, - am_ckpt=am_ckpt, - am_stat=am_stat, - phones_dict=phones_dict, - tones_dict=tones_dict, - speaker_dict=speaker_dict, - voc=voc, - voc_config=voc_config, - voc_ckpt=voc_ckpt, - voc_stat=voc_stat, - lang=lang) - - self.infer(text=text, lang=lang, am=am, spk_id=spk_id) - - res = self.postprocess(output=output) - - return res + if not use_onnx: + paddle.set_device(device) + self._init_from_path( + am=am, + am_config=am_config, + am_ckpt=am_ckpt, + am_stat=am_stat, + phones_dict=phones_dict, + tones_dict=tones_dict, + speaker_dict=speaker_dict, + voc=voc, + voc_config=voc_config, + voc_ckpt=voc_ckpt, + voc_stat=voc_stat, + lang=lang) + + self.infer(text=text, lang=lang, am=am, spk_id=spk_id) + res = self.postprocess(output=output) + return res + else: + # use onnx + # we use `cpu` for onnxruntime by default + # please see description in https://github.com/PaddlePaddle/PaddleSpeech/pull/2220 + self.task_resource = CommonTaskResource( + task='tts', model_format='onnx') + assert ( + am in ONNX_SUPPORT_SET and voc in ONNX_SUPPORT_SET + ), f'the am and voc you choose, they should be in {ONNX_SUPPORT_SET}' + self._init_from_path_onnx( + am=am, + am_ckpt=am_ckpt, + phones_dict=phones_dict, + tones_dict=tones_dict, + speaker_dict=speaker_dict, + voc=voc, + voc_ckpt=voc_ckpt, + lang=lang, + device=device, + cpu_threads=cpu_threads, + fs=fs) + self.infer_onnx(text=text, lang=lang, am=am, spk_id=spk_id) + res = self.postprocess_onnx(output=output) + return res diff --git a/paddlespeech/cli/utils.py b/paddlespeech/cli/utils.py index 0161629e8..60f56f424 100644 --- a/paddlespeech/cli/utils.py +++ b/paddlespeech/cli/utils.py @@ -30,6 +30,7 @@ import yaml from paddle.framework import load from . import download +from ..utils.env import CONF_HOME from .entry import commands try: from .. import __version__ @@ -161,38 +162,6 @@ def load_state_dict_from_url(url: str, path: str, md5: str=None) -> os.PathLike: return load(os.path.join(path, os.path.basename(url))) -def _get_user_home(): - return os.path.expanduser('~') - - -def _get_paddlespcceh_home(): - if 'PPSPEECH_HOME' in os.environ: - home_path = os.environ['PPSPEECH_HOME'] - if os.path.exists(home_path): - if os.path.isdir(home_path): - return home_path - else: - raise RuntimeError( - 'The environment variable PPSPEECH_HOME {} is not a directory.'. - format(home_path)) - else: - return home_path - return os.path.join(_get_user_home(), '.paddlespeech') - - -def _get_sub_home(directory): - home = os.path.join(_get_paddlespcceh_home(), directory) - if not os.path.exists(home): - os.makedirs(home) - return home - - -PPSPEECH_HOME = _get_paddlespcceh_home() -MODEL_HOME = _get_sub_home('models') -CONF_HOME = _get_sub_home('conf') -DATA_HOME = _get_sub_home('datasets') - - def _md5(text: str): '''Calculate the md5 value of the input text.''' md5code = hashlib.md5(text.encode()) diff --git a/paddlespeech/cli/vector/infer.py b/paddlespeech/cli/vector/infer.py index f0eb3ae22..7fb7b4955 100644 --- a/paddlespeech/cli/vector/infer.py +++ b/paddlespeech/cli/vector/infer.py @@ -117,7 +117,7 @@ class VectorExecutor(BaseExecutor): # stage 2: read the input data and store them as a list task_source = self.get_input_source(parser_args.input) - logger.info(f"task source: {task_source}") + logger.debug(f"task source: {task_source}") # stage 3: process the audio one by one # we do action according the task type @@ -127,13 +127,13 @@ class VectorExecutor(BaseExecutor): try: # extract the speaker audio embedding if parser_args.task == "spk": - logger.info("do vector spk task") + logger.debug("do vector spk task") res = self(input_, model, sample_rate, config, ckpt_path, device) task_result[id_] = res elif parser_args.task == "score": - logger.info("do vector score task") - logger.info(f"input content {input_}") + logger.debug("do vector score task") + logger.debug(f"input content {input_}") if len(input_.split()) != 2: logger.error( f"vector score task input {input_} wav num is not two," @@ -142,7 +142,7 @@ class VectorExecutor(BaseExecutor): # get the enroll and test embedding enroll_audio, test_audio = input_.split() - logger.info( + logger.debug( f"score task, enroll audio: {enroll_audio}, test audio: {test_audio}" ) enroll_embedding = self(enroll_audio, model, sample_rate, @@ -158,8 +158,8 @@ class VectorExecutor(BaseExecutor): has_exceptions = True task_result[id_] = f'{e.__class__.__name__}: {e}' - logger.info("task result as follows: ") - logger.info(f"{task_result}") + logger.debug("task result as follows: ") + logger.debug(f"{task_result}") # stage 4: process the all the task results self.process_task_results(parser_args.input, task_result, @@ -207,7 +207,7 @@ class VectorExecutor(BaseExecutor): """ if not hasattr(self, "score_func"): self.score_func = paddle.nn.CosineSimilarity(axis=0) - logger.info("create the cosine score function ") + logger.debug("create the cosine score function ") score = self.score_func( paddle.to_tensor(enroll_embedding), @@ -244,7 +244,7 @@ class VectorExecutor(BaseExecutor): sys.exit(-1) # stage 1: set the paddle runtime host device - logger.info(f"device type: {device}") + logger.debug(f"device type: {device}") paddle.device.set_device(device) # stage 2: read the specific pretrained model @@ -283,7 +283,7 @@ class VectorExecutor(BaseExecutor): # stage 0: avoid to init the mode again self.task = task if hasattr(self, "model"): - logger.info("Model has been initialized") + logger.debug("Model has been initialized") return # stage 1: get the model and config path @@ -294,7 +294,7 @@ class VectorExecutor(BaseExecutor): sample_rate_str = "16k" if sample_rate == 16000 else "8k" tag = model_type + "-" + sample_rate_str self.task_resource.set_task_model(tag, version=None) - logger.info(f"load the pretrained model: {tag}") + logger.debug(f"load the pretrained model: {tag}") # get the model from the pretrained list # we download the pretrained model and store it in the res_path self.res_path = self.task_resource.res_dir @@ -312,19 +312,19 @@ class VectorExecutor(BaseExecutor): self.res_path = os.path.dirname( os.path.dirname(os.path.abspath(self.cfg_path))) - logger.info(f"start to read the ckpt from {self.ckpt_path}") - logger.info(f"read the config from {self.cfg_path}") - logger.info(f"get the res path {self.res_path}") + logger.debug(f"start to read the ckpt from {self.ckpt_path}") + logger.debug(f"read the config from {self.cfg_path}") + logger.debug(f"get the res path {self.res_path}") # stage 2: read and config and init the model body self.config = CfgNode(new_allowed=True) self.config.merge_from_file(self.cfg_path) # stage 3: get the model name to instance the model network with dynamic_import - logger.info("start to dynamic import the model class") + logger.debug("start to dynamic import the model class") model_name = model_type[:model_type.rindex('_')] model_class = self.task_resource.get_model_class(model_name) - logger.info(f"model name {model_name}") + logger.debug(f"model name {model_name}") model_conf = self.config.model backbone = model_class(**model_conf) model = SpeakerIdetification( @@ -333,11 +333,11 @@ class VectorExecutor(BaseExecutor): self.model.eval() # stage 4: load the model parameters - logger.info("start to set the model parameters to model") + logger.debug("start to set the model parameters to model") model_dict = paddle.load(self.ckpt_path) self.model.set_state_dict(model_dict) - logger.info("create the model instance success") + logger.debug("create the model instance success") @paddle.no_grad() def infer(self, model_type: str): @@ -349,14 +349,14 @@ class VectorExecutor(BaseExecutor): # stage 0: get the feat and length from _inputs feats = self._inputs["feats"] lengths = self._inputs["lengths"] - logger.info("start to do backbone network model forward") - logger.info( + logger.debug("start to do backbone network model forward") + logger.debug( f"feats shape:{feats.shape}, lengths shape: {lengths.shape}") # stage 1: get the audio embedding # embedding from (1, emb_size, 1) -> (emb_size) embedding = self.model.backbone(feats, lengths).squeeze().numpy() - logger.info(f"embedding size: {embedding.shape}") + logger.debug(f"embedding size: {embedding.shape}") # stage 2: put the embedding and dim info to _outputs property # the embedding type is numpy.array @@ -380,12 +380,13 @@ class VectorExecutor(BaseExecutor): """ audio_file = input_file if isinstance(audio_file, (str, os.PathLike)): - logger.info(f"Preprocess audio file: {audio_file}") + logger.debug(f"Preprocess audio file: {audio_file}") # stage 1: load the audio sample points # Note: this process must match the training process waveform, sr = load_audio(audio_file) - logger.info(f"load the audio sample points, shape is: {waveform.shape}") + logger.debug( + f"load the audio sample points, shape is: {waveform.shape}") # stage 2: get the audio feat # Note: Now we only support fbank feature @@ -396,9 +397,9 @@ class VectorExecutor(BaseExecutor): n_mels=self.config.n_mels, window_size=self.config.window_size, hop_length=self.config.hop_size) - logger.info(f"extract the audio feat, shape is: {feat.shape}") + logger.debug(f"extract the audio feat, shape is: {feat.shape}") except Exception as e: - logger.info(f"feat occurs exception {e}") + logger.debug(f"feat occurs exception {e}") sys.exit(-1) feat = paddle.to_tensor(feat).unsqueeze(0) @@ -411,11 +412,11 @@ class VectorExecutor(BaseExecutor): # stage 4: store the feat and length in the _inputs, # which will be used in other function - logger.info(f"feats shape: {feat.shape}") + logger.debug(f"feats shape: {feat.shape}") self._inputs["feats"] = feat self._inputs["lengths"] = lengths - logger.info("audio extract the feat success") + logger.debug("audio extract the feat success") def _check(self, audio_file: str, sample_rate: int): """Check if the model sample match the audio sample rate @@ -441,7 +442,7 @@ class VectorExecutor(BaseExecutor): logger.error("Please input the right audio file path") return False - logger.info("checking the aduio file format......") + logger.debug("checking the aduio file format......") try: audio, audio_sample_rate = soundfile.read( audio_file, dtype="float32", always_2d=True) @@ -458,7 +459,7 @@ class VectorExecutor(BaseExecutor): ") return False - logger.info(f"The sample rate is {audio_sample_rate}") + logger.debug(f"The sample rate is {audio_sample_rate}") if audio_sample_rate != self.sample_rate: logger.error("The sample rate of the input file is not {}.\n \ @@ -468,6 +469,6 @@ class VectorExecutor(BaseExecutor): ".format(self.sample_rate, self.sample_rate)) sys.exit(-1) else: - logger.info("The audio file format is right") + logger.debug("The audio file format is right") return True diff --git a/paddlespeech/cls/models/panns/panns.py b/paddlespeech/cls/models/panns/panns.py index 4befe7aa4..37deae80c 100644 --- a/paddlespeech/cls/models/panns/panns.py +++ b/paddlespeech/cls/models/panns/panns.py @@ -16,8 +16,8 @@ import os import paddle.nn as nn import paddle.nn.functional as F -from paddlespeech.audio.utils import MODEL_HOME from paddlespeech.audio.utils.download import load_state_dict_from_url +from paddlespeech.utils.env import MODEL_HOME __all__ = ['CNN14', 'CNN10', 'CNN6', 'cnn14', 'cnn10', 'cnn6'] diff --git a/paddlespeech/kws/exps/__init__.py b/paddlespeech/kws/exps/__init__.py new file mode 100644 index 000000000..e69de29bb diff --git a/paddlespeech/kws/exps/mdtc/__init__.py b/paddlespeech/kws/exps/mdtc/__init__.py new file mode 100644 index 000000000..e69de29bb diff --git a/paddlespeech/resource/model_alias.py b/paddlespeech/resource/model_alias.py index 5309fd86f..9c76dd4b3 100644 --- a/paddlespeech/resource/model_alias.py +++ b/paddlespeech/resource/model_alias.py @@ -83,4 +83,10 @@ model_alias = { # ------------ Vector ------------- # --------------------------------- "ecapatdnn": ["paddlespeech.vector.models.ecapa_tdnn:EcapaTdnn"], + + # --------------------------------- + # -------------- kws -------------- + # --------------------------------- + "mdtc": ["paddlespeech.kws.models.mdtc:MDTC"], + "mdtc_for_kws": ["paddlespeech.kws.models.mdtc:KWSModel"], } diff --git a/paddlespeech/resource/pretrained_models.py b/paddlespeech/resource/pretrained_models.py index 37303331b..9d9be0aca 100644 --- a/paddlespeech/resource/pretrained_models.py +++ b/paddlespeech/resource/pretrained_models.py @@ -155,6 +155,26 @@ asr_dynamic_pretrained_models = { 'lm_md5': '29e02312deb2e59b3c8686c7966d4fe3' }, + '1.0.4': { + 'url': + 'http://paddlespeech.bj.bcebos.com/s2t/wenetspeech/asr0/asr0_deepspeech2_online_wenetspeech_ckpt_1.0.4.model.tar.gz', + 'md5': + 'c595cb76902b5a5d01409171375989f4', + 'cfg_path': + 'model.yaml', + 'ckpt_path': + 'exp/deepspeech2_online/checkpoints/avg_10', + 'model': + 'exp/deepspeech2_online/checkpoints/avg_10.jit.pdmodel', + 'params': + 'exp/deepspeech2_online/checkpoints/avg_10.jit.pdiparams', + 'onnx_model': + 'onnx/model.onnx', + 'lm_url': + 'https://deepspeech.bj.bcebos.com/zh_lm/zh_giga.no_cna_cmn.prune01244.klm', + 'lm_md5': + '29e02312deb2e59b3c8686c7966d4fe3' + }, }, "deepspeech2offline_aishell-zh-16k": { '1.0': { @@ -294,6 +314,26 @@ asr_static_pretrained_models = { 'lm_md5': '29e02312deb2e59b3c8686c7966d4fe3' }, + '1.0.4': { + 'url': + 'http://paddlespeech.bj.bcebos.com/s2t/wenetspeech/asr0/asr0_deepspeech2_online_wenetspeech_ckpt_1.0.4.model.tar.gz', + 'md5': + 'c595cb76902b5a5d01409171375989f4', + 'cfg_path': + 'model.yaml', + 'ckpt_path': + 'exp/deepspeech2_online/checkpoints/avg_10', + 'model': + 'exp/deepspeech2_online/checkpoints/avg_10.jit.pdmodel', + 'params': + 'exp/deepspeech2_online/checkpoints/avg_10.jit.pdiparams', + 'onnx_model': + 'onnx/model.onnx', + 'lm_url': + 'https://deepspeech.bj.bcebos.com/zh_lm/zh_giga.no_cna_cmn.prune01244.klm', + 'lm_md5': + '29e02312deb2e59b3c8686c7966d4fe3' + }, }, } @@ -341,6 +381,26 @@ asr_onnx_pretrained_models = { 'lm_md5': '29e02312deb2e59b3c8686c7966d4fe3' }, + '1.0.4': { + 'url': + 'http://paddlespeech.bj.bcebos.com/s2t/wenetspeech/asr0/asr0_deepspeech2_online_wenetspeech_ckpt_1.0.4.model.tar.gz', + 'md5': + 'c595cb76902b5a5d01409171375989f4', + 'cfg_path': + 'model.yaml', + 'ckpt_path': + 'exp/deepspeech2_online/checkpoints/avg_10', + 'model': + 'exp/deepspeech2_online/checkpoints/avg_10.jit.pdmodel', + 'params': + 'exp/deepspeech2_online/checkpoints/avg_10.jit.pdiparams', + 'onnx_model': + 'onnx/model.onnx', + 'lm_url': + 'https://deepspeech.bj.bcebos.com/zh_lm/zh_giga.no_cna_cmn.prune01244.klm', + 'lm_md5': + '29e02312deb2e59b3c8686c7966d4fe3' + }, }, } @@ -579,6 +639,56 @@ tts_dynamic_pretrained_models = { 'speaker_id_map.txt', }, }, + "fastspeech2_cnndecoder_csmsc-zh": { + '1.0': { + 'url': + 'https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_cnndecoder_csmsc_ckpt_1.0.0.zip', + 'md5': + '6eb28e22ace73e0ebe7845f86478f89f', + 'config': + 'cnndecoder.yaml', + 'ckpt': + 'snapshot_iter_153000.pdz', + 'speech_stats': + 'speech_stats.npy', + 'phones_dict': + 'phone_id_map.txt', + }, + }, + "fastspeech2_mix-mix": { + '1.0': { + 'url': + 'https://paddlespeech.bj.bcebos.com/t2s/chinse_english_mixed/models/fastspeech2_csmscljspeech_add-zhen.zip', + 'md5': + '77d9d4b5a79ed6203339ead7ef6c74f9', + 'config': + 'default.yaml', + 'ckpt': + 'snapshot_iter_94000.pdz', + 'speech_stats': + 'speech_stats.npy', + 'phones_dict': + 'phone_id_map.txt', + 'speaker_dict': + 'speaker_id_map.txt', + }, + '2.0': { + 'url': + 'https://paddlespeech.bj.bcebos.com/t2s/chinse_english_mixed/models/fastspeech2_mix_ckpt_0.2.0.zip', + 'md5': + '1d938e104e972386c8bfcbcc98a91587', + 'config': + 'default.yaml', + 'ckpt': + 'snapshot_iter_99200.pdz', + 'speech_stats': + 'speech_stats.npy', + 'phones_dict': + 'phone_id_map.txt', + 'speaker_dict': + 'speaker_id_map.txt', + }, + }, # tacotron2 "tacotron2_csmsc-zh": { '1.0': { @@ -771,22 +881,6 @@ tts_dynamic_pretrained_models = { 'feats_stats.npy', }, }, - "fastspeech2_cnndecoder_csmsc-zh": { - '1.0': { - 'url': - 'https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_cnndecoder_csmsc_ckpt_1.0.0.zip', - 'md5': - '6eb28e22ace73e0ebe7845f86478f89f', - 'config': - 'cnndecoder.yaml', - 'ckpt': - 'snapshot_iter_153000.pdz', - 'speech_stats': - 'speech_stats.npy', - 'phones_dict': - 'phone_id_map.txt', - }, - }, } tts_static_pretrained_models = { @@ -826,6 +920,58 @@ tts_static_pretrained_models = { 24000, }, }, + "fastspeech2_ljspeech-en": { + '1.0': { + 'url': + 'https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_ljspeech_static_1.1.0.zip', + 'md5': + 'c49f70b52973423ec45aaa6184fb5bc6', + 'model': + 'fastspeech2_ljspeech.pdmodel', + 'params': + 'fastspeech2_ljspeech.pdiparams', + 'phones_dict': + 'phone_id_map.txt', + 'sample_rate': + 22050, + }, + }, + "fastspeech2_aishell3-zh": { + '1.0': { + 'url': + 'https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_aishell3_static_1.1.0.zip', + 'md5': + '695af44679f48eb4abc159977ddaee16', + 'model': + 'fastspeech2_aishell3.pdmodel', + 'params': + 'fastspeech2_aishell3.pdiparams', + 'phones_dict': + 'phone_id_map.txt', + 'speaker_dict': + 'speaker_id_map.txt', + 'sample_rate': + 24000, + }, + }, + "fastspeech2_vctk-en": { + '1.0': { + 'url': + 'https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_vctk_static_1.1.0.zip', + 'md5': + '92d8c082f180bda2fd05a534fb4a1b62', + 'model': + 'fastspeech2_vctk.pdmodel', + 'params': + 'fastspeech2_vctk.pdiparams', + 'phones_dict': + 'phone_id_map.txt', + 'speaker_dict': + 'speaker_id_map.txt', + 'sample_rate': + 24000, + }, + }, # pwgan "pwgan_csmsc-zh": { '1.0': { @@ -841,6 +987,48 @@ tts_static_pretrained_models = { 24000, }, }, + "pwgan_ljspeech-en": { + '1.0': { + 'url': + 'https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwgan_ljspeech_static_1.1.0.zip', + 'md5': + '6f457a069da99c6814ac1fb4677281e4', + 'model': + 'pwgan_ljspeech.pdmodel', + 'params': + 'pwgan_ljspeech.pdiparams', + 'sample_rate': + 22050, + }, + }, + "pwgan_aishell3-zh": { + '1.0': { + 'url': + 'https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwgan_aishell3_static_1.1.0.zip', + 'md5': + '199f64010238275fbdacb326a5cf82d1', + 'model': + 'pwgan_aishell3.pdmodel', + 'params': + 'pwgan_aishell3.pdiparams', + 'sample_rate': + 24000, + }, + }, + "pwgan_vctk-en": { + '1.0': { + 'url': + 'https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwgan_vctk_static_1.1.0.zip', + 'md5': + 'ee0fc571ad5a7fbe4ca20e49df22b819', + 'model': + 'pwgan_vctk.pdmodel', + 'params': + 'pwgan_vctk.pdiparams', + 'sample_rate': + 24000, + }, + }, # mb_melgan "mb_melgan_csmsc-zh": { '1.0': { @@ -871,9 +1059,68 @@ tts_static_pretrained_models = { 24000, }, }, + "hifigan_ljspeech-en": { + '1.0': { + 'url': + 'https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_ljspeech_static_1.1.0.zip', + 'md5': + '8c674e79be7c45f6eda74825316438a0', + 'model': + 'hifigan_ljspeech.pdmodel', + 'params': + 'hifigan_ljspeech.pdiparams', + 'sample_rate': + 22050, + }, + }, + "hifigan_aishell3-zh": { + '1.0': { + 'url': + 'https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_static_1.1.0.zip', + 'md5': + '7a10ec5d8d851e2000128f040d30cc01', + 'model': + 'hifigan_aishell3.pdmodel', + 'params': + 'hifigan_aishell3.pdiparams', + 'sample_rate': + 24000, + }, + }, + "hifigan_vctk-en": { + '1.0': { + 'url': + 'https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_vctk_static_1.1.0.zip', + 'md5': + '130f791dfac84ccdd44ccbdfb67bf08e', + 'model': + 'hifigan_vctk.pdmodel', + 'params': + 'hifigan_vctk.pdiparams', + 'sample_rate': + 24000, + }, + }, } tts_onnx_pretrained_models = { + # speedyspeech + "speedyspeech_csmsc_onnx-zh": { + '1.0': { + 'url': + 'https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_csmsc_onnx_0.2.0.zip', + 'md5': + '3e9c45af9ef70675fc1968ed5074fc88', + 'ckpt': + 'speedyspeech_csmsc.onnx', + 'phones_dict': + 'phone_id_map.txt', + 'tones_dict': + 'tone_id_map.txt', + 'sample_rate': + 24000, + }, + }, # fastspeech2 "fastspeech2_csmsc_onnx-zh": { '1.0': { @@ -881,9 +1128,56 @@ tts_onnx_pretrained_models = { 'https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_csmsc_onnx_0.2.0.zip', 'md5': 'fd3ad38d83273ad51f0ea4f4abf3ab4e', - 'ckpt': ['fastspeech2_csmsc.onnx'], + 'ckpt': + 'fastspeech2_csmsc.onnx', + 'phones_dict': + 'phone_id_map.txt', + 'sample_rate': + 24000, + }, + }, + "fastspeech2_ljspeech_onnx-en": { + '1.0': { + 'url': + 'https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_ljspeech_onnx_1.1.0.zip', + 'md5': + '00754307636a48c972a5f3e65cda3d18', + 'ckpt': + 'fastspeech2_ljspeech.onnx', + 'phones_dict': + 'phone_id_map.txt', + 'sample_rate': + 22050, + }, + }, + "fastspeech2_aishell3_onnx-zh": { + '1.0': { + 'url': + 'https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_aishell3_onnx_1.1.0.zip', + 'md5': + 'a1d6ee21de897ce394f5469e2bb4df0d', + 'ckpt': + 'fastspeech2_aishell3.onnx', + 'phones_dict': + 'phone_id_map.txt', + 'speaker_dict': + 'speaker_id_map.txt', + 'sample_rate': + 24000, + }, + }, + "fastspeech2_vctk_onnx-en": { + '1.0': { + 'url': + 'https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_vctk_onnx_1.1.0.zip', + 'md5': + 'd9c3a9b02204a2070504dd99f5f959bf', + 'ckpt': + 'fastspeech2_vctk.onnx', 'phones_dict': 'phone_id_map.txt', + 'speaker_dict': + 'speaker_id_map.txt', 'sample_rate': 24000, }, @@ -907,6 +1201,55 @@ tts_onnx_pretrained_models = { 24000, }, }, + # pwgan + "pwgan_csmsc_onnx-zh": { + '1.0': { + 'url': + 'https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwgan_csmsc_onnx_0.2.0.zip', + 'md5': + '711d0ade33e73f3b721efc9f20669f9c', + 'ckpt': + 'pwgan_csmsc.onnx', + 'sample_rate': + 24000, + }, + }, + "pwgan_ljspeech_onnx-en": { + '1.0': { + 'url': + 'https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwgan_ljspeech_onnx_1.1.0.zip', + 'md5': + '73cdeeccb77f2ea6ed4d07e71d8ac8b8', + 'ckpt': + 'pwgan_ljspeech.onnx', + 'sample_rate': + 22050, + }, + }, + "pwgan_aishell3_onnx-zh": { + '1.0': { + 'url': + 'https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwgan_aishell3_onnx_1.1.0.zip', + 'md5': + '096ab64e152a4fa476aff79ebdadb01b', + 'ckpt': + 'pwgan_aishell3.onnx', + 'sample_rate': + 24000, + }, + }, + "pwgan_vctk_onnx-en": { + '1.0': { + 'url': + 'https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwgan_vctk_onnx_1.1.0.zip', + 'md5': + '4e754d42cf85f6428f0af887c923d86c', + 'ckpt': + 'pwgan_vctk.onnx', + 'sample_rate': + 24000, + }, + }, # mb_melgan "mb_melgan_csmsc_onnx-zh": { '1.0': { @@ -933,6 +1276,42 @@ tts_onnx_pretrained_models = { 24000, }, }, + "hifigan_ljspeech_onnx-en": { + '1.0': { + 'url': + 'https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_ljspeech_onnx_1.1.0.zip', + 'md5': + '062f54b79c1135a50adb5fc8406260b2', + 'ckpt': + 'hifigan_ljspeech.onnx', + 'sample_rate': + 22050, + }, + }, + "hifigan_aishell3_onnx-zh": { + '1.0': { + 'url': + 'https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_onnx_1.1.0.zip', + 'md5': + 'd6c0d684ad148583ca57837d5e870167', + 'ckpt': + 'hifigan_aishell3.onnx', + 'sample_rate': + 24000, + }, + }, + "hifigan_vctk_onnx-en": { + '1.0': { + 'url': + 'https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_vctk_onnx_1.1.0.zip', + 'md5': + 'fd714df3be283c0efbefc8510160ff6d', + 'ckpt': + 'hifigan_vctk.onnx', + 'sample_rate': + 24000, + }, + }, } # --------------------------------- @@ -954,3 +1333,35 @@ vector_dynamic_pretrained_models = { }, }, } + +# --------------------------------- +# ------------- KWS --------------- +# --------------------------------- +kws_dynamic_pretrained_models = { + 'mdtc_heysnips-16k': { + '1.0': { + 'url': + 'https://paddlespeech.bj.bcebos.com/kws/heysnips/kws0_mdtc_heysnips_ckpt.tar.gz', + 'md5': + 'c0de0a9520d66c3c8d6679460893578f', + 'cfg_path': + 'conf/mdtc.yaml', + 'ckpt_path': + 'ckpt/model', + }, + }, +} + +# --------------------------------- +# ------------- G2PW --------------- +# --------------------------------- +g2pw_onnx_models = { + 'G2PWModel': { + '1.0': { + 'url': + 'https://paddlespeech.bj.bcebos.com/Parakeet/released_models/g2p/G2PWModel.tar', + 'md5': + '63bc0894af15a5a591e58b2130a2bcac', + }, + }, +} diff --git a/paddlespeech/resource/resource.py b/paddlespeech/resource/resource.py index 15112ba7d..8e9914b2e 100644 --- a/paddlespeech/resource/resource.py +++ b/paddlespeech/resource/resource.py @@ -18,11 +18,11 @@ from typing import List from typing import Optional from ..cli.utils import download_and_decompress -from ..cli.utils import MODEL_HOME from ..utils.dynamic_import import dynamic_import +from ..utils.env import MODEL_HOME from .model_alias import model_alias -task_supported = ['asr', 'cls', 'st', 'text', 'tts', 'vector'] +task_supported = ['asr', 'cls', 'st', 'text', 'tts', 'vector', 'kws'] model_format_supported = ['dynamic', 'static', 'onnx'] inference_mode_supported = ['online', 'offline'] @@ -60,6 +60,7 @@ class CommonTaskResource: def set_task_model(self, model_tag: str, model_type: int=0, + skip_download: bool=False, version: Optional[str]=None): """Set model tag and version of current task. @@ -83,16 +84,18 @@ class CommonTaskResource: self.version = version self.res_dict = self.pretrained_models[model_tag][version] self._format_path(self.res_dict) - self.res_dir = self._fetch(self.res_dict, - self._get_model_dir(model_type)) + if not skip_download: + self.res_dir = self._fetch(self.res_dict, + self._get_model_dir(model_type)) else: assert self.task == 'tts', 'Vocoder will only be used in tts task.' self.voc_model_tag = model_tag self.voc_version = version self.voc_res_dict = self.pretrained_models[model_tag][version] self._format_path(self.voc_res_dict) - self.voc_res_dir = self._fetch(self.voc_res_dict, - self._get_model_dir(model_type)) + if not skip_download: + self.voc_res_dir = self._fetch(self.voc_res_dict, + self._get_model_dir(model_type)) @staticmethod def get_model_class(model_name) -> List[object]: @@ -164,7 +167,6 @@ class CommonTaskResource: try: import_models = '{}_{}_pretrained_models'.format(self.task, self.model_format) - print(f"from .pretrained_models import {import_models}") exec('from .pretrained_models import {}'.format(import_models)) models = OrderedDict(locals()[import_models]) except Exception as e: diff --git a/paddlespeech/s2t/__init__.py b/paddlespeech/s2t/__init__.py index 2da68435c..f6476b9aa 100644 --- a/paddlespeech/s2t/__init__.py +++ b/paddlespeech/s2t/__init__.py @@ -18,7 +18,6 @@ from typing import Union import paddle from paddle import nn -from paddle.fluid import core from paddle.nn import functional as F from paddlespeech.s2t.utils.log import Log @@ -39,46 +38,6 @@ paddle.long = 'int64' paddle.uint16 = 'uint16' paddle.cdouble = 'complex128' - -def convert_dtype_to_string(tensor_dtype): - """ - Convert the data type in numpy to the data type in Paddle - Args: - tensor_dtype(core.VarDesc.VarType): the data type in numpy. - Returns: - core.VarDesc.VarType: the data type in Paddle. - """ - dtype = tensor_dtype - if dtype == core.VarDesc.VarType.FP32: - return paddle.float32 - elif dtype == core.VarDesc.VarType.FP64: - return paddle.float64 - elif dtype == core.VarDesc.VarType.FP16: - return paddle.float16 - elif dtype == core.VarDesc.VarType.INT32: - return paddle.int32 - elif dtype == core.VarDesc.VarType.INT16: - return paddle.int16 - elif dtype == core.VarDesc.VarType.INT64: - return paddle.int64 - elif dtype == core.VarDesc.VarType.BOOL: - return paddle.bool - elif dtype == core.VarDesc.VarType.BF16: - # since there is still no support for bfloat16 in NumPy, - # uint16 is used for casting bfloat16 - return paddle.uint16 - elif dtype == core.VarDesc.VarType.UINT8: - return paddle.uint8 - elif dtype == core.VarDesc.VarType.INT8: - return paddle.int8 - elif dtype == core.VarDesc.VarType.COMPLEX64: - return paddle.complex64 - elif dtype == core.VarDesc.VarType.COMPLEX128: - return paddle.complex128 - else: - raise ValueError("Not supported tensor dtype %s" % dtype) - - if not hasattr(paddle, 'softmax'): logger.debug("register user softmax to paddle, remove this when fixed!") setattr(paddle, 'softmax', paddle.nn.functional.softmax) @@ -155,28 +114,6 @@ if not hasattr(paddle.Tensor, 'new_full'): paddle.Tensor.new_full = new_full paddle.static.Variable.new_full = new_full - -def eq(xs: paddle.Tensor, ys: Union[paddle.Tensor, float]) -> paddle.Tensor: - if convert_dtype_to_string(xs.dtype) == paddle.bool: - xs = xs.astype(paddle.int) - return xs.equal( - paddle.to_tensor( - ys, dtype=convert_dtype_to_string(xs.dtype), place=xs.place)) - - -if not hasattr(paddle.Tensor, 'eq'): - logger.debug( - "override eq of paddle.Tensor if exists or register, remove this when fixed!" - ) - paddle.Tensor.eq = eq - paddle.static.Variable.eq = eq - -if not hasattr(paddle, 'eq'): - logger.debug( - "override eq of paddle if exists or register, remove this when fixed!") - paddle.eq = eq - - def contiguous(xs: paddle.Tensor) -> paddle.Tensor: return xs @@ -219,13 +156,22 @@ def is_broadcastable(shp1, shp2): return True +def broadcast_shape(shp1, shp2): + result = [] + for a, b in zip(shp1[::-1], shp2[::-1]): + result.append(max(a, b)) + return result[::-1] + + def masked_fill(xs: paddle.Tensor, mask: paddle.Tensor, value: Union[float, int]): - assert is_broadcastable(xs.shape, mask.shape) is True, (xs.shape, - mask.shape) - bshape = paddle.broadcast_shape(xs.shape, mask.shape) - mask = mask.broadcast_to(bshape) + bshape = broadcast_shape(xs.shape, mask.shape) + mask.stop_gradient = True + tmp = paddle.ones(shape=[len(bshape)], dtype='int32') + for index in range(len(bshape)): + tmp[index] = bshape[index] + mask = mask.broadcast_to(tmp) trues = paddle.ones_like(xs) * value xs = paddle.where(mask, trues, xs) return xs diff --git a/paddlespeech/s2t/exps/deepspeech2/bin/export.py b/paddlespeech/s2t/exps/deepspeech2/bin/export.py index 049e7b688..8acd46dfc 100644 --- a/paddlespeech/s2t/exps/deepspeech2/bin/export.py +++ b/paddlespeech/s2t/exps/deepspeech2/bin/export.py @@ -35,12 +35,6 @@ if __name__ == "__main__": # save jit model to parser.add_argument( "--export_path", type=str, help="path of the jit model to save") - parser.add_argument( - '--nxpu', - type=int, - default=0, - choices=[0, 1], - help="if nxpu == 0 and ngpu == 0, use cpu.") args = parser.parse_args() print_arguments(args) diff --git a/paddlespeech/s2t/exps/deepspeech2/bin/test.py b/paddlespeech/s2t/exps/deepspeech2/bin/test.py index a9828f6e7..030168a9a 100644 --- a/paddlespeech/s2t/exps/deepspeech2/bin/test.py +++ b/paddlespeech/s2t/exps/deepspeech2/bin/test.py @@ -35,12 +35,6 @@ if __name__ == "__main__": # save asr result to parser.add_argument( "--result_file", type=str, help="path of save the asr result") - parser.add_argument( - '--nxpu', - type=int, - default=0, - choices=[0, 1], - help="if nxpu == 0 and ngpu == 0, use cpu.") args = parser.parse_args() print_arguments(args, globals()) diff --git a/paddlespeech/s2t/exps/deepspeech2/bin/test_export.py b/paddlespeech/s2t/exps/deepspeech2/bin/test_export.py index 8db081e7b..d7a9402b9 100644 --- a/paddlespeech/s2t/exps/deepspeech2/bin/test_export.py +++ b/paddlespeech/s2t/exps/deepspeech2/bin/test_export.py @@ -38,12 +38,6 @@ if __name__ == "__main__": #load jit model from parser.add_argument( "--export_path", type=str, help="path of the jit model to save") - parser.add_argument( - '--nxpu', - type=int, - default=0, - choices=[0, 1], - help="if nxpu == 0 and ngpu == 0, use cpu.") parser.add_argument( "--enable-auto-log", action="store_true", help="use auto log") args = parser.parse_args() diff --git a/paddlespeech/s2t/exps/deepspeech2/bin/train.py b/paddlespeech/s2t/exps/deepspeech2/bin/train.py index fee7079d9..2c9942f9b 100644 --- a/paddlespeech/s2t/exps/deepspeech2/bin/train.py +++ b/paddlespeech/s2t/exps/deepspeech2/bin/train.py @@ -31,12 +31,6 @@ def main(config, args): if __name__ == "__main__": parser = default_argument_parser() - parser.add_argument( - '--nxpu', - type=int, - default=0, - choices=[0, 1], - help="if nxpu == 0 and ngpu == 0, use cpu.") args = parser.parse_args() print_arguments(args, globals()) diff --git a/paddlespeech/s2t/exps/deepspeech2/model.py b/paddlespeech/s2t/exps/deepspeech2/model.py index 511997a7c..7ab8cf853 100644 --- a/paddlespeech/s2t/exps/deepspeech2/model.py +++ b/paddlespeech/s2t/exps/deepspeech2/model.py @@ -23,7 +23,7 @@ import paddle from paddle import distributed as dist from paddle import inference -from paddlespeech.s2t.frontend.featurizer.text_featurizer import TextFeaturizer +from paddlespeech.audio.text.text_featurizer import TextFeaturizer from paddlespeech.s2t.io.dataloader import BatchDataLoader from paddlespeech.s2t.models.ds2 import DeepSpeech2InferModel from paddlespeech.s2t.models.ds2 import DeepSpeech2Model diff --git a/paddlespeech/s2t/exps/u2/bin/test_wav.py b/paddlespeech/s2t/exps/u2/bin/test_wav.py index 86c3db89f..887ec7a6d 100644 --- a/paddlespeech/s2t/exps/u2/bin/test_wav.py +++ b/paddlespeech/s2t/exps/u2/bin/test_wav.py @@ -20,10 +20,10 @@ import paddle import soundfile from yacs.config import CfgNode +from paddlespeech.audio.transform.transformation import Transformation from paddlespeech.s2t.frontend.featurizer.text_featurizer import TextFeaturizer from paddlespeech.s2t.models.u2 import U2Model from paddlespeech.s2t.training.cli import default_argument_parser -from paddlespeech.s2t.transform.transformation import Transformation from paddlespeech.s2t.utils.log import Log from paddlespeech.s2t.utils.utility import UpdateConfig logger = Log(__name__).getlog() diff --git a/paddlespeech/s2t/exps/u2/model.py b/paddlespeech/s2t/exps/u2/model.py index efcc9629f..cdad3b8f7 100644 --- a/paddlespeech/s2t/exps/u2/model.py +++ b/paddlespeech/s2t/exps/u2/model.py @@ -26,6 +26,8 @@ from paddle import distributed as dist from paddlespeech.s2t.frontend.featurizer import TextFeaturizer from paddlespeech.s2t.io.dataloader import BatchDataLoader +from paddlespeech.s2t.io.dataloader import StreamDataLoader +from paddlespeech.s2t.io.dataloader import DataLoaderFactory from paddlespeech.s2t.models.u2 import U2Model from paddlespeech.s2t.training.optimizer import OptimizerFactory from paddlespeech.s2t.training.reporter import ObsScope @@ -106,7 +108,8 @@ class U2Trainer(Trainer): @paddle.no_grad() def valid(self): self.model.eval() - logger.info(f"Valid Total Examples: {len(self.valid_loader.dataset)}") + if not self.use_streamdata: + logger.info(f"Valid Total Examples: {len(self.valid_loader.dataset)}") valid_losses = defaultdict(list) num_seen_utts = 1 total_loss = 0.0 @@ -132,7 +135,8 @@ class U2Trainer(Trainer): msg = f"Valid: Rank: {dist.get_rank()}, " msg += "epoch: {}, ".format(self.epoch) msg += "step: {}, ".format(self.iteration) - msg += "batch: {}/{}, ".format(i + 1, len(self.valid_loader)) + if not self.use_streamdata: + msg += "batch: {}/{}, ".format(i + 1, len(self.valid_loader)) msg += ', '.join('{}: {:>.6f}'.format(k, v) for k, v in valid_dump.items()) logger.info(msg) @@ -152,7 +156,8 @@ class U2Trainer(Trainer): self.before_train() - logger.info(f"Train Total Examples: {len(self.train_loader.dataset)}") + if not self.use_streamdata: + logger.info(f"Train Total Examples: {len(self.train_loader.dataset)}") while self.epoch < self.config.n_epoch: with Timer("Epoch-Train Time Cost: {}"): self.model.train() @@ -170,7 +175,8 @@ class U2Trainer(Trainer): self.train_batch(batch_index, batch, msg) self.after_train_batch() report('iter', batch_index + 1) - report('total', len(self.train_loader)) + if not self.use_streamdata: + report('total', len(self.train_loader)) report('reader_cost', dataload_time) observation['batch_cost'] = observation[ 'reader_cost'] + observation['step_cost'] @@ -191,7 +197,6 @@ class U2Trainer(Trainer): except Exception as e: logger.error(e) raise e - with Timer("Eval Time Cost: {}"): total_loss, num_seen_utts = self.valid() if dist.get_world_size() > 1: @@ -218,92 +223,16 @@ class U2Trainer(Trainer): def setup_dataloader(self): config = self.config.clone() - + self.use_streamdata = config.get("use_stream_data", False) if self.train: - # train/valid dataset, return token ids - self.train_loader = BatchDataLoader( - json_file=config.train_manifest, - train_mode=True, - sortagrad=config.sortagrad, - batch_size=config.batch_size, - maxlen_in=config.maxlen_in, - maxlen_out=config.maxlen_out, - minibatches=config.minibatches, - mini_batch_size=self.args.ngpu, - batch_count=config.batch_count, - batch_bins=config.batch_bins, - batch_frames_in=config.batch_frames_in, - batch_frames_out=config.batch_frames_out, - batch_frames_inout=config.batch_frames_inout, - preprocess_conf=config.preprocess_config, - n_iter_processes=config.num_workers, - subsampling_factor=1, - num_encs=1, - dist_sampler=config.get('dist_sampler', False), - shortest_first=False) - - self.valid_loader = BatchDataLoader( - json_file=config.dev_manifest, - train_mode=False, - sortagrad=False, - batch_size=config.batch_size, - maxlen_in=float('inf'), - maxlen_out=float('inf'), - minibatches=0, - mini_batch_size=self.args.ngpu, - batch_count='auto', - batch_bins=0, - batch_frames_in=0, - batch_frames_out=0, - batch_frames_inout=0, - preprocess_conf=config.preprocess_config, - n_iter_processes=config.num_workers, - subsampling_factor=1, - num_encs=1, - dist_sampler=config.get('dist_sampler', False), - shortest_first=False) + self.train_loader = DataLoaderFactory.get_dataloader('train', config, self.args) + self.valid_loader = DataLoaderFactory.get_dataloader('valid', config, self.args) logger.info("Setup train/valid Dataloader!") else: decode_batch_size = config.get('decode', dict()).get( 'decode_batch_size', 1) - # test dataset, return raw text - self.test_loader = BatchDataLoader( - json_file=config.test_manifest, - train_mode=False, - sortagrad=False, - batch_size=decode_batch_size, - maxlen_in=float('inf'), - maxlen_out=float('inf'), - minibatches=0, - mini_batch_size=1, - batch_count='auto', - batch_bins=0, - batch_frames_in=0, - batch_frames_out=0, - batch_frames_inout=0, - preprocess_conf=config.preprocess_config, - n_iter_processes=1, - subsampling_factor=1, - num_encs=1) - - self.align_loader = BatchDataLoader( - json_file=config.test_manifest, - train_mode=False, - sortagrad=False, - batch_size=decode_batch_size, - maxlen_in=float('inf'), - maxlen_out=float('inf'), - minibatches=0, - mini_batch_size=1, - batch_count='auto', - batch_bins=0, - batch_frames_in=0, - batch_frames_out=0, - batch_frames_inout=0, - preprocess_conf=config.preprocess_config, - n_iter_processes=1, - subsampling_factor=1, - num_encs=1) + self.test_loader = DataLoaderFactory.get_dataloader('test', config, self.args) + self.align_loader = DataLoaderFactory.get_dataloader('align', config, self.args) logger.info("Setup test/align Dataloader!") def setup_model(self): @@ -452,7 +381,8 @@ class U2Tester(U2Trainer): def test(self): assert self.args.result_file self.model.eval() - logger.info(f"Test Total Examples: {len(self.test_loader.dataset)}") + if not self.use_streamdata: + logger.info(f"Test Total Examples: {len(self.test_loader.dataset)}") stride_ms = self.config.stride_ms error_rate_type = None diff --git a/paddlespeech/s2t/exps/u2_kaldi/model.py b/paddlespeech/s2t/exps/u2_kaldi/model.py index bc995977a..cb015c116 100644 --- a/paddlespeech/s2t/exps/u2_kaldi/model.py +++ b/paddlespeech/s2t/exps/u2_kaldi/model.py @@ -25,7 +25,7 @@ from paddle import distributed as dist from paddlespeech.s2t.frontend.featurizer import TextFeaturizer from paddlespeech.s2t.frontend.utility import load_dict -from paddlespeech.s2t.io.dataloader import BatchDataLoader +from paddlespeech.s2t.io.dataloader import DataLoaderFactory from paddlespeech.s2t.models.u2 import U2Model from paddlespeech.s2t.training.optimizer import OptimizerFactory from paddlespeech.s2t.training.scheduler import LRSchedulerFactory @@ -104,7 +104,8 @@ class U2Trainer(Trainer): @paddle.no_grad() def valid(self): self.model.eval() - logger.info(f"Valid Total Examples: {len(self.valid_loader.dataset)}") + if not self.use_streamdata: + logger.info(f"Valid Total Examples: {len(self.valid_loader.dataset)}") valid_losses = defaultdict(list) num_seen_utts = 1 total_loss = 0.0 @@ -131,7 +132,8 @@ class U2Trainer(Trainer): msg = f"Valid: Rank: {dist.get_rank()}, " msg += "epoch: {}, ".format(self.epoch) msg += "step: {}, ".format(self.iteration) - msg += "batch: {}/{}, ".format(i + 1, len(self.valid_loader)) + if not self.use_streamdata: + msg += "batch: {}/{}, ".format(i + 1, len(self.valid_loader)) msg += ', '.join('{}: {:>.6f}'.format(k, v) for k, v in valid_dump.items()) logger.info(msg) @@ -150,8 +152,8 @@ class U2Trainer(Trainer): # paddle.jit.save(script_model, script_model_path) self.before_train() - - logger.info(f"Train Total Examples: {len(self.train_loader.dataset)}") + if not self.use_streamdata: + logger.info(f"Train Total Examples: {len(self.train_loader.dataset)}") while self.epoch < self.config.n_epoch: with Timer("Epoch-Train Time Cost: {}"): self.model.train() @@ -162,7 +164,8 @@ class U2Trainer(Trainer): msg = "Train: Rank: {}, ".format(dist.get_rank()) msg += "epoch: {}, ".format(self.epoch) msg += "step: {}, ".format(self.iteration) - msg += "batch : {}/{}, ".format(batch_index + 1, + if not self.use_streamdata: + msg += "batch : {}/{}, ".format(batch_index + 1, len(self.train_loader)) msg += "lr: {:>.8f}, ".format(self.lr_scheduler()) msg += "data time: {:>.3f}s, ".format(dataload_time) @@ -198,87 +201,23 @@ class U2Trainer(Trainer): self.new_epoch() def setup_dataloader(self): - config = self.config.clone() - # train/valid dataset, return token ids - self.train_loader = BatchDataLoader( - json_file=config.train_manifest, - train_mode=True, - sortagrad=False, - batch_size=config.batch_size, - maxlen_in=float('inf'), - maxlen_out=float('inf'), - minibatches=0, - mini_batch_size=self.args.ngpu, - batch_count='auto', - batch_bins=0, - batch_frames_in=0, - batch_frames_out=0, - batch_frames_inout=0, - preprocess_conf=config.preprocess_config, - n_iter_processes=config.num_workers, - subsampling_factor=1, - num_encs=1) - - self.valid_loader = BatchDataLoader( - json_file=config.dev_manifest, - train_mode=False, - sortagrad=False, - batch_size=config.batch_size, - maxlen_in=float('inf'), - maxlen_out=float('inf'), - minibatches=0, - mini_batch_size=self.args.ngpu, - batch_count='auto', - batch_bins=0, - batch_frames_in=0, - batch_frames_out=0, - batch_frames_inout=0, - preprocess_conf=None, - n_iter_processes=config.num_workers, - subsampling_factor=1, - num_encs=1) - - decode_batch_size = config.get('decode', dict()).get( - 'decode_batch_size', 1) - # test dataset, return raw text - self.test_loader = BatchDataLoader( - json_file=config.test_manifest, - train_mode=False, - sortagrad=False, - batch_size=decode_batch_size, - maxlen_in=float('inf'), - maxlen_out=float('inf'), - minibatches=0, - mini_batch_size=1, - batch_count='auto', - batch_bins=0, - batch_frames_in=0, - batch_frames_out=0, - batch_frames_inout=0, - preprocess_conf=None, - n_iter_processes=1, - subsampling_factor=1, - num_encs=1) - - self.align_loader = BatchDataLoader( - json_file=config.test_manifest, - train_mode=False, - sortagrad=False, - batch_size=decode_batch_size, - maxlen_in=float('inf'), - maxlen_out=float('inf'), - minibatches=0, - mini_batch_size=1, - batch_count='auto', - batch_bins=0, - batch_frames_in=0, - batch_frames_out=0, - batch_frames_inout=0, - preprocess_conf=None, - n_iter_processes=1, - subsampling_factor=1, - num_encs=1) - logger.info("Setup train/valid/test/align Dataloader!") + self.use_streamdata = config.get("use_stream_data", False) + if self.train: + config = self.config.clone() + self.train_loader = DataLoaderFactory.get_dataloader('train', config, self.args) + config = self.config.clone() + config['preprocess_config'] = None + self.valid_loader = DataLoaderFactory.get_dataloader('valid', config, self.args) + logger.info("Setup train/valid Dataloader!") + else: + config = self.config.clone() + config['preprocess_config'] = None + self.test_loader = DataLoaderFactory.get_dataloader('test', config, self.args) + config = self.config.clone() + config['preprocess_config'] = None + self.align_loader = DataLoaderFactory.get_dataloader('align', config, self.args) + logger.info("Setup test/align Dataloader!") + def setup_model(self): config = self.config @@ -406,7 +345,8 @@ class U2Tester(U2Trainer): def test(self): assert self.args.result_file self.model.eval() - logger.info(f"Test Total Examples: {len(self.test_loader.dataset)}") + if not self.use_streamdata: + logger.info(f"Test Total Examples: {len(self.test_loader.dataset)}") stride_ms = self.config.stride_ms error_rate_type = None diff --git a/paddlespeech/s2t/exps/u2_st/model.py b/paddlespeech/s2t/exps/u2_st/model.py index 6a32eda77..603825435 100644 --- a/paddlespeech/s2t/exps/u2_st/model.py +++ b/paddlespeech/s2t/exps/u2_st/model.py @@ -25,7 +25,7 @@ import paddle from paddle import distributed as dist from paddlespeech.s2t.frontend.featurizer import TextFeaturizer -from paddlespeech.s2t.io.dataloader import BatchDataLoader +from paddlespeech.s2t.io.dataloader import DataLoaderFactory from paddlespeech.s2t.models.u2_st import U2STModel from paddlespeech.s2t.training.optimizer import OptimizerFactory from paddlespeech.s2t.training.reporter import ObsScope @@ -120,7 +120,8 @@ class U2STTrainer(Trainer): @paddle.no_grad() def valid(self): self.model.eval() - logger.info(f"Valid Total Examples: {len(self.valid_loader.dataset)}") + if not self.use_streamdata: + logger.info(f"Valid Total Examples: {len(self.valid_loader.dataset)}") valid_losses = defaultdict(list) num_seen_utts = 1 total_loss = 0.0 @@ -153,7 +154,8 @@ class U2STTrainer(Trainer): msg = f"Valid: Rank: {dist.get_rank()}, " msg += "epoch: {}, ".format(self.epoch) msg += "step: {}, ".format(self.iteration) - msg += "batch: {}/{}, ".format(i + 1, len(self.valid_loader)) + if not self.use_streamdata: + msg += "batch: {}/{}, ".format(i + 1, len(self.valid_loader)) msg += ', '.join('{}: {:>.6f}'.format(k, v) for k, v in valid_dump.items()) logger.info(msg) @@ -172,8 +174,8 @@ class U2STTrainer(Trainer): # paddle.jit.save(script_model, script_model_path) self.before_train() - - logger.info(f"Train Total Examples: {len(self.train_loader.dataset)}") + if not self.use_streamdata: + logger.info(f"Train Total Examples: {len(self.train_loader.dataset)}") while self.epoch < self.config.n_epoch: with Timer("Epoch-Train Time Cost: {}"): self.model.train() @@ -191,7 +193,8 @@ class U2STTrainer(Trainer): self.train_batch(batch_index, batch, msg) self.after_train_batch() report('iter', batch_index + 1) - report('total', len(self.train_loader)) + if not self.use_streamdata: + report('total', len(self.train_loader)) report('reader_cost', dataload_time) observation['batch_cost'] = observation[ 'reader_cost'] + observation['step_cost'] @@ -241,79 +244,18 @@ class U2STTrainer(Trainer): load_transcript = True if config.model_conf.asr_weight > 0 else False + config = self.config.clone() + config['load_transcript'] = load_transcript + self.use_streamdata = config.get("use_stream_data", False) if self.train: - # train/valid dataset, return token ids - self.train_loader = BatchDataLoader( - json_file=config.train_manifest, - train_mode=True, - sortagrad=False, - batch_size=config.batch_size, - maxlen_in=config.maxlen_in, - maxlen_out=config.maxlen_out, - minibatches=0, - mini_batch_size=1, - batch_count='auto', - batch_bins=0, - batch_frames_in=0, - batch_frames_out=0, - batch_frames_inout=0, - preprocess_conf=config. - preprocess_config, # aug will be off when train_mode=False - n_iter_processes=config.num_workers, - subsampling_factor=1, - load_aux_output=load_transcript, - num_encs=1, - dist_sampler=True) - - self.valid_loader = BatchDataLoader( - json_file=config.dev_manifest, - train_mode=False, - sortagrad=False, - batch_size=config.batch_size, - maxlen_in=float('inf'), - maxlen_out=float('inf'), - minibatches=0, - mini_batch_size=1, - batch_count='auto', - batch_bins=0, - batch_frames_in=0, - batch_frames_out=0, - batch_frames_inout=0, - preprocess_conf=config. - preprocess_config, # aug will be off when train_mode=False - n_iter_processes=config.num_workers, - subsampling_factor=1, - load_aux_output=load_transcript, - num_encs=1, - dist_sampler=False) + self.train_loader = DataLoaderFactory.get_dataloader('train', config, self.args) + self.valid_loader = DataLoaderFactory.get_dataloader('valid', config, self.args) logger.info("Setup train/valid Dataloader!") else: - # test dataset, return raw text - decode_batch_size = config.get('decode', dict()).get( - 'decode_batch_size', 1) - self.test_loader = BatchDataLoader( - json_file=config.test_manifest, - train_mode=False, - sortagrad=False, - batch_size=decode_batch_size, - maxlen_in=float('inf'), - maxlen_out=float('inf'), - minibatches=0, - mini_batch_size=1, - batch_count='auto', - batch_bins=0, - batch_frames_in=0, - batch_frames_out=0, - batch_frames_inout=0, - preprocess_conf=config. - preprocess_config, # aug will be off when train_mode=False - n_iter_processes=config.num_workers, - subsampling_factor=1, - num_encs=1, - dist_sampler=False) - + self.test_loader = DataLoaderFactory.get_dataloader('test', config, self.args) logger.info("Setup test Dataloader!") + def setup_model(self): config = self.config model_conf = config @@ -468,7 +410,8 @@ class U2STTester(U2STTrainer): def test(self): assert self.args.result_file self.model.eval() - logger.info(f"Test Total Examples: {len(self.test_loader.dataset)}") + if not self.use_streamdata: + logger.info(f"Test Total Examples: {len(self.test_loader.dataset)}") decode_cfg = self.config.decode bleu_func = bleu_score.char_bleu if decode_cfg.error_rate_type == 'char-bleu' else bleu_score.bleu diff --git a/paddlespeech/s2t/frontend/augmentor/spec_augment.py b/paddlespeech/s2t/frontend/augmentor/spec_augment.py index e91cfdce4..380712851 100644 --- a/paddlespeech/s2t/frontend/augmentor/spec_augment.py +++ b/paddlespeech/s2t/frontend/augmentor/spec_augment.py @@ -16,7 +16,6 @@ import random import numpy as np from PIL import Image -from PIL.Image import BICUBIC from paddlespeech.s2t.frontend.augmentor.base import AugmentorBase from paddlespeech.s2t.utils.log import Log @@ -164,9 +163,9 @@ class SpecAugmentor(AugmentorBase): window) + 1 # 1 ... t - 1 left = Image.fromarray(x[:center]).resize((x.shape[1], warped), - BICUBIC) + Image.BICUBIC) right = Image.fromarray(x[center:]).resize((x.shape[1], t - warped), - BICUBIC) + Image.BICUBIC) if self.inplace: x[:warped] = left x[warped:] = right diff --git a/paddlespeech/s2t/frontend/featurizer/text_featurizer.py b/paddlespeech/s2t/frontend/featurizer/text_featurizer.py index 0c0fa5e2f..982c6b8fe 100644 --- a/paddlespeech/s2t/frontend/featurizer/text_featurizer.py +++ b/paddlespeech/s2t/frontend/featurizer/text_featurizer.py @@ -226,10 +226,10 @@ class TextFeaturizer(): sos_id = vocab_list.index(SOS) if SOS in vocab_list else -1 space_id = vocab_list.index(SPACE) if SPACE in vocab_list else -1 - logger.info(f"BLANK id: {blank_id}") - logger.info(f"UNK id: {unk_id}") - logger.info(f"EOS id: {eos_id}") - logger.info(f"SOS id: {sos_id}") - logger.info(f"SPACE id: {space_id}") - logger.info(f"MASKCTC id: {maskctc_id}") + logger.debug(f"BLANK id: {blank_id}") + logger.debug(f"UNK id: {unk_id}") + logger.debug(f"EOS id: {eos_id}") + logger.debug(f"SOS id: {sos_id}") + logger.debug(f"SPACE id: {space_id}") + logger.debug(f"MASKCTC id: {maskctc_id}") return token2id, id2token, vocab_list, unk_id, eos_id, blank_id diff --git a/paddlespeech/s2t/io/dataloader.py b/paddlespeech/s2t/io/dataloader.py index 55aa13ff1..735d29da2 100644 --- a/paddlespeech/s2t/io/dataloader.py +++ b/paddlespeech/s2t/io/dataloader.py @@ -18,6 +18,7 @@ from typing import Text import jsonlines import numpy as np +import paddle from paddle.io import BatchSampler from paddle.io import DataLoader from paddle.io import DistributedBatchSampler @@ -28,7 +29,11 @@ from paddlespeech.s2t.io.dataset import TransformDataset from paddlespeech.s2t.io.reader import LoadInputsAndTargets from paddlespeech.s2t.utils.log import Log -__all__ = ["BatchDataLoader"] +import paddlespeech.audio.streamdata as streamdata +from paddlespeech.audio.text.text_featurizer import TextFeaturizer +from yacs.config import CfgNode + +__all__ = ["BatchDataLoader", "StreamDataLoader"] logger = Log(__name__).getlog() @@ -56,6 +61,136 @@ def batch_collate(x): """ return x[0] +def read_preprocess_cfg(preprocess_conf_file): + augment_conf = dict() + preprocess_cfg = CfgNode(new_allowed=True) + preprocess_cfg.merge_from_file(preprocess_conf_file) + for idx, process in enumerate(preprocess_cfg["process"]): + opts = dict(process) + process_type = opts.pop("type") + if process_type == 'time_warp': + augment_conf['max_w'] = process['max_time_warp'] + augment_conf['w_inplace'] = process['inplace'] + augment_conf['w_mode'] = process['mode'] + if process_type == 'freq_mask': + augment_conf['max_f'] = process['F'] + augment_conf['num_f_mask'] = process['n_mask'] + augment_conf['f_inplace'] = process['inplace'] + augment_conf['f_replace_with_zero'] = process['replace_with_zero'] + if process_type == 'time_mask': + augment_conf['max_t'] = process['T'] + augment_conf['num_t_mask'] = process['n_mask'] + augment_conf['t_inplace'] = process['inplace'] + augment_conf['t_replace_with_zero'] = process['replace_with_zero'] + return augment_conf + +class StreamDataLoader(): + def __init__(self, + manifest_file: str, + train_mode: bool, + unit_type: str='char', + batch_size: int=0, + preprocess_conf=None, + num_mel_bins=80, + frame_length=25, + frame_shift=10, + dither=0.0, + minlen_in: float=0.0, + maxlen_in: float=float('inf'), + minlen_out: float=0.0, + maxlen_out: float=float('inf'), + resample_rate: int=16000, + shuffle_size: int=10000, + sort_size: int=1000, + n_iter_processes: int=1, + prefetch_factor: int=2, + dist_sampler: bool=False, + cmvn_file="data/mean_std.json", + vocab_filepath='data/lang_char/vocab.txt'): + self.manifest_file = manifest_file + self.train_model = train_mode + self.batch_size = batch_size + self.prefetch_factor = prefetch_factor + self.dist_sampler = dist_sampler + self.n_iter_processes = n_iter_processes + + text_featurizer = TextFeaturizer(unit_type, vocab_filepath) + symbol_table = text_featurizer.vocab_dict + self.feat_dim = num_mel_bins + self.vocab_size = text_featurizer.vocab_size + + augment_conf = read_preprocess_cfg(preprocess_conf) + + # The list of shard + shardlist = [] + with open(manifest_file, "r") as f: + for line in f.readlines(): + shardlist.append(line.strip()) + world_size = 1 + try: + world_size = paddle.distributed.get_world_size() + except Exception as e: + logger.warninig(e) + logger.warninig("can not get world_size using paddle.distributed.get_world_size(), use world_size=1") + assert(len(shardlist) >= world_size, "the length of shard list should >= number of gpus/xpus/...") + + update_n_iter_processes = int(max(min(len(shardlist)/world_size - 1, self.n_iter_processes), 0)) + logger.info(f"update_n_iter_processes {update_n_iter_processes}") + if update_n_iter_processes != self.n_iter_processes: + self.n_iter_processes = update_n_iter_processes + logger.info(f"change nun_workers to {self.n_iter_processes}") + + if self.dist_sampler: + base_dataset = streamdata.DataPipeline( + streamdata.SimpleShardList(shardlist), + streamdata.split_by_node if train_mode else streamdata.placeholder(), + streamdata.split_by_worker, + streamdata.tarfile_to_samples(streamdata.reraise_exception) + ) + else: + base_dataset = streamdata.DataPipeline( + streamdata.SimpleShardList(shardlist), + streamdata.split_by_worker, + streamdata.tarfile_to_samples(streamdata.reraise_exception) + ) + + self.dataset = base_dataset.append_list( + streamdata.audio_tokenize(symbol_table), + streamdata.audio_data_filter(frame_shift=frame_shift, max_length=maxlen_in, min_length=minlen_in, token_max_length=maxlen_out, token_min_length=minlen_out), + streamdata.audio_resample(resample_rate=resample_rate), + streamdata.audio_compute_fbank(num_mel_bins=num_mel_bins, frame_length=frame_length, frame_shift=frame_shift, dither=dither), + streamdata.audio_spec_aug(**augment_conf) if train_mode else streamdata.placeholder(), # num_t_mask=2, num_f_mask=2, max_t=40, max_f=30, max_w=80) + streamdata.shuffle(shuffle_size), + streamdata.sort(sort_size=sort_size), + streamdata.batched(batch_size), + streamdata.audio_padding(), + streamdata.audio_cmvn(cmvn_file) + ) + + if paddle.__version__ >= '2.3.2': + self.loader = streamdata.WebLoader( + self.dataset, + num_workers=self.n_iter_processes, + prefetch_factor = self.prefetch_factor, + batch_size=None + ) + else: + self.loader = streamdata.WebLoader( + self.dataset, + num_workers=self.n_iter_processes, + batch_size=None + ) + + def __iter__(self): + return self.loader.__iter__() + + def __call__(self): + return self.__iter__() + + def __len__(self): + logger.info("Stream dataloader does not support calculate the length of the dataset") + return -1 + class BatchDataLoader(): def __init__(self, @@ -199,3 +334,120 @@ class BatchDataLoader(): echo += f"shortest_first: {self.shortest_first}, " echo += f"file: {self.json_file}" return echo + + +class DataLoaderFactory(): + @staticmethod + def get_dataloader(mode: str, config, args): + config = config.clone() + use_streamdata = config.get("use_stream_data", False) + if use_streamdata: + if mode == 'train': + config['manifest'] = config.train_manifest + config['train_mode'] = True + elif mode == 'valid': + config['manifest'] = config.dev_manifest + config['train_mode'] = False + elif model == 'test' or mode == 'align': + config['manifest'] = config.test_manifest + config['train_mode'] = False + config['dither'] = 0.0 + config['minlen_in'] = 0.0 + config['maxlen_in'] = float('inf') + config['minlen_out'] = 0 + config['maxlen_out'] = float('inf') + config['dist_sampler'] = False + else: + raise KeyError("not valid mode type!!, please input one of 'train, valid, test, align'") + return StreamDataLoader( + manifest_file=config.manifest, + train_mode=config.train_mode, + unit_type=config.unit_type, + preprocess_conf=config.preprocess_config, + batch_size=config.batch_size, + num_mel_bins=config.feat_dim, + frame_length=config.window_ms, + frame_shift=config.stride_ms, + dither=config.dither, + minlen_in=config.minlen_in, + maxlen_in=config.maxlen_in, + minlen_out=config.minlen_out, + maxlen_out=config.maxlen_out, + resample_rate=config.resample_rate, + shuffle_size=config.shuffle_size, + sort_size=config.sort_size, + n_iter_processes=config.num_workers, + prefetch_factor=config.prefetch_factor, + dist_sampler=config.dist_sampler, + cmvn_file=config.cmvn_file, + vocab_filepath=config.vocab_filepath, + ) + else: + if mode == 'train': + config['manifest'] = config.train_manifest + config['train_mode'] = True + config['mini_batch_size'] = args.ngpu + config['subsampling_factor'] = 1 + config['num_encs'] = 1 + config['shortest_first'] = False + elif mode == 'valid': + config['manifest'] = config.dev_manifest + config['train_mode'] = False + config['sortagrad'] = False + config['maxlen_in'] = float('inf') + config['maxlen_out'] = float('inf') + config['minibatches'] = 0 + config['mini_batch_size'] = args.ngpu + config['batch_count'] = 'auto' + config['batch_bins'] = 0 + config['batch_frames_in'] = 0 + config['batch_frames_out'] = 0 + config['batch_frames_inout'] = 0 + config['subsampling_factor'] = 1 + config['num_encs'] = 1 + config['shortest_first'] = False + elif mode == 'test' or mode == 'align': + config['manifest'] = config.test_manifest + config['train_mode'] = False + config['sortagrad'] = False + config['batch_size'] = config.get('decode', dict()).get( + 'decode_batch_size', 1) + config['maxlen_in'] = float('inf') + config['maxlen_out'] = float('inf') + config['minibatches'] = 0 + config['mini_batch_size'] = 1 + config['batch_count'] = 'auto' + config['batch_bins'] = 0 + config['batch_frames_in'] = 0 + config['batch_frames_out'] = 0 + config['batch_frames_inout'] = 0 + config['num_workers'] = 1 + config['subsampling_factor'] = 1 + config['num_encs'] = 1 + config['dist_sampler'] = False + config['shortest_first'] = False + else: + raise KeyError("not valid mode type!!, please input one of 'train, valid, test, align'") + + return BatchDataLoader( + json_file=config.manifest, + train_mode=config.train_mode, + sortagrad=config.sortagrad, + batch_size=config.batch_size, + maxlen_in=config.maxlen_in, + maxlen_out=config.maxlen_out, + minibatches=config.minibatches, + mini_batch_size=config.mini_batch_size, + batch_count=config.batch_count, + batch_bins=config.batch_bins, + batch_frames_in=config.batch_frames_in, + batch_frames_out=config.batch_frames_out, + batch_frames_inout=config.batch_frames_inout, + preprocess_conf=config.preprocess_config, + n_iter_processes=config.num_workers, + subsampling_factor=config.subsampling_factor, + load_aux_output=config.get('load_transcript', None), + num_encs=config.num_encs, + dist_sampler=config.dist_sampler, + shortest_first=config.shortest_first) + diff --git a/paddlespeech/s2t/io/reader.py b/paddlespeech/s2t/io/reader.py index 4e136bdce..5e018befb 100644 --- a/paddlespeech/s2t/io/reader.py +++ b/paddlespeech/s2t/io/reader.py @@ -19,7 +19,7 @@ import numpy as np import soundfile from .utility import feat_type -from paddlespeech.s2t.transform.transformation import Transformation +from paddlespeech.audio.transform.transformation import Transformation from paddlespeech.s2t.utils.log import Log # from paddlespeech.s2t.frontend.augmentor.augmentation import AugmentationPipeline as Transformation diff --git a/paddlespeech/s2t/models/u2/u2.py b/paddlespeech/s2t/models/u2/u2.py index b4b61666f..e19f411cf 100644 --- a/paddlespeech/s2t/models/u2/u2.py +++ b/paddlespeech/s2t/models/u2/u2.py @@ -29,6 +29,9 @@ import paddle from paddle import jit from paddle import nn +from paddlespeech.audio.utils.tensor_utils import add_sos_eos +from paddlespeech.audio.utils.tensor_utils import pad_sequence +from paddlespeech.audio.utils.tensor_utils import th_accuracy from paddlespeech.s2t.decoders.scorers.ctc import CTCPrefixScorer from paddlespeech.s2t.frontend.utility import IGNORE_ID from paddlespeech.s2t.frontend.utility import load_cmvn @@ -48,9 +51,6 @@ from paddlespeech.s2t.utils import checkpoint from paddlespeech.s2t.utils import layer_tools from paddlespeech.s2t.utils.ctc_utils import remove_duplicates_and_blank from paddlespeech.s2t.utils.log import Log -from paddlespeech.s2t.utils.tensor_utils import add_sos_eos -from paddlespeech.s2t.utils.tensor_utils import pad_sequence -from paddlespeech.s2t.utils.tensor_utils import th_accuracy from paddlespeech.s2t.utils.utility import log_add from paddlespeech.s2t.utils.utility import UpdateConfig @@ -318,7 +318,7 @@ class U2BaseModel(ASRInterface, nn.Layer): dim=1) # (B*N, i+1) # 2.6 Update end flag - end_flag = paddle.eq(hyps[:, -1], self.eos).view(-1, 1) + end_flag = paddle.equal(hyps[:, -1], self.eos).view(-1, 1) # 3. Select best of best scores = scores.view(batch_size, beam_size) @@ -605,29 +605,42 @@ class U2BaseModel(ASRInterface, nn.Layer): xs: paddle.Tensor, offset: int, required_cache_size: int, - subsampling_cache: Optional[paddle.Tensor]=None, - elayers_output_cache: Optional[List[paddle.Tensor]]=None, - conformer_cnn_cache: Optional[List[paddle.Tensor]]=None, - ) -> Tuple[paddle.Tensor, paddle.Tensor, List[paddle.Tensor], List[ - paddle.Tensor]]: + att_cache: paddle.Tensor=paddle.zeros([0, 0, 0, 0]), + cnn_cache: paddle.Tensor=paddle.zeros([0, 0, 0, 0]), + ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor]: """ Export interface for c++ call, give input chunk xs, and return output from time 0 to current chunk. + Args: - xs (paddle.Tensor): chunk input - subsampling_cache (Optional[paddle.Tensor]): subsampling cache - elayers_output_cache (Optional[List[paddle.Tensor]]): - transformer/conformer encoder layers output cache - conformer_cnn_cache (Optional[List[paddle.Tensor]]): conformer - cnn cache + xs (paddle.Tensor): chunk input, with shape (b=1, time, mel-dim), + where `time == (chunk_size - 1) * subsample_rate + \ + subsample.right_context + 1` + offset (int): current offset in encoder output time stamp + required_cache_size (int): cache size required for next chunk + compuation + >=0: actual cache size + <0: means all history cache is required + att_cache (paddle.Tensor): cache tensor for KEY & VALUE in + transformer/conformer attention, with shape + (elayers, head, cache_t1, d_k * 2), where + `head * d_k == hidden-dim` and + `cache_t1 == chunk_size * num_decoding_left_chunks`. + `d_k * 2` for att key & value. + cnn_cache (paddle.Tensor): cache tensor for cnn_module in conformer, + (elayers, b=1, hidden-dim, cache_t2), where + `cache_t2 == cnn.lorder - 1`. + Returns: - paddle.Tensor: output, it ranges from time 0 to current chunk. - paddle.Tensor: subsampling cache - List[paddle.Tensor]: attention cache - List[paddle.Tensor]: conformer cnn cache + paddle.Tensor: output of current input xs, + with shape (b=1, chunk_size, hidden-dim). + paddle.Tensor: new attention cache required for next chunk, with + dynamic shape (elayers, head, T(?), d_k * 2) + depending on required_cache_size. + paddle.Tensor: new conformer cnn cache required for next chunk, with + same shape as the original cnn_cache. """ - return self.encoder.forward_chunk( - xs, offset, required_cache_size, subsampling_cache, - elayers_output_cache, conformer_cnn_cache) + return self.encoder.forward_chunk(xs, offset, required_cache_size, + att_cache, cnn_cache) # @jit.to_static def ctc_activation(self, xs: paddle.Tensor) -> paddle.Tensor: @@ -827,7 +840,7 @@ class U2Model(U2DecodeModel): # encoder encoder_type = configs.get('encoder', 'transformer') - logger.info(f"U2 Encoder type: {encoder_type}") + logger.debug(f"U2 Encoder type: {encoder_type}") if encoder_type == 'transformer': encoder = TransformerEncoder( input_dim, global_cmvn=global_cmvn, **configs['encoder_conf']) @@ -894,7 +907,7 @@ class U2Model(U2DecodeModel): if checkpoint_path: infos = checkpoint.Checkpoint().load_parameters( model, checkpoint_path=checkpoint_path) - logger.info(f"checkpoint info: {infos}") + logger.debug(f"checkpoint info: {infos}") layer_tools.summary(model) return model diff --git a/paddlespeech/s2t/models/u2_st/u2_st.py b/paddlespeech/s2t/models/u2_st/u2_st.py index 6447753c5..e86bbedfa 100644 --- a/paddlespeech/s2t/models/u2_st/u2_st.py +++ b/paddlespeech/s2t/models/u2_st/u2_st.py @@ -38,8 +38,8 @@ from paddlespeech.s2t.modules.mask import subsequent_mask from paddlespeech.s2t.utils import checkpoint from paddlespeech.s2t.utils import layer_tools from paddlespeech.s2t.utils.log import Log -from paddlespeech.s2t.utils.tensor_utils import add_sos_eos -from paddlespeech.s2t.utils.tensor_utils import th_accuracy +from paddlespeech.audio.utils.tensor_utils import add_sos_eos +from paddlespeech.audio.utils.tensor_utils import th_accuracy from paddlespeech.s2t.utils.utility import UpdateConfig __all__ = ["U2STModel", "U2STInferModel"] @@ -401,29 +401,42 @@ class U2STBaseModel(nn.Layer): xs: paddle.Tensor, offset: int, required_cache_size: int, - subsampling_cache: Optional[paddle.Tensor]=None, - elayers_output_cache: Optional[List[paddle.Tensor]]=None, - conformer_cnn_cache: Optional[List[paddle.Tensor]]=None, - ) -> Tuple[paddle.Tensor, paddle.Tensor, List[paddle.Tensor], List[ - paddle.Tensor]]: + att_cache: paddle.Tensor = paddle.zeros([0, 0, 0, 0]), + cnn_cache: paddle.Tensor = paddle.zeros([0, 0, 0, 0]), + ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor]: """ Export interface for c++ call, give input chunk xs, and return output from time 0 to current chunk. + Args: - xs (paddle.Tensor): chunk input - subsampling_cache (Optional[paddle.Tensor]): subsampling cache - elayers_output_cache (Optional[List[paddle.Tensor]]): - transformer/conformer encoder layers output cache - conformer_cnn_cache (Optional[List[paddle.Tensor]]): conformer - cnn cache + xs (paddle.Tensor): chunk input, with shape (b=1, time, mel-dim), + where `time == (chunk_size - 1) * subsample_rate + \ + subsample.right_context + 1` + offset (int): current offset in encoder output time stamp + required_cache_size (int): cache size required for next chunk + compuation + >=0: actual cache size + <0: means all history cache is required + att_cache (paddle.Tensor): cache tensor for KEY & VALUE in + transformer/conformer attention, with shape + (elayers, head, cache_t1, d_k * 2), where + `head * d_k == hidden-dim` and + `cache_t1 == chunk_size * num_decoding_left_chunks`. + `d_k * 2` for att key & value. + cnn_cache (paddle.Tensor): cache tensor for cnn_module in conformer, + (elayers, b=1, hidden-dim, cache_t2), where + `cache_t2 == cnn.lorder - 1` + Returns: - paddle.Tensor: output, it ranges from time 0 to current chunk. - paddle.Tensor: subsampling cache - List[paddle.Tensor]: attention cache - List[paddle.Tensor]: conformer cnn cache + paddle.Tensor: output of current input xs, + with shape (b=1, chunk_size, hidden-dim). + paddle.Tensor: new attention cache required for next chunk, with + dynamic shape (elayers, head, T(?), d_k * 2) + depending on required_cache_size. + paddle.Tensor: new conformer cnn cache required for next chunk, with + same shape as the original cnn_cache. """ return self.encoder.forward_chunk( - xs, offset, required_cache_size, subsampling_cache, - elayers_output_cache, conformer_cnn_cache) + xs, offset, required_cache_size, att_cache, cnn_cache) # @jit.to_static def ctc_activation(self, xs: paddle.Tensor) -> paddle.Tensor: diff --git a/paddlespeech/s2t/modules/align.py b/paddlespeech/s2t/modules/align.py index ad71ee021..cacda2461 100644 --- a/paddlespeech/s2t/modules/align.py +++ b/paddlespeech/s2t/modules/align.py @@ -13,8 +13,7 @@ # limitations under the License. import paddle from paddle import nn - -from paddlespeech.s2t.modules.initializer import KaimingUniform +import math """ To align the initializer between paddle and torch, the API below are set defalut initializer with priority higger than global initializer. @@ -82,10 +81,10 @@ class Linear(nn.Linear): name=None): if weight_attr is None: if global_init_type == "kaiming_uniform": - weight_attr = paddle.ParamAttr(initializer=KaimingUniform()) + weight_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform(fan_in=None, negative_slope=math.sqrt(5), nonlinearity='leaky_relu')) if bias_attr is None: if global_init_type == "kaiming_uniform": - bias_attr = paddle.ParamAttr(initializer=KaimingUniform()) + bias_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform(fan_in=None, negative_slope=math.sqrt(5), nonlinearity='leaky_relu')) super(Linear, self).__init__(in_features, out_features, weight_attr, bias_attr, name) @@ -105,10 +104,10 @@ class Conv1D(nn.Conv1D): data_format='NCL'): if weight_attr is None: if global_init_type == "kaiming_uniform": - weight_attr = paddle.ParamAttr(initializer=KaimingUniform()) + weight_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform(fan_in=None, negative_slope=math.sqrt(5), nonlinearity='leaky_relu')) if bias_attr is None: if global_init_type == "kaiming_uniform": - bias_attr = paddle.ParamAttr(initializer=KaimingUniform()) + bias_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform(fan_in=None, negative_slope=math.sqrt(5), nonlinearity='leaky_relu')) super(Conv1D, self).__init__( in_channels, out_channels, kernel_size, stride, padding, dilation, groups, padding_mode, weight_attr, bias_attr, data_format) @@ -129,10 +128,10 @@ class Conv2D(nn.Conv2D): data_format='NCHW'): if weight_attr is None: if global_init_type == "kaiming_uniform": - weight_attr = paddle.ParamAttr(initializer=KaimingUniform()) + weight_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform(fan_in=None, negative_slope=math.sqrt(5), nonlinearity='leaky_relu')) if bias_attr is None: if global_init_type == "kaiming_uniform": - bias_attr = paddle.ParamAttr(initializer=KaimingUniform()) + bias_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform(fan_in=None, negative_slope=math.sqrt(5), nonlinearity='leaky_relu')) super(Conv2D, self).__init__( in_channels, out_channels, kernel_size, stride, padding, dilation, groups, padding_mode, weight_attr, bias_attr, data_format) diff --git a/paddlespeech/s2t/modules/attention.py b/paddlespeech/s2t/modules/attention.py index 438efd2a1..b6d615867 100644 --- a/paddlespeech/s2t/modules/attention.py +++ b/paddlespeech/s2t/modules/attention.py @@ -84,9 +84,10 @@ class MultiHeadedAttention(nn.Layer): return q, k, v def forward_attention(self, - value: paddle.Tensor, - scores: paddle.Tensor, - mask: Optional[paddle.Tensor]) -> paddle.Tensor: + value: paddle.Tensor, + scores: paddle.Tensor, + mask: paddle.Tensor = paddle.ones([0, 0, 0], dtype=paddle.bool), + ) -> paddle.Tensor: """Compute attention context vector. Args: value (paddle.Tensor): Transformed value, size @@ -94,14 +95,23 @@ class MultiHeadedAttention(nn.Layer): scores (paddle.Tensor): Attention score, size (#batch, n_head, time1, time2). mask (paddle.Tensor): Mask, size (#batch, 1, time2) or - (#batch, time1, time2). + (#batch, time1, time2), (0, 0, 0) means fake mask. Returns: - paddle.Tensor: Transformed value weighted - by the attention score, (#batch, time1, d_model). + paddle.Tensor: Transformed value (#batch, time1, d_model) + weighted by the attention score (#batch, time1, time2). """ n_batch = value.shape[0] - if mask is not None: - mask = mask.unsqueeze(1).eq(0) # (batch, 1, *, time2) + + # When `if mask.size(2) > 0` be True: + # 1. training. + # 2. oonx(16/4, chunk_size/history_size), feed real cache and real mask for the 1st chunk. + # When will `if mask.size(2) > 0` be False? + # 1. onnx(16/-1, -1/-1, 16/0) + # 2. jit (16/-1, -1/-1, 16/0, 16/4) + if paddle.shape(mask)[2] > 0: # time2 > 0 + mask = mask.unsqueeze(1).equal(0) # (batch, 1, *, time2) + # for last chunk, time2 might be larger than scores.size(-1) + mask = mask[:, :, :, :paddle.shape(scores)[-1]] scores = scores.masked_fill(mask, -float('inf')) attn = paddle.softmax( scores, axis=-1).masked_fill(mask, @@ -121,21 +131,66 @@ class MultiHeadedAttention(nn.Layer): query: paddle.Tensor, key: paddle.Tensor, value: paddle.Tensor, - mask: Optional[paddle.Tensor]) -> paddle.Tensor: + mask: paddle.Tensor = paddle.ones([0,0,0], dtype=paddle.bool), + pos_emb: paddle.Tensor = paddle.empty([0]), + cache: paddle.Tensor = paddle.zeros([0,0,0,0]) + ) -> Tuple[paddle.Tensor, paddle.Tensor]: """Compute scaled dot product attention. - Args: - query (torch.Tensor): Query tensor (#batch, time1, size). - key (torch.Tensor): Key tensor (#batch, time2, size). - value (torch.Tensor): Value tensor (#batch, time2, size). - mask (torch.Tensor): Mask tensor (#batch, 1, time2) or + Args: + query (paddle.Tensor): Query tensor (#batch, time1, size). + key (paddle.Tensor): Key tensor (#batch, time2, size). + value (paddle.Tensor): Value tensor (#batch, time2, size). + mask (paddle.Tensor): Mask tensor (#batch, 1, time2) or (#batch, time1, time2). + 1.When applying cross attention between decoder and encoder, + the batch padding mask for input is in (#batch, 1, T) shape. + 2.When applying self attention of encoder, + the mask is in (#batch, T, T) shape. + 3.When applying self attention of decoder, + the mask is in (#batch, L, L) shape. + 4.If the different position in decoder see different block + of the encoder, such as Mocha, the passed in mask could be + in (#batch, L, T) shape. But there is no such case in current + Wenet. + cache (paddle.Tensor): Cache tensor (1, head, cache_t, d_k * 2), + where `cache_t == chunk_size * num_decoding_left_chunks` + and `head * d_k == size` Returns: - torch.Tensor: Output tensor (#batch, time1, d_model). + paddle.Tensor: Output tensor (#batch, time1, d_model). + paddle.Tensor: Cache tensor (1, head, cache_t + time1, d_k * 2) + where `cache_t == chunk_size * num_decoding_left_chunks` + and `head * d_k == size` + """ q, k, v = self.forward_qkv(query, key, value) + + # when export onnx model, for 1st chunk, we feed + # cache(1, head, 0, d_k * 2) (16/-1, -1/-1, 16/0 mode) + # or cache(1, head, real_cache_t, d_k * 2) (16/4 mode). + # In all modes, `if cache.size(0) > 0` will alwayse be `True` + # and we will always do splitting and + # concatnation(this will simplify onnx export). Note that + # it's OK to concat & split zero-shaped tensors(see code below). + # when export jit model, for 1st chunk, we always feed + # cache(0, 0, 0, 0) since jit supports dynamic if-branch. + # >>> a = torch.ones((1, 2, 0, 4)) + # >>> b = torch.ones((1, 2, 3, 4)) + # >>> c = torch.cat((a, b), dim=2) + # >>> torch.equal(b, c) # True + # >>> d = torch.split(a, 2, dim=-1) + # >>> torch.equal(d[0], d[1]) # True + if paddle.shape(cache)[0] > 0: + # last dim `d_k * 2` for (key, val) + key_cache, value_cache = paddle.split(cache, 2, axis=-1) + k = paddle.concat([key_cache, k], axis=2) + v = paddle.concat([value_cache, v], axis=2) + # We do cache slicing in encoder.forward_chunk, since it's + # non-trivial to calculate `next_cache_start` here. + new_cache = paddle.concat((k, v), axis=-1) + scores = paddle.matmul(q, k.transpose([0, 1, 3, 2])) / math.sqrt(self.d_k) - return self.forward_attention(v, scores, mask) + return self.forward_attention(v, scores, mask), new_cache class RelPositionMultiHeadedAttention(MultiHeadedAttention): @@ -192,23 +247,55 @@ class RelPositionMultiHeadedAttention(MultiHeadedAttention): query: paddle.Tensor, key: paddle.Tensor, value: paddle.Tensor, - pos_emb: paddle.Tensor, - mask: Optional[paddle.Tensor]): + mask: paddle.Tensor = paddle.ones([0,0,0], dtype=paddle.bool), + pos_emb: paddle.Tensor = paddle.empty([0]), + cache: paddle.Tensor = paddle.zeros([0,0,0,0]) + ) -> Tuple[paddle.Tensor, paddle.Tensor]: """Compute 'Scaled Dot Product Attention' with rel. positional encoding. Args: query (paddle.Tensor): Query tensor (#batch, time1, size). key (paddle.Tensor): Key tensor (#batch, time2, size). value (paddle.Tensor): Value tensor (#batch, time2, size). - pos_emb (paddle.Tensor): Positional embedding tensor - (#batch, time1, size). mask (paddle.Tensor): Mask tensor (#batch, 1, time2) or - (#batch, time1, time2). + (#batch, time1, time2), (0, 0, 0) means fake mask. + pos_emb (paddle.Tensor): Positional embedding tensor + (#batch, time2, size). + cache (paddle.Tensor): Cache tensor (1, head, cache_t, d_k * 2), + where `cache_t == chunk_size * num_decoding_left_chunks` + and `head * d_k == size` Returns: paddle.Tensor: Output tensor (#batch, time1, d_model). + paddle.Tensor: Cache tensor (1, head, cache_t + time1, d_k * 2) + where `cache_t == chunk_size * num_decoding_left_chunks` + and `head * d_k == size` """ q, k, v = self.forward_qkv(query, key, value) q = q.transpose([0, 2, 1, 3]) # (batch, time1, head, d_k) + # when export onnx model, for 1st chunk, we feed + # cache(1, head, 0, d_k * 2) (16/-1, -1/-1, 16/0 mode) + # or cache(1, head, real_cache_t, d_k * 2) (16/4 mode). + # In all modes, `if cache.size(0) > 0` will alwayse be `True` + # and we will always do splitting and + # concatnation(this will simplify onnx export). Note that + # it's OK to concat & split zero-shaped tensors(see code below). + # when export jit model, for 1st chunk, we always feed + # cache(0, 0, 0, 0) since jit supports dynamic if-branch. + # >>> a = torch.ones((1, 2, 0, 4)) + # >>> b = torch.ones((1, 2, 3, 4)) + # >>> c = torch.cat((a, b), dim=2) + # >>> torch.equal(b, c) # True + # >>> d = torch.split(a, 2, dim=-1) + # >>> torch.equal(d[0], d[1]) # True + if paddle.shape(cache)[0] > 0: + # last dim `d_k * 2` for (key, val) + key_cache, value_cache = paddle.split(cache, 2, axis=-1) + k = paddle.concat([key_cache, k], axis=2) + v = paddle.concat([value_cache, v], axis=2) + # We do cache slicing in encoder.forward_chunk, since it's + # non-trivial to calculate `next_cache_start` here. + new_cache = paddle.concat((k, v), axis=-1) + n_batch_pos = pos_emb.shape[0] p = self.linear_pos(pos_emb).view(n_batch_pos, -1, self.h, self.d_k) p = p.transpose([0, 2, 1, 3]) # (batch, head, time1, d_k) @@ -234,4 +321,4 @@ class RelPositionMultiHeadedAttention(MultiHeadedAttention): scores = (matrix_ac + matrix_bd) / math.sqrt( self.d_k) # (batch, head, time1, time2) - return self.forward_attention(v, scores, mask) + return self.forward_attention(v, scores, mask), new_cache diff --git a/paddlespeech/s2t/modules/conformer_convolution.py b/paddlespeech/s2t/modules/conformer_convolution.py index 89e652688..c384b9c78 100644 --- a/paddlespeech/s2t/modules/conformer_convolution.py +++ b/paddlespeech/s2t/modules/conformer_convolution.py @@ -108,15 +108,17 @@ class ConvolutionModule(nn.Layer): def forward(self, x: paddle.Tensor, - mask_pad: Optional[paddle.Tensor]=None, - cache: Optional[paddle.Tensor]=None + mask_pad: paddle.Tensor= paddle.ones([0,0,0], dtype=paddle.bool), + cache: paddle.Tensor= paddle.zeros([0,0,0]), ) -> Tuple[paddle.Tensor, paddle.Tensor]: """Compute convolution module. Args: x (paddle.Tensor): Input tensor (#batch, time, channels). - mask_pad (paddle.Tensor): used for batch padding, (#batch, channels, time). + mask_pad (paddle.Tensor): used for batch padding (#batch, 1, time), + (0, 0, 0) means fake mask. cache (paddle.Tensor): left context cache, it is only - used in causal convolution. (#batch, channels, time') + used in causal convolution (#batch, channels, cache_t), + (0, 0, 0) meas fake cache. Returns: paddle.Tensor: Output tensor (#batch, time, channels). paddle.Tensor: Output cache tensor (#batch, channels, time') @@ -125,11 +127,11 @@ class ConvolutionModule(nn.Layer): x = x.transpose([0, 2, 1]) # [B, C, T] # mask batch padding - if mask_pad is not None: + if paddle.shape(mask_pad)[2] > 0: # time > 0 x = x.masked_fill(mask_pad, 0.0) if self.lorder > 0: - if cache is None: + if paddle.shape(cache)[2] == 0: # cache_t == 0 x = nn.functional.pad( x, [self.lorder, 0], 'constant', 0.0, data_format='NCL') else: @@ -143,7 +145,7 @@ class ConvolutionModule(nn.Layer): # It's better we just return None if no cache is requried, # However, for JIT export, here we just fake one tensor instead of # None. - new_cache = paddle.zeros([1], dtype=x.dtype) + new_cache = paddle.zeros([0, 0, 0], dtype=x.dtype) # GLU mechanism x = self.pointwise_conv1(x) # (batch, 2*channel, dim) @@ -159,7 +161,7 @@ class ConvolutionModule(nn.Layer): x = self.pointwise_conv2(x) # mask batch padding - if mask_pad is not None: + if paddle.shape(mask_pad)[2] > 0: # time > 0 x = x.masked_fill(mask_pad, 0.0) x = x.transpose([0, 2, 1]) # [B, T, C] diff --git a/paddlespeech/s2t/modules/decoder_layer.py b/paddlespeech/s2t/modules/decoder_layer.py index b7f8694c1..37b124e84 100644 --- a/paddlespeech/s2t/modules/decoder_layer.py +++ b/paddlespeech/s2t/modules/decoder_layer.py @@ -121,11 +121,11 @@ class DecoderLayer(nn.Layer): if self.concat_after: tgt_concat = paddle.cat( - (tgt_q, self.self_attn(tgt_q, tgt, tgt, tgt_q_mask)), dim=-1) + (tgt_q, self.self_attn(tgt_q, tgt, tgt, tgt_q_mask)[0]), dim=-1) x = residual + self.concat_linear1(tgt_concat) else: x = residual + self.dropout( - self.self_attn(tgt_q, tgt, tgt, tgt_q_mask)) + self.self_attn(tgt_q, tgt, tgt, tgt_q_mask)[0]) if not self.normalize_before: x = self.norm1(x) @@ -134,11 +134,11 @@ class DecoderLayer(nn.Layer): x = self.norm2(x) if self.concat_after: x_concat = paddle.cat( - (x, self.src_attn(x, memory, memory, memory_mask)), dim=-1) + (x, self.src_attn(x, memory, memory, memory_mask)[0]), dim=-1) x = residual + self.concat_linear2(x_concat) else: x = residual + self.dropout( - self.src_attn(x, memory, memory, memory_mask)) + self.src_attn(x, memory, memory, memory_mask)[0]) if not self.normalize_before: x = self.norm2(x) diff --git a/paddlespeech/s2t/modules/embedding.py b/paddlespeech/s2t/modules/embedding.py index 51e558eb8..3aeebd29b 100644 --- a/paddlespeech/s2t/modules/embedding.py +++ b/paddlespeech/s2t/modules/embedding.py @@ -131,7 +131,7 @@ class PositionalEncoding(nn.Layer, PositionalEncodingInterface): offset (int): start offset size (int): requried size of position encoding Returns: - paddle.Tensor: Corresponding position encoding + paddle.Tensor: Corresponding position encoding, #[1, T, D]. """ assert offset + size < self.max_len return self.dropout(self.pe[:, offset:offset + size]) diff --git a/paddlespeech/s2t/modules/encoder.py b/paddlespeech/s2t/modules/encoder.py index 4d31acf1a..bff2d69bb 100644 --- a/paddlespeech/s2t/modules/encoder.py +++ b/paddlespeech/s2t/modules/encoder.py @@ -177,7 +177,7 @@ class BaseEncoder(nn.Layer): decoding_chunk_size, self.static_chunk_size, num_decoding_left_chunks) for layer in self.encoders: - xs, chunk_masks, _ = layer(xs, chunk_masks, pos_emb, mask_pad) + xs, chunk_masks, _, _ = layer(xs, chunk_masks, pos_emb, mask_pad) if self.normalize_before: xs = self.after_norm(xs) # Here we assume the mask is not changed in encoder layers, so just @@ -190,30 +190,31 @@ class BaseEncoder(nn.Layer): xs: paddle.Tensor, offset: int, required_cache_size: int, - subsampling_cache: Optional[paddle.Tensor]=None, - elayers_output_cache: Optional[List[paddle.Tensor]]=None, - conformer_cnn_cache: Optional[List[paddle.Tensor]]=None, - ) -> Tuple[paddle.Tensor, paddle.Tensor, List[paddle.Tensor], List[ - paddle.Tensor]]: + att_cache: paddle.Tensor = paddle.zeros([0,0,0,0]), + cnn_cache: paddle.Tensor = paddle.zeros([0,0,0,0]), + att_mask: paddle.Tensor = paddle.ones([0,0,0], dtype=paddle.bool), + ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor]: """ Forward just one chunk Args: - xs (paddle.Tensor): chunk input, [B=1, T, D] + xs (paddle.Tensor): chunk audio feat input, [B=1, T, D], where + `T==(chunk_size-1)*subsampling_rate + subsample.right_context + 1` offset (int): current offset in encoder output time stamp required_cache_size (int): cache size required for next chunk compuation >=0: actual cache size <0: means all history cache is required - subsampling_cache (Optional[paddle.Tensor]): subsampling cache - elayers_output_cache (Optional[List[paddle.Tensor]]): - transformer/conformer encoder layers output cache - conformer_cnn_cache (Optional[List[paddle.Tensor]]): conformer - cnn cache + att_cache(paddle.Tensor): cache tensor for key & val in + transformer/conformer attention. Shape is + (elayers, head, cache_t1, d_k * 2), where`head * d_k == hidden-dim` + and `cache_t1 == chunk_size * num_decoding_left_chunks`. + cnn_cache (paddle.Tensor): cache tensor for cnn_module in conformer, + (elayers, B=1, hidden-dim, cache_t2), where `cache_t2 == cnn.lorder - 1` Returns: - paddle.Tensor: output of current input xs - paddle.Tensor: subsampling cache required for next chunk computation - List[paddle.Tensor]: encoder layers output cache required for next - chunk computation - List[paddle.Tensor]: conformer cnn cache + paddle.Tensor: output of current input xs, (B=1, chunk_size, hidden-dim) + paddle.Tensor: new attention cache required for next chunk, dyanmic shape + (elayers, head, T, d_k*2) depending on required_cache_size + paddle.Tensor: new conformer cnn cache required for next chunk, with + same shape as the original cnn_cache """ assert xs.shape[0] == 1 # batch size must be one # tmp_masks is just for interface compatibility @@ -225,50 +226,50 @@ class BaseEncoder(nn.Layer): if self.global_cmvn is not None: xs = self.global_cmvn(xs) - xs, pos_emb, _ = self.embed( - xs, tmp_masks, offset=offset) #xs=(B, T, D), pos_emb=(B=1, T, D) + # before embed, xs=(B, T, D1), pos_emb=(B=1, T, D) + xs, pos_emb, _ = self.embed(xs, tmp_masks, offset=offset) + # after embed, xs=(B=1, chunk_size, hidden-dim) - if subsampling_cache is not None: - cache_size = subsampling_cache.shape[1] #T - xs = paddle.cat((subsampling_cache, xs), dim=1) - else: - cache_size = 0 + elayers = paddle.shape(att_cache)[0] + cache_t1 = paddle.shape(att_cache)[2] + chunk_size = paddle.shape(xs)[1] + attention_key_size = cache_t1 + chunk_size # only used when using `RelPositionMultiHeadedAttention` pos_emb = self.embed.position_encoding( - offset=offset - cache_size, size=xs.shape[1]) + offset=offset - cache_t1, size=attention_key_size) if required_cache_size < 0: next_cache_start = 0 elif required_cache_size == 0: - next_cache_start = xs.shape[1] + next_cache_start = attention_key_size else: - next_cache_start = xs.shape[1] - required_cache_size - r_subsampling_cache = xs[:, next_cache_start:, :] - - # Real mask for transformer/conformer layers - masks = paddle.ones([1, xs.shape[1]], dtype=paddle.bool) - masks = masks.unsqueeze(1) #[B=1, L'=1, T] - r_elayers_output_cache = [] - r_conformer_cnn_cache = [] + next_cache_start = max(attention_key_size - required_cache_size, 0) + + r_att_cache = [] + r_cnn_cache = [] for i, layer in enumerate(self.encoders): - attn_cache = None if elayers_output_cache is None else elayers_output_cache[ - i] - cnn_cache = None if conformer_cnn_cache is None else conformer_cnn_cache[ - i] - xs, _, new_cnn_cache = layer( - xs, - masks, - pos_emb, - output_cache=attn_cache, - cnn_cache=cnn_cache) - r_elayers_output_cache.append(xs[:, next_cache_start:, :]) - r_conformer_cnn_cache.append(new_cnn_cache) + # att_cache[i:i+1] = (1, head, cache_t1, d_k*2) + # cnn_cache[i:i+1] = (1, B=1, hidden-dim, cache_t2) + xs, _, new_att_cache, new_cnn_cache = layer( + xs, att_mask, pos_emb, + att_cache=att_cache[i:i+1] if elayers > 0 else att_cache, + cnn_cache=cnn_cache[i:i+1] if paddle.shape(cnn_cache)[0] > 0 else cnn_cache, + ) + # new_att_cache = (1, head, attention_key_size, d_k*2) + # new_cnn_cache = (B=1, hidden-dim, cache_t2) + r_att_cache.append(new_att_cache[:,:, next_cache_start:, :]) + r_cnn_cache.append(new_cnn_cache.unsqueeze(0)) # add elayer dim + if self.normalize_before: xs = self.after_norm(xs) - return (xs[:, cache_size:, :], r_subsampling_cache, - r_elayers_output_cache, r_conformer_cnn_cache) + # r_att_cache (elayers, head, T, d_k*2) + # r_cnn_cache (elayers, B=1, hidden-dim, cache_t2) + r_att_cache = paddle.concat(r_att_cache, axis=0) + r_cnn_cache = paddle.concat(r_cnn_cache, axis=0) + return xs, r_att_cache, r_cnn_cache + def forward_chunk_by_chunk( self, @@ -313,25 +314,24 @@ class BaseEncoder(nn.Layer): num_frames = xs.shape[1] required_cache_size = decoding_chunk_size * num_decoding_left_chunks - subsampling_cache: Optional[paddle.Tensor] = None - elayers_output_cache: Optional[List[paddle.Tensor]] = None - conformer_cnn_cache: Optional[List[paddle.Tensor]] = None + + att_cache: paddle.Tensor = paddle.zeros([0,0,0,0]) + cnn_cache: paddle.Tensor = paddle.zeros([0,0,0,0]) + outputs = [] offset = 0 # Feed forward overlap input step by step for cur in range(0, num_frames - context + 1, stride): end = min(cur + decoding_window, num_frames) chunk_xs = xs[:, cur:end, :] - (y, subsampling_cache, elayers_output_cache, - conformer_cnn_cache) = self.forward_chunk( - chunk_xs, offset, required_cache_size, subsampling_cache, - elayers_output_cache, conformer_cnn_cache) + + (y, att_cache, cnn_cache) = self.forward_chunk( + chunk_xs, offset, required_cache_size, att_cache, cnn_cache) + outputs.append(y) offset += y.shape[1] ys = paddle.cat(outputs, 1) - # fake mask, just for jit script and compatibility with `forward` api - masks = paddle.ones([1, ys.shape[1]], dtype=paddle.bool) - masks = masks.unsqueeze(1) + masks = paddle.ones([1, 1, ys.shape[1]], dtype=paddle.bool) return ys, masks diff --git a/paddlespeech/s2t/modules/encoder_layer.py b/paddlespeech/s2t/modules/encoder_layer.py index e80a298d6..5f810dfde 100644 --- a/paddlespeech/s2t/modules/encoder_layer.py +++ b/paddlespeech/s2t/modules/encoder_layer.py @@ -75,49 +75,43 @@ class TransformerEncoderLayer(nn.Layer): self, x: paddle.Tensor, mask: paddle.Tensor, - pos_emb: Optional[paddle.Tensor]=None, - mask_pad: Optional[paddle.Tensor]=None, - output_cache: Optional[paddle.Tensor]=None, - cnn_cache: Optional[paddle.Tensor]=None, - ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor]: + pos_emb: paddle.Tensor, + mask_pad: paddle.Tensor=paddle.ones([0, 0, 0], dtype=paddle.bool), + att_cache: paddle.Tensor=paddle.zeros([0, 0, 0, 0]), + cnn_cache: paddle.Tensor=paddle.zeros([0, 0, 0, 0]), + ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor, paddle.Tensor]: """Compute encoded features. Args: - x (paddle.Tensor): Input tensor (#batch, time, size). - mask (paddle.Tensor): Mask tensor for the input (#batch, time). + x (paddle.Tensor): (#batch, time, size) + mask (paddle.Tensor): Mask tensor for the input (#batch, time,time), + (0, 0, 0) means fake mask. pos_emb (paddle.Tensor): just for interface compatibility to ConformerEncoderLayer - mask_pad (paddle.Tensor): not used here, it's for interface - compatibility to ConformerEncoderLayer - output_cache (paddle.Tensor): Cache tensor of the output - (#batch, time2, size), time2 < time in x. - cnn_cache (paddle.Tensor): not used here, it's for interface - compatibility to ConformerEncoderLayer + mask_pad (paddle.Tensor): does not used in transformer layer, + just for unified api with conformer. + att_cache (paddle.Tensor): Cache tensor of the KEY & VALUE + (#batch=1, head, cache_t1, d_k * 2), head * d_k == size. + cnn_cache (paddle.Tensor): Convolution cache in conformer layer + (#batch=1, size, cache_t2), not used here, it's for interface + compatibility to ConformerEncoderLayer. Returns: paddle.Tensor: Output tensor (#batch, time, size). - paddle.Tensor: Mask tensor (#batch, time). - paddle.Tensor: Fake cnn cache tensor for api compatibility with Conformer (#batch, channels, time'). + paddle.Tensor: Mask tensor (#batch, time, time). + paddle.Tensor: att_cache tensor, + (#batch=1, head, cache_t1 + time, d_k * 2). + paddle.Tensor: cnn_cahce tensor (#batch=1, size, cache_t2). """ residual = x if self.normalize_before: x = self.norm1(x) - if output_cache is None: - x_q = x - else: - assert output_cache.shape[0] == x.shape[0] - assert output_cache.shape[1] < x.shape[1] - assert output_cache.shape[2] == self.size - chunk = x.shape[1] - output_cache.shape[1] - x_q = x[:, -chunk:, :] - residual = residual[:, -chunk:, :] - mask = mask[:, -chunk:, :] + x_att, new_att_cache = self.self_attn(x, x, x, mask, cache=att_cache) if self.concat_after: - x_concat = paddle.concat( - (x, self.self_attn(x_q, x, x, mask)), axis=-1) + x_concat = paddle.concat((x, x_att), axis=-1) x = residual + self.concat_linear(x_concat) else: - x = residual + self.dropout(self.self_attn(x_q, x, x, mask)) + x = residual + self.dropout(x_att) if not self.normalize_before: x = self.norm1(x) @@ -128,11 +122,8 @@ class TransformerEncoderLayer(nn.Layer): if not self.normalize_before: x = self.norm2(x) - if output_cache is not None: - x = paddle.concat([output_cache, x], axis=1) - - fake_cnn_cache = paddle.zeros([1], dtype=x.dtype) - return x, mask, fake_cnn_cache + fake_cnn_cache = paddle.zeros([0, 0, 0], dtype=x.dtype) + return x, mask, new_att_cache, fake_cnn_cache class ConformerEncoderLayer(nn.Layer): @@ -192,32 +183,44 @@ class ConformerEncoderLayer(nn.Layer): self.size = size self.normalize_before = normalize_before self.concat_after = concat_after - self.concat_linear = Linear(size + size, size) + if self.concat_after: + self.concat_linear = Linear(size + size, size) + else: + self.concat_linear = nn.Identity() def forward( self, x: paddle.Tensor, mask: paddle.Tensor, pos_emb: paddle.Tensor, - mask_pad: Optional[paddle.Tensor]=None, - output_cache: Optional[paddle.Tensor]=None, - cnn_cache: Optional[paddle.Tensor]=None, - ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor]: + mask_pad: paddle.Tensor=paddle.ones([0, 0, 0], dtype=paddle.bool), + att_cache: paddle.Tensor=paddle.zeros([0, 0, 0, 0]), + cnn_cache: paddle.Tensor=paddle.zeros([0, 0, 0, 0]), + ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor, paddle.Tensor]: """Compute encoded features. Args: - x (paddle.Tensor): (#batch, time, size) - mask (paddle.Tensor): Mask tensor for the input (#batch, time,time). - pos_emb (paddle.Tensor): positional encoding, must not be None - for ConformerEncoderLayer. - mask_pad (paddle.Tensor): batch padding mask used for conv module, (B, 1, T). - output_cache (paddle.Tensor): Cache tensor of the encoder output - (#batch, time2, size), time2 < time in x. + x (paddle.Tensor): Input tensor (#batch, time, size). + mask (paddle.Tensor): Mask tensor for the input (#batch, time, time). + (0,0,0) means fake mask. + pos_emb (paddle.Tensor): postional encoding, must not be None + for ConformerEncoderLayer + mask_pad (paddle.Tensor): batch padding mask used for conv module. + (#batch, 1,time), (0, 0, 0) means fake mask. + att_cache (paddle.Tensor): Cache tensor of the KEY & VALUE + (#batch=1, head, cache_t1, d_k * 2), head * d_k == size. cnn_cache (paddle.Tensor): Convolution cache in conformer layer + (1, #batch=1, size, cache_t2). First dim will not be used, just + for dy2st. Returns: - paddle.Tensor: Output tensor (#batch, time, size). - paddle.Tensor: Mask tensor (#batch, time). - paddle.Tensor: New cnn cache tensor (#batch, channels, time'). + paddle.Tensor: Output tensor (#batch, time, size). + paddle.Tensor: Mask tensor (#batch, time, time). + paddle.Tensor: att_cache tensor, + (#batch=1, head, cache_t1 + time, d_k * 2). + paddle.Tensor: cnn_cahce tensor (#batch, size, cache_t2). """ + # (1, #batch=1, size, cache_t2) -> (#batch=1, size, cache_t2) + cnn_cache = paddle.squeeze(cnn_cache, axis=0) + # whether to use macaron style FFN if self.feed_forward_macaron is not None: residual = x @@ -233,18 +236,8 @@ class ConformerEncoderLayer(nn.Layer): if self.normalize_before: x = self.norm_mha(x) - if output_cache is None: - x_q = x - else: - assert output_cache.shape[0] == x.shape[0] - assert output_cache.shape[1] < x.shape[1] - assert output_cache.shape[2] == self.size - chunk = x.shape[1] - output_cache.shape[1] - x_q = x[:, -chunk:, :] - residual = residual[:, -chunk:, :] - mask = mask[:, -chunk:, :] - - x_att = self.self_attn(x_q, x, x, pos_emb, mask) + x_att, new_att_cache = self.self_attn( + x, x, x, mask, pos_emb, cache=att_cache) if self.concat_after: x_concat = paddle.concat((x, x_att), axis=-1) @@ -257,7 +250,7 @@ class ConformerEncoderLayer(nn.Layer): # convolution module # Fake new cnn cache here, and then change it in conv_module - new_cnn_cache = paddle.zeros([1], dtype=x.dtype) + new_cnn_cache = paddle.zeros([0, 0, 0], dtype=x.dtype) if self.conv_module is not None: residual = x if self.normalize_before: @@ -282,7 +275,4 @@ class ConformerEncoderLayer(nn.Layer): if self.conv_module is not None: x = self.norm_final(x) - if output_cache is not None: - x = paddle.concat([output_cache, x], axis=1) - - return x, mask, new_cnn_cache + return x, mask, new_att_cache, new_cnn_cache diff --git a/paddlespeech/s2t/modules/initializer.py b/paddlespeech/s2t/modules/initializer.py index 30a04e44f..cdcf2e052 100644 --- a/paddlespeech/s2t/modules/initializer.py +++ b/paddlespeech/s2t/modules/initializer.py @@ -12,142 +12,6 @@ # See the License for the specific language governing permissions and # limitations under the License. import numpy as np -from paddle.fluid import framework -from paddle.fluid import unique_name -from paddle.fluid.core import VarDesc -from paddle.fluid.initializer import MSRAInitializer - -__all__ = ['KaimingUniform'] - - -class KaimingUniform(MSRAInitializer): - r"""Implements the Kaiming Uniform initializer - - This class implements the weight initialization from the paper - `Delving Deep into Rectifiers: Surpassing Human-Level Performance on - ImageNet Classification `_ - by Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun. This is a - robust initialization method that particularly considers the rectifier - nonlinearities. - - In case of Uniform distribution, the range is [-x, x], where - - .. math:: - - x = \sqrt{\frac{1.0}{fan\_in}} - - In case of Normal distribution, the mean is 0 and the standard deviation - is - - .. math:: - - \sqrt{\\frac{2.0}{fan\_in}} - - Args: - fan_in (float32|None): fan_in for Kaiming uniform Initializer. If None, it is\ - inferred from the variable. default is None. - - Note: - It is recommended to set fan_in to None for most cases. - - Examples: - .. code-block:: python - - import paddle - import paddle.nn as nn - - linear = nn.Linear(2, - 4, - weight_attr=nn.initializer.KaimingUniform()) - data = paddle.rand([30, 10, 2], dtype='float32') - res = linear(data) - - """ - - def __init__(self, fan_in=None): - super(KaimingUniform, self).__init__( - uniform=True, fan_in=fan_in, seed=0) - - def __call__(self, var, block=None): - """Initialize the input tensor with MSRA initialization. - - Args: - var(Tensor): Tensor that needs to be initialized. - block(Block, optional): The block in which initialization ops - should be added. Used in static graph only, default None. - - Returns: - The initialization op - """ - block = self._check_block(block) - - assert isinstance(var, framework.Variable) - assert isinstance(block, framework.Block) - f_in, f_out = self._compute_fans(var) - - # If fan_in is passed, use it - fan_in = f_in if self._fan_in is None else self._fan_in - - if self._seed == 0: - self._seed = block.program.random_seed - - # to be compatible of fp16 initalizers - if var.dtype == VarDesc.VarType.FP16 or ( - var.dtype == VarDesc.VarType.BF16 and not self._uniform): - out_dtype = VarDesc.VarType.FP32 - out_var = block.create_var( - name=unique_name.generate( - ".".join(['masra_init', var.name, 'tmp'])), - shape=var.shape, - dtype=out_dtype, - type=VarDesc.VarType.LOD_TENSOR, - persistable=False) - else: - out_dtype = var.dtype - out_var = var - - if self._uniform: - limit = np.sqrt(1.0 / float(fan_in)) - op = block.append_op( - type="uniform_random", - inputs={}, - outputs={"Out": out_var}, - attrs={ - "shape": out_var.shape, - "dtype": int(out_dtype), - "min": -limit, - "max": limit, - "seed": self._seed - }, - stop_gradient=True) - - else: - std = np.sqrt(2.0 / float(fan_in)) - op = block.append_op( - type="gaussian_random", - outputs={"Out": out_var}, - attrs={ - "shape": out_var.shape, - "dtype": int(out_dtype), - "mean": 0.0, - "std": std, - "seed": self._seed - }, - stop_gradient=True) - - if var.dtype == VarDesc.VarType.FP16 or ( - var.dtype == VarDesc.VarType.BF16 and not self._uniform): - block.append_op( - type="cast", - inputs={"X": out_var}, - outputs={"Out": var}, - attrs={"in_dtype": out_var.dtype, - "out_dtype": var.dtype}) - - if not framework.in_dygraph_mode(): - var.op = op - return op - class DefaultInitializerContext(object): """ diff --git a/paddlespeech/s2t/modules/loss.py b/paddlespeech/s2t/modules/loss.py index c7d9bd45d..884fb70c1 100644 --- a/paddlespeech/s2t/modules/loss.py +++ b/paddlespeech/s2t/modules/loss.py @@ -37,9 +37,9 @@ class CTCLoss(nn.Layer): self.loss = nn.CTCLoss(blank=blank, reduction=reduction) self.batch_average = batch_average - logger.info( + logger.debug( f"CTCLoss Loss reduction: {reduction}, div-bs: {batch_average}") - logger.info(f"CTCLoss Grad Norm Type: {grad_norm_type}") + logger.debug(f"CTCLoss Grad Norm Type: {grad_norm_type}") assert grad_norm_type in ('instance', 'batch', 'frame', None) self.norm_by_times = False @@ -70,7 +70,8 @@ class CTCLoss(nn.Layer): param = {} self._kwargs = {k: v for k, v in kwargs.items() if k in param} _notin = {k: v for k, v in kwargs.items() if k not in param} - logger.info(f"{self.loss} kwargs:{self._kwargs}, not support: {_notin}") + logger.debug( + f"{self.loss} kwargs:{self._kwargs}, not support: {_notin}") def forward(self, logits, ys_pad, hlens, ys_lens): """Compute CTC loss. diff --git a/paddlespeech/s2t/training/cli.py b/paddlespeech/s2t/training/cli.py index bb85732a6..1b6bec8a8 100644 --- a/paddlespeech/s2t/training/cli.py +++ b/paddlespeech/s2t/training/cli.py @@ -82,6 +82,12 @@ def default_argument_parser(parser=None): type=int, default=1, help="number of parallel processes. 0 for cpu.") + train_group.add_argument( + '--nxpu', + type=int, + default=0, + choices=[0, 1], + help="if nxpu == 0 and ngpu == 0, use cpu.") train_group.add_argument( "--config", metavar="CONFIG_FILE", help="config file.") train_group.add_argument( diff --git a/paddlespeech/s2t/utils/log.py b/paddlespeech/s2t/utils/log.py index 4f51b7f05..b711ab739 100644 --- a/paddlespeech/s2t/utils/log.py +++ b/paddlespeech/s2t/utils/log.py @@ -99,7 +99,7 @@ class Log(): _call_from_cli = False _frame = inspect.currentframe() while _frame: - if 'paddlespeech/cli/__init__.py' in _frame.f_code.co_filename or 'paddlespeech/t2s' in _frame.f_code.co_filename: + if 'paddlespeech/cli/entry.py' in _frame.f_code.co_filename or 'paddlespeech/t2s' in _frame.f_code.co_filename: _call_from_cli = True break _frame = _frame.f_back diff --git a/paddlespeech/s2t/utils/tensor_utils.py b/paddlespeech/s2t/utils/tensor_utils.py index f9a843ea1..422d4f82a 100644 --- a/paddlespeech/s2t/utils/tensor_utils.py +++ b/paddlespeech/s2t/utils/tensor_utils.py @@ -94,7 +94,7 @@ def pad_sequence(sequences: List[paddle.Tensor], for i, tensor in enumerate(sequences): length = tensor.shape[0] # use index notation to prevent duplicate references to the tensor - logger.info( + logger.debug( f"length {length}, out_tensor {out_tensor.shape}, tensor {tensor.shape}" ) if batch_first: diff --git a/paddlespeech/server/bin/paddlespeech_client.py b/paddlespeech/server/bin/paddlespeech_client.py index fb521b309..f5dc368dd 100644 --- a/paddlespeech/server/bin/paddlespeech_client.py +++ b/paddlespeech/server/bin/paddlespeech_client.py @@ -123,7 +123,6 @@ class TTSClientExecutor(BaseExecutor): time_end = time.time() time_consume = time_end - time_start response_dict = res.json() - logger.info(response_dict["message"]) logger.info("Save synthesized audio successfully on %s." % (output)) logger.info("Audio duration: %f s." % (response_dict['result']['duration'])) @@ -192,23 +191,10 @@ class TTSOnlineClientExecutor(BaseExecutor): self.parser.add_argument( '--spk_id', type=int, default=0, help='Speaker id') self.parser.add_argument( - '--speed', - type=float, - default=1.0, - help='Audio speed, the value should be set between 0 and 3') - self.parser.add_argument( - '--volume', - type=float, - default=1.0, - help='Audio volume, the value should be set between 0 and 3') - self.parser.add_argument( - '--sample_rate', - type=int, - default=0, - choices=[0, 8000, 16000], - help='Sampling rate, the default is the same as the model') - self.parser.add_argument( - '--output', type=str, default=None, help='Synthesized audio file') + '--output', + type=str, + default=None, + help='Client saves synthesized audio') self.parser.add_argument( "--play", type=bool, help="whether to play audio", default=False) @@ -219,9 +205,6 @@ class TTSOnlineClientExecutor(BaseExecutor): port = args.port protocol = args.protocol spk_id = args.spk_id - speed = args.speed - volume = args.volume - sample_rate = args.sample_rate output = args.output play = args.play @@ -232,9 +215,6 @@ class TTSOnlineClientExecutor(BaseExecutor): port=port, protocol=protocol, spk_id=spk_id, - speed=speed, - volume=volume, - sample_rate=sample_rate, output=output, play=play) return True @@ -250,9 +230,6 @@ class TTSOnlineClientExecutor(BaseExecutor): port: int=8092, protocol: str="http", spk_id: int=0, - speed: float=1.0, - volume: float=1.0, - sample_rate: int=0, output: str=None, play: bool=False): """ @@ -264,7 +241,7 @@ class TTSOnlineClientExecutor(BaseExecutor): from paddlespeech.server.utils.audio_handler import TTSHttpHandler handler = TTSHttpHandler(server_ip, port, play) first_response, final_response, duration, save_audio_success, receive_time_list, chunk_duration_list = handler.run( - input, spk_id, speed, volume, sample_rate, output) + input, spk_id, output) delay_time_list = compute_delay(receive_time_list, chunk_duration_list) @@ -274,7 +251,7 @@ class TTSOnlineClientExecutor(BaseExecutor): handler = TTSWsHandler(server_ip, port, play) loop = asyncio.get_event_loop() first_response, final_response, duration, save_audio_success, receive_time_list, chunk_duration_list = loop.run_until_complete( - handler.run(input, output)) + handler.run(input, spk_id, output)) delay_time_list = compute_delay(receive_time_list, chunk_duration_list) @@ -573,11 +550,9 @@ class CLSClientExecutor(BaseExecutor): """ Python API to call an executor. """ - url = 'http://' + server_ip + ":" + str(port) + '/paddlespeech/cls' audio = wav2base64(input) data = {"audio": audio, "topk": topk} - res = requests.post(url=url, data=json.dumps(data)) return res @@ -702,7 +677,7 @@ class VectorClientExecutor(BaseExecutor): test_audio=args.test, task=task) time_end = time.time() - logger.info(f"The vector: {res}") + logger.info(res.json()) logger.info("Response time %f s." % (time_end - time_start)) return True except Exception as e: @@ -751,7 +726,6 @@ class VectorClientExecutor(BaseExecutor): handler = VectorScoreHttpHandler(server_ip=server_ip, port=port) res = handler.run(enroll_audio, test_audio, audio_format, sample_rate) - logger.info(f"The vector score is: {res}") return res else: logger.error(f"Sorry, we have not support such task {task}") diff --git a/paddlespeech/server/bin/paddlespeech_server.py b/paddlespeech/server/bin/paddlespeech_server.py index 11f50655f..175e8ffb6 100644 --- a/paddlespeech/server/bin/paddlespeech_server.py +++ b/paddlespeech/server/bin/paddlespeech_server.py @@ -18,6 +18,7 @@ from typing import List import uvicorn from fastapi import FastAPI +from starlette.middleware.cors import CORSMiddleware from prettytable import PrettyTable from starlette.middleware.cors import CORSMiddleware @@ -45,7 +46,6 @@ app.add_middleware( allow_methods=["*"], allow_headers=["*"]) - @cli_server_register( name='paddlespeech_server.start', description='Start the service') class ServerExecutor(BaseExecutor): diff --git a/paddlespeech/server/conf/ws_ds2_application.yaml b/paddlespeech/server/conf/ws_ds2_application.yaml index 909c2f187..ac20b2a23 100644 --- a/paddlespeech/server/conf/ws_ds2_application.yaml +++ b/paddlespeech/server/conf/ws_ds2_application.yaml @@ -18,12 +18,13 @@ engine_list: ['asr_online-onnx'] # ENGINE CONFIG # ################################################################################# + ################################### ASR ######################################### -################### speech task: asr; engine_type: online-inference ####################### -asr_online-inference: +################### speech task: asr; engine_type: online-onnx ####################### +asr_online-onnx: model_type: 'deepspeech2online_wenetspeech' - am_model: # the pdmodel file of am static model [optional] - am_params: # the pdiparams file of am static model [optional] + am_model: # the pdmodel file of onnx am static model [optional] + am_params: # the pdiparams file of am static model [optional] lang: 'zh' sample_rate: 16000 cfg_path: @@ -32,14 +33,17 @@ asr_online-inference: force_yes: True device: 'cpu' # cpu or gpu:id + # https://onnxruntime.ai/docs/api/python/api_summary.html#inferencesession am_predictor_conf: - device: # set 'gpu:id' or 'cpu' - switch_ir_optim: True - glog_info: False # True -> print glog - summary: True # False -> do not show predictor config + device: 'cpu' # set 'gpu:id' or 'cpu' + graph_optimization_level: 0 + intra_op_num_threads: 0 # Sets the number of threads used to parallelize the execution within nodes. + inter_op_num_threads: 0 # Sets the number of threads used to parallelize the execution of the graph (across nodes). + log_severity_level: 2 # Log severity level. Applies to session load, initialization, etc. 0:Verbose, 1:Info, 2:Warning. 3:Error, 4:Fatal. Default is 2. + log_verbosity_level: 0 # VLOG level if DEBUG build and session_log_severity_level is 0. Applies to session load, initialization, etc. Default is 0. chunk_buffer_conf: - frame_duration_ms: 80 + frame_duration_ms: 85 shift_ms: 40 sample_rate: 16000 sample_width: 2 @@ -49,13 +53,12 @@ asr_online-inference: shift_ms: 10 # ms - ################################### ASR ######################################### -################### speech task: asr; engine_type: online-onnx ####################### -asr_online-onnx: +################### speech task: asr; engine_type: online-inference ####################### +asr_online-inference: model_type: 'deepspeech2online_wenetspeech' - am_model: # the pdmodel file of onnx am static model [optional] - am_params: # the pdiparams file of am static model [optional] + am_model: # the pdmodel file of am static model [optional] + am_params: # the pdiparams file of am static model [optional] lang: 'zh' sample_rate: 16000 cfg_path: @@ -64,14 +67,11 @@ asr_online-onnx: force_yes: True device: 'cpu' # cpu or gpu:id - # https://onnxruntime.ai/docs/api/python/api_summary.html#inferencesession am_predictor_conf: - device: 'cpu' # set 'gpu:id' or 'cpu' - graph_optimization_level: 0 - intra_op_num_threads: 0 # Sets the number of threads used to parallelize the execution within nodes. - inter_op_num_threads: 0 # Sets the number of threads used to parallelize the execution of the graph (across nodes). - log_severity_level: 2 # Log severity level. Applies to session load, initialization, etc. 0:Verbose, 1:Info, 2:Warning. 3:Error, 4:Fatal. Default is 2. - log_verbosity_level: 0 # VLOG level if DEBUG build and session_log_severity_level is 0. Applies to session load, initialization, etc. Default is 0. + device: # set 'gpu:id' or 'cpu' + switch_ir_optim: True + glog_info: False # True -> print glog + summary: True # False -> do not show predictor config chunk_buffer_conf: frame_duration_ms: 85 @@ -81,4 +81,4 @@ asr_online-onnx: window_n: 7 # frame shift_n: 4 # frame window_ms: 25 # ms - shift_ms: 10 # ms + shift_ms: 10 # ms \ No newline at end of file diff --git a/paddlespeech/server/engine/acs/python/acs_engine.py b/paddlespeech/server/engine/acs/python/acs_engine.py index 930101ac9..a607aa07a 100644 --- a/paddlespeech/server/engine/acs/python/acs_engine.py +++ b/paddlespeech/server/engine/acs/python/acs_engine.py @@ -30,7 +30,7 @@ class ACSEngine(BaseEngine): """The ACSEngine Engine """ super(ACSEngine, self).__init__() - logger.info("Create the ACSEngine Instance") + logger.debug("Create the ACSEngine Instance") self.word_list = [] def init(self, config: dict): @@ -42,7 +42,7 @@ class ACSEngine(BaseEngine): Returns: bool: The engine instance flag """ - logger.info("Init the acs engine") + logger.debug("Init the acs engine") try: self.config = config self.device = self.config.get("device", paddle.get_device()) @@ -50,7 +50,7 @@ class ACSEngine(BaseEngine): # websocket default ping timeout is 20 seconds self.ping_timeout = self.config.get("ping_timeout", 20) paddle.set_device(self.device) - logger.info(f"ACS Engine set the device: {self.device}") + logger.debug(f"ACS Engine set the device: {self.device}") except BaseException as e: logger.error( @@ -66,7 +66,9 @@ class ACSEngine(BaseEngine): self.url = "ws://" + self.config.asr_server_ip + ":" + str( self.config.asr_server_port) + "/paddlespeech/asr/streaming" - logger.info("Init the acs engine successfully") + logger.info("Initialize acs server engine successfully on device: %s." % + (self.device)) + return True def read_search_words(self): @@ -95,12 +97,12 @@ class ACSEngine(BaseEngine): Returns: _type_: _description_ """ - logger.info("send a message to the server") + logger.debug("send a message to the server") if self.url is None: logger.error("No asr server, please input valid ip and port") return "" ws = websocket.WebSocket() - logger.info(f"set the ping timeout: {self.ping_timeout} seconds") + logger.debug(f"set the ping timeout: {self.ping_timeout} seconds") ws.connect(self.url, ping_timeout=self.ping_timeout) audio_info = json.dumps( { @@ -123,7 +125,7 @@ class ACSEngine(BaseEngine): logger.info(f"audio result: {msg}") # 3. send chunk audio data to engine - logger.info("send the end signal") + logger.debug("send the end signal") audio_info = json.dumps( { "name": "test.wav", @@ -190,14 +192,17 @@ class ACSEngine(BaseEngine): # search for each word in self.word_list offset = self.config.offset + # last time in time_stamp max_ed = time_stamp[-1]['ed'] for w in self.word_list: # search the w in asr_result and the index in asr_result + # https://docs.python.org/3/library/re.html#re.finditer for m in re.finditer(w, asr_result): + # match start and end char index in timestamp + # https://docs.python.org/3/library/re.html#re.Match.start start = max(time_stamp[m.start(0)]['bg'] - offset, 0) - end = min(time_stamp[m.end(0) - 1]['ed'] + offset, max_ed) - logger.info(f'start: {start}, end: {end}') + logger.debug(f'start: {start}, end: {end}') acs_result.append({'w': w, 'bg': start, 'ed': end}) return acs_result, asr_result @@ -212,7 +217,7 @@ class ACSEngine(BaseEngine): Returns: acs_result, asr_result: the acs result and the asr result """ - logger.info("start to process the audio content search") + logger.debug("start to process the audio content search") msg = self.get_asr_content(io.BytesIO(audio_data)) acs_result, asr_result = self.get_macthed_word(msg) diff --git a/paddlespeech/server/engine/asr/online/ctc_endpoint.py b/paddlespeech/server/engine/asr/online/ctc_endpoint.py index 2dba36417..b87dbe805 100644 --- a/paddlespeech/server/engine/asr/online/ctc_endpoint.py +++ b/paddlespeech/server/engine/asr/online/ctc_endpoint.py @@ -39,10 +39,10 @@ class OnlineCTCEndpoingOpt: # rule1 times out after 5 seconds of silence, even if we decoded nothing. rule1: OnlineCTCEndpointRule = OnlineCTCEndpointRule(False, 5000, 0) - # rule4 times out after 1.0 seconds of silence after decoding something, + # rule2 times out after 1.0 seconds of silence after decoding something, # even if we did not reach a final-state at all. rule2: OnlineCTCEndpointRule = OnlineCTCEndpointRule(True, 1000, 0) - # rule5 times out after the utterance is 20 seconds long, regardless of + # rule3 times out after the utterance is 20 seconds long, regardless of # anything else. rule3: OnlineCTCEndpointRule = OnlineCTCEndpointRule(False, 0, 20000) @@ -102,7 +102,8 @@ class OnlineCTCEndpoint: assert self.num_frames_decoded >= self.trailing_silence_frames assert self.frame_shift_in_ms > 0 - + + decoding_something = (self.num_frames_decoded > self.trailing_silence_frames) and decoding_something utterance_length = self.num_frames_decoded * self.frame_shift_in_ms trailing_silence = self.trailing_silence_frames * self.frame_shift_in_ms diff --git a/paddlespeech/server/engine/asr/online/ctc_search.py b/paddlespeech/server/engine/asr/online/ctc_search.py index 46f310c80..06adb9ccc 100644 --- a/paddlespeech/server/engine/asr/online/ctc_search.py +++ b/paddlespeech/server/engine/asr/online/ctc_search.py @@ -83,11 +83,11 @@ class CTCPrefixBeamSearch: # cur_hyps: (prefix, (blank_ending_score, none_blank_ending_score)) # 0. blank_ending_score, # 1. none_blank_ending_score, - # 2. viterbi_blank ending, - # 3. viterbi_non_blank, + # 2. viterbi_blank ending score, + # 3. viterbi_non_blank score, # 4. current_token_prob, - # 5. times_viterbi_blank, - # 6. times_titerbi_non_blank + # 5. times_viterbi_blank, times_b + # 6. times_titerbi_non_blank, times_nb if self.cur_hyps is None: self.cur_hyps = [(tuple(), (0.0, -float('inf'), 0.0, 0.0, -float('inf'), [], []))] @@ -106,69 +106,69 @@ class CTCPrefixBeamSearch: for s in top_k_index: s = s.item() ps = logp[s].item() - for prefix, (pb, pnb, v_b_s, v_nb_s, cur_token_prob, times_s, - times_ns) in self.cur_hyps: + for prefix, (pb, pnb, v_b_s, v_nb_s, cur_token_prob, times_b, + times_nb) in self.cur_hyps: last = prefix[-1] if len(prefix) > 0 else None if s == blank_id: # blank - n_pb, n_pnb, n_v_s, n_v_ns, n_cur_token_prob, n_times_s, n_times_ns = next_hyps[ + n_pb, n_pnb, n_v_b, n_v_nb, n_cur_token_prob, n_times_b, n_times_nb = next_hyps[ prefix] n_pb = log_add([n_pb, pb + ps, pnb + ps]) - pre_times = times_s if v_b_s > v_nb_s else times_ns - n_times_s = copy.deepcopy(pre_times) + pre_times = times_b if v_b_s > v_nb_s else times_nb + n_times_b = copy.deepcopy(pre_times) viterbi_score = v_b_s if v_b_s > v_nb_s else v_nb_s - n_v_s = viterbi_score + ps - next_hyps[prefix] = (n_pb, n_pnb, n_v_s, n_v_ns, - n_cur_token_prob, n_times_s, - n_times_ns) + n_v_b = viterbi_score + ps + next_hyps[prefix] = (n_pb, n_pnb, n_v_b, n_v_nb, + n_cur_token_prob, n_times_b, + n_times_nb) elif s == last: # Update *ss -> *s; # case1: *a + a => *a - n_pb, n_pnb, n_v_s, n_v_ns, n_cur_token_prob, n_times_s, n_times_ns = next_hyps[ + n_pb, n_pnb, n_v_b, n_v_nb, n_cur_token_prob, n_times_b, n_times_nb = next_hyps[ prefix] n_pnb = log_add([n_pnb, pnb + ps]) - if n_v_ns < v_nb_s + ps: - n_v_ns = v_nb_s + ps + if n_v_nb < v_nb_s + ps: + n_v_nb = v_nb_s + ps if n_cur_token_prob < ps: n_cur_token_prob = ps - n_times_ns = copy.deepcopy(times_ns) - n_times_ns[ + n_times_nb = copy.deepcopy(times_nb) + n_times_nb[ -1] = self.abs_time_step # 注意,这里要重新使用绝对时间 - next_hyps[prefix] = (n_pb, n_pnb, n_v_s, n_v_ns, - n_cur_token_prob, n_times_s, - n_times_ns) + next_hyps[prefix] = (n_pb, n_pnb, n_v_b, n_v_nb, + n_cur_token_prob, n_times_b, + n_times_nb) # Update *s-s -> *ss, - is for blank # Case 2: *aε + a => *aa n_prefix = prefix + (s, ) - n_pb, n_pnb, n_v_s, n_v_ns, n_cur_token_prob, n_times_s, n_times_ns = next_hyps[ + n_pb, n_pnb, n_v_b, n_v_nb, n_cur_token_prob, n_times_b, n_times_nb = next_hyps[ n_prefix] - if n_v_ns < v_b_s + ps: - n_v_ns = v_b_s + ps + if n_v_nb < v_b_s + ps: + n_v_nb = v_b_s + ps n_cur_token_prob = ps - n_times_ns = copy.deepcopy(times_s) - n_times_ns.append(self.abs_time_step) + n_times_nb = copy.deepcopy(times_b) + n_times_nb.append(self.abs_time_step) n_pnb = log_add([n_pnb, pb + ps]) - next_hyps[n_prefix] = (n_pb, n_pnb, n_v_s, n_v_ns, - n_cur_token_prob, n_times_s, - n_times_ns) + next_hyps[n_prefix] = (n_pb, n_pnb, n_v_b, n_v_nb, + n_cur_token_prob, n_times_b, + n_times_nb) else: # Case 3: *a + b => *ab, *aε + b => *ab n_prefix = prefix + (s, ) - n_pb, n_pnb, n_v_s, n_v_ns, n_cur_token_prob, n_times_s, n_times_ns = next_hyps[ + n_pb, n_pnb, n_v_b, n_v_nb, n_cur_token_prob, n_times_b, n_times_nb = next_hyps[ n_prefix] viterbi_score = v_b_s if v_b_s > v_nb_s else v_nb_s - pre_times = times_s if v_b_s > v_nb_s else times_ns - if n_v_ns < viterbi_score + ps: - n_v_ns = viterbi_score + ps + pre_times = times_b if v_b_s > v_nb_s else times_nb + if n_v_nb < viterbi_score + ps: + n_v_nb = viterbi_score + ps n_cur_token_prob = ps - n_times_ns = copy.deepcopy(pre_times) - n_times_ns.append(self.abs_time_step) + n_times_nb = copy.deepcopy(pre_times) + n_times_nb.append(self.abs_time_step) n_pnb = log_add([n_pnb, pb + ps, pnb + ps]) - next_hyps[n_prefix] = (n_pb, n_pnb, n_v_s, n_v_ns, - n_cur_token_prob, n_times_s, - n_times_ns) + next_hyps[n_prefix] = (n_pb, n_pnb, n_v_b, n_v_nb, + n_cur_token_prob, n_times_b, + n_times_nb) # 2.2 Second beam prune next_hyps = sorted( diff --git a/paddlespeech/server/engine/asr/online/onnx/asr_engine.py b/paddlespeech/server/engine/asr/online/onnx/asr_engine.py index aab29f78e..ab4f11305 100644 --- a/paddlespeech/server/engine/asr/online/onnx/asr_engine.py +++ b/paddlespeech/server/engine/asr/online/onnx/asr_engine.py @@ -23,14 +23,14 @@ from yacs.config import CfgNode from paddlespeech.cli.asr.infer import ASRExecutor from paddlespeech.cli.log import logger -from paddlespeech.cli.utils import MODEL_HOME from paddlespeech.resource import CommonTaskResource from paddlespeech.s2t.frontend.featurizer.text_featurizer import TextFeaturizer from paddlespeech.s2t.modules.ctc import CTCDecoder -from paddlespeech.s2t.transform.transformation import Transformation +from paddlespeech.audio.transform.transformation import Transformation from paddlespeech.s2t.utils.utility import UpdateConfig from paddlespeech.server.engine.base_engine import BaseEngine from paddlespeech.server.utils import onnx_infer +from paddlespeech.utils.env import MODEL_HOME __all__ = ['PaddleASRConnectionHanddler', 'ASRServerExecutor', 'ASREngine'] @@ -44,7 +44,7 @@ class PaddleASRConnectionHanddler: asr_engine (ASREngine): the global asr engine """ super().__init__() - logger.info( + logger.debug( "create an paddle asr connection handler to process the websocket connection" ) self.config = asr_engine.config # server config @@ -152,12 +152,12 @@ class PaddleASRConnectionHanddler: self.output_reset() def extract_feat(self, samples: ByteString): - logger.info("Online ASR extract the feat") + logger.debug("Online ASR extract the feat") samples = np.frombuffer(samples, dtype=np.int16) assert samples.ndim == 1 self.num_samples += samples.shape[0] - logger.info( + logger.debug( f"This package receive {samples.shape[0]} pcm data. Global samples:{self.num_samples}" ) @@ -168,7 +168,7 @@ class PaddleASRConnectionHanddler: else: assert self.remained_wav.ndim == 1 # (T,) self.remained_wav = np.concatenate([self.remained_wav, samples]) - logger.info( + logger.debug( f"The concatenation of remain and now audio samples length is: {self.remained_wav.shape}" ) @@ -202,14 +202,14 @@ class PaddleASRConnectionHanddler: # update remained wav self.remained_wav = self.remained_wav[self.n_shift * num_frames:] - logger.info( + logger.debug( f"process the audio feature success, the cached feat shape: {self.cached_feat.shape}" ) - logger.info( + logger.debug( f"After extract feat, the cached remain the audio samples: {self.remained_wav.shape}" ) - logger.info(f"global samples: {self.num_samples}") - logger.info(f"global frames: {self.num_frames}") + logger.debug(f"global samples: {self.num_samples}") + logger.debug(f"global frames: {self.num_frames}") def decode(self, is_finished=False): """advance decoding @@ -237,7 +237,7 @@ class PaddleASRConnectionHanddler: return num_frames = self.cached_feat.shape[1] - logger.info( + logger.debug( f"Required decoding window {decoding_window} frames, and the connection has {num_frames} frames" ) @@ -355,7 +355,7 @@ class ASRServerExecutor(ASRExecutor): lm_url = self.task_resource.res_dict['lm_url'] lm_md5 = self.task_resource.res_dict['lm_md5'] - logger.info(f"Start to load language model {lm_url}") + logger.debug(f"Start to load language model {lm_url}") self.download_lm( lm_url, os.path.dirname(self.config.decode.lang_model_path), lm_md5) @@ -367,7 +367,7 @@ class ASRServerExecutor(ASRExecutor): if "deepspeech2" in self.model_type: # AM predictor - logger.info("ASR engine start to init the am predictor") + logger.debug("ASR engine start to init the am predictor") self.am_predictor = onnx_infer.get_sess( model_path=self.am_model, sess_conf=self.am_predictor_conf) else: @@ -400,7 +400,7 @@ class ASRServerExecutor(ASRExecutor): self.num_decoding_left_chunks = num_decoding_left_chunks # conf for paddleinference predictor or onnx self.am_predictor_conf = am_predictor_conf - logger.info(f"model_type: {self.model_type}") + logger.debug(f"model_type: {self.model_type}") sample_rate_str = '16k' if sample_rate == 16000 else '8k' tag = model_type + '-' + lang + '-' + sample_rate_str @@ -422,12 +422,11 @@ class ASRServerExecutor(ASRExecutor): # self.res_path, self.task_resource.res_dict[ # 'params']) if am_params is None else os.path.abspath(am_params) - logger.info("Load the pretrained model:") - logger.info(f" tag = {tag}") - logger.info(f" res_path: {self.res_path}") - logger.info(f" cfg path: {self.cfg_path}") - logger.info(f" am_model path: {self.am_model}") - # logger.info(f" am_params path: {self.am_params}") + logger.debug("Load the pretrained model:") + logger.debug(f" tag = {tag}") + logger.debug(f" res_path: {self.res_path}") + logger.debug(f" cfg path: {self.cfg_path}") + logger.debug(f" am_model path: {self.am_model}") #Init body. self.config = CfgNode(new_allowed=True) @@ -436,7 +435,7 @@ class ASRServerExecutor(ASRExecutor): if self.config.spm_model_prefix: self.config.spm_model_prefix = os.path.join( self.res_path, self.config.spm_model_prefix) - logger.info(f"spm model path: {self.config.spm_model_prefix}") + logger.debug(f"spm model path: {self.config.spm_model_prefix}") self.vocab = self.config.vocab_filepath @@ -450,7 +449,7 @@ class ASRServerExecutor(ASRExecutor): # AM predictor self.init_model() - logger.info(f"create the {model_type} model success") + logger.debug(f"create the {model_type} model success") return True @@ -501,7 +500,7 @@ class ASREngine(BaseEngine): "If all GPU or XPU is used, you can set the server to 'cpu'") sys.exit(-1) - logger.info(f"paddlespeech_server set the device: {self.device}") + logger.debug(f"paddlespeech_server set the device: {self.device}") if not self.init_model(): logger.error( @@ -509,7 +508,8 @@ class ASREngine(BaseEngine): ) return False - logger.info("Initialize ASR server engine successfully.") + logger.info("Initialize ASR server engine successfully on device: %s." % + (self.device)) return True def new_handler(self): diff --git a/paddlespeech/server/engine/asr/online/paddleinference/asr_engine.py b/paddlespeech/server/engine/asr/online/paddleinference/asr_engine.py index a450e430b..182e64180 100644 --- a/paddlespeech/server/engine/asr/online/paddleinference/asr_engine.py +++ b/paddlespeech/server/engine/asr/online/paddleinference/asr_engine.py @@ -23,14 +23,14 @@ from yacs.config import CfgNode from paddlespeech.cli.asr.infer import ASRExecutor from paddlespeech.cli.log import logger -from paddlespeech.cli.utils import MODEL_HOME from paddlespeech.resource import CommonTaskResource +from paddlespeech.audio.transform.transformation import Transformation from paddlespeech.s2t.frontend.featurizer.text_featurizer import TextFeaturizer from paddlespeech.s2t.modules.ctc import CTCDecoder -from paddlespeech.s2t.transform.transformation import Transformation from paddlespeech.s2t.utils.utility import UpdateConfig from paddlespeech.server.engine.base_engine import BaseEngine from paddlespeech.server.utils.paddle_predictor import init_predictor +from paddlespeech.utils.env import MODEL_HOME __all__ = ['PaddleASRConnectionHanddler', 'ASRServerExecutor', 'ASREngine'] @@ -44,7 +44,7 @@ class PaddleASRConnectionHanddler: asr_engine (ASREngine): the global asr engine """ super().__init__() - logger.info( + logger.debug( "create an paddle asr connection handler to process the websocket connection" ) self.config = asr_engine.config # server config @@ -157,7 +157,7 @@ class PaddleASRConnectionHanddler: assert samples.ndim == 1 self.num_samples += samples.shape[0] - logger.info( + logger.debug( f"This package receive {samples.shape[0]} pcm data. Global samples:{self.num_samples}" ) @@ -168,7 +168,7 @@ class PaddleASRConnectionHanddler: else: assert self.remained_wav.ndim == 1 # (T,) self.remained_wav = np.concatenate([self.remained_wav, samples]) - logger.info( + logger.debug( f"The concatenation of remain and now audio samples length is: {self.remained_wav.shape}" ) @@ -202,14 +202,14 @@ class PaddleASRConnectionHanddler: # update remained wav self.remained_wav = self.remained_wav[self.n_shift * num_frames:] - logger.info( + logger.debug( f"process the audio feature success, the cached feat shape: {self.cached_feat.shape}" ) - logger.info( + logger.debug( f"After extract feat, the cached remain the audio samples: {self.remained_wav.shape}" ) - logger.info(f"global samples: {self.num_samples}") - logger.info(f"global frames: {self.num_frames}") + logger.debug(f"global samples: {self.num_samples}") + logger.debug(f"global frames: {self.num_frames}") def decode(self, is_finished=False): """advance decoding @@ -237,13 +237,13 @@ class PaddleASRConnectionHanddler: return num_frames = self.cached_feat.shape[1] - logger.info( + logger.debug( f"Required decoding window {decoding_window} frames, and the connection has {num_frames} frames" ) # the cached feat must be larger decoding_window if num_frames < decoding_window and not is_finished: - logger.info( + logger.debug( f"frame feat num is less than {decoding_window}, please input more pcm data" ) return None, None @@ -294,7 +294,7 @@ class PaddleASRConnectionHanddler: Returns: logprob: poster probability. """ - logger.info("start to decoce one chunk for deepspeech2") + logger.debug("start to decoce one chunk for deepspeech2") input_names = self.am_predictor.get_input_names() audio_handle = self.am_predictor.get_input_handle(input_names[0]) audio_len_handle = self.am_predictor.get_input_handle(input_names[1]) @@ -369,7 +369,7 @@ class ASRServerExecutor(ASRExecutor): lm_url = self.task_resource.res_dict['lm_url'] lm_md5 = self.task_resource.res_dict['lm_md5'] - logger.info(f"Start to load language model {lm_url}") + logger.debug(f"Start to load language model {lm_url}") self.download_lm( lm_url, os.path.dirname(self.config.decode.lang_model_path), lm_md5) @@ -381,7 +381,7 @@ class ASRServerExecutor(ASRExecutor): if "deepspeech2" in self.model_type: # AM predictor - logger.info("ASR engine start to init the am predictor") + logger.debug("ASR engine start to init the am predictor") self.am_predictor = init_predictor( model_file=self.am_model, params_file=self.am_params, @@ -415,7 +415,7 @@ class ASRServerExecutor(ASRExecutor): self.num_decoding_left_chunks = num_decoding_left_chunks # conf for paddleinference predictor or onnx self.am_predictor_conf = am_predictor_conf - logger.info(f"model_type: {self.model_type}") + logger.debug(f"model_type: {self.model_type}") sample_rate_str = '16k' if sample_rate == 16000 else '8k' tag = model_type + '-' + lang + '-' + sample_rate_str @@ -437,12 +437,12 @@ class ASRServerExecutor(ASRExecutor): self.res_path = os.path.dirname( os.path.dirname(os.path.abspath(self.cfg_path))) - logger.info("Load the pretrained model:") - logger.info(f" tag = {tag}") - logger.info(f" res_path: {self.res_path}") - logger.info(f" cfg path: {self.cfg_path}") - logger.info(f" am_model path: {self.am_model}") - logger.info(f" am_params path: {self.am_params}") + logger.debug("Load the pretrained model:") + logger.debug(f" tag = {tag}") + logger.debug(f" res_path: {self.res_path}") + logger.debug(f" cfg path: {self.cfg_path}") + logger.debug(f" am_model path: {self.am_model}") + logger.debug(f" am_params path: {self.am_params}") #Init body. self.config = CfgNode(new_allowed=True) @@ -451,7 +451,7 @@ class ASRServerExecutor(ASRExecutor): if self.config.spm_model_prefix: self.config.spm_model_prefix = os.path.join( self.res_path, self.config.spm_model_prefix) - logger.info(f"spm model path: {self.config.spm_model_prefix}") + logger.debug(f"spm model path: {self.config.spm_model_prefix}") self.vocab = self.config.vocab_filepath @@ -465,7 +465,7 @@ class ASRServerExecutor(ASRExecutor): # AM predictor self.init_model() - logger.info(f"create the {model_type} model success") + logger.debug(f"create the {model_type} model success") return True @@ -516,7 +516,7 @@ class ASREngine(BaseEngine): "If all GPU or XPU is used, you can set the server to 'cpu'") sys.exit(-1) - logger.info(f"paddlespeech_server set the device: {self.device}") + logger.debug(f"paddlespeech_server set the device: {self.device}") if not self.init_model(): logger.error( @@ -524,7 +524,9 @@ class ASREngine(BaseEngine): ) return False - logger.info("Initialize ASR server engine successfully.") + logger.info("Initialize ASR server engine successfully on device: %s." % + (self.device)) + return True def new_handler(self): diff --git a/paddlespeech/server/engine/asr/online/python/asr_engine.py b/paddlespeech/server/engine/asr/online/python/asr_engine.py index c22cbbe5f..4df38f09d 100644 --- a/paddlespeech/server/engine/asr/online/python/asr_engine.py +++ b/paddlespeech/server/engine/asr/online/python/asr_engine.py @@ -23,11 +23,10 @@ from yacs.config import CfgNode from paddlespeech.cli.asr.infer import ASRExecutor from paddlespeech.cli.log import logger -from paddlespeech.cli.utils import MODEL_HOME from paddlespeech.resource import CommonTaskResource +from paddlespeech.audio.transform.transformation import Transformation from paddlespeech.s2t.frontend.featurizer.text_featurizer import TextFeaturizer from paddlespeech.s2t.modules.ctc import CTCDecoder -from paddlespeech.s2t.transform.transformation import Transformation from paddlespeech.s2t.utils.tensor_utils import add_sos_eos from paddlespeech.s2t.utils.tensor_utils import pad_sequence from paddlespeech.s2t.utils.utility import UpdateConfig @@ -36,6 +35,7 @@ from paddlespeech.server.engine.asr.online.ctc_endpoint import OnlineCTCEndpoint from paddlespeech.server.engine.asr.online.ctc_search import CTCPrefixBeamSearch from paddlespeech.server.engine.base_engine import BaseEngine from paddlespeech.server.utils.paddle_predictor import init_predictor +from paddlespeech.utils.env import MODEL_HOME __all__ = ['PaddleASRConnectionHanddler', 'ASRServerExecutor', 'ASREngine'] @@ -49,7 +49,7 @@ class PaddleASRConnectionHanddler: asr_engine (ASREngine): the global asr engine """ super().__init__() - logger.info( + logger.debug( "create an paddle asr connection handler to process the websocket connection" ) self.config = asr_engine.config # server config @@ -107,7 +107,7 @@ class PaddleASRConnectionHanddler: # acoustic model self.model = self.asr_engine.executor.model self.continuous_decoding = self.config.continuous_decoding - logger.info(f"continue decoding: {self.continuous_decoding}") + logger.debug(f"continue decoding: {self.continuous_decoding}") # ctc decoding config self.ctc_decode_config = self.asr_engine.executor.config.decode @@ -130,9 +130,9 @@ class PaddleASRConnectionHanddler: ## conformer # cache for conformer online - self.subsampling_cache = None - self.elayers_output_cache = None - self.conformer_cnn_cache = None + self.att_cache = paddle.zeros([0,0,0,0]) + self.cnn_cache = paddle.zeros([0,0,0,0]) + self.encoder_out = None # conformer decoding state self.offset = 0 # global offset in decoding frame unit @@ -207,7 +207,7 @@ class PaddleASRConnectionHanddler: assert samples.ndim == 1 self.num_samples += samples.shape[0] - logger.info( + logger.debug( f"This package receive {samples.shape[0]} pcm data. Global samples:{self.num_samples}" ) @@ -218,7 +218,7 @@ class PaddleASRConnectionHanddler: else: assert self.remained_wav.ndim == 1 # (T,) self.remained_wav = np.concatenate([self.remained_wav, samples]) - logger.info( + logger.debug( f"The concatenation of remain and now audio samples length is: {self.remained_wav.shape}" ) @@ -252,14 +252,14 @@ class PaddleASRConnectionHanddler: # update remained wav self.remained_wav = self.remained_wav[self.n_shift * num_frames:] - logger.info( + logger.debug( f"process the audio feature success, the cached feat shape: {self.cached_feat.shape}" ) - logger.info( + logger.debug( f"After extract feat, the cached remain the audio samples: {self.remained_wav.shape}" ) - logger.info(f"global samples: {self.num_samples}") - logger.info(f"global frames: {self.num_frames}") + logger.debug(f"global samples: {self.num_samples}") + logger.debug(f"global frames: {self.num_frames}") def decode(self, is_finished=False): """advance decoding @@ -283,24 +283,24 @@ class PaddleASRConnectionHanddler: stride = subsampling * decoding_chunk_size if self.cached_feat is None: - logger.info("no audio feat, please input more pcm data") + logger.debug("no audio feat, please input more pcm data") return num_frames = self.cached_feat.shape[1] - logger.info( + logger.debug( f"Required decoding window {decoding_window} frames, and the connection has {num_frames} frames" ) # the cached feat must be larger decoding_window if num_frames < decoding_window and not is_finished: - logger.info( + logger.debug( f"frame feat num is less than {decoding_window}, please input more pcm data" ) return None, None # if is_finished=True, we need at least context frames if num_frames < context: - logger.info( + logger.debug( "flast {num_frames} is less than context {context} frames, and we cannot do model forward" ) return None, None @@ -354,7 +354,7 @@ class PaddleASRConnectionHanddler: Returns: logprob: poster probability. """ - logger.info("start to decoce one chunk for deepspeech2") + logger.debug("start to decoce one chunk for deepspeech2") input_names = self.am_predictor.get_input_names() audio_handle = self.am_predictor.get_input_handle(input_names[0]) audio_len_handle = self.am_predictor.get_input_handle(input_names[1]) @@ -391,7 +391,7 @@ class PaddleASRConnectionHanddler: self.decoder.next(output_chunk_probs, output_chunk_lens) trans_best, trans_beam = self.decoder.decode() - logger.info(f"decode one best result for deepspeech2: {trans_best[0]}") + logger.debug(f"decode one best result for deepspeech2: {trans_best[0]}") return trans_best[0] @paddle.no_grad() @@ -402,7 +402,7 @@ class PaddleASRConnectionHanddler: # reset endpiont state self.endpoint_state = False - logger.info( + logger.debug( "Conformer/Transformer: start to decode with advanced_decoding method" ) cfg = self.ctc_decode_config @@ -427,25 +427,25 @@ class PaddleASRConnectionHanddler: stride = subsampling * decoding_chunk_size if self.cached_feat is None: - logger.info("no audio feat, please input more pcm data") + logger.debug("no audio feat, please input more pcm data") return # (B=1,T,D) num_frames = self.cached_feat.shape[1] - logger.info( + logger.debug( f"Required decoding window {decoding_window} frames, and the connection has {num_frames} frames" ) # the cached feat must be larger decoding_window if num_frames < decoding_window and not is_finished: - logger.info( + logger.debug( f"frame feat num is less than {decoding_window}, please input more pcm data" ) return None, None # if is_finished=True, we need at least context frames if num_frames < context: - logger.info( + logger.debug( "flast {num_frames} is less than context {context} frames, and we cannot do model forward" ) return None, None @@ -474,11 +474,9 @@ class PaddleASRConnectionHanddler: # cur chunk chunk_xs = self.cached_feat[:, cur:end, :] # forward chunk - (y, self.subsampling_cache, self.elayers_output_cache, - self.conformer_cnn_cache) = self.model.encoder.forward_chunk( + (y, self.att_cache, self.cnn_cache) = self.model.encoder.forward_chunk( chunk_xs, self.offset, required_cache_size, - self.subsampling_cache, self.elayers_output_cache, - self.conformer_cnn_cache) + self.att_cache, self.cnn_cache) outputs.append(y) # update the global offset, in decoding frame unit @@ -489,7 +487,7 @@ class PaddleASRConnectionHanddler: self.encoder_out = ys else: self.encoder_out = paddle.concat([self.encoder_out, ys], axis=1) - logger.info( + logger.debug( f"This connection handler encoder out shape: {self.encoder_out.shape}" ) @@ -513,7 +511,8 @@ class PaddleASRConnectionHanddler: if self.endpointer.endpoint_detected(ctc_probs.numpy(), decoding_something): self.endpoint_state = True - logger.info(f"Endpoint is detected at {self.num_frames} frame.") + logger.debug( + f"Endpoint is detected at {self.num_frames} frame.") # advance cache of feat assert self.cached_feat.shape[0] == 1 #(B=1,T,D) @@ -526,7 +525,7 @@ class PaddleASRConnectionHanddler: def update_result(self): """Conformer/Transformer hyps to result. """ - logger.info("update the final result") + logger.debug("update the final result") hyps = self.hyps # output results and tokenids @@ -560,16 +559,16 @@ class PaddleASRConnectionHanddler: only for conformer and transformer model. """ if "deepspeech2" in self.model_type: - logger.info("deepspeech2 not support rescoring decoding.") + logger.debug("deepspeech2 not support rescoring decoding.") return if "attention_rescoring" != self.ctc_decode_config.decoding_method: - logger.info( + logger.debug( f"decoding method not match: {self.ctc_decode_config.decoding_method}, need attention_rescoring" ) return - logger.info("rescoring the final result") + logger.debug("rescoring the final result") # last decoding for last audio self.searcher.finalize_search() @@ -685,7 +684,6 @@ class PaddleASRConnectionHanddler: "bg": global_offset_in_sec + start, "ed": global_offset_in_sec + end }) - # logger.info(f"{word_time_stamp[-1]}") self.word_time_stamp = word_time_stamp logger.info(f"word time stamp: {self.word_time_stamp}") @@ -707,13 +705,13 @@ class ASRServerExecutor(ASRExecutor): lm_url = self.task_resource.res_dict['lm_url'] lm_md5 = self.task_resource.res_dict['lm_md5'] - logger.info(f"Start to load language model {lm_url}") + logger.debug(f"Start to load language model {lm_url}") self.download_lm( lm_url, os.path.dirname(self.config.decode.lang_model_path), lm_md5) elif "conformer" in self.model_type or "transformer" in self.model_type: with UpdateConfig(self.config): - logger.info("start to create the stream conformer asr engine") + logger.debug("start to create the stream conformer asr engine") # update the decoding method if self.decode_method: self.config.decode.decoding_method = self.decode_method @@ -726,7 +724,7 @@ class ASRServerExecutor(ASRExecutor): if self.config.decode.decoding_method not in [ "ctc_prefix_beam_search", "attention_rescoring" ]: - logger.info( + logger.debug( "we set the decoding_method to attention_rescoring") self.config.decode.decoding_method = "attention_rescoring" @@ -739,7 +737,7 @@ class ASRServerExecutor(ASRExecutor): def init_model(self) -> None: if "deepspeech2" in self.model_type: # AM predictor - logger.info("ASR engine start to init the am predictor") + logger.debug("ASR engine start to init the am predictor") self.am_predictor = init_predictor( model_file=self.am_model, params_file=self.am_params, @@ -748,7 +746,7 @@ class ASRServerExecutor(ASRExecutor): # load model # model_type: {model_name}_{dataset} model_name = self.model_type[:self.model_type.rindex('_')] - logger.info(f"model name: {model_name}") + logger.debug(f"model name: {model_name}") model_class = self.task_resource.get_model_class(model_name) model = model_class.from_config(self.config) self.model = model @@ -782,7 +780,7 @@ class ASRServerExecutor(ASRExecutor): self.num_decoding_left_chunks = num_decoding_left_chunks # conf for paddleinference predictor or onnx self.am_predictor_conf = am_predictor_conf - logger.info(f"model_type: {self.model_type}") + logger.debug(f"model_type: {self.model_type}") sample_rate_str = '16k' if sample_rate == 16000 else '8k' tag = model_type + '-' + lang + '-' + sample_rate_str @@ -804,12 +802,12 @@ class ASRServerExecutor(ASRExecutor): self.res_path = os.path.dirname( os.path.dirname(os.path.abspath(self.cfg_path))) - logger.info("Load the pretrained model:") - logger.info(f" tag = {tag}") - logger.info(f" res_path: {self.res_path}") - logger.info(f" cfg path: {self.cfg_path}") - logger.info(f" am_model path: {self.am_model}") - logger.info(f" am_params path: {self.am_params}") + logger.debug("Load the pretrained model:") + logger.debug(f" tag = {tag}") + logger.debug(f" res_path: {self.res_path}") + logger.debug(f" cfg path: {self.cfg_path}") + logger.debug(f" am_model path: {self.am_model}") + logger.debug(f" am_params path: {self.am_params}") #Init body. self.config = CfgNode(new_allowed=True) @@ -818,7 +816,7 @@ class ASRServerExecutor(ASRExecutor): if self.config.spm_model_prefix: self.config.spm_model_prefix = os.path.join( self.res_path, self.config.spm_model_prefix) - logger.info(f"spm model path: {self.config.spm_model_prefix}") + logger.debug(f"spm model path: {self.config.spm_model_prefix}") self.vocab = self.config.vocab_filepath @@ -832,7 +830,7 @@ class ASRServerExecutor(ASRExecutor): # AM predictor self.init_model() - logger.info(f"create the {model_type} model success") + logger.debug(f"create the {model_type} model success") return True @@ -883,7 +881,7 @@ class ASREngine(BaseEngine): "If all GPU or XPU is used, you can set the server to 'cpu'") sys.exit(-1) - logger.info(f"paddlespeech_server set the device: {self.device}") + logger.debug(f"paddlespeech_server set the device: {self.device}") if not self.init_model(): logger.error( @@ -891,7 +889,9 @@ class ASREngine(BaseEngine): ) return False - logger.info("Initialize ASR server engine successfully.") + logger.info("Initialize ASR server engine successfully on device: %s." % + (self.device)) + return True def new_handler(self): diff --git a/paddlespeech/server/engine/asr/paddleinference/asr_engine.py b/paddlespeech/server/engine/asr/paddleinference/asr_engine.py index 1a3b4620a..6df666ce8 100644 --- a/paddlespeech/server/engine/asr/paddleinference/asr_engine.py +++ b/paddlespeech/server/engine/asr/paddleinference/asr_engine.py @@ -21,7 +21,6 @@ from yacs.config import CfgNode from paddlespeech.cli.asr.infer import ASRExecutor from paddlespeech.cli.log import logger -from paddlespeech.cli.utils import MODEL_HOME from paddlespeech.resource import CommonTaskResource from paddlespeech.s2t.frontend.featurizer.text_featurizer import TextFeaturizer from paddlespeech.s2t.modules.ctc import CTCDecoder @@ -29,6 +28,7 @@ from paddlespeech.s2t.utils.utility import UpdateConfig from paddlespeech.server.engine.base_engine import BaseEngine from paddlespeech.server.utils.paddle_predictor import init_predictor from paddlespeech.server.utils.paddle_predictor import run_model +from paddlespeech.utils.env import MODEL_HOME __all__ = ['ASREngine', 'PaddleASRConnectionHandler'] @@ -65,10 +65,10 @@ class ASRServerExecutor(ASRExecutor): self.task_resource.res_dict['model']) self.am_params = os.path.join(self.res_path, self.task_resource.res_dict['params']) - logger.info(self.res_path) - logger.info(self.cfg_path) - logger.info(self.am_model) - logger.info(self.am_params) + logger.debug(self.res_path) + logger.debug(self.cfg_path) + logger.debug(self.am_model) + logger.debug(self.am_params) else: self.cfg_path = os.path.abspath(cfg_path) self.am_model = os.path.abspath(am_model) @@ -236,16 +236,16 @@ class PaddleASRConnectionHandler(ASRServerExecutor): if self._check( io.BytesIO(audio_data), self.asr_engine.config.sample_rate, self.asr_engine.config.force_yes): - logger.info("start running asr engine") + logger.debug("start running asr engine") self.preprocess(self.asr_engine.config.model_type, io.BytesIO(audio_data)) st = time.time() self.infer(self.asr_engine.config.model_type) infer_time = time.time() - st self.output = self.postprocess() # Retrieve result of asr. - logger.info("end inferring asr engine") + logger.debug("end inferring asr engine") else: - logger.info("file check failed!") + logger.error("file check failed!") self.output = None logger.info("inference time: {}".format(infer_time)) diff --git a/paddlespeech/server/engine/asr/python/asr_engine.py b/paddlespeech/server/engine/asr/python/asr_engine.py index f9cc3a665..02c40fd12 100644 --- a/paddlespeech/server/engine/asr/python/asr_engine.py +++ b/paddlespeech/server/engine/asr/python/asr_engine.py @@ -104,7 +104,7 @@ class PaddleASRConnectionHandler(ASRServerExecutor): if self._check( io.BytesIO(audio_data), self.asr_engine.config.sample_rate, self.asr_engine.config.force_yes): - logger.info("start run asr engine") + logger.debug("start run asr engine") self.preprocess(self.asr_engine.config.model, io.BytesIO(audio_data)) st = time.time() @@ -112,7 +112,7 @@ class PaddleASRConnectionHandler(ASRServerExecutor): infer_time = time.time() - st self.output = self.postprocess() # Retrieve result of asr. else: - logger.info("file check failed!") + logger.error("file check failed!") self.output = None logger.info("inference time: {}".format(infer_time)) diff --git a/paddlespeech/server/engine/cls/paddleinference/cls_engine.py b/paddlespeech/server/engine/cls/paddleinference/cls_engine.py index 389d56055..fa62ba67c 100644 --- a/paddlespeech/server/engine/cls/paddleinference/cls_engine.py +++ b/paddlespeech/server/engine/cls/paddleinference/cls_engine.py @@ -67,22 +67,22 @@ class CLSServerExecutor(CLSExecutor): self.params_path = os.path.abspath(params_path) self.label_file = os.path.abspath(label_file) - logger.info(self.cfg_path) - logger.info(self.model_path) - logger.info(self.params_path) - logger.info(self.label_file) + logger.debug(self.cfg_path) + logger.debug(self.model_path) + logger.debug(self.params_path) + logger.debug(self.label_file) # config with open(self.cfg_path, 'r') as f: self._conf = yaml.safe_load(f) - logger.info("Read cfg file successfully.") + logger.debug("Read cfg file successfully.") # labels self._label_list = [] with open(self.label_file, 'r') as f: for line in f: self._label_list.append(line.strip()) - logger.info("Read label file successfully.") + logger.debug("Read label file successfully.") # Create predictor self.predictor_conf = predictor_conf @@ -90,7 +90,7 @@ class CLSServerExecutor(CLSExecutor): model_file=self.model_path, params_file=self.params_path, predictor_conf=self.predictor_conf) - logger.info("Create predictor successfully.") + logger.debug("Create predictor successfully.") @paddle.no_grad() def infer(self): @@ -148,7 +148,8 @@ class CLSEngine(BaseEngine): logger.error(e) return False - logger.info("Initialize CLS server engine successfully.") + logger.info("Initialize CLS server engine successfully on device: %s." % + (self.device)) return True @@ -160,7 +161,7 @@ class PaddleCLSConnectionHandler(CLSServerExecutor): cls_engine (CLSEngine): The CLS engine """ super().__init__() - logger.info( + logger.debug( "Create PaddleCLSConnectionHandler to process the cls request") self._inputs = OrderedDict() @@ -183,7 +184,7 @@ class PaddleCLSConnectionHandler(CLSServerExecutor): self.infer() infer_time = time.time() - st - logger.info("inference time: {}".format(infer_time)) + logger.debug("inference time: {}".format(infer_time)) logger.info("cls engine type: inference") def postprocess(self, topk: int): diff --git a/paddlespeech/server/engine/cls/python/cls_engine.py b/paddlespeech/server/engine/cls/python/cls_engine.py index f8d8f20ef..210f4cbbb 100644 --- a/paddlespeech/server/engine/cls/python/cls_engine.py +++ b/paddlespeech/server/engine/cls/python/cls_engine.py @@ -88,7 +88,7 @@ class PaddleCLSConnectionHandler(CLSServerExecutor): cls_engine (CLSEngine): The CLS engine """ super().__init__() - logger.info( + logger.debug( "Create PaddleCLSConnectionHandler to process the cls request") self._inputs = OrderedDict() @@ -110,7 +110,7 @@ class PaddleCLSConnectionHandler(CLSServerExecutor): self.infer() infer_time = time.time() - st - logger.info("inference time: {}".format(infer_time)) + logger.debug("inference time: {}".format(infer_time)) logger.info("cls engine type: python") def postprocess(self, topk: int): diff --git a/paddlespeech/server/engine/engine_factory.py b/paddlespeech/server/engine/engine_factory.py index 6a66a002e..c4f3f9803 100644 --- a/paddlespeech/server/engine/engine_factory.py +++ b/paddlespeech/server/engine/engine_factory.py @@ -13,7 +13,7 @@ # limitations under the License. from typing import Text -from ..utils.log import logger +from paddlespeech.cli.log import logger __all__ = ['EngineFactory'] diff --git a/paddlespeech/server/engine/engine_warmup.py b/paddlespeech/server/engine/engine_warmup.py index 5f548f71d..3751554c2 100644 --- a/paddlespeech/server/engine/engine_warmup.py +++ b/paddlespeech/server/engine/engine_warmup.py @@ -45,7 +45,7 @@ def warm_up(engine_and_type: str, warm_up_time: int=3) -> bool: logger.error("Please check tte engine type.") try: - logger.info("Start to warm up tts engine.") + logger.debug("Start to warm up tts engine.") for i in range(warm_up_time): connection_handler = PaddleTTSConnectionHandler(tts_engine) if flag_online: @@ -53,16 +53,19 @@ def warm_up(engine_and_type: str, warm_up_time: int=3) -> bool: text=sentence, lang=tts_engine.lang, am=tts_engine.config.am): - logger.info( + logger.debug( f"The first response time of the {i} warm up: {connection_handler.first_response_time} s" ) break else: st = time.time() - connection_handler.infer(text=sentence) + connection_handler.infer( + text=sentence, + lang=tts_engine.lang, + am=tts_engine.config.am) et = time.time() - logger.info( + logger.debug( f"The response time of the {i} warm up: {et - st} s") except Exception as e: logger.error("Failed to warm up on tts engine.") diff --git a/paddlespeech/server/engine/text/python/text_engine.py b/paddlespeech/server/engine/text/python/text_engine.py index 73cf8737b..6167e7784 100644 --- a/paddlespeech/server/engine/text/python/text_engine.py +++ b/paddlespeech/server/engine/text/python/text_engine.py @@ -28,7 +28,7 @@ class PaddleTextConnectionHandler: text_engine (TextEngine): The Text engine """ super().__init__() - logger.info( + logger.debug( "Create PaddleTextConnectionHandler to process the text request") self.text_engine = text_engine self.task = self.text_engine.executor.task @@ -130,7 +130,7 @@ class TextEngine(BaseEngine): """The Text Engine """ super(TextEngine, self).__init__() - logger.info("Create the TextEngine Instance") + logger.debug("Create the TextEngine Instance") def init(self, config: dict): """Init the Text Engine @@ -141,7 +141,7 @@ class TextEngine(BaseEngine): Returns: bool: The engine instance flag """ - logger.info("Init the text engine") + logger.debug("Init the text engine") try: self.config = config if self.config.device: @@ -150,7 +150,7 @@ class TextEngine(BaseEngine): self.device = paddle.get_device() paddle.set_device(self.device) - logger.info(f"Text Engine set the device: {self.device}") + logger.debug(f"Text Engine set the device: {self.device}") except BaseException as e: logger.error( "Set device failed, please check if device is already used and the parameter 'device' in the yaml file" @@ -168,5 +168,6 @@ class TextEngine(BaseEngine): ckpt_path=config.ckpt_path, vocab_file=config.vocab_file) - logger.info("Init the text engine successfully") + logger.info("Initialize Text server engine successfully on device: %s." + % (self.device)) return True diff --git a/paddlespeech/server/engine/tts/online/onnx/tts_engine.py b/paddlespeech/server/engine/tts/online/onnx/tts_engine.py index cb9155a2d..0995a55da 100644 --- a/paddlespeech/server/engine/tts/online/onnx/tts_engine.py +++ b/paddlespeech/server/engine/tts/online/onnx/tts_engine.py @@ -62,19 +62,20 @@ class TTSServerExecutor(TTSExecutor): (hasattr(self, 'am_encoder_infer_sess') and hasattr(self, 'am_decoder_sess') and hasattr( self, 'am_postnet_sess'))) and hasattr(self, 'voc_inference'): - logger.info('Models had been initialized.') + logger.debug('Models had been initialized.') return + # am am_tag = am + '-' + lang - self.task_resource.set_task_model( - model_tag=am_tag, - model_type=0, # am - version=None, # default version - ) - self.am_res_path = self.task_resource.res_dir if am == "fastspeech2_csmsc_onnx": # get model info if am_ckpt is None or phones_dict is None: + self.task_resource.set_task_model( + model_tag=am_tag, + model_type=0, # am + version=None, # default version + ) + self.am_res_path = self.task_resource.res_dir self.am_ckpt = os.path.join( self.am_res_path, self.task_resource.res_dict['ckpt'][0]) # must have phones_dict in acoustic @@ -85,14 +86,19 @@ class TTSServerExecutor(TTSExecutor): else: self.am_ckpt = os.path.abspath(am_ckpt[0]) self.phones_dict = os.path.abspath(phones_dict) - self.am_res_path = os.path.dirname( - os.path.abspath(self.am_ckpt)) + self.am_res_path = os.path.dirname(os.path.abspath(am_ckpt)) # create am sess self.am_sess = get_sess(self.am_ckpt, am_sess_conf) elif am == "fastspeech2_cnndecoder_csmsc_onnx": if am_ckpt is None or am_stat is None or phones_dict is None: + self.task_resource.set_task_model( + model_tag=am_tag, + model_type=0, # am + version=None, # default version + ) + self.am_res_path = self.task_resource.res_dir self.am_encoder_infer = os.path.join( self.am_res_path, self.task_resource.res_dict['ckpt'][0]) self.am_decoder = os.path.join( @@ -113,8 +119,7 @@ class TTSServerExecutor(TTSExecutor): self.am_postnet = os.path.abspath(am_ckpt[2]) self.phones_dict = os.path.abspath(phones_dict) self.am_stat = os.path.abspath(am_stat) - self.am_res_path = os.path.dirname( - os.path.abspath(self.am_ckpt)) + self.am_res_path = os.path.dirname(os.path.abspath(am_ckpt[0])) # create am sess self.am_encoder_infer_sess = get_sess(self.am_encoder_infer, @@ -124,34 +129,35 @@ class TTSServerExecutor(TTSExecutor): self.am_mu, self.am_std = np.load(self.am_stat) - logger.info(f"self.phones_dict: {self.phones_dict}") - logger.info(f"am model dir: {self.am_res_path}") - logger.info("Create am sess successfully.") + logger.debug(f"self.phones_dict: {self.phones_dict}") + logger.debug(f"am model dir: {self.am_res_path}") + logger.debug("Create am sess successfully.") # voc model info voc_tag = voc + '-' + lang - self.task_resource.set_task_model( - model_tag=voc_tag, - model_type=1, # vocoder - version=None, # default version - ) + if voc_ckpt is None: + self.task_resource.set_task_model( + model_tag=voc_tag, + model_type=1, # vocoder + version=None, # default version + ) self.voc_res_path = self.task_resource.voc_res_dir self.voc_ckpt = os.path.join( self.voc_res_path, self.task_resource.voc_res_dict['ckpt']) else: self.voc_ckpt = os.path.abspath(voc_ckpt) self.voc_res_path = os.path.dirname(os.path.abspath(self.voc_ckpt)) - logger.info(self.voc_res_path) + logger.debug(self.voc_res_path) # create voc sess self.voc_sess = get_sess(self.voc_ckpt, voc_sess_conf) - logger.info("Create voc sess successfully.") + logger.debug("Create voc sess successfully.") with open(self.phones_dict, "r") as f: phn_id = [line.strip().split() for line in f.readlines()] self.vocab_size = len(phn_id) - logger.info(f"vocab_size: {self.vocab_size}") + logger.debug(f"vocab_size: {self.vocab_size}") # frontend self.tones_dict = None @@ -162,7 +168,7 @@ class TTSServerExecutor(TTSExecutor): elif lang == 'en': self.frontend = English(phone_vocab_path=self.phones_dict) - logger.info("frontend done!") + logger.debug("frontend done!") class TTSEngine(BaseEngine): @@ -206,6 +212,8 @@ class TTSEngine(BaseEngine): self.config.voc_sample_rate == self.config.am_sample_rate ), "The sample rate of AM and Vocoder model are different, please check model." + self.sample_rate = self.config.voc_sample_rate + try: if self.config.am_sess_conf.device is not None: self.device = self.config.am_sess_conf.device @@ -260,7 +268,7 @@ class PaddleTTSConnectionHandler: tts_engine (TTSEngine): The TTS engine """ super().__init__() - logger.info( + logger.debug( "Create PaddleTTSConnectionHandler to process the tts request") self.tts_engine = tts_engine @@ -434,33 +442,13 @@ class PaddleTTSConnectionHandler: self.final_response_time = time.time() - frontend_st - def preprocess(self, text_bese64: str=None, text_bytes: bytes=None): - # Convert byte to text - if text_bese64: - text_bytes = base64.b64decode(text_bese64) # base64 to bytes - text = text_bytes.decode('utf-8') # bytes to text - - return text - - def run(self, - sentence: str, - spk_id: int=0, - speed: float=1.0, - volume: float=1.0, - sample_rate: int=0, - save_path: str=None): + def run(self, sentence: str, spk_id: int=0): """ run include inference and postprocess. Args: sentence (str): text to be synthesized spk_id (int, optional): speaker id for multi-speaker speech synthesis. Defaults to 0. - speed (float, optional): speed. Defaults to 1.0. - volume (float, optional): volume. Defaults to 1.0. - sample_rate (int, optional): target sample rate for synthesized audio, - 0 means the same as the model sampling rate. Defaults to 0. - save_path (str, optional): The save path of the synthesized audio. - None means do not save audio. Defaults to None. - + Returns: wav_base64: The base64 format of the synthesized audio. """ @@ -481,7 +469,7 @@ class PaddleTTSConnectionHandler: yield wav_base64 wav_all = np.concatenate(wav_list, axis=0) - duration = len(wav_all) / self.config.voc_sample_rate + duration = len(wav_all) / self.tts_engine.sample_rate logger.info(f"sentence: {sentence}") logger.info(f"The durations of audio is: {duration} s") logger.info(f"first response time: {self.first_response_time} s") diff --git a/paddlespeech/server/engine/tts/online/python/tts_engine.py b/paddlespeech/server/engine/tts/online/python/tts_engine.py index 2e8997e0f..a46b84bd9 100644 --- a/paddlespeech/server/engine/tts/online/python/tts_engine.py +++ b/paddlespeech/server/engine/tts/online/python/tts_engine.py @@ -102,16 +102,22 @@ class TTSServerExecutor(TTSExecutor): Init model and other resources from a specific path. """ if hasattr(self, 'am_inference') and hasattr(self, 'voc_inference'): - logger.info('Models had been initialized.') + logger.debug('Models had been initialized.') return # am model info + if am_ckpt is None or am_config is None or am_stat is None or phones_dict is None: + use_pretrained_am = True + else: + use_pretrained_am = False + am_tag = am + '-' + lang self.task_resource.set_task_model( model_tag=am_tag, model_type=0, # am + skip_download=not use_pretrained_am, version=None, # default version ) - if am_ckpt is None or am_config is None or am_stat is None or phones_dict is None: + if use_pretrained_am: self.am_res_path = self.task_resource.res_dir self.am_config = os.path.join(self.am_res_path, self.task_resource.res_dict['config']) @@ -122,29 +128,33 @@ class TTSServerExecutor(TTSExecutor): # must have phones_dict in acoustic self.phones_dict = os.path.join( self.am_res_path, self.task_resource.res_dict['phones_dict']) - print("self.phones_dict:", self.phones_dict) - logger.info(self.am_res_path) - logger.info(self.am_config) - logger.info(self.am_ckpt) + logger.debug(self.am_res_path) + logger.debug(self.am_config) + logger.debug(self.am_ckpt) else: self.am_config = os.path.abspath(am_config) self.am_ckpt = os.path.abspath(am_ckpt) self.am_stat = os.path.abspath(am_stat) self.phones_dict = os.path.abspath(phones_dict) self.am_res_path = os.path.dirname(os.path.abspath(self.am_config)) - print("self.phones_dict:", self.phones_dict) self.tones_dict = None self.speaker_dict = None # voc model info + if voc_ckpt is None or voc_config is None or voc_stat is None: + use_pretrained_voc = True + else: + use_pretrained_voc = False + voc_tag = voc + '-' + lang self.task_resource.set_task_model( model_tag=voc_tag, model_type=1, # vocoder + skip_download=not use_pretrained_voc, version=None, # default version ) - if voc_ckpt is None or voc_config is None or voc_stat is None: + if use_pretrained_voc: self.voc_res_path = self.task_resource.voc_res_dir self.voc_config = os.path.join( self.voc_res_path, self.task_resource.voc_res_dict['config']) @@ -153,9 +163,9 @@ class TTSServerExecutor(TTSExecutor): self.voc_stat = os.path.join( self.voc_res_path, self.task_resource.voc_res_dict['speech_stats']) - logger.info(self.voc_res_path) - logger.info(self.voc_config) - logger.info(self.voc_ckpt) + logger.debug(self.voc_res_path) + logger.debug(self.voc_config) + logger.debug(self.voc_ckpt) else: self.voc_config = os.path.abspath(voc_config) self.voc_ckpt = os.path.abspath(voc_ckpt) @@ -172,7 +182,6 @@ class TTSServerExecutor(TTSExecutor): with open(self.phones_dict, "r") as f: phn_id = [line.strip().split() for line in f.readlines()] self.vocab_size = len(phn_id) - print("vocab_size:", self.vocab_size) # frontend if lang == 'zh': @@ -182,7 +191,6 @@ class TTSServerExecutor(TTSExecutor): elif lang == 'en': self.frontend = English(phone_vocab_path=self.phones_dict) - print("frontend done!") # am infer info self.am_name = am[:am.rindex('_')] @@ -197,7 +205,6 @@ class TTSServerExecutor(TTSExecutor): self.am_name + '_inference') self.am_inference = am_inference_class(am_normalizer, am) self.am_inference.eval() - print("acoustic model done!") # voc infer info self.voc_name = voc[:voc.rindex('_')] @@ -208,7 +215,6 @@ class TTSServerExecutor(TTSExecutor): '_inference') self.voc_inference = voc_inference_class(voc_normalizer, voc) self.voc_inference.eval() - print("voc done!") class TTSEngine(BaseEngine): @@ -276,6 +282,12 @@ class TTSEngine(BaseEngine): logger.error(e) return False + assert ( + self.executor.am_config.fs == self.executor.voc_config.fs + ), "The sample rate of AM and Vocoder model are different, please check model." + + self.sample_rate = self.executor.am_config.fs + self.am_block = self.config.am_block self.am_pad = self.config.am_pad self.voc_block = self.config.voc_block @@ -297,7 +309,7 @@ class PaddleTTSConnectionHandler: tts_engine (TTSEngine): The TTS engine """ super().__init__() - logger.info( + logger.debug( "Create PaddleTTSConnectionHandler to process the tts request") self.tts_engine = tts_engine @@ -357,7 +369,7 @@ class PaddleTTSConnectionHandler: text, merge_sentences=merge_sentences) phone_ids = input_ids["phone_ids"] else: - print("lang should in {'zh', 'en'}!") + logger.error("lang should in {'zh', 'en'}!") frontend_et = time.time() self.frontend_time = frontend_et - frontend_st @@ -459,32 +471,15 @@ class PaddleTTSConnectionHandler: self.final_response_time = time.time() - frontend_st - def preprocess(self, text_bese64: str=None, text_bytes: bytes=None): - # Convert byte to text - if text_bese64: - text_bytes = base64.b64decode(text_bese64) # base64 to bytes - text = text_bytes.decode('utf-8') # bytes to text - - return text - - def run(self, + def run( + self, sentence: str, - spk_id: int=0, - speed: float=1.0, - volume: float=1.0, - sample_rate: int=0, - save_path: str=None): + spk_id: int=0, ): """ run include inference and postprocess. Args: sentence (str): text to be synthesized spk_id (int, optional): speaker id for multi-speaker speech synthesis. Defaults to 0. - speed (float, optional): speed. Defaults to 1.0. - volume (float, optional): volume. Defaults to 1.0. - sample_rate (int, optional): target sample rate for synthesized audio, - 0 means the same as the model sampling rate. Defaults to 0. - save_path (str, optional): The save path of the synthesized audio. - None means do not save audio. Defaults to None. Returns: wav_base64: The base64 format of the synthesized audio. @@ -507,7 +502,7 @@ class PaddleTTSConnectionHandler: yield wav_base64 wav_all = np.concatenate(wav_list, axis=0) - duration = len(wav_all) / self.executor.am_config.fs + duration = len(wav_all) / self.tts_engine.sample_rate logger.info(f"sentence: {sentence}") logger.info(f"The durations of audio is: {duration} s") diff --git a/paddlespeech/server/engine/tts/paddleinference/tts_engine.py b/paddlespeech/server/engine/tts/paddleinference/tts_engine.py index ab5b721ff..43b0df407 100644 --- a/paddlespeech/server/engine/tts/paddleinference/tts_engine.py +++ b/paddlespeech/server/engine/tts/paddleinference/tts_engine.py @@ -65,16 +65,22 @@ class TTSServerExecutor(TTSExecutor): Init model and other resources from a specific path. """ if hasattr(self, 'am_predictor') and hasattr(self, 'voc_predictor'): - logger.info('Models had been initialized.') + logger.debug('Models had been initialized.') return # am + if am_model is None or am_params is None or phones_dict is None: + use_pretrained_am = True + else: + use_pretrained_am = False + am_tag = am + '-' + lang self.task_resource.set_task_model( model_tag=am_tag, model_type=0, # am + skip_download=not use_pretrained_am, version=None, # default version ) - if am_model is None or am_params is None or phones_dict is None: + if use_pretrained_am: self.am_res_path = self.task_resource.res_dir self.am_model = os.path.join(self.am_res_path, self.task_resource.res_dict['model']) @@ -85,16 +91,16 @@ class TTSServerExecutor(TTSExecutor): self.am_res_path, self.task_resource.res_dict['phones_dict']) self.am_sample_rate = self.task_resource.res_dict['sample_rate'] - logger.info(self.am_res_path) - logger.info(self.am_model) - logger.info(self.am_params) + logger.debug(self.am_res_path) + logger.debug(self.am_model) + logger.debug(self.am_params) else: self.am_model = os.path.abspath(am_model) self.am_params = os.path.abspath(am_params) self.phones_dict = os.path.abspath(phones_dict) self.am_sample_rate = am_sample_rate self.am_res_path = os.path.dirname(os.path.abspath(self.am_model)) - logger.info("self.phones_dict: {}".format(self.phones_dict)) + logger.debug("self.phones_dict: {}".format(self.phones_dict)) # for speedyspeech self.tones_dict = None @@ -113,13 +119,19 @@ class TTSServerExecutor(TTSExecutor): self.speaker_dict = speaker_dict # voc + if voc_model is None or voc_params is None: + use_pretrained_voc = True + else: + use_pretrained_voc = False + voc_tag = voc + '-' + lang self.task_resource.set_task_model( model_tag=voc_tag, model_type=1, # vocoder + skip_download=not use_pretrained_voc, version=None, # default version ) - if voc_model is None or voc_params is None: + if use_pretrained_voc: self.voc_res_path = self.task_resource.voc_res_dir self.voc_model = os.path.join( self.voc_res_path, self.task_resource.voc_res_dict['model']) @@ -127,9 +139,9 @@ class TTSServerExecutor(TTSExecutor): self.voc_res_path, self.task_resource.voc_res_dict['params']) self.voc_sample_rate = self.task_resource.voc_res_dict[ 'sample_rate'] - logger.info(self.voc_res_path) - logger.info(self.voc_model) - logger.info(self.voc_params) + logger.debug(self.voc_res_path) + logger.debug(self.voc_model) + logger.debug(self.voc_params) else: self.voc_model = os.path.abspath(voc_model) self.voc_params = os.path.abspath(voc_params) @@ -144,21 +156,21 @@ class TTSServerExecutor(TTSExecutor): with open(self.phones_dict, "r") as f: phn_id = [line.strip().split() for line in f.readlines()] vocab_size = len(phn_id) - logger.info("vocab_size: {}".format(vocab_size)) + logger.debug("vocab_size: {}".format(vocab_size)) tone_size = None if self.tones_dict: with open(self.tones_dict, "r") as f: tone_id = [line.strip().split() for line in f.readlines()] tone_size = len(tone_id) - logger.info("tone_size: {}".format(tone_size)) + logger.debug("tone_size: {}".format(tone_size)) spk_num = None if self.speaker_dict: with open(self.speaker_dict, 'rt') as f: spk_id = [line.strip().split() for line in f.readlines()] spk_num = len(spk_id) - logger.info("spk_num: {}".format(spk_num)) + logger.debug("spk_num: {}".format(spk_num)) # frontend if lang == 'zh': @@ -168,7 +180,7 @@ class TTSServerExecutor(TTSExecutor): elif lang == 'en': self.frontend = English(phone_vocab_path=self.phones_dict) - logger.info("frontend done!") + logger.debug("frontend done!") # Create am predictor self.am_predictor_conf = am_predictor_conf @@ -176,7 +188,7 @@ class TTSServerExecutor(TTSExecutor): model_file=self.am_model, params_file=self.am_params, predictor_conf=self.am_predictor_conf) - logger.info("Create AM predictor successfully.") + logger.debug("Create AM predictor successfully.") # Create voc predictor self.voc_predictor_conf = voc_predictor_conf @@ -184,7 +196,7 @@ class TTSServerExecutor(TTSExecutor): model_file=self.voc_model, params_file=self.voc_params, predictor_conf=self.voc_predictor_conf) - logger.info("Create Vocoder predictor successfully.") + logger.debug("Create Vocoder predictor successfully.") @paddle.no_grad() def infer(self, @@ -316,7 +328,8 @@ class TTSEngine(BaseEngine): logger.error(e) return False - logger.info("Initialize TTS server engine successfully.") + logger.info("Initialize TTS server engine successfully on device: %s." % + (self.device)) return True @@ -328,7 +341,7 @@ class PaddleTTSConnectionHandler(TTSServerExecutor): tts_engine (TTSEngine): The TTS engine """ super().__init__() - logger.info( + logger.debug( "Create PaddleTTSConnectionHandler to process the tts request") self.tts_engine = tts_engine @@ -366,23 +379,23 @@ class PaddleTTSConnectionHandler(TTSServerExecutor): if target_fs == 0 or target_fs > original_fs: target_fs = original_fs wav_tar_fs = wav - logger.info( + logger.debug( "The sample rate of synthesized audio is the same as model, which is {}Hz". format(original_fs)) else: wav_tar_fs = librosa.resample( np.squeeze(wav), original_fs, target_fs) - logger.info( + logger.debug( "The sample rate of model is {}Hz and the target sample rate is {}Hz. Converting the sample rate of the synthesized audio successfully.". format(original_fs, target_fs)) # transform volume wav_vol = wav_tar_fs * volume - logger.info("Transform the volume of the audio successfully.") + logger.debug("Transform the volume of the audio successfully.") # transform speed try: # windows not support soxbindings wav_speed = change_speed(wav_vol, speed, target_fs) - logger.info("Transform the speed of the audio successfully.") + logger.debug("Transform the speed of the audio successfully.") except ServerBaseException: raise ServerBaseException( ErrorCode.SERVER_INTERNAL_ERR, @@ -399,7 +412,7 @@ class PaddleTTSConnectionHandler(TTSServerExecutor): wavfile.write(buf, target_fs, wav_speed) base64_bytes = base64.b64encode(buf.read()) wav_base64 = base64_bytes.decode('utf-8') - logger.info("Audio to string successfully.") + logger.debug("Audio to string successfully.") # save audio if audio_path is not None: @@ -487,15 +500,15 @@ class PaddleTTSConnectionHandler(TTSServerExecutor): logger.error(e) sys.exit(-1) - logger.info("AM model: {}".format(self.config.am)) - logger.info("Vocoder model: {}".format(self.config.voc)) - logger.info("Language: {}".format(lang)) + logger.debug("AM model: {}".format(self.config.am)) + logger.debug("Vocoder model: {}".format(self.config.voc)) + logger.debug("Language: {}".format(lang)) logger.info("tts engine type: python") logger.info("audio duration: {}".format(duration)) - logger.info("frontend inference time: {}".format(self.frontend_time)) - logger.info("AM inference time: {}".format(self.am_time)) - logger.info("Vocoder inference time: {}".format(self.voc_time)) + logger.debug("frontend inference time: {}".format(self.frontend_time)) + logger.debug("AM inference time: {}".format(self.am_time)) + logger.debug("Vocoder inference time: {}".format(self.voc_time)) logger.info("total inference time: {}".format(infer_time)) logger.info( "postprocess (change speed, volume, target sample rate) time: {}". @@ -503,6 +516,6 @@ class PaddleTTSConnectionHandler(TTSServerExecutor): logger.info("total generate audio time: {}".format(infer_time + postprocess_time)) logger.info("RTF: {}".format(rtf)) - logger.info("device: {}".format(self.tts_engine.device)) + logger.debug("device: {}".format(self.tts_engine.device)) return lang, target_sample_rate, duration, wav_base64 diff --git a/paddlespeech/server/engine/tts/python/tts_engine.py b/paddlespeech/server/engine/tts/python/tts_engine.py index b048b01a4..4d1801006 100644 --- a/paddlespeech/server/engine/tts/python/tts_engine.py +++ b/paddlespeech/server/engine/tts/python/tts_engine.py @@ -105,7 +105,7 @@ class PaddleTTSConnectionHandler(TTSServerExecutor): tts_engine (TTSEngine): The TTS engine """ super().__init__() - logger.info( + logger.debug( "Create PaddleTTSConnectionHandler to process the tts request") self.tts_engine = tts_engine @@ -143,23 +143,23 @@ class PaddleTTSConnectionHandler(TTSServerExecutor): if target_fs == 0 or target_fs > original_fs: target_fs = original_fs wav_tar_fs = wav - logger.info( + logger.debug( "The sample rate of synthesized audio is the same as model, which is {}Hz". format(original_fs)) else: wav_tar_fs = librosa.resample( np.squeeze(wav), original_fs, target_fs) - logger.info( + logger.debug( "The sample rate of model is {}Hz and the target sample rate is {}Hz. Converting the sample rate of the synthesized audio successfully.". format(original_fs, target_fs)) # transform volume wav_vol = wav_tar_fs * volume - logger.info("Transform the volume of the audio successfully.") + logger.debug("Transform the volume of the audio successfully.") # transform speed try: # windows not support soxbindings wav_speed = change_speed(wav_vol, speed, target_fs) - logger.info("Transform the speed of the audio successfully.") + logger.debug("Transform the speed of the audio successfully.") except ServerBaseException: raise ServerBaseException( ErrorCode.SERVER_INTERNAL_ERR, @@ -176,7 +176,7 @@ class PaddleTTSConnectionHandler(TTSServerExecutor): wavfile.write(buf, target_fs, wav_speed) base64_bytes = base64.b64encode(buf.read()) wav_base64 = base64_bytes.decode('utf-8') - logger.info("Audio to string successfully.") + logger.debug("Audio to string successfully.") # save audio if audio_path is not None: @@ -264,15 +264,15 @@ class PaddleTTSConnectionHandler(TTSServerExecutor): logger.error(e) sys.exit(-1) - logger.info("AM model: {}".format(self.config.am)) - logger.info("Vocoder model: {}".format(self.config.voc)) - logger.info("Language: {}".format(lang)) + logger.debug("AM model: {}".format(self.config.am)) + logger.debug("Vocoder model: {}".format(self.config.voc)) + logger.debug("Language: {}".format(lang)) logger.info("tts engine type: python") logger.info("audio duration: {}".format(duration)) - logger.info("frontend inference time: {}".format(self.frontend_time)) - logger.info("AM inference time: {}".format(self.am_time)) - logger.info("Vocoder inference time: {}".format(self.voc_time)) + logger.debug("frontend inference time: {}".format(self.frontend_time)) + logger.debug("AM inference time: {}".format(self.am_time)) + logger.debug("Vocoder inference time: {}".format(self.voc_time)) logger.info("total inference time: {}".format(infer_time)) logger.info( "postprocess (change speed, volume, target sample rate) time: {}". @@ -280,6 +280,6 @@ class PaddleTTSConnectionHandler(TTSServerExecutor): logger.info("total generate audio time: {}".format(infer_time + postprocess_time)) logger.info("RTF: {}".format(rtf)) - logger.info("device: {}".format(self.tts_engine.device)) + logger.debug("device: {}".format(self.tts_engine.device)) return lang, target_sample_rate, duration, wav_base64 diff --git a/paddlespeech/server/engine/vector/python/vector_engine.py b/paddlespeech/server/engine/vector/python/vector_engine.py index 056833dfe..309796452 100644 --- a/paddlespeech/server/engine/vector/python/vector_engine.py +++ b/paddlespeech/server/engine/vector/python/vector_engine.py @@ -33,7 +33,7 @@ class PaddleVectorConnectionHandler: vector_engine (VectorEngine): The Vector engine """ super().__init__() - logger.info( + logger.debug( "Create PaddleVectorConnectionHandler to process the vector request") self.vector_engine = vector_engine self.executor = self.vector_engine.executor @@ -54,7 +54,7 @@ class PaddleVectorConnectionHandler: Returns: str: the punctuation text """ - logger.info( + logger.debug( f"start to extract the do vector {self.task} from the http request") if self.task == "spk" and task == "spk": embedding = self.extract_audio_embedding(audio_data) @@ -81,17 +81,17 @@ class PaddleVectorConnectionHandler: Returns: float: the score between enroll and test audio """ - logger.info("start to extract the enroll audio embedding") + logger.debug("start to extract the enroll audio embedding") enroll_emb = self.extract_audio_embedding(enroll_audio) - logger.info("start to extract the test audio embedding") + logger.debug("start to extract the test audio embedding") test_emb = self.extract_audio_embedding(test_audio) - logger.info( + logger.debug( "start to get the score between the enroll and test embedding") score = self.executor.get_embeddings_score(enroll_emb, test_emb) - logger.info(f"get the enroll vs test score: {score}") + logger.debug(f"get the enroll vs test score: {score}") return score @paddle.no_grad() @@ -106,11 +106,12 @@ class PaddleVectorConnectionHandler: # because the soundfile will change the io.BytesIO(audio) to the end # thus we should convert the base64 string to io.BytesIO when we need the audio data if not self.executor._check(io.BytesIO(audio), sample_rate): - logger.info("check the audio sample rate occurs error") + logger.debug("check the audio sample rate occurs error") return np.array([0.0]) waveform, sr = load_audio(io.BytesIO(audio)) - logger.info(f"load the audio sample points, shape is: {waveform.shape}") + logger.debug( + f"load the audio sample points, shape is: {waveform.shape}") # stage 2: get the audio feat # Note: Now we only support fbank feature @@ -121,9 +122,9 @@ class PaddleVectorConnectionHandler: n_mels=self.config.n_mels, window_size=self.config.window_size, hop_length=self.config.hop_size) - logger.info(f"extract the audio feats, shape is: {feats.shape}") + logger.debug(f"extract the audio feats, shape is: {feats.shape}") except Exception as e: - logger.info(f"feats occurs exception {e}") + logger.error(f"feats occurs exception {e}") sys.exit(-1) feats = paddle.to_tensor(feats).unsqueeze(0) @@ -159,7 +160,7 @@ class VectorEngine(BaseEngine): """The Vector Engine """ super(VectorEngine, self).__init__() - logger.info("Create the VectorEngine Instance") + logger.debug("Create the VectorEngine Instance") def init(self, config: dict): """Init the Vector Engine @@ -170,7 +171,7 @@ class VectorEngine(BaseEngine): Returns: bool: The engine instance flag """ - logger.info("Init the vector engine") + logger.debug("Init the vector engine") try: self.config = config if self.config.device: @@ -179,7 +180,7 @@ class VectorEngine(BaseEngine): self.device = paddle.get_device() paddle.set_device(self.device) - logger.info(f"Vector Engine set the device: {self.device}") + logger.debug(f"Vector Engine set the device: {self.device}") except BaseException as e: logger.error( "Set device failed, please check if device is already used and the parameter 'device' in the yaml file" @@ -196,5 +197,7 @@ class VectorEngine(BaseEngine): ckpt_path=config.ckpt_path, task=config.task) - logger.info("Init the Vector engine successfully") + logger.info( + "Initialize Vector server engine successfully on device: %s." % + (self.device)) return True diff --git a/paddlespeech/server/restful/tts_api.py b/paddlespeech/server/restful/tts_api.py index 53fe159fd..61e4c49f3 100644 --- a/paddlespeech/server/restful/tts_api.py +++ b/paddlespeech/server/restful/tts_api.py @@ -140,7 +140,9 @@ def tts(request_body: TTSRequest): @router.post("/paddlespeech/tts/streaming") async def stream_tts(request_body: TTSRequest): + # get params text = request_body.text + spk_id = request_body.spk_id engine_pool = get_engine_pool() tts_engine = engine_pool['tts'] @@ -156,4 +158,24 @@ async def stream_tts(request_body: TTSRequest): connection_handler = PaddleTTSConnectionHandler(tts_engine) - return StreamingResponse(connection_handler.run(sentence=text)) + return StreamingResponse( + connection_handler.run(sentence=text, spk_id=spk_id)) + + +@router.get("/paddlespeech/tts/streaming/samplerate") +def get_samplerate(): + try: + engine_pool = get_engine_pool() + tts_engine = engine_pool['tts'] + logger.info("Get tts engine successfully.") + sample_rate = tts_engine.sample_rate + + response = {"sample_rate": sample_rate} + + except ServerBaseException as e: + response = failed_response(e.error_code, e.msg) + except BaseException: + response = failed_response(ErrorCode.SERVER_UNKOWN_ERR) + traceback.print_exc() + + return response diff --git a/paddlespeech/server/utils/audio_handler.py b/paddlespeech/server/utils/audio_handler.py index e3d90d469..4df651337 100644 --- a/paddlespeech/server/utils/audio_handler.py +++ b/paddlespeech/server/utils/audio_handler.py @@ -138,7 +138,7 @@ class ASRWsAudioHandler: Returns: str: the final asr result """ - logging.info("send a message to the server") + logging.debug("send a message to the server") if self.url is None: logger.error("No asr server, please input valid ip and port") @@ -167,11 +167,11 @@ class ASRWsAudioHandler: await ws.send(chunk_data.tobytes()) msg = await ws.recv() msg = json.loads(msg) - - if self.punc_server and len(msg["result"]) > 0: - msg["result"] = self.punc_server.run(msg["result"]) logger.info("client receive msg={}".format(msg)) - + #client start to punctuation restore + if self.punc_server and len(msg['result']) > 0: + msg["result"] = self.punc_server.run(msg["result"]) + logger.info("client punctuation restored msg={}".format(msg)) # 4. we must send finished signal to the server audio_info = json.dumps( { @@ -266,6 +266,12 @@ class TTSWsHandler: self.url = "ws://" + self.server + ":" + str( self.port) + "/paddlespeech/tts/streaming" self.play = play + + # get model sample rate + self.url_get_sr = "http://" + str(self.server) + ":" + str( + self.port) + "/paddlespeech/tts/streaming/samplerate" + self.sample_rate = requests.get(self.url_get_sr).json()["sample_rate"] + if self.play: import pyaudio self.buffer = b'' @@ -273,7 +279,7 @@ class TTSWsHandler: self.stream = self.p.open( format=self.p.get_format_from_width(2), channels=1, - rate=24000, + rate=self.sample_rate, output=True) self.mutex = threading.Lock() self.start_play = True @@ -293,12 +299,13 @@ class TTSWsHandler: self.buffer = b'' self.mutex.release() - async def run(self, text: str, output: str=None): + async def run(self, text: str, spk_id=0, output: str=None): """Send a text to online server Args: text (str): sentence to be synthesized - output (str): save audio path + spk_id (int, optional): speaker id. Defaults to 0. + output (str, optional): client save audio path. Defaults to None. """ all_bytes = b'' receive_time_list = [] @@ -315,11 +322,16 @@ class TTSWsHandler: session = msg["session"] # 3. send speech synthesis request - text_base64 = str(base64.b64encode((text).encode('utf-8')), "UTF8") - request = json.dumps({"text": text_base64}) + #text_base64 = str(base64.b64encode((text).encode('utf-8')), "UTF8") + params = { + "text": text, + "spk_id": spk_id, + } + + request = json.dumps(params) st = time.time() await ws.send(request) - logging.info("send a message to the server") + logging.debug("send a message to the server") # 4. Process the received response message = await ws.recv() @@ -341,10 +353,11 @@ class TTSWsHandler: # Rerutn last packet normally, no audio information elif status == 2: final_response = time.time() - st - duration = len(all_bytes) / 2.0 / 24000 + duration = len(all_bytes) / 2.0 / self.sample_rate if output is not None: - save_audio_success = save_audio(all_bytes, output) + save_audio_success = save_audio(all_bytes, output, + self.sample_rate) else: save_audio_success = False @@ -362,7 +375,8 @@ class TTSWsHandler: receive_time_list.append(time.time()) audio = message["audio"] audio = base64.b64decode(audio) # bytes - chunk_duration_list.append(len(audio) / 2.0 / 24000) + chunk_duration_list.append( + len(audio) / 2.0 / self.sample_rate) all_bytes += audio if self.play: self.mutex.acquire() @@ -403,19 +417,26 @@ class TTSHttpHandler: self.port) + "/paddlespeech/tts/streaming" self.play = play + # get model sample rate + self.url_get_sr = "http://" + str(self.server) + ":" + str( + self.port) + "/paddlespeech/tts/streaming/samplerate" + self.sample_rate = requests.get(self.url_get_sr).json()["sample_rate"] + if self.play: import pyaudio self.buffer = b'' self.p = pyaudio.PyAudio() + self.start_play = True + self.max_fail = 50 + self.stream = self.p.open( format=self.p.get_format_from_width(2), channels=1, - rate=24000, + rate=self.sample_rate, output=True) self.mutex = threading.Lock() - self.start_play = True self.t = threading.Thread(target=self.play_audio) - self.max_fail = 50 + logger.info(f"endpoint: {self.url}") def play_audio(self): @@ -430,31 +451,19 @@ class TTSHttpHandler: self.buffer = b'' self.mutex.release() - def run(self, - text: str, - spk_id=0, - speed=1.0, - volume=1.0, - sample_rate=0, - output: str=None): + def run(self, text: str, spk_id=0, output: str=None): """Send a text to tts online server Args: text (str): sentence to be synthesized. spk_id (int, optional): speaker id. Defaults to 0. - speed (float, optional): audio speed. Defaults to 1.0. - volume (float, optional): audio volume. Defaults to 1.0. - sample_rate (int, optional): audio sample rate, 0 means the same as model. Defaults to 0. - output (str, optional): save audio path. Defaults to None. + output (str, optional): client save audio path. Defaults to None. """ + # 1. Create request params = { "text": text, "spk_id": spk_id, - "speed": speed, - "volume": volume, - "sample_rate": sample_rate, - "save_path": output } all_bytes = b'' @@ -482,14 +491,14 @@ class TTSHttpHandler: self.t.start() self.start_play = False all_bytes += audio - chunk_duration_list.append(len(audio) / 2.0 / 24000) + chunk_duration_list.append(len(audio) / 2.0 / self.sample_rate) final_response = time.time() - st - duration = len(all_bytes) / 2.0 / 24000 + duration = len(all_bytes) / 2.0 / self.sample_rate html.close() # when stream=True if output is not None: - save_audio_success = save_audio(all_bytes, output) + save_audio_success = save_audio(all_bytes, output, self.sample_rate) else: save_audio_success = False @@ -543,10 +552,9 @@ class VectorHttpHandler: "sample_rate": sample_rate, } - logger.info(self.url) res = requests.post(url=self.url, data=json.dumps(data)) - return res.json() + return res class VectorScoreHttpHandler: @@ -594,4 +602,4 @@ class VectorScoreHttpHandler: res = requests.post(url=self.url, data=json.dumps(data)) - return res.json() + return res diff --git a/paddlespeech/server/utils/audio_process.py b/paddlespeech/server/utils/audio_process.py index 416d77ac4..ae5383979 100644 --- a/paddlespeech/server/utils/audio_process.py +++ b/paddlespeech/server/utils/audio_process.py @@ -169,7 +169,7 @@ def save_audio(bytes_data, audio_path, sample_rate: int=24000) -> bool: sample_rate=sample_rate) os.remove("./tmp.pcm") else: - print("Only supports saved audio format is pcm or wav") + logger.error("Only supports saved audio format is pcm or wav") return False return True diff --git a/paddlespeech/server/utils/log.py b/paddlespeech/server/utils/log.py deleted file mode 100644 index 8644064c7..000000000 --- a/paddlespeech/server/utils/log.py +++ /dev/null @@ -1,59 +0,0 @@ -# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import functools -import logging - -__all__ = [ - 'logger', -] - - -class Logger(object): - def __init__(self, name: str=None): - name = 'PaddleSpeech' if not name else name - self.logger = logging.getLogger(name) - - log_config = { - 'DEBUG': 10, - 'INFO': 20, - 'TRAIN': 21, - 'EVAL': 22, - 'WARNING': 30, - 'ERROR': 40, - 'CRITICAL': 50, - 'EXCEPTION': 100, - } - for key, level in log_config.items(): - logging.addLevelName(level, key) - if key == 'EXCEPTION': - self.__dict__[key.lower()] = self.logger.exception - else: - self.__dict__[key.lower()] = functools.partial(self.__call__, - level) - - self.format = logging.Formatter( - fmt='[%(asctime)-15s] [%(levelname)8s] - %(message)s') - - self.handler = logging.StreamHandler() - self.handler.setFormatter(self.format) - - self.logger.addHandler(self.handler) - self.logger.setLevel(logging.DEBUG) - self.logger.propagate = False - - def __call__(self, log_level: str, msg: str): - self.logger.log(log_level, msg) - - -logger = Logger() diff --git a/paddlespeech/server/utils/onnx_infer.py b/paddlespeech/server/utils/onnx_infer.py index 1c9d878f8..25802f627 100644 --- a/paddlespeech/server/utils/onnx_infer.py +++ b/paddlespeech/server/utils/onnx_infer.py @@ -16,11 +16,11 @@ from typing import Optional import onnxruntime as ort -from .log import logger +from paddlespeech.cli.log import logger def get_sess(model_path: Optional[os.PathLike]=None, sess_conf: dict=None): - logger.info(f"ort sessconf: {sess_conf}") + logger.debug(f"ort sessconf: {sess_conf}") sess_options = ort.SessionOptions() sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL if sess_conf.get('graph_optimization_level', 99) == 0: @@ -30,11 +30,13 @@ def get_sess(model_path: Optional[os.PathLike]=None, sess_conf: dict=None): # "gpu:0" providers = ['CPUExecutionProvider'] if "gpu" in sess_conf.get("device", ""): - providers = ['CUDAExecutionProvider'] + device_id = int(sess_conf["device"].split(":")[1]) + providers = [('CUDAExecutionProvider', {'device_id': device_id})] + # fastspeech2/mb_melgan can't use trt now! if sess_conf.get("use_trt", 0): providers = ['TensorrtExecutionProvider'] - logger.info(f"ort providers: {providers}") + logger.debug(f"ort providers: {providers}") if 'cpu_threads' in sess_conf: sess_options.intra_op_num_threads = sess_conf.get("cpu_threads", 0) diff --git a/paddlespeech/server/utils/util.py b/paddlespeech/server/utils/util.py index 061b213c7..826d923ed 100644 --- a/paddlespeech/server/utils/util.py +++ b/paddlespeech/server/utils/util.py @@ -13,6 +13,8 @@ import base64 import math +from paddlespeech.cli.log import logger + def wav2base64(wav_file: str): """ @@ -61,7 +63,7 @@ def get_chunks(data, block_size, pad_size, step): elif step == "voc": data_len = data.shape[0] else: - print("Please set correct type to get chunks, am or voc") + logger.error("Please set correct type to get chunks, am or voc") chunks = [] n = math.ceil(data_len / block_size) @@ -73,7 +75,7 @@ def get_chunks(data, block_size, pad_size, step): elif step == "voc": chunks.append(data[start:end, :]) else: - print("Please set correct type to get chunks, am or voc") + logger.error("Please set correct type to get chunks, am or voc") return chunks diff --git a/paddlespeech/server/ws/tts_api.py b/paddlespeech/server/ws/tts_api.py index 3d8b222ea..275711f58 100644 --- a/paddlespeech/server/ws/tts_api.py +++ b/paddlespeech/server/ws/tts_api.py @@ -87,12 +87,12 @@ async def websocket_endpoint(websocket: WebSocket): # speech synthesis request elif 'text' in message: - text_bese64 = message["text"] - sentence = connection_handler.preprocess( - text_bese64=text_bese64) + text = message["text"] + spk_id = message["spk_id"] # run - wav_generator = connection_handler.run(sentence) + wav_generator = connection_handler.run( + sentence=text, spk_id=spk_id) while True: try: @@ -116,3 +116,22 @@ async def websocket_endpoint(websocket: WebSocket): except Exception as e: logger.error(e) + + +@router.get("/paddlespeech/tts/streaming/samplerate") +def get_samplerate(): + try: + engine_pool = get_engine_pool() + tts_engine = engine_pool['tts'] + logger.info("Get tts engine successfully.") + sample_rate = tts_engine.sample_rate + + response = {"sample_rate": sample_rate} + + except ServerBaseException as e: + response = failed_response(e.error_code, e.msg) + except BaseException: + response = failed_response(ErrorCode.SERVER_UNKOWN_ERR) + traceback.print_exc() + + return response diff --git a/paddlespeech/t2s/datasets/am_batch_fn.py b/paddlespeech/t2s/datasets/am_batch_fn.py index 0b278abaf..2cb7a11a2 100644 --- a/paddlespeech/t2s/datasets/am_batch_fn.py +++ b/paddlespeech/t2s/datasets/am_batch_fn.py @@ -11,10 +11,165 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. +from typing import Collection +from typing import Dict +from typing import List +from typing import Tuple + import numpy as np import paddle from paddlespeech.t2s.datasets.batch import batch_sequences +from paddlespeech.t2s.datasets.get_feats import LogMelFBank +from paddlespeech.t2s.modules.nets_utils import get_seg_pos +from paddlespeech.t2s.modules.nets_utils import make_non_pad_mask +from paddlespeech.t2s.modules.nets_utils import pad_list +from paddlespeech.t2s.modules.nets_utils import phones_masking +from paddlespeech.t2s.modules.nets_utils import phones_text_masking + + +# 因为要传参数,所以需要额外构建 +def build_erniesat_collate_fn(mlm_prob: float=0.8, + mean_phn_span: int=8, + seg_emb: bool=False, + text_masking: bool=False): + + return ErnieSATCollateFn( + mlm_prob=mlm_prob, + mean_phn_span=mean_phn_span, + seg_emb=seg_emb, + text_masking=text_masking) + + +class ErnieSATCollateFn: + """Functor class of common_collate_fn()""" + + def __init__(self, + mlm_prob: float=0.8, + mean_phn_span: int=8, + seg_emb: bool=False, + text_masking: bool=False): + self.mlm_prob = mlm_prob + self.mean_phn_span = mean_phn_span + self.seg_emb = seg_emb + self.text_masking = text_masking + + def __call__(self, exmaples): + return erniesat_batch_fn( + exmaples, + mlm_prob=self.mlm_prob, + mean_phn_span=self.mean_phn_span, + seg_emb=self.seg_emb, + text_masking=self.text_masking) + + +def erniesat_batch_fn(examples, + mlm_prob: float=0.8, + mean_phn_span: int=8, + seg_emb: bool=False, + text_masking: bool=False): + # fields = ["text", "text_lengths", "speech", "speech_lengths", "align_start", "align_end"] + text = [np.array(item["text"], dtype=np.int64) for item in examples] + speech = [np.array(item["speech"], dtype=np.float32) for item in examples] + + text_lengths = [ + np.array(item["text_lengths"], dtype=np.int64) for item in examples + ] + speech_lengths = [ + np.array(item["speech_lengths"], dtype=np.int64) for item in examples + ] + + align_start = [ + np.array(item["align_start"], dtype=np.int64) for item in examples + ] + + align_end = [ + np.array(item["align_end"], dtype=np.int64) for item in examples + ] + + align_start_lengths = [ + np.array(len(item["align_start"]), dtype=np.int64) for item in examples + ] + + # add_pad + text = batch_sequences(text) + speech = batch_sequences(speech) + align_start = batch_sequences(align_start) + align_end = batch_sequences(align_end) + + # convert each batch to paddle.Tensor + text = paddle.to_tensor(text) + speech = paddle.to_tensor(speech) + text_lengths = paddle.to_tensor(text_lengths) + speech_lengths = paddle.to_tensor(speech_lengths) + align_start_lengths = paddle.to_tensor(align_start_lengths) + + speech_pad = speech + text_pad = text + + text_mask = make_non_pad_mask( + text_lengths, text_pad, length_dim=1).unsqueeze(-2) + speech_mask = make_non_pad_mask( + speech_lengths, speech_pad[:, :, 0], length_dim=1).unsqueeze(-2) + + # for training + span_bdy = None + # for inference + if 'span_bdy' in examples[0].keys(): + span_bdy = [ + np.array(item["span_bdy"], dtype=np.int64) for item in examples + ] + span_bdy = paddle.to_tensor(span_bdy) + + # dual_mask 的是混合中英时候同时 mask 语音和文本 + # ernie sat 在实现跨语言的时候都 mask 了 + if text_masking: + masked_pos, text_masked_pos = phones_text_masking( + xs_pad=speech_pad, + src_mask=speech_mask, + text_pad=text_pad, + text_mask=text_mask, + align_start=align_start, + align_end=align_end, + align_start_lens=align_start_lengths, + mlm_prob=mlm_prob, + mean_phn_span=mean_phn_span, + span_bdy=span_bdy) + # 训练纯中文和纯英文的 -> a3t 没有对 phoneme 做 mask, 只对语音 mask 了 + # a3t 和 ernie sat 的区别主要在于做 mask 的时候 + else: + masked_pos = phones_masking( + xs_pad=speech_pad, + src_mask=speech_mask, + align_start=align_start, + align_end=align_end, + align_start_lens=align_start_lengths, + mlm_prob=mlm_prob, + mean_phn_span=mean_phn_span, + span_bdy=span_bdy) + text_masked_pos = paddle.zeros(paddle.shape(text_pad)) + + speech_seg_pos, text_seg_pos = get_seg_pos( + speech_pad=speech_pad, + text_pad=text_pad, + align_start=align_start, + align_end=align_end, + align_start_lens=align_start_lengths, + seg_emb=seg_emb) + + batch = { + "text": text, + "speech": speech, + # need to generate + "masked_pos": masked_pos, + "speech_mask": speech_mask, + "text_mask": text_mask, + "speech_seg_pos": speech_seg_pos, + "text_seg_pos": text_seg_pos, + "text_masked_pos": text_masked_pos + } + + return batch def tacotron2_single_spk_batch_fn(examples): @@ -335,3 +490,182 @@ def vits_single_spk_batch_fn(examples): "speech": speech } return batch + + +# for ERNIE SAT +class MLMCollateFn: + """Functor class of common_collate_fn()""" + + def __init__( + self, + feats_extract, + mlm_prob: float=0.8, + mean_phn_span: int=8, + seg_emb: bool=False, + text_masking: bool=False, + attention_window: int=0, + not_sequence: Collection[str]=(), ): + self.mlm_prob = mlm_prob + self.mean_phn_span = mean_phn_span + self.feats_extract = feats_extract + self.not_sequence = set(not_sequence) + self.attention_window = attention_window + self.seg_emb = seg_emb + self.text_masking = text_masking + + def __call__(self, data: Collection[Tuple[str, Dict[str, np.ndarray]]] + ) -> Tuple[List[str], Dict[str, paddle.Tensor]]: + return mlm_collate_fn( + data, + feats_extract=self.feats_extract, + mlm_prob=self.mlm_prob, + mean_phn_span=self.mean_phn_span, + seg_emb=self.seg_emb, + text_masking=self.text_masking, + not_sequence=self.not_sequence) + + +def mlm_collate_fn( + data: Collection[Tuple[str, Dict[str, np.ndarray]]], + feats_extract=None, + mlm_prob: float=0.8, + mean_phn_span: int=8, + seg_emb: bool=False, + text_masking: bool=False, + pad_value: int=0, + not_sequence: Collection[str]=(), +) -> Tuple[List[str], Dict[str, paddle.Tensor]]: + uttids = [u for u, _ in data] + data = [d for _, d in data] + + assert all(set(data[0]) == set(d) for d in data), "dict-keys mismatching" + assert all(not k.endswith("_lens") + for k in data[0]), f"*_lens is reserved: {list(data[0])}" + + output = {} + for key in data[0]: + + array_list = [d[key] for d in data] + + # Assume the first axis is length: + # tensor_list: Batch x (Length, ...) + tensor_list = [paddle.to_tensor(a) for a in array_list] + # tensor: (Batch, Length, ...) + tensor = pad_list(tensor_list, pad_value) + output[key] = tensor + + # lens: (Batch,) + if key not in not_sequence: + lens = paddle.to_tensor( + [d[key].shape[0] for d in data], dtype=paddle.int64) + output[key + "_lens"] = lens + + feats = feats_extract.get_log_mel_fbank(np.array(output["speech"][0])) + feats = paddle.to_tensor(feats) + print("feats.shape:", feats.shape) + feats_lens = paddle.shape(feats)[0] + feats = paddle.unsqueeze(feats, 0) + + text = output["text"] + text_lens = output["text_lens"] + align_start = output["align_start"] + align_start_lens = output["align_start_lens"] + align_end = output["align_end"] + + max_tlen = max(text_lens) + max_slen = max(feats_lens) + + speech_pad = feats[:, :max_slen] + + text_pad = text + text_mask = make_non_pad_mask( + text_lens, text_pad, length_dim=1).unsqueeze(-2) + speech_mask = make_non_pad_mask( + feats_lens, speech_pad[:, :, 0], length_dim=1).unsqueeze(-2) + + span_bdy = None + if 'span_bdy' in output.keys(): + span_bdy = output['span_bdy'] + + # dual_mask 的是混合中英时候同时 mask 语音和文本 + # ernie sat 在实现跨语言的时候都 mask 了 + if text_masking: + masked_pos, text_masked_pos = phones_text_masking( + xs_pad=speech_pad, + src_mask=speech_mask, + text_pad=text_pad, + text_mask=text_mask, + align_start=align_start, + align_end=align_end, + align_start_lens=align_start_lens, + mlm_prob=mlm_prob, + mean_phn_span=mean_phn_span, + span_bdy=span_bdy) + # 训练纯中文和纯英文的 -> a3t 没有对 phoneme 做 mask, 只对语音 mask 了 + # a3t 和 ernie sat 的区别主要在于做 mask 的时候 + else: + masked_pos = phones_masking( + xs_pad=speech_pad, + src_mask=speech_mask, + align_start=align_start, + align_end=align_end, + align_start_lens=align_start_lens, + mlm_prob=mlm_prob, + mean_phn_span=mean_phn_span, + span_bdy=span_bdy) + text_masked_pos = paddle.zeros(paddle.shape(text_pad)) + + output_dict = {} + + speech_seg_pos, text_seg_pos = get_seg_pos( + speech_pad=speech_pad, + text_pad=text_pad, + align_start=align_start, + align_end=align_end, + align_start_lens=align_start_lens, + seg_emb=seg_emb) + output_dict['speech'] = speech_pad + output_dict['text'] = text_pad + output_dict['masked_pos'] = masked_pos + output_dict['text_masked_pos'] = text_masked_pos + output_dict['speech_mask'] = speech_mask + output_dict['text_mask'] = text_mask + output_dict['speech_seg_pos'] = speech_seg_pos + output_dict['text_seg_pos'] = text_seg_pos + output = (uttids, output_dict) + return output + + +def build_mlm_collate_fn( + sr: int=24000, + n_fft: int=2048, + hop_length: int=300, + win_length: int=None, + n_mels: int=80, + fmin: int=80, + fmax: int=7600, + mlm_prob: float=0.8, + mean_phn_span: int=8, + seg_emb: bool=False, + epoch: int=-1, ): + feats_extract_class = LogMelFBank + + feats_extract = feats_extract_class( + sr=sr, + n_fft=n_fft, + hop_length=hop_length, + win_length=win_length, + n_mels=n_mels, + fmin=fmin, + fmax=fmax) + + if epoch == -1: + mlm_prob_factor = 1 + else: + mlm_prob_factor = 0.8 + + return MLMCollateFn( + feats_extract=feats_extract, + mlm_prob=mlm_prob * mlm_prob_factor, + mean_phn_span=mean_phn_span, + seg_emb=seg_emb) diff --git a/paddlespeech/t2s/exps/ernie_sat/__init__.py b/paddlespeech/t2s/exps/ernie_sat/__init__.py new file mode 100644 index 000000000..e69de29bb diff --git a/paddlespeech/t2s/exps/ernie_sat/align.py b/paddlespeech/t2s/exps/ernie_sat/align.py new file mode 100755 index 000000000..529a8221c --- /dev/null +++ b/paddlespeech/t2s/exps/ernie_sat/align.py @@ -0,0 +1,386 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import os +import shutil +from pathlib import Path + +import librosa +import numpy as np +import pypinyin +from praatio import textgrid +from paddlespeech.t2s.exps.ernie_sat.utils import get_tmp_name +from paddlespeech.t2s.exps.ernie_sat.utils import get_dict + + +DICT_EN = 'tools/aligner/cmudict-0.7b' +DICT_ZH = 'tools/aligner/simple.lexicon' +MODEL_DIR_EN = 'tools/aligner/vctk_model.zip' +MODEL_DIR_ZH = 'tools/aligner/aishell3_model.zip' +MFA_PATH = 'tools/montreal-forced-aligner/bin' +os.environ['PATH'] = MFA_PATH + '/:' + os.environ['PATH'] + +def _get_max_idx(dic): + return sorted([int(key.split('_')[0]) for key in dic.keys()])[-1] + + +def _readtg(tg_path: str, lang: str='en', fs: int=24000, n_shift: int=300): + alignment = textgrid.openTextgrid(tg_path, includeEmptyIntervals=True) + phones = [] + ends = [] + words = [] + + for interval in alignment.tierDict['words'].entryList: + word = interval.label + if word: + words.append(word) + for interval in alignment.tierDict['phones'].entryList: + phone = interval.label + phones.append(phone) + ends.append(interval.end) + frame_pos = librosa.time_to_frames(ends, sr=fs, hop_length=n_shift) + durations = np.diff(frame_pos, prepend=0) + assert len(durations) == len(phones) + # merge '' and sp in the end + if phones[-1] == '' and len(phones) > 1 and phones[-2] == 'sp': + phones = phones[:-1] + durations[-2] += durations[-1] + durations = durations[:-1] + + # replace ' and 'sil' with 'sp' + phones = ['sp' if (phn == '' or phn == 'sil') else phn for phn in phones] + + if lang == 'en': + DICT = DICT_EN + + elif lang == 'zh': + DICT = DICT_ZH + + word2phns_dict = get_dict(DICT) + + phn2word_dict = [] + for word in words: + if lang == 'en': + word = word.upper() + phn2word_dict.append([word2phns_dict[word].split(), word]) + + non_sp_idx = 0 + word_idx = 0 + i = 0 + word2phns = {} + while i < len(phones): + phn = phones[i] + if phn == 'sp': + word2phns[str(word_idx) + '_sp'] = ['sp'] + i += 1 + else: + phns, word = phn2word_dict[non_sp_idx] + word2phns[str(word_idx) + '_' + word] = phns + non_sp_idx += 1 + i += len(phns) + word_idx += 1 + sum_phn = sum(len(word2phns[k]) for k in word2phns) + assert sum_phn == len(phones) + + results = '' + for (p, d) in zip(phones, durations): + results += p + ' ' + str(d) + ' ' + return results.strip(), word2phns + + +def alignment(wav_path: str, + text: str, + fs: int=24000, + lang='en', + n_shift: int=300): + wav_name = os.path.basename(wav_path) + utt = wav_name.split('.')[0] + # prepare data for MFA + tmp_name = get_tmp_name(text=text) + tmpbase = './tmp_dir/' + tmp_name + tmpbase = Path(tmpbase) + tmpbase.mkdir(parents=True, exist_ok=True) + print("tmp_name in alignment:",tmp_name) + + shutil.copyfile(wav_path, tmpbase / wav_name) + txt_name = utt + '.txt' + txt_path = tmpbase / txt_name + with open(txt_path, 'w') as wf: + wf.write(text + '\n') + # MFA + if lang == 'en': + DICT = DICT_EN + MODEL_DIR = MODEL_DIR_EN + + elif lang == 'zh': + DICT = DICT_ZH + MODEL_DIR = MODEL_DIR_ZH + else: + print('please input right lang!!') + + CMD = 'mfa_align' + ' ' + str( + tmpbase) + ' ' + DICT + ' ' + MODEL_DIR + ' ' + str(tmpbase) + os.system(CMD) + tg_path = str(tmpbase) + '/' + tmp_name + '/' + utt + '.TextGrid' + phn_dur, word2phns = _readtg(tg_path, lang=lang) + phn_dur = phn_dur.split() + phns = phn_dur[::2] + durs = phn_dur[1::2] + durs = [int(d) for d in durs] + assert len(phns) == len(durs) + return phns, durs, word2phns + + +def words2phns(text: str, lang='en'): + ''' + Args: + text (str): + input text. + eg: for that reason cover is impossible to be given. + lang (str): + 'en' or 'zh' + Returns: + List[str]: phones of input text. + eg: + ['F', 'AO1', 'R', 'DH', 'AE1', 'T', 'R', 'IY1', 'Z', 'AH0', 'N', 'K', 'AH1', 'V', 'ER0', + 'IH1', 'Z', 'IH2', 'M', 'P', 'AA1', 'S', 'AH0', 'B', 'AH0', 'L', 'T', 'UW1', 'B', 'IY1', + 'G', 'IH1', 'V', 'AH0', 'N'] + + Dict(str, str): key - idx_word + value - phones + eg: + {'0_FOR': ['F', 'AO1', 'R'], '1_THAT': ['DH', 'AE1', 'T'], + '2_REASON': ['R', 'IY1', 'Z', 'AH0', 'N'],'3_COVER': ['K', 'AH1', 'V', 'ER0'], '4_IS': ['IH1', 'Z'], + '5_IMPOSSIBLE': ['IH2', 'M', 'P', 'AA1', 'S', 'AH0', 'B', 'AH0', 'L'], + '6_TO': ['T', 'UW1'], '7_BE': ['B', 'IY1'], '8_GIVEN': ['G', 'IH1', 'V', 'AH0', 'N']} + ''' + text = text.strip() + words = [] + for pun in [ + ',', '.', ':', ';', '!', '?', '"', '(', ')', '--', '---', u',', + u'。', u':', u';', u'!', u'?', u'(', u')' + ]: + text = text.replace(pun, ' ') + for wrd in text.split(): + if (wrd[-1] == '-'): + wrd = wrd[:-1] + if (wrd[0] == "'"): + wrd = wrd[1:] + if wrd: + words.append(wrd) + if lang == 'en': + dictfile = DICT_EN + elif lang == 'zh': + dictfile = DICT_ZH + else: + print('please input right lang!!') + + word2phns_dict = get_dict(dictfile) + ds = word2phns_dict.keys() + phns = [] + wrd2phns = {} + for index, wrd in enumerate(words): + if lang == 'en': + wrd = wrd.upper() + if (wrd not in ds): + wrd2phns[str(index) + '_' + wrd] = 'spn' + phns.extend('spn') + else: + wrd2phns[str(index) + '_' + wrd] = word2phns_dict[wrd].split() + phns.extend(word2phns_dict[wrd].split()) + return phns, wrd2phns + + +def get_phns_spans(wav_path: str, + old_str: str='', + new_str: str='', + source_lang: str='en', + target_lang: str='en', + fs: int=24000, + n_shift: int=300): + is_append = (old_str == new_str[:len(old_str)]) + old_phns, mfa_start, mfa_end = [], [], [] + # source + lang = source_lang + phn, dur, w2p = alignment( + wav_path=wav_path, text=old_str, lang=lang, fs=fs, n_shift=n_shift) + + new_d_cumsum = np.pad(np.array(dur).cumsum(0), (1, 0), 'constant').tolist() + mfa_start = new_d_cumsum[:-1] + mfa_end = new_d_cumsum[1:] + old_phns = phn + + # target + if is_append and (source_lang != target_lang): + cross_lingual_clone = True + else: + cross_lingual_clone = False + + if cross_lingual_clone: + str_origin = new_str[:len(old_str)] + str_append = new_str[len(old_str):] + + if target_lang == 'zh': + phns_origin, origin_w2p = words2phns(str_origin, lang='en') + phns_append, append_w2p_tmp = words2phns(str_append, lang='zh') + elif target_lang == 'en': + # 原始句子 + phns_origin, origin_w2p = words2phns(str_origin, lang='zh') + # clone 句子 + phns_append, append_w2p_tmp = words2phns(str_append, lang='en') + else: + assert target_lang == 'zh' or target_lang == 'en', \ + 'cloning is not support for this language, please check it.' + + new_phns = phns_origin + phns_append + + append_w2p = {} + length = len(origin_w2p) + for key, value in append_w2p_tmp.items(): + idx, wrd = key.split('_') + append_w2p[str(int(idx) + length) + '_' + wrd] = value + new_w2p = origin_w2p.copy() + new_w2p.update(append_w2p) + + else: + if source_lang == target_lang: + new_phns, new_w2p = words2phns(new_str, lang=source_lang) + else: + assert source_lang == target_lang, \ + 'source language is not same with target language...' + + span_to_repl = [0, len(old_phns) - 1] + span_to_add = [0, len(new_phns) - 1] + left_idx = 0 + new_phns_left = [] + sp_count = 0 + # find the left different index + # 因为可能 align 时候的 words2phns 和直接 words2phns, 前者会有 sp? + for key in w2p.keys(): + idx, wrd = key.split('_') + if wrd == 'sp': + sp_count += 1 + new_phns_left.append('sp') + else: + idx = str(int(idx) - sp_count) + if idx + '_' + wrd in new_w2p: + # 是 new_str phn 序列的 index + left_idx += len(new_w2p[idx + '_' + wrd]) + # old phn 序列 + new_phns_left.extend(w2p[key]) + else: + span_to_repl[0] = len(new_phns_left) + span_to_add[0] = len(new_phns_left) + break + + # reverse w2p and new_w2p + right_idx = 0 + new_phns_right = [] + sp_count = 0 + w2p_max_idx = _get_max_idx(w2p) + new_w2p_max_idx = _get_max_idx(new_w2p) + new_phns_mid = [] + if is_append: + new_phns_right = [] + new_phns_mid = new_phns[left_idx:] + span_to_repl[0] = len(new_phns_left) + span_to_add[0] = len(new_phns_left) + span_to_add[1] = len(new_phns_left) + len(new_phns_mid) + span_to_repl[1] = len(old_phns) - len(new_phns_right) + # speech edit + else: + for key in list(w2p.keys())[::-1]: + idx, wrd = key.split('_') + if wrd == 'sp': + sp_count += 1 + new_phns_right = ['sp'] + new_phns_right + else: + idx = str(new_w2p_max_idx - (w2p_max_idx - int(idx) - sp_count)) + if idx + '_' + wrd in new_w2p: + right_idx -= len(new_w2p[idx + '_' + wrd]) + new_phns_right = w2p[key] + new_phns_right + else: + span_to_repl[1] = len(old_phns) - len(new_phns_right) + new_phns_mid = new_phns[left_idx:right_idx] + span_to_add[1] = len(new_phns_left) + len(new_phns_mid) + if len(new_phns_mid) == 0: + span_to_add[1] = min(span_to_add[1] + 1, len(new_phns)) + span_to_add[0] = max(0, span_to_add[0] - 1) + span_to_repl[0] = max(0, span_to_repl[0] - 1) + span_to_repl[1] = min(span_to_repl[1] + 1, + len(old_phns)) + break + new_phns = new_phns_left + new_phns_mid + new_phns_right + ''' + For that reason cover should not be given. + For that reason cover is impossible to be given. + span_to_repl: [17, 23] "should not" + span_to_add: [17, 30] "is impossible to" + ''' + outs = {} + outs['mfa_start'] = mfa_start + outs['mfa_end'] = mfa_end + outs['old_phns'] = old_phns + outs['new_phns'] = new_phns + outs['span_to_repl'] = span_to_repl + outs['span_to_add'] = span_to_add + + return outs + + +if __name__ == '__main__': + text = "For that reason cover should not be given." + phn, dur, word2phns = alignment("exp/p243_313.wav", text, lang='en') + print(phn, dur) + print(word2phns) + print("---------------------------------") + # 这里可以用我们的中文前端得到 pinyin 序列 + text_zh = "卡尔普陪外孙玩滑梯。" + text_zh = pypinyin.lazy_pinyin( + text_zh, + neutral_tone_with_five=True, + style=pypinyin.Style.TONE3, + tone_sandhi=True) + text_zh = " ".join(text_zh) + phn, dur, word2phns = alignment("exp/000001.wav", text_zh, lang='zh') + print(phn, dur) + print(word2phns) + print("---------------------------------") + phns, wrd2phns = words2phns(text, lang='en') + print("phns:", phns) + print("wrd2phns:", wrd2phns) + print("---------------------------------") + + phns, wrd2phns = words2phns(text_zh, lang='zh') + print("phns:", phns) + print("wrd2phns:", wrd2phns) + print("---------------------------------") + + outs = get_phns_spans( + wav_path="exp/p243_313.wav", + old_str="For that reason cover should not be given.", + new_str="for that reason cover is impossible to be given.") + + mfa_start = outs["mfa_start"] + mfa_end = outs["mfa_end"] + old_phns = outs["old_phns"] + new_phns = outs["new_phns"] + span_to_repl = outs["span_to_repl"] + span_to_add = outs["span_to_add"] + print("mfa_start:", mfa_start) + print("mfa_end:", mfa_end) + print("old_phns:", old_phns) + print("new_phns:", new_phns) + print("span_to_repl:", span_to_repl) + print("span_to_add:", span_to_add) + print("---------------------------------") diff --git a/paddlespeech/t2s/exps/ernie_sat/normalize.py b/paddlespeech/t2s/exps/ernie_sat/normalize.py new file mode 100644 index 000000000..74cdae2a6 --- /dev/null +++ b/paddlespeech/t2s/exps/ernie_sat/normalize.py @@ -0,0 +1,130 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Normalize feature files and dump them.""" +import argparse +import logging +from operator import itemgetter +from pathlib import Path + +import jsonlines +import numpy as np +from sklearn.preprocessing import StandardScaler +from tqdm import tqdm + +from paddlespeech.t2s.datasets.data_table import DataTable + + +def main(): + """Run preprocessing process.""" + parser = argparse.ArgumentParser( + description="Normalize dumped raw features (See detail in parallel_wavegan/bin/normalize.py)." + ) + parser.add_argument( + "--metadata", + type=str, + required=True, + help="directory including feature files to be normalized. " + "you need to specify either *-scp or rootdir.") + + parser.add_argument( + "--dumpdir", + type=str, + required=True, + help="directory to dump normalized feature files.") + parser.add_argument( + "--speech-stats", + type=str, + required=True, + help="speech statistics file.") + parser.add_argument( + "--phones-dict", type=str, default=None, help="phone vocabulary file.") + parser.add_argument( + "--speaker-dict", type=str, default=None, help="speaker id map file.") + + args = parser.parse_args() + + dumpdir = Path(args.dumpdir).expanduser() + # use absolute path + dumpdir = dumpdir.resolve() + dumpdir.mkdir(parents=True, exist_ok=True) + + # get dataset + with jsonlines.open(args.metadata, 'r') as reader: + metadata = list(reader) + dataset = DataTable( + metadata, converters={ + "speech": np.load, + }) + logging.info(f"The number of files = {len(dataset)}.") + + # restore scaler + speech_scaler = StandardScaler() + speech_scaler.mean_ = np.load(args.speech_stats)[0] + speech_scaler.scale_ = np.load(args.speech_stats)[1] + speech_scaler.n_features_in_ = speech_scaler.mean_.shape[0] + + vocab_phones = {} + with open(args.phones_dict, 'rt') as f: + phn_id = [line.strip().split() for line in f.readlines()] + for phn, id in phn_id: + vocab_phones[phn] = int(id) + + vocab_speaker = {} + with open(args.speaker_dict, 'rt') as f: + spk_id = [line.strip().split() for line in f.readlines()] + for spk, id in spk_id: + vocab_speaker[spk] = int(id) + + # process each file + output_metadata = [] + + for item in tqdm(dataset): + utt_id = item['utt_id'] + speech = item['speech'] + + # normalize + speech = speech_scaler.transform(speech) + speech_dir = dumpdir / "data_speech" + speech_dir.mkdir(parents=True, exist_ok=True) + speech_path = speech_dir / f"{utt_id}_speech.npy" + np.save(speech_path, speech.astype(np.float32), allow_pickle=False) + + phone_ids = [vocab_phones[p] for p in item['phones']] + spk_id = vocab_speaker[item["speaker"]] + record = { + "utt_id": item['utt_id'], + "spk_id": spk_id, + "text": phone_ids, + "text_lengths": item['text_lengths'], + "speech_lengths": item['speech_lengths'], + "durations": item['durations'], + "speech": str(speech_path), + "align_start": item['align_start'], + "align_end": item['align_end'], + } + # add spk_emb for voice cloning + if "spk_emb" in item: + record["spk_emb"] = str(item["spk_emb"]) + + output_metadata.append(record) + output_metadata.sort(key=itemgetter('utt_id')) + output_metadata_path = Path(args.dumpdir) / "metadata.jsonl" + with jsonlines.open(output_metadata_path, 'w') as writer: + for item in output_metadata: + writer.write(item) + logging.info(f"metadata dumped into {output_metadata_path}") + + +if __name__ == "__main__": + main() diff --git a/paddlespeech/t2s/exps/ernie_sat/preprocess.py b/paddlespeech/t2s/exps/ernie_sat/preprocess.py new file mode 100644 index 000000000..fc9e0888b --- /dev/null +++ b/paddlespeech/t2s/exps/ernie_sat/preprocess.py @@ -0,0 +1,342 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import argparse +import os +from concurrent.futures import ThreadPoolExecutor +from operator import itemgetter +from pathlib import Path +from typing import Any +from typing import Dict +from typing import List + +import jsonlines +import librosa +import numpy as np +import tqdm +import yaml +from yacs.config import CfgNode + +from paddlespeech.t2s.datasets.get_feats import LogMelFBank +from paddlespeech.t2s.datasets.preprocess_utils import compare_duration_and_mel_length +from paddlespeech.t2s.datasets.preprocess_utils import get_input_token +from paddlespeech.t2s.datasets.preprocess_utils import get_phn_dur +from paddlespeech.t2s.datasets.preprocess_utils import get_spk_id_map +from paddlespeech.t2s.datasets.preprocess_utils import merge_silence +from paddlespeech.t2s.utils import str2bool + + +def process_sentence(config: Dict[str, Any], + fp: Path, + sentences: Dict, + output_dir: Path, + mel_extractor=None, + cut_sil: bool=True, + spk_emb_dir: Path=None): + utt_id = fp.stem + # for vctk + if utt_id.endswith("_mic2"): + utt_id = utt_id[:-5] + record = None + if utt_id in sentences: + # reading, resampling may occur + wav, _ = librosa.load(str(fp), sr=config.fs) + if len(wav.shape) != 1: + return record + max_value = np.abs(wav).max() + if max_value > 1.0: + wav = wav / max_value + assert len(wav.shape) == 1, f"{utt_id} is not a mono-channel audio." + assert np.abs(wav).max( + ) <= 1.0, f"{utt_id} is seems to be different that 16 bit PCM." + phones = sentences[utt_id][0] + durations = sentences[utt_id][1] + speaker = sentences[utt_id][2] + d_cumsum = np.pad(np.array(durations).cumsum(0), (1, 0), 'constant') + + # little imprecise than use *.TextGrid directly + times = librosa.frames_to_time( + d_cumsum, sr=config.fs, hop_length=config.n_shift) + if cut_sil: + start = 0 + end = d_cumsum[-1] + if phones[0] == "sil" and len(durations) > 1: + start = times[1] + durations = durations[1:] + phones = phones[1:] + if phones[-1] == 'sil' and len(durations) > 1: + end = times[-2] + durations = durations[:-1] + phones = phones[:-1] + sentences[utt_id][0] = phones + sentences[utt_id][1] = durations + start, end = librosa.time_to_samples([start, end], sr=config.fs) + wav = wav[start:end] + + # extract mel feats + logmel = mel_extractor.get_log_mel_fbank(wav) + # change duration according to mel_length + compare_duration_and_mel_length(sentences, utt_id, logmel) + # utt_id may be popped in compare_duration_and_mel_length + if utt_id not in sentences: + return None + phones = sentences[utt_id][0] + durations = sentences[utt_id][1] + num_frames = logmel.shape[0] + assert sum(durations) == num_frames + + new_d_cumsum = np.pad(np.array(durations).cumsum(0), (1, 0), 'constant') + align_start = new_d_cumsum[:-1] + align_end = new_d_cumsum[1:] + assert len(align_start) == len(align_end) == len(durations) + + mel_dir = output_dir / "data_speech" + mel_dir.mkdir(parents=True, exist_ok=True) + mel_path = mel_dir / (utt_id + "_speech.npy") + np.save(mel_path, logmel) + # align_start_lengths == text_lengths + record = { + "utt_id": utt_id, + "phones": phones, + "text_lengths": len(phones), + "speech_lengths": num_frames, + "durations": durations, + "speech": str(mel_path), + "speaker": speaker, + "align_start": align_start.tolist(), + "align_end": align_end.tolist(), + } + if spk_emb_dir: + if speaker in os.listdir(spk_emb_dir): + embed_name = utt_id + ".npy" + embed_path = spk_emb_dir / speaker / embed_name + if embed_path.is_file(): + record["spk_emb"] = str(embed_path) + else: + return None + return record + + +def process_sentences(config, + fps: List[Path], + sentences: Dict, + output_dir: Path, + mel_extractor=None, + nprocs: int=1, + cut_sil: bool=True, + spk_emb_dir: Path=None): + if nprocs == 1: + results = [] + for fp in tqdm.tqdm(fps, total=len(fps)): + record = process_sentence( + config=config, + fp=fp, + sentences=sentences, + output_dir=output_dir, + mel_extractor=mel_extractor, + cut_sil=cut_sil, + spk_emb_dir=spk_emb_dir) + if record: + results.append(record) + else: + with ThreadPoolExecutor(nprocs) as pool: + futures = [] + with tqdm.tqdm(total=len(fps)) as progress: + for fp in fps: + future = pool.submit(process_sentence, config, fp, + sentences, output_dir, mel_extractor, + cut_sil, spk_emb_dir) + future.add_done_callback(lambda p: progress.update()) + futures.append(future) + + results = [] + for ft in futures: + record = ft.result() + if record: + results.append(record) + + results.sort(key=itemgetter("utt_id")) + # replace 'w' with 'a' to write from the end of file + with jsonlines.open(output_dir / "metadata.jsonl", 'a') as writer: + for item in results: + writer.write(item) + print("Done") + + +def main(): + # parse config and args + parser = argparse.ArgumentParser( + description="Preprocess audio and then extract features.") + + parser.add_argument( + "--dataset", + default="baker", + type=str, + help="name of dataset, should in {baker, aishell3, ljspeech, vctk} now") + + parser.add_argument( + "--rootdir", default=None, type=str, help="directory to dataset.") + + parser.add_argument( + "--dumpdir", + type=str, + required=True, + help="directory to dump feature files.") + parser.add_argument( + "--dur-file", default=None, type=str, help="path to durations.txt.") + + parser.add_argument("--config", type=str, help="fastspeech2 config file.") + + parser.add_argument( + "--num-cpu", type=int, default=1, help="number of process.") + + parser.add_argument( + "--cut-sil", + type=str2bool, + default=True, + help="whether cut sil in the edge of audio") + + parser.add_argument( + "--spk_emb_dir", + default=None, + type=str, + help="directory to speaker embedding files.") + args = parser.parse_args() + + rootdir = Path(args.rootdir).expanduser() + dumpdir = Path(args.dumpdir).expanduser() + # use absolute path + dumpdir = dumpdir.resolve() + dumpdir.mkdir(parents=True, exist_ok=True) + dur_file = Path(args.dur_file).expanduser() + + if args.spk_emb_dir: + spk_emb_dir = Path(args.spk_emb_dir).expanduser().resolve() + else: + spk_emb_dir = None + + assert rootdir.is_dir() + assert dur_file.is_file() + + with open(args.config, 'rt') as f: + config = CfgNode(yaml.safe_load(f)) + + sentences, speaker_set = get_phn_dur(dur_file) + + merge_silence(sentences) + phone_id_map_path = dumpdir / "phone_id_map.txt" + speaker_id_map_path = dumpdir / "speaker_id_map.txt" + get_input_token(sentences, phone_id_map_path, args.dataset) + get_spk_id_map(speaker_set, speaker_id_map_path) + + if args.dataset == "baker": + wav_files = sorted(list((rootdir / "Wave").rglob("*.wav"))) + # split data into 3 sections + num_train = 9800 + num_dev = 100 + train_wav_files = wav_files[:num_train] + dev_wav_files = wav_files[num_train:num_train + num_dev] + test_wav_files = wav_files[num_train + num_dev:] + elif args.dataset == "aishell3": + sub_num_dev = 5 + wav_dir = rootdir / "train" / "wav" + train_wav_files = [] + dev_wav_files = [] + test_wav_files = [] + for speaker in os.listdir(wav_dir): + wav_files = sorted(list((wav_dir / speaker).rglob("*.wav"))) + if len(wav_files) > 100: + train_wav_files += wav_files[:-sub_num_dev * 2] + dev_wav_files += wav_files[-sub_num_dev * 2:-sub_num_dev] + test_wav_files += wav_files[-sub_num_dev:] + else: + train_wav_files += wav_files + + elif args.dataset == "ljspeech": + wav_files = sorted(list((rootdir / "wavs").rglob("*.wav"))) + # split data into 3 sections + num_train = 12900 + num_dev = 100 + train_wav_files = wav_files[:num_train] + dev_wav_files = wav_files[num_train:num_train + num_dev] + test_wav_files = wav_files[num_train + num_dev:] + elif args.dataset == "vctk": + sub_num_dev = 5 + wav_dir = rootdir / "wav48_silence_trimmed" + train_wav_files = [] + dev_wav_files = [] + test_wav_files = [] + for speaker in os.listdir(wav_dir): + wav_files = sorted(list((wav_dir / speaker).rglob("*_mic2.flac"))) + if len(wav_files) > 100: + train_wav_files += wav_files[:-sub_num_dev * 2] + dev_wav_files += wav_files[-sub_num_dev * 2:-sub_num_dev] + test_wav_files += wav_files[-sub_num_dev:] + else: + train_wav_files += wav_files + + else: + print("dataset should in {baker, aishell3, ljspeech, vctk} now!") + + train_dump_dir = dumpdir / "train" / "raw" + train_dump_dir.mkdir(parents=True, exist_ok=True) + dev_dump_dir = dumpdir / "dev" / "raw" + dev_dump_dir.mkdir(parents=True, exist_ok=True) + test_dump_dir = dumpdir / "test" / "raw" + test_dump_dir.mkdir(parents=True, exist_ok=True) + + # Extractor + mel_extractor = LogMelFBank( + sr=config.fs, + n_fft=config.n_fft, + hop_length=config.n_shift, + win_length=config.win_length, + window=config.window, + n_mels=config.n_mels, + fmin=config.fmin, + fmax=config.fmax) + + # process for the 3 sections + if train_wav_files: + process_sentences( + config=config, + fps=train_wav_files, + sentences=sentences, + output_dir=train_dump_dir, + mel_extractor=mel_extractor, + nprocs=args.num_cpu, + cut_sil=args.cut_sil, + spk_emb_dir=spk_emb_dir) + if dev_wav_files: + process_sentences( + config=config, + fps=dev_wav_files, + sentences=sentences, + output_dir=dev_dump_dir, + mel_extractor=mel_extractor, + cut_sil=args.cut_sil, + spk_emb_dir=spk_emb_dir) + if test_wav_files: + process_sentences( + config=config, + fps=test_wav_files, + sentences=sentences, + output_dir=test_dump_dir, + mel_extractor=mel_extractor, + nprocs=args.num_cpu, + cut_sil=args.cut_sil, + spk_emb_dir=spk_emb_dir) + + +if __name__ == "__main__": + main() diff --git a/paddlespeech/t2s/exps/ernie_sat/synthesize.py b/paddlespeech/t2s/exps/ernie_sat/synthesize.py new file mode 100644 index 000000000..2e3582948 --- /dev/null +++ b/paddlespeech/t2s/exps/ernie_sat/synthesize.py @@ -0,0 +1,200 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import argparse +import logging +from pathlib import Path + +import jsonlines +import numpy as np +import paddle +import soundfile as sf +import yaml +from yacs.config import CfgNode + +from paddlespeech.t2s.datasets.am_batch_fn import build_erniesat_collate_fn +from paddlespeech.t2s.exps.syn_utils import denorm +from paddlespeech.t2s.exps.syn_utils import get_am_inference +from paddlespeech.t2s.exps.syn_utils import get_test_dataset +from paddlespeech.t2s.exps.syn_utils import get_voc_inference + + +def evaluate(args): + # dataloader has been too verbose + logging.getLogger("DataLoader").disabled = True + + # construct dataset for evaluation + with jsonlines.open(args.test_metadata, 'r') as reader: + test_metadata = list(reader) + + # Init body. + with open(args.erniesat_config) as f: + erniesat_config = CfgNode(yaml.safe_load(f)) + with open(args.voc_config) as f: + voc_config = CfgNode(yaml.safe_load(f)) + + print("========Args========") + print(yaml.safe_dump(vars(args))) + print("========Config========") + print(erniesat_config) + print(voc_config) + + # ernie sat model + erniesat_inference = get_am_inference( + am='erniesat_dataset', + am_config=erniesat_config, + am_ckpt=args.erniesat_ckpt, + am_stat=args.erniesat_stat, + phones_dict=args.phones_dict) + + test_dataset = get_test_dataset( + test_metadata=test_metadata, am='erniesat_dataset') + + # vocoder + voc_inference = get_voc_inference( + voc=args.voc, + voc_config=voc_config, + voc_ckpt=args.voc_ckpt, + voc_stat=args.voc_stat) + + output_dir = Path(args.output_dir) + output_dir.mkdir(parents=True, exist_ok=True) + collate_fn = build_erniesat_collate_fn( + mlm_prob=erniesat_config.mlm_prob, + mean_phn_span=erniesat_config.mean_phn_span, + seg_emb=erniesat_config.model['enc_input_layer'] == 'sega_mlm', + text_masking=False) + + gen_raw = True + erniesat_mu, erniesat_std = np.load(args.erniesat_stat) + + for datum in test_dataset: + # collate function and dataloader + utt_id = datum["utt_id"] + speech_len = datum["speech_lengths"] + + # mask the middle 1/3 speech + left_bdy, right_bdy = speech_len // 3, 2 * speech_len // 3 + span_bdy = [left_bdy, right_bdy] + datum.update({"span_bdy": span_bdy}) + + batch = collate_fn([datum]) + with paddle.no_grad(): + out_mels = erniesat_inference( + speech=batch["speech"], + text=batch["text"], + masked_pos=batch["masked_pos"], + speech_mask=batch["speech_mask"], + text_mask=batch["text_mask"], + speech_seg_pos=batch["speech_seg_pos"], + text_seg_pos=batch["text_seg_pos"], + span_bdy=span_bdy) + + # vocoder + wav_list = [] + for mel in out_mels: + part_wav = voc_inference(mel) + wav_list.append(part_wav) + wav = paddle.concat(wav_list) + wav = wav.numpy() + if gen_raw: + speech = datum['speech'] + denorm_mel = denorm(speech, erniesat_mu, erniesat_std) + denorm_mel = paddle.to_tensor(denorm_mel) + wav_raw = voc_inference(denorm_mel) + wav_raw = wav_raw.numpy() + + sf.write( + str(output_dir / (utt_id + ".wav")), + wav, + samplerate=erniesat_config.fs) + if gen_raw: + sf.write( + str(output_dir / (utt_id + "_raw" + ".wav")), + wav_raw, + samplerate=erniesat_config.fs) + + print(f"{utt_id} done!") + + +def parse_args(): + # parse args and config + parser = argparse.ArgumentParser( + description="Synthesize with acoustic model & vocoder") + # ernie sat + + parser.add_argument( + '--erniesat_config', + type=str, + default=None, + help='Config of acoustic model.') + parser.add_argument( + '--erniesat_ckpt', + type=str, + default=None, + help='Checkpoint file of acoustic model.') + parser.add_argument( + "--erniesat_stat", + type=str, + default=None, + help="mean and standard deviation used to normalize spectrogram when training acoustic model." + ) + parser.add_argument( + "--phones_dict", type=str, default=None, help="phone vocabulary file.") + # vocoder + parser.add_argument( + '--voc', + type=str, + default='pwgan_csmsc', + choices=[ + 'pwgan_aishell3', + 'pwgan_vctk', + 'hifigan_aishell3', + 'hifigan_vctk', + ], + help='Choose vocoder type of tts task.') + parser.add_argument( + '--voc_config', type=str, default=None, help='Config of voc.') + parser.add_argument( + '--voc_ckpt', type=str, default=None, help='Checkpoint file of voc.') + parser.add_argument( + "--voc_stat", + type=str, + default=None, + help="mean and standard deviation used to normalize spectrogram when training voc." + ) + # other + parser.add_argument( + "--ngpu", type=int, default=1, help="if ngpu == 0, use cpu.") + parser.add_argument("--test_metadata", type=str, help="test metadata.") + parser.add_argument("--output_dir", type=str, help="output dir.") + + args = parser.parse_args() + return args + + +def main(): + + args = parse_args() + if args.ngpu == 0: + paddle.set_device("cpu") + elif args.ngpu > 0: + paddle.set_device("gpu") + else: + print("ngpu should >= 0 !") + + evaluate(args) + + +if __name__ == "__main__": + main() diff --git a/paddlespeech/t2s/exps/ernie_sat/synthesize_e2e.py b/paddlespeech/t2s/exps/ernie_sat/synthesize_e2e.py new file mode 100644 index 000000000..95b07367c --- /dev/null +++ b/paddlespeech/t2s/exps/ernie_sat/synthesize_e2e.py @@ -0,0 +1,346 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import librosa +import numpy as np +import soundfile as sf + +from paddlespeech.t2s.exps.ernie_sat.align import get_phns_spans +from paddlespeech.t2s.exps.ernie_sat.utils import eval_durs +from paddlespeech.t2s.exps.ernie_sat.utils import get_dur_adj_factor +from paddlespeech.t2s.exps.ernie_sat.utils import get_span_bdy +from paddlespeech.t2s.datasets.am_batch_fn import build_erniesat_collate_fn +from paddlespeech.t2s.exps.syn_utils import get_frontend +from paddlespeech.t2s.datasets.get_feats import LogMelFBank +from paddlespeech.t2s.exps.syn_utils import norm +from paddlespeech.t2s.exps.ernie_sat.utils import get_tmp_name + + + + + + +def _p2id(self, phonemes: List[str]) -> np.ndarray: + # replace unk phone with sp + phonemes = [ + phn if phn in vocab_phones else "sp" for phn in phonemes + ] + phone_ids = [vocab_phones[item] for item in phonemes] + return np.array(phone_ids, np.int64) + + + +def prep_feats_with_dur(wav_path: str, + old_str: str='', + new_str: str='', + source_lang: str='en', + target_lang: str='en', + duration_adjust: bool=True, + fs: int=24000, + n_shift: int=300): + ''' + Returns: + np.ndarray: new wav, replace the part to be edited in original wav with 0 + List[str]: new phones + List[float]: mfa start of new wav + List[float]: mfa end of new wav + List[int]: masked mel boundary of original wav + List[int]: masked mel boundary of new wav + ''' + wav_org, _ = librosa.load(wav_path, sr=fs) + phns_spans_outs = get_phns_spans( + wav_path=wav_path, + old_str=old_str, + new_str=new_str, + source_lang=source_lang, + target_lang=target_lang, + fs=fs, + n_shift=n_shift) + + mfa_start = phns_spans_outs["mfa_start"] + mfa_end = phns_spans_outs["mfa_end"] + old_phns = phns_spans_outs["old_phns"] + new_phns = phns_spans_outs["new_phns"] + span_to_repl = phns_spans_outs["span_to_repl"] + span_to_add = phns_spans_outs["span_to_add"] + + # 中文的 phns 不一定都在 fastspeech2 的字典里, 用 sp 代替 + if target_lang in {'en', 'zh'}: + old_durs = eval_durs(old_phns, target_lang=source_lang) + else: + assert target_lang in {'en', 'zh'}, \ + "calculate duration_predict is not support for this language..." + + orig_old_durs = [e - s for e, s in zip(mfa_end, mfa_start)] + + if duration_adjust: + d_factor = get_dur_adj_factor( + orig_dur=orig_old_durs, pred_dur=old_durs, phns=old_phns) + d_factor = d_factor * 1.25 + else: + d_factor = 1 + + if target_lang in {'en', 'zh'}: + new_durs = eval_durs(new_phns, target_lang=target_lang) + else: + assert target_lang == "zh" or target_lang == "en", \ + "calculate duration_predict is not support for this language..." + + # duration 要是整数 + new_durs_adjusted = [int(np.ceil(d_factor * i)) for i in new_durs] + + new_span_dur_sum = sum(new_durs_adjusted[span_to_add[0]:span_to_add[1]]) + old_span_dur_sum = sum(orig_old_durs[span_to_repl[0]:span_to_repl[1]]) + dur_offset = new_span_dur_sum - old_span_dur_sum + new_mfa_start = mfa_start[:span_to_repl[0]] + new_mfa_end = mfa_end[:span_to_repl[0]] + + for dur in new_durs_adjusted[span_to_add[0]:span_to_add[1]]: + if len(new_mfa_end) == 0: + new_mfa_start.append(0) + new_mfa_end.append(dur) + else: + new_mfa_start.append(new_mfa_end[-1]) + new_mfa_end.append(new_mfa_end[-1] + dur) + + new_mfa_start += [i + dur_offset for i in mfa_start[span_to_repl[1]:]] + new_mfa_end += [i + dur_offset for i in mfa_end[span_to_repl[1]:]] + + # 3. get new wav + # 在原始句子后拼接 + if span_to_repl[0] >= len(mfa_start): + wav_left_idx = len(wav_org) + wav_right_idx = wav_left_idx + # 在原始句子中间替换 + else: + wav_left_idx = int(np.floor(mfa_start[span_to_repl[0]] * n_shift)) + wav_right_idx = int(np.ceil(mfa_end[span_to_repl[1] - 1] * n_shift)) + blank_wav = np.zeros( + (int(np.ceil(new_span_dur_sum * n_shift)), ), dtype=wav_org.dtype) + # 原始音频,需要编辑的部分替换成空音频,空音频的时间由 fs2 的 duration_predictor 决定 + new_wav = np.concatenate( + [wav_org[:wav_left_idx], blank_wav, wav_org[wav_right_idx:]]) + + # 音频是正常遮住了 + sf.write(str("new_wav.wav"), new_wav, samplerate=fs) + + # 4. get old and new mel span to be mask + old_span_bdy = get_span_bdy( + mfa_start=mfa_start, mfa_end=mfa_end, span_to_repl=span_to_repl) + + new_span_bdy = get_span_bdy( + mfa_start=new_mfa_start, mfa_end=new_mfa_end, span_to_repl=span_to_add) + + # old_span_bdy, new_span_bdy 是帧级别的范围 + outs = {} + outs['new_wav'] = new_wav + outs['new_phns'] = new_phns + outs['new_mfa_start'] = new_mfa_start + outs['new_mfa_end'] = new_mfa_end + outs['old_span_bdy'] = old_span_bdy + outs['new_span_bdy'] = new_span_bdy + return outs + + + + +def prep_feats(wav_path: str, + old_str: str='', + new_str: str='', + source_lang: str='en', + target_lang: str='en', + duration_adjust: bool=True, + fs: int=24000, + n_shift: int=300): + + outs = prep_feats_with_dur( + wav_path=wav_path, + old_str=old_str, + new_str=new_str, + source_lang=source_lang, + target_lang=target_lang, + duration_adjust=duration_adjust, + fs=fs, + n_shift=n_shift) + + wav_name = os.path.basename(wav_path) + utt_id = wav_name.split('.')[0] + + wav = outs['new_wav'] + phns = outs['new_phns'] + mfa_start = outs['new_mfa_start'] + mfa_end = outs['new_mfa_end'] + old_span_bdy = outs['old_span_bdy'] + new_span_bdy = outs['new_span_bdy'] + span_bdy = np.array(new_span_bdy) + + text = _p2id(phns) + mel = mel_extractor.get_log_mel_fbank(wav) + erniesat_mean, erniesat_std = np.load(erniesat_stat) + normed_mel = norm(mel, erniesat_mean, erniesat_std) + tmp_name = get_tmp_name(text=old_str) + tmpbase = './tmp_dir/' + tmp_name + tmpbase = Path(tmpbase) + tmpbase.mkdir(parents=True, exist_ok=True) + print("tmp_name in synthesize_e2e:",tmp_name) + + mel_path = tmpbase / 'mel.npy' + print("mel_path:",mel_path) + np.save(mel_path, logmel) + durations = [e - s for e, s in zip(mfa_end, mfa_start)] + + datum={ + "utt_id": utt_id, + "spk_id": 0, + "text": text, + "text_lengths": len(text), + "speech_lengths": 115, + "durations": durations, + "speech": mel_path, + "align_start": mfa_start, + "align_end": mfa_end, + "span_bdy": span_bdy + } + + batch = collate_fn([datum]) + print("batch:",batch) + + return batch, old_span_bdy, new_span_bdy + + +def decode_with_model(mlm_model: nn.Layer, + collate_fn, + wav_path: str, + old_str: str='', + new_str: str='', + source_lang: str='en', + target_lang: str='en', + use_teacher_forcing: bool=False, + duration_adjust: bool=True, + fs: int=24000, + n_shift: int=300, + token_list: List[str]=[]): + batch, old_span_bdy, new_span_bdy = prep_feats( + source_lang=source_lang, + target_lang=target_lang, + wav_path=wav_path, + old_str=old_str, + new_str=new_str, + duration_adjust=duration_adjust, + fs=fs, + n_shift=n_shift, + token_list=token_list) + + + + feats = collate_fn(batch)[1] + + if 'text_masked_pos' in feats.keys(): + feats.pop('text_masked_pos') + + output = mlm_model.inference( + text=feats['text'], + speech=feats['speech'], + masked_pos=feats['masked_pos'], + speech_mask=feats['speech_mask'], + text_mask=feats['text_mask'], + speech_seg_pos=feats['speech_seg_pos'], + text_seg_pos=feats['text_seg_pos'], + span_bdy=new_span_bdy, + use_teacher_forcing=use_teacher_forcing) + + # 拼接音频 + output_feat = paddle.concat(x=output, axis=0) + wav_org, _ = librosa.load(wav_path, sr=fs) + return wav_org, output_feat, old_span_bdy, new_span_bdy, fs, hop_length + + +if __name__ == '__main__': + fs = 24000 + n_shift = 300 + wav_path = "exp/p243_313.wav" + old_str = "For that reason cover should not be given." + # for edit + # new_str = "for that reason cover is impossible to be given." + # for synthesize + append_str = "do you love me i love you so much" + new_str = old_str + append_str + + ''' + outs = prep_feats_with_dur( + wav_path=wav_path, + old_str=old_str, + new_str=new_str, + fs=fs, + n_shift=n_shift) + + new_wav = outs['new_wav'] + new_phns = outs['new_phns'] + new_mfa_start = outs['new_mfa_start'] + new_mfa_end = outs['new_mfa_end'] + old_span_bdy = outs['old_span_bdy'] + new_span_bdy = outs['new_span_bdy'] + + print("---------------------------------") + + print("new_wav:", new_wav) + print("new_phns:", new_phns) + print("new_mfa_start:", new_mfa_start) + print("new_mfa_end:", new_mfa_end) + print("old_span_bdy:", old_span_bdy) + print("new_span_bdy:", new_span_bdy) + print("---------------------------------") + ''' + + erniesat_config = "/home/yuantian01/PaddleSpeech_ERNIE_SAT/PaddleSpeech/examples/vctk/ernie_sat/local/default.yaml" + + with open(erniesat_config) as f: + erniesat_config = CfgNode(yaml.safe_load(f)) + + erniesat_stat = "/home/yuantian01/PaddleSpeech_ERNIE_SAT/PaddleSpeech/examples/vctk/ernie_sat/dump/train/speech_stats.npy" + + # Extractor + mel_extractor = LogMelFBank( + sr=erniesat_config.fs, + n_fft=erniesat_config.n_fft, + hop_length=erniesat_config.n_shift, + win_length=erniesat_config.win_length, + window=erniesat_config.window, + n_mels=erniesat_config.n_mels, + fmin=erniesat_config.fmin, + fmax=erniesat_config.fmax) + + + + collate_fn = build_erniesat_collate_fn( + mlm_prob=erniesat_config.mlm_prob, + mean_phn_span=erniesat_config.mean_phn_span, + seg_emb=erniesat_config.model['enc_input_layer'] == 'sega_mlm', + text_masking=False) + + phones_dict='/home/yuantian01/PaddleSpeech_ERNIE_SAT/PaddleSpeech/examples/vctk/ernie_sat/dump/phone_id_map.txt' + vocab_phones = {} + + with open(phones_dict, 'rt') as f: + phn_id = [line.strip().split() for line in f.readlines()] + for phn, id in phn_id: + vocab_phones[phn] = int(id) + + prep_feats(wav_path=wav_path, + old_str=old_str, + new_str=new_str, + fs=fs, + n_shift=n_shift) + + + diff --git a/paddlespeech/t2s/exps/ernie_sat/train.py b/paddlespeech/t2s/exps/ernie_sat/train.py new file mode 100644 index 000000000..ccd1245e1 --- /dev/null +++ b/paddlespeech/t2s/exps/ernie_sat/train.py @@ -0,0 +1,203 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import argparse +import logging +import os +import shutil +from pathlib import Path + +import jsonlines +import numpy as np +import paddle +import yaml +from paddle import DataParallel +from paddle import distributed as dist +from paddle import nn +from paddle.io import DataLoader +from paddle.io import DistributedBatchSampler +from paddle.optimizer import Adam +from yacs.config import CfgNode + +from paddlespeech.t2s.datasets.am_batch_fn import build_erniesat_collate_fn +from paddlespeech.t2s.datasets.data_table import DataTable +from paddlespeech.t2s.models.ernie_sat import ErnieSAT +from paddlespeech.t2s.models.ernie_sat import ErnieSATEvaluator +from paddlespeech.t2s.models.ernie_sat import ErnieSATUpdater +from paddlespeech.t2s.training.extensions.snapshot import Snapshot +from paddlespeech.t2s.training.extensions.visualizer import VisualDL +from paddlespeech.t2s.training.seeding import seed_everything +from paddlespeech.t2s.training.trainer import Trainer + + +def train_sp(args, config): + # decides device type and whether to run in parallel + # setup running environment correctly + if (not paddle.is_compiled_with_cuda()) or args.ngpu == 0: + paddle.set_device("cpu") + else: + paddle.set_device("gpu") + world_size = paddle.distributed.get_world_size() + if world_size > 1: + paddle.distributed.init_parallel_env() + + # set the random seed, it is a must for multiprocess training + seed_everything(config.seed) + + print( + f"rank: {dist.get_rank()}, pid: {os.getpid()}, parent_pid: {os.getppid()}", + ) + fields = [ + "text", "text_lengths", "speech", "speech_lengths", "align_start", + "align_end" + ] + converters = {"speech": np.load} + # dataloader has been too verbose + logging.getLogger("DataLoader").disabled = True + + # construct dataset for training and validation + with jsonlines.open(args.train_metadata, 'r') as reader: + train_metadata = list(reader) + train_dataset = DataTable( + data=train_metadata, + fields=fields, + converters=converters, ) + with jsonlines.open(args.dev_metadata, 'r') as reader: + dev_metadata = list(reader) + dev_dataset = DataTable( + data=dev_metadata, + fields=fields, + converters=converters, ) + + # collate function and dataloader + collate_fn = build_erniesat_collate_fn( + mlm_prob=config.mlm_prob, + mean_phn_span=config.mean_phn_span, + seg_emb=config.model['enc_input_layer'] == 'sega_mlm', + text_masking=config["model"]["text_masking"]) + + train_sampler = DistributedBatchSampler( + train_dataset, + batch_size=config.batch_size, + shuffle=True, + drop_last=True) + + print("samplers done!") + + train_dataloader = DataLoader( + train_dataset, + batch_sampler=train_sampler, + collate_fn=collate_fn, + num_workers=config.num_workers) + + dev_dataloader = DataLoader( + dev_dataset, + shuffle=False, + drop_last=False, + batch_size=config.batch_size, + collate_fn=collate_fn, + num_workers=config.num_workers) + print("dataloaders done!") + + with open(args.phones_dict, "r") as f: + phn_id = [line.strip().split() for line in f.readlines()] + vocab_size = len(phn_id) + print("vocab_size:", vocab_size) + + odim = config.n_mels + model = ErnieSAT(idim=vocab_size, odim=odim, **config["model"]) + + if world_size > 1: + model = DataParallel(model) + print("model done!") + + scheduler = paddle.optimizer.lr.NoamDecay( + d_model=config["scheduler_params"]["d_model"], + warmup_steps=config["scheduler_params"]["warmup_steps"]) + grad_clip = nn.ClipGradByGlobalNorm(config["grad_clip"]) + optimizer = Adam( + learning_rate=scheduler, + grad_clip=grad_clip, + parameters=model.parameters()) + + print("optimizer done!") + + output_dir = Path(args.output_dir) + output_dir.mkdir(parents=True, exist_ok=True) + if dist.get_rank() == 0: + config_name = args.config.split("/")[-1] + # copy conf to output_dir + shutil.copyfile(args.config, output_dir / config_name) + + updater = ErnieSATUpdater( + model=model, + optimizer=optimizer, + scheduler=scheduler, + dataloader=train_dataloader, + text_masking=config["model"]["text_masking"], + odim=odim, + vocab_size=vocab_size, + output_dir=output_dir) + + trainer = Trainer(updater, (config.max_epoch, 'epoch'), output_dir) + + evaluator = ErnieSATEvaluator( + model=model, + dataloader=dev_dataloader, + text_masking=config["model"]["text_masking"], + odim=odim, + vocab_size=vocab_size, + output_dir=output_dir, ) + + if dist.get_rank() == 0: + trainer.extend(evaluator, trigger=(1, "epoch")) + trainer.extend(VisualDL(output_dir), trigger=(1, "iteration")) + trainer.extend( + Snapshot(max_size=config.num_snapshots), trigger=(1, 'epoch')) + trainer.run() + + +def main(): + # parse args and config and redirect to train_sp + parser = argparse.ArgumentParser(description="Train an ErnieSAT model.") + parser.add_argument("--config", type=str, help="ErnieSAT config file.") + parser.add_argument("--train-metadata", type=str, help="training data.") + parser.add_argument("--dev-metadata", type=str, help="dev data.") + parser.add_argument("--output-dir", type=str, help="output dir.") + parser.add_argument( + "--ngpu", type=int, default=1, help="if ngpu=0, use cpu.") + parser.add_argument( + "--phones-dict", type=str, default=None, help="phone vocabulary file.") + + args = parser.parse_args() + + with open(args.config) as f: + config = CfgNode(yaml.safe_load(f)) + + print("========Args========") + print(yaml.safe_dump(vars(args))) + print("========Config========") + print(config) + print( + f"master see the word size: {dist.get_world_size()}, from pid: {os.getpid()}" + ) + + # dispatch + if args.ngpu > 1: + dist.spawn(train_sp, (args, config), nprocs=args.ngpu) + else: + train_sp(args, config) + + +if __name__ == "__main__": + main() diff --git a/paddlespeech/t2s/exps/ernie_sat/utils.py b/paddlespeech/t2s/exps/ernie_sat/utils.py new file mode 100644 index 000000000..9169efa36 --- /dev/null +++ b/paddlespeech/t2s/exps/ernie_sat/utils.py @@ -0,0 +1,216 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from pathlib import Path +from typing import Dict +from typing import List +from typing import Union +import os + +import numpy as np +import paddle +import yaml +from yacs.config import CfgNode +import hashlib + + +from paddlespeech.t2s.exps.syn_utils import get_am_inference +from paddlespeech.t2s.exps.syn_utils import get_voc_inference + +def _get_user(): + return os.path.expanduser('~').split('/')[-1] + +def str2md5(string): + md5_val = hashlib.md5(string.encode('utf8')).hexdigest() + return md5_val + +def get_tmp_name(text:str): + return _get_user() + '_' + str(os.getpid()) + '_' + str2md5(text) + +def get_dict(dictfile: str): + word2phns_dict = {} + with open(dictfile, 'r') as fid: + for line in fid: + line_lst = line.split() + word, phn_lst = line_lst[0], line.split()[1:] + if word not in word2phns_dict.keys(): + word2phns_dict[word] = ' '.join(phn_lst) + return word2phns_dict + + +# 获取需要被 mask 的 mel 帧的范围 +def get_span_bdy(mfa_start: List[float], + mfa_end: List[float], + span_to_repl: List[List[int]]): + if span_to_repl[0] >= len(mfa_start): + span_bdy = [mfa_end[-1], mfa_end[-1]] + else: + span_bdy = [mfa_start[span_to_repl[0]], mfa_end[span_to_repl[1] - 1]] + return span_bdy + + +# mfa 获得的 duration 和 fs2 的 duration_predictor 获取的 duration 可能不同 +# 此处获得一个缩放比例, 用于预测值和真实值之间的缩放 +def get_dur_adj_factor(orig_dur: List[int], + pred_dur: List[int], + phns: List[str]): + length = 0 + factor_list = [] + for orig, pred, phn in zip(orig_dur, pred_dur, phns): + if pred == 0 or phn == 'sp': + continue + else: + factor_list.append(orig / pred) + factor_list = np.array(factor_list) + factor_list.sort() + if len(factor_list) < 5: + return 1 + length = 2 + avg = np.average(factor_list[length:-length]) + return avg + + +def read_2col_text(path: Union[Path, str]) -> Dict[str, str]: + """Read a text file having 2 column as dict object. + + Examples: + wav.scp: + key1 /some/path/a.wav + key2 /some/path/b.wav + + >>> read_2col_text('wav.scp') + {'key1': '/some/path/a.wav', 'key2': '/some/path/b.wav'} + + """ + + data = {} + with Path(path).open("r", encoding="utf-8") as f: + for linenum, line in enumerate(f, 1): + sps = line.rstrip().split(maxsplit=1) + if len(sps) == 1: + k, v = sps[0], "" + else: + k, v = sps + if k in data: + raise RuntimeError(f"{k} is duplicated ({path}:{linenum})") + data[k] = v + return data + + +def load_num_sequence_text(path: Union[Path, str], loader_type: str="csv_int" + ) -> Dict[str, List[Union[float, int]]]: + """Read a text file indicating sequences of number + + Examples: + key1 1 2 3 + key2 34 5 6 + + >>> d = load_num_sequence_text('text') + >>> np.testing.assert_array_equal(d["key1"], np.array([1, 2, 3])) + """ + if loader_type == "text_int": + delimiter = " " + dtype = int + elif loader_type == "text_float": + delimiter = " " + dtype = float + elif loader_type == "csv_int": + delimiter = "," + dtype = int + elif loader_type == "csv_float": + delimiter = "," + dtype = float + else: + raise ValueError(f"Not supported loader_type={loader_type}") + + # path looks like: + # utta 1,0 + # uttb 3,4,5 + # -> return {'utta': np.ndarray([1, 0]), + # 'uttb': np.ndarray([3, 4, 5])} + d = read_2column_text(path) + # Using for-loop instead of dict-comprehension for debuggability + retval = {} + for k, v in d.items(): + try: + retval[k] = [dtype(i) for i in v.split(delimiter)] + except TypeError: + print(f'Error happened with path="{path}", id="{k}", value="{v}"') + raise + return retval + + +def is_chinese(ch): + if u'\u4e00' <= ch <= u'\u9fff': + return True + else: + return False + + +def get_voc_out(mel): + # vocoder + args = parse_args() + with open(args.voc_config) as f: + voc_config = CfgNode(yaml.safe_load(f)) + voc_inference = get_voc_inference( + voc=args.voc, + voc_config=voc_config, + voc_ckpt=args.voc_ckpt, + voc_stat=args.voc_stat) + + with paddle.no_grad(): + wav = voc_inference(mel) + return np.squeeze(wav) + + +def eval_durs(phns, target_lang: str='zh', fs: int=24000, n_shift: int=300): + + if target_lang == 'en': + am = "fastspeech2_ljspeech" + am_config = "download/fastspeech2_nosil_ljspeech_ckpt_0.5/default.yaml" + am_ckpt = "download/fastspeech2_nosil_ljspeech_ckpt_0.5/snapshot_iter_100000.pdz" + am_stat = "download/fastspeech2_nosil_ljspeech_ckpt_0.5/speech_stats.npy" + phones_dict = "download/fastspeech2_nosil_ljspeech_ckpt_0.5/phone_id_map.txt" + + elif target_lang == 'zh': + am = "fastspeech2_csmsc" + am_config = "download/fastspeech2_conformer_baker_ckpt_0.5/conformer.yaml" + am_ckpt = "download/fastspeech2_conformer_baker_ckpt_0.5/snapshot_iter_76000.pdz" + am_stat = "download/fastspeech2_conformer_baker_ckpt_0.5/speech_stats.npy" + phones_dict = "download/fastspeech2_conformer_baker_ckpt_0.5/phone_id_map.txt" + + # Init body. + with open(am_config) as f: + am_config = CfgNode(yaml.safe_load(f)) + + am_inference, am = get_am_inference( + am=am, + am_config=am_config, + am_ckpt=am_ckpt, + am_stat=am_stat, + phones_dict=phones_dict, + return_am=True) + + vocab_phones = {} + with open(phones_dict, "r") as f: + phn_id = [line.strip().split() for line in f.readlines()] + for tone, id in phn_id: + vocab_phones[tone] = int(id) + vocab_size = len(vocab_phones) + phonemes = [phn if phn in vocab_phones else "sp" for phn in phns] + + phone_ids = [vocab_phones[item] for item in phonemes] + phone_ids = paddle.to_tensor(np.array(phone_ids, np.int64)) + _, d_outs, _, _ = am.inference(phone_ids) + d_outs = d_outs.tolist() + return d_outs diff --git a/paddlespeech/t2s/exps/fastspeech2/normalize.py b/paddlespeech/t2s/exps/fastspeech2/normalize.py index 8ec20ebf0..92d10832b 100644 --- a/paddlespeech/t2s/exps/fastspeech2/normalize.py +++ b/paddlespeech/t2s/exps/fastspeech2/normalize.py @@ -58,30 +58,8 @@ def main(): "--phones-dict", type=str, default=None, help="phone vocabulary file.") parser.add_argument( "--speaker-dict", type=str, default=None, help="speaker id map file.") - parser.add_argument( - "--verbose", - type=int, - default=1, - help="logging level. higher is more logging. (default=1)") - args = parser.parse_args() - # set logger - if args.verbose > 1: - logging.basicConfig( - level=logging.DEBUG, - format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s" - ) - elif args.verbose > 0: - logging.basicConfig( - level=logging.INFO, - format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s" - ) - else: - logging.basicConfig( - level=logging.WARN, - format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s" - ) - logging.warning('Skip DEBUG/INFO messages') + args = parser.parse_args() dumpdir = Path(args.dumpdir).expanduser() # use absolute path diff --git a/paddlespeech/t2s/exps/fastspeech2/preprocess.py b/paddlespeech/t2s/exps/fastspeech2/preprocess.py index eac75f982..f4acdc60b 100644 --- a/paddlespeech/t2s/exps/fastspeech2/preprocess.py +++ b/paddlespeech/t2s/exps/fastspeech2/preprocess.py @@ -144,7 +144,8 @@ def process_sentences(config, energy_extractor=None, nprocs: int=1, cut_sil: bool=True, - spk_emb_dir: Path=None): + spk_emb_dir: Path=None, + write_metadata_method: str='w'): if nprocs == 1: results = [] for fp in tqdm.tqdm(fps, total=len(fps)): @@ -179,7 +180,8 @@ def process_sentences(config, results.append(record) results.sort(key=itemgetter("utt_id")) - with jsonlines.open(output_dir / "metadata.jsonl", 'w') as writer: + with jsonlines.open(output_dir / "metadata.jsonl", + write_metadata_method) as writer: for item in results: writer.write(item) print("Done") @@ -209,11 +211,6 @@ def main(): parser.add_argument("--config", type=str, help="fastspeech2 config file.") - parser.add_argument( - "--verbose", - type=int, - default=1, - help="logging level. higher is more logging. (default=1)") parser.add_argument( "--num-cpu", type=int, default=1, help="number of process.") @@ -228,6 +225,13 @@ def main(): default=None, type=str, help="directory to speaker embedding files.") + + parser.add_argument( + "--write_metadata_method", + default="w", + type=str, + choices=["w", "a"], + help="How the metadata.jsonl file is written.") args = parser.parse_args() rootdir = Path(args.rootdir).expanduser() @@ -248,10 +252,6 @@ def main(): with open(args.config, 'rt') as f: config = CfgNode(yaml.safe_load(f)) - if args.verbose > 1: - print(vars(args)) - print(config) - sentences, speaker_set = get_phn_dur(dur_file) merge_silence(sentences) @@ -349,7 +349,8 @@ def main(): energy_extractor=energy_extractor, nprocs=args.num_cpu, cut_sil=args.cut_sil, - spk_emb_dir=spk_emb_dir) + spk_emb_dir=spk_emb_dir, + write_metadata_method=args.write_metadata_method) if dev_wav_files: process_sentences( config=config, @@ -360,7 +361,8 @@ def main(): pitch_extractor=pitch_extractor, energy_extractor=energy_extractor, cut_sil=args.cut_sil, - spk_emb_dir=spk_emb_dir) + spk_emb_dir=spk_emb_dir, + write_metadata_method=args.write_metadata_method) if test_wav_files: process_sentences( config=config, @@ -372,7 +374,8 @@ def main(): energy_extractor=energy_extractor, nprocs=args.num_cpu, cut_sil=args.cut_sil, - spk_emb_dir=spk_emb_dir) + spk_emb_dir=spk_emb_dir, + write_metadata_method=args.write_metadata_method) if __name__ == "__main__": diff --git a/paddlespeech/t2s/exps/gan_vocoder/normalize.py b/paddlespeech/t2s/exps/gan_vocoder/normalize.py index ba95d3ed6..4cb7e41c5 100644 --- a/paddlespeech/t2s/exps/gan_vocoder/normalize.py +++ b/paddlespeech/t2s/exps/gan_vocoder/normalize.py @@ -47,30 +47,8 @@ def main(): default=False, action="store_true", help="whether to skip the copy of wav files.") - parser.add_argument( - "--verbose", - type=int, - default=1, - help="logging level. higher is more logging. (default=1)") - args = parser.parse_args() - # set logger - if args.verbose > 1: - logging.basicConfig( - level=logging.DEBUG, - format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s" - ) - elif args.verbose > 0: - logging.basicConfig( - level=logging.INFO, - format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s" - ) - else: - logging.basicConfig( - level=logging.WARN, - format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s" - ) - logging.warning('Skip DEBUG/INFO messages') + args = parser.parse_args() dumpdir = Path(args.dumpdir).expanduser() # use absolute path diff --git a/paddlespeech/t2s/exps/gan_vocoder/preprocess.py b/paddlespeech/t2s/exps/gan_vocoder/preprocess.py index 546367964..05c657682 100644 --- a/paddlespeech/t2s/exps/gan_vocoder/preprocess.py +++ b/paddlespeech/t2s/exps/gan_vocoder/preprocess.py @@ -167,11 +167,6 @@ def main(): required=True, help="directory to dump feature files.") parser.add_argument("--config", type=str, help="vocoder config file.") - parser.add_argument( - "--verbose", - type=int, - default=1, - help="logging level. higher is more logging. (default=1)") parser.add_argument( "--num-cpu", type=int, default=1, help="number of process.") parser.add_argument( @@ -197,10 +192,6 @@ def main(): with open(args.config, 'rt') as f: config = CfgNode(yaml.safe_load(f)) - if args.verbose > 1: - print(vars(args)) - print(config) - sentences, speaker_set = get_phn_dur(dur_file) merge_silence(sentences) diff --git a/paddlespeech/t2s/exps/inference.py b/paddlespeech/t2s/exps/inference.py index 98e73e102..5840c0699 100644 --- a/paddlespeech/t2s/exps/inference.py +++ b/paddlespeech/t2s/exps/inference.py @@ -35,8 +35,13 @@ def parse_args(): type=str, default='fastspeech2_csmsc', choices=[ - 'speedyspeech_csmsc', 'fastspeech2_csmsc', 'fastspeech2_aishell3', - 'fastspeech2_vctk', 'tacotron2_csmsc' + 'speedyspeech_csmsc', + 'fastspeech2_csmsc', + 'fastspeech2_aishell3', + 'fastspeech2_ljspeech', + 'fastspeech2_vctk', + 'tacotron2_csmsc', + 'fastspeech2_mix', ], help='Choose acoustic model type of tts task.') parser.add_argument( @@ -56,8 +61,16 @@ def parse_args(): type=str, default='pwgan_csmsc', choices=[ - 'pwgan_csmsc', 'mb_melgan_csmsc', 'hifigan_csmsc', 'pwgan_aishell3', - 'pwgan_vctk', 'wavernn_csmsc' + 'pwgan_csmsc', + 'pwgan_aishell3', + 'pwgan_ljspeech', + 'pwgan_vctk', + 'mb_melgan_csmsc', + 'hifigan_csmsc', + 'hifigan_aishell3', + 'hifigan_ljspeech', + 'hifigan_vctk', + 'wavernn_csmsc', ], help='Choose vocoder type of tts task.') # other @@ -65,7 +78,7 @@ def parse_args(): '--lang', type=str, default='zh', - help='Choose model language. zh or en') + help='Choose model language. zh or en or mix') parser.add_argument( "--text", type=str, @@ -74,11 +87,6 @@ def parse_args(): "--inference_dir", type=str, help="dir to save inference models") parser.add_argument("--output_dir", type=str, help="output dir") # inference - parser.add_argument( - "--use_trt", - type=str2bool, - default=False, - help="Whether to use inference engin TensorRT.", ) parser.add_argument( "--int8", type=str2bool, @@ -144,7 +152,8 @@ def main(): frontend=frontend, lang=args.lang, merge_sentences=merge_sentences, - speaker_dict=args.speaker_dict, ) + speaker_dict=args.speaker_dict, + spk_id=args.spk_id, ) wav = get_voc_output( voc_predictor=voc_predictor, input=am_output_data) speed = wav.size / t.elapse @@ -166,7 +175,8 @@ def main(): frontend=frontend, lang=args.lang, merge_sentences=merge_sentences, - speaker_dict=args.speaker_dict, ) + speaker_dict=args.speaker_dict, + spk_id=args.spk_id, ) wav = get_voc_output( voc_predictor=voc_predictor, input=am_output_data) @@ -175,7 +185,7 @@ def main(): speed = wav.size / t.elapse rtf = fs / speed - sf.write(output_dir / (utt_id + ".wav"), wav, samplerate=24000) + sf.write(output_dir / (utt_id + ".wav"), wav, samplerate=fs) print( f"{utt_id}, mel: {am_output_data.shape}, wave: {wav.shape}, time: {t.elapse}s, Hz: {speed}, RTF: {rtf}." ) diff --git a/paddlespeech/t2s/exps/inference_streaming.py b/paddlespeech/t2s/exps/inference_streaming.py index 624defc6a..5e2ce89db 100644 --- a/paddlespeech/t2s/exps/inference_streaming.py +++ b/paddlespeech/t2s/exps/inference_streaming.py @@ -27,6 +27,7 @@ from paddlespeech.t2s.exps.syn_utils import get_predictor from paddlespeech.t2s.exps.syn_utils import get_sentences from paddlespeech.t2s.exps.syn_utils import get_streaming_am_output from paddlespeech.t2s.exps.syn_utils import get_voc_output +from paddlespeech.t2s.exps.syn_utils import run_frontend from paddlespeech.t2s.utils import str2bool @@ -175,14 +176,13 @@ def main(): for utt_id, sentence in sentences: with timer() as t: # frontend - if args.lang == 'zh': - input_ids = frontend.get_input_ids( - sentence, - merge_sentences=merge_sentences, - get_tone_ids=get_tone_ids) - phone_ids = input_ids["phone_ids"] - else: - print("lang should be 'zh' here!") + frontend_dict = run_frontend( + frontend=frontend, + text=sentence, + merge_sentences=merge_sentences, + get_tone_ids=get_tone_ids, + lang=args.lang) + phone_ids = frontend_dict['phone_ids'] phones = phone_ids[0].numpy() # acoustic model orig_hs = get_am_sublayer_output( diff --git a/paddlespeech/t2s/exps/ort_predict.py b/paddlespeech/t2s/exps/ort_predict.py index 2e8596ded..bd89f74d2 100644 --- a/paddlespeech/t2s/exps/ort_predict.py +++ b/paddlespeech/t2s/exps/ort_predict.py @@ -41,17 +41,17 @@ def ort_predict(args): # am am_sess = get_sess( - model_dir=args.inference_dir, - model_file=args.am + ".onnx", + model_path=str(Path(args.inference_dir) / (args.am + '.onnx')), device=args.device, - cpu_threads=args.cpu_threads) + cpu_threads=args.cpu_threads, + use_trt=args.use_trt) # vocoder voc_sess = get_sess( - model_dir=args.inference_dir, - model_file=args.voc + ".onnx", + model_path=str(Path(args.inference_dir) / (args.voc + '.onnx')), device=args.device, - cpu_threads=args.cpu_threads) + cpu_threads=args.cpu_threads, + use_trt=args.use_trt) # am warmup for T in [27, 38, 54]: diff --git a/paddlespeech/t2s/exps/ort_predict_e2e.py b/paddlespeech/t2s/exps/ort_predict_e2e.py index a2ef8e4c6..75284f7bb 100644 --- a/paddlespeech/t2s/exps/ort_predict_e2e.py +++ b/paddlespeech/t2s/exps/ort_predict_e2e.py @@ -22,6 +22,7 @@ from timer import timer from paddlespeech.t2s.exps.syn_utils import get_frontend from paddlespeech.t2s.exps.syn_utils import get_sentences from paddlespeech.t2s.exps.syn_utils import get_sess +from paddlespeech.t2s.exps.syn_utils import run_frontend from paddlespeech.t2s.utils import str2bool @@ -42,31 +43,42 @@ def ort_predict(args): fs = 24000 if am_dataset != 'ljspeech' else 22050 am_sess = get_sess( - model_dir=args.inference_dir, - model_file=args.am + ".onnx", + model_path=str(Path(args.inference_dir) / (args.am + '.onnx')), device=args.device, - cpu_threads=args.cpu_threads) + cpu_threads=args.cpu_threads, + use_trt=args.use_trt) # vocoder voc_sess = get_sess( - model_dir=args.inference_dir, - model_file=args.voc + ".onnx", + model_path=str(Path(args.inference_dir) / (args.voc + '.onnx')), device=args.device, - cpu_threads=args.cpu_threads) + cpu_threads=args.cpu_threads, + use_trt=args.use_trt) + + merge_sentences = True # frontend warmup # Loading model cost 0.5+ seconds if args.lang == 'zh': - frontend.get_input_ids("你好,欢迎使用飞桨框架进行深度学习研究!", merge_sentences=True) + frontend.get_input_ids( + "你好,欢迎使用飞桨框架进行深度学习研究!", merge_sentences=merge_sentences) else: - print("lang should in be 'zh' here!") + frontend.get_input_ids( + "hello, thank you, thank you very much", + merge_sentences=merge_sentences) # am warmup + spk_id = [args.spk_id] for T in [27, 38, 54]: am_input_feed = {} if am_name == 'fastspeech2': - phone_ids = np.random.randint(1, 266, size=(T, )) + if args.lang == 'en': + phone_ids = np.random.randint(1, 78, size=(T, )) + else: + phone_ids = np.random.randint(1, 266, size=(T, )) am_input_feed.update({'text': phone_ids}) + if am_dataset in {"aishell3", "vctk", "mix"}: + am_input_feed.update({'spk_id': spk_id}) elif am_name == 'speedyspeech': phone_ids = np.random.randint(1, 92, size=(T, )) tone_ids = np.random.randint(1, 5, size=(T, )) @@ -81,44 +93,51 @@ def ort_predict(args): N = 0 T = 0 - merge_sentences = True + merge_sentences = False get_tone_ids = False - am_input_feed = {} if am_name == 'speedyspeech': get_tone_ids = True + am_input_feed = {} for utt_id, sentence in sentences: with timer() as t: - if args.lang == 'zh': - input_ids = frontend.get_input_ids( - sentence, - merge_sentences=merge_sentences, - get_tone_ids=get_tone_ids) - phone_ids = input_ids["phone_ids"] - if get_tone_ids: - tone_ids = input_ids["tone_ids"] - else: - print("lang should in be 'zh' here!") - # merge_sentences=True here, so we only use the first item of phone_ids - phone_ids = phone_ids[0].numpy() - if am_name == 'fastspeech2': - am_input_feed.update({'text': phone_ids}) - elif am_name == 'speedyspeech': - tone_ids = tone_ids[0].numpy() - am_input_feed.update({'phones': phone_ids, 'tones': tone_ids}) - mel = am_sess.run(output_names=None, input_feed=am_input_feed) - mel = mel[0] - wav = voc_sess.run(output_names=None, input_feed={'logmel': mel}) - - N += len(wav[0]) - T += t.elapse - speed = len(wav[0]) / t.elapse - rtf = fs / speed - sf.write( - str(output_dir / (utt_id + ".wav")), - np.array(wav)[0], - samplerate=fs) + frontend_dict = run_frontend( + frontend=frontend, + text=sentence, + merge_sentences=merge_sentences, + get_tone_ids=get_tone_ids, + lang=args.lang) + phone_ids = frontend_dict['phone_ids'] + flags = 0 + for i in range(len(phone_ids)): + part_phone_ids = phone_ids[i].numpy() + if am_name == 'fastspeech2': + am_input_feed.update({'text': part_phone_ids}) + if am_dataset in {"aishell3", "vctk", "mix"}: + am_input_feed.update({'spk_id': spk_id}) + elif am_name == 'speedyspeech': + part_tone_ids = frontend_dict['tone_ids'][i].numpy() + am_input_feed.update({ + 'phones': part_phone_ids, + 'tones': part_tone_ids + }) + mel = am_sess.run(output_names=None, input_feed=am_input_feed) + mel = mel[0] + wav = voc_sess.run( + output_names=None, input_feed={'logmel': mel}) + wav = wav[0] + if flags == 0: + wav_all = wav + flags = 1 + else: + wav_all = np.concatenate([wav_all, wav]) + wav = wav_all + N += len(wav) + T += t.elapse + speed = len(wav) / t.elapse + rtf = fs / speed + sf.write(str(output_dir / (utt_id + ".wav")), wav, samplerate=fs) print( - f"{utt_id}, mel: {mel.shape}, wave: {len(wav[0])}, time: {t.elapse}s, Hz: {speed}, RTF: {rtf}." + f"{utt_id}, mel: {mel.shape}, wave: {len(wav)}, time: {t.elapse}s, Hz: {speed}, RTF: {rtf}." ) print(f"generation speed: {N / T}Hz, RTF: {fs / (N / T) }") @@ -130,19 +149,41 @@ def parse_args(): '--am', type=str, default='fastspeech2_csmsc', - choices=['fastspeech2_csmsc', 'speedyspeech_csmsc'], + choices=[ + 'fastspeech2_csmsc', + 'fastspeech2_aishell3', + 'fastspeech2_ljspeech', + 'fastspeech2_vctk', + 'speedyspeech_csmsc', + 'fastspeech2_mix', + ], help='Choose acoustic model type of tts task.') parser.add_argument( "--phones_dict", type=str, default=None, help="phone vocabulary file.") parser.add_argument( "--tones_dict", type=str, default=None, help="tone vocabulary file.") + parser.add_argument( + '--spk_id', + type=int, + default=0, + help='spk id for multi speaker acoustic model') # voc parser.add_argument( '--voc', type=str, default='hifigan_csmsc', - choices=['hifigan_csmsc', 'mb_melgan_csmsc', 'pwgan_csmsc'], + choices=[ + 'pwgan_csmsc', + 'pwgan_aishell3', + 'pwgan_ljspeech', + 'pwgan_vctk', + 'hifigan_csmsc', + 'hifigan_aishell3', + 'hifigan_ljspeech', + 'hifigan_vctk', + 'mb_melgan_csmsc', + ], help='Choose vocoder type of tts task.') # other parser.add_argument( diff --git a/paddlespeech/t2s/exps/ort_predict_streaming.py b/paddlespeech/t2s/exps/ort_predict_streaming.py index d5241f1c6..0d07dcf37 100644 --- a/paddlespeech/t2s/exps/ort_predict_streaming.py +++ b/paddlespeech/t2s/exps/ort_predict_streaming.py @@ -24,6 +24,7 @@ from paddlespeech.t2s.exps.syn_utils import get_chunks from paddlespeech.t2s.exps.syn_utils import get_frontend from paddlespeech.t2s.exps.syn_utils import get_sentences from paddlespeech.t2s.exps.syn_utils import get_sess +from paddlespeech.t2s.exps.syn_utils import run_frontend from paddlespeech.t2s.utils import str2bool @@ -45,29 +46,33 @@ def ort_predict(args): # streaming acoustic model am_encoder_infer_sess = get_sess( - model_dir=args.inference_dir, - model_file=args.am + "_am_encoder_infer" + ".onnx", + model_path=str( + Path(args.inference_dir) / + (args.am + '_am_encoder_infer' + '.onnx')), device=args.device, - cpu_threads=args.cpu_threads) + cpu_threads=args.cpu_threads, + use_trt=args.use_trt) am_decoder_sess = get_sess( - model_dir=args.inference_dir, - model_file=args.am + "_am_decoder" + ".onnx", + model_path=str( + Path(args.inference_dir) / (args.am + '_am_decoder' + '.onnx')), device=args.device, - cpu_threads=args.cpu_threads) + cpu_threads=args.cpu_threads, + use_trt=args.use_trt) am_postnet_sess = get_sess( - model_dir=args.inference_dir, - model_file=args.am + "_am_postnet" + ".onnx", + model_path=str( + Path(args.inference_dir) / (args.am + '_am_postnet' + '.onnx')), device=args.device, - cpu_threads=args.cpu_threads) + cpu_threads=args.cpu_threads, + use_trt=args.use_trt) am_mu, am_std = np.load(args.am_stat) # vocoder voc_sess = get_sess( - model_dir=args.inference_dir, - model_file=args.voc + ".onnx", + model_path=str(Path(args.inference_dir) / (args.voc + '.onnx')), device=args.device, - cpu_threads=args.cpu_threads) + cpu_threads=args.cpu_threads, + use_trt=args.use_trt) # frontend warmup # Loading model cost 0.5+ seconds @@ -102,14 +107,13 @@ def ort_predict(args): for utt_id, sentence in sentences: with timer() as t: - if args.lang == 'zh': - input_ids = frontend.get_input_ids( - sentence, - merge_sentences=merge_sentences, - get_tone_ids=get_tone_ids) - phone_ids = input_ids["phone_ids"] - else: - print("lang should in be 'zh' here!") + frontend_dict = run_frontend( + frontend=frontend, + text=sentence, + merge_sentences=merge_sentences, + get_tone_ids=get_tone_ids, + lang=args.lang) + phone_ids = frontend_dict['phone_ids'] # merge_sentences=True here, so we only use the first item of phone_ids phone_ids = phone_ids[0].numpy() orig_hs = am_encoder_infer_sess.run( diff --git a/paddlespeech/t2s/exps/sentences_mix.txt b/paddlespeech/t2s/exps/sentences_mix.txt new file mode 100644 index 000000000..06e97d14a --- /dev/null +++ b/paddlespeech/t2s/exps/sentences_mix.txt @@ -0,0 +1,8 @@ +001 你好,欢迎使用 Paddle Speech 中英文混合 T T S 功能,开始你的合成之旅吧! +002 我们的声学模型使用了 Fast Speech Two, 声码器使用了 Parallel Wave GAN and Hifi GAN. +003 Paddle N L P 发布 ERNIE Tiny 全系列中文预训练小模型,快速提升预训练模型部署效率,通用信息抽取技术 U I E Tiny 系列模型全新升级,支持速度更快效果更好的 U I E 小模型。 +004 Paddle Speech 发布 P P A S R 流式语音识别系统、P P T T S 流式语音合成系统、P P V P R 全链路声纹识别系统。 +005 Paddle Bo Bo: 使用 Paddle Speech 的语音合成模块生成虚拟人的声音。 +006 热烈欢迎您在 Discussions 中提交问题,并在 Issues 中指出发现的 bug。此外,我们非常希望您参与到 Paddle Speech 的开发中! +007 我喜欢 eat apple, 你喜欢 drink milk。 +008 我们要去云南 team building, 非常非常 happy. \ No newline at end of file diff --git a/paddlespeech/t2s/exps/speedyspeech/normalize.py b/paddlespeech/t2s/exps/speedyspeech/normalize.py index 249a4d6d8..f29466f65 100644 --- a/paddlespeech/t2s/exps/speedyspeech/normalize.py +++ b/paddlespeech/t2s/exps/speedyspeech/normalize.py @@ -50,11 +50,6 @@ def main(): "--tones-dict", type=str, default=None, help="tone vocabulary file.") parser.add_argument( "--speaker-dict", type=str, default=None, help="speaker id map file.") - parser.add_argument( - "--verbose", - type=int, - default=1, - help="logging level. higher is more logging. (default=1)") parser.add_argument( "--use-relative-path", @@ -63,24 +58,6 @@ def main(): help="whether use relative path in metadata") args = parser.parse_args() - # set logger - if args.verbose > 1: - logging.basicConfig( - level=logging.DEBUG, - format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s" - ) - elif args.verbose > 0: - logging.basicConfig( - level=logging.INFO, - format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s" - ) - else: - logging.basicConfig( - level=logging.WARN, - format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s" - ) - logging.warning('Skip DEBUG/INFO messages') - dumpdir = Path(args.dumpdir).expanduser() # use absolute path dumpdir = dumpdir.resolve() diff --git a/paddlespeech/t2s/exps/speedyspeech/preprocess.py b/paddlespeech/t2s/exps/speedyspeech/preprocess.py index aa7608d6b..e4084c142 100644 --- a/paddlespeech/t2s/exps/speedyspeech/preprocess.py +++ b/paddlespeech/t2s/exps/speedyspeech/preprocess.py @@ -195,11 +195,6 @@ def main(): parser.add_argument("--config", type=str, help="fastspeech2 config file.") - parser.add_argument( - "--verbose", - type=int, - default=1, - help="logging level. higher is more logging. (default=1)") parser.add_argument( "--num-cpu", type=int, default=1, help="number of process.") @@ -230,10 +225,6 @@ def main(): with open(args.config, 'rt') as f: config = CfgNode(yaml.safe_load(f)) - if args.verbose > 1: - print(vars(args)) - print(config) - sentences, speaker_set = get_phn_dur(dur_file) merge_silence(sentences) diff --git a/paddlespeech/t2s/exps/stream_play_tts.py b/paddlespeech/t2s/exps/stream_play_tts.py new file mode 100644 index 000000000..4dcf4794f --- /dev/null +++ b/paddlespeech/t2s/exps/stream_play_tts.py @@ -0,0 +1,181 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +# stream play TTS +# Before first execution, download and decompress the models in the execution directory +# wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_cnndecoder_csmsc_streaming_onnx_1.0.0.zip +# wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_onnx_0.2.0.zip +# unzip fastspeech2_cnndecoder_csmsc_streaming_onnx_1.0.0.zip +# unzip mb_melgan_csmsc_onnx_0.2.0.zip +import math +import time + +import numpy as np +import onnxruntime as ort +import pyaudio +import soundfile as sf + +from paddlespeech.server.utils.audio_process import float2pcm +from paddlespeech.server.utils.util import denorm +from paddlespeech.server.utils.util import get_chunks +from paddlespeech.t2s.frontend.zh_frontend import Frontend + +voc_block = 36 +voc_pad = 14 +am_block = 72 +am_pad = 12 +voc_upsample = 300 + +phones_dict = "fastspeech2_cnndecoder_csmsc_streaming_onnx_1.0.0/phone_id_map.txt" +frontend = Frontend(phone_vocab_path=phones_dict, tone_vocab_path=None) + +am_stat_path = "fastspeech2_cnndecoder_csmsc_streaming_onnx_1.0.0/speech_stats.npy" +am_mu, am_std = np.load(am_stat_path) + +# 模型路径 +onnx_am_encoder = "fastspeech2_cnndecoder_csmsc_streaming_onnx_1.0.0/fastspeech2_csmsc_am_encoder_infer.onnx" +onnx_am_decoder = "fastspeech2_cnndecoder_csmsc_streaming_onnx_1.0.0/fastspeech2_csmsc_am_decoder.onnx" +onnx_am_postnet = "fastspeech2_cnndecoder_csmsc_streaming_onnx_1.0.0/fastspeech2_csmsc_am_postnet.onnx" +onnx_voc_melgan = "mb_melgan_csmsc_onnx_0.2.0/mb_melgan_csmsc.onnx" + +# 用CPU推理 +providers = ['CPUExecutionProvider'] + +# 配置ort session +sess_options = ort.SessionOptions() + +# 创建session +am_encoder_infer_sess = ort.InferenceSession( + onnx_am_encoder, providers=providers, sess_options=sess_options) +am_decoder_sess = ort.InferenceSession( + onnx_am_decoder, providers=providers, sess_options=sess_options) +am_postnet_sess = ort.InferenceSession( + onnx_am_postnet, providers=providers, sess_options=sess_options) +voc_melgan_sess = ort.InferenceSession( + onnx_voc_melgan, providers=providers, sess_options=sess_options) + + +def depadding(data, chunk_num, chunk_id, block, pad, upsample): + """ + Streaming inference removes the result of pad inference + """ + front_pad = min(chunk_id * block, pad) + # first chunk + if chunk_id == 0: + data = data[:block * upsample] + # last chunk + elif chunk_id == chunk_num - 1: + data = data[front_pad * upsample:] + # middle chunk + else: + data = data[front_pad * upsample:(front_pad + block) * upsample] + + return data + + +def inference_stream(text): + input_ids = frontend.get_input_ids( + text, merge_sentences=False, get_tone_ids=False) + phone_ids = input_ids["phone_ids"] + for i in range(len(phone_ids)): + part_phone_ids = phone_ids[i].numpy() + voc_chunk_id = 0 + + orig_hs = am_encoder_infer_sess.run( + None, input_feed={'text': part_phone_ids}) + orig_hs = orig_hs[0] + + # streaming voc chunk info + mel_len = orig_hs.shape[1] + voc_chunk_num = math.ceil(mel_len / voc_block) + start = 0 + end = min(voc_block + voc_pad, mel_len) + + # streaming am + hss = get_chunks(orig_hs, am_block, am_pad, "am") + am_chunk_num = len(hss) + for i, hs in enumerate(hss): + am_decoder_output = am_decoder_sess.run(None, input_feed={'xs': hs}) + am_postnet_output = am_postnet_sess.run( + None, + input_feed={ + 'xs': np.transpose(am_decoder_output[0], (0, 2, 1)) + }) + am_output_data = am_decoder_output + np.transpose( + am_postnet_output[0], (0, 2, 1)) + normalized_mel = am_output_data[0][0] + + sub_mel = denorm(normalized_mel, am_mu, am_std) + sub_mel = depadding(sub_mel, am_chunk_num, i, am_block, am_pad, 1) + + if i == 0: + mel_streaming = sub_mel + else: + mel_streaming = np.concatenate((mel_streaming, sub_mel), axis=0) + + # streaming voc + # 当流式AM推理的mel帧数大于流式voc推理的chunk size,开始进行流式voc 推理 + while (mel_streaming.shape[0] >= end and + voc_chunk_id < voc_chunk_num): + voc_chunk = mel_streaming[start:end, :] + + sub_wav = voc_melgan_sess.run( + output_names=None, input_feed={'logmel': voc_chunk}) + sub_wav = depadding(sub_wav[0], voc_chunk_num, voc_chunk_id, + voc_block, voc_pad, voc_upsample) + + yield sub_wav + + voc_chunk_id += 1 + start = max(0, voc_chunk_id * voc_block - voc_pad) + end = min((voc_chunk_id + 1) * voc_block + voc_pad, mel_len) + + +if __name__ == '__main__': + + text = "欢迎使用飞桨语音合成系统,测试一下合成效果。" + # warm up + # onnxruntime 第一次时间会长一些,建议先 warmup 一下 + for sub_wav in inference_stream(text="哈哈哈哈"): + continue + + # pyaudio 播放 + p = pyaudio.PyAudio() + stream = p.open( + format=p.get_format_from_width(2), # int16 + channels=1, + rate=24000, + output=True) + + # 计时 + wavs = [] + t1 = time.time() + for sub_wav in inference_stream(text): + print("响应时间:", time.time() - t1) + t1 = time.time() + wavs.append(sub_wav.flatten()) + # float32 to int16 + wav = float2pcm(sub_wav) + # to bytes + wav_bytes = wav.tobytes() + stream.write(wav_bytes) + + # 关闭 pyaudio 播放器 + stream.stop_stream() + stream.close() + p.terminate() + + # 流式合成的结果导出 + wav = np.concatenate(wavs) + print(wav.shape) + sf.write("demo_stream.wav", data=wav, samplerate=24000) diff --git a/paddlespeech/t2s/exps/syn_utils.py b/paddlespeech/t2s/exps/syn_utils.py index 6b9f41a6b..127e1a3ba 100644 --- a/paddlespeech/t2s/exps/syn_utils.py +++ b/paddlespeech/t2s/exps/syn_utils.py @@ -29,9 +29,12 @@ from yacs.config import CfgNode from paddlespeech.t2s.datasets.data_table import DataTable from paddlespeech.t2s.frontend import English +from paddlespeech.t2s.frontend.mix_frontend import MixFrontend from paddlespeech.t2s.frontend.zh_frontend import Frontend from paddlespeech.t2s.modules.normalizer import ZScore from paddlespeech.utils.dynamic_import import dynamic_import +# remove [W:onnxruntime: xxx] from ort +ort.set_default_logger_severity(3) model_alias = { # acoustic model @@ -68,6 +71,10 @@ model_alias = { "paddlespeech.t2s.models.wavernn:WaveRNN", "wavernn_inference": "paddlespeech.t2s.models.wavernn:WaveRNNInference", + "erniesat": + "paddlespeech.t2s.models.ernie_sat:ErnieSAT", + "erniesat_inference": + "paddlespeech.t2s.models.ernie_sat:ErnieSATInference", } @@ -98,6 +105,8 @@ def get_sentences(text_file: Optional[os.PathLike], lang: str='zh'): sentence = "".join(items[1:]) elif lang == 'en': sentence = " ".join(items[1:]) + elif lang == 'mix': + sentence = " ".join(items[1:]) sentences.append((utt_id, sentence)) return sentences @@ -109,9 +118,11 @@ def get_test_dataset(test_metadata: List[Dict[str, Any]], # model: {model_name}_{dataset} am_name = am[:am.rindex('_')] am_dataset = am[am.rindex('_') + 1:] + converters = {} if am_name == 'fastspeech2': fields = ["utt_id", "text"] - if am_dataset in {"aishell3", "vctk"} and speaker_dict is not None: + if am_dataset in {"aishell3", "vctk", + "mix"} and speaker_dict is not None: print("multiple speaker fastspeech2!") fields += ["spk_id"] elif voice_cloning: @@ -126,8 +137,17 @@ def get_test_dataset(test_metadata: List[Dict[str, Any]], if voice_cloning: print("voice cloning!") fields += ["spk_emb"] + elif am_name == 'erniesat': + fields = [ + "utt_id", "text", "text_lengths", "speech", "speech_lengths", + "align_start", "align_end" + ] + converters = {"speech": np.load} + else: + print("wrong am, please input right am!!!") - test_dataset = DataTable(data=test_metadata, fields=fields) + test_dataset = DataTable( + data=test_metadata, fields=fields, converters=converters) return test_dataset @@ -140,48 +160,73 @@ def get_frontend(lang: str='zh', phone_vocab_path=phones_dict, tone_vocab_path=tones_dict) elif lang == 'en': frontend = English(phone_vocab_path=phones_dict) + elif lang == 'mix': + frontend = MixFrontend( + phone_vocab_path=phones_dict, tone_vocab_path=tones_dict) else: print("wrong lang!") - print("frontend done!") return frontend +def run_frontend(frontend: object, + text: str, + merge_sentences: bool=False, + get_tone_ids: bool=False, + lang: str='zh', + to_tensor: bool=True): + outs = dict() + if lang == 'zh': + input_ids = frontend.get_input_ids( + text, + merge_sentences=merge_sentences, + get_tone_ids=get_tone_ids, + to_tensor=to_tensor) + phone_ids = input_ids["phone_ids"] + if get_tone_ids: + tone_ids = input_ids["tone_ids"] + outs.update({'tone_ids': tone_ids}) + elif lang == 'en': + input_ids = frontend.get_input_ids( + text, merge_sentences=merge_sentences, to_tensor=to_tensor) + phone_ids = input_ids["phone_ids"] + elif lang == 'mix': + input_ids = frontend.get_input_ids( + text, merge_sentences=merge_sentences, to_tensor=to_tensor) + phone_ids = input_ids["phone_ids"] + else: + print("lang should in {'zh', 'en', 'mix'}!") + outs.update({'phone_ids': phone_ids}) + return outs + + # dygraph -def get_am_inference( - am: str='fastspeech2_csmsc', - am_config: CfgNode=None, - am_ckpt: Optional[os.PathLike]=None, - am_stat: Optional[os.PathLike]=None, - phones_dict: Optional[os.PathLike]=None, - tones_dict: Optional[os.PathLike]=None, - speaker_dict: Optional[os.PathLike]=None, ): +def get_am_inference(am: str='fastspeech2_csmsc', + am_config: CfgNode=None, + am_ckpt: Optional[os.PathLike]=None, + am_stat: Optional[os.PathLike]=None, + phones_dict: Optional[os.PathLike]=None, + tones_dict: Optional[os.PathLike]=None, + speaker_dict: Optional[os.PathLike]=None, + return_am: bool=False): with open(phones_dict, "r") as f: phn_id = [line.strip().split() for line in f.readlines()] vocab_size = len(phn_id) - print("vocab_size:", vocab_size) - tone_size = None if tones_dict is not None: with open(tones_dict, "r") as f: tone_id = [line.strip().split() for line in f.readlines()] tone_size = len(tone_id) - print("tone_size:", tone_size) - spk_num = None if speaker_dict is not None: with open(speaker_dict, 'rt') as f: spk_id = [line.strip().split() for line in f.readlines()] spk_num = len(spk_id) - print("spk_num:", spk_num) - odim = am_config.n_mels # model: {model_name}_{dataset} am_name = am[:am.rindex('_')] am_dataset = am[am.rindex('_') + 1:] - am_class = dynamic_import(am_name, model_alias) am_inference_class = dynamic_import(am_name + '_inference', model_alias) - if am_name == 'fastspeech2': am = am_class( idim=vocab_size, odim=odim, spk_num=spk_num, **am_config["model"]) @@ -193,6 +238,10 @@ def get_am_inference( **am_config["model"]) elif am_name == 'tacotron2': am = am_class(idim=vocab_size, odim=odim, **am_config["model"]) + elif am_name == 'erniesat': + am = am_class(idim=vocab_size, odim=odim, **am_config["model"]) + else: + print("wrong am, please input right am!!!") am.set_state_dict(paddle.load(am_ckpt)["main_params"]) am.eval() @@ -202,8 +251,10 @@ def get_am_inference( am_normalizer = ZScore(am_mu, am_std) am_inference = am_inference_class(am_normalizer, am) am_inference.eval() - print("acoustic model done!") - return am_inference + if return_am: + return am_inference, am + else: + return am_inference def get_voc_inference( @@ -231,7 +282,6 @@ def get_voc_inference( voc_normalizer = ZScore(voc_mu, voc_std) voc_inference = voc_inference_class(voc_normalizer, voc) voc_inference.eval() - print("voc done!") return voc_inference @@ -244,7 +294,8 @@ def am_to_static(am_inference, am_name = am[:am.rindex('_')] am_dataset = am[am.rindex('_') + 1:] if am_name == 'fastspeech2': - if am_dataset in {"aishell3", "vctk"} and speaker_dict is not None: + if am_dataset in {"aishell3", "vctk", "mix" + } and speaker_dict is not None: am_inference = jit.to_static( am_inference, input_spec=[ @@ -256,7 +307,8 @@ def am_to_static(am_inference, am_inference, input_spec=[InputSpec([-1], dtype=paddle.int64)]) elif am_name == 'speedyspeech': - if am_dataset in {"aishell3", "vctk"} and speaker_dict is not None: + if am_dataset in {"aishell3", "vctk", "mix" + } and speaker_dict is not None: am_inference = jit.to_static( am_inference, input_spec=[ @@ -313,9 +365,9 @@ def get_predictor(model_dir: Optional[os.PathLike]=None, def get_am_output( input: str, - am_predictor, - am, - frontend, + am_predictor: paddle.nn.Layer, + am: str, + frontend: object, lang: str='zh', merge_sentences: bool=True, speaker_dict: Optional[os.PathLike]=None, @@ -323,26 +375,23 @@ def get_am_output( am_name = am[:am.rindex('_')] am_dataset = am[am.rindex('_') + 1:] am_input_names = am_predictor.get_input_names() - get_tone_ids = False get_spk_id = False + get_tone_ids = False if am_name == 'speedyspeech': get_tone_ids = True - if am_dataset in {"aishell3", "vctk"} and speaker_dict: + if am_dataset in {"aishell3", "vctk", "mix"} and speaker_dict: get_spk_id = True spk_id = np.array([spk_id]) - if lang == 'zh': - input_ids = frontend.get_input_ids( - input, merge_sentences=merge_sentences, get_tone_ids=get_tone_ids) - phone_ids = input_ids["phone_ids"] - elif lang == 'en': - input_ids = frontend.get_input_ids( - input, merge_sentences=merge_sentences) - phone_ids = input_ids["phone_ids"] - else: - print("lang should in {'zh', 'en'}!") + + frontend_dict = run_frontend( + frontend=frontend, + text=input, + merge_sentences=merge_sentences, + get_tone_ids=get_tone_ids, + lang=lang) if get_tone_ids: - tone_ids = input_ids["tone_ids"] + tone_ids = frontend_dict['tone_ids'] tones = tone_ids[0].numpy() tones_handle = am_predictor.get_input_handle(am_input_names[1]) tones_handle.reshape(tones.shape) @@ -351,6 +400,7 @@ def get_am_output( spk_id_handle = am_predictor.get_input_handle(am_input_names[1]) spk_id_handle.reshape(spk_id.shape) spk_id_handle.copy_from_cpu(spk_id) + phone_ids = frontend_dict['phone_ids'] phones = phone_ids[0].numpy() phones_handle = am_predictor.get_input_handle(am_input_names[0]) phones_handle.reshape(phones.shape) @@ -399,13 +449,13 @@ def get_streaming_am_output(input: str, lang: str='zh', merge_sentences: bool=True): get_tone_ids = False - if lang == 'zh': - input_ids = frontend.get_input_ids( - input, merge_sentences=merge_sentences, get_tone_ids=get_tone_ids) - phone_ids = input_ids["phone_ids"] - else: - print("lang should be 'zh' here!") - + frontend_dict = run_frontend( + frontend=frontend, + text=input, + merge_sentences=merge_sentences, + get_tone_ids=get_tone_ids, + lang=lang) + phone_ids = frontend_dict['phone_ids'] phones = phone_ids[0].numpy() am_encoder_infer_output = get_am_sublayer_output( am_encoder_infer_predictor, input=phones) @@ -422,26 +472,25 @@ def get_streaming_am_output(input: str, # onnx -def get_sess(model_dir: Optional[os.PathLike]=None, - model_file: Optional[os.PathLike]=None, +def get_sess(model_path: Optional[os.PathLike], device: str='cpu', cpu_threads: int=1, use_trt: bool=False): - - model_dir = str(Path(model_dir) / model_file) sess_options = ort.SessionOptions() sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL sess_options.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL - - if device == "gpu": + if 'gpu' in device.lower(): + device_id = int(device.split(':')[1]) if len( + device.split(':')) == 2 else 0 # fastspeech2/mb_melgan can't use trt now! if use_trt: - providers = ['TensorrtExecutionProvider'] + provider_name = 'TensorrtExecutionProvider' else: - providers = ['CUDAExecutionProvider'] - elif device == "cpu": + provider_name = 'CUDAExecutionProvider' + providers = [(provider_name, {'device_id': device_id})] + elif device.lower() == 'cpu': providers = ['CPUExecutionProvider'] sess_options.intra_op_num_threads = cpu_threads sess = ort.InferenceSession( - model_dir, providers=providers, sess_options=sess_options) + model_path, providers=providers, sess_options=sess_options) return sess diff --git a/paddlespeech/t2s/exps/synthesize.py b/paddlespeech/t2s/exps/synthesize.py index 9ddab726e..a8e18150e 100644 --- a/paddlespeech/t2s/exps/synthesize.py +++ b/paddlespeech/t2s/exps/synthesize.py @@ -136,7 +136,7 @@ def parse_args(): choices=[ 'speedyspeech_csmsc', 'fastspeech2_csmsc', 'fastspeech2_ljspeech', 'fastspeech2_aishell3', 'fastspeech2_vctk', 'tacotron2_csmsc', - 'tacotron2_ljspeech', 'tacotron2_aishell3' + 'tacotron2_ljspeech', 'tacotron2_aishell3', 'fastspeech2_mix' ], help='Choose acoustic model type of tts task.') parser.add_argument( diff --git a/paddlespeech/t2s/exps/synthesize_e2e.py b/paddlespeech/t2s/exps/synthesize_e2e.py index 28657eb27..9ce8286fb 100644 --- a/paddlespeech/t2s/exps/synthesize_e2e.py +++ b/paddlespeech/t2s/exps/synthesize_e2e.py @@ -25,6 +25,7 @@ from paddlespeech.t2s.exps.syn_utils import get_am_inference from paddlespeech.t2s.exps.syn_utils import get_frontend from paddlespeech.t2s.exps.syn_utils import get_sentences from paddlespeech.t2s.exps.syn_utils import get_voc_inference +from paddlespeech.t2s.exps.syn_utils import run_frontend from paddlespeech.t2s.exps.syn_utils import voc_to_static @@ -49,6 +50,7 @@ def evaluate(args): lang=args.lang, phones_dict=args.phones_dict, tones_dict=args.tones_dict) + print("frontend done!") # acoustic model am_name = args.am[:args.am.rindex('_')] @@ -62,13 +64,14 @@ def evaluate(args): phones_dict=args.phones_dict, tones_dict=args.tones_dict, speaker_dict=args.speaker_dict) - + print("acoustic model done!") # vocoder voc_inference = get_voc_inference( voc=args.voc, voc_config=voc_config, voc_ckpt=args.voc_ckpt, voc_stat=args.voc_stat) + print("voc done!") # whether dygraph to static if args.inference_dir: @@ -78,7 +81,6 @@ def evaluate(args): am=args.am, inference_dir=args.inference_dir, speaker_dict=args.speaker_dict) - # vocoder voc_inference = voc_to_static( voc_inference=voc_inference, @@ -101,20 +103,13 @@ def evaluate(args): T = 0 for utt_id, sentence in sentences: with timer() as t: - if args.lang == 'zh': - input_ids = frontend.get_input_ids( - sentence, - merge_sentences=merge_sentences, - get_tone_ids=get_tone_ids) - phone_ids = input_ids["phone_ids"] - if get_tone_ids: - tone_ids = input_ids["tone_ids"] - elif args.lang == 'en': - input_ids = frontend.get_input_ids( - sentence, merge_sentences=merge_sentences) - phone_ids = input_ids["phone_ids"] - else: - print("lang should in {'zh', 'en'}!") + frontend_dict = run_frontend( + frontend=frontend, + text=sentence, + merge_sentences=merge_sentences, + get_tone_ids=get_tone_ids, + lang=args.lang) + phone_ids = frontend_dict['phone_ids'] with paddle.no_grad(): flags = 0 for i in range(len(phone_ids)): @@ -122,14 +117,14 @@ def evaluate(args): # acoustic model if am_name == 'fastspeech2': # multi speaker - if am_dataset in {"aishell3", "vctk"}: + if am_dataset in {"aishell3", "vctk", "mix"}: spk_id = paddle.to_tensor(args.spk_id) mel = am_inference(part_phone_ids, spk_id) else: mel = am_inference(part_phone_ids) elif am_name == 'speedyspeech': - part_tone_ids = tone_ids[i] - if am_dataset in {"aishell3", "vctk"}: + part_tone_ids = frontend_dict['tone_ids'][i] + if am_dataset in {"aishell3", "vctk", "mix"}: spk_id = paddle.to_tensor(args.spk_id) mel = am_inference(part_phone_ids, part_tone_ids, spk_id) @@ -170,7 +165,7 @@ def parse_args(): choices=[ 'speedyspeech_csmsc', 'speedyspeech_aishell3', 'fastspeech2_csmsc', 'fastspeech2_ljspeech', 'fastspeech2_aishell3', 'fastspeech2_vctk', - 'tacotron2_csmsc', 'tacotron2_ljspeech' + 'tacotron2_csmsc', 'tacotron2_ljspeech', 'fastspeech2_mix' ], help='Choose acoustic model type of tts task.') parser.add_argument( @@ -231,7 +226,7 @@ def parse_args(): '--lang', type=str, default='zh', - help='Choose model language. zh or en') + help='Choose model language. zh or en or mix') parser.add_argument( "--inference_dir", diff --git a/paddlespeech/t2s/exps/synthesize_streaming.py b/paddlespeech/t2s/exps/synthesize_streaming.py index d8b23f1ad..6f86cc2b2 100644 --- a/paddlespeech/t2s/exps/synthesize_streaming.py +++ b/paddlespeech/t2s/exps/synthesize_streaming.py @@ -30,6 +30,7 @@ from paddlespeech.t2s.exps.syn_utils import get_frontend from paddlespeech.t2s.exps.syn_utils import get_sentences from paddlespeech.t2s.exps.syn_utils import get_voc_inference from paddlespeech.t2s.exps.syn_utils import model_alias +from paddlespeech.t2s.exps.syn_utils import run_frontend from paddlespeech.t2s.exps.syn_utils import voc_to_static from paddlespeech.t2s.utils import str2bool from paddlespeech.utils.dynamic_import import dynamic_import @@ -138,15 +139,13 @@ def evaluate(args): for utt_id, sentence in sentences: with timer() as t: - if args.lang == 'zh': - input_ids = frontend.get_input_ids( - sentence, - merge_sentences=merge_sentences, - get_tone_ids=get_tone_ids) - - phone_ids = input_ids["phone_ids"] - else: - print("lang should be 'zh' here!") + frontend_dict = run_frontend( + frontend=frontend, + text=sentence, + merge_sentences=merge_sentences, + get_tone_ids=get_tone_ids, + lang=args.lang) + phone_ids = frontend_dict['phone_ids'] # merge_sentences=True here, so we only use the first item of phone_ids phone_ids = phone_ids[0] with paddle.no_grad(): diff --git a/paddlespeech/t2s/exps/tacotron2/preprocess.py b/paddlespeech/t2s/exps/tacotron2/preprocess.py index 6137da7f1..c27b9769b 100644 --- a/paddlespeech/t2s/exps/tacotron2/preprocess.py +++ b/paddlespeech/t2s/exps/tacotron2/preprocess.py @@ -184,11 +184,6 @@ def main(): parser.add_argument("--config", type=str, help="fastspeech2 config file.") - parser.add_argument( - "--verbose", - type=int, - default=1, - help="logging level. higher is more logging. (default=1)") parser.add_argument( "--num-cpu", type=int, default=1, help="number of process.") @@ -223,10 +218,6 @@ def main(): with open(args.config, 'rt') as f: config = CfgNode(yaml.safe_load(f)) - if args.verbose > 1: - print(vars(args)) - print(config) - sentences, speaker_set = get_phn_dur(dur_file) merge_silence(sentences) diff --git a/paddlespeech/t2s/exps/transformer_tts/normalize.py b/paddlespeech/t2s/exps/transformer_tts/normalize.py index 87e975b88..e5f052c60 100644 --- a/paddlespeech/t2s/exps/transformer_tts/normalize.py +++ b/paddlespeech/t2s/exps/transformer_tts/normalize.py @@ -51,30 +51,8 @@ def main(): "--phones-dict", type=str, default=None, help="phone vocabulary file.") parser.add_argument( "--speaker-dict", type=str, default=None, help="speaker id map file.") - parser.add_argument( - "--verbose", - type=int, - default=1, - help="logging level. higher is more logging. (default=1)") - args = parser.parse_args() - # set logger - if args.verbose > 1: - logging.basicConfig( - level=logging.DEBUG, - format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s" - ) - elif args.verbose > 0: - logging.basicConfig( - level=logging.INFO, - format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s" - ) - else: - logging.basicConfig( - level=logging.WARN, - format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s" - ) - logging.warning('Skip DEBUG/INFO messages') + args = parser.parse_args() # check directory existence dumpdir = Path(args.dumpdir).resolve() diff --git a/paddlespeech/t2s/exps/transformer_tts/preprocess.py b/paddlespeech/t2s/exps/transformer_tts/preprocess.py index 28ca3de6e..2ebd5ecc2 100644 --- a/paddlespeech/t2s/exps/transformer_tts/preprocess.py +++ b/paddlespeech/t2s/exps/transformer_tts/preprocess.py @@ -186,11 +186,6 @@ def main(): type=str, help="yaml format configuration file.") - parser.add_argument( - "--verbose", - type=int, - default=1, - help="logging level. higher is more logging. (default=1)") parser.add_argument( "--num-cpu", type=int, default=1, help="number of process.") @@ -210,10 +205,6 @@ def main(): _C = Configuration(_C) config = _C.clone() - if args.verbose > 1: - print(vars(args)) - print(config) - phone_id_map_path = dumpdir / "phone_id_map.txt" speaker_id_map_path = dumpdir / "speaker_id_map.txt" diff --git a/paddlespeech/t2s/exps/vits/normalize.py b/paddlespeech/t2s/exps/vits/normalize.py index 6fc8adb06..5881ae95c 100644 --- a/paddlespeech/t2s/exps/vits/normalize.py +++ b/paddlespeech/t2s/exps/vits/normalize.py @@ -16,6 +16,7 @@ import argparse import logging from operator import itemgetter from pathlib import Path +from typing import List import jsonlines import numpy as np @@ -23,6 +24,50 @@ from sklearn.preprocessing import StandardScaler from tqdm import tqdm from paddlespeech.t2s.datasets.data_table import DataTable +from paddlespeech.t2s.utils import str2bool + +INITIALS = [ + 'b', 'p', 'm', 'f', 'd', 't', 'n', 'l', 'g', 'k', 'h', 'zh', 'ch', 'sh', + 'r', 'z', 'c', 's', 'j', 'q', 'x' +] +INITIALS += ['y', 'w', 'sp', 'spl', 'spn', 'sil'] + + +def intersperse(lst, item): + result = [item] * (len(lst) * 2 + 1) + result[1::2] = lst + return result + + +def insert_after_character(lst, item): + result = [item] + for phone in lst: + result.append(phone) + if phone not in INITIALS: + # finals has tones + assert phone[-1] in "12345" + result.append(item) + return result + + +def add_blank(phones: List[str], + filed: str="character", + blank_token: str=""): + if filed == "phone": + """ + add blank after phones + input: ["n", "i3", "h", "ao3", "m", "a5"] + output: ["n", "", "i3", "", "h", "", "ao3", "", "m", "", "a5"] + """ + phones = intersperse(phones, blank_token) + elif filed == "character": + """ + add blank after characters + input: ["n", "i3", "h", "ao3"] + output: ["n", "i3", "", "h", "ao3", "", "m", "a5"] + """ + phones = insert_after_character(phones, blank_token) + return phones def main(): @@ -58,29 +103,12 @@ def main(): parser.add_argument( "--speaker-dict", type=str, default=None, help="speaker id map file.") parser.add_argument( - "--verbose", - type=int, - default=1, - help="logging level. higher is more logging. (default=1)") - args = parser.parse_args() + "--add-blank", + type=str2bool, + default=True, + help="whether to add blank between phones") - # set logger - if args.verbose > 1: - logging.basicConfig( - level=logging.DEBUG, - format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s" - ) - elif args.verbose > 0: - logging.basicConfig( - level=logging.INFO, - format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s" - ) - else: - logging.basicConfig( - level=logging.WARN, - format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s" - ) - logging.warning('Skip DEBUG/INFO messages') + args = parser.parse_args() dumpdir = Path(args.dumpdir).expanduser() # use absolute path @@ -135,13 +163,19 @@ def main(): else: wav_path = wave - phone_ids = [vocab_phones[p] for p in item['phones']] + phones = item['phones'] + text_lengths = item['text_lengths'] + if args.add_blank: + phones = add_blank(phones, filed="character") + text_lengths = len(phones) + + phone_ids = [vocab_phones[p] for p in phones] spk_id = vocab_speaker[item["speaker"]] record = { "utt_id": item['utt_id'], "text": phone_ids, - "text_lengths": item['text_lengths'], + "text_lengths": text_lengths, 'feats': str(feats_path), "feats_lengths": item['feats_lengths'], "wave": str(wav_path), diff --git a/paddlespeech/t2s/exps/vits/preprocess.py b/paddlespeech/t2s/exps/vits/preprocess.py index 6aa139fb5..f89ab356f 100644 --- a/paddlespeech/t2s/exps/vits/preprocess.py +++ b/paddlespeech/t2s/exps/vits/preprocess.py @@ -197,11 +197,6 @@ def main(): parser.add_argument("--config", type=str, help="fastspeech2 config file.") - parser.add_argument( - "--verbose", - type=int, - default=1, - help="logging level. higher is more logging. (default=1)") parser.add_argument( "--num-cpu", type=int, default=1, help="number of process.") @@ -236,10 +231,6 @@ def main(): with open(args.config, 'rt') as f: config = CfgNode(yaml.safe_load(f)) - if args.verbose > 1: - print(vars(args)) - print(config) - sentences, speaker_set = get_phn_dur(dur_file) merge_silence(sentences) diff --git a/paddlespeech/t2s/exps/vits/synthesize_e2e.py b/paddlespeech/t2s/exps/vits/synthesize_e2e.py index c82e5c039..33a413751 100644 --- a/paddlespeech/t2s/exps/vits/synthesize_e2e.py +++ b/paddlespeech/t2s/exps/vits/synthesize_e2e.py @@ -23,6 +23,7 @@ from yacs.config import CfgNode from paddlespeech.t2s.exps.syn_utils import get_frontend from paddlespeech.t2s.exps.syn_utils import get_sentences from paddlespeech.t2s.models.vits import VITS +from paddlespeech.t2s.utils import str2bool def evaluate(args): @@ -55,6 +56,7 @@ def evaluate(args): output_dir = Path(args.output_dir) output_dir.mkdir(parents=True, exist_ok=True) merge_sentences = False + add_blank = args.add_blank N = 0 T = 0 @@ -62,7 +64,9 @@ def evaluate(args): with timer() as t: if args.lang == 'zh': input_ids = frontend.get_input_ids( - sentence, merge_sentences=merge_sentences) + sentence, + merge_sentences=merge_sentences, + add_blank=add_blank) phone_ids = input_ids["phone_ids"] elif args.lang == 'en': input_ids = frontend.get_input_ids( @@ -125,6 +129,12 @@ def parse_args(): help="text to synthesize, a 'utt_id sentence' pair per line.") parser.add_argument("--output_dir", type=str, help="output dir.") + parser.add_argument( + "--add-blank", + type=str2bool, + default=True, + help="whether to add blank between phones") + args = parser.parse_args() return args diff --git a/paddlespeech/t2s/exps/vits/train.py b/paddlespeech/t2s/exps/vits/train.py index dbda8b717..1a68d1326 100644 --- a/paddlespeech/t2s/exps/vits/train.py +++ b/paddlespeech/t2s/exps/vits/train.py @@ -211,13 +211,18 @@ def train_sp(args, config): generator_first=config.generator_first, output_dir=output_dir) - trainer = Trainer(updater, (config.max_epoch, 'epoch'), output_dir) + trainer = Trainer( + updater, + stop_trigger=(config.train_max_steps, "iteration"), + out=output_dir) if dist.get_rank() == 0: - trainer.extend(evaluator, trigger=(1, "epoch")) - trainer.extend(VisualDL(output_dir), trigger=(1, "iteration")) + trainer.extend( + evaluator, trigger=(config.eval_interval_steps, 'iteration')) + trainer.extend(VisualDL(output_dir), trigger=(1, 'iteration')) trainer.extend( - Snapshot(max_size=config.num_snapshots), trigger=(1, 'epoch')) + Snapshot(max_size=config.num_snapshots), + trigger=(config.save_interval_steps, 'iteration')) print("Trainer Done!") trainer.run() diff --git a/paddlespeech/t2s/exps/waveflow/preprocess.py b/paddlespeech/t2s/exps/waveflow/preprocess.py index ef3a29175..c7034aeab 100644 --- a/paddlespeech/t2s/exps/waveflow/preprocess.py +++ b/paddlespeech/t2s/exps/waveflow/preprocess.py @@ -143,8 +143,6 @@ if __name__ == "__main__": nargs=argparse.REMAINDER, help="options to overwrite --config file and the default config, passing in KEY VALUE pairs" ) - parser.add_argument( - "-v", "--verbose", action="store_true", help="print msg") config = get_cfg_defaults() args = parser.parse_args() @@ -153,8 +151,5 @@ if __name__ == "__main__": if args.opts: config.merge_from_list(args.opts) config.freeze() - if args.verbose: - print(config.data) - print(args) create_dataset(config.data, args.input, args.output) diff --git a/paddlespeech/t2s/exps/waveflow/synthesize.py b/paddlespeech/t2s/exps/waveflow/synthesize.py index 53715b01e..a3190c6e5 100644 --- a/paddlespeech/t2s/exps/waveflow/synthesize.py +++ b/paddlespeech/t2s/exps/waveflow/synthesize.py @@ -72,8 +72,6 @@ if __name__ == "__main__": nargs=argparse.REMAINDER, help="options to overwrite --config file and the default config, passing in KEY VALUE pairs" ) - parser.add_argument( - "-v", "--verbose", action="store_true", help="print msg") args = parser.parse_args() if args.config: diff --git a/paddlespeech/t2s/frontend/g2pw/__init__.py b/paddlespeech/t2s/frontend/g2pw/__init__.py new file mode 100644 index 000000000..6e1ee0db8 --- /dev/null +++ b/paddlespeech/t2s/frontend/g2pw/__init__.py @@ -0,0 +1,2 @@ +from paddlespeech.t2s.frontend.g2pw.onnx_api import G2PWOnnxConverter + diff --git a/paddlespeech/t2s/frontend/g2pw/dataset.py b/paddlespeech/t2s/frontend/g2pw/dataset.py new file mode 100644 index 000000000..ab715dc36 --- /dev/null +++ b/paddlespeech/t2s/frontend/g2pw/dataset.py @@ -0,0 +1,164 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +Credits + This code is modified from https://github.com/GitYCC/g2pW +""" +import numpy as np + +from paddlespeech.t2s.frontend.g2pw.utils import tokenize_and_map + +ANCHOR_CHAR = '▁' + + +def prepare_onnx_input(tokenizer, + labels, + char2phonemes, + chars, + texts, + query_ids, + phonemes=None, + pos_tags=None, + use_mask=False, + use_char_phoneme=False, + use_pos=False, + window_size=None, + max_len=512): + if window_size is not None: + truncated_texts, truncated_query_ids = _truncate_texts(window_size, + texts, query_ids) + + input_ids = [] + token_type_ids = [] + attention_masks = [] + phoneme_masks = [] + char_ids = [] + position_ids = [] + + for idx in range(len(texts)): + text = (truncated_texts if window_size else texts)[idx].lower() + query_id = (truncated_query_ids if window_size else query_ids)[idx] + + try: + tokens, text2token, token2text = tokenize_and_map(tokenizer, text) + except Exception: + print(f'warning: text "{text}" is invalid') + return {} + + text, query_id, tokens, text2token, token2text = _truncate( + max_len, text, query_id, tokens, text2token, token2text) + + processed_tokens = ['[CLS]'] + tokens + ['[SEP]'] + + input_id = list( + np.array(tokenizer.convert_tokens_to_ids(processed_tokens))) + token_type_id = list(np.zeros((len(processed_tokens), ), dtype=int)) + attention_mask = list(np.ones((len(processed_tokens), ), dtype=int)) + + query_char = text[query_id] + phoneme_mask = [1 if i in char2phonemes[query_char] else 0 for i in range(len(labels))] \ + if use_mask else [1] * len(labels) + char_id = chars.index(query_char) + position_id = text2token[ + query_id] + 1 # [CLS] token locate at first place + + input_ids.append(input_id) + token_type_ids.append(token_type_id) + attention_masks.append(attention_mask) + phoneme_masks.append(phoneme_mask) + char_ids.append(char_id) + position_ids.append(position_id) + + outputs = { + 'input_ids': np.array(input_ids), + 'token_type_ids': np.array(token_type_ids), + 'attention_masks': np.array(attention_masks), + 'phoneme_masks': np.array(phoneme_masks).astype(np.float32), + 'char_ids': np.array(char_ids), + 'position_ids': np.array(position_ids), + } + return outputs + + +def _truncate_texts(window_size, texts, query_ids): + truncated_texts = [] + truncated_query_ids = [] + for text, query_id in zip(texts, query_ids): + start = max(0, query_id - window_size // 2) + end = min(len(text), query_id + window_size // 2) + truncated_text = text[start:end] + truncated_texts.append(truncated_text) + + truncated_query_id = query_id - start + truncated_query_ids.append(truncated_query_id) + return truncated_texts, truncated_query_ids + + +def _truncate(max_len, text, query_id, tokens, text2token, token2text): + truncate_len = max_len - 2 + if len(tokens) <= truncate_len: + return (text, query_id, tokens, text2token, token2text) + + token_position = text2token[query_id] + + token_start = token_position - truncate_len // 2 + token_end = token_start + truncate_len + font_exceed_dist = -token_start + back_exceed_dist = token_end - len(tokens) + if font_exceed_dist > 0: + token_start += font_exceed_dist + token_end += font_exceed_dist + elif back_exceed_dist > 0: + token_start -= back_exceed_dist + token_end -= back_exceed_dist + + start = token2text[token_start][0] + end = token2text[token_end - 1][1] + + return (text[start:end], query_id - start, tokens[token_start:token_end], [ + i - token_start if i is not None else None + for i in text2token[start:end] + ], [(s - start, e - start) for s, e in token2text[token_start:token_end]]) + + +def prepare_data(sent_path, lb_path=None): + raw_texts = open(sent_path).read().rstrip().split('\n') + query_ids = [raw.index(ANCHOR_CHAR) for raw in raw_texts] + texts = [raw.replace(ANCHOR_CHAR, '') for raw in raw_texts] + if lb_path is None: + return texts, query_ids + else: + phonemes = open(lb_path).read().rstrip().split('\n') + return texts, query_ids, phonemes + + +def get_phoneme_labels(polyphonic_chars): + labels = sorted(list(set([phoneme for char, phoneme in polyphonic_chars]))) + char2phonemes = {} + for char, phoneme in polyphonic_chars: + if char not in char2phonemes: + char2phonemes[char] = [] + char2phonemes[char].append(labels.index(phoneme)) + return labels, char2phonemes + + +def get_char_phoneme_labels(polyphonic_chars): + labels = sorted( + list(set([f'{char} {phoneme}' for char, phoneme in polyphonic_chars]))) + char2phonemes = {} + for char, phoneme in polyphonic_chars: + if char not in char2phonemes: + char2phonemes[char] = [] + char2phonemes[char].append(labels.index(f'{char} {phoneme}')) + return labels, char2phonemes diff --git a/paddlespeech/t2s/frontend/g2pw/onnx_api.py b/paddlespeech/t2s/frontend/g2pw/onnx_api.py new file mode 100644 index 000000000..3a406ad20 --- /dev/null +++ b/paddlespeech/t2s/frontend/g2pw/onnx_api.py @@ -0,0 +1,199 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +Credits + This code is modified from https://github.com/GitYCC/g2pW +""" +import json +import os + +import numpy as np +import onnxruntime +from opencc import OpenCC +from paddlenlp.transformers import BertTokenizer +from pypinyin import pinyin +from pypinyin import Style + +from paddlespeech.cli.utils import download_and_decompress +from paddlespeech.resource.pretrained_models import g2pw_onnx_models +from paddlespeech.t2s.frontend.g2pw.dataset import get_char_phoneme_labels +from paddlespeech.t2s.frontend.g2pw.dataset import get_phoneme_labels +from paddlespeech.t2s.frontend.g2pw.dataset import prepare_onnx_input +from paddlespeech.t2s.frontend.g2pw.utils import load_config +from paddlespeech.utils.env import MODEL_HOME + + +def predict(session, onnx_input, labels): + all_preds = [] + all_confidences = [] + probs = session.run([], { + "input_ids": onnx_input['input_ids'], + "token_type_ids": onnx_input['token_type_ids'], + "attention_mask": onnx_input['attention_masks'], + "phoneme_mask": onnx_input['phoneme_masks'], + "char_ids": onnx_input['char_ids'], + "position_ids": onnx_input['position_ids'] + })[0] + + preds = np.argmax(probs, axis=1).tolist() + max_probs = [] + for index, arr in zip(preds, probs.tolist()): + max_probs.append(arr[index]) + all_preds += [labels[pred] for pred in preds] + all_confidences += max_probs + + return all_preds, all_confidences + + +class G2PWOnnxConverter: + def __init__(self, + model_dir=MODEL_HOME, + style='bopomofo', + model_source=None, + enable_non_tradional_chinese=False): + if not os.path.exists(os.path.join(model_dir, 'G2PWModel/g2pW.onnx')): + uncompress_path = download_and_decompress( + g2pw_onnx_models['G2PWModel']['1.0'], model_dir) + + sess_options = onnxruntime.SessionOptions() + sess_options.graph_optimization_level = onnxruntime.GraphOptimizationLevel.ORT_ENABLE_ALL + sess_options.execution_mode = onnxruntime.ExecutionMode.ORT_SEQUENTIAL + sess_options.intra_op_num_threads = 2 + self.session_g2pW = onnxruntime.InferenceSession( + os.path.join(model_dir, 'G2PWModel/g2pW.onnx'), + sess_options=sess_options) + self.config = load_config( + os.path.join(model_dir, 'G2PWModel/config.py'), use_default=True) + + self.model_source = model_source if model_source else self.config.model_source + self.enable_opencc = enable_non_tradional_chinese + + self.tokenizer = BertTokenizer.from_pretrained(self.config.model_source) + + polyphonic_chars_path = os.path.join(model_dir, + 'G2PWModel/POLYPHONIC_CHARS.txt') + monophonic_chars_path = os.path.join(model_dir, + 'G2PWModel/MONOPHONIC_CHARS.txt') + self.polyphonic_chars = [ + line.split('\t') + for line in open(polyphonic_chars_path, encoding='utf-8').read() + .strip().split('\n') + ] + self.monophonic_chars = [ + line.split('\t') + for line in open(monophonic_chars_path, encoding='utf-8').read() + .strip().split('\n') + ] + self.labels, self.char2phonemes = get_char_phoneme_labels( + self.polyphonic_chars + ) if self.config.use_char_phoneme else get_phoneme_labels( + self.polyphonic_chars) + + self.chars = sorted(list(self.char2phonemes.keys())) + self.pos_tags = [ + 'UNK', 'A', 'C', 'D', 'I', 'N', 'P', 'T', 'V', 'DE', 'SHI' + ] + + with open( + os.path.join(model_dir, + 'G2PWModel/bopomofo_to_pinyin_wo_tune_dict.json'), + 'r', + encoding='utf-8') as fr: + self.bopomofo_convert_dict = json.load(fr) + self.style_convert_func = { + 'bopomofo': lambda x: x, + 'pinyin': self._convert_bopomofo_to_pinyin, + }[style] + + with open( + os.path.join(model_dir, 'G2PWModel/char_bopomofo_dict.json'), + 'r', + encoding='utf-8') as fr: + self.char_bopomofo_dict = json.load(fr) + + if self.enable_opencc: + self.cc = OpenCC('s2tw') + + def _convert_bopomofo_to_pinyin(self, bopomofo): + tone = bopomofo[-1] + assert tone in '12345' + component = self.bopomofo_convert_dict.get(bopomofo[:-1]) + if component: + return component + tone + else: + print(f'Warning: "{bopomofo}" cannot convert to pinyin') + return None + + def __call__(self, sentences): + if isinstance(sentences, str): + sentences = [sentences] + + if self.enable_opencc: + translated_sentences = [] + for sent in sentences: + translated_sent = self.cc.convert(sent) + assert len(translated_sent) == len(sent) + translated_sentences.append(translated_sent) + sentences = translated_sentences + + texts, query_ids, sent_ids, partial_results = self._prepare_data( + sentences) + if len(texts) == 0: + # sentences no polyphonic words + return partial_results + + onnx_input = prepare_onnx_input( + self.tokenizer, + self.labels, + self.char2phonemes, + self.chars, + texts, + query_ids, + use_mask=self.config.use_mask, + use_char_phoneme=self.config.use_char_phoneme, + window_size=None) + + preds, confidences = predict(self.session_g2pW, onnx_input, self.labels) + if self.config.use_char_phoneme: + preds = [pred.split(' ')[1] for pred in preds] + + results = partial_results + for sent_id, query_id, pred in zip(sent_ids, query_ids, preds): + results[sent_id][query_id] = self.style_convert_func(pred) + + return results + + def _prepare_data(self, sentences): + polyphonic_chars = set(self.chars) + monophonic_chars_dict = { + char: phoneme + for char, phoneme in self.monophonic_chars + } + texts, query_ids, sent_ids, partial_results = [], [], [], [] + for sent_id, sent in enumerate(sentences): + pypinyin_result = pinyin(sent, style=Style.TONE3) + partial_result = [None] * len(sent) + for i, char in enumerate(sent): + if char in polyphonic_chars: + texts.append(sent) + query_ids.append(i) + sent_ids.append(sent_id) + elif char in monophonic_chars_dict: + partial_result[i] = self.style_convert_func( + monophonic_chars_dict[char]) + elif char in self.char_bopomofo_dict: + partial_result[i] = pypinyin_result[i][0] + # partial_result[i] = self.style_convert_func(self.char_bopomofo_dict[char][0]) + partial_results.append(partial_result) + return texts, query_ids, sent_ids, partial_results diff --git a/paddlespeech/t2s/frontend/g2pw/utils.py b/paddlespeech/t2s/frontend/g2pw/utils.py new file mode 100644 index 000000000..ad02c4c1d --- /dev/null +++ b/paddlespeech/t2s/frontend/g2pw/utils.py @@ -0,0 +1,144 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" +Credits + This code is modified from https://github.com/GitYCC/g2pW +""" +import re + + +def wordize_and_map(text): + words = [] + index_map_from_text_to_word = [] + index_map_from_word_to_text = [] + while len(text) > 0: + match_space = re.match(r'^ +', text) + if match_space: + space_str = match_space.group(0) + index_map_from_text_to_word += [None] * len(space_str) + text = text[len(space_str):] + continue + + match_en = re.match(r'^[a-zA-Z0-9]+', text) + if match_en: + en_word = match_en.group(0) + + word_start_pos = len(index_map_from_text_to_word) + word_end_pos = word_start_pos + len(en_word) + index_map_from_word_to_text.append((word_start_pos, word_end_pos)) + + index_map_from_text_to_word += [len(words)] * len(en_word) + + words.append(en_word) + text = text[len(en_word):] + else: + word_start_pos = len(index_map_from_text_to_word) + word_end_pos = word_start_pos + 1 + index_map_from_word_to_text.append((word_start_pos, word_end_pos)) + + index_map_from_text_to_word += [len(words)] + + words.append(text[0]) + text = text[1:] + return words, index_map_from_text_to_word, index_map_from_word_to_text + + +def tokenize_and_map(tokenizer, text): + words, text2word, word2text = wordize_and_map(text) + + tokens = [] + index_map_from_token_to_text = [] + for word, (word_start, word_end) in zip(words, word2text): + word_tokens = tokenizer.tokenize(word) + + if len(word_tokens) == 0 or word_tokens == ['[UNK]']: + index_map_from_token_to_text.append((word_start, word_end)) + tokens.append('[UNK]') + else: + current_word_start = word_start + for word_token in word_tokens: + word_token_len = len(re.sub(r'^##', '', word_token)) + index_map_from_token_to_text.append( + (current_word_start, current_word_start + word_token_len)) + current_word_start = current_word_start + word_token_len + tokens.append(word_token) + + index_map_from_text_to_token = text2word + for i, (token_start, token_end) in enumerate(index_map_from_token_to_text): + for token_pos in range(token_start, token_end): + index_map_from_text_to_token[token_pos] = i + + return tokens, index_map_from_text_to_token, index_map_from_token_to_text + + +def _load_config(config_path): + import importlib.util + spec = importlib.util.spec_from_file_location('__init__', config_path) + config = importlib.util.module_from_spec(spec) + spec.loader.exec_module(config) + return config + + +default_config_dict = { + 'manual_seed': 1313, + 'model_source': 'bert-base-chinese', + 'window_size': 32, + 'num_workers': 2, + 'use_mask': True, + 'use_char_phoneme': False, + 'use_conditional': True, + 'param_conditional': { + 'affect_location': 'softmax', + 'bias': True, + 'char-linear': True, + 'pos-linear': False, + 'char+pos-second': True, + 'char+pos-second_lowrank': False, + 'lowrank_size': 0, + 'char+pos-second_fm': False, + 'fm_size': 0, + 'fix_mode': None, + 'count_json': 'train.count.json' + }, + 'lr': 5e-5, + 'val_interval': 200, + 'num_iter': 10000, + 'use_focal': False, + 'param_focal': { + 'alpha': 0.0, + 'gamma': 0.7 + }, + 'use_pos': True, + 'param_pos ': { + 'weight': 0.1, + 'pos_joint_training': True, + 'train_pos_path': 'train.pos', + 'valid_pos_path': 'dev.pos', + 'test_pos_path': 'test.pos' + } +} + + +def load_config(config_path, use_default=False): + config = _load_config(config_path) + if use_default: + for attr, val in default_config_dict.items(): + if not hasattr(config, attr): + setattr(config, attr, val) + elif isinstance(val, dict): + d = getattr(config, attr) + for dict_k, dict_v in val.items(): + if dict_k not in d: + d[dict_k] = dict_v + return config diff --git a/paddlespeech/t2s/frontend/mix_frontend.py b/paddlespeech/t2s/frontend/mix_frontend.py new file mode 100644 index 000000000..5f145098e --- /dev/null +++ b/paddlespeech/t2s/frontend/mix_frontend.py @@ -0,0 +1,181 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import re +from typing import Dict +from typing import List + +import paddle + +from paddlespeech.t2s.frontend import English +from paddlespeech.t2s.frontend.zh_frontend import Frontend + + +class MixFrontend(): + def __init__(self, + g2p_model="pypinyin", + phone_vocab_path=None, + tone_vocab_path=None): + + self.zh_frontend = Frontend( + phone_vocab_path=phone_vocab_path, tone_vocab_path=tone_vocab_path) + self.en_frontend = English(phone_vocab_path=phone_vocab_path) + self.SENTENCE_SPLITOR = re.compile(r'([:、,;。?!,;?!][”’]?)') + self.sp_id = self.zh_frontend.vocab_phones["sp"] + self.sp_id_tensor = paddle.to_tensor([self.sp_id]) + + def is_chinese(self, char): + if char >= '\u4e00' and char <= '\u9fa5': + return True + else: + return False + + def is_alphabet(self, char): + if (char >= '\u0041' and char <= '\u005a') or (char >= '\u0061' and + char <= '\u007a'): + return True + else: + return False + + def is_number(self, char): + if char >= '\u0030' and char <= '\u0039': + return True + else: + return False + + def is_other(self, char): + if not (self.is_chinese(char) or self.is_number(char) or + self.is_alphabet(char)): + return True + else: + return False + + def _split(self, text: str) -> List[str]: + text = re.sub(r'[《》【】<=>{}()()#&@“”^_|…\\]', '', text) + text = self.SENTENCE_SPLITOR.sub(r'\1\n', text) + text = text.strip() + sentences = [sentence.strip() for sentence in re.split(r'\n+', text)] + return sentences + + def _distinguish(self, text: str) -> List[str]: + # sentence --> [ch_part, en_part, ch_part, ...] + + segments = [] + types = [] + + flag = 0 + temp_seg = "" + temp_lang = "" + + # Determine the type of each character. type: blank, chinese, alphabet, number, unk. + for ch in text: + if self.is_chinese(ch): + types.append("zh") + elif self.is_alphabet(ch): + types.append("en") + elif ch == " ": + types.append("blank") + elif self.is_number(ch): + types.append("num") + else: + types.append("unk") + + assert len(types) == len(text) + + for i in range(len(types)): + + # find the first char of the seg + if flag == 0: + if types[i] != "unk" and types[i] != "blank": + temp_seg += text[i] + temp_lang = types[i] + flag = 1 + + else: + if types[i] == temp_lang or types[i] == "num": + temp_seg += text[i] + + elif temp_lang == "num" and types[i] != "unk": + temp_seg += text[i] + if types[i] == "zh" or types[i] == "en": + temp_lang = types[i] + + elif temp_lang == "en" and types[i] == "blank": + temp_seg += text[i] + + elif types[i] == "unk": + pass + + else: + segments.append((temp_seg, temp_lang)) + + if types[i] != "unk" and types[i] != "blank": + temp_seg = text[i] + temp_lang = types[i] + flag = 1 + else: + flag = 0 + temp_seg = "" + temp_lang = "" + + segments.append((temp_seg, temp_lang)) + + return segments + + def get_input_ids(self, + sentence: str, + merge_sentences: bool=True, + get_tone_ids: bool=False, + add_sp: bool=True, + to_tensor: bool=True) -> Dict[str, List[paddle.Tensor]]: + + sentences = self._split(sentence) + phones_list = [] + result = {} + + for text in sentences: + phones_seg = [] + segments = self._distinguish(text) + for seg in segments: + content = seg[0] + lang = seg[1] + if lang == "zh": + input_ids = self.zh_frontend.get_input_ids( + content, + merge_sentences=True, + get_tone_ids=get_tone_ids, + to_tensor=to_tensor) + + elif lang == "en": + input_ids = self.en_frontend.get_input_ids( + content, merge_sentences=True, to_tensor=to_tensor) + + phones_seg.append(input_ids["phone_ids"][0]) + if add_sp: + phones_seg.append(self.sp_id_tensor) + + phones = paddle.concat(phones_seg) + phones_list.append(phones) + + if merge_sentences: + merge_list = paddle.concat(phones_list) + # rm the last 'sp' to avoid the noise at the end + # cause in the training data, no 'sp' in the end + if merge_list[-1] == self.sp_id_tensor: + merge_list = merge_list[:-1] + phones_list = [] + phones_list.append(merge_list) + + result["phone_ids"] = phones_list + + return result diff --git a/paddlespeech/t2s/frontend/phonectic.py b/paddlespeech/t2s/frontend/phonectic.py index 8e9f11737..261db80a8 100644 --- a/paddlespeech/t2s/frontend/phonectic.py +++ b/paddlespeech/t2s/frontend/phonectic.py @@ -82,8 +82,10 @@ class English(Phonetics): phone_ids = [self.vocab_phones[item] for item in phonemes] return np.array(phone_ids, np.int64) - def get_input_ids(self, sentence: str, - merge_sentences: bool=False) -> paddle.Tensor: + def get_input_ids(self, + sentence: str, + merge_sentences: bool=False, + to_tensor: bool=True) -> paddle.Tensor: result = {} sentences = self.text_normalizer._split(sentence, lang="en") phones_list = [] @@ -99,7 +101,8 @@ class English(Phonetics): if (phn in self.vocab_phones and phn not in self.punc) else "sp" for phn in phones ] - phones_list.append(phones) + if len(phones) != 0: + phones_list.append(phones) if merge_sentences: merge_list = sum(phones_list, []) @@ -112,7 +115,8 @@ class English(Phonetics): for part_phones_list in phones_list: phone_ids = self._p2id(part_phones_list) - phone_ids = paddle.to_tensor(phone_ids) + if to_tensor: + phone_ids = paddle.to_tensor(phone_ids) temp_phone_ids.append(phone_ids) result["phone_ids"] = temp_phone_ids return result diff --git a/paddlespeech/t2s/frontend/polyphonic.yaml b/paddlespeech/t2s/frontend/polyphonic.yaml new file mode 100644 index 000000000..629bcd262 --- /dev/null +++ b/paddlespeech/t2s/frontend/polyphonic.yaml @@ -0,0 +1,26 @@ +polyphonic: + 湖泊: ['hu2','po1'] + 地壳: ['di4','qiao4'] + 柏树: ['bai3','shu4'] + 曝光: ['bao4','guang1'] + 弹力: ['tan2','li4'] + 字帖: ['zi4','tie4'] + 口吃: ['kou3','chi1'] + 包扎: ['bao1','za1'] + 哪吒: ['ne2','zha1'] + 说服: ['shuo1','fu2'] + 识字: ['shi2','zi4'] + 骨头: ['gu3','tou5'] + 对称: ['dui4','chen4'] + 口供: ['kou3','gong4'] + 抹布: ['ma1','bu4'] + 露背: ['lu4','bei4'] + 圈养: ['juan4', 'yang3'] + 眼眶: ['yan3', 'kuang4'] + 品行: ['pin3','xing2'] + 颤抖: ['chan4','dou3'] + 差不多: ['cha4','bu5','duo1'] + 鸭绿江: ['ya1','lu4','jiang1'] + 撒切尔: ['sa4','qie4','er3'] + 比比皆是: ['bi3','bi3','jie1','shi4'] + 身无长物: ['shen1','wu2','chang2','wu4'] \ No newline at end of file diff --git a/paddlespeech/t2s/frontend/zh_frontend.py b/paddlespeech/t2s/frontend/zh_frontend.py index 129aa944e..9513a459c 100644 --- a/paddlespeech/t2s/frontend/zh_frontend.py +++ b/paddlespeech/t2s/frontend/zh_frontend.py @@ -11,6 +11,7 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. +import os import re from typing import Dict from typing import List @@ -18,6 +19,7 @@ from typing import List import jieba.posseg as psg import numpy as np import paddle +import yaml from g2pM import G2pM from pypinyin import lazy_pinyin from pypinyin import load_phrases_dict @@ -25,25 +27,77 @@ from pypinyin import load_single_dict from pypinyin import Style from pypinyin_dict.phrase_pinyin_data import large_pinyin +from paddlespeech.t2s.frontend.g2pw import G2PWOnnxConverter from paddlespeech.t2s.frontend.generate_lexicon import generate_lexicon from paddlespeech.t2s.frontend.tone_sandhi import ToneSandhi from paddlespeech.t2s.frontend.zh_normalization.text_normlization import TextNormalizer +INITIALS = [ + 'b', 'p', 'm', 'f', 'd', 't', 'n', 'l', 'g', 'k', 'h', 'zh', 'ch', 'sh', + 'r', 'z', 'c', 's', 'j', 'q', 'x' +] +INITIALS += ['y', 'w', 'sp', 'spl', 'spn', 'sil'] + + +def intersperse(lst, item): + result = [item] * (len(lst) * 2 + 1) + result[1::2] = lst + return result + + +def insert_after_character(lst, item): + result = [item] + for phone in lst: + result.append(phone) + if phone not in INITIALS: + # finals has tones + # assert phone[-1] in "12345" + result.append(item) + return result + + +class Polyphonic(): + def __init__(self): + with open( + os.path.join( + os.path.dirname(os.path.abspath(__file__)), + 'polyphonic.yaml'), + 'r', + encoding='utf-8') as polyphonic_file: + # 解析yaml + polyphonic_dict = yaml.load(polyphonic_file, Loader=yaml.FullLoader) + self.polyphonic_words = polyphonic_dict["polyphonic"] + + def correct_pronunciation(self, word, pinyin): + # 词汇被词典收录则返回纠正后的读音 + if word in self.polyphonic_words.keys(): + pinyin = self.polyphonic_words[word] + # 否则返回原读音 + return pinyin + class Frontend(): def __init__(self, - g2p_model="pypinyin", + g2p_model="g2pW", phone_vocab_path=None, tone_vocab_path=None): self.tone_modifier = ToneSandhi() self.text_normalizer = TextNormalizer() self.punc = ":,;。?!“”‘’':,;.?!" - # g2p_model can be pypinyin and g2pM + # g2p_model can be pypinyin and g2pM and g2pW self.g2p_model = g2p_model if self.g2p_model == "g2pM": self.g2pM_model = G2pM() self.pinyin2phone = generate_lexicon( with_tone=True, with_erhua=False) + elif self.g2p_model == "g2pW": + self.corrector = Polyphonic() + self.g2pM_model = G2pM() + self.g2pW_model = G2PWOnnxConverter( + style='pinyin', enable_non_tradional_chinese=True) + self.pinyin2phone = generate_lexicon( + with_tone=True, with_erhua=False) + else: self.__init__pypinyin() self.must_erhua = {"小院儿", "胡同儿", "范儿", "老汉儿", "撒欢儿", "寻老礼儿", "妥妥儿"} @@ -133,18 +187,65 @@ class Frontend(): initials = [] finals = [] seg_cut = self.tone_modifier.pre_merge_for_modify(seg_cut) - for word, pos in seg_cut: - if pos == 'eng': - continue - sub_initials, sub_finals = self._get_initials_finals(word) - sub_finals = self.tone_modifier.modified_tone(word, pos, - sub_finals) - if with_erhua: - sub_initials, sub_finals = self._merge_erhua( - sub_initials, sub_finals, word, pos) - initials.append(sub_initials) - finals.append(sub_finals) - # assert len(sub_initials) == len(sub_finals) == len(word) + # 为了多音词获得更好的效果,这里采用整句预测 + if self.g2p_model == "g2pW": + try: + pinyins = self.g2pW_model(seg)[0] + except Exception: + # g2pW采用模型采用繁体输入,如果有cover不了的简体词,采用g2pM预测 + print("[%s] not in g2pW dict,use g2pM" % seg) + pinyins = self.g2pM_model(seg, tone=True, char_split=False) + pre_word_length = 0 + for word, pos in seg_cut: + sub_initials = [] + sub_finals = [] + now_word_length = pre_word_length + len(word) + if pos == 'eng': + pre_word_length = now_word_length + continue + word_pinyins = pinyins[pre_word_length:now_word_length] + # 矫正发音 + word_pinyins = self.corrector.correct_pronunciation( + word, word_pinyins) + for pinyin, char in zip(word_pinyins, word): + if pinyin is None: + pinyin = char + pinyin = pinyin.replace("u:", "v") + if pinyin in self.pinyin2phone: + initial_final_list = self.pinyin2phone[ + pinyin].split(" ") + if len(initial_final_list) == 2: + sub_initials.append(initial_final_list[0]) + sub_finals.append(initial_final_list[1]) + elif len(initial_final_list) == 1: + sub_initials.append('') + sub_finals.append(initial_final_list[1]) + else: + # If it's not pinyin (possibly punctuation) or no conversion is required + sub_initials.append(pinyin) + sub_finals.append(pinyin) + pre_word_length = now_word_length + sub_finals = self.tone_modifier.modified_tone(word, pos, + sub_finals) + if with_erhua: + sub_initials, sub_finals = self._merge_erhua( + sub_initials, sub_finals, word, pos) + initials.append(sub_initials) + finals.append(sub_finals) + # assert len(sub_initials) == len(sub_finals) == len(word) + else: + for word, pos in seg_cut: + if pos == 'eng': + continue + sub_initials, sub_finals = self._get_initials_finals(word) + sub_finals = self.tone_modifier.modified_tone(word, pos, + sub_finals) + if with_erhua: + sub_initials, sub_finals = self._merge_erhua( + sub_initials, sub_finals, word, pos) + initials.append(sub_initials) + finals.append(sub_finals) + # assert len(sub_initials) == len(sub_finals) == len(word) initials = sum(initials, []) finals = sum(finals, []) @@ -285,7 +386,10 @@ class Frontend(): merge_sentences: bool=True, get_tone_ids: bool=False, robot: bool=False, - print_info: bool=False) -> Dict[str, List[paddle.Tensor]]: + print_info: bool=False, + add_blank: bool=False, + blank_token: str="", + to_tensor: bool=True) -> Dict[str, List[paddle.Tensor]]: phonemes = self.get_phonemes( sentence, merge_sentences=merge_sentences, @@ -296,16 +400,22 @@ class Frontend(): tones = [] temp_phone_ids = [] temp_tone_ids = [] + for part_phonemes in phonemes: phones, tones = self._get_phone_tone( part_phonemes, get_tone_ids=get_tone_ids) + if add_blank: + phones = insert_after_character(phones, blank_token) if tones: tone_ids = self._t2id(tones) - tone_ids = paddle.to_tensor(tone_ids) + if to_tensor: + tone_ids = paddle.to_tensor(tone_ids) temp_tone_ids.append(tone_ids) if phones: phone_ids = self._p2id(phones) - phone_ids = paddle.to_tensor(phone_ids) + # if use paddle.to_tensor() in onnxruntime, the first time will be too low + if to_tensor: + phone_ids = paddle.to_tensor(phone_ids) temp_phone_ids.append(phone_ids) if temp_tone_ids: result["tone_ids"] = temp_tone_ids diff --git a/paddlespeech/t2s/models/__init__.py b/paddlespeech/t2s/models/__init__.py index 0b6f29119..d8df4368a 100644 --- a/paddlespeech/t2s/models/__init__.py +++ b/paddlespeech/t2s/models/__init__.py @@ -11,6 +11,7 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. +from .ernie_sat import * from .fastspeech2 import * from .hifigan import * from .melgan import * diff --git a/paddlespeech/t2s/models/ernie_sat/__init__.py b/paddlespeech/t2s/models/ernie_sat/__init__.py new file mode 100644 index 000000000..7e795370e --- /dev/null +++ b/paddlespeech/t2s/models/ernie_sat/__init__.py @@ -0,0 +1,16 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from .ernie_sat import * +from .ernie_sat_updater import * +from .mlm import * diff --git a/paddlespeech/t2s/models/ernie_sat/ernie_sat.py b/paddlespeech/t2s/models/ernie_sat/ernie_sat.py new file mode 100644 index 000000000..54f5d542d --- /dev/null +++ b/paddlespeech/t2s/models/ernie_sat/ernie_sat.py @@ -0,0 +1,705 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +from typing import Dict +from typing import List +from typing import Optional + +import paddle +from paddle import nn + +from paddlespeech.t2s.modules.activation import get_activation +from paddlespeech.t2s.modules.conformer.convolution import ConvolutionModule +from paddlespeech.t2s.modules.conformer.encoder_layer import EncoderLayer +from paddlespeech.t2s.modules.layer_norm import LayerNorm +from paddlespeech.t2s.modules.masked_fill import masked_fill +from paddlespeech.t2s.modules.nets_utils import initialize +from paddlespeech.t2s.modules.tacotron2.decoder import Postnet +from paddlespeech.t2s.modules.transformer.attention import LegacyRelPositionMultiHeadedAttention +from paddlespeech.t2s.modules.transformer.attention import MultiHeadedAttention +from paddlespeech.t2s.modules.transformer.attention import RelPositionMultiHeadedAttention +from paddlespeech.t2s.modules.transformer.embedding import LegacyRelPositionalEncoding +from paddlespeech.t2s.modules.transformer.embedding import PositionalEncoding +from paddlespeech.t2s.modules.transformer.embedding import RelPositionalEncoding +from paddlespeech.t2s.modules.transformer.embedding import ScaledPositionalEncoding +from paddlespeech.t2s.modules.transformer.multi_layer_conv import Conv1dLinear +from paddlespeech.t2s.modules.transformer.multi_layer_conv import MultiLayeredConv1d +from paddlespeech.t2s.modules.transformer.positionwise_feed_forward import PositionwiseFeedForward +from paddlespeech.t2s.modules.transformer.repeat import repeat +from paddlespeech.t2s.modules.transformer.subsampling import Conv2dSubsampling + + +# MLM -> Mask Language Model +class mySequential(nn.Sequential): + def forward(self, *inputs): + for module in self._sub_layers.values(): + if type(inputs) == tuple: + inputs = module(*inputs) + else: + inputs = module(inputs) + return inputs + + +class MaskInputLayer(nn.Layer): + def __init__(self, out_features: int) -> None: + super().__init__() + self.mask_feature = paddle.create_parameter( + shape=(1, 1, out_features), + dtype=paddle.float32, + default_initializer=paddle.nn.initializer.Assign( + paddle.normal(shape=(1, 1, out_features)))) + + def forward(self, input: paddle.Tensor, + masked_pos: paddle.Tensor=None) -> paddle.Tensor: + masked_pos = paddle.expand_as(paddle.unsqueeze(masked_pos, -1), input) + masked_input = masked_fill(input, masked_pos, 0) + masked_fill( + paddle.expand_as(self.mask_feature, input), ~masked_pos, 0) + return masked_input + + +class MLMEncoder(nn.Layer): + """Conformer encoder module. + + Args: + idim (int): Input dimension. + attention_dim (int): Dimension of attention. + attention_heads (int): The number of heads of multi head attention. + linear_units (int): The number of units of position-wise feed forward. + num_blocks (int): The number of decoder blocks. + dropout_rate (float): Dropout rate. + positional_dropout_rate (float): Dropout rate after adding positional encoding. + attention_dropout_rate (float): Dropout rate in attention. + input_layer (Union[str, paddle.nn.Layer]): Input layer type. + normalize_before (bool): Whether to use layer_norm before the first block. + concat_after (bool): Whether to concat attention layer's input and output. + if True, additional linear will be applied. + i.e. x -> x + linear(concat(x, att(x))) + if False, no additional linear will be applied. i.e. x -> x + att(x) + positionwise_layer_type (str): "linear", "conv1d", or "conv1d-linear". + positionwise_conv_kernel_size (int): Kernel size of positionwise conv1d layer. + macaron_style (bool): Whether to use macaron style for positionwise layer. + pos_enc_layer_type (str): Encoder positional encoding layer type. + selfattention_layer_type (str): Encoder attention layer type. + activation_type (str): Encoder activation function type. + use_cnn_module (bool): Whether to use convolution module. + zero_triu (bool): Whether to zero the upper triangular part of attention matrix. + cnn_module_kernel (int): Kernerl size of convolution module. + padding_idx (int): Padding idx for input_layer=embed. + stochastic_depth_rate (float): Maximum probability to skip the encoder layer. + + """ + + def __init__(self, + idim: int, + vocab_size: int=0, + pre_speech_layer: int=0, + attention_dim: int=256, + attention_heads: int=4, + linear_units: int=2048, + num_blocks: int=6, + dropout_rate: float=0.1, + positional_dropout_rate: float=0.1, + attention_dropout_rate: float=0.0, + input_layer: str="conv2d", + normalize_before: bool=True, + concat_after: bool=False, + positionwise_layer_type: str="linear", + positionwise_conv_kernel_size: int=1, + macaron_style: bool=False, + pos_enc_layer_type: str="abs_pos", + pos_enc_class=None, + selfattention_layer_type: str="selfattn", + activation_type: str="swish", + use_cnn_module: bool=False, + zero_triu: bool=False, + cnn_module_kernel: int=31, + padding_idx: int=-1, + stochastic_depth_rate: float=0.0, + text_masking: bool=False): + """Construct an Encoder object.""" + super().__init__() + self._output_size = attention_dim + self.text_masking = text_masking + if self.text_masking: + self.text_masking_layer = MaskInputLayer(attention_dim) + activation = get_activation(activation_type) + if pos_enc_layer_type == "abs_pos": + pos_enc_class = PositionalEncoding + elif pos_enc_layer_type == "scaled_abs_pos": + pos_enc_class = ScaledPositionalEncoding + elif pos_enc_layer_type == "rel_pos": + assert selfattention_layer_type == "rel_selfattn" + pos_enc_class = RelPositionalEncoding + elif pos_enc_layer_type == "legacy_rel_pos": + pos_enc_class = LegacyRelPositionalEncoding + assert selfattention_layer_type == "legacy_rel_selfattn" + else: + raise ValueError("unknown pos_enc_layer: " + pos_enc_layer_type) + + self.conv_subsampling_factor = 1 + if input_layer == "linear": + self.embed = nn.Sequential( + nn.Linear(idim, attention_dim), + nn.LayerNorm(attention_dim), + nn.Dropout(dropout_rate), + nn.ReLU(), + pos_enc_class(attention_dim, positional_dropout_rate), ) + elif input_layer == "conv2d": + self.embed = Conv2dSubsampling( + idim, + attention_dim, + dropout_rate, + pos_enc_class(attention_dim, positional_dropout_rate), ) + self.conv_subsampling_factor = 4 + elif input_layer == "embed": + self.embed = nn.Sequential( + nn.Embedding(idim, attention_dim, padding_idx=padding_idx), + pos_enc_class(attention_dim, positional_dropout_rate), ) + elif input_layer == "mlm": + self.segment_emb = None + self.speech_embed = mySequential( + MaskInputLayer(idim), + nn.Linear(idim, attention_dim), + nn.LayerNorm(attention_dim), + nn.ReLU(), + pos_enc_class(attention_dim, positional_dropout_rate)) + self.text_embed = nn.Sequential( + nn.Embedding( + vocab_size, attention_dim, padding_idx=padding_idx), + pos_enc_class(attention_dim, positional_dropout_rate), ) + elif input_layer == "sega_mlm": + self.segment_emb = nn.Embedding( + 500, attention_dim, padding_idx=padding_idx) + self.speech_embed = mySequential( + MaskInputLayer(idim), + nn.Linear(idim, attention_dim), + nn.LayerNorm(attention_dim), + nn.ReLU(), + pos_enc_class(attention_dim, positional_dropout_rate)) + self.text_embed = nn.Sequential( + nn.Embedding( + vocab_size, attention_dim, padding_idx=padding_idx), + pos_enc_class(attention_dim, positional_dropout_rate), ) + elif isinstance(input_layer, nn.Layer): + self.embed = nn.Sequential( + input_layer, + pos_enc_class(attention_dim, positional_dropout_rate), ) + elif input_layer is None: + self.embed = nn.Sequential( + pos_enc_class(attention_dim, positional_dropout_rate)) + else: + raise ValueError("unknown input_layer: " + input_layer) + self.normalize_before = normalize_before + + # self-attention module definition + if selfattention_layer_type == "selfattn": + encoder_selfattn_layer = MultiHeadedAttention + encoder_selfattn_layer_args = (attention_heads, attention_dim, + attention_dropout_rate, ) + elif selfattention_layer_type == "legacy_rel_selfattn": + assert pos_enc_layer_type == "legacy_rel_pos" + encoder_selfattn_layer = LegacyRelPositionMultiHeadedAttention + encoder_selfattn_layer_args = (attention_heads, attention_dim, + attention_dropout_rate, ) + elif selfattention_layer_type == "rel_selfattn": + assert pos_enc_layer_type == "rel_pos" + encoder_selfattn_layer = RelPositionMultiHeadedAttention + encoder_selfattn_layer_args = (attention_heads, attention_dim, + attention_dropout_rate, zero_triu, ) + else: + raise ValueError("unknown encoder_attn_layer: " + + selfattention_layer_type) + + # feed-forward module definition + if positionwise_layer_type == "linear": + positionwise_layer = PositionwiseFeedForward + positionwise_layer_args = (attention_dim, linear_units, + dropout_rate, activation, ) + elif positionwise_layer_type == "conv1d": + positionwise_layer = MultiLayeredConv1d + positionwise_layer_args = (attention_dim, linear_units, + positionwise_conv_kernel_size, + dropout_rate, ) + elif positionwise_layer_type == "conv1d-linear": + positionwise_layer = Conv1dLinear + positionwise_layer_args = (attention_dim, linear_units, + positionwise_conv_kernel_size, + dropout_rate, ) + else: + raise NotImplementedError("Support only linear or conv1d.") + + # convolution module definition + convolution_layer = ConvolutionModule + convolution_layer_args = (attention_dim, cnn_module_kernel, activation) + + self.encoders = repeat( + num_blocks, + lambda lnum: EncoderLayer( + attention_dim, + encoder_selfattn_layer(*encoder_selfattn_layer_args), + positionwise_layer(*positionwise_layer_args), + positionwise_layer(*positionwise_layer_args) if macaron_style else None, + convolution_layer(*convolution_layer_args) if use_cnn_module else None, + dropout_rate, + normalize_before, + concat_after, + stochastic_depth_rate * float(1 + lnum) / num_blocks, ), ) + self.pre_speech_layer = pre_speech_layer + self.pre_speech_encoders = repeat( + self.pre_speech_layer, + lambda lnum: EncoderLayer( + attention_dim, + encoder_selfattn_layer(*encoder_selfattn_layer_args), + positionwise_layer(*positionwise_layer_args), + positionwise_layer(*positionwise_layer_args) if macaron_style else None, + convolution_layer(*convolution_layer_args) if use_cnn_module else None, + dropout_rate, + normalize_before, + concat_after, + stochastic_depth_rate * float(1 + lnum) / self.pre_speech_layer, ), + ) + if self.normalize_before: + self.after_norm = LayerNorm(attention_dim) + + def forward(self, + speech: paddle.Tensor, + text: paddle.Tensor, + masked_pos: paddle.Tensor, + speech_mask: paddle.Tensor=None, + text_mask: paddle.Tensor=None, + speech_seg_pos: paddle.Tensor=None, + text_seg_pos: paddle.Tensor=None): + """Encode input sequence. + + """ + if masked_pos is not None: + speech = self.speech_embed(speech, masked_pos) + else: + speech = self.speech_embed(speech) + if text is not None: + text = self.text_embed(text) + if speech_seg_pos is not None and text_seg_pos is not None and self.segment_emb: + speech_seg_emb = self.segment_emb(speech_seg_pos) + text_seg_emb = self.segment_emb(text_seg_pos) + text = (text[0] + text_seg_emb, text[1]) + speech = (speech[0] + speech_seg_emb, speech[1]) + if self.pre_speech_encoders: + speech, _ = self.pre_speech_encoders(speech, speech_mask) + + if text is not None: + xs = paddle.concat([speech[0], text[0]], axis=1) + xs_pos_emb = paddle.concat([speech[1], text[1]], axis=1) + masks = paddle.concat([speech_mask, text_mask], axis=-1) + else: + xs = speech[0] + xs_pos_emb = speech[1] + masks = speech_mask + + xs, masks = self.encoders((xs, xs_pos_emb), masks) + + if isinstance(xs, tuple): + xs = xs[0] + if self.normalize_before: + xs = self.after_norm(xs) + + return xs, masks + + +class MLMDecoder(MLMEncoder): + def forward(self, xs: paddle.Tensor, masks: paddle.Tensor): + """Encode input sequence. + + Args: + xs (paddle.Tensor): Input tensor (#batch, time, idim). + masks (paddle.Tensor): Mask tensor (#batch, time). + + Returns: + paddle.Tensor: Output tensor (#batch, time, attention_dim). + paddle.Tensor: Mask tensor (#batch, time). + + """ + xs = self.embed(xs) + xs, masks = self.encoders(xs, masks) + + if isinstance(xs, tuple): + xs = xs[0] + if self.normalize_before: + xs = self.after_norm(xs) + + return xs, masks + + +# encoder and decoder is nn.Layer, not str +class MLM(nn.Layer): + def __init__(self, + odim: int, + encoder: nn.Layer, + decoder: Optional[nn.Layer], + postnet_layers: int=0, + postnet_chans: int=0, + postnet_filts: int=0, + text_masking: bool=False): + + super().__init__() + self.odim = odim + self.encoder = encoder + self.decoder = decoder + self.vocab_size = encoder.text_embed[0]._num_embeddings + + if self.decoder is None or not (hasattr(self.decoder, + 'output_layer') and + self.decoder.output_layer is not None): + self.sfc = nn.Linear(self.encoder._output_size, odim) + else: + self.sfc = None + if text_masking: + self.text_sfc = nn.Linear( + self.encoder.text_embed[0]._embedding_dim, + self.vocab_size, + weight_attr=self.encoder.text_embed[0]._weight_attr) + else: + self.text_sfc = None + + self.postnet = (None if postnet_layers == 0 else Postnet( + idim=self.encoder._output_size, + odim=odim, + n_layers=postnet_layers, + n_chans=postnet_chans, + n_filts=postnet_filts, + use_batch_norm=True, + dropout_rate=0.5, )) + + def inference( + self, + speech: paddle.Tensor, + text: paddle.Tensor, + masked_pos: paddle.Tensor, + speech_mask: paddle.Tensor, + text_mask: paddle.Tensor, + speech_seg_pos: paddle.Tensor, + text_seg_pos: paddle.Tensor, + span_bdy: List[int], + use_teacher_forcing: bool=False, ) -> List[paddle.Tensor]: + ''' + Args: + speech (paddle.Tensor): input speech (1, Tmax, D). + text (paddle.Tensor): input text (1, Tmax2). + masked_pos (paddle.Tensor): masked position of input speech (1, Tmax) + speech_mask (paddle.Tensor): mask of speech (1, 1, Tmax). + text_mask (paddle.Tensor): mask of text (1, 1, Tmax2). + speech_seg_pos (paddle.Tensor): n-th phone of each mel, 0<=n<=Tmax2 (1, Tmax). + text_seg_pos (paddle.Tensor): n-th phone of each phone, 0<=n<=Tmax2 (1, Tmax2). + span_bdy (List[int]): masked mel boundary of input speech (2,) + use_teacher_forcing (bool): whether to use teacher forcing + Returns: + List[Tensor]: + eg: + [Tensor(shape=[1, 181, 80]), Tensor(shape=[80, 80]), Tensor(shape=[1, 67, 80])] + ''' + + z_cache = None + if use_teacher_forcing: + before_outs, zs, *_ = self.forward( + speech=speech, + text=text, + masked_pos=masked_pos, + speech_mask=speech_mask, + text_mask=text_mask, + speech_seg_pos=speech_seg_pos, + text_seg_pos=text_seg_pos) + if zs is None: + zs = before_outs + + speech = speech.squeeze(0) + outs = [speech[:span_bdy[0]]] + outs += [zs[0][span_bdy[0]:span_bdy[1]]] + outs += [speech[span_bdy[1]:]] + return outs + return None + + +class MLMEncAsDecoder(MLM): + def forward(self, + speech: paddle.Tensor, + text: paddle.Tensor, + masked_pos: paddle.Tensor, + speech_mask: paddle.Tensor, + text_mask: paddle.Tensor, + speech_seg_pos: paddle.Tensor, + text_seg_pos: paddle.Tensor): + # feats: (Batch, Length, Dim) + # -> encoder_out: (Batch, Length2, Dim2) + encoder_out, h_masks = self.encoder( + speech=speech, + text=text, + masked_pos=masked_pos, + speech_mask=speech_mask, + text_mask=text_mask, + speech_seg_pos=speech_seg_pos, + text_seg_pos=text_seg_pos) + if self.decoder is not None: + zs, _ = self.decoder(encoder_out, h_masks) + else: + zs = encoder_out + speech_hidden_states = zs[:, :paddle.shape(speech)[1], :] + if self.sfc is not None: + before_outs = paddle.reshape( + self.sfc(speech_hidden_states), + (paddle.shape(speech_hidden_states)[0], -1, self.odim)) + else: + before_outs = speech_hidden_states + if self.postnet is not None: + after_outs = before_outs + paddle.transpose( + self.postnet(paddle.transpose(before_outs, [0, 2, 1])), + [0, 2, 1]) + else: + after_outs = None + return before_outs, after_outs, None + + +class MLMDualMaksing(MLM): + def forward(self, + speech: paddle.Tensor, + text: paddle.Tensor, + masked_pos: paddle.Tensor, + speech_mask: paddle.Tensor, + text_mask: paddle.Tensor, + speech_seg_pos: paddle.Tensor, + text_seg_pos: paddle.Tensor): + # feats: (Batch, Length, Dim) + # -> encoder_out: (Batch, Length2, Dim2) + encoder_out, h_masks = self.encoder( + speech=speech, + text=text, + masked_pos=masked_pos, + speech_mask=speech_mask, + text_mask=text_mask, + speech_seg_pos=speech_seg_pos, + text_seg_pos=text_seg_pos) + if self.decoder is not None: + zs, _ = self.decoder(encoder_out, h_masks) + else: + zs = encoder_out + speech_hidden_states = zs[:, :paddle.shape(speech)[1], :] + if self.text_sfc: + text_hiddent_states = zs[:, paddle.shape(speech)[1]:, :] + text_outs = paddle.reshape( + self.text_sfc(text_hiddent_states), + (paddle.shape(text_hiddent_states)[0], -1, self.vocab_size)) + if self.sfc is not None: + before_outs = paddle.reshape( + self.sfc(speech_hidden_states), + (paddle.shape(speech_hidden_states)[0], -1, self.odim)) + else: + before_outs = speech_hidden_states + if self.postnet is not None: + after_outs = before_outs + paddle.transpose( + self.postnet(paddle.transpose(before_outs, [0, 2, 1])), + [0, 2, 1]) + else: + after_outs = None + return before_outs, after_outs, text_outs + + +class ErnieSAT(nn.Layer): + def __init__( + self, + # network structure related + idim: int, + odim: int, + postnet_layers: int=5, + postnet_filts: int=5, + postnet_chans: int=256, + use_scaled_pos_enc: bool=False, + encoder_type: str='conformer', + decoder_type: str='conformer', + enc_input_layer: str='sega_mlm', + enc_pre_speech_layer: int=0, + enc_cnn_module_kernel: int=7, + enc_attention_dim: int=384, + enc_attention_heads: int=2, + enc_linear_units: int=1536, + enc_num_blocks: int=4, + enc_dropout_rate: float=0.2, + enc_positional_dropout_rate: float=0.2, + enc_attention_dropout_rate: float=0.2, + enc_normalize_before: bool=True, + enc_macaron_style: bool=True, + enc_use_cnn_module: bool=True, + enc_selfattention_layer_type: str='legacy_rel_selfattn', + enc_activation_type: str='swish', + enc_pos_enc_layer_type: str='legacy_rel_pos', + enc_positionwise_layer_type: str='conv1d', + enc_positionwise_conv_kernel_size: int=3, + text_masking: bool=False, + dec_cnn_module_kernel: int=31, + dec_attention_dim: int=384, + dec_attention_heads: int=2, + dec_linear_units: int=1536, + dec_num_blocks: int=4, + dec_dropout_rate: float=0.2, + dec_positional_dropout_rate: float=0.2, + dec_attention_dropout_rate: float=0.2, + dec_macaron_style: bool=True, + dec_use_cnn_module: bool=True, + dec_selfattention_layer_type: str='legacy_rel_selfattn', + dec_activation_type: str='swish', + dec_pos_enc_layer_type: str='legacy_rel_pos', + dec_positionwise_layer_type: str='conv1d', + dec_positionwise_conv_kernel_size: int=3, + init_type: str="xavier_uniform", ): + super().__init__() + # store hyperparameters + self.odim = odim + + self.use_scaled_pos_enc = use_scaled_pos_enc + + # initialize parameters + initialize(self, init_type) + + # Encoder + if encoder_type == "conformer": + encoder = MLMEncoder( + idim=odim, + vocab_size=idim, + pre_speech_layer=enc_pre_speech_layer, + attention_dim=enc_attention_dim, + attention_heads=enc_attention_heads, + linear_units=enc_linear_units, + num_blocks=enc_num_blocks, + dropout_rate=enc_dropout_rate, + positional_dropout_rate=enc_positional_dropout_rate, + attention_dropout_rate=enc_attention_dropout_rate, + input_layer=enc_input_layer, + normalize_before=enc_normalize_before, + positionwise_layer_type=enc_positionwise_layer_type, + positionwise_conv_kernel_size=enc_positionwise_conv_kernel_size, + macaron_style=enc_macaron_style, + pos_enc_layer_type=enc_pos_enc_layer_type, + selfattention_layer_type=enc_selfattention_layer_type, + activation_type=enc_activation_type, + use_cnn_module=enc_use_cnn_module, + cnn_module_kernel=enc_cnn_module_kernel, + text_masking=text_masking) + else: + raise ValueError(f"{encoder_type} is not supported.") + + # Decoder + if decoder_type != 'no_decoder': + decoder = MLMDecoder( + idim=0, + input_layer=None, + cnn_module_kernel=dec_cnn_module_kernel, + attention_dim=dec_attention_dim, + attention_heads=dec_attention_heads, + linear_units=dec_linear_units, + num_blocks=dec_num_blocks, + dropout_rate=dec_dropout_rate, + positional_dropout_rate=dec_positional_dropout_rate, + macaron_style=dec_macaron_style, + use_cnn_module=dec_use_cnn_module, + selfattention_layer_type=dec_selfattention_layer_type, + activation_type=dec_activation_type, + pos_enc_layer_type=dec_pos_enc_layer_type, + positionwise_layer_type=dec_positionwise_layer_type, + positionwise_conv_kernel_size=dec_positionwise_conv_kernel_size) + + else: + decoder = None + + model_class = MLMDualMaksing if text_masking else MLMEncAsDecoder + + self.model = model_class( + odim=odim, + encoder=encoder, + decoder=decoder, + postnet_layers=postnet_layers, + postnet_filts=postnet_filts, + postnet_chans=postnet_chans, + text_masking=text_masking) + + nn.initializer.set_global_initializer(None) + + def forward(self, + speech: paddle.Tensor, + text: paddle.Tensor, + masked_pos: paddle.Tensor, + speech_mask: paddle.Tensor, + text_mask: paddle.Tensor, + speech_seg_pos: paddle.Tensor, + text_seg_pos: paddle.Tensor): + return self.model( + speech=speech, + text=text, + masked_pos=masked_pos, + speech_mask=speech_mask, + text_mask=text_mask, + speech_seg_pos=speech_seg_pos, + text_seg_pos=text_seg_pos) + + def inference( + self, + speech: paddle.Tensor, + text: paddle.Tensor, + masked_pos: paddle.Tensor, + speech_mask: paddle.Tensor, + text_mask: paddle.Tensor, + speech_seg_pos: paddle.Tensor, + text_seg_pos: paddle.Tensor, + span_bdy: List[int], + use_teacher_forcing: bool=False, ) -> Dict[str, paddle.Tensor]: + return self.model.inference( + speech=speech, + text=text, + masked_pos=masked_pos, + speech_mask=speech_mask, + text_mask=text_mask, + speech_seg_pos=speech_seg_pos, + text_seg_pos=text_seg_pos, + span_bdy=span_bdy, + use_teacher_forcing=use_teacher_forcing) + + +class ErnieSATInference(nn.Layer): + def __init__(self, normalizer, model): + super().__init__() + self.normalizer = normalizer + self.acoustic_model = model + + def forward( + self, + speech: paddle.Tensor, + text: paddle.Tensor, + masked_pos: paddle.Tensor, + speech_mask: paddle.Tensor, + text_mask: paddle.Tensor, + speech_seg_pos: paddle.Tensor, + text_seg_pos: paddle.Tensor, + span_bdy: List[int], + use_teacher_forcing: bool=True, ): + outs = self.acoustic_model.inference( + speech=speech, + text=text, + masked_pos=masked_pos, + speech_mask=speech_mask, + text_mask=text_mask, + speech_seg_pos=speech_seg_pos, + text_seg_pos=text_seg_pos, + span_bdy=span_bdy, + use_teacher_forcing=use_teacher_forcing) + + normed_mel_pre, normed_mel_masked, normed_mel_post = outs + logmel_pre = self.normalizer.inverse(normed_mel_pre) + logmel_masked = self.normalizer.inverse(normed_mel_masked) + logmel_post = self.normalizer.inverse(normed_mel_post) + return logmel_pre, logmel_masked, logmel_post diff --git a/paddlespeech/t2s/models/ernie_sat/ernie_sat_updater.py b/paddlespeech/t2s/models/ernie_sat/ernie_sat_updater.py new file mode 100644 index 000000000..219341c88 --- /dev/null +++ b/paddlespeech/t2s/models/ernie_sat/ernie_sat_updater.py @@ -0,0 +1,158 @@ +# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import logging +from pathlib import Path + +from paddle import distributed as dist +from paddle.io import DataLoader +from paddle.nn import Layer +from paddle.optimizer import Optimizer +from paddle.optimizer.lr import LRScheduler + +from paddlespeech.t2s.modules.losses import MLMLoss +from paddlespeech.t2s.training.extensions.evaluator import StandardEvaluator +from paddlespeech.t2s.training.reporter import report +from paddlespeech.t2s.training.updaters.standard_updater import StandardUpdater +logging.basicConfig( + format='%(asctime)s [%(levelname)s] [%(filename)s:%(lineno)d] %(message)s', + datefmt='[%Y-%m-%d %H:%M:%S]') +logger = logging.getLogger(__name__) +logger.setLevel(logging.INFO) + + +class ErnieSATUpdater(StandardUpdater): + def __init__(self, + model: Layer, + optimizer: Optimizer, + scheduler: LRScheduler, + dataloader: DataLoader, + init_state=None, + text_masking: bool=False, + odim: int=80, + vocab_size: int=100, + output_dir: Path=None): + super().__init__(model, optimizer, dataloader, init_state=None) + self.scheduler = scheduler + + self.criterion = MLMLoss( + text_masking=text_masking, odim=odim, vocab_size=vocab_size) + + log_file = output_dir / 'worker_{}.log'.format(dist.get_rank()) + self.filehandler = logging.FileHandler(str(log_file)) + logger.addHandler(self.filehandler) + self.logger = logger + self.msg = "" + + def update_core(self, batch): + self.msg = "Rank: {}, ".format(dist.get_rank()) + losses_dict = {} + + before_outs, after_outs, text_outs = self.model( + speech=batch["speech"], + text=batch["text"], + masked_pos=batch["masked_pos"], + speech_mask=batch["speech_mask"], + text_mask=batch["text_mask"], + speech_seg_pos=batch["speech_seg_pos"], + text_seg_pos=batch["text_seg_pos"]) + + mlm_loss, text_mlm_loss = self.criterion( + speech=batch["speech"], + before_outs=before_outs, + after_outs=after_outs, + masked_pos=batch["masked_pos"], + text=batch["text"], + # maybe None + text_outs=text_outs, + # maybe None + text_masked_pos=batch["text_masked_pos"]) + + loss = mlm_loss + text_mlm_loss if text_mlm_loss is not None else mlm_loss + + self.optimizer.clear_grad() + + loss.backward() + self.optimizer.step() + self.scheduler.step() + scheduler_msg = 'lr: {}'.format(self.scheduler.last_lr) + + report("train/loss", float(loss)) + report("train/mlm_loss", float(mlm_loss)) + if text_mlm_loss is not None: + report("train/text_mlm_loss", float(text_mlm_loss)) + losses_dict["text_mlm_loss"] = float(text_mlm_loss) + + losses_dict["mlm_loss"] = float(mlm_loss) + losses_dict["loss"] = float(loss) + self.msg += ', '.join('{}: {:>.6f}'.format(k, v) + for k, v in losses_dict.items()) + self.msg += ', ' + scheduler_msg + + +class ErnieSATEvaluator(StandardEvaluator): + def __init__(self, + model: Layer, + dataloader: DataLoader, + text_masking: bool=False, + odim: int=80, + vocab_size: int=100, + output_dir: Path=None): + super().__init__(model, dataloader) + + log_file = output_dir / 'worker_{}.log'.format(dist.get_rank()) + self.filehandler = logging.FileHandler(str(log_file)) + logger.addHandler(self.filehandler) + self.logger = logger + self.msg = "" + + self.criterion = MLMLoss( + text_masking=text_masking, odim=odim, vocab_size=vocab_size) + + def evaluate_core(self, batch): + self.msg = "Evaluate: " + losses_dict = {} + + before_outs, after_outs, text_outs = self.model( + speech=batch["speech"], + text=batch["text"], + masked_pos=batch["masked_pos"], + speech_mask=batch["speech_mask"], + text_mask=batch["text_mask"], + speech_seg_pos=batch["speech_seg_pos"], + text_seg_pos=batch["text_seg_pos"]) + + mlm_loss, text_mlm_loss = self.criterion( + speech=batch["speech"], + before_outs=before_outs, + after_outs=after_outs, + masked_pos=batch["masked_pos"], + text=batch["text"], + # maybe None + text_outs=text_outs, + # maybe None + text_masked_pos=batch["text_masked_pos"]) + loss = mlm_loss + text_mlm_loss if text_mlm_loss is not None else mlm_loss + + report("eval/loss", float(loss)) + report("eval/mlm_loss", float(mlm_loss)) + if text_mlm_loss is not None: + report("eval/text_mlm_loss", float(text_mlm_loss)) + losses_dict["text_mlm_loss"] = float(text_mlm_loss) + + losses_dict["mlm_loss"] = float(mlm_loss) + losses_dict["loss"] = float(loss) + + self.msg += ', '.join('{}: {:>.6f}'.format(k, v) + for k, v in losses_dict.items()) + self.logger.info(self.msg) diff --git a/paddlespeech/t2s/models/ernie_sat/mlm.py b/paddlespeech/t2s/models/ernie_sat/mlm.py new file mode 100644 index 000000000..647fdd9b4 --- /dev/null +++ b/paddlespeech/t2s/models/ernie_sat/mlm.py @@ -0,0 +1,579 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import argparse +from typing import Dict +from typing import List +from typing import Optional + +import paddle +import yaml +from paddle import nn +from yacs.config import CfgNode + +from paddlespeech.t2s.modules.activation import get_activation +from paddlespeech.t2s.modules.conformer.convolution import ConvolutionModule +from paddlespeech.t2s.modules.conformer.encoder_layer import EncoderLayer +from paddlespeech.t2s.modules.layer_norm import LayerNorm +from paddlespeech.t2s.modules.masked_fill import masked_fill +from paddlespeech.t2s.modules.nets_utils import initialize +from paddlespeech.t2s.modules.tacotron2.decoder import Postnet +from paddlespeech.t2s.modules.transformer.attention import LegacyRelPositionMultiHeadedAttention +from paddlespeech.t2s.modules.transformer.attention import MultiHeadedAttention +from paddlespeech.t2s.modules.transformer.attention import RelPositionMultiHeadedAttention +from paddlespeech.t2s.modules.transformer.embedding import LegacyRelPositionalEncoding +from paddlespeech.t2s.modules.transformer.embedding import PositionalEncoding +from paddlespeech.t2s.modules.transformer.embedding import RelPositionalEncoding +from paddlespeech.t2s.modules.transformer.embedding import ScaledPositionalEncoding +from paddlespeech.t2s.modules.transformer.multi_layer_conv import Conv1dLinear +from paddlespeech.t2s.modules.transformer.multi_layer_conv import MultiLayeredConv1d +from paddlespeech.t2s.modules.transformer.positionwise_feed_forward import PositionwiseFeedForward +from paddlespeech.t2s.modules.transformer.repeat import repeat +from paddlespeech.t2s.modules.transformer.subsampling import Conv2dSubsampling + + +# MLM -> Mask Language Model +class mySequential(nn.Sequential): + def forward(self, *inputs): + for module in self._sub_layers.values(): + if type(inputs) == tuple: + inputs = module(*inputs) + else: + inputs = module(inputs) + return inputs + + +class MaskInputLayer(nn.Layer): + def __init__(self, out_features: int) -> None: + super().__init__() + self.mask_feature = paddle.create_parameter( + shape=(1, 1, out_features), + dtype=paddle.float32, + default_initializer=paddle.nn.initializer.Assign( + paddle.normal(shape=(1, 1, out_features)))) + + def forward(self, input: paddle.Tensor, + masked_pos: paddle.Tensor=None) -> paddle.Tensor: + masked_pos = paddle.expand_as(paddle.unsqueeze(masked_pos, -1), input) + masked_input = masked_fill(input, masked_pos, 0) + masked_fill( + paddle.expand_as(self.mask_feature, input), ~masked_pos, 0) + return masked_input + + +class MLMEncoder(nn.Layer): + """Conformer encoder module. + + Args: + idim (int): Input dimension. + attention_dim (int): Dimension of attention. + attention_heads (int): The number of heads of multi head attention. + linear_units (int): The number of units of position-wise feed forward. + num_blocks (int): The number of decoder blocks. + dropout_rate (float): Dropout rate. + positional_dropout_rate (float): Dropout rate after adding positional encoding. + attention_dropout_rate (float): Dropout rate in attention. + input_layer (Union[str, paddle.nn.Layer]): Input layer type. + normalize_before (bool): Whether to use layer_norm before the first block. + concat_after (bool): Whether to concat attention layer's input and output. + if True, additional linear will be applied. + i.e. x -> x + linear(concat(x, att(x))) + if False, no additional linear will be applied. i.e. x -> x + att(x) + positionwise_layer_type (str): "linear", "conv1d", or "conv1d-linear". + positionwise_conv_kernel_size (int): Kernel size of positionwise conv1d layer. + macaron_style (bool): Whether to use macaron style for positionwise layer. + pos_enc_layer_type (str): Encoder positional encoding layer type. + selfattention_layer_type (str): Encoder attention layer type. + activation_type (str): Encoder activation function type. + use_cnn_module (bool): Whether to use convolution module. + zero_triu (bool): Whether to zero the upper triangular part of attention matrix. + cnn_module_kernel (int): Kernerl size of convolution module. + padding_idx (int): Padding idx for input_layer=embed. + stochastic_depth_rate (float): Maximum probability to skip the encoder layer. + + """ + + def __init__(self, + idim: int, + vocab_size: int=0, + pre_speech_layer: int=0, + attention_dim: int=256, + attention_heads: int=4, + linear_units: int=2048, + num_blocks: int=6, + dropout_rate: float=0.1, + positional_dropout_rate: float=0.1, + attention_dropout_rate: float=0.0, + input_layer: str="conv2d", + normalize_before: bool=True, + concat_after: bool=False, + positionwise_layer_type: str="linear", + positionwise_conv_kernel_size: int=1, + macaron_style: bool=False, + pos_enc_layer_type: str="abs_pos", + selfattention_layer_type: str="selfattn", + activation_type: str="swish", + use_cnn_module: bool=False, + zero_triu: bool=False, + cnn_module_kernel: int=31, + padding_idx: int=-1, + stochastic_depth_rate: float=0.0, + text_masking: bool=False): + """Construct an Encoder object.""" + super().__init__() + self._output_size = attention_dim + self.text_masking = text_masking + if self.text_masking: + self.text_masking_layer = MaskInputLayer(attention_dim) + activation = get_activation(activation_type) + if pos_enc_layer_type == "abs_pos": + pos_enc_class = PositionalEncoding + elif pos_enc_layer_type == "scaled_abs_pos": + pos_enc_class = ScaledPositionalEncoding + elif pos_enc_layer_type == "rel_pos": + assert selfattention_layer_type == "rel_selfattn" + pos_enc_class = RelPositionalEncoding + elif pos_enc_layer_type == "legacy_rel_pos": + pos_enc_class = LegacyRelPositionalEncoding + assert selfattention_layer_type == "legacy_rel_selfattn" + else: + raise ValueError("unknown pos_enc_layer: " + pos_enc_layer_type) + + self.conv_subsampling_factor = 1 + if input_layer == "linear": + self.embed = nn.Sequential( + nn.Linear(idim, attention_dim), + nn.LayerNorm(attention_dim), + nn.Dropout(dropout_rate), + nn.ReLU(), + pos_enc_class(attention_dim, positional_dropout_rate), ) + elif input_layer == "conv2d": + self.embed = Conv2dSubsampling( + idim, + attention_dim, + dropout_rate, + pos_enc_class(attention_dim, positional_dropout_rate), ) + self.conv_subsampling_factor = 4 + elif input_layer == "embed": + self.embed = nn.Sequential( + nn.Embedding(idim, attention_dim, padding_idx=padding_idx), + pos_enc_class(attention_dim, positional_dropout_rate), ) + elif input_layer == "mlm": + self.segment_emb = None + self.speech_embed = mySequential( + MaskInputLayer(idim), + nn.Linear(idim, attention_dim), + nn.LayerNorm(attention_dim), + nn.ReLU(), + pos_enc_class(attention_dim, positional_dropout_rate)) + self.text_embed = nn.Sequential( + nn.Embedding( + vocab_size, attention_dim, padding_idx=padding_idx), + pos_enc_class(attention_dim, positional_dropout_rate), ) + elif input_layer == "sega_mlm": + self.segment_emb = nn.Embedding( + 500, attention_dim, padding_idx=padding_idx) + self.speech_embed = mySequential( + MaskInputLayer(idim), + nn.Linear(idim, attention_dim), + nn.LayerNorm(attention_dim), + nn.ReLU(), + pos_enc_class(attention_dim, positional_dropout_rate)) + self.text_embed = nn.Sequential( + nn.Embedding( + vocab_size, attention_dim, padding_idx=padding_idx), + pos_enc_class(attention_dim, positional_dropout_rate), ) + elif isinstance(input_layer, nn.Layer): + self.embed = nn.Sequential( + input_layer, + pos_enc_class(attention_dim, positional_dropout_rate), ) + elif input_layer is None: + self.embed = nn.Sequential( + pos_enc_class(attention_dim, positional_dropout_rate)) + else: + raise ValueError("unknown input_layer: " + input_layer) + self.normalize_before = normalize_before + + # self-attention module definition + if selfattention_layer_type == "selfattn": + encoder_selfattn_layer = MultiHeadedAttention + encoder_selfattn_layer_args = (attention_heads, attention_dim, + attention_dropout_rate, ) + elif selfattention_layer_type == "legacy_rel_selfattn": + assert pos_enc_layer_type == "legacy_rel_pos" + encoder_selfattn_layer = LegacyRelPositionMultiHeadedAttention + encoder_selfattn_layer_args = (attention_heads, attention_dim, + attention_dropout_rate, ) + elif selfattention_layer_type == "rel_selfattn": + assert pos_enc_layer_type == "rel_pos" + encoder_selfattn_layer = RelPositionMultiHeadedAttention + encoder_selfattn_layer_args = (attention_heads, attention_dim, + attention_dropout_rate, zero_triu, ) + else: + raise ValueError("unknown encoder_attn_layer: " + + selfattention_layer_type) + + # feed-forward module definition + if positionwise_layer_type == "linear": + positionwise_layer = PositionwiseFeedForward + positionwise_layer_args = (attention_dim, linear_units, + dropout_rate, activation, ) + elif positionwise_layer_type == "conv1d": + positionwise_layer = MultiLayeredConv1d + positionwise_layer_args = (attention_dim, linear_units, + positionwise_conv_kernel_size, + dropout_rate, ) + elif positionwise_layer_type == "conv1d-linear": + positionwise_layer = Conv1dLinear + positionwise_layer_args = (attention_dim, linear_units, + positionwise_conv_kernel_size, + dropout_rate, ) + else: + raise NotImplementedError("Support only linear or conv1d.") + + # convolution module definition + convolution_layer = ConvolutionModule + convolution_layer_args = (attention_dim, cnn_module_kernel, activation) + + self.encoders = repeat( + num_blocks, + lambda lnum: EncoderLayer( + attention_dim, + encoder_selfattn_layer(*encoder_selfattn_layer_args), + positionwise_layer(*positionwise_layer_args), + positionwise_layer(*positionwise_layer_args) if macaron_style else None, + convolution_layer(*convolution_layer_args) if use_cnn_module else None, + dropout_rate, + normalize_before, + concat_after, + stochastic_depth_rate * float(1 + lnum) / num_blocks, ), ) + self.pre_speech_layer = pre_speech_layer + self.pre_speech_encoders = repeat( + self.pre_speech_layer, + lambda lnum: EncoderLayer( + attention_dim, + encoder_selfattn_layer(*encoder_selfattn_layer_args), + positionwise_layer(*positionwise_layer_args), + positionwise_layer(*positionwise_layer_args) if macaron_style else None, + convolution_layer(*convolution_layer_args) if use_cnn_module else None, + dropout_rate, + normalize_before, + concat_after, + stochastic_depth_rate * float(1 + lnum) / self.pre_speech_layer, ), + ) + if self.normalize_before: + self.after_norm = LayerNorm(attention_dim) + + def forward(self, + speech: paddle.Tensor, + text: paddle.Tensor, + masked_pos: paddle.Tensor, + speech_mask: paddle.Tensor=None, + text_mask: paddle.Tensor=None, + speech_seg_pos: paddle.Tensor=None, + text_seg_pos: paddle.Tensor=None): + """Encode input sequence. + + """ + if masked_pos is not None: + speech = self.speech_embed(speech, masked_pos) + else: + speech = self.speech_embed(speech) + if text is not None: + text = self.text_embed(text) + if speech_seg_pos is not None and text_seg_pos is not None and self.segment_emb: + speech_seg_emb = self.segment_emb(speech_seg_pos) + text_seg_emb = self.segment_emb(text_seg_pos) + text = (text[0] + text_seg_emb, text[1]) + speech = (speech[0] + speech_seg_emb, speech[1]) + if self.pre_speech_encoders: + speech, _ = self.pre_speech_encoders(speech, speech_mask) + + if text is not None: + xs = paddle.concat([speech[0], text[0]], axis=1) + xs_pos_emb = paddle.concat([speech[1], text[1]], axis=1) + masks = paddle.concat([speech_mask, text_mask], axis=-1) + else: + xs = speech[0] + xs_pos_emb = speech[1] + masks = speech_mask + + xs, masks = self.encoders((xs, xs_pos_emb), masks) + + if isinstance(xs, tuple): + xs = xs[0] + if self.normalize_before: + xs = self.after_norm(xs) + + return xs, masks + + +class MLMDecoder(MLMEncoder): + def forward(self, xs: paddle.Tensor, masks: paddle.Tensor): + """Encode input sequence. + + Args: + xs (paddle.Tensor): Input tensor (#batch, time, idim). + masks (paddle.Tensor): Mask tensor (#batch, time). + + Returns: + paddle.Tensor: Output tensor (#batch, time, attention_dim). + paddle.Tensor: Mask tensor (#batch, time). + + """ + xs = self.embed(xs) + xs, masks = self.encoders(xs, masks) + + if isinstance(xs, tuple): + xs = xs[0] + if self.normalize_before: + xs = self.after_norm(xs) + + return xs, masks + + +# encoder and decoder is nn.Layer, not str +class MLM(nn.Layer): + def __init__(self, + odim: int, + encoder: nn.Layer, + decoder: Optional[nn.Layer], + postnet_layers: int=0, + postnet_chans: int=0, + postnet_filts: int=0, + text_masking: bool=False): + + super().__init__() + self.odim = odim + self.encoder = encoder + self.decoder = decoder + self.vocab_size = encoder.text_embed[0]._num_embeddings + + if self.decoder is None or not (hasattr(self.decoder, + 'output_layer') and + self.decoder.output_layer is not None): + self.sfc = nn.Linear(self.encoder._output_size, odim) + else: + self.sfc = None + if text_masking: + self.text_sfc = nn.Linear( + self.encoder.text_embed[0]._embedding_dim, + self.vocab_size, + weight_attr=self.encoder.text_embed[0]._weight_attr) + else: + self.text_sfc = None + + self.postnet = (None if postnet_layers == 0 else Postnet( + idim=self.encoder._output_size, + odim=odim, + n_layers=postnet_layers, + n_chans=postnet_chans, + n_filts=postnet_filts, + use_batch_norm=True, + dropout_rate=0.5, )) + + def inference( + self, + speech: paddle.Tensor, + text: paddle.Tensor, + masked_pos: paddle.Tensor, + speech_mask: paddle.Tensor, + text_mask: paddle.Tensor, + speech_seg_pos: paddle.Tensor, + text_seg_pos: paddle.Tensor, + span_bdy: List[int], + use_teacher_forcing: bool=False, ) -> Dict[str, paddle.Tensor]: + ''' + Args: + speech (paddle.Tensor): input speech (1, Tmax, D). + text (paddle.Tensor): input text (1, Tmax2). + masked_pos (paddle.Tensor): masked position of input speech (1, Tmax) + speech_mask (paddle.Tensor): mask of speech (1, 1, Tmax). + text_mask (paddle.Tensor): mask of text (1, 1, Tmax2). + speech_seg_pos (paddle.Tensor): n-th phone of each mel, 0<=n<=Tmax2 (1, Tmax). + text_seg_pos (paddle.Tensor): n-th phone of each phone, 0<=n<=Tmax2 (1, Tmax2). + span_bdy (List[int]): masked mel boundary of input speech (2,) + use_teacher_forcing (bool): whether to use teacher forcing + Returns: + List[Tensor]: + eg: + [Tensor(shape=[1, 181, 80]), Tensor(shape=[80, 80]), Tensor(shape=[1, 67, 80])] + ''' + + z_cache = None + if use_teacher_forcing: + before_outs, zs, *_ = self.forward( + speech=speech, + text=text, + masked_pos=masked_pos, + speech_mask=speech_mask, + text_mask=text_mask, + speech_seg_pos=speech_seg_pos, + text_seg_pos=text_seg_pos) + if zs is None: + zs = before_outs + + speech = speech.squeeze(0) + outs = [speech[:span_bdy[0]]] + outs += [zs[0][span_bdy[0]:span_bdy[1]]] + outs += [speech[span_bdy[1]:]] + return outs + return None + + +class MLMEncAsDecoder(MLM): + def forward(self, + speech: paddle.Tensor, + text: paddle.Tensor, + masked_pos: paddle.Tensor, + speech_mask: paddle.Tensor, + text_mask: paddle.Tensor, + speech_seg_pos: paddle.Tensor, + text_seg_pos: paddle.Tensor): + # feats: (Batch, Length, Dim) + # -> encoder_out: (Batch, Length2, Dim2) + encoder_out, h_masks = self.encoder( + speech=speech, + text=text, + masked_pos=masked_pos, + speech_mask=speech_mask, + text_mask=text_mask, + speech_seg_pos=speech_seg_pos, + text_seg_pos=text_seg_pos) + if self.decoder is not None: + zs, _ = self.decoder(encoder_out, h_masks) + else: + zs = encoder_out + speech_hidden_states = zs[:, :paddle.shape(speech)[1], :] + if self.sfc is not None: + before_outs = paddle.reshape( + self.sfc(speech_hidden_states), + (paddle.shape(speech_hidden_states)[0], -1, self.odim)) + else: + before_outs = speech_hidden_states + if self.postnet is not None: + after_outs = before_outs + paddle.transpose( + self.postnet(paddle.transpose(before_outs, [0, 2, 1])), + [0, 2, 1]) + else: + after_outs = None + return before_outs, after_outs, None + + +class MLMDualMaksing(MLM): + def forward(self, + speech: paddle.Tensor, + text: paddle.Tensor, + masked_pos: paddle.Tensor, + speech_mask: paddle.Tensor, + text_mask: paddle.Tensor, + speech_seg_pos: paddle.Tensor, + text_seg_pos: paddle.Tensor): + # feats: (Batch, Length, Dim) + # -> encoder_out: (Batch, Length2, Dim2) + encoder_out, h_masks = self.encoder( + speech=speech, + text=text, + masked_pos=masked_pos, + speech_mask=speech_mask, + text_mask=text_mask, + speech_seg_pos=speech_seg_pos, + text_seg_pos=text_seg_pos) + if self.decoder is not None: + zs, _ = self.decoder(encoder_out, h_masks) + else: + zs = encoder_out + speech_hidden_states = zs[:, :paddle.shape(speech)[1], :] + if self.text_sfc: + text_hiddent_states = zs[:, paddle.shape(speech)[1]:, :] + text_outs = paddle.reshape( + self.text_sfc(text_hiddent_states), + (paddle.shape(text_hiddent_states)[0], -1, self.vocab_size)) + if self.sfc is not None: + before_outs = paddle.reshape( + self.sfc(speech_hidden_states), + (paddle.shape(speech_hidden_states)[0], -1, self.odim)) + else: + before_outs = speech_hidden_states + if self.postnet is not None: + after_outs = before_outs + paddle.transpose( + self.postnet(paddle.transpose(before_outs, [0, 2, 1])), + [0, 2, 1]) + else: + after_outs = None + return before_outs, after_outs, text_outs + + +def build_model_from_file(config_file, model_file): + + state_dict = paddle.load(model_file) + model_class = MLMDualMaksing if 'conformer_combine_vctk_aishell3_dual_masking' in config_file \ + else MLMEncAsDecoder + + # 构建模型 + with open(config_file) as f: + conf = CfgNode(yaml.safe_load(f)) + model = build_model(conf, model_class) + model.set_state_dict(state_dict) + return model, conf + + +# select encoder and decoder here +def build_model(args: argparse.Namespace, model_class=MLMEncAsDecoder) -> MLM: + if isinstance(args.token_list, str): + with open(args.token_list, encoding="utf-8") as f: + token_list = [line.rstrip() for line in f] + + # Overwriting token_list to keep it as "portable". + args.token_list = list(token_list) + elif isinstance(args.token_list, (tuple, list)): + token_list = list(args.token_list) + else: + raise RuntimeError("token_list must be str or list") + + vocab_size = len(token_list) + odim = 80 + + # Encoder + encoder_class = MLMEncoder + + if 'text_masking' in args.model_conf.keys() and args.model_conf[ + 'text_masking']: + args.encoder_conf['text_masking'] = True + else: + args.encoder_conf['text_masking'] = False + + encoder = encoder_class( + args.input_size, vocab_size=vocab_size, **args.encoder_conf) + + # Decoder + if args.decoder != 'no_decoder': + decoder_class = MLMDecoder + decoder = decoder_class( + idim=0, + input_layer=None, + **args.decoder_conf, ) + else: + decoder = None + + # Build model + model = model_class( + odim=odim, + encoder=encoder, + decoder=decoder, + **args.model_conf, ) + + # Initialize + if args.init is not None: + initialize(model, args.init) + + return model diff --git a/paddlespeech/t2s/models/fastspeech2/fastspeech2.py b/paddlespeech/t2s/models/fastspeech2/fastspeech2.py index 48595bb25..9905765db 100644 --- a/paddlespeech/t2s/models/fastspeech2/fastspeech2.py +++ b/paddlespeech/t2s/models/fastspeech2/fastspeech2.py @@ -141,80 +141,140 @@ class FastSpeech2(nn.Layer): init_dec_alpha: float=1.0, ): """Initialize FastSpeech2 module. Args: - idim (int): Dimension of the inputs. - odim (int): Dimension of the outputs. - adim (int): Attention dimension. - aheads (int): Number of attention heads. - elayers (int): Number of encoder layers. - eunits (int): Number of encoder hidden units. - dlayers (int): Number of decoder layers. - dunits (int): Number of decoder hidden units. - postnet_layers (int): Number of postnet layers. - postnet_chans (int): Number of postnet channels. - postnet_filts (int): Kernel size of postnet. - postnet_dropout_rate (float): Dropout rate in postnet. - use_scaled_pos_enc (bool): Whether to use trainable scaled pos encoding. - use_batch_norm (bool): Whether to use batch normalization in encoder prenet. - encoder_normalize_before (bool): Whether to apply layernorm layer before encoder block. - decoder_normalize_before (bool): Whether to apply layernorm layer before decoder block. - encoder_concat_after (bool): Whether to concatenate attention layer's input and output in encoder. - decoder_concat_after (bool): Whether to concatenate attention layer's input and output in decoder. - reduction_factor (int): Reduction factor. - encoder_type (str): Encoder type ("transformer" or "conformer"). - decoder_type (str): Decoder type ("transformer" or "conformer"). - transformer_enc_dropout_rate (float): Dropout rate in encoder except attention and positional encoding. - transformer_enc_positional_dropout_rate (float): Dropout rate after encoder positional encoding. - transformer_enc_attn_dropout_rate (float): Dropout rate in encoder self-attention module. - transformer_dec_dropout_rate (float): Dropout rate in decoder except attention & positional encoding. - transformer_dec_positional_dropout_rate (float): Dropout rate after decoder positional encoding. - transformer_dec_attn_dropout_rate (float): Dropout rate in decoder self-attention module. - conformer_pos_enc_layer_type (str): Pos encoding layer type in conformer. - conformer_self_attn_layer_type (str): Self-attention layer type in conformer - conformer_activation_type (str): Activation function type in conformer. - use_macaron_style_in_conformer (bool): Whether to use macaron style FFN. - use_cnn_in_conformer (bool): Whether to use CNN in conformer. - zero_triu (bool): Whether to use zero triu in relative self-attention module. - conformer_enc_kernel_size (int): Kernel size of encoder conformer. - conformer_dec_kernel_size (int): Kernel size of decoder conformer. - duration_predictor_layers (int): Number of duration predictor layers. - duration_predictor_chans (int): Number of duration predictor channels. - duration_predictor_kernel_size (int): Kernel size of duration predictor. - duration_predictor_dropout_rate (float): Dropout rate in duration predictor. - pitch_predictor_layers (int): Number of pitch predictor layers. - pitch_predictor_chans (int): Number of pitch predictor channels. - pitch_predictor_kernel_size (int): Kernel size of pitch predictor. - pitch_predictor_dropout_rate (float): Dropout rate in pitch predictor. - pitch_embed_kernel_size (float): Kernel size of pitch embedding. - pitch_embed_dropout_rate (float): Dropout rate for pitch embedding. - stop_gradient_from_pitch_predictor (bool): Whether to stop gradient from pitch predictor to encoder. - energy_predictor_layers (int): Number of energy predictor layers. - energy_predictor_chans (int): Number of energy predictor channels. - energy_predictor_kernel_size (int): Kernel size of energy predictor. - energy_predictor_dropout_rate (float): Dropout rate in energy predictor. - energy_embed_kernel_size (float): Kernel size of energy embedding. - energy_embed_dropout_rate (float): Dropout rate for energy embedding. - stop_gradient_from_energy_predictor(bool): Whether to stop gradient from energy predictor to encoder. - spk_num (Optional[int]): Number of speakers. If not None, assume that the spk_embed_dim is not None, + idim (int): + Dimension of the inputs. + odim (int): + Dimension of the outputs. + adim (int): + Attention dimension. + aheads (int): + Number of attention heads. + elayers (int): + Number of encoder layers. + eunits (int): + Number of encoder hidden units. + dlayers (int): + Number of decoder layers. + dunits (int): + Number of decoder hidden units. + postnet_layers (int): + Number of postnet layers. + postnet_chans (int): + Number of postnet channels. + postnet_filts (int): + Kernel size of postnet. + postnet_dropout_rate (float): + Dropout rate in postnet. + use_scaled_pos_enc (bool): + Whether to use trainable scaled pos encoding. + use_batch_norm (bool): + Whether to use batch normalization in encoder prenet. + encoder_normalize_before (bool): + Whether to apply layernorm layer before encoder block. + decoder_normalize_before (bool): + Whether to apply layernorm layer before decoder block. + encoder_concat_after (bool): + Whether to concatenate attention layer's input and output in encoder. + decoder_concat_after (bool): + Whether to concatenate attention layer's input and output in decoder. + reduction_factor (int): + Reduction factor. + encoder_type (str): + Encoder type ("transformer" or "conformer"). + decoder_type (str): + Decoder type ("transformer" or "conformer"). + transformer_enc_dropout_rate (float): + Dropout rate in encoder except attention and positional encoding. + transformer_enc_positional_dropout_rate (float): + Dropout rate after encoder positional encoding. + transformer_enc_attn_dropout_rate (float): + Dropout rate in encoder self-attention module. + transformer_dec_dropout_rate (float): + Dropout rate in decoder except attention & positional encoding. + transformer_dec_positional_dropout_rate (float): + Dropout rate after decoder positional encoding. + transformer_dec_attn_dropout_rate (float): + Dropout rate in decoder self-attention module. + conformer_pos_enc_layer_type (str): + Pos encoding layer type in conformer. + conformer_self_attn_layer_type (str): + Self-attention layer type in conformer + conformer_activation_type (str): + Activation function type in conformer. + use_macaron_style_in_conformer (bool): + Whether to use macaron style FFN. + use_cnn_in_conformer (bool): + Whether to use CNN in conformer. + zero_triu (bool): + Whether to use zero triu in relative self-attention module. + conformer_enc_kernel_size (int): + Kernel size of encoder conformer. + conformer_dec_kernel_size (int): + Kernel size of decoder conformer. + duration_predictor_layers (int): + Number of duration predictor layers. + duration_predictor_chans (int): + Number of duration predictor channels. + duration_predictor_kernel_size (int): + Kernel size of duration predictor. + duration_predictor_dropout_rate (float): + Dropout rate in duration predictor. + pitch_predictor_layers (int): + Number of pitch predictor layers. + pitch_predictor_chans (int): + Number of pitch predictor channels. + pitch_predictor_kernel_size (int): + Kernel size of pitch predictor. + pitch_predictor_dropout_rate (float): + Dropout rate in pitch predictor. + pitch_embed_kernel_size (float): + Kernel size of pitch embedding. + pitch_embed_dropout_rate (float): + Dropout rate for pitch embedding. + stop_gradient_from_pitch_predictor (bool): + Whether to stop gradient from pitch predictor to encoder. + energy_predictor_layers (int): + Number of energy predictor layers. + energy_predictor_chans (int): + Number of energy predictor channels. + energy_predictor_kernel_size (int): + Kernel size of energy predictor. + energy_predictor_dropout_rate (float): + Dropout rate in energy predictor. + energy_embed_kernel_size (float): + Kernel size of energy embedding. + energy_embed_dropout_rate (float): + Dropout rate for energy embedding. + stop_gradient_from_energy_predictor(bool): + Whether to stop gradient from energy predictor to encoder. + spk_num (Optional[int]): + Number of speakers. If not None, assume that the spk_embed_dim is not None, spk_ids will be provided as the input and use spk_embedding_table. - spk_embed_dim (Optional[int]): Speaker embedding dimension. If not None, + spk_embed_dim (Optional[int]): + Speaker embedding dimension. If not None, assume that spk_emb will be provided as the input or spk_num is not None. - spk_embed_integration_type (str): How to integrate speaker embedding. - tone_num (Optional[int]): Number of tones. If not None, assume that the + spk_embed_integration_type (str): + How to integrate speaker embedding. + tone_num (Optional[int]): + Number of tones. If not None, assume that the tone_ids will be provided as the input and use tone_embedding_table. - tone_embed_dim (Optional[int]): Tone embedding dimension. If not None, assume that tone_num is not None. - tone_embed_integration_type (str): How to integrate tone embedding. - init_type (str): How to initialize transformer parameters. - init_enc_alpha (float): Initial value of alpha in scaled pos encoding of the encoder. - init_dec_alpha (float): Initial value of alpha in scaled pos encoding of the decoder. + tone_embed_dim (Optional[int]): + Tone embedding dimension. If not None, assume that tone_num is not None. + tone_embed_integration_type (str): + How to integrate tone embedding. + init_type (str): + How to initialize transformer parameters. + init_enc_alpha (float): + Initial value of alpha in scaled pos encoding of the encoder. + init_dec_alpha (float): + Initial value of alpha in scaled pos encoding of the decoder. """ assert check_argument_types() super().__init__() # store hyperparameters - self.idim = idim self.odim = odim - self.eos = idim - 1 self.reduction_factor = reduction_factor self.encoder_type = encoder_type self.decoder_type = decoder_type @@ -258,7 +318,6 @@ class FastSpeech2(nn.Layer): padding_idx=self.padding_idx) if encoder_type == "transformer": - print("encoder_type is transformer") self.encoder = TransformerEncoder( idim=idim, attention_dim=adim, @@ -275,7 +334,6 @@ class FastSpeech2(nn.Layer): positionwise_layer_type=positionwise_layer_type, positionwise_conv_kernel_size=positionwise_conv_kernel_size, ) elif encoder_type == "conformer": - print("encoder_type is conformer") self.encoder = ConformerEncoder( idim=idim, attention_dim=adim, @@ -362,7 +420,6 @@ class FastSpeech2(nn.Layer): # NOTE: we use encoder as decoder # because fastspeech's decoder is the same as encoder if decoder_type == "transformer": - print("decoder_type is transformer") self.decoder = TransformerEncoder( idim=0, attention_dim=adim, @@ -380,7 +437,6 @@ class FastSpeech2(nn.Layer): positionwise_layer_type=positionwise_layer_type, positionwise_conv_kernel_size=positionwise_conv_kernel_size, ) elif decoder_type == "conformer": - print("decoder_type is conformer") self.decoder = ConformerEncoder( idim=0, attention_dim=adim, @@ -453,20 +509,29 @@ class FastSpeech2(nn.Layer): """Calculate forward propagation. Args: - text(Tensor(int64)): Batch of padded token ids (B, Tmax). - text_lengths(Tensor(int64)): Batch of lengths of each input (B,). - speech(Tensor): Batch of padded target features (B, Lmax, odim). - speech_lengths(Tensor(int64)): Batch of the lengths of each target (B,). - durations(Tensor(int64)): Batch of padded durations (B, Tmax). - pitch(Tensor): Batch of padded token-averaged pitch (B, Tmax, 1). - energy(Tensor): Batch of padded token-averaged energy (B, Tmax, 1). - tone_id(Tensor, optional(int64)): Batch of padded tone ids (B, Tmax). - spk_emb(Tensor, optional): Batch of speaker embeddings (B, spk_embed_dim). - spk_id(Tnesor, optional(int64)): Batch of speaker ids (B,) + text(Tensor(int64)): + Batch of padded token ids (B, Tmax). + text_lengths(Tensor(int64)): + Batch of lengths of each input (B,). + speech(Tensor): + Batch of padded target features (B, Lmax, odim). + speech_lengths(Tensor(int64)): + Batch of the lengths of each target (B,). + durations(Tensor(int64)): + Batch of padded durations (B, Tmax). + pitch(Tensor): + Batch of padded token-averaged pitch (B, Tmax, 1). + energy(Tensor): + Batch of padded token-averaged energy (B, Tmax, 1). + tone_id(Tensor, optional(int64)): + Batch of padded tone ids (B, Tmax). + spk_emb(Tensor, optional): + Batch of speaker embeddings (B, spk_embed_dim). + spk_id(Tnesor, optional(int64)): + Batch of speaker ids (B,) Returns: - """ # input of embedding must be int64 @@ -662,20 +727,28 @@ class FastSpeech2(nn.Layer): """Generate the sequence of features given the sequences of characters. Args: - text(Tensor(int64)): Input sequence of characters (T,). - durations(Tensor, optional (int64)): Groundtruth of duration (T,). - pitch(Tensor, optional): Groundtruth of token-averaged pitch (T, 1). - energy(Tensor, optional): Groundtruth of token-averaged energy (T, 1). - alpha(float, optional): Alpha to control the speed. - use_teacher_forcing(bool, optional): Whether to use teacher forcing. + text(Tensor(int64)): + Input sequence of characters (T,). + durations(Tensor, optional (int64)): + Groundtruth of duration (T,). + pitch(Tensor, optional): + Groundtruth of token-averaged pitch (T, 1). + energy(Tensor, optional): + Groundtruth of token-averaged energy (T, 1). + alpha(float, optional): + Alpha to control the speed. + use_teacher_forcing(bool, optional): + Whether to use teacher forcing. If true, groundtruth of duration, pitch and energy will be used. - spk_emb(Tensor, optional, optional): peaker embedding vector (spk_embed_dim,). (Default value = None) - spk_id(Tensor, optional(int64), optional): spk ids (1,). (Default value = None) - tone_id(Tensor, optional(int64), optional): tone ids (T,). (Default value = None) + spk_emb(Tensor, optional, optional): + peaker embedding vector (spk_embed_dim,). (Default value = None) + spk_id(Tensor, optional(int64), optional): + spk ids (1,). (Default value = None) + tone_id(Tensor, optional(int64), optional): + tone ids (T,). (Default value = None) Returns: - """ # input of embedding must be int64 x = paddle.cast(text, 'int64') @@ -724,8 +797,10 @@ class FastSpeech2(nn.Layer): """Integrate speaker embedding with hidden states. Args: - hs(Tensor): Batch of hidden state sequences (B, Tmax, adim). - spk_emb(Tensor): Batch of speaker embeddings (B, spk_embed_dim). + hs(Tensor): + Batch of hidden state sequences (B, Tmax, adim). + spk_emb(Tensor): + Batch of speaker embeddings (B, spk_embed_dim). Returns: @@ -749,8 +824,10 @@ class FastSpeech2(nn.Layer): """Integrate speaker embedding with hidden states. Args: - hs(Tensor): Batch of hidden state sequences (B, Tmax, adim). - tone_embs(Tensor): Batch of speaker embeddings (B, Tmax, tone_embed_dim). + hs(Tensor): + Batch of hidden state sequences (B, Tmax, adim). + tone_embs(Tensor): + Batch of speaker embeddings (B, Tmax, tone_embed_dim). Returns: @@ -773,10 +850,12 @@ class FastSpeech2(nn.Layer): """Make masks for self-attention. Args: - ilens(Tensor): Batch of lengths (B,). + ilens(Tensor): + Batch of lengths (B,). Returns: - Tensor: Mask tensor for self-attention. dtype=paddle.bool + Tensor: + Mask tensor for self-attention. dtype=paddle.bool Examples: >>> ilens = [5, 3] @@ -858,19 +937,32 @@ class StyleFastSpeech2Inference(FastSpeech2Inference): """ Args: - text(Tensor(int64)): Input sequence of characters (T,). - durations(paddle.Tensor/np.ndarray, optional (int64)): Groundtruth of duration (T,), this will overwrite the set of durations_scale and durations_bias + text(Tensor(int64)): + Input sequence of characters (T,). + durations(paddle.Tensor/np.ndarray, optional (int64)): + Groundtruth of duration (T,), this will overwrite the set of durations_scale and durations_bias durations_scale(int/float, optional): + durations_bias(int/float, optional): - pitch(paddle.Tensor/np.ndarray, optional): Groundtruth of token-averaged pitch (T, 1), this will overwrite the set of pitch_scale and pitch_bias - pitch_scale(int/float, optional): In denormed HZ domain. - pitch_bias(int/float, optional): In denormed HZ domain. - energy(paddle.Tensor/np.ndarray, optional): Groundtruth of token-averaged energy (T, 1), this will overwrite the set of energy_scale and energy_bias - energy_scale(int/float, optional): In denormed domain. - energy_bias(int/float, optional): In denormed domain. - robot: bool: (Default value = False) - spk_emb: (Default value = None) - spk_id: (Default value = None) + + pitch(paddle.Tensor/np.ndarray, optional): + Groundtruth of token-averaged pitch (T, 1), this will overwrite the set of pitch_scale and pitch_bias + pitch_scale(int/float, optional): + In denormed HZ domain. + pitch_bias(int/float, optional): + In denormed HZ domain. + energy(paddle.Tensor/np.ndarray, optional): + Groundtruth of token-averaged energy (T, 1), this will overwrite the set of energy_scale and energy_bias + energy_scale(int/float, optional): + In denormed domain. + energy_bias(int/float, optional): + In denormed domain. + robot(bool) (Default value = False): + + spk_emb(Default value = None): + + spk_id(Default value = None): + Returns: Tensor: logmel @@ -949,8 +1041,10 @@ class FastSpeech2Loss(nn.Layer): use_weighted_masking: bool=False): """Initialize feed-forward Transformer loss module. Args: - use_masking (bool): Whether to apply masking for padded part in loss calculation. - use_weighted_masking (bool): Whether to weighted masking in loss calculation. + use_masking (bool): + Whether to apply masking for padded part in loss calculation. + use_weighted_masking (bool): + Whether to weighted masking in loss calculation. """ assert check_argument_types() super().__init__() @@ -982,17 +1076,28 @@ class FastSpeech2Loss(nn.Layer): """Calculate forward propagation. Args: - after_outs(Tensor): Batch of outputs after postnets (B, Lmax, odim). - before_outs(Tensor): Batch of outputs before postnets (B, Lmax, odim). - d_outs(Tensor): Batch of outputs of duration predictor (B, Tmax). - p_outs(Tensor): Batch of outputs of pitch predictor (B, Tmax, 1). - e_outs(Tensor): Batch of outputs of energy predictor (B, Tmax, 1). - ys(Tensor): Batch of target features (B, Lmax, odim). - ds(Tensor): Batch of durations (B, Tmax). - ps(Tensor): Batch of target token-averaged pitch (B, Tmax, 1). - es(Tensor): Batch of target token-averaged energy (B, Tmax, 1). - ilens(Tensor): Batch of the lengths of each input (B,). - olens(Tensor): Batch of the lengths of each target (B,). + after_outs(Tensor): + Batch of outputs after postnets (B, Lmax, odim). + before_outs(Tensor): + Batch of outputs before postnets (B, Lmax, odim). + d_outs(Tensor): + Batch of outputs of duration predictor (B, Tmax). + p_outs(Tensor): + Batch of outputs of pitch predictor (B, Tmax, 1). + e_outs(Tensor): + Batch of outputs of energy predictor (B, Tmax, 1). + ys(Tensor): + Batch of target features (B, Lmax, odim). + ds(Tensor): + Batch of durations (B, Tmax). + ps(Tensor): + Batch of target token-averaged pitch (B, Tmax, 1). + es(Tensor): + Batch of target token-averaged energy (B, Tmax, 1). + ilens(Tensor): + Batch of the lengths of each input (B,). + olens(Tensor): + Batch of the lengths of each target (B,). Returns: diff --git a/paddlespeech/t2s/models/hifigan/hifigan.py b/paddlespeech/t2s/models/hifigan/hifigan.py index bea9dd9a3..7a01840e2 100644 --- a/paddlespeech/t2s/models/hifigan/hifigan.py +++ b/paddlespeech/t2s/models/hifigan/hifigan.py @@ -50,20 +50,34 @@ class HiFiGANGenerator(nn.Layer): init_type: str="xavier_uniform", ): """Initialize HiFiGANGenerator module. Args: - in_channels (int): Number of input channels. - out_channels (int): Number of output channels. - channels (int): Number of hidden representation channels. - global_channels (int): Number of global conditioning channels. - kernel_size (int): Kernel size of initial and final conv layer. - upsample_scales (list): List of upsampling scales. - upsample_kernel_sizes (list): List of kernel sizes for upsampling layers. - resblock_kernel_sizes (list): List of kernel sizes for residual blocks. - resblock_dilations (list): List of dilation list for residual blocks. - use_additional_convs (bool): Whether to use additional conv layers in residual blocks. - bias (bool): Whether to add bias parameter in convolution layers. - nonlinear_activation (str): Activation function module name. - nonlinear_activation_params (dict): Hyperparameters for activation function. - use_weight_norm (bool): Whether to use weight norm. + in_channels (int): + Number of input channels. + out_channels (int): + Number of output channels. + channels (int): + Number of hidden representation channels. + global_channels (int): + Number of global conditioning channels. + kernel_size (int): + Kernel size of initial and final conv layer. + upsample_scales (list): + List of upsampling scales. + upsample_kernel_sizes (list): + List of kernel sizes for upsampling layers. + resblock_kernel_sizes (list): + List of kernel sizes for residual blocks. + resblock_dilations (list): + List of dilation list for residual blocks. + use_additional_convs (bool): + Whether to use additional conv layers in residual blocks. + bias (bool): + Whether to add bias parameter in convolution layers. + nonlinear_activation (str): + Activation function module name. + nonlinear_activation_params (dict): + Hyperparameters for activation function. + use_weight_norm (bool): + Whether to use weight norm. If set to true, it will be applied to all of the conv layers. """ super().__init__() @@ -199,9 +213,10 @@ class HiFiGANGenerator(nn.Layer): def inference(self, c, g: Optional[paddle.Tensor]=None): """Perform inference. Args: - c (Tensor): Input tensor (T, in_channels). - normalize_before (bool): Whether to perform normalization. - g (Optional[Tensor]): Global conditioning tensor (global_channels, 1). + c (Tensor): + Input tensor (T, in_channels). + g (Optional[Tensor]): + Global conditioning tensor (global_channels, 1). Returns: Tensor: Output tensor (T ** prod(upsample_scales), out_channels). @@ -233,20 +248,33 @@ class HiFiGANPeriodDiscriminator(nn.Layer): """Initialize HiFiGANPeriodDiscriminator module. Args: - in_channels (int): Number of input channels. - out_channels (int): Number of output channels. - period (int): Period. - kernel_sizes (list): Kernel sizes of initial conv layers and the final conv layer. - channels (int): Number of initial channels. - downsample_scales (list): List of downsampling scales. - max_downsample_channels (int): Number of maximum downsampling channels. - use_additional_convs (bool): Whether to use additional conv layers in residual blocks. - bias (bool): Whether to add bias parameter in convolution layers. - nonlinear_activation (str): Activation function module name. - nonlinear_activation_params (dict): Hyperparameters for activation function. - use_weight_norm (bool): Whether to use weight norm. + in_channels (int): + Number of input channels. + out_channels (int): + Number of output channels. + period (int): + Period. + kernel_sizes (list): + Kernel sizes of initial conv layers and the final conv layer. + channels (int): + Number of initial channels. + downsample_scales (list): + List of downsampling scales. + max_downsample_channels (int): + Number of maximum downsampling channels. + use_additional_convs (bool): + Whether to use additional conv layers in residual blocks. + bias (bool): + Whether to add bias parameter in convolution layers. + nonlinear_activation (str): + Activation function module name. + nonlinear_activation_params (dict): + Hyperparameters for activation function. + use_weight_norm (bool): + Whether to use weight norm. If set to true, it will be applied to all of the conv layers. - use_spectral_norm (bool): Whether to use spectral norm. + use_spectral_norm (bool): + Whether to use spectral norm. If set to true, it will be applied to all of the conv layers. """ super().__init__() @@ -298,7 +326,8 @@ class HiFiGANPeriodDiscriminator(nn.Layer): """Calculate forward propagation. Args: - c (Tensor): Input tensor (B, in_channels, T). + c (Tensor): + Input tensor (B, in_channels, T). Returns: list: List of each layer's tensors. """ @@ -367,8 +396,10 @@ class HiFiGANMultiPeriodDiscriminator(nn.Layer): """Initialize HiFiGANMultiPeriodDiscriminator module. Args: - periods (list): List of periods. - discriminator_params (dict): Parameters for hifi-gan period discriminator module. + periods (list): + List of periods. + discriminator_params (dict): + Parameters for hifi-gan period discriminator module. The period parameter will be overwritten. """ super().__init__() @@ -385,7 +416,8 @@ class HiFiGANMultiPeriodDiscriminator(nn.Layer): """Calculate forward propagation. Args: - x (Tensor): Input noise signal (B, 1, T). + x (Tensor): + Input noise signal (B, 1, T). Returns: List: List of list of each discriminator outputs, which consists of each layer output tensors. """ @@ -417,16 +449,25 @@ class HiFiGANScaleDiscriminator(nn.Layer): """Initilize HiFiGAN scale discriminator module. Args: - in_channels (int): Number of input channels. - out_channels (int): Number of output channels. - kernel_sizes (list): List of four kernel sizes. The first will be used for the first conv layer, + in_channels (int): + Number of input channels. + out_channels (int): + Number of output channels. + kernel_sizes (list): + List of four kernel sizes. The first will be used for the first conv layer, and the second is for downsampling part, and the remaining two are for output layers. - channels (int): Initial number of channels for conv layer. - max_downsample_channels (int): Maximum number of channels for downsampling layers. - bias (bool): Whether to add bias parameter in convolution layers. - downsample_scales (list): List of downsampling scales. - nonlinear_activation (str): Activation function module name. - nonlinear_activation_params (dict): Hyperparameters for activation function. + channels (int): + Initial number of channels for conv layer. + max_downsample_channels (int): + Maximum number of channels for downsampling layers. + bias (bool): + Whether to add bias parameter in convolution layers. + downsample_scales (list): + List of downsampling scales. + nonlinear_activation (str): + Activation function module name. + nonlinear_activation_params (dict): + Hyperparameters for activation function. use_weight_norm (bool): Whether to use weight norm. If set to true, it will be applied to all of the conv layers. use_spectral_norm (bool): Whether to use spectral norm. @@ -614,7 +655,8 @@ class HiFiGANMultiScaleDiscriminator(nn.Layer): """Calculate forward propagation. Args: - x (Tensor): Input noise signal (B, 1, T). + x (Tensor): + Input noise signal (B, 1, T). Returns: List: List of list of each discriminator outputs, which consists of each layer output tensors. """ @@ -675,14 +717,21 @@ class HiFiGANMultiScaleMultiPeriodDiscriminator(nn.Layer): """Initilize HiFiGAN multi-scale + multi-period discriminator module. Args: - scales (int): Number of multi-scales. - scale_downsample_pooling (str): Pooling module name for downsampling of the inputs. - scale_downsample_pooling_params (dict): Parameters for the above pooling module. - scale_discriminator_params (dict): Parameters for hifi-gan scale discriminator module. - follow_official_norm (bool): Whether to follow the norm setting of the official implementaion. + scales (int): + Number of multi-scales. + scale_downsample_pooling (str): + Pooling module name for downsampling of the inputs. + scale_downsample_pooling_params (dict): + Parameters for the above pooling module. + scale_discriminator_params (dict): + Parameters for hifi-gan scale discriminator module. + follow_official_norm (bool): + Whether to follow the norm setting of the official implementaion. The first discriminator uses spectral norm and the other discriminators use weight norm. - periods (list): List of periods. - period_discriminator_params (dict): Parameters for hifi-gan period discriminator module. + periods (list): + List of periods. + period_discriminator_params (dict): + Parameters for hifi-gan period discriminator module. The period parameter will be overwritten. """ super().__init__() @@ -704,7 +753,8 @@ class HiFiGANMultiScaleMultiPeriodDiscriminator(nn.Layer): """Calculate forward propagation. Args: - x (Tensor): Input noise signal (B, 1, T). + x (Tensor): + Input noise signal (B, 1, T). Returns: List: List of list of each discriminator outputs, diff --git a/paddlespeech/t2s/models/melgan/melgan.py b/paddlespeech/t2s/models/melgan/melgan.py index 22d8fd9e7..058cf40d9 100644 --- a/paddlespeech/t2s/models/melgan/melgan.py +++ b/paddlespeech/t2s/models/melgan/melgan.py @@ -53,24 +53,38 @@ class MelGANGenerator(nn.Layer): """Initialize MelGANGenerator module. Args: - in_channels (int): Number of input channels. - out_channels (int): Number of output channels, + in_channels (int): + Number of input channels. + out_channels (int): + Number of output channels, the number of sub-band is out_channels in multi-band melgan. - kernel_size (int): Kernel size of initial and final conv layer. - channels (int): Initial number of channels for conv layer. - bias (bool): Whether to add bias parameter in convolution layers. - upsample_scales (List[int]): List of upsampling scales. - stack_kernel_size (int): Kernel size of dilated conv layers in residual stack. - stacks (int): Number of stacks in a single residual stack. - nonlinear_activation (Optional[str], optional): Non linear activation in upsample network, by default None - nonlinear_activation_params (Dict[str, Any], optional): Parameters passed to the linear activation in the upsample network, - by default {} - pad (str): Padding function module name before dilated convolution layer. - pad_params (dict): Hyperparameters for padding function. - use_final_nonlinear_activation (nn.Layer): Activation function for the final layer. - use_weight_norm (bool): Whether to use weight norm. + kernel_size (int): + Kernel size of initial and final conv layer. + channels (int): + Initial number of channels for conv layer. + bias (bool): + Whether to add bias parameter in convolution layers. + upsample_scales (List[int]): + List of upsampling scales. + stack_kernel_size (int): + Kernel size of dilated conv layers in residual stack. + stacks (int): + Number of stacks in a single residual stack. + nonlinear_activation (Optional[str], optional): + Non linear activation in upsample network, by default None + nonlinear_activation_params (Dict[str, Any], optional): + Parameters passed to the linear activation in the upsample network, by default {} + pad (str): + Padding function module name before dilated convolution layer. + pad_params (dict): + Hyperparameters for padding function. + use_final_nonlinear_activation (nn.Layer): + Activation function for the final layer. + use_weight_norm (bool): + Whether to use weight norm. If set to true, it will be applied to all of the conv layers. - use_causal_conv (bool): Whether to use causal convolution. + use_causal_conv (bool): + Whether to use causal convolution. """ super().__init__() @@ -194,7 +208,8 @@ class MelGANGenerator(nn.Layer): """Calculate forward propagation. Args: - c (Tensor): Input tensor (B, in_channels, T). + c (Tensor): + Input tensor (B, in_channels, T). Returns: Tensor: Output tensor (B, out_channels, T ** prod(upsample_scales)). """ @@ -244,7 +259,8 @@ class MelGANGenerator(nn.Layer): """Perform inference. Args: - c (Union[Tensor, ndarray]): Input tensor (T, in_channels). + c (Union[Tensor, ndarray]): + Input tensor (T, in_channels). Returns: Tensor: Output tensor (out_channels*T ** prod(upsample_scales), 1). """ @@ -279,20 +295,30 @@ class MelGANDiscriminator(nn.Layer): """Initilize MelGAN discriminator module. Args: - in_channels (int): Number of input channels. - out_channels (int): Number of output channels. + in_channels (int): + Number of input channels. + out_channels (int): + Number of output channels. kernel_sizes (List[int]): List of two kernel sizes. The prod will be used for the first conv layer, and the first and the second kernel sizes will be used for the last two layers. For example if kernel_sizes = [5, 3], the first layer kernel size will be 5 * 3 = 15, the last two layers' kernel size will be 5 and 3, respectively. - channels (int): Initial number of channels for conv layer. - max_downsample_channels (int): Maximum number of channels for downsampling layers. - bias (bool): Whether to add bias parameter in convolution layers. - downsample_scales (List[int]): List of downsampling scales. - nonlinear_activation (str): Activation function module name. - nonlinear_activation_params (dict): Hyperparameters for activation function. - pad (str): Padding function module name before dilated convolution layer. - pad_params (dict): Hyperparameters for padding function. + channels (int): + Initial number of channels for conv layer. + max_downsample_channels (int): + Maximum number of channels for downsampling layers. + bias (bool): + Whether to add bias parameter in convolution layers. + downsample_scales (List[int]): + List of downsampling scales. + nonlinear_activation (str): + Activation function module name. + nonlinear_activation_params (dict): + Hyperparameters for activation function. + pad (str): + Padding function module name before dilated convolution layer. + pad_params (dict): + Hyperparameters for padding function. """ super().__init__() @@ -364,7 +390,8 @@ class MelGANDiscriminator(nn.Layer): def forward(self, x): """Calculate forward propagation. Args: - x (Tensor): Input noise signal (B, 1, T). + x (Tensor): + Input noise signal (B, 1, T). Returns: List: List of output tensors of each layer (for feat_match_loss). """ @@ -406,22 +433,37 @@ class MelGANMultiScaleDiscriminator(nn.Layer): """Initilize MelGAN multi-scale discriminator module. Args: - in_channels (int): Number of input channels. - out_channels (int): Number of output channels. - scales (int): Number of multi-scales. - downsample_pooling (str): Pooling module name for downsampling of the inputs. - downsample_pooling_params (dict): Parameters for the above pooling module. - kernel_sizes (List[int]): List of two kernel sizes. The sum will be used for the first conv layer, + in_channels (int): + Number of input channels. + out_channels (int): + Number of output channels. + scales (int): + Number of multi-scales. + downsample_pooling (str): + Pooling module name for downsampling of the inputs. + downsample_pooling_params (dict): + Parameters for the above pooling module. + kernel_sizes (List[int]): + List of two kernel sizes. The sum will be used for the first conv layer, and the first and the second kernel sizes will be used for the last two layers. - channels (int): Initial number of channels for conv layer. - max_downsample_channels (int): Maximum number of channels for downsampling layers. - bias (bool): Whether to add bias parameter in convolution layers. - downsample_scales (List[int]): List of downsampling scales. - nonlinear_activation (str): Activation function module name. - nonlinear_activation_params (dict): Hyperparameters for activation function. - pad (str): Padding function module name before dilated convolution layer. - pad_params (dict): Hyperparameters for padding function. - use_causal_conv (bool): Whether to use causal convolution. + channels (int): + Initial number of channels for conv layer. + max_downsample_channels (int): + Maximum number of channels for downsampling layers. + bias (bool): + Whether to add bias parameter in convolution layers. + downsample_scales (List[int]): + List of downsampling scales. + nonlinear_activation (str): + Activation function module name. + nonlinear_activation_params (dict): + Hyperparameters for activation function. + pad (str): + Padding function module name before dilated convolution layer. + pad_params (dict): + Hyperparameters for padding function. + use_causal_conv (bool): + Whether to use causal convolution. """ super().__init__() @@ -464,7 +506,8 @@ class MelGANMultiScaleDiscriminator(nn.Layer): def forward(self, x): """Calculate forward propagation. Args: - x (Tensor): Input noise signal (B, 1, T). + x (Tensor): + Input noise signal (B, 1, T). Returns: List: List of list of each discriminator outputs, which consists of each layer output tensors. """ diff --git a/paddlespeech/t2s/models/melgan/style_melgan.py b/paddlespeech/t2s/models/melgan/style_melgan.py index 40a2f1009..d902a4b01 100644 --- a/paddlespeech/t2s/models/melgan/style_melgan.py +++ b/paddlespeech/t2s/models/melgan/style_melgan.py @@ -54,20 +54,34 @@ class StyleMelGANGenerator(nn.Layer): """Initilize Style MelGAN generator. Args: - in_channels (int): Number of input noise channels. - aux_channels (int): Number of auxiliary input channels. - channels (int): Number of channels for conv layer. - out_channels (int): Number of output channels. - kernel_size (int): Kernel size of conv layers. - dilation (int): Dilation factor for conv layers. - bias (bool): Whether to add bias parameter in convolution layers. - noise_upsample_scales (list): List of noise upsampling scales. - noise_upsample_activation (str): Activation function module name for noise upsampling. - noise_upsample_activation_params (dict): Hyperparameters for the above activation function. - upsample_scales (list): List of upsampling scales. - upsample_mode (str): Upsampling mode in TADE layer. - gated_function (str): Gated function in TADEResBlock ("softmax" or "sigmoid"). - use_weight_norm (bool): Whether to use weight norm. + in_channels (int): + Number of input noise channels. + aux_channels (int): + Number of auxiliary input channels. + channels (int): + Number of channels for conv layer. + out_channels (int): + Number of output channels. + kernel_size (int): + Kernel size of conv layers. + dilation (int): + Dilation factor for conv layers. + bias (bool): + Whether to add bias parameter in convolution layers. + noise_upsample_scales (list): + List of noise upsampling scales. + noise_upsample_activation (str): + Activation function module name for noise upsampling. + noise_upsample_activation_params (dict): + Hyperparameters for the above activation function. + upsample_scales (list): + List of upsampling scales. + upsample_mode (str): + Upsampling mode in TADE layer. + gated_function (str): + Gated function in TADEResBlock ("softmax" or "sigmoid"). + use_weight_norm (bool): + Whether to use weight norm. If set to true, it will be applied to all of the conv layers. """ super().__init__() @@ -194,7 +208,8 @@ class StyleMelGANGenerator(nn.Layer): def inference(self, c): """Perform inference. Args: - c (Tensor): Input tensor (T, in_channels). + c (Tensor): + Input tensor (T, in_channels). Returns: Tensor: Output tensor (T ** prod(upsample_scales), out_channels). """ @@ -258,11 +273,16 @@ class StyleMelGANDiscriminator(nn.Layer): """Initilize Style MelGAN discriminator. Args: - repeats (int): Number of repititons to apply RWD. - window_sizes (list): List of random window sizes. - pqmf_params (list): List of list of Parameters for PQMF modules - discriminator_params (dict): Parameters for base discriminator module. - use_weight_nom (bool): Whether to apply weight normalization. + repeats (int): + Number of repititons to apply RWD. + window_sizes (list): + List of random window sizes. + pqmf_params (list): + List of list of Parameters for PQMF modules + discriminator_params (dict): + Parameters for base discriminator module. + use_weight_nom (bool): + Whether to apply weight normalization. """ super().__init__() @@ -299,7 +319,8 @@ class StyleMelGANDiscriminator(nn.Layer): def forward(self, x): """Calculate forward propagation. Args: - x (Tensor): Input tensor (B, 1, T). + x (Tensor): + Input tensor (B, 1, T). Returns: List: List of discriminator outputs, #items in the list will be equal to repeats * #discriminators. diff --git a/paddlespeech/t2s/models/parallel_wavegan/parallel_wavegan.py b/paddlespeech/t2s/models/parallel_wavegan/parallel_wavegan.py index cc8460e4d..be306d9cc 100644 --- a/paddlespeech/t2s/models/parallel_wavegan/parallel_wavegan.py +++ b/paddlespeech/t2s/models/parallel_wavegan/parallel_wavegan.py @@ -32,29 +32,45 @@ class PWGGenerator(nn.Layer): """Wave Generator for Parallel WaveGAN Args: - in_channels (int, optional): Number of channels of the input waveform, by default 1 - out_channels (int, optional): Number of channels of the output waveform, by default 1 - kernel_size (int, optional): Kernel size of the residual blocks inside, by default 3 - layers (int, optional): Number of residual blocks inside, by default 30 - stacks (int, optional): The number of groups to split the residual blocks into, by default 3 + in_channels (int, optional): + Number of channels of the input waveform, by default 1 + out_channels (int, optional): + Number of channels of the output waveform, by default 1 + kernel_size (int, optional): + Kernel size of the residual blocks inside, by default 3 + layers (int, optional): + Number of residual blocks inside, by default 30 + stacks (int, optional): + The number of groups to split the residual blocks into, by default 3 Within each group, the dilation of the residual block grows exponentially. - residual_channels (int, optional): Residual channel of the residual blocks, by default 64 - gate_channels (int, optional): Gate channel of the residual blocks, by default 128 - skip_channels (int, optional): Skip channel of the residual blocks, by default 64 - aux_channels (int, optional): Auxiliary channel of the residual blocks, by default 80 - aux_context_window (int, optional): The context window size of the first convolution applied to the - auxiliary input, by default 2 - dropout (float, optional): Dropout of the residual blocks, by default 0. - bias (bool, optional): Whether to use bias in residual blocks, by default True - use_weight_norm (bool, optional): Whether to use weight norm in all convolutions, by default True - use_causal_conv (bool, optional): Whether to use causal padding in the upsample network and residual - blocks, by default False - upsample_scales (List[int], optional): Upsample scales of the upsample network, by default [4, 4, 4, 4] - nonlinear_activation (Optional[str], optional): Non linear activation in upsample network, by default None - nonlinear_activation_params (Dict[str, Any], optional): Parameters passed to the linear activation in the upsample network, - by default {} - interpolate_mode (str, optional): Interpolation mode of the upsample network, by default "nearest" - freq_axis_kernel_size (int, optional): Kernel size along the frequency axis of the upsample network, by default 1 + residual_channels (int, optional): + Residual channel of the residual blocks, by default 64 + gate_channels (int, optional): + Gate channel of the residual blocks, by default 128 + skip_channels (int, optional): + Skip channel of the residual blocks, by default 64 + aux_channels (int, optional): + Auxiliary channel of the residual blocks, by default 80 + aux_context_window (int, optional): + The context window size of the first convolution applied to the auxiliary input, by default 2 + dropout (float, optional): + Dropout of the residual blocks, by default 0. + bias (bool, optional): + Whether to use bias in residual blocks, by default True + use_weight_norm (bool, optional): + Whether to use weight norm in all convolutions, by default True + use_causal_conv (bool, optional): + Whether to use causal padding in the upsample network and residual blocks, by default False + upsample_scales (List[int], optional): + Upsample scales of the upsample network, by default [4, 4, 4, 4] + nonlinear_activation (Optional[str], optional): + Non linear activation in upsample network, by default None + nonlinear_activation_params (Dict[str, Any], optional): + Parameters passed to the linear activation in the upsample network, by default {} + interpolate_mode (str, optional): + Interpolation mode of the upsample network, by default "nearest" + freq_axis_kernel_size (int, optional): + Kernel size along the frequency axis of the upsample network, by default 1 """ def __init__( @@ -147,9 +163,11 @@ class PWGGenerator(nn.Layer): """Generate waveform. Args: - x(Tensor): Shape (N, C_in, T), The input waveform. - c(Tensor): Shape (N, C_aux, T'). The auxiliary input (e.g. spectrogram). It - is upsampled to match the time resolution of the input. + x(Tensor): + Shape (N, C_in, T), The input waveform. + c(Tensor): + Shape (N, C_aux, T'). The auxiliary input (e.g. spectrogram). + It is upsampled to match the time resolution of the input. Returns: Tensor: Shape (N, C_out, T), the generated waveform. @@ -195,8 +213,10 @@ class PWGGenerator(nn.Layer): """Waveform generation. This function is used for single instance inference. Args: - c(Tensor, optional, optional): Shape (T', C_aux), the auxiliary input, by default None - x(Tensor, optional): Shape (T, C_in), the noise waveform, by default None + c(Tensor, optional, optional): + Shape (T', C_aux), the auxiliary input, by default None + x(Tensor, optional): + Shape (T, C_in), the noise waveform, by default None Returns: Tensor: Shape (T, C_out), the generated waveform @@ -214,20 +234,28 @@ class PWGDiscriminator(nn.Layer): """A convolutional discriminator for audio. Args: - in_channels (int, optional): Number of channels of the input audio, by default 1 - out_channels (int, optional): Output feature size, by default 1 - kernel_size (int, optional): Kernel size of convolutional sublayers, by default 3 - layers (int, optional): Number of layers, by default 10 - conv_channels (int, optional): Feature size of the convolutional sublayers, by default 64 - dilation_factor (int, optional): The factor with which dilation of each convolutional sublayers grows + in_channels (int, optional): + Number of channels of the input audio, by default 1 + out_channels (int, optional): + Output feature size, by default 1 + kernel_size (int, optional): + Kernel size of convolutional sublayers, by default 3 + layers (int, optional): + Number of layers, by default 10 + conv_channels (int, optional): + Feature size of the convolutional sublayers, by default 64 + dilation_factor (int, optional): + The factor with which dilation of each convolutional sublayers grows exponentially if it is greater than 1, else the dilation of each convolutional sublayers grows linearly, by default 1 - nonlinear_activation (str, optional): The activation after each convolutional sublayer, by default "leakyrelu" - nonlinear_activation_params (Dict[str, Any], optional): The parameters passed to the activation's initializer, by default - {"negative_slope": 0.2} - bias (bool, optional): Whether to use bias in convolutional sublayers, by default True - use_weight_norm (bool, optional): Whether to use weight normalization at all convolutional sublayers, - by default True + nonlinear_activation (str, optional): + The activation after each convolutional sublayer, by default "leakyrelu" + nonlinear_activation_params (Dict[str, Any], optional): + The parameters passed to the activation's initializer, by default {"negative_slope": 0.2} + bias (bool, optional): + Whether to use bias in convolutional sublayers, by default True + use_weight_norm (bool, optional): + Whether to use weight normalization at all convolutional sublayers, by default True """ def __init__( @@ -290,7 +318,8 @@ class PWGDiscriminator(nn.Layer): """ Args: - x (Tensor): Shape (N, in_channels, num_samples), the input audio. + x (Tensor): + Shape (N, in_channels, num_samples), the input audio. Returns: Tensor: Shape (N, out_channels, num_samples), the predicted logits. @@ -318,24 +347,35 @@ class ResidualPWGDiscriminator(nn.Layer): """A wavenet-style discriminator for audio. Args: - in_channels (int, optional): Number of channels of the input audio, by default 1 - out_channels (int, optional): Output feature size, by default 1 - kernel_size (int, optional): Kernel size of residual blocks, by default 3 - layers (int, optional): Number of residual blocks, by default 30 - stacks (int, optional): Number of groups of residual blocks, within which the dilation + in_channels (int, optional): + Number of channels of the input audio, by default 1 + out_channels (int, optional): + Output feature size, by default 1 + kernel_size (int, optional): + Kernel size of residual blocks, by default 3 + layers (int, optional): + Number of residual blocks, by default 30 + stacks (int, optional): + Number of groups of residual blocks, within which the dilation of each residual blocks grows exponentially, by default 3 - residual_channels (int, optional): Residual channels of residual blocks, by default 64 - gate_channels (int, optional): Gate channels of residual blocks, by default 128 - skip_channels (int, optional): Skip channels of residual blocks, by default 64 - dropout (float, optional): Dropout probability of residual blocks, by default 0. - bias (bool, optional): Whether to use bias in residual blocks, by default True - use_weight_norm (bool, optional): Whether to use weight normalization in all convolutional layers, - by default True - use_causal_conv (bool, optional): Whether to use causal convolution in residual blocks, by default False - nonlinear_activation (str, optional): Activation after convolutions other than those in residual blocks, - by default "leakyrelu" - nonlinear_activation_params (Dict[str, Any], optional): Parameters to pass to the activation, - by default {"negative_slope": 0.2} + residual_channels (int, optional): + Residual channels of residual blocks, by default 64 + gate_channels (int, optional): + Gate channels of residual blocks, by default 128 + skip_channels (int, optional): + Skip channels of residual blocks, by default 64 + dropout (float, optional): + Dropout probability of residual blocks, by default 0. + bias (bool, optional): + Whether to use bias in residual blocks, by default True + use_weight_norm (bool, optional): + Whether to use weight normalization in all convolutional layers, by default True + use_causal_conv (bool, optional): + Whether to use causal convolution in residual blocks, by default False + nonlinear_activation (str, optional): + Activation after convolutions other than those in residual blocks, by default "leakyrelu" + nonlinear_activation_params (Dict[str, Any], optional): + Parameters to pass to the activation, by default {"negative_slope": 0.2} """ def __init__( @@ -405,7 +445,8 @@ class ResidualPWGDiscriminator(nn.Layer): def forward(self, x): """ Args: - x(Tensor): Shape (N, in_channels, num_samples), the input audio.↩ + x(Tensor): + Shape (N, in_channels, num_samples), the input audio.↩ Returns: Tensor: Shape (N, out_channels, num_samples), the predicted logits. diff --git a/paddlespeech/t2s/models/speedyspeech/speedyspeech.py b/paddlespeech/t2s/models/speedyspeech/speedyspeech.py index ed7c0b7e4..395ad6917 100644 --- a/paddlespeech/t2s/models/speedyspeech/speedyspeech.py +++ b/paddlespeech/t2s/models/speedyspeech/speedyspeech.py @@ -29,10 +29,14 @@ class ResidualBlock(nn.Layer): n: int=2): """SpeedySpeech encoder module. Args: - channels (int, optional): Feature size of the residual output(and also the input). - kernel_size (int, optional): Kernel size of the 1D convolution. - dilation (int, optional): Dilation of the 1D convolution. - n (int): Number of blocks. + channels (int, optional): + Feature size of the residual output(and also the input). + kernel_size (int, optional): + Kernel size of the 1D convolution. + dilation (int, optional): + Dilation of the 1D convolution. + n (int): + Number of blocks. """ super().__init__() @@ -57,7 +61,8 @@ class ResidualBlock(nn.Layer): def forward(self, x: paddle.Tensor): """Calculate forward propagation. Args: - x(Tensor): Batch of input sequences (B, hidden_size, Tmax). + x(Tensor): + Batch of input sequences (B, hidden_size, Tmax). Returns: Tensor: The residual output (B, hidden_size, Tmax). """ @@ -89,8 +94,10 @@ class TextEmbedding(nn.Layer): def forward(self, text: paddle.Tensor, tone: paddle.Tensor=None): """Calculate forward propagation. Args: - text(Tensor(int64)): Batch of padded token ids (B, Tmax). - tones(Tensor, optional(int64)): Batch of padded tone ids (B, Tmax). + text(Tensor(int64)): + Batch of padded token ids (B, Tmax). + tones(Tensor, optional(int64)): + Batch of padded tone ids (B, Tmax). Returns: Tensor: The residual output (B, Tmax, embedding_size). """ @@ -109,12 +116,18 @@ class TextEmbedding(nn.Layer): class SpeedySpeechEncoder(nn.Layer): """SpeedySpeech encoder module. Args: - vocab_size (int): Dimension of the inputs. - tone_size (Optional[int]): Number of tones. - hidden_size (int): Number of encoder hidden units. - kernel_size (int): Kernel size of encoder. - dilations (List[int]): Dilations of encoder. - spk_num (Optional[int]): Number of speakers. + vocab_size (int): + Dimension of the inputs. + tone_size (Optional[int]): + Number of tones. + hidden_size (int): + Number of encoder hidden units. + kernel_size (int): + Kernel size of encoder. + dilations (List[int]): + Dilations of encoder. + spk_num (Optional[int]): + Number of speakers. """ def __init__(self, @@ -161,9 +174,12 @@ class SpeedySpeechEncoder(nn.Layer): spk_id: paddle.Tensor=None): """Encoder input sequence. Args: - text(Tensor(int64)): Batch of padded token ids (B, Tmax). - tones(Tensor, optional(int64)): Batch of padded tone ids (B, Tmax). - spk_id(Tnesor, optional(int64)): Batch of speaker ids (B,) + text(Tensor(int64)): + Batch of padded token ids (B, Tmax). + tones(Tensor, optional(int64)): + Batch of padded tone ids (B, Tmax). + spk_id(Tnesor, optional(int64)): + Batch of speaker ids (B,) Returns: Tensor: Output tensor (B, Tmax, hidden_size). @@ -192,7 +208,8 @@ class DurationPredictor(nn.Layer): def forward(self, x: paddle.Tensor): """Calculate forward propagation. Args: - x(Tensor): Batch of input sequences (B, Tmax, hidden_size). + x(Tensor): + Batch of input sequences (B, Tmax, hidden_size). Returns: Tensor: Batch of predicted durations in log domain (B, Tmax). @@ -212,10 +229,14 @@ class SpeedySpeechDecoder(nn.Layer): ]): """SpeedySpeech decoder module. Args: - hidden_size (int): Number of decoder hidden units. - kernel_size (int): Kernel size of decoder. - output_size (int): Dimension of the outputs. - dilations (List[int]): Dilations of decoder. + hidden_size (int): + Number of decoder hidden units. + kernel_size (int): + Kernel size of decoder. + output_size (int): + Dimension of the outputs. + dilations (List[int]): + Dilations of decoder. """ super().__init__() res_blocks = [ @@ -230,7 +251,8 @@ class SpeedySpeechDecoder(nn.Layer): def forward(self, x): """Decoder input sequence. Args: - x(Tensor): Input tensor (B, time, hidden_size). + x(Tensor): + Input tensor (B, time, hidden_size). Returns: Tensor: Output tensor (B, time, output_size). @@ -261,18 +283,30 @@ class SpeedySpeech(nn.Layer): positional_dropout_rate: int=0.1): """Initialize SpeedySpeech module. Args: - vocab_size (int): Dimension of the inputs. - encoder_hidden_size (int): Number of encoder hidden units. - encoder_kernel_size (int): Kernel size of encoder. - encoder_dilations (List[int]): Dilations of encoder. - duration_predictor_hidden_size (int): Number of duration predictor hidden units. - decoder_hidden_size (int): Number of decoder hidden units. - decoder_kernel_size (int): Kernel size of decoder. - decoder_dilations (List[int]): Dilations of decoder. - decoder_output_size (int): Dimension of the outputs. - tone_size (Optional[int]): Number of tones. - spk_num (Optional[int]): Number of speakers. - init_type (str): How to initialize transformer parameters. + vocab_size (int): + Dimension of the inputs. + encoder_hidden_size (int): + Number of encoder hidden units. + encoder_kernel_size (int): + Kernel size of encoder. + encoder_dilations (List[int]): + Dilations of encoder. + duration_predictor_hidden_size (int): + Number of duration predictor hidden units. + decoder_hidden_size (int): + Number of decoder hidden units. + decoder_kernel_size (int): + Kernel size of decoder. + decoder_dilations (List[int]): + Dilations of decoder. + decoder_output_size (int): + Dimension of the outputs. + tone_size (Optional[int]): + Number of tones. + spk_num (Optional[int]): + Number of speakers. + init_type (str): + How to initialize transformer parameters. """ super().__init__() @@ -304,14 +338,20 @@ class SpeedySpeech(nn.Layer): spk_id: paddle.Tensor=None): """Calculate forward propagation. Args: - text(Tensor(int64)): Batch of padded token ids (B, Tmax). - durations(Tensor(int64)): Batch of padded durations (B, Tmax). - tones(Tensor, optional(int64)): Batch of padded tone ids (B, Tmax). - spk_id(Tnesor, optional(int64)): Batch of speaker ids (B,) + text(Tensor(int64)): + Batch of padded token ids (B, Tmax). + durations(Tensor(int64)): + Batch of padded durations (B, Tmax). + tones(Tensor, optional(int64)): + Batch of padded tone ids (B, Tmax). + spk_id(Tnesor, optional(int64)): + Batch of speaker ids (B,) Returns: - Tensor: Output tensor (B, T_frames, decoder_output_size). - Tensor: Predicted durations (B, Tmax). + Tensor: + Output tensor (B, T_frames, decoder_output_size). + Tensor: + Predicted durations (B, Tmax). """ # input of embedding must be int64 text = paddle.cast(text, 'int64') @@ -336,10 +376,14 @@ class SpeedySpeech(nn.Layer): spk_id: paddle.Tensor=None): """Generate the sequence of features given the sequences of characters. Args: - text(Tensor(int64)): Input sequence of characters (T,). - tones(Tensor, optional(int64)): Batch of padded tone ids (T, ). - durations(Tensor, optional (int64)): Groundtruth of duration (T,). - spk_id(Tensor, optional(int64), optional): spk ids (1,). (Default value = None) + text(Tensor(int64)): + Input sequence of characters (T,). + tones(Tensor, optional(int64)): + Batch of padded tone ids (T, ). + durations(Tensor, optional (int64)): + Groundtruth of duration (T,). + spk_id(Tensor, optional(int64), optional): + spk ids (1,). (Default value = None) Returns: Tensor: logmel (T, decoder_output_size). diff --git a/paddlespeech/t2s/models/tacotron2/tacotron2.py b/paddlespeech/t2s/models/tacotron2/tacotron2.py index 7b306e482..25b5c932a 100644 --- a/paddlespeech/t2s/models/tacotron2/tacotron2.py +++ b/paddlespeech/t2s/models/tacotron2/tacotron2.py @@ -83,38 +83,67 @@ class Tacotron2(nn.Layer): init_type: str="xavier_uniform", ): """Initialize Tacotron2 module. Args: - idim (int): Dimension of the inputs. - odim (int): Dimension of the outputs. - embed_dim (int): Dimension of the token embedding. - elayers (int): Number of encoder blstm layers. - eunits (int): Number of encoder blstm units. - econv_layers (int): Number of encoder conv layers. - econv_filts (int): Number of encoder conv filter size. - econv_chans (int): Number of encoder conv filter channels. - dlayers (int): Number of decoder lstm layers. - dunits (int): Number of decoder lstm units. - prenet_layers (int): Number of prenet layers. - prenet_units (int): Number of prenet units. - postnet_layers (int): Number of postnet layers. - postnet_filts (int): Number of postnet filter size. - postnet_chans (int): Number of postnet filter channels. - output_activation (str): Name of activation function for outputs. - adim (int): Number of dimension of mlp in attention. - aconv_chans (int): Number of attention conv filter channels. - aconv_filts (int): Number of attention conv filter size. - cumulate_att_w (bool): Whether to cumulate previous attention weight. - use_batch_norm (bool): Whether to use batch normalization. - use_concate (bool): Whether to concat enc outputs w/ dec lstm outputs. - reduction_factor (int): Reduction factor. - spk_num (Optional[int]): Number of speakers. If set to > 1, assume that the + idim (int): + Dimension of the inputs. + odim (int): + Dimension of the outputs. + embed_dim (int): + Dimension of the token embedding. + elayers (int): + Number of encoder blstm layers. + eunits (int): + Number of encoder blstm units. + econv_layers (int): + Number of encoder conv layers. + econv_filts (int): + Number of encoder conv filter size. + econv_chans (int): + Number of encoder conv filter channels. + dlayers (int): + Number of decoder lstm layers. + dunits (int): + Number of decoder lstm units. + prenet_layers (int): + Number of prenet layers. + prenet_units (int): + Number of prenet units. + postnet_layers (int): + Number of postnet layers. + postnet_filts (int): + Number of postnet filter size. + postnet_chans (int): + Number of postnet filter channels. + output_activation (str): + Name of activation function for outputs. + adim (int): + Number of dimension of mlp in attention. + aconv_chans (int): + Number of attention conv filter channels. + aconv_filts (int): + Number of attention conv filter size. + cumulate_att_w (bool): + Whether to cumulate previous attention weight. + use_batch_norm (bool): + Whether to use batch normalization. + use_concate (bool): + Whether to concat enc outputs w/ dec lstm outputs. + reduction_factor (int): + Reduction factor. + spk_num (Optional[int]): + Number of speakers. If set to > 1, assume that the sids will be provided as the input and use sid embedding layer. - lang_num (Optional[int]): Number of languages. If set to > 1, assume that the + lang_num (Optional[int]): + Number of languages. If set to > 1, assume that the lids will be provided as the input and use sid embedding layer. - spk_embed_dim (Optional[int]): Speaker embedding dimension. If set to > 0, + spk_embed_dim (Optional[int]): + Speaker embedding dimension. If set to > 0, assume that spk_emb will be provided as the input. - spk_embed_integration_type (str): How to integrate speaker embedding. - dropout_rate (float): Dropout rate. - zoneout_rate (float): Zoneout rate. + spk_embed_integration_type (str): + How to integrate speaker embedding. + dropout_rate (float): + Dropout rate. + zoneout_rate (float): + Zoneout rate. """ assert check_argument_types() super().__init__() @@ -230,18 +259,28 @@ class Tacotron2(nn.Layer): """Calculate forward propagation. Args: - text (Tensor(int64)): Batch of padded character ids (B, T_text). - text_lengths (Tensor(int64)): Batch of lengths of each input batch (B,). - speech (Tensor): Batch of padded target features (B, T_feats, odim). - speech_lengths (Tensor(int64)): Batch of the lengths of each target (B,). - spk_emb (Optional[Tensor]): Batch of speaker embeddings (B, spk_embed_dim). - spk_id (Optional[Tensor]): Batch of speaker IDs (B, 1). - lang_id (Optional[Tensor]): Batch of language IDs (B, 1). + text (Tensor(int64)): + Batch of padded character ids (B, T_text). + text_lengths (Tensor(int64)): + Batch of lengths of each input batch (B,). + speech (Tensor): + Batch of padded target features (B, T_feats, odim). + speech_lengths (Tensor(int64)): + Batch of the lengths of each target (B,). + spk_emb (Optional[Tensor]): + Batch of speaker embeddings (B, spk_embed_dim). + spk_id (Optional[Tensor]): + Batch of speaker IDs (B, 1). + lang_id (Optional[Tensor]): + Batch of language IDs (B, 1). Returns: - Tensor: Loss scalar value. - Dict: Statistics to be monitored. - Tensor: Weight value if not joint training else model outputs. + Tensor: + Loss scalar value. + Dict: + Statistics to be monitored. + Tensor: + Weight value if not joint training else model outputs. """ text = text[:, :text_lengths.max()] @@ -329,18 +368,30 @@ class Tacotron2(nn.Layer): """Generate the sequence of features given the sequences of characters. Args: - text (Tensor(int64)): Input sequence of characters (T_text,). - speech (Optional[Tensor]): Feature sequence to extract style (N, idim). - spk_emb (ptional[Tensor]): Speaker embedding (spk_embed_dim,). - spk_id (Optional[Tensor]): Speaker ID (1,). - lang_id (Optional[Tensor]): Language ID (1,). - threshold (float): Threshold in inference. - minlenratio (float): Minimum length ratio in inference. - maxlenratio (float): Maximum length ratio in inference. - use_att_constraint (bool): Whether to apply attention constraint. - backward_window (int): Backward window in attention constraint. - forward_window (int): Forward window in attention constraint. - use_teacher_forcing (bool): Whether to use teacher forcing. + text (Tensor(int64)): + Input sequence of characters (T_text,). + speech (Optional[Tensor]): + Feature sequence to extract style (N, idim). + spk_emb (ptional[Tensor]): + Speaker embedding (spk_embed_dim,). + spk_id (Optional[Tensor]): + Speaker ID (1,). + lang_id (Optional[Tensor]): + Language ID (1,). + threshold (float): + Threshold in inference. + minlenratio (float): + Minimum length ratio in inference. + maxlenratio (float): + Maximum length ratio in inference. + use_att_constraint (bool): + Whether to apply attention constraint. + backward_window (int): + Backward window in attention constraint. + forward_window (int): + Forward window in attention constraint. + use_teacher_forcing (bool): + Whether to use teacher forcing. Returns: Dict[str, Tensor] diff --git a/paddlespeech/t2s/models/transformer_tts/transformer_tts.py b/paddlespeech/t2s/models/transformer_tts/transformer_tts.py index 92754c30a..355fceb16 100644 --- a/paddlespeech/t2s/models/transformer_tts/transformer_tts.py +++ b/paddlespeech/t2s/models/transformer_tts/transformer_tts.py @@ -49,66 +49,124 @@ class TransformerTTS(nn.Layer): https://arxiv.org/pdf/1809.08895.pdf Args: - idim (int): Dimension of the inputs. - odim (int): Dimension of the outputs. - embed_dim (int, optional): Dimension of character embedding. - eprenet_conv_layers (int, optional): Number of encoder prenet convolution layers. - eprenet_conv_chans (int, optional): Number of encoder prenet convolution channels. - eprenet_conv_filts (int, optional): Filter size of encoder prenet convolution. - dprenet_layers (int, optional): Number of decoder prenet layers. - dprenet_units (int, optional): Number of decoder prenet hidden units. - elayers (int, optional): Number of encoder layers. - eunits (int, optional): Number of encoder hidden units. - adim (int, optional): Number of attention transformation dimensions. - aheads (int, optional): Number of heads for multi head attention. - dlayers (int, optional): Number of decoder layers. - dunits (int, optional): Number of decoder hidden units. - postnet_layers (int, optional): Number of postnet layers. - postnet_chans (int, optional): Number of postnet channels. - postnet_filts (int, optional): Filter size of postnet. - use_scaled_pos_enc (pool, optional): Whether to use trainable scaled positional encoding. - use_batch_norm (bool, optional): Whether to use batch normalization in encoder prenet. - encoder_normalize_before (bool, optional): Whether to perform layer normalization before encoder block. - decoder_normalize_before (bool, optional): Whether to perform layer normalization before decoder block. - encoder_concat_after (bool, optional): Whether to concatenate attention layer's input and output in encoder. - decoder_concat_after (bool, optional): Whether to concatenate attention layer's input and output in decoder. - positionwise_layer_type (str, optional): Position-wise operation type. - positionwise_conv_kernel_size (int, optional): Kernel size in position wise conv 1d. - reduction_factor (int, optional): Reduction factor. - spk_embed_dim (int, optional): Number of speaker embedding dimenstions. - spk_embed_integration_type (str, optional): How to integrate speaker embedding. - use_gst (str, optional): Whether to use global style token. - gst_tokens (int, optional): The number of GST embeddings. - gst_heads (int, optional): The number of heads in GST multihead attention. - gst_conv_layers (int, optional): The number of conv layers in GST. - gst_conv_chans_list (Sequence[int], optional): List of the number of channels of conv layers in GST. - gst_conv_kernel_size (int, optional): Kernal size of conv layers in GST. - gst_conv_stride (int, optional): Stride size of conv layers in GST. - gst_gru_layers (int, optional): The number of GRU layers in GST. - gst_gru_units (int, optional): The number of GRU units in GST. - transformer_lr (float, optional): Initial value of learning rate. - transformer_warmup_steps (int, optional): Optimizer warmup steps. - transformer_enc_dropout_rate (float, optional): Dropout rate in encoder except attention and positional encoding. - transformer_enc_positional_dropout_rate (float, optional): Dropout rate after encoder positional encoding. - transformer_enc_attn_dropout_rate (float, optional): Dropout rate in encoder self-attention module. - transformer_dec_dropout_rate (float, optional): Dropout rate in decoder except attention & positional encoding. - transformer_dec_positional_dropout_rate (float, optional): Dropout rate after decoder positional encoding. - transformer_dec_attn_dropout_rate (float, optional): Dropout rate in deocoder self-attention module. - transformer_enc_dec_attn_dropout_rate (float, optional): Dropout rate in encoder-deocoder attention module. - init_type (str, optional): How to initialize transformer parameters. - init_enc_alpha (float, optional): Initial value of alpha in scaled pos encoding of the encoder. - init_dec_alpha (float, optional): Initial value of alpha in scaled pos encoding of the decoder. - eprenet_dropout_rate (float, optional): Dropout rate in encoder prenet. - dprenet_dropout_rate (float, optional): Dropout rate in decoder prenet. - postnet_dropout_rate (float, optional): Dropout rate in postnet. - use_masking (bool, optional): Whether to apply masking for padded part in loss calculation. - use_weighted_masking (bool, optional): Whether to apply weighted masking in loss calculation. - bce_pos_weight (float, optional): Positive sample weight in bce calculation (only for use_masking=true). - loss_type (str, optional): How to calculate loss. - use_guided_attn_loss (bool, optional): Whether to use guided attention loss. - num_heads_applied_guided_attn (int, optional): Number of heads in each layer to apply guided attention loss. - num_layers_applied_guided_attn (int, optional): Number of layers to apply guided attention loss. - List of module names to apply guided attention loss. + idim (int): + Dimension of the inputs. + odim (int): + Dimension of the outputs. + embed_dim (int, optional): + Dimension of character embedding. + eprenet_conv_layers (int, optional): + Number of encoder prenet convolution layers. + eprenet_conv_chans (int, optional): + Number of encoder prenet convolution channels. + eprenet_conv_filts (int, optional): + Filter size of encoder prenet convolution. + dprenet_layers (int, optional): + Number of decoder prenet layers. + dprenet_units (int, optional): + Number of decoder prenet hidden units. + elayers (int, optional): + Number of encoder layers. + eunits (int, optional): + Number of encoder hidden units. + adim (int, optional): + Number of attention transformation dimensions. + aheads (int, optional): + Number of heads for multi head attention. + dlayers (int, optional): + Number of decoder layers. + dunits (int, optional): + Number of decoder hidden units. + postnet_layers (int, optional): + Number of postnet layers. + postnet_chans (int, optional): + Number of postnet channels. + postnet_filts (int, optional): + Filter size of postnet. + use_scaled_pos_enc (pool, optional): + Whether to use trainable scaled positional encoding. + use_batch_norm (bool, optional): + Whether to use batch normalization in encoder prenet. + encoder_normalize_before (bool, optional): + Whether to perform layer normalization before encoder block. + decoder_normalize_before (bool, optional): + Whether to perform layer normalization before decoder block. + encoder_concat_after (bool, optional): + Whether to concatenate attention layer's input and output in encoder. + decoder_concat_after (bool, optional): + Whether to concatenate attention layer's input and output in decoder. + positionwise_layer_type (str, optional): + Position-wise operation type. + positionwise_conv_kernel_size (int, optional): + Kernel size in position wise conv 1d. + reduction_factor (int, optional): + Reduction factor. + spk_embed_dim (int, optional): + Number of speaker embedding dimenstions. + spk_embed_integration_type (str, optional): + How to integrate speaker embedding. + use_gst (str, optional): + Whether to use global style token. + gst_tokens (int, optional): + The number of GST embeddings. + gst_heads (int, optional): + The number of heads in GST multihead attention. + gst_conv_layers (int, optional): + The number of conv layers in GST. + gst_conv_chans_list (Sequence[int], optional): + List of the number of channels of conv layers in GST. + gst_conv_kernel_size (int, optional): + Kernal size of conv layers in GST. + gst_conv_stride (int, optional): + Stride size of conv layers in GST. + gst_gru_layers (int, optional): + The number of GRU layers in GST. + gst_gru_units (int, optional): + The number of GRU units in GST. + transformer_lr (float, optional): + Initial value of learning rate. + transformer_warmup_steps (int, optional): + Optimizer warmup steps. + transformer_enc_dropout_rate (float, optional): + Dropout rate in encoder except attention and positional encoding. + transformer_enc_positional_dropout_rate (float, optional): + Dropout rate after encoder positional encoding. + transformer_enc_attn_dropout_rate (float, optional): + Dropout rate in encoder self-attention module. + transformer_dec_dropout_rate (float, optional): + Dropout rate in decoder except attention & positional encoding. + transformer_dec_positional_dropout_rate (float, optional): + Dropout rate after decoder positional encoding. + transformer_dec_attn_dropout_rate (float, optional): + Dropout rate in deocoder self-attention module. + transformer_enc_dec_attn_dropout_rate (float, optional): + Dropout rate in encoder-deocoder attention module. + init_type (str, optional): + How to initialize transformer parameters. + init_enc_alpha (float, optional): + Initial value of alpha in scaled pos encoding of the encoder. + init_dec_alpha (float, optional): + Initial value of alpha in scaled pos encoding of the decoder. + eprenet_dropout_rate (float, optional): + Dropout rate in encoder prenet. + dprenet_dropout_rate (float, optional): + Dropout rate in decoder prenet. + postnet_dropout_rate (float, optional): + Dropout rate in postnet. + use_masking (bool, optional): + Whether to apply masking for padded part in loss calculation. + use_weighted_masking (bool, optional): + Whether to apply weighted masking in loss calculation. + bce_pos_weight (float, optional): + Positive sample weight in bce calculation (only for use_masking=true). + loss_type (str, optional): + How to calculate loss. + use_guided_attn_loss (bool, optional): + Whether to use guided attention loss. + num_heads_applied_guided_attn (int, optional): + Number of heads in each layer to apply guided attention loss. + num_layers_applied_guided_attn (int, optional): + Number of layers to apply guided attention loss. """ def __init__( diff --git a/paddlespeech/t2s/models/vits/vits.py b/paddlespeech/t2s/models/vits/vits.py index ab8eda26d..5c476be77 100644 --- a/paddlespeech/t2s/models/vits/vits.py +++ b/paddlespeech/t2s/models/vits/vits.py @@ -227,11 +227,7 @@ class VITS(nn.Layer): lids (Optional[Tensor]): Language index tensor (B,) or (B, 1). forward_generator (bool): Whether to forward generator. Returns: - Dict[str, Any]: - - loss (Tensor): Loss scalar tensor. - - stats (Dict[str, float]): Statistics to be monitored. - - weight (Tensor): Weight tensor to summarize losses. - - optim_idx (int): Optimizer index (0 for G and 1 for D). + """ if forward_generator: return self._forward_generator( diff --git a/paddlespeech/t2s/models/waveflow.py b/paddlespeech/t2s/models/waveflow.py index 52e6005be..8e2ce822f 100644 --- a/paddlespeech/t2s/models/waveflow.py +++ b/paddlespeech/t2s/models/waveflow.py @@ -33,8 +33,10 @@ def fold(x, n_group): """Fold audio or spectrogram's temporal dimension in to groups. Args: - x(Tensor): The input tensor. shape=(*, time_steps) - n_group(int): The size of a group. + x(Tensor): + The input tensor. shape=(*, time_steps) + n_group(int): + The size of a group. Returns: Tensor: Folded tensor. shape=(*, time_steps // n_group, group) @@ -53,7 +55,8 @@ class UpsampleNet(nn.LayerList): on mel and time dimension. Args: - upscale_factors(List[int], optional): Time upsampling factors for each Conv2DTranspose Layer. + upscale_factors(List[int], optional): + Time upsampling factors for each Conv2DTranspose Layer. The ``UpsampleNet`` contains ``len(upscale_factor)`` Conv2DTranspose Layers. Each upscale_factor is used as the ``stride`` for the corresponding Conv2DTranspose. Defaults to [16, 16], this the default @@ -94,8 +97,10 @@ class UpsampleNet(nn.LayerList): """Forward pass of the ``UpsampleNet`` Args: - x(Tensor): The input spectrogram. shape=(batch_size, input_channels, time_steps) - trim_conv_artifact(bool, optional, optional): Trim deconvolution artifact at each layer. Defaults to False. + x(Tensor): + The input spectrogram. shape=(batch_size, input_channels, time_steps) + trim_conv_artifact(bool, optional, optional): + Trim deconvolution artifact at each layer. Defaults to False. Returns: Tensor: The upsampled spectrogram. shape=(batch_size, input_channels, time_steps * upsample_factor) @@ -123,10 +128,14 @@ class ResidualBlock(nn.Layer): and output. Args: - channels (int): Feature size of the input. - cond_channels (int): Featuer size of the condition. - kernel_size (Tuple[int]): Kernel size of the Convolution2d applied to the input. - dilations (int): Dilations of the Convolution2d applied to the input. + channels (int): + Feature size of the input. + cond_channels (int): + Featuer size of the condition. + kernel_size (Tuple[int]): + Kernel size of the Convolution2d applied to the input. + dilations (int): + Dilations of the Convolution2d applied to the input. """ def __init__(self, channels, cond_channels, kernel_size, dilations): @@ -173,12 +182,16 @@ class ResidualBlock(nn.Layer): """Compute output for a whole folded sequence. Args: - x (Tensor): The input. [shape=(batch_size, channel, height, width)] - condition (Tensor [shape=(batch_size, condition_channel, height, width)]): The local condition. + x (Tensor): + The input. [shape=(batch_size, channel, height, width)] + condition (Tensor [shape=(batch_size, condition_channel, height, width)]): + The local condition. Returns: - res (Tensor): The residual output. [shape=(batch_size, channel, height, width)] - skip (Tensor): The skip output. [shape=(batch_size, channel, height, width)] + res (Tensor): + The residual output. [shape=(batch_size, channel, height, width)] + skip (Tensor): + The skip output. [shape=(batch_size, channel, height, width)] """ x_in = x x = self.conv(x) @@ -216,12 +229,16 @@ class ResidualBlock(nn.Layer): """Compute the output for a row and update the buffer. Args: - x_row (Tensor): A row of the input. shape=(batch_size, channel, 1, width) - condition_row (Tensor): A row of the condition. shape=(batch_size, condition_channel, 1, width) + x_row (Tensor): + A row of the input. shape=(batch_size, channel, 1, width) + condition_row (Tensor): + A row of the condition. shape=(batch_size, condition_channel, 1, width) Returns: - res (Tensor): A row of the the residual output. shape=(batch_size, channel, 1, width) - skip (Tensor): A row of the skip output. shape=(batch_size, channel, 1, width) + res (Tensor): + A row of the the residual output. shape=(batch_size, channel, 1, width) + skip (Tensor): + A row of the skip output. shape=(batch_size, channel, 1, width) """ x_row_in = x_row @@ -258,11 +275,16 @@ class ResidualNet(nn.LayerList): """A stack of several ResidualBlocks. It merges condition at each layer. Args: - n_layer (int): Number of ResidualBlocks in the ResidualNet. - residual_channels (int): Feature size of each ResidualBlocks. - condition_channels (int): Feature size of the condition. - kernel_size (Tuple[int]): Kernel size of each ResidualBlock. - dilations_h (List[int]): Dilation in height dimension of every ResidualBlock. + n_layer (int): + Number of ResidualBlocks in the ResidualNet. + residual_channels (int): + Feature size of each ResidualBlocks. + condition_channels (int): + Feature size of the condition. + kernel_size (Tuple[int]): + Kernel size of each ResidualBlock. + dilations_h (List[int]): + Dilation in height dimension of every ResidualBlock. Raises: ValueError: If the length of dilations_h does not equals n_layers. @@ -288,11 +310,13 @@ class ResidualNet(nn.LayerList): """Comput the output of given the input and the condition. Args: - x (Tensor): The input. shape=(batch_size, channel, height, width) - condition (Tensor): The local condition. shape=(batch_size, condition_channel, height, width) + x (Tensor): + The input. shape=(batch_size, channel, height, width) + condition (Tensor): + The local condition. shape=(batch_size, condition_channel, height, width) Returns: - Tensor : The output, which is an aggregation of all the skip outputs. shape=(batch_size, channel, height, width) + Tensor: The output, which is an aggregation of all the skip outputs. shape=(batch_size, channel, height, width) """ skip_connections = [] @@ -312,12 +336,16 @@ class ResidualNet(nn.LayerList): """Compute the output for a row and update the buffers. Args: - x_row (Tensor): A row of the input. shape=(batch_size, channel, 1, width) - condition_row (Tensor): A row of the condition. shape=(batch_size, condition_channel, 1, width) + x_row (Tensor): + A row of the input. shape=(batch_size, channel, 1, width) + condition_row (Tensor): + A row of the condition. shape=(batch_size, condition_channel, 1, width) Returns: - res (Tensor): A row of the the residual output. shape=(batch_size, channel, 1, width) - skip (Tensor): A row of the skip output. shape=(batch_size, channel, 1, width) + res (Tensor): + A row of the the residual output. shape=(batch_size, channel, 1, width) + skip (Tensor): + A row of the skip output. shape=(batch_size, channel, 1, width) """ skip_connections = [] @@ -337,11 +365,16 @@ class Flow(nn.Layer): sampling. Args: - n_layers (int): Number of ResidualBlocks in the Flow. - channels (int): Feature size of the ResidualBlocks. - mel_bands (int): Feature size of the mel spectrogram (mel bands). - kernel_size (Tuple[int]): Kernel size of each ResisualBlocks in the Flow. - n_group (int): Number of timesteps to the folded into a group. + n_layers (int): + Number of ResidualBlocks in the Flow. + channels (int): + Feature size of the ResidualBlocks. + mel_bands (int): + Feature size of the mel spectrogram (mel bands). + kernel_size (Tuple[int]): + Kernel size of each ResisualBlocks in the Flow. + n_group (int): + Number of timesteps to the folded into a group. """ dilations_dict = { 8: [1, 1, 1, 1, 1, 1, 1, 1], @@ -393,11 +426,14 @@ class Flow(nn.Layer): a sample from p(X) into a sample from p(Z). Args: - x (Tensor): A input sample of the distribution p(X). shape=(batch, 1, height, width) - condition (Tensor): The local condition. shape=(batch, condition_channel, height, width) + x (Tensor): + A input sample of the distribution p(X). shape=(batch, 1, height, width) + condition (Tensor): + The local condition. shape=(batch, condition_channel, height, width) Returns: - z (Tensor): shape(batch, 1, height, width), the transformed sample. + z (Tensor): + shape(batch, 1, height, width), the transformed sample. Tuple[Tensor, Tensor]: The parameter of the transformation. logs (Tensor): shape(batch, 1, height - 1, width), the log scale of the transformation from x to z. @@ -433,8 +469,10 @@ class Flow(nn.Layer): p(Z) and transform the sample. It is a auto regressive transformation. Args: - z(Tensor): A sample of the distribution p(Z). shape=(batch, 1, time_steps - condition(Tensor): The local condition. shape=(batch, condition_channel, time_steps) + z(Tensor): + A sample of the distribution p(Z). shape=(batch, 1, time_steps + condition(Tensor): + The local condition. shape=(batch, condition_channel, time_steps) Returns: Tensor: The transformed sample. shape=(batch, 1, height, width) @@ -462,12 +500,18 @@ class WaveFlow(nn.LayerList): flows. Args: - n_flows (int): Number of flows in the WaveFlow model. - n_layers (int): Number of ResidualBlocks in each Flow. - n_group (int): Number of timesteps to fold as a group. - channels (int): Feature size of each ResidualBlock. - mel_bands (int): Feature size of mel spectrogram (mel bands). - kernel_size (Union[int, List[int]]): Kernel size of the convolution layer in each ResidualBlock. + n_flows (int): + Number of flows in the WaveFlow model. + n_layers (int): + Number of ResidualBlocks in each Flow. + n_group (int): + Number of timesteps to fold as a group. + channels (int): + Feature size of each ResidualBlock. + mel_bands (int): + Feature size of mel spectrogram (mel bands). + kernel_size (Union[int, List[int]]): + Kernel size of the convolution layer in each ResidualBlock. """ def __init__(self, n_flows, n_layers, n_group, channels, mel_bands, @@ -518,12 +562,16 @@ class WaveFlow(nn.LayerList): condition. Args: - x (Tensor): The audio. shape=(batch_size, time_steps) - condition (Tensor): The local condition (mel spectrogram here). shape=(batch_size, condition channel, time_steps) + x (Tensor): + The audio. shape=(batch_size, time_steps) + condition (Tensor): + The local condition (mel spectrogram here). shape=(batch_size, condition channel, time_steps) Returns: - Tensor: The transformed random variable. shape=(batch_size, time_steps) - Tensor: The log determinant of the jacobian of the transformation from x to z. shape=(1,) + Tensor: + The transformed random variable. shape=(batch_size, time_steps) + Tensor: + The log determinant of the jacobian of the transformation from x to z. shape=(1,) """ # x: (B, T) # condition: (B, C, T) upsampled condition @@ -559,12 +607,13 @@ class WaveFlow(nn.LayerList): autoregressive manner. Args: - z (Tensor): A sample of the distribution p(Z). shape=(batch, 1, time_steps - condition (Tensor): The local condition. shape=(batch, condition_channel, time_steps) + z (Tensor): + A sample of the distribution p(Z). shape=(batch, 1, time_steps + condition (Tensor): + The local condition. shape=(batch, condition_channel, time_steps) Returns: Tensor: The transformed sample (audio here). shape=(batch_size, time_steps) - """ z, condition = self._trim(z, condition) @@ -590,13 +639,20 @@ class ConditionalWaveFlow(nn.LayerList): """ConditionalWaveFlow, a UpsampleNet with a WaveFlow model. Args: - upsample_factors (List[int]): Upsample factors for the upsample net. - n_flows (int): Number of flows in the WaveFlow model. - n_layers (int): Number of ResidualBlocks in each Flow. - n_group (int): Number of timesteps to fold as a group. - channels (int): Feature size of each ResidualBlock. - n_mels (int): Feature size of mel spectrogram (mel bands). - kernel_size (Union[int, List[int]]): Kernel size of the convolution layer in each ResidualBlock. + upsample_factors (List[int]): + Upsample factors for the upsample net. + n_flows (int): + Number of flows in the WaveFlow model. + n_layers (int): + Number of ResidualBlocks in each Flow. + n_group (int): + Number of timesteps to fold as a group. + channels (int): + Feature size of each ResidualBlock. + n_mels (int): + Feature size of mel spectrogram (mel bands). + kernel_size (Union[int, List[int]]): + Kernel size of the convolution layer in each ResidualBlock. """ def __init__(self, @@ -622,12 +678,16 @@ class ConditionalWaveFlow(nn.LayerList): the determinant of the jacobian of the transformation from x to z. Args: - audio(Tensor): The audio. shape=(B, T) - mel(Tensor): The mel spectrogram. shape=(B, C_mel, T_mel) + audio(Tensor): + The audio. shape=(B, T) + mel(Tensor): + The mel spectrogram. shape=(B, C_mel, T_mel) Returns: - Tensor: The inversely transformed random variable z (x to z). shape=(B, T) - Tensor: the log of the determinant of the jacobian of the transformation from x to z. shape=(1,) + Tensor: + The inversely transformed random variable z (x to z). shape=(B, T) + Tensor: + the log of the determinant of the jacobian of the transformation from x to z. shape=(1,) """ condition = self.encoder(mel) z, log_det_jacobian = self.decoder(audio, condition) @@ -638,10 +698,12 @@ class ConditionalWaveFlow(nn.LayerList): """Generate raw audio given mel spectrogram. Args: - mel(np.ndarray): Mel spectrogram of an utterance(in log-magnitude). shape=(C_mel, T_mel) + mel(np.ndarray): + Mel spectrogram of an utterance(in log-magnitude). shape=(C_mel, T_mel) Returns: - Tensor: The synthesized audio, where``T <= T_mel * upsample_factors``. shape=(B, T) + Tensor: + The synthesized audio, where``T <= T_mel * upsample_factors``. shape=(B, T) """ start = time.time() condition = self.encoder(mel, trim_conv_artifact=True) # (B, C, T) @@ -657,7 +719,8 @@ class ConditionalWaveFlow(nn.LayerList): """Generate raw audio given mel spectrogram. Args: - mel(np.ndarray): Mel spectrogram of an utterance(in log-magnitude). shape=(C_mel, T_mel) + mel(np.ndarray): + Mel spectrogram of an utterance(in log-magnitude). shape=(C_mel, T_mel) Returns: np.ndarray: The synthesized audio. shape=(T,) @@ -673,8 +736,10 @@ class ConditionalWaveFlow(nn.LayerList): """Build a ConditionalWaveFlow model from a pretrained model. Args: - config(yacs.config.CfgNode): model configs - checkpoint_path(Path or str): the path of pretrained model checkpoint, without extension name + config(yacs.config.CfgNode): + model configs + checkpoint_path(Path or str): + the path of pretrained model checkpoint, without extension name Returns: ConditionalWaveFlow The model built from pretrained result. @@ -694,8 +759,8 @@ class WaveFlowLoss(nn.Layer): """Criterion of a WaveFlow model. Args: - sigma (float): The standard deviation of the gaussian noise used in WaveFlow, - by default 1.0. + sigma (float): + The standard deviation of the gaussian noise used in WaveFlow, by default 1.0. """ def __init__(self, sigma=1.0): @@ -708,8 +773,10 @@ class WaveFlowLoss(nn.Layer): log_det_jacobian of transformation from x to z. Args: - z(Tensor): The transformed random variable (x to z). shape=(B, T) - log_det_jacobian(Tensor): The log of the determinant of the jacobian matrix of the + z(Tensor): + The transformed random variable (x to z). shape=(B, T) + log_det_jacobian(Tensor): + The log of the determinant of the jacobian matrix of the transformation from x to z. shape=(1,) Returns: @@ -726,7 +793,8 @@ class ConditionalWaveFlow2Infer(ConditionalWaveFlow): """Generate raw audio given mel spectrogram. Args: - mel (np.ndarray): Mel spectrogram of an utterance(in log-magnitude). shape=(C_mel, T_mel) + mel (np.ndarray): + Mel spectrogram of an utterance(in log-magnitude). shape=(C_mel, T_mel) Returns: np.ndarray: The synthesized audio. shape=(T,) diff --git a/paddlespeech/t2s/models/wavernn/wavernn.py b/paddlespeech/t2s/models/wavernn/wavernn.py index eb892eda5..254edbb2d 100644 --- a/paddlespeech/t2s/models/wavernn/wavernn.py +++ b/paddlespeech/t2s/models/wavernn/wavernn.py @@ -165,19 +165,29 @@ class WaveRNN(nn.Layer): init_type: str="xavier_uniform", ): ''' Args: - rnn_dims (int, optional): Hidden dims of RNN Layers. - fc_dims (int, optional): Dims of FC Layers. - bits (int, optional): bit depth of signal. - aux_context_window (int, optional): The context window size of the first convolution applied to the - auxiliary input, by default 2 - upsample_scales (List[int], optional): Upsample scales of the upsample network. - aux_channels (int, optional): Auxiliary channel of the residual blocks. - compute_dims (int, optional): Dims of Conv1D in MelResNet. - res_out_dims (int, optional): Dims of output in MelResNet. - res_blocks (int, optional): Number of residual blocks. - mode (str, optional): Output mode of the WaveRNN vocoder. + rnn_dims (int, optional): + Hidden dims of RNN Layers. + fc_dims (int, optional): + Dims of FC Layers. + bits (int, optional): + bit depth of signal. + aux_context_window (int, optional): + The context window size of the first convolution applied to the auxiliary input, by default 2 + upsample_scales (List[int], optional): + Upsample scales of the upsample network. + aux_channels (int, optional): + Auxiliary channel of the residual blocks. + compute_dims (int, optional): + Dims of Conv1D in MelResNet. + res_out_dims (int, optional): + Dims of output in MelResNet. + res_blocks (int, optional): + Number of residual blocks. + mode (str, optional): + Output mode of the WaveRNN vocoder. `MOL` for Mixture of Logistic Distribution, and `RAW` for quantized bits as the model's output. - init_type (str): How to initialize parameters. + init_type (str): + How to initialize parameters. ''' super().__init__() self.mode = mode @@ -226,8 +236,10 @@ class WaveRNN(nn.Layer): def forward(self, x, c): ''' Args: - x (Tensor): wav sequence, [B, T] - c (Tensor): mel spectrogram [B, C_aux, T'] + x (Tensor): + wav sequence, [B, T] + c (Tensor): + mel spectrogram [B, C_aux, T'] T = (T' - 2 * aux_context_window ) * hop_length Returns: @@ -280,10 +292,14 @@ class WaveRNN(nn.Layer): gen_display: bool=False): """ Args: - c(Tensor): input mels, (T', C_aux) - batched(bool): generate in batch or not - target(int): target number of samples to be generated in each batch entry - overlap(int): number of samples for crossfading between batches + c(Tensor): + input mels, (T', C_aux) + batched(bool): + generate in batch or not + target(int): + target number of samples to be generated in each batch entry + overlap(int): + number of samples for crossfading between batches mu_law(bool) Returns: wav sequence: Output (T' * prod(upsample_scales), out_channels, C_out). @@ -404,7 +420,8 @@ class WaveRNN(nn.Layer): def pad_tensor(self, x, pad, side='both'): ''' Args: - x(Tensor): mel, [1, n_frames, 80] + x(Tensor): + mel, [1, n_frames, 80] pad(int): side(str, optional): (Default value = 'both') @@ -428,12 +445,15 @@ class WaveRNN(nn.Layer): Overlap will be used for crossfading in xfade_and_unfold() Args: - x(Tensor): Upsampled conditioning features. mels or aux + x(Tensor): + Upsampled conditioning features. mels or aux shape=(1, T, features) mels: [1, T, 80] aux: [1, T, 128] - target(int): Target timesteps for each index of batch - overlap(int): Timesteps for both xfade and rnn warmup + target(int): + Target timesteps for each index of batch + overlap(int): + Timesteps for both xfade and rnn warmup Returns: Tensor: diff --git a/paddlespeech/t2s/modules/causal_conv.py b/paddlespeech/t2s/modules/causal_conv.py index 3abccc15f..337ee2383 100644 --- a/paddlespeech/t2s/modules/causal_conv.py +++ b/paddlespeech/t2s/modules/causal_conv.py @@ -42,7 +42,8 @@ class CausalConv1D(nn.Layer): def forward(self, x): """Calculate forward propagation. Args: - x (Tensor): Input tensor (B, in_channels, T). + x (Tensor): + Input tensor (B, in_channels, T). Returns: Tensor: Output tensor (B, out_channels, T). """ @@ -67,7 +68,8 @@ class CausalConv1DTranspose(nn.Layer): def forward(self, x): """Calculate forward propagation. Args: - x (Tensor): Input tensor (B, in_channels, T_in). + x (Tensor): + Input tensor (B, in_channels, T_in). Returns: Tensor: Output tensor (B, out_channels, T_out). """ diff --git a/paddlespeech/t2s/modules/conformer/convolution.py b/paddlespeech/t2s/modules/conformer/convolution.py index 185c62fb3..dadda0640 100644 --- a/paddlespeech/t2s/modules/conformer/convolution.py +++ b/paddlespeech/t2s/modules/conformer/convolution.py @@ -20,8 +20,10 @@ class ConvolutionModule(nn.Layer): """ConvolutionModule in Conformer model. Args: - channels (int): The number of channels of conv layers. - kernel_size (int): Kernerl size of conv layers. + channels (int): + The number of channels of conv layers. + kernel_size (int): + Kernerl size of conv layers. """ def __init__(self, channels, kernel_size, activation=nn.ReLU(), bias=True): @@ -59,7 +61,8 @@ class ConvolutionModule(nn.Layer): """Compute convolution module. Args: - x (Tensor): Input tensor (#batch, time, channels). + x (Tensor): + Input tensor (#batch, time, channels). Returns: Tensor: Output tensor (#batch, time, channels). """ diff --git a/paddlespeech/t2s/modules/conformer/encoder_layer.py b/paddlespeech/t2s/modules/conformer/encoder_layer.py index 61c326125..26a354565 100644 --- a/paddlespeech/t2s/modules/conformer/encoder_layer.py +++ b/paddlespeech/t2s/modules/conformer/encoder_layer.py @@ -23,25 +23,34 @@ class EncoderLayer(nn.Layer): """Encoder layer module. Args: - size (int): Input dimension. - self_attn (nn.Layer): Self-attention module instance. + size (int): + Input dimension. + self_attn (nn.Layer): + Self-attention module instance. `MultiHeadedAttention` or `RelPositionMultiHeadedAttention` instance can be used as the argument. - feed_forward (nn.Layer): Feed-forward module instance. + feed_forward (nn.Layer): + Feed-forward module instance. `PositionwiseFeedForward`, `MultiLayeredConv1d`, or `Conv1dLinear` instance can be used as the argument. - feed_forward_macaron (nn.Layer): Additional feed-forward module instance. + feed_forward_macaron (nn.Layer): + Additional feed-forward module instance. `PositionwiseFeedForward`, `MultiLayeredConv1d`, or `Conv1dLinear` instance can be used as the argument. - conv_module (nn.Layer): Convolution module instance. + conv_module (nn.Layer): + Convolution module instance. `ConvlutionModule` instance can be used as the argument. - dropout_rate (float): Dropout rate. - normalize_before (bool): Whether to use layer_norm before the first block. - concat_after (bool): Whether to concat attention layer's input and output. + dropout_rate (float): + Dropout rate. + normalize_before (bool): + Whether to use layer_norm before the first block. + concat_after (bool): + Whether to concat attention layer's input and output. if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x) - stochastic_depth_rate (float): Proability to skip this layer. + stochastic_depth_rate (float): + Proability to skip this layer. During training, the layer may skip residual computation and return input as-is with given probability. """ @@ -86,15 +95,19 @@ class EncoderLayer(nn.Layer): """Compute encoded features. Args: - x_input(Union[Tuple, Tensor]): Input tensor w/ or w/o pos emb. + x_input(Union[Tuple, Tensor]): + Input tensor w/ or w/o pos emb. - w/ pos emb: Tuple of tensors [(#batch, time, size), (1, time, size)]. - w/o pos emb: Tensor (#batch, time, size). - mask(Tensor): Mask tensor for the input (#batch, time). + mask(Tensor): + Mask tensor for the input (#batch, time). cache (Tensor): Returns: - Tensor: Output tensor (#batch, time, size). - Tensor: Mask tensor (#batch, time). + Tensor: + Output tensor (#batch, time, size). + Tensor: + Mask tensor (#batch, time). """ if isinstance(x_input, tuple): x, pos_emb = x_input[0], x_input[1] diff --git a/paddlespeech/t2s/modules/conv.py b/paddlespeech/t2s/modules/conv.py index aa875bd50..922af03f2 100644 --- a/paddlespeech/t2s/modules/conv.py +++ b/paddlespeech/t2s/modules/conv.py @@ -42,13 +42,19 @@ class Conv1dCell(nn.Conv1D): class. Args: - in_channels (int): The feature size of the input. - out_channels (int): The feature size of the output. - kernel_size (int or Tuple[int]): The size of the kernel. - dilation (int or Tuple[int]): The dilation of the convolution, by default 1 - weight_attr (ParamAttr, Initializer, str or bool, optional) : The parameter attribute of the convolution kernel, + in_channels (int): + The feature size of the input. + out_channels (int): + The feature size of the output. + kernel_size (int or Tuple[int]): + The size of the kernel. + dilation (int or Tuple[int]): + The dilation of the convolution, by default 1 + weight_attr (ParamAttr, Initializer, str or bool, optional): + The parameter attribute of the convolution kernel, by default None. - bias_attr (ParamAttr, Initializer, str or bool, optional):The parameter attribute of the bias. + bias_attr (ParamAttr, Initializer, str or bool, optional): + The parameter attribute of the bias. If ``False``, this layer does not have a bias, by default None. Examples: @@ -122,7 +128,8 @@ class Conv1dCell(nn.Conv1D): """Initialize the buffer for the step input. Args: - x_t (Tensor): The step input. shape=(batch_size, in_channels) + x_t (Tensor): + The step input. shape=(batch_size, in_channels) """ batch_size, _ = x_t.shape @@ -134,7 +141,8 @@ class Conv1dCell(nn.Conv1D): """Shift the buffer by one step. Args: - x_t (Tensor): The step input. shape=(batch_size, in_channels) + x_t (Tensor): T + he step input. shape=(batch_size, in_channels) """ self._buffer = paddle.concat( @@ -144,10 +152,12 @@ class Conv1dCell(nn.Conv1D): """Add step input and compute step output. Args: - x_t (Tensor): The step input. shape=(batch_size, in_channels) + x_t (Tensor): + The step input. shape=(batch_size, in_channels) Returns: - y_t (Tensor): The step output. shape=(batch_size, out_channels) + y_t (Tensor): + The step output. shape=(batch_size, out_channels) """ batch_size = x_t.shape[0] @@ -173,10 +183,14 @@ class Conv1dBatchNorm(nn.Layer): """A Conv1D Layer followed by a BatchNorm1D. Args: - in_channels (int): The feature size of the input. - out_channels (int): The feature size of the output. - kernel_size (int): The size of the convolution kernel. - stride (int, optional): The stride of the convolution, by default 1. + in_channels (int): + The feature size of the input. + out_channels (int): + The feature size of the output. + kernel_size (int): + The size of the convolution kernel. + stride (int, optional): + The stride of the convolution, by default 1. padding (int, str or Tuple[int], optional): The padding of the convolution. If int, a symmetrical padding is applied before convolution; @@ -189,9 +203,12 @@ class Conv1dBatchNorm(nn.Layer): bias_attr (ParamAttr, Initializer, str or bool, optional): The parameter attribute of the bias of the convolution, by defaultNone. - data_format (str ["NCL" or "NLC"], optional): The data layout of the input, by default "NCL" - momentum (float, optional): The momentum of the BatchNorm1D layer, by default 0.9 - epsilon (float, optional): The epsilon of the BatchNorm1D layer, by default 1e-05 + data_format (str ["NCL" or "NLC"], optional): + The data layout of the input, by default "NCL" + momentum (float, optional): + The momentum of the BatchNorm1D layer, by default 0.9 + epsilon (float, optional): + The epsilon of the BatchNorm1D layer, by default 1e-05 """ def __init__(self, @@ -225,12 +242,13 @@ class Conv1dBatchNorm(nn.Layer): """Forward pass of the Conv1dBatchNorm layer. Args: - x (Tensor): The input tensor. Its data layout depends on ``data_format``. - shape=(B, C_in, T_in) or (B, T_in, C_in) + x (Tensor): + The input tensor. Its data layout depends on ``data_format``. + shape=(B, C_in, T_in) or (B, T_in, C_in) Returns: - Tensor: The output tensor. - shape=(B, C_out, T_out) or (B, T_out, C_out) + Tensor: + The output tensor. shape=(B, C_out, T_out) or (B, T_out, C_out) """ x = self.conv(x) diff --git a/paddlespeech/t2s/modules/geometry.py b/paddlespeech/t2s/modules/geometry.py index 01eb5ad0a..80c872a81 100644 --- a/paddlespeech/t2s/modules/geometry.py +++ b/paddlespeech/t2s/modules/geometry.py @@ -19,8 +19,10 @@ def shuffle_dim(x, axis, perm=None): """Permute input tensor along aixs given the permutation or randomly. Args: - x (Tensor): The input tensor. - axis (int): The axis to shuffle. + x (Tensor): + The input tensor. + axis (int): + The axis to shuffle. perm (List[int], ndarray, optional): The order to reorder the tensor along the ``axis``-th dimension. It is a permutation of ``[0, d)``, where d is the size of the diff --git a/paddlespeech/t2s/modules/layer_norm.py b/paddlespeech/t2s/modules/layer_norm.py index 088b98e02..9e2add293 100644 --- a/paddlespeech/t2s/modules/layer_norm.py +++ b/paddlespeech/t2s/modules/layer_norm.py @@ -19,8 +19,10 @@ from paddle import nn class LayerNorm(nn.LayerNorm): """Layer normalization module. Args: - nout (int): Output dim size. - dim (int): Dimension to be normalized. + nout (int): + Output dim size. + dim (int): + Dimension to be normalized. """ def __init__(self, nout, dim=-1): @@ -32,7 +34,8 @@ class LayerNorm(nn.LayerNorm): """Apply layer normalization. Args: - x (Tensor):Input tensor. + x (Tensor): + Input tensor. Returns: Tensor: Normalized tensor. diff --git a/paddlespeech/t2s/modules/losses.py b/paddlespeech/t2s/modules/losses.py index e6ab93513..1a43f5ef3 100644 --- a/paddlespeech/t2s/modules/losses.py +++ b/paddlespeech/t2s/modules/losses.py @@ -269,8 +269,10 @@ class GuidedAttentionLoss(nn.Layer): """Make masks indicating non-padded part. Args: - ilens(Tensor(int64) or List): Batch of lengths (B,). - olens(Tensor(int64) or List): Batch of lengths (B,). + ilens(Tensor(int64) or List): + Batch of lengths (B,). + olens(Tensor(int64) or List): + Batch of lengths (B,). Returns: Tensor: Mask tensor indicating non-padded part. @@ -322,9 +324,12 @@ class GuidedMultiHeadAttentionLoss(GuidedAttentionLoss): """Calculate forward propagation. Args: - att_ws(Tensor): Batch of multi head attention weights (B, H, T_max_out, T_max_in). - ilens(Tensor): Batch of input lenghts (B,). - olens(Tensor): Batch of output lenghts (B,). + att_ws(Tensor): + Batch of multi head attention weights (B, H, T_max_out, T_max_in). + ilens(Tensor): + Batch of input lenghts (B,). + olens(Tensor): + Batch of output lenghts (B,). Returns: Tensor: Guided attention loss value. @@ -354,9 +359,12 @@ class Tacotron2Loss(nn.Layer): """Initialize Tactoron2 loss module. Args: - use_masking (bool): Whether to apply masking for padded part in loss calculation. - use_weighted_masking (bool): Whether to apply weighted masking in loss calculation. - bce_pos_weight (float): Weight of positive sample of stop token. + use_masking (bool): + Whether to apply masking for padded part in loss calculation. + use_weighted_masking (bool): + Whether to apply weighted masking in loss calculation. + bce_pos_weight (float): + Weight of positive sample of stop token. """ super().__init__() assert (use_masking != use_weighted_masking) or not use_masking @@ -374,17 +382,25 @@ class Tacotron2Loss(nn.Layer): """Calculate forward propagation. Args: - after_outs(Tensor): Batch of outputs after postnets (B, Lmax, odim). - before_outs(Tensor): Batch of outputs before postnets (B, Lmax, odim). - logits(Tensor): Batch of stop logits (B, Lmax). - ys(Tensor): Batch of padded target features (B, Lmax, odim). - stop_labels(Tensor(int64)): Batch of the sequences of stop token labels (B, Lmax). + after_outs(Tensor): + Batch of outputs after postnets (B, Lmax, odim). + before_outs(Tensor): + Batch of outputs before postnets (B, Lmax, odim). + logits(Tensor): + Batch of stop logits (B, Lmax). + ys(Tensor): + Batch of padded target features (B, Lmax, odim). + stop_labels(Tensor(int64)): + Batch of the sequences of stop token labels (B, Lmax). olens(Tensor(int64)): Returns: - Tensor: L1 loss value. - Tensor: Mean square error loss value. - Tensor: Binary cross entropy loss value. + Tensor: + L1 loss value. + Tensor: + Mean square error loss value. + Tensor: + Binary cross entropy loss value. """ # make mask and apply it if self.use_masking: @@ -437,16 +453,24 @@ def stft(x, pad_mode='reflect'): """Perform STFT and convert to magnitude spectrogram. Args: - x(Tensor): Input signal tensor (B, T). - fft_size(int): FFT size. - hop_size(int): Hop size. - win_length(int, optional): window : str, optional (Default value = None) - window(str, optional): Name of window function, see `scipy.signal.get_window` for more - details. Defaults to "hann". - center(bool, optional, optional): center (bool, optional): Whether to pad `x` to make that the + x(Tensor): + Input signal tensor (B, T). + fft_size(int): + FFT size. + hop_size(int): + Hop size. + win_length(int, optional): + window (str, optional): + (Default value = None) + window(str, optional): + Name of window function, see `scipy.signal.get_window` for more details. Defaults to "hann". + center(bool, optional, optional): center (bool, optional): + Whether to pad `x` to make that the :math:`t \times hop\\_length` at the center of :math:`t`-th frame. Default: `True`. - pad_mode(str, optional, optional): (Default value = 'reflect') - hop_length: (Default value = None) + pad_mode(str, optional, optional): + (Default value = 'reflect') + hop_length: + (Default value = None) Returns: Tensor: Magnitude spectrogram (B, #frames, fft_size // 2 + 1). @@ -480,8 +504,10 @@ class SpectralConvergenceLoss(nn.Layer): def forward(self, x_mag, y_mag): """Calculate forward propagation. Args: - x_mag (Tensor): Magnitude spectrogram of predicted signal (B, #frames, #freq_bins). - y_mag (Tensor): Magnitude spectrogram of groundtruth signal (B, #frames, #freq_bins). + x_mag (Tensor): + Magnitude spectrogram of predicted signal (B, #frames, #freq_bins). + y_mag (Tensor): + Magnitude spectrogram of groundtruth signal (B, #frames, #freq_bins). Returns: Tensor: Spectral convergence loss value. """ @@ -501,8 +527,10 @@ class LogSTFTMagnitudeLoss(nn.Layer): def forward(self, x_mag, y_mag): """Calculate forward propagation. Args: - x_mag (Tensor): Magnitude spectrogram of predicted signal (B, #frames, #freq_bins). - y_mag (Tensor): Magnitude spectrogram of groundtruth signal (B, #frames, #freq_bins). + x_mag (Tensor): + Magnitude spectrogram of predicted signal (B, #frames, #freq_bins). + y_mag (Tensor): + Magnitude spectrogram of groundtruth signal (B, #frames, #freq_bins). Returns: Tensor: Log STFT magnitude loss value. """ @@ -531,11 +559,15 @@ class STFTLoss(nn.Layer): def forward(self, x, y): """Calculate forward propagation. Args: - x (Tensor): Predicted signal (B, T). - y (Tensor): Groundtruth signal (B, T). + x (Tensor): + Predicted signal (B, T). + y (Tensor): + Groundtruth signal (B, T). Returns: - Tensor: Spectral convergence loss value. - Tensor: Log STFT magnitude loss value. + Tensor: + Spectral convergence loss value. + Tensor: + Log STFT magnitude loss value. """ x_mag = stft(x, self.fft_size, self.shift_size, self.win_length, self.window) @@ -558,10 +590,14 @@ class MultiResolutionSTFTLoss(nn.Layer): window="hann", ): """Initialize Multi resolution STFT loss module. Args: - fft_sizes (list): List of FFT sizes. - hop_sizes (list): List of hop sizes. - win_lengths (list): List of window lengths. - window (str): Window function type. + fft_sizes (list): + List of FFT sizes. + hop_sizes (list): + List of hop sizes. + win_lengths (list): + List of window lengths. + window (str): + Window function type. """ super().__init__() assert len(fft_sizes) == len(hop_sizes) == len(win_lengths) @@ -573,11 +609,15 @@ class MultiResolutionSTFTLoss(nn.Layer): """Calculate forward propagation. Args: - x (Tensor): Predicted signal (B, T) or (B, #subband, T). - y (Tensor): Groundtruth signal (B, T) or (B, #subband, T). + x (Tensor): + Predicted signal (B, T) or (B, #subband, T). + y (Tensor): + Groundtruth signal (B, T) or (B, #subband, T). Returns: - Tensor: Multi resolution spectral convergence loss value. - Tensor: Multi resolution log STFT magnitude loss value. + Tensor: + Multi resolution spectral convergence loss value. + Tensor: + Multi resolution log STFT magnitude loss value. """ if len(x.shape) == 3: # (B, C, T) -> (B x C, T) @@ -615,9 +655,11 @@ class GeneratorAdversarialLoss(nn.Layer): def forward(self, outputs): """Calcualate generator adversarial loss. Args: - outputs (Tensor or List): Discriminator outputs or list of discriminator outputs. + outputs (Tensor or List): + Discriminator outputs or list of discriminator outputs. Returns: - Tensor: Generator adversarial loss value. + Tensor: + Generator adversarial loss value. """ if isinstance(outputs, (tuple, list)): adv_loss = 0.0 @@ -659,13 +701,15 @@ class DiscriminatorAdversarialLoss(nn.Layer): """Calcualate discriminator adversarial loss. Args: - outputs_hat (Tensor or list): Discriminator outputs or list of - discriminator outputs calculated from generator outputs. - outputs (Tensor or list): Discriminator outputs or list of - discriminator outputs calculated from groundtruth. + outputs_hat (Tensor or list): + Discriminator outputs or list of discriminator outputs calculated from generator outputs. + outputs (Tensor or list): + Discriminator outputs or list of discriminator outputs calculated from groundtruth. Returns: - Tensor: Discriminator real loss value. - Tensor: Discriminator fake loss value. + Tensor: + Discriminator real loss value. + Tensor: + Discriminator fake loss value. """ if isinstance(outputs, (tuple, list)): real_loss = 0.0 @@ -766,9 +810,12 @@ def masked_l1_loss(prediction, target, mask): """Compute maksed L1 loss. Args: - prediction(Tensor): The prediction. - target(Tensor): The target. The shape should be broadcastable to ``prediction``. - mask(Tensor): The mask. The shape should be broadcatable to the broadcasted shape of + prediction(Tensor): + The prediction. + target(Tensor): + The target. The shape should be broadcastable to ``prediction``. + mask(Tensor): + The mask. The shape should be broadcatable to the broadcasted shape of ``prediction`` and ``target``. Returns: @@ -916,8 +963,10 @@ class MelSpectrogramLoss(nn.Layer): def forward(self, y_hat, y): """Calculate Mel-spectrogram loss. Args: - y_hat(Tensor): Generated single tensor (B, 1, T). - y(Tensor): Groundtruth single tensor (B, 1, T). + y_hat(Tensor): + Generated single tensor (B, 1, T). + y(Tensor): + Groundtruth single tensor (B, 1, T). Returns: Tensor: Mel-spectrogram loss value. @@ -947,9 +996,11 @@ class FeatureMatchLoss(nn.Layer): """Calcualate feature matching loss. Args: - feats_hat(list): List of list of discriminator outputs + feats_hat(list): + List of list of discriminator outputs calcuated from generater outputs. - feats(list): List of list of discriminator outputs + feats(list): + List of list of discriminator outputs Returns: Tensor: Feature matching loss value. @@ -986,11 +1037,16 @@ class KLDivergenceLoss(nn.Layer): """Calculate KL divergence loss. Args: - z_p (Tensor): Flow hidden representation (B, H, T_feats). - logs_q (Tensor): Posterior encoder projected scale (B, H, T_feats). - m_p (Tensor): Expanded text encoder projected mean (B, H, T_feats). - logs_p (Tensor): Expanded text encoder projected scale (B, H, T_feats). - z_mask (Tensor): Mask tensor (B, 1, T_feats). + z_p (Tensor): + Flow hidden representation (B, H, T_feats). + logs_q (Tensor): + Posterior encoder projected scale (B, H, T_feats). + m_p (Tensor): + Expanded text encoder projected mean (B, H, T_feats). + logs_p (Tensor): + Expanded text encoder projected scale (B, H, T_feats). + z_mask (Tensor): + Mask tensor (B, 1, T_feats). Returns: Tensor: KL divergence loss. @@ -1007,3 +1063,66 @@ class KLDivergenceLoss(nn.Layer): loss = kl / paddle.sum(z_mask) return loss + + +# loss for ERNIE SAT +class MLMLoss(nn.Layer): + def __init__(self, + odim: int, + vocab_size: int=0, + lsm_weight: float=0.1, + ignore_id: int=-1, + text_masking: bool=False): + super().__init__() + if text_masking: + self.text_mlm_loss = nn.CrossEntropyLoss(ignore_index=ignore_id) + if lsm_weight > 50: + self.l1_loss_func = nn.MSELoss() + else: + self.l1_loss_func = nn.L1Loss(reduction='none') + self.text_masking = text_masking + self.odim = odim + self.vocab_size = vocab_size + + def forward( + self, + speech: paddle.Tensor, + before_outs: paddle.Tensor, + after_outs: paddle.Tensor, + masked_pos: paddle.Tensor, + # for text_loss when text_masking == True + text: paddle.Tensor=None, + text_outs: paddle.Tensor=None, + text_masked_pos: paddle.Tensor=None): + + xs_pad = speech + mlm_loss_pos = masked_pos > 0 + loss = paddle.sum( + self.l1_loss_func( + paddle.reshape(before_outs, (-1, self.odim)), + paddle.reshape(xs_pad, (-1, self.odim))), + axis=-1) + if after_outs is not None: + loss += paddle.sum( + self.l1_loss_func( + paddle.reshape(after_outs, (-1, self.odim)), + paddle.reshape(xs_pad, (-1, self.odim))), + axis=-1) + mlm_loss = paddle.sum((loss * paddle.reshape( + mlm_loss_pos, [-1]))) / paddle.sum((mlm_loss_pos) + 1e-10) + + text_mlm_loss = None + + if self.text_masking: + assert text is not None + assert text_outs is not None + assert text_masked_pos is not None + text_outs = paddle.reshape(text_outs, [-1, self.vocab_size]) + text = paddle.reshape(text, [-1]) + text_mlm_loss = self.text_mlm_loss(text_outs, text) + text_masked_pos_reshape = paddle.reshape(text_masked_pos, [-1]) + text_mlm_loss = paddle.sum( + text_mlm_loss * + text_masked_pos_reshape) / paddle.sum((text_masked_pos) + 1e-10) + + return mlm_loss, text_mlm_loss diff --git a/paddlespeech/t2s/modules/nets_utils.py b/paddlespeech/t2s/modules/nets_utils.py index 598b63164..798e4dee8 100644 --- a/paddlespeech/t2s/modules/nets_utils.py +++ b/paddlespeech/t2s/modules/nets_utils.py @@ -12,8 +12,10 @@ # See the License for the specific language governing permissions and # limitations under the License. # Modified from espnet(https://github.com/espnet/espnet) +import math from typing import Tuple +import numpy as np import paddle from paddle import nn from typeguard import check_argument_types @@ -23,8 +25,10 @@ def pad_list(xs, pad_value): """Perform padding for the list of tensors. Args: - xs (List[Tensor]): List of Tensors [(T_1, `*`), (T_2, `*`), ..., (T_B, `*`)]. - pad_value (float): Value for padding. + xs (List[Tensor]): + List of Tensors [(T_1, `*`), (T_2, `*`), ..., (T_B, `*`)]. + pad_value (float): + Value for padding. Returns: Tensor: Padded tensor (B, Tmax, `*`). @@ -40,7 +44,8 @@ def pad_list(xs, pad_value): """ n_batch = len(xs) max_len = max(x.shape[0] for x in xs) - pad = paddle.full([n_batch, max_len, *xs[0].shape[1:]], pad_value) + pad = paddle.full( + [n_batch, max_len, *xs[0].shape[1:]], pad_value, dtype=xs[0].dtype) for i in range(n_batch): pad[i, :xs[i].shape[0]] = xs[i] @@ -48,13 +53,20 @@ def pad_list(xs, pad_value): return pad -def make_pad_mask(lengths, length_dim=-1): +def make_pad_mask(lengths, xs=None, length_dim=-1): """Make mask tensor containing indices of padded part. Args: - lengths (Tensor(int64)): Batch of lengths (B,). + lengths (Tensor(int64)): + Batch of lengths (B,). + xs (Tensor, optional): + The reference tensor. + If set, masks will be the same shape as this tensor. + length_dim (int, optional): + Dimension indicator of the above tensor. + See the example. - Returns: + Returns: Tensor(bool): Mask tensor containing indices of padded part bool. Examples: @@ -63,45 +75,187 @@ def make_pad_mask(lengths, length_dim=-1): >>> lengths = [5, 3, 2] >>> make_non_pad_mask(lengths) masks = [[0, 0, 0, 0 ,0], - [0, 0, 0, 1, 1], - [0, 0, 1, 1, 1]] + [0, 0, 0, 1, 1], + [0, 0, 1, 1, 1]] + + With the reference tensor. + + >>> xs = paddle.zeros((3, 2, 4)) + >>> make_pad_mask(lengths, xs) + tensor([[[0, 0, 0, 0], + [0, 0, 0, 0]], + [[0, 0, 0, 1], + [0, 0, 0, 1]], + [[0, 0, 1, 1], + [0, 0, 1, 1]]]) + >>> xs = paddle.zeros((3, 2, 6)) + >>> make_pad_mask(lengths, xs) + tensor([[[0, 0, 0, 0, 0, 1], + [0, 0, 0, 0, 0, 1]], + [[0, 0, 0, 1, 1, 1], + [0, 0, 0, 1, 1, 1]], + [[0, 0, 1, 1, 1, 1], + [0, 0, 1, 1, 1, 1]]]) + + With the reference tensor and dimension indicator. + + >>> xs = paddle.zeros((3, 6, 6)) + >>> make_pad_mask(lengths, xs, 1) + tensor([[[0, 0, 0, 0, 0, 0], + [0, 0, 0, 0, 0, 0], + [0, 0, 0, 0, 0, 0], + [0, 0, 0, 0, 0, 0], + [0, 0, 0, 0, 0, 0], + [1, 1, 1, 1, 1, 1]], + [[0, 0, 0, 0, 0, 0], + [0, 0, 0, 0, 0, 0], + [0, 0, 0, 0, 0, 0], + [1, 1, 1, 1, 1, 1], + [1, 1, 1, 1, 1, 1], + [1, 1, 1, 1, 1, 1]], + [[0, 0, 0, 0, 0, 0], + [0, 0, 0, 0, 0, 0], + [1, 1, 1, 1, 1, 1], + [1, 1, 1, 1, 1, 1], + [1, 1, 1, 1, 1, 1], + [1, 1, 1, 1, 1, 1]]]) + >>> make_pad_mask(lengths, xs, 2) + tensor([[[0, 0, 0, 0, 0, 1], + [0, 0, 0, 0, 0, 1], + [0, 0, 0, 0, 0, 1], + [0, 0, 0, 0, 0, 1], + [0, 0, 0, 0, 0, 1], + [0, 0, 0, 0, 0, 1]], + [[0, 0, 0, 1, 1, 1], + [0, 0, 0, 1, 1, 1], + [0, 0, 0, 1, 1, 1], + [0, 0, 0, 1, 1, 1], + [0, 0, 0, 1, 1, 1], + [0, 0, 0, 1, 1, 1]], + [[0, 0, 1, 1, 1, 1], + [0, 0, 1, 1, 1, 1], + [0, 0, 1, 1, 1, 1], + [0, 0, 1, 1, 1, 1], + [0, 0, 1, 1, 1, 1], + [0, 0, 1, 1, 1, 1]]],) + """ if length_dim == 0: raise ValueError("length_dim cannot be 0: {}".format(length_dim)) bs = paddle.shape(lengths)[0] - maxlen = lengths.max() + if xs is None: + maxlen = lengths.max() + else: + maxlen = paddle.shape(xs)[length_dim] + seq_range = paddle.arange(0, maxlen, dtype=paddle.int64) seq_range_expand = seq_range.unsqueeze(0).expand([bs, maxlen]) seq_length_expand = lengths.unsqueeze(-1) - mask = seq_range_expand >= seq_length_expand - + mask = seq_range_expand >= seq_length_expand.cast(seq_range_expand.dtype) + + if xs is not None: + assert paddle.shape(xs)[0] == bs, (paddle.shape(xs)[0], bs) + + if length_dim < 0: + length_dim = len(paddle.shape(xs)) + length_dim + # ind = (:, None, ..., None, :, , None, ..., None) + ind = tuple( + slice(None) if i in (0, length_dim) else None + for i in range(len(paddle.shape(xs)))) + mask = paddle.expand(mask[ind], paddle.shape(xs)) return mask -def make_non_pad_mask(lengths, length_dim=-1): +def make_non_pad_mask(lengths, xs=None, length_dim=-1): """Make mask tensor containing indices of non-padded part. Args: - lengths (Tensor(int64) or List): Batch of lengths (B,). - xs (Tensor, optional): The reference tensor. + lengths (Tensor(int64) or List): + Batch of lengths (B,). + xs (Tensor, optional): + The reference tensor. If set, masks will be the same shape as this tensor. - length_dim (int, optional): Dimension indicator of the above tensor. + length_dim (int, optional): + Dimension indicator of the above tensor. See the example. Returns: - Tensor(bool): mask tensor containing indices of padded part bool. + Tensor(bool): + mask tensor containing indices of padded part bool. - Examples: + Examples: With only lengths. >>> lengths = [5, 3, 2] >>> make_non_pad_mask(lengths) masks = [[1, 1, 1, 1 ,1], - [1, 1, 1, 0, 0], - [1, 1, 0, 0, 0]] + [1, 1, 1, 0, 0], + [1, 1, 0, 0, 0]] + + With the reference tensor. + + >>> xs = paddle.zeros((3, 2, 4)) + >>> make_non_pad_mask(lengths, xs) + tensor([[[1, 1, 1, 1], + [1, 1, 1, 1]], + [[1, 1, 1, 0], + [1, 1, 1, 0]], + [[1, 1, 0, 0], + [1, 1, 0, 0]]]) + >>> xs = paddle.zeros((3, 2, 6)) + >>> make_non_pad_mask(lengths, xs) + tensor([[[1, 1, 1, 1, 1, 0], + [1, 1, 1, 1, 1, 0]], + [[1, 1, 1, 0, 0, 0], + [1, 1, 1, 0, 0, 0]], + [[1, 1, 0, 0, 0, 0], + [1, 1, 0, 0, 0, 0]]]) + + With the reference tensor and dimension indicator. + + >>> xs = paddle.zeros((3, 6, 6)) + >>> make_non_pad_mask(lengths, xs, 1) + tensor([[[1, 1, 1, 1, 1, 1], + [1, 1, 1, 1, 1, 1], + [1, 1, 1, 1, 1, 1], + [1, 1, 1, 1, 1, 1], + [1, 1, 1, 1, 1, 1], + [0, 0, 0, 0, 0, 0]], + [[1, 1, 1, 1, 1, 1], + [1, 1, 1, 1, 1, 1], + [1, 1, 1, 1, 1, 1], + [0, 0, 0, 0, 0, 0], + [0, 0, 0, 0, 0, 0], + [0, 0, 0, 0, 0, 0]], + [[1, 1, 1, 1, 1, 1], + [1, 1, 1, 1, 1, 1], + [0, 0, 0, 0, 0, 0], + [0, 0, 0, 0, 0, 0], + [0, 0, 0, 0, 0, 0], + [0, 0, 0, 0, 0, 0]]]) + >>> make_non_pad_mask(lengths, xs, 2) + tensor([[[1, 1, 1, 1, 1, 0], + [1, 1, 1, 1, 1, 0], + [1, 1, 1, 1, 1, 0], + [1, 1, 1, 1, 1, 0], + [1, 1, 1, 1, 1, 0], + [1, 1, 1, 1, 1, 0]], + [[1, 1, 1, 0, 0, 0], + [1, 1, 1, 0, 0, 0], + [1, 1, 1, 0, 0, 0], + [1, 1, 1, 0, 0, 0], + [1, 1, 1, 0, 0, 0], + [1, 1, 1, 0, 0, 0]], + [[1, 1, 0, 0, 0, 0], + [1, 1, 0, 0, 0, 0], + [1, 1, 0, 0, 0, 0], + [1, 1, 0, 0, 0, 0], + [1, 1, 0, 0, 0, 0], + [1, 1, 0, 0, 0, 0]]]) + """ - return paddle.logical_not(make_pad_mask(lengths, length_dim)) + return paddle.logical_not(make_pad_mask(lengths, xs, length_dim)) def initialize(model: nn.Layer, init: str): @@ -112,8 +266,10 @@ def initialize(model: nn.Layer, init: str): Custom initialization routines can be implemented into submodules Args: - model (nn.Layer): Target. - init (str): Method of initialization. + model (nn.Layer): + Target. + init (str): + Method of initialization. """ assert check_argument_types() @@ -140,12 +296,17 @@ def get_random_segments( segment_size: int, ) -> Tuple[paddle.Tensor, paddle.Tensor]: """Get random segments. Args: - x (Tensor): Input tensor (B, C, T). - x_lengths (Tensor): Length tensor (B,). - segment_size (int): Segment size. + x (Tensor): + Input tensor (B, C, T). + x_lengths (Tensor): + Length tensor (B,). + segment_size (int): + Segment size. Returns: - Tensor: Segmented tensor (B, C, segment_size). - Tensor: Start index tensor (B,). + Tensor: + Segmented tensor (B, C, segment_size). + Tensor: + Start index tensor (B,). """ b, c, t = paddle.shape(x) max_start_idx = x_lengths - segment_size @@ -161,9 +322,12 @@ def get_segments( segment_size: int, ) -> paddle.Tensor: """Get segments. Args: - x (Tensor): Input tensor (B, C, T). - start_idxs (Tensor): Start index tensor (B,). - segment_size (int): Segment size. + x (Tensor): + Input tensor (B, C, T). + start_idxs (Tensor): + Start index tensor (B,). + segment_size (int): + Segment size. Returns: Tensor: Segmented tensor (B, C, segment_size). """ @@ -194,3 +358,294 @@ def paddle_gather(x, dim, index): ind2 = paddle.transpose(paddle.stack(nd_index), [1, 0]).astype("int64") paddle_out = paddle.gather_nd(x, ind2).reshape(index_shape) return paddle_out + + +# for ERNIE SAT +# mask phones +def phones_masking(xs_pad: paddle.Tensor, + src_mask: paddle.Tensor, + align_start: paddle.Tensor, + align_end: paddle.Tensor, + align_start_lens: paddle.Tensor, + mlm_prob: float=0.8, + mean_phn_span: int=8, + span_bdy: paddle.Tensor=None): + ''' + Args: + xs_pad (paddle.Tensor): + input speech (B, Tmax, D). + src_mask (paddle.Tensor): + mask of speech (B, 1, Tmax). + align_start (paddle.Tensor): + frame level phone alignment start (B, Tmax2). + align_end (paddle.Tensor): + frame level phone alignment end (B, Tmax2). + align_start_lens (paddle.Tensor): + length of align_start (B, ). + mlm_prob (float): + mean_phn_span (int): + span_bdy (paddle.Tensor): + masked mel boundary of input speech (B, 2). + Returns: + paddle.Tensor[bool]: masked position of input speech (B, Tmax). + ''' + bz, sent_len, _ = paddle.shape(xs_pad) + masked_pos = paddle.zeros((bz, sent_len)) + if mlm_prob == 1.0: + masked_pos += 1 + elif mean_phn_span == 0: + # only speech + length = sent_len + mean_phn_span = min(length * mlm_prob // 3, 50) + masked_phn_idxs = random_spans_noise_mask( + length=length, mlm_prob=mlm_prob, + mean_phn_span=mean_phn_span).nonzero() + masked_pos[:, masked_phn_idxs] = 1 + else: + for idx in range(bz): + # for inference + if span_bdy is not None: + for s, e in zip(span_bdy[idx][::2], span_bdy[idx][1::2]): + masked_pos[idx, s:e] = 1 + # for training + else: + length = align_start_lens[idx] + if length < 2: + continue + masked_phn_idxs = random_spans_noise_mask( + length=length, + mlm_prob=mlm_prob, + mean_phn_span=mean_phn_span).nonzero() + masked_start = align_start[idx][masked_phn_idxs].tolist() + masked_end = align_end[idx][masked_phn_idxs].tolist() + for s, e in zip(masked_start, masked_end): + masked_pos[idx, s:e] = 1 + non_eos_mask = paddle.reshape(src_mask, paddle.shape(xs_pad)[:2]) + masked_pos = masked_pos * non_eos_mask + masked_pos = paddle.cast(masked_pos, 'bool') + + return masked_pos + + +# mask speech and phones +def phones_text_masking(xs_pad: paddle.Tensor, + src_mask: paddle.Tensor, + text_pad: paddle.Tensor, + text_mask: paddle.Tensor, + align_start: paddle.Tensor, + align_end: paddle.Tensor, + align_start_lens: paddle.Tensor, + mlm_prob: float=0.8, + mean_phn_span: int=8, + span_bdy: paddle.Tensor=None): + ''' + Args: + xs_pad (paddle.Tensor): + input speech (B, Tmax, D). + src_mask (paddle.Tensor): + mask of speech (B, 1, Tmax). + text_pad (paddle.Tensor): + input text (B, Tmax2). + text_mask (paddle.Tensor): + mask of text (B, 1, Tmax2). + align_start (paddle.Tensor): + frame level phone alignment start (B, Tmax2). + align_end (paddle.Tensor): + frame level phone alignment end (B, Tmax2). + align_start_lens (paddle.Tensor): + length of align_start (B, ). + mlm_prob (float): + mean_phn_span (int): + span_bdy (paddle.Tensor): + masked mel boundary of input speech (B, 2). + Returns: + paddle.Tensor[bool]: + masked position of input speech (B, Tmax). + paddle.Tensor[bool]: + masked position of input text (B, Tmax2). + ''' + bz, sent_len, _ = paddle.shape(xs_pad) + masked_pos = paddle.zeros((bz, sent_len)) + _, text_len = paddle.shape(text_pad) + text_mask_num_lower = math.ceil(text_len * (1 - mlm_prob) * 0.5) + text_masked_pos = paddle.zeros((bz, text_len)) + + if mlm_prob == 1.0: + masked_pos += 1 + elif mean_phn_span == 0: + # only speech + length = sent_len + mean_phn_span = min(length * mlm_prob // 3, 50) + masked_phn_idxs = random_spans_noise_mask( + length=length, mlm_prob=mlm_prob, + mean_phn_span=mean_phn_span).nonzero() + masked_pos[:, masked_phn_idxs] = 1 + else: + for idx in range(bz): + # for inference + if span_bdy is not None: + for s, e in zip(span_bdy[idx][::2], span_bdy[idx][1::2]): + masked_pos[idx, s:e] = 1 + # for training + else: + length = align_start_lens[idx] + if length < 2: + continue + masked_phn_idxs = random_spans_noise_mask( + length=length, + mlm_prob=mlm_prob, + mean_phn_span=mean_phn_span).nonzero() + unmasked_phn_idxs = list( + set(range(length)) - set(masked_phn_idxs[0].tolist())) + np.random.shuffle(unmasked_phn_idxs) + masked_text_idxs = unmasked_phn_idxs[:text_mask_num_lower] + text_masked_pos[idx, masked_text_idxs] = 1 + masked_start = align_start[idx][masked_phn_idxs].tolist() + masked_end = align_end[idx][masked_phn_idxs].tolist() + for s, e in zip(masked_start, masked_end): + masked_pos[idx, s:e] = 1 + non_eos_mask = paddle.reshape(src_mask, shape=paddle.shape(xs_pad)[:2]) + masked_pos = masked_pos * non_eos_mask + non_eos_text_mask = paddle.reshape( + text_mask, shape=paddle.shape(text_pad)[:2]) + text_masked_pos = text_masked_pos * non_eos_text_mask + masked_pos = paddle.cast(masked_pos, 'bool') + text_masked_pos = paddle.cast(text_masked_pos, 'bool') + + return masked_pos, text_masked_pos + + +def get_seg_pos(speech_pad: paddle.Tensor, + text_pad: paddle.Tensor, + align_start: paddle.Tensor, + align_end: paddle.Tensor, + align_start_lens: paddle.Tensor, + seg_emb: bool=False): + ''' + Args: + speech_pad (paddle.Tensor): + input speech (B, Tmax, D). + text_pad (paddle.Tensor): + input text (B, Tmax2). + align_start (paddle.Tensor): + frame level phone alignment start (B, Tmax2). + align_end (paddle.Tensor): + frame level phone alignment end (B, Tmax2). + align_start_lens (paddle.Tensor): + length of align_start (B, ). + seg_emb (bool): + whether to use segment embedding. + Returns: + paddle.Tensor[int]: n-th phone of each mel, 0<=n<=Tmax2 (B, Tmax). + eg: + Tensor(shape=[1, 328], dtype=int64, place=Place(gpu:0), stop_gradient=True, + [[0 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , + 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , + 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , + 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , + 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 1 , 2 , 2 , 2 , 3 , 3 , 3 , 4 , 4 , 4 , + 5 , 5 , 5 , 6 , 6 , 6 , 6 , 6 , 6 , 6 , 6 , 7 , 7 , 7 , 7 , 7 , 7 , 7 , + 7 , 8 , 8 , 8 , 8 , 9 , 9 , 9 , 9 , 9 , 9 , 9 , 9 , 10, 10, 10, 10, 10, + 10, 10, 10, 11, 11, 11, 11, 12, 12, 12, 12, 12, 12, 12, 12, 12, 12, 13, + 13, 13, 13, 13, 13, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 14, 15, 15, + 15, 15, 15, 15, 15, 16, 16, 16, 16, 16, 16, 17, 17, 17, 17, 17, 17, 17, + 17, 18, 18, 18, 18, 18, 18, 19, 19, 19, 19, 19, 19, 19, 20, 20, 20, 20, + 20, 20, 21, 21, 21, 21, 21, 21, 21, 22, 22, 22, 22, 22, 22, 22, 23, 23, + 23, 23, 23, 23, 23, 23, 24, 24, 24, 24, 24, 24, 24, 24, 24, 25, 25, 25, + 25, 26, 26, 26, 27, 27, 27, 27, 27, 28, 28, 28, 28, 28, 28, 28, 28, 29, + 29, 29, 29, 29, 29, 30, 30, 30, 30, 31, 31, 31, 31, 31, 31, 31, 31, 32, + 32, 32, 32, 32, 33, 33, 33, 33, 33, 33, 33, 33, 34, 34, 34, 34, 35, 35, + 35, 35, 35, 35, 35, 35, 36, 36, 37, 37, 37, 37, 37, 37, 37, 37, 37, 37, + 37, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, 38, + 38, 38, 0 , 0 ]]) + paddle.Tensor[int]: n-th phone of each phone, 0<=n<=Tmax2 (B, Tmax2). + eg: + Tensor(shape=[1, 38], dtype=int64, place=Place(gpu:0), stop_gradient=True, + [[1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10, 11, 12, 13, 14, 15, 16, 17, + 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, + 36, 37, 38]]) + ''' + + bz, speech_len, _ = paddle.shape(speech_pad) + _, text_len = paddle.shape(text_pad) + + text_seg_pos = paddle.zeros((bz, text_len), dtype='int64') + speech_seg_pos = paddle.zeros((bz, speech_len), dtype='int64') + + if not seg_emb: + return speech_seg_pos, text_seg_pos + for idx in range(bz): + align_length = align_start_lens[idx] + for j in range(align_length): + s, e = align_start[idx][j], align_end[idx][j] + speech_seg_pos[idx, s:e] = j + 1 + text_seg_pos[idx, j] = j + 1 + + return speech_seg_pos, text_seg_pos + + +# randomly select the range of speech and text to mask during training +def random_spans_noise_mask(length: int, + mlm_prob: float=0.8, + mean_phn_span: float=8): + """This function is copy of `random_spans_helper + `__ . + Noise mask consisting of random spans of noise tokens. + The number of noise tokens and the number of noise spans and non-noise spans + are determined deterministically as follows: + num_noise_tokens = round(length * noise_density) + num_nonnoise_spans = num_noise_spans = round(num_noise_tokens / mean_noise_span_length) + Spans alternate between non-noise and noise, beginning with non-noise. + Subject to the above restrictions, all masks are equally likely. + Args: + length: an int32 scalar (length of the incoming token sequence) + noise_density: a float - approximate density of output mask + mean_noise_span_length: a number + Returns: + np.ndarray: a boolean tensor with shape [length] + """ + + orig_length = length + + num_noise_tokens = int(np.round(length * mlm_prob)) + # avoid degeneracy by ensuring positive numbers of noise and nonnoise tokens. + num_noise_tokens = min(max(num_noise_tokens, 1), length - 1) + num_noise_spans = int(np.round(num_noise_tokens / mean_phn_span)) + + # avoid degeneracy by ensuring positive number of noise spans + num_noise_spans = max(num_noise_spans, 1) + num_nonnoise_tokens = length - num_noise_tokens + + # pick the lengths of the noise spans and the non-noise spans + def _random_seg(num_items, num_segs): + """Partition a sequence of items randomly into non-empty segments. + Args: + num_items: + an integer scalar > 0 + num_segs: + an integer scalar in [1, num_items] + Returns: + a Tensor with shape [num_segs] containing positive integers that add + up to num_items + """ + mask_idxs = np.arange(num_items - 1) < (num_segs - 1) + np.random.shuffle(mask_idxs) + first_in_seg = np.pad(mask_idxs, [[1, 0]]) + segment_id = np.cumsum(first_in_seg) + # count length of sub segments assuming that list is sorted + _, segment_length = np.unique(segment_id, return_counts=True) + return segment_length + + noise_span_lens = _random_seg(num_noise_tokens, num_noise_spans) + nonnoise_span_lens = _random_seg(num_nonnoise_tokens, num_noise_spans) + + interleaved_span_lens = np.reshape( + np.stack([nonnoise_span_lens, noise_span_lens], axis=1), + [num_noise_spans * 2]) + span_starts = np.cumsum(interleaved_span_lens)[:-1] + span_start_indicator = np.zeros((length, ), dtype=np.int8) + span_start_indicator[span_starts] = True + span_num = np.cumsum(span_start_indicator) + is_noise = np.equal(span_num % 2, 1) + + return is_noise[:orig_length] diff --git a/paddlespeech/t2s/modules/pqmf.py b/paddlespeech/t2s/modules/pqmf.py index 9860da906..7b42409d8 100644 --- a/paddlespeech/t2s/modules/pqmf.py +++ b/paddlespeech/t2s/modules/pqmf.py @@ -26,9 +26,12 @@ def design_prototype_filter(taps=62, cutoff_ratio=0.142, beta=9.0): filters of cosine modulated filterbanks`_. Args: - taps (int): The number of filter taps. - cutoff_ratio (float): Cut-off frequency ratio. - beta (float): Beta coefficient for kaiser window. + taps (int): + The number of filter taps. + cutoff_ratio (float): + Cut-off frequency ratio. + beta (float): + Beta coefficient for kaiser window. Returns: ndarray: Impluse response of prototype filter (taps + 1,). @@ -66,10 +69,14 @@ class PQMF(nn.Layer): See dicussion in https://github.com/kan-bayashi/ParallelWaveGAN/issues/195. Args: - subbands (int): The number of subbands. - taps (int): The number of filter taps. - cutoff_ratio (float): Cut-off frequency ratio. - beta (float): Beta coefficient for kaiser window. + subbands (int): + The number of subbands. + taps (int): + The number of filter taps. + cutoff_ratio (float): + Cut-off frequency ratio. + beta (float): + Beta coefficient for kaiser window. """ super().__init__() @@ -103,7 +110,8 @@ class PQMF(nn.Layer): def analysis(self, x): """Analysis with PQMF. Args: - x (Tensor): Input tensor (B, 1, T). + x (Tensor): + Input tensor (B, 1, T). Returns: Tensor: Output tensor (B, subbands, T // subbands). """ @@ -113,7 +121,8 @@ class PQMF(nn.Layer): def synthesis(self, x): """Synthesis with PQMF. Args: - x (Tensor): Input tensor (B, subbands, T // subbands). + x (Tensor): + Input tensor (B, subbands, T // subbands). Returns: Tensor: Output tensor (B, 1, T). """ diff --git a/paddlespeech/t2s/modules/predictor/duration_predictor.py b/paddlespeech/t2s/modules/predictor/duration_predictor.py index 33ed575b4..cb38fd5b4 100644 --- a/paddlespeech/t2s/modules/predictor/duration_predictor.py +++ b/paddlespeech/t2s/modules/predictor/duration_predictor.py @@ -50,12 +50,18 @@ class DurationPredictor(nn.Layer): """Initilize duration predictor module. Args: - idim (int):Input dimension. - n_layers (int, optional): Number of convolutional layers. - n_chans (int, optional): Number of channels of convolutional layers. - kernel_size (int, optional): Kernel size of convolutional layers. - dropout_rate (float, optional): Dropout rate. - offset (float, optional): Offset value to avoid nan in log domain. + idim (int): + Input dimension. + n_layers (int, optional): + Number of convolutional layers. + n_chans (int, optional): + Number of channels of convolutional layers. + kernel_size (int, optional): + Kernel size of convolutional layers. + dropout_rate (float, optional): + Dropout rate. + offset (float, optional): + Offset value to avoid nan in log domain. """ super().__init__() @@ -99,8 +105,10 @@ class DurationPredictor(nn.Layer): def forward(self, xs, x_masks=None): """Calculate forward propagation. Args: - xs(Tensor): Batch of input sequences (B, Tmax, idim). - x_masks(ByteTensor, optional, optional): Batch of masks indicating padded part (B, Tmax). (Default value = None) + xs(Tensor): + Batch of input sequences (B, Tmax, idim). + x_masks(ByteTensor, optional, optional): + Batch of masks indicating padded part (B, Tmax). (Default value = None) Returns: Tensor: Batch of predicted durations in log domain (B, Tmax). @@ -110,8 +118,10 @@ class DurationPredictor(nn.Layer): def inference(self, xs, x_masks=None): """Inference duration. Args: - xs(Tensor): Batch of input sequences (B, Tmax, idim). - x_masks(Tensor(bool), optional, optional): Batch of masks indicating padded part (B, Tmax). (Default value = None) + xs(Tensor): + Batch of input sequences (B, Tmax, idim). + x_masks(Tensor(bool), optional, optional): + Batch of masks indicating padded part (B, Tmax). (Default value = None) Returns: Tensor: Batch of predicted durations in linear domain int64 (B, Tmax). @@ -140,8 +150,10 @@ class DurationPredictorLoss(nn.Layer): """Calculate forward propagation. Args: - outputs(Tensor): Batch of prediction durations in log domain (B, T) - targets(Tensor): Batch of groundtruth durations in linear domain (B, T) + outputs(Tensor): + Batch of prediction durations in log domain (B, T) + targets(Tensor): + Batch of groundtruth durations in linear domain (B, T) Returns: Tensor: Mean squared error loss value. diff --git a/paddlespeech/t2s/modules/predictor/length_regulator.py b/paddlespeech/t2s/modules/predictor/length_regulator.py index e4fbf5491..bdfa18391 100644 --- a/paddlespeech/t2s/modules/predictor/length_regulator.py +++ b/paddlespeech/t2s/modules/predictor/length_regulator.py @@ -36,7 +36,8 @@ class LengthRegulator(nn.Layer): """Initilize length regulator module. Args: - pad_value (float, optional): Value used for padding. + pad_value (float, optional): + Value used for padding. """ super().__init__() @@ -97,9 +98,12 @@ class LengthRegulator(nn.Layer): """Calculate forward propagation. Args: - xs (Tensor): Batch of sequences of char or phoneme embeddings (B, Tmax, D). - ds (Tensor(int64)): Batch of durations of each frame (B, T). - alpha (float, optional): Alpha value to control speed of speech. + xs (Tensor): + Batch of sequences of char or phoneme embeddings (B, Tmax, D). + ds (Tensor(int64)): + Batch of durations of each frame (B, T). + alpha (float, optional): + Alpha value to control speed of speech. Returns: Tensor: replicated input tensor based on durations (B, T*, D). diff --git a/paddlespeech/t2s/modules/predictor/variance_predictor.py b/paddlespeech/t2s/modules/predictor/variance_predictor.py index 8afbf2576..4c2a67cc4 100644 --- a/paddlespeech/t2s/modules/predictor/variance_predictor.py +++ b/paddlespeech/t2s/modules/predictor/variance_predictor.py @@ -43,11 +43,16 @@ class VariancePredictor(nn.Layer): """Initilize duration predictor module. Args: - idim (int): Input dimension. - n_layers (int, optional): Number of convolutional layers. - n_chans (int, optional): Number of channels of convolutional layers. - kernel_size (int, optional): Kernel size of convolutional layers. - dropout_rate (float, optional): Dropout rate. + idim (int): + Input dimension. + n_layers (int, optional): + Number of convolutional layers. + n_chans (int, optional): + Number of channels of convolutional layers. + kernel_size (int, optional): + Kernel size of convolutional layers. + dropout_rate (float, optional): + Dropout rate. """ assert check_argument_types() super().__init__() @@ -74,11 +79,14 @@ class VariancePredictor(nn.Layer): """Calculate forward propagation. Args: - xs (Tensor): Batch of input sequences (B, Tmax, idim). - x_masks (Tensor(bool), optional): Batch of masks indicating padded part (B, Tmax, 1). + xs (Tensor): + Batch of input sequences (B, Tmax, idim). + x_masks (Tensor(bool), optional): + Batch of masks indicating padded part (B, Tmax, 1). Returns: - Tensor: Batch of predicted sequences (B, Tmax, 1). + Tensor: + Batch of predicted sequences (B, Tmax, 1). """ # (B, idim, Tmax) xs = xs.transpose([0, 2, 1]) diff --git a/paddlespeech/t2s/modules/residual_block.py b/paddlespeech/t2s/modules/residual_block.py index 5965a7203..f21eedecb 100644 --- a/paddlespeech/t2s/modules/residual_block.py +++ b/paddlespeech/t2s/modules/residual_block.py @@ -29,15 +29,24 @@ class WaveNetResidualBlock(nn.Layer): refer to `WaveNet: A Generative Model for Raw Audio `_. Args: - kernel_size (int, optional): Kernel size of the 1D convolution, by default 3 - residual_channels (int, optional): Feature size of the residual output(and also the input), by default 64 - gate_channels (int, optional): Output feature size of the 1D convolution, by default 128 - skip_channels (int, optional): Feature size of the skip output, by default 64 - aux_channels (int, optional): Feature size of the auxiliary input (e.g. spectrogram), by default 80 - dropout (float, optional): Probability of the dropout before the 1D convolution, by default 0. - dilation (int, optional): Dilation of the 1D convolution, by default 1 - bias (bool, optional): Whether to use bias in the 1D convolution, by default True - use_causal_conv (bool, optional): Whether to use causal padding for the 1D convolution, by default False + kernel_size (int, optional): + Kernel size of the 1D convolution, by default 3 + residual_channels (int, optional): + Feature size of the residual output(and also the input), by default 64 + gate_channels (int, optional): + Output feature size of the 1D convolution, by default 128 + skip_channels (int, optional): + Feature size of the skip output, by default 64 + aux_channels (int, optional): + Feature size of the auxiliary input (e.g. spectrogram), by default 80 + dropout (float, optional): + Probability of the dropout before the 1D convolution, by default 0. + dilation (int, optional): + Dilation of the 1D convolution, by default 1 + bias (bool, optional): + Whether to use bias in the 1D convolution, by default True + use_causal_conv (bool, optional): + Whether to use causal padding for the 1D convolution, by default False """ def __init__(self, @@ -81,13 +90,17 @@ class WaveNetResidualBlock(nn.Layer): def forward(self, x, c): """ Args: - x (Tensor): the input features. Shape (N, C_res, T) - c (Tensor): the auxiliary input. Shape (N, C_aux, T) + x (Tensor): + the input features. Shape (N, C_res, T) + c (Tensor): + the auxiliary input. Shape (N, C_aux, T) Returns: - res (Tensor): Shape (N, C_res, T), the residual output, which is used as the + res (Tensor): + Shape (N, C_res, T), the residual output, which is used as the input of the next ResidualBlock in a stack of ResidualBlocks. - skip (Tensor): Shape (N, C_skip, T), the skip output, which is collected among + skip (Tensor): + Shape (N, C_skip, T), the skip output, which is collected among each layer in a stack of ResidualBlocks. """ x_input = x @@ -121,13 +134,20 @@ class HiFiGANResidualBlock(nn.Layer): ): """Initialize HiFiGANResidualBlock module. Args: - kernel_size (int): Kernel size of dilation convolution layer. - channels (int): Number of channels for convolution layer. - dilations (List[int]): List of dilation factors. - use_additional_convs (bool): Whether to use additional convolution layers. - bias (bool): Whether to add bias parameter in convolution layers. - nonlinear_activation (str): Activation function module name. - nonlinear_activation_params (dict): Hyperparameters for activation function. + kernel_size (int): + Kernel size of dilation convolution layer. + channels (int): + Number of channels for convolution layer. + dilations (List[int]): + List of dilation factors. + use_additional_convs (bool): + Whether to use additional convolution layers. + bias (bool): + Whether to add bias parameter in convolution layers. + nonlinear_activation (str): + Activation function module name. + nonlinear_activation_params (dict): + Hyperparameters for activation function. """ super().__init__() @@ -167,7 +187,8 @@ class HiFiGANResidualBlock(nn.Layer): def forward(self, x): """Calculate forward propagation. Args: - x (Tensor): Input tensor (B, channels, T). + x (Tensor): + Input tensor (B, channels, T). Returns: Tensor: Output tensor (B, channels, T). """ diff --git a/paddlespeech/t2s/modules/residual_stack.py b/paddlespeech/t2s/modules/residual_stack.py index 0d949b563..98f5db3cf 100644 --- a/paddlespeech/t2s/modules/residual_stack.py +++ b/paddlespeech/t2s/modules/residual_stack.py @@ -39,15 +39,24 @@ class ResidualStack(nn.Layer): """Initialize ResidualStack module. Args: - kernel_size (int): Kernel size of dilation convolution layer. - channels (int): Number of channels of convolution layers. - dilation (int): Dilation factor. - bias (bool): Whether to add bias parameter in convolution layers. - nonlinear_activation (str): Activation function module name. - nonlinear_activation_params (Dict[str,Any]): Hyperparameters for activation function. - pad (str): Padding function module name before dilated convolution layer. - pad_params (Dict[str, Any]): Hyperparameters for padding function. - use_causal_conv (bool): Whether to use causal convolution. + kernel_size (int): + Kernel size of dilation convolution layer. + channels (int): + Number of channels of convolution layers. + dilation (int): + Dilation factor. + bias (bool): + Whether to add bias parameter in convolution layers. + nonlinear_activation (str): + Activation function module name. + nonlinear_activation_params (Dict[str,Any]): + Hyperparameters for activation function. + pad (str): + Padding function module name before dilated convolution layer. + pad_params (Dict[str, Any]): + Hyperparameters for padding function. + use_causal_conv (bool): + Whether to use causal convolution. """ super().__init__() # for compatibility @@ -95,7 +104,8 @@ class ResidualStack(nn.Layer): """Calculate forward propagation. Args: - c (Tensor): Input tensor (B, channels, T). + c (Tensor): + Input tensor (B, channels, T). Returns: Tensor: Output tensor (B, chennels, T). """ diff --git a/paddlespeech/t2s/modules/style_encoder.py b/paddlespeech/t2s/modules/style_encoder.py index 49091eac8..b558e7693 100644 --- a/paddlespeech/t2s/modules/style_encoder.py +++ b/paddlespeech/t2s/modules/style_encoder.py @@ -32,16 +32,26 @@ class StyleEncoder(nn.Layer): Speech Synthesis`: https://arxiv.org/abs/1803.09017 Args: - idim (int, optional): Dimension of the input mel-spectrogram. - gst_tokens (int, optional): The number of GST embeddings. - gst_token_dim (int, optional): Dimension of each GST embedding. - gst_heads (int, optional): The number of heads in GST multihead attention. - conv_layers (int, optional): The number of conv layers in the reference encoder. - conv_chans_list (Sequence[int], optional): List of the number of channels of conv layers in the referece encoder. - conv_kernel_size (int, optional): Kernal size of conv layers in the reference encoder. - conv_stride (int, optional): Stride size of conv layers in the reference encoder. - gru_layers (int, optional): The number of GRU layers in the reference encoder. - gru_units (int, optional):The number of GRU units in the reference encoder. + idim (int, optional): + Dimension of the input mel-spectrogram. + gst_tokens (int, optional): + The number of GST embeddings. + gst_token_dim (int, optional): + Dimension of each GST embedding. + gst_heads (int, optional): + The number of heads in GST multihead attention. + conv_layers (int, optional): + The number of conv layers in the reference encoder. + conv_chans_list (Sequence[int], optional): + List of the number of channels of conv layers in the referece encoder. + conv_kernel_size (int, optional): + Kernal size of conv layers in the reference encoder. + conv_stride (int, optional): + Stride size of conv layers in the reference encoder. + gru_layers (int, optional): + The number of GRU layers in the reference encoder. + gru_units (int, optional): + The number of GRU units in the reference encoder. Todo: * Support manual weight specification in inference. @@ -82,7 +92,8 @@ class StyleEncoder(nn.Layer): """Calculate forward propagation. Args: - speech (Tensor): Batch of padded target features (B, Lmax, odim). + speech (Tensor): + Batch of padded target features (B, Lmax, odim). Returns: Tensor: Style token embeddings (B, token_dim). @@ -104,13 +115,20 @@ class ReferenceEncoder(nn.Layer): Speech Synthesis`: https://arxiv.org/abs/1803.09017 Args: - idim (int, optional): Dimension of the input mel-spectrogram. - conv_layers (int, optional): The number of conv layers in the reference encoder. - conv_chans_list: (Sequence[int], optional): List of the number of channels of conv layers in the referece encoder. - conv_kernel_size (int, optional): Kernal size of conv layers in the reference encoder. - conv_stride (int, optional): Stride size of conv layers in the reference encoder. - gru_layers (int, optional): The number of GRU layers in the reference encoder. - gru_units (int, optional): The number of GRU units in the reference encoder. + idim (int, optional): + Dimension of the input mel-spectrogram. + conv_layers (int, optional): + The number of conv layers in the reference encoder. + conv_chans_list: (Sequence[int], optional): + List of the number of channels of conv layers in the referece encoder. + conv_kernel_size (int, optional): + Kernal size of conv layers in the reference encoder. + conv_stride (int, optional): + Stride size of conv layers in the reference encoder. + gru_layers (int, optional): + The number of GRU layers in the reference encoder. + gru_units (int, optional): + The number of GRU units in the reference encoder. """ @@ -168,7 +186,8 @@ class ReferenceEncoder(nn.Layer): def forward(self, speech: paddle.Tensor) -> paddle.Tensor: """Calculate forward propagation. Args: - speech (Tensor): Batch of padded target features (B, Lmax, idim). + speech (Tensor): + Batch of padded target features (B, Lmax, idim). Returns: Tensor: Reference embedding (B, gru_units) @@ -200,11 +219,16 @@ class StyleTokenLayer(nn.Layer): .. _`Style Tokens: Unsupervised Style Modeling, Control and Transfer in End-to-End Speech Synthesis`: https://arxiv.org/abs/1803.09017 Args: - ref_embed_dim (int, optional): Dimension of the input reference embedding. - gst_tokens (int, optional): The number of GST embeddings. - gst_token_dim (int, optional): Dimension of each GST embedding. - gst_heads (int, optional): The number of heads in GST multihead attention. - dropout_rate (float, optional): Dropout rate in multi-head attention. + ref_embed_dim (int, optional): + Dimension of the input reference embedding. + gst_tokens (int, optional): + The number of GST embeddings. + gst_token_dim (int, optional): + Dimension of each GST embedding. + gst_heads (int, optional): + The number of heads in GST multihead attention. + dropout_rate (float, optional): + Dropout rate in multi-head attention. """ @@ -236,7 +260,8 @@ class StyleTokenLayer(nn.Layer): """Calculate forward propagation. Args: - ref_embs (Tensor): Reference embeddings (B, ref_embed_dim). + ref_embs (Tensor): + Reference embeddings (B, ref_embed_dim). Returns: Tensor: Style token embeddings (B, gst_token_dim). diff --git a/paddlespeech/t2s/modules/tacotron2/attentions.py b/paddlespeech/t2s/modules/tacotron2/attentions.py index a6fde742d..cdaef4608 100644 --- a/paddlespeech/t2s/modules/tacotron2/attentions.py +++ b/paddlespeech/t2s/modules/tacotron2/attentions.py @@ -31,10 +31,14 @@ def _apply_attention_constraint(e, Text-to-Speech with Convolutional Sequence Learning`_. Args: - e(Tensor): Attention energy before applying softmax (1, T). - last_attended_idx(int): The index of the inputs of the last attended [0, T]. - backward_window(int, optional, optional): Backward window size in attention constraint. (Default value = 1) - forward_window(int, optional, optional): Forward window size in attetion constraint. (Default value = 3) + e(Tensor): + Attention energy before applying softmax (1, T). + last_attended_idx(int): + The index of the inputs of the last attended [0, T]. + backward_window(int, optional, optional): + Backward window size in attention constraint. (Default value = 1) + forward_window(int, optional, optional): + Forward window size in attetion constraint. (Default value = 3) Returns: Tensor: Monotonic constrained attention energy (1, T). @@ -62,12 +66,18 @@ class AttLoc(nn.Layer): (https://arxiv.org/pdf/1506.07503.pdf) Args: - eprojs (int): projection-units of encoder - dunits (int): units of decoder - att_dim (int): attention dimension - aconv_chans (int): channels of attention convolution - aconv_filts (int): filter size of attention convolution - han_mode (bool): flag to swith on mode of hierarchical attention and not store pre_compute_enc_h + eprojs (int): + projection-units of encoder + dunits (int): + units of decoder + att_dim (int): + attention dimension + aconv_chans (int): + channels of attention convolution + aconv_filts (int): + filter size of attention convolution + han_mode (bool): + flag to swith on mode of hierarchical attention and not store pre_compute_enc_h """ def __init__(self, @@ -117,18 +127,29 @@ class AttLoc(nn.Layer): forward_window=3, ): """Calculate AttLoc forward propagation. Args: - enc_hs_pad(Tensor): padded encoder hidden state (B, T_max, D_enc) - enc_hs_len(Tensor): padded encoder hidden state length (B) - dec_z(Tensor dec_z): decoder hidden state (B, D_dec) - att_prev(Tensor): previous attention weight (B, T_max) - scaling(float, optional): scaling parameter before applying softmax (Default value = 2.0) - forward_window(Tensor, optional): forward window size when constraining attention (Default value = 3) - last_attended_idx(int, optional): index of the inputs of the last attended (Default value = None) - backward_window(int, optional): backward window size in attention constraint (Default value = 1) - forward_window(int, optional): forward window size in attetion constraint (Default value = 3) + enc_hs_pad(Tensor): + padded encoder hidden state (B, T_max, D_enc) + enc_hs_len(Tensor): + padded encoder hidden state length (B) + dec_z(Tensor dec_z): + decoder hidden state (B, D_dec) + att_prev(Tensor): + previous attention weight (B, T_max) + scaling(float, optional): + scaling parameter before applying softmax (Default value = 2.0) + forward_window(Tensor, optional): + forward window size when constraining attention (Default value = 3) + last_attended_idx(int, optional): + index of the inputs of the last attended (Default value = None) + backward_window(int, optional): + backward window size in attention constraint (Default value = 1) + forward_window(int, optional): + forward window size in attetion constraint (Default value = 3) Returns: - Tensor: attention weighted encoder state (B, D_enc) - Tensor: previous attention weights (B, T_max) + Tensor: + attention weighted encoder state (B, D_enc) + Tensor: + previous attention weights (B, T_max) """ batch = paddle.shape(enc_hs_pad)[0] # pre-compute all h outside the decoder loop @@ -192,11 +213,16 @@ class AttForward(nn.Layer): (https://arxiv.org/pdf/1807.06736.pdf) Args: - eprojs (int): projection-units of encoder - dunits (int): units of decoder - att_dim (int): attention dimension - aconv_chans (int): channels of attention convolution - aconv_filts (int): filter size of attention convolution + eprojs (int): + projection-units of encoder + dunits (int): + units of decoder + att_dim (int): + attention dimension + aconv_chans (int): + channels of attention convolution + aconv_filts (int): + filter size of attention convolution """ def __init__(self, eprojs, dunits, att_dim, aconv_chans, aconv_filts): @@ -239,18 +265,28 @@ class AttForward(nn.Layer): """Calculate AttForward forward propagation. Args: - enc_hs_pad(Tensor): padded encoder hidden state (B, T_max, D_enc) - enc_hs_len(list): padded encoder hidden state length (B,) - dec_z(Tensor): decoder hidden state (B, D_dec) - att_prev(Tensor): attention weights of previous step (B, T_max) - scaling(float, optional): scaling parameter before applying softmax (Default value = 1.0) - last_attended_idx(int, optional): index of the inputs of the last attended (Default value = None) - backward_window(int, optional): backward window size in attention constraint (Default value = 1) - forward_window(int, optional): (Default value = 3) + enc_hs_pad(Tensor): + padded encoder hidden state (B, T_max, D_enc) + enc_hs_len(list): + padded encoder hidden state length (B,) + dec_z(Tensor): + decoder hidden state (B, D_dec) + att_prev(Tensor): + attention weights of previous step (B, T_max) + scaling(float, optional): + scaling parameter before applying softmax (Default value = 1.0) + last_attended_idx(int, optional): + index of the inputs of the last attended (Default value = None) + backward_window(int, optional): + backward window size in attention constraint (Default value = 1) + forward_window(int, optional): + (Default value = 3) Returns: - Tensor: attention weighted encoder state (B, D_enc) - Tensor: previous attention weights (B, T_max) + Tensor: + attention weighted encoder state (B, D_enc) + Tensor: + previous attention weights (B, T_max) """ batch = len(enc_hs_pad) # pre-compute all h outside the decoder loop @@ -321,12 +357,18 @@ class AttForwardTA(nn.Layer): (https://arxiv.org/pdf/1807.06736.pdf) Args: - eunits (int): units of encoder - dunits (int): units of decoder - att_dim (int): attention dimension - aconv_chans (int): channels of attention convolution - aconv_filts (int): filter size of attention convolution - odim (int): output dimension + eunits (int): + units of encoder + dunits (int): + units of decoder + att_dim (int): + attention dimension + aconv_chans (int): + channels of attention convolution + aconv_filts (int): + filter size of attention convolution + odim (int): + output dimension """ def __init__(self, eunits, dunits, att_dim, aconv_chans, aconv_filts, odim): @@ -372,19 +414,30 @@ class AttForwardTA(nn.Layer): """Calculate AttForwardTA forward propagation. Args: - enc_hs_pad(Tensor): padded encoder hidden state (B, Tmax, eunits) - enc_hs_len(list Tensor): padded encoder hidden state length (B,) - dec_z(Tensor): decoder hidden state (B, dunits) - att_prev(Tensor): attention weights of previous step (B, T_max) - out_prev(Tensor): decoder outputs of previous step (B, odim) - scaling(float, optional): scaling parameter before applying softmax (Default value = 1.0) - last_attended_idx(int, optional): index of the inputs of the last attended (Default value = None) - backward_window(int, optional): backward window size in attention constraint (Default value = 1) - forward_window(int, optional): (Default value = 3) + enc_hs_pad(Tensor): + padded encoder hidden state (B, Tmax, eunits) + enc_hs_len(list Tensor): + padded encoder hidden state length (B,) + dec_z(Tensor): + decoder hidden state (B, dunits) + att_prev(Tensor): + attention weights of previous step (B, T_max) + out_prev(Tensor): + decoder outputs of previous step (B, odim) + scaling(float, optional): + scaling parameter before applying softmax (Default value = 1.0) + last_attended_idx(int, optional): + index of the inputs of the last attended (Default value = None) + backward_window(int, optional): + backward window size in attention constraint (Default value = 1) + forward_window(int, optional): + (Default value = 3) Returns: - Tensor: attention weighted encoder state (B, dunits) - Tensor: previous attention weights (B, Tmax) + Tensor: + attention weighted encoder state (B, dunits) + Tensor: + previous attention weights (B, Tmax) """ batch = len(enc_hs_pad) # pre-compute all h outside the decoder loop diff --git a/paddlespeech/t2s/modules/tacotron2/decoder.py b/paddlespeech/t2s/modules/tacotron2/decoder.py index ebdfa3879..41c94b63f 100644 --- a/paddlespeech/t2s/modules/tacotron2/decoder.py +++ b/paddlespeech/t2s/modules/tacotron2/decoder.py @@ -45,10 +45,14 @@ class Prenet(nn.Layer): """Initialize prenet module. Args: - idim (int): Dimension of the inputs. - odim (int): Dimension of the outputs. - n_layers (int, optional): The number of prenet layers. - n_units (int, optional): The number of prenet units. + idim (int): + Dimension of the inputs. + odim (int): + Dimension of the outputs. + n_layers (int, optional): + The number of prenet layers. + n_units (int, optional): + The number of prenet units. """ super().__init__() self.dropout_rate = dropout_rate @@ -62,7 +66,8 @@ class Prenet(nn.Layer): """Calculate forward propagation. Args: - x (Tensor): Batch of input tensors (B, ..., idim). + x (Tensor): + Batch of input tensors (B, ..., idim). Returns: Tensor: Batch of output tensors (B, ..., odim). @@ -212,7 +217,8 @@ class ZoneOutCell(nn.Layer): """Calculate forward propagation. Args: - inputs (Tensor): Batch of input tensor (B, input_size). + inputs (Tensor): + Batch of input tensor (B, input_size). hidden (tuple): - Tensor: Batch of initial hidden states (B, hidden_size). - Tensor: Batch of initial cell states (B, hidden_size). @@ -277,26 +283,39 @@ class Decoder(nn.Layer): """Initialize Tacotron2 decoder module. Args: - idim (int): Dimension of the inputs. - odim (int): Dimension of the outputs. - att (nn.Layer): Instance of attention class. - dlayers (int, optional): The number of decoder lstm layers. - dunits (int, optional): The number of decoder lstm units. - prenet_layers (int, optional): The number of prenet layers. - prenet_units (int, optional): The number of prenet units. - postnet_layers (int, optional): The number of postnet layers. - postnet_filts (int, optional): The number of postnet filter size. - postnet_chans (int, optional): The number of postnet filter channels. - output_activation_fn (nn.Layer, optional): Activation function for outputs. - cumulate_att_w (bool, optional): Whether to cumulate previous attention weight. - use_batch_norm (bool, optional): Whether to use batch normalization. - use_concate : bool, optional + idim (int): + Dimension of the inputs. + odim (int): + Dimension of the outputs. + att (nn.Layer): + Instance of attention class. + dlayers (int, optional): + The number of decoder lstm layers. + dunits (int, optional): + The number of decoder lstm units. + prenet_layers (int, optional): + The number of prenet layers. + prenet_units (int, optional): + The number of prenet units. + postnet_layers (int, optional): + The number of postnet layers. + postnet_filts (int, optional): + The number of postnet filter size. + postnet_chans (int, optional): + The number of postnet filter channels. + output_activation_fn (nn.Layer, optional): + Activation function for outputs. + cumulate_att_w (bool, optional): + Whether to cumulate previous attention weight. + use_batch_norm (bool, optional): + Whether to use batch normalization. + use_concate (bool, optional): Whether to concatenate encoder embedding with decoder lstm outputs. - dropout_rate : float, optional + dropout_rate (float, optional): Dropout rate. - zoneout_rate : float, optional + zoneout_rate (float, optional): Zoneout rate. - reduction_factor : int, optional + reduction_factor (int, optional): Reduction factor. """ super().__init__() @@ -363,15 +382,22 @@ class Decoder(nn.Layer): """Calculate forward propagation. Args: - hs (Tensor): Batch of the sequences of padded hidden states (B, Tmax, idim). - hlens (Tensor(int64) padded): Batch of lengths of each input batch (B,). - ys (Tensor): Batch of the sequences of padded target features (B, Lmax, odim). + hs (Tensor): + Batch of the sequences of padded hidden states (B, Tmax, idim). + hlens (Tensor(int64) padded): + Batch of lengths of each input batch (B,). + ys (Tensor): + Batch of the sequences of padded target features (B, Lmax, odim). Returns: - Tensor: Batch of output tensors after postnet (B, Lmax, odim). - Tensor: Batch of output tensors before postnet (B, Lmax, odim). - Tensor: Batch of logits of stop prediction (B, Lmax). - Tensor: Batch of attention weights (B, Lmax, Tmax). + Tensor: + Batch of output tensors after postnet (B, Lmax, odim). + Tensor: + Batch of output tensors before postnet (B, Lmax, odim). + Tensor: + Batch of logits of stop prediction (B, Lmax). + Tensor: + Batch of attention weights (B, Lmax, Tmax). Note: This computation is performed in teacher-forcing manner. @@ -471,20 +497,30 @@ class Decoder(nn.Layer): forward_window=None, ): """Generate the sequence of features given the sequences of characters. Args: - h(Tensor): Input sequence of encoder hidden states (T, C). - threshold(float, optional, optional): Threshold to stop generation. (Default value = 0.5) - minlenratio(float, optional, optional): Minimum length ratio. If set to 1.0 and the length of input is 10, + h(Tensor): + Input sequence of encoder hidden states (T, C). + threshold(float, optional, optional): + Threshold to stop generation. (Default value = 0.5) + minlenratio(float, optional, optional): + Minimum length ratio. If set to 1.0 and the length of input is 10, the minimum length of outputs will be 10 * 1 = 10. (Default value = 0.0) - maxlenratio(float, optional, optional): Minimum length ratio. If set to 10 and the length of input is 10, + maxlenratio(float, optional, optional): + Minimum length ratio. If set to 10 and the length of input is 10, the maximum length of outputs will be 10 * 10 = 100. (Default value = 0.0) - use_att_constraint(bool, optional): Whether to apply attention constraint introduced in `Deep Voice 3`_. (Default value = False) - backward_window(int, optional): Backward window size in attention constraint. (Default value = None) - forward_window(int, optional): (Default value = None) + use_att_constraint(bool, optional): + Whether to apply attention constraint introduced in `Deep Voice 3`_. (Default value = False) + backward_window(int, optional): + Backward window size in attention constraint. (Default value = None) + forward_window(int, optional): + (Default value = None) Returns: - Tensor: Output sequence of features (L, odim). - Tensor: Output sequence of stop probabilities (L,). - Tensor: Attention weights (L, T). + Tensor: + Output sequence of features (L, odim). + Tensor: + Output sequence of stop probabilities (L,). + Tensor: + Attention weights (L, T). Note: This computation is performed in auto-regressive manner. @@ -625,9 +661,12 @@ class Decoder(nn.Layer): """Calculate all of the attention weights. Args: - hs (Tensor): Batch of the sequences of padded hidden states (B, Tmax, idim). - hlens (Tensor(int64)): Batch of lengths of each input batch (B,). - ys (Tensor): Batch of the sequences of padded target features (B, Lmax, odim). + hs (Tensor): + Batch of the sequences of padded hidden states (B, Tmax, idim). + hlens (Tensor(int64)): + Batch of lengths of each input batch (B,). + ys (Tensor): + Batch of the sequences of padded target features (B, Lmax, odim). Returns: numpy.ndarray: diff --git a/paddlespeech/t2s/modules/tacotron2/encoder.py b/paddlespeech/t2s/modules/tacotron2/encoder.py index db102a115..224c82400 100644 --- a/paddlespeech/t2s/modules/tacotron2/encoder.py +++ b/paddlespeech/t2s/modules/tacotron2/encoder.py @@ -46,17 +46,28 @@ class Encoder(nn.Layer): padding_idx=0, ): """Initialize Tacotron2 encoder module. Args: - idim (int): Dimension of the inputs. - input_layer (str): Input layer type. - embed_dim (int, optional): Dimension of character embedding. - elayers (int, optional): The number of encoder blstm layers. - eunits (int, optional): The number of encoder blstm units. - econv_layers (int, optional): The number of encoder conv layers. - econv_filts (int, optional): The number of encoder conv filter size. - econv_chans (int, optional): The number of encoder conv filter channels. - use_batch_norm (bool, optional): Whether to use batch normalization. - use_residual (bool, optional): Whether to use residual connection. - dropout_rate (float, optional): Dropout rate. + idim (int): + Dimension of the inputs. + input_layer (str): + Input layer type. + embed_dim (int, optional): + Dimension of character embedding. + elayers (int, optional): + The number of encoder blstm layers. + eunits (int, optional): + The number of encoder blstm units. + econv_layers (int, optional): + The number of encoder conv layers. + econv_filts (int, optional): + The number of encoder conv filter size. + econv_chans (int, optional): + The number of encoder conv filter channels. + use_batch_norm (bool, optional): + Whether to use batch normalization. + use_residual (bool, optional): + Whether to use residual connection. + dropout_rate (float, optional): + Dropout rate. """ super().__init__() @@ -127,14 +138,18 @@ class Encoder(nn.Layer): """Calculate forward propagation. Args: - xs (Tensor): Batch of the padded sequence. Either character ids (B, Tmax) + xs (Tensor): + Batch of the padded sequence. Either character ids (B, Tmax) or acoustic feature (B, Tmax, idim * encoder_reduction_factor). Padded value should be 0. - ilens (Tensor(int64)): Batch of lengths of each input batch (B,). + ilens (Tensor(int64)): + Batch of lengths of each input batch (B,). Returns: - Tensor: Batch of the sequences of encoder states(B, Tmax, eunits). - Tensor(int64): Batch of lengths of each sequence (B,) + Tensor: + Batch of the sequences of encoder states(B, Tmax, eunits). + Tensor(int64): + Batch of lengths of each sequence (B,) """ xs = self.embed(xs).transpose([0, 2, 1]) if self.convs is not None: @@ -161,8 +176,8 @@ class Encoder(nn.Layer): """Inference. Args: - x (Tensor): The sequeunce of character ids (T,) - or acoustic feature (T, idim * encoder_reduction_factor). + x (Tensor): + The sequeunce of character ids (T,) or acoustic feature (T, idim * encoder_reduction_factor). Returns: Tensor: The sequences of encoder states(T, eunits). diff --git a/paddlespeech/t2s/modules/tade_res_block.py b/paddlespeech/t2s/modules/tade_res_block.py index b2275e236..799cbe9fd 100644 --- a/paddlespeech/t2s/modules/tade_res_block.py +++ b/paddlespeech/t2s/modules/tade_res_block.py @@ -60,11 +60,15 @@ class TADELayer(nn.Layer): def forward(self, x, c): """Calculate forward propagation. Args: - x (Tensor): Input tensor (B, in_channels, T). - c (Tensor): Auxiliary input tensor (B, aux_channels, T). + x (Tensor): + Input tensor (B, in_channels, T). + c (Tensor): + Auxiliary input tensor (B, aux_channels, T). Returns: - Tensor: Output tensor (B, in_channels, T * upsample_factor). - Tensor: Upsampled aux tensor (B, in_channels, T * upsample_factor). + Tensor: + Output tensor (B, in_channels, T * upsample_factor). + Tensor: + Upsampled aux tensor (B, in_channels, T * upsample_factor). """ x = self.norm(x) @@ -138,11 +142,15 @@ class TADEResBlock(nn.Layer): """Calculate forward propagation. Args: - x (Tensor): Input tensor (B, in_channels, T). - c (Tensor): Auxiliary input tensor (B, aux_channels, T). + x (Tensor): + Input tensor (B, in_channels, T). + c (Tensor): + Auxiliary input tensor (B, aux_channels, T). Returns: - Tensor: Output tensor (B, in_channels, T * upsample_factor). - Tensor: Upsampled auxirialy tensor (B, in_channels, T * upsample_factor). + Tensor: + Output tensor (B, in_channels, T * upsample_factor). + Tensor: + Upsampled auxirialy tensor (B, in_channels, T * upsample_factor). """ residual = x x, c = self.tade1(x, c) diff --git a/paddlespeech/t2s/modules/transformer/attention.py b/paddlespeech/t2s/modules/transformer/attention.py index cdb95b211..d7a032445 100644 --- a/paddlespeech/t2s/modules/transformer/attention.py +++ b/paddlespeech/t2s/modules/transformer/attention.py @@ -25,9 +25,12 @@ from paddlespeech.t2s.modules.masked_fill import masked_fill class MultiHeadedAttention(nn.Layer): """Multi-Head Attention layer. Args: - n_head (int): The number of heads. - n_feat (int): The number of features. - dropout_rate (float): Dropout rate. + n_head (int): + The number of heads. + n_feat (int): + The number of features. + dropout_rate (float): + Dropout rate. """ def __init__(self, n_head, n_feat, dropout_rate): @@ -48,14 +51,20 @@ class MultiHeadedAttention(nn.Layer): """Transform query, key and value. Args: - query(Tensor): query tensor (#batch, time1, size). - key(Tensor): Key tensor (#batch, time2, size). - value(Tensor): Value tensor (#batch, time2, size). + query(Tensor): + query tensor (#batch, time1, size). + key(Tensor): + Key tensor (#batch, time2, size). + value(Tensor): + Value tensor (#batch, time2, size). Returns: - Tensor: Transformed query tensor (#batch, n_head, time1, d_k). - Tensor: Transformed key tensor (#batch, n_head, time2, d_k). - Tensor: Transformed value tensor (#batch, n_head, time2, d_k). + Tensor: + Transformed query tensor (#batch, n_head, time1, d_k). + Tensor: + Transformed key tensor (#batch, n_head, time2, d_k). + Tensor: + Transformed value tensor (#batch, n_head, time2, d_k). """ n_batch = paddle.shape(query)[0] @@ -77,9 +86,12 @@ class MultiHeadedAttention(nn.Layer): """Compute attention context vector. Args: - value(Tensor): Transformed value (#batch, n_head, time2, d_k). - scores(Tensor): Attention score (#batch, n_head, time1, time2). - mask(Tensor, optional): Mask (#batch, 1, time2) or (#batch, time1, time2). (Default value = None) + value(Tensor): + Transformed value (#batch, n_head, time2, d_k). + scores(Tensor): + Attention score (#batch, n_head, time1, time2). + mask(Tensor, optional): + Mask (#batch, 1, time2) or (#batch, time1, time2). (Default value = None) Returns: Tensor: Transformed value (#batch, time1, d_model) weighted by the attention score (#batch, time1, time2). @@ -113,10 +125,14 @@ class MultiHeadedAttention(nn.Layer): """Compute scaled dot product attention. Args: - query(Tensor): Query tensor (#batch, time1, size). - key(Tensor): Key tensor (#batch, time2, size). - value(Tensor): Value tensor (#batch, time2, size). - mask(Tensor, optional): Mask tensor (#batch, 1, time2) or (#batch, time1, time2). (Default value = None) + query(Tensor): + Query tensor (#batch, time1, size). + key(Tensor): + Key tensor (#batch, time2, size). + value(Tensor): + Value tensor (#batch, time2, size). + mask(Tensor, optional): + Mask tensor (#batch, 1, time2) or (#batch, time1, time2). (Default value = None) Returns: Tensor: Output tensor (#batch, time1, d_model). @@ -134,10 +150,14 @@ class RelPositionMultiHeadedAttention(MultiHeadedAttention): Paper: https://arxiv.org/abs/1901.02860 Args: - n_head (int): The number of heads. - n_feat (int): The number of features. - dropout_rate (float): Dropout rate. - zero_triu (bool): Whether to zero the upper triangular part of attention matrix. + n_head (int): + The number of heads. + n_feat (int): + The number of features. + dropout_rate (float): + Dropout rate. + zero_triu (bool): + Whether to zero the upper triangular part of attention matrix. """ def __init__(self, n_head, n_feat, dropout_rate, zero_triu=False): @@ -161,10 +181,11 @@ class RelPositionMultiHeadedAttention(MultiHeadedAttention): def rel_shift(self, x): """Compute relative positional encoding. Args: - x(Tensor): Input tensor (batch, head, time1, 2*time1-1). + x(Tensor): + Input tensor (batch, head, time1, 2*time1-1). Returns: - Tensor:Output tensor. + Tensor: Output tensor. """ b, h, t1, t2 = paddle.shape(x) zero_pad = paddle.zeros((b, h, t1, 1)) @@ -183,11 +204,16 @@ class RelPositionMultiHeadedAttention(MultiHeadedAttention): """Compute 'Scaled Dot Product Attention' with rel. positional encoding. Args: - query(Tensor): Query tensor (#batch, time1, size). - key(Tensor): Key tensor (#batch, time2, size). - value(Tensor): Value tensor (#batch, time2, size). - pos_emb(Tensor): Positional embedding tensor (#batch, 2*time1-1, size). - mask(Tensor): Mask tensor (#batch, 1, time2) or (#batch, time1, time2). + query(Tensor): + Query tensor (#batch, time1, size). + key(Tensor): + Key tensor (#batch, time2, size). + value(Tensor): + Value tensor (#batch, time2, size). + pos_emb(Tensor): + Positional embedding tensor (#batch, 2*time1-1, size). + mask(Tensor): + Mask tensor (#batch, 1, time2) or (#batch, time1, time2). Returns: Tensor: Output tensor (#batch, time1, d_model). @@ -220,3 +246,103 @@ class RelPositionMultiHeadedAttention(MultiHeadedAttention): scores = (matrix_ac + matrix_bd) / math.sqrt(self.d_k) return self.forward_attention(v, scores, mask) + + +class LegacyRelPositionMultiHeadedAttention(MultiHeadedAttention): + """Multi-Head Attention layer with relative position encoding (old version). + Details can be found in https://github.com/espnet/espnet/pull/2816. + Paper: https://arxiv.org/abs/1901.02860 + + Args: + n_head (int): + The number of heads. + n_feat (int): + The number of features. + dropout_rate (float): + Dropout rate. + zero_triu (bool): + Whether to zero the upper triangular part of attention matrix. + """ + + def __init__(self, n_head, n_feat, dropout_rate, zero_triu=False): + """Construct an RelPositionMultiHeadedAttention object.""" + super().__init__(n_head, n_feat, dropout_rate) + self.zero_triu = zero_triu + # linear transformation for positional encoding + self.linear_pos = nn.Linear(n_feat, n_feat, bias_attr=False) + # these two learnable bias are used in matrix c and matrix d + # as described in https://arxiv.org/abs/1901.02860 Section 3.3 + + self.pos_bias_u = paddle.create_parameter( + shape=(self.h, self.d_k), + dtype='float32', + default_initializer=paddle.nn.initializer.XavierUniform()) + self.pos_bias_v = paddle.create_parameter( + shape=(self.h, self.d_k), + dtype='float32', + default_initializer=paddle.nn.initializer.XavierUniform()) + + def rel_shift(self, x): + """Compute relative positional encoding. + Args: + x(Tensor): + Input tensor (batch, head, time1, time2). + Returns: + Tensor:Output tensor. + """ + b, h, t1, t2 = paddle.shape(x) + zero_pad = paddle.zeros((b, h, t1, 1)) + x_padded = paddle.concat([zero_pad, x], axis=-1) + x_padded = paddle.reshape(x_padded, [b, h, t2 + 1, t1]) + # only keep the positions from 0 to time2 + x = paddle.reshape(x_padded[:, :, 1:], [b, h, t1, t2]) + + if self.zero_triu: + ones = paddle.ones((t1, t2)) + x = x * paddle.tril(ones, t2 - 1)[None, None, :, :] + + return x + + def forward(self, query, key, value, pos_emb, mask): + """Compute 'Scaled Dot Product Attention' with rel. positional encoding. + + Args: + query(Tensor): Query tensor (#batch, time1, size). + key(Tensor): Key tensor (#batch, time2, size). + value(Tensor): Value tensor (#batch, time2, size). + pos_emb(Tensor): Positional embedding tensor (#batch, time1, size). + mask(Tensor): Mask tensor (#batch, 1, time2) or (#batch, time1, time2). + + Returns: + Tensor: Output tensor (#batch, time1, d_model). + """ + q, k, v = self.forward_qkv(query, key, value) + # (batch, time1, head, d_k) + q = paddle.transpose(q, [0, 2, 1, 3]) + + n_batch_pos = paddle.shape(pos_emb)[0] + p = paddle.reshape( + self.linear_pos(pos_emb), [n_batch_pos, -1, self.h, self.d_k]) + # (batch, head, time1, d_k) + p = paddle.transpose(p, [0, 2, 1, 3]) + # (batch, head, time1, d_k) + q_with_bias_u = paddle.transpose((q + self.pos_bias_u), [0, 2, 1, 3]) + # (batch, head, time1, d_k) + q_with_bias_v = paddle.transpose((q + self.pos_bias_v), [0, 2, 1, 3]) + + # compute attention score + # first compute matrix a and matrix c + # as described in https://arxiv.org/abs/1901.02860 Section 3.3 + # (batch, head, time1, time2) + matrix_ac = paddle.matmul(q_with_bias_u, + paddle.transpose(k, [0, 1, 3, 2])) + + # compute matrix b and matrix d + # (batch, head, time1, time1) + matrix_bd = paddle.matmul(q_with_bias_v, + paddle.transpose(p, [0, 1, 3, 2])) + matrix_bd = self.rel_shift(matrix_bd) + # (batch, head, time1, time2) + scores = (matrix_ac + matrix_bd) / math.sqrt(self.d_k) + + return self.forward_attention(v, scores, mask) diff --git a/paddlespeech/t2s/modules/transformer/decoder.py b/paddlespeech/t2s/modules/transformer/decoder.py index a8db7345a..e68487678 100644 --- a/paddlespeech/t2s/modules/transformer/decoder.py +++ b/paddlespeech/t2s/modules/transformer/decoder.py @@ -37,28 +37,46 @@ class Decoder(nn.Layer): """Transfomer decoder module. Args: - odim (int): Output diminsion. - self_attention_layer_type (str): Self-attention layer type. - attention_dim (int): Dimention of attention. - attention_heads (int): The number of heads of multi head attention. - conv_wshare (int): The number of kernel of convolution. Only used in + odim (int): + Output diminsion. + self_attention_layer_type (str): + Self-attention layer type. + attention_dim (int): + Dimention of attention. + attention_heads (int): + The number of heads of multi head attention. + conv_wshare (int): + The number of kernel of convolution. Only used in self_attention_layer_type == "lightconv*" or "dynamiconv*". - conv_kernel_length (Union[int, str]):Kernel size str of convolution + conv_kernel_length (Union[int, str]): + Kernel size str of convolution (e.g. 71_71_71_71_71_71). Only used in self_attention_layer_type == "lightconv*" or "dynamiconv*". - conv_usebias (bool): Whether to use bias in convolution. Only used in + conv_usebias (bool): + Whether to use bias in convolution. Only used in self_attention_layer_type == "lightconv*" or "dynamiconv*". - linear_units(int): The number of units of position-wise feed forward. - num_blocks (int): The number of decoder blocks. - dropout_rate (float): Dropout rate. - positional_dropout_rate (float): Dropout rate after adding positional encoding. - self_attention_dropout_rate (float): Dropout rate in self-attention. - src_attention_dropout_rate (float): Dropout rate in source-attention. - input_layer (Union[str, nn.Layer]): Input layer type. - use_output_layer (bool): Whether to use output layer. - pos_enc_class (nn.Layer): Positional encoding module class. + linear_units(int): + The number of units of position-wise feed forward. + num_blocks (int): + The number of decoder blocks. + dropout_rate (float): + Dropout rate. + positional_dropout_rate (float): + Dropout rate after adding positional encoding. + self_attention_dropout_rate (float): + Dropout rate in self-attention. + src_attention_dropout_rate (float): + Dropout rate in source-attention. + input_layer (Union[str, nn.Layer]): + Input layer type. + use_output_layer (bool): + Whether to use output layer. + pos_enc_class (nn.Layer): + Positional encoding module class. `PositionalEncoding `or `ScaledPositionalEncoding` - normalize_before (bool): Whether to use layer_norm before the first block. - concat_after (bool): Whether to concat attention layer's input and output. + normalize_before (bool): + Whether to use layer_norm before the first block. + concat_after (bool): + Whether to concat attention layer's input and output. if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x) @@ -143,17 +161,22 @@ class Decoder(nn.Layer): def forward(self, tgt, tgt_mask, memory, memory_mask): """Forward decoder. Args: - tgt(Tensor): Input token ids, int64 (#batch, maxlen_out) if input_layer == "embed". + tgt(Tensor): + Input token ids, int64 (#batch, maxlen_out) if input_layer == "embed". In the other case, input tensor (#batch, maxlen_out, odim). - tgt_mask(Tensor): Input token mask (#batch, maxlen_out). - memory(Tensor): Encoded memory, float32 (#batch, maxlen_in, feat). - memory_mask(Tensor): Encoded memory mask (#batch, maxlen_in). + tgt_mask(Tensor): + Input token mask (#batch, maxlen_out). + memory(Tensor): + Encoded memory, float32 (#batch, maxlen_in, feat). + memory_mask(Tensor): + Encoded memory mask (#batch, maxlen_in). Returns: Tensor: Decoded token score before softmax (#batch, maxlen_out, odim) if use_output_layer is True. In the other case,final block outputs (#batch, maxlen_out, attention_dim). - Tensor: Score mask before softmax (#batch, maxlen_out). + Tensor: + Score mask before softmax (#batch, maxlen_out). """ x = self.embed(tgt) @@ -169,14 +192,20 @@ class Decoder(nn.Layer): """Forward one step. Args: - tgt(Tensor): Input token ids, int64 (#batch, maxlen_out). - tgt_mask(Tensor): Input token mask (#batch, maxlen_out). - memory(Tensor): Encoded memory, float32 (#batch, maxlen_in, feat). - cache((List[Tensor]), optional): List of cached tensors. (Default value = None) + tgt(Tensor): + Input token ids, int64 (#batch, maxlen_out). + tgt_mask(Tensor): + Input token mask (#batch, maxlen_out). + memory(Tensor): + Encoded memory, float32 (#batch, maxlen_in, feat). + cache((List[Tensor]), optional): + List of cached tensors. (Default value = None) Returns: - Tensor: Output tensor (batch, maxlen_out, odim). - List[Tensor]: List of cache tensors of each decoder layer. + Tensor: + Output tensor (batch, maxlen_out, odim). + List[Tensor]: + List of cache tensors of each decoder layer. """ x = self.embed(tgt) @@ -219,9 +248,12 @@ class Decoder(nn.Layer): """Score new token batch (required). Args: - ys(Tensor): paddle.int64 prefix tokens (n_batch, ylen). - states(List[Any]): Scorer states for prefix tokens. - xs(Tensor): The encoder feature that generates ys (n_batch, xlen, n_feat). + ys(Tensor): + paddle.int64 prefix tokens (n_batch, ylen). + states(List[Any]): + Scorer states for prefix tokens. + xs(Tensor): + The encoder feature that generates ys (n_batch, xlen, n_feat). Returns: tuple[Tensor, List[Any]]: diff --git a/paddlespeech/t2s/modules/transformer/decoder_layer.py b/paddlespeech/t2s/modules/transformer/decoder_layer.py index 9a13cd794..0a79e9548 100644 --- a/paddlespeech/t2s/modules/transformer/decoder_layer.py +++ b/paddlespeech/t2s/modules/transformer/decoder_layer.py @@ -24,16 +24,23 @@ class DecoderLayer(nn.Layer): Args: - size (int): Input dimension. - self_attn (nn.Layer): Self-attention module instance. + size (int): + Input dimension. + self_attn (nn.Layer): + Self-attention module instance. `MultiHeadedAttention` instance can be used as the argument. - src_attn (nn.Layer): Self-attention module instance. + src_attn (nn.Layer): + Self-attention module instance. `MultiHeadedAttention` instance can be used as the argument. - feed_forward (nn.Layer): Feed-forward module instance. + feed_forward (nn.Layer): + Feed-forward module instance. `PositionwiseFeedForward`, `MultiLayeredConv1d`, or `Conv1dLinear` instance can be used as the argument. - dropout_rate (float): Dropout rate. - normalize_before (bool): Whether to use layer_norm before the first block. - concat_after (bool): Whether to concat attention layer's input and output. + dropout_rate (float): + Dropout rate. + normalize_before (bool): + Whether to use layer_norm before the first block. + concat_after (bool): + Whether to concat attention layer's input and output. if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x) @@ -69,11 +76,16 @@ class DecoderLayer(nn.Layer): """Compute decoded features. Args: - tgt(Tensor): Input tensor (#batch, maxlen_out, size). - tgt_mask(Tensor): Mask for input tensor (#batch, maxlen_out). - memory(Tensor): Encoded memory, float32 (#batch, maxlen_in, size). - memory_mask(Tensor): Encoded memory mask (#batch, maxlen_in). - cache(List[Tensor], optional): List of cached tensors. + tgt(Tensor): + Input tensor (#batch, maxlen_out, size). + tgt_mask(Tensor): + Mask for input tensor (#batch, maxlen_out). + memory(Tensor): + Encoded memory, float32 (#batch, maxlen_in, size). + memory_mask(Tensor): + Encoded memory mask (#batch, maxlen_in). + cache(List[Tensor], optional): + List of cached tensors. Each tensor shape should be (#batch, maxlen_out - 1, size). (Default value = None) Returns: Tensor diff --git a/paddlespeech/t2s/modules/transformer/embedding.py b/paddlespeech/t2s/modules/transformer/embedding.py index d9339d20b..7ba301cbd 100644 --- a/paddlespeech/t2s/modules/transformer/embedding.py +++ b/paddlespeech/t2s/modules/transformer/embedding.py @@ -23,11 +23,16 @@ class PositionalEncoding(nn.Layer): """Positional encoding. Args: - d_model (int): Embedding dimension. - dropout_rate (float): Dropout rate. - max_len (int): Maximum input length. - reverse (bool): Whether to reverse the input position. - type (str): dtype of param + d_model (int): + Embedding dimension. + dropout_rate (float): + Dropout rate. + max_len (int): + Maximum input length. + reverse (bool): + Whether to reverse the input position. + type (str): + dtype of param """ def __init__(self, @@ -68,7 +73,8 @@ class PositionalEncoding(nn.Layer): """Add positional encoding. Args: - x (Tensor): Input tensor (batch, time, `*`). + x (Tensor): + Input tensor (batch, time, `*`). Returns: Tensor: Encoded tensor (batch, time, `*`). @@ -84,10 +90,14 @@ class ScaledPositionalEncoding(PositionalEncoding): See Sec. 3.2 https://arxiv.org/abs/1809.08895 Args: - d_model (int): Embedding dimension. - dropout_rate (float): Dropout rate. - max_len (int): Maximum input length. - dtype (str): dtype of param + d_model (int): + Embedding dimension. + dropout_rate (float): + Dropout rate. + max_len (int): + Maximum input length. + dtype (str): + dtype of param """ def __init__(self, d_model, dropout_rate, max_len=5000, dtype="float32"): @@ -111,7 +121,8 @@ class ScaledPositionalEncoding(PositionalEncoding): """Add positional encoding. Args: - x (Tensor): Input tensor (batch, time, `*`). + x (Tensor): + Input tensor (batch, time, `*`). Returns: Tensor: Encoded tensor (batch, time, `*`). """ @@ -127,9 +138,12 @@ class RelPositionalEncoding(nn.Layer): See : Appendix B in https://arxiv.org/abs/1901.02860 Args: - d_model (int): Embedding dimension. - dropout_rate (float): Dropout rate. - max_len (int): Maximum input length. + d_model (int): + Embedding dimension. + dropout_rate (float): + Dropout rate. + max_len (int): + Maximum input length. """ def __init__(self, d_model, dropout_rate, max_len=5000, dtype="float32"): @@ -175,7 +189,8 @@ class RelPositionalEncoding(nn.Layer): def forward(self, x: paddle.Tensor): """Add positional encoding. Args: - x (Tensor):Input tensor (batch, time, `*`). + x (Tensor): + Input tensor (batch, time, `*`). Returns: Tensor: Encoded tensor (batch, time, `*`). """ @@ -185,3 +200,70 @@ class RelPositionalEncoding(nn.Layer): pe_size = paddle.shape(self.pe) pos_emb = self.pe[:, pe_size[1] // 2 - T + 1:pe_size[1] // 2 + T, ] return self.dropout(x), self.dropout(pos_emb) + + +class LegacyRelPositionalEncoding(PositionalEncoding): + """Relative positional encoding module (old version). + + Details can be found in https://github.com/espnet/espnet/pull/2816. + + See : Appendix B in https://arxiv.org/abs/1901.02860 + + Args: + d_model (int): + Embedding dimension. + dropout_rate (float): + Dropout rate. + max_len (int): + Maximum input length. + + """ + + def __init__(self, d_model: int, dropout_rate: float, max_len: int=5000): + """ + Args: + d_model (int): + Embedding dimension. + dropout_rate (float): + Dropout rate. + max_len (int, optional): + [Maximum input length.]. Defaults to 5000. + """ + super().__init__(d_model, dropout_rate, max_len, reverse=True) + + def extend_pe(self, x): + """Reset the positional encodings.""" + if self.pe is not None: + if paddle.shape(self.pe)[1] >= paddle.shape(x)[1]: + return + pe = paddle.zeros((paddle.shape(x)[1], self.d_model)) + if self.reverse: + position = paddle.arange( + paddle.shape(x)[1] - 1, -1, -1.0, + dtype=paddle.float32).unsqueeze(1) + else: + position = paddle.arange( + 0, paddle.shape(x)[1], dtype=paddle.float32).unsqueeze(1) + div_term = paddle.exp( + paddle.arange(0, self.d_model, 2, dtype=paddle.float32) * + -(math.log(10000.0) / self.d_model)) + pe[:, 0::2] = paddle.sin(position * div_term) + pe[:, 1::2] = paddle.cos(position * div_term) + pe = pe.unsqueeze(0) + self.pe = pe + + def forward(self, x: paddle.Tensor): + """Compute positional encoding. + Args: + x (Tensor): + Input tensor (batch, time, `*`). + Returns: + Tensor: + Encoded tensor (batch, time, `*`). + Tensor: + Positional embedding tensor (1, time, `*`). + """ + self.extend_pe(x) + x = x * self.xscale + pos_emb = self.pe[:, :paddle.shape(x)[1]] + return self.dropout(x), self.dropout(pos_emb) diff --git a/paddlespeech/t2s/modules/transformer/encoder.py b/paddlespeech/t2s/modules/transformer/encoder.py index 11986360a..f2aed5892 100644 --- a/paddlespeech/t2s/modules/transformer/encoder.py +++ b/paddlespeech/t2s/modules/transformer/encoder.py @@ -38,32 +38,55 @@ class BaseEncoder(nn.Layer): """Base Encoder module. Args: - idim (int): Input dimension. - attention_dim (int): Dimention of attention. - attention_heads (int): The number of heads of multi head attention. - linear_units (int): The number of units of position-wise feed forward. - num_blocks (int): The number of decoder blocks. - dropout_rate (float): Dropout rate. - positional_dropout_rate (float): Dropout rate after adding positional encoding. - attention_dropout_rate (float): Dropout rate in attention. - input_layer (Union[str, nn.Layer]): Input layer type. - normalize_before (bool): Whether to use layer_norm before the first block. - concat_after (bool): Whether to concat attention layer's input and output. + idim (int): + Input dimension. + attention_dim (int): + Dimention of attention. + attention_heads (int): + The number of heads of multi head attention. + linear_units (int): + The number of units of position-wise feed forward. + num_blocks (int): + The number of decoder blocks. + dropout_rate (float): + Dropout rate. + positional_dropout_rate (float): + Dropout rate after adding positional encoding. + attention_dropout_rate (float): + Dropout rate in attention. + input_layer (Union[str, nn.Layer]): + Input layer type. + normalize_before (bool): + Whether to use layer_norm before the first block. + concat_after (bool): + Whether to concat attention layer's input and output. if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x) - positionwise_layer_type (str): "linear", "conv1d", or "conv1d-linear". - positionwise_conv_kernel_size (int): Kernel size of positionwise conv1d layer. - macaron_style (bool): Whether to use macaron style for positionwise layer. - pos_enc_layer_type (str): Encoder positional encoding layer type. - selfattention_layer_type (str): Encoder attention layer type. - activation_type (str): Encoder activation function type. - use_cnn_module (bool): Whether to use convolution module. - zero_triu (bool): Whether to zero the upper triangular part of attention matrix. - cnn_module_kernel (int): Kernerl size of convolution module. - padding_idx (int): Padding idx for input_layer=embed. - stochastic_depth_rate (float): Maximum probability to skip the encoder layer. - intermediate_layers (Union[List[int], None]): indices of intermediate CTC layer. + positionwise_layer_type (str): + "linear", "conv1d", or "conv1d-linear". + positionwise_conv_kernel_size (int): + Kernel size of positionwise conv1d layer. + macaron_style (bool): + Whether to use macaron style for positionwise layer. + pos_enc_layer_type (str): + Encoder positional encoding layer type. + selfattention_layer_type (str): + Encoder attention layer type. + activation_type (str): + Encoder activation function type. + use_cnn_module (bool): + Whether to use convolution module. + zero_triu (bool): + Whether to zero the upper triangular part of attention matrix. + cnn_module_kernel (int): + Kernerl size of convolution module. + padding_idx (int): + Padding idx for input_layer=embed. + stochastic_depth_rate (float): + Maximum probability to skip the encoder layer. + intermediate_layers (Union[List[int], None]): + indices of intermediate CTC layer. indices start from 1. if not None, intermediate outputs are returned (which changes return type signature.) @@ -266,12 +289,16 @@ class BaseEncoder(nn.Layer): """Encode input sequence. Args: - xs (Tensor): Input tensor (#batch, time, idim). - masks (Tensor): Mask tensor (#batch, 1, time). + xs (Tensor): + Input tensor (#batch, time, idim). + masks (Tensor): + Mask tensor (#batch, 1, time). Returns: - Tensor: Output tensor (#batch, time, attention_dim). - Tensor: Mask tensor (#batch, 1, time). + Tensor: + Output tensor (#batch, time, attention_dim). + Tensor: + Mask tensor (#batch, 1, time). """ xs = self.embed(xs) xs, masks = self.encoders(xs, masks) @@ -284,26 +311,43 @@ class TransformerEncoder(BaseEncoder): """Transformer encoder module. Args: - idim (int): Input dimension. - attention_dim (int): Dimention of attention. - attention_heads (int): The number of heads of multi head attention. - linear_units (int): The number of units of position-wise feed forward. - num_blocks (int): The number of decoder blocks. - dropout_rate (float): Dropout rate. - positional_dropout_rate (float): Dropout rate after adding positional encoding. - attention_dropout_rate (float): Dropout rate in attention. - input_layer (Union[str, paddle.nn.Layer]): Input layer type. - pos_enc_layer_type (str): Encoder positional encoding layer type. - normalize_before (bool): Whether to use layer_norm before the first block. - concat_after (bool): Whether to concat attention layer's input and output. + idim (int): + Input dimension. + attention_dim (int): + Dimention of attention. + attention_heads (int): + The number of heads of multi head attention. + linear_units (int): + The number of units of position-wise feed forward. + num_blocks (int): + The number of decoder blocks. + dropout_rate (float): + Dropout rate. + positional_dropout_rate (float): + Dropout rate after adding positional encoding. + attention_dropout_rate (float): + Dropout rate in attention. + input_layer (Union[str, paddle.nn.Layer]): + Input layer type. + pos_enc_layer_type (str): + Encoder positional encoding layer type. + normalize_before (bool): + Whether to use layer_norm before the first block. + concat_after (bool): + Whether to concat attention layer's input and output. if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x) - positionwise_layer_type (str): "linear", "conv1d", or "conv1d-linear". - positionwise_conv_kernel_size (int): Kernel size of positionwise conv1d layer. - selfattention_layer_type (str): Encoder attention layer type. - activation_type (str): Encoder activation function type. - padding_idx (int): Padding idx for input_layer=embed. + positionwise_layer_type (str): + "linear", "conv1d", or "conv1d-linear". + positionwise_conv_kernel_size (int): + Kernel size of positionwise conv1d layer. + selfattention_layer_type (str): + Encoder attention layer type. + activation_type (str): + Encoder activation function type. + padding_idx (int): + Padding idx for input_layer=embed. """ def __init__( @@ -350,12 +394,16 @@ class TransformerEncoder(BaseEncoder): """Encoder input sequence. Args: - xs(Tensor): Input tensor (#batch, time, idim). - masks(Tensor): Mask tensor (#batch, 1, time). + xs(Tensor): + Input tensor (#batch, time, idim). + masks(Tensor): + Mask tensor (#batch, 1, time). Returns: - Tensor: Output tensor (#batch, time, attention_dim). - Tensor: Mask tensor (#batch, 1, time). + Tensor: + Output tensor (#batch, time, attention_dim). + Tensor: + Mask tensor (#batch, 1, time). """ xs = self.embed(xs) xs, masks = self.encoders(xs, masks) @@ -367,14 +415,20 @@ class TransformerEncoder(BaseEncoder): """Encode input frame. Args: - xs (Tensor): Input tensor. - masks (Tensor): Mask tensor. - cache (List[Tensor]): List of cache tensors. + xs (Tensor): + Input tensor. + masks (Tensor): + Mask tensor. + cache (List[Tensor]): + List of cache tensors. Returns: - Tensor: Output tensor. - Tensor: Mask tensor. - List[Tensor]: List of new cache tensors. + Tensor: + Output tensor. + Tensor: + Mask tensor. + List[Tensor]: + List of new cache tensors. """ xs = self.embed(xs) @@ -393,32 +447,55 @@ class ConformerEncoder(BaseEncoder): """Conformer encoder module. Args: - idim (int): Input dimension. - attention_dim (int): Dimention of attention. - attention_heads (int): The number of heads of multi head attention. - linear_units (int): The number of units of position-wise feed forward. - num_blocks (int): The number of decoder blocks. - dropout_rate (float): Dropout rate. - positional_dropout_rate (float): Dropout rate after adding positional encoding. - attention_dropout_rate (float): Dropout rate in attention. - input_layer (Union[str, nn.Layer]): Input layer type. - normalize_before (bool): Whether to use layer_norm before the first block. - concat_after (bool):Whether to concat attention layer's input and output. + idim (int): + Input dimension. + attention_dim (int): + Dimention of attention. + attention_heads (int): + The number of heads of multi head attention. + linear_units (int): + The number of units of position-wise feed forward. + num_blocks (int): + The number of decoder blocks. + dropout_rate (float): + Dropout rate. + positional_dropout_rate (float): + Dropout rate after adding positional encoding. + attention_dropout_rate (float): + Dropout rate in attention. + input_layer (Union[str, nn.Layer]): + Input layer type. + normalize_before (bool): + Whether to use layer_norm before the first block. + concat_after (bool): + Whether to concat attention layer's input and output. if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x) - positionwise_layer_type (str): "linear", "conv1d", or "conv1d-linear". - positionwise_conv_kernel_size (int): Kernel size of positionwise conv1d layer. - macaron_style (bool): Whether to use macaron style for positionwise layer. - pos_enc_layer_type (str): Encoder positional encoding layer type. - selfattention_layer_type (str): Encoder attention layer type. - activation_type (str): Encoder activation function type. - use_cnn_module (bool): Whether to use convolution module. - zero_triu (bool): Whether to zero the upper triangular part of attention matrix. - cnn_module_kernel (int): Kernerl size of convolution module. - padding_idx (int): Padding idx for input_layer=embed. - stochastic_depth_rate (float): Maximum probability to skip the encoder layer. - intermediate_layers (Union[List[int], None]):indices of intermediate CTC layer. indices start from 1. + positionwise_layer_type (str): + "linear", "conv1d", or "conv1d-linear". + positionwise_conv_kernel_size (int): + Kernel size of positionwise conv1d layer. + macaron_style (bool): + Whether to use macaron style for positionwise layer. + pos_enc_layer_type (str): + Encoder positional encoding layer type. + selfattention_layer_type (str): + Encoder attention layer type. + activation_type (str): + Encoder activation function type. + use_cnn_module (bool): + Whether to use convolution module. + zero_triu (bool): + Whether to zero the upper triangular part of attention matrix. + cnn_module_kernel (int): + Kernerl size of convolution module. + padding_idx (int): + Padding idx for input_layer=embed. + stochastic_depth_rate (float): + Maximum probability to skip the encoder layer. + intermediate_layers (Union[List[int], None]): + indices of intermediate CTC layer. indices start from 1. if not None, intermediate outputs are returned (which changes return type signature.) """ @@ -478,11 +555,15 @@ class ConformerEncoder(BaseEncoder): """Encode input sequence. Args: - xs (Tensor): Input tensor (#batch, time, idim). - masks (Tensor): Mask tensor (#batch, 1, time). + xs (Tensor): + Input tensor (#batch, time, idim). + masks (Tensor): + Mask tensor (#batch, 1, time). Returns: - Tensor: Output tensor (#batch, time, attention_dim). - Tensor: Mask tensor (#batch, 1, time). + Tensor: + Output tensor (#batch, time, attention_dim). + Tensor: + Mask tensor (#batch, 1, time). """ if isinstance(self.embed, (Conv2dSubsampling)): xs, masks = self.embed(xs, masks) @@ -539,7 +620,8 @@ class Conv1dResidualBlock(nn.Layer): def forward(self, xs): """Encode input sequence. Args: - xs (Tensor): Input tensor (#batch, idim, T). + xs (Tensor): + Input tensor (#batch, idim, T). Returns: Tensor: Output tensor (#batch, odim, T). """ @@ -582,8 +664,10 @@ class CNNDecoder(nn.Layer): def forward(self, xs, masks=None): """Encode input sequence. Args: - xs (Tensor): Input tensor (#batch, time, idim). - masks (Tensor): Mask tensor (#batch, 1, time). + xs (Tensor): + Input tensor (#batch, time, idim). + masks (Tensor): + Mask tensor (#batch, 1, time). Returns: Tensor: Output tensor (#batch, time, odim). """ @@ -629,8 +713,10 @@ class CNNPostnet(nn.Layer): def forward(self, xs, masks=None): """Encode input sequence. Args: - xs (Tensor): Input tensor (#batch, odim, time). - masks (Tensor): Mask tensor (#batch, 1, time). + xs (Tensor): + Input tensor (#batch, odim, time). + masks (Tensor): + Mask tensor (#batch, 1, time). Returns: Tensor: Output tensor (#batch, odim, time). """ diff --git a/paddlespeech/t2s/modules/transformer/encoder_layer.py b/paddlespeech/t2s/modules/transformer/encoder_layer.py index 72372b69b..63494b0de 100644 --- a/paddlespeech/t2s/modules/transformer/encoder_layer.py +++ b/paddlespeech/t2s/modules/transformer/encoder_layer.py @@ -21,14 +21,20 @@ class EncoderLayer(nn.Layer): """Encoder layer module. Args: - size (int): Input dimension. - self_attn (nn.Layer): Self-attention module instance. + size (int): + Input dimension. + self_attn (nn.Layer): + Self-attention module instance. `MultiHeadedAttention` instance can be used as the argument. - feed_forward (nn.Layer): Feed-forward module instance. + feed_forward (nn.Layer): + Feed-forward module instance. `PositionwiseFeedForward`, `MultiLayeredConv1d`, or `Conv1dLinear` instance can be used as the argument. - dropout_rate (float): Dropout rate. - normalize_before (bool): Whether to use layer_norm before the first block. - concat_after (bool): Whether to concat attention layer's input and output. + dropout_rate (float): + Dropout rate. + normalize_before (bool): + Whether to use layer_norm before the first block. + concat_after (bool): + Whether to concat attention layer's input and output. if True, additional linear will be applied. i.e. x -> x + linear(concat(x, att(x))) if False, no additional linear will be applied. i.e. x -> x + att(x) @@ -59,13 +65,18 @@ class EncoderLayer(nn.Layer): """Compute encoded features. Args: - x(Tensor): Input tensor (#batch, time, size). - mask(Tensor): Mask tensor for the input (#batch, time). - cache(Tensor, optional): Cache tensor of the input (#batch, time - 1, size). + x(Tensor): + Input tensor (#batch, time, size). + mask(Tensor): + Mask tensor for the input (#batch, time). + cache(Tensor, optional): + Cache tensor of the input (#batch, time - 1, size). Returns: - Tensor: Output tensor (#batch, time, size). - Tensor: Mask tensor (#batch, time). + Tensor: + Output tensor (#batch, time, size). + Tensor: + Mask tensor (#batch, time). """ residual = x if self.normalize_before: diff --git a/paddlespeech/t2s/modules/transformer/lightconv.py b/paddlespeech/t2s/modules/transformer/lightconv.py index 9bcc1acfb..22217d50f 100644 --- a/paddlespeech/t2s/modules/transformer/lightconv.py +++ b/paddlespeech/t2s/modules/transformer/lightconv.py @@ -31,12 +31,18 @@ class LightweightConvolution(nn.Layer): https://github.com/pytorch/fairseq/tree/master/fairseq Args: - wshare (int): the number of kernel of convolution - n_feat (int): the number of features - dropout_rate (float): dropout_rate - kernel_size (int): kernel size (length) - use_kernel_mask (bool): Use causal mask or not for convolution kernel - use_bias (bool): Use bias term or not. + wshare (int): + the number of kernel of convolution + n_feat (int): + the number of features + dropout_rate (float): + dropout_rate + kernel_size (int): + kernel size (length) + use_kernel_mask (bool): + Use causal mask or not for convolution kernel + use_bias (bool): + Use bias term or not. """ @@ -94,10 +100,14 @@ class LightweightConvolution(nn.Layer): This is just for compatibility with self-attention layer (attention.py) Args: - query (Tensor): input tensor. (batch, time1, d_model) - key (Tensor): NOT USED. (batch, time2, d_model) - value (Tensor): NOT USED. (batch, time2, d_model) - mask : (Tensor): (batch, time1, time2) mask + query (Tensor): + input tensor. (batch, time1, d_model) + key (Tensor): + NOT USED. (batch, time2, d_model) + value (Tensor): + NOT USED. (batch, time2, d_model) + mask : (Tensor): + (batch, time1, time2) mask Return: Tensor: ouput. (batch, time1, d_model) diff --git a/paddlespeech/t2s/modules/transformer/mask.py b/paddlespeech/t2s/modules/transformer/mask.py index c10e6add2..71dd37975 100644 --- a/paddlespeech/t2s/modules/transformer/mask.py +++ b/paddlespeech/t2s/modules/transformer/mask.py @@ -19,8 +19,10 @@ def subsequent_mask(size, dtype=paddle.bool): """Create mask for subsequent steps (size, size). Args: - size (int): size of mask - dtype (paddle.dtype): result dtype + size (int): + size of mask + dtype (paddle.dtype): + result dtype Return: Tensor: >>> subsequent_mask(3) @@ -36,9 +38,12 @@ def target_mask(ys_in_pad, ignore_id, dtype=paddle.bool): """Create mask for decoder self-attention. Args: - ys_pad (Tensor): batch of padded target sequences (B, Lmax) - ignore_id (int): index of padding - dtype (paddle.dtype): result dtype + ys_pad (Tensor): + batch of padded target sequences (B, Lmax) + ignore_id (int): + index of padding + dtype (paddle.dtype): + result dtype Return: Tensor: (B, Lmax, Lmax) """ diff --git a/paddlespeech/t2s/modules/transformer/multi_layer_conv.py b/paddlespeech/t2s/modules/transformer/multi_layer_conv.py index d3285b65f..91d67ca58 100644 --- a/paddlespeech/t2s/modules/transformer/multi_layer_conv.py +++ b/paddlespeech/t2s/modules/transformer/multi_layer_conv.py @@ -32,10 +32,14 @@ class MultiLayeredConv1d(nn.Layer): """Initialize MultiLayeredConv1d module. Args: - in_chans (int): Number of input channels. - hidden_chans (int): Number of hidden channels. - kernel_size (int): Kernel size of conv1d. - dropout_rate (float): Dropout rate. + in_chans (int): + Number of input channels. + hidden_chans (int): + Number of hidden channels. + kernel_size (int): + Kernel size of conv1d. + dropout_rate (float): + Dropout rate. """ super().__init__() @@ -58,7 +62,8 @@ class MultiLayeredConv1d(nn.Layer): """Calculate forward propagation. Args: - x (Tensor): Batch of input tensors (B, T, in_chans). + x (Tensor): + Batch of input tensors (B, T, in_chans). Returns: Tensor: Batch of output tensors (B, T, in_chans). @@ -79,10 +84,14 @@ class Conv1dLinear(nn.Layer): """Initialize Conv1dLinear module. Args: - in_chans (int): Number of input channels. - hidden_chans (int): Number of hidden channels. - kernel_size (int): Kernel size of conv1d. - dropout_rate (float): Dropout rate. + in_chans (int): + Number of input channels. + hidden_chans (int): + Number of hidden channels. + kernel_size (int): + Kernel size of conv1d. + dropout_rate (float): + Dropout rate. """ super().__init__() self.w_1 = nn.Conv1D( @@ -99,7 +108,8 @@ class Conv1dLinear(nn.Layer): """Calculate forward propagation. Args: - x (Tensor): Batch of input tensors (B, T, in_chans). + x (Tensor): + Batch of input tensors (B, T, in_chans). Returns: Tensor: Batch of output tensors (B, T, in_chans). diff --git a/paddlespeech/t2s/modules/transformer/positionwise_feed_forward.py b/paddlespeech/t2s/modules/transformer/positionwise_feed_forward.py index 92af6851c..45ea279bf 100644 --- a/paddlespeech/t2s/modules/transformer/positionwise_feed_forward.py +++ b/paddlespeech/t2s/modules/transformer/positionwise_feed_forward.py @@ -21,9 +21,12 @@ class PositionwiseFeedForward(nn.Layer): """Positionwise feed forward layer. Args: - idim (int): Input dimenstion. - hidden_units (int): The number of hidden units. - dropout_rate (float): Dropout rate. + idim (int): + Input dimenstion. + hidden_units (int): + The number of hidden units. + dropout_rate (float): + Dropout rate. """ def __init__(self, diff --git a/paddlespeech/t2s/modules/transformer/repeat.py b/paddlespeech/t2s/modules/transformer/repeat.py index 1e946adf7..43d11e9f9 100644 --- a/paddlespeech/t2s/modules/transformer/repeat.py +++ b/paddlespeech/t2s/modules/transformer/repeat.py @@ -30,8 +30,10 @@ def repeat(N, fn): """Repeat module N times. Args: - N (int): Number of repeat time. - fn (Callable): Function to generate module. + N (int): + Number of repeat time. + fn (Callable): + Function to generate module. Returns: MultiSequential: Repeated model instance. diff --git a/paddlespeech/t2s/modules/transformer/subsampling.py b/paddlespeech/t2s/modules/transformer/subsampling.py index 07439705a..a17278c0b 100644 --- a/paddlespeech/t2s/modules/transformer/subsampling.py +++ b/paddlespeech/t2s/modules/transformer/subsampling.py @@ -23,10 +23,14 @@ class Conv2dSubsampling(nn.Layer): """Convolutional 2D subsampling (to 1/4 length). Args: - idim (int): Input dimension. - odim (int): Output dimension. - dropout_rate (float): Dropout rate. - pos_enc (nn.Layer): Custom position encoding layer. + idim (int): + Input dimension. + odim (int): + Output dimension. + dropout_rate (float): + Dropout rate. + pos_enc (nn.Layer): + Custom position encoding layer. """ def __init__(self, idim, odim, dropout_rate, pos_enc=None): @@ -45,11 +49,15 @@ class Conv2dSubsampling(nn.Layer): def forward(self, x, x_mask): """Subsample x. Args: - x (Tensor): Input tensor (#batch, time, idim). - x_mask (Tensor): Input mask (#batch, 1, time). + x (Tensor): + Input tensor (#batch, time, idim). + x_mask (Tensor): + Input mask (#batch, 1, time). Returns: - Tensor: Subsampled tensor (#batch, time', odim), where time' = time // 4. - Tensor: Subsampled mask (#batch, 1, time'), where time' = time // 4. + Tensor: + Subsampled tensor (#batch, time', odim), where time' = time // 4. + Tensor: + Subsampled mask (#batch, 1, time'), where time' = time // 4. """ # (b, c, t, f) x = x.unsqueeze(1) diff --git a/paddlespeech/t2s/modules/upsample.py b/paddlespeech/t2s/modules/upsample.py index 65e78a892..164db65dd 100644 --- a/paddlespeech/t2s/modules/upsample.py +++ b/paddlespeech/t2s/modules/upsample.py @@ -28,9 +28,12 @@ class Stretch2D(nn.Layer): """Strech an image (or image-like object) with some interpolation. Args: - w_scale (int): Scalar of width. - h_scale (int): Scalar of the height. - mode (str, optional): Interpolation mode, modes suppored are "nearest", "bilinear", + w_scale (int): + Scalar of width. + h_scale (int): + Scalar of the height. + mode (str, optional): + Interpolation mode, modes suppored are "nearest", "bilinear", "trilinear", "bicubic", "linear" and "area",by default "nearest" For more details about interpolation, see `paddle.nn.functional.interpolate `_. @@ -44,11 +47,12 @@ class Stretch2D(nn.Layer): """ Args: - x (Tensor): Shape (N, C, H, W) + x (Tensor): + Shape (N, C, H, W) Returns: - Tensor: The stretched image. - Shape (N, C, H', W'), where ``H'=h_scale * H``, ``W'=w_scale * W``. + Tensor: + The stretched image. Shape (N, C, H', W'), where ``H'=h_scale * H``, ``W'=w_scale * W``. """ out = F.interpolate( @@ -61,12 +65,18 @@ class UpsampleNet(nn.Layer): convolutions. Args: - upsample_scales (List[int]): Upsampling factors for each strech. - nonlinear_activation (Optional[str], optional): Activation after each convolution, by default None - nonlinear_activation_params (Dict[str, Any], optional): Parameters passed to construct the activation, by default {} - interpolate_mode (str, optional): Interpolation mode of the strech, by default "nearest" - freq_axis_kernel_size (int, optional): Convolution kernel size along the frequency axis, by default 1 - use_causal_conv (bool, optional): Whether to use causal padding before convolution, by default False + upsample_scales (List[int]): + Upsampling factors for each strech. + nonlinear_activation (Optional[str], optional): + Activation after each convolution, by default None + nonlinear_activation_params (Dict[str, Any], optional): + Parameters passed to construct the activation, by default {} + interpolate_mode (str, optional): + Interpolation mode of the strech, by default "nearest" + freq_axis_kernel_size (int, optional): + Convolution kernel size along the frequency axis, by default 1 + use_causal_conv (bool, optional): + Whether to use causal padding before convolution, by default False If True, Causal padding is used along the time axis, i.e. padding amount is ``receptive field - 1`` and 0 for before and after, respectively. If False, "same" padding is used along the time axis. @@ -106,7 +116,8 @@ class UpsampleNet(nn.Layer): def forward(self, c): """ Args: - c (Tensor): spectrogram. Shape (N, F, T) + c (Tensor): + spectrogram. Shape (N, F, T) Returns: Tensor: upsampled spectrogram. @@ -126,17 +137,25 @@ class ConvInUpsampleNet(nn.Layer): UpsampleNet. Args: - upsample_scales (List[int]): Upsampling factors for each strech. - nonlinear_activation (Optional[str], optional): Activation after each convolution, by default None - nonlinear_activation_params (Dict[str, Any], optional): Parameters passed to construct the activation, by default {} - interpolate_mode (str, optional): Interpolation mode of the strech, by default "nearest" - freq_axis_kernel_size (int, optional): Convolution kernel size along the frequency axis, by default 1 - aux_channels (int, optional): Feature size of the input, by default 80 - aux_context_window (int, optional): Context window of the first 1D convolution applied to the input. It + upsample_scales (List[int]): + Upsampling factors for each strech. + nonlinear_activation (Optional[str], optional): + Activation after each convolution, by default None + nonlinear_activation_params (Dict[str, Any], optional): + Parameters passed to construct the activation, by default {} + interpolate_mode (str, optional): + Interpolation mode of the strech, by default "nearest" + freq_axis_kernel_size (int, optional): + Convolution kernel size along the frequency axis, by default 1 + aux_channels (int, optional): + Feature size of the input, by default 80 + aux_context_window (int, optional): + Context window of the first 1D convolution applied to the input. It related to the kernel size of the convolution, by default 0 If use causal convolution, the kernel size is ``window + 1``, else the kernel size is ``2 * window + 1``. - use_causal_conv (bool, optional): Whether to use causal padding before convolution, by default False + use_causal_conv (bool, optional): + Whether to use causal padding before convolution, by default False If True, Causal padding is used along the time axis, i.e. padding amount is ``receptive field - 1`` and 0 for before and after, respectively. If False, "same" padding is used along the time axis. @@ -171,7 +190,8 @@ class ConvInUpsampleNet(nn.Layer): def forward(self, c): """ Args: - c (Tensor): spectrogram. Shape (N, F, T) + c (Tensor): + spectrogram. Shape (N, F, T) Returns: Tensors: upsampled spectrogram. Shape (N, F, T'), where ``T' = upsample_factor * T``, diff --git a/paddlespeech/t2s/training/experiment.py b/paddlespeech/t2s/training/experiment.py index 05a363ff2..1eba826df 100644 --- a/paddlespeech/t2s/training/experiment.py +++ b/paddlespeech/t2s/training/experiment.py @@ -58,8 +58,10 @@ class ExperimentBase(object): need. Args: - config (yacs.config.CfgNode): The configuration used for the experiment. - args (argparse.Namespace): The parsed command line arguments. + config (yacs.config.CfgNode): + The configuration used for the experiment. + args (argparse.Namespace): + The parsed command line arguments. Examples: >>> def main_sp(config, args): diff --git a/paddlespeech/t2s/utils/checkpoint.py b/paddlespeech/t2s/utils/checkpoint.py index 1e222c50c..a3a19c0a0 100644 --- a/paddlespeech/t2s/utils/checkpoint.py +++ b/paddlespeech/t2s/utils/checkpoint.py @@ -25,7 +25,8 @@ def _load_latest_checkpoint(checkpoint_dir: str) -> int: """Get the iteration number corresponding to the latest saved checkpoint. Args: - checkpoint_dir (str): the directory where checkpoint is saved. + checkpoint_dir (str): + the directory where checkpoint is saved. Returns: int: the latest iteration number. @@ -46,8 +47,10 @@ def _save_checkpoint(checkpoint_dir: str, iteration: int): """Save the iteration number of the latest model to be checkpointed. Args: - checkpoint_dir (str): the directory where checkpoint is saved. - iteration (int): the latest iteration number. + checkpoint_dir (str): + the directory where checkpoint is saved. + iteration (int): + the latest iteration number. Returns: None @@ -65,11 +68,14 @@ def load_parameters(model, """Load a specific model checkpoint from disk. Args: - model (Layer): model to load parameters. - optimizer (Optimizer, optional): optimizer to load states if needed. - Defaults to None. - checkpoint_dir (str, optional): the directory where checkpoint is saved. - checkpoint_path (str, optional): if specified, load the checkpoint + model (Layer): + model to load parameters. + optimizer (Optimizer, optional): + optimizer to load states if needed. Defaults to None. + checkpoint_dir (str, optional): + the directory where checkpoint is saved. + checkpoint_path (str, optional): + if specified, load the checkpoint stored in the checkpoint_path and the argument 'checkpoint_dir' will be ignored. Defaults to None. @@ -113,11 +119,14 @@ def save_parameters(checkpoint_dir, iteration, model, optimizer=None): """Checkpoint the latest trained model parameters. Args: - checkpoint_dir (str): the directory where checkpoint is saved. - iteration (int): the latest iteration number. - model (Layer): model to be checkpointed. - optimizer (Optimizer, optional): optimizer to be checkpointed. - Defaults to None. + checkpoint_dir (str): + the directory where checkpoint is saved. + iteration (int): + the latest iteration number. + model (Layer): + model to be checkpointed. + optimizer (Optimizer, optional): + optimizer to be checkpointed. Defaults to None. Returns: None diff --git a/paddlespeech/t2s/utils/error_rate.py b/paddlespeech/t2s/utils/error_rate.py index 41b13b75f..76a4f45be 100644 --- a/paddlespeech/t2s/utils/error_rate.py +++ b/paddlespeech/t2s/utils/error_rate.py @@ -71,10 +71,14 @@ def word_errors(reference, hypothesis, ignore_case=False, delimiter=' '): hypothesis sequence in word-level. Args: - reference (str): The reference sentence. - hypothesis (str): The hypothesis sentence. - ignore_case (bool): Whether case-sensitive or not. - delimiter (char(str)): Delimiter of input sentences. + reference (str): + The reference sentence. + hypothesis (str): + The hypothesis sentence. + ignore_case (bool): + Whether case-sensitive or not. + delimiter (char(str)): + Delimiter of input sentences. Returns: list: Levenshtein distance and word number of reference sentence. diff --git a/paddlespeech/t2s/utils/h5_utils.py b/paddlespeech/t2s/utils/h5_utils.py index 75c2e4488..7558e046a 100644 --- a/paddlespeech/t2s/utils/h5_utils.py +++ b/paddlespeech/t2s/utils/h5_utils.py @@ -24,8 +24,10 @@ import numpy as np def read_hdf5(filename: Union[Path, str], dataset_name: str) -> Any: """Read a dataset from a HDF5 file. Args: - filename (Union[Path, str]): Path of the HDF5 file. - dataset_name (str): Name of the dataset to read. + filename (Union[Path, str]): + Path of the HDF5 file. + dataset_name (str): + Name of the dataset to read. Returns: Any: The retrieved dataset. diff --git a/paddlespeech/t2s/utils/internals.py b/paddlespeech/t2s/utils/internals.py index 6c10bd2d5..830e8a80f 100644 --- a/paddlespeech/t2s/utils/internals.py +++ b/paddlespeech/t2s/utils/internals.py @@ -22,7 +22,8 @@ def convert_dtype_to_np_dtype_(dtype): Convert paddle's data type to corrsponding numpy data type. Args: - dtype(np.dtype): the data type in paddle. + dtype(np.dtype): + the data type in paddle. Returns: type: the data type in numpy. diff --git a/paddlespeech/utils/env.py b/paddlespeech/utils/env.py new file mode 100644 index 000000000..03c8757bc --- /dev/null +++ b/paddlespeech/utils/env.py @@ -0,0 +1,46 @@ +# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +import os + + +def _get_user_home(): + return os.path.expanduser('~') + + +def _get_paddlespcceh_home(): + if 'PPSPEECH_HOME' in os.environ: + home_path = os.environ['PPSPEECH_HOME'] + if os.path.exists(home_path): + if os.path.isdir(home_path): + return home_path + else: + raise RuntimeError( + 'The environment variable PPSPEECH_HOME {} is not a directory.'. + format(home_path)) + else: + return home_path + return os.path.join(_get_user_home(), '.paddlespeech') + + +def _get_sub_home(directory): + home = os.path.join(_get_paddlespcceh_home(), directory) + if not os.path.exists(home): + os.makedirs(home) + return home + + +PPSPEECH_HOME = _get_paddlespcceh_home() +MODEL_HOME = _get_sub_home('models') +CONF_HOME = _get_sub_home('conf') +DATA_HOME = _get_sub_home('datasets') diff --git a/setup.py b/setup.py index 903bba64e..0f950774c 100644 --- a/setup.py +++ b/setup.py @@ -37,20 +37,57 @@ VERSION = '0.0.0' COMMITID = 'none' base = [ - "editdistance", "g2p_en", "g2pM", "h5py", "inflect", "jieba", "jsonlines", - "kaldiio", "librosa==0.8.1", "loguru", "matplotlib", "nara_wpe", - "onnxruntime", "pandas", "paddlenlp", "paddlespeech_feat", "praatio==5.0.0", - "pypinyin", "pypinyin-dict", "python-dateutil", "pyworld", "resampy==0.2.2", - "sacrebleu", "scipy", "sentencepiece~=0.1.96", "soundfile~=0.10", - "textgrid", "timer", "tqdm", "typeguard", "visualdl", "webrtcvad", - "yacs~=0.1.8", "prettytable", "zhon", "colorlog", "pathos == 0.2.8", "Ninja" + "editdistance", + "g2p_en", + "g2pM", + "h5py", + "inflect", + "jieba", + "jsonlines", + "kaldiio", + "librosa==0.8.1", + "loguru", + "matplotlib", + "nara_wpe", + "onnxruntime==1.10.0", + "opencc", + "pandas", + "paddlenlp", + "paddlespeech_feat", + "Pillow>=9.0.0", + "praatio==5.0.0", + "protobuf>=3.1.0, <=3.20.0", + "pypinyin", + "pypinyin-dict", + "python-dateutil", + "pyworld==0.2.12", + "resampy==0.2.2", + "sacrebleu", + "scipy", + "sentencepiece~=0.1.96", + "soundfile~=0.10", + "textgrid", + "timer", + "tqdm", + "typeguard", + "visualdl", + "webrtcvad", + "yacs~=0.1.8", + "prettytable", + "zhon", + "colorlog", + "pathos == 0.2.8", + "braceexpand", + "pyyaml", + "pybind11", + "Ninja", ] server = [ "fastapi", "uvicorn", "pattern_singleton", - "websockets", + "websockets" ] requirements = { @@ -62,8 +99,6 @@ requirements = { "gpustat", "paddlespeech_ctcdecoders", "phkit", - "Pillow", - "pybind11", "pypi-kenlm", "snakeviz", "sox", diff --git a/speechx/examples/ds2_ol/onnx/README.md b/speechx/examples/ds2_ol/onnx/README.md index eaea8b6e8..e6ab953c8 100644 --- a/speechx/examples/ds2_ol/onnx/README.md +++ b/speechx/examples/ds2_ol/onnx/README.md @@ -1,9 +1,11 @@ -# DeepSpeech2 ONNX model +# DeepSpeech2 to ONNX model 1. convert deepspeech2 model to ONNX, using Paddle2ONNX. 2. check paddleinference and onnxruntime output equal. 3. optimize onnx model 4. check paddleinference and optimized onnxruntime output equal. +5. quantize onnx model +4. check paddleinference and optimized onnxruntime output equal. Please make sure [Paddle2ONNX](https://github.com/PaddlePaddle/Paddle2ONNX) and [onnx-simplifier](https://github.com/zh794390558/onnx-simplifier/tree/dyn_time_shape) version is correct. @@ -26,12 +28,27 @@ onnxruntime 1.11.0 ## Using ``` -bash run.sh +bash run.sh --stage 0 --stop_stage 5 ``` For more details please see `run.sh`. ## Outputs -The optimized onnx model is `exp/model.opt.onnx`. +The optimized onnx model is `exp/model.opt.onnx`, quanted model is `$exp/model.optset11.quant.onnx`. To show the graph, please using `local/netron.sh`. + + +## [Results](https://github.com/PaddlePaddle/PaddleSpeech/wiki/ASR-Benchmark#streaming-asr) + +机器硬件:`CPU:Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz` +测试脚本:`Streaming Server` + +Acoustic Model | Model Size | enigne | dedoding_method | ctc_weight | decoding_chunk_size | num_decoding_left_chunk | RTF | +|:-------------:| :-----: | :-----: | :------------:| :-----: | :-----: | :-----: |:-----:| +| deepspeech2online_wenetspeech | 659MB | infernece | ctc_prefix_beam_search | - | 1 | - | 1.9108175171428279(utts=80) | +| deepspeech2online_wenetspeech | 659MB | onnx | ctc_prefix_beam_search | - | 1 | - | 0.5617182449999291 (utts=80) | +| deepspeech2online_wenetspeech | 166MB | onnx quant | ctc_prefix_beam_search | - | 1 | - | 0.44507715475808385 (utts=80) | + +> quant 和机器有关,不是所有机器都支持。ONNX quant测试机器指令集支持: +> Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss ht syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology eagerfpu pni pclmulqdq ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm 3dnowprefetch invpcid_single ssbd ibrs ibpb fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx avx512f avx512dq rdseed adx smap clflushopt clwb avx512cd avx512bw avx512vl xsaveopt xsavec xgetbv1 arat umip pku ospke avx512_vnni spec_ctrl diff --git a/speechx/examples/ds2_ol/onnx/local/onnx_convert_opset.py b/speechx/examples/ds2_ol/onnx/local/onnx_convert_opset.py new file mode 100755 index 000000000..00b5cf775 --- /dev/null +++ b/speechx/examples/ds2_ol/onnx/local/onnx_convert_opset.py @@ -0,0 +1,37 @@ +#!/usr/bin/env python3 +import argparse + +import onnx +from onnx import version_converter + +if __name__ == '__main__': + parser = argparse.ArgumentParser(prog=__doc__) + parser.add_argument( + "--model-file", type=str, required=True, help='path/to/the/model.onnx.') + parser.add_argument( + "--save-model", + type=str, + required=True, + help='path/to/saved/model.onnx.') + # Models must be opset10 or higher to be quantized. + parser.add_argument( + "--target-opset", type=int, default=11, help='path/to/the/model.onnx.') + + args = parser.parse_args() + + print(f"to opset: {args.target_opset}") + + # Preprocessing: load the model to be converted. + model_path = args.model_file + original_model = onnx.load(model_path) + + # print('The model before conversion:\n{}'.format(original_model)) + + # A full list of supported adapters can be found here: + # https://github.com/onnx/onnx/blob/main/onnx/version_converter.py#L21 + # Apply the version conversion on the original model + converted_model = version_converter.convert_version(original_model, + args.target_opset) + + # print('The model after conversion:\n{}'.format(converted_model)) + onnx.save(converted_model, args.save_model) diff --git a/speechx/examples/ds2_ol/onnx/local/onnx_infer_shape.py b/speechx/examples/ds2_ol/onnx/local/onnx_infer_shape.py index 838b67510..c41e66b72 100755 --- a/speechx/examples/ds2_ol/onnx/local/onnx_infer_shape.py +++ b/speechx/examples/ds2_ol/onnx/local/onnx_infer_shape.py @@ -492,22 +492,6 @@ class SymbolicShapeInference: skip_infer = node.op_type in [ 'If', 'Loop', 'Scan', 'SplitToSequence', 'ZipMap', \ # contrib ops - - - - - - - - - - - - - - - - 'Attention', 'BiasGelu', \ 'EmbedLayerNormalization', \ 'FastGelu', 'Gelu', 'LayerNormalization', \ diff --git a/speechx/examples/ds2_ol/onnx/local/ort_dyanmic_quant.py b/speechx/examples/ds2_ol/onnx/local/ort_dyanmic_quant.py new file mode 100755 index 000000000..2c5692369 --- /dev/null +++ b/speechx/examples/ds2_ol/onnx/local/ort_dyanmic_quant.py @@ -0,0 +1,48 @@ +#!/usr/bin/env python3 +import argparse + +from onnxruntime.quantization import quantize_dynamic +from onnxruntime.quantization import QuantType + + +def quantize_onnx_model(onnx_model_path, + quantized_model_path, + nodes_to_exclude=[]): + print("Starting quantization...") + + quantize_dynamic( + onnx_model_path, + quantized_model_path, + weight_type=QuantType.QInt8, + nodes_to_exclude=nodes_to_exclude) + + print(f"Quantized model saved to: {quantized_model_path}") + + +def main(): + parser = argparse.ArgumentParser() + parser.add_argument( + "--model-in", + type=str, + required=True, + help="ONNX model", ) + parser.add_argument( + "--model-out", + type=str, + required=True, + default='model.quant.onnx', + help="ONNX model", ) + parser.add_argument( + "--nodes-to-exclude", + type=str, + required=True, + help="nodes to exclude. e.g. conv,linear.", ) + + args = parser.parse_args() + + nodes_to_exclude = args.nodes_to_exclude.split(',') + quantize_onnx_model(args.model_in, args.model_out, nodes_to_exclude) + + +if __name__ == "__main__": + main() diff --git a/speechx/examples/ds2_ol/onnx/local/pd_infer_shape.py b/speechx/examples/ds2_ol/onnx/local/pd_infer_shape.py deleted file mode 100755 index c6e693c6b..000000000 --- a/speechx/examples/ds2_ol/onnx/local/pd_infer_shape.py +++ /dev/null @@ -1,111 +0,0 @@ -#!/usr/bin/env python3 -W ignore::DeprecationWarning -# https://github.com/jiangjiajun/PaddleUtils/blob/main/paddle/README.md#2-%E4%BF%AE%E6%94%B9paddle%E6%A8%A1%E5%9E%8B%E8%BE%93%E5%85%A5shape -import argparse - -# paddle inference shape - - -def process_old_ops_desc(program): - """set matmul op head_number attr to 1 is not exist. - - Args: - program (_type_): _description_ - """ - for i in range(len(program.blocks[0].ops)): - if program.blocks[0].ops[i].type == "matmul": - if not program.blocks[0].ops[i].has_attr("head_number"): - program.blocks[0].ops[i]._set_attr("head_number", 1) - - -def infer_shape(program, input_shape_dict): - # 2002002 - model_version = program.desc._version() - # 2.2.2 - paddle_version = paddle.__version__ - major_ver = model_version // 1000000 - minor_ver = (model_version - major_ver * 1000000) // 1000 - patch_ver = model_version - major_ver * 1000000 - minor_ver * 1000 - model_version = "{}.{}.{}".format(major_ver, minor_ver, patch_ver) - if model_version != paddle_version: - print( - f"[WARNING] The model is saved by paddlepaddle v{model_version}, but now your paddlepaddle is version of {paddle_version}, this difference may cause error, it is recommend you reinstall a same version of paddlepaddle for this model" - ) - - OP_WITHOUT_KERNEL_SET = { - 'feed', 'fetch', 'recurrent', 'go', 'rnn_memory_helper_grad', - 'conditional_block', 'while', 'send', 'recv', 'listen_and_serv', - 'fl_listen_and_serv', 'ncclInit', 'select', 'checkpoint_notify', - 'gen_bkcl_id', 'c_gen_bkcl_id', 'gen_nccl_id', 'c_gen_nccl_id', - 'c_comm_init', 'c_sync_calc_stream', 'c_sync_comm_stream', - 'queue_generator', 'dequeue', 'enqueue', 'heter_listen_and_serv', - 'c_wait_comm', 'c_wait_compute', 'c_gen_hccl_id', 'c_comm_init_hccl', - 'copy_cross_scope' - } - - for k, v in input_shape_dict.items(): - program.blocks[0].var(k).desc.set_shape(v) - - for i in range(len(program.blocks)): - for j in range(len(program.blocks[0].ops)): - # for ops - if program.blocks[i].ops[j].type in OP_WITHOUT_KERNEL_SET: - print(f"not infer: {program.blocks[i].ops[j].type} op") - continue - print(f"infer: {program.blocks[i].ops[j].type} op") - program.blocks[i].ops[j].desc.infer_shape(program.blocks[i].desc) - - -def parse_arguments(): - # python pd_infer_shape.py --model_dir data/exp/deepspeech2_online/checkpoints \ - # --model_filename avg_1.jit.pdmodel\ - # --params_filename avg_1.jit.pdiparams \ - # --save_dir . \ - # --input_shape_dict="{'audio_chunk':[1,-1,161], 'audio_chunk_lens':[1], 'chunk_state_c_box':[5, 1, 1024], 'chunk_state_h_box':[5,1,1024]}" - parser = argparse.ArgumentParser() - parser.add_argument( - '--model_dir', - required=True, - help='Path of directory saved the input model.') - parser.add_argument( - '--model_filename', required=True, help='model.pdmodel.') - parser.add_argument( - '--params_filename', required=True, help='model.pdiparams.') - parser.add_argument( - '--save_dir', - required=True, - help='directory to save the exported model.') - parser.add_argument( - '--input_shape_dict', required=True, help="The new shape information.") - return parser.parse_args() - - -if __name__ == '__main__': - args = parse_arguments() - - import paddle - paddle.enable_static() - import paddle.fluid as fluid - - input_shape_dict_str = args.input_shape_dict - input_shape_dict = eval(input_shape_dict_str) - - print("Start to load paddle model...") - exe = fluid.Executor(fluid.CPUPlace()) - - prog, ipts, outs = fluid.io.load_inference_model( - args.model_dir, - exe, - model_filename=args.model_filename, - params_filename=args.params_filename) - - process_old_ops_desc(prog) - infer_shape(prog, input_shape_dict) - - fluid.io.save_inference_model( - args.save_dir, - ipts, - outs, - exe, - prog, - model_filename=args.model_filename, - params_filename=args.params_filename) diff --git a/speechx/examples/ds2_ol/onnx/local/pd_prune_model.py b/speechx/examples/ds2_ol/onnx/local/pd_prune_model.py deleted file mode 100755 index 5386a971a..000000000 --- a/speechx/examples/ds2_ol/onnx/local/pd_prune_model.py +++ /dev/null @@ -1,158 +0,0 @@ -#!/usr/bin/env python3 -W ignore::DeprecationWarning -# https://github.com/jiangjiajun/PaddleUtils/blob/main/paddle/README.md#1-%E8%A3%81%E5%89%AApaddle%E6%A8%A1%E5%9E%8B -import argparse -import sys -from typing import List - -# paddle prune model. - - -def prepend_feed_ops(program, - feed_target_names: List[str], - feed_holder_name='feed'): - import paddle.fluid.core as core - if len(feed_target_names) == 0: - return - - global_block = program.global_block() - feed_var = global_block.create_var( - name=feed_holder_name, - type=core.VarDesc.VarType.FEED_MINIBATCH, - persistable=True, ) - - for i, name in enumerate(feed_target_names, 0): - if not global_block.has_var(name): - print( - f"The input[{i}]: '{name}' doesn't exist in pruned inference program, which will be ignored in new saved model." - ) - continue - - out = global_block.var(name) - global_block._prepend_op( - type='feed', - inputs={'X': [feed_var]}, - outputs={'Out': [out]}, - attrs={'col': i}, ) - - -def append_fetch_ops(program, - fetch_target_names: List[str], - fetch_holder_name='fetch'): - """in the place, we will add the fetch op - - Args: - program (_type_): inference program - fetch_target_names (List[str]): target names - fetch_holder_name (str, optional): fetch op name. Defaults to 'fetch'. - """ - import paddle.fluid.core as core - global_block = program.global_block() - fetch_var = global_block.create_var( - name=fetch_holder_name, - type=core.VarDesc.VarType.FETCH_LIST, - persistable=True, ) - - print(f"the len of fetch_target_names: {len(fetch_target_names)}") - - for i, name in enumerate(fetch_target_names): - global_block.append_op( - type='fetch', - inputs={'X': [name]}, - outputs={'Out': [fetch_var]}, - attrs={'col': i}, ) - - -def insert_fetch(program, - fetch_target_names: List[str], - fetch_holder_name='fetch'): - """in the place, we will add the fetch op - - Args: - program (_type_): inference program - fetch_target_names (List[str]): target names - fetch_holder_name (str, optional): fetch op name. Defaults to 'fetch'. - """ - global_block = program.global_block() - - # remove fetch - need_to_remove_op_index = [] - for i, op in enumerate(global_block.ops): - if op.type == 'fetch': - need_to_remove_op_index.append(i) - - for index in reversed(need_to_remove_op_index): - global_block._remove_op(index) - - program.desc.flush() - - # append new fetch - append_fetch_ops(program, fetch_target_names, fetch_holder_name) - - -def parse_arguments(): - parser = argparse.ArgumentParser() - parser.add_argument( - '--model_dir', - required=True, - help='Path of directory saved the input model.') - parser.add_argument( - '--model_filename', required=True, help='model.pdmodel.') - parser.add_argument( - '--params_filename', required=True, help='model.pdiparams.') - parser.add_argument( - '--output_names', - required=True, - help='The outputs of model. sep by comma') - parser.add_argument( - '--save_dir', - required=True, - help='directory to save the exported model.') - parser.add_argument('--debug', default=False, help='output debug info.') - return parser.parse_args() - - -if __name__ == '__main__': - args = parse_arguments() - - args.output_names = args.output_names.split(",") - - if len(set(args.output_names)) < len(args.output_names): - print( - f"[ERROR] There's dumplicate name in --output_names {args.output_names}, which is not allowed." - ) - sys.exit(-1) - - import paddle - paddle.enable_static() - # hack prepend_feed_ops - paddle.fluid.io.prepend_feed_ops = prepend_feed_ops - - import paddle.fluid as fluid - - print("start to load paddle model") - exe = fluid.Executor(fluid.CPUPlace()) - prog, ipts, outs = fluid.io.load_inference_model( - args.model_dir, - exe, - model_filename=args.model_filename, - params_filename=args.params_filename) - - print("start to load insert fetch op") - new_outputs = [] - insert_fetch(prog, args.output_names) - for out_name in args.output_names: - new_outputs.append(prog.global_block().var(out_name)) - - # not equal to paddle.static.save_inference_model - fluid.io.save_inference_model( - args.save_dir, - ipts, - new_outputs, - exe, - prog, - model_filename=args.model_filename, - params_filename=args.params_filename) - - if args.debug: - for op in prog.global_block().ops: - print(op) diff --git a/speechx/examples/ds2_ol/onnx/local/prune.sh b/speechx/examples/ds2_ol/onnx/local/prune.sh deleted file mode 100755 index 64636bccf..000000000 --- a/speechx/examples/ds2_ol/onnx/local/prune.sh +++ /dev/null @@ -1,23 +0,0 @@ -#!/bin/bash - -set -e - -if [ $# != 5 ]; then - # local/prune.sh data/exp/deepspeech2_online/checkpoints avg_1.jit.pdmodel avg_1.jit.pdiparams softmax_0.tmp_0,tmp_5,concat_0.tmp_0,concat_1.tmp_0 $PWD - echo "usage: $0 model_dir model_filename param_filename outputs_names save_dir" - exit 1 -fi - -dir=$1 -model=$2 -param=$3 -outputs=$4 -save_dir=$5 - - -python local/pd_prune_model.py \ - --model_dir $dir \ - --model_filename $model \ - --params_filename $param \ - --output_names $outputs \ - --save_dir $save_dir \ No newline at end of file diff --git a/speechx/examples/ds2_ol/onnx/local/tonnx.sh b/speechx/examples/ds2_ol/onnx/local/tonnx.sh index ffedf001c..104872303 100755 --- a/speechx/examples/ds2_ol/onnx/local/tonnx.sh +++ b/speechx/examples/ds2_ol/onnx/local/tonnx.sh @@ -15,11 +15,12 @@ pip install paddle2onnx pip install onnx # https://github.com/PaddlePaddle/Paddle2ONNX#%E5%91%BD%E4%BB%A4%E8%A1%8C%E8%BD%AC%E6%8D%A2 + # opset10 support quantize paddle2onnx --model_dir $dir \ --model_filename $model \ --params_filename $param \ --save_file $output \ --enable_dev_version True \ - --opset_version 9 \ + --opset_version 11 \ --enable_onnx_checker True \ No newline at end of file diff --git a/speechx/examples/ds2_ol/onnx/run.sh b/speechx/examples/ds2_ol/onnx/run.sh index 583abda4e..3dc5e9100 100755 --- a/speechx/examples/ds2_ol/onnx/run.sh +++ b/speechx/examples/ds2_ol/onnx/run.sh @@ -39,41 +39,10 @@ if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ];then popd fi -output_names=softmax_0.tmp_0,tmp_5,concat_0.tmp_0,concat_1.tmp_0 -if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ];then - # prune model by outputs - mkdir -p $exp/prune - - # prune model deps on output_names. - ./local/prune.sh $dir $model $param $output_names $exp/prune -fi - -# aishell rnn hidden is 1024 -# wenetspeech rnn hiddn is 2048 -if [ $model_type == 'aishell' ];then - input_shape_dict="{'audio_chunk':[1,-1,161], 'audio_chunk_lens':[1], 'chunk_state_c_box':[5, 1, 1024], 'chunk_state_h_box':[5,1,1024]}" -elif [ $model_type == 'wenetspeech' ];then - input_shape_dict="{'audio_chunk':[1,-1,161], 'audio_chunk_lens':[1], 'chunk_state_c_box':[5, 1, 2048], 'chunk_state_h_box':[5,1,2048]}" -else - echo "not support: $model_type" - exit -1 -fi -if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ];then - # infer shape by new shape - mkdir -p $exp/shape - echo $input_shape_dict - python3 local/pd_infer_shape.py \ - --model_dir $dir \ - --model_filename $model \ - --params_filename $param \ - --save_dir $exp/shape \ - --input_shape_dict="${input_shape_dict}" -fi - input_file=$exp/static_ds2online_inputs.pickle test -e $input_file -if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ];then +if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ];then # to onnx ./local/tonnx.sh $dir $model $param $exp/model.onnx @@ -81,7 +50,7 @@ if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ];then fi -if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ] ;then +if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ] ;then # ort graph optmize ./local/ort_opt.py --model_in $exp/model.onnx --opt_level 0 --model_out $exp/model.ort.opt.onnx @@ -89,6 +58,18 @@ if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ] ;then fi +if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ];then + # convert opset_num to 11 + ./local/onnx_convert_opset.py --target-opset 11 --model-file $exp/model.ort.opt.onnx --save-model $exp/model.optset11.onnx + + # quant model + nodes_to_exclude='p2o.Conv.0,p2o.Conv.2' + ./local/ort_dyanmic_quant.py --model-in $exp/model.optset11.onnx --model-out $exp/model.optset11.quant.onnx --nodes-to-exclude "${nodes_to_exclude}" + + ./local/infer_check.py --input_file $input_file --model_type $model_type --model_dir $dir --model_prefix $model_prefix --onnx_model $exp/model.optset11.quant.onnx +fi + + # aishell rnn hidden is 1024 # wenetspeech rnn hiddn is 2048 if [ $model_type == 'aishell' ];then diff --git a/tests/test_tipc/configs/mdtc/train_infer_python.txt b/tests/test_tipc/configs/mdtc/train_infer_python.txt new file mode 100644 index 000000000..7a5f658ee --- /dev/null +++ b/tests/test_tipc/configs/mdtc/train_infer_python.txt @@ -0,0 +1,57 @@ +===========================train_params=========================== +model_name:mdtc +python:python3.7 +gpu_list:0|0,1 +null:null +null:null +--benchmark-max-step:50 +null:null +--benchmark-batch-size:16 +null:null +null:null +null:null +null:null +## +trainer:norm_train +norm_train: ../paddlespeech/kws/exps/mdtc/train.py --config=../examples/hey_snips/kws0/conf/mdtc.yaml +pact_train:null +fpgm_train:null +distill_train:null +null:null +null:null +## +===========================eval_params=========================== +eval:null +null:null +## +===========================infer_params=========================== +null:null +null:null +norm_export: null +quant_export:null +fpgm_export:null +distill_export:null +export1:null +export2:null +null:null +infer_model:null +infer_export:null +infer_quant:null +inference:null +null:null +null:null +null:null +null:null +null:null +null:null +null:null +null:null +null:null +null:null +null:null +===========================train_benchmark_params========================== +batch_size:16|30 +fp_items:fp32 +iteration:50 +--profiler-options:"batch_range=[10,35];state=GPU;tracer_option=Default;profile_path=model.profile" +flags:null diff --git a/tests/test_tipc/prepare.sh b/tests/test_tipc/prepare.sh index a13938017..b38bbcba1 100644 --- a/tests/test_tipc/prepare.sh +++ b/tests/test_tipc/prepare.sh @@ -80,4 +80,13 @@ if [ ${MODE} = "benchmark_train" ];then python ../paddlespeech/t2s/exps/gan_vocoder/normalize.py --metadata=dump/test/raw/metadata.jsonl --dumpdir=dump/test/norm --stats=dump/train/feats_stats.npy fi + if [ ${model_name} == "mdtc" ]; then + # 下载 Snips 数据集并解压缩 + wget -nc https://paddlespeech.bj.bcebos.com/datasets/hey_snips_kws_4.0.tar.gz.1 https://paddlespeech.bj.bcebos.com/datasets/hey_snips_https://paddlespeech.bj.bcebos.com/datasets/hey_snips_kws_4.0.tar.gz.2 + cat hey_snips_kws_4.0.tar.gz.* > hey_snips_kws_4.0.tar.gz + rm hey_snips_kws_4.0.tar.gz.* + tar -xzf hey_snips_kws_4.0.tar.gz + # 解压后的数据目录 ./hey_snips_research_6k_en_train_eval_clean_ter + fi + fi diff --git a/tests/unit/cli/test_cli.sh b/tests/unit/cli/test_cli.sh index 6879c4d64..15604961d 100755 --- a/tests/unit/cli/test_cli.sh +++ b/tests/unit/cli/test_cli.sh @@ -43,7 +43,7 @@ paddlespeech tts --input "你好,欢迎使用百度飞桨深度学习框架! paddlespeech tts --am speedyspeech_csmsc --input "你好,欢迎使用百度飞桨深度学习框架!" paddlespeech tts --voc mb_melgan_csmsc --input "你好,欢迎使用百度飞桨深度学习框架!" paddlespeech tts --voc style_melgan_csmsc --input "你好,欢迎使用百度飞桨深度学习框架!" -paddlespeech tts --voc hifigan_csmsc --input "你好,欢迎使用百度飞桨深度学习框架!" +paddlespeech tts --voc pwgan_csmsc --input "你好,欢迎使用百度飞桨深度学习框架!" paddlespeech tts --am fastspeech2_aishell3 --voc pwgan_aishell3 --input "你好,欢迎使用百度飞桨深度学习框架!" --spk_id 0 paddlespeech tts --am fastspeech2_aishell3 --voc hifigan_aishell3 --input "你好,欢迎使用百度飞桨深度学习框架!" --spk_id 0 paddlespeech tts --am fastspeech2_ljspeech --voc pwgan_ljspeech --lang en --input "Life was like a box of chocolates, you never know what you're gonna get." @@ -53,6 +53,16 @@ paddlespeech tts --am fastspeech2_vctk --voc hifigan_vctk --input "Life was like paddlespeech tts --am tacotron2_csmsc --input "你好,欢迎使用百度飞桨深度学习框架!" paddlespeech tts --am tacotron2_csmsc --voc wavernn_csmsc --input "你好,欢迎使用百度飞桨深度学习框架!" paddlespeech tts --am tacotron2_ljspeech --voc pwgan_ljspeech --lang en --input "Life was like a box of chocolates, you never know what you're gonna get." +# mix tts +# The `am` must be `fastspeech2_mix`! +# The `lang` must be `mix`! +# The voc must be chinese datasets' voc now! +# spk 174 is csmcc, spk 175 is ljspeech +paddlespeech tts --am fastspeech2_mix --voc hifigan_csmsc --lang mix --input "热烈欢迎您在 Discussions 中提交问题,并在 Issues 中指出发现的 bug。此外,我们非常希望您参与到 Paddle Speech 的开发中!" --spk_id 174 --output mix_spk174.wav +paddlespeech tts --am fastspeech2_mix --voc hifigan_aishell3 --lang mix --input "热烈欢迎您在 Discussions 中提交问题,并在 Issues 中指出发现的 bug。此外,我们非常希望您参与到 Paddle Speech 的开发中!" --spk_id 174 --output mix_spk174_aishell3.wav +paddlespeech tts --am fastspeech2_mix --voc pwgan_csmsc --lang mix --input "我们的声学模型使用了 Fast Speech Two, 声码器使用了 Parallel Wave GAN and Hifi GAN." --spk_id 175 --output mix_spk175_pwgan.wav +paddlespeech tts --am fastspeech2_mix --voc hifigan_csmsc --lang mix --input "我们的声学模型使用了 Fast Speech Two, 声码器使用了 Parallel Wave GAN and Hifi GAN." --spk_id 175 --output mix_spk175.wav + # Speech Translation (only support linux) paddlespeech st --input ./en.wav diff --git a/third_party/README.md b/third_party/README.md index 843d0d3b2..98e03b0a3 100644 --- a/third_party/README.md +++ b/third_party/README.md @@ -1,27 +1,26 @@ -* [python_kaldi_features](https://github.com/ZitengWang/python_kaldi_features) +# python_kaldi_features + +[python_kaldi_features](https://github.com/ZitengWang/python_kaldi_features) commit: fc1bd6240c2008412ab64dc25045cd872f5e126c ref: https://zhuanlan.zhihu.com/p/55371926 license: MIT -* [python-pinyin](https://github.com/mozillazg/python-pinyin.git) -commit: 55e524aa1b7b8eec3d15c5306043c6cdd5938b03 -license: MIT +# Install ctc_decoder for Windows -* [zhon](https://github.com/tsroten/zhon) -commit: 09bf543696277f71de502506984661a60d24494c -license: MIT +`install_win_ctc.bat` is bat script to install paddlespeech_ctc_decoders for windows -* [pymmseg-cpp](https://github.com/pluskid/pymmseg-cpp.git) -commit: b76465045717fbb4f118c4fbdd24ce93bab10a6d -license: MIT +## Prepare your environment -* [chinese_text_normalization](https://github.com/speechio/chinese_text_normalization.git) -commit: 9e92c7bf2d6b5a7974305406d8e240045beac51c -license: MIT +insure your environment like this: -* [phkit](https://github.com/KuangDD/phkit.git) -commit: b2100293c1e36da531d7f30bd52c9b955a649522 -license: None +* gcc: version >= 12.1.0 +* cmake: version >= 3.24.0 +* make: version >= 3.82.90 +* visual studio: version >= 2019 -* [nnAudio](https://github.com/KinWaiCheuk/nnAudio.git) -license: MIT +## Start your bat script + +```shell +start install_win_ctc.bat + +``` diff --git a/third_party/ctc_decoders/scorer.cpp b/third_party/ctc_decoders/scorer.cpp index 6c1d96be3..6e7f68cf6 100644 --- a/third_party/ctc_decoders/scorer.cpp +++ b/third_party/ctc_decoders/scorer.cpp @@ -13,7 +13,8 @@ #include "decoder_utils.h" using namespace lm::ngram; - +// if your platform is windows ,you need add the define +#define F_OK 0 Scorer::Scorer(double alpha, double beta, const std::string& lm_path, diff --git a/third_party/ctc_decoders/setup.py b/third_party/ctc_decoders/setup.py index ce2787e3f..9a8b292a0 100644 --- a/third_party/ctc_decoders/setup.py +++ b/third_party/ctc_decoders/setup.py @@ -89,10 +89,11 @@ FILES = [ or fn.endswith('unittest.cc')) ] # yapf: enable - LIBS = ['stdc++'] if platform.system() != 'Darwin': LIBS.append('rt') +if platform.system() == 'Windows': + LIBS = ['-static-libstdc++'] ARGS = ['-O3', '-DNDEBUG', '-DKENLM_MAX_ORDER=6', '-std=c++11'] diff --git a/third_party/install_win_ctc.bat b/third_party/install_win_ctc.bat new file mode 100644 index 000000000..0bf1e7bb1 --- /dev/null +++ b/third_party/install_win_ctc.bat @@ -0,0 +1,21 @@ +@echo off + +cd ctc_decoders +if not exist kenlm ( + git clone https://github.com/Doubledongli/kenlm.git + @echo. +) + +if not exist openfst-1.6.3 ( + echo "Download and extract openfst ..." + git clone https://gitee.com/koala999/openfst.git + ren openfst openfst-1.6.3 + @echo. +) + +if not exist ThreadPool ( + git clone https://github.com/progschj/ThreadPool.git + @echo. +) +echo "Install decoders ..." +python setup.py install --num_processes 4 \ No newline at end of file