pull/2162/head
0x45f 2 years ago
commit ac680aa783

@ -25,14 +25,16 @@
| <a href="#documents"> Documents </a>
| <a href="#model-list"> Models List </a>
| <a href="https://aistudio.baidu.com/aistudio/education/group/info/25130"> AIStudio Courses </a>
| <a href="https://arxiv.org/abs/2205.12007"> Paper </a>
| <a href="https://arxiv.org/abs/2205.12007"> NAACL2022 Best Demo Award Paper </a>
| <a href="https://gitee.com/paddlepaddle/PaddleSpeech"> Gitee </a>
</h4>
</div>
------------------------------------------------------------------------------------
**PaddleSpeech** is an open-source toolkit on [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) platform for a variety of critical tasks in speech and audio, with the state-of-art and influential models.
**PaddleSpeech** is an open-source toolkit on [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) platform for a variety of critical tasks in speech and audio, with the state-of-art and influential models.
**PaddleSpeech** won the [NAACL2022 Best Demo Award](https://2022.naacl.org/blog/best-demo-award/), please check out our paper on [Arxiv](https://arxiv.org/abs/2205.12007).
##### Speech Recognition
@ -177,7 +179,7 @@ Via the easy-to-use, efficient, flexible and scalable implementation, our vision
## Installation
We strongly recommend our users to install PaddleSpeech in **Linux** with *python>=3.7*.
We strongly recommend our users to install PaddleSpeech in **Linux** with *python>=3.7* and *paddlepaddle>=2.3.1*.
Up to now, **Linux** supports CLI for the all our tasks, **Mac OSX** and **Windows** only supports PaddleSpeech CLI for Audio Classification, Speech-to-Text and Text-to-Speech. To install `PaddleSpeech`, please see [installation](./docs/source/install.md).
@ -706,6 +708,7 @@ You are warmly welcome to submit questions in [discussions](https://github.com/P
- Many thanks to [phecda-xu](https://github.com/phecda-xu)/[PaddleDubbing](https://github.com/phecda-xu/PaddleDubbing) for developing a dubbing tool with GUI based on PaddleSpeech TTS model.
- Many thanks to [jerryuhoo](https://github.com/jerryuhoo)/[VTuberTalk](https://github.com/jerryuhoo/VTuberTalk) for developing a GUI tool based on PaddleSpeech TTS and code for making datasets from videos based on PaddleSpeech ASR.
- Many thanks to [vpegasus](https://github.com/vpegasus)/[xuesebot](https://github.com/vpegasus/xuesebot) for developing a rasa chatbot,which is able to speak and listen thanks to PaddleSpeech.
- Many thanks to [chenkui164](https://github.com/chenkui164)/[FastASR](https://github.com/chenkui164/FastASR) for the C++ inference implementation of PaddleSpeech ASR.
Besides, PaddleSpeech depends on a lot of open source repositories. See [references](./docs/source/reference.md) for more information.

@ -26,7 +26,7 @@
| <a href="#教程文档"> 教程文档 </a>
| <a href="#模型列表"> 模型列表 </a>
| <a href="https://aistudio.baidu.com/aistudio/education/group/info/25130"> AIStudio 课程 </a>
| <a href="https://arxiv.org/abs/2205.12007"> 论文 </a>
| <a href="https://arxiv.org/abs/2205.12007"> NAACL2022 论文 </a>
| <a href="https://gitee.com/paddlepaddle/PaddleSpeech"> Gitee
</h4>
</div>
@ -35,6 +35,9 @@
------------------------------------------------------------------------------------
**PaddleSpeech** 是基于飞桨 [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) 的语音方向的开源模型库,用于语音和音频中的各种关键任务的开发,包含大量基于深度学习前沿和有影响力的模型,一些典型的应用示例如下:
**PaddleSpeech** 荣获 [NAACL2022 Best Demo Award](https://2022.naacl.org/blog/best-demo-award/), 请访问 [Arxiv](https://arxiv.org/abs/2205.12007) 论文。
##### 语音识别
<div align = "center">
@ -702,7 +705,7 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声
- 非常感谢 [jerryuhoo](https://github.com/jerryuhoo)/[VTuberTalk](https://github.com/jerryuhoo/VTuberTalk) 基于 PaddleSpeech 的 TTS GUI 界面和基于 ASR 制作数据集的相关代码。
- 非常感谢 [vpegasus](https://github.com/vpegasus)/[xuesebot](https://github.com/vpegasus/xuesebot) 基于 PaddleSpeech 的 ASR 与 TTS 设计的可听、说对话机器人。
- 非常感谢 [chenkui164](https://github.com/chenkui164)/[FastASR](https://github.com/chenkui164/FastASR) 对 PaddleSpeech 的 ASR 进行 C++ 推理实现。
此外PaddleSpeech 依赖于许多开源存储库。有关更多信息,请参阅 [references](./docs/source/reference.md)。

@ -12,6 +12,7 @@ This directory contains many speech applications in multiple scenarios.
* speech recognition - recognize text of an audio file
* speech server - Server for Speech Task, e.g. ASR,TTS,CLS
* streaming asr server - receive audio stream from websocket, and recognize to transcript.
* streaming tts server - receive text from http or websocket, and streaming audio data stream.
* speech translation - end to end speech translation
* story talker - book reader based on OCR and TTS
* style_fs2 - multi style control for FastSpeech2 model

@ -10,8 +10,9 @@
* 元宇宙 - 基于语音合成的 2D 增强现实。
* 标点恢复 - 通常作为语音识别的文本后处理任务,为一段无标点的纯文本添加相应的标点符号。
* 语音识别 - 识别一段音频中包含的语音文字。
* 语音服务 - 离线语音服务包括ASR、TTS、CLS等
* 流式语音识别服务 - 流式输入语音数据流识别音频中的文字
* 语音服务 - 离线语音服务包括ASR、TTS、CLS等。
* 流式语音识别服务 - 流式输入语音数据流识别音频中的文字。
* 流式语音合成服务 - 根据待合成文本流式生成合成音频数据流。
* 语音翻译 - 实时识别音频中的语言,并同时翻译成目标语言。
* 会说话的故事书 - 基于 OCR 和语音合成的会说话的故事书。
* 个性化语音合成 - 基于 FastSpeech2 模型的个性化语音合成。

@ -1,6 +1,7 @@
#!/bin/bash
wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespeech.bj.bcebos.com/PaddleAudio/en.wav
wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav
wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/en.wav
# asr
paddlespeech asr --input ./zh.wav
@ -8,3 +9,18 @@ paddlespeech asr --input ./zh.wav
# asr + punc
paddlespeech asr --input ./zh.wav | paddlespeech text --task punc
# asr help
paddlespeech asr --help
# english asr
paddlespeech asr --lang en --model transformer_librispeech --input ./en.wav
# model stats
paddlespeech stats --task asr
# paddlespeech help
paddlespeech --help

@ -11,7 +11,8 @@ This demo is an implementation of starting the voice service and accessing the s
see [installation](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md).
It is recommended to use **paddlepaddle 2.2.2** or above.
You can choose one way from meduim and hard to install paddlespeech.
You can choose one way from easy, meduim and hard to install paddlespeech.
**If you install in simple mode, you need to prepare the yaml file by yourself, you can refer to the yaml file in the conf directory.**
### 2. Prepare config File
The configuration file can be found in `conf/application.yaml` .

@ -3,7 +3,7 @@
# 语音服务
## 介绍
这个 demo 是一个启动离线语音服务和访问服务的实现。它可以通过使用`paddlespeech_server` 和 `paddlespeech_client`的单个命令或 python 的几行代码来实现。
这个 demo 是一个启动离线语音服务和访问服务的实现。它可以通过使用`paddlespeech_server` 和 `paddlespeech_client` 的单个命令或 python 的几行代码来实现。
## 使用方法
@ -11,13 +11,14 @@
请看 [安装文档](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md).
推荐使用 **paddlepaddle 2.2.2** 或以上版本。
你可以从 mediumhard 两种方式中选择一种方式安装 PaddleSpeech。
你可以从简单,中等,困难几种方式中选择一种方式安装 PaddleSpeech。
**如果使用简单模式安装,需要自行准备 yaml 文件,可参考 conf 目录下的 yaml 文件。**
### 2. 准备配置文件
配置文件可参见 `conf/application.yaml`
其中,`engine_list`表示即将启动的服务将会包含的语音引擎,格式为 <语音任务>_<引擎类型>。
目前服务集成的语音任务有: asr(语音识别)、tts(语音合成)、cls(音频分类)、vector(声纹识别)以及text(文本处理)。
目前服务集成的语音任务有: asr(语音识别)、tts(语音合成)、cls(音频分类)、vector(声纹识别)以及 text(文本处理)。
目前引擎类型支持两种形式python 及 inference (Paddle Inference)
**注意:** 如果在容器里可正常启动服务,但客户端访问 ip 不可达,可尝试将配置文件中 `host` 地址换成本地 ip 地址。

@ -1,3 +1,3 @@
#!/bin/bash
paddlespeech_server start --config_file ./conf/application.yaml
paddlespeech_server start --config_file ./conf/application.yaml &> server.log &

@ -0,0 +1,10 @@
#!/bin/bash
wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
wget -c https://paddlespeech.bj.bcebos.com/vector/audio/123456789.wav
# sid extract
paddlespeech_client vector --server_ip 127.0.0.1 --port 8090 --task spk --input ./85236145389.wav
# sid score
paddlespeech_client vector --server_ip 127.0.0.1 --port 8090 --task score --enroll ./85236145389.wav --test ./123456789.wav

@ -0,0 +1,4 @@
#!/bin/bash
paddlespeech_client text --server_ip 127.0.0.1 --port 8090 --input 今天的天气真好啊你下午有空吗我想约你一起去吃饭

@ -747,9 +747,9 @@
}
},
"node_modules/moment": {
"version": "2.29.3",
"resolved": "https://registry.npmmirror.com/moment/-/moment-2.29.3.tgz",
"integrity": "sha512-c6YRvhEo//6T2Jz/vVtYzqBzwvPT95JBQ+smCytzf7c50oMZRsR/a4w88aD34I+/QVSfnoAnSBFPJHItlOMJVw==",
"version": "2.29.4",
"resolved": "https://registry.npmjs.org/moment/-/moment-2.29.4.tgz",
"integrity": "sha512-5LC9SOxjSc2HF6vO2CyuTDNivEdoz2IvyJJGj6X8DJ0eFyfszE0QiEd+iXmBvUP3WHxSjFH/vIsA0EN00cgr8w==",
"engines": {
"node": "*"
}
@ -1636,9 +1636,9 @@
"optional": true
},
"moment": {
"version": "2.29.3",
"resolved": "https://registry.npmmirror.com/moment/-/moment-2.29.3.tgz",
"integrity": "sha512-c6YRvhEo//6T2Jz/vVtYzqBzwvPT95JBQ+smCytzf7c50oMZRsR/a4w88aD34I+/QVSfnoAnSBFPJHItlOMJVw=="
"version": "2.29.4",
"resolved": "https://registry.npmjs.org/moment/-/moment-2.29.4.tgz",
"integrity": "sha512-5LC9SOxjSc2HF6vO2CyuTDNivEdoz2IvyJJGj6X8DJ0eFyfszE0QiEd+iXmBvUP3WHxSjFH/vIsA0EN00cgr8w=="
},
"nanoid": {
"version": "3.3.2",

@ -587,9 +587,9 @@ mime@^1.4.1:
integrity sha512-x0Vn8spI+wuJ1O6S7gnbaQg8Pxh4NNHb7KSINmEWKiPE4RKOplvijn+NkmYmmRgP68mc70j2EbeTFRsrswaQeg==
moment@^2.27.0:
version "2.29.3"
resolved "https://registry.npmmirror.com/moment/-/moment-2.29.3.tgz"
integrity sha512-c6YRvhEo//6T2Jz/vVtYzqBzwvPT95JBQ+smCytzf7c50oMZRsR/a4w88aD34I+/QVSfnoAnSBFPJHItlOMJVw==
version "2.29.4"
resolved "https://registry.yarnpkg.com/moment/-/moment-2.29.4.tgz#3dbe052889fe7c1b2ed966fcb3a77328964ef108"
integrity sha512-5LC9SOxjSc2HF6vO2CyuTDNivEdoz2IvyJJGj6X8DJ0eFyfszE0QiEd+iXmBvUP3WHxSjFH/vIsA0EN00cgr8w==
ms@^2.1.1:
version "2.1.3"

@ -12,7 +12,8 @@ Streaming ASR server only support `websocket` protocol, and doesn't support `htt
see [installation](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md).
It is recommended to use **paddlepaddle 2.2.1** or above.
You can choose one way from meduim and hard to install paddlespeech.
You can choose one way from easy, meduim and hard to install paddlespeech.
**If you install in simple mode, you need to prepare the yaml file by yourself, you can refer to the yaml file in the conf directory.**
### 2. Prepare config File
The configuration file can be found in `conf/ws_application.yaml``conf/ws_conformer_wenetspeech_application.yaml`.

@ -12,8 +12,8 @@
安装 PaddleSpeech 的详细过程请看 [安装文档](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md)。
推荐使用 **paddlepaddle 2.2.1** 或以上版本。
你可以从mediumhard 两种方式中选择一种方式安装 PaddleSpeech。
你可以从简单, 中等,困难 几种种方式中选择一种方式安装 PaddleSpeech。
**如果使用简单模式安装,需要自行准备 yaml 文件,可参考 conf 目录下的 yaml 文件。**
### 2. 准备配置文件

@ -1,9 +1,8 @@
export CUDA_VISIBLE_DEVICE=0,1,2,3
export CUDA_VISIBLE_DEVICE=0,1,2,3
#export CUDA_VISIBLE_DEVICE=0,1,2,3
# nohup python3 punc_server.py --config_file conf/punc_application.yaml > punc.log 2>&1 &
# nohup python3 local/punc_server.py --config_file conf/punc_application.yaml > punc.log 2>&1 &
paddlespeech_server start --config_file conf/punc_application.yaml &> punc.log &
# nohup python3 streaming_asr_server.py --config_file conf/ws_conformer_wenetspeech_application.yaml > streaming_asr.log 2>&1 &
# nohup python3 local/streaming_asr_server.py --config_file conf/ws_conformer_wenetspeech_application.yaml > streaming_asr.log 2>&1 &
paddlespeech_server start --config_file conf/ws_conformer_wenetspeech_application.yaml &> streaming_asr.log &

@ -7,5 +7,5 @@ paddlespeech_client asr_online --server_ip 127.0.0.1 --port 8090 --input ./zh.wa
# read the wav and call streaming and punc service
# If `127.0.0.1` is not accessible, you need to use the actual service IP address.
paddlespeech_client asr_online --server_ip 127.0.0.1 --port 8290 --punc.server_ip 127.0.0.1 --punc.port 8190 --input ./zh.wav
paddlespeech_client asr_online --server_ip 127.0.0.1 --port 8090 --punc.server_ip 127.0.0.1 --punc.port 8190 --input ./zh.wav

@ -11,7 +11,8 @@ This demo is an implementation of starting the streaming speech synthesis servic
see [installation](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md).
It is recommended to use **paddlepaddle 2.2.2** or above.
You can choose one way from meduim and hard to install paddlespeech.
You can choose one way from easy, meduim and hard to install paddlespeech.
**If you install in simple mode, you need to prepare the yaml file by yourself, you can refer to the yaml file in the conf directory.**
### 2. Prepare config File

@ -11,8 +11,8 @@
请看 [安装文档](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md).
推荐使用 **paddlepaddle 2.2.2** 或以上版本。
你可以从 mediumhard 两种方式中选择一种方式安装 PaddleSpeech。
你可以从简单,中等,困难几种方式中选择一种方式安装 PaddleSpeech。
**如果使用简单模式安装,需要自行准备 yaml 文件,可参考 conf 目录下的 yaml 文件。**
### 2. 准备配置文件
配置文件可参见 `conf/tts_online_application.yaml`

@ -2,8 +2,8 @@
# http client test
# If `127.0.0.1` is not accessible, you need to use the actual service IP address.
paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --protocol http --input "您好,欢迎使用百度飞桨语音合成服务。" --output output.wav
paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --protocol http --input "您好,欢迎使用百度飞桨语音合成服务。" --output output.http.wav
# websocket client test
# If `127.0.0.1` is not accessible, you need to use the actual service IP address.
# paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8092 --protocol websocket --input "您好,欢迎使用百度飞桨语音合成服务。" --output output.wav
paddlespeech_client tts_online --server_ip 127.0.0.1 --port 8192 --protocol websocket --input "您好,欢迎使用百度飞桨语音合成服务。" --output output.ws.wav

@ -0,0 +1,103 @@
# This is the parameter configuration file for streaming tts server.
#################################################################################
# SERVER SETTING #
#################################################################################
host: 0.0.0.0
port: 8192
# The task format in the engin_list is: <speech task>_<engine type>
# engine_list choices = ['tts_online', 'tts_online-onnx'], the inference speed of tts_online-onnx is faster than tts_online.
# protocol choices = ['websocket', 'http']
protocol: 'websocket'
engine_list: ['tts_online-onnx']
#################################################################################
# ENGINE CONFIG #
#################################################################################
################################### TTS #########################################
################### speech task: tts; engine_type: online #######################
tts_online:
# am (acoustic model) choices=['fastspeech2_csmsc', 'fastspeech2_cnndecoder_csmsc']
# fastspeech2_cnndecoder_csmsc support streaming am infer.
am: 'fastspeech2_csmsc'
am_config:
am_ckpt:
am_stat:
phones_dict:
tones_dict:
speaker_dict:
spk_id: 0
# voc (vocoder) choices=['mb_melgan_csmsc, hifigan_csmsc']
# Both mb_melgan_csmsc and hifigan_csmsc support streaming voc inference
voc: 'mb_melgan_csmsc'
voc_config:
voc_ckpt:
voc_stat:
# others
lang: 'zh'
device: 'cpu' # set 'gpu:id' or 'cpu'
# am_block and am_pad only for fastspeech2_cnndecoder_onnx model to streaming am infer,
# when am_pad set 12, streaming synthetic audio is the same as non-streaming synthetic audio
am_block: 72
am_pad: 12
# voc_pad and voc_block voc model to streaming voc infer,
# when voc model is mb_melgan_csmsc, voc_pad set 14, streaming synthetic audio is the same as non-streaming synthetic audio; The minimum value of pad can be set to 7, streaming synthetic audio sounds normal
# when voc model is hifigan_csmsc, voc_pad set 19, streaming synthetic audio is the same as non-streaming synthetic audio; voc_pad set 14, streaming synthetic audio sounds normal
voc_block: 36
voc_pad: 14
#################################################################################
# ENGINE CONFIG #
#################################################################################
################################### TTS #########################################
################### speech task: tts; engine_type: online-onnx #######################
tts_online-onnx:
# am (acoustic model) choices=['fastspeech2_csmsc_onnx', 'fastspeech2_cnndecoder_csmsc_onnx']
# fastspeech2_cnndecoder_csmsc_onnx support streaming am infer.
am: 'fastspeech2_cnndecoder_csmsc_onnx'
# am_ckpt is a list, if am is fastspeech2_cnndecoder_csmsc_onnx, am_ckpt = [encoder model, decoder model, postnet model];
# if am is fastspeech2_csmsc_onnx, am_ckpt = [ckpt model];
am_ckpt: # list
am_stat:
phones_dict:
tones_dict:
speaker_dict:
spk_id: 0
am_sample_rate: 24000
am_sess_conf:
device: "cpu" # set 'gpu:id' or 'cpu'
use_trt: False
cpu_threads: 4
# voc (vocoder) choices=['mb_melgan_csmsc_onnx, hifigan_csmsc_onnx']
# Both mb_melgan_csmsc_onnx and hifigan_csmsc_onnx support streaming voc inference
voc: 'hifigan_csmsc_onnx'
voc_ckpt:
voc_sample_rate: 24000
voc_sess_conf:
device: "cpu" # set 'gpu:id' or 'cpu'
use_trt: False
cpu_threads: 4
# others
lang: 'zh'
# am_block and am_pad only for fastspeech2_cnndecoder_onnx model to streaming am infer,
# when am_pad set 12, streaming synthetic audio is the same as non-streaming synthetic audio
am_block: 72
am_pad: 12
# voc_pad and voc_block voc model to streaming voc infer,
# when voc model is mb_melgan_csmsc_onnx, voc_pad set 14, streaming synthetic audio is the same as non-streaming synthetic audio; The minimum value of pad can be set to 7, streaming synthetic audio sounds normal
# when voc model is hifigan_csmsc_onnx, voc_pad set 19, streaming synthetic audio is the same as non-streaming synthetic audio; voc_pad set 14, streaming synthetic audio sounds normal
voc_block: 36
voc_pad: 14
# voc_upsample should be same as n_shift on voc config.
voc_upsample: 300

@ -0,0 +1,10 @@
#!/bin/bash
# http server
paddlespeech_server start --config_file ./conf/tts_online_application.yaml &> tts.http.log &
# websocket server
paddlespeech_server start --config_file ./conf/tts_online_ws_application.yaml &> tts.ws.log &

@ -1,3 +0,0 @@
#!/bin/bash
# start server
paddlespeech_server start --config_file ./conf/tts_online_application.yaml

@ -4,4 +4,10 @@
paddlespeech tts --input 今天的天气不错啊
# Batch process
echo -e "1 欢迎光临。\n2 谢谢惠顾。" | paddlespeech tts
echo -e "1 欢迎光临。\n2 谢谢惠顾。" | paddlespeech tts
# Text Frontend
paddlespeech tts --input 今天是2022/10/29,最低温度是-3℃.

@ -0,0 +1,77 @@
FROM nvidia/cuda:11.2.2-cudnn8-runtime-ubuntu16.04
RUN echo "deb [trusted=true] http://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial main restricted \n\
deb [trusted=true] http://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial-updates main restricted \n\
deb [trusted=true] http://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial universe \n\
deb [trusted=true] http://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial-updates universe \n\
deb [trusted=true] http://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial multiverse \n\
deb [trusted=true] http://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial-updates multiverse \n\
deb [trusted=true] http://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial-backports main restricted universe multiverse \n\
deb [trusted=true] http://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial-security main restricted \n\
deb [trusted=true] http://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial-security universe \n\
deb [trusted=true] http://mirrors.tuna.tsinghua.edu.cn/ubuntu/ xenial-security multiverse" > /etc/apt/sources.list
RUN apt-get update && apt-get install -y inetutils-ping wget vim curl cmake git sox libsndfile1 libpng12-dev \
libpng-dev swig libzip-dev openssl bc libflac* libgdk-pixbuf2.0-dev libpango1.0-dev libcairo2-dev \
libgtk2.0-dev pkg-config zip unzip zlib1g-dev libreadline-dev libbz2-dev liblapack-dev libjpeg-turbo8-dev \
sudo lrzsz libsqlite3-dev libx11-dev libsm6 apt-utils libopencv-dev libavcodec-dev libavformat-dev \
libswscale-dev locales liblzma-dev python-lzma m4 libxext-dev strace libibverbs-dev libpcre3 libpcre3-dev \
build-essential libncurses5-dev libgdbm-dev libnss3-dev libssl-dev libreadline-dev libffi-dev xz-utils \
libfreetype6-dev libxslt1-dev libxml2-dev libgeos-3.5.0 libgeos-dev && apt-get install -y --allow-downgrades \
--allow-change-held-packages libnccl2 libnccl-dev && DEBIAN_FRONTEND=noninteractive apt-get install -y tzdata \
&& /bin/cp /usr/share/zoneinfo/Asia/Shanghai /etc/localtime && dpkg-reconfigure -f noninteractive tzdata && \
cd /usr/lib/x86_64-linux-gnu && ln -s libcudnn.so.8 libcudnn.so && \
cd /usr/local/cuda-11.2/targets/x86_64-linux/lib && ln -s libcublas.so.11.4.1.1043 libcublas.so && \
ln -s libcusolver.so.11.1.0.152 libcusolver.so && ln -s libcusparse.so.11 libcusparse.so && \
ln -s libcufft.so.10.4.1.152 libcufft.so
RUN echo "set meta-flag on" >> /etc/inputrc && echo "set convert-meta off" >> /etc/inputrc && \
locale-gen en_US.UTF-8 && /sbin/ldconfig -v && groupadd -g 10001 paddle && \
useradd -m -s /bin/bash -N -u 10001 paddle -g paddle && chmod g+w /etc/passwd && \
echo "paddle ALL=(ALL) NOPASSWD: ALL" >> /etc/sudoers
ENV LANG=en_US.UTF-8 LC_ALL=en_US.UTF-8 LANGUAGE=en_US.UTF-8 TZ=Asia/Shanghai
# official download site: https://www.python.org/ftp/python/3.7.13/Python-3.7.13.tgz
RUN wget https://cdn.npmmirror.com/binaries/python/3.7.13/Python-3.7.13.tgz && tar xvf Python-3.7.13.tgz && \
cd Python-3.7.13 && ./configure --prefix=/home/paddle/python3.7 && make -j8 && make install && \
rm -rf ../Python-3.7.13 ../Python-3.7.13.tgz && chown -R paddle:paddle /home/paddle/python3.7
RUN cd /tmp && wget https://mirrors.sjtug.sjtu.edu.cn/gnu/gmp/gmp-6.1.0.tar.bz2 && tar xvf gmp-6.1.0.tar.bz2 && \
cd gmp-6.1.0 && ./configure --prefix=/usr/local && make -j8 && make install && \
rm -rf ../gmp-6.1.0.tar.bz2 ../gmp-6.1.0 && cd /tmp && \
wget https://www.mpfr.org/mpfr-3.1.4/mpfr-3.1.4.tar.bz2 && tar xvf mpfr-3.1.4.tar.bz2 && cd mpfr-3.1.4 && \
./configure --prefix=/usr/local && make -j8 && make install && rm -rf ../mpfr-3.1.4.tar.bz2 ../mpfr-3.1.4 && \
cd /tmp && wget https://mirrors.sjtug.sjtu.edu.cn/gnu/mpc/mpc-1.0.3.tar.gz && tar xvf mpc-1.0.3.tar.gz && \
cd mpc-1.0.3 && ./configure --prefix=/usr/local && make -j8 && make install && \
rm -rf ../mpc-1.0.3.tar.gz ../mpc-1.0.3 && cd /tmp && \
wget http://www.mirrorservice.org/sites/sourceware.org/pub/gcc/infrastructure/isl-0.18.tar.bz2 && \
tar xvf isl-0.18.tar.bz2 && cd isl-0.18 && ./configure --prefix=/usr/local && make -j8 && make install \
&& rm -rf ../isl-0.18.tar.bz2 ../isl-0.18 && cd /tmp && \
wget http://mirrors.ustc.edu.cn/gnu/gcc/gcc-8.2.0/gcc-8.2.0.tar.gz --no-check-certificate && \
tar xvf gcc-8.2.0.tar.gz && cd gcc-8.2.0 && unset LIBRARY_PATH && ./configure --prefix=/home/paddle/gcc82 \
--enable-threads=posix --disable-checking --disable-multilib --enable-languages=c,c++ --with-gmp=/usr/local \
--with-mpfr=/usr/local --with-mpc=/usr/local --with-isl=/usr/local && make -j8 && make install && \
rm -rf ../gcc-8.2.0.tar.gz ../gcc-8.2.0 && chown -R paddle:paddle /home/paddle/gcc82
WORKDIR /home/paddle
ENV PATH=/home/paddle/python3.7/bin:/home/paddle/gcc82/bin:${PATH} \
LD_LIBRARY_PATH=/usr/lib/x86_64-linux-gnu:/usr/local/cuda-11.2/targets/x86_64-linux/lib:${LD_LIBRARY_PATH}
RUN mkdir -p ~/.pip && echo "[global]" > ~/.pip/pip.conf && \
echo "index-url=https://mirror.baidu.com/pypi/simple" >> ~/.pip/pip.conf && \
echo "trusted-host=mirror.baidu.com" >> ~/.pip/pip.conf && \
python3 -m pip install --upgrade pip && \
pip install paddlepaddle-gpu==2.3.1.post112 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html && \
rm -rf ~/.cache/pip
RUN git clone https://github.com/PaddlePaddle/PaddleSpeech.git && cd PaddleSpeech && \
pip3 install pytest-runner paddleaudio -i https://pypi.tuna.tsinghua.edu.cn/simple && \
pip3 install -e .[develop] -i https://pypi.tuna.tsinghua.edu.cn/simple && \
pip3 install importlib-metadata==4.2.0 urllib3==1.25.10 -i https://pypi.tuna.tsinghua.edu.cn/simple && \
rm -rf ~/.cache/pip && \
sudo cp -f /home/paddle/gcc82/lib64/libstdc++.so.6.0.25 /usr/lib/x86_64-linux-gnu/libstdc++.so.6 && \
chown -R paddle:paddle /home/paddle/PaddleSpeech
USER paddle
CMD ['bash']

@ -10,7 +10,7 @@ There are 3 ways to use `PaddleSpeech`. According to the degree of difficulty, t
## Prerequisites
- Python >= 3.7
- PaddlePaddle latest version (please refer to the [Installation Guide] (https://www.paddlepaddle.org.cn/documentation/docs/en/beginners_guide/index_en.html))
- PaddlePaddle latest version (please refer to the [Installation Guide](https://www.paddlepaddle.org.cn/documentation/docs/en/beginners_guide/index_en.html))
- C++ compilation environment
- Hip: For Linux and Mac, do not use command `sh` instead of command `bash` in installation document.
- Hip: We recommand you to install `paddlepaddle` from https://mirror.baidu.com/pypi/simple and install `paddlespeech` from https://pypi.tuna.tsinghua.edu.cn/simple.
@ -117,9 +117,9 @@ conda install -y -c gcc_linux-64=8.4.0 gxx_linux-64=8.4.0
```
(Hip: Do not use the last script if you want to install by **Hard** way):
### Install PaddlePaddle
You can choose the `PaddlePaddle` version based on your system. For example, for CUDA 10.2, CuDNN7.5 install paddlepaddle-gpu 2.2.0:
You can choose the `PaddlePaddle` version based on your system. For example, for CUDA 10.2, CuDNN7.5 install paddlepaddle-gpu 2.3.1:
```bash
python3 -m pip install paddlepaddle-gpu==2.2.0 -i https://mirror.baidu.com/pypi/simple
python3 -m pip install paddlepaddle-gpu==2.3.1 -i https://mirror.baidu.com/pypi/simple
```
### Install PaddleSpeech
You can install `paddlespeech` by the following commandthen you can use the `ready-made` examples in `paddlespeech` :
@ -180,9 +180,9 @@ Some users may fail to install `kaldiio` due to the default download source, you
```bash
pip install pytest-runner -i https://pypi.tuna.tsinghua.edu.cn/simple
```
Make sure you have GPU and the paddlepaddle version is right. For example, for CUDA 10.2, CuDNN7.5 install paddle 2.2.0:
Make sure you have GPU and the paddlepaddle version is right. For example, for CUDA 10.2, CuDNN7.5 install paddle 2.3.1:
```bash
python3 -m pip install paddlepaddle-gpu==2.2.0 -i https://mirror.baidu.com/pypi/simple
python3 -m pip install paddlepaddle-gpu==2.3.1 -i https://mirror.baidu.com/pypi/simple
```
### Install PaddleSpeech in Developing Mode
```bash

@ -111,9 +111,9 @@ conda install -y -c gcc_linux-64=8.4.0 gxx_linux-64=8.4.0
```
(提示: 如果你想使用**困难**方式完成安装,请不要使用最后一条命令)
### 安装 PaddlePaddle
你可以根据系统配置选择 PaddlePaddle 版本,例如系统使用 CUDA 10.2 CuDNN7.5 ,你可以安装 paddlepaddle-gpu 2.2.0
你可以根据系统配置选择 PaddlePaddle 版本,例如系统使用 CUDA 10.2 CuDNN7.5 ,你可以安装 paddlepaddle-gpu 2.3.1
```bash
python3 -m pip install paddlepaddle-gpu==2.2.0 -i https://mirror.baidu.com/pypi/simple
python3 -m pip install paddlepaddle-gpu==2.3.1 -i https://mirror.baidu.com/pypi/simple
```
### 安装 PaddleSpeech
最后安装 `paddlespeech`,这样你就可以使用 `paddlespeech` 中已有的 examples
@ -168,9 +168,9 @@ conda activate tools/venv
conda install -y -c conda-forge sox libsndfile swig bzip2 libflac bc
```
### 安装 PaddlePaddle
请确认你系统是否有 GPU并且使用了正确版本的 paddlepaddle。例如系统使用 CUDA 10.2, CuDNN7.5 ,你可以安装 paddlepaddle-gpu 2.2.0
请确认你系统是否有 GPU并且使用了正确版本的 paddlepaddle。例如系统使用 CUDA 10.2, CuDNN7.5 ,你可以安装 paddlepaddle-gpu 2.3.1
```bash
python3 -m pip install paddlepaddle-gpu==2.2.0 -i https://mirror.baidu.com/pypi/simple
python3 -m pip install paddlepaddle-gpu==2.3.1 -i https://mirror.baidu.com/pypi/simple
```
### 用开发者模式安装 PaddleSpeech
部分用户系统由于默认源的问题,安装中会出现 kaldiio 安转出错的问题,建议首先安装 pytest-runner:

@ -150,7 +150,7 @@ manylinux1 支持 Centos5以上 manylinux2010 支持 Centos 6 以上manyli
### 拉取 manylinux2010
```bash
docker pull quay.io/pypa/manylinux1_x86_64
docker pull quay.io/pypa/manylinux2010_x86_64
```
### 使用 manylinux2010

@ -1,5 +1,5 @@
# Transformer/Conformer ASR with Aishell
This example contains code used to train a Transformer or [Conformer](http://arxiv.org/abs/2008.03802) model with [Aishell dataset](http://www.openslr.org/resources/33)
This example contains code used to train a [u2](https://arxiv.org/pdf/2012.05481.pdf) model (Transformer or [Conformer](https://arxiv.org/pdf/2005.08100.pdf) model) with [Aishell dataset](http://www.openslr.org/resources/33)
## Overview
All the scripts you need are in `run.sh`. There are several stages in `run.sh`, and each stage has its function.
| Stage | Function |

@ -1,20 +1,3 @@
# Callcenter 8k sample rate
Data distribution:
```
676048 utts
491.4004722221223 h
4357792.0 text
2.4633630739178654 text/sec
2.6167397877068495 sec/utt
```
train/dev/test partition:
```
33802 manifest.dev
67606 manifest.test
574640 manifest.train
676048 total
```
This recipe only has model/data config for 8k ASR, user need to prepare data and generate manifest metafile. You can see Aishell or Libripseech.

@ -154,7 +154,7 @@ VITS checkpoint contains files listed below.
vits_csmsc_ckpt_1.1.0
├── default.yaml # default config used to train vitx
├── phone_id_map.txt # phone vocabulary file when training vits
└── snapshot_iter_350000.pdz # model parameters and optimizer states
└── snapshot_iter_333000.pdz # model parameters and optimizer states
```
ps: This ckpt is not good enough, a better result is training
@ -169,7 +169,7 @@ FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/synthesize_e2e.py \
--config=vits_csmsc_ckpt_1.1.0/default.yaml \
--ckpt=vits_csmsc_ckpt_1.1.0/snapshot_iter_350000.pdz \
--ckpt=vits_csmsc_ckpt_1.1.0/snapshot_iter_333000.pdz \
--phones_dict=vits_csmsc_ckpt_1.1.0/phone_id_map.txt \
--output_dir=exp/default/test_e2e \
--text=${BIN_DIR}/../sentences.txt \

@ -179,7 +179,7 @@ generator_first: False # whether to start updating generator first
# OTHER TRAINING SETTING #
##########################################################
num_snapshots: 10 # max number of snapshots to keep while training
train_max_steps: 250000 # Number of training steps. == total_iters / ngpus, total_iters = 1000000
train_max_steps: 350000 # Number of training steps. == total_iters / ngpus, total_iters = 1000000
save_interval_steps: 1000 # Interval steps to save checkpoint.
eval_interval_steps: 250 # Interval steps to evaluate the network.
seed: 777 # random seed number

@ -1,5 +1,5 @@
# Transformer/Conformer ASR with Librispeech
This example contains code used to train a Transformer or [Conformer](http://arxiv.org/abs/2008.03802) model with [Librispeech dataset](http://www.openslr.org/resources/12)
This example contains code used to train [u2](https://arxiv.org/pdf/2012.05481.pdf) model (Transformer or [Conformer](https://arxiv.org/pdf/2005.08100.pdf) model) with [Librispeech dataset](http://www.openslr.org/resources/12)
## Overview
All the scripts you need are in `run.sh`. There are several stages in `run.sh`, and each stage has its function.
| Stage | Function |

@ -1,6 +1,6 @@
# Transformer/Conformer ASR with Librispeech ASR2
This example contains code used to train a Transformer or [Conformer](http://arxiv.org/abs/2008.03802) model with [Librispeech dataset](http://www.openslr.org/resources/12) and use some functions in kaldi.
This example contains code used to train a [u2](https://arxiv.org/pdf/2012.05481.pdf) model (Transformer or [Conformer](https://arxiv.org/pdf/2005.08100.pdf) model) with [Librispeech dataset](http://www.openslr.org/resources/12) and use some functions in kaldi.
To use this example, you need to install Kaldi first.

@ -1,5 +1,5 @@
# Transformer/Conformer ASR with Tiny
This example contains code used to train a Transformer or [Conformer](http://arxiv.org/abs/2008.03802) model Tiny dataset(a part of [[Librispeech dataset](http://www.openslr.org/resources/12)](http://www.openslr.org/resources/33))
This example contains code used to train a [u2](https://arxiv.org/pdf/2012.05481.pdf) model (Transformer or [Conformer](https://arxiv.org/pdf/2005.08100.pdf) model) with Tiny dataset(a part of [[Librispeech dataset](http://www.openslr.org/resources/12)](http://www.openslr.org/resources/33))
## Overview
All the scripts you need are in `run.sh`. There are several stages in `run.sh`, and each stage has its function.
| Stage | Function |

@ -31,7 +31,7 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
FLAGS_allocator_strategy=naive_best_fit \
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/../synthesize.py \
--am=fastspeech2_aishell3 \
--am=fastspeech2_vctk \
--am_config=${config_path} \
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
--am_stat=dump/train/speech_stats.npy \

@ -14,3 +14,5 @@
import _locale
_locale._getdefaultlocale = (lambda *args: ['en_US', 'utf8'])

@ -14,6 +14,9 @@
from . import compliance
from . import datasets
from . import features
from . import text
from . import transform
from . import streamdata
from . import functional
from . import io
from . import metric

@ -0,0 +1,13 @@
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

@ -18,7 +18,6 @@ from typing import Union
import paddle
from paddle import nn
from paddle.fluid import core
from paddle.nn import functional as F
from paddlespeech.s2t.utils.log import Log
@ -39,46 +38,6 @@ paddle.long = 'int64'
paddle.uint16 = 'uint16'
paddle.cdouble = 'complex128'
def convert_dtype_to_string(tensor_dtype):
"""
Convert the data type in numpy to the data type in Paddle
Args:
tensor_dtype(core.VarDesc.VarType): the data type in numpy.
Returns:
core.VarDesc.VarType: the data type in Paddle.
"""
dtype = tensor_dtype
if dtype == core.VarDesc.VarType.FP32:
return paddle.float32
elif dtype == core.VarDesc.VarType.FP64:
return paddle.float64
elif dtype == core.VarDesc.VarType.FP16:
return paddle.float16
elif dtype == core.VarDesc.VarType.INT32:
return paddle.int32
elif dtype == core.VarDesc.VarType.INT16:
return paddle.int16
elif dtype == core.VarDesc.VarType.INT64:
return paddle.int64
elif dtype == core.VarDesc.VarType.BOOL:
return paddle.bool
elif dtype == core.VarDesc.VarType.BF16:
# since there is still no support for bfloat16 in NumPy,
# uint16 is used for casting bfloat16
return paddle.uint16
elif dtype == core.VarDesc.VarType.UINT8:
return paddle.uint8
elif dtype == core.VarDesc.VarType.INT8:
return paddle.int8
elif dtype == core.VarDesc.VarType.COMPLEX64:
return paddle.complex64
elif dtype == core.VarDesc.VarType.COMPLEX128:
return paddle.complex128
else:
raise ValueError("Not supported tensor dtype %s" % dtype)
if not hasattr(paddle, 'softmax'):
logger.debug("register user softmax to paddle, remove this when fixed!")
setattr(paddle, 'softmax', paddle.nn.functional.softmax)
@ -156,25 +115,6 @@ if not hasattr(paddle.Tensor, 'new_full'):
paddle.static.Variable.new_full = new_full
def eq(xs: paddle.Tensor, ys: Union[paddle.Tensor, float]) -> paddle.Tensor:
if convert_dtype_to_string(xs.dtype) == paddle.bool:
xs = xs.astype(paddle.int)
return xs.equal(ys)
if not hasattr(paddle.Tensor, 'eq'):
logger.debug(
"override eq of paddle.Tensor if exists or register, remove this when fixed!"
)
paddle.Tensor.eq = eq
paddle.static.Variable.eq = eq
if not hasattr(paddle, 'eq'):
logger.debug(
"override eq of paddle if exists or register, remove this when fixed!")
paddle.eq = eq
def contiguous(xs: paddle.Tensor) -> paddle.Tensor:
return xs

@ -318,7 +318,7 @@ class U2BaseModel(ASRInterface, nn.Layer):
dim=1) # (B*N, i+1)
# 2.6 Update end flag
end_flag = paddle.eq(hyps[:, -1], self.eos).view(-1, 1)
end_flag = paddle.equal(hyps[:, -1], self.eos).view(-1, 1)
# 3. Select best of best
scores = scores.view(batch_size, beam_size)

@ -13,8 +13,7 @@
# limitations under the License.
import paddle
from paddle import nn
from paddlespeech.s2t.modules.initializer import KaimingUniform
import math
"""
To align the initializer between paddle and torch,
the API below are set defalut initializer with priority higger than global initializer.
@ -82,10 +81,10 @@ class Linear(nn.Linear):
name=None):
if weight_attr is None:
if global_init_type == "kaiming_uniform":
weight_attr = paddle.ParamAttr(initializer=KaimingUniform())
weight_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform(fan_in=None, negative_slope=math.sqrt(5), nonlinearity='leaky_relu'))
if bias_attr is None:
if global_init_type == "kaiming_uniform":
bias_attr = paddle.ParamAttr(initializer=KaimingUniform())
bias_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform(fan_in=None, negative_slope=math.sqrt(5), nonlinearity='leaky_relu'))
super(Linear, self).__init__(in_features, out_features, weight_attr,
bias_attr, name)
@ -105,10 +104,10 @@ class Conv1D(nn.Conv1D):
data_format='NCL'):
if weight_attr is None:
if global_init_type == "kaiming_uniform":
weight_attr = paddle.ParamAttr(initializer=KaimingUniform())
weight_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform(fan_in=None, negative_slope=math.sqrt(5), nonlinearity='leaky_relu'))
if bias_attr is None:
if global_init_type == "kaiming_uniform":
bias_attr = paddle.ParamAttr(initializer=KaimingUniform())
bias_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform(fan_in=None, negative_slope=math.sqrt(5), nonlinearity='leaky_relu'))
super(Conv1D, self).__init__(
in_channels, out_channels, kernel_size, stride, padding, dilation,
groups, padding_mode, weight_attr, bias_attr, data_format)
@ -129,10 +128,10 @@ class Conv2D(nn.Conv2D):
data_format='NCHW'):
if weight_attr is None:
if global_init_type == "kaiming_uniform":
weight_attr = paddle.ParamAttr(initializer=KaimingUniform())
weight_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform(fan_in=None, negative_slope=math.sqrt(5), nonlinearity='leaky_relu'))
if bias_attr is None:
if global_init_type == "kaiming_uniform":
bias_attr = paddle.ParamAttr(initializer=KaimingUniform())
bias_attr = paddle.ParamAttr(initializer=nn.initializer.KaimingUniform(fan_in=None, negative_slope=math.sqrt(5), nonlinearity='leaky_relu'))
super(Conv2D, self).__init__(
in_channels, out_channels, kernel_size, stride, padding, dilation,
groups, padding_mode, weight_attr, bias_attr, data_format)

@ -109,7 +109,7 @@ class MultiHeadedAttention(nn.Layer):
# 1. onnx(16/-1, -1/-1, 16/0)
# 2. jit (16/-1, -1/-1, 16/0, 16/4)
if paddle.shape(mask)[2] > 0: # time2 > 0
mask = mask.unsqueeze(1).eq(0) # (batch, 1, *, time2)
mask = mask.unsqueeze(1).equal(0) # (batch, 1, *, time2)
# for last chunk, time2 might be larger than scores.size(-1)
mask = mask[:, :, :, :paddle.shape(scores)[-1]]
scores = scores.masked_fill(mask, -float('inf'))
@ -321,4 +321,4 @@ class RelPositionMultiHeadedAttention(MultiHeadedAttention):
scores = (matrix_ac + matrix_bd) / math.sqrt(
self.d_k) # (batch, head, time1, time2)
return self.forward_attention(v, scores, mask), new_cache
return self.forward_attention(v, scores, mask), new_cache

@ -12,142 +12,6 @@
# See the License for the specific language governing permissions and
# limitations under the License.
import numpy as np
from paddle.fluid import framework
from paddle.fluid import unique_name
from paddle.fluid.core import VarDesc
from paddle.fluid.initializer import MSRAInitializer
__all__ = ['KaimingUniform']
class KaimingUniform(MSRAInitializer):
r"""Implements the Kaiming Uniform initializer
This class implements the weight initialization from the paper
`Delving Deep into Rectifiers: Surpassing Human-Level Performance on
ImageNet Classification <https://arxiv.org/abs/1502.01852>`_
by Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun. This is a
robust initialization method that particularly considers the rectifier
nonlinearities.
In case of Uniform distribution, the range is [-x, x], where
.. math::
x = \sqrt{\frac{1.0}{fan\_in}}
In case of Normal distribution, the mean is 0 and the standard deviation
is
.. math::
\sqrt{\\frac{2.0}{fan\_in}}
Args:
fan_in (float32|None): fan_in for Kaiming uniform Initializer. If None, it is\
inferred from the variable. default is None.
Note:
It is recommended to set fan_in to None for most cases.
Examples:
.. code-block:: python
import paddle
import paddle.nn as nn
linear = nn.Linear(2,
4,
weight_attr=nn.initializer.KaimingUniform())
data = paddle.rand([30, 10, 2], dtype='float32')
res = linear(data)
"""
def __init__(self, fan_in=None):
super(KaimingUniform, self).__init__(
uniform=True, fan_in=fan_in, seed=0)
def __call__(self, var, block=None):
"""Initialize the input tensor with MSRA initialization.
Args:
var(Tensor): Tensor that needs to be initialized.
block(Block, optional): The block in which initialization ops
should be added. Used in static graph only, default None.
Returns:
The initialization op
"""
block = self._check_block(block)
assert isinstance(var, framework.Variable)
assert isinstance(block, framework.Block)
f_in, f_out = self._compute_fans(var)
# If fan_in is passed, use it
fan_in = f_in if self._fan_in is None else self._fan_in
if self._seed == 0:
self._seed = block.program.random_seed
# to be compatible of fp16 initalizers
if var.dtype == VarDesc.VarType.FP16 or (
var.dtype == VarDesc.VarType.BF16 and not self._uniform):
out_dtype = VarDesc.VarType.FP32
out_var = block.create_var(
name=unique_name.generate(
".".join(['masra_init', var.name, 'tmp'])),
shape=var.shape,
dtype=out_dtype,
type=VarDesc.VarType.LOD_TENSOR,
persistable=False)
else:
out_dtype = var.dtype
out_var = var
if self._uniform:
limit = np.sqrt(1.0 / float(fan_in))
op = block.append_op(
type="uniform_random",
inputs={},
outputs={"Out": out_var},
attrs={
"shape": out_var.shape,
"dtype": int(out_dtype),
"min": -limit,
"max": limit,
"seed": self._seed
},
stop_gradient=True)
else:
std = np.sqrt(2.0 / float(fan_in))
op = block.append_op(
type="gaussian_random",
outputs={"Out": out_var},
attrs={
"shape": out_var.shape,
"dtype": int(out_dtype),
"mean": 0.0,
"std": std,
"seed": self._seed
},
stop_gradient=True)
if var.dtype == VarDesc.VarType.FP16 or (
var.dtype == VarDesc.VarType.BF16 and not self._uniform):
block.append_op(
type="cast",
inputs={"X": out_var},
outputs={"Out": var},
attrs={"in_dtype": out_var.dtype,
"out_dtype": var.dtype})
if not framework.in_dygraph_mode():
var.op = op
return op
class DefaultInitializerContext(object):
"""

@ -718,6 +718,7 @@ class VectorClientExecutor(BaseExecutor):
logger.info(f"the input audio: {input}")
handler = VectorHttpHandler(server_ip=server_ip, port=port)
res = handler.run(input, audio_format, sample_rate)
logger.info(f"The spk embedding is: {res}")
return res
elif task == "score":
from paddlespeech.server.utils.audio_handler import VectorScoreHttpHandler

@ -192,12 +192,15 @@ class ACSEngine(BaseEngine):
# search for each word in self.word_list
offset = self.config.offset
# last time in time_stamp
max_ed = time_stamp[-1]['ed']
for w in self.word_list:
# search the w in asr_result and the index in asr_result
# https://docs.python.org/3/library/re.html#re.finditer
for m in re.finditer(w, asr_result):
# match start and end char index in timestamp
# https://docs.python.org/3/library/re.html#re.Match.start
start = max(time_stamp[m.start(0)]['bg'] - offset, 0)
end = min(time_stamp[m.end(0) - 1]['ed'] + offset, max_ed)
logger.debug(f'start: {start}, end: {end}')
acs_result.append({'w': w, 'bg': start, 'ed': end})

@ -39,10 +39,10 @@ class OnlineCTCEndpoingOpt:
# rule1 times out after 5 seconds of silence, even if we decoded nothing.
rule1: OnlineCTCEndpointRule = OnlineCTCEndpointRule(False, 5000, 0)
# rule4 times out after 1.0 seconds of silence after decoding something,
# rule2 times out after 1.0 seconds of silence after decoding something,
# even if we did not reach a final-state at all.
rule2: OnlineCTCEndpointRule = OnlineCTCEndpointRule(True, 1000, 0)
# rule5 times out after the utterance is 20 seconds long, regardless of
# rule3 times out after the utterance is 20 seconds long, regardless of
# anything else.
rule3: OnlineCTCEndpointRule = OnlineCTCEndpointRule(False, 0, 20000)
@ -102,7 +102,8 @@ class OnlineCTCEndpoint:
assert self.num_frames_decoded >= self.trailing_silence_frames
assert self.frame_shift_in_ms > 0
decoding_something = (self.num_frames_decoded > self.trailing_silence_frames) and decoding_something
utterance_length = self.num_frames_decoded * self.frame_shift_in_ms
trailing_silence = self.trailing_silence_frames * self.frame_shift_in_ms

@ -83,11 +83,11 @@ class CTCPrefixBeamSearch:
# cur_hyps: (prefix, (blank_ending_score, none_blank_ending_score))
# 0. blank_ending_score,
# 1. none_blank_ending_score,
# 2. viterbi_blank ending,
# 3. viterbi_non_blank,
# 2. viterbi_blank ending score,
# 3. viterbi_non_blank score,
# 4. current_token_prob,
# 5. times_viterbi_blank,
# 6. times_titerbi_non_blank
# 5. times_viterbi_blank, times_b
# 6. times_titerbi_non_blank, times_nb
if self.cur_hyps is None:
self.cur_hyps = [(tuple(), (0.0, -float('inf'), 0.0, 0.0,
-float('inf'), [], []))]
@ -106,69 +106,69 @@ class CTCPrefixBeamSearch:
for s in top_k_index:
s = s.item()
ps = logp[s].item()
for prefix, (pb, pnb, v_b_s, v_nb_s, cur_token_prob, times_s,
times_ns) in self.cur_hyps:
for prefix, (pb, pnb, v_b_s, v_nb_s, cur_token_prob, times_b,
times_nb) in self.cur_hyps:
last = prefix[-1] if len(prefix) > 0 else None
if s == blank_id: # blank
n_pb, n_pnb, n_v_s, n_v_ns, n_cur_token_prob, n_times_s, n_times_ns = next_hyps[
n_pb, n_pnb, n_v_b, n_v_nb, n_cur_token_prob, n_times_b, n_times_nb = next_hyps[
prefix]
n_pb = log_add([n_pb, pb + ps, pnb + ps])
pre_times = times_s if v_b_s > v_nb_s else times_ns
n_times_s = copy.deepcopy(pre_times)
pre_times = times_b if v_b_s > v_nb_s else times_nb
n_times_b = copy.deepcopy(pre_times)
viterbi_score = v_b_s if v_b_s > v_nb_s else v_nb_s
n_v_s = viterbi_score + ps
next_hyps[prefix] = (n_pb, n_pnb, n_v_s, n_v_ns,
n_cur_token_prob, n_times_s,
n_times_ns)
n_v_b = viterbi_score + ps
next_hyps[prefix] = (n_pb, n_pnb, n_v_b, n_v_nb,
n_cur_token_prob, n_times_b,
n_times_nb)
elif s == last:
# Update *ss -> *s;
# case1: *a + a => *a
n_pb, n_pnb, n_v_s, n_v_ns, n_cur_token_prob, n_times_s, n_times_ns = next_hyps[
n_pb, n_pnb, n_v_b, n_v_nb, n_cur_token_prob, n_times_b, n_times_nb = next_hyps[
prefix]
n_pnb = log_add([n_pnb, pnb + ps])
if n_v_ns < v_nb_s + ps:
n_v_ns = v_nb_s + ps
if n_v_nb < v_nb_s + ps:
n_v_nb = v_nb_s + ps
if n_cur_token_prob < ps:
n_cur_token_prob = ps
n_times_ns = copy.deepcopy(times_ns)
n_times_ns[
n_times_nb = copy.deepcopy(times_nb)
n_times_nb[
-1] = self.abs_time_step # 注意,这里要重新使用绝对时间
next_hyps[prefix] = (n_pb, n_pnb, n_v_s, n_v_ns,
n_cur_token_prob, n_times_s,
n_times_ns)
next_hyps[prefix] = (n_pb, n_pnb, n_v_b, n_v_nb,
n_cur_token_prob, n_times_b,
n_times_nb)
# Update *s-s -> *ss, - is for blank
# Case 2: *aε + a => *aa
n_prefix = prefix + (s, )
n_pb, n_pnb, n_v_s, n_v_ns, n_cur_token_prob, n_times_s, n_times_ns = next_hyps[
n_pb, n_pnb, n_v_b, n_v_nb, n_cur_token_prob, n_times_b, n_times_nb = next_hyps[
n_prefix]
if n_v_ns < v_b_s + ps:
n_v_ns = v_b_s + ps
if n_v_nb < v_b_s + ps:
n_v_nb = v_b_s + ps
n_cur_token_prob = ps
n_times_ns = copy.deepcopy(times_s)
n_times_ns.append(self.abs_time_step)
n_times_nb = copy.deepcopy(times_b)
n_times_nb.append(self.abs_time_step)
n_pnb = log_add([n_pnb, pb + ps])
next_hyps[n_prefix] = (n_pb, n_pnb, n_v_s, n_v_ns,
n_cur_token_prob, n_times_s,
n_times_ns)
next_hyps[n_prefix] = (n_pb, n_pnb, n_v_b, n_v_nb,
n_cur_token_prob, n_times_b,
n_times_nb)
else:
# Case 3: *a + b => *ab, *aε + b => *ab
n_prefix = prefix + (s, )
n_pb, n_pnb, n_v_s, n_v_ns, n_cur_token_prob, n_times_s, n_times_ns = next_hyps[
n_pb, n_pnb, n_v_b, n_v_nb, n_cur_token_prob, n_times_b, n_times_nb = next_hyps[
n_prefix]
viterbi_score = v_b_s if v_b_s > v_nb_s else v_nb_s
pre_times = times_s if v_b_s > v_nb_s else times_ns
if n_v_ns < viterbi_score + ps:
n_v_ns = viterbi_score + ps
pre_times = times_b if v_b_s > v_nb_s else times_nb
if n_v_nb < viterbi_score + ps:
n_v_nb = viterbi_score + ps
n_cur_token_prob = ps
n_times_ns = copy.deepcopy(pre_times)
n_times_ns.append(self.abs_time_step)
n_times_nb = copy.deepcopy(pre_times)
n_times_nb.append(self.abs_time_step)
n_pnb = log_add([n_pnb, pb + ps, pnb + ps])
next_hyps[n_prefix] = (n_pb, n_pnb, n_v_s, n_v_ns,
n_cur_token_prob, n_times_s,
n_times_ns)
next_hyps[n_prefix] = (n_pb, n_pnb, n_v_b, n_v_nb,
n_cur_token_prob, n_times_b,
n_times_nb)
# 2.2 Second beam prune
next_hyps = sorted(

@ -30,7 +30,9 @@ def get_sess(model_path: Optional[os.PathLike]=None, sess_conf: dict=None):
# "gpu:0"
providers = ['CPUExecutionProvider']
if "gpu" in sess_conf.get("device", ""):
providers = ['CUDAExecutionProvider']
device_id = int(sess_conf["device"].split(":")[1])
providers = [('CUDAExecutionProvider', {'device_id': device_id})]
# fastspeech2/mb_melgan can't use trt now!
if sess_conf.get("use_trt", 0):
providers = ['TensorrtExecutionProvider']

@ -29,6 +29,7 @@ from yacs.config import CfgNode
from paddlespeech.t2s.datasets.data_table import DataTable
from paddlespeech.t2s.frontend import English
from paddlespeech.t2s.frontend.mix_frontend import MixFrontend
from paddlespeech.t2s.frontend.zh_frontend import Frontend
from paddlespeech.t2s.modules.normalizer import ZScore
from paddlespeech.utils.dynamic_import import dynamic_import
@ -98,6 +99,8 @@ def get_sentences(text_file: Optional[os.PathLike], lang: str='zh'):
sentence = "".join(items[1:])
elif lang == 'en':
sentence = " ".join(items[1:])
elif lang == 'mix':
sentence = " ".join(items[1:])
sentences.append((utt_id, sentence))
return sentences
@ -111,7 +114,8 @@ def get_test_dataset(test_metadata: List[Dict[str, Any]],
am_dataset = am[am.rindex('_') + 1:]
if am_name == 'fastspeech2':
fields = ["utt_id", "text"]
if am_dataset in {"aishell3", "vctk"} and speaker_dict is not None:
if am_dataset in {"aishell3", "vctk",
"mix"} and speaker_dict is not None:
print("multiple speaker fastspeech2!")
fields += ["spk_id"]
elif voice_cloning:
@ -140,6 +144,10 @@ def get_frontend(lang: str='zh',
phone_vocab_path=phones_dict, tone_vocab_path=tones_dict)
elif lang == 'en':
frontend = English(phone_vocab_path=phones_dict)
elif lang == 'mix':
frontend = MixFrontend(
phone_vocab_path=phones_dict, tone_vocab_path=tones_dict)
else:
print("wrong lang!")
print("frontend done!")
@ -341,8 +349,12 @@ def get_am_output(
input_ids = frontend.get_input_ids(
input, merge_sentences=merge_sentences)
phone_ids = input_ids["phone_ids"]
elif lang == 'mix':
input_ids = frontend.get_input_ids(
input, merge_sentences=merge_sentences)
phone_ids = input_ids["phone_ids"]
else:
print("lang should in {'zh', 'en'}!")
print("lang should in {'zh', 'en', 'mix'}!")
if get_tone_ids:
tone_ids = input_ids["tone_ids"]

@ -113,8 +113,12 @@ def evaluate(args):
input_ids = frontend.get_input_ids(
sentence, merge_sentences=merge_sentences)
phone_ids = input_ids["phone_ids"]
elif args.lang == 'mix':
input_ids = frontend.get_input_ids(
sentence, merge_sentences=merge_sentences)
phone_ids = input_ids["phone_ids"]
else:
print("lang should in {'zh', 'en'}!")
print("lang should in {'zh', 'en', 'mix'}!")
with paddle.no_grad():
flags = 0
for i in range(len(phone_ids)):
@ -122,7 +126,7 @@ def evaluate(args):
# acoustic model
if am_name == 'fastspeech2':
# multi speaker
if am_dataset in {"aishell3", "vctk"}:
if am_dataset in {"aishell3", "vctk", "mix"}:
spk_id = paddle.to_tensor(args.spk_id)
mel = am_inference(part_phone_ids, spk_id)
else:
@ -170,7 +174,7 @@ def parse_args():
choices=[
'speedyspeech_csmsc', 'speedyspeech_aishell3', 'fastspeech2_csmsc',
'fastspeech2_ljspeech', 'fastspeech2_aishell3', 'fastspeech2_vctk',
'tacotron2_csmsc', 'tacotron2_ljspeech'
'tacotron2_csmsc', 'tacotron2_ljspeech', 'fastspeech2_mix'
],
help='Choose acoustic model type of tts task.')
parser.add_argument(
@ -231,7 +235,7 @@ def parse_args():
'--lang',
type=str,
default='zh',
help='Choose model language. zh or en')
help='Choose model language. zh or en or mix')
parser.add_argument(
"--inference_dir",

@ -0,0 +1,179 @@
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import re
from typing import Dict
from typing import List
import paddle
from paddlespeech.t2s.frontend import English
from paddlespeech.t2s.frontend.zh_frontend import Frontend
class MixFrontend():
def __init__(self,
g2p_model="pypinyin",
phone_vocab_path=None,
tone_vocab_path=None):
self.zh_frontend = Frontend(
phone_vocab_path=phone_vocab_path, tone_vocab_path=tone_vocab_path)
self.en_frontend = English(phone_vocab_path=phone_vocab_path)
self.SENTENCE_SPLITOR = re.compile(r'([:、,;。?!,;?!][”’]?)')
self.sp_id = self.zh_frontend.vocab_phones["sp"]
self.sp_id_tensor = paddle.to_tensor([self.sp_id])
def is_chinese(self, char):
if char >= '\u4e00' and char <= '\u9fa5':
return True
else:
return False
def is_alphabet(self, char):
if (char >= '\u0041' and char <= '\u005a') or (char >= '\u0061' and
char <= '\u007a'):
return True
else:
return False
def is_number(self, char):
if char >= '\u0030' and char <= '\u0039':
return True
else:
return False
def is_other(self, char):
if not (self.is_chinese(char) or self.is_number(char) or
self.is_alphabet(char)):
return True
else:
return False
def _split(self, text: str) -> List[str]:
text = re.sub(r'[《》【】<=>{}()#&@“”^_|…\\]', '', text)
text = self.SENTENCE_SPLITOR.sub(r'\1\n', text)
text = text.strip()
sentences = [sentence.strip() for sentence in re.split(r'\n+', text)]
return sentences
def _distinguish(self, text: str) -> List[str]:
# sentence --> [ch_part, en_part, ch_part, ...]
segments = []
types = []
flag = 0
temp_seg = ""
temp_lang = ""
# Determine the type of each character. type: blank, chinese, alphabet, number, unk.
for ch in text:
if self.is_chinese(ch):
types.append("zh")
elif self.is_alphabet(ch):
types.append("en")
elif ch == " ":
types.append("blank")
elif self.is_number(ch):
types.append("num")
else:
types.append("unk")
assert len(types) == len(text)
for i in range(len(types)):
# find the first char of the seg
if flag == 0:
if types[i] != "unk" and types[i] != "blank":
temp_seg += text[i]
temp_lang = types[i]
flag = 1
else:
if types[i] == temp_lang or types[i] == "num":
temp_seg += text[i]
elif temp_lang == "num" and types[i] != "unk":
temp_seg += text[i]
if types[i] == "zh" or types[i] == "en":
temp_lang = types[i]
elif temp_lang == "en" and types[i] == "blank":
temp_seg += text[i]
elif types[i] == "unk":
pass
else:
segments.append((temp_seg, temp_lang))
if types[i] != "unk" and types[i] != "blank":
temp_seg = text[i]
temp_lang = types[i]
flag = 1
else:
flag = 0
temp_seg = ""
temp_lang = ""
segments.append((temp_seg, temp_lang))
return segments
def get_input_ids(self,
sentence: str,
merge_sentences: bool=True,
get_tone_ids: bool=False,
add_sp: bool=True) -> Dict[str, List[paddle.Tensor]]:
sentences = self._split(sentence)
phones_list = []
result = {}
for text in sentences:
phones_seg = []
segments = self._distinguish(text)
for seg in segments:
content = seg[0]
lang = seg[1]
if lang == "zh":
input_ids = self.zh_frontend.get_input_ids(
content,
merge_sentences=True,
get_tone_ids=get_tone_ids)
elif lang == "en":
input_ids = self.en_frontend.get_input_ids(
content, merge_sentences=True)
phones_seg.append(input_ids["phone_ids"][0])
if add_sp:
phones_seg.append(self.sp_id_tensor)
phones = paddle.concat(phones_seg)
phones_list.append(phones)
if merge_sentences:
merge_list = paddle.concat(phones_list)
# rm the last 'sp' to avoid the noise at the end
# cause in the training data, no 'sp' in the end
if merge_list[-1] == self.sp_id_tensor:
merge_list = merge_list[:-1]
phones_list = []
phones_list.append(merge_list)
result["phone_ids"] = phones_list
return result

@ -50,6 +50,7 @@ base = [
"paddlespeech_feat",
"Pillow>=9.0.0"
"praatio==5.0.0",
"protobuf>=3.1.0, <=3.20.0",
"pypinyin",
"pypinyin-dict",
"python-dateutil",

Loading…
Cancel
Save