Merge branch 'develop' into v0.3

pull/1784/head
Honei 2 years ago committed by GitHub
commit f72cbc9b6d
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -15,12 +15,21 @@ You can choose one way from meduim and hard to install paddlespeech.
### 2. Prepare config File ### 2. Prepare config File
The configuration file can be found in `conf/tts_online_application.yaml` The configuration file can be found in `conf/tts_online_application.yaml`.
Among them, `protocol` indicates the network protocol used by the streaming TTS service. Currently, both http and websocket are supported. - `protocol` indicates the network protocol used by the streaming TTS service. Currently, both http and websocket are supported.
`engine_list` indicates the speech engine that will be included in the service to be started, in the format of `<speech task>_<engine type>`. - `engine_list` indicates the speech engine that will be included in the service to be started, in the format of `<speech task>_<engine type>`.
This demo mainly introduces the streaming speech synthesis service, so the speech task should be set to `tts`. - This demo mainly introduces the streaming speech synthesis service, so the speech task should be set to `tts`.
Currently, the engine type supports two forms: **online** and **online-onnx**. `online` indicates an engine that uses python for dynamic graph inference; `online-onnx` indicates an engine that uses onnxruntime for inference. The inference speed of online-onnx is faster. - the engine type supports two forms: **online** and **online-onnx**. `online` indicates an engine that uses python for dynamic graph inference; `online-onnx` indicates an engine that uses onnxruntime for inference. The inference speed of online-onnx is faster.
Streaming TTS AM model support: **fastspeech2 and fastspeech2_cnndecoder**; Voc model support: **hifigan and mb_melgan** - Streaming TTS engine AM model support: **fastspeech2 and fastspeech2_cnndecoder**; Voc model support: **hifigan and mb_melgan**
- In streaming am inference, one chunk of data is inferred at a time to achieve a streaming effect. Among them, `am_block` indicates the number of valid frames in the chunk, and `am_pad` indicates the number of frames added before and after am_block in a chunk. The existence of am_pad is used to eliminate errors caused by streaming inference and avoid the influence of streaming inference on the quality of synthesized audio.
- fastspeech2 does not support streaming am inference, so am_pad and am_block have no effect on it.
- fastspeech2_cnndecoder supports streaming inference. When am_pad=12, streaming inference synthesized audio is consistent with non-streaming synthesized audio.
- In streaming voc inference, one chunk of data is inferred at a time to achieve a streaming effect. Where `voc_block` indicates the number of valid frames in the chunk, and `voc_pad` indicates the number of frames added before and after the voc_block in a chunk. The existence of voc_pad is used to eliminate errors caused by streaming inference and avoid the influence of streaming inference on the quality of synthesized audio.
- Both hifigan and mb_melgan support streaming voc inference.
- When the voc model is mb_melgan, when voc_pad=14, the synthetic audio for streaming inference is consistent with the non-streaming synthetic audio; the minimum voc_pad can be set to 7, and the synthetic audio has no abnormal hearing. If the voc_pad is less than 7, the synthetic audio sounds abnormal.
- When the voc model is hifigan, when voc_pad=20, the streaming inference synthetic audio is consistent with the non-streaming synthetic audio; when voc_pad=14, the synthetic audio has no abnormal hearing.
- Inference speed: mb_melgan > hifigan; Audio quality: mb_melgan < hifigan
### 3. Server Usage ### 3. Server Usage

@ -16,11 +16,19 @@
### 2. 准备配置文件 ### 2. 准备配置文件
配置文件可参见 `conf/tts_online_application.yaml` 配置文件可参见 `conf/tts_online_application.yaml`
其中,`protocol`表示该流式TTS服务使用的网络协议目前支持 http 和 websocket 两种。 - `protocol`表示该流式TTS服务使用的网络协议目前支持 http 和 websocket 两种。
其中,`engine_list`表示即将启动的服务将会包含的语音引擎,格式为 <语音任务>_<引擎类型>。 - `engine_list`表示即将启动的服务将会包含的语音引擎,格式为 <语音任务>_<引擎类型>。
该demo主要介绍流式语音合成服务因此语音任务应设置为tts。 - 该demo主要介绍流式语音合成服务因此语音任务应设置为tts。
目前引擎类型支持两种形式:**online** 表示使用python进行动态图推理的引擎**online-onnx** 表示使用onnxruntime进行推理的引擎。其中online-onnx的推理速度更快。 - 目前引擎类型支持两种形式:**online** 表示使用python进行动态图推理的引擎**online-onnx** 表示使用onnxruntime进行推理的引擎。其中online-onnx的推理速度更快。
流式TTS的AM 模型支持fastspeech2 以及fastspeech2_cnndecoder; Voc 模型支持hifigan, mb_melgan - 流式TTS引擎的AM模型支持fastspeech2 以及fastspeech2_cnndecoder; Voc 模型支持hifigan, mb_melgan
- 流式am推理中每次会对一个chunk的数据进行推理以达到流式的效果。其中`am_block`表示chunk中的有效帧数`am_pad` 表示一个chunk中am_block前后各加的帧数。am_pad的存在用于消除流式推理产生的误差避免由流式推理对合成音频质量的影响。
- fastspeech2不支持流式am推理因此am_pad与am_block对它无效
- fastspeech2_cnndecoder 支持流式推理当am_pad=12时流式推理合成音频与非流式合成音频一致
- 流式voc推理中每次会对一个chunk的数据进行推理以达到流式的效果。其中`voc_block`表示chunk中的有效帧数`voc_pad` 表示一个chunk中voc_block前后各加的帧数。voc_pad的存在用于消除流式推理产生的误差避免由流式推理对合成音频质量的影响。
- hifigan, mb_melgan 均支持流式voc 推理
- 当voc模型为mb_melgan当voc_pad=14时流式推理合成音频与非流式合成音频一致voc_pad最小可以设置为7合成音频听感上没有异常若voc_pad小于7合成音频听感上存在异常。
- 当voc模型为hifigan当voc_pad=20时流式推理合成音频与非流式合成音频一致当voc_pad=14时合成音频听感上没有异常。
- 推理速度mb_melgan > hifigan; 音频质量mb_melgan < hifigan
### 3. 服务端使用方法 ### 3. 服务端使用方法
- 命令行 (推荐使用) - 命令行 (推荐使用)

@ -1,4 +1,4 @@
# This is the parameter configuration file for PaddleSpeech Serving. # This is the parameter configuration file for streaming tts server.
################################################################################# #################################################################################
# SERVER SETTING # # SERVER SETTING #
@ -7,8 +7,8 @@ host: 127.0.0.1
port: 8092 port: 8092
# The task format in the engin_list is: <speech task>_<engine type> # The task format in the engin_list is: <speech task>_<engine type>
# engine_list choices = ['tts_online', 'tts_online-onnx'] # engine_list choices = ['tts_online', 'tts_online-onnx'], the inference speed of tts_online-onnx is faster than tts_online.
# protocol = ['websocket', 'http'] (only one can be selected). # protocol choices = ['websocket', 'http']
protocol: 'http' protocol: 'http'
engine_list: ['tts_online-onnx'] engine_list: ['tts_online-onnx']
@ -20,7 +20,8 @@ engine_list: ['tts_online-onnx']
################################### TTS ######################################### ################################### TTS #########################################
################### speech task: tts; engine_type: online ####################### ################### speech task: tts; engine_type: online #######################
tts_online: tts_online:
# am (acoustic model) choices=['fastspeech2_csmsc', 'fastspeech2_cnndecoder_csmsc'] # am (acoustic model) choices=['fastspeech2_csmsc', 'fastspeech2_cnndecoder_csmsc']
# fastspeech2_cnndecoder_csmsc support streaming am infer.
am: 'fastspeech2_csmsc' am: 'fastspeech2_csmsc'
am_config: am_config:
am_ckpt: am_ckpt:
@ -31,6 +32,7 @@ tts_online:
spk_id: 0 spk_id: 0
# voc (vocoder) choices=['mb_melgan_csmsc, hifigan_csmsc'] # voc (vocoder) choices=['mb_melgan_csmsc, hifigan_csmsc']
# Both mb_melgan_csmsc and hifigan_csmsc support streaming voc inference
voc: 'mb_melgan_csmsc' voc: 'mb_melgan_csmsc'
voc_config: voc_config:
voc_ckpt: voc_ckpt:
@ -39,8 +41,13 @@ tts_online:
# others # others
lang: 'zh' lang: 'zh'
device: 'cpu' # set 'gpu:id' or 'cpu' device: 'cpu' # set 'gpu:id' or 'cpu'
# am_block and am_pad only for fastspeech2_cnndecoder_onnx model to streaming am infer,
# when am_pad set 12, streaming synthetic audio is the same as non-streaming synthetic audio
am_block: 42 am_block: 42
am_pad: 12 am_pad: 12
# voc_pad and voc_block voc model to streaming voc infer,
# when voc model is mb_melgan_csmsc, voc_pad set 14, streaming synthetic audio is the same as non-streaming synthetic audio; The minimum value of pad can be set to 7, streaming synthetic audio sounds normal
# when voc model is hifigan_csmsc, voc_pad set 20, streaming synthetic audio is the same as non-streaming synthetic audio; voc_pad set 14, streaming synthetic audio sounds normal
voc_block: 14 voc_block: 14
voc_pad: 14 voc_pad: 14
@ -53,7 +60,8 @@ tts_online:
################################### TTS ######################################### ################################### TTS #########################################
################### speech task: tts; engine_type: online-onnx ####################### ################### speech task: tts; engine_type: online-onnx #######################
tts_online-onnx: tts_online-onnx:
# am (acoustic model) choices=['fastspeech2_csmsc_onnx', 'fastspeech2_cnndecoder_csmsc_onnx'] # am (acoustic model) choices=['fastspeech2_csmsc_onnx', 'fastspeech2_cnndecoder_csmsc_onnx']
# fastspeech2_cnndecoder_csmsc_onnx support streaming am infer.
am: 'fastspeech2_cnndecoder_csmsc_onnx' am: 'fastspeech2_cnndecoder_csmsc_onnx'
# am_ckpt is a list, if am is fastspeech2_cnndecoder_csmsc_onnx, am_ckpt = [encoder model, decoder model, postnet model]; # am_ckpt is a list, if am is fastspeech2_cnndecoder_csmsc_onnx, am_ckpt = [encoder model, decoder model, postnet model];
# if am is fastspeech2_csmsc_onnx, am_ckpt = [ckpt model]; # if am is fastspeech2_csmsc_onnx, am_ckpt = [ckpt model];
@ -70,6 +78,7 @@ tts_online-onnx:
cpu_threads: 4 cpu_threads: 4
# voc (vocoder) choices=['mb_melgan_csmsc_onnx, hifigan_csmsc_onnx'] # voc (vocoder) choices=['mb_melgan_csmsc_onnx, hifigan_csmsc_onnx']
# Both mb_melgan_csmsc_onnx and hifigan_csmsc_onnx support streaming voc inference
voc: 'hifigan_csmsc_onnx' voc: 'hifigan_csmsc_onnx'
voc_ckpt: voc_ckpt:
voc_sample_rate: 24000 voc_sample_rate: 24000
@ -80,9 +89,15 @@ tts_online-onnx:
# others # others
lang: 'zh' lang: 'zh'
# am_block and am_pad only for fastspeech2_cnndecoder_onnx model to streaming am infer,
# when am_pad set 12, streaming synthetic audio is the same as non-streaming synthetic audio
am_block: 42 am_block: 42
am_pad: 12 am_pad: 12
# voc_pad and voc_block voc model to streaming voc infer,
# when voc model is mb_melgan_csmsc_onnx, voc_pad set 14, streaming synthetic audio is the same as non-streaming synthetic audio; The minimum value of pad can be set to 7, streaming synthetic audio sounds normal
# when voc model is hifigan_csmsc_onnx, voc_pad set 20, streaming synthetic audio is the same as non-streaming synthetic audio; voc_pad set 14, streaming synthetic audio sounds normal
voc_block: 14 voc_block: 14
voc_pad: 14 voc_pad: 14
# voc_upsample should be same as n_shift on voc config.
voc_upsample: 300 voc_upsample: 300

@ -6,7 +6,7 @@
### Speech Recognition Model ### Speech Recognition Model
Acoustic Model | Training Data | Token-based | Size | Descriptions | CER | WER | Hours of speech | Example Link Acoustic Model | Training Data | Token-based | Size | Descriptions | CER | WER | Hours of speech | Example Link
:-------------:| :------------:| :-----: | -----: | :-----: |:-----:| :-----: | :-----: | :-----: :-------------:| :------------:| :-----: | -----: | :-----: |:-----:| :-----: | :-----: | :-----:
[Ds2 Online Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.2.0.model.tar.gz) | Aishell Dataset | Char-based | 345 MB | 2 Conv + 5 LSTM layers with only forward direction | 0.078 |-| 151 h | [D2 Online Aishell ASR0](../../examples/aishell/asr0) [Ds2 Online Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_fbank161_ckpt_0.2.0.model.tar.gz) | Aishell Dataset | Char-based | 479 MB | 2 Conv + 5 LSTM layers with only forward direction | 0.0718 |-| 151 h | [D2 Online Aishell ASR0](../../examples/aishell/asr0)
[Ds2 Offline Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_aishell_ckpt_0.1.1.model.tar.gz)| Aishell Dataset | Char-based | 306 MB | 2 Conv + 3 bidirectional GRU layers| 0.064 |-| 151 h | [Ds2 Offline Aishell ASR0](../../examples/aishell/asr0) [Ds2 Offline Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_aishell_ckpt_0.1.1.model.tar.gz)| Aishell Dataset | Char-based | 306 MB | 2 Conv + 3 bidirectional GRU layers| 0.064 |-| 151 h | [Ds2 Offline Aishell ASR0](../../examples/aishell/asr0)
[Conformer Online Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_chunk_conformer_aishell_ckpt_0.2.0.model.tar.gz) | Aishell Dataset | Char-based | 189 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring| 0.0544 |-| 151 h | [Conformer Online Aishell ASR1](../../examples/aishell/asr1) [Conformer Online Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_chunk_conformer_aishell_ckpt_0.2.0.model.tar.gz) | Aishell Dataset | Char-based | 189 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring| 0.0544 |-| 151 h | [Conformer Online Aishell ASR1](../../examples/aishell/asr1)
[Conformer Offline Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_conformer_aishell_ckpt_0.1.2.model.tar.gz) | Aishell Dataset | Char-based | 189 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0464 |-| 151 h | [Conformer Offline Aishell ASR1](../../examples/aishell/asr1) [Conformer Offline Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_conformer_aishell_ckpt_0.1.2.model.tar.gz) | Aishell Dataset | Char-based | 189 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0464 |-| 151 h | [Conformer Offline Aishell ASR1](../../examples/aishell/asr1)

@ -4,6 +4,7 @@
| Model | Number of Params | Release | Config | Test set | Valid Loss | CER | | Model | Number of Params | Release | Config | Test set | Valid Loss | CER |
| --- | --- | --- | --- | --- | --- | --- | | --- | --- | --- | --- | --- | --- | --- |
| DeepSpeech2 | 45.18M | r0.2.0 | conf/deepspeech2_online.yaml + spec aug + fbank161 | test | 7.679287910461426 | 0.0718 |
| DeepSpeech2 | 45.18M | r0.2.0 | conf/deepspeech2_online.yaml + spec aug | test | 7.708217620849609| 0.078 | | DeepSpeech2 | 45.18M | r0.2.0 | conf/deepspeech2_online.yaml + spec aug | test | 7.708217620849609| 0.078 |
| DeepSpeech2 | 45.18M | v2.2.0 | conf/deepspeech2_online.yaml + spec aug | test | 7.994938373565674 | 0.080 | | DeepSpeech2 | 45.18M | v2.2.0 | conf/deepspeech2_online.yaml + spec aug | test | 7.994938373565674 | 0.080 |

@ -1,39 +1,49 @@
data: # https://yaml.org/type/float.html
data_dir: '/PATH/TO/DATA/hey_snips_research_6k_en_train_eval_clean_ter' ###########################################
dataset: 'paddleaudio.datasets:HeySnips' # Data #
###########################################
dataset: 'paddleaudio.datasets:HeySnips'
data_dir: '/PATH/TO/DATA/hey_snips_research_6k_en_train_eval_clean_ter'
model: ############################################
num_keywords: 1 # Network Architecture #
backbone: 'paddlespeech.kws.models:MDTC' ############################################
config: backbone: 'paddlespeech.kws.models:MDTC'
stack_num: 3 num_keywords: 1
stack_size: 4 stack_num: 3
in_channels: 80 stack_size: 4
res_channels: 32 in_channels: 80
kernel_size: 5 res_channels: 32
kernel_size: 5
feature: ###########################################
feat_type: 'kaldi_fbank' # Feature #
sample_rate: 16000 ###########################################
frame_shift: 10 feat_type: 'kaldi_fbank'
frame_length: 25 sample_rate: 16000
n_mels: 80 frame_shift: 10
frame_length: 25
n_mels: 80
training: ###########################################
epochs: 100 # Training #
num_workers: 16 ###########################################
batch_size: 100 epochs: 100
checkpoint_dir: './checkpoint' num_workers: 16
save_freq: 10 batch_size: 100
log_freq: 10 checkpoint_dir: './checkpoint'
learning_rate: 0.001 save_freq: 10
weight_decay: 0.00005 log_freq: 10
grad_clip: 5.0 learning_rate: 0.001
weight_decay: 0.00005
grad_clip: 5.0
scoring: ###########################################
batch_size: 100 # Scoring #
num_workers: 16 ###########################################
checkpoint: './checkpoint/epoch_100/model.pdparams' batch_size: 100
score_file: './scores.txt' num_workers: 16
stats_file: './stats.0.txt' checkpoint: './checkpoint/epoch_100/model.pdparams'
img_file: './det.png' score_file: './scores.txt'
stats_file: './stats.0.txt'
img_file: './det.png'

@ -1,2 +1,25 @@
#!/bin/bash #!/bin/bash
python3 ${BIN_DIR}/plot_det_curve.py --cfg_path=$1 --keyword HeySnips # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
if [ $# != 3 ];then
echo "usage: ${0} config_path checkpoint output_file"
exit -1
fi
keyword=$1
stats_file=$2
img_file=$3
python3 ${BIN_DIR}/plot_det_curve.py --keyword_label ${keyword} --stats_file ${stats_file} --img_file ${img_file}

@ -1,5 +1,27 @@
#!/bin/bash #!/bin/bash
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
python3 ${BIN_DIR}/score.py --cfg_path=$1 if [ $# != 4 ];then
echo "usage: ${0} checkpoint score_file stats_file"
exit -1
fi
python3 ${BIN_DIR}/compute_det.py --cfg_path=$1 cfg_path=$1
ckpt=$2
score_file=$3
stats_file=$4
python3 ${BIN_DIR}/score.py --config ${cfg_path} --ckpt ${ckpt} --score_file ${score_file} || exit -1
python3 ${BIN_DIR}/compute_det.py --config ${cfg_path} --score_file ${score_file} --stats_file ${stats_file} || exit -1

@ -1,13 +1,31 @@
#!/bin/bash #!/bin/bash
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
if [ $# != 2 ];then
echo "usage: ${0} num_gpus config_path"
exit -1
fi
ngpu=$1 ngpu=$1
cfg_path=$2 cfg_path=$2
if [ ${ngpu} -gt 0 ]; then if [ ${ngpu} -gt 0 ]; then
python3 -m paddle.distributed.launch --gpus $CUDA_VISIBLE_DEVICES ${BIN_DIR}/train.py \ python3 -m paddle.distributed.launch --gpus $CUDA_VISIBLE_DEVICES ${BIN_DIR}/train.py \
--cfg_path ${cfg_path} --config ${cfg_path}
else else
echo "set CUDA_VISIBLE_DEVICES to enable multi-gpus trainning." echo "set CUDA_VISIBLE_DEVICES to enable multi-gpus trainning."
python3 ${BIN_DIR}/train.py \ python3 ${BIN_DIR}/train.py \
--cfg_path ${cfg_path} --config ${cfg_path}
fi fi

@ -32,10 +32,16 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
./local/train.sh ${ngpu} ${cfg_path} || exit -1 ./local/train.sh ${ngpu} ${cfg_path} || exit -1
fi fi
ckpt=./checkpoint/epoch_100/model.pdparams
score_file=./scores.txt
stats_file=./stats.0.txt
img_file=./det.png
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
./local/score.sh ${cfg_path} || exit -1 ./local/score.sh ${cfg_path} ${ckpt} ${score_file} ${stats_file} || exit -1
fi fi
keyword=HeySnips
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
./local/plot.sh ${cfg_path} || exit -1 ./local/plot.sh ${keyword} ${stats_file} ${img_file} || exit -1
fi fi

@ -73,9 +73,9 @@ pretrained_models = {
}, },
"deepspeech2online_aishell-zh-16k": { "deepspeech2online_aishell-zh-16k": {
'url': 'url':
'https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.2.0.model.tar.gz', 'https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_fbank161_ckpt_0.2.0.model.tar.gz',
'md5': 'md5':
'23e16c69730a1cb5d735c98c83c21e16', 'd314960e83cc10dcfa6b04269f3054d4',
'cfg_path': 'cfg_path':
'model.yaml', 'model.yaml',
'ckpt_path': 'ckpt_path':

@ -22,6 +22,8 @@ from typing import Union
import paddle import paddle
import soundfile import soundfile
from paddleaudio.backends import load as load_audio
from paddleaudio.compliance.librosa import melspectrogram
from yacs.config import CfgNode from yacs.config import CfgNode
from ..executor import BaseExecutor from ..executor import BaseExecutor
@ -30,8 +32,6 @@ from ..utils import cli_register
from ..utils import stats_wrapper from ..utils import stats_wrapper
from .pretrained_models import model_alias from .pretrained_models import model_alias
from .pretrained_models import pretrained_models from .pretrained_models import pretrained_models
from paddleaudio.backends import load as load_audio
from paddleaudio.compliance.librosa import melspectrogram
from paddlespeech.s2t.utils.dynamic_import import dynamic_import from paddlespeech.s2t.utils.dynamic_import import dynamic_import
from paddlespeech.vector.io.batch import feature_normalize from paddlespeech.vector.io.batch import feature_normalize
from paddlespeech.vector.modules.sid_model import SpeakerIdetification from paddlespeech.vector.modules.sid_model import SpeakerIdetification

@ -12,24 +12,15 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
# Modified from wekws(https://github.com/wenet-e2e/wekws) # Modified from wekws(https://github.com/wenet-e2e/wekws)
import argparse
import os import os
import paddle import paddle
import yaml
from tqdm import tqdm from tqdm import tqdm
from yacs.config import CfgNode
from paddlespeech.s2t.training.cli import default_argument_parser
from paddlespeech.s2t.utils.dynamic_import import dynamic_import from paddlespeech.s2t.utils.dynamic_import import dynamic_import
# yapf: disable
parser = argparse.ArgumentParser(__doc__)
parser.add_argument("--cfg_path", type=str, required=True)
parser.add_argument('--keyword_index', type=int, default=0, help='keyword index')
parser.add_argument('--step', type=float, default=0.01, help='threshold step of trigger score')
parser.add_argument('--window_shift', type=int, default=50, help='window_shift is used to skip the frames after triggered')
args = parser.parse_args()
# yapf: enable
def load_label_and_score(keyword_index: int, def load_label_and_score(keyword_index: int,
ds: paddle.io.Dataset, ds: paddle.io.Dataset,
@ -61,26 +52,52 @@ def load_label_and_score(keyword_index: int,
if __name__ == '__main__': if __name__ == '__main__':
args.cfg_path = os.path.abspath(os.path.expanduser(args.cfg_path)) parser = default_argument_parser()
with open(args.cfg_path, 'r') as f: parser.add_argument(
config = yaml.safe_load(f) '--keyword_index', type=int, default=0, help='keyword index')
parser.add_argument(
'--step',
type=float,
default=0.01,
help='threshold step of trigger score')
parser.add_argument(
'--window_shift',
type=int,
default=50,
help='window_shift is used to skip the frames after triggered')
parser.add_argument(
"--score_file",
type=str,
required=True,
help='output file of trigger scores')
parser.add_argument(
'--stats_file',
type=str,
default='./stats.0.txt',
help='output file of detection error tradeoff')
args = parser.parse_args()
data_conf = config['data'] # https://yaml.org/type/float.html
feat_conf = config['feature'] config = CfgNode(new_allowed=True)
scoring_conf = config['scoring'] if args.config:
config.merge_from_file(args.config)
# Dataset # Dataset
ds_class = dynamic_import(data_conf['dataset']) ds_class = dynamic_import(config['dataset'])
test_ds = ds_class(data_dir=data_conf['data_dir'], mode='test', **feat_conf) test_ds = ds_class(
data_dir=config['data_dir'],
score_file = os.path.abspath(scoring_conf['score_file']) mode='test',
stats_file = os.path.abspath(scoring_conf['stats_file']) feat_type=config['feat_type'],
sample_rate=config['sample_rate'],
frame_shift=config['frame_shift'],
frame_length=config['frame_length'],
n_mels=config['n_mels'], )
keyword_table, filler_table, filler_duration = load_label_and_score( keyword_table, filler_table, filler_duration = load_label_and_score(
args.keyword, test_ds, score_file) args.keyword_index, test_ds, args.score_file)
print('Filler total duration Hours: {}'.format(filler_duration / 3600.0)) print('Filler total duration Hours: {}'.format(filler_duration / 3600.0))
pbar = tqdm(total=int(1.0 / args.step)) pbar = tqdm(total=int(1.0 / args.step))
with open(stats_file, 'w', encoding='utf8') as fout: with open(args.stats_file, 'w', encoding='utf8') as fout:
keyword_index = args.keyword_index keyword_index = args.keyword_index
threshold = 0.0 threshold = 0.0
while threshold <= 1.0: while threshold <= 1.0:
@ -113,4 +130,4 @@ if __name__ == '__main__':
pbar.update(1) pbar.update(1)
pbar.close() pbar.close()
print('DET saved to: {}'.format(stats_file)) print('DET saved to: {}'.format(args.stats_file))

@ -17,12 +17,12 @@ import os
import matplotlib.pyplot as plt import matplotlib.pyplot as plt
import numpy as np import numpy as np
import yaml
# yapf: disable # yapf: disable
parser = argparse.ArgumentParser(__doc__) parser = argparse.ArgumentParser(__doc__)
parser.add_argument("--cfg_path", type=str, required=True) parser.add_argument('--keyword_label', type=str, required=True, help='keyword string shown on image')
parser.add_argument("--keyword", type=str, required=True) parser.add_argument('--stats_file', type=str, required=True, help='output file of detection error tradeoff')
parser.add_argument('--img_file', type=str, default='./det.png', help='output det image')
args = parser.parse_args() args = parser.parse_args()
# yapf: enable # yapf: enable
@ -61,14 +61,8 @@ def plot_det_curve(keywords, stats_file, figure_file, xlim, x_step, ylim,
if __name__ == '__main__': if __name__ == '__main__':
args.cfg_path = os.path.abspath(os.path.expanduser(args.cfg_path)) img_file = os.path.abspath(args.img_file)
with open(args.cfg_path, 'r') as f: stats_file = os.path.abspath(args.stats_file)
config = yaml.safe_load(f) plot_det_curve([args.keyword_label], stats_file, img_file, 10, 2, 10, 2)
scoring_conf = config['scoring']
img_file = os.path.abspath(scoring_conf['img_file'])
stats_file = os.path.abspath(scoring_conf['stats_file'])
keywords = [args.keyword]
plot_det_curve(keywords, stats_file, img_file, 10, 2, 10, 2)
print('DET curve image saved to: {}'.format(img_file)) print('DET curve image saved to: {}'.format(img_file))

@ -12,55 +12,67 @@
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
# Modified from wekws(https://github.com/wenet-e2e/wekws) # Modified from wekws(https://github.com/wenet-e2e/wekws)
import argparse
import os
import paddle import paddle
import yaml
from tqdm import tqdm from tqdm import tqdm
from yacs.config import CfgNode
from paddlespeech.kws.exps.mdtc.collate import collate_features from paddlespeech.kws.exps.mdtc.collate import collate_features
from paddlespeech.kws.models.mdtc import KWSModel from paddlespeech.kws.models.mdtc import KWSModel
from paddlespeech.s2t.training.cli import default_argument_parser
from paddlespeech.s2t.utils.dynamic_import import dynamic_import from paddlespeech.s2t.utils.dynamic_import import dynamic_import
# yapf: disable
parser = argparse.ArgumentParser(__doc__)
parser.add_argument("--cfg_path", type=str, required=True)
args = parser.parse_args()
# yapf: enable
if __name__ == '__main__': if __name__ == '__main__':
args.cfg_path = os.path.abspath(os.path.expanduser(args.cfg_path)) parser = default_argument_parser()
with open(args.cfg_path, 'r') as f: parser.add_argument(
config = yaml.safe_load(f) "--ckpt",
type=str,
required=True,
help='model checkpoint for evaluation.')
parser.add_argument(
"--score_file",
type=str,
default='./scores.txt',
help='output file of trigger scores')
args = parser.parse_args()
model_conf = config['model'] # https://yaml.org/type/float.html
data_conf = config['data'] config = CfgNode(new_allowed=True)
feat_conf = config['feature'] if args.config:
scoring_conf = config['scoring'] config.merge_from_file(args.config)
# Dataset # Dataset
ds_class = dynamic_import(data_conf['dataset']) ds_class = dynamic_import(config['dataset'])
test_ds = ds_class(data_dir=data_conf['data_dir'], mode='test', **feat_conf) test_ds = ds_class(
data_dir=config['data_dir'],
mode='test',
feat_type=config['feat_type'],
sample_rate=config['sample_rate'],
frame_shift=config['frame_shift'],
frame_length=config['frame_length'],
n_mels=config['n_mels'], )
test_sampler = paddle.io.BatchSampler( test_sampler = paddle.io.BatchSampler(
test_ds, batch_size=scoring_conf['batch_size'], drop_last=False) test_ds, batch_size=config['batch_size'], drop_last=False)
test_loader = paddle.io.DataLoader( test_loader = paddle.io.DataLoader(
test_ds, test_ds,
batch_sampler=test_sampler, batch_sampler=test_sampler,
num_workers=scoring_conf['num_workers'], num_workers=config['num_workers'],
return_list=True, return_list=True,
use_buffer_reader=True, use_buffer_reader=True,
collate_fn=collate_features, ) collate_fn=collate_features, )
# Model # Model
backbone_class = dynamic_import(model_conf['backbone']) backbone_class = dynamic_import(config['backbone'])
backbone = backbone_class(**model_conf['config']) backbone = backbone_class(
model = KWSModel(backbone=backbone, num_keywords=model_conf['num_keywords']) stack_num=config['stack_num'],
model.set_state_dict(paddle.load(scoring_conf['checkpoint'])) stack_size=config['stack_size'],
in_channels=config['in_channels'],
res_channels=config['res_channels'],
kernel_size=config['kernel_size'], )
model = KWSModel(backbone=backbone, num_keywords=config['num_keywords'])
model.set_state_dict(paddle.load(args.ckpt))
model.eval() model.eval()
with paddle.no_grad(), open( with paddle.no_grad(), open(args.score_file, 'w', encoding='utf8') as f:
scoring_conf['score_file'], 'w', encoding='utf8') as fout:
for batch_idx, batch in enumerate( for batch_idx, batch in enumerate(
tqdm(test_loader, total=len(test_loader))): tqdm(test_loader, total=len(test_loader))):
keys, feats, labels, lengths = batch keys, feats, labels, lengths = batch
@ -73,7 +85,6 @@ if __name__ == '__main__':
keyword_scores = score[:, keyword_i] keyword_scores = score[:, keyword_i]
score_frames = ' '.join( score_frames = ' '.join(
['{:.6f}'.format(x) for x in keyword_scores.tolist()]) ['{:.6f}'.format(x) for x in keyword_scores.tolist()])
fout.write( f.write('{} {} {}\n'.format(key, keyword_i, score_frames))
'{} {} {}\n'.format(key, keyword_i, score_frames))
print('Result saved to: {}'.format(scoring_conf['score_file'])) print('Result saved to: {}'.format(args.score_file))

@ -11,77 +11,88 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and # See the License for the specific language governing permissions and
# limitations under the License. # limitations under the License.
import argparse
import os import os
import paddle import paddle
import yaml
from paddleaudio.utils import logger from paddleaudio.utils import logger
from paddleaudio.utils import Timer from paddleaudio.utils import Timer
from yacs.config import CfgNode
from paddlespeech.kws.exps.mdtc.collate import collate_features from paddlespeech.kws.exps.mdtc.collate import collate_features
from paddlespeech.kws.models.loss import max_pooling_loss from paddlespeech.kws.models.loss import max_pooling_loss
from paddlespeech.kws.models.mdtc import KWSModel from paddlespeech.kws.models.mdtc import KWSModel
from paddlespeech.s2t.training.cli import default_argument_parser
from paddlespeech.s2t.utils.dynamic_import import dynamic_import from paddlespeech.s2t.utils.dynamic_import import dynamic_import
# yapf: disable
parser = argparse.ArgumentParser(__doc__)
parser.add_argument("--cfg_path", type=str, required=True)
args = parser.parse_args()
# yapf: enable
if __name__ == '__main__': if __name__ == '__main__':
parser = default_argument_parser()
args = parser.parse_args()
# https://yaml.org/type/float.html
config = CfgNode(new_allowed=True)
if args.config:
config.merge_from_file(args.config)
nranks = paddle.distributed.get_world_size() nranks = paddle.distributed.get_world_size()
if paddle.distributed.get_world_size() > 1: if paddle.distributed.get_world_size() > 1:
paddle.distributed.init_parallel_env() paddle.distributed.init_parallel_env()
local_rank = paddle.distributed.get_rank() local_rank = paddle.distributed.get_rank()
args.cfg_path = os.path.abspath(os.path.expanduser(args.cfg_path))
with open(args.cfg_path, 'r') as f:
config = yaml.safe_load(f)
model_conf = config['model']
data_conf = config['data']
feat_conf = config['feature']
training_conf = config['training']
# Dataset # Dataset
ds_class = dynamic_import(data_conf['dataset']) ds_class = dynamic_import(config['dataset'])
train_ds = ds_class( train_ds = ds_class(
data_dir=data_conf['data_dir'], mode='train', **feat_conf) data_dir=config['data_dir'],
dev_ds = ds_class(data_dir=data_conf['data_dir'], mode='dev', **feat_conf) mode='train',
feat_type=config['feat_type'],
sample_rate=config['sample_rate'],
frame_shift=config['frame_shift'],
frame_length=config['frame_length'],
n_mels=config['n_mels'], )
dev_ds = ds_class(
data_dir=config['data_dir'],
mode='dev',
feat_type=config['feat_type'],
sample_rate=config['sample_rate'],
frame_shift=config['frame_shift'],
frame_length=config['frame_length'],
n_mels=config['n_mels'], )
train_sampler = paddle.io.DistributedBatchSampler( train_sampler = paddle.io.DistributedBatchSampler(
train_ds, train_ds,
batch_size=training_conf['batch_size'], batch_size=config['batch_size'],
shuffle=True, shuffle=True,
drop_last=False) drop_last=False)
train_loader = paddle.io.DataLoader( train_loader = paddle.io.DataLoader(
train_ds, train_ds,
batch_sampler=train_sampler, batch_sampler=train_sampler,
num_workers=training_conf['num_workers'], num_workers=config['num_workers'],
return_list=True, return_list=True,
use_buffer_reader=True, use_buffer_reader=True,
collate_fn=collate_features, ) collate_fn=collate_features, )
# Model # Model
backbone_class = dynamic_import(model_conf['backbone']) backbone_class = dynamic_import(config['backbone'])
backbone = backbone_class(**model_conf['config']) backbone = backbone_class(
model = KWSModel(backbone=backbone, num_keywords=model_conf['num_keywords']) stack_num=config['stack_num'],
stack_size=config['stack_size'],
in_channels=config['in_channels'],
res_channels=config['res_channels'],
kernel_size=config['kernel_size'], )
model = KWSModel(backbone=backbone, num_keywords=config['num_keywords'])
model = paddle.DataParallel(model) model = paddle.DataParallel(model)
clip = paddle.nn.ClipGradByGlobalNorm(training_conf['grad_clip']) clip = paddle.nn.ClipGradByGlobalNorm(config['grad_clip'])
optimizer = paddle.optimizer.Adam( optimizer = paddle.optimizer.Adam(
learning_rate=training_conf['learning_rate'], learning_rate=config['learning_rate'],
weight_decay=training_conf['weight_decay'], weight_decay=config['weight_decay'],
parameters=model.parameters(), parameters=model.parameters(),
grad_clip=clip) grad_clip=clip)
criterion = max_pooling_loss criterion = max_pooling_loss
steps_per_epoch = len(train_sampler) steps_per_epoch = len(train_sampler)
timer = Timer(steps_per_epoch * training_conf['epochs']) timer = Timer(steps_per_epoch * config['epochs'])
timer.start() timer.start()
for epoch in range(1, training_conf['epochs'] + 1): for epoch in range(1, config['epochs'] + 1):
model.train() model.train()
avg_loss = 0 avg_loss = 0
@ -107,15 +118,13 @@ if __name__ == '__main__':
timer.count() timer.count()
if (batch_idx + 1 if (batch_idx + 1) % config['log_freq'] == 0 and local_rank == 0:
) % training_conf['log_freq'] == 0 and local_rank == 0:
lr = optimizer.get_lr() lr = optimizer.get_lr()
avg_loss /= training_conf['log_freq'] avg_loss /= config['log_freq']
avg_acc = num_corrects / num_samples avg_acc = num_corrects / num_samples
print_msg = 'Epoch={}/{}, Step={}/{}'.format( print_msg = 'Epoch={}/{}, Step={}/{}'.format(
epoch, training_conf['epochs'], batch_idx + 1, epoch, config['epochs'], batch_idx + 1, steps_per_epoch)
steps_per_epoch)
print_msg += ' loss={:.4f}'.format(avg_loss) print_msg += ' loss={:.4f}'.format(avg_loss)
print_msg += ' acc={:.4f}'.format(avg_acc) print_msg += ' acc={:.4f}'.format(avg_acc)
print_msg += ' lr={:.6f} step/sec={:.2f} | ETA {}'.format( print_msg += ' lr={:.6f} step/sec={:.2f} | ETA {}'.format(
@ -126,17 +135,17 @@ if __name__ == '__main__':
num_corrects = 0 num_corrects = 0
num_samples = 0 num_samples = 0
if epoch % training_conf[ if epoch % config[
'save_freq'] == 0 and batch_idx + 1 == steps_per_epoch and local_rank == 0: 'save_freq'] == 0 and batch_idx + 1 == steps_per_epoch and local_rank == 0:
dev_sampler = paddle.io.BatchSampler( dev_sampler = paddle.io.BatchSampler(
dev_ds, dev_ds,
batch_size=training_conf['batch_size'], batch_size=config['batch_size'],
shuffle=False, shuffle=False,
drop_last=False) drop_last=False)
dev_loader = paddle.io.DataLoader( dev_loader = paddle.io.DataLoader(
dev_ds, dev_ds,
batch_sampler=dev_sampler, batch_sampler=dev_sampler,
num_workers=training_conf['num_workers'], num_workers=config['num_workers'],
return_list=True, return_list=True,
use_buffer_reader=True, use_buffer_reader=True,
collate_fn=collate_features, ) collate_fn=collate_features, )
@ -159,7 +168,7 @@ if __name__ == '__main__':
logger.eval(print_msg) logger.eval(print_msg)
# Save model # Save model
save_dir = os.path.join(training_conf['checkpoint_dir'], save_dir = os.path.join(config['checkpoint_dir'],
'epoch_{}'.format(epoch)) 'epoch_{}'.format(epoch))
logger.info('Saving model checkpoint to {}'.format(save_dir)) logger.info('Saving model checkpoint to {}'.format(save_dir))
paddle.save(model.state_dict(), paddle.save(model.state_dict(),

@ -13,8 +13,9 @@
# limitations under the License. # limitations under the License.
"""Contains the audio featurizer class.""" """Contains the audio featurizer class."""
import numpy as np import numpy as np
import paddle
import paddleaudio.compliance.kaldi as kaldi
from python_speech_features import delta from python_speech_features import delta
from python_speech_features import logfbank
from python_speech_features import mfcc from python_speech_features import mfcc
@ -345,19 +346,17 @@ class AudioFeaturizer():
raise ValueError("Stride size must not be greater than " raise ValueError("Stride size must not be greater than "
"window size.") "window size.")
# (T, D) # (T, D)
fbank_feat = logfbank( waveform = paddle.to_tensor(
signal=samples, np.expand_dims(samples, 0), dtype=paddle.float32)
samplerate=sample_rate, mat = kaldi.fbank(
winlen=0.001 * window_ms, waveform,
winstep=0.001 * stride_ms, n_mels=feat_dim,
nfilt=feat_dim, frame_length=window_ms, # default : 25
nfft=512, frame_shift=stride_ms, # default : 10
lowfreq=20,
highfreq=max_freq,
dither=dither, dither=dither,
remove_dc_offset=True, energy_floor=0.0,
preemph=0.97, sr=sample_rate)
wintype='povey') fbank_feat = np.squeeze(mat.numpy())
if delta_delta: if delta_delta:
fbank_feat = self._concat_delta_delta(fbank_feat) fbank_feat = self._concat_delta_delta(fbank_feat)
return fbank_feat return fbank_feat

@ -1,4 +1,4 @@
# This is the parameter configuration file for PaddleSpeech Serving. # This is the parameter configuration file for streaming tts server.
################################################################################# #################################################################################
# SERVER SETTING # # SERVER SETTING #
@ -7,8 +7,8 @@ host: 127.0.0.1
port: 8092 port: 8092
# The task format in the engin_list is: <speech task>_<engine type> # The task format in the engin_list is: <speech task>_<engine type>
# task choices = ['tts_online', 'tts_online-onnx'] # engine_list choices = ['tts_online', 'tts_online-onnx'], the inference speed of tts_online-onnx is faster than tts_online.
# protocol = ['websocket', 'http'] (only one can be selected). # protocol choices = ['websocket', 'http']
protocol: 'http' protocol: 'http'
engine_list: ['tts_online-onnx'] engine_list: ['tts_online-onnx']
@ -20,8 +20,9 @@ engine_list: ['tts_online-onnx']
################################### TTS ######################################### ################################### TTS #########################################
################### speech task: tts; engine_type: online ####################### ################### speech task: tts; engine_type: online #######################
tts_online: tts_online:
# am (acoustic model) choices=['fastspeech2_csmsc', 'fastspeech2_cnndecoder_csmsc'] # am (acoustic model) choices=['fastspeech2_csmsc', 'fastspeech2_cnndecoder_csmsc']
am: 'fastspeech2_cnndecoder_csmsc' # fastspeech2_cnndecoder_csmsc support streaming am infer.
am: 'fastspeech2_csmsc'
am_config: am_config:
am_ckpt: am_ckpt:
am_stat: am_stat:
@ -31,6 +32,7 @@ tts_online:
spk_id: 0 spk_id: 0
# voc (vocoder) choices=['mb_melgan_csmsc, hifigan_csmsc'] # voc (vocoder) choices=['mb_melgan_csmsc, hifigan_csmsc']
# Both mb_melgan_csmsc and hifigan_csmsc support streaming voc inference
voc: 'mb_melgan_csmsc' voc: 'mb_melgan_csmsc'
voc_config: voc_config:
voc_ckpt: voc_ckpt:
@ -39,8 +41,13 @@ tts_online:
# others # others
lang: 'zh' lang: 'zh'
device: 'cpu' # set 'gpu:id' or 'cpu' device: 'cpu' # set 'gpu:id' or 'cpu'
# am_block and am_pad only for fastspeech2_cnndecoder_onnx model to streaming am infer,
# when am_pad set 12, streaming synthetic audio is the same as non-streaming synthetic audio
am_block: 42 am_block: 42
am_pad: 12 am_pad: 12
# voc_pad and voc_block voc model to streaming voc infer,
# when voc model is mb_melgan_csmsc, voc_pad set 14, streaming synthetic audio is the same as non-streaming synthetic audio; The minimum value of pad can be set to 7, streaming synthetic audio sounds normal
# when voc model is hifigan_csmsc, voc_pad set 20, streaming synthetic audio is the same as non-streaming synthetic audio; voc_pad set 14, streaming synthetic audio sounds normal
voc_block: 14 voc_block: 14
voc_pad: 14 voc_pad: 14
@ -53,7 +60,8 @@ tts_online:
################################### TTS ######################################### ################################### TTS #########################################
################### speech task: tts; engine_type: online-onnx ####################### ################### speech task: tts; engine_type: online-onnx #######################
tts_online-onnx: tts_online-onnx:
# am (acoustic model) choices=['fastspeech2_csmsc_onnx', 'fastspeech2_cnndecoder_csmsc_onnx'] # am (acoustic model) choices=['fastspeech2_csmsc_onnx', 'fastspeech2_cnndecoder_csmsc_onnx']
# fastspeech2_cnndecoder_csmsc_onnx support streaming am infer.
am: 'fastspeech2_cnndecoder_csmsc_onnx' am: 'fastspeech2_cnndecoder_csmsc_onnx'
# am_ckpt is a list, if am is fastspeech2_cnndecoder_csmsc_onnx, am_ckpt = [encoder model, decoder model, postnet model]; # am_ckpt is a list, if am is fastspeech2_cnndecoder_csmsc_onnx, am_ckpt = [encoder model, decoder model, postnet model];
# if am is fastspeech2_csmsc_onnx, am_ckpt = [ckpt model]; # if am is fastspeech2_csmsc_onnx, am_ckpt = [ckpt model];
@ -70,6 +78,7 @@ tts_online-onnx:
cpu_threads: 4 cpu_threads: 4
# voc (vocoder) choices=['mb_melgan_csmsc_onnx, hifigan_csmsc_onnx'] # voc (vocoder) choices=['mb_melgan_csmsc_onnx, hifigan_csmsc_onnx']
# Both mb_melgan_csmsc_onnx and hifigan_csmsc_onnx support streaming voc inference
voc: 'hifigan_csmsc_onnx' voc: 'hifigan_csmsc_onnx'
voc_ckpt: voc_ckpt:
voc_sample_rate: 24000 voc_sample_rate: 24000
@ -80,9 +89,15 @@ tts_online-onnx:
# others # others
lang: 'zh' lang: 'zh'
# am_block and am_pad only for fastspeech2_cnndecoder_onnx model to streaming am infer,
# when am_pad set 12, streaming synthetic audio is the same as non-streaming synthetic audio
am_block: 42 am_block: 42
am_pad: 12 am_pad: 12
# voc_pad and voc_block voc model to streaming voc infer,
# when voc model is mb_melgan_csmsc_onnx, voc_pad set 14, streaming synthetic audio is the same as non-streaming synthetic audio; The minimum value of pad can be set to 7, streaming synthetic audio sounds normal
# when voc model is hifigan_csmsc_onnx, voc_pad set 20, streaming synthetic audio is the same as non-streaming synthetic audio; voc_pad set 14, streaming synthetic audio sounds normal
voc_block: 14 voc_block: 14
voc_pad: 14 voc_pad: 14
# voc_upsample should be same as n_shift on voc config.
voc_upsample: 300 voc_upsample: 300

@ -43,9 +43,9 @@ __all__ = ['ASREngine']
pretrained_models = { pretrained_models = {
"deepspeech2online_aishell-zh-16k": { "deepspeech2online_aishell-zh-16k": {
'url': 'url':
'https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.2.0.model.tar.gz', 'https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_fbank161_ckpt_0.2.0.model.tar.gz',
'md5': 'md5':
'23e16c69730a1cb5d735c98c83c21e16', 'd314960e83cc10dcfa6b04269f3054d4',
'cfg_path': 'cfg_path':
'model.yaml', 'model.yaml',
'ckpt_path': 'ckpt_path':

@ -24,11 +24,11 @@ from typing import Any
from typing import Dict from typing import Dict
import paddle import paddle
import paddleaudio
import requests import requests
import yaml import yaml
from paddle.framework import load from paddle.framework import load
import paddleaudio
from . import download from . import download
from .entry import client_commands from .entry import client_commands
from .entry import server_commands from .entry import server_commands

@ -94,6 +94,7 @@ class ASRWsAudioHandler:
self.url = "ws://" + self.url + ":" + str( self.url = "ws://" + self.url + ":" + str(
self.port) + endpoint self.port) + endpoint
self.punc_server = TextHttpHandler(punc_server_ip, punc_server_port) self.punc_server = TextHttpHandler(punc_server_ip, punc_server_port)
logger.info(f"endpoint: {self.url}")
def read_wave(self, wavfile_path: str): def read_wave(self, wavfile_path: str):
"""read the audio file from specific wavfile path """read the audio file from specific wavfile path
@ -157,17 +158,18 @@ class ASRWsAudioHandler:
separators=(',', ': ')) separators=(',', ': '))
await ws.send(audio_info) await ws.send(audio_info)
msg = await ws.recv() msg = await ws.recv()
logger.info("receive msg={}".format(msg)) logger.info("client receive msg={}".format(msg))
# 3. send chunk audio data to engine # 3. send chunk audio data to engine
for chunk_data in self.read_wave(wavfile_path): for chunk_data in self.read_wave(wavfile_path):
await ws.send(chunk_data.tobytes()) await ws.send(chunk_data.tobytes())
msg = await ws.recv() msg = await ws.recv()
msg = json.loads(msg) msg = json.loads(msg)
if self.punc_server and len(msg["result"]) > 0: if self.punc_server and len(msg["result"]) > 0:
msg["result"] = self.punc_server.run( msg["result"] = self.punc_server.run(
msg["result"]) msg["result"])
logger.info("receive msg={}".format(msg)) logger.info("client receive msg={}".format(msg))
# 4. we must send finished signal to the server # 4. we must send finished signal to the server
audio_info = json.dumps( audio_info = json.dumps(
@ -184,9 +186,11 @@ class ASRWsAudioHandler:
# 5. decode the bytes to str # 5. decode the bytes to str
msg = json.loads(msg) msg = json.loads(msg)
if self.punc_server: if self.punc_server:
msg["result"] = self.punc_server.run(msg["result"]) msg["result"] = self.punc_server.run(msg["result"])
logger.info("final receive msg={}".format(msg))
logger.info("client final receive msg={}".format(msg))
result = msg result = msg
return result return result

@ -73,8 +73,6 @@ server = [
"uvicorn", "uvicorn",
"pattern_singleton", "pattern_singleton",
"websockets", "websockets",
"websocket",
"websocket-client",
] ]
requirements = { requirements = {

@ -24,8 +24,6 @@ docker run --privileged --net=host --ipc=host -it --rm -v $PWD:/workspace --nam
* More `Paddle` docker images you can see [here](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/docker/linux-docker.html). * More `Paddle` docker images you can see [here](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/docker/linux-docker.html).
* If you want only work under cpu, please download corresponded [image](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/docker/linux-docker.html), and using `docker` instead `nvidia-docker`.
2. Build `speechx` and `examples`. 2. Build `speechx` and `examples`.

@ -79,6 +79,7 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
--feature_wspecifier=ark,scp:$data/split${nj}/JOB/feat.ark,$data/split${nj}/JOB/feat.scp \ --feature_wspecifier=ark,scp:$data/split${nj}/JOB/feat.ark,$data/split${nj}/JOB/feat.scp \
--cmvn_file=$cmvn \ --cmvn_file=$cmvn \
--streaming_chunk=0.36 --streaming_chunk=0.36
echo "feature make have finished!!!"
fi fi
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
@ -94,6 +95,8 @@ if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
cat $data/split${nj}/*/result > $exp/${label_file} cat $data/split${nj}/*/result > $exp/${label_file}
utils/compute-wer.py --char=1 --v=1 $text $exp/${label_file} > $exp/${wer} utils/compute-wer.py --char=1 --v=1 $text $exp/${label_file} > $exp/${wer}
echo "ctc-prefix-beam-search-decoder-ol without lm has finished!!!"
echo "please checkout in ${exp}/${wer}"
fi fi
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
@ -110,6 +113,8 @@ if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
cat $data/split${nj}/*/result_lm > $exp/${label_file}_lm cat $data/split${nj}/*/result_lm > $exp/${label_file}_lm
utils/compute-wer.py --char=1 --v=1 $text $exp/${label_file}_lm > $exp/${wer}.lm utils/compute-wer.py --char=1 --v=1 $text $exp/${label_file}_lm > $exp/${wer}.lm
echo "ctc-prefix-beam-search-decoder-ol with lm test has finished!!!"
echo "please checkout in ${exp}/${wer}.lm"
fi fi
wfst=$data/wfst/ wfst=$data/wfst/
@ -139,6 +144,8 @@ if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
cat $data/split${nj}/*/result_tlg > $exp/${label_file}_tlg cat $data/split${nj}/*/result_tlg > $exp/${label_file}_tlg
utils/compute-wer.py --char=1 --v=1 $text $exp/${label_file}_tlg > $exp/${wer}.tlg utils/compute-wer.py --char=1 --v=1 $text $exp/${label_file}_tlg > $exp/${wer}.tlg
echo "wfst-decoder-ol have finished!!!"
echo "please checkout in ${exp}/${wer}.tlg"
fi fi
if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
@ -159,4 +166,6 @@ if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
cat $data/split${nj}/*/result_recognizer > $exp/${label_file}_recognizer cat $data/split${nj}/*/result_recognizer > $exp/${label_file}_recognizer
utils/compute-wer.py --char=1 --v=1 $text $exp/${label_file}_recognizer > $exp/${wer}.recognizer utils/compute-wer.py --char=1 --v=1 $text $exp/${label_file}_recognizer > $exp/${wer}.recognizer
echo "recognizer test have finished!!!"
echo "please checkout in ${exp}/${wer}.recognizer"
fi fi

@ -27,7 +27,7 @@ ConnectionHandler::ConnectionHandler(
: ws_(std::move(socket)), recognizer_resource_(recognizer_resource) {} : ws_(std::move(socket)), recognizer_resource_(recognizer_resource) {}
void ConnectionHandler::OnSpeechStart() { void ConnectionHandler::OnSpeechStart() {
LOG(INFO) << "Recieved speech start signal, start reading speech"; LOG(INFO) << "Server: Recieved speech start signal, start reading speech";
got_start_tag_ = true; got_start_tag_ = true;
json::value rv = {{"status", "ok"}, {"type", "server_ready"}}; json::value rv = {{"status", "ok"}, {"type", "server_ready"}};
ws_.text(true); ws_.text(true);
@ -39,14 +39,14 @@ void ConnectionHandler::OnSpeechStart() {
} }
void ConnectionHandler::OnSpeechEnd() { void ConnectionHandler::OnSpeechEnd() {
LOG(INFO) << "Recieved speech end signal"; LOG(INFO) << "Server: Recieved speech end signal";
CHECK(recognizer_ != nullptr); CHECK(recognizer_ != nullptr);
recognizer_->SetFinished(); recognizer_->SetFinished();
got_end_tag_ = true; got_end_tag_ = true;
} }
void ConnectionHandler::OnFinalResult(const std::string& result) { void ConnectionHandler::OnFinalResult(const std::string& result) {
LOG(INFO) << "Final result: " << result; LOG(INFO) << "Server: Final result: " << result;
json::value rv = { json::value rv = {
{"status", "ok"}, {"type", "final_result"}, {"result", result}}; {"status", "ok"}, {"type", "final_result"}, {"result", result}};
ws_.text(true); ws_.text(true);
@ -69,10 +69,16 @@ void ConnectionHandler::OnSpeechData(const beast::flat_buffer& buffer) {
pcm_data(i) = static_cast<float>(*pdata); pcm_data(i) = static_cast<float>(*pdata);
pdata++; pdata++;
} }
VLOG(2) << "Recieved " << num_samples << " samples"; VLOG(2) << "Server: Recieved " << num_samples << " samples";
LOG(INFO) << "Recieved " << num_samples << " samples"; LOG(INFO) << "Server: Recieved " << num_samples << " samples";
CHECK(recognizer_ != nullptr); CHECK(recognizer_ != nullptr);
recognizer_->Accept(pcm_data); recognizer_->Accept(pcm_data);
// TODO: return lpartial result
json::value rv = {
{"status", "ok"}, {"type", "partial_result"}, {"result", "TODO"}};
ws_.text(true);
ws_.write(asio::buffer(json::serialize(rv)));
} }
void ConnectionHandler::DecodeThreadFunc() { void ConnectionHandler::DecodeThreadFunc() {
@ -80,9 +86,9 @@ void ConnectionHandler::DecodeThreadFunc() {
while (true) { while (true) {
recognizer_->Decode(); recognizer_->Decode();
if (recognizer_->IsFinished()) { if (recognizer_->IsFinished()) {
LOG(INFO) << "enter finish"; LOG(INFO) << "Server: enter finish";
recognizer_->Decode(); recognizer_->Decode();
LOG(INFO) << "finish"; LOG(INFO) << "Server: finish";
std::string result = recognizer_->GetFinalResult(); std::string result = recognizer_->GetFinalResult();
OnFinalResult(result); OnFinalResult(result);
OnFinish(); OnFinish();
@ -135,7 +141,7 @@ void ConnectionHandler::operator()() {
ws_.read(buffer); ws_.read(buffer);
if (ws_.got_text()) { if (ws_.got_text()) {
std::string message = beast::buffers_to_string(buffer.data()); std::string message = beast::buffers_to_string(buffer.data());
LOG(INFO) << message; LOG(INFO) << "Server: Text: " << message;
OnText(message); OnText(message);
if (got_end_tag_) { if (got_end_tag_) {
break; break;
@ -152,7 +158,7 @@ void ConnectionHandler::operator()() {
} }
} }
LOG(INFO) << "Read all pcm data, wait for decoding thread"; LOG(INFO) << "Server: Read all pcm data, wait for decoding thread";
if (decode_thread_ != nullptr) { if (decode_thread_ != nullptr) {
decode_thread_->join(); decode_thread_->join();
} }

@ -20,11 +20,17 @@ paddlespeech asr --model deepspeech2online_aishell --input ./zh.wav
paddlespeech asr --model deepspeech2offline_librispeech --lang en --input ./en.wav paddlespeech asr --model deepspeech2offline_librispeech --lang en --input ./en.wav
# long audio restriction # long audio restriction
{
wget -c wget https://paddlespeech.bj.bcebos.com/datasets/single_wav/zh/test_long_audio_01.wav wget -c wget https://paddlespeech.bj.bcebos.com/datasets/single_wav/zh/test_long_audio_01.wav
paddlespeech asr --input test_long_audio_01.wav paddlespeech asr --input test_long_audio_01.wav
if [ $? -ne -1 ]; then if [ $? -ne 255 ]; then
echo -e "\e[1;31mTime restriction not passed\e[0m"
exit 1 exit 1
fi fi
} &&
{
echo -e "\033[32mTime restriction passed\033[0m"
}
# Text To Speech # Text To Speech
paddlespeech tts --input "你好,欢迎使用百度飞桨深度学习框架!" paddlespeech tts --input "你好,欢迎使用百度飞桨深度学习框架!"
@ -71,4 +77,4 @@ paddlespeech stats --task vector
paddlespeech stats --task st paddlespeech stats --task st
echo "Test success !!!" echo -e "\033[32mTest success !!!\033[0m"

Loading…
Cancel
Save