Merge branch 'PaddlePaddle:develop' into develop

pull/2647/head^2
liangym 3 years ago committed by GitHub
commit eef87bb7d4
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -157,13 +157,15 @@ Via the easy-to-use, efficient, flexible and scalable implementation, our vision
- 🧩 *Cascaded models application*: as an extension of the typical traditional audio tasks, we combine the workflows of the aforementioned tasks with other fields like Natural language processing (NLP) and Computer Vision (CV).
### Recent Update
- 🔥 2022.11.07: [U2/U2++ C++ High Performance Streaming Asr Deployment](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/speechx/examples/u2pp_ol/wenetspeech).
- 👑 2022.11.01: Add [Adversarial Loss](https://arxiv.org/pdf/1907.04448.pdf) for [Chinese English mixed TTS](./examples/zh_en_tts/tts3).
- 🔥 2022.10.26: Add [Prosody Prediction](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/rhy) for TTS.
- 🎉 2022.10.21: Add [SSML](https://github.com/PaddlePaddle/PaddleSpeech/discussions/2538) for TTS Chinese Text Frontend.
- 👑 2022.10.11: Add [Wav2vec2ASR](./examples/librispeech/asr3), wav2vec2.0 fine-tuning for ASR on LibriSpeech.
- 🔥 2022.09.26: Add Voice Cloning, TTS finetune, and ERNIE-SAT in [PaddleSpeech Web Demo](./demos/speech_web).
- 🔥 2022.09.26: Add Voice Cloning, TTS finetune, and [ERNIE-SAT](https://arxiv.org/abs/2211.03545) in [PaddleSpeech Web Demo](./demos/speech_web).
- ⚡ 2022.09.09: Add AISHELL-3 Voice Cloning [example](./examples/aishell3/vc2) with ECAPA-TDNN speaker encoder.
- ⚡ 2022.08.25: Release TTS [finetune](./examples/other/tts_finetune/tts3) example.
- 🔥 2022.08.22: Add ERNIE-SAT models: [ERNIE-SAT-vctk](./examples/vctk/ernie_sat)、[ERNIE-SAT-aishell3](./examples/aishell3/ernie_sat)、[ERNIE-SAT-zh_en](./examples/aishell3_vctk/ernie_sat).
- 🔥 2022.08.22: Add [ERNIE-SAT](https://arxiv.org/abs/2211.03545) models: [ERNIE-SAT-vctk](./examples/vctk/ernie_sat)、[ERNIE-SAT-aishell3](./examples/aishell3/ernie_sat)、[ERNIE-SAT-zh_en](./examples/aishell3_vctk/ernie_sat).
- 🔥 2022.08.15: Add [g2pW](https://github.com/GitYCC/g2pW) into TTS Chinese Text Frontend.
- 🔥 2022.08.09: Release [Chinese English mixed TTS](./examples/zh_en_tts/tts3).
- ⚡ 2022.08.03: Add ONNXRuntime infer for TTS CLI.
@ -578,7 +580,7 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
</td>
</tr>
<tr>
<td>ERNIE-SAT</td>
<td><a href = "https://arxiv.org/abs/2211.03545">ERNIE-SAT</a></td>
<td>VCTK / AISHELL-3 / ZH_EN</td>
<td>
<a href = "./examples/vctk/ernie_sat">ERNIE-SAT-vctk</a> / <a href = "./examples/aishell3/ernie_sat">ERNIE-SAT-aishell3</a> / <a href = "./examples/aishell3_vctk/ernie_sat">ERNIE-SAT-zh_en</a>
@ -716,9 +718,9 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
<tr>
<td>Keyword Spotting</td>
<td>hey-snips</td>
<td>PANN</td>
<td>MDTC</td>
<td>
<a href = "./examples/hey_snips/kws0">pann-hey-snips</a>
<a href = "./examples/hey_snips/kws0">mdtc-hey-snips</a>
</td>
</tr>
</tbody>

@ -164,13 +164,14 @@
### 近期更新
- 👑 2022.11.01: [中英文混合 TTS](./examples/zh_en_tts/tts3) 新增 [Adversarial Loss](https://arxiv.org/pdf/1907.04448.pdf) 模块。
- 🔥 2022.10.26: TTS 新增[韵律预测](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/rhy)功能。
- 🎉 2022.10.21: TTS 中文文本前端新增 [SSML](https://github.com/PaddlePaddle/PaddleSpeech/discussions/2538) 功能。
- 👑 2022.10.11: 新增 [Wav2vec2ASR](./examples/librispeech/asr3), 在 LibriSpeech 上针对 ASR 任务对 wav2vec2.0 的 finetuning。
- 🔥 2022.09.26: 新增 Voice Cloning, TTS finetune 和 ERNIE-SAT 到 [PaddleSpeech 网页应用](./demos/speech_web)。
- 🔥 2022.09.26: 新增 Voice Cloning, TTS finetune 和 [ERNIE-SAT](https://arxiv.org/abs/2211.03545) 到 [PaddleSpeech 网页应用](./demos/speech_web)。
- ⚡ 2022.09.09: 新增基于 ECAPA-TDNN 声纹模型的 AISHELL-3 Voice Cloning [示例](./examples/aishell3/vc2)。
- ⚡ 2022.08.25: 发布 TTS [finetune](./examples/other/tts_finetune/tts3) 示例。
- 🔥 2022.08.22: 新增 ERNIE-SAT 模型: [ERNIE-SAT-vctk](./examples/vctk/ernie_sat)、[ERNIE-SAT-aishell3](./examples/aishell3/ernie_sat)、[ERNIE-SAT-zh_en](./examples/aishell3_vctk/ernie_sat)。
- 🔥 2022.08.22: 新增 [ERNIE-SAT](https://arxiv.org/abs/2211.03545) 模型: [ERNIE-SAT-vctk](./examples/vctk/ernie_sat)、[ERNIE-SAT-aishell3](./examples/aishell3/ernie_sat)、[ERNIE-SAT-zh_en](./examples/aishell3_vctk/ernie_sat)。
- 🔥 2022.08.15: 将 [g2pW](https://github.com/GitYCC/g2pW) 引入 TTS 中文文本前端。
- 🔥 2022.08.09: 发布[中英文混合 TTS](./examples/zh_en_tts/tts3)。
- ⚡ 2022.08.03: TTS CLI 新增 ONNXRuntime 推理方式。
@ -575,7 +576,7 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声
</td>
</tr>
<tr>
<td>ERNIE-SAT</td>
<td><a href = "https://arxiv.org/abs/2211.03545">ERNIE-SAT</a></td>
<td>VCTK / AISHELL-3 / ZH_EN</td>
<td>
<a href = "./examples/vctk/ernie_sat">ERNIE-SAT-vctk</a> / <a href = "./examples/aishell3/ernie_sat">ERNIE-SAT-aishell3</a> / <a href = "./examples/aishell3_vctk/ernie_sat">ERNIE-SAT-zh_en</a>
@ -696,9 +697,9 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声
</table>
<a name="唤醒模型"></a>
<a name="语音唤醒模型"></a>
**唤醒**
**语音唤醒**
<table style="width:100%">
<thead>
@ -711,11 +712,11 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声
</thead>
<tbody>
<tr>
<td>唤醒</td>
<td>语音唤醒</td>
<td>hey-snips</td>
<td>PANN</td>
<td>MDTC</td>
<td>
<a href = "./examples/hey_snips/kws0">pann-hey-snips</a>
<a href = "./examples/hey_snips/kws0">mdtc-hey-snips</a>
</td>
</tr>
</tbody>

@ -0,0 +1,100 @@
([简体中文](./README_cn.md)|English)
# ASR Deployment by SpeechX
## Introduction
ASR deployment support U2/U2++/Deepspeech2 asr model using c++, which is good practice in industry deployment.
More info about SpeechX, please see [here](../../speechx/README.md).
## Usage
### 1. Environment
* python - 3.7
* docker - `registry.baidubce.com/paddlepaddle/paddle:2.2.2-gpu-cuda10.2-cudnn7`
* os - Ubuntu 16.04.7 LTS
* gcc/g++/gfortran - 8.2.0
* cmake - 3.16.0
More info please see [here](../../speechx/README.md).
### 2. Compile SpeechX
Please see [here](../../speechx/README.md).
### 3. Usage
For u2++ asr deployment example, please to see [here](../../speechx/examples/u2pp_ol/wenetspeech/).
First go to `speechx/speechx/examples/u2pp_ol/wenetspeech` dir.
- Source path.sh
```bash
source path.sh
```
- Download Model, Prepare test data and cmvn
```bash
run.sh --stage 0 --stop_stage 1
```
- Decode with WAV
```bash
# FP32
./local/recognizer.sh
# INT8
./local/recognizer_quant.sh
```
Output:
```bash
I1026 16:13:24.683531 48038 u2_recognizer_main.cc:55] utt: BAC009S0916W0495
I1026 16:13:24.683578 48038 u2_recognizer_main.cc:56] wav dur: 4.17119 sec.
I1026 16:13:24.683595 48038 u2_recognizer_main.cc:64] wav len (sample): 66739
I1026 16:13:25.037652 48038 u2_recognizer_main.cc:87] Pratial result: 3 这令
I1026 16:13:25.043697 48038 u2_recognizer_main.cc:87] Pratial result: 4 这令
I1026 16:13:25.222124 48038 u2_recognizer_main.cc:87] Pratial result: 5 这令被贷款
I1026 16:13:25.228385 48038 u2_recognizer_main.cc:87] Pratial result: 6 这令被贷款
I1026 16:13:25.414669 48038 u2_recognizer_main.cc:87] Pratial result: 7 这令被贷款的员工
I1026 16:13:25.420714 48038 u2_recognizer_main.cc:87] Pratial result: 8 这令被贷款的员工
I1026 16:13:25.608129 48038 u2_recognizer_main.cc:87] Pratial result: 9 这令被贷款的员工们请
I1026 16:13:25.801620 48038 u2_recognizer_main.cc:87] Pratial result: 10 这令被贷款的员工们请食难安
I1026 16:13:25.804101 48038 feature_cache.h:44] set finished
I1026 16:13:25.804128 48038 feature_cache.h:51] compute last feats done.
I1026 16:13:25.948771 48038 u2_recognizer_main.cc:87] Pratial result: 11 这令被贷款的员工们请食难安
I1026 16:13:26.246963 48038 u2_recognizer_main.cc:113] BAC009S0916W0495 这令被贷款的员工们请食难安
```
## Result
> CER compute under aishell-test.
> RTF compute with feature and decoder, which is more end to end.
> Machine Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz avx512_vnni
### FP32
```
Overall -> 5.75 % N=104765 C=99035 S=5587 D=143 I=294
Mandarin -> 5.75 % N=104762 C=99035 S=5584 D=143 I=294
English -> 0.00 % N=0 C=0 S=0 D=0 I=0
Other -> 100.00 % N=3 C=0 S=3 D=0 I=0
```
```
RTF is: 0.315337
```
### INT8
```
Overall -> 5.83 % N=104765 C=98943 S=5675 D=147 I=286
Mandarin -> 5.83 % N=104762 C=98943 S=5672 D=147 I=286
English -> 0.00 % N=0 C=0 S=0 D=0 I=0
Other -> 100.00 % N=3 C=0 S=3 D=0 I=0
```
```
RTF is: 0.269674
```

@ -0,0 +1,96 @@
([简体中文](./README_cn.md)|English)
# 基于SpeechX 的 ASR 部署
## 简介
支持 U2/U2++/Deepspeech2 模型的 C++ 部署,其在工业实践中经常被用到。
更多 Speechx 信息可以参看[文档](../../speechx/README.md)。
## 使用
### 1. 环境
* python - 3.7
* docker - `registry.baidubce.com/paddlepaddle/paddle:2.2.2-gpu-cuda10.2-cudnn7`
* os - Ubuntu 16.04.7 LTS
* gcc/g++/gfortran - 8.2.0
* cmake - 3.16.0
更多信息可以参看[文档](../../speechx/README.md)。
### 2. 编译 SpeechX
更多信息可以参看[文档](../../speechx/README.md)。
### 3. 例子
u2++ 识别部署参看[这里](../../speechx/examples/u2pp_ol/wenetspeech/)。
以下是在 `speechx/speechx/examples/u2pp_ol/wenetspeech`.
- Source path.sh
```bash
source path.sh
```
- 下载模型准备测试数据和cmvn文件
```bash
run.sh --stage 0 --stop_stage 1
```
- 解码
```bash
# FP32
./local/recognizer.sh
# INT8
./local/recognizer_quant.sh
```
输出:
```bash
I1026 16:13:24.683531 48038 u2_recognizer_main.cc:55] utt: BAC009S0916W0495
I1026 16:13:24.683578 48038 u2_recognizer_main.cc:56] wav dur: 4.17119 sec.
I1026 16:13:24.683595 48038 u2_recognizer_main.cc:64] wav len (sample): 66739
I1026 16:13:25.037652 48038 u2_recognizer_main.cc:87] Pratial result: 3 这令
I1026 16:13:25.043697 48038 u2_recognizer_main.cc:87] Pratial result: 4 这令
I1026 16:13:25.222124 48038 u2_recognizer_main.cc:87] Pratial result: 5 这令被贷款
I1026 16:13:25.228385 48038 u2_recognizer_main.cc:87] Pratial result: 6 这令被贷款
I1026 16:13:25.414669 48038 u2_recognizer_main.cc:87] Pratial result: 7 这令被贷款的员工
I1026 16:13:25.420714 48038 u2_recognizer_main.cc:87] Pratial result: 8 这令被贷款的员工
I1026 16:13:25.608129 48038 u2_recognizer_main.cc:87] Pratial result: 9 这令被贷款的员工们请
I1026 16:13:25.801620 48038 u2_recognizer_main.cc:87] Pratial result: 10 这令被贷款的员工们请食难安
I1026 16:13:25.804101 48038 feature_cache.h:44] set finished
I1026 16:13:25.804128 48038 feature_cache.h:51] compute last feats done.
I1026 16:13:25.948771 48038 u2_recognizer_main.cc:87] Pratial result: 11 这令被贷款的员工们请食难安
I1026 16:13:26.246963 48038 u2_recognizer_main.cc:113] BAC009S0916W0495 这令被贷款的员工们请食难安
```
## 结果
> CER 测试集为 aishell-test
> RTF 计算包含提特征和解码
> 测试机器: Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz avx512_vnni
### FP32
```
Overall -> 5.75 % N=104765 C=99035 S=5587 D=143 I=294
Mandarin -> 5.75 % N=104762 C=99035 S=5584 D=143 I=294
English -> 0.00 % N=0 C=0 S=0 D=0 I=0
Other -> 100.00 % N=3 C=0 S=3 D=0 I=0
```
```
RTF is: 0.315337
```
### INT8
```
Overall -> 5.87 % N=104765 C=98909 S=5711 D=145 I=289
Mandarin -> 5.86 % N=104762 C=98909 S=5708 D=145 I=289
English -> 0.00 % N=0 C=0 S=0 D=0 I=0
Other -> 100.00 % N=3 C=0 S=3 D=0 I=0
```

@ -108,7 +108,7 @@ for epoch in range(1, epochs + 1):
optimizer.clear_grad()
# Calculate loss
avg_loss = loss.numpy()[0]
avg_loss = float(loss)
# Calculate metrics
preds = paddle.argmax(logits, axis=1)

@ -22,7 +22,7 @@ Acoustic Model | Training Data | Token-based | Size | Descriptions | CER | WER |
Model | Pre-Train Method | Pre-Train Data | Finetune Data | Size | Descriptions | CER | WER | Example Link |
:-------------:| :------------:| :-----: | -----: | :-----: |:-----:| :-----: | :-----: | :-----: |
[Wav2vec2-large-960h-lv60-self Model](https://paddlespeech.bj.bcebos.com/wav2vec/wav2vec2-large-960h-lv60-self.pdparams) | wav2vec2 | Librispeech and LV-60k Dataset (5.3w h) | - | 1.18 GB |Pre-trained Wav2vec2.0 Model | - | - | - |
[Wav2vec2ASR-large-960h-librispeech Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr3/wav2vec2ASR-large-960h-librispeech_ckpt_1.3.0.model.tar.gz) | wav2vec2 | Librispeech and LV-60k Dataset (5.3w h) | Librispeech (960 h) | 1.18 GB |Encoder: Wav2vec2.0, Decoder: CTC, Decoding method: Greedy search | - | 0.0189 | [Wav2vecASR Librispeech ASR3](../../examples/librispeech/asr3) |
[Wav2vec2ASR-large-960h-librispeech Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr3/wav2vec2ASR-large-960h-librispeech_ckpt_1.3.1.model.tar.gz) | wav2vec2 | Librispeech and LV-60k Dataset (5.3w h) | Librispeech (960 h) | 718 MB |Encoder: Wav2vec2.0, Decoder: CTC, Decoding method: Greedy search | - | 0.0189 | [Wav2vecASR Librispeech ASR3](../../examples/librispeech/asr3) |
### Language Model based on NGram
Language Model | Training Data | Token-based | Size | Descriptions

@ -509,7 +509,7 @@
" optimizer.clear_grad()\n",
"\n",
" # Calculate loss\n",
" avg_loss += loss.numpy()[0]\n",
" avg_loss += float(loss)\n",
"\n",
" # Calculate metrics\n",
" preds = paddle.argmax(logits, axis=1)\n",

@ -1,5 +1,5 @@
# ERNIE-SAT with AISHELL-3 dataset
ERNIE-SAT speech-text joint pretraining framework, which achieves SOTA results in cross-lingual multi-speaker speech synthesis and cross-lingual speech editing tasks, It can be applied to a series of scenarios such as Speech Editing, personalized Speech Synthesis, and Voice Cloning.
[ERNIE-SAT](https://arxiv.org/abs/2211.03545) speech-text joint pretraining framework, which achieves SOTA results in cross-lingual multi-speaker speech synthesis and cross-lingual speech editing tasks, It can be applied to a series of scenarios such as Speech Editing, personalized Speech Synthesis, and Voice Cloning.
## Model Framework
In ERNIE-SAT, we propose two innovations:

@ -0,0 +1 @@
../../../csmsc/tts3/local/export2lite.sh

@ -58,3 +58,13 @@ fi
if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
./local/ort_predict.sh ${train_output_path}
fi
if [ ${stage} -le 7 ] && [ ${stop_stage} -ge 7 ]; then
# This model is not supported, because 3 ops are not supported on 'arm'. These unsupported ops are: 'round, set_value, share_data'.
# This model is not supported, because 4 ops are not supported on 'x86'. These unsupported ops are: 'matmul_v2, round, set_value, share_data'.
# ./local/export2lite.sh ${train_output_path} inference pdlite fastspeech2_aishell3 x86
# x86 ok, arm Segmentation fault
# ./local/export2lite.sh ${train_output_path} inference pdlite pwgan_aishell3 x86
# x86 ok, arm ok
./local/export2lite.sh ${train_output_path} inference pdlite hifigan_aishell3 x86
fi

@ -1,5 +1,5 @@
# ERNIE-SAT with AISHELL-3 and VCTK dataset
ERNIE-SAT speech-text joint pretraining framework, which achieves SOTA results in cross-lingual multi-speaker speech synthesis and cross-lingual speech editing tasks, It can be applied to a series of scenarios such as Speech Editing, personalized Speech Synthesis, and Voice Cloning.
[ERNIE-SAT](https://arxiv.org/abs/2211.03545) speech-text joint pretraining framework, which achieves SOTA results in cross-lingual multi-speaker speech synthesis and cross-lingual speech editing tasks, It can be applied to a series of scenarios such as Speech Editing, personalized Speech Synthesis, and Voice Cloning.
## Model Framework
In ERNIE-SAT, we propose two innovations:

@ -0,0 +1 @@
../../tts3/local/export2lite.sh

@ -60,3 +60,16 @@ fi
if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
./local/ort_predict.sh ${train_output_path}
fi
# must run after stage 3 (which stage generated static models)
if [ ${stage} -le 7 ] && [ ${stop_stage} -ge 7 ]; then
# This model is not supported, because 3 ops are not supported on 'arm'. These unsupported ops are: 'round, set_value, share_data'.
# This model is not supported, because 4 ops are not supported on 'x86'. These unsupported ops are: 'matmul_v2, round, set_value, share_data'.
./local/export2lite.sh ${train_output_path} inference pdlite speedyspeech_csmsc x86
# x86 ok, arm Segmentation fault
# ./local/export2lite.sh ${train_output_path} inference pdlite pwgan_csmsc x86
# x86 ok, arm Segmentation fault
# ./local/export2lite.sh ${train_output_path} inference pdlite mb_melgan_csmsc x86
# x86 ok, arm ok
# ./local/export2lite.sh ${train_output_path} inference pdlite hifigan_csmsc x86
fi

@ -0,0 +1,18 @@
train_output_path=$1
model_dir=$2
output_dir=$3
model=$4
valid_targets=$5
model_name=${model%_*}
echo model_name: ${model_name}
mkdir -p ${train_output_path}/${output_dir}
paddle_lite_opt \
--model_file ${train_output_path}/${model_dir}/${model}.pdmodel \
--param_file ${train_output_path}/${model_dir}/${model}.pdiparams \
--optimize_out ${train_output_path}/${output_dir}/${model}_${valid_targets} \
--valid_targets ${valid_targets}

@ -61,3 +61,16 @@ fi
if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
./local/ort_predict.sh ${train_output_path}
fi
# must run after stage 3 (which stage generated static models)
if [ ${stage} -le 7 ] && [ ${stop_stage} -ge 7 ]; then
# This model is not supported, because 3 ops are not supported on 'arm'. These unsupported ops are: 'round, set_value, share_data'.
# This model is not supported, because 4 ops are not supported on 'x86'. These unsupported ops are: 'matmul_v2, round, set_value, share_data'.
./local/export2lite.sh ${train_output_path} inference pdlite fastspeech2_csmsc x86
# x86 ok, arm Segmentation fault
# ./local/export2lite.sh ${train_output_path} inference pdlite pwgan_csmsc x86
# x86 ok, arm Segmentation fault
# ./local/export2lite.sh ${train_output_path} inference pdlite mb_melgan_csmsc x86
# x86 ok, arm ok
# ./local/export2lite.sh ${train_output_path} inference pdlite hifigan_csmsc x86
fi

@ -75,7 +75,6 @@ if [ ${stage} -le 8 ] && [ ${stop_stage} -ge 8 ]; then
fi
# paddle2onnx streaming
if [ ${stage} -le 9 ] && [ ${stop_stage} -ge 9 ]; then
# install paddle2onnx
version=$(echo `pip list |grep "paddle2onnx"` |awk -F" " '{print $2}')
@ -97,3 +96,34 @@ if [ ${stage} -le 10 ] && [ ${stop_stage} -ge 10 ]; then
./local/ort_predict_streaming.sh ${train_output_path}
fi
# must run after stage 3 (which stage generated static models)
if [ ${stage} -le 11 ] && [ ${stop_stage} -ge 11 ]; then
# This model is not supported, because 3 ops are not supported on 'arm'. These unsupported ops are: 'round, set_value, share_data'.
# This model is not supported, because 4 ops are not supported on 'x86'. These unsupported ops are: 'matmul_v2, round, set_value, share_data'.
./local/export2lite.sh ${train_output_path} inference pdlite fastspeech2_csmsc x86
# x86 ok, arm Segmentation fault
# ./local/export2lite.sh ${train_output_path} inference pdlite pwgan_csmsc x86
# x86 ok, arm Segmentation fault
# ./local/export2lite.sh ${train_output_path} inference pdlite mb_melgan_csmsc x86
# x86 ok, arm ok
# ./local/export2lite.sh ${train_output_path} inference pdlite hifigan_csmsc x86
fi
# must run after stage 5 (which stage generated static models)
if [ ${stage} -le 12 ] && [ ${stop_stage} -ge 12 ]; then
# streaming acoustic model
# This model is not supported, because 3 ops are not supported on 'arm'. These unsupported ops are: 'round, set_value, share_data'.
# This model is not supported, because 4 ops are not supported on 'x86'. These unsupported ops are: 'matmul_v2, round, set_value, share_data'.
# ./local/export2lite.sh ${train_output_path} inference pdlite fastspeech2_csmsc x86
./local/export2lite.sh ${train_output_path} inference_streaming pdlite_streaming fastspeech2_csmsc_am_encoder_infer x86
# x86 ok, arm Segmentation fault
./local/export2lite.sh ${train_output_path} inference_streaming pdlite_streaming fastspeech2_csmsc_am_decoder x86
# x86 ok, arm Segmentation fault
./local/export2lite.sh ${train_output_path} inference_streaming pdlite_streaming fastspeech2_csmsc_am_postnet x86
# x86 ok, arm Segmentation fault
# ./local/export2lite.sh ${train_output_path} inference_streaming pdlite_streaming pwgan_csmsc x86
# x86 ok, arm Segmentation fault
# ./local/export2lite.sh ${train_output_path} inference_streaming pdlite_streaming mb_melgan_csmsc x86
# x86 ok, arm ok
# ./local/export2lite.sh ${train_output_path} inference_streaming pdlite_streaming hifigan_csmsc x86
fi

@ -70,7 +70,6 @@ train_manifest: data/manifest.train
dev_manifest: data/manifest.dev
test_manifest: data/manifest.test-clean
###########################################
# Dataloader #
###########################################
@ -95,6 +94,12 @@ dist_sampler: True
shortest_first: True
return_lens_rate: True
############################################
# Data Augmentation #
############################################
audio_augment: # for raw audio
sample_rate: 16000
speeds: [95, 100, 105]
###########################################
# Training #
@ -115,6 +120,3 @@ log_interval: 1
checkpoint:
kbest_n: 50
latest_n: 5
augment: True

@ -0,0 +1 @@
../../../csmsc/tts3/local/export2lite.sh

@ -59,3 +59,14 @@ fi
if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
./local/ort_predict.sh ${train_output_path}
fi
# must run after stage 3 (which stage generated static models)
if [ ${stage} -le 7 ] && [ ${stop_stage} -ge 7 ]; then
# This model is not supported, because 3 ops are not supported on 'arm'. These unsupported ops are: 'round, set_value, share_data'.
# This model is not supported, because 4 ops are not supported on 'x86'. These unsupported ops are: 'matmul_v2, round, set_value, share_data'.
./local/export2lite.sh ${train_output_path} inference pdlite fastspeech2_ljspeech x86
# x86 ok, arm Segmentation fault
# ./local/export2lite.sh ${train_output_path} inference pdlite pwgan_ljspeech x86
# x86 ok, arm ok
# ./local/export2lite.sh ${train_output_path} inference pdlite hifigan_ljspeech x86
fi

@ -4,3 +4,6 @@ Run the following script to get started, for more detail, please see `run.sh`.
```bash
./run.sh
```
# Rhythm tags for MFA
If you want to get rhythm tags with duration through MFA tool, you may add flag `--rhy-with-duration` in the first two commands in `run.sh`
Note that only CSMSC dataset is supported so far, and we replace `#` with `sp` in rhythm tags for MFA.

@ -182,12 +182,17 @@ if __name__ == "__main__":
"--with-tone", action="store_true", help="whether to consider tone.")
parser.add_argument(
"--with-r", action="store_true", help="whether to consider erhua.")
parser.add_argument(
"--rhy-with-duration",
action="store_true", )
args = parser.parse_args()
lexicon = generate_lexicon(args.with_tone, args.with_r)
symbols = generate_symbols(lexicon)
with open(args.output + ".lexicon", 'wt') as f:
if args.rhy_with_duration:
f.write("sp1 sp1\nsp2 sp2\nsp3 sp3\nsp4 sp4\n")
for k, v in lexicon.items():
f.write(f"{k} {v}\n")

@ -23,6 +23,7 @@ for more details.
"""
import argparse
import os
import re
import shutil
from concurrent.futures import ThreadPoolExecutor
from pathlib import Path
@ -32,6 +33,22 @@ import librosa
import soundfile as sf
from tqdm import tqdm
repalce_dict = {
"": "",
"": "",
"": "",
"": "",
"": "",
"": "",
"": "",
"": "",
"": "",
"": "",
"": "",
"": "",
"": ""
}
def get_transcripts(path: Union[str, Path]):
transcripts = {}
@ -55,8 +72,12 @@ def resample_and_save(source, target, sr=16000):
def reorganize_baker(root_dir: Union[str, Path],
output_dir: Union[str, Path]=None,
resample_audio=False):
resample_audio=False,
rhy_dur=False):
root_dir = Path(root_dir).expanduser()
if rhy_dur:
transcript_path = root_dir / "ProsodyLabeling" / "000001-010000_rhy.txt"
else:
transcript_path = root_dir / "ProsodyLabeling" / "000001-010000.txt"
transcriptions = get_transcripts(transcript_path)
@ -92,6 +113,46 @@ def reorganize_baker(root_dir: Union[str, Path],
print("Done!")
def insert_rhy(sentence_first, sentence_second):
sub = '#'
return_words = []
sentence_first = sentence_first.translate(str.maketrans(repalce_dict))
rhy_idx = [substr.start() for substr in re.finditer(sub, sentence_first)]
re_rhy_idx = []
sentence_first_ = sentence_first.replace("#1", "").replace(
"#2", "").replace("#3", "").replace("#4", "")
sentence_seconds = sentence_second.split(" ")
for i, w in enumerate(rhy_idx):
re_rhy_idx.append(w - i * 2)
i = 0
# print("re_rhy_idx: ", re_rhy_idx)
for sentence_s in (sentence_seconds):
return_words.append(sentence_s)
if i < len(re_rhy_idx) and len(return_words) - i == re_rhy_idx[i]:
return_words.append("sp" + sentence_first[rhy_idx[i] + 1:rhy_idx[i]
+ 2])
i = i + 1
return return_words
def normalize_rhy(root_dir: Union[str, Path]):
root_dir = Path(root_dir).expanduser()
transcript_path = root_dir / "ProsodyLabeling" / "000001-010000.txt"
target_transcript_path = root_dir / "ProsodyLabeling" / "000001-010000_rhy.txt"
with open(transcript_path) as f:
lines = f.readlines()
with open(target_transcript_path, 'wt') as f:
for i in range(0, len(lines), 2):
sentence_first = lines[i] #第一行直接保存
f.write(sentence_first)
transcription = lines[i + 1].strip()
f.write("\t" + " ".join(
insert_rhy(sentence_first.split('\t')[1], transcription)) +
"\n")
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Reorganize Baker dataset for MFA")
@ -104,6 +165,12 @@ if __name__ == "__main__":
"--resample-audio",
action="store_true",
help="To resample audio files or just copy them")
parser.add_argument(
"--rhy-with-duration",
action="store_true", )
args = parser.parse_args()
reorganize_baker(args.root_dir, args.output_dir, args.resample_audio)
if args.rhy_with_duration:
normalize_rhy(args.root_dir)
reorganize_baker(args.root_dir, args.output_dir, args.resample_audio,
args.rhy_with_duration)

@ -123,3 +123,5 @@ iPad Pro的秒控键盘这次也推出白色版本。|iPad Pro的秒控键盘这
985|九八五
12~23|十二到二十三
12-23|十二到二十三
25cm²|二十五平方厘米
25m|米

@ -1,5 +1,5 @@
# ERNIE-SAT with VCTK dataset
ERNIE-SAT speech-text joint pretraining framework, which achieves SOTA results in cross-lingual multi-speaker speech synthesis and cross-lingual speech editing tasks, It can be applied to a series of scenarios such as Speech Editing, personalized Speech Synthesis, and Voice Cloning.
[ERNIE-SAT](https://arxiv.org/abs/2211.03545) speech-text joint pretraining framework, which achieves SOTA results in cross-lingual multi-speaker speech synthesis and cross-lingual speech editing tasks, It can be applied to a series of scenarios such as Speech Editing, personalized Speech Synthesis, and Voice Cloning.
## Model Framework
In ERNIE-SAT, we propose two innovations:

@ -0,0 +1 @@
../../../csmsc/tts3/local/export2lite.sh

@ -58,3 +58,14 @@ fi
if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
./local/ort_predict.sh ${train_output_path}
fi
# must run after stage 3 (which stage generated static models)
if [ ${stage} -le 7 ] && [ ${stop_stage} -ge 7 ]; then
# This model is not supported, because 3 ops are not supported on 'arm'. These unsupported ops are: 'round, set_value, share_data'.
# This model is not supported, because 4 ops are not supported on 'x86'. These unsupported ops are: 'matmul_v2, round, set_value, share_data'.
./local/export2lite.sh ${train_output_path} inference pdlite fastspeech2_vctk x86
# x86 ok, arm Segmentation fault
# ./local/export2lite.sh ${train_output_path} inference pdlite pwgan_vctk x86
# x86 ok, arm ok
# ./local/export2lite.sh ${train_output_path} inference pdlite hifigan_vctk x86
fi

@ -101,7 +101,7 @@ if __name__ == "__main__":
optimizer.clear_grad()
# Calculate loss
avg_loss += loss.numpy()[0]
avg_loss += float(loss)
# Calculate metrics
preds = paddle.argmax(logits, axis=1)

@ -110,7 +110,7 @@ if __name__ == '__main__':
optimizer.clear_grad()
# Calculate loss
avg_loss += loss.numpy()[0]
avg_loss += float(loss)
# Calculate metrics
num_corrects += corrects

@ -71,6 +71,7 @@ class Wav2Vec2ASRTrainer(Trainer):
wavs_lens_rate = wavs_lens / wav.shape[1]
target_lens_rate = target_lens / target.shape[1]
wav = wav[:, :, 0]
if hasattr(train_conf, 'speech_augment'):
wav = self.speech_augmentation(wav, wavs_lens_rate)
loss = self.model(wav, wavs_lens_rate, target, target_lens_rate)
# loss div by `batch_size * accum_grad`
@ -277,7 +278,9 @@ class Wav2Vec2ASRTrainer(Trainer):
logger.info("Setup model!")
# setup speech augmentation for wav2vec2
self.speech_augmentation = TimeDomainSpecAugment()
if hasattr(config, 'audio_augment') and self.train:
self.speech_augmentation = TimeDomainSpecAugment(
**config.audio_augment)
if not self.train:
return

@ -641,14 +641,11 @@ class DropChunk(nn.Layer):
class TimeDomainSpecAugment(nn.Layer):
"""A time-domain approximation of the SpecAugment algorithm.
This augmentation module implements three augmentations in
the time-domain.
1. Drop chunks of the audio (zero amplitude or white noise)
2. Drop frequency bands (with band-drop filters)
3. Speed peturbation (via resampling to slightly different rate)
Arguments
---------
perturb_prob : float from 0 to 1
@ -677,7 +674,6 @@ class TimeDomainSpecAugment(nn.Layer):
drop_chunk_noise_factor : float
The noise factor used to scale the white noise inserted, relative to
the average amplitude of the utterance. Default 0 (no noise inserted).
Example
-------
>>> inputs = paddle.randn([10, 16000])
@ -718,7 +714,6 @@ class TimeDomainSpecAugment(nn.Layer):
def forward(self, waveforms, lengths):
"""Returns the distorted waveforms.
Arguments
---------
waveforms : tensor

@ -0,0 +1,10 @@
0001 考古人员<speak>西<say-as pinyin='zang4'>藏</say-as>布达拉宫里发现一个被隐<say-as pinyin="cang2">藏</say-as>的装有宝<say-as pinyin="zang4">藏</say-as></speak>箱子。
0002 <speak>有人询问中国银<say-as pinyin='hang2'>行</say-as>北京分<say-as pinyin='hang2 hang2'>行行</say-as>长是否叫任我<say-as pinyin='xing2'>行</say-as></speak>。
0003 <speak>市委书记亲自<say-as pinyin='shuai4'>率</say-as>领审计员对这家公司进行财务审计,发现企业的利润<say-as pinyin='lv4'>率</say-as>数据虚假</speak>。
0004 <speak>学生们对代<say-as pinyin='shu4'>数</say-as>理解不深刻,特别是小<say-as pinyin='shu4'>数</say-as>点,在<say-as pinyin='shu3 shu4'>数数</say-as>时容易弄错</speak>。
0005 <speak>赵<say-as pinyin='chang2'>长</say-as>军从小学习武术,擅<say-as pinyin='chang2'>长</say-as>散打,<say-as pinyin='zhang3'>长</say-as>大后参军,担任连<say-as pinyin='zhang3'>长</say-as></speak>。
0006 <speak>我说她<say-as pinyin='zhang3'>涨</say-as>了工资,她就<say-as pinyin='zhang4'>涨</say-as>红着脸,摇头否认</speak>。
0007 <speak>请把这封信交<say-as pinyin='gei3'>给</say-as>团长,告诉他,前线的供<say-as pinyin='ji3'>给</say-as>一定要有保障</speak>。
0008 <speak>矿下的<say-as pinyin='hang4'>巷</say-as>道,与北京四合院的小<say-as pinyin='xiang4'>巷</say-as>有点相似</speak>。
0009 <speak>他常叹自己命<say-as pinyin='bo2'>薄</say-as>,几亩<say-as pinyin='bao2'>薄</say-as>田,种点<say-as pinyin='bo4'>薄</say-as>荷</speak>。
0010 <speak>小明对天相很有研究,在<say-as pinyin='su4'>宿</say-as>舍说了一<say-as pinyin='xiu3'>宿</say-as>有关星<say-as pinyin='xiu4'>宿</say-as>的常识</speak>。

@ -18,6 +18,25 @@ from .num import num2str
# 温度表达式,温度会影响负号的读法
# -3°C 零下三度
RE_TEMPERATURE = re.compile(r'(-?)(\d+(\.\d+)?)(°C|℃|度|摄氏度)')
measure_dict = {
"cm2": "平方厘米",
"cm²": "平方厘米",
"cm3": "立方厘米",
"cm³": "立方厘米",
"cm": "厘米",
"db": "分贝",
"ds": "毫秒",
"kg": "千克",
"km": "千米",
"m2": "平方米",
"": "平方米",
"": "立方米",
"m3": "立方米",
"ml": "毫升",
"m": "",
"mm": "毫米",
"s": ""
}
def replace_temperature(match) -> str:
@ -35,3 +54,10 @@ def replace_temperature(match) -> str:
unit: str = "摄氏度" if unit == "摄氏度" else ""
result = f"{sign}{temperature}{unit}"
return result
def replace_measure(sentence) -> str:
for q_notation in measure_dict:
if q_notation in sentence:
sentence = sentence.replace(q_notation, measure_dict[q_notation])
return sentence

@ -46,6 +46,7 @@ from .phonecode import RE_TELEPHONE
from .phonecode import replace_mobile
from .phonecode import replace_phone
from .quantifier import RE_TEMPERATURE
from .quantifier import replace_measure
from .quantifier import replace_temperature
@ -91,6 +92,7 @@ class TextNormalizer():
sentence = RE_TIME.sub(replace_time, sentence)
sentence = RE_TEMPERATURE.sub(replace_temperature, sentence)
sentence = replace_measure(sentence)
sentence = RE_FRAC.sub(replace_frac, sentence)
sentence = RE_PERCENTAGE.sub(replace_percentage, sentence)
sentence = RE_MOBILE_PHONE.sub(replace_mobile, sentence)

@ -2,10 +2,10 @@
## Testing with Aishell Test Data
## Download wav and model
### Download wav and model
```
run.sh --stop_stage 0
./run.sh --stop_stage 0
```
### compute feature
@ -22,7 +22,6 @@ run.sh --stop_stage 0
### decoding using wav
```
./run.sh --stage 3 --stop_stage 3
```

@ -2,9 +2,11 @@
7176 utts, duration 36108.9 sec.
## Attention Rescore
## U2++ Attention Rescore
### u2++ FP32
> Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz, support `avx512_vnni`
> RTF with feature and decoder which is more end to end.
### FP32
#### CER
@ -17,20 +19,29 @@ Other -> 100.00 % N=3 C=0 S=3 D=0 I=0
#### RTF
> RTF with feature and decoder which is more end to end.
* Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz, support `avx512_vnni`
```
I1027 10:52:38.662868 51665 u2_recognizer_main.cc:122] total wav duration is: 36108.9 sec
I1027 10:52:38.662858 51665 u2_recognizer_main.cc:121] total cost:11169.1 sec
I1027 10:52:38.662876 51665 u2_recognizer_main.cc:123] RTF is: 0.309318
```
* Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz, not support `avx512_vnni`
### INT8
> RTF relative improve 12.8%, which count feature and decoder time.
#### CER
```
Overall -> 5.83 % N=104765 C=98943 S=5675 D=147 I=286
Mandarin -> 5.83 % N=104762 C=98943 S=5672 D=147 I=286
English -> 0.00 % N=0 C=0 S=0 D=0 I=0
Other -> 100.00 % N=3 C=0 S=3 D=0 I=0
```
#### RTF
```
I1026 16:13:26.247121 48038 u2_recognizer_main.cc:123] total wav duration is: 36108.9 sec
I1026 16:13:26.247130 48038 u2_recognizer_main.cc:124] total decode cost:13656.7 sec
I1026 16:13:26.247138 48038 u2_recognizer_main.cc:125] RTF is: 0.378208
I1110 09:59:52.551712 37249 u2_recognizer_main.cc:122] total wav duration is: 36108.9 sec
I1110 09:59:52.551717 37249 u2_recognizer_main.cc:123] total decode cost:9737.63 sec
I1110 09:59:52.551723 37249 u2_recognizer_main.cc:124] RTF is: 0.269674
```

@ -10,7 +10,7 @@ mkdir -p $exp
ckpt_dir=./data/model
model_dir=$ckpt_dir/asr1_chunk_conformer_u2pp_wenetspeech_static_1.3.0.model/
utils/run.pl JOB=1:$nj $data/split${nj}/JOB/decoder.fbank.wolm.log \
utils/run.pl JOB=1:$nj $data/split${nj}/JOB/decoder.log \
ctc_prefix_beam_search_decoder_main \
--model_path=$model_dir/export.jit \
--vocab_path=$model_dir/unit.txt \

@ -1,18 +1,21 @@
#!/bin/bash
set -x
set -e
. path.sh
nj=20
data=data
exp=exp
mkdir -p $exp
ckpt_dir=./data/model
model_dir=$ckpt_dir/asr1_chunk_conformer_u2pp_wenetspeech_static_1.3.0.model/
utils/run.pl JOB=1:$nj $data/split${nj}/JOB/nnet.log \
u2_nnet_main \
--model_path=$model_dir/export.jit \
--feature_rspecifier=ark,t:$exp/fbank.ark \
--vocab_path=$model_dir/unit.txt \
--feature_rspecifier=ark,t:${data}/split${nj}/JOB/fbank.ark \
--nnet_decoder_chunk=16 \
--receptive_field_length=7 \
--subsampling_rate=4 \
@ -20,4 +23,3 @@ u2_nnet_main \
--nnet_encoder_outs_wspecifier=ark,t:$exp/encoder_outs.ark \
--nnet_prob_wspecifier=ark,t:$exp/logprobs.ark
echo "u2 nnet decode."

@ -24,8 +24,6 @@ fi
ckpt_dir=$data/model
model_dir=$ckpt_dir/asr1_chunk_conformer_u2pp_wenetspeech_static_1.3.0.model/
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ];then
# download u2pp model

@ -32,7 +32,6 @@ class DataCache : public FrontendInterface {
// accept waves/feats
void Accept(const kaldi::VectorBase<kaldi::BaseFloat>& inputs) override {
data_ = inputs;
SetDim(data_.Dim());
}
bool Read(kaldi::Vector<kaldi::BaseFloat>* feats) override {
@ -41,7 +40,6 @@ class DataCache : public FrontendInterface {
}
(*feats) = data_;
data_.Resize(0);
SetDim(data_.Dim());
return true;
}

@ -71,6 +71,7 @@ bool Decodable::AdvanceChunk() {
VLOG(3) << "decodable exit;";
return false;
}
CHECK_GE(frontend_->Dim(), 0);
VLOG(1) << "AdvanceChunk feat cost: " << timer.Elapsed() << " sec.";
VLOG(2) << "Forward in " << features.Dim() / frontend_->Dim() << " feats.";

@ -13,6 +13,7 @@
# limitations under the License.
import paddle
import torch
from paddle.device.cuda import synchronize
from parallel_wavegan.layers import residual_block
from parallel_wavegan.layers import upsample
from parallel_wavegan.models import parallel_wavegan as pwgan
@ -24,7 +25,6 @@ from paddlespeech.t2s.models.parallel_wavegan import PWGGenerator
from paddlespeech.t2s.models.parallel_wavegan import ResidualBlock
from paddlespeech.t2s.models.parallel_wavegan import ResidualPWGDiscriminator
from paddlespeech.t2s.utils.layer_tools import summary
from paddlespeech.t2s.utils.profile import synchronize
paddle.set_device("gpu:0")
device = torch.device("cuda:0")

Loading…
Cancel
Save