Merge branch 'PaddlePaddle:develop' into develop

pull/2647/head^2
liangym 3 years ago committed by GitHub
commit eef87bb7d4
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -157,13 +157,15 @@ Via the easy-to-use, efficient, flexible and scalable implementation, our vision
- 🧩 *Cascaded models application*: as an extension of the typical traditional audio tasks, we combine the workflows of the aforementioned tasks with other fields like Natural language processing (NLP) and Computer Vision (CV). - 🧩 *Cascaded models application*: as an extension of the typical traditional audio tasks, we combine the workflows of the aforementioned tasks with other fields like Natural language processing (NLP) and Computer Vision (CV).
### Recent Update ### Recent Update
- 🔥 2022.11.07: [U2/U2++ C++ High Performance Streaming Asr Deployment](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/speechx/examples/u2pp_ol/wenetspeech).
- 👑 2022.11.01: Add [Adversarial Loss](https://arxiv.org/pdf/1907.04448.pdf) for [Chinese English mixed TTS](./examples/zh_en_tts/tts3).
- 🔥 2022.10.26: Add [Prosody Prediction](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/rhy) for TTS. - 🔥 2022.10.26: Add [Prosody Prediction](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/rhy) for TTS.
- 🎉 2022.10.21: Add [SSML](https://github.com/PaddlePaddle/PaddleSpeech/discussions/2538) for TTS Chinese Text Frontend. - 🎉 2022.10.21: Add [SSML](https://github.com/PaddlePaddle/PaddleSpeech/discussions/2538) for TTS Chinese Text Frontend.
- 👑 2022.10.11: Add [Wav2vec2ASR](./examples/librispeech/asr3), wav2vec2.0 fine-tuning for ASR on LibriSpeech. - 👑 2022.10.11: Add [Wav2vec2ASR](./examples/librispeech/asr3), wav2vec2.0 fine-tuning for ASR on LibriSpeech.
- 🔥 2022.09.26: Add Voice Cloning, TTS finetune, and ERNIE-SAT in [PaddleSpeech Web Demo](./demos/speech_web). - 🔥 2022.09.26: Add Voice Cloning, TTS finetune, and [ERNIE-SAT](https://arxiv.org/abs/2211.03545) in [PaddleSpeech Web Demo](./demos/speech_web).
- ⚡ 2022.09.09: Add AISHELL-3 Voice Cloning [example](./examples/aishell3/vc2) with ECAPA-TDNN speaker encoder. - ⚡ 2022.09.09: Add AISHELL-3 Voice Cloning [example](./examples/aishell3/vc2) with ECAPA-TDNN speaker encoder.
- ⚡ 2022.08.25: Release TTS [finetune](./examples/other/tts_finetune/tts3) example. - ⚡ 2022.08.25: Release TTS [finetune](./examples/other/tts_finetune/tts3) example.
- 🔥 2022.08.22: Add ERNIE-SAT models: [ERNIE-SAT-vctk](./examples/vctk/ernie_sat)、[ERNIE-SAT-aishell3](./examples/aishell3/ernie_sat)、[ERNIE-SAT-zh_en](./examples/aishell3_vctk/ernie_sat). - 🔥 2022.08.22: Add [ERNIE-SAT](https://arxiv.org/abs/2211.03545) models: [ERNIE-SAT-vctk](./examples/vctk/ernie_sat)、[ERNIE-SAT-aishell3](./examples/aishell3/ernie_sat)、[ERNIE-SAT-zh_en](./examples/aishell3_vctk/ernie_sat).
- 🔥 2022.08.15: Add [g2pW](https://github.com/GitYCC/g2pW) into TTS Chinese Text Frontend. - 🔥 2022.08.15: Add [g2pW](https://github.com/GitYCC/g2pW) into TTS Chinese Text Frontend.
- 🔥 2022.08.09: Release [Chinese English mixed TTS](./examples/zh_en_tts/tts3). - 🔥 2022.08.09: Release [Chinese English mixed TTS](./examples/zh_en_tts/tts3).
- ⚡ 2022.08.03: Add ONNXRuntime infer for TTS CLI. - ⚡ 2022.08.03: Add ONNXRuntime infer for TTS CLI.
@ -578,7 +580,7 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
</td> </td>
</tr> </tr>
<tr> <tr>
<td>ERNIE-SAT</td> <td><a href = "https://arxiv.org/abs/2211.03545">ERNIE-SAT</a></td>
<td>VCTK / AISHELL-3 / ZH_EN</td> <td>VCTK / AISHELL-3 / ZH_EN</td>
<td> <td>
<a href = "./examples/vctk/ernie_sat">ERNIE-SAT-vctk</a> / <a href = "./examples/aishell3/ernie_sat">ERNIE-SAT-aishell3</a> / <a href = "./examples/aishell3_vctk/ernie_sat">ERNIE-SAT-zh_en</a> <a href = "./examples/vctk/ernie_sat">ERNIE-SAT-vctk</a> / <a href = "./examples/aishell3/ernie_sat">ERNIE-SAT-aishell3</a> / <a href = "./examples/aishell3_vctk/ernie_sat">ERNIE-SAT-zh_en</a>
@ -716,9 +718,9 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
<tr> <tr>
<td>Keyword Spotting</td> <td>Keyword Spotting</td>
<td>hey-snips</td> <td>hey-snips</td>
<td>PANN</td> <td>MDTC</td>
<td> <td>
<a href = "./examples/hey_snips/kws0">pann-hey-snips</a> <a href = "./examples/hey_snips/kws0">mdtc-hey-snips</a>
</td> </td>
</tr> </tr>
</tbody> </tbody>

@ -164,13 +164,14 @@
### 近期更新 ### 近期更新
- 👑 2022.11.01: [中英文混合 TTS](./examples/zh_en_tts/tts3) 新增 [Adversarial Loss](https://arxiv.org/pdf/1907.04448.pdf) 模块。
- 🔥 2022.10.26: TTS 新增[韵律预测](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/rhy)功能。 - 🔥 2022.10.26: TTS 新增[韵律预测](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/rhy)功能。
- 🎉 2022.10.21: TTS 中文文本前端新增 [SSML](https://github.com/PaddlePaddle/PaddleSpeech/discussions/2538) 功能。 - 🎉 2022.10.21: TTS 中文文本前端新增 [SSML](https://github.com/PaddlePaddle/PaddleSpeech/discussions/2538) 功能。
- 👑 2022.10.11: 新增 [Wav2vec2ASR](./examples/librispeech/asr3), 在 LibriSpeech 上针对 ASR 任务对 wav2vec2.0 的 finetuning。 - 👑 2022.10.11: 新增 [Wav2vec2ASR](./examples/librispeech/asr3), 在 LibriSpeech 上针对 ASR 任务对 wav2vec2.0 的 finetuning。
- 🔥 2022.09.26: 新增 Voice Cloning, TTS finetune 和 ERNIE-SAT 到 [PaddleSpeech 网页应用](./demos/speech_web)。 - 🔥 2022.09.26: 新增 Voice Cloning, TTS finetune 和 [ERNIE-SAT](https://arxiv.org/abs/2211.03545) 到 [PaddleSpeech 网页应用](./demos/speech_web)。
- ⚡ 2022.09.09: 新增基于 ECAPA-TDNN 声纹模型的 AISHELL-3 Voice Cloning [示例](./examples/aishell3/vc2)。 - ⚡ 2022.09.09: 新增基于 ECAPA-TDNN 声纹模型的 AISHELL-3 Voice Cloning [示例](./examples/aishell3/vc2)。
- ⚡ 2022.08.25: 发布 TTS [finetune](./examples/other/tts_finetune/tts3) 示例。 - ⚡ 2022.08.25: 发布 TTS [finetune](./examples/other/tts_finetune/tts3) 示例。
- 🔥 2022.08.22: 新增 ERNIE-SAT 模型: [ERNIE-SAT-vctk](./examples/vctk/ernie_sat)、[ERNIE-SAT-aishell3](./examples/aishell3/ernie_sat)、[ERNIE-SAT-zh_en](./examples/aishell3_vctk/ernie_sat)。 - 🔥 2022.08.22: 新增 [ERNIE-SAT](https://arxiv.org/abs/2211.03545) 模型: [ERNIE-SAT-vctk](./examples/vctk/ernie_sat)、[ERNIE-SAT-aishell3](./examples/aishell3/ernie_sat)、[ERNIE-SAT-zh_en](./examples/aishell3_vctk/ernie_sat)。
- 🔥 2022.08.15: 将 [g2pW](https://github.com/GitYCC/g2pW) 引入 TTS 中文文本前端。 - 🔥 2022.08.15: 将 [g2pW](https://github.com/GitYCC/g2pW) 引入 TTS 中文文本前端。
- 🔥 2022.08.09: 发布[中英文混合 TTS](./examples/zh_en_tts/tts3)。 - 🔥 2022.08.09: 发布[中英文混合 TTS](./examples/zh_en_tts/tts3)。
- ⚡ 2022.08.03: TTS CLI 新增 ONNXRuntime 推理方式。 - ⚡ 2022.08.03: TTS CLI 新增 ONNXRuntime 推理方式。
@ -575,7 +576,7 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声
</td> </td>
</tr> </tr>
<tr> <tr>
<td>ERNIE-SAT</td> <td><a href = "https://arxiv.org/abs/2211.03545">ERNIE-SAT</a></td>
<td>VCTK / AISHELL-3 / ZH_EN</td> <td>VCTK / AISHELL-3 / ZH_EN</td>
<td> <td>
<a href = "./examples/vctk/ernie_sat">ERNIE-SAT-vctk</a> / <a href = "./examples/aishell3/ernie_sat">ERNIE-SAT-aishell3</a> / <a href = "./examples/aishell3_vctk/ernie_sat">ERNIE-SAT-zh_en</a> <a href = "./examples/vctk/ernie_sat">ERNIE-SAT-vctk</a> / <a href = "./examples/aishell3/ernie_sat">ERNIE-SAT-aishell3</a> / <a href = "./examples/aishell3_vctk/ernie_sat">ERNIE-SAT-zh_en</a>
@ -696,9 +697,9 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声
</table> </table>
<a name="唤醒模型"></a> <a name="语音唤醒模型"></a>
**唤醒** **语音唤醒**
<table style="width:100%"> <table style="width:100%">
<thead> <thead>
@ -711,11 +712,11 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声
</thead> </thead>
<tbody> <tbody>
<tr> <tr>
<td>唤醒</td> <td>语音唤醒</td>
<td>hey-snips</td> <td>hey-snips</td>
<td>PANN</td> <td>MDTC</td>
<td> <td>
<a href = "./examples/hey_snips/kws0">pann-hey-snips</a> <a href = "./examples/hey_snips/kws0">mdtc-hey-snips</a>
</td> </td>
</tr> </tr>
</tbody> </tbody>

@ -0,0 +1,100 @@
([简体中文](./README_cn.md)|English)
# ASR Deployment by SpeechX
## Introduction
ASR deployment support U2/U2++/Deepspeech2 asr model using c++, which is good practice in industry deployment.
More info about SpeechX, please see [here](../../speechx/README.md).
## Usage
### 1. Environment
* python - 3.7
* docker - `registry.baidubce.com/paddlepaddle/paddle:2.2.2-gpu-cuda10.2-cudnn7`
* os - Ubuntu 16.04.7 LTS
* gcc/g++/gfortran - 8.2.0
* cmake - 3.16.0
More info please see [here](../../speechx/README.md).
### 2. Compile SpeechX
Please see [here](../../speechx/README.md).
### 3. Usage
For u2++ asr deployment example, please to see [here](../../speechx/examples/u2pp_ol/wenetspeech/).
First go to `speechx/speechx/examples/u2pp_ol/wenetspeech` dir.
- Source path.sh
```bash
source path.sh
```
- Download Model, Prepare test data and cmvn
```bash
run.sh --stage 0 --stop_stage 1
```
- Decode with WAV
```bash
# FP32
./local/recognizer.sh
# INT8
./local/recognizer_quant.sh
```
Output:
```bash
I1026 16:13:24.683531 48038 u2_recognizer_main.cc:55] utt: BAC009S0916W0495
I1026 16:13:24.683578 48038 u2_recognizer_main.cc:56] wav dur: 4.17119 sec.
I1026 16:13:24.683595 48038 u2_recognizer_main.cc:64] wav len (sample): 66739
I1026 16:13:25.037652 48038 u2_recognizer_main.cc:87] Pratial result: 3 这令
I1026 16:13:25.043697 48038 u2_recognizer_main.cc:87] Pratial result: 4 这令
I1026 16:13:25.222124 48038 u2_recognizer_main.cc:87] Pratial result: 5 这令被贷款
I1026 16:13:25.228385 48038 u2_recognizer_main.cc:87] Pratial result: 6 这令被贷款
I1026 16:13:25.414669 48038 u2_recognizer_main.cc:87] Pratial result: 7 这令被贷款的员工
I1026 16:13:25.420714 48038 u2_recognizer_main.cc:87] Pratial result: 8 这令被贷款的员工
I1026 16:13:25.608129 48038 u2_recognizer_main.cc:87] Pratial result: 9 这令被贷款的员工们请
I1026 16:13:25.801620 48038 u2_recognizer_main.cc:87] Pratial result: 10 这令被贷款的员工们请食难安
I1026 16:13:25.804101 48038 feature_cache.h:44] set finished
I1026 16:13:25.804128 48038 feature_cache.h:51] compute last feats done.
I1026 16:13:25.948771 48038 u2_recognizer_main.cc:87] Pratial result: 11 这令被贷款的员工们请食难安
I1026 16:13:26.246963 48038 u2_recognizer_main.cc:113] BAC009S0916W0495 这令被贷款的员工们请食难安
```
## Result
> CER compute under aishell-test.
> RTF compute with feature and decoder, which is more end to end.
> Machine Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz avx512_vnni
### FP32
```
Overall -> 5.75 % N=104765 C=99035 S=5587 D=143 I=294
Mandarin -> 5.75 % N=104762 C=99035 S=5584 D=143 I=294
English -> 0.00 % N=0 C=0 S=0 D=0 I=0
Other -> 100.00 % N=3 C=0 S=3 D=0 I=0
```
```
RTF is: 0.315337
```
### INT8
```
Overall -> 5.83 % N=104765 C=98943 S=5675 D=147 I=286
Mandarin -> 5.83 % N=104762 C=98943 S=5672 D=147 I=286
English -> 0.00 % N=0 C=0 S=0 D=0 I=0
Other -> 100.00 % N=3 C=0 S=3 D=0 I=0
```
```
RTF is: 0.269674
```

@ -0,0 +1,96 @@
([简体中文](./README_cn.md)|English)
# 基于SpeechX 的 ASR 部署
## 简介
支持 U2/U2++/Deepspeech2 模型的 C++ 部署,其在工业实践中经常被用到。
更多 Speechx 信息可以参看[文档](../../speechx/README.md)。
## 使用
### 1. 环境
* python - 3.7
* docker - `registry.baidubce.com/paddlepaddle/paddle:2.2.2-gpu-cuda10.2-cudnn7`
* os - Ubuntu 16.04.7 LTS
* gcc/g++/gfortran - 8.2.0
* cmake - 3.16.0
更多信息可以参看[文档](../../speechx/README.md)。
### 2. 编译 SpeechX
更多信息可以参看[文档](../../speechx/README.md)。
### 3. 例子
u2++ 识别部署参看[这里](../../speechx/examples/u2pp_ol/wenetspeech/)。
以下是在 `speechx/speechx/examples/u2pp_ol/wenetspeech`.
- Source path.sh
```bash
source path.sh
```
- 下载模型准备测试数据和cmvn文件
```bash
run.sh --stage 0 --stop_stage 1
```
- 解码
```bash
# FP32
./local/recognizer.sh
# INT8
./local/recognizer_quant.sh
```
输出:
```bash
I1026 16:13:24.683531 48038 u2_recognizer_main.cc:55] utt: BAC009S0916W0495
I1026 16:13:24.683578 48038 u2_recognizer_main.cc:56] wav dur: 4.17119 sec.
I1026 16:13:24.683595 48038 u2_recognizer_main.cc:64] wav len (sample): 66739
I1026 16:13:25.037652 48038 u2_recognizer_main.cc:87] Pratial result: 3 这令
I1026 16:13:25.043697 48038 u2_recognizer_main.cc:87] Pratial result: 4 这令
I1026 16:13:25.222124 48038 u2_recognizer_main.cc:87] Pratial result: 5 这令被贷款
I1026 16:13:25.228385 48038 u2_recognizer_main.cc:87] Pratial result: 6 这令被贷款
I1026 16:13:25.414669 48038 u2_recognizer_main.cc:87] Pratial result: 7 这令被贷款的员工
I1026 16:13:25.420714 48038 u2_recognizer_main.cc:87] Pratial result: 8 这令被贷款的员工
I1026 16:13:25.608129 48038 u2_recognizer_main.cc:87] Pratial result: 9 这令被贷款的员工们请
I1026 16:13:25.801620 48038 u2_recognizer_main.cc:87] Pratial result: 10 这令被贷款的员工们请食难安
I1026 16:13:25.804101 48038 feature_cache.h:44] set finished
I1026 16:13:25.804128 48038 feature_cache.h:51] compute last feats done.
I1026 16:13:25.948771 48038 u2_recognizer_main.cc:87] Pratial result: 11 这令被贷款的员工们请食难安
I1026 16:13:26.246963 48038 u2_recognizer_main.cc:113] BAC009S0916W0495 这令被贷款的员工们请食难安
```
## 结果
> CER 测试集为 aishell-test
> RTF 计算包含提特征和解码
> 测试机器: Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz avx512_vnni
### FP32
```
Overall -> 5.75 % N=104765 C=99035 S=5587 D=143 I=294
Mandarin -> 5.75 % N=104762 C=99035 S=5584 D=143 I=294
English -> 0.00 % N=0 C=0 S=0 D=0 I=0
Other -> 100.00 % N=3 C=0 S=3 D=0 I=0
```
```
RTF is: 0.315337
```
### INT8
```
Overall -> 5.87 % N=104765 C=98909 S=5711 D=145 I=289
Mandarin -> 5.86 % N=104762 C=98909 S=5708 D=145 I=289
English -> 0.00 % N=0 C=0 S=0 D=0 I=0
Other -> 100.00 % N=3 C=0 S=3 D=0 I=0
```

@ -108,7 +108,7 @@ for epoch in range(1, epochs + 1):
optimizer.clear_grad() optimizer.clear_grad()
# Calculate loss # Calculate loss
avg_loss = loss.numpy()[0] avg_loss = float(loss)
# Calculate metrics # Calculate metrics
preds = paddle.argmax(logits, axis=1) preds = paddle.argmax(logits, axis=1)

@ -22,7 +22,7 @@ Acoustic Model | Training Data | Token-based | Size | Descriptions | CER | WER |
Model | Pre-Train Method | Pre-Train Data | Finetune Data | Size | Descriptions | CER | WER | Example Link | Model | Pre-Train Method | Pre-Train Data | Finetune Data | Size | Descriptions | CER | WER | Example Link |
:-------------:| :------------:| :-----: | -----: | :-----: |:-----:| :-----: | :-----: | :-----: | :-------------:| :------------:| :-----: | -----: | :-----: |:-----:| :-----: | :-----: | :-----: |
[Wav2vec2-large-960h-lv60-self Model](https://paddlespeech.bj.bcebos.com/wav2vec/wav2vec2-large-960h-lv60-self.pdparams) | wav2vec2 | Librispeech and LV-60k Dataset (5.3w h) | - | 1.18 GB |Pre-trained Wav2vec2.0 Model | - | - | - | [Wav2vec2-large-960h-lv60-self Model](https://paddlespeech.bj.bcebos.com/wav2vec/wav2vec2-large-960h-lv60-self.pdparams) | wav2vec2 | Librispeech and LV-60k Dataset (5.3w h) | - | 1.18 GB |Pre-trained Wav2vec2.0 Model | - | - | - |
[Wav2vec2ASR-large-960h-librispeech Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr3/wav2vec2ASR-large-960h-librispeech_ckpt_1.3.0.model.tar.gz) | wav2vec2 | Librispeech and LV-60k Dataset (5.3w h) | Librispeech (960 h) | 1.18 GB |Encoder: Wav2vec2.0, Decoder: CTC, Decoding method: Greedy search | - | 0.0189 | [Wav2vecASR Librispeech ASR3](../../examples/librispeech/asr3) | [Wav2vec2ASR-large-960h-librispeech Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr3/wav2vec2ASR-large-960h-librispeech_ckpt_1.3.1.model.tar.gz) | wav2vec2 | Librispeech and LV-60k Dataset (5.3w h) | Librispeech (960 h) | 718 MB |Encoder: Wav2vec2.0, Decoder: CTC, Decoding method: Greedy search | - | 0.0189 | [Wav2vecASR Librispeech ASR3](../../examples/librispeech/asr3) |
### Language Model based on NGram ### Language Model based on NGram
Language Model | Training Data | Token-based | Size | Descriptions Language Model | Training Data | Token-based | Size | Descriptions

@ -509,7 +509,7 @@
" optimizer.clear_grad()\n", " optimizer.clear_grad()\n",
"\n", "\n",
" # Calculate loss\n", " # Calculate loss\n",
" avg_loss += loss.numpy()[0]\n", " avg_loss += float(loss)\n",
"\n", "\n",
" # Calculate metrics\n", " # Calculate metrics\n",
" preds = paddle.argmax(logits, axis=1)\n", " preds = paddle.argmax(logits, axis=1)\n",

@ -1,5 +1,5 @@
# ERNIE-SAT with AISHELL-3 dataset # ERNIE-SAT with AISHELL-3 dataset
ERNIE-SAT speech-text joint pretraining framework, which achieves SOTA results in cross-lingual multi-speaker speech synthesis and cross-lingual speech editing tasks, It can be applied to a series of scenarios such as Speech Editing, personalized Speech Synthesis, and Voice Cloning. [ERNIE-SAT](https://arxiv.org/abs/2211.03545) speech-text joint pretraining framework, which achieves SOTA results in cross-lingual multi-speaker speech synthesis and cross-lingual speech editing tasks, It can be applied to a series of scenarios such as Speech Editing, personalized Speech Synthesis, and Voice Cloning.
## Model Framework ## Model Framework
In ERNIE-SAT, we propose two innovations: In ERNIE-SAT, we propose two innovations:

@ -0,0 +1 @@
../../../csmsc/tts3/local/export2lite.sh

@ -58,3 +58,13 @@ fi
if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
./local/ort_predict.sh ${train_output_path} ./local/ort_predict.sh ${train_output_path}
fi fi
if [ ${stage} -le 7 ] && [ ${stop_stage} -ge 7 ]; then
# This model is not supported, because 3 ops are not supported on 'arm'. These unsupported ops are: 'round, set_value, share_data'.
# This model is not supported, because 4 ops are not supported on 'x86'. These unsupported ops are: 'matmul_v2, round, set_value, share_data'.
# ./local/export2lite.sh ${train_output_path} inference pdlite fastspeech2_aishell3 x86
# x86 ok, arm Segmentation fault
# ./local/export2lite.sh ${train_output_path} inference pdlite pwgan_aishell3 x86
# x86 ok, arm ok
./local/export2lite.sh ${train_output_path} inference pdlite hifigan_aishell3 x86
fi

@ -1,5 +1,5 @@
# ERNIE-SAT with AISHELL-3 and VCTK dataset # ERNIE-SAT with AISHELL-3 and VCTK dataset
ERNIE-SAT speech-text joint pretraining framework, which achieves SOTA results in cross-lingual multi-speaker speech synthesis and cross-lingual speech editing tasks, It can be applied to a series of scenarios such as Speech Editing, personalized Speech Synthesis, and Voice Cloning. [ERNIE-SAT](https://arxiv.org/abs/2211.03545) speech-text joint pretraining framework, which achieves SOTA results in cross-lingual multi-speaker speech synthesis and cross-lingual speech editing tasks, It can be applied to a series of scenarios such as Speech Editing, personalized Speech Synthesis, and Voice Cloning.
## Model Framework ## Model Framework
In ERNIE-SAT, we propose two innovations: In ERNIE-SAT, we propose two innovations:

@ -0,0 +1 @@
../../tts3/local/export2lite.sh

@ -60,3 +60,16 @@ fi
if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
./local/ort_predict.sh ${train_output_path} ./local/ort_predict.sh ${train_output_path}
fi fi
# must run after stage 3 (which stage generated static models)
if [ ${stage} -le 7 ] && [ ${stop_stage} -ge 7 ]; then
# This model is not supported, because 3 ops are not supported on 'arm'. These unsupported ops are: 'round, set_value, share_data'.
# This model is not supported, because 4 ops are not supported on 'x86'. These unsupported ops are: 'matmul_v2, round, set_value, share_data'.
./local/export2lite.sh ${train_output_path} inference pdlite speedyspeech_csmsc x86
# x86 ok, arm Segmentation fault
# ./local/export2lite.sh ${train_output_path} inference pdlite pwgan_csmsc x86
# x86 ok, arm Segmentation fault
# ./local/export2lite.sh ${train_output_path} inference pdlite mb_melgan_csmsc x86
# x86 ok, arm ok
# ./local/export2lite.sh ${train_output_path} inference pdlite hifigan_csmsc x86
fi

@ -0,0 +1,18 @@
train_output_path=$1
model_dir=$2
output_dir=$3
model=$4
valid_targets=$5
model_name=${model%_*}
echo model_name: ${model_name}
mkdir -p ${train_output_path}/${output_dir}
paddle_lite_opt \
--model_file ${train_output_path}/${model_dir}/${model}.pdmodel \
--param_file ${train_output_path}/${model_dir}/${model}.pdiparams \
--optimize_out ${train_output_path}/${output_dir}/${model}_${valid_targets} \
--valid_targets ${valid_targets}

@ -61,3 +61,16 @@ fi
if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
./local/ort_predict.sh ${train_output_path} ./local/ort_predict.sh ${train_output_path}
fi fi
# must run after stage 3 (which stage generated static models)
if [ ${stage} -le 7 ] && [ ${stop_stage} -ge 7 ]; then
# This model is not supported, because 3 ops are not supported on 'arm'. These unsupported ops are: 'round, set_value, share_data'.
# This model is not supported, because 4 ops are not supported on 'x86'. These unsupported ops are: 'matmul_v2, round, set_value, share_data'.
./local/export2lite.sh ${train_output_path} inference pdlite fastspeech2_csmsc x86
# x86 ok, arm Segmentation fault
# ./local/export2lite.sh ${train_output_path} inference pdlite pwgan_csmsc x86
# x86 ok, arm Segmentation fault
# ./local/export2lite.sh ${train_output_path} inference pdlite mb_melgan_csmsc x86
# x86 ok, arm ok
# ./local/export2lite.sh ${train_output_path} inference pdlite hifigan_csmsc x86
fi

@ -75,7 +75,6 @@ if [ ${stage} -le 8 ] && [ ${stop_stage} -ge 8 ]; then
fi fi
# paddle2onnx streaming # paddle2onnx streaming
if [ ${stage} -le 9 ] && [ ${stop_stage} -ge 9 ]; then if [ ${stage} -le 9 ] && [ ${stop_stage} -ge 9 ]; then
# install paddle2onnx # install paddle2onnx
version=$(echo `pip list |grep "paddle2onnx"` |awk -F" " '{print $2}') version=$(echo `pip list |grep "paddle2onnx"` |awk -F" " '{print $2}')
@ -97,3 +96,34 @@ if [ ${stage} -le 10 ] && [ ${stop_stage} -ge 10 ]; then
./local/ort_predict_streaming.sh ${train_output_path} ./local/ort_predict_streaming.sh ${train_output_path}
fi fi
# must run after stage 3 (which stage generated static models)
if [ ${stage} -le 11 ] && [ ${stop_stage} -ge 11 ]; then
# This model is not supported, because 3 ops are not supported on 'arm'. These unsupported ops are: 'round, set_value, share_data'.
# This model is not supported, because 4 ops are not supported on 'x86'. These unsupported ops are: 'matmul_v2, round, set_value, share_data'.
./local/export2lite.sh ${train_output_path} inference pdlite fastspeech2_csmsc x86
# x86 ok, arm Segmentation fault
# ./local/export2lite.sh ${train_output_path} inference pdlite pwgan_csmsc x86
# x86 ok, arm Segmentation fault
# ./local/export2lite.sh ${train_output_path} inference pdlite mb_melgan_csmsc x86
# x86 ok, arm ok
# ./local/export2lite.sh ${train_output_path} inference pdlite hifigan_csmsc x86
fi
# must run after stage 5 (which stage generated static models)
if [ ${stage} -le 12 ] && [ ${stop_stage} -ge 12 ]; then
# streaming acoustic model
# This model is not supported, because 3 ops are not supported on 'arm'. These unsupported ops are: 'round, set_value, share_data'.
# This model is not supported, because 4 ops are not supported on 'x86'. These unsupported ops are: 'matmul_v2, round, set_value, share_data'.
# ./local/export2lite.sh ${train_output_path} inference pdlite fastspeech2_csmsc x86
./local/export2lite.sh ${train_output_path} inference_streaming pdlite_streaming fastspeech2_csmsc_am_encoder_infer x86
# x86 ok, arm Segmentation fault
./local/export2lite.sh ${train_output_path} inference_streaming pdlite_streaming fastspeech2_csmsc_am_decoder x86
# x86 ok, arm Segmentation fault
./local/export2lite.sh ${train_output_path} inference_streaming pdlite_streaming fastspeech2_csmsc_am_postnet x86
# x86 ok, arm Segmentation fault
# ./local/export2lite.sh ${train_output_path} inference_streaming pdlite_streaming pwgan_csmsc x86
# x86 ok, arm Segmentation fault
# ./local/export2lite.sh ${train_output_path} inference_streaming pdlite_streaming mb_melgan_csmsc x86
# x86 ok, arm ok
# ./local/export2lite.sh ${train_output_path} inference_streaming pdlite_streaming hifigan_csmsc x86
fi

@ -70,7 +70,6 @@ train_manifest: data/manifest.train
dev_manifest: data/manifest.dev dev_manifest: data/manifest.dev
test_manifest: data/manifest.test-clean test_manifest: data/manifest.test-clean
########################################### ###########################################
# Dataloader # # Dataloader #
########################################### ###########################################
@ -95,6 +94,12 @@ dist_sampler: True
shortest_first: True shortest_first: True
return_lens_rate: True return_lens_rate: True
############################################
# Data Augmentation #
############################################
audio_augment: # for raw audio
sample_rate: 16000
speeds: [95, 100, 105]
########################################### ###########################################
# Training # # Training #
@ -115,6 +120,3 @@ log_interval: 1
checkpoint: checkpoint:
kbest_n: 50 kbest_n: 50
latest_n: 5 latest_n: 5
augment: True

@ -0,0 +1 @@
../../../csmsc/tts3/local/export2lite.sh

@ -59,3 +59,14 @@ fi
if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
./local/ort_predict.sh ${train_output_path} ./local/ort_predict.sh ${train_output_path}
fi fi
# must run after stage 3 (which stage generated static models)
if [ ${stage} -le 7 ] && [ ${stop_stage} -ge 7 ]; then
# This model is not supported, because 3 ops are not supported on 'arm'. These unsupported ops are: 'round, set_value, share_data'.
# This model is not supported, because 4 ops are not supported on 'x86'. These unsupported ops are: 'matmul_v2, round, set_value, share_data'.
./local/export2lite.sh ${train_output_path} inference pdlite fastspeech2_ljspeech x86
# x86 ok, arm Segmentation fault
# ./local/export2lite.sh ${train_output_path} inference pdlite pwgan_ljspeech x86
# x86 ok, arm ok
# ./local/export2lite.sh ${train_output_path} inference pdlite hifigan_ljspeech x86
fi

@ -4,3 +4,6 @@ Run the following script to get started, for more detail, please see `run.sh`.
```bash ```bash
./run.sh ./run.sh
``` ```
# Rhythm tags for MFA
If you want to get rhythm tags with duration through MFA tool, you may add flag `--rhy-with-duration` in the first two commands in `run.sh`
Note that only CSMSC dataset is supported so far, and we replace `#` with `sp` in rhythm tags for MFA.

@ -182,12 +182,17 @@ if __name__ == "__main__":
"--with-tone", action="store_true", help="whether to consider tone.") "--with-tone", action="store_true", help="whether to consider tone.")
parser.add_argument( parser.add_argument(
"--with-r", action="store_true", help="whether to consider erhua.") "--with-r", action="store_true", help="whether to consider erhua.")
parser.add_argument(
"--rhy-with-duration",
action="store_true", )
args = parser.parse_args() args = parser.parse_args()
lexicon = generate_lexicon(args.with_tone, args.with_r) lexicon = generate_lexicon(args.with_tone, args.with_r)
symbols = generate_symbols(lexicon) symbols = generate_symbols(lexicon)
with open(args.output + ".lexicon", 'wt') as f: with open(args.output + ".lexicon", 'wt') as f:
if args.rhy_with_duration:
f.write("sp1 sp1\nsp2 sp2\nsp3 sp3\nsp4 sp4\n")
for k, v in lexicon.items(): for k, v in lexicon.items():
f.write(f"{k} {v}\n") f.write(f"{k} {v}\n")

@ -23,6 +23,7 @@ for more details.
""" """
import argparse import argparse
import os import os
import re
import shutil import shutil
from concurrent.futures import ThreadPoolExecutor from concurrent.futures import ThreadPoolExecutor
from pathlib import Path from pathlib import Path
@ -32,6 +33,22 @@ import librosa
import soundfile as sf import soundfile as sf
from tqdm import tqdm from tqdm import tqdm
repalce_dict = {
"": "",
"": "",
"": "",
"": "",
"": "",
"": "",
"": "",
"": "",
"": "",
"": "",
"": "",
"": "",
"": ""
}
def get_transcripts(path: Union[str, Path]): def get_transcripts(path: Union[str, Path]):
transcripts = {} transcripts = {}
@ -55,9 +72,13 @@ def resample_and_save(source, target, sr=16000):
def reorganize_baker(root_dir: Union[str, Path], def reorganize_baker(root_dir: Union[str, Path],
output_dir: Union[str, Path]=None, output_dir: Union[str, Path]=None,
resample_audio=False): resample_audio=False,
rhy_dur=False):
root_dir = Path(root_dir).expanduser() root_dir = Path(root_dir).expanduser()
transcript_path = root_dir / "ProsodyLabeling" / "000001-010000.txt" if rhy_dur:
transcript_path = root_dir / "ProsodyLabeling" / "000001-010000_rhy.txt"
else:
transcript_path = root_dir / "ProsodyLabeling" / "000001-010000.txt"
transcriptions = get_transcripts(transcript_path) transcriptions = get_transcripts(transcript_path)
wave_dir = root_dir / "Wave" wave_dir = root_dir / "Wave"
@ -92,6 +113,46 @@ def reorganize_baker(root_dir: Union[str, Path],
print("Done!") print("Done!")
def insert_rhy(sentence_first, sentence_second):
sub = '#'
return_words = []
sentence_first = sentence_first.translate(str.maketrans(repalce_dict))
rhy_idx = [substr.start() for substr in re.finditer(sub, sentence_first)]
re_rhy_idx = []
sentence_first_ = sentence_first.replace("#1", "").replace(
"#2", "").replace("#3", "").replace("#4", "")
sentence_seconds = sentence_second.split(" ")
for i, w in enumerate(rhy_idx):
re_rhy_idx.append(w - i * 2)
i = 0
# print("re_rhy_idx: ", re_rhy_idx)
for sentence_s in (sentence_seconds):
return_words.append(sentence_s)
if i < len(re_rhy_idx) and len(return_words) - i == re_rhy_idx[i]:
return_words.append("sp" + sentence_first[rhy_idx[i] + 1:rhy_idx[i]
+ 2])
i = i + 1
return return_words
def normalize_rhy(root_dir: Union[str, Path]):
root_dir = Path(root_dir).expanduser()
transcript_path = root_dir / "ProsodyLabeling" / "000001-010000.txt"
target_transcript_path = root_dir / "ProsodyLabeling" / "000001-010000_rhy.txt"
with open(transcript_path) as f:
lines = f.readlines()
with open(target_transcript_path, 'wt') as f:
for i in range(0, len(lines), 2):
sentence_first = lines[i] #第一行直接保存
f.write(sentence_first)
transcription = lines[i + 1].strip()
f.write("\t" + " ".join(
insert_rhy(sentence_first.split('\t')[1], transcription)) +
"\n")
if __name__ == "__main__": if __name__ == "__main__":
parser = argparse.ArgumentParser( parser = argparse.ArgumentParser(
description="Reorganize Baker dataset for MFA") description="Reorganize Baker dataset for MFA")
@ -104,6 +165,12 @@ if __name__ == "__main__":
"--resample-audio", "--resample-audio",
action="store_true", action="store_true",
help="To resample audio files or just copy them") help="To resample audio files or just copy them")
parser.add_argument(
"--rhy-with-duration",
action="store_true", )
args = parser.parse_args() args = parser.parse_args()
reorganize_baker(args.root_dir, args.output_dir, args.resample_audio) if args.rhy_with_duration:
normalize_rhy(args.root_dir)
reorganize_baker(args.root_dir, args.output_dir, args.resample_audio,
args.rhy_with_duration)

@ -123,3 +123,5 @@ iPad Pro的秒控键盘这次也推出白色版本。|iPad Pro的秒控键盘这
985|九八五 985|九八五
12~23|十二到二十三 12~23|十二到二十三
12-23|十二到二十三 12-23|十二到二十三
25cm²|二十五平方厘米
25m|米

@ -1,5 +1,5 @@
# ERNIE-SAT with VCTK dataset # ERNIE-SAT with VCTK dataset
ERNIE-SAT speech-text joint pretraining framework, which achieves SOTA results in cross-lingual multi-speaker speech synthesis and cross-lingual speech editing tasks, It can be applied to a series of scenarios such as Speech Editing, personalized Speech Synthesis, and Voice Cloning. [ERNIE-SAT](https://arxiv.org/abs/2211.03545) speech-text joint pretraining framework, which achieves SOTA results in cross-lingual multi-speaker speech synthesis and cross-lingual speech editing tasks, It can be applied to a series of scenarios such as Speech Editing, personalized Speech Synthesis, and Voice Cloning.
## Model Framework ## Model Framework
In ERNIE-SAT, we propose two innovations: In ERNIE-SAT, we propose two innovations:

@ -0,0 +1 @@
../../../csmsc/tts3/local/export2lite.sh

@ -58,3 +58,14 @@ fi
if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
./local/ort_predict.sh ${train_output_path} ./local/ort_predict.sh ${train_output_path}
fi fi
# must run after stage 3 (which stage generated static models)
if [ ${stage} -le 7 ] && [ ${stop_stage} -ge 7 ]; then
# This model is not supported, because 3 ops are not supported on 'arm'. These unsupported ops are: 'round, set_value, share_data'.
# This model is not supported, because 4 ops are not supported on 'x86'. These unsupported ops are: 'matmul_v2, round, set_value, share_data'.
./local/export2lite.sh ${train_output_path} inference pdlite fastspeech2_vctk x86
# x86 ok, arm Segmentation fault
# ./local/export2lite.sh ${train_output_path} inference pdlite pwgan_vctk x86
# x86 ok, arm ok
# ./local/export2lite.sh ${train_output_path} inference pdlite hifigan_vctk x86
fi

@ -101,7 +101,7 @@ if __name__ == "__main__":
optimizer.clear_grad() optimizer.clear_grad()
# Calculate loss # Calculate loss
avg_loss += loss.numpy()[0] avg_loss += float(loss)
# Calculate metrics # Calculate metrics
preds = paddle.argmax(logits, axis=1) preds = paddle.argmax(logits, axis=1)

@ -110,7 +110,7 @@ if __name__ == '__main__':
optimizer.clear_grad() optimizer.clear_grad()
# Calculate loss # Calculate loss
avg_loss += loss.numpy()[0] avg_loss += float(loss)
# Calculate metrics # Calculate metrics
num_corrects += corrects num_corrects += corrects

@ -71,7 +71,8 @@ class Wav2Vec2ASRTrainer(Trainer):
wavs_lens_rate = wavs_lens / wav.shape[1] wavs_lens_rate = wavs_lens / wav.shape[1]
target_lens_rate = target_lens / target.shape[1] target_lens_rate = target_lens / target.shape[1]
wav = wav[:, :, 0] wav = wav[:, :, 0]
wav = self.speech_augmentation(wav, wavs_lens_rate) if hasattr(train_conf, 'speech_augment'):
wav = self.speech_augmentation(wav, wavs_lens_rate)
loss = self.model(wav, wavs_lens_rate, target, target_lens_rate) loss = self.model(wav, wavs_lens_rate, target, target_lens_rate)
# loss div by `batch_size * accum_grad` # loss div by `batch_size * accum_grad`
loss /= train_conf.accum_grad loss /= train_conf.accum_grad
@ -277,7 +278,9 @@ class Wav2Vec2ASRTrainer(Trainer):
logger.info("Setup model!") logger.info("Setup model!")
# setup speech augmentation for wav2vec2 # setup speech augmentation for wav2vec2
self.speech_augmentation = TimeDomainSpecAugment() if hasattr(config, 'audio_augment') and self.train:
self.speech_augmentation = TimeDomainSpecAugment(
**config.audio_augment)
if not self.train: if not self.train:
return return

@ -641,14 +641,11 @@ class DropChunk(nn.Layer):
class TimeDomainSpecAugment(nn.Layer): class TimeDomainSpecAugment(nn.Layer):
"""A time-domain approximation of the SpecAugment algorithm. """A time-domain approximation of the SpecAugment algorithm.
This augmentation module implements three augmentations in This augmentation module implements three augmentations in
the time-domain. the time-domain.
1. Drop chunks of the audio (zero amplitude or white noise) 1. Drop chunks of the audio (zero amplitude or white noise)
2. Drop frequency bands (with band-drop filters) 2. Drop frequency bands (with band-drop filters)
3. Speed peturbation (via resampling to slightly different rate) 3. Speed peturbation (via resampling to slightly different rate)
Arguments Arguments
--------- ---------
perturb_prob : float from 0 to 1 perturb_prob : float from 0 to 1
@ -677,7 +674,6 @@ class TimeDomainSpecAugment(nn.Layer):
drop_chunk_noise_factor : float drop_chunk_noise_factor : float
The noise factor used to scale the white noise inserted, relative to The noise factor used to scale the white noise inserted, relative to
the average amplitude of the utterance. Default 0 (no noise inserted). the average amplitude of the utterance. Default 0 (no noise inserted).
Example Example
------- -------
>>> inputs = paddle.randn([10, 16000]) >>> inputs = paddle.randn([10, 16000])
@ -718,7 +714,6 @@ class TimeDomainSpecAugment(nn.Layer):
def forward(self, waveforms, lengths): def forward(self, waveforms, lengths):
"""Returns the distorted waveforms. """Returns the distorted waveforms.
Arguments Arguments
--------- ---------
waveforms : tensor waveforms : tensor

@ -0,0 +1,10 @@
0001 考古人员<speak>西<say-as pinyin='zang4'>藏</say-as>布达拉宫里发现一个被隐<say-as pinyin="cang2">藏</say-as>的装有宝<say-as pinyin="zang4">藏</say-as></speak>箱子。
0002 <speak>有人询问中国银<say-as pinyin='hang2'>行</say-as>北京分<say-as pinyin='hang2 hang2'>行行</say-as>长是否叫任我<say-as pinyin='xing2'>行</say-as></speak>。
0003 <speak>市委书记亲自<say-as pinyin='shuai4'>率</say-as>领审计员对这家公司进行财务审计,发现企业的利润<say-as pinyin='lv4'>率</say-as>数据虚假</speak>。
0004 <speak>学生们对代<say-as pinyin='shu4'>数</say-as>理解不深刻,特别是小<say-as pinyin='shu4'>数</say-as>点,在<say-as pinyin='shu3 shu4'>数数</say-as>时容易弄错</speak>。
0005 <speak>赵<say-as pinyin='chang2'>长</say-as>军从小学习武术,擅<say-as pinyin='chang2'>长</say-as>散打,<say-as pinyin='zhang3'>长</say-as>大后参军,担任连<say-as pinyin='zhang3'>长</say-as></speak>。
0006 <speak>我说她<say-as pinyin='zhang3'>涨</say-as>了工资,她就<say-as pinyin='zhang4'>涨</say-as>红着脸,摇头否认</speak>。
0007 <speak>请把这封信交<say-as pinyin='gei3'>给</say-as>团长,告诉他,前线的供<say-as pinyin='ji3'>给</say-as>一定要有保障</speak>。
0008 <speak>矿下的<say-as pinyin='hang4'>巷</say-as>道,与北京四合院的小<say-as pinyin='xiang4'>巷</say-as>有点相似</speak>。
0009 <speak>他常叹自己命<say-as pinyin='bo2'>薄</say-as>,几亩<say-as pinyin='bao2'>薄</say-as>田,种点<say-as pinyin='bo4'>薄</say-as>荷</speak>。
0010 <speak>小明对天相很有研究,在<say-as pinyin='su4'>宿</say-as>舍说了一<say-as pinyin='xiu3'>宿</say-as>有关星<say-as pinyin='xiu4'>宿</say-as>的常识</speak>。

@ -18,6 +18,25 @@ from .num import num2str
# 温度表达式,温度会影响负号的读法 # 温度表达式,温度会影响负号的读法
# -3°C 零下三度 # -3°C 零下三度
RE_TEMPERATURE = re.compile(r'(-?)(\d+(\.\d+)?)(°C|℃|度|摄氏度)') RE_TEMPERATURE = re.compile(r'(-?)(\d+(\.\d+)?)(°C|℃|度|摄氏度)')
measure_dict = {
"cm2": "平方厘米",
"cm²": "平方厘米",
"cm3": "立方厘米",
"cm³": "立方厘米",
"cm": "厘米",
"db": "分贝",
"ds": "毫秒",
"kg": "千克",
"km": "千米",
"m2": "平方米",
"": "平方米",
"": "立方米",
"m3": "立方米",
"ml": "毫升",
"m": "",
"mm": "毫米",
"s": ""
}
def replace_temperature(match) -> str: def replace_temperature(match) -> str:
@ -35,3 +54,10 @@ def replace_temperature(match) -> str:
unit: str = "摄氏度" if unit == "摄氏度" else "" unit: str = "摄氏度" if unit == "摄氏度" else ""
result = f"{sign}{temperature}{unit}" result = f"{sign}{temperature}{unit}"
return result return result
def replace_measure(sentence) -> str:
for q_notation in measure_dict:
if q_notation in sentence:
sentence = sentence.replace(q_notation, measure_dict[q_notation])
return sentence

@ -46,6 +46,7 @@ from .phonecode import RE_TELEPHONE
from .phonecode import replace_mobile from .phonecode import replace_mobile
from .phonecode import replace_phone from .phonecode import replace_phone
from .quantifier import RE_TEMPERATURE from .quantifier import RE_TEMPERATURE
from .quantifier import replace_measure
from .quantifier import replace_temperature from .quantifier import replace_temperature
@ -91,6 +92,7 @@ class TextNormalizer():
sentence = RE_TIME.sub(replace_time, sentence) sentence = RE_TIME.sub(replace_time, sentence)
sentence = RE_TEMPERATURE.sub(replace_temperature, sentence) sentence = RE_TEMPERATURE.sub(replace_temperature, sentence)
sentence = replace_measure(sentence)
sentence = RE_FRAC.sub(replace_frac, sentence) sentence = RE_FRAC.sub(replace_frac, sentence)
sentence = RE_PERCENTAGE.sub(replace_percentage, sentence) sentence = RE_PERCENTAGE.sub(replace_percentage, sentence)
sentence = RE_MOBILE_PHONE.sub(replace_mobile, sentence) sentence = RE_MOBILE_PHONE.sub(replace_mobile, sentence)

@ -2,10 +2,10 @@
## Testing with Aishell Test Data ## Testing with Aishell Test Data
## Download wav and model ### Download wav and model
``` ```
run.sh --stop_stage 0 ./run.sh --stop_stage 0
``` ```
### compute feature ### compute feature
@ -22,7 +22,6 @@ run.sh --stop_stage 0
### decoding using wav ### decoding using wav
``` ```
./run.sh --stage 3 --stop_stage 3 ./run.sh --stage 3 --stop_stage 3
``` ```

@ -2,9 +2,11 @@
7176 utts, duration 36108.9 sec. 7176 utts, duration 36108.9 sec.
## Attention Rescore ## U2++ Attention Rescore
### u2++ FP32 > Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz, support `avx512_vnni`
> RTF with feature and decoder which is more end to end.
### FP32
#### CER #### CER
@ -17,20 +19,29 @@ Other -> 100.00 % N=3 C=0 S=3 D=0 I=0
#### RTF #### RTF
> RTF with feature and decoder which is more end to end.
* Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz, support `avx512_vnni`
``` ```
I1027 10:52:38.662868 51665 u2_recognizer_main.cc:122] total wav duration is: 36108.9 sec I1027 10:52:38.662868 51665 u2_recognizer_main.cc:122] total wav duration is: 36108.9 sec
I1027 10:52:38.662858 51665 u2_recognizer_main.cc:121] total cost:11169.1 sec I1027 10:52:38.662858 51665 u2_recognizer_main.cc:121] total cost:11169.1 sec
I1027 10:52:38.662876 51665 u2_recognizer_main.cc:123] RTF is: 0.309318 I1027 10:52:38.662876 51665 u2_recognizer_main.cc:123] RTF is: 0.309318
``` ```
* Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz, not support `avx512_vnni` ### INT8
> RTF relative improve 12.8%, which count feature and decoder time.
#### CER
```
Overall -> 5.83 % N=104765 C=98943 S=5675 D=147 I=286
Mandarin -> 5.83 % N=104762 C=98943 S=5672 D=147 I=286
English -> 0.00 % N=0 C=0 S=0 D=0 I=0
Other -> 100.00 % N=3 C=0 S=3 D=0 I=0
```
#### RTF
``` ```
I1026 16:13:26.247121 48038 u2_recognizer_main.cc:123] total wav duration is: 36108.9 sec I1110 09:59:52.551712 37249 u2_recognizer_main.cc:122] total wav duration is: 36108.9 sec
I1026 16:13:26.247130 48038 u2_recognizer_main.cc:124] total decode cost:13656.7 sec I1110 09:59:52.551717 37249 u2_recognizer_main.cc:123] total decode cost:9737.63 sec
I1026 16:13:26.247138 48038 u2_recognizer_main.cc:125] RTF is: 0.378208 I1110 09:59:52.551723 37249 u2_recognizer_main.cc:124] RTF is: 0.269674
``` ```

@ -10,7 +10,7 @@ mkdir -p $exp
ckpt_dir=./data/model ckpt_dir=./data/model
model_dir=$ckpt_dir/asr1_chunk_conformer_u2pp_wenetspeech_static_1.3.0.model/ model_dir=$ckpt_dir/asr1_chunk_conformer_u2pp_wenetspeech_static_1.3.0.model/
utils/run.pl JOB=1:$nj $data/split${nj}/JOB/decoder.fbank.wolm.log \ utils/run.pl JOB=1:$nj $data/split${nj}/JOB/decoder.log \
ctc_prefix_beam_search_decoder_main \ ctc_prefix_beam_search_decoder_main \
--model_path=$model_dir/export.jit \ --model_path=$model_dir/export.jit \
--vocab_path=$model_dir/unit.txt \ --vocab_path=$model_dir/unit.txt \

@ -1,18 +1,21 @@
#!/bin/bash #!/bin/bash
set -x
set -e set -e
. path.sh . path.sh
nj=20
data=data data=data
exp=exp exp=exp
mkdir -p $exp mkdir -p $exp
ckpt_dir=./data/model ckpt_dir=./data/model
model_dir=$ckpt_dir/asr1_chunk_conformer_u2pp_wenetspeech_static_1.3.0.model/ model_dir=$ckpt_dir/asr1_chunk_conformer_u2pp_wenetspeech_static_1.3.0.model/
utils/run.pl JOB=1:$nj $data/split${nj}/JOB/nnet.log \
u2_nnet_main \ u2_nnet_main \
--model_path=$model_dir/export.jit \ --model_path=$model_dir/export.jit \
--feature_rspecifier=ark,t:$exp/fbank.ark \ --vocab_path=$model_dir/unit.txt \
--feature_rspecifier=ark,t:${data}/split${nj}/JOB/fbank.ark \
--nnet_decoder_chunk=16 \ --nnet_decoder_chunk=16 \
--receptive_field_length=7 \ --receptive_field_length=7 \
--subsampling_rate=4 \ --subsampling_rate=4 \
@ -20,4 +23,3 @@ u2_nnet_main \
--nnet_encoder_outs_wspecifier=ark,t:$exp/encoder_outs.ark \ --nnet_encoder_outs_wspecifier=ark,t:$exp/encoder_outs.ark \
--nnet_prob_wspecifier=ark,t:$exp/logprobs.ark --nnet_prob_wspecifier=ark,t:$exp/logprobs.ark
echo "u2 nnet decode." echo "u2 nnet decode."

@ -24,8 +24,6 @@ fi
ckpt_dir=$data/model ckpt_dir=$data/model
model_dir=$ckpt_dir/asr1_chunk_conformer_u2pp_wenetspeech_static_1.3.0.model/
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ];then if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ];then
# download u2pp model # download u2pp model

@ -32,7 +32,6 @@ class DataCache : public FrontendInterface {
// accept waves/feats // accept waves/feats
void Accept(const kaldi::VectorBase<kaldi::BaseFloat>& inputs) override { void Accept(const kaldi::VectorBase<kaldi::BaseFloat>& inputs) override {
data_ = inputs; data_ = inputs;
SetDim(data_.Dim());
} }
bool Read(kaldi::Vector<kaldi::BaseFloat>* feats) override { bool Read(kaldi::Vector<kaldi::BaseFloat>* feats) override {
@ -41,7 +40,6 @@ class DataCache : public FrontendInterface {
} }
(*feats) = data_; (*feats) = data_;
data_.Resize(0); data_.Resize(0);
SetDim(data_.Dim());
return true; return true;
} }

@ -71,6 +71,7 @@ bool Decodable::AdvanceChunk() {
VLOG(3) << "decodable exit;"; VLOG(3) << "decodable exit;";
return false; return false;
} }
CHECK_GE(frontend_->Dim(), 0);
VLOG(1) << "AdvanceChunk feat cost: " << timer.Elapsed() << " sec."; VLOG(1) << "AdvanceChunk feat cost: " << timer.Elapsed() << " sec.";
VLOG(2) << "Forward in " << features.Dim() / frontend_->Dim() << " feats."; VLOG(2) << "Forward in " << features.Dim() / frontend_->Dim() << " feats.";

@ -13,6 +13,7 @@
# limitations under the License. # limitations under the License.
import paddle import paddle
import torch import torch
from paddle.device.cuda import synchronize
from parallel_wavegan.layers import residual_block from parallel_wavegan.layers import residual_block
from parallel_wavegan.layers import upsample from parallel_wavegan.layers import upsample
from parallel_wavegan.models import parallel_wavegan as pwgan from parallel_wavegan.models import parallel_wavegan as pwgan
@ -24,7 +25,6 @@ from paddlespeech.t2s.models.parallel_wavegan import PWGGenerator
from paddlespeech.t2s.models.parallel_wavegan import ResidualBlock from paddlespeech.t2s.models.parallel_wavegan import ResidualBlock
from paddlespeech.t2s.models.parallel_wavegan import ResidualPWGDiscriminator from paddlespeech.t2s.models.parallel_wavegan import ResidualPWGDiscriminator
from paddlespeech.t2s.utils.layer_tools import summary from paddlespeech.t2s.utils.layer_tools import summary
from paddlespeech.t2s.utils.profile import synchronize
paddle.set_device("gpu:0") paddle.set_device("gpu:0")
device = torch.device("cuda:0") device = torch.device("cuda:0")

Loading…
Cancel
Save