Merge branch 'develop' of github.com:SmileGoat/PaddleSpeech into add_aishell_eg

pull/1676/head
Yang Zhou 3 years ago
commit c18846635e

@ -280,10 +280,14 @@ paddlespeech_client cls --server_ip 127.0.0.1 --port 8090 --input input.wav
For more information about server command lines, please see: [speech server demos](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/demos/speech_server) For more information about server command lines, please see: [speech server demos](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/demos/speech_server)
<a name="ModelList"></a>
## Model List ## Model List
PaddleSpeech supports a series of most popular models. They are summarized in [released models](./docs/source/released_model.md) and attached with available pretrained models. PaddleSpeech supports a series of most popular models. They are summarized in [released models](./docs/source/released_model.md) and attached with available pretrained models.
<a name="SpeechToText"></a>
**Speech-to-Text** contains *Acoustic Model*, *Language Model*, and *Speech Translation*, with the following details: **Speech-to-Text** contains *Acoustic Model*, *Language Model*, and *Speech Translation*, with the following details:
<table style="width:100%"> <table style="width:100%">
@ -357,6 +361,8 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
</tbody> </tbody>
</table> </table>
<a name="TextToSpeech"></a>
**Text-to-Speech** in PaddleSpeech mainly contains three modules: *Text Frontend*, *Acoustic Model* and *Vocoder*. Acoustic Model and Vocoder models are listed as follow: **Text-to-Speech** in PaddleSpeech mainly contains three modules: *Text Frontend*, *Acoustic Model* and *Vocoder*. Acoustic Model and Vocoder models are listed as follow:
<table> <table>
@ -457,10 +463,10 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
</td> </td>
</tr> </tr>
<tr> <tr>
<td>GE2E + Tactron2</td> <td>GE2E + Tacotron2</td>
<td>AISHELL-3</td> <td>AISHELL-3</td>
<td> <td>
<a href = "./examples/aishell3/vc0">ge2e-tactron2-aishell3</a> <a href = "./examples/aishell3/vc0">ge2e-tacotron2-aishell3</a>
</td> </td>
</tr> </tr>
<tr> <tr>
@ -473,6 +479,8 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
</tbody> </tbody>
</table> </table>
<a name="AudioClassification"></a>
**Audio Classification** **Audio Classification**
<table style="width:100%"> <table style="width:100%">
@ -496,6 +504,8 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
</tbody> </tbody>
</table> </table>
<a name="SpeakerVerification"></a>
**Speaker Verification** **Speaker Verification**
<table style="width:100%"> <table style="width:100%">
@ -519,6 +529,8 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
</tbody> </tbody>
</table> </table>
<a name="PunctuationRestoration"></a>
**Punctuation Restoration** **Punctuation Restoration**
<table style="width:100%"> <table style="width:100%">
@ -559,10 +571,18 @@ Normally, [Speech SoTA](https://paperswithcode.com/area/speech), [Audio SoTA](ht
- [Advanced Usage](./docs/source/tts/advanced_usage.md) - [Advanced Usage](./docs/source/tts/advanced_usage.md)
- [Chinese Rule Based Text Frontend](./docs/source/tts/zh_text_frontend.md) - [Chinese Rule Based Text Frontend](./docs/source/tts/zh_text_frontend.md)
- [Test Audio Samples](https://paddlespeech.readthedocs.io/en/latest/tts/demo.html) - [Test Audio Samples](https://paddlespeech.readthedocs.io/en/latest/tts/demo.html)
- [Audio Classification](./demos/audio_tagging/README.md) - Speaker Verification
- [Audio Searching](./demos/audio_searching/README.md)
- [Speaker Verification](./demos/speaker_verification/README.md) - [Speaker Verification](./demos/speaker_verification/README.md)
- [Audio Classification](./demos/audio_tagging/README.md)
- [Speech Translation](./demos/speech_translation/README.md) - [Speech Translation](./demos/speech_translation/README.md)
- [Speech Server](./demos/speech_server/README.md)
- [Released Models](./docs/source/released_model.md) - [Released Models](./docs/source/released_model.md)
- [Speech-to-Text](#SpeechToText)
- [Text-to-Speech](#TextToSpeech)
- [Audio Classification](#AudioClassification)
- [Speaker Verification](#SpeakerVerification)
- [Punctuation Restoration](#PunctuationRestoration)
- [Community](#Community) - [Community](#Community)
- [Welcome to contribute](#contribution) - [Welcome to contribute](#contribution)
- [License](#License) - [License](#License)

@ -273,6 +273,8 @@ paddlespeech_client cls --server_ip 127.0.0.1 --port 8090 --input input.wav
## 模型列表 ## 模型列表
PaddleSpeech 支持很多主流的模型,并提供了预训练模型,详情请见[模型列表](./docs/source/released_model.md)。 PaddleSpeech 支持很多主流的模型,并提供了预训练模型,详情请见[模型列表](./docs/source/released_model.md)。
<a name="语音识别模型"></a>
PaddleSpeech 的 **语音转文本** 包含语音识别声学模型、语音识别语言模型和语音翻译, 详情如下: PaddleSpeech 的 **语音转文本** 包含语音识别声学模型、语音识别语言模型和语音翻译, 详情如下:
<table style="width:100%"> <table style="width:100%">
@ -347,6 +349,7 @@ PaddleSpeech 的 **语音转文本** 包含语音识别声学模型、语音识
</table> </table>
<a name="语音合成模型"></a> <a name="语音合成模型"></a>
PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声学模型和声码器。声学模型和声码器模型如下: PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声学模型和声码器。声学模型和声码器模型如下:
<table> <table>
@ -447,10 +450,10 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声
</td> </td>
</tr> </tr>
<tr> <tr>
<td>GE2E + Tactron2</td> <td>GE2E + Tacotron2</td>
<td>AISHELL-3</td> <td>AISHELL-3</td>
<td> <td>
<a href = "./examples/aishell3/vc0">ge2e-tactron2-aishell3</a> <a href = "./examples/aishell3/vc0">ge2e-tacotron2-aishell3</a>
</td> </td>
</tr> </tr>
<tr> <tr>
@ -488,6 +491,8 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声
</table> </table>
<a name="声纹识别模型"></a>
**声纹识别** **声纹识别**
<table style="width:100%"> <table style="width:100%">
@ -511,6 +516,8 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声
</tbody> </tbody>
</table> </table>
<a name="标点恢复模型"></a>
**标点恢复** **标点恢复**
<table style="width:100%"> <table style="width:100%">
@ -556,13 +563,18 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声
- [进阶用法](./docs/source/tts/advanced_usage.md) - [进阶用法](./docs/source/tts/advanced_usage.md)
- [中文文本前端](./docs/source/tts/zh_text_frontend.md) - [中文文本前端](./docs/source/tts/zh_text_frontend.md)
- [测试语音样本](https://paddlespeech.readthedocs.io/en/latest/tts/demo.html) - [测试语音样本](https://paddlespeech.readthedocs.io/en/latest/tts/demo.html)
- [声音分类](./demos/audio_tagging/README_cn.md) - 声纹识别
- [声纹识别](./demos/speaker_verification/README_cn.md) - [声纹识别](./demos/speaker_verification/README_cn.md)
- [音频检索](./demos/audio_searching/README_cn.md)
- [声音分类](./demos/audio_tagging/README_cn.md)
- [语音翻译](./demos/speech_translation/README_cn.md) - [语音翻译](./demos/speech_translation/README_cn.md)
- [服务化部署](./demos/speech_server/README_cn.md)
- [模型列表](#模型列表) - [模型列表](#模型列表)
- [语音识别](#语音识别模型) - [语音识别](#语音识别模型)
- [语音合成](#语音合成模型) - [语音合成](#语音合成模型)
- [声音分类](#声音分类模型) - [声音分类](#声音分类模型)
- [声纹识别](#声纹识别模型)
- [标点恢复](#标点恢复模型)
- [技术交流群](#技术交流群) - [技术交流群](#技术交流群)
- [欢迎贡献](#欢迎贡献) - [欢迎贡献](#欢迎贡献)
- [License](#License) - [License](#License)

@ -30,6 +30,11 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
paddlespeech vector --task spk --input vec.job paddlespeech vector --task spk --input vec.job
echo -e "demo2 85236145389.wav \n demo3 85236145389.wav" | paddlespeech vector --task spk echo -e "demo2 85236145389.wav \n demo3 85236145389.wav" | paddlespeech vector --task spk
paddlespeech vector --task score --input "./85236145389.wav ./123456789.wav"
echo -e "demo4 85236145389.wav 85236145389.wav \n demo5 85236145389.wav 123456789.wav" > vec.job
paddlespeech vector --task score --input vec.job
``` ```
Usage: Usage:
@ -38,6 +43,7 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
``` ```
Arguments: Arguments:
- `input`(required): Audio file to recognize. - `input`(required): Audio file to recognize.
- `task` (required): Specify `vector` task. Default `spk`
- `model`: Model type of vector task. Default: `ecapatdnn_voxceleb12`. - `model`: Model type of vector task. Default: `ecapatdnn_voxceleb12`.
- `sample_rate`: Sample rate of the model. Default: `16000`. - `sample_rate`: Sample rate of the model. Default: `16000`.
- `config`: Config of vector task. Use pretrained model when it is None. Default: `None`. - `config`: Config of vector task. Use pretrained model when it is None. Default: `None`.
@ -47,45 +53,45 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
Output: Output:
```bash ```bash
demo [ -5.749211 9.505463 -8.200284 -5.2075014 5.3940268 demo [ 1.4217498 5.626253 -5.342073 1.1773866 3.308055
-3.04878 1.611095 10.127234 -10.534177 -15.821609 1.756596 5.167894 10.80636 -3.8226728 -5.6141334
1.2032688 -0.35080156 1.2629458 -12.643498 -2.5758228 2.623845 -0.8072968 1.9635103 -7.3128724 0.01103897
-11.343508 2.3385992 -8.719341 14.213509 15.404744 -9.723131 0.6619743 -6.976803 10.213478 7.494748
-0.39327756 6.338786 2.688887 8.7104025 17.469526 2.9105635 3.8949256 3.7999806 7.1061673 16.905321
-8.77959 7.0576906 4.648855 -1.3089896 -23.294737 -7.1493764 8.733103 3.4230042 -4.831653 -11.403367
8.013747 13.891729 -9.926753 5.655307 -5.9422326 11.232214 7.1274667 -4.2828417 2.452362 -5.130748
-22.842539 0.6293588 -18.46266 -10.811862 9.8192625 -18.177666 -2.6116815 -11.000337 -6.7314315 1.6564683
3.0070958 3.8072643 -2.3861165 3.0821571 -14.739942 0.7618269 1.1253023 -2.083836 4.725744 -8.782597
1.7594414 -0.6485091 4.485623 2.0207152 7.264915 -3.539873 3.814236 5.1420674 2.162061 4.096431
-6.40137 23.63524 2.9711294 -22.708025 9.93719 -6.4162116 12.747448 1.9429878 -15.152943 6.417416
20.354511 -10.324688 -0.700492 -8.783211 -5.27593 16.097002 -9.716668 -1.9920526 -3.3649497 -1.871939
15.999649 3.3004563 12.747926 15.429879 4.7849145 11.567354 3.69788 11.258265 7.442363 9.183411
5.6699696 -2.3826702 10.605882 3.9112158 3.1500628 4.5281515 -1.2417862 4.3959084 6.6727695 5.8898783
15.859915 -2.1832209 -23.908653 -6.4799504 -4.5365124 7.627124 -0.66919386 -11.889693 -9.208865 -7.4274073
-9.224193 14.568347 -10.568833 4.982321 -4.342062 -3.7776625 6.917234 -9.848748 -2.0944717 -5.135116
0.0914714 12.645902 -5.74285 -3.2141201 -2.7173362 0.49563864 9.317534 -5.9141874 -1.8098574 -0.11738578
-6.680575 0.4757669 -5.035051 -6.7964664 16.865469 -7.169265 -1.0578263 -5.7216787 -5.1173844 16.137651
-11.54324 7.681869 0.44475392 9.708182 -8.932846 -4.473626 7.6624317 -0.55381083 9.631587 -6.4704556
0.4123232 -4.361452 1.3948607 9.511665 0.11667654 -8.548508 4.3716145 -0.79702514 4.478997 -2.9758704
2.9079323 6.049952 9.275183 -18.078873 6.2983274 3.272176 2.8382776 5.134597 -9.190781 -0.5657382
-0.7500531 -2.725033 -7.6027865 3.3404543 2.990815 -4.8745747 2.3165567 -5.984303 -2.1798875 0.35541576
4.010979 11.000591 -2.8873312 7.1352735 -16.79663 -0.31784213 9.493548 2.1144536 4.358092 -12.089823
18.495346 -14.293832 7.89578 2.2714825 22.976387 8.451689 -7.925461 4.6242585 4.4289427 18.692003
-4.875734 -3.0836344 -2.9999814 13.751918 6.448228 -2.6204622 -5.149185 -0.35821092 8.488551 4.981496
-11.924197 2.171869 2.0423572 -6.173772 10.778437 -9.32683 -2.2544234 6.6417594 1.2119585 10.977129
25.77281 -4.9495463 14.57806 0.3044315 2.6132357 16.555033 3.3238444 9.551863 -1.6676947 -0.79539716
-7.591999 -2.076944 9.025118 1.7834753 -3.1799617 -8.605674 -0.47356385 2.6741948 -5.359179 -2.6673796
-4.9401326 23.465864 5.1685796 -9.018578 9.037825 0.66607 15.443222 4.740594 -3.4725387 11.592567
-4.4150195 6.859591 -12.274467 -0.88911164 5.186309 -2.054497 1.7361217 -8.265324 -9.30447 5.4068313
-3.9988663 -13.638606 -9.925445 -0.06329413 -3.6709652 -1.5180256 -7.746615 -6.089606 0.07112726 -0.34904733
-12.397416 -12.719869 -1.395601 2.1150916 5.7381287 -8.649895 -9.998958 -2.564841 -0.53999114 2.601808
-4.4691963 -3.82819 -0.84233856 -1.1604277 -13.490127 -0.31927416 -1.8815292 -2.07215 -3.4105783 -8.2998085
8.731719 -20.778936 -11.495662 5.8033476 -4.752041 1.483641 -15.365992 -8.288208 3.8847756 -3.4876456
10.833007 -6.717991 4.504732 13.4244375 1.1306485 7.3629923 0.4657332 3.132599 12.438889 -1.8337058
7.3435574 1.400918 14.704036 -9.501399 7.2315617 4.532936 2.7264361 10.145339 -6.521951 2.897153
-6.417456 1.3333273 11.872697 -0.30664724 8.8845 -3.3925855 5.079156 7.759716 4.677565 5.8457737
6.5569253 4.7948146 0.03662816 -8.704245 6.224871 2.402413 7.7071047 3.9711342 -6.390043 6.1268735
-3.2701402 -11.508579 ] -3.7760346 -11.118123 ]
``` ```
- Python API - Python API
@ -97,56 +103,113 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
audio_emb = vector_executor( audio_emb = vector_executor(
model='ecapatdnn_voxceleb12', model='ecapatdnn_voxceleb12',
sample_rate=16000, sample_rate=16000,
config=None, config=None, # Set `config` and `ckpt_path` to None to use pretrained model.
ckpt_path=None, ckpt_path=None,
audio_file='./85236145389.wav', audio_file='./85236145389.wav',
force_yes=False,
device=paddle.get_device()) device=paddle.get_device())
print('Audio embedding Result: \n{}'.format(audio_emb)) print('Audio embedding Result: \n{}'.format(audio_emb))
test_emb = vector_executor(
model='ecapatdnn_voxceleb12',
sample_rate=16000,
config=None, # Set `config` and `ckpt_path` to None to use pretrained model.
ckpt_path=None,
audio_file='./123456789.wav',
device=paddle.get_device())
print('Test embedding Result: \n{}'.format(test_emb))
# score range [0, 1]
score = vector_executor.get_embeddings_score(audio_emb, test_emb)
print(f"Eembeddings Score: {score}")
``` ```
Output: Output
```bash ```bash
# Vector Result: # Vector Result:
[ -5.749211 9.505463 -8.200284 -5.2075014 5.3940268 Audio embedding Result:
-3.04878 1.611095 10.127234 -10.534177 -15.821609 [ 1.4217498 5.626253 -5.342073 1.1773866 3.308055
1.2032688 -0.35080156 1.2629458 -12.643498 -2.5758228 1.756596 5.167894 10.80636 -3.8226728 -5.6141334
-11.343508 2.3385992 -8.719341 14.213509 15.404744 2.623845 -0.8072968 1.9635103 -7.3128724 0.01103897
-0.39327756 6.338786 2.688887 8.7104025 17.469526 -9.723131 0.6619743 -6.976803 10.213478 7.494748
-8.77959 7.0576906 4.648855 -1.3089896 -23.294737 2.9105635 3.8949256 3.7999806 7.1061673 16.905321
8.013747 13.891729 -9.926753 5.655307 -5.9422326 -7.1493764 8.733103 3.4230042 -4.831653 -11.403367
-22.842539 0.6293588 -18.46266 -10.811862 9.8192625 11.232214 7.1274667 -4.2828417 2.452362 -5.130748
3.0070958 3.8072643 -2.3861165 3.0821571 -14.739942 -18.177666 -2.6116815 -11.000337 -6.7314315 1.6564683
1.7594414 -0.6485091 4.485623 2.0207152 7.264915 0.7618269 1.1253023 -2.083836 4.725744 -8.782597
-6.40137 23.63524 2.9711294 -22.708025 9.93719 -3.539873 3.814236 5.1420674 2.162061 4.096431
20.354511 -10.324688 -0.700492 -8.783211 -5.27593 -6.4162116 12.747448 1.9429878 -15.152943 6.417416
15.999649 3.3004563 12.747926 15.429879 4.7849145 16.097002 -9.716668 -1.9920526 -3.3649497 -1.871939
5.6699696 -2.3826702 10.605882 3.9112158 3.1500628 11.567354 3.69788 11.258265 7.442363 9.183411
15.859915 -2.1832209 -23.908653 -6.4799504 -4.5365124 4.5281515 -1.2417862 4.3959084 6.6727695 5.8898783
-9.224193 14.568347 -10.568833 4.982321 -4.342062 7.627124 -0.66919386 -11.889693 -9.208865 -7.4274073
0.0914714 12.645902 -5.74285 -3.2141201 -2.7173362 -3.7776625 6.917234 -9.848748 -2.0944717 -5.135116
-6.680575 0.4757669 -5.035051 -6.7964664 16.865469 0.49563864 9.317534 -5.9141874 -1.8098574 -0.11738578
-11.54324 7.681869 0.44475392 9.708182 -8.932846 -7.169265 -1.0578263 -5.7216787 -5.1173844 16.137651
0.4123232 -4.361452 1.3948607 9.511665 0.11667654 -4.473626 7.6624317 -0.55381083 9.631587 -6.4704556
2.9079323 6.049952 9.275183 -18.078873 6.2983274 -8.548508 4.3716145 -0.79702514 4.478997 -2.9758704
-0.7500531 -2.725033 -7.6027865 3.3404543 2.990815 3.272176 2.8382776 5.134597 -9.190781 -0.5657382
4.010979 11.000591 -2.8873312 7.1352735 -16.79663 -4.8745747 2.3165567 -5.984303 -2.1798875 0.35541576
18.495346 -14.293832 7.89578 2.2714825 22.976387 -0.31784213 9.493548 2.1144536 4.358092 -12.089823
-4.875734 -3.0836344 -2.9999814 13.751918 6.448228 8.451689 -7.925461 4.6242585 4.4289427 18.692003
-11.924197 2.171869 2.0423572 -6.173772 10.778437 -2.6204622 -5.149185 -0.35821092 8.488551 4.981496
25.77281 -4.9495463 14.57806 0.3044315 2.6132357 -9.32683 -2.2544234 6.6417594 1.2119585 10.977129
-7.591999 -2.076944 9.025118 1.7834753 -3.1799617 16.555033 3.3238444 9.551863 -1.6676947 -0.79539716
-4.9401326 23.465864 5.1685796 -9.018578 9.037825 -8.605674 -0.47356385 2.6741948 -5.359179 -2.6673796
-4.4150195 6.859591 -12.274467 -0.88911164 5.186309 0.66607 15.443222 4.740594 -3.4725387 11.592567
-3.9988663 -13.638606 -9.925445 -0.06329413 -3.6709652 -2.054497 1.7361217 -8.265324 -9.30447 5.4068313
-12.397416 -12.719869 -1.395601 2.1150916 5.7381287 -1.5180256 -7.746615 -6.089606 0.07112726 -0.34904733
-4.4691963 -3.82819 -0.84233856 -1.1604277 -13.490127 -8.649895 -9.998958 -2.564841 -0.53999114 2.601808
8.731719 -20.778936 -11.495662 5.8033476 -4.752041 -0.31927416 -1.8815292 -2.07215 -3.4105783 -8.2998085
10.833007 -6.717991 4.504732 13.4244375 1.1306485 1.483641 -15.365992 -8.288208 3.8847756 -3.4876456
7.3435574 1.400918 14.704036 -9.501399 7.2315617 7.3629923 0.4657332 3.132599 12.438889 -1.8337058
-6.417456 1.3333273 11.872697 -0.30664724 8.8845 4.532936 2.7264361 10.145339 -6.521951 2.897153
6.5569253 4.7948146 0.03662816 -8.704245 6.224871 -3.3925855 5.079156 7.759716 4.677565 5.8457737
-3.2701402 -11.508579 ] 2.402413 7.7071047 3.9711342 -6.390043 6.1268735
-3.7760346 -11.118123 ]
# get the test embedding
Test embedding Result:
[ -1.902964 2.0690894 -8.034194 3.5472693 0.18089125
6.9085927 1.4097427 -1.9487704 -10.021278 -0.20755845
-8.04332 4.344489 2.3200977 -14.306299 5.184692
-11.55602 -3.8497238 0.6444722 1.2833948 2.6766639
0.5878921 0.7946299 1.7207596 2.5791872 14.998469
-1.3385371 15.031221 -0.8006958 1.99287 -9.52007
2.435466 4.003221 -4.33817 -4.898601 -5.304714
-18.033886 10.790787 -12.784645 -5.641755 2.9761686
-10.566622 1.4839455 6.152458 -5.7195854 2.8603241
6.112133 8.489869 5.5958056 1.2836679 -1.2293907
0.89927405 7.0288725 -2.854029 -0.9782962 5.8255906
14.905906 -5.025907 0.7866458 -4.2444224 -16.354029
10.521315 0.9604709 -3.3257897 7.144871 -13.592733
-8.568869 -1.7953678 0.26313916 10.916714 -6.9374123
1.857403 -6.2746415 2.8154466 -7.2338667 -2.293357
-0.05452765 5.4287076 5.0849075 -6.690375 -1.6183422
3.654291 0.94352573 -9.200294 -5.4749465 -3.5235846
1.3420814 4.240421 -2.772944 -2.8451524 16.311104
4.2969875 -1.762936 -12.5758915 8.595198 -0.8835239
-1.5708797 1.568961 1.1413603 3.5032008 -0.45251232
-6.786333 16.89443 5.3366146 -8.789056 0.6355629
3.2579517 -3.328322 7.5969577 0.66025066 -6.550468
-9.148656 2.020372 -0.4615173 1.1965656 -3.8764873
11.6562195 -6.0750933 12.182899 3.2218833 0.81969476
5.570001 -3.8459578 -7.205299 7.9262037 -7.6611166
-5.249467 -2.2671914 7.2658715 -13.298164 4.821147
-2.7263982 11.691089 -3.8918593 -2.838112 -1.0336838
-3.8034165 2.8536487 -5.60398 -1.1972581 1.3455094
-3.4903061 2.2408795 5.5010734 -3.970756 11.99696
-7.8858757 0.43160373 -5.5059714 4.3426995 16.322706
11.635366 0.72157705 -9.245714 -3.91465 -4.449838
-1.5716927 7.713747 -2.2430465 -6.198303 -13.481864
2.8156567 -5.7812386 5.1456156 2.7289324 -14.505571
13.270688 3.448231 -7.0659585 4.5886116 -4.466099
-0.296428 -11.463529 -2.6076477 14.110243 -6.9725137
-1.9962958 2.7119343 19.391657 0.01961198 14.607133
-1.6695905 -4.391516 1.3131028 -6.670972 -5.888604
12.0612335 5.9285784 3.3715196 1.492534 10.723728
-0.95514804 -12.085431 ]
# get the score between enroll and test
Eembeddings Score: 0.4292638301849365
``` ```
### 4.Pretrained Models ### 4.Pretrained Models

@ -29,6 +29,11 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
paddlespeech vector --task spk --input vec.job paddlespeech vector --task spk --input vec.job
echo -e "demo2 85236145389.wav \n demo3 85236145389.wav" | paddlespeech vector --task spk echo -e "demo2 85236145389.wav \n demo3 85236145389.wav" | paddlespeech vector --task spk
paddlespeech vector --task score --input "./85236145389.wav ./123456789.wav"
echo -e "demo4 85236145389.wav 85236145389.wav \n demo5 85236145389.wav 123456789.wav" > vec.job
paddlespeech vector --task score --input vec.job
``` ```
使用方法: 使用方法:
@ -37,6 +42,7 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
``` ```
参数: 参数:
- `input`(必须输入):用于识别的音频文件。 - `input`(必须输入):用于识别的音频文件。
- `task` (必须输入): 用于指定 `vector` 处理的具体任务,默认是 `spk`
- `model`:声纹任务的模型,默认值:`ecapatdnn_voxceleb12`。 - `model`:声纹任务的模型,默认值:`ecapatdnn_voxceleb12`。
- `sample_rate`:音频采样率,默认值:`16000`。 - `sample_rate`:音频采样率,默认值:`16000`。
- `config`:声纹任务的参数文件,若不设置则使用预训练模型中的默认配置,默认值:`None`。 - `config`:声纹任务的参数文件,若不设置则使用预训练模型中的默认配置,默认值:`None`。
@ -45,45 +51,45 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
输出: 输出:
```bash ```bash
demo [ -5.749211 9.505463 -8.200284 -5.2075014 5.3940268 demo [ 1.4217498 5.626253 -5.342073 1.1773866 3.308055
-3.04878 1.611095 10.127234 -10.534177 -15.821609 1.756596 5.167894 10.80636 -3.8226728 -5.6141334
1.2032688 -0.35080156 1.2629458 -12.643498 -2.5758228 2.623845 -0.8072968 1.9635103 -7.3128724 0.01103897
-11.343508 2.3385992 -8.719341 14.213509 15.404744 -9.723131 0.6619743 -6.976803 10.213478 7.494748
-0.39327756 6.338786 2.688887 8.7104025 17.469526 2.9105635 3.8949256 3.7999806 7.1061673 16.905321
-8.77959 7.0576906 4.648855 -1.3089896 -23.294737 -7.1493764 8.733103 3.4230042 -4.831653 -11.403367
8.013747 13.891729 -9.926753 5.655307 -5.9422326 11.232214 7.1274667 -4.2828417 2.452362 -5.130748
-22.842539 0.6293588 -18.46266 -10.811862 9.8192625 -18.177666 -2.6116815 -11.000337 -6.7314315 1.6564683
3.0070958 3.8072643 -2.3861165 3.0821571 -14.739942 0.7618269 1.1253023 -2.083836 4.725744 -8.782597
1.7594414 -0.6485091 4.485623 2.0207152 7.264915 -3.539873 3.814236 5.1420674 2.162061 4.096431
-6.40137 23.63524 2.9711294 -22.708025 9.93719 -6.4162116 12.747448 1.9429878 -15.152943 6.417416
20.354511 -10.324688 -0.700492 -8.783211 -5.27593 16.097002 -9.716668 -1.9920526 -3.3649497 -1.871939
15.999649 3.3004563 12.747926 15.429879 4.7849145 11.567354 3.69788 11.258265 7.442363 9.183411
5.6699696 -2.3826702 10.605882 3.9112158 3.1500628 4.5281515 -1.2417862 4.3959084 6.6727695 5.8898783
15.859915 -2.1832209 -23.908653 -6.4799504 -4.5365124 7.627124 -0.66919386 -11.889693 -9.208865 -7.4274073
-9.224193 14.568347 -10.568833 4.982321 -4.342062 -3.7776625 6.917234 -9.848748 -2.0944717 -5.135116
0.0914714 12.645902 -5.74285 -3.2141201 -2.7173362 0.49563864 9.317534 -5.9141874 -1.8098574 -0.11738578
-6.680575 0.4757669 -5.035051 -6.7964664 16.865469 -7.169265 -1.0578263 -5.7216787 -5.1173844 16.137651
-11.54324 7.681869 0.44475392 9.708182 -8.932846 -4.473626 7.6624317 -0.55381083 9.631587 -6.4704556
0.4123232 -4.361452 1.3948607 9.511665 0.11667654 -8.548508 4.3716145 -0.79702514 4.478997 -2.9758704
2.9079323 6.049952 9.275183 -18.078873 6.2983274 3.272176 2.8382776 5.134597 -9.190781 -0.5657382
-0.7500531 -2.725033 -7.6027865 3.3404543 2.990815 -4.8745747 2.3165567 -5.984303 -2.1798875 0.35541576
4.010979 11.000591 -2.8873312 7.1352735 -16.79663 -0.31784213 9.493548 2.1144536 4.358092 -12.089823
18.495346 -14.293832 7.89578 2.2714825 22.976387 8.451689 -7.925461 4.6242585 4.4289427 18.692003
-4.875734 -3.0836344 -2.9999814 13.751918 6.448228 -2.6204622 -5.149185 -0.35821092 8.488551 4.981496
-11.924197 2.171869 2.0423572 -6.173772 10.778437 -9.32683 -2.2544234 6.6417594 1.2119585 10.977129
25.77281 -4.9495463 14.57806 0.3044315 2.6132357 16.555033 3.3238444 9.551863 -1.6676947 -0.79539716
-7.591999 -2.076944 9.025118 1.7834753 -3.1799617 -8.605674 -0.47356385 2.6741948 -5.359179 -2.6673796
-4.9401326 23.465864 5.1685796 -9.018578 9.037825 0.66607 15.443222 4.740594 -3.4725387 11.592567
-4.4150195 6.859591 -12.274467 -0.88911164 5.186309 -2.054497 1.7361217 -8.265324 -9.30447 5.4068313
-3.9988663 -13.638606 -9.925445 -0.06329413 -3.6709652 -1.5180256 -7.746615 -6.089606 0.07112726 -0.34904733
-12.397416 -12.719869 -1.395601 2.1150916 5.7381287 -8.649895 -9.998958 -2.564841 -0.53999114 2.601808
-4.4691963 -3.82819 -0.84233856 -1.1604277 -13.490127 -0.31927416 -1.8815292 -2.07215 -3.4105783 -8.2998085
8.731719 -20.778936 -11.495662 5.8033476 -4.752041 1.483641 -15.365992 -8.288208 3.8847756 -3.4876456
10.833007 -6.717991 4.504732 13.4244375 1.1306485 7.3629923 0.4657332 3.132599 12.438889 -1.8337058
7.3435574 1.400918 14.704036 -9.501399 7.2315617 4.532936 2.7264361 10.145339 -6.521951 2.897153
-6.417456 1.3333273 11.872697 -0.30664724 8.8845 -3.3925855 5.079156 7.759716 4.677565 5.8457737
6.5569253 4.7948146 0.03662816 -8.704245 6.224871 2.402413 7.7071047 3.9711342 -6.390043 6.1268735
-3.2701402 -11.508579 ] -3.7760346 -11.118123 ]
``` ```
- Python API - Python API
@ -98,53 +104,109 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
config=None, # Set `config` and `ckpt_path` to None to use pretrained model. config=None, # Set `config` and `ckpt_path` to None to use pretrained model.
ckpt_path=None, ckpt_path=None,
audio_file='./85236145389.wav', audio_file='./85236145389.wav',
force_yes=False,
device=paddle.get_device()) device=paddle.get_device())
print('Audio embedding Result: \n{}'.format(audio_emb)) print('Audio embedding Result: \n{}'.format(audio_emb))
test_emb = vector_executor(
model='ecapatdnn_voxceleb12',
sample_rate=16000,
config=None, # Set `config` and `ckpt_path` to None to use pretrained model.
ckpt_path=None,
audio_file='./123456789.wav',
device=paddle.get_device())
print('Test embedding Result: \n{}'.format(test_emb))
# score range [0, 1]
score = vector_executor.get_embeddings_score(audio_emb, test_emb)
print(f"Eembeddings Score: {score}")
``` ```
输出: 输出:
```bash ```bash
# Vector Result: # Vector Result:
[ -5.749211 9.505463 -8.200284 -5.2075014 5.3940268 Audio embedding Result:
-3.04878 1.611095 10.127234 -10.534177 -15.821609 [ 1.4217498 5.626253 -5.342073 1.1773866 3.308055
1.2032688 -0.35080156 1.2629458 -12.643498 -2.5758228 1.756596 5.167894 10.80636 -3.8226728 -5.6141334
-11.343508 2.3385992 -8.719341 14.213509 15.404744 2.623845 -0.8072968 1.9635103 -7.3128724 0.01103897
-0.39327756 6.338786 2.688887 8.7104025 17.469526 -9.723131 0.6619743 -6.976803 10.213478 7.494748
-8.77959 7.0576906 4.648855 -1.3089896 -23.294737 2.9105635 3.8949256 3.7999806 7.1061673 16.905321
8.013747 13.891729 -9.926753 5.655307 -5.9422326 -7.1493764 8.733103 3.4230042 -4.831653 -11.403367
-22.842539 0.6293588 -18.46266 -10.811862 9.8192625 11.232214 7.1274667 -4.2828417 2.452362 -5.130748
3.0070958 3.8072643 -2.3861165 3.0821571 -14.739942 -18.177666 -2.6116815 -11.000337 -6.7314315 1.6564683
1.7594414 -0.6485091 4.485623 2.0207152 7.264915 0.7618269 1.1253023 -2.083836 4.725744 -8.782597
-6.40137 23.63524 2.9711294 -22.708025 9.93719 -3.539873 3.814236 5.1420674 2.162061 4.096431
20.354511 -10.324688 -0.700492 -8.783211 -5.27593 -6.4162116 12.747448 1.9429878 -15.152943 6.417416
15.999649 3.3004563 12.747926 15.429879 4.7849145 16.097002 -9.716668 -1.9920526 -3.3649497 -1.871939
5.6699696 -2.3826702 10.605882 3.9112158 3.1500628 11.567354 3.69788 11.258265 7.442363 9.183411
15.859915 -2.1832209 -23.908653 -6.4799504 -4.5365124 4.5281515 -1.2417862 4.3959084 6.6727695 5.8898783
-9.224193 14.568347 -10.568833 4.982321 -4.342062 7.627124 -0.66919386 -11.889693 -9.208865 -7.4274073
0.0914714 12.645902 -5.74285 -3.2141201 -2.7173362 -3.7776625 6.917234 -9.848748 -2.0944717 -5.135116
-6.680575 0.4757669 -5.035051 -6.7964664 16.865469 0.49563864 9.317534 -5.9141874 -1.8098574 -0.11738578
-11.54324 7.681869 0.44475392 9.708182 -8.932846 -7.169265 -1.0578263 -5.7216787 -5.1173844 16.137651
0.4123232 -4.361452 1.3948607 9.511665 0.11667654 -4.473626 7.6624317 -0.55381083 9.631587 -6.4704556
2.9079323 6.049952 9.275183 -18.078873 6.2983274 -8.548508 4.3716145 -0.79702514 4.478997 -2.9758704
-0.7500531 -2.725033 -7.6027865 3.3404543 2.990815 3.272176 2.8382776 5.134597 -9.190781 -0.5657382
4.010979 11.000591 -2.8873312 7.1352735 -16.79663 -4.8745747 2.3165567 -5.984303 -2.1798875 0.35541576
18.495346 -14.293832 7.89578 2.2714825 22.976387 -0.31784213 9.493548 2.1144536 4.358092 -12.089823
-4.875734 -3.0836344 -2.9999814 13.751918 6.448228 8.451689 -7.925461 4.6242585 4.4289427 18.692003
-11.924197 2.171869 2.0423572 -6.173772 10.778437 -2.6204622 -5.149185 -0.35821092 8.488551 4.981496
25.77281 -4.9495463 14.57806 0.3044315 2.6132357 -9.32683 -2.2544234 6.6417594 1.2119585 10.977129
-7.591999 -2.076944 9.025118 1.7834753 -3.1799617 16.555033 3.3238444 9.551863 -1.6676947 -0.79539716
-4.9401326 23.465864 5.1685796 -9.018578 9.037825 -8.605674 -0.47356385 2.6741948 -5.359179 -2.6673796
-4.4150195 6.859591 -12.274467 -0.88911164 5.186309 0.66607 15.443222 4.740594 -3.4725387 11.592567
-3.9988663 -13.638606 -9.925445 -0.06329413 -3.6709652 -2.054497 1.7361217 -8.265324 -9.30447 5.4068313
-12.397416 -12.719869 -1.395601 2.1150916 5.7381287 -1.5180256 -7.746615 -6.089606 0.07112726 -0.34904733
-4.4691963 -3.82819 -0.84233856 -1.1604277 -13.490127 -8.649895 -9.998958 -2.564841 -0.53999114 2.601808
8.731719 -20.778936 -11.495662 5.8033476 -4.752041 -0.31927416 -1.8815292 -2.07215 -3.4105783 -8.2998085
10.833007 -6.717991 4.504732 13.4244375 1.1306485 1.483641 -15.365992 -8.288208 3.8847756 -3.4876456
7.3435574 1.400918 14.704036 -9.501399 7.2315617 7.3629923 0.4657332 3.132599 12.438889 -1.8337058
-6.417456 1.3333273 11.872697 -0.30664724 8.8845 4.532936 2.7264361 10.145339 -6.521951 2.897153
6.5569253 4.7948146 0.03662816 -8.704245 6.224871 -3.3925855 5.079156 7.759716 4.677565 5.8457737
-3.2701402 -11.508579 ] 2.402413 7.7071047 3.9711342 -6.390043 6.1268735
-3.7760346 -11.118123 ]
# get the test embedding
Test embedding Result:
[ -1.902964 2.0690894 -8.034194 3.5472693 0.18089125
6.9085927 1.4097427 -1.9487704 -10.021278 -0.20755845
-8.04332 4.344489 2.3200977 -14.306299 5.184692
-11.55602 -3.8497238 0.6444722 1.2833948 2.6766639
0.5878921 0.7946299 1.7207596 2.5791872 14.998469
-1.3385371 15.031221 -0.8006958 1.99287 -9.52007
2.435466 4.003221 -4.33817 -4.898601 -5.304714
-18.033886 10.790787 -12.784645 -5.641755 2.9761686
-10.566622 1.4839455 6.152458 -5.7195854 2.8603241
6.112133 8.489869 5.5958056 1.2836679 -1.2293907
0.89927405 7.0288725 -2.854029 -0.9782962 5.8255906
14.905906 -5.025907 0.7866458 -4.2444224 -16.354029
10.521315 0.9604709 -3.3257897 7.144871 -13.592733
-8.568869 -1.7953678 0.26313916 10.916714 -6.9374123
1.857403 -6.2746415 2.8154466 -7.2338667 -2.293357
-0.05452765 5.4287076 5.0849075 -6.690375 -1.6183422
3.654291 0.94352573 -9.200294 -5.4749465 -3.5235846
1.3420814 4.240421 -2.772944 -2.8451524 16.311104
4.2969875 -1.762936 -12.5758915 8.595198 -0.8835239
-1.5708797 1.568961 1.1413603 3.5032008 -0.45251232
-6.786333 16.89443 5.3366146 -8.789056 0.6355629
3.2579517 -3.328322 7.5969577 0.66025066 -6.550468
-9.148656 2.020372 -0.4615173 1.1965656 -3.8764873
11.6562195 -6.0750933 12.182899 3.2218833 0.81969476
5.570001 -3.8459578 -7.205299 7.9262037 -7.6611166
-5.249467 -2.2671914 7.2658715 -13.298164 4.821147
-2.7263982 11.691089 -3.8918593 -2.838112 -1.0336838
-3.8034165 2.8536487 -5.60398 -1.1972581 1.3455094
-3.4903061 2.2408795 5.5010734 -3.970756 11.99696
-7.8858757 0.43160373 -5.5059714 4.3426995 16.322706
11.635366 0.72157705 -9.245714 -3.91465 -4.449838
-1.5716927 7.713747 -2.2430465 -6.198303 -13.481864
2.8156567 -5.7812386 5.1456156 2.7289324 -14.505571
13.270688 3.448231 -7.0659585 4.5886116 -4.466099
-0.296428 -11.463529 -2.6076477 14.110243 -6.9725137
-1.9962958 2.7119343 19.391657 0.01961198 14.607133
-1.6695905 -4.391516 1.3131028 -6.670972 -5.888604
12.0612335 5.9285784 3.3715196 1.492534 10.723728
-0.95514804 -12.085431 ]
# get the score between enroll and test
Eembeddings Score: 0.4292638301849365
``` ```
### 4.预训练模型 ### 4.预训练模型

@ -1,6 +1,9 @@
#!/bin/bash #!/bin/bash
wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
wget -c https://paddlespeech.bj.bcebos.com/vector/audio/123456789.wav
# asr # vector
paddlespeech vector --task spk --input ./85236145389.wav paddlespeech vector --task spk --input ./85236145389.wav
paddlespeech vector --task score --input "./85236145389.wav ./123456789.wav"

@ -6,7 +6,7 @@
### Speech Recognition Model ### Speech Recognition Model
Acoustic Model | Training Data | Token-based | Size | Descriptions | CER | WER | Hours of speech | Example Link Acoustic Model | Training Data | Token-based | Size | Descriptions | CER | WER | Hours of speech | Example Link
:-------------:| :------------:| :-----: | -----: | :-----: |:-----:| :-----: | :-----: | :-----: :-------------:| :------------:| :-----: | -----: | :-----: |:-----:| :-----: | :-----: | :-----:
[Ds2 Online Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.1.1.model.tar.gz) | Aishell Dataset | Char-based | 345 MB | 2 Conv + 5 LSTM layers with only forward direction | 0.080 |-| 151 h | [D2 Online Aishell ASR0](../../examples/aishell/asr0) [Ds2 Online Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.2.0.model.tar.gz) | Aishell Dataset | Char-based | 345 MB | 2 Conv + 5 LSTM layers with only forward direction | 0.078 |-| 151 h | [D2 Online Aishell ASR0](../../examples/aishell/asr0)
[Ds2 Offline Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_aishell_ckpt_0.1.1.model.tar.gz)| Aishell Dataset | Char-based | 306 MB | 2 Conv + 3 bidirectional GRU layers| 0.064 |-| 151 h | [Ds2 Offline Aishell ASR0](../../examples/aishell/asr0) [Ds2 Offline Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_aishell_ckpt_0.1.1.model.tar.gz)| Aishell Dataset | Char-based | 306 MB | 2 Conv + 3 bidirectional GRU layers| 0.064 |-| 151 h | [Ds2 Offline Aishell ASR0](../../examples/aishell/asr0)
[Conformer Online Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_chunk_conformer_aishell_ckpt_0.1.2.model.tar.gz) | Aishell Dataset | Char-based | 189 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0565 |-| 151 h | [Conformer Online Aishell ASR1](../../examples/aishell/asr1) [Conformer Online Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_chunk_conformer_aishell_ckpt_0.1.2.model.tar.gz) | Aishell Dataset | Char-based | 189 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0565 |-| 151 h | [Conformer Online Aishell ASR1](../../examples/aishell/asr1)
[Conformer Offline Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_conformer_aishell_ckpt_0.1.2.model.tar.gz) | Aishell Dataset | Char-based | 189 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0483 |-| 151 h | [Conformer Offline Aishell ASR1](../../examples/aishell/asr1) [Conformer Offline Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_conformer_aishell_ckpt_0.1.2.model.tar.gz) | Aishell Dataset | Char-based | 189 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0483 |-| 151 h | [Conformer Offline Aishell ASR1](../../examples/aishell/asr1)
@ -37,8 +37,8 @@ Model Type | Dataset| Example Link | Pretrained Models|Static Models|Size (stati
Tacotron2|LJSpeech|[tacotron2-ljspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/tts0)|[tacotron2_ljspeech_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_ljspeech_ckpt_0.2.0.zip)||| Tacotron2|LJSpeech|[tacotron2-ljspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/tts0)|[tacotron2_ljspeech_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_ljspeech_ckpt_0.2.0.zip)|||
Tacotron2|CSMSC|[tacotron2-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts0)|[tacotron2_csmsc_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_csmsc_ckpt_0.2.0.zip)|[tacotron2_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_csmsc_static_0.2.0.zip)|103MB| Tacotron2|CSMSC|[tacotron2-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts0)|[tacotron2_csmsc_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_csmsc_ckpt_0.2.0.zip)|[tacotron2_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_csmsc_static_0.2.0.zip)|103MB|
TransformerTTS| LJSpeech| [transformer-ljspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/tts1)|[transformer_tts_ljspeech_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/transformer_tts/transformer_tts_ljspeech_ckpt_0.4.zip)||| TransformerTTS| LJSpeech| [transformer-ljspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/tts1)|[transformer_tts_ljspeech_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/transformer_tts/transformer_tts_ljspeech_ckpt_0.4.zip)|||
SpeedySpeech| CSMSC | [speedyspeech-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts2) |[speedyspeech_nosil_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_ckpt_0.5.zip)|[speedyspeech_nosil_baker_static_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_static_0.5.zip)|12MB| SpeedySpeech| CSMSC | [speedyspeech-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts2) |[speedyspeech_nosil_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_ckpt_0.5.zip)|[speedyspeech_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_csmsc_static_0.2.0.zip)|12MB|
FastSpeech2| CSMSC |[fastspeech2-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts3)|[fastspeech2_nosil_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_ckpt_0.4.zip)|[fastspeech2_nosil_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_static_0.4.zip)|157MB| FastSpeech2| CSMSC |[fastspeech2-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts3)|[fastspeech2_nosil_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_ckpt_0.4.zip)|[fastspeech2_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_csmsc_static_0.2.0.zip)|157MB|
FastSpeech2-Conformer| CSMSC |[fastspeech2-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts3)|[fastspeech2_conformer_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_conformer_baker_ckpt_0.5.zip)||| FastSpeech2-Conformer| CSMSC |[fastspeech2-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts3)|[fastspeech2_conformer_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_conformer_baker_ckpt_0.5.zip)|||
FastSpeech2| AISHELL-3 |[fastspeech2-aishell3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/tts3)|[fastspeech2_nosil_aishell3_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_aishell3_ckpt_0.4.zip)||| FastSpeech2| AISHELL-3 |[fastspeech2-aishell3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/tts3)|[fastspeech2_nosil_aishell3_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_aishell3_ckpt_0.4.zip)|||
FastSpeech2| LJSpeech |[fastspeech2-ljspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/tts3)|[fastspeech2_nosil_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_ljspeech_ckpt_0.5.zip)||| FastSpeech2| LJSpeech |[fastspeech2-ljspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/tts3)|[fastspeech2_nosil_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_ljspeech_ckpt_0.5.zip)|||
@ -80,7 +80,7 @@ PANN | ESC-50 |[pann-esc50](../../examples/esc50/cls0)|[esc50_cnn6.tar.gz](https
Model Type | Dataset| Example Link | Pretrained Models | Static Models Model Type | Dataset| Example Link | Pretrained Models | Static Models
:-------------:| :------------:| :-----: | :-----: | :-----: :-------------:| :------------:| :-----: | :-----: | :-----:
PANN | VoxCeleb| [voxceleb_ecapatdnn](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/voxceleb/sv0) | [ecapatdnn.tar.gz](https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_1_1.tar.gz) | - PANN | VoxCeleb| [voxceleb_ecapatdnn](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/voxceleb/sv0) | [ecapatdnn.tar.gz](https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_2_0.tar.gz) | -
## Punctuation Restoration Models ## Punctuation Restoration Models
Model Type | Dataset| Example Link | Pretrained Models Model Type | Dataset| Example Link | Pretrained Models

@ -173,12 +173,7 @@ bash local/data.sh --stage 2 --stop_stage 2
CUDA_VISIBLE_DEVICES= ./local/test.sh conf/deepspeech2.yaml exp/deepspeech2/checkpoints/avg_1 CUDA_VISIBLE_DEVICES= ./local/test.sh conf/deepspeech2.yaml exp/deepspeech2/checkpoints/avg_1
``` ```
The performance of the released models are shown below: The performance of the released models are shown in [this](./RESULTS.md)
| Acoustic Model | Training Data | Token-based | Size | Descriptions | CER | WER | Hours of speech |
| :----------------------------: | :-------------: | :---------: | -----: | :------------------------------------------------- | :---- | :--- | :-------------- |
| Ds2 Online Aishell ASR0 Model | Aishell Dataset | Char-based | 345 MB | 2 Conv + 5 LSTM layers with only forward direction | 0.080 | - | 151 h |
| Ds2 Offline Aishell ASR0 Model | Aishell Dataset | Char-based | 306 MB | 2 Conv + 3 bidirectional GRU layers | 0.064 | - | 151 h |
## Stage 4: Static graph model Export ## Stage 4: Static graph model Export
This stage is to transform dygraph to static graph. This stage is to transform dygraph to static graph.
```bash ```bash

@ -4,15 +4,16 @@
| Model | Number of Params | Release | Config | Test set | Valid Loss | CER | | Model | Number of Params | Release | Config | Test set | Valid Loss | CER |
| --- | --- | --- | --- | --- | --- | --- | | --- | --- | --- | --- | --- | --- | --- |
| DeepSpeech2 | 45.18M | 2.2.0 | conf/deepspeech2_online.yaml + spec aug | test | 7.994938373565674 | 0.080 | | DeepSpeech2 | 45.18M | r0.2.0 | conf/deepspeech2_online.yaml + spec aug | test | 7.708217620849609| 0.078 |
| DeepSpeech2 | 45.18M | v2.2.0 | conf/deepspeech2_online.yaml + spec aug | test | 7.994938373565674 | 0.080 |
## Deepspeech2 Non-Streaming ## Deepspeech2 Non-Streaming
| Model | Number of Params | Release | Config | Test set | Valid Loss | CER | | Model | Number of Params | Release | Config | Test set | Valid Loss | CER |
| --- | --- | --- | --- | --- | --- | --- | | --- | --- | --- | --- | --- | --- | --- |
| DeepSpeech2 | 58.4M | 2.2.0 | conf/deepspeech2.yaml + spec aug | test | 5.738585948944092 | 0.064000 | | DeepSpeech2 | 58.4M | v2.2.0 | conf/deepspeech2.yaml + spec aug | test | 5.738585948944092 | 0.064000 |
| DeepSpeech2 | 58.4M | 2.1.0 | conf/deepspeech2.yaml + spec aug | test | 7.483316898345947 | 0.077860 | | DeepSpeech2 | 58.4M | v2.1.0 | conf/deepspeech2.yaml + spec aug | test | 7.483316898345947 | 0.077860 |
| DeepSpeech2 | 58.4M | 2.1.0 | conf/deepspeech2.yaml | test | 7.299022197723389 | 0.078671 | | DeepSpeech2 | 58.4M | v2.1.0 | conf/deepspeech2.yaml | test | 7.299022197723389 | 0.078671 |
| DeepSpeech2 | 58.4M | 2.0.0 | conf/deepspeech2.yaml | test | - | 0.078977 | | DeepSpeech2 | 58.4M | v2.0.0 | conf/deepspeech2.yaml | test | - | 0.078977 |
| --- | --- | --- | --- | --- | --- | --- | | --- | --- | --- | --- | --- | --- | --- |
| DeepSpeech2 | 58.4M | 1.8.5 | - | test | - | 0.080447 | | DeepSpeech2 | 58.4M | v1.8.5 | - | test | - | 0.080447 |

@ -118,7 +118,7 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/voice_cloning.sh ${conf_path} ${train_outpu
``` ```
## Pretrained Model ## Pretrained Model
[tacotron2_aishell3_ckpt_vc0_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_aishell3_ckpt_vc0_0.2.0.zip) - [tacotron2_aishell3_ckpt_vc0_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_aishell3_ckpt_vc0_0.2.0.zip)
Model | Step | eval/loss | eval/l1_loss | eval/mse_loss | eval/bce_loss| eval/attn_loss Model | Step | eval/loss | eval/l1_loss | eval/mse_loss | eval/bce_loss| eval/attn_loss

@ -119,7 +119,7 @@ ref_audio
CUDA_VISIBLE_DEVICES=${gpus} ./local/voice_cloning.sh ${conf_path} ${train_output_path} ${ckpt_name} ${ge2e_params_path} ${ref_audio_dir} CUDA_VISIBLE_DEVICES=${gpus} ./local/voice_cloning.sh ${conf_path} ${train_output_path} ${ckpt_name} ${ge2e_params_path} ${ref_audio_dir}
``` ```
## Pretrained Model ## Pretrained Model
[fastspeech2_nosil_aishell3_vc1_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_aishell3_vc1_ckpt_0.5.zip) - [fastspeech2_nosil_aishell3_vc1_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_aishell3_vc1_ckpt_0.5.zip)
Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/pitch_loss| eval/energy_loss Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/pitch_loss| eval/energy_loss
:-------------:| :------------:| :-----: | :-----: | :--------: |:--------:|:---------: :-------------:| :------------:| :-----: | :-----: | :--------: |:--------:|:---------:

@ -137,7 +137,8 @@ optional arguments:
5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
## Pretrained Models ## Pretrained Models
Pretrained models can be downloaded here [pwg_aishell3_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_aishell3_ckpt_0.5.zip). Pretrained models can be downloaded here:
- [pwg_aishell3_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_aishell3_ckpt_0.5.zip)
Model | Step | eval/generator_loss | eval/log_stft_magnitude_loss:| eval/spectral_convergence_loss Model | Step | eval/generator_loss | eval/log_stft_magnitude_loss:| eval/spectral_convergence_loss
:-------------:| :------------:| :-----: | :-----: | :--------: :-------------:| :------------:| :-----: | :-----: | :--------:

@ -136,7 +136,8 @@ optional arguments:
4. `--output-dir` is the directory to save the synthesized audio files. 4. `--output-dir` is the directory to save the synthesized audio files.
5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
## Pretrained Models ## Pretrained Models
The pretrained model can be downloaded here [hifigan_aishell3_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_ckpt_0.2.0.zip). The pretrained model can be downloaded here:
- [hifigan_aishell3_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_ckpt_0.2.0.zip)
Model | Step | eval/generator_loss | eval/mel_loss| eval/feature_matching_loss Model | Step | eval/generator_loss | eval/mel_loss| eval/feature_matching_loss

@ -212,7 +212,8 @@ optional arguments:
Pretrained Tacotron2 model with no silence in the edge of audios: Pretrained Tacotron2 model with no silence in the edge of audios:
- [tacotron2_csmsc_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_csmsc_ckpt_0.2.0.zip) - [tacotron2_csmsc_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_csmsc_ckpt_0.2.0.zip)
The static model can be downloaded here [tacotron2_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_csmsc_static_0.2.0.zip). The static model can be downloaded here:
- [tacotron2_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_csmsc_static_0.2.0.zip)
Model | Step | eval/loss | eval/l1_loss | eval/mse_loss | eval/bce_loss| eval/attn_loss Model | Step | eval/loss | eval/l1_loss | eval/mse_loss | eval/bce_loss| eval/attn_loss

@ -221,9 +221,12 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/inference.sh ${train_output_path}
``` ```
## Pretrained Model ## Pretrained Model
Pretrained SpeedySpeech model with no silence in the edge of audios[speedyspeech_nosil_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_ckpt_0.5.zip). Pretrained SpeedySpeech model with no silence in the edge of audios:
- [speedyspeech_nosil_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_ckpt_0.5.zip)
The static model can be downloaded here [speedyspeech_nosil_baker_static_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_static_0.5.zip). The static model can be downloaded here:
- [speedyspeech_nosil_baker_static_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_static_0.5.zip)
- [speedyspeech_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_csmsc_static_0.2.0.zip)
Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/ssim_loss Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/ssim_loss
:-------------:| :------------:| :-----: | :-----: | :--------:|:--------: :-------------:| :------------:| :-----: | :-----: | :--------:|:--------:

@ -232,6 +232,9 @@ The static model can be downloaded here:
- [fastspeech2_nosil_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_static_0.4.zip) - [fastspeech2_nosil_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_static_0.4.zip)
- [fastspeech2_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_csmsc_static_0.2.0.zip) - [fastspeech2_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_csmsc_static_0.2.0.zip)
The ONNX model can be downloaded here:
- [fastspeech2_csmsc_onnx_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_csmsc_onnx_0.2.0.zip)
Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/pitch_loss| eval/energy_loss Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/pitch_loss| eval/energy_loss
:-------------:| :------------:| :-----: | :-----: | :--------: |:--------:|:---------: :-------------:| :------------:| :-----: | :-----: | :--------: |:--------:|:---------:
default| 2(gpu) x 76000|1.0991|0.59132|0.035815|0.31915|0.15287| default| 2(gpu) x 76000|1.0991|0.59132|0.035815|0.31915|0.15287|

@ -0,0 +1,31 @@
train_output_path=$1
stage=0
stop_stage=0
# only support default_fastspeech2 + hifigan/mb_melgan now!
# synthesize from metadata
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
python3 ${BIN_DIR}/../ort_predict.py \
--inference_dir=${train_output_path}/inference_onnx \
--am=fastspeech2_csmsc \
--voc=hifigan_csmsc \
--test_metadata=dump/test/norm/metadata.jsonl \
--output_dir=${train_output_path}/onnx_infer_out \
--device=cpu \
--cpu_threads=2
fi
# e2e, synthesize from text
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
python3 ${BIN_DIR}/../ort_predict_e2e.py \
--inference_dir=${train_output_path}/inference_onnx \
--am=fastspeech2_csmsc \
--voc=hifigan_csmsc \
--output_dir=${train_output_path}/onnx_infer_out_e2e \
--text=${BIN_DIR}/../csmsc_test.txt \
--phones_dict=dump/phone_id_map.txt \
--device=cpu \
--cpu_threads=2
fi

@ -0,0 +1,22 @@
train_output_path=$1
model_dir=$2
output_dir=$3
model=$4
enable_dev_version=True
model_name=${model%_*}
echo model_name: ${model_name}
if [ ${model_name} = 'mb_melgan' ] ;then
enable_dev_version=False
fi
mkdir -p ${train_output_path}/${output_dir}
paddle2onnx \
--model_dir ${train_output_path}/${model_dir} \
--model_filename ${model}.pdmodel \
--params_filename ${model}.pdiparams \
--save_file ${train_output_path}/${output_dir}/${model}.onnx \
--enable_dev_version ${enable_dev_version}

@ -41,3 +41,25 @@ if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
CUDA_VISIBLE_DEVICES=${gpus} ./local/inference.sh ${train_output_path} || exit -1 CUDA_VISIBLE_DEVICES=${gpus} ./local/inference.sh ${train_output_path} || exit -1
fi fi
# paddle2onnx, please make sure the static models are in ${train_output_path}/inference first
# we have only tested the following models so far
if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
# install paddle2onnx
version=$(echo `pip list |grep "paddle2onnx"` |awk -F" " '{print $2}')
if [[ -z "$version" || ${version} != '0.9.4' ]]; then
pip install paddle2onnx==0.9.4
fi
./local/paddle2onnx.sh ${train_output_path} inference inference_onnx fastspeech2_csmsc
./local/paddle2onnx.sh ${train_output_path} inference inference_onnx hifigan_csmsc
./local/paddle2onnx.sh ${train_output_path} inference inference_onnx mb_melgan_csmsc
fi
# inference with onnxruntime, use fastspeech2 + hifigan by default
if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
# install onnxruntime
version=$(echo `pip list |grep "onnxruntime"` |awk -F" " '{print $2}')
if [[ -z "$version" || ${version} != '1.10.0' ]]; then
pip install onnxruntime==1.10.0
fi
./local/ort_predict.sh ${train_output_path}
fi

@ -127,9 +127,11 @@ optional arguments:
5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
## Pretrained Models ## Pretrained Models
The pretrained model can be downloaded here [pwg_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_ckpt_0.4.zip). The pretrained model can be downloaded here:
- [pwg_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_ckpt_0.4.zip)
The static model can be downloaded here [pwg_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_static_0.4.zip). The static model can be downloaded here:
- [pwg_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_static_0.4.zip)
Model | Step | eval/generator_loss | eval/log_stft_magnitude_loss| eval/spectral_convergence_loss Model | Step | eval/generator_loss | eval/log_stft_magnitude_loss| eval/spectral_convergence_loss
:-------------:| :------------:| :-----: | :-----: | :--------: :-------------:| :------------:| :-----: | :-----: | :--------:

@ -152,11 +152,17 @@ TODO:
The hyperparameter of `finetune.yaml` is not good enough, a smaller `learning_rate` should be used (more `milestones` should be set). The hyperparameter of `finetune.yaml` is not good enough, a smaller `learning_rate` should be used (more `milestones` should be set).
## Pretrained Models ## Pretrained Models
The pretrained model can be downloaded here [mb_melgan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_ckpt_0.1.1.zip). The pretrained model can be downloaded here:
- [mb_melgan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_ckpt_0.1.1.zip)
The finetuned model can be downloaded here [mb_melgan_baker_finetune_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_baker_finetune_ckpt_0.5.zip). The finetuned model can be downloaded here:
- [mb_melgan_baker_finetune_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_baker_finetune_ckpt_0.5.zip)
The static model can be downloaded here [mb_melgan_csmsc_static_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_static_0.1.1.zip) The static model can be downloaded here:
- [mb_melgan_csmsc_static_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_static_0.1.1.zip)
The ONNX model can be downloaded here:
- [mb_melgan_csmsc_onnx_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_onnx_0.2.0.zip)
Model | Step | eval/generator_loss | eval/log_stft_magnitude_loss|eval/spectral_convergence_loss |eval/sub_log_stft_magnitude_loss|eval/sub_spectral_convergence_loss Model | Step | eval/generator_loss | eval/log_stft_magnitude_loss|eval/spectral_convergence_loss |eval/sub_log_stft_magnitude_loss|eval/sub_spectral_convergence_loss
:-------------:| :------------:| :-----: | :-----: | :--------:| :--------:| :--------: :-------------:| :------------:| :-----: | :-----: | :--------:| :--------:| :--------:

@ -112,7 +112,8 @@ optional arguments:
5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
## Pretrained Models ## Pretrained Models
The pretrained model can be downloaded here [style_melgan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/style_melgan/style_melgan_csmsc_ckpt_0.1.1.zip). The pretrained model can be downloaded here:
- [style_melgan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/style_melgan/style_melgan_csmsc_ckpt_0.1.1.zip)
The static model of Style MelGAN is not available now. The static model of Style MelGAN is not available now.

@ -112,9 +112,14 @@ optional arguments:
5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
## Pretrained Models ## Pretrained Models
The pretrained model can be downloaded here [hifigan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_ckpt_0.1.1.zip). The pretrained model can be downloaded here:
- [hifigan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_ckpt_0.1.1.zip)
The static model can be downloaded here [hifigan_csmsc_static_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_static_0.1.1.zip). The static model can be downloaded here:
- [hifigan_csmsc_static_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_static_0.1.1.zip)
The ONNX model can be downloaded here:
- [hifigan_csmsc_onnx_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_onnx_0.2.0.zip)
Model | Step | eval/generator_loss | eval/mel_loss| eval/feature_matching_loss Model | Step | eval/generator_loss | eval/mel_loss| eval/feature_matching_loss
:-------------:| :------------:| :-----: | :-----: | :--------: :-------------:| :------------:| :-----: | :-----: | :--------:

@ -109,9 +109,11 @@ optional arguments:
5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
## Pretrained Models ## Pretrained Models
The pretrained model can be downloaded here [wavernn_csmsc_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/wavernn/wavernn_csmsc_ckpt_0.2.0.zip). The pretrained model can be downloaded here:
- [wavernn_csmsc_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/wavernn/wavernn_csmsc_ckpt_0.2.0.zip)
The static model can be downloaded here [wavernn_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/wavernn/wavernn_csmsc_static_0.2.0.zip). The static model can be downloaded here:
- [wavernn_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/wavernn/wavernn_csmsc_static_0.2.0.zip)
Model | Step | eval/loss Model | Step | eval/loss
:-------------:|:------------:| :------------: :-------------:|:------------:| :------------:

@ -21,7 +21,7 @@
The pretrained model can be downloaded here [ernie_linear_p3_iwslt2012_zh_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/text/ernie_linear_p3_iwslt2012_zh_ckpt_0.1.1.zip). The pretrained model can be downloaded here [ernie_linear_p3_iwslt2012_zh_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/text/ernie_linear_p3_iwslt2012_zh_ckpt_0.1.1.zip).
### Test Result ### Test Result
- Ernie Linear - Ernie
| |COMMA | PERIOD | QUESTION | OVERALL| | |COMMA | PERIOD | QUESTION | OVERALL|
|:-----:|:-----:|:-----:|:-----:|:-----:| |:-----:|:-----:|:-----:|:-----:|:-----:|
|Precision |0.510955 |0.526462 |0.820755 |0.619391| |Precision |0.510955 |0.526462 |0.820755 |0.619391|

@ -0,0 +1,9 @@
# iwslt2012
## Ernie
| |COMMA | PERIOD | QUESTION | OVERALL|
|:-----:|:-----:|:-----:|:-----:|:-----:|
|Precision |0.510955 |0.526462 |0.820755 |0.619391|
|Recall |0.517433 |0.564179 |0.861386 |0.647666|
|F1 |0.514173 |0.544669 |0.840580 |0.633141|

@ -171,7 +171,8 @@ optional arguments:
6. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. 6. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
## Pretrained Model ## Pretrained Model
Pretrained Model can be downloaded here. [transformer_tts_ljspeech_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/transformer_tts/transformer_tts_ljspeech_ckpt_0.4.zip) Pretrained Model can be downloaded here:
- [transformer_tts_ljspeech_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/transformer_tts/transformer_tts_ljspeech_ckpt_0.4.zip)
TransformerTTS checkpoint contains files listed below. TransformerTTS checkpoint contains files listed below.
```text ```text

@ -214,7 +214,8 @@ optional arguments:
9. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. 9. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
## Pretrained Model ## Pretrained Model
Pretrained FastSpeech2 model with no silence in the edge of audios. [fastspeech2_nosil_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_ljspeech_ckpt_0.5.zip) Pretrained FastSpeech2 model with no silence in the edge of audios:
- [fastspeech2_nosil_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_ljspeech_ckpt_0.5.zip)
Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/pitch_loss| eval/energy_loss Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/pitch_loss| eval/energy_loss
:-------------:| :------------:| :-----: | :-----: | :--------: |:--------:|:---------: :-------------:| :------------:| :-----: | :-----: | :--------: |:--------:|:---------:

@ -50,4 +50,5 @@ Synthesize waveform.
6. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. 6. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
## Pretrained Model ## Pretrained Model
Pretrained Model with residual channel equals 128 can be downloaded here. [waveflow_ljspeech_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/waveflow/waveflow_ljspeech_ckpt_0.3.zip). Pretrained Model with residual channel equals 128 can be downloaded here:
- [waveflow_ljspeech_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/waveflow/waveflow_ljspeech_ckpt_0.3.zip)

@ -127,7 +127,8 @@ optional arguments:
5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
## Pretrained Model ## Pretrained Model
Pretrained models can be downloaded here. [pwg_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_ljspeech_ckpt_0.5.zip) Pretrained models can be downloaded here:
- [pwg_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_ljspeech_ckpt_0.5.zip)
Parallel WaveGAN checkpoint contains files listed below. Parallel WaveGAN checkpoint contains files listed below.

@ -127,7 +127,8 @@ optional arguments:
5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
## Pretrained Model ## Pretrained Model
The pretrained model can be downloaded here [hifigan_ljspeech_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_ljspeech_ckpt_0.2.0.zip). The pretrained model can be downloaded here:
- [hifigan_ljspeech_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_ljspeech_ckpt_0.2.0.zip)
Model | Step | eval/generator_loss | eval/mel_loss| eval/feature_matching_loss Model | Step | eval/generator_loss | eval/mel_loss| eval/feature_matching_loss
@ -143,6 +144,5 @@ hifigan_ljspeech_ckpt_0.2.0
└── snapshot_iter_2500000.pdz # generator parameters of hifigan └── snapshot_iter_2500000.pdz # generator parameters of hifigan
``` ```
## Acknowledgement ## Acknowledgement
We adapted some code from https://github.com/kan-bayashi/ParallelWaveGAN. We adapted some code from https://github.com/kan-bayashi/ParallelWaveGAN.

@ -217,7 +217,8 @@ optional arguments:
9. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. 9. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
## Pretrained Model ## Pretrained Model
Pretrained FastSpeech2 model with no silence in the edge of audios. [fastspeech2_nosil_vctk_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_vctk_ckpt_0.5.zip) Pretrained FastSpeech2 model with no silence in the edge of audios:
- [fastspeech2_nosil_vctk_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_vctk_ckpt_0.5.zip)
FastSpeech2 checkpoint contains files listed below. FastSpeech2 checkpoint contains files listed below.
```text ```text

@ -132,7 +132,8 @@ optional arguments:
5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
## Pretrained Model ## Pretrained Model
Pretrained models can be downloaded here [pwg_vctk_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_vctk_ckpt_0.1.1.zip). Pretrained models can be downloaded here:
- [pwg_vctk_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_vctk_ckpt_0.1.1.zip)
Parallel WaveGAN checkpoint contains files listed below. Parallel WaveGAN checkpoint contains files listed below.

@ -133,7 +133,8 @@ optional arguments:
5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
## Pretrained Model ## Pretrained Model
The pretrained model can be downloaded here [hifigan_vctk_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_vctk_ckpt_0.2.0.zip). The pretrained model can be downloaded here:
- [hifigan_vctk_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_vctk_ckpt_0.2.0.zip)
Model | Step | eval/generator_loss | eval/mel_loss| eval/feature_matching_loss Model | Step | eval/generator_loss | eval/mel_loss| eval/feature_matching_loss

@ -4,4 +4,4 @@
| Model | Number of Params | Release | Config | dim | Test set | Cosine | Cosine + S-Norm | | Model | Number of Params | Release | Config | dim | Test set | Cosine | Cosine + S-Norm |
| --- | --- | --- | --- | --- | --- | --- | ---- | | --- | --- | --- | --- | --- | --- | --- | ---- |
| ECAPA-TDNN | 85M | 0.1.1 | conf/ecapa_tdnn.yaml |192 | test | 1.15 | 1.06 | | ECAPA-TDNN | 85M | 0.2.0 | conf/ecapa_tdnn.yaml |192 | test | 1.02 | 0.95 |

@ -0,0 +1,30 @@
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import numpy as np
def pcm16to32(audio: np.ndarray) -> np.ndarray:
"""pcm int16 to float32
Args:
audio (np.ndarray): Waveform with dtype of int16.
Returns:
np.ndarray: Waveform with dtype of float32.
"""
if audio.dtype == np.int16:
audio = audio.astype("float32")
bits = np.iinfo(np.int16).bits
audio = audio / (2**(bits - 1))
return audio

@ -80,9 +80,9 @@ pretrained_models = {
}, },
"deepspeech2online_aishell-zh-16k": { "deepspeech2online_aishell-zh-16k": {
'url': 'url':
'https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.1.1.model.tar.gz', 'https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.2.0.model.tar.gz',
'md5': 'md5':
'd5e076217cf60486519f72c217d21b9b', '23e16c69730a1cb5d735c98c83c21e16',
'cfg_path': 'cfg_path':
'model.yaml', 'model.yaml',
'ckpt_path': 'ckpt_path':
@ -426,6 +426,11 @@ class ASRExecutor(BaseExecutor):
try: try:
audio, audio_sample_rate = soundfile.read( audio, audio_sample_rate = soundfile.read(
audio_file, dtype="int16", always_2d=True) audio_file, dtype="int16", always_2d=True)
audio_duration = audio.shape[0] / audio_sample_rate
max_duration = 50.0
if audio_duration >= max_duration:
logger.error("Please input audio file less then 50 seconds.\n")
return
except Exception as e: except Exception as e:
logger.exception(e) logger.exception(e)
logger.error( logger.error(

@ -15,6 +15,7 @@ import argparse
import os import os
import sys import sys
from collections import OrderedDict from collections import OrderedDict
from typing import Dict
from typing import List from typing import List
from typing import Optional from typing import Optional
from typing import Union from typing import Union
@ -42,9 +43,9 @@ pretrained_models = {
# "paddlespeech vector --task spk --model ecapatdnn_voxceleb12-16k --sr 16000 --input ./input.wav" # "paddlespeech vector --task spk --model ecapatdnn_voxceleb12-16k --sr 16000 --input ./input.wav"
"ecapatdnn_voxceleb12-16k": { "ecapatdnn_voxceleb12-16k": {
'url': 'url':
'https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_1_1.tar.gz', 'https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_2_0.tar.gz',
'md5': 'md5':
'a1c0dba7d4de997187786ff517d5b4ec', 'cc33023c54ab346cd318408f43fcaf95',
'cfg_path': 'cfg_path':
'conf/model.yaml', # the yaml config path 'conf/model.yaml', # the yaml config path
'ckpt_path': 'ckpt_path':
@ -79,7 +80,7 @@ class VectorExecutor(BaseExecutor):
"--task", "--task",
type=str, type=str,
default="spk", default="spk",
choices=["spk"], choices=["spk", "score"],
help="task type in vector domain") help="task type in vector domain")
self.parser.add_argument( self.parser.add_argument(
"--input", "--input",
@ -147,13 +148,40 @@ class VectorExecutor(BaseExecutor):
logger.info(f"task source: {task_source}") logger.info(f"task source: {task_source}")
# stage 3: process the audio one by one # stage 3: process the audio one by one
# we do action according the task type
task_result = OrderedDict() task_result = OrderedDict()
has_exceptions = False has_exceptions = False
for id_, input_ in task_source.items(): for id_, input_ in task_source.items():
try: try:
# extract the speaker audio embedding
if parser_args.task == "spk":
logger.info("do vector spk task")
res = self(input_, model, sample_rate, config, ckpt_path, res = self(input_, model, sample_rate, config, ckpt_path,
device) device)
task_result[id_] = res task_result[id_] = res
elif parser_args.task == "score":
logger.info("do vector score task")
logger.info(f"input content {input_}")
if len(input_.split()) != 2:
logger.error(
f"vector score task input {input_} wav num is not two,"
"that is {len(input_.split())}")
sys.exit(-1)
# get the enroll and test embedding
enroll_audio, test_audio = input_.split()
logger.info(
f"score task, enroll audio: {enroll_audio}, test audio: {test_audio}"
)
enroll_embedding = self(enroll_audio, model, sample_rate,
config, ckpt_path, device)
test_embedding = self(test_audio, model, sample_rate,
config, ckpt_path, device)
# get the score
res = self.get_embeddings_score(enroll_embedding,
test_embedding)
task_result[id_] = res
except Exception as e: except Exception as e:
has_exceptions = True has_exceptions = True
task_result[id_] = f'{e.__class__.__name__}: {e}' task_result[id_] = f'{e.__class__.__name__}: {e}'
@ -172,6 +200,49 @@ class VectorExecutor(BaseExecutor):
else: else:
return True return True
def _get_job_contents(
self, job_input: os.PathLike) -> Dict[str, Union[str, os.PathLike]]:
"""
Read a job input file and return its contents in a dictionary.
Refactor from the Executor._get_job_contents
Args:
job_input (os.PathLike): The job input file.
Returns:
Dict[str, str]: Contents of job input.
"""
job_contents = OrderedDict()
with open(job_input) as f:
for line in f:
line = line.strip()
if not line:
continue
k = line.split(' ')[0]
v = ' '.join(line.split(' ')[1:])
job_contents[k] = v
return job_contents
def get_embeddings_score(self, enroll_embedding, test_embedding):
"""get the enroll embedding and test embedding score
Args:
enroll_embedding (numpy.array): shape: (emb_size), enroll audio embedding
test_embedding (numpy.array): shape: (emb_size), test audio embedding
Returns:
score: the score between enroll embedding and test embedding
"""
if not hasattr(self, "score_func"):
self.score_func = paddle.nn.CosineSimilarity(axis=0)
logger.info("create the cosine score function ")
score = self.score_func(
paddle.to_tensor(enroll_embedding),
paddle.to_tensor(test_embedding))
return score.item()
@stats_wrapper @stats_wrapper
def __call__(self, def __call__(self,
audio_file: os.PathLike, audio_file: os.PathLike,

@ -36,7 +36,7 @@ pretrained_models = {
'url': 'url':
'https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.1.1.model.tar.gz', 'https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.1.1.model.tar.gz',
'md5': 'md5':
'd5e076217cf60486519f72c217d21b9b', '23e16c69730a1cb5d735c98c83c21e16',
'cfg_path': 'cfg_path':
'model.yaml', 'model.yaml',
'ckpt_path': 'ckpt_path':

@ -86,6 +86,9 @@ def process_sentence(config: Dict[str, Any],
logmel = mel_extractor.get_log_mel_fbank(wav) logmel = mel_extractor.get_log_mel_fbank(wav)
# change duration according to mel_length # change duration according to mel_length
compare_duration_and_mel_length(sentences, utt_id, logmel) compare_duration_and_mel_length(sentences, utt_id, logmel)
# utt_id may be popped in compare_duration_and_mel_length
if utt_id not in sentences:
return None
phones = sentences[utt_id][0] phones = sentences[utt_id][0]
durations = sentences[utt_id][1] durations = sentences[utt_id][1]
num_frames = logmel.shape[0] num_frames = logmel.shape[0]

@ -104,7 +104,7 @@ def get_voc_output(args, voc_predictor, input):
def parse_args(): def parse_args():
parser = argparse.ArgumentParser( parser = argparse.ArgumentParser(
description="Paddle Infernce with speedyspeech & parallel wavegan.") description="Paddle Infernce with acoustic model & vocoder.")
# acoustic model # acoustic model
parser.add_argument( parser.add_argument(
'--am', '--am',

@ -0,0 +1,156 @@
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
from pathlib import Path
import jsonlines
import numpy as np
import onnxruntime as ort
import soundfile as sf
from timer import timer
from paddlespeech.t2s.exps.syn_utils import get_test_dataset
from paddlespeech.t2s.utils import str2bool
def get_sess(args, filed='am'):
full_name = ''
if filed == 'am':
full_name = args.am
elif filed == 'voc':
full_name = args.voc
model_dir = str(Path(args.inference_dir) / (full_name + ".onnx"))
sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
sess_options.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL
if args.device == "gpu":
# fastspeech2/mb_melgan can't use trt now!
if args.use_trt:
providers = ['TensorrtExecutionProvider']
else:
providers = ['CUDAExecutionProvider']
elif args.device == "cpu":
providers = ['CPUExecutionProvider']
sess_options.intra_op_num_threads = args.cpu_threads
sess = ort.InferenceSession(
model_dir, providers=providers, sess_options=sess_options)
return sess
def ort_predict(args):
# construct dataset for evaluation
with jsonlines.open(args.test_metadata, 'r') as reader:
test_metadata = list(reader)
am_name = args.am[:args.am.rindex('_')]
am_dataset = args.am[args.am.rindex('_') + 1:]
test_dataset = get_test_dataset(args, test_metadata, am_name, am_dataset)
output_dir = Path(args.output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
fs = 24000 if am_dataset != 'ljspeech' else 22050
# am
am_sess = get_sess(args, filed='am')
# vocoder
voc_sess = get_sess(args, filed='voc')
# am warmup
for T in [27, 38, 54]:
data = np.random.randint(1, 266, size=(T, ))
am_sess.run(None, {"text": data})
# voc warmup
for T in [227, 308, 544]:
data = np.random.rand(T, 80).astype("float32")
voc_sess.run(None, {"logmel": data})
print("warm up done!")
N = 0
T = 0
for example in test_dataset:
utt_id = example['utt_id']
phone_ids = example["text"]
with timer() as t:
mel = am_sess.run(output_names=None, input_feed={'text': phone_ids})
mel = mel[0]
wav = voc_sess.run(output_names=None, input_feed={'logmel': mel})
N += len(wav[0])
T += t.elapse
speed = len(wav[0]) / t.elapse
rtf = fs / speed
sf.write(
str(output_dir / (utt_id + ".wav")),
np.array(wav)[0],
samplerate=fs)
print(
f"{utt_id}, mel: {mel.shape}, wave: {len(wav[0])}, time: {t.elapse}s, Hz: {speed}, RTF: {rtf}."
)
print(f"generation speed: {N / T}Hz, RTF: {fs / (N / T) }")
def parse_args():
parser = argparse.ArgumentParser(description="Infernce with onnxruntime.")
# acoustic model
parser.add_argument(
'--am',
type=str,
default='fastspeech2_csmsc',
choices=[
'fastspeech2_csmsc',
],
help='Choose acoustic model type of tts task.')
# voc
parser.add_argument(
'--voc',
type=str,
default='hifigan_csmsc',
choices=['hifigan_csmsc', 'mb_melgan_csmsc'],
help='Choose vocoder type of tts task.')
# other
parser.add_argument(
"--inference_dir", type=str, help="dir to save inference models")
parser.add_argument("--test_metadata", type=str, help="test metadata.")
parser.add_argument("--output_dir", type=str, help="output dir")
# inference
parser.add_argument(
"--use_trt",
type=str2bool,
default=False,
help="Whether to use inference engin TensorRT.", )
parser.add_argument(
"--device",
default="gpu",
choices=["gpu", "cpu"],
help="Device selected for inference.", )
parser.add_argument('--cpu_threads', type=int, default=1)
args, _ = parser.parse_known_args()
return args
def main():
args = parse_args()
ort_predict(args)
if __name__ == "__main__":
main()

@ -0,0 +1,183 @@
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
from pathlib import Path
import numpy as np
import onnxruntime as ort
import soundfile as sf
from timer import timer
from paddlespeech.t2s.exps.syn_utils import get_frontend
from paddlespeech.t2s.exps.syn_utils import get_sentences
from paddlespeech.t2s.utils import str2bool
def get_sess(args, filed='am'):
full_name = ''
if filed == 'am':
full_name = args.am
elif filed == 'voc':
full_name = args.voc
model_dir = str(Path(args.inference_dir) / (full_name + ".onnx"))
sess_options = ort.SessionOptions()
sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
sess_options.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL
if args.device == "gpu":
# fastspeech2/mb_melgan can't use trt now!
if args.use_trt:
providers = ['TensorrtExecutionProvider']
else:
providers = ['CUDAExecutionProvider']
elif args.device == "cpu":
providers = ['CPUExecutionProvider']
sess_options.intra_op_num_threads = args.cpu_threads
sess = ort.InferenceSession(
model_dir, providers=providers, sess_options=sess_options)
return sess
def ort_predict(args):
# frontend
frontend = get_frontend(args)
output_dir = Path(args.output_dir)
output_dir.mkdir(parents=True, exist_ok=True)
sentences = get_sentences(args)
am_name = args.am[:args.am.rindex('_')]
am_dataset = args.am[args.am.rindex('_') + 1:]
fs = 24000 if am_dataset != 'ljspeech' else 22050
# am
am_sess = get_sess(args, filed='am')
# vocoder
voc_sess = get_sess(args, filed='voc')
# am warmup
for T in [27, 38, 54]:
data = np.random.randint(1, 266, size=(T, ))
am_sess.run(None, {"text": data})
# voc warmup
for T in [227, 308, 544]:
data = np.random.rand(T, 80).astype("float32")
voc_sess.run(None, {"logmel": data})
print("warm up done!")
# frontend warmup
# Loading model cost 0.5+ seconds
if args.lang == 'zh':
frontend.get_input_ids("你好,欢迎使用飞桨框架进行深度学习研究!", merge_sentences=True)
else:
print("lang should in be 'zh' here!")
N = 0
T = 0
merge_sentences = True
for utt_id, sentence in sentences:
with timer() as t:
if args.lang == 'zh':
input_ids = frontend.get_input_ids(
sentence, merge_sentences=merge_sentences)
phone_ids = input_ids["phone_ids"]
else:
print("lang should in be 'zh' here!")
# merge_sentences=True here, so we only use the first item of phone_ids
phone_ids = phone_ids[0].numpy()
mel = am_sess.run(output_names=None, input_feed={'text': phone_ids})
mel = mel[0]
wav = voc_sess.run(output_names=None, input_feed={'logmel': mel})
N += len(wav[0])
T += t.elapse
speed = len(wav[0]) / t.elapse
rtf = fs / speed
sf.write(
str(output_dir / (utt_id + ".wav")),
np.array(wav)[0],
samplerate=fs)
print(
f"{utt_id}, mel: {mel.shape}, wave: {len(wav[0])}, time: {t.elapse}s, Hz: {speed}, RTF: {rtf}."
)
print(f"generation speed: {N / T}Hz, RTF: {fs / (N / T) }")
def parse_args():
parser = argparse.ArgumentParser(description="Infernce with onnxruntime.")
# acoustic model
parser.add_argument(
'--am',
type=str,
default='fastspeech2_csmsc',
choices=[
'fastspeech2_csmsc',
],
help='Choose acoustic model type of tts task.')
parser.add_argument(
"--phones_dict", type=str, default=None, help="phone vocabulary file.")
parser.add_argument(
"--tones_dict", type=str, default=None, help="tone vocabulary file.")
# voc
parser.add_argument(
'--voc',
type=str,
default='hifigan_csmsc',
choices=['hifigan_csmsc', 'mb_melgan_csmsc'],
help='Choose vocoder type of tts task.')
# other
parser.add_argument(
"--inference_dir", type=str, help="dir to save inference models")
parser.add_argument(
"--text",
type=str,
help="text to synthesize, a 'utt_id sentence' pair per line")
parser.add_argument("--output_dir", type=str, help="output dir")
parser.add_argument(
'--lang',
type=str,
default='zh',
help='Choose model language. zh or en')
# inference
parser.add_argument(
"--use_trt",
type=str2bool,
default=False,
help="Whether to use inference engin TensorRT.", )
parser.add_argument(
"--device",
default="gpu",
choices=["gpu", "cpu"],
help="Device selected for inference.", )
parser.add_argument('--cpu_threads', type=int, default=1)
args, _ = parser.parse_known_args()
return args
def main():
args = parse_args()
ort_predict(args)
if __name__ == "__main__":
main()

@ -79,6 +79,9 @@ def process_sentence(config: Dict[str, Any],
logmel = mel_extractor.get_log_mel_fbank(wav) logmel = mel_extractor.get_log_mel_fbank(wav)
# change duration according to mel_length # change duration according to mel_length
compare_duration_and_mel_length(sentences, utt_id, logmel) compare_duration_and_mel_length(sentences, utt_id, logmel)
# utt_id may be popped in compare_duration_and_mel_length
if utt_id not in sentences:
return None
labels = sentences[utt_id][0] labels = sentences[utt_id][0]
# extract phone and duration # extract phone and duration
phones = [] phones = []

@ -90,6 +90,7 @@ def evaluate(args):
output_dir = Path(args.output_dir) output_dir = Path(args.output_dir)
output_dir.mkdir(parents=True, exist_ok=True) output_dir.mkdir(parents=True, exist_ok=True)
merge_sentences = True merge_sentences = True
get_tone_ids = False
N = 0 N = 0
T = 0 T = 0
@ -98,8 +99,6 @@ def evaluate(args):
for utt_id, sentence in sentences: for utt_id, sentence in sentences:
with timer() as t: with timer() as t:
get_tone_ids = False
if args.lang == 'zh': if args.lang == 'zh':
input_ids = frontend.get_input_ids( input_ids = frontend.get_input_ids(
sentence, sentence,

@ -82,6 +82,9 @@ def process_sentence(config: Dict[str, Any],
logmel = mel_extractor.get_log_mel_fbank(wav) logmel = mel_extractor.get_log_mel_fbank(wav)
# change duration according to mel_length # change duration according to mel_length
compare_duration_and_mel_length(sentences, utt_id, logmel) compare_duration_and_mel_length(sentences, utt_id, logmel)
# utt_id may be popped in compare_duration_and_mel_length
if utt_id not in sentences:
return None
phones = sentences[utt_id][0] phones = sentences[utt_id][0]
durations = sentences[utt_id][1] durations = sentences[utt_id][1]
num_frames = logmel.shape[0] num_frames = logmel.shape[0]

@ -31,8 +31,9 @@ def sinusoid_position_encoding(num_positions: int,
channel = paddle.arange(0, feature_size, 2, dtype=dtype) channel = paddle.arange(0, feature_size, 2, dtype=dtype)
index = paddle.arange(start_pos, start_pos + num_positions, 1, dtype=dtype) index = paddle.arange(start_pos, start_pos + num_positions, 1, dtype=dtype)
p = (paddle.unsqueeze(index, -1) * denominator = channel / float(feature_size)
omega) / (10000.0**(channel / float(feature_size))) denominator = paddle.to_tensor([10000.0], dtype='float32')**denominator
p = (paddle.unsqueeze(index, -1) * omega) / denominator
encodings = paddle.zeros([num_positions, feature_size], dtype=dtype) encodings = paddle.zeros([num_positions, feature_size], dtype=dtype)
encodings[:, 0::2] = paddle.sin(p) encodings[:, 0::2] = paddle.sin(p)
encodings[:, 1::2] = paddle.cos(p) encodings[:, 1::2] = paddle.cos(p)

@ -79,6 +79,20 @@ class Conv1d(nn.Layer):
bias_attr=bias, ) bias_attr=bias, )
def forward(self, x): def forward(self, x):
"""Do conv1d forward
Args:
x (paddle.Tensor): [N, C, L] input data,
N is the batch,
C is the data dimension,
L is the time
Raises:
ValueError: only support the same padding type
Returns:
paddle.Tensor: the value of conv1d
"""
if self.padding == "same": if self.padding == "same":
x = self._manage_padding(x, self.kernel_size, self.dilation, x = self._manage_padding(x, self.kernel_size, self.dilation,
self.stride) self.stride)
@ -88,6 +102,20 @@ class Conv1d(nn.Layer):
return self.conv(x) return self.conv(x)
def _manage_padding(self, x, kernel_size: int, dilation: int, stride: int): def _manage_padding(self, x, kernel_size: int, dilation: int, stride: int):
"""Padding the input data
Args:
x (paddle.Tensor): [N, C, L] input data
N is the batch,
C is the data dimension,
L is the time
kernel_size (int): 1-d convolution kernel size
dilation (int): 1-d convolution dilation
stride (int): 1-d convolution stride
Returns:
paddle.Tensor: the padded input data
"""
L_in = x.shape[-1] # Detecting input shape L_in = x.shape[-1] # Detecting input shape
padding = self._get_padding_elem(L_in, stride, kernel_size, padding = self._get_padding_elem(L_in, stride, kernel_size,
dilation) # Time padding dilation) # Time padding
@ -101,6 +129,17 @@ class Conv1d(nn.Layer):
stride: int, stride: int,
kernel_size: int, kernel_size: int,
dilation: int): dilation: int):
"""Calculate the padding value in same mode
Args:
L_in (int): the times of the input data,
stride (int): 1-d convolution stride
kernel_size (int): 1-d convolution kernel size
dilation (int): 1-d convolution stride
Returns:
int: return the padding value in same mode
"""
if stride > 1: if stride > 1:
n_steps = math.ceil(((L_in - kernel_size * dilation) / stride) + 1) n_steps = math.ceil(((L_in - kernel_size * dilation) / stride) + 1)
L_out = stride * (n_steps - 1) + kernel_size * dilation L_out = stride * (n_steps - 1) + kernel_size * dilation
@ -245,6 +284,13 @@ class SEBlock(nn.Layer):
class AttentiveStatisticsPooling(nn.Layer): class AttentiveStatisticsPooling(nn.Layer):
def __init__(self, channels, attention_channels=128, global_context=True): def __init__(self, channels, attention_channels=128, global_context=True):
"""Compute the speaker verification statistics
The detail info is section 3.1 in https://arxiv.org/pdf/1709.01507.pdf
Args:
channels (int): input data channel or data dimension
attention_channels (int, optional): attention dimension. Defaults to 128.
global_context (bool, optional): If use the global context information. Defaults to True.
"""
super().__init__() super().__init__()
self.eps = 1e-12 self.eps = 1e-12

Loading…
Cancel
Save