Merge branch 'develop' of github.com:PaddlePaddle/PaddleSpeech into add_onnx

pull/1665/head
TianYuan 3 years ago
commit c765fca6b4

@ -180,7 +180,7 @@ Via the easy-to-use, efficient, flexible and scalable implementation, our vision
2021.12.14: We would like to have an online courses to introduce basics and research of speech, as well as code practice with `paddlespeech`. Please pay attention to our [Calendar](https://www.paddlepaddle.org.cn/live). 2021.12.14: We would like to have an online courses to introduce basics and research of speech, as well as code practice with `paddlespeech`. Please pay attention to our [Calendar](https://www.paddlepaddle.org.cn/live).
---> --->
- 👏🏻 2022.03.28: PaddleSpeech Server is available for Audio Classification, Automatic Speech Recognition and Text-to-Speech. - 👏🏻 2022.03.28: PaddleSpeech Server is available for Audio Classification, Automatic Speech Recognition and Text-to-Speech.
- 👏🏻 2022.03.28: PaddleSpeech CLI is available for Speaker Verfication. - 👏🏻 2022.03.28: PaddleSpeech CLI is available for Speaker Verification.
- 🤗 2021.12.14: Our PaddleSpeech [ASR](https://huggingface.co/spaces/KPatrick/PaddleSpeechASR) and [TTS](https://huggingface.co/spaces/KPatrick/PaddleSpeechTTS) Demos on Hugging Face Spaces are available! - 🤗 2021.12.14: Our PaddleSpeech [ASR](https://huggingface.co/spaces/KPatrick/PaddleSpeechASR) and [TTS](https://huggingface.co/spaces/KPatrick/PaddleSpeechTTS) Demos on Hugging Face Spaces are available!
- 👏🏻 2021.12.10: PaddleSpeech CLI is available for Audio Classification, Automatic Speech Recognition, Speech Translation (English to Chinese) and Text-to-Speech. - 👏🏻 2021.12.10: PaddleSpeech CLI is available for Audio Classification, Automatic Speech Recognition, Speech Translation (English to Chinese) and Text-to-Speech.
@ -280,10 +280,14 @@ paddlespeech_client cls --server_ip 127.0.0.1 --port 8090 --input input.wav
For more information about server command lines, please see: [speech server demos](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/demos/speech_server) For more information about server command lines, please see: [speech server demos](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/demos/speech_server)
<a name="ModelList"></a>
## Model List ## Model List
PaddleSpeech supports a series of most popular models. They are summarized in [released models](./docs/source/released_model.md) and attached with available pretrained models. PaddleSpeech supports a series of most popular models. They are summarized in [released models](./docs/source/released_model.md) and attached with available pretrained models.
<a name="SpeechToText"></a>
**Speech-to-Text** contains *Acoustic Model*, *Language Model*, and *Speech Translation*, with the following details: **Speech-to-Text** contains *Acoustic Model*, *Language Model*, and *Speech Translation*, with the following details:
<table style="width:100%"> <table style="width:100%">
@ -357,6 +361,8 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
</tbody> </tbody>
</table> </table>
<a name="TextToSpeech"></a>
**Text-to-Speech** in PaddleSpeech mainly contains three modules: *Text Frontend*, *Acoustic Model* and *Vocoder*. Acoustic Model and Vocoder models are listed as follow: **Text-to-Speech** in PaddleSpeech mainly contains three modules: *Text Frontend*, *Acoustic Model* and *Vocoder*. Acoustic Model and Vocoder models are listed as follow:
<table> <table>
@ -473,6 +479,8 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
</tbody> </tbody>
</table> </table>
<a name="AudioClassification"></a>
**Audio Classification** **Audio Classification**
<table style="width:100%"> <table style="width:100%">
@ -496,6 +504,8 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
</tbody> </tbody>
</table> </table>
<a name="SpeakerVerification"></a>
**Speaker Verification** **Speaker Verification**
<table style="width:100%"> <table style="width:100%">
@ -519,6 +529,8 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
</tbody> </tbody>
</table> </table>
<a name="PunctuationRestoration"></a>
**Punctuation Restoration** **Punctuation Restoration**
<table style="width:100%"> <table style="width:100%">
@ -559,10 +571,18 @@ Normally, [Speech SoTA](https://paperswithcode.com/area/speech), [Audio SoTA](ht
- [Advanced Usage](./docs/source/tts/advanced_usage.md) - [Advanced Usage](./docs/source/tts/advanced_usage.md)
- [Chinese Rule Based Text Frontend](./docs/source/tts/zh_text_frontend.md) - [Chinese Rule Based Text Frontend](./docs/source/tts/zh_text_frontend.md)
- [Test Audio Samples](https://paddlespeech.readthedocs.io/en/latest/tts/demo.html) - [Test Audio Samples](https://paddlespeech.readthedocs.io/en/latest/tts/demo.html)
- [Audio Classification](./demos/audio_tagging/README.md) - Speaker Verification
- [Audio Searching](./demos/audio_searching/README.md)
- [Speaker Verification](./demos/speaker_verification/README.md) - [Speaker Verification](./demos/speaker_verification/README.md)
- [Audio Classification](./demos/audio_tagging/README.md)
- [Speech Translation](./demos/speech_translation/README.md) - [Speech Translation](./demos/speech_translation/README.md)
- [Speech Server](./demos/speech_server/README.md)
- [Released Models](./docs/source/released_model.md) - [Released Models](./docs/source/released_model.md)
- [Speech-to-Text](#SpeechToText)
- [Text-to-Speech](#TextToSpeech)
- [Audio Classification](#AudioClassification)
- [Speaker Verification](#SpeakerVerification)
- [Punctuation Restoration](#PunctuationRestoration)
- [Community](#Community) - [Community](#Community)
- [Welcome to contribute](#contribution) - [Welcome to contribute](#contribution)
- [License](#License) - [License](#License)

@ -273,6 +273,8 @@ paddlespeech_client cls --server_ip 127.0.0.1 --port 8090 --input input.wav
## 模型列表 ## 模型列表
PaddleSpeech 支持很多主流的模型,并提供了预训练模型,详情请见[模型列表](./docs/source/released_model.md)。 PaddleSpeech 支持很多主流的模型,并提供了预训练模型,详情请见[模型列表](./docs/source/released_model.md)。
<a name="语音识别模型"></a>
PaddleSpeech 的 **语音转文本** 包含语音识别声学模型、语音识别语言模型和语音翻译, 详情如下: PaddleSpeech 的 **语音转文本** 包含语音识别声学模型、语音识别语言模型和语音翻译, 详情如下:
<table style="width:100%"> <table style="width:100%">
@ -347,6 +349,7 @@ PaddleSpeech 的 **语音转文本** 包含语音识别声学模型、语音识
</table> </table>
<a name="语音合成模型"></a> <a name="语音合成模型"></a>
PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声学模型和声码器。声学模型和声码器模型如下: PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声学模型和声码器。声学模型和声码器模型如下:
<table> <table>
@ -488,6 +491,8 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声
</table> </table>
<a name="声纹识别模型"></a>
**声纹识别** **声纹识别**
<table style="width:100%"> <table style="width:100%">
@ -511,6 +516,8 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声
</tbody> </tbody>
</table> </table>
<a name="标点恢复模型"></a>
**标点恢复** **标点恢复**
<table style="width:100%"> <table style="width:100%">
@ -556,13 +563,18 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声
- [进阶用法](./docs/source/tts/advanced_usage.md) - [进阶用法](./docs/source/tts/advanced_usage.md)
- [中文文本前端](./docs/source/tts/zh_text_frontend.md) - [中文文本前端](./docs/source/tts/zh_text_frontend.md)
- [测试语音样本](https://paddlespeech.readthedocs.io/en/latest/tts/demo.html) - [测试语音样本](https://paddlespeech.readthedocs.io/en/latest/tts/demo.html)
- [声音分类](./demos/audio_tagging/README_cn.md) - 声纹识别
- [声纹识别](./demos/speaker_verification/README_cn.md) - [声纹识别](./demos/speaker_verification/README_cn.md)
- [音频检索](./demos/audio_searching/README_cn.md)
- [声音分类](./demos/audio_tagging/README_cn.md)
- [语音翻译](./demos/speech_translation/README_cn.md) - [语音翻译](./demos/speech_translation/README_cn.md)
- [服务化部署](./demos/speech_server/README_cn.md)
- [模型列表](#模型列表) - [模型列表](#模型列表)
- [语音识别](#语音识别模型) - [语音识别](#语音识别模型)
- [语音合成](#语音合成模型) - [语音合成](#语音合成模型)
- [声音分类](#声音分类模型) - [声音分类](#声音分类模型)
- [声纹识别](#声纹识别模型)
- [标点恢复](#标点恢复模型)
- [技术交流群](#技术交流群) - [技术交流群](#技术交流群)
- [欢迎贡献](#欢迎贡献) - [欢迎贡献](#欢迎贡献)
- [License](#License) - [License](#License)

@ -38,6 +38,7 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
``` ```
Arguments: Arguments:
- `input`(required): Audio file to recognize. - `input`(required): Audio file to recognize.
- `task` (required): Specify `vector` task. Default `spk`
- `model`: Model type of vector task. Default: `ecapatdnn_voxceleb12`. - `model`: Model type of vector task. Default: `ecapatdnn_voxceleb12`.
- `sample_rate`: Sample rate of the model. Default: `16000`. - `sample_rate`: Sample rate of the model. Default: `16000`.
- `config`: Config of vector task. Use pretrained model when it is None. Default: `None`. - `config`: Config of vector task. Use pretrained model when it is None. Default: `None`.
@ -47,45 +48,45 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
Output: Output:
```bash ```bash
demo [ -5.749211 9.505463 -8.200284 -5.2075014 5.3940268 demo [ 1.4217498 5.626253 -5.342073 1.1773866 3.308055
-3.04878 1.611095 10.127234 -10.534177 -15.821609 1.756596 5.167894 10.80636 -3.8226728 -5.6141334
1.2032688 -0.35080156 1.2629458 -12.643498 -2.5758228 2.623845 -0.8072968 1.9635103 -7.3128724 0.01103897
-11.343508 2.3385992 -8.719341 14.213509 15.404744 -9.723131 0.6619743 -6.976803 10.213478 7.494748
-0.39327756 6.338786 2.688887 8.7104025 17.469526 2.9105635 3.8949256 3.7999806 7.1061673 16.905321
-8.77959 7.0576906 4.648855 -1.3089896 -23.294737 -7.1493764 8.733103 3.4230042 -4.831653 -11.403367
8.013747 13.891729 -9.926753 5.655307 -5.9422326 11.232214 7.1274667 -4.2828417 2.452362 -5.130748
-22.842539 0.6293588 -18.46266 -10.811862 9.8192625 -18.177666 -2.6116815 -11.000337 -6.7314315 1.6564683
3.0070958 3.8072643 -2.3861165 3.0821571 -14.739942 0.7618269 1.1253023 -2.083836 4.725744 -8.782597
1.7594414 -0.6485091 4.485623 2.0207152 7.264915 -3.539873 3.814236 5.1420674 2.162061 4.096431
-6.40137 23.63524 2.9711294 -22.708025 9.93719 -6.4162116 12.747448 1.9429878 -15.152943 6.417416
20.354511 -10.324688 -0.700492 -8.783211 -5.27593 16.097002 -9.716668 -1.9920526 -3.3649497 -1.871939
15.999649 3.3004563 12.747926 15.429879 4.7849145 11.567354 3.69788 11.258265 7.442363 9.183411
5.6699696 -2.3826702 10.605882 3.9112158 3.1500628 4.5281515 -1.2417862 4.3959084 6.6727695 5.8898783
15.859915 -2.1832209 -23.908653 -6.4799504 -4.5365124 7.627124 -0.66919386 -11.889693 -9.208865 -7.4274073
-9.224193 14.568347 -10.568833 4.982321 -4.342062 -3.7776625 6.917234 -9.848748 -2.0944717 -5.135116
0.0914714 12.645902 -5.74285 -3.2141201 -2.7173362 0.49563864 9.317534 -5.9141874 -1.8098574 -0.11738578
-6.680575 0.4757669 -5.035051 -6.7964664 16.865469 -7.169265 -1.0578263 -5.7216787 -5.1173844 16.137651
-11.54324 7.681869 0.44475392 9.708182 -8.932846 -4.473626 7.6624317 -0.55381083 9.631587 -6.4704556
0.4123232 -4.361452 1.3948607 9.511665 0.11667654 -8.548508 4.3716145 -0.79702514 4.478997 -2.9758704
2.9079323 6.049952 9.275183 -18.078873 6.2983274 3.272176 2.8382776 5.134597 -9.190781 -0.5657382
-0.7500531 -2.725033 -7.6027865 3.3404543 2.990815 -4.8745747 2.3165567 -5.984303 -2.1798875 0.35541576
4.010979 11.000591 -2.8873312 7.1352735 -16.79663 -0.31784213 9.493548 2.1144536 4.358092 -12.089823
18.495346 -14.293832 7.89578 2.2714825 22.976387 8.451689 -7.925461 4.6242585 4.4289427 18.692003
-4.875734 -3.0836344 -2.9999814 13.751918 6.448228 -2.6204622 -5.149185 -0.35821092 8.488551 4.981496
-11.924197 2.171869 2.0423572 -6.173772 10.778437 -9.32683 -2.2544234 6.6417594 1.2119585 10.977129
25.77281 -4.9495463 14.57806 0.3044315 2.6132357 16.555033 3.3238444 9.551863 -1.6676947 -0.79539716
-7.591999 -2.076944 9.025118 1.7834753 -3.1799617 -8.605674 -0.47356385 2.6741948 -5.359179 -2.6673796
-4.9401326 23.465864 5.1685796 -9.018578 9.037825 0.66607 15.443222 4.740594 -3.4725387 11.592567
-4.4150195 6.859591 -12.274467 -0.88911164 5.186309 -2.054497 1.7361217 -8.265324 -9.30447 5.4068313
-3.9988663 -13.638606 -9.925445 -0.06329413 -3.6709652 -1.5180256 -7.746615 -6.089606 0.07112726 -0.34904733
-12.397416 -12.719869 -1.395601 2.1150916 5.7381287 -8.649895 -9.998958 -2.564841 -0.53999114 2.601808
-4.4691963 -3.82819 -0.84233856 -1.1604277 -13.490127 -0.31927416 -1.8815292 -2.07215 -3.4105783 -8.2998085
8.731719 -20.778936 -11.495662 5.8033476 -4.752041 1.483641 -15.365992 -8.288208 3.8847756 -3.4876456
10.833007 -6.717991 4.504732 13.4244375 1.1306485 7.3629923 0.4657332 3.132599 12.438889 -1.8337058
7.3435574 1.400918 14.704036 -9.501399 7.2315617 4.532936 2.7264361 10.145339 -6.521951 2.897153
-6.417456 1.3333273 11.872697 -0.30664724 8.8845 -3.3925855 5.079156 7.759716 4.677565 5.8457737
6.5569253 4.7948146 0.03662816 -8.704245 6.224871 2.402413 7.7071047 3.9711342 -6.390043 6.1268735
-3.2701402 -11.508579 ] -3.7760346 -11.118123 ]
``` ```
- Python API - Python API
@ -97,56 +98,57 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
audio_emb = vector_executor( audio_emb = vector_executor(
model='ecapatdnn_voxceleb12', model='ecapatdnn_voxceleb12',
sample_rate=16000, sample_rate=16000,
config=None, config=None, # Set `config` and `ckpt_path` to None to use pretrained model.
ckpt_path=None, ckpt_path=None,
audio_file='./85236145389.wav', audio_file='./85236145389.wav',
force_yes=False,
device=paddle.get_device()) device=paddle.get_device())
print('Audio embedding Result: \n{}'.format(audio_emb)) print('Audio embedding Result: \n{}'.format(audio_emb))
``` ```
Output: Output
```bash ```bash
# Vector Result: # Vector Result:
[ -5.749211 9.505463 -8.200284 -5.2075014 5.3940268 Audio embedding Result:
-3.04878 1.611095 10.127234 -10.534177 -15.821609 [ 1.4217498 5.626253 -5.342073 1.1773866 3.308055
1.2032688 -0.35080156 1.2629458 -12.643498 -2.5758228 1.756596 5.167894 10.80636 -3.8226728 -5.6141334
-11.343508 2.3385992 -8.719341 14.213509 15.404744 2.623845 -0.8072968 1.9635103 -7.3128724 0.01103897
-0.39327756 6.338786 2.688887 8.7104025 17.469526 -9.723131 0.6619743 -6.976803 10.213478 7.494748
-8.77959 7.0576906 4.648855 -1.3089896 -23.294737 2.9105635 3.8949256 3.7999806 7.1061673 16.905321
8.013747 13.891729 -9.926753 5.655307 -5.9422326 -7.1493764 8.733103 3.4230042 -4.831653 -11.403367
-22.842539 0.6293588 -18.46266 -10.811862 9.8192625 11.232214 7.1274667 -4.2828417 2.452362 -5.130748
3.0070958 3.8072643 -2.3861165 3.0821571 -14.739942 -18.177666 -2.6116815 -11.000337 -6.7314315 1.6564683
1.7594414 -0.6485091 4.485623 2.0207152 7.264915 0.7618269 1.1253023 -2.083836 4.725744 -8.782597
-6.40137 23.63524 2.9711294 -22.708025 9.93719 -3.539873 3.814236 5.1420674 2.162061 4.096431
20.354511 -10.324688 -0.700492 -8.783211 -5.27593 -6.4162116 12.747448 1.9429878 -15.152943 6.417416
15.999649 3.3004563 12.747926 15.429879 4.7849145 16.097002 -9.716668 -1.9920526 -3.3649497 -1.871939
5.6699696 -2.3826702 10.605882 3.9112158 3.1500628 11.567354 3.69788 11.258265 7.442363 9.183411
15.859915 -2.1832209 -23.908653 -6.4799504 -4.5365124 4.5281515 -1.2417862 4.3959084 6.6727695 5.8898783
-9.224193 14.568347 -10.568833 4.982321 -4.342062 7.627124 -0.66919386 -11.889693 -9.208865 -7.4274073
0.0914714 12.645902 -5.74285 -3.2141201 -2.7173362 -3.7776625 6.917234 -9.848748 -2.0944717 -5.135116
-6.680575 0.4757669 -5.035051 -6.7964664 16.865469 0.49563864 9.317534 -5.9141874 -1.8098574 -0.11738578
-11.54324 7.681869 0.44475392 9.708182 -8.932846 -7.169265 -1.0578263 -5.7216787 -5.1173844 16.137651
0.4123232 -4.361452 1.3948607 9.511665 0.11667654 -4.473626 7.6624317 -0.55381083 9.631587 -6.4704556
2.9079323 6.049952 9.275183 -18.078873 6.2983274 -8.548508 4.3716145 -0.79702514 4.478997 -2.9758704
-0.7500531 -2.725033 -7.6027865 3.3404543 2.990815 3.272176 2.8382776 5.134597 -9.190781 -0.5657382
4.010979 11.000591 -2.8873312 7.1352735 -16.79663 -4.8745747 2.3165567 -5.984303 -2.1798875 0.35541576
18.495346 -14.293832 7.89578 2.2714825 22.976387 -0.31784213 9.493548 2.1144536 4.358092 -12.089823
-4.875734 -3.0836344 -2.9999814 13.751918 6.448228 8.451689 -7.925461 4.6242585 4.4289427 18.692003
-11.924197 2.171869 2.0423572 -6.173772 10.778437 -2.6204622 -5.149185 -0.35821092 8.488551 4.981496
25.77281 -4.9495463 14.57806 0.3044315 2.6132357 -9.32683 -2.2544234 6.6417594 1.2119585 10.977129
-7.591999 -2.076944 9.025118 1.7834753 -3.1799617 16.555033 3.3238444 9.551863 -1.6676947 -0.79539716
-4.9401326 23.465864 5.1685796 -9.018578 9.037825 -8.605674 -0.47356385 2.6741948 -5.359179 -2.6673796
-4.4150195 6.859591 -12.274467 -0.88911164 5.186309 0.66607 15.443222 4.740594 -3.4725387 11.592567
-3.9988663 -13.638606 -9.925445 -0.06329413 -3.6709652 -2.054497 1.7361217 -8.265324 -9.30447 5.4068313
-12.397416 -12.719869 -1.395601 2.1150916 5.7381287 -1.5180256 -7.746615 -6.089606 0.07112726 -0.34904733
-4.4691963 -3.82819 -0.84233856 -1.1604277 -13.490127 -8.649895 -9.998958 -2.564841 -0.53999114 2.601808
8.731719 -20.778936 -11.495662 5.8033476 -4.752041 -0.31927416 -1.8815292 -2.07215 -3.4105783 -8.2998085
10.833007 -6.717991 4.504732 13.4244375 1.1306485 1.483641 -15.365992 -8.288208 3.8847756 -3.4876456
7.3435574 1.400918 14.704036 -9.501399 7.2315617 7.3629923 0.4657332 3.132599 12.438889 -1.8337058
-6.417456 1.3333273 11.872697 -0.30664724 8.8845 4.532936 2.7264361 10.145339 -6.521951 2.897153
6.5569253 4.7948146 0.03662816 -8.704245 6.224871 -3.3925855 5.079156 7.759716 4.677565 5.8457737
-3.2701402 -11.508579 ] 2.402413 7.7071047 3.9711342 -6.390043 6.1268735
-3.7760346 -11.118123 ]
``` ```
### 4.Pretrained Models ### 4.Pretrained Models

@ -37,6 +37,7 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
``` ```
参数: 参数:
- `input`(必须输入):用于识别的音频文件。 - `input`(必须输入):用于识别的音频文件。
- `task` (必须输入): 用于指定 `vector` 处理的具体任务,默认是 `spk`
- `model`:声纹任务的模型,默认值:`ecapatdnn_voxceleb12`。 - `model`:声纹任务的模型,默认值:`ecapatdnn_voxceleb12`。
- `sample_rate`:音频采样率,默认值:`16000`。 - `sample_rate`:音频采样率,默认值:`16000`。
- `config`:声纹任务的参数文件,若不设置则使用预训练模型中的默认配置,默认值:`None`。 - `config`:声纹任务的参数文件,若不设置则使用预训练模型中的默认配置,默认值:`None`。
@ -45,45 +46,45 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
输出: 输出:
```bash ```bash
demo [ -5.749211 9.505463 -8.200284 -5.2075014 5.3940268 demo [ 1.4217498 5.626253 -5.342073 1.1773866 3.308055
-3.04878 1.611095 10.127234 -10.534177 -15.821609 1.756596 5.167894 10.80636 -3.8226728 -5.6141334
1.2032688 -0.35080156 1.2629458 -12.643498 -2.5758228 2.623845 -0.8072968 1.9635103 -7.3128724 0.01103897
-11.343508 2.3385992 -8.719341 14.213509 15.404744 -9.723131 0.6619743 -6.976803 10.213478 7.494748
-0.39327756 6.338786 2.688887 8.7104025 17.469526 2.9105635 3.8949256 3.7999806 7.1061673 16.905321
-8.77959 7.0576906 4.648855 -1.3089896 -23.294737 -7.1493764 8.733103 3.4230042 -4.831653 -11.403367
8.013747 13.891729 -9.926753 5.655307 -5.9422326 11.232214 7.1274667 -4.2828417 2.452362 -5.130748
-22.842539 0.6293588 -18.46266 -10.811862 9.8192625 -18.177666 -2.6116815 -11.000337 -6.7314315 1.6564683
3.0070958 3.8072643 -2.3861165 3.0821571 -14.739942 0.7618269 1.1253023 -2.083836 4.725744 -8.782597
1.7594414 -0.6485091 4.485623 2.0207152 7.264915 -3.539873 3.814236 5.1420674 2.162061 4.096431
-6.40137 23.63524 2.9711294 -22.708025 9.93719 -6.4162116 12.747448 1.9429878 -15.152943 6.417416
20.354511 -10.324688 -0.700492 -8.783211 -5.27593 16.097002 -9.716668 -1.9920526 -3.3649497 -1.871939
15.999649 3.3004563 12.747926 15.429879 4.7849145 11.567354 3.69788 11.258265 7.442363 9.183411
5.6699696 -2.3826702 10.605882 3.9112158 3.1500628 4.5281515 -1.2417862 4.3959084 6.6727695 5.8898783
15.859915 -2.1832209 -23.908653 -6.4799504 -4.5365124 7.627124 -0.66919386 -11.889693 -9.208865 -7.4274073
-9.224193 14.568347 -10.568833 4.982321 -4.342062 -3.7776625 6.917234 -9.848748 -2.0944717 -5.135116
0.0914714 12.645902 -5.74285 -3.2141201 -2.7173362 0.49563864 9.317534 -5.9141874 -1.8098574 -0.11738578
-6.680575 0.4757669 -5.035051 -6.7964664 16.865469 -7.169265 -1.0578263 -5.7216787 -5.1173844 16.137651
-11.54324 7.681869 0.44475392 9.708182 -8.932846 -4.473626 7.6624317 -0.55381083 9.631587 -6.4704556
0.4123232 -4.361452 1.3948607 9.511665 0.11667654 -8.548508 4.3716145 -0.79702514 4.478997 -2.9758704
2.9079323 6.049952 9.275183 -18.078873 6.2983274 3.272176 2.8382776 5.134597 -9.190781 -0.5657382
-0.7500531 -2.725033 -7.6027865 3.3404543 2.990815 -4.8745747 2.3165567 -5.984303 -2.1798875 0.35541576
4.010979 11.000591 -2.8873312 7.1352735 -16.79663 -0.31784213 9.493548 2.1144536 4.358092 -12.089823
18.495346 -14.293832 7.89578 2.2714825 22.976387 8.451689 -7.925461 4.6242585 4.4289427 18.692003
-4.875734 -3.0836344 -2.9999814 13.751918 6.448228 -2.6204622 -5.149185 -0.35821092 8.488551 4.981496
-11.924197 2.171869 2.0423572 -6.173772 10.778437 -9.32683 -2.2544234 6.6417594 1.2119585 10.977129
25.77281 -4.9495463 14.57806 0.3044315 2.6132357 16.555033 3.3238444 9.551863 -1.6676947 -0.79539716
-7.591999 -2.076944 9.025118 1.7834753 -3.1799617 -8.605674 -0.47356385 2.6741948 -5.359179 -2.6673796
-4.9401326 23.465864 5.1685796 -9.018578 9.037825 0.66607 15.443222 4.740594 -3.4725387 11.592567
-4.4150195 6.859591 -12.274467 -0.88911164 5.186309 -2.054497 1.7361217 -8.265324 -9.30447 5.4068313
-3.9988663 -13.638606 -9.925445 -0.06329413 -3.6709652 -1.5180256 -7.746615 -6.089606 0.07112726 -0.34904733
-12.397416 -12.719869 -1.395601 2.1150916 5.7381287 -8.649895 -9.998958 -2.564841 -0.53999114 2.601808
-4.4691963 -3.82819 -0.84233856 -1.1604277 -13.490127 -0.31927416 -1.8815292 -2.07215 -3.4105783 -8.2998085
8.731719 -20.778936 -11.495662 5.8033476 -4.752041 1.483641 -15.365992 -8.288208 3.8847756 -3.4876456
10.833007 -6.717991 4.504732 13.4244375 1.1306485 7.3629923 0.4657332 3.132599 12.438889 -1.8337058
7.3435574 1.400918 14.704036 -9.501399 7.2315617 4.532936 2.7264361 10.145339 -6.521951 2.897153
-6.417456 1.3333273 11.872697 -0.30664724 8.8845 -3.3925855 5.079156 7.759716 4.677565 5.8457737
6.5569253 4.7948146 0.03662816 -8.704245 6.224871 2.402413 7.7071047 3.9711342 -6.390043 6.1268735
-3.2701402 -11.508579 ] -3.7760346 -11.118123 ]
``` ```
- Python API - Python API
@ -98,7 +99,6 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
config=None, # Set `config` and `ckpt_path` to None to use pretrained model. config=None, # Set `config` and `ckpt_path` to None to use pretrained model.
ckpt_path=None, ckpt_path=None,
audio_file='./85236145389.wav', audio_file='./85236145389.wav',
force_yes=False,
device=paddle.get_device()) device=paddle.get_device())
print('Audio embedding Result: \n{}'.format(audio_emb)) print('Audio embedding Result: \n{}'.format(audio_emb))
``` ```
@ -106,45 +106,46 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
输出: 输出:
```bash ```bash
# Vector Result: # Vector Result:
[ -5.749211 9.505463 -8.200284 -5.2075014 5.3940268 Audio embedding Result:
-3.04878 1.611095 10.127234 -10.534177 -15.821609 [ 1.4217498 5.626253 -5.342073 1.1773866 3.308055
1.2032688 -0.35080156 1.2629458 -12.643498 -2.5758228 1.756596 5.167894 10.80636 -3.8226728 -5.6141334
-11.343508 2.3385992 -8.719341 14.213509 15.404744 2.623845 -0.8072968 1.9635103 -7.3128724 0.01103897
-0.39327756 6.338786 2.688887 8.7104025 17.469526 -9.723131 0.6619743 -6.976803 10.213478 7.494748
-8.77959 7.0576906 4.648855 -1.3089896 -23.294737 2.9105635 3.8949256 3.7999806 7.1061673 16.905321
8.013747 13.891729 -9.926753 5.655307 -5.9422326 -7.1493764 8.733103 3.4230042 -4.831653 -11.403367
-22.842539 0.6293588 -18.46266 -10.811862 9.8192625 11.232214 7.1274667 -4.2828417 2.452362 -5.130748
3.0070958 3.8072643 -2.3861165 3.0821571 -14.739942 -18.177666 -2.6116815 -11.000337 -6.7314315 1.6564683
1.7594414 -0.6485091 4.485623 2.0207152 7.264915 0.7618269 1.1253023 -2.083836 4.725744 -8.782597
-6.40137 23.63524 2.9711294 -22.708025 9.93719 -3.539873 3.814236 5.1420674 2.162061 4.096431
20.354511 -10.324688 -0.700492 -8.783211 -5.27593 -6.4162116 12.747448 1.9429878 -15.152943 6.417416
15.999649 3.3004563 12.747926 15.429879 4.7849145 16.097002 -9.716668 -1.9920526 -3.3649497 -1.871939
5.6699696 -2.3826702 10.605882 3.9112158 3.1500628 11.567354 3.69788 11.258265 7.442363 9.183411
15.859915 -2.1832209 -23.908653 -6.4799504 -4.5365124 4.5281515 -1.2417862 4.3959084 6.6727695 5.8898783
-9.224193 14.568347 -10.568833 4.982321 -4.342062 7.627124 -0.66919386 -11.889693 -9.208865 -7.4274073
0.0914714 12.645902 -5.74285 -3.2141201 -2.7173362 -3.7776625 6.917234 -9.848748 -2.0944717 -5.135116
-6.680575 0.4757669 -5.035051 -6.7964664 16.865469 0.49563864 9.317534 -5.9141874 -1.8098574 -0.11738578
-11.54324 7.681869 0.44475392 9.708182 -8.932846 -7.169265 -1.0578263 -5.7216787 -5.1173844 16.137651
0.4123232 -4.361452 1.3948607 9.511665 0.11667654 -4.473626 7.6624317 -0.55381083 9.631587 -6.4704556
2.9079323 6.049952 9.275183 -18.078873 6.2983274 -8.548508 4.3716145 -0.79702514 4.478997 -2.9758704
-0.7500531 -2.725033 -7.6027865 3.3404543 2.990815 3.272176 2.8382776 5.134597 -9.190781 -0.5657382
4.010979 11.000591 -2.8873312 7.1352735 -16.79663 -4.8745747 2.3165567 -5.984303 -2.1798875 0.35541576
18.495346 -14.293832 7.89578 2.2714825 22.976387 -0.31784213 9.493548 2.1144536 4.358092 -12.089823
-4.875734 -3.0836344 -2.9999814 13.751918 6.448228 8.451689 -7.925461 4.6242585 4.4289427 18.692003
-11.924197 2.171869 2.0423572 -6.173772 10.778437 -2.6204622 -5.149185 -0.35821092 8.488551 4.981496
25.77281 -4.9495463 14.57806 0.3044315 2.6132357 -9.32683 -2.2544234 6.6417594 1.2119585 10.977129
-7.591999 -2.076944 9.025118 1.7834753 -3.1799617 16.555033 3.3238444 9.551863 -1.6676947 -0.79539716
-4.9401326 23.465864 5.1685796 -9.018578 9.037825 -8.605674 -0.47356385 2.6741948 -5.359179 -2.6673796
-4.4150195 6.859591 -12.274467 -0.88911164 5.186309 0.66607 15.443222 4.740594 -3.4725387 11.592567
-3.9988663 -13.638606 -9.925445 -0.06329413 -3.6709652 -2.054497 1.7361217 -8.265324 -9.30447 5.4068313
-12.397416 -12.719869 -1.395601 2.1150916 5.7381287 -1.5180256 -7.746615 -6.089606 0.07112726 -0.34904733
-4.4691963 -3.82819 -0.84233856 -1.1604277 -13.490127 -8.649895 -9.998958 -2.564841 -0.53999114 2.601808
8.731719 -20.778936 -11.495662 5.8033476 -4.752041 -0.31927416 -1.8815292 -2.07215 -3.4105783 -8.2998085
10.833007 -6.717991 4.504732 13.4244375 1.1306485 1.483641 -15.365992 -8.288208 3.8847756 -3.4876456
7.3435574 1.400918 14.704036 -9.501399 7.2315617 7.3629923 0.4657332 3.132599 12.438889 -1.8337058
-6.417456 1.3333273 11.872697 -0.30664724 8.8845 4.532936 2.7264361 10.145339 -6.521951 2.897153
6.5569253 4.7948146 0.03662816 -8.704245 6.224871 -3.3925855 5.079156 7.759716 4.677565 5.8457737
-3.2701402 -11.508579 ] 2.402413 7.7071047 3.9711342 -6.390043 6.1268735
-3.7760346 -11.118123 ]
``` ```
### 4.预训练模型 ### 4.预训练模型

@ -2,5 +2,5 @@
wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
# asr # vector
paddlespeech vector --task spk --input ./85236145389.wav paddlespeech vector --task spk --input ./85236145389.wav

@ -6,7 +6,7 @@
### Speech Recognition Model ### Speech Recognition Model
Acoustic Model | Training Data | Token-based | Size | Descriptions | CER | WER | Hours of speech | Example Link Acoustic Model | Training Data | Token-based | Size | Descriptions | CER | WER | Hours of speech | Example Link
:-------------:| :------------:| :-----: | -----: | :-----: |:-----:| :-----: | :-----: | :-----: :-------------:| :------------:| :-----: | -----: | :-----: |:-----:| :-----: | :-----: | :-----:
[Ds2 Online Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.1.1.model.tar.gz) | Aishell Dataset | Char-based | 345 MB | 2 Conv + 5 LSTM layers with only forward direction | 0.080 |-| 151 h | [D2 Online Aishell ASR0](../../examples/aishell/asr0) [Ds2 Online Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.2.0.model.tar.gz) | Aishell Dataset | Char-based | 345 MB | 2 Conv + 5 LSTM layers with only forward direction | 0.078 |-| 151 h | [D2 Online Aishell ASR0](../../examples/aishell/asr0)
[Ds2 Offline Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_aishell_ckpt_0.1.1.model.tar.gz)| Aishell Dataset | Char-based | 306 MB | 2 Conv + 3 bidirectional GRU layers| 0.064 |-| 151 h | [Ds2 Offline Aishell ASR0](../../examples/aishell/asr0) [Ds2 Offline Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_aishell_ckpt_0.1.1.model.tar.gz)| Aishell Dataset | Char-based | 306 MB | 2 Conv + 3 bidirectional GRU layers| 0.064 |-| 151 h | [Ds2 Offline Aishell ASR0](../../examples/aishell/asr0)
[Conformer Online Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_chunk_conformer_aishell_ckpt_0.1.2.model.tar.gz) | Aishell Dataset | Char-based | 189 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0565 |-| 151 h | [Conformer Online Aishell ASR1](../../examples/aishell/asr1) [Conformer Online Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_chunk_conformer_aishell_ckpt_0.1.2.model.tar.gz) | Aishell Dataset | Char-based | 189 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0565 |-| 151 h | [Conformer Online Aishell ASR1](../../examples/aishell/asr1)
[Conformer Offline Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_conformer_aishell_ckpt_0.1.2.model.tar.gz) | Aishell Dataset | Char-based | 189 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0483 |-| 151 h | [Conformer Offline Aishell ASR1](../../examples/aishell/asr1) [Conformer Offline Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_conformer_aishell_ckpt_0.1.2.model.tar.gz) | Aishell Dataset | Char-based | 189 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0483 |-| 151 h | [Conformer Offline Aishell ASR1](../../examples/aishell/asr1)
@ -80,7 +80,7 @@ PANN | ESC-50 |[pann-esc50](../../examples/esc50/cls0)|[esc50_cnn6.tar.gz](https
Model Type | Dataset| Example Link | Pretrained Models | Static Models Model Type | Dataset| Example Link | Pretrained Models | Static Models
:-------------:| :------------:| :-----: | :-----: | :-----: :-------------:| :------------:| :-----: | :-----: | :-----:
PANN | VoxCeleb| [voxceleb_ecapatdnn](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/voxceleb/sv0) | [ecapatdnn.tar.gz](https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_1_1.tar.gz) | - PANN | VoxCeleb| [voxceleb_ecapatdnn](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/voxceleb/sv0) | [ecapatdnn.tar.gz](https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_2_0.tar.gz) | -
## Punctuation Restoration Models ## Punctuation Restoration Models
Model Type | Dataset| Example Link | Pretrained Models Model Type | Dataset| Example Link | Pretrained Models

@ -173,12 +173,7 @@ bash local/data.sh --stage 2 --stop_stage 2
CUDA_VISIBLE_DEVICES= ./local/test.sh conf/deepspeech2.yaml exp/deepspeech2/checkpoints/avg_1 CUDA_VISIBLE_DEVICES= ./local/test.sh conf/deepspeech2.yaml exp/deepspeech2/checkpoints/avg_1
``` ```
The performance of the released models are shown below: The performance of the released models are shown in [this](./RESULTS.md)
| Acoustic Model | Training Data | Token-based | Size | Descriptions | CER | WER | Hours of speech |
| :----------------------------: | :-------------: | :---------: | -----: | :------------------------------------------------- | :---- | :--- | :-------------- |
| Ds2 Online Aishell ASR0 Model | Aishell Dataset | Char-based | 345 MB | 2 Conv + 5 LSTM layers with only forward direction | 0.080 | - | 151 h |
| Ds2 Offline Aishell ASR0 Model | Aishell Dataset | Char-based | 306 MB | 2 Conv + 3 bidirectional GRU layers | 0.064 | - | 151 h |
## Stage 4: Static graph model Export ## Stage 4: Static graph model Export
This stage is to transform dygraph to static graph. This stage is to transform dygraph to static graph.
```bash ```bash

@ -4,15 +4,16 @@
| Model | Number of Params | Release | Config | Test set | Valid Loss | CER | | Model | Number of Params | Release | Config | Test set | Valid Loss | CER |
| --- | --- | --- | --- | --- | --- | --- | | --- | --- | --- | --- | --- | --- | --- |
| DeepSpeech2 | 45.18M | 2.2.0 | conf/deepspeech2_online.yaml + spec aug | test | 7.994938373565674 | 0.080 | | DeepSpeech2 | 45.18M | r0.2.0 | conf/deepspeech2_online.yaml + spec aug | test | 7.708217620849609| 0.078 |
| DeepSpeech2 | 45.18M | v2.2.0 | conf/deepspeech2_online.yaml + spec aug | test | 7.994938373565674 | 0.080 |
## Deepspeech2 Non-Streaming ## Deepspeech2 Non-Streaming
| Model | Number of Params | Release | Config | Test set | Valid Loss | CER | | Model | Number of Params | Release | Config | Test set | Valid Loss | CER |
| --- | --- | --- | --- | --- | --- | --- | | --- | --- | --- | --- | --- | --- | --- |
| DeepSpeech2 | 58.4M | 2.2.0 | conf/deepspeech2.yaml + spec aug | test | 5.738585948944092 | 0.064000 | | DeepSpeech2 | 58.4M | v2.2.0 | conf/deepspeech2.yaml + spec aug | test | 5.738585948944092 | 0.064000 |
| DeepSpeech2 | 58.4M | 2.1.0 | conf/deepspeech2.yaml + spec aug | test | 7.483316898345947 | 0.077860 | | DeepSpeech2 | 58.4M | v2.1.0 | conf/deepspeech2.yaml + spec aug | test | 7.483316898345947 | 0.077860 |
| DeepSpeech2 | 58.4M | 2.1.0 | conf/deepspeech2.yaml | test | 7.299022197723389 | 0.078671 | | DeepSpeech2 | 58.4M | v2.1.0 | conf/deepspeech2.yaml | test | 7.299022197723389 | 0.078671 |
| DeepSpeech2 | 58.4M | 2.0.0 | conf/deepspeech2.yaml | test | - | 0.078977 | | DeepSpeech2 | 58.4M | v2.0.0 | conf/deepspeech2.yaml | test | - | 0.078977 |
| --- | --- | --- | --- | --- | --- | --- | | --- | --- | --- | --- | --- | --- | --- |
| DeepSpeech2 | 58.4M | 1.8.5 | - | test | - | 0.080447 | | DeepSpeech2 | 58.4M | v1.8.5 | - | test | - | 0.080447 |

@ -4,7 +4,7 @@
对于声音分类任务传统机器学习的一个常用做法是首先人工提取音频的时域和频域的多种特征并做特征选择、组合、变换等然后基于SVM或决策树进行分类。而端到端的深度学习则通常利用深度网络如RNNCNN等直接对声间波形(waveform)或时频特征(time-frequency)进行特征学习(representation learning)和分类预测。 对于声音分类任务传统机器学习的一个常用做法是首先人工提取音频的时域和频域的多种特征并做特征选择、组合、变换等然后基于SVM或决策树进行分类。而端到端的深度学习则通常利用深度网络如RNNCNN等直接对声间波形(waveform)或时频特征(time-frequency)进行特征学习(representation learning)和分类预测。
在IEEE ICASSP 2017 大会上,谷歌开放了一个大规模的音频数据集[Audioset](https://research.google.com/audioset/)。该数据集包含了 632 类的音频类别以及 2,084,320 条人工标记的每段 10 秒长度的声音剪辑片段来源于YouTube视频。目前该数据集已经有210万个已标注的视频数据5800小时的音频数据经过标记的声音样本的标签类别为527。 在IEEE ICASSP 2017 大会上,谷歌开放了一个大规模的音频数据集[Audioset](https://research.google.com/audioset/)。该数据集包含了 632 类的音频类别以及 2,084,320 条人工标记的每段 **10 秒**长度的声音剪辑片段来源于YouTube视频。目前该数据集已经有 210万 个已标注的视频数据5800 小时的音频数据,经过标记的声音样本的标签类别为 527。
`PANNs`([PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition](https://arxiv.org/pdf/1912.10211.pdf))是基于Audioset数据集训练的声音分类/识别的模型。经过预训练后模型可以用于提取音频的embbedding。本示例将使用`PANNs`的预训练模型Finetune完成声音分类的任务。 `PANNs`([PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition](https://arxiv.org/pdf/1912.10211.pdf))是基于Audioset数据集训练的声音分类/识别的模型。经过预训练后模型可以用于提取音频的embbedding。本示例将使用`PANNs`的预训练模型Finetune完成声音分类的任务。
@ -19,7 +19,7 @@ PaddleAudio提供了PANNs的CNN14、CNN10和CNN6的预训练模型可供用
## 数据集 ## 数据集
[ESC-50: Dataset for Environmental Sound Classification](https://github.com/karolpiczak/ESC-50) 是一个包含有 2000 个带标签的环境声音样本,音频样本采样率为 44,100Hz 的单通道音频文件,所有样本根据标签被划分为 50 个类别,每个类别有 40 个样本。 [ESC-50: Dataset for Environmental Sound Classification](https://github.com/karolpiczak/ESC-50) 是一个包含有 2000 个带标签的时长为 **5 秒**环境声音样本,音频样本采样率为 44,100Hz 的单通道音频文件,所有样本根据标签被划分为 50 个类别,每个类别有 40 个样本。
## 模型指标 ## 模型指标

@ -21,7 +21,7 @@
The pretrained model can be downloaded here [ernie_linear_p3_iwslt2012_zh_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/text/ernie_linear_p3_iwslt2012_zh_ckpt_0.1.1.zip). The pretrained model can be downloaded here [ernie_linear_p3_iwslt2012_zh_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/text/ernie_linear_p3_iwslt2012_zh_ckpt_0.1.1.zip).
### Test Result ### Test Result
- Ernie Linear - Ernie
| |COMMA | PERIOD | QUESTION | OVERALL| | |COMMA | PERIOD | QUESTION | OVERALL|
|:-----:|:-----:|:-----:|:-----:|:-----:| |:-----:|:-----:|:-----:|:-----:|:-----:|
|Precision |0.510955 |0.526462 |0.820755 |0.619391| |Precision |0.510955 |0.526462 |0.820755 |0.619391|

@ -0,0 +1,9 @@
# iwslt2012
## Ernie
| |COMMA | PERIOD | QUESTION | OVERALL|
|:-----:|:-----:|:-----:|:-----:|:-----:|
|Precision |0.510955 |0.526462 |0.820755 |0.619391|
|Recall |0.517433 |0.564179 |0.861386 |0.647666|
|F1 |0.514173 |0.544669 |0.840580 |0.633141|

@ -4,4 +4,4 @@
| Model | Number of Params | Release | Config | dim | Test set | Cosine | Cosine + S-Norm | | Model | Number of Params | Release | Config | dim | Test set | Cosine | Cosine + S-Norm |
| --- | --- | --- | --- | --- | --- | --- | ---- | | --- | --- | --- | --- | --- | --- | --- | ---- |
| ECAPA-TDNN | 85M | 0.1.1 | conf/ecapa_tdnn.yaml |192 | test | 1.15 | 1.06 | | ECAPA-TDNN | 85M | 0.2.0 | conf/ecapa_tdnn.yaml |192 | test | 1.02 | 0.95 |

@ -14,4 +14,3 @@
from .dtw import dtw_distance from .dtw import dtw_distance
from .eer import compute_eer from .eer import compute_eer
from .eer import compute_minDCF from .eer import compute_minDCF
from .mcd import mcd_distance

@ -1,63 +0,0 @@
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from typing import Callable
import mcd.metrics_fast as mt
import numpy as np
from mcd import dtw
__all__ = [
'mcd_distance',
]
def mcd_distance(xs: np.ndarray,
ys: np.ndarray,
cost_fn: Callable=mt.logSpecDbDist) -> float:
"""Mel cepstral distortion (MCD), dtw distance.
Dynamic Time Warping.
Uses dynamic programming to compute:
Examples:
.. code-block:: python
wps[i, j] = cost_fn(xs[i], ys[j]) + min(
wps[i-1, j ], // vertical / insertion / expansion
wps[i , j-1], // horizontal / deletion / compression
wps[i-1, j-1]) // diagonal / match
dtw = sqrt(wps[-1, -1])
Cost Function:
Examples:
.. code-block:: python
logSpecDbConst = 10.0 / math.log(10.0) * math.sqrt(2.0)
def logSpecDbDist(x, y):
diff = x - y
return logSpecDbConst * math.sqrt(np.inner(diff, diff))
Args:
xs (np.ndarray): ref sequence, [T,D]
ys (np.ndarray): hyp sequence, [T,D]
cost_fn (Callable, optional): Cost function. Defaults to mt.logSpecDbDist.
Returns:
float: dtw distance
"""
min_cost, path = dtw.dtw(xs, ys, cost_fn)
return min_cost

@ -0,0 +1,30 @@
# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import numpy as np
def pcm16to32(audio: np.ndarray) -> np.ndarray:
"""pcm int16 to float32
Args:
audio (np.ndarray): Waveform with dtype of int16.
Returns:
np.ndarray: Waveform with dtype of float32.
"""
if audio.dtype == np.int16:
audio = audio.astype("float32")
bits = np.iinfo(np.int16).bits
audio = audio / (2**(bits - 1))
return audio

@ -19,7 +19,7 @@ from setuptools.command.install import install
from setuptools.command.test import test from setuptools.command.test import test
# set the version here # set the version here
VERSION = '0.2.0' VERSION = '0.2.1'
# Inspired by the example at https://pytest.org/latest/goodpractises.html # Inspired by the example at https://pytest.org/latest/goodpractises.html
@ -83,8 +83,7 @@ setuptools.setup(
python_requires='>=3.6', python_requires='>=3.6',
install_requires=[ install_requires=[
'numpy >= 1.15.0', 'scipy >= 1.0.0', 'resampy >= 0.2.2', 'numpy >= 1.15.0', 'scipy >= 1.0.0', 'resampy >= 0.2.2',
'soundfile >= 0.9.0', 'colorlog', 'dtaidistance == 2.3.1', 'mcd >= 0.4', 'soundfile >= 0.9.0', 'colorlog', 'dtaidistance == 2.3.1', 'pathos'
'pathos'
], ],
extras_require={ extras_require={
'test': [ 'test': [

@ -80,9 +80,9 @@ pretrained_models = {
}, },
"deepspeech2online_aishell-zh-16k": { "deepspeech2online_aishell-zh-16k": {
'url': 'url':
'https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.1.1.model.tar.gz', 'https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.2.0.model.tar.gz',
'md5': 'md5':
'd5e076217cf60486519f72c217d21b9b', '23e16c69730a1cb5d735c98c83c21e16',
'cfg_path': 'cfg_path':
'model.yaml', 'model.yaml',
'ckpt_path': 'ckpt_path':
@ -426,6 +426,11 @@ class ASRExecutor(BaseExecutor):
try: try:
audio, audio_sample_rate = soundfile.read( audio, audio_sample_rate = soundfile.read(
audio_file, dtype="int16", always_2d=True) audio_file, dtype="int16", always_2d=True)
audio_duration = audio.shape[0] / audio_sample_rate
max_duration = 50.0
if audio_duration >= max_duration:
logger.error("Please input audio file less then 50 seconds.\n")
return
except Exception as e: except Exception as e:
logger.exception(e) logger.exception(e)
logger.error( logger.error(

@ -42,9 +42,9 @@ pretrained_models = {
# "paddlespeech vector --task spk --model ecapatdnn_voxceleb12-16k --sr 16000 --input ./input.wav" # "paddlespeech vector --task spk --model ecapatdnn_voxceleb12-16k --sr 16000 --input ./input.wav"
"ecapatdnn_voxceleb12-16k": { "ecapatdnn_voxceleb12-16k": {
'url': 'url':
'https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_1_1.tar.gz', 'https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_2_0.tar.gz',
'md5': 'md5':
'a1c0dba7d4de997187786ff517d5b4ec', 'cc33023c54ab346cd318408f43fcaf95',
'cfg_path': 'cfg_path':
'conf/model.yaml', # the yaml config path 'conf/model.yaml', # the yaml config path
'ckpt_path': 'ckpt_path':

@ -86,6 +86,9 @@ def process_sentence(config: Dict[str, Any],
logmel = mel_extractor.get_log_mel_fbank(wav) logmel = mel_extractor.get_log_mel_fbank(wav)
# change duration according to mel_length # change duration according to mel_length
compare_duration_and_mel_length(sentences, utt_id, logmel) compare_duration_and_mel_length(sentences, utt_id, logmel)
# utt_id may be popped in compare_duration_and_mel_length
if utt_id not in sentences:
return None
phones = sentences[utt_id][0] phones = sentences[utt_id][0]
durations = sentences[utt_id][1] durations = sentences[utt_id][1]
num_frames = logmel.shape[0] num_frames = logmel.shape[0]

@ -79,6 +79,9 @@ def process_sentence(config: Dict[str, Any],
logmel = mel_extractor.get_log_mel_fbank(wav) logmel = mel_extractor.get_log_mel_fbank(wav)
# change duration according to mel_length # change duration according to mel_length
compare_duration_and_mel_length(sentences, utt_id, logmel) compare_duration_and_mel_length(sentences, utt_id, logmel)
# utt_id may be popped in compare_duration_and_mel_length
if utt_id not in sentences:
return None
labels = sentences[utt_id][0] labels = sentences[utt_id][0]
# extract phone and duration # extract phone and duration
phones = [] phones = []

@ -82,6 +82,9 @@ def process_sentence(config: Dict[str, Any],
logmel = mel_extractor.get_log_mel_fbank(wav) logmel = mel_extractor.get_log_mel_fbank(wav)
# change duration according to mel_length # change duration according to mel_length
compare_duration_and_mel_length(sentences, utt_id, logmel) compare_duration_and_mel_length(sentences, utt_id, logmel)
# utt_id may be popped in compare_duration_and_mel_length
if utt_id not in sentences:
return None
phones = sentences[utt_id][0] phones = sentences[utt_id][0]
durations = sentences[utt_id][1] durations = sentences[utt_id][1]
num_frames = logmel.shape[0] num_frames = logmel.shape[0]

Loading…
Cancel
Save