Merge branch 'develop' of github.com:SmileGoat/PaddleSpeech into add_aishell_eg

3 years ago · c18846635e
parent 3e2ec3a7da 2f97b81346
commit c18846635e
47 changed files with 998 additions and 226 deletions
--- a/README.md
+++ b/README.md
@ -280,10 +280,14 @@ paddlespeech_client cls --server_ip 127.0.0.1 --port 8090 --input input.wav
 For more information about server command lines, please see: [speech server demos](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/demos/speech_server)


+<a name="ModelList"></a>
+
 ## Model List

 PaddleSpeech supports a series of most popular models. They are summarized in [released models](./docs/source/released_model.md) and attached with available pretrained models.

+<a name="SpeechToText"></a>
+
 **Speech-to-Text** contains *Acoustic Model*, *Language Model*, and *Speech Translation*, with the following details:

 <table style="width:100%">
@ -357,6 +361,8 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
  </tbody>
 </table>

+<a name="TextToSpeech"></a>
+
 **Text-to-Speech** in PaddleSpeech mainly contains three modules: *Text Frontend*, *Acoustic Model* and *Vocoder*. Acoustic Model and Vocoder models are listed as follow:

 <table>
@ -457,10 +463,10 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
      </td>
    </tr>
    <tr>
-      <td>GE2E + Tactron2</td>
+      <td>GE2E + Tacotron2</td>
      <td>AISHELL-3</td>
      <td>
-      <a href = "./examples/aishell3/vc0">ge2e-tactron2-aishell3</a>
+      <a href = "./examples/aishell3/vc0">ge2e-tacotron2-aishell3</a>
      </td>
    </tr>
    <tr>
@ -473,6 +479,8 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
  </tbody>
 </table>

+<a name="AudioClassification"></a>
+
 **Audio Classification**

 <table style="width:100%">
@ -496,6 +504,8 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
  </tbody>
 </table>

+<a name="SpeakerVerification"></a>
+
 **Speaker Verification**

 <table style="width:100%">
@ -519,6 +529,8 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
  </tbody>
 </table>

+<a name="PunctuationRestoration"></a>
+
 **Punctuation Restoration**

 <table style="width:100%">
@ -559,10 +571,18 @@ Normally, [Speech SoTA](https://paperswithcode.com/area/speech), [Audio SoTA](ht
    - [Advanced Usage](./docs/source/tts/advanced_usage.md)
    - [Chinese Rule Based Text Frontend](./docs/source/tts/zh_text_frontend.md)
    - [Test Audio Samples](https://paddlespeech.readthedocs.io/en/latest/tts/demo.html)
+  - Speaker Verification
+    - [Audio Searching](./demos/audio_searching/README.md)
+    - [Speaker Verification](./demos/speaker_verification/README.md)
  - [Audio Classification](./demos/audio_tagging/README.md)
-  - [Speaker Verification](./demos/speaker_verification/README.md)
  - [Speech Translation](./demos/speech_translation/README.md)
+  - [Speech Server](./demos/speech_server/README.md)
 - [Released Models](./docs/source/released_model.md)
+  - [Speech-to-Text](#SpeechToText)
+  - [Text-to-Speech](#TextToSpeech)
+  - [Audio Classification](#AudioClassification)
+  - [Speaker Verification](#SpeakerVerification)
+  - [Punctuation Restoration](#PunctuationRestoration)
 - [Community](#Community)
 - [Welcome to contribute](#contribution)
 - [License](#License)
--- a/README_cn.md
+++ b/README_cn.md
@ -273,6 +273,8 @@ paddlespeech_client cls --server_ip 127.0.0.1 --port 8090 --input input.wav
 ## 模型列表
 PaddleSpeech 支持很多主流的模型，并提供了预训练模型，详情请见[模型列表](./docs/source/released_model.md)。

+<a name="语音识别模型"></a>
+
 PaddleSpeech 的 **语音转文本** 包含语音识别声学模型、语音识别语言模型和语音翻译, 详情如下：

 <table style="width:100%">
@ -347,6 +349,7 @@ PaddleSpeech 的 **语音转文本** 包含语音识别声学模型、语音识
 </table>

 <a name="语音合成模型"></a>
+
 PaddleSpeech 的 **语音合成** 主要包含三个模块：文本前端、声学模型和声码器。声学模型和声码器模型如下：

 <table>
@ -447,10 +450,10 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块：文本前端、声
      </td>
    </tr>
    <tr>
-      <td>GE2E + Tactron2</td>
+      <td>GE2E + Tacotron2</td>
      <td>AISHELL-3</td>
      <td>
-      <a href = "./examples/aishell3/vc0">ge2e-tactron2-aishell3</a>
+      <a href = "./examples/aishell3/vc0">ge2e-tacotron2-aishell3</a>
      </td>
    </tr>
    <tr>
@ -488,6 +491,8 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块：文本前端、声
 </table>


+<a name="声纹识别模型"></a>
+
 **声纹识别**

 <table style="width:100%">
@ -511,6 +516,8 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块：文本前端、声
  </tbody>
 </table>

+<a name="标点恢复模型"></a>
+
 **标点恢复**

 <table style="width:100%">
@ -556,13 +563,18 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块：文本前端、声
    - [进阶用法](./docs/source/tts/advanced_usage.md)
    - [中文文本前端](./docs/source/tts/zh_text_frontend.md)
    - [测试语音样本](https://paddlespeech.readthedocs.io/en/latest/tts/demo.html)
+  - 声纹识别
+    - [声纹识别](./demos/speaker_verification/README_cn.md)
+    - [音频检索](./demos/audio_searching/README_cn.md)
  - [声音分类](./demos/audio_tagging/README_cn.md)
-  - [声纹识别](./demos/speaker_verification/README_cn.md)
  - [语音翻译](./demos/speech_translation/README_cn.md)
+  - [服务化部署](./demos/speech_server/README_cn.md)
 - [模型列表](#模型列表)
  - [语音识别](#语音识别模型)
  - [语音合成](#语音合成模型)
  - [声音分类](#声音分类模型)
+  - [声纹识别](#声纹识别模型)
+  - [标点恢复](#标点恢复模型)
 - [技术交流群](#技术交流群)
 - [欢迎贡献](#欢迎贡献)
 - [License](#License)
--- a/demos/speaker_verification/README.md
+++ b/demos/speaker_verification/README.md
@ -30,6 +30,11 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
  paddlespeech vector --task spk --input vec.job

  echo -e "demo2 85236145389.wav \n demo3 85236145389.wav" | paddlespeech vector --task spk
+
+  paddlespeech vector --task score --input "./85236145389.wav ./123456789.wav"
+  
+  echo -e "demo4 85236145389.wav 85236145389.wav \n demo5 85236145389.wav 123456789.wav" > vec.job
+  paddlespeech vector --task score --input vec.job
  ```
  
  Usage:
@ -38,6 +43,7 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
  ```
  Arguments:
  - `input`(required): Audio file to recognize.
+  - `task` (required): Specify `vector` task. Default `spk`。
  - `model`: Model type of vector task. Default: `ecapatdnn_voxceleb12`.
  - `sample_rate`: Sample rate of the model. Default: `16000`.
  - `config`: Config of vector task. Use pretrained model when it is None. Default: `None`.
@ -47,45 +53,45 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
  Output:

  ```bash
-    demo [ -5.749211     9.505463    -8.200284    -5.2075014    5.3940268
-      -3.04878      1.611095    10.127234   -10.534177   -15.821609
-      1.2032688   -0.35080156   1.2629458  -12.643498    -2.5758228
-    -11.343508     2.3385992   -8.719341    14.213509    15.404744
-      -0.39327756   6.338786     2.688887     8.7104025   17.469526
-      -8.77959      7.0576906    4.648855    -1.3089896  -23.294737
-      8.013747    13.891729    -9.926753     5.655307    -5.9422326
-    -22.842539     0.6293588  -18.46266    -10.811862     9.8192625
-      3.0070958    3.8072643   -2.3861165    3.0821571  -14.739942
-      1.7594414   -0.6485091    4.485623     2.0207152    7.264915
-      -6.40137     23.63524      2.9711294  -22.708025     9.93719
-      20.354511   -10.324688    -0.700492    -8.783211    -5.27593
-      15.999649     3.3004563   12.747926    15.429879     4.7849145
-      5.6699696   -2.3826702   10.605882     3.9112158    3.1500628
-      15.859915    -2.1832209  -23.908653    -6.4799504   -4.5365124
-      -9.224193    14.568347   -10.568833     4.982321    -4.342062
-      0.0914714   12.645902    -5.74285     -3.2141201   -2.7173362
-      -6.680575     0.4757669   -5.035051    -6.7964664   16.865469
-    -11.54324      7.681869     0.44475392   9.708182    -8.932846
-      0.4123232   -4.361452     1.3948607    9.511665     0.11667654
-      2.9079323    6.049952     9.275183   -18.078873     6.2983274
-      -0.7500531   -2.725033    -7.6027865    3.3404543    2.990815
-      4.010979    11.000591    -2.8873312    7.1352735  -16.79663
-      18.495346   -14.293832     7.89578      2.2714825   22.976387
-      -4.875734    -3.0836344   -2.9999814   13.751918     6.448228
-    -11.924197     2.171869     2.0423572   -6.173772    10.778437
-      25.77281     -4.9495463   14.57806      0.3044315    2.6132357
-      -7.591999    -2.076944     9.025118     1.7834753   -3.1799617
-      -4.9401326   23.465864     5.1685796   -9.018578     9.037825
-      -4.4150195    6.859591   -12.274467    -0.88911164   5.186309
-      -3.9988663  -13.638606    -9.925445    -0.06329413  -3.6709652
-    -12.397416   -12.719869    -1.395601     2.1150916    5.7381287
-      -4.4691963   -3.82819     -0.84233856  -1.1604277  -13.490127
-      8.731719   -20.778936   -11.495662     5.8033476   -4.752041
-      10.833007    -6.717991     4.504732    13.4244375    1.1306485
-      7.3435574    1.400918    14.704036    -9.501399     7.2315617
-      -6.417456     1.3333273   11.872697    -0.30664724   8.8845
-      6.5569253    4.7948146    0.03662816  -8.704245     6.224871
-      -3.2701402  -11.508579  ]
+    demo [  1.4217498    5.626253    -5.342073     1.1773866    3.308055
+    1.756596     5.167894    10.80636     -3.8226728   -5.6141334
+    2.623845    -0.8072968    1.9635103   -7.3128724    0.01103897
+    -9.723131     0.6619743   -6.976803    10.213478     7.494748
+    2.9105635    3.8949256    3.7999806    7.1061673   16.905321
+    -7.1493764    8.733103     3.4230042   -4.831653   -11.403367
+    11.232214     7.1274667   -4.2828417    2.452362    -5.130748
+    -18.177666    -2.6116815  -11.000337    -6.7314315    1.6564683
+    0.7618269    1.1253023   -2.083836     4.725744    -8.782597
+    -3.539873     3.814236     5.1420674    2.162061     4.096431
+    -6.4162116   12.747448     1.9429878  -15.152943     6.417416
+    16.097002    -9.716668    -1.9920526   -3.3649497   -1.871939
+    11.567354     3.69788     11.258265     7.442363     9.183411
+    4.5281515   -1.2417862    4.3959084    6.6727695    5.8898783
+    7.627124    -0.66919386 -11.889693    -9.208865    -7.4274073
+    -3.7776625    6.917234    -9.848748    -2.0944717   -5.135116
+    0.49563864   9.317534    -5.9141874   -1.8098574   -0.11738578
+    -7.169265    -1.0578263   -5.7216787   -5.1173844   16.137651
+    -4.473626     7.6624317   -0.55381083   9.631587    -6.4704556
+    -8.548508     4.3716145   -0.79702514   4.478997    -2.9758704
+    3.272176     2.8382776    5.134597    -9.190781    -0.5657382
+    -4.8745747    2.3165567   -5.984303    -2.1798875    0.35541576
+    -0.31784213   9.493548     2.1144536    4.358092   -12.089823
+    8.451689    -7.925461     4.6242585    4.4289427   18.692003
+    -2.6204622   -5.149185    -0.35821092   8.488551     4.981496
+    -9.32683     -2.2544234    6.6417594    1.2119585   10.977129
+    16.555033     3.3238444    9.551863    -1.6676947   -0.79539716
+    -8.605674    -0.47356385   2.6741948   -5.359179    -2.6673796
+    0.66607     15.443222     4.740594    -3.4725387   11.592567
+    -2.054497     1.7361217   -8.265324    -9.30447      5.4068313
+    -1.5180256   -7.746615    -6.089606     0.07112726  -0.34904733
+    -8.649895    -9.998958    -2.564841    -0.53999114   2.601808
+    -0.31927416  -1.8815292   -2.07215     -3.4105783   -8.2998085
+    1.483641   -15.365992    -8.288208     3.8847756   -3.4876456
+    7.3629923    0.4657332    3.132599    12.438889    -1.8337058
+    4.532936     2.7264361   10.145339    -6.521951     2.897153
+    -3.3925855    5.079156     7.759716     4.677565     5.8457737
+    2.402413     7.7071047    3.9711342   -6.390043     6.1268735
+    -3.7760346  -11.118123  ]
  ```

 - Python API
@ -97,56 +103,113 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
  audio_emb = vector_executor(
      model='ecapatdnn_voxceleb12',
      sample_rate=16000,
-      config=None, 
+      config=None,  # Set `config` and `ckpt_path` to None to use pretrained model.
      ckpt_path=None,
      audio_file='./85236145389.wav',
-      force_yes=False,
      device=paddle.get_device())
  print('Audio embedding Result: \n{}'.format(audio_emb))
+
+  test_emb = vector_executor(
+      model='ecapatdnn_voxceleb12',
+      sample_rate=16000,
+      config=None,  # Set `config` and `ckpt_path` to None to use pretrained model.
+      ckpt_path=None,
+      audio_file='./123456789.wav',
+      device=paddle.get_device())
+  print('Test embedding Result: \n{}'.format(test_emb))
+
+  # score range [0, 1]
+  score = vector_executor.get_embeddings_score(audio_emb, test_emb)
+  print(f"Eembeddings Score: {score}")
  ```

-  Output:
+  Output：
+
  ```bash
  # Vector Result:
-   [ -5.749211     9.505463    -8.200284    -5.2075014    5.3940268
-      -3.04878      1.611095    10.127234   -10.534177   -15.821609
-      1.2032688   -0.35080156   1.2629458  -12.643498    -2.5758228
-    -11.343508     2.3385992   -8.719341    14.213509    15.404744
-      -0.39327756   6.338786     2.688887     8.7104025   17.469526
-      -8.77959      7.0576906    4.648855    -1.3089896  -23.294737
-      8.013747    13.891729    -9.926753     5.655307    -5.9422326
-    -22.842539     0.6293588  -18.46266    -10.811862     9.8192625
-      3.0070958    3.8072643   -2.3861165    3.0821571  -14.739942
-      1.7594414   -0.6485091    4.485623     2.0207152    7.264915
-      -6.40137     23.63524      2.9711294  -22.708025     9.93719
-      20.354511   -10.324688    -0.700492    -8.783211    -5.27593
-      15.999649     3.3004563   12.747926    15.429879     4.7849145
-      5.6699696   -2.3826702   10.605882     3.9112158    3.1500628
-      15.859915    -2.1832209  -23.908653    -6.4799504   -4.5365124
-      -9.224193    14.568347   -10.568833     4.982321    -4.342062
-      0.0914714   12.645902    -5.74285     -3.2141201   -2.7173362
-      -6.680575     0.4757669   -5.035051    -6.7964664   16.865469
-    -11.54324      7.681869     0.44475392   9.708182    -8.932846
-      0.4123232   -4.361452     1.3948607    9.511665     0.11667654
-      2.9079323    6.049952     9.275183   -18.078873     6.2983274
-      -0.7500531   -2.725033    -7.6027865    3.3404543    2.990815
-      4.010979    11.000591    -2.8873312    7.1352735  -16.79663
-      18.495346   -14.293832     7.89578      2.2714825   22.976387
-      -4.875734    -3.0836344   -2.9999814   13.751918     6.448228
-    -11.924197     2.171869     2.0423572   -6.173772    10.778437
-      25.77281     -4.9495463   14.57806      0.3044315    2.6132357
-      -7.591999    -2.076944     9.025118     1.7834753   -3.1799617
-      -4.9401326   23.465864     5.1685796   -9.018578     9.037825
-      -4.4150195    6.859591   -12.274467    -0.88911164   5.186309
-      -3.9988663  -13.638606    -9.925445    -0.06329413  -3.6709652
-    -12.397416   -12.719869    -1.395601     2.1150916    5.7381287
-      -4.4691963   -3.82819     -0.84233856  -1.1604277  -13.490127
-      8.731719   -20.778936   -11.495662     5.8033476   -4.752041
-      10.833007    -6.717991     4.504732    13.4244375    1.1306485
-      7.3435574    1.400918    14.704036    -9.501399     7.2315617
-      -6.417456     1.3333273   11.872697    -0.30664724   8.8845
-      6.5569253    4.7948146    0.03662816  -8.704245     6.224871
-      -3.2701402  -11.508579  ]
+   Audio embedding Result:
+    [  1.4217498    5.626253    -5.342073     1.1773866    3.308055
+    1.756596     5.167894    10.80636     -3.8226728   -5.6141334
+    2.623845    -0.8072968    1.9635103   -7.3128724    0.01103897
+    -9.723131     0.6619743   -6.976803    10.213478     7.494748
+    2.9105635    3.8949256    3.7999806    7.1061673   16.905321
+    -7.1493764    8.733103     3.4230042   -4.831653   -11.403367
+    11.232214     7.1274667   -4.2828417    2.452362    -5.130748
+    -18.177666    -2.6116815  -11.000337    -6.7314315    1.6564683
+    0.7618269    1.1253023   -2.083836     4.725744    -8.782597
+    -3.539873     3.814236     5.1420674    2.162061     4.096431
+    -6.4162116   12.747448     1.9429878  -15.152943     6.417416
+    16.097002    -9.716668    -1.9920526   -3.3649497   -1.871939
+    11.567354     3.69788     11.258265     7.442363     9.183411
+    4.5281515   -1.2417862    4.3959084    6.6727695    5.8898783
+    7.627124    -0.66919386 -11.889693    -9.208865    -7.4274073
+    -3.7776625    6.917234    -9.848748    -2.0944717   -5.135116
+    0.49563864   9.317534    -5.9141874   -1.8098574   -0.11738578
+    -7.169265    -1.0578263   -5.7216787   -5.1173844   16.137651
+    -4.473626     7.6624317   -0.55381083   9.631587    -6.4704556
+    -8.548508     4.3716145   -0.79702514   4.478997    -2.9758704
+    3.272176     2.8382776    5.134597    -9.190781    -0.5657382
+    -4.8745747    2.3165567   -5.984303    -2.1798875    0.35541576
+    -0.31784213   9.493548     2.1144536    4.358092   -12.089823
+    8.451689    -7.925461     4.6242585    4.4289427   18.692003
+    -2.6204622   -5.149185    -0.35821092   8.488551     4.981496
+    -9.32683     -2.2544234    6.6417594    1.2119585   10.977129
+    16.555033     3.3238444    9.551863    -1.6676947   -0.79539716
+    -8.605674    -0.47356385   2.6741948   -5.359179    -2.6673796
+    0.66607     15.443222     4.740594    -3.4725387   11.592567
+    -2.054497     1.7361217   -8.265324    -9.30447      5.4068313
+    -1.5180256   -7.746615    -6.089606     0.07112726  -0.34904733
+    -8.649895    -9.998958    -2.564841    -0.53999114   2.601808
+    -0.31927416  -1.8815292   -2.07215     -3.4105783   -8.2998085
+    1.483641   -15.365992    -8.288208     3.8847756   -3.4876456
+    7.3629923    0.4657332    3.132599    12.438889    -1.8337058
+    4.532936     2.7264361   10.145339    -6.521951     2.897153
+    -3.3925855    5.079156     7.759716     4.677565     5.8457737
+    2.402413     7.7071047    3.9711342   -6.390043     6.1268735
+    -3.7760346  -11.118123  ]
+    # get the test embedding
+    Test embedding Result:
+    [ -1.902964     2.0690894   -8.034194     3.5472693    0.18089125
+      6.9085927    1.4097427   -1.9487704  -10.021278    -0.20755845
+      -8.04332      4.344489     2.3200977  -14.306299     5.184692
+    -11.55602     -3.8497238    0.6444722    1.2833948    2.6766639
+      0.5878921    0.7946299    1.7207596    2.5791872   14.998469
+      -1.3385371   15.031221    -0.8006958    1.99287     -9.52007
+      2.435466     4.003221    -4.33817     -4.898601    -5.304714
+    -18.033886    10.790787   -12.784645    -5.641755     2.9761686
+    -10.566622     1.4839455    6.152458    -5.7195854    2.8603241
+      6.112133     8.489869     5.5958056    1.2836679   -1.2293907
+      0.89927405   7.0288725   -2.854029    -0.9782962    5.8255906
+      14.905906    -5.025907     0.7866458   -4.2444224  -16.354029
+      10.521315     0.9604709   -3.3257897    7.144871   -13.592733
+      -8.568869    -1.7953678    0.26313916  10.916714    -6.9374123
+      1.857403    -6.2746415    2.8154466   -7.2338667   -2.293357
+      -0.05452765   5.4287076    5.0849075   -6.690375    -1.6183422
+      3.654291     0.94352573  -9.200294    -5.4749465   -3.5235846
+      1.3420814    4.240421    -2.772944    -2.8451524   16.311104
+      4.2969875   -1.762936   -12.5758915    8.595198    -0.8835239
+      -1.5708797    1.568961     1.1413603    3.5032008   -0.45251232
+      -6.786333    16.89443      5.3366146   -8.789056     0.6355629
+      3.2579517   -3.328322     7.5969577    0.66025066  -6.550468
+      -9.148656     2.020372    -0.4615173    1.1965656   -3.8764873
+      11.6562195   -6.0750933   12.182899     3.2218833    0.81969476
+      5.570001    -3.8459578   -7.205299     7.9262037   -7.6611166
+      -5.249467    -2.2671914    7.2658715  -13.298164     4.821147
+      -2.7263982   11.691089    -3.8918593   -2.838112    -1.0336838
+      -3.8034165    2.8536487   -5.60398     -1.1972581    1.3455094
+      -3.4903061    2.2408795    5.5010734   -3.970756    11.99696
+      -7.8858757    0.43160373  -5.5059714    4.3426995   16.322706
+      11.635366     0.72157705  -9.245714    -3.91465     -4.449838
+      -1.5716927    7.713747    -2.2430465   -6.198303   -13.481864
+      2.8156567   -5.7812386    5.1456156    2.7289324  -14.505571
+      13.270688     3.448231    -7.0659585    4.5886116   -4.466099
+      -0.296428   -11.463529    -2.6076477   14.110243    -6.9725137
+      -1.9962958    2.7119343   19.391657     0.01961198  14.607133
+      -1.6695905   -4.391516     1.3131028   -6.670972    -5.888604
+      12.0612335    5.9285784    3.3715196    1.492534    10.723728
+      -0.95514804 -12.085431  ]
+    # get the score between enroll and test
+    Eembeddings Score: 0.4292638301849365
  ```

 ### 4.Pretrained Models
--- a/demos/speaker_verification/README_cn.md
+++ b/demos/speaker_verification/README_cn.md
@ -29,6 +29,11 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
  paddlespeech vector --task spk --input vec.job

  echo -e "demo2 85236145389.wav \n demo3 85236145389.wav" | paddlespeech vector --task spk
+  
+  paddlespeech vector --task score --input "./85236145389.wav ./123456789.wav"
+  
+  echo -e "demo4 85236145389.wav 85236145389.wav \n demo5 85236145389.wav 123456789.wav" > vec.job
+  paddlespeech vector --task score --input vec.job
  ```
  
  使用方法：
@ -37,6 +42,7 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
  ```
  参数：
  - `input`(必须输入)：用于识别的音频文件。
+  - `task` (必须输入): 用于指定 `vector` 处理的具体任务，默认是 `spk`。
  - `model`：声纹任务的模型，默认值：`ecapatdnn_voxceleb12`。
  - `sample_rate`：音频采样率，默认值：`16000`。
  - `config`：声纹任务的参数文件，若不设置则使用预训练模型中的默认配置，默认值：`None`。
@ -45,45 +51,45 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav

  输出：
  ```bash
-  demo  [ -5.749211     9.505463    -8.200284    -5.2075014    5.3940268
-      -3.04878      1.611095    10.127234   -10.534177   -15.821609
-      1.2032688   -0.35080156   1.2629458  -12.643498    -2.5758228
-    -11.343508     2.3385992   -8.719341    14.213509    15.404744
-      -0.39327756   6.338786     2.688887     8.7104025   17.469526
-      -8.77959      7.0576906    4.648855    -1.3089896  -23.294737
-      8.013747    13.891729    -9.926753     5.655307    -5.9422326
-    -22.842539     0.6293588  -18.46266    -10.811862     9.8192625
-      3.0070958    3.8072643   -2.3861165    3.0821571  -14.739942
-      1.7594414   -0.6485091    4.485623     2.0207152    7.264915
-      -6.40137     23.63524      2.9711294  -22.708025     9.93719
-      20.354511   -10.324688    -0.700492    -8.783211    -5.27593
-      15.999649     3.3004563   12.747926    15.429879     4.7849145
-      5.6699696   -2.3826702   10.605882     3.9112158    3.1500628
-      15.859915    -2.1832209  -23.908653    -6.4799504   -4.5365124
-      -9.224193    14.568347   -10.568833     4.982321    -4.342062
-      0.0914714   12.645902    -5.74285     -3.2141201   -2.7173362
-      -6.680575     0.4757669   -5.035051    -6.7964664   16.865469
-    -11.54324      7.681869     0.44475392   9.708182    -8.932846
-      0.4123232   -4.361452     1.3948607    9.511665     0.11667654
-      2.9079323    6.049952     9.275183   -18.078873     6.2983274
-      -0.7500531   -2.725033    -7.6027865    3.3404543    2.990815
-      4.010979    11.000591    -2.8873312    7.1352735  -16.79663
-      18.495346   -14.293832     7.89578      2.2714825   22.976387
-      -4.875734    -3.0836344   -2.9999814   13.751918     6.448228
-    -11.924197     2.171869     2.0423572   -6.173772    10.778437
-      25.77281     -4.9495463   14.57806      0.3044315    2.6132357
-      -7.591999    -2.076944     9.025118     1.7834753   -3.1799617
-      -4.9401326   23.465864     5.1685796   -9.018578     9.037825
-      -4.4150195    6.859591   -12.274467    -0.88911164   5.186309
-      -3.9988663  -13.638606    -9.925445    -0.06329413  -3.6709652
-    -12.397416   -12.719869    -1.395601     2.1150916    5.7381287
-      -4.4691963   -3.82819     -0.84233856  -1.1604277  -13.490127
-      8.731719   -20.778936   -11.495662     5.8033476   -4.752041
-      10.833007    -6.717991     4.504732    13.4244375    1.1306485
-      7.3435574    1.400918    14.704036    -9.501399     7.2315617
-      -6.417456     1.3333273   11.872697    -0.30664724   8.8845
-      6.5569253    4.7948146    0.03662816  -8.704245     6.224871
-      -3.2701402  -11.508579  ]
+  demo  [  1.4217498    5.626253    -5.342073     1.1773866    3.308055
+    1.756596     5.167894    10.80636     -3.8226728   -5.6141334
+    2.623845    -0.8072968    1.9635103   -7.3128724    0.01103897
+    -9.723131     0.6619743   -6.976803    10.213478     7.494748
+    2.9105635    3.8949256    3.7999806    7.1061673   16.905321
+    -7.1493764    8.733103     3.4230042   -4.831653   -11.403367
+    11.232214     7.1274667   -4.2828417    2.452362    -5.130748
+    -18.177666    -2.6116815  -11.000337    -6.7314315    1.6564683
+    0.7618269    1.1253023   -2.083836     4.725744    -8.782597
+    -3.539873     3.814236     5.1420674    2.162061     4.096431
+    -6.4162116   12.747448     1.9429878  -15.152943     6.417416
+    16.097002    -9.716668    -1.9920526   -3.3649497   -1.871939
+    11.567354     3.69788     11.258265     7.442363     9.183411
+    4.5281515   -1.2417862    4.3959084    6.6727695    5.8898783
+    7.627124    -0.66919386 -11.889693    -9.208865    -7.4274073
+    -3.7776625    6.917234    -9.848748    -2.0944717   -5.135116
+    0.49563864   9.317534    -5.9141874   -1.8098574   -0.11738578
+    -7.169265    -1.0578263   -5.7216787   -5.1173844   16.137651
+    -4.473626     7.6624317   -0.55381083   9.631587    -6.4704556
+    -8.548508     4.3716145   -0.79702514   4.478997    -2.9758704
+    3.272176     2.8382776    5.134597    -9.190781    -0.5657382
+    -4.8745747    2.3165567   -5.984303    -2.1798875    0.35541576
+    -0.31784213   9.493548     2.1144536    4.358092   -12.089823
+    8.451689    -7.925461     4.6242585    4.4289427   18.692003
+    -2.6204622   -5.149185    -0.35821092   8.488551     4.981496
+    -9.32683     -2.2544234    6.6417594    1.2119585   10.977129
+    16.555033     3.3238444    9.551863    -1.6676947   -0.79539716
+    -8.605674    -0.47356385   2.6741948   -5.359179    -2.6673796
+    0.66607     15.443222     4.740594    -3.4725387   11.592567
+    -2.054497     1.7361217   -8.265324    -9.30447      5.4068313
+    -1.5180256   -7.746615    -6.089606     0.07112726  -0.34904733
+    -8.649895    -9.998958    -2.564841    -0.53999114   2.601808
+    -0.31927416  -1.8815292   -2.07215     -3.4105783   -8.2998085
+    1.483641   -15.365992    -8.288208     3.8847756   -3.4876456
+    7.3629923    0.4657332    3.132599    12.438889    -1.8337058
+    4.532936     2.7264361   10.145339    -6.521951     2.897153
+    -3.3925855    5.079156     7.759716     4.677565     5.8457737
+    2.402413     7.7071047    3.9711342   -6.390043     6.1268735
+    -3.7760346  -11.118123  ]
  ```

 - Python API
@ -98,53 +104,109 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
      config=None,  # Set `config` and `ckpt_path` to None to use pretrained model.
      ckpt_path=None,
      audio_file='./85236145389.wav',
-      force_yes=False,
      device=paddle.get_device())
  print('Audio embedding Result: \n{}'.format(audio_emb))
+
+  test_emb = vector_executor(
+      model='ecapatdnn_voxceleb12',
+      sample_rate=16000,
+      config=None,  # Set `config` and `ckpt_path` to None to use pretrained model.
+      ckpt_path=None,
+      audio_file='./123456789.wav',
+      device=paddle.get_device())
+  print('Test embedding Result: \n{}'.format(test_emb))
+
+  # score range [0, 1]
+  score = vector_executor.get_embeddings_score(audio_emb, test_emb)
+  print(f"Eembeddings Score: {score}")
  ```

  输出：
  ```bash
  # Vector Result:
-   [ -5.749211     9.505463    -8.200284    -5.2075014    5.3940268
-      -3.04878      1.611095    10.127234   -10.534177   -15.821609
-      1.2032688   -0.35080156   1.2629458  -12.643498    -2.5758228
-    -11.343508     2.3385992   -8.719341    14.213509    15.404744
-      -0.39327756   6.338786     2.688887     8.7104025   17.469526
-      -8.77959      7.0576906    4.648855    -1.3089896  -23.294737
-      8.013747    13.891729    -9.926753     5.655307    -5.9422326
-    -22.842539     0.6293588  -18.46266    -10.811862     9.8192625
-      3.0070958    3.8072643   -2.3861165    3.0821571  -14.739942
-      1.7594414   -0.6485091    4.485623     2.0207152    7.264915
-      -6.40137     23.63524      2.9711294  -22.708025     9.93719
-      20.354511   -10.324688    -0.700492    -8.783211    -5.27593
-      15.999649     3.3004563   12.747926    15.429879     4.7849145
-      5.6699696   -2.3826702   10.605882     3.9112158    3.1500628
-      15.859915    -2.1832209  -23.908653    -6.4799504   -4.5365124
-      -9.224193    14.568347   -10.568833     4.982321    -4.342062
-      0.0914714   12.645902    -5.74285     -3.2141201   -2.7173362
-      -6.680575     0.4757669   -5.035051    -6.7964664   16.865469
-    -11.54324      7.681869     0.44475392   9.708182    -8.932846
-      0.4123232   -4.361452     1.3948607    9.511665     0.11667654
-      2.9079323    6.049952     9.275183   -18.078873     6.2983274
-      -0.7500531   -2.725033    -7.6027865    3.3404543    2.990815
-      4.010979    11.000591    -2.8873312    7.1352735  -16.79663
-      18.495346   -14.293832     7.89578      2.2714825   22.976387
-      -4.875734    -3.0836344   -2.9999814   13.751918     6.448228
-    -11.924197     2.171869     2.0423572   -6.173772    10.778437
-      25.77281     -4.9495463   14.57806      0.3044315    2.6132357
-      -7.591999    -2.076944     9.025118     1.7834753   -3.1799617
-      -4.9401326   23.465864     5.1685796   -9.018578     9.037825
-      -4.4150195    6.859591   -12.274467    -0.88911164   5.186309
-      -3.9988663  -13.638606    -9.925445    -0.06329413  -3.6709652
-    -12.397416   -12.719869    -1.395601     2.1150916    5.7381287
-      -4.4691963   -3.82819     -0.84233856  -1.1604277  -13.490127
-      8.731719   -20.778936   -11.495662     5.8033476   -4.752041
-      10.833007    -6.717991     4.504732    13.4244375    1.1306485
-      7.3435574    1.400918    14.704036    -9.501399     7.2315617
-      -6.417456     1.3333273   11.872697    -0.30664724   8.8845
-      6.5569253    4.7948146    0.03662816  -8.704245     6.224871
-      -3.2701402  -11.508579  ]
+   Audio embedding Result:
+    [  1.4217498    5.626253    -5.342073     1.1773866    3.308055
+    1.756596     5.167894    10.80636     -3.8226728   -5.6141334
+    2.623845    -0.8072968    1.9635103   -7.3128724    0.01103897
+    -9.723131     0.6619743   -6.976803    10.213478     7.494748
+    2.9105635    3.8949256    3.7999806    7.1061673   16.905321
+    -7.1493764    8.733103     3.4230042   -4.831653   -11.403367
+    11.232214     7.1274667   -4.2828417    2.452362    -5.130748
+    -18.177666    -2.6116815  -11.000337    -6.7314315    1.6564683
+    0.7618269    1.1253023   -2.083836     4.725744    -8.782597
+    -3.539873     3.814236     5.1420674    2.162061     4.096431
+    -6.4162116   12.747448     1.9429878  -15.152943     6.417416
+    16.097002    -9.716668    -1.9920526   -3.3649497   -1.871939
+    11.567354     3.69788     11.258265     7.442363     9.183411
+    4.5281515   -1.2417862    4.3959084    6.6727695    5.8898783
+    7.627124    -0.66919386 -11.889693    -9.208865    -7.4274073
+    -3.7776625    6.917234    -9.848748    -2.0944717   -5.135116
+    0.49563864   9.317534    -5.9141874   -1.8098574   -0.11738578
+    -7.169265    -1.0578263   -5.7216787   -5.1173844   16.137651
+    -4.473626     7.6624317   -0.55381083   9.631587    -6.4704556
+    -8.548508     4.3716145   -0.79702514   4.478997    -2.9758704
+    3.272176     2.8382776    5.134597    -9.190781    -0.5657382
+    -4.8745747    2.3165567   -5.984303    -2.1798875    0.35541576
+    -0.31784213   9.493548     2.1144536    4.358092   -12.089823
+    8.451689    -7.925461     4.6242585    4.4289427   18.692003
+    -2.6204622   -5.149185    -0.35821092   8.488551     4.981496
+    -9.32683     -2.2544234    6.6417594    1.2119585   10.977129
+    16.555033     3.3238444    9.551863    -1.6676947   -0.79539716
+    -8.605674    -0.47356385   2.6741948   -5.359179    -2.6673796
+    0.66607     15.443222     4.740594    -3.4725387   11.592567
+    -2.054497     1.7361217   -8.265324    -9.30447      5.4068313
+    -1.5180256   -7.746615    -6.089606     0.07112726  -0.34904733
+    -8.649895    -9.998958    -2.564841    -0.53999114   2.601808
+    -0.31927416  -1.8815292   -2.07215     -3.4105783   -8.2998085
+    1.483641   -15.365992    -8.288208     3.8847756   -3.4876456
+    7.3629923    0.4657332    3.132599    12.438889    -1.8337058
+    4.532936     2.7264361   10.145339    -6.521951     2.897153
+    -3.3925855    5.079156     7.759716     4.677565     5.8457737
+    2.402413     7.7071047    3.9711342   -6.390043     6.1268735
+    -3.7760346  -11.118123  ]
+    # get the test embedding
+    Test embedding Result:
+    [ -1.902964     2.0690894   -8.034194     3.5472693    0.18089125
+      6.9085927    1.4097427   -1.9487704  -10.021278    -0.20755845
+      -8.04332      4.344489     2.3200977  -14.306299     5.184692
+    -11.55602     -3.8497238    0.6444722    1.2833948    2.6766639
+      0.5878921    0.7946299    1.7207596    2.5791872   14.998469
+      -1.3385371   15.031221    -0.8006958    1.99287     -9.52007
+      2.435466     4.003221    -4.33817     -4.898601    -5.304714
+    -18.033886    10.790787   -12.784645    -5.641755     2.9761686
+    -10.566622     1.4839455    6.152458    -5.7195854    2.8603241
+      6.112133     8.489869     5.5958056    1.2836679   -1.2293907
+      0.89927405   7.0288725   -2.854029    -0.9782962    5.8255906
+      14.905906    -5.025907     0.7866458   -4.2444224  -16.354029
+      10.521315     0.9604709   -3.3257897    7.144871   -13.592733
+      -8.568869    -1.7953678    0.26313916  10.916714    -6.9374123
+      1.857403    -6.2746415    2.8154466   -7.2338667   -2.293357
+      -0.05452765   5.4287076    5.0849075   -6.690375    -1.6183422
+      3.654291     0.94352573  -9.200294    -5.4749465   -3.5235846
+      1.3420814    4.240421    -2.772944    -2.8451524   16.311104
+      4.2969875   -1.762936   -12.5758915    8.595198    -0.8835239
+      -1.5708797    1.568961     1.1413603    3.5032008   -0.45251232
+      -6.786333    16.89443      5.3366146   -8.789056     0.6355629
+      3.2579517   -3.328322     7.5969577    0.66025066  -6.550468
+      -9.148656     2.020372    -0.4615173    1.1965656   -3.8764873
+      11.6562195   -6.0750933   12.182899     3.2218833    0.81969476
+      5.570001    -3.8459578   -7.205299     7.9262037   -7.6611166
+      -5.249467    -2.2671914    7.2658715  -13.298164     4.821147
+      -2.7263982   11.691089    -3.8918593   -2.838112    -1.0336838
+      -3.8034165    2.8536487   -5.60398     -1.1972581    1.3455094
+      -3.4903061    2.2408795    5.5010734   -3.970756    11.99696
+      -7.8858757    0.43160373  -5.5059714    4.3426995   16.322706
+      11.635366     0.72157705  -9.245714    -3.91465     -4.449838
+      -1.5716927    7.713747    -2.2430465   -6.198303   -13.481864
+      2.8156567   -5.7812386    5.1456156    2.7289324  -14.505571
+      13.270688     3.448231    -7.0659585    4.5886116   -4.466099
+      -0.296428   -11.463529    -2.6076477   14.110243    -6.9725137
+      -1.9962958    2.7119343   19.391657     0.01961198  14.607133
+      -1.6695905   -4.391516     1.3131028   -6.670972    -5.888604
+      12.0612335    5.9285784    3.3715196    1.492534    10.723728
+      -0.95514804 -12.085431  ]
+    # get the score between enroll and test
+    Eembeddings Score: 0.4292638301849365
  ```

 ### 4.预训练模型
--- a/demos/speaker_verification/run.sh
+++ b/demos/speaker_verification/run.sh
@ -1,6 +1,9 @@
 #!/bin/bash

 wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
+wget -c https://paddlespeech.bj.bcebos.com/vector/audio/123456789.wav

-# asr
-paddlespeech vector --task spk --input ./85236145389.wav
+# vector
+paddlespeech vector --task spk --input ./85236145389.wav
+
+paddlespeech vector --task score --input "./85236145389.wav ./123456789.wav"
--- a/docs/source/released_model.md
+++ b/docs/source/released_model.md
@ -6,7 +6,7 @@
 ### Speech Recognition Model
 Acoustic Model | Training Data | Token-based | Size | Descriptions | CER | WER | Hours of speech | Example Link 
 :-------------:| :------------:| :-----: | -----: | :-----: |:-----:| :-----:  | :-----:  | :-----: 
-[Ds2 Online Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.1.1.model.tar.gz) | Aishell Dataset | Char-based | 345 MB  | 2 Conv + 5 LSTM layers with only forward direction | 0.080 |-| 151 h | [D2 Online Aishell ASR0](../../examples/aishell/asr0) 
+[Ds2 Online Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.2.0.model.tar.gz) | Aishell Dataset | Char-based | 345 MB  | 2 Conv + 5 LSTM layers with only forward direction | 0.078 |-| 151 h | [D2 Online Aishell ASR0](../../examples/aishell/asr0) 
 [Ds2 Offline Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_aishell_ckpt_0.1.1.model.tar.gz)| Aishell Dataset | Char-based | 306 MB | 2 Conv + 3 bidirectional GRU layers| 0.064 |-| 151 h | [Ds2 Offline Aishell ASR0](../../examples/aishell/asr0) 
 [Conformer Online Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_chunk_conformer_aishell_ckpt_0.1.2.model.tar.gz) | Aishell Dataset | Char-based | 189 MB  | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0565 |-| 151 h | [Conformer Online Aishell ASR1](../../examples/aishell/asr1) 
 [Conformer Offline Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_conformer_aishell_ckpt_0.1.2.model.tar.gz) | Aishell Dataset | Char-based | 189 MB  | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0483 |-| 151 h | [Conformer Offline Aishell ASR1](../../examples/aishell/asr1) 
@ -37,8 +37,8 @@ Model Type | Dataset| Example Link | Pretrained Models|Static Models|Size (stati
 Tacotron2|LJSpeech|[tacotron2-ljspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/tts0)|[tacotron2_ljspeech_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_ljspeech_ckpt_0.2.0.zip)|||
 Tacotron2|CSMSC|[tacotron2-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts0)|[tacotron2_csmsc_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_csmsc_ckpt_0.2.0.zip)|[tacotron2_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_csmsc_static_0.2.0.zip)|103MB|
 TransformerTTS| LJSpeech| [transformer-ljspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/tts1)|[transformer_tts_ljspeech_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/transformer_tts/transformer_tts_ljspeech_ckpt_0.4.zip)|||
-SpeedySpeech| CSMSC | [speedyspeech-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts2) |[speedyspeech_nosil_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_ckpt_0.5.zip)|[speedyspeech_nosil_baker_static_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_static_0.5.zip)|12MB|
-FastSpeech2| CSMSC |[fastspeech2-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts3)|[fastspeech2_nosil_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_ckpt_0.4.zip)|[fastspeech2_nosil_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_static_0.4.zip)|157MB|
+SpeedySpeech| CSMSC | [speedyspeech-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts2) |[speedyspeech_nosil_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_ckpt_0.5.zip)|[speedyspeech_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_csmsc_static_0.2.0.zip)|12MB|
+FastSpeech2| CSMSC |[fastspeech2-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts3)|[fastspeech2_nosil_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_ckpt_0.4.zip)|[fastspeech2_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_csmsc_static_0.2.0.zip)|157MB|
 FastSpeech2-Conformer| CSMSC |[fastspeech2-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts3)|[fastspeech2_conformer_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_conformer_baker_ckpt_0.5.zip)|||
 FastSpeech2| AISHELL-3 |[fastspeech2-aishell3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/tts3)|[fastspeech2_nosil_aishell3_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_aishell3_ckpt_0.4.zip)|||
 FastSpeech2| LJSpeech |[fastspeech2-ljspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/tts3)|[fastspeech2_nosil_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_ljspeech_ckpt_0.5.zip)|||
@ -80,7 +80,7 @@ PANN | ESC-50 |[pann-esc50](../../examples/esc50/cls0)|[esc50_cnn6.tar.gz](https

 Model Type | Dataset| Example Link | Pretrained Models | Static Models 
 :-------------:| :------------:| :-----: | :-----: | :-----:
-PANN | VoxCeleb| [voxceleb_ecapatdnn](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/voxceleb/sv0) | [ecapatdnn.tar.gz](https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_1_1.tar.gz) | -
+PANN | VoxCeleb| [voxceleb_ecapatdnn](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/voxceleb/sv0) | [ecapatdnn.tar.gz](https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_2_0.tar.gz) | -

 ## Punctuation Restoration Models
 Model Type | Dataset| Example Link | Pretrained Models
--- a/examples/aishell/asr0/README.md
+++ b/examples/aishell/asr0/README.md
@ -173,12 +173,7 @@ bash local/data.sh --stage 2  --stop_stage 2

 CUDA_VISIBLE_DEVICES= ./local/test.sh conf/deepspeech2.yaml exp/deepspeech2/checkpoints/avg_1
 ```
-The performance of the released models are shown below:
-
-|         Acoustic Model         |  Training Data  | Token-based |   Size | Descriptions                                       | CER   | WER  | Hours of speech |
-| :----------------------------: | :-------------: | :---------: | -----: | :------------------------------------------------- | :---- | :--- | :-------------- |
-| Ds2 Online Aishell ASR0 Model  | Aishell Dataset | Char-based  | 345 MB | 2 Conv + 5 LSTM layers with only forward direction | 0.080 | -    | 151 h           |
-| Ds2 Offline Aishell ASR0 Model | Aishell Dataset | Char-based  | 306 MB | 2 Conv + 3 bidirectional GRU layers                | 0.064 | -    | 151 h           |
+The performance of the released models are shown in [this](./RESULTS.md)
 ## Stage 4: Static graph model Export
 This stage is to transform dygraph to static graph.
 ```bash
--- a/examples/aishell/asr0/RESULTS.md
+++ b/examples/aishell/asr0/RESULTS.md
@ -4,15 +4,16 @@

 | Model | Number of Params | Release | Config | Test set | Valid Loss | CER | 
 | --- | --- | --- | --- | --- | --- | --- | 
-| DeepSpeech2 | 45.18M | 2.2.0 | conf/deepspeech2_online.yaml + spec aug | test | 7.994938373565674 | 0.080 |  
+| DeepSpeech2 | 45.18M | r0.2.0 | conf/deepspeech2_online.yaml + spec aug | test | 7.708217620849609| 0.078 |
+| DeepSpeech2 | 45.18M | v2.2.0 | conf/deepspeech2_online.yaml + spec aug | test | 7.994938373565674 | 0.080 |  

 ## Deepspeech2 Non-Streaming

 | Model | Number of Params | Release | Config | Test set | Valid Loss | CER |  
 | --- | --- | --- | --- | --- | --- | --- |  
-| DeepSpeech2 | 58.4M | 2.2.0 | conf/deepspeech2.yaml + spec aug | test | 5.738585948944092 | 0.064000 |  
-| DeepSpeech2 | 58.4M | 2.1.0 | conf/deepspeech2.yaml + spec aug | test | 7.483316898345947 | 0.077860 |  
-| DeepSpeech2 | 58.4M | 2.1.0 | conf/deepspeech2.yaml | test | 7.299022197723389 | 0.078671 |
-| DeepSpeech2 | 58.4M | 2.0.0 | conf/deepspeech2.yaml | test | - | 0.078977 |  
+| DeepSpeech2 | 58.4M | v2.2.0 | conf/deepspeech2.yaml + spec aug | test | 5.738585948944092 | 0.064000 |  
+| DeepSpeech2 | 58.4M | v2.1.0 | conf/deepspeech2.yaml + spec aug | test | 7.483316898345947 | 0.077860 |  
+| DeepSpeech2 | 58.4M | v2.1.0 | conf/deepspeech2.yaml | test | 7.299022197723389 | 0.078671 |
+| DeepSpeech2 | 58.4M | v2.0.0 | conf/deepspeech2.yaml | test | - | 0.078977 |  
 | --- | --- | --- | --- | --- | --- | --- |  
-| DeepSpeech2 | 58.4M | 1.8.5 | - | test | - | 0.080447 |
+| DeepSpeech2 | 58.4M | v1.8.5 | - | test | - | 0.080447 |
--- a/examples/aishell3/vc0/README.md
+++ b/examples/aishell3/vc0/README.md
@ -118,7 +118,7 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/voice_cloning.sh ${conf_path} ${train_outpu
 ```

 ## Pretrained Model
-[tacotron2_aishell3_ckpt_vc0_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_aishell3_ckpt_vc0_0.2.0.zip)
+- [tacotron2_aishell3_ckpt_vc0_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_aishell3_ckpt_vc0_0.2.0.zip)


 Model | Step | eval/loss | eval/l1_loss | eval/mse_loss | eval/bce_loss| eval/attn_loss
--- a/examples/aishell3/vc1/README.md
+++ b/examples/aishell3/vc1/README.md
@ -119,7 +119,7 @@ ref_audio
 CUDA_VISIBLE_DEVICES=${gpus} ./local/voice_cloning.sh ${conf_path} ${train_output_path} ${ckpt_name} ${ge2e_params_path} ${ref_audio_dir}
 ```
 ## Pretrained Model
-[fastspeech2_nosil_aishell3_vc1_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_aishell3_vc1_ckpt_0.5.zip)
+- [fastspeech2_nosil_aishell3_vc1_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_aishell3_vc1_ckpt_0.5.zip)

 Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/pitch_loss| eval/energy_loss 
 :-------------:| :------------:| :-----: | :-----: | :--------: |:--------:|:---------:
--- a/examples/aishell3/voc1/README.md
+++ b/examples/aishell3/voc1/README.md
@ -137,7 +137,8 @@ optional arguments:
 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.

 ## Pretrained Models
-Pretrained models can be downloaded here [pwg_aishell3_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_aishell3_ckpt_0.5.zip).
+Pretrained models can be downloaded here:
+- [pwg_aishell3_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_aishell3_ckpt_0.5.zip)

 Model | Step | eval/generator_loss | eval/log_stft_magnitude_loss:| eval/spectral_convergence_loss 
 :-------------:| :------------:| :-----: | :-----: | :--------:
--- a/examples/aishell3/voc5/README.md
+++ b/examples/aishell3/voc5/README.md
@ -136,7 +136,8 @@ optional arguments:
 4. `--output-dir` is the directory to save the synthesized audio files.
 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
 ## Pretrained Models
-The pretrained model can be downloaded here [hifigan_aishell3_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_ckpt_0.2.0.zip).
+The pretrained model can be downloaded here:
+- [hifigan_aishell3_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_ckpt_0.2.0.zip)


 Model | Step | eval/generator_loss | eval/mel_loss| eval/feature_matching_loss
--- a/examples/csmsc/tts0/README.md
+++ b/examples/csmsc/tts0/README.md
@ -212,7 +212,8 @@ optional arguments:
 Pretrained Tacotron2 model with no silence in the edge of audios:
 - [tacotron2_csmsc_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_csmsc_ckpt_0.2.0.zip)

-The static model can be downloaded here [tacotron2_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_csmsc_static_0.2.0.zip).
+The static model can be downloaded here:
+- [tacotron2_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_csmsc_static_0.2.0.zip)


 Model | Step | eval/loss | eval/l1_loss | eval/mse_loss | eval/bce_loss| eval/attn_loss 
--- a/examples/csmsc/tts2/README.md
+++ b/examples/csmsc/tts2/README.md
@ -221,9 +221,12 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/inference.sh ${train_output_path}
 ```

 ## Pretrained Model
-Pretrained SpeedySpeech model with no silence in the edge of audios[speedyspeech_nosil_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_ckpt_0.5.zip).
+Pretrained SpeedySpeech model with no silence in the edge of audios:
+- [speedyspeech_nosil_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_ckpt_0.5.zip)

-The static model can be downloaded here [speedyspeech_nosil_baker_static_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_static_0.5.zip).
+The static model can be downloaded here:
+- [speedyspeech_nosil_baker_static_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_static_0.5.zip)
+- [speedyspeech_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_csmsc_static_0.2.0.zip)

 Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/ssim_loss
 :-------------:| :------------:| :-----: | :-----: | :--------:|:--------:
--- a/examples/csmsc/tts3/README.md
+++ b/examples/csmsc/tts3/README.md
@ -232,6 +232,9 @@ The static model can be downloaded here:
 - [fastspeech2_nosil_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_static_0.4.zip)
 - [fastspeech2_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_csmsc_static_0.2.0.zip)

+The ONNX model can be downloaded here:
+- [fastspeech2_csmsc_onnx_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_csmsc_onnx_0.2.0.zip)
+
 Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/pitch_loss| eval/energy_loss 
 :-------------:| :------------:| :-----: | :-----: | :--------: |:--------:|:---------:
 default| 2(gpu) x 76000|1.0991|0.59132|0.035815|0.31915|0.15287|
--- a/examples/csmsc/tts3/local/ort_predict.sh
+++ b/examples/csmsc/tts3/local/ort_predict.sh
@ -0,0 +1,31 @@
+train_output_path=$1
+
+stage=0
+stop_stage=0
+
+# only support default_fastspeech2 + hifigan/mb_melgan now!
+
+# synthesize from metadata
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+    python3 ${BIN_DIR}/../ort_predict.py \
+        --inference_dir=${train_output_path}/inference_onnx \
+        --am=fastspeech2_csmsc \
+        --voc=hifigan_csmsc \
+        --test_metadata=dump/test/norm/metadata.jsonl \
+        --output_dir=${train_output_path}/onnx_infer_out \
+        --device=cpu \
+        --cpu_threads=2
+fi
+
+# e2e, synthesize from text
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+    python3 ${BIN_DIR}/../ort_predict_e2e.py \
+        --inference_dir=${train_output_path}/inference_onnx \
+        --am=fastspeech2_csmsc \
+        --voc=hifigan_csmsc \
+        --output_dir=${train_output_path}/onnx_infer_out_e2e \
+        --text=${BIN_DIR}/../csmsc_test.txt \
+        --phones_dict=dump/phone_id_map.txt \
+        --device=cpu \
+        --cpu_threads=2
+fi
--- a/examples/csmsc/tts3/local/paddle2onnx.sh
+++ b/examples/csmsc/tts3/local/paddle2onnx.sh
@ -0,0 +1,22 @@
+train_output_path=$1
+model_dir=$2
+output_dir=$3
+model=$4
+
+enable_dev_version=True
+
+model_name=${model%_*}
+echo model_name: ${model_name}
+
+if [ ${model_name} = 'mb_melgan' ] ;then
+    enable_dev_version=False
+fi
+
+mkdir -p ${train_output_path}/${output_dir}
+
+paddle2onnx \
+    --model_dir ${train_output_path}/${model_dir} \
+    --model_filename ${model}.pdmodel \
+    --params_filename ${model}.pdiparams \
+    --save_file ${train_output_path}/${output_dir}/${model}.onnx \
+    --enable_dev_version ${enable_dev_version}
--- a/examples/csmsc/tts3/run.sh
+++ b/examples/csmsc/tts3/run.sh
@ -41,3 +41,25 @@ if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
    CUDA_VISIBLE_DEVICES=${gpus} ./local/inference.sh ${train_output_path} || exit -1
 fi

+# paddle2onnx, please make sure the static models are in ${train_output_path}/inference first
+# we have only tested the following models so far
+if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
+    # install paddle2onnx
+    version=$(echo `pip list |grep "paddle2onnx"` |awk -F" " '{print $2}')
+    if [[ -z "$version" || ${version} != '0.9.4' ]]; then
+        pip install paddle2onnx==0.9.4
+    fi
+    ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx fastspeech2_csmsc
+    ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx hifigan_csmsc
+    ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx mb_melgan_csmsc
+fi
+
+# inference with onnxruntime, use fastspeech2 + hifigan by default
+if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
+    # install onnxruntime
+    version=$(echo `pip list |grep "onnxruntime"` |awk -F" " '{print $2}')
+    if [[ -z "$version" || ${version} != '1.10.0' ]]; then
+        pip install onnxruntime==1.10.0
+    fi
+    ./local/ort_predict.sh ${train_output_path}
+fi
--- a/examples/csmsc/voc1/README.md
+++ b/examples/csmsc/voc1/README.md
@ -127,9 +127,11 @@ optional arguments:
 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.

 ## Pretrained Models
-The pretrained model can be downloaded here [pwg_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_ckpt_0.4.zip).
+The pretrained model can be downloaded here:
+- [pwg_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_ckpt_0.4.zip)

-The static model can be downloaded here [pwg_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_static_0.4.zip).
+The static model can be downloaded here:
+- [pwg_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_static_0.4.zip)

 Model | Step | eval/generator_loss | eval/log_stft_magnitude_loss| eval/spectral_convergence_loss
 :-------------:| :------------:| :-----: | :-----: | :--------:
--- a/examples/csmsc/voc3/README.md
+++ b/examples/csmsc/voc3/README.md
@ -152,11 +152,17 @@ TODO:
 The hyperparameter of `finetune.yaml` is not good enough, a smaller `learning_rate` should be used (more `milestones` should be set).

 ## Pretrained Models
-The pretrained model can be downloaded here [mb_melgan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_ckpt_0.1.1.zip).
+The pretrained model can be downloaded here:
+- [mb_melgan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_ckpt_0.1.1.zip)

-The finetuned model can be downloaded here [mb_melgan_baker_finetune_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_baker_finetune_ckpt_0.5.zip).
+The finetuned model can be downloaded here:
+- [mb_melgan_baker_finetune_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_baker_finetune_ckpt_0.5.zip)

-The static model can be downloaded here [mb_melgan_csmsc_static_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_static_0.1.1.zip)
+The static model can be downloaded here:
+- [mb_melgan_csmsc_static_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_static_0.1.1.zip)
+
+The ONNX model can be downloaded here:
+- [mb_melgan_csmsc_onnx_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_onnx_0.2.0.zip)

 Model | Step | eval/generator_loss | eval/log_stft_magnitude_loss|eval/spectral_convergence_loss |eval/sub_log_stft_magnitude_loss|eval/sub_spectral_convergence_loss
 :-------------:| :------------:| :-----: | :-----: | :--------:| :--------:| :--------:
--- a/examples/csmsc/voc4/README.md
+++ b/examples/csmsc/voc4/README.md
@ -112,7 +112,8 @@ optional arguments:
 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.

 ## Pretrained Models
-The pretrained model can be downloaded here [style_melgan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/style_melgan/style_melgan_csmsc_ckpt_0.1.1.zip).
+The pretrained model can be downloaded here:
+- [style_melgan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/style_melgan/style_melgan_csmsc_ckpt_0.1.1.zip)

 The static model of Style MelGAN is not available now.

--- a/examples/csmsc/voc5/README.md
+++ b/examples/csmsc/voc5/README.md
@ -112,9 +112,14 @@ optional arguments:
 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.

 ## Pretrained Models
-The pretrained model can be downloaded here [hifigan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_ckpt_0.1.1.zip).
+The pretrained model can be downloaded here:
+- [hifigan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_ckpt_0.1.1.zip)

-The static model can be downloaded here [hifigan_csmsc_static_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_static_0.1.1.zip).
+The static model can be downloaded here:
+- [hifigan_csmsc_static_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_static_0.1.1.zip)
+
+The ONNX model can be downloaded here:
+- [hifigan_csmsc_onnx_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_onnx_0.2.0.zip)

 Model | Step | eval/generator_loss | eval/mel_loss| eval/feature_matching_loss
 :-------------:| :------------:| :-----: | :-----: | :--------:
--- a/examples/csmsc/voc6/README.md
+++ b/examples/csmsc/voc6/README.md
@ -109,9 +109,11 @@ optional arguments:
 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.

 ## Pretrained Models
-The pretrained model can be downloaded here [wavernn_csmsc_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/wavernn/wavernn_csmsc_ckpt_0.2.0.zip).
+The pretrained model can be downloaded here:
+- [wavernn_csmsc_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/wavernn/wavernn_csmsc_ckpt_0.2.0.zip)

-The static model can be downloaded here [wavernn_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/wavernn/wavernn_csmsc_static_0.2.0.zip).
+The static model can be downloaded here:
+- [wavernn_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/wavernn/wavernn_csmsc_static_0.2.0.zip)

 Model | Step | eval/loss
 :-------------:|:------------:| :------------:
--- a/examples/iwslt2012/punc0/README.md
+++ b/examples/iwslt2012/punc0/README.md
@ -21,7 +21,7 @@
 The pretrained model can be downloaded here [ernie_linear_p3_iwslt2012_zh_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/text/ernie_linear_p3_iwslt2012_zh_ckpt_0.1.1.zip).

 ### Test Result
- Ernie Linear
+- Ernie
    |       |COMMA  |  PERIOD | QUESTION | OVERALL|
    |:-----:|:-----:|:-----:|:-----:|:-----:|  
    |Precision  |0.510955  |0.526462  |0.820755  |0.619391|
--- a/examples/iwslt2012/punc0/RESULTS.md
+++ b/examples/iwslt2012/punc0/RESULTS.md
@ -0,0 +1,9 @@
+# iwslt2012
+
+## Ernie
+
+|       |COMMA  |  PERIOD | QUESTION | OVERALL|
+|:-----:|:-----:|:-----:|:-----:|:-----:|  
+|Precision  |0.510955  |0.526462  |0.820755  |0.619391|
+|Recall     |0.517433  |0.564179  |0.861386  |0.647666|
+|F1         |0.514173  |0.544669  |0.840580  |0.633141|
--- a/examples/ljspeech/tts1/README.md
+++ b/examples/ljspeech/tts1/README.md
@ -171,7 +171,8 @@ optional arguments:
 6. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.

 ## Pretrained Model
-Pretrained Model can be downloaded here. [transformer_tts_ljspeech_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/transformer_tts/transformer_tts_ljspeech_ckpt_0.4.zip)
+Pretrained Model can be downloaded here:
+- [transformer_tts_ljspeech_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/transformer_tts/transformer_tts_ljspeech_ckpt_0.4.zip)

 TransformerTTS  checkpoint contains files listed below.
 ```text
--- a/examples/ljspeech/tts3/README.md
+++ b/examples/ljspeech/tts3/README.md
@ -214,7 +214,8 @@ optional arguments:
 9. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.

 ## Pretrained Model
-Pretrained FastSpeech2 model with no silence in the edge of audios. [fastspeech2_nosil_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_ljspeech_ckpt_0.5.zip)
+Pretrained FastSpeech2 model with no silence in the edge of audios:
+- [fastspeech2_nosil_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_ljspeech_ckpt_0.5.zip)

 Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/pitch_loss| eval/energy_loss 
 :-------------:| :------------:| :-----: | :-----: | :--------: |:--------:|:---------:
--- a/examples/ljspeech/voc0/README.md
+++ b/examples/ljspeech/voc0/README.md
@ -50,4 +50,5 @@ Synthesize waveform.
 6. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.

 ## Pretrained Model
-Pretrained Model with residual channel equals 128 can be downloaded here. [waveflow_ljspeech_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/waveflow/waveflow_ljspeech_ckpt_0.3.zip).
+Pretrained Model with residual channel equals 128 can be downloaded here:
+- [waveflow_ljspeech_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/waveflow/waveflow_ljspeech_ckpt_0.3.zip)
--- a/examples/ljspeech/voc1/README.md
+++ b/examples/ljspeech/voc1/README.md
@ -127,7 +127,8 @@ optional arguments:
 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.

 ## Pretrained Model
-Pretrained models can be downloaded here. [pwg_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_ljspeech_ckpt_0.5.zip)
+Pretrained models can be downloaded here:
+- [pwg_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_ljspeech_ckpt_0.5.zip)

 Parallel WaveGAN checkpoint contains files listed below.

--- a/examples/ljspeech/voc5/README.md
+++ b/examples/ljspeech/voc5/README.md
@ -127,7 +127,8 @@ optional arguments:
 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.

 ## Pretrained Model
-The pretrained model can be downloaded here [hifigan_ljspeech_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_ljspeech_ckpt_0.2.0.zip).
+The pretrained model can be downloaded here:
+- [hifigan_ljspeech_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_ljspeech_ckpt_0.2.0.zip)


 Model | Step | eval/generator_loss | eval/mel_loss| eval/feature_matching_loss
@ -143,6 +144,5 @@ hifigan_ljspeech_ckpt_0.2.0
 └── snapshot_iter_2500000.pdz     # generator parameters of hifigan
 ```

-
 ## Acknowledgement
 We adapted some code from https://github.com/kan-bayashi/ParallelWaveGAN.
--- a/examples/vctk/tts3/README.md
+++ b/examples/vctk/tts3/README.md
@ -217,7 +217,8 @@ optional arguments:
 9. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.

 ## Pretrained Model
-Pretrained FastSpeech2 model with no silence in the edge of audios. [fastspeech2_nosil_vctk_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_vctk_ckpt_0.5.zip)
+Pretrained FastSpeech2 model with no silence in the edge of audios:
+- [fastspeech2_nosil_vctk_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_vctk_ckpt_0.5.zip)

 FastSpeech2 checkpoint contains files listed below.
 ```text
--- a/examples/vctk/voc1/README.md
+++ b/examples/vctk/voc1/README.md
@ -132,7 +132,8 @@ optional arguments:
 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.

 ## Pretrained Model
-Pretrained models can be downloaded here [pwg_vctk_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_vctk_ckpt_0.1.1.zip).
+Pretrained models can be downloaded here:
+- [pwg_vctk_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_vctk_ckpt_0.1.1.zip)

 Parallel WaveGAN checkpoint contains files listed below.

--- a/examples/vctk/voc5/README.md
+++ b/examples/vctk/voc5/README.md
@ -133,7 +133,8 @@ optional arguments:
 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.

 ## Pretrained Model
-The pretrained model can be downloaded here [hifigan_vctk_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_vctk_ckpt_0.2.0.zip).
+The pretrained model can be downloaded here:
+- [hifigan_vctk_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_vctk_ckpt_0.2.0.zip)


 Model | Step | eval/generator_loss | eval/mel_loss| eval/feature_matching_loss
--- a/examples/voxceleb/sv0/RESULT.md
+++ b/examples/voxceleb/sv0/RESULT.md
@ -4,4 +4,4 @@

 | Model | Number of Params | Release | Config | dim | Test set |  Cosine | Cosine + S-Norm | 
 | --- | --- | --- | --- | --- | --- | --- | ---- |
-| ECAPA-TDNN | 85M | 0.1.1 | conf/ecapa_tdnn.yaml |192 | test | 1.15 |  1.06 | 
+| ECAPA-TDNN | 85M | 0.2.0 | conf/ecapa_tdnn.yaml |192 | test | 1.02 |  0.95 | 
--- a/paddleaudio/paddleaudio/utils/numeric.py
+++ b/paddleaudio/paddleaudio/utils/numeric.py
@ -0,0 +1,30 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import numpy as np
+
+
+def pcm16to32(audio: np.ndarray) -> np.ndarray:
+    """pcm int16 to float32
+
+    Args:
+        audio (np.ndarray): Waveform with dtype of int16.
+
+    Returns:
+        np.ndarray: Waveform with dtype of float32.
+    """
+    if audio.dtype == np.int16:
+        audio = audio.astype("float32")
+        bits = np.iinfo(np.int16).bits
+        audio = audio / (2**(bits - 1))
+    return audio
--- a/paddlespeech/cli/asr/infer.py
+++ b/paddlespeech/cli/asr/infer.py
@ -80,9 +80,9 @@ pretrained_models = {
    },
    "deepspeech2online_aishell-zh-16k": {
        'url':
-        'https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.1.1.model.tar.gz',
+        'https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.2.0.model.tar.gz',
        'md5':
-        'd5e076217cf60486519f72c217d21b9b',
+        '23e16c69730a1cb5d735c98c83c21e16',
        'cfg_path':
        'model.yaml',
        'ckpt_path':
@ -426,6 +426,11 @@ class ASRExecutor(BaseExecutor):
        try:
            audio, audio_sample_rate = soundfile.read(
                audio_file, dtype="int16", always_2d=True)
+            audio_duration = audio.shape[0] / audio_sample_rate
+            max_duration = 50.0
+            if audio_duration >= max_duration:
+                logger.error("Please input audio file less then 50 seconds.\n")
+                return
        except Exception as e:
            logger.exception(e)
            logger.error(
--- a/paddlespeech/cli/vector/infer.py
+++ b/paddlespeech/cli/vector/infer.py
@ -15,6 +15,7 @@ import argparse
 import os
 import sys
 from collections import OrderedDict
+from typing import Dict
 from typing import List
 from typing import Optional
 from typing import Union
@ -42,9 +43,9 @@ pretrained_models = {
    # "paddlespeech vector --task spk --model ecapatdnn_voxceleb12-16k --sr 16000 --input ./input.wav"
    "ecapatdnn_voxceleb12-16k": {
        'url':
-        'https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_1_1.tar.gz',
+        'https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_2_0.tar.gz',
        'md5':
-        'a1c0dba7d4de997187786ff517d5b4ec',
+        'cc33023c54ab346cd318408f43fcaf95',
        'cfg_path':
        'conf/model.yaml',  # the yaml config path
        'ckpt_path':
@ -79,7 +80,7 @@ class VectorExecutor(BaseExecutor):
            "--task",
            type=str,
            default="spk",
-            choices=["spk"],
+            choices=["spk", "score"],
            help="task type in vector domain")
        self.parser.add_argument(
            "--input",
@ -147,13 +148,40 @@ class VectorExecutor(BaseExecutor):
        logger.info(f"task source: {task_source}")

        # stage 3: process the audio one by one
+        # we do action according the task type
        task_result = OrderedDict()
        has_exceptions = False
        for id_, input_ in task_source.items():
            try:
-                res = self(input_, model, sample_rate, config, ckpt_path,
-                           device)
-                task_result[id_] = res
+                # extract the speaker audio embedding
+                if parser_args.task == "spk":
+                    logger.info("do vector spk task")
+                    res = self(input_, model, sample_rate, config, ckpt_path,
+                               device)
+                    task_result[id_] = res
+                elif parser_args.task == "score":
+                    logger.info("do vector score task")
+                    logger.info(f"input content {input_}")
+                    if len(input_.split()) != 2:
+                        logger.error(
+                            f"vector score task input {input_} wav num is not two,"
+                            "that is {len(input_.split())}")
+                        sys.exit(-1)
+
+                    # get the enroll and test embedding
+                    enroll_audio, test_audio = input_.split()
+                    logger.info(
+                        f"score task, enroll audio: {enroll_audio}, test audio: {test_audio}"
+                    )
+                    enroll_embedding = self(enroll_audio, model, sample_rate,
+                                            config, ckpt_path, device)
+                    test_embedding = self(test_audio, model, sample_rate,
+                                          config, ckpt_path, device)
+
+                    # get the score
+                    res = self.get_embeddings_score(enroll_embedding,
+                                                    test_embedding)
+                    task_result[id_] = res
            except Exception as e:
                has_exceptions = True
                task_result[id_] = f'{e.__class__.__name__}: {e}'
@ -172,6 +200,49 @@ class VectorExecutor(BaseExecutor):
        else:
            return True

+    def _get_job_contents(
+            self, job_input: os.PathLike) -> Dict[str, Union[str, os.PathLike]]:
+        """
+        Read a job input file and return its contents in a dictionary.
+        Refactor from the Executor._get_job_contents
+
+        Args:
+            job_input (os.PathLike): The job input file.
+
+        Returns:
+            Dict[str, str]: Contents of job input.
+        """
+        job_contents = OrderedDict()
+        with open(job_input) as f:
+            for line in f:
+                line = line.strip()
+                if not line:
+                    continue
+                k = line.split(' ')[0]
+                v = ' '.join(line.split(' ')[1:])
+                job_contents[k] = v
+        return job_contents
+
+    def get_embeddings_score(self, enroll_embedding, test_embedding):
+        """get the enroll embedding and test embedding score
+
+        Args:
+            enroll_embedding (numpy.array): shape: (emb_size), enroll audio embedding
+            test_embedding (numpy.array): shape: (emb_size), test audio embedding
+
+        Returns:
+            score: the score between enroll embedding and test embedding
+        """
+        if not hasattr(self, "score_func"):
+            self.score_func = paddle.nn.CosineSimilarity(axis=0)
+            logger.info("create the cosine score function ")
+
+        score = self.score_func(
+            paddle.to_tensor(enroll_embedding),
+            paddle.to_tensor(test_embedding))
+
+        return score.item()
+
    @stats_wrapper
    def __call__(self,
                 audio_file: os.PathLike,
--- a/paddlespeech/server/engine/asr/online/asr_engine.py
+++ b/paddlespeech/server/engine/asr/online/asr_engine.py
@ -36,7 +36,7 @@ pretrained_models = {
        'url':
        'https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.1.1.model.tar.gz',
        'md5':
-        'd5e076217cf60486519f72c217d21b9b',
+        '23e16c69730a1cb5d735c98c83c21e16',
        'cfg_path':
        'model.yaml',
        'ckpt_path':
--- a/paddlespeech/t2s/exps/fastspeech2/preprocess.py
+++ b/paddlespeech/t2s/exps/fastspeech2/preprocess.py
@ -86,6 +86,9 @@ def process_sentence(config: Dict[str, Any],
        logmel = mel_extractor.get_log_mel_fbank(wav)
        # change duration according to mel_length
        compare_duration_and_mel_length(sentences, utt_id, logmel)
+        # utt_id may be popped in compare_duration_and_mel_length
+        if utt_id not in sentences:
+            return None
        phones = sentences[utt_id][0]
        durations = sentences[utt_id][1]
        num_frames = logmel.shape[0]
--- a/paddlespeech/t2s/exps/inference.py
+++ b/paddlespeech/t2s/exps/inference.py
@ -104,7 +104,7 @@ def get_voc_output(args, voc_predictor, input):

 def parse_args():
    parser = argparse.ArgumentParser(
-        description="Paddle Infernce with speedyspeech & parallel wavegan.")
+        description="Paddle Infernce with acoustic model & vocoder.")
    # acoustic model
    parser.add_argument(
        '--am',
--- a/paddlespeech/t2s/exps/ort_predict.py
+++ b/paddlespeech/t2s/exps/ort_predict.py
@ -0,0 +1,156 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+from pathlib import Path
+
+import jsonlines
+import numpy as np
+import onnxruntime as ort
+import soundfile as sf
+from timer import timer
+
+from paddlespeech.t2s.exps.syn_utils import get_test_dataset
+from paddlespeech.t2s.utils import str2bool
+
+
+def get_sess(args, filed='am'):
+    full_name = ''
+    if filed == 'am':
+        full_name = args.am
+    elif filed == 'voc':
+        full_name = args.voc
+    model_dir = str(Path(args.inference_dir) / (full_name + ".onnx"))
+    sess_options = ort.SessionOptions()
+    sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
+    sess_options.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL
+
+    if args.device == "gpu":
+        # fastspeech2/mb_melgan can't use trt now!
+        if args.use_trt:
+            providers = ['TensorrtExecutionProvider']
+        else:
+            providers = ['CUDAExecutionProvider']
+    elif args.device == "cpu":
+        providers = ['CPUExecutionProvider']
+    sess_options.intra_op_num_threads = args.cpu_threads
+    sess = ort.InferenceSession(
+        model_dir, providers=providers, sess_options=sess_options)
+    return sess
+
+
+def ort_predict(args):
+    # construct dataset for evaluation
+    with jsonlines.open(args.test_metadata, 'r') as reader:
+        test_metadata = list(reader)
+    am_name = args.am[:args.am.rindex('_')]
+    am_dataset = args.am[args.am.rindex('_') + 1:]
+    test_dataset = get_test_dataset(args, test_metadata, am_name, am_dataset)
+
+    output_dir = Path(args.output_dir)
+    output_dir.mkdir(parents=True, exist_ok=True)
+
+    fs = 24000 if am_dataset != 'ljspeech' else 22050
+
+    # am
+    am_sess = get_sess(args, filed='am')
+
+    # vocoder
+    voc_sess = get_sess(args, filed='voc')
+
+    # am warmup
+    for T in [27, 38, 54]:
+        data = np.random.randint(1, 266, size=(T, ))
+        am_sess.run(None, {"text": data})
+
+    # voc warmup
+    for T in [227, 308, 544]:
+        data = np.random.rand(T, 80).astype("float32")
+        voc_sess.run(None, {"logmel": data})
+    print("warm up done!")
+
+    N = 0
+    T = 0
+    for example in test_dataset:
+        utt_id = example['utt_id']
+        phone_ids = example["text"]
+        with timer() as t:
+            mel = am_sess.run(output_names=None, input_feed={'text': phone_ids})
+            mel = mel[0]
+            wav = voc_sess.run(output_names=None, input_feed={'logmel': mel})
+
+            N += len(wav[0])
+            T += t.elapse
+            speed = len(wav[0]) / t.elapse
+            rtf = fs / speed
+        sf.write(
+            str(output_dir / (utt_id + ".wav")),
+            np.array(wav)[0],
+            samplerate=fs)
+        print(
+            f"{utt_id}, mel: {mel.shape}, wave: {len(wav[0])}, time: {t.elapse}s, Hz: {speed}, RTF: {rtf}."
+        )
+    print(f"generation speed: {N / T}Hz, RTF: {fs / (N / T) }")
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description="Infernce with onnxruntime.")
+    # acoustic model
+    parser.add_argument(
+        '--am',
+        type=str,
+        default='fastspeech2_csmsc',
+        choices=[
+            'fastspeech2_csmsc',
+        ],
+        help='Choose acoustic model type of tts task.')
+
+    # voc
+    parser.add_argument(
+        '--voc',
+        type=str,
+        default='hifigan_csmsc',
+        choices=['hifigan_csmsc', 'mb_melgan_csmsc'],
+        help='Choose vocoder type of tts task.')
+    # other
+    parser.add_argument(
+        "--inference_dir", type=str, help="dir to save inference models")
+    parser.add_argument("--test_metadata", type=str, help="test metadata.")
+    parser.add_argument("--output_dir", type=str, help="output dir")
+
+    # inference
+    parser.add_argument(
+        "--use_trt",
+        type=str2bool,
+        default=False,
+        help="Whether to use inference engin TensorRT.", )
+
+    parser.add_argument(
+        "--device",
+        default="gpu",
+        choices=["gpu", "cpu"],
+        help="Device selected for inference.", )
+    parser.add_argument('--cpu_threads', type=int, default=1)
+
+    args, _ = parser.parse_known_args()
+    return args
+
+
+def main():
+    args = parse_args()
+
+    ort_predict(args)
+
+
+if __name__ == "__main__":
+    main()
--- a/paddlespeech/t2s/exps/ort_predict_e2e.py
+++ b/paddlespeech/t2s/exps/ort_predict_e2e.py
@ -0,0 +1,183 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+from pathlib import Path
+
+import numpy as np
+import onnxruntime as ort
+import soundfile as sf
+from timer import timer
+
+from paddlespeech.t2s.exps.syn_utils import get_frontend
+from paddlespeech.t2s.exps.syn_utils import get_sentences
+from paddlespeech.t2s.utils import str2bool
+
+
+def get_sess(args, filed='am'):
+    full_name = ''
+    if filed == 'am':
+        full_name = args.am
+    elif filed == 'voc':
+        full_name = args.voc
+    model_dir = str(Path(args.inference_dir) / (full_name + ".onnx"))
+    sess_options = ort.SessionOptions()
+    sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
+    sess_options.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL
+
+    if args.device == "gpu":
+        # fastspeech2/mb_melgan can't use trt now!
+        if args.use_trt:
+            providers = ['TensorrtExecutionProvider']
+        else:
+            providers = ['CUDAExecutionProvider']
+    elif args.device == "cpu":
+        providers = ['CPUExecutionProvider']
+    sess_options.intra_op_num_threads = args.cpu_threads
+    sess = ort.InferenceSession(
+        model_dir, providers=providers, sess_options=sess_options)
+    return sess
+
+
+def ort_predict(args):
+
+    # frontend
+    frontend = get_frontend(args)
+
+    output_dir = Path(args.output_dir)
+    output_dir.mkdir(parents=True, exist_ok=True)
+    sentences = get_sentences(args)
+
+    am_name = args.am[:args.am.rindex('_')]
+    am_dataset = args.am[args.am.rindex('_') + 1:]
+    fs = 24000 if am_dataset != 'ljspeech' else 22050
+
+    # am
+    am_sess = get_sess(args, filed='am')
+
+    # vocoder
+    voc_sess = get_sess(args, filed='voc')
+
+    # am warmup
+    for T in [27, 38, 54]:
+        data = np.random.randint(1, 266, size=(T, ))
+        am_sess.run(None, {"text": data})
+
+    # voc warmup
+    for T in [227, 308, 544]:
+        data = np.random.rand(T, 80).astype("float32")
+        voc_sess.run(None, {"logmel": data})
+    print("warm up done!")
+
+    # frontend warmup
+    # Loading model cost 0.5+ seconds
+    if args.lang == 'zh':
+        frontend.get_input_ids("你好，欢迎使用飞桨框架进行深度学习研究！", merge_sentences=True)
+    else:
+        print("lang should in be 'zh' here!")
+
+    N = 0
+    T = 0
+    merge_sentences = True
+    for utt_id, sentence in sentences:
+        with timer() as t:
+            if args.lang == 'zh':
+                input_ids = frontend.get_input_ids(
+                    sentence, merge_sentences=merge_sentences)
+
+                phone_ids = input_ids["phone_ids"]
+            else:
+                print("lang should in be 'zh' here!")
+            # merge_sentences=True here, so we only use the first item of phone_ids
+            phone_ids = phone_ids[0].numpy()
+            mel = am_sess.run(output_names=None, input_feed={'text': phone_ids})
+            mel = mel[0]
+            wav = voc_sess.run(output_names=None, input_feed={'logmel': mel})
+
+            N += len(wav[0])
+            T += t.elapse
+            speed = len(wav[0]) / t.elapse
+            rtf = fs / speed
+        sf.write(
+            str(output_dir / (utt_id + ".wav")),
+            np.array(wav)[0],
+            samplerate=fs)
+        print(
+            f"{utt_id}, mel: {mel.shape}, wave: {len(wav[0])}, time: {t.elapse}s, Hz: {speed}, RTF: {rtf}."
+        )
+    print(f"generation speed: {N / T}Hz, RTF: {fs / (N / T) }")
+
+
+def parse_args():
+    parser = argparse.ArgumentParser(description="Infernce with onnxruntime.")
+    # acoustic model
+    parser.add_argument(
+        '--am',
+        type=str,
+        default='fastspeech2_csmsc',
+        choices=[
+            'fastspeech2_csmsc',
+        ],
+        help='Choose acoustic model type of tts task.')
+    parser.add_argument(
+        "--phones_dict", type=str, default=None, help="phone vocabulary file.")
+    parser.add_argument(
+        "--tones_dict", type=str, default=None, help="tone vocabulary file.")
+
+    # voc
+    parser.add_argument(
+        '--voc',
+        type=str,
+        default='hifigan_csmsc',
+        choices=['hifigan_csmsc', 'mb_melgan_csmsc'],
+        help='Choose vocoder type of tts task.')
+    # other
+    parser.add_argument(
+        "--inference_dir", type=str, help="dir to save inference models")
+    parser.add_argument(
+        "--text",
+        type=str,
+        help="text to synthesize, a 'utt_id sentence' pair per line")
+    parser.add_argument("--output_dir", type=str, help="output dir")
+    parser.add_argument(
+        '--lang',
+        type=str,
+        default='zh',
+        help='Choose model language. zh or en')
+
+    # inference
+    parser.add_argument(
+        "--use_trt",
+        type=str2bool,
+        default=False,
+        help="Whether to use inference engin TensorRT.", )
+
+    parser.add_argument(
+        "--device",
+        default="gpu",
+        choices=["gpu", "cpu"],
+        help="Device selected for inference.", )
+    parser.add_argument('--cpu_threads', type=int, default=1)
+
+    args, _ = parser.parse_known_args()
+    return args
+
+
+def main():
+    args = parse_args()
+
+    ort_predict(args)
+
+
+if __name__ == "__main__":
+    main()
--- a/paddlespeech/t2s/exps/speedyspeech/preprocess.py
+++ b/paddlespeech/t2s/exps/speedyspeech/preprocess.py
@ -79,6 +79,9 @@ def process_sentence(config: Dict[str, Any],
        logmel = mel_extractor.get_log_mel_fbank(wav)
        # change duration according to mel_length
        compare_duration_and_mel_length(sentences, utt_id, logmel)
+        # utt_id may be popped in compare_duration_and_mel_length
+        if utt_id not in sentences:
+            return None
        labels = sentences[utt_id][0]
        # extract phone and duration
        phones = []
--- a/paddlespeech/t2s/exps/synthesize_streaming.py
+++ b/paddlespeech/t2s/exps/synthesize_streaming.py
@ -90,6 +90,7 @@ def evaluate(args):
    output_dir = Path(args.output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)
    merge_sentences = True
+    get_tone_ids = False

    N = 0
    T = 0
@ -98,8 +99,6 @@ def evaluate(args):

    for utt_id, sentence in sentences:
        with timer() as t:
-            get_tone_ids = False
-
            if args.lang == 'zh':
                input_ids = frontend.get_input_ids(
                    sentence,
--- a/paddlespeech/t2s/exps/tacotron2/preprocess.py
+++ b/paddlespeech/t2s/exps/tacotron2/preprocess.py
@ -82,6 +82,9 @@ def process_sentence(config: Dict[str, Any],
        logmel = mel_extractor.get_log_mel_fbank(wav)
        # change duration according to mel_length
        compare_duration_and_mel_length(sentences, utt_id, logmel)
+        # utt_id may be popped in compare_duration_and_mel_length
+        if utt_id not in sentences:
+            return None
        phones = sentences[utt_id][0]
        durations = sentences[utt_id][1]
        num_frames = logmel.shape[0]
--- a/paddlespeech/t2s/modules/positional_encoding.py
+++ b/paddlespeech/t2s/modules/positional_encoding.py
@ -31,8 +31,9 @@ def sinusoid_position_encoding(num_positions: int,

    channel = paddle.arange(0, feature_size, 2, dtype=dtype)
    index = paddle.arange(start_pos, start_pos + num_positions, 1, dtype=dtype)
-    p = (paddle.unsqueeze(index, -1) *
-         omega) / (10000.0**(channel / float(feature_size)))
+    denominator = channel / float(feature_size)
+    denominator = paddle.to_tensor([10000.0], dtype='float32')**denominator
+    p = (paddle.unsqueeze(index, -1) * omega) / denominator
    encodings = paddle.zeros([num_positions, feature_size], dtype=dtype)
    encodings[:, 0::2] = paddle.sin(p)
    encodings[:, 1::2] = paddle.cos(p)
--- a/paddlespeech/vector/models/ecapa_tdnn.py
+++ b/paddlespeech/vector/models/ecapa_tdnn.py
@ -79,6 +79,20 @@ class Conv1d(nn.Layer):
            bias_attr=bias, )

    def forward(self, x):
+        """Do conv1d forward
+
+        Args:
+            x (paddle.Tensor): [N, C, L] input data, 
+                                N is the batch,
+                                C is the data dimension, 
+                                L is the time
+
+        Raises:
+            ValueError: only support the same padding type
+
+        Returns:
+            paddle.Tensor: the value of conv1d
+        """
        if self.padding == "same":
            x = self._manage_padding(x, self.kernel_size, self.dilation,
                                     self.stride)
@ -88,6 +102,20 @@ class Conv1d(nn.Layer):
        return self.conv(x)

    def _manage_padding(self, x, kernel_size: int, dilation: int, stride: int):
+        """Padding the input data
+
+        Args:
+            x (paddle.Tensor): [N, C, L] input data
+                                N is the batch,
+                                C is the data dimension, 
+                                L is the time
+            kernel_size (int): 1-d convolution kernel size
+            dilation (int): 1-d convolution dilation
+            stride (int): 1-d convolution stride
+
+        Returns:
+            paddle.Tensor: the padded input data
+        """
        L_in = x.shape[-1]  # Detecting input shape
        padding = self._get_padding_elem(L_in, stride, kernel_size,
                                         dilation)  # Time padding
@ -101,6 +129,17 @@ class Conv1d(nn.Layer):
                          stride: int,
                          kernel_size: int,
                          dilation: int):
+        """Calculate the padding value in same mode
+
+        Args:
+            L_in (int): the times of the input data, 
+            stride (int): 1-d convolution stride
+            kernel_size (int): 1-d convolution kernel size
+            dilation (int): 1-d convolution stride
+
+        Returns:
+            int: return the padding value in same mode
+        """
        if stride > 1:
            n_steps = math.ceil(((L_in - kernel_size * dilation) / stride) + 1)
            L_out = stride * (n_steps - 1) + kernel_size * dilation
@ -245,6 +284,13 @@ class SEBlock(nn.Layer):

 class AttentiveStatisticsPooling(nn.Layer):
    def __init__(self, channels, attention_channels=128, global_context=True):
+        """Compute the speaker verification statistics
+           The detail info is section 3.1 in https://arxiv.org/pdf/1709.01507.pdf 
+        Args:
+            channels (int): input data channel or data dimension
+            attention_channels (int, optional): attention dimension. Defaults to 128.
+            global_context (bool, optional): If use the global context information. Defaults to True.
+        """
        super().__init__()

        self.eps = 1e-12