Merge branch 'database-search' of github.com:qingen/PaddleSpeech into database-search

3 years ago · d3f8715b0a
parent 89a0ec9018 fdc189a386
commit d3f8715b0a
161 changed files with 59356 additions and 3159 deletions
--- a/README.md
+++ b/README.md
@ -280,10 +280,14 @@ paddlespeech_client cls --server_ip 127.0.0.1 --port 8090 --input input.wav
 For more information about server command lines, please see: [speech server demos](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/demos/speech_server)
 <a name="ModelList"></a>
 ## Model List
 PaddleSpeech supports a series of most popular models. They are summarized in [released models](./docs/source/released_model.md) and attached with available pretrained models.
 <a name="SpeechToText"></a>
 **Speech-to-Text** contains *Acoustic Model*, *Language Model*, and *Speech Translation*, with the following details:
 <table style="width:100%">
@ -357,6 +361,8 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
  </tbody>
 </table>
 <a name="TextToSpeech"></a>
 **Text-to-Speech** in PaddleSpeech mainly contains three modules: *Text Frontend*, *Acoustic Model* and *Vocoder*. Acoustic Model and Vocoder models are listed as follow:
 <table>
@ -457,10 +463,10 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
      </td>
    </tr>
    <tr>
-      <td>GE2E + Tactron2</td>
+      <td>GE2E + Tacotron2</td>
      <td>AISHELL-3</td>
      <td>
-      <a href = "./examples/aishell3/vc0">ge2e-tactron2-aishell3</a>
+      <a href = "./examples/aishell3/vc0">ge2e-tacotron2-aishell3</a>
      </td>
    </tr>
    <tr>
@ -473,6 +479,8 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
  </tbody>
 </table>
 <a name="AudioClassification"></a>
 **Audio Classification**
 <table style="width:100%">
@ -496,6 +504,8 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
  </tbody>
 </table>
 <a name="SpeakerVerification"></a>
 **Speaker Verification**
 <table style="width:100%">
@ -519,6 +529,8 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
  </tbody>
 </table>
 <a name="PunctuationRestoration"></a>
 **Punctuation Restoration**
 <table style="width:100%">
@ -559,10 +571,18 @@ Normally, [Speech SoTA](https://paperswithcode.com/area/speech), [Audio SoTA](ht
    - [Advanced Usage](./docs/source/tts/advanced_usage.md)
    - [Chinese Rule Based Text Frontend](./docs/source/tts/zh_text_frontend.md)
    - [Test Audio Samples](https://paddlespeech.readthedocs.io/en/latest/tts/demo.html)
  - Speaker Verification
    - [Audio Searching](./demos/audio_searching/README.md)
    - [Speaker Verification](./demos/speaker_verification/README.md)
  - [Audio Classification](./demos/audio_tagging/README.md)
  - [Speaker Verification](./demos/speaker_verification/README.md)
  - [Speech Translation](./demos/speech_translation/README.md)
  - [Speech Server](./demos/speech_server/README.md)
 - [Released Models](./docs/source/released_model.md)
  - [Speech-to-Text](#SpeechToText)
  - [Text-to-Speech](#TextToSpeech)
  - [Audio Classification](#AudioClassification)
  - [Speaker Verification](#SpeakerVerification)
  - [Punctuation Restoration](#PunctuationRestoration)
 - [Community](#Community)
 - [Welcome to contribute](#contribution)
 - [License](#License)
--- a/README_cn.md
+++ b/README_cn.md
@ -273,6 +273,8 @@ paddlespeech_client cls --server_ip 127.0.0.1 --port 8090 --input input.wav
 ## 模型列表
 PaddleSpeech 支持很多主流的模型，并提供了预训练模型，详情请见[模型列表](./docs/source/released_model.md)。
 <a name="语音识别模型"></a>
 PaddleSpeech 的 **语音转文本** 包含语音识别声学模型、语音识别语言模型和语音翻译, 详情如下：
 <table style="width:100%">
@ -347,6 +349,7 @@ PaddleSpeech 的 **语音转文本** 包含语音识别声学模型、语音识
 </table>
 <a name="语音合成模型"></a>
 PaddleSpeech 的 **语音合成** 主要包含三个模块：文本前端、声学模型和声码器。声学模型和声码器模型如下：
 <table>
@ -447,10 +450,10 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块：文本前端、声
      </td>
    </tr>
    <tr>
-      <td>GE2E + Tactron2</td>
+      <td>GE2E + Tacotron2</td>
      <td>AISHELL-3</td>
      <td>
-      <a href = "./examples/aishell3/vc0">ge2e-tactron2-aishell3</a>
+      <a href = "./examples/aishell3/vc0">ge2e-tacotron2-aishell3</a>
      </td>
    </tr>
    <tr>
@ -488,6 +491,8 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块：文本前端、声
 </table>
 <a name="声纹识别模型"></a>
 **声纹识别**
 <table style="width:100%">
@ -511,6 +516,8 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块：文本前端、声
  </tbody>
 </table>
 <a name="标点恢复模型"></a>
 **标点恢复**
 <table style="width:100%">
@ -556,13 +563,18 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块：文本前端、声
    - [进阶用法](./docs/source/tts/advanced_usage.md)
    - [中文文本前端](./docs/source/tts/zh_text_frontend.md)
    - [测试语音样本](https://paddlespeech.readthedocs.io/en/latest/tts/demo.html)
  - 声纹识别
    - [声纹识别](./demos/speaker_verification/README_cn.md)
    - [音频检索](./demos/audio_searching/README_cn.md)
  - [声音分类](./demos/audio_tagging/README_cn.md)
  - [声纹识别](./demos/speaker_verification/README_cn.md)
  - [语音翻译](./demos/speech_translation/README_cn.md)
  - [服务化部署](./demos/speech_server/README_cn.md)
 - [模型列表](#模型列表)
  - [语音识别](#语音识别模型)
  - [语音合成](#语音合成模型)
  - [声音分类](#声音分类模型)
  - [声纹识别](#声纹识别模型)
  - [标点恢复](#标点恢复模型)
 - [技术交流群](#技术交流群)
 - [欢迎贡献](#欢迎贡献)
 - [License](#License)
--- a/dataset/rir_noise/rir_noise.py
+++ b/dataset/rir_noise/rir_noise.py
@ -34,14 +34,14 @@ from utils.utility import unzip
 DATA_HOME = os.path.expanduser('~/.cache/paddle/dataset/speech')
-URL_ROOT = 'http://www.openslr.org/resources/28'
+URL_ROOT = '--no-check-certificate http://www.openslr.org/resources/28'
 DATA_URL = URL_ROOT + '/rirs_noises.zip'
 MD5_DATA = 'e6f48e257286e05de56413b4779d8ffb'
 parser = argparse.ArgumentParser(description=__doc__)
 parser.add_argument(
    "--target_dir",
-    default=DATA_HOME + "/Aishell",
+    default=DATA_HOME + "/rirs_noise",
    type=str,
    help="Directory to save the dataset. (default: %(default)s)")
 parser.add_argument(
@ -81,6 +81,10 @@ def create_manifest(data_dir, manifest_path_prefix):
                        },
                        ensure_ascii=False))
        manifest_path = manifest_path_prefix + '.' + dtype
        if not os.path.exists(os.path.dirname(manifest_path)):
            os.makedirs(os.path.dirname(manifest_path))
        with codecs.open(manifest_path, 'w', 'utf-8') as fout:
            for line in json_lines:
                fout.write(line + '\n')
--- a/dataset/voxceleb/voxceleb1.py
+++ b/dataset/voxceleb/voxceleb1.py
@ -149,7 +149,7 @@ def prepare_dataset(base_url, data_list, target_dir, manifest_path,
    # we will download the voxceleb1 data to ${target_dir}/vox1/dev/ or ${target_dir}/vox1/test directory 
    if not os.path.exists(os.path.join(target_dir, "wav")):
        # download all dataset part
-        print("start to download the vox1 dev zip package")
+        print(f"start to download the vox1 zip package to {target_dir}")
        for zip_part in data_list.keys():
            download_url = " --no-check-certificate " + base_url + "/" + zip_part
            download(
--- a/dataset/voxceleb/voxceleb2.py
+++ b/dataset/voxceleb/voxceleb2.py
@ -22,10 +22,12 @@ import codecs
 import glob
 import json
 import os
 import subprocess
 from pathlib import Path
 import soundfile
 from utils.utility import check_md5sum
 from utils.utility import download
 from utils.utility import unzip
@ -35,12 +37,22 @@ DATA_HOME = os.path.expanduser('.')
 BASE_URL = "--no-check-certificate https://www.robots.ox.ac.uk/~vgg/data/voxceleb/data/"
 # dev data
-DEV_DATA_URL = BASE_URL + '/vox2_aac.zip'
+DEV_LIST = {
-DEV_MD5SUM = "bbc063c46078a602ca71605645c2a402"
+    "vox2_dev_aac_partaa": "da070494c573e5c0564b1d11c3b20577",
    "vox2_dev_aac_partab": "17fe6dab2b32b48abaf1676429cdd06f",
    "vox2_dev_aac_partac": "1de58e086c5edf63625af1cb6d831528",
    "vox2_dev_aac_partad": "5a043eb03e15c5a918ee6a52aad477f9",
    "vox2_dev_aac_partae": "cea401b624983e2d0b2a87fb5d59aa60",
    "vox2_dev_aac_partaf": "fc886d9ba90ab88e7880ee98effd6ae9",
    "vox2_dev_aac_partag": "d160ecc3f6ee3eed54d55349531cb42e",
    "vox2_dev_aac_partah": "6b84a81b9af72a9d9eecbb3b1f602e65",
 }
 DEV_TARGET_DATA = "vox2_dev_aac_parta* vox2_dev_aac.zip bbc063c46078a602ca71605645c2a402"
 # test data
-TEST_DATA_URL = BASE_URL + '/vox2_test_aac.zip'
+TEST_LIST = {"vox2_test_aac.zip": "0d2b3ea430a821c33263b5ea37ede312"}
-TEST_MD5SUM = "0d2b3ea430a821c33263b5ea37ede312"
+TEST_TARGET_DATA = "vox2_test_aac.zip vox2_test_aac.zip 0d2b3ea430a821c33263b5ea37ede312"
 parser = argparse.ArgumentParser(description=__doc__)
 parser.add_argument(
@ -68,6 +80,14 @@ args = parser.parse_args()
 def create_manifest(data_dir, manifest_path_prefix):
    """Generate the voxceleb2 dataset manifest file.
    We will create the ${manifest_path_prefix}.vox2 as the final manifest file 
    The dev and test wav info will be put in one manifest file.
    Args:
        data_dir (str): voxceleb2 wav directory, which include dev and test subdataset
        manifest_path_prefix (str): manifest file prefix
    """
    print("Creating manifest %s ..." % manifest_path_prefix)
    json_lines = []
    data_path = os.path.join(data_dir, "**", "*.wav")
@ -119,7 +139,19 @@ def create_manifest(data_dir, manifest_path_prefix):
        print(f"{total_sec / total_num} sec/utt", file=f)
-def download_dataset(url, md5sum, target_dir, dataset):
+def download_dataset(base_url, data_list, target_data, target_dir, dataset):
    """Download the voxceleb2 zip package
    Args:
        base_url (str): the voxceleb2 dataset download baseline url
        data_list (dict): the dataset part zip package and the md5 value
        target_data (str): the final dataset zip info
        target_dir (str): the dataset stored directory
        dataset (str): the dataset name, dev or test
    Raises:
        RuntimeError: the md5sum occurs error
    """
    if not os.path.exists(target_dir):
        os.makedirs(target_dir)
@ -129,9 +161,34 @@ def download_dataset(url, md5sum, target_dir, dataset):
    # but the test dataset will unzip to aac
    # so, wo create the ${target_dir}/test and unzip the m4a to test dir
    if not os.path.exists(os.path.join(target_dir, dataset)):
-        filepath = download(url, md5sum, target_dir)
+        print(f"start to download the vox2 zip package to {target_dir}")
        for zip_part in data_list.keys():
            download_url = " --no-check-certificate " + base_url + "/" + zip_part
            download(
                url=download_url,
                md5sum=data_list[zip_part],
                target_dir=target_dir)
        # pack the all part to target zip file
        all_target_part, target_name, target_md5sum = target_data.split()
        target_name = os.path.join(target_dir, target_name)
        if not os.path.exists(target_name):
            pack_part_cmd = "cat {}/{} > {}".format(target_dir, all_target_part,
                                                    target_name)
            subprocess.call(pack_part_cmd, shell=True)
        # check the target zip file md5sum
        if not check_md5sum(target_name, target_md5sum):
            raise RuntimeError("{} MD5 checkssum failed".format(target_name))
        else:
            print("Check {} md5sum successfully".format(target_name))
        if dataset == "test":
-            unzip(filepath, os.path.join(target_dir, "test"))
+            # we need make the test directory
            unzip(target_name, os.path.join(target_dir, "test"))
        else:
            # upzip dev zip pacakge and will create the dev directory
            unzip(target_name, target_dir)
 def main():
@ -142,14 +199,16 @@ def main():
    print("download: {}".format(args.download))
    if args.download:
        download_dataset(
-            url=DEV_DATA_URL,
+            base_url=BASE_URL,
-            md5sum=DEV_MD5SUM,
+            data_list=DEV_LIST,
            target_data=DEV_TARGET_DATA,
            target_dir=args.target_dir,
            dataset="dev")
        download_dataset(
-            url=TEST_DATA_URL,
+            base_url=BASE_URL,
-            md5sum=TEST_MD5SUM,
+            data_list=TEST_LIST,
            target_data=TEST_TARGET_DATA,
            target_dir=args.target_dir,
            dataset="test")
--- a/demos/speaker_verification/README.md
+++ b/demos/speaker_verification/README.md
@ -30,6 +30,11 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
  paddlespeech vector --task spk --input vec.job
  echo -e "demo2 85236145389.wav \n demo3 85236145389.wav" | paddlespeech vector --task spk
  paddlespeech vector --task score --input "./85236145389.wav ./123456789.wav"
  echo -e "demo4 85236145389.wav 85236145389.wav \n demo5 85236145389.wav 123456789.wav" > vec.job
  paddlespeech vector --task score --input vec.job
  ```
  Usage:
@ -38,6 +43,7 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
  ```
  Arguments:
  - `input`(required): Audio file to recognize.
  - `task` (required): Specify `vector` task. Default `spk`。
  - `model`: Model type of vector task. Default: `ecapatdnn_voxceleb12`.
  - `sample_rate`: Sample rate of the model. Default: `16000`.
  - `config`: Config of vector task. Use pretrained model when it is None. Default: `None`.
@ -47,45 +53,45 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
  Output:
  ```bash
-    demo [ -5.749211     9.505463    -8.200284    -5.2075014    5.3940268
+    demo [  1.4217498    5.626253    -5.342073     1.1773866    3.308055
-      -3.04878      1.611095    10.127234   -10.534177   -15.821609
+    1.756596     5.167894    10.80636     -3.8226728   -5.6141334
-      1.2032688   -0.35080156   1.2629458  -12.643498    -2.5758228
+    2.623845    -0.8072968    1.9635103   -7.3128724    0.01103897
-    -11.343508     2.3385992   -8.719341    14.213509    15.404744
+    -9.723131     0.6619743   -6.976803    10.213478     7.494748
-      -0.39327756   6.338786     2.688887     8.7104025   17.469526
+    2.9105635    3.8949256    3.7999806    7.1061673   16.905321
-      -8.77959      7.0576906    4.648855    -1.3089896  -23.294737
+    -7.1493764    8.733103     3.4230042   -4.831653   -11.403367
-      8.013747    13.891729    -9.926753     5.655307    -5.9422326
+    11.232214     7.1274667   -4.2828417    2.452362    -5.130748
-    -22.842539     0.6293588  -18.46266    -10.811862     9.8192625
+    -18.177666    -2.6116815  -11.000337    -6.7314315    1.6564683
-      3.0070958    3.8072643   -2.3861165    3.0821571  -14.739942
+    0.7618269    1.1253023   -2.083836     4.725744    -8.782597
-      1.7594414   -0.6485091    4.485623     2.0207152    7.264915
+    -3.539873     3.814236     5.1420674    2.162061     4.096431
-      -6.40137     23.63524      2.9711294  -22.708025     9.93719
+    -6.4162116   12.747448     1.9429878  -15.152943     6.417416
-      20.354511   -10.324688    -0.700492    -8.783211    -5.27593
+    16.097002    -9.716668    -1.9920526   -3.3649497   -1.871939
-      15.999649     3.3004563   12.747926    15.429879     4.7849145
+    11.567354     3.69788     11.258265     7.442363     9.183411
-      5.6699696   -2.3826702   10.605882     3.9112158    3.1500628
+    4.5281515   -1.2417862    4.3959084    6.6727695    5.8898783
-      15.859915    -2.1832209  -23.908653    -6.4799504   -4.5365124
+    7.627124    -0.66919386 -11.889693    -9.208865    -7.4274073
-      -9.224193    14.568347   -10.568833     4.982321    -4.342062
+    -3.7776625    6.917234    -9.848748    -2.0944717   -5.135116
-      0.0914714   12.645902    -5.74285     -3.2141201   -2.7173362
+    0.49563864   9.317534    -5.9141874   -1.8098574   -0.11738578
-      -6.680575     0.4757669   -5.035051    -6.7964664   16.865469
+    -7.169265    -1.0578263   -5.7216787   -5.1173844   16.137651
-    -11.54324      7.681869     0.44475392   9.708182    -8.932846
+    -4.473626     7.6624317   -0.55381083   9.631587    -6.4704556
-      0.4123232   -4.361452     1.3948607    9.511665     0.11667654
+    -8.548508     4.3716145   -0.79702514   4.478997    -2.9758704
-      2.9079323    6.049952     9.275183   -18.078873     6.2983274
+    3.272176     2.8382776    5.134597    -9.190781    -0.5657382
-      -0.7500531   -2.725033    -7.6027865    3.3404543    2.990815
+    -4.8745747    2.3165567   -5.984303    -2.1798875    0.35541576
-      4.010979    11.000591    -2.8873312    7.1352735  -16.79663
+    -0.31784213   9.493548     2.1144536    4.358092   -12.089823
-      18.495346   -14.293832     7.89578      2.2714825   22.976387
+    8.451689    -7.925461     4.6242585    4.4289427   18.692003
-      -4.875734    -3.0836344   -2.9999814   13.751918     6.448228
+    -2.6204622   -5.149185    -0.35821092   8.488551     4.981496
-    -11.924197     2.171869     2.0423572   -6.173772    10.778437
+    -9.32683     -2.2544234    6.6417594    1.2119585   10.977129
-      25.77281     -4.9495463   14.57806      0.3044315    2.6132357
+    16.555033     3.3238444    9.551863    -1.6676947   -0.79539716
-      -7.591999    -2.076944     9.025118     1.7834753   -3.1799617
+    -8.605674    -0.47356385   2.6741948   -5.359179    -2.6673796
-      -4.9401326   23.465864     5.1685796   -9.018578     9.037825
+    0.66607     15.443222     4.740594    -3.4725387   11.592567
-      -4.4150195    6.859591   -12.274467    -0.88911164   5.186309
+    -2.054497     1.7361217   -8.265324    -9.30447      5.4068313
-      -3.9988663  -13.638606    -9.925445    -0.06329413  -3.6709652
+    -1.5180256   -7.746615    -6.089606     0.07112726  -0.34904733
-    -12.397416   -12.719869    -1.395601     2.1150916    5.7381287
+    -8.649895    -9.998958    -2.564841    -0.53999114   2.601808
-      -4.4691963   -3.82819     -0.84233856  -1.1604277  -13.490127
+    -0.31927416  -1.8815292   -2.07215     -3.4105783   -8.2998085
-      8.731719   -20.778936   -11.495662     5.8033476   -4.752041
+    1.483641   -15.365992    -8.288208     3.8847756   -3.4876456
-      10.833007    -6.717991     4.504732    13.4244375    1.1306485
+    7.3629923    0.4657332    3.132599    12.438889    -1.8337058
-      7.3435574    1.400918    14.704036    -9.501399     7.2315617
+    4.532936     2.7264361   10.145339    -6.521951     2.897153
-      -6.417456     1.3333273   11.872697    -0.30664724   8.8845
+    -3.3925855    5.079156     7.759716     4.677565     5.8457737
-      6.5569253    4.7948146    0.03662816  -8.704245     6.224871
+    2.402413     7.7071047    3.9711342   -6.390043     6.1268735
-      -3.2701402  -11.508579  ]
+    -3.7760346  -11.118123  ]
  ```
 - Python API
@ -97,56 +103,113 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
  audio_emb = vector_executor(
      model='ecapatdnn_voxceleb12',
      sample_rate=16000,
-      config=None, 
+      config=None,  # Set `config` and `ckpt_path` to None to use pretrained model.
      ckpt_path=None,
      audio_file='./85236145389.wav',
      force_yes=False,
      device=paddle.get_device())
  print('Audio embedding Result: \n{}'.format(audio_emb))
  test_emb = vector_executor(
      model='ecapatdnn_voxceleb12',
      sample_rate=16000,
      config=None,  # Set `config` and `ckpt_path` to None to use pretrained model.
      ckpt_path=None,
      audio_file='./123456789.wav',
      device=paddle.get_device())
  print('Test embedding Result: \n{}'.format(test_emb))
  # score range [0, 1]
  score = vector_executor.get_embeddings_score(audio_emb, test_emb)
  print(f"Eembeddings Score: {score}")
  ```
-  Output:
+  Output：
  ```bash
  # Vector Result:
-   [ -5.749211     9.505463    -8.200284    -5.2075014    5.3940268
+   Audio embedding Result:
-      -3.04878      1.611095    10.127234   -10.534177   -15.821609
+    [  1.4217498    5.626253    -5.342073     1.1773866    3.308055
-      1.2032688   -0.35080156   1.2629458  -12.643498    -2.5758228
+    1.756596     5.167894    10.80636     -3.8226728   -5.6141334
-    -11.343508     2.3385992   -8.719341    14.213509    15.404744
+    2.623845    -0.8072968    1.9635103   -7.3128724    0.01103897
-      -0.39327756   6.338786     2.688887     8.7104025   17.469526
+    -9.723131     0.6619743   -6.976803    10.213478     7.494748
-      -8.77959      7.0576906    4.648855    -1.3089896  -23.294737
+    2.9105635    3.8949256    3.7999806    7.1061673   16.905321
-      8.013747    13.891729    -9.926753     5.655307    -5.9422326
+    -7.1493764    8.733103     3.4230042   -4.831653   -11.403367
-    -22.842539     0.6293588  -18.46266    -10.811862     9.8192625
+    11.232214     7.1274667   -4.2828417    2.452362    -5.130748
-      3.0070958    3.8072643   -2.3861165    3.0821571  -14.739942
+    -18.177666    -2.6116815  -11.000337    -6.7314315    1.6564683
-      1.7594414   -0.6485091    4.485623     2.0207152    7.264915
+    0.7618269    1.1253023   -2.083836     4.725744    -8.782597
-      -6.40137     23.63524      2.9711294  -22.708025     9.93719
+    -3.539873     3.814236     5.1420674    2.162061     4.096431
-      20.354511   -10.324688    -0.700492    -8.783211    -5.27593
+    -6.4162116   12.747448     1.9429878  -15.152943     6.417416
-      15.999649     3.3004563   12.747926    15.429879     4.7849145
+    16.097002    -9.716668    -1.9920526   -3.3649497   -1.871939
-      5.6699696   -2.3826702   10.605882     3.9112158    3.1500628
+    11.567354     3.69788     11.258265     7.442363     9.183411
-      15.859915    -2.1832209  -23.908653    -6.4799504   -4.5365124
+    4.5281515   -1.2417862    4.3959084    6.6727695    5.8898783
-      -9.224193    14.568347   -10.568833     4.982321    -4.342062
+    7.627124    -0.66919386 -11.889693    -9.208865    -7.4274073
-      0.0914714   12.645902    -5.74285     -3.2141201   -2.7173362
+    -3.7776625    6.917234    -9.848748    -2.0944717   -5.135116
-      -6.680575     0.4757669   -5.035051    -6.7964664   16.865469
+    0.49563864   9.317534    -5.9141874   -1.8098574   -0.11738578
-    -11.54324      7.681869     0.44475392   9.708182    -8.932846
+    -7.169265    -1.0578263   -5.7216787   -5.1173844   16.137651
-      0.4123232   -4.361452     1.3948607    9.511665     0.11667654
+    -4.473626     7.6624317   -0.55381083   9.631587    -6.4704556
-      2.9079323    6.049952     9.275183   -18.078873     6.2983274
+    -8.548508     4.3716145   -0.79702514   4.478997    -2.9758704
-      -0.7500531   -2.725033    -7.6027865    3.3404543    2.990815
+    3.272176     2.8382776    5.134597    -9.190781    -0.5657382
-      4.010979    11.000591    -2.8873312    7.1352735  -16.79663
+    -4.8745747    2.3165567   -5.984303    -2.1798875    0.35541576
-      18.495346   -14.293832     7.89578      2.2714825   22.976387
+    -0.31784213   9.493548     2.1144536    4.358092   -12.089823
-      -4.875734    -3.0836344   -2.9999814   13.751918     6.448228
+    8.451689    -7.925461     4.6242585    4.4289427   18.692003
-    -11.924197     2.171869     2.0423572   -6.173772    10.778437
+    -2.6204622   -5.149185    -0.35821092   8.488551     4.981496
-      25.77281     -4.9495463   14.57806      0.3044315    2.6132357
+    -9.32683     -2.2544234    6.6417594    1.2119585   10.977129
-      -7.591999    -2.076944     9.025118     1.7834753   -3.1799617
+    16.555033     3.3238444    9.551863    -1.6676947   -0.79539716
-      -4.9401326   23.465864     5.1685796   -9.018578     9.037825
+    -8.605674    -0.47356385   2.6741948   -5.359179    -2.6673796
-      -4.4150195    6.859591   -12.274467    -0.88911164   5.186309
+    0.66607     15.443222     4.740594    -3.4725387   11.592567
-      -3.9988663  -13.638606    -9.925445    -0.06329413  -3.6709652
+    -2.054497     1.7361217   -8.265324    -9.30447      5.4068313
-    -12.397416   -12.719869    -1.395601     2.1150916    5.7381287
+    -1.5180256   -7.746615    -6.089606     0.07112726  -0.34904733
-      -4.4691963   -3.82819     -0.84233856  -1.1604277  -13.490127
+    -8.649895    -9.998958    -2.564841    -0.53999114   2.601808
-      8.731719   -20.778936   -11.495662     5.8033476   -4.752041
+    -0.31927416  -1.8815292   -2.07215     -3.4105783   -8.2998085
-      10.833007    -6.717991     4.504732    13.4244375    1.1306485
+    1.483641   -15.365992    -8.288208     3.8847756   -3.4876456
-      7.3435574    1.400918    14.704036    -9.501399     7.2315617
+    7.3629923    0.4657332    3.132599    12.438889    -1.8337058
-      -6.417456     1.3333273   11.872697    -0.30664724   8.8845
+    4.532936     2.7264361   10.145339    -6.521951     2.897153
-      6.5569253    4.7948146    0.03662816  -8.704245     6.224871
+    -3.3925855    5.079156     7.759716     4.677565     5.8457737
-      -3.2701402  -11.508579  ]
+    2.402413     7.7071047    3.9711342   -6.390043     6.1268735
    -3.7760346  -11.118123  ]
    # get the test embedding
    Test embedding Result:
    [ -1.902964     2.0690894   -8.034194     3.5472693    0.18089125
      6.9085927    1.4097427   -1.9487704  -10.021278    -0.20755845
      -8.04332      4.344489     2.3200977  -14.306299     5.184692
    -11.55602     -3.8497238    0.6444722    1.2833948    2.6766639
      0.5878921    0.7946299    1.7207596    2.5791872   14.998469
      -1.3385371   15.031221    -0.8006958    1.99287     -9.52007
      2.435466     4.003221    -4.33817     -4.898601    -5.304714
    -18.033886    10.790787   -12.784645    -5.641755     2.9761686
    -10.566622     1.4839455    6.152458    -5.7195854    2.8603241
      6.112133     8.489869     5.5958056    1.2836679   -1.2293907
      0.89927405   7.0288725   -2.854029    -0.9782962    5.8255906
      14.905906    -5.025907     0.7866458   -4.2444224  -16.354029
      10.521315     0.9604709   -3.3257897    7.144871   -13.592733
      -8.568869    -1.7953678    0.26313916  10.916714    -6.9374123
      1.857403    -6.2746415    2.8154466   -7.2338667   -2.293357
      -0.05452765   5.4287076    5.0849075   -6.690375    -1.6183422
      3.654291     0.94352573  -9.200294    -5.4749465   -3.5235846
      1.3420814    4.240421    -2.772944    -2.8451524   16.311104
      4.2969875   -1.762936   -12.5758915    8.595198    -0.8835239
      -1.5708797    1.568961     1.1413603    3.5032008   -0.45251232
      -6.786333    16.89443      5.3366146   -8.789056     0.6355629
      3.2579517   -3.328322     7.5969577    0.66025066  -6.550468
      -9.148656     2.020372    -0.4615173    1.1965656   -3.8764873
      11.6562195   -6.0750933   12.182899     3.2218833    0.81969476
      5.570001    -3.8459578   -7.205299     7.9262037   -7.6611166
      -5.249467    -2.2671914    7.2658715  -13.298164     4.821147
      -2.7263982   11.691089    -3.8918593   -2.838112    -1.0336838
      -3.8034165    2.8536487   -5.60398     -1.1972581    1.3455094
      -3.4903061    2.2408795    5.5010734   -3.970756    11.99696
      -7.8858757    0.43160373  -5.5059714    4.3426995   16.322706
      11.635366     0.72157705  -9.245714    -3.91465     -4.449838
      -1.5716927    7.713747    -2.2430465   -6.198303   -13.481864
      2.8156567   -5.7812386    5.1456156    2.7289324  -14.505571
      13.270688     3.448231    -7.0659585    4.5886116   -4.466099
      -0.296428   -11.463529    -2.6076477   14.110243    -6.9725137
      -1.9962958    2.7119343   19.391657     0.01961198  14.607133
      -1.6695905   -4.391516     1.3131028   -6.670972    -5.888604
      12.0612335    5.9285784    3.3715196    1.492534    10.723728
      -0.95514804 -12.085431  ]
    # get the score between enroll and test
    Eembeddings Score: 0.4292638301849365
  ```
 ### 4.Pretrained Models
--- a/demos/speaker_verification/README_cn.md
+++ b/demos/speaker_verification/README_cn.md
@ -29,6 +29,11 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
  paddlespeech vector --task spk --input vec.job
  echo -e "demo2 85236145389.wav \n demo3 85236145389.wav" | paddlespeech vector --task spk
  paddlespeech vector --task score --input "./85236145389.wav ./123456789.wav"
  echo -e "demo4 85236145389.wav 85236145389.wav \n demo5 85236145389.wav 123456789.wav" > vec.job
  paddlespeech vector --task score --input vec.job
  ```
  使用方法：
@ -37,6 +42,7 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
  ```
  参数：
  - `input`(必须输入)：用于识别的音频文件。
  - `task` (必须输入): 用于指定 `vector` 处理的具体任务，默认是 `spk`。
  - `model`：声纹任务的模型，默认值：`ecapatdnn_voxceleb12`。
  - `sample_rate`：音频采样率，默认值：`16000`。
  - `config`：声纹任务的参数文件，若不设置则使用预训练模型中的默认配置，默认值：`None`。
@ -45,45 +51,45 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
  输出：
  ```bash
-  demo  [ -5.749211     9.505463    -8.200284    -5.2075014    5.3940268
+  demo  [  1.4217498    5.626253    -5.342073     1.1773866    3.308055
-      -3.04878      1.611095    10.127234   -10.534177   -15.821609
+    1.756596     5.167894    10.80636     -3.8226728   -5.6141334
-      1.2032688   -0.35080156   1.2629458  -12.643498    -2.5758228
+    2.623845    -0.8072968    1.9635103   -7.3128724    0.01103897
-    -11.343508     2.3385992   -8.719341    14.213509    15.404744
+    -9.723131     0.6619743   -6.976803    10.213478     7.494748
-      -0.39327756   6.338786     2.688887     8.7104025   17.469526
+    2.9105635    3.8949256    3.7999806    7.1061673   16.905321
-      -8.77959      7.0576906    4.648855    -1.3089896  -23.294737
+    -7.1493764    8.733103     3.4230042   -4.831653   -11.403367
-      8.013747    13.891729    -9.926753     5.655307    -5.9422326
+    11.232214     7.1274667   -4.2828417    2.452362    -5.130748
-    -22.842539     0.6293588  -18.46266    -10.811862     9.8192625
+    -18.177666    -2.6116815  -11.000337    -6.7314315    1.6564683
-      3.0070958    3.8072643   -2.3861165    3.0821571  -14.739942
+    0.7618269    1.1253023   -2.083836     4.725744    -8.782597
-      1.7594414   -0.6485091    4.485623     2.0207152    7.264915
+    -3.539873     3.814236     5.1420674    2.162061     4.096431
-      -6.40137     23.63524      2.9711294  -22.708025     9.93719
+    -6.4162116   12.747448     1.9429878  -15.152943     6.417416
-      20.354511   -10.324688    -0.700492    -8.783211    -5.27593
+    16.097002    -9.716668    -1.9920526   -3.3649497   -1.871939
-      15.999649     3.3004563   12.747926    15.429879     4.7849145
+    11.567354     3.69788     11.258265     7.442363     9.183411
-      5.6699696   -2.3826702   10.605882     3.9112158    3.1500628
+    4.5281515   -1.2417862    4.3959084    6.6727695    5.8898783
-      15.859915    -2.1832209  -23.908653    -6.4799504   -4.5365124
+    7.627124    -0.66919386 -11.889693    -9.208865    -7.4274073
-      -9.224193    14.568347   -10.568833     4.982321    -4.342062
+    -3.7776625    6.917234    -9.848748    -2.0944717   -5.135116
-      0.0914714   12.645902    -5.74285     -3.2141201   -2.7173362
+    0.49563864   9.317534    -5.9141874   -1.8098574   -0.11738578
-      -6.680575     0.4757669   -5.035051    -6.7964664   16.865469
+    -7.169265    -1.0578263   -5.7216787   -5.1173844   16.137651
-    -11.54324      7.681869     0.44475392   9.708182    -8.932846
+    -4.473626     7.6624317   -0.55381083   9.631587    -6.4704556
-      0.4123232   -4.361452     1.3948607    9.511665     0.11667654
+    -8.548508     4.3716145   -0.79702514   4.478997    -2.9758704
-      2.9079323    6.049952     9.275183   -18.078873     6.2983274
+    3.272176     2.8382776    5.134597    -9.190781    -0.5657382
-      -0.7500531   -2.725033    -7.6027865    3.3404543    2.990815
+    -4.8745747    2.3165567   -5.984303    -2.1798875    0.35541576
-      4.010979    11.000591    -2.8873312    7.1352735  -16.79663
+    -0.31784213   9.493548     2.1144536    4.358092   -12.089823
-      18.495346   -14.293832     7.89578      2.2714825   22.976387
+    8.451689    -7.925461     4.6242585    4.4289427   18.692003
-      -4.875734    -3.0836344   -2.9999814   13.751918     6.448228
+    -2.6204622   -5.149185    -0.35821092   8.488551     4.981496
-    -11.924197     2.171869     2.0423572   -6.173772    10.778437
+    -9.32683     -2.2544234    6.6417594    1.2119585   10.977129
-      25.77281     -4.9495463   14.57806      0.3044315    2.6132357
+    16.555033     3.3238444    9.551863    -1.6676947   -0.79539716
-      -7.591999    -2.076944     9.025118     1.7834753   -3.1799617
+    -8.605674    -0.47356385   2.6741948   -5.359179    -2.6673796
-      -4.9401326   23.465864     5.1685796   -9.018578     9.037825
+    0.66607     15.443222     4.740594    -3.4725387   11.592567
-      -4.4150195    6.859591   -12.274467    -0.88911164   5.186309
+    -2.054497     1.7361217   -8.265324    -9.30447      5.4068313
-      -3.9988663  -13.638606    -9.925445    -0.06329413  -3.6709652
+    -1.5180256   -7.746615    -6.089606     0.07112726  -0.34904733
-    -12.397416   -12.719869    -1.395601     2.1150916    5.7381287
+    -8.649895    -9.998958    -2.564841    -0.53999114   2.601808
-      -4.4691963   -3.82819     -0.84233856  -1.1604277  -13.490127
+    -0.31927416  -1.8815292   -2.07215     -3.4105783   -8.2998085
-      8.731719   -20.778936   -11.495662     5.8033476   -4.752041
+    1.483641   -15.365992    -8.288208     3.8847756   -3.4876456
-      10.833007    -6.717991     4.504732    13.4244375    1.1306485
+    7.3629923    0.4657332    3.132599    12.438889    -1.8337058
-      7.3435574    1.400918    14.704036    -9.501399     7.2315617
+    4.532936     2.7264361   10.145339    -6.521951     2.897153
-      -6.417456     1.3333273   11.872697    -0.30664724   8.8845
+    -3.3925855    5.079156     7.759716     4.677565     5.8457737
-      6.5569253    4.7948146    0.03662816  -8.704245     6.224871
+    2.402413     7.7071047    3.9711342   -6.390043     6.1268735
-      -3.2701402  -11.508579  ]
+    -3.7760346  -11.118123  ]
  ```
 - Python API
@ -98,53 +104,109 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
      config=None,  # Set `config` and `ckpt_path` to None to use pretrained model.
      ckpt_path=None,
      audio_file='./85236145389.wav',
      force_yes=False,
      device=paddle.get_device())
  print('Audio embedding Result: \n{}'.format(audio_emb))
  test_emb = vector_executor(
      model='ecapatdnn_voxceleb12',
      sample_rate=16000,
      config=None,  # Set `config` and `ckpt_path` to None to use pretrained model.
      ckpt_path=None,
      audio_file='./123456789.wav',
      device=paddle.get_device())
  print('Test embedding Result: \n{}'.format(test_emb))
  # score range [0, 1]
  score = vector_executor.get_embeddings_score(audio_emb, test_emb)
  print(f"Eembeddings Score: {score}")
  ```
  输出：
  ```bash
  # Vector Result:
-   [ -5.749211     9.505463    -8.200284    -5.2075014    5.3940268
+   Audio embedding Result:
-      -3.04878      1.611095    10.127234   -10.534177   -15.821609
+    [  1.4217498    5.626253    -5.342073     1.1773866    3.308055
-      1.2032688   -0.35080156   1.2629458  -12.643498    -2.5758228
+    1.756596     5.167894    10.80636     -3.8226728   -5.6141334
-    -11.343508     2.3385992   -8.719341    14.213509    15.404744
+    2.623845    -0.8072968    1.9635103   -7.3128724    0.01103897
-      -0.39327756   6.338786     2.688887     8.7104025   17.469526
+    -9.723131     0.6619743   -6.976803    10.213478     7.494748
-      -8.77959      7.0576906    4.648855    -1.3089896  -23.294737
+    2.9105635    3.8949256    3.7999806    7.1061673   16.905321
-      8.013747    13.891729    -9.926753     5.655307    -5.9422326
+    -7.1493764    8.733103     3.4230042   -4.831653   -11.403367
-    -22.842539     0.6293588  -18.46266    -10.811862     9.8192625
+    11.232214     7.1274667   -4.2828417    2.452362    -5.130748
-      3.0070958    3.8072643   -2.3861165    3.0821571  -14.739942
+    -18.177666    -2.6116815  -11.000337    -6.7314315    1.6564683
-      1.7594414   -0.6485091    4.485623     2.0207152    7.264915
+    0.7618269    1.1253023   -2.083836     4.725744    -8.782597
-      -6.40137     23.63524      2.9711294  -22.708025     9.93719
+    -3.539873     3.814236     5.1420674    2.162061     4.096431
-      20.354511   -10.324688    -0.700492    -8.783211    -5.27593
+    -6.4162116   12.747448     1.9429878  -15.152943     6.417416
-      15.999649     3.3004563   12.747926    15.429879     4.7849145
+    16.097002    -9.716668    -1.9920526   -3.3649497   -1.871939
-      5.6699696   -2.3826702   10.605882     3.9112158    3.1500628
+    11.567354     3.69788     11.258265     7.442363     9.183411
-      15.859915    -2.1832209  -23.908653    -6.4799504   -4.5365124
+    4.5281515   -1.2417862    4.3959084    6.6727695    5.8898783
-      -9.224193    14.568347   -10.568833     4.982321    -4.342062
+    7.627124    -0.66919386 -11.889693    -9.208865    -7.4274073
-      0.0914714   12.645902    -5.74285     -3.2141201   -2.7173362
+    -3.7776625    6.917234    -9.848748    -2.0944717   -5.135116
-      -6.680575     0.4757669   -5.035051    -6.7964664   16.865469
+    0.49563864   9.317534    -5.9141874   -1.8098574   -0.11738578
-    -11.54324      7.681869     0.44475392   9.708182    -8.932846
+    -7.169265    -1.0578263   -5.7216787   -5.1173844   16.137651
-      0.4123232   -4.361452     1.3948607    9.511665     0.11667654
+    -4.473626     7.6624317   -0.55381083   9.631587    -6.4704556
-      2.9079323    6.049952     9.275183   -18.078873     6.2983274
+    -8.548508     4.3716145   -0.79702514   4.478997    -2.9758704
-      -0.7500531   -2.725033    -7.6027865    3.3404543    2.990815
+    3.272176     2.8382776    5.134597    -9.190781    -0.5657382
-      4.010979    11.000591    -2.8873312    7.1352735  -16.79663
+    -4.8745747    2.3165567   -5.984303    -2.1798875    0.35541576
-      18.495346   -14.293832     7.89578      2.2714825   22.976387
+    -0.31784213   9.493548     2.1144536    4.358092   -12.089823
-      -4.875734    -3.0836344   -2.9999814   13.751918     6.448228
+    8.451689    -7.925461     4.6242585    4.4289427   18.692003
-    -11.924197     2.171869     2.0423572   -6.173772    10.778437
+    -2.6204622   -5.149185    -0.35821092   8.488551     4.981496
-      25.77281     -4.9495463   14.57806      0.3044315    2.6132357
+    -9.32683     -2.2544234    6.6417594    1.2119585   10.977129
-      -7.591999    -2.076944     9.025118     1.7834753   -3.1799617
+    16.555033     3.3238444    9.551863    -1.6676947   -0.79539716
-      -4.9401326   23.465864     5.1685796   -9.018578     9.037825
+    -8.605674    -0.47356385   2.6741948   -5.359179    -2.6673796
-      -4.4150195    6.859591   -12.274467    -0.88911164   5.186309
+    0.66607     15.443222     4.740594    -3.4725387   11.592567
-      -3.9988663  -13.638606    -9.925445    -0.06329413  -3.6709652
+    -2.054497     1.7361217   -8.265324    -9.30447      5.4068313
-    -12.397416   -12.719869    -1.395601     2.1150916    5.7381287
+    -1.5180256   -7.746615    -6.089606     0.07112726  -0.34904733
-      -4.4691963   -3.82819     -0.84233856  -1.1604277  -13.490127
+    -8.649895    -9.998958    -2.564841    -0.53999114   2.601808
-      8.731719   -20.778936   -11.495662     5.8033476   -4.752041
+    -0.31927416  -1.8815292   -2.07215     -3.4105783   -8.2998085
-      10.833007    -6.717991     4.504732    13.4244375    1.1306485
+    1.483641   -15.365992    -8.288208     3.8847756   -3.4876456
-      7.3435574    1.400918    14.704036    -9.501399     7.2315617
+    7.3629923    0.4657332    3.132599    12.438889    -1.8337058
-      -6.417456     1.3333273   11.872697    -0.30664724   8.8845
+    4.532936     2.7264361   10.145339    -6.521951     2.897153
-      6.5569253    4.7948146    0.03662816  -8.704245     6.224871
+    -3.3925855    5.079156     7.759716     4.677565     5.8457737
-      -3.2701402  -11.508579  ]
+    2.402413     7.7071047    3.9711342   -6.390043     6.1268735
    -3.7760346  -11.118123  ]
    # get the test embedding
    Test embedding Result:
    [ -1.902964     2.0690894   -8.034194     3.5472693    0.18089125
      6.9085927    1.4097427   -1.9487704  -10.021278    -0.20755845
      -8.04332      4.344489     2.3200977  -14.306299     5.184692
    -11.55602     -3.8497238    0.6444722    1.2833948    2.6766639
      0.5878921    0.7946299    1.7207596    2.5791872   14.998469
      -1.3385371   15.031221    -0.8006958    1.99287     -9.52007
      2.435466     4.003221    -4.33817     -4.898601    -5.304714
    -18.033886    10.790787   -12.784645    -5.641755     2.9761686
    -10.566622     1.4839455    6.152458    -5.7195854    2.8603241
      6.112133     8.489869     5.5958056    1.2836679   -1.2293907
      0.89927405   7.0288725   -2.854029    -0.9782962    5.8255906
      14.905906    -5.025907     0.7866458   -4.2444224  -16.354029
      10.521315     0.9604709   -3.3257897    7.144871   -13.592733
      -8.568869    -1.7953678    0.26313916  10.916714    -6.9374123
      1.857403    -6.2746415    2.8154466   -7.2338667   -2.293357
      -0.05452765   5.4287076    5.0849075   -6.690375    -1.6183422
      3.654291     0.94352573  -9.200294    -5.4749465   -3.5235846
      1.3420814    4.240421    -2.772944    -2.8451524   16.311104
      4.2969875   -1.762936   -12.5758915    8.595198    -0.8835239
      -1.5708797    1.568961     1.1413603    3.5032008   -0.45251232
      -6.786333    16.89443      5.3366146   -8.789056     0.6355629
      3.2579517   -3.328322     7.5969577    0.66025066  -6.550468
      -9.148656     2.020372    -0.4615173    1.1965656   -3.8764873
      11.6562195   -6.0750933   12.182899     3.2218833    0.81969476
      5.570001    -3.8459578   -7.205299     7.9262037   -7.6611166
      -5.249467    -2.2671914    7.2658715  -13.298164     4.821147
      -2.7263982   11.691089    -3.8918593   -2.838112    -1.0336838
      -3.8034165    2.8536487   -5.60398     -1.1972581    1.3455094
      -3.4903061    2.2408795    5.5010734   -3.970756    11.99696
      -7.8858757    0.43160373  -5.5059714    4.3426995   16.322706
      11.635366     0.72157705  -9.245714    -3.91465     -4.449838
      -1.5716927    7.713747    -2.2430465   -6.198303   -13.481864
      2.8156567   -5.7812386    5.1456156    2.7289324  -14.505571
      13.270688     3.448231    -7.0659585    4.5886116   -4.466099
      -0.296428   -11.463529    -2.6076477   14.110243    -6.9725137
      -1.9962958    2.7119343   19.391657     0.01961198  14.607133
      -1.6695905   -4.391516     1.3131028   -6.670972    -5.888604
      12.0612335    5.9285784    3.3715196    1.492534    10.723728
      -0.95514804 -12.085431  ]
    # get the score between enroll and test
    Eembeddings Score: 0.4292638301849365
  ```
 ### 4.预训练模型
--- a/demos/speaker_verification/run.sh
+++ b/demos/speaker_verification/run.sh
@ -1,6 +1,9 @@
 #!/bin/bash
 wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
 wget -c https://paddlespeech.bj.bcebos.com/vector/audio/123456789.wav
-# asr
+# vector
-paddlespeech vector --task spk --input ./85236145389.wav
+paddlespeech vector --task spk --input ./85236145389.wav
 paddlespeech vector --task score --input "./85236145389.wav ./123456789.wav"
--- a/docs/source/released_model.md
+++ b/docs/source/released_model.md
@ -6,7 +6,7 @@
 ### Speech Recognition Model
 Acoustic Model | Training Data | Token-based | Size | Descriptions | CER | WER | Hours of speech | Example Link 
 :-------------:| :------------:| :-----: | -----: | :-----: |:-----:| :-----:  | :-----:  | :-----: 
-[Ds2 Online Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.1.1.model.tar.gz) | Aishell Dataset | Char-based | 345 MB  | 2 Conv + 5 LSTM layers with only forward direction | 0.080 |-| 151 h | [D2 Online Aishell ASR0](../../examples/aishell/asr0) 
+[Ds2 Online Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.2.0.model.tar.gz) | Aishell Dataset | Char-based | 345 MB  | 2 Conv + 5 LSTM layers with only forward direction | 0.078 |-| 151 h | [D2 Online Aishell ASR0](../../examples/aishell/asr0) 
 [Ds2 Offline Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_aishell_ckpt_0.1.1.model.tar.gz)| Aishell Dataset | Char-based | 306 MB | 2 Conv + 3 bidirectional GRU layers| 0.064 |-| 151 h | [Ds2 Offline Aishell ASR0](../../examples/aishell/asr0) 
 [Conformer Online Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_chunk_conformer_aishell_ckpt_0.1.2.model.tar.gz) | Aishell Dataset | Char-based | 189 MB  | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0565 |-| 151 h | [Conformer Online Aishell ASR1](../../examples/aishell/asr1) 
 [Conformer Offline Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_conformer_aishell_ckpt_0.1.2.model.tar.gz) | Aishell Dataset | Char-based | 189 MB  | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0483 |-| 151 h | [Conformer Offline Aishell ASR1](../../examples/aishell/asr1) 
@ -37,8 +37,8 @@ Model Type | Dataset| Example Link | Pretrained Models|Static Models|Size (stati
 Tacotron2|LJSpeech|[tacotron2-ljspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/tts0)|[tacotron2_ljspeech_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_ljspeech_ckpt_0.2.0.zip)|||
 Tacotron2|CSMSC|[tacotron2-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts0)|[tacotron2_csmsc_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_csmsc_ckpt_0.2.0.zip)|[tacotron2_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_csmsc_static_0.2.0.zip)|103MB|
 TransformerTTS| LJSpeech| [transformer-ljspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/tts1)|[transformer_tts_ljspeech_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/transformer_tts/transformer_tts_ljspeech_ckpt_0.4.zip)|||
-SpeedySpeech| CSMSC | [speedyspeech-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts2) |[speedyspeech_nosil_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_ckpt_0.5.zip)|[speedyspeech_nosil_baker_static_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_static_0.5.zip)|12MB|
+SpeedySpeech| CSMSC | [speedyspeech-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts2) |[speedyspeech_nosil_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_ckpt_0.5.zip)|[speedyspeech_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_csmsc_static_0.2.0.zip)|12MB|
-FastSpeech2| CSMSC |[fastspeech2-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts3)|[fastspeech2_nosil_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_ckpt_0.4.zip)|[fastspeech2_nosil_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_static_0.4.zip)|157MB|
+FastSpeech2| CSMSC |[fastspeech2-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts3)|[fastspeech2_nosil_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_ckpt_0.4.zip)|[fastspeech2_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_csmsc_static_0.2.0.zip)|157MB|
 FastSpeech2-Conformer| CSMSC |[fastspeech2-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts3)|[fastspeech2_conformer_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_conformer_baker_ckpt_0.5.zip)|||
 FastSpeech2| AISHELL-3 |[fastspeech2-aishell3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/tts3)|[fastspeech2_nosil_aishell3_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_aishell3_ckpt_0.4.zip)|||
 FastSpeech2| LJSpeech |[fastspeech2-ljspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/tts3)|[fastspeech2_nosil_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_ljspeech_ckpt_0.5.zip)|||
@ -80,7 +80,7 @@ PANN | ESC-50 |[pann-esc50](../../examples/esc50/cls0)|[esc50_cnn6.tar.gz](https
 Model Type | Dataset| Example Link | Pretrained Models | Static Models 
 :-------------:| :------------:| :-----: | :-----: | :-----:
-PANN | VoxCeleb| [voxceleb_ecapatdnn](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/voxceleb/sv0) | [ecapatdnn.tar.gz](https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_1_1.tar.gz) | -
+PANN | VoxCeleb| [voxceleb_ecapatdnn](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/voxceleb/sv0) | [ecapatdnn.tar.gz](https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_2_0.tar.gz) | -
 ## Punctuation Restoration Models
 Model Type | Dataset| Example Link | Pretrained Models
--- a/examples/aishell/asr0/README.md
+++ b/examples/aishell/asr0/README.md
@ -151,21 +151,14 @@ avg.sh best exp/deepspeech2/checkpoints 1
 CUDA_VISIBLE_DEVICES= ./local/test.sh conf/deepspeech2.yaml exp/deepspeech2/checkpoints/avg_1
 ```
 ## Pretrained Model
-You can get the pretrained transformer or conformer using the scripts below:
+You can get the pretrained models from [this](../../../docs/source/released_model.md).
 ```bash
 Deepspeech2 offline:
 wget https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/ds2.model.tar.gz
 Deepspeech2 online:
 wget https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/aishell_ds2_online_cer8.00_release.tar.gz
 ```
 using the `tar` scripts to unpack the model and then you can use the script to test the model.
 For example:
 ```
-wget https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/ds2.model.tar.gz
+wget https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_aishell_ckpt_0.1.1.model.tar.gz
-tar xzvf ds2.model.tar.gz
+tar xzvf asr0_deepspeech2_aishell_ckpt_0.1.1.model.tar.gz
 source path.sh
 # If you have process the data and get the manifest file， you can skip the following 2 steps
 bash local/data.sh --stage -1 --stop_stage -1
@ -173,12 +166,7 @@ bash local/data.sh --stage 2  --stop_stage 2
 CUDA_VISIBLE_DEVICES= ./local/test.sh conf/deepspeech2.yaml exp/deepspeech2/checkpoints/avg_1
 ```
-The performance of the released models are shown below:
+The performance of the released models are shown in [this](./RESULTS.md)
 |         Acoustic Model         |  Training Data  | Token-based |   Size | Descriptions                                       | CER   | WER  | Hours of speech |
 | :----------------------------: | :-------------: | :---------: | -----: | :------------------------------------------------- | :---- | :--- | :-------------- |
 | Ds2 Online Aishell ASR0 Model  | Aishell Dataset | Char-based  | 345 MB | 2 Conv + 5 LSTM layers with only forward direction | 0.080 | -    | 151 h           |
 | Ds2 Offline Aishell ASR0 Model | Aishell Dataset | Char-based  | 306 MB | 2 Conv + 3 bidirectional GRU layers                | 0.064 | -    | 151 h           |
 ## Stage 4: Static graph model Export
 This stage is to transform dygraph to static graph.
 ```bash
@ -214,8 +202,8 @@ if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
 ```
 you can train the model by yourself, or you can download the pretrained model by the script below:
 ```bash
-wget https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/ds2.model.tar.gz
+wget https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_aishell_ckpt_0.1.1.model.tar.gz
-tar xzvf ds2.model.tar.gz
+tar xzvf asr0_deepspeech2_aishell_ckpt_0.1.1.model.tar.gz
 ```
 You can download the audio demo:
 ```bash
--- a/examples/aishell/asr0/RESULTS.md
+++ b/examples/aishell/asr0/RESULTS.md
@ -4,15 +4,16 @@
 | Model | Number of Params | Release | Config | Test set | Valid Loss | CER | 
 | --- | --- | --- | --- | --- | --- | --- | 
-| DeepSpeech2 | 45.18M | 2.2.0 | conf/deepspeech2_online.yaml + spec aug | test | 7.994938373565674 | 0.080 |  
+| DeepSpeech2 | 45.18M | r0.2.0 | conf/deepspeech2_online.yaml + spec aug | test | 7.708217620849609| 0.078 |
 | DeepSpeech2 | 45.18M | v2.2.0 | conf/deepspeech2_online.yaml + spec aug | test | 7.994938373565674 | 0.080 |  
 ## Deepspeech2 Non-Streaming
 | Model | Number of Params | Release | Config | Test set | Valid Loss | CER |  
 | --- | --- | --- | --- | --- | --- | --- |  
-| DeepSpeech2 | 58.4M | 2.2.0 | conf/deepspeech2.yaml + spec aug | test | 5.738585948944092 | 0.064000 |  
+| DeepSpeech2 | 58.4M | v2.2.0 | conf/deepspeech2.yaml + spec aug | test | 5.738585948944092 | 0.064000 |  
-| DeepSpeech2 | 58.4M | 2.1.0 | conf/deepspeech2.yaml + spec aug | test | 7.483316898345947 | 0.077860 |  
+| DeepSpeech2 | 58.4M | v2.1.0 | conf/deepspeech2.yaml + spec aug | test | 7.483316898345947 | 0.077860 |  
-| DeepSpeech2 | 58.4M | 2.1.0 | conf/deepspeech2.yaml | test | 7.299022197723389 | 0.078671 |
+| DeepSpeech2 | 58.4M | v2.1.0 | conf/deepspeech2.yaml | test | 7.299022197723389 | 0.078671 |
-| DeepSpeech2 | 58.4M | 2.0.0 | conf/deepspeech2.yaml | test | - | 0.078977 |  
+| DeepSpeech2 | 58.4M | v2.0.0 | conf/deepspeech2.yaml | test | - | 0.078977 |  
 | --- | --- | --- | --- | --- | --- | --- |  
-| DeepSpeech2 | 58.4M | 1.8.5 | - | test | - | 0.080447 |
+| DeepSpeech2 | 58.4M | v1.8.5 | - | test | - | 0.080447 |
--- a/examples/aishell/asr1/README.md
+++ b/examples/aishell/asr1/README.md
@ -143,25 +143,14 @@ avg.sh best exp/conformer/checkpoints 20
 CUDA_VISIBLE_DEVICES= ./local/test.sh conf/conformer.yaml exp/conformer/checkpoints/avg_20
 ```
 ## Pretrained Model
-You can get the pretrained transformer or conformer using the scripts below:
+You can get the pretrained transformer or conformer from [this](../../../docs/source/released_model.md)
 ```bash
 # Conformer:
 wget https://deepspeech.bj.bcebos.com/release2.1/aishell/s1/aishell.release.tar.gz
 # Chunk Conformer:
 wget https://deepspeech.bj.bcebos.com/release2.1/aishell/s1/aishell.chunk.release.tar.gz
 # Transformer:
 wget https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/transformer.model.tar.gz
 ```
 using the `tar` scripts to unpack the model and then you can use the script to test the model.
 For example:
 ```
-wget https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/transformer.model.tar.gz
+wget https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_transformer_aishell_ckpt_0.1.1.model.tar.gz
-tar xzvf transformer.model.tar.gz
+tar xzvf asr1_transformer_aishell_ckpt_0.1.1.model.tar.gz
 source path.sh
 # If you have process the data and get the manifest file， you can skip the following 2 steps
 bash local/data.sh --stage -1 --stop_stage -1
@ -206,7 +195,7 @@ In some situations, you want to use the trained model to do the inference for th
 ```
 you can train the model by yourself using ```bash run.sh --stage 0 --stop_stage 3```, or you can download the pretrained model through the script below:
 ```bash
-wget https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/transformer.model.tar.gz
+wget https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_transformer_aishell_ckpt_0.1.1.model.tar.gz
 tar xzvf transformer.model.tar.gz
 ```
 You can download the audio demo:
--- a/examples/aishell3/vc0/README.md
+++ b/examples/aishell3/vc0/README.md
@ -118,7 +118,7 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/voice_cloning.sh ${conf_path} ${train_outpu
 ```
 ## Pretrained Model
-[tacotron2_aishell3_ckpt_vc0_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_aishell3_ckpt_vc0_0.2.0.zip)
+- [tacotron2_aishell3_ckpt_vc0_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_aishell3_ckpt_vc0_0.2.0.zip)
 Model | Step | eval/loss | eval/l1_loss | eval/mse_loss | eval/bce_loss| eval/attn_loss
--- a/examples/aishell3/vc1/README.md
+++ b/examples/aishell3/vc1/README.md
@ -119,7 +119,7 @@ ref_audio
 CUDA_VISIBLE_DEVICES=${gpus} ./local/voice_cloning.sh ${conf_path} ${train_output_path} ${ckpt_name} ${ge2e_params_path} ${ref_audio_dir}
 ```
 ## Pretrained Model
-[fastspeech2_nosil_aishell3_vc1_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_aishell3_vc1_ckpt_0.5.zip)
+- [fastspeech2_nosil_aishell3_vc1_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_aishell3_vc1_ckpt_0.5.zip)
 Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/pitch_loss| eval/energy_loss 
 :-------------:| :------------:| :-----: | :-----: | :--------: |:--------:|:---------:
--- a/examples/aishell3/voc1/README.md
+++ b/examples/aishell3/voc1/README.md
@ -137,7 +137,8 @@ optional arguments:
 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
 ## Pretrained Models
-Pretrained models can be downloaded here [pwg_aishell3_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_aishell3_ckpt_0.5.zip).
+Pretrained models can be downloaded here:
 - [pwg_aishell3_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_aishell3_ckpt_0.5.zip)
 Model | Step | eval/generator_loss | eval/log_stft_magnitude_loss:| eval/spectral_convergence_loss 
 :-------------:| :------------:| :-----: | :-----: | :--------:
--- a/examples/aishell3/voc5/README.md
+++ b/examples/aishell3/voc5/README.md
@ -136,7 +136,8 @@ optional arguments:
 4. `--output-dir` is the directory to save the synthesized audio files.
 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
 ## Pretrained Models
-The pretrained model can be downloaded here [hifigan_aishell3_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_ckpt_0.2.0.zip).
+The pretrained model can be downloaded here:
 - [hifigan_aishell3_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_ckpt_0.2.0.zip)
 Model | Step | eval/generator_loss | eval/mel_loss| eval/feature_matching_loss
--- a/examples/ami/sd0/conf/ecapa_tdnn.yaml
+++ b/examples/ami/sd0/conf/ecapa_tdnn.yaml
@ -0,0 +1,62 @@
 ###########################################################
 #                AMI DATA PREPARE SETTING               #
 ###########################################################
 split_type: 'full_corpus_asr'
 skip_TNO: True
 # Options for mic_type: 'Mix-Lapel', 'Mix-Headset', 'Array1', 'Array1-01', 'BeamformIt'
 mic_type: 'Mix-Headset'
 vad_type: 'oracle'
 max_subseg_dur: 3.0
 overlap: 1.5
 # Some more exp folders (for cleaner structure).
 embedding_dir: emb #!ref <save_folder>/emb
 meta_data_dir: metadata #!ref <save_folder>/metadata
 ref_rttm_dir: ref_rttms #!ref <save_folder>/ref_rttms
 sys_rttm_dir: sys_rttms #!ref <save_folder>/sys_rttms
 der_dir: DER #!ref <save_folder>/DER
 ###########################################################
 #                FEATURE EXTRACTION SETTING               #
 ###########################################################
 # currently, we only support fbank
 sr: 16000           # sample rate
 n_mels: 80
 window_size: 400     #25ms, sample rate 16000, 25 * 16000 / 1000 = 400 
 hop_size: 160        #10ms, sample rate 16000, 10 * 16000 / 1000 = 160
 #left_frames: 0
 #right_frames: 0
 #deltas: False
 ###########################################################
 #                       MODEL SETTING                     #
 ###########################################################
 # currently, we only support ecapa-tdnn in the ecapa_tdnn.yaml
 # if we want use another model, please choose another configuration yaml file
 seed: 1234
 emb_dim: 192
 batch_size: 16
 model:
  input_size: 80
  channels: [1024, 1024, 1024, 1024, 3072]
  kernel_sizes: [5, 3, 3, 3, 1]
  dilations: [1, 2, 3, 4, 1]
  attention_channels: 128
  lin_neurons: 192
 # Will automatically download ECAPA-TDNN model (best).
 ###########################################################
 #               SPECTRAL CLUSTERING SETTING               #
 ###########################################################
 backend: 'SC' # options: 'kmeans' # Note: kmeans goes only with cos affinity
 affinity: 'cos'  # options: cos, nn
 max_num_spkrs: 10
 oracle_n_spkrs: True
 ###########################################################
 #                  DER EVALUATION SETTING                 #
 ###########################################################
 ignore_overlap: True
 forgiveness_collar: 0.25
--- a/examples/ami/sd0/local/compute_embdding.py
+++ b/examples/ami/sd0/local/compute_embdding.py
@ -0,0 +1,231 @@
 # Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import argparse
 import json
 import os
 import pickle
 import sys
 import numpy as np
 import paddle
 from paddle.io import BatchSampler
 from paddle.io import DataLoader
 from tqdm.contrib import tqdm
 from yacs.config import CfgNode
 from paddlespeech.s2t.utils.log import Log
 from paddlespeech.vector.cluster.diarization import EmbeddingMeta
 from paddlespeech.vector.io.batch import batch_feature_normalize
 from paddlespeech.vector.io.dataset_from_json import JSONDataset
 from paddlespeech.vector.models.ecapa_tdnn import EcapaTdnn
 from paddlespeech.vector.modules.sid_model import SpeakerIdetification
 from paddlespeech.vector.training.seeding import seed_everything
 # Logger setup
 logger = Log(__name__).getlog()
 def prepare_subset_json(full_meta_data, rec_id, out_meta_file):
    """Prepares metadata for a given recording ID.
    Arguments
    ---------
    full_meta_data : json
        Full meta (json) containing all the recordings
    rec_id : str
        The recording ID for which meta (json) has to be prepared
    out_meta_file : str
        Path of the output meta (json) file.
    """
    subset = {}
    for key in full_meta_data:
        k = str(key)
        if k.startswith(rec_id):
            subset[key] = full_meta_data[key]
    with open(out_meta_file, mode="w") as json_f:
        json.dump(subset, json_f, indent=2)
 def create_dataloader(json_file, batch_size):
    """Creates the datasets and their data processing pipelines.
    This is used for multi-mic processing.
    """
    # create datasets
    dataset = JSONDataset(
        json_file=json_file,
        feat_type='melspectrogram',
        n_mels=config.n_mels,
        window_size=config.window_size,
        hop_length=config.hop_size)
    # create dataloader
    batch_sampler = BatchSampler(dataset, batch_size=batch_size, shuffle=True)
    dataloader = DataLoader(dataset,
                            batch_sampler=batch_sampler,
                            collate_fn=lambda x: batch_feature_normalize(
                                x, mean_norm=True, std_norm=False),
                            return_list=True)
    return dataloader
 def main(args, config):
    # set the training device, cpu or gpu
    paddle.set_device(args.device)
    # set the random seed
    seed_everything(config.seed)
    # stage1: build the dnn backbone model network
    ecapa_tdnn = EcapaTdnn(**config.model)
    # stage2: build the speaker verification eval instance with backbone model
    model = SpeakerIdetification(backbone=ecapa_tdnn, num_class=1)
    # stage3: load the pre-trained model
    #         we get the last model from the epoch and save_interval
    args.load_checkpoint = os.path.abspath(
        os.path.expanduser(args.load_checkpoint))
    # load model checkpoint to sid model
    state_dict = paddle.load(
        os.path.join(args.load_checkpoint, 'model.pdparams'))
    model.set_state_dict(state_dict)
    logger.info(f'Checkpoint loaded from {args.load_checkpoint}')
    # set the model to eval mode
    model.eval()
    # load meta data
    meta_file = os.path.join(
        args.data_dir,
        config.meta_data_dir,
        "ami_" + args.dataset + "." + config.mic_type + ".subsegs.json", )
    with open(meta_file, "r") as f:
        full_meta = json.load(f)
    # get all the recording IDs in this dataset.
    all_keys = full_meta.keys()
    A = [word.rstrip().split("_")[0] for word in all_keys]
    all_rec_ids = list(set(A[1:]))
    all_rec_ids.sort()
    split = "AMI_" + args.dataset
    i = 1
    msg = "Extra embdding for " + args.dataset + " set"
    logger.info(msg)
    if len(all_rec_ids) <= 0:
        msg = "No recording IDs found! Please check if meta_data json file is properly generated."
        logger.error(msg)
        sys.exit()
    # extra different recordings embdding in a dataset.
    for rec_id in tqdm(all_rec_ids):
        # This tag will be displayed in the log.
        tag = ("[" + str(args.dataset) + ": " + str(i) + "/" +
               str(len(all_rec_ids)) + "]")
        i = i + 1
        # log message.
        msg = "Embdding %s : %s " % (tag, rec_id)
        logger.debug(msg)
        # embedding directory.
        if not os.path.exists(
                os.path.join(args.data_dir, config.embedding_dir, split)):
            os.makedirs(
                os.path.join(args.data_dir, config.embedding_dir, split))
        # file to store embeddings.
        emb_file_name = rec_id + "." + config.mic_type + ".emb_stat.pkl"
        diary_stat_emb_file = os.path.join(args.data_dir, config.embedding_dir,
                                           split, emb_file_name)
        # prepare a metadata (json) for one recording. This is basically a subset of full_meta.
        # lets keep this meta-info in embedding directory itself.
        json_file_name = rec_id + "." + config.mic_type + ".json"
        meta_per_rec_file = os.path.join(args.data_dir, config.embedding_dir,
                                         split, json_file_name)
        # write subset (meta for one recording) json metadata.
        prepare_subset_json(full_meta, rec_id, meta_per_rec_file)
        # prepare data loader.
        diary_set_loader = create_dataloader(meta_per_rec_file,
                                             config.batch_size)
        # extract embeddings (skip if already done).
        if not os.path.isfile(diary_stat_emb_file):
            logger.debug("Extracting deep embeddings")
            embeddings = np.empty(shape=[0, config.emb_dim], dtype=np.float64)
            segset = []
            for batch_idx, batch in enumerate(tqdm(diary_set_loader)):
                # extrac the audio embedding
                ids, feats, lengths = batch['ids'], batch['feats'], batch[
                    'lengths']
                seg = [x for x in ids]
                segset = segset + seg
                emb = model.backbone(feats, lengths).squeeze(
                    -1).numpy()  # (N, emb_size, 1) -> (N, emb_size)
                embeddings = np.concatenate((embeddings, emb), axis=0)
            segset = np.array(segset, dtype="|O")
            stat_obj = EmbeddingMeta(
                segset=segset,
                stats=embeddings, )
            logger.debug("Saving Embeddings...")
            with open(diary_stat_emb_file, "wb") as output:
                pickle.dump(stat_obj, output)
        else:
            logger.debug("Skipping embedding extraction (as already present).")
 # Begin experiment!
 if __name__ == "__main__":
    parser = argparse.ArgumentParser(__doc__)
    parser.add_argument(
        '--device',
        default="gpu",
        help="Select which device to perform diarization, defaults to gpu.")
    parser.add_argument(
        "--config", default=None, type=str, help="configuration file")
    parser.add_argument(
        "--data-dir",
        default="../save/",
        type=str,
        help="processsed data directory")
    parser.add_argument(
        "--dataset",
        choices=['dev', 'eval'],
        default="dev",
        type=str,
        help="Select which dataset to extra embdding, defaults to dev")
    parser.add_argument(
        "--load-checkpoint",
        type=str,
        default='',
        help="Directory to load model checkpoint to compute embeddings.")
    args = parser.parse_args()
    config = CfgNode(new_allowed=True)
    if args.config:
        config.merge_from_file(args.config)
    config.freeze()
    main(args, config)
--- a/examples/ami/sd0/local/data.sh
+++ b/examples/ami/sd0/local/data.sh
@ -1,49 +0,0 @@
 #!/bin/bash
 stage=1
 TARGET_DIR=${MAIN_ROOT}/dataset/ami
 data_folder=${TARGET_DIR}/amicorpus #e.g., /path/to/amicorpus/
 manual_annot_folder=${TARGET_DIR}/ami_public_manual_1.6.2 #e.g., /path/to/ami_public_manual_1.6.2/
 save_folder=${MAIN_ROOT}/examples/ami/sd0/data
 ref_rttm_dir=${save_folder}/ref_rttms
 meta_data_dir=${save_folder}/metadata
 set=L
 . ${MAIN_ROOT}/utils/parse_options.sh || exit 1;
 set -u
 set -o pipefail
 mkdir -p ${save_folder}
 if [ ${stage} -le 0 ]; then
    # Download AMI corpus, You need around 10GB of free space to get whole data
    # The signals are too large to package in this way,
    # so you need to use the chooser to indicate which ones you wish to download
    echo "Please follow https://groups.inf.ed.ac.uk/ami/download/ to download the data."
    echo "Annotations: AMI manual annotations v1.6.2 "
    echo "Signals: "
    echo "1) Select one or more AMI meetings: the IDs please follow ./ami_split.py"
    echo "2) Select media streams: Just select Headset mix"
    exit 0;
 fi
 if [ ${stage} -le 1 ]; then
    echo "AMI Data preparation"
    python local/ami_prepare.py  --data_folder ${data_folder} \
            --manual_annot_folder ${manual_annot_folder} \
            --save_folder ${save_folder} --ref_rttm_dir ${ref_rttm_dir} \
            --meta_data_dir ${meta_data_dir} 
    if [ $? -ne 0 ]; then
        echo "Prepare AMI failed. Please check log message."
        exit 1
    fi
 fi
 echo "AMI data preparation done."
 exit 0
--- a/examples/ami/sd0/local/experiment.py
+++ b/examples/ami/sd0/local/experiment.py
@ -0,0 +1,428 @@
 # Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import argparse
 import glob
 import json
 import os
 import pickle
 import shutil
 import sys
 import numpy as np
 from tqdm.contrib import tqdm
 from yacs.config import CfgNode
 from paddlespeech.s2t.utils.log import Log
 from paddlespeech.vector.cluster import diarization as diar
 from utils.DER import DER
 # Logger setup
 logger = Log(__name__).getlog()
 def diarize_dataset(
        full_meta,
        split_type,
        n_lambdas,
        pval,
        save_dir,
        config,
        n_neighbors=10, ):
    """This function diarizes all the recordings in a given dataset. It performs
    computation of embedding and clusters them using spectral clustering (or other backends).
    The output speaker boundary file is stored in the RTTM format.
    """
    # prepare `spkr_info` only once when Oracle num of speakers is selected.
    # spkr_info is essential to obtain number of speakers from groundtruth.
    if config.oracle_n_spkrs is True:
        full_ref_rttm_file = os.path.join(save_dir, config.ref_rttm_dir,
                                          "fullref_ami_" + split_type + ".rttm")
        rttm = diar.read_rttm(full_ref_rttm_file)
        spkr_info = list(  # noqa F841
            filter(lambda x: x.startswith("SPKR-INFO"), rttm))
    # get all the recording IDs in this dataset.
    all_keys = full_meta.keys()
    A = [word.rstrip().split("_")[0] for word in all_keys]
    all_rec_ids = list(set(A[1:]))
    all_rec_ids.sort()
    split = "AMI_" + split_type
    i = 1
    # adding tag for directory path.
    type_of_num_spkr = "oracle" if config.oracle_n_spkrs else "est"
    tag = (type_of_num_spkr + "_" + str(config.affinity) + "_" + config.backend)
    # make out rttm dir
    out_rttm_dir = os.path.join(save_dir, config.sys_rttm_dir, config.mic_type,
                                split, tag)
    if not os.path.exists(out_rttm_dir):
        os.makedirs(out_rttm_dir)
    # diarizing different recordings in a dataset.
    for rec_id in tqdm(all_rec_ids):
        # this tag will be displayed in the log.
        tag = ("[" + str(split_type) + ": " + str(i) + "/" +
               str(len(all_rec_ids)) + "]")
        i = i + 1
        # log message.
        msg = "Diarizing %s : %s " % (tag, rec_id)
        logger.debug(msg)
        # load embeddings.
        emb_file_name = rec_id + "." + config.mic_type + ".emb_stat.pkl"
        diary_stat_emb_file = os.path.join(save_dir, config.embedding_dir,
                                           split, emb_file_name)
        if not os.path.isfile(diary_stat_emb_file):
            msg = "Embdding file %s not found! Please check if embdding file is properly generated." % (
                diary_stat_emb_file)
            logger.error(msg)
            sys.exit()
        with open(diary_stat_emb_file, "rb") as in_file:
            diary_obj = pickle.load(in_file)
        out_rttm_file = out_rttm_dir + "/" + rec_id + ".rttm"
        # processing starts from here.
        if config.oracle_n_spkrs is True:
            # oracle num of speakers.
            num_spkrs = diar.get_oracle_num_spkrs(rec_id, spkr_info)
        else:
            if config.affinity == "nn":
                # num of speakers tunned on dev set (only for nn affinity).
                num_spkrs = n_lambdas
            else:
                # num of speakers will be estimated using max eigen gap for cos based affinity.
                # so adding None here. Will use this None later-on.
                num_spkrs = None
        if config.backend == "kmeans":
            diar.do_kmeans_clustering(
                diary_obj,
                out_rttm_file,
                rec_id,
                num_spkrs,
                pval, )
        if config.backend == "SC":
            # go for Spectral Clustering (SC).
            diar.do_spec_clustering(
                diary_obj,
                out_rttm_file,
                rec_id,
                num_spkrs,
                pval,
                config.affinity,
                n_neighbors, )
        # can used for AHC later. Likewise one can add different backends here.
        if config.backend == "AHC":
            # call AHC
            threshold = pval  # pval for AHC is nothing but threshold.
            diar.do_AHC(diary_obj, out_rttm_file, rec_id, num_spkrs, threshold)
    # once all RTTM outputs are generated, concatenate individual RTTM files to obtain single RTTM file.
    # this is not needed but just staying with the standards.
    concate_rttm_file = out_rttm_dir + "/sys_output.rttm"
    logger.debug("Concatenating individual RTTM files...")
    with open(concate_rttm_file, "w") as cat_file:
        for f in glob.glob(out_rttm_dir + "/*.rttm"):
            if f == concate_rttm_file:
                continue
            with open(f, "r") as indi_rttm_file:
                shutil.copyfileobj(indi_rttm_file, cat_file)
    msg = "The system generated RTTM file for %s set : %s" % (
        split_type, concate_rttm_file, )
    logger.debug(msg)
    return concate_rttm_file
 def dev_pval_tuner(full_meta, save_dir, config):
    """Tuning p_value for affinity matrix.
    The p_value used so that only p% of the values in each row is retained.
    """
    DER_list = []
    prange = np.arange(0.002, 0.015, 0.001)
    n_lambdas = None  # using it as flag later.
    for p_v in prange:
        # Process whole dataset for value of p_v.
        concate_rttm_file = diarize_dataset(full_meta, "dev", n_lambdas, p_v,
                                            save_dir, config)
        ref_rttm_file = os.path.join(save_dir, config.ref_rttm_dir,
                                     "fullref_ami_dev.rttm")
        sys_rttm_file = concate_rttm_file
        [MS, FA, SER, DER_] = DER(
            ref_rttm_file,
            sys_rttm_file,
            config.ignore_overlap,
            config.forgiveness_collar, )
        DER_list.append(DER_)
        if config.oracle_n_spkrs is True and config.backend == "kmeans":
            # no need of p_val search. Note p_val is needed for SC for both oracle and est num of speakers.
            # p_val is needed in oracle_n_spkr=False when using kmeans backend.
            break
    # Take p_val that gave minmum DER on Dev dataset.
    tuned_p_val = prange[DER_list.index(min(DER_list))]
    return tuned_p_val
 def dev_ahc_threshold_tuner(full_meta, save_dir, config):
    """Tuning threshold for affinity matrix. This function is called when AHC is used as backend.
    """
    DER_list = []
    prange = np.arange(0.0, 1.0, 0.1)
    n_lambdas = None  # using it as flag later.
    # Note: p_val is threshold in case of AHC.
    for p_v in prange:
        # Process whole dataset for value of p_v.
        concate_rttm_file = diarize_dataset(full_meta, "dev", n_lambdas, p_v,
                                            save_dir, config)
        ref_rttm = os.path.join(save_dir, config.ref_rttm_dir,
                                "fullref_ami_dev.rttm")
        sys_rttm = concate_rttm_file
        [MS, FA, SER, DER_] = DER(
            ref_rttm,
            sys_rttm,
            config.ignore_overlap,
            config.forgiveness_collar, )
        DER_list.append(DER_)
        if config.oracle_n_spkrs is True:
            break  # no need of threshold search.
    # Take p_val that gave minmum DER on Dev dataset.
    tuned_p_val = prange[DER_list.index(min(DER_list))]
    return tuned_p_val
 def dev_nn_tuner(full_meta, split_type, save_dir, config):
    """Tuning n_neighbors on dev set. Assuming oracle num of speakers.
    This is used when nn based affinity is selected.
    """
    DER_list = []
    pval = None
    # Now assumming oracle num of speakers.
    n_lambdas = 4
    for nn in range(5, 15):
        # Process whole dataset for value of n_lambdas.
        concate_rttm_file = diarize_dataset(full_meta, "dev", n_lambdas, p_v,
                                            save_dir, config, nn)
        ref_rttm = os.path.join(save_dir, config.ref_rttm_dir,
                                "fullref_ami_dev.rttm")
        sys_rttm = concate_rttm_file
        [MS, FA, SER, DER_] = DER(
            ref_rttm,
            sys_rttm,
            config.ignore_overlap,
            config.forgiveness_collar, )
        DER_list.append([nn, DER_])
        if config.oracle_n_spkrs is True and config.backend == "kmeans":
            break
    DER_list.sort(key=lambda x: x[1])
    tunned_nn = DER_list[0]
    return tunned_nn[0]
 def dev_tuner(full_meta, split_type, save_dir, config):
    """Tuning n_components on dev set. Used for nn based affinity matrix.
    Note: This is a very basic tunning for nn based affinity.
    This is work in progress till we find a better way.
    """
    DER_list = []
    pval = None
    for n_lambdas in range(1, config.max_num_spkrs + 1):
        # Process whole dataset for value of n_lambdas.
        concate_rttm_file = diarize_dataset(full_meta, "dev", n_lambdas, p_v,
                                            save_dir, config)
        ref_rttm = os.path.join(save_dir, config.ref_rttm_dir,
                                "fullref_ami_dev.rttm")
        sys_rttm = concate_rttm_file
        [MS, FA, SER, DER_] = DER(
            ref_rttm,
            sys_rttm,
            config.ignore_overlap,
            config.forgiveness_collar, )
        DER_list.append(DER_)
    # Take n_lambdas with minmum DER.
    tuned_n_lambdas = DER_list.index(min(DER_list)) + 1
    return tuned_n_lambdas
 def main(args, config):
    # AMI Dev Set: Tune hyperparams on dev set.
    # Read the embdding file for dev set generated during embdding compute
    dev_meta_file = os.path.join(
        args.data_dir,
        config.meta_data_dir,
        "ami_dev." + config.mic_type + ".subsegs.json", )
    with open(dev_meta_file, "r") as f:
        meta_dev = json.load(f)
    full_meta = meta_dev
    # Processing starts from here
    # Following few lines selects option for different backend and affinity matrices. Finds best values for hyperameters using dev set.
    ref_rttm_file = os.path.join(args.data_dir, config.ref_rttm_dir,
                                 "fullref_ami_dev.rttm")
    best_nn = None
    if config.affinity == "nn":
        logger.info("Tuning for nn (Multiple iterations over AMI Dev set)")
        best_nn = dev_nn_tuner(full_meta, args.data_dir, config)
    n_lambdas = None
    best_pval = None
    if config.affinity == "cos" and (config.backend == "SC" or
                                     config.backend == "kmeans"):
        # oracle num_spkrs or not, doesn't matter for kmeans and SC backends
        # cos: Tune for the best pval for SC /kmeans (for unknown num of spkrs)
        logger.info(
            "Tuning for p-value for SC (Multiple iterations over AMI Dev set)")
        best_pval = dev_pval_tuner(full_meta, args.data_dir, config)
    elif config.backend == "AHC":
        logger.info("Tuning for threshold-value for AHC")
        best_threshold = dev_ahc_threshold_tuner(full_meta, args.data_dir,
                                                 config)
        best_pval = best_threshold
    else:
        # NN for unknown num of speakers (can be used in future)
        if config.oracle_n_spkrs is False:
            # nn: Tune num of number of components (to be updated later)
            logger.info(
                "Tuning for number of eigen components for NN (Multiple iterations over AMI Dev set)"
            )
            # dev_tuner used for tuning num of components in NN. Can be used in future.
            n_lambdas = dev_tuner(full_meta, args.data_dir, config)
    # load 'dev' and 'eval' metadata files.
    full_meta_dev = full_meta  # current full_meta is for 'dev'
    eval_meta_file = os.path.join(
        args.data_dir,
        config.meta_data_dir,
        "ami_eval." + config.mic_type + ".subsegs.json", )
    with open(eval_meta_file, "r") as f:
        full_meta_eval = json.load(f)
    # tag to be appended to final output DER files. Writing DER for individual files.
    type_of_num_spkr = "oracle" if config.oracle_n_spkrs else "est"
    tag = (
        type_of_num_spkr + "_" + str(config.affinity) + "." + config.mic_type)
    # perform final diarization on 'dev' and 'eval' with best hyperparams.
    final_DERs = {}
    out_der_dir = os.path.join(args.data_dir, config.der_dir)
    if not os.path.exists(out_der_dir):
        os.makedirs(out_der_dir)
    for split_type in ["dev", "eval"]:
        if split_type == "dev":
            full_meta = full_meta_dev
        else:
            full_meta = full_meta_eval
        # performing diarization.
        msg = "Diarizing using best hyperparams: " + split_type + " set"
        logger.info(msg)
        out_boundaries = diarize_dataset(
            full_meta,
            split_type,
            n_lambdas=n_lambdas,
            pval=best_pval,
            n_neighbors=best_nn,
            save_dir=args.data_dir,
            config=config)
        # computing DER.
        msg = "Computing DERs for " + split_type + " set"
        logger.info(msg)
        ref_rttm = os.path.join(args.data_dir, config.ref_rttm_dir,
                                "fullref_ami_" + split_type + ".rttm")
        sys_rttm = out_boundaries
        [MS, FA, SER, DER_vals] = DER(
            ref_rttm,
            sys_rttm,
            config.ignore_overlap,
            config.forgiveness_collar,
            individual_file_scores=True, )
        # writing DER values to a file. Append tag.
        der_file_name = split_type + "_DER_" + tag
        out_der_file = os.path.join(out_der_dir, der_file_name)
        msg = "Writing DER file to: " + out_der_file
        logger.info(msg)
        diar.write_ders_file(ref_rttm, DER_vals, out_der_file)
        msg = ("AMI " + split_type + " set DER = %s %%\n" %
               (str(round(DER_vals[-1], 2))))
        logger.info(msg)
        final_DERs[split_type] = round(DER_vals[-1], 2)
    # final print DERs
    msg = (
        "Final Diarization Error Rate (%%) on AMI corpus: Dev = %s %% | Eval = %s %%\n"
        % (str(final_DERs["dev"]), str(final_DERs["eval"])))
    logger.info(msg)
 if __name__ == "__main__":
    parser = argparse.ArgumentParser(__doc__)
    parser.add_argument(
        "--config", default=None, type=str, help="configuration file")
    parser.add_argument(
        "--data-dir",
        default="../data/",
        type=str,
        help="processsed data directory")
    args = parser.parse_args()
    config = CfgNode(new_allowed=True)
    if args.config:
        config.merge_from_file(args.config)
    config.freeze()
    main(args, config)
--- a/examples/ami/sd0/local/process.sh
+++ b/examples/ami/sd0/local/process.sh
@ -0,0 +1,49 @@
 #!/bin/bash
 stage=0
 set=L
 . ${MAIN_ROOT}/utils/parse_options.sh || exit 1;
 set -o pipefail
 data_folder=$1
 manual_annot_folder=$2
 save_folder=$3
 pretrained_model_dir=$4
 conf_path=$5
 device=$6
 ref_rttm_dir=${save_folder}/ref_rttms
 meta_data_dir=${save_folder}/metadata
 if [ ${stage} -le 0 ]; then
    echo "AMI Data preparation"
    python local/ami_prepare.py  --data_folder ${data_folder} \
            --manual_annot_folder ${manual_annot_folder} \
            --save_folder ${save_folder} --ref_rttm_dir ${ref_rttm_dir} \
            --meta_data_dir ${meta_data_dir} 
    if [ $? -ne 0 ]; then
        echo "Prepare AMI failed. Please check log message."
        exit 1
    fi
    echo "AMI data preparation done."           
 fi
 if [ ${stage} -le 1 ]; then
    # extra embddings for dev and eval dataset
    for name in dev eval; do
        python local/compute_embdding.py --config ${conf_path} \
                --data-dir ${save_folder} \
                --device ${device} \
                --dataset ${name} \
                --load-checkpoint ${pretrained_model_dir}
    done
 fi
 if [ ${stage} -le 2 ]; then
    # tune hyperparams on dev set
    # perform final diarization on 'dev' and 'eval' with best hyperparams
    python local/experiment.py --config ${conf_path} \
            --data-dir ${save_folder}
 fi
--- a/examples/ami/sd0/run.sh
+++ b/examples/ami/sd0/run.sh
@ -1,14 +1,46 @@
 #!/bin/bash
-. path.sh || exit 1;
+. ./path.sh || exit 1;
 set -e
-stage=1
+stage=0
 #TARGET_DIR=${MAIN_ROOT}/dataset/ami
 TARGET_DIR=/home/dataset/AMI
 data_folder=${TARGET_DIR}/amicorpus #e.g., /path/to/amicorpus/
 manual_annot_folder=${TARGET_DIR}/ami_public_manual_1.6.2 #e.g., /path/to/ami_public_manual_1.6.2/
 save_folder=./save
 pretraind_model_dir=${save_folder}/sv0_ecapa_tdnn_voxceleb12_ckpt_0_1_1/model
 conf_path=conf/ecapa_tdnn.yaml
 device=gpu
 . ${MAIN_ROOT}/utils/parse_options.sh || exit 1;
-if [ ${stage} -le 1 ]; then
+if [ $stage -le 0 ]; then
-    # prepare data
+    # Prepare data
-    bash ./local/data.sh || exit -1
+    # Download AMI corpus, You need around 10GB of free space to get whole data
-fi
+    # The signals are too large to package in this way,
    # so you need to use the chooser to indicate which ones you wish to download
    echo "Please follow https://groups.inf.ed.ac.uk/ami/download/ to download the data."
    echo "Annotations: AMI manual annotations v1.6.2 "
    echo "Signals: "
    echo "1) Select one or more AMI meetings: the IDs please follow ./ami_split.py"
    echo "2) Select media streams: Just select Headset mix"
 fi
 if [ $stage -le 1 ]; then
    # Download the pretrained model
    wget https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_1_1.tar.gz
    mkdir -p ${save_folder} && tar -xvf sv0_ecapa_tdnn_voxceleb12_ckpt_0_1_1.tar.gz -C ${save_folder}
    rm -rf sv0_ecapa_tdnn_voxceleb12_ckpt_0_1_1.tar.gz
    echo "download the pretrained ECAPA-TDNN Model to path: "${pretraind_model_dir}
 fi
 if [ $stage -le 2 ]; then
    # Tune hyperparams on dev set and perform final diarization on dev and eval with best hyperparams.
    echo ${data_folder} ${manual_annot_folder} ${save_folder} ${pretraind_model_dir} ${conf_path}
    bash ./local/process.sh ${data_folder} ${manual_annot_folder} \
        ${save_folder} ${pretraind_model_dir} ${conf_path} ${device} || exit 1
 fi
--- a/examples/csmsc/tts0/README.md
+++ b/examples/csmsc/tts0/README.md
@ -212,7 +212,8 @@ optional arguments:
 Pretrained Tacotron2 model with no silence in the edge of audios:
 - [tacotron2_csmsc_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_csmsc_ckpt_0.2.0.zip)
-The static model can be downloaded here [tacotron2_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_csmsc_static_0.2.0.zip).
+The static model can be downloaded here:
 - [tacotron2_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_csmsc_static_0.2.0.zip)
 Model | Step | eval/loss | eval/l1_loss | eval/mse_loss | eval/bce_loss| eval/attn_loss 
--- a/examples/csmsc/tts2/README.md
+++ b/examples/csmsc/tts2/README.md
@ -221,9 +221,12 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/inference.sh ${train_output_path}
 ```
 ## Pretrained Model
-Pretrained SpeedySpeech model with no silence in the edge of audios[speedyspeech_nosil_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_ckpt_0.5.zip).
+Pretrained SpeedySpeech model with no silence in the edge of audios:
 - [speedyspeech_nosil_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_ckpt_0.5.zip)
-The static model can be downloaded here [speedyspeech_nosil_baker_static_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_static_0.5.zip).
+The static model can be downloaded here:
 - [speedyspeech_nosil_baker_static_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_static_0.5.zip)
 - [speedyspeech_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_csmsc_static_0.2.0.zip)
 Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/ssim_loss
 :-------------:| :------------:| :-----: | :-----: | :--------:|:--------:
--- a/examples/csmsc/tts3/README.md
+++ b/examples/csmsc/tts3/README.md
@ -232,6 +232,9 @@ The static model can be downloaded here:
 - [fastspeech2_nosil_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_static_0.4.zip)
 - [fastspeech2_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_csmsc_static_0.2.0.zip)
 The ONNX model can be downloaded here:
 - [fastspeech2_csmsc_onnx_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_csmsc_onnx_0.2.0.zip)
 Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/pitch_loss| eval/energy_loss 
 :-------------:| :------------:| :-----: | :-----: | :--------: |:--------:|:---------:
 default| 2(gpu) x 76000|1.0991|0.59132|0.035815|0.31915|0.15287|
--- a/examples/csmsc/tts3/local/ort_predict.sh
+++ b/examples/csmsc/tts3/local/ort_predict.sh
@ -0,0 +1,31 @@
 train_output_path=$1
 stage=0
 stop_stage=0
 # only support default_fastspeech2 + hifigan/mb_melgan now!
 # synthesize from metadata
 if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
    python3 ${BIN_DIR}/../ort_predict.py \
        --inference_dir=${train_output_path}/inference_onnx \
        --am=fastspeech2_csmsc \
        --voc=hifigan_csmsc \
        --test_metadata=dump/test/norm/metadata.jsonl \
        --output_dir=${train_output_path}/onnx_infer_out \
        --device=cpu \
        --cpu_threads=2
 fi
 # e2e, synthesize from text
 if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
    python3 ${BIN_DIR}/../ort_predict_e2e.py \
        --inference_dir=${train_output_path}/inference_onnx \
        --am=fastspeech2_csmsc \
        --voc=hifigan_csmsc \
        --output_dir=${train_output_path}/onnx_infer_out_e2e \
        --text=${BIN_DIR}/../csmsc_test.txt \
        --phones_dict=dump/phone_id_map.txt \
        --device=cpu \
        --cpu_threads=2
 fi
--- a/examples/csmsc/tts3/local/paddle2onnx.sh
+++ b/examples/csmsc/tts3/local/paddle2onnx.sh
@ -0,0 +1,22 @@
 train_output_path=$1
 model_dir=$2
 output_dir=$3
 model=$4
 enable_dev_version=True
 model_name=${model%_*}
 echo model_name: ${model_name}
 if [ ${model_name} = 'mb_melgan' ] ;then
    enable_dev_version=False
 fi
 mkdir -p ${train_output_path}/${output_dir}
 paddle2onnx \
    --model_dir ${train_output_path}/${model_dir} \
    --model_filename ${model}.pdmodel \
    --params_filename ${model}.pdiparams \
    --save_file ${train_output_path}/${output_dir}/${model}.onnx \
    --enable_dev_version ${enable_dev_version}
--- a/examples/csmsc/tts3/run.sh
+++ b/examples/csmsc/tts3/run.sh
@ -41,3 +41,25 @@ if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
    CUDA_VISIBLE_DEVICES=${gpus} ./local/inference.sh ${train_output_path} || exit -1
 fi
 # paddle2onnx, please make sure the static models are in ${train_output_path}/inference first
 # we have only tested the following models so far
 if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
    # install paddle2onnx
    version=$(echo `pip list |grep "paddle2onnx"` |awk -F" " '{print $2}')
    if [[ -z "$version" || ${version} != '0.9.4' ]]; then
        pip install paddle2onnx==0.9.4
    fi
    ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx fastspeech2_csmsc
    ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx hifigan_csmsc
    ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx mb_melgan_csmsc
 fi
 # inference with onnxruntime, use fastspeech2 + hifigan by default
 if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
    # install onnxruntime
    version=$(echo `pip list |grep "onnxruntime"` |awk -F" " '{print $2}')
    if [[ -z "$version" || ${version} != '1.10.0' ]]; then
        pip install onnxruntime==1.10.0
    fi
    ./local/ort_predict.sh ${train_output_path}
 fi
--- a/examples/csmsc/voc1/README.md
+++ b/examples/csmsc/voc1/README.md
@ -127,9 +127,11 @@ optional arguments:
 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
 ## Pretrained Models
-The pretrained model can be downloaded here [pwg_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_ckpt_0.4.zip).
+The pretrained model can be downloaded here:
 - [pwg_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_ckpt_0.4.zip)
-The static model can be downloaded here [pwg_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_static_0.4.zip).
+The static model can be downloaded here:
 - [pwg_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_static_0.4.zip)
 Model | Step | eval/generator_loss | eval/log_stft_magnitude_loss| eval/spectral_convergence_loss
 :-------------:| :------------:| :-----: | :-----: | :--------:
--- a/examples/csmsc/voc3/README.md
+++ b/examples/csmsc/voc3/README.md
@ -152,11 +152,17 @@ TODO:
 The hyperparameter of `finetune.yaml` is not good enough, a smaller `learning_rate` should be used (more `milestones` should be set).
 ## Pretrained Models
-The pretrained model can be downloaded here [mb_melgan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_ckpt_0.1.1.zip).
+The pretrained model can be downloaded here:
 - [mb_melgan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_ckpt_0.1.1.zip)
-The finetuned model can be downloaded here [mb_melgan_baker_finetune_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_baker_finetune_ckpt_0.5.zip).
+The finetuned model can be downloaded here:
 - [mb_melgan_baker_finetune_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_baker_finetune_ckpt_0.5.zip)
-The static model can be downloaded here [mb_melgan_csmsc_static_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_static_0.1.1.zip)
+The static model can be downloaded here:
 - [mb_melgan_csmsc_static_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_static_0.1.1.zip)
 The ONNX model can be downloaded here:
 - [mb_melgan_csmsc_onnx_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_onnx_0.2.0.zip)
 Model | Step | eval/generator_loss | eval/log_stft_magnitude_loss|eval/spectral_convergence_loss |eval/sub_log_stft_magnitude_loss|eval/sub_spectral_convergence_loss
 :-------------:| :------------:| :-----: | :-----: | :--------:| :--------:| :--------:
--- a/examples/csmsc/voc4/README.md
+++ b/examples/csmsc/voc4/README.md
@ -112,7 +112,8 @@ optional arguments:
 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
 ## Pretrained Models
-The pretrained model can be downloaded here [style_melgan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/style_melgan/style_melgan_csmsc_ckpt_0.1.1.zip).
+The pretrained model can be downloaded here:
 - [style_melgan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/style_melgan/style_melgan_csmsc_ckpt_0.1.1.zip)
 The static model of Style MelGAN is not available now.
--- a/examples/csmsc/voc5/README.md
+++ b/examples/csmsc/voc5/README.md
@ -112,9 +112,14 @@ optional arguments:
 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
 ## Pretrained Models
-The pretrained model can be downloaded here [hifigan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_ckpt_0.1.1.zip).
+The pretrained model can be downloaded here:
 - [hifigan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_ckpt_0.1.1.zip)
-The static model can be downloaded here [hifigan_csmsc_static_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_static_0.1.1.zip).
+The static model can be downloaded here:
 - [hifigan_csmsc_static_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_static_0.1.1.zip)
 The ONNX model can be downloaded here:
 - [hifigan_csmsc_onnx_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_onnx_0.2.0.zip)
 Model | Step | eval/generator_loss | eval/mel_loss| eval/feature_matching_loss
 :-------------:| :------------:| :-----: | :-----: | :--------:
--- a/examples/csmsc/voc6/README.md
+++ b/examples/csmsc/voc6/README.md
@ -109,9 +109,11 @@ optional arguments:
 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
 ## Pretrained Models
-The pretrained model can be downloaded here [wavernn_csmsc_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/wavernn/wavernn_csmsc_ckpt_0.2.0.zip).
+The pretrained model can be downloaded here:
 - [wavernn_csmsc_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/wavernn/wavernn_csmsc_ckpt_0.2.0.zip)
-The static model can be downloaded here [wavernn_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/wavernn/wavernn_csmsc_static_0.2.0.zip).
+The static model can be downloaded here:
 - [wavernn_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/wavernn/wavernn_csmsc_static_0.2.0.zip)
 Model | Step | eval/loss
 :-------------:|:------------:| :------------:
--- a/examples/iwslt2012/punc0/README.md
+++ b/examples/iwslt2012/punc0/README.md
@ -21,7 +21,7 @@
 The pretrained model can be downloaded here [ernie_linear_p3_iwslt2012_zh_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/text/ernie_linear_p3_iwslt2012_zh_ckpt_0.1.1.zip).
 ### Test Result
- Ernie Linear
+- Ernie
    |       |COMMA  |  PERIOD | QUESTION | OVERALL|
    |:-----:|:-----:|:-----:|:-----:|:-----:|  
    |Precision  |0.510955  |0.526462  |0.820755  |0.619391|
--- a/examples/iwslt2012/punc0/RESULTS.md
+++ b/examples/iwslt2012/punc0/RESULTS.md
@ -0,0 +1,9 @@
 # iwslt2012
 ## Ernie
 |       |COMMA  |  PERIOD | QUESTION | OVERALL|
 |:-----:|:-----:|:-----:|:-----:|:-----:|  
 |Precision  |0.510955  |0.526462  |0.820755  |0.619391|
 |Recall     |0.517433  |0.564179  |0.861386  |0.647666|
 |F1         |0.514173  |0.544669  |0.840580  |0.633141|
--- a/examples/librispeech/asr1/README.md
+++ b/examples/librispeech/asr1/README.md
@ -151,44 +151,22 @@ avg.sh best exp/conformer/checkpoints 20
 CUDA_VISIBLE_DEVICES= ./local/test.sh conf/conformer.yaml exp/conformer/checkpoints/avg_20
 ```
 ## Pretrained Model
-You can get the pretrained transformer or conformer using the scripts below:
+You can get the pretrained transformer or conformer from [this](../../../docs/source/released_model.md).
-```bash
+
 # Conformer:
 wget https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr1/conformer.model.tar.gz
 # Transformer:
 wget https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr1/transformer.model.tar.gz
 ```
 using the `tar` scripts to unpack the model and then you can use the script to test the model.
 For example:
 ```bash
-wget https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr1/conformer.model.tar.gz
+wget https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr1/asr1_conformer_librispeech_ckpt_0.1.1.model.tar.gz
-tar xzvf transformer.model.tar.gz
+tar xzvf asr1_conformer_librispeech_ckpt_0.1.1.model.tar.gz
 source path.sh
 # If you have process the data and get the manifest file， you can skip the following 2 steps
 bash local/data.sh --stage -1 --stop_stage -1
 bash local/data.sh --stage 2 --stop_stage 2
 CUDA_VISIBLE_DEVICES= ./local/test.sh conf/conformer.yaml exp/conformer/checkpoints/avg_20
 ```
-The performance of the released models are shown below:
+The performance of the released models are shown in [here](./RESULTS.md).
 ## Conformer
 train: Epoch 70, 4 V100-32G, best avg: 20
 | Model     | Params  | Config              | Augmentation | Test set   | Decode method          | Loss              | WER      |
 | --------- | ------- | ------------------- | ------------ | ---------- | ---------------------- | ----------------- | -------- |
 | conformer | 47.63 M | conf/conformer.yaml | spec_aug     | test-clean | attention              | 6.433612394332886 | 0.039771 |
 | conformer | 47.63 M | conf/conformer.yaml | spec_aug     | test-clean | ctc_greedy_search      | 6.433612394332886 | 0.040342 |
 | conformer | 47.63 M | conf/conformer.yaml | spec_aug     | test-clean | ctc_prefix_beam_search | 6.433612394332886 | 0.040342 |
 | conformer | 47.63 M | conf/conformer.yaml | spec_aug     | test-clean | attention_rescoring    | 6.433612394332886 | 0.033761 |
 ## Transformer
 train: Epoch 120, 4 V100-32G, 27 Day, best avg: 10
 | Model       | Params  | Config                | Augmentation | Test set   | Decode method          | Loss              | WER      |
 | ----------- | ------- | --------------------- | ------------ | ---------- | ---------------------- | ----------------- | -------- |
 | transformer | 32.52 M | conf/transformer.yaml | spec_aug     | test-clean | attention              | 6.382194232940674 | 0.049661 |
 | transformer | 32.52 M | conf/transformer.yaml | spec_aug     | test-clean | ctc_greedy_search      | 6.382194232940674 | 0.049566 |
 | transformer | 32.52 M | conf/transformer.yaml | spec_aug     | test-clean | ctc_prefix_beam_search | 6.382194232940674 | 0.049585 |
 | transformer | 32.52 M | conf/transformer.yaml | spec_aug     | test-clean | attention_rescoring    | 6.382194232940674 | 0.038135 |
 ## Stage 4: CTC Alignment 
 If you want to get the alignment between the audio and the text, you can use the ctc alignment. The code of this stage is shown below:
 ```bash
@ -227,8 +205,8 @@ In some situations, you want to use the trained model to do the inference for th
 ```
 you can train the model by yourself using ```bash run.sh --stage 0 --stop_stage 3```, or you can download the pretrained model through the script below:
 ```bash
-wget https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr1/conformer.model.tar.gz
+wget https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr1/asr1_conformer_librispeech_ckpt_0.1.1.model.tar.gz
-tar xzvf conformer.model.tar.gz
+tar xzvf asr1_conformer_librispeech_ckpt_0.1.1.model.tar.gz
 ```
 You can download the audio demo:
 ```bash
--- a/examples/librispeech/asr2/README.md
+++ b/examples/librispeech/asr2/README.md
@ -1,4 +1,4 @@
-# Transformer/Conformer ASR with Librispeech Asr2
+# Transformer/Conformer ASR with Librispeech ASR2
 This example contains code used to train a Transformer or [Conformer](http://arxiv.org/abs/2008.03802) model with [Librispeech dataset](http://www.openslr.org/resources/12) and use some functions in kaldi.
@ -213,17 +213,14 @@ avg.sh latest exp/transformer/checkpoints 10
 ./local/recog.sh  --ckpt_prefix exp/transformer/checkpoints/avg_10
 ```
 ## Pretrained Model
-You can get the pretrained transformer using the scripts below:
+You can get the pretrained models from [this](../../../docs/source/released_model.md).
-```bash
+
 # Transformer:
 wget https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr2/transformer.model.tar.gz
 ```
 using the `tar` scripts to unpack the model and then you can use the script to test the model.
 For example:
 ```bash
-wget https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr2/transformer.model.tar.gz
+wget https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr2/asr2_transformer_librispeech_ckpt_0.1.1.model.tar.gz
-tar xzvf transformer.model.tar.gz
+tar xzvf asr2_transformer_librispeech_ckpt_0.1.1.model.tar.gz
 source path.sh
 # If you have process the data and get the manifest file， you can skip the following 2 steps
 bash local/data.sh --stage -1 --stop_stage -1
@ -231,26 +228,7 @@ bash local/data.sh --stage 2 --stop_stage 2
 CUDA_VISIBLE_DEVICES= ./local/test.sh conf/transformer.yaml exp/ctc/checkpoints/avg_10
 ```
-The performance of the released models are shown below:
+The performance of the released models are shown [here](./RESULTS.md).
 ### Transformer
 |    Model    | Params |          GPUS          |  Averaged Model  |        Config         | Augmentation |      Loss       |
 | :---------: | :----: | :--------------------: | :--------------: | :-------------------: | :----------: | :-------------: |
 | transformer | 32.52M | 8 Tesla V100-SXM2-32GB | 10-best val_loss | conf/transformer.yaml |   spec_aug   | 6.3197922706604 |
 #### Attention Rescore
 | Test Set   | Decode Method         | #Snt | #Wrd  | Corr | Sub  | Del  | Ins  | Err  | S.Err |
 | ---------- | --------------------- | ---- | ----- | ---- | ---- | ---- | ---- | ---- | ----- |
 | test-clean | attention             | 2620 | 52576 | 96.4 | 2.5  | 1.1  | 0.4  | 4.0  | 34.7  |
 | test-clean | ctc_greedy_search     | 2620 | 52576 | 95.9 | 3.7  | 0.4  | 0.5  | 4.6  | 48.0  |
 | test-clean | ctc_prefix_beamsearch | 2620 | 52576 | 95.9 | 3.7  | 0.4  | 0.5  | 4.6  | 47.6  |
 | test-clean | attention_rescore     | 2620 | 52576 | 96.8 | 2.9  | 0.3  | 0.4  | 3.7  | 38.0  |
 #### JoinCTC
 | Test Set   | Decode Method     | #Snt | #Wrd  | Corr | Sub  | Del  | Ins  | Err  | S.Err |
 | ---------- | ----------------- | ---- | ----- | ---- | ---- | ---- | ---- | ---- | ----- |
 | test-clean | join_ctc_only_att | 2620 | 52576 | 96.1 | 2.5  | 1.4  | 0.4  | 4.4  | 34.7  |
 | test-clean | join_ctc_w/o_lm   | 2620 | 52576 | 97.2 | 2.6  | 0.3  | 0.4  | 3.2  | 34.9  |
 | test-clean | join_ctc_w_lm     | 2620 | 52576 | 97.9 | 1.8  | 0.2  | 0.3  | 2.4  | 27.8  |
 Compare with [ESPNET](https://github.com/espnet/espnet/blob/master/egs/librispeech/asr1/RESULTS.md#pytorch-large-transformer-with-specaug-4-gpus--transformer-lm-4-gpus) we using 8gpu, but the model size (aheads4-adim256) small than it.
 ## Stage 5: CTC Alignment 
--- a/examples/ljspeech/tts1/README.md
+++ b/examples/ljspeech/tts1/README.md
@ -171,7 +171,8 @@ optional arguments:
 6. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
 ## Pretrained Model
-Pretrained Model can be downloaded here. [transformer_tts_ljspeech_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/transformer_tts/transformer_tts_ljspeech_ckpt_0.4.zip)
+Pretrained Model can be downloaded here:
 - [transformer_tts_ljspeech_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/transformer_tts/transformer_tts_ljspeech_ckpt_0.4.zip)
 TransformerTTS  checkpoint contains files listed below.
 ```text
--- a/examples/ljspeech/tts3/README.md
+++ b/examples/ljspeech/tts3/README.md
@ -214,7 +214,8 @@ optional arguments:
 9. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
 ## Pretrained Model
-Pretrained FastSpeech2 model with no silence in the edge of audios. [fastspeech2_nosil_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_ljspeech_ckpt_0.5.zip)
+Pretrained FastSpeech2 model with no silence in the edge of audios:
 - [fastspeech2_nosil_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_ljspeech_ckpt_0.5.zip)
 Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/pitch_loss| eval/energy_loss 
 :-------------:| :------------:| :-----: | :-----: | :--------: |:--------:|:---------:
--- a/examples/ljspeech/voc0/README.md
+++ b/examples/ljspeech/voc0/README.md
@ -50,4 +50,5 @@ Synthesize waveform.
 6. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
 ## Pretrained Model
-Pretrained Model with residual channel equals 128 can be downloaded here. [waveflow_ljspeech_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/waveflow/waveflow_ljspeech_ckpt_0.3.zip).
+Pretrained Model with residual channel equals 128 can be downloaded here:
 - [waveflow_ljspeech_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/waveflow/waveflow_ljspeech_ckpt_0.3.zip)
--- a/examples/ljspeech/voc1/README.md
+++ b/examples/ljspeech/voc1/README.md
@ -127,7 +127,8 @@ optional arguments:
 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
 ## Pretrained Model
-Pretrained models can be downloaded here. [pwg_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_ljspeech_ckpt_0.5.zip)
+Pretrained models can be downloaded here:
 - [pwg_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_ljspeech_ckpt_0.5.zip)
 Parallel WaveGAN checkpoint contains files listed below.
--- a/examples/ljspeech/voc5/README.md
+++ b/examples/ljspeech/voc5/README.md
@ -127,7 +127,8 @@ optional arguments:
 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
 ## Pretrained Model
-The pretrained model can be downloaded here [hifigan_ljspeech_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_ljspeech_ckpt_0.2.0.zip).
+The pretrained model can be downloaded here:
 - [hifigan_ljspeech_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_ljspeech_ckpt_0.2.0.zip)
 Model | Step | eval/generator_loss | eval/mel_loss| eval/feature_matching_loss
@ -143,6 +144,5 @@ hifigan_ljspeech_ckpt_0.2.0
 └── snapshot_iter_2500000.pdz     # generator parameters of hifigan
 ```
 ## Acknowledgement
 We adapted some code from https://github.com/kan-bayashi/ParallelWaveGAN.
--- a/examples/vctk/tts3/README.md
+++ b/examples/vctk/tts3/README.md
@ -217,7 +217,8 @@ optional arguments:
 9. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
 ## Pretrained Model
-Pretrained FastSpeech2 model with no silence in the edge of audios. [fastspeech2_nosil_vctk_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_vctk_ckpt_0.5.zip)
+Pretrained FastSpeech2 model with no silence in the edge of audios:
 - [fastspeech2_nosil_vctk_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_vctk_ckpt_0.5.zip)
 FastSpeech2 checkpoint contains files listed below.
 ```text
--- a/examples/vctk/voc1/README.md
+++ b/examples/vctk/voc1/README.md
@ -132,7 +132,8 @@ optional arguments:
 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
 ## Pretrained Model
-Pretrained models can be downloaded here [pwg_vctk_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_vctk_ckpt_0.1.1.zip).
+Pretrained models can be downloaded here:
 - [pwg_vctk_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_vctk_ckpt_0.1.1.zip)
 Parallel WaveGAN checkpoint contains files listed below.
--- a/examples/vctk/voc5/README.md
+++ b/examples/vctk/voc5/README.md
@ -133,7 +133,8 @@ optional arguments:
 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
 ## Pretrained Model
-The pretrained model can be downloaded here [hifigan_vctk_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_vctk_ckpt_0.2.0.zip).
+The pretrained model can be downloaded here:
 - [hifigan_vctk_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_vctk_ckpt_0.2.0.zip)
 Model | Step | eval/generator_loss | eval/mel_loss| eval/feature_matching_loss
--- a/examples/voxceleb/sv0/RESULT.md
+++ b/examples/voxceleb/sv0/RESULT.md
@ -4,4 +4,4 @@
 | Model | Number of Params | Release | Config | dim | Test set |  Cosine | Cosine + S-Norm | 
 | --- | --- | --- | --- | --- | --- | --- | ---- |
-| ECAPA-TDNN | 85M | 0.1.1 | conf/ecapa_tdnn.yaml |192 | test | 1.15 |  1.06 | 
+| ECAPA-TDNN | 85M | 0.2.0 | conf/ecapa_tdnn.yaml |192 | test | 1.02 |  0.95 | 
--- a/examples/voxceleb/sv0/conf/ecapa_tdnn.yaml
+++ b/examples/voxceleb/sv0/conf/ecapa_tdnn.yaml
@ -1,14 +1,16 @@
 ###########################################
 #                Data                 #
 ###########################################
 # we should explicitly specify the wav path of vox2 audio data converted from m4a
 vox2_base_path: 
 augment: True
-batch_size: 16
+batch_size: 32
 num_workers: 2
-num_speakers: 7205 # 1211 vox1, 5994 vox2, 7205 vox1+2, test speakers: 41
+num_speakers: 1211 # 1211 vox1, 5994 vox2, 7205 vox1+2, test speakers: 41
 shuffle: True
 skip_prep: False
 split_ratio: 0.9
 chunk_duration: 3.0 # seconds
 random_chunk: True
 verification_file: data/vox1/veri_test2.txt
 ###########################################################
 #                FEATURE EXTRACTION SETTING               #
@ -26,7 +28,6 @@ hop_size: 160        #10ms, sample rate 16000, 10 * 16000 / 1000 = 160
 # if we want use another model, please choose another configuration yaml file
 model:
  input_size: 80
  # "channels": [512, 512, 512, 512, 1536],
  channels: [1024, 1024, 1024, 1024, 3072]
  kernel_sizes: [5, 3, 3, 3, 1]
  dilations: [1, 2, 3, 4, 1]
@ -38,8 +39,8 @@ model:
 ###########################################
 seed: 1986 # according from speechbrain configuration
 epochs: 10
-save_interval: 1
+save_interval: 10
-log_interval: 1
+log_interval: 10
 learning_rate: 1e-8
--- a/examples/voxceleb/sv0/conf/ecapa_tdnn_small.yaml
+++ b/examples/voxceleb/sv0/conf/ecapa_tdnn_small.yaml
@ -0,0 +1,53 @@
 ###########################################
 #                Data                 #
 ###########################################
 augment: True
 batch_size: 16
 num_workers: 2
 num_speakers: 1211 # 1211 vox1, 5994 vox2, 7205 vox1+2, test speakers: 41
 shuffle: True
 skip_prep: False
 split_ratio: 0.9
 chunk_duration: 3.0 # seconds
 random_chunk: True
 verification_file: data/vox1/veri_test2.txt
 ###########################################################
 #                FEATURE EXTRACTION SETTING               #
 ###########################################################
 # currently, we only support fbank
 sr: 16000           # sample rate
 n_mels: 80
 window_size: 400     #25ms, sample rate 16000, 25 * 16000 / 1000 = 400 
 hop_size: 160        #10ms, sample rate 16000, 10 * 16000 / 1000 = 160
 ###########################################################
 #                       MODEL SETTING                     #
 ###########################################################
 # currently, we only support ecapa-tdnn in the ecapa_tdnn.yaml
 # if we want use another model, please choose another configuration yaml file
 model:
  input_size: 80
  channels: [512, 512, 512, 512, 1536]
  kernel_sizes: [5, 3, 3, 3, 1]
  dilations: [1, 2, 3, 4, 1]
  attention_channels: 128
  lin_neurons: 192
 ###########################################
 #                Training                 #
 ###########################################
 seed: 1986 # according from speechbrain configuration
 epochs: 100
 save_interval: 10
 log_interval: 10
 learning_rate: 1e-8
 ###########################################
 #                Testing                  #
 ###########################################
 global_embedding_norm: True
 embedding_mean_norm: True
 embedding_std_norm: False
--- a/examples/voxceleb/sv0/local/data.sh
+++ b/examples/voxceleb/sv0/local/data.sh
@ -12,7 +12,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-stage=1
+stage=0
 stop_stage=100
 . ${MAIN_ROOT}/utils/parse_options.sh || exit -1;
@ -30,29 +30,114 @@ dir=$1
 conf_path=$2
 mkdir -p ${dir}
-if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+# Generally the `MAIN_ROOT` refers to the root of PaddleSpeech,
-    # data prepare for vox1 and vox2, vox2 must be converted from m4a to wav
+# which is defined in the path.sh
-    # we should use the local/convert.sh convert m4a to wav
+# And we will download the voxceleb data and rirs noise to ${MAIN_ROOT}/dataset
    python3 local/data_prepare.py \
                        --data-dir ${dir} \
                        --config ${conf_path}
 fi 
 TARGET_DIR=${MAIN_ROOT}/dataset
 mkdir -p ${TARGET_DIR}
 if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
-    # download data, generate manifests
+   # download data, generate manifests
-    python3 ${TARGET_DIR}/voxceleb/voxceleb1.py \
+   # we will generate the manifest.{dev,test} file from ${TARGET_DIR}/voxceleb/vox1/{dev,test} directory
-      --manifest_prefix="data/vox1/manifest" \
+   # and generate the meta info and download the trial file
   # manifest.dev: 148642
   # manifest.test: 4847
   echo "Start to download vox1 dataset and generate the manifest files "
   python3 ${TARGET_DIR}/voxceleb/voxceleb1.py \
      --manifest_prefix="${dir}/vox1/manifest" \
      --target_dir="${TARGET_DIR}/voxceleb/vox1/"
-    if [ $? -ne 0 ]; then
+   if [ $? -ne 0 ]; then
-        echo "Prepare voxceleb failed. Terminated."
+      echo "Prepare voxceleb1 failed. Terminated."
-        exit 1
+      exit 1
-    fi
+   fi
 fi
 if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
   # download voxceleb2 data
   # we will download the data and unzip the package
   # and we will store the m4a file in ${TARGET_DIR}/voxceleb/vox2/{dev,test}
   echo "start to download vox2 dataset"
   python3 ${TARGET_DIR}/voxceleb/voxceleb2.py \
      --download \
      --target_dir="${TARGET_DIR}/voxceleb/vox2/"
   if [ $? -ne 0 ]; then
      echo "Download voxceleb2 dataset failed. Terminated."
      exit 1
   fi
 fi
 if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
   # convert the m4a to wav
   # and we will not delete the original m4a file
   echo "start to convert the m4a to wav"
   bash local/convert.sh ${TARGET_DIR}/voxceleb/vox2/test/ || exit 1;
   if [ $? -ne 0 ]; then
      echo "Convert voxceleb2 dataset from m4a to wav failed. Terminated."
      exit 1
   fi
   echo "m4a convert to wav operation finished"
 fi
 if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
   # generate the vox2 manifest file from wav file
   # we will generate the ${dir}/vox2/manifest.vox2
   # because we use all the vox2 dataset to train, so collect all the vox2 data in one file
   echo "start generate the vox2 manifest files"
   python3 ${TARGET_DIR}/voxceleb/voxceleb2.py \
      --generate \
      --manifest_prefix="${dir}/vox2/manifest" \
      --target_dir="${TARGET_DIR}/voxceleb/vox2/"
-   #  for dataset in train dev test; do
+   if [ $? -ne 0 ]; then
-   #      mv data/manifest.${dataset} data/manifest.${dataset}.raw
+      echo "Prepare voxceleb2 dataset failed. Terminated."
-   #  done
+      exit 1
-fi
+   fi
 fi
 if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
   # generate the vox csv file
   # Currently, our training system use csv file for dataset
   echo "convert the json format to csv format to be compatible with training process"
   python3 local/make_vox_csv_dataset_from_json.py\
      --train "${dir}/vox1/manifest.dev" "${dir}/vox2/manifest.vox2"\
      --test "${dir}/vox1/manifest.test" \
      --target_dir "${dir}/vox/" \
      --config ${conf_path}
   if [ $? -ne 0 ]; then
      echo "Prepare voxceleb failed. Terminated."
      exit 1
   fi
 fi
 if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
   # generate the open rir noise manifest file
   echo "generate the open rir noise manifest file"
   python3 ${TARGET_DIR}/rir_noise/rir_noise.py\
      --manifest_prefix="${dir}/rir_noise/manifest" \
      --target_dir="${TARGET_DIR}/rir_noise/"
   if [ $? -ne 0 ]; then
      echo "Prepare rir_noise failed. Terminated."
      exit 1
   fi
 fi
 if [ ${stage} -le 7 ] && [ ${stop_stage} -ge 7 ]; then
   # generate the open rir noise manifest file
   echo "generate the open rir noise csv file"
   python3 local/make_rirs_noise_csv_dataset_from_json.py \
      --noise_dir="${TARGET_DIR}/rir_noise/" \
      --data_dir="${dir}/rir_noise/" \
      --config ${conf_path}
   if [ $? -ne 0 ]; then
      echo "Prepare rir_noise failed. Terminated."
      exit 1
   fi
 fi
--- a/examples/voxceleb/sv0/local/make_rirs_noise_csv_dataset_from_json.py
+++ b/examples/voxceleb/sv0/local/make_rirs_noise_csv_dataset_from_json.py
@ -0,0 +1,167 @@
 # Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """
 Convert the PaddleSpeech jsonline format data to csv format data in voxceleb experiment.
 Currently, Speaker Identificaton Training process use csv format.
 """
 import argparse
 import csv
 import os
 from typing import List
 import tqdm
 from yacs.config import CfgNode
 from paddleaudio import load as load_audio
 from paddlespeech.s2t.utils.log import Log
 from paddlespeech.vector.utils.vector_utils import get_chunks
 logger = Log(__name__).getlog()
 def get_chunks_list(wav_file: str,
                    split_chunks: bool,
                    base_path: str,
                    chunk_duration: float=3.0) -> List[List[str]]:
    """Get the single audio file info 
    Args:
        wav_file (list): the wav audio file and get this audio segment info list
        split_chunks (bool): audio split flag
        base_path (str): the audio base path 
        chunk_duration (float): the chunk duration. 
                                if set the split_chunks, we split the audio into multi-chunks segment.
    """
    waveform, sr = load_audio(wav_file)
    audio_id = wav_file.split("/rir_noise/")[-1].split(".")[0]
    audio_duration = waveform.shape[0] / sr
    ret = []
    if split_chunks and audio_duration > chunk_duration:  # Split into pieces of self.chunk_duration seconds.
        uniq_chunks_list = get_chunks(chunk_duration, audio_id, audio_duration)
        for idx, chunk in enumerate(uniq_chunks_list):
            s, e = chunk.split("_")[-2:]  # Timestamps of start and end
            start_sample = int(float(s) * sr)
            end_sample = int(float(e) * sr)
            # currently, all vector csv data format use one representation
            # id, duration, wav, start, stop, label
            # in rirs noise, all the label name is 'noise'
            # the label is string type and we will convert it to integer type in training
            ret.append([
                chunk, audio_duration, wav_file, start_sample, end_sample,
                "noise"
            ])
    else:  # Keep whole audio.
        ret.append(
            [audio_id, audio_duration, wav_file, 0, waveform.shape[0], "noise"])
    return ret
 def generate_csv(wav_files,
                 output_file: str,
                 base_path: str,
                 split_chunks: bool=True):
    """Prepare the csv file according the wav files
    Args:
        wav_files (list): all the audio list to prepare the csv file
        output_file (str): the output csv file
        config (CfgNode): yaml configuration content
        split_chunks (bool): audio split flag
    """
    logger.info(f'Generating csv: {output_file}')
    header = ["utt_id", "duration", "wav", "start", "stop", "label"]
    csv_lines = []
    for item in tqdm.tqdm(wav_files):
        csv_lines.extend(
            get_chunks_list(
                item, base_path=base_path, split_chunks=split_chunks))
    if not os.path.exists(os.path.dirname(output_file)):
        os.makedirs(os.path.dirname(output_file))
    with open(output_file, mode="w") as csv_f:
        csv_writer = csv.writer(
            csv_f, delimiter=",", quotechar='"', quoting=csv.QUOTE_MINIMAL)
        csv_writer.writerow(header)
        for line in csv_lines:
            csv_writer.writerow(line)
 def prepare_data(args, config):
    """Convert the jsonline format to csv format
    Args:
        args (argparse.Namespace): scripts args
        config (CfgNode): yaml configuration content
    """
    # if external config set the skip_prep flat, we will do nothing
    if config.skip_prep:
        return
    base_path = args.noise_dir
    wav_path = os.path.join(base_path, "RIRS_NOISES")
    logger.info(f"base path: {base_path}")
    logger.info(f"wav path: {wav_path}")
    rir_list = os.path.join(wav_path, "real_rirs_isotropic_noises", "rir_list")
    rir_files = []
    with open(rir_list, 'r') as f:
        for line in f.readlines():
            rir_file = line.strip().split(' ')[-1]
            rir_files.append(os.path.join(base_path, rir_file))
    noise_list = os.path.join(wav_path, "pointsource_noises", "noise_list")
    noise_files = []
    with open(noise_list, 'r') as f:
        for line in f.readlines():
            noise_file = line.strip().split(' ')[-1]
            noise_files.append(os.path.join(base_path, noise_file))
    csv_path = os.path.join(args.data_dir, 'csv')
    logger.info(f"csv path: {csv_path}")
    generate_csv(
        rir_files, os.path.join(csv_path, 'rir.csv'), base_path=base_path)
    generate_csv(
        noise_files, os.path.join(csv_path, 'noise.csv'), base_path=base_path)
 if __name__ == "__main__":
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument(
        "--noise_dir",
        default=None,
        required=True,
        help="The noise dataset dataset directory.")
    parser.add_argument(
        "--data_dir",
        default=None,
        required=True,
        help="The target directory stores the csv files")
    parser.add_argument(
        "--config",
        default=None,
        required=True,
        type=str,
        help="configuration file")
    args = parser.parse_args()
    # parse the yaml config file
    config = CfgNode(new_allowed=True)
    if args.config:
        config.merge_from_file(args.config)
    # prepare the csv file from jsonlines files
    prepare_data(args, config)
--- a/examples/voxceleb/sv0/local/make_vox_csv_dataset_from_json.py
+++ b/examples/voxceleb/sv0/local/make_vox_csv_dataset_from_json.py
@ -0,0 +1,251 @@
 # Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """
 Convert the PaddleSpeech jsonline format data to csv format data in voxceleb experiment.
 Currently, Speaker Identificaton Training process use csv format.
 """
 import argparse
 import csv
 import json
 import os
 import random
 import tqdm
 from yacs.config import CfgNode
 from paddleaudio import load as load_audio
 from paddlespeech.s2t.utils.log import Log
 from paddlespeech.vector.utils.vector_utils import get_chunks
 logger = Log(__name__).getlog()
 def prepare_csv(wav_files, output_file, config, split_chunks=True):
    """Prepare the csv file according the wav files
    Args:
        wav_files (list): all the audio list to prepare the csv file
        output_file (str): the output csv file
        config (CfgNode): yaml configuration content
        split_chunks (bool, optional): audio split flag. Defaults to True.
    """
    if not os.path.exists(os.path.dirname(output_file)):
        os.makedirs(os.path.dirname(output_file))
    csv_lines = []
    header = ["utt_id", "duration", "wav", "start", "stop", "label"]
    # voxceleb meta info for each training utterance segment
    # we extract a segment from a utterance to train 
    # and the segment' period is between start and stop time point in the original wav file
    # each field in the meta info means as follows:
    # utt_id: the utterance segment name, which is uniq in training dataset
    # duration: the total utterance time
    # wav: utterance file path, which should be absoulute path
    # start: start point in the original wav file sample point range
    # stop: stop point in the original wav file sample point range
    # label: the utterance segment's label name, 
    #        which is speaker name in speaker verification domain
    for item in tqdm.tqdm(wav_files, total=len(wav_files)):
        item = json.loads(item.strip())
        audio_id = item['utt'].replace(".wav",
                                       "")  # we remove the wav suffix name
        audio_duration = item['feat_shape'][0]
        wav_file = item['feat']
        label = audio_id.split('-')[
            0]  # speaker name in speaker verification domain
        waveform, sr = load_audio(wav_file)
        if split_chunks:
            uniq_chunks_list = get_chunks(config.chunk_duration, audio_id,
                                          audio_duration)
            for chunk in uniq_chunks_list:
                s, e = chunk.split("_")[-2:]  # Timestamps of start and end
                start_sample = int(float(s) * sr)
                end_sample = int(float(e) * sr)
                # id, duration, wav, start, stop, label
                # in vector, the label in speaker id
                csv_lines.append([
                    chunk, audio_duration, wav_file, start_sample, end_sample,
                    label
                ])
        else:
            csv_lines.append([
                audio_id, audio_duration, wav_file, 0, waveform.shape[0], label
            ])
    with open(output_file, mode="w") as csv_f:
        csv_writer = csv.writer(
            csv_f, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
        csv_writer.writerow(header)
        for line in csv_lines:
            csv_writer.writerow(line)
 def get_enroll_test_list(dataset_list, verification_file):
    """Get the enroll and test utterance list from all the voxceleb1 test utterance dataset.
       Generally, we get the enroll and test utterances from the verfification file.
       The verification file format as follows:
       target/nontarget enroll-utt test-utt,
       we set 0 as nontarget and 1 as target, eg:
       0 a.wav b.wav
       1 a.wav a.wav
    Args:
        dataset_list (list): all the dataset to get the test utterances
        verification_file (str): voxceleb1 trial file
    """
    logger.info(f"verification file: {verification_file}")
    enroll_audios = set()
    test_audios = set()
    with open(verification_file, 'r') as f:
        for line in f:
            _, enroll_file, test_file = line.strip().split(' ')
            enroll_audios.add('-'.join(enroll_file.split('/')))
            test_audios.add('-'.join(test_file.split('/')))
    enroll_files = []
    test_files = []
    for dataset in dataset_list:
        with open(dataset, 'r') as f:
            for line in f:
                # audio_id may be in enroll and test at the same time
                # eg: 1 a.wav a.wav
                # the audio a.wav is enroll and test file at the same time
                audio_id = json.loads(line.strip())['utt']
                if audio_id in enroll_audios:
                    enroll_files.append(line)
                if audio_id in test_audios:
                    test_files.append(line)
    enroll_files = sorted(enroll_files)
    test_files = sorted(test_files)
    return enroll_files, test_files
 def get_train_dev_list(dataset_list, target_dir, split_ratio):
    """Get the train and dev utterance list from all the training utterance dataset.
       Generally, we use the split_ratio as the train dataset ratio,
       and the remaining utterance (ratio is 1 - split_ratio) is the dev dataset
    Args:
        dataset_list (list): all the dataset to get the all utterances
        target_dir (str): the target train and dev directory, 
                          we will create the csv directory to store the {train,dev}.csv file
        split_ratio (float): train dataset ratio in all utterance list
    """
    logger.info("start to get train and dev utt list")
    if not os.path.exists(os.path.join(target_dir, "meta")):
        os.makedirs(os.path.join(target_dir, "meta"))
    audio_files = []
    speakers = set()
    for dataset in dataset_list:
        with open(dataset, 'r') as f:
            for line in f:
                # the label is speaker name
                label_name = json.loads(line.strip())['utt2spk']
                speakers.add(label_name)
                audio_files.append(line.strip())
    speakers = sorted(speakers)
    logger.info(f"we get {len(speakers)} speakers from all the train dataset")
    with open(os.path.join(target_dir, "meta", "label2id.txt"), 'w') as f:
        for label_id, label_name in enumerate(speakers):
            f.write(f'{label_name} {label_id}\n')
    logger.info(
        f'we store the speakers to {os.path.join(target_dir, "meta", "label2id.txt")}'
    )
    # the split_ratio is for train dataset 
    # the remaining is for dev dataset
    split_idx = int(split_ratio * len(audio_files))
    audio_files = sorted(audio_files)
    random.shuffle(audio_files)
    train_files, dev_files = audio_files[:split_idx], audio_files[split_idx:]
    logger.info(
        f"we get train utterances: {len(train_files)}, dev utterance: {len(dev_files)}"
    )
    return train_files, dev_files
 def prepare_data(args, config):
    """Convert the jsonline format to csv format
    Args:
        args (argparse.Namespace): scripts args
        config (CfgNode): yaml configuration content
    """
    # stage0: set the random seed
    random.seed(config.seed)
    # if external config set the skip_prep flat, we will do nothing
    if config.skip_prep:
        return
    # stage 1: prepare the enroll and test csv file
    #          And we generate the speaker to label file label2id.txt
    logger.info("start to prepare the data csv file")
    enroll_files, test_files = get_enroll_test_list(
        [args.test], verification_file=config.verification_file)
    prepare_csv(
        enroll_files,
        os.path.join(args.target_dir, "csv", "enroll.csv"),
        config,
        split_chunks=False)
    prepare_csv(
        test_files,
        os.path.join(args.target_dir, "csv", "test.csv"),
        config,
        split_chunks=False)
    # stage 2: prepare the train and dev csv file
    #          we get the train dataset ratio as config.split_ratio
    #          and the remaining is dev dataset
    logger.info("start to prepare the data csv file")
    train_files, dev_files = get_train_dev_list(
        args.train, target_dir=args.target_dir, split_ratio=config.split_ratio)
    prepare_csv(train_files,
                os.path.join(args.target_dir, "csv", "train.csv"), config)
    prepare_csv(dev_files,
                os.path.join(args.target_dir, "csv", "dev.csv"), config)
 if __name__ == "__main__":
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument(
        "--train",
        required=True,
        nargs='+',
        help="The jsonline files list for train.")
    parser.add_argument(
        "--test", required=True, help="The jsonline file for test")
    parser.add_argument(
        "--target_dir",
        default=None,
        required=True,
        help="The target directory stores the csv files and meta file.")
    parser.add_argument(
        "--config",
        default=None,
        required=True,
        type=str,
        help="configuration file")
    args = parser.parse_args()
    # parse the yaml config file
    config = CfgNode(new_allowed=True)
    if args.config:
        config.merge_from_file(args.config)
    # prepare the csv file from jsonlines files
    prepare_data(args, config)
--- a/examples/voxceleb/sv0/run.sh
+++ b/examples/voxceleb/sv0/run.sh
@ -18,24 +18,22 @@ set -e
 #######################################################################
 # stage 0: data prepare, including voxceleb1 download and generate {train,dev,enroll,test}.csv
-#          voxceleb2 data is m4a format, so we need user to convert the m4a to wav yourselves as described in Readme.md with the script local/convert.sh
+#          voxceleb2 data is m4a format, so we need convert the m4a to wav yourselves with the script local/convert.sh
 # stage 1: train the speaker identification model
 # stage 2: test speaker identification 
-# stage 3: extract the training embeding to train the LDA and PLDA
+# stage 3: (todo)extract the training embeding to train the LDA and PLDA
 ######################################################################
 # we can set the variable PPAUDIO_HOME to specifiy the root directory of the downloaded vox1 and vox2 dataset 
 # default the dataset will be stored in the ~/.paddleaudio/
 # the vox2 dataset is stored in m4a format, we need to convert the audio from m4a to wav yourself
-# and put all of them to ${PPAUDIO_HOME}/datasets/vox2
+# and put all of them to ${MAIN_ROOT}/datasets/vox2
-# we will find the wav from ${PPAUDIO_HOME}/datasets/vox1/wav and ${PPAUDIO_HOME}/datasets/vox2/wav
+# we will find the wav from ${MAIN_ROOT}/datasets/vox1/{dev,test}/wav and ${MAIN_ROOT}/datasets/vox2/wav
-# export PPAUDIO_HOME=
+
 stage=0
 stop_stage=50
 # data directory
 # if we set the variable ${dir}, we will store the wav info to this directory
-# otherwise, we will store the wav info to vox1 and vox2 directory respectively
+# otherwise, we will store the wav info to data/vox1 and data/vox2 directory respectively
 # vox2 wav path, we must convert the m4a format to wav format    
 dir=data/                                 # data info directory   
@ -64,6 +62,6 @@ if [ $stage -le 2 ] && [ ${stop_stage} -ge 2 ]; then
 fi
 # if [ $stage -le 3 ]; then
-#      # stage 2: extract the training embeding to train the LDA and PLDA
+#      # stage 3: extract the training embeding to train the LDA and PLDA
 #      # todo: extract the training embedding
 # fi 
--- a/paddleaudio/paddleaudio/datasets/voxceleb.py
+++ b/paddleaudio/paddleaudio/datasets/voxceleb.py
@ -261,7 +261,7 @@ class VoxCeleb(Dataset):
                     output_file: str,
                     split_chunks: bool=True):
        print(f'Generating csv: {output_file}')
-        header = ["ID", "duration", "wav", "start", "stop", "spk_id"]
+        header = ["id", "duration", "wav", "start", "stop", "spk_id"]
        # Note: this may occurs c++ execption, but the program will execute fine
        # so we can ignore the execption 
        with Pool(cpu_count()) as p:
--- a/paddleaudio/paddleaudio/utils/numeric.py
+++ b/paddleaudio/paddleaudio/utils/numeric.py
@ -0,0 +1,30 @@
 # Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import numpy as np
 def pcm16to32(audio: np.ndarray) -> np.ndarray:
    """pcm int16 to float32
    Args:
        audio (np.ndarray): Waveform with dtype of int16.
    Returns:
        np.ndarray: Waveform with dtype of float32.
    """
    if audio.dtype == np.int16:
        audio = audio.astype("float32")
        bits = np.iinfo(np.int16).bits
        audio = audio / (2**(bits - 1))
    return audio
--- a/paddlespeech/cli/asr/infer.py
+++ b/paddlespeech/cli/asr/infer.py
@ -80,9 +80,9 @@ pretrained_models = {
    },
    "deepspeech2online_aishell-zh-16k": {
        'url':
-        'https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.1.1.model.tar.gz',
+        'https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.2.0.model.tar.gz',
        'md5':
-        'd5e076217cf60486519f72c217d21b9b',
+        '23e16c69730a1cb5d735c98c83c21e16',
        'cfg_path':
        'model.yaml',
        'ckpt_path':
@ -426,6 +426,11 @@ class ASRExecutor(BaseExecutor):
        try:
            audio, audio_sample_rate = soundfile.read(
                audio_file, dtype="int16", always_2d=True)
            audio_duration = audio.shape[0] / audio_sample_rate
            max_duration = 50.0
            if audio_duration >= max_duration:
                logger.error("Please input audio file less then 50 seconds.\n")
                return
        except Exception as e:
            logger.exception(e)
            logger.error(
--- a/paddlespeech/cli/vector/infer.py
+++ b/paddlespeech/cli/vector/infer.py
@ -15,6 +15,7 @@ import argparse
 import os
 import sys
 from collections import OrderedDict
 from typing import Dict
 from typing import List
 from typing import Optional
 from typing import Union
@ -42,9 +43,9 @@ pretrained_models = {
    # "paddlespeech vector --task spk --model ecapatdnn_voxceleb12-16k --sr 16000 --input ./input.wav"
    "ecapatdnn_voxceleb12-16k": {
        'url':
-        'https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_1_1.tar.gz',
+        'https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_2_0.tar.gz',
        'md5':
-        'a1c0dba7d4de997187786ff517d5b4ec',
+        'cc33023c54ab346cd318408f43fcaf95',
        'cfg_path':
        'conf/model.yaml',  # the yaml config path
        'ckpt_path':
@ -79,7 +80,7 @@ class VectorExecutor(BaseExecutor):
            "--task",
            type=str,
            default="spk",
-            choices=["spk"],
+            choices=["spk", "score"],
            help="task type in vector domain")
        self.parser.add_argument(
            "--input",
@ -147,13 +148,40 @@ class VectorExecutor(BaseExecutor):
        logger.info(f"task source: {task_source}")
        # stage 3: process the audio one by one
        # we do action according the task type
        task_result = OrderedDict()
        has_exceptions = False
        for id_, input_ in task_source.items():
            try:
-                res = self(input_, model, sample_rate, config, ckpt_path,
+                # extract the speaker audio embedding
-                           device)
+                if parser_args.task == "spk":
-                task_result[id_] = res
+                    logger.info("do vector spk task")
                    res = self(input_, model, sample_rate, config, ckpt_path,
                               device)
                    task_result[id_] = res
                elif parser_args.task == "score":
                    logger.info("do vector score task")
                    logger.info(f"input content {input_}")
                    if len(input_.split()) != 2:
                        logger.error(
                            f"vector score task input {input_} wav num is not two,"
                            "that is {len(input_.split())}")
                        sys.exit(-1)
                    # get the enroll and test embedding
                    enroll_audio, test_audio = input_.split()
                    logger.info(
                        f"score task, enroll audio: {enroll_audio}, test audio: {test_audio}"
                    )
                    enroll_embedding = self(enroll_audio, model, sample_rate,
                                            config, ckpt_path, device)
                    test_embedding = self(test_audio, model, sample_rate,
                                          config, ckpt_path, device)
                    # get the score
                    res = self.get_embeddings_score(enroll_embedding,
                                                    test_embedding)
                    task_result[id_] = res
            except Exception as e:
                has_exceptions = True
                task_result[id_] = f'{e.__class__.__name__}: {e}'
@ -172,6 +200,49 @@ class VectorExecutor(BaseExecutor):
        else:
            return True
    def _get_job_contents(
            self, job_input: os.PathLike) -> Dict[str, Union[str, os.PathLike]]:
        """
        Read a job input file and return its contents in a dictionary.
        Refactor from the Executor._get_job_contents
        Args:
            job_input (os.PathLike): The job input file.
        Returns:
            Dict[str, str]: Contents of job input.
        """
        job_contents = OrderedDict()
        with open(job_input) as f:
            for line in f:
                line = line.strip()
                if not line:
                    continue
                k = line.split(' ')[0]
                v = ' '.join(line.split(' ')[1:])
                job_contents[k] = v
        return job_contents
    def get_embeddings_score(self, enroll_embedding, test_embedding):
        """get the enroll embedding and test embedding score
        Args:
            enroll_embedding (numpy.array): shape: (emb_size), enroll audio embedding
            test_embedding (numpy.array): shape: (emb_size), test audio embedding
        Returns:
            score: the score between enroll embedding and test embedding
        """
        if not hasattr(self, "score_func"):
            self.score_func = paddle.nn.CosineSimilarity(axis=0)
            logger.info("create the cosine score function ")
        score = self.score_func(
            paddle.to_tensor(enroll_embedding),
            paddle.to_tensor(test_embedding))
        return score.item()
    @stats_wrapper
    def __call__(self,
                 audio_file: os.PathLike,
--- a/paddlespeech/server/bin/paddlespeech_server.py
+++ b/paddlespeech/server/bin/paddlespeech_server.py
@ -23,8 +23,9 @@ from ..util import cli_server_register
 from ..util import stats_wrapper
 from paddlespeech.cli.log import logger
 from paddlespeech.server.engine.engine_pool import init_engine_pool
-from paddlespeech.server.restful.api import setup_router
+from paddlespeech.server.restful.api import setup_router as setup_http_router
 from paddlespeech.server.utils.config import get_config
 from paddlespeech.server.ws.api import setup_router as setup_ws_router
 __all__ = ['ServerExecutor', 'ServerStatsExecutor']
@ -63,7 +64,12 @@ class ServerExecutor(BaseExecutor):
        """
        # init api
        api_list = list(engine.split("_")[0] for engine in config.engine_list)
-        api_router = setup_router(api_list)
+        if config.protocol == "websocket":
            api_router = setup_ws_router(api_list)
        elif config.protocol == "http":
            api_router = setup_http_router(api_list)
        else:
            raise Exception("unsupported protocol")
        app.include_router(api_router)
        if not init_engine_pool(config):
--- a/paddlespeech/server/conf/tts_online_application.yaml
+++ b/paddlespeech/server/conf/tts_online_application.yaml
@ -0,0 +1,46 @@
 # This is the parameter configuration file for PaddleSpeech Serving.
 #################################################################################
 #                             SERVER SETTING                                    #
 #################################################################################
 host: 127.0.0.1
 port: 8092
 # The task format in the engin_list is: <speech task>_<engine type>
 # task choices = ['asr_online', 'tts_online']
 # protocol = ['websocket', 'http'] (only one can be selected).
 protocol: 'http'
 engine_list: ['tts_online']
 #################################################################################
 #                                ENGINE CONFIG                                  #
 #################################################################################
 ################################### TTS #########################################
 ################### speech task: tts; engine_type: online #######################
 tts_online: 
    # am (acoustic model) choices=['fastspeech2_csmsc']        
    am: 'fastspeech2_csmsc'   
    am_config: 
    am_ckpt: 
    am_stat: 
    phones_dict: 
    tones_dict: 
    speaker_dict: 
    spk_id: 0
    # voc (vocoder) choices=['mb_melgan_csmsc']
    voc: 'mb_melgan_csmsc'
    voc_config: 
    voc_ckpt: 
    voc_stat: 
    # others
    lang: 'zh'
    device:  # set 'gpu:id' or 'cpu'
    am_block: 42
    am_pad: 12
    voc_block: 14
    voc_pad: 14
--- a/paddlespeech/server/engine/asr/online/asr_engine.py
+++ b/paddlespeech/server/engine/asr/online/asr_engine.py
@ -27,6 +27,7 @@ from paddlespeech.s2t.frontend.speech import SpeechSegment
 from paddlespeech.s2t.modules.ctc import CTCDecoder
 from paddlespeech.s2t.utils.utility import UpdateConfig
 from paddlespeech.server.engine.base_engine import BaseEngine
 from paddlespeech.server.utils.audio_process import pcm2float
 from paddlespeech.server.utils.paddle_predictor import init_predictor
 __all__ = ['ASREngine']
@ -36,7 +37,7 @@ pretrained_models = {
        'url':
        'https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.1.1.model.tar.gz',
        'md5':
-        'd5e076217cf60486519f72c217d21b9b',
+        '23e16c69730a1cb5d735c98c83c21e16',
        'cfg_path':
        'model.yaml',
        'ckpt_path':
@ -222,21 +223,6 @@ class ASRServerExecutor(ASRExecutor):
        else:
            raise Exception("invalid model name")
    def _pcm16to32(self, audio):
        """pcm int16 to float32
        Args:
            audio(numpy.array): numpy.int16
        Returns:
            audio(numpy.array): numpy.float32
        """
        if audio.dtype == np.int16:
            audio = audio.astype("float32")
            bits = np.iinfo(np.int16).bits
            audio = audio / (2**(bits - 1))
        return audio
    def extract_feat(self, samples, sample_rate):
        """extract feat
@ -249,7 +235,7 @@ class ASRServerExecutor(ASRExecutor):
            x_chunk_lens (numpy.array): shape[B]
        """
        # pcm16 -> pcm 32
-        samples = self._pcm16to32(samples)
+        samples = pcm2float(samples)
        # read audio
        speech_segment = SpeechSegment.from_pcm(
--- a/paddlespeech/server/engine/engine_factory.py
+++ b/paddlespeech/server/engine/engine_factory.py
@ -34,6 +34,9 @@ class EngineFactory(object):
        elif engine_name == 'tts' and engine_type == 'python':
            from paddlespeech.server.engine.tts.python.tts_engine import TTSEngine
            return TTSEngine()
        elif engine_name == 'tts' and engine_type == 'online':
            from paddlespeech.server.engine.tts.online.tts_engine import TTSEngine
            return TTSEngine()
        elif engine_name == 'cls' and engine_type == 'inference':
            from paddlespeech.server.engine.cls.paddleinference.cls_engine import CLSEngine
            return CLSEngine()
--- a/paddlespeech/server/engine/tts/online/init.py
+++ b/paddlespeech/server/engine/tts/online/init.py
@ -0,0 +1,13 @@
 # Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
--- a/paddlespeech/server/engine/tts/online/tts_engine.py
+++ b/paddlespeech/server/engine/tts/online/tts_engine.py
@ -0,0 +1,220 @@
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import base64
 import time
 import numpy as np
 import paddle
 from paddlespeech.cli.log import logger
 from paddlespeech.cli.tts.infer import TTSExecutor
 from paddlespeech.server.engine.base_engine import BaseEngine
 from paddlespeech.server.utils.audio_process import float2pcm
 from paddlespeech.server.utils.util import get_chunks
 __all__ = ['TTSEngine']
 class TTSServerExecutor(TTSExecutor):
    def __init__(self):
        super().__init__()
        pass
    @paddle.no_grad()
    def infer(
            self,
            text: str,
            lang: str='zh',
            am: str='fastspeech2_csmsc',
            spk_id: int=0,
            am_block: int=42,
            am_pad: int=12,
            voc_block: int=14,
            voc_pad: int=14, ):
        """
        Model inference and result stored in self.output.
        """
        am_name = am[:am.rindex('_')]
        am_dataset = am[am.rindex('_') + 1:]
        get_tone_ids = False
        merge_sentences = False
        frontend_st = time.time()
        if lang == 'zh':
            input_ids = self.frontend.get_input_ids(
                text,
                merge_sentences=merge_sentences,
                get_tone_ids=get_tone_ids)
            phone_ids = input_ids["phone_ids"]
            if get_tone_ids:
                tone_ids = input_ids["tone_ids"]
        elif lang == 'en':
            input_ids = self.frontend.get_input_ids(
                text, merge_sentences=merge_sentences)
            phone_ids = input_ids["phone_ids"]
        else:
            print("lang should in {'zh', 'en'}!")
        self.frontend_time = time.time() - frontend_st
        for i in range(len(phone_ids)):
            am_st = time.time()
            part_phone_ids = phone_ids[i]
            # am
            if am_name == 'speedyspeech':
                part_tone_ids = tone_ids[i]
                mel = self.am_inference(part_phone_ids, part_tone_ids)
            # fastspeech2
            else:
                # multi speaker
                if am_dataset in {"aishell3", "vctk"}:
                    mel = self.am_inference(
                        part_phone_ids, spk_id=paddle.to_tensor(spk_id))
                else:
                    mel = self.am_inference(part_phone_ids)
            am_et = time.time()
            # voc streaming
            voc_upsample = self.voc_config.n_shift
            mel_chunks = get_chunks(mel, voc_block, voc_pad, "voc")
            chunk_num = len(mel_chunks)
            voc_st = time.time()
            for i, mel_chunk in enumerate(mel_chunks):
                sub_wav = self.voc_inference(mel_chunk)
                front_pad = min(i * voc_block, voc_pad)
                if i == 0:
                    sub_wav = sub_wav[:voc_block * voc_upsample]
                elif i == chunk_num - 1:
                    sub_wav = sub_wav[front_pad * voc_upsample:]
                else:
                    sub_wav = sub_wav[front_pad * voc_upsample:(
                        front_pad + voc_block) * voc_upsample]
                yield sub_wav
 class TTSEngine(BaseEngine):
    """TTS server engine
    Args:
        metaclass: Defaults to Singleton.
    """
    def __init__(self, name=None):
        """Initialize TTS server engine
        """
        super(TTSEngine, self).__init__()
    def init(self, config: dict) -> bool:
        self.executor = TTSServerExecutor()
        self.config = config
        assert "fastspeech2_csmsc" in config.am and (
            config.voc == "hifigan_csmsc-zh" or config.voc == "mb_melgan_csmsc"
        ), 'Please check config, am support: fastspeech2, voc support: hifigan_csmsc-zh or mb_melgan_csmsc.'
        try:
            if self.config.device:
                self.device = self.config.device
            else:
                self.device = paddle.get_device()
            paddle.set_device(self.device)
        except Exception as e:
            logger.error(
                "Set device failed, please check if device is already used and the parameter 'device' in the yaml file"
            )
            logger.error("Initialize TTS server engine Failed on device: %s." %
                         (self.device))
            return False
        try:
            self.executor._init_from_path(
                am=self.config.am,
                am_config=self.config.am_config,
                am_ckpt=self.config.am_ckpt,
                am_stat=self.config.am_stat,
                phones_dict=self.config.phones_dict,
                tones_dict=self.config.tones_dict,
                speaker_dict=self.config.speaker_dict,
                voc=self.config.voc,
                voc_config=self.config.voc_config,
                voc_ckpt=self.config.voc_ckpt,
                voc_stat=self.config.voc_stat,
                lang=self.config.lang)
        except Exception as e:
            logger.error("Failed to get model related files.")
            logger.error("Initialize TTS server engine Failed on device: %s." %
                         (self.device))
            return False
        self.am_block = self.config.am_block
        self.am_pad = self.config.am_pad
        self.voc_block = self.config.voc_block
        self.voc_pad = self.config.voc_pad
        logger.info("Initialize TTS server engine successfully on device: %s." %
                    (self.device))
        return True
    def preprocess(self, text_bese64: str=None, text_bytes: bytes=None):
        # Convert byte to text
        if text_bese64:
            text_bytes = base64.b64decode(text_bese64)  # base64 to bytes
        text = text_bytes.decode('utf-8')  # bytes to text
        return text
    def run(self,
            sentence: str,
            spk_id: int=0,
            speed: float=1.0,
            volume: float=1.0,
            sample_rate: int=0,
            save_path: str=None):
        """ run include inference and postprocess.
        Args:
            sentence (str): text to be synthesized
            spk_id (int, optional): speaker id for multi-speaker speech synthesis. Defaults to 0.
            speed (float, optional): speed. Defaults to 1.0.
            volume (float, optional): volume. Defaults to 1.0.
            sample_rate (int, optional): target sample rate for synthesized audio, 
            0 means the same as the model sampling rate. Defaults to 0.
            save_path (str, optional): The save path of the synthesized audio. 
            None means do not save audio. Defaults to None.
        Returns:
            wav_base64: The base64 format of the synthesized audio.
        """
        lang = self.config.lang
        wav_list = []
        for wav in self.executor.infer(
                text=sentence,
                lang=lang,
                am=self.config.am,
                spk_id=spk_id,
                am_block=self.am_block,
                am_pad=self.am_pad,
                voc_block=self.voc_block,
                voc_pad=self.voc_pad):
            # wav type: <class 'numpy.ndarray'>  float32, convert to pcm (base64)
            wav = float2pcm(wav)  # float32 to int16
            wav_bytes = wav.tobytes()  # to bytes
            wav_base64 = base64.b64encode(wav_bytes).decode('utf8')  # to base64
            wav_list.append(wav)
            yield wav_base64
        wav_all = np.concatenate(wav_list, axis=0)
        logger.info("The durations of audio is: {} s".format(
            len(wav_all) / self.executor.am_config.fs))
--- a/paddlespeech/server/restful/tts_api.py
+++ b/paddlespeech/server/restful/tts_api.py
@ -15,6 +15,7 @@ import traceback
 from typing import Union
 from fastapi import APIRouter
 from fastapi.responses import StreamingResponse
 from paddlespeech.cli.log import logger
 from paddlespeech.server.engine.engine_pool import get_engine_pool
@ -125,3 +126,14 @@ def tts(request_body: TTSRequest):
        traceback.print_exc()
    return response
@router.post("/paddlespeech/streaming/tts")
 async def stream_tts(request_body: TTSRequest):
    text = request_body.text
    engine_pool = get_engine_pool()
    tts_engine = engine_pool['tts']
    logger.info("Get tts engine successfully.")
    return StreamingResponse(tts_engine.run(sentence=text))
--- a/paddlespeech/server/tests/tts/offline/http_client.py
+++ b/paddlespeech/server/tests/tts/offline/http_client.py
@ -33,7 +33,8 @@ def tts_client(args):
        text: A sentence to be synthesized
        outfile: Synthetic audio file
    """
-    url = 'http://127.0.0.1:8090/paddlespeech/tts'
+    url = "http://" + str(args.server) + ":" + str(
        args.port) + "/paddlespeech/tts"
    request = {
        "text": args.text,
        "spk_id": args.spk_id,
@ -72,7 +73,7 @@ if __name__ == "__main__":
    parser.add_argument(
        '--text',
        type=str,
-        default="你好，欢迎使用语音合成服务",
+        default="您好，欢迎使用语音合成服务。",
        help='A sentence to be synthesized')
    parser.add_argument('--spk_id', type=int, default=0, help='Speaker id')
    parser.add_argument('--speed', type=float, default=1.0, help='Audio speed')
@ -88,6 +89,9 @@ if __name__ == "__main__":
        type=str,
        default="./out.wav",
        help='Synthesized audio file')
    parser.add_argument(
        "--server", type=str, help="server ip", default="127.0.0.1")
    parser.add_argument("--port", type=int, help="server port", default=8090)
    args = parser.parse_args()
    st = time.time()
--- a/paddlespeech/server/tests/tts/online/http_client.py
+++ b/paddlespeech/server/tests/tts/online/http_client.py
@ -0,0 +1,100 @@
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import argparse
 import base64
 import json
 import os
 import time
 import requests
 from paddlespeech.server.utils.audio_process import pcm2wav
 def save_audio(buffer, audio_path) -> bool:
    if args.save_path.endswith("pcm"):
        with open(args.save_path, "wb") as f:
            f.write(buffer)
    elif args.save_path.endswith("wav"):
        with open("./tmp.pcm", "wb") as f:
            f.write(buffer)
        pcm2wav("./tmp.pcm", audio_path, channels=1, bits=16, sample_rate=24000)
        os.system("rm ./tmp.pcm")
    else:
        print("Only supports saved audio format is pcm or wav")
        return False
    return True
 def test(args):
    params = {
        "text": args.text,
        "spk_id": args.spk_id,
        "speed": args.speed,
        "volume": args.volume,
        "sample_rate": args.sample_rate,
        "save_path": ''
    }
    buffer = b''
    flag = 1
    url = "http://" + str(args.server) + ":" + str(
        args.port) + "/paddlespeech/streaming/tts"
    st = time.time()
    html = requests.post(url, json.dumps(params), stream=True)
    for chunk in html.iter_content(chunk_size=1024):
        chunk = base64.b64decode(chunk)  # bytes
        if flag:
            first_response = time.time() - st
            print(f"首包响应：{first_response} s")
            flag = 0
        buffer += chunk
    final_response = time.time() - st
    duration = len(buffer) / 2.0 / 24000
    print(f"尾包响应：{final_response} s")
    print(f"音频时长：{duration} s")
    print(f"RTF: {final_response / duration}")
    if args.save_path is not None:
        if save_audio(buffer, args.save_path):
            print("音频保存至：", args.save_path)
 if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        '--text',
        type=str,
        default="您好，欢迎使用语音合成服务。",
        help='A sentence to be synthesized')
    parser.add_argument('--spk_id', type=int, default=0, help='Speaker id')
    parser.add_argument('--speed', type=float, default=1.0, help='Audio speed')
    parser.add_argument(
        '--volume', type=float, default=1.0, help='Audio volume')
    parser.add_argument(
        '--sample_rate',
        type=int,
        default=0,
        help='Sampling rate, the default is the same as the model')
    parser.add_argument(
        "--server", type=str, help="server ip", default="127.0.0.1")
    parser.add_argument("--port", type=int, help="server port", default=8092)
    parser.add_argument(
        "--save_path", type=str, help="save audio path", default=None)
    args = parser.parse_args()
    test(args)
--- a/paddlespeech/server/tests/tts/online/http_client_playaudio.py
+++ b/paddlespeech/server/tests/tts/online/http_client_playaudio.py
@ -0,0 +1,112 @@
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import argparse
 import base64
 import json
 import threading
 import time
 import pyaudio
 import requests
 mutex = threading.Lock()
 buffer = b''
 p = pyaudio.PyAudio()
 stream = p.open(
    format=p.get_format_from_width(2), channels=1, rate=24000, output=True)
 max_fail = 50
 def play_audio():
    global stream
    global buffer
    global max_fail
    while True:
        if not buffer:
            max_fail -= 1
            time.sleep(0.05)
            if max_fail < 0:
                break
        mutex.acquire()
        stream.write(buffer)
        buffer = b''
        mutex.release()
 def test(args):
    global mutex
    global buffer
    params = {
        "text": args.text,
        "spk_id": args.spk_id,
        "speed": args.speed,
        "volume": args.volume,
        "sample_rate": args.sample_rate,
        "save_path": ''
    }
    all_bytes = 0.0
    t = threading.Thread(target=play_audio)
    flag = 1
    url = "http://" + str(args.server) + ":" + str(
        args.port) + "/paddlespeech/streaming/tts"
    st = time.time()
    html = requests.post(url, json.dumps(params), stream=True)
    for chunk in html.iter_content(chunk_size=1024):
        mutex.acquire()
        chunk = base64.b64decode(chunk)  # bytes
        buffer += chunk
        mutex.release()
        if flag:
            first_response = time.time() - st
            print(f"首包响应：{first_response} s")
            flag = 0
            t.start()
        all_bytes += len(chunk)
    final_response = time.time() - st
    duration = all_bytes / 2 / 24000
    print(f"尾包响应：{final_response} s")
    print(f"音频时长：{duration} s")
    print(f"RTF: {final_response / duration}")
    t.join()
    stream.stop_stream()
    stream.close()
    p.terminate()
 if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        '--text',
        type=str,
        default="您好，欢迎使用语音合成服务。",
        help='A sentence to be synthesized')
    parser.add_argument('--spk_id', type=int, default=0, help='Speaker id')
    parser.add_argument('--speed', type=float, default=1.0, help='Audio speed')
    parser.add_argument(
        '--volume', type=float, default=1.0, help='Audio volume')
    parser.add_argument(
        '--sample_rate',
        type=int,
        default=0,
        help='Sampling rate, the default is the same as the model')
    parser.add_argument(
        "--server", type=str, help="server ip", default="127.0.0.1")
    parser.add_argument("--port", type=int, help="server port", default=8092)
    args = parser.parse_args()
    test(args)
--- a/paddlespeech/server/tests/tts/online/ws_client.py
+++ b/paddlespeech/server/tests/tts/online/ws_client.py
@ -0,0 +1,126 @@
 # Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import _thread as thread
 import argparse
 import base64
 import json
 import ssl
 import time
 import websocket
 flag = 1
 st = 0.0
 all_bytes = b''
 class WsParam(object):
    # 初始化
    def __init__(self, text, server="127.0.0.1", port=8090):
        self.server = server
        self.port = port
        self.url = "ws://" + self.server + ":" + str(self.port) + "/ws/tts"
        self.text = text
    # 生成url
    def create_url(self):
        return self.url
 def on_message(ws, message):
    global flag
    global st
    global all_bytes
    try:
        message = json.loads(message)
        audio = message["audio"]
        audio = base64.b64decode(audio)  # bytes
        status = message["status"]
        all_bytes += audio
        if status == 0:
            print("create successfully.")
        elif status == 1:
            if flag:
                print(f"首包响应：{time.time() - st} s")
                flag = 0
        elif status == 2:
            final_response = time.time() - st
            duration = len(all_bytes) / 2.0 / 24000
            print(f"尾包响应：{final_response} s")
            print(f"音频时长：{duration} s")
            print(f"RTF: {final_response / duration}")
            with open("./out.pcm", "wb") as f:
                f.write(all_bytes)
            print("ws is closed")
            ws.close()
        else:
            print("infer error")
    except Exception as e:
        print("receive msg,but parse exception:", e)
 # 收到websocket错误的处理
 def on_error(ws, error):
    print("### error:", error)
 # 收到websocket关闭的处理
 def on_close(ws):
    print("### closed ###")
 # 收到websocket连接建立的处理
 def on_open(ws):
    def run(*args):
        global st
        text_base64 = str(
            base64.b64encode((wsParam.text).encode('utf-8')), "UTF8")
        d = {"text": text_base64}
        d = json.dumps(d)
        print("Start sending text data")
        st = time.time()
        ws.send(d)
    thread.start_new_thread(run, ())
 if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--text",
        type=str,
        help="A sentence to be synthesized",
        default="您好，欢迎使用语音合成服务。")
    parser.add_argument(
        "--server", type=str, help="server ip", default="127.0.0.1")
    parser.add_argument("--port", type=int, help="server port", default=8092)
    args = parser.parse_args()
    print("***************************************")
    print("Server ip: ", args.server)
    print("Server port: ", args.port)
    print("Sentence to be synthesized: ", args.text)
    print("***************************************")
    wsParam = WsParam(text=args.text, server=args.server, port=args.port)
    websocket.enableTrace(False)
    wsUrl = wsParam.create_url()
    ws = websocket.WebSocketApp(
        wsUrl, on_message=on_message, on_error=on_error, on_close=on_close)
    ws.on_open = on_open
    ws.run_forever(sslopt={"cert_reqs": ssl.CERT_NONE})
--- a/paddlespeech/server/tests/tts/online/ws_client_playaudio.py
+++ b/paddlespeech/server/tests/tts/online/ws_client_playaudio.py
@ -0,0 +1,160 @@
 # Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import _thread as thread
 import argparse
 import base64
 import json
 import ssl
 import threading
 import time
 import pyaudio
 import websocket
 mutex = threading.Lock()
 buffer = b''
 p = pyaudio.PyAudio()
 stream = p.open(
    format=p.get_format_from_width(2), channels=1, rate=24000, output=True)
 flag = 1
 st = 0.0
 all_bytes = 0.0
 class WsParam(object):
    # 初始化
    def __init__(self, text, server="127.0.0.1", port=8090):
        self.server = server
        self.port = port
        self.url = "ws://" + self.server + ":" + str(self.port) + "/ws/tts"
        self.text = text
    # 生成url
    def create_url(self):
        return self.url
 def play_audio():
    global stream
    global buffer
    while True:
        time.sleep(0.05)
        if not buffer:  # buffer 为空
            break
        mutex.acquire()
        stream.write(buffer)
        buffer = b''
        mutex.release()
 t = threading.Thread(target=play_audio)
 def on_message(ws, message):
    global flag
    global t
    global buffer
    global st
    global all_bytes
    try:
        message = json.loads(message)
        audio = message["audio"]
        audio = base64.b64decode(audio)  # bytes
        status = message["status"]
        all_bytes += len(audio)
        if status == 0:
            print("create successfully.")
        elif status == 1:
            mutex.acquire()
            buffer += audio
            mutex.release()
            if flag:
                print(f"首包响应：{time.time() - st} s")
                flag = 0
                print("Start playing audio")
                t.start()
        elif status == 2:
            final_response = time.time() - st
            duration = all_bytes / 2 / 24000
            print(f"尾包响应：{final_response} s")
            print(f"音频时长：{duration} s")
            print(f"RTF: {final_response / duration}")
            print("ws is closed")
            ws.close()
        else:
            print("infer error")
    except Exception as e:
        print("receive msg,but parse exception:", e)
 # 收到websocket错误的处理
 def on_error(ws, error):
    print("### error:", error)
 # 收到websocket关闭的处理
 def on_close(ws):
    print("### closed ###")
 # 收到websocket连接建立的处理
 def on_open(ws):
    def run(*args):
        global st
        text_base64 = str(
            base64.b64encode((wsParam.text).encode('utf-8')), "UTF8")
        d = {"text": text_base64}
        d = json.dumps(d)
        print("Start sending text data")
        st = time.time()
        ws.send(d)
    thread.start_new_thread(run, ())
 if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument(
        "--text",
        type=str,
        help="A sentence to be synthesized",
        default="您好，欢迎使用语音合成服务。")
    parser.add_argument(
        "--server", type=str, help="server ip", default="127.0.0.1")
    parser.add_argument("--port", type=int, help="server port", default=8092)
    args = parser.parse_args()
    print("***************************************")
    print("Server ip: ", args.server)
    print("Server port: ", args.port)
    print("Sentence to be synthesized: ", args.text)
    print("***************************************")
    wsParam = WsParam(text=args.text, server=args.server, port=args.port)
    websocket.enableTrace(False)
    wsUrl = wsParam.create_url()
    ws = websocket.WebSocketApp(
        wsUrl, on_message=on_message, on_error=on_error, on_close=on_close)
    ws.on_open = on_open
    ws.run_forever(sslopt={"cert_reqs": ssl.CERT_NONE})
    t.join()
    print("End of playing audio")
    stream.stop_stream()
    stream.close()
    p.terminate()
--- a/paddlespeech/server/utils/audio_process.py
+++ b/paddlespeech/server/utils/audio_process.py
@ -103,3 +103,40 @@ def change_speed(sample_raw, speed_rate, sample_rate):
        sample_rate_in=sample_rate).squeeze(-1).astype(np.float32).copy()
    return sample_speed
 def float2pcm(sig, dtype='int16'):
    """Convert floating point signal with a range from -1 to 1 to PCM.
    Args:
        sig (array): Input array, must have floating point type.
        dtype (str, optional): Desired (integer) data type. Defaults to 'int16'.
    Returns:
        numpy.ndarray: Integer data, scaled and clipped to the range of the given
    """
    sig = np.asarray(sig)
    if sig.dtype.kind != 'f':
        raise TypeError("'sig' must be a float array")
    dtype = np.dtype(dtype)
    if dtype.kind not in 'iu':
        raise TypeError("'dtype' must be an integer type")
    i = np.iinfo(dtype)
    abs_max = 2**(i.bits - 1)
    offset = i.min + abs_max
    return (sig * abs_max + offset).clip(i.min, i.max).astype(dtype)
 def pcm2float(data):
    """pcm int16 to float32
    Args:
        audio(numpy.array): numpy.int16
    Returns:
        audio(numpy.array): numpy.float32
    """
    if data.dtype == np.int16:
        data = data.astype("float32")
        bits = np.iinfo(np.int16).bits
        data = data / (2**(bits - 1))
    return data
--- a/paddlespeech/server/utils/util.py
+++ b/paddlespeech/server/utils/util.py
@ -11,6 +11,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the 
 import base64
 import math
 def wav2base64(wav_file: str):
@ -31,3 +32,42 @@ def self_check():
    """ self check resource
    """
    return True
 def denorm(data, mean, std):
    """stream am model need to denorm
    """
    return data * std + mean
 def get_chunks(data, block_size, pad_size, step):
    """Divide data into multiple chunks
    Args:
        data (tensor): data
        block_size (int): [description]
        pad_size (int): [description]
        step (str): set "am" or "voc", generate chunk for step am or vocoder(voc)
    Returns:
        list: chunks list
    """
    if step == "am":
        data_len = data.shape[1]
    elif step == "voc":
        data_len = data.shape[0]
    else:
        print("Please set correct type to get chunks, am or voc")
    chunks = []
    n = math.ceil(data_len / block_size)
    for i in range(n):
        start = max(0, i * block_size - pad_size)
        end = min((i + 1) * block_size + pad_size, data_len)
        if step == "am":
            chunks.append(data[:, start:end, :])
        elif step == "voc":
            chunks.append(data[start:end, :])
        else:
            print("Please set correct type to get chunks, am or voc")
    return chunks
--- a/paddlespeech/server/ws/api.py
+++ b/paddlespeech/server/ws/api.py
@ -16,6 +16,7 @@ from typing import List
 from fastapi import APIRouter
 from paddlespeech.server.ws.asr_socket import router as asr_router
 from paddlespeech.server.ws.tts_socket import router as tts_router
 _router = APIRouter()
@ -31,7 +32,7 @@ def setup_router(api_list: List):
        if api_name == 'asr':
            _router.include_router(asr_router)
        elif api_name == 'tts':
-            pass
+            _router.include_router(tts_router)
        else:
            pass
--- a/paddlespeech/server/ws/tts_socket.py
+++ b/paddlespeech/server/ws/tts_socket.py
@ -0,0 +1,62 @@
 # Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import json
 from fastapi import APIRouter
 from fastapi import WebSocket
 from fastapi import WebSocketDisconnect
 from starlette.websockets import WebSocketState as WebSocketState
 from paddlespeech.cli.log import logger
 from paddlespeech.server.engine.engine_pool import get_engine_pool
 router = APIRouter()
@router.websocket('/ws/tts')
 async def websocket_endpoint(websocket: WebSocket):
    await websocket.accept()
    try:
        # careful here, changed the source code from starlette.websockets
        assert websocket.application_state == WebSocketState.CONNECTED
        message = await websocket.receive()
        websocket._raise_on_disconnect(message)
        # get engine
        engine_pool = get_engine_pool()
        tts_engine = engine_pool['tts']
        # 获取 message 并转文本
        message = json.loads(message["text"])
        text_bese64 = message["text"]
        sentence = tts_engine.preprocess(text_bese64=text_bese64)
        # run
        wav_generator = tts_engine.run(sentence)
        while True:
            try:
                tts_results = next(wav_generator)
                resp = {"status": 1, "audio": tts_results}
                await websocket.send_json(resp)
                logger.info("streaming audio...")
            except StopIteration as e:
                resp = {"status": 2, "audio": ''}
                await websocket.send_json(resp)
                logger.info("Complete the transmission of audio streams")
                break
    except WebSocketDisconnect:
        pass
--- a/paddlespeech/t2s/exps/fastspeech2/preprocess.py
+++ b/paddlespeech/t2s/exps/fastspeech2/preprocess.py
@ -86,6 +86,9 @@ def process_sentence(config: Dict[str, Any],
        logmel = mel_extractor.get_log_mel_fbank(wav)
        # change duration according to mel_length
        compare_duration_and_mel_length(sentences, utt_id, logmel)
        # utt_id may be popped in compare_duration_and_mel_length
        if utt_id not in sentences:
            return None
        phones = sentences[utt_id][0]
        durations = sentences[utt_id][1]
        num_frames = logmel.shape[0]
--- a/paddlespeech/t2s/exps/inference.py
+++ b/paddlespeech/t2s/exps/inference.py
@ -104,7 +104,7 @@ def get_voc_output(args, voc_predictor, input):
 def parse_args():
    parser = argparse.ArgumentParser(
-        description="Paddle Infernce with speedyspeech & parallel wavegan.")
+        description="Paddle Infernce with acoustic model & vocoder.")
    # acoustic model
    parser.add_argument(
        '--am',
--- a/paddlespeech/t2s/exps/ort_predict.py
+++ b/paddlespeech/t2s/exps/ort_predict.py
@ -0,0 +1,156 @@
 # Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import argparse
 from pathlib import Path
 import jsonlines
 import numpy as np
 import onnxruntime as ort
 import soundfile as sf
 from timer import timer
 from paddlespeech.t2s.exps.syn_utils import get_test_dataset
 from paddlespeech.t2s.utils import str2bool
 def get_sess(args, filed='am'):
    full_name = ''
    if filed == 'am':
        full_name = args.am
    elif filed == 'voc':
        full_name = args.voc
    model_dir = str(Path(args.inference_dir) / (full_name + ".onnx"))
    sess_options = ort.SessionOptions()
    sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
    sess_options.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL
    if args.device == "gpu":
        # fastspeech2/mb_melgan can't use trt now!
        if args.use_trt:
            providers = ['TensorrtExecutionProvider']
        else:
            providers = ['CUDAExecutionProvider']
    elif args.device == "cpu":
        providers = ['CPUExecutionProvider']
    sess_options.intra_op_num_threads = args.cpu_threads
    sess = ort.InferenceSession(
        model_dir, providers=providers, sess_options=sess_options)
    return sess
 def ort_predict(args):
    # construct dataset for evaluation
    with jsonlines.open(args.test_metadata, 'r') as reader:
        test_metadata = list(reader)
    am_name = args.am[:args.am.rindex('_')]
    am_dataset = args.am[args.am.rindex('_') + 1:]
    test_dataset = get_test_dataset(args, test_metadata, am_name, am_dataset)
    output_dir = Path(args.output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)
    fs = 24000 if am_dataset != 'ljspeech' else 22050
    # am
    am_sess = get_sess(args, filed='am')
    # vocoder
    voc_sess = get_sess(args, filed='voc')
    # am warmup
    for T in [27, 38, 54]:
        data = np.random.randint(1, 266, size=(T, ))
        am_sess.run(None, {"text": data})
    # voc warmup
    for T in [227, 308, 544]:
        data = np.random.rand(T, 80).astype("float32")
        voc_sess.run(None, {"logmel": data})
    print("warm up done!")
    N = 0
    T = 0
    for example in test_dataset:
        utt_id = example['utt_id']
        phone_ids = example["text"]
        with timer() as t:
            mel = am_sess.run(output_names=None, input_feed={'text': phone_ids})
            mel = mel[0]
            wav = voc_sess.run(output_names=None, input_feed={'logmel': mel})
            N += len(wav[0])
            T += t.elapse
            speed = len(wav[0]) / t.elapse
            rtf = fs / speed
        sf.write(
            str(output_dir / (utt_id + ".wav")),
            np.array(wav)[0],
            samplerate=fs)
        print(
            f"{utt_id}, mel: {mel.shape}, wave: {len(wav[0])}, time: {t.elapse}s, Hz: {speed}, RTF: {rtf}."
        )
    print(f"generation speed: {N / T}Hz, RTF: {fs / (N / T) }")
 def parse_args():
    parser = argparse.ArgumentParser(description="Infernce with onnxruntime.")
    # acoustic model
    parser.add_argument(
        '--am',
        type=str,
        default='fastspeech2_csmsc',
        choices=[
            'fastspeech2_csmsc',
        ],
        help='Choose acoustic model type of tts task.')
    # voc
    parser.add_argument(
        '--voc',
        type=str,
        default='hifigan_csmsc',
        choices=['hifigan_csmsc', 'mb_melgan_csmsc'],
        help='Choose vocoder type of tts task.')
    # other
    parser.add_argument(
        "--inference_dir", type=str, help="dir to save inference models")
    parser.add_argument("--test_metadata", type=str, help="test metadata.")
    parser.add_argument("--output_dir", type=str, help="output dir")
    # inference
    parser.add_argument(
        "--use_trt",
        type=str2bool,
        default=False,
        help="Whether to use inference engin TensorRT.", )
    parser.add_argument(
        "--device",
        default="gpu",
        choices=["gpu", "cpu"],
        help="Device selected for inference.", )
    parser.add_argument('--cpu_threads', type=int, default=1)
    args, _ = parser.parse_known_args()
    return args
 def main():
    args = parse_args()
    ort_predict(args)
 if __name__ == "__main__":
    main()
--- a/paddlespeech/t2s/exps/ort_predict_e2e.py
+++ b/paddlespeech/t2s/exps/ort_predict_e2e.py
@ -0,0 +1,183 @@
 # Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import argparse
 from pathlib import Path
 import numpy as np
 import onnxruntime as ort
 import soundfile as sf
 from timer import timer
 from paddlespeech.t2s.exps.syn_utils import get_frontend
 from paddlespeech.t2s.exps.syn_utils import get_sentences
 from paddlespeech.t2s.utils import str2bool
 def get_sess(args, filed='am'):
    full_name = ''
    if filed == 'am':
        full_name = args.am
    elif filed == 'voc':
        full_name = args.voc
    model_dir = str(Path(args.inference_dir) / (full_name + ".onnx"))
    sess_options = ort.SessionOptions()
    sess_options.graph_optimization_level = ort.GraphOptimizationLevel.ORT_ENABLE_ALL
    sess_options.execution_mode = ort.ExecutionMode.ORT_SEQUENTIAL
    if args.device == "gpu":
        # fastspeech2/mb_melgan can't use trt now!
        if args.use_trt:
            providers = ['TensorrtExecutionProvider']
        else:
            providers = ['CUDAExecutionProvider']
    elif args.device == "cpu":
        providers = ['CPUExecutionProvider']
    sess_options.intra_op_num_threads = args.cpu_threads
    sess = ort.InferenceSession(
        model_dir, providers=providers, sess_options=sess_options)
    return sess
 def ort_predict(args):
    # frontend
    frontend = get_frontend(args)
    output_dir = Path(args.output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)
    sentences = get_sentences(args)
    am_name = args.am[:args.am.rindex('_')]
    am_dataset = args.am[args.am.rindex('_') + 1:]
    fs = 24000 if am_dataset != 'ljspeech' else 22050
    # am
    am_sess = get_sess(args, filed='am')
    # vocoder
    voc_sess = get_sess(args, filed='voc')
    # am warmup
    for T in [27, 38, 54]:
        data = np.random.randint(1, 266, size=(T, ))
        am_sess.run(None, {"text": data})
    # voc warmup
    for T in [227, 308, 544]:
        data = np.random.rand(T, 80).astype("float32")
        voc_sess.run(None, {"logmel": data})
    print("warm up done!")
    # frontend warmup
    # Loading model cost 0.5+ seconds
    if args.lang == 'zh':
        frontend.get_input_ids("你好，欢迎使用飞桨框架进行深度学习研究！", merge_sentences=True)
    else:
        print("lang should in be 'zh' here!")
    N = 0
    T = 0
    merge_sentences = True
    for utt_id, sentence in sentences:
        with timer() as t:
            if args.lang == 'zh':
                input_ids = frontend.get_input_ids(
                    sentence, merge_sentences=merge_sentences)
                phone_ids = input_ids["phone_ids"]
            else:
                print("lang should in be 'zh' here!")
            # merge_sentences=True here, so we only use the first item of phone_ids
            phone_ids = phone_ids[0].numpy()
            mel = am_sess.run(output_names=None, input_feed={'text': phone_ids})
            mel = mel[0]
            wav = voc_sess.run(output_names=None, input_feed={'logmel': mel})
            N += len(wav[0])
            T += t.elapse
            speed = len(wav[0]) / t.elapse
            rtf = fs / speed
        sf.write(
            str(output_dir / (utt_id + ".wav")),
            np.array(wav)[0],
            samplerate=fs)
        print(
            f"{utt_id}, mel: {mel.shape}, wave: {len(wav[0])}, time: {t.elapse}s, Hz: {speed}, RTF: {rtf}."
        )
    print(f"generation speed: {N / T}Hz, RTF: {fs / (N / T) }")
 def parse_args():
    parser = argparse.ArgumentParser(description="Infernce with onnxruntime.")
    # acoustic model
    parser.add_argument(
        '--am',
        type=str,
        default='fastspeech2_csmsc',
        choices=[
            'fastspeech2_csmsc',
        ],
        help='Choose acoustic model type of tts task.')
    parser.add_argument(
        "--phones_dict", type=str, default=None, help="phone vocabulary file.")
    parser.add_argument(
        "--tones_dict", type=str, default=None, help="tone vocabulary file.")
    # voc
    parser.add_argument(
        '--voc',
        type=str,
        default='hifigan_csmsc',
        choices=['hifigan_csmsc', 'mb_melgan_csmsc'],
        help='Choose vocoder type of tts task.')
    # other
    parser.add_argument(
        "--inference_dir", type=str, help="dir to save inference models")
    parser.add_argument(
        "--text",
        type=str,
        help="text to synthesize, a 'utt_id sentence' pair per line")
    parser.add_argument("--output_dir", type=str, help="output dir")
    parser.add_argument(
        '--lang',
        type=str,
        default='zh',
        help='Choose model language. zh or en')
    # inference
    parser.add_argument(
        "--use_trt",
        type=str2bool,
        default=False,
        help="Whether to use inference engin TensorRT.", )
    parser.add_argument(
        "--device",
        default="gpu",
        choices=["gpu", "cpu"],
        help="Device selected for inference.", )
    parser.add_argument('--cpu_threads', type=int, default=1)
    args, _ = parser.parse_known_args()
    return args
 def main():
    args = parse_args()
    ort_predict(args)
 if __name__ == "__main__":
    main()
--- a/paddlespeech/t2s/exps/speedyspeech/preprocess.py
+++ b/paddlespeech/t2s/exps/speedyspeech/preprocess.py
@ -79,6 +79,9 @@ def process_sentence(config: Dict[str, Any],
        logmel = mel_extractor.get_log_mel_fbank(wav)
        # change duration according to mel_length
        compare_duration_and_mel_length(sentences, utt_id, logmel)
        # utt_id may be popped in compare_duration_and_mel_length
        if utt_id not in sentences:
            return None
        labels = sentences[utt_id][0]
        # extract phone and duration
        phones = []
--- a/paddlespeech/t2s/exps/synthesize_streaming.py
+++ b/paddlespeech/t2s/exps/synthesize_streaming.py
@ -90,6 +90,7 @@ def evaluate(args):
    output_dir = Path(args.output_dir)
    output_dir.mkdir(parents=True, exist_ok=True)
    merge_sentences = True
    get_tone_ids = False
    N = 0
    T = 0
@ -98,8 +99,6 @@ def evaluate(args):
    for utt_id, sentence in sentences:
        with timer() as t:
            get_tone_ids = False
            if args.lang == 'zh':
                input_ids = frontend.get_input_ids(
                    sentence,
--- a/paddlespeech/t2s/exps/tacotron2/preprocess.py
+++ b/paddlespeech/t2s/exps/tacotron2/preprocess.py
@ -82,6 +82,9 @@ def process_sentence(config: Dict[str, Any],
        logmel = mel_extractor.get_log_mel_fbank(wav)
        # change duration according to mel_length
        compare_duration_and_mel_length(sentences, utt_id, logmel)
        # utt_id may be popped in compare_duration_and_mel_length
        if utt_id not in sentences:
            return None
        phones = sentences[utt_id][0]
        durations = sentences[utt_id][1]
        num_frames = logmel.shape[0]
--- a/paddlespeech/t2s/modules/positional_encoding.py
+++ b/paddlespeech/t2s/modules/positional_encoding.py
@ -31,8 +31,9 @@ def sinusoid_position_encoding(num_positions: int,
    channel = paddle.arange(0, feature_size, 2, dtype=dtype)
    index = paddle.arange(start_pos, start_pos + num_positions, 1, dtype=dtype)
-    p = (paddle.unsqueeze(index, -1) *
+    denominator = channel / float(feature_size)
-         omega) / (10000.0**(channel / float(feature_size)))
+    denominator = paddle.to_tensor([10000.0], dtype='float32')**denominator
    p = (paddle.unsqueeze(index, -1) * omega) / denominator
    encodings = paddle.zeros([num_positions, feature_size], dtype=dtype)
    encodings[:, 0::2] = paddle.sin(p)
    encodings[:, 1::2] = paddle.cos(p)
--- a/paddlespeech/vector/cluster/diarization.py
+++ b/paddlespeech/vector/cluster/diarization.py
@ -747,6 +747,77 @@ def merge_ssegs_same_speaker(lol):
    return new_lol
 def write_ders_file(ref_rttm, DER, out_der_file):
    """Write the final DERs for individual recording.
    Arguments
    ---------
    ref_rttm : str
        Reference RTTM file.
    DER : array
        Array containing DER values of each recording.
    out_der_file : str
        File to write the DERs.
    """
    rttm = read_rttm(ref_rttm)
    spkr_info = list(filter(lambda x: x.startswith("SPKR-INFO"), rttm))
    rec_id_list = []
    count = 0
    with open(out_der_file, "w") as f:
        for row in spkr_info:
            a = row.split(" ")
            rec_id = a[1]
            if rec_id not in rec_id_list:
                r = [rec_id, str(round(DER[count], 2))]
                rec_id_list.append(rec_id)
                line_str = " ".join(r)
                f.write("%s\n" % line_str)
                count += 1
        r = ["OVERALL ", str(round(DER[count], 2))]
        line_str = " ".join(r)
        f.write("%s\n" % line_str)
 def get_oracle_num_spkrs(rec_id, spkr_info):
    """
    Returns actual number of speakers in a recording from the ground-truth.
    This can be used when the condition is oracle number of speakers.
    Arguments
    ---------
    rec_id : str
        Recording ID for which the number of speakers have to be obtained.
    spkr_info : list
        Header of the RTTM file. Starting with `SPKR-INFO`.
    Example
    -------
    >>> from speechbrain.processing import diarization as diar
    >>> spkr_info = ['SPKR-INFO ES2011a 0 <NA> <NA> <NA> unknown ES2011a.A <NA> <NA>',
    ... 'SPKR-INFO ES2011a 0 <NA> <NA> <NA> unknown ES2011a.B <NA> <NA>',
    ... 'SPKR-INFO ES2011a 0 <NA> <NA> <NA> unknown ES2011a.C <NA> <NA>',
    ... 'SPKR-INFO ES2011a 0 <NA> <NA> <NA> unknown ES2011a.D <NA> <NA>',
    ... 'SPKR-INFO ES2011b 0 <NA> <NA> <NA> unknown ES2011b.A <NA> <NA>',
    ... 'SPKR-INFO ES2011b 0 <NA> <NA> <NA> unknown ES2011b.B <NA> <NA>',
    ... 'SPKR-INFO ES2011b 0 <NA> <NA> <NA> unknown ES2011b.C <NA> <NA>']
    >>> diar.get_oracle_num_spkrs('ES2011a', spkr_info)
    4
    >>> diar.get_oracle_num_spkrs('ES2011b', spkr_info)
    3
    """
    num_spkrs = 0
    for line in spkr_info:
        if rec_id in line:
            # Since rec_id is prefix for each speaker
            num_spkrs += 1
    return num_spkrs
 def distribute_overlap(lol):
    """
    Distributes the overlapped speech equally among the adjacent segments
@ -827,6 +898,29 @@ def distribute_overlap(lol):
    return new_lol
 def read_rttm(rttm_file_path):
    """
    Reads and returns RTTM in list format.
    Arguments
    ---------
    rttm_file_path : str
        Path to the RTTM file to be read.
    Returns
    -------
    rttm : list
        List containing rows of RTTM file.
    """
    rttm = []
    with open(rttm_file_path, "r") as f:
        for line in f:
            entry = line[:-1]
            rttm.append(entry)
    return rttm
 def write_rttm(segs_list, out_rttm_file):
    """
    Writes the segment list in RTTM format (A standard NIST format).
--- a/paddlespeech/vector/exps/ecapa_tdnn/test.py
+++ b/paddlespeech/vector/exps/ecapa_tdnn/test.py
@ -21,10 +21,11 @@ from paddle.io import DataLoader
 from tqdm import tqdm
 from yacs.config import CfgNode
 from paddleaudio.datasets import VoxCeleb
 from paddleaudio.metric import compute_eer
 from paddlespeech.s2t.utils.log import Log
 from paddlespeech.vector.io.batch import batch_feature_normalize
 from paddlespeech.vector.io.dataset import CSVDataset
 from paddlespeech.vector.io.embedding_norm import InputNormalization
 from paddlespeech.vector.models.ecapa_tdnn import EcapaTdnn
 from paddlespeech.vector.modules.sid_model import SpeakerIdetification
 from paddlespeech.vector.training.seeding import seed_everything
@ -32,6 +33,91 @@ from paddlespeech.vector.training.seeding import seed_everything
 logger = Log(__name__).getlog()
 def compute_dataset_embedding(data_loader, model, mean_var_norm_emb, config,
                              id2embedding):
    """compute the dataset embeddings
    Args:
        data_loader (_type_): _description_
        model (_type_): _description_
        mean_var_norm_emb (_type_): _description_
        config (_type_): _description_
    """
    logger.info(
        f'Computing embeddings on {data_loader.dataset.csv_path} dataset')
    with paddle.no_grad():
        for batch_idx, batch in enumerate(tqdm(data_loader)):
            # stage 8-1: extrac the audio embedding
            ids, feats, lengths = batch['ids'], batch['feats'], batch['lengths']
            embeddings = model.backbone(feats, lengths).squeeze(
                -1)  # (N, emb_size, 1) -> (N, emb_size)
            # Global embedding normalization.
            # if we use the global embedding norm
            # eer can reduece about relative 10%
            if config.global_embedding_norm and mean_var_norm_emb:
                lengths = paddle.ones([embeddings.shape[0]])
                embeddings = mean_var_norm_emb(embeddings, lengths)
            # Update embedding dict.
            id2embedding.update(dict(zip(ids, embeddings)))
 def compute_verification_scores(id2embedding, train_cohort, config):
    labels = []
    enroll_ids = []
    test_ids = []
    logger.info(f"read the trial from {config.verification_file}")
    cos_sim_func = paddle.nn.CosineSimilarity(axis=-1)
    scores = []
    with open(config.verification_file, 'r') as f:
        for line in f.readlines():
            label, enroll_id, test_id = line.strip().split(' ')
            enroll_id = enroll_id.split('.')[0].replace('/', '-')
            test_id = test_id.split('.')[0].replace('/', '-')
            labels.append(int(label))
            enroll_emb = id2embedding[enroll_id]
            test_emb = id2embedding[test_id]
            score = cos_sim_func(enroll_emb, test_emb).item()
            if "score_norm" in config:
                # Getting norm stats for enroll impostors
                enroll_rep = paddle.tile(
                    enroll_emb, repeat_times=[train_cohort.shape[0], 1])
                score_e_c = cos_sim_func(enroll_rep, train_cohort)
                if "cohort_size" in config:
                    score_e_c, _ = paddle.topk(
                        score_e_c, k=config.cohort_size, axis=0)
                mean_e_c = paddle.mean(score_e_c, axis=0)
                std_e_c = paddle.std(score_e_c, axis=0)
                # Getting norm stats for test impostors
                test_rep = paddle.tile(
                    test_emb, repeat_times=[train_cohort.shape[0], 1])
                score_t_c = cos_sim_func(test_rep, train_cohort)
                if "cohort_size" in config:
                    score_t_c, _ = paddle.topk(
                        score_t_c, k=config.cohort_size, axis=0)
                mean_t_c = paddle.mean(score_t_c, axis=0)
                std_t_c = paddle.std(score_t_c, axis=0)
                if config.score_norm == "s-norm":
                    score_e = (score - mean_e_c) / std_e_c
                    score_t = (score - mean_t_c) / std_t_c
                    score = 0.5 * (score_e + score_t)
                elif config.score_norm == "z-norm":
                    score = (score - mean_e_c) / std_e_c
                elif config.score_norm == "t-norm":
                    score = (score - mean_t_c) / std_t_c
            scores.append(score)
    return scores, labels
 def main(args, config):
    # stage0: set the training device, cpu or gpu
    paddle.set_device(args.device)
@ -58,9 +144,8 @@ def main(args, config):
    # stage4: construct the enroll and test dataloader
-    enroll_dataset = VoxCeleb(
+    enroll_dataset = CSVDataset(
-        subset='enroll',
+        os.path.join(args.data_dir, "vox/csv/enroll.csv"),
        target_dir=args.data_dir,
        feat_type='melspectrogram',
        random_chunk=False,
        n_mels=config.n_mels,
@ -68,16 +153,15 @@ def main(args, config):
        hop_length=config.hop_size)
    enroll_sampler = BatchSampler(
        enroll_dataset, batch_size=config.batch_size,
-        shuffle=True)  # Shuffle to make embedding normalization more robust.
+        shuffle=False)  # Shuffle to make embedding normalization more robust.
-    enrol_loader = DataLoader(enroll_dataset,
+    enroll_loader = DataLoader(enroll_dataset,
                    batch_sampler=enroll_sampler,
                    collate_fn=lambda x: batch_feature_normalize(
-                            x, mean_norm=True, std_norm=False),
+                                x, mean_norm=True, std_norm=False),
                    num_workers=config.num_workers,
                    return_list=True,)
-    test_dataset = VoxCeleb(
+    test_dataset = CSVDataset(
-        subset='test',
+        os.path.join(args.data_dir, "vox/csv/test.csv"),
        target_dir=args.data_dir,
        feat_type='melspectrogram',
        random_chunk=False,
        n_mels=config.n_mels,
@ -85,7 +169,7 @@ def main(args, config):
        hop_length=config.hop_size)
    test_sampler = BatchSampler(
-        test_dataset, batch_size=config.batch_size, shuffle=True)
+        test_dataset, batch_size=config.batch_size, shuffle=False)
    test_loader = DataLoader(test_dataset,
                            batch_sampler=test_sampler,
                            collate_fn=lambda x: batch_feature_normalize(
@ -97,75 +181,65 @@ def main(args, config):
    # stage6: global embedding norm to imporve the performance
    logger.info(f"global embedding norm: {config.global_embedding_norm}")
    if config.global_embedding_norm:
        global_embedding_mean = None
        global_embedding_std = None
        mean_norm_flag = config.embedding_mean_norm
        std_norm_flag = config.embedding_std_norm
        batch_count = 0
    # stage7: Compute embeddings of audios in enrol and test dataset from model.
    if config.global_embedding_norm:
        mean_var_norm_emb = InputNormalization(
            norm_type="global",
            mean_norm=config.embedding_mean_norm,
            std_norm=config.embedding_std_norm)
    if "score_norm" in config:
        logger.info(f"we will do score norm: {config.score_norm}")
        train_dataset = CSVDataset(
            os.path.join(args.data_dir, "vox/csv/train.csv"),
            feat_type='melspectrogram',
            n_train_snts=config.n_train_snts,
            random_chunk=False,
            n_mels=config.n_mels,
            window_size=config.window_size,
            hop_length=config.hop_size)
        train_sampler = BatchSampler(
            train_dataset, batch_size=config.batch_size, shuffle=False)
        train_loader = DataLoader(train_dataset,
                            batch_sampler=train_sampler,
                            collate_fn=lambda x: batch_feature_normalize(
                                x, mean_norm=True, std_norm=False),
                            num_workers=config.num_workers,
                            return_list=True,)
    id2embedding = {}
    # Run multi times to make embedding normalization more stable.
-    for i in range(2):
+    logger.info("First loop for enroll and test dataset")
-        for dl in [enrol_loader, test_loader]:
+    compute_dataset_embedding(enroll_loader, model, mean_var_norm_emb, config,
-            logger.info(
+                              id2embedding)
-                f'Loop {[i+1]}: Computing embeddings on {dl.dataset.subset} dataset'
+    compute_dataset_embedding(test_loader, model, mean_var_norm_emb, config,
-            )
+                              id2embedding)
-            with paddle.no_grad():
+
-                for batch_idx, batch in enumerate(tqdm(dl)):
+    logger.info("Second loop for enroll and test dataset")
-
+    compute_dataset_embedding(enroll_loader, model, mean_var_norm_emb, config,
-                    # stage 8-1: extrac the audio embedding
+                              id2embedding)
-                    ids, feats, lengths = batch['ids'], batch['feats'], batch[
+    compute_dataset_embedding(test_loader, model, mean_var_norm_emb, config,
-                        'lengths']
+                              id2embedding)
-                    embeddings = model.backbone(feats, lengths).squeeze(
+    mean_var_norm_emb.save(
-                        -1).numpy()  # (N, emb_size, 1) -> (N, emb_size)
+        os.path.join(args.load_checkpoint, "mean_var_norm_emb"))
                    # Global embedding normalization.
                    # if we use the global embedding norm
                    # eer can reduece about relative 10%
                    if config.global_embedding_norm:
                        batch_count += 1
                        current_mean = embeddings.mean(
                            axis=0) if mean_norm_flag else 0
                        current_std = embeddings.std(
                            axis=0) if std_norm_flag else 1
                        # Update global mean and std.
                        if global_embedding_mean is None and global_embedding_std is None:
                            global_embedding_mean, global_embedding_std = current_mean, current_std
                        else:
                            weight = 1 / batch_count  # Weight decay by batches.
                            global_embedding_mean = (
                                1 - weight
                            ) * global_embedding_mean + weight * current_mean
                            global_embedding_std = (
                                1 - weight
                            ) * global_embedding_std + weight * current_std
                        # Apply global embedding normalization.
                        embeddings = (embeddings - global_embedding_mean
                                      ) / global_embedding_std
                    # Update embedding dict.
                    id2embedding.update(dict(zip(ids, embeddings)))
    # stage 8: Compute cosine scores.
-    labels = []
+    train_cohort = None
-    enroll_ids = []
+    if "score_norm" in config:
-    test_ids = []
+        train_embeddings = {}
-    logger.info(f"read the trial from {VoxCeleb.veri_test_file}")
+        # cohort embedding not do mean and std norm
-    with open(VoxCeleb.veri_test_file, 'r') as f:
+        compute_dataset_embedding(train_loader, model, None, config,
-        for line in f.readlines():
+                                  train_embeddings)
-            label, enroll_id, test_id = line.strip().split(' ')
+        train_cohort = paddle.stack(list(train_embeddings.values()))
-            labels.append(int(label))
+
-            enroll_ids.append(enroll_id.split('.')[0].replace('/', '-'))
+    # compute the scores
-            test_ids.append(test_id.split('.')[0].replace('/', '-'))
+    scores, labels = compute_verification_scores(id2embedding, train_cohort,
-
+                                                 config)
-    cos_sim_func = paddle.nn.CosineSimilarity(axis=1)
+
-    enrol_embeddings, test_embeddings = map(lambda ids: paddle.to_tensor(
+    # compute the EER and threshold
-        np.asarray([id2embedding[uttid] for uttid in ids], dtype='float32')),
+    scores = paddle.to_tensor(scores)
                                            [enroll_ids, test_ids
                                             ])  # (N, emb_size)
    scores = cos_sim_func(enrol_embeddings, test_embeddings)
    EER, threshold = compute_eer(np.asarray(labels), scores.numpy())
    logger.info(
        f'EER of verification test: {EER*100:.4f}%, score threshold: {threshold:.5f}'
--- a/paddlespeech/vector/exps/ecapa_tdnn/train.py
+++ b/paddlespeech/vector/exps/ecapa_tdnn/train.py
@ -23,13 +23,13 @@ from paddle.io import DistributedBatchSampler
 from yacs.config import CfgNode
 from paddleaudio.compliance.librosa import melspectrogram
 from paddleaudio.datasets.voxceleb import VoxCeleb
 from paddlespeech.s2t.utils.log import Log
 from paddlespeech.vector.io.augment import build_augment_pipeline
 from paddlespeech.vector.io.augment import waveform_augment
 from paddlespeech.vector.io.batch import batch_pad_right
 from paddlespeech.vector.io.batch import feature_normalize
 from paddlespeech.vector.io.batch import waveform_collate_fn
 from paddlespeech.vector.io.dataset import CSVDataset
 from paddlespeech.vector.models.ecapa_tdnn import EcapaTdnn
 from paddlespeech.vector.modules.loss import AdditiveAngularMargin
 from paddlespeech.vector.modules.loss import LogSoftmaxWrapper
@ -54,8 +54,12 @@ def main(args, config):
    # stage2: data prepare, such vox1 and vox2 data, and augment noise data and pipline
    # note: some cmd must do in rank==0, so wo will refactor the data prepare code
-    train_dataset = VoxCeleb('train', target_dir=args.data_dir)
+    train_dataset = CSVDataset(
-    dev_dataset = VoxCeleb('dev', target_dir=args.data_dir)
+        csv_path=os.path.join(args.data_dir, "vox/csv/train.csv"),
        label2id_path=os.path.join(args.data_dir, "vox/meta/label2id.txt"))
    dev_dataset = CSVDataset(
        csv_path=os.path.join(args.data_dir, "vox/csv/dev.csv"),
        label2id_path=os.path.join(args.data_dir, "vox/meta/label2id.txt"))
    if config.augment:
        augment_pipeline = build_augment_pipeline(target_dir=args.data_dir)
@ -67,7 +71,7 @@ def main(args, config):
    # stage4: build the speaker verification train instance with backbone model
    model = SpeakerIdetification(
-        backbone=ecapa_tdnn, num_class=VoxCeleb.num_speakers)
+        backbone=ecapa_tdnn, num_class=config.num_speakers)
    # stage5: build the optimizer, we now only construct the AdamW optimizer
    #         140000 is single gpu steps
@ -193,15 +197,15 @@ def main(args, config):
                          paddle.optimizer.lr.LRScheduler):
                optimizer._learning_rate.step()
            optimizer.clear_grad()
            train_run_cost += time.time() - train_start
            # stage 9-8: Calculate average loss per batch
-            avg_loss += loss.numpy()[0]
+            avg_loss = loss.item()
            # stage 9-9: Calculate metrics, which is one-best accuracy
            preds = paddle.argmax(logits, axis=1)
            num_corrects += (preds == labels).numpy().sum()
            num_samples += feats.shape[0]
            train_run_cost += time.time() - train_start
            timer.count()  # step plus one in timer
            # stage 9-10: print the log information only on 0-rank per log-freq batchs
@ -220,8 +224,9 @@ def main(args, config):
                    train_feat_cost / config.log_interval)
                print_msg += ' avg_train_cost: {:.5f} sec,'.format(
                    train_run_cost / config.log_interval)
-                print_msg += ' lr={:.4E} step/sec={:.2f} | ETA {}'.format(
+
-                    lr, timer.timing, timer.eta)
+                print_msg += ' lr={:.4E} step/sec={:.2f} ips:{:.5f}| ETA {}'.format(
                    lr, timer.timing, timer.ips, timer.eta)
                logger.info(print_msg)
                avg_loss = 0
--- a/paddlespeech/vector/io/augment.py
+++ b/paddlespeech/vector/io/augment.py
@ -14,6 +14,7 @@
 # this is modified from SpeechBrain
 # https://github.com/speechbrain/speechbrain/blob/085be635c07f16d42cd1295045bc46c407f1e15b/speechbrain/lobes/augment.py
 import math
 import os
 from typing import List
 import numpy as np
@ -21,8 +22,8 @@ import paddle
 import paddle.nn as nn
 import paddle.nn.functional as F
 from paddleaudio.datasets.rirs_noises import OpenRIRNoise
 from paddlespeech.s2t.utils.log import Log
 from paddlespeech.vector.io.dataset import CSVDataset
 from paddlespeech.vector.io.signal_processing import compute_amplitude
 from paddlespeech.vector.io.signal_processing import convolve1d
 from paddlespeech.vector.io.signal_processing import dB_to_amplitude
@ -509,7 +510,7 @@ class AddNoise(nn.Layer):
                    assert w >= 0, f'Target length {target_length} is less than origin length {x.shape[0]}'
                    return np.pad(x, [0, w], mode=mode, **kwargs)
-                ids = [item['id'] for item in batch]
+                ids = [item['utt_id'] for item in batch]
                lengths = np.asarray([item['feat'].shape[0] for item in batch])
                waveforms = list(
                    map(lambda x: pad(x, max(max_length, lengths.max().item())),
@ -589,7 +590,7 @@ class AddReverb(nn.Layer):
                assert w >= 0, f'Target length {target_length} is less than origin length {x.shape[0]}'
                return np.pad(x, [0, w], mode=mode, **kwargs)
-            ids = [item['id'] for item in batch]
+            ids = [item['utt_id'] for item in batch]
            lengths = np.asarray([item['feat'].shape[0] for item in batch])
            waveforms = list(
                map(lambda x: pad(x, lengths.max().item()),
@ -839,8 +840,10 @@ def build_augment_pipeline(target_dir=None) -> List[paddle.nn.Layer]:
        List[paddle.nn.Layer]: all augment process
    """
    logger.info("start to build the augment pipeline")
-    noise_dataset = OpenRIRNoise('noise', target_dir=target_dir)
+    noise_dataset = CSVDataset(csv_path=os.path.join(target_dir,
-    rir_dataset = OpenRIRNoise('rir', target_dir=target_dir)
+                                                     "rir_noise/csv/noise.csv"))
    rir_dataset = CSVDataset(csv_path=os.path.join(target_dir,
                                                   "rir_noise/csv/rir.csv"))
    wavedrop = TimeDomainSpecAugment(
        sample_rate=16000,
--- a/paddlespeech/vector/io/batch.py
+++ b/paddlespeech/vector/io/batch.py
@ -17,6 +17,17 @@ import paddle
 def waveform_collate_fn(batch):
    """Wrap the waveform into a batch form
    Args:
        batch (list): the waveform list from the dataloader
                      the item of data include several field
                      feat: the utterance waveform data
                      label: the utterance label encoding data
    Returns:
        dict: the batch data to dataloader
    """
    waveforms = np.stack([item['feat'] for item in batch])
    labels = np.stack([item['label'] for item in batch])
@ -27,6 +38,18 @@ def feature_normalize(feats: paddle.Tensor,
                      mean_norm: bool=True,
                      std_norm: bool=True,
                      convert_to_numpy: bool=False):
    """Do one utterance feature normalization
    Args:
        feats (paddle.Tensor): the original utterance feat, such as fbank, mfcc
        mean_norm (bool, optional): mean norm flag. Defaults to True.
        std_norm (bool, optional): std norm flag. Defaults to True.
        convert_to_numpy (bool, optional): convert the paddle.tensor to numpy 
                                           and do feature norm with numpy. Defaults to False.
    Returns:
        paddle.Tensor : the normalized feats
    """
    # Features normalization if needed
    # numpy.mean is a little with paddle.mean about 1e-6
    if convert_to_numpy:
@ -60,7 +83,17 @@ def pad_right_2d(x, target_length, axis=-1, mode='constant', **kwargs):
 def batch_feature_normalize(batch, mean_norm: bool=True, std_norm: bool=True):
-    ids = [item['id'] for item in batch]
+    """Do batch utterance features normalization
    Args:
        batch (list): the batch feature from dataloader
        mean_norm (bool, optional): mean normalization flag. Defaults to True.
        std_norm (bool, optional): std normalization flag. Defaults to True.
    Returns:
        dict: the normalized batch features
    """
    ids = [item['utt_id'] for item in batch]
    lengths = np.asarray([item['feat'].shape[1] for item in batch])
    feats = list(
        map(lambda x: pad_right_2d(x, lengths.max()),
--- a/paddlespeech/vector/io/dataset.py
+++ b/paddlespeech/vector/io/dataset.py
@ -0,0 +1,192 @@
 # Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 from dataclasses import dataclass
 from dataclasses import fields
 from paddle.io import Dataset
 from paddleaudio import load as load_audio
 from paddleaudio.compliance.librosa import melspectrogram
 from paddlespeech.s2t.utils.log import Log
 logger = Log(__name__).getlog()
 # the audio meta info in the vector CSVDataset
 # utt_id: the utterance segment name
 # duration: utterance segment time
 # wav: utterance file path
 # start: start point in the original wav file
 # stop: stop point in the original wav file
 # label: the utterance segment's label id
@dataclass
 class meta_info:
    """the audio meta info in the vector CSVDataset
    Args:
        utt_id (str): the utterance segment name
        duration (float): utterance segment time
        wav (str): utterance file path
        start (int): start point in the original wav file
        stop (int): stop point in the original wav file
        lab_id (str): the utterance segment's label id
    """
    utt_id: str
    duration: float
    wav: str
    start: int
    stop: int
    label: str
 # csv dataset support feature type
 # raw: return the pcm data sample point
 # melspectrogram: fbank feature
 feat_funcs = {
    'raw': None,
    'melspectrogram': melspectrogram,
 }
 class CSVDataset(Dataset):
    def __init__(self,
                 csv_path,
                 label2id_path=None,
                 config=None,
                 random_chunk=True,
                 feat_type: str="raw",
                 n_train_snts: int=-1,
                 **kwargs):
        """Implement the CSV Dataset
        Args:
            csv_path (str): csv dataset file path
            label2id_path (str): the utterance label to integer id map file path
            config (CfgNode): yaml config
            feat_type (str): dataset feature type. if it is raw, it return pcm data.
            n_train_snts (int): select the n_train_snts sample from the dataset. 
                                if n_train_snts = -1, dataset will load all the sample.
                                Default value is -1.
            kwargs : feature type args
        """
        super().__init__()
        self.csv_path = csv_path
        self.label2id_path = label2id_path
        self.config = config
        self.random_chunk = random_chunk
        self.feat_type = feat_type
        self.n_train_snts = n_train_snts
        self.feat_config = kwargs
        self.id2label = {}
        self.label2id = {}
        self.data = self.load_data_csv()
        self.load_speaker_to_label()
    def load_data_csv(self):
        """Load the csv dataset content and store them in the data property
        the csv dataset's format has six fields, 
        that is audio_id or utt_id, audio duration, segment start point, segment stop point 
        and utterance label.
        Note in training period, the utterance label must has a map to integer id in label2id_path 
        Returns:
            list: the csv data with meta_info type
        """
        data = []
        with open(self.csv_path, 'r') as rf:
            for line in rf.readlines()[1:]:
                audio_id, duration, wav, start, stop, spk_id = line.strip(
                ).split(',')
                data.append(
                    meta_info(audio_id,
                              float(duration), wav,
                              int(start), int(stop), spk_id))
        if self.n_train_snts > 0:
            sample_num = min(self.n_train_snts, len(data))
            data = data[0:sample_num]
        return data
    def load_speaker_to_label(self):
        """Load the utterance label map content.
        In vector domain, we call the utterance label as speaker label.
        The speaker label is real speaker label in speaker verification domain,
        and in language identification is language label.
        """
        if not self.label2id_path:
            logger.warning("No speaker id to label file")
            return
        with open(self.label2id_path, 'r') as f:
            for line in f.readlines():
                label_name, label_id = line.strip().split(' ')
                self.label2id[label_name] = int(label_id)
                self.id2label[int(label_id)] = label_name
    def convert_to_record(self, idx: int):
        """convert the dataset sample to training record the CSV Dataset
        Args:
            idx (int) : the request index in all the dataset
        """
        sample = self.data[idx]
        record = {}
        # To show all fields in a namedtuple: `type(sample)._fields`
        for field in fields(sample):
            record[field.name] = getattr(sample, field.name)
        waveform, sr = load_audio(record['wav'])
        # random select a chunk audio samples from the audio
        if self.config and self.config.random_chunk:
            num_wav_samples = waveform.shape[0]
            num_chunk_samples = int(self.config.chunk_duration * sr)
            start = random.randint(0, num_wav_samples - num_chunk_samples - 1)
            stop = start + num_chunk_samples
        else:
            start = record['start']
            stop = record['stop']
        # we only return the waveform as feat
        waveform = waveform[start:stop]
        # all availabel feature type is in feat_funcs
        assert self.feat_type in feat_funcs.keys(), \
            f"Unknown feat_type: {self.feat_type}, it must be one in {list(feat_funcs.keys())}"
        feat_func = feat_funcs[self.feat_type]
        feat = feat_func(
            waveform, sr=sr, **self.feat_config) if feat_func else waveform
        record.update({'feat': feat})
        if self.label2id:
            record.update({'label': self.label2id[record['label']]})
        return record
    def __getitem__(self, idx):
        """Return the specific index sample
        Args:
            idx (int) : the request index in all the dataset
        """
        return self.convert_to_record(idx)
    def __len__(self):
        """Return the dataset length
        Returns:
            int: the length num of the dataset
        """
        return len(self.data)
--- a/paddlespeech/vector/io/dataset_from_json.py
+++ b/paddlespeech/vector/io/dataset_from_json.py
@ -0,0 +1,116 @@
 # Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import json
 from dataclasses import dataclass
 from dataclasses import fields
 from paddle.io import Dataset
 from paddleaudio import load as load_audio
 from paddleaudio.compliance.librosa import melspectrogram
 from paddleaudio.compliance.librosa import mfcc
@dataclass
 class meta_info:
    """the audio meta info in the vector JSONDataset
    Args:
        id (str): the segment name
        duration (float): segment time
        wav (str): wav file path
        start (int): start point in the original wav file
        stop (int): stop point in the original wav file
        lab_id (str): the record id
    """
    id: str
    duration: float
    wav: str
    start: int
    stop: int
    record_id: str
 # json dataset support feature type
 feat_funcs = {
    'raw': None,
    'melspectrogram': melspectrogram,
    'mfcc': mfcc,
 }
 class JSONDataset(Dataset):
    """
    dataset from json file.
    """
    def __init__(self, json_file: str, feat_type: str='raw', **kwargs):
        """
        Ags:
            json_file (:obj:`str`): Data prep JSON file.
            labels (:obj:`List[int]`): Labels of audio files.
            feat_type (:obj:`str`, `optional`, defaults to `raw`):
                It identifies the feature type that user wants to extrace of an audio file.
        """
        if feat_type not in feat_funcs.keys():
            raise RuntimeError(
                f"Unknown feat_type: {feat_type}, it must be one in {list(feat_funcs.keys())}"
            )
        self.json_file = json_file
        self.feat_type = feat_type
        self.feat_config = kwargs
        self._data = self._get_data()
        super(JSONDataset, self).__init__()
    def _get_data(self):
        with open(self.json_file, "r") as f:
            meta_data = json.load(f)
        data = []
        for key in meta_data:
            sub_seg = meta_data[key]["wav"]
            wav = sub_seg["file"]
            duration = sub_seg["duration"]
            start = sub_seg["start"]
            stop = sub_seg["stop"]
            rec_id = str(key).rsplit("_", 2)[0]
            data.append(
                meta_info(
                    str(key),
                    float(duration), wav, int(start), int(stop), str(rec_id)))
        return data
    def _convert_to_record(self, idx: int):
        sample = self._data[idx]
        record = {}
        # To show all fields in a namedtuple
        for field in fields(sample):
            record[field.name] = getattr(sample, field.name)
        waveform, sr = load_audio(record['wav'])
        waveform = waveform[record['start']:record['stop']]
        feat_func = feat_funcs[self.feat_type]
        feat = feat_func(
            waveform, sr=sr, **self.feat_config) if feat_func else waveform
        record.update({'feat': feat})
        return record
    def __getitem__(self, idx):
        return self._convert_to_record(idx)
    def __len__(self):
        return len(self._data)
--- a/paddlespeech/vector/io/embedding_norm.py
+++ b/paddlespeech/vector/io/embedding_norm.py
@ -0,0 +1,214 @@
 # Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 from typing import Dict
 import paddle
 class InputNormalization:
    spk_dict_mean: Dict[int, paddle.Tensor]
    spk_dict_std: Dict[int, paddle.Tensor]
    spk_dict_count: Dict[int, int]
    def __init__(
            self,
            mean_norm=True,
            std_norm=True,
            norm_type="global", ):
        """Do feature or embedding mean and std norm
        Args:
            mean_norm (bool, optional): mean norm flag. Defaults to True.
            std_norm (bool, optional): std norm flag. Defaults to True.
            norm_type (str, optional): norm type. Defaults to "global".
        """
        super().__init__()
        self.training = True
        self.mean_norm = mean_norm
        self.std_norm = std_norm
        self.norm_type = norm_type
        self.glob_mean = paddle.to_tensor([0], dtype="float32")
        self.glob_std = paddle.to_tensor([0], dtype="float32")
        self.spk_dict_mean = {}
        self.spk_dict_std = {}
        self.spk_dict_count = {}
        self.weight = 1.0
        self.count = 0
        self.eps = 1e-10
    def __call__(self,
                 x,
                 lengths,
                 spk_ids=paddle.to_tensor([], dtype="float32")):
        """Returns the tensor with the surrounding context.
        Args:
            x (paddle.Tensor): A batch of tensors.
            lengths (paddle.Tensor): A batch of tensors containing the relative length of each
                                    sentence (e.g, [0.7, 0.9, 1.0]). It is used to avoid
                                    computing stats on zero-padded steps.
            spk_ids (_type_, optional): tensor containing the ids of each speaker (e.g, [0 10 6]).
                                        It is used to perform per-speaker normalization when
                                        norm_type='speaker'. Defaults to paddle.to_tensor([], dtype="float32").
        Returns:
            paddle.Tensor: The normalized feature or embedding
        """
        N_batches = x.shape[0]
        # print(f"x shape: {x.shape[1]}")
        current_means = []
        current_stds = []
        for snt_id in range(N_batches):
            # Avoiding padded time steps
            # actual size is the actual time data length
            actual_size = paddle.round(lengths[snt_id] *
                                       x.shape[1]).astype("int32")
            # computing actual time data statistics
            current_mean, current_std = self._compute_current_stats(
                x[snt_id, 0:actual_size, ...].unsqueeze(0))
            current_means.append(current_mean)
            current_stds.append(current_std)
        if self.norm_type == "global":
            current_mean = paddle.mean(paddle.stack(current_means), axis=0)
            current_std = paddle.mean(paddle.stack(current_stds), axis=0)
            if self.norm_type == "global":
                if self.training:
                    if self.count == 0:
                        self.glob_mean = current_mean
                        self.glob_std = current_std
                    else:
                        self.weight = 1 / (self.count + 1)
                        self.glob_mean = (
                            1 - self.weight
                        ) * self.glob_mean + self.weight * current_mean
                        self.glob_std = (
                            1 - self.weight
                        ) * self.glob_std + self.weight * current_std
                    self.glob_mean.detach()
                    self.glob_std.detach()
                    self.count = self.count + 1
                x = (x - self.glob_mean) / (self.glob_std)
        return x
    def _compute_current_stats(self, x):
        """Returns the tensor with the surrounding context.
        Args:
            x (paddle.Tensor): A batch of tensors.
        Returns:
             the statistics of the data
        """
        # Compute current mean
        if self.mean_norm:
            current_mean = paddle.mean(x, axis=0).detach()
        else:
            current_mean = paddle.to_tensor([0.0], dtype="float32")
        # Compute current std
        if self.std_norm:
            current_std = paddle.std(x, axis=0).detach()
        else:
            current_std = paddle.to_tensor([1.0], dtype="float32")
        # Improving numerical stability of std
        current_std = paddle.maximum(current_std,
                                     self.eps * paddle.ones_like(current_std))
        return current_mean, current_std
    def _statistics_dict(self):
        """Fills the dictionary containing the normalization statistics.
        """
        state = {}
        state["count"] = self.count
        state["glob_mean"] = self.glob_mean
        state["glob_std"] = self.glob_std
        state["spk_dict_mean"] = self.spk_dict_mean
        state["spk_dict_std"] = self.spk_dict_std
        state["spk_dict_count"] = self.spk_dict_count
        return state
    def _load_statistics_dict(self, state):
        """Loads the dictionary containing the statistics.
        Arguments
        ---------
        state : dict
            A dictionary containing the normalization statistics.
        """
        self.count = state["count"]
        if isinstance(state["glob_mean"], int):
            self.glob_mean = state["glob_mean"]
            self.glob_std = state["glob_std"]
        else:
            self.glob_mean = state["glob_mean"]  # .to(self.device_inp)
            self.glob_std = state["glob_std"]  # .to(self.device_inp)
        # Loading the spk_dict_mean in the right device
        self.spk_dict_mean = {}
        for spk in state["spk_dict_mean"]:
            self.spk_dict_mean[spk] = state["spk_dict_mean"][spk]
        # Loading the spk_dict_std in the right device
        self.spk_dict_std = {}
        for spk in state["spk_dict_std"]:
            self.spk_dict_std[spk] = state["spk_dict_std"][spk]
        self.spk_dict_count = state["spk_dict_count"]
        return state
    def to(self, device):
        """Puts the needed tensors in the right device.
        """
        self = super(InputNormalization, self).to(device)
        self.glob_mean = self.glob_mean.to(device)
        self.glob_std = self.glob_std.to(device)
        for spk in self.spk_dict_mean:
            self.spk_dict_mean[spk] = self.spk_dict_mean[spk].to(device)
            self.spk_dict_std[spk] = self.spk_dict_std[spk].to(device)
        return self
    def save(self, path):
        """Save statistic dictionary.
        Args:
            path (str): A path where to save the dictionary.
        """
        stats = self._statistics_dict()
        paddle.save(stats, path)
    def _load(self, path, end_of_epoch=False, device=None):
        """Load statistic dictionary.
        Arguments
        ---------
        path : str
            The path of the statistic dictionary
        device : str, None
            Passed to paddle.load(..., map_location=device)
        """
        del end_of_epoch  # Unused here.
        stats = paddle.load(path, map_location=device)
        self._load_statistics_dict(stats)
--- a/paddlespeech/vector/models/ecapa_tdnn.py
+++ b/paddlespeech/vector/models/ecapa_tdnn.py
@ -79,6 +79,20 @@ class Conv1d(nn.Layer):
            bias_attr=bias, )
    def forward(self, x):
        """Do conv1d forward
        Args:
            x (paddle.Tensor): [N, C, L] input data, 
                                N is the batch,
                                C is the data dimension, 
                                L is the time
        Raises:
            ValueError: only support the same padding type
        Returns:
            paddle.Tensor: the value of conv1d
        """
        if self.padding == "same":
            x = self._manage_padding(x, self.kernel_size, self.dilation,
                                     self.stride)
@ -88,6 +102,20 @@ class Conv1d(nn.Layer):
        return self.conv(x)
    def _manage_padding(self, x, kernel_size: int, dilation: int, stride: int):
        """Padding the input data
        Args:
            x (paddle.Tensor): [N, C, L] input data
                                N is the batch,
                                C is the data dimension, 
                                L is the time
            kernel_size (int): 1-d convolution kernel size
            dilation (int): 1-d convolution dilation
            stride (int): 1-d convolution stride
        Returns:
            paddle.Tensor: the padded input data
        """
        L_in = x.shape[-1]  # Detecting input shape
        padding = self._get_padding_elem(L_in, stride, kernel_size,
                                         dilation)  # Time padding
@ -101,6 +129,17 @@ class Conv1d(nn.Layer):
                          stride: int,
                          kernel_size: int,
                          dilation: int):
        """Calculate the padding value in same mode
        Args:
            L_in (int): the times of the input data, 
            stride (int): 1-d convolution stride
            kernel_size (int): 1-d convolution kernel size
            dilation (int): 1-d convolution stride
        Returns:
            int: return the padding value in same mode
        """
        if stride > 1:
            n_steps = math.ceil(((L_in - kernel_size * dilation) / stride) + 1)
            L_out = stride * (n_steps - 1) + kernel_size * dilation
@ -245,6 +284,13 @@ class SEBlock(nn.Layer):
 class AttentiveStatisticsPooling(nn.Layer):
    def __init__(self, channels, attention_channels=128, global_context=True):
        """Compute the speaker verification statistics
           The detail info is section 3.1 in https://arxiv.org/pdf/1709.01507.pdf 
        Args:
            channels (int): input data channel or data dimension
            attention_channels (int, optional): attention dimension. Defaults to 128.
            global_context (bool, optional): If use the global context information. Defaults to True.
        """
        super().__init__()
        self.eps = 1e-12
--- a/paddlespeech/vector/utils/time.py
+++ b/paddlespeech/vector/utils/time.py
@ -23,6 +23,7 @@ class Timer(object):
        self.last_start_step = 0
        self.current_step = 0
        self._is_running = True
        self.cur_ips = 0
    def start(self):
        self.last_time = time.time()
@ -43,12 +44,17 @@ class Timer(object):
        self.last_start_step = self.current_step
        time_used = time.time() - self.last_time
        self.last_time = time.time()
        self.cur_ips = run_steps / time_used
        return time_used / run_steps
    @property
    def is_running(self) -> bool:
        return self._is_running
    @property
    def ips(self) -> float:
        return self.cur_ips
    @property
    def eta(self) -> str:
        if not self.is_running:
--- a/paddlespeech/vector/utils/vector_utils.py
+++ b/paddlespeech/vector/utils/vector_utils.py
@ -0,0 +1,32 @@
 # Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 def get_chunks(seg_dur, audio_id, audio_duration):
    """Get all chunk segments from a utterance
    Args:
        seg_dur (float): segment chunk duration, seconds
        audio_id (str): utterance name, 
        audio_duration (float): utterance duration, seconds
    Returns:
        List: all the chunk segments 
    """
    num_chunks = int(audio_duration / seg_dur)  # all in seconds
    chunk_lst = [
        audio_id + "_" + str(i * seg_dur) + "_" + str(i * seg_dur + seg_dur)
        for i in range(num_chunks)
    ]
    return chunk_lst
--- a/speechx/examples/aishell/local/compute-wer.py
+++ b/speechx/examples/aishell/local/compute-wer.py
@ -0,0 +1,500 @@
 #!/usr/bin/env python3
 # -*- coding: utf-8 -*-
 import re, sys, unicodedata
 import codecs
 remove_tag = True
 spacelist= [' ', '\t', '\r', '\n']
 puncts = ['!', ',', '?',
          '、', '。', '！', '，', '；', '？',
          '：', '「', '」', '︰',  '『', '』', '《', '》']
 def characterize(string) :
  res = []
  i = 0
  while i < len(string):
    char = string[i]
    if char in puncts:
      i += 1
      continue
    cat1 = unicodedata.category(char)
    #https://unicodebook.readthedocs.io/unicode.html#unicode-categories
    if cat1 == 'Zs' or cat1 == 'Cn' or char in spacelist: # space or not assigned
       i += 1
       continue
    if cat1 == 'Lo': # letter-other
       res.append(char)
       i += 1
    else:
       # some input looks like: <unk><noise>, we want to separate it to two words.
       sep = ' '
       if char == '<': sep = '>'
       j = i+1
       while j < len(string):
         c = string[j]
         if ord(c) >= 128 or (c in spacelist) or (c==sep):
           break
         j += 1
       if j < len(string) and string[j] == '>':
         j += 1
       res.append(string[i:j])
       i = j
  return res
 def stripoff_tags(x):
  if not x: return ''
  chars = []
  i = 0; T=len(x)
  while i < T:
    if x[i] == '<':
      while i < T and x[i] != '>':
         i += 1
      i += 1
    else:
      chars.append(x[i])
      i += 1
  return ''.join(chars)
 def normalize(sentence, ignore_words, cs, split=None):
    """ sentence, ignore_words are both in unicode
    """
    new_sentence = []
    for token in sentence:
        x = token
        if not cs:
           x = x.upper()
        if x in ignore_words:
           continue
        if remove_tag:
          x = stripoff_tags(x)
        if not x:
          continue
        if split and x in split:
          new_sentence += split[x]
        else:
          new_sentence.append(x)
    return new_sentence
 class Calculator :
  def __init__(self) :
    self.data = {}
    self.space = []
    self.cost = {}
    self.cost['cor'] = 0
    self.cost['sub'] = 1
    self.cost['del'] = 1
    self.cost['ins'] = 1
  def calculate(self, lab, rec) :
    # Initialization
    lab.insert(0, '')
    rec.insert(0, '')
    while len(self.space) < len(lab) :
      self.space.append([])
    for row in self.space :
      for element in row :
        element['dist'] = 0
        element['error'] = 'non'
      while len(row) < len(rec) :
        row.append({'dist' : 0, 'error' : 'non'})
    for i in range(len(lab)) :
      self.space[i][0]['dist'] = i
      self.space[i][0]['error'] = 'del'
    for j in range(len(rec)) :
      self.space[0][j]['dist'] = j
      self.space[0][j]['error'] = 'ins'
    self.space[0][0]['error'] = 'non'
    for token in lab :
      if token not in self.data and len(token) > 0 :
        self.data[token] = {'all' : 0, 'cor' : 0, 'sub' : 0, 'ins' : 0, 'del' : 0}
    for token in rec :
      if token not in self.data and len(token) > 0 :
        self.data[token] = {'all' : 0, 'cor' : 0, 'sub' : 0, 'ins' : 0, 'del' : 0}
    # Computing edit distance
    for i, lab_token in enumerate(lab) :
      for j, rec_token in enumerate(rec) :
        if i == 0 or j == 0 :
          continue
        min_dist = sys.maxsize
        min_error = 'none'
        dist = self.space[i-1][j]['dist'] + self.cost['del']
        error = 'del'
        if dist < min_dist :
          min_dist = dist
          min_error = error
        dist = self.space[i][j-1]['dist'] + self.cost['ins']
        error = 'ins'
        if dist < min_dist :
          min_dist = dist
          min_error = error
        if lab_token == rec_token :
          dist = self.space[i-1][j-1]['dist'] + self.cost['cor']
          error = 'cor'
        else :
          dist = self.space[i-1][j-1]['dist'] + self.cost['sub']
          error = 'sub'
        if dist < min_dist :
          min_dist = dist
          min_error = error
        self.space[i][j]['dist'] = min_dist
        self.space[i][j]['error'] = min_error
    # Tracing back
    result = {'lab':[], 'rec':[], 'all':0, 'cor':0, 'sub':0, 'ins':0, 'del':0}
    i = len(lab) - 1
    j = len(rec) - 1
    while True :
      if self.space[i][j]['error'] == 'cor' : # correct
        if len(lab[i]) > 0 :
          self.data[lab[i]]['all'] = self.data[lab[i]]['all'] + 1
          self.data[lab[i]]['cor'] = self.data[lab[i]]['cor'] + 1
          result['all'] = result['all'] + 1
          result['cor'] = result['cor'] + 1
        result['lab'].insert(0, lab[i])
        result['rec'].insert(0, rec[j])
        i = i - 1
        j = j - 1
      elif self.space[i][j]['error'] == 'sub' : # substitution
        if len(lab[i]) > 0 :
          self.data[lab[i]]['all'] = self.data[lab[i]]['all'] + 1
          self.data[lab[i]]['sub'] = self.data[lab[i]]['sub'] + 1
          result['all'] = result['all'] + 1
          result['sub'] = result['sub'] + 1
        result['lab'].insert(0, lab[i])
        result['rec'].insert(0, rec[j])
        i = i - 1
        j = j - 1
      elif self.space[i][j]['error'] == 'del' : # deletion
        if len(lab[i]) > 0 :
          self.data[lab[i]]['all'] = self.data[lab[i]]['all'] + 1
          self.data[lab[i]]['del'] = self.data[lab[i]]['del'] + 1
          result['all'] = result['all'] + 1
          result['del'] = result['del'] + 1
        result['lab'].insert(0, lab[i])
        result['rec'].insert(0, "")
        i = i - 1
      elif self.space[i][j]['error'] == 'ins' : # insertion
        if len(rec[j]) > 0 :
          self.data[rec[j]]['ins'] = self.data[rec[j]]['ins'] + 1
          result['ins'] = result['ins'] + 1
        result['lab'].insert(0, "")
        result['rec'].insert(0, rec[j])
        j = j - 1
      elif self.space[i][j]['error'] == 'non' : # starting point
        break
      else : # shouldn't reach here
        print('this should not happen , i = {i} , j = {j} , error = {error}'.format(i = i, j = j, error = self.space[i][j]['error']))
    return result
  def overall(self) :
    result = {'all':0, 'cor':0, 'sub':0, 'ins':0, 'del':0}
    for token in self.data :
      result['all'] = result['all'] + self.data[token]['all']
      result['cor'] = result['cor'] + self.data[token]['cor']
      result['sub'] = result['sub'] + self.data[token]['sub']
      result['ins'] = result['ins'] + self.data[token]['ins']
      result['del'] = result['del'] + self.data[token]['del']
    return result
  def cluster(self, data) :
    result = {'all':0, 'cor':0, 'sub':0, 'ins':0, 'del':0}
    for token in data :
      if token in self.data :
        result['all'] = result['all'] + self.data[token]['all']
        result['cor'] = result['cor'] + self.data[token]['cor']
        result['sub'] = result['sub'] + self.data[token]['sub']
        result['ins'] = result['ins'] + self.data[token]['ins']
        result['del'] = result['del'] + self.data[token]['del']
    return result
  def keys(self) :
      return list(self.data.keys())
 def width(string):
  return sum(1 + (unicodedata.east_asian_width(c) in "AFW") for c in string)
 def default_cluster(word) :
  unicode_names = [ unicodedata.name(char) for char in word ]
  for i in reversed(range(len(unicode_names))) :
    if unicode_names[i].startswith('DIGIT') :  # 1
      unicode_names[i] = 'Number'  # 'DIGIT'
    elif (unicode_names[i].startswith('CJK UNIFIED IDEOGRAPH') or
          unicode_names[i].startswith('CJK COMPATIBILITY IDEOGRAPH')) :
      # 明 / 郎
      unicode_names[i] = 'Mandarin'  # 'CJK IDEOGRAPH'
    elif (unicode_names[i].startswith('LATIN CAPITAL LETTER') or
          unicode_names[i].startswith('LATIN SMALL LETTER')) :
      # A / a
      unicode_names[i] = 'English'  # 'LATIN LETTER'
    elif unicode_names[i].startswith('HIRAGANA LETTER') :  # は こ め
      unicode_names[i] = 'Japanese'  # 'GANA LETTER'
    elif (unicode_names[i].startswith('AMPERSAND') or
          unicode_names[i].startswith('APOSTROPHE') or
          unicode_names[i].startswith('COMMERCIAL AT') or
          unicode_names[i].startswith('DEGREE CELSIUS') or
          unicode_names[i].startswith('EQUALS SIGN') or
          unicode_names[i].startswith('FULL STOP') or
          unicode_names[i].startswith('HYPHEN-MINUS') or
          unicode_names[i].startswith('LOW LINE') or
          unicode_names[i].startswith('NUMBER SIGN') or
          unicode_names[i].startswith('PLUS SIGN') or
          unicode_names[i].startswith('SEMICOLON')) :
      # & / ' / @ / ℃ / = / . / - / _ / # / + / ;
      del unicode_names[i]
    else :
      return 'Other'
  if len(unicode_names) == 0 :
      return 'Other'
  if len(unicode_names) == 1 :
      return unicode_names[0]
  for i in range(len(unicode_names)-1) :
    if unicode_names[i] != unicode_names[i+1] :
      return 'Other'
  return unicode_names[0]
 def usage() :
  print("compute-wer.py : compute word error rate (WER) and align recognition results and references.")
  print("         usage : python compute-wer.py [--cs={0,1}] [--cluster=foo] [--ig=ignore_file] [--char={0,1}] [--v={0,1}] [--padding-symbol={space,underline}] test.ref test.hyp > test.wer")
 if __name__ == '__main__':
  if len(sys.argv) == 1 :
    usage()
    sys.exit(0)
  calculator = Calculator()
  cluster_file = ''
  ignore_words = set()
  tochar = False
  verbose= 1
  padding_symbol= ' '
  case_sensitive = False
  max_words_per_line = sys.maxsize
  split = None
  while len(sys.argv) > 3:
     a = '--maxw='
     if sys.argv[1].startswith(a):
        b = sys.argv[1][len(a):]
        del sys.argv[1]
        max_words_per_line = int(b)
        continue
     a = '--rt='
     if sys.argv[1].startswith(a):
        b = sys.argv[1][len(a):].lower()
        del sys.argv[1]
        remove_tag = (b == 'true') or (b != '0')
        continue
     a = '--cs='
     if sys.argv[1].startswith(a):
        b = sys.argv[1][len(a):].lower()
        del sys.argv[1]
        case_sensitive = (b == 'true') or (b != '0')
        continue
     a = '--cluster='
     if sys.argv[1].startswith(a):
       cluster_file = sys.argv[1][len(a):]
       del sys.argv[1]
       continue
     a = '--splitfile='
     if sys.argv[1].startswith(a):
       split_file = sys.argv[1][len(a):]
       del sys.argv[1]
       split = dict()
       with codecs.open(split_file, 'r', 'utf-8') as fh:
         for line in fh:  # line in unicode
           words = line.strip().split()
           if len(words) >= 2:
             split[words[0]] = words[1:]
       continue
     a = '--ig='
     if sys.argv[1].startswith(a):
       ignore_file = sys.argv[1][len(a):]
       del sys.argv[1]
       with codecs.open(ignore_file, 'r', 'utf-8') as fh:
         for line in fh:  # line in unicode
           line = line.strip()
           if len(line) > 0:
             ignore_words.add(line)
       continue
     a = '--char='
     if sys.argv[1].startswith(a):
        b = sys.argv[1][len(a):].lower()
        del sys.argv[1]
        tochar = (b == 'true') or (b != '0')
        continue
     a = '--v='
     if sys.argv[1].startswith(a):
        b = sys.argv[1][len(a):].lower()
        del sys.argv[1]
        verbose=0
        try:
          verbose=int(b)
        except:
           if b == 'true' or b != '0':
              verbose = 1
        continue
     a = '--padding-symbol='
     if sys.argv[1].startswith(a):
        b = sys.argv[1][len(a):].lower()
        del sys.argv[1]
        if b == 'space':
          padding_symbol= ' '
        elif b == 'underline':
          padding_symbol= '_'
        continue
     if True or sys.argv[1].startswith('-'):
        #ignore invalid switch
        del sys.argv[1]
        continue
  if not case_sensitive:
     ig=set([w.upper() for w in ignore_words])
     ignore_words = ig
  default_clusters = {}
  default_words = {}
  ref_file = sys.argv[1]
  hyp_file = sys.argv[2]
  rec_set = {}
  if split and not case_sensitive:
     newsplit = dict()
     for w in split:
        words = split[w]
        for i in range(len(words)):
           words[i] = words[i].upper()
        newsplit[w.upper()] = words
     split = newsplit
  with codecs.open(hyp_file, 'r', 'utf-8') as fh:
     for line in fh:
        if tochar:
            array = characterize(line)
        else:
            array = line.strip().split()
        if len(array)==0: continue
        fid = array[0]
        rec_set[fid] = normalize(array[1:], ignore_words, case_sensitive, split)
  # compute error rate on the interaction of reference file and hyp file
  for line in open(ref_file, 'r', encoding='utf-8') :
    if tochar:
          array = characterize(line)
    else:
          array = line.rstrip('\n').split()
    if len(array)==0: continue
    fid = array[0]
    if fid not in rec_set:
       continue
    lab = normalize(array[1:], ignore_words, case_sensitive, split)
    rec = rec_set[fid]
    if verbose:
      print('\nutt: %s' % fid)
    for word in rec + lab :
      if word not in default_words :
         default_cluster_name = default_cluster(word)
         if default_cluster_name not in default_clusters :
           default_clusters[default_cluster_name] = {}
         if word not in default_clusters[default_cluster_name] :
           default_clusters[default_cluster_name][word] = 1
         default_words[word] = default_cluster_name
    result = calculator.calculate(lab, rec)
    if verbose:
      if result['all'] != 0 :
        wer = float(result['ins'] + result['sub'] + result['del']) * 100.0 / result['all']
      else :
        wer = 0.0
      print('WER: %4.2f %%' % wer, end = ' ')
      print('N=%d C=%d S=%d D=%d I=%d' %
          (result['all'], result['cor'], result['sub'], result['del'], result['ins']))
      space = {}
      space['lab'] = []
      space['rec'] = []
      for idx in range(len(result['lab'])) :
        len_lab = width(result['lab'][idx])
        len_rec = width(result['rec'][idx])
        length = max(len_lab, len_rec)
        space['lab'].append(length-len_lab)
        space['rec'].append(length-len_rec)
      upper_lab = len(result['lab'])
      upper_rec = len(result['rec'])
      lab1, rec1 = 0, 0
      while lab1 < upper_lab or rec1 < upper_rec:
         if verbose > 1:
             print('lab(%s):' % fid.encode('utf-8'), end = ' ')
         else:
             print('lab:', end = ' ')
         lab2 = min(upper_lab, lab1 + max_words_per_line)
         for idx in range(lab1, lab2):
           token = result['lab'][idx]
           print('{token}'.format(token = token), end = '')
           for n in range(space['lab'][idx]) :
             print(padding_symbol, end = '')
           print(' ',end='')
         print()
         if verbose > 1:
            print('rec(%s):' % fid.encode('utf-8'), end = ' ')
         else:
            print('rec:', end = ' ')
         rec2 = min(upper_rec, rec1 + max_words_per_line)
         for idx in range(rec1, rec2):
           token = result['rec'][idx]
           print('{token}'.format(token = token), end = '')
           for n in range(space['rec'][idx]) :
             print(padding_symbol, end = '')
           print(' ',end='')
         print('\n', end='\n')
         lab1 = lab2
         rec1 = rec2
  if verbose:
    print('===========================================================================')
    print()
  result = calculator.overall()
  if result['all'] != 0 :
    wer = float(result['ins'] + result['sub'] + result['del']) * 100.0 / result['all']
  else :
    wer = 0.0
  print('Overall -> %4.2f %%' % wer, end = ' ')
  print('N=%d C=%d S=%d D=%d I=%d' %
        (result['all'], result['cor'], result['sub'], result['del'], result['ins']))
  if not verbose:
     print()
  if verbose:
   for cluster_id in default_clusters :
     result = calculator.cluster([ k for k in default_clusters[cluster_id] ])
     if result['all'] != 0 :
        wer = float(result['ins'] + result['sub'] + result['del']) * 100.0 / result['all']
     else :
        wer = 0.0
     print('%s -> %4.2f %%' % (cluster_id, wer), end = ' ')
     print('N=%d C=%d S=%d D=%d I=%d' %
          (result['all'], result['cor'], result['sub'], result['del'], result['ins']))
   if len(cluster_file) > 0 : # compute separated WERs for word clusters
     cluster_id = ''
     cluster = []
     for line in open(cluster_file, 'r', encoding='utf-8') :
       for token in line.decode('utf-8').rstrip('\n').split() :
        # end of cluster reached, like </Keyword>
        if token[0:2] == '</' and token[len(token)-1] == '>' and \
           token.lstrip('</').rstrip('>') == cluster_id :
          result = calculator.cluster(cluster)
          if result['all'] != 0 :
            wer = float(result['ins'] + result['sub'] + result['del']) * 100.0 / result['all']
          else :
            wer = 0.0
          print('%s -> %4.2f %%' % (cluster_id, wer), end = ' ')
          print('N=%d C=%d S=%d D=%d I=%d' %
                (result['all'], result['cor'], result['sub'], result['del'], result['ins']))
          cluster_id = ''
          cluster = []
        # begin of cluster reached, like <Keyword>
        elif token[0] == '<' and token[len(token)-1] == '>' and \
             cluster_id == '' :
          cluster_id = token.lstrip('<').rstrip('>')
          cluster = []
        # general terms, like WEATHER / CAR / ...
        else :
          cluster.append(token)
   print()
   print('===========================================================================')
--- a/speechx/examples/aishell/local/split_data.sh
+++ b/speechx/examples/aishell/local/split_data.sh
@ -0,0 +1,24 @@
 #!/usr/bin/env bash
 data=$1
 feat_scp=$2
 split_feat_name=$3
 numsplit=$4
 if ! [ "$numsplit" -gt 0 ]; then
  echo "Invalid num-split argument";
  exit 1;
 fi
 directories=$(for n in `seq $numsplit`; do echo $data/split${numsplit}/$n; done)
 feat_split_scp=$(for n in `seq $numsplit`; do echo $data/split${numsplit}/$n/${split_feat_name}; done)
 echo $feat_split_scp
 # if this mkdir fails due to argument-list being too long, iterate.
 if ! mkdir -p $directories >&/dev/null; then
  for n in `seq $numsplit`; do
    mkdir -p $data/split${numsplit}/$n
  done
 fi
 utils/split_scp.pl $feat_scp $feat_split_scp
--- a/speechx/examples/aishell/path.sh
+++ b/speechx/examples/aishell/path.sh
@ -0,0 +1,14 @@
 # This contains the locations of binarys build required for running the examples.
 SPEECHX_ROOT=$PWD/../..
 SPEECHX_EXAMPLES=$SPEECHX_ROOT/build/examples
 SPEECHX_TOOLS=$SPEECHX_ROOT/tools
 TOOLS_BIN=$SPEECHX_TOOLS/valgrind/install/bin
 [ -d $SPEECHX_EXAMPLES ] || { echo "Error: 'build/examples' directory not found. please ensure that the project build successfully"; }
 export LC_AL=C
 SPEECHX_BIN=$SPEECHX_EXAMPLES/decoder:$SPEECHX_EXAMPLES/feat
 export PATH=$PATH:$SPEECHX_BIN:$TOOLS_BIN
--- a/speechx/examples/aishell/run.sh
+++ b/speechx/examples/aishell/run.sh
@ -0,0 +1,113 @@
 #!/bin/bash
 set +x
 set -e
 . path.sh
 # 1. compile
 if [ ! -d ${SPEECHX_EXAMPLES} ]; then
    pushd ${SPEECHX_ROOT} 
    bash build.sh
    popd
 fi
 # 2. download model
 if [ ! -d ../paddle_asr_model ]; then
    wget -c https://paddlespeech.bj.bcebos.com/s2t/paddle_asr_online/paddle_asr_model.tar.gz
    tar xzfv paddle_asr_model.tar.gz
    mv ./paddle_asr_model ../
    # produce wav scp
    echo "utt1 " $PWD/../paddle_asr_model/BAC009S0764W0290.wav > ../paddle_asr_model/wav.scp
 fi
 mkdir -p data
 data=$PWD/data
 aishell_wav_scp=aishell_test.scp
 if [ ! -d $data/test ]; then
    wget -c https://paddlespeech.bj.bcebos.com/s2t/paddle_asr_online/aishell_test.zip
    unzip -d $data aishell_test.zip
    realpath $data/test/*/*.wav > $data/wavlist
    awk -F '/' '{ print $(NF) }' $data/wavlist | awk -F '.' '{ print $1 }' > $data/utt_id
    paste $data/utt_id $data/wavlist > $data/$aishell_wav_scp
 fi
 model_dir=$PWD/aishell_ds2_online_model
 if [ ! -d $model_dir ]; then
    mkdir -p $model_dir 
    wget -P $model_dir -c https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.2.0.model.tar.gz
    tar xzfv $model_dir/asr0_deepspeech2_online_aishell_ckpt_0.2.0.model.tar.gz -C $model_dir
 fi
 # 3. make feature
 aishell_online_model=$model_dir/exp/deepspeech2_online/checkpoints
 lm_model_dir=../paddle_asr_model
 label_file=./aishell_result
 wer=./aishell_wer
 nj=40
 export GLOG_logtostderr=1
 #./local/split_data.sh $data $data/$aishell_wav_scp $aishell_wav_scp $nj
 data=$PWD/data
 # 3. gen linear feat
 cmvn=$PWD/cmvn.ark
 cmvn_json2binary_main --json_file=$model_dir/data/mean_std.json --cmvn_write_path=$cmvn
 utils/run.pl JOB=1:$nj $data/split${nj}/JOB/feat_log \
 linear_spectrogram_without_db_norm_main \
    --wav_rspecifier=scp:$data/split${nj}/JOB/${aishell_wav_scp} \
    --feature_wspecifier=ark,scp:$data/split${nj}/JOB/feat.ark,$data/split${nj}/JOB/feat.scp \
    --cmvn_file=$cmvn \
    --streaming_chunk=0.36
 text=$data/test/text
 # 4. recognizer
 utils/run.pl JOB=1:$nj $data/split${nj}/JOB/log \
  offline_decoder_sliding_chunk_main \
    --feature_rspecifier=scp:$data/split${nj}/JOB/feat.scp \
    --model_path=$aishell_online_model/avg_1.jit.pdmodel \
    --param_path=$aishell_online_model/avg_1.jit.pdiparams \
    --model_output_names=softmax_0.tmp_0,tmp_5,concat_0.tmp_0,concat_1.tmp_0 \
    --dict_file=$lm_model_dir/vocab.txt \
    --result_wspecifier=ark,t:$data/split${nj}/JOB/result
 cat $data/split${nj}/*/result > ${label_file}
 local/compute-wer.py --char=1 --v=1 ${label_file} $text > ${wer}
 # 4. decode with lm
 utils/run.pl JOB=1:$nj $data/split${nj}/JOB/log_lm \
  offline_decoder_sliding_chunk_main \
    --feature_rspecifier=scp:$data/split${nj}/JOB/feat.scp \
    --model_path=$aishell_online_model/avg_1.jit.pdmodel \
    --param_path=$aishell_online_model/avg_1.jit.pdiparams \
    --model_output_names=softmax_0.tmp_0,tmp_5,concat_0.tmp_0,concat_1.tmp_0 \
    --dict_file=$lm_model_dir/vocab.txt \
    --lm_path=$lm_model_dir/avg_1.jit.klm \
    --result_wspecifier=ark,t:$data/split${nj}/JOB/result_lm
 cat $data/split${nj}/*/result_lm > ${label_file}_lm
 local/compute-wer.py --char=1 --v=1 ${label_file}_lm $text > ${wer}_lm
 graph_dir=./aishell_graph
 if [ ! -d $ ]; then
    wget -c https://paddlespeech.bj.bcebos.com/s2t/paddle_asr_online/aishell_graph.zip
    unzip -d aishell_graph.zip
 fi
 # 5. test TLG decoder
 utils/run.pl JOB=1:$nj $data/split${nj}/JOB/log_tlg \
  offline_wfst_decoder_main \
    --feature_rspecifier=scp:$data/split${nj}/JOB/feat.scp \
    --model_path=$aishell_online_model/avg_1.jit.pdmodel \
    --param_path=$aishell_online_model/avg_1.jit.pdiparams \
    --word_symbol_table=$graph_dir/words.txt \
    --model_output_names=softmax_0.tmp_0,tmp_5,concat_0.tmp_0,concat_1.tmp_0 \
     --graph_path=$graph_dir/TLG.fst --max_active=7500 \
    --acoustic_scale=1.2 \
    --result_wspecifier=ark,t:$data/split${nj}/JOB/result_tlg
 cat $data/split${nj}/*/result_tlg > ${label_file}_tlg
 local/compute-wer.py --char=1 --v=1 ${label_file}_tlg $text > ${wer}_tlg
--- a/speechx/examples/aishell/utils
+++ b/speechx/examples/aishell/utils
@ -0,0 +1 @@
 ../../../utils
--- a/speechx/examples/decoder/CMakeLists.txt
+++ b/speechx/examples/decoder/CMakeLists.txt
@ -8,6 +8,10 @@ add_executable(offline_decoder_main ${CMAKE_CURRENT_SOURCE_DIR}/offline_decoder_
 target_include_directories(offline_decoder_main PRIVATE ${SPEECHX_ROOT} ${SPEECHX_ROOT}/kaldi)
 target_link_libraries(offline_decoder_main PUBLIC nnet decoder fst utils gflags glog kaldi-base kaldi-matrix kaldi-util ${DEPS})
 add_executable(offline_wfst_decoder_main ${CMAKE_CURRENT_SOURCE_DIR}/offline_wfst_decoder_main.cc)
 target_include_directories(offline_wfst_decoder_main PRIVATE ${SPEECHX_ROOT} ${SPEECHX_ROOT}/kaldi)
 target_link_libraries(offline_wfst_decoder_main PUBLIC nnet decoder fst utils gflags glog kaldi-base kaldi-matrix kaldi-util kaldi-decoder ${DEPS})
 add_executable(decoder_test_main ${CMAKE_CURRENT_SOURCE_DIR}/decoder_test_main.cc)
 target_include_directories(decoder_test_main PRIVATE ${SPEECHX_ROOT} ${SPEECHX_ROOT}/kaldi)
 target_link_libraries(decoder_test_main PUBLIC nnet decoder fst utils gflags glog kaldi-base kaldi-matrix kaldi-util ${DEPS})
--- a/speechx/examples/decoder/offline_decoder_sliding_chunk_main.cc
+++ b/speechx/examples/decoder/offline_decoder_sliding_chunk_main.cc
@ -22,30 +22,37 @@
 #include "nnet/decodable.h"
 #include "nnet/paddle_nnet.h"
-DEFINE_string(feature_respecifier, "", "test feature rspecifier");
+DEFINE_string(feature_rspecifier, "", "test feature rspecifier");
 DEFINE_string(result_wspecifier, "", "test result wspecifier");
 DEFINE_string(model_path, "avg_1.jit.pdmodel", "paddle nnet model");
 DEFINE_string(param_path, "avg_1.jit.pdiparams", "paddle nnet model param");
 DEFINE_string(dict_file, "vocab.txt", "vocabulary of lm");
-DEFINE_string(lm_path, "lm.klm", "language model");
+DEFINE_string(lm_path, "", "language model");
 DEFINE_int32(receptive_field_length,
             7,
             "receptive field of two CNN(kernel=5) downsampling module.");
 DEFINE_int32(downsampling_rate,
             4,
             "two CNN(kernel=5) module downsampling rate.");
 DEFINE_string(model_output_names,
              "save_infer_model/scale_0.tmp_1,save_infer_model/"
              "scale_1.tmp_1,save_infer_model/scale_2.tmp_1,save_infer_model/"
              "scale_3.tmp_1",
              "model output names");
 DEFINE_string(model_cache_names, "5-1-1024,5-1-1024", "model cache names");
 using kaldi::BaseFloat;
 using kaldi::Matrix;
 using std::vector;
 // test ds2 online decoder by feeding speech feature
 int main(int argc, char* argv[]) {
    gflags::ParseCommandLineFlags(&argc, &argv, false);
    google::InitGoogleLogging(argv[0]);
    kaldi::SequentialBaseFloatMatrixReader feature_reader(
-        FLAGS_feature_respecifier);
+        FLAGS_feature_rspecifier);
    kaldi::TokenWriter result_writer(FLAGS_result_wspecifier);
    std::string model_graph = FLAGS_model_path;
    std::string model_params = FLAGS_param_path;
    std::string dict_file = FLAGS_dict_file;
@ -55,7 +62,6 @@ int main(int argc, char* argv[]) {
    LOG(INFO) << "dict path: " << dict_file;
    LOG(INFO) << "lm path: " << lm_path;
    int32 num_done = 0, num_err = 0;
    ppspeech::CTCBeamSearchOptions opts;
@ -66,7 +72,8 @@ int main(int argc, char* argv[]) {
    ppspeech::ModelOptions model_opts;
    model_opts.model_path = model_graph;
    model_opts.params_path = model_params;
-    model_opts.cache_shape = "5-1-1024,5-1-1024";
+    model_opts.cache_shape = FLAGS_model_cache_names;
    model_opts.output_names = FLAGS_model_output_names;
    std::shared_ptr<ppspeech::PaddleNnet> nnet(
        new ppspeech::PaddleNnet(model_opts));
    std::shared_ptr<ppspeech::DataCache> raw_data(new ppspeech::DataCache());
@ -129,9 +136,16 @@ int main(int argc, char* argv[]) {
        }
        std::string result;
        result = decoder.GetFinalBestPath();
        KALDI_LOG << " the result of " << utt << " is " << result;
        decodable->Reset();
        decoder.Reset();
        if (result.empty()) {
            // the TokenWriter can not write empty string.
            ++num_err;
            KALDI_LOG << " the result of " << utt << " is empty";
            continue;
        }
        KALDI_LOG << " the result of " << utt << " is " << result;
        result_writer.Write(utt, result);
        ++num_done;
    }
--- a/speechx/examples/decoder/offline_wfst_decoder_main.cc
+++ b/speechx/examples/decoder/offline_wfst_decoder_main.cc
@ -0,0 +1,158 @@
 // Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
 //
 // Licensed under the Apache License, Version 2.0 (the "License");
 // you may not use this file except in compliance with the License.
 // You may obtain a copy of the License at
 //
 //     http://www.apache.org/licenses/LICENSE-2.0
 //
 // Unless required by applicable law or agreed to in writing, software
 // distributed under the License is distributed on an "AS IS" BASIS,
 // WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 // See the License for the specific language governing permissions and
 // limitations under the License.
 // todo refactor, repalce with gtest
 #include "base/flags.h"
 #include "base/log.h"
 #include "decoder/ctc_tlg_decoder.h"
 #include "frontend/audio/data_cache.h"
 #include "kaldi/util/table-types.h"
 #include "nnet/decodable.h"
 #include "nnet/paddle_nnet.h"
 DEFINE_string(feature_rspecifier, "", "test feature rspecifier");
 DEFINE_string(result_wspecifier, "", "test result wspecifier");
 DEFINE_string(model_path, "avg_1.jit.pdmodel", "paddle nnet model");
 DEFINE_string(param_path, "avg_1.jit.pdiparams", "paddle nnet model param");
 DEFINE_string(word_symbol_table, "words.txt", "word symbol table");
 DEFINE_string(graph_path, "TLG", "decoder graph");
 DEFINE_double(acoustic_scale, 1.0, "acoustic scale");
 DEFINE_int32(max_active, 7500, "decoder graph");
 DEFINE_int32(receptive_field_length,
             7,
             "receptive field of two CNN(kernel=5) downsampling module.");
 DEFINE_int32(downsampling_rate,
             4,
             "two CNN(kernel=5) module downsampling rate.");
 DEFINE_string(model_output_names,
              "save_infer_model/scale_0.tmp_1,save_infer_model/"
              "scale_1.tmp_1,save_infer_model/scale_2.tmp_1,save_infer_model/"
              "scale_3.tmp_1",
              "model output names");
 DEFINE_string(model_cache_names, "5-1-1024,5-1-1024", "model cache names");
 using kaldi::BaseFloat;
 using kaldi::Matrix;
 using std::vector;
 // test TLG decoder by feeding speech feature.
 int main(int argc, char* argv[]) {
    gflags::ParseCommandLineFlags(&argc, &argv, false);
    google::InitGoogleLogging(argv[0]);
    kaldi::SequentialBaseFloatMatrixReader feature_reader(
        FLAGS_feature_rspecifier);
    kaldi::TokenWriter result_writer(FLAGS_result_wspecifier);
    std::string model_graph = FLAGS_model_path;
    std::string model_params = FLAGS_param_path;
    std::string word_symbol_table = FLAGS_word_symbol_table;
    std::string graph_path = FLAGS_graph_path;
    LOG(INFO) << "model path: " << model_graph;
    LOG(INFO) << "model param: " << model_params;
    LOG(INFO) << "word symbol path: " << word_symbol_table;
    LOG(INFO) << "graph path: " << graph_path;
    int32 num_done = 0, num_err = 0;
    ppspeech::TLGDecoderOptions opts;
    opts.word_symbol_table = word_symbol_table;
    opts.fst_path = graph_path;
    opts.opts.max_active = FLAGS_max_active;
    opts.opts.beam = 15.0;
    opts.opts.lattice_beam = 7.5;
    ppspeech::TLGDecoder decoder(opts);
    ppspeech::ModelOptions model_opts;
    model_opts.model_path = model_graph;
    model_opts.params_path = model_params;
    model_opts.cache_shape = FLAGS_model_cache_names;
    model_opts.output_names = FLAGS_model_output_names;
    std::shared_ptr<ppspeech::PaddleNnet> nnet(
        new ppspeech::PaddleNnet(model_opts));
    std::shared_ptr<ppspeech::DataCache> raw_data(new ppspeech::DataCache());
    std::shared_ptr<ppspeech::Decodable> decodable(
        new ppspeech::Decodable(nnet, raw_data, FLAGS_acoustic_scale));
    int32 chunk_size = FLAGS_receptive_field_length;
    int32 chunk_stride = FLAGS_downsampling_rate;
    int32 receptive_field_length = FLAGS_receptive_field_length;
    LOG(INFO) << "chunk size (frame): " << chunk_size;
    LOG(INFO) << "chunk stride (frame): " << chunk_stride;
    LOG(INFO) << "receptive field (frame): " << receptive_field_length;
    decoder.InitDecoder();
    for (; !feature_reader.Done(); feature_reader.Next()) {
        string utt = feature_reader.Key();
        kaldi::Matrix<BaseFloat> feature = feature_reader.Value();
        raw_data->SetDim(feature.NumCols());
        LOG(INFO) << "process utt: " << utt;
        LOG(INFO) << "rows: " << feature.NumRows();
        LOG(INFO) << "cols: " << feature.NumCols();
        int32 row_idx = 0;
        int32 padding_len = 0;
        int32 ori_feature_len = feature.NumRows();
        if ((feature.NumRows() - chunk_size) % chunk_stride != 0) {
            padding_len =
                chunk_stride - (feature.NumRows() - chunk_size) % chunk_stride;
            feature.Resize(feature.NumRows() + padding_len,
                           feature.NumCols(),
                           kaldi::kCopyData);
        }
        int32 num_chunks = (feature.NumRows() - chunk_size) / chunk_stride + 1;
        for (int chunk_idx = 0; chunk_idx < num_chunks; ++chunk_idx) {
            kaldi::Vector<kaldi::BaseFloat> feature_chunk(chunk_size *
                                                          feature.NumCols());
            int32 feature_chunk_size = 0;
            if (ori_feature_len > chunk_idx * chunk_stride) {
                feature_chunk_size = std::min(
                    ori_feature_len - chunk_idx * chunk_stride, chunk_size);
            }
            if (feature_chunk_size < receptive_field_length) break;
            int32 start = chunk_idx * chunk_stride;
            for (int row_id = 0; row_id < chunk_size; ++row_id) {
                kaldi::SubVector<kaldi::BaseFloat> tmp(feature, start);
                kaldi::SubVector<kaldi::BaseFloat> f_chunk_tmp(
                    feature_chunk.Data() + row_id * feature.NumCols(),
                    feature.NumCols());
                f_chunk_tmp.CopyFromVec(tmp);
                ++start;
            }
            raw_data->Accept(feature_chunk);
            if (chunk_idx == num_chunks - 1) {
                raw_data->SetFinished();
            }
            decoder.AdvanceDecode(decodable);
        }
        std::string result;
        result = decoder.GetFinalBestPath();
        decodable->Reset();
        decoder.Reset();
        if (result.empty()) {
            // the TokenWriter can not write empty string.
            ++num_err;
            KALDI_LOG << " the result of " << utt << " is empty";
            continue;
        }
        KALDI_LOG << " the result of " << utt << " is " << result;
        result_writer.Write(utt, result);
        ++num_done;
    }
    KALDI_LOG << "Done " << num_done << " utterances, " << num_err
              << " with errors.";
    return (num_done != 0 ? 0 : 1);
 }
--- a/speechx/examples/feat/CMakeLists.txt
+++ b/speechx/examples/feat/CMakeLists.txt
@ -7,4 +7,12 @@ target_link_libraries(mfcc-test kaldi-mfcc)
 add_executable(linear_spectrogram_main ${CMAKE_CURRENT_SOURCE_DIR}/linear_spectrogram_main.cc)
 target_include_directories(linear_spectrogram_main PRIVATE ${SPEECHX_ROOT} ${SPEECHX_ROOT}/kaldi)
-target_link_libraries(linear_spectrogram_main frontend kaldi-util kaldi-feat-common gflags glog)
+target_link_libraries(linear_spectrogram_main frontend kaldi-util kaldi-feat-common gflags glog)
 add_executable(linear_spectrogram_without_db_norm_main ${CMAKE_CURRENT_SOURCE_DIR}/linear_spectrogram_without_db_norm_main.cc)
 target_include_directories(linear_spectrogram_without_db_norm_main PRIVATE ${SPEECHX_ROOT} ${SPEECHX_ROOT}/kaldi)
 target_link_libraries(linear_spectrogram_without_db_norm_main frontend kaldi-util kaldi-feat-common gflags glog)
 add_executable(cmvn_json2binary_main ${CMAKE_CURRENT_SOURCE_DIR}/cmvn_json2binary_main.cc)
 target_include_directories(cmvn_json2binary_main PRIVATE ${SPEECHX_ROOT} ${SPEECHX_ROOT}/kaldi)
 target_link_libraries(cmvn_json2binary_main utils kaldi-util kaldi-matrix gflags glog)
--- a/Show More
+++ b/Show More