Merge branch 'develop' into CER

3 years ago · 8d1ee8262e
parent 4e431ae269 f39de8d754
commit 8d1ee8262e
358 changed files with 217821 additions and 5924 deletions
--- a/.gitignore
+++ b/.gitignore
@ -33,6 +33,12 @@ tools/Miniconda3-latest-Linux-x86_64.sh
 tools/activate_python.sh
 tools/miniconda.sh
 tools/CRF++-0.58/
 tools/liblbfgs-1.10/
 tools/srilm/
 tools/env.sh
 tools/openfst-1.8.1/
 tools/libsndfile/
 tools/python-soundfile/
 speechx/fc_patch/
--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@ -50,13 +50,13 @@ repos:
        entry: bash .pre-commit-hooks/clang-format.hook -i
        language: system
        files: \.(c|cc|cxx|cpp|cu|h|hpp|hxx|cuh|proto)$
-        exclude: (?=speechx/speechx/kaldi|speechx/patch).*(\.cpp|\.cc|\.h|\.py)$
+        exclude: (?=speechx/speechx/kaldi|speechx/patch|speechx/tools/fstbin|speechx/tools/lmbin).*(\.cpp|\.cc|\.h|\.py)$
    -   id: copyright_checker
        name: copyright_checker
        entry: python .pre-commit-hooks/copyright-check.hook
        language: system
        files: \.(c|cc|cxx|cpp|cu|h|hpp|hxx|proto|py)$
-        exclude: (?=third_party|pypinyin|speechx/speechx/kaldi|speechx/patch).*(\.cpp|\.cc|\.h|\.py)$
+        exclude: (?=third_party|pypinyin|speechx/speechx/kaldi|speechx/patch|speechx/tools/fstbin|speechx/tools/lmbin).*(\.cpp|\.cc|\.h|\.py)$
 -   repo: https://github.com/asottile/reorder_python_imports
    rev: v2.4.0
    hooks:
--- a/README.md
+++ b/README.md
@ -280,10 +280,14 @@ paddlespeech_client cls --server_ip 127.0.0.1 --port 8090 --input input.wav
 For more information about server command lines, please see: [speech server demos](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/demos/speech_server)
 <a name="ModelList"></a>
 ## Model List
 PaddleSpeech supports a series of most popular models. They are summarized in [released models](./docs/source/released_model.md) and attached with available pretrained models.
 <a name="SpeechToText"></a>
 **Speech-to-Text** contains *Acoustic Model*, *Language Model*, and *Speech Translation*, with the following details:
 <table style="width:100%">
@ -357,6 +361,8 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
  </tbody>
 </table>
 <a name="TextToSpeech"></a>
 **Text-to-Speech** in PaddleSpeech mainly contains three modules: *Text Frontend*, *Acoustic Model* and *Vocoder*. Acoustic Model and Vocoder models are listed as follow:
 <table>
@ -457,10 +463,10 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
      </td>
    </tr>
    <tr>
-      <td>GE2E + Tactron2</td>
+      <td>GE2E + Tacotron2</td>
      <td>AISHELL-3</td>
      <td>
-      <a href = "./examples/aishell3/vc0">ge2e-tactron2-aishell3</a>
+      <a href = "./examples/aishell3/vc0">ge2e-tacotron2-aishell3</a>
      </td>
    </tr>
    <tr>
@ -473,6 +479,8 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
  </tbody>
 </table>
 <a name="AudioClassification"></a>
 **Audio Classification**
 <table style="width:100%">
@ -496,6 +504,8 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
  </tbody>
 </table>
 <a name="SpeakerVerification"></a>
 **Speaker Verification**
 <table style="width:100%">
@ -519,6 +529,8 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
  </tbody>
 </table>
 <a name="PunctuationRestoration"></a>
 **Punctuation Restoration**
 <table style="width:100%">
@ -559,10 +571,18 @@ Normally, [Speech SoTA](https://paperswithcode.com/area/speech), [Audio SoTA](ht
    - [Advanced Usage](./docs/source/tts/advanced_usage.md)
    - [Chinese Rule Based Text Frontend](./docs/source/tts/zh_text_frontend.md)
    - [Test Audio Samples](https://paddlespeech.readthedocs.io/en/latest/tts/demo.html)
  - Speaker Verification
    - [Audio Searching](./demos/audio_searching/README.md)
    - [Speaker Verification](./demos/speaker_verification/README.md)
  - [Audio Classification](./demos/audio_tagging/README.md)
  - [Speaker Verification](./demos/speaker_verification/README.md)
  - [Speech Translation](./demos/speech_translation/README.md)
  - [Speech Server](./demos/speech_server/README.md)
 - [Released Models](./docs/source/released_model.md)
  - [Speech-to-Text](#SpeechToText)
  - [Text-to-Speech](#TextToSpeech)
  - [Audio Classification](#AudioClassification)
  - [Speaker Verification](#SpeakerVerification)
  - [Punctuation Restoration](#PunctuationRestoration)
 - [Community](#Community)
 - [Welcome to contribute](#contribution)
 - [License](#License)
--- a/README_cn.md
+++ b/README_cn.md
@ -273,6 +273,8 @@ paddlespeech_client cls --server_ip 127.0.0.1 --port 8090 --input input.wav
 ## 模型列表
 PaddleSpeech 支持很多主流的模型，并提供了预训练模型，详情请见[模型列表](./docs/source/released_model.md)。
 <a name="语音识别模型"></a>
 PaddleSpeech 的 **语音转文本** 包含语音识别声学模型、语音识别语言模型和语音翻译, 详情如下：
 <table style="width:100%">
@ -347,6 +349,7 @@ PaddleSpeech 的 **语音转文本** 包含语音识别声学模型、语音识
 </table>
 <a name="语音合成模型"></a>
 PaddleSpeech 的 **语音合成** 主要包含三个模块：文本前端、声学模型和声码器。声学模型和声码器模型如下：
 <table>
@ -447,10 +450,10 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块：文本前端、声
      </td>
    </tr>
    <tr>
-      <td>GE2E + Tactron2</td>
+      <td>GE2E + Tacotron2</td>
      <td>AISHELL-3</td>
      <td>
-      <a href = "./examples/aishell3/vc0">ge2e-tactron2-aishell3</a>
+      <a href = "./examples/aishell3/vc0">ge2e-tacotron2-aishell3</a>
      </td>
    </tr>
    <tr>
@ -488,6 +491,8 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块：文本前端、声
 </table>
 <a name="声纹识别模型"></a>
 **声纹识别**
 <table style="width:100%">
@ -511,6 +516,8 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块：文本前端、声
  </tbody>
 </table>
 <a name="标点恢复模型"></a>
 **标点恢复**
 <table style="width:100%">
@ -556,13 +563,18 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块：文本前端、声
    - [进阶用法](./docs/source/tts/advanced_usage.md)
    - [中文文本前端](./docs/source/tts/zh_text_frontend.md)
    - [测试语音样本](https://paddlespeech.readthedocs.io/en/latest/tts/demo.html)
  - 声纹识别
    - [声纹识别](./demos/speaker_verification/README_cn.md)
    - [音频检索](./demos/audio_searching/README_cn.md)
  - [声音分类](./demos/audio_tagging/README_cn.md)
  - [声纹识别](./demos/speaker_verification/README_cn.md)
  - [语音翻译](./demos/speech_translation/README_cn.md)
  - [服务化部署](./demos/speech_server/README_cn.md)
 - [模型列表](#模型列表)
  - [语音识别](#语音识别模型)
  - [语音合成](#语音合成模型)
  - [声音分类](#声音分类模型)
  - [声纹识别](#声纹识别模型)
  - [标点恢复](#标点恢复模型)
 - [技术交流群](#技术交流群)
 - [欢迎贡献](#欢迎贡献)
 - [License](#License)
--- a/dataset/rir_noise/rir_noise.py
+++ b/dataset/rir_noise/rir_noise.py
@ -34,14 +34,14 @@ from utils.utility import unzip
 DATA_HOME = os.path.expanduser('~/.cache/paddle/dataset/speech')
-URL_ROOT = 'http://www.openslr.org/resources/28'
+URL_ROOT = '--no-check-certificate http://www.openslr.org/resources/28'
 DATA_URL = URL_ROOT + '/rirs_noises.zip'
 MD5_DATA = 'e6f48e257286e05de56413b4779d8ffb'
 parser = argparse.ArgumentParser(description=__doc__)
 parser.add_argument(
    "--target_dir",
-    default=DATA_HOME + "/Aishell",
+    default=DATA_HOME + "/rirs_noise",
    type=str,
    help="Directory to save the dataset. (default: %(default)s)")
 parser.add_argument(
@ -81,6 +81,10 @@ def create_manifest(data_dir, manifest_path_prefix):
                        },
                        ensure_ascii=False))
        manifest_path = manifest_path_prefix + '.' + dtype
        if not os.path.exists(os.path.dirname(manifest_path)):
            os.makedirs(os.path.dirname(manifest_path))
        with codecs.open(manifest_path, 'w', 'utf-8') as fout:
            for line in json_lines:
                fout.write(line + '\n')
--- a/dataset/voxceleb/voxceleb1.py
+++ b/dataset/voxceleb/voxceleb1.py
@ -149,7 +149,7 @@ def prepare_dataset(base_url, data_list, target_dir, manifest_path,
    # we will download the voxceleb1 data to ${target_dir}/vox1/dev/ or ${target_dir}/vox1/test directory 
    if not os.path.exists(os.path.join(target_dir, "wav")):
        # download all dataset part
-        print("start to download the vox1 dev zip package")
+        print(f"start to download the vox1 zip package to {target_dir}")
        for zip_part in data_list.keys():
            download_url = " --no-check-certificate " + base_url + "/" + zip_part
            download(
--- a/dataset/voxceleb/voxceleb2.py
+++ b/dataset/voxceleb/voxceleb2.py
@ -22,10 +22,12 @@ import codecs
 import glob
 import json
 import os
 import subprocess
 from pathlib import Path
 import soundfile
 from utils.utility import check_md5sum
 from utils.utility import download
 from utils.utility import unzip
@ -35,12 +37,22 @@ DATA_HOME = os.path.expanduser('.')
 BASE_URL = "--no-check-certificate https://www.robots.ox.ac.uk/~vgg/data/voxceleb/data/"
 # dev data
-DEV_DATA_URL = BASE_URL + '/vox2_aac.zip'
+DEV_LIST = {
-DEV_MD5SUM = "bbc063c46078a602ca71605645c2a402"
+    "vox2_dev_aac_partaa": "da070494c573e5c0564b1d11c3b20577",
    "vox2_dev_aac_partab": "17fe6dab2b32b48abaf1676429cdd06f",
    "vox2_dev_aac_partac": "1de58e086c5edf63625af1cb6d831528",
    "vox2_dev_aac_partad": "5a043eb03e15c5a918ee6a52aad477f9",
    "vox2_dev_aac_partae": "cea401b624983e2d0b2a87fb5d59aa60",
    "vox2_dev_aac_partaf": "fc886d9ba90ab88e7880ee98effd6ae9",
    "vox2_dev_aac_partag": "d160ecc3f6ee3eed54d55349531cb42e",
    "vox2_dev_aac_partah": "6b84a81b9af72a9d9eecbb3b1f602e65",
 }
 DEV_TARGET_DATA = "vox2_dev_aac_parta* vox2_dev_aac.zip bbc063c46078a602ca71605645c2a402"
 # test data
-TEST_DATA_URL = BASE_URL + '/vox2_test_aac.zip'
+TEST_LIST = {"vox2_test_aac.zip": "0d2b3ea430a821c33263b5ea37ede312"}
-TEST_MD5SUM = "0d2b3ea430a821c33263b5ea37ede312"
+TEST_TARGET_DATA = "vox2_test_aac.zip vox2_test_aac.zip 0d2b3ea430a821c33263b5ea37ede312"
 parser = argparse.ArgumentParser(description=__doc__)
 parser.add_argument(
@ -68,6 +80,14 @@ args = parser.parse_args()
 def create_manifest(data_dir, manifest_path_prefix):
    """Generate the voxceleb2 dataset manifest file.
    We will create the ${manifest_path_prefix}.vox2 as the final manifest file 
    The dev and test wav info will be put in one manifest file.
    Args:
        data_dir (str): voxceleb2 wav directory, which include dev and test subdataset
        manifest_path_prefix (str): manifest file prefix
    """
    print("Creating manifest %s ..." % manifest_path_prefix)
    json_lines = []
    data_path = os.path.join(data_dir, "**", "*.wav")
@ -119,7 +139,19 @@ def create_manifest(data_dir, manifest_path_prefix):
        print(f"{total_sec / total_num} sec/utt", file=f)
-def download_dataset(url, md5sum, target_dir, dataset):
+def download_dataset(base_url, data_list, target_data, target_dir, dataset):
    """Download the voxceleb2 zip package
    Args:
        base_url (str): the voxceleb2 dataset download baseline url
        data_list (dict): the dataset part zip package and the md5 value
        target_data (str): the final dataset zip info
        target_dir (str): the dataset stored directory
        dataset (str): the dataset name, dev or test
    Raises:
        RuntimeError: the md5sum occurs error
    """
    if not os.path.exists(target_dir):
        os.makedirs(target_dir)
@ -129,9 +161,34 @@ def download_dataset(url, md5sum, target_dir, dataset):
    # but the test dataset will unzip to aac
    # so, wo create the ${target_dir}/test and unzip the m4a to test dir
    if not os.path.exists(os.path.join(target_dir, dataset)):
-        filepath = download(url, md5sum, target_dir)
+        print(f"start to download the vox2 zip package to {target_dir}")
        for zip_part in data_list.keys():
            download_url = " --no-check-certificate " + base_url + "/" + zip_part
            download(
                url=download_url,
                md5sum=data_list[zip_part],
                target_dir=target_dir)
        # pack the all part to target zip file
        all_target_part, target_name, target_md5sum = target_data.split()
        target_name = os.path.join(target_dir, target_name)
        if not os.path.exists(target_name):
            pack_part_cmd = "cat {}/{} > {}".format(target_dir, all_target_part,
                                                    target_name)
            subprocess.call(pack_part_cmd, shell=True)
        # check the target zip file md5sum
        if not check_md5sum(target_name, target_md5sum):
            raise RuntimeError("{} MD5 checkssum failed".format(target_name))
        else:
            print("Check {} md5sum successfully".format(target_name))
        if dataset == "test":
-            unzip(filepath, os.path.join(target_dir, "test"))
+            # we need make the test directory
            unzip(target_name, os.path.join(target_dir, "test"))
        else:
            # upzip dev zip pacakge and will create the dev directory
            unzip(target_name, target_dir)
 def main():
@ -142,14 +199,16 @@ def main():
    print("download: {}".format(args.download))
    if args.download:
        download_dataset(
-            url=DEV_DATA_URL,
+            base_url=BASE_URL,
-            md5sum=DEV_MD5SUM,
+            data_list=DEV_LIST,
            target_data=DEV_TARGET_DATA,
            target_dir=args.target_dir,
            dataset="dev")
        download_dataset(
-            url=TEST_DATA_URL,
+            base_url=BASE_URL,
-            md5sum=TEST_MD5SUM,
+            data_list=TEST_LIST,
            target_data=TEST_TARGET_DATA,
            target_dir=args.target_dir,
            dataset="test")
--- a/demos/audio_searching/README.md
+++ b/demos/audio_searching/README.md
@ -90,7 +90,7 @@ Then to start the system server, and it provides HTTP backend services.
  ```bash
  export PYTHONPATH=$PYTHONPATH:./src:../../paddleaudio
-  python src/main.py
+  python src/audio_search.py
  ```
  Then you will see the Application is started:
@ -111,7 +111,7 @@ Then to start the system server, and it provides HTTP backend services.
  ```bash
  wget -c https://www.openslr.org/resources/82/cn-celeb_v2.tar.gz && tar -xvf cn-celeb_v2.tar.gz 
  ```
-  **Note**: If you want to build a quick demo, you can use ./src/test_main.py:download_audio_data function, it downloads 20 audio files , Subsequent results show this collection as an example
+  **Note**: If you want to build a quick demo, you can use ./src/test_audio_search.py:download_audio_data function, it downloads 20 audio files , Subsequent results show this collection as an example
 - Prepare model(Skip this step if you use the default model.)
  ```bash
@ -123,7 +123,7 @@ Then to start the system server, and it provides HTTP backend services.
    The internal process is downloading data, loading the paddlespeech model, extracting embedding, storing library, retrieving and deleting library  
    ```bash
-    python ./src/test_main.py
+    python ./src/test_audio_search.py
    ```
    Output：
--- a/demos/audio_searching/README_cn.md
+++ b/demos/audio_searching/README_cn.md
@ -92,7 +92,7 @@ ffce340b3790  minio/minio:RELEASE.2020-12-03T00-03-10Z  "/usr/bin/docker-ent…"
  ```bash
  export PYTHONPATH=$PYTHONPATH:./src:../../paddleaudio
-  python src/main.py
+  python src/audio_search.py
  ```
  然后你会看到应用程序启动:
@ -113,7 +113,7 @@ ffce340b3790  minio/minio:RELEASE.2020-12-03T00-03-10Z  "/usr/bin/docker-ent…"
  ```bash
  wget -c https://www.openslr.org/resources/82/cn-celeb_v2.tar.gz && tar -xvf cn-celeb_v2.tar.gz 
  ```
-  **注**：如果希望快速搭建 demo，可以采用 ./src/test_main.py:download_audio_data 内部的 20 条音频，另外后续结果展示以该集合为例
+  **注**：如果希望快速搭建 demo，可以采用 ./src/test_audio_search.py:download_audio_data 内部的 20 条音频，另外后续结果展示以该集合为例
 - 准备模型（如果使用默认模型，可以跳过此步骤）
  ```bash
@ -124,7 +124,7 @@ ffce340b3790  minio/minio:RELEASE.2020-12-03T00-03-10Z  "/usr/bin/docker-ent…"
 - 脚本测试（推荐）
    ```bash
-    python ./src/test_main.py
+    python ./src/test_audio_search.py
    ```
    注：内部将依次下载数据，加载 paddlespeech 模型，提取 embedding，存储建库，检索，删库
--- a/demos/audio_searching/src/audio_search.py
+++ b/demos/audio_searching/src/audio_search.py
@ -40,7 +40,6 @@ app.add_middleware(
    allow_methods=["*"],
    allow_headers=["*"])
 MODEL = None
 MILVUS_CLI = MilvusHelper()
 MYSQL_CLI = MySQLHelper()
--- a/demos/audio_searching/src/encode.py
+++ b/demos/audio_searching/src/encode.py
@ -12,8 +12,8 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import numpy as np
 from logs import LOGGER
 from paddlespeech.cli import VectorExecutor
 vector_executor = VectorExecutor()
--- a/demos/audio_searching/src/mysql_helpers.py
+++ b/demos/audio_searching/src/mysql_helpers.py
@ -13,6 +13,7 @@
 # limitations under the License.
 import sys
 import numpy
 import pymysql
 from config import MYSQL_DB
 from config import MYSQL_HOST
@ -69,7 +70,7 @@ class MySQLHelper():
            sys.exit(1)
    def load_data_to_mysql(self, table_name, data):
-        # Batch insert (Milvus_ids, img_path) to mysql
+        # Batch insert (Milvus_ids, audio_path) to mysql
        self.test_connection()
        sql = "insert into " + table_name + " (milvus_id,audio_path) values (%s,%s);"
        try:
@ -82,7 +83,7 @@ class MySQLHelper():
            sys.exit(1)
    def search_by_milvus_ids(self, ids, table_name):
-        # Get the img_path according to the milvus ids
+        # Get the audio_path according to the milvus ids
        self.test_connection()
        str_ids = str(ids).replace('[', '').replace(']', '')
        sql = "select audio_path from " + table_name + " where milvus_id in (" + str_ids + ") order by field (milvus_id," + str_ids + ");"
@ -120,14 +121,83 @@ class MySQLHelper():
            sys.exit(1)
    def count_table(self, table_name):
-        # Get the number of mysql table
+        # Get the number of spk in mysql table
        self.test_connection()
-        sql = "select count(milvus_id) from " + table_name + ";"
+        sql = "select count(spk_id) from " + table_name + ";"
        try:
            self.cursor.execute(sql)
            results = self.cursor.fetchall()
-            LOGGER.debug(f"MYSQL count table:{table_name}")
+            LOGGER.debug(f"MYSQL count table:{results[0][0]}")
            return results[0][0]
        except Exception as e:
            LOGGER.error(f"MYSQL ERROR: {e} with sql: {sql}")
            sys.exit(1)
    def create_mysql_table_vpr(self, table_name):
        # Create mysql table if not exists
        self.test_connection()
        sql = "create table if not exists " + table_name + "(spk_id TEXT, audio_path TEXT, embedding TEXT);"
        try:
            self.cursor.execute(sql)
            LOGGER.debug(f"MYSQL create table: {table_name} with sql: {sql}")
        except Exception as e:
            LOGGER.error(f"MYSQL ERROR: {e} with sql: {sql}")
            sys.exit(1)
    def load_data_to_mysql_vpr(self, table_name, data):
        # Insert (spk, audio, embedding) to mysql
        self.test_connection()
        sql = "insert into " + table_name + " (spk_id,audio_path,embedding) values (%s,%s,%s);"
        try:
            self.cursor.execute(sql, data)
            LOGGER.debug(
                f"MYSQL loads data to table: {table_name} successfully")
        except Exception as e:
            LOGGER.error(f"MYSQL ERROR: {e} with sql: {sql}")
            sys.exit(1)
    def list_vpr(self, table_name):
        # Get all records in mysql
        self.test_connection()
        sql = "select * from " + table_name + " ;"
        try:
            self.cursor.execute(sql)
            results = self.cursor.fetchall()
            self.conn.commit()
            spk_ids = [res[0] for res in results]
            audio_paths = [res[1] for res in results]
            embeddings = [
                numpy.array(
                    str(res[2]).replace('[', '').replace(']', '').split(","))
                for res in results
            ]
            return spk_ids, audio_paths, embeddings
        except Exception as e:
            LOGGER.error(f"MYSQL ERROR: {e} with sql: {sql}")
            sys.exit(1)
    def search_audio_vpr(self, table_name, spk_id):
        # Get the audio_path according to the spk_id
        self.test_connection()
        sql = "select audio_path from " + table_name + " where spk_id='" + spk_id + "' ;"
        try:
            self.cursor.execute(sql)
            results = self.cursor.fetchall()
            LOGGER.debug(
                f"MYSQL search by spk id {spk_id} to get audio {results[0][0]}.")
            return results[0][0]
        except Exception as e:
            LOGGER.error(f"MYSQL ERROR: {e} with sql: {sql}")
            sys.exit(1)
    def delete_data_vpr(self, table_name, spk_id):
        # Delete a record by spk_id in mysql table
        self.test_connection()
        sql = "delete from " + table_name + " where spk_id='" + spk_id + "';"
        try:
            self.cursor.execute(sql)
            LOGGER.debug(
                f"MYSQL delete a record {spk_id} in table {table_name}")
        except Exception as e:
            LOGGER.error(f"MYSQL ERROR: {e} with sql: {sql}")
            sys.exit(1)
--- a/demos/audio_searching/src/operations/count.py
+++ b/demos/audio_searching/src/operations/count.py
@ -31,3 +31,45 @@ def do_count(table_name, milvus_cli):
    except Exception as e:
        LOGGER.error(f"Error attempting to count table {e}")
        sys.exit(1)
 def do_count_vpr(table_name, mysql_cli):
    """
    Returns the total number of spk in the system
    """
    if not table_name:
        table_name = DEFAULT_TABLE
    try:
        num = mysql_cli.count_table(table_name)
        return num
    except Exception as e:
        LOGGER.error(f"Error attempting to count table {e}")
        sys.exit(1)
 def do_list(table_name, mysql_cli):
    """
    Returns the total records of vpr in the system
    """
    if not table_name:
        table_name = DEFAULT_TABLE
    try:
        spk_ids, audio_paths, _ = mysql_cli.list_vpr(table_name)
        return spk_ids, audio_paths
    except Exception as e:
        LOGGER.error(f"Error attempting to count table {e}")
        sys.exit(1)
 def do_get(table_name, spk_id, mysql_cli):
    """
    Returns the audio path by spk_id in the system
    """
    if not table_name:
        table_name = DEFAULT_TABLE
    try:
        audio_apth = mysql_cli.search_audio_vpr(table_name, spk_id)
        return audio_apth
    except Exception as e:
        LOGGER.error(f"Error attempting to count table {e}")
        sys.exit(1)
--- a/demos/audio_searching/src/operations/drop.py
+++ b/demos/audio_searching/src/operations/drop.py
@ -32,3 +32,31 @@ def do_drop(table_name, milvus_cli, mysql_cli):
    except Exception as e:
        LOGGER.error(f"Error attempting to drop table: {e}")
        sys.exit(1)
 def do_drop_vpr(table_name, mysql_cli):
    """
    Delete the table of MySQL
    """
    if not table_name:
        table_name = DEFAULT_TABLE
    try:
        mysql_cli.delete_table(table_name)
        return "OK"
    except Exception as e:
        LOGGER.error(f"Error attempting to drop table: {e}")
        sys.exit(1)
 def do_delete(table_name, spk_id, mysql_cli):
    """
    Delete a record by spk_id in MySQL
    """
    if not table_name:
        table_name = DEFAULT_TABLE
    try:
        mysql_cli.delete_data_vpr(table_name, spk_id)
        return "OK"
    except Exception as e:
        LOGGER.error(f"Error attempting to drop table: {e}")
        sys.exit(1)
--- a/demos/audio_searching/src/operations/load.py
+++ b/demos/audio_searching/src/operations/load.py
@ -82,3 +82,16 @@ def do_load(table_name, audio_dir, milvus_cli, mysql_cli):
    mysql_cli.create_mysql_table(table_name)
    mysql_cli.load_data_to_mysql(table_name, format_data(ids, names))
    return len(ids)
 def do_enroll(table_name, spk_id, audio_path, mysql_cli):
    """
    Import spk_id,audio_path,embedding to Mysql
    """
    if not table_name:
        table_name = DEFAULT_TABLE
    embedding = get_audio_embedding(audio_path)
    mysql_cli.create_mysql_table_vpr(table_name)
    data = (spk_id, audio_path, str(embedding))
    mysql_cli.load_data_to_mysql_vpr(table_name, data)
    return "OK"
--- a/demos/audio_searching/src/operations/search.py
+++ b/demos/audio_searching/src/operations/search.py
@ -13,6 +13,7 @@
 # limitations under the License.
 import sys
 import numpy
 from config import DEFAULT_TABLE
 from config import TOP_K
 from encode import get_audio_embedding
@ -39,3 +40,26 @@ def do_search(host, table_name, audio_path, milvus_cli, mysql_cli):
    except Exception as e:
        LOGGER.error(f"Error with search: {e}")
        sys.exit(1)
 def do_search_vpr(host, table_name, audio_path, mysql_cli):
    """
    Search the uploaded audio in MySQL
    """
    try:
        if not table_name:
            table_name = DEFAULT_TABLE
        emb = get_audio_embedding(audio_path)
        emb = numpy.array(emb)
        spk_ids, paths, vectors = mysql_cli.list_vpr(table_name)
        scores = [numpy.dot(emb, x.astype(numpy.float64)) for x in vectors]
        spk_ids = [str(x) for x in spk_ids]
        paths = [str(x) for x in paths]
        for i in range(len(paths)):
            tmp = "http://" + str(host) + "/data?audio_path=" + str(paths[i])
            paths[i] = tmp
            scores[i] = scores[i] * 100
        return spk_ids, paths, scores
    except Exception as e:
        LOGGER.error(f"Error with search: {e}")
        sys.exit(1)
--- a/demos/audio_searching/src/test_audio_search.py
+++ b/demos/audio_searching/src/test_audio_search.py
@ -11,8 +11,8 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 from audio_search import app
 from fastapi.testclient import TestClient
 from main import app
 from utils.utility import download
 from utils.utility import unpack
@ -22,7 +22,7 @@ client = TestClient(app)
 def download_audio_data():
    """
-    download audio data
+    Download audio data
    """
    url = "https://paddlespeech.bj.bcebos.com/vector/audio/example_audio.tar.gz"
    md5sum = "52ac69316c1aa1fdef84da7dd2c67b39"
@ -64,7 +64,7 @@ def test_count():
    """
    Returns the total number of vectors in the system
    """
-    response = client.get("audio/count")
+    response = client.get("/audio/count")
    assert response.status_code == 200
    assert response.json() == 20
--- a/demos/audio_searching/src/test_vpr_search.py
+++ b/demos/audio_searching/src/test_vpr_search.py
@ -0,0 +1,115 @@
 # Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 from fastapi.testclient import TestClient
 from vpr_search import app
 from utils.utility import download
 from utils.utility import unpack
 client = TestClient(app)
 def download_audio_data():
    """
    Download audio data
    """
    url = "https://paddlespeech.bj.bcebos.com/vector/audio/example_audio.tar.gz"
    md5sum = "52ac69316c1aa1fdef84da7dd2c67b39"
    target_dir = "./"
    filepath = download(url, md5sum, target_dir)
    unpack(filepath, target_dir, True)
 def test_drop():
    """
    Delete the table of MySQL
    """
    response = client.post("/vpr/drop")
    assert response.status_code == 200
 def test_enroll_local(spk: str, audio: str):
    """
    Enroll the audio to MySQL
    """
    response = client.post("/vpr/enroll/local?spk_id=" + spk +
                           "&audio_path=.%2Fexample_audio%2F" + audio + ".wav")
    assert response.status_code == 200
    assert response.json() == {
        'status': True,
        'msg': "Successfully enroll data!"
    }
 def test_search_local():
    """
    Search the spk in MySQL by audio
    """
    response = client.post(
        "/vpr/recog/local?audio_path=.%2Fexample_audio%2Ftest.wav")
    assert response.status_code == 200
 def test_list():
    """
    Get all records in MySQL
    """
    response = client.get("/vpr/list")
    assert response.status_code == 200
 def test_data(spk: str):
    """
    Get the audio file by spk_id in MySQL
    """
    response = client.get("/vpr/data?spk_id=" + spk)
    assert response.status_code == 200
 def test_del(spk: str):
    """
    Delete the record in MySQL by spk_id
    """
    response = client.post("/vpr/del?spk_id=" + spk)
    assert response.status_code == 200
 def test_count():
    """
    Get the number of spk in MySQL
    """
    response = client.get("/vpr/count")
    assert response.status_code == 200
 if __name__ == "__main__":
    download_audio_data()
    test_enroll_local("spk1", "arms_strikes")
    test_enroll_local("spk2", "sword_wielding")
    test_enroll_local("spk3", "test")
    test_list()
    test_data("spk1")
    test_count()
    test_search_local()
    test_del("spk1")
    test_count()
    test_search_local()
    test_enroll_local("spk1", "arms_strikes")
    test_count()
    test_search_local()
    test_drop()
--- a/demos/audio_searching/src/vpr_search.py
+++ b/demos/audio_searching/src/vpr_search.py
@ -0,0 +1,206 @@
 # Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import os
 import uvicorn
 from config import UPLOAD_PATH
 from fastapi import FastAPI
 from fastapi import File
 from fastapi import UploadFile
 from logs import LOGGER
 from mysql_helpers import MySQLHelper
 from operations.count import do_count_vpr
 from operations.count import do_get
 from operations.count import do_list
 from operations.drop import do_delete
 from operations.drop import do_drop_vpr
 from operations.load import do_enroll
 from operations.search import do_search_vpr
 from starlette.middleware.cors import CORSMiddleware
 from starlette.requests import Request
 from starlette.responses import FileResponse
 app = FastAPI()
 app.add_middleware(
    CORSMiddleware,
    allow_origins=["*"],
    allow_credentials=True,
    allow_methods=["*"],
    allow_headers=["*"])
 MYSQL_CLI = MySQLHelper()
 # Mkdir 'tmp/audio-data'
 if not os.path.exists(UPLOAD_PATH):
    os.makedirs(UPLOAD_PATH)
    LOGGER.info(f"Mkdir the path: {UPLOAD_PATH}")
@app.post('/vpr/enroll')
 async def vpr_enroll(table_name: str=None,
                     spk_id: str=None,
                     audio: UploadFile=File(...)):
    # Enroll the uploaded audio with spk-id into MySQL
    try:
        # Save the upload data to server.
        content = await audio.read()
        audio_path = os.path.join(UPLOAD_PATH, audio.filename)
        with open(audio_path, "wb+") as f:
            f.write(content)
        do_enroll(table_name, spk_id, audio_path, MYSQL_CLI)
        LOGGER.info(f"Successfully enrolled {spk_id} online!")
        return {'status': True, 'msg': "Successfully enroll data!"}
    except Exception as e:
        LOGGER.error(e)
        return {'status': False, 'msg': e}, 400
@app.post('/vpr/enroll/local')
 async def vpr_enroll_local(table_name: str=None,
                           spk_id: str=None,
                           audio_path: str=None):
    # Enroll the local audio with spk-id into MySQL
    try:
        do_enroll(table_name, spk_id, audio_path, MYSQL_CLI)
        LOGGER.info(f"Successfully enrolled {spk_id} locally!")
        return {'status': True, 'msg': "Successfully enroll data!"}
    except Exception as e:
        LOGGER.error(e)
        return {'status': False, 'msg': e}, 400
@app.post('/vpr/recog')
 async def vpr_recog(request: Request,
                    table_name: str=None,
                    audio: UploadFile=File(...)):
    # Voice print recognition online
    try:
        # Save the upload data to server.
        content = await audio.read()
        query_audio_path = os.path.join(UPLOAD_PATH, audio.filename)
        with open(query_audio_path, "wb+") as f:
            f.write(content)
        host = request.headers['host']
        spk_ids, paths, scores = do_search_vpr(host, table_name,
                                               query_audio_path, MYSQL_CLI)
        for spk_id, path, score in zip(spk_ids, paths, scores):
            LOGGER.info(f"spk {spk_id}, score {score}, audio path {path}, ")
        res = dict(zip(spk_ids, zip(paths, scores)))
        # Sort results by distance metric, closest distances first
        res = sorted(res.items(), key=lambda item: item[1][1], reverse=True)
        LOGGER.info("Successfully speaker recognition online!")
        return res
    except Exception as e:
        LOGGER.error(e)
        return {'status': False, 'msg': e}, 400
@app.post('/vpr/recog/local')
 async def vpr_recog_local(request: Request,
                          table_name: str=None,
                          audio_path: str=None):
    # Voice print recognition locally
    try:
        host = request.headers['host']
        spk_ids, paths, scores = do_search_vpr(host, table_name, audio_path,
                                               MYSQL_CLI)
        for spk_id, path, score in zip(spk_ids, paths, scores):
            LOGGER.info(f"spk {spk_id}, score {score}, audio path {path}, ")
        res = dict(zip(spk_ids, zip(paths, scores)))
        # Sort results by distance metric, closest distances first
        res = sorted(res.items(), key=lambda item: item[1][1], reverse=True)
        LOGGER.info("Successfully speaker recognition locally!")
        return res
    except Exception as e:
        LOGGER.error(e)
        return {'status': False, 'msg': e}, 400
@app.post('/vpr/del')
 async def vpr_del(table_name: str=None, spk_id: str=None):
    # Delete a record by spk_id in MySQL
    try:
        do_delete(table_name, spk_id, MYSQL_CLI)
        LOGGER.info("Successfully delete a record by spk_id in MySQL")
        return {'status': True, 'msg': "Successfully delete data!"}
    except Exception as e:
        LOGGER.error(e)
        return {'status': False, 'msg': e}, 400
@app.get('/vpr/list')
 async def vpr_list(table_name: str=None):
    # Get all records in MySQL
    try:
        spk_ids, audio_paths = do_list(table_name, MYSQL_CLI)
        for i in range(len(spk_ids)):
            LOGGER.debug(f"spk {spk_ids[i]}, audio path {audio_paths[i]}")
        LOGGER.info("Successfully list all records from mysql!")
        return spk_ids, audio_paths
    except Exception as e:
        LOGGER.error(e)
        return {'status': False, 'msg': e}, 400
@app.get('/vpr/data')
 async def vpr_data(
    table_name: str=None,
    spk_id: str=None, ):
    # Get the audio file from path by spk_id in MySQL
    try:
        audio_path = do_get(table_name, spk_id, MYSQL_CLI)
        LOGGER.info(f"Successfully get audio path {audio_path}!")
        return FileResponse(audio_path)
    except Exception as e:
        LOGGER.error(e)
        return {'status': False, 'msg': e}, 400
@app.get('/vpr/count')
 async def vpr_count(table_name: str=None):
    # Get the total number of spk in MySQL
    try:
        num = do_count_vpr(table_name, MYSQL_CLI)
        LOGGER.info("Successfully count the number of spk!")
        return num
    except Exception as e:
        LOGGER.error(e)
        return {'status': False, 'msg': e}, 400
@app.post('/vpr/drop')
 async def drop_tables(table_name: str=None):
    # Delete the table of MySQL
    try:
        do_drop_vpr(table_name, MYSQL_CLI)
        LOGGER.info("Successfully drop tables in MySQL!")
        return {'status': True, 'msg': "Successfully drop tables!"}
    except Exception as e:
        LOGGER.error(e)
        return {'status': False, 'msg': e}, 400
@app.get('/data')
 def audio_path(audio_path):
    # Get the audio file from path
    try:
        LOGGER.info(f"Successfully get audio: {audio_path}")
        return FileResponse(audio_path)
    except Exception as e:
        LOGGER.error(f"get audio error: {e}")
        return {'status': False, 'msg': e}, 400
 if __name__ == '__main__':
    uvicorn.run(app=app, host='0.0.0.0', port=8002)
--- a/demos/speaker_verification/README.md
+++ b/demos/speaker_verification/README.md
@ -30,6 +30,11 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
  paddlespeech vector --task spk --input vec.job
  echo -e "demo2 85236145389.wav \n demo3 85236145389.wav" | paddlespeech vector --task spk
  paddlespeech vector --task score --input "./85236145389.wav ./123456789.wav"
  echo -e "demo4 85236145389.wav 85236145389.wav \n demo5 85236145389.wav 123456789.wav" > vec.job
  paddlespeech vector --task score --input vec.job
  ```
  Usage:
@ -38,6 +43,7 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
  ```
  Arguments:
  - `input`(required): Audio file to recognize.
  - `task` (required): Specify `vector` task. Default `spk`。
  - `model`: Model type of vector task. Default: `ecapatdnn_voxceleb12`.
  - `sample_rate`: Sample rate of the model. Default: `16000`.
  - `config`: Config of vector task. Use pretrained model when it is None. Default: `None`.
@ -47,45 +53,45 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
  Output:
  ```bash
-    demo [ -5.749211     9.505463    -8.200284    -5.2075014    5.3940268
+    demo [  1.4217498    5.626253    -5.342073     1.1773866    3.308055
-      -3.04878      1.611095    10.127234   -10.534177   -15.821609
+    1.756596     5.167894    10.80636     -3.8226728   -5.6141334
-      1.2032688   -0.35080156   1.2629458  -12.643498    -2.5758228
+    2.623845    -0.8072968    1.9635103   -7.3128724    0.01103897
-    -11.343508     2.3385992   -8.719341    14.213509    15.404744
+    -9.723131     0.6619743   -6.976803    10.213478     7.494748
-      -0.39327756   6.338786     2.688887     8.7104025   17.469526
+    2.9105635    3.8949256    3.7999806    7.1061673   16.905321
-      -8.77959      7.0576906    4.648855    -1.3089896  -23.294737
+    -7.1493764    8.733103     3.4230042   -4.831653   -11.403367
-      8.013747    13.891729    -9.926753     5.655307    -5.9422326
+    11.232214     7.1274667   -4.2828417    2.452362    -5.130748
-    -22.842539     0.6293588  -18.46266    -10.811862     9.8192625
+    -18.177666    -2.6116815  -11.000337    -6.7314315    1.6564683
-      3.0070958    3.8072643   -2.3861165    3.0821571  -14.739942
+    0.7618269    1.1253023   -2.083836     4.725744    -8.782597
-      1.7594414   -0.6485091    4.485623     2.0207152    7.264915
+    -3.539873     3.814236     5.1420674    2.162061     4.096431
-      -6.40137     23.63524      2.9711294  -22.708025     9.93719
+    -6.4162116   12.747448     1.9429878  -15.152943     6.417416
-      20.354511   -10.324688    -0.700492    -8.783211    -5.27593
+    16.097002    -9.716668    -1.9920526   -3.3649497   -1.871939
-      15.999649     3.3004563   12.747926    15.429879     4.7849145
+    11.567354     3.69788     11.258265     7.442363     9.183411
-      5.6699696   -2.3826702   10.605882     3.9112158    3.1500628
+    4.5281515   -1.2417862    4.3959084    6.6727695    5.8898783
-      15.859915    -2.1832209  -23.908653    -6.4799504   -4.5365124
+    7.627124    -0.66919386 -11.889693    -9.208865    -7.4274073
-      -9.224193    14.568347   -10.568833     4.982321    -4.342062
+    -3.7776625    6.917234    -9.848748    -2.0944717   -5.135116
-      0.0914714   12.645902    -5.74285     -3.2141201   -2.7173362
+    0.49563864   9.317534    -5.9141874   -1.8098574   -0.11738578
-      -6.680575     0.4757669   -5.035051    -6.7964664   16.865469
+    -7.169265    -1.0578263   -5.7216787   -5.1173844   16.137651
-    -11.54324      7.681869     0.44475392   9.708182    -8.932846
+    -4.473626     7.6624317   -0.55381083   9.631587    -6.4704556
-      0.4123232   -4.361452     1.3948607    9.511665     0.11667654
+    -8.548508     4.3716145   -0.79702514   4.478997    -2.9758704
-      2.9079323    6.049952     9.275183   -18.078873     6.2983274
+    3.272176     2.8382776    5.134597    -9.190781    -0.5657382
-      -0.7500531   -2.725033    -7.6027865    3.3404543    2.990815
+    -4.8745747    2.3165567   -5.984303    -2.1798875    0.35541576
-      4.010979    11.000591    -2.8873312    7.1352735  -16.79663
+    -0.31784213   9.493548     2.1144536    4.358092   -12.089823
-      18.495346   -14.293832     7.89578      2.2714825   22.976387
+    8.451689    -7.925461     4.6242585    4.4289427   18.692003
-      -4.875734    -3.0836344   -2.9999814   13.751918     6.448228
+    -2.6204622   -5.149185    -0.35821092   8.488551     4.981496
-    -11.924197     2.171869     2.0423572   -6.173772    10.778437
+    -9.32683     -2.2544234    6.6417594    1.2119585   10.977129
-      25.77281     -4.9495463   14.57806      0.3044315    2.6132357
+    16.555033     3.3238444    9.551863    -1.6676947   -0.79539716
-      -7.591999    -2.076944     9.025118     1.7834753   -3.1799617
+    -8.605674    -0.47356385   2.6741948   -5.359179    -2.6673796
-      -4.9401326   23.465864     5.1685796   -9.018578     9.037825
+    0.66607     15.443222     4.740594    -3.4725387   11.592567
-      -4.4150195    6.859591   -12.274467    -0.88911164   5.186309
+    -2.054497     1.7361217   -8.265324    -9.30447      5.4068313
-      -3.9988663  -13.638606    -9.925445    -0.06329413  -3.6709652
+    -1.5180256   -7.746615    -6.089606     0.07112726  -0.34904733
-    -12.397416   -12.719869    -1.395601     2.1150916    5.7381287
+    -8.649895    -9.998958    -2.564841    -0.53999114   2.601808
-      -4.4691963   -3.82819     -0.84233856  -1.1604277  -13.490127
+    -0.31927416  -1.8815292   -2.07215     -3.4105783   -8.2998085
-      8.731719   -20.778936   -11.495662     5.8033476   -4.752041
+    1.483641   -15.365992    -8.288208     3.8847756   -3.4876456
-      10.833007    -6.717991     4.504732    13.4244375    1.1306485
+    7.3629923    0.4657332    3.132599    12.438889    -1.8337058
-      7.3435574    1.400918    14.704036    -9.501399     7.2315617
+    4.532936     2.7264361   10.145339    -6.521951     2.897153
-      -6.417456     1.3333273   11.872697    -0.30664724   8.8845
+    -3.3925855    5.079156     7.759716     4.677565     5.8457737
-      6.5569253    4.7948146    0.03662816  -8.704245     6.224871
+    2.402413     7.7071047    3.9711342   -6.390043     6.1268735
-      -3.2701402  -11.508579  ]
+    -3.7760346  -11.118123  ]
  ```
 - Python API
@ -97,56 +103,113 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
  audio_emb = vector_executor(
      model='ecapatdnn_voxceleb12',
      sample_rate=16000,
-      config=None, 
+      config=None,  # Set `config` and `ckpt_path` to None to use pretrained model.
      ckpt_path=None,
      audio_file='./85236145389.wav',
      force_yes=False,
      device=paddle.get_device())
  print('Audio embedding Result: \n{}'.format(audio_emb))
  test_emb = vector_executor(
      model='ecapatdnn_voxceleb12',
      sample_rate=16000,
      config=None,  # Set `config` and `ckpt_path` to None to use pretrained model.
      ckpt_path=None,
      audio_file='./123456789.wav',
      device=paddle.get_device())
  print('Test embedding Result: \n{}'.format(test_emb))
  # score range [0, 1]
  score = vector_executor.get_embeddings_score(audio_emb, test_emb)
  print(f"Eembeddings Score: {score}")
  ```
-  Output:
+  Output：
  ```bash
  # Vector Result:
-   [ -5.749211     9.505463    -8.200284    -5.2075014    5.3940268
+   Audio embedding Result:
-      -3.04878      1.611095    10.127234   -10.534177   -15.821609
+    [  1.4217498    5.626253    -5.342073     1.1773866    3.308055
-      1.2032688   -0.35080156   1.2629458  -12.643498    -2.5758228
+    1.756596     5.167894    10.80636     -3.8226728   -5.6141334
-    -11.343508     2.3385992   -8.719341    14.213509    15.404744
+    2.623845    -0.8072968    1.9635103   -7.3128724    0.01103897
-      -0.39327756   6.338786     2.688887     8.7104025   17.469526
+    -9.723131     0.6619743   -6.976803    10.213478     7.494748
-      -8.77959      7.0576906    4.648855    -1.3089896  -23.294737
+    2.9105635    3.8949256    3.7999806    7.1061673   16.905321
-      8.013747    13.891729    -9.926753     5.655307    -5.9422326
+    -7.1493764    8.733103     3.4230042   -4.831653   -11.403367
-    -22.842539     0.6293588  -18.46266    -10.811862     9.8192625
+    11.232214     7.1274667   -4.2828417    2.452362    -5.130748
-      3.0070958    3.8072643   -2.3861165    3.0821571  -14.739942
+    -18.177666    -2.6116815  -11.000337    -6.7314315    1.6564683
-      1.7594414   -0.6485091    4.485623     2.0207152    7.264915
+    0.7618269    1.1253023   -2.083836     4.725744    -8.782597
-      -6.40137     23.63524      2.9711294  -22.708025     9.93719
+    -3.539873     3.814236     5.1420674    2.162061     4.096431
-      20.354511   -10.324688    -0.700492    -8.783211    -5.27593
+    -6.4162116   12.747448     1.9429878  -15.152943     6.417416
-      15.999649     3.3004563   12.747926    15.429879     4.7849145
+    16.097002    -9.716668    -1.9920526   -3.3649497   -1.871939
-      5.6699696   -2.3826702   10.605882     3.9112158    3.1500628
+    11.567354     3.69788     11.258265     7.442363     9.183411
-      15.859915    -2.1832209  -23.908653    -6.4799504   -4.5365124
+    4.5281515   -1.2417862    4.3959084    6.6727695    5.8898783
-      -9.224193    14.568347   -10.568833     4.982321    -4.342062
+    7.627124    -0.66919386 -11.889693    -9.208865    -7.4274073
-      0.0914714   12.645902    -5.74285     -3.2141201   -2.7173362
+    -3.7776625    6.917234    -9.848748    -2.0944717   -5.135116
-      -6.680575     0.4757669   -5.035051    -6.7964664   16.865469
+    0.49563864   9.317534    -5.9141874   -1.8098574   -0.11738578
-    -11.54324      7.681869     0.44475392   9.708182    -8.932846
+    -7.169265    -1.0578263   -5.7216787   -5.1173844   16.137651
-      0.4123232   -4.361452     1.3948607    9.511665     0.11667654
+    -4.473626     7.6624317   -0.55381083   9.631587    -6.4704556
-      2.9079323    6.049952     9.275183   -18.078873     6.2983274
+    -8.548508     4.3716145   -0.79702514   4.478997    -2.9758704
-      -0.7500531   -2.725033    -7.6027865    3.3404543    2.990815
+    3.272176     2.8382776    5.134597    -9.190781    -0.5657382
-      4.010979    11.000591    -2.8873312    7.1352735  -16.79663
+    -4.8745747    2.3165567   -5.984303    -2.1798875    0.35541576
-      18.495346   -14.293832     7.89578      2.2714825   22.976387
+    -0.31784213   9.493548     2.1144536    4.358092   -12.089823
-      -4.875734    -3.0836344   -2.9999814   13.751918     6.448228
+    8.451689    -7.925461     4.6242585    4.4289427   18.692003
-    -11.924197     2.171869     2.0423572   -6.173772    10.778437
+    -2.6204622   -5.149185    -0.35821092   8.488551     4.981496
-      25.77281     -4.9495463   14.57806      0.3044315    2.6132357
+    -9.32683     -2.2544234    6.6417594    1.2119585   10.977129
-      -7.591999    -2.076944     9.025118     1.7834753   -3.1799617
+    16.555033     3.3238444    9.551863    -1.6676947   -0.79539716
-      -4.9401326   23.465864     5.1685796   -9.018578     9.037825
+    -8.605674    -0.47356385   2.6741948   -5.359179    -2.6673796
-      -4.4150195    6.859591   -12.274467    -0.88911164   5.186309
+    0.66607     15.443222     4.740594    -3.4725387   11.592567
-      -3.9988663  -13.638606    -9.925445    -0.06329413  -3.6709652
+    -2.054497     1.7361217   -8.265324    -9.30447      5.4068313
-    -12.397416   -12.719869    -1.395601     2.1150916    5.7381287
+    -1.5180256   -7.746615    -6.089606     0.07112726  -0.34904733
-      -4.4691963   -3.82819     -0.84233856  -1.1604277  -13.490127
+    -8.649895    -9.998958    -2.564841    -0.53999114   2.601808
-      8.731719   -20.778936   -11.495662     5.8033476   -4.752041
+    -0.31927416  -1.8815292   -2.07215     -3.4105783   -8.2998085
-      10.833007    -6.717991     4.504732    13.4244375    1.1306485
+    1.483641   -15.365992    -8.288208     3.8847756   -3.4876456
-      7.3435574    1.400918    14.704036    -9.501399     7.2315617
+    7.3629923    0.4657332    3.132599    12.438889    -1.8337058
-      -6.417456     1.3333273   11.872697    -0.30664724   8.8845
+    4.532936     2.7264361   10.145339    -6.521951     2.897153
-      6.5569253    4.7948146    0.03662816  -8.704245     6.224871
+    -3.3925855    5.079156     7.759716     4.677565     5.8457737
-      -3.2701402  -11.508579  ]
+    2.402413     7.7071047    3.9711342   -6.390043     6.1268735
    -3.7760346  -11.118123  ]
    # get the test embedding
    Test embedding Result:
    [ -1.902964     2.0690894   -8.034194     3.5472693    0.18089125
      6.9085927    1.4097427   -1.9487704  -10.021278    -0.20755845
      -8.04332      4.344489     2.3200977  -14.306299     5.184692
    -11.55602     -3.8497238    0.6444722    1.2833948    2.6766639
      0.5878921    0.7946299    1.7207596    2.5791872   14.998469
      -1.3385371   15.031221    -0.8006958    1.99287     -9.52007
      2.435466     4.003221    -4.33817     -4.898601    -5.304714
    -18.033886    10.790787   -12.784645    -5.641755     2.9761686
    -10.566622     1.4839455    6.152458    -5.7195854    2.8603241
      6.112133     8.489869     5.5958056    1.2836679   -1.2293907
      0.89927405   7.0288725   -2.854029    -0.9782962    5.8255906
      14.905906    -5.025907     0.7866458   -4.2444224  -16.354029
      10.521315     0.9604709   -3.3257897    7.144871   -13.592733
      -8.568869    -1.7953678    0.26313916  10.916714    -6.9374123
      1.857403    -6.2746415    2.8154466   -7.2338667   -2.293357
      -0.05452765   5.4287076    5.0849075   -6.690375    -1.6183422
      3.654291     0.94352573  -9.200294    -5.4749465   -3.5235846
      1.3420814    4.240421    -2.772944    -2.8451524   16.311104
      4.2969875   -1.762936   -12.5758915    8.595198    -0.8835239
      -1.5708797    1.568961     1.1413603    3.5032008   -0.45251232
      -6.786333    16.89443      5.3366146   -8.789056     0.6355629
      3.2579517   -3.328322     7.5969577    0.66025066  -6.550468
      -9.148656     2.020372    -0.4615173    1.1965656   -3.8764873
      11.6562195   -6.0750933   12.182899     3.2218833    0.81969476
      5.570001    -3.8459578   -7.205299     7.9262037   -7.6611166
      -5.249467    -2.2671914    7.2658715  -13.298164     4.821147
      -2.7263982   11.691089    -3.8918593   -2.838112    -1.0336838
      -3.8034165    2.8536487   -5.60398     -1.1972581    1.3455094
      -3.4903061    2.2408795    5.5010734   -3.970756    11.99696
      -7.8858757    0.43160373  -5.5059714    4.3426995   16.322706
      11.635366     0.72157705  -9.245714    -3.91465     -4.449838
      -1.5716927    7.713747    -2.2430465   -6.198303   -13.481864
      2.8156567   -5.7812386    5.1456156    2.7289324  -14.505571
      13.270688     3.448231    -7.0659585    4.5886116   -4.466099
      -0.296428   -11.463529    -2.6076477   14.110243    -6.9725137
      -1.9962958    2.7119343   19.391657     0.01961198  14.607133
      -1.6695905   -4.391516     1.3131028   -6.670972    -5.888604
      12.0612335    5.9285784    3.3715196    1.492534    10.723728
      -0.95514804 -12.085431  ]
    # get the score between enroll and test
    Eembeddings Score: 0.4292638301849365
  ```
 ### 4.Pretrained Models
--- a/demos/speaker_verification/README_cn.md
+++ b/demos/speaker_verification/README_cn.md
@ -29,6 +29,11 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
  paddlespeech vector --task spk --input vec.job
  echo -e "demo2 85236145389.wav \n demo3 85236145389.wav" | paddlespeech vector --task spk
  paddlespeech vector --task score --input "./85236145389.wav ./123456789.wav"
  echo -e "demo4 85236145389.wav 85236145389.wav \n demo5 85236145389.wav 123456789.wav" > vec.job
  paddlespeech vector --task score --input vec.job
  ```
  使用方法：
@ -37,6 +42,7 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
  ```
  参数：
  - `input`(必须输入)：用于识别的音频文件。
  - `task` (必须输入): 用于指定 `vector` 处理的具体任务，默认是 `spk`。
  - `model`：声纹任务的模型，默认值：`ecapatdnn_voxceleb12`。
  - `sample_rate`：音频采样率，默认值：`16000`。
  - `config`：声纹任务的参数文件，若不设置则使用预训练模型中的默认配置，默认值：`None`。
@ -45,45 +51,45 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
  输出：
  ```bash
-  demo  [ -5.749211     9.505463    -8.200284    -5.2075014    5.3940268
+  demo  [  1.4217498    5.626253    -5.342073     1.1773866    3.308055
-      -3.04878      1.611095    10.127234   -10.534177   -15.821609
+    1.756596     5.167894    10.80636     -3.8226728   -5.6141334
-      1.2032688   -0.35080156   1.2629458  -12.643498    -2.5758228
+    2.623845    -0.8072968    1.9635103   -7.3128724    0.01103897
-    -11.343508     2.3385992   -8.719341    14.213509    15.404744
+    -9.723131     0.6619743   -6.976803    10.213478     7.494748
-      -0.39327756   6.338786     2.688887     8.7104025   17.469526
+    2.9105635    3.8949256    3.7999806    7.1061673   16.905321
-      -8.77959      7.0576906    4.648855    -1.3089896  -23.294737
+    -7.1493764    8.733103     3.4230042   -4.831653   -11.403367
-      8.013747    13.891729    -9.926753     5.655307    -5.9422326
+    11.232214     7.1274667   -4.2828417    2.452362    -5.130748
-    -22.842539     0.6293588  -18.46266    -10.811862     9.8192625
+    -18.177666    -2.6116815  -11.000337    -6.7314315    1.6564683
-      3.0070958    3.8072643   -2.3861165    3.0821571  -14.739942
+    0.7618269    1.1253023   -2.083836     4.725744    -8.782597
-      1.7594414   -0.6485091    4.485623     2.0207152    7.264915
+    -3.539873     3.814236     5.1420674    2.162061     4.096431
-      -6.40137     23.63524      2.9711294  -22.708025     9.93719
+    -6.4162116   12.747448     1.9429878  -15.152943     6.417416
-      20.354511   -10.324688    -0.700492    -8.783211    -5.27593
+    16.097002    -9.716668    -1.9920526   -3.3649497   -1.871939
-      15.999649     3.3004563   12.747926    15.429879     4.7849145
+    11.567354     3.69788     11.258265     7.442363     9.183411
-      5.6699696   -2.3826702   10.605882     3.9112158    3.1500628
+    4.5281515   -1.2417862    4.3959084    6.6727695    5.8898783
-      15.859915    -2.1832209  -23.908653    -6.4799504   -4.5365124
+    7.627124    -0.66919386 -11.889693    -9.208865    -7.4274073
-      -9.224193    14.568347   -10.568833     4.982321    -4.342062
+    -3.7776625    6.917234    -9.848748    -2.0944717   -5.135116
-      0.0914714   12.645902    -5.74285     -3.2141201   -2.7173362
+    0.49563864   9.317534    -5.9141874   -1.8098574   -0.11738578
-      -6.680575     0.4757669   -5.035051    -6.7964664   16.865469
+    -7.169265    -1.0578263   -5.7216787   -5.1173844   16.137651
-    -11.54324      7.681869     0.44475392   9.708182    -8.932846
+    -4.473626     7.6624317   -0.55381083   9.631587    -6.4704556
-      0.4123232   -4.361452     1.3948607    9.511665     0.11667654
+    -8.548508     4.3716145   -0.79702514   4.478997    -2.9758704
-      2.9079323    6.049952     9.275183   -18.078873     6.2983274
+    3.272176     2.8382776    5.134597    -9.190781    -0.5657382
-      -0.7500531   -2.725033    -7.6027865    3.3404543    2.990815
+    -4.8745747    2.3165567   -5.984303    -2.1798875    0.35541576
-      4.010979    11.000591    -2.8873312    7.1352735  -16.79663
+    -0.31784213   9.493548     2.1144536    4.358092   -12.089823
-      18.495346   -14.293832     7.89578      2.2714825   22.976387
+    8.451689    -7.925461     4.6242585    4.4289427   18.692003
-      -4.875734    -3.0836344   -2.9999814   13.751918     6.448228
+    -2.6204622   -5.149185    -0.35821092   8.488551     4.981496
-    -11.924197     2.171869     2.0423572   -6.173772    10.778437
+    -9.32683     -2.2544234    6.6417594    1.2119585   10.977129
-      25.77281     -4.9495463   14.57806      0.3044315    2.6132357
+    16.555033     3.3238444    9.551863    -1.6676947   -0.79539716
-      -7.591999    -2.076944     9.025118     1.7834753   -3.1799617
+    -8.605674    -0.47356385   2.6741948   -5.359179    -2.6673796
-      -4.9401326   23.465864     5.1685796   -9.018578     9.037825
+    0.66607     15.443222     4.740594    -3.4725387   11.592567
-      -4.4150195    6.859591   -12.274467    -0.88911164   5.186309
+    -2.054497     1.7361217   -8.265324    -9.30447      5.4068313
-      -3.9988663  -13.638606    -9.925445    -0.06329413  -3.6709652
+    -1.5180256   -7.746615    -6.089606     0.07112726  -0.34904733
-    -12.397416   -12.719869    -1.395601     2.1150916    5.7381287
+    -8.649895    -9.998958    -2.564841    -0.53999114   2.601808
-      -4.4691963   -3.82819     -0.84233856  -1.1604277  -13.490127
+    -0.31927416  -1.8815292   -2.07215     -3.4105783   -8.2998085
-      8.731719   -20.778936   -11.495662     5.8033476   -4.752041
+    1.483641   -15.365992    -8.288208     3.8847756   -3.4876456
-      10.833007    -6.717991     4.504732    13.4244375    1.1306485
+    7.3629923    0.4657332    3.132599    12.438889    -1.8337058
-      7.3435574    1.400918    14.704036    -9.501399     7.2315617
+    4.532936     2.7264361   10.145339    -6.521951     2.897153
-      -6.417456     1.3333273   11.872697    -0.30664724   8.8845
+    -3.3925855    5.079156     7.759716     4.677565     5.8457737
-      6.5569253    4.7948146    0.03662816  -8.704245     6.224871
+    2.402413     7.7071047    3.9711342   -6.390043     6.1268735
-      -3.2701402  -11.508579  ]
+    -3.7760346  -11.118123  ]
  ```
 - Python API
@ -98,53 +104,109 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
      config=None,  # Set `config` and `ckpt_path` to None to use pretrained model.
      ckpt_path=None,
      audio_file='./85236145389.wav',
      force_yes=False,
      device=paddle.get_device())
  print('Audio embedding Result: \n{}'.format(audio_emb))
  test_emb = vector_executor(
      model='ecapatdnn_voxceleb12',
      sample_rate=16000,
      config=None,  # Set `config` and `ckpt_path` to None to use pretrained model.
      ckpt_path=None,
      audio_file='./123456789.wav',
      device=paddle.get_device())
  print('Test embedding Result: \n{}'.format(test_emb))
  # score range [0, 1]
  score = vector_executor.get_embeddings_score(audio_emb, test_emb)
  print(f"Eembeddings Score: {score}")
  ```
  输出：
  ```bash
  # Vector Result:
-   [ -5.749211     9.505463    -8.200284    -5.2075014    5.3940268
+   Audio embedding Result:
-      -3.04878      1.611095    10.127234   -10.534177   -15.821609
+    [  1.4217498    5.626253    -5.342073     1.1773866    3.308055
-      1.2032688   -0.35080156   1.2629458  -12.643498    -2.5758228
+    1.756596     5.167894    10.80636     -3.8226728   -5.6141334
-    -11.343508     2.3385992   -8.719341    14.213509    15.404744
+    2.623845    -0.8072968    1.9635103   -7.3128724    0.01103897
-      -0.39327756   6.338786     2.688887     8.7104025   17.469526
+    -9.723131     0.6619743   -6.976803    10.213478     7.494748
-      -8.77959      7.0576906    4.648855    -1.3089896  -23.294737
+    2.9105635    3.8949256    3.7999806    7.1061673   16.905321
-      8.013747    13.891729    -9.926753     5.655307    -5.9422326
+    -7.1493764    8.733103     3.4230042   -4.831653   -11.403367
-    -22.842539     0.6293588  -18.46266    -10.811862     9.8192625
+    11.232214     7.1274667   -4.2828417    2.452362    -5.130748
-      3.0070958    3.8072643   -2.3861165    3.0821571  -14.739942
+    -18.177666    -2.6116815  -11.000337    -6.7314315    1.6564683
-      1.7594414   -0.6485091    4.485623     2.0207152    7.264915
+    0.7618269    1.1253023   -2.083836     4.725744    -8.782597
-      -6.40137     23.63524      2.9711294  -22.708025     9.93719
+    -3.539873     3.814236     5.1420674    2.162061     4.096431
-      20.354511   -10.324688    -0.700492    -8.783211    -5.27593
+    -6.4162116   12.747448     1.9429878  -15.152943     6.417416
-      15.999649     3.3004563   12.747926    15.429879     4.7849145
+    16.097002    -9.716668    -1.9920526   -3.3649497   -1.871939
-      5.6699696   -2.3826702   10.605882     3.9112158    3.1500628
+    11.567354     3.69788     11.258265     7.442363     9.183411
-      15.859915    -2.1832209  -23.908653    -6.4799504   -4.5365124
+    4.5281515   -1.2417862    4.3959084    6.6727695    5.8898783
-      -9.224193    14.568347   -10.568833     4.982321    -4.342062
+    7.627124    -0.66919386 -11.889693    -9.208865    -7.4274073
-      0.0914714   12.645902    -5.74285     -3.2141201   -2.7173362
+    -3.7776625    6.917234    -9.848748    -2.0944717   -5.135116
-      -6.680575     0.4757669   -5.035051    -6.7964664   16.865469
+    0.49563864   9.317534    -5.9141874   -1.8098574   -0.11738578
-    -11.54324      7.681869     0.44475392   9.708182    -8.932846
+    -7.169265    -1.0578263   -5.7216787   -5.1173844   16.137651
-      0.4123232   -4.361452     1.3948607    9.511665     0.11667654
+    -4.473626     7.6624317   -0.55381083   9.631587    -6.4704556
-      2.9079323    6.049952     9.275183   -18.078873     6.2983274
+    -8.548508     4.3716145   -0.79702514   4.478997    -2.9758704
-      -0.7500531   -2.725033    -7.6027865    3.3404543    2.990815
+    3.272176     2.8382776    5.134597    -9.190781    -0.5657382
-      4.010979    11.000591    -2.8873312    7.1352735  -16.79663
+    -4.8745747    2.3165567   -5.984303    -2.1798875    0.35541576
-      18.495346   -14.293832     7.89578      2.2714825   22.976387
+    -0.31784213   9.493548     2.1144536    4.358092   -12.089823
-      -4.875734    -3.0836344   -2.9999814   13.751918     6.448228
+    8.451689    -7.925461     4.6242585    4.4289427   18.692003
-    -11.924197     2.171869     2.0423572   -6.173772    10.778437
+    -2.6204622   -5.149185    -0.35821092   8.488551     4.981496
-      25.77281     -4.9495463   14.57806      0.3044315    2.6132357
+    -9.32683     -2.2544234    6.6417594    1.2119585   10.977129
-      -7.591999    -2.076944     9.025118     1.7834753   -3.1799617
+    16.555033     3.3238444    9.551863    -1.6676947   -0.79539716
-      -4.9401326   23.465864     5.1685796   -9.018578     9.037825
+    -8.605674    -0.47356385   2.6741948   -5.359179    -2.6673796
-      -4.4150195    6.859591   -12.274467    -0.88911164   5.186309
+    0.66607     15.443222     4.740594    -3.4725387   11.592567
-      -3.9988663  -13.638606    -9.925445    -0.06329413  -3.6709652
+    -2.054497     1.7361217   -8.265324    -9.30447      5.4068313
-    -12.397416   -12.719869    -1.395601     2.1150916    5.7381287
+    -1.5180256   -7.746615    -6.089606     0.07112726  -0.34904733
-      -4.4691963   -3.82819     -0.84233856  -1.1604277  -13.490127
+    -8.649895    -9.998958    -2.564841    -0.53999114   2.601808
-      8.731719   -20.778936   -11.495662     5.8033476   -4.752041
+    -0.31927416  -1.8815292   -2.07215     -3.4105783   -8.2998085
-      10.833007    -6.717991     4.504732    13.4244375    1.1306485
+    1.483641   -15.365992    -8.288208     3.8847756   -3.4876456
-      7.3435574    1.400918    14.704036    -9.501399     7.2315617
+    7.3629923    0.4657332    3.132599    12.438889    -1.8337058
-      -6.417456     1.3333273   11.872697    -0.30664724   8.8845
+    4.532936     2.7264361   10.145339    -6.521951     2.897153
-      6.5569253    4.7948146    0.03662816  -8.704245     6.224871
+    -3.3925855    5.079156     7.759716     4.677565     5.8457737
-      -3.2701402  -11.508579  ]
+    2.402413     7.7071047    3.9711342   -6.390043     6.1268735
    -3.7760346  -11.118123  ]
    # get the test embedding
    Test embedding Result:
    [ -1.902964     2.0690894   -8.034194     3.5472693    0.18089125
      6.9085927    1.4097427   -1.9487704  -10.021278    -0.20755845
      -8.04332      4.344489     2.3200977  -14.306299     5.184692
    -11.55602     -3.8497238    0.6444722    1.2833948    2.6766639
      0.5878921    0.7946299    1.7207596    2.5791872   14.998469
      -1.3385371   15.031221    -0.8006958    1.99287     -9.52007
      2.435466     4.003221    -4.33817     -4.898601    -5.304714
    -18.033886    10.790787   -12.784645    -5.641755     2.9761686
    -10.566622     1.4839455    6.152458    -5.7195854    2.8603241
      6.112133     8.489869     5.5958056    1.2836679   -1.2293907
      0.89927405   7.0288725   -2.854029    -0.9782962    5.8255906
      14.905906    -5.025907     0.7866458   -4.2444224  -16.354029
      10.521315     0.9604709   -3.3257897    7.144871   -13.592733
      -8.568869    -1.7953678    0.26313916  10.916714    -6.9374123
      1.857403    -6.2746415    2.8154466   -7.2338667   -2.293357
      -0.05452765   5.4287076    5.0849075   -6.690375    -1.6183422
      3.654291     0.94352573  -9.200294    -5.4749465   -3.5235846
      1.3420814    4.240421    -2.772944    -2.8451524   16.311104
      4.2969875   -1.762936   -12.5758915    8.595198    -0.8835239
      -1.5708797    1.568961     1.1413603    3.5032008   -0.45251232
      -6.786333    16.89443      5.3366146   -8.789056     0.6355629
      3.2579517   -3.328322     7.5969577    0.66025066  -6.550468
      -9.148656     2.020372    -0.4615173    1.1965656   -3.8764873
      11.6562195   -6.0750933   12.182899     3.2218833    0.81969476
      5.570001    -3.8459578   -7.205299     7.9262037   -7.6611166
      -5.249467    -2.2671914    7.2658715  -13.298164     4.821147
      -2.7263982   11.691089    -3.8918593   -2.838112    -1.0336838
      -3.8034165    2.8536487   -5.60398     -1.1972581    1.3455094
      -3.4903061    2.2408795    5.5010734   -3.970756    11.99696
      -7.8858757    0.43160373  -5.5059714    4.3426995   16.322706
      11.635366     0.72157705  -9.245714    -3.91465     -4.449838
      -1.5716927    7.713747    -2.2430465   -6.198303   -13.481864
      2.8156567   -5.7812386    5.1456156    2.7289324  -14.505571
      13.270688     3.448231    -7.0659585    4.5886116   -4.466099
      -0.296428   -11.463529    -2.6076477   14.110243    -6.9725137
      -1.9962958    2.7119343   19.391657     0.01961198  14.607133
      -1.6695905   -4.391516     1.3131028   -6.670972    -5.888604
      12.0612335    5.9285784    3.3715196    1.492534    10.723728
      -0.95514804 -12.085431  ]
    # get the score between enroll and test
    Eembeddings Score: 0.4292638301849365
  ```
 ### 4.预训练模型
--- a/demos/speaker_verification/run.sh
+++ b/demos/speaker_verification/run.sh
@ -1,6 +1,9 @@
 #!/bin/bash
 wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
 wget -c https://paddlespeech.bj.bcebos.com/vector/audio/123456789.wav
-# asr
+# vector
 paddlespeech vector --task spk --input ./85236145389.wav
 paddlespeech vector --task score --input "./85236145389.wav ./123456789.wav"
--- a/demos/speech_recognition/run.sh
+++ b/demos/speech_recognition/run.sh
--- a/demos/speech_server/README_cn.md
+++ b/demos/speech_server/README_cn.md
@ -85,6 +85,10 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
 - 命令行 (推荐使用)
   ```
   paddlespeech_client asr --server_ip 127.0.0.1 --port 8090 --input ./zh.wav
   # 流式ASR
   paddlespeech_client asr_online --server_ip 127.0.0.1 --port 8091 --input ./zh.wav
   ```
    使用帮助:
@ -191,7 +195,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
  ```
-  ### 5. CLS 客户端使用方法
+  ### 6. CLS 客户端使用方法
  **注意：** 初次使用客户端时响应时间会略长
  - 命令行 (推荐使用)
   ```
--- a/docs/source/released_model.md
+++ b/docs/source/released_model.md
@ -6,7 +6,7 @@
 ### Speech Recognition Model
 Acoustic Model | Training Data | Token-based | Size | Descriptions | CER | WER | Hours of speech | Example Link 
 :-------------:| :------------:| :-----: | -----: | :-----: |:-----:| :-----:  | :-----:  | :-----: 
-[Ds2 Online Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.1.1.model.tar.gz) | Aishell Dataset | Char-based | 345 MB  | 2 Conv + 5 LSTM layers with only forward direction | 0.080 |-| 151 h | [D2 Online Aishell ASR0](../../examples/aishell/asr0) 
+[Ds2 Online Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.2.0.model.tar.gz) | Aishell Dataset | Char-based | 345 MB  | 2 Conv + 5 LSTM layers with only forward direction | 0.078 |-| 151 h | [D2 Online Aishell ASR0](../../examples/aishell/asr0) 
 [Ds2 Offline Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_aishell_ckpt_0.1.1.model.tar.gz)| Aishell Dataset | Char-based | 306 MB | 2 Conv + 3 bidirectional GRU layers| 0.064 |-| 151 h | [Ds2 Offline Aishell ASR0](../../examples/aishell/asr0) 
 [Conformer Online Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_chunk_conformer_aishell_ckpt_0.1.2.model.tar.gz) | Aishell Dataset | Char-based | 189 MB  | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0565 |-| 151 h | [Conformer Online Aishell ASR1](../../examples/aishell/asr1) 
 [Conformer Offline Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_conformer_aishell_ckpt_0.1.2.model.tar.gz) | Aishell Dataset | Char-based | 189 MB  | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0483 |-| 151 h | [Conformer Offline Aishell ASR1](../../examples/aishell/asr1) 
@ -37,8 +37,8 @@ Model Type | Dataset| Example Link | Pretrained Models|Static Models|Size (stati
 Tacotron2|LJSpeech|[tacotron2-ljspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/tts0)|[tacotron2_ljspeech_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_ljspeech_ckpt_0.2.0.zip)|||
 Tacotron2|CSMSC|[tacotron2-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts0)|[tacotron2_csmsc_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_csmsc_ckpt_0.2.0.zip)|[tacotron2_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_csmsc_static_0.2.0.zip)|103MB|
 TransformerTTS| LJSpeech| [transformer-ljspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/tts1)|[transformer_tts_ljspeech_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/transformer_tts/transformer_tts_ljspeech_ckpt_0.4.zip)|||
-SpeedySpeech| CSMSC | [speedyspeech-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts2) |[speedyspeech_nosil_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_ckpt_0.5.zip)|[speedyspeech_nosil_baker_static_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_static_0.5.zip)|12MB|
+SpeedySpeech| CSMSC | [speedyspeech-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts2)|[speedyspeech_csmsc_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_csmsc_ckpt_0.2.0.zip)|[speedyspeech_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_csmsc_static_0.2.0.zip)|12MB|
-FastSpeech2| CSMSC |[fastspeech2-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts3)|[fastspeech2_nosil_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_ckpt_0.4.zip)|[fastspeech2_nosil_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_static_0.4.zip)|157MB|
+FastSpeech2| CSMSC |[fastspeech2-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts3)|[fastspeech2_nosil_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_ckpt_0.4.zip)|[fastspeech2_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_csmsc_static_0.2.0.zip)|157MB|
 FastSpeech2-Conformer| CSMSC |[fastspeech2-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts3)|[fastspeech2_conformer_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_conformer_baker_ckpt_0.5.zip)|||
 FastSpeech2| AISHELL-3 |[fastspeech2-aishell3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/tts3)|[fastspeech2_nosil_aishell3_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_aishell3_ckpt_0.4.zip)|||
 FastSpeech2| LJSpeech |[fastspeech2-ljspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/tts3)|[fastspeech2_nosil_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_ljspeech_ckpt_0.5.zip)|||
@ -80,7 +80,7 @@ PANN | ESC-50 |[pann-esc50](../../examples/esc50/cls0)|[esc50_cnn6.tar.gz](https
 Model Type | Dataset| Example Link | Pretrained Models | Static Models 
 :-------------:| :------------:| :-----: | :-----: | :-----:
-PANN | VoxCeleb| [voxceleb_ecapatdnn](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/voxceleb/sv0) | [ecapatdnn.tar.gz](https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_1_1.tar.gz) | -
+PANN | VoxCeleb| [voxceleb_ecapatdnn](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/voxceleb/sv0) | [ecapatdnn.tar.gz](https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_2_0.tar.gz) | -
 ## Punctuation Restoration Models
 Model Type | Dataset| Example Link | Pretrained Models
--- a/examples/aishell/asr0/README.md
+++ b/examples/aishell/asr0/README.md
@ -151,21 +151,14 @@ avg.sh best exp/deepspeech2/checkpoints 1
 CUDA_VISIBLE_DEVICES= ./local/test.sh conf/deepspeech2.yaml exp/deepspeech2/checkpoints/avg_1
 ```
 ## Pretrained Model
-You can get the pretrained transformer or conformer using the scripts below:
+You can get the pretrained models from [this](../../../docs/source/released_model.md).
 ```bash
 Deepspeech2 offline:
 wget https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/ds2.model.tar.gz
 Deepspeech2 online:
 wget https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/aishell_ds2_online_cer8.00_release.tar.gz
 ```
 using the `tar` scripts to unpack the model and then you can use the script to test the model.
 For example:
 ```
-wget https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/ds2.model.tar.gz
+wget https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_aishell_ckpt_0.1.1.model.tar.gz
-tar xzvf ds2.model.tar.gz
+tar xzvf asr0_deepspeech2_aishell_ckpt_0.1.1.model.tar.gz
 source path.sh
 # If you have process the data and get the manifest file， you can skip the following 2 steps
 bash local/data.sh --stage -1 --stop_stage -1
@ -173,12 +166,7 @@ bash local/data.sh --stage 2  --stop_stage 2
 CUDA_VISIBLE_DEVICES= ./local/test.sh conf/deepspeech2.yaml exp/deepspeech2/checkpoints/avg_1
 ```
-The performance of the released models are shown below:
+The performance of the released models are shown in [this](./RESULTS.md)
 |         Acoustic Model         |  Training Data  | Token-based |   Size | Descriptions                                       | CER   | WER  | Hours of speech |
 | :----------------------------: | :-------------: | :---------: | -----: | :------------------------------------------------- | :---- | :--- | :-------------- |
 | Ds2 Online Aishell ASR0 Model  | Aishell Dataset | Char-based  | 345 MB | 2 Conv + 5 LSTM layers with only forward direction | 0.080 | -    | 151 h           |
 | Ds2 Offline Aishell ASR0 Model | Aishell Dataset | Char-based  | 306 MB | 2 Conv + 3 bidirectional GRU layers                | 0.064 | -    | 151 h           |
 ## Stage 4: Static graph model Export
 This stage is to transform dygraph to static graph.
 ```bash
@ -214,8 +202,8 @@ if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
 ```
 you can train the model by yourself, or you can download the pretrained model by the script below:
 ```bash
-wget https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/ds2.model.tar.gz
+wget https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_aishell_ckpt_0.1.1.model.tar.gz
-tar xzvf ds2.model.tar.gz
+tar xzvf asr0_deepspeech2_aishell_ckpt_0.1.1.model.tar.gz
 ```
 You can download the audio demo:
 ```bash
--- a/examples/aishell/asr0/RESULTS.md
+++ b/examples/aishell/asr0/RESULTS.md
@ -4,15 +4,16 @@
 | Model | Number of Params | Release | Config | Test set | Valid Loss | CER | 
 | --- | --- | --- | --- | --- | --- | --- | 
-| DeepSpeech2 | 45.18M | 2.2.0 | conf/deepspeech2_online.yaml + spec aug | test | 7.994938373565674 | 0.080 |  
+| DeepSpeech2 | 45.18M | r0.2.0 | conf/deepspeech2_online.yaml + spec aug | test | 7.708217620849609| 0.078 |
 | DeepSpeech2 | 45.18M | v2.2.0 | conf/deepspeech2_online.yaml + spec aug | test | 7.994938373565674 | 0.080 |  
 ## Deepspeech2 Non-Streaming
 | Model | Number of Params | Release | Config | Test set | Valid Loss | CER |  
 | --- | --- | --- | --- | --- | --- | --- |  
-| DeepSpeech2 | 58.4M | 2.2.0 | conf/deepspeech2.yaml + spec aug | test | 5.738585948944092 | 0.064000 |  
+| DeepSpeech2 | 58.4M | v2.2.0 | conf/deepspeech2.yaml + spec aug | test | 5.738585948944092 | 0.064000 |  
-| DeepSpeech2 | 58.4M | 2.1.0 | conf/deepspeech2.yaml + spec aug | test | 7.483316898345947 | 0.077860 |  
+| DeepSpeech2 | 58.4M | v2.1.0 | conf/deepspeech2.yaml + spec aug | test | 7.483316898345947 | 0.077860 |  
-| DeepSpeech2 | 58.4M | 2.1.0 | conf/deepspeech2.yaml | test | 7.299022197723389 | 0.078671 |
+| DeepSpeech2 | 58.4M | v2.1.0 | conf/deepspeech2.yaml | test | 7.299022197723389 | 0.078671 |
-| DeepSpeech2 | 58.4M | 2.0.0 | conf/deepspeech2.yaml | test | - | 0.078977 |  
+| DeepSpeech2 | 58.4M | v2.0.0 | conf/deepspeech2.yaml | test | - | 0.078977 |  
 | --- | --- | --- | --- | --- | --- | --- |  
-| DeepSpeech2 | 58.4M | 1.8.5 | - | test | - | 0.080447 |
+| DeepSpeech2 | 58.4M | v1.8.5 | - | test | - | 0.080447 |
--- a/examples/aishell/asr0/run.sh
+++ b/examples/aishell/asr0/run.sh
@ -5,7 +5,7 @@ source path.sh
 gpus=0,1,2,3
 stage=0
 stop_stage=100
-conf_path=conf/deepspeech2.yaml    #conf/deepspeech2.yaml or conf/deepspeeech2_online.yaml
+conf_path=conf/deepspeech2.yaml    #conf/deepspeech2.yaml or conf/deepspeech2_online.yaml
 decode_conf_path=conf/tuning/decode.yaml
 avg_num=1
 model_type=offline    # offline or online
--- a/examples/aishell/asr1/README.md
+++ b/examples/aishell/asr1/README.md
@ -143,25 +143,14 @@ avg.sh best exp/conformer/checkpoints 20
 CUDA_VISIBLE_DEVICES= ./local/test.sh conf/conformer.yaml exp/conformer/checkpoints/avg_20
 ```
 ## Pretrained Model
-You can get the pretrained transformer or conformer using the scripts below:
+You can get the pretrained transformer or conformer from [this](../../../docs/source/released_model.md)
 ```bash
 # Conformer:
 wget https://deepspeech.bj.bcebos.com/release2.1/aishell/s1/aishell.release.tar.gz
 # Chunk Conformer:
 wget https://deepspeech.bj.bcebos.com/release2.1/aishell/s1/aishell.chunk.release.tar.gz
 # Transformer:
 wget https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/transformer.model.tar.gz
 ```
 using the `tar` scripts to unpack the model and then you can use the script to test the model.
 For example:
 ```
-wget https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/transformer.model.tar.gz
+wget https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_transformer_aishell_ckpt_0.1.1.model.tar.gz
-tar xzvf transformer.model.tar.gz
+tar xzvf asr1_transformer_aishell_ckpt_0.1.1.model.tar.gz
 source path.sh
 # If you have process the data and get the manifest file， you can skip the following 2 steps
 bash local/data.sh --stage -1 --stop_stage -1
@ -206,7 +195,7 @@ In some situations, you want to use the trained model to do the inference for th
 ```
 you can train the model by yourself using ```bash run.sh --stage 0 --stop_stage 3```, or you can download the pretrained model through the script below:
 ```bash
-wget https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/transformer.model.tar.gz
+wget https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_transformer_aishell_ckpt_0.1.1.model.tar.gz
 tar xzvf transformer.model.tar.gz
 ```
 You can download the audio demo:
--- a/examples/aishell3/vc0/README.md
+++ b/examples/aishell3/vc0/README.md
@ -118,7 +118,7 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/voice_cloning.sh ${conf_path} ${train_outpu
 ```
 ## Pretrained Model
-[tacotron2_aishell3_ckpt_vc0_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_aishell3_ckpt_vc0_0.2.0.zip)
+- [tacotron2_aishell3_ckpt_vc0_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_aishell3_ckpt_vc0_0.2.0.zip)
 Model | Step | eval/loss | eval/l1_loss | eval/mse_loss | eval/bce_loss| eval/attn_loss
--- a/examples/aishell3/vc1/README.md
+++ b/examples/aishell3/vc1/README.md
@ -119,7 +119,7 @@ ref_audio
 CUDA_VISIBLE_DEVICES=${gpus} ./local/voice_cloning.sh ${conf_path} ${train_output_path} ${ckpt_name} ${ge2e_params_path} ${ref_audio_dir}
 ```
 ## Pretrained Model
-[fastspeech2_nosil_aishell3_vc1_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_aishell3_vc1_ckpt_0.5.zip)
+- [fastspeech2_nosil_aishell3_vc1_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_aishell3_vc1_ckpt_0.5.zip)
 Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/pitch_loss| eval/energy_loss 
 :-------------:| :------------:| :-----: | :-----: | :--------: |:--------:|:---------:
--- a/examples/aishell3/voc1/README.md
+++ b/examples/aishell3/voc1/README.md
@ -137,7 +137,8 @@ optional arguments:
 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
 ## Pretrained Models
-Pretrained models can be downloaded here [pwg_aishell3_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_aishell3_ckpt_0.5.zip).
+Pretrained models can be downloaded here:
 - [pwg_aishell3_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_aishell3_ckpt_0.5.zip)
 Model | Step | eval/generator_loss | eval/log_stft_magnitude_loss:| eval/spectral_convergence_loss 
 :-------------:| :------------:| :-----: | :-----: | :--------:
--- a/examples/aishell3/voc5/README.md
+++ b/examples/aishell3/voc5/README.md
@ -136,7 +136,8 @@ optional arguments:
 4. `--output-dir` is the directory to save the synthesized audio files.
 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
 ## Pretrained Models
-The pretrained model can be downloaded here [hifigan_aishell3_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_ckpt_0.2.0.zip).
+The pretrained model can be downloaded here:
 - [hifigan_aishell3_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_ckpt_0.2.0.zip)
 Model | Step | eval/generator_loss | eval/mel_loss| eval/feature_matching_loss
--- a/examples/ami/sd0/conf/ecapa_tdnn.yaml
+++ b/examples/ami/sd0/conf/ecapa_tdnn.yaml
@ -0,0 +1,62 @@
 ###########################################################
 #                AMI DATA PREPARE SETTING               #
 ###########################################################
 split_type: 'full_corpus_asr'
 skip_TNO: True
 # Options for mic_type: 'Mix-Lapel', 'Mix-Headset', 'Array1', 'Array1-01', 'BeamformIt'
 mic_type: 'Mix-Headset'
 vad_type: 'oracle'
 max_subseg_dur: 3.0
 overlap: 1.5
 # Some more exp folders (for cleaner structure).
 embedding_dir: emb #!ref <save_folder>/emb
 meta_data_dir: metadata #!ref <save_folder>/metadata
 ref_rttm_dir: ref_rttms #!ref <save_folder>/ref_rttms
 sys_rttm_dir: sys_rttms #!ref <save_folder>/sys_rttms
 der_dir: DER #!ref <save_folder>/DER
 ###########################################################
 #                FEATURE EXTRACTION SETTING               #
 ###########################################################
 # currently, we only support fbank
 sr: 16000           # sample rate
 n_mels: 80
 window_size: 400     #25ms, sample rate 16000, 25 * 16000 / 1000 = 400 
 hop_size: 160        #10ms, sample rate 16000, 10 * 16000 / 1000 = 160
 #left_frames: 0
 #right_frames: 0
 #deltas: False
 ###########################################################
 #                       MODEL SETTING                     #
 ###########################################################
 # currently, we only support ecapa-tdnn in the ecapa_tdnn.yaml
 # if we want use another model, please choose another configuration yaml file
 seed: 1234
 emb_dim: 192
 batch_size: 16
 model:
  input_size: 80
  channels: [1024, 1024, 1024, 1024, 3072]
  kernel_sizes: [5, 3, 3, 3, 1]
  dilations: [1, 2, 3, 4, 1]
  attention_channels: 128
  lin_neurons: 192
 # Will automatically download ECAPA-TDNN model (best).
 ###########################################################
 #               SPECTRAL CLUSTERING SETTING               #
 ###########################################################
 backend: 'SC' # options: 'kmeans' # Note: kmeans goes only with cos affinity
 affinity: 'cos'  # options: cos, nn
 max_num_spkrs: 10
 oracle_n_spkrs: True
 ###########################################################
 #                  DER EVALUATION SETTING                 #
 ###########################################################
 ignore_overlap: True
 forgiveness_collar: 0.25
--- a/examples/ami/sd0/local/compute_embdding.py
+++ b/examples/ami/sd0/local/compute_embdding.py
@ -0,0 +1,231 @@
 # Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import argparse
 import json
 import os
 import pickle
 import sys
 import numpy as np
 import paddle
 from paddle.io import BatchSampler
 from paddle.io import DataLoader
 from tqdm.contrib import tqdm
 from yacs.config import CfgNode
 from paddlespeech.s2t.utils.log import Log
 from paddlespeech.vector.cluster.diarization import EmbeddingMeta
 from paddlespeech.vector.io.batch import batch_feature_normalize
 from paddlespeech.vector.io.dataset_from_json import JSONDataset
 from paddlespeech.vector.models.ecapa_tdnn import EcapaTdnn
 from paddlespeech.vector.modules.sid_model import SpeakerIdetification
 from paddlespeech.vector.training.seeding import seed_everything
 # Logger setup
 logger = Log(__name__).getlog()
 def prepare_subset_json(full_meta_data, rec_id, out_meta_file):
    """Prepares metadata for a given recording ID.
    Arguments
    ---------
    full_meta_data : json
        Full meta (json) containing all the recordings
    rec_id : str
        The recording ID for which meta (json) has to be prepared
    out_meta_file : str
        Path of the output meta (json) file.
    """
    subset = {}
    for key in full_meta_data:
        k = str(key)
        if k.startswith(rec_id):
            subset[key] = full_meta_data[key]
    with open(out_meta_file, mode="w") as json_f:
        json.dump(subset, json_f, indent=2)
 def create_dataloader(json_file, batch_size):
    """Creates the datasets and their data processing pipelines.
    This is used for multi-mic processing.
    """
    # create datasets
    dataset = JSONDataset(
        json_file=json_file,
        feat_type='melspectrogram',
        n_mels=config.n_mels,
        window_size=config.window_size,
        hop_length=config.hop_size)
    # create dataloader
    batch_sampler = BatchSampler(dataset, batch_size=batch_size, shuffle=True)
    dataloader = DataLoader(dataset,
                            batch_sampler=batch_sampler,
                            collate_fn=lambda x: batch_feature_normalize(
                                x, mean_norm=True, std_norm=False),
                            return_list=True)
    return dataloader
 def main(args, config):
    # set the training device, cpu or gpu
    paddle.set_device(args.device)
    # set the random seed
    seed_everything(config.seed)
    # stage1: build the dnn backbone model network
    ecapa_tdnn = EcapaTdnn(**config.model)
    # stage2: build the speaker verification eval instance with backbone model
    model = SpeakerIdetification(backbone=ecapa_tdnn, num_class=1)
    # stage3: load the pre-trained model
    #         we get the last model from the epoch and save_interval
    args.load_checkpoint = os.path.abspath(
        os.path.expanduser(args.load_checkpoint))
    # load model checkpoint to sid model
    state_dict = paddle.load(
        os.path.join(args.load_checkpoint, 'model.pdparams'))
    model.set_state_dict(state_dict)
    logger.info(f'Checkpoint loaded from {args.load_checkpoint}')
    # set the model to eval mode
    model.eval()
    # load meta data
    meta_file = os.path.join(
        args.data_dir,
        config.meta_data_dir,
        "ami_" + args.dataset + "." + config.mic_type + ".subsegs.json", )
    with open(meta_file, "r") as f:
        full_meta = json.load(f)
    # get all the recording IDs in this dataset.
    all_keys = full_meta.keys()
    A = [word.rstrip().split("_")[0] for word in all_keys]
    all_rec_ids = list(set(A[1:]))
    all_rec_ids.sort()
    split = "AMI_" + args.dataset
    i = 1
    msg = "Extra embdding for " + args.dataset + " set"
    logger.info(msg)
    if len(all_rec_ids) <= 0:
        msg = "No recording IDs found! Please check if meta_data json file is properly generated."
        logger.error(msg)
        sys.exit()
    # extra different recordings embdding in a dataset.
    for rec_id in tqdm(all_rec_ids):
        # This tag will be displayed in the log.
        tag = ("[" + str(args.dataset) + ": " + str(i) + "/" +
               str(len(all_rec_ids)) + "]")
        i = i + 1
        # log message.
        msg = "Embdding %s : %s " % (tag, rec_id)
        logger.debug(msg)
        # embedding directory.
        if not os.path.exists(
                os.path.join(args.data_dir, config.embedding_dir, split)):
            os.makedirs(
                os.path.join(args.data_dir, config.embedding_dir, split))
        # file to store embeddings.
        emb_file_name = rec_id + "." + config.mic_type + ".emb_stat.pkl"
        diary_stat_emb_file = os.path.join(args.data_dir, config.embedding_dir,
                                           split, emb_file_name)
        # prepare a metadata (json) for one recording. This is basically a subset of full_meta.
        # lets keep this meta-info in embedding directory itself.
        json_file_name = rec_id + "." + config.mic_type + ".json"
        meta_per_rec_file = os.path.join(args.data_dir, config.embedding_dir,
                                         split, json_file_name)
        # write subset (meta for one recording) json metadata.
        prepare_subset_json(full_meta, rec_id, meta_per_rec_file)
        # prepare data loader.
        diary_set_loader = create_dataloader(meta_per_rec_file,
                                             config.batch_size)
        # extract embeddings (skip if already done).
        if not os.path.isfile(diary_stat_emb_file):
            logger.debug("Extracting deep embeddings")
            embeddings = np.empty(shape=[0, config.emb_dim], dtype=np.float64)
            segset = []
            for batch_idx, batch in enumerate(tqdm(diary_set_loader)):
                # extrac the audio embedding
                ids, feats, lengths = batch['ids'], batch['feats'], batch[
                    'lengths']
                seg = [x for x in ids]
                segset = segset + seg
                emb = model.backbone(feats, lengths).squeeze(
                    -1).numpy()  # (N, emb_size, 1) -> (N, emb_size)
                embeddings = np.concatenate((embeddings, emb), axis=0)
            segset = np.array(segset, dtype="|O")
            stat_obj = EmbeddingMeta(
                segset=segset,
                stats=embeddings, )
            logger.debug("Saving Embeddings...")
            with open(diary_stat_emb_file, "wb") as output:
                pickle.dump(stat_obj, output)
        else:
            logger.debug("Skipping embedding extraction (as already present).")
 # Begin experiment!
 if __name__ == "__main__":
    parser = argparse.ArgumentParser(__doc__)
    parser.add_argument(
        '--device',
        default="gpu",
        help="Select which device to perform diarization, defaults to gpu.")
    parser.add_argument(
        "--config", default=None, type=str, help="configuration file")
    parser.add_argument(
        "--data-dir",
        default="../save/",
        type=str,
        help="processsed data directory")
    parser.add_argument(
        "--dataset",
        choices=['dev', 'eval'],
        default="dev",
        type=str,
        help="Select which dataset to extra embdding, defaults to dev")
    parser.add_argument(
        "--load-checkpoint",
        type=str,
        default='',
        help="Directory to load model checkpoint to compute embeddings.")
    args = parser.parse_args()
    config = CfgNode(new_allowed=True)
    if args.config:
        config.merge_from_file(args.config)
    config.freeze()
    main(args, config)
--- a/examples/ami/sd0/local/data.sh
+++ b/examples/ami/sd0/local/data.sh
@ -1,49 +0,0 @@
 #!/bin/bash
 stage=1
 TARGET_DIR=${MAIN_ROOT}/dataset/ami
 data_folder=${TARGET_DIR}/amicorpus #e.g., /path/to/amicorpus/
 manual_annot_folder=${TARGET_DIR}/ami_public_manual_1.6.2 #e.g., /path/to/ami_public_manual_1.6.2/
 save_folder=${MAIN_ROOT}/examples/ami/sd0/data
 ref_rttm_dir=${save_folder}/ref_rttms
 meta_data_dir=${save_folder}/metadata
 set=L
 . ${MAIN_ROOT}/utils/parse_options.sh || exit 1;
 set -u
 set -o pipefail
 mkdir -p ${save_folder}
 if [ ${stage} -le 0 ]; then
    # Download AMI corpus, You need around 10GB of free space to get whole data
    # The signals are too large to package in this way,
    # so you need to use the chooser to indicate which ones you wish to download
    echo "Please follow https://groups.inf.ed.ac.uk/ami/download/ to download the data."
    echo "Annotations: AMI manual annotations v1.6.2 "
    echo "Signals: "
    echo "1) Select one or more AMI meetings: the IDs please follow ./ami_split.py"
    echo "2) Select media streams: Just select Headset mix"
    exit 0;
 fi
 if [ ${stage} -le 1 ]; then
    echo "AMI Data preparation"
    python local/ami_prepare.py  --data_folder ${data_folder} \
            --manual_annot_folder ${manual_annot_folder} \
            --save_folder ${save_folder} --ref_rttm_dir ${ref_rttm_dir} \
            --meta_data_dir ${meta_data_dir} 
    if [ $? -ne 0 ]; then
        echo "Prepare AMI failed. Please check log message."
        exit 1
    fi
 fi
 echo "AMI data preparation done."
 exit 0
--- a/examples/ami/sd0/local/experiment.py
+++ b/examples/ami/sd0/local/experiment.py
@ -0,0 +1,428 @@
 # Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import argparse
 import glob
 import json
 import os
 import pickle
 import shutil
 import sys
 import numpy as np
 from tqdm.contrib import tqdm
 from yacs.config import CfgNode
 from paddlespeech.s2t.utils.log import Log
 from paddlespeech.vector.cluster import diarization as diar
 from utils.DER import DER
 # Logger setup
 logger = Log(__name__).getlog()
 def diarize_dataset(
        full_meta,
        split_type,
        n_lambdas,
        pval,
        save_dir,
        config,
        n_neighbors=10, ):
    """This function diarizes all the recordings in a given dataset. It performs
    computation of embedding and clusters them using spectral clustering (or other backends).
    The output speaker boundary file is stored in the RTTM format.
    """
    # prepare `spkr_info` only once when Oracle num of speakers is selected.
    # spkr_info is essential to obtain number of speakers from groundtruth.
    if config.oracle_n_spkrs is True:
        full_ref_rttm_file = os.path.join(save_dir, config.ref_rttm_dir,
                                          "fullref_ami_" + split_type + ".rttm")
        rttm = diar.read_rttm(full_ref_rttm_file)
        spkr_info = list(  # noqa F841
            filter(lambda x: x.startswith("SPKR-INFO"), rttm))
    # get all the recording IDs in this dataset.
    all_keys = full_meta.keys()
    A = [word.rstrip().split("_")[0] for word in all_keys]
    all_rec_ids = list(set(A[1:]))
    all_rec_ids.sort()
    split = "AMI_" + split_type
    i = 1
    # adding tag for directory path.
    type_of_num_spkr = "oracle" if config.oracle_n_spkrs else "est"
    tag = (type_of_num_spkr + "_" + str(config.affinity) + "_" + config.backend)
    # make out rttm dir
    out_rttm_dir = os.path.join(save_dir, config.sys_rttm_dir, config.mic_type,
                                split, tag)
    if not os.path.exists(out_rttm_dir):
        os.makedirs(out_rttm_dir)
    # diarizing different recordings in a dataset.
    for rec_id in tqdm(all_rec_ids):
        # this tag will be displayed in the log.
        tag = ("[" + str(split_type) + ": " + str(i) + "/" +
               str(len(all_rec_ids)) + "]")
        i = i + 1
        # log message.
        msg = "Diarizing %s : %s " % (tag, rec_id)
        logger.debug(msg)
        # load embeddings.
        emb_file_name = rec_id + "." + config.mic_type + ".emb_stat.pkl"
        diary_stat_emb_file = os.path.join(save_dir, config.embedding_dir,
                                           split, emb_file_name)
        if not os.path.isfile(diary_stat_emb_file):
            msg = "Embdding file %s not found! Please check if embdding file is properly generated." % (
                diary_stat_emb_file)
            logger.error(msg)
            sys.exit()
        with open(diary_stat_emb_file, "rb") as in_file:
            diary_obj = pickle.load(in_file)
        out_rttm_file = out_rttm_dir + "/" + rec_id + ".rttm"
        # processing starts from here.
        if config.oracle_n_spkrs is True:
            # oracle num of speakers.
            num_spkrs = diar.get_oracle_num_spkrs(rec_id, spkr_info)
        else:
            if config.affinity == "nn":
                # num of speakers tunned on dev set (only for nn affinity).
                num_spkrs = n_lambdas
            else:
                # num of speakers will be estimated using max eigen gap for cos based affinity.
                # so adding None here. Will use this None later-on.
                num_spkrs = None
        if config.backend == "kmeans":
            diar.do_kmeans_clustering(
                diary_obj,
                out_rttm_file,
                rec_id,
                num_spkrs,
                pval, )
        if config.backend == "SC":
            # go for Spectral Clustering (SC).
            diar.do_spec_clustering(
                diary_obj,
                out_rttm_file,
                rec_id,
                num_spkrs,
                pval,
                config.affinity,
                n_neighbors, )
        # can used for AHC later. Likewise one can add different backends here.
        if config.backend == "AHC":
            # call AHC
            threshold = pval  # pval for AHC is nothing but threshold.
            diar.do_AHC(diary_obj, out_rttm_file, rec_id, num_spkrs, threshold)
    # once all RTTM outputs are generated, concatenate individual RTTM files to obtain single RTTM file.
    # this is not needed but just staying with the standards.
    concate_rttm_file = out_rttm_dir + "/sys_output.rttm"
    logger.debug("Concatenating individual RTTM files...")
    with open(concate_rttm_file, "w") as cat_file:
        for f in glob.glob(out_rttm_dir + "/*.rttm"):
            if f == concate_rttm_file:
                continue
            with open(f, "r") as indi_rttm_file:
                shutil.copyfileobj(indi_rttm_file, cat_file)
    msg = "The system generated RTTM file for %s set : %s" % (
        split_type, concate_rttm_file, )
    logger.debug(msg)
    return concate_rttm_file
 def dev_pval_tuner(full_meta, save_dir, config):
    """Tuning p_value for affinity matrix.
    The p_value used so that only p% of the values in each row is retained.
    """
    DER_list = []
    prange = np.arange(0.002, 0.015, 0.001)
    n_lambdas = None  # using it as flag later.
    for p_v in prange:
        # Process whole dataset for value of p_v.
        concate_rttm_file = diarize_dataset(full_meta, "dev", n_lambdas, p_v,
                                            save_dir, config)
        ref_rttm_file = os.path.join(save_dir, config.ref_rttm_dir,
                                     "fullref_ami_dev.rttm")
        sys_rttm_file = concate_rttm_file
        [MS, FA, SER, DER_] = DER(
            ref_rttm_file,
            sys_rttm_file,
            config.ignore_overlap,
            config.forgiveness_collar, )
        DER_list.append(DER_)
        if config.oracle_n_spkrs is True and config.backend == "kmeans":
            # no need of p_val search. Note p_val is needed for SC for both oracle and est num of speakers.
            # p_val is needed in oracle_n_spkr=False when using kmeans backend.
            break
    # Take p_val that gave minmum DER on Dev dataset.
    tuned_p_val = prange[DER_list.index(min(DER_list))]
    return tuned_p_val
 def dev_ahc_threshold_tuner(full_meta, save_dir, config):
    """Tuning threshold for affinity matrix. This function is called when AHC is used as backend.
    """
    DER_list = []
    prange = np.arange(0.0, 1.0, 0.1)
    n_lambdas = None  # using it as flag later.
    # Note: p_val is threshold in case of AHC.
    for p_v in prange:
        # Process whole dataset for value of p_v.
        concate_rttm_file = diarize_dataset(full_meta, "dev", n_lambdas, p_v,
                                            save_dir, config)
        ref_rttm = os.path.join(save_dir, config.ref_rttm_dir,
                                "fullref_ami_dev.rttm")
        sys_rttm = concate_rttm_file
        [MS, FA, SER, DER_] = DER(
            ref_rttm,
            sys_rttm,
            config.ignore_overlap,
            config.forgiveness_collar, )
        DER_list.append(DER_)
        if config.oracle_n_spkrs is True:
            break  # no need of threshold search.
    # Take p_val that gave minmum DER on Dev dataset.
    tuned_p_val = prange[DER_list.index(min(DER_list))]
    return tuned_p_val
 def dev_nn_tuner(full_meta, split_type, save_dir, config):
    """Tuning n_neighbors on dev set. Assuming oracle num of speakers.
    This is used when nn based affinity is selected.
    """
    DER_list = []
    pval = None
    # Now assumming oracle num of speakers.
    n_lambdas = 4
    for nn in range(5, 15):
        # Process whole dataset for value of n_lambdas.
        concate_rttm_file = diarize_dataset(full_meta, "dev", n_lambdas, p_v,
                                            save_dir, config, nn)
        ref_rttm = os.path.join(save_dir, config.ref_rttm_dir,
                                "fullref_ami_dev.rttm")
        sys_rttm = concate_rttm_file
        [MS, FA, SER, DER_] = DER(
            ref_rttm,
            sys_rttm,
            config.ignore_overlap,
            config.forgiveness_collar, )
        DER_list.append([nn, DER_])
        if config.oracle_n_spkrs is True and config.backend == "kmeans":
            break
    DER_list.sort(key=lambda x: x[1])
    tunned_nn = DER_list[0]
    return tunned_nn[0]
 def dev_tuner(full_meta, split_type, save_dir, config):
    """Tuning n_components on dev set. Used for nn based affinity matrix.
    Note: This is a very basic tunning for nn based affinity.
    This is work in progress till we find a better way.
    """
    DER_list = []
    pval = None
    for n_lambdas in range(1, config.max_num_spkrs + 1):
        # Process whole dataset for value of n_lambdas.
        concate_rttm_file = diarize_dataset(full_meta, "dev", n_lambdas, p_v,
                                            save_dir, config)
        ref_rttm = os.path.join(save_dir, config.ref_rttm_dir,
                                "fullref_ami_dev.rttm")
        sys_rttm = concate_rttm_file
        [MS, FA, SER, DER_] = DER(
            ref_rttm,
            sys_rttm,
            config.ignore_overlap,
            config.forgiveness_collar, )
        DER_list.append(DER_)
    # Take n_lambdas with minmum DER.
    tuned_n_lambdas = DER_list.index(min(DER_list)) + 1
    return tuned_n_lambdas
 def main(args, config):
    # AMI Dev Set: Tune hyperparams on dev set.
    # Read the embdding file for dev set generated during embdding compute
    dev_meta_file = os.path.join(
        args.data_dir,
        config.meta_data_dir,
        "ami_dev." + config.mic_type + ".subsegs.json", )
    with open(dev_meta_file, "r") as f:
        meta_dev = json.load(f)
    full_meta = meta_dev
    # Processing starts from here
    # Following few lines selects option for different backend and affinity matrices. Finds best values for hyperameters using dev set.
    ref_rttm_file = os.path.join(args.data_dir, config.ref_rttm_dir,
                                 "fullref_ami_dev.rttm")
    best_nn = None
    if config.affinity == "nn":
        logger.info("Tuning for nn (Multiple iterations over AMI Dev set)")
        best_nn = dev_nn_tuner(full_meta, args.data_dir, config)
    n_lambdas = None
    best_pval = None
    if config.affinity == "cos" and (config.backend == "SC" or
                                     config.backend == "kmeans"):
        # oracle num_spkrs or not, doesn't matter for kmeans and SC backends
        # cos: Tune for the best pval for SC /kmeans (for unknown num of spkrs)
        logger.info(
            "Tuning for p-value for SC (Multiple iterations over AMI Dev set)")
        best_pval = dev_pval_tuner(full_meta, args.data_dir, config)
    elif config.backend == "AHC":
        logger.info("Tuning for threshold-value for AHC")
        best_threshold = dev_ahc_threshold_tuner(full_meta, args.data_dir,
                                                 config)
        best_pval = best_threshold
    else:
        # NN for unknown num of speakers (can be used in future)
        if config.oracle_n_spkrs is False:
            # nn: Tune num of number of components (to be updated later)
            logger.info(
                "Tuning for number of eigen components for NN (Multiple iterations over AMI Dev set)"
            )
            # dev_tuner used for tuning num of components in NN. Can be used in future.
            n_lambdas = dev_tuner(full_meta, args.data_dir, config)
    # load 'dev' and 'eval' metadata files.
    full_meta_dev = full_meta  # current full_meta is for 'dev'
    eval_meta_file = os.path.join(
        args.data_dir,
        config.meta_data_dir,
        "ami_eval." + config.mic_type + ".subsegs.json", )
    with open(eval_meta_file, "r") as f:
        full_meta_eval = json.load(f)
    # tag to be appended to final output DER files. Writing DER for individual files.
    type_of_num_spkr = "oracle" if config.oracle_n_spkrs else "est"
    tag = (
        type_of_num_spkr + "_" + str(config.affinity) + "." + config.mic_type)
    # perform final diarization on 'dev' and 'eval' with best hyperparams.
    final_DERs = {}
    out_der_dir = os.path.join(args.data_dir, config.der_dir)
    if not os.path.exists(out_der_dir):
        os.makedirs(out_der_dir)
    for split_type in ["dev", "eval"]:
        if split_type == "dev":
            full_meta = full_meta_dev
        else:
            full_meta = full_meta_eval
        # performing diarization.
        msg = "Diarizing using best hyperparams: " + split_type + " set"
        logger.info(msg)
        out_boundaries = diarize_dataset(
            full_meta,
            split_type,
            n_lambdas=n_lambdas,
            pval=best_pval,
            n_neighbors=best_nn,
            save_dir=args.data_dir,
            config=config)
        # computing DER.
        msg = "Computing DERs for " + split_type + " set"
        logger.info(msg)
        ref_rttm = os.path.join(args.data_dir, config.ref_rttm_dir,
                                "fullref_ami_" + split_type + ".rttm")
        sys_rttm = out_boundaries
        [MS, FA, SER, DER_vals] = DER(
            ref_rttm,
            sys_rttm,
            config.ignore_overlap,
            config.forgiveness_collar,
            individual_file_scores=True, )
        # writing DER values to a file. Append tag.
        der_file_name = split_type + "_DER_" + tag
        out_der_file = os.path.join(out_der_dir, der_file_name)
        msg = "Writing DER file to: " + out_der_file
        logger.info(msg)
        diar.write_ders_file(ref_rttm, DER_vals, out_der_file)
        msg = ("AMI " + split_type + " set DER = %s %%\n" %
               (str(round(DER_vals[-1], 2))))
        logger.info(msg)
        final_DERs[split_type] = round(DER_vals[-1], 2)
    # final print DERs
    msg = (
        "Final Diarization Error Rate (%%) on AMI corpus: Dev = %s %% | Eval = %s %%\n"
        % (str(final_DERs["dev"]), str(final_DERs["eval"])))
    logger.info(msg)
 if __name__ == "__main__":
    parser = argparse.ArgumentParser(__doc__)
    parser.add_argument(
        "--config", default=None, type=str, help="configuration file")
    parser.add_argument(
        "--data-dir",
        default="../data/",
        type=str,
        help="processsed data directory")
    args = parser.parse_args()
    config = CfgNode(new_allowed=True)
    if args.config:
        config.merge_from_file(args.config)
    config.freeze()
    main(args, config)
--- a/examples/ami/sd0/local/process.sh
+++ b/examples/ami/sd0/local/process.sh
@ -0,0 +1,49 @@
 #!/bin/bash
 stage=0
 set=L
 . ${MAIN_ROOT}/utils/parse_options.sh || exit 1;
 set -o pipefail
 data_folder=$1
 manual_annot_folder=$2
 save_folder=$3
 pretrained_model_dir=$4
 conf_path=$5
 device=$6
 ref_rttm_dir=${save_folder}/ref_rttms
 meta_data_dir=${save_folder}/metadata
 if [ ${stage} -le 0 ]; then
    echo "AMI Data preparation"
    python local/ami_prepare.py  --data_folder ${data_folder} \
            --manual_annot_folder ${manual_annot_folder} \
            --save_folder ${save_folder} --ref_rttm_dir ${ref_rttm_dir} \
            --meta_data_dir ${meta_data_dir} 
    if [ $? -ne 0 ]; then
        echo "Prepare AMI failed. Please check log message."
        exit 1
    fi
    echo "AMI data preparation done."           
 fi
 if [ ${stage} -le 1 ]; then
    # extra embddings for dev and eval dataset
    for name in dev eval; do
        python local/compute_embdding.py --config ${conf_path} \
                --data-dir ${save_folder} \
                --device ${device} \
                --dataset ${name} \
                --load-checkpoint ${pretrained_model_dir}
    done
 fi
 if [ ${stage} -le 2 ]; then
    # tune hyperparams on dev set
    # perform final diarization on 'dev' and 'eval' with best hyperparams
    python local/experiment.py --config ${conf_path} \
            --data-dir ${save_folder}
 fi
--- a/examples/ami/sd0/run.sh
+++ b/examples/ami/sd0/run.sh
@ -1,14 +1,46 @@
 #!/bin/bash
-. path.sh || exit 1;
+. ./path.sh || exit 1;
 set -e
-stage=1
+stage=0
 #TARGET_DIR=${MAIN_ROOT}/dataset/ami
 TARGET_DIR=/home/dataset/AMI
 data_folder=${TARGET_DIR}/amicorpus #e.g., /path/to/amicorpus/
 manual_annot_folder=${TARGET_DIR}/ami_public_manual_1.6.2 #e.g., /path/to/ami_public_manual_1.6.2/
 save_folder=./save
 pretraind_model_dir=${save_folder}/sv0_ecapa_tdnn_voxceleb12_ckpt_0_1_1/model
 conf_path=conf/ecapa_tdnn.yaml
 device=gpu
 . ${MAIN_ROOT}/utils/parse_options.sh || exit 1;
-if [ ${stage} -le 1 ]; then
+if [ $stage -le 0 ]; then
-    # prepare data
+    # Prepare data
-    bash ./local/data.sh || exit -1
+    # Download AMI corpus, You need around 10GB of free space to get whole data
    # The signals are too large to package in this way,
    # so you need to use the chooser to indicate which ones you wish to download
    echo "Please follow https://groups.inf.ed.ac.uk/ami/download/ to download the data."
    echo "Annotations: AMI manual annotations v1.6.2 "
    echo "Signals: "
    echo "1) Select one or more AMI meetings: the IDs please follow ./ami_split.py"
    echo "2) Select media streams: Just select Headset mix"
 fi
 if [ $stage -le 1 ]; then
    # Download the pretrained model
    wget https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_1_1.tar.gz
    mkdir -p ${save_folder} && tar -xvf sv0_ecapa_tdnn_voxceleb12_ckpt_0_1_1.tar.gz -C ${save_folder}
    rm -rf sv0_ecapa_tdnn_voxceleb12_ckpt_0_1_1.tar.gz
    echo "download the pretrained ECAPA-TDNN Model to path: "${pretraind_model_dir}
 fi
 if [ $stage -le 2 ]; then
    # Tune hyperparams on dev set and perform final diarization on dev and eval with best hyperparams.
    echo ${data_folder} ${manual_annot_folder} ${save_folder} ${pretraind_model_dir} ${conf_path}
    bash ./local/process.sh ${data_folder} ${manual_annot_folder} \
        ${save_folder} ${pretraind_model_dir} ${conf_path} ${device} || exit 1
 fi
--- a/examples/csmsc/tts0/README.md
+++ b/examples/csmsc/tts0/README.md
@ -212,7 +212,8 @@ optional arguments:
 Pretrained Tacotron2 model with no silence in the edge of audios:
 - [tacotron2_csmsc_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_csmsc_ckpt_0.2.0.zip)
-The static model can be downloaded here [tacotron2_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_csmsc_static_0.2.0.zip).
+The static model can be downloaded here:
 - [tacotron2_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_csmsc_static_0.2.0.zip)
 Model | Step | eval/loss | eval/l1_loss | eval/mse_loss | eval/bce_loss| eval/attn_loss 
--- a/examples/csmsc/tts0/local/inference.sh
+++ b/examples/csmsc/tts0/local/inference.sh
@ -27,20 +27,8 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
        --phones_dict=dump/phone_id_map.txt
 fi
 # style melgan
 # style melgan's Dygraph to Static Graph is not ready now
 if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
    python3 ${BIN_DIR}/../inference.py \
        --inference_dir=${train_output_path}/inference \
        --am=tacotron2_csmsc \
        --voc=style_melgan_csmsc \
        --text=${BIN_DIR}/../sentences.txt \
        --output_dir=${train_output_path}/pd_infer_out \
        --phones_dict=dump/phone_id_map.txt
 fi
 # hifigan
-if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
+if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
    python3 ${BIN_DIR}/../inference.py \
        --inference_dir=${train_output_path}/inference \
        --am=tacotron2_csmsc \
--- a/examples/csmsc/tts2/README.md
+++ b/examples/csmsc/tts2/README.md
@ -221,21 +221,30 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/inference.sh ${train_output_path}
 ```
 ## Pretrained Model
-Pretrained SpeedySpeech model with no silence in the edge of audios[speedyspeech_nosil_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_ckpt_0.5.zip).
+Pretrained SpeedySpeech model with no silence in the edge of audios:
 - [speedyspeech_nosil_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_ckpt_0.5.zip)
 - [speedyspeech_csmsc_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_csmsc_ckpt_0.2.0.zip)
 The static model can be downloaded here:
 - [speedyspeech_nosil_baker_static_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_static_0.5.zip)
 - [speedyspeech_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_csmsc_static_0.2.0.zip)
 The ONNX model can be downloaded here:
 - [speedyspeech_csmsc_onnx_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_csmsc_onnx_0.2.0.zip)
 The static model can be downloaded here [speedyspeech_nosil_baker_static_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_static_0.5.zip).
 Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/ssim_loss
 :-------------:| :------------:| :-----: | :-----: | :--------:|:--------:
-default| 1(gpu) x 11400|0.83655|0.42324|0.03211| 0.38119
+default| 1(gpu) x 11400|0.79532|0.400246|0.030259| 0.36482
 SpeedySpeech checkpoint contains files listed below.
 ```text
-speedyspeech_nosil_baker_ckpt_0.5
+speedyspeech_csmsc_ckpt_0.2.0
 ├── default.yaml            # default config used to train speedyspeech
 ├── feats_stats.npy         # statistics used to normalize spectrogram when training speedyspeech
 ├── phone_id_map.txt        # phone vocabulary file when training speedyspeech
-├── snapshot_iter_11400.pdz # model parameters and optimizer states
+├── snapshot_iter_30600.pdz # model parameters and optimizer states
 └── tone_id_map.txt         # tone vocabulary file when training speedyspeech
 ```
 You can use the following scripts to synthesize for `${BIN_DIR}/../sentences.txt` using pretrained speedyspeech and parallel wavegan models.
@ -246,9 +255,9 @@ FLAGS_allocator_strategy=naive_best_fit \
 FLAGS_fraction_of_gpu_memory_to_use=0.01 \
 python3 ${BIN_DIR}/../synthesize_e2e.py \
  --am=speedyspeech_csmsc \
-  --am_config=speedyspeech_nosil_baker_ckpt_0.5/default.yaml \
+  --am_config=speedyspeech_csmsc_ckpt_0.2.0/default.yaml \
-  --am_ckpt=speedyspeech_nosil_baker_ckpt_0.5/snapshot_iter_11400.pdz \
+  --am_ckpt=speedyspeech_csmsc_ckpt_0.2.0/snapshot_iter_30600.pdz \
-  --am_stat=speedyspeech_nosil_baker_ckpt_0.5/feats_stats.npy \
+  --am_stat=speedyspeech_csmsc_ckpt_0.2.0/feats_stats.npy \
  --voc=pwgan_csmsc \
  --voc_config=pwg_baker_ckpt_0.4/pwg_default.yaml \
  --voc_ckpt=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \
@ -257,6 +266,6 @@ python3 ${BIN_DIR}/../synthesize_e2e.py \
  --text=${BIN_DIR}/../sentences.txt \
  --output_dir=exp/default/test_e2e \
  --inference_dir=exp/default/inference \
-  --phones_dict=speedyspeech_nosil_baker_ckpt_0.5/phone_id_map.txt \
+  --phones_dict=speedyspeech_csmsc_ckpt_0.2.0/phone_id_map.txt \
-  --tones_dict=speedyspeech_nosil_baker_ckpt_0.5/tone_id_map.txt
+  --tones_dict=speedyspeech_csmsc_ckpt_0.2.0/tone_id_map.txt
 ```
--- a/examples/csmsc/tts2/local/inference.sh
+++ b/examples/csmsc/tts2/local/inference.sh
@ -30,21 +30,8 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
        --tones_dict=dump/tone_id_map.txt
 fi
 # style melgan
 # style melgan's Dygraph to Static Graph is not ready now
 if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
    python3 ${BIN_DIR}/../inference.py \
        --inference_dir=${train_output_path}/inference \
        --am=speedyspeech_csmsc \
        --voc=style_melgan_csmsc \
        --text=${BIN_DIR}/../sentences.txt \
        --output_dir=${train_output_path}/pd_infer_out \
        --phones_dict=dump/phone_id_map.txt \
        --tones_dict=dump/tone_id_map.txt
 fi
 # hifigan
-if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
+if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
    python3 ${BIN_DIR}/../inference.py \
        --inference_dir=${train_output_path}/inference \
        --am=speedyspeech_csmsc \
--- a/examples/csmsc/tts2/local/ort_predict.sh
+++ b/examples/csmsc/tts2/local/ort_predict.sh
@ -0,0 +1,32 @@
 train_output_path=$1
 stage=0
 stop_stage=0
 # only support default_fastspeech2/speedyspeech + hifigan/mb_melgan now!
 # synthesize from metadata
 if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
    python3 ${BIN_DIR}/../ort_predict.py \
        --inference_dir=${train_output_path}/inference_onnx \
        --am=speedyspeech_csmsc \
        --voc=hifigan_csmsc \
        --test_metadata=dump/test/norm/metadata.jsonl \
        --output_dir=${train_output_path}/onnx_infer_out \
        --device=cpu \
        --cpu_threads=2
 fi
 # e2e, synthesize from text
 if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
    python3 ${BIN_DIR}/../ort_predict_e2e.py \
        --inference_dir=${train_output_path}/inference_onnx \
        --am=speedyspeech_csmsc \
        --voc=hifigan_csmsc \
        --output_dir=${train_output_path}/onnx_infer_out_e2e \
        --text=${BIN_DIR}/../csmsc_test.txt \
        --phones_dict=dump/phone_id_map.txt \
        --tones_dict=dump/tone_id_map.txt \
        --device=cpu \
        --cpu_threads=2
 fi
--- a/examples/csmsc/tts2/local/paddle2onnx.sh
+++ b/examples/csmsc/tts2/local/paddle2onnx.sh
@ -0,0 +1 @@
 ../../tts3/local/paddle2onnx.sh
--- a/examples/csmsc/tts2/run.sh
+++ b/examples/csmsc/tts2/run.sh
@ -40,3 +40,25 @@ if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
    # inference with static model
    CUDA_VISIBLE_DEVICES=${gpus} ./local/inference.sh ${train_output_path} || exit -1
 fi
 # paddle2onnx, please make sure the static models are in ${train_output_path}/inference first
 # we have only tested the following models so far
 if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
    # install paddle2onnx
    version=$(echo `pip list |grep "paddle2onnx"` |awk -F" " '{print $2}')
    if [[ -z "$version" || ${version} != '0.9.4' ]]; then
        pip install paddle2onnx==0.9.4
    fi
    ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx speedyspeech_csmsc
    ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx hifigan_csmsc
 fi
 # inference with onnxruntime, use fastspeech2 + hifigan by default
 if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
    # install onnxruntime
    version=$(echo `pip list |grep "onnxruntime"` |awk -F" " '{print $2}')
    if [[ -z "$version" || ${version} != '1.10.0' ]]; then
        pip install onnxruntime==1.10.0
    fi
    ./local/ort_predict.sh ${train_output_path}
 fi
--- a/examples/csmsc/tts3/README.md
+++ b/examples/csmsc/tts3/README.md
@ -231,11 +231,19 @@ Pretrained FastSpeech2 model with no silence in the edge of audios:
 The static model can be downloaded here:
 - [fastspeech2_nosil_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_static_0.4.zip)
 - [fastspeech2_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_csmsc_static_0.2.0.zip)
 - [fastspeech2_cnndecoder_csmsc_static_1.0.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_cnndecoder_csmsc_static_1.0.0.zip)
 - [fastspeech2_cnndecoder_csmsc_streaming_static_1.0.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_cnndecoder_csmsc_streaming_static_1.0.0.zip)
 The ONNX model can be downloaded here:
 - [fastspeech2_csmsc_onnx_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_csmsc_onnx_0.2.0.zip)
 - [fastspeech2_cnndecoder_csmsc_onnx_1.0.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_cnndecoder_csmsc_onnx_1.0.0.zip)
 - [fastspeech2_cnndecoder_csmsc_streaming_onnx_1.0.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_cnndecoder_csmsc_streaming_onnx_1.0.0.zip)
 Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/pitch_loss| eval/energy_loss 
 :-------------:| :------------:| :-----: | :-----: | :--------: |:--------:|:---------:
 default| 2(gpu) x 76000|1.0991|0.59132|0.035815|0.31915|0.15287|
 conformer| 2(gpu) x 76000|1.0675|0.56103|0.035869|0.31553|0.15509|
 cnndecoder| 1(gpu) x 153000|1.1153|0.61475|0.03380|0.30414|0.14707|
 FastSpeech2 checkpoint contains files listed below.
 ```text
--- a/examples/csmsc/tts3/local/inference.sh
+++ b/examples/csmsc/tts3/local/inference.sh
@ -5,6 +5,7 @@ train_output_path=$1
 stage=0
 stop_stage=0
 # pwgan
 if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
    python3 ${BIN_DIR}/../inference.py \
        --inference_dir=${train_output_path}/inference \
@ -27,20 +28,8 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
        --phones_dict=dump/phone_id_map.txt
 fi
 # style melgan
 # style melgan's Dygraph to Static Graph is not ready now
 if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
    python3 ${BIN_DIR}/../inference.py \
        --inference_dir=${train_output_path}/inference \
        --am=fastspeech2_csmsc \
        --voc=style_melgan_csmsc \
        --text=${BIN_DIR}/../sentences.txt \
        --output_dir=${train_output_path}/pd_infer_out \
        --phones_dict=dump/phone_id_map.txt
 fi
 # hifigan
-if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
+if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
    python3 ${BIN_DIR}/../inference.py \
        --inference_dir=${train_output_path}/inference \
        --am=fastspeech2_csmsc \
@ -51,7 +40,7 @@ if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
 fi
 # wavernn
-if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
+if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
    python3 ${BIN_DIR}/../inference.py \
        --inference_dir=${train_output_path}/inference \
        --am=fastspeech2_csmsc \
--- a/examples/csmsc/tts3/local/inference_streaming.sh
+++ b/examples/csmsc/tts3/local/inference_streaming.sh
@ -0,0 +1,47 @@
 #!/bin/bash
 train_output_path=$1
 stage=0
 stop_stage=0
 # pwgan
 if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
    python3 ${BIN_DIR}/../inference_streaming.py \
        --inference_dir=${train_output_path}/inference_streaming \
        --am=fastspeech2_csmsc \
        --am_stat=dump/train/speech_stats.npy \
        --voc=pwgan_csmsc \
        --text=${BIN_DIR}/../sentences.txt \
        --output_dir=${train_output_path}/pd_infer_out_streaming \
        --phones_dict=dump/phone_id_map.txt \
        --am_streaming=True
 fi
 # for more GAN Vocoders
 # multi band melgan
 if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
    python3 ${BIN_DIR}/../inference_streaming.py \
        --inference_dir=${train_output_path}/inference_streaming \
        --am=fastspeech2_csmsc \
        --am_stat=dump/train/speech_stats.npy \
        --voc=mb_melgan_csmsc \
        --text=${BIN_DIR}/../sentences.txt \
        --output_dir=${train_output_path}/pd_infer_out_streaming \
        --phones_dict=dump/phone_id_map.txt \
        --am_streaming=True
 fi
 # hifigan
 if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
    python3 ${BIN_DIR}/../inference_streaming.py \
        --inference_dir=${train_output_path}/inference_streaming \
        --am=fastspeech2_csmsc \
        --am_stat=dump/train/speech_stats.npy \
        --voc=hifigan_csmsc \
        --text=${BIN_DIR}/../sentences.txt \
        --output_dir=${train_output_path}/pd_infer_out_streaming \
        --phones_dict=dump/phone_id_map.txt \
        --am_streaming=True
 fi
--- a/examples/csmsc/tts3/local/ort_predict.sh
+++ b/examples/csmsc/tts3/local/ort_predict.sh
@ -0,0 +1,31 @@
 train_output_path=$1
 stage=0
 stop_stage=0
 # only support default_fastspeech2/speedyspeech + hifigan/mb_melgan now!
 # synthesize from metadata
 if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
    python3 ${BIN_DIR}/../ort_predict.py \
        --inference_dir=${train_output_path}/inference_onnx \
        --am=fastspeech2_csmsc \
        --voc=hifigan_csmsc \
        --test_metadata=dump/test/norm/metadata.jsonl \
        --output_dir=${train_output_path}/onnx_infer_out \
        --device=cpu \
        --cpu_threads=2
 fi
 # e2e, synthesize from text
 if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
    python3 ${BIN_DIR}/../ort_predict_e2e.py \
        --inference_dir=${train_output_path}/inference_onnx \
        --am=fastspeech2_csmsc \
        --voc=hifigan_csmsc \
        --output_dir=${train_output_path}/onnx_infer_out_e2e \
        --text=${BIN_DIR}/../csmsc_test.txt \
        --phones_dict=dump/phone_id_map.txt \
        --device=cpu \
        --cpu_threads=2
 fi
--- a/examples/csmsc/tts3/local/ort_predict_streaming.sh
+++ b/examples/csmsc/tts3/local/ort_predict_streaming.sh
@ -0,0 +1,19 @@
 train_output_path=$1
 stage=0
 stop_stage=0
 # e2e, synthesize from text
 if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
    python3 ${BIN_DIR}/../ort_predict_streaming.py \
        --inference_dir=${train_output_path}/inference_onnx_streaming \
        --am=fastspeech2_csmsc \
        --am_stat=dump/train/speech_stats.npy \
        --voc=hifigan_csmsc \
        --output_dir=${train_output_path}/onnx_infer_out_streaming \
        --text=${BIN_DIR}/../csmsc_test.txt \
        --phones_dict=dump/phone_id_map.txt \
        --device=cpu \
        --cpu_threads=2 \
        --am_streaming=True
 fi
--- a/examples/csmsc/tts3/local/paddle2onnx.sh
+++ b/examples/csmsc/tts3/local/paddle2onnx.sh
@ -0,0 +1,23 @@
 train_output_path=$1
 model_dir=$2
 output_dir=$3
 model=$4
 enable_dev_version=True
 model_name=${model%_*}
 echo model_name: ${model_name}
 if [ ${model_name} = 'mb_melgan' ] ;then
    enable_dev_version=False
 fi
 mkdir -p ${train_output_path}/${output_dir}
 paddle2onnx \
    --model_dir ${train_output_path}/${model_dir} \
    --model_filename ${model}.pdmodel \
    --params_filename ${model}.pdiparams \
    --save_file ${train_output_path}/${output_dir}/${model}.onnx \
    --opset_version 11 \
    --enable_dev_version ${enable_dev_version}
--- a/examples/csmsc/tts3/local/synthesize_e2e.sh
+++ b/examples/csmsc/tts3/local/synthesize_e2e.sh
@ -109,6 +109,6 @@ if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
        --lang=zh \
        --text=${BIN_DIR}/../sentences.txt \
        --output_dir=${train_output_path}/test_e2e \
-        --phones_dict=dump/phone_id_map.txt \
+        --phones_dict=dump/phone_id_map.txt #\
-        --inference_dir=${train_output_path}/inference
+        # --inference_dir=${train_output_path}/inference
 fi
--- a/examples/csmsc/tts3/local/synthesize_streaming.sh
+++ b/examples/csmsc/tts3/local/synthesize_streaming.sh
@ -88,5 +88,6 @@ if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
        --text=${BIN_DIR}/../sentences.txt \
        --output_dir=${train_output_path}/test_e2e_streaming \
        --phones_dict=dump/phone_id_map.txt \
-        --am_streaming=True
+        --am_streaming=True \
        --inference_dir=${train_output_path}/inference_streaming
 fi
--- a/examples/csmsc/tts3/run.sh
+++ b/examples/csmsc/tts3/run.sh
@ -41,3 +41,25 @@ if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
    CUDA_VISIBLE_DEVICES=${gpus} ./local/inference.sh ${train_output_path} || exit -1
 fi
 # paddle2onnx, please make sure the static models are in ${train_output_path}/inference first
 # we have only tested the following models so far
 if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
    # install paddle2onnx
    version=$(echo `pip list |grep "paddle2onnx"` |awk -F" " '{print $2}')
    if [[ -z "$version" || ${version} != '0.9.4' ]]; then
        pip install paddle2onnx==0.9.4
    fi
    ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx fastspeech2_csmsc
    ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx hifigan_csmsc
    ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx mb_melgan_csmsc
 fi
 # inference with onnxruntime, use fastspeech2 + hifigan by default
 if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
    # install onnxruntime
    version=$(echo `pip list |grep "onnxruntime"` |awk -F" " '{print $2}')
    if [[ -z "$version" || ${version} != '1.10.0' ]]; then
        pip install onnxruntime==1.10.0
    fi
    ./local/ort_predict.sh ${train_output_path}
 fi
--- a/examples/csmsc/tts3/run_cnndecoder.sh
+++ b/examples/csmsc/tts3/run_cnndecoder.sh
@ -31,18 +31,75 @@ if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
    CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
 fi
 # synthesize_e2e non-streaming
 if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
    # synthesize_e2e, vocoder is pwgan
    CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
 fi
 # inference non-streaming
 if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
    # inference with static model
    CUDA_VISIBLE_DEVICES=${gpus} ./local/inference.sh ${train_output_path} || exit -1
 fi
 # synthesize_e2e streaming
 if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
    # synthesize_e2e, vocoder is pwgan
    CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_streaming.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
 fi
 # inference streaming
 if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
    # inference with static model
    CUDA_VISIBLE_DEVICES=${gpus} ./local/inference_streaming.sh ${train_output_path} || exit -1
 fi
 # paddle2onnx non streaming
 if [ ${stage} -le 7 ] && [ ${stop_stage} -ge 7 ]; then
    # install paddle2onnx
    version=$(echo `pip list |grep "paddle2onnx"` |awk -F" " '{print $2}')
    if [[ -z "$version" || ${version} != '0.9.4' ]]; then
        pip install paddle2onnx==0.9.4
    fi
    ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx fastspeech2_csmsc
    ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx hifigan_csmsc
 fi
 # onnxruntime non streaming
 # inference with onnxruntime, use fastspeech2 + hifigan by default
 if [ ${stage} -le 8 ] && [ ${stop_stage} -ge 8 ]; then
    # install onnxruntime
    version=$(echo `pip list |grep "onnxruntime"` |awk -F" " '{print $2}')
    if [[ -z "$version" || ${version} != '1.10.0' ]]; then
        pip install onnxruntime==1.10.0
    fi
    ./local/ort_predict.sh ${train_output_path}
 fi
 # paddle2onnx streaming
 if [ ${stage} -le 9 ] && [ ${stop_stage} -ge 9 ]; then
    # install paddle2onnx
    version=$(echo `pip list |grep "paddle2onnx"` |awk -F" " '{print $2}')
    if [[ -z "$version" || ${version} != '0.9.4' ]]; then
        pip install paddle2onnx==0.9.4
    fi
    # streaming acoustic model
    ./local/paddle2onnx.sh ${train_output_path} inference_streaming inference_onnx_streaming fastspeech2_csmsc_am_encoder_infer
    ./local/paddle2onnx.sh ${train_output_path} inference_streaming inference_onnx_streaming fastspeech2_csmsc_am_decoder
    ./local/paddle2onnx.sh ${train_output_path} inference_streaming inference_onnx_streaming fastspeech2_csmsc_am_postnet
    # vocoder
    ./local/paddle2onnx.sh ${train_output_path} inference_streaming inference_onnx_streaming hifigan_csmsc
 fi
 # onnxruntime streaming
 if [ ${stage} -le 10 ] && [ ${stop_stage} -ge 10 ]; then
    # install onnxruntime
    version=$(echo `pip list |grep "onnxruntime"` |awk -F" " '{print $2}')
    if [[ -z "$version" || ${version} != '1.10.0' ]]; then
        pip install onnxruntime==1.10.0
    fi
    ./local/ort_predict_streaming.sh ${train_output_path}
 fi
--- a/examples/csmsc/voc1/README.md
+++ b/examples/csmsc/voc1/README.md
@ -127,9 +127,14 @@ optional arguments:
 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
 ## Pretrained Models
-The pretrained model can be downloaded here [pwg_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_ckpt_0.4.zip).
+The pretrained model can be downloaded here:
 - [pwg_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_ckpt_0.4.zip)
-The static model can be downloaded here [pwg_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_static_0.4.zip).
+The static model can be downloaded here:
 - [pwg_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_static_0.4.zip)
 The ONNX model can be downloaded here:
 - [pwgan_csmsc_onnx_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwgan_csmsc_onnx_0.2.0.zip)
 Model | Step | eval/generator_loss | eval/log_stft_magnitude_loss| eval/spectral_convergence_loss
 :-------------:| :------------:| :-----: | :-----: | :--------:
--- a/examples/csmsc/voc3/README.md
+++ b/examples/csmsc/voc3/README.md
@ -152,11 +152,17 @@ TODO:
 The hyperparameter of `finetune.yaml` is not good enough, a smaller `learning_rate` should be used (more `milestones` should be set).
 ## Pretrained Models
-The pretrained model can be downloaded here [mb_melgan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_ckpt_0.1.1.zip).
+The pretrained model can be downloaded here:
 - [mb_melgan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_ckpt_0.1.1.zip)
-The finetuned model can be downloaded here [mb_melgan_baker_finetune_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_baker_finetune_ckpt_0.5.zip).
+The finetuned model can be downloaded here:
 - [mb_melgan_baker_finetune_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_baker_finetune_ckpt_0.5.zip)
-The static model can be downloaded here [mb_melgan_csmsc_static_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_static_0.1.1.zip)
+The static model can be downloaded here:
 - [mb_melgan_csmsc_static_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_static_0.1.1.zip)
 The ONNX model can be downloaded here:
 - [mb_melgan_csmsc_onnx_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_onnx_0.2.0.zip)
 Model | Step | eval/generator_loss | eval/log_stft_magnitude_loss|eval/spectral_convergence_loss |eval/sub_log_stft_magnitude_loss|eval/sub_spectral_convergence_loss
 :-------------:| :------------:| :-----: | :-----: | :--------:| :--------:| :--------:
--- a/examples/csmsc/voc4/README.md
+++ b/examples/csmsc/voc4/README.md
@ -112,7 +112,8 @@ optional arguments:
 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
 ## Pretrained Models
-The pretrained model can be downloaded here [style_melgan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/style_melgan/style_melgan_csmsc_ckpt_0.1.1.zip).
+The pretrained model can be downloaded here:
 - [style_melgan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/style_melgan/style_melgan_csmsc_ckpt_0.1.1.zip)
 The static model of Style MelGAN is not available now.
--- a/examples/csmsc/voc5/README.md
+++ b/examples/csmsc/voc5/README.md
@ -112,9 +112,14 @@ optional arguments:
 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
 ## Pretrained Models
-The pretrained model can be downloaded here [hifigan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_ckpt_0.1.1.zip).
+The pretrained model can be downloaded here:
 - [hifigan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_ckpt_0.1.1.zip)
-The static model can be downloaded here [hifigan_csmsc_static_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_static_0.1.1.zip).
+The static model can be downloaded here:
 - [hifigan_csmsc_static_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_static_0.1.1.zip)
 The ONNX model can be downloaded here:
 - [hifigan_csmsc_onnx_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_onnx_0.2.0.zip)
 Model | Step | eval/generator_loss | eval/mel_loss| eval/feature_matching_loss
 :-------------:| :------------:| :-----: | :-----: | :--------:
--- a/examples/csmsc/voc6/README.md
+++ b/examples/csmsc/voc6/README.md
@ -109,9 +109,11 @@ optional arguments:
 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
 ## Pretrained Models
-The pretrained model can be downloaded here [wavernn_csmsc_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/wavernn/wavernn_csmsc_ckpt_0.2.0.zip).
+The pretrained model can be downloaded here:
 - [wavernn_csmsc_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/wavernn/wavernn_csmsc_ckpt_0.2.0.zip)
-The static model can be downloaded here [wavernn_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/wavernn/wavernn_csmsc_static_0.2.0.zip).
+The static model can be downloaded here:
 - [wavernn_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/wavernn/wavernn_csmsc_static_0.2.0.zip)
 Model | Step | eval/loss
 :-------------:|:------------:| :------------:
--- a/examples/esc50/README.md
+++ b/examples/esc50/README.md
@ -4,7 +4,7 @@
 对于声音分类任务，传统机器学习的一个常用做法是首先人工提取音频的时域和频域的多种特征并做特征选择、组合、变换等，然后基于SVM或决策树进行分类。而端到端的深度学习则通常利用深度网络如RNN，CNN等直接对声间波形(waveform)或时频特征(time-frequency)进行特征学习(representation learning)和分类预测。
-在IEEE ICASSP 2017 大会上，谷歌开放了一个大规模的音频数据集[Audioset](https://research.google.com/audioset/)。该数据集包含了 632 类的音频类别以及 2,084,320 条人工标记的每段 10 秒长度的声音剪辑片段（来源于YouTube视频）。目前该数据集已经有210万个已标注的视频数据，5800小时的音频数据，经过标记的声音样本的标签类别为527。
+在IEEE ICASSP 2017 大会上，谷歌开放了一个大规模的音频数据集[Audioset](https://research.google.com/audioset/)。该数据集包含了 632 类的音频类别以及 2,084,320 条人工标记的每段 **10 秒**长度的声音剪辑片段（来源于YouTube视频）。目前该数据集已经有 210万 个已标注的视频数据，5800 小时的音频数据，经过标记的声音样本的标签类别为 527。
 `PANNs`([PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition](https://arxiv.org/pdf/1912.10211.pdf))是基于Audioset数据集训练的声音分类/识别的模型。经过预训练后，模型可以用于提取音频的embbedding。本示例将使用`PANNs`的预训练模型Finetune完成声音分类的任务。
@ -12,14 +12,14 @@
 ## 模型简介
 PaddleAudio提供了PANNs的CNN14、CNN10和CNN6的预训练模型，可供用户选择使用：
- CNN14: 该模型主要包含12个卷积层和2个全连接层，模型参数的数量为79.6M，embbedding维度是2048。
+- CNN14: 该模型主要包含12个卷积层和2个全连接层，模型参数的数量为 79.6M，embbedding维度是 2048。
- CNN10: 该模型主要包含8个卷积层和2个全连接层，模型参数的数量为4.9M，embbedding维度是512。
+- CNN10: 该模型主要包含8个卷积层和2个全连接层，模型参数的数量为 4.9M，embbedding维度是 512。
- CNN6: 该模型主要包含4个卷积层和2个全连接层，模型参数的数量为4.5M，embbedding维度是512。
+- CNN6: 该模型主要包含4个卷积层和2个全连接层，模型参数的数量为 4.5M，embbedding维度是 512。
 ## 数据集
-[ESC-50: Dataset for Environmental Sound Classification](https://github.com/karolpiczak/ESC-50) 是一个包含有 2000 个带标签的环境声音样本，音频样本采样率为 44,100Hz 的单通道音频文件，所有样本根据标签被划分为 50 个类别，每个类别有 40 个样本。
+[ESC-50: Dataset for Environmental Sound Classification](https://github.com/karolpiczak/ESC-50) 是一个包含有 2000 个带标签的时长为 **5 秒**的环境声音样本，音频样本采样率为 44,100Hz 的单通道音频文件，所有样本根据标签被划分为 50 个类别，每个类别有 40 个样本。
 ## 模型指标
@ -43,13 +43,13 @@ $ CUDA_VISIBLE_DEVICES=0 ./run.sh 1 conf/panns.yaml
 ```
 训练的参数可在 `conf/panns.yaml` 的 `training` 中配置，其中：
- `epochs`: 训练轮次，默认为50。
+- `epochs`: 训练轮次，默认为 50。
 - `learning_rate`: Fine-tune的学习率；默认为5e-5。
- `batch_size`: 批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为16。
+- `batch_size`: 批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为 16。
 - `num_workers`: Dataloader获取数据的子进程数。默认为0，加载数据的流程在主进程执行。
 - `checkpoint_dir`: 模型参数文件和optimizer参数文件的保存目录，默认为`./checkpoint`。
- `save_freq`: 训练过程中的模型保存频率，默认为10。
+- `save_freq`: 训练过程中的模型保存频率，默认为 10。
- `log_freq`: 训练过程中的信息打印频率，默认为10。
+- `log_freq`: 训练过程中的信息打印频率，默认为 10。
 示例代码中使用的预训练模型为`CNN14`，如果想更换为其他预训练模型，可通过修改 `conf/panns.yaml` 的 `model` 中配置：
 ```yaml
@ -76,7 +76,7 @@ $ CUDA_VISIBLE_DEVICES=0 ./run.sh 2 conf/panns.yaml
 训练的参数可在 `conf/panns.yaml` 的 `predicting` 中配置，其中：
 - `audio_file`: 指定预测的音频文件。
- `top_k`: 预测显示的top k标签的得分，默认为1。
+- `top_k`: 预测显示的top k标签的得分，默认为 1。
 - `checkpoint`: 模型参数checkpoint文件。
 输出的预测结果如下：
--- a/examples/iwslt2012/punc0/README.md
+++ b/examples/iwslt2012/punc0/README.md
@ -21,7 +21,7 @@
 The pretrained model can be downloaded here [ernie_linear_p3_iwslt2012_zh_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/text/ernie_linear_p3_iwslt2012_zh_ckpt_0.1.1.zip).
 ### Test Result
- Ernie Linear
+- Ernie
    |       |COMMA  |  PERIOD | QUESTION | OVERALL|
    |:-----:|:-----:|:-----:|:-----:|:-----:|  
    |Precision  |0.510955  |0.526462  |0.820755  |0.619391|
--- a/examples/iwslt2012/punc0/RESULTS.md
+++ b/examples/iwslt2012/punc0/RESULTS.md
@ -0,0 +1,9 @@
 # iwslt2012
 ## Ernie
 |       |COMMA  |  PERIOD | QUESTION | OVERALL|
 |:-----:|:-----:|:-----:|:-----:|:-----:|  
 |Precision  |0.510955  |0.526462  |0.820755  |0.619391|
 |Recall     |0.517433  |0.564179  |0.861386  |0.647666|
 |F1         |0.514173  |0.544669  |0.840580  |0.633141|
--- a/examples/librispeech/asr1/README.md
+++ b/examples/librispeech/asr1/README.md
@ -151,44 +151,22 @@ avg.sh best exp/conformer/checkpoints 20
 CUDA_VISIBLE_DEVICES= ./local/test.sh conf/conformer.yaml exp/conformer/checkpoints/avg_20
 ```
 ## Pretrained Model
-You can get the pretrained transformer or conformer using the scripts below:
+You can get the pretrained transformer or conformer from [this](../../../docs/source/released_model.md).
-```bash
+
 # Conformer:
 wget https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr1/conformer.model.tar.gz
 # Transformer:
 wget https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr1/transformer.model.tar.gz
 ```
 using the `tar` scripts to unpack the model and then you can use the script to test the model.
 For example:
 ```bash
-wget https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr1/conformer.model.tar.gz
+wget https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr1/asr1_conformer_librispeech_ckpt_0.1.1.model.tar.gz
-tar xzvf transformer.model.tar.gz
+tar xzvf asr1_conformer_librispeech_ckpt_0.1.1.model.tar.gz
 source path.sh
 # If you have process the data and get the manifest file， you can skip the following 2 steps
 bash local/data.sh --stage -1 --stop_stage -1
 bash local/data.sh --stage 2 --stop_stage 2
 CUDA_VISIBLE_DEVICES= ./local/test.sh conf/conformer.yaml exp/conformer/checkpoints/avg_20
 ```
-The performance of the released models are shown below:
+The performance of the released models are shown in [here](./RESULTS.md).
 ## Conformer
 train: Epoch 70, 4 V100-32G, best avg: 20
 | Model     | Params  | Config              | Augmentation | Test set   | Decode method          | Loss              | WER      |
 | --------- | ------- | ------------------- | ------------ | ---------- | ---------------------- | ----------------- | -------- |
 | conformer | 47.63 M | conf/conformer.yaml | spec_aug     | test-clean | attention              | 6.433612394332886 | 0.039771 |
 | conformer | 47.63 M | conf/conformer.yaml | spec_aug     | test-clean | ctc_greedy_search      | 6.433612394332886 | 0.040342 |
 | conformer | 47.63 M | conf/conformer.yaml | spec_aug     | test-clean | ctc_prefix_beam_search | 6.433612394332886 | 0.040342 |
 | conformer | 47.63 M | conf/conformer.yaml | spec_aug     | test-clean | attention_rescoring    | 6.433612394332886 | 0.033761 |
 ## Transformer
 train: Epoch 120, 4 V100-32G, 27 Day, best avg: 10
 | Model       | Params  | Config                | Augmentation | Test set   | Decode method          | Loss              | WER      |
 | ----------- | ------- | --------------------- | ------------ | ---------- | ---------------------- | ----------------- | -------- |
 | transformer | 32.52 M | conf/transformer.yaml | spec_aug     | test-clean | attention              | 6.382194232940674 | 0.049661 |
 | transformer | 32.52 M | conf/transformer.yaml | spec_aug     | test-clean | ctc_greedy_search      | 6.382194232940674 | 0.049566 |
 | transformer | 32.52 M | conf/transformer.yaml | spec_aug     | test-clean | ctc_prefix_beam_search | 6.382194232940674 | 0.049585 |
 | transformer | 32.52 M | conf/transformer.yaml | spec_aug     | test-clean | attention_rescoring    | 6.382194232940674 | 0.038135 |
 ## Stage 4: CTC Alignment 
 If you want to get the alignment between the audio and the text, you can use the ctc alignment. The code of this stage is shown below:
 ```bash
@ -227,8 +205,8 @@ In some situations, you want to use the trained model to do the inference for th
 ```
 you can train the model by yourself using ```bash run.sh --stage 0 --stop_stage 3```, or you can download the pretrained model through the script below:
 ```bash
-wget https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr1/conformer.model.tar.gz
+wget https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr1/asr1_conformer_librispeech_ckpt_0.1.1.model.tar.gz
-tar xzvf conformer.model.tar.gz
+tar xzvf asr1_conformer_librispeech_ckpt_0.1.1.model.tar.gz
 ```
 You can download the audio demo:
 ```bash
--- a/examples/librispeech/asr2/README.md
+++ b/examples/librispeech/asr2/README.md
@ -1,4 +1,4 @@
-# Transformer/Conformer ASR with Librispeech Asr2
+# Transformer/Conformer ASR with Librispeech ASR2
 This example contains code used to train a Transformer or [Conformer](http://arxiv.org/abs/2008.03802) model with [Librispeech dataset](http://www.openslr.org/resources/12) and use some functions in kaldi.
@ -213,17 +213,14 @@ avg.sh latest exp/transformer/checkpoints 10
 ./local/recog.sh  --ckpt_prefix exp/transformer/checkpoints/avg_10
 ```
 ## Pretrained Model
-You can get the pretrained transformer using the scripts below:
+You can get the pretrained models from [this](../../../docs/source/released_model.md).
-```bash
+
 # Transformer:
 wget https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr2/transformer.model.tar.gz
 ```
 using the `tar` scripts to unpack the model and then you can use the script to test the model.
 For example:
 ```bash
-wget https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr2/transformer.model.tar.gz
+wget https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr2/asr2_transformer_librispeech_ckpt_0.1.1.model.tar.gz
-tar xzvf transformer.model.tar.gz
+tar xzvf asr2_transformer_librispeech_ckpt_0.1.1.model.tar.gz
 source path.sh
 # If you have process the data and get the manifest file， you can skip the following 2 steps
 bash local/data.sh --stage -1 --stop_stage -1
@ -231,26 +228,7 @@ bash local/data.sh --stage 2 --stop_stage 2
 CUDA_VISIBLE_DEVICES= ./local/test.sh conf/transformer.yaml exp/ctc/checkpoints/avg_10
 ```
-The performance of the released models are shown below:
+The performance of the released models are shown [here](./RESULTS.md).
 ### Transformer
 |    Model    | Params |          GPUS          |  Averaged Model  |        Config         | Augmentation |      Loss       |
 | :---------: | :----: | :--------------------: | :--------------: | :-------------------: | :----------: | :-------------: |
 | transformer | 32.52M | 8 Tesla V100-SXM2-32GB | 10-best val_loss | conf/transformer.yaml |   spec_aug   | 6.3197922706604 |
 #### Attention Rescore
 | Test Set   | Decode Method         | #Snt | #Wrd  | Corr | Sub  | Del  | Ins  | Err  | S.Err |
 | ---------- | --------------------- | ---- | ----- | ---- | ---- | ---- | ---- | ---- | ----- |
 | test-clean | attention             | 2620 | 52576 | 96.4 | 2.5  | 1.1  | 0.4  | 4.0  | 34.7  |
 | test-clean | ctc_greedy_search     | 2620 | 52576 | 95.9 | 3.7  | 0.4  | 0.5  | 4.6  | 48.0  |
 | test-clean | ctc_prefix_beamsearch | 2620 | 52576 | 95.9 | 3.7  | 0.4  | 0.5  | 4.6  | 47.6  |
 | test-clean | attention_rescore     | 2620 | 52576 | 96.8 | 2.9  | 0.3  | 0.4  | 3.7  | 38.0  |
 #### JoinCTC
 | Test Set   | Decode Method     | #Snt | #Wrd  | Corr | Sub  | Del  | Ins  | Err  | S.Err |
 | ---------- | ----------------- | ---- | ----- | ---- | ---- | ---- | ---- | ---- | ----- |
 | test-clean | join_ctc_only_att | 2620 | 52576 | 96.1 | 2.5  | 1.4  | 0.4  | 4.4  | 34.7  |
 | test-clean | join_ctc_w/o_lm   | 2620 | 52576 | 97.2 | 2.6  | 0.3  | 0.4  | 3.2  | 34.9  |
 | test-clean | join_ctc_w_lm     | 2620 | 52576 | 97.9 | 1.8  | 0.2  | 0.3  | 2.4  | 27.8  |
 Compare with [ESPNET](https://github.com/espnet/espnet/blob/master/egs/librispeech/asr1/RESULTS.md#pytorch-large-transformer-with-specaug-4-gpus--transformer-lm-4-gpus) we using 8gpu, but the model size (aheads4-adim256) small than it.
 ## Stage 5: CTC Alignment 
--- a/examples/ljspeech/tts1/README.md
+++ b/examples/ljspeech/tts1/README.md
@ -171,7 +171,8 @@ optional arguments:
 6. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
 ## Pretrained Model
-Pretrained Model can be downloaded here. [transformer_tts_ljspeech_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/transformer_tts/transformer_tts_ljspeech_ckpt_0.4.zip)
+Pretrained Model can be downloaded here:
 - [transformer_tts_ljspeech_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/transformer_tts/transformer_tts_ljspeech_ckpt_0.4.zip)
 TransformerTTS  checkpoint contains files listed below.
 ```text
--- a/examples/ljspeech/tts3/README.md
+++ b/examples/ljspeech/tts3/README.md
@ -214,7 +214,8 @@ optional arguments:
 9. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
 ## Pretrained Model
-Pretrained FastSpeech2 model with no silence in the edge of audios. [fastspeech2_nosil_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_ljspeech_ckpt_0.5.zip)
+Pretrained FastSpeech2 model with no silence in the edge of audios:
 - [fastspeech2_nosil_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_ljspeech_ckpt_0.5.zip)
 Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/pitch_loss| eval/energy_loss 
 :-------------:| :------------:| :-----: | :-----: | :--------: |:--------:|:---------:
--- a/examples/ljspeech/tts3/local/synthesize.sh
+++ b/examples/ljspeech/tts3/local/synthesize.sh
@ -26,7 +26,7 @@ if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
 fi
 # hifigan
-if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
    FLAGS_allocator_strategy=naive_best_fit \
    FLAGS_fraction_of_gpu_memory_to_use=0.01 \
    python3 ${BIN_DIR}/../synthesize.py \
--- a/examples/ljspeech/voc0/README.md
+++ b/examples/ljspeech/voc0/README.md
@ -50,4 +50,5 @@ Synthesize waveform.
 6. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
 ## Pretrained Model
-Pretrained Model with residual channel equals 128 can be downloaded here. [waveflow_ljspeech_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/waveflow/waveflow_ljspeech_ckpt_0.3.zip).
+Pretrained Model with residual channel equals 128 can be downloaded here:
 - [waveflow_ljspeech_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/waveflow/waveflow_ljspeech_ckpt_0.3.zip)
--- a/examples/ljspeech/voc1/README.md
+++ b/examples/ljspeech/voc1/README.md
@ -127,7 +127,8 @@ optional arguments:
 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
 ## Pretrained Model
-Pretrained models can be downloaded here. [pwg_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_ljspeech_ckpt_0.5.zip)
+Pretrained models can be downloaded here:
 - [pwg_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_ljspeech_ckpt_0.5.zip)
 Parallel WaveGAN checkpoint contains files listed below.
--- a/examples/ljspeech/voc5/README.md
+++ b/examples/ljspeech/voc5/README.md
@ -127,7 +127,8 @@ optional arguments:
 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
 ## Pretrained Model
-The pretrained model can be downloaded here [hifigan_ljspeech_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_ljspeech_ckpt_0.2.0.zip).
+The pretrained model can be downloaded here:
 - [hifigan_ljspeech_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_ljspeech_ckpt_0.2.0.zip)
 Model | Step | eval/generator_loss | eval/mel_loss| eval/feature_matching_loss
@ -143,6 +144,5 @@ hifigan_ljspeech_ckpt_0.2.0
 └── snapshot_iter_2500000.pdz     # generator parameters of hifigan
 ```
 ## Acknowledgement
 We adapted some code from https://github.com/kan-bayashi/ParallelWaveGAN.
--- a/examples/other/1xt2x/src_deepspeech2x/init.py
+++ b/examples/other/1xt2x/src_deepspeech2x/init.py
@ -26,10 +26,10 @@ from paddlespeech.s2t.utils.log import Log
 #TODO(Hui Zhang): remove  fluid import
 logger = Log(__name__).getlog()
-########### hcak logging #############
+########### hack logging #############
 logger.warn = logger.warning
-########### hcak paddle #############
+########### hack paddle #############
 paddle.half = 'float16'
 paddle.float = 'float32'
 paddle.double = 'float64'
@ -110,7 +110,7 @@ if not hasattr(paddle, 'cat'):
    paddle.cat = cat
-########### hcak paddle.Tensor #############
+########### hack paddle.Tensor #############
 def item(x: paddle.Tensor):
    return x.numpy().item()
@ -353,7 +353,7 @@ if not hasattr(paddle.Tensor, 'tolist'):
    setattr(paddle.Tensor, 'tolist', tolist)
-########### hcak paddle.nn #############
+########### hack paddle.nn #############
 class GLU(nn.Layer):
    """Gated Linear Units (GLU) Layer"""
--- a/examples/other/ngram_lm/s0/local/build_zh_lm.sh
+++ b/examples/other/ngram_lm/s0/local/build_zh_lm.sh
@ -27,7 +27,7 @@ arpa=$3
 if [ $stage -le 0 ] && [ $stop_stage -ge 0 ];then
    # text tn & wordseg preprocess
    echo "process text."
-    python3 ${MAIN_ROOT}/utils/zh_tn.py ${type} ${text} ${text}.${type}.tn
+    python3 ${MAIN_ROOT}/utils/zh_tn.py --token_type ${type} ${text} ${text}.${type}.tn
 fi
 if [ $stage -le 1 ] && [ $stop_stage -ge 1 ];then
--- a/examples/other/ngram_lm/s0/local/download_lm_zh.sh
+++ b/examples/other/ngram_lm/s0/local/download_lm_zh.sh
@ -10,6 +10,11 @@ MD5="29e02312deb2e59b3c8686c7966d4fe3"
 TARGET=${DIR}/zh_giga.no_cna_cmn.prune01244.klm
 if [ -e $TARGET ];then
    echo "already have lm"
    exit 0;
 fi
 echo "Download language model ..."
 download $URL $MD5 $TARGET
 if [ $? -ne 0 ]; then
--- a/examples/vctk/tts3/README.md
+++ b/examples/vctk/tts3/README.md
@ -217,7 +217,8 @@ optional arguments:
 9. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
 ## Pretrained Model
-Pretrained FastSpeech2 model with no silence in the edge of audios. [fastspeech2_nosil_vctk_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_vctk_ckpt_0.5.zip)
+Pretrained FastSpeech2 model with no silence in the edge of audios:
 - [fastspeech2_nosil_vctk_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_vctk_ckpt_0.5.zip)
 FastSpeech2 checkpoint contains files listed below.
 ```text
--- a/examples/vctk/voc1/README.md
+++ b/examples/vctk/voc1/README.md
@ -132,7 +132,8 @@ optional arguments:
 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
 ## Pretrained Model
-Pretrained models can be downloaded here [pwg_vctk_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_vctk_ckpt_0.1.1.zip).
+Pretrained models can be downloaded here:
 - [pwg_vctk_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_vctk_ckpt_0.1.1.zip)
 Parallel WaveGAN checkpoint contains files listed below.
--- a/examples/vctk/voc5/README.md
+++ b/examples/vctk/voc5/README.md
@ -133,7 +133,8 @@ optional arguments:
 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
 ## Pretrained Model
-The pretrained model can be downloaded here [hifigan_vctk_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_vctk_ckpt_0.2.0.zip).
+The pretrained model can be downloaded here:
 - [hifigan_vctk_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_vctk_ckpt_0.2.0.zip)
 Model | Step | eval/generator_loss | eval/mel_loss| eval/feature_matching_loss
--- a/examples/voxceleb/sv0/RESULT.md
+++ b/examples/voxceleb/sv0/RESULT.md
@ -4,4 +4,4 @@
 | Model | Number of Params | Release | Config | dim | Test set |  Cosine | Cosine + S-Norm | 
 | --- | --- | --- | --- | --- | --- | --- | ---- |
-| ECAPA-TDNN | 85M | 0.1.1 | conf/ecapa_tdnn.yaml |192 | test | 1.15 |  1.06 | 
+| ECAPA-TDNN | 85M | 0.2.0 | conf/ecapa_tdnn.yaml |192 | test | 1.02 |  0.95 | 
--- a/examples/voxceleb/sv0/conf/ecapa_tdnn.yaml
+++ b/examples/voxceleb/sv0/conf/ecapa_tdnn.yaml
@ -1,14 +1,16 @@
 ###########################################
 #                Data                 #
 ###########################################
 # we should explicitly specify the wav path of vox2 audio data converted from m4a
 vox2_base_path: 
 augment: True
-batch_size: 16
+batch_size: 32
 num_workers: 2
 num_speakers: 7205 # 1211 vox1, 5994 vox2, 7205 vox1+2, test speakers: 41
 shuffle: True
 skip_prep: False
 split_ratio: 0.9
 chunk_duration: 3.0 # seconds
 random_chunk: True
 verification_file: data/vox1/veri_test2.txt
 ###########################################################
 #                FEATURE EXTRACTION SETTING               #
@ -26,7 +28,6 @@ hop_size: 160        #10ms, sample rate 16000, 10 * 16000 / 1000 = 160
 # if we want use another model, please choose another configuration yaml file
 model:
  input_size: 80
  # "channels": [512, 512, 512, 512, 1536],
  channels: [1024, 1024, 1024, 1024, 3072]
  kernel_sizes: [5, 3, 3, 3, 1]
  dilations: [1, 2, 3, 4, 1]
@ -38,11 +39,19 @@ model:
 ###########################################
 seed: 1986 # according from speechbrain configuration
 epochs: 10
-save_interval: 1
+save_interval: 10
-log_interval: 1
+log_interval: 10
 learning_rate: 1e-8
 max_lr: 1e-3
 step_size: 140000
 ###########################################
 #                loss                     #
 ###########################################
 margin: 0.2
 scale: 30
 ###########################################
 #                Testing                  #
 ###########################################
--- a/examples/voxceleb/sv0/conf/ecapa_tdnn_small.yaml
+++ b/examples/voxceleb/sv0/conf/ecapa_tdnn_small.yaml
@ -0,0 +1,60 @@
 ###########################################
 #                Data                 #
 ###########################################
 augment: True
 batch_size: 32
 num_workers: 2
 num_speakers: 1211 # 1211 vox1, 5994 vox2, 7205 vox1+2, test speakers: 41
 shuffle: True
 skip_prep: False
 split_ratio: 0.9
 chunk_duration: 3.0 # seconds
 random_chunk: True
 verification_file: data/vox1/veri_test2.txt
 ###########################################################
 #                FEATURE EXTRACTION SETTING               #
 ###########################################################
 # currently, we only support fbank
 sr: 16000           # sample rate
 n_mels: 80
 window_size: 400     #25ms, sample rate 16000, 25 * 16000 / 1000 = 400 
 hop_size: 160        #10ms, sample rate 16000, 10 * 16000 / 1000 = 160
 ###########################################################
 #                       MODEL SETTING                     #
 ###########################################################
 # currently, we only support ecapa-tdnn in the ecapa_tdnn.yaml
 # if we want use another model, please choose another configuration yaml file
 model:
  input_size: 80
  channels: [512, 512, 512, 512, 1536]
  kernel_sizes: [5, 3, 3, 3, 1]
  dilations: [1, 2, 3, 4, 1]
  attention_channels: 128
  lin_neurons: 192
 ###########################################
 #                Training                 #
 ###########################################
 seed: 1986 # according from speechbrain configuration
 epochs: 100
 save_interval: 10
 log_interval: 10
 learning_rate: 1e-8
 max_lr: 1e-3
 step_size: 140000
 ###########################################
 #                loss                     #
 ###########################################
 margin: 0.2
 scale: 30
 ###########################################
 #                Testing                  #
 ###########################################
 global_embedding_norm: True
 embedding_mean_norm: True
 embedding_std_norm: False
--- a/examples/voxceleb/sv0/local/data.sh
+++ b/examples/voxceleb/sv0/local/data.sh
@ -12,7 +12,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-stage=1
+stage=0
 stop_stage=100
 . ${MAIN_ROOT}/utils/parse_options.sh || exit -1;
@ -30,29 +30,114 @@ dir=$1
 conf_path=$2
 mkdir -p ${dir}
-if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+# Generally the `MAIN_ROOT` refers to the root of PaddleSpeech,
-    # data prepare for vox1 and vox2, vox2 must be converted from m4a to wav
+# which is defined in the path.sh
-    # we should use the local/convert.sh convert m4a to wav
+# And we will download the voxceleb data and rirs noise to ${MAIN_ROOT}/dataset
    python3 local/data_prepare.py \
                        --data-dir ${dir} \
                        --config ${conf_path}
 fi 
 TARGET_DIR=${MAIN_ROOT}/dataset
 mkdir -p ${TARGET_DIR}
 if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
-    # download data, generate manifests
+   # download data, generate manifests
-    python3 ${TARGET_DIR}/voxceleb/voxceleb1.py \
+   # we will generate the manifest.{dev,test} file from ${TARGET_DIR}/voxceleb/vox1/{dev,test} directory
-      --manifest_prefix="data/vox1/manifest" \
+   # and generate the meta info and download the trial file
   # manifest.dev: 148642
   # manifest.test: 4847
   echo "Start to download vox1 dataset and generate the manifest files "
   python3 ${TARGET_DIR}/voxceleb/voxceleb1.py \
      --manifest_prefix="${dir}/vox1/manifest" \
      --target_dir="${TARGET_DIR}/voxceleb/vox1/"
-    if [ $? -ne 0 ]; then
+   if [ $? -ne 0 ]; then
-        echo "Prepare voxceleb failed. Terminated."
+      echo "Prepare voxceleb1 failed. Terminated."
-        exit 1
+      exit 1
-    fi
+   fi
 fi
 if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
   # download voxceleb2 data
   # we will download the data and unzip the package
   # and we will store the m4a file in ${TARGET_DIR}/voxceleb/vox2/{dev,test}
   echo "start to download vox2 dataset"
   python3 ${TARGET_DIR}/voxceleb/voxceleb2.py \
      --download \
      --target_dir="${TARGET_DIR}/voxceleb/vox2/"
   if [ $? -ne 0 ]; then
      echo "Download voxceleb2 dataset failed. Terminated."
      exit 1
   fi
 fi
 if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
   # convert the m4a to wav
   # and we will not delete the original m4a file
   echo "start to convert the m4a to wav"
   bash local/convert.sh ${TARGET_DIR}/voxceleb/vox2/test/ || exit 1;
   if [ $? -ne 0 ]; then
      echo "Convert voxceleb2 dataset from m4a to wav failed. Terminated."
      exit 1
   fi
   echo "m4a convert to wav operation finished"
 fi
 if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
   # generate the vox2 manifest file from wav file
   # we will generate the ${dir}/vox2/manifest.vox2
   # because we use all the vox2 dataset to train, so collect all the vox2 data in one file
   echo "start generate the vox2 manifest files"
   python3 ${TARGET_DIR}/voxceleb/voxceleb2.py \
      --generate \
      --manifest_prefix="${dir}/vox2/manifest" \
      --target_dir="${TARGET_DIR}/voxceleb/vox2/"
   if [ $? -ne 0 ]; then
      echo "Prepare voxceleb2 dataset failed. Terminated."
      exit 1
   fi
 fi
 if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
   # generate the vox csv file
   # Currently, our training system use csv file for dataset
   echo "convert the json format to csv format to be compatible with training process"
   python3 local/make_vox_csv_dataset_from_json.py\
      --train "${dir}/vox1/manifest.dev" "${dir}/vox2/manifest.vox2"\
      --test "${dir}/vox1/manifest.test" \
      --target_dir "${dir}/vox/" \
      --config ${conf_path}
   if [ $? -ne 0 ]; then
      echo "Prepare voxceleb failed. Terminated."
      exit 1
   fi
 fi
 if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
   # generate the open rir noise manifest file
   echo "generate the open rir noise manifest file"
   python3 ${TARGET_DIR}/rir_noise/rir_noise.py\
      --manifest_prefix="${dir}/rir_noise/manifest" \
      --target_dir="${TARGET_DIR}/rir_noise/"
   if [ $? -ne 0 ]; then
      echo "Prepare rir_noise failed. Terminated."
      exit 1
   fi
 fi
 if [ ${stage} -le 7 ] && [ ${stop_stage} -ge 7 ]; then
   # generate the open rir noise manifest file
   echo "generate the open rir noise csv file"
   python3 local/make_rirs_noise_csv_dataset_from_json.py \
      --noise_dir="${TARGET_DIR}/rir_noise/" \
      --data_dir="${dir}/rir_noise/" \
      --config ${conf_path}
-   #  for dataset in train dev test; do
+   if [ $? -ne 0 ]; then
-   #      mv data/manifest.${dataset} data/manifest.${dataset}.raw
+      echo "Prepare rir_noise failed. Terminated."
-   #  done
+      exit 1
   fi
 fi
--- a/examples/voxceleb/sv0/local/make_rirs_noise_csv_dataset_from_json.py
+++ b/examples/voxceleb/sv0/local/make_rirs_noise_csv_dataset_from_json.py
@ -0,0 +1,167 @@
 # Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """
 Convert the PaddleSpeech jsonline format data to csv format data in voxceleb experiment.
 Currently, Speaker Identificaton Training process use csv format.
 """
 import argparse
 import csv
 import os
 from typing import List
 import tqdm
 from yacs.config import CfgNode
 from paddleaudio import load as load_audio
 from paddlespeech.s2t.utils.log import Log
 from paddlespeech.vector.utils.vector_utils import get_chunks
 logger = Log(__name__).getlog()
 def get_chunks_list(wav_file: str,
                    split_chunks: bool,
                    base_path: str,
                    chunk_duration: float=3.0) -> List[List[str]]:
    """Get the single audio file info 
    Args:
        wav_file (list): the wav audio file and get this audio segment info list
        split_chunks (bool): audio split flag
        base_path (str): the audio base path 
        chunk_duration (float): the chunk duration. 
                                if set the split_chunks, we split the audio into multi-chunks segment.
    """
    waveform, sr = load_audio(wav_file)
    audio_id = wav_file.split("/rir_noise/")[-1].split(".")[0]
    audio_duration = waveform.shape[0] / sr
    ret = []
    if split_chunks and audio_duration > chunk_duration:  # Split into pieces of self.chunk_duration seconds.
        uniq_chunks_list = get_chunks(chunk_duration, audio_id, audio_duration)
        for idx, chunk in enumerate(uniq_chunks_list):
            s, e = chunk.split("_")[-2:]  # Timestamps of start and end
            start_sample = int(float(s) * sr)
            end_sample = int(float(e) * sr)
            # currently, all vector csv data format use one representation
            # id, duration, wav, start, stop, label
            # in rirs noise, all the label name is 'noise'
            # the label is string type and we will convert it to integer type in training
            ret.append([
                chunk, audio_duration, wav_file, start_sample, end_sample,
                "noise"
            ])
    else:  # Keep whole audio.
        ret.append(
            [audio_id, audio_duration, wav_file, 0, waveform.shape[0], "noise"])
    return ret
 def generate_csv(wav_files,
                 output_file: str,
                 base_path: str,
                 split_chunks: bool=True):
    """Prepare the csv file according the wav files
    Args:
        wav_files (list): all the audio list to prepare the csv file
        output_file (str): the output csv file
        config (CfgNode): yaml configuration content
        split_chunks (bool): audio split flag
    """
    logger.info(f'Generating csv: {output_file}')
    header = ["utt_id", "duration", "wav", "start", "stop", "label"]
    csv_lines = []
    for item in tqdm.tqdm(wav_files):
        csv_lines.extend(
            get_chunks_list(
                item, base_path=base_path, split_chunks=split_chunks))
    if not os.path.exists(os.path.dirname(output_file)):
        os.makedirs(os.path.dirname(output_file))
    with open(output_file, mode="w") as csv_f:
        csv_writer = csv.writer(
            csv_f, delimiter=",", quotechar='"', quoting=csv.QUOTE_MINIMAL)
        csv_writer.writerow(header)
        for line in csv_lines:
            csv_writer.writerow(line)
 def prepare_data(args, config):
    """Convert the jsonline format to csv format
    Args:
        args (argparse.Namespace): scripts args
        config (CfgNode): yaml configuration content
    """
    # if external config set the skip_prep flat, we will do nothing
    if config.skip_prep:
        return
    base_path = args.noise_dir
    wav_path = os.path.join(base_path, "RIRS_NOISES")
    logger.info(f"base path: {base_path}")
    logger.info(f"wav path: {wav_path}")
    rir_list = os.path.join(wav_path, "real_rirs_isotropic_noises", "rir_list")
    rir_files = []
    with open(rir_list, 'r') as f:
        for line in f.readlines():
            rir_file = line.strip().split(' ')[-1]
            rir_files.append(os.path.join(base_path, rir_file))
    noise_list = os.path.join(wav_path, "pointsource_noises", "noise_list")
    noise_files = []
    with open(noise_list, 'r') as f:
        for line in f.readlines():
            noise_file = line.strip().split(' ')[-1]
            noise_files.append(os.path.join(base_path, noise_file))
    csv_path = os.path.join(args.data_dir, 'csv')
    logger.info(f"csv path: {csv_path}")
    generate_csv(
        rir_files, os.path.join(csv_path, 'rir.csv'), base_path=base_path)
    generate_csv(
        noise_files, os.path.join(csv_path, 'noise.csv'), base_path=base_path)
 if __name__ == "__main__":
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument(
        "--noise_dir",
        default=None,
        required=True,
        help="The noise dataset dataset directory.")
    parser.add_argument(
        "--data_dir",
        default=None,
        required=True,
        help="The target directory stores the csv files")
    parser.add_argument(
        "--config",
        default=None,
        required=True,
        type=str,
        help="configuration file")
    args = parser.parse_args()
    # parse the yaml config file
    config = CfgNode(new_allowed=True)
    if args.config:
        config.merge_from_file(args.config)
    # prepare the csv file from jsonlines files
    prepare_data(args, config)
--- a/examples/voxceleb/sv0/local/make_vox_csv_dataset_from_json.py
+++ b/examples/voxceleb/sv0/local/make_vox_csv_dataset_from_json.py
@ -0,0 +1,251 @@
 # Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """
 Convert the PaddleSpeech jsonline format data to csv format data in voxceleb experiment.
 Currently, Speaker Identificaton Training process use csv format.
 """
 import argparse
 import csv
 import json
 import os
 import random
 import tqdm
 from yacs.config import CfgNode
 from paddleaudio import load as load_audio
 from paddlespeech.s2t.utils.log import Log
 from paddlespeech.vector.utils.vector_utils import get_chunks
 logger = Log(__name__).getlog()
 def prepare_csv(wav_files, output_file, config, split_chunks=True):
    """Prepare the csv file according the wav files
    Args:
        wav_files (list): all the audio list to prepare the csv file
        output_file (str): the output csv file
        config (CfgNode): yaml configuration content
        split_chunks (bool, optional): audio split flag. Defaults to True.
    """
    if not os.path.exists(os.path.dirname(output_file)):
        os.makedirs(os.path.dirname(output_file))
    csv_lines = []
    header = ["utt_id", "duration", "wav", "start", "stop", "label"]
    # voxceleb meta info for each training utterance segment
    # we extract a segment from a utterance to train 
    # and the segment' period is between start and stop time point in the original wav file
    # each field in the meta info means as follows:
    # utt_id: the utterance segment name, which is uniq in training dataset
    # duration: the total utterance time
    # wav: utterance file path, which should be absoulute path
    # start: start point in the original wav file sample point range
    # stop: stop point in the original wav file sample point range
    # label: the utterance segment's label name, 
    #        which is speaker name in speaker verification domain
    for item in tqdm.tqdm(wav_files, total=len(wav_files)):
        item = json.loads(item.strip())
        audio_id = item['utt'].replace(".wav",
                                       "")  # we remove the wav suffix name
        audio_duration = item['feat_shape'][0]
        wav_file = item['feat']
        label = audio_id.split('-')[
            0]  # speaker name in speaker verification domain
        waveform, sr = load_audio(wav_file)
        if split_chunks:
            uniq_chunks_list = get_chunks(config.chunk_duration, audio_id,
                                          audio_duration)
            for chunk in uniq_chunks_list:
                s, e = chunk.split("_")[-2:]  # Timestamps of start and end
                start_sample = int(float(s) * sr)
                end_sample = int(float(e) * sr)
                # id, duration, wav, start, stop, label
                # in vector, the label in speaker id
                csv_lines.append([
                    chunk, audio_duration, wav_file, start_sample, end_sample,
                    label
                ])
        else:
            csv_lines.append([
                audio_id, audio_duration, wav_file, 0, waveform.shape[0], label
            ])
    with open(output_file, mode="w") as csv_f:
        csv_writer = csv.writer(
            csv_f, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
        csv_writer.writerow(header)
        for line in csv_lines:
            csv_writer.writerow(line)
 def get_enroll_test_list(dataset_list, verification_file):
    """Get the enroll and test utterance list from all the voxceleb1 test utterance dataset.
       Generally, we get the enroll and test utterances from the verfification file.
       The verification file format as follows:
       target/nontarget enroll-utt test-utt,
       we set 0 as nontarget and 1 as target, eg:
       0 a.wav b.wav
       1 a.wav a.wav
    Args:
        dataset_list (list): all the dataset to get the test utterances
        verification_file (str): voxceleb1 trial file
    """
    logger.info(f"verification file: {verification_file}")
    enroll_audios = set()
    test_audios = set()
    with open(verification_file, 'r') as f:
        for line in f:
            _, enroll_file, test_file = line.strip().split(' ')
            enroll_audios.add('-'.join(enroll_file.split('/')))
            test_audios.add('-'.join(test_file.split('/')))
    enroll_files = []
    test_files = []
    for dataset in dataset_list:
        with open(dataset, 'r') as f:
            for line in f:
                # audio_id may be in enroll and test at the same time
                # eg: 1 a.wav a.wav
                # the audio a.wav is enroll and test file at the same time
                audio_id = json.loads(line.strip())['utt']
                if audio_id in enroll_audios:
                    enroll_files.append(line)
                if audio_id in test_audios:
                    test_files.append(line)
    enroll_files = sorted(enroll_files)
    test_files = sorted(test_files)
    return enroll_files, test_files
 def get_train_dev_list(dataset_list, target_dir, split_ratio):
    """Get the train and dev utterance list from all the training utterance dataset.
       Generally, we use the split_ratio as the train dataset ratio,
       and the remaining utterance (ratio is 1 - split_ratio) is the dev dataset
    Args:
        dataset_list (list): all the dataset to get the all utterances
        target_dir (str): the target train and dev directory, 
                          we will create the csv directory to store the {train,dev}.csv file
        split_ratio (float): train dataset ratio in all utterance list
    """
    logger.info("start to get train and dev utt list")
    if not os.path.exists(os.path.join(target_dir, "meta")):
        os.makedirs(os.path.join(target_dir, "meta"))
    audio_files = []
    speakers = set()
    for dataset in dataset_list:
        with open(dataset, 'r') as f:
            for line in f:
                # the label is speaker name
                label_name = json.loads(line.strip())['utt2spk']
                speakers.add(label_name)
                audio_files.append(line.strip())
    speakers = sorted(speakers)
    logger.info(f"we get {len(speakers)} speakers from all the train dataset")
    with open(os.path.join(target_dir, "meta", "label2id.txt"), 'w') as f:
        for label_id, label_name in enumerate(speakers):
            f.write(f'{label_name} {label_id}\n')
    logger.info(
        f'we store the speakers to {os.path.join(target_dir, "meta", "label2id.txt")}'
    )
    # the split_ratio is for train dataset 
    # the remaining is for dev dataset
    split_idx = int(split_ratio * len(audio_files))
    audio_files = sorted(audio_files)
    random.shuffle(audio_files)
    train_files, dev_files = audio_files[:split_idx], audio_files[split_idx:]
    logger.info(
        f"we get train utterances: {len(train_files)}, dev utterance: {len(dev_files)}"
    )
    return train_files, dev_files
 def prepare_data(args, config):
    """Convert the jsonline format to csv format
    Args:
        args (argparse.Namespace): scripts args
        config (CfgNode): yaml configuration content
    """
    # stage0: set the random seed
    random.seed(config.seed)
    # if external config set the skip_prep flat, we will do nothing
    if config.skip_prep:
        return
    # stage 1: prepare the enroll and test csv file
    #          And we generate the speaker to label file label2id.txt
    logger.info("start to prepare the data csv file")
    enroll_files, test_files = get_enroll_test_list(
        [args.test], verification_file=config.verification_file)
    prepare_csv(
        enroll_files,
        os.path.join(args.target_dir, "csv", "enroll.csv"),
        config,
        split_chunks=False)
    prepare_csv(
        test_files,
        os.path.join(args.target_dir, "csv", "test.csv"),
        config,
        split_chunks=False)
    # stage 2: prepare the train and dev csv file
    #          we get the train dataset ratio as config.split_ratio
    #          and the remaining is dev dataset
    logger.info("start to prepare the data csv file")
    train_files, dev_files = get_train_dev_list(
        args.train, target_dir=args.target_dir, split_ratio=config.split_ratio)
    prepare_csv(train_files,
                os.path.join(args.target_dir, "csv", "train.csv"), config)
    prepare_csv(dev_files,
                os.path.join(args.target_dir, "csv", "dev.csv"), config)
 if __name__ == "__main__":
    parser = argparse.ArgumentParser(description=__doc__)
    parser.add_argument(
        "--train",
        required=True,
        nargs='+',
        help="The jsonline files list for train.")
    parser.add_argument(
        "--test", required=True, help="The jsonline file for test")
    parser.add_argument(
        "--target_dir",
        default=None,
        required=True,
        help="The target directory stores the csv files and meta file.")
    parser.add_argument(
        "--config",
        default=None,
        required=True,
        type=str,
        help="configuration file")
    args = parser.parse_args()
    # parse the yaml config file
    config = CfgNode(new_allowed=True)
    if args.config:
        config.merge_from_file(args.config)
    # prepare the csv file from jsonlines files
    prepare_data(args, config)
--- a/examples/voxceleb/sv0/run.sh
+++ b/examples/voxceleb/sv0/run.sh
@ -18,24 +18,22 @@ set -e
 #######################################################################
 # stage 0: data prepare, including voxceleb1 download and generate {train,dev,enroll,test}.csv
-#          voxceleb2 data is m4a format, so we need user to convert the m4a to wav yourselves as described in Readme.md with the script local/convert.sh
+#          voxceleb2 data is m4a format, so we need convert the m4a to wav yourselves with the script local/convert.sh
 # stage 1: train the speaker identification model
 # stage 2: test speaker identification 
-# stage 3: extract the training embeding to train the LDA and PLDA
+# stage 3: (todo)extract the training embeding to train the LDA and PLDA
 ######################################################################
 # we can set the variable PPAUDIO_HOME to specifiy the root directory of the downloaded vox1 and vox2 dataset 
 # default the dataset will be stored in the ~/.paddleaudio/
 # the vox2 dataset is stored in m4a format, we need to convert the audio from m4a to wav yourself
-# and put all of them to ${PPAUDIO_HOME}/datasets/vox2
+# and put all of them to ${MAIN_ROOT}/datasets/vox2
-# we will find the wav from ${PPAUDIO_HOME}/datasets/vox1/wav and ${PPAUDIO_HOME}/datasets/vox2/wav
+# we will find the wav from ${MAIN_ROOT}/datasets/vox1/{dev,test}/wav and ${MAIN_ROOT}/datasets/vox2/wav
-# export PPAUDIO_HOME=
+
 stage=0
 stop_stage=50
 # data directory
 # if we set the variable ${dir}, we will store the wav info to this directory
-# otherwise, we will store the wav info to vox1 and vox2 directory respectively
+# otherwise, we will store the wav info to data/vox1 and data/vox2 directory respectively
 # vox2 wav path, we must convert the m4a format to wav format    
 dir=data/                                 # data info directory   
@ -64,6 +62,6 @@ if [ $stage -le 2 ] && [ ${stop_stage} -ge 2 ]; then
 fi
 # if [ $stage -le 3 ]; then
-#      # stage 2: extract the training embeding to train the LDA and PLDA
+#      # stage 3: extract the training embeding to train the LDA and PLDA
 #      # todo: extract the training embedding
 # fi 
--- a/paddleaudio/paddleaudio/compliance/librosa.py
+++ b/paddleaudio/paddleaudio/compliance/librosa.py
@ -341,7 +341,7 @@ def stft(x: np.ndarray,
        hop_length (Optional[int], optional): Number of steps to advance between adjacent windows. Defaults to None.
        win_length (Optional[int], optional): The size of window. Defaults to None.
        window (str, optional): A string of window specification. Defaults to "hann".
-        center (bool, optional): Whether to pad `x` to make that the :math:`t \times hop\_length` at the center of `t`-th frame. Defaults to True.
+        center (bool, optional): Whether to pad `x` to make that the :math:`t \times hop\\_length` at the center of `t`-th frame. Defaults to True.
        dtype (type, optional): Data type of STFT results. Defaults to np.complex64.
        pad_mode (str, optional): Choose padding pattern when `center` is `True`. Defaults to "reflect".
@ -509,7 +509,7 @@ def melspectrogram(x: np.ndarray,
        fmin (float, optional): Minimum frequency in Hz. Defaults to 50.0.
        fmax (Optional[float], optional): Maximum frequency in Hz. Defaults to None.
        window (str, optional): A string of window specification. Defaults to "hann".
-        center (bool, optional): Whether to pad `x` to make that the :math:`t \times hop\_length` at the center of `t`-th frame. Defaults to True.
+        center (bool, optional): Whether to pad `x` to make that the :math:`t \times hop\\_length` at the center of `t`-th frame. Defaults to True.
        pad_mode (str, optional): Choose padding pattern when `center` is `True`. Defaults to "reflect".
        power (float, optional): Exponent for the magnitude melspectrogram. Defaults to 2.0.
        to_db (bool, optional): Enable db scale. Defaults to True.
@ -564,7 +564,7 @@ def spectrogram(x: np.ndarray,
        window_size (int, optional): Size of FFT and window length. Defaults to 512.
        hop_length (int, optional): Number of steps to advance between adjacent windows. Defaults to 320.
        window (str, optional): A string of window specification. Defaults to "hann".
-        center (bool, optional): Whether to pad `x` to make that the :math:`t \times hop\_length` at the center of `t`-th frame. Defaults to True.
+        center (bool, optional): Whether to pad `x` to make that the :math:`t \times hop\\_length` at the center of `t`-th frame. Defaults to True.
        pad_mode (str, optional): Choose padding pattern when `center` is `True`. Defaults to "reflect".
        power (float, optional): Exponent for the magnitude melspectrogram. Defaults to 2.0.
--- a/paddleaudio/paddleaudio/datasets/voxceleb.py
+++ b/paddleaudio/paddleaudio/datasets/voxceleb.py
@ -261,7 +261,7 @@ class VoxCeleb(Dataset):
                     output_file: str,
                     split_chunks: bool=True):
        print(f'Generating csv: {output_file}')
-        header = ["ID", "duration", "wav", "start", "stop", "spk_id"]
+        header = ["id", "duration", "wav", "start", "stop", "spk_id"]
        # Note: this may occurs c++ execption, but the program will execute fine
        # so we can ignore the execption 
        with Pool(cpu_count()) as p:
--- a/paddleaudio/paddleaudio/features/layers.py
+++ b/paddleaudio/paddleaudio/features/layers.py
@ -42,7 +42,7 @@ class Spectrogram(nn.Layer):
        win_length (Optional[int], optional): The window length of the short time FFT. If `None`, it is set to same as `n_fft`. Defaults to None.
        window (str, optional): The window function applied to the signal before the Fourier transform. Supported window functions: 'hamming', 'hann', 'kaiser', 'gaussian', 'exponential', 'triang', 'bohman', 'blackman', 'cosine', 'tukey', 'taylor'. Defaults to 'hann'.
        power (float, optional): Exponent for the magnitude spectrogram. Defaults to 2.0.
-        center (bool, optional): Whether to pad `x` to make that the :math:`t \times hop\_length` at the center of `t`-th frame. Defaults to True.
+        center (bool, optional): Whether to pad `x` to make that the :math:`t \times hop\\_length` at the center of `t`-th frame. Defaults to True.
        pad_mode (str, optional): Choose padding pattern when `center` is `True`. Defaults to 'reflect'.
        dtype (str, optional): Data type of input and window. Defaults to 'float32'.
    """
@ -99,7 +99,7 @@ class MelSpectrogram(nn.Layer):
        win_length (Optional[int], optional): The window length of the short time FFT. If `None`, it is set to same as `n_fft`. Defaults to None.
        window (str, optional): The window function applied to the signal before the Fourier transform. Supported window functions: 'hamming', 'hann', 'kaiser', 'gaussian', 'exponential', 'triang', 'bohman', 'blackman', 'cosine', 'tukey', 'taylor'. Defaults to 'hann'.
        power (float, optional): Exponent for the magnitude spectrogram. Defaults to 2.0.
-        center (bool, optional): Whether to pad `x` to make that the :math:`t \times hop\_length` at the center of `t`-th frame. Defaults to True.
+        center (bool, optional): Whether to pad `x` to make that the :math:`t \times hop\\_length` at the center of `t`-th frame. Defaults to True.
        pad_mode (str, optional): Choose padding pattern when `center` is `True`. Defaults to 'reflect'.
        n_mels (int, optional): Number of mel bins. Defaults to 64.
        f_min (float, optional): Minimum frequency in Hz. Defaults to 50.0.
@ -176,7 +176,7 @@ class LogMelSpectrogram(nn.Layer):
        win_length (Optional[int], optional): The window length of the short time FFT. If `None`, it is set to same as `n_fft`. Defaults to None.
        window (str, optional): The window function applied to the signal before the Fourier transform. Supported window functions: 'hamming', 'hann', 'kaiser', 'gaussian', 'exponential', 'triang', 'bohman', 'blackman', 'cosine', 'tukey', 'taylor'. Defaults to 'hann'.
        power (float, optional): Exponent for the magnitude spectrogram. Defaults to 2.0.
-        center (bool, optional): Whether to pad `x` to make that the :math:`t \times hop\_length` at the center of `t`-th frame. Defaults to True.
+        center (bool, optional): Whether to pad `x` to make that the :math:`t \times hop\\_length` at the center of `t`-th frame. Defaults to True.
        pad_mode (str, optional): Choose padding pattern when `center` is `True`. Defaults to 'reflect'.
        n_mels (int, optional): Number of mel bins. Defaults to 64.
        f_min (float, optional): Minimum frequency in Hz. Defaults to 50.0.
@ -257,7 +257,7 @@ class MFCC(nn.Layer):
        win_length (Optional[int], optional): The window length of the short time FFT. If `None`, it is set to same as `n_fft`. Defaults to None.
        window (str, optional): The window function applied to the signal before the Fourier transform. Supported window functions: 'hamming', 'hann', 'kaiser', 'gaussian', 'exponential', 'triang', 'bohman', 'blackman', 'cosine', 'tukey', 'taylor'. Defaults to 'hann'.
        power (float, optional): Exponent for the magnitude spectrogram. Defaults to 2.0.
-        center (bool, optional): Whether to pad `x` to make that the :math:`t \times hop\_length` at the center of `t`-th frame. Defaults to True.
+        center (bool, optional): Whether to pad `x` to make that the :math:`t \times hop\\_length` at the center of `t`-th frame. Defaults to True.
        pad_mode (str, optional): Choose padding pattern when `center` is `True`. Defaults to 'reflect'.
        n_mels (int, optional): Number of mel bins. Defaults to 64.
        f_min (float, optional): Minimum frequency in Hz. Defaults to 50.0.
--- a/paddleaudio/paddleaudio/metric/init.py
+++ b/paddleaudio/paddleaudio/metric/init.py
@ -14,4 +14,3 @@
 from .dtw import dtw_distance
 from .eer import compute_eer
 from .eer import compute_minDCF
 from .mcd import mcd_distance
--- a/paddleaudio/paddleaudio/metric/mcd.py
+++ b/paddleaudio/paddleaudio/metric/mcd.py
@ -1,63 +0,0 @@
 # Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 from typing import Callable
 import mcd.metrics_fast as mt
 import numpy as np
 from mcd import dtw
 __all__ = [
    'mcd_distance',
 ]
 def mcd_distance(xs: np.ndarray,
                 ys: np.ndarray,
                 cost_fn: Callable=mt.logSpecDbDist) -> float:
    """Mel cepstral distortion (MCD), dtw distance.
    Dynamic Time Warping.
    Uses dynamic programming to compute:
    Examples:
        .. code-block:: python
            wps[i, j] = cost_fn(xs[i], ys[j]) + min(
                            wps[i-1, j  ],  // vertical   / insertion / expansion
                            wps[i  , j-1],  // horizontal / deletion  / compression
                            wps[i-1, j-1])  // diagonal   / match
            dtw = sqrt(wps[-1, -1])
    Cost Function:
    Examples:
        .. code-block:: python
            logSpecDbConst = 10.0 / math.log(10.0) * math.sqrt(2.0)
            def logSpecDbDist(x, y):
                diff = x - y
                return logSpecDbConst * math.sqrt(np.inner(diff, diff))
    Args:
        xs (np.ndarray): ref sequence, [T,D]
        ys (np.ndarray): hyp sequence, [T,D]
        cost_fn (Callable, optional): Cost function. Defaults to mt.logSpecDbDist.
    Returns:
        float: dtw distance
    """
    min_cost, path = dtw.dtw(xs, ys, cost_fn)
    return min_cost
--- a/paddleaudio/paddleaudio/utils/numeric.py
+++ b/paddleaudio/paddleaudio/utils/numeric.py
@ -0,0 +1,30 @@
 # Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import numpy as np
 def pcm16to32(audio: np.ndarray) -> np.ndarray:
    """pcm int16 to float32
    Args:
        audio (np.ndarray): Waveform with dtype of int16.
    Returns:
        np.ndarray: Waveform with dtype of float32.
    """
    if audio.dtype == np.int16:
        audio = audio.astype("float32")
        bits = np.iinfo(np.int16).bits
        audio = audio / (2**(bits - 1))
    return audio
--- a/paddleaudio/setup.py
+++ b/paddleaudio/setup.py
@ -19,7 +19,7 @@ from setuptools.command.install import install
 from setuptools.command.test import test
 # set the version here
-VERSION = '0.2.0'
+VERSION = '0.2.1'
 # Inspired by the example at https://pytest.org/latest/goodpractises.html
@ -83,8 +83,7 @@ setuptools.setup(
    python_requires='>=3.6',
    install_requires=[
        'numpy >= 1.15.0', 'scipy >= 1.0.0', 'resampy >= 0.2.2',
-        'soundfile >= 0.9.0', 'colorlog', 'dtaidistance == 2.3.1', 'mcd >= 0.4',
+        'soundfile >= 0.9.0', 'colorlog', 'dtaidistance == 2.3.1', 'pathos'
        'pathos'
    ],
    extras_require={
        'test': [
--- a/paddlespeech/cli/asr/infer.py
+++ b/paddlespeech/cli/asr/infer.py
@ -29,9 +29,10 @@ from ..download import get_path_from_url
 from ..executor import BaseExecutor
 from ..log import logger
 from ..utils import cli_register
 from ..utils import download_and_decompress
 from ..utils import MODEL_HOME
 from ..utils import stats_wrapper
 from .pretrained_models import model_alias
 from .pretrained_models import pretrained_models
 from paddlespeech.s2t.frontend.featurizer.text_featurizer import TextFeaturizer
 from paddlespeech.s2t.transform.transformation import Transformation
 from paddlespeech.s2t.utils.dynamic_import import dynamic_import
@ -39,94 +40,13 @@ from paddlespeech.s2t.utils.utility import UpdateConfig
 __all__ = ['ASRExecutor']
 pretrained_models = {
    # The tags for pretrained_models should be "{model_name}[_{dataset}][-{lang}][-...]".
    # e.g. "conformer_wenetspeech-zh-16k" and "panns_cnn6-32k".
    # Command line and python api use "{model_name}[_{dataset}]" as --model, usage:
    # "paddlespeech asr --model conformer_wenetspeech --lang zh --sr 16000 --input ./input.wav"
    "conformer_wenetspeech-zh-16k": {
        'url':
        'https://paddlespeech.bj.bcebos.com/s2t/wenetspeech/asr1_conformer_wenetspeech_ckpt_0.1.1.model.tar.gz',
        'md5':
        '76cb19ed857e6623856b7cd7ebbfeda4',
        'cfg_path':
        'model.yaml',
        'ckpt_path':
        'exp/conformer/checkpoints/wenetspeech',
    },
    "transformer_librispeech-en-16k": {
        'url':
        'https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr1/asr1_transformer_librispeech_ckpt_0.1.1.model.tar.gz',
        'md5':
        '2c667da24922aad391eacafe37bc1660',
        'cfg_path':
        'model.yaml',
        'ckpt_path':
        'exp/transformer/checkpoints/avg_10',
    },
    "deepspeech2offline_aishell-zh-16k": {
        'url':
        'https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_aishell_ckpt_0.1.1.model.tar.gz',
        'md5':
        '932c3593d62fe5c741b59b31318aa314',
        'cfg_path':
        'model.yaml',
        'ckpt_path':
        'exp/deepspeech2/checkpoints/avg_1',
        'lm_url':
        'https://deepspeech.bj.bcebos.com/zh_lm/zh_giga.no_cna_cmn.prune01244.klm',
        'lm_md5':
        '29e02312deb2e59b3c8686c7966d4fe3'
    },
    "deepspeech2online_aishell-zh-16k": {
        'url':
        'https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.1.1.model.tar.gz',
        'md5':
        'd5e076217cf60486519f72c217d21b9b',
        'cfg_path':
        'model.yaml',
        'ckpt_path':
        'exp/deepspeech2_online/checkpoints/avg_1',
        'lm_url':
        'https://deepspeech.bj.bcebos.com/zh_lm/zh_giga.no_cna_cmn.prune01244.klm',
        'lm_md5':
        '29e02312deb2e59b3c8686c7966d4fe3'
    },
    "deepspeech2offline_librispeech-en-16k": {
        'url':
        'https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr0/asr0_deepspeech2_librispeech_ckpt_0.1.1.model.tar.gz',
        'md5':
        'f5666c81ad015c8de03aac2bc92e5762',
        'cfg_path':
        'model.yaml',
        'ckpt_path':
        'exp/deepspeech2/checkpoints/avg_1',
        'lm_url':
        'https://deepspeech.bj.bcebos.com/en_lm/common_crawl_00.prune01111.trie.klm',
        'lm_md5':
        '099a601759d467cd0a8523ff939819c5'
    },
 }
 model_alias = {
    "deepspeech2offline":
    "paddlespeech.s2t.models.ds2:DeepSpeech2Model",
    "deepspeech2online":
    "paddlespeech.s2t.models.ds2_online:DeepSpeech2ModelOnline",
    "conformer":
    "paddlespeech.s2t.models.u2:U2Model",
    "transformer":
    "paddlespeech.s2t.models.u2:U2Model",
    "wenetspeech":
    "paddlespeech.s2t.models.u2:U2Model",
 }
@cli_register(
    name='paddlespeech.asr', description='Speech to text infer command.')
 class ASRExecutor(BaseExecutor):
    def __init__(self):
-        super(ASRExecutor, self).__init__()
+        super().__init__()
        self.model_alias = model_alias
        self.pretrained_models = pretrained_models
        self.parser = argparse.ArgumentParser(
            prog='paddlespeech.asr', add_help=True)
@ -136,7 +56,9 @@ class ASRExecutor(BaseExecutor):
            '--model',
            type=str,
            default='conformer_wenetspeech',
-            choices=[tag[:tag.index('-')] for tag in pretrained_models.keys()],
+            choices=[
                tag[:tag.index('-')] for tag in self.pretrained_models.keys()
            ],
            help='Choose model type of asr task.')
        self.parser.add_argument(
            '--lang',
@ -192,23 +114,6 @@ class ASRExecutor(BaseExecutor):
            action='store_true',
            help='Increase logger verbosity of current task.')
    def _get_pretrained_path(self, tag: str) -> os.PathLike:
        """
        Download and returns pretrained resources path of current task.
        """
        support_models = list(pretrained_models.keys())
        assert tag in pretrained_models, 'The model "{}" you want to use has not been supported, please choose other models.\nThe support models includes:\n\t\t{}\n'.format(
            tag, '\n\t\t'.join(support_models))
        res_path = os.path.join(MODEL_HOME, tag)
        decompressed_path = download_and_decompress(pretrained_models[tag],
                                                    res_path)
        decompressed_path = os.path.abspath(decompressed_path)
        logger.info(
            'Use pretrained model stored in: {}'.format(decompressed_path))
        return decompressed_path
    def _init_from_path(self,
                        model_type: str='wenetspeech',
                        lang: str='zh',
@ -219,6 +124,7 @@ class ASRExecutor(BaseExecutor):
        """
        Init model and other resources from a specific path.
        """
        logger.info("start to init the model")
        if hasattr(self, 'model'):
            logger.info('Model had been initialized.')
            return
@ -228,18 +134,20 @@ class ASRExecutor(BaseExecutor):
            tag = model_type + '-' + lang + '-' + sample_rate_str
            res_path = self._get_pretrained_path(tag)  # wenetspeech_zh
            self.res_path = res_path
-            self.cfg_path = os.path.join(res_path,
+            self.cfg_path = os.path.join(
-                                         pretrained_models[tag]['cfg_path'])
+                res_path, self.pretrained_models[tag]['cfg_path'])
            self.ckpt_path = os.path.join(
-                res_path, pretrained_models[tag]['ckpt_path'] + ".pdparams")
+                res_path,
                self.pretrained_models[tag]['ckpt_path'] + ".pdparams")
            logger.info(res_path)
-            logger.info(self.cfg_path)
+
            logger.info(self.ckpt_path)
        else:
            self.cfg_path = os.path.abspath(cfg_path)
            self.ckpt_path = os.path.abspath(ckpt_path + ".pdparams")
            self.res_path = os.path.dirname(
                os.path.dirname(os.path.abspath(self.cfg_path)))
        logger.info(self.cfg_path)
        logger.info(self.ckpt_path)
        #Init body.
        self.config = CfgNode(new_allowed=True)
@ -255,8 +163,8 @@ class ASRExecutor(BaseExecutor):
                self.collate_fn_test = SpeechCollator.from_config(self.config)
                self.text_feature = TextFeaturizer(
                    unit_type=self.config.unit_type, vocab=self.vocab)
-                lm_url = pretrained_models[tag]['lm_url']
+                lm_url = self.pretrained_models[tag]['lm_url']
-                lm_md5 = pretrained_models[tag]['lm_md5']
+                lm_md5 = self.pretrained_models[tag]['lm_md5']
                self.download_lm(
                    lm_url,
                    os.path.dirname(self.config.decode.lang_model_path), lm_md5)
@ -269,12 +177,11 @@ class ASRExecutor(BaseExecutor):
                    vocab=self.config.vocab_filepath,
                    spm_model_prefix=self.config.spm_model_prefix)
                self.config.decode.decoding_method = decode_method
            else:
                raise Exception("wrong type")
        model_name = model_type[:model_type.rindex(
            '_')]  # model_type: {model_name}_{dataset}
-        model_class = dynamic_import(model_name, model_alias)
+        model_class = dynamic_import(model_name, self.model_alias)
        model_conf = self.config
        model = model_class.from_config(model_conf)
        self.model = model
@ -347,12 +254,14 @@ class ASRExecutor(BaseExecutor):
        else:
            raise Exception("wrong type")
        logger.info("audio feat process success")
    @paddle.no_grad()
    def infer(self, model_type: str):
        """
        Model inference and result stored in self.output.
        """
-
+        logger.info("start to infer the model to get the output")
        cfg = self.config.decode
        audio = self._inputs["audio"]
        audio_len = self._inputs["audio_len"]
@ -369,17 +278,22 @@ class ASRExecutor(BaseExecutor):
            self._outputs["result"] = result_transcripts[0]
        elif "conformer" in model_type or "transformer" in model_type:
-            result_transcripts = self.model.decode(
+            logger.info(f"we will use the transformer like model : {model_type}")
-                audio,
+            try:
-                audio_len,
+                result_transcripts = self.model.decode(
-                text_feature=self.text_feature,
+                    audio,
-                decoding_method=cfg.decoding_method,
+                    audio_len,
-                beam_size=cfg.beam_size,
+                    text_feature=self.text_feature,
-                ctc_weight=cfg.ctc_weight,
+                    decoding_method=cfg.decoding_method,
-                decoding_chunk_size=cfg.decoding_chunk_size,
+                    beam_size=cfg.beam_size,
-                num_decoding_left_chunks=cfg.num_decoding_left_chunks,
+                    ctc_weight=cfg.ctc_weight,
-                simulate_streaming=cfg.simulate_streaming)
+                    decoding_chunk_size=cfg.decoding_chunk_size,
-            self._outputs["result"] = result_transcripts[0][0]
+                    num_decoding_left_chunks=cfg.num_decoding_left_chunks,
                    simulate_streaming=cfg.simulate_streaming)
                self._outputs["result"] = result_transcripts[0][0]
            except Exception as e:
                logger.exception(e)
        else:
            raise Exception("invalid model name")
@ -426,6 +340,11 @@ class ASRExecutor(BaseExecutor):
        try:
            audio, audio_sample_rate = soundfile.read(
                audio_file, dtype="int16", always_2d=True)
            audio_duration = audio.shape[0] / audio_sample_rate
            max_duration = 50.0
            if audio_duration >= max_duration:
                logger.error("Please input audio file less then 50 seconds.\n")
                return
        except Exception as e:
            logger.exception(e)
            logger.error(
--- a/paddlespeech/cli/asr/pretrained_models.py
+++ b/paddlespeech/cli/asr/pretrained_models.py
@ -0,0 +1,97 @@
 # Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 pretrained_models = {
    # The tags for pretrained_models should be "{model_name}[_{dataset}][-{lang}][-...]".
    # e.g. "conformer_wenetspeech-zh-16k" and "panns_cnn6-32k".
    # Command line and python api use "{model_name}[_{dataset}]" as --model, usage:
    # "paddlespeech asr --model conformer_wenetspeech --lang zh --sr 16000 --input ./input.wav"
    "conformer_wenetspeech-zh-16k": {
        'url':
        'https://paddlespeech.bj.bcebos.com/s2t/wenetspeech/asr1_conformer_wenetspeech_ckpt_0.1.1.model.tar.gz',
        'md5':
        '76cb19ed857e6623856b7cd7ebbfeda4',
        'cfg_path':
        'model.yaml',
        'ckpt_path':
        'exp/conformer/checkpoints/wenetspeech',
    },
    "transformer_librispeech-en-16k": {
        'url':
        'https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr1/asr1_transformer_librispeech_ckpt_0.1.1.model.tar.gz',
        'md5':
        '2c667da24922aad391eacafe37bc1660',
        'cfg_path':
        'model.yaml',
        'ckpt_path':
        'exp/transformer/checkpoints/avg_10',
    },
    "deepspeech2offline_aishell-zh-16k": {
        'url':
        'https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_aishell_ckpt_0.1.1.model.tar.gz',
        'md5':
        '932c3593d62fe5c741b59b31318aa314',
        'cfg_path':
        'model.yaml',
        'ckpt_path':
        'exp/deepspeech2/checkpoints/avg_1',
        'lm_url':
        'https://deepspeech.bj.bcebos.com/zh_lm/zh_giga.no_cna_cmn.prune01244.klm',
        'lm_md5':
        '29e02312deb2e59b3c8686c7966d4fe3'
    },
    "deepspeech2online_aishell-zh-16k": {
        'url':
        'https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.2.0.model.tar.gz',
        'md5':
        '23e16c69730a1cb5d735c98c83c21e16',
        'cfg_path':
        'model.yaml',
        'ckpt_path':
        'exp/deepspeech2_online/checkpoints/avg_1',
        'lm_url':
        'https://deepspeech.bj.bcebos.com/zh_lm/zh_giga.no_cna_cmn.prune01244.klm',
        'lm_md5':
        '29e02312deb2e59b3c8686c7966d4fe3'
    },
    "deepspeech2offline_librispeech-en-16k": {
        'url':
        'https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr0/asr0_deepspeech2_librispeech_ckpt_0.1.1.model.tar.gz',
        'md5':
        'f5666c81ad015c8de03aac2bc92e5762',
        'cfg_path':
        'model.yaml',
        'ckpt_path':
        'exp/deepspeech2/checkpoints/avg_1',
        'lm_url':
        'https://deepspeech.bj.bcebos.com/en_lm/common_crawl_00.prune01111.trie.klm',
        'lm_md5':
        '099a601759d467cd0a8523ff939819c5'
    },
 }
 model_alias = {
    "deepspeech2offline":
    "paddlespeech.s2t.models.ds2:DeepSpeech2Model",
    "deepspeech2online":
    "paddlespeech.s2t.models.ds2_online:DeepSpeech2ModelOnline",
    "conformer":
    "paddlespeech.s2t.models.u2:U2Model",
    "conformer_online":
    "paddlespeech.s2t.models.u2:U2Model",
    "transformer":
    "paddlespeech.s2t.models.u2:U2Model",
    "wenetspeech":
    "paddlespeech.s2t.models.u2:U2Model",
 }
--- a/paddlespeech/cli/cls/infer.py
+++ b/paddlespeech/cli/cls/infer.py
@ -25,55 +25,23 @@ import yaml
 from ..executor import BaseExecutor
 from ..log import logger
 from ..utils import cli_register
 from ..utils import download_and_decompress
 from ..utils import MODEL_HOME
 from ..utils import stats_wrapper
 from .pretrained_models import model_alias
 from .pretrained_models import pretrained_models
 from paddleaudio import load
 from paddleaudio.features import LogMelSpectrogram
 from paddlespeech.s2t.utils.dynamic_import import dynamic_import
 __all__ = ['CLSExecutor']
 pretrained_models = {
    # The tags for pretrained_models should be "{model_name}[_{dataset}][-{lang}][-...]".
    # e.g. "conformer_wenetspeech-zh-16k", "transformer_aishell-zh-16k" and "panns_cnn6-32k".
    # Command line and python api use "{model_name}[_{dataset}]" as --model, usage:
    # "paddlespeech asr --model conformer_wenetspeech --lang zh --sr 16000 --input ./input.wav"
    "panns_cnn6-32k": {
        'url': 'https://paddlespeech.bj.bcebos.com/cls/panns_cnn6.tar.gz',
        'md5': '4cf09194a95df024fd12f84712cf0f9c',
        'cfg_path': 'panns.yaml',
        'ckpt_path': 'cnn6.pdparams',
        'label_file': 'audioset_labels.txt',
    },
    "panns_cnn10-32k": {
        'url': 'https://paddlespeech.bj.bcebos.com/cls/panns_cnn10.tar.gz',
        'md5': 'cb8427b22176cc2116367d14847f5413',
        'cfg_path': 'panns.yaml',
        'ckpt_path': 'cnn10.pdparams',
        'label_file': 'audioset_labels.txt',
    },
    "panns_cnn14-32k": {
        'url': 'https://paddlespeech.bj.bcebos.com/cls/panns_cnn14.tar.gz',
        'md5': 'e3b9b5614a1595001161d0ab95edee97',
        'cfg_path': 'panns.yaml',
        'ckpt_path': 'cnn14.pdparams',
        'label_file': 'audioset_labels.txt',
    },
 }
 model_alias = {
    "panns_cnn6": "paddlespeech.cls.models.panns:CNN6",
    "panns_cnn10": "paddlespeech.cls.models.panns:CNN10",
    "panns_cnn14": "paddlespeech.cls.models.panns:CNN14",
 }
@cli_register(
    name='paddlespeech.cls', description='Audio classification infer command.')
 class CLSExecutor(BaseExecutor):
    def __init__(self):
-        super(CLSExecutor, self).__init__()
+        super().__init__()
        self.model_alias = model_alias
        self.pretrained_models = pretrained_models
        self.parser = argparse.ArgumentParser(
            prog='paddlespeech.cls', add_help=True)
@ -83,7 +51,9 @@ class CLSExecutor(BaseExecutor):
            '--model',
            type=str,
            default='panns_cnn14',
-            choices=[tag[:tag.index('-')] for tag in pretrained_models.keys()],
+            choices=[
                tag[:tag.index('-')] for tag in self.pretrained_models.keys()
            ],
            help='Choose model type of cls task.')
        self.parser.add_argument(
            '--config',
@ -121,23 +91,6 @@ class CLSExecutor(BaseExecutor):
            action='store_true',
            help='Increase logger verbosity of current task.')
    def _get_pretrained_path(self, tag: str) -> os.PathLike:
        """
            Download and returns pretrained resources path of current task.
        """
        support_models = list(pretrained_models.keys())
        assert tag in pretrained_models, 'The model "{}" you want to use has not been supported, please choose other models.\nThe support models includes:\n\t\t{}\n'.format(
            tag, '\n\t\t'.join(support_models))
        res_path = os.path.join(MODEL_HOME, tag)
        decompressed_path = download_and_decompress(pretrained_models[tag],
                                                    res_path)
        decompressed_path = os.path.abspath(decompressed_path)
        logger.info(
            'Use pretrained model stored in: {}'.format(decompressed_path))
        return decompressed_path
    def _init_from_path(self,
                        model_type: str='panns_cnn14',
                        cfg_path: Optional[os.PathLike]=None,
@ -153,12 +106,12 @@ class CLSExecutor(BaseExecutor):
        if label_file is None or ckpt_path is None:
            tag = model_type + '-' + '32k'  # panns_cnn14-32k
            self.res_path = self._get_pretrained_path(tag)
-            self.cfg_path = os.path.join(self.res_path,
+            self.cfg_path = os.path.join(
-                                         pretrained_models[tag]['cfg_path'])
+                self.res_path, self.pretrained_models[tag]['cfg_path'])
-            self.label_file = os.path.join(self.res_path,
+            self.label_file = os.path.join(
-                                           pretrained_models[tag]['label_file'])
+                self.res_path, self.pretrained_models[tag]['label_file'])
-            self.ckpt_path = os.path.join(self.res_path,
+            self.ckpt_path = os.path.join(
-                                          pretrained_models[tag]['ckpt_path'])
+                self.res_path, self.pretrained_models[tag]['ckpt_path'])
        else:
            self.cfg_path = os.path.abspath(cfg_path)
            self.label_file = os.path.abspath(label_file)
@ -175,7 +128,7 @@ class CLSExecutor(BaseExecutor):
                self._label_list.append(line.strip())
        # model
-        model_class = dynamic_import(model_type, model_alias)
+        model_class = dynamic_import(model_type, self.model_alias)
        model_dict = paddle.load(self.ckpt_path)
        self.model = model_class(extract_embedding=False)
        self.model.set_state_dict(model_dict)
--- a/paddlespeech/cli/cls/pretrained_models.py
+++ b/paddlespeech/cli/cls/pretrained_models.py
@ -0,0 +1,47 @@
 # Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 pretrained_models = {
    # The tags for pretrained_models should be "{model_name}[_{dataset}][-{lang}][-...]".
    # e.g. "conformer_wenetspeech-zh-16k", "transformer_aishell-zh-16k" and "panns_cnn6-32k".
    # Command line and python api use "{model_name}[_{dataset}]" as --model, usage:
    # "paddlespeech asr --model conformer_wenetspeech --lang zh --sr 16000 --input ./input.wav"
    "panns_cnn6-32k": {
        'url': 'https://paddlespeech.bj.bcebos.com/cls/panns_cnn6.tar.gz',
        'md5': '4cf09194a95df024fd12f84712cf0f9c',
        'cfg_path': 'panns.yaml',
        'ckpt_path': 'cnn6.pdparams',
        'label_file': 'audioset_labels.txt',
    },
    "panns_cnn10-32k": {
        'url': 'https://paddlespeech.bj.bcebos.com/cls/panns_cnn10.tar.gz',
        'md5': 'cb8427b22176cc2116367d14847f5413',
        'cfg_path': 'panns.yaml',
        'ckpt_path': 'cnn10.pdparams',
        'label_file': 'audioset_labels.txt',
    },
    "panns_cnn14-32k": {
        'url': 'https://paddlespeech.bj.bcebos.com/cls/panns_cnn14.tar.gz',
        'md5': 'e3b9b5614a1595001161d0ab95edee97',
        'cfg_path': 'panns.yaml',
        'ckpt_path': 'cnn14.pdparams',
        'label_file': 'audioset_labels.txt',
    },
 }
 model_alias = {
    "panns_cnn6": "paddlespeech.cls.models.panns:CNN6",
    "panns_cnn10": "paddlespeech.cls.models.panns:CNN10",
    "panns_cnn14": "paddlespeech.cls.models.panns:CNN14",
 }
--- a/paddlespeech/cli/executor.py
+++ b/paddlespeech/cli/executor.py
@ -25,6 +25,8 @@ from typing import Union
 import paddle
 from .log import logger
 from .utils import download_and_decompress
 from .utils import MODEL_HOME
 class BaseExecutor(ABC):
@ -35,19 +37,8 @@ class BaseExecutor(ABC):
    def __init__(self):
        self._inputs = OrderedDict()
        self._outputs = OrderedDict()
-
+        self.pretrained_models = OrderedDict()
-    @abstractmethod
+        self.model_alias = OrderedDict()
    def _get_pretrained_path(self, tag: str) -> os.PathLike:
        """
        Download and returns pretrained resources path of current task.
        Args:
            tag (str): A tag of pretrained model.
        Returns:
            os.PathLike: The path on which resources of pretrained model locate. 
        """
        pass
    @abstractmethod
    def _init_from_path(self, *args, **kwargs):
@ -227,3 +218,20 @@ class BaseExecutor(ABC):
        ]
        for l in loggers:
            l.disabled = True
    def _get_pretrained_path(self, tag: str) -> os.PathLike:
        """
        Download and returns pretrained resources path of current task.
        """
        support_models = list(self.pretrained_models.keys())
        assert tag in self.pretrained_models, 'The model "{}" you want to use has not been supported, please choose other models.\nThe support models includes:\n\t\t{}\n'.format(
            tag, '\n\t\t'.join(support_models))
        res_path = os.path.join(MODEL_HOME, tag)
        decompressed_path = download_and_decompress(self.pretrained_models[tag],
                                                    res_path)
        decompressed_path = os.path.abspath(decompressed_path)
        logger.info(
            'Use pretrained model stored in: {}'.format(decompressed_path))
        return decompressed_path
--- a/paddlespeech/cli/st/infer.py
+++ b/paddlespeech/cli/st/infer.py
@ -32,40 +32,24 @@ from ..utils import cli_register
 from ..utils import download_and_decompress
 from ..utils import MODEL_HOME
 from ..utils import stats_wrapper
 from .pretrained_models import kaldi_bins
 from .pretrained_models import model_alias
 from .pretrained_models import pretrained_models
 from paddlespeech.s2t.frontend.featurizer.text_featurizer import TextFeaturizer
 from paddlespeech.s2t.utils.dynamic_import import dynamic_import
 from paddlespeech.s2t.utils.utility import UpdateConfig
 __all__ = ["STExecutor"]
 pretrained_models = {
    "fat_st_ted-en-zh": {
        "url":
        "https://paddlespeech.bj.bcebos.com/s2t/ted_en_zh/st1/st1_transformer_mtl_noam_ted-en-zh_ckpt_0.1.1.model.tar.gz",
        "md5":
        "d62063f35a16d91210a71081bd2dd557",
        "cfg_path":
        "model.yaml",
        "ckpt_path":
        "exp/transformer_mtl_noam/checkpoints/fat_st_ted-en-zh.pdparams",
    }
 }
 model_alias = {"fat_st": "paddlespeech.s2t.models.u2_st:U2STModel"}
 kaldi_bins = {
    "url":
    "https://paddlespeech.bj.bcebos.com/s2t/ted_en_zh/st1/kaldi_bins.tar.gz",
    "md5":
    "c0682303b3f3393dbf6ed4c4e35a53eb",
 }
@cli_register(
    name="paddlespeech.st", description="Speech translation infer command.")
 class STExecutor(BaseExecutor):
    def __init__(self):
-        super(STExecutor, self).__init__()
+        super().__init__()
        self.model_alias = model_alias
        self.pretrained_models = pretrained_models
        self.kaldi_bins = kaldi_bins
        self.parser = argparse.ArgumentParser(
            prog="paddlespeech.st", add_help=True)
@ -75,7 +59,9 @@ class STExecutor(BaseExecutor):
            "--model",
            type=str,
            default="fat_st_ted",
-            choices=[tag[:tag.index('-')] for tag in pretrained_models.keys()],
+            choices=[
                tag[:tag.index('-')] for tag in self.pretrained_models.keys()
            ],
            help="Choose model type of st task.")
        self.parser.add_argument(
            "--src_lang",
@ -119,28 +105,11 @@ class STExecutor(BaseExecutor):
            action='store_true',
            help='Increase logger verbosity of current task.')
    def _get_pretrained_path(self, tag: str) -> os.PathLike:
        """
            Download and returns pretrained resources path of current task.
        """
        support_models = list(pretrained_models.keys())
        assert tag in pretrained_models, 'The model "{}" you want to use has not been supported, please choose other models.\nThe support models includes:\n\t\t{}\n'.format(
            tag, '\n\t\t'.join(support_models))
        res_path = os.path.join(MODEL_HOME, tag)
        decompressed_path = download_and_decompress(pretrained_models[tag],
                                                    res_path)
        decompressed_path = os.path.abspath(decompressed_path)
        logger.info(
            "Use pretrained model stored in: {}".format(decompressed_path))
        return decompressed_path
    def _set_kaldi_bins(self) -> os.PathLike:
        """
            Download and returns kaldi_bins resources path of current task.
        """
-        decompressed_path = download_and_decompress(kaldi_bins, MODEL_HOME)
+        decompressed_path = download_and_decompress(self.kaldi_bins, MODEL_HOME)
        decompressed_path = os.path.abspath(decompressed_path)
        logger.info("Kaldi_bins stored in: {}".format(decompressed_path))
        if "LD_LIBRARY_PATH" in os.environ:
@ -197,7 +166,7 @@ class STExecutor(BaseExecutor):
        model_conf = self.config
        model_name = model_type[:model_type.rindex(
            '_')]  # model_type: {model_name}_{dataset}
-        model_class = dynamic_import(model_name, model_alias)
+        model_class = dynamic_import(model_name, self.model_alias)
        self.model = model_class.from_config(model_conf)
        self.model.eval()
--- a/paddlespeech/cli/st/pretrained_models.py
+++ b/paddlespeech/cli/st/pretrained_models.py
@ -0,0 +1,35 @@
 # Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 pretrained_models = {
    "fat_st_ted-en-zh": {
        "url":
        "https://paddlespeech.bj.bcebos.com/s2t/ted_en_zh/st1/st1_transformer_mtl_noam_ted-en-zh_ckpt_0.1.1.model.tar.gz",
        "md5":
        "d62063f35a16d91210a71081bd2dd557",
        "cfg_path":
        "model.yaml",
        "ckpt_path":
        "exp/transformer_mtl_noam/checkpoints/fat_st_ted-en-zh.pdparams",
    }
 }
 model_alias = {"fat_st": "paddlespeech.s2t.models.u2_st:U2STModel"}
 kaldi_bins = {
    "url":
    "https://paddlespeech.bj.bcebos.com/s2t/ted_en_zh/st1/kaldi_bins.tar.gz",
    "md5":
    "c0682303b3f3393dbf6ed4c4e35a53eb",
 }
--- a/paddlespeech/cli/stats/infer.py
+++ b/paddlespeech/cli/stats/infer.py
@ -16,7 +16,6 @@ from typing import List
 from prettytable import PrettyTable
 from ..log import logger
 from ..utils import cli_register
 from ..utils import stats_wrapper
@ -27,7 +26,8 @@ model_name_format = {
    'cls': 'Model-Sample Rate',
    'st': 'Model-Source language-Target language',
    'text': 'Model-Task-Language',
-    'tts': 'Model-Language'
+    'tts': 'Model-Language',
    'vector': 'Model-Sample Rate'
 }
@ -36,18 +36,18 @@ model_name_format = {
    description='Get speech tasks support models list.')
 class StatsExecutor():
    def __init__(self):
-        super(StatsExecutor, self).__init__()
+        super().__init__()
        self.parser = argparse.ArgumentParser(
            prog='paddlespeech.stats', add_help=True)
        self.task_choices = ['asr', 'cls', 'st', 'text', 'tts', 'vector']
        self.parser.add_argument(
            '--task',
            type=str,
            default='asr',
-            choices=['asr', 'cls', 'st', 'text', 'tts'],
+            choices=self.task_choices,
            help='Choose speech task.',
            required=True)
        self.task_choices = ['asr', 'cls', 'st', 'text', 'tts']
    def show_support_models(self, pretrained_models: dict):
        fields = model_name_format[self.task].split("-")
@ -61,73 +61,15 @@ class StatsExecutor():
            Command line entry.
        """
        parser_args = self.parser.parse_args(argv)
-        self.task = parser_args.task
+        has_exceptions = False
-        if self.task not in self.task_choices:
+        try:
-            logger.error(
+            self(parser_args.task)
-                "Please input correct speech task, choices = ['asr', 'cls', 'st', 'text', 'tts']"
+        except Exception as e:
-            )
+            has_exceptions = True
        if has_exceptions:
            return False
-
+        else:
-        elif self.task == 'asr':
+            return True
            try:
                from ..asr.infer import pretrained_models
                logger.info(
                    "Here is the list of ASR pretrained models released by PaddleSpeech that can be used by command line and python API"
                )
                self.show_support_models(pretrained_models)
                return True
            except BaseException:
                logger.error("Failed to get the list of ASR pretrained models.")
                return False
        elif self.task == 'cls':
            try:
                from ..cls.infer import pretrained_models
                logger.info(
                    "Here is the list of CLS pretrained models released by PaddleSpeech that can be used by command line and python API"
                )
                self.show_support_models(pretrained_models)
                return True
            except BaseException:
                logger.error("Failed to get the list of CLS pretrained models.")
                return False
        elif self.task == 'st':
            try:
                from ..st.infer import pretrained_models
                logger.info(
                    "Here is the list of ST pretrained models released by PaddleSpeech that can be used by command line and python API"
                )
                self.show_support_models(pretrained_models)
                return True
            except BaseException:
                logger.error("Failed to get the list of ST pretrained models.")
                return False
        elif self.task == 'text':
            try:
                from ..text.infer import pretrained_models
                logger.info(
                    "Here is the list of TEXT pretrained models released by PaddleSpeech that can be used by command line and python API"
                )
                self.show_support_models(pretrained_models)
                return True
            except BaseException:
                logger.error(
                    "Failed to get the list of TEXT pretrained models.")
                return False
        elif self.task == 'tts':
            try:
                from ..tts.infer import pretrained_models
                logger.info(
                    "Here is the list of TTS pretrained models released by PaddleSpeech that can be used by command line and python API"
                )
                self.show_support_models(pretrained_models)
                return True
            except BaseException:
                logger.error("Failed to get the list of TTS pretrained models.")
                return False
    @stats_wrapper
    def __call__(
@ -138,13 +80,12 @@ class StatsExecutor():
        """
        self.task = task
        if self.task not in self.task_choices:
-            print(
+            print("Please input correct speech task, choices = " + str(
-                "Please input correct speech task, choices = ['asr', 'cls', 'st', 'text', 'tts']"
+                self.task_choices))
            )
        elif self.task == 'asr':
            try:
-                from ..asr.infer import pretrained_models
+                from ..asr.pretrained_models import pretrained_models
                print(
                    "Here is the list of ASR pretrained models released by PaddleSpeech that can be used by command line and python API"
                )
@ -154,7 +95,7 @@ class StatsExecutor():
        elif self.task == 'cls':
            try:
-                from ..cls.infer import pretrained_models
+                from ..cls.pretrained_models import pretrained_models
                print(
                    "Here is the list of CLS pretrained models released by PaddleSpeech that can be used by command line and python API"
                )
@ -164,7 +105,7 @@ class StatsExecutor():
        elif self.task == 'st':
            try:
-                from ..st.infer import pretrained_models
+                from ..st.pretrained_models import pretrained_models
                print(
                    "Here is the list of ST pretrained models released by PaddleSpeech that can be used by command line and python API"
                )
@ -174,7 +115,7 @@ class StatsExecutor():
        elif self.task == 'text':
            try:
-                from ..text.infer import pretrained_models
+                from ..text.pretrained_models import pretrained_models
                print(
                    "Here is the list of TEXT pretrained models released by PaddleSpeech that can be used by command line and python API"
                )
@ -184,10 +125,22 @@ class StatsExecutor():
        elif self.task == 'tts':
            try:
-                from ..tts.infer import pretrained_models
+                from ..tts.pretrained_models import pretrained_models
                print(
                    "Here is the list of TTS pretrained models released by PaddleSpeech that can be used by command line and python API"
                )
                self.show_support_models(pretrained_models)
            except BaseException:
                print("Failed to get the list of TTS pretrained models.")
        elif self.task == 'vector':
            try:
                from ..vector.pretrained_models import pretrained_models
                print(
                    "Here is the list of Speaker Recognition pretrained models released by PaddleSpeech that can be used by command line and python API"
                )
                self.show_support_models(pretrained_models)
            except BaseException:
                print(
                    "Failed to get the list of Speaker Recognition pretrained models."
                )
--- a/Show More
+++ b/Show More