Merge branch 'develop' into CER

3 years ago · 8d1ee8262e
parent 4e431ae269 f39de8d754
commit 8d1ee8262e
358 changed files with 217821 additions and 5924 deletions
--- a/.gitignore
+++ b/.gitignore
@ -33,6 +33,12 @@ tools/Miniconda3-latest-Linux-x86_64.sh
 tools/activate_python.sh
 tools/miniconda.sh
 tools/CRF++-0.58/
+tools/liblbfgs-1.10/
+tools/srilm/
+tools/env.sh
+tools/openfst-1.8.1/
+tools/libsndfile/
+tools/python-soundfile/

 speechx/fc_patch/

--- a/.pre-commit-config.yaml
+++ b/.pre-commit-config.yaml
@ -50,13 +50,13 @@ repos:
        entry: bash .pre-commit-hooks/clang-format.hook -i
        language: system
        files: \.(c|cc|cxx|cpp|cu|h|hpp|hxx|cuh|proto)$
-        exclude: (?=speechx/speechx/kaldi|speechx/patch).*(\.cpp|\.cc|\.h|\.py)$
+        exclude: (?=speechx/speechx/kaldi|speechx/patch|speechx/tools/fstbin|speechx/tools/lmbin).*(\.cpp|\.cc|\.h|\.py)$
    -   id: copyright_checker
        name: copyright_checker
        entry: python .pre-commit-hooks/copyright-check.hook
        language: system
        files: \.(c|cc|cxx|cpp|cu|h|hpp|hxx|proto|py)$
-        exclude: (?=third_party|pypinyin|speechx/speechx/kaldi|speechx/patch).*(\.cpp|\.cc|\.h|\.py)$
+        exclude: (?=third_party|pypinyin|speechx/speechx/kaldi|speechx/patch|speechx/tools/fstbin|speechx/tools/lmbin).*(\.cpp|\.cc|\.h|\.py)$
 -   repo: https://github.com/asottile/reorder_python_imports
    rev: v2.4.0
    hooks:
--- a/README.md
+++ b/README.md
@ -280,10 +280,14 @@ paddlespeech_client cls --server_ip 127.0.0.1 --port 8090 --input input.wav
 For more information about server command lines, please see: [speech server demos](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/demos/speech_server)


+<a name="ModelList"></a>
+
 ## Model List

 PaddleSpeech supports a series of most popular models. They are summarized in [released models](./docs/source/released_model.md) and attached with available pretrained models.

+<a name="SpeechToText"></a>
+
 **Speech-to-Text** contains *Acoustic Model*, *Language Model*, and *Speech Translation*, with the following details:

 <table style="width:100%">
@ -357,6 +361,8 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
  </tbody>
 </table>

+<a name="TextToSpeech"></a>
+
 **Text-to-Speech** in PaddleSpeech mainly contains three modules: *Text Frontend*, *Acoustic Model* and *Vocoder*. Acoustic Model and Vocoder models are listed as follow:

 <table>
@ -457,10 +463,10 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
      </td>
    </tr>
    <tr>
-      <td>GE2E + Tactron2</td>
+      <td>GE2E + Tacotron2</td>
      <td>AISHELL-3</td>
      <td>
-      <a href = "./examples/aishell3/vc0">ge2e-tactron2-aishell3</a>
+      <a href = "./examples/aishell3/vc0">ge2e-tacotron2-aishell3</a>
      </td>
    </tr>
    <tr>
@ -473,6 +479,8 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
  </tbody>
 </table>

+<a name="AudioClassification"></a>
+
 **Audio Classification**

 <table style="width:100%">
@ -496,6 +504,8 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
  </tbody>
 </table>

+<a name="SpeakerVerification"></a>
+
 **Speaker Verification**

 <table style="width:100%">
@ -519,6 +529,8 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
  </tbody>
 </table>

+<a name="PunctuationRestoration"></a>
+
 **Punctuation Restoration**

 <table style="width:100%">
@ -559,10 +571,18 @@ Normally, [Speech SoTA](https://paperswithcode.com/area/speech), [Audio SoTA](ht
    - [Advanced Usage](./docs/source/tts/advanced_usage.md)
    - [Chinese Rule Based Text Frontend](./docs/source/tts/zh_text_frontend.md)
    - [Test Audio Samples](https://paddlespeech.readthedocs.io/en/latest/tts/demo.html)
+  - Speaker Verification
+    - [Audio Searching](./demos/audio_searching/README.md)
+    - [Speaker Verification](./demos/speaker_verification/README.md)
  - [Audio Classification](./demos/audio_tagging/README.md)
-  - [Speaker Verification](./demos/speaker_verification/README.md)
  - [Speech Translation](./demos/speech_translation/README.md)
+  - [Speech Server](./demos/speech_server/README.md)
 - [Released Models](./docs/source/released_model.md)
+  - [Speech-to-Text](#SpeechToText)
+  - [Text-to-Speech](#TextToSpeech)
+  - [Audio Classification](#AudioClassification)
+  - [Speaker Verification](#SpeakerVerification)
+  - [Punctuation Restoration](#PunctuationRestoration)
 - [Community](#Community)
 - [Welcome to contribute](#contribution)
 - [License](#License)
--- a/README_cn.md
+++ b/README_cn.md
@ -273,6 +273,8 @@ paddlespeech_client cls --server_ip 127.0.0.1 --port 8090 --input input.wav
 ## 模型列表
 PaddleSpeech 支持很多主流的模型，并提供了预训练模型，详情请见[模型列表](./docs/source/released_model.md)。

+<a name="语音识别模型"></a>
+
 PaddleSpeech 的 **语音转文本** 包含语音识别声学模型、语音识别语言模型和语音翻译, 详情如下：

 <table style="width:100%">
@ -347,6 +349,7 @@ PaddleSpeech 的 **语音转文本** 包含语音识别声学模型、语音识
 </table>

 <a name="语音合成模型"></a>
+
 PaddleSpeech 的 **语音合成** 主要包含三个模块：文本前端、声学模型和声码器。声学模型和声码器模型如下：

 <table>
@ -447,10 +450,10 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块：文本前端、声
      </td>
    </tr>
    <tr>
-      <td>GE2E + Tactron2</td>
+      <td>GE2E + Tacotron2</td>
      <td>AISHELL-3</td>
      <td>
-      <a href = "./examples/aishell3/vc0">ge2e-tactron2-aishell3</a>
+      <a href = "./examples/aishell3/vc0">ge2e-tacotron2-aishell3</a>
      </td>
    </tr>
    <tr>
@ -488,6 +491,8 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块：文本前端、声
 </table>


+<a name="声纹识别模型"></a>
+
 **声纹识别**

 <table style="width:100%">
@ -511,6 +516,8 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块：文本前端、声
  </tbody>
 </table>

+<a name="标点恢复模型"></a>
+
 **标点恢复**

 <table style="width:100%">
@ -556,13 +563,18 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块：文本前端、声
    - [进阶用法](./docs/source/tts/advanced_usage.md)
    - [中文文本前端](./docs/source/tts/zh_text_frontend.md)
    - [测试语音样本](https://paddlespeech.readthedocs.io/en/latest/tts/demo.html)
+  - 声纹识别
+    - [声纹识别](./demos/speaker_verification/README_cn.md)
+    - [音频检索](./demos/audio_searching/README_cn.md)
  - [声音分类](./demos/audio_tagging/README_cn.md)
-  - [声纹识别](./demos/speaker_verification/README_cn.md)
  - [语音翻译](./demos/speech_translation/README_cn.md)
+  - [服务化部署](./demos/speech_server/README_cn.md)
 - [模型列表](#模型列表)
  - [语音识别](#语音识别模型)
  - [语音合成](#语音合成模型)
  - [声音分类](#声音分类模型)
+  - [声纹识别](#声纹识别模型)
+  - [标点恢复](#标点恢复模型)
 - [技术交流群](#技术交流群)
 - [欢迎贡献](#欢迎贡献)
 - [License](#License)
--- a/dataset/rir_noise/rir_noise.py
+++ b/dataset/rir_noise/rir_noise.py
@ -34,14 +34,14 @@ from utils.utility import unzip

 DATA_HOME = os.path.expanduser('~/.cache/paddle/dataset/speech')

-URL_ROOT = 'http://www.openslr.org/resources/28'
+URL_ROOT = '--no-check-certificate http://www.openslr.org/resources/28'
 DATA_URL = URL_ROOT + '/rirs_noises.zip'
 MD5_DATA = 'e6f48e257286e05de56413b4779d8ffb'

 parser = argparse.ArgumentParser(description=__doc__)
 parser.add_argument(
    "--target_dir",
-    default=DATA_HOME + "/Aishell",
+    default=DATA_HOME + "/rirs_noise",
    type=str,
    help="Directory to save the dataset. (default: %(default)s)")
 parser.add_argument(
@ -81,6 +81,10 @@ def create_manifest(data_dir, manifest_path_prefix):
                        },
                        ensure_ascii=False))
        manifest_path = manifest_path_prefix + '.' + dtype
+
+        if not os.path.exists(os.path.dirname(manifest_path)):
+            os.makedirs(os.path.dirname(manifest_path))
+
        with codecs.open(manifest_path, 'w', 'utf-8') as fout:
            for line in json_lines:
                fout.write(line + '\n')
--- a/dataset/voxceleb/voxceleb1.py
+++ b/dataset/voxceleb/voxceleb1.py
@ -149,7 +149,7 @@ def prepare_dataset(base_url, data_list, target_dir, manifest_path,
    # we will download the voxceleb1 data to ${target_dir}/vox1/dev/ or ${target_dir}/vox1/test directory 
    if not os.path.exists(os.path.join(target_dir, "wav")):
        # download all dataset part
-        print("start to download the vox1 dev zip package")
+        print(f"start to download the vox1 zip package to {target_dir}")
        for zip_part in data_list.keys():
            download_url = " --no-check-certificate " + base_url + "/" + zip_part
            download(
--- a/dataset/voxceleb/voxceleb2.py
+++ b/dataset/voxceleb/voxceleb2.py
@ -22,10 +22,12 @@ import codecs
 import glob
 import json
 import os
+import subprocess
 from pathlib import Path

 import soundfile

+from utils.utility import check_md5sum
 from utils.utility import download
 from utils.utility import unzip

@ -35,12 +37,22 @@ DATA_HOME = os.path.expanduser('.')
 BASE_URL = "--no-check-certificate https://www.robots.ox.ac.uk/~vgg/data/voxceleb/data/"

 # dev data
-DEV_DATA_URL = BASE_URL + '/vox2_aac.zip'
-DEV_MD5SUM = "bbc063c46078a602ca71605645c2a402"
+DEV_LIST = {
+    "vox2_dev_aac_partaa": "da070494c573e5c0564b1d11c3b20577",
+    "vox2_dev_aac_partab": "17fe6dab2b32b48abaf1676429cdd06f",
+    "vox2_dev_aac_partac": "1de58e086c5edf63625af1cb6d831528",
+    "vox2_dev_aac_partad": "5a043eb03e15c5a918ee6a52aad477f9",
+    "vox2_dev_aac_partae": "cea401b624983e2d0b2a87fb5d59aa60",
+    "vox2_dev_aac_partaf": "fc886d9ba90ab88e7880ee98effd6ae9",
+    "vox2_dev_aac_partag": "d160ecc3f6ee3eed54d55349531cb42e",
+    "vox2_dev_aac_partah": "6b84a81b9af72a9d9eecbb3b1f602e65",
+}
+
+DEV_TARGET_DATA = "vox2_dev_aac_parta* vox2_dev_aac.zip bbc063c46078a602ca71605645c2a402"

 # test data
-TEST_DATA_URL = BASE_URL + '/vox2_test_aac.zip'
-TEST_MD5SUM = "0d2b3ea430a821c33263b5ea37ede312"
+TEST_LIST = {"vox2_test_aac.zip": "0d2b3ea430a821c33263b5ea37ede312"}
+TEST_TARGET_DATA = "vox2_test_aac.zip vox2_test_aac.zip 0d2b3ea430a821c33263b5ea37ede312"

 parser = argparse.ArgumentParser(description=__doc__)
 parser.add_argument(
@ -68,6 +80,14 @@ args = parser.parse_args()


 def create_manifest(data_dir, manifest_path_prefix):
+    """Generate the voxceleb2 dataset manifest file.
+    We will create the ${manifest_path_prefix}.vox2 as the final manifest file 
+    The dev and test wav info will be put in one manifest file.
+
+    Args:
+        data_dir (str): voxceleb2 wav directory, which include dev and test subdataset
+        manifest_path_prefix (str): manifest file prefix
+    """
    print("Creating manifest %s ..." % manifest_path_prefix)
    json_lines = []
    data_path = os.path.join(data_dir, "**", "*.wav")
@ -119,7 +139,19 @@ def create_manifest(data_dir, manifest_path_prefix):
        print(f"{total_sec / total_num} sec/utt", file=f)


-def download_dataset(url, md5sum, target_dir, dataset):
+def download_dataset(base_url, data_list, target_data, target_dir, dataset):
+    """Download the voxceleb2 zip package
+
+    Args:
+        base_url (str): the voxceleb2 dataset download baseline url
+        data_list (dict): the dataset part zip package and the md5 value
+        target_data (str): the final dataset zip info
+        target_dir (str): the dataset stored directory
+        dataset (str): the dataset name, dev or test
+
+    Raises:
+        RuntimeError: the md5sum occurs error
+    """
    if not os.path.exists(target_dir):
        os.makedirs(target_dir)

@ -129,9 +161,34 @@ def download_dataset(url, md5sum, target_dir, dataset):
    # but the test dataset will unzip to aac
    # so, wo create the ${target_dir}/test and unzip the m4a to test dir
    if not os.path.exists(os.path.join(target_dir, dataset)):
-        filepath = download(url, md5sum, target_dir)
+        print(f"start to download the vox2 zip package to {target_dir}")
+        for zip_part in data_list.keys():
+            download_url = " --no-check-certificate " + base_url + "/" + zip_part
+            download(
+                url=download_url,
+                md5sum=data_list[zip_part],
+                target_dir=target_dir)
+
+        # pack the all part to target zip file
+        all_target_part, target_name, target_md5sum = target_data.split()
+        target_name = os.path.join(target_dir, target_name)
+        if not os.path.exists(target_name):
+            pack_part_cmd = "cat {}/{} > {}".format(target_dir, all_target_part,
+                                                    target_name)
+            subprocess.call(pack_part_cmd, shell=True)
+
+        # check the target zip file md5sum
+        if not check_md5sum(target_name, target_md5sum):
+            raise RuntimeError("{} MD5 checkssum failed".format(target_name))
+        else:
+            print("Check {} md5sum successfully".format(target_name))
+
        if dataset == "test":
-            unzip(filepath, os.path.join(target_dir, "test"))
+            # we need make the test directory
+            unzip(target_name, os.path.join(target_dir, "test"))
+        else:
+            # upzip dev zip pacakge and will create the dev directory
+            unzip(target_name, target_dir)


 def main():
@ -142,14 +199,16 @@ def main():
    print("download: {}".format(args.download))
    if args.download:
        download_dataset(
-            url=DEV_DATA_URL,
-            md5sum=DEV_MD5SUM,
+            base_url=BASE_URL,
+            data_list=DEV_LIST,
+            target_data=DEV_TARGET_DATA,
            target_dir=args.target_dir,
            dataset="dev")

        download_dataset(
-            url=TEST_DATA_URL,
-            md5sum=TEST_MD5SUM,
+            base_url=BASE_URL,
+            data_list=TEST_LIST,
+            target_data=TEST_TARGET_DATA,
            target_dir=args.target_dir,
            dataset="test")

--- a/demos/audio_searching/README.md
+++ b/demos/audio_searching/README.md
@ -90,7 +90,7 @@ Then to start the system server, and it provides HTTP backend services.

  ```bash
  export PYTHONPATH=$PYTHONPATH:./src:../../paddleaudio
-  python src/main.py
+  python src/audio_search.py
  ```

  Then you will see the Application is started:
@ -111,7 +111,7 @@ Then to start the system server, and it provides HTTP backend services.
  ```bash
  wget -c https://www.openslr.org/resources/82/cn-celeb_v2.tar.gz && tar -xvf cn-celeb_v2.tar.gz 
  ```
-  **Note**: If you want to build a quick demo, you can use ./src/test_main.py:download_audio_data function, it downloads 20 audio files , Subsequent results show this collection as an example
+  **Note**: If you want to build a quick demo, you can use ./src/test_audio_search.py:download_audio_data function, it downloads 20 audio files , Subsequent results show this collection as an example

 - Prepare model(Skip this step if you use the default model.)
  ```bash
@ -123,7 +123,7 @@ Then to start the system server, and it provides HTTP backend services.

    The internal process is downloading data, loading the paddlespeech model, extracting embedding, storing library, retrieving and deleting library  
    ```bash
-    python ./src/test_main.py
+    python ./src/test_audio_search.py
    ```

    Output：
--- a/demos/audio_searching/README_cn.md
+++ b/demos/audio_searching/README_cn.md
@ -92,7 +92,7 @@ ffce340b3790  minio/minio:RELEASE.2020-12-03T00-03-10Z  "/usr/bin/docker-ent…"

  ```bash
  export PYTHONPATH=$PYTHONPATH:./src:../../paddleaudio
-  python src/main.py
+  python src/audio_search.py
  ```

  然后你会看到应用程序启动:
@ -113,7 +113,7 @@ ffce340b3790  minio/minio:RELEASE.2020-12-03T00-03-10Z  "/usr/bin/docker-ent…"
  ```bash
  wget -c https://www.openslr.org/resources/82/cn-celeb_v2.tar.gz && tar -xvf cn-celeb_v2.tar.gz 
  ```
-  **注**：如果希望快速搭建 demo，可以采用 ./src/test_main.py:download_audio_data 内部的 20 条音频，另外后续结果展示以该集合为例
+  **注**：如果希望快速搭建 demo，可以采用 ./src/test_audio_search.py:download_audio_data 内部的 20 条音频，另外后续结果展示以该集合为例

 - 准备模型（如果使用默认模型，可以跳过此步骤）
  ```bash
@ -124,7 +124,7 @@ ffce340b3790  minio/minio:RELEASE.2020-12-03T00-03-10Z  "/usr/bin/docker-ent…"
 - 脚本测试（推荐）

    ```bash
-    python ./src/test_main.py
+    python ./src/test_audio_search.py
    ```
    注：内部将依次下载数据，加载 paddlespeech 模型，提取 embedding，存储建库，检索，删库

--- a/demos/audio_searching/src/audio_search.py
+++ b/demos/audio_searching/src/audio_search.py
@ -40,7 +40,6 @@ app.add_middleware(
    allow_methods=["*"],
    allow_headers=["*"])

-MODEL = None
 MILVUS_CLI = MilvusHelper()
 MYSQL_CLI = MySQLHelper()

--- a/demos/audio_searching/src/encode.py
+++ b/demos/audio_searching/src/encode.py
@ -12,8 +12,8 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import numpy as np
-
 from logs import LOGGER
+
 from paddlespeech.cli import VectorExecutor

 vector_executor = VectorExecutor()
--- a/demos/audio_searching/src/mysql_helpers.py
+++ b/demos/audio_searching/src/mysql_helpers.py
@ -13,6 +13,7 @@
 # limitations under the License.
 import sys

+import numpy
 import pymysql
 from config import MYSQL_DB
 from config import MYSQL_HOST
@ -69,7 +70,7 @@ class MySQLHelper():
            sys.exit(1)

    def load_data_to_mysql(self, table_name, data):
-        # Batch insert (Milvus_ids, img_path) to mysql
+        # Batch insert (Milvus_ids, audio_path) to mysql
        self.test_connection()
        sql = "insert into " + table_name + " (milvus_id,audio_path) values (%s,%s);"
        try:
@ -82,7 +83,7 @@ class MySQLHelper():
            sys.exit(1)

    def search_by_milvus_ids(self, ids, table_name):
-        # Get the img_path according to the milvus ids
+        # Get the audio_path according to the milvus ids
        self.test_connection()
        str_ids = str(ids).replace('[', '').replace(']', '')
        sql = "select audio_path from " + table_name + " where milvus_id in (" + str_ids + ") order by field (milvus_id," + str_ids + ");"
@ -120,14 +121,83 @@ class MySQLHelper():
            sys.exit(1)

    def count_table(self, table_name):
-        # Get the number of mysql table
+        # Get the number of spk in mysql table
        self.test_connection()
-        sql = "select count(milvus_id) from " + table_name + ";"
+        sql = "select count(spk_id) from " + table_name + ";"
        try:
            self.cursor.execute(sql)
            results = self.cursor.fetchall()
-            LOGGER.debug(f"MYSQL count table:{table_name}")
+            LOGGER.debug(f"MYSQL count table:{results[0][0]}")
            return results[0][0]
        except Exception as e:
            LOGGER.error(f"MYSQL ERROR: {e} with sql: {sql}")
            sys.exit(1)
+
+    def create_mysql_table_vpr(self, table_name):
+        # Create mysql table if not exists
+        self.test_connection()
+        sql = "create table if not exists " + table_name + "(spk_id TEXT, audio_path TEXT, embedding TEXT);"
+        try:
+            self.cursor.execute(sql)
+            LOGGER.debug(f"MYSQL create table: {table_name} with sql: {sql}")
+        except Exception as e:
+            LOGGER.error(f"MYSQL ERROR: {e} with sql: {sql}")
+            sys.exit(1)
+
+    def load_data_to_mysql_vpr(self, table_name, data):
+        # Insert (spk, audio, embedding) to mysql
+        self.test_connection()
+        sql = "insert into " + table_name + " (spk_id,audio_path,embedding) values (%s,%s,%s);"
+        try:
+            self.cursor.execute(sql, data)
+            LOGGER.debug(
+                f"MYSQL loads data to table: {table_name} successfully")
+        except Exception as e:
+            LOGGER.error(f"MYSQL ERROR: {e} with sql: {sql}")
+            sys.exit(1)
+
+    def list_vpr(self, table_name):
+        # Get all records in mysql
+        self.test_connection()
+        sql = "select * from " + table_name + " ;"
+        try:
+            self.cursor.execute(sql)
+            results = self.cursor.fetchall()
+            self.conn.commit()
+            spk_ids = [res[0] for res in results]
+            audio_paths = [res[1] for res in results]
+            embeddings = [
+                numpy.array(
+                    str(res[2]).replace('[', '').replace(']', '').split(","))
+                for res in results
+            ]
+            return spk_ids, audio_paths, embeddings
+        except Exception as e:
+            LOGGER.error(f"MYSQL ERROR: {e} with sql: {sql}")
+            sys.exit(1)
+
+    def search_audio_vpr(self, table_name, spk_id):
+        # Get the audio_path according to the spk_id
+        self.test_connection()
+        sql = "select audio_path from " + table_name + " where spk_id='" + spk_id + "' ;"
+        try:
+            self.cursor.execute(sql)
+            results = self.cursor.fetchall()
+            LOGGER.debug(
+                f"MYSQL search by spk id {spk_id} to get audio {results[0][0]}.")
+            return results[0][0]
+        except Exception as e:
+            LOGGER.error(f"MYSQL ERROR: {e} with sql: {sql}")
+            sys.exit(1)
+
+    def delete_data_vpr(self, table_name, spk_id):
+        # Delete a record by spk_id in mysql table
+        self.test_connection()
+        sql = "delete from " + table_name + " where spk_id='" + spk_id + "';"
+        try:
+            self.cursor.execute(sql)
+            LOGGER.debug(
+                f"MYSQL delete a record {spk_id} in table {table_name}")
+        except Exception as e:
+            LOGGER.error(f"MYSQL ERROR: {e} with sql: {sql}")
+            sys.exit(1)
--- a/demos/audio_searching/src/operations/count.py
+++ b/demos/audio_searching/src/operations/count.py
@ -31,3 +31,45 @@ def do_count(table_name, milvus_cli):
    except Exception as e:
        LOGGER.error(f"Error attempting to count table {e}")
        sys.exit(1)
+
+
+def do_count_vpr(table_name, mysql_cli):
+    """
+    Returns the total number of spk in the system
+    """
+    if not table_name:
+        table_name = DEFAULT_TABLE
+    try:
+        num = mysql_cli.count_table(table_name)
+        return num
+    except Exception as e:
+        LOGGER.error(f"Error attempting to count table {e}")
+        sys.exit(1)
+
+
+def do_list(table_name, mysql_cli):
+    """
+    Returns the total records of vpr in the system
+    """
+    if not table_name:
+        table_name = DEFAULT_TABLE
+    try:
+        spk_ids, audio_paths, _ = mysql_cli.list_vpr(table_name)
+        return spk_ids, audio_paths
+    except Exception as e:
+        LOGGER.error(f"Error attempting to count table {e}")
+        sys.exit(1)
+
+
+def do_get(table_name, spk_id, mysql_cli):
+    """
+    Returns the audio path by spk_id in the system
+    """
+    if not table_name:
+        table_name = DEFAULT_TABLE
+    try:
+        audio_apth = mysql_cli.search_audio_vpr(table_name, spk_id)
+        return audio_apth
+    except Exception as e:
+        LOGGER.error(f"Error attempting to count table {e}")
+        sys.exit(1)
--- a/demos/audio_searching/src/operations/drop.py
+++ b/demos/audio_searching/src/operations/drop.py
@ -32,3 +32,31 @@ def do_drop(table_name, milvus_cli, mysql_cli):
    except Exception as e:
        LOGGER.error(f"Error attempting to drop table: {e}")
        sys.exit(1)
+
+
+def do_drop_vpr(table_name, mysql_cli):
+    """
+    Delete the table of MySQL
+    """
+    if not table_name:
+        table_name = DEFAULT_TABLE
+    try:
+        mysql_cli.delete_table(table_name)
+        return "OK"
+    except Exception as e:
+        LOGGER.error(f"Error attempting to drop table: {e}")
+        sys.exit(1)
+
+
+def do_delete(table_name, spk_id, mysql_cli):
+    """
+    Delete a record by spk_id in MySQL
+    """
+    if not table_name:
+        table_name = DEFAULT_TABLE
+    try:
+        mysql_cli.delete_data_vpr(table_name, spk_id)
+        return "OK"
+    except Exception as e:
+        LOGGER.error(f"Error attempting to drop table: {e}")
+        sys.exit(1)
--- a/demos/audio_searching/src/operations/load.py
+++ b/demos/audio_searching/src/operations/load.py
@ -82,3 +82,16 @@ def do_load(table_name, audio_dir, milvus_cli, mysql_cli):
    mysql_cli.create_mysql_table(table_name)
    mysql_cli.load_data_to_mysql(table_name, format_data(ids, names))
    return len(ids)
+
+
+def do_enroll(table_name, spk_id, audio_path, mysql_cli):
+    """
+    Import spk_id,audio_path,embedding to Mysql
+    """
+    if not table_name:
+        table_name = DEFAULT_TABLE
+    embedding = get_audio_embedding(audio_path)
+    mysql_cli.create_mysql_table_vpr(table_name)
+    data = (spk_id, audio_path, str(embedding))
+    mysql_cli.load_data_to_mysql_vpr(table_name, data)
+    return "OK"
--- a/demos/audio_searching/src/operations/search.py
+++ b/demos/audio_searching/src/operations/search.py
@ -13,6 +13,7 @@
 # limitations under the License.
 import sys

+import numpy
 from config import DEFAULT_TABLE
 from config import TOP_K
 from encode import get_audio_embedding
@ -39,3 +40,26 @@ def do_search(host, table_name, audio_path, milvus_cli, mysql_cli):
    except Exception as e:
        LOGGER.error(f"Error with search: {e}")
        sys.exit(1)
+
+
+def do_search_vpr(host, table_name, audio_path, mysql_cli):
+    """
+    Search the uploaded audio in MySQL
+    """
+    try:
+        if not table_name:
+            table_name = DEFAULT_TABLE
+        emb = get_audio_embedding(audio_path)
+        emb = numpy.array(emb)
+        spk_ids, paths, vectors = mysql_cli.list_vpr(table_name)
+        scores = [numpy.dot(emb, x.astype(numpy.float64)) for x in vectors]
+        spk_ids = [str(x) for x in spk_ids]
+        paths = [str(x) for x in paths]
+        for i in range(len(paths)):
+            tmp = "http://" + str(host) + "/data?audio_path=" + str(paths[i])
+            paths[i] = tmp
+            scores[i] = scores[i] * 100
+        return spk_ids, paths, scores
+    except Exception as e:
+        LOGGER.error(f"Error with search: {e}")
+        sys.exit(1)
--- a/demos/audio_searching/src/test_audio_search.py
+++ b/demos/audio_searching/src/test_audio_search.py
@ -11,8 +11,8 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
+from audio_search import app
 from fastapi.testclient import TestClient
-from main import app

 from utils.utility import download
 from utils.utility import unpack
@ -22,7 +22,7 @@ client = TestClient(app)

 def download_audio_data():
    """
-    download audio data
+    Download audio data
    """
    url = "https://paddlespeech.bj.bcebos.com/vector/audio/example_audio.tar.gz"
    md5sum = "52ac69316c1aa1fdef84da7dd2c67b39"
@ -64,7 +64,7 @@ def test_count():
    """
    Returns the total number of vectors in the system
    """
-    response = client.get("audio/count")
+    response = client.get("/audio/count")
    assert response.status_code == 200
    assert response.json() == 20

--- a/demos/audio_searching/src/test_vpr_search.py
+++ b/demos/audio_searching/src/test_vpr_search.py
@ -0,0 +1,115 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from fastapi.testclient import TestClient
+from vpr_search import app
+
+from utils.utility import download
+from utils.utility import unpack
+
+client = TestClient(app)
+
+
+def download_audio_data():
+    """
+    Download audio data
+    """
+    url = "https://paddlespeech.bj.bcebos.com/vector/audio/example_audio.tar.gz"
+    md5sum = "52ac69316c1aa1fdef84da7dd2c67b39"
+    target_dir = "./"
+    filepath = download(url, md5sum, target_dir)
+    unpack(filepath, target_dir, True)
+
+
+def test_drop():
+    """
+    Delete the table of MySQL
+    """
+    response = client.post("/vpr/drop")
+    assert response.status_code == 200
+
+
+def test_enroll_local(spk: str, audio: str):
+    """
+    Enroll the audio to MySQL
+    """
+    response = client.post("/vpr/enroll/local?spk_id=" + spk +
+                           "&audio_path=.%2Fexample_audio%2F" + audio + ".wav")
+    assert response.status_code == 200
+    assert response.json() == {
+        'status': True,
+        'msg': "Successfully enroll data!"
+    }
+
+
+def test_search_local():
+    """
+    Search the spk in MySQL by audio
+    """
+    response = client.post(
+        "/vpr/recog/local?audio_path=.%2Fexample_audio%2Ftest.wav")
+    assert response.status_code == 200
+
+
+def test_list():
+    """
+    Get all records in MySQL
+    """
+    response = client.get("/vpr/list")
+    assert response.status_code == 200
+
+
+def test_data(spk: str):
+    """
+    Get the audio file by spk_id in MySQL
+    """
+    response = client.get("/vpr/data?spk_id=" + spk)
+    assert response.status_code == 200
+
+
+def test_del(spk: str):
+    """
+    Delete the record in MySQL by spk_id
+    """
+    response = client.post("/vpr/del?spk_id=" + spk)
+    assert response.status_code == 200
+
+
+def test_count():
+    """
+    Get the number of spk in MySQL
+    """
+    response = client.get("/vpr/count")
+    assert response.status_code == 200
+
+
+if __name__ == "__main__":
+    download_audio_data()
+
+    test_enroll_local("spk1", "arms_strikes")
+    test_enroll_local("spk2", "sword_wielding")
+    test_enroll_local("spk3", "test")
+    test_list()
+    test_data("spk1")
+    test_count()
+    test_search_local()
+
+    test_del("spk1")
+    test_count()
+    test_search_local()
+
+    test_enroll_local("spk1", "arms_strikes")
+    test_count()
+    test_search_local()
+
+    test_drop()
--- a/demos/audio_searching/src/vpr_search.py
+++ b/demos/audio_searching/src/vpr_search.py
@ -0,0 +1,206 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+
+import uvicorn
+from config import UPLOAD_PATH
+from fastapi import FastAPI
+from fastapi import File
+from fastapi import UploadFile
+from logs import LOGGER
+from mysql_helpers import MySQLHelper
+from operations.count import do_count_vpr
+from operations.count import do_get
+from operations.count import do_list
+from operations.drop import do_delete
+from operations.drop import do_drop_vpr
+from operations.load import do_enroll
+from operations.search import do_search_vpr
+from starlette.middleware.cors import CORSMiddleware
+from starlette.requests import Request
+from starlette.responses import FileResponse
+
+app = FastAPI()
+app.add_middleware(
+    CORSMiddleware,
+    allow_origins=["*"],
+    allow_credentials=True,
+    allow_methods=["*"],
+    allow_headers=["*"])
+
+MYSQL_CLI = MySQLHelper()
+
+# Mkdir 'tmp/audio-data'
+if not os.path.exists(UPLOAD_PATH):
+    os.makedirs(UPLOAD_PATH)
+    LOGGER.info(f"Mkdir the path: {UPLOAD_PATH}")
+
+
+@app.post('/vpr/enroll')
+async def vpr_enroll(table_name: str=None,
+                     spk_id: str=None,
+                     audio: UploadFile=File(...)):
+    # Enroll the uploaded audio with spk-id into MySQL
+    try:
+        # Save the upload data to server.
+        content = await audio.read()
+        audio_path = os.path.join(UPLOAD_PATH, audio.filename)
+        with open(audio_path, "wb+") as f:
+            f.write(content)
+        do_enroll(table_name, spk_id, audio_path, MYSQL_CLI)
+        LOGGER.info(f"Successfully enrolled {spk_id} online!")
+        return {'status': True, 'msg': "Successfully enroll data!"}
+    except Exception as e:
+        LOGGER.error(e)
+        return {'status': False, 'msg': e}, 400
+
+
+@app.post('/vpr/enroll/local')
+async def vpr_enroll_local(table_name: str=None,
+                           spk_id: str=None,
+                           audio_path: str=None):
+    # Enroll the local audio with spk-id into MySQL
+    try:
+        do_enroll(table_name, spk_id, audio_path, MYSQL_CLI)
+        LOGGER.info(f"Successfully enrolled {spk_id} locally!")
+        return {'status': True, 'msg': "Successfully enroll data!"}
+    except Exception as e:
+        LOGGER.error(e)
+        return {'status': False, 'msg': e}, 400
+
+
+@app.post('/vpr/recog')
+async def vpr_recog(request: Request,
+                    table_name: str=None,
+                    audio: UploadFile=File(...)):
+    # Voice print recognition online
+    try:
+        # Save the upload data to server.
+        content = await audio.read()
+        query_audio_path = os.path.join(UPLOAD_PATH, audio.filename)
+        with open(query_audio_path, "wb+") as f:
+            f.write(content)
+        host = request.headers['host']
+        spk_ids, paths, scores = do_search_vpr(host, table_name,
+                                               query_audio_path, MYSQL_CLI)
+        for spk_id, path, score in zip(spk_ids, paths, scores):
+            LOGGER.info(f"spk {spk_id}, score {score}, audio path {path}, ")
+        res = dict(zip(spk_ids, zip(paths, scores)))
+        # Sort results by distance metric, closest distances first
+        res = sorted(res.items(), key=lambda item: item[1][1], reverse=True)
+        LOGGER.info("Successfully speaker recognition online!")
+        return res
+    except Exception as e:
+        LOGGER.error(e)
+        return {'status': False, 'msg': e}, 400
+
+
+@app.post('/vpr/recog/local')
+async def vpr_recog_local(request: Request,
+                          table_name: str=None,
+                          audio_path: str=None):
+    # Voice print recognition locally
+    try:
+        host = request.headers['host']
+        spk_ids, paths, scores = do_search_vpr(host, table_name, audio_path,
+                                               MYSQL_CLI)
+        for spk_id, path, score in zip(spk_ids, paths, scores):
+            LOGGER.info(f"spk {spk_id}, score {score}, audio path {path}, ")
+        res = dict(zip(spk_ids, zip(paths, scores)))
+        # Sort results by distance metric, closest distances first
+        res = sorted(res.items(), key=lambda item: item[1][1], reverse=True)
+        LOGGER.info("Successfully speaker recognition locally!")
+        return res
+    except Exception as e:
+        LOGGER.error(e)
+        return {'status': False, 'msg': e}, 400
+
+
+@app.post('/vpr/del')
+async def vpr_del(table_name: str=None, spk_id: str=None):
+    # Delete a record by spk_id in MySQL
+    try:
+        do_delete(table_name, spk_id, MYSQL_CLI)
+        LOGGER.info("Successfully delete a record by spk_id in MySQL")
+        return {'status': True, 'msg': "Successfully delete data!"}
+    except Exception as e:
+        LOGGER.error(e)
+        return {'status': False, 'msg': e}, 400
+
+
+@app.get('/vpr/list')
+async def vpr_list(table_name: str=None):
+    # Get all records in MySQL
+    try:
+        spk_ids, audio_paths = do_list(table_name, MYSQL_CLI)
+        for i in range(len(spk_ids)):
+            LOGGER.debug(f"spk {spk_ids[i]}, audio path {audio_paths[i]}")
+        LOGGER.info("Successfully list all records from mysql!")
+        return spk_ids, audio_paths
+    except Exception as e:
+        LOGGER.error(e)
+        return {'status': False, 'msg': e}, 400
+
+
+@app.get('/vpr/data')
+async def vpr_data(
+    table_name: str=None,
+    spk_id: str=None, ):
+    # Get the audio file from path by spk_id in MySQL
+    try:
+        audio_path = do_get(table_name, spk_id, MYSQL_CLI)
+        LOGGER.info(f"Successfully get audio path {audio_path}!")
+        return FileResponse(audio_path)
+    except Exception as e:
+        LOGGER.error(e)
+        return {'status': False, 'msg': e}, 400
+
+
+@app.get('/vpr/count')
+async def vpr_count(table_name: str=None):
+    # Get the total number of spk in MySQL
+    try:
+        num = do_count_vpr(table_name, MYSQL_CLI)
+        LOGGER.info("Successfully count the number of spk!")
+        return num
+    except Exception as e:
+        LOGGER.error(e)
+        return {'status': False, 'msg': e}, 400
+
+
+@app.post('/vpr/drop')
+async def drop_tables(table_name: str=None):
+    # Delete the table of MySQL
+    try:
+        do_drop_vpr(table_name, MYSQL_CLI)
+        LOGGER.info("Successfully drop tables in MySQL!")
+        return {'status': True, 'msg': "Successfully drop tables!"}
+    except Exception as e:
+        LOGGER.error(e)
+        return {'status': False, 'msg': e}, 400
+
+
+@app.get('/data')
+def audio_path(audio_path):
+    # Get the audio file from path
+    try:
+        LOGGER.info(f"Successfully get audio: {audio_path}")
+        return FileResponse(audio_path)
+    except Exception as e:
+        LOGGER.error(f"get audio error: {e}")
+        return {'status': False, 'msg': e}, 400
+
+
+if __name__ == '__main__':
+    uvicorn.run(app=app, host='0.0.0.0', port=8002)
--- a/demos/speaker_verification/README.md
+++ b/demos/speaker_verification/README.md
@ -30,6 +30,11 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
  paddlespeech vector --task spk --input vec.job

  echo -e "demo2 85236145389.wav \n demo3 85236145389.wav" | paddlespeech vector --task spk
+
+  paddlespeech vector --task score --input "./85236145389.wav ./123456789.wav"
+  
+  echo -e "demo4 85236145389.wav 85236145389.wav \n demo5 85236145389.wav 123456789.wav" > vec.job
+  paddlespeech vector --task score --input vec.job
  ```
  
  Usage:
@ -38,6 +43,7 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
  ```
  Arguments:
  - `input`(required): Audio file to recognize.
+  - `task` (required): Specify `vector` task. Default `spk`。
  - `model`: Model type of vector task. Default: `ecapatdnn_voxceleb12`.
  - `sample_rate`: Sample rate of the model. Default: `16000`.
  - `config`: Config of vector task. Use pretrained model when it is None. Default: `None`.
@ -47,45 +53,45 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
  Output:

  ```bash
-    demo [ -5.749211     9.505463    -8.200284    -5.2075014    5.3940268
-      -3.04878      1.611095    10.127234   -10.534177   -15.821609
-      1.2032688   -0.35080156   1.2629458  -12.643498    -2.5758228
-    -11.343508     2.3385992   -8.719341    14.213509    15.404744
-      -0.39327756   6.338786     2.688887     8.7104025   17.469526
-      -8.77959      7.0576906    4.648855    -1.3089896  -23.294737
-      8.013747    13.891729    -9.926753     5.655307    -5.9422326
-    -22.842539     0.6293588  -18.46266    -10.811862     9.8192625
-      3.0070958    3.8072643   -2.3861165    3.0821571  -14.739942
-      1.7594414   -0.6485091    4.485623     2.0207152    7.264915
-      -6.40137     23.63524      2.9711294  -22.708025     9.93719
-      20.354511   -10.324688    -0.700492    -8.783211    -5.27593
-      15.999649     3.3004563   12.747926    15.429879     4.7849145
-      5.6699696   -2.3826702   10.605882     3.9112158    3.1500628
-      15.859915    -2.1832209  -23.908653    -6.4799504   -4.5365124
-      -9.224193    14.568347   -10.568833     4.982321    -4.342062
-      0.0914714   12.645902    -5.74285     -3.2141201   -2.7173362
-      -6.680575     0.4757669   -5.035051    -6.7964664   16.865469
-    -11.54324      7.681869     0.44475392   9.708182    -8.932846
-      0.4123232   -4.361452     1.3948607    9.511665     0.11667654
-      2.9079323    6.049952     9.275183   -18.078873     6.2983274
-      -0.7500531   -2.725033    -7.6027865    3.3404543    2.990815
-      4.010979    11.000591    -2.8873312    7.1352735  -16.79663
-      18.495346   -14.293832     7.89578      2.2714825   22.976387
-      -4.875734    -3.0836344   -2.9999814   13.751918     6.448228
-    -11.924197     2.171869     2.0423572   -6.173772    10.778437
-      25.77281     -4.9495463   14.57806      0.3044315    2.6132357
-      -7.591999    -2.076944     9.025118     1.7834753   -3.1799617
-      -4.9401326   23.465864     5.1685796   -9.018578     9.037825
-      -4.4150195    6.859591   -12.274467    -0.88911164   5.186309
-      -3.9988663  -13.638606    -9.925445    -0.06329413  -3.6709652
-    -12.397416   -12.719869    -1.395601     2.1150916    5.7381287
-      -4.4691963   -3.82819     -0.84233856  -1.1604277  -13.490127
-      8.731719   -20.778936   -11.495662     5.8033476   -4.752041
-      10.833007    -6.717991     4.504732    13.4244375    1.1306485
-      7.3435574    1.400918    14.704036    -9.501399     7.2315617
-      -6.417456     1.3333273   11.872697    -0.30664724   8.8845
-      6.5569253    4.7948146    0.03662816  -8.704245     6.224871
-      -3.2701402  -11.508579  ]
+    demo [  1.4217498    5.626253    -5.342073     1.1773866    3.308055
+    1.756596     5.167894    10.80636     -3.8226728   -5.6141334
+    2.623845    -0.8072968    1.9635103   -7.3128724    0.01103897
+    -9.723131     0.6619743   -6.976803    10.213478     7.494748
+    2.9105635    3.8949256    3.7999806    7.1061673   16.905321
+    -7.1493764    8.733103     3.4230042   -4.831653   -11.403367
+    11.232214     7.1274667   -4.2828417    2.452362    -5.130748
+    -18.177666    -2.6116815  -11.000337    -6.7314315    1.6564683
+    0.7618269    1.1253023   -2.083836     4.725744    -8.782597
+    -3.539873     3.814236     5.1420674    2.162061     4.096431
+    -6.4162116   12.747448     1.9429878  -15.152943     6.417416
+    16.097002    -9.716668    -1.9920526   -3.3649497   -1.871939
+    11.567354     3.69788     11.258265     7.442363     9.183411
+    4.5281515   -1.2417862    4.3959084    6.6727695    5.8898783
+    7.627124    -0.66919386 -11.889693    -9.208865    -7.4274073
+    -3.7776625    6.917234    -9.848748    -2.0944717   -5.135116
+    0.49563864   9.317534    -5.9141874   -1.8098574   -0.11738578
+    -7.169265    -1.0578263   -5.7216787   -5.1173844   16.137651
+    -4.473626     7.6624317   -0.55381083   9.631587    -6.4704556
+    -8.548508     4.3716145   -0.79702514   4.478997    -2.9758704
+    3.272176     2.8382776    5.134597    -9.190781    -0.5657382
+    -4.8745747    2.3165567   -5.984303    -2.1798875    0.35541576
+    -0.31784213   9.493548     2.1144536    4.358092   -12.089823
+    8.451689    -7.925461     4.6242585    4.4289427   18.692003
+    -2.6204622   -5.149185    -0.35821092   8.488551     4.981496
+    -9.32683     -2.2544234    6.6417594    1.2119585   10.977129
+    16.555033     3.3238444    9.551863    -1.6676947   -0.79539716
+    -8.605674    -0.47356385   2.6741948   -5.359179    -2.6673796
+    0.66607     15.443222     4.740594    -3.4725387   11.592567
+    -2.054497     1.7361217   -8.265324    -9.30447      5.4068313
+    -1.5180256   -7.746615    -6.089606     0.07112726  -0.34904733
+    -8.649895    -9.998958    -2.564841    -0.53999114   2.601808
+    -0.31927416  -1.8815292   -2.07215     -3.4105783   -8.2998085
+    1.483641   -15.365992    -8.288208     3.8847756   -3.4876456
+    7.3629923    0.4657332    3.132599    12.438889    -1.8337058
+    4.532936     2.7264361   10.145339    -6.521951     2.897153
+    -3.3925855    5.079156     7.759716     4.677565     5.8457737
+    2.402413     7.7071047    3.9711342   -6.390043     6.1268735
+    -3.7760346  -11.118123  ]
  ```

 - Python API
@ -97,56 +103,113 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
  audio_emb = vector_executor(
      model='ecapatdnn_voxceleb12',
      sample_rate=16000,
-      config=None, 
+      config=None,  # Set `config` and `ckpt_path` to None to use pretrained model.
      ckpt_path=None,
      audio_file='./85236145389.wav',
-      force_yes=False,
      device=paddle.get_device())
  print('Audio embedding Result: \n{}'.format(audio_emb))
+
+  test_emb = vector_executor(
+      model='ecapatdnn_voxceleb12',
+      sample_rate=16000,
+      config=None,  # Set `config` and `ckpt_path` to None to use pretrained model.
+      ckpt_path=None,
+      audio_file='./123456789.wav',
+      device=paddle.get_device())
+  print('Test embedding Result: \n{}'.format(test_emb))
+
+  # score range [0, 1]
+  score = vector_executor.get_embeddings_score(audio_emb, test_emb)
+  print(f"Eembeddings Score: {score}")
  ```

-  Output:
+  Output：
+
  ```bash
  # Vector Result:
-   [ -5.749211     9.505463    -8.200284    -5.2075014    5.3940268
-      -3.04878      1.611095    10.127234   -10.534177   -15.821609
-      1.2032688   -0.35080156   1.2629458  -12.643498    -2.5758228
-    -11.343508     2.3385992   -8.719341    14.213509    15.404744
-      -0.39327756   6.338786     2.688887     8.7104025   17.469526
-      -8.77959      7.0576906    4.648855    -1.3089896  -23.294737
-      8.013747    13.891729    -9.926753     5.655307    -5.9422326
-    -22.842539     0.6293588  -18.46266    -10.811862     9.8192625
-      3.0070958    3.8072643   -2.3861165    3.0821571  -14.739942
-      1.7594414   -0.6485091    4.485623     2.0207152    7.264915
-      -6.40137     23.63524      2.9711294  -22.708025     9.93719
-      20.354511   -10.324688    -0.700492    -8.783211    -5.27593
-      15.999649     3.3004563   12.747926    15.429879     4.7849145
-      5.6699696   -2.3826702   10.605882     3.9112158    3.1500628
-      15.859915    -2.1832209  -23.908653    -6.4799504   -4.5365124
-      -9.224193    14.568347   -10.568833     4.982321    -4.342062
-      0.0914714   12.645902    -5.74285     -3.2141201   -2.7173362
-      -6.680575     0.4757669   -5.035051    -6.7964664   16.865469
-    -11.54324      7.681869     0.44475392   9.708182    -8.932846
-      0.4123232   -4.361452     1.3948607    9.511665     0.11667654
-      2.9079323    6.049952     9.275183   -18.078873     6.2983274
-      -0.7500531   -2.725033    -7.6027865    3.3404543    2.990815
-      4.010979    11.000591    -2.8873312    7.1352735  -16.79663
-      18.495346   -14.293832     7.89578      2.2714825   22.976387
-      -4.875734    -3.0836344   -2.9999814   13.751918     6.448228
-    -11.924197     2.171869     2.0423572   -6.173772    10.778437
-      25.77281     -4.9495463   14.57806      0.3044315    2.6132357
-      -7.591999    -2.076944     9.025118     1.7834753   -3.1799617
-      -4.9401326   23.465864     5.1685796   -9.018578     9.037825
-      -4.4150195    6.859591   -12.274467    -0.88911164   5.186309
-      -3.9988663  -13.638606    -9.925445    -0.06329413  -3.6709652
-    -12.397416   -12.719869    -1.395601     2.1150916    5.7381287
-      -4.4691963   -3.82819     -0.84233856  -1.1604277  -13.490127
-      8.731719   -20.778936   -11.495662     5.8033476   -4.752041
-      10.833007    -6.717991     4.504732    13.4244375    1.1306485
-      7.3435574    1.400918    14.704036    -9.501399     7.2315617
-      -6.417456     1.3333273   11.872697    -0.30664724   8.8845
-      6.5569253    4.7948146    0.03662816  -8.704245     6.224871
-      -3.2701402  -11.508579  ]
+   Audio embedding Result:
+    [  1.4217498    5.626253    -5.342073     1.1773866    3.308055
+    1.756596     5.167894    10.80636     -3.8226728   -5.6141334
+    2.623845    -0.8072968    1.9635103   -7.3128724    0.01103897
+    -9.723131     0.6619743   -6.976803    10.213478     7.494748
+    2.9105635    3.8949256    3.7999806    7.1061673   16.905321
+    -7.1493764    8.733103     3.4230042   -4.831653   -11.403367
+    11.232214     7.1274667   -4.2828417    2.452362    -5.130748
+    -18.177666    -2.6116815  -11.000337    -6.7314315    1.6564683
+    0.7618269    1.1253023   -2.083836     4.725744    -8.782597
+    -3.539873     3.814236     5.1420674    2.162061     4.096431
+    -6.4162116   12.747448     1.9429878  -15.152943     6.417416
+    16.097002    -9.716668    -1.9920526   -3.3649497   -1.871939
+    11.567354     3.69788     11.258265     7.442363     9.183411
+    4.5281515   -1.2417862    4.3959084    6.6727695    5.8898783
+    7.627124    -0.66919386 -11.889693    -9.208865    -7.4274073
+    -3.7776625    6.917234    -9.848748    -2.0944717   -5.135116
+    0.49563864   9.317534    -5.9141874   -1.8098574   -0.11738578
+    -7.169265    -1.0578263   -5.7216787   -5.1173844   16.137651
+    -4.473626     7.6624317   -0.55381083   9.631587    -6.4704556
+    -8.548508     4.3716145   -0.79702514   4.478997    -2.9758704
+    3.272176     2.8382776    5.134597    -9.190781    -0.5657382
+    -4.8745747    2.3165567   -5.984303    -2.1798875    0.35541576
+    -0.31784213   9.493548     2.1144536    4.358092   -12.089823
+    8.451689    -7.925461     4.6242585    4.4289427   18.692003
+    -2.6204622   -5.149185    -0.35821092   8.488551     4.981496
+    -9.32683     -2.2544234    6.6417594    1.2119585   10.977129
+    16.555033     3.3238444    9.551863    -1.6676947   -0.79539716
+    -8.605674    -0.47356385   2.6741948   -5.359179    -2.6673796
+    0.66607     15.443222     4.740594    -3.4725387   11.592567
+    -2.054497     1.7361217   -8.265324    -9.30447      5.4068313
+    -1.5180256   -7.746615    -6.089606     0.07112726  -0.34904733
+    -8.649895    -9.998958    -2.564841    -0.53999114   2.601808
+    -0.31927416  -1.8815292   -2.07215     -3.4105783   -8.2998085
+    1.483641   -15.365992    -8.288208     3.8847756   -3.4876456
+    7.3629923    0.4657332    3.132599    12.438889    -1.8337058
+    4.532936     2.7264361   10.145339    -6.521951     2.897153
+    -3.3925855    5.079156     7.759716     4.677565     5.8457737
+    2.402413     7.7071047    3.9711342   -6.390043     6.1268735
+    -3.7760346  -11.118123  ]
+    # get the test embedding
+    Test embedding Result:
+    [ -1.902964     2.0690894   -8.034194     3.5472693    0.18089125
+      6.9085927    1.4097427   -1.9487704  -10.021278    -0.20755845
+      -8.04332      4.344489     2.3200977  -14.306299     5.184692
+    -11.55602     -3.8497238    0.6444722    1.2833948    2.6766639
+      0.5878921    0.7946299    1.7207596    2.5791872   14.998469
+      -1.3385371   15.031221    -0.8006958    1.99287     -9.52007
+      2.435466     4.003221    -4.33817     -4.898601    -5.304714
+    -18.033886    10.790787   -12.784645    -5.641755     2.9761686
+    -10.566622     1.4839455    6.152458    -5.7195854    2.8603241
+      6.112133     8.489869     5.5958056    1.2836679   -1.2293907
+      0.89927405   7.0288725   -2.854029    -0.9782962    5.8255906
+      14.905906    -5.025907     0.7866458   -4.2444224  -16.354029
+      10.521315     0.9604709   -3.3257897    7.144871   -13.592733
+      -8.568869    -1.7953678    0.26313916  10.916714    -6.9374123
+      1.857403    -6.2746415    2.8154466   -7.2338667   -2.293357
+      -0.05452765   5.4287076    5.0849075   -6.690375    -1.6183422
+      3.654291     0.94352573  -9.200294    -5.4749465   -3.5235846
+      1.3420814    4.240421    -2.772944    -2.8451524   16.311104
+      4.2969875   -1.762936   -12.5758915    8.595198    -0.8835239
+      -1.5708797    1.568961     1.1413603    3.5032008   -0.45251232
+      -6.786333    16.89443      5.3366146   -8.789056     0.6355629
+      3.2579517   -3.328322     7.5969577    0.66025066  -6.550468
+      -9.148656     2.020372    -0.4615173    1.1965656   -3.8764873
+      11.6562195   -6.0750933   12.182899     3.2218833    0.81969476
+      5.570001    -3.8459578   -7.205299     7.9262037   -7.6611166
+      -5.249467    -2.2671914    7.2658715  -13.298164     4.821147
+      -2.7263982   11.691089    -3.8918593   -2.838112    -1.0336838
+      -3.8034165    2.8536487   -5.60398     -1.1972581    1.3455094
+      -3.4903061    2.2408795    5.5010734   -3.970756    11.99696
+      -7.8858757    0.43160373  -5.5059714    4.3426995   16.322706
+      11.635366     0.72157705  -9.245714    -3.91465     -4.449838
+      -1.5716927    7.713747    -2.2430465   -6.198303   -13.481864
+      2.8156567   -5.7812386    5.1456156    2.7289324  -14.505571
+      13.270688     3.448231    -7.0659585    4.5886116   -4.466099
+      -0.296428   -11.463529    -2.6076477   14.110243    -6.9725137
+      -1.9962958    2.7119343   19.391657     0.01961198  14.607133
+      -1.6695905   -4.391516     1.3131028   -6.670972    -5.888604
+      12.0612335    5.9285784    3.3715196    1.492534    10.723728
+      -0.95514804 -12.085431  ]
+    # get the score between enroll and test
+    Eembeddings Score: 0.4292638301849365
  ```

 ### 4.Pretrained Models
--- a/demos/speaker_verification/README_cn.md
+++ b/demos/speaker_verification/README_cn.md
@ -29,6 +29,11 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
  paddlespeech vector --task spk --input vec.job

  echo -e "demo2 85236145389.wav \n demo3 85236145389.wav" | paddlespeech vector --task spk
+  
+  paddlespeech vector --task score --input "./85236145389.wav ./123456789.wav"
+  
+  echo -e "demo4 85236145389.wav 85236145389.wav \n demo5 85236145389.wav 123456789.wav" > vec.job
+  paddlespeech vector --task score --input vec.job
  ```
  
  使用方法：
@ -37,6 +42,7 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
  ```
  参数：
  - `input`(必须输入)：用于识别的音频文件。
+  - `task` (必须输入): 用于指定 `vector` 处理的具体任务，默认是 `spk`。
  - `model`：声纹任务的模型，默认值：`ecapatdnn_voxceleb12`。
  - `sample_rate`：音频采样率，默认值：`16000`。
  - `config`：声纹任务的参数文件，若不设置则使用预训练模型中的默认配置，默认值：`None`。
@ -45,45 +51,45 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav

  输出：
  ```bash
-  demo  [ -5.749211     9.505463    -8.200284    -5.2075014    5.3940268
-      -3.04878      1.611095    10.127234   -10.534177   -15.821609
-      1.2032688   -0.35080156   1.2629458  -12.643498    -2.5758228
-    -11.343508     2.3385992   -8.719341    14.213509    15.404744
-      -0.39327756   6.338786     2.688887     8.7104025   17.469526
-      -8.77959      7.0576906    4.648855    -1.3089896  -23.294737
-      8.013747    13.891729    -9.926753     5.655307    -5.9422326
-    -22.842539     0.6293588  -18.46266    -10.811862     9.8192625
-      3.0070958    3.8072643   -2.3861165    3.0821571  -14.739942
-      1.7594414   -0.6485091    4.485623     2.0207152    7.264915
-      -6.40137     23.63524      2.9711294  -22.708025     9.93719
-      20.354511   -10.324688    -0.700492    -8.783211    -5.27593
-      15.999649     3.3004563   12.747926    15.429879     4.7849145
-      5.6699696   -2.3826702   10.605882     3.9112158    3.1500628
-      15.859915    -2.1832209  -23.908653    -6.4799504   -4.5365124
-      -9.224193    14.568347   -10.568833     4.982321    -4.342062
-      0.0914714   12.645902    -5.74285     -3.2141201   -2.7173362
-      -6.680575     0.4757669   -5.035051    -6.7964664   16.865469
-    -11.54324      7.681869     0.44475392   9.708182    -8.932846
-      0.4123232   -4.361452     1.3948607    9.511665     0.11667654
-      2.9079323    6.049952     9.275183   -18.078873     6.2983274
-      -0.7500531   -2.725033    -7.6027865    3.3404543    2.990815
-      4.010979    11.000591    -2.8873312    7.1352735  -16.79663
-      18.495346   -14.293832     7.89578      2.2714825   22.976387
-      -4.875734    -3.0836344   -2.9999814   13.751918     6.448228
-    -11.924197     2.171869     2.0423572   -6.173772    10.778437
-      25.77281     -4.9495463   14.57806      0.3044315    2.6132357
-      -7.591999    -2.076944     9.025118     1.7834753   -3.1799617
-      -4.9401326   23.465864     5.1685796   -9.018578     9.037825
-      -4.4150195    6.859591   -12.274467    -0.88911164   5.186309
-      -3.9988663  -13.638606    -9.925445    -0.06329413  -3.6709652
-    -12.397416   -12.719869    -1.395601     2.1150916    5.7381287
-      -4.4691963   -3.82819     -0.84233856  -1.1604277  -13.490127
-      8.731719   -20.778936   -11.495662     5.8033476   -4.752041
-      10.833007    -6.717991     4.504732    13.4244375    1.1306485
-      7.3435574    1.400918    14.704036    -9.501399     7.2315617
-      -6.417456     1.3333273   11.872697    -0.30664724   8.8845
-      6.5569253    4.7948146    0.03662816  -8.704245     6.224871
-      -3.2701402  -11.508579  ]
+  demo  [  1.4217498    5.626253    -5.342073     1.1773866    3.308055
+    1.756596     5.167894    10.80636     -3.8226728   -5.6141334
+    2.623845    -0.8072968    1.9635103   -7.3128724    0.01103897
+    -9.723131     0.6619743   -6.976803    10.213478     7.494748
+    2.9105635    3.8949256    3.7999806    7.1061673   16.905321
+    -7.1493764    8.733103     3.4230042   -4.831653   -11.403367
+    11.232214     7.1274667   -4.2828417    2.452362    -5.130748
+    -18.177666    -2.6116815  -11.000337    -6.7314315    1.6564683
+    0.7618269    1.1253023   -2.083836     4.725744    -8.782597
+    -3.539873     3.814236     5.1420674    2.162061     4.096431
+    -6.4162116   12.747448     1.9429878  -15.152943     6.417416
+    16.097002    -9.716668    -1.9920526   -3.3649497   -1.871939
+    11.567354     3.69788     11.258265     7.442363     9.183411
+    4.5281515   -1.2417862    4.3959084    6.6727695    5.8898783
+    7.627124    -0.66919386 -11.889693    -9.208865    -7.4274073
+    -3.7776625    6.917234    -9.848748    -2.0944717   -5.135116
+    0.49563864   9.317534    -5.9141874   -1.8098574   -0.11738578
+    -7.169265    -1.0578263   -5.7216787   -5.1173844   16.137651
+    -4.473626     7.6624317   -0.55381083   9.631587    -6.4704556
+    -8.548508     4.3716145   -0.79702514   4.478997    -2.9758704
+    3.272176     2.8382776    5.134597    -9.190781    -0.5657382
+    -4.8745747    2.3165567   -5.984303    -2.1798875    0.35541576
+    -0.31784213   9.493548     2.1144536    4.358092   -12.089823
+    8.451689    -7.925461     4.6242585    4.4289427   18.692003
+    -2.6204622   -5.149185    -0.35821092   8.488551     4.981496
+    -9.32683     -2.2544234    6.6417594    1.2119585   10.977129
+    16.555033     3.3238444    9.551863    -1.6676947   -0.79539716
+    -8.605674    -0.47356385   2.6741948   -5.359179    -2.6673796
+    0.66607     15.443222     4.740594    -3.4725387   11.592567
+    -2.054497     1.7361217   -8.265324    -9.30447      5.4068313
+    -1.5180256   -7.746615    -6.089606     0.07112726  -0.34904733
+    -8.649895    -9.998958    -2.564841    -0.53999114   2.601808
+    -0.31927416  -1.8815292   -2.07215     -3.4105783   -8.2998085
+    1.483641   -15.365992    -8.288208     3.8847756   -3.4876456
+    7.3629923    0.4657332    3.132599    12.438889    -1.8337058
+    4.532936     2.7264361   10.145339    -6.521951     2.897153
+    -3.3925855    5.079156     7.759716     4.677565     5.8457737
+    2.402413     7.7071047    3.9711342   -6.390043     6.1268735
+    -3.7760346  -11.118123  ]
  ```

 - Python API
@ -98,53 +104,109 @@ wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
      config=None,  # Set `config` and `ckpt_path` to None to use pretrained model.
      ckpt_path=None,
      audio_file='./85236145389.wav',
-      force_yes=False,
      device=paddle.get_device())
  print('Audio embedding Result: \n{}'.format(audio_emb))
+
+  test_emb = vector_executor(
+      model='ecapatdnn_voxceleb12',
+      sample_rate=16000,
+      config=None,  # Set `config` and `ckpt_path` to None to use pretrained model.
+      ckpt_path=None,
+      audio_file='./123456789.wav',
+      device=paddle.get_device())
+  print('Test embedding Result: \n{}'.format(test_emb))
+
+  # score range [0, 1]
+  score = vector_executor.get_embeddings_score(audio_emb, test_emb)
+  print(f"Eembeddings Score: {score}")
  ```

  输出：
  ```bash
  # Vector Result:
-   [ -5.749211     9.505463    -8.200284    -5.2075014    5.3940268
-      -3.04878      1.611095    10.127234   -10.534177   -15.821609
-      1.2032688   -0.35080156   1.2629458  -12.643498    -2.5758228
-    -11.343508     2.3385992   -8.719341    14.213509    15.404744
-      -0.39327756   6.338786     2.688887     8.7104025   17.469526
-      -8.77959      7.0576906    4.648855    -1.3089896  -23.294737
-      8.013747    13.891729    -9.926753     5.655307    -5.9422326
-    -22.842539     0.6293588  -18.46266    -10.811862     9.8192625
-      3.0070958    3.8072643   -2.3861165    3.0821571  -14.739942
-      1.7594414   -0.6485091    4.485623     2.0207152    7.264915
-      -6.40137     23.63524      2.9711294  -22.708025     9.93719
-      20.354511   -10.324688    -0.700492    -8.783211    -5.27593
-      15.999649     3.3004563   12.747926    15.429879     4.7849145
-      5.6699696   -2.3826702   10.605882     3.9112158    3.1500628
-      15.859915    -2.1832209  -23.908653    -6.4799504   -4.5365124
-      -9.224193    14.568347   -10.568833     4.982321    -4.342062
-      0.0914714   12.645902    -5.74285     -3.2141201   -2.7173362
-      -6.680575     0.4757669   -5.035051    -6.7964664   16.865469
-    -11.54324      7.681869     0.44475392   9.708182    -8.932846
-      0.4123232   -4.361452     1.3948607    9.511665     0.11667654
-      2.9079323    6.049952     9.275183   -18.078873     6.2983274
-      -0.7500531   -2.725033    -7.6027865    3.3404543    2.990815
-      4.010979    11.000591    -2.8873312    7.1352735  -16.79663
-      18.495346   -14.293832     7.89578      2.2714825   22.976387
-      -4.875734    -3.0836344   -2.9999814   13.751918     6.448228
-    -11.924197     2.171869     2.0423572   -6.173772    10.778437
-      25.77281     -4.9495463   14.57806      0.3044315    2.6132357
-      -7.591999    -2.076944     9.025118     1.7834753   -3.1799617
-      -4.9401326   23.465864     5.1685796   -9.018578     9.037825
-      -4.4150195    6.859591   -12.274467    -0.88911164   5.186309
-      -3.9988663  -13.638606    -9.925445    -0.06329413  -3.6709652
-    -12.397416   -12.719869    -1.395601     2.1150916    5.7381287
-      -4.4691963   -3.82819     -0.84233856  -1.1604277  -13.490127
-      8.731719   -20.778936   -11.495662     5.8033476   -4.752041
-      10.833007    -6.717991     4.504732    13.4244375    1.1306485
-      7.3435574    1.400918    14.704036    -9.501399     7.2315617
-      -6.417456     1.3333273   11.872697    -0.30664724   8.8845
-      6.5569253    4.7948146    0.03662816  -8.704245     6.224871
-      -3.2701402  -11.508579  ]
+   Audio embedding Result:
+    [  1.4217498    5.626253    -5.342073     1.1773866    3.308055
+    1.756596     5.167894    10.80636     -3.8226728   -5.6141334
+    2.623845    -0.8072968    1.9635103   -7.3128724    0.01103897
+    -9.723131     0.6619743   -6.976803    10.213478     7.494748
+    2.9105635    3.8949256    3.7999806    7.1061673   16.905321
+    -7.1493764    8.733103     3.4230042   -4.831653   -11.403367
+    11.232214     7.1274667   -4.2828417    2.452362    -5.130748
+    -18.177666    -2.6116815  -11.000337    -6.7314315    1.6564683
+    0.7618269    1.1253023   -2.083836     4.725744    -8.782597
+    -3.539873     3.814236     5.1420674    2.162061     4.096431
+    -6.4162116   12.747448     1.9429878  -15.152943     6.417416
+    16.097002    -9.716668    -1.9920526   -3.3649497   -1.871939
+    11.567354     3.69788     11.258265     7.442363     9.183411
+    4.5281515   -1.2417862    4.3959084    6.6727695    5.8898783
+    7.627124    -0.66919386 -11.889693    -9.208865    -7.4274073
+    -3.7776625    6.917234    -9.848748    -2.0944717   -5.135116
+    0.49563864   9.317534    -5.9141874   -1.8098574   -0.11738578
+    -7.169265    -1.0578263   -5.7216787   -5.1173844   16.137651
+    -4.473626     7.6624317   -0.55381083   9.631587    -6.4704556
+    -8.548508     4.3716145   -0.79702514   4.478997    -2.9758704
+    3.272176     2.8382776    5.134597    -9.190781    -0.5657382
+    -4.8745747    2.3165567   -5.984303    -2.1798875    0.35541576
+    -0.31784213   9.493548     2.1144536    4.358092   -12.089823
+    8.451689    -7.925461     4.6242585    4.4289427   18.692003
+    -2.6204622   -5.149185    -0.35821092   8.488551     4.981496
+    -9.32683     -2.2544234    6.6417594    1.2119585   10.977129
+    16.555033     3.3238444    9.551863    -1.6676947   -0.79539716
+    -8.605674    -0.47356385   2.6741948   -5.359179    -2.6673796
+    0.66607     15.443222     4.740594    -3.4725387   11.592567
+    -2.054497     1.7361217   -8.265324    -9.30447      5.4068313
+    -1.5180256   -7.746615    -6.089606     0.07112726  -0.34904733
+    -8.649895    -9.998958    -2.564841    -0.53999114   2.601808
+    -0.31927416  -1.8815292   -2.07215     -3.4105783   -8.2998085
+    1.483641   -15.365992    -8.288208     3.8847756   -3.4876456
+    7.3629923    0.4657332    3.132599    12.438889    -1.8337058
+    4.532936     2.7264361   10.145339    -6.521951     2.897153
+    -3.3925855    5.079156     7.759716     4.677565     5.8457737
+    2.402413     7.7071047    3.9711342   -6.390043     6.1268735
+    -3.7760346  -11.118123  ]
+    # get the test embedding
+    Test embedding Result:
+    [ -1.902964     2.0690894   -8.034194     3.5472693    0.18089125
+      6.9085927    1.4097427   -1.9487704  -10.021278    -0.20755845
+      -8.04332      4.344489     2.3200977  -14.306299     5.184692
+    -11.55602     -3.8497238    0.6444722    1.2833948    2.6766639
+      0.5878921    0.7946299    1.7207596    2.5791872   14.998469
+      -1.3385371   15.031221    -0.8006958    1.99287     -9.52007
+      2.435466     4.003221    -4.33817     -4.898601    -5.304714
+    -18.033886    10.790787   -12.784645    -5.641755     2.9761686
+    -10.566622     1.4839455    6.152458    -5.7195854    2.8603241
+      6.112133     8.489869     5.5958056    1.2836679   -1.2293907
+      0.89927405   7.0288725   -2.854029    -0.9782962    5.8255906
+      14.905906    -5.025907     0.7866458   -4.2444224  -16.354029
+      10.521315     0.9604709   -3.3257897    7.144871   -13.592733
+      -8.568869    -1.7953678    0.26313916  10.916714    -6.9374123
+      1.857403    -6.2746415    2.8154466   -7.2338667   -2.293357
+      -0.05452765   5.4287076    5.0849075   -6.690375    -1.6183422
+      3.654291     0.94352573  -9.200294    -5.4749465   -3.5235846
+      1.3420814    4.240421    -2.772944    -2.8451524   16.311104
+      4.2969875   -1.762936   -12.5758915    8.595198    -0.8835239
+      -1.5708797    1.568961     1.1413603    3.5032008   -0.45251232
+      -6.786333    16.89443      5.3366146   -8.789056     0.6355629
+      3.2579517   -3.328322     7.5969577    0.66025066  -6.550468
+      -9.148656     2.020372    -0.4615173    1.1965656   -3.8764873
+      11.6562195   -6.0750933   12.182899     3.2218833    0.81969476
+      5.570001    -3.8459578   -7.205299     7.9262037   -7.6611166
+      -5.249467    -2.2671914    7.2658715  -13.298164     4.821147
+      -2.7263982   11.691089    -3.8918593   -2.838112    -1.0336838
+      -3.8034165    2.8536487   -5.60398     -1.1972581    1.3455094
+      -3.4903061    2.2408795    5.5010734   -3.970756    11.99696
+      -7.8858757    0.43160373  -5.5059714    4.3426995   16.322706
+      11.635366     0.72157705  -9.245714    -3.91465     -4.449838
+      -1.5716927    7.713747    -2.2430465   -6.198303   -13.481864
+      2.8156567   -5.7812386    5.1456156    2.7289324  -14.505571
+      13.270688     3.448231    -7.0659585    4.5886116   -4.466099
+      -0.296428   -11.463529    -2.6076477   14.110243    -6.9725137
+      -1.9962958    2.7119343   19.391657     0.01961198  14.607133
+      -1.6695905   -4.391516     1.3131028   -6.670972    -5.888604
+      12.0612335    5.9285784    3.3715196    1.492534    10.723728
+      -0.95514804 -12.085431  ]
+    # get the score between enroll and test
+    Eembeddings Score: 0.4292638301849365
  ```

 ### 4.预训练模型
--- a/demos/speaker_verification/run.sh
+++ b/demos/speaker_verification/run.sh
@ -1,6 +1,9 @@
 #!/bin/bash

 wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
+wget -c https://paddlespeech.bj.bcebos.com/vector/audio/123456789.wav

-# asr
-paddlespeech vector --task spk --input ./85236145389.wav
+# vector
+paddlespeech vector --task spk --input ./85236145389.wav
+
+paddlespeech vector --task score --input "./85236145389.wav ./123456789.wav"
--- a/demos/speech_recognition/run.sh
+++ b/demos/speech_recognition/run.sh
@ -7,4 +7,4 @@ paddlespeech asr --input ./zh.wav


 # asr + punc
-paddlespeech asr --input ./zh.wav | paddlespeech text --task punc
+paddlespeech asr --input ./zh.wav | paddlespeech text --task punc
--- a/demos/speech_server/README_cn.md
+++ b/demos/speech_server/README_cn.md
@ -85,6 +85,10 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
 - 命令行 (推荐使用)
   ```
   paddlespeech_client asr --server_ip 127.0.0.1 --port 8090 --input ./zh.wav
+
+   # 流式ASR
+   paddlespeech_client asr_online --server_ip 127.0.0.1 --port 8091 --input ./zh.wav
+
   ```

    使用帮助:
@ -191,7 +195,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee

  ```

-  ### 5. CLS 客户端使用方法
+  ### 6. CLS 客户端使用方法
  **注意：** 初次使用客户端时响应时间会略长
  - 命令行 (推荐使用)
   ```
--- a/docs/source/released_model.md
+++ b/docs/source/released_model.md
@ -6,7 +6,7 @@
 ### Speech Recognition Model
 Acoustic Model | Training Data | Token-based | Size | Descriptions | CER | WER | Hours of speech | Example Link 
 :-------------:| :------------:| :-----: | -----: | :-----: |:-----:| :-----:  | :-----:  | :-----: 
-[Ds2 Online Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.1.1.model.tar.gz) | Aishell Dataset | Char-based | 345 MB  | 2 Conv + 5 LSTM layers with only forward direction | 0.080 |-| 151 h | [D2 Online Aishell ASR0](../../examples/aishell/asr0) 
+[Ds2 Online Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.2.0.model.tar.gz) | Aishell Dataset | Char-based | 345 MB  | 2 Conv + 5 LSTM layers with only forward direction | 0.078 |-| 151 h | [D2 Online Aishell ASR0](../../examples/aishell/asr0) 
 [Ds2 Offline Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_aishell_ckpt_0.1.1.model.tar.gz)| Aishell Dataset | Char-based | 306 MB | 2 Conv + 3 bidirectional GRU layers| 0.064 |-| 151 h | [Ds2 Offline Aishell ASR0](../../examples/aishell/asr0) 
 [Conformer Online Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_chunk_conformer_aishell_ckpt_0.1.2.model.tar.gz) | Aishell Dataset | Char-based | 189 MB  | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0565 |-| 151 h | [Conformer Online Aishell ASR1](../../examples/aishell/asr1) 
 [Conformer Offline Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_conformer_aishell_ckpt_0.1.2.model.tar.gz) | Aishell Dataset | Char-based | 189 MB  | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0483 |-| 151 h | [Conformer Offline Aishell ASR1](../../examples/aishell/asr1) 
@ -37,8 +37,8 @@ Model Type | Dataset| Example Link | Pretrained Models|Static Models|Size (stati
 Tacotron2|LJSpeech|[tacotron2-ljspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/tts0)|[tacotron2_ljspeech_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_ljspeech_ckpt_0.2.0.zip)|||
 Tacotron2|CSMSC|[tacotron2-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts0)|[tacotron2_csmsc_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_csmsc_ckpt_0.2.0.zip)|[tacotron2_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_csmsc_static_0.2.0.zip)|103MB|
 TransformerTTS| LJSpeech| [transformer-ljspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/tts1)|[transformer_tts_ljspeech_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/transformer_tts/transformer_tts_ljspeech_ckpt_0.4.zip)|||
-SpeedySpeech| CSMSC | [speedyspeech-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts2) |[speedyspeech_nosil_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_ckpt_0.5.zip)|[speedyspeech_nosil_baker_static_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_static_0.5.zip)|12MB|
-FastSpeech2| CSMSC |[fastspeech2-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts3)|[fastspeech2_nosil_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_ckpt_0.4.zip)|[fastspeech2_nosil_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_static_0.4.zip)|157MB|
+SpeedySpeech| CSMSC | [speedyspeech-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts2)|[speedyspeech_csmsc_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_csmsc_ckpt_0.2.0.zip)|[speedyspeech_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_csmsc_static_0.2.0.zip)|12MB|
+FastSpeech2| CSMSC |[fastspeech2-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts3)|[fastspeech2_nosil_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_ckpt_0.4.zip)|[fastspeech2_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_csmsc_static_0.2.0.zip)|157MB|
 FastSpeech2-Conformer| CSMSC |[fastspeech2-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts3)|[fastspeech2_conformer_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_conformer_baker_ckpt_0.5.zip)|||
 FastSpeech2| AISHELL-3 |[fastspeech2-aishell3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/tts3)|[fastspeech2_nosil_aishell3_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_aishell3_ckpt_0.4.zip)|||
 FastSpeech2| LJSpeech |[fastspeech2-ljspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/tts3)|[fastspeech2_nosil_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_ljspeech_ckpt_0.5.zip)|||
@ -80,7 +80,7 @@ PANN | ESC-50 |[pann-esc50](../../examples/esc50/cls0)|[esc50_cnn6.tar.gz](https

 Model Type | Dataset| Example Link | Pretrained Models | Static Models 
 :-------------:| :------------:| :-----: | :-----: | :-----:
-PANN | VoxCeleb| [voxceleb_ecapatdnn](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/voxceleb/sv0) | [ecapatdnn.tar.gz](https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_1_1.tar.gz) | -
+PANN | VoxCeleb| [voxceleb_ecapatdnn](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/voxceleb/sv0) | [ecapatdnn.tar.gz](https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_2_0.tar.gz) | -

 ## Punctuation Restoration Models
 Model Type | Dataset| Example Link | Pretrained Models
--- a/examples/aishell/asr0/README.md
+++ b/examples/aishell/asr0/README.md
@ -151,21 +151,14 @@ avg.sh best exp/deepspeech2/checkpoints 1
 CUDA_VISIBLE_DEVICES= ./local/test.sh conf/deepspeech2.yaml exp/deepspeech2/checkpoints/avg_1
 ```
 ## Pretrained Model
-You can get the pretrained transformer or conformer using the scripts below:
-```bash
-Deepspeech2 offline:
-wget https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/ds2.model.tar.gz
-
-Deepspeech2 online:
-wget https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/aishell_ds2_online_cer8.00_release.tar.gz
+You can get the pretrained models from [this](../../../docs/source/released_model.md).

-```
 using the `tar` scripts to unpack the model and then you can use the script to test the model.

 For example:
 ```
-wget https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/ds2.model.tar.gz
-tar xzvf ds2.model.tar.gz
+wget https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_aishell_ckpt_0.1.1.model.tar.gz
+tar xzvf asr0_deepspeech2_aishell_ckpt_0.1.1.model.tar.gz
 source path.sh
 # If you have process the data and get the manifest file， you can skip the following 2 steps
 bash local/data.sh --stage -1 --stop_stage -1
@ -173,12 +166,7 @@ bash local/data.sh --stage 2  --stop_stage 2

 CUDA_VISIBLE_DEVICES= ./local/test.sh conf/deepspeech2.yaml exp/deepspeech2/checkpoints/avg_1
 ```
-The performance of the released models are shown below:
-
-|         Acoustic Model         |  Training Data  | Token-based |   Size | Descriptions                                       | CER   | WER  | Hours of speech |
-| :----------------------------: | :-------------: | :---------: | -----: | :------------------------------------------------- | :---- | :--- | :-------------- |
-| Ds2 Online Aishell ASR0 Model  | Aishell Dataset | Char-based  | 345 MB | 2 Conv + 5 LSTM layers with only forward direction | 0.080 | -    | 151 h           |
-| Ds2 Offline Aishell ASR0 Model | Aishell Dataset | Char-based  | 306 MB | 2 Conv + 3 bidirectional GRU layers                | 0.064 | -    | 151 h           |
+The performance of the released models are shown in [this](./RESULTS.md)
 ## Stage 4: Static graph model Export
 This stage is to transform dygraph to static graph.
 ```bash
@ -214,8 +202,8 @@ if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
 ```
 you can train the model by yourself, or you can download the pretrained model by the script below:
 ```bash
-wget https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/ds2.model.tar.gz
-tar xzvf ds2.model.tar.gz
+wget https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_aishell_ckpt_0.1.1.model.tar.gz
+tar xzvf asr0_deepspeech2_aishell_ckpt_0.1.1.model.tar.gz
 ```
 You can download the audio demo:
 ```bash
--- a/examples/aishell/asr0/RESULTS.md
+++ b/examples/aishell/asr0/RESULTS.md
@ -4,15 +4,16 @@

 | Model | Number of Params | Release | Config | Test set | Valid Loss | CER | 
 | --- | --- | --- | --- | --- | --- | --- | 
-| DeepSpeech2 | 45.18M | 2.2.0 | conf/deepspeech2_online.yaml + spec aug | test | 7.994938373565674 | 0.080 |  
+| DeepSpeech2 | 45.18M | r0.2.0 | conf/deepspeech2_online.yaml + spec aug | test | 7.708217620849609| 0.078 |
+| DeepSpeech2 | 45.18M | v2.2.0 | conf/deepspeech2_online.yaml + spec aug | test | 7.994938373565674 | 0.080 |  

 ## Deepspeech2 Non-Streaming

 | Model | Number of Params | Release | Config | Test set | Valid Loss | CER |  
 | --- | --- | --- | --- | --- | --- | --- |  
-| DeepSpeech2 | 58.4M | 2.2.0 | conf/deepspeech2.yaml + spec aug | test | 5.738585948944092 | 0.064000 |  
-| DeepSpeech2 | 58.4M | 2.1.0 | conf/deepspeech2.yaml + spec aug | test | 7.483316898345947 | 0.077860 |  
-| DeepSpeech2 | 58.4M | 2.1.0 | conf/deepspeech2.yaml | test | 7.299022197723389 | 0.078671 |
-| DeepSpeech2 | 58.4M | 2.0.0 | conf/deepspeech2.yaml | test | - | 0.078977 |  
+| DeepSpeech2 | 58.4M | v2.2.0 | conf/deepspeech2.yaml + spec aug | test | 5.738585948944092 | 0.064000 |  
+| DeepSpeech2 | 58.4M | v2.1.0 | conf/deepspeech2.yaml + spec aug | test | 7.483316898345947 | 0.077860 |  
+| DeepSpeech2 | 58.4M | v2.1.0 | conf/deepspeech2.yaml | test | 7.299022197723389 | 0.078671 |
+| DeepSpeech2 | 58.4M | v2.0.0 | conf/deepspeech2.yaml | test | - | 0.078977 |  
 | --- | --- | --- | --- | --- | --- | --- |  
-| DeepSpeech2 | 58.4M | 1.8.5 | - | test | - | 0.080447 |
+| DeepSpeech2 | 58.4M | v1.8.5 | - | test | - | 0.080447 |
--- a/examples/aishell/asr0/run.sh
+++ b/examples/aishell/asr0/run.sh
@ -5,7 +5,7 @@ source path.sh
 gpus=0,1,2,3
 stage=0
 stop_stage=100
-conf_path=conf/deepspeech2.yaml    #conf/deepspeech2.yaml or conf/deepspeeech2_online.yaml
+conf_path=conf/deepspeech2.yaml    #conf/deepspeech2.yaml or conf/deepspeech2_online.yaml
 decode_conf_path=conf/tuning/decode.yaml
 avg_num=1
 model_type=offline    # offline or online
--- a/examples/aishell/asr1/README.md
+++ b/examples/aishell/asr1/README.md
@ -143,25 +143,14 @@ avg.sh best exp/conformer/checkpoints 20
 CUDA_VISIBLE_DEVICES= ./local/test.sh conf/conformer.yaml exp/conformer/checkpoints/avg_20
 ```
 ## Pretrained Model
-You can get the pretrained transformer or conformer using the scripts below:
+You can get the pretrained transformer or conformer from [this](../../../docs/source/released_model.md)

-```bash
-# Conformer:
-wget https://deepspeech.bj.bcebos.com/release2.1/aishell/s1/aishell.release.tar.gz
-
-# Chunk Conformer:
-wget https://deepspeech.bj.bcebos.com/release2.1/aishell/s1/aishell.chunk.release.tar.gz
-
-# Transformer:
-wget https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/transformer.model.tar.gz
-
-```
 using the `tar` scripts to unpack the model and then you can use the script to test the model.

 For example:
 ```
-wget https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/transformer.model.tar.gz
-tar xzvf transformer.model.tar.gz
+wget https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_transformer_aishell_ckpt_0.1.1.model.tar.gz
+tar xzvf asr1_transformer_aishell_ckpt_0.1.1.model.tar.gz
 source path.sh
 # If you have process the data and get the manifest file， you can skip the following 2 steps
 bash local/data.sh --stage -1 --stop_stage -1
@ -206,7 +195,7 @@ In some situations, you want to use the trained model to do the inference for th
 ```
 you can train the model by yourself using ```bash run.sh --stage 0 --stop_stage 3```, or you can download the pretrained model through the script below:
 ```bash
-wget https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/transformer.model.tar.gz
+wget https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_transformer_aishell_ckpt_0.1.1.model.tar.gz
 tar xzvf transformer.model.tar.gz
 ```
 You can download the audio demo:
--- a/examples/aishell3/vc0/README.md
+++ b/examples/aishell3/vc0/README.md
@ -118,7 +118,7 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/voice_cloning.sh ${conf_path} ${train_outpu
 ```

 ## Pretrained Model
-[tacotron2_aishell3_ckpt_vc0_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_aishell3_ckpt_vc0_0.2.0.zip)
+- [tacotron2_aishell3_ckpt_vc0_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_aishell3_ckpt_vc0_0.2.0.zip)


 Model | Step | eval/loss | eval/l1_loss | eval/mse_loss | eval/bce_loss| eval/attn_loss
--- a/examples/aishell3/vc1/README.md
+++ b/examples/aishell3/vc1/README.md
@ -119,7 +119,7 @@ ref_audio
 CUDA_VISIBLE_DEVICES=${gpus} ./local/voice_cloning.sh ${conf_path} ${train_output_path} ${ckpt_name} ${ge2e_params_path} ${ref_audio_dir}
 ```
 ## Pretrained Model
-[fastspeech2_nosil_aishell3_vc1_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_aishell3_vc1_ckpt_0.5.zip)
+- [fastspeech2_nosil_aishell3_vc1_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_aishell3_vc1_ckpt_0.5.zip)

 Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/pitch_loss| eval/energy_loss 
 :-------------:| :------------:| :-----: | :-----: | :--------: |:--------:|:---------:
--- a/examples/aishell3/voc1/README.md
+++ b/examples/aishell3/voc1/README.md
@ -137,7 +137,8 @@ optional arguments:
 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.

 ## Pretrained Models
-Pretrained models can be downloaded here [pwg_aishell3_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_aishell3_ckpt_0.5.zip).
+Pretrained models can be downloaded here:
+- [pwg_aishell3_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_aishell3_ckpt_0.5.zip)

 Model | Step | eval/generator_loss | eval/log_stft_magnitude_loss:| eval/spectral_convergence_loss 
 :-------------:| :------------:| :-----: | :-----: | :--------:
--- a/examples/aishell3/voc5/README.md
+++ b/examples/aishell3/voc5/README.md
@ -136,7 +136,8 @@ optional arguments:
 4. `--output-dir` is the directory to save the synthesized audio files.
 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
 ## Pretrained Models
-The pretrained model can be downloaded here [hifigan_aishell3_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_ckpt_0.2.0.zip).
+The pretrained model can be downloaded here:
+- [hifigan_aishell3_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_ckpt_0.2.0.zip)


 Model | Step | eval/generator_loss | eval/mel_loss| eval/feature_matching_loss
--- a/examples/ami/sd0/conf/ecapa_tdnn.yaml
+++ b/examples/ami/sd0/conf/ecapa_tdnn.yaml
@ -0,0 +1,62 @@
+###########################################################
+#                AMI DATA PREPARE SETTING               #
+###########################################################
+split_type: 'full_corpus_asr'
+skip_TNO: True
+# Options for mic_type: 'Mix-Lapel', 'Mix-Headset', 'Array1', 'Array1-01', 'BeamformIt'
+mic_type: 'Mix-Headset'
+vad_type: 'oracle'
+max_subseg_dur: 3.0
+overlap: 1.5
+# Some more exp folders (for cleaner structure).
+embedding_dir: emb #!ref <save_folder>/emb
+meta_data_dir: metadata #!ref <save_folder>/metadata
+ref_rttm_dir: ref_rttms #!ref <save_folder>/ref_rttms
+sys_rttm_dir: sys_rttms #!ref <save_folder>/sys_rttms
+der_dir: DER #!ref <save_folder>/DER
+
+
+###########################################################
+#                FEATURE EXTRACTION SETTING               #
+###########################################################
+# currently, we only support fbank
+sr: 16000           # sample rate
+n_mels: 80
+window_size: 400     #25ms, sample rate 16000, 25 * 16000 / 1000 = 400 
+hop_size: 160        #10ms, sample rate 16000, 10 * 16000 / 1000 = 160
+#left_frames: 0
+#right_frames: 0
+#deltas: False
+
+
+###########################################################
+#                       MODEL SETTING                     #
+###########################################################
+# currently, we only support ecapa-tdnn in the ecapa_tdnn.yaml
+# if we want use another model, please choose another configuration yaml file
+seed: 1234
+emb_dim: 192
+batch_size: 16
+model:
+  input_size: 80
+  channels: [1024, 1024, 1024, 1024, 3072]
+  kernel_sizes: [5, 3, 3, 3, 1]
+  dilations: [1, 2, 3, 4, 1]
+  attention_channels: 128
+  lin_neurons: 192
+# Will automatically download ECAPA-TDNN model (best).
+
+###########################################################
+#               SPECTRAL CLUSTERING SETTING               #
+###########################################################
+backend: 'SC' # options: 'kmeans' # Note: kmeans goes only with cos affinity
+affinity: 'cos'  # options: cos, nn
+max_num_spkrs: 10
+oracle_n_spkrs: True
+
+
+###########################################################
+#                  DER EVALUATION SETTING                 #
+###########################################################
+ignore_overlap: True
+forgiveness_collar: 0.25
--- a/examples/ami/sd0/local/compute_embdding.py
+++ b/examples/ami/sd0/local/compute_embdding.py
@ -0,0 +1,231 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import json
+import os
+import pickle
+import sys
+
+import numpy as np
+import paddle
+from paddle.io import BatchSampler
+from paddle.io import DataLoader
+from tqdm.contrib import tqdm
+from yacs.config import CfgNode
+
+from paddlespeech.s2t.utils.log import Log
+from paddlespeech.vector.cluster.diarization import EmbeddingMeta
+from paddlespeech.vector.io.batch import batch_feature_normalize
+from paddlespeech.vector.io.dataset_from_json import JSONDataset
+from paddlespeech.vector.models.ecapa_tdnn import EcapaTdnn
+from paddlespeech.vector.modules.sid_model import SpeakerIdetification
+from paddlespeech.vector.training.seeding import seed_everything
+
+# Logger setup
+logger = Log(__name__).getlog()
+
+
+def prepare_subset_json(full_meta_data, rec_id, out_meta_file):
+    """Prepares metadata for a given recording ID.
+
+    Arguments
+    ---------
+    full_meta_data : json
+        Full meta (json) containing all the recordings
+    rec_id : str
+        The recording ID for which meta (json) has to be prepared
+    out_meta_file : str
+        Path of the output meta (json) file.
+    """
+
+    subset = {}
+    for key in full_meta_data:
+        k = str(key)
+        if k.startswith(rec_id):
+            subset[key] = full_meta_data[key]
+
+    with open(out_meta_file, mode="w") as json_f:
+        json.dump(subset, json_f, indent=2)
+
+
+def create_dataloader(json_file, batch_size):
+    """Creates the datasets and their data processing pipelines.
+    This is used for multi-mic processing.
+    """
+
+    # create datasets
+    dataset = JSONDataset(
+        json_file=json_file,
+        feat_type='melspectrogram',
+        n_mels=config.n_mels,
+        window_size=config.window_size,
+        hop_length=config.hop_size)
+
+    # create dataloader
+    batch_sampler = BatchSampler(dataset, batch_size=batch_size, shuffle=True)
+    dataloader = DataLoader(dataset,
+                            batch_sampler=batch_sampler,
+                            collate_fn=lambda x: batch_feature_normalize(
+                                x, mean_norm=True, std_norm=False),
+                            return_list=True)
+
+    return dataloader
+
+
+def main(args, config):
+    # set the training device, cpu or gpu
+    paddle.set_device(args.device)
+    # set the random seed
+    seed_everything(config.seed)
+
+    # stage1: build the dnn backbone model network
+    ecapa_tdnn = EcapaTdnn(**config.model)
+
+    # stage2: build the speaker verification eval instance with backbone model
+    model = SpeakerIdetification(backbone=ecapa_tdnn, num_class=1)
+
+    # stage3: load the pre-trained model
+    #         we get the last model from the epoch and save_interval
+    args.load_checkpoint = os.path.abspath(
+        os.path.expanduser(args.load_checkpoint))
+
+    # load model checkpoint to sid model
+    state_dict = paddle.load(
+        os.path.join(args.load_checkpoint, 'model.pdparams'))
+    model.set_state_dict(state_dict)
+    logger.info(f'Checkpoint loaded from {args.load_checkpoint}')
+
+    # set the model to eval mode
+    model.eval()
+
+    # load meta data
+    meta_file = os.path.join(
+        args.data_dir,
+        config.meta_data_dir,
+        "ami_" + args.dataset + "." + config.mic_type + ".subsegs.json", )
+    with open(meta_file, "r") as f:
+        full_meta = json.load(f)
+
+    # get all the recording IDs in this dataset.
+    all_keys = full_meta.keys()
+    A = [word.rstrip().split("_")[0] for word in all_keys]
+    all_rec_ids = list(set(A[1:]))
+    all_rec_ids.sort()
+    split = "AMI_" + args.dataset
+    i = 1
+
+    msg = "Extra embdding for " + args.dataset + " set"
+    logger.info(msg)
+
+    if len(all_rec_ids) <= 0:
+        msg = "No recording IDs found! Please check if meta_data json file is properly generated."
+        logger.error(msg)
+        sys.exit()
+
+    # extra different recordings embdding in a dataset.
+    for rec_id in tqdm(all_rec_ids):
+        # This tag will be displayed in the log.
+        tag = ("[" + str(args.dataset) + ": " + str(i) + "/" +
+               str(len(all_rec_ids)) + "]")
+        i = i + 1
+
+        # log message.
+        msg = "Embdding %s : %s " % (tag, rec_id)
+        logger.debug(msg)
+
+        # embedding directory.
+        if not os.path.exists(
+                os.path.join(args.data_dir, config.embedding_dir, split)):
+            os.makedirs(
+                os.path.join(args.data_dir, config.embedding_dir, split))
+
+        # file to store embeddings.
+        emb_file_name = rec_id + "." + config.mic_type + ".emb_stat.pkl"
+        diary_stat_emb_file = os.path.join(args.data_dir, config.embedding_dir,
+                                           split, emb_file_name)
+
+        # prepare a metadata (json) for one recording. This is basically a subset of full_meta.
+        # lets keep this meta-info in embedding directory itself.
+        json_file_name = rec_id + "." + config.mic_type + ".json"
+        meta_per_rec_file = os.path.join(args.data_dir, config.embedding_dir,
+                                         split, json_file_name)
+
+        # write subset (meta for one recording) json metadata.
+        prepare_subset_json(full_meta, rec_id, meta_per_rec_file)
+
+        # prepare data loader.
+        diary_set_loader = create_dataloader(meta_per_rec_file,
+                                             config.batch_size)
+
+        # extract embeddings (skip if already done).
+        if not os.path.isfile(diary_stat_emb_file):
+            logger.debug("Extracting deep embeddings")
+            embeddings = np.empty(shape=[0, config.emb_dim], dtype=np.float64)
+            segset = []
+
+            for batch_idx, batch in enumerate(tqdm(diary_set_loader)):
+                # extrac the audio embedding
+                ids, feats, lengths = batch['ids'], batch['feats'], batch[
+                    'lengths']
+                seg = [x for x in ids]
+                segset = segset + seg
+                emb = model.backbone(feats, lengths).squeeze(
+                    -1).numpy()  # (N, emb_size, 1) -> (N, emb_size)
+                embeddings = np.concatenate((embeddings, emb), axis=0)
+
+            segset = np.array(segset, dtype="|O")
+            stat_obj = EmbeddingMeta(
+                segset=segset,
+                stats=embeddings, )
+            logger.debug("Saving Embeddings...")
+            with open(diary_stat_emb_file, "wb") as output:
+                pickle.dump(stat_obj, output)
+
+        else:
+            logger.debug("Skipping embedding extraction (as already present).")
+
+
+# Begin experiment!
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(__doc__)
+    parser.add_argument(
+        '--device',
+        default="gpu",
+        help="Select which device to perform diarization, defaults to gpu.")
+    parser.add_argument(
+        "--config", default=None, type=str, help="configuration file")
+    parser.add_argument(
+        "--data-dir",
+        default="../save/",
+        type=str,
+        help="processsed data directory")
+    parser.add_argument(
+        "--dataset",
+        choices=['dev', 'eval'],
+        default="dev",
+        type=str,
+        help="Select which dataset to extra embdding, defaults to dev")
+    parser.add_argument(
+        "--load-checkpoint",
+        type=str,
+        default='',
+        help="Directory to load model checkpoint to compute embeddings.")
+    args = parser.parse_args()
+    config = CfgNode(new_allowed=True)
+    if args.config:
+        config.merge_from_file(args.config)
+
+    config.freeze()
+
+    main(args, config)
--- a/examples/ami/sd0/local/data.sh
+++ b/examples/ami/sd0/local/data.sh
@ -1,49 +0,0 @@
-#!/bin/bash
-
-stage=1
-
-TARGET_DIR=${MAIN_ROOT}/dataset/ami
-data_folder=${TARGET_DIR}/amicorpus #e.g., /path/to/amicorpus/
-manual_annot_folder=${TARGET_DIR}/ami_public_manual_1.6.2 #e.g., /path/to/ami_public_manual_1.6.2/
-
-save_folder=${MAIN_ROOT}/examples/ami/sd0/data
-ref_rttm_dir=${save_folder}/ref_rttms
-meta_data_dir=${save_folder}/metadata
-
-set=L
-
-. ${MAIN_ROOT}/utils/parse_options.sh || exit 1;
-set -u
-set -o pipefail
-
-mkdir -p ${save_folder}
-
-if [ ${stage} -le 0 ]; then
-    # Download AMI corpus, You need around 10GB of free space to get whole data
-    # The signals are too large to package in this way,
-    # so you need to use the chooser to indicate which ones you wish to download
-    echo "Please follow https://groups.inf.ed.ac.uk/ami/download/ to download the data."
-    echo "Annotations: AMI manual annotations v1.6.2 "
-    echo "Signals: "
-    echo "1) Select one or more AMI meetings: the IDs please follow ./ami_split.py"
-    echo "2) Select media streams: Just select Headset mix"
-    exit 0;
-fi
-
-if [ ${stage} -le 1 ]; then
-    echo "AMI Data preparation"
-
-    python local/ami_prepare.py  --data_folder ${data_folder} \
-            --manual_annot_folder ${manual_annot_folder} \
-            --save_folder ${save_folder} --ref_rttm_dir ${ref_rttm_dir} \
-            --meta_data_dir ${meta_data_dir} 
-    
-    if [ $? -ne 0 ]; then
-        echo "Prepare AMI failed. Please check log message."
-        exit 1
-    fi
-            
-fi
-
-echo "AMI data preparation done."
-exit 0
--- a/examples/ami/sd0/local/experiment.py
+++ b/examples/ami/sd0/local/experiment.py
@ -0,0 +1,428 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import glob
+import json
+import os
+import pickle
+import shutil
+import sys
+
+import numpy as np
+from tqdm.contrib import tqdm
+from yacs.config import CfgNode
+
+from paddlespeech.s2t.utils.log import Log
+from paddlespeech.vector.cluster import diarization as diar
+from utils.DER import DER
+
+# Logger setup
+logger = Log(__name__).getlog()
+
+
+def diarize_dataset(
+        full_meta,
+        split_type,
+        n_lambdas,
+        pval,
+        save_dir,
+        config,
+        n_neighbors=10, ):
+    """This function diarizes all the recordings in a given dataset. It performs
+    computation of embedding and clusters them using spectral clustering (or other backends).
+    The output speaker boundary file is stored in the RTTM format.
+    """
+
+    # prepare `spkr_info` only once when Oracle num of speakers is selected.
+    # spkr_info is essential to obtain number of speakers from groundtruth.
+    if config.oracle_n_spkrs is True:
+        full_ref_rttm_file = os.path.join(save_dir, config.ref_rttm_dir,
+                                          "fullref_ami_" + split_type + ".rttm")
+        rttm = diar.read_rttm(full_ref_rttm_file)
+
+        spkr_info = list(  # noqa F841
+            filter(lambda x: x.startswith("SPKR-INFO"), rttm))
+
+    # get all the recording IDs in this dataset.
+    all_keys = full_meta.keys()
+    A = [word.rstrip().split("_")[0] for word in all_keys]
+    all_rec_ids = list(set(A[1:]))
+    all_rec_ids.sort()
+    split = "AMI_" + split_type
+    i = 1
+
+    # adding tag for directory path.
+    type_of_num_spkr = "oracle" if config.oracle_n_spkrs else "est"
+    tag = (type_of_num_spkr + "_" + str(config.affinity) + "_" + config.backend)
+
+    # make out rttm dir
+    out_rttm_dir = os.path.join(save_dir, config.sys_rttm_dir, config.mic_type,
+                                split, tag)
+    if not os.path.exists(out_rttm_dir):
+        os.makedirs(out_rttm_dir)
+
+    # diarizing different recordings in a dataset.
+    for rec_id in tqdm(all_rec_ids):
+        # this tag will be displayed in the log.
+        tag = ("[" + str(split_type) + ": " + str(i) + "/" +
+               str(len(all_rec_ids)) + "]")
+        i = i + 1
+
+        # log message.
+        msg = "Diarizing %s : %s " % (tag, rec_id)
+        logger.debug(msg)
+
+        # load embeddings.
+        emb_file_name = rec_id + "." + config.mic_type + ".emb_stat.pkl"
+        diary_stat_emb_file = os.path.join(save_dir, config.embedding_dir,
+                                           split, emb_file_name)
+        if not os.path.isfile(diary_stat_emb_file):
+            msg = "Embdding file %s not found! Please check if embdding file is properly generated." % (
+                diary_stat_emb_file)
+            logger.error(msg)
+            sys.exit()
+        with open(diary_stat_emb_file, "rb") as in_file:
+            diary_obj = pickle.load(in_file)
+
+        out_rttm_file = out_rttm_dir + "/" + rec_id + ".rttm"
+
+        # processing starts from here.
+        if config.oracle_n_spkrs is True:
+            # oracle num of speakers.
+            num_spkrs = diar.get_oracle_num_spkrs(rec_id, spkr_info)
+        else:
+            if config.affinity == "nn":
+                # num of speakers tunned on dev set (only for nn affinity).
+                num_spkrs = n_lambdas
+            else:
+                # num of speakers will be estimated using max eigen gap for cos based affinity.
+                # so adding None here. Will use this None later-on.
+                num_spkrs = None
+
+        if config.backend == "kmeans":
+            diar.do_kmeans_clustering(
+                diary_obj,
+                out_rttm_file,
+                rec_id,
+                num_spkrs,
+                pval, )
+
+        if config.backend == "SC":
+            # go for Spectral Clustering (SC).
+            diar.do_spec_clustering(
+                diary_obj,
+                out_rttm_file,
+                rec_id,
+                num_spkrs,
+                pval,
+                config.affinity,
+                n_neighbors, )
+
+        # can used for AHC later. Likewise one can add different backends here.
+        if config.backend == "AHC":
+            # call AHC
+            threshold = pval  # pval for AHC is nothing but threshold.
+            diar.do_AHC(diary_obj, out_rttm_file, rec_id, num_spkrs, threshold)
+
+    # once all RTTM outputs are generated, concatenate individual RTTM files to obtain single RTTM file.
+    # this is not needed but just staying with the standards.
+    concate_rttm_file = out_rttm_dir + "/sys_output.rttm"
+    logger.debug("Concatenating individual RTTM files...")
+    with open(concate_rttm_file, "w") as cat_file:
+        for f in glob.glob(out_rttm_dir + "/*.rttm"):
+            if f == concate_rttm_file:
+                continue
+            with open(f, "r") as indi_rttm_file:
+                shutil.copyfileobj(indi_rttm_file, cat_file)
+
+    msg = "The system generated RTTM file for %s set : %s" % (
+        split_type, concate_rttm_file, )
+    logger.debug(msg)
+
+    return concate_rttm_file
+
+
+def dev_pval_tuner(full_meta, save_dir, config):
+    """Tuning p_value for affinity matrix.
+    The p_value used so that only p% of the values in each row is retained.
+    """
+
+    DER_list = []
+    prange = np.arange(0.002, 0.015, 0.001)
+
+    n_lambdas = None  # using it as flag later.
+    for p_v in prange:
+        # Process whole dataset for value of p_v.
+        concate_rttm_file = diarize_dataset(full_meta, "dev", n_lambdas, p_v,
+                                            save_dir, config)
+
+        ref_rttm_file = os.path.join(save_dir, config.ref_rttm_dir,
+                                     "fullref_ami_dev.rttm")
+        sys_rttm_file = concate_rttm_file
+        [MS, FA, SER, DER_] = DER(
+            ref_rttm_file,
+            sys_rttm_file,
+            config.ignore_overlap,
+            config.forgiveness_collar, )
+
+        DER_list.append(DER_)
+
+        if config.oracle_n_spkrs is True and config.backend == "kmeans":
+            # no need of p_val search. Note p_val is needed for SC for both oracle and est num of speakers.
+            # p_val is needed in oracle_n_spkr=False when using kmeans backend.
+            break
+
+    # Take p_val that gave minmum DER on Dev dataset.
+    tuned_p_val = prange[DER_list.index(min(DER_list))]
+
+    return tuned_p_val
+
+
+def dev_ahc_threshold_tuner(full_meta, save_dir, config):
+    """Tuning threshold for affinity matrix. This function is called when AHC is used as backend.
+    """
+
+    DER_list = []
+    prange = np.arange(0.0, 1.0, 0.1)
+
+    n_lambdas = None  # using it as flag later.
+
+    # Note: p_val is threshold in case of AHC.
+    for p_v in prange:
+        # Process whole dataset for value of p_v.
+        concate_rttm_file = diarize_dataset(full_meta, "dev", n_lambdas, p_v,
+                                            save_dir, config)
+
+        ref_rttm = os.path.join(save_dir, config.ref_rttm_dir,
+                                "fullref_ami_dev.rttm")
+        sys_rttm = concate_rttm_file
+        [MS, FA, SER, DER_] = DER(
+            ref_rttm,
+            sys_rttm,
+            config.ignore_overlap,
+            config.forgiveness_collar, )
+
+        DER_list.append(DER_)
+
+        if config.oracle_n_spkrs is True:
+            break  # no need of threshold search.
+
+    # Take p_val that gave minmum DER on Dev dataset.
+    tuned_p_val = prange[DER_list.index(min(DER_list))]
+
+    return tuned_p_val
+
+
+def dev_nn_tuner(full_meta, split_type, save_dir, config):
+    """Tuning n_neighbors on dev set. Assuming oracle num of speakers.
+    This is used when nn based affinity is selected.
+    """
+
+    DER_list = []
+    pval = None
+
+    # Now assumming oracle num of speakers.
+    n_lambdas = 4
+
+    for nn in range(5, 15):
+
+        # Process whole dataset for value of n_lambdas.
+        concate_rttm_file = diarize_dataset(full_meta, "dev", n_lambdas, p_v,
+                                            save_dir, config, nn)
+
+        ref_rttm = os.path.join(save_dir, config.ref_rttm_dir,
+                                "fullref_ami_dev.rttm")
+        sys_rttm = concate_rttm_file
+        [MS, FA, SER, DER_] = DER(
+            ref_rttm,
+            sys_rttm,
+            config.ignore_overlap,
+            config.forgiveness_collar, )
+
+        DER_list.append([nn, DER_])
+
+        if config.oracle_n_spkrs is True and config.backend == "kmeans":
+            break
+
+    DER_list.sort(key=lambda x: x[1])
+    tunned_nn = DER_list[0]
+
+    return tunned_nn[0]
+
+
+def dev_tuner(full_meta, split_type, save_dir, config):
+    """Tuning n_components on dev set. Used for nn based affinity matrix.
+    Note: This is a very basic tunning for nn based affinity.
+    This is work in progress till we find a better way.
+    """
+
+    DER_list = []
+    pval = None
+    for n_lambdas in range(1, config.max_num_spkrs + 1):
+
+        # Process whole dataset for value of n_lambdas.
+        concate_rttm_file = diarize_dataset(full_meta, "dev", n_lambdas, p_v,
+                                            save_dir, config)
+
+        ref_rttm = os.path.join(save_dir, config.ref_rttm_dir,
+                                "fullref_ami_dev.rttm")
+        sys_rttm = concate_rttm_file
+        [MS, FA, SER, DER_] = DER(
+            ref_rttm,
+            sys_rttm,
+            config.ignore_overlap,
+            config.forgiveness_collar, )
+
+        DER_list.append(DER_)
+
+    # Take n_lambdas with minmum DER.
+    tuned_n_lambdas = DER_list.index(min(DER_list)) + 1
+
+    return tuned_n_lambdas
+
+
+def main(args, config):
+    # AMI Dev Set: Tune hyperparams on dev set.
+    # Read the embdding file for dev set generated during embdding compute
+    dev_meta_file = os.path.join(
+        args.data_dir,
+        config.meta_data_dir,
+        "ami_dev." + config.mic_type + ".subsegs.json", )
+    with open(dev_meta_file, "r") as f:
+        meta_dev = json.load(f)
+
+    full_meta = meta_dev
+
+    # Processing starts from here
+    # Following few lines selects option for different backend and affinity matrices. Finds best values for hyperameters using dev set.
+    ref_rttm_file = os.path.join(args.data_dir, config.ref_rttm_dir,
+                                 "fullref_ami_dev.rttm")
+    best_nn = None
+    if config.affinity == "nn":
+        logger.info("Tuning for nn (Multiple iterations over AMI Dev set)")
+        best_nn = dev_nn_tuner(full_meta, args.data_dir, config)
+
+    n_lambdas = None
+    best_pval = None
+
+    if config.affinity == "cos" and (config.backend == "SC" or
+                                     config.backend == "kmeans"):
+        # oracle num_spkrs or not, doesn't matter for kmeans and SC backends
+        # cos: Tune for the best pval for SC /kmeans (for unknown num of spkrs)
+        logger.info(
+            "Tuning for p-value for SC (Multiple iterations over AMI Dev set)")
+        best_pval = dev_pval_tuner(full_meta, args.data_dir, config)
+
+    elif config.backend == "AHC":
+        logger.info("Tuning for threshold-value for AHC")
+        best_threshold = dev_ahc_threshold_tuner(full_meta, args.data_dir,
+                                                 config)
+        best_pval = best_threshold
+    else:
+        # NN for unknown num of speakers (can be used in future)
+        if config.oracle_n_spkrs is False:
+            # nn: Tune num of number of components (to be updated later)
+            logger.info(
+                "Tuning for number of eigen components for NN (Multiple iterations over AMI Dev set)"
+            )
+            # dev_tuner used for tuning num of components in NN. Can be used in future.
+            n_lambdas = dev_tuner(full_meta, args.data_dir, config)
+
+    # load 'dev' and 'eval' metadata files.
+    full_meta_dev = full_meta  # current full_meta is for 'dev'
+    eval_meta_file = os.path.join(
+        args.data_dir,
+        config.meta_data_dir,
+        "ami_eval." + config.mic_type + ".subsegs.json", )
+    with open(eval_meta_file, "r") as f:
+        full_meta_eval = json.load(f)
+
+    # tag to be appended to final output DER files. Writing DER for individual files.
+    type_of_num_spkr = "oracle" if config.oracle_n_spkrs else "est"
+    tag = (
+        type_of_num_spkr + "_" + str(config.affinity) + "." + config.mic_type)
+
+    # perform final diarization on 'dev' and 'eval' with best hyperparams.
+    final_DERs = {}
+    out_der_dir = os.path.join(args.data_dir, config.der_dir)
+    if not os.path.exists(out_der_dir):
+        os.makedirs(out_der_dir)
+
+    for split_type in ["dev", "eval"]:
+        if split_type == "dev":
+            full_meta = full_meta_dev
+        else:
+            full_meta = full_meta_eval
+
+        # performing diarization.
+        msg = "Diarizing using best hyperparams: " + split_type + " set"
+        logger.info(msg)
+        out_boundaries = diarize_dataset(
+            full_meta,
+            split_type,
+            n_lambdas=n_lambdas,
+            pval=best_pval,
+            n_neighbors=best_nn,
+            save_dir=args.data_dir,
+            config=config)
+
+        # computing DER.
+        msg = "Computing DERs for " + split_type + " set"
+        logger.info(msg)
+        ref_rttm = os.path.join(args.data_dir, config.ref_rttm_dir,
+                                "fullref_ami_" + split_type + ".rttm")
+        sys_rttm = out_boundaries
+        [MS, FA, SER, DER_vals] = DER(
+            ref_rttm,
+            sys_rttm,
+            config.ignore_overlap,
+            config.forgiveness_collar,
+            individual_file_scores=True, )
+
+        # writing DER values to a file. Append tag.
+        der_file_name = split_type + "_DER_" + tag
+        out_der_file = os.path.join(out_der_dir, der_file_name)
+        msg = "Writing DER file to: " + out_der_file
+        logger.info(msg)
+        diar.write_ders_file(ref_rttm, DER_vals, out_der_file)
+
+        msg = ("AMI " + split_type + " set DER = %s %%\n" %
+               (str(round(DER_vals[-1], 2))))
+        logger.info(msg)
+        final_DERs[split_type] = round(DER_vals[-1], 2)
+
+    # final print DERs
+    msg = (
+        "Final Diarization Error Rate (%%) on AMI corpus: Dev = %s %% | Eval = %s %%\n"
+        % (str(final_DERs["dev"]), str(final_DERs["eval"])))
+    logger.info(msg)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(__doc__)
+    parser.add_argument(
+        "--config", default=None, type=str, help="configuration file")
+    parser.add_argument(
+        "--data-dir",
+        default="../data/",
+        type=str,
+        help="processsed data directory")
+    args = parser.parse_args()
+    config = CfgNode(new_allowed=True)
+    if args.config:
+        config.merge_from_file(args.config)
+
+    config.freeze()
+
+    main(args, config)
--- a/examples/ami/sd0/local/process.sh
+++ b/examples/ami/sd0/local/process.sh
@ -0,0 +1,49 @@
+#!/bin/bash
+
+stage=0
+set=L
+
+. ${MAIN_ROOT}/utils/parse_options.sh || exit 1;
+set -o pipefail
+
+data_folder=$1
+manual_annot_folder=$2
+save_folder=$3
+pretrained_model_dir=$4
+conf_path=$5
+device=$6
+
+ref_rttm_dir=${save_folder}/ref_rttms
+meta_data_dir=${save_folder}/metadata
+
+if [ ${stage} -le 0 ]; then
+    echo "AMI Data preparation"
+    python local/ami_prepare.py  --data_folder ${data_folder} \
+            --manual_annot_folder ${manual_annot_folder} \
+            --save_folder ${save_folder} --ref_rttm_dir ${ref_rttm_dir} \
+            --meta_data_dir ${meta_data_dir} 
+    
+    if [ $? -ne 0 ]; then
+        echo "Prepare AMI failed. Please check log message."
+        exit 1
+    fi
+    echo "AMI data preparation done."           
+fi
+
+if [ ${stage} -le 1 ]; then
+    # extra embddings for dev and eval dataset
+    for name in dev eval; do
+        python local/compute_embdding.py --config ${conf_path} \
+                --data-dir ${save_folder} \
+                --device ${device} \
+                --dataset ${name} \
+                --load-checkpoint ${pretrained_model_dir}
+    done
+fi
+
+if [ ${stage} -le 2 ]; then
+    # tune hyperparams on dev set
+    # perform final diarization on 'dev' and 'eval' with best hyperparams
+    python local/experiment.py --config ${conf_path} \
+            --data-dir ${save_folder}
+fi
--- a/examples/ami/sd0/run.sh
+++ b/examples/ami/sd0/run.sh
@ -1,14 +1,46 @@
 #!/bin/bash

-. path.sh || exit 1;
+. ./path.sh || exit 1;
 set -e

-stage=1
+stage=0

+#TARGET_DIR=${MAIN_ROOT}/dataset/ami
+TARGET_DIR=/home/dataset/AMI
+data_folder=${TARGET_DIR}/amicorpus #e.g., /path/to/amicorpus/
+manual_annot_folder=${TARGET_DIR}/ami_public_manual_1.6.2 #e.g., /path/to/ami_public_manual_1.6.2/
+
+save_folder=./save
+pretraind_model_dir=${save_folder}/sv0_ecapa_tdnn_voxceleb12_ckpt_0_1_1/model
+conf_path=conf/ecapa_tdnn.yaml
+device=gpu

 . ${MAIN_ROOT}/utils/parse_options.sh || exit 1;

-if [ ${stage} -le 1 ]; then
-    # prepare data
-    bash ./local/data.sh || exit -1
-fi
+if [ $stage -le 0 ]; then
+    # Prepare data
+    # Download AMI corpus, You need around 10GB of free space to get whole data
+    # The signals are too large to package in this way,
+    # so you need to use the chooser to indicate which ones you wish to download
+    echo "Please follow https://groups.inf.ed.ac.uk/ami/download/ to download the data."
+    echo "Annotations: AMI manual annotations v1.6.2 "
+    echo "Signals: "
+    echo "1) Select one or more AMI meetings: the IDs please follow ./ami_split.py"
+    echo "2) Select media streams: Just select Headset mix"
+fi
+
+if [ $stage -le 1 ]; then
+    # Download the pretrained model
+    wget https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_1_1.tar.gz
+    mkdir -p ${save_folder} && tar -xvf sv0_ecapa_tdnn_voxceleb12_ckpt_0_1_1.tar.gz -C ${save_folder}
+    rm -rf sv0_ecapa_tdnn_voxceleb12_ckpt_0_1_1.tar.gz
+    echo "download the pretrained ECAPA-TDNN Model to path: "${pretraind_model_dir}
+fi
+
+if [ $stage -le 2 ]; then
+    # Tune hyperparams on dev set and perform final diarization on dev and eval with best hyperparams.
+    echo ${data_folder} ${manual_annot_folder} ${save_folder} ${pretraind_model_dir} ${conf_path}
+    bash ./local/process.sh ${data_folder} ${manual_annot_folder} \
+        ${save_folder} ${pretraind_model_dir} ${conf_path} ${device} || exit 1
+fi
+
--- a/examples/csmsc/tts0/README.md
+++ b/examples/csmsc/tts0/README.md
@ -212,7 +212,8 @@ optional arguments:
 Pretrained Tacotron2 model with no silence in the edge of audios:
 - [tacotron2_csmsc_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_csmsc_ckpt_0.2.0.zip)

-The static model can be downloaded here [tacotron2_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_csmsc_static_0.2.0.zip).
+The static model can be downloaded here:
+- [tacotron2_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_csmsc_static_0.2.0.zip)


 Model | Step | eval/loss | eval/l1_loss | eval/mse_loss | eval/bce_loss| eval/attn_loss 
--- a/examples/csmsc/tts0/local/inference.sh
+++ b/examples/csmsc/tts0/local/inference.sh
@ -27,20 +27,8 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
        --phones_dict=dump/phone_id_map.txt
 fi

-# style melgan
-# style melgan's Dygraph to Static Graph is not ready now
-if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
-    python3 ${BIN_DIR}/../inference.py \
-        --inference_dir=${train_output_path}/inference \
-        --am=tacotron2_csmsc \
-        --voc=style_melgan_csmsc \
-        --text=${BIN_DIR}/../sentences.txt \
-        --output_dir=${train_output_path}/pd_infer_out \
-        --phones_dict=dump/phone_id_map.txt
-fi
-
 # hifigan
-if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
+if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
    python3 ${BIN_DIR}/../inference.py \
        --inference_dir=${train_output_path}/inference \
        --am=tacotron2_csmsc \
--- a/examples/csmsc/tts2/README.md
+++ b/examples/csmsc/tts2/README.md
@ -221,21 +221,30 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/inference.sh ${train_output_path}
 ```

 ## Pretrained Model
-Pretrained SpeedySpeech model with no silence in the edge of audios[speedyspeech_nosil_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_ckpt_0.5.zip).
+Pretrained SpeedySpeech model with no silence in the edge of audios:
+- [speedyspeech_nosil_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_ckpt_0.5.zip)
+- [speedyspeech_csmsc_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_csmsc_ckpt_0.2.0.zip)
+
+The static model can be downloaded here:
+- [speedyspeech_nosil_baker_static_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_static_0.5.zip)
+- [speedyspeech_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_csmsc_static_0.2.0.zip)
+
+The ONNX model can be downloaded here:
+- [speedyspeech_csmsc_onnx_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_csmsc_onnx_0.2.0.zip)

-The static model can be downloaded here [speedyspeech_nosil_baker_static_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_static_0.5.zip).

 Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/ssim_loss
 :-------------:| :------------:| :-----: | :-----: | :--------:|:--------:
-default| 1(gpu) x 11400|0.83655|0.42324|0.03211| 0.38119
+default| 1(gpu) x 11400|0.79532|0.400246|0.030259| 0.36482

 SpeedySpeech checkpoint contains files listed below.
+
 ```text
-speedyspeech_nosil_baker_ckpt_0.5
+speedyspeech_csmsc_ckpt_0.2.0
 ├── default.yaml            # default config used to train speedyspeech
 ├── feats_stats.npy         # statistics used to normalize spectrogram when training speedyspeech
 ├── phone_id_map.txt        # phone vocabulary file when training speedyspeech
-├── snapshot_iter_11400.pdz # model parameters and optimizer states
+├── snapshot_iter_30600.pdz # model parameters and optimizer states
 └── tone_id_map.txt         # tone vocabulary file when training speedyspeech
 ```
 You can use the following scripts to synthesize for `${BIN_DIR}/../sentences.txt` using pretrained speedyspeech and parallel wavegan models.
@ -246,9 +255,9 @@ FLAGS_allocator_strategy=naive_best_fit \
 FLAGS_fraction_of_gpu_memory_to_use=0.01 \
 python3 ${BIN_DIR}/../synthesize_e2e.py \
  --am=speedyspeech_csmsc \
-  --am_config=speedyspeech_nosil_baker_ckpt_0.5/default.yaml \
-  --am_ckpt=speedyspeech_nosil_baker_ckpt_0.5/snapshot_iter_11400.pdz \
-  --am_stat=speedyspeech_nosil_baker_ckpt_0.5/feats_stats.npy \
+  --am_config=speedyspeech_csmsc_ckpt_0.2.0/default.yaml \
+  --am_ckpt=speedyspeech_csmsc_ckpt_0.2.0/snapshot_iter_30600.pdz \
+  --am_stat=speedyspeech_csmsc_ckpt_0.2.0/feats_stats.npy \
  --voc=pwgan_csmsc \
  --voc_config=pwg_baker_ckpt_0.4/pwg_default.yaml \
  --voc_ckpt=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \
@ -257,6 +266,6 @@ python3 ${BIN_DIR}/../synthesize_e2e.py \
  --text=${BIN_DIR}/../sentences.txt \
  --output_dir=exp/default/test_e2e \
  --inference_dir=exp/default/inference \
-  --phones_dict=speedyspeech_nosil_baker_ckpt_0.5/phone_id_map.txt \
-  --tones_dict=speedyspeech_nosil_baker_ckpt_0.5/tone_id_map.txt
+  --phones_dict=speedyspeech_csmsc_ckpt_0.2.0/phone_id_map.txt \
+  --tones_dict=speedyspeech_csmsc_ckpt_0.2.0/tone_id_map.txt
 ```
--- a/examples/csmsc/tts2/local/inference.sh
+++ b/examples/csmsc/tts2/local/inference.sh
@ -30,21 +30,8 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
        --tones_dict=dump/tone_id_map.txt
 fi

-# style melgan
-# style melgan's Dygraph to Static Graph is not ready now
-if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
-    python3 ${BIN_DIR}/../inference.py \
-        --inference_dir=${train_output_path}/inference \
-        --am=speedyspeech_csmsc \
-        --voc=style_melgan_csmsc \
-        --text=${BIN_DIR}/../sentences.txt \
-        --output_dir=${train_output_path}/pd_infer_out \
-        --phones_dict=dump/phone_id_map.txt \
-        --tones_dict=dump/tone_id_map.txt
-fi
-
 # hifigan
-if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
+if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
    python3 ${BIN_DIR}/../inference.py \
        --inference_dir=${train_output_path}/inference \
        --am=speedyspeech_csmsc \
--- a/examples/csmsc/tts2/local/ort_predict.sh
+++ b/examples/csmsc/tts2/local/ort_predict.sh
@ -0,0 +1,32 @@
+train_output_path=$1
+
+stage=0
+stop_stage=0
+
+# only support default_fastspeech2/speedyspeech + hifigan/mb_melgan now!
+
+# synthesize from metadata
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+    python3 ${BIN_DIR}/../ort_predict.py \
+        --inference_dir=${train_output_path}/inference_onnx \
+        --am=speedyspeech_csmsc \
+        --voc=hifigan_csmsc \
+        --test_metadata=dump/test/norm/metadata.jsonl \
+        --output_dir=${train_output_path}/onnx_infer_out \
+        --device=cpu \
+        --cpu_threads=2
+fi
+
+# e2e, synthesize from text
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+    python3 ${BIN_DIR}/../ort_predict_e2e.py \
+        --inference_dir=${train_output_path}/inference_onnx \
+        --am=speedyspeech_csmsc \
+        --voc=hifigan_csmsc \
+        --output_dir=${train_output_path}/onnx_infer_out_e2e \
+        --text=${BIN_DIR}/../csmsc_test.txt \
+        --phones_dict=dump/phone_id_map.txt \
+        --tones_dict=dump/tone_id_map.txt \
+        --device=cpu \
+        --cpu_threads=2
+fi
--- a/examples/csmsc/tts2/local/paddle2onnx.sh
+++ b/examples/csmsc/tts2/local/paddle2onnx.sh
@ -0,0 +1 @@
+../../tts3/local/paddle2onnx.sh
--- a/examples/csmsc/tts2/run.sh
+++ b/examples/csmsc/tts2/run.sh
@ -40,3 +40,25 @@ if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
    # inference with static model
    CUDA_VISIBLE_DEVICES=${gpus} ./local/inference.sh ${train_output_path} || exit -1
 fi
+
+# paddle2onnx, please make sure the static models are in ${train_output_path}/inference first
+# we have only tested the following models so far
+if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
+    # install paddle2onnx
+    version=$(echo `pip list |grep "paddle2onnx"` |awk -F" " '{print $2}')
+    if [[ -z "$version" || ${version} != '0.9.4' ]]; then
+        pip install paddle2onnx==0.9.4
+    fi
+    ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx speedyspeech_csmsc
+    ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx hifigan_csmsc
+fi
+
+# inference with onnxruntime, use fastspeech2 + hifigan by default
+if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
+    # install onnxruntime
+    version=$(echo `pip list |grep "onnxruntime"` |awk -F" " '{print $2}')
+    if [[ -z "$version" || ${version} != '1.10.0' ]]; then
+        pip install onnxruntime==1.10.0
+    fi
+    ./local/ort_predict.sh ${train_output_path}
+fi
--- a/examples/csmsc/tts3/README.md
+++ b/examples/csmsc/tts3/README.md
@ -231,11 +231,19 @@ Pretrained FastSpeech2 model with no silence in the edge of audios:
 The static model can be downloaded here:
 - [fastspeech2_nosil_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_static_0.4.zip)
 - [fastspeech2_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_csmsc_static_0.2.0.zip)
+- [fastspeech2_cnndecoder_csmsc_static_1.0.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_cnndecoder_csmsc_static_1.0.0.zip)
+- [fastspeech2_cnndecoder_csmsc_streaming_static_1.0.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_cnndecoder_csmsc_streaming_static_1.0.0.zip)
+
+The ONNX model can be downloaded here:
+- [fastspeech2_csmsc_onnx_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_csmsc_onnx_0.2.0.zip)
+- [fastspeech2_cnndecoder_csmsc_onnx_1.0.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_cnndecoder_csmsc_onnx_1.0.0.zip)
+- [fastspeech2_cnndecoder_csmsc_streaming_onnx_1.0.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_cnndecoder_csmsc_streaming_onnx_1.0.0.zip)

 Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/pitch_loss| eval/energy_loss 
 :-------------:| :------------:| :-----: | :-----: | :--------: |:--------:|:---------:
 default| 2(gpu) x 76000|1.0991|0.59132|0.035815|0.31915|0.15287|
 conformer| 2(gpu) x 76000|1.0675|0.56103|0.035869|0.31553|0.15509|
+cnndecoder| 1(gpu) x 153000|1.1153|0.61475|0.03380|0.30414|0.14707|

 FastSpeech2 checkpoint contains files listed below.
 ```text
--- a/examples/csmsc/tts3/local/inference.sh
+++ b/examples/csmsc/tts3/local/inference.sh
@ -5,6 +5,7 @@ train_output_path=$1
 stage=0
 stop_stage=0

+# pwgan
 if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
    python3 ${BIN_DIR}/../inference.py \
        --inference_dir=${train_output_path}/inference \
@ -27,20 +28,8 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
        --phones_dict=dump/phone_id_map.txt
 fi

-# style melgan
-# style melgan's Dygraph to Static Graph is not ready now
-if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
-    python3 ${BIN_DIR}/../inference.py \
-        --inference_dir=${train_output_path}/inference \
-        --am=fastspeech2_csmsc \
-        --voc=style_melgan_csmsc \
-        --text=${BIN_DIR}/../sentences.txt \
-        --output_dir=${train_output_path}/pd_infer_out \
-        --phones_dict=dump/phone_id_map.txt
-fi
-
 # hifigan
-if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
+if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
    python3 ${BIN_DIR}/../inference.py \
        --inference_dir=${train_output_path}/inference \
        --am=fastspeech2_csmsc \
@ -51,7 +40,7 @@ if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
 fi

 # wavernn
-if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
+if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
    python3 ${BIN_DIR}/../inference.py \
        --inference_dir=${train_output_path}/inference \
        --am=fastspeech2_csmsc \
--- a/examples/csmsc/tts3/local/inference_streaming.sh
+++ b/examples/csmsc/tts3/local/inference_streaming.sh
@ -0,0 +1,47 @@
+#!/bin/bash
+
+train_output_path=$1
+
+stage=0
+stop_stage=0
+
+# pwgan
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+    python3 ${BIN_DIR}/../inference_streaming.py \
+        --inference_dir=${train_output_path}/inference_streaming \
+        --am=fastspeech2_csmsc \
+        --am_stat=dump/train/speech_stats.npy \
+        --voc=pwgan_csmsc \
+        --text=${BIN_DIR}/../sentences.txt \
+        --output_dir=${train_output_path}/pd_infer_out_streaming \
+        --phones_dict=dump/phone_id_map.txt \
+        --am_streaming=True
+fi
+
+# for more GAN Vocoders
+# multi band melgan
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+    python3 ${BIN_DIR}/../inference_streaming.py \
+        --inference_dir=${train_output_path}/inference_streaming \
+        --am=fastspeech2_csmsc \
+        --am_stat=dump/train/speech_stats.npy \
+        --voc=mb_melgan_csmsc \
+        --text=${BIN_DIR}/../sentences.txt \
+        --output_dir=${train_output_path}/pd_infer_out_streaming \
+        --phones_dict=dump/phone_id_map.txt \
+        --am_streaming=True
+fi
+
+# hifigan
+if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
+    python3 ${BIN_DIR}/../inference_streaming.py \
+        --inference_dir=${train_output_path}/inference_streaming \
+        --am=fastspeech2_csmsc \
+        --am_stat=dump/train/speech_stats.npy \
+        --voc=hifigan_csmsc \
+        --text=${BIN_DIR}/../sentences.txt \
+        --output_dir=${train_output_path}/pd_infer_out_streaming \
+        --phones_dict=dump/phone_id_map.txt \
+        --am_streaming=True
+fi
+
--- a/examples/csmsc/tts3/local/ort_predict.sh
+++ b/examples/csmsc/tts3/local/ort_predict.sh
@ -0,0 +1,31 @@
+train_output_path=$1
+
+stage=0
+stop_stage=0
+
+# only support default_fastspeech2/speedyspeech + hifigan/mb_melgan now!
+
+# synthesize from metadata
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+    python3 ${BIN_DIR}/../ort_predict.py \
+        --inference_dir=${train_output_path}/inference_onnx \
+        --am=fastspeech2_csmsc \
+        --voc=hifigan_csmsc \
+        --test_metadata=dump/test/norm/metadata.jsonl \
+        --output_dir=${train_output_path}/onnx_infer_out \
+        --device=cpu \
+        --cpu_threads=2
+fi
+
+# e2e, synthesize from text
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+    python3 ${BIN_DIR}/../ort_predict_e2e.py \
+        --inference_dir=${train_output_path}/inference_onnx \
+        --am=fastspeech2_csmsc \
+        --voc=hifigan_csmsc \
+        --output_dir=${train_output_path}/onnx_infer_out_e2e \
+        --text=${BIN_DIR}/../csmsc_test.txt \
+        --phones_dict=dump/phone_id_map.txt \
+        --device=cpu \
+        --cpu_threads=2
+fi
--- a/examples/csmsc/tts3/local/ort_predict_streaming.sh
+++ b/examples/csmsc/tts3/local/ort_predict_streaming.sh
@ -0,0 +1,19 @@
+train_output_path=$1
+
+stage=0
+stop_stage=0
+
+# e2e, synthesize from text
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+    python3 ${BIN_DIR}/../ort_predict_streaming.py \
+        --inference_dir=${train_output_path}/inference_onnx_streaming \
+        --am=fastspeech2_csmsc \
+        --am_stat=dump/train/speech_stats.npy \
+        --voc=hifigan_csmsc \
+        --output_dir=${train_output_path}/onnx_infer_out_streaming \
+        --text=${BIN_DIR}/../csmsc_test.txt \
+        --phones_dict=dump/phone_id_map.txt \
+        --device=cpu \
+        --cpu_threads=2 \
+        --am_streaming=True
+fi
--- a/examples/csmsc/tts3/local/paddle2onnx.sh
+++ b/examples/csmsc/tts3/local/paddle2onnx.sh
@ -0,0 +1,23 @@
+train_output_path=$1
+model_dir=$2
+output_dir=$3
+model=$4
+
+enable_dev_version=True
+
+model_name=${model%_*}
+echo model_name: ${model_name}
+
+if [ ${model_name} = 'mb_melgan' ] ;then
+    enable_dev_version=False
+fi
+
+mkdir -p ${train_output_path}/${output_dir}
+
+paddle2onnx \
+    --model_dir ${train_output_path}/${model_dir} \
+    --model_filename ${model}.pdmodel \
+    --params_filename ${model}.pdiparams \
+    --save_file ${train_output_path}/${output_dir}/${model}.onnx \
+    --opset_version 11 \
+    --enable_dev_version ${enable_dev_version}
--- a/examples/csmsc/tts3/local/synthesize_e2e.sh
+++ b/examples/csmsc/tts3/local/synthesize_e2e.sh
@ -109,6 +109,6 @@ if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
        --lang=zh \
        --text=${BIN_DIR}/../sentences.txt \
        --output_dir=${train_output_path}/test_e2e \
-        --phones_dict=dump/phone_id_map.txt \
-        --inference_dir=${train_output_path}/inference
+        --phones_dict=dump/phone_id_map.txt #\
+        # --inference_dir=${train_output_path}/inference
 fi
--- a/examples/csmsc/tts3/local/synthesize_streaming.sh
+++ b/examples/csmsc/tts3/local/synthesize_streaming.sh
@ -88,5 +88,6 @@ if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
        --text=${BIN_DIR}/../sentences.txt \
        --output_dir=${train_output_path}/test_e2e_streaming \
        --phones_dict=dump/phone_id_map.txt \
-        --am_streaming=True
+        --am_streaming=True \
+        --inference_dir=${train_output_path}/inference_streaming
 fi
--- a/examples/csmsc/tts3/run.sh
+++ b/examples/csmsc/tts3/run.sh
@ -41,3 +41,25 @@ if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
    CUDA_VISIBLE_DEVICES=${gpus} ./local/inference.sh ${train_output_path} || exit -1
 fi

+# paddle2onnx, please make sure the static models are in ${train_output_path}/inference first
+# we have only tested the following models so far
+if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
+    # install paddle2onnx
+    version=$(echo `pip list |grep "paddle2onnx"` |awk -F" " '{print $2}')
+    if [[ -z "$version" || ${version} != '0.9.4' ]]; then
+        pip install paddle2onnx==0.9.4
+    fi
+    ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx fastspeech2_csmsc
+    ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx hifigan_csmsc
+    ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx mb_melgan_csmsc
+fi
+
+# inference with onnxruntime, use fastspeech2 + hifigan by default
+if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
+    # install onnxruntime
+    version=$(echo `pip list |grep "onnxruntime"` |awk -F" " '{print $2}')
+    if [[ -z "$version" || ${version} != '1.10.0' ]]; then
+        pip install onnxruntime==1.10.0
+    fi
+    ./local/ort_predict.sh ${train_output_path}
+fi
--- a/examples/csmsc/tts3/run_cnndecoder.sh
+++ b/examples/csmsc/tts3/run_cnndecoder.sh
@ -31,18 +31,75 @@ if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
    CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
 fi

+# synthesize_e2e non-streaming
 if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
    # synthesize_e2e, vocoder is pwgan
    CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
 fi

+# inference non-streaming
 if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
    # inference with static model
    CUDA_VISIBLE_DEVICES=${gpus} ./local/inference.sh ${train_output_path} || exit -1
 fi

+# synthesize_e2e streaming
 if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
    # synthesize_e2e, vocoder is pwgan
    CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_streaming.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
 fi

+# inference streaming
+if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
+    # inference with static model
+    CUDA_VISIBLE_DEVICES=${gpus} ./local/inference_streaming.sh ${train_output_path} || exit -1
+fi
+
+# paddle2onnx non streaming
+if [ ${stage} -le 7 ] && [ ${stop_stage} -ge 7 ]; then
+    # install paddle2onnx
+    version=$(echo `pip list |grep "paddle2onnx"` |awk -F" " '{print $2}')
+    if [[ -z "$version" || ${version} != '0.9.4' ]]; then
+        pip install paddle2onnx==0.9.4
+    fi
+    ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx fastspeech2_csmsc
+    ./local/paddle2onnx.sh ${train_output_path} inference inference_onnx hifigan_csmsc
+fi
+
+
+# onnxruntime non streaming
+# inference with onnxruntime, use fastspeech2 + hifigan by default
+if [ ${stage} -le 8 ] && [ ${stop_stage} -ge 8 ]; then
+    # install onnxruntime
+    version=$(echo `pip list |grep "onnxruntime"` |awk -F" " '{print $2}')
+    if [[ -z "$version" || ${version} != '1.10.0' ]]; then
+        pip install onnxruntime==1.10.0
+    fi
+    ./local/ort_predict.sh ${train_output_path}
+fi
+
+# paddle2onnx streaming
+if [ ${stage} -le 9 ] && [ ${stop_stage} -ge 9 ]; then
+    # install paddle2onnx
+    version=$(echo `pip list |grep "paddle2onnx"` |awk -F" " '{print $2}')
+    if [[ -z "$version" || ${version} != '0.9.4' ]]; then
+        pip install paddle2onnx==0.9.4
+    fi
+    # streaming acoustic model
+    ./local/paddle2onnx.sh ${train_output_path} inference_streaming inference_onnx_streaming fastspeech2_csmsc_am_encoder_infer
+    ./local/paddle2onnx.sh ${train_output_path} inference_streaming inference_onnx_streaming fastspeech2_csmsc_am_decoder
+    ./local/paddle2onnx.sh ${train_output_path} inference_streaming inference_onnx_streaming fastspeech2_csmsc_am_postnet
+    # vocoder
+    ./local/paddle2onnx.sh ${train_output_path} inference_streaming inference_onnx_streaming hifigan_csmsc
+fi
+
+# onnxruntime streaming
+if [ ${stage} -le 10 ] && [ ${stop_stage} -ge 10 ]; then
+    # install onnxruntime
+    version=$(echo `pip list |grep "onnxruntime"` |awk -F" " '{print $2}')
+    if [[ -z "$version" || ${version} != '1.10.0' ]]; then
+        pip install onnxruntime==1.10.0
+    fi
+    ./local/ort_predict_streaming.sh ${train_output_path}
+fi
+
--- a/examples/csmsc/voc1/README.md
+++ b/examples/csmsc/voc1/README.md
@ -127,9 +127,14 @@ optional arguments:
 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.

 ## Pretrained Models
-The pretrained model can be downloaded here [pwg_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_ckpt_0.4.zip).
+The pretrained model can be downloaded here:
+- [pwg_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_ckpt_0.4.zip)

-The static model can be downloaded here [pwg_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_static_0.4.zip).
+The static model can be downloaded here:
+- [pwg_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_static_0.4.zip)
+
+The ONNX model can be downloaded here:
+- [pwgan_csmsc_onnx_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwgan_csmsc_onnx_0.2.0.zip)

 Model | Step | eval/generator_loss | eval/log_stft_magnitude_loss| eval/spectral_convergence_loss
 :-------------:| :------------:| :-----: | :-----: | :--------:
--- a/examples/csmsc/voc3/README.md
+++ b/examples/csmsc/voc3/README.md
@ -152,11 +152,17 @@ TODO:
 The hyperparameter of `finetune.yaml` is not good enough, a smaller `learning_rate` should be used (more `milestones` should be set).

 ## Pretrained Models
-The pretrained model can be downloaded here [mb_melgan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_ckpt_0.1.1.zip).
+The pretrained model can be downloaded here:
+- [mb_melgan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_ckpt_0.1.1.zip)

-The finetuned model can be downloaded here [mb_melgan_baker_finetune_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_baker_finetune_ckpt_0.5.zip).
+The finetuned model can be downloaded here:
+- [mb_melgan_baker_finetune_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_baker_finetune_ckpt_0.5.zip)

-The static model can be downloaded here [mb_melgan_csmsc_static_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_static_0.1.1.zip)
+The static model can be downloaded here:
+- [mb_melgan_csmsc_static_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_static_0.1.1.zip)
+
+The ONNX model can be downloaded here:
+- [mb_melgan_csmsc_onnx_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_onnx_0.2.0.zip)

 Model | Step | eval/generator_loss | eval/log_stft_magnitude_loss|eval/spectral_convergence_loss |eval/sub_log_stft_magnitude_loss|eval/sub_spectral_convergence_loss
 :-------------:| :------------:| :-----: | :-----: | :--------:| :--------:| :--------:
--- a/examples/csmsc/voc4/README.md
+++ b/examples/csmsc/voc4/README.md
@ -112,7 +112,8 @@ optional arguments:
 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.

 ## Pretrained Models
-The pretrained model can be downloaded here [style_melgan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/style_melgan/style_melgan_csmsc_ckpt_0.1.1.zip).
+The pretrained model can be downloaded here:
+- [style_melgan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/style_melgan/style_melgan_csmsc_ckpt_0.1.1.zip)

 The static model of Style MelGAN is not available now.

--- a/examples/csmsc/voc5/README.md
+++ b/examples/csmsc/voc5/README.md
@ -112,9 +112,14 @@ optional arguments:
 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.

 ## Pretrained Models
-The pretrained model can be downloaded here [hifigan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_ckpt_0.1.1.zip).
+The pretrained model can be downloaded here:
+- [hifigan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_ckpt_0.1.1.zip)

-The static model can be downloaded here [hifigan_csmsc_static_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_static_0.1.1.zip).
+The static model can be downloaded here:
+- [hifigan_csmsc_static_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_static_0.1.1.zip)
+
+The ONNX model can be downloaded here:
+- [hifigan_csmsc_onnx_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_onnx_0.2.0.zip)

 Model | Step | eval/generator_loss | eval/mel_loss| eval/feature_matching_loss
 :-------------:| :------------:| :-----: | :-----: | :--------:
--- a/examples/csmsc/voc6/README.md
+++ b/examples/csmsc/voc6/README.md
@ -109,9 +109,11 @@ optional arguments:
 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.

 ## Pretrained Models
-The pretrained model can be downloaded here [wavernn_csmsc_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/wavernn/wavernn_csmsc_ckpt_0.2.0.zip).
+The pretrained model can be downloaded here:
+- [wavernn_csmsc_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/wavernn/wavernn_csmsc_ckpt_0.2.0.zip)

-The static model can be downloaded here [wavernn_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/wavernn/wavernn_csmsc_static_0.2.0.zip).
+The static model can be downloaded here:
+- [wavernn_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/wavernn/wavernn_csmsc_static_0.2.0.zip)

 Model | Step | eval/loss
 :-------------:|:------------:| :------------:
--- a/examples/esc50/README.md
+++ b/examples/esc50/README.md
@ -4,7 +4,7 @@

 对于声音分类任务，传统机器学习的一个常用做法是首先人工提取音频的时域和频域的多种特征并做特征选择、组合、变换等，然后基于SVM或决策树进行分类。而端到端的深度学习则通常利用深度网络如RNN，CNN等直接对声间波形(waveform)或时频特征(time-frequency)进行特征学习(representation learning)和分类预测。

-在IEEE ICASSP 2017 大会上，谷歌开放了一个大规模的音频数据集[Audioset](https://research.google.com/audioset/)。该数据集包含了 632 类的音频类别以及 2,084,320 条人工标记的每段 10 秒长度的声音剪辑片段（来源于YouTube视频）。目前该数据集已经有210万个已标注的视频数据，5800小时的音频数据，经过标记的声音样本的标签类别为527。
+在IEEE ICASSP 2017 大会上，谷歌开放了一个大规模的音频数据集[Audioset](https://research.google.com/audioset/)。该数据集包含了 632 类的音频类别以及 2,084,320 条人工标记的每段 **10 秒**长度的声音剪辑片段（来源于YouTube视频）。目前该数据集已经有 210万 个已标注的视频数据，5800 小时的音频数据，经过标记的声音样本的标签类别为 527。

 `PANNs`([PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition](https://arxiv.org/pdf/1912.10211.pdf))是基于Audioset数据集训练的声音分类/识别的模型。经过预训练后，模型可以用于提取音频的embbedding。本示例将使用`PANNs`的预训练模型Finetune完成声音分类的任务。

@ -12,14 +12,14 @@
 ## 模型简介

 PaddleAudio提供了PANNs的CNN14、CNN10和CNN6的预训练模型，可供用户选择使用：
- CNN14: 该模型主要包含12个卷积层和2个全连接层，模型参数的数量为79.6M，embbedding维度是2048。
- CNN10: 该模型主要包含8个卷积层和2个全连接层，模型参数的数量为4.9M，embbedding维度是512。
- CNN6: 该模型主要包含4个卷积层和2个全连接层，模型参数的数量为4.5M，embbedding维度是512。
+- CNN14: 该模型主要包含12个卷积层和2个全连接层，模型参数的数量为 79.6M，embbedding维度是 2048。
+- CNN10: 该模型主要包含8个卷积层和2个全连接层，模型参数的数量为 4.9M，embbedding维度是 512。
+- CNN6: 该模型主要包含4个卷积层和2个全连接层，模型参数的数量为 4.5M，embbedding维度是 512。


 ## 数据集

-[ESC-50: Dataset for Environmental Sound Classification](https://github.com/karolpiczak/ESC-50) 是一个包含有 2000 个带标签的环境声音样本，音频样本采样率为 44,100Hz 的单通道音频文件，所有样本根据标签被划分为 50 个类别，每个类别有 40 个样本。
+[ESC-50: Dataset for Environmental Sound Classification](https://github.com/karolpiczak/ESC-50) 是一个包含有 2000 个带标签的时长为 **5 秒**的环境声音样本，音频样本采样率为 44,100Hz 的单通道音频文件，所有样本根据标签被划分为 50 个类别，每个类别有 40 个样本。

 ## 模型指标

@ -43,13 +43,13 @@ $ CUDA_VISIBLE_DEVICES=0 ./run.sh 1 conf/panns.yaml
 ```

 训练的参数可在 `conf/panns.yaml` 的 `training` 中配置，其中：
- `epochs`: 训练轮次，默认为50。
+- `epochs`: 训练轮次，默认为 50。
 - `learning_rate`: Fine-tune的学习率；默认为5e-5。
- `batch_size`: 批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为16。
+- `batch_size`: 批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为 16。
 - `num_workers`: Dataloader获取数据的子进程数。默认为0，加载数据的流程在主进程执行。
 - `checkpoint_dir`: 模型参数文件和optimizer参数文件的保存目录，默认为`./checkpoint`。
- `save_freq`: 训练过程中的模型保存频率，默认为10。
- `log_freq`: 训练过程中的信息打印频率，默认为10。
+- `save_freq`: 训练过程中的模型保存频率，默认为 10。
+- `log_freq`: 训练过程中的信息打印频率，默认为 10。

 示例代码中使用的预训练模型为`CNN14`，如果想更换为其他预训练模型，可通过修改 `conf/panns.yaml` 的 `model` 中配置：
 ```yaml
@ -76,7 +76,7 @@ $ CUDA_VISIBLE_DEVICES=0 ./run.sh 2 conf/panns.yaml

 训练的参数可在 `conf/panns.yaml` 的 `predicting` 中配置，其中：
 - `audio_file`: 指定预测的音频文件。
- `top_k`: 预测显示的top k标签的得分，默认为1。
+- `top_k`: 预测显示的top k标签的得分，默认为 1。
 - `checkpoint`: 模型参数checkpoint文件。

 输出的预测结果如下：
--- a/examples/iwslt2012/punc0/README.md
+++ b/examples/iwslt2012/punc0/README.md
@ -21,7 +21,7 @@
 The pretrained model can be downloaded here [ernie_linear_p3_iwslt2012_zh_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/text/ernie_linear_p3_iwslt2012_zh_ckpt_0.1.1.zip).

 ### Test Result
- Ernie Linear
+- Ernie
    |       |COMMA  |  PERIOD | QUESTION | OVERALL|
    |:-----:|:-----:|:-----:|:-----:|:-----:|  
    |Precision  |0.510955  |0.526462  |0.820755  |0.619391|
--- a/examples/iwslt2012/punc0/RESULTS.md
+++ b/examples/iwslt2012/punc0/RESULTS.md
@ -0,0 +1,9 @@
+# iwslt2012
+
+## Ernie
+
+|       |COMMA  |  PERIOD | QUESTION | OVERALL|
+|:-----:|:-----:|:-----:|:-----:|:-----:|  
+|Precision  |0.510955  |0.526462  |0.820755  |0.619391|
+|Recall     |0.517433  |0.564179  |0.861386  |0.647666|
+|F1         |0.514173  |0.544669  |0.840580  |0.633141|
--- a/examples/librispeech/asr1/README.md
+++ b/examples/librispeech/asr1/README.md
@ -151,44 +151,22 @@ avg.sh best exp/conformer/checkpoints 20
 CUDA_VISIBLE_DEVICES= ./local/test.sh conf/conformer.yaml exp/conformer/checkpoints/avg_20
 ```
 ## Pretrained Model
-You can get the pretrained transformer or conformer using the scripts below:
-```bash
-# Conformer:
-wget https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr1/conformer.model.tar.gz
-# Transformer:
-wget https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr1/transformer.model.tar.gz
-```
+You can get the pretrained transformer or conformer from [this](../../../docs/source/released_model.md).
+
 using the `tar` scripts to unpack the model and then you can use the script to test the model.

 For example:
 ```bash
-wget https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr1/conformer.model.tar.gz
-tar xzvf transformer.model.tar.gz
+wget https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr1/asr1_conformer_librispeech_ckpt_0.1.1.model.tar.gz
+tar xzvf asr1_conformer_librispeech_ckpt_0.1.1.model.tar.gz
 source path.sh
 # If you have process the data and get the manifest file， you can skip the following 2 steps
 bash local/data.sh --stage -1 --stop_stage -1
 bash local/data.sh --stage 2 --stop_stage 2
 CUDA_VISIBLE_DEVICES= ./local/test.sh conf/conformer.yaml exp/conformer/checkpoints/avg_20
 ```
-The performance of the released models are shown below:
-## Conformer
-train: Epoch 70, 4 V100-32G, best avg: 20
-
-| Model     | Params  | Config              | Augmentation | Test set   | Decode method          | Loss              | WER      |
-| --------- | ------- | ------------------- | ------------ | ---------- | ---------------------- | ----------------- | -------- |
-| conformer | 47.63 M | conf/conformer.yaml | spec_aug     | test-clean | attention              | 6.433612394332886 | 0.039771 |
-| conformer | 47.63 M | conf/conformer.yaml | spec_aug     | test-clean | ctc_greedy_search      | 6.433612394332886 | 0.040342 |
-| conformer | 47.63 M | conf/conformer.yaml | spec_aug     | test-clean | ctc_prefix_beam_search | 6.433612394332886 | 0.040342 |
-| conformer | 47.63 M | conf/conformer.yaml | spec_aug     | test-clean | attention_rescoring    | 6.433612394332886 | 0.033761 |
-## Transformer
-train: Epoch 120, 4 V100-32G, 27 Day, best avg: 10
+The performance of the released models are shown in [here](./RESULTS.md).

-| Model       | Params  | Config                | Augmentation | Test set   | Decode method          | Loss              | WER      |
-| ----------- | ------- | --------------------- | ------------ | ---------- | ---------------------- | ----------------- | -------- |
-| transformer | 32.52 M | conf/transformer.yaml | spec_aug     | test-clean | attention              | 6.382194232940674 | 0.049661 |
-| transformer | 32.52 M | conf/transformer.yaml | spec_aug     | test-clean | ctc_greedy_search      | 6.382194232940674 | 0.049566 |
-| transformer | 32.52 M | conf/transformer.yaml | spec_aug     | test-clean | ctc_prefix_beam_search | 6.382194232940674 | 0.049585 |
-| transformer | 32.52 M | conf/transformer.yaml | spec_aug     | test-clean | attention_rescoring    | 6.382194232940674 | 0.038135 |
 ## Stage 4: CTC Alignment 
 If you want to get the alignment between the audio and the text, you can use the ctc alignment. The code of this stage is shown below:
 ```bash
@ -227,8 +205,8 @@ In some situations, you want to use the trained model to do the inference for th
 ```
 you can train the model by yourself using ```bash run.sh --stage 0 --stop_stage 3```, or you can download the pretrained model through the script below:
 ```bash
-wget https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr1/conformer.model.tar.gz
-tar xzvf conformer.model.tar.gz
+wget https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr1/asr1_conformer_librispeech_ckpt_0.1.1.model.tar.gz
+tar xzvf asr1_conformer_librispeech_ckpt_0.1.1.model.tar.gz
 ```
 You can download the audio demo:
 ```bash
--- a/examples/librispeech/asr2/README.md
+++ b/examples/librispeech/asr2/README.md
@ -1,4 +1,4 @@
-# Transformer/Conformer ASR with Librispeech Asr2
+# Transformer/Conformer ASR with Librispeech ASR2

 This example contains code used to train a Transformer or [Conformer](http://arxiv.org/abs/2008.03802) model with [Librispeech dataset](http://www.openslr.org/resources/12) and use some functions in kaldi.

@ -213,17 +213,14 @@ avg.sh latest exp/transformer/checkpoints 10
 ./local/recog.sh  --ckpt_prefix exp/transformer/checkpoints/avg_10
 ```
 ## Pretrained Model
-You can get the pretrained transformer using the scripts below:
-```bash
-# Transformer:
-wget https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr2/transformer.model.tar.gz
-```
+You can get the pretrained models from [this](../../../docs/source/released_model.md).
+
 using the `tar` scripts to unpack the model and then you can use the script to test the model.

 For example:
 ```bash
-wget https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr2/transformer.model.tar.gz
-tar xzvf transformer.model.tar.gz
+wget https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr2/asr2_transformer_librispeech_ckpt_0.1.1.model.tar.gz
+tar xzvf asr2_transformer_librispeech_ckpt_0.1.1.model.tar.gz
 source path.sh
 # If you have process the data and get the manifest file， you can skip the following 2 steps
 bash local/data.sh --stage -1 --stop_stage -1
@ -231,26 +228,7 @@ bash local/data.sh --stage 2 --stop_stage 2

 CUDA_VISIBLE_DEVICES= ./local/test.sh conf/transformer.yaml exp/ctc/checkpoints/avg_10
 ```
-The performance of the released models are shown below:
-### Transformer
-|    Model    | Params |          GPUS          |  Averaged Model  |        Config         | Augmentation |      Loss       |
-| :---------: | :----: | :--------------------: | :--------------: | :-------------------: | :----------: | :-------------: |
-| transformer | 32.52M | 8 Tesla V100-SXM2-32GB | 10-best val_loss | conf/transformer.yaml |   spec_aug   | 6.3197922706604 |
-
-#### Attention Rescore
-| Test Set   | Decode Method         | #Snt | #Wrd  | Corr | Sub  | Del  | Ins  | Err  | S.Err |
-| ---------- | --------------------- | ---- | ----- | ---- | ---- | ---- | ---- | ---- | ----- |
-| test-clean | attention             | 2620 | 52576 | 96.4 | 2.5  | 1.1  | 0.4  | 4.0  | 34.7  |
-| test-clean | ctc_greedy_search     | 2620 | 52576 | 95.9 | 3.7  | 0.4  | 0.5  | 4.6  | 48.0  |
-| test-clean | ctc_prefix_beamsearch | 2620 | 52576 | 95.9 | 3.7  | 0.4  | 0.5  | 4.6  | 47.6  |
-| test-clean | attention_rescore     | 2620 | 52576 | 96.8 | 2.9  | 0.3  | 0.4  | 3.7  | 38.0  |
-
-#### JoinCTC
-| Test Set   | Decode Method     | #Snt | #Wrd  | Corr | Sub  | Del  | Ins  | Err  | S.Err |
-| ---------- | ----------------- | ---- | ----- | ---- | ---- | ---- | ---- | ---- | ----- |
-| test-clean | join_ctc_only_att | 2620 | 52576 | 96.1 | 2.5  | 1.4  | 0.4  | 4.4  | 34.7  |
-| test-clean | join_ctc_w/o_lm   | 2620 | 52576 | 97.2 | 2.6  | 0.3  | 0.4  | 3.2  | 34.9  |
-| test-clean | join_ctc_w_lm     | 2620 | 52576 | 97.9 | 1.8  | 0.2  | 0.3  | 2.4  | 27.8  |
+The performance of the released models are shown [here](./RESULTS.md).

 Compare with [ESPNET](https://github.com/espnet/espnet/blob/master/egs/librispeech/asr1/RESULTS.md#pytorch-large-transformer-with-specaug-4-gpus--transformer-lm-4-gpus) we using 8gpu, but the model size (aheads4-adim256) small than it.
 ## Stage 5: CTC Alignment 
--- a/examples/ljspeech/tts1/README.md
+++ b/examples/ljspeech/tts1/README.md
@ -171,7 +171,8 @@ optional arguments:
 6. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.

 ## Pretrained Model
-Pretrained Model can be downloaded here. [transformer_tts_ljspeech_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/transformer_tts/transformer_tts_ljspeech_ckpt_0.4.zip)
+Pretrained Model can be downloaded here:
+- [transformer_tts_ljspeech_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/transformer_tts/transformer_tts_ljspeech_ckpt_0.4.zip)

 TransformerTTS  checkpoint contains files listed below.
 ```text
--- a/examples/ljspeech/tts3/README.md
+++ b/examples/ljspeech/tts3/README.md
@ -214,7 +214,8 @@ optional arguments:
 9. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.

 ## Pretrained Model
-Pretrained FastSpeech2 model with no silence in the edge of audios. [fastspeech2_nosil_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_ljspeech_ckpt_0.5.zip)
+Pretrained FastSpeech2 model with no silence in the edge of audios:
+- [fastspeech2_nosil_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_ljspeech_ckpt_0.5.zip)

 Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/pitch_loss| eval/energy_loss 
 :-------------:| :------------:| :-----: | :-----: | :--------: |:--------:|:---------:
--- a/examples/ljspeech/tts3/local/synthesize.sh
+++ b/examples/ljspeech/tts3/local/synthesize.sh
@ -26,7 +26,7 @@ if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
 fi

 # hifigan
-if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
    FLAGS_allocator_strategy=naive_best_fit \
    FLAGS_fraction_of_gpu_memory_to_use=0.01 \
    python3 ${BIN_DIR}/../synthesize.py \
--- a/examples/ljspeech/voc0/README.md
+++ b/examples/ljspeech/voc0/README.md
@ -50,4 +50,5 @@ Synthesize waveform.
 6. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.

 ## Pretrained Model
-Pretrained Model with residual channel equals 128 can be downloaded here. [waveflow_ljspeech_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/waveflow/waveflow_ljspeech_ckpt_0.3.zip).
+Pretrained Model with residual channel equals 128 can be downloaded here:
+- [waveflow_ljspeech_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/waveflow/waveflow_ljspeech_ckpt_0.3.zip)
--- a/examples/ljspeech/voc1/README.md
+++ b/examples/ljspeech/voc1/README.md
@ -127,7 +127,8 @@ optional arguments:
 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.

 ## Pretrained Model
-Pretrained models can be downloaded here. [pwg_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_ljspeech_ckpt_0.5.zip)
+Pretrained models can be downloaded here:
+- [pwg_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_ljspeech_ckpt_0.5.zip)

 Parallel WaveGAN checkpoint contains files listed below.

--- a/examples/ljspeech/voc5/README.md
+++ b/examples/ljspeech/voc5/README.md
@ -127,7 +127,8 @@ optional arguments:
 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.

 ## Pretrained Model
-The pretrained model can be downloaded here [hifigan_ljspeech_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_ljspeech_ckpt_0.2.0.zip).
+The pretrained model can be downloaded here:
+- [hifigan_ljspeech_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_ljspeech_ckpt_0.2.0.zip)


 Model | Step | eval/generator_loss | eval/mel_loss| eval/feature_matching_loss
@ -143,6 +144,5 @@ hifigan_ljspeech_ckpt_0.2.0
 └── snapshot_iter_2500000.pdz     # generator parameters of hifigan
 ```

-
 ## Acknowledgement
 We adapted some code from https://github.com/kan-bayashi/ParallelWaveGAN.
--- a/examples/other/1xt2x/src_deepspeech2x/init.py
+++ b/examples/other/1xt2x/src_deepspeech2x/init.py
@ -26,10 +26,10 @@ from paddlespeech.s2t.utils.log import Log
 #TODO(Hui Zhang): remove  fluid import
 logger = Log(__name__).getlog()

-########### hcak logging #############
+########### hack logging #############
 logger.warn = logger.warning

-########### hcak paddle #############
+########### hack paddle #############
 paddle.half = 'float16'
 paddle.float = 'float32'
 paddle.double = 'float64'
@ -110,7 +110,7 @@ if not hasattr(paddle, 'cat'):
    paddle.cat = cat


-########### hcak paddle.Tensor #############
+########### hack paddle.Tensor #############
 def item(x: paddle.Tensor):
    return x.numpy().item()

@ -353,7 +353,7 @@ if not hasattr(paddle.Tensor, 'tolist'):
    setattr(paddle.Tensor, 'tolist', tolist)


-########### hcak paddle.nn #############
+########### hack paddle.nn #############
 class GLU(nn.Layer):
    """Gated Linear Units (GLU) Layer"""

--- a/examples/other/ngram_lm/s0/local/build_zh_lm.sh
+++ b/examples/other/ngram_lm/s0/local/build_zh_lm.sh
@ -27,7 +27,7 @@ arpa=$3
 if [ $stage -le 0 ] && [ $stop_stage -ge 0 ];then
    # text tn & wordseg preprocess
    echo "process text."
-    python3 ${MAIN_ROOT}/utils/zh_tn.py ${type} ${text} ${text}.${type}.tn
+    python3 ${MAIN_ROOT}/utils/zh_tn.py --token_type ${type} ${text} ${text}.${type}.tn
 fi

 if [ $stage -le 1 ] && [ $stop_stage -ge 1 ];then
--- a/examples/other/ngram_lm/s0/local/download_lm_zh.sh
+++ b/examples/other/ngram_lm/s0/local/download_lm_zh.sh
@ -10,6 +10,11 @@ MD5="29e02312deb2e59b3c8686c7966d4fe3"
 TARGET=${DIR}/zh_giga.no_cna_cmn.prune01244.klm


+if [ -e $TARGET ];then
+    echo "already have lm"
+    exit 0;
+fi
+
 echo "Download language model ..."
 download $URL $MD5 $TARGET
 if [ $? -ne 0 ]; then
--- a/examples/vctk/tts3/README.md
+++ b/examples/vctk/tts3/README.md
@ -217,7 +217,8 @@ optional arguments:
 9. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.

 ## Pretrained Model
-Pretrained FastSpeech2 model with no silence in the edge of audios. [fastspeech2_nosil_vctk_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_vctk_ckpt_0.5.zip)
+Pretrained FastSpeech2 model with no silence in the edge of audios:
+- [fastspeech2_nosil_vctk_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_vctk_ckpt_0.5.zip)

 FastSpeech2 checkpoint contains files listed below.
 ```text
--- a/examples/vctk/voc1/README.md
+++ b/examples/vctk/voc1/README.md
@ -132,7 +132,8 @@ optional arguments:
 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.

 ## Pretrained Model
-Pretrained models can be downloaded here [pwg_vctk_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_vctk_ckpt_0.1.1.zip).
+Pretrained models can be downloaded here:
+- [pwg_vctk_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_vctk_ckpt_0.1.1.zip)

 Parallel WaveGAN checkpoint contains files listed below.

--- a/examples/vctk/voc5/README.md
+++ b/examples/vctk/voc5/README.md
@ -133,7 +133,8 @@ optional arguments:
 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.

 ## Pretrained Model
-The pretrained model can be downloaded here [hifigan_vctk_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_vctk_ckpt_0.2.0.zip).
+The pretrained model can be downloaded here:
+- [hifigan_vctk_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_vctk_ckpt_0.2.0.zip)


 Model | Step | eval/generator_loss | eval/mel_loss| eval/feature_matching_loss
--- a/examples/voxceleb/sv0/RESULT.md
+++ b/examples/voxceleb/sv0/RESULT.md
@ -4,4 +4,4 @@

 | Model | Number of Params | Release | Config | dim | Test set |  Cosine | Cosine + S-Norm | 
 | --- | --- | --- | --- | --- | --- | --- | ---- |
-| ECAPA-TDNN | 85M | 0.1.1 | conf/ecapa_tdnn.yaml |192 | test | 1.15 |  1.06 | 
+| ECAPA-TDNN | 85M | 0.2.0 | conf/ecapa_tdnn.yaml |192 | test | 1.02 |  0.95 | 
--- a/examples/voxceleb/sv0/conf/ecapa_tdnn.yaml
+++ b/examples/voxceleb/sv0/conf/ecapa_tdnn.yaml
@ -1,14 +1,16 @@
 ###########################################
 #                Data                 #
 ###########################################
-# we should explicitly specify the wav path of vox2 audio data converted from m4a
-vox2_base_path: 
 augment: True
-batch_size: 16
+batch_size: 32
 num_workers: 2
 num_speakers: 7205 # 1211 vox1, 5994 vox2, 7205 vox1+2, test speakers: 41
 shuffle: True
+skip_prep: False
+split_ratio: 0.9
+chunk_duration: 3.0 # seconds
 random_chunk: True
+verification_file: data/vox1/veri_test2.txt

 ###########################################################
 #                FEATURE EXTRACTION SETTING               #
@ -26,7 +28,6 @@ hop_size: 160        #10ms, sample rate 16000, 10 * 16000 / 1000 = 160
 # if we want use another model, please choose another configuration yaml file
 model:
  input_size: 80
-  # "channels": [512, 512, 512, 512, 1536],
  channels: [1024, 1024, 1024, 1024, 3072]
  kernel_sizes: [5, 3, 3, 3, 1]
  dilations: [1, 2, 3, 4, 1]
@ -38,11 +39,19 @@ model:
 ###########################################
 seed: 1986 # according from speechbrain configuration
 epochs: 10
-save_interval: 1
-log_interval: 1
+save_interval: 10
+log_interval: 10
 learning_rate: 1e-8
+max_lr: 1e-3
+step_size: 140000


+###########################################
+#                loss                     #
+###########################################
+margin: 0.2
+scale: 30
+
 ###########################################
 #                Testing                  #
 ###########################################
--- a/examples/voxceleb/sv0/conf/ecapa_tdnn_small.yaml
+++ b/examples/voxceleb/sv0/conf/ecapa_tdnn_small.yaml
@ -0,0 +1,60 @@
+###########################################
+#                Data                 #
+###########################################
+augment: True
+batch_size: 32
+num_workers: 2
+num_speakers: 1211 # 1211 vox1, 5994 vox2, 7205 vox1+2, test speakers: 41
+shuffle: True
+skip_prep: False
+split_ratio: 0.9
+chunk_duration: 3.0 # seconds
+random_chunk: True
+verification_file: data/vox1/veri_test2.txt
+
+###########################################################
+#                FEATURE EXTRACTION SETTING               #
+###########################################################
+# currently, we only support fbank
+sr: 16000           # sample rate
+n_mels: 80
+window_size: 400     #25ms, sample rate 16000, 25 * 16000 / 1000 = 400 
+hop_size: 160        #10ms, sample rate 16000, 10 * 16000 / 1000 = 160
+
+###########################################################
+#                       MODEL SETTING                     #
+###########################################################
+# currently, we only support ecapa-tdnn in the ecapa_tdnn.yaml
+# if we want use another model, please choose another configuration yaml file
+model:
+  input_size: 80
+  channels: [512, 512, 512, 512, 1536]
+  kernel_sizes: [5, 3, 3, 3, 1]
+  dilations: [1, 2, 3, 4, 1]
+  attention_channels: 128
+  lin_neurons: 192
+
+###########################################
+#                Training                 #
+###########################################
+seed: 1986 # according from speechbrain configuration
+epochs: 100
+save_interval: 10
+log_interval: 10
+learning_rate: 1e-8
+max_lr: 1e-3
+step_size: 140000
+
+###########################################
+#                loss                     #
+###########################################
+margin: 0.2
+scale: 30
+
+###########################################
+#                Testing                  #
+###########################################
+global_embedding_norm: True
+embedding_mean_norm: True
+embedding_std_norm: False
+
--- a/examples/voxceleb/sv0/local/data.sh
+++ b/examples/voxceleb/sv0/local/data.sh
@ -12,7 +12,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-stage=1
+stage=0
 stop_stage=100

 . ${MAIN_ROOT}/utils/parse_options.sh || exit -1;
@ -30,29 +30,114 @@ dir=$1
 conf_path=$2
 mkdir -p ${dir}

-if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
-    # data prepare for vox1 and vox2, vox2 must be converted from m4a to wav
-    # we should use the local/convert.sh convert m4a to wav
-    python3 local/data_prepare.py \
-                        --data-dir ${dir} \
-                        --config ${conf_path}
-fi 
-
+# Generally the `MAIN_ROOT` refers to the root of PaddleSpeech,
+# which is defined in the path.sh
+# And we will download the voxceleb data and rirs noise to ${MAIN_ROOT}/dataset
 TARGET_DIR=${MAIN_ROOT}/dataset
 mkdir -p ${TARGET_DIR}

 if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
-    # download data, generate manifests
-    python3 ${TARGET_DIR}/voxceleb/voxceleb1.py \
-      --manifest_prefix="data/vox1/manifest" \
+   # download data, generate manifests
+   # we will generate the manifest.{dev,test} file from ${TARGET_DIR}/voxceleb/vox1/{dev,test} directory
+   # and generate the meta info and download the trial file
+   # manifest.dev: 148642
+   # manifest.test: 4847
+   echo "Start to download vox1 dataset and generate the manifest files "
+   python3 ${TARGET_DIR}/voxceleb/voxceleb1.py \
+      --manifest_prefix="${dir}/vox1/manifest" \
      --target_dir="${TARGET_DIR}/voxceleb/vox1/"

-    if [ $? -ne 0 ]; then
-        echo "Prepare voxceleb failed. Terminated."
-        exit 1
-    fi
+   if [ $? -ne 0 ]; then
+      echo "Prepare voxceleb1 failed. Terminated."
+      exit 1
+   fi
+
+fi
+
+if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
+   # download voxceleb2 data
+   # we will download the data and unzip the package
+   # and we will store the m4a file in ${TARGET_DIR}/voxceleb/vox2/{dev,test}
+   echo "start to download vox2 dataset"
+   python3 ${TARGET_DIR}/voxceleb/voxceleb2.py \
+      --download \
+      --target_dir="${TARGET_DIR}/voxceleb/vox2/"
+
+   if [ $? -ne 0 ]; then
+      echo "Download voxceleb2 dataset failed. Terminated."
+      exit 1
+   fi
+
+fi
+
+if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
+   # convert the m4a to wav
+   # and we will not delete the original m4a file
+   echo "start to convert the m4a to wav"
+   bash local/convert.sh ${TARGET_DIR}/voxceleb/vox2/test/ || exit 1;
+   
+   if [ $? -ne 0 ]; then
+      echo "Convert voxceleb2 dataset from m4a to wav failed. Terminated."
+      exit 1
+   fi
+   echo "m4a convert to wav operation finished"
+fi
+
+if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
+   # generate the vox2 manifest file from wav file
+   # we will generate the ${dir}/vox2/manifest.vox2
+   # because we use all the vox2 dataset to train, so collect all the vox2 data in one file
+   echo "start generate the vox2 manifest files"
+   python3 ${TARGET_DIR}/voxceleb/voxceleb2.py \
+      --generate \
+      --manifest_prefix="${dir}/vox2/manifest" \
+      --target_dir="${TARGET_DIR}/voxceleb/vox2/"

-   #  for dataset in train dev test; do
-   #      mv data/manifest.${dataset} data/manifest.${dataset}.raw
-   #  done
-fi
+   if [ $? -ne 0 ]; then
+      echo "Prepare voxceleb2 dataset failed. Terminated."
+      exit 1
+   fi
+fi
+
+if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
+   # generate the vox csv file
+   # Currently, our training system use csv file for dataset
+   echo "convert the json format to csv format to be compatible with training process"
+   python3 local/make_vox_csv_dataset_from_json.py\
+      --train "${dir}/vox1/manifest.dev" "${dir}/vox2/manifest.vox2"\
+      --test "${dir}/vox1/manifest.test" \
+      --target_dir "${dir}/vox/" \
+      --config ${conf_path}
+
+   if [ $? -ne 0 ]; then
+      echo "Prepare voxceleb failed. Terminated."
+      exit 1
+   fi
+fi
+
+if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
+   # generate the open rir noise manifest file
+   echo "generate the open rir noise manifest file"
+   python3 ${TARGET_DIR}/rir_noise/rir_noise.py\
+      --manifest_prefix="${dir}/rir_noise/manifest" \
+      --target_dir="${TARGET_DIR}/rir_noise/"
+
+   if [ $? -ne 0 ]; then
+      echo "Prepare rir_noise failed. Terminated."
+      exit 1
+   fi
+fi
+
+if [ ${stage} -le 7 ] && [ ${stop_stage} -ge 7 ]; then
+   # generate the open rir noise manifest file
+   echo "generate the open rir noise csv file"
+   python3 local/make_rirs_noise_csv_dataset_from_json.py \
+      --noise_dir="${TARGET_DIR}/rir_noise/" \
+      --data_dir="${dir}/rir_noise/" \
+      --config ${conf_path}
+
+   if [ $? -ne 0 ]; then
+      echo "Prepare rir_noise failed. Terminated."
+      exit 1
+   fi
+fi
--- a/examples/voxceleb/sv0/local/make_rirs_noise_csv_dataset_from_json.py
+++ b/examples/voxceleb/sv0/local/make_rirs_noise_csv_dataset_from_json.py
@ -0,0 +1,167 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Convert the PaddleSpeech jsonline format data to csv format data in voxceleb experiment.
+Currently, Speaker Identificaton Training process use csv format.
+"""
+import argparse
+import csv
+import os
+from typing import List
+
+import tqdm
+from yacs.config import CfgNode
+
+from paddleaudio import load as load_audio
+from paddlespeech.s2t.utils.log import Log
+from paddlespeech.vector.utils.vector_utils import get_chunks
+
+logger = Log(__name__).getlog()
+
+
+def get_chunks_list(wav_file: str,
+                    split_chunks: bool,
+                    base_path: str,
+                    chunk_duration: float=3.0) -> List[List[str]]:
+    """Get the single audio file info 
+
+    Args:
+        wav_file (list): the wav audio file and get this audio segment info list
+        split_chunks (bool): audio split flag
+        base_path (str): the audio base path 
+        chunk_duration (float): the chunk duration. 
+                                if set the split_chunks, we split the audio into multi-chunks segment.
+    """
+    waveform, sr = load_audio(wav_file)
+    audio_id = wav_file.split("/rir_noise/")[-1].split(".")[0]
+    audio_duration = waveform.shape[0] / sr
+
+    ret = []
+    if split_chunks and audio_duration > chunk_duration:  # Split into pieces of self.chunk_duration seconds.
+        uniq_chunks_list = get_chunks(chunk_duration, audio_id, audio_duration)
+
+        for idx, chunk in enumerate(uniq_chunks_list):
+            s, e = chunk.split("_")[-2:]  # Timestamps of start and end
+            start_sample = int(float(s) * sr)
+            end_sample = int(float(e) * sr)
+
+            # currently, all vector csv data format use one representation
+            # id, duration, wav, start, stop, label
+            # in rirs noise, all the label name is 'noise'
+            # the label is string type and we will convert it to integer type in training
+            ret.append([
+                chunk, audio_duration, wav_file, start_sample, end_sample,
+                "noise"
+            ])
+    else:  # Keep whole audio.
+        ret.append(
+            [audio_id, audio_duration, wav_file, 0, waveform.shape[0], "noise"])
+    return ret
+
+
+def generate_csv(wav_files,
+                 output_file: str,
+                 base_path: str,
+                 split_chunks: bool=True):
+    """Prepare the csv file according the wav files
+
+    Args:
+        wav_files (list): all the audio list to prepare the csv file
+        output_file (str): the output csv file
+        config (CfgNode): yaml configuration content
+        split_chunks (bool): audio split flag
+    """
+    logger.info(f'Generating csv: {output_file}')
+    header = ["utt_id", "duration", "wav", "start", "stop", "label"]
+    csv_lines = []
+    for item in tqdm.tqdm(wav_files):
+        csv_lines.extend(
+            get_chunks_list(
+                item, base_path=base_path, split_chunks=split_chunks))
+
+    if not os.path.exists(os.path.dirname(output_file)):
+        os.makedirs(os.path.dirname(output_file))
+
+    with open(output_file, mode="w") as csv_f:
+        csv_writer = csv.writer(
+            csv_f, delimiter=",", quotechar='"', quoting=csv.QUOTE_MINIMAL)
+        csv_writer.writerow(header)
+        for line in csv_lines:
+            csv_writer.writerow(line)
+
+
+def prepare_data(args, config):
+    """Convert the jsonline format to csv format
+
+    Args:
+        args (argparse.Namespace): scripts args
+        config (CfgNode): yaml configuration content
+    """
+    # if external config set the skip_prep flat, we will do nothing
+    if config.skip_prep:
+        return
+
+    base_path = args.noise_dir
+    wav_path = os.path.join(base_path, "RIRS_NOISES")
+    logger.info(f"base path: {base_path}")
+    logger.info(f"wav path: {wav_path}")
+    rir_list = os.path.join(wav_path, "real_rirs_isotropic_noises", "rir_list")
+    rir_files = []
+    with open(rir_list, 'r') as f:
+        for line in f.readlines():
+            rir_file = line.strip().split(' ')[-1]
+            rir_files.append(os.path.join(base_path, rir_file))
+
+    noise_list = os.path.join(wav_path, "pointsource_noises", "noise_list")
+    noise_files = []
+    with open(noise_list, 'r') as f:
+        for line in f.readlines():
+            noise_file = line.strip().split(' ')[-1]
+            noise_files.append(os.path.join(base_path, noise_file))
+
+    csv_path = os.path.join(args.data_dir, 'csv')
+    logger.info(f"csv path: {csv_path}")
+    generate_csv(
+        rir_files, os.path.join(csv_path, 'rir.csv'), base_path=base_path)
+    generate_csv(
+        noise_files, os.path.join(csv_path, 'noise.csv'), base_path=base_path)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument(
+        "--noise_dir",
+        default=None,
+        required=True,
+        help="The noise dataset dataset directory.")
+    parser.add_argument(
+        "--data_dir",
+        default=None,
+        required=True,
+        help="The target directory stores the csv files")
+    parser.add_argument(
+        "--config",
+        default=None,
+        required=True,
+        type=str,
+        help="configuration file")
+    args = parser.parse_args()
+
+    # parse the yaml config file
+    config = CfgNode(new_allowed=True)
+    if args.config:
+        config.merge_from_file(args.config)
+
+    # prepare the csv file from jsonlines files
+    prepare_data(args, config)
--- a/examples/voxceleb/sv0/local/make_vox_csv_dataset_from_json.py
+++ b/examples/voxceleb/sv0/local/make_vox_csv_dataset_from_json.py
@ -0,0 +1,251 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""
+Convert the PaddleSpeech jsonline format data to csv format data in voxceleb experiment.
+Currently, Speaker Identificaton Training process use csv format.
+"""
+import argparse
+import csv
+import json
+import os
+import random
+
+import tqdm
+from yacs.config import CfgNode
+
+from paddleaudio import load as load_audio
+from paddlespeech.s2t.utils.log import Log
+from paddlespeech.vector.utils.vector_utils import get_chunks
+
+logger = Log(__name__).getlog()
+
+
+def prepare_csv(wav_files, output_file, config, split_chunks=True):
+    """Prepare the csv file according the wav files
+
+    Args:
+        wav_files (list): all the audio list to prepare the csv file
+        output_file (str): the output csv file
+        config (CfgNode): yaml configuration content
+        split_chunks (bool, optional): audio split flag. Defaults to True.
+    """
+    if not os.path.exists(os.path.dirname(output_file)):
+        os.makedirs(os.path.dirname(output_file))
+    csv_lines = []
+    header = ["utt_id", "duration", "wav", "start", "stop", "label"]
+    # voxceleb meta info for each training utterance segment
+    # we extract a segment from a utterance to train 
+    # and the segment' period is between start and stop time point in the original wav file
+    # each field in the meta info means as follows:
+    # utt_id: the utterance segment name, which is uniq in training dataset
+    # duration: the total utterance time
+    # wav: utterance file path, which should be absoulute path
+    # start: start point in the original wav file sample point range
+    # stop: stop point in the original wav file sample point range
+    # label: the utterance segment's label name, 
+    #        which is speaker name in speaker verification domain
+    for item in tqdm.tqdm(wav_files, total=len(wav_files)):
+        item = json.loads(item.strip())
+        audio_id = item['utt'].replace(".wav",
+                                       "")  # we remove the wav suffix name
+        audio_duration = item['feat_shape'][0]
+        wav_file = item['feat']
+        label = audio_id.split('-')[
+            0]  # speaker name in speaker verification domain
+        waveform, sr = load_audio(wav_file)
+        if split_chunks:
+            uniq_chunks_list = get_chunks(config.chunk_duration, audio_id,
+                                          audio_duration)
+            for chunk in uniq_chunks_list:
+                s, e = chunk.split("_")[-2:]  # Timestamps of start and end
+                start_sample = int(float(s) * sr)
+                end_sample = int(float(e) * sr)
+                # id, duration, wav, start, stop, label
+                # in vector, the label in speaker id
+                csv_lines.append([
+                    chunk, audio_duration, wav_file, start_sample, end_sample,
+                    label
+                ])
+        else:
+            csv_lines.append([
+                audio_id, audio_duration, wav_file, 0, waveform.shape[0], label
+            ])
+
+    with open(output_file, mode="w") as csv_f:
+        csv_writer = csv.writer(
+            csv_f, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
+        csv_writer.writerow(header)
+        for line in csv_lines:
+            csv_writer.writerow(line)
+
+
+def get_enroll_test_list(dataset_list, verification_file):
+    """Get the enroll and test utterance list from all the voxceleb1 test utterance dataset.
+       Generally, we get the enroll and test utterances from the verfification file.
+       The verification file format as follows:
+       target/nontarget enroll-utt test-utt,
+       we set 0 as nontarget and 1 as target, eg:
+       0 a.wav b.wav
+       1 a.wav a.wav
+
+    Args:
+        dataset_list (list): all the dataset to get the test utterances
+        verification_file (str): voxceleb1 trial file
+    """
+    logger.info(f"verification file: {verification_file}")
+    enroll_audios = set()
+    test_audios = set()
+    with open(verification_file, 'r') as f:
+        for line in f:
+            _, enroll_file, test_file = line.strip().split(' ')
+            enroll_audios.add('-'.join(enroll_file.split('/')))
+            test_audios.add('-'.join(test_file.split('/')))
+
+    enroll_files = []
+    test_files = []
+    for dataset in dataset_list:
+        with open(dataset, 'r') as f:
+            for line in f:
+                # audio_id may be in enroll and test at the same time
+                # eg: 1 a.wav a.wav
+                # the audio a.wav is enroll and test file at the same time
+                audio_id = json.loads(line.strip())['utt']
+                if audio_id in enroll_audios:
+                    enroll_files.append(line)
+                if audio_id in test_audios:
+                    test_files.append(line)
+
+    enroll_files = sorted(enroll_files)
+    test_files = sorted(test_files)
+
+    return enroll_files, test_files
+
+
+def get_train_dev_list(dataset_list, target_dir, split_ratio):
+    """Get the train and dev utterance list from all the training utterance dataset.
+       Generally, we use the split_ratio as the train dataset ratio,
+       and the remaining utterance (ratio is 1 - split_ratio) is the dev dataset
+
+    Args:
+        dataset_list (list): all the dataset to get the all utterances
+        target_dir (str): the target train and dev directory, 
+                          we will create the csv directory to store the {train,dev}.csv file
+        split_ratio (float): train dataset ratio in all utterance list
+    """
+    logger.info("start to get train and dev utt list")
+    if not os.path.exists(os.path.join(target_dir, "meta")):
+        os.makedirs(os.path.join(target_dir, "meta"))
+
+    audio_files = []
+    speakers = set()
+    for dataset in dataset_list:
+        with open(dataset, 'r') as f:
+            for line in f:
+                # the label is speaker name
+                label_name = json.loads(line.strip())['utt2spk']
+                speakers.add(label_name)
+                audio_files.append(line.strip())
+    speakers = sorted(speakers)
+    logger.info(f"we get {len(speakers)} speakers from all the train dataset")
+
+    with open(os.path.join(target_dir, "meta", "label2id.txt"), 'w') as f:
+        for label_id, label_name in enumerate(speakers):
+            f.write(f'{label_name} {label_id}\n')
+    logger.info(
+        f'we store the speakers to {os.path.join(target_dir, "meta", "label2id.txt")}'
+    )
+
+    # the split_ratio is for train dataset 
+    # the remaining is for dev dataset
+    split_idx = int(split_ratio * len(audio_files))
+    audio_files = sorted(audio_files)
+    random.shuffle(audio_files)
+    train_files, dev_files = audio_files[:split_idx], audio_files[split_idx:]
+    logger.info(
+        f"we get train utterances: {len(train_files)}, dev utterance: {len(dev_files)}"
+    )
+    return train_files, dev_files
+
+
+def prepare_data(args, config):
+    """Convert the jsonline format to csv format
+
+    Args:
+        args (argparse.Namespace): scripts args
+        config (CfgNode): yaml configuration content
+    """
+    # stage0: set the random seed
+    random.seed(config.seed)
+
+    # if external config set the skip_prep flat, we will do nothing
+    if config.skip_prep:
+        return
+
+    # stage 1: prepare the enroll and test csv file
+    #          And we generate the speaker to label file label2id.txt
+    logger.info("start to prepare the data csv file")
+    enroll_files, test_files = get_enroll_test_list(
+        [args.test], verification_file=config.verification_file)
+    prepare_csv(
+        enroll_files,
+        os.path.join(args.target_dir, "csv", "enroll.csv"),
+        config,
+        split_chunks=False)
+    prepare_csv(
+        test_files,
+        os.path.join(args.target_dir, "csv", "test.csv"),
+        config,
+        split_chunks=False)
+
+    # stage 2: prepare the train and dev csv file
+    #          we get the train dataset ratio as config.split_ratio
+    #          and the remaining is dev dataset
+    logger.info("start to prepare the data csv file")
+    train_files, dev_files = get_train_dev_list(
+        args.train, target_dir=args.target_dir, split_ratio=config.split_ratio)
+    prepare_csv(train_files,
+                os.path.join(args.target_dir, "csv", "train.csv"), config)
+    prepare_csv(dev_files,
+                os.path.join(args.target_dir, "csv", "dev.csv"), config)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description=__doc__)
+    parser.add_argument(
+        "--train",
+        required=True,
+        nargs='+',
+        help="The jsonline files list for train.")
+    parser.add_argument(
+        "--test", required=True, help="The jsonline file for test")
+    parser.add_argument(
+        "--target_dir",
+        default=None,
+        required=True,
+        help="The target directory stores the csv files and meta file.")
+    parser.add_argument(
+        "--config",
+        default=None,
+        required=True,
+        type=str,
+        help="configuration file")
+    args = parser.parse_args()
+
+    # parse the yaml config file
+    config = CfgNode(new_allowed=True)
+    if args.config:
+        config.merge_from_file(args.config)
+
+    # prepare the csv file from jsonlines files
+    prepare_data(args, config)
--- a/examples/voxceleb/sv0/run.sh
+++ b/examples/voxceleb/sv0/run.sh
@ -18,24 +18,22 @@ set -e

 #######################################################################
 # stage 0: data prepare, including voxceleb1 download and generate {train,dev,enroll,test}.csv
-#          voxceleb2 data is m4a format, so we need user to convert the m4a to wav yourselves as described in Readme.md with the script local/convert.sh
+#          voxceleb2 data is m4a format, so we need convert the m4a to wav yourselves with the script local/convert.sh
 # stage 1: train the speaker identification model
 # stage 2: test speaker identification 
-# stage 3: extract the training embeding to train the LDA and PLDA
+# stage 3: (todo)extract the training embeding to train the LDA and PLDA
 ######################################################################

-# we can set the variable PPAUDIO_HOME to specifiy the root directory of the downloaded vox1 and vox2 dataset 
-# default the dataset will be stored in the ~/.paddleaudio/
 # the vox2 dataset is stored in m4a format, we need to convert the audio from m4a to wav yourself
-# and put all of them to ${PPAUDIO_HOME}/datasets/vox2
-# we will find the wav from ${PPAUDIO_HOME}/datasets/vox1/wav and ${PPAUDIO_HOME}/datasets/vox2/wav
-# export PPAUDIO_HOME=
+# and put all of them to ${MAIN_ROOT}/datasets/vox2
+# we will find the wav from ${MAIN_ROOT}/datasets/vox1/{dev,test}/wav and ${MAIN_ROOT}/datasets/vox2/wav
+
 stage=0
 stop_stage=50

 # data directory
 # if we set the variable ${dir}, we will store the wav info to this directory
-# otherwise, we will store the wav info to vox1 and vox2 directory respectively
+# otherwise, we will store the wav info to data/vox1 and data/vox2 directory respectively
 # vox2 wav path, we must convert the m4a format to wav format    
 dir=data/                                 # data info directory   

@ -64,6 +62,6 @@ if [ $stage -le 2 ] && [ ${stop_stage} -ge 2 ]; then
 fi

 # if [ $stage -le 3 ]; then
-#      # stage 2: extract the training embeding to train the LDA and PLDA
+#      # stage 3: extract the training embeding to train the LDA and PLDA
 #      # todo: extract the training embedding
 # fi 
--- a/paddleaudio/paddleaudio/compliance/librosa.py
+++ b/paddleaudio/paddleaudio/compliance/librosa.py
@ -341,7 +341,7 @@ def stft(x: np.ndarray,
        hop_length (Optional[int], optional): Number of steps to advance between adjacent windows. Defaults to None.
        win_length (Optional[int], optional): The size of window. Defaults to None.
        window (str, optional): A string of window specification. Defaults to "hann".
-        center (bool, optional): Whether to pad `x` to make that the :math:`t \times hop\_length` at the center of `t`-th frame. Defaults to True.
+        center (bool, optional): Whether to pad `x` to make that the :math:`t \times hop\\_length` at the center of `t`-th frame. Defaults to True.
        dtype (type, optional): Data type of STFT results. Defaults to np.complex64.
        pad_mode (str, optional): Choose padding pattern when `center` is `True`. Defaults to "reflect".

@ -509,7 +509,7 @@ def melspectrogram(x: np.ndarray,
        fmin (float, optional): Minimum frequency in Hz. Defaults to 50.0.
        fmax (Optional[float], optional): Maximum frequency in Hz. Defaults to None.
        window (str, optional): A string of window specification. Defaults to "hann".
-        center (bool, optional): Whether to pad `x` to make that the :math:`t \times hop\_length` at the center of `t`-th frame. Defaults to True.
+        center (bool, optional): Whether to pad `x` to make that the :math:`t \times hop\\_length` at the center of `t`-th frame. Defaults to True.
        pad_mode (str, optional): Choose padding pattern when `center` is `True`. Defaults to "reflect".
        power (float, optional): Exponent for the magnitude melspectrogram. Defaults to 2.0.
        to_db (bool, optional): Enable db scale. Defaults to True.
@ -564,7 +564,7 @@ def spectrogram(x: np.ndarray,
        window_size (int, optional): Size of FFT and window length. Defaults to 512.
        hop_length (int, optional): Number of steps to advance between adjacent windows. Defaults to 320.
        window (str, optional): A string of window specification. Defaults to "hann".
-        center (bool, optional): Whether to pad `x` to make that the :math:`t \times hop\_length` at the center of `t`-th frame. Defaults to True.
+        center (bool, optional): Whether to pad `x` to make that the :math:`t \times hop\\_length` at the center of `t`-th frame. Defaults to True.
        pad_mode (str, optional): Choose padding pattern when `center` is `True`. Defaults to "reflect".
        power (float, optional): Exponent for the magnitude melspectrogram. Defaults to 2.0.

--- a/paddleaudio/paddleaudio/datasets/voxceleb.py
+++ b/paddleaudio/paddleaudio/datasets/voxceleb.py
@ -261,7 +261,7 @@ class VoxCeleb(Dataset):
                     output_file: str,
                     split_chunks: bool=True):
        print(f'Generating csv: {output_file}')
-        header = ["ID", "duration", "wav", "start", "stop", "spk_id"]
+        header = ["id", "duration", "wav", "start", "stop", "spk_id"]
        # Note: this may occurs c++ execption, but the program will execute fine
        # so we can ignore the execption 
        with Pool(cpu_count()) as p:
--- a/paddleaudio/paddleaudio/features/layers.py
+++ b/paddleaudio/paddleaudio/features/layers.py
@ -42,7 +42,7 @@ class Spectrogram(nn.Layer):
        win_length (Optional[int], optional): The window length of the short time FFT. If `None`, it is set to same as `n_fft`. Defaults to None.
        window (str, optional): The window function applied to the signal before the Fourier transform. Supported window functions: 'hamming', 'hann', 'kaiser', 'gaussian', 'exponential', 'triang', 'bohman', 'blackman', 'cosine', 'tukey', 'taylor'. Defaults to 'hann'.
        power (float, optional): Exponent for the magnitude spectrogram. Defaults to 2.0.
-        center (bool, optional): Whether to pad `x` to make that the :math:`t \times hop\_length` at the center of `t`-th frame. Defaults to True.
+        center (bool, optional): Whether to pad `x` to make that the :math:`t \times hop\\_length` at the center of `t`-th frame. Defaults to True.
        pad_mode (str, optional): Choose padding pattern when `center` is `True`. Defaults to 'reflect'.
        dtype (str, optional): Data type of input and window. Defaults to 'float32'.
    """
@ -99,7 +99,7 @@ class MelSpectrogram(nn.Layer):
        win_length (Optional[int], optional): The window length of the short time FFT. If `None`, it is set to same as `n_fft`. Defaults to None.
        window (str, optional): The window function applied to the signal before the Fourier transform. Supported window functions: 'hamming', 'hann', 'kaiser', 'gaussian', 'exponential', 'triang', 'bohman', 'blackman', 'cosine', 'tukey', 'taylor'. Defaults to 'hann'.
        power (float, optional): Exponent for the magnitude spectrogram. Defaults to 2.0.
-        center (bool, optional): Whether to pad `x` to make that the :math:`t \times hop\_length` at the center of `t`-th frame. Defaults to True.
+        center (bool, optional): Whether to pad `x` to make that the :math:`t \times hop\\_length` at the center of `t`-th frame. Defaults to True.
        pad_mode (str, optional): Choose padding pattern when `center` is `True`. Defaults to 'reflect'.
        n_mels (int, optional): Number of mel bins. Defaults to 64.
        f_min (float, optional): Minimum frequency in Hz. Defaults to 50.0.
@ -176,7 +176,7 @@ class LogMelSpectrogram(nn.Layer):
        win_length (Optional[int], optional): The window length of the short time FFT. If `None`, it is set to same as `n_fft`. Defaults to None.
        window (str, optional): The window function applied to the signal before the Fourier transform. Supported window functions: 'hamming', 'hann', 'kaiser', 'gaussian', 'exponential', 'triang', 'bohman', 'blackman', 'cosine', 'tukey', 'taylor'. Defaults to 'hann'.
        power (float, optional): Exponent for the magnitude spectrogram. Defaults to 2.0.
-        center (bool, optional): Whether to pad `x` to make that the :math:`t \times hop\_length` at the center of `t`-th frame. Defaults to True.
+        center (bool, optional): Whether to pad `x` to make that the :math:`t \times hop\\_length` at the center of `t`-th frame. Defaults to True.
        pad_mode (str, optional): Choose padding pattern when `center` is `True`. Defaults to 'reflect'.
        n_mels (int, optional): Number of mel bins. Defaults to 64.
        f_min (float, optional): Minimum frequency in Hz. Defaults to 50.0.
@ -257,7 +257,7 @@ class MFCC(nn.Layer):
        win_length (Optional[int], optional): The window length of the short time FFT. If `None`, it is set to same as `n_fft`. Defaults to None.
        window (str, optional): The window function applied to the signal before the Fourier transform. Supported window functions: 'hamming', 'hann', 'kaiser', 'gaussian', 'exponential', 'triang', 'bohman', 'blackman', 'cosine', 'tukey', 'taylor'. Defaults to 'hann'.
        power (float, optional): Exponent for the magnitude spectrogram. Defaults to 2.0.
-        center (bool, optional): Whether to pad `x` to make that the :math:`t \times hop\_length` at the center of `t`-th frame. Defaults to True.
+        center (bool, optional): Whether to pad `x` to make that the :math:`t \times hop\\_length` at the center of `t`-th frame. Defaults to True.
        pad_mode (str, optional): Choose padding pattern when `center` is `True`. Defaults to 'reflect'.
        n_mels (int, optional): Number of mel bins. Defaults to 64.
        f_min (float, optional): Minimum frequency in Hz. Defaults to 50.0.
--- a/paddleaudio/paddleaudio/metric/init.py
+++ b/paddleaudio/paddleaudio/metric/init.py
@ -14,4 +14,3 @@
 from .dtw import dtw_distance
 from .eer import compute_eer
 from .eer import compute_minDCF
-from .mcd import mcd_distance
--- a/paddleaudio/paddleaudio/metric/mcd.py
+++ b/paddleaudio/paddleaudio/metric/mcd.py
@ -1,63 +0,0 @@
-# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
-#
-# Licensed under the Apache License, Version 2.0 (the "License");
-# you may not use this file except in compliance with the License.
-# You may obtain a copy of the License at
-#
-#     http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing, software
-# distributed under the License is distributed on an "AS IS" BASIS,
-# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
-# See the License for the specific language governing permissions and
-# limitations under the License.
-from typing import Callable
-
-import mcd.metrics_fast as mt
-import numpy as np
-from mcd import dtw
-
-__all__ = [
-    'mcd_distance',
-]
-
-
-def mcd_distance(xs: np.ndarray,
-                 ys: np.ndarray,
-                 cost_fn: Callable=mt.logSpecDbDist) -> float:
-    """Mel cepstral distortion (MCD), dtw distance.
-
-    Dynamic Time Warping.
-    Uses dynamic programming to compute:
-
-    Examples:
-        .. code-block:: python
-
-            wps[i, j] = cost_fn(xs[i], ys[j]) + min(
-                            wps[i-1, j  ],  // vertical   / insertion / expansion
-                            wps[i  , j-1],  // horizontal / deletion  / compression
-                            wps[i-1, j-1])  // diagonal   / match
-
-            dtw = sqrt(wps[-1, -1])
-
-    Cost Function:
-    Examples:
-        .. code-block:: python
-
-            logSpecDbConst = 10.0 / math.log(10.0) * math.sqrt(2.0)
-
-            def logSpecDbDist(x, y):
-                diff = x - y
-                return logSpecDbConst * math.sqrt(np.inner(diff, diff))
-
-    Args:
-        xs (np.ndarray): ref sequence, [T,D]
-        ys (np.ndarray): hyp sequence, [T,D]
-        cost_fn (Callable, optional): Cost function. Defaults to mt.logSpecDbDist.
-
-    Returns:
-        float: dtw distance
-    """
-
-    min_cost, path = dtw.dtw(xs, ys, cost_fn)
-    return min_cost
--- a/paddleaudio/paddleaudio/utils/numeric.py
+++ b/paddleaudio/paddleaudio/utils/numeric.py
@ -0,0 +1,30 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import numpy as np
+
+
+def pcm16to32(audio: np.ndarray) -> np.ndarray:
+    """pcm int16 to float32
+
+    Args:
+        audio (np.ndarray): Waveform with dtype of int16.
+
+    Returns:
+        np.ndarray: Waveform with dtype of float32.
+    """
+    if audio.dtype == np.int16:
+        audio = audio.astype("float32")
+        bits = np.iinfo(np.int16).bits
+        audio = audio / (2**(bits - 1))
+    return audio
--- a/paddleaudio/setup.py
+++ b/paddleaudio/setup.py
@ -19,7 +19,7 @@ from setuptools.command.install import install
 from setuptools.command.test import test

 # set the version here
-VERSION = '0.2.0'
+VERSION = '0.2.1'


 # Inspired by the example at https://pytest.org/latest/goodpractises.html
@ -83,8 +83,7 @@ setuptools.setup(
    python_requires='>=3.6',
    install_requires=[
        'numpy >= 1.15.0', 'scipy >= 1.0.0', 'resampy >= 0.2.2',
-        'soundfile >= 0.9.0', 'colorlog', 'dtaidistance == 2.3.1', 'mcd >= 0.4',
-        'pathos'
+        'soundfile >= 0.9.0', 'colorlog', 'dtaidistance == 2.3.1', 'pathos'
    ],
    extras_require={
        'test': [
--- a/paddlespeech/cli/asr/infer.py
+++ b/paddlespeech/cli/asr/infer.py
@ -29,9 +29,10 @@ from ..download import get_path_from_url
 from ..executor import BaseExecutor
 from ..log import logger
 from ..utils import cli_register
-from ..utils import download_and_decompress
 from ..utils import MODEL_HOME
 from ..utils import stats_wrapper
+from .pretrained_models import model_alias
+from .pretrained_models import pretrained_models
 from paddlespeech.s2t.frontend.featurizer.text_featurizer import TextFeaturizer
 from paddlespeech.s2t.transform.transformation import Transformation
 from paddlespeech.s2t.utils.dynamic_import import dynamic_import
@ -39,94 +40,13 @@ from paddlespeech.s2t.utils.utility import UpdateConfig

 __all__ = ['ASRExecutor']

-pretrained_models = {
-    # The tags for pretrained_models should be "{model_name}[_{dataset}][-{lang}][-...]".
-    # e.g. "conformer_wenetspeech-zh-16k" and "panns_cnn6-32k".
-    # Command line and python api use "{model_name}[_{dataset}]" as --model, usage:
-    # "paddlespeech asr --model conformer_wenetspeech --lang zh --sr 16000 --input ./input.wav"
-    "conformer_wenetspeech-zh-16k": {
-        'url':
-        'https://paddlespeech.bj.bcebos.com/s2t/wenetspeech/asr1_conformer_wenetspeech_ckpt_0.1.1.model.tar.gz',
-        'md5':
-        '76cb19ed857e6623856b7cd7ebbfeda4',
-        'cfg_path':
-        'model.yaml',
-        'ckpt_path':
-        'exp/conformer/checkpoints/wenetspeech',
-    },
-    "transformer_librispeech-en-16k": {
-        'url':
-        'https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr1/asr1_transformer_librispeech_ckpt_0.1.1.model.tar.gz',
-        'md5':
-        '2c667da24922aad391eacafe37bc1660',
-        'cfg_path':
-        'model.yaml',
-        'ckpt_path':
-        'exp/transformer/checkpoints/avg_10',
-    },
-    "deepspeech2offline_aishell-zh-16k": {
-        'url':
-        'https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_aishell_ckpt_0.1.1.model.tar.gz',
-        'md5':
-        '932c3593d62fe5c741b59b31318aa314',
-        'cfg_path':
-        'model.yaml',
-        'ckpt_path':
-        'exp/deepspeech2/checkpoints/avg_1',
-        'lm_url':
-        'https://deepspeech.bj.bcebos.com/zh_lm/zh_giga.no_cna_cmn.prune01244.klm',
-        'lm_md5':
-        '29e02312deb2e59b3c8686c7966d4fe3'
-    },
-    "deepspeech2online_aishell-zh-16k": {
-        'url':
-        'https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.1.1.model.tar.gz',
-        'md5':
-        'd5e076217cf60486519f72c217d21b9b',
-        'cfg_path':
-        'model.yaml',
-        'ckpt_path':
-        'exp/deepspeech2_online/checkpoints/avg_1',
-        'lm_url':
-        'https://deepspeech.bj.bcebos.com/zh_lm/zh_giga.no_cna_cmn.prune01244.klm',
-        'lm_md5':
-        '29e02312deb2e59b3c8686c7966d4fe3'
-    },
-    "deepspeech2offline_librispeech-en-16k": {
-        'url':
-        'https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr0/asr0_deepspeech2_librispeech_ckpt_0.1.1.model.tar.gz',
-        'md5':
-        'f5666c81ad015c8de03aac2bc92e5762',
-        'cfg_path':
-        'model.yaml',
-        'ckpt_path':
-        'exp/deepspeech2/checkpoints/avg_1',
-        'lm_url':
-        'https://deepspeech.bj.bcebos.com/en_lm/common_crawl_00.prune01111.trie.klm',
-        'lm_md5':
-        '099a601759d467cd0a8523ff939819c5'
-    },
-}
-
-model_alias = {
-    "deepspeech2offline":
-    "paddlespeech.s2t.models.ds2:DeepSpeech2Model",
-    "deepspeech2online":
-    "paddlespeech.s2t.models.ds2_online:DeepSpeech2ModelOnline",
-    "conformer":
-    "paddlespeech.s2t.models.u2:U2Model",
-    "transformer":
-    "paddlespeech.s2t.models.u2:U2Model",
-    "wenetspeech":
-    "paddlespeech.s2t.models.u2:U2Model",
-}
-
-
@cli_register(
    name='paddlespeech.asr', description='Speech to text infer command.')
 class ASRExecutor(BaseExecutor):
    def __init__(self):
-        super(ASRExecutor, self).__init__()
+        super().__init__()
+        self.model_alias = model_alias
+        self.pretrained_models = pretrained_models

        self.parser = argparse.ArgumentParser(
            prog='paddlespeech.asr', add_help=True)
@ -136,7 +56,9 @@ class ASRExecutor(BaseExecutor):
            '--model',
            type=str,
            default='conformer_wenetspeech',
-            choices=[tag[:tag.index('-')] for tag in pretrained_models.keys()],
+            choices=[
+                tag[:tag.index('-')] for tag in self.pretrained_models.keys()
+            ],
            help='Choose model type of asr task.')
        self.parser.add_argument(
            '--lang',
@ -192,23 +114,6 @@ class ASRExecutor(BaseExecutor):
            action='store_true',
            help='Increase logger verbosity of current task.')

-    def _get_pretrained_path(self, tag: str) -> os.PathLike:
-        """
-        Download and returns pretrained resources path of current task.
-        """
-        support_models = list(pretrained_models.keys())
-        assert tag in pretrained_models, 'The model "{}" you want to use has not been supported, please choose other models.\nThe support models includes:\n\t\t{}\n'.format(
-            tag, '\n\t\t'.join(support_models))
-
-        res_path = os.path.join(MODEL_HOME, tag)
-        decompressed_path = download_and_decompress(pretrained_models[tag],
-                                                    res_path)
-        decompressed_path = os.path.abspath(decompressed_path)
-        logger.info(
-            'Use pretrained model stored in: {}'.format(decompressed_path))
-
-        return decompressed_path
-
    def _init_from_path(self,
                        model_type: str='wenetspeech',
                        lang: str='zh',
@ -219,6 +124,7 @@ class ASRExecutor(BaseExecutor):
        """
        Init model and other resources from a specific path.
        """
+        logger.info("start to init the model")
        if hasattr(self, 'model'):
            logger.info('Model had been initialized.')
            return
@ -228,19 +134,21 @@ class ASRExecutor(BaseExecutor):
            tag = model_type + '-' + lang + '-' + sample_rate_str
            res_path = self._get_pretrained_path(tag)  # wenetspeech_zh
            self.res_path = res_path
-            self.cfg_path = os.path.join(res_path,
-                                         pretrained_models[tag]['cfg_path'])
+            self.cfg_path = os.path.join(
+                res_path, self.pretrained_models[tag]['cfg_path'])
            self.ckpt_path = os.path.join(
-                res_path, pretrained_models[tag]['ckpt_path'] + ".pdparams")
+                res_path,
+                self.pretrained_models[tag]['ckpt_path'] + ".pdparams")
            logger.info(res_path)
-            logger.info(self.cfg_path)
-            logger.info(self.ckpt_path)
+
        else:
            self.cfg_path = os.path.abspath(cfg_path)
            self.ckpt_path = os.path.abspath(ckpt_path + ".pdparams")
            self.res_path = os.path.dirname(
                os.path.dirname(os.path.abspath(self.cfg_path)))
-
+        logger.info(self.cfg_path)
+        logger.info(self.ckpt_path)
+        
        #Init body.
        self.config = CfgNode(new_allowed=True)
        self.config.merge_from_file(self.cfg_path)
@ -255,8 +163,8 @@ class ASRExecutor(BaseExecutor):
                self.collate_fn_test = SpeechCollator.from_config(self.config)
                self.text_feature = TextFeaturizer(
                    unit_type=self.config.unit_type, vocab=self.vocab)
-                lm_url = pretrained_models[tag]['lm_url']
-                lm_md5 = pretrained_models[tag]['lm_md5']
+                lm_url = self.pretrained_models[tag]['lm_url']
+                lm_md5 = self.pretrained_models[tag]['lm_md5']
                self.download_lm(
                    lm_url,
                    os.path.dirname(self.config.decode.lang_model_path), lm_md5)
@ -269,12 +177,11 @@ class ASRExecutor(BaseExecutor):
                    vocab=self.config.vocab_filepath,
                    spm_model_prefix=self.config.spm_model_prefix)
                self.config.decode.decoding_method = decode_method
-
            else:
                raise Exception("wrong type")
        model_name = model_type[:model_type.rindex(
            '_')]  # model_type: {model_name}_{dataset}
-        model_class = dynamic_import(model_name, model_alias)
+        model_class = dynamic_import(model_name, self.model_alias)
        model_conf = self.config
        model = model_class.from_config(model_conf)
        self.model = model
@ -347,12 +254,14 @@ class ASRExecutor(BaseExecutor):
        else:
            raise Exception("wrong type")

+        logger.info("audio feat process success")
+
    @paddle.no_grad()
    def infer(self, model_type: str):
        """
        Model inference and result stored in self.output.
        """
-
+        logger.info("start to infer the model to get the output")
        cfg = self.config.decode
        audio = self._inputs["audio"]
        audio_len = self._inputs["audio_len"]
@ -369,17 +278,22 @@ class ASRExecutor(BaseExecutor):
            self._outputs["result"] = result_transcripts[0]

        elif "conformer" in model_type or "transformer" in model_type:
-            result_transcripts = self.model.decode(
-                audio,
-                audio_len,
-                text_feature=self.text_feature,
-                decoding_method=cfg.decoding_method,
-                beam_size=cfg.beam_size,
-                ctc_weight=cfg.ctc_weight,
-                decoding_chunk_size=cfg.decoding_chunk_size,
-                num_decoding_left_chunks=cfg.num_decoding_left_chunks,
-                simulate_streaming=cfg.simulate_streaming)
-            self._outputs["result"] = result_transcripts[0][0]
+            logger.info(f"we will use the transformer like model : {model_type}")
+            try:
+                result_transcripts = self.model.decode(
+                    audio,
+                    audio_len,
+                    text_feature=self.text_feature,
+                    decoding_method=cfg.decoding_method,
+                    beam_size=cfg.beam_size,
+                    ctc_weight=cfg.ctc_weight,
+                    decoding_chunk_size=cfg.decoding_chunk_size,
+                    num_decoding_left_chunks=cfg.num_decoding_left_chunks,
+                    simulate_streaming=cfg.simulate_streaming)
+                self._outputs["result"] = result_transcripts[0][0]
+            except Exception as e:
+                logger.exception(e)
+
        else:
            raise Exception("invalid model name")

@ -426,6 +340,11 @@ class ASRExecutor(BaseExecutor):
        try:
            audio, audio_sample_rate = soundfile.read(
                audio_file, dtype="int16", always_2d=True)
+            audio_duration = audio.shape[0] / audio_sample_rate
+            max_duration = 50.0
+            if audio_duration >= max_duration:
+                logger.error("Please input audio file less then 50 seconds.\n")
+                return
        except Exception as e:
            logger.exception(e)
            logger.error(
--- a/paddlespeech/cli/asr/pretrained_models.py
+++ b/paddlespeech/cli/asr/pretrained_models.py
@ -0,0 +1,97 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+pretrained_models = {
+    # The tags for pretrained_models should be "{model_name}[_{dataset}][-{lang}][-...]".
+    # e.g. "conformer_wenetspeech-zh-16k" and "panns_cnn6-32k".
+    # Command line and python api use "{model_name}[_{dataset}]" as --model, usage:
+    # "paddlespeech asr --model conformer_wenetspeech --lang zh --sr 16000 --input ./input.wav"
+    "conformer_wenetspeech-zh-16k": {
+        'url':
+        'https://paddlespeech.bj.bcebos.com/s2t/wenetspeech/asr1_conformer_wenetspeech_ckpt_0.1.1.model.tar.gz',
+        'md5':
+        '76cb19ed857e6623856b7cd7ebbfeda4',
+        'cfg_path':
+        'model.yaml',
+        'ckpt_path':
+        'exp/conformer/checkpoints/wenetspeech',
+    },
+    "transformer_librispeech-en-16k": {
+        'url':
+        'https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr1/asr1_transformer_librispeech_ckpt_0.1.1.model.tar.gz',
+        'md5':
+        '2c667da24922aad391eacafe37bc1660',
+        'cfg_path':
+        'model.yaml',
+        'ckpt_path':
+        'exp/transformer/checkpoints/avg_10',
+    },
+    "deepspeech2offline_aishell-zh-16k": {
+        'url':
+        'https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_aishell_ckpt_0.1.1.model.tar.gz',
+        'md5':
+        '932c3593d62fe5c741b59b31318aa314',
+        'cfg_path':
+        'model.yaml',
+        'ckpt_path':
+        'exp/deepspeech2/checkpoints/avg_1',
+        'lm_url':
+        'https://deepspeech.bj.bcebos.com/zh_lm/zh_giga.no_cna_cmn.prune01244.klm',
+        'lm_md5':
+        '29e02312deb2e59b3c8686c7966d4fe3'
+    },
+    "deepspeech2online_aishell-zh-16k": {
+        'url':
+        'https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.2.0.model.tar.gz',
+        'md5':
+        '23e16c69730a1cb5d735c98c83c21e16',
+        'cfg_path':
+        'model.yaml',
+        'ckpt_path':
+        'exp/deepspeech2_online/checkpoints/avg_1',
+        'lm_url':
+        'https://deepspeech.bj.bcebos.com/zh_lm/zh_giga.no_cna_cmn.prune01244.klm',
+        'lm_md5':
+        '29e02312deb2e59b3c8686c7966d4fe3'
+    },
+    "deepspeech2offline_librispeech-en-16k": {
+        'url':
+        'https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr0/asr0_deepspeech2_librispeech_ckpt_0.1.1.model.tar.gz',
+        'md5':
+        'f5666c81ad015c8de03aac2bc92e5762',
+        'cfg_path':
+        'model.yaml',
+        'ckpt_path':
+        'exp/deepspeech2/checkpoints/avg_1',
+        'lm_url':
+        'https://deepspeech.bj.bcebos.com/en_lm/common_crawl_00.prune01111.trie.klm',
+        'lm_md5':
+        '099a601759d467cd0a8523ff939819c5'
+    },
+}
+
+model_alias = {
+    "deepspeech2offline":
+    "paddlespeech.s2t.models.ds2:DeepSpeech2Model",
+    "deepspeech2online":
+    "paddlespeech.s2t.models.ds2_online:DeepSpeech2ModelOnline",
+    "conformer":
+    "paddlespeech.s2t.models.u2:U2Model",
+    "conformer_online":
+    "paddlespeech.s2t.models.u2:U2Model",
+    "transformer":
+    "paddlespeech.s2t.models.u2:U2Model",
+    "wenetspeech":
+    "paddlespeech.s2t.models.u2:U2Model",
+}
--- a/paddlespeech/cli/cls/infer.py
+++ b/paddlespeech/cli/cls/infer.py
@ -25,55 +25,23 @@ import yaml
 from ..executor import BaseExecutor
 from ..log import logger
 from ..utils import cli_register
-from ..utils import download_and_decompress
-from ..utils import MODEL_HOME
 from ..utils import stats_wrapper
+from .pretrained_models import model_alias
+from .pretrained_models import pretrained_models
 from paddleaudio import load
 from paddleaudio.features import LogMelSpectrogram
 from paddlespeech.s2t.utils.dynamic_import import dynamic_import

 __all__ = ['CLSExecutor']

-pretrained_models = {
-    # The tags for pretrained_models should be "{model_name}[_{dataset}][-{lang}][-...]".
-    # e.g. "conformer_wenetspeech-zh-16k", "transformer_aishell-zh-16k" and "panns_cnn6-32k".
-    # Command line and python api use "{model_name}[_{dataset}]" as --model, usage:
-    # "paddlespeech asr --model conformer_wenetspeech --lang zh --sr 16000 --input ./input.wav"
-    "panns_cnn6-32k": {
-        'url': 'https://paddlespeech.bj.bcebos.com/cls/panns_cnn6.tar.gz',
-        'md5': '4cf09194a95df024fd12f84712cf0f9c',
-        'cfg_path': 'panns.yaml',
-        'ckpt_path': 'cnn6.pdparams',
-        'label_file': 'audioset_labels.txt',
-    },
-    "panns_cnn10-32k": {
-        'url': 'https://paddlespeech.bj.bcebos.com/cls/panns_cnn10.tar.gz',
-        'md5': 'cb8427b22176cc2116367d14847f5413',
-        'cfg_path': 'panns.yaml',
-        'ckpt_path': 'cnn10.pdparams',
-        'label_file': 'audioset_labels.txt',
-    },
-    "panns_cnn14-32k": {
-        'url': 'https://paddlespeech.bj.bcebos.com/cls/panns_cnn14.tar.gz',
-        'md5': 'e3b9b5614a1595001161d0ab95edee97',
-        'cfg_path': 'panns.yaml',
-        'ckpt_path': 'cnn14.pdparams',
-        'label_file': 'audioset_labels.txt',
-    },
-}
-
-model_alias = {
-    "panns_cnn6": "paddlespeech.cls.models.panns:CNN6",
-    "panns_cnn10": "paddlespeech.cls.models.panns:CNN10",
-    "panns_cnn14": "paddlespeech.cls.models.panns:CNN14",
-}
-

@cli_register(
    name='paddlespeech.cls', description='Audio classification infer command.')
 class CLSExecutor(BaseExecutor):
    def __init__(self):
-        super(CLSExecutor, self).__init__()
+        super().__init__()
+        self.model_alias = model_alias
+        self.pretrained_models = pretrained_models

        self.parser = argparse.ArgumentParser(
            prog='paddlespeech.cls', add_help=True)
@ -83,7 +51,9 @@ class CLSExecutor(BaseExecutor):
            '--model',
            type=str,
            default='panns_cnn14',
-            choices=[tag[:tag.index('-')] for tag in pretrained_models.keys()],
+            choices=[
+                tag[:tag.index('-')] for tag in self.pretrained_models.keys()
+            ],
            help='Choose model type of cls task.')
        self.parser.add_argument(
            '--config',
@ -121,23 +91,6 @@ class CLSExecutor(BaseExecutor):
            action='store_true',
            help='Increase logger verbosity of current task.')

-    def _get_pretrained_path(self, tag: str) -> os.PathLike:
-        """
-            Download and returns pretrained resources path of current task.
-        """
-        support_models = list(pretrained_models.keys())
-        assert tag in pretrained_models, 'The model "{}" you want to use has not been supported, please choose other models.\nThe support models includes:\n\t\t{}\n'.format(
-            tag, '\n\t\t'.join(support_models))
-
-        res_path = os.path.join(MODEL_HOME, tag)
-        decompressed_path = download_and_decompress(pretrained_models[tag],
-                                                    res_path)
-        decompressed_path = os.path.abspath(decompressed_path)
-        logger.info(
-            'Use pretrained model stored in: {}'.format(decompressed_path))
-
-        return decompressed_path
-
    def _init_from_path(self,
                        model_type: str='panns_cnn14',
                        cfg_path: Optional[os.PathLike]=None,
@ -153,12 +106,12 @@ class CLSExecutor(BaseExecutor):
        if label_file is None or ckpt_path is None:
            tag = model_type + '-' + '32k'  # panns_cnn14-32k
            self.res_path = self._get_pretrained_path(tag)
-            self.cfg_path = os.path.join(self.res_path,
-                                         pretrained_models[tag]['cfg_path'])
-            self.label_file = os.path.join(self.res_path,
-                                           pretrained_models[tag]['label_file'])
-            self.ckpt_path = os.path.join(self.res_path,
-                                          pretrained_models[tag]['ckpt_path'])
+            self.cfg_path = os.path.join(
+                self.res_path, self.pretrained_models[tag]['cfg_path'])
+            self.label_file = os.path.join(
+                self.res_path, self.pretrained_models[tag]['label_file'])
+            self.ckpt_path = os.path.join(
+                self.res_path, self.pretrained_models[tag]['ckpt_path'])
        else:
            self.cfg_path = os.path.abspath(cfg_path)
            self.label_file = os.path.abspath(label_file)
@ -175,7 +128,7 @@ class CLSExecutor(BaseExecutor):
                self._label_list.append(line.strip())

        # model
-        model_class = dynamic_import(model_type, model_alias)
+        model_class = dynamic_import(model_type, self.model_alias)
        model_dict = paddle.load(self.ckpt_path)
        self.model = model_class(extract_embedding=False)
        self.model.set_state_dict(model_dict)
--- a/paddlespeech/cli/cls/pretrained_models.py
+++ b/paddlespeech/cli/cls/pretrained_models.py
@ -0,0 +1,47 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+pretrained_models = {
+    # The tags for pretrained_models should be "{model_name}[_{dataset}][-{lang}][-...]".
+    # e.g. "conformer_wenetspeech-zh-16k", "transformer_aishell-zh-16k" and "panns_cnn6-32k".
+    # Command line and python api use "{model_name}[_{dataset}]" as --model, usage:
+    # "paddlespeech asr --model conformer_wenetspeech --lang zh --sr 16000 --input ./input.wav"
+    "panns_cnn6-32k": {
+        'url': 'https://paddlespeech.bj.bcebos.com/cls/panns_cnn6.tar.gz',
+        'md5': '4cf09194a95df024fd12f84712cf0f9c',
+        'cfg_path': 'panns.yaml',
+        'ckpt_path': 'cnn6.pdparams',
+        'label_file': 'audioset_labels.txt',
+    },
+    "panns_cnn10-32k": {
+        'url': 'https://paddlespeech.bj.bcebos.com/cls/panns_cnn10.tar.gz',
+        'md5': 'cb8427b22176cc2116367d14847f5413',
+        'cfg_path': 'panns.yaml',
+        'ckpt_path': 'cnn10.pdparams',
+        'label_file': 'audioset_labels.txt',
+    },
+    "panns_cnn14-32k": {
+        'url': 'https://paddlespeech.bj.bcebos.com/cls/panns_cnn14.tar.gz',
+        'md5': 'e3b9b5614a1595001161d0ab95edee97',
+        'cfg_path': 'panns.yaml',
+        'ckpt_path': 'cnn14.pdparams',
+        'label_file': 'audioset_labels.txt',
+    },
+}
+
+model_alias = {
+    "panns_cnn6": "paddlespeech.cls.models.panns:CNN6",
+    "panns_cnn10": "paddlespeech.cls.models.panns:CNN10",
+    "panns_cnn14": "paddlespeech.cls.models.panns:CNN14",
+}
--- a/paddlespeech/cli/executor.py
+++ b/paddlespeech/cli/executor.py
@ -25,6 +25,8 @@ from typing import Union
 import paddle

 from .log import logger
+from .utils import download_and_decompress
+from .utils import MODEL_HOME


 class BaseExecutor(ABC):
@ -35,19 +37,8 @@ class BaseExecutor(ABC):
    def __init__(self):
        self._inputs = OrderedDict()
        self._outputs = OrderedDict()
-
-    @abstractmethod
-    def _get_pretrained_path(self, tag: str) -> os.PathLike:
-        """
-        Download and returns pretrained resources path of current task.
-
-        Args:
-            tag (str): A tag of pretrained model.
-
-        Returns:
-            os.PathLike: The path on which resources of pretrained model locate. 
-        """
-        pass
+        self.pretrained_models = OrderedDict()
+        self.model_alias = OrderedDict()

    @abstractmethod
    def _init_from_path(self, *args, **kwargs):
@ -227,3 +218,20 @@ class BaseExecutor(ABC):
        ]
        for l in loggers:
            l.disabled = True
+
+    def _get_pretrained_path(self, tag: str) -> os.PathLike:
+        """
+        Download and returns pretrained resources path of current task.
+        """
+        support_models = list(self.pretrained_models.keys())
+        assert tag in self.pretrained_models, 'The model "{}" you want to use has not been supported, please choose other models.\nThe support models includes:\n\t\t{}\n'.format(
+            tag, '\n\t\t'.join(support_models))
+
+        res_path = os.path.join(MODEL_HOME, tag)
+        decompressed_path = download_and_decompress(self.pretrained_models[tag],
+                                                    res_path)
+        decompressed_path = os.path.abspath(decompressed_path)
+        logger.info(
+            'Use pretrained model stored in: {}'.format(decompressed_path))
+
+        return decompressed_path
--- a/paddlespeech/cli/st/infer.py
+++ b/paddlespeech/cli/st/infer.py
@ -32,40 +32,24 @@ from ..utils import cli_register
 from ..utils import download_and_decompress
 from ..utils import MODEL_HOME
 from ..utils import stats_wrapper
+from .pretrained_models import kaldi_bins
+from .pretrained_models import model_alias
+from .pretrained_models import pretrained_models
 from paddlespeech.s2t.frontend.featurizer.text_featurizer import TextFeaturizer
 from paddlespeech.s2t.utils.dynamic_import import dynamic_import
 from paddlespeech.s2t.utils.utility import UpdateConfig

 __all__ = ["STExecutor"]

-pretrained_models = {
-    "fat_st_ted-en-zh": {
-        "url":
-        "https://paddlespeech.bj.bcebos.com/s2t/ted_en_zh/st1/st1_transformer_mtl_noam_ted-en-zh_ckpt_0.1.1.model.tar.gz",
-        "md5":
-        "d62063f35a16d91210a71081bd2dd557",
-        "cfg_path":
-        "model.yaml",
-        "ckpt_path":
-        "exp/transformer_mtl_noam/checkpoints/fat_st_ted-en-zh.pdparams",
-    }
-}
-
-model_alias = {"fat_st": "paddlespeech.s2t.models.u2_st:U2STModel"}
-
-kaldi_bins = {
-    "url":
-    "https://paddlespeech.bj.bcebos.com/s2t/ted_en_zh/st1/kaldi_bins.tar.gz",
-    "md5":
-    "c0682303b3f3393dbf6ed4c4e35a53eb",
-}
-

@cli_register(
    name="paddlespeech.st", description="Speech translation infer command.")
 class STExecutor(BaseExecutor):
    def __init__(self):
-        super(STExecutor, self).__init__()
+        super().__init__()
+        self.model_alias = model_alias
+        self.pretrained_models = pretrained_models
+        self.kaldi_bins = kaldi_bins

        self.parser = argparse.ArgumentParser(
            prog="paddlespeech.st", add_help=True)
@ -75,7 +59,9 @@ class STExecutor(BaseExecutor):
            "--model",
            type=str,
            default="fat_st_ted",
-            choices=[tag[:tag.index('-')] for tag in pretrained_models.keys()],
+            choices=[
+                tag[:tag.index('-')] for tag in self.pretrained_models.keys()
+            ],
            help="Choose model type of st task.")
        self.parser.add_argument(
            "--src_lang",
@ -119,28 +105,11 @@ class STExecutor(BaseExecutor):
            action='store_true',
            help='Increase logger verbosity of current task.')

-    def _get_pretrained_path(self, tag: str) -> os.PathLike:
-        """
-            Download and returns pretrained resources path of current task.
-        """
-        support_models = list(pretrained_models.keys())
-        assert tag in pretrained_models, 'The model "{}" you want to use has not been supported, please choose other models.\nThe support models includes:\n\t\t{}\n'.format(
-            tag, '\n\t\t'.join(support_models))
-
-        res_path = os.path.join(MODEL_HOME, tag)
-        decompressed_path = download_and_decompress(pretrained_models[tag],
-                                                    res_path)
-        decompressed_path = os.path.abspath(decompressed_path)
-        logger.info(
-            "Use pretrained model stored in: {}".format(decompressed_path))
-
-        return decompressed_path
-
    def _set_kaldi_bins(self) -> os.PathLike:
        """
            Download and returns kaldi_bins resources path of current task.
        """
-        decompressed_path = download_and_decompress(kaldi_bins, MODEL_HOME)
+        decompressed_path = download_and_decompress(self.kaldi_bins, MODEL_HOME)
        decompressed_path = os.path.abspath(decompressed_path)
        logger.info("Kaldi_bins stored in: {}".format(decompressed_path))
        if "LD_LIBRARY_PATH" in os.environ:
@ -197,7 +166,7 @@ class STExecutor(BaseExecutor):
        model_conf = self.config
        model_name = model_type[:model_type.rindex(
            '_')]  # model_type: {model_name}_{dataset}
-        model_class = dynamic_import(model_name, model_alias)
+        model_class = dynamic_import(model_name, self.model_alias)
        self.model = model_class.from_config(model_conf)
        self.model.eval()

--- a/paddlespeech/cli/st/pretrained_models.py
+++ b/paddlespeech/cli/st/pretrained_models.py
@ -0,0 +1,35 @@
+# Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+pretrained_models = {
+    "fat_st_ted-en-zh": {
+        "url":
+        "https://paddlespeech.bj.bcebos.com/s2t/ted_en_zh/st1/st1_transformer_mtl_noam_ted-en-zh_ckpt_0.1.1.model.tar.gz",
+        "md5":
+        "d62063f35a16d91210a71081bd2dd557",
+        "cfg_path":
+        "model.yaml",
+        "ckpt_path":
+        "exp/transformer_mtl_noam/checkpoints/fat_st_ted-en-zh.pdparams",
+    }
+}
+
+model_alias = {"fat_st": "paddlespeech.s2t.models.u2_st:U2STModel"}
+
+kaldi_bins = {
+    "url":
+    "https://paddlespeech.bj.bcebos.com/s2t/ted_en_zh/st1/kaldi_bins.tar.gz",
+    "md5":
+    "c0682303b3f3393dbf6ed4c4e35a53eb",
+}
--- a/paddlespeech/cli/stats/infer.py
+++ b/paddlespeech/cli/stats/infer.py
@ -16,7 +16,6 @@ from typing import List

 from prettytable import PrettyTable

-from ..log import logger
 from ..utils import cli_register
 from ..utils import stats_wrapper

@ -27,7 +26,8 @@ model_name_format = {
    'cls': 'Model-Sample Rate',
    'st': 'Model-Source language-Target language',
    'text': 'Model-Task-Language',
-    'tts': 'Model-Language'
+    'tts': 'Model-Language',
+    'vector': 'Model-Sample Rate'
 }


@ -36,18 +36,18 @@ model_name_format = {
    description='Get speech tasks support models list.')
 class StatsExecutor():
    def __init__(self):
-        super(StatsExecutor, self).__init__()
+        super().__init__()

        self.parser = argparse.ArgumentParser(
            prog='paddlespeech.stats', add_help=True)
+        self.task_choices = ['asr', 'cls', 'st', 'text', 'tts', 'vector']
        self.parser.add_argument(
            '--task',
            type=str,
            default='asr',
-            choices=['asr', 'cls', 'st', 'text', 'tts'],
+            choices=self.task_choices,
            help='Choose speech task.',
            required=True)
-        self.task_choices = ['asr', 'cls', 'st', 'text', 'tts']

    def show_support_models(self, pretrained_models: dict):
        fields = model_name_format[self.task].split("-")
@ -61,73 +61,15 @@ class StatsExecutor():
            Command line entry.
        """
        parser_args = self.parser.parse_args(argv)
-        self.task = parser_args.task
-        if self.task not in self.task_choices:
-            logger.error(
-                "Please input correct speech task, choices = ['asr', 'cls', 'st', 'text', 'tts']"
-            )
+        has_exceptions = False
+        try:
+            self(parser_args.task)
+        except Exception as e:
+            has_exceptions = True
+        if has_exceptions:
            return False
-
-        elif self.task == 'asr':
-            try:
-                from ..asr.infer import pretrained_models
-                logger.info(
-                    "Here is the list of ASR pretrained models released by PaddleSpeech that can be used by command line and python API"
-                )
-                self.show_support_models(pretrained_models)
-                return True
-            except BaseException:
-                logger.error("Failed to get the list of ASR pretrained models.")
-                return False
-
-        elif self.task == 'cls':
-            try:
-                from ..cls.infer import pretrained_models
-                logger.info(
-                    "Here is the list of CLS pretrained models released by PaddleSpeech that can be used by command line and python API"
-                )
-                self.show_support_models(pretrained_models)
-                return True
-            except BaseException:
-                logger.error("Failed to get the list of CLS pretrained models.")
-                return False
-
-        elif self.task == 'st':
-            try:
-                from ..st.infer import pretrained_models
-                logger.info(
-                    "Here is the list of ST pretrained models released by PaddleSpeech that can be used by command line and python API"
-                )
-                self.show_support_models(pretrained_models)
-                return True
-            except BaseException:
-                logger.error("Failed to get the list of ST pretrained models.")
-                return False
-
-        elif self.task == 'text':
-            try:
-                from ..text.infer import pretrained_models
-                logger.info(
-                    "Here is the list of TEXT pretrained models released by PaddleSpeech that can be used by command line and python API"
-                )
-                self.show_support_models(pretrained_models)
-                return True
-            except BaseException:
-                logger.error(
-                    "Failed to get the list of TEXT pretrained models.")
-                return False
-
-        elif self.task == 'tts':
-            try:
-                from ..tts.infer import pretrained_models
-                logger.info(
-                    "Here is the list of TTS pretrained models released by PaddleSpeech that can be used by command line and python API"
-                )
-                self.show_support_models(pretrained_models)
-                return True
-            except BaseException:
-                logger.error("Failed to get the list of TTS pretrained models.")
-                return False
+        else:
+            return True

    @stats_wrapper
    def __call__(
@ -138,13 +80,12 @@ class StatsExecutor():
        """
        self.task = task
        if self.task not in self.task_choices:
-            print(
-                "Please input correct speech task, choices = ['asr', 'cls', 'st', 'text', 'tts']"
-            )
+            print("Please input correct speech task, choices = " + str(
+                self.task_choices))

        elif self.task == 'asr':
            try:
-                from ..asr.infer import pretrained_models
+                from ..asr.pretrained_models import pretrained_models
                print(
                    "Here is the list of ASR pretrained models released by PaddleSpeech that can be used by command line and python API"
                )
@ -154,7 +95,7 @@ class StatsExecutor():

        elif self.task == 'cls':
            try:
-                from ..cls.infer import pretrained_models
+                from ..cls.pretrained_models import pretrained_models
                print(
                    "Here is the list of CLS pretrained models released by PaddleSpeech that can be used by command line and python API"
                )
@ -164,7 +105,7 @@ class StatsExecutor():

        elif self.task == 'st':
            try:
-                from ..st.infer import pretrained_models
+                from ..st.pretrained_models import pretrained_models
                print(
                    "Here is the list of ST pretrained models released by PaddleSpeech that can be used by command line and python API"
                )
@ -174,7 +115,7 @@ class StatsExecutor():

        elif self.task == 'text':
            try:
-                from ..text.infer import pretrained_models
+                from ..text.pretrained_models import pretrained_models
                print(
                    "Here is the list of TEXT pretrained models released by PaddleSpeech that can be used by command line and python API"
                )
@ -184,10 +125,22 @@ class StatsExecutor():

        elif self.task == 'tts':
            try:
-                from ..tts.infer import pretrained_models
+                from ..tts.pretrained_models import pretrained_models
                print(
                    "Here is the list of TTS pretrained models released by PaddleSpeech that can be used by command line and python API"
                )
                self.show_support_models(pretrained_models)
            except BaseException:
                print("Failed to get the list of TTS pretrained models.")
+
+        elif self.task == 'vector':
+            try:
+                from ..vector.pretrained_models import pretrained_models
+                print(
+                    "Here is the list of Speaker Recognition pretrained models released by PaddleSpeech that can be used by command line and python API"
+                )
+                self.show_support_models(pretrained_models)
+            except BaseException:
+                print(
+                    "Failed to get the list of Speaker Recognition pretrained models."
+                )
--- a/Show More
+++ b/Show More