merge develop

3 years ago · c2ee6bc67d
parent 5170ccf00d 3c8f30c7a4
commit c2ee6bc67d
207 changed files with 9033 additions and 801 deletions
--- a/.gitignore
+++ b/.gitignore
@ -1,7 +1,7 @@
 .DS_Store
 *.pyc
 .vscode
-*log
+*.log
 *.wav
 *.pdmodel
 *.pdiparams*
@ -34,4 +34,6 @@ tools/activate_python.sh
 tools/miniconda.sh
 tools/CRF++-0.58/
-speechx/fc_patch/
+speechx/fc_patch/
 third_party/ctc_decoders/paddlespeech_ctcdecoders.py
--- a/.mergify.yml
+++ b/.mergify.yml
@ -52,7 +52,7 @@ pull_request_rules:
        add: ["T2S"]
  - name: "auto add label=Audio"
    conditions:
-      - files~=^audio/
+      - files~=^paddleaudio/
    actions:
      label:
        add: ["Audio"]
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -1,4 +1,15 @@
 # Changelog
 Date: 2022-3-22, Author: yt605155624.
 Add features to: CLI:
  - Support aishell3_hifigan、vctk_hifigan
  - PRLink: https://github.com/PaddlePaddle/PaddleSpeech/pull/1587
 Date: 2022-3-09, Author: yt605155624.
 Add features to: T2S:
  - Add ljspeech hifigan egs.
  - PRLink: https://github.com/PaddlePaddle/PaddleSpeech/pull/1549
 Date: 2022-3-08, Author: yt605155624.
 Add features to: T2S:
  - Add aishell3 hifigan egs.
--- a/README.md
+++ b/README.md
@ -7,6 +7,7 @@
  <h3>
  <a href="#quick-start"> Quick Start </a>
  | <a href="#quick-start-server"> Quick Start Server </a>
  | <a href="#documents"> Documents </a>
  | <a href="#model-list"> Models List </a>
 </div>
@ -178,6 +179,8 @@ Via the easy-to-use, efficient, flexible and scalable implementation, our vision
 <!---
 2021.12.14: We would like to have an online courses to introduce basics and research of speech, as well as code practice with `paddlespeech`. Please pay attention to our [Calendar](https://www.paddlepaddle.org.cn/live).
 --->
 - 👏🏻  2022.03.28: PaddleSpeech Server is available for Audio Classification, Automatic Speech Recognition and Text-to-Speech.
 - 👏🏻  2022.03.28: PaddleSpeech CLI is available for Speaker Verfication.
 - 🤗  2021.12.14: Our PaddleSpeech [ASR](https://huggingface.co/spaces/KPatrick/PaddleSpeechASR) and [TTS](https://huggingface.co/spaces/KPatrick/PaddleSpeechTTS) Demos on Hugging Face Spaces are available!
 - 👏🏻  2021.12.10: PaddleSpeech CLI is available for Audio Classification, Automatic Speech Recognition, Speech Translation (English to Chinese) and Text-to-Speech.
@ -203,6 +206,11 @@ Developers can have a try of our models with [PaddleSpeech Command Line](./paddl
 paddlespeech cls --input input.wav
 ```
 **Speaker Verification**
 ```
 paddlespeech vector --task spk --input input_16k.wav
 ```
 **Automatic Speech Recognition**
 ```shell
 paddlespeech asr --lang zh --input input_16k.wav
@ -242,6 +250,36 @@ For more command lines, please see: [demos](https://github.com/PaddlePaddle/Padd
 If you want to try more functions like training and tuning, please have a look at [Speech-to-Text Quick Start](./docs/source/asr/quick_start.md) and [Text-to-Speech Quick Start](./docs/source/tts/quick_start.md).
 <a name="quickstartserver"></a>
 ## Quick Start Server
 Developers can have a try of our speech server with [PaddleSpeech Server Command Line](./paddlespeech/server/README.md).
 **Start server**     
 ```shell
 paddlespeech_server start --config_file ./paddlespeech/server/conf/application.yaml
 ```
 **Access Speech Recognition Services**     
 ```shell
 paddlespeech_client asr --server_ip 127.0.0.1 --port 8090 --input input_16k.wav
 ```
 **Access Text to Speech Services**     
 ```shell
 paddlespeech_client tts --server_ip 127.0.0.1 --port 8090 --input "您好，欢迎使用百度飞桨语音合成服务。" --output output.wav
 ```
 **Access Audio Classification Services**     
 ```shell
 paddlespeech_client cls --server_ip 127.0.0.1 --port 8090 --input input.wav
 ```
 For more information about server command lines, please see: [speech server demos](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/demos/speech_server)
 ## Model List
 PaddleSpeech supports a series of most popular models. They are summarized in [released models](./docs/source/released_model.md) and attached with available pretrained models.
@ -458,6 +496,29 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
  </tbody>
 </table>
 **Speaker Verification**
 <table style="width:100%">
  <thead>
    <tr>
      <th> Task </th>
      <th> Dataset </th>
      <th> Model Type </th>
      <th> Link </th>
    </tr>
  </thead>
  <tbody>
  <tr>
      <td>Speaker Verification</td>
      <td>VoxCeleb12</td>
      <td>ECAPA-TDNN</td>
      <td>
      <a href = "./examples/voxceleb/sv0">ecapa-tdnn-voxceleb12</a>
      </td>
    </tr>
  </tbody>
 </table>
 **Punctuation Restoration**
 <table style="width:100%">
@ -499,6 +560,7 @@ Normally, [Speech SoTA](https://paperswithcode.com/area/speech), [Audio SoTA](ht
    - [Chinese Rule Based Text Frontend](./docs/source/tts/zh_text_frontend.md)
    - [Test Audio Samples](https://paddlespeech.readthedocs.io/en/latest/tts/demo.html)
  - [Audio Classification](./demos/audio_tagging/README.md)
  - [Speaker Verification](./demos/speaker_verification/README.md)
  - [Speech Translation](./demos/speech_translation/README.md)
 - [Released Models](./docs/source/released_model.md)
 - [Community](#Community)
--- a/README_cn.md
+++ b/README_cn.md
@ -6,6 +6,7 @@
  <h3>
  <a href="#quick-start"> 快速开始 </a>
  | <a href="#quick-start-server"> 快速使用服务 </a>
  | <a href="#documents"> 教程文档 </a>
  | <a href="#model-list"> 模型列表 </a>
 </div>
@ -179,7 +180,9 @@ from https://github.com/18F/open-source-guide/blob/18f-pages/pages/making-readme
 <!---
 2021.12.14: We would like to have an online courses to introduce basics and research of speech, as well as code practice with `paddlespeech`. Please pay attention to our [Calendar](https://www.paddlepaddle.org.cn/live).
 --->
- 🤗 2021.12.14: 我们在 Hugging Face Spaces 上的 [ASR](https://huggingface.co/spaces/KPatrick/PaddleSpeechASR) 以及 [TTS](https://huggingface.co/spaces/akhaliq/paddlespeech) Demos 上线啦!
+- 👏🏻 2022.03.28: PaddleSpeech Server 上线! 覆盖了声音分类、语音识别、以及语音合成。
 - 👏🏻 2022.03.28: PaddleSpeech CLI 上线声纹验证。
 - 🤗  2021.12.14: Our PaddleSpeech [ASR](https://huggingface.co/spaces/KPatrick/PaddleSpeechASR) and [TTS](https://huggingface.co/spaces/KPatrick/PaddleSpeechTTS) Demos on Hugging Face Spaces are available!
 - 👏🏻 2021.12.10: PaddleSpeech CLI 上线！覆盖了声音分类、语音识别、语音翻译（英译中）以及语音合成。
 ### 技术交流群
@ -202,6 +205,10 @@ from https://github.com/18F/open-source-guide/blob/18f-pages/pages/making-readme
 ```shell
 paddlespeech cls --input input.wav
 ```
 **声纹识别**
 ```shell
 paddlespeech vector --task spk --input input_16k.wav
 ```
 **语音识别**
 ```shell
 paddlespeech asr --lang zh --input input_16k.wav
@ -236,6 +243,33 @@ paddlespeech asr --input ./zh.wav | paddlespeech text --task punc
 更多命令行命令请参考 [demos](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/demos)
 > Note: 如果需要训练或者微调，请查看[语音识别](./docs/source/asr/quick_start.md)， [语音合成](./docs/source/tts/quick_start.md)。
 ## 快速使用服务
 安装完成后，开发者可以通过命令行快速使用服务。
 **启动服务**     
 ```shell
 paddlespeech_server start --config_file ./paddlespeech/server/conf/application.yaml
 ```
 **访问语音识别服务**     
 ```shell
 paddlespeech_client asr --server_ip 127.0.0.1 --port 8090 --input input_16k.wav
 ```
 **访问语音合成服务**     
 ```shell
 paddlespeech_client tts --server_ip 127.0.0.1 --port 8090 --input "您好，欢迎使用百度飞桨语音合成服务。" --output output.wav
 ```
 **访问音频分类服务**     
 ```shell
 paddlespeech_client cls --server_ip 127.0.0.1 --port 8090 --input input.wav
 ```
 更多服务相关的命令行使用信息，请参考 [demos](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/demos/speech_server)
 ## 模型列表
 PaddleSpeech 支持很多主流的模型，并提供了预训练模型，详情请见[模型列表](./docs/source/released_model.md)。
@ -453,6 +487,30 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块：文本前端、声
  </tbody>
 </table>
 **声纹识别**
 <table style="width:100%">
  <thead>
    <tr>
      <th> Task </th>
      <th> Dataset </th>
      <th> Model Type </th>
      <th> Link </th>
    </tr>
  </thead>
  <tbody>
  <tr>
      <td>Speaker Verification</td>
      <td>VoxCeleb12</td>
      <td>ECAPA-TDNN</td>
      <td>
      <a href = "./examples/voxceleb/sv0">ecapa-tdnn-voxceleb12</a>
      </td>
    </tr>
  </tbody>
 </table>
 **标点恢复**
 <table style="width:100%">
@ -499,6 +557,7 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块：文本前端、声
    - [中文文本前端](./docs/source/tts/zh_text_frontend.md)
    - [测试语音样本](https://paddlespeech.readthedocs.io/en/latest/tts/demo.html)
  - [声音分类](./demos/audio_tagging/README_cn.md)
  - [声纹识别](./demos/speaker_verification/README_cn.md)
  - [语音翻译](./demos/speech_translation/README_cn.md)
 - [模型列表](#模型列表)
  - [语音识别](#语音识别模型)
@ -521,6 +580,15 @@ author={PaddlePaddle Authors},
 howpublished = {\url{https://github.com/PaddlePaddle/PaddleSpeech}},
 year={2021}
 }
@inproceedings{zheng2021fused,
  title={Fused acoustic and text encoding for multimodal bilingual pretraining and speech translation},
  author={Zheng, Renjie and Chen, Junkun and Ma, Mingbo and Huang, Liang},
  booktitle={International Conference on Machine Learning},
  pages={12736--12746},
  year={2021},
  organization={PMLR}
 }
 ```
 <a name="欢迎贡献"></a>
@ -568,7 +636,6 @@ year={2021}
 ## 致谢
 - 非常感谢 [yeyupiaoling](https://github.com/yeyupiaoling)/[PPASR](https://github.com/yeyupiaoling/PPASR)/[PaddlePaddle-DeepSpeech](https://github.com/yeyupiaoling/PaddlePaddle-DeepSpeech)/[VoiceprintRecognition-PaddlePaddle](https://github.com/yeyupiaoling/VoiceprintRecognition-PaddlePaddle)/[AudioClassification-PaddlePaddle](https://github.com/yeyupiaoling/AudioClassification-PaddlePaddle) 多年来的关注和建议，以及在诸多问题上的帮助。
 - 非常感谢 [AK391](https://github.com/AK391) 在 Huggingface Spaces 上使用 Gradio 对我们的语音合成功能进行网页版演示。
 - 非常感谢 [mymagicpower](https://github.com/mymagicpower) 采用PaddleSpeech 对 ASR 的[短语音](https://github.com/mymagicpower/AIAS/tree/main/3_audio_sdks/asr_sdk)及[长语音](https://github.com/mymagicpower/AIAS/tree/main/3_audio_sdks/asr_long_audio_sdk)进行 Java 实现。
 - 非常感谢 [JiehangXie](https://github.com/JiehangXie)/[PaddleBoBo](https://github.com/JiehangXie/PaddleBoBo) 采用 PaddleSpeech 语音合成功能实现 Virtual Uploader(VUP)/Virtual YouTuber(VTuber) 虚拟主播。
 - 非常感谢 [745165806](https://github.com/745165806)/[PaddleSpeechTask](https://github.com/745165806/PaddleSpeechTask) 贡献标点重建相关模型。
--- a/dataset/librispeech/librispeech.py
+++ b/dataset/librispeech/librispeech.py
@ -20,12 +20,12 @@ of each audio file in the data set.
 """
 import argparse
 import codecs
 import distutils.util
 import io
 import json
 import os
 from multiprocessing.pool import Pool
 import distutils.util
 import soundfile
 from utils.utility import download
--- a/dataset/voxceleb/voxceleb1.py
+++ b/dataset/voxceleb/voxceleb1.py
@ -59,12 +59,19 @@ DEV_TARGET_DATA = "vox1_dev_wav_parta* vox1_dev_wav.zip ae63e55b951748cc486645f5
 TEST_LIST = {"vox1_test_wav.zip": "185fdc63c3c739954633d50379a3d102"}
 TEST_TARGET_DATA = "vox1_test_wav.zip vox1_test_wav.zip 185fdc63c3c739954633d50379a3d102"
-# kaldi trial
+# voxceleb trial
-# this trial file is organized by kaldi according the official file,
+
-# which is a little different with the official trial veri_test2.txt
+TRIAL_BASE_URL = "https://www.robots.ox.ac.uk/~vgg/data/voxceleb/meta/"
-KALDI_BASE_URL = "http://www.openslr.org/resources/49/"
+TRIAL_LIST = {
-TRIAL_LIST = {"voxceleb1_test_v2.txt": "29fc7cc1c5d59f0816dc15d6e8be60f7"}
+    "veri_test.txt": "29fc7cc1c5d59f0816dc15d6e8be60f7",  # voxceleb1
-TRIAL_TARGET_DATA = "voxceleb1_test_v2.txt voxceleb1_test_v2.txt 29fc7cc1c5d59f0816dc15d6e8be60f7"
+    "veri_test2.txt": "b73110731c9223c1461fe49cb48dddfc",  # voxceleb1(cleaned)
    "list_test_hard.txt": "21c341b6b2168eea2634df0fb4b8fff1",  # voxceleb1-H
    "list_test_hard2.txt":
    "857790e09d579a68eb2e339a090343c8",  # voxceleb1-H(cleaned)
    "list_test_all.txt": "b9ecf7aa49d4b656aa927a8092844e4a",  # voxceleb1-E
    "list_test_all2.txt":
    "a53e059deb562ffcfc092bf5d90d9f3a"  # voxceleb1-E(cleaned)
 }
 parser = argparse.ArgumentParser(description=__doc__)
 parser.add_argument(
@ -82,7 +89,7 @@ args = parser.parse_args()
 def create_manifest(data_dir, manifest_path_prefix):
-    print("Creating manifest %s ..." % manifest_path_prefix)
+    print(f"Creating manifest {manifest_path_prefix} from {data_dir}")
    json_lines = []
    data_path = os.path.join(data_dir, "wav", "**", "*.wav")
    total_sec = 0.0
@ -114,6 +121,9 @@ def create_manifest(data_dir, manifest_path_prefix):
    # voxceleb1 is given explicit in the path
    data_dir_name = Path(data_dir).name
    manifest_path_prefix = manifest_path_prefix + "." + data_dir_name
    if not os.path.exists(os.path.dirname(manifest_path_prefix)):
        os.makedirs(os.path.dirname(manifest_path_prefix))
    with codecs.open(manifest_path_prefix, 'w', encoding='utf-8') as f:
        for line in json_lines:
            f.write(line + "\n")
@ -133,11 +143,13 @@ def create_manifest(data_dir, manifest_path_prefix):
 def prepare_dataset(base_url, data_list, target_dir, manifest_path,
                    target_data):
    if not os.path.exists(target_dir):
-        os.mkdir(target_dir)
+        os.makedirs(target_dir)
    # wav directory already exists, it need do nothing
    # we will download the voxceleb1 data to ${target_dir}/vox1/dev/ or ${target_dir}/vox1/test directory 
    if not os.path.exists(os.path.join(target_dir, "wav")):
        # download all dataset part
        print("start to download the vox1 dev zip package")
        for zip_part in data_list.keys():
            download_url = " --no-check-certificate " + base_url + "/" + zip_part
            download(
@ -167,10 +179,22 @@ def prepare_dataset(base_url, data_list, target_dir, manifest_path,
    create_manifest(data_dir=target_dir, manifest_path_prefix=manifest_path)
 def prepare_trial(base_url, data_list, target_dir):
    if not os.path.exists(target_dir):
        os.makedirs(target_dir)
    for trial, md5sum in data_list.items():
        target_trial = os.path.join(target_dir, trial)
        if not os.path.exists(os.path.join(target_dir, trial)):
            download_url = " --no-check-certificate " + base_url + "/" + trial
            download(url=download_url, md5sum=md5sum, target_dir=target_dir)
 def main():
    if args.target_dir.startswith('~'):
        args.target_dir = os.path.expanduser(args.target_dir)
    # prepare the vox1 dev data
    prepare_dataset(
        base_url=BASE_URL,
        data_list=DEV_LIST,
@ -178,6 +202,7 @@ def main():
        manifest_path=args.manifest_prefix,
        target_data=DEV_TARGET_DATA)
    # prepare the vox1 test data
    prepare_dataset(
        base_url=BASE_URL,
        data_list=TEST_LIST,
@ -185,6 +210,12 @@ def main():
        manifest_path=args.manifest_prefix,
        target_data=TEST_TARGET_DATA)
    # prepare the vox1 trial
    prepare_trial(
        base_url=TRIAL_BASE_URL,
        data_list=TRIAL_LIST,
        target_dir=os.path.dirname(args.manifest_prefix))
    print("Manifest prepare done!")
--- a/dataset/voxceleb/voxceleb2.py
+++ b/dataset/voxceleb/voxceleb2.py
@ -0,0 +1,164 @@
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 """Prepare VoxCeleb2 dataset
 Download and unpack the voxceleb2 data files.
 Voxceleb2 data is stored as the m4a format, 
 so we need convert the m4a to wav with the convert.sh scripts
 """
 import argparse
 import codecs
 import glob
 import json
 import os
 from pathlib import Path
 import soundfile
 from utils.utility import download
 from utils.utility import unzip
 # all the data will be download in the current data/voxceleb directory default
 DATA_HOME = os.path.expanduser('.')
 BASE_URL = "--no-check-certificate https://www.robots.ox.ac.uk/~vgg/data/voxceleb/data/"
 # dev data
 DEV_DATA_URL = BASE_URL + '/vox2_aac.zip'
 DEV_MD5SUM = "bbc063c46078a602ca71605645c2a402"
 # test data
 TEST_DATA_URL = BASE_URL + '/vox2_test_aac.zip'
 TEST_MD5SUM = "0d2b3ea430a821c33263b5ea37ede312"
 parser = argparse.ArgumentParser(description=__doc__)
 parser.add_argument(
    "--target_dir",
    default=DATA_HOME + "/voxceleb2/",
    type=str,
    help="Directory to save the voxceleb1 dataset. (default: %(default)s)")
 parser.add_argument(
    "--manifest_prefix",
    default="manifest",
    type=str,
    help="Filepath prefix for output manifests. (default: %(default)s)")
 parser.add_argument(
    "--download",
    default=False,
    action="store_true",
    help="Download the voxceleb2 dataset. (default: %(default)s)")
 parser.add_argument(
    "--generate",
    default=False,
    action="store_true",
    help="Generate the manifest files. (default: %(default)s)")
 args = parser.parse_args()
 def create_manifest(data_dir, manifest_path_prefix):
    print("Creating manifest %s ..." % manifest_path_prefix)
    json_lines = []
    data_path = os.path.join(data_dir, "**", "*.wav")
    total_sec = 0.0
    total_text = 0.0
    total_num = 0
    speakers = set()
    for audio_path in glob.glob(data_path, recursive=True):
        audio_id = "-".join(audio_path.split("/")[-3:])
        utt2spk = audio_path.split("/")[-3]
        duration = soundfile.info(audio_path).duration
        text = ""
        json_lines.append(
            json.dumps(
                {
                    "utt": audio_id,
                    "utt2spk": str(utt2spk),
                    "feat": audio_path,
                    "feat_shape": (duration, ),
                    "text": text  # compatible with asr data format
                },
                ensure_ascii=False))
        total_sec += duration
        total_text += len(text)
        total_num += 1
        speakers.add(utt2spk)
    # data_dir_name refer to dev or test
    # voxceleb2 is given explicit in the path
    data_dir_name = Path(data_dir).name
    manifest_path_prefix = manifest_path_prefix + "." + data_dir_name
    if not os.path.exists(os.path.dirname(manifest_path_prefix)):
        os.makedirs(os.path.dirname(manifest_path_prefix))
    with codecs.open(manifest_path_prefix, 'w', encoding='utf-8') as f:
        for line in json_lines:
            f.write(line + "\n")
    manifest_dir = os.path.dirname(manifest_path_prefix)
    meta_path = os.path.join(manifest_dir, "voxceleb2." +
                             data_dir_name) + ".meta"
    with codecs.open(meta_path, 'w', encoding='utf-8') as f:
        print(f"{total_num} utts", file=f)
        print(f"{len(speakers)} speakers", file=f)
        print(f"{total_sec / (60 * 60)} h", file=f)
        print(f"{total_text} text", file=f)
        print(f"{total_text / total_sec} text/sec", file=f)
        print(f"{total_sec / total_num} sec/utt", file=f)
 def download_dataset(url, md5sum, target_dir, dataset):
    if not os.path.exists(target_dir):
        os.makedirs(target_dir)
    # wav directory already exists, it need do nothing
    print("target dir {}".format(os.path.join(target_dir, dataset)))
    # unzip the dev dataset will create the dev and unzip the m4a to dev dir
    # but the test dataset will unzip to aac
    # so, wo create the ${target_dir}/test and unzip the m4a to test dir
    if not os.path.exists(os.path.join(target_dir, dataset)):
        filepath = download(url, md5sum, target_dir)
        if dataset == "test":
            unzip(filepath, os.path.join(target_dir, "test"))
 def main():
    if args.target_dir.startswith('~'):
        args.target_dir = os.path.expanduser(args.target_dir)
    # download and unpack the vox2-dev data
    print("download: {}".format(args.download))
    if args.download:
        download_dataset(
            url=DEV_DATA_URL,
            md5sum=DEV_MD5SUM,
            target_dir=args.target_dir,
            dataset="dev")
        download_dataset(
            url=TEST_DATA_URL,
            md5sum=TEST_MD5SUM,
            target_dir=args.target_dir,
            dataset="test")
        print("VoxCeleb2 download is done!")
    if args.generate:
        create_manifest(
            args.target_dir, manifest_path_prefix=args.manifest_prefix)
 if __name__ == '__main__':
    main()
--- a/demos/README.md
+++ b/demos/README.md
@ -4,6 +4,7 @@
 The directory containes many speech applications in multi scenarios.
 * audio searching - mass audio similarity retrieval
 * audio tagging - multi-label tagging of an audio file
 * automatic_video_subtitiles - generate subtitles from a video
 * metaverse - 2D AR with TTS  
--- a/demos/README_cn.md
+++ b/demos/README_cn.md
@ -4,6 +4,7 @@
 该目录包含基于 PaddleSpeech 开发的不同场景的语音应用 Demo：
 * 声音检索 - 海量音频相似性检索。
 * 声音分类 - 基于 AudioSet 的 527 类标签的音频多标签分类。 
 * 视频字幕生成 - 识别视频中语音的文本，并进行文本后处理。
 * 元宇宙 - 基于语音合成的 2D 增强现实。
--- a/demos/audio_searching/README.md
+++ b/demos/audio_searching/README.md
@ -3,27 +3,36 @@
 # Audio Searching
 ## Introduction
-As the Internet continues to evolve, unstructured data such as emails, social media photos, live videos, and customer service voice calls have become increasingly common.  If we want to process the data on a computer, we need to use embedding technology to transform the data into vector and store, index, and query it
+As the Internet continues to evolve, unstructured data such as emails, social media photos, live videos, and customer service voice calls have become increasingly common. If we want to process the data on a computer, we need to use embedding technology to transform the data into vector and store, index, and query it.
-However, when there is a large amount of data, such as hundreds of millions of audio tracks, it is more difficult to do a similarity search.  The exhaustive method is feasible, but very time consuming.  For this scenario, this demo will introduce how to build an audio similarity retrieval system using the open source vector database Milvus
+However, when there is a large amount of data, such as hundreds of millions of audio tracks, it is more difficult to do a similarity search. The exhaustive method is feasible, but very time consuming.  For this scenario, this demo will introduce how to build an audio similarity retrieval system using the open source vector database Milvus.
-Audio retrieval (speech, music, speaker, etc.) enables querying and finding similar sounds (or the same speaker) in a large amount of audio data.  The audio similarity retrieval system can be used to identify similar sound effects, minimize intellectual property infringement, quickly retrieve the voice print library, and help enterprises control fraud and identity theft.  Audio retrieval also plays an important role in the classification and statistical analysis of audio data
+Audio retrieval (speech, music, speaker, etc.) enables querying and finding similar sounds (or the same speaker) in a large amount of audio data.  The audio similarity retrieval system can be used to identify similar sound effects, minimize intellectual property infringement, quickly retrieve the voice print library, and help enterprises control fraud and identity theft. Audio retrieval also plays an important role in the classification and statistical analysis of audio data.
-In this demo, you will learn how to build an audio retrieval system to retrieve similar sound snippets.  The uploaded audio clips are converted into vector data using paddlespeech-based pre-training models (audio classification model, speaker recognition model, etc.) and stored in Milvus.  Milvus automatically generates a unique ID for each vector, then stores the ID and the corresponding audio information (audio ID, audio speaker ID, etc.) in MySQL to complete the library construction.  During retrieval, users upload test audio to obtain vector, and then conduct vector similarity search in Milvus. The retrieval result returned by Milvus is vector ID, and the corresponding audio information can be queried in MySQL by ID
+In this demo, you will learn how to build an audio retrieval system to retrieve similar sound snippets. The uploaded audio clips are converted into vector data using paddlespeech-based pre-training models (audio classification model, speaker recognition model, etc.) and stored in Milvus.  Milvus automatically generates a unique ID for each vector, then stores the ID and the corresponding audio information (audio ID, audio speaker ID, etc.) in MySQL to complete the library construction.  During retrieval, users upload test audio to obtain vector, and then conduct vector similarity search in Milvus.The retrieval result returned by Milvus is vector ID, and the corresponding audio information can be queried in MySQL by ID.
 ![Workflow of an audio searching system](./img/audio_searching.png)
-Note：this demo uses the [CN-Celeb](http://openslr.org/82/) dataset of at least 650,000 audio entries and 3000 speakers to build the audio vector library, which is then retrieved using a preset distance calculation. The dataset can also use other,  Adjust as needed, e.g. Librispeech, VoxCeleb, UrbanSound, GloVe, MNIST, etc
+Note：this demo uses the [CN-Celeb](http://openslr.org/82/) dataset of at least 650,000 audio entries and 3000 speakers to build the audio vector library, which is then retrieved using a preset distance calculation. The dataset can also use other,  Adjust as needed, e.g. Librispeech, VoxCeleb, UrbanSound, GloVe, MNIST, etc.
 ## Usage
-### 1. Prepare MySQL and Milvus services by docker-compose
+### 1. Prepare PaddleSpeech
 Audio vector extraction requires PaddleSpeech training model, so please make sure that PaddleSpeech has been installed before running. Specific installation steps: See [installation](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md).  
 You can choose one way from easy, meduim and hard to install paddlespeech.
 ### 2. Prepare MySQL and Milvus services by docker-compose
 The audio similarity search system requires Milvus, MySQL services. We can start these containers with one click through [docker-compose.yaml](./docker-compose.yaml), so please make sure you have [installed Docker Engine](https://docs.docker.com/engine/install/) and [Docker Compose](https://docs.docker.com/compose/install/) before running. then
 ```bash
 ## Enter the audio_searching directory for the following example
 cd ~/PaddleSpeech/demos/audio_searching/
 ## Then start the related services within the container
 docker-compose -f docker-compose.yaml up -d
 ```
-Then you will see the that all containers are created:
+You will see the that all containers are created:
 ```bash
 Creating network "quick_deploy_app_net" with driver "bridge"
@ -42,10 +51,10 @@ b2bcf279e599  milvusdb/milvus:v2.0.1  "/tini -- milvus run…"  22 hours ago  Up
 d8ef4c84e25c  mysql:5.7 "docker-entrypoint.s…"  22 hours ago  Up 22 hours 0.0.0.0:3306->3306/tcp, 33060/tcp audio-mysql
 8fb501edb4f3  quay.io/coreos/etcd:v3.5.0  "etcd -advertise-cli…"  22 hours ago  Up 22 hours 2379-2380/tcp milvus-etcd
 ffce340b3790  minio/minio:RELEASE.2020-12-03T00-03-10Z  "/usr/bin/docker-ent…"  22 hours ago  Up 22 hours (healthy) 9000/tcp  milvus-minio
-15c84a506754  iregistry.baidu-int.com/paddlespeech/audio-search-client:1.0  "/bin/bash -c '/usr/…"  22 hours ago  Up 22 hours (healthy) 0.0.0.0:8068->80/tcp  audio-webclient
+15c84a506754  paddlepaddle/paddlespeech-audio-search-client:2.3  "/bin/bash -c '/usr/…"  22 hours ago  Up 22 hours (healthy) 0.0.0.0:8068->80/tcp  audio-webclient
 ```
-### 2. Start API Server
+### 3. Start API Server
 Then to start the system server, and it provides HTTP backend services.
 - Install the Python packages
@ -53,95 +62,153 @@ Then to start the system server, and it provides HTTP backend services.
  ```bash
  pip install -r requirements.txt
  ```
- Set configuration
+- Set configuration(In the case of local running, you can skip this step.)
  ```bash
  ## Method 1: Modify the source file
  vim src/config.py
  ## Method 2: Modify the environment variables, as shown in
  export MILVUS_HOST=127.0.0.1
  export MYSQL_HOST=127.0.0.1
  ```
-  Modify the parameters according to your own environment. Here listing some parameters that need to be set, for more information please refer to [config.py](./src/config.py).
+  Here listing some parameters that need to be set, for more information please refer to [config.py](./src/config.py).
-  | **Parameter**    | **Description**                                       | **Default setting** |
+  | **Parameter**    |**Description**         | **Default setting** |
-  | ---------------- | ----------------------------------------------------- | ------------------- |
+  | ---------------- | -----------------------| ------------------- |
-  | MILVUS_HOST      | The IP address of Milvus, you can get it by ifconfig. If running everything on one machine, most likely 127.0.0.1 | 127.0.0.1           |
+  | MILVUS_HOST      | The IP address of Milvus, you can get it by ifconfig. If running everything on one machine, most likely 127.0.0.1 | 127.0.0.1
-  | MILVUS_PORT      | Port of Milvus.                                       | 19530               |
+  | MILVUS_PORT      | Port of Milvus.    | 19530               |
-  | VECTOR_DIMENSION | Dimension of the vectors.                             | 2048                |
+  | VECTOR_DIMENSION | Dimension of the vectors.        | 2048          |
-  | MYSQL_HOST       | The IP address of Mysql.                              | 127.0.0.1           |
+  | MYSQL_HOST       | The IP address of Mysql.    | 127.0.0.1           |
-  | MYSQL_PORT       | Port of Milvus.                                       | 3306                |
+  | MYSQL_PORT       | Port of Mysql.        | 3306                |
-  | DEFAULT_TABLE    | The milvus and mysql default collection name.         | audio_table          |
+  | DEFAULT_TABLE    | The milvus and mysql default collection name.  | audio_table          |
 - Run the code
  Then start the server with Fastapi.
  ```bash
-  export PYTHONPATH=$PYTHONPATH:./src
+  export PYTHONPATH=$PYTHONPATH:./src:../../paddleaudio
  python src/main.py
  ```
  Then you will see the Application is started:
  ```bash
-  INFO:     Started server process [3949]
+  INFO:     Started server process [13352]
-  2022-03-07 17:39:14,864 ｜ INFO ｜ server.py ｜ serve ｜ 75 ｜ Started server process [3949]
+  2022-03-26 22:45:30,838 ｜ INFO ｜ server.py ｜ serve ｜ 75 ｜ Started server process [13352]
  INFO:     Waiting for application startup.
-  2022-03-07 17:39:14,865 ｜ INFO ｜ on.py ｜ startup ｜ 45 ｜ Waiting for application startup.
+  2022-03-26 22:45:30,839 ｜ INFO ｜ on.py ｜ startup ｜ 45 ｜ Waiting for application startup.
  INFO:     Application startup complete.
-  2022-03-07 17:39:14,866 ｜ INFO ｜ on.py ｜ startup ｜ 59 ｜ Application startup complete.
+  2022-03-26 22:45:30,839 ｜ INFO ｜ on.py ｜ startup ｜ 59 ｜ Application startup complete.
  INFO:     Uvicorn running on http://0.0.0.0:8002 (Press CTRL+C to quit)
-  2022-03-07 17:39:14,867 ｜ INFO ｜ server.py ｜ _log_started_message ｜ 206 ｜ Uvicorn running on http://0.0.0.0:8002 (Press CTRL+C to quit)
+  2022-03-26 22:45:30,840 ｜ INFO ｜ server.py ｜ _log_started_message ｜ 206 ｜ Uvicorn running on http://0.0.0.0:8002 (Press CTRL+C to quit)
  ```
-### 3. Usage
+### 4. Usage
 - Prepare data
  ```bash
  wget -c https://www.openslr.org/resources/82/cn-celeb_v2.tar.gz && tar -xvf cn-celeb_v2.tar.gz 
  ```
-  Note: If you want to build a quick demo, you can use ./src/test_main.py:download_audio_data function, it downloads 20 audio files , Subsequent results show this collection as an example
+  **Note**: If you want to build a quick demo, you can use ./src/test_main.py:download_audio_data function, it downloads 20 audio files , Subsequent results show this collection as an example
 - Prepare model(Skip this step if you use the default model.)
  ```bash
  ## Modify model configuration parameters. Currently, only ecapatdnn_voxceleb12 is supported, and multiple types will be supported in the future
  vim ./src/encode.py
  ```
- - scripts test (recommend!)
+- Scripts test (Recommended)
-    The internal process is downloading data, loading the Paddlespeech model, extracting embedding, storing library, retrieving and deleting library  
+    The internal process is downloading data, loading the paddlespeech model, extracting embedding, storing library, retrieving and deleting library  
    ```bash
    python ./src/test_main.py
    ```
    Output：
    ```bash
-    Checkpoint path: %your model path%
+    Downloading https://paddlespeech.bj.bcebos.com/vector/audio/example_audio.tar.gz ...
    ...
    Unpacking ./example_audio.tar.gz ...
    [2022-03-26 22:50:54,987] [    INFO] - checking the aduio file format......
    [2022-03-26 22:50:54,987] [    INFO] - The sample rate is 16000
    [2022-03-26 22:50:54,987] [    INFO] - The audio file format is right
    [2022-03-26 22:50:54,988] [    INFO] - device type: cpu
    [2022-03-26 22:50:54,988] [    INFO] - load the pretrained model: ecapatdnn_voxceleb12-16k
    [2022-03-26 22:50:54,990] [    INFO] - Downloading sv0_ecapa_tdnn_voxceleb12_ckpt_0_1_0.tar.gz from https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_1_0.tar.gz
    ...
    [2022-03-26 22:51:17,285] [    INFO] - start to dynamic import the model class
    [2022-03-26 22:51:17,285] [    INFO] - model name ecapatdnn
    [2022-03-26 22:51:23,864] [    INFO] - start to set the model parameters to model
    [2022-03-26 22:54:08,115] [    INFO] - create the model instance success
    [2022-03-26 22:54:08,116] [    INFO] - Preprocess audio file: /home/zhaoqingen/PaddleSpeech/demos/audio_
    searching/example_audio/knife_hit_iron3.wav
    [2022-03-26 22:54:08,116] [    INFO] - load the audio sample points, shape is: (11012,)
    [2022-03-26 22:54:08,150] [    INFO] - extract the audio feat, shape is: (80, 69)
    [2022-03-26 22:54:08,152] [    INFO] - feats shape: [1, 80, 69]
    [2022-03-26 22:54:08,154] [    INFO] - audio extract the feat success
    [2022-03-26 22:54:08,155] [    INFO] - start to do backbone network model forward
    [2022-03-26 22:54:08,155] [    INFO] - feats shape:[1, 80, 69], lengths shape: [1]
    [2022-03-26 22:54:08,433] [    INFO] - embedding size: (192,)
    Extracting feature from audio No. 1 , 20 audios in total
    [2022-03-26 22:54:08,435] [    INFO] - checking the aduio file format......
    [2022-03-26 22:54:08,435] [    INFO] - The sample rate is 16000
    [2022-03-26 22:54:08,436] [    INFO] - The audio file format is right
    [2022-03-26 22:54:08,436] [    INFO] - device type: cpu
    [2022-03-26 22:54:08,436] [    INFO] - Model has been initialized
    [2022-03-26 22:54:08,436] [    INFO] - Preprocess audio file: /home/zhaoqingen/PaddleSpeech/demos/audio_searching/example_audio/sword_wielding.wav
    [2022-03-26 22:54:08,436] [    INFO] - load the audio sample points, shape is: (6391,)
    [2022-03-26 22:54:08,452] [    INFO] - extract the audio feat, shape is: (80, 40)
    [2022-03-26 22:54:08,454] [    INFO] - feats shape: [1, 80, 40]
    [2022-03-26 22:54:08,454] [    INFO] - audio extract the feat success
    [2022-03-26 22:54:08,454] [    INFO] - start to do backbone network model forward
    [2022-03-26 22:54:08,455] [    INFO] - feats shape:[1, 80, 40], lengths shape: [1]
    [2022-03-26 22:54:08,633] [    INFO] - embedding size: (192,)
    Extracting feature from audio No. 2 , 20 audios in total
    ...
-    2022-03-09 17:22:13,870 ｜ INFO ｜ main.py ｜ load_audios ｜ 85 ｜ Successfully loaded data, total count: 20
+    2022-03-26 22:54:15,892 ｜ INFO ｜ main.py ｜ load_audios ｜ 85 ｜ Successfully loaded data, total count: 20
-    2022-03-09 17:22:13,898 ｜ INFO ｜ main.py ｜ count_audio ｜ 147 ｜ Successfully count the number of data!
+    2022-03-26 22:54:15,908 ｜ INFO ｜ main.py ｜ count_audio ｜ 148 ｜ Successfully count the number of data!
-    2022-03-09 17:22:13,918 ｜ INFO ｜ main.py ｜ audio_path ｜ 57 ｜ Successfully load audio: ./example_audio/test.wav
+    [2022-03-26 22:54:15,916] [    INFO] - checking the aduio file format......
    [2022-03-26 22:54:15,916] [    INFO] - The sample rate is 16000
    [2022-03-26 22:54:15,916] [    INFO] - The audio file format is right
    [2022-03-26 22:54:15,916] [    INFO] - device type: cpu
    [2022-03-26 22:54:15,916] [    INFO] - Model has been initialized
    [2022-03-26 22:54:15,916] [    INFO] - Preprocess audio file: /home/zhaoqingen/PaddleSpeech/demos/audio_searching/example_audio/test.wav
    [2022-03-26 22:54:15,917] [    INFO] - load the audio sample points, shape is: (8456,)
    [2022-03-26 22:54:15,923] [    INFO] - extract the audio feat, shape is: (80, 53)
    [2022-03-26 22:54:15,924] [    INFO] - feats shape: [1, 80, 53]
    [2022-03-26 22:54:15,924] [    INFO] - audio extract the feat success
    [2022-03-26 22:54:15,924] [    INFO] - start to do backbone network model forward
    [2022-03-26 22:54:15,924] [    INFO] - feats shape:[1, 80, 53], lengths shape: [1]
    [2022-03-26 22:54:16,051] [    INFO] - embedding size: (192,)
    ...
-    2022-03-09 17:22:32,580 ｜ INFO ｜ main.py ｜ search_local_audio ｜ 131 ｜ search result http://testserver/data?audio_path=./example_audio/test.wav, distance 0.0
+    2022-03-26 22:54:16,086 ｜ INFO ｜ main.py ｜ search_local_audio ｜ 132 ｜ search result http://testserver/data?audio_path=./example_audio/test.wav, score 100.0
-    2022-03-09 17:22:32,580 ｜ INFO ｜ main.py ｜ search_local_audio ｜ 131 ｜ search result http://testserver/data?audio_path=./example_audio/knife_chopping.wav, distance 0.021805256605148315
+    2022-03-26 22:54:16,087 ｜ INFO ｜ main.py ｜ search_local_audio ｜ 132 ｜ search result http://testserver/data?audio_path=./example_audio/knife_chopping.wav, score 29.182177782058716
-    2022-03-09 17:22:32,580 ｜ INFO ｜ main.py ｜ search_local_audio ｜ 131 ｜ search result http://testserver/data?audio_path=./example_audio/knife_cut_into_flesh.wav, distance 0.052762262523174286
+    2022-03-26 22:54:16,087 ｜ INFO ｜ main.py ｜ search_local_audio ｜ 132 ｜ search result http://testserver/data?audio_path=./example_audio/knife_cut_into_body.wav, score 22.73637056350708
    ...
-    2022-03-09 17:22:32,582 ｜ INFO ｜ main.py ｜ search_local_audio ｜ 135 ｜ Successfully searched similar audio!
+    2022-03-26 22:54:16,088 ｜ INFO ｜ main.py ｜ search_local_audio ｜ 136 ｜ Successfully searched similar audio!
-    2022-03-09 17:22:33,658 ｜ INFO ｜ main.py ｜ drop_tables ｜ 159 ｜ Successfully drop tables in Milvus and MySQL!
+    2022-03-26 22:54:17,164 ｜ INFO ｜ main.py ｜ drop_tables ｜ 160 ｜ Successfully drop tables in Milvus and MySQL!
    ```
- GUI test (optional)
+- GUI test (Optional)
-    Navigate to 127.0.0.1:8068 in your browser to access the front-end interface
+    Navigate to 127.0.0.1:8068 in your browser to access the front-end interface.
-    Note: If the browser and the service are not on the same machine, then the IP needs to be changed to the IP of the machine where the service is located, and the corresponding API_URL in docker-compose.yaml needs to be changed and the service can be restarted
+    **Note**: If the browser and the service are not on the same machine, then the IP needs to be changed to the IP of the machine where the service is located, and the corresponding API_URL in docker-compose.yaml needs to be changed, and the docker-compose.yaml file needs to be re-executed for the change to take effect.
    - Insert data
-      Download the data and decompress it to a path named /home/speech/data. Then enter /home/speech/data in the address bar of the upload page to upload the data  
+      Download the data on the server and decompress it to a file, for example, /home/speech/data/. Then enter /home/speech/data/ in the address bar of the upload page to upload the data.
      ![](./img/insert.png)
    - Search for similar audio
-      Select the magnifying glass icon on the left side of the interface. Then, press the "Default Target Audio File" button and upload a .wav sound file you'd like to search. Results will be displayed
+      Select the magnifying glass icon on the left side of the interface. Then, press the "Default Target Audio File" button and upload a .wav sound file from the client you'd like to search. Results will be displayed.
      ![](./img/search.png)
-### 4.Result
+### 5.Result
 machine configuration：
 - OS: CentOS release 7.6 
@ -157,15 +224,12 @@ recall and elapsed time statistics are shown in the following figure：
  ![](./img/result.png)
-The retrieval framework based on Milvus takes about 2.9 milliseconds to retrieve on the premise of 90% recall rate, and it takes about 500 milliseconds for feature extraction (testing audio takes about 5 seconds), that is, a single audio test takes about 503 milliseconds in total, which can meet most application scenarios
+The retrieval framework based on Milvus takes about 2.9 milliseconds to retrieve on the premise of 90% recall rate, and it takes about 500 milliseconds for feature extraction (testing audio takes about 5 seconds), that is, a single audio test takes about 503 milliseconds in total, which can meet most application scenarios.
-### 5.Pretrained Models
+### 6.Pretrained Models
 Here is a list of pretrained models released by PaddleSpeech :
 | Model | Sample Rate
 | :--- | :---: 
 | ecapa_tdnn | 16000
 | panns_cnn6| 32000
 | panns_cnn10| 32000
 | panns_cnn14| 32000
--- a/demos/audio_searching/README_cn.md
+++ b/demos/audio_searching/README_cn.md
@ -4,27 +4,36 @@
 # 音频相似性检索
 ## 介绍
-随着互联网不断发展，电子邮件、社交媒体照片、直播视频、客服语音等非结构化数据已经变得越来越普遍。如果想要使用计算机来处理这些数据，需要使用 embedding 技术将这些数据转化为向量 vector，然后进行存储、建索引、并查询
+随着互联网不断发展，电子邮件、社交媒体照片、直播视频、客服语音等非结构化数据已经变得越来越普遍。如果想要使用计算机来处理这些数据，需要使用 embedding 技术将这些数据转化为向量 vector，然后进行存储、建索引、并查询。
-但是，当数据量很大，比如上亿条音频要做相似度搜索，就比较困难了。穷举法固然可行，但非常耗时。针对这种场景，该demo 将介绍如何使用开源向量数据库 Milvus 搭建音频相似度检索系统
+但是，当数据量很大，比如上亿条音频要做相似度搜索，就比较困难了。穷举法固然可行，但非常耗时。针对这种场景，该 demo 将介绍如何使用开源向量数据库 Milvus 搭建音频相似度检索系统。
-音频检索（如演讲、音乐、说话人等检索）实现了在海量音频数据中查询并找出相似声音（或相同说话人）片段。音频相似性检索系统可用于识别相似的音效、最大限度减少知识产权侵权等，还可以快速的检索声纹库、帮助企业控制欺诈和身份盗用等。在音频数据的分类和统计分析中，音频检索也发挥着重要作用
+音频检索（如演讲、音乐、说话人等检索）实现了在海量音频数据中查询并找出相似声音（或相同说话人）片段。音频相似性检索系统可用于识别相似的音效、最大限度减少知识产权侵权等，还可以快速的检索声纹库、帮助企业控制欺诈和身份盗用等。在音频数据的分类和统计分析中，音频检索也发挥着重要作用。
-在本 demo 中，你将学会如何构建一个音频检索系统，用来检索相似的声音片段。使用基于 PaddleSpeech 预训练模型（音频分类模型，说话人识别模型等）将上传的音频片段转换为向量数据，并存储在 Milvus 中。Milvus 自动为每个向量生成唯一的 ID，然后将 ID 和 相应的音频信息（音频id，音频的说话人id等等）存储在 MySQL，这样就完成建库的工作。用户在检索时，上传测试音频，得到向量，然后在 Milvus 中进行向量相似度搜索，Milvus 返回的检索结果为向量 ID，通过 ID 在 MySQL 内部查询相应的音频信息即可
+在本 demo 中，你将学会如何构建一个音频检索系统，用来检索相似的声音片段。使用基于 PaddleSpeech 预训练模型（音频分类模型，说话人识别模型等）将上传的音频片段转换为向量数据，并存储在 Milvus 中。Milvus 自动为每个向量生成唯一的 ID，然后将 ID 和 相应的音频信息（音频id，音频的说话人id等等）存储在 MySQL，这样就完成建库的工作。用户在检索时，上传测试音频，得到向量，然后在 Milvus 中进行向量相似度搜索，Milvus 返回的检索结果为向量 ID，通过 ID 在 MySQL 内部查询相应的音频信息即可。
 ![音频检索流程图](./img/audio_searching.png)
-注：该 demo 使用 [CN-Celeb](http://openslr.org/82/) 数据集，包括至少 650000 条音频，3000 个说话人，来建立音频向量库（音频特征，或音频说话人特征），然后通过预设的距离计算方式进行音频（或说话人）检索，这里面数据集也可以使用其他的，根据需要调整，如Librispeech，VoxCeleb，UrbanSound，GloVe，MNIST等
+注：该 demo 使用 [CN-Celeb](http://openslr.org/82/) 数据集，包括至少 650000 条音频，3000 个说话人，来建立音频向量库（音频特征，或音频说话人特征），然后通过预设的距离计算方式进行音频（或说话人）检索，这里面数据集也可以使用其他的，根据需要调整，如Librispeech，VoxCeleb，UrbanSound，GloVe，MNIST等。
 ## 使用方法
-### 1. MySQL 和 Milvus 安装
+### 1. PaddleSpeech 安装
-音频相似度搜索系统需要用到 Milvus, MySQL 服务。 我们可以通过 [docker-compose.yaml](./docker-compose.yaml) 一键启动这些容器，所以请确保在运行之前已经安装了 [Docker Engine](https://docs.docker.com/engine/install/) 和 [Docker Compose](https://docs.docker.com/compose/install/)。 即
+音频向量的提取需要用到基于 PaddleSpeech 训练的模型，所以请确保在运行之前已经安装了 PaddleSpeech，具体安装步骤，详见[安装文档](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install_cn.md)。
 你可以从 easy，medium，hard 三种方式中选择一种方式安装。
 ### 2. MySQL 和 Milvus 安装
 音频相似性的检索需要用到 Milvus, MySQL 服务。 我们可以通过 [docker-compose.yaml](./docker-compose.yaml) 一键启动这些容器，所以请确保在运行之前已经安装了 [Docker Engine](https://docs.docker.com/engine/install/) 和 [Docker Compose](https://docs.docker.com/compose/install/)。 即
 ```bash
 ## 先进入到 audio_searching 目录，如下示例
 cd ~/PaddleSpeech/demos/audio_searching/
 ## 然后启动容器内的相关服务
 docker-compose -f docker-compose.yaml up -d
 ```
-然后你会看到所有的容器都被创建: 
+你会看到所有的容器都被创建:
 ```bash
 Creating network "quick_deploy_app_net" with driver "bridge"
@ -43,63 +52,74 @@ b2bcf279e599  milvusdb/milvus:v2.0.1  "/tini -- milvus run…"  22 hours ago  Up
 d8ef4c84e25c  mysql:5.7 "docker-entrypoint.s…"  22 hours ago  Up 22 hours 0.0.0.0:3306->3306/tcp, 33060/tcp audio-mysql
 8fb501edb4f3  quay.io/coreos/etcd:v3.5.0  "etcd -advertise-cli…"  22 hours ago  Up 22 hours 2379-2380/tcp milvus-etcd
 ffce340b3790  minio/minio:RELEASE.2020-12-03T00-03-10Z  "/usr/bin/docker-ent…"  22 hours ago  Up 22 hours (healthy) 9000/tcp  milvus-minio
-15c84a506754  iregistry.baidu-int.com/paddlespeech/audio-search-client:1.0  "/bin/bash -c '/usr/…"  22 hours ago  Up 22 hours (healthy) 0.0.0.0:8068->80/tcp  audio-webclient
+15c84a506754  paddlepaddle/paddlespeech-audio-search-client:2.3  "/bin/bash -c '/usr/…"  22 hours ago  Up 22 hours (healthy) 0.0.0.0:8068->80/tcp  audio-webclient
 ```
-### 2. 配置并启动 API 服务
+### 3. 配置并启动 API 服务
-启动系统服务程序，它会提供基于 Http 后端服务
+启动系统服务程序，它会提供基于 HTTP 后端服务。
 - 安装服务依赖的 python 基础包
  ```bash
  pip install -r requirements.txt
  ```
- 修改配置
+- 修改配置(本地运行情况下，一般不用修改，可以跳过该步骤)
  ```bash
  ## 方法一：修改源码文件
  vim src/config.py
  ## 方法二：修改环境变量，如下所示
  export MILVUS_HOST=127.0.0.1
  export MYSQL_HOST=127.0.0.1
  ```
-  请根据实际环境进行修改。 这里列出了一些需要设置的参数，更多信息请参考 [config.py](./src/config.py)  
+  这里列出了一些需要设置的参数，更多信息请参考 [config.py](./src/config.py)
-  | **Parameter**    | **Description**                                       | **Default setting** |
+  | **参数**    | **描述**                | **默认设置** |
-  | ---------------- | ----------------------------------------------------- | ------------------- |
+  | ---------------- | -------------------- | ------------------- |
-  | MILVUS_HOST      | The IP address of Milvus, you can get it by ifconfig. If running everything on one machine, most likely 127.0.0.1 | 127.0.0.1           |
+  | MILVUS_HOST      | Milvus 服务的 IP 地址 | 127.0.0.1           |
-  | MILVUS_PORT      | Port of Milvus.                                       | 19530               |
+  | MILVUS_PORT      | Milvus 服务的端口号   | 19530               |
-  | VECTOR_DIMENSION | Dimension of the vectors.                             | 2048                |
+  | VECTOR_DIMENSION | 特征向量的维度        | 192                 |
-  | MYSQL_HOST       | The IP address of Mysql.                              | 127.0.0.1           |
+  | MYSQL_HOST       | Mysql 服务的 IP 地址  | 127.0.0.1           |
-  | MYSQL_PORT       | Port of Milvus.                                       | 3306                |
+  | MYSQL_PORT       | Mysql 服务的端口号    | 3306                |
-  | DEFAULT_TABLE    | The milvus and mysql default collection name.         | audio_table          |
+  | DEFAULT_TABLE    | 默认存储的表名        | audio_table         |
 - 运行程序
  启动用 Fastapi 构建的服务
  ```bash
-  export PYTHONPATH=$PYTHONPATH:./src
+  export PYTHONPATH=$PYTHONPATH:./src:../../paddleaudio
  python src/main.py
  ```
  然后你会看到应用程序启动:
  ```bash
-  INFO:     Started server process [3949]
+  INFO:     Started server process [13352]
-  2022-03-07 17:39:14,864 ｜ INFO ｜ server.py ｜ serve ｜ 75 ｜ Started server process [3949]
+  2022-03-26 22:45:30,838 ｜ INFO ｜ server.py ｜ serve ｜ 75 ｜ Started server process [13352]
  INFO:     Waiting for application startup.
-  2022-03-07 17:39:14,865 ｜ INFO ｜ on.py ｜ startup ｜ 45 ｜ Waiting for application startup.
+  2022-03-26 22:45:30,839 ｜ INFO ｜ on.py ｜ startup ｜ 45 ｜ Waiting for application startup.
  INFO:     Application startup complete.
-  2022-03-07 17:39:14,866 ｜ INFO ｜ on.py ｜ startup ｜ 59 ｜ Application startup complete.
+  2022-03-26 22:45:30,839 ｜ INFO ｜ on.py ｜ startup ｜ 59 ｜ Application startup complete.
  INFO:     Uvicorn running on http://0.0.0.0:8002 (Press CTRL+C to quit)
-  2022-03-07 17:39:14,867 ｜ INFO ｜ server.py ｜ _log_started_message ｜ 206 ｜ Uvicorn running on http://0.0.0.0:8002 (Press CTRL+C to quit)
+  2022-03-26 22:45:30,840 ｜ INFO ｜ server.py ｜ _log_started_message ｜ 206 ｜ Uvicorn running on http://0.0.0.0:8002 (Press CTRL+C to quit)
  ```
-### 3. 测试方法
+### 4. 测试方法
 - 准备数据
  ```bash
  wget -c https://www.openslr.org/resources/82/cn-celeb_v2.tar.gz && tar -xvf cn-celeb_v2.tar.gz 
  ```
-  注：如果希望快速搭建 demo，可以采用 ./src/test_main.py:download_audio_data 内部的 20 条音频，另外后续结果展示以该集合为例
+  **注**：如果希望快速搭建 demo，可以采用 ./src/test_main.py:download_audio_data 内部的 20 条音频，另外后续结果展示以该集合为例
 - 准备模型（如果使用默认模型，可以跳过此步骤）
  ```bash
  ## 修改模型配置参数，目前 model 仅支持 ecapatdnn_voxceleb12，后续将支持多种类型
  vim ./src/encode.py
  ```
 - 脚本测试（推荐）
@ -110,40 +130,88 @@ ffce340b3790  minio/minio:RELEASE.2020-12-03T00-03-10Z  "/usr/bin/docker-ent…"
    输出：
    ```bash
-    Checkpoint path: %your model path%
+    Downloading https://paddlespeech.bj.bcebos.com/vector/audio/example_audio.tar.gz ...
    ...
    Unpacking ./example_audio.tar.gz ...
    [2022-03-26 22:50:54,987] [    INFO] - checking the aduio file format......
    [2022-03-26 22:50:54,987] [    INFO] - The sample rate is 16000
    [2022-03-26 22:50:54,987] [    INFO] - The audio file format is right
    [2022-03-26 22:50:54,988] [    INFO] - device type: cpu
    [2022-03-26 22:50:54,988] [    INFO] - load the pretrained model: ecapatdnn_voxceleb12-16k
    [2022-03-26 22:50:54,990] [    INFO] - Downloading sv0_ecapa_tdnn_voxceleb12_ckpt_0_1_0.tar.gz from https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_1_0.tar.gz
    ...
    [2022-03-26 22:51:17,285] [    INFO] - start to dynamic import the model class
    [2022-03-26 22:51:17,285] [    INFO] - model name ecapatdnn
    [2022-03-26 22:51:23,864] [    INFO] - start to set the model parameters to model
    [2022-03-26 22:54:08,115] [    INFO] - create the model instance success
    [2022-03-26 22:54:08,116] [    INFO] - Preprocess audio file: /home/zhaoqingen/PaddleSpeech/demos/audio_
    searching/example_audio/knife_hit_iron3.wav
    [2022-03-26 22:54:08,116] [    INFO] - load the audio sample points, shape is: (11012,)
    [2022-03-26 22:54:08,150] [    INFO] - extract the audio feat, shape is: (80, 69)
    [2022-03-26 22:54:08,152] [    INFO] - feats shape: [1, 80, 69]
    [2022-03-26 22:54:08,154] [    INFO] - audio extract the feat success
    [2022-03-26 22:54:08,155] [    INFO] - start to do backbone network model forward
    [2022-03-26 22:54:08,155] [    INFO] - feats shape:[1, 80, 69], lengths shape: [1]
    [2022-03-26 22:54:08,433] [    INFO] - embedding size: (192,)
    Extracting feature from audio No. 1 , 20 audios in total
    [2022-03-26 22:54:08,435] [    INFO] - checking the aduio file format......
    [2022-03-26 22:54:08,435] [    INFO] - The sample rate is 16000
    [2022-03-26 22:54:08,436] [    INFO] - The audio file format is right
    [2022-03-26 22:54:08,436] [    INFO] - device type: cpu
    [2022-03-26 22:54:08,436] [    INFO] - Model has been initialized
    [2022-03-26 22:54:08,436] [    INFO] - Preprocess audio file: /home/zhaoqingen/PaddleSpeech/demos/audio_searching/example_audio/sword_wielding.wav
    [2022-03-26 22:54:08,436] [    INFO] - load the audio sample points, shape is: (6391,)
    [2022-03-26 22:54:08,452] [    INFO] - extract the audio feat, shape is: (80, 40)
    [2022-03-26 22:54:08,454] [    INFO] - feats shape: [1, 80, 40]
    [2022-03-26 22:54:08,454] [    INFO] - audio extract the feat success
    [2022-03-26 22:54:08,454] [    INFO] - start to do backbone network model forward
    [2022-03-26 22:54:08,455] [    INFO] - feats shape:[1, 80, 40], lengths shape: [1]
    [2022-03-26 22:54:08,633] [    INFO] - embedding size: (192,)
    Extracting feature from audio No. 2 , 20 audios in total
    ...
-    2022-03-09 17:22:13,870 ｜ INFO ｜ main.py ｜ load_audios ｜ 85 ｜ Successfully loaded data, total count: 20
+    2022-03-26 22:54:15,892 ｜ INFO ｜ main.py ｜ load_audios ｜ 85 ｜ Successfully loaded data, total count: 20
-    2022-03-09 17:22:13,898 ｜ INFO ｜ main.py ｜ count_audio ｜ 147 ｜ Successfully count the number of data!
+    2022-03-26 22:54:15,908 ｜ INFO ｜ main.py ｜ count_audio ｜ 148 ｜ Successfully count the number of data!
-    2022-03-09 17:22:13,918 ｜ INFO ｜ main.py ｜ audio_path ｜ 57 ｜ Successfully load audio: ./example_audio/test.wav
+    [2022-03-26 22:54:15,916] [    INFO] - checking the aduio file format......
    [2022-03-26 22:54:15,916] [    INFO] - The sample rate is 16000
    [2022-03-26 22:54:15,916] [    INFO] - The audio file format is right
    [2022-03-26 22:54:15,916] [    INFO] - device type: cpu
    [2022-03-26 22:54:15,916] [    INFO] - Model has been initialized
    [2022-03-26 22:54:15,916] [    INFO] - Preprocess audio file: /home/zhaoqingen/PaddleSpeech/demos/audio_searching/example_audio/test.wav
    [2022-03-26 22:54:15,917] [    INFO] - load the audio sample points, shape is: (8456,)
    [2022-03-26 22:54:15,923] [    INFO] - extract the audio feat, shape is: (80, 53)
    [2022-03-26 22:54:15,924] [    INFO] - feats shape: [1, 80, 53]
    [2022-03-26 22:54:15,924] [    INFO] - audio extract the feat success
    [2022-03-26 22:54:15,924] [    INFO] - start to do backbone network model forward
    [2022-03-26 22:54:15,924] [    INFO] - feats shape:[1, 80, 53], lengths shape: [1]
    [2022-03-26 22:54:16,051] [    INFO] - embedding size: (192,)
    ...
-    2022-03-09 17:22:32,580 ｜ INFO ｜ main.py ｜ search_local_audio ｜ 131 ｜ search result http://testserver/data?audio_path=./example_audio/test.wav, distance 0.0
+    2022-03-26 22:54:16,086 ｜ INFO ｜ main.py ｜ search_local_audio ｜ 132 ｜ search result http://testserver/data?audio_path=./example_audio/test.wav, score 100.0
-    2022-03-09 17:22:32,580 ｜ INFO ｜ main.py ｜ search_local_audio ｜ 131 ｜ search result http://testserver/data?audio_path=./example_audio/knife_chopping.wav, distance 0.021805256605148315
+    2022-03-26 22:54:16,087 ｜ INFO ｜ main.py ｜ search_local_audio ｜ 132 ｜ search result http://testserver/data?audio_path=./example_audio/knife_chopping.wav, score 29.182177782058716
-    2022-03-09 17:22:32,580 ｜ INFO ｜ main.py ｜ search_local_audio ｜ 131 ｜ search result http://testserver/data?audio_path=./example_audio/knife_cut_into_flesh.wav, distance 0.052762262523174286
+    2022-03-26 22:54:16,087 ｜ INFO ｜ main.py ｜ search_local_audio ｜ 132 ｜ search result http://testserver/data?audio_path=./example_audio/knife_cut_into_body.wav, score 22.73637056350708
    ...
-    2022-03-09 17:22:32,582 ｜ INFO ｜ main.py ｜ search_local_audio ｜ 135 ｜ Successfully searched similar audio!
+    2022-03-26 22:54:16,088 ｜ INFO ｜ main.py ｜ search_local_audio ｜ 136 ｜ Successfully searched similar audio!
-    2022-03-09 17:22:33,658 ｜ INFO ｜ main.py ｜ drop_tables ｜ 159 ｜ Successfully drop tables in Milvus and MySQL!
+    2022-03-26 22:54:17,164 ｜ INFO ｜ main.py ｜ drop_tables ｜ 160 ｜ Successfully drop tables in Milvus and MySQL!
    ```
  - 前端测试（可选）
    在浏览器中输入 127.0.0.1:8068 访问前端页面
-    注：如果浏览器和服务不在同一台机器上，那么 IP 需要修改成服务所在的机器 IP，并且docker-compose.yaml 中相应的 API_URL 也要修改，并重新起服务即可
+    **注**：如果浏览器和服务不在同一台机器上，那么 IP 需要修改成服务所在的机器 IP，并且 docker-compose.yaml 中相应的 API_URL 也要修改，然后重新执行 docker-compose.yaml 文件，使修改生效。
    - 上传音频
-      下载数据并解压到一文件夹，假设为 /home/speech/data，那么在上传页面地址栏输入 /home/speech/data 进行数据上传
+      在服务端下载数据并解压到一文件夹，假设为 /home/speech/data/，那么在上传页面地址栏输入 /home/speech/data/ 进行数据上传
      ![](./img/insert.png)
    - 检索相似音频
-      选择左上角放大镜，点击 “Default Target Audio File” 按钮，上传测试音频，接着你将看到检索结果
+      选择左上角放大镜，点击 “Default Target Audio File” 按钮，从客户端上传测试音频，接着你将看到检索结果
      ![](./img/search.png)
-### 4. 结果
+### 5. 结果
 机器配置：
 - 操作系统: CentOS release 7.6 
@ -158,15 +226,12 @@ ffce340b3790  minio/minio:RELEASE.2020-12-03T00-03-10Z  "/usr/bin/docker-ent…"
  ![](./img/result.png)
-基于 milvus 的检索框架在召回率 90% 的前提下，检索耗时约 2.9 毫秒，加上特征提取(Embedding)耗时约 500毫秒(测试音频时长约 5秒)，即单条音频测试总共耗时约 503 毫秒，可以满足大多数应用场景
+基于 Milvus 的检索框架在召回率 90% 的前提下，检索耗时约 2.9 毫秒，加上特征提取(Embedding)耗时约 500 毫秒(测试音频时长约 5 秒)，即单条音频测试总共耗时约 503 毫秒，可以满足大多数应用场景。
-### 5. 预训练模型
+### 6. 预训练模型
 以下是 PaddleSpeech 提供的预训练模型列表：
 | 模型 | 采样率
 | :--- | :---: 
 | ecapa_tdnn| 16000
 | panns_cnn6| 32000
 | panns_cnn10| 32000
 | panns_cnn14| 32000
--- a/demos/audio_searching/docker-compose.yaml
+++ b/demos/audio_searching/docker-compose.yaml
@ -64,7 +64,7 @@ services:
  webclient:
    container_name: audio-webclient
-    image: qingen1/paddlespeech-audio-search-client:2.3
+    image: paddlepaddle/paddlespeech-audio-search-client:2.3
    networks:
      app_net:
        ipv4_address: 172.16.23.13
--- a/demos/audio_searching/img/insert.png
+++ b/demos/audio_searching/img/insert.png
--- a/demos/audio_searching/img/search.png
+++ b/demos/audio_searching/img/search.png
--- a/demos/audio_searching/requirements.txt
+++ b/demos/audio_searching/requirements.txt
@ -1,12 +1,13 @@
 soundfile==0.10.3.post1
 librosa==0.8.0
 numpy
 pymysql
 fastapi
 uvicorn
 diskcache==5.2.1
 dtaidistance==2.3.1
 fastapi
 librosa==0.8.0
 numpy==1.21.0
 pydantic
 pymilvus==2.0.1
 pymysql
 python-multipart
-typing
+soundfile==0.10.3.post1
 starlette
-pydantic
+typing
 uvicorn
--- a/demos/audio_searching/src/config.py
+++ b/demos/audio_searching/src/config.py
@ -11,13 +11,12 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import os
 ############### Milvus Configuration ###############
 MILVUS_HOST = os.getenv("MILVUS_HOST", "127.0.0.1")
 MILVUS_PORT = int(os.getenv("MILVUS_PORT", "19530"))
-VECTOR_DIMENSION = int(os.getenv("VECTOR_DIMENSION", "2048"))
+VECTOR_DIMENSION = int(os.getenv("VECTOR_DIMENSION", "192"))
 INDEX_FILE_SIZE = int(os.getenv("INDEX_FILE_SIZE", "1024"))
 METRIC_TYPE = os.getenv("METRIC_TYPE", "L2")
 DEFAULT_TABLE = os.getenv("DEFAULT_TABLE", "audio_table")
--- a/demos/audio_searching/src/encode.py
+++ b/demos/audio_searching/src/encode.py
@ -11,11 +11,12 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import os
 import librosa
 import numpy as np
 from logs import LOGGER
 from paddlespeech.cli import VectorExecutor
 vector_executor = VectorExecutor()
 def get_audio_embedding(path):
@ -23,16 +24,10 @@ def get_audio_embedding(path):
    Use vpr_inference to generate embedding of audio
    """
    try:
-        RESAMPLE_RATE = 16000
+        embedding = vector_executor(
-        audio, _ = librosa.load(path, sr=RESAMPLE_RATE, mono=True)
+            audio_file=path, model='ecapatdnn_voxceleb12')
        # TODO add infer/python interface to get embedding, now fake it by rand
        # vpr = ECAPATDNN(checkpoint_path=None, device='cuda')
        # embedding = vpr.inference(audio)
        np.random.seed(hash(os.path.basename(path)) % 1000000)
        embedding = np.random.rand(1, 2048)
        embedding = embedding / np.linalg.norm(embedding)
-        embedding = embedding.tolist()[0]
+        embedding = embedding.tolist()
        return embedding
    except Exception as e:
        LOGGER.error(f"Error with embedding:{e}")
--- a/demos/audio_searching/src/logs.py
+++ b/demos/audio_searching/src/logs.py
@ -11,7 +11,6 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import codecs
 import datetime
 import logging
 import os
@ -124,7 +123,7 @@ class MultiprocessHandler(logging.FileHandler):
            logging.FileHandler.emit(self, record)
        except (KeyboardInterrupt, SystemExit):
            raise
-        except:
+        except Exception as e:
            self.handleError(record)
--- a/demos/audio_searching/src/operations/load.py
+++ b/demos/audio_searching/src/operations/load.py
@ -26,9 +26,8 @@ def get_audios(path):
    """
    supported_formats = [".wav", ".mp3", ".ogg", ".flac", ".m4a"]
    return [
-        item
+        item for sublist in [[os.path.join(dir, file) for file in files]
-        for sublist in [[os.path.join(dir, file) for file in files]
+                             for dir, _, files in list(os.walk(path))]
                        for dir, _, files in list(os.walk(path))]
        for item in sublist if os.path.splitext(item)[1] in supported_formats
    ]
--- a/demos/audio_searching/src/test_main.py
+++ b/demos/audio_searching/src/test_main.py
@ -11,12 +11,12 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import zipfile
 import gdown
 from fastapi.testclient import TestClient
 from main import app
 from utils.utility import download
 from utils.utility import unpack
 client = TestClient(app)
@ -24,11 +24,11 @@ def download_audio_data():
    """
    download audio data
    """
-    url = 'https://drive.google.com/uc?id=1bKu21JWBfcZBuEuzFEvPoAX6PmRrgnUp'
+    url = "https://paddlespeech.bj.bcebos.com/vector/audio/example_audio.tar.gz"
-    gdown.download(url)
+    md5sum = "52ac69316c1aa1fdef84da7dd2c67b39"
-
+    target_dir = "./"
-    with zipfile.ZipFile('example_audio.zip', 'r') as zip_ref:
+    filepath = download(url, md5sum, target_dir)
-        zip_ref.extractall('./example_audio')
+    unpack(filepath, target_dir, True)
 def test_drop():
--- a/demos/speaker_verification/README.md
+++ b/demos/speaker_verification/README.md
@ -0,0 +1,158 @@
 ([简体中文](./README_cn.md)|English)
 # Speech Verification)
 ## Introduction
 Speaker Verification, refers to the problem of getting a speaker embedding from an audio. 
 This demo is an implementation to extract speaker embedding from a specific audio file. It can be done by a single command or a few lines in python using `PaddleSpeech`. 
 ## Usage
 ### 1. Installation
 see [installation](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install.md).
 You can choose one way from easy, meduim and hard to install paddlespeech.
 ### 2. Prepare Input File
 The input of this demo should be a WAV file(`.wav`), and the sample rate must be the same as the model.
 Here are sample files for this demo that can be downloaded:
 ```bash
 wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
 ```
 ### 3. Usage
 - Command Line(Recommended)
  ```bash
  paddlespeech vector --task spk --input 85236145389.wav
  echo -e "demo1 85236145389.wav" > vec.job
  paddlespeech vector --task spk --input vec.job
  echo -e "demo2 85236145389.wav \n demo3 85236145389.wav" | paddlespeech vector --task spk
  ```
  Usage:
  ```bash
  paddlespeech vector --help
  ```
  Arguments:
  - `input`(required): Audio file to recognize.
  - `model`: Model type of vector task. Default: `ecapatdnn_voxceleb12`.
  - `sample_rate`: Sample rate of the model. Default: `16000`.
  - `config`: Config of vector task. Use pretrained model when it is None. Default: `None`.
  - `ckpt_path`: Model checkpoint. Use pretrained model when it is None. Default: `None`.
  - `device`: Choose device to execute model inference. Default: default device of paddlepaddle in current environment.
  Output:
  ```bash
    demo [ -5.749211     9.505463    -8.200284    -5.2075014    5.3940268
      -3.04878      1.611095    10.127234   -10.534177   -15.821609
      1.2032688   -0.35080156   1.2629458  -12.643498    -2.5758228
    -11.343508     2.3385992   -8.719341    14.213509    15.404744
      -0.39327756   6.338786     2.688887     8.7104025   17.469526
      -8.77959      7.0576906    4.648855    -1.3089896  -23.294737
      8.013747    13.891729    -9.926753     5.655307    -5.9422326
    -22.842539     0.6293588  -18.46266    -10.811862     9.8192625
      3.0070958    3.8072643   -2.3861165    3.0821571  -14.739942
      1.7594414   -0.6485091    4.485623     2.0207152    7.264915
      -6.40137     23.63524      2.9711294  -22.708025     9.93719
      20.354511   -10.324688    -0.700492    -8.783211    -5.27593
      15.999649     3.3004563   12.747926    15.429879     4.7849145
      5.6699696   -2.3826702   10.605882     3.9112158    3.1500628
      15.859915    -2.1832209  -23.908653    -6.4799504   -4.5365124
      -9.224193    14.568347   -10.568833     4.982321    -4.342062
      0.0914714   12.645902    -5.74285     -3.2141201   -2.7173362
      -6.680575     0.4757669   -5.035051    -6.7964664   16.865469
    -11.54324      7.681869     0.44475392   9.708182    -8.932846
      0.4123232   -4.361452     1.3948607    9.511665     0.11667654
      2.9079323    6.049952     9.275183   -18.078873     6.2983274
      -0.7500531   -2.725033    -7.6027865    3.3404543    2.990815
      4.010979    11.000591    -2.8873312    7.1352735  -16.79663
      18.495346   -14.293832     7.89578      2.2714825   22.976387
      -4.875734    -3.0836344   -2.9999814   13.751918     6.448228
    -11.924197     2.171869     2.0423572   -6.173772    10.778437
      25.77281     -4.9495463   14.57806      0.3044315    2.6132357
      -7.591999    -2.076944     9.025118     1.7834753   -3.1799617
      -4.9401326   23.465864     5.1685796   -9.018578     9.037825
      -4.4150195    6.859591   -12.274467    -0.88911164   5.186309
      -3.9988663  -13.638606    -9.925445    -0.06329413  -3.6709652
    -12.397416   -12.719869    -1.395601     2.1150916    5.7381287
      -4.4691963   -3.82819     -0.84233856  -1.1604277  -13.490127
      8.731719   -20.778936   -11.495662     5.8033476   -4.752041
      10.833007    -6.717991     4.504732    13.4244375    1.1306485
      7.3435574    1.400918    14.704036    -9.501399     7.2315617
      -6.417456     1.3333273   11.872697    -0.30664724   8.8845
      6.5569253    4.7948146    0.03662816  -8.704245     6.224871
      -3.2701402  -11.508579  ]
  ```
 - Python API
  ```python
  import paddle
  from paddlespeech.cli import VectorExecutor
  vector_executor = VectorExecutor()
  audio_emb = vector_executor(
      model='ecapatdnn_voxceleb12',
      sample_rate=16000,
      config=None, 
      ckpt_path=None,
      audio_file='./85236145389.wav',
      force_yes=False,
      device=paddle.get_device())
  print('Audio embedding Result: \n{}'.format(audio_emb))
  ```
  Output:
  ```bash
  # Vector Result:
   [ -5.749211     9.505463    -8.200284    -5.2075014    5.3940268
      -3.04878      1.611095    10.127234   -10.534177   -15.821609
      1.2032688   -0.35080156   1.2629458  -12.643498    -2.5758228
    -11.343508     2.3385992   -8.719341    14.213509    15.404744
      -0.39327756   6.338786     2.688887     8.7104025   17.469526
      -8.77959      7.0576906    4.648855    -1.3089896  -23.294737
      8.013747    13.891729    -9.926753     5.655307    -5.9422326
    -22.842539     0.6293588  -18.46266    -10.811862     9.8192625
      3.0070958    3.8072643   -2.3861165    3.0821571  -14.739942
      1.7594414   -0.6485091    4.485623     2.0207152    7.264915
      -6.40137     23.63524      2.9711294  -22.708025     9.93719
      20.354511   -10.324688    -0.700492    -8.783211    -5.27593
      15.999649     3.3004563   12.747926    15.429879     4.7849145
      5.6699696   -2.3826702   10.605882     3.9112158    3.1500628
      15.859915    -2.1832209  -23.908653    -6.4799504   -4.5365124
      -9.224193    14.568347   -10.568833     4.982321    -4.342062
      0.0914714   12.645902    -5.74285     -3.2141201   -2.7173362
      -6.680575     0.4757669   -5.035051    -6.7964664   16.865469
    -11.54324      7.681869     0.44475392   9.708182    -8.932846
      0.4123232   -4.361452     1.3948607    9.511665     0.11667654
      2.9079323    6.049952     9.275183   -18.078873     6.2983274
      -0.7500531   -2.725033    -7.6027865    3.3404543    2.990815
      4.010979    11.000591    -2.8873312    7.1352735  -16.79663
      18.495346   -14.293832     7.89578      2.2714825   22.976387
      -4.875734    -3.0836344   -2.9999814   13.751918     6.448228
    -11.924197     2.171869     2.0423572   -6.173772    10.778437
      25.77281     -4.9495463   14.57806      0.3044315    2.6132357
      -7.591999    -2.076944     9.025118     1.7834753   -3.1799617
      -4.9401326   23.465864     5.1685796   -9.018578     9.037825
      -4.4150195    6.859591   -12.274467    -0.88911164   5.186309
      -3.9988663  -13.638606    -9.925445    -0.06329413  -3.6709652
    -12.397416   -12.719869    -1.395601     2.1150916    5.7381287
      -4.4691963   -3.82819     -0.84233856  -1.1604277  -13.490127
      8.731719   -20.778936   -11.495662     5.8033476   -4.752041
      10.833007    -6.717991     4.504732    13.4244375    1.1306485
      7.3435574    1.400918    14.704036    -9.501399     7.2315617
      -6.417456     1.3333273   11.872697    -0.30664724   8.8845
      6.5569253    4.7948146    0.03662816  -8.704245     6.224871
      -3.2701402  -11.508579  ]
  ```
 ### 4.Pretrained Models
 Here is a list of pretrained models released by PaddleSpeech that can be used by command and python API:
 | Model | Sample Rate
 | :--- | :---: |
 | ecapatdnn_voxceleb12 | 16k
--- a/demos/speaker_verification/README_cn.md
+++ b/demos/speaker_verification/README_cn.md
@ -0,0 +1,155 @@
 (简体中文|[English](./README.md))
 # 声纹识别
 ## 介绍
 声纹识别是一项用计算机程序自动提取说话人特征的技术。
 这个 demo 是一个从给定音频文件提取说话人特征，它可以通过使用 `PaddleSpeech` 的单个命令或 python 中的几行代码来实现。
 ## 使用方法
 ### 1. 安装
 请看[安装文档](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/source/install_cn.md)。
 你可以从 easy，medium，hard 三中方式中选择一种方式安装。
 ### 2. 准备输入
 这个 demo 的输入应该是一个 WAV 文件（`.wav`），并且采样率必须与模型的采样率相同。
 可以下载此 demo 的示例音频：
 ```bash
 # 该音频的内容是数字串 85236145389
 wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
 ```
 ### 3. 使用方法
 - 命令行 (推荐使用)
  ```bash
  paddlespeech vector --task spk --input 85236145389.wav
  echo -e "demo1 85236145389.wav" > vec.job
  paddlespeech vector --task spk --input vec.job
  echo -e "demo2 85236145389.wav \n demo3 85236145389.wav" | paddlespeech vector --task spk
  ```
  使用方法：
  ```bash
  paddlespeech vector --help
  ```
  参数：
  - `input`(必须输入)：用于识别的音频文件。
  - `model`：声纹任务的模型，默认值：`ecapatdnn_voxceleb12`。
  - `sample_rate`：音频采样率，默认值：`16000`。
  - `config`：声纹任务的参数文件，若不设置则使用预训练模型中的默认配置，默认值：`None`。
  - `ckpt_path`：模型参数文件，若不设置则下载预训练模型使用，默认值：`None`。
  - `device`：执行预测的设备，默认值：当前系统下 paddlepaddle 的默认 device。
  输出：
  ```bash
  demo  [ -5.749211     9.505463    -8.200284    -5.2075014    5.3940268
      -3.04878      1.611095    10.127234   -10.534177   -15.821609
      1.2032688   -0.35080156   1.2629458  -12.643498    -2.5758228
    -11.343508     2.3385992   -8.719341    14.213509    15.404744
      -0.39327756   6.338786     2.688887     8.7104025   17.469526
      -8.77959      7.0576906    4.648855    -1.3089896  -23.294737
      8.013747    13.891729    -9.926753     5.655307    -5.9422326
    -22.842539     0.6293588  -18.46266    -10.811862     9.8192625
      3.0070958    3.8072643   -2.3861165    3.0821571  -14.739942
      1.7594414   -0.6485091    4.485623     2.0207152    7.264915
      -6.40137     23.63524      2.9711294  -22.708025     9.93719
      20.354511   -10.324688    -0.700492    -8.783211    -5.27593
      15.999649     3.3004563   12.747926    15.429879     4.7849145
      5.6699696   -2.3826702   10.605882     3.9112158    3.1500628
      15.859915    -2.1832209  -23.908653    -6.4799504   -4.5365124
      -9.224193    14.568347   -10.568833     4.982321    -4.342062
      0.0914714   12.645902    -5.74285     -3.2141201   -2.7173362
      -6.680575     0.4757669   -5.035051    -6.7964664   16.865469
    -11.54324      7.681869     0.44475392   9.708182    -8.932846
      0.4123232   -4.361452     1.3948607    9.511665     0.11667654
      2.9079323    6.049952     9.275183   -18.078873     6.2983274
      -0.7500531   -2.725033    -7.6027865    3.3404543    2.990815
      4.010979    11.000591    -2.8873312    7.1352735  -16.79663
      18.495346   -14.293832     7.89578      2.2714825   22.976387
      -4.875734    -3.0836344   -2.9999814   13.751918     6.448228
    -11.924197     2.171869     2.0423572   -6.173772    10.778437
      25.77281     -4.9495463   14.57806      0.3044315    2.6132357
      -7.591999    -2.076944     9.025118     1.7834753   -3.1799617
      -4.9401326   23.465864     5.1685796   -9.018578     9.037825
      -4.4150195    6.859591   -12.274467    -0.88911164   5.186309
      -3.9988663  -13.638606    -9.925445    -0.06329413  -3.6709652
    -12.397416   -12.719869    -1.395601     2.1150916    5.7381287
      -4.4691963   -3.82819     -0.84233856  -1.1604277  -13.490127
      8.731719   -20.778936   -11.495662     5.8033476   -4.752041
      10.833007    -6.717991     4.504732    13.4244375    1.1306485
      7.3435574    1.400918    14.704036    -9.501399     7.2315617
      -6.417456     1.3333273   11.872697    -0.30664724   8.8845
      6.5569253    4.7948146    0.03662816  -8.704245     6.224871
      -3.2701402  -11.508579  ]
  ```
 - Python API
  ```python
  import paddle
  from paddlespeech.cli import VectorExecutor
  vector_executor = VectorExecutor()
  audio_emb = vector_executor(
      model='ecapatdnn_voxceleb12',
      sample_rate=16000,
      config=None,  # Set `config` and `ckpt_path` to None to use pretrained model.
      ckpt_path=None,
      audio_file='./85236145389.wav',
      force_yes=False,
      device=paddle.get_device())
  print('Audio embedding Result: \n{}'.format(audio_emb))
  ```
  输出：
  ```bash
  # Vector Result:
   [ -5.749211     9.505463    -8.200284    -5.2075014    5.3940268
      -3.04878      1.611095    10.127234   -10.534177   -15.821609
      1.2032688   -0.35080156   1.2629458  -12.643498    -2.5758228
    -11.343508     2.3385992   -8.719341    14.213509    15.404744
      -0.39327756   6.338786     2.688887     8.7104025   17.469526
      -8.77959      7.0576906    4.648855    -1.3089896  -23.294737
      8.013747    13.891729    -9.926753     5.655307    -5.9422326
    -22.842539     0.6293588  -18.46266    -10.811862     9.8192625
      3.0070958    3.8072643   -2.3861165    3.0821571  -14.739942
      1.7594414   -0.6485091    4.485623     2.0207152    7.264915
      -6.40137     23.63524      2.9711294  -22.708025     9.93719
      20.354511   -10.324688    -0.700492    -8.783211    -5.27593
      15.999649     3.3004563   12.747926    15.429879     4.7849145
      5.6699696   -2.3826702   10.605882     3.9112158    3.1500628
      15.859915    -2.1832209  -23.908653    -6.4799504   -4.5365124
      -9.224193    14.568347   -10.568833     4.982321    -4.342062
      0.0914714   12.645902    -5.74285     -3.2141201   -2.7173362
      -6.680575     0.4757669   -5.035051    -6.7964664   16.865469
    -11.54324      7.681869     0.44475392   9.708182    -8.932846
      0.4123232   -4.361452     1.3948607    9.511665     0.11667654
      2.9079323    6.049952     9.275183   -18.078873     6.2983274
      -0.7500531   -2.725033    -7.6027865    3.3404543    2.990815
      4.010979    11.000591    -2.8873312    7.1352735  -16.79663
      18.495346   -14.293832     7.89578      2.2714825   22.976387
      -4.875734    -3.0836344   -2.9999814   13.751918     6.448228
    -11.924197     2.171869     2.0423572   -6.173772    10.778437
      25.77281     -4.9495463   14.57806      0.3044315    2.6132357
      -7.591999    -2.076944     9.025118     1.7834753   -3.1799617
      -4.9401326   23.465864     5.1685796   -9.018578     9.037825
      -4.4150195    6.859591   -12.274467    -0.88911164   5.186309
      -3.9988663  -13.638606    -9.925445    -0.06329413  -3.6709652
    -12.397416   -12.719869    -1.395601     2.1150916    5.7381287
      -4.4691963   -3.82819     -0.84233856  -1.1604277  -13.490127
      8.731719   -20.778936   -11.495662     5.8033476   -4.752041
      10.833007    -6.717991     4.504732    13.4244375    1.1306485
      7.3435574    1.400918    14.704036    -9.501399     7.2315617
      -6.417456     1.3333273   11.872697    -0.30664724   8.8845
      6.5569253    4.7948146    0.03662816  -8.704245     6.224871
      -3.2701402  -11.508579  ]
  ```
 ### 4.预训练模型
 以下是 PaddleSpeech 提供的可以被命令行和 python API 使用的预训练模型列表：
 | 模型 | 采样率
 | :--- | :---: |
 | ecapatdnn_voxceleb12 | 16k
--- a/demos/speaker_verification/run.sh
+++ b/demos/speaker_verification/run.sh
@ -0,0 +1,6 @@
 #!/bin/bash
 wget -c https://paddlespeech.bj.bcebos.com/vector/audio/85236145389.wav
 # asr
 paddlespeech vector --task spk --input ./85236145389.wav
--- a/demos/speech_server/README.md
+++ b/demos/speech_server/README.md
@ -15,8 +15,8 @@ You can choose one way from meduim and hard to install paddlespeech.
 ### 2. Prepare config File
 The configuration file can be found in `conf/application.yaml` .
-Among them, `engine_list` indicates the speech engine that will be included in the service to be started, in the format of <speech task>_<engine type>.
+Among them, `engine_list` indicates the speech engine that will be included in the service to be started, in the format of `<speech task>_<engine type>`.
-At present, the speech tasks integrated by the service include: asr (speech recognition) and tts (speech synthesis).
+At present, the speech tasks integrated by the service include: asr (speech recognition), tts (text to sppech) and cls (audio classification).
 Currently the engine type supports two forms: python and inference (Paddle Inference)
--- a/demos/speech_server/README_cn.md
+++ b/demos/speech_server/README_cn.md
@ -17,7 +17,7 @@
 ### 2. 准备配置文件
 配置文件可参见 `conf/application.yaml` 。
 其中，`engine_list`表示即将启动的服务将会包含的语音引擎，格式为 <语音任务>_<引擎类型>。
-目前服务集成的语音任务有： asr(语音识别)、tts(语音合成)。
+目前服务集成的语音任务有： asr(语音识别)、tts(语音合成)以及cls(音频分类)。
 目前引擎类型支持两种形式：python 及 inference (Paddle Inference)
--- a/docs/source/released_model.md
+++ b/docs/source/released_model.md
@ -8,7 +8,8 @@ Acoustic Model | Training Data | Token-based | Size | Descriptions | CER | WER |
 :-------------:| :------------:| :-----: | -----: | :-----: |:-----:| :-----:  | :-----:  | :-----: 
 [Ds2 Online Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_ckpt_0.1.1.model.tar.gz) | Aishell Dataset | Char-based | 345 MB  | 2 Conv + 5 LSTM layers with only forward direction | 0.080 |-| 151 h | [D2 Online Aishell ASR0](../../examples/aishell/asr0) 
 [Ds2 Offline Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_aishell_ckpt_0.1.1.model.tar.gz)| Aishell Dataset | Char-based | 306 MB | 2 Conv + 3 bidirectional GRU layers| 0.064 |-| 151 h | [Ds2 Offline Aishell ASR0](../../examples/aishell/asr0) 
-[Conformer Offline Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_conformer_aishell_ckpt_0.1.1.model.tar.gz) | Aishell Dataset | Char-based | 284 MB  | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.056 |-| 151 h | [Conformer Offline Aishell ASR1](../../examples/aishell/asr1) 
+[Conformer Online Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_chunk_conformer_aishell_ckpt_0.1.2.model.tar.gz) | Aishell Dataset | Char-based | 189 MB  | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0565 |-| 151 h | [Conformer Online Aishell ASR1](../../examples/aishell/asr1) 
 [Conformer Offline Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_conformer_aishell_ckpt_0.1.2.model.tar.gz) | Aishell Dataset | Char-based | 189 MB  | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0483 |-| 151 h | [Conformer Offline Aishell ASR1](../../examples/aishell/asr1) 
 [Transformer Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_transformer_aishell_ckpt_0.1.1.model.tar.gz) | Aishell Dataset | Char-based | 128 MB | Encoder:Transformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0523 || 151 h | [Transformer  Aishell ASR1](../../examples/aishell/asr1) 
 [Ds2 Offline Librispeech ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr0/asr0_deepspeech2_librispeech_ckpt_0.1.1.model.tar.gz)| Librispeech Dataset | Char-based | 518 MB | 2 Conv + 3 bidirectional LSTM layers| - |0.0725| 960 h | [Ds2 Offline Librispeech ASR0](../../examples/librispeech/asr0) 
 [Conformer Librispeech ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr1/asr1_conformer_librispeech_ckpt_0.1.1.model.tar.gz) | Librispeech Dataset | subword-based | 191 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring |-| 0.0337 | 960 h | [Conformer Librispeech ASR1](../../examples/librispeech/asr1) 
@ -54,8 +55,9 @@ Parallel WaveGAN| VCTK |[PWGAN-vctk](https://github.com/PaddlePaddle/PaddleSpeec
 |Multi Band MelGAN | CSMSC |[MB MelGAN-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/voc3) | [mb_melgan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_ckpt_0.1.1.zip) <br>[mb_melgan_baker_finetune_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_baker_finetune_ckpt_0.5.zip)|[mb_melgan_csmsc_static_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_static_0.1.1.zip) |8.2MB|
 Style MelGAN | CSMSC |[Style MelGAN-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/voc4)|[style_melgan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/style_melgan/style_melgan_csmsc_ckpt_0.1.1.zip)| | |
 HiFiGAN | CSMSC |[HiFiGAN-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/voc5)|[hifigan_csmsc_ckpt_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_ckpt_0.1.1.zip)|[hifigan_csmsc_static_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_static_0.1.1.zip)|50MB|
 HiFiGAN | LJSpeech |[HiFiGAN-ljspeech](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/voc5)|[hifigan_ljspeech_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_ljspeech_ckpt_0.2.0.zip)|||
 HiFiGAN | AISHELL-3 |[HiFiGAN-aishell3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/voc5)|[hifigan_aishell3_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_ckpt_0.2.0.zip)|||
-HiFiGAN | VCTK |[HiFiGAN-vctk](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/vctk/voc5)|[hifigan_aishell3_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_ckpt_0.2.0.zip)|||
+HiFiGAN | VCTK |[HiFiGAN-vctk](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/vctk/voc5)|[hifigan_vctk_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_vctk_ckpt_0.2.0.zip)|||
 WaveRNN | CSMSC |[WaveRNN-csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/voc6)|[wavernn_csmsc_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/wavernn/wavernn_csmsc_ckpt_0.2.0.zip)|[wavernn_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/wavernn/wavernn_csmsc_static_0.2.0.zip)|18MB|
@ -74,6 +76,12 @@ Model Type | Dataset| Example Link | Pretrained Models | Static Models
 PANN | Audioset| [audioset_tagging_cnn](https://github.com/qiuqiangkong/audioset_tagging_cnn) | [panns_cnn6.pdparams](https://bj.bcebos.com/paddleaudio/models/panns_cnn6.pdparams), [panns_cnn10.pdparams](https://bj.bcebos.com/paddleaudio/models/panns_cnn10.pdparams), [panns_cnn14.pdparams](https://bj.bcebos.com/paddleaudio/models/panns_cnn14.pdparams) | [panns_cnn6_static.tar.gz](https://paddlespeech.bj.bcebos.com/cls/inference_model/panns_cnn6_static.tar.gz)(18M), [panns_cnn10_static.tar.gz](https://paddlespeech.bj.bcebos.com/cls/inference_model/panns_cnn10_static.tar.gz)(19M), [panns_cnn14_static.tar.gz](https://paddlespeech.bj.bcebos.com/cls/inference_model/panns_cnn14_static.tar.gz)(289M) 
 PANN | ESC-50 |[pann-esc50](../../examples/esc50/cls0)|[esc50_cnn6.tar.gz](https://paddlespeech.bj.bcebos.com/cls/esc50/esc50_cnn6.tar.gz), [esc50_cnn10.tar.gz](https://paddlespeech.bj.bcebos.com/cls/esc50/esc50_cnn10.tar.gz), [esc50_cnn14.tar.gz](https://paddlespeech.bj.bcebos.com/cls/esc50/esc50_cnn14.tar.gz)
 ## Speaker Verification Models
 Model Type | Dataset| Example Link | Pretrained Models | Static Models 
 :-------------:| :------------:| :-----: | :-----: | :-----:
 PANN | VoxCeleb| [voxceleb_ecapatdnn](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/voxceleb/sv0) | [ecapatdnn.tar.gz](https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_1_1.tar.gz) | -
 ## Punctuation Restoration Models
 Model Type | Dataset| Example Link | Pretrained Models
 :-------------:| :------------:| :-----: | :-----:
--- a/examples/aishell/asr1/README.md
+++ b/examples/aishell/asr1/README.md
@ -168,30 +168,7 @@ bash local/data.sh --stage -1 --stop_stage -1
 bash local/data.sh --stage 2 --stop_stage 2
 CUDA_VISIBLE_DEVICES= ./local/test.sh conf/transformer.yaml exp/transformer/checkpoints/avg_20
 ```
-The performance of the released models are shown below:
+[The performance of the released models](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/examples/aishell/asr1/RESULTS.md)
 ### Conformer
 | Model     | Params | Config              | Augmentation     | Test set | Decode method          | Loss | CER      |
 | --------- | ------ | ------------------- | ---------------- | -------- | ---------------------- | ---- | -------- |
 | conformer | 47.07M | conf/conformer.yaml | spec_aug + shift | test     | attention              | -    | 0.059858 |
 | conformer | 47.07M | conf/conformer.yaml | spec_aug + shift | test     | ctc_greedy_search      | -    | 0.062311 |
 | conformer | 47.07M | conf/conformer.yaml | spec_aug + shift | test     | ctc_prefix_beam_search | -    | 0.062196 |
 | conformer | 47.07M | conf/conformer.yaml | spec_aug + shift | test     | attention_rescoring    | -    | 0.054694 |
 ### Chunk Conformer
 Need set `decoding.decoding_chunk_size=16` when decoding.
 | Model     | Params | Config                    | Augmentation     | Test set | Decode method          | Chunk Size & Left Chunks | Loss | CER      |
 | --------- | ------ | ------------------------- | ---------------- | -------- | ---------------------- | ------------------------ | ---- | -------- |
 | conformer | 47.06M | conf/chunk_conformer.yaml | spec_aug + shift | test     | attention              | 16, -1                   | -    | 0.061939 |
 | conformer | 47.06M | conf/chunk_conformer.yaml | spec_aug + shift | test     | ctc_greedy_search      | 16, -1                   | -    | 0.070806 |
 | conformer | 47.06M | conf/chunk_conformer.yaml | spec_aug + shift | test     | ctc_prefix_beam_search | 16, -1                   | -    | 0.070739 |
 | conformer | 47.06M | conf/chunk_conformer.yaml | spec_aug + shift | test     | attention_rescoring    | 16, -1                   | -    | 0.059400 |
 ### Transformer 
 | Model       | Params | Config                | Augmentation | Test set | Decode method          | Loss              | CER      |
 | ----------- | ------ | --------------------- | ------------ | -------- | ---------------------- | ----------------- | -------- |
 | transformer | 31.95M | conf/transformer.yaml | spec_aug     | test     | attention              | 3.858648955821991 | 0.057293 |
 | transformer | 31.95M | conf/transformer.yaml | spec_aug     | test     | ctc_greedy_search      | 3.858648955821991 | 0.061837 |
 | transformer | 31.95M | conf/transformer.yaml | spec_aug     | test     | ctc_prefix_beam_search | 3.858648955821991 | 0.061685 |
 | transformer | 31.95M | conf/transformer.yaml | spec_aug     | test     | attention_rescoring    | 3.858648955821991 | 0.053844 |
 ## Stage 4: CTC Alignment 
 If you want to get the alignment between the audio and the text, you can use the ctc alignment. The code of this stage is shown below:
 ```bash
--- a/examples/aishell/asr1/RESULTS.md
+++ b/examples/aishell/asr1/RESULTS.md
@ -1,24 +1,27 @@
 # Aishell
 ## Conformer
-
+paddle version: 2.2.2  
-| Model | Params | Config | Augmentation| Test set | Decode method | Loss | CER |  
+paddlespeech version: 0.1.2
-| --- | --- | --- | --- | --- | --- | --- | --- |  
+| Model | Params | Config | Augmentation| Test set | Decode method | Loss | CER |
-| conformer | 47.07M  | conf/conformer.yaml | spec_aug + shift | test | attention | - | 0.059858 |  
+| --- | --- | --- | --- | --- | --- | --- | --- | 
-| conformer | 47.07M  | conf/conformer.yaml | spec_aug + shift | test | ctc_greedy_search | - | 0.062311 |  
+| conformer | 47.07M  | conf/conformer.yaml | spec_aug | test | attention | - | 0.0548 |
-| conformer | 47.07M  | conf/conformer.yaml | spec_aug + shift | test | ctc_prefix_beam_search | - | 0.062196 |  
+| conformer | 47.07M  | conf/conformer.yaml | spec_aug | test | ctc_greedy_search | - | 0.05127 |
-| conformer | 47.07M  | conf/conformer.yaml | spec_aug + shift | test | attention_rescoring | - | 0.054694 |  
+| conformer | 47.07M  | conf/conformer.yaml | spec_aug| test | ctc_prefix_beam_search | - | 0.05131 | 
 | conformer | 47.07M  | conf/conformer.yaml | spec_aug | test | attention_rescoring | - | 0.04829 | 
 ## Chunk Conformer
 paddle version: 2.2.2  
 paddlespeech version: 0.1.2  
 Need set `decoding.decoding_chunk_size=16` when decoding.
 | Model | Params | Config | Augmentation| Test set | Decode method | Chunk Size & Left Chunks | Loss | CER |  
 | --- | --- | --- | --- | --- | --- | --- | --- | --- |  
-| conformer | 47.06M | conf/chunk_conformer.yaml | spec_aug + shift | test | attention | 16, -1 | - | 0.061939 |  
+| conformer | 47.06M | conf/chunk_conformer.yaml | spec_aug | test | attention | 16, -1 | - | 0.0573884 |  
-| conformer | 47.06M | conf/chunk_conformer.yaml | spec_aug + shift | test | ctc_greedy_search | 16, -1 | - | 0.070806 |  
+| conformer | 47.06M | conf/chunk_conformer.yaml | spec_aug | test | ctc_greedy_search | 16, -1 | - | 0.06599091 |  
-| conformer | 47.06M | conf/chunk_conformer.yaml | spec_aug + shift | test | ctc_prefix_beam_search | 16, -1 | - | 0.070739 |  
+| conformer | 47.06M | conf/chunk_conformer.yaml | spec_aug | test | ctc_prefix_beam_search | 16, -1 | - | 0.065991 |  
-| conformer | 47.06M | conf/chunk_conformer.yaml | spec_aug + shift | test | attention_rescoring | 16, -1 |  - | 0.059400 |  
+| conformer | 47.06M | conf/chunk_conformer.yaml | spec_aug | test | attention_rescoring | 16, -1 |  - | 0.056502 |  
 ## Transformer 
--- a/examples/aishell/asr1/conf/chunk_conformer.yaml
+++ b/examples/aishell/asr1/conf/chunk_conformer.yaml
@ -39,6 +39,7 @@ model_conf:
    ctc_weight: 0.3
    lsm_weight: 0.1     # label smoothing option
    length_normalized_loss: false
    init_type: 'kaiming_uniform' 
 ###########################################
 #                   Data                  #
@ -61,28 +62,29 @@ feat_dim: 80
 stride_ms: 10.0
 window_ms: 25.0
 sortagrad: 0 # Feed samples from shortest to longest ; -1: enabled for all epochs, 0: disabled, other: enabled for 'other' epochs 
-batch_size: 64
+batch_size: 32
 maxlen_in: 512  # if input length  > maxlen-in, batchsize is automatically reduced
 maxlen_out: 150  # if output length > maxlen-out, batchsize is automatically reduced
 minibatches: 0 # for debug
 batch_count: auto
-batch_bins: 0 
+batch_bins: 0
 batch_frames_in: 0
 batch_frames_out: 0
 batch_frames_inout: 0
-num_workers: 0
+num_workers: 2
 subsampling_factor: 1
 num_encs: 1
 ###########################################
 #                 Training                #
 ###########################################
-n_epoch: 240 
+n_epoch: 180 
-accum_grad: 2
+accum_grad: 1
 global_grad_clip: 5.0
 dist_sampler: True
 optim: adam
 optim_conf:
-  lr: 0.002
+  lr: 0.001
  weight_decay: 1.0e-6
 scheduler: warmuplr
 scheduler_conf:
@ -92,4 +94,3 @@ log_interval: 100
 checkpoint:
  kbest_n: 50
  latest_n: 5
--- a/examples/aishell/asr1/conf/conformer.yaml
+++ b/examples/aishell/asr1/conf/conformer.yaml
@ -37,6 +37,7 @@ model_conf:
    ctc_weight: 0.3
    lsm_weight: 0.1     # label smoothing option
    length_normalized_loss: false
    init_type: 'kaiming_uniform' 
 ###########################################
 #                   Data                  #
@ -75,6 +76,7 @@ num_encs: 1
 n_epoch: 240 
 accum_grad: 2
 global_grad_clip: 5.0
 dist_sampler: True
 optim: adam
 optim_conf:
  lr: 0.002
--- a/examples/aishell/asr1/conf/preprocess.yaml
+++ b/examples/aishell/asr1/conf/preprocess.yaml
@ -23,7 +23,3 @@ process:
    n_mask: 2
    inplace: true
    replace_with_zero: false
--- a/examples/aishell/asr1/conf/transformer.yaml
+++ b/examples/aishell/asr1/conf/transformer.yaml
@ -61,16 +61,17 @@ batch_frames_in: 0
 batch_frames_out: 0
 batch_frames_inout: 0
 preprocess_config: conf/preprocess.yaml 
-num_workers: 0
+num_workers: 2
 subsampling_factor: 1
 num_encs: 1
 ###########################################
 #                 Training                #
 ###########################################
-n_epoch: 240 
+n_epoch: 30
 accum_grad: 2
 global_grad_clip: 5.0
 dist_sampler: False
 optim: adam
 optim_conf:
  lr: 0.002
--- a/examples/ami/sd0/local/ami_prepare.py
+++ b/examples/ami/sd0/local/ami_prepare.py
@ -18,18 +18,17 @@ Download: http://groups.inf.ed.ac.uk/ami/download/
 Prepares metadata files (JSON) from manual annotations "segments/" using RTTM format (Oracle VAD).
 """
 import argparse
 import glob
 import json
 import logging
 import os
 import xml.etree.ElementTree as et
 from distutils.util import strtobool
 from ami_splits import get_AMI_split
 from dataio import load_pkl
 from dataio import save_pkl
 from distutils.util import strtobool
 logger = logging.getLogger(__name__)
 SAMPLERATE = 16000
--- a/examples/csmsc/tts3/README.md
+++ b/examples/csmsc/tts3/README.md
@ -226,8 +226,11 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/inference.sh ${train_output_path}
 Pretrained FastSpeech2 model with no silence in the edge of audios:
 - [fastspeech2_nosil_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_ckpt_0.4.zip)
 - [fastspeech2_conformer_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_conformer_baker_ckpt_0.5.zip)
 - [fastspeech2_cnndecoder_csmsc_ckpt_1.0.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_cnndecoder_csmsc_ckpt_1.0.0.zip)
-The static model can be downloaded here [fastspeech2_nosil_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_static_0.4.zip).
+The static model can be downloaded here:
 - [fastspeech2_nosil_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_static_0.4.zip)
 - [fastspeech2_csmsc_static_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_csmsc_static_0.2.0.zip)
 Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/pitch_loss| eval/energy_loss 
 :-------------:| :------------:| :-----: | :-----: | :--------: |:--------:|:---------:
--- a/examples/csmsc/tts3/conf/cnndecoder.yaml
+++ b/examples/csmsc/tts3/conf/cnndecoder.yaml
@ -0,0 +1,107 @@
 # use CNND
 ###########################################################
 #                FEATURE EXTRACTION SETTING               #
 ###########################################################
 fs: 24000          # sr
 n_fft: 2048        # FFT size (samples).
 n_shift: 300       # Hop size (samples). 12.5ms
 win_length: 1200   # Window length (samples). 50ms
                   # If set to null, it will be the same as fft_size.
 window: "hann"     # Window function.
 # Only used for feats_type != raw
 fmin: 80           # Minimum frequency of Mel basis.
 fmax: 7600         # Maximum frequency of Mel basis.
 n_mels: 80         # The number of mel basis.
 # Only used for the model using pitch features (e.g. FastSpeech2)
 f0min: 80          # Minimum f0 for pitch extraction.
 f0max: 400         # Maximum f0 for pitch extraction.
 ###########################################################
 #                       DATA SETTING                      #
 ###########################################################
 batch_size: 64
 num_workers: 4
 ###########################################################
 #                       MODEL SETTING                     #
 ###########################################################
 model:
    adim: 384         # attention dimension
    aheads: 2         # number of attention heads
    elayers: 4        # number of encoder layers
    eunits: 1536      # number of encoder ff units
    dlayers: 4        # number of decoder layers
    dunits: 1536      # number of decoder ff units
    positionwise_layer_type: conv1d   # type of position-wise layer
    positionwise_conv_kernel_size: 3  # kernel size of position wise conv layer
    duration_predictor_layers: 2      # number of layers of duration predictor
    duration_predictor_chans: 256     # number of channels of duration predictor
    duration_predictor_kernel_size: 3 # filter size of duration predictor
    postnet_layers: 5                 # number of layers of postnset
    postnet_filts: 5                  # filter size of conv layers in postnet
    postnet_chans: 256                # number of channels of conv layers in postnet
    use_scaled_pos_enc: True          # whether to use scaled positional encoding
    encoder_normalize_before: True    # whether to perform layer normalization before the input
    decoder_normalize_before: True    # whether to perform layer normalization before the input
    reduction_factor: 1               # reduction factor
    encoder_type: transformer           # encoder type
    decoder_type: cnndecoder           # decoder type
    init_type: xavier_uniform         # initialization type
    init_enc_alpha: 1.0               # initial value of alpha of encoder scaled position encoding
    init_dec_alpha: 1.0               # initial value of alpha of decoder scaled position encoding
    transformer_enc_dropout_rate: 0.2            # dropout rate for transformer encoder layer
    transformer_enc_positional_dropout_rate: 0.2 # dropout rate for transformer encoder positional encoding
    transformer_enc_attn_dropout_rate: 0.2       # dropout rate for transformer encoder attention layer
    cnn_dec_dropout_rate: 0.2                    # dropout rate for cnn decoder layer
    cnn_postnet_dropout_rate: 0.2
    cnn_postnet_resblock_kernel_sizes: [256, 256] # kernel sizes for residual block of cnn_postnet
    cnn_postnet_kernel_size: 5                   # kernel size of cnn_postnet
    cnn_decoder_embedding_dim: 256
    pitch_predictor_layers: 5                  # number of conv layers in pitch predictor
    pitch_predictor_chans: 256                 # number of channels of conv layers in pitch predictor
    pitch_predictor_kernel_size: 5             # kernel size of conv leyers in pitch predictor
    pitch_predictor_dropout: 0.5               # dropout rate in pitch predictor
    pitch_embed_kernel_size: 1                 # kernel size of conv embedding layer for pitch
    pitch_embed_dropout: 0.0                   # dropout rate after conv embedding layer for pitch
    stop_gradient_from_pitch_predictor: True   # whether to stop the gradient from pitch predictor to encoder
    energy_predictor_layers: 2                 # number of conv layers in energy predictor
    energy_predictor_chans: 256                # number of channels of conv layers in energy predictor
    energy_predictor_kernel_size: 3            # kernel size of conv leyers in energy predictor
    energy_predictor_dropout: 0.5              # dropout rate in energy predictor
    energy_embed_kernel_size: 1                # kernel size of conv embedding layer for energy
    energy_embed_dropout: 0.0                  # dropout rate after conv embedding layer for energy
    stop_gradient_from_energy_predictor: False # whether to stop the gradient from energy predictor to encoder
 ###########################################################
 #                       UPDATER SETTING                   #
 ###########################################################
 updater:
    use_masking: True                 # whether to apply masking for padded part in loss calculation
 ###########################################################
 #                     OPTIMIZER SETTING                   #
 ###########################################################
 optimizer:
  optim: adam              # optimizer type
  learning_rate: 0.001     # learning rate
 ###########################################################
 #                     TRAINING SETTING                    #
 ###########################################################
 max_epoch: 1000
 num_snapshots: 5
 ###########################################################
 #                       OTHER SETTING                     #
 ###########################################################
 seed: 10086
--- a/examples/csmsc/tts3/local/synthesize_streaming.sh
+++ b/examples/csmsc/tts3/local/synthesize_streaming.sh
@ -0,0 +1,92 @@
 #!/bin/bash
 config_path=$1
 train_output_path=$2
 ckpt_name=$3
 stage=0
 stop_stage=0
 # pwgan
 if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
    FLAGS_allocator_strategy=naive_best_fit \
    FLAGS_fraction_of_gpu_memory_to_use=0.01 \
    python3 ${BIN_DIR}/../synthesize_streaming.py \
        --am=fastspeech2_csmsc \
        --am_config=${config_path} \
        --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
        --am_stat=dump/train/speech_stats.npy \
        --voc=pwgan_csmsc \
        --voc_config=pwg_baker_ckpt_0.4/pwg_default.yaml \
        --voc_ckpt=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \
        --voc_stat=pwg_baker_ckpt_0.4/pwg_stats.npy \
        --lang=zh \
        --text=${BIN_DIR}/../sentences.txt \
        --output_dir=${train_output_path}/test_e2e_streaming \
        --phones_dict=dump/phone_id_map.txt \
        --am_streaming=True
 fi
 # for more GAN Vocoders
 # multi band melgan
 if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
    FLAGS_allocator_strategy=naive_best_fit \
    FLAGS_fraction_of_gpu_memory_to_use=0.01 \
    python3 ${BIN_DIR}/../synthesize_streaming.py \
        --am=fastspeech2_csmsc \
        --am_config=${config_path} \
        --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
        --am_stat=dump/train/speech_stats.npy \
        --voc=mb_melgan_csmsc \
        --voc_config=mb_melgan_csmsc_ckpt_0.1.1/default.yaml \
        --voc_ckpt=mb_melgan_csmsc_ckpt_0.1.1/snapshot_iter_1000000.pdz\
        --voc_stat=mb_melgan_csmsc_ckpt_0.1.1/feats_stats.npy \
        --lang=zh \
        --text=${BIN_DIR}/../sentences.txt \
        --output_dir=${train_output_path}/test_e2e_streaming \
        --phones_dict=dump/phone_id_map.txt \
        --am_streaming=True
 fi
 # the pretrained models haven't release now
 # style melgan
 # style melgan's Dygraph to Static Graph is not ready now
 if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
    FLAGS_allocator_strategy=naive_best_fit \
    FLAGS_fraction_of_gpu_memory_to_use=0.01 \
    python3 ${BIN_DIR}/../synthesize_streaming.py \
        --am=fastspeech2_csmsc \
        --am_config=${config_path} \
        --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
        --am_stat=dump/train/speech_stats.npy \
        --voc=style_melgan_csmsc \
        --voc_config=style_melgan_csmsc_ckpt_0.1.1/default.yaml \
        --voc_ckpt=style_melgan_csmsc_ckpt_0.1.1/snapshot_iter_1500000.pdz \
        --voc_stat=style_melgan_csmsc_ckpt_0.1.1/feats_stats.npy \
        --lang=zh \
        --text=${BIN_DIR}/../sentences.txt \
        --output_dir=${train_output_path}/test_e2e_streaming \
        --phones_dict=dump/phone_id_map.txt \
        --am_streaming=True
 fi
 # hifigan
 if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
    echo "in hifigan syn_e2e"
    FLAGS_allocator_strategy=naive_best_fit \
    FLAGS_fraction_of_gpu_memory_to_use=0.01 \
    python3 ${BIN_DIR}/../synthesize_streaming.py \
        --am=fastspeech2_csmsc \
        --am_config=${config_path} \
        --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
        --am_stat=dump/train/speech_stats.npy \
        --voc=hifigan_csmsc \
        --voc_config=hifigan_csmsc_ckpt_0.1.1/default.yaml \
        --voc_ckpt=hifigan_csmsc_ckpt_0.1.1/snapshot_iter_2500000.pdz \
        --voc_stat=hifigan_csmsc_ckpt_0.1.1/feats_stats.npy \
        --lang=zh \
        --text=${BIN_DIR}/../sentences.txt \
        --output_dir=${train_output_path}/test_e2e_streaming \
        --phones_dict=dump/phone_id_map.txt \
        --am_streaming=True
 fi
--- a/examples/csmsc/tts3/run_cnndecoder.sh
+++ b/examples/csmsc/tts3/run_cnndecoder.sh
@ -0,0 +1,48 @@
 #!/bin/bash
 set -e
 source path.sh
 gpus=0,1
 stage=0
 stop_stage=100
 conf_path=conf/cnndecoder.yaml
 train_output_path=exp/cnndecoder
 ckpt_name=snapshot_iter_153.pdz
 # with the following command, you can choose the stage range you want to run
 # such as `./run.sh --stage 0 --stop-stage 0`
 # this can not be mixed use with `$1`, `$2` ...
 source ${MAIN_ROOT}/utils/parse_options.sh || exit 1
 if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
    # prepare data
    ./local/preprocess.sh ${conf_path} || exit -1
 fi
 if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
    # train model, all `ckpt` under `train_output_path/checkpoints/` dir
    CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} || exit -1
 fi
 if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
    # synthesize, vocoder is pwgan
    CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
 fi
 if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
    # synthesize_e2e, vocoder is pwgan
    CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
 fi
 if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
    # inference with static model
    CUDA_VISIBLE_DEVICES=${gpus} ./local/inference.sh ${train_output_path} || exit -1
 fi
 if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
    # synthesize_e2e, vocoder is pwgan
    CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_streaming.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
 fi
--- a/examples/ljspeech/tts3/local/synthesize.sh
+++ b/examples/ljspeech/tts3/local/synthesize.sh
@ -4,17 +4,42 @@ config_path=$1
 train_output_path=$2
 ckpt_name=$3
-FLAGS_allocator_strategy=naive_best_fit \
+stage=0
-FLAGS_fraction_of_gpu_memory_to_use=0.01 \
+stop_stage=0
-python3 ${BIN_DIR}/../synthesize.py \
+
-    --am=fastspeech2_ljspeech \
+# pwgan
-    --am_config=${config_path} \
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
-    --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
+    FLAGS_allocator_strategy=naive_best_fit \
-    --am_stat=dump/train/speech_stats.npy \
+    FLAGS_fraction_of_gpu_memory_to_use=0.01 \
-    --voc=pwgan_ljspeech \
+    python3 ${BIN_DIR}/../synthesize.py \
-    --voc_config=pwg_ljspeech_ckpt_0.5/pwg_default.yaml \
+        --am=fastspeech2_ljspeech \
-    --voc_ckpt=pwg_ljspeech_ckpt_0.5/pwg_snapshot_iter_400000.pdz  \
+        --am_config=${config_path} \
-    --voc_stat=pwg_ljspeech_ckpt_0.5/pwg_stats.npy \
+        --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
-    --test_metadata=dump/test/norm/metadata.jsonl \
+        --am_stat=dump/train/speech_stats.npy \
-    --output_dir=${train_output_path}/test \
+        --voc=pwgan_ljspeech \
-    --phones_dict=dump/phone_id_map.txt
+        --voc_config=pwg_ljspeech_ckpt_0.5/pwg_default.yaml \
        --voc_ckpt=pwg_ljspeech_ckpt_0.5/pwg_snapshot_iter_400000.pdz  \
        --voc_stat=pwg_ljspeech_ckpt_0.5/pwg_stats.npy \
        --test_metadata=dump/test/norm/metadata.jsonl \
        --output_dir=${train_output_path}/test \
        --phones_dict=dump/phone_id_map.txt
 fi
 # hifigan
 if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
    FLAGS_allocator_strategy=naive_best_fit \
    FLAGS_fraction_of_gpu_memory_to_use=0.01 \
    python3 ${BIN_DIR}/../synthesize.py \
        --am=fastspeech2_ljspeech \
        --am_config=${config_path} \
        --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
        --am_stat=dump/train/speech_stats.npy \
        --voc=hifigan_ljspeech \
        --voc_config=hifigan_ljspeech_ckpt_0.2.0/default.yaml \
        --voc_ckpt=hifigan_ljspeech_ckpt_0.2.0/snapshot_iter_2500000.pdz \
        --voc_stat=hifigan_ljspeech_ckpt_0.2.0/feats_stats.npy \
        --test_metadata=dump/test/norm/metadata.jsonl \
        --output_dir=${train_output_path}/test \
        --phones_dict=dump/phone_id_map.txt
 fi
--- a/examples/ljspeech/tts3/local/synthesize_e2e.sh
+++ b/examples/ljspeech/tts3/local/synthesize_e2e.sh
@ -4,19 +4,45 @@ config_path=$1
 train_output_path=$2
 ckpt_name=$3
-FLAGS_allocator_strategy=naive_best_fit \
+stage=0
-FLAGS_fraction_of_gpu_memory_to_use=0.01 \
+stop_stage=0
-python3 ${BIN_DIR}/../synthesize_e2e.py \
+
-    --am=fastspeech2_ljspeech \
+# pwgan
-    --am_config=${config_path} \
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
-    --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
+    FLAGS_allocator_strategy=naive_best_fit \
-    --am_stat=dump/train/speech_stats.npy \
+    FLAGS_fraction_of_gpu_memory_to_use=0.01 \
-    --voc=pwgan_ljspeech \
+    python3 ${BIN_DIR}/../synthesize_e2e.py \
-    --voc_config=pwg_ljspeech_ckpt_0.5/pwg_default.yaml \
+        --am=fastspeech2_ljspeech \
-    --voc_ckpt=pwg_ljspeech_ckpt_0.5/pwg_snapshot_iter_400000.pdz  \
+        --am_config=${config_path} \
-    --voc_stat=pwg_ljspeech_ckpt_0.5/pwg_stats.npy \
+        --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
-    --lang=en \
+        --am_stat=dump/train/speech_stats.npy \
-    --text=${BIN_DIR}/../sentences_en.txt \
+        --voc=pwgan_ljspeech \
-    --output_dir=${train_output_path}/test_e2e \
+        --voc_config=pwg_ljspeech_ckpt_0.5/pwg_default.yaml \
-    --inference_dir=${train_output_path}/inference \
+        --voc_ckpt=pwg_ljspeech_ckpt_0.5/pwg_snapshot_iter_400000.pdz  \
-    --phones_dict=dump/phone_id_map.txt
+        --voc_stat=pwg_ljspeech_ckpt_0.5/pwg_stats.npy \
        --lang=en \
        --text=${BIN_DIR}/../sentences_en.txt \
        --output_dir=${train_output_path}/test_e2e \
        --inference_dir=${train_output_path}/inference \
        --phones_dict=dump/phone_id_map.txt
 fi
 # hifigan
 if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
    FLAGS_allocator_strategy=naive_best_fit \
    FLAGS_fraction_of_gpu_memory_to_use=0.01 \
    python3 ${BIN_DIR}/../synthesize_e2e.py \
        --am=fastspeech2_ljspeech \
        --am_config=${config_path} \
        --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
        --am_stat=dump/train/speech_stats.npy \
        --voc=hifigan_ljspeech \
        --voc_config=hifigan_ljspeech_ckpt_0.2.0/default.yaml \
        --voc_ckpt=hifigan_ljspeech_ckpt_0.2.0/snapshot_iter_2500000.pdz \
        --voc_stat=hifigan_ljspeech_ckpt_0.2.0/feats_stats.npy \
        --lang=en \
        --text=${BIN_DIR}/../sentences_en.txt \
        --output_dir=${train_output_path}/test_e2e \
        --inference_dir=${train_output_path}/inference \
        --phones_dict=dump/phone_id_map.txt
 fi
--- a/examples/ljspeech/voc5/README.md
+++ b/examples/ljspeech/voc5/README.md
@ -127,6 +127,21 @@ optional arguments:
 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
 ## Pretrained Model
 The pretrained model can be downloaded here [hifigan_ljspeech_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_ljspeech_ckpt_0.2.0.zip).
 Model | Step | eval/generator_loss | eval/mel_loss| eval/feature_matching_loss
 :-------------:| :------------:| :-----: | :-----: | :--------:
 default| 1(gpu) x 2500000|24.492|0.115|7.227
 HiFiGAN checkpoint contains files listed below.
 ```text
 hifigan_ljspeech_ckpt_0.2.0
 ├── default.yaml                  # default config used to train hifigan
 ├── feats_stats.npy               # statistics used to normalize spectrogram when training hifigan
 └── snapshot_iter_2500000.pdz     # generator parameters of hifigan
 ```
 ## Acknowledgement
--- a/examples/voxceleb/README.md
+++ b/examples/voxceleb/README.md
@ -6,3 +6,45 @@ sv0 - speaker verfication with softmax backend etc, all python code
 sv1 - dependence on kaldi, speaker verfication with plda/sc backend, 
      more info refer to the sv1/readme.txt
 ## VoxCeleb2 preparation
 VoxCeleb2 audio files are released in m4a format. All the VoxCeleb2 m4a audio files must be converted in wav files before feeding them in PaddleSpeech. 
 Please, follow these steps to prepare the dataset correctly:
 1. Download Voxceleb2.
 You can find download instructions here: http://www.robots.ox.ac.uk/~vgg/data/voxceleb/
 2. Convert .m4a to wav
 VoxCeleb2 stores files with the m4a audio format. To use them in PaddleSpeech,  you have to convert all the m4a audio files into wav files.
 ``` shell
 ffmpeg -y -i %s -ac 1 -vn -acodec pcm_s16le -ar 16000 %s
 ```
 You can do the conversion using ffmpeg  https://gist.github.com/seungwonpark/4f273739beef2691cd53b5c39629d830). This operation might take several hours and should be only once.
 3. Put all the wav files in a folder called `wav`. You should have something like `voxceleb2/wav/id*/*.wav` (e.g, `voxceleb2/wav/id00012/21Uxsk56VDQ/00001.wav`)
 ## voxceleb dataset summary
 |dataset | vox1 - dev | vox1 - test |vox2 - dev| vox2 - test|
 |---------|-----------|------------|-----------|----------|
 |spks    |  1211       |40     |      5994        | 118|
 |utts     | 148642    | 4874   | 1092009     |36273|
 | time(h) | 340.4 | 11.2  | 2360.2  |79.9 |
 ## trial summary
 | trial     | filename |  nums | positive | negative |
 |--------|-----------|--------|-------|------|
 | VoxCeleb1 | veri_test.txt | 37720 | 18860 | 18860 | 
 | VoxCeleb1(cleaned) | veri_test2.txt | 37611 | 18802 | 18809 |
 | VoxCeleb1-H | list_test_hard.txt | 552536 | 276270 | 276266 |
 |VoxCeleb1-H(cleaned) |list_test_hard2.txt | 550894 | 275488 | 275406 |
 |VoxCeleb1-E | list_test_all.txt | 581480 | 290743 | 290737 | 
 |VoxCeleb1-E(cleaned) | list_test_all2.txt |579818 |289921 |289897 |
--- a/examples/voxceleb/sv0/RESULT.md
+++ b/examples/voxceleb/sv0/RESULT.md
@ -0,0 +1,7 @@
 # VoxCeleb
 ## ECAPA-TDNN 
 | Model | Number of Params | Release | Config | dim | Test set |  Cosine | Cosine + S-Norm | 
 | --- | --- | --- | --- | --- | --- | --- | ---- |
 | ECAPA-TDNN | 85M | 0.1.1 | conf/ecapa_tdnn.yaml |192 | test | 1.15 |  1.06 | 
--- a/examples/voxceleb/sv0/conf/ecapa_tdnn.yaml
+++ b/examples/voxceleb/sv0/conf/ecapa_tdnn.yaml
@ -0,0 +1,52 @@
 ###########################################
 #                Data                 #
 ###########################################
 # we should explicitly specify the wav path of vox2 audio data converted from m4a
 vox2_base_path: 
 augment: True
 batch_size: 16
 num_workers: 2
 num_speakers: 7205 # 1211 vox1, 5994 vox2, 7205 vox1+2, test speakers: 41
 shuffle: True
 random_chunk: True
 ###########################################################
 #                FEATURE EXTRACTION SETTING               #
 ###########################################################
 # currently, we only support fbank
 sr: 16000           # sample rate
 n_mels: 80
 window_size: 400     #25ms, sample rate 16000, 25 * 16000 / 1000 = 400 
 hop_size: 160        #10ms, sample rate 16000, 10 * 16000 / 1000 = 160
 ###########################################################
 #                       MODEL SETTING                     #
 ###########################################################
 # currently, we only support ecapa-tdnn in the ecapa_tdnn.yaml
 # if we want use another model, please choose another configuration yaml file
 model:
  input_size: 80
  # "channels": [512, 512, 512, 512, 1536],
  channels: [1024, 1024, 1024, 1024, 3072]
  kernel_sizes: [5, 3, 3, 3, 1]
  dilations: [1, 2, 3, 4, 1]
  attention_channels: 128
  lin_neurons: 192
 ###########################################
 #                Training                 #
 ###########################################
 seed: 1986 # according from speechbrain configuration
 epochs: 10
 save_interval: 1
 log_interval: 1
 learning_rate: 1e-8
 ###########################################
 #                Testing                  #
 ###########################################
 global_embedding_norm: True
 embedding_mean_norm: True
 embedding_std_norm: False
--- a/examples/voxceleb/sv0/local/data.sh
+++ b/examples/voxceleb/sv0/local/data.sh
@ -0,0 +1,58 @@
 #!/bin/bash
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 stage=1
 stop_stage=100
 . ${MAIN_ROOT}/utils/parse_options.sh || exit -1;
 if [ $# -ne 2 ] ; then
   echo "Usage: $0 [options] <data-dir> <conf-path>";
   echo "e.g.: $0 ./data/ conf/ecapa_tdnn.yaml"
   echo "Options: "
   echo "  --stage <stage|-1>               # Used to run a partially-completed data process from somewhere in the middle."
   echo "  --stop-stage <stop-stage|100>    # Used to run a partially-completed data process stop stage in the middle"
   exit 1;
 fi
 dir=$1
 conf_path=$2
 mkdir -p ${dir}
 if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
    # data prepare for vox1 and vox2, vox2 must be converted from m4a to wav
    # we should use the local/convert.sh convert m4a to wav
    python3 local/data_prepare.py \
                        --data-dir ${dir} \
                        --config ${conf_path}
 fi 
 TARGET_DIR=${MAIN_ROOT}/dataset
 mkdir -p ${TARGET_DIR}
 if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
    # download data, generate manifests
    python3 ${TARGET_DIR}/voxceleb/voxceleb1.py \
      --manifest_prefix="data/vox1/manifest" \
      --target_dir="${TARGET_DIR}/voxceleb/vox1/"
    if [ $? -ne 0 ]; then
        echo "Prepare voxceleb failed. Terminated."
        exit 1
    fi
   #  for dataset in train dev test; do
   #      mv data/manifest.${dataset} data/manifest.${dataset}.raw
   #  done
 fi
--- a/examples/voxceleb/sv0/local/data_prepare.py
+++ b/examples/voxceleb/sv0/local/data_prepare.py
@ -0,0 +1,70 @@
 # Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import argparse
 import paddle
 from yacs.config import CfgNode
 from paddleaudio.datasets.voxceleb import VoxCeleb
 from paddlespeech.s2t.utils.log import Log
 from paddlespeech.vector.io.augment import build_augment_pipeline
 from paddlespeech.vector.training.seeding import seed_everything
 logger = Log(__name__).getlog()
 def main(args, config):
    # stage0: set the cpu device, all data prepare process will be done in cpu mode
    paddle.set_device("cpu")
    # set the random seed, it is a must for multiprocess training
    seed_everything(config.seed)
    # stage 1: generate the voxceleb csv file
    # Note: this may occurs c++ execption, but the program will execute fine
    # so we ignore the execption 
    # we explicitly pass the vox2 base path to data prepare and generate the audio info
    logger.info("start to generate the voxceleb dataset info")
    train_dataset = VoxCeleb(
        'train', target_dir=args.data_dir, vox2_base_path=config.vox2_base_path)
    # stage 2: generate the augment noise csv file
    if config.augment:
        logger.info("start to generate the augment dataset info")
        augment_pipeline = build_augment_pipeline(target_dir=args.data_dir)
 if __name__ == "__main__":
    # yapf: disable
    parser = argparse.ArgumentParser(__doc__)
    parser.add_argument("--data-dir",
                        default="./data/",
                        type=str,
                        help="data directory")
    parser.add_argument("--config",
                        default=None,
                        type=str,
                        help="configuration file")
    args = parser.parse_args()
    # yapf: enable
    # https://yaml.org/type/float.html
    config = CfgNode(new_allowed=True)
    if args.config:
        config.merge_from_file(args.config)
    config.freeze()
    print(config)
    main(args, config)
--- a/examples/voxceleb/sv0/local/emb.sh
+++ b/examples/voxceleb/sv0/local/emb.sh
@ -0,0 +1,51 @@
 #!/bin/bash
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 . ./path.sh
 stage=0
 stop_stage=100
 exp_dir=exp/ecapa-tdnn-vox12-big/            # experiment directory
 conf_path=conf/ecapa_tdnn.yaml
 audio_path="demo/voxceleb/00001.wav"
 use_gpu=true
 . ${MAIN_ROOT}/utils/parse_options.sh || exit -1;
 if [ $# -ne 0 ] ; then
   echo "Usage: $0 [options]";
   echo "e.g.: $0 ./data/ exp/voxceleb12/ conf/ecapa_tdnn.yaml"
   echo "Options: "
   echo "  --use-gpu <true,false|true>      # specify is gpu is to be used for training"
   echo "  --stage <stage|-1>               # Used to run a partially-completed data process from somewhere in the middle."
   echo "  --stop-stage <stop-stage|100>    # Used to run a partially-completed data process stop stage in the middle"
   echo "  --exp-dir                        # experiment directorh, where is has the model.pdparams"
   echo "  --conf-path                      # configuration file for extracting the embedding"
   echo "  --audio-path                     # audio-path, which will be processed to extract the embedding"
   exit 1;
 fi
 # set the test device
 device="cpu"
 if ${use_gpu}; then
    device="gpu"
 fi
 if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
    # extract the audio embedding
    python3 ${BIN_DIR}/extract_emb.py --device ${device} \
            --config ${conf_path} \
            --audio-path ${audio_path} --load-checkpoint ${exp_dir}
 fi
--- a/examples/voxceleb/sv0/local/test.sh
+++ b/examples/voxceleb/sv0/local/test.sh
@ -0,0 +1,42 @@
 #!/bin/bash
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 stage=1
 stop_stage=100
 use_gpu=true    # if true, we run on GPU.
 . ${MAIN_ROOT}/utils/parse_options.sh || exit -1;
 if [ $# -ne 3 ] ; then
   echo "Usage: $0 [options] <data-dir> <exp-dir> <conf-path>";
   echo "e.g.: $0 ./data/ exp/voxceleb12/ conf/ecapa_tdnn.yaml"
   echo "Options: "
   echo "  --use-gpu <true,false|true>      # specify is gpu is to be used for training"
   echo "  --stage <stage|-1>               # Used to run a partially-completed data process from somewhere in the middle."
   echo "  --stop-stage <stop-stage|100>    # Used to run a partially-completed data process stop stage in the middle"
   exit 1;
 fi
 dir=$1
 exp_dir=$2
 conf_path=$3
 if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
   # test the model and compute the eer metrics
   python3 ${BIN_DIR}/test.py \
         --data-dir ${dir} \
         --load-checkpoint ${exp_dir} \
         --config ${conf_path}
 fi
--- a/examples/voxceleb/sv0/local/train.sh
+++ b/examples/voxceleb/sv0/local/train.sh
@ -0,0 +1,61 @@
 #!/bin/bash
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 stage=0
 stop_stage=100
 use_gpu=true    # if true, we run on GPU.
 . ${MAIN_ROOT}/utils/parse_options.sh || exit -1;
 if [ $# -ne 3 ] ; then
   echo "Usage: $0 [options] <data-dir> <exp-dir> <conf-path>";
   echo "e.g.: $0 ./data/ exp/voxceleb12/ conf/ecapa_tdnn.yaml"
   echo "Options: "
   echo "  --use-gpu <true,false|true>      # specify is gpu is to be used for training"
   echo "  --stage <stage|-1>               # Used to run a partially-completed data process from somewhere in the middle."
   echo "  --stop-stage <stop-stage|100>    # Used to run a partially-completed data process stop stage in the middle"
   exit 1;
 fi
 dir=$1
 exp_dir=$2
 conf_path=$3
 # get the gpu nums for training
 ngpu=$(echo $CUDA_VISIBLE_DEVICES | awk -F "," '{print NF}')
 echo "using $ngpu gpus..."
 # setting training device
 device="cpu"
 if ${use_gpu}; then
    device="gpu"
 fi
 if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
    # train the speaker identification task with voxceleb data
    # and we will create the trained model parameters in ${exp_dir}/model.pdparams as the soft link
    # Note: we will store the log file in exp/log directory
    python3 -m paddle.distributed.launch --gpus=$CUDA_VISIBLE_DEVICES \
        ${BIN_DIR}/train.py --device ${device} --checkpoint-dir ${exp_dir} \
        --data-dir ${dir} --config ${conf_path}
 fi 
 if [ $? -ne 0 ]; then
    echo "Failed in training!"
    exit 1
 fi
 exit 0
--- a/examples/voxceleb/sv0/path.sh
+++ b/examples/voxceleb/sv0/path.sh
@ -0,0 +1,28 @@
 #!/bin/bash
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 export MAIN_ROOT=`realpath ${PWD}/../../../`
 export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH}
 export LC_ALL=C
 export PYTHONDONTWRITEBYTECODE=1
 # Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
 export PYTHONIOENCODING=UTF-8
 export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH}
 export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/lib/
 MODEL=ecapa_tdnn
 export BIN_DIR=${MAIN_ROOT}/paddlespeech/vector/exps/${MODEL}
--- a/examples/voxceleb/sv0/run.sh
+++ b/examples/voxceleb/sv0/run.sh
@ -0,0 +1,69 @@
 #!/bin/bash
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 . ./path.sh
 set -e
 #######################################################################
 # stage 0: data prepare, including voxceleb1 download and generate {train,dev,enroll,test}.csv
 #          voxceleb2 data is m4a format, so we need user to convert the m4a to wav yourselves as described in Readme.md with the script local/convert.sh
 # stage 1: train the speaker identification model
 # stage 2: test speaker identification 
 # stage 3: extract the training embeding to train the LDA and PLDA
 ######################################################################
 # we can set the variable PPAUDIO_HOME to specifiy the root directory of the downloaded vox1 and vox2 dataset 
 # default the dataset will be stored in the ~/.paddleaudio/
 # the vox2 dataset is stored in m4a format, we need to convert the audio from m4a to wav yourself
 # and put all of them to ${PPAUDIO_HOME}/datasets/vox2
 # we will find the wav from ${PPAUDIO_HOME}/datasets/vox1/wav and ${PPAUDIO_HOME}/datasets/vox2/wav
 # export PPAUDIO_HOME=
 stage=0
 stop_stage=50
 # data directory
 # if we set the variable ${dir}, we will store the wav info to this directory
 # otherwise, we will store the wav info to vox1 and vox2 directory respectively
 # vox2 wav path, we must convert the m4a format to wav format    
 dir=data/                                 # data info directory   
 exp_dir=exp/ecapa-tdnn-vox12-big/            # experiment directory
 conf_path=conf/ecapa_tdnn.yaml          
 gpus=0,1,2,3
 source ${MAIN_ROOT}/utils/parse_options.sh || exit 1;
 mkdir -p ${exp_dir}
 if [ $stage -le 0 ] && [ ${stop_stage} -ge 0 ]; then 
     # stage 0: data prepare for vox1 and vox2, vox2 must be converted from m4a to wav
     bash ./local/data.sh ${dir} ${conf_path}|| exit -1;
 fi
 if [ $stage -le 1 ] && [ ${stop_stage} -ge 1 ]; then
     # stage 1: train the speaker identification model
     CUDA_VISIBLE_DEVICES=${gpus} bash ./local/train.sh ${dir} ${exp_dir} ${conf_path} 
 fi
 if [ $stage -le 2 ] && [ ${stop_stage} -ge 2 ]; then
     # stage 2: get the speaker verification scores with cosine function
     #          now we only support use cosine to get the scores
     CUDA_VISIBLE_DEVICES=0 bash ./local/test.sh ${dir} ${exp_dir} ${conf_path}
 fi
 # if [ $stage -le 3 ]; then
 #      # stage 2: extract the training embeding to train the LDA and PLDA
 #      # todo: extract the training embedding
 # fi 
--- a/examples/voxceleb/sv0/utils
+++ b/examples/voxceleb/sv0/utils
@ -0,0 +1 @@
 ../../../utils/
--- a/paddleaudio/.gitignore
+++ b/paddleaudio/.gitignore
@ -0,0 +1,2 @@
 .eggs
 *.wav
--- a/paddleaudio/README.md
+++ b/paddleaudio/README.md
@ -0,0 +1,7 @@
 # PaddleAudio
 PaddleAudio is an audio library for PaddlePaddle.
 ## Install
 `pip install .`
--- a/paddleaudio/docs/Makefile
+++ b/paddleaudio/docs/Makefile
@ -0,0 +1,19 @@
 # Minimal makefile for Sphinx documentation
 #
 # You can set these variables from the command line.
 SPHINXOPTS    =
 SPHINXBUILD   = sphinx-build
 SOURCEDIR     = source
 BUILDDIR      = build
 # Put it first so that "make" without argument is like "make help".
 help:
 	@$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
 .PHONY: help Makefile
 # Catch-all target: route all unknown targets to Sphinx using the new
 # "make mode" option.  $(O) is meant as a shortcut for $(SPHINXOPTS).
 %: Makefile
 	@$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)
--- a/paddleaudio/docs/README.md
+++ b/paddleaudio/docs/README.md
@ -0,0 +1,24 @@
 # Build docs for PaddleAudio
 Execute the following steps in **current directory**.
 ## 1. Install
 `pip install Sphinx sphinx_rtd_theme`
 ## 2. Generate API docs
 Generate API docs from doc string.
 `sphinx-apidoc -fMeT -o source ../paddleaudio ../paddleaudio/utils --templatedir source/_templates`
 ## 3. Build
 `sphinx-build source _html`
 ## 4. Preview
 Open `_html/index.html` for page preview.
--- a/paddleaudio/docs/images/paddle.png
+++ b/paddleaudio/docs/images/paddle.png
--- a/paddleaudio/docs/make.bat
+++ b/paddleaudio/docs/make.bat
@ -0,0 +1,35 @@
@ECHO OFF
 pushd %~dp0
 REM Command file for Sphinx documentation
 if "%SPHINXBUILD%" == "" (
 	set SPHINXBUILD=sphinx-build
 )
 set SOURCEDIR=source
 set BUILDDIR=build
 if "%1" == "" goto help
 %SPHINXBUILD% >NUL 2>NUL
 if errorlevel 9009 (
 	echo.
 	echo.The 'sphinx-build' command was not found. Make sure you have Sphinx
 	echo.installed, then set the SPHINXBUILD environment variable to point
 	echo.to the full path of the 'sphinx-build' executable. Alternatively you
 	echo.may add the Sphinx directory to PATH.
 	echo.
 	echo.If you don't have Sphinx installed, grab it from
 	echo.http://sphinx-doc.org/
 	exit /b 1
 )
 %SPHINXBUILD% -M %1 %SOURCEDIR% %BUILDDIR% %SPHINXOPTS%
 goto end
 :help
 %SPHINXBUILD% -M help %SOURCEDIR% %BUILDDIR% %SPHINXOPTS%
 :end
 popd
--- a/paddleaudio/docs/source/_static/custom.css
+++ b/paddleaudio/docs/source/_static/custom.css
@ -0,0 +1,5 @@
 .wy-nav-content {
    max-width: 80%;
 }
 .table table{ background:#b9b9b9} 
 .table table td{ background:#FFF; } 
--- a/paddleaudio/docs/source/_templates/module.rst_t
+++ b/paddleaudio/docs/source/_templates/module.rst_t
@ -0,0 +1,9 @@
 {%- if show_headings %}
 {{- basename | e | heading }}
 {% endif -%}
 .. automodule:: {{ qualname }}
 {%- for option in automodule_options %}
   :{{ option }}:
 {%- endfor %}
--- a/paddleaudio/docs/source/_templates/package.rst_t
+++ b/paddleaudio/docs/source/_templates/package.rst_t
@ -0,0 +1,57 @@
 {%- macro automodule(modname, options) -%}
 .. automodule:: {{ modname }}
 {%- for option in options %}
   :{{ option }}:
 {%- endfor %}
 {%- endmacro %}
 {%- macro toctree(docnames) -%}
 .. toctree::
   :maxdepth: {{ maxdepth }}
 {% for docname in docnames %}
   {{ docname }}
 {%- endfor %}
 {%- endmacro %}
 {%- if is_namespace %}
 {{- [pkgname, "namespace"] | join(" ") | e | heading }}
 {% else %}
 {{- pkgname | e | heading }}
 {% endif %}
 {%- if is_namespace %}
 .. py:module:: {{ pkgname }}
 {% endif %}
 {%- if modulefirst and not is_namespace %}
 {{ automodule(pkgname, automodule_options) }}
 {% endif %}
 {%- if subpackages %}
 Subpackages
 -----------
 {{ toctree(subpackages) }}
 {% endif %}
 {%- if submodules %}
 Submodules
 ----------
 {% if separatemodules %}
 {{ toctree(submodules) }}
 {% else %}
 {%- for submodule in submodules %}
 {% if show_headings %}
 {{- submodule | e | heading(2) }}
 {% endif %}
 {{ automodule(submodule, automodule_options) }}
 {% endfor %}
 {%- endif %}
 {%- endif %}
 {%- if not modulefirst and not is_namespace %}
 Module contents
 ---------------
 {{ automodule(pkgname, automodule_options) }}
 {% endif %}
--- a/paddleaudio/docs/source/_templates/toc.rst_t
+++ b/paddleaudio/docs/source/_templates/toc.rst_t
@ -0,0 +1,8 @@
 {{ header | heading }}
 .. toctree::
   :maxdepth: {{ maxdepth }}
 {% for docname in docnames %}
   {{ docname }}
 {%- endfor %}
--- a/paddleaudio/docs/source/conf.py
+++ b/paddleaudio/docs/source/conf.py
@ -0,0 +1,181 @@
 # -*- coding: utf-8 -*-
 #
 # Configuration file for the Sphinx documentation builder.
 #
 # This file does only contain a selection of the most common options. For a
 # full list see the documentation:
 # http://www.sphinx-doc.org/en/master/config
 # -- Path setup --------------------------------------------------------------
 # If extensions (or modules to document with autodoc) are in another directory,
 # add these directories to sys.path here. If the directory is relative to the
 # documentation root, use os.path.abspath to make it absolute, like shown here.
 import os
 import sys
 sys.path.insert(0, os.path.abspath('../..'))
 # -- Project information -----------------------------------------------------
 project = 'PaddleAudio'
 copyright = '2022, PaddlePaddle'
 author = 'PaddlePaddle'
 # The short X.Y version
 version = ''
 # The full version, including alpha/beta/rc tags
 release = '0.2.0'
 # -- General configuration ---------------------------------------------------
 # If your documentation needs a minimal Sphinx version, state it here.
 #
 # needs_sphinx = '1.0'
 # Add any Sphinx extension module names here, as strings. They can be
 # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom
 # ones.
 extensions = [
    'sphinx.ext.autodoc',
    'sphinx.ext.intersphinx',
    'sphinx.ext.mathjax',
    'sphinx.ext.viewcode',
    'sphinx.ext.napoleon',
 ]
 napoleon_google_docstring = True
 # Add any paths that contain templates here, relative to this directory.
 templates_path = ['_templates']
 # The suffix(es) of source filenames.
 # You can specify multiple suffix as a list of string:
 #
 # source_suffix = ['.rst', '.md']
 source_suffix = '.rst'
 # The master toctree document.
 master_doc = 'index'
 # The language for content autogenerated by Sphinx. Refer to documentation
 # for a list of supported languages.
 #
 # This is also used if you do content translation via gettext catalogs.
 # Usually you set "language" from the command line for these cases.
 language = None
 # List of patterns, relative to source directory, that match files and
 # directories to ignore when looking for source files.
 # This pattern also affects html_static_path and html_extra_path.
 exclude_patterns = []
 # The name of the Pygments (syntax highlighting) style to use.
 pygments_style = None
 # -- Options for HTML output -------------------------------------------------
 # The theme to use for HTML and HTML Help pages.  See the documentation for
 # a list of builtin themes.
 #
 import sphinx_rtd_theme
 html_theme = 'sphinx_rtd_theme'
 html_theme_path = [sphinx_rtd_theme.get_html_theme_path()]
 smartquotes = False
 # Theme options are theme-specific and customize the look and feel of a theme
 # further.  For a list of options available for each theme, see the
 # documentation.
 #
 # html_theme_options = {}
 # Add any paths that contain custom static files (such as style sheets) here,
 # relative to this directory. They are copied after the builtin static files,
 # so a file named "default.css" will overwrite the builtin "default.css".
 html_static_path = ['_static']
 html_logo = '../images/paddle.png'
 html_css_files = [
    'custom.css',
 ]
 # Custom sidebar templates, must be a dictionary that maps document names
 # to template names.
 #
 # The default sidebars (for documents that don't match any pattern) are
 # defined by theme itself.  Builtin themes are using these templates by
 # default: ``['localtoc.html', 'relations.html', 'sourcelink.html',
 # 'searchbox.html']``.
 #
 # html_sidebars = {}
 # -- Options for HTMLHelp output ---------------------------------------------
 # Output file base name for HTML help builder.
 htmlhelp_basename = 'PaddleAudiodoc'
 # -- Options for LaTeX output ------------------------------------------------
 latex_elements = {
    # The paper size ('letterpaper' or 'a4paper').
    #
    # 'papersize': 'letterpaper',
    # The font size ('10pt', '11pt' or '12pt').
    #
    # 'pointsize': '10pt',
    # Additional stuff for the LaTeX preamble.
    #
    # 'preamble': '',
    # Latex figure (float) alignment
    #
    # 'figure_align': 'htbp',
 }
 # Grouping the document tree into LaTeX files. List of tuples
 # (source start file, target name, title,
 #  author, documentclass [howto, manual, or own class]).
 latex_documents = [
    (master_doc, 'PaddleAudio.tex', 'PaddleAudio Documentation', 'PaddlePaddle',
     'manual'),
 ]
 # -- Options for manual page output ------------------------------------------
 # One entry per manual page. List of tuples
 # (source start file, name, description, authors, manual section).
 man_pages = [(master_doc, 'paddleaudio', 'PaddleAudio Documentation', [author],
              1)]
 # -- Options for Texinfo output ----------------------------------------------
 # Grouping the document tree into Texinfo files. List of tuples
 # (source start file, target name, title, author,
 #  dir menu entry, description, category)
 texinfo_documents = [
    (master_doc, 'PaddleAudio', 'PaddleAudio Documentation', author,
     'PaddleAudio', 'One line description of project.', 'Miscellaneous'),
 ]
 # -- Options for Epub output -------------------------------------------------
 # Bibliographic Dublin Core info.
 epub_title = project
 # The unique identifier of the text. This can be a ISBN number
 # or the project homepage.
 #
 # epub_identifier = ''
 # A unique identification for the text.
 #
 # epub_uid = ''
 # A list of files that should not be packed into the epub file.
 epub_exclude_files = ['search.html']
 # -- Extension configuration -------------------------------------------------
 # -- Options for intersphinx extension ---------------------------------------
 # Example configuration for intersphinx: refer to the Python standard library.
 intersphinx_mapping = {'https://docs.python.org/': None}
--- a/paddleaudio/docs/source/index.rst
+++ b/paddleaudio/docs/source/index.rst
@ -0,0 +1,22 @@
 .. PaddleAudio documentation master file, created by
   sphinx-quickstart on Tue Mar 22 15:57:16 2022.
   You can adapt this file completely to your liking, but it should at least
   contain the root `toctree` directive.
 Welcome to PaddleAudio's documentation!
 =======================================
 .. toctree::
   :maxdepth: 1
   Index <self>
 API References
 --------------
 .. toctree::
   :maxdepth: 2
   :titlesonly:
   paddleaudio
--- a/paddleaudio/paddleaudio/compliance/init.py
+++ b/paddleaudio/paddleaudio/compliance/init.py
@ -11,3 +11,5 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 from . import kaldi
 from . import librosa
--- a/paddleaudio/paddleaudio/datasets/init.py
+++ b/paddleaudio/paddleaudio/datasets/init.py
@ -13,5 +13,7 @@
 # limitations under the License.
 from .esc50 import ESC50
 from .gtzan import GTZAN
 from .rirs_noises import OpenRIRNoise
 from .tess import TESS
 from .urban_sound import UrbanSound8K
 from .voxceleb import VoxCeleb
--- a/paddleaudio/paddleaudio/datasets/rirs_noises.py
+++ b/paddleaudio/paddleaudio/datasets/rirs_noises.py
@ -0,0 +1,201 @@
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import collections
 import csv
 import os
 import random
 from typing import List
 from paddle.io import Dataset
 from tqdm import tqdm
 from ..backends import load as load_audio
 from ..backends import save as save_wav
 from ..utils import DATA_HOME
 from ..utils.download import download_and_decompress
 from .dataset import feat_funcs
 __all__ = ['OpenRIRNoise']
 class OpenRIRNoise(Dataset):
    archieves = [
        {
            'url': 'http://www.openslr.org/resources/28/rirs_noises.zip',
            'md5': 'e6f48e257286e05de56413b4779d8ffb',
        },
    ]
    sample_rate = 16000
    meta_info = collections.namedtuple('META_INFO', ('id', 'duration', 'wav'))
    base_path = os.path.join(DATA_HOME, 'open_rir_noise')
    wav_path = os.path.join(base_path, 'RIRS_NOISES')
    csv_path = os.path.join(base_path, 'csv')
    subsets = ['rir', 'noise']
    def __init__(self,
                 subset: str='rir',
                 feat_type: str='raw',
                 target_dir=None,
                 random_chunk: bool=True,
                 chunk_duration: float=3.0,
                 seed: int=0,
                 **kwargs):
        assert subset in self.subsets, \
            'Dataset subset must be one in {}, but got {}'.format(self.subsets, subset)
        self.subset = subset
        self.feat_type = feat_type
        self.feat_config = kwargs
        self.random_chunk = random_chunk
        self.chunk_duration = chunk_duration
        OpenRIRNoise.csv_path = os.path.join(
            target_dir, "open_rir_noise",
            "csv") if target_dir else self.csv_path
        self._data = self._get_data()
        super(OpenRIRNoise, self).__init__()
        # Set up a seed to reproduce training or predicting result.
        # random.seed(seed)
    def _get_data(self):
        # Download audio files.
        print(f"rirs noises base path: {self.base_path}")
        if not os.path.isdir(self.base_path):
            download_and_decompress(
                self.archieves, self.base_path, decompress=True)
        else:
            print(
                f"{self.base_path} already exists, we will not download and decompress again"
            )
        # Data preparation.
        print(f"prepare the csv to {self.csv_path}")
        if not os.path.isdir(self.csv_path):
            os.makedirs(self.csv_path)
            self.prepare_data()
        data = []
        with open(os.path.join(self.csv_path, f'{self.subset}.csv'), 'r') as rf:
            for line in rf.readlines()[1:]:
                audio_id, duration, wav = line.strip().split(',')
                data.append(self.meta_info(audio_id, float(duration), wav))
        random.shuffle(data)
        return data
    def _convert_to_record(self, idx: int):
        sample = self._data[idx]
        record = {}
        # To show all fields in a namedtuple: `type(sample)._fields`
        for field in type(sample)._fields:
            record[field] = getattr(sample, field)
        waveform, sr = load_audio(record['wav'])
        assert self.feat_type in feat_funcs.keys(), \
            f"Unknown feat_type: {self.feat_type}, it must be one in {list(feat_funcs.keys())}"
        feat_func = feat_funcs[self.feat_type]
        feat = feat_func(
            waveform, sr=sr, **self.feat_config) if feat_func else waveform
        record.update({'feat': feat})
        return record
    @staticmethod
    def _get_chunks(seg_dur, audio_id, audio_duration):
        num_chunks = int(audio_duration / seg_dur)  # all in milliseconds
        chunk_lst = [
            audio_id + "_" + str(i * seg_dur) + "_" + str(i * seg_dur + seg_dur)
            for i in range(num_chunks)
        ]
        return chunk_lst
    def _get_audio_info(self, wav_file: str,
                        split_chunks: bool) -> List[List[str]]:
        waveform, sr = load_audio(wav_file)
        audio_id = wav_file.split("/open_rir_noise/")[-1].split(".")[0]
        audio_duration = waveform.shape[0] / sr
        ret = []
        if split_chunks and audio_duration > self.chunk_duration:  # Split into pieces of self.chunk_duration seconds.
            uniq_chunks_list = self._get_chunks(self.chunk_duration, audio_id,
                                                audio_duration)
            for idx, chunk in enumerate(uniq_chunks_list):
                s, e = chunk.split("_")[-2:]  # Timestamps of start and end
                start_sample = int(float(s) * sr)
                end_sample = int(float(e) * sr)
                new_wav_file = os.path.join(self.base_path,
                                            audio_id + f'_chunk_{idx+1:02}.wav')
                save_wav(waveform[start_sample:end_sample], sr, new_wav_file)
                # id, duration, new_wav
                ret.append([chunk, self.chunk_duration, new_wav_file])
        else:  # Keep whole audio.
            ret.append([audio_id, audio_duration, wav_file])
        return ret
    def generate_csv(self,
                     wav_files: List[str],
                     output_file: str,
                     split_chunks: bool=True):
        print(f'Generating csv: {output_file}')
        header = ["id", "duration", "wav"]
        infos = list(
            tqdm(
                map(self._get_audio_info, wav_files, [split_chunks] * len(
                    wav_files)),
                total=len(wav_files)))
        csv_lines = []
        for info in infos:
            csv_lines.extend(info)
        with open(output_file, mode="w") as csv_f:
            csv_writer = csv.writer(
                csv_f, delimiter=",", quotechar='"', quoting=csv.QUOTE_MINIMAL)
            csv_writer.writerow(header)
            for line in csv_lines:
                csv_writer.writerow(line)
    def prepare_data(self):
        rir_list = os.path.join(self.wav_path, "real_rirs_isotropic_noises",
                                "rir_list")
        rir_files = []
        with open(rir_list, 'r') as f:
            for line in f.readlines():
                rir_file = line.strip().split(' ')[-1]
                rir_files.append(os.path.join(self.base_path, rir_file))
        noise_list = os.path.join(self.wav_path, "pointsource_noises",
                                  "noise_list")
        noise_files = []
        with open(noise_list, 'r') as f:
            for line in f.readlines():
                noise_file = line.strip().split(' ')[-1]
                noise_files.append(os.path.join(self.base_path, noise_file))
        self.generate_csv(rir_files, os.path.join(self.csv_path, 'rir.csv'))
        self.generate_csv(noise_files, os.path.join(self.csv_path, 'noise.csv'))
    def __getitem__(self, idx):
        return self._convert_to_record(idx)
    def __len__(self):
        return len(self._data)
--- a/paddleaudio/paddleaudio/datasets/voxceleb.py
+++ b/paddleaudio/paddleaudio/datasets/voxceleb.py
@ -0,0 +1,356 @@
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import collections
 import csv
 import glob
 import os
 import random
 from multiprocessing import cpu_count
 from typing import List
 from paddle.io import Dataset
 from pathos.multiprocessing import Pool
 from tqdm import tqdm
 from ..backends import load as load_audio
 from ..utils import DATA_HOME
 from ..utils import decompress
 from ..utils.download import download_and_decompress
 from .dataset import feat_funcs
 __all__ = ['VoxCeleb']
 class VoxCeleb(Dataset):
    source_url = 'https://thor.robots.ox.ac.uk/~vgg/data/voxceleb/vox1a/'
    archieves_audio_dev = [
        {
            'url': source_url + 'vox1_dev_wav_partaa',
            'md5': 'e395d020928bc15670b570a21695ed96',
        },
        {
            'url': source_url + 'vox1_dev_wav_partab',
            'md5': 'bbfaaccefab65d82b21903e81a8a8020',
        },
        {
            'url': source_url + 'vox1_dev_wav_partac',
            'md5': '017d579a2a96a077f40042ec33e51512',
        },
        {
            'url': source_url + 'vox1_dev_wav_partad',
            'md5': '7bb1e9f70fddc7a678fa998ea8b3ba19',
        },
    ]
    archieves_audio_test = [
        {
            'url': source_url + 'vox1_test_wav.zip',
            'md5': '185fdc63c3c739954633d50379a3d102',
        },
    ]
    archieves_meta = [
        {
            'url':
            'https://www.robots.ox.ac.uk/~vgg/data/voxceleb/meta/veri_test2.txt',
            'md5':
            'b73110731c9223c1461fe49cb48dddfc',
        },
    ]
    num_speakers = 1211  # 1211 vox1, 5994 vox2, 7205 vox1+2, test speakers: 41
    sample_rate = 16000
    meta_info = collections.namedtuple(
        'META_INFO', ('id', 'duration', 'wav', 'start', 'stop', 'spk_id'))
    base_path = os.path.join(DATA_HOME, 'vox1')
    wav_path = os.path.join(base_path, 'wav')
    meta_path = os.path.join(base_path, 'meta')
    veri_test_file = os.path.join(meta_path, 'veri_test2.txt')
    csv_path = os.path.join(base_path, 'csv')
    subsets = ['train', 'dev', 'enroll', 'test']
    def __init__(
            self,
            subset: str='train',
            feat_type: str='raw',
            random_chunk: bool=True,
            chunk_duration: float=3.0,  # seconds
            split_ratio: float=0.9,  # train split ratio
            seed: int=0,
            target_dir: str=None,
            vox2_base_path=None,
            **kwargs):
        """VoxCeleb data prepare and get the specific dataset audio info
        Args:
            subset (str, optional): dataset name, such as train, dev, enroll or test. Defaults to 'train'.
            feat_type (str, optional): feat type, such raw, melspectrogram(fbank) or mfcc . Defaults to 'raw'.
            random_chunk (bool, optional): random select a duration from audio. Defaults to True.
            chunk_duration (float, optional): chunk duration if random_chunk flag is set. Defaults to 3.0.
            target_dir (str, optional): data dir, audio info will be stored in this directory. Defaults to None.
            vox2_base_path (_type_, optional): vox2 directory. vox2 data must be converted from m4a to wav. Defaults to None.
        """
        assert subset in self.subsets, \
            'Dataset subset must be one in {}, but got {}'.format(self.subsets, subset)
        self.subset = subset
        self.spk_id2label = {}
        self.feat_type = feat_type
        self.feat_config = kwargs
        self.random_chunk = random_chunk
        self.chunk_duration = chunk_duration
        self.split_ratio = split_ratio
        self.target_dir = target_dir if target_dir else VoxCeleb.base_path
        self.vox2_base_path = vox2_base_path
        # if we set the target dir, we will change the vox data info data from base path to target dir
        VoxCeleb.csv_path = os.path.join(
            target_dir, "voxceleb", 'csv') if target_dir else VoxCeleb.csv_path
        VoxCeleb.meta_path = os.path.join(
            target_dir, "voxceleb",
            'meta') if target_dir else VoxCeleb.meta_path
        VoxCeleb.veri_test_file = os.path.join(VoxCeleb.meta_path,
                                               'veri_test2.txt')
        # self._data = self._get_data()[:1000]  # KP: Small dataset test.
        self._data = self._get_data()
        super(VoxCeleb, self).__init__()
        # Set up a seed to reproduce training or predicting result.
        # random.seed(seed)
    def _get_data(self):
        # Download audio files.
        # We need the users to decompress all vox1/dev/wav and vox1/test/wav/ to vox1/wav/ dir
        # so, we check the vox1/wav dir status
        print(f"wav base path: {self.wav_path}")
        if not os.path.isdir(self.wav_path):
            print("start to download the voxceleb1 dataset")
            download_and_decompress(  # multi-zip parts concatenate to vox1_dev_wav.zip
                self.archieves_audio_dev,
                self.base_path,
                decompress=False)
            download_and_decompress(  # download the vox1_test_wav.zip and unzip
                self.archieves_audio_test,
                self.base_path,
                decompress=True)
            # Download all parts and concatenate the files into one zip file.
            dev_zipfile = os.path.join(self.base_path, 'vox1_dev_wav.zip')
            print(f'Concatenating all parts to: {dev_zipfile}')
            os.system(
                f'cat {os.path.join(self.base_path, "vox1_dev_wav_parta*")} > {dev_zipfile}'
            )
            # Extract all audio files of dev and test set.
            decompress(dev_zipfile, self.base_path)
        # Download meta files.
        if not os.path.isdir(self.meta_path):
            print("prepare the meta data")
            download_and_decompress(
                self.archieves_meta, self.meta_path, decompress=False)
        # Data preparation.
        if not os.path.isdir(self.csv_path):
            os.makedirs(self.csv_path)
            self.prepare_data()
        data = []
        print(
            f"read the {self.subset} from {os.path.join(self.csv_path, f'{self.subset}.csv')}"
        )
        with open(os.path.join(self.csv_path, f'{self.subset}.csv'), 'r') as rf:
            for line in rf.readlines()[1:]:
                audio_id, duration, wav, start, stop, spk_id = line.strip(
                ).split(',')
                data.append(
                    self.meta_info(audio_id,
                                   float(duration), wav,
                                   int(start), int(stop), spk_id))
        with open(os.path.join(self.meta_path, 'spk_id2label.txt'), 'r') as f:
            for line in f.readlines():
                spk_id, label = line.strip().split(' ')
                self.spk_id2label[spk_id] = int(label)
        return data
    def _convert_to_record(self, idx: int):
        sample = self._data[idx]
        record = {}
        # To show all fields in a namedtuple: `type(sample)._fields`
        for field in type(sample)._fields:
            record[field] = getattr(sample, field)
        waveform, sr = load_audio(record['wav'])
        # random select a chunk audio samples from the audio
        if self.random_chunk:
            num_wav_samples = waveform.shape[0]
            num_chunk_samples = int(self.chunk_duration * sr)
            start = random.randint(0, num_wav_samples - num_chunk_samples - 1)
            stop = start + num_chunk_samples
        else:
            start = record['start']
            stop = record['stop']
        waveform = waveform[start:stop]
        assert self.feat_type in feat_funcs.keys(), \
            f"Unknown feat_type: {self.feat_type}, it must be one in {list(feat_funcs.keys())}"
        feat_func = feat_funcs[self.feat_type]
        feat = feat_func(
            waveform, sr=sr, **self.feat_config) if feat_func else waveform
        record.update({'feat': feat})
        if self.subset in ['train',
                           'dev']:  # Labels are available in train and dev.
            record.update({'label': self.spk_id2label[record['spk_id']]})
        return record
    @staticmethod
    def _get_chunks(seg_dur, audio_id, audio_duration):
        num_chunks = int(audio_duration / seg_dur)  # all in milliseconds
        chunk_lst = [
            audio_id + "_" + str(i * seg_dur) + "_" + str(i * seg_dur + seg_dur)
            for i in range(num_chunks)
        ]
        return chunk_lst
    def _get_audio_info(self, wav_file: str,
                        split_chunks: bool) -> List[List[str]]:
        waveform, sr = load_audio(wav_file)
        spk_id, sess_id, utt_id = wav_file.split("/")[-3:]
        audio_id = '-'.join([spk_id, sess_id, utt_id.split(".")[0]])
        audio_duration = waveform.shape[0] / sr
        ret = []
        if split_chunks:  # Split into pieces of self.chunk_duration seconds.
            uniq_chunks_list = self._get_chunks(self.chunk_duration, audio_id,
                                                audio_duration)
            for chunk in uniq_chunks_list:
                s, e = chunk.split("_")[-2:]  # Timestamps of start and end
                start_sample = int(float(s) * sr)
                end_sample = int(float(e) * sr)
                # id, duration, wav, start, stop, spk_id
                ret.append([
                    chunk, audio_duration, wav_file, start_sample, end_sample,
                    spk_id
                ])
        else:  # Keep whole audio.
            ret.append([
                audio_id, audio_duration, wav_file, 0, waveform.shape[0], spk_id
            ])
        return ret
    def generate_csv(self,
                     wav_files: List[str],
                     output_file: str,
                     split_chunks: bool=True):
        print(f'Generating csv: {output_file}')
        header = ["ID", "duration", "wav", "start", "stop", "spk_id"]
        # Note: this may occurs c++ execption, but the program will execute fine
        # so we can ignore the execption 
        with Pool(cpu_count()) as p:
            infos = list(
                tqdm(
                    p.imap(lambda x: self._get_audio_info(x, split_chunks),
                           wav_files),
                    total=len(wav_files)))
        csv_lines = []
        for info in infos:
            csv_lines.extend(info)
        with open(output_file, mode="w") as csv_f:
            csv_writer = csv.writer(
                csv_f, delimiter=",", quotechar='"', quoting=csv.QUOTE_MINIMAL)
            csv_writer.writerow(header)
            for line in csv_lines:
                csv_writer.writerow(line)
    def prepare_data(self):
        # Audio of speakers in veri_test_file should not be included in training set.
        print("start to prepare the data csv file")
        enroll_files = set()
        test_files = set()
        # get the enroll and test audio file path
        with open(self.veri_test_file, 'r') as f:
            for line in f.readlines():
                _, enrol_file, test_file = line.strip().split(' ')
                enroll_files.add(os.path.join(self.wav_path, enrol_file))
                test_files.add(os.path.join(self.wav_path, test_file))
            enroll_files = sorted(enroll_files)
            test_files = sorted(test_files)
        # get the enroll and test speakers
        test_spks = set()
        for file in (enroll_files + test_files):
            spk = file.split('/wav/')[1].split('/')[0]
            test_spks.add(spk)
        # get all the train and dev audios file path
        audio_files = []
        speakers = set()
        print("Getting file list...")
        for path in [self.wav_path, self.vox2_base_path]:
            # if vox2 directory is not set and vox2 is not a directory 
            # we will not process this directory
            if not path or not os.path.exists(path):
                print(f"{path} is an invalid path, please check again, "
                      "and we will ignore the vox2 base path")
                continue
            for file in glob.glob(
                    os.path.join(path, "**", "*.wav"), recursive=True):
                spk = file.split('/wav/')[1].split('/')[0]
                if spk in test_spks:
                    continue
                speakers.add(spk)
                audio_files.append(file)
        print(
            f"start to generate the {os.path.join(self.meta_path, 'spk_id2label.txt')}"
        )
        # encode the train and dev speakers label to spk_id2label.txt
        with open(os.path.join(self.meta_path, 'spk_id2label.txt'), 'w') as f:
            for label, spk_id in enumerate(
                    sorted(speakers)):  # 1211 vox1, 5994 vox2, 7205 vox1+2
                f.write(f'{spk_id} {label}\n')
        audio_files = sorted(audio_files)
        random.shuffle(audio_files)
        split_idx = int(self.split_ratio * len(audio_files))
        # split_ratio to train
        train_files, dev_files = audio_files[:split_idx], audio_files[
            split_idx:]
        self.generate_csv(train_files, os.path.join(self.csv_path, 'train.csv'))
        self.generate_csv(dev_files, os.path.join(self.csv_path, 'dev.csv'))
        self.generate_csv(
            enroll_files,
            os.path.join(self.csv_path, 'enroll.csv'),
            split_chunks=False)
        self.generate_csv(
            test_files,
            os.path.join(self.csv_path, 'test.csv'),
            split_chunks=False)
    def __getitem__(self, idx):
        return self._convert_to_record(idx)
    def __len__(self):
        return len(self._data)
--- a/paddleaudio/paddleaudio/metric/init.py
+++ b/paddleaudio/paddleaudio/metric/init.py
@ -12,4 +12,6 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 from .dtw import dtw_distance
 from .eer import compute_eer
 from .eer import compute_minDCF
 from .mcd import mcd_distance
--- a/paddleaudio/paddleaudio/metric/dtw.py
+++ b/paddleaudio/paddleaudio/metric/dtw.py
@ -24,11 +24,15 @@ def dtw_distance(xs: np.ndarray, ys: np.ndarray) -> float:
    This function keeps a compact matrix, not the full warping paths matrix.
    Uses dynamic programming to compute:
-    wps[i, j] = (s1[i]-s2[j])**2 + min(
+    Examples:
-                    wps[i-1, j  ] + penalty,  // vertical   / insertion / expansion
+        .. code-block:: python
-                    wps[i  , j-1] + penalty,  // horizontal / deletion  / compression
+
-                    wps[i-1, j-1])            // diagonal   / match
+            wps[i, j] = (s1[i]-s2[j])**2 + min(
-    dtw = sqrt(wps[-1, -1])
+                            wps[i-1, j  ] + penalty,  // vertical   / insertion / expansion
                            wps[i  , j-1] + penalty,  // horizontal / deletion  / compression
                            wps[i-1, j-1])            // diagonal   / match
            dtw = sqrt(wps[-1, -1])
    Args:
        xs (np.ndarray): ref sequence, [T,D]
--- a/paddleaudio/paddleaudio/metric/eer.py
+++ b/paddleaudio/paddleaudio/metric/eer.py
@ -0,0 +1,100 @@
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 from typing import List
 import numpy as np
 import paddle
 from sklearn.metrics import roc_curve
 def compute_eer(labels: np.ndarray, scores: np.ndarray) -> List[float]:
    """Compute EER and return score threshold.
    Args:
        labels (np.ndarray): the trial label, shape: [N], one-dimention, N refer to the samples num
        scores (np.ndarray): the trial scores, shape: [N], one-dimention, N refer to the samples num
    Returns:
        List[float]: eer and the specific threshold
    """
    fpr, tpr, threshold = roc_curve(y_true=labels, y_score=scores)
    fnr = 1 - tpr
    eer_threshold = threshold[np.nanargmin(np.absolute((fnr - fpr)))]
    eer = fpr[np.nanargmin(np.absolute((fnr - fpr)))]
    return eer, eer_threshold
 def compute_minDCF(positive_scores,
                   negative_scores,
                   c_miss=1.0,
                   c_fa=1.0,
                   p_target=0.01):
    """
    This is modified from SpeechBrain
    https://github.com/speechbrain/speechbrain/blob/085be635c07f16d42cd1295045bc46c407f1e15b/speechbrain/utils/metric_stats.py#L509
    Computes the minDCF metric normally used to evaluate speaker verification
    systems. The min_DCF is the minimum of the following C_det function computed
    within the defined threshold range:
    C_det =  c_miss * p_miss * p_target + c_fa * p_fa * (1 -p_target)
    where p_miss is the missing probability and p_fa is the probability of having
    a false alarm.
    Args:
        positive_scores (Paddle.Tensor): The scores from entries of the same class.
        negative_scores (Paddle.Tensor): The scores from entries of different classes.
        c_miss (float, optional): Cost assigned to a missing error (default 1.0).
        c_fa (float, optional): Cost assigned to a false alarm (default 1.0).
        p_target (float, optional): Prior probability of having a target (default 0.01).
    Returns:
        List[float]: min dcf and the specific threshold
    """
    # Computing candidate thresholds
    if len(positive_scores.shape) > 1:
        positive_scores = positive_scores.squeeze()
    if len(negative_scores.shape) > 1:
        negative_scores = negative_scores.squeeze()
    thresholds = paddle.sort(paddle.concat([positive_scores, negative_scores]))
    thresholds = paddle.unique(thresholds)
    # Adding intermediate thresholds
    interm_thresholds = (thresholds[0:-1] + thresholds[1:]) / 2
    thresholds = paddle.sort(paddle.concat([thresholds, interm_thresholds]))
    # Computing False Rejection Rate (miss detection)
    positive_scores = paddle.concat(
        len(thresholds) * [positive_scores.unsqueeze(0)])
    pos_scores_threshold = positive_scores.transpose(perm=[1, 0]) <= thresholds
    p_miss = (pos_scores_threshold.sum(0)
              ).astype("float32") / positive_scores.shape[1]
    del positive_scores
    del pos_scores_threshold
    # Computing False Acceptance Rate (false alarm)
    negative_scores = paddle.concat(
        len(thresholds) * [negative_scores.unsqueeze(0)])
    neg_scores_threshold = negative_scores.transpose(perm=[1, 0]) > thresholds
    p_fa = (neg_scores_threshold.sum(0)
            ).astype("float32") / negative_scores.shape[1]
    del negative_scores
    del neg_scores_threshold
    c_det = c_miss * p_miss * p_target + c_fa * p_fa * (1 - p_target)
    c_min = paddle.min(c_det, axis=0)
    min_index = paddle.argmin(c_det, axis=0)
    return float(c_min), float(thresholds[min_index])
--- a/paddleaudio/paddleaudio/metric/mcd.py
+++ b/paddleaudio/paddleaudio/metric/mcd.py
@ -11,6 +11,8 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 from typing import Callable
 import mcd.metrics_fast as mt
 import numpy as np
 from mcd import dtw
@ -20,29 +22,42 @@ __all__ = [
 ]
-def mcd_distance(xs: np.ndarray, ys: np.ndarray, cost_fn=mt.logSpecDbDist):
+def mcd_distance(xs: np.ndarray,
                 ys: np.ndarray,
                 cost_fn: Callable=mt.logSpecDbDist) -> float:
    """Mel cepstral distortion (MCD), dtw distance.
    Dynamic Time Warping.
    Uses dynamic programming to compute:
-    wps[i, j] = cost_fn(xs[i], ys[j]) + min(
+
-                    wps[i-1, j  ],  // vertical   / insertion / expansion
+    Examples:
-                    wps[i  , j-1],  // horizontal / deletion  / compression
+        .. code-block:: python
-                    wps[i-1, j-1])  // diagonal   / match
+
-    dtw = sqrt(wps[-1, -1])
+            wps[i, j] = cost_fn(xs[i], ys[j]) + min(
                            wps[i-1, j  ],  // vertical   / insertion / expansion
                            wps[i  , j-1],  // horizontal / deletion  / compression
                            wps[i-1, j-1])  // diagonal   / match
            dtw = sqrt(wps[-1, -1])
    Cost Function:
-    logSpecDbConst = 10.0 / math.log(10.0) * math.sqrt(2.0)
+    Examples:
-    def logSpecDbDist(x, y):
+        .. code-block:: python
-        diff = x - y
+
-        return logSpecDbConst * math.sqrt(np.inner(diff, diff))
+            logSpecDbConst = 10.0 / math.log(10.0) * math.sqrt(2.0)
            def logSpecDbDist(x, y):
                diff = x - y
                return logSpecDbConst * math.sqrt(np.inner(diff, diff))
    Args:
        xs (np.ndarray): ref sequence, [T,D]
        ys (np.ndarray): hyp sequence, [T,D]
        cost_fn (Callable, optional): Cost function. Defaults to mt.logSpecDbDist.
    Returns:
        float: dtw distance
    """
    min_cost, path = dtw.dtw(xs, ys, cost_fn)
    return min_cost
--- a/paddleaudio/paddleaudio/utils/download.py
+++ b/paddleaudio/paddleaudio/utils/download.py
@ -37,7 +37,9 @@ def decompress(file: str):
    download._decompress(file)
-def download_and_decompress(archives: List[Dict[str, str]], path: str):
+def download_and_decompress(archives: List[Dict[str, str]],
                            path: str,
                            decompress: bool=True):
    """
    Download archieves and decompress to specific path.
    """
@ -47,8 +49,8 @@ def download_and_decompress(archives: List[Dict[str, str]], path: str):
    for archive in archives:
        assert 'url' in archive and 'md5' in archive, \
            'Dictionary keys of "url" and "md5" are required in the archive, but got: {list(archieve.keys())}'
-
+        download.get_path_from_url(
-        download.get_path_from_url(archive['url'], path, archive['md5'])
+            archive['url'], path, archive['md5'], decompress=decompress)
 def load_state_dict_from_url(url: str, path: str, md5: str=None):
--- a/paddleaudio/setup.py
+++ b/paddleaudio/setup.py
@ -82,13 +82,9 @@ setuptools.setup(
    ],
    python_requires='>=3.6',
    install_requires=[
-        'numpy >= 1.15.0',
+        'numpy >= 1.15.0', 'scipy >= 1.0.0', 'resampy >= 0.2.2',
-        'scipy >= 1.0.0',
+        'soundfile >= 0.9.0', 'colorlog', 'dtaidistance == 2.3.1', 'mcd >= 0.4',
-        'resampy >= 0.2.2',
+        'pathos'
        'soundfile >= 0.9.0',
        'colorlog',
        'dtaidistance >= 2.3.6',
        'mcd >= 0.4',
    ],
    extras_require={
        'test': [
--- a/paddlespeech/cli/README.md
+++ b/paddlespeech/cli/README.md
@ -13,6 +13,12 @@
 paddlespeech cls --input input.wav
 ```
 ## Speaker Verification
 ```bash
 paddlespeech vector --task spk --input input_16k.wav
 ```
 ## Automatic Speech Recognition
 ```
 paddlespeech asr --lang zh --input input_16k.wav
--- a/paddlespeech/cli/README_cn.md
+++ b/paddlespeech/cli/README_cn.md
@ -12,6 +12,12 @@
 ## 声音分类
 ```bash
 paddlespeech cls --input input.wav
 ```
  ## 声纹识别
 ```bash
 paddlespeech vector --task spk --input input_16k.wav
 ```
 ## 语音识别
--- a/paddlespeech/cli/init.py
+++ b/paddlespeech/cli/init.py
@ -21,5 +21,6 @@ from .st import STExecutor
 from .stats import StatsExecutor
 from .text import TextExecutor
 from .tts import TTSExecutor
 from .vector import VectorExecutor
 _locale._getdefaultlocale = (lambda *args: ['en_US', 'utf8'])
--- a/paddlespeech/cli/tts/infer.py
+++ b/paddlespeech/cli/tts/infer.py
@ -237,6 +237,18 @@ pretrained_models = {
        'speech_stats':
        'feats_stats.npy',
    },
    "hifigan_ljspeech-en": {
        'url':
        'https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_ljspeech_ckpt_0.2.0.zip',
        'md5':
        '70e9131695decbca06a65fe51ed38a72',
        'config':
        'default.yaml',
        'ckpt':
        'snapshot_iter_2500000.pdz',
        'speech_stats':
        'feats_stats.npy',
    },
    "hifigan_aishell3-zh": {
        'url':
        'https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_ckpt_0.2.0.zip',
@ -389,6 +401,7 @@ class TTSExecutor(BaseExecutor):
                'mb_melgan_csmsc',
                'style_melgan_csmsc',
                'hifigan_csmsc',
                'hifigan_ljspeech',
                'hifigan_aishell3',
                'hifigan_vctk',
                'wavernn_csmsc',
--- a/paddlespeech/cli/vector/init.py
+++ b/paddlespeech/cli/vector/init.py
@ -0,0 +1,14 @@
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 from .infer import VectorExecutor
--- a/paddlespeech/cli/vector/infer.py
+++ b/paddlespeech/cli/vector/infer.py
@ -0,0 +1,448 @@
 # Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import argparse
 import os
 import sys
 from collections import OrderedDict
 from typing import List
 from typing import Optional
 from typing import Union
 import paddle
 import soundfile
 from yacs.config import CfgNode
 from ..executor import BaseExecutor
 from ..log import logger
 from ..utils import cli_register
 from ..utils import download_and_decompress
 from ..utils import MODEL_HOME
 from ..utils import stats_wrapper
 from paddleaudio.backends import load as load_audio
 from paddleaudio.compliance.librosa import melspectrogram
 from paddlespeech.s2t.utils.dynamic_import import dynamic_import
 from paddlespeech.vector.io.batch import feature_normalize
 from paddlespeech.vector.modules.sid_model import SpeakerIdetification
 pretrained_models = {
    # The tags for pretrained_models should be "{model_name}[-{dataset}][-{sr}][-...]".
    # e.g. "ecapatdnn_voxceleb12-16k".
    # Command line and python api use "{model_name}[-{dataset}]" as --model, usage:
    # "paddlespeech vector --task spk --model ecapatdnn_voxceleb12-16k --sr 16000 --input ./input.wav"
    "ecapatdnn_voxceleb12-16k": {
        'url':
        'https://paddlespeech.bj.bcebos.com/vector/voxceleb/sv0_ecapa_tdnn_voxceleb12_ckpt_0_1_1.tar.gz',
        'md5':
        'a1c0dba7d4de997187786ff517d5b4ec',
        'cfg_path':
        'conf/model.yaml',  # the yaml config path
        'ckpt_path':
        'model/model',  # the format is ${dir}/{model_name}, 
        # so the first 'model' is dir, the second 'model' is the name
        # this means we have a model stored as model/model.pdparams
    },
 }
 model_alias = {
    "ecapatdnn": "paddlespeech.vector.models.ecapa_tdnn:EcapaTdnn",
 }
@cli_register(
    name="paddlespeech.vector",
    description="Speech to vector embedding infer command.")
 class VectorExecutor(BaseExecutor):
    def __init__(self):
        super(VectorExecutor, self).__init__()
        self.parser = argparse.ArgumentParser(
            prog="paddlespeech.vector", add_help=True)
        self.parser.add_argument(
            "--model",
            type=str,
            default="ecapatdnn_voxceleb12",
            choices=["ecapatdnn_voxceleb12"],
            help="Choose model type of vector task.")
        self.parser.add_argument(
            "--task",
            type=str,
            default="spk",
            choices=["spk"],
            help="task type in vector domain")
        self.parser.add_argument(
            "--input",
            type=str,
            default=None,
            help="Audio file to extract embedding.")
        self.parser.add_argument(
            "--sample_rate",
            type=int,
            default=16000,
            choices=[16000],
            help="Choose the audio sample rate of the model. 8000 or 16000")
        self.parser.add_argument(
            "--ckpt_path",
            type=str,
            default=None,
            help="Checkpoint file of model.")
        self.parser.add_argument(
            '--config',
            type=str,
            default=None,
            help='Config of asr task. Use deault config when it is None.')
        self.parser.add_argument(
            "--device",
            type=str,
            default=paddle.get_device(),
            help="Choose device to execute model inference.")
        self.parser.add_argument(
            '-d',
            '--job_dump_result',
            action='store_true',
            help='Save job result into file.')
        self.parser.add_argument(
            '-v',
            '--verbose',
            action='store_true',
            help='Increase logger verbosity of current task.')
    def execute(self, argv: List[str]) -> bool:
        """Command line entry for vector model
        Args:
            argv (List[str]): command line args list
        Returns:
            bool: 
                 False: some audio occurs error
                 True: all audio process success
        """
        # stage 0: parse the args and get the required args
        parser_args = self.parser.parse_args(argv)
        model = parser_args.model
        sample_rate = parser_args.sample_rate
        config = parser_args.config
        ckpt_path = parser_args.ckpt_path
        device = parser_args.device
        # stage 1: configurate the verbose flag
        if not parser_args.verbose:
            self.disable_task_loggers()
        # stage 2: read the input data and store them as a list
        task_source = self.get_task_source(parser_args.input)
        logger.info(f"task source: {task_source}")
        # stage 3: process the audio one by one
        task_result = OrderedDict()
        has_exceptions = False
        for id_, input_ in task_source.items():
            try:
                res = self(input_, model, sample_rate, config, ckpt_path,
                           device)
                task_result[id_] = res
            except Exception as e:
                has_exceptions = True
                task_result[id_] = f'{e.__class__.__name__}: {e}'
        logger.info("task result as follows: ")
        logger.info(f"{task_result}")
        # stage 4: process the all the task results
        self.process_task_results(parser_args.input, task_result,
                                  parser_args.job_dump_result)
        # stage 5: return the exception flag
        #          if return False, somen audio process occurs error
        if has_exceptions:
            return False
        else:
            return True
    @stats_wrapper
    def __call__(self,
                 audio_file: os.PathLike,
                 model: str='ecapatdnn_voxceleb12',
                 sample_rate: int=16000,
                 config: os.PathLike=None,
                 ckpt_path: os.PathLike=None,
                 device=paddle.get_device()):
        """Extract the audio embedding
        Args:
            audio_file (os.PathLike): audio path, 
                                      whose format must be wav and sample rate must be matched the model
            model (str, optional): mode type, which is been loaded from the pretrained model list. 
                                   Defaults to 'ecapatdnn-voxceleb12'.
            sample_rate (int, optional): model sample rate. Defaults to 16000.
            config (os.PathLike, optional): yaml config. Defaults to None.
            ckpt_path (os.PathLike, optional): pretrained model path. Defaults to None.
            device (optional): paddle running host device. Defaults to paddle.get_device().
        Returns:
            dict: return the audio embedding and the embedding shape
        """
        # stage 0: check the audio format
        audio_file = os.path.abspath(audio_file)
        if not self._check(audio_file, sample_rate):
            sys.exit(-1)
        # stage 1: set the paddle runtime host device
        logger.info(f"device type: {device}")
        paddle.device.set_device(device)
        # stage 2: read the specific pretrained model
        self._init_from_path(model, sample_rate, config, ckpt_path)
        # stage 3: preprocess the audio and get the audio feat
        self.preprocess(model, audio_file)
        # stage 4: infer the model and get the audio embedding
        self.infer(model)
        # stage 5: process the result and set them to output dict
        res = self.postprocess()
        return res
    def _get_pretrained_path(self, tag: str) -> os.PathLike:
        """get the neural network path from the pretrained model list
           we stored all the pretained mode in the variable `pretrained_models`
        Args:
            tag (str): model tag in the pretrained model list
        Returns:
            os.PathLike: the downloaded pretrained model path in the disk
        """
        support_models = list(pretrained_models.keys())
        assert tag in pretrained_models, \
            'The model "{}" you want to use has not been supported,'\
            'please choose other models.\n' \
            'The support models includes\n\t\t{}'.format(tag, "\n\t\t".join(support_models))
        res_path = os.path.join(MODEL_HOME, tag)
        decompressed_path = download_and_decompress(pretrained_models[tag],
                                                    res_path)
        decompressed_path = os.path.abspath(decompressed_path)
        logger.info(
            'Use pretrained model stored in: {}'.format(decompressed_path))
        return decompressed_path
    def _init_from_path(self,
                        model_type: str='ecapatdnn_voxceleb12',
                        sample_rate: int=16000,
                        cfg_path: Optional[os.PathLike]=None,
                        ckpt_path: Optional[os.PathLike]=None):
        """Init the neural network from the model path
        Args:
            model_type (str, optional): model tag in the pretrained model list. 
                                        Defaults to 'ecapatdnn_voxceleb12'.
            sample_rate (int, optional): model sample rate. 
                                         Defaults to 16000.
            cfg_path (Optional[os.PathLike], optional): yaml config file path. 
                                                        Defaults to None.
            ckpt_path (Optional[os.PathLike], optional): the pretrained model path, which is stored in the disk. 
                                                         Defaults to None.
        """
        # stage 0: avoid to init the mode again
        if hasattr(self, "model"):
            logger.info("Model has been initialized")
            return
        # stage 1: get the model and config path
        #          if we want init the network from the model stored in the disk,
        #          we must pass the config path and the ckpt model path
        if cfg_path is None or ckpt_path is None:
            # get the mode from pretrained list
            sample_rate_str = "16k" if sample_rate == 16000 else "8k"
            tag = model_type + "-" + sample_rate_str
            logger.info(f"load the pretrained model: {tag}")
            # get the model from the pretrained list
            # we download the pretrained model and store it in the res_path
            res_path = self._get_pretrained_path(tag)
            self.res_path = res_path
            self.cfg_path = os.path.join(res_path,
                                         pretrained_models[tag]['cfg_path'])
            self.ckpt_path = os.path.join(
                res_path, pretrained_models[tag]['ckpt_path'] + '.pdparams')
        else:
            # get the model from disk
            self.cfg_path = os.path.abspath(cfg_path)
            self.ckpt_path = os.path.abspath(ckpt_path + ".pdparams")
            self.res_path = os.path.dirname(
                os.path.dirname(os.path.abspath(self.cfg_path)))
        logger.info(f"start to read the ckpt from {self.ckpt_path}")
        logger.info(f"read the config from {self.cfg_path}")
        logger.info(f"get the res path {self.res_path}")
        # stage 2: read and config and init the model body
        self.config = CfgNode(new_allowed=True)
        self.config.merge_from_file(self.cfg_path)
        # stage 3: get the model name to instance the model network with dynamic_import
        logger.info("start to dynamic import the model class")
        model_name = model_type[:model_type.rindex('_')]
        logger.info(f"model name {model_name}")
        model_class = dynamic_import(model_name, model_alias)
        model_conf = self.config.model
        backbone = model_class(**model_conf)
        model = SpeakerIdetification(
            backbone=backbone, num_class=self.config.num_speakers)
        self.model = model
        self.model.eval()
        # stage 4: load the model parameters
        logger.info("start to set the model parameters to model")
        model_dict = paddle.load(self.ckpt_path)
        self.model.set_state_dict(model_dict)
        logger.info("create the model instance success")
    @paddle.no_grad()
    def infer(self, model_type: str):
        """Infer the model to get the embedding
        Args:
            model_type (str): speaker verification model type
        """
        # stage 0: get the feat and length from _inputs
        feats = self._inputs["feats"]
        lengths = self._inputs["lengths"]
        logger.info("start to do backbone network model forward")
        logger.info(
            f"feats shape:{feats.shape}, lengths shape: {lengths.shape}")
        # stage 1: get the audio embedding
        # embedding from (1, emb_size, 1) -> (emb_size)
        embedding = self.model.backbone(feats, lengths).squeeze().numpy()
        logger.info(f"embedding size: {embedding.shape}")
        # stage 2: put the embedding and dim info to _outputs property
        #          the embedding type is numpy.array
        self._outputs["embedding"] = embedding
    def postprocess(self) -> Union[str, os.PathLike]:
        """Return the audio embedding info
        Returns:
            Union[str, os.PathLike]: audio embedding info
        """
        embedding = self._outputs["embedding"]
        return embedding
    def preprocess(self, model_type: str, input_file: Union[str, os.PathLike]):
        """Extract the audio feat
        Args:
            model_type (str): speaker verification model type
            input_file (Union[str, os.PathLike]): audio file path
        """
        audio_file = input_file
        if isinstance(audio_file, (str, os.PathLike)):
            logger.info(f"Preprocess audio file: {audio_file}")
        # stage 1: load the audio sample points
        #    Note: this process must match the training process
        waveform, sr = load_audio(audio_file)
        logger.info(f"load the audio sample points, shape is: {waveform.shape}")
        # stage 2: get the audio feat
        # Note: Now we only support fbank feature
        try:
            feat = melspectrogram(
                x=waveform,
                sr=self.config.sr,
                n_mels=self.config.n_mels,
                window_size=self.config.window_size,
                hop_length=self.config.hop_size)
            logger.info(f"extract the audio feat, shape is: {feat.shape}")
        except Exception as e:
            logger.info(f"feat occurs exception {e}")
            sys.exit(-1)
        feat = paddle.to_tensor(feat).unsqueeze(0)
        # in inference period, the lengths is all one without padding
        lengths = paddle.ones([1])
        # stage 3: we do feature normalize,
        #          Now we assume that the feat must do normalize
        feat = feature_normalize(feat, mean_norm=True, std_norm=False)
        # stage 4: store the feat and length in the _inputs,
        #          which will be used in other function
        logger.info(f"feats shape: {feat.shape}")
        self._inputs["feats"] = feat
        self._inputs["lengths"] = lengths
        logger.info("audio extract the feat success")
    def _check(self, audio_file: str, sample_rate: int):
        """Check if the model sample match the audio sample rate 
        Args:
            audio_file (str): audio file path, which will be extracted the embedding
            sample_rate (int): the desired model sample rate 
        Returns:
            bool: return if the audio sample rate matches the model sample rate
        """
        self.sample_rate = sample_rate
        if self.sample_rate != 16000 and self.sample_rate != 8000:
            logger.error(
                "invalid sample rate, please input --sr 8000 or --sr 16000")
            return False
        if isinstance(audio_file, (str, os.PathLike)):
            if not os.path.isfile(audio_file):
                logger.error("Please input the right audio file path")
                return False
        logger.info("checking the aduio file format......")
        try:
            audio, audio_sample_rate = soundfile.read(
                audio_file, dtype="float32", always_2d=True)
        except Exception as e:
            logger.exception(e)
            logger.error(
                "can not open the audio file, please check the audio file format is 'wav'. \n \
                 you can try to use sox to change the file format.\n \
                 For example: \n \
                 sample rate: 16k \n \
                 sox input_audio.xx --rate 16k --bits 16 --channels 1 output_audio.wav \n \
                 sample rate: 8k \n \
                 sox input_audio.xx --rate 8k --bits 16 --channels 1 output_audio.wav \n \
                 ")
            return False
        logger.info(f"The sample rate is {audio_sample_rate}")
        if audio_sample_rate != self.sample_rate:
            logger.error("The sample rate of the input file is not {}.\n \
                            The program will resample the wav file to {}.\n \
                            If the result does not meet your expectations，\n \
                            Please input the 16k 16 bit 1 channel wav file. \
                        ".format(self.sample_rate, self.sample_rate))
            sys.exit(-1)
        else:
            logger.info("The audio file format is right")
        return True
--- a/paddlespeech/s2t/decoders/recog.py
+++ b/paddlespeech/s2t/decoders/recog.py
@ -12,6 +12,7 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # Reference espnet Apache 2.0 (http://www.apache.org/licenses/LICENSE-2.0)
 # Modified from espnet(https://github.com/espnet/espnet)
 """V2 backend for `asr_recog.py` using py:class:`decoders.beam_search.BeamSearch`."""
 import jsonlines
 import paddle
--- a/paddlespeech/s2t/decoders/recog_bin.py
+++ b/paddlespeech/s2t/decoders/recog_bin.py
@ -12,15 +12,16 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # Reference espnet Apache 2.0 (http://www.apache.org/licenses/LICENSE-2.0)
 # Modified from espnet(https://github.com/espnet/espnet)
 """End-to-end speech recognition model decoding script."""
 import logging
 import os
 import random
 import sys
 from distutils.util import strtobool
 import configargparse
 import numpy as np
 from distutils.util import strtobool
 def get_parser():
--- a/paddlespeech/s2t/exps/u2/model.py
+++ b/paddlespeech/s2t/exps/u2/model.py
@ -239,7 +239,7 @@ class U2Trainer(Trainer):
                n_iter_processes=config.num_workers,
                subsampling_factor=1,
                num_encs=1,
-                dist_sampler=False,
+                dist_sampler=config.get('dist_sampler', False),
                shortest_first=False)
            self.valid_loader = BatchDataLoader(
@ -260,7 +260,7 @@ class U2Trainer(Trainer):
                n_iter_processes=config.num_workers,
                subsampling_factor=1,
                num_encs=1,
-                dist_sampler=False,
+                dist_sampler=config.get('dist_sampler', False),
                shortest_first=False)
            logger.info("Setup train/valid Dataloader!")
        else:
--- a/paddlespeech/s2t/frontend/audio.py
+++ b/paddlespeech/s2t/frontend/audio.py
@ -208,6 +208,18 @@ class AudioSegment():
            io.BytesIO(bytes), dtype='float32')
        return cls(samples, sample_rate)
    @classmethod
    def from_pcm(cls, samples, sample_rate):
        """Create audio segment from a byte string containing audio samples.
        :param samples: Audio samples [num_samples x num_channels].
        :type samples: numpy.ndarray
        :param sample_rate: Audio sample rate.
        :type sample_rate: int
        :return: Audio segment instance.
        :rtype: AudioSegment
        """
        return cls(samples, sample_rate)
    @classmethod
    def concatenate(cls, *segments):
        """Concatenate an arbitrary number of audio segments together.
--- a/paddlespeech/s2t/frontend/speech.py
+++ b/paddlespeech/s2t/frontend/speech.py
@ -107,6 +107,27 @@ class SpeechSegment(AudioSegment):
        return cls(audio.samples, audio.sample_rate, transcript, tokens,
                   token_ids)
    @classmethod
    def from_pcm(cls,
                 samples,
                 sample_rate,
                 transcript,
                 tokens=None,
                 token_ids=None):
        """Create speech segment from pcm on online mode 
        Args:
            samples (numpy.ndarray): Audio samples [num_samples x num_channels].
            sample_rate (int): Audio sample rate.
            transcript (str): Transcript text for the speech.
            tokens (List[str], optional): text tokens. Defaults to None.
            token_ids (List[int], optional): text token ids. Defaults to None.
        Returns: 
            SpeechSegment: Speech segment instance.
        """
        audio = AudioSegment.from_pcm(samples, sample_rate)
        return cls(audio.samples, audio.sample_rate, transcript, tokens,
                   token_ids)
    @classmethod
    def concatenate(cls, *segments):
        """Concatenate an arbitrary number of speech segments together, both
--- a/paddlespeech/s2t/models/u2/u2.py
+++ b/paddlespeech/s2t/models/u2/u2.py
@ -11,6 +11,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # Modified from wenet(https://github.com/wenet-e2e/wenet)
 """U2 ASR Model
 Unified Streaming and Non-streaming Two-pass End-to-end Model for Speech Recognition
 (https://arxiv.org/pdf/2012.05481.pdf)
@ -36,6 +37,7 @@ from paddlespeech.s2t.modules.ctc import CTCDecoderBase
 from paddlespeech.s2t.modules.decoder import TransformerDecoder
 from paddlespeech.s2t.modules.encoder import ConformerEncoder
 from paddlespeech.s2t.modules.encoder import TransformerEncoder
 from paddlespeech.s2t.modules.initializer import DefaultInitializerContext
 from paddlespeech.s2t.modules.loss import LabelSmoothingLoss
 from paddlespeech.s2t.modules.mask import make_pad_mask
 from paddlespeech.s2t.modules.mask import mask_finished_preds
@ -72,6 +74,7 @@ class U2BaseModel(ASRInterface, nn.Layer):
        assert 0.0 <= ctc_weight <= 1.0, ctc_weight
        nn.Layer.__init__(self)
        # note that eos is the same as sos (equivalent ID)
        self.sos = vocab_size - 1
        self.eos = vocab_size - 1
@ -780,9 +783,12 @@ class U2DecodeModel(U2BaseModel):
 class U2Model(U2DecodeModel):
    def __init__(self, configs: dict):
        vocab_size, encoder, decoder, ctc = U2Model._init_from_config(configs)
        model_conf = configs.get('model_conf', dict())
        init_type = model_conf.get("init_type", None)
        with DefaultInitializerContext(init_type):
            vocab_size, encoder, decoder, ctc = U2Model._init_from_config(
                configs)
        super().__init__(
            vocab_size=vocab_size,
            encoder=encoder,
--- a/paddlespeech/s2t/models/u2/updater.py
+++ b/paddlespeech/s2t/models/u2/updater.py
@ -11,6 +11,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # Modified from wenet(https://github.com/wenet-e2e/wenet)
 from contextlib import nullcontext
 import paddle
--- a/paddlespeech/s2t/models/u2_st/u2_st.py
+++ b/paddlespeech/s2t/models/u2_st/u2_st.py
@ -11,6 +11,7 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 # Modified from wenet(https://github.com/wenet-e2e/wenet)
 """U2 ASR Model
 Unified Streaming and Non-streaming Two-pass End-to-end Model for Speech Recognition
 (https://arxiv.org/pdf/2012.05481.pdf)
--- a/paddlespeech/s2t/modules/activation.py
+++ b/paddlespeech/s2t/modules/activation.py
@ -17,6 +17,8 @@ import paddle
 from paddle import nn
 from paddle.nn import functional as F
 from paddlespeech.s2t.modules.align import Conv2D
 from paddlespeech.s2t.modules.align import Linear
 from paddlespeech.s2t.utils.log import Log
 logger = Log(__name__).getlog()
@ -51,7 +53,7 @@ class LinearGLUBlock(nn.Layer):
            idim (int): input and output dimension
        """
        super().__init__()
-        self.fc = nn.Linear(idim, idim * 2)
+        self.fc = Linear(idim, idim * 2)
    def forward(self, xs):
        return glu(self.fc(xs), dim=-1)
@ -75,7 +77,7 @@ class ConvGLUBlock(nn.Layer):
        self.conv_residual = None
        if in_ch != out_ch:
            self.conv_residual = nn.utils.weight_norm(
-                nn.Conv2D(
+                Conv2D(
                    in_channels=in_ch, out_channels=out_ch, kernel_size=(1, 1)),
                name='weight',
                dim=0)
@ -86,7 +88,7 @@ class ConvGLUBlock(nn.Layer):
        layers = OrderedDict()
        if bottlececk_dim == 0:
            layers['conv'] = nn.utils.weight_norm(
-                nn.Conv2D(
+                Conv2D(
                    in_channels=in_ch,
                    out_channels=out_ch * 2,
                    kernel_size=(kernel_size, 1)),
@ -106,7 +108,7 @@ class ConvGLUBlock(nn.Layer):
                dim=0)
            layers['dropout_in'] = nn.Dropout(p=dropout)
            layers['conv_bottleneck'] = nn.utils.weight_norm(
-                nn.Conv2D(
+                Conv2D(
                    in_channels=bottlececk_dim,
                    out_channels=bottlececk_dim,
                    kernel_size=(kernel_size, 1)),
@ -115,7 +117,7 @@ class ConvGLUBlock(nn.Layer):
            layers['dropout'] = nn.Dropout(p=dropout)
            layers['glu'] = GLU()
            layers['conv_out'] = nn.utils.weight_norm(
-                nn.Conv2D(
+                Conv2D(
                    in_channels=bottlececk_dim,
                    out_channels=out_ch * 2,
                    kernel_size=(1, 1)),
--- a/paddlespeech/s2t/modules/align.py
+++ b/paddlespeech/s2t/modules/align.py
@ -0,0 +1,139 @@
 # Copyright (c) 2022 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import paddle
 from paddle import nn
 from paddlespeech.s2t.modules.initializer import KaimingUniform
 """
    To align the initializer between paddle and torch, 
    the API below are set defalut initializer with priority higger than global initializer.
 """
 global_init_type = None
 class LayerNorm(nn.LayerNorm):
    def __init__(self,
                 normalized_shape,
                 epsilon=1e-05,
                 weight_attr=None,
                 bias_attr=None,
                 name=None):
        if weight_attr is None:
            weight_attr = paddle.ParamAttr(
                initializer=nn.initializer.Constant(1.0))
        if bias_attr is None:
            bias_attr = paddle.ParamAttr(
                initializer=nn.initializer.Constant(0.0))
        super(LayerNorm, self).__init__(normalized_shape, epsilon, weight_attr,
                                        bias_attr, name)
 class BatchNorm1D(nn.BatchNorm1D):
    def __init__(self,
                 num_features,
                 momentum=0.9,
                 epsilon=1e-05,
                 weight_attr=None,
                 bias_attr=None,
                 data_format='NCL',
                 name=None):
        if weight_attr is None:
            weight_attr = paddle.ParamAttr(
                initializer=nn.initializer.Constant(1.0))
        if bias_attr is None:
            bias_attr = paddle.ParamAttr(
                initializer=nn.initializer.Constant(0.0))
        super(BatchNorm1D,
              self).__init__(num_features, momentum, epsilon, weight_attr,
                             bias_attr, data_format, name)
 class Embedding(nn.Embedding):
    def __init__(self,
                 num_embeddings,
                 embedding_dim,
                 padding_idx=None,
                 sparse=False,
                 weight_attr=None,
                 name=None):
        if weight_attr is None:
            weight_attr = paddle.ParamAttr(initializer=nn.initializer.Normal())
        super(Embedding, self).__init__(num_embeddings, embedding_dim,
                                        padding_idx, sparse, weight_attr, name)
 class Linear(nn.Linear):
    def __init__(self,
                 in_features,
                 out_features,
                 weight_attr=None,
                 bias_attr=None,
                 name=None):
        if weight_attr is None:
            if global_init_type == "kaiming_uniform":
                weight_attr = paddle.ParamAttr(initializer=KaimingUniform())
        if bias_attr is None:
            if global_init_type == "kaiming_uniform":
                bias_attr = paddle.ParamAttr(initializer=KaimingUniform())
        super(Linear, self).__init__(in_features, out_features, weight_attr,
                                     bias_attr, name)
 class Conv1D(nn.Conv1D):
    def __init__(self,
                 in_channels,
                 out_channels,
                 kernel_size,
                 stride=1,
                 padding=0,
                 dilation=1,
                 groups=1,
                 padding_mode='zeros',
                 weight_attr=None,
                 bias_attr=None,
                 data_format='NCL'):
        if weight_attr is None:
            if global_init_type == "kaiming_uniform":
                print("set kaiming_uniform")
                weight_attr = paddle.ParamAttr(initializer=KaimingUniform())
        if bias_attr is None:
            if global_init_type == "kaiming_uniform":
                bias_attr = paddle.ParamAttr(initializer=KaimingUniform())
        super(Conv1D, self).__init__(
            in_channels, out_channels, kernel_size, stride, padding, dilation,
            groups, padding_mode, weight_attr, bias_attr, data_format)
 class Conv2D(nn.Conv2D):
    def __init__(self,
                 in_channels,
                 out_channels,
                 kernel_size,
                 stride=1,
                 padding=0,
                 dilation=1,
                 groups=1,
                 padding_mode='zeros',
                 weight_attr=None,
                 bias_attr=None,
                 data_format='NCHW'):
        if weight_attr is None:
            if global_init_type == "kaiming_uniform":
                weight_attr = paddle.ParamAttr(initializer=KaimingUniform())
        if bias_attr is None:
            if global_init_type == "kaiming_uniform":
                bias_attr = paddle.ParamAttr(initializer=KaimingUniform())
        super(Conv2D, self).__init__(
            in_channels, out_channels, kernel_size, stride, padding, dilation,
            groups, padding_mode, weight_attr, bias_attr, data_format)
--- a/paddlespeech/s2t/modules/attention.py
+++ b/paddlespeech/s2t/modules/attention.py
@ -22,6 +22,7 @@ import paddle
 from paddle import nn
 from paddle.nn import initializer as I
 from paddlespeech.s2t.modules.align import Linear
 from paddlespeech.s2t.utils.log import Log
 logger = Log(__name__).getlog()
@ -48,10 +49,10 @@ class MultiHeadedAttention(nn.Layer):
        # We assume d_v always equals d_k
        self.d_k = n_feat // n_head
        self.h = n_head
-        self.linear_q = nn.Linear(n_feat, n_feat)
+        self.linear_q = Linear(n_feat, n_feat)
-        self.linear_k = nn.Linear(n_feat, n_feat)
+        self.linear_k = Linear(n_feat, n_feat)
-        self.linear_v = nn.Linear(n_feat, n_feat)
+        self.linear_v = Linear(n_feat, n_feat)
-        self.linear_out = nn.Linear(n_feat, n_feat)
+        self.linear_out = Linear(n_feat, n_feat)
        self.dropout = nn.Dropout(p=dropout_rate)
    def forward_qkv(self,
@ -95,7 +96,7 @@ class MultiHeadedAttention(nn.Layer):
            mask (paddle.Tensor): Mask, size (#batch, 1, time2) or
                (#batch, time1, time2).
        Returns:
-            paddle.Tensor: Transformed value weighted 
+            paddle.Tensor: Transformed value weighted
                by the attention score, (#batch, time1, d_model).
        """
        n_batch = value.shape[0]
@ -150,7 +151,7 @@ class RelPositionMultiHeadedAttention(MultiHeadedAttention):
        """
        super().__init__(n_head, n_feat, dropout_rate)
        # linear transformation for positional encoding
-        self.linear_pos = nn.Linear(n_feat, n_feat, bias_attr=False)
+        self.linear_pos = Linear(n_feat, n_feat, bias_attr=False)
        # these two learnable bias are used in matrix c and matrix d
        # as described in https://arxiv.org/abs/1901.02860 Section 3.3
        #self.pos_bias_u = nn.Parameter(torch.Tensor(self.h, self.d_k))
--- a/paddlespeech/s2t/modules/conformer_convolution.py
+++ b/paddlespeech/s2t/modules/conformer_convolution.py
@ -21,6 +21,9 @@ import paddle
 from paddle import nn
 from typeguard import check_argument_types
 from paddlespeech.s2t.modules.align import BatchNorm1D
 from paddlespeech.s2t.modules.align import Conv1D
 from paddlespeech.s2t.modules.align import LayerNorm
 from paddlespeech.s2t.utils.log import Log
 logger = Log(__name__).getlog()
@ -49,7 +52,7 @@ class ConvolutionModule(nn.Layer):
        """
        assert check_argument_types()
        super().__init__()
-        self.pointwise_conv1 = nn.Conv1D(
+        self.pointwise_conv1 = Conv1D(
            channels,
            2 * channels,
            kernel_size=1,
@ -60,8 +63,8 @@ class ConvolutionModule(nn.Layer):
        )
        # self.lorder is used to distinguish if it's a causal convolution,
-        # if self.lorder > 0: 
+        # if self.lorder > 0:
-        #    it's a causal convolution, the input will be padded with 
+        #    it's a causal convolution, the input will be padded with
        #    `self.lorder` frames on the left in forward (causal conv impl).
        # else: it's a symmetrical convolution
        if causal:
@ -73,7 +76,7 @@ class ConvolutionModule(nn.Layer):
            padding = (kernel_size - 1) // 2
            self.lorder = 0
-        self.depthwise_conv = nn.Conv1D(
+        self.depthwise_conv = Conv1D(
            channels,
            channels,
            kernel_size,
@ -87,12 +90,12 @@ class ConvolutionModule(nn.Layer):
        assert norm in ['batch_norm', 'layer_norm']
        if norm == "batch_norm":
            self.use_layer_norm = False
-            self.norm = nn.BatchNorm1D(channels)
+            self.norm = BatchNorm1D(channels)
        else:
            self.use_layer_norm = True
-            self.norm = nn.LayerNorm(channels)
+            self.norm = LayerNorm(channels)
-        self.pointwise_conv2 = nn.Conv1D(
+        self.pointwise_conv2 = Conv1D(
            channels,
            channels,
            kernel_size=1,
--- a/paddlespeech/s2t/modules/ctc.py
+++ b/paddlespeech/s2t/modules/ctc.py
@ -18,6 +18,7 @@ from paddle import nn
 from paddle.nn import functional as F
 from typeguard import check_argument_types
 from paddlespeech.s2t.modules.align import Linear
 from paddlespeech.s2t.modules.loss import CTCLoss
 from paddlespeech.s2t.utils import ctc_utils
 from paddlespeech.s2t.utils.log import Log
@ -69,7 +70,7 @@ class CTCDecoderBase(nn.Layer):
        self.blank_id = blank_id
        self.odim = odim
        self.dropout = nn.Dropout(dropout_rate)
-        self.ctc_lo = nn.Linear(enc_n_units, self.odim)
+        self.ctc_lo = Linear(enc_n_units, self.odim)
        reduction_type = "sum" if reduction else "none"
        self.criterion = CTCLoss(
            blank=self.blank_id,
--- a/paddlespeech/s2t/modules/decoder.py
+++ b/paddlespeech/s2t/modules/decoder.py
@ -24,6 +24,9 @@ from paddle import nn
 from typeguard import check_argument_types
 from paddlespeech.s2t.decoders.scorers.scorer_interface import BatchScorerInterface
 from paddlespeech.s2t.modules.align import Embedding
 from paddlespeech.s2t.modules.align import LayerNorm
 from paddlespeech.s2t.modules.align import Linear
 from paddlespeech.s2t.modules.attention import MultiHeadedAttention
 from paddlespeech.s2t.modules.decoder_layer import DecoderLayer
 from paddlespeech.s2t.modules.embedding import PositionalEncoding
@ -76,21 +79,22 @@ class TransformerDecoder(BatchScorerInterface, nn.Layer):
            concat_after: bool=False, ):
        assert check_argument_types()
        nn.Layer.__init__(self)
        self.selfattention_layer_type = 'selfattn'
        attention_dim = encoder_output_size
        if input_layer == "embed":
            self.embed = nn.Sequential(
-                nn.Embedding(vocab_size, attention_dim),
+                Embedding(vocab_size, attention_dim),
                PositionalEncoding(attention_dim, positional_dropout_rate), )
        else:
            raise ValueError(f"only 'embed' is supported: {input_layer}")
        self.normalize_before = normalize_before
-        self.after_norm = nn.LayerNorm(attention_dim, epsilon=1e-12)
+        self.after_norm = LayerNorm(attention_dim, epsilon=1e-12)
        self.use_output_layer = use_output_layer
-        self.output_layer = nn.Linear(attention_dim, vocab_size)
+        self.output_layer = Linear(attention_dim, vocab_size)
        self.decoders = nn.LayerList([
            DecoderLayer(
--- a/paddlespeech/s2t/modules/decoder_layer.py
+++ b/paddlespeech/s2t/modules/decoder_layer.py
@ -20,6 +20,8 @@ from typing import Tuple
 import paddle
 from paddle import nn
 from paddlespeech.s2t.modules.align import LayerNorm
 from paddlespeech.s2t.modules.align import Linear
 from paddlespeech.s2t.utils.log import Log
 logger = Log(__name__).getlog()
@ -62,14 +64,14 @@ class DecoderLayer(nn.Layer):
        self.self_attn = self_attn
        self.src_attn = src_attn
        self.feed_forward = feed_forward
-        self.norm1 = nn.LayerNorm(size, epsilon=1e-12)
+        self.norm1 = LayerNorm(size, epsilon=1e-12)
-        self.norm2 = nn.LayerNorm(size, epsilon=1e-12)
+        self.norm2 = LayerNorm(size, epsilon=1e-12)
-        self.norm3 = nn.LayerNorm(size, epsilon=1e-12)
+        self.norm3 = LayerNorm(size, epsilon=1e-12)
        self.dropout = nn.Dropout(dropout_rate)
        self.normalize_before = normalize_before
        self.concat_after = concat_after
-        self.concat_linear1 = nn.Linear(size + size, size)
+        self.concat_linear1 = Linear(size + size, size)
-        self.concat_linear2 = nn.Linear(size + size, size)
+        self.concat_linear2 = Linear(size + size, size)
    def forward(
            self,
--- a/paddlespeech/s2t/modules/encoder.py
+++ b/paddlespeech/s2t/modules/encoder.py
@ -23,6 +23,7 @@ from paddle import nn
 from typeguard import check_argument_types
 from paddlespeech.s2t.modules.activation import get_activation
 from paddlespeech.s2t.modules.align import LayerNorm
 from paddlespeech.s2t.modules.attention import MultiHeadedAttention
 from paddlespeech.s2t.modules.attention import RelPositionMultiHeadedAttention
 from paddlespeech.s2t.modules.conformer_convolution import ConvolutionModule
@ -129,7 +130,7 @@ class BaseEncoder(nn.Layer):
                d_model=output_size, dropout_rate=positional_dropout_rate), )
        self.normalize_before = normalize_before
-        self.after_norm = nn.LayerNorm(output_size, epsilon=1e-12)
+        self.after_norm = LayerNorm(output_size, epsilon=1e-12)
        self.static_chunk_size = static_chunk_size
        self.use_dynamic_chunk = use_dynamic_chunk
        self.use_dynamic_left_chunk = use_dynamic_left_chunk
@ -457,6 +458,7 @@ class ConformerEncoder(BaseEncoder):
            cnn_module_norm (str): cnn conv norm type, Optional['batch_norm','layer_norm']
        """
        assert check_argument_types()
        super().__init__(input_size, output_size, attention_heads, linear_units,
                         num_blocks, dropout_rate, positional_dropout_rate,
                         attention_dropout_rate, input_layer,
--- a/paddlespeech/s2t/modules/encoder_layer.py
+++ b/paddlespeech/s2t/modules/encoder_layer.py
@ -20,6 +20,8 @@ from typing import Tuple
 import paddle
 from paddle import nn
 from paddlespeech.s2t.modules.align import LayerNorm
 from paddlespeech.s2t.modules.align import Linear
 from paddlespeech.s2t.utils.log import Log
 logger = Log(__name__).getlog()
@ -39,7 +41,7 @@ class TransformerEncoderLayer(nn.Layer):
            normalize_before: bool=True,
            concat_after: bool=False, ):
        """Construct an EncoderLayer object.
-        
+
        Args:
            size (int): Input dimension.
            self_attn (nn.Layer): Self-attention module instance.
@ -59,15 +61,15 @@ class TransformerEncoderLayer(nn.Layer):
        super().__init__()
        self.self_attn = self_attn
        self.feed_forward = feed_forward
-        self.norm1 = nn.LayerNorm(size, epsilon=1e-12)
+        self.norm1 = LayerNorm(size, epsilon=1e-12)
-        self.norm2 = nn.LayerNorm(size, epsilon=1e-12)
+        self.norm2 = LayerNorm(size, epsilon=1e-12)
        self.dropout = nn.Dropout(dropout_rate)
        self.size = size
        self.normalize_before = normalize_before
        self.concat_after = concat_after
        # concat_linear may be not used in forward fuction,
        # but will be saved in the *.pt
-        self.concat_linear = nn.Linear(size + size, size)
+        self.concat_linear = Linear(size + size, size)
    def forward(
            self,
@ -147,7 +149,7 @@ class ConformerEncoderLayer(nn.Layer):
            normalize_before: bool=True,
            concat_after: bool=False, ):
        """Construct an EncoderLayer object.
-        
+
        Args:
            size (int): Input dimension.
            self_attn (nn.Layer): Self-attention module instance.
@ -174,23 +176,23 @@ class ConformerEncoderLayer(nn.Layer):
        self.feed_forward = feed_forward
        self.feed_forward_macaron = feed_forward_macaron
        self.conv_module = conv_module
-        self.norm_ff = nn.LayerNorm(size, epsilon=1e-12)  # for the FNN module
+        self.norm_ff = LayerNorm(size, epsilon=1e-12)  # for the FNN module
-        self.norm_mha = nn.LayerNorm(size, epsilon=1e-12)  # for the MHA module
+        self.norm_mha = LayerNorm(size, epsilon=1e-12)  # for the MHA module
        if feed_forward_macaron is not None:
-            self.norm_ff_macaron = nn.LayerNorm(size, epsilon=1e-12)
+            self.norm_ff_macaron = LayerNorm(size, epsilon=1e-12)
            self.ff_scale = 0.5
        else:
            self.ff_scale = 1.0
        if self.conv_module is not None:
-            self.norm_conv = nn.LayerNorm(
+            self.norm_conv = LayerNorm(
                size, epsilon=1e-12)  # for the CNN module
-            self.norm_final = nn.LayerNorm(
+            self.norm_final = LayerNorm(
                size, epsilon=1e-12)  # for the final output of the block
        self.dropout = nn.Dropout(dropout_rate)
        self.size = size
        self.normalize_before = normalize_before
        self.concat_after = concat_after
-        self.concat_linear = nn.Linear(size + size, size)
+        self.concat_linear = Linear(size + size, size)
    def forward(
            self,
--- a/paddlespeech/s2t/modules/initializer.py
+++ b/paddlespeech/s2t/modules/initializer.py
@ -0,0 +1,172 @@
 #   Copyright (c) 2018 PaddlePaddle Authors. All Rights Reserved.
 #
 # Licensed under the Apache License, Version 2.0 (the "License");
 # you may not use this file except in compliance with the License.
 # You may obtain a copy of the License at
 #
 #     http://www.apache.org/licenses/LICENSE-2.0
 #
 # Unless required by applicable law or agreed to in writing, software
 # distributed under the License is distributed on an "AS IS" BASIS,
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
 import numpy as np
 from paddle.fluid import framework
 from paddle.fluid import unique_name
 from paddle.fluid.core import VarDesc
 from paddle.fluid.initializer import MSRAInitializer
 __all__ = ['KaimingUniform']
 class KaimingUniform(MSRAInitializer):
    r"""Implements the Kaiming Uniform initializer
    This class implements the weight initialization from the paper
    `Delving Deep into Rectifiers: Surpassing Human-Level Performance on
    ImageNet Classification <https://arxiv.org/abs/1502.01852>`_
    by Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun. This is a
    robust initialization method that particularly considers the rectifier
    nonlinearities.
    In case of Uniform distribution, the range is [-x, x], where
    .. math::
        x = \sqrt{\frac{1.0}{fan\_in}}
    In case of Normal distribution, the mean is 0 and the standard deviation
    is
    .. math::
        \sqrt{\\frac{2.0}{fan\_in}}
    Args:
        fan_in (float32|None): fan_in for Kaiming uniform Initializer. If None, it is\
        inferred from the variable. default is None.
    Note:
        It is recommended to set fan_in to None for most cases.
    Examples:
        .. code-block:: python
            import paddle
            import paddle.nn as nn
            linear = nn.Linear(2,
                               4,
                               weight_attr=nn.initializer.KaimingUniform())
            data = paddle.rand([30, 10, 2], dtype='float32')
            res = linear(data)
    """
    def __init__(self, fan_in=None):
        super(KaimingUniform, self).__init__(
            uniform=True, fan_in=fan_in, seed=0)
    def __call__(self, var, block=None):
        """Initialize the input tensor with MSRA initialization.
        Args:
            var(Tensor): Tensor that needs to be initialized.
            block(Block, optional): The block in which initialization ops
                   should be added. Used in static graph only, default None.
        Returns:
            The initialization op
        """
        block = self._check_block(block)
        assert isinstance(var, framework.Variable)
        assert isinstance(block, framework.Block)
        f_in, f_out = self._compute_fans(var)
        # If fan_in is passed, use it
        fan_in = f_in if self._fan_in is None else self._fan_in
        if self._seed == 0:
            self._seed = block.program.random_seed
        # to be compatible of fp16 initalizers
        if var.dtype == VarDesc.VarType.FP16 or (
                var.dtype == VarDesc.VarType.BF16 and not self._uniform):
            out_dtype = VarDesc.VarType.FP32
            out_var = block.create_var(
                name=unique_name.generate(
                    ".".join(['masra_init', var.name, 'tmp'])),
                shape=var.shape,
                dtype=out_dtype,
                type=VarDesc.VarType.LOD_TENSOR,
                persistable=False)
        else:
            out_dtype = var.dtype
            out_var = var
        if self._uniform:
            limit = np.sqrt(1.0 / float(fan_in))
            op = block.append_op(
                type="uniform_random",
                inputs={},
                outputs={"Out": out_var},
                attrs={
                    "shape": out_var.shape,
                    "dtype": int(out_dtype),
                    "min": -limit,
                    "max": limit,
                    "seed": self._seed
                },
                stop_gradient=True)
        else:
            std = np.sqrt(2.0 / float(fan_in))
            op = block.append_op(
                type="gaussian_random",
                outputs={"Out": out_var},
                attrs={
                    "shape": out_var.shape,
                    "dtype": int(out_dtype),
                    "mean": 0.0,
                    "std": std,
                    "seed": self._seed
                },
                stop_gradient=True)
        if var.dtype == VarDesc.VarType.FP16 or (
                var.dtype == VarDesc.VarType.BF16 and not self._uniform):
            block.append_op(
                type="cast",
                inputs={"X": out_var},
                outputs={"Out": var},
                attrs={"in_dtype": out_var.dtype,
                       "out_dtype": var.dtype})
        if not framework.in_dygraph_mode():
            var.op = op
        return op
 class DefaultInitializerContext(object):
    """
        egs:
        with DefaultInitializerContext("kaiming_uniform"):
            code for setup_model
    """
    def __init__(self, init_type=None):
        self.init_type = init_type
    def __enter__(self):
        if self.init_type is None:
            return
        else:
            from paddlespeech.s2t.modules import align
            align.global_init_type = self.init_type
            return
    def __exit__(self, exc_type, exc_val, exc_tb):
        from paddlespeech.s2t.modules import align
        align.global_init_type = None
--- a/paddlespeech/s2t/modules/positionwise_feed_forward.py
+++ b/paddlespeech/s2t/modules/positionwise_feed_forward.py
@ -17,6 +17,7 @@
 import paddle
 from paddle import nn
 from paddlespeech.s2t.modules.align import Linear
 from paddlespeech.s2t.utils.log import Log
 logger = Log(__name__).getlog()
@ -44,10 +45,10 @@ class PositionwiseFeedForward(nn.Layer):
            activation (paddle.nn.Layer): Activation function
        """
        super().__init__()
-        self.w_1 = nn.Linear(idim, hidden_units)
+        self.w_1 = Linear(idim, hidden_units)
        self.activation = activation
        self.dropout = nn.Dropout(dropout_rate)
-        self.w_2 = nn.Linear(hidden_units, idim)
+        self.w_2 = Linear(hidden_units, idim)
    def forward(self, xs: paddle.Tensor) -> paddle.Tensor:
        """Forward function.
--- a/paddlespeech/s2t/modules/subsampling.py
+++ b/paddlespeech/s2t/modules/subsampling.py
@ -19,6 +19,9 @@ from typing import Tuple
 import paddle
 from paddle import nn
 from paddlespeech.s2t.modules.align import Conv2D
 from paddlespeech.s2t.modules.align import LayerNorm
 from paddlespeech.s2t.modules.align import Linear
 from paddlespeech.s2t.modules.embedding import PositionalEncoding
 from paddlespeech.s2t.utils.log import Log
@ -60,8 +63,8 @@ class LinearNoSubsampling(BaseSubsampling):
        """
        super().__init__(pos_enc_class)
        self.out = nn.Sequential(
-            nn.Linear(idim, odim),
+            Linear(idim, odim),
-            nn.LayerNorm(odim, epsilon=1e-12),
+            LayerNorm(odim, epsilon=1e-12),
            nn.Dropout(dropout_rate),
            nn.ReLU(), )
        self.right_context = 0
@ -108,12 +111,12 @@ class Conv2dSubsampling4(Conv2dSubsampling):
        """
        super().__init__(pos_enc_class)
        self.conv = nn.Sequential(
-            nn.Conv2D(1, odim, 3, 2),
+            Conv2D(1, odim, 3, 2),
            nn.ReLU(),
-            nn.Conv2D(odim, odim, 3, 2),
+            Conv2D(odim, odim, 3, 2),
            nn.ReLU(), )
        self.out = nn.Sequential(
-            nn.Linear(odim * (((idim - 1) // 2 - 1) // 2), odim))
+            Linear(odim * (((idim - 1) // 2 - 1) // 2), odim))
        self.subsampling_rate = 4
        # The right context for every conv layer is computed by:
        # (kernel_size - 1) * frame_rate_of_this_layer
@ -160,13 +163,13 @@ class Conv2dSubsampling6(Conv2dSubsampling):
        """
        super().__init__(pos_enc_class)
        self.conv = nn.Sequential(
-            nn.Conv2D(1, odim, 3, 2),
+            Conv2D(1, odim, 3, 2),
            nn.ReLU(),
-            nn.Conv2D(odim, odim, 5, 3),
+            Conv2D(odim, odim, 5, 3),
            nn.ReLU(), )
        # O = (I - F + Pstart + Pend) // S + 1
        # when Padding == 0, O = (I - F - S) // S
-        self.linear = nn.Linear(odim * (((idim - 1) // 2 - 2) // 3), odim)
+        self.linear = Linear(odim * (((idim - 1) // 2 - 2) // 3), odim)
        # The right context for every conv layer is computed by:
        # (kernel_size - 1) * frame_rate_of_this_layer
        # 10 = (3 - 1) * 1 + (5 - 1) * 2
@ -212,14 +215,14 @@ class Conv2dSubsampling8(Conv2dSubsampling):
        """
        super().__init__(pos_enc_class)
        self.conv = nn.Sequential(
-            nn.Conv2D(1, odim, 3, 2),
+            Conv2D(1, odim, 3, 2),
            nn.ReLU(),
-            nn.Conv2D(odim, odim, 3, 2),
+            Conv2D(odim, odim, 3, 2),
            nn.ReLU(),
-            nn.Conv2D(odim, odim, 3, 2),
+            Conv2D(odim, odim, 3, 2),
            nn.ReLU(), )
-        self.linear = nn.Linear(odim * ((((idim - 1) // 2 - 1) // 2 - 1) // 2),
+        self.linear = Linear(odim * ((((idim - 1) // 2 - 1) // 2 - 1) // 2),
-                                odim)
+                             odim)
        self.subsampling_rate = 8
        # The right context for every conv layer is computed by:
        # (kernel_size - 1) * frame_rate_of_this_layer
--- a/Show More
+++ b/Show More