diff --git a/.gitignore b/.gitignore
index cfdf0275..1724bd43 100644
--- a/.gitignore
+++ b/.gitignore
@@ -24,5 +24,7 @@ tools/montreal-forced-aligner/
 tools/Montreal-Forced-Aligner/
 tools/sctk
 tools/sctk-20159b5/
+tools/kaldi
+tools/OpenBLAS/
 
 *output/
diff --git a/README.md b/README.md
index 809ffe6d..e0769720 100644
--- a/README.md
+++ b/README.md
@@ -1,31 +1,302 @@
-# PaddlePaddle Speech toolkit
+English | [简体中文](README_ch.md)
 
+# PaddleSpeech
+
+
+
+<p align="center">
+  <img src="./docs/images/PaddleSpeech_log.png" />
+</p>
+<div align="center">  
+
+  <h3> 
+  <a href="https://github.com/Mingxue-Xu/DeepSpeech#quick-start"> Quick Start </a> 
+  | <a href="https://github.com/Mingxue-Xu/DeepSpeech#tutorials"> Tutorials </a> 
+  | <a href="https://github.com/Mingxue-Xu/DeepSpeech#model-list"> Models List </a> 
+  
+</div>
+  
+------------------------------------------------------------------------------------
 ![License](https://img.shields.io/badge/license-Apache%202-red.svg)
 ![python version](https://img.shields.io/badge/python-3.7+-orange.svg)
 ![support os](https://img.shields.io/badge/os-linux-yellow.svg)
 
-*DeepSpeech* is an open-source implementation of end-to-end Automatic Speech Recognition engine, with [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) platform. Our vision is to empower both industrial application and academic research on speech recognition, via an easy-to-use, efficient, samller and scalable implementation, including training, inference & testing module, and deployment.
+<!---
+why they should use your module, 
+how they can install it, 
+how they can use it
+-->
+
+**PaddleSpeech** is an open-source toolkit on [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) platform for two critical tasks in Speech - **Automatic Speech Recognition (ASR)** and **Text-To-Speech Synthesis (TTS)**, with modules involving state-of-art and influential models. 
+
+Via the easy-to-use, efficient, flexible and scalable implementation, our vision is to empower both industrial application and academic research, including training, inference & testing module, and deployment. Besides, this toolkit also features at:
+- **Fast and Light-weight**: we provide a high-speed and ultra-lightweight model that is convenient for industrial deployment.
+- **Rule-based Chinese frontend**: our frontend contains Text Normalization (TN) and Grapheme-to-Phoneme (G2P, including Polyphone and Tone Sandhi). Moreover, we use self-defined linguistic rules to adapt Chinese context. 
+- **Varieties of Functions that Vitalize Research**: 
+  - *Integration of mainstream models and datasets*: the toolkit implements modules that participate in the whole pipeline of both ASR and TTS, and uses datasets like LibriSpeech, LJSpeech, AIShell, etc. See also [model lists](#models-list) for more details.
+  - *Support of ASR streaming and non-streaming data*: This toolkit contains non-streaming/streaming models like [DeepSpeech2](http://proceedings.mlr.press/v48/amodei16.pdf), [Transformer](https://arxiv.org/abs/1706.03762), [Conformer](https://arxiv.org/abs/2005.08100) and [U2](https://arxiv.org/pdf/2012.05481.pdf).
+  
+Let's install PaddleSpeech with only a few lines of code! 
+
+>Note: The official name is still deepspeech. 2021/10/26
+
+``` shell
+# 1. Install essential libraries and paddlepaddle first.
+# install prerequisites
+sudo apt-get install -y sox pkg-config libflac-dev libogg-dev libvorbis-dev libboost-dev swig python3-dev libsndfile1
+# `pip install paddlepaddle-gpu` instead if you are using GPU. 
+pip install paddlepaddle
+
+# 2.Then install PaddleSpeech.
+git clone https://github.com/PaddlePaddle/DeepSpeech.git
+cd DeepSpeech
+pip install -e .
+```
+
+
+## Table of Contents
+
+The contents of this README is as follow:
+- [Alternative Installation](#installation)
+- [Quick Start](#quick-start)
+- [Models List](#models-list)
+- [Tutorials](#tutorials)
+- [FAQ and Contributing](#faq-and-contributing)
+- [License](#license)
+- [Acknowledgement](#acknowledgement)
+
+## Alternative Installation
+
+The base environment in this page is  
+- Ubuntu 16.04
+- python>=3.7
+- paddlepaddle==2.1.2
 
+If you want to set up PaddleSpeech in other environment, please see the [ASR installation](docs/source/asr/install.md) and [TTS installation](docs/source/tts/install.md) documents for all the alternatives.
 
-## Features
+## Quick Start
 
- See [feature list](docs/source/asr/feature_list.md) for more information.
+> Note: `ckptfile` should be replaced by real path that represents files or folders later. Similarly, `exp/default` is the folder that contains the pretrained models.
 
-## Setup
+Try a tiny ASR DeepSpeech2 model training on toy set of LibriSpeech:
 
-All tested under:  
-* Ubuntu 16.04
-* python>=3.7
-* paddlepaddle==2.1.2
+```shell
+cd examples/tiny/s0/
+# source the environment
+source path.sh
+# prepare librispeech dataset
+bash local/data.sh
+# evaluate your ckptfile model file
+bash local/test.sh conf/deepspeech2.yaml ckptfile offline
+```
 
-Please see [install](docs/source/asr/install.md).
+For TTS, try FastSpeech2 on LJSpeech:
+- Download LJSpeech-1.1 from the [ljspeech official website](https://keithito.com/LJ-Speech-Dataset/) and our prepared durations for fastspeech2 [ljspeech_alignment](https://paddlespeech.bj.bcebos.com/MFA/LJSpeech-1.1/ljspeech_alignment.tar.gz).
+- Assume your path to the dataset is `~/datasets/LJSpeech-1.1` and `./ljspeech_alignment` accordingly, preprocess your data and then use our pretrained model to synthesize:
+```shell
+bash ./local/preprocess.sh conf/default.yaml
+bash ./local/synthesize_e2e.sh conf/default.yaml exp/default ckptfile
+```
 
-## Getting Started
 
-Please see [Getting Started](docs/source/asr/getting_started.md) and [tiny egs](examples/tiny/s0/README.md).
 
+If you want to try more functions like training and tuning, please see [ASR getting started](docs/source/asr/getting_started.md) and [TTS Basic Use](/docs/source/tts/basic_usage.md).
 
-## More Information  
+## Models List
+
+
+
+PaddleSpeech ASR supports a lot of mainstream models, which are summarized as follow. For more information, please refer to [ASR Models](./docs/source/asr/released_model.md).
+
+<!---
+The current hyperlinks redirect to [Previous Parakeet](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples). 
+-->
+
+<table>
+  <thead>
+    <tr>
+      <th>ASR Module Type</th>
+      <th>Dataset</th>
+      <th>Model Type</th>
+      <th>Link</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+      <td rowspan="6">Acoustic Model</td>
+      <td rowspan="4" >Aishell</td>
+      <td >2 Conv + 5 LSTM layers with only forward direction	</td>
+      <td>
+      <a href = "https://deepspeech.bj.bcebos.com/release2.1/aishell/s0/aishell.s0.ds_online.5rnn.debug.tar.gz">Ds2 Online Aishell Model</a>
+      </td>
+    </tr>
+    <tr>
+      <td>2 Conv + 3 bidirectional GRU layers</td>
+      <td>
+      <a href = "https://deepspeech.bj.bcebos.com/release2.1/aishell/s0/aishell.s0.ds2.offline.cer6p65.release.tar.gz">Ds2 Offline Aishell Model</a>
+      </td>
+    </tr>
+    <tr>
+      <td>Encoder:Conformer, Decoder:Transformer, Decoding method: Attention + CTC</td>
+      <td>
+      <a href = "https://deepspeech.bj.bcebos.com/release2.1/aishell/s1/aishell.release.tar.gz">Conformer Offline Aishell Model</a>
+      </td>
+    </tr>
+    <tr>
+      <td >Encoder:Conformer, Decoder:Transformer, Decoding method: Attention</td>
+      <td>
+      <a href = "https://deepspeech.bj.bcebos.com/release2.1/librispeech/s1/conformer.release.tar.gz">Conformer Librispeech Model</a>
+      </td>
+    </tr>
+      <tr>
+      <td rowspan="2"> Librispeech</td>
+      <td>Encoder:Conformer, Decoder:Transformer, Decoding method: Attention</td>
+      <td> <a href = "https://deepspeech.bj.bcebos.com/release2.1/librispeech/s1/conformer.release.tar.gz">Conformer Librispeech Model</a> </td>
+    </tr>
+    <tr>
+      <td>Encoder:Transformer, Decoder:Transformer, Decoding method: Attention</td>
+      <td>
+      <a href = "https://deepspeech.bj.bcebos.com/release2.1/librispeech/s1/transformer.release.tar.gz">Transformer Librispeech Model</a>
+      </td>
+    </tr>
+   <tr>
+      <td rowspan="3">Language Model</td>
+      <td >CommonCrawl(en.00)</td>
+      <td >English Language Model</td>
+      <td>
+      <a href = "https://deepspeech.bj.bcebos.com/en_lm/common_crawl_00.prune01111.trie.klm">English Language Model</a>
+      </td>
+    </tr>
+    <tr>
+      <td rowspan="2">Baidu Internal Corpus</td>
+      <td>Mandarin Language Model Small</td>
+      <td>
+      <a href = "https://deepspeech.bj.bcebos.com/zh_lm/zh_giga.no_cna_cmn.prune01244.klm">Mandarin Language Model Small</a>
+      </td>
+    </tr>
+    <tr>
+      <td >Mandarin Language Model Large</td>
+      <td>
+      <a href = "https://deepspeech.bj.bcebos.com/zh_lm/zhidao_giga.klm">Mandarin Language Model Large</a>
+      </td>
+    </tr>
+  </tbody>
+</table>
+
+
+PaddleSpeech TTS mainly contains three modules: *Text Frontend*, *Acoustic Model* and *Vocoder*. Acoustic Model and Vocoder models are listed as follow:
+
+<table>
+  <thead>
+    <tr>
+      <th>TTS Module Type</th>
+      <th>Model Type</th>
+      <th>Dataset</th>
+      <th>Link</th>
+    </tr>
+  </thead>
+  <tbody>
+    <tr>
+    <td> Text Frontend</td>
+    <td colspan="2"> &emsp; </td>
+    <td> 
+    <a href = "https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/other/text_frontend">chinese-fronted</a>
+    </td>
+    </tr>
+    <tr>
+      <td rowspan="7">Acoustic Model</td>
+      <td >Tacotron2</td>
+      <td rowspan="2" >LJSpeech</td>
+      <td>
+      <a href = "https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/ljspeech/tts0">tacotron2-vctk</a>
+      </td>
+    </tr>
+    <tr>
+      <td>TransformerTTS</td>
+      <td>
+      <a href = "https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/ljspeech/tts1">transformer-ljspeech</a>
+      </td>
+    </tr>
+    <tr>
+      <td>SpeedySpeech</td>
+      <td>CSMSC</td>
+      <td >
+      <a href = "https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/csmsc/tts2">speedyspeech-csmsc</a>
+      </td>
+    </tr>
+    <tr>
+      <td rowspan="4">FastSpeech2</td>
+      <td>AISHELL-3</td>
+      <td>
+      <a href = "https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/aishell3/tts3">fastspeech2-aishell3</a>
+      </td>
+    </tr>
+    <tr>
+      <td>VCTK</td>
+      <td> <a href = "https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/vctk/tts3">fastspeech2-vctk</a> </td>
+    </tr>
+    <tr>
+      <td>LJSpeech</td>
+      <td> <a href = "https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/ljspeech/tts3">fastspeech2-ljspeech</a> </td>
+    </tr>
+    <tr>
+      <td>CSMSC</td>
+      <td>
+      <a href = "https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/csmsc/tts3">fastspeech2-csmsc</a>
+      </td>
+    </tr>
+   <tr>
+      <td rowspan="4">Vocoder</td>
+      <td >WaveFlow</td>
+      <td >LJSpeech</td>
+      <td>
+      <a href = "https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/ljspeech/voc0">waveflow-ljspeech</a>
+      </td>
+    </tr>
+    <tr>
+      <td rowspan="3">Parallel WaveGAN</td>
+      <td >LJSpeech</td>
+      <td>
+      <a href = "https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/ljspeech/voc1">PWGAN-ljspeech</a>
+      </td>
+    </tr>
+    <tr>
+      <td >VCTK</td>
+      <td>
+      <a href = "https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/vctk/voc1">PWGAN-vctk</a>
+      </td>
+    </tr>
+    <tr>
+      <td >CSMSC</td>
+      <td>
+      <a href = "https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/csmsc/voc1">PWGAN-csmsc</a>
+      </td>
+    </tr>
+    <tr>
+    <td rowspan="2">Voice Cloning</td>
+    <td>GE2E</td>
+    <td >AISHELL-3, etc.</td>
+    <td>
+    <a href = "https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/other/ge2e">ge2e</a>
+    </td>
+    </tr>
+    <tr>
+    <td>GE2E + Tactron2</td>
+    <td>AISHELL-3</td>
+    <td>
+    <a href = "https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/aishell3/vc0">ge2e-tactron2-aishell3</a>
+    </td>
+    </td>
+    </tr>
+  </tbody>
+</table>
+
+
+## Tutorials 
+
+Normally, [Speech SoTA](https://paperswithcode.com/area/speech) gives you an overview of the hot academic topics in speech. If you want to focus on the two tasks in PaddleSpeech, you will find the following guidelines are helpful to grasp the core ideas.
+
+The original ASR module is based on [Baidu's DeepSpeech](https://arxiv.org/abs/1412.5567) which is an independent product named [DeepSpeech](https://deepspeech.readthedocs.io). However, the toolkit aligns almost all the SoTA modules in the pipeline. Specifically, these modules are 
 
 * [Data Prepration](docs/source/asr/data_preparation.md)  
 * [Data Augmentation](docs/source/asr/augmentation.md)  
@@ -33,16 +304,18 @@ Please see [Getting Started](docs/source/asr/getting_started.md) and [tiny egs](
 * [Benchmark](docs/source/asr/benchmark.md)  
 * [Relased Model](docs/source/asr/released_model.md)  
 
+The TTS module is originally called [Parakeet](https://github.com/PaddlePaddle/Parakeet), and now merged with DeepSpeech. If you are interested in academic research about this function, please see [TTS research overview](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/docs/source/tts#overview). Also, [this document](https://paddleparakeet.readthedocs.io/en/latest/released_models.html) is a good guideline for the pipeline components.
 
-## Questions and Help
 
-You are welcome to submit questions in [Github Discussions](https://github.com/PaddlePaddle/DeepSpeech/discussions) and bug reports in [Github Issues](https://github.com/PaddlePaddle/DeepSpeech/issues). You are also welcome to contribute to this project.
+## FAQ and Contributing
 
+You are warmly welcome to submit questions in [discussions](https://github.com/PaddlePaddle/DeepSpeech/discussions) and bug reports in [issues](https://github.com/PaddlePaddle/DeepSpeech/issues)! Also, we highly appreciate if you would like to contribute to this project!
 
 ## License
 
-DeepSpeech is provided under the [Apache-2.0 License](./LICENSE).
+PaddleSpeech is provided under the [Apache-2.0 License](./LICENSE).
 
 ## Acknowledgement
 
-We depends on many open source repos. See [References](docs/source/asr/reference.md) for more information.
+PaddleSpeech depends on a lot of open source repos. See [references](docs/source/asr/reference.md) for more information.
+
diff --git a/deepspeech/decoders/recog.py b/deepspeech/decoders/recog.py
index 6dea6b70..bc48e692 100644
--- a/deepspeech/decoders/recog.py
+++ b/deepspeech/decoders/recog.py
@@ -24,21 +24,23 @@ from .utils import add_results_to_json
 from deepspeech.exps import dynamic_import_tester
 from deepspeech.io.reader import LoadInputsAndTargets
 from deepspeech.models.asr_interface import ASRInterface
+from deepspeech.models.lm_interface import dynamic_import_lm
 from deepspeech.utils.log import Log
-# from espnet.asr.asr_utils import get_model_conf
-# from espnet.asr.asr_utils import torch_load
-# from espnet.nets.lm_interface import dynamic_import_lm
 
 logger = Log(__name__).getlog()
 
 # NOTE: you need this func to generate our sphinx doc
 
 
+def get_config(config_path):
+    confs = CfgNode(new_allowed=True)
+    confs.merge_from_file(config_path)
+    return confs
+
+
 def load_trained_model(args):
     args.nprocs = args.ngpu
-    confs = CfgNode()
-    confs.set_new_allowed(True)
-    confs.merge_from_file(args.model_conf)
+    confs = get_config(args.model_conf)
     class_obj = dynamic_import_tester(args.model_name)
     exp = class_obj(confs, args)
     with exp.eval():
@@ -49,6 +51,16 @@ def load_trained_model(args):
     return model, char_list, exp, confs
 
 
+def load_trained_lm(args):
+    lm_args = get_config(args.rnnlm_conf)
+    lm_model_module = lm_args.model_module
+    lm_class = dynamic_import_lm(lm_model_module)
+    lm = lm_class(**lm_args.model)
+    model_dict = paddle.load(args.rnnlm)
+    lm.set_state_dict(model_dict)
+    return lm
+
+
 def recog_v2(args):
     """Decode with custom models that implements ScorerInterface.
 
@@ -78,12 +90,7 @@ def recog_v2(args):
         preprocess_args={"train": False}, )
 
     if args.rnnlm:
-        lm_args = get_model_conf(args.rnnlm, args.rnnlm_conf)
-        # NOTE: for a compatibility with less than 0.5.0 version models
-        lm_model_module = getattr(lm_args, "model_module", "default")
-        lm_class = dynamic_import_lm(lm_model_module, lm_args.backend)
-        lm = lm_class(len(char_list), lm_args)
-        torch_load(args.rnnlm, lm)
+        lm = load_trained_lm(args)
         lm.eval()
     else:
         lm = None
diff --git a/deepspeech/decoders/recog_bin.py b/deepspeech/decoders/recog_bin.py
index fbf582f7..7c866648 100644
--- a/deepspeech/decoders/recog_bin.py
+++ b/deepspeech/decoders/recog_bin.py
@@ -21,8 +21,6 @@ from distutils.util import strtobool
 import configargparse
 import numpy as np
 
-from deepspeech.decoders.recog import recog_v2
-
 
 def get_parser():
     """Get default arguments."""
@@ -359,7 +357,7 @@ def main(args):
         if args.num_encs == 1:
             # Experimental API that supports custom LMs
             if args.api == "v2":
-
+                from deepspeech.decoders.recog import recog_v2
                 recog_v2(args)
             else:
                 raise ValueError("Only support --api v2")
diff --git a/deepspeech/decoders/scorers/ctc_prefix_score.py b/deepspeech/decoders/scorers/ctc_prefix_score.py
index c85d546d..13429d49 100644
--- a/deepspeech/decoders/scorers/ctc_prefix_score.py
+++ b/deepspeech/decoders/scorers/ctc_prefix_score.py
@@ -318,6 +318,18 @@ class CTCPrefixScore():
             r[0, 0] = xs[0]
             r[0, 1] = self.logzero
         else:
+            # Although the code does not exactly follow Algorithm 2, 
+            # we don't have to change it because we can assume 
+            # r_t(h)=0 for t < |h| in CTC forward computation 
+            # (Note: we assume here that index t starts with 0).
+            # The purpose of this difference is to reduce the number of for-loops.
+            # https://github.com/espnet/espnet/pull/3655
+            # where we start to accumulate r_t(h) from t=|h| 
+            # and iterate r_t(h) = (r_{t-1}(h) + ...) to T-1, 
+            # avoiding accumulating zeros for t=1~|h|-1.
+            # Thus, we need to set r_{|h|-1}(h) = 0, 
+            # i.e., r[output_length-1] = logzero, for initialization.
+            # This is just for reducing the computation.
             r[output_length - 1] = self.logzero
 
         # prepare forward probabilities for the last label
diff --git a/deepspeech/frontend/augmentor/augmentation.py b/deepspeech/frontend/augmentor/augmentation.py
index 0de81333..d2316ab1 100644
--- a/deepspeech/frontend/augmentor/augmentation.py
+++ b/deepspeech/frontend/augmentor/augmentation.py
@@ -13,6 +13,7 @@
 # limitations under the License.
 """Contains the data augmentation pipeline."""
 import json
+import os
 from collections.abc import Sequence
 from inspect import signature
 from pprint import pformat
@@ -90,9 +91,8 @@ class AugmentationPipeline():
     effect.
 
     Params:
-        augmentation_config(str): Augmentation configuration in json string.
+        preprocess_conf(str): Augmentation configuration in `json file` or `json string`.
         random_seed(int): Random seed.
-        train(bool): whether is train mode.
     
     Raises:
         ValueError: If the augmentation json config is in incorrect format".
@@ -100,11 +100,18 @@ class AugmentationPipeline():
 
     SPEC_TYPES = {'specaug'}
 
-    def __init__(self, augmentation_config: str, random_seed: int=0):
+    def __init__(self, preprocess_conf: str, random_seed: int=0):
         self._rng = np.random.RandomState(random_seed)
         self.conf = {'mode': 'sequential', 'process': []}
-        if augmentation_config:
-            process = json.loads(augmentation_config)
+        if preprocess_conf:
+            if os.path.isfile(preprocess_conf):
+                # json file
+                with open(preprocess_conf, 'r') as fin:
+                    json_string = fin.read()
+            else:
+                # json string
+                json_string = preprocess_conf
+            process = json.loads(json_string)
             self.conf['process'] += process
 
         self._augmentors, self._rates = self._parse_pipeline_from('all')
diff --git a/deepspeech/io/collator.py b/deepspeech/io/collator.py
index 5f0bc462..b523dfc8 100644
--- a/deepspeech/io/collator.py
+++ b/deepspeech/io/collator.py
@@ -105,7 +105,7 @@ class SpeechCollatorBase():
         self._local_data = TarLocalData(tar2info={}, tar2object={})
 
         self.augmentation = AugmentationPipeline(
-            augmentation_config=aug_file.read(), random_seed=random_seed)
+            preprocess_conf=aug_file.read(), random_seed=random_seed)
 
         self._normalizer = FeatureNormalizer(
             mean_std_filepath) if mean_std_filepath else None
diff --git a/deepspeech/io/reader.py b/deepspeech/io/reader.py
index 5873788b..59098752 100644
--- a/deepspeech/io/reader.py
+++ b/deepspeech/io/reader.py
@@ -17,7 +17,7 @@ import kaldiio
 import numpy as np
 import soundfile
 
-from deepspeech.frontend.augmentor.augmentation import AugmentationPipeline
+from deepspeech.frontend.augmentor.augmentation import AugmentationPipeline as Transformation
 from deepspeech.utils.log import Log
 
 __all__ = ["LoadInputsAndTargets"]
@@ -66,8 +66,7 @@ class LoadInputsAndTargets():
             raise ValueError("Only asr are allowed: mode={}".format(mode))
 
         if preprocess_conf is not None:
-            with open(preprocess_conf, 'r') as fin:
-                self.preprocessing = AugmentationPipeline(fin.read())
+            self.preprocessing = Transformation(preprocess_conf)
             logger.warning(
                 "[Experimental feature] Some preprocessing will be done "
                 "for the mini-batch creation using {}".format(
diff --git a/deepspeech/models/asr_interface.py b/deepspeech/models/asr_interface.py
index 7dac81b4..d86daa0b 100644
--- a/deepspeech/models/asr_interface.py
+++ b/deepspeech/models/asr_interface.py
@@ -18,7 +18,7 @@ from deepspeech.utils.dynamic_import import dynamic_import
 
 
 class ASRInterface:
-    """ASR Interface for ESPnet model implementation."""
+    """ASR Interface model implementation."""
 
     @staticmethod
     def add_arguments(parser):
@@ -103,14 +103,14 @@ class ASRInterface:
     @property
     def attention_plot_class(self):
         """Get attention plot class."""
-        from espnet.asr.asr_utils import PlotAttentionReport
+        from deepspeech.training.extensions.plot import PlotAttentionReport
 
         return PlotAttentionReport
 
     @property
     def ctc_plot_class(self):
         """Get CTC plot class."""
-        from espnet.asr.asr_utils import PlotCTCReport
+        from deepspeech.training.extensions.plot import PlotCTCReport
 
         return PlotCTCReport
 
diff --git a/deepspeech/models/lm/__init__.py b/deepspeech/models/lm/__init__.py
new file mode 100644
index 00000000..185a92b8
--- /dev/null
+++ b/deepspeech/models/lm/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/deepspeech/models/lm/transformer.py b/deepspeech/models/lm/transformer.py
new file mode 100644
index 00000000..35ecf678
--- /dev/null
+++ b/deepspeech/models/lm/transformer.py
@@ -0,0 +1,262 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import Any
+from typing import List
+from typing import Tuple
+
+import numpy as np
+import paddle
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+from deepspeech.decoders.scorers.scorer_interface import BatchScorerInterface
+from deepspeech.models.lm_interface import LMInterface
+from deepspeech.modules.encoder import TransformerEncoder
+from deepspeech.modules.mask import subsequent_mask
+from deepspeech.utils.log import Log
+
+logger = Log(__name__).getlog()
+
+
+class TransformerLM(nn.Layer, LMInterface, BatchScorerInterface):
+    def __init__(self,
+                 n_vocab: int,
+                 pos_enc: str=None,
+                 embed_unit: int=128,
+                 att_unit: int=256,
+                 head: int=2,
+                 unit: int=1024,
+                 layer: int=4,
+                 dropout_rate: float=0.5,
+                 emb_dropout_rate: float=0.0,
+                 att_dropout_rate: float=0.0,
+                 tie_weights: bool=False,
+                 **kwargs):
+        nn.Layer.__init__(self)
+
+        if pos_enc == "sinusoidal":
+            pos_enc_layer_type = "abs_pos"
+        elif pos_enc is None:
+            pos_enc_layer_type = "no_pos"
+        else:
+            raise ValueError(f"unknown pos-enc option: {pos_enc}")
+
+        self.embed = nn.Embedding(n_vocab, embed_unit)
+
+        if emb_dropout_rate == 0.0:
+            self.embed_drop = None
+        else:
+            self.embed_drop = nn.Dropout(emb_dropout_rate)
+
+        self.encoder = TransformerEncoder(
+            input_size=embed_unit,
+            output_size=att_unit,
+            attention_heads=head,
+            linear_units=unit,
+            num_blocks=layer,
+            dropout_rate=dropout_rate,
+            attention_dropout_rate=att_dropout_rate,
+            input_layer="linear",
+            pos_enc_layer_type=pos_enc_layer_type,
+            concat_after=False,
+            static_chunk_size=1,
+            use_dynamic_chunk=False,
+            use_dynamic_left_chunk=False)
+
+        self.decoder = nn.Linear(att_unit, n_vocab)
+
+        logger.info("Tie weights set to {}".format(tie_weights))
+        logger.info("Dropout set to {}".format(dropout_rate))
+        logger.info("Emb Dropout set to {}".format(emb_dropout_rate))
+        logger.info("Att Dropout set to {}".format(att_dropout_rate))
+
+        if tie_weights:
+            assert (
+                att_unit == embed_unit
+            ), "Tie Weights: True need embedding and final dimensions to match"
+            self.decoder.weight = self.embed.weight
+
+    def _target_mask(self, ys_in_pad):
+        ys_mask = ys_in_pad != 0
+        m = subsequent_mask(ys_mask.size(-1)).unsqueeze(0)
+        return ys_mask.unsqueeze(-2) & m
+
+    def forward(self, x: paddle.Tensor, t: paddle.Tensor
+                ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor]:
+        """Compute LM loss value from buffer sequences.
+
+        Args:
+            x (paddle.Tensor): Input ids. (batch, len)
+            t (paddle.Tensor): Target ids. (batch, len)
+
+        Returns:
+            tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor]: Tuple of
+                loss to backward (scalar),
+                negative log-likelihood of t: -log p(t) (scalar) and
+                the number of elements in x (scalar)
+
+        Notes:
+            The last two return values are used
+            in perplexity: p(t)^{-n} = exp(-log p(t) / n)
+
+        """
+        xm = x != 0
+        xlen = xm.sum(axis=1)
+        if self.embed_drop is not None:
+            emb = self.embed_drop(self.embed(x))
+        else:
+            emb = self.embed(x)
+        h, _ = self.encoder(emb, xlen)
+        y = self.decoder(h)
+        loss = F.cross_entropy(
+            y.view(-1, y.shape[-1]), t.view(-1), reduction="none")
+        mask = xm.to(dtype=loss.dtype)
+        logp = loss * mask.view(-1)
+        logp = logp.sum()
+        count = mask.sum()
+        return logp / count, logp, count
+
+    # beam search API (see ScorerInterface)
+    def score(self, y: paddle.Tensor, state: Any,
+              x: paddle.Tensor) -> Tuple[paddle.Tensor, Any]:
+        """Score new token.
+
+        Args:
+            y (paddle.Tensor): 1D paddle.int64 prefix tokens.
+            state: Scorer state for prefix tokens
+            x (paddle.Tensor): encoder feature that generates ys.
+
+        Returns:
+            tuple[paddle.Tensor, Any]: Tuple of
+                paddle.float32 scores for next token (n_vocab)
+                and next state for ys
+
+        """
+        y = y.unsqueeze(0)
+
+        if self.embed_drop is not None:
+            emb = self.embed_drop(self.embed(y))
+        else:
+            emb = self.embed(y)
+
+        h, _, cache = self.encoder.forward_one_step(
+            emb, self._target_mask(y), cache=state)
+        h = self.decoder(h[:, -1])
+        logp = F.log_softmax(h).squeeze(0)
+        return logp, cache
+
+    # batch beam search API (see BatchScorerInterface)
+    def batch_score(self,
+                    ys: paddle.Tensor,
+                    states: List[Any],
+                    xs: paddle.Tensor) -> Tuple[paddle.Tensor, List[Any]]:
+        """Score new token batch (required).
+
+        Args:
+            ys (paddle.Tensor): paddle.int64 prefix tokens (n_batch, ylen).
+            states (List[Any]): Scorer states for prefix tokens.
+            xs (paddle.Tensor):
+                The encoder feature that generates ys (n_batch, xlen, n_feat).
+
+        Returns:
+            tuple[paddle.Tensor, List[Any]]: Tuple of
+                batchfied scores for next token with shape of `(n_batch, n_vocab)`
+                and next state list for ys.
+
+        """
+        # merge states
+        n_batch = len(ys)
+        n_layers = len(self.encoder.encoders)
+        if states[0] is None:
+            batch_state = None
+        else:
+            # transpose state of [batch, layer] into [layer, batch]
+            batch_state = [
+                paddle.stack([states[b][i] for b in range(n_batch)])
+                for i in range(n_layers)
+            ]
+
+        if self.embed_drop is not None:
+            emb = self.embed_drop(self.embed(ys))
+        else:
+            emb = self.embed(ys)
+
+        # batch decoding
+        h, _, states = self.encoder.forward_one_step(
+            emb, self._target_mask(ys), cache=batch_state)
+        h = self.decoder(h[:, -1])
+        logp = F.log_softmax(h)
+
+        # transpose state of [layer, batch] into [batch, layer]
+        state_list = [[states[i][b] for i in range(n_layers)]
+                      for b in range(n_batch)]
+        return logp, state_list
+
+
+if __name__ == "__main__":
+    tlm = TransformerLM(
+        n_vocab=5002,
+        pos_enc=None,
+        embed_unit=128,
+        att_unit=512,
+        head=8,
+        unit=2048,
+        layer=16,
+        dropout_rate=0.5, )
+
+    #     n_vocab: int,
+    # pos_enc: str=None,
+    # embed_unit: int=128,
+    # att_unit: int=256,
+    # head: int=2,
+    # unit: int=1024,
+    # layer: int=4,
+    # dropout_rate: float=0.5,
+    # emb_dropout_rate: float = 0.0,
+    # att_dropout_rate: float = 0.0,
+    # tie_weights: bool = False,):
+    paddle.set_device("cpu")
+    model_dict = paddle.load("transformerLM.pdparams")
+    tlm.set_state_dict(model_dict)
+
+    tlm.eval()
+    #Test the score
+    input2 = np.array([5])
+    input2 = paddle.to_tensor(input2)
+    state = None
+    output, state = tlm.score(input2, state, None)
+
+    input3 = np.array([5, 10])
+    input3 = paddle.to_tensor(input3)
+    output, state = tlm.score(input3, state, None)
+
+    input4 = np.array([5, 10, 0])
+    input4 = paddle.to_tensor(input4)
+    output, state = tlm.score(input4, state, None)
+    print("output", output)
+    """
+    #Test the batch score
+    batch_size = 2
+    inp2 = np.array([[5], [10]])
+    inp2 = paddle.to_tensor(inp2)
+    output, states = tlm.batch_score(
+        inp2, [(None,None,0)] * batch_size)
+    inp3 = np.array([[100], [30]])
+    inp3 = paddle.to_tensor(inp3)
+    output, states = tlm.batch_score(
+        inp3, states)
+    print("output", output)
+    #print("cache", cache)
+    #np.save("output_pd.npy", output)
+    """
diff --git a/deepspeech/models/lm_interface.py b/deepspeech/models/lm_interface.py
new file mode 100644
index 00000000..e2987282
--- /dev/null
+++ b/deepspeech/models/lm_interface.py
@@ -0,0 +1,82 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Language model interface."""
+import argparse
+
+from deepspeech.decoders.scorers.scorer_interface import ScorerInterface
+from deepspeech.utils.dynamic_import import dynamic_import
+
+
+class LMInterface(ScorerInterface):
+    """LM Interface model implementation."""
+
+    @staticmethod
+    def add_arguments(parser):
+        """Add arguments to command line argument parser."""
+        return parser
+
+    @classmethod
+    def build(cls, n_vocab: int, **kwargs):
+        """Initialize this class with python-level args.
+
+        Args:
+            idim (int): The number of vocabulary.
+
+        Returns:
+            LMinterface: A new instance of LMInterface.
+
+        """
+        args = argparse.Namespace(**kwargs)
+        return cls(n_vocab, args)
+
+    def forward(self, x, t):
+        """Compute LM loss value from buffer sequences.
+
+        Args:
+            x (torch.Tensor): Input ids. (batch, len)
+            t (torch.Tensor): Target ids. (batch, len)
+
+        Returns:
+            tuple[torch.Tensor, torch.Tensor, torch.Tensor]: Tuple of
+                loss to backward (scalar),
+                negative log-likelihood of t: -log p(t) (scalar) and
+                the number of elements in x (scalar)
+
+        Notes:
+            The last two return values are used
+            in perplexity: p(t)^{-n} = exp(-log p(t) / n)
+
+        """
+        raise NotImplementedError("forward method is not implemented")
+
+
+predefined_lms = {
+    "transformer": "deepspeech.models.lm.transformer:TransformerLM",
+}
+
+
+def dynamic_import_lm(module):
+    """Import LM class dynamically.
+
+    Args:
+        module (str): module_name:class_name or alias in `predefined_lms`
+
+    Returns:
+        type: LM class
+
+    """
+    model_class = dynamic_import(module, predefined_lms)
+    assert issubclass(model_class,
+                      LMInterface), f"{module} does not implement LMInterface"
+    return model_class
diff --git a/deepspeech/models/st_interface.py b/deepspeech/models/st_interface.py
new file mode 100644
index 00000000..05939f9a
--- /dev/null
+++ b/deepspeech/models/st_interface.py
@@ -0,0 +1,75 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""ST Interface module."""
+from .asr_interface import ASRInterface
+from deepspeech.utils.dynamic_import import dynamic_import
+
+
+class STInterface(ASRInterface):
+    """ST Interface model implementation.
+
+    NOTE: This class is inherited from ASRInterface to enable joint translation
+    and recognition when performing multi-task learning with the ASR task.
+
+    """
+
+    def translate(self,
+                  x,
+                  trans_args,
+                  char_list=None,
+                  rnnlm=None,
+                  ensemble_models=[]):
+        """Recognize x for evaluation.
+
+        :param ndarray x: input acouctic feature (B, T, D) or (T, D)
+        :param namespace trans_args: argment namespace contraining options
+        :param list char_list: list of characters
+        :param paddle.nn.Layer rnnlm: language model module
+        :return: N-best decoding results
+        :rtype: list
+        """
+        raise NotImplementedError("translate method is not implemented")
+
+    def translate_batch(self, x, trans_args, char_list=None, rnnlm=None):
+        """Beam search implementation for batch.
+
+        :param paddle.Tensor x: encoder hidden state sequences (B, Tmax, Henc)
+        :param namespace trans_args: argument namespace containing options
+        :param list char_list: list of characters
+        :param paddle.nn.Layer rnnlm: language model module
+        :return: N-best decoding results
+        :rtype: list
+        """
+        raise NotImplementedError("Batch decoding is not supported yet.")
+
+
+predefined_st = {
+    "transformer": "deepspeech.models.u2_st:U2STModel",
+}
+
+
+def dynamic_import_st(module):
+    """Import ST models dynamically.
+
+    Args:
+        module (str): module_name:class_name or alias in `predefined_st`
+
+    Returns:
+        type: ST class
+
+    """
+    model_class = dynamic_import(module, predefined_st)
+    assert issubclass(model_class,
+                      STInterface), f"{module} does not implement STInterface"
+    return model_class
diff --git a/deepspeech/models/u2_st/__init__.py b/deepspeech/models/u2_st/__init__.py
new file mode 100644
index 00000000..6b10b083
--- /dev/null
+++ b/deepspeech/models/u2_st/__init__.py
@@ -0,0 +1,15 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from .u2_st import U2STInferModel
+from .u2_st import U2STModel
diff --git a/deepspeech/models/u2_st.py b/deepspeech/models/u2_st/u2_st.py
similarity index 100%
rename from deepspeech/models/u2_st.py
rename to deepspeech/models/u2_st/u2_st.py
diff --git a/deepspeech/modules/embedding.py b/deepspeech/modules/embedding.py
index fbbda023..64d594c2 100644
--- a/deepspeech/modules/embedding.py
+++ b/deepspeech/modules/embedding.py
@@ -22,10 +22,52 @@ from deepspeech.utils.log import Log
 
 logger = Log(__name__).getlog()
 
-__all__ = ["PositionalEncoding", "RelPositionalEncoding"]
+__all__ = [
+    "PositionalEncodingInterface", "NoPositionalEncoding", "PositionalEncoding",
+    "RelPositionalEncoding"
+]
 
 
-class PositionalEncoding(nn.Layer):
+class PositionalEncodingInterface:
+    def forward(self, x: paddle.Tensor,
+                offset: int=0) -> Tuple[paddle.Tensor, paddle.Tensor]:
+        """Compute positional encoding.
+        Args:
+            x (paddle.Tensor): Input tensor (batch, time, `*`).
+        Returns:
+            paddle.Tensor: Encoded tensor (batch, time, `*`).
+            paddle.Tensor: Positional embedding tensor (1, time, `*`).
+        """
+        raise NotImplementedError("forward method is not implemented")
+
+    def position_encoding(self, offset: int, size: int) -> paddle.Tensor:
+        """ For getting encoding in a streaming fashion
+        Args:
+            offset (int): start offset
+            size (int): requried size of position encoding
+        Returns:
+            paddle.Tensor: Corresponding position encoding
+        """
+        raise NotImplementedError("position_encoding method is not implemented")
+
+
+class NoPositionalEncoding(nn.Layer, PositionalEncodingInterface):
+    def __init__(self,
+                 d_model: int,
+                 dropout_rate: float,
+                 max_len: int=5000,
+                 reverse: bool=False):
+        nn.Layer.__init__(self)
+
+    def forward(self, x: paddle.Tensor,
+                offset: int=0) -> Tuple[paddle.Tensor, paddle.Tensor]:
+        return x, None
+
+    def position_encoding(self, offset: int, size: int) -> paddle.Tensor:
+        return None
+
+
+class PositionalEncoding(nn.Layer, PositionalEncodingInterface):
     def __init__(self,
                  d_model: int,
                  dropout_rate: float,
@@ -40,7 +82,7 @@ class PositionalEncoding(nn.Layer):
             max_len (int, optional): maximum input length. Defaults to 5000.
             reverse (bool, optional): Not used. Defaults to False.
         """
-        super().__init__()
+        nn.Layer.__init__(self)
         self.d_model = d_model
         self.max_len = max_len
         self.xscale = paddle.to_tensor(math.sqrt(self.d_model))
@@ -85,7 +127,7 @@ class PositionalEncoding(nn.Layer):
             offset (int): start offset
             size (int): requried size of position encoding
         Returns:
-            paddle.Tensor: Corresponding encoding
+            paddle.Tensor: Corresponding position encoding
         """
         assert offset + size < self.max_len
         return self.dropout(self.pe[:, offset:offset + size])
diff --git a/deepspeech/modules/encoder.py b/deepspeech/modules/encoder.py
index 6ffb6465..435b6894 100644
--- a/deepspeech/modules/encoder.py
+++ b/deepspeech/modules/encoder.py
@@ -24,6 +24,7 @@ from deepspeech.modules.activation import get_activation
 from deepspeech.modules.attention import MultiHeadedAttention
 from deepspeech.modules.attention import RelPositionMultiHeadedAttention
 from deepspeech.modules.conformer_convolution import ConvolutionModule
+from deepspeech.modules.embedding import NoPositionalEncoding
 from deepspeech.modules.embedding import PositionalEncoding
 from deepspeech.modules.embedding import RelPositionalEncoding
 from deepspeech.modules.encoder_layer import ConformerEncoderLayer
@@ -76,7 +77,7 @@ class BaseEncoder(nn.Layer):
             input_layer (str): input layer type.
                 optional [linear, conv2d, conv2d6, conv2d8]
             pos_enc_layer_type (str): Encoder positional encoding layer type.
-                opitonal [abs_pos, scaled_abs_pos, rel_pos]
+                opitonal [abs_pos, scaled_abs_pos, rel_pos, no_pos]
             normalize_before (bool):
                 True: use layer_norm before each sub-block of a layer.
                 False: use layer_norm after each sub-block of a layer.
@@ -101,6 +102,8 @@ class BaseEncoder(nn.Layer):
             pos_enc_class = PositionalEncoding
         elif pos_enc_layer_type == "rel_pos":
             pos_enc_class = RelPositionalEncoding
+        elif pos_enc_layer_type == "no_pos":
+            pos_enc_class = NoPositionalEncoding
         else:
             raise ValueError("unknown pos_enc_layer: " + pos_enc_layer_type)
 
@@ -370,6 +373,41 @@ class TransformerEncoder(BaseEncoder):
                 concat_after=concat_after) for _ in range(num_blocks)
         ])
 
+    def forward_one_step(
+            self,
+            xs: paddle.Tensor,
+            masks: paddle.Tensor,
+            cache=None, ) -> Tuple[paddle.Tensor, paddle.Tensor]:
+        """Encode input frame.
+
+        Args:
+            xs (paddle.Tensor): (Prefix) Input tensor. (B, T, D)
+            masks (paddle.Tensor): Mask tensor. (B, T, T)
+            cache (List[paddle.Tensor]): List of cache tensors.
+
+        Returns:
+            paddle.Tensor: Output tensor.
+            paddle.Tensor: Mask tensor.
+            List[paddle.Tensor]: List of new cache tensors.
+        """
+        if self.global_cmvn is not None:
+            xs = self.global_cmvn(xs)
+
+        #TODO(Hui Zhang): self.embed(xs, masks, offset=0), stride_slice not support bool tensor
+        xs, pos_emb, masks = self.embed(xs, masks.astype(xs.dtype), offset=0)
+        #TODO(Hui Zhang): remove mask.astype, stride_slice not support bool tensor
+        masks = masks.astype(paddle.bool)
+
+        if cache is None:
+            cache = [None for _ in range(len(self.encoders))]
+        new_cache = []
+        for c, e in zip(cache, self.encoders):
+            xs, masks, _ = e(xs, masks, output_cache=c)
+            new_cache.append(xs)
+        if self.normalize_before:
+            xs = self.after_norm(xs)
+        return xs, masks, new_cache
+
 
 class ConformerEncoder(BaseEncoder):
     """Conformer encoder module."""
diff --git a/deepspeech/modules/encoder_layer.py b/deepspeech/modules/encoder_layer.py
index 1db556ca..6f49cfc8 100644
--- a/deepspeech/modules/encoder_layer.py
+++ b/deepspeech/modules/encoder_layer.py
@@ -71,7 +71,7 @@ class TransformerEncoderLayer(nn.Layer):
             self,
             x: paddle.Tensor,
             mask: paddle.Tensor,
-            pos_emb: paddle.Tensor,
+            pos_emb: Optional[paddle.Tensor]=None,
             mask_pad: Optional[paddle.Tensor]=None,
             output_cache: Optional[paddle.Tensor]=None,
             cnn_cache: Optional[paddle.Tensor]=None,
@@ -82,8 +82,8 @@ class TransformerEncoderLayer(nn.Layer):
             mask (paddle.Tensor): Mask tensor for the input (#batch, time).
             pos_emb (paddle.Tensor): just for interface compatibility
                 to ConformerEncoderLayer
-            mask_pad (paddle.Tensor): does not used in transformer layer,
-                just for unified api with conformer.
+            mask_pad (paddle.Tensor): not used here, it's for interface
+                compatibility to ConformerEncoderLayer
             output_cache (paddle.Tensor): Cache tensor of the output
                 (#batch, time2, size), time2 < time in x.
             cnn_cache (paddle.Tensor): not used here, it's for interface
diff --git a/deepspeech/modules/subsampling.py b/deepspeech/modules/subsampling.py
index 3bed62f3..13e2c8ef 100644
--- a/deepspeech/modules/subsampling.py
+++ b/deepspeech/modules/subsampling.py
@@ -60,7 +60,8 @@ class LinearNoSubsampling(BaseSubsampling):
         self.out = nn.Sequential(
             nn.Linear(idim, odim),
             nn.LayerNorm(odim, epsilon=1e-12),
-            nn.Dropout(dropout_rate), )
+            nn.Dropout(dropout_rate),
+            nn.ReLU(), )
         self.right_context = 0
         self.subsampling_rate = 1
 
@@ -83,7 +84,12 @@ class LinearNoSubsampling(BaseSubsampling):
         return x, pos_emb, x_mask
 
 
-class Conv2dSubsampling4(BaseSubsampling):
+class Conv2dSubsampling(BaseSubsampling):
+    def __init__(self, *args, **kwargs):
+        super().__init__(*args, **kwargs)
+
+
+class Conv2dSubsampling4(Conv2dSubsampling):
     """Convolutional 2D subsampling (to 1/4 length)."""
 
     def __init__(self,
@@ -134,7 +140,7 @@ class Conv2dSubsampling4(BaseSubsampling):
         return x, pos_emb, x_mask[:, :, :-2:2][:, :, :-2:2]
 
 
-class Conv2dSubsampling6(BaseSubsampling):
+class Conv2dSubsampling6(Conv2dSubsampling):
     """Convolutional 2D subsampling (to 1/6 length)."""
 
     def __init__(self,
@@ -187,7 +193,7 @@ class Conv2dSubsampling6(BaseSubsampling):
         return x, pos_emb, x_mask[:, :, :-2:2][:, :, :-4:3]
 
 
-class Conv2dSubsampling8(BaseSubsampling):
+class Conv2dSubsampling8(Conv2dSubsampling):
     """Convolutional 2D subsampling (to 1/8 length)."""
 
     def __init__(self,
diff --git a/deepspeech/training/extensions/plot.py b/deepspeech/training/extensions/plot.py
new file mode 100644
index 00000000..6fbb4d4d
--- /dev/null
+++ b/deepspeech/training/extensions/plot.py
@@ -0,0 +1,418 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import copy
+import os
+
+import numpy as np
+
+from . import extension
+
+
+class PlotAttentionReport(extension.Extension):
+    """Plot attention reporter.
+
+    Args:
+        att_vis_fn (espnet.nets.*_backend.e2e_asr.E2E.calculate_all_attentions):
+            Function of attention visualization.
+        data (list[tuple(str, dict[str, list[Any]])]): List json utt key items.
+        outdir (str): Directory to save figures.
+        converter (espnet.asr.*_backend.asr.CustomConverter):
+            Function to convert data.
+        device (int | torch.device): Device.
+        reverse (bool): If True, input and output length are reversed.
+        ikey (str): Key to access input
+            (for ASR/ST ikey="input", for MT ikey="output".)
+        iaxis (int): Dimension to access input
+            (for ASR/ST iaxis=0, for MT iaxis=1.)
+        okey (str): Key to access output
+            (for ASR/ST okey="input", MT okay="output".)
+        oaxis (int): Dimension to access output
+            (for ASR/ST oaxis=0, for MT oaxis=0.)
+        subsampling_factor (int): subsampling factor in encoder
+
+    """
+
+    def __init__(
+            self,
+            att_vis_fn,
+            data,
+            outdir,
+            converter,
+            transform,
+            device,
+            reverse=False,
+            ikey="input",
+            iaxis=0,
+            okey="output",
+            oaxis=0,
+            subsampling_factor=1, ):
+        self.att_vis_fn = att_vis_fn
+        self.data = copy.deepcopy(data)
+        self.data_dict = {k: v for k, v in copy.deepcopy(data)}
+        # key is utterance ID
+        self.outdir = outdir
+        self.converter = converter
+        self.transform = transform
+        self.device = device
+        self.reverse = reverse
+        self.ikey = ikey
+        self.iaxis = iaxis
+        self.okey = okey
+        self.oaxis = oaxis
+        self.factor = subsampling_factor
+        if not os.path.exists(self.outdir):
+            os.makedirs(self.outdir)
+
+    def __call__(self, trainer):
+        """Plot and save image file of att_ws matrix."""
+        att_ws, uttid_list = self.get_attention_weights()
+        if isinstance(att_ws, list):  # multi-encoder case
+            num_encs = len(att_ws) - 1
+            # atts
+            for i in range(num_encs):
+                for idx, att_w in enumerate(att_ws[i]):
+                    filename = "%s/%s.ep.{.updater.epoch}.att%d.png" % (
+                        self.outdir, uttid_list[idx], i + 1, )
+                    att_w = self.trim_attention_weight(uttid_list[idx], att_w)
+                    np_filename = "%s/%s.ep.{.updater.epoch}.att%d.npy" % (
+                        self.outdir, uttid_list[idx], i + 1, )
+                    np.save(np_filename.format(trainer), att_w)
+                    self._plot_and_save_attention(att_w,
+                                                  filename.format(trainer))
+            # han
+            for idx, att_w in enumerate(att_ws[num_encs]):
+                filename = "%s/%s.ep.{.updater.epoch}.han.png" % (
+                    self.outdir, uttid_list[idx], )
+                att_w = self.trim_attention_weight(uttid_list[idx], att_w)
+                np_filename = "%s/%s.ep.{.updater.epoch}.han.npy" % (
+                    self.outdir, uttid_list[idx], )
+                np.save(np_filename.format(trainer), att_w)
+                self._plot_and_save_attention(
+                    att_w, filename.format(trainer), han_mode=True)
+        else:
+            for idx, att_w in enumerate(att_ws):
+                filename = "%s/%s.ep.{.updater.epoch}.png" % (self.outdir,
+                                                              uttid_list[idx], )
+                att_w = self.trim_attention_weight(uttid_list[idx], att_w)
+                np_filename = "%s/%s.ep.{.updater.epoch}.npy" % (
+                    self.outdir, uttid_list[idx], )
+                np.save(np_filename.format(trainer), att_w)
+                self._plot_and_save_attention(att_w, filename.format(trainer))
+
+    def log_attentions(self, logger, step):
+        """Add image files of att_ws matrix to the tensorboard."""
+        att_ws, uttid_list = self.get_attention_weights()
+        if isinstance(att_ws, list):  # multi-encoder case
+            num_encs = len(att_ws) - 1
+            # atts
+            for i in range(num_encs):
+                for idx, att_w in enumerate(att_ws[i]):
+                    att_w = self.trim_attention_weight(uttid_list[idx], att_w)
+                    plot = self.draw_attention_plot(att_w)
+                    logger.add_figure(
+                        "%s_att%d" % (uttid_list[idx], i + 1),
+                        plot.gcf(),
+                        step, )
+            # han
+            for idx, att_w in enumerate(att_ws[num_encs]):
+                att_w = self.trim_attention_weight(uttid_list[idx], att_w)
+                plot = self.draw_han_plot(att_w)
+                logger.add_figure(
+                    "%s_han" % (uttid_list[idx]),
+                    plot.gcf(),
+                    step, )
+        else:
+            for idx, att_w in enumerate(att_ws):
+                att_w = self.trim_attention_weight(uttid_list[idx], att_w)
+                plot = self.draw_attention_plot(att_w)
+                logger.add_figure("%s" % (uttid_list[idx]), plot.gcf(), step)
+
+    def get_attention_weights(self):
+        """Return attention weights.
+
+        Returns:
+            numpy.ndarray: attention weights. float. Its shape would be
+                differ from backend.
+                * pytorch-> 1) multi-head case => (B, H, Lmax, Tmax), 2)
+                    other case => (B, Lmax, Tmax).
+                * chainer-> (B, Lmax, Tmax)
+
+        """
+        return_batch, uttid_list = self.transform(self.data, return_uttid=True)
+        batch = self.converter([return_batch], self.device)
+        if isinstance(batch, tuple):
+            att_ws = self.att_vis_fn(*batch)
+        else:
+            att_ws = self.att_vis_fn(**batch)
+        return att_ws, uttid_list
+
+    def trim_attention_weight(self, uttid, att_w):
+        """Transform attention matrix with regard to self.reverse."""
+        if self.reverse:
+            enc_key, enc_axis = self.okey, self.oaxis
+            dec_key, dec_axis = self.ikey, self.iaxis
+        else:
+            enc_key, enc_axis = self.ikey, self.iaxis
+            dec_key, dec_axis = self.okey, self.oaxis
+        dec_len = int(self.data_dict[uttid][dec_key][dec_axis]["shape"][0])
+        enc_len = int(self.data_dict[uttid][enc_key][enc_axis]["shape"][0])
+        if self.factor > 1:
+            enc_len //= self.factor
+        if len(att_w.shape) == 3:
+            att_w = att_w[:, :dec_len, :enc_len]
+        else:
+            att_w = att_w[:dec_len, :enc_len]
+        return att_w
+
+    def draw_attention_plot(self, att_w):
+        """Plot the att_w matrix.
+
+        Returns:
+            matplotlib.pyplot: pyplot object with attention matrix image.
+
+        """
+        import matplotlib
+
+        matplotlib.use("Agg")
+        import matplotlib.pyplot as plt
+
+        plt.clf()
+        att_w = att_w.astype(np.float32)
+        if len(att_w.shape) == 3:
+            for h, aw in enumerate(att_w, 1):
+                plt.subplot(1, len(att_w), h)
+                plt.imshow(aw, aspect="auto")
+                plt.xlabel("Encoder Index")
+                plt.ylabel("Decoder Index")
+        else:
+            plt.imshow(att_w, aspect="auto")
+            plt.xlabel("Encoder Index")
+            plt.ylabel("Decoder Index")
+        plt.tight_layout()
+        return plt
+
+    def draw_han_plot(self, att_w):
+        """Plot the att_w matrix for hierarchical attention.
+
+        Returns:
+            matplotlib.pyplot: pyplot object with attention matrix image.
+
+        """
+        import matplotlib
+
+        matplotlib.use("Agg")
+        import matplotlib.pyplot as plt
+
+        plt.clf()
+        if len(att_w.shape) == 3:
+            for h, aw in enumerate(att_w, 1):
+                legends = []
+                plt.subplot(1, len(att_w), h)
+                for i in range(aw.shape[1]):
+                    plt.plot(aw[:, i])
+                    legends.append("Att{}".format(i))
+                plt.ylim([0, 1.0])
+                plt.xlim([0, aw.shape[0]])
+                plt.grid(True)
+                plt.ylabel("Attention Weight")
+                plt.xlabel("Decoder Index")
+                plt.legend(legends)
+        else:
+            legends = []
+            for i in range(att_w.shape[1]):
+                plt.plot(att_w[:, i])
+                legends.append("Att{}".format(i))
+            plt.ylim([0, 1.0])
+            plt.xlim([0, att_w.shape[0]])
+            plt.grid(True)
+            plt.ylabel("Attention Weight")
+            plt.xlabel("Decoder Index")
+            plt.legend(legends)
+        plt.tight_layout()
+        return plt
+
+    def _plot_and_save_attention(self, att_w, filename, han_mode=False):
+        if han_mode:
+            plt = self.draw_han_plot(att_w)
+        else:
+            plt = self.draw_attention_plot(att_w)
+        plt.savefig(filename)
+        plt.close()
+
+
+class PlotCTCReport(extension.Extension):
+    """Plot CTC reporter.
+
+    Args:
+        ctc_vis_fn (espnet.nets.*_backend.e2e_asr.E2E.calculate_all_ctc_probs):
+            Function of CTC visualization.
+        data (list[tuple(str, dict[str, list[Any]])]): List json utt key items.
+        outdir (str): Directory to save figures.
+        converter (espnet.asr.*_backend.asr.CustomConverter):
+            Function to convert data.
+        device (int | torch.device): Device.
+        reverse (bool): If True, input and output length are reversed.
+        ikey (str): Key to access input
+            (for ASR/ST ikey="input", for MT ikey="output".)
+        iaxis (int): Dimension to access input
+            (for ASR/ST iaxis=0, for MT iaxis=1.)
+        okey (str): Key to access output
+            (for ASR/ST okey="input", MT okay="output".)
+        oaxis (int): Dimension to access output
+            (for ASR/ST oaxis=0, for MT oaxis=0.)
+        subsampling_factor (int): subsampling factor in encoder
+
+    """
+
+    def __init__(
+            self,
+            ctc_vis_fn,
+            data,
+            outdir,
+            converter,
+            transform,
+            device,
+            reverse=False,
+            ikey="input",
+            iaxis=0,
+            okey="output",
+            oaxis=0,
+            subsampling_factor=1, ):
+        self.ctc_vis_fn = ctc_vis_fn
+        self.data = copy.deepcopy(data)
+        self.data_dict = {k: v for k, v in copy.deepcopy(data)}
+        # key is utterance ID
+        self.outdir = outdir
+        self.converter = converter
+        self.transform = transform
+        self.device = device
+        self.reverse = reverse
+        self.ikey = ikey
+        self.iaxis = iaxis
+        self.okey = okey
+        self.oaxis = oaxis
+        self.factor = subsampling_factor
+        if not os.path.exists(self.outdir):
+            os.makedirs(self.outdir)
+
+    def __call__(self, trainer):
+        """Plot and save image file of ctc prob."""
+        ctc_probs, uttid_list = self.get_ctc_probs()
+        if isinstance(ctc_probs, list):  # multi-encoder case
+            num_encs = len(ctc_probs) - 1
+            for i in range(num_encs):
+                for idx, ctc_prob in enumerate(ctc_probs[i]):
+                    filename = "%s/%s.ep.{.updater.epoch}.ctc%d.png" % (
+                        self.outdir, uttid_list[idx], i + 1, )
+                    ctc_prob = self.trim_ctc_prob(uttid_list[idx], ctc_prob)
+                    np_filename = "%s/%s.ep.{.updater.epoch}.ctc%d.npy" % (
+                        self.outdir, uttid_list[idx], i + 1, )
+                    np.save(np_filename.format(trainer), ctc_prob)
+                    self._plot_and_save_ctc(ctc_prob, filename.format(trainer))
+        else:
+            for idx, ctc_prob in enumerate(ctc_probs):
+                filename = "%s/%s.ep.{.updater.epoch}.png" % (self.outdir,
+                                                              uttid_list[idx], )
+                ctc_prob = self.trim_ctc_prob(uttid_list[idx], ctc_prob)
+                np_filename = "%s/%s.ep.{.updater.epoch}.npy" % (
+                    self.outdir, uttid_list[idx], )
+                np.save(np_filename.format(trainer), ctc_prob)
+                self._plot_and_save_ctc(ctc_prob, filename.format(trainer))
+
+    def log_ctc_probs(self, logger, step):
+        """Add image files of ctc probs to the tensorboard."""
+        ctc_probs, uttid_list = self.get_ctc_probs()
+        if isinstance(ctc_probs, list):  # multi-encoder case
+            num_encs = len(ctc_probs) - 1
+            for i in range(num_encs):
+                for idx, ctc_prob in enumerate(ctc_probs[i]):
+                    ctc_prob = self.trim_ctc_prob(uttid_list[idx], ctc_prob)
+                    plot = self.draw_ctc_plot(ctc_prob)
+                    logger.add_figure(
+                        "%s_ctc%d" % (uttid_list[idx], i + 1),
+                        plot.gcf(),
+                        step, )
+        else:
+            for idx, ctc_prob in enumerate(ctc_probs):
+                ctc_prob = self.trim_ctc_prob(uttid_list[idx], ctc_prob)
+                plot = self.draw_ctc_plot(ctc_prob)
+                logger.add_figure("%s" % (uttid_list[idx]), plot.gcf(), step)
+
+    def get_ctc_probs(self):
+        """Return CTC probs.
+
+        Returns:
+            numpy.ndarray: CTC probs. float. Its shape would be
+                differ from backend. (B, Tmax, vocab).
+
+        """
+        return_batch, uttid_list = self.transform(self.data, return_uttid=True)
+        batch = self.converter([return_batch], self.device)
+        if isinstance(batch, tuple):
+            probs = self.ctc_vis_fn(*batch)
+        else:
+            probs = self.ctc_vis_fn(**batch)
+        return probs, uttid_list
+
+    def trim_ctc_prob(self, uttid, prob):
+        """Trim CTC posteriors accoding to input lengths."""
+        enc_len = int(self.data_dict[uttid][self.ikey][self.iaxis]["shape"][0])
+        if self.factor > 1:
+            enc_len //= self.factor
+        prob = prob[:enc_len]
+        return prob
+
+    def draw_ctc_plot(self, ctc_prob):
+        """Plot the ctc_prob matrix.
+
+        Returns:
+            matplotlib.pyplot: pyplot object with CTC prob matrix image.
+
+        """
+        import matplotlib
+
+        matplotlib.use("Agg")
+        import matplotlib.pyplot as plt
+
+        ctc_prob = ctc_prob.astype(np.float32)
+
+        plt.clf()
+        topk_ids = np.argsort(ctc_prob, axis=1)
+        n_frames, vocab = ctc_prob.shape
+        times_probs = np.arange(n_frames)
+
+        plt.figure(figsize=(20, 8))
+
+        # NOTE: index 0 is reserved for blank
+        for idx in set(topk_ids.reshape(-1).tolist()):
+            if idx == 0:
+                plt.plot(
+                    times_probs,
+                    ctc_prob[:, 0],
+                    ":",
+                    label="<blank>",
+                    color="grey")
+            else:
+                plt.plot(times_probs, ctc_prob[:, idx])
+        plt.xlabel(u"Input [frame]", fontsize=12)
+        plt.ylabel("Posteriors", fontsize=12)
+        plt.xticks(list(range(0, int(n_frames) + 1, 10)))
+        plt.yticks(list(range(0, 2, 1)))
+        plt.tight_layout()
+        return plt
+
+    def _plot_and_save_ctc(self, ctc_prob, filename):
+        plt = self.draw_ctc_plot(ctc_prob)
+        plt.savefig(filename)
+        plt.close()
diff --git a/deepspeech/training/triggers/__init__.py b/deepspeech/training/triggers/__init__.py
index 1a7c4292..185a92b8 100644
--- a/deepspeech/training/triggers/__init__.py
+++ b/deepspeech/training/triggers/__init__.py
@@ -11,18 +11,3 @@
 # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 # See the License for the specific language governing permissions and
 # limitations under the License.
-from .interval_trigger import IntervalTrigger
-
-
-def never_fail_trigger(trainer):
-    return False
-
-
-def get_trigger(trigger):
-    if trigger is None:
-        return never_fail_trigger
-    if callable(trigger):
-        return trigger
-    else:
-        trigger = IntervalTrigger(*trigger)
-        return trigger
diff --git a/deepspeech/training/triggers/compare_value_trigger.py b/deepspeech/training/triggers/compare_value_trigger.py
new file mode 100644
index 00000000..efb928e2
--- /dev/null
+++ b/deepspeech/training/triggers/compare_value_trigger.py
@@ -0,0 +1,61 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from ..reporter import DictSummary
+from .utils import get_trigger
+
+
+class CompareValueTrigger():
+    """Trigger invoked when key value getting bigger or lower than before.
+
+    Args:
+        key (str) : Key of value.
+        compare_fn ((float, float) -> bool) : Function to compare the values.
+        trigger (tuple(int, str)) : Trigger that decide the comparison interval.
+
+    """
+
+    def __init__(self, key, compare_fn, trigger=(1, "epoch")):
+        self._key = key
+        self._best_value = None
+        self._interval_trigger = get_trigger(trigger)
+        self._init_summary()
+        self._compare_fn = compare_fn
+
+    def __call__(self, trainer):
+        """Get value related to the key and compare with current value."""
+        observation = trainer.observation
+        summary = self._summary
+        key = self._key
+        if key in observation:
+            summary.add({key: observation[key]})
+
+        if not self._interval_trigger(trainer):
+            return False
+
+        stats = summary.compute_mean()
+        value = float(stats[key])  # copy to CPU
+        self._init_summary()
+
+        if self._best_value is None:
+            # initialize best value
+            self._best_value = value
+            return False
+        elif self._compare_fn(self._best_value, value):
+            return True
+        else:
+            self._best_value = value
+            return False
+
+    def _init_summary(self):
+        self._summary = DictSummary()
diff --git a/deepspeech/training/triggers/time_trigger.py b/deepspeech/training/triggers/time_trigger.py
index ea8fe562..e31179a9 100644
--- a/deepspeech/training/triggers/time_trigger.py
+++ b/deepspeech/training/triggers/time_trigger.py
@@ -30,3 +30,12 @@ class TimeTrigger():
             return True
         else:
             return False
+
+    def state_dict(self):
+        state_dict = {
+            "next_time": self._next_time,
+        }
+        return state_dict
+
+    def set_state_dict(self, state_dict):
+        self._next_time = state_dict['next_time']
diff --git a/deepspeech/training/triggers/utils.py b/deepspeech/training/triggers/utils.py
new file mode 100644
index 00000000..1a7c4292
--- /dev/null
+++ b/deepspeech/training/triggers/utils.py
@@ -0,0 +1,28 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from .interval_trigger import IntervalTrigger
+
+
+def never_fail_trigger(trainer):
+    return False
+
+
+def get_trigger(trigger):
+    if trigger is None:
+        return never_fail_trigger
+    if callable(trigger):
+        return trigger
+    else:
+        trigger = IntervalTrigger(*trigger)
+        return trigger
diff --git a/deepspeech/transform/__init__.py b/deepspeech/transform/__init__.py
new file mode 100644
index 00000000..185a92b8
--- /dev/null
+++ b/deepspeech/transform/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/deepspeech/transform/add_deltas.py b/deepspeech/transform/add_deltas.py
new file mode 100644
index 00000000..4cab0084
--- /dev/null
+++ b/deepspeech/transform/add_deltas.py
@@ -0,0 +1,53 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import numpy as np
+
+
+def delta(feat, window):
+    assert window > 0
+    delta_feat = np.zeros_like(feat)
+    for i in range(1, window + 1):
+        delta_feat[:-i] += i * feat[i:]
+        delta_feat[i:] += -i * feat[:-i]
+        delta_feat[-i:] += i * feat[-1]
+        delta_feat[:i] += -i * feat[0]
+    delta_feat /= 2 * sum(i**2 for i in range(1, window + 1))
+    return delta_feat
+
+
+def add_deltas(x, window=2, order=2):
+    """
+    Args:
+        x (np.ndarray): speech feat, (T, D).
+
+    Return:
+        np.ndarray: (T, (1+order)*D)
+    """
+    feats = [x]
+    for _ in range(order):
+        feats.append(delta(feats[-1], window))
+    return np.concatenate(feats, axis=1)
+
+
+class AddDeltas():
+    def __init__(self, window=2, order=2):
+        self.window = window
+        self.order = order
+
+    def __repr__(self):
+        return "{name}(window={window}, order={order}".format(
+            name=self.__class__.__name__, window=self.window, order=self.order)
+
+    def __call__(self, x):
+        return add_deltas(x, window=self.window, order=self.order)
diff --git a/deepspeech/transform/channel_selector.py b/deepspeech/transform/channel_selector.py
new file mode 100644
index 00000000..d985b482
--- /dev/null
+++ b/deepspeech/transform/channel_selector.py
@@ -0,0 +1,56 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import numpy
+
+
+class ChannelSelector():
+    """Select 1ch from multi-channel signal"""
+
+    def __init__(self, train_channel="random", eval_channel=0, axis=1):
+        self.train_channel = train_channel
+        self.eval_channel = eval_channel
+        self.axis = axis
+
+    def __repr__(self):
+        return ("{name}(train_channel={train_channel}, "
+                "eval_channel={eval_channel}, axis={axis})".format(
+                    name=self.__class__.__name__,
+                    train_channel=self.train_channel,
+                    eval_channel=self.eval_channel,
+                    axis=self.axis, ))
+
+    def __call__(self, x, train=True):
+        # Assuming x: [Time, Channel] by default
+
+        if x.ndim <= self.axis:
+            # If the dimension is insufficient, then unsqueeze
+            # (e.g [Time] -> [Time, 1])
+            ind = tuple(
+                slice(None) if i < x.ndim else None
+                for i in range(self.axis + 1))
+            x = x[ind]
+
+        if train:
+            channel = self.train_channel
+        else:
+            channel = self.eval_channel
+
+        if channel == "random":
+            ch = numpy.random.randint(0, x.shape[self.axis])
+        else:
+            ch = channel
+
+        ind = tuple(
+            slice(None) if i != self.axis else ch for i in range(x.ndim))
+        return x[ind]
diff --git a/deepspeech/transform/cmvn.py b/deepspeech/transform/cmvn.py
new file mode 100644
index 00000000..5d318590
--- /dev/null
+++ b/deepspeech/transform/cmvn.py
@@ -0,0 +1,158 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import io
+
+import h5py
+import kaldiio
+import numpy as np
+
+
+class CMVN():
+    "Apply Global/Spk CMVN/iverserCMVN."
+
+    def __init__(
+            self,
+            stats,
+            norm_means=True,
+            norm_vars=False,
+            filetype="mat",
+            utt2spk=None,
+            spk2utt=None,
+            reverse=False,
+            std_floor=1.0e-20, ):
+        self.stats_file = stats
+        self.norm_means = norm_means
+        self.norm_vars = norm_vars
+        self.reverse = reverse
+
+        if isinstance(stats, dict):
+            stats_dict = dict(stats)
+        else:
+            # Use for global CMVN
+            if filetype == "mat":
+                stats_dict = {None: kaldiio.load_mat(stats)}
+            # Use for global CMVN
+            elif filetype == "npy":
+                stats_dict = {None: np.load(stats)}
+            # Use for speaker CMVN
+            elif filetype == "ark":
+                self.accept_uttid = True
+                stats_dict = dict(kaldiio.load_ark(stats))
+            # Use for speaker CMVN
+            elif filetype == "hdf5":
+                self.accept_uttid = True
+                stats_dict = h5py.File(stats)
+            else:
+                raise ValueError("Not supporting filetype={}".format(filetype))
+
+        if utt2spk is not None:
+            self.utt2spk = {}
+            with io.open(utt2spk, "r", encoding="utf-8") as f:
+                for line in f:
+                    utt, spk = line.rstrip().split(None, 1)
+                    self.utt2spk[utt] = spk
+        elif spk2utt is not None:
+            self.utt2spk = {}
+            with io.open(spk2utt, "r", encoding="utf-8") as f:
+                for line in f:
+                    spk, utts = line.rstrip().split(None, 1)
+                    for utt in utts.split():
+                        self.utt2spk[utt] = spk
+        else:
+            self.utt2spk = None
+
+        # Kaldi makes a matrix for CMVN which has a shape of (2, feat_dim + 1),
+        # and the first vector contains the sum of feats and the second is
+        # the sum of squares. The last value of the first, i.e. stats[0,-1],
+        # is the number of samples for this statistics.
+        self.bias = {}
+        self.scale = {}
+        for spk, stats in stats_dict.items():
+            assert len(stats) == 2, stats.shape
+
+            count = stats[0, -1]
+
+            # If the feature has two or more dimensions
+            if not (np.isscalar(count) or isinstance(count, (int, float))):
+                # The first is only used
+                count = count.flatten()[0]
+
+            mean = stats[0, :-1] / count
+            # V(x) = E(x^2) - (E(x))^2
+            var = stats[1, :-1] / count - mean * mean
+            std = np.maximum(np.sqrt(var), std_floor)
+            self.bias[spk] = -mean
+            self.scale[spk] = 1 / std
+
+    def __repr__(self):
+        return ("{name}(stats_file={stats_file}, "
+                "norm_means={norm_means}, norm_vars={norm_vars}, "
+                "reverse={reverse})".format(
+                    name=self.__class__.__name__,
+                    stats_file=self.stats_file,
+                    norm_means=self.norm_means,
+                    norm_vars=self.norm_vars,
+                    reverse=self.reverse, ))
+
+    def __call__(self, x, uttid=None):
+        if self.utt2spk is not None:
+            spk = self.utt2spk[uttid]
+        else:
+            spk = uttid
+
+        if not self.reverse:
+            # apply cmvn
+            if self.norm_means:
+                x = np.add(x, self.bias[spk])
+            if self.norm_vars:
+                x = np.multiply(x, self.scale[spk])
+
+        else:
+            # apply reverse cmvn
+            if self.norm_vars:
+                x = np.divide(x, self.scale[spk])
+            if self.norm_means:
+                x = np.subtract(x, self.bias[spk])
+
+        return x
+
+
+class UtteranceCMVN():
+    "Apply Utterance CMVN"
+
+    def __init__(self, norm_means=True, norm_vars=False, std_floor=1.0e-20):
+        self.norm_means = norm_means
+        self.norm_vars = norm_vars
+        self.std_floor = std_floor
+
+    def __repr__(self):
+        return "{name}(norm_means={norm_means}, norm_vars={norm_vars})".format(
+            name=self.__class__.__name__,
+            norm_means=self.norm_means,
+            norm_vars=self.norm_vars, )
+
+    def __call__(self, x, uttid=None):
+        # x: [Time, Dim]
+        square_sums = (x**2).sum(axis=0)
+        mean = x.mean(axis=0)
+
+        if self.norm_means:
+            x = np.subtract(x, mean)
+
+        if self.norm_vars:
+            var = square_sums / x.shape[0] - mean**2
+            std = np.maximum(np.sqrt(var), self.std_floor)
+            x = np.divide(x, std)
+
+        return x
diff --git a/deepspeech/transform/functional.py b/deepspeech/transform/functional.py
new file mode 100644
index 00000000..914e484e
--- /dev/null
+++ b/deepspeech/transform/functional.py
@@ -0,0 +1,85 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import inspect
+
+from deepspeech.transform.transform_interface import TransformInterface
+from deepspeech.utils.check_kwargs import check_kwargs
+
+
+class FuncTrans(TransformInterface):
+    """Functional Transformation
+
+    WARNING:
+        Builtin or C/C++ functions may not work properly
+        because this class heavily depends on the `inspect` module.
+
+    Usage:
+
+    >>> def foo_bar(x, a=1, b=2):
+    ...     '''Foo bar
+    ...     :param x: input
+    ...     :param int a: default 1
+    ...     :param int b: default 2
+    ...     '''
+    ...     return x + a - b
+
+
+    >>> class FooBar(FuncTrans):
+    ...     _func = foo_bar
+    ...     __doc__ = foo_bar.__doc__
+    """
+
+    _func = None
+
+    def __init__(self, **kwargs):
+        self.kwargs = kwargs
+        check_kwargs(self.func, kwargs)
+
+    def __call__(self, x):
+        return self.func(x, **self.kwargs)
+
+    @classmethod
+    def add_arguments(cls, parser):
+        fname = cls._func.__name__.replace("_", "-")
+        group = parser.add_argument_group(fname + " transformation setting")
+        for k, v in cls.default_params().items():
+            # TODO(karita): get help and choices from docstring?
+            attr = k.replace("_", "-")
+            group.add_argument(f"--{fname}-{attr}", default=v, type=type(v))
+        return parser
+
+    @property
+    def func(self):
+        return type(self)._func
+
+    @classmethod
+    def default_params(cls):
+        try:
+            d = dict(inspect.signature(cls._func).parameters)
+        except ValueError:
+            d = dict()
+        return {
+            k: v.default
+            for k, v in d.items() if v.default != inspect.Parameter.empty
+        }
+
+    def __repr__(self):
+        params = self.default_params()
+        params.update(**self.kwargs)
+        ret = self.__class__.__name__ + "("
+        if len(params) == 0:
+            return ret + ")"
+        for k, v in params.items():
+            ret += "{}={}, ".format(k, v)
+        return ret[:-2] + ")"
diff --git a/deepspeech/transform/perturb.py b/deepspeech/transform/perturb.py
new file mode 100644
index 00000000..e425fd2e
--- /dev/null
+++ b/deepspeech/transform/perturb.py
@@ -0,0 +1,350 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import librosa
+import numpy
+import scipy
+import soundfile
+
+from deepspeech.io.reader import SoundHDF5File
+
+
+class SpeedPerturbation():
+    """SpeedPerturbation
+
+    The speed perturbation in kaldi uses sox-speed instead of sox-tempo,
+    and sox-speed just to resample the input,
+    i.e pitch and tempo are changed both.
+
+    "Why use speed option instead of tempo -s in SoX for speed perturbation"
+    https://groups.google.com/forum/#!topic/kaldi-help/8OOG7eE4sZ8
+
+    Warning:
+        This function is very slow because of resampling.
+        I recommmend to apply speed-perturb outside the training using sox.
+
+    """
+
+    def __init__(
+            self,
+            lower=0.9,
+            upper=1.1,
+            utt2ratio=None,
+            keep_length=True,
+            res_type="kaiser_best",
+            seed=None, ):
+        self.res_type = res_type
+        self.keep_length = keep_length
+        self.state = numpy.random.RandomState(seed)
+
+        if utt2ratio is not None:
+            self.utt2ratio = {}
+            # Use the scheduled ratio for each utterances
+            self.utt2ratio_file = utt2ratio
+            self.lower = None
+            self.upper = None
+            self.accept_uttid = True
+
+            with open(utt2ratio, "r") as f:
+                for line in f:
+                    utt, ratio = line.rstrip().split(None, 1)
+                    ratio = float(ratio)
+                    self.utt2ratio[utt] = ratio
+        else:
+            self.utt2ratio = None
+            # The ratio is given on runtime randomly
+            self.lower = lower
+            self.upper = upper
+
+    def __repr__(self):
+        if self.utt2ratio is None:
+            return "{}(lower={}, upper={}, " "keep_length={}, res_type={})".format(
+                self.__class__.__name__,
+                self.lower,
+                self.upper,
+                self.keep_length,
+                self.res_type, )
+        else:
+            return "{}({}, res_type={})".format(
+                self.__class__.__name__, self.utt2ratio_file, self.res_type)
+
+    def __call__(self, x, uttid=None, train=True):
+        if not train:
+            return x
+
+        x = x.astype(numpy.float32)
+        if self.accept_uttid:
+            ratio = self.utt2ratio[uttid]
+        else:
+            ratio = self.state.uniform(self.lower, self.upper)
+
+        # Note1: resample requires the sampling-rate of input and output,
+        #        but actually only the ratio is used.
+        y = librosa.resample(x, ratio, 1, res_type=self.res_type)
+
+        if self.keep_length:
+            diff = abs(len(x) - len(y))
+            if len(y) > len(x):
+                # Truncate noise
+                y = y[diff // 2:-((diff + 1) // 2)]
+            elif len(y) < len(x):
+                # Assume the time-axis is the first: (Time, Channel)
+                pad_width = [(diff // 2, (diff + 1) // 2)] + [
+                    (0, 0) for _ in range(y.ndim - 1)
+                ]
+                y = numpy.pad(
+                    y, pad_width=pad_width, constant_values=0, mode="constant")
+        return y
+
+
+class BandpassPerturbation():
+    """BandpassPerturbation
+
+    Randomly dropout along the frequency axis.
+
+    The original idea comes from the following:
+        "randomly-selected frequency band was cut off under the constraint of
+         leaving at least 1,000 Hz band within the range of less than 4,000Hz."
+        (The Hitachi/JHU CHiME-5 system: Advances in speech recognition for
+         everyday home environments using multiple microphone arrays;
+         http://spandh.dcs.shef.ac.uk/chime_workshop/papers/CHiME_2018_paper_kanda.pdf)
+
+    """
+
+    def __init__(self, lower=0.0, upper=0.75, seed=None, axes=(-1, )):
+        self.lower = lower
+        self.upper = upper
+        self.state = numpy.random.RandomState(seed)
+        # x_stft: (Time, Channel, Freq)
+        self.axes = axes
+
+    def __repr__(self):
+        return "{}(lower={}, upper={})".format(self.__class__.__name__,
+                                               self.lower, self.upper)
+
+    def __call__(self, x_stft, uttid=None, train=True):
+        if not train:
+            return x_stft
+
+        if x_stft.ndim == 1:
+            raise RuntimeError("Input in time-freq domain: "
+                               "(Time, Channel, Freq) or (Time, Freq)")
+
+        ratio = self.state.uniform(self.lower, self.upper)
+        axes = [i if i >= 0 else x_stft.ndim - i for i in self.axes]
+        shape = [s if i in axes else 1 for i, s in enumerate(x_stft.shape)]
+
+        mask = self.state.randn(*shape) > ratio
+        x_stft *= mask
+        return x_stft
+
+
+class VolumePerturbation():
+    def __init__(self,
+                 lower=-1.6,
+                 upper=1.6,
+                 utt2ratio=None,
+                 dbunit=True,
+                 seed=None):
+        self.dbunit = dbunit
+        self.utt2ratio_file = utt2ratio
+        self.lower = lower
+        self.upper = upper
+        self.state = numpy.random.RandomState(seed)
+
+        if utt2ratio is not None:
+            # Use the scheduled ratio for each utterances
+            self.utt2ratio = {}
+            self.lower = None
+            self.upper = None
+            self.accept_uttid = True
+
+            with open(utt2ratio, "r") as f:
+                for line in f:
+                    utt, ratio = line.rstrip().split(None, 1)
+                    ratio = float(ratio)
+                    self.utt2ratio[utt] = ratio
+        else:
+            # The ratio is given on runtime randomly
+            self.utt2ratio = None
+
+    def __repr__(self):
+        if self.utt2ratio is None:
+            return "{}(lower={}, upper={}, dbunit={})".format(
+                self.__class__.__name__, self.lower, self.upper, self.dbunit)
+        else:
+            return '{}("{}", dbunit={})'.format(
+                self.__class__.__name__, self.utt2ratio_file, self.dbunit)
+
+    def __call__(self, x, uttid=None, train=True):
+        if not train:
+            return x
+
+        x = x.astype(numpy.float32)
+
+        if self.accept_uttid:
+            ratio = self.utt2ratio[uttid]
+        else:
+            ratio = self.state.uniform(self.lower, self.upper)
+        if self.dbunit:
+            ratio = 10**(ratio / 20)
+        return x * ratio
+
+
+class NoiseInjection():
+    """Add isotropic noise"""
+
+    def __init__(
+            self,
+            utt2noise=None,
+            lower=-20,
+            upper=-5,
+            utt2ratio=None,
+            filetype="list",
+            dbunit=True,
+            seed=None, ):
+        self.utt2noise_file = utt2noise
+        self.utt2ratio_file = utt2ratio
+        self.filetype = filetype
+        self.dbunit = dbunit
+        self.lower = lower
+        self.upper = upper
+        self.state = numpy.random.RandomState(seed)
+
+        if utt2ratio is not None:
+            # Use the scheduled ratio for each utterances
+            self.utt2ratio = {}
+            with open(utt2noise, "r") as f:
+                for line in f:
+                    utt, snr = line.rstrip().split(None, 1)
+                    snr = float(snr)
+                    self.utt2ratio[utt] = snr
+        else:
+            # The ratio is given on runtime randomly
+            self.utt2ratio = None
+
+        if utt2noise is not None:
+            self.utt2noise = {}
+            if filetype == "list":
+                with open(utt2noise, "r") as f:
+                    for line in f:
+                        utt, filename = line.rstrip().split(None, 1)
+                        signal, rate = soundfile.read(filename, dtype="int16")
+                        # Load all files in memory
+                        self.utt2noise[utt] = (signal, rate)
+
+            elif filetype == "sound.hdf5":
+                self.utt2noise = SoundHDF5File(utt2noise, "r")
+            else:
+                raise ValueError(filetype)
+        else:
+            self.utt2noise = None
+
+        if utt2noise is not None and utt2ratio is not None:
+            if set(self.utt2ratio) != set(self.utt2noise):
+                raise RuntimeError("The uttids mismatch between {} and {}".
+                                   format(utt2ratio, utt2noise))
+
+    def __repr__(self):
+        if self.utt2ratio is None:
+            return "{}(lower={}, upper={}, dbunit={})".format(
+                self.__class__.__name__, self.lower, self.upper, self.dbunit)
+        else:
+            return '{}("{}", dbunit={})'.format(
+                self.__class__.__name__, self.utt2ratio_file, self.dbunit)
+
+    def __call__(self, x, uttid=None, train=True):
+        if not train:
+            return x
+        x = x.astype(numpy.float32)
+
+        # 1. Get ratio of noise to signal in sound pressure level
+        if uttid is not None and self.utt2ratio is not None:
+            ratio = self.utt2ratio[uttid]
+        else:
+            ratio = self.state.uniform(self.lower, self.upper)
+
+        if self.dbunit:
+            ratio = 10**(ratio / 20)
+        scale = ratio * numpy.sqrt((x**2).mean())
+
+        # 2. Get noise
+        if self.utt2noise is not None:
+            # Get noise from the external source
+            if uttid is not None:
+                noise, rate = self.utt2noise[uttid]
+            else:
+                # Randomly select the noise source
+                noise = self.state.choice(list(self.utt2noise.values()))
+            # Normalize the level
+            noise /= numpy.sqrt((noise**2).mean())
+
+            # Adjust the noise length
+            diff = abs(len(x) - len(noise))
+            offset = self.state.randint(0, diff)
+            if len(noise) > len(x):
+                # Truncate noise
+                noise = noise[offset:-(diff - offset)]
+            else:
+                noise = numpy.pad(
+                    noise, pad_width=[offset, diff - offset], mode="wrap")
+
+        else:
+            # Generate white noise
+            noise = self.state.normal(0, 1, x.shape)
+
+        # 3. Add noise to signal
+        return x + noise * scale
+
+
+class RIRConvolve():
+    def __init__(self, utt2rir, filetype="list"):
+        self.utt2rir_file = utt2rir
+        self.filetype = filetype
+
+        self.utt2rir = {}
+        if filetype == "list":
+            with open(utt2rir, "r") as f:
+                for line in f:
+                    utt, filename = line.rstrip().split(None, 1)
+                    signal, rate = soundfile.read(filename, dtype="int16")
+                    self.utt2rir[utt] = (signal, rate)
+
+        elif filetype == "sound.hdf5":
+            self.utt2rir = SoundHDF5File(utt2rir, "r")
+        else:
+            raise NotImplementedError(filetype)
+
+    def __repr__(self):
+        return '{}("{}")'.format(self.__class__.__name__, self.utt2rir_file)
+
+    def __call__(self, x, uttid=None, train=True):
+        if not train:
+            return x
+
+        x = x.astype(numpy.float32)
+
+        if x.ndim != 1:
+            # Must be single channel
+            raise RuntimeError(
+                "Input x must be one dimensional array, but got {}".format(
+                    x.shape))
+
+        rir, rate = self.utt2rir[uttid]
+        if rir.ndim == 2:
+            # FIXME(kamo): Use chainer.convolution_1d?
+            # return [Time, Channel]
+            return numpy.stack(
+                [scipy.convolve(x, r, mode="same") for r in rir], axis=-1)
+        else:
+            return scipy.convolve(x, rir, mode="same")
diff --git a/deepspeech/transform/spec_augment.py b/deepspeech/transform/spec_augment.py
new file mode 100644
index 00000000..0e5324e7
--- /dev/null
+++ b/deepspeech/transform/spec_augment.py
@@ -0,0 +1,210 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Spec Augment module for preprocessing i.e., data augmentation"""
+import random
+
+import numpy
+from PIL import Image
+from PIL.Image import BICUBIC
+
+from deepspeech.transform.functional import FuncTrans
+
+
+def time_warp(x, max_time_warp=80, inplace=False, mode="PIL"):
+    """time warp for spec augment
+
+    move random center frame by the random width ~ uniform(-window, window)
+    :param numpy.ndarray x: spectrogram (time, freq)
+    :param int max_time_warp: maximum time frames to warp
+    :param bool inplace: overwrite x with the result
+    :param str mode: "PIL" (default, fast, not differentiable) or "sparse_image_warp"
+        (slow, differentiable)
+    :returns numpy.ndarray: time warped spectrogram (time, freq)
+    """
+    window = max_time_warp
+    if mode == "PIL":
+        t = x.shape[0]
+        if t - window <= window:
+            return x
+        # NOTE: randrange(a, b) emits a, a + 1, ..., b - 1
+        center = random.randrange(window, t - window)
+        warped = random.randrange(center - window, center +
+                                  window) + 1  # 1 ... t - 1
+
+        left = Image.fromarray(x[:center]).resize((x.shape[1], warped), BICUBIC)
+        right = Image.fromarray(x[center:]).resize((x.shape[1], t - warped),
+                                                   BICUBIC)
+        if inplace:
+            x[:warped] = left
+            x[warped:] = right
+            return x
+        return numpy.concatenate((left, right), 0)
+    elif mode == "sparse_image_warp":
+        import paddle
+
+        from espnet.utils import spec_augment
+
+        # TODO(karita): make this differentiable again
+        return spec_augment.time_warp(paddle.to_tensor(x), window).numpy()
+    else:
+        raise NotImplementedError("unknown resize mode: " + mode +
+                                  ", choose one from (PIL, sparse_image_warp).")
+
+
+class TimeWarp(FuncTrans):
+    _func = time_warp
+    __doc__ = time_warp.__doc__
+
+    def __call__(self, x, train):
+        if not train:
+            return x
+        return super().__call__(x)
+
+
+def freq_mask(x, F=30, n_mask=2, replace_with_zero=True, inplace=False):
+    """freq mask for spec agument
+
+    :param numpy.ndarray x: (time, freq)
+    :param int n_mask: the number of masks
+    :param bool inplace: overwrite
+    :param bool replace_with_zero: pad zero on mask if true else use mean
+    """
+    if inplace:
+        cloned = x
+    else:
+        cloned = x.copy()
+
+    num_mel_channels = cloned.shape[1]
+    fs = numpy.random.randint(0, F, size=(n_mask, 2))
+
+    for f, mask_end in fs:
+        f_zero = random.randrange(0, num_mel_channels - f)
+        mask_end += f_zero
+
+        # avoids randrange error if values are equal and range is empty
+        if f_zero == f_zero + f:
+            continue
+
+        if replace_with_zero:
+            cloned[:, f_zero:mask_end] = 0
+        else:
+            cloned[:, f_zero:mask_end] = cloned.mean()
+    return cloned
+
+
+class FreqMask(FuncTrans):
+    _func = freq_mask
+    __doc__ = freq_mask.__doc__
+
+    def __call__(self, x, train):
+        if not train:
+            return x
+        return super().__call__(x)
+
+
+def time_mask(spec, T=40, n_mask=2, replace_with_zero=True, inplace=False):
+    """freq mask for spec agument
+
+    :param numpy.ndarray spec: (time, freq)
+    :param int n_mask: the number of masks
+    :param bool inplace: overwrite
+    :param bool replace_with_zero: pad zero on mask if true else use mean
+    """
+    if inplace:
+        cloned = spec
+    else:
+        cloned = spec.copy()
+    len_spectro = cloned.shape[0]
+    ts = numpy.random.randint(0, T, size=(n_mask, 2))
+    for t, mask_end in ts:
+        # avoid randint range error
+        if len_spectro - t <= 0:
+            continue
+        t_zero = random.randrange(0, len_spectro - t)
+
+        # avoids randrange error if values are equal and range is empty
+        if t_zero == t_zero + t:
+            continue
+
+        mask_end += t_zero
+        if replace_with_zero:
+            cloned[t_zero:mask_end] = 0
+        else:
+            cloned[t_zero:mask_end] = cloned.mean()
+    return cloned
+
+
+class TimeMask(FuncTrans):
+    _func = time_mask
+    __doc__ = time_mask.__doc__
+
+    def __call__(self, x, train):
+        if not train:
+            return x
+        return super().__call__(x)
+
+
+def spec_augment(
+        x,
+        resize_mode="PIL",
+        max_time_warp=80,
+        max_freq_width=27,
+        n_freq_mask=2,
+        max_time_width=100,
+        n_time_mask=2,
+        inplace=True,
+        replace_with_zero=True, ):
+    """spec agument
+
+    apply random time warping and time/freq masking
+    default setting is based on LD (Librispeech double) in Table 2
+        https://arxiv.org/pdf/1904.08779.pdf
+
+    :param numpy.ndarray x: (time, freq)
+    :param str resize_mode: "PIL" (fast, nondifferentiable) or "sparse_image_warp"
+        (slow, differentiable)
+    :param int max_time_warp: maximum frames to warp the center frame in spectrogram (W)
+    :param int freq_mask_width: maximum width of the random freq mask (F)
+    :param int n_freq_mask: the number of the random freq mask (m_F)
+    :param int time_mask_width: maximum width of the random time mask (T)
+    :param int n_time_mask: the number of the random time mask (m_T)
+    :param bool inplace: overwrite intermediate array
+    :param bool replace_with_zero: pad zero on mask if true else use mean
+    """
+    assert isinstance(x, numpy.ndarray)
+    assert x.ndim == 2
+    x = time_warp(x, max_time_warp, inplace=inplace, mode=resize_mode)
+    x = freq_mask(
+        x,
+        max_freq_width,
+        n_freq_mask,
+        inplace=inplace,
+        replace_with_zero=replace_with_zero, )
+    x = time_mask(
+        x,
+        max_time_width,
+        n_time_mask,
+        inplace=inplace,
+        replace_with_zero=replace_with_zero, )
+    return x
+
+
+class SpecAugment(FuncTrans):
+    _func = spec_augment
+    __doc__ = spec_augment.__doc__
+
+    def __call__(self, x, train):
+        if not train:
+            return x
+        return super().__call__(x)
diff --git a/deepspeech/transform/spectrogram.py b/deepspeech/transform/spectrogram.py
new file mode 100644
index 00000000..e63bd680
--- /dev/null
+++ b/deepspeech/transform/spectrogram.py
@@ -0,0 +1,305 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import librosa
+import numpy as np
+
+
+def stft(x,
+         n_fft,
+         n_shift,
+         win_length=None,
+         window="hann",
+         center=True,
+         pad_mode="reflect"):
+    # x: [Time, Channel]
+    if x.ndim == 1:
+        single_channel = True
+        # x: [Time] -> [Time, Channel]
+        x = x[:, None]
+    else:
+        single_channel = False
+    x = x.astype(np.float32)
+
+    # FIXME(kamo): librosa.stft can't use multi-channel?
+    # x: [Time, Channel, Freq]
+    x = np.stack(
+        [
+            librosa.stft(
+                x[:, ch],
+                n_fft=n_fft,
+                hop_length=n_shift,
+                win_length=win_length,
+                window=window,
+                center=center,
+                pad_mode=pad_mode, ).T for ch in range(x.shape[1])
+        ],
+        axis=1, )
+
+    if single_channel:
+        # x: [Time, Channel, Freq] -> [Time, Freq]
+        x = x[:, 0]
+    return x
+
+
+def istft(x, n_shift, win_length=None, window="hann", center=True):
+    # x: [Time, Channel, Freq]
+    if x.ndim == 2:
+        single_channel = True
+        # x: [Time, Freq] -> [Time, Channel, Freq]
+        x = x[:, None, :]
+    else:
+        single_channel = False
+
+    # x: [Time, Channel]
+    x = np.stack(
+        [
+            librosa.istft(
+                x[:, ch].T,  # [Time, Freq] -> [Freq, Time]
+                hop_length=n_shift,
+                win_length=win_length,
+                window=window,
+                center=center, ) for ch in range(x.shape[1])
+        ],
+        axis=1, )
+
+    if single_channel:
+        # x: [Time, Channel] -> [Time]
+        x = x[:, 0]
+    return x
+
+
+def stft2logmelspectrogram(x_stft,
+                           fs,
+                           n_mels,
+                           n_fft,
+                           fmin=None,
+                           fmax=None,
+                           eps=1e-10):
+    # x_stft: (Time, Channel, Freq) or (Time, Freq)
+    fmin = 0 if fmin is None else fmin
+    fmax = fs / 2 if fmax is None else fmax
+
+    # spc: (Time, Channel, Freq) or (Time, Freq)
+    spc = np.abs(x_stft)
+    # mel_basis: (Mel_freq, Freq)
+    mel_basis = librosa.filters.mel(fs, n_fft, n_mels, fmin, fmax)
+    # lmspc: (Time, Channel, Mel_freq) or (Time, Mel_freq)
+    lmspc = np.log10(np.maximum(eps, np.dot(spc, mel_basis.T)))
+
+    return lmspc
+
+
+def spectrogram(x, n_fft, n_shift, win_length=None, window="hann"):
+    # x: (Time, Channel) -> spc: (Time, Channel, Freq)
+    spc = np.abs(stft(x, n_fft, n_shift, win_length, window=window))
+    return spc
+
+
+def logmelspectrogram(
+        x,
+        fs,
+        n_mels,
+        n_fft,
+        n_shift,
+        win_length=None,
+        window="hann",
+        fmin=None,
+        fmax=None,
+        eps=1e-10,
+        pad_mode="reflect", ):
+    # stft: (Time, Channel, Freq) or (Time, Freq)
+    x_stft = stft(
+        x,
+        n_fft=n_fft,
+        n_shift=n_shift,
+        win_length=win_length,
+        window=window,
+        pad_mode=pad_mode, )
+
+    return stft2logmelspectrogram(
+        x_stft,
+        fs=fs,
+        n_mels=n_mels,
+        n_fft=n_fft,
+        fmin=fmin,
+        fmax=fmax,
+        eps=eps)
+
+
+class Spectrogram():
+    def __init__(self, n_fft, n_shift, win_length=None, window="hann"):
+        self.n_fft = n_fft
+        self.n_shift = n_shift
+        self.win_length = win_length
+        self.window = window
+
+    def __repr__(self):
+        return ("{name}(n_fft={n_fft}, n_shift={n_shift}, "
+                "win_length={win_length}, window={window})".format(
+                    name=self.__class__.__name__,
+                    n_fft=self.n_fft,
+                    n_shift=self.n_shift,
+                    win_length=self.win_length,
+                    window=self.window, ))
+
+    def __call__(self, x):
+        return spectrogram(
+            x,
+            n_fft=self.n_fft,
+            n_shift=self.n_shift,
+            win_length=self.win_length,
+            window=self.window, )
+
+
+class LogMelSpectrogram():
+    def __init__(
+            self,
+            fs,
+            n_mels,
+            n_fft,
+            n_shift,
+            win_length=None,
+            window="hann",
+            fmin=None,
+            fmax=None,
+            eps=1e-10, ):
+        self.fs = fs
+        self.n_mels = n_mels
+        self.n_fft = n_fft
+        self.n_shift = n_shift
+        self.win_length = win_length
+        self.window = window
+        self.fmin = fmin
+        self.fmax = fmax
+        self.eps = eps
+
+    def __repr__(self):
+        return ("{name}(fs={fs}, n_mels={n_mels}, n_fft={n_fft}, "
+                "n_shift={n_shift}, win_length={win_length}, window={window}, "
+                "fmin={fmin}, fmax={fmax}, eps={eps}))".format(
+                    name=self.__class__.__name__,
+                    fs=self.fs,
+                    n_mels=self.n_mels,
+                    n_fft=self.n_fft,
+                    n_shift=self.n_shift,
+                    win_length=self.win_length,
+                    window=self.window,
+                    fmin=self.fmin,
+                    fmax=self.fmax,
+                    eps=self.eps, ))
+
+    def __call__(self, x):
+        return logmelspectrogram(
+            x,
+            fs=self.fs,
+            n_mels=self.n_mels,
+            n_fft=self.n_fft,
+            n_shift=self.n_shift,
+            win_length=self.win_length,
+            window=self.window, )
+
+
+class Stft2LogMelSpectrogram():
+    def __init__(self, fs, n_mels, n_fft, fmin=None, fmax=None, eps=1e-10):
+        self.fs = fs
+        self.n_mels = n_mels
+        self.n_fft = n_fft
+        self.fmin = fmin
+        self.fmax = fmax
+        self.eps = eps
+
+    def __repr__(self):
+        return ("{name}(fs={fs}, n_mels={n_mels}, n_fft={n_fft}, "
+                "fmin={fmin}, fmax={fmax}, eps={eps}))".format(
+                    name=self.__class__.__name__,
+                    fs=self.fs,
+                    n_mels=self.n_mels,
+                    n_fft=self.n_fft,
+                    fmin=self.fmin,
+                    fmax=self.fmax,
+                    eps=self.eps, ))
+
+    def __call__(self, x):
+        return stft2logmelspectrogram(
+            x,
+            fs=self.fs,
+            n_mels=self.n_mels,
+            n_fft=self.n_fft,
+            fmin=self.fmin,
+            fmax=self.fmax, )
+
+
+class Stft():
+    def __init__(
+            self,
+            n_fft,
+            n_shift,
+            win_length=None,
+            window="hann",
+            center=True,
+            pad_mode="reflect", ):
+        self.n_fft = n_fft
+        self.n_shift = n_shift
+        self.win_length = win_length
+        self.window = window
+        self.center = center
+        self.pad_mode = pad_mode
+
+    def __repr__(self):
+        return ("{name}(n_fft={n_fft}, n_shift={n_shift}, "
+                "win_length={win_length}, window={window},"
+                "center={center}, pad_mode={pad_mode})".format(
+                    name=self.__class__.__name__,
+                    n_fft=self.n_fft,
+                    n_shift=self.n_shift,
+                    win_length=self.win_length,
+                    window=self.window,
+                    center=self.center,
+                    pad_mode=self.pad_mode, ))
+
+    def __call__(self, x):
+        return stft(
+            x,
+            self.n_fft,
+            self.n_shift,
+            win_length=self.win_length,
+            window=self.window,
+            center=self.center,
+            pad_mode=self.pad_mode, )
+
+
+class IStft():
+    def __init__(self, n_shift, win_length=None, window="hann", center=True):
+        self.n_shift = n_shift
+        self.win_length = win_length
+        self.window = window
+        self.center = center
+
+    def __repr__(self):
+        return ("{name}(n_shift={n_shift}, "
+                "win_length={win_length}, window={window},"
+                "center={center})".format(
+                    name=self.__class__.__name__,
+                    n_shift=self.n_shift,
+                    win_length=self.win_length,
+                    window=self.window,
+                    center=self.center, ))
+
+    def __call__(self, x):
+        return istft(
+            x,
+            self.n_shift,
+            win_length=self.win_length,
+            window=self.window,
+            center=self.center, )
diff --git a/deepspeech/transform/transform_interface.py b/deepspeech/transform/transform_interface.py
new file mode 100644
index 00000000..7ab29554
--- /dev/null
+++ b/deepspeech/transform/transform_interface.py
@@ -0,0 +1,33 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+# TODO(karita): add this to all the transform impl.
+class TransformInterface:
+    """Transform Interface"""
+
+    def __call__(self, x):
+        raise NotImplementedError("__call__ method is not implemented")
+
+    @classmethod
+    def add_arguments(cls, parser):
+        return parser
+
+    def __repr__(self):
+        return self.__class__.__name__ + "()"
+
+
+class Identity(TransformInterface):
+    """Identity Function"""
+
+    def __call__(self, x):
+        return x
diff --git a/deepspeech/transform/transformation.py b/deepspeech/transform/transformation.py
new file mode 100644
index 00000000..afb1db28
--- /dev/null
+++ b/deepspeech/transform/transformation.py
@@ -0,0 +1,156 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Transformation module."""
+import copy
+import io
+import logging
+from collections import OrderedDict
+from collections.abc import Sequence
+from inspect import signature
+
+import yaml
+
+from deepspeech.utils.dynamic_import import dynamic_import
+
+# TODO(karita): inherit TransformInterface
+# TODO(karita): register cmd arguments in asr_train.py
+import_alias = dict(
+    identity="deepspeech.transform.transform_interface:Identity",
+    time_warp="deepspeech.transform.spec_augment:TimeWarp",
+    time_mask="deepspeech.transform.spec_augment:TimeMask",
+    freq_mask="deepspeech.transform.spec_augment:FreqMask",
+    spec_augment="deepspeech.transform.spec_augment:SpecAugment",
+    speed_perturbation="deepspeech.transform.perturb:SpeedPerturbation",
+    volume_perturbation="deepspeech.transform.perturb:VolumePerturbation",
+    noise_injection="deepspeech.transform.perturb:NoiseInjection",
+    bandpass_perturbation="deepspeech.transform.perturb:BandpassPerturbation",
+    rir_convolve="deepspeech.transform.perturb:RIRConvolve",
+    delta="deepspeech.transform.add_deltas:AddDeltas",
+    cmvn="deepspeech.transform.cmvn:CMVN",
+    utterance_cmvn="deepspeech.transform.cmvn:UtteranceCMVN",
+    fbank="deepspeech.transform.spectrogram:LogMelSpectrogram",
+    spectrogram="deepspeech.transform.spectrogram:Spectrogram",
+    stft="deepspeech.transform.spectrogram:Stft",
+    istft="deepspeech.transform.spectrogram:IStft",
+    stft2fbank="deepspeech.transform.spectrogram:Stft2LogMelSpectrogram",
+    wpe="deepspeech.transform.wpe:WPE",
+    channel_selector="deepspeech.transform.channel_selector:ChannelSelector", )
+
+
+class Transformation():
+    """Apply some functions to the mini-batch
+
+    Examples:
+        >>> kwargs = {"process": [{"type": "fbank",
+        ...                        "n_mels": 80,
+        ...                        "fs": 16000},
+        ...                       {"type": "cmvn",
+        ...                        "stats": "data/train/cmvn.ark",
+        ...                        "norm_vars": True},
+        ...                       {"type": "delta", "window": 2, "order": 2}]}
+        >>> transform = Transformation(kwargs)
+        >>> bs = 10
+        >>> xs = [np.random.randn(100, 80).astype(np.float32)
+        ...       for _ in range(bs)]
+        >>> xs = transform(xs)
+    """
+
+    def __init__(self, conffile=None):
+        if conffile is not None:
+            if isinstance(conffile, dict):
+                self.conf = copy.deepcopy(conffile)
+            else:
+                with io.open(conffile, encoding="utf-8") as f:
+                    self.conf = yaml.safe_load(f)
+                    assert isinstance(self.conf, dict), type(self.conf)
+        else:
+            self.conf = {"mode": "sequential", "process": []}
+
+        self.functions = OrderedDict()
+        if self.conf.get("mode", "sequential") == "sequential":
+            for idx, process in enumerate(self.conf["process"]):
+                assert isinstance(process, dict), type(process)
+                opts = dict(process)
+                process_type = opts.pop("type")
+                class_obj = dynamic_import(process_type, import_alias)
+                # TODO(karita): assert issubclass(class_obj, TransformInterface)
+                try:
+                    self.functions[idx] = class_obj(**opts)
+                except TypeError:
+                    try:
+                        signa = signature(class_obj)
+                    except ValueError:
+                        # Some function, e.g. built-in function, are failed
+                        pass
+                    else:
+                        logging.error("Expected signature: {}({})".format(
+                            class_obj.__name__, signa))
+                    raise
+        else:
+            raise NotImplementedError(
+                "Not supporting mode={}".format(self.conf["mode"]))
+
+    def __repr__(self):
+        rep = "\n" + "\n".join("    {}: {}".format(k, v)
+                               for k, v in self.functions.items())
+        return "{}({})".format(self.__class__.__name__, rep)
+
+    def __call__(self, xs, uttid_list=None, **kwargs):
+        """Return new mini-batch
+
+        :param Union[Sequence[np.ndarray], np.ndarray] xs:
+        :param Union[Sequence[str], str] uttid_list:
+        :return: batch:
+        :rtype: List[np.ndarray]
+        """
+        if not isinstance(xs, Sequence):
+            is_batch = False
+            xs = [xs]
+        else:
+            is_batch = True
+
+        if isinstance(uttid_list, str):
+            uttid_list = [uttid_list for _ in range(len(xs))]
+
+        if self.conf.get("mode", "sequential") == "sequential":
+            for idx in range(len(self.conf["process"])):
+                func = self.functions[idx]
+                # TODO(karita): use TrainingTrans and UttTrans to check __call__ args
+                # Derive only the args which the func has
+                try:
+                    param = signature(func).parameters
+                except ValueError:
+                    # Some function, e.g. built-in function, are failed
+                    param = {}
+                _kwargs = {k: v for k, v in kwargs.items() if k in param}
+                try:
+                    if uttid_list is not None and "uttid" in param:
+                        xs = [
+                            func(x, u, **_kwargs)
+                            for x, u in zip(xs, uttid_list)
+                        ]
+                    else:
+                        xs = [func(x, **_kwargs) for x in xs]
+                except Exception:
+                    logging.fatal("Catch a exception from {}th func: {}".format(
+                        idx, func))
+                    raise
+        else:
+            raise NotImplementedError(
+                "Not supporting mode={}".format(self.conf["mode"]))
+
+        if is_batch:
+            return xs
+        else:
+            return xs[0]
diff --git a/deepspeech/transform/wpe.py b/deepspeech/transform/wpe.py
new file mode 100644
index 00000000..d82005f6
--- /dev/null
+++ b/deepspeech/transform/wpe.py
@@ -0,0 +1,57 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from nara_wpe.wpe import wpe
+
+
+class WPE(object):
+    def __init__(self,
+                 taps=10,
+                 delay=3,
+                 iterations=3,
+                 psd_context=0,
+                 statistics_mode="full"):
+        self.taps = taps
+        self.delay = delay
+        self.iterations = iterations
+        self.psd_context = psd_context
+        self.statistics_mode = statistics_mode
+
+    def __repr__(self):
+        return ("{name}(taps={taps}, delay={delay}"
+                "iterations={iterations}, psd_context={psd_context}, "
+                "statistics_mode={statistics_mode})".format(
+                    name=self.__class__.__name__,
+                    taps=self.taps,
+                    delay=self.delay,
+                    iterations=self.iterations,
+                    psd_context=self.psd_context,
+                    statistics_mode=self.statistics_mode, ))
+
+    def __call__(self, xs):
+        """Return enhanced
+
+        :param np.ndarray xs: (Time, Channel, Frequency)
+        :return: enhanced_xs
+        :rtype: np.ndarray
+
+        """
+        # nara_wpe.wpe: (F, C, T)
+        xs = wpe(
+            xs.transpose((2, 1, 0)),
+            taps=self.taps,
+            delay=self.delay,
+            iterations=self.iterations,
+            psd_context=self.psd_context,
+            statistics_mode=self.statistics_mode, )
+        return xs.transpose(2, 1, 0)
diff --git a/deepspeech/utils/asr_utils.py b/deepspeech/utils/asr_utils.py
new file mode 100644
index 00000000..6f86e56f
--- /dev/null
+++ b/deepspeech/utils/asr_utils.py
@@ -0,0 +1,52 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import json
+
+import numpy as np
+
+__all__ = ["label_smoothing_dist"]
+
+
+# TODO(takaaki-hori): add different smoothing methods
+def label_smoothing_dist(odim, lsm_type, transcript=None, blank=0):
+    """Obtain label distribution for loss smoothing.
+
+    :param odim:
+    :param lsm_type:
+    :param blank:
+    :param transcript:
+    :return:
+    """
+    if transcript is not None:
+        with open(transcript, "rb") as f:
+            trans_json = json.load(f)["utts"]
+
+    if lsm_type == "unigram":
+        assert transcript is not None, (
+            "transcript is required for %s label smoothing" % lsm_type)
+        labelcount = np.zeros(odim)
+        for k, v in trans_json.items():
+            ids = np.array([int(n) for n in v["output"][0]["tokenid"].split()])
+            # to avoid an error when there is no text in an uttrance
+            if len(ids) > 0:
+                labelcount[ids] += 1
+        labelcount[odim - 1] = len(transcript)  # count <eos>
+        labelcount[labelcount == 0] = 1  # flooring
+        labelcount[blank] = 0  # remove counts for blank
+        labeldist = labelcount.astype(np.float32) / np.sum(labelcount)
+    else:
+        logging.error("Error: unexpected label smoothing type: %s" % lsm_type)
+        sys.exit()
+
+    return labeldist
diff --git a/deepspeech/utils/bleu_score.py b/deepspeech/utils/bleu_score.py
index 09646133..ea32fcf9 100644
--- a/deepspeech/utils/bleu_score.py
+++ b/deepspeech/utils/bleu_score.py
@@ -14,17 +14,17 @@
 """This module provides functions to calculate bleu score in different level.
 e.g. wer for word-level, cer for char-level.
 """
+import nltk
+import numpy as np
 import sacrebleu
 
-__all__ = ['bleu', 'char_bleu']
+__all__ = ['bleu', 'char_bleu', "ErrorCalculator"]
 
 
 def bleu(hypothesis, reference):
     """Calculate BLEU. BLEU compares reference text and
     hypothesis text in word-level using scarebleu.
 
-   
-
     :param reference: The reference sentences.
     :type reference: list[list[str]]
     :param hypothesis: The hypothesis sentence.
@@ -39,8 +39,6 @@ def char_bleu(hypothesis, reference):
     """Calculate BLEU. BLEU compares reference text and
     hypothesis text in char-level using scarebleu.
 
-   
-
     :param reference: The reference sentences.
     :type reference: list[list[str]]
     :param hypothesis: The hypothesis sentence.
@@ -52,3 +50,70 @@ def char_bleu(hypothesis, reference):
                  for ref in reference]
 
     return sacrebleu.corpus_bleu(hypothesis, reference)
+
+
+class ErrorCalculator():
+    """Calculate BLEU for ST and MT models during training.
+
+    :param y_hats: numpy array with predicted text
+    :param y_pads: numpy array with true (target) text
+    :param char_list: vocabulary list
+    :param sym_space: space symbol
+    :param sym_pad: pad symbol
+    :param report_bleu: report BLUE score if True
+    """
+
+    def __init__(self, char_list, sym_space, sym_pad, report_bleu=False):
+        """Construct an ErrorCalculator object."""
+        super().__init__()
+        self.char_list = char_list
+        self.space = sym_space
+        self.pad = sym_pad
+        self.report_bleu = report_bleu
+        if self.space in self.char_list:
+            self.idx_space = self.char_list.index(self.space)
+        else:
+            self.idx_space = None
+
+    def __call__(self, ys_hat, ys_pad):
+        """Calculate corpus-level BLEU score.
+
+        :param torch.Tensor ys_hat: prediction (batch, seqlen)
+        :param torch.Tensor ys_pad: reference (batch, seqlen)
+        :return: corpus-level BLEU score in a mini-batch
+        :rtype float
+        """
+        bleu = None
+        if not self.report_bleu:
+            return bleu
+
+        bleu = self.calculate_corpus_bleu(ys_hat, ys_pad)
+        return bleu
+
+    def calculate_corpus_bleu(self, ys_hat, ys_pad):
+        """Calculate corpus-level BLEU score in a mini-batch.
+
+        :param torch.Tensor seqs_hat: prediction (batch, seqlen)
+        :param torch.Tensor seqs_true: reference (batch, seqlen)
+        :return: corpus-level BLEU score
+        :rtype float
+        """
+        seqs_hat, seqs_true = [], []
+        for i, y_hat in enumerate(ys_hat):
+            y_true = ys_pad[i]
+            eos_true = np.where(y_true == -1)[0]
+            ymax = eos_true[0] if len(eos_true) > 0 else len(y_true)
+            # NOTE: padding index (-1) in y_true is used to pad y_hat
+            # because y_hats is not padded with -1
+            seq_hat = [self.char_list[int(idx)] for idx in y_hat[:ymax]]
+            seq_true = [
+                self.char_list[int(idx)] for idx in y_true if int(idx) != -1
+            ]
+            seq_hat_text = "".join(seq_hat).replace(self.space, " ")
+            seq_hat_text = seq_hat_text.replace(self.pad, "")
+            seq_true_text = "".join(seq_true).replace(self.space, " ")
+            seqs_hat.append(seq_hat_text)
+            seqs_true.append(seq_true_text)
+        bleu = nltk.bleu_score.corpus_bleu([[ref] for ref in seqs_true],
+                                           seqs_hat)
+        return bleu * 100
diff --git a/deepspeech/utils/check_kwargs.py b/deepspeech/utils/check_kwargs.py
new file mode 100644
index 00000000..1ee7329b
--- /dev/null
+++ b/deepspeech/utils/check_kwargs.py
@@ -0,0 +1,34 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import inspect
+
+
+def check_kwargs(func, kwargs, name=None):
+    """check kwargs are valid for func
+
+    If kwargs are invalid, raise TypeError as same as python default
+    :param function func: function to be validated
+    :param dict kwargs: keyword arguments for func
+    :param str name: name used in TypeError (default is func name)
+    """
+    try:
+        params = inspect.signature(func).parameters
+    except ValueError:
+        return
+    if name is None:
+        name = func.__name__
+    for k in kwargs.keys():
+        if k not in params:
+            raise TypeError(
+                f"{name}() got an unexpected keyword argument '{k}'")
diff --git a/deepspeech/utils/cli_readers.py b/deepspeech/utils/cli_readers.py
new file mode 100644
index 00000000..72aa2bdb
--- /dev/null
+++ b/deepspeech/utils/cli_readers.py
@@ -0,0 +1,241 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import io
+import logging
+import sys
+
+import h5py
+import kaldiio
+import soundfile
+
+from deepspeech.io.reader import SoundHDF5File
+
+
+def file_reader_helper(
+        rspecifier: str,
+        filetype: str="mat",
+        return_shape: bool=False,
+        segments: str=None, ):
+    """Read uttid and array in kaldi style
+
+    This function might be a bit confusing as "ark" is used
+    for HDF5 to imitate "kaldi-rspecifier".
+
+    Args:
+        rspecifier: Give as "ark:feats.ark" or "scp:feats.scp"
+        filetype: "mat" is kaldi-martix, "hdf5": HDF5
+        return_shape: Return the shape of the matrix,
+            instead of the matrix. This can reduce IO cost for HDF5.
+        segments (str): The file format is
+            "<segment-id> <recording-id> <start-time> <end-time>\n"
+            "e.g. call-861225-A-0050-0065 call-861225-A 5.0 6.5\n"
+    Returns:
+        Generator[Tuple[str, np.ndarray], None, None]:
+
+    Examples:
+        Read from kaldi-matrix ark file:
+
+        >>> for u, array in file_reader_helper('ark:feats.ark', 'mat'):
+        ...     array
+
+        Read from HDF5 file:
+
+        >>> for u, array in file_reader_helper('ark:feats.h5', 'hdf5'):
+        ...     array
+
+    """
+    if filetype == "mat":
+        return KaldiReader(
+            rspecifier, return_shape=return_shape, segments=segments)
+    elif filetype == "hdf5":
+        return HDF5Reader(rspecifier, return_shape=return_shape)
+    elif filetype == "sound.hdf5":
+        return SoundHDF5Reader(rspecifier, return_shape=return_shape)
+    elif filetype == "sound":
+        return SoundReader(rspecifier, return_shape=return_shape)
+    else:
+        raise NotImplementedError(f"filetype={filetype}")
+
+
+class KaldiReader:
+    def __init__(self, rspecifier, return_shape=False, segments=None):
+        self.rspecifier = rspecifier
+        self.return_shape = return_shape
+        self.segments = segments
+
+    def __iter__(self):
+        with kaldiio.ReadHelper(
+                self.rspecifier, segments=self.segments) as reader:
+            for key, array in reader:
+                if self.return_shape:
+                    array = array.shape
+                yield key, array
+
+
+class HDF5Reader:
+    def __init__(self, rspecifier, return_shape=False):
+        if ":" not in rspecifier:
+            raise ValueError('Give "rspecifier" such as "ark:some.ark: {}"'.
+                             format(self.rspecifier))
+        self.rspecifier = rspecifier
+        self.ark_or_scp, self.filepath = self.rspecifier.split(":", 1)
+        if self.ark_or_scp not in ["ark", "scp"]:
+            raise ValueError(f"Must be scp or ark: {self.ark_or_scp}")
+
+        self.return_shape = return_shape
+
+    def __iter__(self):
+        if self.ark_or_scp == "scp":
+            hdf5_dict = {}
+            with open(self.filepath, "r", encoding="utf-8") as f:
+                for line in f:
+                    key, value = line.rstrip().split(None, 1)
+
+                    if ":" not in value:
+                        raise RuntimeError(
+                            "scp file for hdf5 should be like: "
+                            '"uttid filepath.h5:key": {}({})'.format(
+                                line, self.filepath))
+                    path, h5_key = value.split(":", 1)
+
+                    hdf5_file = hdf5_dict.get(path)
+                    if hdf5_file is None:
+                        try:
+                            hdf5_file = h5py.File(path, "r")
+                        except Exception:
+                            logging.error("Error when loading {}".format(path))
+                            raise
+                        hdf5_dict[path] = hdf5_file
+
+                    try:
+                        data = hdf5_file[h5_key]
+                    except Exception:
+                        logging.error("Error when loading {} with key={}".
+                                      format(path, h5_key))
+                        raise
+
+                    if self.return_shape:
+                        yield key, data.shape
+                    else:
+                        yield key, data[()]
+
+            # Closing all files
+            for k in hdf5_dict:
+                try:
+                    hdf5_dict[k].close()
+                except Exception:
+                    pass
+
+        else:
+            if self.filepath == "-":
+                # Required h5py>=2.9
+                filepath = io.BytesIO(sys.stdin.buffer.read())
+            else:
+                filepath = self.filepath
+            with h5py.File(filepath, "r") as f:
+                for key in f:
+                    if self.return_shape:
+                        yield key, f[key].shape
+                    else:
+                        yield key, f[key][()]
+
+
+class SoundHDF5Reader:
+    def __init__(self, rspecifier, return_shape=False):
+        if ":" not in rspecifier:
+            raise ValueError('Give "rspecifier" such as "ark:some.ark: {}"'.
+                             format(rspecifier))
+        self.ark_or_scp, self.filepath = rspecifier.split(":", 1)
+        if self.ark_or_scp not in ["ark", "scp"]:
+            raise ValueError(f"Must be scp or ark: {self.ark_or_scp}")
+        self.return_shape = return_shape
+
+    def __iter__(self):
+        if self.ark_or_scp == "scp":
+            hdf5_dict = {}
+            with open(self.filepath, "r", encoding="utf-8") as f:
+                for line in f:
+                    key, value = line.rstrip().split(None, 1)
+
+                    if ":" not in value:
+                        raise RuntimeError(
+                            "scp file for hdf5 should be like: "
+                            '"uttid filepath.h5:key": {}({})'.format(
+                                line, self.filepath))
+                    path, h5_key = value.split(":", 1)
+
+                    hdf5_file = hdf5_dict.get(path)
+                    if hdf5_file is None:
+                        try:
+                            hdf5_file = SoundHDF5File(path, "r")
+                        except Exception:
+                            logging.error("Error when loading {}".format(path))
+                            raise
+                        hdf5_dict[path] = hdf5_file
+
+                    try:
+                        data = hdf5_file[h5_key]
+                    except Exception:
+                        logging.error("Error when loading {} with key={}".
+                                      format(path, h5_key))
+                        raise
+
+                    # Change Tuple[ndarray, int] -> Tuple[int, ndarray]
+                    # (soundfile style -> scipy style)
+                    array, rate = data
+                    if self.return_shape:
+                        array = array.shape
+                    yield key, (rate, array)
+
+            # Closing all files
+            for k in hdf5_dict:
+                try:
+                    hdf5_dict[k].close()
+                except Exception:
+                    pass
+
+        else:
+            if self.filepath == "-":
+                # Required h5py>=2.9
+                filepath = io.BytesIO(sys.stdin.buffer.read())
+            else:
+                filepath = self.filepath
+            for key, (a, r) in SoundHDF5File(filepath, "r").items():
+                if self.return_shape:
+                    a = a.shape
+                yield key, (r, a)
+
+
+class SoundReader:
+    def __init__(self, rspecifier, return_shape=False):
+        if ":" not in rspecifier:
+            raise ValueError('Give "rspecifier" such as "scp:some.scp: {}"'.
+                             format(rspecifier))
+        self.ark_or_scp, self.filepath = rspecifier.split(":", 1)
+        if self.ark_or_scp != "scp":
+            raise ValueError('Only supporting "scp" for sound file: {}'.format(
+                self.ark_or_scp))
+        self.return_shape = return_shape
+
+    def __iter__(self):
+        with open(self.filepath, "r", encoding="utf-8") as f:
+            for line in f:
+                key, sound_file_path = line.rstrip().split(None, 1)
+                # Assume PCM16
+                array, rate = soundfile.read(sound_file_path, dtype="int16")
+                # Change Tuple[ndarray, int] -> Tuple[int, ndarray]
+                # (soundfile style -> scipy style)
+                if self.return_shape:
+                    array = array.shape
+                yield key, (rate, array)
diff --git a/deepspeech/utils/cli_utils.py b/deepspeech/utils/cli_utils.py
new file mode 100644
index 00000000..f8e1d60b
--- /dev/null
+++ b/deepspeech/utils/cli_utils.py
@@ -0,0 +1,70 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import sys
+from collections.abc import Sequence
+from distutils.util import strtobool as dist_strtobool
+
+import numpy
+
+
+def strtobool(x):
+    # distutils.util.strtobool returns integer, but it's confusing,
+    return bool(dist_strtobool(x))
+
+
+def get_commandline_args():
+    extra_chars = [
+        " ",
+        ";",
+        "&",
+        "(",
+        ")",
+        "|",
+        "^",
+        "<",
+        ">",
+        "?",
+        "*",
+        "[",
+        "]",
+        "$",
+        "`",
+        '"',
+        "\\",
+        "!",
+        "{",
+        "}",
+    ]
+
+    # Escape the extra characters for shell
+    argv = [
+        arg.replace("'", "'\\''") if all(char not in arg
+                                         for char in extra_chars) else
+        "'" + arg.replace("'", "'\\''") + "'" for arg in sys.argv
+    ]
+
+    return sys.executable + " " + " ".join(argv)
+
+
+def is_scipy_wav_style(value):
+    # If Tuple[int, numpy.ndarray] or not
+    return (isinstance(value, Sequence) and len(value) == 2 and
+            isinstance(value[0], int) and isinstance(value[1], numpy.ndarray))
+
+
+def assert_scipy_wav_style(value):
+    assert is_scipy_wav_style(
+        value), "Must be Tuple[int, numpy.ndarray], but got {}".format(
+            type(value) if not isinstance(value, Sequence) else "{}[{}]".format(
+                type(value), ", ".join(str(type(v)) for v in value)))
diff --git a/deepspeech/utils/cli_writers.py b/deepspeech/utils/cli_writers.py
new file mode 100644
index 00000000..e0737193
--- /dev/null
+++ b/deepspeech/utils/cli_writers.py
@@ -0,0 +1,293 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from pathlib import Path
+from typing import Dict
+
+import h5py
+import kaldiio
+import numpy
+import soundfile
+
+from deepspeech.io.reader import SoundHDF5File
+from deepspeech.utils.cli_utils import assert_scipy_wav_style
+
+
+def file_writer_helper(
+        wspecifier: str,
+        filetype: str="mat",
+        write_num_frames: str=None,
+        compress: bool=False,
+        compression_method: int=2,
+        pcm_format: str="wav", ):
+    """Write matrices in kaldi style
+
+    Args:
+        wspecifier: e.g. ark,scp:out.ark,out.scp
+        filetype: "mat" is kaldi-martix, "hdf5": HDF5
+        write_num_frames: e.g. 'ark,t:num_frames.txt'
+        compress: Compress or not
+        compression_method: Specify compression level
+
+    Write in kaldi-matrix-ark with "kaldi-scp" file:
+
+    >>> with file_writer_helper('ark,scp:out.ark,out.scp') as f:
+    >>>     f['uttid'] = array
+
+    This "scp" has the following format:
+
+        uttidA out.ark:1234
+        uttidB out.ark:2222
+
+    where, 1234 and 2222 points the strating byte address of the matrix.
+    (For detail, see official documentation of Kaldi)
+
+    Write in HDF5 with "scp" file:
+
+    >>> with file_writer_helper('ark,scp:out.h5,out.scp', 'hdf5') as f:
+    >>>     f['uttid'] = array
+
+    This "scp" file is created as:
+
+        uttidA out.h5:uttidA
+        uttidB out.h5:uttidB
+
+    HDF5 can be, unlike "kaldi-ark", accessed to any keys,
+    so originally "scp" is not required for random-reading.
+    Nevertheless we create "scp" for HDF5 because it is useful
+    for some use-case. e.g. Concatenation, Splitting.
+
+    """
+    if filetype == "mat":
+        return KaldiWriter(
+            wspecifier,
+            write_num_frames=write_num_frames,
+            compress=compress,
+            compression_method=compression_method, )
+    elif filetype == "hdf5":
+        return HDF5Writer(
+            wspecifier, write_num_frames=write_num_frames, compress=compress)
+    elif filetype == "sound.hdf5":
+        return SoundHDF5Writer(
+            wspecifier,
+            write_num_frames=write_num_frames,
+            pcm_format=pcm_format)
+    elif filetype == "sound":
+        return SoundWriter(
+            wspecifier,
+            write_num_frames=write_num_frames,
+            pcm_format=pcm_format)
+    else:
+        raise NotImplementedError(f"filetype={filetype}")
+
+
+class BaseWriter:
+    def __setitem__(self, key, value):
+        raise NotImplementedError
+
+    def __enter__(self):
+        return self
+
+    def __exit__(self, exc_type, exc_val, exc_tb):
+        self.close()
+
+    def close(self):
+        try:
+            self.writer.close()
+        except Exception:
+            pass
+
+        if self.writer_scp is not None:
+            try:
+                self.writer_scp.close()
+            except Exception:
+                pass
+
+        if self.writer_nframe is not None:
+            try:
+                self.writer_nframe.close()
+            except Exception:
+                pass
+
+
+def get_num_frames_writer(write_num_frames: str):
+    """get_num_frames_writer
+
+    Examples:
+        >>> get_num_frames_writer('ark,t:num_frames.txt')
+    """
+    if write_num_frames is not None:
+        if ":" not in write_num_frames:
+            raise ValueError('Must include ":", write_num_frames={}'.format(
+                write_num_frames))
+
+        nframes_type, nframes_file = write_num_frames.split(":", 1)
+        if nframes_type != "ark,t":
+            raise ValueError("Only supporting text mode. "
+                             "e.g. --write-num-frames=ark,t:foo.txt :"
+                             "{}".format(nframes_type))
+
+    return open(nframes_file, "w", encoding="utf-8")
+
+
+class KaldiWriter(BaseWriter):
+    def __init__(self,
+                 wspecifier,
+                 write_num_frames=None,
+                 compress=False,
+                 compression_method=2):
+        if compress:
+            self.writer = kaldiio.WriteHelper(
+                wspecifier, compression_method=compression_method)
+        else:
+            self.writer = kaldiio.WriteHelper(wspecifier)
+        self.writer_scp = None
+        if write_num_frames is not None:
+            self.writer_nframe = get_num_frames_writer(write_num_frames)
+        else:
+            self.writer_nframe = None
+
+    def __setitem__(self, key, value):
+        self.writer[key] = value
+        if self.writer_nframe is not None:
+            self.writer_nframe.write(f"{key} {len(value)}\n")
+
+
+def parse_wspecifier(wspecifier: str) -> Dict[str, str]:
+    """Parse wspecifier to dict
+
+    Examples:
+        >>> parse_wspecifier('ark,scp:out.ark,out.scp')
+        {'ark': 'out.ark', 'scp': 'out.scp'}
+
+    """
+    ark_scp, filepath = wspecifier.split(":", 1)
+    if ark_scp not in ["ark", "scp,ark", "ark,scp"]:
+        raise ValueError("{} is not allowed: {}".format(ark_scp, wspecifier))
+    ark_scps = ark_scp.split(",")
+    filepaths = filepath.split(",")
+    if len(ark_scps) != len(filepaths):
+        raise ValueError("Mismatch: {} and {}".format(ark_scp, filepath))
+    spec_dict = dict(zip(ark_scps, filepaths))
+    return spec_dict
+
+
+class HDF5Writer(BaseWriter):
+    """HDF5Writer
+
+    Examples:
+        >>> with HDF5Writer('ark:out.h5', compress=True) as f:
+        ...     f['key'] = array
+    """
+
+    def __init__(self, wspecifier, write_num_frames=None, compress=False):
+        spec_dict = parse_wspecifier(wspecifier)
+        self.filename = spec_dict["ark"]
+
+        if compress:
+            self.kwargs = {"compression": "gzip"}
+        else:
+            self.kwargs = {}
+        self.writer = h5py.File(spec_dict["ark"], "w")
+        if "scp" in spec_dict:
+            self.writer_scp = open(spec_dict["scp"], "w", encoding="utf-8")
+        else:
+            self.writer_scp = None
+        if write_num_frames is not None:
+            self.writer_nframe = get_num_frames_writer(write_num_frames)
+        else:
+            self.writer_nframe = None
+
+    def __setitem__(self, key, value):
+        self.writer.create_dataset(key, data=value, **self.kwargs)
+
+        if self.writer_scp is not None:
+            self.writer_scp.write(f"{key} {self.filename}:{key}\n")
+        if self.writer_nframe is not None:
+            self.writer_nframe.write(f"{key} {len(value)}\n")
+
+
+class SoundHDF5Writer(BaseWriter):
+    """SoundHDF5Writer
+
+    Examples:
+        >>> fs = 16000
+        >>> with SoundHDF5Writer('ark:out.h5') as f:
+        ...     f['key'] = fs, array
+    """
+
+    def __init__(self, wspecifier, write_num_frames=None, pcm_format="wav"):
+        self.pcm_format = pcm_format
+        spec_dict = parse_wspecifier(wspecifier)
+        self.filename = spec_dict["ark"]
+        self.writer = SoundHDF5File(
+            spec_dict["ark"], "w", format=self.pcm_format)
+        if "scp" in spec_dict:
+            self.writer_scp = open(spec_dict["scp"], "w", encoding="utf-8")
+        else:
+            self.writer_scp = None
+        if write_num_frames is not None:
+            self.writer_nframe = get_num_frames_writer(write_num_frames)
+        else:
+            self.writer_nframe = None
+
+    def __setitem__(self, key, value):
+        assert_scipy_wav_style(value)
+        # Change Tuple[int, ndarray] -> Tuple[ndarray, int]
+        # (scipy style -> soundfile style)
+        value = (value[1], value[0])
+        self.writer.create_dataset(key, data=value)
+
+        if self.writer_scp is not None:
+            self.writer_scp.write(f"{key} {self.filename}:{key}\n")
+        if self.writer_nframe is not None:
+            self.writer_nframe.write(f"{key} {len(value[0])}\n")
+
+
+class SoundWriter(BaseWriter):
+    """SoundWriter
+
+    Examples:
+        >>> fs = 16000
+        >>> with SoundWriter('ark,scp:outdir,out.scp') as f:
+        ...     f['key'] = fs, array
+    """
+
+    def __init__(self, wspecifier, write_num_frames=None, pcm_format="wav"):
+        self.pcm_format = pcm_format
+        spec_dict = parse_wspecifier(wspecifier)
+        # e.g. ark,scp:dirname,wav.scp
+        # -> The wave files are found in dirname/*.wav
+        self.dirname = spec_dict["ark"]
+        Path(self.dirname).mkdir(parents=True, exist_ok=True)
+        self.writer = None
+
+        if "scp" in spec_dict:
+            self.writer_scp = open(spec_dict["scp"], "w", encoding="utf-8")
+        else:
+            self.writer_scp = None
+        if write_num_frames is not None:
+            self.writer_nframe = get_num_frames_writer(write_num_frames)
+        else:
+            self.writer_nframe = None
+
+    def __setitem__(self, key, value):
+        assert_scipy_wav_style(value)
+        rate, signal = value
+        wavfile = Path(self.dirname) / (key + "." + self.pcm_format)
+        soundfile.write(wavfile, signal.astype(numpy.int16), rate)
+
+        if self.writer_scp is not None:
+            self.writer_scp.write(f"{key} {wavfile}\n")
+        if self.writer_nframe is not None:
+            self.writer_nframe.write(f"{key} {len(signal)}\n")
diff --git a/deepspeech/utils/error_rate.py b/deepspeech/utils/error_rate.py
index 81f458b6..548376aa 100644
--- a/deepspeech/utils/error_rate.py
+++ b/deepspeech/utils/error_rate.py
@@ -14,12 +14,12 @@
 """This module provides functions to calculate error rate in different level.
 e.g. wer for word-level, cer for char-level.
 """
+from itertools import groupby
+
 import editdistance
 import numpy as np
 
-__all__ = ['word_errors', 'char_errors', 'wer', 'cer']
-
-editdistance.eval("a", "b")
+__all__ = ['word_errors', 'char_errors', 'wer', 'cer', "ErrorCalculator"]
 
 
 def _levenshtein_distance(ref, hyp):
@@ -211,3 +211,154 @@ def cer(reference, hypothesis, ignore_case=False, remove_space=False):
 
     cer = float(edit_distance) / ref_len
     return cer
+
+
+class ErrorCalculator():
+    """Calculate CER and WER for E2E_ASR and CTC models during training.
+
+    :param y_hats: numpy array with predicted text
+    :param y_pads: numpy array with true (target) text
+    :param char_list: List[str]
+    :param sym_space: <space>
+    :param sym_blank: <blank>
+    :return:
+    """
+
+    def __init__(self,
+                 char_list,
+                 sym_space,
+                 sym_blank,
+                 report_cer=False,
+                 report_wer=False):
+        """Construct an ErrorCalculator object."""
+        super().__init__()
+
+        self.report_cer = report_cer
+        self.report_wer = report_wer
+
+        self.char_list = char_list
+        self.space = sym_space
+        self.blank = sym_blank
+        self.idx_blank = self.char_list.index(self.blank)
+        if self.space in self.char_list:
+            self.idx_space = self.char_list.index(self.space)
+        else:
+            self.idx_space = None
+
+    def __call__(self, ys_hat, ys_pad, is_ctc=False):
+        """Calculate sentence-level WER/CER score.
+
+        :param paddle.Tensor ys_hat: prediction (batch, seqlen)
+        :param paddle.Tensor ys_pad: reference (batch, seqlen)
+        :param bool is_ctc: calculate CER score for CTC
+        :return: sentence-level WER score
+        :rtype float
+        :return: sentence-level CER score
+        :rtype float
+        """
+        cer, wer = None, None
+        if is_ctc:
+            return self.calculate_cer_ctc(ys_hat, ys_pad)
+        elif not self.report_cer and not self.report_wer:
+            return cer, wer
+
+        seqs_hat, seqs_true = self.convert_to_char(ys_hat, ys_pad)
+        if self.report_cer:
+            cer = self.calculate_cer(seqs_hat, seqs_true)
+
+        if self.report_wer:
+            wer = self.calculate_wer(seqs_hat, seqs_true)
+        return cer, wer
+
+    def calculate_cer_ctc(self, ys_hat, ys_pad):
+        """Calculate sentence-level CER score for CTC.
+
+        :param paddle.Tensor ys_hat: prediction (batch, seqlen)
+        :param paddle.Tensor ys_pad: reference (batch, seqlen)
+        :return: average sentence-level CER score
+        :rtype float
+        """
+        cers, char_ref_lens = [], []
+        for i, y in enumerate(ys_hat):
+            y_hat = [x[0] for x in groupby(y)]
+            y_true = ys_pad[i]
+            seq_hat, seq_true = [], []
+            for idx in y_hat:
+                idx = int(idx)
+                if idx != -1 and idx != self.idx_blank and idx != self.idx_space:
+                    seq_hat.append(self.char_list[int(idx)])
+
+            for idx in y_true:
+                idx = int(idx)
+                if idx != -1 and idx != self.idx_blank and idx != self.idx_space:
+                    seq_true.append(self.char_list[int(idx)])
+
+            hyp_chars = "".join(seq_hat)
+            ref_chars = "".join(seq_true)
+            if len(ref_chars) > 0:
+                cers.append(editdistance.eval(hyp_chars, ref_chars))
+                char_ref_lens.append(len(ref_chars))
+
+        cer_ctc = float(sum(cers)) / sum(char_ref_lens) if cers else None
+        return cer_ctc
+
+    def convert_to_char(self, ys_hat, ys_pad):
+        """Convert index to character.
+
+        :param paddle.Tensor seqs_hat: prediction (batch, seqlen)
+        :param paddle.Tensor seqs_true: reference (batch, seqlen)
+        :return: token list of prediction
+        :rtype list
+        :return: token list of reference
+        :rtype list
+        """
+        seqs_hat, seqs_true = [], []
+        for i, y_hat in enumerate(ys_hat):
+            y_true = ys_pad[i]
+            eos_true = np.where(y_true == -1)[0]
+            ymax = eos_true[0] if len(eos_true) > 0 else len(y_true)
+            # NOTE: padding index (-1) in y_true is used to pad y_hat
+            seq_hat = [self.char_list[int(idx)] for idx in y_hat[:ymax]]
+            seq_true = [
+                self.char_list[int(idx)] for idx in y_true if int(idx) != -1
+            ]
+            seq_hat_text = "".join(seq_hat).replace(self.space, " ")
+            seq_hat_text = seq_hat_text.replace(self.blank, "")
+            seq_true_text = "".join(seq_true).replace(self.space, " ")
+            seqs_hat.append(seq_hat_text)
+            seqs_true.append(seq_true_text)
+        return seqs_hat, seqs_true
+
+    def calculate_cer(self, seqs_hat, seqs_true):
+        """Calculate sentence-level CER score.
+
+        :param list seqs_hat: prediction
+        :param list seqs_true: reference
+        :return: average sentence-level CER score
+        :rtype float
+        """
+        char_eds, char_ref_lens = [], []
+        for i, seq_hat_text in enumerate(seqs_hat):
+            seq_true_text = seqs_true[i]
+            hyp_chars = seq_hat_text.replace(" ", "")
+            ref_chars = seq_true_text.replace(" ", "")
+            char_eds.append(editdistance.eval(hyp_chars, ref_chars))
+            char_ref_lens.append(len(ref_chars))
+        return float(sum(char_eds)) / sum(char_ref_lens)
+
+    def calculate_wer(self, seqs_hat, seqs_true):
+        """Calculate sentence-level WER score.
+
+        :param list seqs_hat: prediction
+        :param list seqs_true: reference
+        :return: average sentence-level WER score
+        :rtype float
+        """
+        word_eds, word_ref_lens = [], []
+        for i, seq_hat_text in enumerate(seqs_hat):
+            seq_true_text = seqs_true[i]
+            hyp_words = seq_hat_text.split()
+            ref_words = seq_true_text.split()
+            word_eds.append(editdistance.eval(hyp_words, ref_words))
+            word_ref_lens.append(len(ref_words))
+        return float(sum(word_eds)) / sum(word_ref_lens)
diff --git a/deepspeech/utils/spec_augment.py b/deepspeech/utils/spec_augment.py
new file mode 100644
index 00000000..185a92b8
--- /dev/null
+++ b/deepspeech/utils/spec_augment.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/docs/images/PaddleSpeech_log.png b/docs/images/PaddleSpeech_log.png
new file mode 100644
index 00000000..fb252775
Binary files /dev/null and b/docs/images/PaddleSpeech_log.png differ
diff --git a/examples/aishell3/tts3/README.md b/examples/aishell3/tts3/README.md
index 130c52e1..c313d922 100644
--- a/examples/aishell3/tts3/README.md
+++ b/examples/aishell3/tts3/README.md
@@ -17,7 +17,7 @@ tar zxvf data_aishell3.tgz -C data_aishell3
 ```
 ### Get MFA result of AISHELL-3 and Extract it
 We use [MFA2.x](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for aishell3_fastspeech2.
-You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/use_mfa) (use MFA1.x now) of our repo.
+You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/other/use_mfa) (use MFA1.x now) of our repo.
 
 ## Get Started
 Assume the path to the dataset is `~/datasets/data_aishell3`.
@@ -98,7 +98,7 @@ optional arguments:
 7. `--speaker-dict`is the path of the  speaker id map file when training a multi-speaker FastSpeech2.
 
 ### Synthesize
-We use [parallel wavegan](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/parallelwave_gan/baker) as the neural vocoder.
+We use [parallel wavegan](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/csmsc/voc1) as the neural vocoder.
 Download pretrained parallel wavegan model from [pwg_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/pwg_baker_ckpt_0.4.zip) and unzip it.
 ```bash
 unzip pwg_baker_ckpt_0.4.zip
diff --git a/examples/aishell3/vc0/README.md b/examples/aishell3/vc0/README.md
index d5803f64..9364cf00 100644
--- a/examples/aishell3/vc0/README.md
+++ b/examples/aishell3/vc0/README.md
@@ -1,8 +1,8 @@
 # Tacotron2 + AISHELL-3 Voice Cloning
 This example contains code used to train a [Tacotron2 ](https://arxiv.org/abs/1712.05884) model with [AISHELL-3](http://www.aishelltech.com/aishell_3). The trained model can be used in Voice Cloning Task, We refer to the model structure of  [Transfer Learning from Speaker Veriﬁcation to Multispeaker Text-To-Speech Synthesis](https://arxiv.org/pdf/1806.04558.pdf) . The general steps are as follows:
-1. Speaker Encoder: We  use a Speaker Verification to train a speaker encoder. Datasets used in this task are different from those used in Tacotron2, because the  transcriptions are not needed, we use more datasets, refer to  [ge2e](../../other/ge2e).
+1. Speaker Encoder: We  use a Speaker Verification to train a speaker encoder. Datasets used in this task are different from those used in Tacotron2, because the  transcriptions are not needed, we use more datasets, refer to  [ge2e](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/other/ge2e).
 2. Synthesizer: Then, we use the trained speaker encoder to generate utterance embedding for each  sentence in AISHELL-3. This embedding is a extra input of  Tacotron2 which will be concated with encoder outputs.
-3. Vocoder: We use WaveFlow as the neural Vocoder, refer to [waveflow](../../ljspeech/voc0).
+3. Vocoder: We use WaveFlow as the neural Vocoder, refer to [waveflow](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/ljspeech/voc0).
 
 ## Get Started
 Assume the path to the dataset is `~/datasets/data_aishell3`.
@@ -39,9 +39,9 @@ There are silence in the edge of AISHELL-3's wavs, and the audio amplitude is ve
 
 We use Montreal Force Aligner 1.0. The label in  aishell3 include pinyin，so the lexicon we provided to MFA is pinyin rather than Chinese characters. And the prosody marks(`$`  and `%`) need to be removed. You shoud preprocess the dataset into the format  which MFA needs, the texts have the same name with wavs and have the suffix `.lab`.
 
-We use [lexicon.txt](./lexicon.txt) as the lexicon.
+We use [lexicon.txt](https://github.com/PaddlePaddle/DeepSpeech/blob/develop/parakeet/exps/voice_cloning/tacotron2_ge2e/lexicon.txt) as the lexicon.
 
-You can download the alignment results from here [alignment_aishell3.tar.gz](https://paddlespeech.bj.bcebos.com/Parakeet/alignment_aishell3.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/use_mfa) (use MFA1.x now) of our repo.
+You can download the alignment results from here [alignment_aishell3.tar.gz](https://paddlespeech.bj.bcebos.com/Parakeet/alignment_aishell3.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/other/use_mfa) (use MFA1.x now) of our repo.
 
 ```bash
 if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
diff --git a/examples/csmsc/tts2/README.md b/examples/csmsc/tts2/README.md
index 4283e8cc..e73f81fa 100644
--- a/examples/csmsc/tts2/README.md
+++ b/examples/csmsc/tts2/README.md
@@ -7,7 +7,7 @@ Download CSMSC from it's [Official Website](https://test.data-baker.com/data/ind
 
 ### Get MFA result of CSMSC and Extract it
 We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for SPEEDYSPEECH.
-You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to  [use_mfa example](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/use_mfa) of our repo.
+You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to  [use_mfa example](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/other/use_mfa) of our repo.
 
 ## Get Started
 Assume the path to the dataset is `~/datasets/BZNSYP`.
@@ -91,7 +91,7 @@ optional arguments:
 7. `--tones-dict` is the path of the tone vocabulary file.
 
 ### Synthesize
-We use [parallel wavegan](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/parallelwave_gan/baker) as the neural vocoder.
+We use [parallel wavegan](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/csmsc/voc1) as the neural vocoder.
 Download pretrained parallel wavegan model from [pwg_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/pwg_baker_ckpt_0.4.zip) and unzip it.
 ```bash
 unzip pwg_baker_ckpt_0.4.zip
diff --git a/examples/csmsc/tts3/README.md b/examples/csmsc/tts3/README.md
index 735ef6d1..42f33faa 100644
--- a/examples/csmsc/tts3/README.md
+++ b/examples/csmsc/tts3/README.md
@@ -7,7 +7,7 @@ Download CSMSC from it's [Official Website](https://test.data-baker.com/data/ind
 
 ### Get MFA result of CSMSC and Extract it
 We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for fastspeech2.
-You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to  [use_mfa example](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/use_mfa) of our repo.
+You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to  [use_mfa example](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/other/use_mfa) of our repo.
 
 ## Get Started
 Assume the path to the dataset is `~/datasets/BZNSYP`.
@@ -88,7 +88,7 @@ optional arguments:
 6. `--phones-dict` is the path of the phone vocabulary file.
 
 ### Synthesize
-We use [parallel wavegan](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/parallelwave_gan/baker) as the neural vocoder.
+We use [parallel wavegan](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/csmsc/voc1) as the neural vocoder.
 Download pretrained parallel wavegan model from [pwg_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/pwg_baker_ckpt_0.4.zip) and unzip it.
 ```bash
 unzip pwg_baker_ckpt_0.4.zip
diff --git a/examples/csmsc/voc1/README.md b/examples/csmsc/voc1/README.md
index 2a7b3185..4b6b6c42 100644
--- a/examples/csmsc/voc1/README.md
+++ b/examples/csmsc/voc1/README.md
@@ -6,7 +6,7 @@ Download CSMSC from the [official website](https://www.data-baker.com/data/index
 
 ### Get MFA results for silence trim
 We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to  cut silence in the edge of audio.
-You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to  [use_mfa example](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/use_mfa) of our repo.
+You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to  [use_mfa example](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/other/use_mfa) of our repo.
 
 ## Get Started
 Assume the path to the dataset is `~/datasets/BZNSYP`.
diff --git a/examples/librispeech/s2/.gitignore b/examples/librispeech/s2/.gitignore
new file mode 100644
index 00000000..e56b7d34
--- /dev/null
+++ b/examples/librispeech/s2/.gitignore
@@ -0,0 +1,4 @@
+dump
+fbank
+exp
+data
diff --git a/examples/librispeech/s2/README.md b/examples/librispeech/s2/README.md
index d5df37d8..9285a183 100644
--- a/examples/librispeech/s2/README.md
+++ b/examples/librispeech/s2/README.md
@@ -1,8 +1,11 @@
 # LibriSpeech
 
-| Model | Params | Config | Augmentation| Loss |
-| --- | --- | --- | --- |
-| transformer | 32.52 M | conf/transformer.yaml | spec_aug | 6.3197922706604 |
+
+## Transformer
+
+| Model | Params | GPUS | Averaged Model | Config | Augmentation| Loss |
+| --- | --- | --- | --- | --- | --- |  
+| transformer | 32.52 M | 8 Tesla V100-SXM2-32GB | 10-best val_loss | conf/transformer.yaml | spec_aug | 6.3197922706604 |
 
 
 | Test Set | Decode Method | #Snt | #Wrd | Corr | Sub | Del | Ins | Err | S.Err |  
@@ -11,4 +14,14 @@
 | test-clean | ctc_greedy_search | 2620 | 52576 | 95.9 | 3.7 | 0.4 | 0.5 | 4.6 | 48.0 |  
 | test-clean | ctc_prefix_beamsearch | 2620 | 52576 | 95.9 | 3.7 | 0.4 | 0.5 | 4.6 | 47.6 |  
 | test-clean | attention_rescore | 2620 | 52576 | 96.8 | 2.9 | 0.3 | 0.4 | 3.7 | 38.0 |  
+
+### JoinCTC
+
+| Test Set | Decode Method | #Snt | #Wrd | Corr | Sub | Del | Ins | Err | S.Err |  
+| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
+| test-clean | join_ctc_only_att | 2620 | 52576 | 96.1 | 2.5 | 1.4 | 0.4 | 4.4 | 34.7 |  
 | test-clean | join_ctc_w/o_lm | 2620 | 52576 | 97.2 | 2.6 | 0.3 | 0.4 | 3.2 | 34.9 |  
+| test-clean | join_ctc_w_lm | 2620 | 52576 | 97.9 | 1.8 | 0.2 | 0.3 | 2.4 | 27.8 |  
+
+Compare with [ESPNET](https://github.com/espnet/espnet/blob/master/egs/librispeech/asr1/RESULTS.md#pytorch-large-transformer-with-specaug-4-gpus--transformer-lm-4-gpus)
+we using 8gpu, but model size (aheads4-adim256) small than it.
diff --git a/examples/librispeech/s2/conf/chunk_conformer.yaml b/examples/librispeech/s2/conf/chunk_conformer.yaml
deleted file mode 100644
index afd2b051..00000000
--- a/examples/librispeech/s2/conf/chunk_conformer.yaml
+++ /dev/null
@@ -1,122 +0,0 @@
-# https://yaml.org/type/float.html
-data:
-  train_manifest: data/manifest.train
-  dev_manifest: data/manifest.dev
-  test_manifest: data/manifest.test
-  min_input_len: 0.5
-  max_input_len: 20.0
-  min_output_len: 0.0
-  max_output_len: 400.0
-  min_output_input_ratio: 0.05
-  max_output_input_ratio: 10.0
-
-collator:
-  vocab_filepath: data/vocab.txt 
-  unit_type: 'spm'
-  spm_model_prefix: 'data/bpe_unigram_5000'
-  mean_std_filepath: ""
-  augmentation_config: conf/augmentation.json
-  batch_size: 16
-  raw_wav: True  # use raw_wav or kaldi feature
-  spectrum_type: fbank #linear, mfcc, fbank
-  feat_dim: 80
-  delta_delta: False
-  dither: 1.0
-  target_sample_rate: 16000
-  max_freq: None
-  n_fft: None
-  stride_ms: 10.0
-  window_ms: 25.0
-  use_dB_normalization: True
-  target_dB: -20
-  random_seed: 0
-  keep_transcription_text: False
-  sortagrad: True 
-  shuffle_method: batch_shuffle
-  num_workers: 2
-
-
-# network architecture
-model:
-    cmvn_file: "data/mean_std.json"
-    cmvn_file_type: "json"
-    # encoder related
-    encoder: conformer
-    encoder_conf:
-        output_size: 256    # dimension of attention
-        attention_heads: 4
-        linear_units: 2048  # the number of units of position-wise feed forward
-        num_blocks: 12      # the number of encoder blocks
-        dropout_rate: 0.1
-        positional_dropout_rate: 0.1
-        attention_dropout_rate: 0.0
-        input_layer: conv2d # encoder input type, you can chose conv2d, conv2d6 and conv2d8
-        normalize_before: True
-        use_cnn_module: True
-        cnn_module_kernel: 15
-        activation_type: 'swish'
-        pos_enc_layer_type: 'rel_pos'
-        selfattention_layer_type: 'rel_selfattn'
-        causal: True
-        use_dynamic_chunk: true
-        cnn_module_norm: 'layer_norm' # using nn.LayerNorm makes model converge faster
-        use_dynamic_left_chunk: false
-
-    # decoder related
-    decoder: transformer
-    decoder_conf:
-        attention_heads: 4
-        linear_units: 2048
-        num_blocks: 6
-        dropout_rate: 0.1
-        positional_dropout_rate: 0.1
-        self_attention_dropout_rate: 0.0
-        src_attention_dropout_rate: 0.0
-
-    # hybrid CTC/attention
-    model_conf:
-        ctc_weight: 0.3
-        ctc_dropoutrate: 0.0
-        ctc_grad_norm_type: null
-        lsm_weight: 0.1     # label smoothing option
-        length_normalized_loss: false
-
-
-training:
-  n_epoch: 240
-  accum_grad: 8
-  global_grad_clip: 5.0
-  optim: adam
-  optim_conf:
-    lr: 0.001
-    weight_decay: 1e-06
-  scheduler: warmuplr     # pytorch v1.1.0+ required
-  scheduler_conf:
-    warmup_steps: 25000
-    lr_decay: 1.0
-  log_interval: 100
-  checkpoint:
-    kbest_n: 50
-    latest_n: 5
-
-
-decoding:
-  batch_size: 128
-  error_rate_type: wer
-  decoding_method: attention  # 'attention', 'ctc_greedy_search', 'ctc_prefix_beam_search', 'attention_rescoring'
-  lang_model_path: data/lm/common_crawl_00.prune01111.trie.klm
-  alpha: 2.5
-  beta: 0.3
-  beam_size: 10
-  cutoff_prob: 1.0
-  cutoff_top_n: 0
-  num_proc_bsearch: 8
-  ctc_weight: 0.5 # ctc weight for attention rescoring decode mode.
-  decoding_chunk_size: -1 # decoding chunk size. Defaults to -1.
-      # <0: for decoding, use full chunk.
-      # >0: for decoding, use fixed chunk size as set.
-      # 0: used for training, it's prohibited here. 
-  num_decoding_left_chunks: -1  # number of left chunks for decoding. Defaults to -1.
-  simulate_streaming: true  # simulate streaming inference. Defaults to False.
-
-
diff --git a/examples/librispeech/s2/conf/chunk_transformer.yaml b/examples/librispeech/s2/conf/chunk_transformer.yaml
deleted file mode 100644
index 721bb7d9..00000000
--- a/examples/librispeech/s2/conf/chunk_transformer.yaml
+++ /dev/null
@@ -1,115 +0,0 @@
-# https://yaml.org/type/float.html
-data:
-  train_manifest: data/manifest.train
-  dev_manifest: data/manifest.dev
-  test_manifest: data/manifest.test
-  min_input_len: 0.5  # second
-  max_input_len: 20.0 # second
-  min_output_len: 0.0 # tokens
-  max_output_len: 400.0 # tokens
-  min_output_input_ratio: 0.05
-  max_output_input_ratio: 10.0
-
-collator:
-  vocab_filepath: data/vocab.txt 
-  unit_type: 'spm'
-  spm_model_prefix: 'data/bpe_unigram_5000'
-  mean_std_filepath: ""
-  augmentation_config: conf/augmentation.json
-  batch_size: 64
-  raw_wav: True  # use raw_wav or kaldi feature
-  spectrum_type: fbank #linear, mfcc, fbank
-  feat_dim: 80
-  delta_delta: False
-  dither: 1.0
-  target_sample_rate: 16000
-  max_freq: None
-  n_fft: None
-  stride_ms: 10.0
-  window_ms: 25.0
-  use_dB_normalization: True
-  target_dB: -20
-  random_seed: 0
-  keep_transcription_text: False
-  sortagrad: True 
-  shuffle_method: batch_shuffle
-  num_workers: 2
-
-
-# network architecture
-model:
-    cmvn_file: "data/mean_std.json"
-    cmvn_file_type: "json"
-    # encoder related
-    encoder: transformer
-    encoder_conf:
-        output_size: 256    # dimension of attention
-        attention_heads: 4
-        linear_units: 2048  # the number of units of position-wise feed forward
-        num_blocks: 12      # the number of encoder blocks
-        dropout_rate: 0.1
-        positional_dropout_rate: 0.1
-        attention_dropout_rate: 0.0
-        input_layer: conv2d # encoder input type, you can chose conv2d, conv2d6 and conv2d8
-        normalize_before: true
-        use_dynamic_chunk: true
-        use_dynamic_left_chunk: false
-
-    # decoder related
-    decoder: transformer
-    decoder_conf:
-        attention_heads: 4
-        linear_units: 2048
-        num_blocks: 6
-        dropout_rate: 0.1
-        positional_dropout_rate: 0.1
-        self_attention_dropout_rate: 0.0
-        src_attention_dropout_rate: 0.0
-
-    # hybrid CTC/attention
-    model_conf:
-        ctc_weight: 0.3
-        ctc_dropoutrate: 0.0
-        ctc_grad_norm_type: null
-        lsm_weight: 0.1     # label smoothing option
-        length_normalized_loss: false
-
-
-training:
-  n_epoch: 120
-  accum_grad: 1
-  global_grad_clip: 5.0
-  optim: adam
-  optim_conf:
-    lr: 0.001
-    weight_decay: 1e-06
-  scheduler: warmuplr     # pytorch v1.1.0+ required
-  scheduler_conf:
-    warmup_steps: 25000
-    lr_decay: 1.0
-  log_interval: 100
-  checkpoint:
-    kbest_n: 50
-    latest_n: 5
-
-
-decoding:
-  batch_size: 64
-  error_rate_type: wer
-  decoding_method: attention  # 'attention', 'ctc_greedy_search', 'ctc_prefix_beam_search', 'attention_rescoring'
-  lang_model_path: data/lm/common_crawl_00.prune01111.trie.klm
-  alpha: 2.5
-  beta: 0.3
-  beam_size: 10
-  cutoff_prob: 1.0
-  cutoff_top_n: 0
-  num_proc_bsearch: 8
-  ctc_weight: 0.5 # ctc weight for attention rescoring decode mode.
-  decoding_chunk_size: -1 # decoding chunk size. Defaults to -1.
-      # <0: for decoding, use full chunk.
-      # >0: for decoding, use fixed chunk size as set.
-      # 0: used for training, it's prohibited here. 
-  num_decoding_left_chunks: -1  # number of left chunks for decoding. Defaults to -1.
-  simulate_streaming: true  # simulate streaming inference. Defaults to False.
-
-
diff --git a/examples/librispeech/s2/conf/conformer.yaml b/examples/librispeech/s2/conf/conformer.yaml
deleted file mode 100644
index ef87753c..00000000
--- a/examples/librispeech/s2/conf/conformer.yaml
+++ /dev/null
@@ -1,118 +0,0 @@
-# https://yaml.org/type/float.html
-data:
-  train_manifest: data/manifest.train
-  dev_manifest: data/manifest.dev
-  test_manifest: data/manifest.test-clean
-  min_input_len: 0.5  # seconds
-  max_input_len: 20.0 # seconds
-  min_output_len: 0.0 # tokens
-  max_output_len: 400.0 # tokens
-  min_output_input_ratio: 0.05
-  max_output_input_ratio: 10.0
-
-collator:
-  vocab_filepath: data/vocab.txt 
-  unit_type: 'spm'
-  spm_model_prefix: 'data/bpe_unigram_5000'
-  mean_std_filepath: ""
-  augmentation_config: conf/augmentation.json
-  batch_size: 16
-  raw_wav: True  # use raw_wav or kaldi feature
-  spectrum_type: fbank #linear, mfcc, fbank
-  feat_dim: 80
-  delta_delta: False
-  dither: 1.0
-  target_sample_rate: 16000
-  max_freq: None
-  n_fft: None
-  stride_ms: 10.0
-  window_ms: 25.0
-  use_dB_normalization: True
-  target_dB: -20
-  random_seed: 0
-  keep_transcription_text: False
-  sortagrad: True 
-  shuffle_method: batch_shuffle
-  num_workers: 2
-
-
-# network architecture
-model:
-    cmvn_file: "data/mean_std.json"
-    cmvn_file_type: "json"
-    # encoder related
-    encoder: conformer
-    encoder_conf:
-        output_size: 256    # dimension of attention
-        attention_heads: 4
-        linear_units: 2048  # the number of units of position-wise feed forward
-        num_blocks: 12      # the number of encoder blocks
-        dropout_rate: 0.1
-        positional_dropout_rate: 0.1
-        attention_dropout_rate: 0.0
-        input_layer: conv2d # encoder input type, you can chose conv2d, conv2d6 and conv2d8
-        normalize_before: True
-        use_cnn_module: True
-        cnn_module_kernel: 15
-        activation_type: 'swish'
-        pos_enc_layer_type: 'rel_pos'
-        selfattention_layer_type: 'rel_selfattn'
-
-    # decoder related
-    decoder: transformer
-    decoder_conf:
-        attention_heads: 4
-        linear_units: 2048
-        num_blocks: 6
-        dropout_rate: 0.1
-        positional_dropout_rate: 0.1
-        self_attention_dropout_rate: 0.0
-        src_attention_dropout_rate: 0.0
-
-    # hybrid CTC/attention
-    model_conf:
-        ctc_weight: 0.3
-        ctc_dropoutrate: 0.0
-        ctc_grad_norm_type: null
-        lsm_weight: 0.1     # label smoothing option
-        length_normalized_loss: false
-
-
-training:
-  n_epoch: 120
-  accum_grad: 8
-  global_grad_clip: 3.0
-  optim: adam
-  optim_conf:
-    lr: 0.004
-    weight_decay: 1e-06
-  scheduler: warmuplr     # pytorch v1.1.0+ required
-  scheduler_conf:
-    warmup_steps: 25000
-    lr_decay: 1.0
-  log_interval: 100
-  checkpoint:
-    kbest_n: 50
-    latest_n: 5
-
-
-decoding:
-  batch_size: 64
-  error_rate_type: wer
-  decoding_method: attention  # 'attention', 'ctc_greedy_search', 'ctc_prefix_beam_search', 'attention_rescoring'
-  lang_model_path: data/lm/common_crawl_00.prune01111.trie.klm
-  alpha: 2.5
-  beta: 0.3
-  beam_size: 10
-  cutoff_prob: 1.0
-  cutoff_top_n: 0
-  num_proc_bsearch: 8
-  ctc_weight: 0.5 # ctc weight for attention rescoring decode mode.
-  decoding_chunk_size: -1 # decoding chunk size. Defaults to -1.
-      # <0: for decoding, use full chunk.
-      # >0: for decoding, use fixed chunk size as set.
-      # 0: used for training, it's prohibited here. 
-  num_decoding_left_chunks: -1  # number of left chunks for decoding. Defaults to -1.
-  simulate_streaming: False  # simulate streaming inference. Defaults to False.
-
-
diff --git a/examples/librispeech/s2/conf/decode/decode.yaml b/examples/librispeech/s2/conf/decode/decode.yaml
index 4c702db5..98b36d17 100644
--- a/examples/librispeech/s2/conf/decode/decode.yaml
+++ b/examples/librispeech/s2/conf/decode/decode.yaml
@@ -1,7 +1,7 @@
 batchsize: 0
 beam-size: 60
-ctc-weight: 0.0
-lm-weight: 0.0
+ctc-weight: 0.4
+lm-weight: 0.6
 maxlenratio: 0.0
 minlenratio: 0.0
 penalty: 0.0
diff --git a/examples/librispeech/s2/conf/decode/decode_att.yaml b/examples/librispeech/s2/conf/decode/decode_att.yaml
new file mode 100644
index 00000000..4c702db5
--- /dev/null
+++ b/examples/librispeech/s2/conf/decode/decode_att.yaml
@@ -0,0 +1,7 @@
+batchsize: 0
+beam-size: 60
+ctc-weight: 0.0
+lm-weight: 0.0
+maxlenratio: 0.0
+minlenratio: 0.0
+penalty: 0.0
diff --git a/examples/librispeech/s2/conf/decode/decode_all.yaml b/examples/librispeech/s2/conf/decode/decode_ctc.yaml
similarity index 73%
rename from examples/librispeech/s2/conf/decode/decode_all.yaml
rename to examples/librispeech/s2/conf/decode/decode_ctc.yaml
index 87d5f6d1..867bf611 100644
--- a/examples/librispeech/s2/conf/decode/decode_all.yaml
+++ b/examples/librispeech/s2/conf/decode/decode_ctc.yaml
@@ -1,7 +1,7 @@
 batchsize: 0
 beam-size: 60
 ctc-weight: 0.4
-lm-weight: 0.6
+lm-weight: 0.0
 maxlenratio: 0.0
 minlenratio: 0.0
-penalty: 0.0
\ No newline at end of file
+penalty: 0.0
diff --git a/examples/librispeech/s2/conf/fbank.conf b/examples/librispeech/s2/conf/fbank.conf
new file mode 100644
index 00000000..82ac7bd0
--- /dev/null
+++ b/examples/librispeech/s2/conf/fbank.conf
@@ -0,0 +1,2 @@
+--sample-frequency=16000 
+--num-mel-bins=80
diff --git a/examples/librispeech/s2/conf/lm/transformer.yaml b/examples/librispeech/s2/conf/lm/transformer.yaml
new file mode 100644
index 00000000..4349f795
--- /dev/null
+++ b/examples/librispeech/s2/conf/lm/transformer.yaml
@@ -0,0 +1,13 @@
+model_module: transformer
+model:
+    n_vocab: 5002
+    pos_enc: null
+    embed_unit: 128
+    att_unit: 512
+    head: 8
+    unit: 2048
+    layer: 16
+    dropout_rate: 0.5
+    emb_dropout_rate: 0.0
+    att_dropout_rate: 0.0
+    tie_weights: False 
diff --git a/examples/librispeech/s2/conf/pitch.conf b/examples/librispeech/s2/conf/pitch.conf
new file mode 100644
index 00000000..e959a19d
--- /dev/null
+++ b/examples/librispeech/s2/conf/pitch.conf
@@ -0,0 +1 @@
+--sample-frequency=16000
diff --git a/examples/librispeech/s2/conf/transformer.yaml b/examples/librispeech/s2/conf/transformer.yaml
index c9eed4f9..b2babca7 100644
--- a/examples/librispeech/s2/conf/transformer.yaml
+++ b/examples/librispeech/s2/conf/transformer.yaml
@@ -5,9 +5,9 @@ data:
   test_manifest: data/manifest.test-clean
 
 collator:
-  vocab_filepath: data/bpe_unigram_5000_units.txt
+  vocab_filepath: data/lang_char/train_960_unigram5000_units.txt
   unit_type: spm
-  spm_model_prefix: data/bpe_unigram_5000
+  spm_model_prefix: data/lang_char/train_960_unigram5000
   feat_dim: 83
   stride_ms: 10.0
   window_ms: 25.0
diff --git a/examples/librispeech/s2/local/data.sh b/examples/librispeech/s2/local/data.sh
index 56fec846..b232f35a 100755
--- a/examples/librispeech/s2/local/data.sh
+++ b/examples/librispeech/s2/local/data.sh
@@ -2,19 +2,42 @@
 
 stage=-1
 stop_stage=100
+nj=32
+debugmode=1
+dumpdir=dump   # directory to dump full features
+N=0            # number of minibatches to be used (mainly for debugging). "0" uses all minibatches.
+verbose=0      # verbose option
+resume=        # Resume the training from snapshot
+
+# feature configuration
+do_delta=false
+
+# Set this to somewhere where you want to put your data, or where
+# someone else has already put it.  You'll want to change this
+# if you're not on the CLSP grid.
+datadir=${MAIN_ROOT}/examples/dataset/
 
 # bpemode (unigram or bpe)
 nbpe=5000
 bpemode=unigram
-bpeprefix="data/bpe_${bpemode}_${nbpe}"
 
 source ${MAIN_ROOT}/utils/parse_options.sh
 
+# Set bash to 'debug' mode, it will exit on :
+# -e 'error', -u 'undefined variable', -o ... 'error in pipeline', -x 'print commands',
+set -e
+set -u
+set -o pipefail
+
+train_set=train_960
+train_sp=train_sp
+train_dev=dev
+recog_set="test_clean test_other dev_clean dev_other"
+
 
 mkdir -p data
 TARGET_DIR=${MAIN_ROOT}/examples/dataset
 mkdir -p ${TARGET_DIR}
-
 if [ ${stage} -le -1 ] && [ ${stop_stage} -ge -1 ]; then
     # download data, generate manifests
     python3 ${TARGET_DIR}/librispeech/librispeech.py \
@@ -46,63 +69,98 @@ if [ ${stage} -le -1 ] && [ ${stop_stage} -ge -1 ]; then
 fi
 
 if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
-    # compute mean and stddev for normalizer
-    num_workers=$(nproc)
-    python3 ${MAIN_ROOT}/utils/compute_mean_std.py \
-    --manifest_path="data/manifest.train.raw" \
-    --num_samples=-1 \
-    --spectrum_type="fbank" \
-    --feat_dim=80 \
-    --delta_delta=false \
-    --sample_rate=16000 \
-    --stride_ms=10.0 \
-    --window_ms=25.0 \
-    --use_dB_normalization=False \
-    --num_workers=${num_workers} \
-    --output_path="data/mean_std.json"
-
-    if [ $? -ne 0 ]; then
-        echo "Compute mean and stddev failed. Terminated."
-        exit 1
-    fi
+    ### Task dependent. You have to make data the following preparation part by yourself.
+    ### But you can utilize Kaldi recipes in most cases
+    echo "stage 0: Data preparation"
+    for part in dev-clean test-clean dev-other test-other train-clean-100 train-clean-360 train-other-500; do
+        # use underscore-separated names in data directories.
+        local/data_prep.sh ${datadir}/librispeech/${part}/LibriSpeech/${part} data/${part//-/_}
+    done
 fi
 
+feat_tr_dir=${dumpdir}/${train_set}/delta${do_delta}; mkdir -p ${feat_tr_dir}
+feat_sp_dir=${dumpdir}/${train_sp}/delta${do_delta}; mkdir -p ${feat_sp_dir}
+feat_dt_dir=${dumpdir}/${train_dev}/delta${do_delta}; mkdir -p ${feat_dt_dir}
 if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
-    # build vocabulary
-    python3 ${MAIN_ROOT}/utils/build_vocab.py \
-    --unit_type "spm" \
-    --spm_vocab_size=${nbpe} \
-    --spm_mode ${bpemode} \
-    --spm_model_prefix ${bpeprefix} \
-    --vocab_path="data/vocab.txt" \
-    --manifest_paths="data/manifest.train.raw"
+    ### Task dependent. You have to design training and dev sets by yourself.
+    ### But you can utilize Kaldi recipes in most cases
+    echo "stage 1: Feature Generation"
+    fbankdir=fbank
+    # Generate the fbank features; by default 80-dimensional fbanks with pitch on each frame
+    for x in dev_clean test_clean dev_other test_other train_clean_100 train_clean_360 train_other_500; do
+        steps/make_fbank_pitch.sh --cmd "$train_cmd" --nj ${nj} --write_utt2num_frames true \
+            data/${x} exp/make_fbank/${x} ${fbankdir}
+        utils/fix_data_dir.sh data/${x}
+    done
 
-    if [ $? -ne 0 ]; then
-        echo "Build vocabulary failed. Terminated."
-        exit 1
-    fi
+    utils/combine_data.sh --extra_files utt2num_frames data/${train_set}_org data/train_clean_100 data/train_clean_360 data/train_other_500
+    utils/combine_data.sh --extra_files utt2num_frames data/${train_dev}_org data/dev_clean data/dev_other
+    utils/perturb_data_dir_speed.sh 0.9  data/${train_set}_org  data/temp1
+    utils/perturb_data_dir_speed.sh 1.0  data/${train_set}_org  data/temp2
+    utils/perturb_data_dir_speed.sh 1.1  data/${train_set}_org  data/temp3
+
+    utils/combine_data.sh --extra-files utt2uniq data/${train_sp}_org data/temp1 data/temp2 data/temp3
+
+    # remove utt having more than 3000 frames
+    # remove utt having more than 400 characters
+    remove_longshortdata.sh --maxframes 3000 --maxchars 400 data/${train_set}_org data/${train_set}
+    remove_longshortdata.sh --maxframes 3000 --maxchars 400 data/${train_sp}_org data/${train_sp}
+    remove_longshortdata.sh --maxframes 3000 --maxchars 400 data/${train_dev}_org data/${train_dev}
+    steps/make_fbank_pitch.sh --cmd "$train_cmd" --nj $nj  --write_utt2num_frames true \
+            data/train_sp  exp/make_fbank/train_sp  ${fbankdir}
+    utils/fix_data_dir.sh data/train_sp
+    # compute global CMVN
+    compute-cmvn-stats scp:data/${train_sp}/feats.scp data/${train_sp}/cmvn.ark
+
+    # dump features for training
+    dump.sh --cmd "$train_cmd" --nj ${nj} --do_delta ${do_delta} \
+        data/${train_sp}/feats.scp data/${train_sp}/cmvn.ark exp/dump_feats/train ${feat_sp_dir}
+    dump.sh --cmd "$train_cmd" --nj ${nj} --do_delta ${do_delta} \
+        data/${train_dev}/feats.scp data/${train_sp}/cmvn.ark exp/dump_feats/dev ${feat_dt_dir}
+    for rtask in ${recog_set}; do
+        feat_recog_dir=${dumpdir}/${rtask}/delta${do_delta}; mkdir -p ${feat_recog_dir}
+        dump.sh --cmd "$train_cmd" --nj ${nj} --do_delta ${do_delta} \
+            data/${rtask}/feats.scp data/${train_sp}/cmvn.ark exp/dump_feats/recog/${rtask} \
+            ${feat_recog_dir}
+    done
 fi
 
+dict=data/lang_char/${train_set}_${bpemode}${nbpe}_units.txt
+bpemodel=data/lang_char/${train_set}_${bpemode}${nbpe}
+echo "dictionary: ${dict}"
 if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
-    # format manifest with tokenids, vocab size
-    for set in train dev test dev-clean dev-other test-clean test-other; do
-    {
-        python3 ${MAIN_ROOT}/utils/format_data.py \
-        --feat_type "raw" \
-        --cmvn_path "data/mean_std.json" \
-        --unit_type "spm" \
-        --spm_model_prefix ${bpeprefix} \
-        --vocab_path="data/vocab.txt" \
-        --manifest_path="data/manifest.${set}.raw" \
-        --output_path="data/manifest.${set}"
-
-        if [ $? -ne 0 ]; then
-            echo "Formt mnaifest failed. Terminated."
-            exit 1
-        fi
-    }&
+    ### Task dependent. You have to check non-linguistic symbols used in the corpus.
+    echo "stage 2: Dictionary and Json Data Preparation"
+    mkdir -p data/lang_char/
+    echo "<unk> 1" > ${dict} # <unk> must be 1, 0 will be used for "blank" in CTC
+    cut -f 2- -d" " data/${train_set}/text > data/lang_char/input.txt
+    spm_train --input=data/lang_char/input.txt --vocab_size=${nbpe} --model_type=${bpemode} --model_prefix=${bpemodel} --input_sentence_size=100000000
+    spm_encode --model=${bpemodel}.model --output_format=piece < data/lang_char/input.txt | tr ' ' '\n' | sort | uniq | awk '{print $0 " " NR+1}' >> ${dict}
+    wc -l ${dict}
+
+    # make json labels
+    data2json.sh --nj ${nj} --feat ${feat_sp_dir}/feats.scp --bpecode ${bpemodel}.model \
+        data/${train_sp} ${dict} > ${feat_sp_dir}/data_${bpemode}${nbpe}.json
+    data2json.sh --nj ${nj} --feat ${feat_dt_dir}/feats.scp --bpecode ${bpemodel}.model \
+        data/${train_dev} ${dict} > ${feat_dt_dir}/data_${bpemode}${nbpe}.json
+
+    for rtask in ${recog_set}; do
+        feat_recog_dir=${dumpdir}/${rtask}/delta${do_delta}
+        data2json.sh --nj ${nj} --feat ${feat_recog_dir}/feats.scp --bpecode ${bpemodel}.model \
+            data/${rtask} ${dict} > ${feat_recog_dir}/data_${bpemode}${nbpe}.json
+    done
+fi
+
+
+if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
+    # make json labels
+    python3 local/espnet_json_to_manifest.py --json-file ${feat_sp_dir}/data_${bpemode}${nbpe}.json --manifest-file data/manifest.train
+    python3 local/espnet_json_to_manifest.py --json-file ${feat_dt_dir}/data_${bpemode}${nbpe}.json --manifest-file data/manifest.dev
+
+    for rtask in ${recog_set}; do
+        feat_recog_dir=${dumpdir}/${rtask}/delta${do_delta}
+        python3 local/espnet_json_to_manifest.py --json-file ${feat_recog_dir}/data_${bpemode}${nbpe}.json --manifest-file data/manifest.${rtask//_/-}
     done
-    wait
 fi
 
 echo "LibriSpeech Data preparation done."
diff --git a/examples/librispeech/s2/local/data_prep.sh b/examples/librispeech/s2/local/data_prep.sh
new file mode 100755
index 00000000..c903d45b
--- /dev/null
+++ b/examples/librispeech/s2/local/data_prep.sh
@@ -0,0 +1,85 @@
+#!/usr/bin/env bash
+
+# Copyright 2014  Vassil Panayotov
+#           2014  Johns Hopkins University (author: Daniel Povey)
+# Apache 2.0
+
+if [ "$#" -ne 2 ]; then
+  echo "Usage: $0 <src-dir> <dst-dir>"
+  echo "e.g.: $0 /export/a15/vpanayotov/data/LibriSpeech/dev-clean data/dev-clean"
+  exit 1
+fi
+
+src=$1
+dst=$2
+
+# all utterances are FLAC compressed
+if ! which flac >&/dev/null; then
+   echo "Please install 'flac' on ALL worker nodes!"
+   exit 1
+fi
+
+spk_file=$src/../SPEAKERS.TXT
+
+mkdir -p $dst || exit 1
+
+[ ! -d $src ] && echo "$0: no such directory $src" && exit 1
+[ ! -f $spk_file ] && echo "$0: expected file $spk_file to exist" && exit 1
+
+
+wav_scp=$dst/wav.scp; [[ -f "$wav_scp" ]] && rm $wav_scp
+trans=$dst/text; [[ -f "$trans" ]] && rm $trans
+utt2spk=$dst/utt2spk; [[ -f "$utt2spk" ]] && rm $utt2spk
+spk2gender=$dst/spk2gender; [[ -f $spk2gender ]] && rm $spk2gender
+
+for reader_dir in $(find -L $src -mindepth 1 -maxdepth 1 -type d | sort); do
+  reader=$(basename $reader_dir)
+  if ! [ $reader -eq $reader ]; then  # not integer.
+    echo "$0: unexpected subdirectory name $reader"
+    exit 1
+  fi
+
+  reader_gender=$(egrep "^$reader[ ]+\|" $spk_file | awk -F'|' '{gsub(/[ ]+/, ""); print tolower($2)}')
+  if [ "$reader_gender" != 'm' ] && [ "$reader_gender" != 'f' ]; then
+    echo "Unexpected gender: '$reader_gender'"
+    exit 1
+  fi
+
+  for chapter_dir in $(find -L $reader_dir/ -mindepth 1 -maxdepth 1 -type d | sort); do
+    chapter=$(basename $chapter_dir)
+    if ! [ "$chapter" -eq "$chapter" ]; then
+      echo "$0: unexpected chapter-subdirectory name $chapter"
+      exit 1
+    fi
+
+    find -L $chapter_dir/ -iname "*.flac" | sort | xargs -I% basename % .flac | \
+      awk -v "dir=$chapter_dir" '{printf "%s flac -c -d -s %s/%s.flac |\n", $0, dir, $0}' >>$wav_scp|| exit 1
+
+    chapter_trans=$chapter_dir/${reader}-${chapter}.trans.txt
+    [ ! -f  $chapter_trans ] && echo "$0: expected file $chapter_trans to exist" && exit 1
+    cat $chapter_trans >>$trans
+
+    # NOTE: For now we are using per-chapter utt2spk. That is each chapter is considered
+    #       to be a different speaker. This is done for simplicity and because we want
+    #       e.g. the CMVN to be calculated per-chapter
+    awk -v "reader=$reader" -v "chapter=$chapter" '{printf "%s %s-%s\n", $1, reader, chapter}' \
+      <$chapter_trans >>$utt2spk || exit 1
+
+    # reader -> gender map (again using per-chapter granularity)
+    echo "${reader}-${chapter} $reader_gender" >>$spk2gender
+  done
+done
+
+spk2utt=$dst/spk2utt
+utils/utt2spk_to_spk2utt.pl <$utt2spk >$spk2utt || exit 1
+
+ntrans=$(wc -l <$trans)
+nutt2spk=$(wc -l <$utt2spk)
+! [ "$ntrans" -eq "$nutt2spk" ] && \
+  echo "Inconsistent #transcripts($ntrans) and #utt2spk($nutt2spk)" && exit 1
+
+utils/validate_data_dir.sh --no-feats $dst || exit 1
+
+echo "$0: successfully prepared data in $dst"
+
+exit 0
diff --git a/examples/librispeech/s2/local/recog.sh b/examples/librispeech/s2/local/recog.sh
index df3846c0..e2578ba6 100755
--- a/examples/librispeech/s2/local/recog.sh
+++ b/examples/librispeech/s2/local/recog.sh
@@ -11,22 +11,24 @@ tag=
 decode_config=conf/decode/decode.yaml
 
 # lm params
-lang_model=rnnlm.model.best
-lmexpdir=exp/train_rnnlm_pytorch_lm_transformer_cosine_batchsize32_lr1e-4_layer16_unigram5000_ngpu4/
-lmtag='nolm'
+rnnlm_config_path=conf/lm/transformer.yaml
+lmexpdir=exp/lm
+lang_model=rnnlm.pdparams
+lmtag='transformer'
 
+train_set=train_960
 recog_set="test-clean test-other dev-clean dev-other"
 recog_set="test-clean"
 
 # bpemode (unigram or bpe)
 nbpe=5000
 bpemode=unigram
-bpeprefix="data/bpe_${bpemode}_${nbpe}"
+bpeprefix=data/lang_char/${train_set}_${bpemode}${nbpe}
 bpemodel=${bpeprefix}.model
 
 # bin params
 config_path=conf/transformer.yaml
-dict=data/bpe_unigram_5000_units.txt
+dict=data/lang_char/${train_set}_${bpemode}${nbpe}_units.txt
 ckpt_prefix=
 
 source ${MAIN_ROOT}/utils/parse_options.sh || exit 1;
@@ -51,6 +53,9 @@ if [[ ${config_path} =~ ^.*chunk_.*yaml$ ]];then
 fi
 echo "chunk mode: ${chunk_mode}"
 echo "decode conf: ${decode_config}"
+echo "lm conf: ${rnnlm_config_path}"
+echo "lm model: ${lmexpdir}/${lang_model}"
+
 
 # download language model
 #bash local/download_lm_en.sh
@@ -59,6 +64,13 @@ echo "decode conf: ${decode_config}"
 #fi
 
 
+# download rnnlm
+mkdir -p ${lmexpdir}
+if [ ! -f ${lmexpdir}/${lang_model} ]; then
+    wget -c -O ${lmexpdir}/${lang_model} https://deepspeech.bj.bcebos.com/transformer_lm/transformerLM.pdparams
+fi
+
+
 pids=() # initialize pids
 
 for dmethd in join_ctc; do
@@ -90,9 +102,9 @@ for dmethd in join_ctc; do
                 --recog-json ${feat_recog_dir}/split${nj}/JOB/manifest.${rtask} \
                 --result-label ${decode_dir}/data.JOB.json \
                 --model-conf ${config_path} \
-                --model ${ckpt_prefix}.pdparams
-
-                #--rnnlm ${lmexpdir}/${lang_model} \
+                --model ${ckpt_prefix}.pdparams \
+                --rnnlm-conf ${rnnlm_config_path} \
+                --rnnlm ${lmexpdir}/${lang_model}
 
         score_sclite.sh --bpe ${nbpe} --bpemodel ${bpemodel} --wer false ${decode_dir} ${dict}
 
diff --git a/examples/librispeech/s2/local/test.sh b/examples/librispeech/s2/local/test.sh
index 5f662d29..23670f74 100755
--- a/examples/librispeech/s2/local/test.sh
+++ b/examples/librispeech/s2/local/test.sh
@@ -8,17 +8,18 @@ nj=32
 
 lmtag='nolm'
 
+train_set=train_960
 recog_set="test-clean test-other dev-clean dev-other"
 recog_set="test-clean"
 
 # bpemode (unigram or bpe)
 nbpe=5000
 bpemode=unigram
-bpeprefix="data/bpe_${bpemode}_${nbpe}"
+bpeprefix=data/lang_char/${train_set}_${bpemode}${nbpe}
 bpemodel=${bpeprefix}.model
 
 config_path=conf/transformer.yaml
-dict=data/bpe_unigram_5000_units.txt
+dict=data/lang_char/${train_set}_${bpemode}${nbpe}_units.txt
 ckpt_prefix=
 
 source ${MAIN_ROOT}/utils/parse_options.sh || exit 1;
diff --git a/examples/librispeech/s2/path.sh b/examples/librispeech/s2/path.sh
index eec437b6..32ff28c1 100644
--- a/examples/librispeech/s2/path.sh
+++ b/examples/librispeech/s2/path.sh
@@ -1,6 +1,6 @@
 export MAIN_ROOT=`realpath ${PWD}/../../../`
 
-export PATH=${MAIN_ROOT}:${MAIN_ROOT}/tools/sctk/bin:${PWD}/utils:${PATH}
+export PATH=${MAIN_ROOT}:${MAIN_ROOT}/tools/sctk/bin:${MAIN_ROOT}/utils:${PWD}/utils:${PATH}
 export LC_ALL=C
 
 export PYTHONDONTWRITEBYTECODE=1
@@ -13,3 +13,16 @@ export LD_LIBRARY_PATH=${LD_LIBRARY_PATH}:/usr/local/lib/
 
 MODEL=u2_kaldi
 export BIN_DIR=${MAIN_ROOT}/deepspeech/exps/${MODEL}/bin
+
+# srilm
+export LIBLBFGS=${MAIN_ROOT}/tools/liblbfgs-1.10
+export LD_LIBRARY_PATH=${LD_LIBRARY_PATH:-}:${LIBLBFGS}/lib/.libs
+export SRILM=${MAIN_ROOT}/tools/srilm
+export PATH=${PATH}:${SRILM}/bin:${SRILM}/bin/i686-m64
+
+# Kaldi
+export KALDI_ROOT=${MAIN_ROOT}/tools/kaldi
+[ -f $KALDI_ROOT/tools/env.sh ] && . $KALDI_ROOT/tools/env.sh
+export PATH=$PWD/utils/:$KALDI_ROOT/tools/openfst/bin:$PWD:$PATH
+[ ! -f $KALDI_ROOT/tools/config/common_path.sh ] && echo >&2 "The standard file $KALDI_ROOT/tools/config/common_path.sh is not present, can not using Kaldi!"
+[ -f $KALDI_ROOT/tools/config/common_path.sh ] && . $KALDI_ROOT/tools/config/common_path.sh
\ No newline at end of file
diff --git a/examples/librispeech/s2/run.sh b/examples/librispeech/s2/run.sh
index 3c7569fb..146f133d 100755
--- a/examples/librispeech/s2/run.sh
+++ b/examples/librispeech/s2/run.sh
@@ -33,16 +33,21 @@ if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
 fi
 
 if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
-    # test ckpt avg_n
+    # attetion resocre decoder
     ./local/test.sh ${conf_path} ${dict_path} exp/${ckpt}/checkpoints/${avg_ckpt} || exit -1
 fi
 
 if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
+    # join ctc decoder, use transformerlm to score
+    ./local/recog.sh  --ckpt_prefix exp/${ckpt}/checkpoints/${avg_ckpt}
+fi
+
+if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
     # ctc alignment of test data
     CUDA_VISIBLE_DEVICES=0 ./local/align.sh ${conf_path} ${dict_path} exp/${ckpt}/checkpoints/${avg_ckpt} || exit -1
 fi
 
-if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
+if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
     # export ckpt avg_n
     CUDA_VISIBLE_DEVICES= ./local/export.sh ${conf_path} exp/${ckpt}/checkpoints/${avg_ckpt} exp/${ckpt}/checkpoints/${avg_ckpt}.jit
 fi
diff --git a/examples/librispeech/s2/steps b/examples/librispeech/s2/steps
new file mode 120000
index 00000000..995eeccb
--- /dev/null
+++ b/examples/librispeech/s2/steps
@@ -0,0 +1 @@
+../../../tools/kaldi/egs/wsj/s5/steps/
\ No newline at end of file
diff --git a/examples/librispeech/s2/utils b/examples/librispeech/s2/utils
index 256f914a..f49247da 120000
--- a/examples/librispeech/s2/utils
+++ b/examples/librispeech/s2/utils
@@ -1 +1 @@
-../../../utils/
\ No newline at end of file
+../../../tools/kaldi/egs/wsj/s5/utils
\ No newline at end of file
diff --git a/examples/ljspeech/tts0/README.md b/examples/ljspeech/tts0/README.md
index e95f6614..e8e3ebff 100644
--- a/examples/ljspeech/tts0/README.md
+++ b/examples/ljspeech/tts0/README.md
@@ -77,7 +77,7 @@ optional arguments:
                         config, passing in KEY VALUE pairs
   -v, --verbose         print msg
 ```
-**Ps.** You can  use [waveflow](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/waveflow) as the neural vocoder to synthesize mels to wavs. (Please  refer to `synthesize.sh` in our  LJSpeech waveflow example)
+**Ps.** You can  use [waveflow](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/ljspeech/voc0) as the neural vocoder to synthesize mels to wavs. (Please  refer to `synthesize.sh` in our  LJSpeech waveflow example)
 
 ## Pretrained Models
 Pretrained Models can be downloaded from links below. We provide 2 models with different configurations.
diff --git a/examples/ljspeech/tts1/README.md b/examples/ljspeech/tts1/README.md
index 097dc08c..0385fdce 100644
--- a/examples/ljspeech/tts1/README.md
+++ b/examples/ljspeech/tts1/README.md
@@ -81,7 +81,7 @@ optional arguments:
 6. `--phones-dict` is the path of the phone vocabulary file.
 
 ## Synthesize
-We use [waveflow](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/waveflow) as the neural vocoder.
+We use [waveflow](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/ljspeech/voc0) as the neural vocoder.
 Download Pretrained WaveFlow Model with residual channel equals 128 from [waveflow_ljspeech_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/waveflow_ljspeech_ckpt_0.3.zip) and unzip it.
 ```bash
 unzip waveflow_ljspeech_ckpt_0.3.zip
diff --git a/examples/ljspeech/tts3/README.md b/examples/ljspeech/tts3/README.md
index f5bea6a9..dc711ce8 100644
--- a/examples/ljspeech/tts3/README.md
+++ b/examples/ljspeech/tts3/README.md
@@ -7,7 +7,7 @@ Download LJSpeech-1.1 from the [official website](https://keithito.com/LJ-Speech
 
 ### Get MFA result of LJSpeech-1.1 and Extract it
 We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for fastspeech2.
-You can download from here [ljspeech_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/LJSpeech-1.1/ljspeech_alignment.tar.gz), or train your own MFA model reference to  [use_mfa example](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/use_mfa) of our repo.
+You can download from here [ljspeech_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/LJSpeech-1.1/ljspeech_alignment.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/other/use_mfa) of our repo.
 
 ## Get Started
 Assume the path to the dataset is `~/datasets/LJSpeech-1.1`.
@@ -88,7 +88,7 @@ optional arguments:
 6. `--phones-dict` is the path of the phone vocabulary file.
 
 ### Synthesize
-We use [parallel wavegan](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/parallelwave_gan/ljspeech/) as the neural vocoder.
+We use [parallel wavegan](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/ljspeech/voc1) as the neural vocoder.
 Download pretrained parallel wavegan model from [pwg_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/pwg_ljspeech_ckpt_0.5.zip) and unzip it.
 ```bash
 unzip pwg_ljspeech_ckpt_0.5.zip
diff --git a/examples/ljspeech/voc1/README.md b/examples/ljspeech/voc1/README.md
index 995b4c7c..ba6eb002 100644
--- a/examples/ljspeech/voc1/README.md
+++ b/examples/ljspeech/voc1/README.md
@@ -5,7 +5,7 @@ This example contains code used to train a [parallel wavegan](http://arxiv.org/a
 Download LJSpeech-1.1 from the [official website](https://keithito.com/LJ-Speech-Dataset/).
 ### Get MFA results for silence trim
 We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to  cut silence in the edge of audio.
-You can download from here [ljspeech_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/LJSpeech-1.1/ljspeech_alignment.tar.gz), or train your own MFA model reference to  [use_mfa example](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/use_mfa) of our repo.
+You can download from here [ljspeech_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/LJSpeech-1.1/ljspeech_alignment.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/other/use_mfa) of our repo.
 
 ## Get Started
 Assume the path to the dataset is `~/datasets/LJSpeech-1.1`.
diff --git a/examples/other/ge2e/README.md b/examples/other/ge2e/README.md
index 89365d63..1fa9677a 100644
--- a/examples/other/ge2e/README.md
+++ b/examples/other/ge2e/README.md
@@ -1,5 +1,5 @@
 # Speaker Encoder
-This experiment trains a speaker encoder with speaker verification as its task. It is done as a part of the experiment of transfer learning from speaker verification to multispeaker text-to-speech synthesis, which can be found at [tacotron2_aishell3](../tacotron2_shell3). The trained speaker encoder is used to extract utterance embeddings from utterances.
+This experiment trains a speaker encoder with speaker verification as its task. It is done as a part of the experiment of transfer learning from speaker verification to multispeaker text-to-speech synthesis, which can be found at [examples/aishell3/vc0](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/aishell3/vc0). The trained speaker encoder is used to extract utterance embeddings from utterances.
 ## Model
 The model used in this experiment is the speaker encoder with text independent speaker verification task in [GENERALIZED END-TO-END LOSS FOR SPEAKER VERIFICATION](https://arxiv.org/pdf/1710.10467.pdf). GE2E-softmax loss is used.
 
diff --git a/examples/tiny/s0/README.md b/examples/tiny/s0/README.md
index 0f96864c..7dc16dc3 100644
--- a/examples/tiny/s0/README.md
+++ b/examples/tiny/s0/README.md
@@ -23,12 +23,6 @@
 
 - Case inference with an existing model
 
-    ```bash
-    bash local/infer.sh
-    ```
-
-    `infer.sh` will show us some speech-to-text decoding results for several (default: 10) samples with the trained model. The performance might not be good now as the current model is only trained with a toy subset of LibriSpeech. To see the results with a better model, you can download a well-trained (trained for several days, with the complete LibriSpeech) model and do the inference.
-
 - Evaluate an existing model
 
     ```bash
@@ -44,8 +38,3 @@
     bash local/export.sh ckpt_path saved_jit_model_path
     ```
 
-- Tune hyper paerameter
-
-    ```bash
-    bash local/tune.sh
-    ```
diff --git a/examples/vctk/tts3/README.md b/examples/vctk/tts3/README.md
index 2a79cdd6..717ee7ac 100644
--- a/examples/vctk/tts3/README.md
+++ b/examples/vctk/tts3/README.md
@@ -7,8 +7,8 @@ Download VCTK-0.92 from the [official website](https://datashare.ed.ac.uk/handle
 
 ### Get MFA result of VCTK and Extract it
 We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for fastspeech2.
-You can download from here [vctk_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/VCTK-Corpus-0.92/vctk_alignment.tar.gz), or train your own MFA model reference to  [use_mfa example](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/use_mfa) of our repo.
-ps: we remove three speakers in VCTK-0.92 (see [reorganize_vctk.py](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/use_mfa/local/reorganize_vctk.py)):
+You can download from here [vctk_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/VCTK-Corpus-0.92/vctk_alignment.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/other/use_mfa) of our repo.
+ps: we remove three speakers in VCTK-0.92 (see [reorganize_vctk.py](https://github.com/PaddlePaddle/DeepSpeech/blob/develop/examples/other/use_mfa/local/reorganize_vctk.py)):
 1. `p315`, because no txt for it.
 2. `p280` and `p362`, because no *_mic2.flac (which is better than *_mic1.flac) for  them.
 
@@ -91,7 +91,7 @@ optional arguments:
 6. `--phones-dict` is the path of the phone vocabulary file.
 
 ### Synthesize
-We use [parallel wavegan](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/parallelwave_gan/baker) as the neural vocoder.
+We use [parallel wavegan](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/vctk/voc1) as the neural vocoder.
 
 Download pretrained parallel wavegan model from [pwg_vctk_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/pwg_vctk_ckpt_0.5.zip)and unzip it.
 ```bash
diff --git a/examples/vctk/voc1/README.md b/examples/vctk/voc1/README.md
index b74b9d4a..cbfff32d 100644
--- a/examples/vctk/voc1/README.md
+++ b/examples/vctk/voc1/README.md
@@ -7,8 +7,8 @@ Download VCTK-0.92  from the [official website](https://datashare.ed.ac.uk/handl
 
 ### Get MFA results for silence trim
 We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to  cut silence in the edge of audio.
-You can download from here [vctk_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/VCTK-Corpus-0.92/vctk_alignment.tar.gz), or train your own MFA model reference to  [use_mfa example](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/use_mfa) of our repo.
-ps: we remove three speakers in VCTK-0.92 (see [reorganize_vctk.py](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/use_mfa/local/reorganize_vctk.py)):
+You can download from here [vctk_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/VCTK-Corpus-0.92/vctk_alignment.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/other/use_mfa) of our repo.
+ps: we remove three speakers in VCTK-0.92 (see [reorganize_vctk.py](https://github.com/PaddlePaddle/DeepSpeech/blob/develop/examples/other/use_mfa/local/reorganize_vctk.py)):
 1. `p315`, because no txt for it.
 2. `p280` and `p362`, because no *_mic2.flac (which is better than *_mic1.flac) for  them.
 
diff --git a/paddleaudio/.gitignore b/paddleaudio/.gitignore
new file mode 100644
index 00000000..e649619e
--- /dev/null
+++ b/paddleaudio/.gitignore
@@ -0,0 +1,7 @@
+.ipynb_checkpoints/**
+*.ipynb
+nohup.out
+__pycache__/
+*.wav
+*.m4a
+obsolete/**
diff --git a/paddleaudio/.pre-commit-config.yaml b/paddleaudio/.pre-commit-config.yaml
new file mode 100644
index 00000000..4100f348
--- /dev/null
+++ b/paddleaudio/.pre-commit-config.yaml
@@ -0,0 +1,45 @@
+repos:
+-   repo: local
+    hooks:
+    -   id: yapf
+        name: yapf
+        entry: yapf
+        language: system
+        args: [-i, --style .style.yapf]
+        files: \.py$
+
+-   repo: https://github.com/pre-commit/pre-commit-hooks
+    rev: a11d9314b22d8f8c7556443875b731ef05965464
+    hooks:
+    -   id: check-merge-conflict
+    -   id: check-symlinks
+    -   id: end-of-file-fixer
+    -   id: trailing-whitespace
+    -   id: detect-private-key
+    -   id: check-symlinks
+    -   id: check-added-large-files
+
+-   repo: https://github.com/pycqa/isort
+    rev: 5.8.0
+    hooks:
+    -   id: isort
+        name: isort (python)
+    -   id: isort
+        name: isort (cython)
+        types: [cython]
+    -   id: isort
+        name: isort (pyi)
+        types: [pyi]
+
+-   repo: local
+    hooks:
+    -   id: flake8
+        name: flake8
+        entry: flake8
+        language: system
+        args:
+        -   --count
+        -   --select=E9,F63,F7,F82
+        -   --show-source
+        -   --statistics
+        files: \.py$
diff --git a/paddleaudio/.style.yapf b/paddleaudio/.style.yapf
new file mode 100644
index 00000000..4741fb4f
--- /dev/null
+++ b/paddleaudio/.style.yapf
@@ -0,0 +1,3 @@
+[style]
+based_on_style = pep8
+column_limit = 80
diff --git a/paddleaudio/LICENSE b/paddleaudio/LICENSE
new file mode 100644
index 00000000..261eeb9e
--- /dev/null
+++ b/paddleaudio/LICENSE
@@ -0,0 +1,201 @@
+                                 Apache License
+                           Version 2.0, January 2004
+                        http://www.apache.org/licenses/
+
+   TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+   1. Definitions.
+
+      "License" shall mean the terms and conditions for use, reproduction,
+      and distribution as defined by Sections 1 through 9 of this document.
+
+      "Licensor" shall mean the copyright owner or entity authorized by
+      the copyright owner that is granting the License.
+
+      "Legal Entity" shall mean the union of the acting entity and all
+      other entities that control, are controlled by, or are under common
+      control with that entity. For the purposes of this definition,
+      "control" means (i) the power, direct or indirect, to cause the
+      direction or management of such entity, whether by contract or
+      otherwise, or (ii) ownership of fifty percent (50%) or more of the
+      outstanding shares, or (iii) beneficial ownership of such entity.
+
+      "You" (or "Your") shall mean an individual or Legal Entity
+      exercising permissions granted by this License.
+
+      "Source" form shall mean the preferred form for making modifications,
+      including but not limited to software source code, documentation
+      source, and configuration files.
+
+      "Object" form shall mean any form resulting from mechanical
+      transformation or translation of a Source form, including but
+      not limited to compiled object code, generated documentation,
+      and conversions to other media types.
+
+      "Work" shall mean the work of authorship, whether in Source or
+      Object form, made available under the License, as indicated by a
+      copyright notice that is included in or attached to the work
+      (an example is provided in the Appendix below).
+
+      "Derivative Works" shall mean any work, whether in Source or Object
+      form, that is based on (or derived from) the Work and for which the
+      editorial revisions, annotations, elaborations, or other modifications
+      represent, as a whole, an original work of authorship. For the purposes
+      of this License, Derivative Works shall not include works that remain
+      separable from, or merely link (or bind by name) to the interfaces of,
+      the Work and Derivative Works thereof.
+
+      "Contribution" shall mean any work of authorship, including
+      the original version of the Work and any modifications or additions
+      to that Work or Derivative Works thereof, that is intentionally
+      submitted to Licensor for inclusion in the Work by the copyright owner
+      or by an individual or Legal Entity authorized to submit on behalf of
+      the copyright owner. For the purposes of this definition, "submitted"
+      means any form of electronic, verbal, or written communication sent
+      to the Licensor or its representatives, including but not limited to
+      communication on electronic mailing lists, source code control systems,
+      and issue tracking systems that are managed by, or on behalf of, the
+      Licensor for the purpose of discussing and improving the Work, but
+      excluding communication that is conspicuously marked or otherwise
+      designated in writing by the copyright owner as "Not a Contribution."
+
+      "Contributor" shall mean Licensor and any individual or Legal Entity
+      on behalf of whom a Contribution has been received by Licensor and
+      subsequently incorporated within the Work.
+
+   2. Grant of Copyright License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      copyright license to reproduce, prepare Derivative Works of,
+      publicly display, publicly perform, sublicense, and distribute the
+      Work and such Derivative Works in Source or Object form.
+
+   3. Grant of Patent License. Subject to the terms and conditions of
+      this License, each Contributor hereby grants to You a perpetual,
+      worldwide, non-exclusive, no-charge, royalty-free, irrevocable
+      (except as stated in this section) patent license to make, have made,
+      use, offer to sell, sell, import, and otherwise transfer the Work,
+      where such license applies only to those patent claims licensable
+      by such Contributor that are necessarily infringed by their
+      Contribution(s) alone or by combination of their Contribution(s)
+      with the Work to which such Contribution(s) was submitted. If You
+      institute patent litigation against any entity (including a
+      cross-claim or counterclaim in a lawsuit) alleging that the Work
+      or a Contribution incorporated within the Work constitutes direct
+      or contributory patent infringement, then any patent licenses
+      granted to You under this License for that Work shall terminate
+      as of the date such litigation is filed.
+
+   4. Redistribution. You may reproduce and distribute copies of the
+      Work or Derivative Works thereof in any medium, with or without
+      modifications, and in Source or Object form, provided that You
+      meet the following conditions:
+
+      (a) You must give any other recipients of the Work or
+          Derivative Works a copy of this License; and
+
+      (b) You must cause any modified files to carry prominent notices
+          stating that You changed the files; and
+
+      (c) You must retain, in the Source form of any Derivative Works
+          that You distribute, all copyright, patent, trademark, and
+          attribution notices from the Source form of the Work,
+          excluding those notices that do not pertain to any part of
+          the Derivative Works; and
+
+      (d) If the Work includes a "NOTICE" text file as part of its
+          distribution, then any Derivative Works that You distribute must
+          include a readable copy of the attribution notices contained
+          within such NOTICE file, excluding those notices that do not
+          pertain to any part of the Derivative Works, in at least one
+          of the following places: within a NOTICE text file distributed
+          as part of the Derivative Works; within the Source form or
+          documentation, if provided along with the Derivative Works; or,
+          within a display generated by the Derivative Works, if and
+          wherever such third-party notices normally appear. The contents
+          of the NOTICE file are for informational purposes only and
+          do not modify the License. You may add Your own attribution
+          notices within Derivative Works that You distribute, alongside
+          or as an addendum to the NOTICE text from the Work, provided
+          that such additional attribution notices cannot be construed
+          as modifying the License.
+
+      You may add Your own copyright statement to Your modifications and
+      may provide additional or different license terms and conditions
+      for use, reproduction, or distribution of Your modifications, or
+      for any such Derivative Works as a whole, provided Your use,
+      reproduction, and distribution of the Work otherwise complies with
+      the conditions stated in this License.
+
+   5. Submission of Contributions. Unless You explicitly state otherwise,
+      any Contribution intentionally submitted for inclusion in the Work
+      by You to the Licensor shall be under the terms and conditions of
+      this License, without any additional terms or conditions.
+      Notwithstanding the above, nothing herein shall supersede or modify
+      the terms of any separate license agreement you may have executed
+      with Licensor regarding such Contributions.
+
+   6. Trademarks. This License does not grant permission to use the trade
+      names, trademarks, service marks, or product names of the Licensor,
+      except as required for reasonable and customary use in describing the
+      origin of the Work and reproducing the content of the NOTICE file.
+
+   7. Disclaimer of Warranty. Unless required by applicable law or
+      agreed to in writing, Licensor provides the Work (and each
+      Contributor provides its Contributions) on an "AS IS" BASIS,
+      WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
+      implied, including, without limitation, any warranties or conditions
+      of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A
+      PARTICULAR PURPOSE. You are solely responsible for determining the
+      appropriateness of using or redistributing the Work and assume any
+      risks associated with Your exercise of permissions under this License.
+
+   8. Limitation of Liability. In no event and under no legal theory,
+      whether in tort (including negligence), contract, or otherwise,
+      unless required by applicable law (such as deliberate and grossly
+      negligent acts) or agreed to in writing, shall any Contributor be
+      liable to You for damages, including any direct, indirect, special,
+      incidental, or consequential damages of any character arising as a
+      result of this License or out of the use or inability to use the
+      Work (including but not limited to damages for loss of goodwill,
+      work stoppage, computer failure or malfunction, or any and all
+      other commercial damages or losses), even if such Contributor
+      has been advised of the possibility of such damages.
+
+   9. Accepting Warranty or Additional Liability. While redistributing
+      the Work or Derivative Works thereof, You may choose to offer,
+      and charge a fee for, acceptance of support, warranty, indemnity,
+      or other liability obligations and/or rights consistent with this
+      License. However, in accepting such obligations, You may act only
+      on Your own behalf and on Your sole responsibility, not on behalf
+      of any other Contributor, and only if You agree to indemnify,
+      defend, and hold each Contributor harmless for any liability
+      incurred by, or claims asserted against, such Contributor by reason
+      of your accepting any such warranty or additional liability.
+
+   END OF TERMS AND CONDITIONS
+
+   APPENDIX: How to apply the Apache License to your work.
+
+      To apply the Apache License to your work, attach the following
+      boilerplate notice, with the fields enclosed by brackets "[]"
+      replaced with your own identifying information. (Don't include
+      the brackets!)  The text should be enclosed in the appropriate
+      comment syntax for the file format. We also recommend that a
+      file or class name and description of purpose be included on the
+      same "printed page" as the copyright notice for easier
+      identification within third-party archives.
+
+   Copyright [yyyy] [name of copyright owner]
+
+   Licensed under the Apache License, Version 2.0 (the "License");
+   you may not use this file except in compliance with the License.
+   You may obtain a copy of the License at
+
+       http://www.apache.org/licenses/LICENSE-2.0
+
+   Unless required by applicable law or agreed to in writing, software
+   distributed under the License is distributed on an "AS IS" BASIS,
+   WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+   See the License for the specific language governing permissions and
+   limitations under the License.
diff --git a/paddleaudio/README.md b/paddleaudio/README.md
new file mode 100644
index 00000000..9607fd86
--- /dev/null
+++ b/paddleaudio/README.md
@@ -0,0 +1,37 @@
+# PaddleAudio:  The audio library for PaddlePaddle
+
+## Introduction
+PaddleAudio is the audio toolkit to speed up your audio research and development loop in PaddlePaddle. It currently provides a collection of audio datasets, feature-extraction functions, audio transforms,state-of-the-art pre-trained models in sound tagging/classification and anomaly sound detection. More models and features are on the roadmap.
+
+
+
+## Features
+- Spectrogram and related features are compatible with librosa.
+- State-of-the-art models in sound tagging on Audioset, sound classification on esc50, and more to come.
+- Ready-to-use audio embedding with a line of code, includes sound embedding and more on the roadmap.
+- Data loading supports for common open source audio in multiple languages including English, Mandarin and so on.
+
+
+## Install
+```
+git clone https://github.com/PaddlePaddle/models
+cd models/PaddleAudio
+pip install .
+
+```
+
+## Quick start
+### Audio loading and feature extraction
+```
+import paddleaudio as pa
+s,r = pa.load(f)
+mel_spect = pa.melspectrogram(s,sr=r)
+```
+
+###  Examples
+We provide a set of examples to help you get started in using PaddleAudio quickly.
+- [PANNs:  acoustic scene and events analysis using pre-trained models](./examples/panns)
+- [Environmental Sound classification on ESC-50 dataset](./examples/sound_classification)
+- [Training a audio-tagging network on Audioset](./examples/audioset_training)
+
+Please refer to [example directory](./examples) for more details.
diff --git a/paddleaudio/examples/panns/README.md b/paddleaudio/examples/panns/README.md
new file mode 100644
index 00000000..243ebf8e
--- /dev/null
+++ b/paddleaudio/examples/panns/README.md
@@ -0,0 +1,128 @@
+# Audio Tagging
+
+声音分类的任务是单标签的分类任务，但是对于一段音频来说，它可以是多标签的。譬如在一般的室内办公环境进行录音，这段音频里可能包含人们说话的声音、键盘敲打的声音、鼠标点击的声音，还有室内的一些其他背景声音。对于通用的声音识别和声音检测场景而言，对一段音频预测多个标签是具有很强的实用性的。
+
+在IEEE ICASSP 2017 大会上，谷歌开放了一个大规模的音频数据集[Audioset](https://research.google.com/audioset/)。该数据集包含了 632 类的音频类别以及 2,084,320 条人工标记的每段 10 秒长度的声音剪辑片段（来源于YouTube视频）。目前该数据集已经有210万个已标注的视频数据，5800小时的音频数据，经过标记的声音样本的标签类别为527。
+
+`PANNs`([PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition](https://arxiv.org/pdf/1912.10211.pdf))是基于Audioset数据集训练的声音分类/识别的模型。其预训练的任务是多标签的声音识别，因此可用于声音的实时tagging。
+
+本示例采用`PANNs`预训练模型，基于Audioset的标签类别对输入音频实时tagging，并最终以文本形式输出对应时刻的top k类别和对应的得分。
+
+
+## 模型简介
+
+PaddleAudio提供了PANNs的CNN14、CNN10和CNN6的预训练模型，可供用户选择使用：
+- CNN14: 该模型主要包含12个卷积层和2个全连接层，模型参数的数量为79.6M，embbedding维度是2048。
+- CNN10: 该模型主要包含8个卷积层和2个全连接层，模型参数的数量为4.9M，embbedding维度是512。
+- CNN6: 该模型主要包含4个卷积层和2个全连接层，模型参数的数量为4.5M，embbedding维度是512。
+
+
+## 快速开始
+
+### 模型预测
+
+```shell
+export CUDA_VISIBLE_DEVICES=0
+python audio_tag.py --device gpu --wav ./cat.wav --sample_duration 2 --hop_duration 0.3 --output_dir ./output_dir
+```
+
+可支持配置的参数：
+
+- `device`: 选用什么设备进行训练，可选cpu或gpu，默认为gpu。如使用gpu训练则参数gpus指定GPU卡号。
+- `wav`: 指定预测的音频文件。
+- `sample_duration`: 模型每次预测的音频时间长度，单位为秒，默认为2s。
+- `hop_duration`: 每两个预测音频的时间间隔，单位为秒，默认为0.3s。
+- `output_dir`: 模型预测结果存放的路径，默认为`./output_dir`。
+
+示例代码中使用的预训练模型为`CNN14`，如果想更换为其他预训练模型，可通过以下方式执行：
+```python
+from paddleaudio.models.panns import cnn14, cnn10, cnn6
+
+# CNN14
+model = cnn14(pretrained=True, extract_embedding=False)
+# CNN10
+model = cnn10(pretrained=True, extract_embedding=False)
+# CNN6
+model = cnn6(pretrained=True, extract_embedding=False)
+```
+
+执行结果：
+```
+[2021-04-30 19:15:41,025] [    INFO] - Saved tagging results to ./output_dir/audioset_tagging_sr_44100.npz
+```
+
+执行后得分结果保存在`output_dir`的`.npz`文件中。
+
+
+### 生成tagging标签文本
+```shell
+python parse_result.py --tagging_file ./output_dir/audioset_tagging_sr_44100.npz --top_k 10 --smooth True --smooth_size 5 --label_file ./assets/audioset_labels.txt --output_dir ./output_dir
+```
+
+可支持配置的参数：
+
+- `tagging_file`: 模型预测结果文件。
+- `top_k`: 获取预测结果中，得分最高的前top_k个标签，默认为10。
+- `smooth`: 预测结果的后验概率平滑，默认为True，表示应用平滑。
+- `smooth_size`: 平滑计算过程中的样本数量，默认为5。
+- `label_file`: 模型预测结果对应的Audioset类别的文本文件。
+- `output_dir`: 标签文本存放的路径，默认为`./output_dir`。
+
+执行结果：
+```
+[2021-04-30 19:26:58,743] [    INFO] - Posterior smoothing...
+[2021-04-30 19:26:58,746] [    INFO] - Saved tagging labels to ./output_dir/audioset_tagging_sr_44100.txt
+```
+
+执行后文本结果保存在`output_dir`的`.txt`文件中。
+
+
+## Tagging标签文本
+
+最终输出的文本结果如下所示。  
+样本每个时间范围的top k结果用空行分隔。在每一个结果中，第一行是时间信息，数字表示tagging结果在时间起点信息，比例值代表当前时刻`t`与音频总长度`T`的比值；紧接的k行是对应的标签和得分。
+
+```
+0.0
+Cat: 0.9144676923751831
+Animal: 0.8855036497116089
+Domestic animals, pets: 0.804577112197876
+Meow: 0.7422927021980286
+Music: 0.19959309697151184
+Inside, small room: 0.12550437450408936
+Caterwaul: 0.021584441885352135
+Purr: 0.020247288048267365
+Speech: 0.018197158351540565
+Vehicle: 0.007446660194545984
+
+0.059197544398158296
+Cat: 0.9250872135162354
+Animal: 0.8957151174545288
+Domestic animals, pets: 0.8228275775909424
+Meow: 0.7650775909423828
+Music: 0.20210561156272888
+Inside, small room: 0.12290887534618378
+Caterwaul: 0.029371455311775208
+Purr: 0.018731823191046715
+Speech: 0.017130598425865173
+Vehicle: 0.007748497650027275
+
+0.11839508879631659
+Cat: 0.9336574673652649
+Animal: 0.9111202359199524
+Domestic animals, pets: 0.8349071145057678
+Meow: 0.7761964797973633
+Music: 0.20467285811901093
+Inside, small room: 0.10709915310144424
+Caterwaul: 0.05370649695396423
+Purr: 0.018830426037311554
+Speech: 0.017361722886562347
+Vehicle: 0.006929398979991674
+
+...
+...
+```
+
+以下[Demo](https://bj.bcebos.com/paddleaudio/media/audio_tagging_demo.mp4)展示了一个将tagging标签输出到视频的例子，可以实时地对音频进行多标签预测。
+
+![](https://bj.bcebos.com/paddleaudio/media/audio_tagging_demo.gif)
diff --git a/paddleaudio/examples/panns/assets/audioset_labels.txt b/paddleaudio/examples/panns/assets/audioset_labels.txt
new file mode 100644
index 00000000..6fccf56a
--- /dev/null
+++ b/paddleaudio/examples/panns/assets/audioset_labels.txt
@@ -0,0 +1,527 @@
+Speech
+Male speech, man speaking
+Female speech, woman speaking
+Child speech, kid speaking
+Conversation
+Narration, monologue
+Babbling
+Speech synthesizer
+Shout
+Bellow
+Whoop
+Yell
+Battle cry
+Children shouting
+Screaming
+Whispering
+Laughter
+Baby laughter
+Giggle
+Snicker
+Belly laugh
+Chuckle, chortle
+Crying, sobbing
+Baby cry, infant cry
+Whimper
+Wail, moan
+Sigh
+Singing
+Choir
+Yodeling
+Chant
+Mantra
+Male singing
+Female singing
+Child singing
+Synthetic singing
+Rapping
+Humming
+Groan
+Grunt
+Whistling
+Breathing
+Wheeze
+Snoring
+Gasp
+Pant
+Snort
+Cough
+Throat clearing
+Sneeze
+Sniff
+Run
+Shuffle
+Walk, footsteps
+Chewing, mastication
+Biting
+Gargling
+Stomach rumble
+Burping, eructation
+Hiccup
+Fart
+Hands
+Finger snapping
+Clapping
+Heart sounds, heartbeat
+Heart murmur
+Cheering
+Applause
+Chatter
+Crowd
+Hubbub, speech noise, speech babble
+Children playing
+Animal
+Domestic animals, pets
+Dog
+Bark
+Yip
+Howl
+Bow-wow
+Growling
+Whimper (dog)
+Cat
+Purr
+Meow
+Hiss
+Caterwaul
+Livestock, farm animals, working animals
+Horse
+Clip-clop
+Neigh, whinny
+Cattle, bovinae
+Moo
+Cowbell
+Pig
+Oink
+Goat
+Bleat
+Sheep
+Fowl
+Chicken, rooster
+Cluck
+Crowing, cock-a-doodle-doo
+Turkey
+Gobble
+Duck
+Quack
+Goose
+Honk
+Wild animals
+Roaring cats (lions, tigers)
+Roar
+Bird
+Bird vocalization, bird call, bird song
+Chirp, tweet
+Squawk
+Pigeon, dove
+Coo
+Crow
+Caw
+Owl
+Hoot
+Bird flight, flapping wings
+Canidae, dogs, wolves
+Rodents, rats, mice
+Mouse
+Patter
+Insect
+Cricket
+Mosquito
+Fly, housefly
+Buzz
+Bee, wasp, etc.
+Frog
+Croak
+Snake
+Rattle
+Whale vocalization
+Music
+Musical instrument
+Plucked string instrument
+Guitar
+Electric guitar
+Bass guitar
+Acoustic guitar
+Steel guitar, slide guitar
+Tapping (guitar technique)
+Strum
+Banjo
+Sitar
+Mandolin
+Zither
+Ukulele
+Keyboard (musical)
+Piano
+Electric piano
+Organ
+Electronic organ
+Hammond organ
+Synthesizer
+Sampler
+Harpsichord
+Percussion
+Drum kit
+Drum machine
+Drum
+Snare drum
+Rimshot
+Drum roll
+Bass drum
+Timpani
+Tabla
+Cymbal
+Hi-hat
+Wood block
+Tambourine
+Rattle (instrument)
+Maraca
+Gong
+Tubular bells
+Mallet percussion
+Marimba, xylophone
+Glockenspiel
+Vibraphone
+Steelpan
+Orchestra
+Brass instrument
+French horn
+Trumpet
+Trombone
+Bowed string instrument
+String section
+Violin, fiddle
+Pizzicato
+Cello
+Double bass
+Wind instrument, woodwind instrument
+Flute
+Saxophone
+Clarinet
+Harp
+Bell
+Church bell
+Jingle bell
+Bicycle bell
+Tuning fork
+Chime
+Wind chime
+Change ringing (campanology)
+Harmonica
+Accordion
+Bagpipes
+Didgeridoo
+Shofar
+Theremin
+Singing bowl
+Scratching (performance technique)
+Pop music
+Hip hop music
+Beatboxing
+Rock music
+Heavy metal
+Punk rock
+Grunge
+Progressive rock
+Rock and roll
+Psychedelic rock
+Rhythm and blues
+Soul music
+Reggae
+Country
+Swing music
+Bluegrass
+Funk
+Folk music
+Middle Eastern music
+Jazz
+Disco
+Classical music
+Opera
+Electronic music
+House music
+Techno
+Dubstep
+Drum and bass
+Electronica
+Electronic dance music
+Ambient music
+Trance music
+Music of Latin America
+Salsa music
+Flamenco
+Blues
+Music for children
+New-age music
+Vocal music
+A capella
+Music of Africa
+Afrobeat
+Christian music
+Gospel music
+Music of Asia
+Carnatic music
+Music of Bollywood
+Ska
+Traditional music
+Independent music
+Song
+Background music
+Theme music
+Jingle (music)
+Soundtrack music
+Lullaby
+Video game music
+Christmas music
+Dance music
+Wedding music
+Happy music
+Funny music
+Sad music
+Tender music
+Exciting music
+Angry music
+Scary music
+Wind
+Rustling leaves
+Wind noise (microphone)
+Thunderstorm
+Thunder
+Water
+Rain
+Raindrop
+Rain on surface
+Stream
+Waterfall
+Ocean
+Waves, surf
+Steam
+Gurgling
+Fire
+Crackle
+Vehicle
+Boat, Water vehicle
+Sailboat, sailing ship
+Rowboat, canoe, kayak
+Motorboat, speedboat
+Ship
+Motor vehicle (road)
+Car
+Vehicle horn, car horn, honking
+Toot
+Car alarm
+Power windows, electric windows
+Skidding
+Tire squeal
+Car passing by
+Race car, auto racing
+Truck
+Air brake
+Air horn, truck horn
+Reversing beeps
+Ice cream truck, ice cream van
+Bus
+Emergency vehicle
+Police car (siren)
+Ambulance (siren)
+Fire engine, fire truck (siren)
+Motorcycle
+Traffic noise, roadway noise
+Rail transport
+Train
+Train whistle
+Train horn
+Railroad car, train wagon
+Train wheels squealing
+Subway, metro, underground
+Aircraft
+Aircraft engine
+Jet engine
+Propeller, airscrew
+Helicopter
+Fixed-wing aircraft, airplane
+Bicycle
+Skateboard
+Engine
+Light engine (high frequency)
+Dental drill, dentist's drill
+Lawn mower
+Chainsaw
+Medium engine (mid frequency)
+Heavy engine (low frequency)
+Engine knocking
+Engine starting
+Idling
+Accelerating, revving, vroom
+Door
+Doorbell
+Ding-dong
+Sliding door
+Slam
+Knock
+Tap
+Squeak
+Cupboard open or close
+Drawer open or close
+Dishes, pots, and pans
+Cutlery, silverware
+Chopping (food)
+Frying (food)
+Microwave oven
+Blender
+Water tap, faucet
+Sink (filling or washing)
+Bathtub (filling or washing)
+Hair dryer
+Toilet flush
+Toothbrush
+Electric toothbrush
+Vacuum cleaner
+Zipper (clothing)
+Keys jangling
+Coin (dropping)
+Scissors
+Electric shaver, electric razor
+Shuffling cards
+Typing
+Typewriter
+Computer keyboard
+Writing
+Alarm
+Telephone
+Telephone bell ringing
+Ringtone
+Telephone dialing, DTMF
+Dial tone
+Busy signal
+Alarm clock
+Siren
+Civil defense siren
+Buzzer
+Smoke detector, smoke alarm
+Fire alarm
+Foghorn
+Whistle
+Steam whistle
+Mechanisms
+Ratchet, pawl
+Clock
+Tick
+Tick-tock
+Gears
+Pulleys
+Sewing machine
+Mechanical fan
+Air conditioning
+Cash register
+Printer
+Camera
+Single-lens reflex camera
+Tools
+Hammer
+Jackhammer
+Sawing
+Filing (rasp)
+Sanding
+Power tool
+Drill
+Explosion
+Gunshot, gunfire
+Machine gun
+Fusillade
+Artillery fire
+Cap gun
+Fireworks
+Firecracker
+Burst, pop
+Eruption
+Boom
+Wood
+Chop
+Splinter
+Crack
+Glass
+Chink, clink
+Shatter
+Liquid
+Splash, splatter
+Slosh
+Squish
+Drip
+Pour
+Trickle, dribble
+Gush
+Fill (with liquid)
+Spray
+Pump (liquid)
+Stir
+Boiling
+Sonar
+Arrow
+Whoosh, swoosh, swish
+Thump, thud
+Thunk
+Electronic tuner
+Effects unit
+Chorus effect
+Basketball bounce
+Bang
+Slap, smack
+Whack, thwack
+Smash, crash
+Breaking
+Bouncing
+Whip
+Flap
+Scratch
+Scrape
+Rub
+Roll
+Crushing
+Crumpling, crinkling
+Tearing
+Beep, bleep
+Ping
+Ding
+Clang
+Squeal
+Creak
+Rustle
+Whir
+Clatter
+Sizzle
+Clicking
+Clickety-clack
+Rumble
+Plop
+Jingle, tinkle
+Hum
+Zing
+Boing
+Crunch
+Silence
+Sine wave
+Harmonic
+Chirp tone
+Sound effect
+Pulse
+Inside, small room
+Inside, large room or hall
+Inside, public space
+Outside, urban or manmade
+Outside, rural or natural
+Reverberation
+Echo
+Noise
+Environmental noise
+Static
+Mains hum
+Distortion
+Sidetone
+Cacophony
+White noise
+Pink noise
+Throbbing
+Vibration
+Television
+Radio
+Field recording
diff --git a/paddleaudio/examples/panns/audio_tag.py b/paddleaudio/examples/panns/audio_tag.py
new file mode 100644
index 00000000..eea14bba
--- /dev/null
+++ b/paddleaudio/examples/panns/audio_tag.py
@@ -0,0 +1,112 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import os
+from typing import List
+
+import numpy as np
+import paddle
+
+from paddleaudio.backends import load as load_audio
+from paddleaudio.features import melspectrogram
+from paddleaudio.models.panns import cnn14
+from paddleaudio.utils import logger
+
+# yapf: disable
+parser = argparse.ArgumentParser(__doc__)
+parser.add_argument('--device', choices=['cpu', 'gpu'], default='gpu', help='Select which device to predict, defaults to gpu.')
+parser.add_argument('--wav', type=str, required=True, help='Audio file to infer.')
+parser.add_argument('--sample_duration', type=float, default=2.0, help='Duration(in seconds) of tagging samples to predict.')
+parser.add_argument('--hop_duration', type=float, default=0.3, help='Duration(in seconds) between two samples.')
+parser.add_argument('--output_dir', type=str, default='./output_dir', help='Directory to save tagging result.')
+args = parser.parse_args()
+# yapf: enable
+
+
+def split(waveform: np.ndarray, win_size: int, hop_size: int):
+    """
+    Split into N waveforms.
+    N is decided by win_size and hop_size.
+    """
+    assert isinstance(waveform, np.ndarray)
+    time = []
+    data = []
+    for i in range(0, len(waveform), hop_size):
+        segment = waveform[i:i + win_size]
+        if len(segment) < win_size:
+            segment = np.pad(segment, (0, win_size - len(segment)))
+        data.append(segment)
+        time.append(i / len(waveform))
+    return time, data
+
+
+def batchify(data: List[List[float]],
+             sample_rate: int,
+             batch_size: int,
+             **kwargs):
+    """
+    Extract features from waveforms and create batches.
+    """
+    examples = []
+    for waveform in data:
+        feats = melspectrogram(waveform, sample_rate, **kwargs).transpose()
+        examples.append(feats)
+
+    # Seperates data into some batches.
+    one_batch = []
+    for example in examples:
+        one_batch.append(example)
+        if len(one_batch) == batch_size:
+            yield one_batch
+            one_batch = []
+    if one_batch:
+        yield one_batch
+
+
+def predict(model, data: List[List[float]], sample_rate: int,
+            batch_size: int=1):
+    """
+    Use pretrained model to make predictions.
+    """
+    batches = batchify(data, sample_rate, batch_size)
+    results = None
+    model.eval()
+    for batch in batches:
+        feats = paddle.to_tensor(batch).unsqueeze(1)  \
+            # (batch_size, num_frames, num_melbins) -> (batch_size, 1, num_frames, num_melbins)
+
+        audioset_scores = model(feats)
+        if results is None:
+            results = audioset_scores.numpy()
+        else:
+            results = np.concatenate((results, audioset_scores.numpy()))
+
+    return results
+
+
+if __name__ == '__main__':
+    paddle.set_device(args.device)
+    model = cnn14(pretrained=True, extract_embedding=False)
+    waveform, sr = load_audio(args.wav, sr=None)
+    time, data = split(waveform,
+                       int(args.sample_duration * sr),
+                       int(args.hop_duration * sr))
+    results = predict(model, data, sr, batch_size=8)
+
+    if not os.path.exists(args.output_dir):
+        os.makedirs(args.output_dir)
+    time = np.arange(0, 1, int(args.hop_duration * sr) / len(waveform))
+    output_file = os.path.join(args.output_dir, f'audioset_tagging_sr_{sr}.npz')
+    np.savez(output_file, time=time, scores=results)
+    logger.info(f'Saved tagging results to {output_file}')
diff --git a/paddleaudio/examples/panns/parse_result.py b/paddleaudio/examples/panns/parse_result.py
new file mode 100644
index 00000000..667489dd
--- /dev/null
+++ b/paddleaudio/examples/panns/parse_result.py
@@ -0,0 +1,84 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import ast
+import os
+from typing import Dict
+
+import numpy as np
+
+from paddleaudio.utils import logger
+
+# yapf: disable
+parser = argparse.ArgumentParser(__doc__)
+parser.add_argument('--tagging_file', type=str, required=True, help='')
+parser.add_argument('--top_k', type=int, default=10, help='Get top k predicted results of audioset labels.')
+parser.add_argument('--smooth', type=ast.literal_eval, default=True, help='Set "True" to apply posterior smoothing.')
+parser.add_argument('--smooth_size', type=int, default=5, help='Window size of posterior smoothing.')
+parser.add_argument('--label_file', type=str, default='./assets/audioset_labels.txt', help='File of audioset labels.')
+parser.add_argument('--output_dir', type=str, default='./output_dir', help='Directory to save tagging labels.')
+args = parser.parse_args()
+# yapf: enable
+
+
+def smooth(results: np.ndarray, win_size: int):
+    """
+    Execute posterior smoothing in-place.
+    """
+    for i in range(len(results) - 1, -1, -1):
+        if i < win_size - 1:
+            left = 0
+        else:
+            left = i + 1 - win_size
+        results[i] = np.sum(results[left:i + 1], axis=0) / (i - left + 1)
+
+
+def generate_topk_label(k: int, label_map: Dict, result: np.ndarray):
+    """
+    Return top k result.
+    """
+    result = np.asarray(result)
+    topk_idx = (-result).argsort()[:k]
+
+    ret = ''
+    for idx in topk_idx:
+        label, score = label_map[idx], result[idx]
+        ret += f'{label}: {score}\n'
+    return ret
+
+
+if __name__ == "__main__":
+    label_map = {}
+    with open(args.label_file, 'r') as f:
+        for i, l in enumerate(f.readlines()):
+            label_map[i] = l.strip()
+
+    results = np.load(args.tagging_file, allow_pickle=True)
+    times, scores = results['time'], results['scores']
+
+    if args.smooth:
+        logger.info('Posterior smoothing...')
+        smooth(scores, win_size=args.smooth_size)
+
+    if not os.path.exists(args.output_dir):
+        os.makedirs(args.output_dir)
+    output_file = os.path.join(
+        args.output_dir,
+        os.path.basename(args.tagging_file).split('.')[0] + '.txt')
+    with open(output_file, 'w') as f:
+        for time, score in zip(times, scores):
+            f.write(f'{time}\n')
+            f.write(generate_topk_label(args.top_k, label_map, score) + '\n')
+
+    logger.info(f'Saved tagging labels to {output_file}')
diff --git a/paddleaudio/examples/sound_classification/README.md b/paddleaudio/examples/sound_classification/README.md
new file mode 100644
index 00000000..86a54cb3
--- /dev/null
+++ b/paddleaudio/examples/sound_classification/README.md
@@ -0,0 +1,116 @@
+# 声音分类
+
+声音分类和检测是声音算法的一个热门研究方向。  
+
+对于声音分类任务，传统机器学习的一个常用做法是首先人工提取音频的时域和频域的多种特征并做特征选择、组合、变换等，然后基于SVM或决策树进行分类。而端到端的深度学习则通常利用深度网络如RNN，CNN等直接对声间波形(waveform)或时频特征(time-frequency)进行特征学习(representation learning)和分类预测。
+
+在IEEE ICASSP 2017 大会上，谷歌开放了一个大规模的音频数据集[Audioset](https://research.google.com/audioset/)。该数据集包含了 632 类的音频类别以及 2,084,320 条人工标记的每段 10 秒长度的声音剪辑片段（来源于YouTube视频）。目前该数据集已经有210万个已标注的视频数据，5800小时的音频数据，经过标记的声音样本的标签类别为527。
+
+`PANNs`([PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition](https://arxiv.org/pdf/1912.10211.pdf))是基于Audioset数据集训练的声音分类/识别的模型。经过预训练后，模型可以用于提取音频的embbedding。本示例将使用`PANNs`的预训练模型Finetune完成声音分类的任务。
+
+
+## 模型简介
+
+PaddleAudio提供了PANNs的CNN14、CNN10和CNN6的预训练模型，可供用户选择使用：
+- CNN14: 该模型主要包含12个卷积层和2个全连接层，模型参数的数量为79.6M，embbedding维度是2048。
+- CNN10: 该模型主要包含8个卷积层和2个全连接层，模型参数的数量为4.9M，embbedding维度是512。
+- CNN6: 该模型主要包含4个卷积层和2个全连接层，模型参数的数量为4.5M，embbedding维度是512。
+
+
+## 快速开始
+
+### 模型训练
+
+以环境声音分类数据集`ESC50`为示例，运行下面的命令，可在训练集上进行模型的finetune，支持单机的单卡训练和多卡训练。关于如何使用`paddle.distributed.launch`启动多卡训练，请查看[单机多卡训练](https://www.paddlepaddle.org.cn/documentation/docs/zh/guides/02_paddle2.0_develop/06_device_cn.html)。
+
+单卡训练:
+```shell
+$ python train.py --epochs 50 --batch_size 16 --checkpoint_dir ./checkpoint --save_freq 10
+```
+
+多卡训练:
+```shell
+$ unset CUDA_VISIBLE_DEVICES
+$ python -m paddle.distributed.launch --gpus "0,1" train.py --epochs 50 --batch_size 16 --num_worker 4 --checkpoint_dir ./checkpoint --save_freq 10
+```
+
+可支持配置的参数：
+
+- `device`: 选用什么设备进行训练，可选cpu或gpu，默认为gpu。如使用gpu训练则参数gpus指定GPU卡号。
+- `epochs`: 训练轮次，默认为50。
+- `learning_rate`: Fine-tune的学习率；默认为5e-5。
+- `batch_size`: 批处理大小，请结合显存情况进行调整，若出现显存不足，请适当调低这一参数；默认为16。
+- `num_workers`: Dataloader获取数据的子进程数。默认为0，加载数据的流程在主进程执行。
+- `checkpoint_dir`: 模型参数文件和optimizer参数文件的保存目录，默认为`./checkpoint`。
+- `save_freq`: 训练过程中的模型保存频率，默认为10。
+- `log_freq`: 训练过程中的信息打印频率，默认为10。
+
+示例代码中使用的预训练模型为`CNN14`，如果想更换为其他预训练模型，可通过以下方式执行：
+```python
+from model import SoundClassifier
+from paddleaudio.datasets import ESC50
+from paddleaudio.models.panns import cnn14, cnn10, cnn6
+
+# CNN14
+backbone = cnn14(pretrained=True, extract_embedding=True)
+model = SoundClassifier(backbone, num_class=len(ESC50.label_list))
+
+# CNN10
+backbone = cnn10(pretrained=True, extract_embedding=True)
+model = SoundClassifier(backbone, num_class=len(ESC50.label_list))
+
+# CNN6
+backbone = cnn6(pretrained=True, extract_embedding=True)
+model = SoundClassifier(backbone, num_class=len(ESC50.label_list))
+```
+
+### 模型预测
+
+```shell
+python -u predict.py --wav ./dog.wav --top_k 3 --checkpoint ./checkpoint/epoch_50/model.pdparams
+```
+
+可支持配置的参数：
+- `device`: 选用什么设备进行训练，可选cpu或gpu，默认为gpu。如使用gpu训练则参数gpus指定GPU卡号。
+- `wav`: 指定预测的音频文件。
+- `top_k`: 预测显示的top k标签的得分，默认为1。
+- `checkpoint`: 模型参数checkpoint文件。
+
+输出的预测结果如下：
+```
+[/audio/dog.wav]
+Dog: 0.9999538660049438
+Clock tick: 1.3341237718123011e-05
+Cat: 6.579841738130199e-06
+```
+
+### 模型部署
+
+#### 1. 动转静
+
+模型训练结束后，可以将已保存的动态图参数导出成静态图的模型和参数，然后实施静态图的部署。
+
+```shell
+python -u export_model.py --checkpoint ./checkpoint/epoch_50/model.pdparams --output_dir ./export
+```
+
+可支持配置的参数：
+- `checkpoint`: 模型参数checkpoint文件。
+- `output_dir`: 导出静态图模型和参数文件的保存目录。
+
+导出的静态图模型和参数文件如下：
+```sh
+$ tree export
+export
+├── inference.pdiparams
+├── inference.pdiparams.info
+└── inference.pdmodel
+```
+
+#### 2. 模型部署和预测
+
+`deploy/python/predict.py` 脚本使用了`paddle.inference`模块下的api，提供了python端部署的示例：
+
+```sh
+python deploy/python/predict.py --model_dir ./export --device gpu
+```
diff --git a/paddleaudio/examples/sound_classification/deploy/python/predict.py b/paddleaudio/examples/sound_classification/deploy/python/predict.py
new file mode 100644
index 00000000..f41307c8
--- /dev/null
+++ b/paddleaudio/examples/sound_classification/deploy/python/predict.py
@@ -0,0 +1,147 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import os
+
+import numpy as np
+from paddle import inference
+from scipy.special import softmax
+
+from paddleaudio.backends import load as load_audio
+from paddleaudio.datasets import ESC50
+from paddleaudio.features import melspectrogram
+
+# yapf: disable
+parser = argparse.ArgumentParser()
+parser.add_argument("--model_dir", type=str, required=True, default="./export", help="The directory to static model.")
+parser.add_argument("--batch_size", type=int, default=2, help="Batch size per GPU/CPU for training.")
+parser.add_argument('--device', choices=['cpu', 'gpu', 'xpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument('--use_tensorrt', type=eval, default=False, choices=[True, False], help='Enable to use tensorrt to speed up.')
+parser.add_argument("--precision", type=str, default="fp32", choices=["fp32", "fp16"], help='The tensorrt precision.')
+parser.add_argument('--cpu_threads', type=int, default=10, help='Number of threads to predict when using cpu.')
+parser.add_argument('--enable_mkldnn', type=eval, default=False, choices=[True, False], help='Enable to use mkldnn to speed up when using cpu.')
+parser.add_argument("--log_dir", type=str, default="./log", help="The path to save log.")
+args = parser.parse_args()
+# yapf: enable
+
+
+def extract_features(files: str, **kwargs):
+    waveforms = []
+    srs = []
+    max_length = float('-inf')
+    for file in files:
+        waveform, sr = load_audio(file, sr=None)
+        max_length = max(max_length, len(waveform))
+        waveforms.append(waveform)
+        srs.append(sr)
+
+    feats = []
+    for i in range(len(waveforms)):
+        # padding
+        if len(waveforms[i]) < max_length:
+            pad_width = max_length - len(waveforms[i])
+            waveforms[i] = np.pad(waveforms[i], pad_width=(0, pad_width))
+
+        feat = melspectrogram(waveforms[i], sr, **kwargs).transpose()
+        feats.append(feat)
+
+    return np.stack(feats, axis=0)
+
+
+class Predictor(object):
+    def __init__(self,
+                 model_dir,
+                 device="gpu",
+                 batch_size=1,
+                 use_tensorrt=False,
+                 precision="fp32",
+                 cpu_threads=10,
+                 enable_mkldnn=False):
+        self.batch_size = batch_size
+
+        model_file = os.path.join(model_dir, "inference.pdmodel")
+        params_file = os.path.join(model_dir, "inference.pdiparams")
+
+        assert os.path.isfile(model_file) and os.path.isfile(
+            params_file), 'Please check model and parameter files.'
+
+        config = inference.Config(model_file, params_file)
+        if device == "gpu":
+            # set GPU configs accordingly
+            # such as intialize the gpu memory, enable tensorrt
+            config.enable_use_gpu(100, 0)
+            precision_map = {
+                "fp16": inference.PrecisionType.Half,
+                "fp32": inference.PrecisionType.Float32,
+            }
+            precision_mode = precision_map[precision]
+
+            if use_tensorrt:
+                config.enable_tensorrt_engine(
+                    max_batch_size=batch_size,
+                    min_subgraph_size=30,
+                    precision_mode=precision_mode)
+        elif device == "cpu":
+            # set CPU configs accordingly,
+            # such as enable_mkldnn, set_cpu_math_library_num_threads
+            config.disable_gpu()
+            if enable_mkldnn:
+                # cache 10 different shapes for mkldnn to avoid memory leak
+                config.set_mkldnn_cache_capacity(10)
+                config.enable_mkldnn()
+            config.set_cpu_math_library_num_threads(cpu_threads)
+        elif device == "xpu":
+            # set XPU configs accordingly
+            config.enable_xpu(100)
+
+        config.switch_use_feed_fetch_ops(False)
+        self.predictor = inference.create_predictor(config)
+        self.input_handles = [
+            self.predictor.get_input_handle(name)
+            for name in self.predictor.get_input_names()
+        ]
+        self.output_handle = self.predictor.get_output_handle(
+            self.predictor.get_output_names()[0])
+
+    def predict(self, wavs):
+        feats = extract_features(wavs)
+
+        self.input_handles[0].copy_from_cpu(feats)
+        self.predictor.run()
+        logits = self.output_handle.copy_to_cpu()
+        probs = softmax(logits, axis=1)
+        indices = np.argmax(probs, axis=1)
+
+        return indices
+
+
+if __name__ == "__main__":
+    # Define predictor to do prediction.
+    predictor = Predictor(args.model_dir, args.device, args.batch_size,
+                          args.use_tensorrt, args.precision, args.cpu_threads,
+                          args.enable_mkldnn)
+
+    wavs = [
+        '~/audio_demo_resource/cat.wav',
+        '~/audio_demo_resource/dog.wav',
+    ]
+
+    for i in range(len(wavs)):
+        wavs[i] = os.path.abspath(os.path.expanduser(wavs[i]))
+        assert os.path.isfile(
+            wavs[i]), f'Please check input wave file: {wavs[i]}'
+
+    results = predictor.predict(wavs)
+    for idx, wav in enumerate(wavs):
+        print(f'Wav: {wav} \t Label: {ESC50.label_list[results[idx]]}')
diff --git a/paddleaudio/examples/sound_classification/export_model.py b/paddleaudio/examples/sound_classification/export_model.py
new file mode 100644
index 00000000..f36ae305
--- /dev/null
+++ b/paddleaudio/examples/sound_classification/export_model.py
@@ -0,0 +1,45 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import os
+
+import paddle
+from model import SoundClassifier
+
+from paddleaudio.datasets import ESC50
+from paddleaudio.models.panns import cnn14
+
+# yapf: disable
+parser = argparse.ArgumentParser(__doc__)
+parser.add_argument("--checkpoint", type=str, required=True, help="Checkpoint of model.")
+parser.add_argument("--output_dir", type=str, default='./export', help="Path to save static model and its parameters.")
+args = parser.parse_args()
+# yapf: enable
+
+if __name__ == '__main__':
+    model = SoundClassifier(
+        backbone=cnn14(pretrained=False, extract_embedding=True),
+        num_class=len(ESC50.label_list))
+    model.set_state_dict(paddle.load(args.checkpoint))
+    model.eval()
+
+    model = paddle.jit.to_static(
+        model,
+        input_spec=[
+            paddle.static.InputSpec(
+                shape=[None, None, 64], dtype=paddle.float32)
+        ])
+
+    # Save in static graph model.
+    paddle.jit.save(model, os.path.join(args.output_dir, "inference"))
diff --git a/paddleaudio/examples/sound_classification/model.py b/paddleaudio/examples/sound_classification/model.py
new file mode 100644
index 00000000..df64158f
--- /dev/null
+++ b/paddleaudio/examples/sound_classification/model.py
@@ -0,0 +1,36 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import paddle.nn as nn
+
+
+class SoundClassifier(nn.Layer):
+    """
+    Model for sound classification which uses panns pretrained models to extract
+    embeddings from audio files.
+    """
+
+    def __init__(self, backbone, num_class, dropout=0.1):
+        super(SoundClassifier, self).__init__()
+        self.backbone = backbone
+        self.dropout = nn.Dropout(dropout)
+        self.fc = nn.Linear(self.backbone.emb_size, num_class)
+
+    def forward(self, x):
+        # x: (batch_size, num_frames, num_melbins) -> (batch_size, 1, num_frames, num_melbins)
+        x = x.unsqueeze(1)
+        x = self.backbone(x)
+        x = self.dropout(x)
+        logits = self.fc(x)
+
+        return logits
diff --git a/paddleaudio/examples/sound_classification/predict.py b/paddleaudio/examples/sound_classification/predict.py
new file mode 100644
index 00000000..af15e804
--- /dev/null
+++ b/paddleaudio/examples/sound_classification/predict.py
@@ -0,0 +1,61 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+
+import numpy as np
+import paddle
+import paddle.nn.functional as F
+from model import SoundClassifier
+
+from paddleaudio.backends import load as load_audio
+from paddleaudio.datasets import ESC50
+from paddleaudio.features import melspectrogram
+from paddleaudio.models.panns import cnn14
+
+# yapf: disable
+parser = argparse.ArgumentParser(__doc__)
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to predict, defaults to gpu.")
+parser.add_argument("--wav", type=str, required=True, help="Audio file to infer.")
+parser.add_argument("--top_k", type=int, default=1, help="Show top k predicted results")
+parser.add_argument("--checkpoint", type=str, required=True, help="Checkpoint of model.")
+args = parser.parse_args()
+# yapf: enable
+
+
+def extract_features(file: str, **kwargs):
+    waveform, sr = load_audio(file, sr=None)
+    feat = melspectrogram(waveform, sr, **kwargs).transpose()
+    return feat
+
+
+if __name__ == '__main__':
+    paddle.set_device(args.device)
+
+    model = SoundClassifier(
+        backbone=cnn14(pretrained=False, extract_embedding=True),
+        num_class=len(ESC50.label_list))
+    model.set_state_dict(paddle.load(args.checkpoint))
+    model.eval()
+
+    feat = np.expand_dims(extract_features(args.wav), 0)
+    feat = paddle.to_tensor(feat)
+    logits = model(feat)
+    probs = F.softmax(logits, axis=1).numpy()
+
+    sorted_indices = (-probs[0]).argsort()
+
+    msg = f'[{args.wav}]\n'
+    for idx in sorted_indices[:args.top_k]:
+        msg += f'{ESC50.label_list[idx]}: {probs[0][idx]}\n'
+    print(msg)
diff --git a/paddleaudio/examples/sound_classification/train.py b/paddleaudio/examples/sound_classification/train.py
new file mode 100644
index 00000000..bd10bc9f
--- /dev/null
+++ b/paddleaudio/examples/sound_classification/train.py
@@ -0,0 +1,149 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import argparse
+import os
+
+import paddle
+from model import SoundClassifier
+
+from paddleaudio.datasets import ESC50
+from paddleaudio.models.panns import cnn14
+from paddleaudio.utils import logger
+from paddleaudio.utils import Timer
+
+# yapf: disable
+parser = argparse.ArgumentParser(__doc__)
+parser.add_argument('--device', choices=['cpu', 'gpu'], default="gpu", help="Select which device to train model, defaults to gpu.")
+parser.add_argument("--epochs", type=int, default=50, help="Number of epoches for fine-tuning.")
+parser.add_argument("--learning_rate", type=float, default=5e-5, help="Learning rate used to train with warmup.")
+parser.add_argument("--batch_size", type=int, default=16, help="Total examples' number in batch for training.")
+parser.add_argument("--num_workers", type=int, default=0, help="Number of workers in dataloader.")
+parser.add_argument("--checkpoint_dir", type=str, default='./checkpoint', help="Directory to save model checkpoints.")
+parser.add_argument("--save_freq", type=int, default=10, help="Save checkpoint every n epoch.")
+parser.add_argument("--log_freq", type=int, default=10, help="Log the training infomation every n steps.")
+args = parser.parse_args()
+# yapf: enable
+
+if __name__ == "__main__":
+    paddle.set_device(args.device)
+    nranks = paddle.distributed.get_world_size()
+    if paddle.distributed.get_world_size() > 1:
+        paddle.distributed.init_parallel_env()
+    local_rank = paddle.distributed.get_rank()
+
+    backbone = cnn14(pretrained=True, extract_embedding=True)
+    model = SoundClassifier(backbone, num_class=len(ESC50.label_list))
+    model = paddle.DataParallel(model)
+    optimizer = paddle.optimizer.Adam(
+        learning_rate=args.learning_rate, parameters=model.parameters())
+    criterion = paddle.nn.loss.CrossEntropyLoss()
+
+    train_ds = ESC50(mode='train', feat_type='melspectrogram')
+    dev_ds = ESC50(mode='dev', feat_type='melspectrogram')
+
+    train_sampler = paddle.io.DistributedBatchSampler(
+        train_ds, batch_size=args.batch_size, shuffle=True, drop_last=False)
+    train_loader = paddle.io.DataLoader(
+        train_ds,
+        batch_sampler=train_sampler,
+        num_workers=args.num_workers,
+        return_list=True,
+        use_buffer_reader=True, )
+
+    steps_per_epoch = len(train_sampler)
+    timer = Timer(steps_per_epoch * args.epochs)
+    timer.start()
+
+    for epoch in range(1, args.epochs + 1):
+        model.train()
+
+        avg_loss = 0
+        num_corrects = 0
+        num_samples = 0
+        for batch_idx, batch in enumerate(train_loader):
+            feats, labels = batch
+            logits = model(feats)
+
+            loss = criterion(logits, labels)
+            loss.backward()
+            optimizer.step()
+            if isinstance(optimizer._learning_rate,
+                          paddle.optimizer.lr.LRScheduler):
+                optimizer._learning_rate.step()
+            optimizer.clear_grad()
+
+            # Calculate loss
+            avg_loss += loss.numpy()[0]
+
+            # Calculate metrics
+            preds = paddle.argmax(logits, axis=1)
+            num_corrects += (preds == labels).numpy().sum()
+            num_samples += feats.shape[0]
+
+            timer.count()
+
+            if (batch_idx + 1) % args.log_freq == 0 and local_rank == 0:
+                lr = optimizer.get_lr()
+                avg_loss /= args.log_freq
+                avg_acc = num_corrects / num_samples
+
+                print_msg = 'Epoch={}/{}, Step={}/{}'.format(
+                    epoch, args.epochs, batch_idx + 1, steps_per_epoch)
+                print_msg += ' loss={:.4f}'.format(avg_loss)
+                print_msg += ' acc={:.4f}'.format(avg_acc)
+                print_msg += ' lr={:.6f} step/sec={:.2f} | ETA {}'.format(
+                    lr, timer.timing, timer.eta)
+                logger.train(print_msg)
+
+                avg_loss = 0
+                num_corrects = 0
+                num_samples = 0
+
+        if epoch % args.save_freq == 0 and batch_idx + 1 == steps_per_epoch and local_rank == 0:
+            dev_sampler = paddle.io.BatchSampler(
+                dev_ds,
+                batch_size=args.batch_size,
+                shuffle=False,
+                drop_last=False)
+            dev_loader = paddle.io.DataLoader(
+                dev_ds,
+                batch_sampler=dev_sampler,
+                num_workers=args.num_workers,
+                return_list=True, )
+
+            model.eval()
+            num_corrects = 0
+            num_samples = 0
+            with logger.processing('Evaluation on validation dataset'):
+                for batch_idx, batch in enumerate(dev_loader):
+                    feats, labels = batch
+                    logits = model(feats)
+
+                    preds = paddle.argmax(logits, axis=1)
+                    num_corrects += (preds == labels).numpy().sum()
+                    num_samples += feats.shape[0]
+
+            print_msg = '[Evaluation result]'
+            print_msg += ' dev_acc={:.4f}'.format(num_corrects / num_samples)
+
+            logger.eval(print_msg)
+
+            # Save model
+            save_dir = os.path.join(args.checkpoint_dir,
+                                    'epoch_{}'.format(epoch))
+            logger.info('Saving model checkpoint to {}'.format(save_dir))
+            paddle.save(model.state_dict(),
+                        os.path.join(save_dir, 'model.pdparams'))
+            paddle.save(optimizer.state_dict(),
+                        os.path.join(save_dir, 'model.pdopt'))
diff --git a/paddleaudio/paddleaudio/__init__.py b/paddleaudio/paddleaudio/__init__.py
new file mode 100644
index 00000000..2685cf57
--- /dev/null
+++ b/paddleaudio/paddleaudio/__init__.py
@@ -0,0 +1,15 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from .backends import *
+from .features import *
diff --git a/paddleaudio/paddleaudio/backends/__init__.py b/paddleaudio/paddleaudio/backends/__init__.py
new file mode 100644
index 00000000..f2f77ffe
--- /dev/null
+++ b/paddleaudio/paddleaudio/backends/__init__.py
@@ -0,0 +1,14 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from .audio import *
diff --git a/paddleaudio/paddleaudio/backends/audio.py b/paddleaudio/paddleaudio/backends/audio.py
new file mode 100644
index 00000000..4127570e
--- /dev/null
+++ b/paddleaudio/paddleaudio/backends/audio.py
@@ -0,0 +1,303 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import warnings
+from typing import Optional
+from typing import Tuple
+from typing import Union
+
+import numpy as np
+import resampy
+import soundfile as sf
+from numpy import ndarray as array
+from scipy.io import wavfile
+
+from ..utils import ParameterError
+
+__all__ = [
+    'resample',
+    'to_mono',
+    'depth_convert',
+    'normalize',
+    'save_wav',
+    'load',
+]
+NORMALMIZE_TYPES = ['linear', 'gaussian']
+MERGE_TYPES = ['ch0', 'ch1', 'random', 'average']
+RESAMPLE_MODES = ['kaiser_best', 'kaiser_fast']
+EPS = 1e-8
+
+
+def resample(y: array, src_sr: int, target_sr: int,
+             mode: str='kaiser_fast') -> array:
+    """ Audio resampling
+
+     This function is the same as using resampy.resample().
+
+     Notes:
+        The default mode is kaiser_fast.  For better audio quality, use mode = 'kaiser_fast'
+
+     """
+
+    if mode == 'kaiser_best':
+        warnings.warn(
+            f'Using resampy in kaiser_best to {src_sr}=>{target_sr}. This function is pretty slow, \
+        we recommend the mode kaiser_fast in large scale audio trainning')
+
+    if not isinstance(y, np.ndarray):
+        raise ParameterError(
+            'Only support numpy array, but received y in {type(y)}')
+
+    if mode not in RESAMPLE_MODES:
+        raise ParameterError(f'resample mode must in {RESAMPLE_MODES}')
+
+    return resampy.resample(y, src_sr, target_sr, filter=mode)
+
+
+def to_mono(y: array, merge_type: str='average') -> array:
+    """ convert sterior audio to mono
+    """
+    if merge_type not in MERGE_TYPES:
+        raise ParameterError(
+            f'Unsupported merge type {merge_type}, available types are {MERGE_TYPES}'
+        )
+    if y.ndim > 2:
+        raise ParameterError(
+            f'Unsupported audio array,  y.ndim > 2, the shape is {y.shape}')
+    if y.ndim == 1:  # nothing to merge
+        return y
+
+    if merge_type == 'ch0':
+        return y[0]
+    if merge_type == 'ch1':
+        return y[1]
+    if merge_type == 'random':
+        return y[np.random.randint(0, 2)]
+
+    # need to do averaging according to dtype
+
+    if y.dtype == 'float32':
+        y_out = (y[0] + y[1]) * 0.5
+    elif y.dtype == 'int16':
+        y_out = y.astype('int32')
+        y_out = (y_out[0] + y_out[1]) // 2
+        y_out = np.clip(y_out, np.iinfo(y.dtype).min,
+                        np.iinfo(y.dtype).max).astype(y.dtype)
+
+    elif y.dtype == 'int8':
+        y_out = y.astype('int16')
+        y_out = (y_out[0] + y_out[1]) // 2
+        y_out = np.clip(y_out, np.iinfo(y.dtype).min,
+                        np.iinfo(y.dtype).max).astype(y.dtype)
+    else:
+        raise ParameterError(f'Unsupported dtype: {y.dtype}')
+    return y_out
+
+
+def _safe_cast(y: array, dtype: Union[type, str]) -> array:
+    """ data type casting in a safe way, i.e., prevent overflow or underflow
+
+    This function is used internally.
+    """
+    return np.clip(y, np.iinfo(dtype).min, np.iinfo(dtype).max).astype(dtype)
+
+
+def depth_convert(y: array, dtype: Union[type, str],
+                  dithering: bool=True) -> array:
+    """Convert audio array to target dtype safely
+
+    This function convert audio waveform to a target dtype, with addition steps of
+    preventing overflow/underflow and preserving audio range.
+
+    """
+
+    SUPPORT_DTYPE = ['int16', 'int8', 'float32', 'float64']
+    if y.dtype not in SUPPORT_DTYPE:
+        raise ParameterError(
+            'Unsupported audio dtype, '
+            f'y.dtype is {y.dtype}, supported dtypes are {SUPPORT_DTYPE}')
+
+    if dtype not in SUPPORT_DTYPE:
+        raise ParameterError(
+            'Unsupported audio dtype, '
+            f'target dtype  is {dtype}, supported dtypes are {SUPPORT_DTYPE}')
+
+    if dtype == y.dtype:
+        return y
+
+    if dtype == 'float64' and y.dtype == 'float32':
+        return _safe_cast(y, dtype)
+    if dtype == 'float32' and y.dtype == 'float64':
+        return _safe_cast(y, dtype)
+
+    if dtype == 'int16' or dtype == 'int8':
+        if y.dtype in ['float64', 'float32']:
+            factor = np.iinfo(dtype).max
+            y = np.clip(y * factor, np.iinfo(dtype).min,
+                        np.iinfo(dtype).max).astype(dtype)
+            y = y.astype(dtype)
+        else:
+            if dtype == 'int16' and y.dtype == 'int8':
+                factor = np.iinfo('int16').max / np.iinfo('int8').max - EPS
+                y = y.astype('float32') * factor
+                y = y.astype('int16')
+
+            else:  # dtype == 'int8' and y.dtype=='int16':
+                y = y.astype('int32') * np.iinfo('int8').max / \
+                    np.iinfo('int16').max
+                y = y.astype('int8')
+
+    if dtype in ['float32', 'float64']:
+        org_dtype = y.dtype
+        y = y.astype(dtype) / np.iinfo(org_dtype).max
+    return y
+
+
+def sound_file_load(file: str,
+                    offset: Optional[float]=None,
+                    dtype: str='int16',
+                    duration: Optional[int]=None) -> Tuple[array, int]:
+    """Load audio using soundfile library
+
+    This function load audio file using libsndfile.
+
+    Reference:
+        http://www.mega-nerd.com/libsndfile/#Features
+
+    """
+    with sf.SoundFile(file) as sf_desc:
+        sr_native = sf_desc.samplerate
+        if offset:
+            sf_desc.seek(int(offset * sr_native))
+        if duration is not None:
+            frame_duration = int(duration * sr_native)
+        else:
+            frame_duration = -1
+        y = sf_desc.read(frames=frame_duration, dtype=dtype, always_2d=False).T
+
+    return y, sf_desc.samplerate
+
+
+def audio_file_load():
+    """Load audio using audiofile library
+
+    This function load audio file using audiofile.
+
+    Reference:
+        https://audiofile.68k.org/
+
+    """
+    raise NotImplementedError()
+
+
+def sox_file_load():
+    """Load audio using sox library
+
+    This function load audio file using sox.
+
+    Reference:
+        http://sox.sourceforge.net/
+    """
+    raise NotImplementedError()
+
+
+def normalize(y: array, norm_type: str='linear',
+              mul_factor: float=1.0) -> array:
+    """ normalize an input audio with additional multiplier.
+
+    """
+
+    if norm_type == 'linear':
+        amax = np.max(np.abs(y))
+        factor = 1.0 / (amax + EPS)
+        y = y * factor * mul_factor
+    elif norm_type == 'gaussian':
+        amean = np.mean(y)
+        astd = np.std(y)
+        astd = max(astd, EPS)
+        y = mul_factor * (y - amean) / astd
+    else:
+        raise NotImplementedError(f'norm_type should be in {NORMALMIZE_TYPES}')
+
+    return y
+
+
+def save_wav(y: array, sr: int, file: str) -> None:
+    """Save audio file to disk.
+    This function saves audio to disk using scipy.io.wavfile, with additional step
+    to convert input waveform to int16 unless it already is int16
+
+    Notes:
+        It only support raw wav format.
+
+    """
+    if not file.endswith('.wav'):
+        raise ParameterError(
+            f'only .wav file supported, but dst file name is: {file}')
+
+    if sr <= 0:
+        raise ParameterError(
+            f'Sample rate should be larger than 0, recieved sr = {sr}')
+
+    if y.dtype not in ['int16', 'int8']:
+        warnings.warn(
+            f'input data type is {y.dtype}, will convert data to int16 format before saving'
+        )
+        y_out = depth_convert(y, 'int16')
+    else:
+        y_out = y
+
+    wavfile.write(file, sr, y_out)
+
+
+def load(
+        file: str,
+        sr: Optional[int]=None,
+        mono: bool=True,
+        merge_type: str='average',  # ch0,ch1,random,average
+        normal: bool=True,
+        norm_type: str='linear',
+        norm_mul_factor: float=1.0,
+        offset: float=0.0,
+        duration: Optional[int]=None,
+        dtype: str='float32',
+        resample_mode: str='kaiser_fast') -> Tuple[array, int]:
+    """Load audio file from disk.
+    This function loads audio from disk using using audio beackend.
+
+    Parameters:
+
+    Notes:
+
+    """
+
+    y, r = sound_file_load(file, offset=offset, dtype=dtype, duration=duration)
+
+    if not ((y.ndim == 1 and len(y) > 0) or (y.ndim == 2 and len(y[0]) > 0)):
+        raise ParameterError(f'audio file {file} looks empty')
+
+    if mono:
+        y = to_mono(y, merge_type)
+
+    if sr is not None and sr != r:
+        y = resample(y, r, sr, mode=resample_mode)
+        r = sr
+
+    if normal:
+        y = normalize(y, norm_type, norm_mul_factor)
+    elif dtype in ['int8', 'int16']:
+        # still need to do normalization, before depth convertion
+        y = normalize(y, 'linear', 1.0)
+
+    y = depth_convert(y, dtype)
+    return y, r
diff --git a/paddleaudio/paddleaudio/datasets/__init__.py b/paddleaudio/paddleaudio/datasets/__init__.py
new file mode 100644
index 00000000..e1d2bbc5
--- /dev/null
+++ b/paddleaudio/paddleaudio/datasets/__init__.py
@@ -0,0 +1,34 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from .aishell import AISHELL1
+from .dcase import UrbanAcousticScenes
+from .dcase import UrbanAudioVisualScenes
+from .esc50 import ESC50
+from .gtzan import GTZAN
+from .librispeech import LIBRISPEECH
+from .ravdess import RAVDESS
+from .tess import TESS
+from .urban_sound import UrbanSound8K
+
+__all__ = [
+    'AISHELL1',
+    'LIBRISPEECH',
+    'ESC50',
+    'UrbanSound8K',
+    'GTZAN',
+    'UrbanAcousticScenes',
+    'UrbanAudioVisualScenes',
+    'RAVDESS',
+    'TESS',
+]
diff --git a/paddleaudio/paddleaudio/datasets/aishell.py b/paddleaudio/paddleaudio/datasets/aishell.py
new file mode 100644
index 00000000..d84d9876
--- /dev/null
+++ b/paddleaudio/paddleaudio/datasets/aishell.py
@@ -0,0 +1,154 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import codecs
+import collections
+import json
+import os
+from typing import Dict
+
+from paddle.io import Dataset
+from tqdm import tqdm
+
+from ..backends import load as load_audio
+from ..utils.download import decompress
+from ..utils.download import download_and_decompress
+from ..utils.env import DATA_HOME
+from ..utils.log import logger
+from .dataset import feat_funcs
+
+__all__ = ['AISHELL1']
+
+
+class AISHELL1(Dataset):
+    """
+    This Open Source Mandarin Speech Corpus, AISHELL-ASR0009-OS1, is 178 hours long.
+    It is a part of AISHELL-ASR0009, of which utterance contains 11 domains, including
+    smart home, autonomous driving, and industrial production. The whole recording was
+    put in quiet indoor environment, using 3 different devices at the same time: high
+    fidelity microphone (44.1kHz, 16-bit,); Android-system mobile phone (16kHz, 16-bit),
+    iOS-system mobile phone (16kHz, 16-bit). Audios in high fidelity were re-sampled
+    to 16kHz to build AISHELL- ASR0009-OS1. 400 speakers from different accent areas
+    in China were invited to participate in the recording. The manual transcription
+    accuracy rate is above 95%, through professional speech annotation and strict
+    quality inspection. The corpus is divided into training, development and testing
+    sets.
+
+    Reference:
+        AISHELL-1: An Open-Source Mandarin Speech Corpus and A Speech Recognition Baseline
+        https://arxiv.org/abs/1709.05522
+    """
+
+    archieves = [
+        {
+            'url': 'http://www.openslr.org/resources/33/data_aishell.tgz',
+            'md5': '2f494334227864a8a8fec932999db9d8',
+        },
+    ]
+    text_meta = os.path.join('data_aishell', 'transcript',
+                             'aishell_transcript_v0.8.txt')
+    utt_info = collections.namedtuple('META_INFO',
+                                      ('file_path', 'utt_id', 'text'))
+    audio_path = os.path.join('data_aishell', 'wav')
+    manifest_path = os.path.join('data_aishell', 'manifest')
+    subset = ['train', 'dev', 'test']
+
+    def __init__(self, subset: str='train', feat_type: str='raw', **kwargs):
+        assert subset in self.subset, 'Dataset subset must be one in {}, but got {}'.format(
+            self.subset, subset)
+        self.subset = subset
+        self.feat_type = feat_type
+        self.feat_config = kwargs
+        self._data = self._get_data()
+        super(AISHELL1, self).__init__()
+
+    def _get_text_info(self) -> Dict[str, str]:
+        ret = {}
+        with open(os.path.join(DATA_HOME, self.text_meta), 'r') as rf:
+            for line in rf.readlines()[1:]:
+                utt_id, text = map(str.strip, line.split(' ',
+                                                         1))  # utt_id, text
+                ret.update({utt_id: ''.join(text.split())})
+        return ret
+
+    def _get_data(self):
+        if not os.path.isdir(os.path.join(DATA_HOME, self.audio_path)) or \
+            not os.path.isfile(os.path.join(DATA_HOME, self.text_meta)):
+            download_and_decompress(self.archieves, DATA_HOME)
+            # Extract *wav from *.tar.gz.
+            for root, _, files in os.walk(
+                    os.path.join(DATA_HOME, self.audio_path)):
+                for file in files:
+                    if file.endswith('.tar.gz'):
+                        decompress(os.path.join(root, file))
+                        os.remove(os.path.join(root, file))
+
+        text_info = self._get_text_info()
+
+        data = []
+        for root, _, files in os.walk(
+                os.path.join(DATA_HOME, self.audio_path, self.subset)):
+            for file in files:
+                if file.endswith('.wav'):
+                    utt_id = os.path.splitext(file)[0]
+                    if utt_id not in text_info:  # There are some utt_id that without label
+                        continue
+                    text = text_info[utt_id]
+                    file_path = os.path.join(root, file)
+                    data.append(self.utt_info(file_path, utt_id, text))
+
+        return data
+
+    def _convert_to_record(self, idx: int):
+        sample = self._data[idx]
+
+        record = {}
+        # To show all fields in a namedtuple: `type(sample)._fields`
+        for field in type(sample)._fields:
+            record[field] = getattr(sample, field)
+
+        waveform, sr = load_audio(
+            sample[0])  # The first element of sample is file path
+        feat_func = feat_funcs[self.feat_type]
+        feat = feat_func(
+            waveform, sample_rate=sr,
+            **self.feat_config) if feat_func else waveform
+        record.update({'feat': feat, 'duration': len(waveform) / sr})
+        return record
+
+    def create_manifest(self, prefix='manifest'):
+        if not os.path.isdir(os.path.join(DATA_HOME, self.manifest_path)):
+            os.makedirs(os.path.join(DATA_HOME, self.manifest_path))
+
+        manifest_file = os.path.join(DATA_HOME, self.manifest_path,
+                                     f'{prefix}.{self.subset}')
+        with codecs.open(manifest_file, 'w', 'utf-8') as f:
+            for idx in tqdm(range(len(self))):
+                record = self._convert_to_record(idx)
+                record_line = json.dumps(
+                    {
+                        'utt': record['utt_id'],
+                        'feat': record['file_path'],
+                        'feat_shape': (record['duration'], ),
+                        'text': record['text']
+                    },
+                    ensure_ascii=False)
+                f.write(record_line + '\n')
+        logger.info(f'Manifest file {manifest_file} created.')
+
+    def __getitem__(self, idx):
+        record = self._convert_to_record(idx)
+        return tuple(record.values())
+
+    def __len__(self):
+        return len(self._data)
diff --git a/paddleaudio/paddleaudio/datasets/dataset.py b/paddleaudio/paddleaudio/datasets/dataset.py
new file mode 100644
index 00000000..fb521bea
--- /dev/null
+++ b/paddleaudio/paddleaudio/datasets/dataset.py
@@ -0,0 +1,82 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import List
+
+import numpy as np
+import paddle
+
+from ..backends import load as load_audio
+from ..features import melspectrogram
+from ..features import mfcc
+
+feat_funcs = {
+    'raw': None,
+    'melspectrogram': melspectrogram,
+    'mfcc': mfcc,
+}
+
+
+class AudioClassificationDataset(paddle.io.Dataset):
+    """
+    Base class of audio classification dataset.
+    """
+
+    def __init__(self,
+                 files: List[str],
+                 labels: List[int],
+                 feat_type: str='raw',
+                 **kwargs):
+        """
+        Ags:
+            files (:obj:`List[str]`): A list of absolute path of audio files.
+            labels (:obj:`List[int]`): Labels of audio files.
+            feat_type (:obj:`str`, `optional`, defaults to `raw`):
+                It identifies the feature type that user wants to extrace of an audio file.
+        """
+        super(AudioClassificationDataset, self).__init__()
+
+        if feat_type not in feat_funcs.keys():
+            raise RuntimeError(
+                f"Unknown feat_type: {feat_type}, it must be one in {list(feat_funcs.keys())}"
+            )
+
+        self.files = files
+        self.labels = labels
+
+        self.feat_type = feat_type
+        self.feat_config = kwargs  # Pass keyword arguments to customize feature config
+
+    def _get_data(self, input_file: str):
+        raise NotImplementedError
+
+    def _convert_to_record(self, idx):
+        file, label = self.files[idx], self.labels[idx]
+
+        waveform, sample_rate = load_audio(file)
+        feat_func = feat_funcs[self.feat_type]
+
+        record = {}
+        record['feat'] = feat_func(
+            waveform, sample_rate,
+            **self.feat_config) if feat_func else waveform
+        record['label'] = label
+        return record
+
+    def __getitem__(self, idx):
+        record = self._convert_to_record(idx)
+        return np.array(record['feat']).transpose(), np.array(
+            record['label'], dtype=np.int64)
+
+    def __len__(self):
+        return len(self.files)
diff --git a/paddleaudio/paddleaudio/datasets/dcase.py b/paddleaudio/paddleaudio/datasets/dcase.py
new file mode 100644
index 00000000..47b0c915
--- /dev/null
+++ b/paddleaudio/paddleaudio/datasets/dcase.py
@@ -0,0 +1,298 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import collections
+import os
+from typing import List
+from typing import Tuple
+
+from ..utils.download import download_and_decompress
+from ..utils.env import DATA_HOME
+from .dataset import AudioClassificationDataset
+
+__all__ = ['UrbanAcousticScenes', 'UrbanAudioVisualScenes']
+
+
+class UrbanAcousticScenes(AudioClassificationDataset):
+    """
+    TAU Urban Acoustic Scenes 2020 Mobile Development dataset contains recordings from
+    12 European cities in 10 different acoustic scenes using 4 different devices.
+    Additionally, synthetic data for 11 mobile devices was created based on the original
+    recordings. Of the 12 cities, two are present only in the evaluation set.
+
+    Reference:
+        A multi-device dataset for urban acoustic scene classification
+        https://arxiv.org/abs/1807.09840
+    """
+
+    source_url = 'https://zenodo.org/record/3819968/files/'
+    base_name = 'TAU-urban-acoustic-scenes-2020-mobile-development'
+    archieves = [
+        {
+            'url': source_url + base_name + '.meta.zip',
+            'md5': '6eae9db553ce48e4ea246e34e50a3cf5',
+        },
+        {
+            'url': source_url + base_name + '.audio.1.zip',
+            'md5': 'b1e85b8a908d3d6a6ab73268f385d5c8',
+        },
+        {
+            'url': source_url + base_name + '.audio.2.zip',
+            'md5': '4310a13cc2943d6ce3f70eba7ba4c784',
+        },
+        {
+            'url': source_url + base_name + '.audio.3.zip',
+            'md5': 'ed38956c4246abb56190c1e9b602b7b8',
+        },
+        {
+            'url': source_url + base_name + '.audio.4.zip',
+            'md5': '97ab8560056b6816808dedc044dcc023',
+        },
+        {
+            'url': source_url + base_name + '.audio.5.zip',
+            'md5': 'b50f5e0bfed33cd8e52cb3e7f815c6cb',
+        },
+        {
+            'url': source_url + base_name + '.audio.6.zip',
+            'md5': 'fbf856a3a86fff7520549c899dc94372',
+        },
+        {
+            'url': source_url + base_name + '.audio.7.zip',
+            'md5': '0dbffe7b6e45564da649378723284062',
+        },
+        {
+            'url': source_url + base_name + '.audio.8.zip',
+            'md5': 'bb6f77832bf0bd9f786f965beb251b2e',
+        },
+        {
+            'url': source_url + base_name + '.audio.9.zip',
+            'md5': 'a65596a5372eab10c78e08a0de797c9e',
+        },
+        {
+            'url': source_url + base_name + '.audio.10.zip',
+            'md5': '2ad595819ffa1d56d2de4c7ed43205a6',
+        },
+        {
+            'url': source_url + base_name + '.audio.11.zip',
+            'md5': '0ad29f7040a4e6a22cfd639b3a6738e5',
+        },
+        {
+            'url': source_url + base_name + '.audio.12.zip',
+            'md5': 'e5f4400c6b9697295fab4cf507155a2f',
+        },
+        {
+            'url': source_url + base_name + '.audio.13.zip',
+            'md5': '8855ab9f9896422746ab4c5d89d8da2f',
+        },
+        {
+            'url': source_url + base_name + '.audio.14.zip',
+            'md5': '092ad744452cd3e7de78f988a3d13020',
+        },
+        {
+            'url': source_url + base_name + '.audio.15.zip',
+            'md5': '4b5eb85f6592aebf846088d9df76b420',
+        },
+        {
+            'url': source_url + base_name + '.audio.16.zip',
+            'md5': '2e0a89723e58a3836be019e6996ae460',
+        },
+    ]
+    label_list = [
+        'airport', 'shopping_mall', 'metro_station', 'street_pedestrian',
+        'public_square', 'street_traffic', 'tram', 'bus', 'metro', 'park'
+    ]
+
+    meta = os.path.join(base_name, 'meta.csv')
+    meta_info = collections.namedtuple('META_INFO', (
+        'filename', 'scene_label', 'identifier', 'source_label'))
+    subset_meta = {
+        'train': os.path.join(base_name, 'evaluation_setup', 'fold1_train.csv'),
+        'dev':
+        os.path.join(base_name, 'evaluation_setup', 'fold1_evaluate.csv'),
+        'test': os.path.join(base_name, 'evaluation_setup', 'fold1_test.csv'),
+    }
+    subset_meta_info = collections.namedtuple('SUBSET_META_INFO',
+                                              ('filename', 'scene_label'))
+    audio_path = os.path.join(base_name, 'audio')
+
+    def __init__(self, mode: str='train', feat_type: str='raw', **kwargs):
+        """
+        Ags:
+            mode (:obj:`str`, `optional`, defaults to `train`):
+                It identifies the dataset mode (train or dev).
+            feat_type (:obj:`str`, `optional`, defaults to `raw`):
+                It identifies the feature type that user wants to extrace of an audio file.
+        """
+        files, labels = self._get_data(mode)
+        super(UrbanAcousticScenes, self).__init__(
+            files=files, labels=labels, feat_type=feat_type, **kwargs)
+
+    def _get_meta_info(self, subset: str=None,
+                       skip_header: bool=True) -> List[collections.namedtuple]:
+        if subset is None:
+            meta_file = self.meta
+            meta_info = self.meta_info
+        else:
+            assert subset in self.subset_meta, f'Subset must be one in {list(self.subset_meta.keys())}, but got {subset}.'
+            meta_file = self.subset_meta[subset]
+            meta_info = self.subset_meta_info
+
+        ret = []
+        with open(os.path.join(DATA_HOME, meta_file), 'r') as rf:
+            lines = rf.readlines()[1:] if skip_header else rf.readlines()
+            for line in lines:
+                ret.append(meta_info(*line.strip().split('\t')))
+        return ret
+
+    def _get_data(self, mode: str) -> Tuple[List[str], List[int]]:
+        if not os.path.isdir(os.path.join(DATA_HOME, self.audio_path)) or \
+            not os.path.isfile(os.path.join(DATA_HOME, self.meta)):
+            download_and_decompress(self.archieves, DATA_HOME)
+
+        meta_info = self._get_meta_info(subset=mode, skip_header=True)
+
+        files = []
+        labels = []
+        for sample in meta_info:
+            filename, label = sample[:2]
+            filename = os.path.basename(filename)
+            target = self.label_list.index(label)
+
+            files.append(os.path.join(DATA_HOME, self.audio_path, filename))
+            labels.append(int(target))
+
+        return files, labels
+
+
+class UrbanAudioVisualScenes(AudioClassificationDataset):
+    """
+    TAU Urban Audio Visual Scenes 2021 Development dataset contains synchronized audio
+    and video recordings from 12 European cities in 10 different scenes.
+    This dataset consists of 10-seconds audio and video segments from 10
+    acoustic scenes. The total amount of audio in the development set is 34 hours.
+
+    Reference:
+        A Curated Dataset of Urban Scenes for Audio-Visual Scene Analysis
+        https://arxiv.org/abs/2011.00030
+    """
+
+    source_url = 'https://zenodo.org/record/4477542/files/'
+    base_name = 'TAU-urban-audio-visual-scenes-2021-development'
+
+    archieves = [
+        {
+            'url': source_url + base_name + '.meta.zip',
+            'md5': '76e3d7ed5291b118372e06379cb2b490',
+        },
+        {
+            'url': source_url + base_name + '.audio.1.zip',
+            'md5': '186f6273f8f69ed9dbdc18ad65ac234f',
+        },
+        {
+            'url': source_url + base_name + '.audio.2.zip',
+            'md5': '7fd6bb63127f5785874a55aba4e77aa5',
+        },
+        {
+            'url': source_url + base_name + '.audio.3.zip',
+            'md5': '61396bede29d7c8c89729a01a6f6b2e2',
+        },
+        {
+            'url': source_url + base_name + '.audio.4.zip',
+            'md5': '6ddac89717fcf9c92c451868eed77fe1',
+        },
+        {
+            'url': source_url + base_name + '.audio.5.zip',
+            'md5': 'af4820756cdf1a7d4bd6037dc034d384',
+        },
+        {
+            'url': source_url + base_name + '.audio.6.zip',
+            'md5': 'ebd11ec24411f2a17a64723bd4aa7fff',
+        },
+        {
+            'url': source_url + base_name + '.audio.7.zip',
+            'md5': '2be39a76aeed704d5929d020a2909efd',
+        },
+        {
+            'url': source_url + base_name + '.audio.8.zip',
+            'md5': '972d8afe0874720fc2f28086e7cb22a9',
+        },
+    ]
+    label_list = [
+        'airport', 'shopping_mall', 'metro_station', 'street_pedestrian',
+        'public_square', 'street_traffic', 'tram', 'bus', 'metro', 'park'
+    ]
+
+    meta_base_path = os.path.join(base_name, base_name + '.meta')
+    meta = os.path.join(meta_base_path, 'meta.csv')
+    meta_info = collections.namedtuple('META_INFO', (
+        'filename_audio', 'filename_video', 'scene_label', 'identifier'))
+    subset_meta = {
+        'train':
+        os.path.join(meta_base_path, 'evaluation_setup', 'fold1_train.csv'),
+        'dev':
+        os.path.join(meta_base_path, 'evaluation_setup', 'fold1_evaluate.csv'),
+        'test':
+        os.path.join(meta_base_path, 'evaluation_setup', 'fold1_test.csv'),
+    }
+    subset_meta_info = collections.namedtuple('SUBSET_META_INFO', (
+        'filename_audio', 'filename_video', 'scene_label'))
+    audio_path = os.path.join(base_name, 'audio')
+
+    def __init__(self, mode: str='train', feat_type: str='raw', **kwargs):
+        """
+        Ags:
+            mode (:obj:`str`, `optional`, defaults to `train`):
+                It identifies the dataset mode (train or dev).
+            feat_type (:obj:`str`, `optional`, defaults to `raw`):
+                It identifies the feature type that user wants to extrace of an audio file.
+        """
+        files, labels = self._get_data(mode)
+        super(UrbanAudioVisualScenes, self).__init__(
+            files=files, labels=labels, feat_type=feat_type, **kwargs)
+
+    def _get_meta_info(self, subset: str=None,
+                       skip_header: bool=True) -> List[collections.namedtuple]:
+        if subset is None:
+            meta_file = self.meta
+            meta_info = self.meta_info
+        else:
+            assert subset in self.subset_meta, f'Subset must be one in {list(self.subset_meta.keys())}, but got {subset}.'
+            meta_file = self.subset_meta[subset]
+            meta_info = self.subset_meta_info
+
+        ret = []
+        with open(os.path.join(DATA_HOME, meta_file), 'r') as rf:
+            lines = rf.readlines()[1:] if skip_header else rf.readlines()
+            for line in lines:
+                ret.append(meta_info(*line.strip().split('\t')))
+        return ret
+
+    def _get_data(self, mode: str) -> Tuple[List[str], List[int]]:
+        if not os.path.isdir(os.path.join(DATA_HOME, self.audio_path)) or \
+            not os.path.isfile(os.path.join(DATA_HOME, self.meta)):
+            download_and_decompress(self.archieves,
+                                    os.path.join(DATA_HOME, self.base_name))
+
+        meta_info = self._get_meta_info(subset=mode, skip_header=True)
+
+        files = []
+        labels = []
+        for sample in meta_info:
+            filename, _, label = sample[:3]
+            filename = os.path.basename(filename)
+            target = self.label_list.index(label)
+
+            files.append(os.path.join(DATA_HOME, self.audio_path, filename))
+            labels.append(int(target))
+
+        return files, labels
diff --git a/paddleaudio/paddleaudio/datasets/esc50.py b/paddleaudio/paddleaudio/datasets/esc50.py
new file mode 100644
index 00000000..e7477d40
--- /dev/null
+++ b/paddleaudio/paddleaudio/datasets/esc50.py
@@ -0,0 +1,152 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import collections
+import os
+from typing import List
+from typing import Tuple
+
+from ..utils.download import download_and_decompress
+from ..utils.env import DATA_HOME
+from .dataset import AudioClassificationDataset
+
+__all__ = ['ESC50']
+
+
+class ESC50(AudioClassificationDataset):
+    """
+    The ESC-50 dataset is a labeled collection of 2000 environmental audio recordings
+    suitable for benchmarking methods of environmental sound classification. The dataset
+    consists of 5-second-long recordings organized into 50 semantical classes (with
+    40 examples per class)
+
+    Reference:
+        ESC: Dataset for Environmental Sound Classification
+        http://dx.doi.org/10.1145/2733373.2806390
+    """
+
+    archieves = [
+        {
+            'url':
+            'https://paddleaudio.bj.bcebos.com/datasets/ESC-50-master.zip',
+            'md5': '7771e4b9d86d0945acce719c7a59305a',
+        },
+    ]
+    label_list = [
+        # Animals
+        'Dog',
+        'Rooster',
+        'Pig',
+        'Cow',
+        'Frog',
+        'Cat',
+        'Hen',
+        'Insects (flying)',
+        'Sheep',
+        'Crow',
+        # Natural soundscapes & water sounds
+        'Rain',
+        'Sea waves',
+        'Crackling fire',
+        'Crickets',
+        'Chirping birds',
+        'Water drops',
+        'Wind',
+        'Pouring water',
+        'Toilet flush',
+        'Thunderstorm',
+        # Human, non-speech sounds
+        'Crying baby',
+        'Sneezing',
+        'Clapping',
+        'Breathing',
+        'Coughing',
+        'Footsteps',
+        'Laughing',
+        'Brushing teeth',
+        'Snoring',
+        'Drinking, sipping',
+        # Interior/domestic sounds
+        'Door knock',
+        'Mouse click',
+        'Keyboard typing',
+        'Door, wood creaks',
+        'Can opening',
+        'Washing machine',
+        'Vacuum cleaner',
+        'Clock alarm',
+        'Clock tick',
+        'Glass breaking',
+        # Exterior/urban noises
+        'Helicopter',
+        'Chainsaw',
+        'Siren',
+        'Car horn',
+        'Engine',
+        'Train',
+        'Church bells',
+        'Airplane',
+        'Fireworks',
+        'Hand saw',
+    ]
+    meta = os.path.join('ESC-50-master', 'meta', 'esc50.csv')
+    meta_info = collections.namedtuple(
+        'META_INFO',
+        ('filename', 'fold', 'target', 'category', 'esc10', 'src_file', 'take'))
+    audio_path = os.path.join('ESC-50-master', 'audio')
+
+    def __init__(self,
+                 mode: str='train',
+                 split: int=1,
+                 feat_type: str='raw',
+                 **kwargs):
+        """
+        Ags:
+            mode (:obj:`str`, `optional`, defaults to `train`):
+                It identifies the dataset mode (train or dev).
+            split (:obj:`int`, `optional`, defaults to 1):
+                It specify the fold of dev dataset.
+            feat_type (:obj:`str`, `optional`, defaults to `raw`):
+                It identifies the feature type that user wants to extrace of an audio file.
+        """
+        files, labels = self._get_data(mode, split)
+        super(ESC50, self).__init__(
+            files=files, labels=labels, feat_type=feat_type, **kwargs)
+
+    def _get_meta_info(self) -> List[collections.namedtuple]:
+        ret = []
+        with open(os.path.join(DATA_HOME, self.meta), 'r') as rf:
+            for line in rf.readlines()[1:]:
+                ret.append(self.meta_info(*line.strip().split(',')))
+        return ret
+
+    def _get_data(self, mode: str, split: int) -> Tuple[List[str], List[int]]:
+        if not os.path.isdir(os.path.join(DATA_HOME, self.audio_path)) or \
+            not os.path.isfile(os.path.join(DATA_HOME, self.meta)):
+            download_and_decompress(self.archieves, DATA_HOME)
+
+        meta_info = self._get_meta_info()
+
+        files = []
+        labels = []
+        for sample in meta_info:
+            filename, fold, target, _, _, _, _ = sample
+            if mode == 'train' and int(fold) != split:
+                files.append(os.path.join(DATA_HOME, self.audio_path, filename))
+                labels.append(int(target))
+
+            if mode != 'train' and int(fold) == split:
+                files.append(os.path.join(DATA_HOME, self.audio_path, filename))
+                labels.append(int(target))
+
+        return files, labels
diff --git a/paddleaudio/paddleaudio/datasets/gtzan.py b/paddleaudio/paddleaudio/datasets/gtzan.py
new file mode 100644
index 00000000..cfea6f37
--- /dev/null
+++ b/paddleaudio/paddleaudio/datasets/gtzan.py
@@ -0,0 +1,115 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import collections
+import os
+import random
+from typing import List
+from typing import Tuple
+
+from ..utils.download import download_and_decompress
+from ..utils.env import DATA_HOME
+from .dataset import AudioClassificationDataset
+
+__all__ = ['GTZAN']
+
+
+class GTZAN(AudioClassificationDataset):
+    """
+    The GTZAN dataset consists of 1000 audio tracks each 30 seconds long. It contains 10 genres,
+    each represented by 100 tracks. The dataset is the most-used public dataset for evaluation
+    in machine listening research for music genre recognition (MGR).
+
+    Reference:
+        Musical genre classification of audio signals
+        https://ieeexplore.ieee.org/document/1021072/
+    """
+
+    archieves = [
+        {
+            'url': 'http://opihi.cs.uvic.ca/sound/genres.tar.gz',
+            'md5': '5b3d6dddb579ab49814ab86dba69e7c7',
+        },
+    ]
+    label_list = [
+        'blues', 'classical', 'country', 'disco', 'hiphop', 'jazz', 'metal',
+        'pop', 'reggae', 'rock'
+    ]
+    meta = os.path.join('genres', 'input.mf')
+    meta_info = collections.namedtuple('META_INFO', ('file_path', 'label'))
+    audio_path = 'genres'
+
+    def __init__(self,
+                 mode='train',
+                 seed=0,
+                 n_folds=5,
+                 split=1,
+                 feat_type='raw',
+                 **kwargs):
+        """
+        Ags:
+            mode (:obj:`str`, `optional`, defaults to `train`):
+                It identifies the dataset mode (train or dev).
+            seed (:obj:`int`, `optional`, defaults to 0):
+                Set the random seed to shuffle samples.
+            n_folds (:obj:`int`, `optional`, defaults to 5):
+                Split the dataset into n folds. 1 fold for dev dataset and n-1 for train dataset.
+            split (:obj:`int`, `optional`, defaults to 1):
+                It specify the fold of dev dataset.
+            feat_type (:obj:`str`, `optional`, defaults to `raw`):
+                It identifies the feature type that user wants to extrace of an audio file.
+        """
+        assert split <= n_folds, f'The selected split should not be larger than n_fold, but got {split} > {n_folds}'
+        files, labels = self._get_data(mode, seed, n_folds, split)
+        super(GTZAN, self).__init__(
+            files=files, labels=labels, feat_type=feat_type, **kwargs)
+
+    def _get_meta_info(self) -> List[collections.namedtuple]:
+        ret = []
+        with open(os.path.join(DATA_HOME, self.meta), 'r') as rf:
+            for line in rf.readlines():
+                ret.append(self.meta_info(*line.strip().split('\t')))
+        return ret
+
+    def _get_data(self, mode, seed, n_folds,
+                  split) -> Tuple[List[str], List[int]]:
+        if not os.path.isdir(os.path.join(DATA_HOME, self.audio_path)) or \
+            not os.path.isfile(os.path.join(DATA_HOME, self.meta)):
+            download_and_decompress(self.archieves, DATA_HOME)
+
+        meta_info = self._get_meta_info()
+        random.seed(seed)  # shuffle samples to split data
+        random.shuffle(
+            meta_info
+        )  # make sure using the same seed to create train and dev dataset
+
+        files = []
+        labels = []
+        n_samples_per_fold = len(meta_info) // n_folds
+        for idx, sample in enumerate(meta_info):
+            file_path, label = sample
+            filename = os.path.basename(file_path)
+            target = self.label_list.index(label)
+            fold = idx // n_samples_per_fold + 1
+
+            if mode == 'train' and int(fold) != split:
+                files.append(
+                    os.path.join(DATA_HOME, self.audio_path, label, filename))
+                labels.append(target)
+
+            if mode != 'train' and int(fold) == split:
+                files.append(
+                    os.path.join(DATA_HOME, self.audio_path, label, filename))
+                labels.append(target)
+
+        return files, labels
diff --git a/paddleaudio/paddleaudio/datasets/librispeech.py b/paddleaudio/paddleaudio/datasets/librispeech.py
new file mode 100644
index 00000000..c3b3c83d
--- /dev/null
+++ b/paddleaudio/paddleaudio/datasets/librispeech.py
@@ -0,0 +1,199 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import codecs
+import collections
+import json
+import os
+from typing import Dict
+
+from paddle.io import Dataset
+from tqdm import tqdm
+
+from ..backends import load as load_audio
+from ..utils.download import download_and_decompress
+from ..utils.env import DATA_HOME
+from ..utils.log import logger
+from .dataset import feat_funcs
+
+__all__ = ['LIBRISPEECH']
+
+
+class LIBRISPEECH(Dataset):
+    """
+    LibriSpeech is a corpus of approximately 1000 hours of 16kHz read English speech,
+    prepared by Vassil Panayotov with the assistance of Daniel Povey. The data is
+    derived from read audiobooks from the LibriVox project, and has been carefully
+    segmented and aligned.
+
+    Reference:
+        LIBRISPEECH: AN ASR CORPUS BASED ON PUBLIC DOMAIN AUDIO BOOKS
+        http://www.danielpovey.com/files/2015_icassp_librispeech.pdf
+        https://arxiv.org/abs/1709.05522
+    """
+
+    source_url = 'http://www.openslr.org/resources/12/'
+    archieves = [
+        {
+            'url': source_url + 'train-clean-100.tar.gz',
+            'md5': '2a93770f6d5c6c964bc36631d331a522',
+        },
+        {
+            'url': source_url + 'train-clean-360.tar.gz',
+            'md5': 'c0e676e450a7ff2f54aeade5171606fa',
+        },
+        {
+            'url': source_url + 'train-other-500.tar.gz',
+            'md5': 'd1a0fd59409feb2c614ce4d30c387708',
+        },
+        {
+            'url': source_url + 'dev-clean.tar.gz',
+            'md5': '42e2234ba48799c1f50f24a7926300a1',
+        },
+        {
+            'url': source_url + 'dev-other.tar.gz',
+            'md5': 'c8d0bcc9cca99d4f8b62fcc847357931',
+        },
+        {
+            'url': source_url + 'test-clean.tar.gz',
+            'md5': '32fa31d27d2e1cad72775fee3f4849a9',
+        },
+        {
+            'url': source_url + 'test-other.tar.gz',
+            'md5': 'fb5a50374b501bb3bac4815ee91d3135',
+        },
+    ]
+    speaker_meta = os.path.join('LibriSpeech', 'SPEAKERS.TXT')
+    utt_info = collections.namedtuple('META_INFO', (
+        'file_path', 'utt_id', 'text', 'spk_id', 'spk_gender'))
+    audio_path = 'LibriSpeech'
+    manifest_path = os.path.join('LibriSpeech', 'manifest')
+    subset = [
+        'train-clean-100', 'train-clean-360', 'train-clean-500', 'dev-clean',
+        'dev-other', 'test-clean', 'test-other'
+    ]
+
+    def __init__(self,
+                 subset: str='train-clean-100',
+                 feat_type: str='raw',
+                 **kwargs):
+        assert subset in self.subset, 'Dataset subset must be one in {}, but got {}'.format(
+            self.subset, subset)
+        self.subset = subset
+        self.feat_type = feat_type
+        self.feat_config = kwargs
+        self._data = self._get_data()
+        super(LIBRISPEECH, self).__init__()
+
+    def _get_speaker_info(self) -> Dict[str, str]:
+        ret = {}
+        with open(os.path.join(DATA_HOME, self.speaker_meta), 'r') as rf:
+            for line in rf.readlines():
+                if ';' in line:  # Skip dataset abstract
+                    continue
+                spk_id, gender = map(str.strip,
+                                     line.split('|')[:2])  # spk_id, gender
+                ret.update({spk_id: gender})
+        return ret
+
+    def _get_text_info(self, trans_file) -> Dict[str, str]:
+        ret = {}
+        with open(trans_file, 'r') as rf:
+            for line in rf.readlines():
+                utt_id, text = map(str.strip, line.split(' ',
+                                                         1))  # utt_id, text
+                ret.update({utt_id: text})
+        return ret
+
+    def _get_data(self):
+        if not os.path.isdir(os.path.join(DATA_HOME, self.audio_path)) or \
+            not os.path.isfile(os.path.join(DATA_HOME, self.speaker_meta)):
+            download_and_decompress(self.archieves, DATA_HOME,
+                                    len(self.archieves))
+
+        # Speaker info
+        speaker_info = self._get_speaker_info()
+
+        # Text info
+        text_info = {}
+        for root, _, files in os.walk(
+                os.path.join(DATA_HOME, self.audio_path, self.subset)):
+            for file in files:
+                if file.endswith('.trans.txt'):
+                    text_info.update(
+                        self._get_text_info(os.path.join(root, file)))
+
+        data = []
+        for root, _, files in os.walk(
+                os.path.join(DATA_HOME, self.audio_path, self.subset)):
+            for file in files:
+                if file.endswith('.flac'):
+                    utt_id = os.path.splitext(file)[0]
+                    spk_id = utt_id.split('-')[0]
+                    if utt_id not in text_info \
+                        or spk_id not in speaker_info :  # Skip samples with incomplete data
+                        continue
+                    file_path = os.path.join(root, file)
+                    text = text_info[utt_id]
+                    spk_gender = speaker_info[spk_id]
+                    data.append(
+                        self.utt_info(file_path, utt_id, text, spk_id,
+                                      spk_gender))
+
+        return data
+
+    def _convert_to_record(self, idx: int):
+        sample = self._data[idx]
+
+        record = {}
+        # To show all fields in a namedtuple: `type(sample)._fields`
+        for field in type(sample)._fields:
+            record[field] = getattr(sample, field)
+
+        waveform, sr = load_audio(
+            sample[0])  # The first element of sample is file path
+        feat_func = feat_funcs[self.feat_type]
+        feat = feat_func(
+            waveform, sample_rate=sr,
+            **self.feat_config) if feat_func else waveform
+        record.update({'feat': feat, 'duration': len(waveform) / sr})
+        return record
+
+    def create_manifest(self, prefix='manifest'):
+        if not os.path.isdir(os.path.join(DATA_HOME, self.manifest_path)):
+            os.makedirs(os.path.join(DATA_HOME, self.manifest_path))
+
+        manifest_file = os.path.join(DATA_HOME, self.manifest_path,
+                                     f'{prefix}.{self.subset}')
+        with codecs.open(manifest_file, 'w', 'utf-8') as f:
+            for idx in tqdm(range(len(self))):
+                record = self._convert_to_record(idx)
+                record_line = json.dumps(
+                    {
+                        'utt': record['utt_id'],
+                        'feat': record['file_path'],
+                        'feat_shape': (record['duration'], ),
+                        'text': record['text'],
+                        'spk': record['spk_id'],
+                        'gender': record['spk_gender'],
+                    },
+                    ensure_ascii=False)
+                f.write(record_line + '\n')
+        logger.info(f'Manifest file {manifest_file} created.')
+
+    def __getitem__(self, idx):
+        record = self._convert_to_record(idx)
+        return tuple(record.values())
+
+    def __len__(self):
+        return len(self._data)
diff --git a/paddleaudio/paddleaudio/datasets/ravdess.py b/paddleaudio/paddleaudio/datasets/ravdess.py
new file mode 100644
index 00000000..d886aad2
--- /dev/null
+++ b/paddleaudio/paddleaudio/datasets/ravdess.py
@@ -0,0 +1,136 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import collections
+import os
+import random
+from typing import List
+from typing import Tuple
+
+from ..utils.download import download_and_decompress
+from ..utils.env import DATA_HOME
+from .dataset import AudioClassificationDataset
+
+__all__ = ['RAVDESS']
+
+
+class RAVDESS(AudioClassificationDataset):
+    """
+    The RAVDESS contains 24 professional actors (12 female, 12 male), vocalizing two
+    lexically-matched statements in a neutral North American accent. Speech emotions
+    includes calm, happy, sad, angry, fearful, surprise, and disgust expressions.
+    Each expression is produced at two levels of emotional intensity (normal, strong),
+    with an additional neutral expression.
+
+    Reference:
+        The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS):
+        A dynamic, multimodal set of facial and vocal expressions in North American English
+        https://doi.org/10.1371/journal.pone.0196391
+    """
+
+    archieves = [
+        {
+            'url':
+            'https://zenodo.org/record/1188976/files/Audio_Song_Actors_01-24.zip',
+            'md5':
+            '5411230427d67a21e18aa4d466e6d1b9',
+        },
+        {
+            'url':
+            'https://zenodo.org/record/1188976/files/Audio_Speech_Actors_01-24.zip',
+            'md5':
+            'bc696df654c87fed845eb13823edef8a',
+        },
+    ]
+    label_list = [
+        'neutral', 'calm', 'happy', 'sad', 'angry', 'fearful', 'disgust',
+        'surprised'
+    ]
+    meta_info = collections.namedtuple(
+        'META_INFO', ('modality', 'vocal_channel', 'emotion',
+                      'emotion_intensity', 'statement', 'repitition', 'actor'))
+    speech_path = os.path.join(DATA_HOME, 'Audio_Speech_Actors_01-24')
+    song_path = os.path.join(DATA_HOME, 'Audio_Song_Actors_01-24')
+
+    def __init__(self,
+                 mode='train',
+                 seed=0,
+                 n_folds=5,
+                 split=1,
+                 feat_type='raw',
+                 **kwargs):
+        """
+        Ags:
+            mode (:obj:`str`, `optional`, defaults to `train`):
+                It identifies the dataset mode (train or dev).
+            seed (:obj:`int`, `optional`, defaults to 0):
+                Set the random seed to shuffle samples.
+            n_folds (:obj:`int`, `optional`, defaults to 5):
+                Split the dataset into n folds. 1 fold for dev dataset and n-1 for train dataset.
+            split (:obj:`int`, `optional`, defaults to 1):
+                It specify the fold of dev dataset.
+            feat_type (:obj:`str`, `optional`, defaults to `raw`):
+                It identifies the feature type that user wants to extrace of an audio file.
+        """
+        assert split <= n_folds, f'The selected split should not be larger than n_fold, but got {split} > {n_folds}'
+        files, labels = self._get_data(mode, seed, n_folds, split)
+        super(RAVDESS, self).__init__(
+            files=files, labels=labels, feat_type=feat_type, **kwargs)
+
+    def _get_meta_info(self, files) -> List[collections.namedtuple]:
+        ret = []
+        for file in files:
+            basename_without_extend = os.path.basename(file)[:-4]
+            ret.append(self.meta_info(*basename_without_extend.split('-')))
+        return ret
+
+    def _get_data(self, mode, seed, n_folds,
+                  split) -> Tuple[List[str], List[int]]:
+        if not os.path.isdir(self.speech_path) and not os.path.isdir(
+                self.song_path):
+            download_and_decompress(self.archieves, DATA_HOME)
+
+        wav_files = []
+        for root, _, files in os.walk(self.speech_path):
+            for file in files:
+                if file.endswith('.wav'):
+                    wav_files.append(os.path.join(root, file))
+
+        for root, _, files in os.walk(self.song_path):
+            for file in files:
+                if file.endswith('.wav'):
+                    wav_files.append(os.path.join(root, file))
+
+        random.seed(seed)  # shuffle samples to split data
+        random.shuffle(
+            wav_files
+        )  # make sure using the same seed to create train and dev dataset
+        meta_info = self._get_meta_info(wav_files)
+
+        files = []
+        labels = []
+        n_samples_per_fold = len(meta_info) // n_folds
+        for idx, sample in enumerate(meta_info):
+            _, _, emotion, _, _, _, _ = sample
+            target = int(emotion) - 1
+            fold = idx // n_samples_per_fold + 1
+
+            if mode == 'train' and int(fold) != split:
+                files.append(wav_files[idx])
+                labels.append(target)
+
+            if mode != 'train' and int(fold) == split:
+                files.append(wav_files[idx])
+                labels.append(target)
+
+        return files, labels
diff --git a/paddleaudio/paddleaudio/datasets/tess.py b/paddleaudio/paddleaudio/datasets/tess.py
new file mode 100644
index 00000000..8faab9c3
--- /dev/null
+++ b/paddleaudio/paddleaudio/datasets/tess.py
@@ -0,0 +1,126 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import collections
+import os
+import random
+from typing import List
+from typing import Tuple
+
+from ..utils.download import download_and_decompress
+from ..utils.env import DATA_HOME
+from .dataset import AudioClassificationDataset
+
+__all__ = ['TESS']
+
+
+class TESS(AudioClassificationDataset):
+    """
+    TESS is a set of 200 target words were spoken in the carrier phrase
+    "Say the word _____' by two actresses (aged 26 and 64 years) and
+    recordings were made of the set portraying each of seven emotions(anger,
+    disgust, fear, happiness, pleasant surprise, sadness, and neutral).
+    There are 2800 stimuli in total.
+
+    Reference:
+        Toronto emotional speech set (TESS)
+        https://doi.org/10.5683/SP2/E8H2MF
+    """
+
+    archieves = [
+        {
+            'url':
+            'https://bj.bcebos.com/paddleaudio/datasets/TESS_Toronto_emotional_speech_set.zip',
+            'md5':
+            '1465311b24d1de704c4c63e4ccc470c7',
+        },
+    ]
+    label_list = [
+        'angry',
+        'disgust',
+        'fear',
+        'happy',
+        'neutral',
+        'ps',  # pleasant surprise
+        'sad',
+    ]
+    meta_info = collections.namedtuple('META_INFO',
+                                       ('speaker', 'word', 'emotion'))
+    audio_path = 'TESS_Toronto_emotional_speech_set'
+
+    def __init__(self,
+                 mode='train',
+                 seed=0,
+                 n_folds=5,
+                 split=1,
+                 feat_type='raw',
+                 **kwargs):
+        """
+        Ags:
+            mode (:obj:`str`, `optional`, defaults to `train`):
+                It identifies the dataset mode (train or dev).
+            seed (:obj:`int`, `optional`, defaults to 0):
+                Set the random seed to shuffle samples.
+            n_folds (:obj:`int`, `optional`, defaults to 5):
+                Split the dataset into n folds. 1 fold for dev dataset and n-1 for train dataset.
+            split (:obj:`int`, `optional`, defaults to 1):
+                It specify the fold of dev dataset.
+            feat_type (:obj:`str`, `optional`, defaults to `raw`):
+                It identifies the feature type that user wants to extrace of an audio file.
+        """
+        assert split <= n_folds, f'The selected split should not be larger than n_fold, but got {split} > {n_folds}'
+        files, labels = self._get_data(mode, seed, n_folds, split)
+        super(TESS, self).__init__(
+            files=files, labels=labels, feat_type=feat_type, **kwargs)
+
+    def _get_meta_info(self, files) -> List[collections.namedtuple]:
+        ret = []
+        for file in files:
+            basename_without_extend = os.path.basename(file)[:-4]
+            ret.append(self.meta_info(*basename_without_extend.split('_')))
+        return ret
+
+    def _get_data(self, mode, seed, n_folds,
+                  split) -> Tuple[List[str], List[int]]:
+        if not os.path.isdir(os.path.join(DATA_HOME, self.audio_path)):
+            download_and_decompress(self.archieves, DATA_HOME)
+
+        wav_files = []
+        for root, _, files in os.walk(os.path.join(DATA_HOME, self.audio_path)):
+            for file in files:
+                if file.endswith('.wav'):
+                    wav_files.append(os.path.join(root, file))
+
+        random.seed(seed)  # shuffle samples to split data
+        random.shuffle(
+            wav_files
+        )  # make sure using the same seed to create train and dev dataset
+        meta_info = self._get_meta_info(wav_files)
+
+        files = []
+        labels = []
+        n_samples_per_fold = len(meta_info) // n_folds
+        for idx, sample in enumerate(meta_info):
+            _, _, emotion = sample
+            target = self.label_list.index(emotion)
+            fold = idx // n_samples_per_fold + 1
+
+            if mode == 'train' and int(fold) != split:
+                files.append(wav_files[idx])
+                labels.append(target)
+
+            if mode != 'train' and int(fold) == split:
+                files.append(wav_files[idx])
+                labels.append(target)
+
+        return files, labels
diff --git a/paddleaudio/paddleaudio/datasets/urban_sound.py b/paddleaudio/paddleaudio/datasets/urban_sound.py
new file mode 100644
index 00000000..d97c4d1d
--- /dev/null
+++ b/paddleaudio/paddleaudio/datasets/urban_sound.py
@@ -0,0 +1,104 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import collections
+import os
+from typing import List
+from typing import Tuple
+
+from ..utils.download import download_and_decompress
+from ..utils.env import DATA_HOME
+from .dataset import AudioClassificationDataset
+
+__all__ = ['UrbanSound8K']
+
+
+class UrbanSound8K(AudioClassificationDataset):
+    """
+    UrbanSound8K dataset contains 8732 labeled sound excerpts (<=4s) of urban
+    sounds from 10 classes: air_conditioner, car_horn, children_playing, dog_bark,
+    drilling, enginge_idling, gun_shot, jackhammer, siren, and street_music. The
+    classes are drawn from the urban sound taxonomy.
+
+    Reference:
+        A Dataset and Taxonomy for Urban Sound Research
+        https://dl.acm.org/doi/10.1145/2647868.2655045
+    """
+
+    archieves = [
+        {
+            'url':
+            'https://zenodo.org/record/1203745/files/UrbanSound8K.tar.gz',
+            'md5': '9aa69802bbf37fb986f71ec1483a196e',
+        },
+    ]
+    label_list = [
+        "air_conditioner", "car_horn", "children_playing", "dog_bark",
+        "drilling", "engine_idling", "gun_shot", "jackhammer", "siren",
+        "street_music"
+    ]
+    meta = os.path.join('UrbanSound8K', 'metadata', 'UrbanSound8K.csv')
+    meta_info = collections.namedtuple(
+        'META_INFO', ('filename', 'fsid', 'start', 'end', 'salience', 'fold',
+                      'class_id', 'label'))
+    audio_path = os.path.join('UrbanSound8K', 'audio')
+
+    def __init__(self,
+                 mode: str='train',
+                 split: int=1,
+                 feat_type: str='raw',
+                 **kwargs):
+        files, labels = self._get_data(mode, split)
+        super(UrbanSound8K, self).__init__(
+            files=files, labels=labels, feat_type=feat_type, **kwargs)
+        """
+        Ags:
+            mode (:obj:`str`, `optional`, defaults to `train`):
+                It identifies the dataset mode (train or dev).
+            split (:obj:`int`, `optional`, defaults to 1):
+                It specify the fold of dev dataset.
+            feat_type (:obj:`str`, `optional`, defaults to `raw`):
+                It identifies the feature type that user wants to extrace of an audio file.
+        """
+
+    def _get_meta_info(self):
+        ret = []
+        with open(os.path.join(DATA_HOME, self.meta), 'r') as rf:
+            for line in rf.readlines()[1:]:
+                ret.append(self.meta_info(*line.strip().split(',')))
+        return ret
+
+    def _get_data(self, mode: str, split: int) -> Tuple[List[str], List[int]]:
+        if not os.path.isdir(os.path.join(DATA_HOME, self.audio_path)) or \
+            not os.path.isfile(os.path.join(DATA_HOME, self.meta)):
+            download_and_decompress(self.archieves, DATA_HOME)
+
+        meta_info = self._get_meta_info()
+
+        files = []
+        labels = []
+        for sample in meta_info:
+            filename, _, _, _, _, fold, target, _ = sample
+            if mode == 'train' and int(fold) != split:
+                files.append(
+                    os.path.join(DATA_HOME, self.audio_path, f'fold{fold}',
+                                 filename))
+                labels.append(int(target))
+
+            if mode != 'train' and int(fold) == split:
+                files.append(
+                    os.path.join(DATA_HOME, self.audio_path, f'fold{fold}',
+                                 filename))
+                labels.append(int(target))
+
+        return files, labels
diff --git a/paddleaudio/paddleaudio/features/__init__.py b/paddleaudio/paddleaudio/features/__init__.py
new file mode 100644
index 00000000..8503cfab
--- /dev/null
+++ b/paddleaudio/paddleaudio/features/__init__.py
@@ -0,0 +1,15 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from .augment import *
+from .core import *
diff --git a/paddleaudio/paddleaudio/features/augment.py b/paddleaudio/paddleaudio/features/augment.py
new file mode 100644
index 00000000..653e1807
--- /dev/null
+++ b/paddleaudio/paddleaudio/features/augment.py
@@ -0,0 +1,170 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from typing import List
+
+import numpy as np
+from numpy import ndarray as array
+
+from paddleaudio.backends import depth_convert
+from paddleaudio.utils import ParameterError
+
+__all__ = [
+    'depth_augment',
+    'spect_augment',
+    'random_crop1d',
+    'random_crop2d',
+    'adaptive_spect_augment',
+]
+
+
+def randint(high: int) -> int:
+    """Generate one random integer in range [0 high)
+
+     This is a helper function for random data augmentaiton
+    """
+    return int(np.random.randint(0, high=high))
+
+
+def rand() -> float:
+    """Generate one floating-point number in range [0 1)
+
+    This is a helper function for random data augmentaiton
+    """
+    return float(np.random.rand(1))
+
+
+def depth_augment(y: array,
+                  choices: List=['int8', 'int16'],
+                  probs: List[float]=[0.5, 0.5]) -> array:
+    """ Audio depth augmentation
+
+    Do audio depth augmentation to simulate the distortion brought by quantization.
+    """
+    assert len(probs) == len(
+        choices
+    ), 'number of choices {} must be equal to size of probs {}'.format(
+        len(choices), len(probs))
+    depth = np.random.choice(choices, p=probs)
+    src_depth = y.dtype
+    y1 = depth_convert(y, depth)
+    y2 = depth_convert(y1, src_depth)
+
+    return y2
+
+
+def adaptive_spect_augment(spect: array, tempo_axis: int=0,
+                           level: float=0.1) -> array:
+    """Do adpative spectrogram augmentation
+
+    The level of the augmentation is gowern by the paramter level,
+    ranging from 0 to 1, with 0 represents no augmentation。
+
+    """
+    assert spect.ndim == 2., 'only supports 2d tensor or numpy array'
+    if tempo_axis == 0:
+        nt, nf = spect.shape
+    else:
+        nf, nt = spect.shape
+
+    time_mask_width = int(nt * level * 0.5)
+    freq_mask_width = int(nf * level * 0.5)
+
+    num_time_mask = int(10 * level)
+    num_freq_mask = int(10 * level)
+
+    if tempo_axis == 0:
+        for _ in range(num_time_mask):
+            start = randint(nt - time_mask_width)
+            spect[start:start + time_mask_width, :] = 0
+        for _ in range(num_freq_mask):
+            start = randint(nf - freq_mask_width)
+            spect[:, start:start + freq_mask_width] = 0
+    else:
+        for _ in range(num_time_mask):
+            start = randint(nt - time_mask_width)
+            spect[:, start:start + time_mask_width] = 0
+        for _ in range(num_freq_mask):
+            start = randint(nf - freq_mask_width)
+            spect[start:start + freq_mask_width, :] = 0
+
+    return spect
+
+
+def spect_augment(spect: array,
+                  tempo_axis: int=0,
+                  max_time_mask: int=3,
+                  max_freq_mask: int=3,
+                  max_time_mask_width: int=30,
+                  max_freq_mask_width: int=20) -> array:
+    """Do spectrogram augmentation in both time and freq axis
+
+    Reference:
+
+    """
+    assert spect.ndim == 2., 'only supports 2d tensor or numpy array'
+    if tempo_axis == 0:
+        nt, nf = spect.shape
+    else:
+        nf, nt = spect.shape
+
+    num_time_mask = randint(max_time_mask)
+    num_freq_mask = randint(max_freq_mask)
+
+    time_mask_width = randint(max_time_mask_width)
+    freq_mask_width = randint(max_freq_mask_width)
+
+    if tempo_axis == 0:
+        for _ in range(num_time_mask):
+            start = randint(nt - time_mask_width)
+            spect[start:start + time_mask_width, :] = 0
+        for _ in range(num_freq_mask):
+            start = randint(nf - freq_mask_width)
+            spect[:, start:start + freq_mask_width] = 0
+    else:
+        for _ in range(num_time_mask):
+            start = randint(nt - time_mask_width)
+            spect[:, start:start + time_mask_width] = 0
+        for _ in range(num_freq_mask):
+            start = randint(nf - freq_mask_width)
+            spect[start:start + freq_mask_width, :] = 0
+
+    return spect
+
+
+def random_crop1d(y: array, crop_len: int) -> array:
+    """ Do random cropping on 1d input signal
+
+    The input is a 1d signal, typically a sound waveform
+    """
+    if y.ndim != 1:
+        'only accept 1d tensor or numpy array'
+    n = len(y)
+    idx = randint(n - crop_len)
+    return y[idx:idx + crop_len]
+
+
+def random_crop2d(s: array, crop_len: int, tempo_axis: int=0) -> array:
+    """ Do random cropping for 2D array, typically a spectrogram.
+
+    The cropping is done in temporal direction on the time-freq input signal.
+    """
+    if tempo_axis >= s.ndim:
+        raise ParameterError('axis out of range')
+
+    n = s.shape[tempo_axis]
+    idx = randint(high=n - crop_len)
+    sli = [slice(None) for i in range(s.ndim)]
+    sli[tempo_axis] = slice(idx, idx + crop_len)
+    out = s[tuple(sli)]
+    return out
diff --git a/paddleaudio/paddleaudio/features/core.py b/paddleaudio/paddleaudio/features/core.py
new file mode 100644
index 00000000..7b764797
--- /dev/null
+++ b/paddleaudio/paddleaudio/features/core.py
@@ -0,0 +1,576 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import warnings
+from typing import List
+from typing import Optional
+from typing import Union
+
+import numpy as np
+import scipy
+from numpy import ndarray as array
+from numpy.lib.stride_tricks import as_strided
+from scipy.signal import get_window
+
+from paddleaudio.utils import ParameterError
+
+__all__ = [
+    'stft',
+    'mfcc',
+    'hz_to_mel',
+    'mel_to_hz',
+    'split_frames',
+    'mel_frequencies',
+    'power_to_db',
+    'compute_fbank_matrix',
+    'melspectrogram',
+    'spectrogram',
+    'mu_encode',
+    'mu_decode',
+]
+
+
+def pad_center(data: array, size: int, axis: int=-1, **kwargs) -> array:
+    """Pad an array to a target length along a target axis.
+
+    This differs from `np.pad` by centering the data prior to padding,
+    analogous to `str.center`
+    """
+
+    kwargs.setdefault("mode", "constant")
+    n = data.shape[axis]
+    lpad = int((size - n) // 2)
+    lengths = [(0, 0)] * data.ndim
+    lengths[axis] = (lpad, int(size - n - lpad))
+
+    if lpad < 0:
+        raise ParameterError(("Target size ({size:d}) must be "
+                              "at least input size ({n:d})"))
+
+    return np.pad(data, lengths, **kwargs)
+
+
+def split_frames(x: array, frame_length: int, hop_length: int,
+                 axis: int=-1) -> array:
+    """Slice a data array into (overlapping) frames.
+
+    This function is aligned with librosa.frame
+    """
+
+    if not isinstance(x, np.ndarray):
+        raise ParameterError(
+            f"Input must be of type numpy.ndarray, given type(x)={type(x)}")
+
+    if x.shape[axis] < frame_length:
+        raise ParameterError(f"Input is too short (n={x.shape[axis]:d})"
+                             f" for frame_length={frame_length:d}")
+
+    if hop_length < 1:
+        raise ParameterError(f"Invalid hop_length: {hop_length:d}")
+
+    if axis == -1 and not x.flags["F_CONTIGUOUS"]:
+        warnings.warn(f"librosa.util.frame called with axis={axis} "
+                      "on a non-contiguous input. This will result in a copy.")
+        x = np.asfortranarray(x)
+    elif axis == 0 and not x.flags["C_CONTIGUOUS"]:
+        warnings.warn(f"librosa.util.frame called with axis={axis} "
+                      "on a non-contiguous input. This will result in a copy.")
+        x = np.ascontiguousarray(x)
+
+    n_frames = 1 + (x.shape[axis] - frame_length) // hop_length
+    strides = np.asarray(x.strides)
+
+    new_stride = np.prod(strides[strides > 0] // x.itemsize) * x.itemsize
+
+    if axis == -1:
+        shape = list(x.shape)[:-1] + [frame_length, n_frames]
+        strides = list(strides) + [hop_length * new_stride]
+
+    elif axis == 0:
+        shape = [n_frames, frame_length] + list(x.shape)[1:]
+        strides = [hop_length * new_stride] + list(strides)
+
+    else:
+        raise ParameterError(f"Frame axis={axis} must be either 0 or -1")
+
+    return as_strided(x, shape=shape, strides=strides)
+
+
+def _check_audio(y, mono=True) -> bool:
+    """Determine whether a variable contains valid audio data.
+
+    The audio y must be a np.ndarray, ether 1-channel or two channel
+    """
+    if not isinstance(y, np.ndarray):
+        raise ParameterError("Audio data must be of type numpy.ndarray")
+    if y.ndim > 2:
+        raise ParameterError(
+            f"Invalid shape for audio ndim={y.ndim:d}, shape={y.shape}")
+
+    if mono and y.ndim == 2:
+        raise ParameterError(
+            f"Invalid shape for mono audio ndim={y.ndim:d}, shape={y.shape}")
+
+    if (mono and len(y) == 0) or (not mono and y.shape[1] < 0):
+        raise ParameterError(f"Audio is empty ndim={y.ndim:d}, shape={y.shape}")
+
+    if not np.issubdtype(y.dtype, np.floating):
+        raise ParameterError("Audio data must be floating-point")
+
+    if not np.isfinite(y).all():
+        raise ParameterError("Audio buffer is not finite everywhere")
+
+    return True
+
+
+def hz_to_mel(frequencies: Union[float, List[float], array],
+              htk: bool=False) -> array:
+    """Convert Hz to Mels
+
+    This function is aligned with librosa.
+    """
+    freq = np.asanyarray(frequencies)
+
+    if htk:
+        return 2595.0 * np.log10(1.0 + freq / 700.0)
+
+    # Fill in the linear part
+    f_min = 0.0
+    f_sp = 200.0 / 3
+
+    mels = (freq - f_min) / f_sp
+
+    # Fill in the log-scale part
+
+    min_log_hz = 1000.0  # beginning of log region (Hz)
+    min_log_mel = (min_log_hz - f_min) / f_sp  # same (Mels)
+    logstep = np.log(6.4) / 27.0  # step size for log region
+
+    if freq.ndim:
+        # If we have array data, vectorize
+        log_t = freq >= min_log_hz
+        mels[log_t] = min_log_mel + \
+            np.log(freq[log_t] / min_log_hz) / logstep
+    elif freq >= min_log_hz:
+        # If we have scalar data, heck directly
+        mels = min_log_mel + np.log(freq / min_log_hz) / logstep
+
+    return mels
+
+
+def mel_to_hz(mels: Union[float, List[float], array], htk: int=False) -> array:
+    """Convert mel bin numbers to frequencies.
+
+    This function is aligned with librosa.
+    """
+    mel_array = np.asanyarray(mels)
+
+    if htk:
+        return 700.0 * (10.0**(mel_array / 2595.0) - 1.0)
+
+    # Fill in the linear scale
+    f_min = 0.0
+    f_sp = 200.0 / 3
+    freqs = f_min + f_sp * mel_array
+
+    # And now the nonlinear scale
+    min_log_hz = 1000.0  # beginning of log region (Hz)
+    min_log_mel = (min_log_hz - f_min) / f_sp  # same (Mels)
+    logstep = np.log(6.4) / 27.0  # step size for log region
+
+    if mel_array.ndim:
+        # If we have vector data, vectorize
+        log_t = mel_array >= min_log_mel
+        freqs[log_t] = min_log_hz * \
+            np.exp(logstep * (mel_array[log_t] - min_log_mel))
+    elif mel_array >= min_log_mel:
+        # If we have scalar data, check directly
+        freqs = min_log_hz * np.exp(logstep * (mel_array - min_log_mel))
+
+    return freqs
+
+
+def mel_frequencies(n_mels: int=128,
+                    fmin: float=0.0,
+                    fmax: float=11025.0,
+                    htk: bool=False) -> array:
+    """Compute mel frequencies
+
+    This function is aligned with librosa.
+    """
+    # 'Center freqs' of mel bands - uniformly spaced between limits
+    min_mel = hz_to_mel(fmin, htk=htk)
+    max_mel = hz_to_mel(fmax, htk=htk)
+
+    mels = np.linspace(min_mel, max_mel, n_mels)
+
+    return mel_to_hz(mels, htk=htk)
+
+
+def fft_frequencies(sr: int, n_fft: int) -> array:
+    """Compute fourier frequencies.
+
+    This function is aligned with librosa.
+    """
+    return np.linspace(0, float(sr) / 2, int(1 + n_fft // 2), endpoint=True)
+
+
+def compute_fbank_matrix(sr: int,
+                         n_fft: int,
+                         n_mels: int=128,
+                         fmin: float=0.0,
+                         fmax: Optional[float]=None,
+                         htk: bool=False,
+                         norm: str="slaney",
+                         dtype: type=np.float32):
+    """Compute fbank matrix.
+
+    This funciton is aligned with librosa.
+    """
+    if norm != "slaney":
+        raise ParameterError('norm must set to slaney')
+
+    if fmax is None:
+        fmax = float(sr) / 2
+
+    # Initialize the weights
+    n_mels = int(n_mels)
+    weights = np.zeros((n_mels, int(1 + n_fft // 2)), dtype=dtype)
+
+    # Center freqs of each FFT bin
+    fftfreqs = fft_frequencies(sr=sr, n_fft=n_fft)
+
+    # 'Center freqs' of mel bands - uniformly spaced between limits
+    mel_f = mel_frequencies(n_mels + 2, fmin=fmin, fmax=fmax, htk=htk)
+
+    fdiff = np.diff(mel_f)
+    ramps = np.subtract.outer(mel_f, fftfreqs)
+
+    for i in range(n_mels):
+        # lower and upper slopes for all bins
+        lower = -ramps[i] / fdiff[i]
+        upper = ramps[i + 2] / fdiff[i + 1]
+
+        # .. then intersect them with each other and zero
+        weights[i] = np.maximum(0, np.minimum(lower, upper))
+
+    if norm == "slaney":
+        # Slaney-style mel is scaled to be approx constant energy per channel
+        enorm = 2.0 / (mel_f[2:n_mels + 2] - mel_f[:n_mels])
+        weights *= enorm[:, np.newaxis]
+
+    # Only check weights if f_mel[0] is positive
+    if not np.all((mel_f[:-2] == 0) | (weights.max(axis=1) > 0)):
+        # This means we have an empty channel somewhere
+        warnings.warn("Empty filters detected in mel frequency basis. "
+                      "Some channels will produce empty responses. "
+                      "Try increasing your sampling rate (and fmax) or "
+                      "reducing n_mels.")
+
+    return weights
+
+
+def stft(x: array,
+         n_fft: int=2048,
+         hop_length: Optional[int]=None,
+         win_length: Optional[int]=None,
+         window: str="hann",
+         center: bool=True,
+         dtype: type=np.complex64,
+         pad_mode: str="reflect") -> array:
+    """Short-time Fourier transform (STFT).
+
+    This function is aligned with librosa.
+    """
+    _check_audio(x)
+    # By default, use the entire frame
+    if win_length is None:
+        win_length = n_fft
+
+    # Set the default hop, if it's not already specified
+    if hop_length is None:
+        hop_length = int(win_length // 4)
+
+    fft_window = get_window(window, win_length, fftbins=True)
+
+    # Pad the window out to n_fft size
+    fft_window = pad_center(fft_window, n_fft)
+
+    # Reshape so that the window can be broadcast
+    fft_window = fft_window.reshape((-1, 1))
+
+    # Pad the time series so that frames are centered
+    if center:
+        if n_fft > x.shape[-1]:
+            warnings.warn(
+                f"n_fft={n_fft} is too small for input signal of length={x.shape[-1]}"
+            )
+        x = np.pad(x, int(n_fft // 2), mode=pad_mode)
+
+    elif n_fft > x.shape[-1]:
+        raise ParameterError(
+            f"n_fft={n_fft} is too small for input signal of length={x.shape[-1]}"
+        )
+
+    # Window the time series.
+    x_frames = split_frames(x, frame_length=n_fft, hop_length=hop_length)
+    # Pre-allocate the STFT matrix
+    stft_matrix = np.empty(
+        (int(1 + n_fft // 2), x_frames.shape[1]), dtype=dtype, order="F")
+    fft = np.fft  # use numpy fft as default
+    # Constrain STFT block sizes to 256 KB
+    MAX_MEM_BLOCK = 2**8 * 2**10
+    # how many columns can we fit within MAX_MEM_BLOCK?
+    n_columns = MAX_MEM_BLOCK // (stft_matrix.shape[0] * stft_matrix.itemsize)
+    n_columns = max(n_columns, 1)
+
+    for bl_s in range(0, stft_matrix.shape[1], n_columns):
+        bl_t = min(bl_s + n_columns, stft_matrix.shape[1])
+        stft_matrix[:, bl_s:bl_t] = fft.rfft(
+            fft_window * x_frames[:, bl_s:bl_t], axis=0)
+
+    return stft_matrix
+
+
+def power_to_db(spect: array,
+                ref: float=1.0,
+                amin: float=1e-10,
+                top_db: Optional[float]=80.0) -> array:
+    """Convert a power spectrogram (amplitude squared) to decibel (dB) units
+
+    This computes the scaling ``10 * log10(spect / ref)`` in a numerically
+    stable way.
+
+    This function is aligned with librosa.
+    """
+    spect = np.asarray(spect)
+
+    if amin <= 0:
+        raise ParameterError("amin must be strictly positive")
+
+    if np.issubdtype(spect.dtype, np.complexfloating):
+        warnings.warn(
+            "power_to_db was called on complex input so phase "
+            "information will be discarded. To suppress this warning, "
+            "call power_to_db(np.abs(D)**2) instead.")
+        magnitude = np.abs(spect)
+    else:
+        magnitude = spect
+
+    if callable(ref):
+        # User supplied a function to calculate reference power
+        ref_value = ref(magnitude)
+    else:
+        ref_value = np.abs(ref)
+
+    log_spec = 10.0 * np.log10(np.maximum(amin, magnitude))
+    log_spec -= 10.0 * np.log10(np.maximum(amin, ref_value))
+
+    if top_db is not None:
+        if top_db < 0:
+            raise ParameterError("top_db must be non-negative")
+        log_spec = np.maximum(log_spec, log_spec.max() - top_db)
+
+    return log_spec
+
+
+def mfcc(x,
+         sr: int=16000,
+         spect: Optional[array]=None,
+         n_mfcc: int=20,
+         dct_type: int=2,
+         norm: str="ortho",
+         lifter: int=0,
+         **kwargs) -> array:
+    """Mel-frequency cepstral coefficients (MFCCs)
+
+    This function is NOT strictly aligned with librosa. The following example shows how to get the
+    same result with librosa:
+
+    # paddleaudioe mfcc:
+     kwargs = {
+        'window_size':512,
+        'hop_length':320,
+        'mel_bins':64,
+        'fmin':50,
+         'to_db':False}
+    a = mfcc(x,
+        spect=None,
+        n_mfcc=20,
+        dct_type=2,
+        norm='ortho',
+        lifter=0,
+        **kwargs)
+
+    # librosa mfcc:
+    spect = librosa.feature.melspectrogram(x,sr=16000,n_fft=512,
+                                              win_length=512,
+                                              hop_length=320,
+                                              n_mels=64, fmin=50)
+    b = librosa.feature.mfcc(x,
+        sr=16000,
+        S=spect,
+        n_mfcc=20,
+        dct_type=2,
+        norm='ortho',
+        lifter=0)
+
+    assert np.mean( (a-b)**2) < 1e-8
+
+    """
+    if spect is None:
+        spect = melspectrogram(x, sr=sr, **kwargs)
+
+    M = scipy.fftpack.dct(spect, axis=0, type=dct_type, norm=norm)[:n_mfcc]
+
+    if lifter > 0:
+        factor = np.sin(np.pi * np.arange(1, 1 + n_mfcc, dtype=M.dtype) /
+                        lifter)
+        return M * factor[:, np.newaxis]
+    elif lifter == 0:
+        return M
+    else:
+        raise ParameterError(
+            f"MFCC lifter={lifter} must be a non-negative number")
+
+
+def melspectrogram(x: array,
+                   sr: int=16000,
+                   window_size: int=512,
+                   hop_length: int=320,
+                   n_mels: int=64,
+                   fmin: int=50,
+                   fmax: Optional[float]=None,
+                   window: str='hann',
+                   center: bool=True,
+                   pad_mode: str='reflect',
+                   power: float=2.0,
+                   to_db: bool=True,
+                   ref: float=1.0,
+                   amin: float=1e-10,
+                   top_db: Optional[float]=None) -> array:
+    """Compute mel-spectrogram.
+
+    Parameters:
+        x: numpy.ndarray
+        The input wavform is a numpy array [shape=(n,)]
+
+        window_size: int, typically 512, 1024, 2048, etc.
+        The window size for framing, also used as n_fft for stft
+
+
+    Returns:
+        The mel-spectrogram in power scale or db scale(default)
+
+
+    Notes:
+    1. sr is default to 16000, which is commonly used in speech/speaker processing.
+    2. when fmax is None, it is set to sr//2.
+    3. this function will convert mel spectgrum to db scale by default. This is different
+    that of librosa.
+
+    """
+    _check_audio(x, mono=True)
+    if len(x) <= 0:
+        raise ParameterError('The input waveform is empty')
+
+    if fmax is None:
+        fmax = sr // 2
+    if fmin < 0 or fmin >= fmax:
+        raise ParameterError('fmin and fmax must statisfy 0<fmin<fmax')
+
+    s = stft(
+        x,
+        n_fft=window_size,
+        hop_length=hop_length,
+        win_length=window_size,
+        window=window,
+        center=center,
+        pad_mode=pad_mode)
+
+    spect_power = np.abs(s)**power
+    fb_matrix = compute_fbank_matrix(
+        sr=sr, n_fft=window_size, n_mels=n_mels, fmin=fmin, fmax=fmax)
+    mel_spect = np.matmul(fb_matrix, spect_power)
+    if to_db:
+        return power_to_db(mel_spect, ref=ref, amin=amin, top_db=top_db)
+    else:
+        return mel_spect
+
+
+def spectrogram(x: array,
+                sr: int=16000,
+                window_size: int=512,
+                hop_length: int=320,
+                window: str='hann',
+                center: bool=True,
+                pad_mode: str='reflect',
+                power: float=2.0) -> array:
+    """Compute spectrogram from an input waveform.
+
+    This function is a wrapper for librosa.feature.stft, with addition step to
+    compute the magnitude of the complex spectrogram.
+    """
+
+    s = stft(
+        x,
+        n_fft=window_size,
+        hop_length=hop_length,
+        win_length=window_size,
+        window=window,
+        center=center,
+        pad_mode=pad_mode)
+
+    return np.abs(s)**power
+
+
+def mu_encode(x: array, mu: int=255, quantized: bool=True) -> array:
+    """Mu-law encoding.
+
+    Compute the mu-law decoding given an input code.
+    When quantized is True, the result will be converted to
+    integer in range [0,mu-1]. Otherwise, the resulting signal
+    is in range [-1,1]
+
+
+    Reference:
+        https://en.wikipedia.org/wiki/%CE%9C-law_algorithm
+
+    """
+    mu = 255
+    y = np.sign(x) * np.log1p(mu * np.abs(x)) / np.log1p(mu)
+    if quantized:
+        y = np.floor((y + 1) / 2 * mu + 0.5)  # convert to [0 , mu-1]
+    return y
+
+
+def mu_decode(y: array, mu: int=255, quantized: bool=True) -> array:
+    """Mu-law decoding.
+
+    Compute the mu-law decoding given an input code.
+
+    it assumes that the input y is in
+    range [0,mu-1] when quantize is True and [-1,1] otherwise
+
+    Reference:
+        https://en.wikipedia.org/wiki/%CE%9C-law_algorithm
+
+    """
+    if mu < 1:
+        raise ParameterError('mu is typically set as 2**k-1, k=1, 2, 3,...')
+
+    mu = mu - 1
+    if quantized:  # undo the quantization
+        y = y * 2 / mu - 1
+    x = np.sign(y) / mu * ((1 + mu)**np.abs(y) - 1)
+    return x
diff --git a/paddleaudio/paddleaudio/models/__init__.py b/paddleaudio/paddleaudio/models/__init__.py
new file mode 100644
index 00000000..185a92b8
--- /dev/null
+++ b/paddleaudio/paddleaudio/models/__init__.py
@@ -0,0 +1,13 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
diff --git a/paddleaudio/paddleaudio/models/panns.py b/paddleaudio/paddleaudio/models/panns.py
new file mode 100644
index 00000000..1c68f06f
--- /dev/null
+++ b/paddleaudio/paddleaudio/models/panns.py
@@ -0,0 +1,309 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+
+import paddle.nn as nn
+import paddle.nn.functional as F
+
+from ..utils.download import load_state_dict_from_url
+from ..utils.env import MODEL_HOME
+
+__all__ = ['CNN14', 'CNN10', 'CNN6', 'cnn14', 'cnn10', 'cnn6']
+
+pretrained_model_urls = {
+    'cnn14': 'https://bj.bcebos.com/paddleaudio/models/panns_cnn14.pdparams',
+    'cnn10': 'https://bj.bcebos.com/paddleaudio/models/panns_cnn10.pdparams',
+    'cnn6': 'https://bj.bcebos.com/paddleaudio/models/panns_cnn6.pdparams',
+}
+
+
+class ConvBlock(nn.Layer):
+    def __init__(self, in_channels, out_channels):
+        super(ConvBlock, self).__init__()
+
+        self.conv1 = nn.Conv2D(
+            in_channels=in_channels,
+            out_channels=out_channels,
+            kernel_size=(3, 3),
+            stride=(1, 1),
+            padding=(1, 1),
+            bias_attr=False)
+        self.conv2 = nn.Conv2D(
+            in_channels=out_channels,
+            out_channels=out_channels,
+            kernel_size=(3, 3),
+            stride=(1, 1),
+            padding=(1, 1),
+            bias_attr=False)
+        self.bn1 = nn.BatchNorm2D(out_channels)
+        self.bn2 = nn.BatchNorm2D(out_channels)
+
+    def forward(self, x, pool_size=(2, 2), pool_type='avg'):
+        x = self.conv1(x)
+        x = self.bn1(x)
+        x = F.relu(x)
+
+        x = self.conv2(x)
+        x = self.bn2(x)
+        x = F.relu(x)
+
+        if pool_type == 'max':
+            x = F.max_pool2d(x, kernel_size=pool_size)
+        elif pool_type == 'avg':
+            x = F.avg_pool2d(x, kernel_size=pool_size)
+        elif pool_type == 'avg+max':
+            x = F.avg_pool2d(
+                x, kernel_size=pool_size) + F.max_pool2d(
+                    x, kernel_size=pool_size)
+        else:
+            raise Exception(
+                f'Pooling type of {pool_type} is not supported. It must be one of "max", "avg" and "avg+max".'
+            )
+        return x
+
+
+class ConvBlock5x5(nn.Layer):
+    def __init__(self, in_channels, out_channels):
+        super(ConvBlock5x5, self).__init__()
+
+        self.conv1 = nn.Conv2D(
+            in_channels=in_channels,
+            out_channels=out_channels,
+            kernel_size=(5, 5),
+            stride=(1, 1),
+            padding=(2, 2),
+            bias_attr=False)
+        self.bn1 = nn.BatchNorm2D(out_channels)
+
+    def forward(self, x, pool_size=(2, 2), pool_type='avg'):
+        x = self.conv1(x)
+        x = self.bn1(x)
+        x = F.relu(x)
+
+        if pool_type == 'max':
+            x = F.max_pool2d(x, kernel_size=pool_size)
+        elif pool_type == 'avg':
+            x = F.avg_pool2d(x, kernel_size=pool_size)
+        elif pool_type == 'avg+max':
+            x = F.avg_pool2d(
+                x, kernel_size=pool_size) + F.max_pool2d(
+                    x, kernel_size=pool_size)
+        else:
+            raise Exception(
+                f'Pooling type of {pool_type} is not supported. It must be one of "max", "avg" and "avg+max".'
+            )
+        return x
+
+
+class CNN14(nn.Layer):
+    """
+    The CNN14(14-layer CNNs) mainly consist of 6 convolutional blocks while each convolutional
+    block consists of 2 convolutional layers with a kernel size of 3 × 3.
+
+    Reference:
+        PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition
+        https://arxiv.org/pdf/1912.10211.pdf
+    """
+    emb_size = 2048
+
+    def __init__(self, extract_embedding: bool=True):
+
+        super(CNN14, self).__init__()
+        self.bn0 = nn.BatchNorm2D(64)
+        self.conv_block1 = ConvBlock(in_channels=1, out_channels=64)
+        self.conv_block2 = ConvBlock(in_channels=64, out_channels=128)
+        self.conv_block3 = ConvBlock(in_channels=128, out_channels=256)
+        self.conv_block4 = ConvBlock(in_channels=256, out_channels=512)
+        self.conv_block5 = ConvBlock(in_channels=512, out_channels=1024)
+        self.conv_block6 = ConvBlock(in_channels=1024, out_channels=2048)
+
+        self.fc1 = nn.Linear(2048, self.emb_size)
+        self.fc_audioset = nn.Linear(self.emb_size, 527)
+        self.extract_embedding = extract_embedding
+
+    def forward(self, x):
+        x.stop_gradient = False
+        x = x.transpose([0, 3, 2, 1])
+        x = self.bn0(x)
+        x = x.transpose([0, 3, 2, 1])
+
+        x = self.conv_block1(x, pool_size=(2, 2), pool_type='avg')
+        x = F.dropout(x, p=0.2, training=self.training)
+
+        x = self.conv_block2(x, pool_size=(2, 2), pool_type='avg')
+        x = F.dropout(x, p=0.2, training=self.training)
+
+        x = self.conv_block3(x, pool_size=(2, 2), pool_type='avg')
+        x = F.dropout(x, p=0.2, training=self.training)
+
+        x = self.conv_block4(x, pool_size=(2, 2), pool_type='avg')
+        x = F.dropout(x, p=0.2, training=self.training)
+
+        x = self.conv_block5(x, pool_size=(2, 2), pool_type='avg')
+        x = F.dropout(x, p=0.2, training=self.training)
+
+        x = self.conv_block6(x, pool_size=(1, 1), pool_type='avg')
+        x = F.dropout(x, p=0.2, training=self.training)
+
+        x = x.mean(axis=3)
+        x = x.max(axis=2) + x.mean(axis=2)
+
+        x = F.dropout(x, p=0.5, training=self.training)
+        x = F.relu(self.fc1(x))
+
+        if self.extract_embedding:
+            output = F.dropout(x, p=0.5, training=self.training)
+        else:
+            output = F.sigmoid(self.fc_audioset(x))
+        return output
+
+
+class CNN10(nn.Layer):
+    """
+    The CNN10(14-layer CNNs) mainly consist of 4 convolutional blocks while each convolutional
+    block consists of 2 convolutional layers with a kernel size of 3 × 3.
+
+    Reference:
+        PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition
+        https://arxiv.org/pdf/1912.10211.pdf
+    """
+    emb_size = 512
+
+    def __init__(self, extract_embedding: bool=True):
+
+        super(CNN10, self).__init__()
+        self.bn0 = nn.BatchNorm2D(64)
+        self.conv_block1 = ConvBlock(in_channels=1, out_channels=64)
+        self.conv_block2 = ConvBlock(in_channels=64, out_channels=128)
+        self.conv_block3 = ConvBlock(in_channels=128, out_channels=256)
+        self.conv_block4 = ConvBlock(in_channels=256, out_channels=512)
+
+        self.fc1 = nn.Linear(512, self.emb_size)
+        self.fc_audioset = nn.Linear(self.emb_size, 527)
+        self.extract_embedding = extract_embedding
+
+    def forward(self, x):
+        x.stop_gradient = False
+        x = x.transpose([0, 3, 2, 1])
+        x = self.bn0(x)
+        x = x.transpose([0, 3, 2, 1])
+
+        x = self.conv_block1(x, pool_size=(2, 2), pool_type='avg')
+        x = F.dropout(x, p=0.2, training=self.training)
+
+        x = self.conv_block2(x, pool_size=(2, 2), pool_type='avg')
+        x = F.dropout(x, p=0.2, training=self.training)
+
+        x = self.conv_block3(x, pool_size=(2, 2), pool_type='avg')
+        x = F.dropout(x, p=0.2, training=self.training)
+
+        x = self.conv_block4(x, pool_size=(2, 2), pool_type='avg')
+        x = F.dropout(x, p=0.2, training=self.training)
+
+        x = x.mean(axis=3)
+        x = x.max(axis=2) + x.mean(axis=2)
+
+        x = F.dropout(x, p=0.5, training=self.training)
+        x = F.relu(self.fc1(x))
+
+        if self.extract_embedding:
+            output = F.dropout(x, p=0.5, training=self.training)
+        else:
+            output = F.sigmoid(self.fc_audioset(x))
+        return output
+
+
+class CNN6(nn.Layer):
+    """
+    The CNN14(14-layer CNNs) mainly consist of 4 convolutional blocks while each convolutional
+    block consists of 1 convolutional layers with a kernel size of 5 × 5.
+
+    Reference:
+        PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition
+        https://arxiv.org/pdf/1912.10211.pdf
+    """
+    emb_size = 512
+
+    def __init__(self, extract_embedding: bool=True):
+
+        super(CNN6, self).__init__()
+        self.bn0 = nn.BatchNorm2D(64)
+        self.conv_block1 = ConvBlock5x5(in_channels=1, out_channels=64)
+        self.conv_block2 = ConvBlock5x5(in_channels=64, out_channels=128)
+        self.conv_block3 = ConvBlock5x5(in_channels=128, out_channels=256)
+        self.conv_block4 = ConvBlock5x5(in_channels=256, out_channels=512)
+
+        self.fc1 = nn.Linear(512, self.emb_size)
+        self.fc_audioset = nn.Linear(self.emb_size, 527)
+        self.extract_embedding = extract_embedding
+
+    def forward(self, x):
+        x.stop_gradient = False
+        x = x.transpose([0, 3, 2, 1])
+        x = self.bn0(x)
+        x = x.transpose([0, 3, 2, 1])
+
+        x = self.conv_block1(x, pool_size=(2, 2), pool_type='avg')
+        x = F.dropout(x, p=0.2, training=self.training)
+
+        x = self.conv_block2(x, pool_size=(2, 2), pool_type='avg')
+        x = F.dropout(x, p=0.2, training=self.training)
+
+        x = self.conv_block3(x, pool_size=(2, 2), pool_type='avg')
+        x = F.dropout(x, p=0.2, training=self.training)
+
+        x = self.conv_block4(x, pool_size=(2, 2), pool_type='avg')
+        x = F.dropout(x, p=0.2, training=self.training)
+
+        x = x.mean(axis=3)
+        x = x.max(axis=2) + x.mean(axis=2)
+
+        x = F.dropout(x, p=0.5, training=self.training)
+        x = F.relu(self.fc1(x))
+
+        if self.extract_embedding:
+            output = F.dropout(x, p=0.5, training=self.training)
+        else:
+            output = F.sigmoid(self.fc_audioset(x))
+        return output
+
+
+def cnn14(pretrained: bool=False, extract_embedding: bool=True) -> CNN14:
+    model = CNN14(extract_embedding=extract_embedding)
+    if pretrained:
+        state_dict = load_state_dict_from_url(
+            url=pretrained_model_urls['cnn14'],
+            path=os.path.join(MODEL_HOME, 'panns'))
+        model.set_state_dict(state_dict)
+    return model
+
+
+def cnn10(pretrained: bool=False, extract_embedding: bool=True) -> CNN10:
+    model = CNN10(extract_embedding=extract_embedding)
+    if pretrained:
+        state_dict = load_state_dict_from_url(
+            url=pretrained_model_urls['cnn10'],
+            path=os.path.join(MODEL_HOME, 'panns'))
+        model.set_state_dict(state_dict)
+    return model
+
+
+def cnn6(pretrained: bool=False, extract_embedding: bool=True) -> CNN6:
+    model = CNN6(extract_embedding=extract_embedding)
+    if pretrained:
+        state_dict = load_state_dict_from_url(
+            url=pretrained_model_urls['cnn6'],
+            path=os.path.join(MODEL_HOME, 'panns'))
+        model.set_state_dict(state_dict)
+    return model
diff --git a/paddleaudio/paddleaudio/utils/__init__.py b/paddleaudio/paddleaudio/utils/__init__.py
new file mode 100644
index 00000000..1c1b4a90
--- /dev/null
+++ b/paddleaudio/paddleaudio/utils/__init__.py
@@ -0,0 +1,18 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+from .download import *
+from .env import *
+from .error import *
+from .log import *
+from .time import *
diff --git a/paddleaudio/paddleaudio/utils/download.py b/paddleaudio/paddleaudio/utils/download.py
new file mode 100644
index 00000000..0a36f29b
--- /dev/null
+++ b/paddleaudio/paddleaudio/utils/download.py
@@ -0,0 +1,66 @@
+# Copyright (c) 2021  PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+from typing import Dict
+from typing import List
+
+from paddle.framework import load as load_state_dict
+from paddle.utils import download
+from pathos.multiprocessing import ProcessPool
+
+from .log import logger
+
+download.logger = logger
+
+
+def decompress(file: str):
+    """
+    Extracts all files from a compressed file.
+    """
+    assert os.path.isfile(file), "File: {} not exists.".format(file)
+    download._decompress(file)
+
+
+def download_and_decompress(archives: List[Dict[str, str]],
+                            path: str,
+                            n_workers: int=0):
+    """
+    Download archieves and decompress to specific path.
+    """
+    if not os.path.isdir(path):
+        os.makedirs(path)
+
+    if n_workers <= 0:
+        for archive in archives:
+            assert 'url' in archive and 'md5' in archive, \
+                'Dictionary keys of "url" and "md5" are required in the archive, but got: {list(archieve.keys())}'
+
+            download.get_path_from_url(archive['url'], path, archive['md5'])
+    else:
+        pool = ProcessPool(nodes=n_workers)
+        pool.imap(download.get_path_from_url, [_['url'] for _ in archives],
+                  [path] * len(archives), [_['md5'] for _ in archives])
+        pool.close()
+        pool.join()
+
+
+def load_state_dict_from_url(url: str, path: str, md5: str=None):
+    """
+    Download and load a state dict from url
+    """
+    if not os.path.isdir(path):
+        os.makedirs(path)
+
+    download.get_path_from_url(url, path, md5)
+    return load_state_dict(os.path.join(path, os.path.basename(url)))
diff --git a/paddleaudio/paddleaudio/utils/env.py b/paddleaudio/paddleaudio/utils/env.py
new file mode 100644
index 00000000..59c6b621
--- /dev/null
+++ b/paddleaudio/paddleaudio/utils/env.py
@@ -0,0 +1,53 @@
+# Copyright (c) 2021  PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+'''
+This module is used to store environmental variables in PaddleAudio.
+PPAUDIO_HOME     -->  the root directory for storing PaddleAudio related data. Default to ~/.paddleaudio. Users can change the
+├                            default value through the PPAUDIO_HOME environment variable.
+├─ MODEL_HOME    -->  Store model files.
+└─ DATA_HOME     -->  Store automatically downloaded datasets.
+'''
+import os
+
+
+def _get_user_home():
+    return os.path.expanduser('~')
+
+
+def _get_ppaudio_home():
+    if 'PPAUDIO_HOME' in os.environ:
+        home_path = os.environ['PPAUDIO_HOME']
+        if os.path.exists(home_path):
+            if os.path.isdir(home_path):
+                return home_path
+            else:
+                raise RuntimeError(
+                    'The environment variable PPAUDIO_HOME {} is not a directory.'.
+                    format(home_path))
+        else:
+            return home_path
+    return os.path.join(_get_user_home(), '.paddleaudio')
+
+
+def _get_sub_home(directory):
+    home = os.path.join(_get_ppaudio_home(), directory)
+    if not os.path.exists(home):
+        os.makedirs(home)
+    return home
+
+
+USER_HOME = _get_user_home()
+PPAUDIO_HOME = _get_ppaudio_home()
+MODEL_HOME = _get_sub_home('models')
+DATA_HOME = _get_sub_home('datasets')
diff --git a/paddleaudio/paddleaudio/utils/error.py b/paddleaudio/paddleaudio/utils/error.py
new file mode 100644
index 00000000..f3977489
--- /dev/null
+++ b/paddleaudio/paddleaudio/utils/error.py
@@ -0,0 +1,20 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+__all__ = ['ParameterError']
+
+
+class ParameterError(Exception):
+    """Exception class for Parameter checking"""
+    pass
diff --git a/paddleaudio/paddleaudio/utils/log.py b/paddleaudio/paddleaudio/utils/log.py
new file mode 100644
index 00000000..5e7db68a
--- /dev/null
+++ b/paddleaudio/paddleaudio/utils/log.py
@@ -0,0 +1,136 @@
+# Copyright (c) 2021  PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import contextlib
+import functools
+import logging
+import threading
+import time
+
+import colorlog
+
+loggers = {}
+
+log_config = {
+    'DEBUG': {
+        'level': 10,
+        'color': 'purple'
+    },
+    'INFO': {
+        'level': 20,
+        'color': 'green'
+    },
+    'TRAIN': {
+        'level': 21,
+        'color': 'cyan'
+    },
+    'EVAL': {
+        'level': 22,
+        'color': 'blue'
+    },
+    'WARNING': {
+        'level': 30,
+        'color': 'yellow'
+    },
+    'ERROR': {
+        'level': 40,
+        'color': 'red'
+    },
+    'CRITICAL': {
+        'level': 50,
+        'color': 'bold_red'
+    }
+}
+
+
+class Logger(object):
+    '''
+    Deafult logger in PaddleAudio
+    Args:
+        name(str) : Logger name, default is 'PaddleAudio'
+    '''
+
+    def __init__(self, name: str=None):
+        name = 'PaddleAudio' if not name else name
+        self.logger = logging.getLogger(name)
+
+        for key, conf in log_config.items():
+            logging.addLevelName(conf['level'], key)
+            self.__dict__[key] = functools.partial(self.__call__, conf['level'])
+            self.__dict__[key.lower()] = functools.partial(self.__call__,
+                                                           conf['level'])
+
+        self.format = colorlog.ColoredFormatter(
+            '%(log_color)s[%(asctime)-15s] [%(levelname)8s]%(reset)s - %(message)s',
+            log_colors={key: conf['color']
+                        for key, conf in log_config.items()})
+
+        self.handler = logging.StreamHandler()
+        self.handler.setFormatter(self.format)
+
+        self.logger.addHandler(self.handler)
+        self.logLevel = 'DEBUG'
+        self.logger.setLevel(logging.DEBUG)
+        self.logger.propagate = False
+        self._is_enable = True
+
+    def disable(self):
+        self._is_enable = False
+
+    def enable(self):
+        self._is_enable = True
+
+    @property
+    def is_enable(self) -> bool:
+        return self._is_enable
+
+    def __call__(self, log_level: str, msg: str):
+        if not self.is_enable:
+            return
+
+        self.logger.log(log_level, msg)
+
+    @contextlib.contextmanager
+    def use_terminator(self, terminator: str):
+        old_terminator = self.handler.terminator
+        self.handler.terminator = terminator
+        yield
+        self.handler.terminator = old_terminator
+
+    @contextlib.contextmanager
+    def processing(self, msg: str, interval: float=0.1):
+        '''
+        Continuously print a progress bar with rotating special effects.
+        Args:
+            msg(str): Message to be printed.
+            interval(float): Rotation interval. Default to 0.1.
+        '''
+        end = False
+
+        def _printer():
+            index = 0
+            flags = ['\\', '|', '/', '-']
+            while not end:
+                flag = flags[index % len(flags)]
+                with self.use_terminator('\r'):
+                    self.info('{}: {}'.format(msg, flag))
+                time.sleep(interval)
+                index += 1
+
+        t = threading.Thread(target=_printer)
+        t.start()
+        yield
+        end = True
+
+
+logger = Logger()
diff --git a/paddleaudio/paddleaudio/utils/time.py b/paddleaudio/paddleaudio/utils/time.py
new file mode 100644
index 00000000..6f0c7585
--- /dev/null
+++ b/paddleaudio/paddleaudio/utils/time.py
@@ -0,0 +1,67 @@
+# Copyright (c) 2021  PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License"
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import math
+import time
+
+
+class Timer(object):
+    '''Calculate runing speed and estimated time of arrival(ETA)'''
+
+    def __init__(self, total_step: int):
+        self.total_step = total_step
+        self.last_start_step = 0
+        self.current_step = 0
+        self._is_running = True
+
+    def start(self):
+        self.last_time = time.time()
+        self.start_time = time.time()
+
+    def stop(self):
+        self._is_running = False
+        self.end_time = time.time()
+
+    def count(self) -> int:
+        if not self.current_step >= self.total_step:
+            self.current_step += 1
+        return self.current_step
+
+    @property
+    def timing(self) -> float:
+        run_steps = self.current_step - self.last_start_step
+        self.last_start_step = self.current_step
+        time_used = time.time() - self.last_time
+        self.last_time = time.time()
+        return run_steps / time_used
+
+    @property
+    def is_running(self) -> bool:
+        return self._is_running
+
+    @property
+    def eta(self) -> str:
+        if not self.is_running:
+            return '00:00:00'
+        scale = self.total_step / self.current_step
+        remaining_time = (time.time() - self.start_time) * scale
+        return seconds_to_hms(remaining_time)
+
+
+def seconds_to_hms(seconds: int) -> str:
+    '''Convert the number of seconds to hh:mm:ss'''
+    h = math.floor(seconds / 3600)
+    m = math.floor((seconds - h * 3600) / 60)
+    s = int(seconds - h * 3600 - m * 60)
+    hms_str = '{:0>2}:{:0>2}:{:0>2}'.format(h, m, s)
+    return hms_str
diff --git a/paddleaudio/requirements.txt b/paddleaudio/requirements.txt
new file mode 100644
index 00000000..9510a1b2
--- /dev/null
+++ b/paddleaudio/requirements.txt
@@ -0,0 +1,4 @@
+numpy >= 1.15.0
+resampy >= 0.2.2
+scipy >= 1.0.0
+soundfile >= 0.9.0
\ No newline at end of file
diff --git a/paddleaudio/setup.py b/paddleaudio/setup.py
new file mode 100644
index 00000000..09bd2155
--- /dev/null
+++ b/paddleaudio/setup.py
@@ -0,0 +1,43 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import setuptools
+
+# set the version here
+version = '0.1.0a'
+
+with open("README.md", "r") as fh:
+    long_description = fh.read()
+setuptools.setup(
+    name="paddleaudio",
+    version=version,
+    author="",
+    author_email="",
+    description="PaddleAudio, in development",
+    long_description=long_description,
+    long_description_content_type="text/markdown",
+    url="",
+    packages=setuptools.find_packages(exclude=["build*", "test*", "examples*"]),
+    classifiers=[
+        "Programming Language :: Python :: 3",
+        "License :: OSI Approved :: MIT License",
+        "Operating System :: OS Independent",
+    ],
+    python_requires='>=3.6',
+    install_requires=[
+        'numpy >= 1.15.0', 'scipy >= 1.0.0', 'resampy >= 0.2.2',
+        'soundfile >= 0.9.0'
+    ],
+    extras_require={'dev': ['pytest>=3.7', 'librosa>=0.7.2']
+                    }  # for dev only, install: pip install -e .[dev]
+)
diff --git a/paddleaudio/test/README.md b/paddleaudio/test/README.md
new file mode 100644
index 00000000..e5dbc537
--- /dev/null
+++ b/paddleaudio/test/README.md
@@ -0,0 +1,41 @@
+# PaddleAudio Testing Guide
+
+
+
+
+# Testing
+First clone a version of the project by
+```
+git clone https://github.com/PaddlePaddle/models.git
+
+```
+Then install the project in your virtual environment.
+```
+cd models/PaddleAudio
+python setup.py bdist_wheel
+pip install -e .[dev]
+```
+The requirements for testing will be installed along with PaddleAudio.  
+
+Now run
+```
+pytest test
+```
+
+If it goes well, you will see outputs like these:
+```
+platform linux -- Python 3.7.10, pytest-6.2.4, py-1.10.0, pluggy-0.13.1
+rootdir: ./models/PaddleAudio
+plugins: hydra-core-1.0.6
+collected 16 items  
+
+test/unit_test/test_backend.py ...........                                                                         [ 68%]
+test/unit_test/test_features.py .....                                                                              [100%]
+
+==================================================== warnings summary ====================================================
+.
+.
+.
+-- Docs: https://docs.pytest.org/en/stable/warnings.html
+============================================ 16 passed, 11 warnings in 6.76s =============================================
+```
diff --git a/paddleaudio/test/unit_test/test_backend.py b/paddleaudio/test/unit_test/test_backend.py
new file mode 100644
index 00000000..a0d5d2db
--- /dev/null
+++ b/paddleaudio/test/unit_test/test_backend.py
@@ -0,0 +1,114 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import librosa
+import numpy as np
+import pytest
+
+import paddleaudio
+
+TEST_FILE = './test/data/test_audio.wav'
+
+
+def relative_err(a, b, real=True):
+    """compute relative error of two matrices or vectors"""
+    if real:
+        return np.sum((a - b)**2) / (EPS + np.sum(a**2) + np.sum(b**2))
+    else:
+        err = np.sum((a.real - b.real)**2) / \
+            (EPS + np.sum(a.real**2) + np.sum(b.real**2))
+        err += np.sum((a.imag - b.imag)**2) / \
+            (EPS + np.sum(a.imag**2) + np.sum(b.imag**2))
+
+        return err
+
+
+@pytest.mark.filterwarnings("ignore::DeprecationWarning")
+def load_audio():
+    x, r = librosa.load(TEST_FILE, sr=16000)
+    print(f'librosa: mean: {np.mean(x)}, std:{np.std(x)}')
+    return x, r
+
+
+# start testing
+x, r = load_audio()
+EPS = 1e-8
+
+
+def test_load():
+    s, r = paddleaudio.load(TEST_FILE, sr=16000)
+    assert r == 16000
+    assert s.dtype == 'float32'
+
+    s, r = paddleaudio.load(
+        TEST_FILE, sr=16000, offset=1, duration=2, dtype='int16')
+    assert len(s) / r == 2.0
+    assert r == 16000
+    assert s.dtype == 'int16'
+
+
+def test_depth_convert():
+    y = paddleaudio.depth_convert(x, 'int16')
+    assert len(y) == len(x)
+    assert y.dtype == 'int16'
+    assert np.max(y) <= 32767
+    assert np.min(y) >= -32768
+    assert np.std(y) > EPS
+
+    y = paddleaudio.depth_convert(x, 'int8')
+    assert len(y) == len(x)
+    assert y.dtype == 'int8'
+    assert np.max(y) <= 127
+    assert np.min(y) >= -128
+    assert np.std(y) > EPS
+
+
+# test case for resample
+rs_test_data = [
+    (32000, 'kaiser_fast'),
+    (16000, 'kaiser_fast'),
+    (8000, 'kaiser_fast'),
+    (32000, 'kaiser_best'),
+    (16000, 'kaiser_best'),
+    (8000, 'kaiser_best'),
+    (22050, 'kaiser_best'),
+    (44100, 'kaiser_best'),
+]
+
+
+@pytest.mark.parametrize('sr,mode', rs_test_data)
+def test_resample(sr, mode):
+    y = paddleaudio.resample(x, 16000, sr, mode=mode)
+    factor = sr / 16000
+    err = relative_err(len(y), len(x) * factor)
+    print('err:', err)
+    assert err < EPS
+
+
+def test_normalize():
+    y = paddleaudio.normalize(x, norm_type='linear', mul_factor=0.5)
+    assert np.max(y) < 0.5 + EPS
+
+    y = paddleaudio.normalize(x, norm_type='linear', mul_factor=2.0)
+    assert np.max(y) <= 2.0 + EPS
+
+    y = paddleaudio.normalize(x, norm_type='gaussian', mul_factor=1.0)
+    print('np.std(y):', np.std(y))
+    assert np.abs(np.std(y) - 1.0) < EPS
+
+
+if __name__ == '__main__':
+    test_load()
+    test_depth_convert()
+    test_resample(22050, 'kaiser_fast')
+    test_normalize()
diff --git a/paddleaudio/test/unit_test/test_features.py b/paddleaudio/test/unit_test/test_features.py
new file mode 100644
index 00000000..4d68c1b5
--- /dev/null
+++ b/paddleaudio/test/unit_test/test_features.py
@@ -0,0 +1,144 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import librosa
+import numpy as np
+import pytest
+
+import paddleaudio as pa
+
+
+@pytest.mark.filterwarnings("ignore::DeprecationWarning")
+def load_audio():
+    x, r = librosa.load('./test/data/test_audio.wav')
+    #x,r = librosa.load('../data/test_audio.wav',sr=16000)
+    return x, r
+
+
+## start testing
+x, r = load_audio()
+EPS = 1e-8
+
+
+def relative_err(a, b, real=True):
+    """compute relative error of two matrices or vectors"""
+    if real:
+        return np.sum((a - b)**2) / (EPS + np.sum(a**2) + np.sum(b**2))
+    else:
+        err = np.sum((a.real - b.real)**2) / (
+            EPS + np.sum(a.real**2) + np.sum(b.real**2))
+        err += np.sum((a.imag - b.imag)**2) / (
+            EPS + np.sum(a.imag**2) + np.sum(b.imag**2))
+
+        return err
+
+
+@pytest.mark.filterwarnings("ignore::DeprecationWarning")
+def test_melspectrogram():
+    a = pa.melspectrogram(
+        x,
+        window_size=512,
+        sr=16000,
+        hop_length=320,
+        n_mels=64,
+        fmin=50,
+        to_db=False, )
+    b = librosa.feature.melspectrogram(
+        x,
+        sr=16000,
+        n_fft=512,
+        win_length=512,
+        hop_length=320,
+        n_mels=64,
+        fmin=50)
+    assert relative_err(a, b) < EPS
+
+
+@pytest.mark.filterwarnings("ignore::DeprecationWarning")
+def test_melspectrogram_db():
+
+    a = pa.melspectrogram(
+        x,
+        window_size=512,
+        sr=16000,
+        hop_length=320,
+        n_mels=64,
+        fmin=50,
+        to_db=True,
+        ref=1.0,
+        amin=1e-10,
+        top_db=None)
+    b = librosa.feature.melspectrogram(
+        x,
+        sr=16000,
+        n_fft=512,
+        win_length=512,
+        hop_length=320,
+        n_mels=64,
+        fmin=50)
+    b = pa.power_to_db(b, ref=1.0, amin=1e-10, top_db=None)
+    assert relative_err(a, b) < EPS
+
+
+@pytest.mark.filterwarnings("ignore::DeprecationWarning")
+def test_stft():
+    a = pa.stft(x, n_fft=1024, hop_length=320, win_length=512)
+    b = librosa.stft(x, n_fft=1024, hop_length=320, win_length=512)
+    assert a.shape == b.shape
+    assert relative_err(a, b, real=False) < EPS
+
+
+@pytest.mark.filterwarnings("ignore::DeprecationWarning")
+def test_split_frames():
+    a = librosa.util.frame(x, frame_length=512, hop_length=320)
+    b = pa.split_frames(x, frame_length=512, hop_length=320)
+    assert relative_err(a, b) < EPS
+
+
+@pytest.mark.filterwarnings("ignore::DeprecationWarning")
+def test_mfcc():
+    kwargs = {
+        'window_size': 512,
+        'hop_length': 320,
+        'n_mels': 64,
+        'fmin': 50,
+        'to_db': False
+    }
+    a = pa.mfcc(
+        x,
+        #sample_rate=16000,
+        spect=None,
+        n_mfcc=20,
+        dct_type=2,
+        norm='ortho',
+        lifter=0,
+        **kwargs)
+    S = librosa.feature.melspectrogram(
+        x,
+        sr=16000,
+        n_fft=512,
+        win_length=512,
+        hop_length=320,
+        n_mels=64,
+        fmin=50)
+    b = librosa.feature.mfcc(
+        x, sr=16000, S=S, n_mfcc=20, dct_type=2, norm='ortho', lifter=0)
+    assert relative_err(a, b) < EPS
+
+
+if __name__ == '__main__':
+    test_melspectrogram()
+    test_melspectrogram_db()
+    test_stft()
+    test_split_frames()
+    test_mfcc()
diff --git a/parakeet/data/batch.py b/parakeet/data/batch.py
index 515074d1..5e7ac399 100644
--- a/parakeet/data/batch.py
+++ b/parakeet/data/batch.py
@@ -53,8 +53,8 @@ def batch_text_id(minibatch, pad_id=0, dtype=np.int64):
     peek_example = minibatch[0]
     assert len(peek_example.shape) == 1, "text example is an 1D tensor"
 
-    lengths = [example.shape[0] for example in minibatch
-               ]  # assume (channel, n_samples) or (n_samples, )
+    lengths = [example.shape[0] for example in
+               minibatch]  # assume (channel, n_samples) or (n_samples, )
     max_len = np.max(lengths)
 
     batch = []
diff --git a/parakeet/exps/tacotron2/ljspeech.py b/parakeet/exps/tacotron2/ljspeech.py
index 20dc29d3..59c855eb 100644
--- a/parakeet/exps/tacotron2/ljspeech.py
+++ b/parakeet/exps/tacotron2/ljspeech.py
@@ -67,19 +67,16 @@ class LJSpeechCollector(object):
 
         # Sort by text_len in descending order
         texts = [
-            i
-            for i, _ in sorted(
+            i for i, _ in sorted(
                 zip(texts, text_lens), key=lambda x: x[1], reverse=True)
         ]
         mels = [
-            i
-            for i, _ in sorted(
+            i for i, _ in sorted(
                 zip(mels, text_lens), key=lambda x: x[1], reverse=True)
         ]
 
         mel_lens = [
-            i
-            for i, _ in sorted(
+            i for i, _ in sorted(
                 zip(mel_lens, text_lens), key=lambda x: x[1], reverse=True)
         ]
 
diff --git a/requirements.txt b/requirements.txt
index a7310a02..08f68449 100644
--- a/requirements.txt
+++ b/requirements.txt
@@ -13,6 +13,7 @@ librosa
 llvmlite
 loguru
 matplotlib
+nara_wpe
 nltk
 numba
 numpy==1.20.0
@@ -23,6 +24,7 @@ praatio~=4.1
 pre-commit
 pybind11
 pypinyin
+python-dateutil
 pyworld
 resampy==0.2.2
 sacrebleu
diff --git a/setup.py b/setup.py
index be17e0a4..a2e4c031 100644
--- a/setup.py
+++ b/setup.py
@@ -65,13 +65,6 @@ def _remove(files: str):
 
 
 def _post_install(install_lib_dir):
-    # apt
-    check_call("apt-get update -y")
-    check_call("apt-get install -y " + 'vim tig tree sox pkg-config ' +
-               'libsndfile1 libflac-dev libogg-dev ' +
-               'libvorbis-dev libboost-dev swig python3-dev ')
-    print("apt install.")
-
     # tools/make
     tool_dir = HERE / "tools"
     _remove(tool_dir.glob("*.done"))
diff --git a/setup.sh b/setup.sh
index d3dd8207..aefdab98 100644
--- a/setup.sh
+++ b/setup.sh
@@ -10,7 +10,7 @@ fi
 
 if [ -e /etc/lsb-release ];then
     ${SUDO} apt-get update -y
-    ${SUDO} apt-get install -y jq vim tig tree sox pkg-config libflac-dev libogg-dev libvorbis-dev libboost-dev swig python3-dev
+    ${SUDO} apt-get install -y bc jq vim tig tree sox pkg-config libflac-dev libogg-dev libvorbis-dev libboost-dev swig python3-dev
     if [ $? != 0 ]; then
         error_msg "Please using Ubuntu or install pkg-config libflac-dev libogg-dev libvorbis-dev libboost-dev swig python3-dev by user."
         exit -1
diff --git a/tools/Makefile b/tools/Makefile
index 77b41a48..87107a53 100644
--- a/tools/Makefile
+++ b/tools/Makefile
@@ -10,7 +10,7 @@ WGET ?= wget --no-check-certificate
 
 .PHONY: all clean
 
-all: virtualenv.done kenlm.done sox.done soxbindings.done mfa.done sclite.done
+all: virtualenv.done apt.done kenlm.done sox.done soxbindings.done mfa.done sclite.done
 
 virtualenv.done:
 	test -d venv || virtualenv -p $(PYTHON) venv
@@ -21,6 +21,13 @@ clean:
 	find -iname "*.pyc" -delete
 	rm -rf kenlm
 
+
+apt.done:
+	apt update -y
+	apt install -y bc flac jq vim tig tree pkg-config libflac-dev libogg-dev libvorbis-dev libboost-dev swig python3-dev 
+	echo "check_certificate = off" >> ~/.wgetrc
+	touch apt.done
+
 kenlm.done:
 	# Ubuntu 16.04 透過 apt 會安裝 boost 1.58.0
 	# it seems that boost (1.54.0) requires higher version. After I switched to g++-5 it compiles normally.
@@ -48,6 +55,13 @@ mfa.done:
 	tar xvf montreal-forced-aligner_linux.tar.gz
 	touch mfa.done
 
+openblas.done:
+	bash extras/install_openblas.sh
+	touch openblas.done
+
+kaldi.done: openblas.done
+	bash extras/install_kaldi.sh
+	touch kaldi.done
 
 #== SCTK ===============================================================================
 # SCTK official repo does not have version tags. Here's the mapping:
diff --git a/tools/extras/install_kaldi.sh b/tools/extras/install_kaldi.sh
index b87232b0..3cdcd32d 100755
--- a/tools/extras/install_kaldi.sh
+++ b/tools/extras/install_kaldi.sh
@@ -16,7 +16,7 @@ else
     echo "$KALDI_DIR already exists!"
 fi
 
-cd "$KALDI_DIR/tools"
+pushd "$KALDI_DIR/tools"
 git pull
 
 # Prevent kaldi from switching default python version
@@ -28,8 +28,12 @@ touch "python/.use_default_python"
 make -j4
 
 pushd ../src
-./configure --shared --use-cuda=no --static-math --mathlib=OPENBLAS --openblas-root=${KALDI_DIR}/../OpenBLAS/install
+OPENBLAS_DIR=${KALDI_DIR}/../OpenBLAS
+mkdir -p ${OPENBLAS_DIR}/install
+./configure --shared --use-cuda=no --static-math --mathlib=OPENBLAS --openblas-root=${OPENBLAS_DIR}/install
 make clean -j && make depend -j && make -j4
 popd
 
+popd
+
 echo "Done installing Kaldi."
diff --git a/utils/__init__.py b/utils/__init__.py
old mode 100644
new mode 100755
diff --git a/utils/apply-cmvn.py b/utils/apply-cmvn.py
new file mode 100755
index 00000000..f80053fb
--- /dev/null
+++ b/utils/apply-cmvn.py
@@ -0,0 +1,149 @@
+#!/usr/bin/env python3
+import argparse
+import logging
+from distutils.util import strtobool
+
+import kaldiio
+import numpy
+
+from deepspeech.transform.cmvn import CMVN
+from deepspeech.utils.cli_readers import file_reader_helper
+from deepspeech.utils.cli_utils import get_commandline_args
+from deepspeech.utils.cli_utils import is_scipy_wav_style
+from deepspeech.utils.cli_writers import file_writer_helper
+
+
+def get_parser():
+    parser = argparse.ArgumentParser(
+        description="apply mean-variance normalization to files",
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter, )
+
+    parser.add_argument(
+        "--verbose", "-V", default=0, type=int, help="Verbose option")
+    parser.add_argument(
+        "--in-filetype",
+        type=str,
+        default="mat",
+        choices=["mat", "hdf5", "sound.hdf5", "sound"],
+        help="Specify the file format for the rspecifier. "
+        '"mat" is the matrix format in kaldi', )
+    parser.add_argument(
+        "--stats-filetype",
+        type=str,
+        default="mat",
+        choices=["mat", "hdf5", "npy"],
+        help="Specify the file format for the rspecifier. "
+        '"mat" is the matrix format in kaldi', )
+    parser.add_argument(
+        "--out-filetype",
+        type=str,
+        default="mat",
+        choices=["mat", "hdf5"],
+        help="Specify the file format for the wspecifier. "
+        '"mat" is the matrix format in kaldi', )
+
+    parser.add_argument(
+        "--norm-means",
+        type=strtobool,
+        default=True,
+        help="Do variance normalization or not.", )
+    parser.add_argument(
+        "--norm-vars",
+        type=strtobool,
+        default=False,
+        help="Do variance normalization or not.", )
+    parser.add_argument(
+        "--reverse",
+        type=strtobool,
+        default=False,
+        help="Do reverse mode or not")
+    parser.add_argument(
+        "--spk2utt",
+        type=str,
+        help="A text file of speaker to utterance-list map. "
+        "(Don't give rspecifier format, such as "
+        '"ark:spk2utt")', )
+    parser.add_argument(
+        "--utt2spk",
+        type=str,
+        help="A text file of utterance to speaker map. "
+        "(Don't give rspecifier format, such as "
+        '"ark:utt2spk")', )
+    parser.add_argument(
+        "--write-num-frames",
+        type=str,
+        help="Specify wspecifer for utt2num_frames")
+    parser.add_argument(
+        "--compress",
+        type=strtobool,
+        default=False,
+        help="Save in compressed format")
+    parser.add_argument(
+        "--compression-method",
+        type=int,
+        default=2,
+        help="Specify the method(if mat) or "
+        "gzip-level(if hdf5)", )
+    parser.add_argument(
+        "stats_rspecifier_or_rxfilename",
+        help="Input stats. e.g. ark:stats.ark or stats.mat", )
+    parser.add_argument(
+        "rspecifier", type=str, help="Read specifier id. e.g. ark:some.ark")
+    parser.add_argument(
+        "wspecifier", type=str, help="Write specifier id. e.g. ark:some.ark")
+    return parser
+
+
+def main():
+    args = get_parser().parse_args()
+
+    # logging info
+    logfmt = "%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s"
+    if args.verbose > 0:
+        logging.basicConfig(level=logging.INFO, format=logfmt)
+    else:
+        logging.basicConfig(level=logging.WARN, format=logfmt)
+    logging.info(get_commandline_args())
+
+    if ":" in args.stats_rspecifier_or_rxfilename:
+        is_rspcifier = True
+        if args.stats_filetype == "npy":
+            stats_filetype = "hdf5"
+        else:
+            stats_filetype = args.stats_filetype
+
+        stats_dict = dict(
+            file_reader_helper(args.stats_rspecifier_or_rxfilename,
+                               stats_filetype))
+    else:
+        is_rspcifier = False
+        if args.stats_filetype == "mat":
+            stats = kaldiio.load_mat(args.stats_rspecifier_or_rxfilename)
+        else:
+            stats = numpy.load(args.stats_rspecifier_or_rxfilename)
+        stats_dict = {None: stats}
+
+    cmvn = CMVN(
+        stats=stats_dict,
+        norm_means=args.norm_means,
+        norm_vars=args.norm_vars,
+        utt2spk=args.utt2spk,
+        spk2utt=args.spk2utt,
+        reverse=args.reverse, )
+
+    with file_writer_helper(
+            args.wspecifier,
+            filetype=args.out_filetype,
+            write_num_frames=args.write_num_frames,
+            compress=args.compress,
+            compression_method=args.compression_method, ) as writer:
+        for utt, mat in file_reader_helper(args.rspecifier, args.in_filetype):
+            if is_scipy_wav_style(mat):
+                # If data is sound file, then got as Tuple[int, ndarray]
+                rate, mat = mat
+            mat = cmvn(mat, utt if is_rspcifier else None)
+            writer[utt] = mat
+
+
+if __name__ == "__main__":
+    main()
diff --git a/utils/avg_model.py b/utils/avg_model.py
index 7c05ec78..6ee16408 100755
--- a/utils/avg_model.py
+++ b/utils/avg_model.py
@@ -47,8 +47,10 @@ def main(args):
 
     beat_val_scores = sorted_val_scores[:args.num, 1]
     selected_epochs = sorted_val_scores[:args.num, 0].astype(np.int64)
+    avg_val_score = np.mean(beat_val_scores)
     print("selected val scores = " + str(beat_val_scores))
     print("selected epochs = " + str(selected_epochs))
+    print("averaged val score = " + str(avg_val_score))
 
     path_list = [
         args.ckpt_dir + '/{}.pdparams'.format(int(epoch))
@@ -80,7 +82,7 @@ def main(args):
         data = json.dumps({
             "mode": 'val_best' if args.val_best else 'latest',
             "avg_ckpt": args.dst_model,
-            "val_loss_mean": np.mean(beat_val_scores),
+            "val_loss_mean": avg_val_score,
             "ckpts": path_list,
             "epochs": selected_epochs.tolist(),
             "val_losses": beat_val_scores.tolist(),
diff --git a/utils/caculate_rtf.py b/utils/caculate_rtf.py
new file mode 100755
index 00000000..fcc155ed
--- /dev/null
+++ b/utils/caculate_rtf.py
@@ -0,0 +1,65 @@
+#!/usr/bin/env python3
+# encoding: utf-8
+# Copyright 2021 Kyoto University (Hirofumi Inaguma)
+#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
+import argparse
+import codecs
+import glob
+import os
+
+from dateutil import parser
+
+
+def get_parser():
+    parser = argparse.ArgumentParser(
+        description="calculate real time factor (RTF)")
+    parser.add_argument(
+        "--log-dir",
+        type=str,
+        default=None,
+        help="path to logging directory", )
+    return parser
+
+
+def main():
+
+    args = get_parser().parse_args()
+
+    audio_sec = 0
+    decode_sec = 0
+    n_utt = 0
+
+    audio_durations = []
+    start_times = []
+    end_times = []
+    for x in glob.glob(os.path.join(args.log_dir, "decode.*.log")):
+        with codecs.open(x, "r", "utf-8") as f:
+            for line in f:
+                x = line.strip()
+                # 2021-10-25 08:22:04.052 | INFO | xxx:recog_v2:188 - feat: (1570, 83)
+                if "feat:" in x:
+                    dur = int(x.split("(")[1].split(',')[0])
+                    audio_durations += [dur]
+                    start_times += [parser.parse(x.split("|")[0])]
+                elif "total log probability:" in x:
+                    end_times += [parser.parse(x.split("|")[0])]
+        assert len(audio_durations) == len(end_times), (len(audio_durations),
+                                                        len(end_times), )
+        assert len(start_times) == len(end_times), (len(start_times),
+                                                    len(end_times))
+
+        audio_sec += sum(audio_durations) / 100  # [sec]
+        decode_sec += sum([(end - start).total_seconds()
+                           for start, end in zip(start_times, end_times)])
+        n_utt += len(audio_durations)
+
+    print("Total audio duration: %.3f [sec]" % audio_sec)
+    print("Total decoding time: %.3f [sec]" % decode_sec)
+    rtf = decode_sec / audio_sec if audio_sec > 0 else 0
+    print("RTF: %.3f" % rtf)
+    latency = decode_sec * 1000 / n_utt if n_utt > 0 else 0
+    print("Latency: %.3f [ms/sentence]" % latency)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/utils/compute-cmvn-stats.py b/utils/compute-cmvn-stats.py
new file mode 100755
index 00000000..706d8cd5
--- /dev/null
+++ b/utils/compute-cmvn-stats.py
@@ -0,0 +1,186 @@
+#!/usr/bin/env python3
+import argparse
+import logging
+
+import kaldiio
+import numpy as np
+
+from deepspeech.transform.transformation import Transformation
+from deepspeech.utils.cli_readers import file_reader_helper
+from deepspeech.utils.cli_utils import get_commandline_args
+from deepspeech.utils.cli_utils import is_scipy_wav_style
+from deepspeech.utils.cli_writers import file_writer_helper
+
+
+def get_parser():
+    parser = argparse.ArgumentParser(
+        description="Compute cepstral mean and "
+        "variance normalization statistics"
+        "If wspecifier provided: per-utterance by default, "
+        "or per-speaker if"
+        "spk2utt option provided; if wxfilename: global",
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter, )
+    parser.add_argument(
+        "--spk2utt",
+        type=str,
+        help="A text file of speaker to utterance-list map. "
+        "(Don't give rspecifier format, such as "
+        '"ark:utt2spk")', )
+    parser.add_argument(
+        "--verbose", "-V", default=0, type=int, help="Verbose option")
+    parser.add_argument(
+        "--in-filetype",
+        type=str,
+        default="mat",
+        choices=["mat", "hdf5", "sound.hdf5", "sound"],
+        help="Specify the file format for the rspecifier. "
+        '"mat" is the matrix format in kaldi', )
+    parser.add_argument(
+        "--out-filetype",
+        type=str,
+        default="mat",
+        choices=["mat", "hdf5", "npy"],
+        help="Specify the file format for the wspecifier. "
+        '"mat" is the matrix format in kaldi', )
+    parser.add_argument(
+        "--preprocess-conf",
+        type=str,
+        default=None,
+        help="The configuration file for the pre-processing", )
+    parser.add_argument(
+        "rspecifier",
+        type=str,
+        help="Read specifier for feats. e.g. ark:some.ark")
+    parser.add_argument(
+        "wspecifier_or_wxfilename",
+        type=str,
+        help="Write specifier. e.g. ark:some.ark")
+    return parser
+
+
+def main():
+    args = get_parser().parse_args()
+
+    logfmt = "%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s"
+    if args.verbose > 0:
+        logging.basicConfig(level=logging.INFO, format=logfmt)
+    else:
+        logging.basicConfig(level=logging.WARN, format=logfmt)
+    logging.info(get_commandline_args())
+
+    is_wspecifier = ":" in args.wspecifier_or_wxfilename
+
+    if is_wspecifier:
+        if args.spk2utt is not None:
+            logging.info("Performing as speaker CMVN mode")
+            utt2spk_dict = {}
+            with open(args.spk2utt) as f:
+                for line in f:
+                    spk, utts = line.rstrip().split(None, 1)
+                    for utt in utts.split():
+                        utt2spk_dict[utt] = spk
+
+            def utt2spk(x):
+                return utt2spk_dict[x]
+
+        else:
+            logging.info("Performing as utterance CMVN mode")
+
+            def utt2spk(x):
+                return x
+
+        if args.out_filetype == "npy":
+            logging.warning("--out-filetype npy is allowed only for "
+                            "Global CMVN mode, changing to hdf5")
+            args.out_filetype = "hdf5"
+
+    else:
+        logging.info("Performing as global CMVN mode")
+        if args.spk2utt is not None:
+            logging.warning("spk2utt is not used for global CMVN mode")
+
+        def utt2spk(x):
+            return None
+
+        if args.out_filetype == "hdf5":
+            logging.warning("--out-filetype hdf5 is not allowed for "
+                            "Global CMVN mode, changing to npy")
+            args.out_filetype = "npy"
+
+    if args.preprocess_conf is not None:
+        preprocessing = Transformation(args.preprocess_conf)
+        logging.info("Apply preprocessing: {}".format(preprocessing))
+    else:
+        preprocessing = None
+
+    # Calculate stats for each speaker
+    counts = {}
+    sum_feats = {}
+    square_sum_feats = {}
+
+    idx = 0
+    for idx, (utt, matrix) in enumerate(
+            file_reader_helper(args.rspecifier, args.in_filetype), 1):
+        if is_scipy_wav_style(matrix):
+            # If data is sound file, then got as Tuple[int, ndarray]
+            rate, matrix = matrix
+        if preprocessing is not None:
+            matrix = preprocessing(matrix, uttid_list=utt)
+
+        spk = utt2spk(utt)
+
+        # Init at the first seen of the spk
+        if spk not in counts:
+            counts[spk] = 0
+            feat_shape = matrix.shape[1:]
+            # Accumulate in double precision
+            sum_feats[spk] = np.zeros(feat_shape, dtype=np.float64)
+            square_sum_feats[spk] = np.zeros(feat_shape, dtype=np.float64)
+
+        counts[spk] += matrix.shape[0]
+        sum_feats[spk] += matrix.sum(axis=0)
+        square_sum_feats[spk] += (matrix**2).sum(axis=0)
+    logging.info("Processed {} utterances".format(idx))
+    assert idx > 0, idx
+
+    cmvn_stats = {}
+    for spk in counts:
+        feat_shape = sum_feats[spk].shape
+        cmvn_shape = (2, feat_shape[0] + 1) + feat_shape[1:]
+        _cmvn_stats = np.empty(cmvn_shape, dtype=np.float64)
+        _cmvn_stats[0, :-1] = sum_feats[spk]
+        _cmvn_stats[1, :-1] = square_sum_feats[spk]
+
+        _cmvn_stats[0, -1] = counts[spk]
+        _cmvn_stats[1, -1] = 0.0
+
+        # You can get the mean and std as following,
+        # >>> N = _cmvn_stats[0, -1]
+        # >>> mean = _cmvn_stats[0, :-1] / N
+        # >>> std = np.sqrt(_cmvn_stats[1, :-1] / N - mean ** 2)
+
+        cmvn_stats[spk] = _cmvn_stats
+
+    # Per utterance or speaker CMVN
+    if is_wspecifier:
+        with file_writer_helper(
+                args.wspecifier_or_wxfilename,
+                filetype=args.out_filetype) as writer:
+            for spk, mat in cmvn_stats.items():
+                writer[spk] = mat
+
+    # Global CMVN
+    else:
+        matrix = cmvn_stats[None]
+        if args.out_filetype == "npy":
+            np.save(args.wspecifier_or_wxfilename, matrix)
+        elif args.out_filetype == "mat":
+            # Kaldi supports only matrix or vector
+            kaldiio.save_mat(args.wspecifier_or_wxfilename, matrix)
+        else:
+            raise RuntimeError(
+                "Not supporting: --out-filetype {}".format(args.out_filetype))
+
+
+if __name__ == "__main__":
+    main()
diff --git a/utils/compute_statistics.py b/utils/compute_statistics.py
old mode 100644
new mode 100755
diff --git a/utils/copy-feats.py b/utils/copy-feats.py
new file mode 100755
index 00000000..7d1b8589
--- /dev/null
+++ b/utils/copy-feats.py
@@ -0,0 +1,104 @@
+#!/usr/bin/env python3
+import argparse
+import logging
+from distutils.util import strtobool
+
+from deepspeech.transform.transformation import Transformation
+from deepspeech.utils.cli_readers import file_reader_helper
+from deepspeech.utils.cli_utils import get_commandline_args
+from deepspeech.utils.cli_utils import is_scipy_wav_style
+from deepspeech.utils.cli_writers import file_writer_helper
+
+
+def get_parser():
+    parser = argparse.ArgumentParser(
+        description="copy feature with preprocessing",
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter, )
+
+    parser.add_argument(
+        "--verbose", "-V", default=0, type=int, help="Verbose option")
+    parser.add_argument(
+        "--in-filetype",
+        type=str,
+        default="mat",
+        choices=["mat", "hdf5", "sound.hdf5", "sound"],
+        help="Specify the file format for the rspecifier. "
+        '"mat" is the matrix format in kaldi', )
+    parser.add_argument(
+        "--out-filetype",
+        type=str,
+        default="mat",
+        choices=["mat", "hdf5", "sound.hdf5", "sound"],
+        help="Specify the file format for the wspecifier. "
+        '"mat" is the matrix format in kaldi', )
+    parser.add_argument(
+        "--write-num-frames",
+        type=str,
+        help="Specify wspecifer for utt2num_frames")
+    parser.add_argument(
+        "--compress",
+        type=strtobool,
+        default=False,
+        help="Save in compressed format")
+    parser.add_argument(
+        "--compression-method",
+        type=int,
+        default=2,
+        help="Specify the method(if mat) or "
+        "gzip-level(if hdf5)", )
+    parser.add_argument(
+        "--preprocess-conf",
+        type=str,
+        default=None,
+        help="The configuration file for the pre-processing", )
+    parser.add_argument(
+        "rspecifier",
+        type=str,
+        help="Read specifier for feats. e.g. ark:some.ark")
+    parser.add_argument(
+        "wspecifier", type=str, help="Write specifier. e.g. ark:some.ark")
+    return parser
+
+
+def main():
+    parser = get_parser()
+    args = parser.parse_args()
+
+    # logging info
+    logfmt = "%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s"
+    if args.verbose > 0:
+        logging.basicConfig(level=logging.INFO, format=logfmt)
+    else:
+        logging.basicConfig(level=logging.WARN, format=logfmt)
+    logging.info(get_commandline_args())
+
+    if args.preprocess_conf is not None:
+        preprocessing = Transformation(args.preprocess_conf)
+        logging.info("Apply preprocessing: {}".format(preprocessing))
+    else:
+        preprocessing = None
+
+    with file_writer_helper(
+            args.wspecifier,
+            filetype=args.out_filetype,
+            write_num_frames=args.write_num_frames,
+            compress=args.compress,
+            compression_method=args.compression_method, ) as writer:
+        for utt, mat in file_reader_helper(args.rspecifier, args.in_filetype):
+            if is_scipy_wav_style(mat):
+                # If data is sound file, then got as Tuple[int, ndarray]
+                rate, mat = mat
+
+            if preprocessing is not None:
+                mat = preprocessing(mat, uttid_list=utt)
+
+            # shape = (Time, Channel)
+            if args.out_filetype in ["sound.hdf5", "sound"]:
+                # Write Tuple[int, numpy.ndarray] (scipy style)
+                writer[utt] = (rate, mat)
+            else:
+                writer[utt] = mat
+
+
+if __name__ == "__main__":
+    main()
diff --git a/utils/data2json.sh b/utils/data2json.sh
new file mode 100755
index 00000000..25131437
--- /dev/null
+++ b/utils/data2json.sh
@@ -0,0 +1,170 @@
+#!/usr/bin/env bash
+
+# Copyright 2017 Johns Hopkins University (Shinji Watanabe)
+#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
+
+echo "$0 $*" >&2 # Print the command line for logging
+. ./path.sh
+
+nj=1
+cmd=run.pl
+nlsyms=""
+lang=""
+feat="" # feat.scp
+oov="<unk>"
+bpecode=""
+allow_one_column=false
+verbose=0
+trans_type=char
+filetype=""
+preprocess_conf=""
+category=""
+out="" # If omitted, write in stdout
+
+text=""
+multilingual=false
+
+help_message=$(cat << EOF
+Usage: $0 <data-dir> <dict>
+e.g. $0 data/train data/lang_1char/train_units.txt
+Options:
+  --nj <nj>                                        # number of parallel jobs
+  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.
+  --feat <feat-scp>                                # feat.scp or feat1.scp,feat2.scp,...
+  --oov <oov-word>                                 # Default: <unk>
+  --out <outputfile>                               # If omitted, write in stdout
+  --filetype <mat|hdf5|sound.hdf5>                 # Specify the format of feats file
+  --preprocess-conf <json>                         # Apply preprocess to feats when creating shape.scp
+  --verbose <num>                                  # Default: 0
+EOF
+)
+. utils/parse_options.sh
+
+if [ $# != 2 ]; then
+    echo "${help_message}" 1>&2
+    exit 1;
+fi
+
+set -euo pipefail
+
+dir=$1
+dic=$2
+tmpdir=$(mktemp -d ${dir}/tmp-XXXXX)
+trap 'rm -rf ${tmpdir}' EXIT
+
+if [ -z ${text} ]; then
+    text=${dir}/text
+fi
+
+# 1. Create scp files for inputs
+#   These are not necessary for decoding mode, and make it as an option
+input=
+if [ -n "${feat}" ]; then
+    _feat_scps=$(echo "${feat}" | tr ',' ' ' )
+    read -r -a feat_scps <<< $_feat_scps
+    num_feats=${#feat_scps[@]}
+
+    for (( i=1; i<=num_feats; i++ )); do
+        feat=${feat_scps[$((i-1))]}
+        mkdir -p ${tmpdir}/input_${i}
+        input+="input_${i} "
+        cat ${feat} > ${tmpdir}/input_${i}/feat.scp
+
+        # Dump in the "legacy" style JSON format
+        if [ -n "${filetype}" ]; then
+            awk -v filetype=${filetype} '{print $1 " " filetype}' ${feat} \
+                > ${tmpdir}/input_${i}/filetype.scp
+        fi
+
+        feat_to_shape.sh --cmd "${cmd}" --nj ${nj} \
+            --filetype "${filetype}" \
+            --preprocess-conf "${preprocess_conf}" \
+            --verbose ${verbose} ${feat} ${tmpdir}/input_${i}/shape.scp
+    done
+fi
+
+# 2. Create scp files for outputs
+mkdir -p ${tmpdir}/output
+if [ -n "${bpecode}" ]; then
+    if [ ${multilingual} = true ]; then
+        # remove a space before the language ID
+        paste -d " " <(awk '{print $1}' ${text}) <(cut -f 2- -d" " ${text} \
+            | spm_encode --model=${bpecode} --output_format=piece | cut -f 2- -d" ") \
+            > ${tmpdir}/output/token.scp
+    else
+        paste -d " " <(awk '{print $1}' ${text}) <(cut -f 2- -d" " ${text} \
+            | spm_encode --model=${bpecode} --output_format=piece) \
+            > ${tmpdir}/output/token.scp
+    fi
+elif [ -n "${nlsyms}" ]; then
+    text2token.py -s 1 -n 1 -l ${nlsyms} ${text} --trans_type ${trans_type} > ${tmpdir}/output/token.scp
+else
+    text2token.py -s 1 -n 1 ${text} --trans_type ${trans_type} > ${tmpdir}/output/token.scp
+fi
+< ${tmpdir}/output/token.scp utils/sym2int.pl --map-oov ${oov} -f 2- ${dic} > ${tmpdir}/output/tokenid.scp
+# +2 comes from CTC blank and EOS
+vocsize=$(tail -n 1 ${dic} | awk '{print $2}')
+odim=$(echo "$vocsize + 2" | bc)
+< ${tmpdir}/output/tokenid.scp awk -v odim=${odim} '{print $1 " " NF-1 "," odim}' > ${tmpdir}/output/shape.scp
+
+cat ${text} > ${tmpdir}/output/text.scp
+
+
+# 3. Create scp files for the others
+mkdir -p ${tmpdir}/other
+if [ ${multilingual} == true ]; then
+    awk '{
+        n = split($1,S,"[-]");
+        lang=S[n];
+        print $1 " " lang
+    }' ${text} > ${tmpdir}/other/lang.scp
+elif [ -n "${lang}" ]; then
+    awk -v lang=${lang} '{print $1 " " lang}' ${text} > ${tmpdir}/other/lang.scp
+fi
+
+if [ -n "${category}" ]; then
+    awk -v category=${category} '{print $1 " " category}' ${dir}/text \
+        > ${tmpdir}/other/category.scp
+fi
+cat ${dir}/utt2spk > ${tmpdir}/other/utt2spk.scp
+
+# 4. Merge scp files into a JSON file
+opts=""
+if [ -n "${feat}" ]; then
+    intypes="${input} output other"
+else
+    intypes="output other"
+fi
+for intype in ${intypes}; do
+    if [ -z "$(find "${tmpdir}/${intype}" -name "*.scp")" ]; then
+        continue
+    fi
+
+    if [ ${intype} != other ]; then
+        opts+="--${intype%_*}-scps "
+    else
+        opts+="--scps "
+    fi
+
+    for x in "${tmpdir}/${intype}"/*.scp; do
+        k=$(basename ${x} .scp)
+        if [ ${k} = shape ]; then
+            opts+="shape:${x}:shape "
+        else
+            opts+="${k}:${x} "
+        fi
+    done
+done
+
+if ${allow_one_column}; then
+    opts+="--allow-one-column true "
+else
+    opts+="--allow-one-column false "
+fi
+
+if [ -n "${out}" ]; then
+    opts+="-O ${out}"
+fi
+merge_scp2json.py --verbose ${verbose} ${opts}
+
+rm -fr ${tmpdir}
diff --git a/utils/dump.sh b/utils/dump.sh
new file mode 100755
index 00000000..1f312b3a
--- /dev/null
+++ b/utils/dump.sh
@@ -0,0 +1,95 @@
+#!/usr/bin/env bash
+
+# Copyright 2017 Nagoya University (Tomoki Hayashi)
+#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
+
+echo "$0 $*"  # Print the command line for logging
+. ./path.sh
+
+cmd=run.pl
+do_delta=false
+nj=1
+verbose=0
+compress=true
+write_utt2num_frames=true
+filetype='mat'  # mat or hdf5
+help_message="Usage: $0 <scp> <cmvnark> <logdir> <dumpdir>"
+
+. utils/parse_options.sh
+
+scp=$1
+cvmnark=$2
+logdir=$3
+dumpdir=$4
+
+if [ $# != 4 ]; then
+    echo "${help_message}"
+    exit 1;
+fi
+
+set -euo pipefail
+
+mkdir -p ${logdir}
+mkdir -p ${dumpdir}
+
+dumpdir=$(perl -e '($dir,$pwd)= @ARGV; if($dir!~m:^/:) { $dir = "$pwd/$dir"; } print $dir; ' ${dumpdir} ${PWD})
+
+for n in $(seq ${nj}); do
+    # the next command does nothing unless $dumpdir/storage/ exists, see
+    # utils/create_data_link.pl for more info.
+    utils/create_data_link.pl ${dumpdir}/feats.${n}.ark
+done
+
+if ${write_utt2num_frames}; then
+    write_num_frames_opt="--write-num-frames=ark,t:$dumpdir/utt2num_frames.JOB"
+else
+    write_num_frames_opt=
+fi
+
+# split scp file
+split_scps=""
+for n in $(seq ${nj}); do
+    split_scps="$split_scps $logdir/feats.$n.scp"
+done
+
+utils/split_scp.pl ${scp} ${split_scps} || exit 1;
+
+# dump features
+if ${do_delta}; then
+    ${cmd} JOB=1:${nj} ${logdir}/dump_feature.JOB.log \
+        apply-cmvn --norm-vars=true ${cvmnark} scp:${logdir}/feats.JOB.scp ark:- \| \
+        add-deltas ark:- ark:- \| \
+        copy-feats.py --verbose ${verbose} --out-filetype ${filetype} \
+            --compress=${compress} --compression-method=2 ${write_num_frames_opt} \
+            ark:- ark,scp:${dumpdir}/feats.JOB.ark,${dumpdir}/feats.JOB.scp \
+        || exit 1
+else
+    ${cmd} JOB=1:${nj} ${logdir}/dump_feature.JOB.log \
+        apply-cmvn --norm-vars=true ${cvmnark} scp:${logdir}/feats.JOB.scp ark:- \| \
+        copy-feats.py --verbose ${verbose} --out-filetype ${filetype} \
+            --compress=${compress} --compression-method=2 ${write_num_frames_opt} \
+            ark:- ark,scp:${dumpdir}/feats.JOB.ark,${dumpdir}/feats.JOB.scp \
+        || exit 1
+fi
+
+# concatenate scp files
+for n in $(seq ${nj}); do
+    cat ${dumpdir}/feats.${n}.scp || exit 1;
+done > ${dumpdir}/feats.scp || exit 1
+
+if ${write_utt2num_frames}; then
+    for n in $(seq ${nj}); do
+        cat ${dumpdir}/utt2num_frames.${n} || exit 1;
+    done > ${dumpdir}/utt2num_frames || exit 1
+    rm ${dumpdir}/utt2num_frames.* 2>/dev/null
+fi
+
+# Write the filetype, this will be used for data2json.sh
+echo ${filetype} > ${dumpdir}/filetype
+
+
+# remove temp scps
+rm ${logdir}/feats.*.scp 2>/dev/null
+if [ ${verbose} -eq 1 ]; then
+    echo "Succeeded dumping features for training"
+fi
diff --git a/utils/feat-to-shape.py b/utils/feat-to-shape.py
new file mode 100755
index 00000000..7b36b7e5
--- /dev/null
+++ b/utils/feat-to-shape.py
@@ -0,0 +1,82 @@
+#!/usr/bin/env python3
+import argparse
+import logging
+import sys
+
+from deepspeech.transform.transformation import Transformation
+from deepspeech.utils.cli_readers import file_reader_helper
+from deepspeech.utils.cli_utils import get_commandline_args
+from deepspeech.utils.cli_utils import is_scipy_wav_style
+
+
+def get_parser():
+    parser = argparse.ArgumentParser(
+        description="convert feature to its shape",
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter, )
+    parser.add_argument(
+        "--verbose", "-V", default=0, type=int, help="Verbose option")
+    parser.add_argument(
+        "--filetype",
+        type=str,
+        default="mat",
+        choices=["mat", "hdf5", "sound.hdf5", "sound"],
+        help="Specify the file format for the rspecifier. "
+        '"mat" is the matrix format in kaldi', )
+    parser.add_argument(
+        "--preprocess-conf",
+        type=str,
+        default=None,
+        help="The configuration file for the pre-processing", )
+    parser.add_argument(
+        "rspecifier",
+        type=str,
+        help="Read specifier for feats. e.g. ark:some.ark")
+    parser.add_argument(
+        "out",
+        nargs="?",
+        type=argparse.FileType("w"),
+        default=sys.stdout,
+        help="The output filename. "
+        "If omitted, then output to sys.stdout", )
+    return parser
+
+
+def main():
+    parser = get_parser()
+    args = parser.parse_args()
+
+    # logging info
+    logfmt = "%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s"
+    if args.verbose > 0:
+        logging.basicConfig(level=logging.INFO, format=logfmt)
+    else:
+        logging.basicConfig(level=logging.WARN, format=logfmt)
+    logging.info(get_commandline_args())
+
+    if args.preprocess_conf is not None:
+        preprocessing = Transformation(args.preprocess_conf)
+        logging.info("Apply preprocessing: {}".format(preprocessing))
+    else:
+        preprocessing = None
+
+    # There are no necessary for matrix without preprocessing,
+    # so change to file_reader_helper to return shape.
+    # This make sense only with filetype="hdf5".
+    for utt, mat in file_reader_helper(
+            args.rspecifier, args.filetype, return_shape=preprocessing is None):
+        if preprocessing is not None:
+            if is_scipy_wav_style(mat):
+                # If data is sound file, then got as Tuple[int, ndarray]
+                rate, mat = mat
+            mat = preprocessing(mat, uttid_list=utt)
+            shape_str = ",".join(map(str, mat.shape))
+        else:
+            if len(mat) == 2 and isinstance(mat[1], tuple):
+                # If data is sound file, Tuple[int, Tuple[int, ...]]
+                rate, mat = mat
+            shape_str = ",".join(map(str, mat))
+        args.out.write("{} {}\n".format(utt, shape_str))
+
+
+if __name__ == "__main__":
+    main()
diff --git a/utils/feat_to_shape.sh b/utils/feat_to_shape.sh
new file mode 100755
index 00000000..7f4668c4
--- /dev/null
+++ b/utils/feat_to_shape.sh
@@ -0,0 +1,72 @@
+#!/usr/bin/env bash
+
+# Begin configuration section.
+nj=4
+cmd=run.pl
+verbose=0
+filetype=""
+preprocess_conf=""
+# End configuration section.
+
+help_message=$(cat << EOF
+Usage: $0 [options] <input-scp> <output-scp> [<log-dir>]
+e.g.: $0 data/train/feats.scp data/train/shape.scp data/train/log
+Options:
+  --nj <nj>                                        # number of parallel jobs
+  --cmd (utils/run.pl|utils/queue.pl <queue opts>) # how to run jobs.
+  --filetype <mat|hdf5|sound.hdf5>                 # Specify the format of feats file
+  --preprocess-conf <json>                         # Apply preprocess to feats when creating shape.scp
+  --verbose <num>                                  # Default: 0
+EOF
+)
+
+echo "$0 $*" 1>&2 # Print the command line for logging
+
+. parse_options.sh || exit 1;
+
+if [ $# -lt 2 ] || [ $# -gt 3 ]; then
+    echo "${help_message}" 1>&2
+    exit 1;
+fi
+
+set -euo pipefail
+
+scp=$1
+outscp=$2
+data=$(dirname ${scp})
+if [ $# -eq 3 ]; then
+  logdir=$3
+else
+  logdir=${data}/log
+fi
+mkdir -p ${logdir}
+
+nj=$((nj<$(<"${scp}" wc -l)?nj:$(<"${scp}" wc -l)))
+split_scps=""
+for n in $(seq ${nj}); do
+    split_scps="${split_scps} ${logdir}/feats.${n}.scp"
+done
+
+utils/split_scp.pl ${scp} ${split_scps}
+
+if [ -n "${preprocess_conf}" ]; then
+    preprocess_opt="--preprocess-conf ${preprocess_conf}"
+else
+    preprocess_opt=""
+fi
+if [ -n "${filetype}" ]; then
+    filetype_opt="--filetype ${filetype}"
+else
+    filetype_opt=""
+fi
+
+${cmd} JOB=1:${nj} ${logdir}/feat_to_shape.JOB.log \
+    feat-to-shape.py --verbose ${verbose} ${preprocess_opt} ${filetype_opt} \
+    scp:${logdir}/feats.JOB.scp ${logdir}/shape.JOB.scp
+
+# concatenate the .scp files together.
+for n in $(seq ${nj}); do
+    cat ${logdir}/shape.${n}.scp
+done > ${outscp}
+
+rm -f ${logdir}/feats.*.scp 2>/dev/null
diff --git a/utils/gen_duration_from_textgrid.py b/utils/gen_duration_from_textgrid.py
old mode 100644
new mode 100755
diff --git a/utils/merge_scp2json.py b/utils/merge_scp2json.py
new file mode 100755
index 00000000..b724a7dd
--- /dev/null
+++ b/utils/merge_scp2json.py
@@ -0,0 +1,289 @@
+#!/usr/bin/env python3
+# encoding: utf-8
+import argparse
+import codecs
+import json
+import logging
+import sys
+from distutils.util import strtobool
+from io import open
+
+from deepspeech.utils.cli_utils import get_commandline_args
+
+PY2 = sys.version_info[0] == 2
+sys.stdin = codecs.getreader("utf-8")(sys.stdin if PY2 else sys.stdin.buffer)
+sys.stdout = codecs.getwriter("utf-8")(sys.stdout if PY2 else sys.stdout.buffer)
+
+
+# Special types:
+def shape(x):
+    """Change str to List[int]
+
+    >>> shape('3,5')
+    [3, 5]
+    >>> shape(' [3, 5] ')
+    [3, 5]
+
+    """
+
+    # x: ' [3, 5] ' -> '3, 5'
+    x = x.strip()
+    if x[0] == "[":
+        x = x[1:]
+    if x[-1] == "]":
+        x = x[:-1]
+
+    return list(map(int, x.split(",")))
+
+
+def get_parser():
+    parser = argparse.ArgumentParser(
+        description="Given each file paths with such format as "
+        "<key>:<file>:<type>. type> can be omitted and the default "
+        'is "str". e.g. {} '
+        "--input-scps feat:data/feats.scp shape:data/utt2feat_shape:shape "
+        "--input-scps feat:data/feats2.scp shape:data/utt2feat2_shape:shape "
+        "--output-scps text:data/text shape:data/utt2text_shape:shape "
+        "--scps utt2spk:data/utt2spk".format(sys.argv[0]),
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter, )
+    parser.add_argument(
+        "--input-scps",
+        type=str,
+        nargs="*",
+        action="append",
+        default=[],
+        help="Json files for the inputs", )
+    parser.add_argument(
+        "--output-scps",
+        type=str,
+        nargs="*",
+        action="append",
+        default=[],
+        help="Json files for the outputs", )
+    parser.add_argument(
+        "--scps",
+        type=str,
+        nargs="+",
+        default=[],
+        help="The json files except for the input and outputs", )
+    parser.add_argument(
+        "--verbose", "-V", default=1, type=int, help="Verbose option")
+    parser.add_argument(
+        "--allow-one-column",
+        type=strtobool,
+        default=False,
+        help="Allow one column in input scp files. "
+        "In this case, the value will be empty string.", )
+    parser.add_argument(
+        "--out",
+        "-O",
+        type=str,
+        help="The output filename. "
+        "If omitted, then output to sys.stdout", )
+    return parser
+
+
+if __name__ == "__main__":
+    parser = get_parser()
+    args = parser.parse_args()
+    args.scps = [args.scps]
+
+    # logging info
+    logfmt = "%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s"
+    if args.verbose > 0:
+        logging.basicConfig(level=logging.INFO, format=logfmt)
+    else:
+        logging.basicConfig(level=logging.WARN, format=logfmt)
+    logging.info(get_commandline_args())
+
+    # List[List[Tuple[str, str, Callable[[str], Any], str, str]]]
+    input_infos = []
+    output_infos = []
+    infos = []
+    for lis_list, key_scps_list in [
+        (input_infos, args.input_scps),
+        (output_infos, args.output_scps),
+        (infos, args.scps),
+    ]:
+        for key_scps in key_scps_list:
+            lis = []
+            for key_scp in key_scps:
+                sps = key_scp.split(":")
+                if len(sps) == 2:
+                    key, scp = sps
+                    type_func = None
+                    type_func_str = "none"
+                elif len(sps) == 3:
+                    key, scp, type_func_str = sps
+                    fail = False
+
+                    try:
+                        # type_func: Callable[[str], Any]
+                        # e.g. type_func_str = "int" -> type_func = int
+                        type_func = eval(type_func_str)
+                    except Exception:
+                        raise RuntimeError(
+                            "Unknown type: {}".format(type_func_str))
+
+                    if not callable(type_func):
+                        raise RuntimeError(
+                            "Unknown type: {}".format(type_func_str))
+
+                else:
+                    raise RuntimeError(
+                        "Format <key>:<filepath> "
+                        "or <key>:<filepath>:<type>  "
+                        "e.g. feat:data/feat.scp "
+                        "or shape:data/feat.scp:shape: {}".format(key_scp))
+
+                for item in lis:
+                    if key == item[0]:
+                        raise RuntimeError('The key "{}" is duplicated: {} {}'.
+                                           format(key, item[3], key_scp))
+
+                lis.append((key, scp, type_func, key_scp, type_func_str))
+            lis_list.append(lis)
+
+    # Open  scp files
+    input_fscps = [[open(i[1], "r", encoding="utf-8") for i in il]
+                   for il in input_infos]
+    output_fscps = [[open(i[1], "r", encoding="utf-8") for i in il]
+                    for il in output_infos]
+    fscps = [[open(i[1], "r", encoding="utf-8") for i in il] for il in infos]
+
+    # Note(kamo): What is done here?
+    # The final goal is creating a JSON file such as.
+    # {
+    #     "utts": {
+    #         "sample_id1": {(omitted)},
+    #         "sample_id2": {(omitted)},
+    #          ....
+    #     }
+    # }
+    #
+    # To reduce memory usage, reading the input text files for each lines
+    # and writing JSON elements per samples.
+    if args.out is None:
+        out = sys.stdout
+    else:
+        out = open(args.out, "w", encoding="utf-8")
+    out.write('{\n    "utts": {\n')
+    nutt = 0
+    while True:
+        nutt += 1
+        # List[List[str]]
+        input_lines = [[f.readline() for f in fl] for fl in input_fscps]
+        output_lines = [[f.readline() for f in fl] for fl in output_fscps]
+        lines = [[f.readline() for f in fl] for fl in fscps]
+
+        # Get the first line
+        concat = sum(input_lines + output_lines + lines, [])
+        if len(concat) == 0:
+            break
+        first = concat[0]
+
+        # Sanity check: Must be sorted by the first column and have same keys
+        count = 0
+        for ls_list in (input_lines, output_lines, lines):
+            for ls in ls_list:
+                for line in ls:
+                    if line == "" or first == "":
+                        if line != first:
+                            concat = sum(input_infos + output_infos + infos, [])
+                            raise RuntimeError("The number of lines mismatch "
+                                               'between: "{}" and "{}"'.format(
+                                                   concat[0][1],
+                                                   concat[count][1]))
+
+                    elif line.split()[0] != first.split()[0]:
+                        concat = sum(input_infos + output_infos + infos, [])
+                        raise RuntimeError(
+                            "The keys are mismatch at {}th line "
+                            'between "{}" and "{}":\n>>> {}\n>>> {}'.format(
+                                nutt,
+                                concat[0][1],
+                                concat[count][1],
+                                first.rstrip(),
+                                line.rstrip(), ))
+                    count += 1
+
+        # The end of file
+        if first == "":
+            if nutt != 1:
+                out.write("\n")
+            break
+        if nutt != 1:
+            out.write(",\n")
+
+        entry = {}
+        for inout, _lines, _infos in [
+            ("input", input_lines, input_infos),
+            ("output", output_lines, output_infos),
+            ("other", lines, infos),
+        ]:
+
+            lis = []
+            for idx, (line_list, info_list) in enumerate(
+                    zip(_lines, _infos), 1):
+                if inout == "input":
+                    d = {"name": "input{}".format(idx)}
+                elif inout == "output":
+                    d = {"name": "target{}".format(idx)}
+                else:
+                    d = {}
+
+                # info_list: List[Tuple[str, str, Callable]]
+                # line_list: List[str]
+                for line, info in zip(line_list, info_list):
+                    sps = line.split(None, 1)
+                    if len(sps) < 2:
+                        if not args.allow_one_column:
+                            raise RuntimeError(
+                                "Format error {}th line in {}: "
+                                ' Expecting "<key> <value>":\n>>> {}'.format(
+                                    nutt, info[1], line))
+                        uttid = sps[0]
+                        value = ""
+                    else:
+                        uttid, value = sps
+
+                    key = info[0]
+                    type_func = info[2]
+                    value = value.rstrip()
+
+                    if type_func is not None:
+                        try:
+                            # type_func: Callable[[str], Any]
+                            value = type_func(value)
+                        except Exception:
+                            logging.error(
+                                '"{}" is an invalid function '
+                                "for the {} th line in {}: \n>>> {}".format(
+                                    info[4], nutt, info[1], line))
+                            raise
+
+                    d[key] = value
+                lis.append(d)
+
+            if inout != "other":
+                entry[inout] = lis
+            else:
+                # If key == 'other'. only has the first item
+                entry.update(lis[0])
+
+        entry = json.dumps(
+            entry,
+            indent=4,
+            ensure_ascii=False,
+            sort_keys=True,
+            separators=(",", ": "))
+        # Add indent
+        indent = "    " * 2
+        entry = ("\n" + indent).join(entry.split("\n"))
+
+        uttid = first.split()[0]
+        out.write('        "{}": {}'.format(uttid, entry))
+
+    out.write("    }\n}\n")
+
+    logging.info("{} entries in {}".format(nutt, out.name))
diff --git a/utils/reduce_data_dir.sh b/utils/reduce_data_dir.sh
new file mode 100755
index 00000000..60c82a7c
--- /dev/null
+++ b/utils/reduce_data_dir.sh
@@ -0,0 +1,59 @@
+#!/usr/bin/env bash
+
+# koried, 10/29/2012
+
+# Reduce a data set based on a list of turn-ids
+
+help_message="usage: $0 srcdir turnlist destdir"
+
+if [ $1 == "--help" ]; then
+    echo "${help_message}"
+    exit 0;
+fi
+
+if [ $# != 3 ]; then
+    echo "${help_message}"
+    exit 1;
+fi
+
+srcdir=$1
+reclist=$2
+destdir=$3
+
+if [ ! -f ${srcdir}/utt2spk ]; then
+echo "$0: no such file $srcdir/utt2spk"
+exit 1;
+fi
+
+function do_filtering {
+# assumes the utt2spk and spk2utt files already exist.
+	[ -f ${srcdir}/feats.scp ] && utils/filter_scp.pl ${destdir}/utt2spk <${srcdir}/feats.scp >${destdir}/feats.scp
+	[ -f ${srcdir}/wav.scp ] && utils/filter_scp.pl ${destdir}/utt2spk <${srcdir}/wav.scp >${destdir}/wav.scp
+	[ -f ${srcdir}/text ] && utils/filter_scp.pl ${destdir}/utt2spk <${srcdir}/text >${destdir}/text
+	[ -f ${srcdir}/utt2num_frames ] && utils/filter_scp.pl ${destdir}/utt2spk <${srcdir}/utt2num_frames >${destdir}/utt2num_frames
+	[ -f ${srcdir}/spk2gender ] && utils/filter_scp.pl ${destdir}/spk2utt <${srcdir}/spk2gender >${destdir}/spk2gender
+	[ -f ${srcdir}/cmvn.scp ] && utils/filter_scp.pl ${destdir}/spk2utt <${srcdir}/cmvn.scp >${destdir}/cmvn.scp
+	if [ -f ${srcdir}/segments ]; then
+		utils/filter_scp.pl ${destdir}/utt2spk <${srcdir}/segments >${destdir}/segments
+		awk '{print $2;}' ${destdir}/segments | sort | uniq > ${destdir}/reco # recordings.
+		# The next line would override the command above for wav.scp, which would be incorrect.
+		[ -f ${srcdir}/wav.scp ] && utils/filter_scp.pl ${destdir}/reco <${srcdir}/wav.scp >${destdir}/wav.scp
+		[ -f ${srcdir}/reco2file_and_channel ] && \
+			utils/filter_scp.pl ${destdir}/reco <${srcdir}/reco2file_and_channel >${destdir}/reco2file_and_channel
+		
+		# Filter the STM file for proper sclite scoring (this will also remove the comments lines)
+		[ -f ${srcdir}/stm ] && utils/filter_scp.pl ${destdir}/reco < ${srcdir}/stm > ${destdir}/stm
+		rm ${destdir}/reco
+	fi
+	srcutts=$(wc -l < ${srcdir}/utt2spk)
+	destutts=$(wc -l < ${destdir}/utt2spk)
+	echo "Reduced #utt from $srcutts to $destutts"
+}
+
+mkdir -p ${destdir}
+
+# filter the utt2spk based on the set of recordings
+utils/filter_scp.pl ${reclist} < ${srcdir}/utt2spk > ${destdir}/utt2spk
+
+utils/utt2spk_to_spk2utt.pl < ${destdir}/utt2spk > ${destdir}/spk2utt
+do_filtering;
diff --git a/utils/remove_longshortdata.sh b/utils/remove_longshortdata.sh
new file mode 100755
index 00000000..e0b9da09
--- /dev/null
+++ b/utils/remove_longshortdata.sh
@@ -0,0 +1,62 @@
+#!/usr/bin/env bash
+
+# Copyright 2017 Johns Hopkins University (Shinji Watanabe)
+#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
+
+. ./path.sh
+
+maxframes=2000
+minframes=10
+maxchars=200
+minchars=0
+nlsyms=""
+no_feat=false
+trans_type=char
+
+help_message="usage: $0 olddatadir newdatadir"
+
+. utils/parse_options.sh || exit 1;
+
+if [ $# != 2 ]; then
+    echo "${help_message}"
+    exit 1;
+fi
+
+sdir=$1
+odir=$2
+mkdir -p ${odir}/tmp
+
+if [ ${no_feat} = true ]; then
+    # for machine translation
+    cut -d' ' -f 1 ${sdir}/text > ${odir}/tmp/reclist1
+else
+    echo "extract utterances having less than $maxframes or more than $minframes frames"
+    utils/data/get_utt2num_frames.sh ${sdir}
+    < ${sdir}/utt2num_frames  awk -v maxframes="$maxframes" '{ if ($2 < maxframes) print }' \
+        | awk -v minframes="$minframes" '{ if ($2 > minframes) print }' \
+        | awk '{print $1}' > ${odir}/tmp/reclist1
+fi
+
+echo "extract utterances having less than $maxchars or more than $minchars characters"
+# counting number of chars. Use (NF - 1) instead of NF to exclude the utterance ID column
+if [ -z ${nlsyms} ]; then
+text2token.py -s 1 -n 1 ${sdir}/text --trans_type ${trans_type} \
+    | awk -v maxchars="$maxchars" '{ if (NF - 1 < maxchars) print }' \
+    | awk -v minchars="$minchars" '{ if (NF - 1 > minchars) print }' \
+    | awk '{print $1}' > ${odir}/tmp/reclist2
+else
+text2token.py -l ${nlsyms} -s 1 -n 1 ${sdir}/text --trans_type ${trans_type} \
+    | awk -v maxchars="$maxchars" '{ if (NF - 1 < maxchars) print }' \
+    | awk -v minchars="$minchars" '{ if (NF - 1 > minchars) print }' \
+    | awk '{print $1}' > ${odir}/tmp/reclist2
+fi
+
+# extract common lines
+comm -12 <(sort ${odir}/tmp/reclist1) <(sort ${odir}/tmp/reclist2) > ${odir}/tmp/reclist
+
+reduce_data_dir.sh ${sdir} ${odir}/tmp/reclist ${odir}
+utils/fix_data_dir.sh ${odir}
+
+oldnum=$(wc -l ${sdir}/feats.scp | awk '{print $1}')
+newnum=$(wc -l ${odir}/feats.scp | awk '{print $1}')
+echo "change from $oldnum to $newnum"
diff --git a/utils/text2token.py b/utils/text2token.py
new file mode 100755
index 00000000..4b25612e
--- /dev/null
+++ b/utils/text2token.py
@@ -0,0 +1,129 @@
+#!/usr/bin/env python3
+# Copyright 2017 Johns Hopkins University (Shinji Watanabe)
+#  Apache 2.0  (http://www.apache.org/licenses/LICENSE-2.0)
+import argparse
+import codecs
+import re
+import sys
+
+is_python2 = sys.version_info[0] == 2
+
+
+def exist_or_not(i, match_pos):
+    start_pos = None
+    end_pos = None
+    for pos in match_pos:
+        if pos[0] <= i < pos[1]:
+            start_pos = pos[0]
+            end_pos = pos[1]
+            break
+
+    return start_pos, end_pos
+
+
+def get_parser():
+    parser = argparse.ArgumentParser(
+        description="convert raw text to tokenized text",
+        formatter_class=argparse.ArgumentDefaultsHelpFormatter, )
+    parser.add_argument(
+        "--nchar",
+        "-n",
+        default=1,
+        type=int,
+        help="number of characters to split, i.e., \
+                        aabb -> a a b b with -n 1 and aa bb with -n 2", )
+    parser.add_argument(
+        "--skip-ncols", "-s", default=0, type=int, help="skip first n columns")
+    parser.add_argument(
+        "--space", default="<space>", type=str, help="space symbol")
+    parser.add_argument(
+        "--non-lang-syms",
+        "-l",
+        default=None,
+        type=str,
+        help="list of non-linguistic symobles, e.g., <NOISE> etc.", )
+    parser.add_argument(
+        "text", type=str, default=False, nargs="?", help="input text")
+    parser.add_argument(
+        "--trans_type",
+        "-t",
+        type=str,
+        default="char",
+        choices=["char", "phn"],
+        help="""Transcript type. char/phn. e.g., for TIMIT FADG0_SI1279 -
+                        If trans_type is char,
+                        read from SI1279.WRD file -> "bricks are an alternative"
+                        Else if trans_type is phn,
+                        read from SI1279.PHN file -> "sil b r ih sil k s aa r er n aa l
+                        sil t er n ih sil t ih v sil" """, )
+    return parser
+
+
+def main():
+    parser = get_parser()
+    args = parser.parse_args()
+
+    rs = []
+    if args.non_lang_syms is not None:
+        with codecs.open(args.non_lang_syms, "r", encoding="utf-8") as f:
+            nls = [x.rstrip() for x in f.readlines()]
+            rs = [re.compile(re.escape(x)) for x in nls]
+
+    if args.text:
+        f = codecs.open(args.text, encoding="utf-8")
+    else:
+        f = codecs.getreader("utf-8")(sys.stdin
+                                      if is_python2 else sys.stdin.buffer)
+
+    sys.stdout = codecs.getwriter("utf-8")(sys.stdout
+                                           if is_python2 else sys.stdout.buffer)
+    line = f.readline()
+    n = args.nchar
+    while line:
+        x = line.split()
+        print(" ".join(x[:args.skip_ncols]), end=" ")
+        a = " ".join(x[args.skip_ncols:])
+
+        # get all matched positions
+        match_pos = []
+        for r in rs:
+            i = 0
+            while i >= 0:
+                m = r.search(a, i)
+                if m:
+                    match_pos.append([m.start(), m.end()])
+                    i = m.end()
+                else:
+                    break
+
+        if args.trans_type == "phn":
+            a = a.split(" ")
+        else:
+            if len(match_pos) > 0:
+                chars = []
+                i = 0
+                while i < len(a):
+                    start_pos, end_pos = exist_or_not(i, match_pos)
+                    if start_pos is not None:
+                        chars.append(a[start_pos:end_pos])
+                        i = end_pos
+                    else:
+                        chars.append(a[i])
+                        i += 1
+                a = chars
+
+            a = [a[j:j + n] for j in range(0, len(a), n)]
+
+        a_flat = []
+        for z in a:
+            a_flat.append("".join(z))
+
+        a_chars = [z.replace(" ", args.space) for z in a_flat]
+        if args.trans_type == "phn":
+            a_chars = [z.replace("sil", args.space) for z in a_chars]
+        print(" ".join(a_chars))
+        line = f.readline()
+
+
+if __name__ == "__main__":
+    main()

ASR Module Type	Dataset	Model Type	Link
Acoustic Model	Aishell	2 Conv + 5 LSTM layers with only forward direction	+ Ds2 Online Aishell Model +
		2 Conv + 3 bidirectional GRU layers	+ Ds2 Offline Aishell Model +
		Encoder:Conformer, Decoder:Transformer, Decoding method: Attention + CTC	+ Conformer Offline Aishell Model +
		Encoder:Conformer, Decoder:Transformer, Decoding method: Attention	+ Conformer Librispeech Model +
	Librispeech	Encoder:Conformer, Decoder:Transformer, Decoding method: Attention	Conformer Librispeech Model
	Librispeech	Encoder:Transformer, Decoder:Transformer, Decoding method: Attention	+ Transformer Librispeech Model +
Language Model	CommonCrawl(en.00)	English Language Model	+ English Language Model +
	Baidu Internal Corpus	Mandarin Language Model Small	+ Mandarin Language Model Small +
	Baidu Internal Corpus	Mandarin Language Model Large	+ Mandarin Language Model Large +
TTS Module Type	Model Type	Dataset	Link
Text Frontend			+ chinese-fronted +
Acoustic Model	Tacotron2	LJSpeech	+ tacotron2-vctk +
	TransformerTTS	LJSpeech	+ transformer-ljspeech +
	SpeedySpeech	CSMSC	+ speedyspeech-csmsc +
	FastSpeech2	AISHELL-3	+ fastspeech2-aishell3 +
		VCTK	fastspeech2-vctk
		LJSpeech	fastspeech2-ljspeech
		CSMSC	+ fastspeech2-csmsc +
Vocoder	WaveFlow	LJSpeech	+ waveflow-ljspeech +
	Parallel WaveGAN	LJSpeech	+ PWGAN-ljspeech +
		VCTK	+ PWGAN-vctk +
		CSMSC	+ PWGAN-csmsc +
Voice Cloning	GE2E	AISHELL-3, etc.	+ ge2e +
Voice Cloning	GE2E + Tactron2	AISHELL-3	+ ge2e-tactron2-aishell3 +