merge parakeet repo into deepspeech

3 years ago · b079577e08
parent aaa87698c4
commit b079577e08
431 changed files with 178133 additions and 10 deletions
--- a/.readthedocs.yml
+++ b/.readthedocs.yml
@ -0,0 +1,30 @@
+# .readthedocs.yml
+# Read the Docs configuration file
+# See https://docs.readthedocs.io/en/stable/config-file/v2.html for details
+
+# Required
+version: 2
+
+# Build documentation in the docs/ directory with Sphinx
+sphinx:
+  configuration: docs/src/conf.py
+
+# Build documentation with MkDocs
+#mkdocs:
+#  configuration: mkdocs.yml
+
+# Optionally build your docs in additional formats such as PDF
+formats: []
+
+# Optionally set the version of Python and requirements required to build your docs
+python:
+  version: 3.7
+  install:
+    - method: pip
+      path: .
+      extra_requirements:
+        - doc
+    
+    - requirements: docs/requirements.txt
+
+
--- a/README.md
+++ b/README.md
@ -1,4 +1,4 @@
-# PaddlePaddle Speech to Any toolkit
+# PaddlePaddle Speech toolkit

 ![License](https://img.shields.io/badge/license-Apache%202-red.svg)
 ![python version](https://img.shields.io/badge/python-3.7+-orange.svg)
--- a/docs/images/audio_icon.png
+++ b/docs/images/audio_icon.png
--- a/docs/images/fastpitch.png
+++ b/docs/images/fastpitch.png
--- a/docs/images/fastspeech.png
+++ b/docs/images/fastspeech.png
--- a/docs/images/fastspeech2.png
+++ b/docs/images/fastspeech2.png
--- a/docs/images/frame_level_am.png
+++ b/docs/images/frame_level_am.png
--- a/docs/images/logo-small.png
+++ b/docs/images/logo-small.png
--- a/docs/images/logo.png
+++ b/docs/images/logo.png
--- a/docs/images/news_icon.png
+++ b/docs/images/news_icon.png
--- a/docs/images/pwg.png
+++ b/docs/images/pwg.png
--- a/docs/images/seq2seq_am.png
+++ b/docs/images/seq2seq_am.png
--- a/docs/images/speedyspeech.png
+++ b/docs/images/speedyspeech.png
--- a/docs/images/tacotron.png
+++ b/docs/images/tacotron.png
--- a/docs/images/tacotron2.png
+++ b/docs/images/tacotron2.png
--- a/docs/images/transformer.png
+++ b/docs/images/transformer.png
--- a/docs/images/transformer_tts.png
+++ b/docs/images/transformer_tts.png
--- a/docs/source/asr/augmentation.md
+++ b/docs/source/asr/augmentation.md
--- a/docs/source/asr/data_preparation.md
+++ b/docs/source/asr/data_preparation.md
--- a/docs/source/asr/deepspeech_architecture.md
+++ b/docs/source/asr/deepspeech_architecture.md
--- a/docs/source/asr/feature_list.md
+++ b/docs/source/asr/feature_list.md
--- a/docs/source/asr/getting_started.md
+++ b/docs/source/asr/getting_started.md
--- a/docs/source/asr/install.md
+++ b/docs/source/asr/install.md
--- a/docs/source/asr/ngram_lm.md
+++ b/docs/source/asr/ngram_lm.md
--- a/docs/source/asr/reference.md
+++ b/docs/source/asr/reference.md
--- a/docs/source/asr/released_model.md
+++ b/docs/source/asr/released_model.md
--- a/docs/source/conf.py
+++ b/docs/source/conf.py
@ -25,7 +25,7 @@ import sphinx_rtd_theme

 # -- Project information -----------------------------------------------------

-project = 'paddle deepspeech'
+project = 'paddle speech'
 copyright = '2021, Deepspeech-developers'
 author = 'Deepspeech-developers'

--- a/docs/source/tts/README.md
+++ b/docs/source/tts/README.md
@ -0,0 +1,129 @@
+# Parakeet
+Parakeet aims to provide a flexible, efficient and state-of-the-art text-to-speech toolkit for the open-source community. It is built on PaddlePaddle dynamic graph and includes many influential TTS models.  
+
+<div align="center">
+  <img src="docs/images/logo.png" width=300 /> <br>
+</div>
+
+## News  <img src="./docs/images/news_icon.png" width="40"/>
+- Oct-12-2021, Refector examples code.
+- Oct-12-2021, Parallel WaveGAN with LJSpeech. Check [examples/GANVocoder/parallelwave_gan/ljspeech](./examples/GANVocoder/parallelwave_gan/ljspeech).
+- Oct-12-2021, FastSpeech2/FastPitch with LJSpeech. Check [examples/fastspeech2/ljspeech](./examples/fastspeech2/ljspeech).
+- Sep-14-2021, Reconstruction of TransformerTTS. Check [examples/transformer_tts/ljspeech](./examples/transformer_tts/ljspeech).
+- Aug-31-2021, Chinese Text Frontend. Check [examples/text_frontend](./examples/text_frontend).
+- Aug-23-2021, FastSpeech2/FastPitch with AISHELL-3. Check [examples/fastspeech2/aishell3](./examples/fastspeech2/aishell3).
+- Aug-03-2021, FastSpeech2/FastPitch with CSMSC. Check [examples/fastspeech2/baker](./examples/fastspeech2/baker).
+- Jul-19-2021, SpeedySpeech with CSMSC. Check [examples/speedyspeech/baker](./examples/speedyspeech/baker).
+- Jul-01-2021, Parallel WaveGAN with CSMSC. Check [examples/GANVocoder/parallelwave_gan/baker](./examples/GANVocoder/parallelwave_gan/baker).
+- Jul-01-2021, Montreal-Forced-Aligner. Check  [examples/use_mfa](./examples/use_mfa).
+- May-07-2021, Voice Cloning in Chinese. Check [examples/tacotron2_aishell3](./examples/tacotron2_aishell3).
+
+## Overview
+
+In order to facilitate exploiting the existing TTS models directly and developing the new ones, Parakeet selects typical models and provides their reference implementations in PaddlePaddle. Further more, Parakeet abstracts the TTS pipeline and standardizes the procedure of data preprocessing, common modules sharing, model configuration, and the process of training and synthesis. The models supported here include Text FrontEnd, end-to-end Acoustic models and Vocoders:
+
+- Text FrontEnd
+  - Rule based Chinese frontend.
+
+- Acoustic Models
+  - [【FastSpeech2】FastSpeech 2: Fast and High-Quality End-to-End Text to Speech](https://arxiv.org/abs/2006.04558)
+  - [【SpeedySpeech】SpeedySpeech: Efficient Neural Speech Synthesis](https://arxiv.org/abs/2008.03802)
+  - [【Transformer TTS】Neural Speech Synthesis with Transformer Network](https://arxiv.org/abs/1809.08895)
+  - [【Tacotron2】Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions](https://arxiv.org/abs/1712.05884)
+- Vocoders
+  - [【Parallel WaveGAN】Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram](https://arxiv.org/abs/1910.11480)
+  - [【WaveFlow】WaveFlow: A Compact Flow-based Model for Raw Audio](https://arxiv.org/abs/1912.01219)
+- Voice Cloning
+  - [Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis](https://arxiv.org/pdf/1806.04558v4.pdf)
+  - [【GE2E】Generalized End-to-End Loss for Speaker Verification](https://arxiv.org/abs/1710.10467)
+
+## Setup
+It's difficult to install some dependent libraries for this repo in Windows system, we recommend that you **DO NOT** use Windows system, please use `Linux`.
+
+Make sure the library `libsndfile1` is installed, e.g., on Ubuntu.
+
+```bash
+sudo apt-get install libsndfile1
+```
+### Install PaddlePaddle
+See [install](https://www.paddlepaddle.org.cn/install/quick) for more details. This repo requires PaddlePaddle **2.1.2** or above.
+
+### Install Parakeet
+
+```bash
+git clone https://github.com/PaddlePaddle/Parakeet
+cd Parakeet
+pip install -e .
+```
+
+If some python dependent packages cannot be installed successfully, you can run the following script first.
+(replace `python3.6` with your own python version)
+```bash
+sudo apt install -y python3.6-dev
+```
+
+See [install](https://paddle-parakeet.readthedocs.io/en/latest/install.html) for more details.
+
+## Examples
+Entries to the introduction, and the launch of training and synthsis for different example models:
+
+- [>>> Chinese Text Frontend](./examples/text_frontend)
+- [>>> FastSpeech2/FastPitch](./examples/fastspeech2)
+- [>>> Montreal-Forced-Aligner](./examples/use_mfa)
+- [>>> Parallel WaveGAN](./examples/GANVocoder/parallelwave_gan)
+- [>>> SpeedySpeech](./examples/speedyspeech)
+- [>>> Tacotron2_AISHELL3](./examples/tacotron2_aishell3)
+- [>>> GE2E](./examples/ge2e)
+- [>>> WaveFlow](./examples/waveflow)
+- [>>> TransformerTTS](./examples/transformer_tts)
+- [>>> Tacotron2](./examples/tacotron2)
+
+## Audio samples
+### TTS models (Acoustic Model + Neural Vocoder)
+Check our [website](https://paddleparakeet.readthedocs.io/en/latest/demo.html) for audio sampels.
+
+## Released Model
+
+### Acoustic Model
+
+#### FastSpeech2/FastPitch
+1. [fastspeech2_nosil_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/fastspeech2_nosil_baker_ckpt_0.4.zip)
+2. [fastspeech2_nosil_aishell3_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/fastspeech2_nosil_aishell3_ckpt_0.4.zip)
+3. [fastspeech2_nosil_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/fastspeech2_nosil_ljspeech_ckpt_0.5.zip)
+
+#### SpeedySpeech
+1. [speedyspeech_nosil_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/speedyspeech_nosil_baker_ckpt_0.5.zip)
+
+#### TransformerTTS
+
+1. [transformer_tts_ljspeech_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/transformer_tts_ljspeech_ckpt_0.4.zip)
+
+#### Tacotron2
+
+1. [tacotron2_ljspeech_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/tacotron2_ljspeech_ckpt_0.3.zip)
+2. [tacotron2_ljspeech_ckpt_0.3_alternative.zip](https://paddlespeech.bj.bcebos.com/Parakeet/tacotron2_ljspeech_ckpt_0.3_alternative.zip)
+
+### Vocoder
+
+#### WaveFlow
+
+1. [waveflow_ljspeech_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/waveflow_ljspeech_ckpt_0.3.zip)
+
+#### Parallel WaveGAN
+
+1. [pwg_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/pwg_baker_ckpt_0.4.zip)
+2. [pwg_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/pwg_ljspeech_ckpt_0.5.zip)
+
+### Voice Cloning
+
+#### Tacotron2_AISHELL3
+
+1. [tacotron2_aishell3_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/tacotron2_aishell3_ckpt_0.3.zip)
+
+#### GE2E
+
+1. [ge2e_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/ge2e_ckpt_0.3.zip)
+
+## License
+
+Parakeet is provided under the [Apache-2.0 license](LICENSE).
--- a/docs/source/tts/advanced_usage.md
+++ b/docs/source/tts/advanced_usage.md
@ -0,0 +1,333 @@
+
+# Advanced Usage
+This sections covers how to extend parakeet by implementing your own models and experiments. Guidelines on implementation are also elaborated.
+
+For the general deep learning experiment, there are several parts to deal with:
+1. Preprocess the data according to the needs of the model, and iterate the dataset by batch.
+2. Define the model, optimizer and other components.
+3. Write out the training process (generally including forward / backward calculation, parameter update, log recording, visualization, periodic evaluation, etc.).
+5. Configure and run the experiment.
+
+## Parakeet's Model Components
+In order to balance the reusability and function of models, we divide models into several types according to its characteristics.
+
+For the commonly used modules that can be used as part of other larger models, we try to implement them as simple and universal as possible, because they will be reused. Modules with trainable parameters are generally implemented as subclasses of `paddle.nn.Layer`. Modules without trainable parameters can be directly implemented as a function, and its input and output are `paddle.Tensor`.
+
+Models for a specific task  are implemented as subclasses of `paddle.nn.Layer`. Models could be simple, like a single layer RNN. For complicated models, it is recommended to split the model into different components.
+
+For a seq-to-seq model, it's natural to split it into encoder and decoder. For a model composed of several similar layers, it's natural to extract the sublayer as a separate layer.
+
+There are two common ways to define a model which consists of several modules.
+
+1. Define a module given the specifications. Here is an example with multilayer perceptron.
+    ```python
+    class MLP(nn.Layer):
+        def __init__(self, input_size, hidden_size, output_size):
+            self.linear1 = nn.Linear(input_size, hidden_size)
+            self.linear2 = nn.Linear(hidden_size, output_size)
+
+        def forward(self, x):
+            return self.linear2(paddle.tanh(self.linear1(x))
+
+    module = MLP(16, 32, 4) # intialize a module
+    ```
+    When the module is intended to be a generic and reusable layer that can be integrated into a larger model, we prefer to define it in this way.
+
+    For considerations of readability and usability, we strongly recommend **NOT** to pack specifications into a single object. Here’s an example below.
+    ```python
+    class MLP(nn.Layer):
+        def __init__(self, hparams):
+            self.linear1 = nn.Linear(hparams.input_size, hparams.hidden_size)
+            self.linear2 = nn.Linear(hparams.hidden_size, hparams.output_size)
+
+        def forward(self, x):
+            return self.linear2(paddle.tanh(self.linear1(x))
+    ```
+    For a module defined in this way, it’s harder for the user to initialize an instance. Users have to read the code to check what attributes are used.
+
+    Also, code in this style tend to be abused by passing a huge config object to initialize every module used in an experiment, thought each module may not need the whole configuration.
+
+    We prefer to be explicit.
+
+2. Define a module as a combination given its components. Here is an example for a sequence-to-sequence model.
+    ```python
+    class Seq2Seq(nn.Layer):
+        def __init__(self, encoder, decoder):
+            self.encoder = encoder
+            self.decoder = decoder
+
+        def forward(self, x):
+            encoder_output = self.encoder(x)
+            output = self.decoder(encoder_output)
+            return output
+
+    encoder = Encoder(...)
+    decoder = Decoder(...)
+    # compose two components
+    model = Seq2Seq(encoder, decoder)
+    ```
+    When a model is a complicated and made up of several components, each of which has a separate functionality, and can be replaced by other components with the same functionality, we prefer to define it in this way.
+
+In the directory structure of Parakeet, modules with high reusability are placed in `parakeet.modules`, but models for specific tasks are placed in `parakeet.models`. When developing a new model, developers need to consider the feasibility of splitting the modules, and the degree of generality of the modules, and place them in appropriate directories.
+
+## Parakeet's Data Components
+Another critical componnet for a deep learning project is data.
+Parakeet uses the following methods for training data:
+1. Preprocess the data.
+2. Load the preprocessed data for training.
+
+Previously, we wrote the preprocessing in the `__getitem__` of the Dataset, which will process when accessing a certain batch samples, but encountered some problems:
+
+1.  Efficiency problem. Even if Paddle has a design to load data asynchronously, when the batch size is large, each sample needs to be preprocessed and set up batches, which takes a lot of time , and  may even seriously slow down the training process.
+2. Data filtering problem. Some filtering conditions depend on the features of the processed sample. For example, filtering samples that are too short according to text length. If the text  length can only be known after `__getitem__`,  every time you filter, the entire dataset needed to be loaded once!  In addition, if you do not pre-filter, A small exception (such as too short text ) in  `__getitem__` will cause an exception in the entire data flow, which is not feasible, because `collate_fn `  presupposes that the acquisition of each sample can be normal. Even if  some special flags, such as `None`, are used to mark data acquisition failures, and skip `collate_fn`, it will change batch_size .
+
+Therefore, it is not realistic to put preprocessing entirely on `__getitem__`. We use the method mentioned above instead.
+During preprocessing, we can do filtering, We can also save more intermediate features, such as text length, audio length, etc., which can be used for subsequent filtering. Because of the habit of TTS field, data is  stored in multiple files, and the processed results are stored in `npy` format.
+
+Use a list-like way to store metadata and store the file path in it, so that you can not be restricted by the specific storage location of the file. In addition to the file path, other metadata can also be stored in it. For example, the path of the text, the path of the audio, the path of the spectrum, the number of frames, the number of sampling points, and so on.
+
+Then for the path, there are multiple opening methods,  such as `sf.read`, `np.load`, etc., so it's best to use a parameter that can be input, we don't even want to determine the reading method by it's extension, it's best to let the users input it , in this way, users can define their own method to parse the data.
+
+So we learned from the design of `DataFrame`, but our construction method is simpler, only need a `list of dicts`, a dict represents a record, and it's convenient to interact with formats such as `json`, `yaml`. For each selected field, we need to give a parser (called `converter` in the interface), and that's it.
+
+Then we need to select a format for saving metadata to the hard disk. There are two square brackets when storing the list of records in `json`, which is not convenient for stream reading and writing, so we use `jsonlines`. We don't use `yaml` because it occupies too many rows when storing the list of records.
+
+Meanwhile, `cache` is added here, and a multi-process Manager is used to share memory between multiple processes. When `num_workers` is used, it is guaranteed that each sub process will not cache a copy.
+
+The implementation of `DataTable` can be found in `parakeet/datasets/data_table.py`.
+```python
+class DataTable(Dataset):
+    """Dataset to load and convert data for general purpose.
+
+    Parameters
+    ----------
+    data : List[Dict[str, Any]]
+        Metadata, a list of meta datum, each of which is composed of
+        several fields
+    fields : List[str], optional
+        Fields to use, if not specified, all the fields in the data are
+        used, by default None
+    converters : Dict[str, Callable], optional
+        Converters used to process each field, by default None
+    use_cache : bool, optional
+        Whether to use cache, by default False
+
+    Raises
+    ------
+    ValueError
+        If there is some field that does not exist in data.
+    ValueError
+        If there is some field in converters that does not exist in fields.
+    """
+
+    def __init__(self,
+                 data: List[Dict[str, Any]],
+                 fields: List[str]=None,
+                 converters: Dict[str, Callable]=None,
+                 use_cache: bool=False):
+```
+It's `__getitem__` method is to parse each field with their own parser, and then compose a dictionary to return.
+```python
+def _convert(self, meta_datum: Dict[str, Any]) -> Dict[str, Any]:
+    """Convert a meta datum to an example by applying the corresponding
+    converters to each fields requested.
+
+    Parameters
+    ----------
+    meta_datum : Dict[str, Any]
+        Meta datum
+
+    Returns
+    -------
+    Dict[str, Any]
+        Converted example
+    """
+    example = {}
+    for field in self.fields:
+        converter = self.converters.get(field, None)
+        meta_datum_field = meta_datum[field]
+        if converter is not None:
+            converted_field = converter(meta_datum_field)
+        else:
+            converted_field = meta_datum_field
+        example[field] = converted_field
+    return example
+```
+
+## Parakeet's Training Components
+A typical training process includes the following processes:
+1. Iterate the dataset.
+2. Process batch data.
+3. Neural network forward/backward calculation.
+4. Parameter update.
+5. Evaluate the model on the validation dataset, when some special conditions are reached.
+6. Write logs, visualize, and in some cases save necessary intermediate results.
+7. Save the state of the model and optimizer.
+
+Here, we mainly introduce the training related components of Parakeet and why we designed it like this.
+### Global Repoter
+When training and modifying Deep Learning models，logging is often needed, and it has even become the key to model debugging and modifying. We usually use various visualization tools，such as ,  `visualdl` in `paddle`, `tensorboard` in `tensorflow`  and `vidsom`, `wnb` ,etc. Besides, `logging` and `print` are usuaally used for different purpose.
+
+In these tools, `print` is the simplest，it doesn't have the concept of  `logger` and `handler` in `logging` 、 `summarywriter`  and `logdir` in `tensorboard`, when printing, there is no need for `global_step` ，It's light enough to appear anywhere in the code, and it's printed to a common stdout. Of course, its customizability is  limited, for example, it is no longer intuitive when printing dictionaries or more complex objects. And it's fleeting, people need to use redirection to save information.
+
+For TTS models development，we hope to have a more universal multimedia stdout, which is actually a tool similar to `tensorboard`, which allows many multimedia forms, but it needs a `summary writer` when using, and a `step` when writing information. If the data are images or voices,  some format control parameters are needed.
+
+This will destroy the modular design to a certain extent. For example, If my model is composed of multiple sublayers, and I want to record some important information in the forward method of some sublayers. For this reason, I may need to pass the `summary writer` to this sublayers, but for the sublayers, its function is calculation, it should not have extra considerations, and it's also difficult for us to tolerate that the initialization of an `nn.Linear` has an optional `visualizer` in the method. And, for a calculation module, **HOW** can it know the global step? These are things related to the training process!
+
+Therefore, a more common approach is not to put writing_log_code in the definition of layer, but return it, then obtain them during training, and write them to `summary writer`.  However, the return values need to be modified.  `summary writer ` is a broadcaster at the training level, and then each module transmits information to it by modifying the return values.
+
+We think this method is a little ugly. We prefer to return the necessary information only rather than change the return values to accommodate visualization and recording.  When you need to report some information, you should be able to report it without difficult. So we imitate the design of `chainer` and use the `global repoter`.
+
+It takes advantage of the globality of Python's module level variables and the effect of context manager.
+
+There is a module level variable in  `parakeet/training/reporter.py`  `OBSERVATIONS`，which is  a `Dict` to store key-value.
+```python
+# parakeet/training/reporter.py
+
+@contextlib.contextmanager
+def scope(observations):
+    # make `observation` the target to report to.
+    # it is basically a dictionary that stores temporary observations
+    global OBSERVATIONS
+    old = OBSERVATIONS
+    OBSERVATIONS = observations
+
+    try:
+        yield
+    finally:
+        OBSERVATIONS = old
+```
+
+Then we implement a context manager `scope`, which is used to switch the variables bound by the name of `OBSERVATIONS`. Then a `getter` function is defined to get the dictionary bound by `OBSERVATIONS`.
+```python
+def get_observations():
+    global OBSERVATIONS
+    return OBSERVATIONS
+```
+Then we define a function to get  the current `OBSERVATIONS`，and write key-value pair into it.
+```python
+def report(name, value):
+    # a simple function to report named value
+    # you can use it everywhere, it will get the default target and writ to it
+    # you can think of it as std.out
+    observations = get_observations()
+    if observations is None:
+        return
+    else:
+        observations[name] = value
+```
+The test code following shows the usage method.
+- use `first` as the current `OBSERVATION`, write `first_begin=1`,
+- then, open the second `OBSERVATION`, write `second_begin=2`,
+- then, open the third `OBSERVATION`, write  `third_begin=3`
+- exit the third `OBSERVATION` , we back to the second  `OBSERVATION` automatically
+- write some context in the second `OBSERVATION` , then exit it, and   we back to the first  `OBSERVATION` automatically
+```python
+def test_reporter_scope():
+    first = {}
+    second = {}
+    third = {}
+
+    with scope(first):
+        report("first_begin", 1)
+        with scope(second):
+            report("second_begin", 2)
+            with scope(third):
+                report("third_begin", 3)
+                report("third_end", 4)
+            report("seconf_end", 5)
+        report("first_end", 6)
+
+    assert first == {'first_begin': 1, 'first_end': 6}
+    assert second == {'second_begin': 2, 'seconf_end': 5}
+    assert third == {'third_begin': 3, 'third_end': 4}
+```
+
+In this way, when we write  modular components, we can directly call `report`.  The caller will decide where to report as long as it's ready for `OBSERVATION`, then it opens a `scope` and calls the component within this `scope`.
+
+ The `Trainer` in Parakeet report the information in this way.
+```python
+while True:
+    self.observation = {}
+    # set observation as the report target
+    # you can use report freely in Updater.update()
+
+    # updating parameters and state
+    with scope(self.observation):
+        update() # training for a step is defined here
+```
+### Updater: Model Training Process
+
+In order to maintain the purity of function and the reusability of code, we abstract the model code into a subclass of  `paddle.nn.Layer`, and write the core computing functions in it.
+
+We tend to write the forward process of training in `forward()`, but only write to the prediction result, not to the loss. Therefore, this module can be called by a larger module.
+
+However, when we compose an experiment, we need to add some other things, such as training process, evaluation process, checkpoint saving, visualization and the like. In this process, we will encounter some things that only exist in the training process, such as `optimizer`, `learning rate scheduler`, `visualizer`, etc. These things are not part of the model, they should **NOT** be written in the model code.
+
+We made an abstraction for these intermediate processes, that is, `Updater`, which takes the `model`, `optimizer`, and `data stream` as input, and its function is training. Since there may be differences in training methods of different models, we tend to write a corresponding `Updater` for each model. But this is different from the final training script, there is still a certain degree of encapsulation, just to extract the details of regular saving, visualization, evaluation, etc., and only retain the most basic function, that is,  training the model.
+
+### Visualizer
+Because we choose observation as the communication mode, we can simply write the things in observation into `visualizer`.
+
+## Parakeet's Configuration Components
+Deep learning experiments often have many options to configure. These configurations can be roughly divided into several categories.
+1. Data source and data processing mode configuration.
+2. Save path configuration of experimental results.
+3. Data preprocessing mode configuration.
+4. Model structure and hyperparameterconfiguration.
+5. Training process configuration.
+
+It’s common to change the running configuration to compare results. To keep track of running configuration, we use `yaml` configuration files.
+
+Also, we want to interact with command line options. Some options that usually change according to running environments is provided by command line arguments. In addition, we want to override an option in the config file without editing it.
+
+Taking these requirements in to consideration, we use [yacs](https://github.com/rbgirshick/yacs) as a config management tool. Other tools like [omegaconf](https://github.com/omry/omegaconf) are also powerful and have similar functions.
+
+In each example provided, there is a `config.py`,  the default config is defined at `conf/default.yaml`. If you want to get the default config, import `config.py` and call `get_cfg_defaults()` to get it. Then it can be updated with `yaml` config file or command line arguments if needed.
+
+For details about how to use yacs in experiments, see [yacs](https://github.com/rbgirshick/yacs).
+
+The following is the basic  `ArgumentParser`:
+1. `--config`  is used to support configuration file parsing, and the configuration file itself handles the unique options of each experiment.
+2. `--train-metadata` is the path to the training data.
+3.  `--output-dir` is the dir to save the training results.（if there are checkpoints in  `checkpoints/` of  `--output-dir` , it's defalut to reload the newest checkpoint to train)
+4. `--device` and  `--nprocs` determine operation modes，`--device` specifies the type of running device, whether to run on `cpu` or `gpu`. `--nprocs` refers to  the number of training processes. If `nprocs` > 1, it means that multi process parallel training is used. (Note: currently only GPU multi card multi process training is supported.)
+
+Developers can refer to the examples in  `Parakeet/examples` to write the default configuration file when adding new experiments.
+
+## Parakeet's Experiment template
+
+The experimental codes in Parakeet  are generally organized as follows:
+
+```text
+├── conf
+│    └── default.yaml   (defalut config)
+├── README.md           (help information)  
+├── batch_fn.py         (organize metadata into batch)
+├── config.py           (code to read default config)
+├── *_updater.py        (Updater of  a specific model)
+├── preprocess.py       (data preprocessing code)
+├── preprocess.sh       (script to call data preprocessing.py)
+├── synthesis.py        (synthesis from metadata)
+├── synthesis.sh        (script to call synthesis.py)
+├── synthesis_e2e.py    (synthesis from raw text)
+├── synthesis_e2e.sh    (script to call synthesis_e2e.py)
+├── train.py            (train code)
+└── run.sh              (script to call train.py)
+```
+
+We add a named argument. `--output-dir` to each training script to specify the output directory. The directory structure is as follows, It's best for developers to follow this specification:
+```text
+exp/default/
+├── checkpoints/
+│   ├── records.jsonl        (record file)
+│   └── snapshot_iter_*.pdz  (checkpoint files)
+├── config.yaml              (config fille of this experiment)
+├── vdlrecords.*.log         (visualdl record file)
+├── worker_*.log             (text logging, one file per process)
+├── validation/              (output dir during training, information_iter_*/ is the output of each step, if necessary)
+├── inference/               (output dir of exported static graph model, which is only used in the final stage of training, if implemented)
+└── test/                    (output dir of synthesis results)
+```
+
+You can view the examples we provide in `Parakeet/examples`. These experiments are provided to users as examples which can be run directly. Users are welcome to add new models and experiments and contribute code to Parakeet.
--- a/docs/source/tts/basic_usage.md
+++ b/docs/source/tts/basic_usage.md
@ -0,0 +1,115 @@
+# Basic Usage
+This section shows how to use pretrained models provided by parakeet and make inference with them.
+
+Pretrained models in v0.4 are provided in a archive. Extract it to get a folder like this:
+```
+checkpoint_name/
+├──default.yaml
+├──snapshot_iter_76000.pdz
+├──speech_stats.npy
+└──phone_id_map.txt
+```
+`default.yaml` stores the config used to train the model.
+`snapshot_iter_N.pdz` is the chechpoint file, where `N` is the steps it has been trained.
+`*_stats.npy` is the stats file of feature if  it has been normalized before training.
+`phone_id_map.txt` is the map of  phonemes to phoneme_ids.
+
+The example code below shows how to use the models for prediction.
+## Acoustic Models (text to spectrogram)
+The code below show how to use a `FastSpeech2` model.  After loading the pretrained model, use it and normalizer object to construct a prediction object，then use fastspeech2_inferencet(phone_ids) to generate spectrograms, which can be further used to synthesize raw audio with a vocoder.
+
+```python
+from pathlib import Path
+import numpy as np
+import paddle
+import yaml
+from yacs.config import CfgNode
+from parakeet.models.fastspeech2 import FastSpeech2
+from parakeet.models.fastspeech2 import FastSpeech2Inference
+from parakeet.modules.normalizer import ZScore
+# Parakeet/examples/fastspeech2/baker/frontend.py
+from frontend import Frontend
+
+# load the pretrained model
+checkpoint_dir = Path("fastspeech2_nosil_baker_ckpt_0.4")
+with open(checkpoint_dir / "phone_id_map.txt", "r") as f:
+    phn_id = [line.strip().split() for line in f.readlines()]
+vocab_size = len(phn_id)
+with open(checkpoint_dir / "default.yaml") as f:
+    fastspeech2_config = CfgNode(yaml.safe_load(f))
+odim = fastspeech2_config.n_mels
+model = FastSpeech2(
+    idim=vocab_size, odim=odim, **fastspeech2_config["model"])
+model.set_state_dict(
+    paddle.load(args.fastspeech2_checkpoint)["main_params"])
+model.eval()
+
+# load stats file
+stat = np.load(checkpoint_dir / "speech_stats.npy")
+mu, std = stat
+mu = paddle.to_tensor(mu)
+std = paddle.to_tensor(std)
+fastspeech2_normalizer = ZScore(mu, std)
+
+# construct a prediction object
+fastspeech2_inference = FastSpeech2Inference(fastspeech2_normalizer, model)
+
+# load Chinese Frontend
+frontend = Frontend(checkpoint_dir / "phone_id_map.txt")
+
+# text to spectrogram
+sentence = "你好吗？"
+input_ids = frontend.get_input_ids(sentence, merge_sentences=True)
+phone_ids = input_ids["phone_ids"]
+flags = 0
+# The output of Chinese text frontend is segmented
+for part_phone_ids in phone_ids:
+    with paddle.no_grad():
+        temp_mel = fastspeech2_inference(part_phone_ids)
+        if flags == 0:
+            mel = temp_mel
+            flags = 1
+        else:
+            mel = paddle.concat([mel, temp_mel])
+```
+
+## Vocoder (spectrogram to wave)
+The code below show how to use a  ` Parallel WaveGAN` model. Like the example above, after loading the pretrained model, use it and normalizer object to construct a prediction object，then use pwg_inference(mel) to generate  raw audio (in wav format).
+
+```python
+from pathlib import Path
+import numpy as np
+import paddle
+import soundfile as sf
+import yaml
+from yacs.config import CfgNode
+from parakeet.models.parallel_wavegan import PWGGenerator
+from parakeet.models.parallel_wavegan import PWGInference
+from parakeet.modules.normalizer import ZScore
+
+# load the pretrained model
+checkpoint_dir = Path("parallel_wavegan_baker_ckpt_0.4")
+with open(checkpoint_dir / "pwg_default.yaml") as f:
+    pwg_config = CfgNode(yaml.safe_load(f))
+vocoder = PWGGenerator(**pwg_config["generator_params"])
+vocoder.set_state_dict(paddle.load(args.pwg_params))
+vocoder.remove_weight_norm()
+vocoder.eval()
+
+# load stats file
+stat = np.load(checkpoint_dir / "pwg_stats.npy")
+mu, std = stat
+mu = paddle.to_tensor(mu)
+std = paddle.to_tensor(std)
+pwg_normalizer = ZScore(mu, std)
+
+# construct a prediction object
+pwg_inference = PWGInference(pwg_normalizer, vocoder)
+
+# spectrogram to wave
+wav = pwg_inference(mel)
+sf.write(
+        audio_path,
+        wav.numpy(),
+        samplerate=fastspeech2_config.fs)
+```
--- a/docs/source/tts/cn_text_frontend.md
+++ b/docs/source/tts/cn_text_frontend.md
@ -0,0 +1,108 @@
+# Chinese Rule Based Text Frontend
+TTS system mainly includes three modules: `text frontend`, `Acoustic model` and `Vocoder`. We provide a complete Chinese text frontend module in Parakeet, see exapmle in `Parakeet/examples/text_frontend/`.
+
+A text frontend module mainly includes:
+ - Text Segmentation
+ - Text Normalization (TN)
+ - Word Segmentation (mainly in Chinese)
+ - Part-of-Speech
+ - Prosody
+ - G2P (Grapheme-to-Phoneme, include Polyphone and Tone Sandhi, etc.)
+ - Linguistic Features/Charactors/Phonemes
+
+```text
+• text: 90 后为中华人民共和国成立 70 周年准备了大礼
+• Text Normalization: 九零后为中华人民共和国成立七十周年准备了大礼
+• Word Segmentation: 九零后/为/中华人民/共和国/成立/七十/周年/准备/了/大礼
+• G2P:
+    jiu3 ling2 hou4 wei4 zhong1 hua2 ren2 min2 gong4 he2 guo2 ...
+• Prosody (prosodic words #1, prosodic phrases #2, intonation phrases #3, sentence #4):
+    九零后#1为中华人民#1共和国#2成立七十周年#3准备了大礼#4
+```
+
+Among them, Text Normalization and G2P are the most important modules. We mainly introduce them here.
+
+## Text Normalization
+### Supported NSW (Non-Standard-Word) Normalization
+
+|NSW type|raw|normalized|
+|:--|:-|:-|
+|serial number|电影中梁朝伟扮演的陈永仁的编号27149|电影中梁朝伟扮演的陈永仁的编号二七一四九|
+|cardinal|这块黄金重达324.75克<br>我们班的最高总分为583分|这块黄金重达三百二十四点七五克<br>我们班的最高总分为五百八十三分|
+|numeric range |12\~23<br>-1.5\~2|十二到二十三<br>负一点五到二|
+|date|她出生于86年8月18日，她弟弟出生于1995年3月1日|她出生于八六年八月十八日， 她弟弟出生于一九九五年三月一日|
+|time|等会请在12:05请通知我|等会请在十二点零五分请通知我
+|temperature|今天的最低气温达到-10°C|今天的最低气温达到零下十度
+|fraction|现场有7/12的观众投出了赞成票|现场有十二分之七的观众投出了赞成票|
+|percentage|明天有62％的概率降雨|明天有百分之六十二的概率降雨|
+|money|随便来几个价格12块5，34.5元，20.1万|随便来几个价格十二块五，三十四点五元，二十点一万|
+|telephone|这是固话0421-33441122<br>这是手机+86 18544139121|这是固话零四二一三三四四一一二二<br>这是手机八六一八五四四一三九一二一|
+
+
+## Grapheme-to-Phoneme
+In Chinese, G2P is a very complex module, which mainly includes **polyphone**  and **tone sandhi**.
+
+We use [g2pM](https://github.com/kakaobrain/g2pM) and [pypinyin](https://github.com/mozillazg/python-pinyin)  as the defalut g2p tools. They can solve the problem of polyphone to a certain extent. In the future, we intend to use a trainable language model (for example, [BERT](https://arxiv.org/abs/1810.04805)) for polyphone.
+
+However, g2pM and pypinyin do not perform well in tone sandhi, we use rules to solve this problem, which requires relevant linguistic knowledge.
+
+The **tone sandhi** in Chinese mainly include:
+
+ - soft tone sandhi (轻声变调)
+ - "一" "不" tone sandhi ("一" "不" 变调)
+ - three tone sandhi  (三声变调)
+
+For ease of understanding, we list the tone sandhi rules in Chinese here
+### 1. 轻声变调
+|  |cases  |
+|:--|:-|
+| 语气助词“吧、呢、啊”等 | 吃吧、走吗、去呢、跑啊 |
+| 结构助词：“的、地、得”| 我的书、慢慢地走、跑得很快等 |
+|有的轻声音节和非轻声音节构成对比区别意义 |买卖：一指生意；二指买和卖。 <br/> 地道：一指纯粹、真正；二指地下通道。<br> 大意：一指没有注意；二指主要的意思。 <br/>  东西：一指各种事物；二指东面与西面。<br>  言语：一指所说的话；二指开口，招呼。<br/>运气：一指一种锻炼的方法。二指幸运。<br> |
+|名词的后缀：“们、子、头”|你们、房子、石头 |
+|名词或动词的第二个重叠音节 | 奶奶、姐姐、爸爸、试试、看看、说说、问问 |
+|名词后面表示方位的：“上、下、里” |桌上、地下、院里  |
+| 动态助词：“了、着、过” | 走了、看着、去过|
+| 作宾语的人称代词：“我、你、他” | 找我、请你、麻烦他。 |
+| 约定俗成 | 匀称、盘算、枇杷、篱笆、活泼、玄乎。狐狸、学生、拾掇、麻烦、蛤蟆、石榴。玫瑰、凉快、萝卜、朋友、奴才、云彩。脑袋、老爷、老婆、嘴巴、指头、指甲。委屈、喇叭、讲究、打发、打听、喜欢。点心、伙计、打扮、哑巴、女婿、首饰。自在、吓唬、力气、漂亮、队伍、地方。痛快、念叨、笑语、丈夫、志气、钥匙。月亮、正经、位置、秀气、上司、悟性。告示、动静、热闹、屁股、阔气、意思。等 |
+
+
+### 2. "一" "不" 变调
+#### "一" 变调
+|  | 是否变调 | cases|
+|:--|:-|:-|
+| 单独念 | 否 | 第一、一楼|
+| 序数 |否  | |
+| 用在语句末尾 | 否 | |
+| 去声前变阳平（四声前变二声） |  | 一栋yí dòng、一段yí duàn、一律yí lǜ、一路yí lù|
+| 非去声前变去声（非四声前变四声） |  | 阴平（一声）<br>一发yì fā 、一端yì duān、一天yì tiān、一忽yì hū<br>阳平（二声）<br>一叠yì dié 、一同yì tóng 、一头yì tóu 、一条yì tiáo<br>上声（三声）<br>一统yì tǒng、一体yì tǐ、一览yì lǎn、一口yì kǒu|
+|轻读，当“一”嵌在重叠式的动词之间  |  | 听一听 tīng yi tīng|
+
+#### "不" 变调
+|  | 是否变调 | cases|
+|:--|:-|:-|
+|单独念|否  | |
+| 用在语句末尾| 否  | 我不|
+|去声前变阳平（四声前变成二声）  |  | 不怕bú pà、不妙bú miào、不犯bú fàn、不忿bú fèn|
+| 轻读，不”夹在重叠动词或重叠形容词之间、夹在动词和补语之间 |  |懂不懂 dǒng bu dǒng 、看不清 kàn bu qīng |
+
+
+### 3. 三声变调
+|  | 子类别| 如何变调|cases|
+|:--|:-|:-|:-|
+|单独念 |  | 否|  |
+|句末 |  | 否|  |
+|在句中停顿并没被后音节影响  |  |否 |  |
+|三声+三声  |  | 二声+三声|保险、保养、党委、尽管、老板、本领、引导、古老、敏感、鼓舞、永远、语法、口语、岛屿、保姆、远景、北海、首长、母语 |
+| 三个三声相连| 双音节+单音节（“双单格”结构）| 前两个变二声|演讲稿、跑马场、展览馆、管理组、水彩笔、蒙古语、选取法、古典舞、虎骨酒、洗脸水、草稿纸|
+|  | 单音节+双音节（“单双格”结构）|第二个变二声|史小姐、党小组、好小伙、跑百米、纸老虎、李厂长、老保姆、冷处理、很友好、小雨伞|
+|  | 单音节+单音节+单音节（“单三格”结构）| 前两个变二声| 软懒散、稳准狠|
+| 更多三声音节相连时|  | 按语意与若干二字组成三字组，然后按以上变调规律处理|岂有 / 此理。<br>请你 / 给我 / 打点儿 / 洗脸水。<br>手表厂 / 有五种 /好产品。|
+
+## References
+
+ - [chinese_text_normalization](https://github.com/speechio/chinese_text_normalization)
+ - [声调篇｜这些“一、不”变调规律，你不得不知](https://zhuanlan.zhihu.com/p/36156170)
+ - [TTS前端模块中的普通话变调规则](https://zhuanlan.zhihu.com/p/65091429)
+ - [轻声和变调](https://wenku.baidu.com/view/ad2016d94693daef5ef73db1.html)
+ - [必读轻声词语表546条](http://www.chaziwang.com/article-view-504.html)
--- a/docs/source/tts/demo.rst
+++ b/docs/source/tts/demo.rst
@ -0,0 +1,583 @@
+Audio Sample 
+==================
+
+The main processes of TTS include:
+
+1. Convert the original text into characters/phonemes, through ``text frontend`` module.
+
+2. Convert characters/phonemes into acoustic features , such as linear spectrogram, mel spectrogram, LPC features, etc. through ``Acoustic models``.
+
+3. Convert acoustic features into waveforms through ``Vocoders``.
+
+When training ``Tacotron2``、``TransformerTTS`` and ``WaveFlow``, we use English single speaker TTS dataset `LJSpeech <https://keithito.com/LJ-Speech-Dataset/>`_  by default. However, when training ``SpeedySpeech``, ``FastSpeech2`` and ``ParallelWaveGAN``, we use Chinese single speaker dataset `CSMSC <https://test.data-baker.com/data/index/source/>`_ by default. 
+
+In the future, ``Parakeet`` will mainly use Chinese TTS datasets for default examples.
+
+Here, we will display three types of audio samples:
+
+1. Analysis/synthesis (ground-truth spectrograms + Vocoder)
+
+2. TTS (Acoustic model + Vocoder)
+
+3. Chinese TTS with/without text frontend (mainly tone sandhi)
+
+Analysis/synthesis
+--------------------------
+
+Audio samples generated from ground-truth spectrograms with a vocoder.
+
+.. raw:: html
+    
+    <b>LJSpeech(English)</b>
+    <br>
+    </br>
+    <table>
+        <tr>
+            <th  align="left"> GT </th>
+            <th  align="left"> WaveFlow </th>
+        </tr>
+        <tr>
+            <td>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/ljspeech_gt/LJ001-0001.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/ljspeech_gt/LJ001-0002.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/ljspeech_gt/LJ001-0003.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/ljspeech_gt/LJ001-0004.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/ljspeech_gt/LJ001-0005.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+            </td>
+
+            <td>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/waveflow_res128_ljspeech_samples_1.0/step_2000k_sentence_0.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/waveflow_res128_ljspeech_samples_1.0/step_2000k_sentence_1.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/waveflow_res128_ljspeech_samples_1.0/step_2000k_sentence_2.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/waveflow_res128_ljspeech_samples_1.0/step_2000k_sentence_3.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/waveflow_res128_ljspeech_samples_1.0/step_2000k_sentence_4.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+            </td>
+        </tr>
+    </table>
+    
+    <br>
+    </br>
+    <b>CSMSC(Chinese)</b>
+    <br>
+    </br>
+
+    <table>
+        <tr>
+            <th  align="left"> GT (convert to 24k) </th>
+            <th  align="left"> ParallelWaveGAN </th>
+        </tr>
+        <tr>
+           <td>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/baker_gt_24k/009901.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/baker_gt_24k/009902.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/baker_gt_24k/009903.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/baker_gt_24k/009904.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/baker_gt_24k/009905.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+            </td>
+
+            <td>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/pwg_baker_ckpt_0.4/009901.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/pwg_baker_ckpt_0.4/009902.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/pwg_baker_ckpt_0.4/009903.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/pwg_baker_ckpt_0.4/009904.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/pwg_baker_ckpt_0.4/009905.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+            </td>
+        </tr>
+    
+    </table>
+
+
+TTS
+-------------------
+
+Audio samples generated by a TTS system. Text is first transformed into spectrogram by a text-to-spectrogram model, then the spectrogram is converted into raw audio by a vocoder.
+
+.. raw:: html
+
+    <table>
+        <tr>
+            <th  align="left"> TransformerTTS + WaveFlow </th>
+            <th  align="left"> Tacotron2 + WaveFlow </th>
+        </tr>
+        <tr>
+            <td>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/transformer_tts_ljspeech_ckpt_0.4_waveflow_ljspeech_ckpt_0.3/001.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/transformer_tts_ljspeech_ckpt_0.4_waveflow_ljspeech_ckpt_0.3/002.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/transformer_tts_ljspeech_ckpt_0.4_waveflow_ljspeech_ckpt_0.3/003.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/transformer_tts_ljspeech_ckpt_0.4_waveflow_ljspeech_ckpt_0.3/004.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/transformer_tts_ljspeech_ckpt_0.4_waveflow_ljspeech_ckpt_0.3/005.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/transformer_tts_ljspeech_ckpt_0.4_waveflow_ljspeech_ckpt_0.3/006.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/transformer_tts_ljspeech_ckpt_0.4_waveflow_ljspeech_ckpt_0.3/007.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/transformer_tts_ljspeech_ckpt_0.4_waveflow_ljspeech_ckpt_0.3/008.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/transformer_tts_ljspeech_ckpt_0.4_waveflow_ljspeech_ckpt_0.3/009.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+            </td>
+            <td>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/tacotron2_ljspeech_waveflow_samples_0.2/sentence_1.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/tacotron2_ljspeech_waveflow_samples_0.2/sentence_2.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/tacotron2_ljspeech_waveflow_samples_0.2/sentence_3.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/tacotron2_ljspeech_waveflow_samples_0.2/sentence_4.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/tacotron2_ljspeech_waveflow_samples_0.2/sentence_5.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/tacotron2_ljspeech_waveflow_samples_0.2/sentence_6.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/tacotron2_ljspeech_waveflow_samples_0.2/sentence_7.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/tacotron2_ljspeech_waveflow_samples_0.2/sentence_8.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/tacotron2_ljspeech_waveflow_samples_0.2/sentence_9.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+            </td>
+        </tr>
+    </table>
+
+    <table>
+        <tr>
+            <th  align="left"> SpeedySpeech + ParallelWaveGAN </th>
+            <th  align="left"> FastSpeech2 + ParallelWaveGAN </th>
+        </tr>
+        <tr>
+            <td>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/speedyspeech_baker_ckpt_0.4_pwg_baker_ckpt_0.4/001.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/speedyspeech_baker_ckpt_0.4_pwg_baker_ckpt_0.4/002.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/speedyspeech_baker_ckpt_0.4_pwg_baker_ckpt_0.4/003.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/speedyspeech_baker_ckpt_0.4_pwg_baker_ckpt_0.4/004.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/speedyspeech_baker_ckpt_0.4_pwg_baker_ckpt_0.4/005.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/speedyspeech_baker_ckpt_0.4_pwg_baker_ckpt_0.4/006.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/speedyspeech_baker_ckpt_0.4_pwg_baker_ckpt_0.4/007.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/speedyspeech_baker_ckpt_0.4_pwg_baker_ckpt_0.4/008.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/speedyspeech_baker_ckpt_0.4_pwg_baker_ckpt_0.4/009.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+            </td>
+            <td>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/fastspeech2_nosil_baker_ckpt_0.4_parallel_wavegan_baker_ckpt_0.4/001.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/fastspeech2_nosil_baker_ckpt_0.4_parallel_wavegan_baker_ckpt_0.4/002.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/fastspeech2_nosil_baker_ckpt_0.4_parallel_wavegan_baker_ckpt_0.4/003.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/fastspeech2_nosil_baker_ckpt_0.4_parallel_wavegan_baker_ckpt_0.4/004.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/fastspeech2_nosil_baker_ckpt_0.4_parallel_wavegan_baker_ckpt_0.4/005.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/fastspeech2_nosil_baker_ckpt_0.4_parallel_wavegan_baker_ckpt_0.4/006.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/fastspeech2_nosil_baker_ckpt_0.4_parallel_wavegan_baker_ckpt_0.4/007.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/fastspeech2_nosil_baker_ckpt_0.4_parallel_wavegan_baker_ckpt_0.4/008.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/fastspeech2_nosil_baker_ckpt_0.4_parallel_wavegan_baker_ckpt_0.4/009.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+            </td>
+        </tr>
+    </table>
+
+
+
+Chinese TTS with/without text frontend
+--------------------------------------
+
+We provide a complete Chinese text frontend module in ``Parakeet``. ``Text Normalization`` and ``G2P`` are the most important modules in text frontend, We assume that the texts are normalized already, and mainly compare ``G2P`` module here.
+
+We use ``FastSpeech2`` + ``ParallelWaveGAN`` here.
+
+.. raw:: html
+
+    <table>
+        <tr>
+            <th  align="left"> With Text Frontend </th>
+            <th  align="left"> Without Text Frontend </th>
+        </tr>
+        <tr>
+            <td>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/with_frontend/001.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/with_frontend/002.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/with_frontend/003.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/with_frontend/004.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/with_frontend/005.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/with_frontend/006.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/with_frontend/007.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/with_frontend/008.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/with_frontend/009.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/with_frontend/010.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+            </td>
+            <td>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/without_frontend/001.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/without_frontend/002.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/without_frontend/003.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/without_frontend/004.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/without_frontend/005.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/without_frontend/006.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/without_frontend/007.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/without_frontend/008.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/without_frontend/009.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+                <audio controls="controls">
+                    <source
+                        src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/demos/without_frontend/010.wav"
+                        type="audio/wav">
+                    Your browser does not support the <code>audio</code> element.
+                </audio>
+            </td>
+        </tr>
+
+
+    <table>
--- a/docs/source/tts/index.rst
+++ b/docs/source/tts/index.rst
@ -0,0 +1,45 @@
+.. parakeet documentation master file, created by
+   sphinx-quickstart on Fri Sep 10 14:22:24 2021.
+   You can adapt this file completely to your liking, but it should at least
+   contain the root `toctree` directive.
+
+Parakeet 
+====================================
+
+``parakeet`` is a deep learning based text-to-speech toolkit built upon ``paddlepaddle`` framework. It aims to provide a flexible, efficient and state-of-the-art text-to-speech toolkit for the open-source community. It includes many influential TTS models proposed by `Baidu Research <http://research.baidu.com>`_ and other research groups. 
+
+``parakeet`` mainly consists of components below.
+
+#. Implementation of models and commonly used neural network layers.
+#. Dataset abstraction and common data preprocessing pipelines.
+#. Ready-to-run experiments.
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Introduction
+
+   introduction
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Getting started
+
+   install
+   basic_usage
+   advanced_usage
+   cn_text_frontend
+   released_models
+
+.. toctree::
+   :maxdepth: 1
+   :caption: Demos
+   
+   demo
+   
+
+Indices and tables
+==================
+
+* :ref:`genindex`
+* :ref:`modindex`
+* :ref:`search`
--- a/docs/source/tts/install.md
+++ b/docs/source/tts/install.md
@ -0,0 +1,47 @@
+# Installation
+## Install PaddlePaddle
+Parakeet requires PaddlePaddle as its backend. Note that 2.1.2 or newer versions of paddle is required.
+
+Since paddlepaddle has multiple packages depending on the device (cpu or gpu) and the dependency libraries, it is recommended to install a proper package of paddlepaddle with respect to the device and dependency library versons via `pip`.
+
+Installing paddlepaddle with conda or build paddlepaddle from source is also supported. Please refer to [PaddlePaddle installation](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html) for more details.
+
+Example instruction to install paddlepaddle via pip is listed below.
+
+### PaddlePaddle with GPU
+```python
+# CUDA10.1 的 PaddlePaddle
+python -m pip install paddlepaddle-gpu==2.1.2.post101 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
+# CUDA10.2 的 PaddlePaddle
+python -m pip install paddlepaddle-gpu -i https://mirror.baidu.com/pypi/simple
+# CUDA11.0 的 PaddlePaddle
+python -m pip install paddlepaddle-gpu==2.1.2.post110 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
+# CUDA11.2 的 PaddlePaddle
+python -m pip install paddlepaddle-gpu==2.1.2.post112 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
+```
+### PaddlePaddle with CPU
+```python
+python -m pip install paddlepaddle==2.1.2 -i https://mirror.baidu.com/pypi/simple
+```
+## Install libsndfile
+Experimemts in parakeet often involve audio and spectrum processing, thus `librosa` and `soundfile` are required. `soundfile` requires a extra C library `libsndfile`, which is not always handled by pip.
+
+For Windows and Mac users, `libsndfile` is also installed when installing `soundfile` via pip, but for Linux users, installing `libsndfile` via system package manager is required. Example commands for popular distributions are listed below.
+```bash
+# ubuntu, debian
+sudo apt-get install libsndfile1
+# centos, fedora
+sudo yum install libsndfile
+# openSUSE
+sudo zypper in libsndfile
+```
+For any problem with installtion of soundfile, please refer to [SoundFile](https://pypi.org/project/SoundFile/).
+## Install Parakeet
+There are two ways to install parakeet according to the purpose of using it.
+
+ 1. If you want to run experiments provided by parakeet or add new models and experiments, it is recommended to clone the project from github (Parakeet), and install it in editable mode.
+       ```python
+       git clone https://github.com/PaddlePaddle/Parakeet
+       cd Parakeet
+       pip install -e .
+       ```
--- a/docs/source/tts/introduction.md
+++ b/docs/source/tts/introduction.md
@ -0,0 +1,27 @@
+# Parakeet - PAddle PARAllel text-to-speech toolKIT
+
+## What is Parakeet?
+Parakeet is a deep learning based text-to-speech toolkit built upon paddlepaddle framework. It aims to provide a flexible, efficient and state-of-the-art text-to-speech toolkit for the open-source community. It includes many influential TTS models proposed by Baidu Research and other research groups.
+
+## What can Parakeet do?
+Parakeet mainly consists of components below:
+- Implementation of models and commonly used neural network layers.
+- Dataset abstraction and common data preprocessing pipelines.
+- Ready-to-run experiments.
+
+Parakeet provides you with a complete TTS pipeline, including:
+- Text FrontEnd
+    - Rule based Chinese frontend.
+- Acoustic Models
+    - FastSpeech2
+    - SpeedySpeech
+    - TransformerTTS
+    - Tacotron2
+- Vocoders
+    - Parallel WaveGAN
+    - WaveFlow
+- Voice Cloning
+    - Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis
+    - GE2E
+
+Parakeet helps you to train TTS models with simple commands.
--- a/docs/source/tts/released_models.md
+++ b/docs/source/tts/released_models.md
@ -0,0 +1,295 @@
+# Released Models
+TTS system mainly includes three modules: `text frontend`, `Acoustic model` and `Vocoder`. We introduce a rule based Chinese text frontend in [cn_text_frontend.md](./cn_text_frontend.md). Here, we will introduce acoustic models and vocoders, which are trainable models.
+
+The main processes of TTS include:
+1. Convert the original text into characters/phonemes, through `text frontend` module.
+2. Convert characters/phonemes into acoustic features , such as linear spectrogram, mel spectrogram, LPC features, etc. through `Acoustic models`.
+3. Convert acoustic features into waveforms through `Vocoders`.
+
+A simple text frontend module can be implemented by rules. Acoustic models and vocoders need to be trained. The models provided by Parakeet are acoustic models and vocoders.
+
+## Acoustic Models
+### Modeling Objectives of Acoustic Models
+Modeling the mapping relationship between text sequences and speech features：
+```text
+text X = {x1,...,xM}
+specch Y = {y1,...yN}
+```
+Modeling Objectives:
+```text
+Ω = argmax p(Y|X,Ω)
+```
+### Modeling process of Acoustic Models
+At present, there are two mainstream acoustic model structures.
+
+- Frame level acoustic model:
+   - Duration model (M Tokens - > N Frames).
+   - Acoustic decoder (N Frames - > N Frames).
+
+<div align="left">
+  <img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/frame_level_am.png" width=500 /> <br>
+</div>
+
+- Sequence to sequence acoustic model:
+    - M Tokens - > N Frames.
+
+<div align="left">
+  <img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/seq2seq_am.png" width=500 /> <br>
+</div>
+
+### Tacotron2
+ [Tacotron](https://arxiv.org/abs/1703.10135)  is the first end-to-end acoustic model based on deep learning, and it is also the most widely used acoustic model.
+
+[Tacotron2](https://arxiv.org/abs/1712.05884) is the Improvement of Tacotron.
+#### Tacotron
+**Features of Tacotron:**
+- Encoder.
+   - CBHG.
+   - Input: character sequence.
+- Decoder.
+    - Global soft attention.
+    - unidirectional RNN.
+    - Autoregressive teacher force training (input real speech feature).
+    - Multi frame prediction.
+    - CBHG postprocess.
+    - Vocoder: Griffin-Lim.
+<div align="left">
+  <img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/tacotron.png" width=700 /> <br>
+</div>
+
+**Advantage of Tacotron:**
+- No need for complex text frontend analysis modules.
+- No need for additional duration model.
+- Greatly simplifies the acoustic model construction process and reduces the dependence of speech synthesis tasks on domain knowledge.
+
+**Disadvantages of Tacotron:**
+- The CBHG  is complex and the amount of parameters is relatively large.
+- Global soft attention.
+- Poor stability for speech synthesis tasks.
+- In training, the less the number of speech frames predicted at each moment, the more difficult it is to train.
+-  Phase problem in Griffin-Lim casues speech distortion during wave reconstruction.
+- The autoregressive decoder cannot be stopped during the generation process.
+
+#### Tacotron2
+**Features of Tacotron2:**
+- Reduction of parameters.
+   - CBHG -> PostNet (3 Conv layers + BLSTM or 5 Conv layers).
+   - remove Attention RNN.
+- Speech distortion caused by Griffin-Lim.
+    - WaveNet.
+- Improvements of PostNet.
+   - CBHG -> 5 Conv layers.
+   -  The input and output of the PostNet calculate `L2` loss with real Mel spectrogram.
+   - Residual connection.
+- Bad stop in autoregressive decoder.
+   - Predict whether it should stop at each moment of decoding (stop token).
+   - Set a threshold to determine whether to stop generating when decoding.
+- Stability of attention.
+   - Location-aware attention.
+   - The alignment matrix of previous time is considered at the step `t` of decoder.
+
+<div align="left">
+  <img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/tacotron2.png" width=500 /> <br>
+</div>
+
+You can find Parakeet's tacotron2 example at `Parakeet/examples/tacotron2`.
+
+### TransformerTTS
+**Disadvantages of the Tacotrons:**
+- Encodr and decoder are relatively weak at global information modeling
+   - Vanishing gradient of RNN.
+   - Fixed-length context modeling problem in CNN kernel.
+- Training is relatively inefficient.
+- The attention is not robust enough and the stability is poor.
+
+Transformer TTS is a combination of Tacotron2 and Transformer.
+
+#### Transformer
+ [Transformer](https://arxiv.org/abs/1706.03762) is a seq2seq model based entirely on attention mechanism.
+
+**Features of Transformer:**
+- Encoder.
+    - `N` blocks based on self-attention mechanism.
+    - Positional Encoding.
+- Decoder.
+    - `N` blocks based on self-attention mechanism.
+    - Add Mask to the self-attention in blocks to cover up the information after `t` step.
+    - Attentions between encoder and decoder.
+    - Positional Encoding.
+
+<div align="left">
+  <img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/transformer.png" width=500 /> <br>
+</div>
+
+#### Transformer TTS
+Transformer TTS is a seq2seq acoustic model based on Transformer and Tacotron2.
+
+**Motivations：**
+- RNNs in Tacotron2  make the inefficiency of training.
+- Vanishing gradient of RNN makes the model's ability to model long-term contexts weak.
+- Self-attention doesn't contain any recursive structure which can be trained in parallel.
+- Self-attention can model global context information well.
+
+**Features of Transformer TTS:**
+- Add conv based PreNet in encoder and decoder.
+- Stop Token in decoder controls when to stop autoregressive generation.
+- Add PostNet after decoder to improve the quality of synthetic speech.
+- Scaled position encoding.
+    - Uniform scale position encoding may have a negative impact on input or output sequences.
+
+<div align="left">
+  <img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/transformer_tts.png" width=500 /> <br>
+</div>
+
+**Disadvantages of Transformer TTS:**
+- The ability of position encoding for timing information is still relatively weak.
+- The ability to perceive local information is weak, and local information is more related to pronunciation.
+- Stability is worse than Tacotron2.
+
+You can find Parakeet's Transformer TTS example at `Parakeet/examples/transformer_tts`.
+
+
+### FastSpeech2
+**Disadvantage of seq2seq models:**
+- In the seq2seq model based on attention, no matter how to improve the attention mechanism, it's difficult to avoid generation errors in the decoding stage.
+
+Frame level acoustic models use duration models to determine the pronunciation duration of phonemes, and the frame level mapping does not have the uncertainty of sequence generation.
+
+In seq2saq models, the concept of duration models is used as the alignment module of two sequences to replace attention, which can avoid the uncertainty in attention, and significantly improve the stability of the seq2saq models.
+
+#### FastSpeech
+Instead of using the encoder-attention-decoder based architecture as adopted by most seq2seq based autoregressive and non-autoregressive generation, [FastSpeech](https://arxiv.org/abs/1905.09263) is  a novel feed-forward structure, which can generate a target mel spectrogram sequence in parallel.
+
+**Features of FastSpeech:**
+- Encoder: based on Transformer.
+- Change `FFN` to `CNN` in self-attention.
+    -  Model local dependency.
+- Length regulator.
+    - Use real phoneme durations to expand output frame of encoder during training.
+- Non autoregressive decode.
+    -  Improve generation efficiency.
+
+**Length predictor:**
+- Pretrain a TransformerTTS model.
+- Get alignment matrix of train data.
+- Caculate the phoneme durations according to the probability of the alignment matrix.
+- Use the output of encoder to predict the phoneme durations and calculate the MSE loss.
+- Use real phoneme durations to expand output frame of encoder during training.
+- Use phoneme durations predicted by the duration model to expand the frame during prediction.
+    - Attentrion can not control phoneme durations. The explicit duration modeling can control durations through duration coefficient (duration coefficient is `1` during training).
+
+**Advantages of non-autoregressive decoder:**
+- The built-in duration model of the seq2seq model has converted the input length `M` to the output length `N`.
+- The length of output is known, `stop token` is no longer used, avoiding the problem of being unable to stop.
+• Can be generated in parallel (decoding time is less affected by sequence length)
+
+<div align="left">
+  <img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/fastspeech.png" width=800 /> <br>
+</div>
+
+#### FastPitch
+[FastPitch](https://arxiv.org/abs/2006.06873) follows FastSpeech. A single pitch value is predicted for every temporal location, which improves the overall quality of synthesized speech.
+
+<div align="left">
+  <img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/fastpitch.png" width=500 /> <br>
+</div>
+
+#### FastSpeech2
+**Disadvantages of FastSpeech:**
+- The teacher-student distillation pipeline is complicated and time-consuming.
+- The duration extracted from the teacher model is not accurate enough.
+- The target mel spectrograms distilled from teacher model suffer from information loss due to data simplification.
+
+[FastSpeech2](https://arxiv.org/abs/2006.04558)  addresses the issues in FastSpeech and better solves the one-to-many mapping problem in TTS.
+
+**Features of FastSpeech2:**
+- Directly training the model with ground-truth target instead of the simplified output from teacher.
+- Introducing more variation information of speech as conditional inputs, extract `duration`, `pitch` and `energy` from speech waveform and directly take them as conditional inputs in training and use predicted values in inference.
+
+FastSpeech2 is similar to FastPitch but introduces more variation information of speech.
+
+<div align="left">
+  <img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/fastspeech2.png" width=800 /> <br>
+</div>
+
+You can find Parakeet's FastSpeech2/FastPitch example at `Parakeet/examples/fastspeech2`, We use token-averaged pitch and energy values introduced in FastPitch rather than frame level ones in FastSpeech2.
+
+### SpeedySpeech
+[SpeedySpeech](https://arxiv.org/abs/2008.03802) simplify the teacher-student architecture of FastSpeech and provide a fast and stable training procedure.
+
+**Features of SpeedySpeech:**
+- Use a simpler, smaller and faster-to-train convolutional teacher model ([Deepvoice3](https://arxiv.org/abs/1710.07654) and [DCTTS](https://arxiv.org/abs/1710.08969)) with a single attention layer instead of Transformer used in FastSpeech.  
+- Show that self-attention layers in the student network are not needed for high-quality speech synthesis.
+- Describe a simple data augmentation technique that can be used early in the training to make the teacher network robust to sequential error propagation.
+
+<div align="left">
+  <img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/speedyspeech.png" width=500 /> <br>
+</div>
+
+You can find Parakeet's SpeedySpeech example at `Parakeet/examples/speedyspeech/baker`.
+
+## Vocoders
+In speech synthesis, the main task of the vocoder is to convert the spectral parameters predicted by the acoustic model into the final speech waveform.
+
+Taking into account the short-term change frequency of the waveform, the acoustic model usually avoids direct modeling of the speech waveform, but firstly models the spectral features extracted from the speech waveform, and then reconstructs the waveform by the decoding part of the vocoder.
+
+A vocoder usually consists of a pair of encoders and decoders for speech analysis and synthesis. The encoder estimate the parameters, and then the decoder restores the speech.
+
+Vocoders based on neural networks usually is speech synthesis, which learns the mapping relationship from spectral features to waveforms through training data.
+
+### Categories of neural vocodes
+- Autoregression
+    - WaveNet
+    - WaveRNN
+    - LPCNet
+
+- Flow
+    - **WaveFlow**
+    - WaveGlow
+    - FloWaveNet
+    - Parallel WaveNet
+- GAN
+    - WaveGAN
+    - **Parallel WaveGAN**
+    - MelGAN
+    - HiFi-GAN
+- VAE
+    - Wave-VAE
+- Diffusion
+    - WaveGrad
+    - DiffWave
+
+**Motivations of GAN-based vocoders:**
+- Modeling speech signal by estimating probability distribution usually has high requirements for the expression ability of the model itself. In addition, specific assumptions need to be made about the distribution of waveforms.
+- Although autoregressive neural vocoders can obtain high-quality synthetic speech, such models usually have a **slow generation speed**.
+- The training of inverse autoregressive flow vocoders is complex, and they also require the modeling capability of long term context information.
+- Vocoders based on Bipartite Transformation converge slowly and are complex.
+- GAN-based vocoders don't need to make assumptions about the speech distribution, and train through adversarial learning.
+
+Here, we introduce a Flow-based vocoder WaveFlow and a GAN-based vocoder Parallel WaveGAN.
+
+### WaveFlow
+ [WaveFlow](https://arxiv.org/abs/1912.01219) is proposed by Baidu Research.
+
+**Features of WaveFlow:**
+- It can synthesize 22.05 kHz high-fidelity speech around 40x faster than real-time on a Nvidia V100 GPU without engineered inference kernels, which is faster than [WaveGlow](https://github.com/NVIDIA/waveglow) and serveral orders of magnitude faster than WaveNet.
+- It is a small-footprint flow-based model for raw audio. It has only 5.9M parameters, which is 15x smalller than WaveGlow (87.9M).
+- It is directly trained with maximum likelihood without probability density distillation and auxiliary losses as used in [Parallel WaveNet](https://arxiv.org/abs/1711.10433) and [ClariNet](https://openreview.net/pdf?id=HklY120cYm), which simplifies the training pipeline and reduces the cost of development.
+
+You can find Parakeet's WaveFlow example at `Parakeet/examples/waveflow`.
+
+### Parallel WaveGAN
+[Parallel WaveGAN](https://arxiv.org/abs/1910.11480) trains a non-autoregressive WaveNet variant as a generator in a GAN based training method.
+
+**Features of Parallel WaveGAN:**
+
+- Use non-causal convolution instead of causal convolution.
+- The input is random Gaussian white noise.
+- The model is non-autoregressive both in training and prediction, which is fast
+-  Multi-resolution STFT loss.
+
+<div align="left">
+  <img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/pwg.png" width=600 /> <br>
+</div>
+
+You can find Parakeet's Parallel WaveGAN example at `Parakeet/examples/parallelwave_gan/baker`.
--- a/examples/aishell3/README.md
+++ b/examples/aishell3/README.md
@ -0,0 +1,4 @@
+# Aishell3
+
+* tts0 - fastspeech2
+* vc0 - tactron2 voice clone
--- a/examples/aishell3/tts0/run.sh
+++ b/examples/aishell3/tts0/run.sh
--- a/examples/aishell3/vc0/local/tacotron2/README_cn.md
+++ b/examples/aishell3/vc0/local/tacotron2/README_cn.md
@ -0,0 +1,112 @@
+## Tacotron2 + AISHELL-3 数据集训练语音克隆模型
+
+本实验的内容是利用 AISHELL-3 数据集和 Tacotron 2 模型进行语音克隆任务，使用的模型大体结构和论文 [Transfer Learning from Speaker Veriﬁcation to Multispeaker Text-To-Speech Synthesis](https://arxiv.org/pdf/1806.04558.pdf) 相同。大致步骤如下：
+
+1. Speaker Encoder: 我们使用了一个 Speaker Verification 任务训练一个 speaker encoder。这部分任务所用的数据集和训练 Tacotron 2 的数据集不同，因为不需要 transcription 的缘故，我们使用了较多的训练数据，可以参考实现 [ge2e](../ge2e)。
+2. Synthesizer: 然后使用训练好的 speaker encoder 为 AISHELL-3 数据集中的每个句子生成对应的 utterance embedding. 这个 Embedding 作为 Tacotron 模型中的一个额外输入和 encoder outputs 拼接在一起。
+3. Vocoder: 我们使用的声码器是 WaveFlow，参考实验 [waveflow](../waveflow).
+
+## 数据处理
+
+### utterance embedding 的生成
+
+使用训练好的 speaker encoder 为 AISHELL-3 数据集中的每个句子生成对应的 utterance embedding. 以和音频文件夹同构的方式存储。存储格式是 `.npy` 文件。
+
+首先 cd 到 [ge2e](../ge2e) 文件夹。下载训练好的 [模型](https://paddlespeech.bj.bcebos.com/Parakeet/ge2e_ckpt_0.3.zip)，然后运行脚本生成每个句子的 utterance embedding.
+
+```bash
+python inference.py --input=<intput> --output=<output> --device="gpu" --checkpoint_path=<pretrained checkpoint>
+```
+
+其中 input 是只包含音频文件夹的文件。这里可以用 `~/datasets/aishell3/train/wav`，然后 output 是用于存储 utterance embed 的文件夹，这里可以用 `~/datasets/aishell3/train/embed`。Utterance embedding 会以和音频文件夹相同的文件结构存储，格式为 `.npy`.
+
+utterance embedding 的计算可能会用几个小时的时间，请耐心等待。
+
+### 音频处理
+
+因为 AISHELL-3 数据集前后有一些空白，静音片段，而且语音幅值很小，所以我们需要进行空白移除和音量规范化。空白移除可以简单的使用基于音量或者能量的方法，但是效果不是很好，对于不同的句子很难取到一个一致的阈值。我们使用的是先利用 Force Aligner 进行文本和语音的对齐。然后根据对齐结果截除空白。
+
+我们使用的工具是 Montreal Force Aligner 1.0. 因为 aishell 的标注包含拼音标注，所以我们提供给 Montreal Force Aligner 的是拼音 transcription 而不是汉字 transcription. 而且需要把其中的韵律标记(`$` 和 `%`)去除，并且处理成 Montreal Force Alinger 所需要的文件形式。和音频同名的文本文件，扩展名为 `.lab`.
+
+此外还需要准备词典文件。其中包含把拼音序列转换为 phone 序列的映射关系。在这里我们只做声母和韵母的切分，而声调则归为韵母的一部分。我们使用的[词典文件](./lexicon.txt)可以下载。
+
+准备好之后运行训练和对齐。首先下载 [Montreal Force Aligner 1.0](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner/releases/tag/v1.0.1).下载之后解压即可运行。cd 到其中的 bin 文件夹运行命令，即可进行训练和对齐。前三个命令行参数分别是音频文件夹的路径，词典路径和对齐文件输出路径。可以通过`-o` 传入训练得到的模型保存路径。
+
+```bash
+./mfa_train_and_align \
+  ~/datasets/aishell3/train/wav \
+  lexicon.txt \
+  ~/datasets/aishell3/train/alignment \
+  -o aishell3_model \
+  -v
+```
+
+因为训练和对齐的时间比较长。我们提供了对齐后的 [alignment 文件](https://paddlespeech.bj.bcebos.com/Parakeet/alignment_aishell3.tar.gz)，其中每个句子对应的文件为 `.TextGrid` 格式的文本。
+
+得到了对齐文件之后，可以运行 `process_wav.py` 脚本来处理音频。
+
+```bash
+python process_wav.py --input=<input> --output=<output> --alignment=<alignment>
+```
+
+默认 input, output, alignment 分别是 `~/datasets/aishell3/train/wav`, `~/datasets/aishell3/train/normalized_wav`, `~/datasets/aishell3/train/alignment`.
+
+处理结束后，会将处理好的音频保存在 `<output>` 文件夹中。
+
+### 转录文本处理
+
+把文本转换成为 phone 和 tone 的形式，并存储起来。值得注意的是，这里我们的处理和用于 montreal force aligner 的不一样。我们把声调分了出来。这是一个处理方式，当然也可以只做声母和韵母的切分。
+
+运行脚本处理转录文本。
+
+```bash
+python preprocess_transcription.py --input=<input> --output=<output>
+```
+
+默认的 input 是 `~/datasets/aishell3/train`，其中会包含 `label_train-set.txt` 文件，处理后的结果会 `metadata.yaml` 和 `metadata.pickle`. 前者是文本格式，方便查看，后者是二进制格式，方便直接读取。
+
+### mel 频谱提取
+
+对处理后的音频进行 mel 频谱的提取，并且以和音频文件夹同构的方式存储，存储格式是 `.npy` 文件。
+
+```python
+python extract_mel.py --input=<intput> --output=<output>
+```
+
+input 是处理后的音频所在的文件夹，output 是输出频谱的文件夹。
+
+## 训练
+
+运行脚本训练。
+
+```python
+python train.py --data=<data> --output=<output> --device="gpu"
+```
+
+我们的模型去掉了 tacotron2 模型中的 stop token prediction。因为实践中由于 stop token prediction 是一个正负样例比例极不平衡的问题，每个句子可能有几百帧对应负样例，只有一帧正样例，而且这个 stop token prediction 对音频静音的裁切十分敏感。我们转用 attention 的最高点到达 encoder 侧的最后一个符号为终止条件。
+
+另外，为了加速模型的收敛，我们加上了 guided attention loss, 诱导 encoder-decoder 之间的 alignment 更快地呈现对角线。
+
+可以使用 visualdl 查看训练过程的 log。
+
+```bash
+visualdl --logdir=<output> --host=$HOSTNAME
+```
+
+示例 training loss / validation loss 曲线如下。
+
+![train](./images/train.png)
+
+![valid](./images/valid.png)
+
+<img src="images/alignment-step2000.png" alt="alignment-step2000" style="zoom:50%;" />
+
+大约从训练 2000 步左右就从 validation 过程中产出的 alignement 中可以观察到模糊的对角线。随着训练步数增加，对角线会更加清晰。但因为 validation 也是以 teacher forcing 的方式进行的，所以要在真正的 auto regressive 合成中产出的 alignment 中观察到对角线，需要更长的时间。
+
+## 预训练模型
+
+预训练模型下载链接。[tacotron2_aishell3_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/tacotron2_aishell3_ckpt_0.3.zip).
+
+## 使用
+
+本实验包含了一个简单的使用示例，用户可以替换作为参考的声音以及文本，用训练好的模型来合成语音。使用方式参考 [notebook](./voice_cloning.ipynb) 上的使用说明。
--- a/examples/aishell3/vc0/local/tacotron2/aishell3.py
+++ b/examples/aishell3/vc0/local/tacotron2/aishell3.py
@ -0,0 +1,88 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import pickle
+from pathlib import Path
+
+import numpy as np
+from paddle.io import Dataset
+from parakeet.frontend import Vocab
+from parakeet.data import batch_text_id, batch_spec
+
+from preprocess_transcription import _phones, _tones
+
+voc_phones = Vocab(sorted(list(_phones)))
+print("vocab_phones:\n", voc_phones)
+voc_tones = Vocab(sorted(list(_tones)))
+print("vocab_tones:\n", voc_tones)
+
+
+class AiShell3(Dataset):
+    """Processed AiShell3 dataset."""
+
+    def __init__(self, root):
+        super().__init__()
+        self.root = Path(root).expanduser()
+        self.embed_dir = self.root / "embed"
+        self.mel_dir = self.root / "mel"
+
+        with open(self.root / "metadata.pickle", 'rb') as f:
+            self.records = pickle.load(f)
+
+    def __getitem__(self, index):
+        metadatum = self.records[index]
+        sentence_id = metadatum["sentence_id"]
+        speaker_id = sentence_id[:7]
+        phones = metadatum["phones"]
+        tones = metadatum["tones"]
+        phones = np.array(
+            [voc_phones.lookup(item) for item in phones], dtype=np.int64)
+        tones = np.array(
+            [voc_tones.lookup(item) for item in tones], dtype=np.int64)
+        mel = np.load(str(self.mel_dir / speaker_id / (sentence_id + ".npy")))
+        embed = np.load(
+            str(self.embed_dir / speaker_id / (sentence_id + ".npy")))
+        return phones, tones, mel, embed
+
+    def __len__(self):
+        return len(self.records)
+
+
+def collate_aishell3_examples(examples):
+    phones, tones, mel, embed = list(zip(*examples))
+
+    text_lengths = np.array([item.shape[0] for item in phones], dtype=np.int64)
+    spec_lengths = np.array([item.shape[1] for item in mel], dtype=np.int64)
+    T_dec = np.max(spec_lengths)
+    stop_tokens = (
+        np.arange(T_dec) >= np.expand_dims(spec_lengths, -1)).astype(np.float32)
+    phones, _ = batch_text_id(phones)
+    tones, _ = batch_text_id(tones)
+    mel, _ = batch_spec(mel)
+    mel = np.transpose(mel, (0, 2, 1))
+    embed = np.stack(embed)
+    # 7 fields
+    # (B, T), (B, T), (B, T, C), (B, C), (B,), (B,), (B, T)
+    return phones, tones, mel, embed, text_lengths, spec_lengths, stop_tokens
+
+
+if __name__ == "__main__":
+    dataset = AiShell3("~/datasets/aishell3/train")
+    example = dataset[0]
+
+    examples = [dataset[i] for i in range(10)]
+    batch = collate_aishell3_examples(examples)
+
+    for field in batch:
+        print(field.shape, field.dtype)
--- a/examples/aishell3/vc0/local/tacotron2/chinese_g2p.py
+++ b/examples/aishell3/vc0/local/tacotron2/chinese_g2p.py
@ -0,0 +1,39 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from typing import List, Tuple
+from pypinyin import lazy_pinyin, Style
+from preprocess_transcription import split_syllable
+
+
+def convert_to_pinyin(text: str) -> List[str]:
+    """convert text into list of syllables, other characters that are not chinese, thus
+    cannot be converted to pinyin are splited.
+    """
+    syllables = lazy_pinyin(
+        text, style=Style.TONE3, neutral_tone_with_five=True)
+    return syllables
+
+
+def convert_sentence(text: str) -> List[Tuple[str]]:
+    """convert a sentence into two list: phones and tones"""
+    syllables = convert_to_pinyin(text)
+    phones = []
+    tones = []
+    for syllable in syllables:
+        p, t = split_syllable(syllable)
+        phones.extend(p)
+        tones.extend(t)
+
+    return phones, tones
--- a/examples/aishell3/vc0/local/tacotron2/config.py
+++ b/examples/aishell3/vc0/local/tacotron2/config.py
@ -0,0 +1,82 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from yacs.config import CfgNode as CN
+
+_C = CN()
+_C.data = CN(
+    dict(
+        batch_size=32,  # batch size
+        valid_size=64,  # the first N examples are reserved for validation
+        sample_rate=22050,  # Hz, sample rate
+        n_fft=1024,  # fft frame size
+        win_length=1024,  # window size
+        hop_length=256,  # hop size between ajacent frame
+        fmax=8000,  # Hz, max frequency when converting to mel
+        fmin=0,  # Hz, min frequency when converting to mel
+        d_mels=80,  # mel bands
+        padding_idx=0,  # text embedding's padding index
+    ))
+
+_C.model = CN(
+    dict(
+        vocab_size=70,
+        n_tones=10,
+        reduction_factor=1,  # reduction factor
+        d_encoder=512,  # embedding & encoder's internal size
+        encoder_conv_layers=3,  # number of conv layer in tacotron2 encoder
+        encoder_kernel_size=5,  # kernel size of conv layers in tacotron2 encoder
+        d_prenet=256,  # hidden size of decoder prenet
+        # hidden size of the first rnn layer in tacotron2 decoder
+        d_attention_rnn=1024,
+        # hidden size of the second rnn layer in tacotron2 decoder
+        d_decoder_rnn=1024,
+        d_attention=128,  # hidden size of  decoder location linear layer
+        attention_filters=32,  # number of filter in decoder location conv layer
+        attention_kernel_size=31,  # kernel size of decoder location conv layer
+        d_postnet=512,  # hidden size of decoder postnet
+        postnet_kernel_size=5,  # kernel size of conv layers in postnet
+        postnet_conv_layers=5,  # number of conv layer in decoder postnet
+        p_encoder_dropout=0.5,  # droput probability in encoder
+        p_prenet_dropout=0.5,  # droput probability in decoder prenet
+
+        # droput probability of first rnn layer in decoder
+        p_attention_dropout=0.1,
+        # droput probability of second rnn layer in decoder
+        p_decoder_dropout=0.1,
+        p_postnet_dropout=0.5,  # droput probability in decoder postnet
+        guided_attention_loss_sigma=0.2,
+        d_global_condition=256,
+
+        # whether to use a classifier to predict stop probability
+        use_stop_token=False,
+        # whether to use guided attention loss in training
+        use_guided_attention_loss=True, ))
+
+_C.training = CN(
+    dict(
+        lr=1e-3,  # learning rate
+        weight_decay=1e-6,  # the coeff of weight decay
+        grad_clip_thresh=1.0,  # the clip norm of grad clip.
+        valid_interval=1000,  # validation
+        save_interval=1000,  # checkpoint
+        max_iteration=500000,  # max iteration to train
+    ))
+
+
+def get_cfg_defaults():
+    """Get a yacs CfgNode object with default values for my_project."""
+    # Return a clone so that the defaults will not be altered
+    # This is for the "local variable" use pattern
+    return _C.clone()
--- a/examples/aishell3/vc0/local/tacotron2/extract_mel.py
+++ b/examples/aishell3/vc0/local/tacotron2/extract_mel.py
@ -0,0 +1,96 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import multiprocessing as mp
+from functools import partial
+from pathlib import Path
+
+import numpy as np
+from parakeet.audio import AudioProcessor
+from parakeet.audio.spec_normalizer import NormalizerBase, LogMagnitude
+
+import tqdm
+
+from config import get_cfg_defaults
+
+
+def extract_mel(fname: Path,
+                input_dir: Path,
+                output_dir: Path,
+                p: AudioProcessor,
+                n: NormalizerBase):
+    relative_path = fname.relative_to(input_dir)
+    out_path = (output_dir / relative_path).with_suffix(".npy")
+    out_path.parent.mkdir(parents=True, exist_ok=True)
+    wav = p.read_wav(fname)
+    mel = p.mel_spectrogram(wav)
+    mel = n.transform(mel)
+    np.save(out_path, mel)
+
+
+def extract_mel_multispeaker(config, input_dir, output_dir, extension=".wav"):
+    input_dir = Path(input_dir).expanduser()
+    fnames = list(input_dir.rglob(f"*{extension}"))
+    output_dir = Path(output_dir).expanduser()
+    output_dir.mkdir(parents=True, exist_ok=True)
+
+    p = AudioProcessor(config.sample_rate, config.n_fft, config.win_length,
+                       config.hop_length, config.n_mels, config.fmin,
+                       config.fmax)
+    n = LogMagnitude(1e-5)
+
+    func = partial(
+        extract_mel, input_dir=input_dir, output_dir=output_dir, p=p, n=n)
+
+    with mp.Pool(16) as pool:
+        list(
+            tqdm.tqdm(
+                pool.imap(func, fnames), total=len(fnames), unit="utterance"))
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(
+        description="Extract mel spectrogram from processed wav in AiShell3 training dataset."
+    )
+    parser.add_argument(
+        "--config",
+        type=str,
+        help="yaml config file to overwrite the default config")
+    parser.add_argument(
+        "--input",
+        type=str,
+        default="~/datasets/aishell3/train/normalized_wav",
+        help="path of the processed wav folder")
+    parser.add_argument(
+        "--output",
+        type=str,
+        default="~/datasets/aishell3/train/mel",
+        help="path of the folder to save mel spectrograms")
+    parser.add_argument(
+        "--opts",
+        nargs=argparse.REMAINDER,
+        help="options to overwrite --config file and the default config, passing in KEY VALUE pairs"
+    )
+    default_config = get_cfg_defaults()
+
+    args = parser.parse_args()
+    if args.config:
+        default_config.merge_from_file(args.config)
+    if args.opts:
+        default_config.merge_from_list(args.opts)
+    default_config.freeze()
+    audio_config = default_config.data
+
+    extract_mel_multispeaker(audio_config, args.input, args.output)
--- a/examples/aishell3/vc0/local/tacotron2/images/alignment-step2000.png
+++ b/examples/aishell3/vc0/local/tacotron2/images/alignment-step2000.png
--- a/examples/aishell3/vc0/local/tacotron2/images/train.png
+++ b/examples/aishell3/vc0/local/tacotron2/images/train.png
--- a/examples/aishell3/vc0/local/tacotron2/images/valid.png
+++ b/examples/aishell3/vc0/local/tacotron2/images/valid.png
--- a/examples/aishell3/vc0/local/tacotron2/lexicon.txt
+++ b/examples/aishell3/vc0/local/tacotron2/lexicon.txt
--- a/examples/aishell3/vc0/local/tacotron2/preprocess_transcription.py
+++ b/examples/aishell3/vc0/local/tacotron2/preprocess_transcription.py
@ -0,0 +1,258 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+from pathlib import Path
+import re
+import pickle
+
+import yaml
+import tqdm
+
+zh_pattern = re.compile("[\u4e00-\u9fa5]")
+
+_tones = {'<pad>', '<s>', '</s>', '0', '1', '2', '3', '4', '5'}
+
+_pauses = {'%', '$'}
+
+_initials = {
+    'b',
+    'p',
+    'm',
+    'f',
+    'd',
+    't',
+    'n',
+    'l',
+    'g',
+    'k',
+    'h',
+    'j',
+    'q',
+    'x',
+    'zh',
+    'ch',
+    'sh',
+    'r',
+    'z',
+    'c',
+    's',
+}
+
+_finals = {
+    'ii',
+    'iii',
+    'a',
+    'o',
+    'e',
+    'ea',
+    'ai',
+    'ei',
+    'ao',
+    'ou',
+    'an',
+    'en',
+    'ang',
+    'eng',
+    'er',
+    'i',
+    'ia',
+    'io',
+    'ie',
+    'iai',
+    'iao',
+    'iou',
+    'ian',
+    'ien',
+    'iang',
+    'ieng',
+    'u',
+    'ua',
+    'uo',
+    'uai',
+    'uei',
+    'uan',
+    'uen',
+    'uang',
+    'ueng',
+    'v',
+    've',
+    'van',
+    'ven',
+    'veng',
+}
+
+_ernized_symbol = {'&r'}
+
+_specials = {'<pad>', '<unk>', '<s>', '</s>'}
+
+_phones = _initials | _finals | _ernized_symbol | _specials | _pauses
+
+
+def is_zh(word):
+    global zh_pattern
+    match = zh_pattern.search(word)
+    return match is not None
+
+
+def ernized(syllable):
+    return syllable[:2] != "er" and syllable[-2] == 'r'
+
+
+def convert(syllable):
+    # expansion of o -> uo
+    syllable = re.sub(r"([bpmf])o$", r"\1uo", syllable)
+    # syllable = syllable.replace("bo", "buo").replace("po", "puo").replace("mo", "muo").replace("fo", "fuo")
+    # expansion for iong, ong
+    syllable = syllable.replace("iong", "veng").replace("ong", "ueng")
+
+    # expansion for ing, in
+    syllable = syllable.replace("ing", "ieng").replace("in", "ien")
+
+    # expansion for un, ui, iu
+    syllable = syllable.replace("un", "uen").replace("ui",
+                                                     "uei").replace("iu", "iou")
+
+    # rule for variants of i
+    syllable = syllable.replace("zi", "zii").replace("ci", "cii").replace("si", "sii")\
+        .replace("zhi", "zhiii").replace("chi", "chiii").replace("shi", "shiii")\
+        .replace("ri", "riii")
+
+    # rule for y preceding i, u
+    syllable = syllable.replace("yi", "i").replace("yu", "v").replace("y", "i")
+
+    # rule for w
+    syllable = syllable.replace("wu", "u").replace("w", "u")
+
+    # rule for v following j, q, x
+    syllable = syllable.replace("ju", "jv").replace("qu",
+                                                    "qv").replace("xu", "xv")
+
+    return syllable
+
+
+def split_syllable(syllable: str):
+    """Split a syllable in pinyin into a list of phones and a list of tones.
+    Initials have no tone, represented by '0', while finals have tones from
+    '1,2,3,4,5'.
+
+    e.g.
+
+    zhang -> ['zh', 'ang'], ['0', '1']
+    """
+    if syllable in _pauses:
+        # syllable, tone
+        return [syllable], ['0']
+
+    tone = syllable[-1]
+    syllable = convert(syllable[:-1])
+
+    phones = []
+    tones = []
+
+    global _initials
+    if syllable[:2] in _initials:
+        phones.append(syllable[:2])
+        tones.append('0')
+        phones.append(syllable[2:])
+        tones.append(tone)
+    elif syllable[0] in _initials:
+        phones.append(syllable[0])
+        tones.append('0')
+        phones.append(syllable[1:])
+        tones.append(tone)
+    else:
+        phones.append(syllable)
+        tones.append(tone)
+    return phones, tones
+
+
+def load_aishell3_transcription(line: str):
+    sentence_id, pinyin, text = line.strip().split("|")
+    syllables = pinyin.strip().split()
+
+    results = []
+
+    for syllable in syllables:
+        if syllable in _pauses:
+            results.append(syllable)
+        elif not ernized(syllable):
+            results.append(syllable)
+        else:
+            results.append(syllable[:-2] + syllable[-1])
+            results.append('&r5')
+
+    phones = []
+    tones = []
+    for syllable in results:
+        p, t = split_syllable(syllable)
+        phones.extend(p)
+        tones.extend(t)
+    for p in phones:
+        assert p in _phones, p
+    return {
+        "sentence_id": sentence_id,
+        "text": text,
+        "syllables": results,
+        "phones": phones,
+        "tones": tones
+    }
+
+
+def process_aishell3(dataset_root, output_dir):
+    dataset_root = Path(dataset_root).expanduser()
+    output_dir = Path(output_dir).expanduser()
+    output_dir.mkdir(parents=True, exist_ok=True)
+
+    prosody_label_path = dataset_root / "label_train-set.txt"
+    with open(prosody_label_path, 'rt') as f:
+        lines = [line.strip() for line in f]
+
+    records = lines[5:]
+
+    processed_records = []
+    for record in tqdm.tqdm(records):
+        new_record = load_aishell3_transcription(record)
+        processed_records.append(new_record)
+        print(new_record)
+
+    with open(output_dir / "metadata.pickle", 'wb') as f:
+        pickle.dump(processed_records, f)
+
+    with open(output_dir / "metadata.yaml", 'wt', encoding="utf-8") as f:
+        yaml.safe_dump(
+            processed_records, f, default_flow_style=None, allow_unicode=True)
+
+    print("metadata done!")
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(
+        description="Preprocess transcription of AiShell3 and save them in a compact file(yaml and pickle)."
+    )
+    parser.add_argument(
+        "--input",
+        type=str,
+        default="~/datasets/aishell3/train",
+        help="path of the training dataset,(contains a label_train-set.txt).")
+    parser.add_argument(
+        "--output",
+        type=str,
+        help="the directory to save the processed transcription."
+        "If not provided, it would be the same as the input.")
+    args = parser.parse_args()
+    if args.output is None:
+        args.output = args.input
+
+    process_aishell3(args.input, args.output)
--- a/examples/aishell3/vc0/local/tacotron2/process_wav.py
+++ b/examples/aishell3/vc0/local/tacotron2/process_wav.py
@ -0,0 +1,95 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+from pathlib import Path
+from multiprocessing import Pool
+from functools import partial
+
+import numpy as np
+import librosa
+import soundfile as sf
+from tqdm import tqdm
+from praatio import tgio
+
+
+def get_valid_part(fpath):
+    f = tgio.openTextgrid(fpath)
+
+    start = 0
+    phone_entry_list = f.tierDict['phones'].entryList
+    first_entry = phone_entry_list[0]
+    if first_entry.label == "sil":
+        start = first_entry.end
+
+    last_entry = phone_entry_list[-1]
+    if last_entry.label == "sp":
+        end = last_entry.start
+    else:
+        end = last_entry.end
+    return start, end
+
+
+def process_utterance(fpath, source_dir, target_dir, alignment_dir):
+    rel_path = fpath.relative_to(source_dir)
+    opath = target_dir / rel_path
+    apath = (alignment_dir / rel_path).with_suffix(".TextGrid")
+    opath.parent.mkdir(parents=True, exist_ok=True)
+
+    start, end = get_valid_part(apath)
+    wav, _ = librosa.load(fpath, sr=22050, offset=start, duration=end - start)
+    normalized_wav = wav / np.max(wav) * 0.999
+    sf.write(opath, normalized_wav, samplerate=22050, subtype='PCM_16')
+    # print(f"{fpath} => {opath}")
+
+
+def preprocess_aishell3(source_dir, target_dir, alignment_dir):
+    source_dir = Path(source_dir).expanduser()
+    target_dir = Path(target_dir).expanduser()
+    alignment_dir = Path(alignment_dir).expanduser()
+
+    wav_paths = list(source_dir.rglob("*.wav"))
+    print(f"there are {len(wav_paths)} audio files in total")
+    fx = partial(
+        process_utterance,
+        source_dir=source_dir,
+        target_dir=target_dir,
+        alignment_dir=alignment_dir)
+    with Pool(16) as p:
+        list(
+            tqdm(p.imap(fx, wav_paths), total=len(wav_paths), unit="utterance"))
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(
+        description="Process audio in AiShell3, trim silence according to the alignment "
+        "files generated by MFA, and normalize volume by peak.")
+    parser.add_argument(
+        "--input",
+        type=str,
+        default="~/datasets/aishell3/train/wav",
+        help="path of the original audio folder in aishell3.")
+    parser.add_argument(
+        "--output",
+        type=str,
+        default="~/datasets/aishell3/train/normalized_wav",
+        help="path of the folder to save the processed audio files.")
+    parser.add_argument(
+        "--alignment",
+        type=str,
+        default="~/datasets/aishell3/train/alignment",
+        help="path of the alignment files.")
+    args = parser.parse_args()
+
+    preprocess_aishell3(args.input, args.output, args.alignment)
--- a/examples/aishell3/vc0/local/tacotron2/train.py
+++ b/examples/aishell3/vc0/local/tacotron2/train.py
@ -0,0 +1,262 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import time
+from pathlib import Path
+from collections import defaultdict
+
+import numpy as np
+from matplotlib import pyplot as plt
+
+import paddle
+from paddle import distributed as dist
+from paddle.io import DataLoader, DistributedBatchSampler
+
+from parakeet.data import dataset
+from parakeet.training.cli import default_argument_parser
+from parakeet.training.experiment import ExperimentBase
+from parakeet.utils import display, mp_tools
+from parakeet.models.tacotron2 import Tacotron2, Tacotron2Loss
+
+from config import get_cfg_defaults
+from aishell3 import AiShell3, collate_aishell3_examples
+
+
+class Experiment(ExperimentBase):
+    def compute_losses(self, inputs, outputs):
+        texts, tones, mel_targets, utterance_embeds, text_lens, output_lens, stop_tokens = inputs
+
+        mel_outputs = outputs["mel_output"]
+        mel_outputs_postnet = outputs["mel_outputs_postnet"]
+        alignments = outputs["alignments"]
+
+        losses = self.criterion(mel_outputs, mel_outputs_postnet, mel_targets,
+                                alignments, output_lens, text_lens)
+        return losses
+
+    def train_batch(self):
+        start = time.time()
+        batch = self.read_batch()
+        data_loader_time = time.time() - start
+
+        self.optimizer.clear_grad()
+        self.model.train()
+        texts, tones, mels, utterance_embeds, text_lens, output_lens, stop_tokens = batch
+        outputs = self.model(
+            texts,
+            text_lens,
+            mels,
+            output_lens,
+            tones=tones,
+            global_condition=utterance_embeds)
+        losses = self.compute_losses(batch, outputs)
+        loss = losses["loss"]
+        loss.backward()
+        self.optimizer.step()
+        iteration_time = time.time() - start
+
+        losses_np = {k: float(v) for k, v in losses.items()}
+        # logging
+        msg = "Rank: {}, ".format(dist.get_rank())
+        msg += "step: {}, ".format(self.iteration)
+        msg += "time: {:>.3f}s/{:>.3f}s, ".format(data_loader_time,
+                                                  iteration_time)
+        msg += ', '.join('{}: {:>.6f}'.format(k, v)
+                         for k, v in losses_np.items())
+        self.logger.info(msg)
+
+        if dist.get_rank() == 0:
+            for key, value in losses_np.items():
+                self.visualizer.add_scalar(f"train_loss/{key}", value,
+                                           self.iteration)
+
+    @mp_tools.rank_zero_only
+    @paddle.no_grad()
+    def valid(self):
+        valid_losses = defaultdict(list)
+        for i, batch in enumerate(self.valid_loader):
+            texts, tones, mels, utterance_embeds, text_lens, output_lens, stop_tokens = batch
+            outputs = self.model(
+                texts,
+                text_lens,
+                mels,
+                output_lens,
+                tones=tones,
+                global_condition=utterance_embeds)
+            losses = self.compute_losses(batch, outputs)
+            for key, value in losses.items():
+                valid_losses[key].append(float(value))
+
+            attention_weights = outputs["alignments"]
+            self.visualizer.add_figure(
+                f"valid_sentence_{i}_alignments",
+                display.plot_alignment(attention_weights[0].numpy().T),
+                self.iteration)
+            self.visualizer.add_figure(
+                f"valid_sentence_{i}_target_spectrogram",
+                display.plot_spectrogram(mels[0].numpy().T), self.iteration)
+            mel_pred = outputs['mel_outputs_postnet']
+            self.visualizer.add_figure(
+                f"valid_sentence_{i}_predicted_spectrogram",
+                display.plot_spectrogram(mel_pred[0].numpy().T), self.iteration)
+
+        # write visual log
+        valid_losses = {k: np.mean(v) for k, v in valid_losses.items()}
+
+        # logging
+        msg = "Valid: "
+        msg += "step: {}, ".format(self.iteration)
+        msg += ', '.join('{}: {:>.6f}'.format(k, v)
+                         for k, v in valid_losses.items())
+        self.logger.info(msg)
+
+        for key, value in valid_losses.items():
+            self.visualizer.add_scalar(f"valid/{key}", value, self.iteration)
+
+    @mp_tools.rank_zero_only
+    @paddle.no_grad()
+    def eval(self):
+        """Evaluation of Tacotron2 in autoregressive manner."""
+        self.model.eval()
+        mel_dir = Path(self.output_dir / ("eval_{}".format(self.iteration)))
+        mel_dir.mkdir(parents=True, exist_ok=True)
+        for i, batch in enumerate(self.test_loader):
+            texts, tones, mels, utterance_embeds, *_ = batch
+            outputs = self.model.infer(
+                texts, tones=tones, global_condition=utterance_embeds)
+
+            display.plot_alignment(outputs["alignments"][0].numpy().T)
+            plt.savefig(mel_dir / f"sentence_{i}.png")
+            plt.close()
+            np.save(mel_dir / f"sentence_{i}",
+                    outputs["mel_outputs_postnet"][0].numpy().T)
+            print(f"sentence_{i}")
+
+    def setup_model(self):
+        config = self.config
+        model = Tacotron2(
+            vocab_size=config.model.vocab_size,
+            n_tones=config.model.n_tones,
+            d_mels=config.data.d_mels,
+            d_encoder=config.model.d_encoder,
+            encoder_conv_layers=config.model.encoder_conv_layers,
+            encoder_kernel_size=config.model.encoder_kernel_size,
+            d_prenet=config.model.d_prenet,
+            d_attention_rnn=config.model.d_attention_rnn,
+            d_decoder_rnn=config.model.d_decoder_rnn,
+            attention_filters=config.model.attention_filters,
+            attention_kernel_size=config.model.attention_kernel_size,
+            d_attention=config.model.d_attention,
+            d_postnet=config.model.d_postnet,
+            postnet_kernel_size=config.model.postnet_kernel_size,
+            postnet_conv_layers=config.model.postnet_conv_layers,
+            reduction_factor=config.model.reduction_factor,
+            p_encoder_dropout=config.model.p_encoder_dropout,
+            p_prenet_dropout=config.model.p_prenet_dropout,
+            p_attention_dropout=config.model.p_attention_dropout,
+            p_decoder_dropout=config.model.p_decoder_dropout,
+            p_postnet_dropout=config.model.p_postnet_dropout,
+            d_global_condition=config.model.d_global_condition,
+            use_stop_token=config.model.use_stop_token, )
+
+        if self.parallel:
+            model = paddle.DataParallel(model)
+
+        grad_clip = paddle.nn.ClipGradByGlobalNorm(
+            config.training.grad_clip_thresh)
+        optimizer = paddle.optimizer.Adam(
+            learning_rate=config.training.lr,
+            parameters=model.parameters(),
+            weight_decay=paddle.regularizer.L2Decay(
+                config.training.weight_decay),
+            grad_clip=grad_clip)
+        criterion = Tacotron2Loss(
+            use_stop_token_loss=config.model.use_stop_token,
+            use_guided_attention_loss=config.model.use_guided_attention_loss,
+            sigma=config.model.guided_attention_loss_sigma)
+        self.model = model
+        self.optimizer = optimizer
+        self.criterion = criterion
+
+    def setup_dataloader(self):
+        args = self.args
+        config = self.config
+        ljspeech_dataset = AiShell3(args.data)
+
+        valid_set, train_set = dataset.split(ljspeech_dataset,
+                                             config.data.valid_size)
+        batch_fn = collate_aishell3_examples
+
+        if not self.parallel:
+            self.train_loader = DataLoader(
+                train_set,
+                batch_size=config.data.batch_size,
+                shuffle=True,
+                drop_last=True,
+                collate_fn=batch_fn)
+        else:
+            sampler = DistributedBatchSampler(
+                train_set,
+                batch_size=config.data.batch_size,
+                shuffle=True,
+                drop_last=True)
+            self.train_loader = DataLoader(
+                train_set, batch_sampler=sampler, collate_fn=batch_fn)
+
+        self.valid_loader = DataLoader(
+            valid_set,
+            batch_size=config.data.batch_size,
+            shuffle=False,
+            drop_last=False,
+            collate_fn=batch_fn)
+
+        self.test_loader = DataLoader(
+            valid_set,
+            batch_size=1,
+            shuffle=False,
+            drop_last=False,
+            collate_fn=batch_fn)
+
+
+def main_sp(config, args):
+    exp = Experiment(config, args)
+    exp.setup()
+    exp.resume_or_load()
+    if not args.test:
+        exp.run()
+    else:
+        exp.eval()
+
+
+def main(config, args):
+    if args.nprocs > 1 and args.device == "gpu":
+        dist.spawn(main_sp, args=(config, args), nprocs=args.nprocs)
+    else:
+        main_sp(config, args)
+
+
+if __name__ == "__main__":
+    config = get_cfg_defaults()
+    parser = default_argument_parser()
+    parser.add_argument("--test", action="store_true")
+    args = parser.parse_args()
+    if args.config:
+        config.merge_from_file(args.config)
+    if args.opts:
+        config.merge_from_list(args.opts)
+    config.freeze()
+    print(config)
+    print(args)
+
+    main(config, args)
--- a/examples/aishell3/vc0/local/tacotron2/voice_cloning.ipynb
+++ b/examples/aishell3/vc0/local/tacotron2/voice_cloning.ipynb
--- a/examples/aishell3/vc0/run.sh
+++ b/examples/aishell3/vc0/run.sh
--- a/examples/csmsc/speedyspeech/baker/README.md
+++ b/examples/csmsc/speedyspeech/baker/README.md
@ -0,0 +1,226 @@
+# Speedyspeech with CSMSC
+
+This example contains code used to train a [Speedyspeech](http://arxiv.org/abs/2008.03802) model with [Chinese Standard Mandarin Speech Copus](https://www.data-baker.com/open_source.html). NOTE that we only implement the student part of the Speedyspeech model. The ground truth alignment used to train the model is extracted from the dataset using [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner).
+
+## Dataset
+### Download and Extract the datasaet
+Download CSMSC from it's [Official Website](https://test.data-baker.com/data/index/source).
+
+### Get MFA result of CSMSC and Extract it
+We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for SPEEDYSPEECH.
+You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to  [use_mfa example](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/use_mfa) of our repo.
+
+## Preprocess the dataset
+Assume the path to the dataset is `~/datasets/BZNSYP`.
+Assume the path to the MFA result of CSMSC is `./baker_alignment_tone`.
+Run the command below to preprocess the dataset.
+```bash
+./preprocess.sh
+```
+When it is done. A `dump` folder is created in the current directory. The structure of the dump folder is listed below.
+
+```text
+dump
+├── dev
+│   ├── norm
+│   └── raw
+├── test
+│   ├── norm
+│   └── raw
+└── train
+    ├── norm
+    ├── raw
+    └── feats_stats.npy
+```
+
+The dataset is split into 3 parts, namely `train`, `dev` and `test`, each of which contains a `norm` and `raw` sub folder. The raw folder contains log magnitude of mel spectrogram of each utterances, while the norm folder contains normalized spectrogram. The statistics used to normalize the spectrogram is computed from the training set, which is located in `dump/train/feats_stats.npy`.
+
+Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains phones, tones, durations, path of spectrogram, and id of each utterance.
+
+## Train the model
+`./run.sh` calls `../train.py`.
+```bash
+./run.sh
+```
+Here's the complete help message.
+
+```text
+usage: train.py [-h] [--config CONFIG] [--train-metadata TRAIN_METADATA]
+                     [--dev-metadata DEV_METADATA] [--output-dir OUTPUT_DIR]
+                     [--device DEVICE] [--nprocs NPROCS] [--verbose VERBOSE]
+                     [--use-relative-path USE_RELATIVE_PATH]
+                     [--phones-dict PHONES_DICT] [--tones-dict TONES_DICT]
+
+Train a Speedyspeech model with sigle speaker dataset.
+
+optional arguments:
+  -h, --help            show this help message and exit
+  --config CONFIG       config file.
+  --train-metadata TRAIN_METADATA
+                        training data.
+  --dev-metadata DEV_METADATA
+                        dev data.
+  --output-dir OUTPUT_DIR
+                        output dir.
+  --device DEVICE       device type to use.
+  --nprocs NPROCS       number of processes.
+  --verbose VERBOSE     verbose.
+  --use-relative-path USE_RELATIVE_PATH
+                        whether use relative path in metadata
+  --phones-dict PHONES_DICT
+                        phone vocabulary file.
+  --tones-dict TONES_DICT
+                        tone vocabulary file.
+```
+
+1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
+2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
+3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are save in `checkpoints/` inside this directory.
+4. `--device` is the type of the device to run the experiment, 'cpu' or 'gpu' are supported.
+5. `--nprocs` is the number of processes to run in parallel, note that nprocs > 1 is only supported when `--device` is 'gpu'.
+6. `--phones-dict` is the path of the phone vocabulary file.
+7. `--tones-dict` is the path of the tone vocabulary file.
+
+## Pretrained Model
+Pretrained SpeedySpeech model with no silence in the edge of audios. [speedyspeech_nosil_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/speedyspeech_nosil_baker_ckpt_0.5.zip)
+
+SpeedySpeech checkpoint contains files listed below.
+```text
+speedyspeech_nosil_baker_ckpt_0.5
+├── default.yaml            # default config used to train speedyspeech
+├── feats_stats.npy         # statistics used to normalize spectrogram when training speedyspeech
+├── phone_id_map.txt        # phone vocabulary file when training speedyspeech
+├── snapshot_iter_11400.pdz # model parameters and optimizer states
+└── tone_id_map.txt         # tone vocabulary file when training speedyspeech
+```
+
+## Synthesize
+We use [parallel wavegan](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/parallelwave_gan/baker) as the neural vocoder.
+Download pretrained parallel wavegan model from [pwg_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/pwg_baker_ckpt_0.4.zip) and unzip it.
+```bash
+unzip pwg_baker_ckpt_0.4.zip
+```
+Parallel WaveGAN checkpoint contains files listed below.
+```text
+pwg_baker_ckpt_0.4
+├── pwg_default.yaml               # default config used to train parallel wavegan
+├── pwg_snapshot_iter_400000.pdz   # model parameters of parallel wavegan
+└── pwg_stats.npy                  # statistics used to normalize spectrogram when training parallel wavegan
+```
+`synthesize.sh` calls `../synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
+```bash
+./synthesize.sh
+```
+```text
+usage: synthesize.py [-h] [--speedyspeech-config SPEEDYSPEECH_CONFIG]
+                     [--speedyspeech-checkpoint SPEEDYSPEECH_CHECKPOINT]
+                     [--speedyspeech-stat SPEEDYSPEECH_STAT]
+                     [--pwg-config PWG_CONFIG]
+                     [--pwg-checkpoint PWG_CHECKPOINT] [--pwg-stat PWG_STAT]
+                     [--phones-dict PHONES_DICT] [--tones-dict TONES_DICT]
+                     [--test-metadata TEST_METADATA] [--output-dir OUTPUT_DIR]
+                     [--inference-dir INFERENCE_DIR] [--device DEVICE]
+                     [--verbose VERBOSE]
+
+Synthesize with speedyspeech & parallel wavegan.
+
+optional arguments:
+  -h, --help            show this help message and exit
+  --speedyspeech-config SPEEDYSPEECH_CONFIG
+                        config file for speedyspeech.
+  --speedyspeech-checkpoint SPEEDYSPEECH_CHECKPOINT
+                        speedyspeech checkpoint to load.
+  --speedyspeech-stat SPEEDYSPEECH_STAT
+                        mean and standard deviation used to normalize
+                        spectrogram when training speedyspeech.
+  --pwg-config PWG_CONFIG
+                        config file for parallelwavegan.
+  --pwg-checkpoint PWG_CHECKPOINT
+                        parallel wavegan generator parameters to load.
+  --pwg-stat PWG_STAT   mean and standard deviation used to normalize
+                        spectrogram when training speedyspeech.
+  --phones-dict PHONES_DICT
+                        phone vocabulary file.
+  --tones-dict TONES_DICT
+                        tone vocabulary file.
+  --test-metadata TEST_METADATA
+                        test metadata
+  --output-dir OUTPUT_DIR
+                        output dir
+  --inference-dir INFERENCE_DIR
+                        dir to save inference models
+  --device DEVICE       device type to use
+  --verbose VERBOSE     verbose
+```
+`synthesize_e2e.sh` calls `synthesize_e2e.py`, which can synthesize waveform from text file.
+```bash
+./synthesize_e2e.sh
+```
+```text
+usage: synthesize_e2e.py [-h] [--speedyspeech-config SPEEDYSPEECH_CONFIG]
+                         [--speedyspeech-checkpoint SPEEDYSPEECH_CHECKPOINT]
+                         [--speedyspeech-stat SPEEDYSPEECH_STAT]
+                         [--pwg-config PWG_CONFIG]
+                         [--pwg-checkpoint PWG_CHECKPOINT]
+                         [--pwg-stat PWG_STAT] [--text TEXT]
+                         [--phones-dict PHONES_DICT] [--tones-dict TONES_DICT]
+                         [--output-dir OUTPUT_DIR]
+                         [--inference-dir INFERENCE_DIR] [--device DEVICE]
+                         [--verbose VERBOSE]
+
+Synthesize with speedyspeech & parallel wavegan.
+
+optional arguments:
+  -h, --help            show this help message and exit
+  --speedyspeech-config SPEEDYSPEECH_CONFIG
+                        config file for speedyspeech.
+  --speedyspeech-checkpoint SPEEDYSPEECH_CHECKPOINT
+                        speedyspeech checkpoint to load.
+  --speedyspeech-stat SPEEDYSPEECH_STAT
+                        mean and standard deviation used to normalize
+                        spectrogram when training speedyspeech.
+  --pwg-config PWG_CONFIG
+                        config file for parallelwavegan.
+  --pwg-checkpoint PWG_CHECKPOINT
+                        parallel wavegan checkpoint to load.
+  --pwg-stat PWG_STAT   mean and standard deviation used to normalize
+                        spectrogram when training speedyspeech.
+  --text TEXT           text to synthesize, a 'utt_id sentence' pair per line
+  --phones-dict PHONES_DICT
+                        phone vocabulary file.
+  --tones-dict TONES_DICT
+                        tone vocabulary file.
+  --output-dir OUTPUT_DIR
+                        output dir
+  --inference-dir INFERENCE_DIR
+                        dir to save inference models
+  --device DEVICE       device type to use
+  --verbose VERBOSE     verbose
+```
+1. `--speedyspeech-config`, `--speedyspeech-checkpoint`, `--speedyspeech-stat` are arguments for speedyspeech, which correspond to the 3 files in the speedyspeech pretrained model.
+2. `--pwg-config`, `--pwg-checkpoint`, `--pwg-stat` are arguments for parallel wavegan, which correspond to the 3 files in the parallel wavegan pretrained model.
+3. `--text` is the text file, which contains sentences to synthesize.
+4. `--output-dir` is the directory to save synthesized audio files.
+5. `--inference-dir` is the directory to save exported model, which can be used with paddle infernece.
+6. `--device` is the type of device to run synthesis, 'cpu' and 'gpu' are supported. 'gpu' is recommended for faster synthesis.
+6. `--phones-dict` is the path of the phone vocabulary file.
+7. `--tones-dict` is the path of the tone vocabulary file.
+
+You can use the following scripts to synthesize for `../sentences.txt` using pretrained speedyspeech and parallel wavegan models.
+```bash
+FLAGS_allocator_strategy=naive_best_fit \
+FLAGS_fraction_of_gpu_memory_to_use=0.01 \
+python3 synthesize_e2e.py \
+  --speedyspeech-config=speedyspeech_nosil_baker_ckpt_0.5/default.yaml \
+  --speedyspeech-checkpoint=speedyspeech_nosil_baker_ckpt_0.5/snapshot_iter_11400.pdz \
+  --speedyspeech-stat=speedyspeech_nosil_baker_ckpt_0.5/feats_stats.npy \
+  --pwg-config=pwg_baker_ckpt_0.4/pwg_default.yaml \
+  --pwg-checkpoint=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \
+  --pwg-stat=pwg_baker_ckpt_0.4/pwg_stats.npy \
+  --text=../sentences.txt \
+  --output-dir=exp/default/test_e2e \
+  --inference-dir=exp/default/inference \
+  --device="gpu" \
+  --phones-dict=speedyspeech_nosil_baker_ckpt_0.5/phone_id_map.txt \
+  --tones-dict=speedyspeech_nosil_baker_ckpt_0.5/tone_id_map.txt
+```
--- a/examples/csmsc/speedyspeech/baker/conf/default.yaml
+++ b/examples/csmsc/speedyspeech/baker/conf/default.yaml
@ -0,0 +1,50 @@
+###########################################################
+#                FEATURE EXTRACTION SETTING               #
+###########################################################
+fs: 24000           # Sampling rate.
+n_fft: 2048         # FFT size.
+n_shift: 300        # Hop size.
+win_length: 1200    # Window length.
+                    # If set to null, it will be the same as fft_size.
+window: "hann"      # Window function.
+n_mels: 80          # Number of mel basis.
+fmin: 80            # Minimum freq in mel basis calculation.
+fmax: 7600          # Maximum frequency in mel basis calculation.
+
+###########################################################
+#                       DATA SETTING                      #
+###########################################################
+batch_size: 64
+num_workers: 4
+
+###########################################################
+#                       MODEL SETTING                     #
+###########################################################
+model:
+  encoder_hidden_size: 128
+  encoder_kernel_size: 3
+  encoder_dilations: [1, 3, 9, 27, 1, 3, 9, 27, 1, 1]
+  duration_predictor_hidden_size: 128
+  decoder_hidden_size: 128
+  decoder_output_size: 80
+  decoder_kernel_size: 3
+  decoder_dilations: [1, 3, 9, 27, 1, 3, 9, 27, 1, 3, 9, 27, 1, 3, 9, 27, 1, 1]
+
+###########################################################
+#                     OPTIMIZER SETTING                   #
+###########################################################
+optimizer:
+  optim: adam              # optimizer type
+  learning_rate: 0.002     # learning rate
+  max_grad_norm: 1
+
+###########################################################
+#                     TRAINING SETTING                    #
+###########################################################
+max_epoch: 200
+num_snapshots: 5
+
+###########################################################
+#                       OTHER SETTING                     #
+###########################################################
+seed: 10086
--- a/examples/csmsc/speedyspeech/baker/inference.py
+++ b/examples/csmsc/speedyspeech/baker/inference.py
@ -0,0 +1,146 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+from pathlib import Path
+
+import soundfile as sf
+from paddle import inference
+from parakeet.frontend.zh_frontend import Frontend
+
+
+def main():
+    parser = argparse.ArgumentParser(
+        description="Paddle Infernce with speedyspeech & parallel wavegan.")
+    parser.add_argument(
+        "--inference-dir", type=str, help="dir to save inference models")
+    parser.add_argument(
+        "--text",
+        type=str,
+        help="text to synthesize, a 'utt_id sentence' pair per line")
+    parser.add_argument("--output-dir", type=str, help="output dir")
+    parser.add_argument(
+        "--enable-auto-log", action="store_true", help="use auto log")
+    parser.add_argument(
+        "--phones-dict",
+        type=str,
+        default="phones.txt",
+        help="phone vocabulary file.")
+    parser.add_argument(
+        "--tones-dict",
+        type=str,
+        default="tones.txt",
+        help="tone vocabulary file.")
+
+    args, _ = parser.parse_known_args()
+
+    frontend = Frontend(
+        phone_vocab_path=args.phones_dict, tone_vocab_path=args.tones_dict)
+    print("frontend done!")
+
+    speedyspeech_config = inference.Config(
+        str(Path(args.inference_dir) / "speedyspeech.pdmodel"),
+        str(Path(args.inference_dir) / "speedyspeech.pdiparams"))
+    speedyspeech_config.enable_use_gpu(100, 0)
+    speedyspeech_config.enable_memory_optim()
+    speedyspeech_predictor = inference.create_predictor(speedyspeech_config)
+
+    pwg_config = inference.Config(
+        str(Path(args.inference_dir) / "pwg.pdmodel"),
+        str(Path(args.inference_dir) / "pwg.pdiparams"))
+    pwg_config.enable_use_gpu(100, 0)
+    pwg_config.enable_memory_optim()
+    pwg_predictor = inference.create_predictor(pwg_config)
+
+    if args.enable_auto_log:
+        import auto_log
+        os.makedirs("output", exist_ok=True)
+        pid = os.getpid()
+        logger = auto_log.AutoLogger(
+            model_name="speedyspeech",
+            model_precision='float32',
+            batch_size=1,
+            data_shape="dynamic",
+            save_path="./output/auto_log.log",
+            inference_config=speedyspeech_config,
+            pids=pid,
+            process_name=None,
+            gpu_ids=0,
+            time_keys=['preprocess_time', 'inference_time', 'postprocess_time'],
+            warmup=0)
+
+    output_dir = Path(args.output_dir)
+    output_dir.mkdir(parents=True, exist_ok=True)
+    sentences = []
+
+    with open(args.text, 'rt') as f:
+        for line in f:
+            utt_id, sentence = line.strip().split()
+            sentences.append((utt_id, sentence))
+
+    for utt_id, sentence in sentences:
+        if args.enable_auto_log:
+            logger.times.start()
+
+        input_ids = frontend.get_input_ids(
+            sentence, merge_sentences=True, get_tone_ids=True)
+        phone_ids = input_ids["phone_ids"]
+        tone_ids = input_ids["tone_ids"]
+        phones = phone_ids[0]
+        tones = tone_ids[0]
+
+        if args.enable_auto_log:
+            logger.times.stamp()
+
+        input_names = speedyspeech_predictor.get_input_names()
+        phones_handle = speedyspeech_predictor.get_input_handle(input_names[0])
+        tones_handle = speedyspeech_predictor.get_input_handle(input_names[1])
+
+        phones_handle.reshape(phones.shape)
+        phones_handle.copy_from_cpu(phones)
+        tones_handle.reshape(tones.shape)
+        tones_handle.copy_from_cpu(tones)
+
+        speedyspeech_predictor.run()
+        output_names = speedyspeech_predictor.get_output_names()
+        output_handle = speedyspeech_predictor.get_output_handle(
+            output_names[0])
+        output_data = output_handle.copy_to_cpu()
+
+        input_names = pwg_predictor.get_input_names()
+        mel_handle = pwg_predictor.get_input_handle(input_names[0])
+        mel_handle.reshape(output_data.shape)
+        mel_handle.copy_from_cpu(output_data)
+
+        pwg_predictor.run()
+        output_names = pwg_predictor.get_output_names()
+        output_handle = pwg_predictor.get_output_handle(output_names[0])
+        wav = output_data = output_handle.copy_to_cpu()
+
+        if args.enable_auto_log:
+            logger.times.stamp()
+
+        sf.write(output_dir / (utt_id + ".wav"), wav, samplerate=24000)
+
+        if args.enable_auto_log:
+            logger.times.end(stamp=True)
+        print(f"{utt_id} done!")
+
+    if args.enable_auto_log:
+        logger.report()
+
+
+if __name__ == "__main__":
+    main()
--- a/examples/csmsc/speedyspeech/baker/inference.sh
+++ b/examples/csmsc/speedyspeech/baker/inference.sh
@ -0,0 +1,8 @@
+#!/bin/bash
+
+python3 inference.py \
+  --inference-dir=exp/default/inference \
+  --text=../sentences.txt \
+  --output-dir=exp/default/pd_infer_out \
+  --phones-dict=dump/phone_id_map.txt \
+  --tones-dict=dump/tone_id_map.txt
--- a/examples/csmsc/speedyspeech/baker/preprocess.sh
+++ b/examples/csmsc/speedyspeech/baker/preprocess.sh
@ -0,0 +1,65 @@
+#!/bin/bash
+
+stage=0
+stop_stage=100
+
+export MAIN_ROOT=`realpath ${PWD}/../../../`
+
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+    # get durations from MFA's result
+    echo "Generate durations.txt from MFA results ..."
+    python3 ${MAIN_ROOT}/utils/gen_duration_from_textgrid.py \
+        --inputdir=./baker_alignment_tone \
+        --output=durations.txt \
+        --config=conf/default.yaml
+fi
+
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+    echo "Extract features ..."
+    python3 ../preprocess.py \
+        --dataset=baker \
+        --rootdir=~/datasets/BZNSYP/ \
+        --dumpdir=dump \
+        --dur-file=durations.txt \
+        --config=conf/default.yaml \
+        --num-cpu=20 \
+        --cut-sil=True \
+        --use-relative-path=True
+fi
+
+if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
+    echo "Get features' stats ..."
+    python3 ${MAIN_ROOT}/utils/compute_statistics.py \
+        --metadata=dump/train/raw/metadata.jsonl \
+        --field-name="feats" \
+        --use-relative-path=True
+fi
+
+if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
+    # normalize and covert phone/tone to id, dev and test should use train's stats
+    echo "Normalize ..."
+    python3 ../normalize.py \
+        --metadata=dump/train/raw/metadata.jsonl \
+        --dumpdir=dump/train/norm \
+        --stats=dump/train/feats_stats.npy \
+        --phones-dict=dump/phone_id_map.txt \
+        --tones-dict=dump/tone_id_map.txt \
+        --use-relative-path=True
+
+    python3 ../normalize.py \
+        --metadata=dump/dev/raw/metadata.jsonl \
+        --dumpdir=dump/dev/norm \
+        --stats=dump/train/feats_stats.npy \
+        --phones-dict=dump/phone_id_map.txt \
+        --tones-dict=dump/tone_id_map.txt \
+        --use-relative-path=True
+
+    python3 ../normalize.py \
+        --metadata=dump/test/raw/metadata.jsonl \
+        --dumpdir=dump/test/norm \
+        --stats=dump/train/feats_stats.npy \
+        --phones-dict=dump/phone_id_map.txt \
+        --tones-dict=dump/tone_id_map.txt \
+        --use-relative-path=True
+
+fi
--- a/examples/csmsc/speedyspeech/baker/run.sh
+++ b/examples/csmsc/speedyspeech/baker/run.sh
@ -0,0 +1,12 @@
+
+#!/bin/bash
+
+python ../train.py \
+    --train-metadata=dump/train/norm/metadata.jsonl \
+    --dev-metadata=dump/dev/norm/metadata.jsonl \
+    --config=conf/default.yaml \
+    --output-dir=exp/default \
+    --nprocs=2 \
+    --phones-dict=dump/phone_id_map.txt \
+    --tones-dict=dump/tone_id_map.txt \
+    --use-relative-path=True
--- a/examples/csmsc/speedyspeech/baker/synthesize.sh
+++ b/examples/csmsc/speedyspeech/baker/synthesize.sh
@ -0,0 +1,16 @@
+#!/bin/bash
+FLAGS_allocator_strategy=naive_best_fit \
+FLAGS_fraction_of_gpu_memory_to_use=0.01 \
+python3 ../synthesize.py \
+  --speedyspeech-config=conf/default.yaml \
+  --speedyspeech-checkpoint=exp/default/checkpoints/snapshot_iter_11400.pdz \
+  --speedyspeech-stat=dump/train/feats_stats.npy \
+  --pwg-config=pwg_baker_ckpt_0.4/pwg_default.yaml \
+  --pwg-checkpoint=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \
+  --pwg-stat=pwg_baker_ckpt_0.4/pwg_stats.npy \
+  --test-metadata=dump/test/norm/metadata.jsonl \
+  --output-dir=exp/default/test \
+  --inference-dir=exp/default/inference \
+  --phones-dict=dump/phone_id_map.txt \
+  --tones-dict=dump/tone_id_map.txt \
+  --device="gpu"
--- a/examples/csmsc/speedyspeech/baker/synthesize_e2e.py
+++ b/examples/csmsc/speedyspeech/baker/synthesize_e2e.py
@ -0,0 +1,196 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import logging
+import os
+from pathlib import Path
+
+import numpy as np
+import soundfile as sf
+import paddle
+import yaml
+from paddle import jit
+from paddle.static import InputSpec
+from parakeet.frontend.zh_frontend import Frontend
+from parakeet.models.speedyspeech import SpeedySpeech
+from parakeet.models.speedyspeech import SpeedySpeechInference
+from parakeet.models.parallel_wavegan import PWGGenerator
+from parakeet.models.parallel_wavegan import PWGInference
+from parakeet.modules.normalizer import ZScore
+from yacs.config import CfgNode
+
+
+def evaluate(args, speedyspeech_config, pwg_config):
+    # dataloader has been too verbose
+    logging.getLogger("DataLoader").disabled = True
+
+    # construct dataset for evaluation
+    sentences = []
+    with open(args.text, 'rt') as f:
+        for line in f:
+            utt_id, sentence = line.strip().split()
+            sentences.append((utt_id, sentence))
+
+    with open(args.phones_dict, "r") as f:
+        phn_id = [line.strip().split() for line in f.readlines()]
+    vocab_size = len(phn_id)
+    print("vocab_size:", vocab_size)
+    with open(args.tones_dict, "r") as f:
+        tone_id = [line.strip().split() for line in f.readlines()]
+    tone_size = len(tone_id)
+    print("tone_size:", tone_size)
+
+    model = SpeedySpeech(
+        vocab_size=vocab_size,
+        tone_size=tone_size,
+        **speedyspeech_config["model"])
+    model.set_state_dict(
+        paddle.load(args.speedyspeech_checkpoint)["main_params"])
+    model.eval()
+
+    vocoder = PWGGenerator(**pwg_config["generator_params"])
+    vocoder.set_state_dict(paddle.load(args.pwg_checkpoint)["generator_params"])
+    vocoder.remove_weight_norm()
+    vocoder.eval()
+    print("model done!")
+
+    stat = np.load(args.speedyspeech_stat)
+    mu, std = stat
+    mu = paddle.to_tensor(mu)
+    std = paddle.to_tensor(std)
+    speedyspeech_normalizer = ZScore(mu, std)
+
+    stat = np.load(args.pwg_stat)
+    mu, std = stat
+    mu = paddle.to_tensor(mu)
+    std = paddle.to_tensor(std)
+    pwg_normalizer = ZScore(mu, std)
+
+    speedyspeech_inference = SpeedySpeechInference(speedyspeech_normalizer,
+                                                   model)
+    speedyspeech_inference.eval()
+    speedyspeech_inference = jit.to_static(
+        speedyspeech_inference,
+        input_spec=[
+            InputSpec([-1], dtype=paddle.int64), InputSpec(
+                [-1], dtype=paddle.int64)
+        ])
+    paddle.jit.save(speedyspeech_inference,
+                    os.path.join(args.inference_dir, "speedyspeech"))
+    speedyspeech_inference = paddle.jit.load(
+        os.path.join(args.inference_dir, "speedyspeech"))
+
+    pwg_inference = PWGInference(pwg_normalizer, vocoder)
+    pwg_inference.eval()
+    pwg_inference = jit.to_static(
+        pwg_inference, input_spec=[
+            InputSpec([-1, 80], dtype=paddle.float32),
+        ])
+    paddle.jit.save(pwg_inference, os.path.join(args.inference_dir, "pwg"))
+    pwg_inference = paddle.jit.load(os.path.join(args.inference_dir, "pwg"))
+
+    frontend = Frontend(
+        phone_vocab_path=args.phones_dict, tone_vocab_path=args.tones_dict)
+    print("frontend done!")
+
+    output_dir = Path(args.output_dir)
+    output_dir.mkdir(parents=True, exist_ok=True)
+
+    for utt_id, sentence in sentences:
+        input_ids = frontend.get_input_ids(
+            sentence, merge_sentences=True, get_tone_ids=True)
+        phone_ids = input_ids["phone_ids"]
+        tone_ids = input_ids["tone_ids"]
+
+        flags = 0
+        for i in range(len(phone_ids)):
+            part_phone_ids = phone_ids[i]
+            part_tone_ids = tone_ids[i]
+            with paddle.no_grad():
+                mel = speedyspeech_inference(part_phone_ids, part_tone_ids)
+                temp_wav = pwg_inference(mel)
+            if flags == 0:
+                wav = temp_wav
+                flags = 1
+            else:
+                wav = paddle.concat([wav, temp_wav])
+        sf.write(
+            output_dir / (utt_id + ".wav"),
+            wav.numpy(),
+            samplerate=speedyspeech_config.fs)
+        print(f"{utt_id} done!")
+
+
+def main():
+    # parse args and config and redirect to train_sp
+    parser = argparse.ArgumentParser(
+        description="Synthesize with speedyspeech & parallel wavegan.")
+    parser.add_argument(
+        "--speedyspeech-config", type=str, help="config file for speedyspeech.")
+    parser.add_argument(
+        "--speedyspeech-checkpoint",
+        type=str,
+        help="speedyspeech checkpoint to load.")
+    parser.add_argument(
+        "--speedyspeech-stat",
+        type=str,
+        help="mean and standard deviation used to normalize spectrogram when training speedyspeech."
+    )
+    parser.add_argument(
+        "--pwg-config", type=str, help="config file for parallelwavegan.")
+    parser.add_argument(
+        "--pwg-checkpoint",
+        type=str,
+        help="parallel wavegan checkpoint to load.")
+    parser.add_argument(
+        "--pwg-stat",
+        type=str,
+        help="mean and standard deviation used to normalize spectrogram when training speedyspeech."
+    )
+    parser.add_argument(
+        "--text",
+        type=str,
+        help="text to synthesize, a 'utt_id sentence' pair per line")
+    parser.add_argument(
+        "--phones-dict", type=str, default=None, help="phone vocabulary file.")
+    parser.add_argument(
+        "--tones-dict", type=str, default=None, help="tone vocabulary file.")
+    parser.add_argument("--output-dir", type=str, help="output dir")
+    parser.add_argument(
+        "--inference-dir", type=str, help="dir to save inference models")
+    parser.add_argument(
+        "--device", type=str, default="gpu", help="device type to use")
+    parser.add_argument("--verbose", type=int, default=1, help="verbose")
+
+    args, _ = parser.parse_known_args()
+
+    paddle.set_device(args.device)
+
+    with open(args.speedyspeech_config) as f:
+        speedyspeech_config = CfgNode(yaml.safe_load(f))
+    with open(args.pwg_config) as f:
+        pwg_config = CfgNode(yaml.safe_load(f))
+
+    print("========Args========")
+    print(yaml.safe_dump(vars(args)))
+    print("========Config========")
+    print(speedyspeech_config)
+    print(pwg_config)
+
+    evaluate(args, speedyspeech_config, pwg_config)
+
+
+if __name__ == "__main__":
+    main()
--- a/examples/csmsc/speedyspeech/baker/synthesize_e2e.sh
+++ b/examples/csmsc/speedyspeech/baker/synthesize_e2e.sh
@ -0,0 +1,16 @@
+#!/bin/bash
+FLAGS_allocator_strategy=naive_best_fit \
+FLAGS_fraction_of_gpu_memory_to_use=0.01 \
+python synthesize_e2e.py \
+  --speedyspeech-config=conf/default.yaml \
+  --speedyspeech-checkpoint=exp/default/checkpoints/snapshot_iter_11400.pdz \
+  --speedyspeech-stat=dump/train/feats_stats.npy \
+  --pwg-config=pwg_baker_ckpt_0.4/pwg_default.yaml \
+  --pwg-checkpoint=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \
+  --pwg-stat=pwg_baker_ckpt_0.4/pwg_stats.npy \
+  --text=../sentences.txt \
+  --output-dir=exp/default/test_e2e \
+  --inference-dir=exp/default/inference \
+  --device="gpu" \
+  --phones-dict=dump/phone_id_map.txt \
+  --tones-dict=dump/tone_id_map.txt
--- a/examples/csmsc/speedyspeech/normalize.py
+++ b/examples/csmsc/speedyspeech/normalize.py
@ -0,0 +1,159 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Normalize feature files and dump them."""
+
+import argparse
+import logging
+from operator import itemgetter
+from pathlib import Path
+
+import jsonlines
+import numpy as np
+from parakeet.datasets.data_table import DataTable
+from sklearn.preprocessing import StandardScaler
+from tqdm import tqdm
+
+
+def main():
+    """Run preprocessing process."""
+    parser = argparse.ArgumentParser(
+        description="Normalize dumped raw features (See detail in parallel_wavegan/bin/normalize.py)."
+    )
+    parser.add_argument(
+        "--metadata",
+        type=str,
+        required=True,
+        help="directory including feature files to be normalized. "
+        "you need to specify either *-scp or rootdir.")
+    parser.add_argument(
+        "--dumpdir",
+        type=str,
+        required=True,
+        help="directory to dump normalized feature files.")
+    parser.add_argument(
+        "--stats", type=str, required=True, help="statistics file.")
+    parser.add_argument(
+        "--phones-dict", type=str, default=None, help="phone vocabulary file.")
+    parser.add_argument(
+        "--tones-dict", type=str, default=None, help="tone vocabulary file.")
+
+    parser.add_argument(
+        "--verbose",
+        type=int,
+        default=1,
+        help="logging level. higher is more logging. (default=1)")
+
+    def str2bool(str):
+        return True if str.lower() == 'true' else False
+
+    parser.add_argument(
+        "--use-relative-path",
+        type=str2bool,
+        default=False,
+        help="whether use relative path in metadata")
+    args = parser.parse_args()
+
+    # set logger
+    if args.verbose > 1:
+        logging.basicConfig(
+            level=logging.DEBUG,
+            format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s"
+        )
+    elif args.verbose > 0:
+        logging.basicConfig(
+            level=logging.INFO,
+            format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s"
+        )
+    else:
+        logging.basicConfig(
+            level=logging.WARN,
+            format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s"
+        )
+        logging.warning('Skip DEBUG/INFO messages')
+
+    dumpdir = Path(args.dumpdir).expanduser()
+    # use absolute path
+    dumpdir = dumpdir.resolve()
+    dumpdir.mkdir(parents=True, exist_ok=True)
+
+    # get dataset
+    with jsonlines.open(args.metadata, 'r') as reader:
+        metadata = list(reader)
+    if args.use_relative_path:
+        # if use_relative_path in preprocess, covert it to absolute path here
+        metadata_dir = Path(args.metadata).parent
+        for item in metadata:
+            item["feats"] = str(metadata_dir / item["feats"])
+
+    dataset = DataTable(
+        metadata, converters={
+            'feats': np.load,
+        })
+    logging.info(f"The number of files = {len(dataset)}.")
+
+    # restore scaler
+    scaler = StandardScaler()
+    scaler.mean_ = np.load(args.stats)[0]
+    scaler.scale_ = np.load(args.stats)[1]
+    # from version 0.23.0, this information is needed
+    scaler.n_features_in_ = scaler.mean_.shape[0]
+
+    vocab_phones = {}
+    with open(args.phones_dict, 'rt') as f:
+        phn_id = [line.strip().split() for line in f.readlines()]
+    for phn, id in phn_id:
+        vocab_phones[phn] = int(id)
+
+    vocab_tones = {}
+    with open(args.tones_dict, 'rt') as f:
+        tone_id = [line.strip().split() for line in f.readlines()]
+    for tone, id in tone_id:
+        vocab_tones[tone] = int(id)
+
+    # process each file
+    output_metadata = []
+
+    for item in tqdm(dataset):
+        utt_id = item['utt_id']
+        mel = item['feats']
+        # normalize
+        mel = scaler.transform(mel)
+
+        # save
+        mel_path = dumpdir / f"{utt_id}_feats.npy"
+        np.save(mel_path, mel.astype(np.float32), allow_pickle=False)
+        phone_ids = [vocab_phones[p] for p in item['phones']]
+        tone_ids = [vocab_tones[p] for p in item['tones']]
+        if args.use_relative_path:
+            # convert absolute path to relative path:
+            mel_path = mel_path.relative_to(dumpdir)
+        output_metadata.append({
+            'utt_id': utt_id,
+            'phones': phone_ids,
+            'tones': tone_ids,
+            'num_phones': item['num_phones'],
+            'num_frames': item['num_frames'],
+            'durations': item['durations'],
+            'feats': str(mel_path),
+        })
+    output_metadata.sort(key=itemgetter('utt_id'))
+    output_metadata_path = Path(args.dumpdir) / "metadata.jsonl"
+    with jsonlines.open(output_metadata_path, 'w') as writer:
+        for item in output_metadata:
+            writer.write(item)
+    logging.info(f"metadata dumped into {output_metadata_path}")
+
+
+if __name__ == "__main__":
+    main()
--- a/examples/csmsc/speedyspeech/preprocess.py
+++ b/examples/csmsc/speedyspeech/preprocess.py
@ -0,0 +1,293 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from operator import itemgetter
+from typing import Any
+from typing import Dict
+from typing import List
+
+import argparse
+import jsonlines
+import librosa
+import numpy as np
+import re
+import tqdm
+import yaml
+from concurrent.futures import ThreadPoolExecutor
+from parakeet.data.get_feats import LogMelFBank
+from parakeet.datasets.preprocess_utils import compare_duration_and_mel_length
+from parakeet.datasets.preprocess_utils import get_phones_tones
+from parakeet.datasets.preprocess_utils import get_phn_dur
+from parakeet.datasets.preprocess_utils import merge_silence
+from pathlib import Path
+from yacs.config import CfgNode
+
+
+def process_sentence(config: Dict[str, Any],
+                     fp: Path,
+                     sentences: Dict,
+                     output_dir: Path,
+                     mel_extractor=None,
+                     cut_sil: bool=True):
+    utt_id = fp.stem
+    record = None
+    if utt_id in sentences:
+        # reading, resampling may occur
+        wav, _ = librosa.load(str(fp), sr=config.fs)
+        if len(wav.shape) != 1 or np.abs(wav).max() > 1.0:
+            return record
+        assert len(wav.shape) == 1, f"{utt_id} is not a mono-channel audio."
+        assert np.abs(wav).max(
+        ) <= 1.0, f"{utt_id} is seems to be different that 16 bit PCM."
+        phones = sentences[utt_id][0]
+        durations = sentences[utt_id][1]
+        speaker = sentences[utt_id][2]
+        d_cumsum = np.pad(np.array(durations).cumsum(0), (1, 0), 'constant')
+        # little imprecise than use *.TextGrid directly
+        times = librosa.frames_to_time(
+            d_cumsum, sr=config.fs, hop_length=config.n_shift)
+        if cut_sil:
+            start = 0
+            end = d_cumsum[-1]
+            if phones[0] == "sil" and len(durations) > 1:
+                start = times[1]
+                durations = durations[1:]
+                phones = phones[1:]
+            if phones[-1] == 'sil' and len(durations) > 1:
+                end = times[-2]
+                durations = durations[:-1]
+                phones = phones[:-1]
+            sentences[utt_id][0] = phones
+            sentences[utt_id][1] = durations
+            start, end = librosa.time_to_samples([start, end], sr=config.fs)
+            wav = wav[start:end]
+
+        # extract mel feats
+        logmel = mel_extractor.get_log_mel_fbank(wav)
+        # change duration according to mel_length
+        compare_duration_and_mel_length(sentences, utt_id, logmel)
+        labels = sentences[utt_id][0]
+        # extract phone and duration
+        phones = []
+        tones = []
+        for label in labels:
+            # split tone from finals
+            match = re.match(r'^(\w+)([012345])$', label)
+            if match:
+                phones.append(match.group(1))
+                tones.append(match.group(2))
+            else:
+                phones.append(label)
+                tones.append('0')
+        durations = sentences[utt_id][1]
+        num_frames = logmel.shape[0]
+        assert sum(durations) == num_frames
+        assert len(phones) == len(tones) == len(durations)
+
+        mel_path = output_dir / (utt_id + "_feats.npy")
+        np.save(mel_path, logmel)  # (num_frames, n_mels)
+        record = {
+            "utt_id": utt_id,
+            "phones": phones,
+            "tones": tones,
+            "num_phones": len(phones),
+            "num_frames": num_frames,
+            "durations": durations,
+            "feats": str(mel_path),  # Path object
+        }
+    return record
+
+
+def process_sentences(config,
+                      fps: List[Path],
+                      sentences: Dict,
+                      output_dir: Path,
+                      mel_extractor=None,
+                      nprocs: int=1,
+                      cut_sil: bool=True,
+                      use_relative_path: bool=False):
+    if nprocs == 1:
+        results = []
+        for fp in tqdm.tqdm(fps, total=len(fps)):
+            record = process_sentence(config, fp, sentences, output_dir,
+                                      mel_extractor, cut_sil)
+            if record:
+                results.append(record)
+    else:
+        with ThreadPoolExecutor(nprocs) as pool:
+            futures = []
+            with tqdm.tqdm(total=len(fps)) as progress:
+                for fp in fps:
+                    future = pool.submit(process_sentence, config, fp,
+                                         sentences, output_dir, mel_extractor,
+                                         cut_sil)
+                    future.add_done_callback(lambda p: progress.update())
+                    futures.append(future)
+
+                results = []
+                for ft in futures:
+                    record = ft.result()
+                    if record:
+                        results.append(record)
+
+    results.sort(key=itemgetter("utt_id"))
+    output_dir = Path(output_dir)
+    metadata_path = output_dir / "metadata.jsonl"
+    # NOTE: use relative path to the meta jsonlines file for Full Chain Project
+    with jsonlines.open(metadata_path, 'w') as writer:
+        for item in results:
+            if use_relative_path:
+                item["feats"] = str(Path(item["feats"]).relative_to(output_dir))
+            writer.write(item)
+    print("Done")
+
+
+def main():
+    # parse config and args
+    parser = argparse.ArgumentParser(
+        description="Preprocess audio and then extract features.")
+
+    parser.add_argument(
+        "--dataset",
+        default="baker",
+        type=str,
+        help="name of dataset, should in {baker} now")
+
+    parser.add_argument(
+        "--rootdir", default=None, type=str, help="directory to dataset.")
+    parser.add_argument(
+        "--dumpdir",
+        type=str,
+        required=True,
+        help="directory to dump feature files.")
+
+    parser.add_argument(
+        "--dur-file",
+        default=None,
+        type=str,
+        help="path to baker durations.txt.")
+
+    parser.add_argument("--config", type=str, help="fastspeech2 config file.")
+
+    parser.add_argument(
+        "--verbose",
+        type=int,
+        default=1,
+        help="logging level. higher is more logging. (default=1)")
+    parser.add_argument(
+        "--num-cpu", type=int, default=1, help="number of process.")
+
+    def str2bool(str):
+        return True if str.lower() == 'true' else False
+
+    parser.add_argument(
+        "--cut-sil",
+        type=str2bool,
+        default=True,
+        help="whether cut sil in the edge of audio")
+
+    parser.add_argument(
+        "--use-relative-path",
+        type=str2bool,
+        default=False,
+        help="whether use relative path in metadata")
+
+    args = parser.parse_args()
+
+    rootdir = Path(args.rootdir).expanduser()
+    dumpdir = Path(args.dumpdir).expanduser()
+    # use absolute path
+    dumpdir = dumpdir.resolve()
+    dumpdir.mkdir(parents=True, exist_ok=True)
+    dur_file = Path(args.dur_file).expanduser()
+
+    assert rootdir.is_dir()
+    assert dur_file.is_file()
+
+    with open(args.config, 'rt') as f:
+        config = CfgNode(yaml.safe_load(f))
+
+    if args.verbose > 1:
+        print(vars(args))
+        print(config)
+
+    sentences, speaker_set = get_phn_dur(dur_file)
+
+    merge_silence(sentences)
+    phone_id_map_path = dumpdir / "phone_id_map.txt"
+    tone_id_map_path = dumpdir / "tone_id_map.txt"
+    get_phones_tones(sentences, phone_id_map_path, tone_id_map_path,
+                     args.dataset)
+
+    if args.dataset == "baker":
+        wav_files = sorted(list((rootdir / "Wave").rglob("*.wav")))
+        # split data into 3 sections
+        num_train = 9800
+        num_dev = 100
+        train_wav_files = wav_files[:num_train]
+        dev_wav_files = wav_files[num_train:num_train + num_dev]
+        test_wav_files = wav_files[num_train + num_dev:]
+
+    train_dump_dir = dumpdir / "train" / "raw"
+    train_dump_dir.mkdir(parents=True, exist_ok=True)
+    dev_dump_dir = dumpdir / "dev" / "raw"
+    dev_dump_dir.mkdir(parents=True, exist_ok=True)
+    test_dump_dir = dumpdir / "test" / "raw"
+    test_dump_dir.mkdir(parents=True, exist_ok=True)
+
+    # Extractor
+    mel_extractor = LogMelFBank(
+        sr=config.fs,
+        n_fft=config.n_fft,
+        hop_length=config.n_shift,
+        win_length=config.win_length,
+        window=config.window,
+        n_mels=config.n_mels,
+        fmin=config.fmin,
+        fmax=config.fmax)
+
+    # process for the 3 sections
+    if train_wav_files:
+        process_sentences(
+            config,
+            train_wav_files,
+            sentences,
+            train_dump_dir,
+            mel_extractor,
+            nprocs=args.num_cpu,
+            cut_sil=args.cut_sil,
+            use_relative_path=args.use_relative_path)
+    if dev_wav_files:
+        process_sentences(
+            config,
+            dev_wav_files,
+            sentences,
+            dev_dump_dir,
+            mel_extractor,
+            cut_sil=args.cut_sil,
+            use_relative_path=args.use_relative_path)
+    if test_wav_files:
+        process_sentences(
+            config,
+            test_wav_files,
+            sentences,
+            test_dump_dir,
+            mel_extractor,
+            nprocs=args.num_cpu,
+            cut_sil=args.cut_sil,
+            use_relative_path=args.use_relative_path)
+
+
+if __name__ == "__main__":
+    main()
--- a/examples/csmsc/speedyspeech/sentences.txt
+++ b/examples/csmsc/speedyspeech/sentences.txt
@ -0,0 +1,16 @@
+001 凯莫瑞安联合体的经济崩溃，迫在眉睫。
+002 对于所有想要离开那片废土，去寻找更美好生活的人来说。
+003 克哈，是你们所有人安全的港湾。
+004 为了保护尤摩扬人民不受异虫的残害，我所做的，比他们自己的领导委员会都多。
+005 无论他们如何诽谤我，我将继续为所有泰伦人的最大利益，而努力奋斗。
+006 身为你们的元首，我带领泰伦人实现了人类统治领地和经济的扩张。
+007 我们将继续成长，用行动回击那些只会说风凉话，不愿意和我们相向而行的害群之马。
+008 帝国武装力量，无数的优秀儿女，正时刻守卫着我们的家园大门，但是他们孤木难支。
+009 凡是今天应征入伍者，所获的所有刑罚罪责，减半。
+010 激进分子和异见者希望你们一听见枪声，就背弃多年的和平与繁荣。
+011 他们没有勇气和能力，带领人类穿越一个充满危险的星系。
+012 法治是我们的命脉，然而它却受到前所未有的挑战。
+013 我将恢复我们帝国的荣光，绝不会向任何外星势力低头。
+014 我已经驯服了异虫，荡平了星灵。如今它们的创造者，想要夺走我们拥有的一切。
+015 永远记住，谁才是最能保护你们的人。
+016 不要听信别人的谗言，我不是什么克隆人。
--- a/examples/csmsc/speedyspeech/synthesize.py
+++ b/examples/csmsc/speedyspeech/synthesize.py
@ -0,0 +1,180 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+import os
+import logging
+import argparse
+from pathlib import Path
+
+import jsonlines
+import numpy as np
+import soundfile as sf
+import paddle
+import yaml
+from paddle import jit
+from paddle.static import InputSpec
+from yacs.config import CfgNode
+
+from parakeet.datasets.data_table import DataTable
+from parakeet.models.speedyspeech import SpeedySpeech
+from parakeet.models.speedyspeech import SpeedySpeechInference
+from parakeet.models.parallel_wavegan import PWGGenerator
+from parakeet.models.parallel_wavegan import PWGInference
+from parakeet.modules.normalizer import ZScore
+
+
+def evaluate(args, speedyspeech_config, pwg_config):
+    # dataloader has been too verbose
+    logging.getLogger("DataLoader").disabled = True
+
+    # construct dataset for evaluation
+    with jsonlines.open(args.test_metadata, 'r') as reader:
+        test_metadata = list(reader)
+    test_dataset = DataTable(
+        data=test_metadata, fields=["utt_id", "phones", "tones"])
+
+    with open(args.phones_dict, "r") as f:
+        phn_id = [line.strip().split() for line in f.readlines()]
+    vocab_size = len(phn_id)
+    print("vocab_size:", vocab_size)
+    with open(args.tones_dict, "r") as f:
+        tone_id = [line.strip().split() for line in f.readlines()]
+    tone_size = len(tone_id)
+    print("tone_size:", tone_size)
+
+    model = SpeedySpeech(
+        vocab_size=vocab_size,
+        tone_size=tone_size,
+        **speedyspeech_config["model"])
+    model.set_state_dict(
+        paddle.load(args.speedyspeech_checkpoint)["main_params"])
+    model.eval()
+
+    vocoder = PWGGenerator(**pwg_config["generator_params"])
+    vocoder.set_state_dict(paddle.load(args.pwg_checkpoint)["generator_params"])
+    vocoder.remove_weight_norm()
+    vocoder.eval()
+    print("model done!")
+
+    stat = np.load(args.speedyspeech_stat)
+    mu, std = stat
+    mu = paddle.to_tensor(mu)
+    std = paddle.to_tensor(std)
+    speedyspeech_normalizer = ZScore(mu, std)
+    speedyspeech_normalizer.eval()
+
+    stat = np.load(args.pwg_stat)
+    mu, std = stat
+    mu = paddle.to_tensor(mu)
+    std = paddle.to_tensor(std)
+    pwg_normalizer = ZScore(mu, std)
+    pwg_normalizer.eval()
+
+    speedyspeech_inference = SpeedySpeechInference(speedyspeech_normalizer,
+                                                   model)
+    speedyspeech_inference.eval()
+    speedyspeech_inference = jit.to_static(
+        speedyspeech_inference,
+        input_spec=[
+            InputSpec([-1], dtype=paddle.int64), InputSpec(
+                [-1], dtype=paddle.int64)
+        ])
+    paddle.jit.save(speedyspeech_inference,
+                    os.path.join(args.inference_dir, "speedyspeech"))
+    speedyspeech_inference = paddle.jit.load(
+        os.path.join(args.inference_dir, "speedyspeech"))
+
+    pwg_inference = PWGInference(pwg_normalizer, vocoder)
+    pwg_inference.eval()
+    pwg_inference = jit.to_static(
+        pwg_inference, input_spec=[
+            InputSpec([-1, 80], dtype=paddle.float32),
+        ])
+    paddle.jit.save(pwg_inference, os.path.join(args.inference_dir, "pwg"))
+    pwg_inference = paddle.jit.load(os.path.join(args.inference_dir, "pwg"))
+
+    output_dir = Path(args.output_dir)
+    output_dir.mkdir(parents=True, exist_ok=True)
+
+    for datum in test_dataset:
+        utt_id = datum["utt_id"]
+        phones = paddle.to_tensor(datum["phones"])
+        tones = paddle.to_tensor(datum["tones"])
+
+        with paddle.no_grad():
+            wav = pwg_inference(speedyspeech_inference(phones, tones))
+        sf.write(
+            output_dir / (utt_id + ".wav"),
+            wav.numpy(),
+            samplerate=speedyspeech_config.fs)
+        print(f"{utt_id} done!")
+
+
+def main():
+    # parse args and config and redirect to train_sp
+    parser = argparse.ArgumentParser(
+        description="Synthesize with speedyspeech & parallel wavegan.")
+    parser.add_argument(
+        "--speedyspeech-config", type=str, help="config file for speedyspeech.")
+    parser.add_argument(
+        "--speedyspeech-checkpoint",
+        type=str,
+        help="speedyspeech checkpoint to load.")
+    parser.add_argument(
+        "--speedyspeech-stat",
+        type=str,
+        help="mean and standard deviation used to normalize spectrogram when training speedyspeech."
+    )
+    parser.add_argument(
+        "--pwg-config", type=str, help="config file for parallelwavegan.")
+    parser.add_argument(
+        "--pwg-checkpoint",
+        type=str,
+        help="parallel wavegan generator parameters to load.")
+    parser.add_argument(
+        "--pwg-stat",
+        type=str,
+        help="mean and standard deviation used to normalize spectrogram when training speedyspeech."
+    )
+    parser.add_argument(
+        "--phones-dict", type=str, default=None, help="phone vocabulary file.")
+    parser.add_argument(
+        "--tones-dict", type=str, default=None, help="tone vocabulary file.")
+    parser.add_argument("--test-metadata", type=str, help="test metadata")
+    parser.add_argument("--output-dir", type=str, help="output dir")
+    parser.add_argument(
+        "--inference-dir", type=str, help="dir to save inference models")
+    parser.add_argument(
+        "--device", type=str, default="gpu", help="device type to use")
+    parser.add_argument("--verbose", type=int, default=1, help="verbose")
+
+    args, _ = parser.parse_known_args()
+
+    paddle.set_device(args.device)
+
+    with open(args.speedyspeech_config) as f:
+        speedyspeech_config = CfgNode(yaml.safe_load(f))
+    with open(args.pwg_config) as f:
+        pwg_config = CfgNode(yaml.safe_load(f))
+
+    print("========Args========")
+    print(yaml.safe_dump(vars(args)))
+    print("========Config========")
+    print(speedyspeech_config)
+    print(pwg_config)
+
+    evaluate(args, speedyspeech_config, pwg_config)
+
+
+if __name__ == "__main__":
+    main()
--- a/examples/csmsc/speedyspeech/train.py
+++ b/examples/csmsc/speedyspeech/train.py
@ -0,0 +1,224 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import logging
+import os
+import shutil
+
+import jsonlines
+import numpy as np
+import paddle
+import yaml
+from paddle import distributed as dist
+from paddle import DataParallel
+from paddle.io import DataLoader
+from paddle.io import DistributedBatchSampler
+from parakeet.datasets.data_table import DataTable
+from parakeet.datasets.am_batch_fn import speedyspeech_batch_fn
+from parakeet.models.speedyspeech import SpeedySpeech
+from parakeet.models.speedyspeech import SpeedySpeechEvaluator
+from parakeet.models.speedyspeech import SpeedySpeechUpdater
+from parakeet.training.extensions.snapshot import Snapshot
+from parakeet.training.extensions.visualizer import VisualDL
+from parakeet.training.optimizer import build_optimizers
+from parakeet.training.seeding import seed_everything
+from parakeet.training.trainer import Trainer
+from pathlib import Path
+from visualdl import LogWriter
+from yacs.config import CfgNode
+
+
+def train_sp(args, config):
+    # decides device type and whether to run in parallel
+    # setup running environment correctly
+    world_size = paddle.distributed.get_world_size()
+    if not paddle.is_compiled_with_cuda():
+        paddle.set_device("cpu")
+    else:
+        paddle.set_device("gpu")
+        if world_size > 1:
+            paddle.distributed.init_parallel_env()
+
+    # set the random seed, it is a must for multiprocess training
+    seed_everything(config.seed)
+
+    print(
+        f"rank: {dist.get_rank()}, pid: {os.getpid()}, parent_pid: {os.getppid()}",
+    )
+
+    # dataloader has been too verbose
+    logging.getLogger("DataLoader").disabled = True
+
+    # construct dataset for training and validation
+    with jsonlines.open(args.train_metadata, 'r') as reader:
+        train_metadata = list(reader)
+    if args.use_relative_path:
+        # if use_relative_path in preprocess, covert it to absolute path here
+        metadata_dir = Path(args.train_metadata).parent
+        for item in train_metadata:
+            item["feats"] = str(metadata_dir / item["feats"])
+
+    train_dataset = DataTable(
+        data=train_metadata,
+        fields=[
+            "phones", "tones", "num_phones", "num_frames", "feats", "durations"
+        ],
+        converters={
+            "feats": np.load,
+        }, )
+    with jsonlines.open(args.dev_metadata, 'r') as reader:
+        dev_metadata = list(reader)
+    if args.use_relative_path:
+        # if use_relative_path in preprocess, covert it to absolute path here
+        metadata_dir = Path(args.dev_metadata).parent
+        for item in dev_metadata:
+            item["feats"] = str(metadata_dir / item["feats"])
+
+    dev_dataset = DataTable(
+        data=dev_metadata,
+        fields=[
+            "phones", "tones", "num_phones", "num_frames", "feats", "durations"
+        ],
+        converters={
+            "feats": np.load,
+        }, )
+
+    # collate function and dataloader
+    train_sampler = DistributedBatchSampler(
+        train_dataset,
+        batch_size=config.batch_size,
+        shuffle=True,
+        drop_last=True)
+    print("samplers done!")
+
+    train_dataloader = DataLoader(
+        train_dataset,
+        batch_sampler=train_sampler,
+        collate_fn=speedyspeech_batch_fn,
+        num_workers=config.num_workers)
+    dev_dataloader = DataLoader(
+        dev_dataset,
+        shuffle=False,
+        drop_last=False,
+        batch_size=config.batch_size,
+        collate_fn=speedyspeech_batch_fn,
+        num_workers=config.num_workers)
+    print("dataloaders done!")
+    with open(args.phones_dict, "r") as f:
+        phn_id = [line.strip().split() for line in f.readlines()]
+    vocab_size = len(phn_id)
+    print("vocab_size:", vocab_size)
+    with open(args.tones_dict, "r") as f:
+        tone_id = [line.strip().split() for line in f.readlines()]
+    tone_size = len(tone_id)
+    print("tone_size:", tone_size)
+
+    model = SpeedySpeech(
+        vocab_size=vocab_size, tone_size=tone_size, **config["model"])
+    if world_size > 1:
+        model = DataParallel(model)
+    print("model done!")
+    optimizer = build_optimizers(model, **config["optimizer"])
+    print("optimizer done!")
+
+    output_dir = Path(args.output_dir)
+    output_dir.mkdir(parents=True, exist_ok=True)
+    if dist.get_rank() == 0:
+        config_name = args.config.split("/")[-1]
+        # copy conf to output_dir
+        shutil.copyfile(args.config, output_dir / config_name)
+
+    updater = SpeedySpeechUpdater(
+        model=model,
+        optimizer=optimizer,
+        dataloader=train_dataloader,
+        output_dir=output_dir)
+
+    trainer = Trainer(updater, (config.max_epoch, 'epoch'), output_dir)
+
+    evaluator = SpeedySpeechEvaluator(
+        model, dev_dataloader, output_dir=output_dir)
+
+    if dist.get_rank() == 0:
+        trainer.extend(evaluator, trigger=(1, "epoch"))
+        writer = LogWriter(str(output_dir))
+        trainer.extend(VisualDL(writer), trigger=(1, "iteration"))
+        trainer.extend(
+            Snapshot(max_size=config.num_snapshots), trigger=(1, 'epoch'))
+    trainer.run()
+
+
+def main():
+    # parse args and config and redirect to train_sp
+    parser = argparse.ArgumentParser(
+        description="Train a Speedyspeech model with sigle speaker dataset.")
+    parser.add_argument("--config", type=str, help="config file.")
+    parser.add_argument("--train-metadata", type=str, help="training data.")
+    parser.add_argument("--dev-metadata", type=str, help="dev data.")
+    parser.add_argument("--output-dir", type=str, help="output dir.")
+    parser.add_argument(
+        "--device", type=str, default="gpu", help="device type to use.")
+    parser.add_argument(
+        "--nprocs", type=int, default=1, help="number of processes.")
+    parser.add_argument("--verbose", type=int, default=1, help="verbose.")
+
+    def str2bool(str):
+        return True if str.lower() == 'true' else False
+
+    parser.add_argument(
+        "--use-relative-path",
+        type=str2bool,
+        default=False,
+        help="whether use relative path in metadata")
+
+    parser.add_argument(
+        "--phones-dict", type=str, default=None, help="phone vocabulary file.")
+
+    parser.add_argument(
+        "--tones-dict", type=str, default=None, help="tone vocabulary file.")
+
+    # 这里可以多传入 max_epoch 等
+    args, rest = parser.parse_known_args()
+    if args.device == "cpu" and args.nprocs > 1:
+        raise RuntimeError("Multiprocess training on CPU is not supported.")
+    with open(args.config) as f:
+        config = CfgNode(yaml.safe_load(f))
+
+    if rest:
+        extra = []
+        # to support key=value format
+        for item in rest:
+            # remove "--"
+            item = item[2:]
+            extra.extend(item.split("=", maxsplit=1))
+        config.merge_from_list(extra)
+
+    print("========Args========")
+    print(yaml.safe_dump(vars(args)))
+    print("========Config========")
+    print(config)
+    print(
+        f"master see the word size: {dist.get_world_size()}, from pid: {os.getpid()}"
+    )
+
+    # dispatch
+    if args.nprocs > 1:
+        dist.spawn(train_sp, (args, config), nprocs=args.nprocs)
+    else:
+        train_sp(args, config)
+
+
+if __name__ == "__main__":
+    main()
--- a/examples/ljspeech/README.md
+++ b/examples/ljspeech/README.md
@ -0,0 +1,6 @@
+
+# LJSpeech
+
+* tts0 - Tactron2
+* tts1 - TransformerTTS 
+* voc0 - WaveFlow 
--- a/examples/ljspeech/tts0/local/tacotron2/README.md
+++ b/examples/ljspeech/tts0/local/tacotron2/README.md
@ -0,0 +1,92 @@
+# Tacotron2
+
+PaddlePaddle dynamic graph implementation of Tacotron2, a neural network architecture for speech synthesis directly from text. The implementation is based on [Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions](https://arxiv.org/abs/1712.05884).
+
+## Project Structure
+
+```text
+├── config.py              # default configuration file
+├── ljspeech.py            # dataset and dataloader settings for LJSpeech
+├── preprocess.py          # script to preprocess LJSpeech dataset
+├── synthesize.py          # script to synthesize spectrogram from text
+├── train.py               # script for tacotron2 model training
+├── synthesize.ipynb       # notebook example for end-to-end TTS
+```
+
+## Dataset
+
+We experiment with the LJSpeech dataset. Download and unzip [LJSpeech](https://keithito.com/LJ-Speech-Dataset/).
+
+```bash
+wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
+tar xjvf LJSpeech-1.1.tar.bz2
+```
+
+Then you need to preprocess the data by running ``preprocess.py``, the preprocessed data will be placed in ``--output`` directory.
+
+```bash
+python preprocess.py \
+--input=${DATAPATH} \
+--output=${PREPROCESSEDDATAPATH} \
+-v  \
+```
+
+For more help on arguments
+
+``python preprocess.py --help``.
+
+## Train the model
+
+Tacotron2 model can be trained by running ``train.py``.
+
+```bash
+python train.py \
+--data=${PREPROCESSEDDATAPATH} \
+--output=${OUTPUTPATH} \
+--device=gpu \
+```
+
+If you want to train on CPU, just set ``--device=cpu``.
+If you want to train on multiple GPUs, just set ``--nprocs`` as num of GPU.
+By default, training will be resumed from the latest checkpoint in ``--output``, if you want to start a new training, please use a new ``${OUTPUTPATH}`` with no checkpoint. And if you want to resume from an other existing model, you should set ``checkpoint_path`` to be the checkpoint path you want to load.
+
+**Note: The checkpoint path cannot contain the file extension.**
+
+For more help on arguments
+
+``python train_transformer.py --help``.
+
+## Synthesize
+
+After training the Tacotron2, spectrogram can be synthesized by running ``synthesize.py``.
+
+```bash
+python synthesize.py \
+--config=${CONFIGPATH} \
+--checkpoint_path=${CHECKPOINTPATH} \
+--input=${TEXTPATH} \
+--output=${OUTPUTPATH}
+--device=gpu
+```
+
+The ``${CONFIGPATH}`` needs to be matched with ``${CHECKPOINTPATH}``.
+
+For more help on arguments
+
+``python synthesize.py --help``.
+
+Then you can find the spectrogram files in ``${OUTPUTPATH}``, and then they can be the input of vocoder like [waveflow](../waveflow/README.md#Synthesis) to get audio files.
+
+
+## Pretrained Models
+
+Pretrained Models can be downloaded from links below. We provide 2 models with different configurations.
+
+1. This model use a binary classifier to predict the stop token. [tacotron2_ljspeech_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/tacotron2_ljspeech_ckpt_0.3.zip)
+
+2. This model does not have a stop token predictor. It uses the attention peak position to decided whether all the contents have been uttered. Also guided attention loss is used to speed up training. This model is trained with `configs/alternative.yaml`.[tacotron2_ljspeech_ckpt_0.3_alternative.zip](https://paddlespeech.bj.bcebos.com/Parakeet/tacotron2_ljspeech_ckpt_0.3_alternative.zip)
+
+
+## Notebook: End-to-end TTS
+
+See [synthesize.ipynb](./synthesize.ipynb) for details about end-to-end TTS with tacotron2 and waveflow.
--- a/examples/ljspeech/tts0/local/tacotron2/config.py
+++ b/examples/ljspeech/tts0/local/tacotron2/config.py
@ -0,0 +1,76 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from yacs.config import CfgNode as CN
+
+_C = CN()
+_C.data = CN(
+    dict(
+        batch_size=32,  # batch size
+        valid_size=64,  # the first N examples are reserved for validation
+        sample_rate=22050,  # Hz, sample rate
+        n_fft=1024,  # fft frame size
+        win_length=1024,  # window size
+        hop_length=256,  # hop size between ajacent frame
+        fmax=8000,  # Hz, max frequency when converting to mel
+        fmin=0,  # Hz, min frequency when converting to mel
+        n_mels=80,  # mel bands
+        padding_idx=0,  # text embedding's padding index
+    ))
+
+_C.model = CN(
+    dict(
+        vocab_size=37,  # set this according to the frontend's vocab size
+        n_tones=None,
+        reduction_factor=1,  # reduction factor
+        d_encoder=512,  # embedding & encoder's internal size
+        encoder_conv_layers=3,  # number of conv layer in tacotron2 encoder
+        encoder_kernel_size=5,  # kernel size of conv layers in tacotron2 encoder
+        d_prenet=256,  # hidden size of decoder prenet
+        d_attention_rnn=1024,  # hidden size of the first rnn layer in tacotron2 decoder
+        d_decoder_rnn=1024,  # hidden size of the second rnn layer in tacotron2 decoder
+        d_attention=128,  # hidden size of  decoder location linear layer
+        attention_filters=32,  # number of filter in decoder location conv layer
+        attention_kernel_size=31,  # kernel size of decoder location conv layer
+        d_postnet=512,  # hidden size of decoder postnet
+        postnet_kernel_size=5,  # kernel size of conv layers in postnet
+        postnet_conv_layers=5,  # number of conv layer in decoder postnet
+        p_encoder_dropout=0.5,  # droput probability in encoder
+        p_prenet_dropout=0.5,  # droput probability in decoder prenet
+        p_attention_dropout=0.1,  # droput probability of first rnn layer in decoder
+        p_decoder_dropout=0.1,  # droput probability of second rnn layer in decoder
+        p_postnet_dropout=0.5,  # droput probability in decoder postnet
+        d_global_condition=None,
+        use_stop_token=True,  # wherther to use binary classifier to predict when to stop
+        use_guided_attention_loss=False,  # whether to use guided attention loss
+        guided_attention_loss_sigma=0.2  # sigma in guided attention loss
+    ))
+
+_C.training = CN(
+    dict(
+        lr=1e-3,  # learning rate
+        weight_decay=1e-6,  # the coeff of weight decay
+        grad_clip_thresh=1.0,  # the clip norm of grad clip.
+        plot_interval=1000,  # plot attention and spectrogram
+        valid_interval=1000,  # validation
+        save_interval=1000,  # checkpoint
+        max_iteration=500000,  # max iteration to train
+    ))
+
+
+def get_cfg_defaults():
+    """Get a yacs CfgNode object with default values for my_project."""
+    # Return a clone so that the defaults will not be altered
+    # This is for the "local variable" use pattern
+    return _C.clone()
--- a/examples/ljspeech/tts0/local/tacotron2/ljspeech.py
+++ b/examples/ljspeech/tts0/local/tacotron2/ljspeech.py
@ -0,0 +1,94 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from pathlib import Path
+import pickle
+
+import numpy as np
+from paddle.io import Dataset
+
+from parakeet.data.batch import batch_spec, batch_text_id
+
+
+class LJSpeech(Dataset):
+    """A simple dataset adaptor for the processed ljspeech dataset."""
+
+    def __init__(self, root):
+        self.root = Path(root).expanduser()
+        records = []
+        with open(self.root / "metadata.pkl", 'rb') as f:
+            metadata = pickle.load(f)
+        for mel_name, text, ids in metadata:
+            mel_name = self.root / "mel" / (mel_name + ".npy")
+            records.append((mel_name, text, ids))
+        self.records = records
+
+    def __getitem__(self, i):
+        mel_name, _, ids = self.records[i]
+        mel = np.load(mel_name)
+        return ids, mel
+
+    def __len__(self):
+        return len(self.records)
+
+
+class LJSpeechCollector(object):
+    """A simple callable to batch LJSpeech examples."""
+
+    def __init__(self, padding_idx=0, padding_value=0., padding_stop_token=1.0):
+        self.padding_idx = padding_idx
+        self.padding_value = padding_value
+        self.padding_stop_token = padding_stop_token
+
+    def __call__(self, examples):
+        texts = []
+        mels = []
+        text_lens = []
+        mel_lens = []
+
+        for data in examples:
+            text, mel = data
+            text = np.array(text, dtype=np.int64)
+            text_lens.append(len(text))
+            mels.append(mel)
+            texts.append(text)
+            mel_lens.append(mel.shape[1])
+
+        # Sort by text_len in descending order
+        texts = [
+            i
+            for i, _ in sorted(
+                zip(texts, text_lens), key=lambda x: x[1], reverse=True)
+        ]
+        mels = [
+            i
+            for i, _ in sorted(
+                zip(mels, text_lens), key=lambda x: x[1], reverse=True)
+        ]
+
+        mel_lens = [
+            i
+            for i, _ in sorted(
+                zip(mel_lens, text_lens), key=lambda x: x[1], reverse=True)
+        ]
+
+        mel_lens = np.array(mel_lens, dtype=np.int64)
+        text_lens = np.array(sorted(text_lens, reverse=True), dtype=np.int64)
+
+        # Pad sequence with largest len of the batch
+        texts, _ = batch_text_id(texts, pad_id=self.padding_idx)
+        mels, _ = batch_spec(mels, pad_value=self.padding_value)
+        mels = np.transpose(mels, axes=(0, 2, 1))
+
+        return texts, mels, text_lens, mel_lens
--- a/examples/ljspeech/tts0/local/tacotron2/preprocess.py
+++ b/examples/ljspeech/tts0/local/tacotron2/preprocess.py
@ -0,0 +1,99 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import pickle
+import argparse
+from pathlib import Path
+
+import tqdm
+import numpy as np
+
+from parakeet.datasets import LJSpeechMetaData
+from parakeet.audio import AudioProcessor, LogMagnitude
+from parakeet.frontend import EnglishCharacter
+
+from config import get_cfg_defaults
+
+
+def create_dataset(config, source_path, target_path, verbose=False):
+    # create output dir
+    target_path = Path(target_path).expanduser()
+    mel_path = target_path / "mel"
+    os.makedirs(mel_path, exist_ok=True)
+
+    meta_data = LJSpeechMetaData(source_path)
+    frontend = EnglishCharacter()
+    processor = AudioProcessor(
+        sample_rate=config.data.sample_rate,
+        n_fft=config.data.n_fft,
+        n_mels=config.data.n_mels,
+        win_length=config.data.win_length,
+        hop_length=config.data.hop_length,
+        fmax=config.data.fmax,
+        fmin=config.data.fmin)
+    normalizer = LogMagnitude()
+
+    records = []
+    for (fname, text, _) in tqdm.tqdm(meta_data):
+        wav = processor.read_wav(fname)
+        mel = processor.mel_spectrogram(wav)
+        mel = normalizer.transform(mel)
+        ids = frontend(text)
+        mel_name = os.path.splitext(os.path.basename(fname))[0]
+
+        # save mel spectrogram
+        records.append((mel_name, text, ids))
+        np.save(mel_path / mel_name, mel)
+    if verbose:
+        print("save mel spectrograms into {}".format(mel_path))
+
+    # save meta data as pickle archive
+    with open(target_path / "metadata.pkl", 'wb') as f:
+        pickle.dump(records, f)
+        if verbose:
+            print("saved metadata into {}".format(target_path / "metadata.pkl"))
+
+    print("Done.")
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="create dataset")
+    parser.add_argument(
+        "--config",
+        type=str,
+        metavar="FILE",
+        help="extra config to overwrite the default config")
+    parser.add_argument(
+        "--input", type=str, help="path of the ljspeech dataset")
+    parser.add_argument(
+        "--output", type=str, help="path to save output dataset")
+    parser.add_argument(
+        "--opts",
+        nargs=argparse.REMAINDER,
+        help="options to overwrite --config file and the default config, passing in KEY VALUE pairs"
+    )
+    parser.add_argument(
+        "-v", "--verbose", action="store_true", help="print msg")
+
+    config = get_cfg_defaults()
+    args = parser.parse_args()
+    if args.config:
+        config.merge_from_file(args.config)
+    if args.opts:
+        config.merge_from_list(args.opts)
+    config.freeze()
+    print(config.data)
+
+    create_dataset(config, args.input, args.output, args.verbose)
--- a/examples/ljspeech/tts0/local/tacotron2/synthesize.ipynb
+++ b/examples/ljspeech/tts0/local/tacotron2/synthesize.ipynb
--- a/examples/ljspeech/tts0/local/tacotron2/synthesize.py
+++ b/examples/ljspeech/tts0/local/tacotron2/synthesize.py
@ -0,0 +1,95 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+from pathlib import Path
+
+import paddle
+import numpy as np
+from matplotlib import pyplot as plt
+
+from parakeet.frontend import EnglishCharacter
+from parakeet.models.tacotron2 import Tacotron2
+from parakeet.utils import display
+
+from config import get_cfg_defaults
+
+
+def main(config, args):
+    paddle.set_device(args.device)
+
+    # model
+    frontend = EnglishCharacter()
+    model = Tacotron2.from_pretrained(config, args.checkpoint_path)
+    model.eval()
+
+    # inputs
+    input_path = Path(args.input).expanduser()
+    with open(input_path, "rt") as f:
+        sentences = f.readlines()
+
+    if args.output is None:
+        output_dir = input_path.parent / "synthesis"
+    else:
+        output_dir = Path(args.output).expanduser()
+    output_dir.mkdir(exist_ok=True)
+
+    for i, sentence in enumerate(sentences):
+        sentence = paddle.to_tensor(frontend(sentence)).unsqueeze(0)
+        outputs = model.infer(sentence)
+        mel_output = outputs["mel_outputs_postnet"][0].numpy().T
+        alignment = outputs["alignments"][0].numpy().T
+
+        np.save(str(output_dir / f"sentence_{i}"), mel_output)
+        display.plot_alignment(alignment)
+        plt.savefig(str(output_dir / f"sentence_{i}.png"))
+        if args.verbose:
+            print("spectrogram saved at {}".format(output_dir /
+                                                   f"sentence_{i}.npy"))
+
+
+if __name__ == "__main__":
+    config = get_cfg_defaults()
+
+    parser = argparse.ArgumentParser(
+        description="generate mel spectrogram with TransformerTTS.")
+    parser.add_argument(
+        "--config",
+        type=str,
+        metavar="FILE",
+        help="extra config to overwrite the default config")
+    parser.add_argument(
+        "--checkpoint_path", type=str, help="path of the checkpoint to load.")
+    parser.add_argument("--input", type=str, help="path of the text sentences")
+    parser.add_argument("--output", type=str, help="path to save outputs")
+    parser.add_argument(
+        "--device", type=str, default="cpu", help="device type to use.")
+    parser.add_argument(
+        "--opts",
+        nargs=argparse.REMAINDER,
+        help="options to overwrite --config file and the default config, passing in KEY VALUE pairs"
+    )
+    parser.add_argument(
+        "-v", "--verbose", action="store_true", help="print msg")
+
+    args = parser.parse_args()
+    if args.config:
+        config.merge_from_file(args.config)
+    if args.opts:
+        config.merge_from_list(args.opts)
+    config.freeze()
+    print(config)
+    print(args)
+
+    main(config, args)
--- a/examples/ljspeech/tts0/local/tacotron2/train.py
+++ b/examples/ljspeech/tts0/local/tacotron2/train.py
@ -0,0 +1,218 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import time
+from collections import defaultdict
+
+import numpy as np
+import paddle
+from paddle.io import DataLoader
+from paddle.io import DistributedBatchSampler
+from paddle import distributed as dist
+from parakeet.data import dataset
+from parakeet.training.cli import default_argument_parser
+from parakeet.training.experiment import ExperimentBase
+from parakeet.utils import display, mp_tools
+from parakeet.models.tacotron2 import Tacotron2, Tacotron2Loss
+
+from config import get_cfg_defaults
+from ljspeech import LJSpeech, LJSpeechCollector
+
+
+class Experiment(ExperimentBase):
+    def compute_losses(self, inputs, outputs):
+        texts, mel_targets, plens, slens = inputs
+
+        mel_outputs = outputs["mel_output"]
+        mel_outputs_postnet = outputs["mel_outputs_postnet"]
+        attention_weight = outputs["alignments"]
+        if self.config.model.use_stop_token:
+            stop_logits = outputs["stop_logits"]
+        else:
+            stop_logits = None
+
+        losses = self.criterion(mel_outputs, mel_outputs_postnet, mel_targets,
+                                attention_weight, slens, plens, stop_logits)
+        return losses
+
+    def train_batch(self):
+        start = time.time()
+        batch = self.read_batch()
+        data_loader_time = time.time() - start
+
+        self.optimizer.clear_grad()
+        self.model.train()
+        texts, mels, text_lens, output_lens = batch
+        outputs = self.model(texts, text_lens, mels, output_lens)
+        losses = self.compute_losses(batch, outputs)
+        loss = losses["loss"]
+        loss.backward()
+        self.optimizer.step()
+        iteration_time = time.time() - start
+
+        losses_np = {k: float(v) for k, v in losses.items()}
+        # logging
+        msg = "Rank: {}, ".format(dist.get_rank())
+        msg += "step: {}, ".format(self.iteration)
+        msg += "time: {:>.3f}s/{:>.3f}s, ".format(data_loader_time,
+                                                  iteration_time)
+        msg += ', '.join('{}: {:>.6f}'.format(k, v)
+                         for k, v in losses_np.items())
+        self.logger.info(msg)
+
+        if dist.get_rank() == 0:
+            for k, v in losses_np.items():
+                self.visualizer.add_scalar(f"train_loss/{k}", v, self.iteration)
+
+    @mp_tools.rank_zero_only
+    @paddle.no_grad()
+    def valid(self):
+        valid_losses = defaultdict(list)
+        for i, batch in enumerate(self.valid_loader):
+            texts, mels, text_lens, output_lens = batch
+            outputs = self.model(texts, text_lens, mels, output_lens)
+            losses = self.compute_losses(batch, outputs)
+            for k, v in losses.items():
+                valid_losses[k].append(float(v))
+
+            attention_weights = outputs["alignments"]
+            self.visualizer.add_figure(
+                f"valid_sentence_{i}_alignments",
+                display.plot_alignment(attention_weights[0].numpy().T),
+                self.iteration)
+            self.visualizer.add_figure(
+                f"valid_sentence_{i}_target_spectrogram",
+                display.plot_spectrogram(mels[0].numpy().T), self.iteration)
+            self.visualizer.add_figure(
+                f"valid_sentence_{i}_predicted_spectrogram",
+                display.plot_spectrogram(outputs['mel_outputs_postnet'][0]
+                                         .numpy().T), self.iteration)
+
+        # write visual log
+        valid_losses = {k: np.mean(v) for k, v in valid_losses.items()}
+
+        # logging
+        msg = "Valid: "
+        msg += "step: {}, ".format(self.iteration)
+        msg += ', '.join('{}: {:>.6f}'.format(k, v)
+                         for k, v in valid_losses.items())
+        self.logger.info(msg)
+
+        for k, v in valid_losses.items():
+            self.visualizer.add_scalar(f"valid/{k}", v, self.iteration)
+
+    def setup_model(self):
+        config = self.config
+        model = Tacotron2(
+            vocab_size=config.model.vocab_size,
+            d_mels=config.data.n_mels,
+            d_encoder=config.model.d_encoder,
+            encoder_conv_layers=config.model.encoder_conv_layers,
+            encoder_kernel_size=config.model.encoder_kernel_size,
+            d_prenet=config.model.d_prenet,
+            d_attention_rnn=config.model.d_attention_rnn,
+            d_decoder_rnn=config.model.d_decoder_rnn,
+            attention_filters=config.model.attention_filters,
+            attention_kernel_size=config.model.attention_kernel_size,
+            d_attention=config.model.d_attention,
+            d_postnet=config.model.d_postnet,
+            postnet_kernel_size=config.model.postnet_kernel_size,
+            postnet_conv_layers=config.model.postnet_conv_layers,
+            reduction_factor=config.model.reduction_factor,
+            p_encoder_dropout=config.model.p_encoder_dropout,
+            p_prenet_dropout=config.model.p_prenet_dropout,
+            p_attention_dropout=config.model.p_attention_dropout,
+            p_decoder_dropout=config.model.p_decoder_dropout,
+            p_postnet_dropout=config.model.p_postnet_dropout,
+            use_stop_token=config.model.use_stop_token)
+
+        if self.parallel:
+            model = paddle.DataParallel(model)
+
+        grad_clip = paddle.nn.ClipGradByGlobalNorm(
+            config.training.grad_clip_thresh)
+        optimizer = paddle.optimizer.Adam(
+            learning_rate=config.training.lr,
+            parameters=model.parameters(),
+            weight_decay=paddle.regularizer.L2Decay(
+                config.training.weight_decay),
+            grad_clip=grad_clip)
+        criterion = Tacotron2Loss(
+            use_stop_token_loss=config.model.use_stop_token,
+            use_guided_attention_loss=config.model.use_guided_attention_loss,
+            sigma=config.model.guided_attention_loss_sigma)
+        self.model = model
+        self.optimizer = optimizer
+        self.criterion = criterion
+
+    def setup_dataloader(self):
+        args = self.args
+        config = self.config
+        ljspeech_dataset = LJSpeech(args.data)
+
+        valid_set, train_set = dataset.split(ljspeech_dataset,
+                                             config.data.valid_size)
+        batch_fn = LJSpeechCollector(padding_idx=config.data.padding_idx)
+
+        if not self.parallel:
+            self.train_loader = DataLoader(
+                train_set,
+                batch_size=config.data.batch_size,
+                shuffle=True,
+                drop_last=True,
+                collate_fn=batch_fn)
+        else:
+            sampler = DistributedBatchSampler(
+                train_set,
+                batch_size=config.data.batch_size,
+                shuffle=True,
+                drop_last=True)
+            self.train_loader = DataLoader(
+                train_set, batch_sampler=sampler, collate_fn=batch_fn)
+
+        self.valid_loader = DataLoader(
+            valid_set,
+            batch_size=config.data.batch_size,
+            shuffle=False,
+            drop_last=False,
+            collate_fn=batch_fn)
+
+
+def main_sp(config, args):
+    exp = Experiment(config, args)
+    exp.setup()
+    exp.resume_or_load()
+    exp.run()
+
+
+def main(config, args):
+    if args.nprocs > 1 and args.device == "gpu":
+        dist.spawn(main_sp, args=(config, args), nprocs=args.nprocs)
+    else:
+        main_sp(config, args)
+
+
+if __name__ == "__main__":
+    config = get_cfg_defaults()
+    parser = default_argument_parser()
+    args = parser.parse_args()
+    if args.config:
+        config.merge_from_file(args.config)
+    if args.opts:
+        config.merge_from_list(args.opts)
+    config.freeze()
+    print(config)
+    print(args)
+
+    main(config, args)
--- a/examples/ljspeech/tts0/run.sh
+++ b/examples/ljspeech/tts0/run.sh
--- a/examples/ljspeech/tts1/local/transformer_tts/ljspeech/README.md
+++ b/examples/ljspeech/tts1/local/transformer_tts/ljspeech/README.md
@ -0,0 +1,194 @@
+# TransformerTTS with LJSpeech
+## Dataset
+### Download the datasaet
+```bash
+wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
+```
+### Extract the dataset
+```bash
+tar xjvf LJSpeech-1.1.tar.bz2
+```
+### Preprocess the dataset
+Assume the path to the dataset is `~/datasets/LJSpeech-1.1`.
+Run the command below to preprocess the dataset.
+
+```bash
+./preprocess.sh.
+```
+When it is done. A `dump` folder is created in the current directory. The structure of the dump folder is listed below.
+```text
+dump
+├── dev
+│ ├── norm
+│ └── raw
+├── phone_id_map.txt
+├── speaker_id_map.txt
+├── test
+│  ├── norm
+│  └── raw
+└── train
+    ├── norm
+    ├── raw
+    └── speech_stats.npy
+```
+The dataset is split into 3 parts, namely `train`, `dev` and` test`, each of which contains a `norm` and `raw` sub folder. The raw folder contains speech feature of each utterances, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/speech_stats.npy`.
+
+Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains phones, text_lengths, speech_lengths, path of speech features, speaker and id of each utterance.
+
+## Train the model
+`./run.sh` calls `../train.py`.
+```bash
+./run.sh
+```
+Here's the complete help message.
+```text
+usage: train.py [-h] [--config CONFIG] [--train-metadata TRAIN_METADATA]
+                [--dev-metadata DEV_METADATA] [--output-dir OUTPUT_DIR]
+                [--device DEVICE] [--nprocs NPROCS] [--verbose VERBOSE]
+                [--phones-dict PHONES_DICT]
+
+Train a TransformerTTS model with LJSpeech TTS dataset.
+
+optional arguments:
+  -h, --help            show this help message and exit
+  --config CONFIG       config file to overwrite default config.
+  --train-metadata TRAIN_METADATA
+                        training data.
+  --dev-metadata DEV_METADATA
+                        dev data.
+  --output-dir OUTPUT_DIR
+                        output dir.
+  --device DEVICE       device type to use.
+  --nprocs NPROCS       number of processes.
+  --verbose VERBOSE     verbose.
+  --phones-dict PHONES_DICT
+                        phone vocabulary file.
+```
+1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
+2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
+3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are save in `checkpoints/` inside this directory.
+4. `--device` is the type of the device to run the experiment, 'cpu' or 'gpu' are supported.
+5. `--nprocs` is the number of processes to run in parallel, note that nprocs > 1 is only supported when `--device` is 'gpu'.
+6. `--phones-dict` is the path of the phone vocabulary file.
+
+## Pretrained Model
+Pretrained Model can be downloaded here. [transformer_tts_ljspeech_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/transformer_tts_ljspeech_ckpt_0.4.zip)
+
+TransformerTTS  checkpoint contains files listed below.
+```text
+transformer_tts_ljspeech_ckpt_0.4
+├── default.yaml             # default config used to train transformer_tts
+├── phone_id_map.txt         # phone vocabulary file when training transformer_tts
+├── snapshot_iter_201500.pdz # model parameters and optimizer states
+└── speech_stats.npy         # statistics used to normalize spectrogram when training transformer_tts
+```
+## Synthesize
+We use [waveflow](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/waveflow) as the neural vocoder.
+Download Pretrained WaveFlow Model with residual channel equals 128 from [waveflow_ljspeech_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/waveflow_ljspeech_ckpt_0.3.zip) and unzip it.
+```bash
+unzip waveflow_ljspeech_ckpt_0.3.zip
+```
+WaveFlow  checkpoint contains files listed below.
+```text
+waveflow_ljspeech_ckpt_0.3
+├── config.yaml           # default config used to train waveflow
+└── step-2000000.pdparams # model parameters of waveflow
+```
+`synthesize.sh` calls `../synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
+```bash
+./synthesize.sh
+```
+```text
+usage: synthesize.py [-h] [--transformer-tts-config TRANSFORMER_TTS_CONFIG]
+                     [--transformer-tts-checkpoint TRANSFORMER_TTS_CHECKPOINT]
+                     [--transformer-tts-stat TRANSFORMER_TTS_STAT]
+                     [--waveflow-config WAVEFLOW_CONFIG]
+                     [--waveflow-checkpoint WAVEFLOW_CHECKPOINT]
+                     [--phones-dict PHONES_DICT]
+                     [--test-metadata TEST_METADATA] [--output-dir OUTPUT_DIR]
+                     [--device DEVICE] [--verbose VERBOSE]
+
+Synthesize with transformer tts & waveflow.
+
+optional arguments:
+  -h, --help            show this help message and exit
+  --transformer-tts-config TRANSFORMER_TTS_CONFIG
+                        transformer tts config file.
+  --transformer-tts-checkpoint TRANSFORMER_TTS_CHECKPOINT
+                        transformer tts checkpoint to load.
+  --transformer-tts-stat TRANSFORMER_TTS_STAT
+                        mean and standard deviation used to normalize
+                        spectrogram when training transformer tts.
+  --waveflow-config WAVEFLOW_CONFIG
+                        waveflow config file.
+  --waveflow-checkpoint WAVEFLOW_CHECKPOINT
+                        waveflow checkpoint to load.
+  --phones-dict PHONES_DICT
+                        phone vocabulary file.
+  --test-metadata TEST_METADATA
+                        test metadata.
+  --output-dir OUTPUT_DIR
+                        output dir.
+  --device DEVICE       device type to use.
+  --verbose VERBOSE     verbose.
+```
+`synthesize_e2e.sh` calls `synthesize_e2e.py`, which can synthesize waveform from text file.
+```bash
+./synthesize_e2e.sh
+```
+```text
+usage: synthesize_e2e.py [-h]
+                         [--transformer-tts-config TRANSFORMER_TTS_CONFIG]
+                         [--transformer-tts-checkpoint TRANSFORMER_TTS_CHECKPOINT]
+                         [--transformer-tts-stat TRANSFORMER_TTS_STAT]
+                         [--waveflow-config WAVEFLOW_CONFIG]
+                         [--waveflow-checkpoint WAVEFLOW_CHECKPOINT]
+                         [--phones-dict PHONES_DICT] [--text TEXT]
+                         [--output-dir OUTPUT_DIR] [--device DEVICE]
+                         [--verbose VERBOSE]
+
+Synthesize with transformer tts & waveflow.
+
+optional arguments:
+  -h, --help            show this help message and exit
+  --transformer-tts-config TRANSFORMER_TTS_CONFIG
+                        transformer tts config file.
+  --transformer-tts-checkpoint TRANSFORMER_TTS_CHECKPOINT
+                        transformer tts checkpoint to load.
+  --transformer-tts-stat TRANSFORMER_TTS_STAT
+                        mean and standard deviation used to normalize
+                        spectrogram when training transformer tts.
+  --waveflow-config WAVEFLOW_CONFIG
+                        waveflow config file.
+  --waveflow-checkpoint WAVEFLOW_CHECKPOINT
+                        waveflow checkpoint to load.
+  --phones-dict PHONES_DICT
+                        phone vocabulary file.
+  --text TEXT           text to synthesize, a 'utt_id sentence' pair per line.
+  --output-dir OUTPUT_DIR
+                        output dir.
+  --device DEVICE       device type to use.
+  --verbose VERBOSE     verbose.
+```
+1. `--transformer-tts-config`, `--transformer-tts-checkpoint`, `--transformer-tts-stat` and `--phones-dict` are arguments for transformer_tts, which correspond to the 4 files in the transformer_tts pretrained model.
+2. `--waveflow-config`, `--waveflow-checkpoint` are arguments for waveflow, which correspond to the 2 files in the waveflow pretrained model.
+3. `--test-metadata` should be the metadata file in the normalized subfolder of `test`  in the `dump` folder.
+4. `--text` is the text file, which contains sentences to synthesize.
+5. `--output-dir` is the directory to save synthesized audio files.
+6. `--device` is the type of device to run synthesis, 'cpu' and 'gpu' are supported. 'gpu' is recommended for faster synthesis.
+
+You can use the following scripts to synthesize for `../sentences.txt` using pretrained transformer_tts  and waveflow models.
+```bash
+FLAGS_allocator_strategy=naive_best_fit \
+FLAGS_fraction_of_gpu_memory_to_use=0.01 \
+python3 synthesize_e2e.py \
+  --transformer-tts-config=transformer_tts_ljspeech_ckpt_0.4/default.yaml \
+  --transformer-tts-checkpoint=transformer_tts_ljspeech_ckpt_0.4/snapshot_iter_201500.pdz \
+  --transformer-tts-stat=transformer_tts_ljspeech_ckpt_0.4/speech_stats.npy \
+  --waveflow-config=waveflow_ljspeech_ckpt_0.3/config.yaml \
+  --waveflow-checkpoint=waveflow_ljspeech_ckpt_0.3/step-2000000.pdparams \
+  --text=../sentences.txt \
+  --output-dir=exp/default/test_e2e \
+  --device="gpu" \
+  --phones-dict=transformer_tts_ljspeech_ckpt_0.4/phone_id_map.txt
+```
--- a/examples/ljspeech/tts1/local/transformer_tts/ljspeech/conf/default.yaml
+++ b/examples/ljspeech/tts1/local/transformer_tts/ljspeech/conf/default.yaml
@ -0,0 +1,92 @@
+
+fs : 22050              # Hz, sample rate
+n_fft : 1024            # fft frame size
+win_length : 1024       # window size
+n_shift : 256           # hop size between ajacent frame
+fmin : 0                # Hz, min frequency when converting to mel
+fmax : 8000             # Hz, max frequency when converting to mel
+n_mels : 80             # mel bands
+window: "hann"          # Window function.
+
+###########################################################
+#                       DATA SETTING                      #
+###########################################################
+batch_size: 16
+num_workers: 2
+
+##########################################################
+#                  TTS MODEL SETTING                     #
+##########################################################
+tts: transformertts        # model architecture
+model:                     # keyword arguments for the selected model
+    embed_dim: 0           # embedding dimension in encoder prenet
+    eprenet_conv_layers: 0 # number of conv layers in encoder prenet
+                           # if set to 0, no encoder prenet will be used
+    eprenet_conv_filts: 0  # filter size of conv layers in encoder prenet
+    eprenet_conv_chans: 0  # number of channels of conv layers in encoder prenet
+    dprenet_layers: 2      # number of layers in decoder prenet
+    dprenet_units: 256     # number of units in decoder prenet
+    adim: 512              # attention dimension
+    aheads: 8              # number of attention heads
+    elayers: 6             # number of encoder layers
+    eunits: 1024           # number of encoder ff units
+    dlayers: 6             # number of decoder layers
+    dunits: 1024           # number of decoder ff units
+    positionwise_layer_type: conv1d  # type of position-wise layer
+    positionwise_conv_kernel_size: 1 # kernel size of position wise conv layer
+    postnet_layers: 5                # number of layers of postnset
+    postnet_filts: 5                 # filter size of conv layers in postnet
+    postnet_chans: 256               # number of channels of conv layers in postnet
+    use_scaled_pos_enc: True         # whether to use scaled positional encoding
+    encoder_normalize_before: True   # whether to perform layer normalization before the input
+    decoder_normalize_before: True   # whether to perform layer normalization before the input
+    reduction_factor: 1              # reduction factor
+    init_type: xavier_uniform        # initialization type
+    init_enc_alpha: 1.0              # initial value of alpha of encoder scaled position encoding
+    init_dec_alpha: 1.0              # initial value of alpha of decoder scaled position encoding
+    eprenet_dropout_rate: 0.0        # dropout rate for encoder prenet
+    dprenet_dropout_rate: 0.5        # dropout rate for decoder prenet
+    postnet_dropout_rate: 0.5        # dropout rate for postnet
+    transformer_enc_dropout_rate: 0.1                # dropout rate for transformer encoder layer
+    transformer_enc_positional_dropout_rate: 0.1     # dropout rate for transformer encoder positional encoding
+    transformer_enc_attn_dropout_rate: 0.1           # dropout rate for transformer encoder attention layer
+    transformer_dec_dropout_rate: 0.1                # dropout rate for transformer decoder layer
+    transformer_dec_positional_dropout_rate: 0.1     # dropout rate for transformer decoder positional encoding
+    transformer_dec_attn_dropout_rate: 0.1           # dropout rate for transformer decoder attention layer
+    transformer_enc_dec_attn_dropout_rate: 0.1       # dropout rate for transformer encoder-decoder attention layer
+    num_heads_applied_guided_attn: 2                 # number of heads to apply guided attention loss
+    num_layers_applied_guided_attn: 2                # number of layers to apply guided attention loss  
+    
+
+
+###########################################################
+#                       UPDATER SETTING                   #
+###########################################################
+updater:
+    use_masking: true                  # whether to apply masking for padded part in loss calculation
+    loss_type: L1
+    use_guided_attn_loss: true         # whether to use guided attention loss
+    guided_attn_loss_sigma: 0.4        # sigma in guided attention loss
+    guided_attn_loss_lambda: 10.0      # lambda in guided attention loss
+    modules_applied_guided_attn: ["encoder-decoder"] # modules to apply guided attention loss
+    bce_pos_weight: 5.0              # weight of positive sample in binary cross entropy calculation
+
+
+##########################################################
+#            OPTIMIZER & SCHEDULER SETTING               #
+##########################################################
+optimizer:
+    optim: adam               # optimizer type
+    learning_rate: 0.001      # learning rate
+
+###########################################################
+#                     TRAINING SETTING                    #
+###########################################################
+max_epoch: 500
+num_snapshots: 5
+
+
+###########################################################
+#                       OTHER SETTING                     #
+###########################################################
+seed: 10086
--- a/examples/ljspeech/tts1/local/transformer_tts/ljspeech/preprocess.sh
+++ b/examples/ljspeech/tts1/local/transformer_tts/ljspeech/preprocess.sh
@ -0,0 +1,50 @@
+#!/bin/bash
+
+stage=1
+stop_stage=100
+
+export MAIN_ROOT=`realpath ${PWD}/../../../`
+
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+    # extract features
+    echo "Extract features ..."
+    python3 ../preprocess.py  \
+        --dataset=ljspeech \
+        --rootdir=~/datasets/LJSpeech-1.1/ \
+        --dumpdir=dump \
+        --config-path=conf/default.yaml \
+        --num-cpu=8
+fi
+
+if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
+    # get features' stats(mean and std)
+    echo "Get features' stats ..."
+    python3 ${MAIN_ROOT}/utils/compute_statistics.py \
+        --metadata=dump/train/raw/metadata.jsonl \
+        --field-name="speech"
+fi
+
+if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
+    # normalize and covert phone to id, dev and test should use train's stats
+    echo "Normalize ..."
+    python3 ../normalize.py \
+        --metadata=dump/train/raw/metadata.jsonl \
+        --dumpdir=dump/train/norm \
+        --speech-stats=dump/train/speech_stats.npy \
+        --phones-dict=dump/phone_id_map.txt \
+        --speaker-dict=dump/speaker_id_map.txt
+
+    python3 ../normalize.py \
+        --metadata=dump/dev/raw/metadata.jsonl \
+        --dumpdir=dump/dev/norm \
+        --speech-stats=dump/train/speech_stats.npy \
+        --phones-dict=dump/phone_id_map.txt \
+        --speaker-dict=dump/speaker_id_map.txt
+
+    python3 ../normalize.py \
+        --metadata=dump/test/raw/metadata.jsonl \
+        --dumpdir=dump/test/norm \
+        --speech-stats=dump/train/speech_stats.npy \
+        --phones-dict=dump/phone_id_map.txt \
+        --speaker-dict=dump/speaker_id_map.txt
+fi
--- a/examples/ljspeech/tts1/local/transformer_tts/ljspeech/run.sh
+++ b/examples/ljspeech/tts1/local/transformer_tts/ljspeech/run.sh
@ -0,0 +1,9 @@
+#!/bin/bash
+
+python3 ../train.py \
+    --train-metadata=dump/train/norm/metadata.jsonl \
+    --dev-metadata=dump/dev/norm/metadata.jsonl \
+    --config=conf/default.yaml \
+    --output-dir=exp/default \
+    --nprocs=2 \
+    --phones-dict=dump/phone_id_map.txt
--- a/examples/ljspeech/tts1/local/transformer_tts/ljspeech/synthesize.sh
+++ b/examples/ljspeech/tts1/local/transformer_tts/ljspeech/synthesize.sh
@ -0,0 +1,13 @@
+#!/bin/bash
+FLAGS_allocator_strategy=naive_best_fit \
+FLAGS_fraction_of_gpu_memory_to_use=0.01 \
+python3 ../synthesize.py \
+  --transformer-tts-config=conf/default.yaml \
+  --transformer-tts-checkpoint=exp/default/checkpoints/snapshot_iter_201500.pdz \
+  --transformer-tts-stat=dump/train/speech_stats.npy \
+  --waveflow-config=waveflow_ljspeech_ckpt_0.3/config.yaml \
+  --waveflow-checkpoint=waveflow_ljspeech_ckpt_0.3/step-2000000.pdparams \
+  --test-metadata=dump/test/norm/metadata.jsonl \
+  --output-dir=exp/default/test \
+  --device="gpu" \
+  --phones-dict=dump/phone_id_map.txt
--- a/examples/ljspeech/tts1/local/transformer_tts/ljspeech/synthesize_e2e.py
+++ b/examples/ljspeech/tts1/local/transformer_tts/ljspeech/synthesize_e2e.py
@ -0,0 +1,161 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import logging
+from pathlib import Path
+
+import numpy as np
+import paddle
+import soundfile as sf
+import yaml
+from yacs.config import CfgNode
+from parakeet.frontend import English
+from parakeet.models.transformer_tts import TransformerTTS
+from parakeet.models.transformer_tts import TransformerTTSInference
+from parakeet.models.waveflow import ConditionalWaveFlow
+from parakeet.modules.normalizer import ZScore
+from parakeet.utils import layer_tools
+
+
+def evaluate(args, acoustic_model_config, vocoder_config):
+    # dataloader has been too verbose
+    logging.getLogger("DataLoader").disabled = True
+
+    # construct dataset for evaluation
+    sentences = []
+    with open(args.text, 'rt') as f:
+        for line in f:
+            line_list = line.strip().split()
+            utt_id = line_list[0]
+            sentence = " ".join(line_list[1:])
+            sentences.append((utt_id, sentence))
+
+    with open(args.phones_dict, "r") as f:
+        phn_id = [line.strip().split() for line in f.readlines()]
+
+    vocab_size = len(phn_id)
+    phone_id_map = {}
+    for phn, id in phn_id:
+        phone_id_map[phn] = int(id)
+    print("vocab_size:", vocab_size)
+    odim = acoustic_model_config.n_mels
+    model = TransformerTTS(
+        idim=vocab_size, odim=odim, **acoustic_model_config["model"])
+
+    model.set_state_dict(
+        paddle.load(args.transformer_tts_checkpoint)["main_params"])
+    model.eval()
+
+    # remove ".pdparams" in waveflow_checkpoint
+    vocoder_checkpoint_path = args.waveflow_checkpoint[:-9] if args.waveflow_checkpoint.endswith(
+        ".pdparams") else args.waveflow_checkpoint
+    vocoder = ConditionalWaveFlow.from_pretrained(vocoder_config,
+                                                  vocoder_checkpoint_path)
+    layer_tools.recursively_remove_weight_norm(vocoder)
+    vocoder.eval()
+    print("model done!")
+
+    frontend = English()
+    print("frontend done!")
+
+    stat = np.load(args.transformer_tts_stat)
+    mu, std = stat
+    mu = paddle.to_tensor(mu)
+    std = paddle.to_tensor(std)
+    transformer_tts_normalizer = ZScore(mu, std)
+
+    transformer_tts_inference = TransformerTTSInference(
+        transformer_tts_normalizer, model)
+
+    output_dir = Path(args.output_dir)
+    output_dir.mkdir(parents=True, exist_ok=True)
+
+    for utt_id, sentence in sentences:
+        phones = frontend.phoneticize(sentence)
+        # remove start_symbol and end_symbol
+        phones = phones[1:-1]
+        phones = [phn for phn in phones if not phn.isspace()]
+        phones = [phn if phn in phone_id_map else "," for phn in phones]
+        phone_ids = [phone_id_map[phn] for phn in phones]
+        with paddle.no_grad():
+            mel = transformer_tts_inference(paddle.to_tensor(phone_ids))
+            # mel shape is (T, feats) and waveflow's input shape is (batch, feats, T)
+            mel = mel.unsqueeze(0).transpose([0, 2, 1])
+            # wavflow's output shape is (B, T)
+            wav = vocoder.infer(mel)[0]
+
+        sf.write(
+            str(output_dir / (utt_id + ".wav")),
+            wav.numpy(),
+            samplerate=acoustic_model_config.fs)
+        print(f"{utt_id} done!")
+
+
+def main():
+    # parse args and config and redirect to train_sp
+    parser = argparse.ArgumentParser(
+        description="Synthesize with transformer tts & waveflow.")
+    parser.add_argument(
+        "--transformer-tts-config",
+        type=str,
+        help="transformer tts config file.")
+    parser.add_argument(
+        "--transformer-tts-checkpoint",
+        type=str,
+        help="transformer tts checkpoint to load.")
+    parser.add_argument(
+        "--transformer-tts-stat",
+        type=str,
+        help="mean and standard deviation used to normalize spectrogram when training transformer tts."
+    )
+    parser.add_argument(
+        "--waveflow-config", type=str, help="waveflow config file.")
+    # not normalize when training waveflow
+    parser.add_argument(
+        "--waveflow-checkpoint", type=str, help="waveflow checkpoint to load.")
+    parser.add_argument(
+        "--phones-dict",
+        type=str,
+        default="phone_id_map.txt",
+        help="phone vocabulary file.")
+    parser.add_argument(
+        "--text",
+        type=str,
+        help="text to synthesize, a 'utt_id sentence' pair per line.")
+    parser.add_argument("--output-dir", type=str, help="output dir.")
+    parser.add_argument(
+        "--device", type=str, default="gpu", help="device type to use.")
+    parser.add_argument("--verbose", type=int, default=1, help="verbose.")
+
+    args = parser.parse_args()
+
+    paddle.set_device(args.device)
+
+    with open(args.transformer_tts_config) as f:
+        transformer_tts_config = CfgNode(yaml.safe_load(f))
+    with open(args.waveflow_config) as f:
+        waveflow_config = CfgNode(yaml.safe_load(f))
+
+    print("========Args========")
+    print(yaml.safe_dump(vars(args)))
+    print("========Config========")
+    print(transformer_tts_config)
+    print(waveflow_config)
+
+    evaluate(args, transformer_tts_config, waveflow_config)
+
+
+if __name__ == "__main__":
+    main()
--- a/examples/ljspeech/tts1/local/transformer_tts/ljspeech/synthesize_e2e.sh
+++ b/examples/ljspeech/tts1/local/transformer_tts/ljspeech/synthesize_e2e.sh
@ -0,0 +1,13 @@
+#!/bin/bash
+FLAGS_allocator_strategy=naive_best_fit \
+FLAGS_fraction_of_gpu_memory_to_use=0.01 \
+python3 synthesize_e2e.py \
+  --transformer-tts-config=conf/default.yaml \
+  --transformer-tts-checkpoint=exp/default/checkpoints/snapshot_iter_201500.pdz \
+  --transformer-tts-stat=dump/train/speech_stats.npy \
+  --waveflow-config=waveflow_ljspeech_ckpt_0.3/config.yaml \
+  --waveflow-checkpoint=waveflow_ljspeech_ckpt_0.3/step-2000000.pdparams \
+  --text=../sentences.txt \
+  --output-dir=exp/default/test_e2e \
+  --device="gpu" \
+  --phones-dict=dump/phone_id_map.txt
--- a/examples/ljspeech/tts1/local/transformer_tts/normalize.py
+++ b/examples/ljspeech/tts1/local/transformer_tts/normalize.py
@ -0,0 +1,143 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+"""Normalize feature files and dump them."""
+
+import argparse
+import logging
+from operator import itemgetter
+from pathlib import Path
+
+import jsonlines
+import numpy as np
+from parakeet.datasets.data_table import DataTable
+from sklearn.preprocessing import StandardScaler
+from tqdm import tqdm
+
+
+def main():
+    """Run preprocessing process."""
+    parser = argparse.ArgumentParser(
+        description="Normalize dumped raw features (See detail in parallel_wavegan/bin/normalize.py)."
+    )
+    parser.add_argument(
+        "--metadata",
+        type=str,
+        required=True,
+        help="directory including feature files to be normalized. "
+        "you need to specify either *-scp or rootdir.")
+
+    parser.add_argument(
+        "--dumpdir",
+        type=str,
+        required=True,
+        help="directory to dump normalized feature files.")
+    parser.add_argument(
+        "--speech-stats",
+        type=str,
+        required=True,
+        help="speech statistics file.")
+    parser.add_argument(
+        "--phones-dict", type=str, default=None, help="phone vocabulary file.")
+    parser.add_argument(
+        "--speaker-dict", type=str, default=None, help="speaker id map file.")
+    parser.add_argument(
+        "--verbose",
+        type=int,
+        default=1,
+        help="logging level. higher is more logging. (default=1)")
+    args = parser.parse_args()
+
+    # set logger
+    if args.verbose > 1:
+        logging.basicConfig(
+            level=logging.DEBUG,
+            format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s"
+        )
+    elif args.verbose > 0:
+        logging.basicConfig(
+            level=logging.INFO,
+            format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s"
+        )
+    else:
+        logging.basicConfig(
+            level=logging.WARN,
+            format="%(asctime)s (%(module)s:%(lineno)d) %(levelname)s: %(message)s"
+        )
+        logging.warning('Skip DEBUG/INFO messages')
+
+    # check directory existence
+    dumpdir = Path(args.dumpdir).resolve()
+    dumpdir.mkdir(parents=True, exist_ok=True)
+
+    # get dataset
+    with jsonlines.open(args.metadata, 'r') as reader:
+        metadata = list(reader)
+    dataset = DataTable(
+        metadata, converters={
+            "speech": np.load,
+        })
+    logging.info(f"The number of files = {len(dataset)}.")
+
+    # restore scaler
+    speech_scaler = StandardScaler()
+    speech_scaler.mean_ = np.load(args.speech_stats)[0]
+    speech_scaler.scale_ = np.load(args.speech_stats)[1]
+    speech_scaler.n_features_in_ = speech_scaler.mean_.shape[0]
+
+    vocab_phones = {}
+    with open(args.phones_dict, 'rt') as f:
+        phn_id = [line.strip().split() for line in f.readlines()]
+    for phn, id in phn_id:
+        vocab_phones[phn] = int(id)
+
+    vocab_speaker = {}
+    with open(args.speaker_dict, 'rt') as f:
+        spk_id = [line.strip().split() for line in f.readlines()]
+    for spk, id in spk_id:
+        vocab_speaker[spk] = int(id)
+
+    # process each file
+    output_metadata = []
+
+    for item in tqdm(dataset):
+        utt_id = item['utt_id']
+        speech = item['speech']
+        # normalize
+        speech = speech_scaler.transform(speech)
+        speech_dir = dumpdir / "data_speech"
+        speech_dir.mkdir(parents=True, exist_ok=True)
+        speech_path = speech_dir / f"{utt_id}_speech.npy"
+        np.save(speech_path, speech.astype(np.float32), allow_pickle=False)
+
+        phone_ids = [vocab_phones[p] for p in item['phones']]
+        spk_id = vocab_speaker[item["speaker"]]
+        record = {
+            "utt_id": item['utt_id'],
+            "spk_id": spk_id,
+            "text": phone_ids,
+            "text_lengths": item['text_lengths'],
+            "speech_lengths": item['speech_lengths'],
+            "speech": str(speech_path),
+        }
+        output_metadata.append(record)
+    output_metadata.sort(key=itemgetter('utt_id'))
+    output_metadata_path = Path(args.dumpdir) / "metadata.jsonl"
+    with jsonlines.open(output_metadata_path, 'w') as writer:
+        for item in output_metadata:
+            writer.write(item)
+    logging.info(f"metadata dumped into {output_metadata_path}")
+
+
+if __name__ == "__main__":
+    main()
--- a/examples/ljspeech/tts1/local/transformer_tts/preprocess.py
+++ b/examples/ljspeech/tts1/local/transformer_tts/preprocess.py
@ -0,0 +1,281 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+from concurrent.futures import ThreadPoolExecutor
+from operator import itemgetter
+from pathlib import Path
+from typing import Any
+from typing import Dict
+from typing import List
+
+import jsonlines
+import librosa
+import numpy as np
+import tqdm
+import yaml
+from parakeet.data.get_feats import LogMelFBank
+from parakeet.frontend import English
+from yacs.config import CfgNode as Configuration
+
+
+def get_lj_sentences(file_name, frontend):
+    '''
+    read MFA duration.txt
+    Parameters
+    ----------
+    file_name : str or Path
+    Returns
+    ----------
+    Dict
+        sentence: {'utt': ([char], [int])}
+    '''
+    f = open(file_name, 'r')
+    sentence = {}
+    speaker_set = set()
+    for line in f:
+        line_list = line.strip().split('|')
+        utt = line_list[0]
+        speaker = utt.split("-")[0][:2]
+        speaker_set.add(speaker)
+        raw_text = line_list[-1]
+        phonemes = frontend.phoneticize(raw_text)
+        phonemes = phonemes[1:-1]
+        phonemes = [phn for phn in phonemes if not phn.isspace()]
+        sentence[utt] = (phonemes, speaker)
+    f.close()
+    return sentence, speaker_set
+
+
+def get_input_token(sentence, output_path):
+    '''
+    get phone set from training data and save it
+    Parameters
+    ----------
+    sentence : Dict
+        sentence: {'utt': ([char], str)}
+    output_path : str or path
+        path to save phone_id_map
+    '''
+    phn_token = set()
+    for utt in sentence:
+        for phn in sentence[utt][0]:
+            if phn != "<eos>":
+                phn_token.add(phn)
+    phn_token = list(phn_token)
+    phn_token.sort()
+    phn_token = ["<pad>", "<unk>"] + phn_token
+    phn_token += ["<eos>"]
+
+    with open(output_path, 'w') as f:
+        for i, phn in enumerate(phn_token):
+            f.write(phn + ' ' + str(i) + '\n')
+
+
+def get_spk_id_map(speaker_set, output_path):
+    speakers = sorted(list(speaker_set))
+    with open(output_path, 'w') as f:
+        for i, spk in enumerate(speakers):
+            f.write(spk + ' ' + str(i) + '\n')
+
+
+def process_sentence(config: Dict[str, Any],
+                     fp: Path,
+                     sentences: Dict,
+                     output_dir: Path,
+                     mel_extractor=None):
+    utt_id = fp.stem
+    record = None
+    if utt_id in sentences:
+        # reading, resampling may occur
+        wav, _ = librosa.load(str(fp), sr=config.fs)
+        if len(wav.shape) != 1 or np.abs(wav).max() > 1.0:
+            return record
+        assert len(wav.shape) == 1, f"{utt_id} is not a mono-channel audio."
+        assert np.abs(wav).max(
+        ) <= 1.0, f"{utt_id} is seems to be different that 16 bit PCM."
+        phones = sentences[utt_id][0]
+        speaker = sentences[utt_id][1]
+        logmel = mel_extractor.get_log_mel_fbank(wav, base='e')
+        # change duration according to mel_length
+        num_frames = logmel.shape[0]
+        mel_dir = output_dir / "data_speech"
+        mel_dir.mkdir(parents=True, exist_ok=True)
+        mel_path = mel_dir / (utt_id + "_speech.npy")
+        np.save(mel_path, logmel)
+        record = {
+            "utt_id": utt_id,
+            "phones": phones,
+            "text_lengths": len(phones),
+            "speech_lengths": num_frames,
+            "speech": str(mel_path),
+            "speaker": speaker
+        }
+    return record
+
+
+def process_sentences(config,
+                      fps: List[Path],
+                      sentences: Dict,
+                      output_dir: Path,
+                      mel_extractor=None,
+                      nprocs: int=1):
+    if nprocs == 1:
+        results = []
+        for fp in tqdm.tqdm(fps, total=len(fps)):
+            record = process_sentence(config, fp, sentences, output_dir,
+                                      mel_extractor)
+            if record:
+                results.append(record)
+    else:
+        with ThreadPoolExecutor(nprocs) as pool:
+            futures = []
+            with tqdm.tqdm(total=len(fps)) as progress:
+                for fp in fps:
+                    future = pool.submit(process_sentence, config, fp,
+                                         sentences, output_dir, mel_extractor)
+                    future.add_done_callback(lambda p: progress.update())
+                    futures.append(future)
+
+                results = []
+                for ft in futures:
+                    record = ft.result()
+                    if record:
+                        results.append(record)
+
+    results.sort(key=itemgetter("utt_id"))
+    with jsonlines.open(output_dir / "metadata.jsonl", 'w') as writer:
+        for item in results:
+            writer.write(item)
+    print("Done")
+
+
+def main():
+    # parse config and args
+    parser = argparse.ArgumentParser(
+        description="Preprocess audio and then extract features.")
+
+    parser.add_argument(
+        "--dataset",
+        default="ljspeech",
+        type=str,
+        help="name of dataset, should in {ljspeech} now")
+
+    parser.add_argument(
+        "--rootdir", default=None, type=str, help="directory to dataset.")
+
+    parser.add_argument(
+        "--dumpdir",
+        type=str,
+        required=True,
+        help="directory to dump feature files.")
+
+    parser.add_argument(
+        "--config-path",
+        default="conf/default.yaml",
+        type=str,
+        help="yaml format configuration file.")
+
+    parser.add_argument(
+        "--verbose",
+        type=int,
+        default=1,
+        help="logging level. higher is more logging. (default=1)")
+    parser.add_argument(
+        "--num-cpu", type=int, default=1, help="number of process.")
+
+    args = parser.parse_args()
+
+    config_path = Path(args.config_path).resolve()
+    root_dir = Path(args.rootdir).expanduser()
+    dumpdir = Path(args.dumpdir).expanduser()
+    # use absolute path
+    dumpdir = dumpdir.resolve()
+    dumpdir.mkdir(parents=True, exist_ok=True)
+
+    assert root_dir.is_dir()
+
+    with open(config_path, 'rt') as f:
+        _C = yaml.safe_load(f)
+        _C = Configuration(_C)
+        config = _C.clone()
+
+    if args.verbose > 1:
+        print(vars(args))
+        print(config)
+
+    phone_id_map_path = dumpdir / "phone_id_map.txt"
+    speaker_id_map_path = dumpdir / "speaker_id_map.txt"
+
+    if args.dataset == "ljspeech":
+        wav_files = sorted(list((root_dir / "wavs").rglob("*.wav")))
+        frontend = English()
+        sentences, speaker_set = get_lj_sentences(root_dir / "metadata.csv",
+                                                  frontend)
+        get_input_token(sentences, phone_id_map_path)
+        get_spk_id_map(speaker_set, speaker_id_map_path)
+        # split data into 3 sections
+        num_train = 12900
+        num_dev = 100
+        train_wav_files = wav_files[:num_train]
+        dev_wav_files = wav_files[num_train:num_train + num_dev]
+        test_wav_files = wav_files[num_train + num_dev:]
+
+    train_dump_dir = dumpdir / "train" / "raw"
+    train_dump_dir.mkdir(parents=True, exist_ok=True)
+    dev_dump_dir = dumpdir / "dev" / "raw"
+    dev_dump_dir.mkdir(parents=True, exist_ok=True)
+    test_dump_dir = dumpdir / "test" / "raw"
+    test_dump_dir.mkdir(parents=True, exist_ok=True)
+
+    # Extractor
+    mel_extractor = LogMelFBank(
+        sr=config.fs,
+        n_fft=config.n_fft,
+        hop_length=config.n_shift,
+        win_length=config.win_length,
+        window=config.window,
+        n_mels=config.n_mels,
+        fmin=config.fmin,
+        fmax=config.fmax)
+
+    # process for the 3 sections
+    if train_wav_files:
+        process_sentences(
+            config,
+            train_wav_files,
+            sentences,
+            train_dump_dir,
+            mel_extractor,
+            nprocs=args.num_cpu)
+    if dev_wav_files:
+        process_sentences(
+            config,
+            dev_wav_files,
+            sentences,
+            dev_dump_dir,
+            mel_extractor,
+            nprocs=args.num_cpu)
+    if test_wav_files:
+        process_sentences(
+            config,
+            test_wav_files,
+            sentences,
+            test_dump_dir,
+            mel_extractor,
+            nprocs=args.num_cpu)
+
+
+if __name__ == "__main__":
+    main()
--- a/examples/ljspeech/tts1/local/transformer_tts/sentences.txt
+++ b/examples/ljspeech/tts1/local/transformer_tts/sentences.txt
@ -0,0 +1,9 @@
+001 Life was like a box of chocolates, you never know what you're gonna get.
+002 With great power there must come great responsibility.
+003 To be or not to be, that’s a question.
+004 A man can be destroyed but not defeated
+005 Do not, for one repulse, give up the purpose that you resolved to effort.
+006 Death is just a part of life, something we're all destined to do.
+007 I think it's hard winning a war with words. 
+008 Don’t argue with the people of strong determination, because they may change the fact!
+009 Love you three thousand times.
--- a/examples/ljspeech/tts1/local/transformer_tts/synthesize.py
+++ b/examples/ljspeech/tts1/local/transformer_tts/synthesize.py
@ -0,0 +1,142 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import logging
+from pathlib import Path
+
+import jsonlines
+import numpy as np
+import paddle
+import soundfile as sf
+import yaml
+from yacs.config import CfgNode
+from parakeet.datasets.data_table import DataTable
+from parakeet.models.transformer_tts import TransformerTTS
+from parakeet.models.transformer_tts import TransformerTTSInference
+from parakeet.models.waveflow import ConditionalWaveFlow
+from parakeet.modules.normalizer import ZScore
+from parakeet.utils import layer_tools
+
+
+def evaluate(args, acoustic_model_config, vocoder_config):
+    # dataloader has been too verbose
+    logging.getLogger("DataLoader").disabled = True
+
+    # construct dataset for evaluation
+    with jsonlines.open(args.test_metadata, 'r') as reader:
+        test_metadata = list(reader)
+    test_dataset = DataTable(data=test_metadata, fields=["utt_id", "text"])
+
+    with open(args.phones_dict, "r") as f:
+        phn_id = [line.strip().split() for line in f.readlines()]
+    vocab_size = len(phn_id)
+    print("vocab_size:", vocab_size)
+    odim = acoustic_model_config.n_mels
+    model = TransformerTTS(
+        idim=vocab_size, odim=odim, **acoustic_model_config["model"])
+
+    model.set_state_dict(
+        paddle.load(args.transformer_tts_checkpoint)["main_params"])
+    model.eval()
+    # remove ".pdparams" in waveflow_checkpoint
+    vocoder_checkpoint_path = args.waveflow_checkpoint[:-9] if args.waveflow_checkpoint.endswith(
+        ".pdparams") else args.waveflow_checkpoint
+    vocoder = ConditionalWaveFlow.from_pretrained(vocoder_config,
+                                                  vocoder_checkpoint_path)
+    layer_tools.recursively_remove_weight_norm(vocoder)
+    vocoder.eval()
+    print("model done!")
+
+    stat = np.load(args.transformer_tts_stat)
+    mu, std = stat
+    mu = paddle.to_tensor(mu)
+    std = paddle.to_tensor(std)
+    transformer_tts_normalizer = ZScore(mu, std)
+
+    transformer_tts_inference = TransformerTTSInference(
+        transformer_tts_normalizer, model)
+
+    output_dir = Path(args.output_dir)
+    output_dir.mkdir(parents=True, exist_ok=True)
+
+    for datum in test_dataset:
+        utt_id = datum["utt_id"]
+        text = paddle.to_tensor(datum["text"])
+
+        with paddle.no_grad():
+            mel = transformer_tts_inference(text)
+            # mel shape is (T, feats) and waveflow's input shape is (batch, feats, T)
+            mel = mel.unsqueeze(0).transpose([0, 2, 1])
+            # wavflow's output shape is (B, T)
+            wav = vocoder.infer(mel)[0]
+
+        sf.write(
+            str(output_dir / (utt_id + ".wav")),
+            wav.numpy(),
+            samplerate=acoustic_model_config.fs)
+        print(f"{utt_id} done!")
+
+
+def main():
+    # parse args and config and redirect to train_sp
+    parser = argparse.ArgumentParser(
+        description="Synthesize with transformer tts & waveflow.")
+    parser.add_argument(
+        "--transformer-tts-config",
+        type=str,
+        help="transformer tts config file.")
+    parser.add_argument(
+        "--transformer-tts-checkpoint",
+        type=str,
+        help="transformer tts checkpoint to load.")
+    parser.add_argument(
+        "--transformer-tts-stat",
+        type=str,
+        help="mean and standard deviation used to normalize spectrogram when training transformer tts."
+    )
+    parser.add_argument(
+        "--waveflow-config", type=str, help="waveflow config file.")
+    # not normalize when training waveflow
+    parser.add_argument(
+        "--waveflow-checkpoint", type=str, help="waveflow checkpoint to load.")
+    parser.add_argument(
+        "--phones-dict", type=str, default=None, help="phone vocabulary file.")
+
+    parser.add_argument("--test-metadata", type=str, help="test metadata.")
+    parser.add_argument("--output-dir", type=str, help="output dir.")
+    parser.add_argument(
+        "--device", type=str, default="gpu", help="device type to use.")
+    parser.add_argument("--verbose", type=int, default=1, help="verbose.")
+
+    args = parser.parse_args()
+
+    paddle.set_device(args.device)
+
+    with open(args.transformer_tts_config) as f:
+        transformer_tts_config = CfgNode(yaml.safe_load(f))
+    with open(args.waveflow_config) as f:
+        waveflow_config = CfgNode(yaml.safe_load(f))
+
+    print("========Args========")
+    print(yaml.safe_dump(vars(args)))
+    print("========Config========")
+    print(transformer_tts_config)
+    print(waveflow_config)
+
+    evaluate(args, transformer_tts_config, waveflow_config)
+
+
+if __name__ == "__main__":
+    main()
--- a/examples/ljspeech/tts1/local/transformer_tts/train.py
+++ b/examples/ljspeech/tts1/local/transformer_tts/train.py
@ -0,0 +1,199 @@
+# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import argparse
+import os
+import logging
+import shutil
+from pathlib import Path
+
+import jsonlines
+import numpy as np
+import paddle
+import yaml
+from paddle import DataParallel
+from paddle import distributed as dist
+from paddle.io import DataLoader
+from paddle.io import DistributedBatchSampler
+from parakeet.datasets.data_table import DataTable
+from parakeet.datasets.am_batch_fn import transformer_single_spk_batch_fn
+from parakeet.models.transformer_tts import TransformerTTS
+from parakeet.models.transformer_tts import TransformerTTSUpdater
+from parakeet.models.transformer_tts import TransformerTTSEvaluator
+from parakeet.training.extensions.snapshot import Snapshot
+from parakeet.training.extensions.visualizer import VisualDL
+from parakeet.training.optimizer import build_optimizers
+from parakeet.training.seeding import seed_everything
+from parakeet.training.trainer import Trainer
+from visualdl import LogWriter
+from yacs.config import CfgNode
+
+
+def train_sp(args, config):
+    # decides device type and whether to run in parallel
+    # setup running environment correctly
+    if not paddle.is_compiled_with_cuda():
+        paddle.set_device("cpu")
+    else:
+        paddle.set_device("gpu")
+    world_size = paddle.distributed.get_world_size()
+    if world_size > 1:
+        paddle.distributed.init_parallel_env()
+
+    # set the random seed, it is a must for multiprocess training
+    seed_everything(config.seed)
+
+    print(
+        f"rank: {dist.get_rank()}, pid: {os.getpid()}, parent_pid: {os.getppid()}",
+    )
+
+    # dataloader has been too verbose
+    logging.getLogger("DataLoader").disabled = True
+
+    # construct dataset for training and validation
+    with jsonlines.open(args.train_metadata, 'r') as reader:
+        train_metadata = list(reader)
+    train_dataset = DataTable(
+        data=train_metadata,
+        fields=[
+            "text",
+            "text_lengths",
+            "speech",
+            "speech_lengths",
+        ],
+        converters={
+            "speech": np.load,
+        }, )
+    with jsonlines.open(args.dev_metadata, 'r') as reader:
+        dev_metadata = list(reader)
+    dev_dataset = DataTable(
+        data=dev_metadata,
+        fields=[
+            "text",
+            "text_lengths",
+            "speech",
+            "speech_lengths",
+        ],
+        converters={
+            "speech": np.load,
+        }, )
+
+    # collate function and dataloader
+    train_sampler = DistributedBatchSampler(
+        train_dataset,
+        batch_size=config.batch_size,
+        shuffle=True,
+        drop_last=True)
+
+    print("samplers done!")
+
+    train_dataloader = DataLoader(
+        train_dataset,
+        batch_sampler=train_sampler,
+        collate_fn=transformer_single_spk_batch_fn,
+        num_workers=config.num_workers)
+
+    dev_dataloader = DataLoader(
+        dev_dataset,
+        shuffle=False,
+        drop_last=False,
+        batch_size=config.batch_size,
+        collate_fn=transformer_single_spk_batch_fn,
+        num_workers=config.num_workers)
+    print("dataloaders done!")
+
+    with open(args.phones_dict, "r") as f:
+        phn_id = [line.strip().split() for line in f.readlines()]
+    vocab_size = len(phn_id)
+    print("vocab_size:", vocab_size)
+
+    odim = config.n_mels
+    model = TransformerTTS(idim=vocab_size, odim=odim, **config["model"])
+    if world_size > 1:
+        model = DataParallel(model)
+    print("model done!")
+
+    optimizer = build_optimizers(model, **config["optimizer"])
+    print("optimizer done!")
+
+    output_dir = Path(args.output_dir)
+    output_dir.mkdir(parents=True, exist_ok=True)
+    if dist.get_rank() == 0:
+        config_name = args.config.split("/")[-1]
+        # copy conf to output_dir
+        shutil.copyfile(args.config, output_dir / config_name)
+
+    updater = TransformerTTSUpdater(
+        model=model,
+        optimizer=optimizer,
+        dataloader=train_dataloader,
+        output_dir=output_dir,
+        **config["updater"])
+
+    trainer = Trainer(updater, (config.max_epoch, 'epoch'), output_dir)
+
+    evaluator = TransformerTTSEvaluator(
+        model, dev_dataloader, output_dir=output_dir, **config["updater"])
+
+    if dist.get_rank() == 0:
+        trainer.extend(evaluator, trigger=(1, "epoch"))
+        writer = LogWriter(str(output_dir))
+        trainer.extend(VisualDL(writer), trigger=(1, "iteration"))
+        trainer.extend(
+            Snapshot(max_size=config.num_snapshots), trigger=(1, 'epoch'))
+    # print(trainer.extensions)
+    trainer.run()
+
+
+def main():
+    # parse args and config and redirect to train_sp
+    parser = argparse.ArgumentParser(description="Train a TransformerTTS "
+                                     "model with LJSpeech TTS dataset.")
+    parser.add_argument(
+        "--config", type=str, help="config file to overwrite default config.")
+    parser.add_argument("--train-metadata", type=str, help="training data.")
+    parser.add_argument("--dev-metadata", type=str, help="dev data.")
+    parser.add_argument("--output-dir", type=str, help="output dir.")
+    parser.add_argument(
+        "--device", type=str, default="gpu", help="device type to use.")
+    parser.add_argument(
+        "--nprocs", type=int, default=1, help="number of processes.")
+    parser.add_argument("--verbose", type=int, default=1, help="verbose.")
+    parser.add_argument(
+        "--phones-dict", type=str, default=None, help="phone vocabulary file.")
+
+    args = parser.parse_args()
+    if args.device == "cpu" and args.nprocs > 1:
+        raise RuntimeError("Multiprocess training on CPU is not supported.")
+
+    with open(args.config) as f:
+        config = CfgNode(yaml.safe_load(f))
+
+    print("========Args========")
+    print(yaml.safe_dump(vars(args)))
+    print("========Config========")
+    print(config)
+    print(
+        f"master see the word size: {dist.get_world_size()}, from pid: {os.getpid()}"
+    )
+
+    # dispatch
+    if args.nprocs > 1:
+        dist.spawn(train_sp, (args, config), nprocs=args.nprocs)
+    else:
+        train_sp(args, config)
+
+
+if __name__ == "__main__":
+    main()
--- a/examples/ljspeech/tts1/run.sh
+++ b/examples/ljspeech/tts1/run.sh
--- a/examples/ljspeech/voc0/local/waveflow/README.md
+++ b/examples/ljspeech/voc0/local/waveflow/README.md
@ -0,0 +1,52 @@
+# WaveFlow with LJSpeech
+
+## Dataset
+
+### Download the datasaet.
+
+```bash
+wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
+```
+
+### Extract the dataset.
+
+```bash
+tar xjvf LJSpeech-1.1.tar.bz2
+```
+
+### Preprocess the dataset.
+
+Assume the path to save the preprocessed dataset is `ljspeech_waveflow`. Run the command below to preprocess the dataset.
+
+```bash
+python preprocess.py --input=LJSpeech-1.1/  --output=ljspeech_waveflow
+```
+
+## Train the model
+
+The training script requires 4 command line arguments.
+`--data` is the path of the training dataset, `--output` is the path of the output directory (we recommend to use a subdirectory in `runs` to manage different experiments.)
+
+`--device` should be "cpu" or "gpu", `--nprocs` is the number of processes to train the model in parallel.
+
+```bash
+python train.py --data=ljspeech_waveflow/ --output=runs/test --device="gpu" --nprocs=1
+```
+
+If you want distributed training, set a larger `--nprocs` (e.g. 4). Note that distributed training with cpu is not supported yet.
+
+## Synthesize
+
+Synthesize waveform. We assume the `--input` is a directory containing several mel spectrograms(log magnitude) in `.npy` format. The output would be saved in `--output` directory, containing several `.wav` files, each with the same name as the mel spectrogram does.
+
+`--checkpoint_path` should be the path of the parameter file (`.pdparams`) to load. Note that the extention name `.pdparmas` is not included here.
+
+`--device` specifies to device to run synthesis on.
+
+```bash
+python synthesize.py --input=mels/ --output=wavs/ --checkpoint_path='step-2000000' --device="gpu" --verbose
+```
+
+## Pretrained Model
+
+Pretrained Model with residual channel equals 128 can be downloaded here. [waveflow_ljspeech_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/waveflow_ljspeech_ckpt_0.3.zip).
--- a/examples/ljspeech/voc0/local/waveflow/config.py
+++ b/examples/ljspeech/voc0/local/waveflow/config.py
@ -0,0 +1,56 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from yacs.config import CfgNode as CN
+
+_C = CN()
+_C.data = CN(
+    dict(
+        batch_size=8,  # batch size
+        valid_size=16,  # the first N examples are reserved for validation
+        sample_rate=22050,  # Hz, sample rate
+        n_fft=1024,  # fft frame size
+        win_length=1024,  # window size
+        hop_length=256,  # hop size between ajacent frame
+        fmin=0,
+        fmax=8000,  # Hz, max frequency when converting to mel
+        n_mels=80,  # mel bands
+        clip_frames=65,  # mel clip frames
+    ))
+
+_C.model = CN(
+    dict(
+        upsample_factors=[16, 16],
+        n_flows=8,  # number of flows in WaveFlow
+        n_layers=8,  # number of conv block in each flow
+        n_group=16,  # folding factor of audio and spectrogram
+        channels=128,  # resiaudal channel in each flow
+        kernel_size=[3, 3],  # kernel size in each conv block
+        sigma=1.0,  # stddev of the random noise
+    ))
+
+_C.training = CN(
+    dict(
+        lr=2e-4,  # learning rates
+        valid_interval=1000,  # validation
+        save_interval=10000,  # checkpoint
+        max_iteration=3000000,  # max iteration to train
+    ))
+
+
+def get_cfg_defaults():
+    """Get a yacs CfgNode object with default values for my_project."""
+    # Return a clone so that the defaults will not be altered
+    # This is for the "local variable" use pattern
+    return _C.clone()
--- a/examples/ljspeech/voc0/local/waveflow/ljspeech.py
+++ b/examples/ljspeech/voc0/local/waveflow/ljspeech.py
@ -0,0 +1,89 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+from pathlib import Path
+
+import numpy as np
+import pandas
+from paddle.io import Dataset
+
+from parakeet.data.batch import batch_spec, batch_wav
+
+
+class LJSpeech(Dataset):
+    """A simple dataset adaptor for the processed ljspeech dataset."""
+
+    def __init__(self, root):
+        self.root = Path(root).expanduser()
+        meta_data = pandas.read_csv(
+            str(self.root / "metadata.csv"),
+            sep="\t",
+            header=None,
+            names=["fname", "frames", "samples"])
+
+        records = []
+        for row in meta_data.itertuples():
+            mel_path = str(self.root / "mel" / (row.fname + ".npy"))
+            wav_path = str(self.root / "wav" / (row.fname + ".npy"))
+            records.append((mel_path, wav_path))
+        self.records = records
+
+    def __getitem__(self, i):
+        mel_name, wav_name = self.records[i]
+        mel = np.load(mel_name)
+        wav = np.load(wav_name)
+        return mel, wav
+
+    def __len__(self):
+        return len(self.records)
+
+
+class LJSpeechCollector(object):
+    """A simple callable to batch LJSpeech examples."""
+
+    def __init__(self, padding_value=0.):
+        self.padding_value = padding_value
+
+    def __call__(self, examples):
+        mels = [example[0] for example in examples]
+        wavs = [example[1] for example in examples]
+        mels, _ = batch_spec(mels, pad_value=self.padding_value)
+        wavs, _ = batch_wav(wavs, pad_value=self.padding_value)
+        return mels, wavs
+
+
+class LJSpeechClipCollector(object):
+    def __init__(self, clip_frames=65, hop_length=256):
+        self.clip_frames = clip_frames
+        self.hop_length = hop_length
+
+    def __call__(self, examples):
+        mels = []
+        wavs = []
+        for example in examples:
+            mel_clip, wav_clip = self.clip(example)
+            mels.append(mel_clip)
+            wavs.append(wav_clip)
+        mels = np.stack(mels)
+        wavs = np.stack(wavs)
+        return mels, wavs
+
+    def clip(self, example):
+        mel, wav = example
+        frames = mel.shape[-1]
+        start = np.random.randint(0, frames - self.clip_frames)
+        mel_clip = mel[:, start:start + self.clip_frames]
+        wav_clip = wav[start * self.hop_length:(start + self.clip_frames) *
+                       self.hop_length]
+        return mel_clip, wav_clip
--- a/examples/ljspeech/voc0/local/waveflow/preprocess.py
+++ b/examples/ljspeech/voc0/local/waveflow/preprocess.py
@ -0,0 +1,162 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import argparse
+from pathlib import Path
+
+import tqdm
+import numpy as np
+import librosa
+import pandas as pd
+
+from parakeet.datasets import LJSpeechMetaData
+from parakeet.audio import LogMagnitude
+
+from config import get_cfg_defaults
+
+
+class Transform(object):
+    def __init__(self, sample_rate, n_fft, win_length, hop_length, n_mels, fmin,
+                 fmax):
+        self.sample_rate = sample_rate
+        self.n_fft = n_fft
+        self.win_length = win_length
+        self.hop_length = hop_length
+        self.n_mels = n_mels
+        self.fmin = fmin
+        self.fmax = fmax
+
+        self.spec_normalizer = LogMagnitude(min=1e-5)
+
+    def __call__(self, example):
+        wav_path, _, _ = example
+
+        sr = self.sample_rate
+        n_fft = self.n_fft
+        win_length = self.win_length
+        hop_length = self.hop_length
+        n_mels = self.n_mels
+        fmin = self.fmin
+        fmax = self.fmax
+
+        wav, loaded_sr = librosa.load(wav_path, sr=None)
+        assert loaded_sr == sr, "sample rate does not match, resampling applied"
+
+        # Pad audio to the right size.
+        frames = int(np.ceil(float(wav.size) / hop_length))
+        fft_padding = (n_fft - hop_length) // 2  # sound
+        desired_length = frames * hop_length + fft_padding * 2
+        pad_amount = (desired_length - wav.size) // 2
+
+        if wav.size % 2 == 0:
+            wav = np.pad(wav, (pad_amount, pad_amount), mode='reflect')
+        else:
+            wav = np.pad(wav, (pad_amount, pad_amount + 1), mode='reflect')
+
+        # Normalize audio.
+        wav = wav / np.abs(wav).max() * 0.999
+
+        # Compute mel-spectrogram.
+        # Turn center to False to prevent internal padding.
+        spectrogram = librosa.core.stft(
+            wav,
+            hop_length=hop_length,
+            win_length=win_length,
+            n_fft=n_fft,
+            center=False)
+        spectrogram_magnitude = np.abs(spectrogram)
+
+        # Compute mel-spectrograms.
+        mel_filter_bank = librosa.filters.mel(
+            sr=sr, n_fft=n_fft, n_mels=n_mels, fmin=fmin, fmax=fmax)
+        mel_spectrogram = np.dot(mel_filter_bank, spectrogram_magnitude)
+
+        # log scale mel_spectrogram.
+        mel_spectrogram = self.spec_normalizer.transform(mel_spectrogram)
+
+        # Extract the center of audio that corresponds to mel spectrograms.
+        audio = wav[fft_padding:-fft_padding]
+        assert mel_spectrogram.shape[1] * hop_length == audio.size
+
+        # there is no clipping here
+        return audio, mel_spectrogram
+
+
+def create_dataset(config, input_dir, output_dir):
+    input_dir = Path(input_dir).expanduser()
+    dataset = LJSpeechMetaData(input_dir)
+
+    output_dir = Path(output_dir).expanduser()
+    output_dir.mkdir(exist_ok=True)
+
+    transform = Transform(config.sample_rate, config.n_fft, config.win_length,
+                          config.hop_length, config.n_mels, config.fmin,
+                          config.fmax)
+    file_names = []
+
+    for example in tqdm.tqdm(dataset):
+        fname, _, _ = example
+        base_name = os.path.splitext(os.path.basename(fname))[0]
+        wav_dir = output_dir / "wav"
+        mel_dir = output_dir / "mel"
+        wav_dir.mkdir(exist_ok=True)
+        mel_dir.mkdir(exist_ok=True)
+
+        audio, mel = transform(example)
+        np.save(str(wav_dir / base_name), audio)
+        np.save(str(mel_dir / base_name), mel)
+
+        file_names.append((base_name, mel.shape[-1], audio.shape[-1]))
+
+    meta_data = pd.DataFrame.from_records(file_names)
+    meta_data.to_csv(
+        str(output_dir / "metadata.csv"), sep="\t", index=None, header=None)
+    print("saved meta data in to {}".format(
+        os.path.join(output_dir, "metadata.csv")))
+
+    print("Done!")
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="create dataset")
+    parser.add_argument(
+        "--config",
+        type=str,
+        metavar="FILE",
+        help="extra config to overwrite the default config")
+    parser.add_argument(
+        "--input", type=str, help="path of the ljspeech dataset")
+    parser.add_argument(
+        "--output", type=str, help="path to save output dataset")
+    parser.add_argument(
+        "--opts",
+        nargs=argparse.REMAINDER,
+        help="options to overwrite --config file and the default config, passing in KEY VALUE pairs"
+    )
+    parser.add_argument(
+        "-v", "--verbose", action="store_true", help="print msg")
+
+    config = get_cfg_defaults()
+    args = parser.parse_args()
+    if args.config:
+        config.merge_from_file(args.config)
+    if args.opts:
+        config.merge_from_list(args.opts)
+    config.freeze()
+    if args.verbose:
+        print(config.data)
+        print(args)
+
+    create_dataset(config.data, args.input, args.output)
--- a/examples/ljspeech/voc0/local/waveflow/synthesize.py
+++ b/examples/ljspeech/voc0/local/waveflow/synthesize.py
@ -0,0 +1,83 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import os
+import argparse
+from pathlib import Path
+
+import numpy as np
+import soundfile as sf
+import paddle
+
+from parakeet.models.waveflow import ConditionalWaveFlow
+from parakeet.utils import layer_tools
+
+from config import get_cfg_defaults
+
+
+def main(config, args):
+    paddle.set_device(args.device)
+    model = ConditionalWaveFlow.from_pretrained(config, args.checkpoint_path)
+    layer_tools.recursively_remove_weight_norm(model)
+    model.eval()
+
+    mel_dir = Path(args.input).expanduser()
+    output_dir = Path(args.output).expanduser()
+    output_dir.mkdir(parents=True, exist_ok=True)
+    for file_path in mel_dir.glob("*.npy"):
+        mel = np.load(str(file_path))
+        with paddle.amp.auto_cast():
+            audio = model.predict(mel)
+        audio_path = output_dir / (os.path.splitext(file_path.name)[0] + ".wav")
+        sf.write(audio_path, audio, config.data.sample_rate)
+        print("[synthesize] {} -> {}".format(file_path, audio_path))
+
+
+if __name__ == "__main__":
+    config = get_cfg_defaults()
+
+    parser = argparse.ArgumentParser(
+        description="generate mel spectrogram with TransformerTTS.")
+    parser.add_argument(
+        "--config",
+        type=str,
+        metavar="FILE",
+        help="extra config to overwrite the default config")
+    parser.add_argument(
+        "--checkpoint_path", type=str, help="path of the checkpoint to load.")
+    parser.add_argument(
+        "--input",
+        type=str,
+        help="path of directory containing mel spectrogram (in .npy format)")
+    parser.add_argument("--output", type=str, help="path to save outputs")
+    parser.add_argument(
+        "--device", type=str, default="cpu", help="device type to use.")
+    parser.add_argument(
+        "--opts",
+        nargs=argparse.REMAINDER,
+        help="options to overwrite --config file and the default config, passing in KEY VALUE pairs"
+    )
+    parser.add_argument(
+        "-v", "--verbose", action="store_true", help="print msg")
+
+    args = parser.parse_args()
+    if args.config:
+        config.merge_from_file(args.config)
+    if args.opts:
+        config.merge_from_list(args.opts)
+    config.freeze()
+    print(config)
+    print(args)
+
+    main(config, args)
--- a/examples/ljspeech/voc0/local/waveflow/train.py
+++ b/examples/ljspeech/voc0/local/waveflow/train.py
@ -0,0 +1,158 @@
+# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+import time
+
+import numpy as np
+import paddle
+from paddle import distributed as dist
+from paddle.io import DataLoader, DistributedBatchSampler
+
+from parakeet.data import dataset
+from parakeet.models.waveflow import ConditionalWaveFlow, WaveFlowLoss
+from parakeet.utils import mp_tools
+from parakeet.training.cli import default_argument_parser
+from parakeet.training.experiment import ExperimentBase
+
+from config import get_cfg_defaults
+from ljspeech import LJSpeech, LJSpeechClipCollector, LJSpeechCollector
+
+
+class Experiment(ExperimentBase):
+    def setup_model(self):
+        config = self.config
+        model = ConditionalWaveFlow(
+            upsample_factors=config.model.upsample_factors,
+            n_flows=config.model.n_flows,
+            n_layers=config.model.n_layers,
+            n_group=config.model.n_group,
+            channels=config.model.channels,
+            n_mels=config.data.n_mels,
+            kernel_size=config.model.kernel_size)
+
+        if self.parallel:
+            model = paddle.DataParallel(model)
+        optimizer = paddle.optimizer.Adam(
+            config.training.lr, parameters=model.parameters())
+        criterion = WaveFlowLoss(sigma=config.model.sigma)
+
+        self.model = model
+        self.optimizer = optimizer
+        self.criterion = criterion
+
+    def setup_dataloader(self):
+        config = self.config
+        args = self.args
+
+        ljspeech_dataset = LJSpeech(args.data)
+        valid_set, train_set = dataset.split(ljspeech_dataset,
+                                             config.data.valid_size)
+
+        batch_fn = LJSpeechClipCollector(config.data.clip_frames,
+                                         config.data.hop_length)
+
+        if not self.parallel:
+            train_loader = DataLoader(
+                train_set,
+                batch_size=config.data.batch_size,
+                shuffle=True,
+                drop_last=True,
+                collate_fn=batch_fn)
+        else:
+            sampler = DistributedBatchSampler(
+                train_set,
+                batch_size=config.data.batch_size,
+                num_replicas=dist.get_world_size(),
+                rank=dist.get_rank(),
+                shuffle=True,
+                drop_last=True)
+            train_loader = DataLoader(
+                train_set, batch_sampler=sampler, collate_fn=batch_fn)
+
+        valid_batch_fn = LJSpeechCollector()
+        valid_loader = DataLoader(
+            valid_set, batch_size=1, collate_fn=valid_batch_fn)
+
+        self.train_loader = train_loader
+        self.valid_loader = valid_loader
+
+    def compute_outputs(self, mel, wav):
+        # model_core = model._layers if isinstance(model, paddle.DataParallel) else model
+        z, log_det_jocobian = self.model(wav, mel)
+        return z, log_det_jocobian
+
+    def train_batch(self):
+        start = time.time()
+        batch = self.read_batch()
+        data_loader_time = time.time() - start
+
+        self.model.train()
+        self.optimizer.clear_grad()
+        mel, wav = batch
+        z, log_det_jocobian = self.compute_outputs(mel, wav)
+        loss = self.criterion(z, log_det_jocobian)
+        loss.backward()
+        self.optimizer.step()
+        iteration_time = time.time() - start
+
+        loss_value = float(loss)
+        msg = "Rank: {}, ".format(dist.get_rank())
+        msg += "step: {}, ".format(self.iteration)
+        msg += "time: {:>.3f}s/{:>.3f}s, ".format(data_loader_time,
+                                                  iteration_time)
+        msg += "loss: {:>.6f}".format(loss_value)
+        self.logger.info(msg)
+        if dist.get_rank() == 0:
+            self.visualizer.add_scalar("train/loss", loss_value, self.iteration)
+
+    @mp_tools.rank_zero_only
+    @paddle.no_grad()
+    def valid(self):
+        valid_iterator = iter(self.valid_loader)
+        valid_losses = []
+        mel, wav = next(valid_iterator)
+        z, log_det_jocobian = self.compute_outputs(mel, wav)
+        loss = self.criterion(z, log_det_jocobian)
+        valid_losses.append(float(loss))
+        valid_loss = np.mean(valid_losses)
+        self.visualizer.add_scalar("valid/loss", valid_loss, self.iteration)
+
+
+def main_sp(config, args):
+    exp = Experiment(config, args)
+    exp.setup()
+    exp.resume_or_load()
+    exp.run()
+
+
+def main(config, args):
+    if args.nprocs > 1 and args.device == "gpu":
+        dist.spawn(main_sp, args=(config, args), nprocs=args.nprocs)
+    else:
+        main_sp(config, args)
+
+
+if __name__ == "__main__":
+    config = get_cfg_defaults()
+    parser = default_argument_parser()
+    args = parser.parse_args()
+    if args.config:
+        config.merge_from_file(args.config)
+    if args.opts:
+        config.merge_from_list(args.opts)
+    config.freeze()
+    print(config)
+    print(args)
+
+    main(config, args)
--- a/examples/ljspeech/voc0/run.sh
+++ b/examples/ljspeech/voc0/run.sh
--- a/examples/other/1xt2x/.gitignore
+++ b/examples/other/1xt2x/.gitignore
--- a/examples/other/1xt2x/README.md
+++ b/examples/other/1xt2x/README.md
--- a/examples/other/1xt2x/aishell/.gitignore
+++ b/examples/other/1xt2x/aishell/.gitignore
--- a/examples/other/1xt2x/aishell/conf/augmentation.json
+++ b/examples/other/1xt2x/aishell/conf/augmentation.json
--- a/examples/other/1xt2x/aishell/conf/deepspeech2.yaml
+++ b/examples/other/1xt2x/aishell/conf/deepspeech2.yaml
--- a/Show More
+++ b/Show More