update readme, test=doc_fix (#1156)

pull/1157/head
TianYuan 3 years ago committed by GitHub
parent e9748faa71
commit 69138a2c85
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -1,7 +1,7 @@
# Audio Tagging
## Introduction
Audio tagging is the task of labelling an audio clip with one or more labels or tags, includeing music tagging, acoustic scene classification, audio event classification, etc.
Audio tagging is the task of labeling an audio clip with one or more labels or tags, including music tagging, acoustic scene classification, audio event classification, etc.
This demo is an implementation to tag an audio file with 527 [AudioSet](https://research.google.com/audioset/) labels. It can be done by a single command or a few lines in python using `PaddleSpeech`.
@ -12,7 +12,7 @@ pip install paddlespeech
```
### 2. Prepare Input File
Input of this demo should be a WAV file(`.wav`).
The input of this demo should be a WAV file(`.wav`).
Here are sample files for this demo that can be downloaded:
```bash
@ -29,13 +29,13 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/cat.wav https://paddlespe
paddlespeech cls --help
```
Arguments:
- `input`(required): Audio file to tag.
- `input`(required): The audio file to tag.
- `model`: Model type of tagging task. Default: `panns_cnn14`.
- `config`: Config of tagging task. Use pretrained model when it is None. Default: `None`.
- `ckpt_path`: Model checkpoint. Use pretrained model when it is None. Default: `None`.
- `label_file`: Label file of tagging task. Use audioset labels when it is None. Default: `None`.
- `topk`: Show topk tagging labels of result. Default: `1`.
- `device`: Choose device to execute model inference. Default: default device of paddlepaddle in current environment.
- `config`: Config of tagging task. Use a pretrained model when it is None. Default: `None`.
- `ckpt_path`: Model checkpoint. Use a pretrained model when it is None. Default: `None`.
- `label_file`: Label file of tagging task. Use audio set labels when it is None. Default: `None`.
- `topk`: Show topk tagging labels of the result. Default: `1`.
- `device`: Choose the device to execute model inference. Default: default device of paddlepaddle in the current environment.
Output:
```bash
@ -83,10 +83,9 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/cat.wav https://paddlespe
Bird: 0.006304860580712557
```
### 4.Pretrained Models
Here is a list of pretrained models released by PaddleSpeech that can be used by command and python api:
Here is a list of pretrained models released by PaddleSpeech that can be used by command and python API:
| Model | Sample Rate
| :--- | :---:

@ -1,9 +1,9 @@
# Automatic Video Subtitiles
## Introduction
Automatic video subtitiles can generate subtitiles from a specific video by using Automatic Speech Recognition (ASR) system.
Automatic video subtitles can generate subtitles from a specific video by using the Automatic Speech Recognition (ASR) system.
This demo is an implementation to automatic video subtitiles from a video file. It can be done by a single command or a few lines in python using `PaddleSpeech`.
This demo is an implementation to automatic video subtitles from a video file. It can be done by a single command or a few lines in python using `PaddleSpeech`.
## Usage
### 1. Installation
@ -12,7 +12,7 @@ pip install paddlespeech
```
### 2. Prepare Input
Get a video file with speech of the specific language:
Get a video file with the speech of the specific language:
```bash
wget -c https://paddlespeech.bj.bcebos.com/demos/asr_demos/subtitle_demo1.mp4
```
@ -22,7 +22,6 @@ Extract `.wav` with one channel and 16000 sample rate from the video:
ffmpeg -i subtitle_demo1.mp4 -ac 1 -ar 16000 -vn input.wav
```
### 3. Usage
- Python API

@ -1,13 +1,11 @@
# Metaverse
## Introduction
Metaverse is a new Internet application and social form integrating virtual reality produced by integrating a variety of new technologies.
This demo is an implementation to let a celebrity in an image "speak". With the composition of `TTS` mudule of `PaddleSpeech` and `PaddleGAN`, we integrate the installation and the specific modules in a single shell script.
This demo is an implementation to let a celebrity in an image "speak". With the composition of the `TTS` module of `PaddleSpeech` and `PaddleGAN`, we integrate the installation and the specific modules in a single shell script.
## Usage
You can make your favorite person say the specified content with the `TTS` mudule of `PaddleSpeech` and `PaddleGAN`, and construct your own virtual human.
You can make your favorite person say the specified content with the `TTS` module of `PaddleSpeech` and `PaddleGAN`, and construct your virtual human.
Run `run.sh` to complete all the essential procedures, including the installation.
@ -16,8 +14,8 @@ Run `run.sh` to complete all the essential procedures, including the installatio
```
In `run.sh`, it will execute `source path.sh` firstly, which will set the environment variants.
If you would like to try your own sentence, please replace the sentence in `sentences.txt`.
If you would like to try your sentence, please replace the sentence in `sentences.txt`.
If you would like to try your own image, please replace the image `download/Lamarr.png` in the shell script.
If you would like to try your image, please replace the image `download/Lamarr.png` in the shell script.
The result has shown on our [notebook](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/tutorial/tts/tts_tutorial.ipynb).
The result has shown in our [notebook](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/tutorial/tts/tts_tutorial.ipynb).

@ -1,19 +1,16 @@
# Punctuation Restoration
## Introduction
Punctuation restoration is a common post-processing problem for Automatic Speech Recognition (ASR) systems. It is important to improve the readability of the transcribed text for the human reader and facilitate NLP tasks.
This demo is an implementation to restore punctuation from a raw text. It can be done by a single command or a few lines in python using `PaddleSpeech`.
This demo is an implementation to restore punctuation from raw text. It can be done by a single command or a few lines in python using `PaddleSpeech`.
## Usage
### 1. Installation
```bash
pip install paddlespeech
```
### 2. Prepare Input
Input of this demo should be a text of the specific language that can be passed via argument.
The input of this demo should be a text of the specific language that can be passed via argument.
### 3. Usage
- Command Line(Recommended)
@ -63,10 +60,8 @@ Input of this demo should be a text of the specific language that can be passed
今天的天气真不错啊!你下午有空吗?我想约你一起去吃饭。
```
### 4.Pretrained Models
Here is a list of pretrained models released by PaddleSpeech that can be used by command and python api:
Here is a list of pretrained models released by PaddleSpeech that can be used by command and python API:
- Punctuation Restoration
| Model | Language | Number of Punctuation Characters

@ -12,7 +12,7 @@ pip install paddlespeech
```
### 2. Prepare Input File
Input of this demo should be a WAV file(`.wav`), and the sample rate must be same as the model's.
The input of this demo should be a WAV file(`.wav`), and the sample rate must be the same as the model.
Here are sample files for this demo that can be downloaded:
```bash
@ -65,10 +65,9 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
我认为跑步最重要的就是给我带来了身体健康
```
### 4.Pretrained Models
Here is a list of pretrained models released by PaddleSpeech that can be used by command and python api:
Here is a list of pretrained models released by PaddleSpeech that can be used by command and python API:
| Model | Language | Sample Rate
| :--- | :---: | :---: |

@ -1,10 +1,10 @@
([简体中文](./README_cn.md)|English)
# Speech Translation
## Introduction
Speech translation is the process by which conversational spoken phrases are instantly translated and spoken aloud in a second language.
This demo is an implementation to recognize text from a specific audio file and translate to target language. It can be done by a single command or a few lines in python using `PaddleSpeech`.
This demo is an implementation to recognize text from a specific audio file and translate it to the target language. It can be done by a single command or a few lines in python using `PaddleSpeech`.
## Usage
### 1. Installation
@ -13,7 +13,7 @@ pip install paddlespeech
```
### 2. Prepare Input File
Input of this demo should be a WAV file(`.wav`).
The input of this demo should be a WAV file(`.wav`).
Here are sample files for this demo that can be downloaded:
```bash
@ -68,10 +68,8 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespee
['我 在 这栋 建筑 的 古老 门上 敲门 。']
```
### 4.Pretrained Models
Here is a list of pretrained models released by PaddleSpeech that can be used by command and python api:
Here is a list of pretrained models released by PaddleSpeech that can be used by command and python API:
| Model | Source Language | Target Language
| :--- | :---: | :---: |

@ -2,11 +2,11 @@
## Introduction
Storybooks are very important children's enlightenment books, but parents usually don't have enough time to read storybooks for their children. For very young children, they may not understand the Chinese characters in storybooks. Or sometimes, children just want to "listen" but don't want to "read".
You can use `PaddleOCR` to get the text of a storybook, and read it by the `TTS` mudule of `PaddleSpeech`.
You can use `PaddleOCR` to get the text of a storybook and read it by the `TTS` module of `PaddleSpeech`.
## Usage
Run the following command line to get started:
```
./run.sh
```
The result has shown on our [notebook](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/tutorial/tts/tts_tutorial.ipynb).
The result has shown in our [notebook](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/tutorial/tts/tts_tutorial.ipynb).

@ -1,14 +1,14 @@
# Style FastSpeech2
## Introduction
[FastSpeech2](https://arxiv.org/abs/2006.04558) is a classical acoustic model for Text-to-Speech synthesis, which introduces controllable speech input, including `phoneme duration`、`energy` and `pitch`.
[FastSpeech2](https://arxiv.org/abs/2006.04558) is a classical acoustic model for Text-to-Speech synthesis, which introduces controllable speech input, including `phoneme duration` `energy` and `pitch`.
In the prediction phase, you can change these controllable variables to get some interesting results.
For example:
1. The `duration` control in `FastSpeech2` can control the speed of audios will keep the `pitch`. (in some speech tool, increase the speed will increase the pitch, and vice versa.)
1. The `duration` control in `FastSpeech2` can control the speed of audios will keep the `pitch`. (in some speech tools, increasing the speed will increase the pitch and vice versa.)
2. When we set `pitch` of one sentence to a mean value and set `tones` of phones to `1`, we will get a `robot-style` timbre.
2. When we set the `pitch` of one sentence to a mean value and set the `tones` of phones to `1`, we will get a `robot-style` timbre.
3. When we raise the `pitch` of an adult female (with a fixed scale ratio), we will get a `child-style` timbre.
@ -20,7 +20,7 @@ Run the following command line to get started:
```
In `run.sh`, it will execute `source path.sh` firstly, which will set the environment variants.
If you would like to try your own sentence, please replace the sentence in `sentences.txt`.
If you would like to try your sentence, please replace the sentence in `sentences.txt`.
For more details, please see `style_syn.py`

@ -4,7 +4,7 @@
## Introduction
Text-to-speech (TTS) is a natural language modeling process that requires changing units of text into units of speech for audio presentation.
This demo is an implementation to generate an audio from the giving text. It can be done by a single command or a few lines in python using `PaddleSpeech`.
This demo is an implementation to generate audio from the given text. It can be done by a single command or a few lines in python using `PaddleSpeech`.
## Usage
### 1. Installation
@ -13,7 +13,7 @@ pip install paddlespeech
```
### 2. Prepare Input
Input of this demo should be a text of the specific language that can be passed via argument.
The input of this demo should be a text of the specific language that can be passed via argument.
### 3. Usage
- Command Line (Recommended)
- Chinese
@ -22,11 +22,11 @@ Input of this demo should be a text of the specific language that can be passed
```bash
paddlespeech tts --input "你好,欢迎使用百度飞桨深度学习框架!"
```
- Chinese, use `SpeedySpeech` as acoustic model
- Chinese, use `SpeedySpeech` as the acoustic model
```bash
paddlespeech tts --am speedyspeech_csmsc --input "你好,欢迎使用百度飞桨深度学习框架!"
```
- Chinese, multi speaker
- Chinese, multi-speaker
You can change `spk_id` here.
```bash
@ -37,7 +37,7 @@ Input of this demo should be a text of the specific language that can be passed
```bash
paddlespeech tts --am fastspeech2_ljspeech --voc pwgan_ljspeech --lang en --input "hello world"
```
- English, multi speaker
- English, multi-speaker
You can change `spk_id` here.
```bash
@ -104,7 +104,7 @@ Input of this demo should be a text of the specific language that can be passed
### 4. Pretrained Models
Here is a list of pretrained models released by PaddleSpeech that can be used by command and python api:
Here is a list of pretrained models released by PaddleSpeech that can be used by command and python API:
- Acoustic model
| Model | Language

@ -1,9 +1,8 @@
# Data Augmentation Pipeline
Data augmentation has often been a highly effective technique to boost the deep learning performance. We augment our speech data by synthesizing new audios with small random perturbation (label-invariant transformation) added upon raw audios. You don't have to do the syntheses on your own, as it is already embedded into the data provider and is done on the fly, randomly for each epoch during training.
Data augmentation has often been a highly effective technique to boost deep learning performance. We augment our speech data by synthesizing new audios with small random perturbation (label-invariant transformation) added upon raw audios. You don't have to do the syntheses on your own, as it is already embedded into the data provider and is done on the fly, randomly for each epoch during training.
Six optional augmentation components are provided to be selected, configured and inserted into the processing pipeline.
Six optional augmentation components are provided to be selected, configured, and inserted into the processing pipeline.
* Audio
- Volume Perturbation
@ -17,7 +16,7 @@ Six optional augmentation components are provided to be selected, configured and
- SpecAugment
- Adaptive SpecAugment
In order to inform the trainer of what augmentation components are needed and what their processing orders are, it is required to prepare in advance an *augmentation configuration file* in [JSON](http://www.json.org/) format. For example:
To inform the trainer of what augmentation components are needed and what their processing orders are, it is required to prepare in advance an *augmentation configuration file* in [JSON](http://www.json.org/) format. For example:
```
[{
@ -34,8 +33,8 @@ In order to inform the trainer of what augmentation components are needed and wh
}]
```
When the `augment_conf_file` argument is set to the path of the above example configuration file, every audio clip in every epoch will be processed: with 60% of chance, it will first be speed perturbed with a uniformly random sampled speed-rate between 0.95 and 1.05, and then with 80% of chance it will be shifted in time with a random sampled offset between -5 ms and 5 ms. Finally this newly synthesized audio clip will be feed into the feature extractor for further training.
When the `augment_conf_file` argument is set to the path of the above example configuration file, every audio clip in every epoch will be processed: with 60% of chance, it will first be speed perturbed with a uniformly random sampled speed-rate between 0.95 and 1.05, and then with 80% of chance it will be shifted in time with a randomly sampled offset between -5 ms and 5 ms. Finally, this newly synthesized audio clip will be fed into the feature extractor for further training.
For other configuration examples, please refer to `examples/conf/augmentation.example.json`.
Be careful when utilizing the data augmentation technique, as improper augmentation will do harm to the training, due to the enlarged train-test gap.
Be careful when utilizing the data augmentation technique, as improper augmentation will harm the training, due to the enlarged train-test gap.

@ -1,15 +1,13 @@
# Data Preparation
## Generate Manifest
*DeepSpeech2 on PaddlePaddle* accepts a textual **manifest** file as its data set interface. A manifest file summarizes a set of speech data, with each line containing some meta data (e.g. filepath, transcription, duration) of one audio clip, in [JSON](http://www.json.org/) format, such as:
*DeepSpeech2 on PaddlePaddle* accepts a textual **manifest** file as its data set interface. A manifest file summarizes a set of speech data, with each line containing some meta data (e.g. file path, transcription, duration) of one audio clip, in [JSON](http://www.json.org/) format, such as:
```
{"audio_filepath": "/home/work/.cache/paddle/Libri/134686/1089-134686-0001.flac", "duration": 3.275, "text": "stuff it into you his belly counselled him"}
{"audio_filepath": "/home/work/.cache/paddle/Libri/134686/1089-134686-0007.flac", "duration": 4.275, "text": "a cold lucid indifference reigned in his soul"}
```
To use your custom data, you only need to generate such manifest files to summarize the dataset. Given such summarized manifests, training, inference and all other modules can be aware of where to access the audio files, as well as their meta data including the transcription labels.
For how to generate such manifest files, please refer to `examples/librispeech/local/librispeech.py`, which will download data and generate manifest files for LibriSpeech dataset.
@ -26,12 +24,12 @@ python3 utils/compute_mean_std.py \
--output_path examples/librispeech/data/mean_std.npz
```
It will compute the mean and standard deviatio of power spectrum feature with 2000 random sampled audio clips listed in `examples/librispeech/data/manifest.train` and save the results to `examples/librispeech/data/mean_std.npz` for further usage.
It will compute the mean and standard deviations of the power spectrum feature with 2000 random sampled audio clips listed in `examples/librispeech/data/manifest.train` and save the results to `examples/librispeech/data/mean_std.npz` for further usage.
## Build Vocabulary
A vocabulary of possible characters is required to convert the transcription into a list of token indices for training, and in decoding, to convert from a list of indices back to text again. Such a character-based vocabulary can be built with `utils/build_vocab.py`.
A vocabulary of possible characters is required to convert the transcription into a list of token indices for training, and in decoding, to convert from a list of indices back to the text again. Such a character-based vocabulary can be built with `utils/build_vocab.py`.
```bash
python3 utils/build_vocab.py \
@ -40,4 +38,4 @@ python3 utils/build_vocab.py \
--manifest_paths examples/librispeech/data/manifest.train
```
It will write a vocabuary file `examples/librispeech/data/vocab.txt` with all transcription text in `examples/librispeech/data/manifest.train`, without vocabulary truncation (`--count_threshold 0`).
It will write a vocabulary file `examples/librispeech/data/vocab.txt` with all transcription text in `examples/librispeech/data/manifest.train`, without vocabulary truncation (`--count_threshold 0`).

@ -36,7 +36,7 @@
### Aligment
* MFA
* CTC Aligment
* CTC Alignment
### Speech Frontend
@ -73,5 +73,5 @@
### Grapheme To Phoneme
* syallable
* syllable
* phoneme

@ -1,7 +1,7 @@
# Models introduction
## Streaming DeepSpeech2
The implemented arcitecure of Deepspeech2 online model is based on [Deepspeech2 model](https://arxiv.org/pdf/1512.02595.pdf) with some changes.
The model is mainly composed of 2D convolution subsampling layer and stacked single direction rnn layers.
The implemented architecture of Deepspeech2 online model is based on [Deepspeech2 model](https://arxiv.org/pdf/1512.02595.pdf) with some changes.
The model is mainly composed of 2D convolution subsampling layers and stacked single-direction rnn layers.
To illustrate the model implementation clearly, 3 parts are described in detail.
- Data Preparation
@ -10,7 +10,7 @@ To illustrate the model implementation clearly, 3 parts are described in detail.
In addition, the training process and the testing process are also introduced.
The arcitecture of the model is shown in Fig.1.
The architecture of the model is shown in Fig.1.
<p align="center">
<img src="https://raw.githubusercontent.com/PaddlePaddle/PaddleSpeech/develop/docs/images/ds2onlineModel.png" width=800>
@ -20,7 +20,7 @@ The arcitecture of the model is shown in Fig.1.
### Data Preparation
#### Vocabulary
For English data, the vocabulary dictionary is composed of 26 English characters with " ' ", space, \<blank\> and \<eos\>. The \<blank\> represents the blank label in CTC, the \<unk\> represents the unknown character and the \<eos\> represents the start and the end characters. For mandarin, the vocabulary dictionary is composed of chinese characters statisticed from the training set and three additional characters are added. The added characters are \<blank\>, \<unk\> and \<eos\>. For both English and mandarin data, we set the default indexs that \<blank\>=0, \<unk\>=1 and \<eos\>= last index.
For English data, the vocabulary dictionary is composed of 26 English characters with " ' ", space, \<blank\> and \<eos\>. The \<blank\> represents the blank label in CTC, the \<unk\> represents the unknown character and the \<eos\> represents the start and the end characters. For mandarin, the vocabulary dictionary is composed of Chinese characters statistics from the training set, and three additional characters are added. The added characters are \<blank\>, \<unk\> and \<eos\>. For both English and mandarin data, we set the default indexes that \<blank\>=0, \<unk\>=1 and \<eos\>= last index.
```
# The code to build vocabulary
cd examples/aishell/s0
@ -38,7 +38,7 @@ vi examples/librispeech/s0/data/vocab.txt
```
#### CMVN
For CMVN, a subset or the full of traininig set is chosed and be used to compute the feature mean and std.
For CMVN, a subset of the full of the training set is selected and be used to compute the feature mean and std.
```
# The code to compute the feature mean and std
cd examples/aishell/s0
@ -58,14 +58,14 @@ python3 ../../../utils/compute_mean_std.py \
#### Feature Extraction
For feature extraction, three methods are implemented, which are linear (FFT without using filter bank), fbank and mfcc.
Currently, the released deepspeech2 online model use the linear feature extraction method.
Currently, the released deepspeech2 online model uses the linear feature extraction method.
```
The code for feature extraction
vi paddlespeech/s2t/frontend/featurizer/audio_featurizer.py
```
### Encoder
The encoder is composed of two 2D convolution subsampling layers and a number of stacked single direction rnn layers. The 2D convolution subsampling layers extract feature representation from the raw audio feature and reduce the length of audio feature at the same time. After passing through the convolution subsampling layers, then the feature representation are input into the stacked rnn layers. For the stacked rnn layers, LSTM cell and GRU cell are provided to use. Adding one fully connected (fc) layer after the stacked rnn layers is optional. If the number of stacked rnn layers is less than 5, adding one fc layer after stacked rnn layers is recommand.
The encoder is composed of two 2D convolution subsampling layers and several stacked single-direction rnn layers. The 2D convolution subsampling layers extract feature representation from the raw audio feature and reduce the length of the audio feature at the same time. After passing through the convolution subsampling layers, then the feature representation is input into the stacked rnn layers. For the stacked rnn layers, LSTM cell and GRU cell are provided to use. Adding one fully connected (fc) layer after the stacked rnn layers are optional. If the number of stacked rnn layers is less than 5, adding one fc layer after stacked rnn layers are recommended.
The code of Encoder is in:
```
@ -73,7 +73,7 @@ vi paddlespeech/s2t/models/ds2_online/deepspeech2.py
```
### Decoder
To got the character possibilities of each frame, the feature representation of each frame output from the encoder are input into a projection layer which is implemented as a dense layer to do feature projection. The output dim of the projection layer is same with the vocabulary size. After projection layer, the softmax function is used to transform the frame-level feature representation be the possibilities of characters. While making model inference, the character possibilities of each frame are input into the CTC decoder to get the final speech recognition results.
To get the character possibilities of each frame, the feature representation of each frame output from the encoder is input into a projection layer which is implemented as a dense layer to do feature projection. The output dim of the projection layer is the same as the vocabulary size. After the projection layer, the softmax function is used to transform the frame-level feature representation be the possibilities of characters. While making model inference, the character possibilities of each frame are input into the CTC decoder to get the final speech recognition results.
The code of the decoder is in:
```
@ -91,7 +91,7 @@ bash run.sh --stage 0 --stop_stage 2 --model_type online --conf_path conf/deepsp
```
The detail commands are:
```
# The code for training in run.sh
# The code for training in the run.sh
set -e
source path.sh
@ -123,8 +123,7 @@ if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
avg.sh exp/${ckpt}/checkpoints ${avg_num}
fi
```
By using the command above, the training process can be started. There are 5 stages in "run.sh", and the first 3 stages are used for training process. The stage 0 is used for data preparation, in which the dataset will be downloaded, and the manifest files of the datasets, vocabulary dictionary and CMVN file will be generated in "./data/". The stage 1 is used for training the model, the log files and model checkpoint is saved in "exp/deepspeech2_online/". The stage 2 is used to generated final model for predicting by averaging the top-k model parameters based on validation loss.
By using the command above, the training process can be started. There are 5 stages in "run.sh", and the first 3 stages are used for the training process. Stage 0 is used for data preparation, in which the dataset will be downloaded, and the manifest files of the datasets, vocabulary dictionary, and CMVN file will be generated in "./data/". Stage 1 is used for training the model, the log files and model checkpoint are saved in "exp/deepspeech2_online/". Stage 2 is used to generate the final model for predicting by averaging the top-k model parameters based on validation loss.
### Testing Process
Using the command below, you can test the deepspeech2 online model.
@ -153,10 +152,10 @@ if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
CUDA_VISIBLE_DEVICES=0 ./local/test_export.sh ${conf_path} exp/${ckpt}/checkpoints/${avg_ckpt}.jit ${model_type}|| exit -1
fi
```
After the training process, we use stage 3,4,5 for testing process. The stage 3 is for testing the model generated in the stage 2 and provided the CER index of the test set. The stage 4 is for transforming the model from dynamic graph to static graph by using "paddle.jit" library. The stage 5 is for testing the model in static graph.
After the training process, we use stages 3,4,5 for the testing process. Stage 3 is for testing the model generated in stage 2 and provided the CER index of the test set. Stage 4 is for transforming the model from a dynamic graph to a static graph by using "paddle.jit" library. Stage 5 is for testing the model in a static graph.
## Non-Streaming DeepSpeech2
The deepspeech2 offline model is similarity to the deepspeech2 online model. The main difference between them is the offline model use the stacked bi-directional rnn layers while the online model use the single direction rnn layers and the fc layer is not used. For the stacked bi-directional rnn layers in the offline model, the rnn cell and gru cell are provided to use.
The deepspeech2 offline model is similar to the deepspeech2 online model. The main difference between them is the offline model uses the stacked bi-directional rnn layers while the online model uses the single direction rnn layers and the fc layer is not used. For the stacked bi-directional rnn layers in the offline model, the rnn cell and gru cell are provided to use.
The arcitecture of the model is shown in Fig.2.
<p align="center">
@ -165,14 +164,14 @@ The arcitecture of the model is shown in Fig.2.
</p>
For data preparation and decoder, the deepspeech2 offline model is same with the deepspeech2 online model.
For data preparation and decoder, the deepspeech2 offline model is the same as the deepspeech2 online model.
The code of encoder and decoder for deepspeech2 offline model is in:
```
vi paddlespeech/s2t/models/ds2/deepspeech2.py
```
The training process and testing process of deepspeech2 offline model is very similary to deepspeech2 online model.
The training process and testing process of deepspeech2 offline model is very similar to deepspeech2 online model.
Only some changes should be noticed.
For training and testing, the "model_type" and the "conf_path" must be set.

@ -4,16 +4,15 @@
A language model is required to improve the decoder's performance. We have prepared two language models (with lossy compression) for users to download and try. One is for English and the other is for Mandarin. The bash script to download LM is example's `local/download_lm_*.sh`.
For example, users can simply run this to download the preprared mandarin language models:
For example, users can simply run this to download the prepared mandarin language models:
```bash
cd examples/aishell
source path.sh
bash local/download_lm_ch.sh
```
If you wish to train your own better language model, please refer to [KenLM](https://github.com/kpu/kenlm) for tutorials.
Here we provide some tips to show how we preparing our English and Mandarin language models.
Here we provide some tips to show how we prepare our English and Mandarin language models.
You can take it as a reference when you train your own.
### English LM
@ -24,14 +23,14 @@ The English corpus is from the [Common Crawl Repository](http://commoncrawl.org)
* Repeated whitespace characters are squeezed to one and the beginning whitespace characters are removed. Notice that all transcriptions are lowercase, so all characters are converted to lowercase.
* Top 400,000 most frequent words are selected to build the vocabulary and the rest are replaced with 'UNKNOWNWORD'.
Now the preprocessing is done and we get a clean corpus to train the language model. Our released language model are trained with agruments '-o 5 --prune 0 1 1 1 1'. '-o 5' means the max order of language model is 5. '--prune 0 1 1 1 1' represents count thresholds for each order and more specifically it will prune singletons for orders two and higher. To save disk storage we convert the arpa file to 'trie' binary file with arguments '-a 22 -q 8 -b 8'. '-a' represents the maximum number of leading bits of pointers in 'trie' to chop. '-q -b' are quantization parameters for probability and backoff.
Now the preprocessing is done and we get a clean corpus to train the language model. Our released language model is trained with arguments '-o 5 --prune 0 1 1 1 1'. '-o 5' means the max order of the language model is 5. '--prune 0 1 1 1 1' represents count thresholds for each order and more specifically it will prune singletons for orders two and higher. To save disk storage we convert the ARPA file to 'trie' binary file with arguments '-a 22 -q 8 -b 8'. '-a' represents the maximum number of leading bits of pointers in 'trie' to chop. '-q -b' are quantization parameters for probability and backoff.
### Mandarin LM
Different from the English language model, Mandarin language model is character-based where each token is a Chinese character. We use internal corpus to train the released Mandarin language models. The corpus contain billions of tokens. The preprocessing has tiny difference from English language model and main steps include:
Different from the English language model, the Mandarin language model is character-based where each token is a Chinese character. We use the internal corpus to train the released Mandarin language models. The corpus contains billions of tokens. The preprocessing has a tiny difference from the English language model and the main steps include:
* The beginning and trailing whitespace characters are removed.
* English punctuations and Chinese punctuations are removed.
* A whitespace character between two tokens is inserted.
Please notice that the released language models only contain Chinese simplified characters. After preprocessing done we can begin to train the language model. The key training arguments for small LM is '-o 5 --prune 0 1 2 4 4' and '-o 5' for large LM. Please refer above section for the meaning of each argument. We also convert the arpa file to binary file using default settings.
Please notice that the released language models only contain Chinese simplified characters. After preprocessing is done we can begin to train the language model. The key training arguments for small LM are '-o 5 --prune 0 1 2 4 4' and '-o 5' for large LM. Please refer above section for the meaning of each argument. We also convert the ARPA file to a binary file using default settings.

@ -1,11 +1,11 @@
# Quick Start of Speech-to-Text
Several shell scripts provided in `./examples/tiny/local` will help us to quickly give it a try, for most major modules, including data preparation, model training, case inference and model evaluation, with a few public dataset (e.g. [LibriSpeech](http://www.openslr.org/12/), [Aishell](http://www.openslr.org/33)). Reading these examples will also help you to understand how to make it work with your own data.
Several shell scripts provided in `./examples/tiny/local` will help us to quickly give it a try, for most major modules, including data preparation, model training, case inference, and model evaluation, with a few public datasets (e.g. [LibriSpeech](http://www.openslr.org/12/), [Aishell](http://www.openslr.org/33)). Reading these examples will also help you to understand how to make it work with your data.
Some of the scripts in `./examples` are not configured with GPUs. If you want to train with 8 GPUs, please modify `CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7`. If you don't have any GPU available, please set `CUDA_VISIBLE_DEVICES=` to use CPUs instead. Besides, if out-of-memory problem occurs, just reduce `batch_size` to fit.
Some of the scripts in `./examples` are not configured with GPUs. If you want to train with 8 GPUs, please modify `CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7`. If you don't have any GPU available, please set `CUDA_VISIBLE_DEVICES=` to use CPUs instead. Besides, if an out-of-memory problem occurs, just reduce `batch_size` to fit.
Let's take a tiny sampled subset of [LibriSpeech dataset](http://www.openslr.org/12/) for instance.
- Go to directory
- Go to the directory
```bash
cd examples/tiny
@ -16,19 +16,17 @@ Let's take a tiny sampled subset of [LibriSpeech dataset](http://www.openslr.org
source path.sh
```
**Must do this before you start to do anything.**
Set `MAIN_ROOT` as project dir. Using defualt `deepspeech2` model as `MODEL`, you can change this in the script.
- Main entrypoint
Set `MAIN_ROOT` as project dir. Using the default `deepspeech2` model as `MODEL`, you can change this in the script.
- Main entry point
```bash
bash run.sh
```
This is just a demo, please make sure every `step` works well before next `step`.
This is just a demo, please make sure every `step` works well before the next `step`.
More detailed information are provided in the following sections. Wish you a happy journey with the *DeepSpeech on PaddlePaddle* ASR engine!
More detailed information is provided in the following sections. Wish you a happy journey with the *DeepSpeech on PaddlePaddle* ASR engine!
## Training a model
The key steps of training for Mandarin language are same to that of English language and we have also provided an example for Mandarin training with Aishell in ```examples/aishell/local```. As mentioned above, please execute ```sh data.sh```, ```sh train.sh```and```sh test.sh```to do data preparation, training, and testing correspondingly.
The key steps of training for the Mandarin language are the same as that of the English language and we have also provided an example for Mandarin training with Aishell in `examples/aishell/local`. As mentioned above, please execute `sh data.sh`, `sh train.sh` and `sh test.sh` to do data preparation, training, and testing correspondingly.
## Evaluate a Model
To evaluate a model's performance quantitatively, please run:
@ -37,4 +35,4 @@ CUDA_VISIBLE_DEVICES=0 bash local/test.sh
```
The error rate (default: word error rate; can be set with `error_rate_type`) will be printed.
We provide two types of CTC decoders: *CTC greedy decoder* and *CTC beam search decoder*. The *CTC greedy decoder* is an implementation of the simple best-path decoding algorithm, selecting at each timestep the most likely token, thus being greedy and locally optimal. The [*CTC beam search decoder*](https://arxiv.org/abs/1408.2873) otherwise utilizes a heuristic breadth-first graph search for reaching a near global optimality; it also requires a pre-trained KenLM language model for better scoring and ranking. The decoder type can be set with argument `decoding_method`.
We provide two types of CTC decoders: *CTC greedy decoder* and *CTC beam search decoder*. The *CTC greedy decoder* is an implementation of the simple best-path decoding algorithm, selecting at each timestep the most likely token, thus being greedy and locally optimal. The [*CTC beam search decoder*](https://arxiv.org/abs/1408.2873) otherwise utilizes a heuristic breadth-first graph search for reaching near-global optimality; it also requires a pre-trained KenLM language model for better scoring and ranking. The decoder type can be set with the argument `decoding_method`.

@ -1,29 +1,19 @@
# The Dependencies
## By apt-get
### The base dependencies:
```
bc flac jq vim tig tree pkg-config libsndfile1 libflac-dev libvorbis-dev libboost-dev swig python3-dev
```
### The dependencies of kenlm:
```
```
build-essential cmake libboost-system-dev libboost-thread-dev libboost-program-options-dev libboost-test-dev libeigen3-dev zlib1g-dev libbz2-dev liblzma-dev gcc-5 g++-5
```
### The dependencies of sox:
```
libvorbis-dev libmp3lame-dev libmad-ocaml-dev
```
```
## By make or setup
```
```
kenlm
sox
mfa

@ -1,12 +1,12 @@
([简体中文](./install_cn.md)|English)
# Installation
There are 3 ways to use `PaddleSpeech`. According to the degree of difficulty, the 3 ways can be divided into **Easy**, **Medium** and **Hard**. You can choose one of the 3 ways to install `PaddleSpeech`.
There are 3 ways to use `PaddleSpeech`. According to the degree of difficulty, the 3 ways can be divided into **Easy**, **Medium**, and **Hard**. You can choose one of the 3 ways to install `PaddleSpeech`.
| Way | Function | Support|
| :---- | :----------------------------------------------------------- |:----|
| Easy | (1) Use command line functions of PaddleSpeech. <br> (2) Experience PaddleSpeech on Ai Studio. | Linux, MacWindows |
| Medium | Support major functionsuch as using the` ready-made `examples and using PaddleSpeech to train your own model. | Linux |
| Hard | Support full function of Paddlespeechincluding training n-gram language model, montreal-forced-aligner and so on. And you are more able be a developer! | Ubuntu |
|:---- |:----------------------------------------------------------- |:----|
| Easy | (1) Use command-line functions of PaddleSpeech. <br> (2) Experience PaddleSpeech on Ai Studio. | Linux, MacWindows |
| Medium | Support major functions such as using the` ready-made `examples and using PaddleSpeech to train your model. | Linux |
| Hard | Support full function of Paddlespeechincluding training n-gram language model, Montreal-Forced-Aligner, and so on. And you are more able to be a developer! | Ubuntu |
## Prerequisites
- Python >= 3.7
@ -14,9 +14,9 @@ There are 3 ways to use `PaddleSpeech`. According to the degree of difficulty, t
- C++ compilation environment
- Hip: For Linux and Mac, do not use command `sh` instead of command `bash` in installation document.
## Easy: Get the Basic Function (Support Linux, Mac and Windows)
- If you are newer to `PaddleSpeech` and want to experience it easily without your own machine. We recommend you to use [AI Studio](https://aistudio.baidu.com/aistudio/index) to experience it. There is a step-by-step tutorial for `PaddleSpeech` and you can use the basic function of `PaddleSpeech` with a free machine.
- If you want to use the command line function of Paddlespeech, you need to complete the following steps to install `PaddleSpeech`. For more information about how to use command line function , you can see the [cli](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/paddlespeech/cli).
## Easy: Get the Basic Function (Support Linux, Mac, and Windows)
- If you are newer to `PaddleSpeech` and want to experience it easily without your machine. We recommend you to use [AI Studio](https://aistudio.baidu.com/aistudio/index) to experience it. There is a step-by-step tutorial for `PaddleSpeech` and you can use the basic function of `PaddleSpeech` with a free machine.
- If you want to use the command line function of Paddlespeech, you need to complete the following steps to install `PaddleSpeech`. For more information about how to use the command line function, you can see the [cli](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/paddlespeech/cli).
### Install Conda
Conda is a management system of the environment. You can go to [minicoda](https://docs.conda.io/en/latest/miniconda.html) (select a version py>=3.7) to download and install the conda.
And then Install conda dependencies for `paddlespeech` :
@ -25,11 +25,9 @@ And then Install conda dependencies for `paddlespeech` :
conda install -y -c conda-forge sox libsndfile bzip2
```
### Install C++ Compilation Environment
(If you already have C++ compilation environment, you can miss this step.)
#### Windows
You need to install the `Visual Studio` to make the C++ compilation environment.
You need to install `Visual Studio` to make the C++ compilation environment.
#### Mac
```bash
@ -48,15 +46,12 @@ sudo apt install build-essential
# Others
conda install -y -c gcc_linux-64=8.4.0 gxx_linux-64=8.4.0
```
### Install PaddleSpeech
You can use the following command:
```bash
pip install paddlepaddle paddlespeech
```
## Medium: Get the Major Function (Support Linux)
## Medium: Get the Major Functions (Support Linux)
If you want to get the major function of `paddlespeech`. There are 4 steps you need to do.
### Install Conda
@ -71,7 +66,7 @@ $HOME/miniconda3/bin/conda init
# activate the conda
bash
```
Then you can create an conda virtual environment using the following command:
Then you can create a conda virtual environment using the following command:
```bash
conda create -y -p tools/venv python=3.7
```
@ -119,9 +114,9 @@ pip install .
- choice 1: working with `Ubuntu` Docker Container.
- choice 2: working on `Ubuntu` with `root` privilege.
To avoid the trouble of environment setup, running in Docker container is highly recommended. Otherwise, if you work on `Ubuntu` with `root` privilege, you can still complete the installation.
To avoid the trouble of environment setup, running in a Docker container is highly recommended. Otherwise, if you work on `Ubuntu` with `root` privilege, you can still complete the installation.
### Choice 1: Running in Docker Container (Recommand)
### Choice 1: Running in Docker Container (Recommend)
Docker is an open-source tool to build, ship, and run distributed applications in an isolated environment. A Docker image for this project has been provided in [hub.docker.com](https://hub.docker.com) with all the dependencies installed. This Docker image requires the support of NVIDIA GPU, so please make sure its availability and the [nvidia-docker](https://github.com/NVIDIA/nvidia-docker) has been installed.
Take several steps to launch the Docker image:
@ -136,7 +131,6 @@ nvidia-docker pull registry.baidubce.com/paddlepaddle/paddle:2.2.0-gpu-cuda10.2-
git clone https://github.com/PaddlePaddle/PaddleSpeech.git
```
- Run the Docker image
```bash
sudo nvidia-docker run --net=host --ipc=host --rm -it -v $(pwd)/PaddleSpeech:/PaddleSpeech registry.baidubce.com/paddlepaddle/paddle:2.2.0-gpu-cuda10.2-cudnn7 /bin/bash
```
@ -144,7 +138,7 @@ sudo nvidia-docker run --net=host --ipc=host --rm -it -v $(pwd)/PaddleSpeech:/Pa
```bash
cd /PaddleSpeech
```
Now you can execute training, inference and hyper-parameters tuning in Docker container.
Now you can execute training, inference, and hyper-parameters tuning in Docker container.
### Choice 2: Running in Ubuntu with Root Privilege
- Install `build-essential` by apt
@ -165,11 +159,11 @@ bash extras/install_miniconda.sh
popd
# use the "bash" command to make the conda environment works
bash
# create an conda virtual environment
# create a conda virtual environment
conda create -y -p tools/venv python=3.7
# Activate the conda virtual environment:
conda activate tools/venv
# Install the conda packags
# Install the conda packages
conda install -y -c conda-forge sox libsndfile swig bzip2 libflac bc
```
### Install PaddlePaddle
@ -189,4 +183,3 @@ bash extras/install_openblas.sh
bash extras/install_kaldi.sh
popd
```

@ -1,81 +1,55 @@
(简体中文|[English](./install.md))
# 安装方法
`PaddleSpeech`有三种安装方法。根据安装的难易程度,这三种方法可以分为 **简单**, **中等** 和 **困难**.
`PaddleSpeech` 有三种安装方法。根据安装的难易程度,这三种方法可以分为 **简单**, **中等** 和 **困难**.
| 方式 | 功能 | 支持系统 |
| :--- | :----------------------------------------------------------- | :------------------ |
| 简单 | (1) 使用PaddleSpeech的命令行功能. <br> (2) 在 Aistudio上体验PaddleSpeech. | Linux, MacWindows |
| 中等 | 支持PaddleSpeech主要功能比如使用已有examples中的模型和使用PaddleSpeech来训练自己的模型. | Linux |
| 困难 | 支持PaddleSpeech的各项功能包含训练语言模型,使用强制对齐等。并且你更能成为一名开发者! | Ubuntu |
| 简单 | (1) 使用 PaddleSpeech 的命令行功能. <br> (2) 在 Aistudio上体验 PaddleSpeech. | Linux, MacWindows |
| 中等 | 支持 PaddleSpeech 主要功能,比如使用已有 examples 中的模型和使用 PaddleSpeech 来训练自己的模型. | Linux |
| 困难 | 支持 PaddleSpeech 的各项功能,包含训练语言模型,使用强制对齐等。并且你更能成为一名开发者! | Ubuntu |
## 先决条件
- Python >= 3.7
- 最新版本的PaddlePaddle (请看 [安装向导] (https://www.paddlepaddle.org.cn/documentation/docs/en/beginners_guide/index_en.html))
- 最新版本的 PaddlePaddle (请看 [安装向导](https://www.paddlepaddle.org.cn/documentation/docs/en/beginners_guide/index_en.html))
- C++ 编译环境
- 提示: 对于Linux和Max请不要使用`sh`代替安装文档中的`bash`
## 简单: 获取基本功能(支持LinuxMac和Windows)
- 如果你是一个刚刚接触`PaddleSpeech`的新人并且想要很方便地体验一下该项目。我们建议你 体验一下[AI Studio](https://aistudio.baidu.com/aistudio/index)。我们在AI Studio上面建立了一个让你一步一步运行体验来使用`PaddleSpeech`的教程。
- 如果你想使用`PaddleSpeech`的命令行功能,你需要跟随下面的步骤来安装`PaddleSpeech`。如果你想了解更多关于使用`PaddleSpeech`命令行功能的信息,你可以参考 [cli](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/paddlespeech/cli)。
- 提示: 对于 Linux 和 Mac请不要使用 `sh` 代替安装文档中的 `bash`
## 简单: 获取基本功能(支持 LinuxMac 和 Windows)
- 如果你是一个刚刚接触 `PaddleSpeech` 的新人并且想要很方便地体验一下该项目。我们建议你 体验一下[AI Studio](https://aistudio.baidu.com/aistudio/index)。我们在AI Studio上面建立了一个让你一步一步运行体验来使用`PaddleSpeech`的教程。
- 如果你想使用 `PaddleSpeech` 的命令行功能,你需要跟随下面的步骤来安装 `PaddleSpeech`。如果你想了解更多关于使用 `PaddleSpeech` 命令行功能的信息,你可以参考 [cli](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/paddlespeech/cli)。
### 安装Conda
Conda是一个包管理的环境。你可以前往[minicoda](https://docs.conda.io/en/latest/miniconda.html) 去下载并安装conda请下载py>=3.7的版本)。
然后你需要安装`paddlespeech`的conda依赖:
Conda是一个包管理的环境。你可以前往 [minicoda](https://docs.conda.io/en/latest/miniconda.html) 去下载并安装 conda请下载 py>=3.7 的版本)。
然后你需要安装 `paddlespeech` 的 conda 依赖:
```bash
conda install -y -c conda-forge sox libsndfile bzip2
```
### 安装C++ 编译环境
(如果你系统上已经安装了C++编译环境,请忽略这一步。)
### 安装 C++ 编译环境
(如果你系统上已经安装了 C++ 编译环境,请忽略这一步。)
#### Windows
对于Windows系统需要安装`Visual Studio`来完成C++编译环境的安装。
对于 Windows 系统,需要安装 `Visual Studio` 来完成 C++ 编译环境的安装。
#### Mac
```bash
brew install gcc
```
#### Linux
```bash
# centos
sudo yum install gcc gcc-c++
```
```bash
# ubuntu
sudo apt install build-essential
```
```bash
# Others
conda install -y -c gcc_linux-64=8.4.0 gxx_linux-64=8.4.0
```
### 安装 PaddleSpeech
你可以使用如下命令:
```bash
pip install paddlepaddle paddlespeech
```
## 中等: 获取主要功能支持Linux
如果你想要使用`paddlespeech`的主要功能。你需要完成4个步骤
### 安装Conda
Conda是一个包管理的环境。你可以前往[minicoda](https://docs.conda.io/en/latest/miniconda.html) 去下载并安装conda请下载py>=3.7的版本)。你可以尝试自己安装,或者使用以下的命令:
## 中等: 获取主要功能(支持 Linux
如果你想要使用` paddlespeech` 的主要功能。你需要完成 4 个步骤
### 安装 Conda
Conda 是一个包管理的环境。你可以前往 [minicoda](https://docs.conda.io/en/latest/miniconda.html) 去下载并安装 conda请下载 py>=3.7 的版本)。你可以尝试自己安装,或者使用以下的命令:
```bash
# download the miniconda
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
@ -86,156 +60,110 @@ $HOME/miniconda3/bin/conda init
# activate the conda
bash
```
然后你可以创建一个conda的虚拟环境
然后你可以创建一个 conda 的虚拟环境:
```bash
conda create -y -p tools/venv python=3.7
```
激活conda虚拟环境
激活 conda 虚拟环境:
```bash
conda activate tools/venv
```
安装`paddlespeech`的conda依赖
安装 `paddlespeech` 的 conda 依赖:
```bash
conda install -y -c conda-forge sox libsndfile swig bzip2
```
### 安装C++ 编译环境
(如果你系统上已经安装了C++编译环境,请忽略这一步。)
你可以使用如下的步骤来安装C++的编译环境`gcc` and `gxx`
### 安装 C++ 编译环境
(如果你系统上已经安装了 C++ 编译环境,请忽略这一步。)
你可以使用如下的步骤来安装 C++ 的编译环境 `gcc``gxx`
```bash
# centos
sudo yum install gcc gcc-c++
```
```bash
# ubuntu
sudo apt install build-essential
```
```bash
# Others
conda install -y -c gcc_linux-64=8.4.0 gxx_linux-64=8.4.0
```
(提示: 如果你想使用**困难**方式完成安装,请不要使用最后一条命令)
### 安装 PaddlePaddle
你可以根据系统配置选择PaddlePaddle版本例如系统使用CUDA 10.2, CuDNN7.5 你可以安装paddlepaddle-gpu 2.2.0
你可以根据系统配置选择 PaddlePaddle 版本,例如系统使用 CUDA 10.2 CuDNN7.5 ,你可以安装 paddlepaddle-gpu 2.2.0
```bash
python3 -m pip install paddlepaddle-gpu==2.2.0
```
### 安装 PaddleSpeech
你需要使用`git clone`的方式下载并安装 `paddlespeech`,这样你才可以使用`paddlespeech`中已有的examples
你需要使用 `git clone` 的方式下载并安装 `paddlespeech`,这样你才可以使用 `paddlespeech`中已有的 examples
```bash
https://github.com/PaddlePaddle/PaddleSpeech.git
cd PaddleSpeech
pip install .
```
## 困难: 获取所有功能(支持 Ubuntu
### 先决条件
- Ubuntu >= 16.04
- 选择 1 使用`Ubuntu` docker。
- 选择 2 使用`Ubuntu` 并且拥有root权限。
为了避免各种环境配置问题我们非常推荐你使用docker容器。如果你不想使用docker但是可以使用拥有root权限的Ubuntu系统你也可以完成**困难**方式的安装。
- 选择 2 使用`Ubuntu` ,并且拥有 root 权限。
为了避免各种环境配置问题,我们非常推荐你使用 docker 容器。如果你不想使用 docker但是可以使用拥有 root 权限的 Ubuntu 系统,你也可以完成**困难**方式的安装。
### 选择1 使用Docker容器推荐
Docker 是一种开源工具,用于在和系统本身环境相隔离的环境中构建、发布和运行各类应用程序。你可以访问[hub.docker.com](https://hub.docker.com)来下载各种版本的docker目前已经有适用于`PaddleSpeech`的docker提供在了该网站上。Docker镜像需要使用Nvidia GPU所以你也需要提前安装好[nvidia-docker](https://github.com/NVIDIA/nvidia-docker) 。
Docker 是一种开源工具,用于在和系统本身环境相隔离的环境中构建、发布和运行各类应用程序。你可以访问 [hub.docker.com](https://hub.docker.com) 来下载各种版本的 docker目前已经有适用于 `PaddleSpeech` 的 docker 提供在了该网站上。Docker 镜像需要使用 Nvidia GPU所以你也需要提前安装好 [nvidia-docker](https://github.com/NVIDIA/nvidia-docker) 。
你需要完成几个步骤来启动docker
- 下载docker镜像:
例如拉取paddle2.2.0镜像:
- 下载 docker 镜像:
例如,拉取 paddle2.2.0 镜像:
```bash
nvidia-docker pull registry.baidubce.com/paddlepaddle/paddle:2.2.0-gpu-cuda10.2-cudnn7
```
- 克隆 `PaddleSpeech` 仓库
```bash
git clone https://github.com/PaddlePaddle/PaddleSpeech.git
```
- 启动docker镜像
- 启动 docker 镜像
```bash
sudo nvidia-docker run --net=host --ipc=host --rm -it -v $(pwd)/PaddleSpeech:/PaddleSpeech registry.baidubce.com/paddlepaddle/paddle:2.2.0-gpu-cuda10.2-cudnn7 /bin/bash
```
- 进入PaddleSpeech目录
- 进入 PaddleSpeech 目录
```bash
cd /PaddleSpeech
```
完成这些以后你就可以在docker容器中执行训练、推理和超参fine-tune
### 选择2 使用有root权限的Ubuntu
- 使用apt安装`build-essential`
完成这些以后,你就可以在 docker 容器中执行训练、推理和超参 fine-tune。
### 选择2 使用有 root 权限的 Ubuntu
- 使用apt安装 `build-essential`
```bash
sudo apt install build-essential
```
- 克隆 `PaddleSpeech` 仓库
```bash
git clone https://github.com/PaddlePaddle/PaddleSpeech.git
# 进入PaddleSpeech目录
cd PaddleSpeech
```
### 安装Conda
### 安装 Conda
```bash
# 下载并安装miniconda
# 下载并安装 miniconda
pushd tools
bash extras/install_miniconda.sh
popd
# 使用"bash" 命令激活Conda环境
# 使用 "bash" 命令激活Conda环境
bash
# 创建Conda虚拟环境
# 创建 Conda 虚拟环境
conda create -y -p tools/venv python=3.7
# 激活Conda虚拟环境:
# 激活 Conda 虚拟环境:
conda activate tools/venv
# 安装Conda包
# 安装 Conda
conda install -y -c conda-forge sox libsndfile swig bzip2 libflac bc
```
### 安装PaddlePaddle
请确认你系统是否有GPU并且使用了正确版本的paddlepaddle。例如系统使用CUDA 10.2, CuDNN7.5 你可以安装paddlepaddle-gpu 2.2.0
### 安装 PaddlePaddle
请确认你系统是否有 GPU并且使用了正确版本的 paddlepaddle。例如系统使用 CUDA 10.2, CuDNN7.5 ,你可以安装 paddlepaddle-gpu 2.2.0
```bash
python3 -m pip install paddlepaddle-gpu==2.2.0
```
### 用开发者模式安装PaddleSpeech
### 用开发者模式安装 PaddleSpeech
```bash
pip install -e .[develop]
```
### 安装Kaldi可选
### 安装 Kaldi可选
```bash
pushd tools
bash extras/install_openblas.sh

@ -1,7 +1,7 @@
# PaddleSpeech
## What is PaddleSpeech?
PaddleSpeech is an open-source toolkit on PaddlePaddle platform for two critical tasks in Speech - Speech-to-Text (Automatic Speech Recognition, ASR) and Text-to-Speech Synthesis (TTS), with modules involving state-of-art and influential models.
PaddleSpeech is an open-source toolkit on the PaddlePaddle platform for two critical tasks in Speech - Speech-to-Text (Automatic Speech Recognition, ASR) and Text-to-Speech Synthesis (TTS), with modules involving state-of-art and influential models.
## What can PaddleSpeech do?
@ -29,7 +29,7 @@ PaddleSpeech ASR provides you with a complete ASR pipeline, including:
- attention decoding (used in Transformer and Conformer)
- attention rescoring (used in Transformer and Conformer)
Speech-to-Text helps you training the ASR model very simply.
Speech-to-Text helps you train the ASR model very simply.
### Text-to-Speech
TTS mainly consists of components below:
@ -53,4 +53,4 @@ PaddleSpeech TTS provides you with a complete TTS pipeline, including:
- Transfer Learning from Speaker Verification to Multispeaker Text-to-Speech Synthesis
- GE2E
Text-to-Speech helps you to train TTS models with simple commands.
Text-to-Speech helps you to train TTS models with simple commands.

@ -1,6 +1,6 @@
# Reference
We borrowed a lot of code from these repos to build `model` and `engine`, thanks for these great works and opensource community!
We borrowed a lot of code from these repos to build `model` and `engine`, thanks for these great works and the open-source community!
* [espnet](https://github.com/espnet/espnet/blob/master/LICENSE)
- Apache-2.0 License
@ -30,7 +30,7 @@ We borrowed a lot of code from these repos to build `model` and `engine`, thanks
* [chainer](https://github.com/chainer/chainer/blob/master/LICENSE)
- MIT License
- Updater, Trainer and some utils.
- Updater, Trainer, and some utils.
* [librosa](https://github.com/librosa/librosa/blob/main/LICENSE.md)
- ISC License

@ -1,5 +1,5 @@
# Parakeet
Parakeet aims to provide a flexible, efficient and state-of-the-art text-to-speech toolkit for the open-source community. It is built on PaddlePaddle dynamic graph and includes many influential TTS models.
Parakeet aims to provide a flexible, efficient, and state-of-the-art text-to-speech toolkit for the open-source community. It is built on PaddlePaddle dynamic graph and includes many influential TTS models.
<div align="center">
<img src="../../images/logo.png" width=300 /> <br>
@ -7,10 +7,10 @@ Parakeet aims to provide a flexible, efficient and state-of-the-art text-to-spee
## Overview
In order to facilitate exploiting the existing TTS models directly and developing the new ones, Parakeet selects typical models and provides their reference implementations in PaddlePaddle. Further more, Parakeet abstracts the TTS pipeline and standardizes the procedure of data preprocessing, common modules sharing, model configuration, and the process of training and synthesis. The models supported here include Text FrontEnd, end-to-end Acoustic models and Vocoders:
To facilitate exploiting the existing TTS models directly and developing the new ones, Parakeet selects typical models and provides their reference implementations in PaddlePaddle. Furthermore, Parakeet abstracts the TTS pipeline and standardizes the procedure of data preprocessing, common modules sharing, model configuration, and the process of training and synthesis. The models supported here include Text FrontEnd, end-to-end Acoustic models, and Vocoders:
- Text FrontEnd
- Rule based Chinese frontend.
- Rule-based Chinese frontend.
- Acoustic Models
- [【FastSpeech2】FastSpeech 2: Fast and High-Quality End-to-End Text to Speech](https://arxiv.org/abs/2006.04558)

@ -1,24 +1,25 @@
# Advanced Usage
This sections covers how to extend TTS by implementing your own models and experiments. Guidelines on implementation are also elaborated.
This section covers how to extend TTS by implementing your models and experiments. Guidelines on implementation are also elaborated.
For the general deep learning experiment, there are several parts to deal with:
1. Preprocess the data according to the needs of the model, and iterate the dataset by batch.
2. Define the model, optimizer and other components.
2. Define the model, optimizer, and other components.
3. Write out the training process (generally including forward / backward calculation, parameter update, log recording, visualization, periodic evaluation, etc.).
5. Configure and run the experiment.
## PaddleSpeech TTS's Model Components
In order to balance the reusability and function of models, we divide models into several types according to its characteristics.
To balance the reusability and function of models, we divide models into several types according to their characteristics.
For the commonly used modules that can be used as part of other larger models, we try to implement them as simple and universal as possible, because they will be reused. Modules with trainable parameters are generally implemented as subclasses of `paddle.nn.Layer`. Modules without trainable parameters can be directly implemented as a function, and its input and output are `paddle.Tensor`.
Models for a specific task are implemented as subclasses of `paddle.nn.Layer`. Models could be simple, like a single layer RNN. For complicated models, it is recommended to split the model into different components.
Models for a specific task are implemented as subclasses of `paddle.nn.Layer`. Models could be simple, like a single-layer RNN. For complicated models, it is recommended to split the model into different components.
For a seq-to-seq model, it's natural to split it into encoder and decoder. For a model composed of several similar layers, it's natural to extract the sublayer as a separate layer.
There are two common ways to define a model which consists of several modules.
1. Define a module given the specifications. Here is an example with multilayer perceptron.
1. Define a module given the specifications. Here is an example with a multilayer perceptron.
```python
class MLP(nn.Layer):
def __init__(self, input_size, hidden_size, output_size):
@ -44,11 +45,11 @@ There are two common ways to define a model which consists of several modules.
```
For a module defined in this way, its harder for the user to initialize an instance. Users have to read the code to check what attributes are used.
Also, code in this style tend to be abused by passing a huge config object to initialize every module used in an experiment, thought each module may not need the whole configuration.
Also, code in this style tends to be abused by passing a huge config object to initialize every module used in an experiment, though each module may not need the whole configuration.
We prefer to be explicit.
2. Define a module as a combination given its components. Here is an example for a sequence-to-sequence model.
2. Define a module as a combination given its components. Here is an example of a sequence-to-sequence model.
```python
class Seq2Seq(nn.Layer):
def __init__(self, encoder, decoder):
@ -65,27 +66,27 @@ There are two common ways to define a model which consists of several modules.
# compose two components
model = Seq2Seq(encoder, decoder)
```
When a model is a complicated and made up of several components, each of which has a separate functionality, and can be replaced by other components with the same functionality, we prefer to define it in this way.
When a model is complicated and made up of several components, each of which has a separate functionality, and can be replaced by other components with the same functionality, we prefer to define it in this way.
In the directory structure of PaddleSpeech TTS, modules with high reusability are placed in `paddlespeech.t2s.modules`, but models for specific tasks are placed in `paddlespeech.t2s.models`. When developing a new model, developers need to consider the feasibility of splitting the modules, and the degree of generality of the modules, and place them in appropriate directories.
In the directory structure of PaddleSpeech TTS, modules with high reusability are placed in `paddlespeech.t2s.modules`, but models for specific tasks are placed in `paddlespeech.t2s.models`. When developing a new model, developers need to consider the feasibility of splitting the modules, and the degree of generality of the modules and place them in appropriate directories.
## PaddleSpeech TTS's Data Components
Another critical componnet for a deep learning project is data.
Another critical component for a deep learning project is data.
PaddleSpeech TTS uses the following methods for training data:
1. Preprocess the data.
2. Load the preprocessed data for training.
Previously, we wrote the preprocessing in the `__getitem__` of the Dataset, which will process when accessing a certain batch samples, but encountered some problems:
Previously, we wrote the preprocessing in the `__getitem__` of the Dataset, which will process when accessing a certain batch sample, but encountered some problems:
1. Efficiency problem. Even if Paddle has a design to load data asynchronously, when the batch size is large, each sample needs to be preprocessed and set up batches, which takes a lot of time , and may even seriously slow down the training process.
2. Data filtering problem. Some filtering conditions depend on the features of the processed sample. For example, filtering samples that are too short according to text length. If the text length can only be known after `__getitem__`, every time you filter, the entire dataset needed to be loaded once! In addition, if you do not pre-filter, A small exception (such as too short text ) in `__getitem__` will cause an exception in the entire data flow, which is not feasible, because `collate_fn ` presupposes that the acquisition of each sample can be normal. Even if some special flags, such as `None`, are used to mark data acquisition failures, and skip `collate_fn`, it will change batch_size .
1. Efficiency problem. Even if Paddle has a design to load data asynchronously, when the batch size is large, each sample needs to be preprocessed and set up batches, which takes a lot of time, and may even seriously slow down the training process.
2. Data filtering problem. Some filtering conditions depend on the features of the processed sample. For example, filtering samples that are too short according to text length. If the text length can only be known after `__getitem__`, every time you filter, the entire dataset needed to be loaded once! In addition, if you do not pre-filter, A small exception (such as too short text ) in `__getitem__` will cause an exception in the entire data flow, which is not feasible, because `collate_fn ` presupposes that the acquisition of each sample can be normal. Even if some special flags, such as `None`, are used to mark data acquisition failures, and skip `collate_fn`, it will change batch_size.
Therefore, it is not realistic to put preprocessing entirely on `__getitem__`. We use the method mentioned above instead.
During preprocessing, we can do filtering, We can also save more intermediate features, such as text length, audio length, etc., which can be used for subsequent filtering. Because of the habit of TTS field, data is stored in multiple files, and the processed results are stored in `npy` format.
During preprocessing, we can do filtering, We can also save more intermediate features, such as text length, audio length, etc., which can be used for subsequent filtering. Because of the habit of TTS field, data is stored in multiple files, and the processed results are stored in `npy` format.
Use a list-like way to store metadata and store the file path in it, so that you can not be restricted by the specific storage location of the file. In addition to the file path, other metadata can also be stored in it. For example, the path of the text, the path of the audio, the path of the spectrum, the number of frames, the number of sampling points, and so on.
Then for the path, there are multiple opening methods, such as `sf.read`, `np.load`, etc., so it's best to use a parameter that can be input, we don't even want to determine the reading method by it's extension, it's best to let the users input it , in this way, users can define their own method to parse the data.
Then for the path, there are multiple opening methods, such as `sf.read`, `np.load`, etc., so it's best to use a parameter that can be input, we don't even want to determine the reading method by its extension, it's best to let the users input it, in this way, users can define their method to parse the data.
So we learned from the design of `DataFrame`, but our construction method is simpler, only need a `list of dicts`, a dict represents a record, and it's convenient to interact with formats such as `json`, `yaml`. For each selected field, we need to give a parser (called `converter` in the interface), and that's it.
@ -109,7 +110,7 @@ class DataTable(Dataset):
converters : Dict[str, Callable], optional
Converters used to process each field, by default None
use_cache : bool, optional
Whether to use cache, by default False
Whether to use a cache, by default False
Raises
------
@ -125,11 +126,11 @@ class DataTable(Dataset):
converters: Dict[str, Callable]=None,
use_cache: bool=False):
```
It's `__getitem__` method is to parse each field with their own parser, and then compose a dictionary to return.
Its `__getitem__` method is to parse each field with their parser and then compose a dictionary to return.
```python
def _convert(self, meta_datum: Dict[str, Any]) -> Dict[str, Any]:
"""Convert a meta datum to an example by applying the corresponding
converters to each fields requested.
converters to each field requested.
Parameters
----------
@ -163,23 +164,23 @@ A typical training process includes the following processes:
6. Write logs, visualize, and in some cases save necessary intermediate results.
7. Save the state of the model and optimizer.
Here, we mainly introduce the training related components of TTS in Pa and why we designed it like this.
### Global Repoter
When training and modifying Deep Learning modelslogging is often needed, and it has even become the key to model debugging and modifying. We usually use various visualization toolssuch as , `visualdl` in `paddle`, `tensorboard` in `tensorflow` and `vidsom`, `wnb` ,etc. Besides, `logging` and `print` are usuaally used for different purpose.
Here, we mainly introduce the training-related components of TTS in Pa and why we designed it like this.
### Global Reporter
When training and modifying Deep Learning modelslogging is often needed, and it has even become the key to model debugging and modifying. We usually use various visualization toolssuch as , `visualdl` in `paddle`, `tensorboard` in `tensorflow` and `vidsom`, `wnb` ,etc. Besides, `logging` and `print` are usually used for a different purpose.
In these tools, `print` is the simplestit doesn't have the concept of `logger` and `handler` in `logging``summarywriter` and `logdir` in `tensorboard`, when printing, there is no need for `global_step` It's light enough to appear anywhere in the code, and it's printed to a common stdout. Of course, its customizability is limited, for example, it is no longer intuitive when printing dictionaries or more complex objects. And it's fleeting, people need to use redirection to save information.
In these tools, `print` is the simplestit doesn't have the concept of `logger` and `handler` in `logging``summarywriter` and `logdir` in `tensorboard`, when printing, there is no need for `global_step` It's light enough to appear anywhere in the code, and it's printed to a common stdout. Of course, its customizability is limited, for example, it is no longer intuitive when printing dictionaries or more complex objects. And it's fleeting, people need to use redirection to save information.
For TTS models developmentwe hope to have a more universal multimedia stdout, which is actually a tool similar to `tensorboard`, which allows many multimedia forms, but it needs a `summary writer` when using, and a `step` when writing information. If the data are images or voices, some format control parameters are needed.
For TTS models developmentwe hope to have a more universal multimedia stdout, which is a tool similar to `tensorboard`, which allows many multimedia forms, but it needs a `summary writer` when using, and a `step` when writing information. If the data are images or voices, some format control parameters are needed.
This will destroy the modular design to a certain extent. For example, If my model is composed of multiple sublayers, and I want to record some important information in the forward method of some sublayers. For this reason, I may need to pass the `summary writer` to this sublayers, but for the sublayers, its function is calculation, it should not have extra considerations, and it's also difficult for us to tolerate that the initialization of an `nn.Linear` has an optional `visualizer` in the method. And, for a calculation module, **HOW** can it know the global step? These are things related to the training process!
This will destroy the modular design to a certain extent. For example, If my model is composed of multiple sublayers, and I want to record some important information in the forward method of some sublayers. For this reason, I may need to pass the `summary writer` to these sublayers, but for the sublayers, its function is the calculation, it should not have extra considerations, and it's also difficult for us to tolerate that the initialization of an `nn.Linear` has an optional `visualizer` in the method. And, for a calculation module, **HOW** can it know the global step? These are things related to the training process!
Therefore, a more common approach is not to put writing_log_code in the definition of layer, but return it, then obtain them during training, and write them to `summary writer`. However, the return values need to be modified. `summary writer ` is a broadcaster at the training level, and then each module transmits information to it by modifying the return values.
We think this method is a little ugly. We prefer to return the necessary information only rather than change the return values to accommodate visualization and recording. When you need to report some information, you should be able to report it without difficult. So we imitate the design of `chainer` and use the `global repoter`.
We think this method is a little ugly. We prefer to return the necessary information only rather than change the return values to accommodate visualization and recording. When you need to report some information, you should be able to report it without difficulty. So we imitate the design of `chainer` and use the `global repoter`.
It takes advantage of the globality of Python's module level variables and the effect of context manager.
It takes advantage of the globality of Python's module-level variables and the effect of context manager.
There is a module level variable in `paddlespeech/t2s/training/reporter.py` `OBSERVATIONS`which is a `Dict` to store key-value.
There is a module-level variable in `paddlespeech/t2s/training/reporter.py` `OBSERVATIONS`which is a `Dict` to store key-value.
```python
# paddlespeech/t2s/training/reporter.py
@ -242,7 +243,7 @@ def test_reporter_scope():
assert third == {'third_begin': 3, 'third_end': 4}
```
In this way, when we write modular components, we can directly call `report`. The caller will decide where to report as long as it's ready for `OBSERVATION`, then it opens a `scope` and calls the component within this `scope`.
In this way, when we write modular components, we can directly call `report`. The caller will decide where to report as long as it's ready for `OBSERVATION`, then it opens a `scope` and calls the component within this `scope`.
The `Trainer` in PaddleSpeech TTS report the information in this way.
```python
@ -257,11 +258,11 @@ while True:
```
### Updater: Model Training Process
In order to maintain the purity of function and the reusability of code, we abstract the model code into a subclass of `paddle.nn.Layer`, and write the core computing functions in it.
To maintain the purity of function and the reusability of code, we abstract the model code into a subclass of `paddle.nn.Layer`, and write the core computing functions in it.
We tend to write the forward process of training in `forward()`, but only write to the prediction result, not to the loss. Therefore, this module can be called by a larger module.
However, when we compose an experiment, we need to add some other things, such as training process, evaluation process, checkpoint saving, visualization and the like. In this process, we will encounter some things that only exist in the training process, such as `optimizer`, `learning rate scheduler`, `visualizer`, etc. These things are not part of the model, they should **NOT** be written in the model code.
However, when we compose an experiment, we need to add some other things, such as the training process, evaluation process, checkpoint saving, visualization, and the like. In this process, we will encounter some things that only exist in the training process, such as `optimizer`, `learning rate scheduler`, `visualizer`, etc. These things are not part of the model, they should **NOT** be written in the model code.
We made an abstraction for these intermediate processes, that is, `Updater`, which takes the `model`, `optimizer`, and `data stream` as input, and its function is training. Since there may be differences in training methods of different models, we tend to write a corresponding `Updater` for each model. But this is different from the final training script, there is still a certain degree of encapsulation, just to extract the details of regular saving, visualization, evaluation, etc., and only retain the most basic function, that is, training the model.
@ -273,23 +274,23 @@ Deep learning experiments often have many options to configure. These configurat
1. Data source and data processing mode configuration.
2. Save path configuration of experimental results.
3. Data preprocessing mode configuration.
4. Model structure and hyperparameterconfiguration.
4. Model structure and hyperparameter configuration.
5. Training process configuration.
Its common to change the running configuration to compare results. To keep track of running configuration, we use `yaml` configuration files.
Also, we want to interact with command line options. Some options that usually change according to running environments is provided by command line arguments. In addition, we want to override an option in the config file without editing it.
Also, we want to interact with command-line options. Some options that usually change according to running environments are provided by command line arguments. In addition, we want to override an option in the config file without editing it.
Taking these requirements in to consideration, we use [yacs](https://github.com/rbgirshick/yacs) as a config management tool. Other tools like [omegaconf](https://github.com/omry/omegaconf) are also powerful and have similar functions.
Taking these requirements into consideration, we use [yacs](https://github.com/rbgirshick/yacs) as a config management tool. Other tools like [omegaconf](https://github.com/omry/omegaconf) are also powerful and have similar functions.
In each example provided, there is a `config.py`, the default config is defined at `conf/default.yaml`. If you want to get the default config, import `config.py` and call `get_cfg_defaults()` to get it. Then it can be updated with `yaml` config file or command line arguments if needed.
In each example provided, there is a `config.py`, the default config is defined at `conf/default.yaml`. If you want to get the default config, import `config.py` and call `get_cfg_defaults()` to get it. Then it can be updated with `yaml` config file or command-line arguments if needed.
For details about how to use yacs in experiments, see [yacs](https://github.com/rbgirshick/yacs).
The following is the basic `ArgumentParser`:
1. `--config` is used to support configuration file parsing, and the configuration file itself handles the unique options of each experiment.
2. `--train-metadata` is the path to the training data.
3. `--output-dir` is the dir to save the training results.if there are checkpoints in `checkpoints/` of `--output-dir` , it's defalut to reload the newest checkpoint to train)
3. `--output-dir` is the dir to save the training results.if there are checkpoints in `checkpoints/` of `--output-dir` , it defaults to reload the newest checkpoint to train)
4. `--ngpu` determine operation modes`--ngpu` refers to the number of training processes. If `ngpu` > 0, it means using GPU, else CPU is used.
Developers can refer to the examples in `examples` to write the default configuration file when adding new experiments.
@ -313,13 +314,13 @@ The experimental codes in PaddleSpeech TTS are generally organized as follows:
```
The `*.py` files called by above `*.sh` are located `${BIN_DIR}/`
We add a named argument. `--output-dir` to each training script to specify the output directory. The directory structure is as follows, It's best for developers to follow this specification:
We add a named argument. `--output-dir` to each training script to specify the output directory. The directory structure is as follows, developers should follow this specification:
```text
exp/default/
├── checkpoints/
│ ├── records.jsonl (record file)
│ └── snapshot_iter_*.pdz (checkpoint files)
├── config.yaml (config fille of this experiment)
├── config.yaml (config file of this experiment)
├── vdlrecords.*.log (visualdl record file)
├── worker_*.log (text logging, one file per process)
├── validation/ (output dir during training, information_iter_*/ is the output of each step, if necessary)
@ -327,4 +328,4 @@ exp/default/
└── test/ (output dir of synthesis results)
```
You can view the examples we provide in `examples`. These experiments are provided to users as examples which can be run directly. Users are welcome to add new models and experiments and contribute code to PaddleSpeech.
You can view the examples we provide in `examples`. These experiments are provided to users as examples that can be run directly. Users are welcome to add new models and experiments and contribute code to PaddleSpeech.

@ -1,9 +1,9 @@
# Models introduction
TTS system mainly includes three modules: `Text Frontend`, `Acoustic model` and `Vocoder`. We introduce a rule based Chinese text frontend in [cn_text_frontend.md](./cn_text_frontend.md). Here, we will introduce acoustic models and vocoders, which are trainable models.
TTS system mainly includes three modules: `Text Frontend`, `Acoustic model` and `Vocoder`. We introduce a rule-based Chinese text frontend in [cn_text_frontend.md](./cn_text_frontend.md). Here, we will introduce acoustic models and vocoders, which are trainable.
The main processes of TTS include:
1. Convert the original text into characters/phonemes, through `text frontend` module.
2. Convert characters/phonemes into acoustic features , such as linear spectrogram, mel spectrogram, LPC features, etc. through `Acoustic models`.
1. Convert the original text into characters/phonemes, through the `text frontend` module.
2. Convert characters/phonemes into acoustic features, such as linear spectrogram, mel spectrogram, LPC features, etc. through `Acoustic models`.
3. Convert acoustic features into waveforms through `Vocoders`.
A simple text frontend module can be implemented by rules. Acoustic models and vocoders need to be trained. The models provided by PaddleSpeech TTS are acoustic models and vocoders.
@ -59,7 +59,7 @@ At present, there are two mainstream acoustic model structures.
**Advantage of Tacotron:**
- No need for complex text frontend analysis modules.
- No need for additional duration model.
- No need for an additional duration model.
- Greatly simplifies the acoustic model construction process and reduces the dependence of speech synthesis tasks on domain knowledge.
**Disadvantages of Tacotron:**
@ -67,7 +67,7 @@ At present, there are two mainstream acoustic model structures.
- Global soft attention.
- Poor stability for speech synthesis tasks.
- In training, the less the number of speech frames predicted at each moment, the more difficult it is to train.
- Phase problem in Griffin-Lim casues speech distortion during wave reconstruction.
- Phase problem in Griffin-Lim causes speech distortion during wave reconstruction.
- The autoregressive decoder cannot be stopped during the generation process.
#### Tacotron2
@ -81,12 +81,12 @@ At present, there are two mainstream acoustic model structures.
- CBHG -> 5 Conv layers.
- The input and output of the PostNet calculate `L2` loss with real Mel spectrogram.
- Residual connection.
- Bad stop in autoregressive decoder.
- Bad stop in an autoregressive decoder.
- Predict whether it should stop at each moment of decoding (stop token).
- Set a threshold to determine whether to stop generating when decoding.
- Stability of attention.
- Location-aware attention.
- The alignment matrix of previous time is considered at the step `t` of decoder.
- The alignment matrix of the previous time is considered at step `t` of the decoder.
<div align="left">
<img src="https://raw.githubusercontent.com/PaddlePaddle/PaddleSpeech/develop/docs/images/tacotron2.png" width=500 /> <br>
@ -96,7 +96,7 @@ You can find PaddleSpeech TTS's tacotron2 with LJSpeech dataset example at [exam
### TransformerTTS
**Disadvantages of the Tacotrons:**
- Encodr and decoder are relatively weak at global information modeling
- Encoder and decoder are relatively weak at global information modeling
- Vanishing gradient of RNN.
- Fixed-length context modeling problem in CNN kernel.
- Training is relatively inefficient.
@ -105,7 +105,7 @@ You can find PaddleSpeech TTS's tacotron2 with LJSpeech dataset example at [exam
Transformer TTS is a combination of Tacotron2 and Transformer.
#### Transformer
[Transformer](https://arxiv.org/abs/1706.03762) is a seq2seq model based entirely on attention mechanism.
[Transformer](https://arxiv.org/abs/1706.03762) is a seq2seq model based entirely on an attention mechanism.
**Features of Transformer:**
- Encoder.
@ -113,7 +113,7 @@ Transformer TTS is a combination of Tacotron2 and Transformer.
- Positional Encoding.
- Decoder.
- `N` blocks based on self-attention mechanism.
- Add Mask to the self-attention in blocks to cover up the information after `t` step.
- Add Mask to the self-attention in blocks to cover up the information after the `t` step.
- Attentions between encoder and decoder.
- Positional Encoding.
@ -153,34 +153,34 @@ You can find PaddleSpeech TTS's Transformer TTS with LJSpeech dataset example at
**Disadvantage of seq2seq models:**
- In the seq2seq model based on attention, no matter how to improve the attention mechanism, it's difficult to avoid generation errors in the decoding stage.
Frame level acoustic models use duration models to determine the pronunciation duration of phonemes, and the frame level mapping does not have the uncertainty of sequence generation.
Frame-level acoustic models use duration models to determine the pronunciation duration of phonemes, and the frame-level mapping does not have the uncertainty of sequence generation.
In seq2saq models, the concept of duration models is used as the alignment module of two sequences to replace attention, which can avoid the uncertainty in attention, and significantly improve the stability of the seq2saq models.
#### FastSpeech
Instead of using the encoder-attention-decoder based architecture as adopted by most seq2seq based autoregressive and non-autoregressive generation, [FastSpeech](https://arxiv.org/abs/1905.09263) is a novel feed-forward structure, which can generate a target mel spectrogram sequence in parallel.
Instead of using the encoder-attention-decoder based architecture as adopted by most seq2seq based autoregressive and non-autoregressive generation, [FastSpeech](https://arxiv.org/abs/1905.09263) is a novel feed-forward structure, which can generate a target mel spectrogram sequence in parallel.
**Features of FastSpeech:**
- Encoder: based on Transformer.
- Change `FFN` to `CNN` in self-attention.
- Model local dependency.
- Length regulator.
- Use real phoneme durations to expand output frame of encoder during training.
- Non autoregressive decode.
- Use real phoneme durations to expand the output frame of the encoder during training.
- Non-autoregressive decode.
- Improve generation efficiency.
**Length predictor:**
- Pretrain a TransformerTTS model.
- Get alignment matrix of train data.
- Caculate the phoneme durations according to the probability of the alignment matrix.
- Use the output of encoder to predict the phoneme durations and calculate the MSE loss.
- Use real phoneme durations to expand output frame of encoder during training.
- Calculate the phoneme durations according to the probability of the alignment matrix.
- Use the output of the encoder to predict the phoneme durations and calculate the MSE loss.
- Use real phoneme durations to expand the output frame of the encoder during training.
- Use phoneme durations predicted by the duration model to expand the frame during prediction.
- Attentrion can not control phoneme durations. The explicit duration modeling can control durations through duration coefficient (duration coefficient is `1` during training).
**Advantages of non-autoregressive decoder:**
- The built-in duration model of the seq2seq model has converted the input length `M` to the output length `N`.
- The length of output is known, `stop token` is no longer used, avoiding the problem of being unable to stop.
- The length of the output is known, `stop token` is no longer used, avoiding the problem of being unable to stop.
• Can be generated in parallel (decoding time is less affected by sequence length)
<div align="left">
@ -198,27 +198,27 @@ Instead of using the encoder-attention-decoder based architecture as adopted by
**Disadvantages of FastSpeech:**
- The teacher-student distillation pipeline is complicated and time-consuming.
- The duration extracted from the teacher model is not accurate enough.
- The target mel spectrograms distilled from teacher model suffer from information loss due to data simplification.
- The target mel spectrograms distilled from the teacher model suffer from information loss due to data simplification.
[FastSpeech2](https://arxiv.org/abs/2006.04558) addresses the issues in FastSpeech and better solves the one-to-many mapping problem in TTS.
**Features of FastSpeech2:**
- Directly training the model with ground-truth target instead of the simplified output from teacher.
- Introducing more variation information of speech as conditional inputs, extract `duration`, `pitch` and `energy` from speech waveform and directly take them as conditional inputs in training and use predicted values in inference.
- Directly train the model with the ground-truth target instead of the simplified output from the teacher.
- Introducing more variation information of speech as conditional inputs, extract `duration`, `pitch`, and `energy` from speech waveform and directly take them as conditional inputs in training and use predicted values in inference.
FastSpeech2 is similar to FastPitch but introduces more variation information of speech.
FastSpeech2 is similar to FastPitch but introduces more variation information of the speech.
<div align="left">
<img src="https://raw.githubusercontent.com/PaddlePaddle/PaddleSpeech/develop/docs/images/fastspeech2.png" width=800 /> <br>
</div>
You can find PaddleSpeech TTS's FastSpeech2/FastPitch with CSMSC dataset example at [examples/csmsc/tts3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts3), We use token-averaged pitch and energy values introduced in FastPitch rather than frame level ones in FastSpeech2.
You can find PaddleSpeech TTS's FastSpeech2/FastPitch with CSMSC dataset example at [examples/csmsc/tts3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts3), We use token-averaged pitch and energy values introduced in FastPitch rather than frame-level ones in FastSpeech2.
### SpeedySpeech
[SpeedySpeech](https://arxiv.org/abs/2008.03802) simplify the teacher-student architecture of FastSpeech and provide a fast and stable training procedure.
**Features of SpeedySpeech:**
- Use a simpler, smaller and faster-to-train convolutional teacher model ([Deepvoice3](https://arxiv.org/abs/1710.07654) and [DCTTS](https://arxiv.org/abs/1710.08969)) with a single attention layer instead of Transformer used in FastSpeech.
- Use a simpler, smaller, and faster-to-train convolutional teacher model ([Deepvoice3](https://arxiv.org/abs/1710.07654) and [DCTTS](https://arxiv.org/abs/1710.08969)) with a single attention layer instead of Transformer used in FastSpeech.
- Show that self-attention layers in the student network are not needed for high-quality speech synthesis.
- Describe a simple data augmentation technique that can be used early in the training to make the teacher network robust to sequential error propagation.
@ -233,7 +233,7 @@ In speech synthesis, the main task of the vocoder is to convert the spectral par
Taking into account the short-term change frequency of the waveform, the acoustic model usually avoids direct modeling of the speech waveform, but firstly models the spectral features extracted from the speech waveform, and then reconstructs the waveform by the decoding part of the vocoder.
A vocoder usually consists of a pair of encoders and decoders for speech analysis and synthesis. The encoder estimate the parameters, and then the decoder restores the speech.
A vocoder usually consists of a pair of encoders and decoders for speech analysis and synthesis. The encoder estimates the parameters, and then the decoder restores the speech.
Vocoders based on neural networks usually is speech synthesis, which learns the mapping relationship from spectral features to waveforms through training data.
@ -262,11 +262,11 @@ Vocoders based on neural networks usually is speech synthesis, which learns the
- DiffWave
**Motivations of GAN-based vocoders:**
- Modeling speech signal by estimating probability distribution usually has high requirements for the expression ability of the model itself. In addition, specific assumptions need to be made about the distribution of waveforms.
- Modeling speech signals by estimating probability distribution usually has high requirements for the expression ability of the model itself. In addition, specific assumptions need to be made about the distribution of waveforms.
- Although autoregressive neural vocoders can obtain high-quality synthetic speech, such models usually have a **slow generation speed**.
- The training of inverse autoregressive flow vocoders is complex, and they also require the modeling capability of long term context information.
- The training of inverse autoregressive flow vocoders is complex, and they also require the modeling capability of long-term context information.
- Vocoders based on Bipartite Transformation converge slowly and are complex.
- GAN-based vocoders don't need to make assumptions about the speech distribution, and train through adversarial learning.
- GAN-based vocoders don't need to make assumptions about the speech distribution and train through adversarial learning.
Here, we introduce a Flow-based vocoder WaveFlow and a GAN-based vocoder Parallel WaveGAN.
@ -274,14 +274,14 @@ Here, we introduce a Flow-based vocoder WaveFlow and a GAN-based vocoder Paralle
[WaveFlow](https://arxiv.org/abs/1912.01219) is proposed by Baidu Research.
**Features of WaveFlow:**
- It can synthesize 22.05 kHz high-fidelity speech around 40x faster than real-time on a Nvidia V100 GPU without engineered inference kernels, which is faster than [WaveGlow](https://github.com/NVIDIA/waveglow) and serveral orders of magnitude faster than WaveNet.
- It is a small-footprint flow-based model for raw audio. It has only 5.9M parameters, which is 15x smalller than WaveGlow (87.9M).
- It can synthesize 22.05 kHz high-fidelity speech around 40x faster than real-time on an Nvidia V100 GPU without engineered inference kernels, which is faster than [WaveGlow](https://github.com/NVIDIA/waveglow) and several orders of magnitude faster than WaveNet.
- It is a small-footprint flow-based model for raw audio. It has only 5.9M parameters, which is 15x smaller than WaveGlow (87.9M).
- It is directly trained with maximum likelihood without probability density distillation and auxiliary losses as used in [Parallel WaveNet](https://arxiv.org/abs/1711.10433) and [ClariNet](https://openreview.net/pdf?id=HklY120cYm), which simplifies the training pipeline and reduces the cost of development.
You can find PaddleSpeech TTS's WaveFlow with LJSpeech dataset example at [examples/ljspeech/voc0](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/voc0).
### Parallel WaveGAN
[Parallel WaveGAN](https://arxiv.org/abs/1910.11480) trains a non-autoregressive WaveNet variant as a generator in a GAN based training method.
[Parallel WaveGAN](https://arxiv.org/abs/1910.11480) trains a non-autoregressive WaveNet variant as a generator in a GAN-based training method.
**Features of Parallel WaveGAN:**

@ -1,9 +1,9 @@
# Quick Start of Text-to-Speech
The examples in PaddleSpeech are mainly classified by datasets, the TTS datasets we mainly used are:
* CSMCS (Mandarin single speaker)
* AISHELL3 (Mandarin multiple speaker)
* AISHELL3 (Mandarin multiple speakers)
* LJSpeech (English single speaker)
* VCTK (English multiple speaker)
* VCTK (English multiple speakers)
The models in PaddleSpeech TTS have the following mapping relationship:
* tts0 - Tactron2
@ -14,6 +14,8 @@ The models in PaddleSpeech TTS have the following mapping relationship:
* voc1 - Parallel WaveGAN
* voc2 - MelGAN
* voc3 - MultiBand MelGAN
* voc4 - Style MelGAN
* voc5 - HiFiGAN
* vc0 - Tactron2 Voice Clone with GE2E
* vc1 - FastSpeech2 Voice Clone with GE2E
@ -22,7 +24,7 @@ The models in PaddleSpeech TTS have the following mapping relationship:
Let's take a FastSpeech2 + Parallel WaveGAN with CSMSC dataset for instance. [examples/csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc)
### Train Parallel WaveGAN with CSMSC
- Go to directory
- Go to the directory
```bash
cd examples/csmsc/voc1
```
@ -37,9 +39,9 @@ Let's take a FastSpeech2 + Parallel WaveGAN with CSMSC dataset for instance. [ex
```bash
bash run.sh
```
This is just a demo, please make sure source data have been prepared well and every `step` works well before next `step`.
This is just a demo, please make sure source data have been prepared well and every `step` works well before the next `step`.
### Train FastSpeech2 with CSMSC
- Go to directory
- Go to the directory
```bash
cd examples/csmsc/tts3
```
@ -49,26 +51,26 @@ Let's take a FastSpeech2 + Parallel WaveGAN with CSMSC dataset for instance. [ex
```
**Must do this before you start to do anything.**
Set `MAIN_ROOT` as project dir. Using `fastspeech2` model as `MODEL`.
- Main entrypoint
- Main entry point
```bash
bash run.sh
```
This is just a demo, please make sure source data have been prepared well and every `step` works well before next `step`.
This is just a demo, please make sure source data have been prepared well and every `step` works well before the next `step`.
The steps in `run.sh` mainly include:
- source path.
- preprocess the dataset,
- train the model.
- synthesize waveform from metadata.jsonl.
- synthesize waveform from text file. (in acoustic models)
- inference using static model. (optional)
- synthesize waveform from a text file. (in acoustic models)
- inference using a static model. (optional)
For more details , you can see `README.md` in examples.
For more details, you can see `README.md` in examples.
## Pipeline of TTS
This section shows how to use pretrained models provided by TTS and make inference with them.
This section shows how to use pretrained models provided by TTS and make an inference with them.
Pretrained models in TTS are provided in a archive. Extract it to get a folder like this:
Pretrained models in TTS are provided in an archive. Extract it to get a folder like this:
**Acoustic Models:**
```text
checkpoint_name
@ -87,15 +89,15 @@ checkpoint_name
└── stats.npy
```
- `default.yaml` stores the config used to train the model.
- `snapshot_iter_*.pdz` is the chechpoint file, where `*` is the steps it has been trained.
- `*_stats.npy` is the stats file of feature if it has been normalized before training.
- `phone_id_map.txt` is the map of phonemes to phoneme_ids.
- `tone_id_map.txt` is the map of tones to tones_ids, when you split tones and phones before training acoustic models. (for example in our csmsc/speedyspeech example)
- `spk_id_map.txt` is the map of spkeaker to spk_ids in multi-spk acoustic models. (for example in our aishell3/fastspeech2 example)
- `snapshot_iter_*.pdz` is the checkpoint file, where `*` is the steps it has been trained.
- `*_stats.npy` is the stats file of the feature if it has been normalized before training.
- `phone_id_map.txt` is the map of phonemes to phoneme_ids.
- `tone_id_map.txt` is the map of tones to tones_ids, when you split tones and phones before training acoustic models. (for example in our csmsc/speedyspeech example)
- `spk_id_map.txt` is the map of speakers to spk_ids in multi-spk acoustic models. (for example in our aishell3/fastspeech2 example)
The example code below shows how to use the models for prediction.
### Acoustic Models (text to spectrogram)
The code below show how to use a `FastSpeech2` model. After loading the pretrained model, use it and normalizer object to construct a prediction objectthen use `fastspeech2_inferencet(phone_ids)` to generate spectrograms, which can be further used to synthesize raw audio with a vocoder.
The code below shows how to use a `FastSpeech2` model. After loading the pretrained model, use it and the normalizer object to construct a prediction objectthen use `fastspeech2_inferencet(phone_ids)` to generate spectrograms, which can be further used to synthesize raw audio with a vocoder.
```python
from pathlib import Path
@ -153,7 +155,7 @@ for part_phone_ids in phone_ids:
```
### Vocoder (spectrogram to wave)
The code below show how to use a ` Parallel WaveGAN` model. Like the example above, after loading the pretrained model, use it and normalizer object to construct a prediction objectthen use `pwg_inference(mel)` to generate raw audio (in wav format).
The code below shows how to use a ` Parallel WaveGAN` model. Like the example above, after loading the pretrained model, use it and the normalizer object to construct a prediction objectthen use `pwg_inference(mel)` to generate raw audio (in wav format).
```python
from pathlib import Path

@ -1,4 +1,4 @@
# Chinese Rule Based Text Frontend
# Chinese Rule-Based Text Frontend
A TTS system mainly includes three modules: `Text Frontend`, `Acoustic model` and `Vocoder`. We provide a complete Chinese text frontend module in PaddleSpeech TTS, see exapmles in [examples/other/tn](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/tn) and [examples/other/g2p](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/g2p).
A text frontend module mainly includes:
@ -42,7 +42,7 @@ Among them, Text Normalization and G2P are the most important modules. We mainly
## Grapheme-to-Phoneme
In Chinese, G2P is a very complex module, which mainly includes **polyphone** and **tone sandhi**.
We use [g2pM](https://github.com/kakaobrain/g2pM) and [pypinyin](https://github.com/mozillazg/python-pinyin) as the defalut g2p tools. They can solve the problem of polyphone to a certain extent. In the future, we intend to use a trainable language model (for example, [BERT](https://arxiv.org/abs/1810.04805)) for polyphone.
We use [g2pM](https://github.com/kakaobrain/g2pM) and [pypinyin](https://github.com/mozillazg/python-pinyin) as the default g2p tools. They can solve the problem of polyphones to a certain extent. In the future, we intend to use a trainable language model (for example, [BERT](https://arxiv.org/abs/1810.04805)) for polyphones.
However, g2pM and pypinyin do not perform well in tone sandhi, we use rules to solve this problem, which requires relevant linguistic knowledge.

@ -1,89 +1,63 @@
# DeepSpeech2 offline/online ASR with Aishell
This example contains code used to train a DeepSpeech2 offline or online model with [Aishell dataset](http://www.openslr.org/resources/33)
## Overview
All the scirpts you need are in the ```run.sh```. There are several stages in the ```run.sh```, and each stage has its function.
All the scripts you need are in the `run.sh`. There are several stages in the `run.sh`, and each stage has its function.
| Stage | Function |
| :---- | :----------------------------------------------------------- |
| 0 | Process data. It includes: <br> (1) Download the dataset <br> (2) Caculate the CMVN of the train dataset <br> (3) Get the vocabulary file <br> (4) Get the manifest files of the train, development and test dataset |
|:---- |:----------------------------------------------------------- |
| 0 | Process data. It includes: <br> (1) Download the dataset <br> (2) Calculate the CMVN of the train dataset <br> (3) Get the vocabulary file <br> (4) Get the manifest files of the train, development and test dataset |
| 1 | Train the model |
| 2 | Get the final model by averaging the top-k models, set k = 1 means choose the best model |
| 2 | Get the final model by averaging the top-k models, set k = 1 means to choose the best model |
| 3 | Test the final model performance |
| 4 | Export the static graph model |
| 5 | Test the static graph model |
| 6 | Infer the single audio file |
You can choose to run a range of stages by setting the ```stage``` and ```stop_stage ``` .
You can choose to run a range of stages by setting the `stage` and `stop_stage `.
For example, if you want to execute the code in stage 2 and stage 3, you can run this script:
```bash
bash run.sh --stage 2 --stop_stage 3
```
Or you can set ```stage``` equal to ```stop-stage``` to only run one stage.
For example, if you only want to run ```stage 0```, you can use the script below:
Or you can set `stage` equal to `stop-stage` to only run one stage.
For example, if you only want to run `stage 0`, you can use the script below:
```bash
bash run.sh --stage 0 --stop_stage 0
```
The document below will describe the scripts in the ```run.sh``` in detail.
The document below will describe the scripts in the `run.sh` in detail.
## The environment variables
The path.sh contains the environment variable.
```bash
source path.sh
```
This script needs to be run firstly.
This script needs to be run first.
And another script is also needed:
```bash
source ${MAIN_ROOT}/utils/parse_options.sh
```
It will support the way of using```--varibale value``` in the shell scripts.
It will support the way of using `--variable value` in the shell scripts.
## The local variables
Some local variables are set in the `run.sh`.
`gpus` denotes the GPU number you want to use. If you set `gpus=`, it means you only use CPU.
Some local variables are set in the ```run.sh```.
```gpus``` denotes the GPU number you want to use. If you set ```gpus=```, it means you only use CPU.
`stage` denotes the number of the stage you want to start from in the experiments.
`stop stage` denotes the number of the stage you want to end at in the experiments.
`conf_path` denotes the config path of the model.
`avg_num` denotes the number K of top-K models you want to average to get the final model.
`model_type` denotes the model type: offline or online
`audio file` denotes the file path of the single file you want to infer in stage 6
`ckpt` denotes the checkpoint prefix of the model, e.g. "deepspeech2"
```stage``` denotes the number of stage you want to start from in the expriments.
```stop stage```denotes the number of stage you want to end at in the expriments.
```conf_path``` denotes the config path of the model.
```avg_num``` denotes the number K of top-K models you want to average to get the final model.
```model_type```denotes the model type: offline or online
```audio file``` denotes the file path of the single file you want to infer in stage 6
```ckpt``` denotes the checkpoint prefix of the model, e.g. "deepspeech2"
You can set the local variables (except ```ckpt```) when you use the ```run.sh```
For example, you can set the ```gpus``` and ``avg_num`` when you use the command line.:
You can set the local variables (except `ckpt`) when you use the `run.sh`
For example, you can set the `gpus` and `avg_num` when you use the command line.:
```bash
bash run.sh --gpus 0,1 --avg_num 1
```
## Stage 0: Data processing
To use this example, you need to process data firstly and you can use stage 0 in the ```run.sh``` to do this. The code is shown below:
To use this example, you need to process data firstly and you can use stage 0 in the `run.sh` to do this. The code is shown below:
```bash
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
@ -91,24 +65,17 @@ To use this example, you need to process data firstly and you can use stage 0 i
bash ./local/data.sh || exit -1
fi
```
Stage 0 is for processing the data.
If you only want to process the data. You can run
```bash
bash run.sh --stage 0 --stop_stage 0
```
You can also just run these scripts in your command line.
```bash
source path.sh
bash ./local/data.sh
```
After processing the data, the ``data`` directory will look like this:
After processing the data, the `data` directory will look like this:
```bash
data/
|-- dev.meta
@ -124,54 +91,37 @@ data/
|-- test.meta
`-- train.meta
```
## Stage 1: Model training
If you want to train the model. you can use stage 1 in the ```run.sh```. The code is shown below.
If you want to train the model. you can use stage 1 in the `run.sh`. The code is shown below.
```bash
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# train model, all `ckpt` under `exp` dir
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${ckpt}
fi
```
If you want to train the model, you can use the script below to execute stage 0 and stage 1:
```bash
bash run.sh --stage 0 --stop_stage 1
```
or you can run these scripts in the command line (only use CPU).
```bash
source path.sh
bash ./local/data.sh
CUDA_VISIBLE_DEVICES= ./local/train.sh conf/deepspeech2.yaml deepspeech2
```
## Stage 2: Top-k Models Averaging
After training the model, we need to get the final model for testing and inference. In every epoch, the model checkpoint is saved, so we can choose the best model from them based on the validation loss or we can sort them and average the parameters of the top-k models to get the final model. We can use stage 2 to do this, and the code is shown below:
After training the model, we need to get the final model for testing and inference. In every epoch, the model checkpoint is saved, so we can choose the best model from them based on the validation loss or we can sort them and average the parameters of the top-k models to get the final model. We can use stage 2 to do this, and the code is shown below:
```bash
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# avg n best model
avg.sh best exp/${ckpt}/checkpoints ${avg_num}
fi
```
The ```avg.sh``` is in the ```../../../utils/``` which is define in the ```path.sh```.
The `avg.sh` is in the `../../../utils/` which is define in the `path.sh`.
If you want to get the final model, you can use the script below to execute stage 0, stage 1, and stage 2:
```bash
bash run.sh --stage 0 --stop_stage 2
```
or you can run these scripts in the command line (only use CPU).
```bash
@ -180,28 +130,19 @@ bash ./local/data.sh
CUDA_VISIBLE_DEVICES= ./local/train.sh conf/deepspeech2.yaml deepspeech2
avg.sh best exp/deepspeech2/checkpoints 1
```
## Stage 3: Model Testing
The test stage is to evaluate the model performance.. The code of test stage is shown below:
The test stage is to evaluate the model performance. The code of the test stage is shown below:
```bash
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
# test ckpt avg_n
CUDA_VISIBLE_DEVICES=0 ./local/test.sh ${conf_path} exp/${ckpt}/checkpoints/${avg_ckpt} || exit -1
fi
```
If you want to train a model and test it, you can use the script below to execute stage 0, stage 1, stage 2, and stage 3 :
```bash
bash run.sh --stage 0 --stop_stage 3
```
or you can run these scripts in the command line (only use CPU).
```bash
source path.sh
bash ./local/data.sh
@ -209,13 +150,8 @@ CUDA_VISIBLE_DEVICES= ./local/train.sh conf/deepspeech2.yaml deepspeech2
avg.sh best exp/deepspeech2/checkpoints 1
CUDA_VISIBLE_DEVICES= ./local/test.sh conf/deepspeech2.yaml exp/deepspeech2/checkpoints/avg_1
```
## Pretrained Model
You can get the pretrained transfomer or conformer using the scripts below:
You can get the pretrained transformer or conformer using the scripts below:
```bash
Deepspeech2 offline:
wget https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/ds2.model.tar.gz
@ -224,11 +160,9 @@ Deepspeech2 online:
wget https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/aishell_ds2_online_cer8.00_release.tar.gz
```
using the ```tar``` scripts to unpack the model and then you can use the script to test the modle.
using the `tar` scripts to unpack the model and then you can use the script to test the model.
For example:
```
wget https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/ds2.model.tar.gz
tar xzvf ds2.model.tar.gz
@ -239,83 +173,55 @@ bash local/data.sh --stage 2 --stop_stage 2
CUDA_VISIBLE_DEVICES= ./local/test.sh conf/deepspeech2.yaml exp/deepspeech2/checkpoints/avg_1
```
The performance of the released models are shown below:
| Acoustic Model | Training Data | Token-based | Size | Descriptions | CER | WER | Hours of speech |
| :----------------------------: | :-------------: | :---------: | -----: | :------------------------------------------------- | :---- | :--- | :-------------- |
| Ds2 Online Aishell ASR0 Model | Aishell Dataset | Char-based | 345 MB | 2 Conv + 5 LSTM layers with only forward direction | 0.080 | - | 151 h |
| Ds2 Offline Aishell ASR0 Model | Aishell Dataset | Char-based | 306 MB | 2 Conv + 3 bidirectional GRU layers | 0.064 | - | 151 h |
## Stage 4: Static graph model Export
This stage is to transform the dynamic graph model to static graph model.
This stage is to transform dygraph to static graph.
```bash
if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
# export ckpt avg_n
CUDA_VISIBLE_DEVICES=0 ./local/export.sh ${conf_path} exp/${ckpt}/checkpoints/${avg_ckpt} exp/${ckpt}/checkpoints/${avg_ckpt}.jit ${model_type}
fi
```
If you already have a dynamic graph model, you can run this script:
```bash
source path.sh
./local/export.sh deepspeech2.yaml exp/deepspeech2/checkpoints/avg_1 exp/deepspeech2/checkpoints/avg_1.jit offline
```
## Stage 5: Static graph Model Testing
Similer to stage 3, static graph model can also be tested.
Similar to stage 3, the static graph model can also be tested.
```bash
if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
# test export ckpt avg_n
CUDA_VISIBLE_DEVICES=0 ./local/test_export.sh ${conf_path} exp/${ckpt}/checkpoints/${avg_ckpt}.jit ${model_type}|| exit -1
fi
```
If you already have export the static graph, you can run this script:
If you already have exported the static graph, you can run this script:
```bash
CUDA_VISIBLE_DEVICES= ./local/test_export.sh conf/deepspeech2.yaml exp/deepspeech2/checkpoints/avg_1.jit offline
```
## Stage 6: Single Audio File Inference
In some situations, you want to use the trained model to do the inference for the single audio file. You can use stage 5. The code is shown below
```bash
if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
# test a single .wav file
CUDA_VISIBLE_DEVICES=0 ./local/test_wav.sh ${conf_path} exp/${ckpt}/checkpoints/${avg_ckpt} ${model_type} ${audio_file}
fi
```
you can train the model by yourself, or you can download the pretrained model by the script below:
```bash
wget https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/ds2.model.tar.gz
tar xzvf ds2.model.tar.gz
```
You can downloads the audio demo:
You can download the audio demo:
```bash
wget -nc https://paddlespeech.bj.bcebos.com/datasets/single_wav/zh/demo_01_03.wav -P data/
```
You need to prepare an audio file or use the audio demo above, please confirm the sample rate of the audio is 16K. You can get the result of audio demo by running the script below.
You need to prepare an audio file or use the audio demo above, please confirm the sample rate of the audio is 16K. You can get the result of the audio demo by running the script below.
```bash
CUDA_VISIBLE_DEVICES= ./local/test_wav.sh conf/deepspeech2.yaml exp/deepspeech2/checkpoints/avg_1 data/demo_01_03.wav
```

@ -1,88 +1,55 @@
# Transformer/Conformer ASR with Aishell
This example contains code used to train a Transformer or [Conformer](http://arxiv.org/abs/2008.03802) model with [Aishell dataset](http://www.openslr.org/resources/33)
## Overview
All the scirpts you need are in ```run.sh```. There are several stages in ```run.sh```, and each stage has its function.
All the scripts you need are in `run.sh`. There are several stages in `run.sh`, and each stage has its function.
| Stage | Function |
| :---- | :----------------------------------------------------------- |
| 0 | Process data. It includes: <br> (1) Download the dataset <br> (2) Caculate the CMVN of the train dataset <br> (3) Get the vocabulary file <br> (4) Get the manifest files of the train, development and test dataset |
|:---- |:----------------------------------------------------------- |
| 0 | Process data. It includes: <br> (1) Download the dataset <br> (2) Calculate the CMVN of the train dataset <br> (3) Get the vocabulary file <br> (4) Get the manifest files of the train, development and test dataset |
| 1 | Train the model |
| 2 | Get the final model by averaging the top-k models, set k = 1 means choose the best model |
| 2 | Get the final model by averaging the top-k models, set k = 1 means to choose the best model |
| 3 | Test the final model performance |
| 4 | Get ctc alignment of test data using the final model |
| 5 | Infer the single audio file |
You can choose to run a range of stages by setting ```stage``` and ```stop_stage ```.
You can choose to run a range of stages by setting `stage` and `stop_stage `.
For example, if you want to execute the code in stage 2 and stage 3, you can run this script:
```bash
bash run.sh --stage 2 --stop_stage 3
```
Or you can set ```stage``` equal to ```stop-stage``` to only run one stage.
For example, if you only want to run ```stage 0```, you can use the script below:
Or you can set `stage` equal to `stop-stage` to only run one stage.
For example, if you only want to run `stage 0`, you can use the script below:
```bash
bash run.sh --stage 0 --stop_stage 0
```
The document below will describe the scripts in ```run.sh``` in detail.
The document below will describe the scripts in `run.sh` in detail.
## The Environment Variables
The path.sh contains the environment variables.
```bash
source path.sh
```
This script needs to be run firstly. And another script is also needed:
This script needs to be run first. And another script is also needed:
```bash
source ${MAIN_ROOT}/utils/parse_options.sh
```
It will support the way of using```--varibale value``` in the shell scripts.
It will support the way of using `--variable value` in the shell scripts.
## The Local Variables
Some local variables are set in ```run.sh```.
```gpus``` denotes the GPU number you want to use. If you set ```gpus=```, it means you only use CPU.
```stage``` denotes the number of stage you want to start from in the expriments.
```stop stage```denotes the number of stage you want to end at in the expriments.
```conf_path``` denotes the config path of the model.
```avg_num``` denotes the number K of top-K models you want to average to get the final model.
```audio_file``` denotes the file path of the single file you want to infer in stage 5
```ckpt``` denotes the checkpoint prefix of the model, e.g. "conformer"
You can set the local variables (except ```ckpt```) when you use ```run.sh```
For example, you can set the ```gpus``` and ``avg_num`` when you use the command line.:
Some local variables are set in `run.sh`.
`gpus` denotes the GPU number you want to use. If you set `gpus=`, it means you only use CPU.
`stage` denotes the number of the stage you want to start from in the experiments.
`stop stage` denotes the number of the stage you want to end at in the experiments.
`conf_path` denotes the config path of the model.
`avg_num` denotes the number K of top-K models you want to average to get the final model.
`audio_file` denotes the file path of the single file you want to infer in stage 5
`ckpt` denotes the checkpoint prefix of the model, e.g. "conformer"
You can set the local variables (except `ckpt`) when you use `run.sh`
For example, you can set the `gpus` and `avg_num` when you use the command line.:
```bash
bash run.sh --gpus 0,1 --avg_num 20
```
## Stage 0: Data Processing
To use this example, you need to process data firstly and you can use stage 0 in ```run.sh``` to do this. The code is shown below:
To use this example, you need to process data firstly and you can use stage 0 in `run.sh` to do this. The code is shown below:
```bash
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
# prepare data
@ -93,20 +60,15 @@ To use this example, you need to process data firstly and you can use stage 0 in
Stage 0 is for processing the data.
If you only want to process the data. You can run
```bash
bash run.sh --stage 0 --stop_stage 0
```
You can also just run these scripts in your command line.
```bash
source path.sh
bash ./local/data.sh
```
After processing the data, the ``data`` directory will look like this:
```bash
data/
|-- dev.meta
@ -122,84 +84,57 @@ data/
|-- test.meta
`-- train.meta
```
## Stage 1: Model Training
If you want to train the model. you can use stage 1 in ```run.sh```. The code is shown below.
If you want to train the model. you can use stage 1 in `run.sh`. The code is shown below.
```bash
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# train model, all `ckpt` under `exp` dir
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${ckpt}
fi
```
If you want to train the model, you can use the script below to execute stage 0 and stage 1:
```bash
bash run.sh --stage 0 --stop_stage 1
```
or you can run these scripts in the command line (only use CPU).
```bash
source path.sh
bash ./local/data.sh
CUDA_VISIBLE_DEVICES= ./local/train.sh conf/conformer.yaml conformer
```
## Stage 2: Top-k Models Averaging
After training the model, we need to get the final model for testing and inference. In every epoch, the model checkpoint is saved, so we can choose the best model from them based on the validation loss or we can sort them and average the parameters of the top-k models to get the final model. We can use stage 2 to do this, and the code is shown below:
```bash
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# avg n best model
avg.sh best exp/${ckpt}/checkpoints ${avg_num}
fi
```
The ```avg.sh``` is in the ```../../../utils/``` which is define in the ```path.sh```.
The `avg.sh`is in the `../../../utils/` which is define in the `path.sh`.
If you want to get the final model, you can use the script below to execute stage 0, stage 1, and stage 2:
```bash
bash run.sh --stage 0 --stop_stage 2
```
or you can run these scripts in the command line (only use CPU).
```bash
source path.sh
bash ./local/data.sh
CUDA_VISIBLE_DEVICES= ./local/train.sh conf/conformer.yaml conformer
avg.sh best exp/conformer/checkpoints 20
```
## Stage 3: Model Testing
The test stage is to evaluate the model performance. The code of test stage is shown below:
The test stage is to evaluate the model performance. The code of the test stage is shown below:
```bash
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
# test ckpt avg_n
CUDA_VISIBLE_DEVICES=0 ./local/test.sh ${conf_path} exp/${ckpt}/checkpoints/${avg_ckpt} || exit -1
fi
```
If you want to train a model and test it, you can use the script below to execute stage 0, stage 1, stage 2, and stage 3 :
```bash
bash run.sh --stage 0 --stop_stage 3
```
or you can run these scripts in the command line (only use CPU).
```bash
source path.sh
bash ./local/data.sh
@ -207,29 +142,23 @@ CUDA_VISIBLE_DEVICES= ./local/train.sh conf/conformer.yaml conformer
avg.sh best exp/conformer/checkpoints 20
CUDA_VISIBLE_DEVICES= ./local/test.sh conf/conformer.yaml exp/conformer/checkpoints/avg_20
```
## Pretrained Model
You can get the pretrained transfomer or conformer using the scripts below:
You can get the pretrained transformer or conformer using the scripts below:
```bash
Conformer:
# Conformer:
wget https://deepspeech.bj.bcebos.com/release2.1/aishell/s1/aishell.release.tar.gz
Chunk Conformer:
# Chunk Conformer:
wget https://deepspeech.bj.bcebos.com/release2.1/aishell/s1/aishell.chunk.release.tar.gz
Transfomer:
# Transformer:
wget https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/transformer.model.tar.gz
```
using the ```tar``` scripts to unpack the model and then you can use the script to test the modle.
using the `tar` scripts to unpack the model and then you can use the script to test the model.
For example:
```
wget https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/transformer.model.tar.gz
tar xzvf transformer.model.tar.gz
@ -237,28 +166,18 @@ source path.sh
# If you have process the data and get the manifest file you can skip the following 2 steps
bash local/data.sh --stage -1 --stop_stage -1
bash local/data.sh --stage 2 --stop_stage 2
CUDA_VISIBLE_DEVICES= ./local/test.sh conf/transformer.yaml exp/transformer/checkpoints/avg_20
```
The performance of the released models are shown below:
### Conformer
| Model | Params | Config | Augmentation | Test set | Decode method | Loss | CER |
| --------- | ------ | ------------------- | ---------------- | -------- | ---------------------- | ---- | -------- |
| conformer | 47.07M | conf/conformer.yaml | spec_aug + shift | test | attention | - | 0.059858 |
| conformer | 47.07M | conf/conformer.yaml | spec_aug + shift | test | ctc_greedy_search | - | 0.062311 |
| conformer | 47.07M | conf/conformer.yaml | spec_aug + shift | test | ctc_prefix_beam_search | - | 0.062196 |
| conformer | 47.07M | conf/conformer.yaml | spec_aug + shift | test | attention_rescoring | - | 0.054694 |
### Chunk Conformer
Need set `decoding.decoding_chunk_size=16` when decoding.
| Model | Params | Config | Augmentation | Test set | Decode method | Chunk Size & Left Chunks | Loss | CER |
| --------- | ------ | ------------------------- | ---------------- | -------- | ---------------------- | ------------------------ | ---- | -------- |
| conformer | 47.06M | conf/chunk_conformer.yaml | spec_aug + shift | test | attention | 16, -1 | - | 0.061939 |
@ -266,44 +185,31 @@ Need set `decoding.decoding_chunk_size=16` when decoding.
| conformer | 47.06M | conf/chunk_conformer.yaml | spec_aug + shift | test | ctc_prefix_beam_search | 16, -1 | - | 0.070739 |
| conformer | 47.06M | conf/chunk_conformer.yaml | spec_aug + shift | test | attention_rescoring | 16, -1 | - | 0.059400 |
### Transformer
| Model | Params | Config | Augmentation | Test set | Decode method | Loss | CER |
| ----------- | ------ | --------------------- | ------------ | -------- | ---------------------- | ----------------- | -------- |
| transformer | 31.95M | conf/transformer.yaml | spec_aug | test | attention | 3.858648955821991 | 0.057293 |
| transformer | 31.95M | conf/transformer.yaml | spec_aug | test | ctc_greedy_search | 3.858648955821991 | 0.061837 |
| transformer | 31.95M | conf/transformer.yaml | spec_aug | test | ctc_prefix_beam_search | 3.858648955821991 | 0.061685 |
| transformer | 31.95M | conf/transformer.yaml | spec_aug | test | attention_rescoring | 3.858648955821991 | 0.053844 |
## Stage 4: CTC Alignment
If you want to get the alignment between the audio and the text, you can use the ctc alignment. The code of this stage is shown below:
```bash
if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
# ctc alignment of test data
CUDA_VISIBLE_DEVICES=0 ./local/align.sh ${conf_path} exp/${ckpt}/checkpoints/${avg_ckpt} || exit -1
fi
```
If you want to train the model, test it and do the alignment, you can use the script below to execute stage 0, stage 1, stage 2, and stage 3 :
```bash
bash run.sh --stage 0 --stop_stage 4
```
or if you only need to train a model and do the alignment, you can use these scripts to escape stage 3(test stage):
```bash
bash run.sh --stage 0 --stop_stage 2
bash run.sh --stage 4 --stop_stage 4
```
or you can also use these scripts in the command line (only use CPU).
```bash
source path.sh
bash ./local/data.sh
@ -313,33 +219,24 @@ avg.sh best exp/conformer/checkpoints 20
CUDA_VISIBLE_DEVICES= ./local/test.sh conf/conformer.yaml exp/conformer/checkpoints/avg_20
CUDA_VISIBLE_DEVICES= ./local/align.sh conf/conformer.yaml exp/conformer/checkpoints/avg_20
```
## Stage 5: Single Audio File Inference
In some situations, you want to use the trained model to do the inference for the single audio file. You can use stage 5. The code is shown below
```bash
if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
# test a single .wav file
CUDA_VISIBLE_DEVICES=0 ./local/test_wav.sh ${conf_path} exp/${ckpt}/checkpoints/${avg_ckpt} ${audio_file} || exit -1
fi
```
you can train the model by yourself using ```bash run.sh --stage 0 --stop_stage 3```, or you can download the pretrained model through the script below:
```bash
wget https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/transformer.model.tar.gz
tar xzvf transformer.model.tar.gz
```
You can downloads the audio demo:
You can download the audio demo:
```bash
wget -nc https://paddlespeech.bj.bcebos.com/datasets/single_wav/zh/demo_01_03.wav -P data/
```
You need to prepare an audio file or use the audio demo above, please confirm the sample rate of the audio is 16K. You can get the result by running the script below.
```bash
CUDA_VISIBLE_DEVICES= ./local/test_wav.sh conf/transformer.yaml exp/transformer/checkpoints/avg_20 data/demo_01_03.wav
```

@ -1,110 +1,78 @@
# DeepSpeech2 offline/online ASR with Librispeech
This example contains code used to train a DeepSpeech2 offline or online model with [[Librispeech dataset](http://www.openslr.org/resources/12)](http://www.openslr.org/resources/33)
## Overview
All the scirpts you need are in the ```run.sh```. There are several stages in the ```run.sh```, and each stage has its function.
All the scripts you need are in the `run.sh`. There are several stages in the `run.sh`, and each stage has its function.
| Stage | Function |
| :---- | :----------------------------------------------------------- |
| 0 | Process data. It includes: <br> (1) Download the dataset <br> (2) Caculate the CMVN of the train dataset <br> (3) Get the vocabulary file <br> (4) Get the manifest files of the train, development and test dataset |
|:---- |:----------------------------------------------------------- |
| 0 | Process data. It includes: <br> (1) Download the dataset <br> (2) Calculate the CMVN of the train dataset <br> (3) Get the vocabulary file <br> (4) Get the manifest files of the train, development and test dataset |
| 1 | Train the model |
| 2 | Get the final model by averaging the top-k models, set k = 1 means choose the best model |
| 2 | Get the final model by averaging the top-k models, set k = 1 means to choose the best model |
| 3 | Test the final model performance |
| 4 | Export the static graph model |
| 5 | Test the static graph model |
| 6 | Infer the single audio file |
You can choose to run a range of stages by setting the ```stage``` and ```stop_stage ``` .
You can choose to run a range of stages by setting the `stage` and `stop_stage `.
For example, if you want to execute the code in stage 2 and stage 3, you can run this script:
```bash
bash run.sh --stage 2 --stop_stage 3
```
Or you can set ```stage``` equal to ```stop-stage``` to only run one stage.
For example, if you only want to run ```stage 0```, you can use the script below:
Or you can set `stage` equal to `stop-stage` to only run one stage.
For example, if you only want to run `stage 0`, you can use the script below:
```bash
bash run.sh --stage 0 --stop_stage 0
```
The document below will describe the scripts in the ```run.sh``` in detail.
The document below will describe the scripts in the `run.sh` in detail.
## The environment variables
The path.sh contains the environment variable.
```bash
source path.sh
```
This script needs to be run firstly.
This script needs to be run first.
And another script is also needed:
```bash
source ${MAIN_ROOT}/utils/parse_options.sh
```
It will support the way of using```--varibale value``` in the shell scripts.
It will support the way of using `--variable value` in the shell scripts.
## The local variables
Some local variables are set in the `run.sh`.
`gpus` denotes the GPU number you want to use. If you set `gpus=`, it means you only use CPU.
`stage` denotes the number of the stage you want to start from in the experiments.
`stop stage` denotes the number of stages you want to end at in the experiments.
`conf_path` denotes the config path of the model.
`avg_num` denotes the number K of top-K models you want to average to get the final model.
`model_type` denotes the model type: offline or online
`audio file` denotes the file path of the single file you want to infer in stage 6
`ckpt` denotes the checkpoint prefix of the model, e.g. "deepspeech2"
Some local variables are set in the ```run.sh```.
```gpus``` denotes the GPU number you want to use. If you set ```gpus=```, it means you only use CPU.
```stage``` denotes the number of stage you want to start from in the expriments.
```stop stage```denotes the number of stage you want to end at in the expriments.
```conf_path``` denotes the config path of the model.
```avg_num``` denotes the number K of top-K models you want to average to get the final model.
```model_type```denotes the model type: offline or online
```audio file``` denotes the file path of the single file you want to infer in stage 6
```ckpt``` denotes the checkpoint prefix of the model, e.g. "deepspeech2"
You can set the local variables (except ```ckpt```) when you use the ```run.sh```
For example, you can set the ```gpus``` and ``avg_num`` when you use the command line.:
You can set the local variables (except `ckpt`) when you use the `run.sh`
For example, you can set the `gpus` and `avg_num` when you use the command line.:
```bash
bash run.sh --gpus 0,1 --avg_num 1
```
## Stage 0: Data processing
To use this example, you need to process data firstly and you can use stage 0 in the ```run.sh``` to do this. The code is shown below:
To use this example, you need to process data firstly and you can use stage 0 in the `run.sh` to do this. The code is shown below:
```bash
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
# prepare data
bash ./local/data.sh || exit -1
fi
```
Stage 0 is for processing the data.
If you only want to process the data. You can run
```bash
bash run.sh --stage 0 --stop_stage 0
```
You can also just run these scripts in your command line.
```bash
source path.sh
bash ./local/data.sh
```
After processing the data, the ``data`` directory will look like this:
After processing the data, the `data` directory will look like this:
```bash
data/
|-- dev.meta
@ -120,19 +88,14 @@ data/
|-- test.meta
`-- train.meta
```
## Stage 1: Model training
If you want to train the model. you can use stage 1 in the ```run.sh```. The code is shown below.
If you want to train the model. you can use stage 1 in the `run.sh`. The code is shown below.
```bash
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# train model, all `ckpt` under `exp` dir
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${ckpt}
fi
```
If you want to train the model, you can use the script below to execute stage 0 and stage 1:
```bash
bash run.sh --stage 0 --stop_stage 1
@ -143,25 +106,19 @@ source path.sh
bash ./local/data.sh
CUDA_VISIBLE_DEVICES= ./local/train.sh conf/deepspeech2.yaml deepspeech2
```
## Stage 2: Top-k Models Averaging
After training the model, we need to get the final model for testing and inference. In every epoch, the model checkpoint is saved, so we can choose the best model from them based on the validation loss or we can sort them and average the parameters of the top-k models to get the final model. We can use stage 2 to do this, and the code is shown below:
After training the model, we need to get the final model for testing and inference. In every epoch, the model checkpoint is saved, so we can choose the best model from them based on the validation loss or we can sort them and average the parameters of the top-k models to get the final model. We can use stage 2 to do this, and the code is shown below:
```bash
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# avg n best model
avg.sh best exp/${ckpt}/checkpoints ${avg_num}
fi
```
The ```avg.sh``` is in the ```../../../utils/``` which is define in the ```path.sh```.
The `avg.sh` is in the `../../../utils/` which is define in the `path.sh`.
If you want to get the final model, you can use the script below to execute stage 0, stage 1, and stage 2:
```bash
bash run.sh --stage 0 --stop_stage 2
```
or you can run these scripts in the command line (only use CPU).
```bash
source path.sh
@ -169,28 +126,19 @@ bash ./local/data.sh
CUDA_VISIBLE_DEVICES= ./local/train.sh conf/deepspeech2.yaml deepspeech2
avg.sh best exp/deepspeech2/checkpoints 1
```
## Stage 3: Model Testing
The test stage is to evaluate the model performance.. The code of test stage is shown below:
The test stage is to evaluate the model performance. The code of the test stage is shown below:
```bash
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
# test ckpt avg_n
CUDA_VISIBLE_DEVICES=0 ./local/test.sh ${conf_path} exp/${ckpt}/checkpoints/${avg_ckpt} || exit -1
fi
```
If you want to train a model and test it, you can use the script below to execute stage 0, stage 1, stage 2, and stage 3 :
```bash
bash run.sh --stage 0 --stop_stage 3
```
or you can run these scripts in the command line (only use CPU).
```bash
source path.sh
bash ./local/data.sh
@ -198,70 +146,44 @@ CUDA_VISIBLE_DEVICES= ./local/train.sh conf/deepspeech2.yaml deepspeech2
avg.sh best exp/deepspeech2/checkpoints 1
CUDA_VISIBLE_DEVICES= ./local/test.sh conf/deepspeech2.yaml exp/deepspeech2/checkpoints/avg_1
```
## Stage 4: Static graph model Export
This stage is to transform the dynamic graph model to static graph model.
This stage is to transform dygraph to static graph.
```bash
if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
# export ckpt avg_n
CUDA_VISIBLE_DEVICES=0 ./local/export.sh ${conf_path} exp/${ckpt}/checkpoints/${avg_ckpt} exp/${ckpt}/checkpoints/${avg_ckpt}.jit ${model_type}
fi
```
If you already have a dynamic graph model, you can run this script:
```bash
source path.sh
./local/export.sh deepspeech2.yaml exp/deepspeech2/checkpoints/avg_1 exp/deepspeech2/checkpoints/avg_1.jit offline
```
## Stage 5: Static graph Model Testing
Similer to stage 3, static graph model can also be tested.
Similar to stage 3, the static graph model can also be tested.
```bash
if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
# test export ckpt avg_n
CUDA_VISIBLE_DEVICES=0 ./local/test_export.sh ${conf_path} exp/${ckpt}/checkpoints/${avg_ckpt}.jit ${model_type}|| exit -1
fi
```
If you already have export the static graph, you can run this script:
If you already have exported the static graph, you can run this script:
```bash
CUDA_VISIBLE_DEVICES= ./local/test_export.sh conf/deepspeech2.yaml exp/deepspeech2/checkpoints/avg_1.jit offline
```
## Stage 6: Single Audio File Inference
In some situations, you want to use the trained model to do the inference for the single audio file. You can use stage 5. The code is shown below
```bash
if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
# test a single .wav file
CUDA_VISIBLE_DEVICES=0 ./local/test_wav.sh ${conf_path} exp/${ckpt}/checkpoints/${avg_ckpt} ${model_type} ${audio_file}
fi
```
You can downloads the audio demo:
You can download the audio demo:
```bash
wget -nc https://paddlespeech.bj.bcebos.com/datasets/single_wav/en/demo_002_en.wav -P data/
```
You can train a model by yourself, then you need to prepare an audio file or use the audio demo above, please confirm the sample rate of the audio is 16K. You can get the result of audio demo by running the script below.
You can train a model by yourself, then you need to prepare an audio file or use the audio demo above, please confirm the sample rate of the audio is 16K. You can get the result of the audio demo by running the script below.
```bash
CUDA_VISIBLE_DEVICES= ./local/test_wav.sh conf/deepspeech2.yaml exp/deepspeech2/checkpoints/avg_1 data/demo_002_en.wav
```

@ -1,114 +1,76 @@
# Transformer/Conformer ASR with Librispeech
This example contains code used to train a Transformer or [Conformer](http://arxiv.org/abs/2008.03802) model with [Librispeech dataset](http://www.openslr.org/resources/12)
## Overview
All the scirpts you need are in ```run.sh```. There are several stages in ```run.sh```, and each stage has its function.
All the scripts you need are in `run.sh`. There are several stages in `run.sh`, and each stage has its function.
| Stage | Function |
| :---- | :----------------------------------------------------------- |
| 0 | Process data. It includes: <br> (1) Download the dataset <br> (2) Caculate the CMVN of the train dataset <br> (3) Get the vocabulary file <br> (4) Get the manifest files of the train, development and test dataset<br> (5) Get the sentencepiece model |
|:---- |:----------------------------------------------------------- |
| 0 | Process data. It includes: <br> (1) Download the dataset <br> (2) Calculate the CMVN of the train dataset <br> (3) Get the vocabulary file <br> (4) Get the manifest files of the train, development and test dataset<br> (5) Get the sentencepiece model |
| 1 | Train the model |
| 2 | Get the final model by averaging the top-k models, set k = 1 means choose the best model |
| 2 | Get the final model by averaging the top-k models, set k = 1 means to choose the best model |
| 3 | Test the final model performance |
| 4 | Get ctc alignment of test data using the final model |
| 5 | Infer the single audio file |
You can choose to run a range of stages by setting ```stage``` and ```stop_stage ```.
You can choose to run a range of stages by setting `stage` and `stop_stage `.
For example, if you want to execute the code in stage 2 and stage 3, you can run this script:
```bash
bash run.sh --stage 2 --stop_stage 3
```
Or you can set ```stage``` equal to ```stop-stage``` to only run one stage.
For example, if you only want to run ```stage 0```, you can use the script below:
Or you can set `stage` equal to `stop-stage` to only run one stage.
For example, if you only want to run `stage 0`, you can use the script below:
```bash
bash run.sh --stage 0 --stop_stage 0
```
The document below will describe the scripts in ```run.sh``` in detail.
The document below will describe the scripts in `run.sh` in detail.
## The Environment Variables
The path.sh contains the environment variables.
```bash
. ./path.sh
. ./cmd.sh
```
This script needs to be run firstly. And another script is also needed:
This script needs to be run first. And another script is also needed:
```bash
source ${MAIN_ROOT}/utils/parse_options.sh
```
It will support the way of using```--varibale value``` in the shell scripts.
It will support the way of using `--variable value` in the shell scripts.
## The Local Variables
Some local variables are set in `run.sh`.
`gpus` denotes the GPU number you want to use. If you set `gpus=`, it means you only use CPU.
`stage` denotes the number of stages you want to start from in the experiments.
`stop stage` denotes the number of the stage you want to end at in the experiments.
`conf_path` denotes the config path of the model.
`avg_num` denotes the number K of top-K models you want to average to get the final model.
`audio file` denotes the file path of the single file you want to infer in stage 5
`ckpt` denotes the checkpoint prefix of the model, e.g. "conformer"
Some local variables are set in ```run.sh```.
```gpus``` denotes the GPU number you want to use. If you set ```gpus=```, it means you only use CPU.
```stage``` denotes the number of stage you want to start from in the expriments.
```stop stage```denotes the number of stage you want to end at in the expriments.
```conf_path``` denotes the config path of the model.
```avg_num``` denotes the number K of top-K models you want to average to get the final model.
```audio file``` denotes the file path of the single file you want to infer in stage 5
```ckpt``` denotes the checkpoint prefix of the model, e.g. "conformer"
You can set the local variables (except ```ckpt```) when you use ```run.sh```
For example, you can set the ```gpus``` and ``avg_num`` when you use the command line.:
You can set the local variables (except `ckpt`) when you use `run.sh`
For example, you can set the `gpus` and `avg_num` when you use the command line:
```bash
bash run.sh --gpus 0,1 --avg_num 20
```
## Stage 0: Data Processing
To use this example, you need to process data firstly and you can use stage 0 in ```run.sh``` to do this. The code is shown below:
To use this example, you need to process data firstly and you can use stage 0 in `run.sh` to do this. The code is shown below:
```bash
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
# prepare data
bash ./local/data.sh || exit -1
fi
```
Stage 0 is for processing the data.
If you only want to process the data. You can run
```bash
bash run.sh --stage 0 --stop_stage 0
```
You can also just run these scripts in your command line.
```bash
. ./path.sh
. ./cmd.sh
bash ./local/data.sh
```
After processing the data, the ``data`` directory will look like this:
After processing the data, the `data` directory will look like this:
```bash
data/
|-- dev.meta
@ -126,55 +88,38 @@ data/
|-- test.meta
`-- train.meta
```
## Stage 1: Model Training
If you want to train the model. you can use stage 1 in ```run.sh```. The code is shown below.
If you want to train the model. you can use stage 1 in `run.sh`. The code is shown below.
```bash
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# train model, all `ckpt` under `exp` dir
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${ckpt}
fi
```
If you want to train the model, you can use the script below to execute stage 0 and stage 1:
```bash
bash run.sh --stage 0 --stop_stage 1
```
or you can run these scripts in the command line (only use CPU).
```bash
. ./path.sh
. ./cmd.sh
bash ./local/data.sh
CUDA_VISIBLE_DEVICES= ./local/train.sh conf/conformer.yaml conformer
```
## Stage 2: Top-k Models Averaging
After training the model, we need to get the final model for testing and inference. In every epoch, the model checkpoint is saved, so we can choose the best model from them based on the validation loss or we can sort them and average the parameters of the top-k models to get the final model. We can use stage 2 to do this, and the code is shown below:
```bash
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# avg n best model
avg.sh best exp/${ckpt}/checkpoints ${avg_num}
fi
```
The ```avg.sh``` is in the ```../../../utils/``` which is define in the ```path.sh```.
The `avg.sh` is in the `../../../utils/` which is define in the `path.sh`.
If you want to get the final model, you can use the script below to execute stage 0, stage 1, and stage 2:
```bash
bash run.sh --stage 0 --stop_stage 2
```
or you can run these scripts in the command line (only use CPU).
```bash
@ -184,28 +129,19 @@ bash ./local/data.sh
CUDA_VISIBLE_DEVICES= ./local/train.sh conf/conformer.yaml conformer
avg.sh best exp/conformer/checkpoints 20
```
## Stage 3: Model Testing
The test stage is to evaluate the model performance. The code of test stage is shown below:
```bash
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
# test ckpt avg_n
CUDA_VISIBLE_DEVICES=0 ./local/test.sh ${conf_path} exp/${ckpt}/checkpoints/${avg_ckpt} || exit -1
fi
```
If you want to train a model and test it, you can use the script below to execute stage 0, stage 1, stage 2, and stage 3 :
```bash
bash run.sh --stage 0 --stop_stage 3
```
or you can run these scripts in the command line (only use CPU).
```bash
. ./path.sh
. ./cmd.sh
@ -214,44 +150,28 @@ CUDA_VISIBLE_DEVICES= ./local/train.sh conf/conformer.yaml conformer
avg.sh best exp/conformer/checkpoints 20
CUDA_VISIBLE_DEVICES= ./local/test.sh conf/conformer.yaml exp/conformer/checkpoints/avg_20
```
## Pretrained Model
You can get the pretrained transfomer or conformer using the scripts below:
You can get the pretrained transformer or conformer using the scripts below:
```bash
Conformer:
# Conformer:
wget https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr1/conformer.model.tar.gz
Transfomer:
# Transformer:
wget https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr1/transformer.model.tar.gz
```
using the ```tar``` scripts to unpack the model and then you can use the script to test the modle.
using the `tar` scripts to unpack the model and then you can use the script to test the model.
For example:
```
```bash
wget https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr1/conformer.model.tar.gz
tar xzvf transformer.model.tar.gz
source path.sh
# If you have process the data and get the manifest file you can skip the following 2 steps
bash local/data.sh --stage -1 --stop_stage -1
bash local/data.sh --stage 2 --stop_stage 2
CUDA_VISIBLE_DEVICES= ./local/test.sh conf/conformer.yaml exp/conformer/checkpoints/avg_20
```
The performance of the released models are shown below:
## Conformer
train: Epoch 70, 4 V100-32G, best avg: 20
| Model | Params | Config | Augmentation | Test set | Decode method | Loss | WER |
@ -260,10 +180,7 @@ train: Epoch 70, 4 V100-32G, best avg: 20
| conformer | 47.63 M | conf/conformer.yaml | spec_aug | test-clean | ctc_greedy_search | 6.433612394332886 | 0.040342 |
| conformer | 47.63 M | conf/conformer.yaml | spec_aug | test-clean | ctc_prefix_beam_search | 6.433612394332886 | 0.040342 |
| conformer | 47.63 M | conf/conformer.yaml | spec_aug | test-clean | attention_rescoring | 6.433612394332886 | 0.033761 |
## Transformer
train: Epoch 120, 4 V100-32G, 27 Day, best avg: 10
| Model | Params | Config | Augmentation | Test set | Decode method | Loss | WER |
@ -272,35 +189,24 @@ train: Epoch 120, 4 V100-32G, 27 Day, best avg: 10
| transformer | 32.52 M | conf/transformer.yaml | spec_aug | test-clean | ctc_greedy_search | 6.382194232940674 | 0.049566 |
| transformer | 32.52 M | conf/transformer.yaml | spec_aug | test-clean | ctc_prefix_beam_search | 6.382194232940674 | 0.049585 |
| transformer | 32.52 M | conf/transformer.yaml | spec_aug | test-clean | attention_rescoring | 6.382194232940674 | 0.038135 |
## Stage 4: CTC Alignment
If you want to get the alignment between the audio and the text, you can use the ctc alignment. The code of this stage is shown below:
```bash
if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
# ctc alignment of test data
CUDA_VISIBLE_DEVICES=0 ./local/align.sh ${conf_path} exp/${ckpt}/checkpoints/${avg_ckpt} || exit -1
fi
```
If you want to train the model, test it and do the alignment, you can use the script below to execute stage 0, stage 1, stage 2, and stage 3 :
```bash
bash run.sh --stage 0 --stop_stage 4
```
or if you only need to train a model and do the alignment, you can use these scripts to escape stage 3(test stage):
```bash
bash run.sh --stage 0 --stop_stage 2
bash run.sh --stage 4 --stop_stage 4
```
or you can also use these scripts in the command line (only use CPU).
```bash
. ./path.sh
. ./cmd.sh
@ -311,35 +217,24 @@ avg.sh best exp/conformer/checkpoints 20
CUDA_VISIBLE_DEVICES= ./local/test.sh conf/conformer.yaml exp/conformer/checkpoints/avg_20
CUDA_VISIBLE_DEVICES= ./local/align.sh conf/conformer.yaml exp/conformer/checkpoints/avg_20
```
## Stage 5: Single Audio File Inference
In some situations, you want to use the trained model to do the inference for the single audio file. You can use stage 5. The code is shown below
```bash
if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
# test a single .wav file
CUDA_VISIBLE_DEVICES=0 ./local/test_wav.sh ${conf_path} exp/${ckpt}/checkpoints/${avg_ckpt} ${audio_file} || exit -1
fi
```
you can train the model by yourself using ```bash run.sh --stage 0 --stop_stage 3```, or you can download the pretrained model through the script below:
```bash
wget https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr1/conformer.model.tar.gz
tar xzvf conformer.model.tar.gz
```
You can downloads the audio demo:
You can download the audio demo:
```bash
wget -nc https://paddlespeech.bj.bcebos.com/datasets/single_wav/en/demo_002_en.wav -P data/
```
You need to prepare an audio file or use the audio demo above, please confirm the sample rate of the audio is 16K. You can get the result of audio demo by running the script below.
You need to prepare an audio file or use the audio demo above, please confirm the sample rate of the audio is 16K. You can get the result of the audio demo by running the script below.
```bash
CUDA_VISIBLE_DEVICES= ./local/test_wav.sh conf/conformer.yaml exp/conformer/checkpoints/avg_20 data/demo_002_en.wav
```

@ -1,91 +1,65 @@
# Transformer/Conformer ASR with Librispeech Asr2
This example contains code used to train a Transformer or [Conformer](http://arxiv.org/abs/2008.03802) model with [Librispeech dataset](http://www.openslr.org/resources/12) and use some functions in kaldi.
This example contains code used to train a Transformer or [Conformer](http://arxiv.org/abs/2008.03802) model with [Librispeech dataset](http://www.openslr.org/resources/12) and use some functions in kaldi.
To use this example, you need to install Kaldi at first.
To use this example, you need to install Kaldi first.
## Overview
All the scirpts you need are in ```run.sh```. There are several stages in ```run.sh```, and each stage has its function.
All the scripts you need are in ```run.sh```. There are several stages in ```run.sh```, and each stage has its function.
| Stage | Function |
| :---- | :----------------------------------------------------------- |
| 0 | Process data. It includes: <br> (1) Download the dataset <br> (2) Caculate the CMVN of the train dataset <br> (3) Get the vocabulary file <br> (4) Get the manifest files of the train, development and test dataset<br> (5) Get the sentencepiece model |
|:---- |:----------------------------------------------------------- |
| 0 | Process data. It includes: <br> (1) Download the dataset <br> (2) Calculate the CMVN of the train dataset <br> (3) Get the vocabulary file <br> (4) Get the manifest files of the train, development and test dataset<br> (5) Get the sentencepiece model |
| 1 | Train the model |
| 2 | Get the final model by averaging the top-k models, set k = 1 means choose the best model |
| 2 | Get the final model by averaging the top-k models, set k = 1 means to choose the best model |
| 3 | Test the final model performance |
| 4 | Join ctc decoder and use transformer language model to score |
| 5 | Get ctc alignment of test data using the final model |
| 6 | Caculate the perplexity of transformer language model |
| 6 | Calculate the perplexity of transformer language model |
You can choose to run a range of stages by setting ```stage``` and ```stop_stage ```.
You can choose to run a range of stages by setting `stage` and `stop_stage `.
For example, if you want to execute the code in stage 2 and stage 3, you can run this script:
```bash
bash run.sh --stage 2 --stop_stage 3
```
Or you can set ```stage``` equal to ```stop-stage``` to only run one stage.
For example, if you only want to run ```stage 0```, you can use the script below:
Or you can set `stage` equal to `stop-stage` to only run one stage.
For example, if you only want to run `stage 0`, you can use the script below:
```bash
bash run.sh --stage 0 --stop_stage 0
```
The document below will describe the scripts in ```run.sh``` in detail.
The document below will describe the scripts in `run.sh` in detail.
## The Environment Variables
The path.sh contains the environment variables.
```bash
. ./path.sh
. ./cmd.sh
```
This script needs to be run firstly. And another script is also needed:
This script needs to be run first. And another script is also needed:
```bash
source ${MAIN_ROOT}/utils/parse_options.sh
```
It will support the way of using```--varibale value``` in the shell scripts.
It will support the way of using `--variable value` in the shell scripts.
## The Local Variables
Some local variables are set in ```run.sh```.
```gpus``` denotes the GPU number you want to use. If you set ```gpus=```, it means you only use CPU.
```stage``` denotes the number of stage you want to start from in the expriments.
```stop stage```denotes the number of stage you want to end at in the expriments.
```conf_path``` denotes the config path of the model.
`dict_path` denotes the path of vocabulary file.
```avg_num``` denotes the number K of top-K models you want to average to get the final model.
```ckpt``` denotes the checkpoint prefix of the model, e.g. "transformer"
You can set the local variables (except ```ckpt```) when you use ```run.sh```
For example, you can set the ```gpus``` and ``avg_num`` when you use the command line.:
Some local variables are set in `run.sh`.
`gpus` denotes the GPU number you want to use. If you set `gpus=`, it means you only use CPU.
`stage` denotes the number of the stage you want to start from in the experiments.
`stop stage` denotes the number of the stage you want to end at in the experiments.
`conf_path` denotes the config path of the model.
`dict_path` denotes the path of the vocabulary file.
`avg_num` denotes the number K of top-K models you want to average to get the final model.
`ckpt` denotes the checkpoint prefix of the model, e.g. "transformer"
You can set the local variables (except `ckpt`) when you use `run.sh`
For example, you can set the `gpus` and `avg_num` when you use the command line.:
```bash
bash run.sh --gpus 0,1 --avg_num 10
```
## Stage 0: Data Processing
To use this example, you need to process data firstly and you can use stage 0 in ```run.sh``` to do this. The code is shown below:
To use this example, you need to process data firstly and you can use stage 0 in ```run.sh```to do this. The code is shown below:
```bash
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
@ -93,7 +67,6 @@ To use this example, you need to process data firstly and you can use stage 0 in
bash ./local/data.sh || exit -1
fi
```
Stage 0 is for processing the data.
If you only want to process the data. You can run
@ -156,56 +129,39 @@ data/
└── train_sp_org
```
## Stage 1: Model Training
If you want to train the model. you can use stage 1 in ```run.sh```. The code is shown below.
```bash
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# train model, all `ckpt` under `exp` dir
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${ckpt}
fi
```
If you want to train the model, you can use the script below to execute stage 0 and stage 1:
```bash
bash run.sh --stage 0 --stop_stage 1
```
or you can run these scripts in the command line (only use CPU).
```bash
. ./path.sh
. ./cmd.sh
bash ./local/data.sh
CUDA_VISIBLE_DEVICES= ./local/train.sh conf/transformer.yaml transformer
```
## Stage 2: Top-k Models Averaging
After training the model, we need to get the final model for testing and inference. In every epoch, the model checkpoint is saved, so we can choose the last K models and average the parameters of the models to get the final model. We can use stage 2 to do this, and the code is shown below:
After training the model, we need to get the final model for testing and inference. In every epoch, the model checkpoint is saved, so we can choose the last K models and average the parameters of the models to get the final model. We can use stage 2 to do this, and the code is shown below:
```bash
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# avg n best model
avg.sh lastest exp/${ckpt}/checkpoints ${avg_num}
fi
```
The ```avg.sh``` is in the ```../../../utils/``` which is define in the ```path.sh```.
The `avg.sh` is in the `../../../utils/` which is define in the `path.sh`.
If you want to get the final model, you can use the script below to execute stage 0, stage 1, and stage 2:
```bash
bash run.sh --stage 0 --stop_stage 2
```
or you can run these scripts in the command line (only use CPU).
```bash
. ./path.sh
. ./cmd.sh
@ -213,28 +169,19 @@ bash ./local/data.sh
CUDA_VISIBLE_DEVICES= ./local/train.sh conf/transformer.yaml transformer
avg.sh best exp/transformer/checkpoints 10
```
## Stage 3: Model Testing
The stage 3 is to evaluate the model performance with attention rescore decoder. The code of this stage is shown below:
Stage 3 is to evaluate the model performance with an attention rescore decoder. The code of this stage is shown below:
```bash
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
# attetion resocre decoder
./local/test.sh ${conf_path} ${dict_path} exp/${ckpt}/checkpoints/${avg_ckpt} || exit -1
fi
```
If you want to train a model and test it, you can use the script below to execute stage 0, stage 1, stage 2, and stage 3 :
```bash
bash run.sh --stage 0 --stop_stage 3
```
or you can run these scripts in the command line (only use CPU).
```bash
. ./path.sh
. ./cmd.sh
@ -243,29 +190,20 @@ CUDA_VISIBLE_DEVICES= ./local/train.sh conf/transformer.yaml transformer
avg.sh latest exp/transformer/checkpoints 10
CUDA_VISIBLE_DEVICES= ./local/test.sh conf/transformer.yaml data/train_960_unigram5000_units.txt exp/transformer/checkpoints/avg_10
```
## Stage 4: Model Testing with Join CTC Decoder
The stage 4 is to evaluate the model performance with join ctc decoder. The code of this stage is shown below:
Stage 4 is to evaluate the model performance with the join ctc decoder. The code of this stage is shown below:
```bash
if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
# join ctc decoder, use transformerlm to score
./local/recog.sh --ckpt_prefix exp/${ckpt}/checkpoints/${avg_ckpt}
fi
```
If you want to train a model and test it, you can use the script below to execute stage 0, stage 1, stage 2, and stage 4 :
```bash
bash run.sh --stage 0 --stop_stage 3
bash run.sh --stage 4 --stop_stage 4
```
or you can run these scripts in the command line (only use CPU).
```bash
. ./path.sh
. ./cmd.sh
@ -274,24 +212,16 @@ CUDA_VISIBLE_DEVICES= ./local/train.sh conf/transformer.yaml transformer
avg.sh latest exp/transformer/checkpoints 10
./local/recog.sh --ckpt_prefix exp/transformer/checkpoints/avg_10
```
## Pretrained Model
You can get the pretrained transfomer using the scripts below:
You can get the pretrained transformer using the scripts below:
```bash
Transfomer:
# Transformer:
wget https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr2/transformer.model.tar.gz
```
using the ```tar``` scripts to unpack the model and then you can use the script to test the modle.
using the `tar` scripts to unpack the model and then you can use the script to test the model.
For example:
```
```bash
wget https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr2/transformer.model.tar.gz
tar xzvf transformer.model.tar.gz
source path.sh
@ -301,19 +231,13 @@ bash local/data.sh --stage 2 --stop_stage 2
CUDA_VISIBLE_DEVICES= ./local/test.sh conf/transformer.yaml exp/ctc/checkpoints/avg_10
```
The performance of the released models are shown below:
### Transformer
| Model | Params | GPUS | Averaged Model | Config | Augmentation | Loss |
| :---------: | :----: | :--------------------: | :--------------: | :-------------------: | :----------: | :-------------: |
| transformer | 32.52M | 8 Tesla V100-SXM2-32GB | 10-best val_loss | conf/transformer.yaml | spec_aug | 6.3197922706604 |
#### Attention Rescore
| Test Set | Decode Method | #Snt | #Wrd | Corr | Sub | Del | Ins | Err | S.Err |
| ---------- | --------------------- | ---- | ----- | ---- | ---- | ---- | ---- | ---- | ----- |
| test-clean | attention | 2620 | 52576 | 96.4 | 2.5 | 1.1 | 0.4 | 4.0 | 34.7 |
@ -322,43 +246,31 @@ The performance of the released models are shown below:
| test-clean | attention_rescore | 2620 | 52576 | 96.8 | 2.9 | 0.3 | 0.4 | 3.7 | 38.0 |
#### JoinCTC
| Test Set | Decode Method | #Snt | #Wrd | Corr | Sub | Del | Ins | Err | S.Err |
| ---------- | ----------------- | ---- | ----- | ---- | ---- | ---- | ---- | ---- | ----- |
| test-clean | join_ctc_only_att | 2620 | 52576 | 96.1 | 2.5 | 1.4 | 0.4 | 4.4 | 34.7 |
| test-clean | join_ctc_w/o_lm | 2620 | 52576 | 97.2 | 2.6 | 0.3 | 0.4 | 3.2 | 34.9 |
| test-clean | join_ctc_w_lm | 2620 | 52576 | 97.9 | 1.8 | 0.2 | 0.3 | 2.4 | 27.8 |
Compare with [ESPNET](https://github.com/espnet/espnet/blob/master/egs/librispeech/asr1/RESULTS.md#pytorch-large-transformer-with-specaug-4-gpus--transformer-lm-4-gpus) we using 8gpu, but model size (aheads4-adim256) small than it.
Compare with [ESPNET](https://github.com/espnet/espnet/blob/master/egs/librispeech/asr1/RESULTS.md#pytorch-large-transformer-with-specaug-4-gpus--transformer-lm-4-gpus) we using 8gpu, but the model size (aheads4-adim256) small than it.
## Stage 5: CTC Alignment
If you want to get the alignment between the audio and the text, you can use the ctc alignment. The code of this stage is shown below:
```bash
if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
# ctc alignment of test data
CUDA_VISIBLE_DEVICES=0 ./local/align.sh ${conf_path} ${dict_path} exp/${ckpt}/checkpoints/${avg_ckpt} || exit -1
fi
```
If you want to train the model, test it and do the alignment, you can use the script below to execute stage 0, stage 1, stage 2, stage 3, stage 4 and stage 5:
If you want to train the model, test it and do the alignment, you can use the script below to execute stage 0, stage 1, stage 2, stage 3, stage 4, and stage 5:
```bash
bash run.sh --stage 0 --stop_stage 5
```
or if you only need to train a model and do the alignment, you can use these scripts to escape stage 3(test stage):
```bash
bash run.sh --stage 0 --stop_stage 2
bash run.sh --stage 5 --stop_stage 5
```
or you can also use these scripts in the command line (only use CPU).
```bash
. ./path.sh
. ./cmd.sh
@ -367,22 +279,15 @@ CUDA_VISIBLE_DEVICES= ./local/train.sh conf/transformer.yaml transformer
avg.sh best exp/transformer/checkpoints 20
CUDA_VISIBLE_DEVICES= ./local/align.sh conf/transformer.yaml data/train_960_unigram5000_units.txt exp/transformer/checkpoints/avg_10
```
## Stage 6: Perplexity Caculation
This stage is for caculating the perplexity of transformer language model. The code of this stage is shown below:
## Stage 6: Perplexity Calculation
This stage is for calculating the perplexity of the transformer language model. The code of this stage is shown below:
```bash
if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
./local/cacu_perplexity.sh || exit -1
fi
```
If you only want to caculate the perplexity of transformer language model, you can use this script:
If you only want to calculate the perplexity of the transformer language model, you can use this script:
```bash
bash run.sh --stage 6 --stop_stage 6
```

@ -1,12 +1,12 @@
# G2P
For g2p, we use BZNSYP's phone label as the ground truth and we delete silence tokens in labels and predicted phones.
You should Download BZNSYP from it's [Official Website](https://test.data-baker.com/data/index/source) and extract it. Assume the path to the dataset is `~/datasets/BZNSYP`.
You should Download BZNSYP from its [Official Website](https://test.data-baker.com/data/index/source) and extract it. Assume the path to the dataset is `~/datasets/BZNSYP`.
We use `WER` as evaluation criterion.
We use `WER` as an evaluation criterion.
# Start
Run the command below to get the results of test.
Run the command below to get the results of the test.
```bash
./run.sh
```

@ -1,30 +1,30 @@
# Speaker Encoder
This experiment trains a speaker encoder with speaker verification as its task. It is done as a part of the experiment of transfer learning from speaker verification to multispeaker text-to-speech synthesis, which can be found at [examples/aishell3/vc0](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/vc0). The trained speaker encoder is used to extract utterance embeddings from utterances.
This experiment trains a speaker encoder with speaker verification as to its task. It is done as a part of the experiment of transfer learning from speaker verification to multispeaker text-to-speech synthesis, which can be found at [examples/aishell3/vc0](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/vc0). The trained speaker encoder is used to extract utterance embeddings from utterances.
## Model
The model used in this experiment is the speaker encoder with text independent speaker verification task in [GENERALIZED END-TO-END LOSS FOR SPEAKER VERIFICATION](https://arxiv.org/pdf/1710.10467.pdf). GE2E-softmax loss is used.
The model used in this experiment is the speaker encoder with text-independent speaker verification task in [GENERALIZED END-TO-END LOSS FOR SPEAKER VERIFICATION](https://arxiv.org/pdf/1710.10467.pdf). GE2E-softmax loss is used.
## Download Datasets
Currently supported datasets are Librispeech-other-500, VoxCeleb, VoxCeleb2,ai-datatang-200zh, magicdata, which can be downloaded from corresponding webpage.
Currently supported datasets are Librispeech-other-500, VoxCeleb, VoxCeleb2,ai-datatang-200zh, magicdata, which can be downloaded from the corresponding webpage.
1. Librispeech/train-other-500
An English multispeaker dataset[URL](https://www.openslr.org/resources/12/train-other-500.tar.gz)only the `train-other-500` subset is used.
2. VoxCeleb1
An English multispeaker dataset[URL](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html) , Audio Files from Dev A to Dev D should be downloaded, combined and extracted.
An English multispeaker dataset[URL](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html), Audio Files from Dev A to Dev D should be downloaded, combined, and extracted.
3. VoxCeleb2
An English multispeaker dataset[URL](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html) , Audio Files from Dev A to Dev H should be downloaded, combined and extracted.
An English multispeaker dataset[URL](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html), Audio Files from Dev A to Dev H should be downloaded, combined, and extracted.
4. Aidatatang-200zh
A Mandarin Chinese multispeaker dataset [URL](https://www.openslr.org/62/) .
A Mandarin Chinese multispeaker dataset [URL](https://www.openslr.org/62/).
5. magicdata
A Mandarin Chinese multispeaker dataset [URL](https://www.openslr.org/68/) .
A Mandarin Chinese multispeaker dataset [URL](https://www.openslr.org/68/).
If you want to use other datasets, you can also download and preprocess it as long as it meets the requirements described below.
If you want to use other datasets, you can also download and preprocess them as long as they meet the requirements described below.
## Get Started
```bash
./run.sh
```
You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset.
You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
```bash
./run.sh --stage 0 --stop-stage 0
```
@ -33,13 +33,13 @@ You can choose a range of stages you want to run, or set `stage` equal to `stop-
```bash
./local/preprocess.sh ${datasets_root} ${preprocess_path} ${dataset_names}
```
Assume datasets_root is `~/datasets/GE2E`, and it has the follow structureWe only use `train-other-500` for simplicity:
Assume datasets_root is `~/datasets/GE2E`, and it has the following structureWe only use `train-other-500` for simplicity:
```Text
GE2E
├── LibriSpeech
└── (other datasets)
```
Multispeaker datasets are used as training data, though the transcriptions are not used. To enlarge the amount of data used for training, several multispeaker datasets are combined. The preporcessed datasets are organized in a file structure described below. The mel spectrogram of each utterance is save in `.npy` format. The dataset is 2-stratified (speaker-utterance). Since multiple datasets are combined, to avoid conflict in speaker id, dataset name is prepended to the speake ids.
Multispeaker datasets are used as training data, though the transcriptions are not used. To enlarge the amount of data used for training, several multispeaker datasets are combined. The preprocessed datasets are organized in a file structure described below. The mel spectrogram of each utterance is saved in `.npy` format. The dataset is 2-stratified (speaker-utterance). Since multiple datasets are combined, to avoid conflict in speaker id, the dataset name is prepended to the speaker ids.
```text
dataset_root
@ -63,7 +63,7 @@ dataset_root
In `${BIN_DIR}/preprocess.py`:
1. `--datasets_root` is the directory that contains several extracted dataset
2. `--output_dir` is the directory to save the preprocessed dataset
3. `--dataset_names` is the dataset to preprocess. If there are multiple datasets in `--datasets_root` to preprocess, the names can be joined with comma. Currently supported dataset names are librispeech_other, voxceleb1, voxceleb2, aidatatang_200zh and magicdata.
3. `--dataset_names` is the dataset to preprocess. If there are multiple datasets in `--datasets_root` to preprocess, the names can be joined with a comma. Currently supported dataset names are librispeech_other, voxceleb1, voxceleb2, aidatatang_200zh, and magicdata.
### Model Training
`./local/train.sh` calls `${BIN_DIR}/train.py`.
@ -72,15 +72,15 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${preprocess_path} ${train_output_
```
In `${BIN_DIR}/train.py`:
1. `--data` is the path to the preprocessed dataset.
2. `--output` is the directory to save resultsusually a subdirectory of `runs`.It contains visualdl log files, text log files, config file and a `checkpoints` directory, which contains parameter file and optimizer state file. If `--output` already has some training results in it, the most recent parameter file and optimizer state file is loaded before training.
2. `--output` is the directory to save resultsusually a subdirectory of `runs`. It contains visualdl log files, text log files, config files, and a `checkpoints` directory, which contains parameter files and optimizer state files. If `--output` already has some training results in it, the most recent parameter file and optimizer state file are loaded before training.
4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
5. `CUDA_VISIBLE_DEVICES` can be used to specify visible devices with cuda.
Other options are described below.
- `--config` is a `.yaml` config file used to override the default config(which is coded in `config.py`).
- `--opts` is command line options to further override config files. It should be the last comman line options passed with multiple key-value pairs separated by spaces.
- `--checkpoint_path` specifies the checkpoiont to load before training, extension is not included. A parameter file ( `.pdparams`) and an optimizer state file ( `.pdopt`) with the same name is used. This option has a higher priority than auto-resuming from the `--output` directory.
- `--opts` is a command-line option to further override config files. It should be the last command-line options passed with multiple key-value pairs separated by spaces.
- `--checkpoint_path` specifies the checkpoint to load before training, extension is not included. A parameter file ( `.pdparams`) and an optimizer state file ( `.pdopt`) with the same name is used. This option has a higher priority than auto-resuming from the `--output` directory.
### Inferencing
When training is done, run the command below to generate utterance embedding for each utterance in a dataset.
@ -90,7 +90,7 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/inference.sh ${infer_input} ${infer_output}
```
In `${BIN_DIR}/inference.py`:
1. `--input` is the path of the dataset used for inference.
2. `--output` is the directory to save the processed results. It has the same file structure as the input dataset. Each utterance in the dataset has a corrsponding utterance embedding file in `*.npy` format.
2. `--output` is the directory to save the processed results. It has the same file structure as the input dataset. Each utterance in the dataset has a corresponding utterance embedding file in the `*.npy` format.
3. `--checkpoint_path` is the path of the checkpoint to use, extension not included.
4. `--pattern` is the wildcard pattern to filter audio files for inference, defaults to `*.wav`.
5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.

@ -1,9 +1,9 @@
# Text Normalization
For text normalization, the test data is `data/textnorm_test_cases.txt`, we use `|` as the separator of raw_data and normed_data.
We use `CER` as evaluation criterion.
We use `CER` as an evaluation criterion.
## Start
Run the command below to get the results of test.
Run the command below to get the results of the test.
```bash
./run.sh
```

@ -1,187 +1,127 @@
# Transformer/Conformer ST0 with TED_En_Zh
# Transformer/Conformer ST0 with TED_En_Zh
This example contains code used to train a Transformer or [Conformer](http://arxiv.org/abs/2008.03802) model with TED_EN_Zh
## Overview
All the scirpts you need are in ```run.sh```. There are several stages in ```run.sh```, and each stage has its function.
All the scripts you need are in `run.sh`. There are several stages in `run.sh`, and each stage has its function.
You need to download TED_En_Zh dataset by yourself.
| Stage | Function |
| :---- | :----------------------------------------------------------- |
| 0 | Process data. It includes: <br> (1) Caculate the CMVN of the train dataset <br> (2) Get the vocabulary file <br> (3) Get the manifest files of the train, development and test dataset<br> |
|:---- |:----------------------------------------------------------- |
| 0 | Process data. It includes: <br> (1) Calculate the CMVN of the train dataset <br> (2) Get the vocabulary file <br> (3) Get the manifest files of the train, development and test dataset<br> |
| 1 | Train the model |
| 2 | Get the final model by averaging the top-k models, set k = 1 means choose the best model |
| 2 | Get the final model by averaging the top-k models, set k = 1 means to choose the best model |
| 3 | Test the final model performance |
You can choose to run a range of stages by setting ```stage``` and ```stop_stage ```.
You can choose to run a range of stages by setting `stage` and `stop_stage `.
For example, if you want to execute the code in stage 2 and stage 3, you can run this script:
```bash
bash run.sh --stage 2 --stop_stage 3
```
Or you can set ```stage``` equal to ```stop-stage``` to only run one stage.
Or you can set `stage` equal to `stop-stage` to only run one stage.
For example, if you only want to run ```stage 0```, you can use the script below:
```bash
bash run.sh --stage 0 --stop_stage 0
```
The document below will describe the scripts in ```run.sh``` in detail.
The document below will describe the scripts in `run.sh` in detail.
## The Environment Variables
The path.sh contains the environment variables.
```bash
source path.h
```
This script needs to be run firstly. And another script is also needed:
This script needs to be run first. And another script is also needed:
```bash
source ${MAIN_ROOT}/utils/parse_options.sh
```
It will support the way of using```--varibale value``` in the shell scripts.
It will support the way of using `--variable value` in the shell scripts.
## The Local Variables
Some local variables are set in `run.sh`.
`gpus` denotes the GPU number you want to use. If you set `gpus=`, it means you only use CPU.
`stage` denotes the number of stages you want to start from in the experiments.
`stop_stage` denotes the number of stages you want to end at in the experiments.
`conf_path`denotes the config path of the model.
`data_path` denotes the path of the dataset.
`avg_num`denotes the number K of top-K models you want to average to get the final model.
`ckpt` denotes the checkpoint prefix of the model, e.g. "transformer_mtl_noam"
Some local variables are set in ```run.sh```.
```gpus``` denotes the GPU number you want to use. If you set ```gpus=```, it means you only use CPU.
```stage``` denotes the number of stage you want to start from in the expriments.
```stop stage```denotes the number of stage you want to end at in the expriments.
```conf_path``` denotes the config path of the model.
`data_path` denotes the path of the dataset..
```avg_num``` denotes the number K of top-K models you want to average to get the final model.
```ckpt``` denotes the checkpoint prefix of the model, e.g. "transformer_mtl_noam"
You can set the local variables (except ```ckpt```) when you use ```run.sh```
For example, you can set the ```gpus``` and ``avg_num`` when you use the command line.:
You can set the local variables (except `ckpt`) when you use `run.sh`
For example, you can set the `gpus` and `avg_num` when you use the command line.:
```bash
bash run.sh --gpus 0,1 --avg_num 5
```
## Stage 0: Data Processing
To use this example, you need to process data firstly and you can use stage 0 in ```run.sh``` to do this. The code is shown below:
To use this example, you need to process data firstly and you can use stage 0 in ```run.sh```to do this. The code is shown below:
```bash
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
# prepare data
bash ./local/data.sh || exit -1
fi
```
Stage 0 is for processing the data.
If you only want to process the data. You can run
```bash
bash run.sh --stage 0 --stop_stage 0
```
You can also just run these scripts in your command line.
```bash
source path.h
bash ./local/data.sh
```
## Stage 1: Model Training
If you want to train the model. you can use stage 1 in ```run.sh```. The code is shown below.
```bash
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# train model, all `ckpt` under `exp` dir
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${ckpt}
fi
```
If you want to train the model, you can use the script below to execute stage 0 and stage 1:
```bash
bash run.sh --stage 0 --stop_stage 1
```
or you can run these scripts in the command line (only use CPU).
```bash
source path.h
bash ./local/data.sh
CUDA_VISIBLE_DEVICES= ./local/train.sh conf/transformer_mtl_noam.yaml transformer_mtl_noam
```
## Stage 2: Top-k Models Averaging
After training the model, we need to get the final model for testing and inference. In every epoch, the model checkpoint is saved, so we can choose the best model from them based on the validation loss or we can sort them and average the parameters of the top-k models to get the final model. We can use stage 2 to do this, and the code is shown below:
After training the model, we need to get the final model for testing and inference. In every epoch, the model checkpoint is saved, so we can choose the best model from them based on the validation loss or we can sort them and average the parameters of the top-k models to get the final model. We can use stage 2 to do this, and the code is shown below:
```bash
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# avg n best model
avg.sh best exp/${ckpt}/checkpoints ${avg_num}
fi
```
The ```avg.sh``` is in the ```../../../utils/``` which is define in the ```path.sh```.
The ```avg.sh```is in the ```../../../utils/```which is define in the ```path.sh```.
If you want to get the final model, you can use the script below to execute stage 0, stage 1, and stage 2:
```bash
bash run.sh --stage 0 --stop_stage 2
```
or you can run these scripts in the command line (only use CPU).
```bash
source path.h
bash ./local/data.sh
CUDA_VISIBLE_DEVICES= ./local/train.sh conf/transformer_mtl_noam.yaml transformer_mtl_noam
avg.sh best exp/transformer_mtl_noam/checkpoints 5
```
## Stage 3: Model Testing
The stage 3 is to evaluate the model performance. The code of this stage is shown below:
```bash
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
# test ckpt avg_n
CUDA_VISIBLE_DEVICES=0 ./local/test.sh ${conf_path} exp/${ckpt}/checkpoints/${avg_ckpt} || exit -1
fi
```
If you want to train a model and test it, you can use the script below to execute stage 0, stage 1, stage 2, and stage 3 :
```bash
bash run.sh --stage 0 --stop_stage 3
```
or you can run these scripts in the command line (only use CPU).
```bash
source path.h
bash ./local/data.sh
@ -189,15 +129,8 @@ CUDA_VISIBLE_DEVICES= ./local/train.sh conf/transformer_mtl_noam.yaml transforme
avg.sh latest exp/transformer_mtl_noam/checkpoints 5
CUDA_VISIBLE_DEVICES= ./local/test.sh conf/transformer_mtl_noam.yaml exp/transformer_mtl_noam/checkpoints/avg_5
```
The performance of the released models are shown below:
### Transformer
| Model | Params | Config | Char-BLEU |
| ------------------- | ------ | -------------------------------- | --------- |
| Transformer+ASR MTL | 50.26M | conf/transformer_joint_noam.yaml | 17.38 |

@ -1,122 +1,81 @@
# Transformer/Conformer ST1 with TED_En_Zh
This example contains code used to train a Transformer or [Conformer](http://arxiv.org/abs/2008.03802) model with TED_EN_Zh.
To use this example, you need to install Kaldi at first.
The main difference between st0 and st1 is that st1 use kaldi feature.
To use this example, you need to install Kaldi first.
The main difference between st0 and st1 is that st1 uses kaldi feature.
## Overview
All the scirpts you need are in ```run.sh```. There are several stages in ```run.sh```, and each stage has its function.
All the scripts you need are in `run.sh`. There are several stages in `run.sh`, and each stage has its function.
You need to download TED_En_Zh dataset by yourself.
| Stage | Function |
| :---- | :----------------------------------------------------------- |
| 0 | Process data. It includes: <br> (1) Caculate the CMVN of the train dataset <br> (2) Get the vocabulary file <br> (3) Get the manifest files of the train, development and test dataset<br> |
|:---- |:----------------------------------------------------------- |
| 0 | Process data. It includes: <br> (1) Calculate the CMVN of the train dataset <br> (2) Get the vocabulary file <br> (3) Get the manifest files of the train, development and test dataset<br> |
| 1 | Train the model |
| 2 | Get the final model by averaging the top-k models, set k = 1 means choose the best model |
| 2 | Get the final model by averaging the top-k models, set k = 1 means to choose the best model |
| 3 | Test the final model performance |
You can choose to run a range of stages by setting ```stage``` and ```stop_stage ```.
You can choose to run a range of stages by setting `stage` and `stop_stage `.
For example, if you want to execute the code in stage 2 and stage 3, you can run this script:
```bash
bash run.sh --stage 2 --stop_stage 3
```
Or you can set ```stage``` equal to ```stop-stage``` to only run one stage.
For example, if you only want to run ```stage 0```, you can use the script below:
Or you can set `stage` equal to `stop-stage` to only run one stage.
For example, if you only want to run `stage 0`, you can use the script below:
```bash
bash run.sh --stage 0 --stop_stage 0
```
The document below will describe the scripts in ```run.sh``` in detail.
The document below will describe the scripts in ```run.sh```in detail.
## The Environment Variables
The path.sh contains the environment variables.
```bash
. ./path.sh
. ./cmd.sh
```
This script needs to be run firstly. And another script is also needed:
This script needs to be run first. And another script is also needed:
```bash
source ${MAIN_ROOT}/utils/parse_options.sh
```
It will support the way of using```--varibale value``` in the shell scripts.
It will support the way of using `--variable value` in the shell scripts.
## The Local Variables
Some local variables are set in `run.sh`.
`gpus` denotes the GPU number you want to use. If you set `gpus=`, it means you only use CPU.
`stage` denotes the number of the stage you want to start from in the experiments.
`stop stage` denotes the number of the stage you want to end at in the experiments.
`conf_path` denotes the config path of the model.
`data_path` denotes the path of the dataset.
`avg_num`denotes the number K of top-K models you want to average to get the final model.
`ckpt` denotes the checkpoint prefix of the model, e.g. "transformer_mtl_noam"
Some local variables are set in ```run.sh```.
```gpus``` denotes the GPU number you want to use. If you set ```gpus=```, it means you only use CPU.
```stage``` denotes the number of stage you want to start from in the expriments.
```stop stage```denotes the number of stage you want to end at in the expriments.
```conf_path``` denotes the config path of the model.
`data_path` denotes the path of the dataset..
```avg_num``` denotes the number K of top-K models you want to average to get the final model.
```ckpt``` denotes the checkpoint prefix of the model, e.g. "transformer_mtl_noam"
You can set the local variables (except ```ckpt```) when you use ```run.sh```
For example, you can set the ```gpus``` and ``avg_num`` when you use the command line.:
You can set the local variables (except `ckpt`) when you use `run.sh`
For example, you can set the `gpus` and `avg_num` when you use the command line.:
```bash
bash run.sh --gpus 0,1 --avg_num 5
```
## Stage 0: Data Processing
To use this example, you need to process data firstly and you can use stage 0 in ```run.sh``` to do this. The code is shown below:
To use this example, you need to process data firstly and you can use stage 0 in ```run.sh```to do this. The code is shown below:
```bash
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
# prepare data
bash ./local/data.sh || exit -1
fi
```
Stage 0 is for processing the data.
If you only want to process the data. You can run
```bash
bash run.sh --stage 0 --stop_stage 0
```
You can also just run these scripts in your command line.
```bash
. ./path.sh
. ./cmd.sh
bash ./local/data.sh
```
## Stage 1: Model Training
If you want to train the model. you can use stage 1 in ```run.sh```. The code is shown below.
```bash
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# train model, all `ckpt` under `exp` dir
@ -127,44 +86,31 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${ckpt} "${ckpt_path}"
fi
```
If you want to train the model, you can use the script below to execute stage 0 and stage 1:
```bash
bash run.sh --stage 0 --stop_stage 1
```
or you can run these scripts in the command line (only use CPU).
```bash
. ./path.sh
. ./cmd.sh
bash ./local/data.sh
CUDA_VISIBLE_DEVICES= ./local/train.sh conf/transformer_mtl_noam.yaml transformer_mtl_noam ""
```
## Stage 2: Top-k Models Averaging
After training the model, we need to get the final model for testing and inference. In every epoch, the model checkpoint is saved, so we can choose the best model from them based on the validation loss or we can sort them and average the parameters of the top-k models to get the final model. We can use stage 2 to do this, and the code is shown below:
After training the model, we need to get the final model for testing and inference. In every epoch, the model checkpoint is saved, so we can choose the best model from them based on the validation loss or we can sort them and average the parameters of the top-k models to get the final model. We can use stage 2 to do this, and the code is shown below:
```bash
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# avg n best model
avg.sh best exp/${ckpt}/checkpoints ${avg_num}
fi
```
The ```avg.sh``` is in the ```../../../utils/``` which is define in the ```path.sh```.
The ```avg.sh```is in the ```../../../utils/```which is define in the ```path.sh```.
If you want to get the final model, you can use the script below to execute stage 0, stage 1, and stage 2:
```bash
bash run.sh --stage 0 --stop_stage 2
```
or you can run these scripts in the command line (only use CPU).
```bash
. ./path.sh
. ./cmd.sh
@ -172,28 +118,19 @@ bash ./local/data.sh
CUDA_VISIBLE_DEVICES= ./local/train.sh conf/transformer_mtl_noam.yaml transformer_mtl_noam
avg.sh best exp/transformer_mtl_noam/checkpoints 5
```
## Stage 3: Model Testing
The stage 3 is to evaluate the model performance. The code of this stage is shown below:
```bash
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
# test ckpt avg_n
CUDA_VISIBLE_DEVICES=0 ./local/test.sh ${conf_path} exp/${ckpt}/checkpoints/${avg_ckpt} || exit -1
fi
```
If you want to train a model and test it, you can use the script below to execute stage 0, stage 1, stage 2, and stage 3 :
```bash
bash run.sh --stage 0 --stop_stage 3
```
or you can run these scripts in the command line (only use CPU).
```bash
. ./path.sh
. ./cmd.sh
@ -202,15 +139,9 @@ CUDA_VISIBLE_DEVICES= ./local/train.sh conf/transformer_mtl_noam.yaml transforme
avg.sh latest exp/transformer_mtl_noam/checkpoints 5
CUDA_VISIBLE_DEVICES= ./local/test.sh conf/transformer_mtl_noam.yaml exp/transformer_mtl_noam/checkpoints/avg_5
```
The performance of the released models are shown below:
### Transformer
| Model | Params | Config | Val loss | Char-BLEU |
| --- | --- | --- | --- | --- |
| FAT + Transformer+ASR MTL | 50.26M | conf/transformer_mtl_noam.yaml | 62.86 | 19.45 |
| FAT + Transformer+ASR MTL with word reward | 50.26M | conf/transformer_mtl_noam.yaml | 62.86 | 20.80 |

@ -1,96 +1,66 @@
# DeepSpeech2 offline/online ASR with Tiny
This example contains code used to train a DeepSpeech2 offline or online model with Tiny dataset(a part of [[Librispeech dataset](http://www.openslr.org/resources/12)](http://www.openslr.org/resources/33))
## Overview
All the scirpts you need are in the ```run.sh```. There are several stages in the ```run.sh```, and each stage has its function.
All the scripts you need are in the `run.sh`. There are several stages in the `run.sh`, and each stage has its function.
| Stage | Function |
| :---- | :----------------------------------------------------------- |
| 0 | Process data. It includes: <br> (1) Download the dataset <br> (2) Caculate the CMVN of the train dataset <br> (3) Get the vocabulary file <br> (4) Get the manifest files of the train, development and test dataset |
|:---- |:----------------------------------------------------------- |
| 0 | Process data. It includes: <br> (1) Download the dataset <br> (2) Calculate the CMVN of the train dataset <br> (3) Get the vocabulary file <br> (4) Get the manifest files of the train, development and test dataset |
| 1 | Train the model |
| 2 | Get the final model by averaging the top-k models, set k = 1 means choose the best model |
| 2 | Get the final model by averaging the top-k models, set k = 1 means to choose the best model |
| 3 | Test the final model performance |
| 4 | Export the static graph model |
You can choose to run a range of stages by setting the ```stage``` and ```stop_stage ``` .
You can choose to run a range of stages by setting the `stage` and `stop_stage `.
For example, if you want to execute the code in stage 2 and stage 3, you can run this script:
```bash
bash run.sh --stage 2 --stop_stage 3
```
Or you can set ```stage``` equal to ```stop-stage``` to only run one stage.
For example, if you only want to run ```stage 0```, you can use the script below:
Or you can set `stage` equal to `stop-stage` to only run one stage.
For example, if you only want to run `stage 0`, you can use the script below:
```bash
bash run.sh --stage 0 --stop_stage 0
```
The document below will describe the scripts in the ```run.sh``` in detail.
The document below will describe the scripts in the `run.sh` in detail.
## The environment variables
The path.sh contains the environment variable.
```bash
source path.sh
```
This script needs to be run firstly.
This script needs to be run first.
And another script is also needed:
```bash
source ${MAIN_ROOT}/utils/parse_options.sh
```
It will support the way of using```--varibale value``` in the shell scripts.
It will support the way of using `--variable value` in the shell scripts.
## The local variables
Some local variables are set in the `run.sh`.
`gpus` denotes the GPU number you want to use. If you set `gpus=`, it means you only use CPU.
`stage` denotes the number of stages you want to start from in the experiments.
`stop stage` denotes the number of the stage you want to end at in the experiments.
`conf_path` denotes the config path of the model.
`avg_num` denotes the number K of top-K models you want to average to get the final model.
`model_type`denotes the model type: offline or online
`ckpt` denotes the checkpoint prefix of the model, e.g. "deepspeech2"
Some local variables are set in the ```run.sh```.
```gpus``` denotes the GPU number you want to use. If you set ```gpus=```, it means you only use CPU.
```stage``` denotes the number of stage you want to start from in the expriments.
```stop stage```denotes the number of stage you want to end at in the expriments.
```conf_path``` denotes the config path of the model.
```avg_num``` denotes the number K of top-K models you want to average to get the final model.
```model_type```denotes the model type: offline or online
```ckpt``` denotes the checkpoint prefix of the model, e.g. "deepspeech2"
You can set the local variables (except ```ckpt```) when you use the ```run.sh```
For example, you can set the ```gpus``` and ``avg_num`` when you use the command line.:
You can set the local variables (except `ckpt`) when you use the `run.sh`
For example, you can set the `gpus` and `avg_num` when you use the command line.:
```bash
bash run.sh --gpus 0,1 --avg_num 20
```
## Stage 0: Data processing
To use this example, you need to process data firstly and you can use stage 0 in the ```run.sh``` to do this. The code is shown below:
To use this example, you need to process data firstly and you can use stage 0 in the `run.sh` to do this. The code is shown below:
```bash
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
# prepare data
bash ./local/data.sh || exit -1
fi
```
Stage 0 is for processing the data.
If you only want to process the data. You can run
@ -98,16 +68,12 @@ If you only want to process the data. You can run
```bash
bash run.sh --stage 0 --stop_stage 0
```
You can also just run these scripts in your command line.
```bash
source path.sh
bash ./local/data.sh
```
After processing the data, the ``data`` directory will look like this:
After processing the data, the `data` directory will look like this:
```bash
data/
|-- dev.meta
@ -123,54 +89,37 @@ data/
|-- test.meta
`-- train.meta
```
## Stage 1: Model training
If you want to train the model. you can use stage 1 in the ```run.sh```. The code is shown below.
```bash
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# train model, all `ckpt` under `exp` dir
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${ckpt}
fi
```
If you want to train the model, you can use the script below to execute stage 0 and stage 1:
```bash
bash run.sh --stage 0 --stop_stage 1
```
or you can run these scripts in the command line (only use CPU).
```bash
source path.sh
bash ./local/data.sh
CUDA_VISIBLE_DEVICES= ./local/train.sh conf/deepspeech2.yaml deepspeech2
```
## Stage 2: Top-k Models Averaging
After training the model, we need to get the final model for testing and inference. In every epoch, the model checkpoint is saved, so we can choose the best model from them based on the validation loss or we can sort them and average the parameters of the top-k models to get the final model. We can use stage 2 to do this, and the code is shown below:
After training the model, we need to get the final model for testing and inference. In every epoch, the model checkpoint is saved, so we can choose the best model from them based on the validation loss or we can sort them and average the parameters of the top-k models to get the final model. We can use stage 2 to do this, and the code is shown below:
```bash
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# avg n best model
avg.sh best exp/${ckpt}/checkpoints ${avg_num}
fi
```
The ```avg.sh``` is in the ```../../../utils/``` which is define in the ```path.sh```.
The `avg.sh` is in the `../../../utils/` which is define in the `path.sh`.
If you want to get the final model, you can use the script below to execute stage 0, stage 1, and stage 2:
```bash
bash run.sh --stage 0 --stop_stage 2
```
or you can run these scripts in the command line (only use CPU).
```bash
@ -179,12 +128,8 @@ bash ./local/data.sh
CUDA_VISIBLE_DEVICES= ./local/train.sh conf/deepspeech2.yaml deepspeech2
avg.sh best exp/deepspeech2/checkpoints 1
```
## Stage 3: Model Testing
The test stage is to evaluate the model performance.. The code of test stage is shown below:
The test stage is to evaluate the model performance. The code of the test stage is shown below:
```bash
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
@ -192,15 +137,11 @@ The test stage is to evaluate the model performance.. The code of test stage is
CUDA_VISIBLE_DEVICES=0 ./local/test.sh ${conf_path} exp/${ckpt}/checkpoints/${avg_ckpt} || exit -1
fi
```
If you want to train a model and test it, you can use the script below to execute stage 0, stage 1, stage 2, and stage 3 :
```bash
bash run.sh --stage 0 --stop_stage 3
```
or you can run these scripts in the command line (only use CPU).
```bash
source path.sh
bash ./local/data.sh
@ -208,23 +149,16 @@ CUDA_VISIBLE_DEVICES= ./local/train.sh conf/deepspeech2.yaml deepspeech2
avg.sh best exp/deepspeech2/checkpoints 1
CUDA_VISIBLE_DEVICES= ./local/test.sh conf/deepspeech2.yaml exp/deepspeech2/checkpoints/avg_1
```
## Stage 4: Static graph model Export
This stage is to transform the dynamic graph model to static graph model.
This stage is to transform dygraph to static graph.
```bash
if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
# export ckpt avg_n
CUDA_VISIBLE_DEVICES=0 ./local/export.sh ${conf_path} exp/${ckpt}/checkpoints/${avg_ckpt} exp/${ckpt}/checkpoints/${avg_ckpt}.jit ${model_type}
fi
```
If you already have a dynamic graph model, you can run this script:
```bash
source path.sh
./local/export.sh deepspeech2.yaml exp/deepspeech2/checkpoints/avg_1 exp/deepspeech2/checkpoints/avg_1.jit offline
```

@ -1,80 +1,53 @@
# Transformer/Conformer ASR with Tiny
This example contains code used to train a Transformer or [Conformer](http://arxiv.org/abs/2008.03802) model Tiny dataset(a part of [[Librispeech dataset](http://www.openslr.org/resources/12)](http://www.openslr.org/resources/33))
## Overview
All the scirpts you need are in ```run.sh```. There are several stages in ```run.sh```, and each stage has its function.
All the scripts you need are in `run.sh`. There are several stages in `run.sh`, and each stage has its function.
| Stage | Function |
| :---- | :----------------------------------------------------------- |
| 0 | Process data. It includes: <br> (1) Download the dataset <br> (2) Caculate the CMVN of the train dataset <br> (3) Get the vocabulary file <br> (4) Get the manifest files of the train, development and test dataset<br> (5) Get the sentencepiece model |
|:---- |:----------------------------------------------------------- |
| 0 | Process data. It includes: <br> (1) Download the dataset <br> (2) Calculate the CMVN of the train dataset <br> (3) Get the vocabulary file <br> (4) Get the manifest files of the train, development and test dataset<br> (5) Get the sentencepiece model |
| 1 | Train the model |
| 2 | Get the final model by averaging the top-k models, set k = 1 means choose the best model |
| 2 | Get the final model by averaging the top-k models, set k = 1 means to choose the best model |
| 3 | Test the final model performance |
| 4 | Get ctc alignment of test data using the final model |
You can choose to run a range of stages by setting ```stage``` and ```stop_stage ```.
You can choose to run a range of stages by setting `stage` and `stop_stage`.
For example, if you want to execute the code in stage 2 and stage 3, you can run this script:
```bash
bash run.sh --stage 2 --stop_stage 3
```
Or you can set ```stage``` equal to ```stop-stage``` to only run one stage.
Or you can set `stage` equal to `stop-stage` to only run one stage.
For example, if you only want to run ```stage 0```, you can use the script below:
```bash
bash run.sh --stage 0 --stop_stage 0
```
The document below will describe the scripts in ```run.sh``` in detail.
The document below will describe the scripts in ```run.sh```in detail.
## The Environment Variables
The path.sh contains the environment variables.
```bash
. ./path.sh
. ./cmd.sh
```
This script needs to be run firstly. And another script is also needed:
This script needs to be run first. And another script is also needed:
```bash
source ${MAIN_ROOT}/utils/parse_options.sh
```
It will support the way of using```--varibale value``` in the shell scripts.
It will support the way of using `--variable value` in the shell scripts.
## The Local Variables
Some local variables are set in ```run.sh```.
```gpus``` denotes the GPU number you want to use. If you set ```gpus=```, it means you only use CPU.
```stage``` denotes the number of stage you want to start from in the expriments.
```stop stage```denotes the number of stage you want to end at in the expriments.
```conf_path``` denotes the config path of the model.
```avg_num``` denotes the number K of top-K models you want to average to get the final model.
```ckpt``` denotes the checkpoint prefix of the model, e.g. "transformerr"
You can set the local variables (except ```ckpt```) when you use ```run.sh```
For example, you can set the ```gpus``` and ``avg_num`` when you use the command line.:
Some local variables are set in `run.sh`.
`gpus` denotes the GPU number you want to use. If you set `gpus=`, it means you only use CPU.
`stage` denotes the number of stage you want the start from in the experiments.
`stop stage` denotes the number of stage you want the stop at in the expriments.
`conf_path` denotes the config path of the model.
`avg_num`denotes the number K of top-K models you want to average to get the final model.
`ckpt` denotes the checkpoint prefix of the model, e.g. "transformerr"
Youtransformer local variables (except `ckpt`) when you use `run.sh`
For example, you can set the `gpus` and `avg_num` when you use the command line.:
```bash
bash run.sh --gpus 0,1 --avg_num 1
```
## Stage 0: Data Processing
To use this example, you need to process data firstly and you can use stage 0 in ```run.sh``` to do this. The code is shown below:
To use this example, you need to process data firstly and you can use stage 0 in ```run.sh```to do this. The code is shown below:
```bash
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
@ -82,25 +55,19 @@ To use this example, you need to process data firstly and you can use stage 0 in
bash ./local/data.sh || exit -1
fi
```
Stage 0 is for processing the data.
If you only want to process the data. You can run
```bash
bash run.sh --stage 0 --stop_stage 0
```
You can also just run these scripts in your command line.
```bash
. ./path.sh
. ./cmd.sh
bash ./local/data.sh
```
After processing the data, the ``data`` directory will look like this:
```bash
data/
|-- dev.meta
@ -118,57 +85,38 @@ data/
|-- test.meta
`-- train.meta
```
## Stage 1: Model Training
If you want to train the model. you can use stage 1 in ```run.sh```. The code is shown below.
```bash
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# train model, all `ckpt` under `exp` dir
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${ckpt}
fi
```
If you want to train the model, you can use the script below to execute stage 0 and stage 1:
```bash
bash run.sh --stage 0 --stop_stage 1
```
or you can run these scripts in the command line (only use CPU).
```bash
. ./path.sh
. ./cmd.sh
bash ./local/data.sh
CUDA_VISIBLE_DEVICES= ./local/train.sh conf/transformer.yaml transformer
```
## Stage 2: Top-k Models Averaging
```## Stage 2: Top-k Models Averaging
After training the model, we need to get the final model for testing and inference. In every epoch, the model checkpoint is saved, so we can choose the best model from them based on the validation loss or we can sort them and average the parameters of the top-k models to get the final model. We can use stage 2 to do this, and the code is shown below:
```bash
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# avg n best model
avg.sh best exp/${ckpt}/checkpoints ${avg_num}
fi
```
The ```avg.sh``` is in the ```../../../utils/``` which is define in the ```path.sh```.
The `avg.sh` is in the `../../../utils/` which is define in the `path.sh`.
If you want to get the final model, you can use the script below to execute stage 0, stage 1, and stage 2:
```bash
bash run.sh --stage 0 --stop_stage 2
```
or you can run these scripts in the command line (only use CPU).
```bash
. ./path.sh
. ./cmd.sh
@ -176,28 +124,19 @@ bash ./local/data.sh
CUDA_VISIBLE_DEVICES= ./local/train.sh conf/transformer.yaml transformer
avg.sh best exp/transformer/checkpoints 1
```
## Stage 3: Model Testing
The test stage is to evaluate the model performance. The code of test stage is shown below:
```bash
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
# test ckpt avg_n
CUDA_VISIBLE_DEVICES=0 ./local/test.sh ${conf_path} exp/${ckpt}/checkpoints/${avg_ckpt} || exit -1
fi
```
If you want to train a model and test it, you can use the script below to execute stage 0, stage 1, stage 2, and stage 3 :
```bash
bash run.sh --stage 0 --stop_stage 3
```
or you can run these scripts in the command line (only use CPU).
```bash
. ./path.sh
. ./cmd.sh
@ -206,34 +145,25 @@ CUDA_VISIBLE_DEVICES= ./local/train.sh conf/transformer.yaml transformer
avg.sh best exp/transformer/checkpoints 1
CUDA_VISIBLE_DEVICES= ./local/test.sh conf/transformer.yaml exp/transformer/checkpoints/avg_1
```
## Stage 4: CTC Alignment
If you want to get the alignment between the audio and the text, you can use the ctc alignment. The code of this stage is shown below:
```bash
if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
# ctc alignment of test data
CUDA_VISIBLE_DEVICES=0 ./local/align.sh ${conf_path} exp/${ckpt}/checkpoints/${avg_ckpt} || exit -1
fi
```
If you want to train the model, test it and do the alignment, you can use the script below to execute stage 0, stage 1, stage 2, and stage 3 :
```bash
bash run.sh --stage 0 --stop_stage 4
```
or if you only need to train a model and do the alignment, you can use these scripts to escape stage 3 (test stage):
```bash
bash run.sh --stage 0 --stop_stage 2
bash run.sh --stage 4 --stop_stage 4
```
or you can also use these scripts in the command line (only use CPU).
```bash
. ./path.sh
. ./cmd.sh
@ -244,4 +174,3 @@ avg.sh best exp/transformer/checkpoints 1
CUDA_VISIBLE_DEVICES= ./local/test.sh conf/transformer.yaml exp/transformer/checkpoints/avg_1
CUDA_VISIBLE_DEVICES= ./local/align.sh conf/transformer.yaml exp/transformer/checkpoints/avg_1
```

@ -41,7 +41,7 @@ We classify the high label into 10 groups according to its domain, speaking styl
| others | 144 | 507.5 | 651.5 |
| Total | 6113 | 3892 | 10005 |
As shown in the following table, we provide 3 training subsets, namely `S`, `M` and `L` for building ASR systems on different data scales.
As shown in the following table, we provide 3 training subsets, namely `S`, `M`, and `L` for building ASR systems on different data scales.
| Training Subsets | Confidence | Hours |
|------------------|-------------|-------|

Loading…
Cancel
Save