Audio tagging is the task of labelling an audio clip with one or more labels or tags, includeing music tagging, acoustic scene classification, audio event classification, etc.
Audio tagging is the task of labeling an audio clip with one or more labels or tags, including music tagging, acoustic scene classification, audio event classification, etc.
This demo is an implementation to tag an audio file with 527 [AudioSet](https://research.google.com/audioset/) labels. It can be done by a single command or a few lines in python using `PaddleSpeech`.
@ -12,7 +12,7 @@ pip install paddlespeech
```
### 2. Prepare Input File
Input of this demo should be a WAV file(`.wav`).
The input of this demo should be a WAV file(`.wav`).
Here are sample files for this demo that can be downloaded:
Automatic video subtitiles can generate subtitiles from a specific video by using Automatic Speech Recognition (ASR) system.
Automatic video subtitles can generate subtitles from a specific video by using the Automatic Speech Recognition (ASR) system.
This demo is an implementation to automatic video subtitiles from a video file. It can be done by a single command or a few lines in python using `PaddleSpeech`.
This demo is an implementation to automatic video subtitles from a video file. It can be done by a single command or a few lines in python using `PaddleSpeech`.
## Usage
### 1. Installation
@ -12,7 +12,7 @@ pip install paddlespeech
```
### 2. Prepare Input
Get a video file with speech of the specific language:
Get a video file with the speech of the specific language:
Metaverse is a new Internet application and social form integrating virtual reality produced by integrating a variety of new technologies.
This demo is an implementation to let a celebrity in an image "speak". With the composition of `TTS` mudule of `PaddleSpeech` and `PaddleGAN`, we integrate the installation and the specific modules in a single shell script.
This demo is an implementation to let a celebrity in an image "speak". With the composition of the `TTS` module of `PaddleSpeech` and `PaddleGAN`, we integrate the installation and the specific modules in a single shell script.
## Usage
You can make your favorite person say the specified content with the `TTS` mudule of `PaddleSpeech` and `PaddleGAN`, and construct your own virtual human.
You can make your favorite person say the specified content with the `TTS` module of `PaddleSpeech` and `PaddleGAN`, and construct your virtual human.
Run `run.sh` to complete all the essential procedures, including the installation.
@ -16,8 +14,8 @@ Run `run.sh` to complete all the essential procedures, including the installatio
```
In `run.sh`, it will execute `source path.sh` firstly, which will set the environment variants.
If you would like to try your own sentence, please replace the sentence in `sentences.txt`.
If you would like to try your sentence, please replace the sentence in `sentences.txt`.
If you would like to try your own image, please replace the image `download/Lamarr.png` in the shell script.
If you would like to try your image, please replace the image `download/Lamarr.png` in the shell script.
The result has shown on our [notebook](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/tutorial/tts/tts_tutorial.ipynb).
The result has shown in our [notebook](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/tutorial/tts/tts_tutorial.ipynb).
Punctuation restoration is a common post-processing problem for Automatic Speech Recognition (ASR) systems. It is important to improve the readability of the transcribed text for the human reader and facilitate NLP tasks.
This demo is an implementation to restore punctuation from a raw text. It can be done by a single command or a few lines in python using `PaddleSpeech`.
This demo is an implementation to restore punctuation from raw text. It can be done by a single command or a few lines in python using `PaddleSpeech`.
## Usage
### 1. Installation
```bash
pip install paddlespeech
```
### 2. Prepare Input
Input of this demo should be a text of the specific language that can be passed via argument.
The input of this demo should be a text of the specific language that can be passed via argument.
### 3. Usage
- Command Line(Recommended)
@ -63,10 +60,8 @@ Input of this demo should be a text of the specific language that can be passed
今天的天气真不错啊!你下午有空吗?我想约你一起去吃饭。
```
### 4.Pretrained Models
Here is a list of pretrained models released by PaddleSpeech that can be used by command and python api:
Here is a list of pretrained models released by PaddleSpeech that can be used by command and python API:
- Punctuation Restoration
| Model | Language | Number of Punctuation Characters
Speech translation is the process by which conversational spoken phrases are instantly translated and spoken aloud in a second language.
This demo is an implementation to recognize text from a specific audio file and translate to target language. It can be done by a single command or a few lines in python using `PaddleSpeech`.
This demo is an implementation to recognize text from a specific audio file and translate it to the target language. It can be done by a single command or a few lines in python using `PaddleSpeech`.
## Usage
### 1. Installation
@ -13,7 +13,7 @@ pip install paddlespeech
```
### 2. Prepare Input File
Input of this demo should be a WAV file(`.wav`).
The input of this demo should be a WAV file(`.wav`).
Here are sample files for this demo that can be downloaded:
Storybooks are very important children's enlightenment books, but parents usually don't have enough time to read storybooks for their children. For very young children, they may not understand the Chinese characters in storybooks. Or sometimes, children just want to "listen" but don't want to "read".
You can use `PaddleOCR` to get the text of a storybook, and read it by the `TTS` mudule of `PaddleSpeech`.
You can use `PaddleOCR` to get the text of a storybook and read it by the `TTS` module of `PaddleSpeech`.
## Usage
Run the following command line to get started:
```
./run.sh
```
The result has shown on our [notebook](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/tutorial/tts/tts_tutorial.ipynb).
The result has shown in our [notebook](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/docs/tutorial/tts/tts_tutorial.ipynb).
[FastSpeech2](https://arxiv.org/abs/2006.04558) is a classical acoustic model for Text-to-Speech synthesis, which introduces controllable speech input, including `phoneme duration`、`energy` and `pitch`.
[FastSpeech2](https://arxiv.org/abs/2006.04558) is a classical acoustic model for Text-to-Speech synthesis, which introduces controllable speech input, including `phoneme duration`、`energy` and `pitch`.
In the prediction phase, you can change these controllable variables to get some interesting results.
For example:
1. The `duration` control in `FastSpeech2` can control the speed of audios will keep the `pitch`. (in some speech tool, increase the speed will increase the pitch, and vice versa.)
1. The `duration` control in `FastSpeech2` can control the speed of audios will keep the `pitch`. (in some speech tools, increasing the speed will increase the pitch and vice versa.)
2. When we set `pitch` of one sentence to a mean value and set `tones` of phones to `1`, we will get a `robot-style` timbre.
2. When we set the `pitch` of one sentence to a mean value and set the`tones` of phones to `1`, we will get a `robot-style` timbre.
3. When we raise the `pitch` of an adult female (with a fixed scale ratio), we will get a `child-style` timbre.
@ -20,7 +20,7 @@ Run the following command line to get started:
```
In `run.sh`, it will execute `source path.sh` firstly, which will set the environment variants.
If you would like to try your own sentence, please replace the sentence in `sentences.txt`.
If you would like to try your sentence, please replace the sentence in `sentences.txt`.
Text-to-speech (TTS) is a natural language modeling process that requires changing units of text into units of speech for audio presentation.
This demo is an implementation to generate an audio from the giving text. It can be done by a single command or a few lines in python using `PaddleSpeech`.
This demo is an implementation to generate audio from the given text. It can be done by a single command or a few lines in python using `PaddleSpeech`.
## Usage
### 1. Installation
@ -13,7 +13,7 @@ pip install paddlespeech
```
### 2. Prepare Input
Input of this demo should be a text of the specific language that can be passed via argument.
The input of this demo should be a text of the specific language that can be passed via argument.
### 3. Usage
- Command Line (Recommended)
- Chinese
@ -22,11 +22,11 @@ Input of this demo should be a text of the specific language that can be passed
```bash
paddlespeech tts --input "你好,欢迎使用百度飞桨深度学习框架!"
```
- Chinese, use `SpeedySpeech` as acoustic model
- Chinese, use `SpeedySpeech` as the acoustic model
Data augmentation has often been a highly effective technique to boost the deep learning performance. We augment our speech data by synthesizing new audios with small random perturbation (label-invariant transformation) added upon raw audios. You don't have to do the syntheses on your own, as it is already embedded into the data provider and is done on the fly, randomly for each epoch during training.
Data augmentation has often been a highly effective technique to boost deep learning performance. We augment our speech data by synthesizing new audios with small random perturbation (label-invariant transformation) added upon raw audios. You don't have to do the syntheses on your own, as it is already embedded into the data provider and is done on the fly, randomly for each epoch during training.
Six optional augmentation components are provided to be selected, configured and inserted into the processing pipeline.
Six optional augmentation components are provided to be selected, configured, and inserted into the processing pipeline.
* Audio
- Volume Perturbation
@ -17,7 +16,7 @@ Six optional augmentation components are provided to be selected, configured and
- SpecAugment
- Adaptive SpecAugment
In order to inform the trainer of what augmentation components are needed and what their processing orders are, it is required to prepare in advance an *augmentation configuration file* in [JSON](http://www.json.org/) format. For example:
To inform the trainer of what augmentation components are needed and what their processing orders are, it is required to prepare in advance an *augmentation configuration file* in [JSON](http://www.json.org/) format. For example:
```
[{
@ -34,8 +33,8 @@ In order to inform the trainer of what augmentation components are needed and wh
}]
```
When the `augment_conf_file` argument is set to the path of the above example configuration file, every audio clip in every epoch will be processed: with 60% of chance, it will first be speed perturbed with a uniformly random sampled speed-rate between 0.95 and 1.05, and then with 80% of chance it will be shifted in time with a random sampled offset between -5 ms and 5 ms. Finally this newly synthesized audio clip will be feed into the feature extractor for further training.
When the `augment_conf_file` argument is set to the path of the above example configuration file, every audio clip in every epoch will be processed: with 60% of chance, it will first be speed perturbed with a uniformly random sampled speed-rate between 0.95 and 1.05, and then with 80% of chance it will be shifted in time with a randomly sampled offset between -5 ms and 5 ms. Finally, this newly synthesized audio clip will be fed into the feature extractor for further training.
For other configuration examples, please refer to `examples/conf/augmentation.example.json`.
Be careful when utilizing the data augmentation technique, as improper augmentation will do harm to the training, due to the enlarged train-test gap.
Be careful when utilizing the data augmentation technique, as improper augmentation will harm the training, due to the enlarged train-test gap.
*DeepSpeech2 on PaddlePaddle* accepts a textual **manifest** file as its data set interface. A manifest file summarizes a set of speech data, with each line containing some meta data (e.g. filepath, transcription, duration) of one audio clip, in [JSON](http://www.json.org/) format, such as:
*DeepSpeech2 on PaddlePaddle* accepts a textual **manifest** file as its data set interface. A manifest file summarizes a set of speech data, with each line containing some meta data (e.g. filepath, transcription, duration) of one audio clip, in [JSON](http://www.json.org/) format, such as:
```
{"audio_filepath": "/home/work/.cache/paddle/Libri/134686/1089-134686-0001.flac", "duration": 3.275, "text": "stuff it into you his belly counselled him"}
{"audio_filepath": "/home/work/.cache/paddle/Libri/134686/1089-134686-0007.flac", "duration": 4.275, "text": "a cold lucid indifference reigned in his soul"}
```
To use your custom data, you only need to generate such manifest files to summarize the dataset. Given such summarized manifests, training, inference and all other modules can be aware of where to access the audio files, as well as their meta data including the transcription labels.
For how to generate such manifest files, please refer to `examples/librispeech/local/librispeech.py`, which will download data and generate manifest files for LibriSpeech dataset.
It will compute the mean and standard deviatio of power spectrum feature with 2000 random sampled audio clips listed in `examples/librispeech/data/manifest.train` and save the results to `examples/librispeech/data/mean_std.npz` for further usage.
It will compute the mean and standard deviations of the power spectrum feature with 2000 random sampled audio clips listed in `examples/librispeech/data/manifest.train` and save the results to `examples/librispeech/data/mean_std.npz` for further usage.
## Build Vocabulary
A vocabulary of possible characters is required to convert the transcription into a list of token indices for training, and in decoding, to convert from a list of indices back to text again. Such a character-based vocabulary can be built with `utils/build_vocab.py`.
A vocabulary of possible characters is required to convert the transcription into a list of token indices for training, and in decoding, to convert from a list of indices back to the text again. Such a character-based vocabulary can be built with `utils/build_vocab.py`.
It will write a vocabuary file `examples/librispeech/data/vocab.txt` with all transcription text in `examples/librispeech/data/manifest.train`, without vocabulary truncation (`--count_threshold 0`).
It will write a vocabulary file `examples/librispeech/data/vocab.txt` with all transcription text in `examples/librispeech/data/manifest.train`, without vocabulary truncation (`--count_threshold 0`).
@ -20,7 +20,7 @@ The arcitecture of the model is shown in Fig.1.
### Data Preparation
#### Vocabulary
For English data, the vocabulary dictionary is composed of 26 English characters with " ' ", space, \<blank\> and \<eos\>. The \<blank\> represents the blank label in CTC, the \<unk\> represents the unknown character and the \<eos\> represents the start and the end characters. For mandarin, the vocabulary dictionary is composed of chinese characters statisticed from the training set and three additional characters are added. The added characters are \<blank\>, \<unk\> and \<eos\>. For both English and mandarin data, we set the default indexs that \<blank\>=0, \<unk\>=1 and \<eos\>= last index.
For English data, the vocabulary dictionary is composed of 26 English characters with " ' ", space, \<blank\> and \<eos\>. The \<blank\> represents the blank label in CTC, the \<unk\> represents the unknown character and the \<eos\> represents the start and the end characters. For mandarin, the vocabulary dictionary is composed of Chinese characters statistics from the training set, and three additional characters are added. The added characters are \<blank\>, \<unk\> and \<eos\>. For both English and mandarin data, we set the default indexes that \<blank\>=0, \<unk\>=1 and \<eos\>= last index.
```
# The code to build vocabulary
cd examples/aishell/s0
@ -38,7 +38,7 @@ vi examples/librispeech/s0/data/vocab.txt
```
#### CMVN
For CMVN, a subset or the full of traininig set is chosed and be used to compute the feature mean and std.
For CMVN, a subset of the full of the training set is selected and be used to compute the feature mean and std.
For feature extraction, three methods are implemented, which are linear (FFT without using filter bank), fbank and mfcc.
Currently, the released deepspeech2 online model use the linear feature extraction method.
Currently, the released deepspeech2 online model uses the linear feature extraction method.
```
The code for feature extraction
vi paddlespeech/s2t/frontend/featurizer/audio_featurizer.py
```
### Encoder
The encoder is composed of two 2D convolution subsampling layers and a number of stacked single direction rnn layers. The 2D convolution subsampling layers extract feature representation from the raw audio feature and reduce the length of audio feature at the same time. After passing through the convolution subsampling layers, then the feature representation are input into the stacked rnn layers. For the stacked rnn layers, LSTM cell and GRU cell are provided to use. Adding one fully connected (fc) layer after the stacked rnn layers is optional. If the number of stacked rnn layers is less than 5, adding one fc layer after stacked rnn layers is recommand.
The encoder is composed of two 2D convolution subsampling layers and several stacked single-direction rnn layers. The 2D convolution subsampling layers extract feature representation from the raw audio feature and reduce the length of the audio feature at the same time. After passing through the convolution subsampling layers, then the feature representation is input into the stacked rnn layers. For the stacked rnn layers, LSTM cell and GRU cell are provided to use. Adding one fully connected (fc) layer after the stacked rnn layers are optional. If the number of stacked rnn layers is less than 5, adding one fc layer after stacked rnn layers are recommended.
The code of Encoder is in:
```
@ -73,7 +73,7 @@ vi paddlespeech/s2t/models/ds2_online/deepspeech2.py
```
### Decoder
To got the character possibilities of each frame, the feature representation of each frame output from the encoder are input into a projection layer which is implemented as a dense layer to do feature projection. The output dim of the projection layer is same with the vocabulary size. After projection layer, the softmax function is used to transform the frame-level feature representation be the possibilities of characters. While making model inference, the character possibilities of each frame are input into the CTC decoder to get the final speech recognition results.
To get the character possibilities of each frame, the feature representation of each frame output from the encoder is input into a projection layer which is implemented as a dense layer to do feature projection. The output dim of the projection layer is the same as the vocabulary size. After the projection layer, the softmax function is used to transform the frame-level feature representation be the possibilities of characters. While making model inference, the character possibilities of each frame are input into the CTC decoder to get the final speech recognition results.
@ -123,8 +123,7 @@ if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
avg.sh exp/${ckpt}/checkpoints ${avg_num}
fi
```
By using the command above, the training process can be started. There are 5 stages in "run.sh", and the first 3 stages are used for training process. The stage 0 is used for data preparation, in which the dataset will be downloaded, and the manifest files of the datasets, vocabulary dictionary and CMVN file will be generated in "./data/". The stage 1 is used for training the model, the log files and model checkpoint is saved in "exp/deepspeech2_online/". The stage 2 is used to generated final model for predicting by averaging the top-k model parameters based on validation loss.
By using the command above, the training process can be started. There are 5 stages in "run.sh", and the first 3 stages are used for the training process. Stage 0 is used for data preparation, in which the dataset will be downloaded, and the manifest files of the datasets, vocabulary dictionary, and CMVN file will be generated in "./data/". Stage 1 is used for training the model, the log files and model checkpoint are saved in "exp/deepspeech2_online/". Stage 2 is used to generate the final model for predicting by averaging the top-k model parameters based on validation loss.
### Testing Process
Using the command below, you can test the deepspeech2 online model.
@ -153,10 +152,10 @@ if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
After the training process, we use stage 3,4,5 for testing process. The stage 3 is for testing the model generated in the stage 2 and provided the CER index of the test set. The stage 4 is for transforming the model from dynamic graph to static graph by using "paddle.jit" library. The stage 5 is for testing the model in static graph.
After the training process, we use stages 3,4,5 for the testing process. Stage 3 is for testing the model generated in stage 2 and provided the CER index of the test set. Stage 4 is for transforming the model from a dynamic graph to a static graph by using "paddle.jit" library. Stage 5 is for testing the model in a static graph.
## Non-Streaming DeepSpeech2
The deepspeech2 offline model is similarity to the deepspeech2 online model. The main difference between them is the offline model use the stacked bi-directional rnn layers while the online model use the single direction rnn layers and the fc layer is not used. For the stacked bi-directional rnn layers in the offline model, the rnn cell and gru cell are provided to use.
The deepspeech2 offline model is similar to the deepspeech2 online model. The main difference between them is the offline model uses the stacked bi-directional rnn layers while the online model uses the single direction rnn layers and the fc layer is not used. For the stacked bi-directional rnn layers in the offline model, the rnn cell and gru cell are provided to use.
The arcitecture of the model is shown in Fig.2.
<palign="center">
@ -165,14 +164,14 @@ The arcitecture of the model is shown in Fig.2.
</p>
For data preparation and decoder, the deepspeech2 offline model is same with the deepspeech2 online model.
For data preparation and decoder, the deepspeech2 offline model is the same as the deepspeech2 online model.
The code of encoder and decoder for deepspeech2 offline model is in:
```
vi paddlespeech/s2t/models/ds2/deepspeech2.py
```
The training process and testing process of deepspeech2 offline model is very similary to deepspeech2 online model.
The training process and testing process of deepspeech2 offline model is very similar to deepspeech2 online model.
Only some changes should be noticed.
For training and testing, the "model_type" and the "conf_path" must be set.
A language model is required to improve the decoder's performance. We have prepared two language models (with lossy compression) for users to download and try. One is for English and the other is for Mandarin. The bash script to download LM is example's `local/download_lm_*.sh`.
For example, users can simply run this to download the preprared mandarin language models:
For example, users can simply run this to download the prepared mandarin language models:
```bash
cd examples/aishell
source path.sh
bash local/download_lm_ch.sh
```
If you wish to train your own better language model, please refer to [KenLM](https://github.com/kpu/kenlm) for tutorials.
Here we provide some tips to show how we preparing our English and Mandarin language models.
Here we provide some tips to show how we prepare our English and Mandarin language models.
You can take it as a reference when you train your own.
### English LM
@ -24,14 +23,14 @@ The English corpus is from the [Common Crawl Repository](http://commoncrawl.org)
* Repeated whitespace characters are squeezed to one and the beginning whitespace characters are removed. Notice that all transcriptions are lowercase, so all characters are converted to lowercase.
* Top 400,000 most frequent words are selected to build the vocabulary and the rest are replaced with 'UNKNOWNWORD'.
Now the preprocessing is done and we get a clean corpus to train the language model. Our released language model are trained with agruments '-o 5 --prune 0 1 1 1 1'. '-o 5' means the max order of language model is 5. '--prune 0 1 1 1 1' represents count thresholds for each order and more specifically it will prune singletons for orders two and higher. To save disk storage we convert the arpa file to 'trie' binary file with arguments '-a 22 -q 8 -b 8'. '-a' represents the maximum number of leading bits of pointers in 'trie' to chop. '-q -b' are quantization parameters for probability and backoff.
Now the preprocessing is done and we get a clean corpus to train the language model. Our released language model is trained with arguments '-o 5 --prune 0 1 1 1 1'. '-o 5' means the max order of the language model is 5. '--prune 0 1 1 1 1' represents count thresholds for each order and more specifically it will prune singletons for orders two and higher. To save disk storage we convert the ARPA file to 'trie' binary file with arguments '-a 22 -q 8 -b 8'. '-a' represents the maximum number of leading bits of pointers in 'trie' to chop. '-q -b' are quantization parameters for probability and backoff.
### Mandarin LM
Different from the English language model, Mandarin language model is character-based where each token is a Chinese character. We use internal corpus to train the released Mandarin language models. The corpus contain billions of tokens. The preprocessing has tiny difference from English language model and main steps include:
Different from the English language model, the Mandarin language model is character-based where each token is a Chinese character. We use the internal corpus to train the released Mandarin language models. The corpus contains billions of tokens. The preprocessing has a tiny difference from the English language model and the main steps include:
* The beginning and trailing whitespace characters are removed.
* English punctuations and Chinese punctuations are removed.
* A whitespace character between two tokens is inserted.
Please notice that the released language models only contain Chinese simplified characters. After preprocessing done we can begin to train the language model. The key training arguments for small LM is '-o 5 --prune 0 1 2 4 4' and '-o 5' for large LM. Please refer above section for the meaning of each argument. We also convert the arpa file to binary file using default settings.
Please notice that the released language models only contain Chinese simplified characters. After preprocessing is done we can begin to train the language model. The key training arguments for small LM are '-o 5 --prune 0 1 2 4 4' and '-o 5' for large LM. Please refer above section for the meaning of each argument. We also convert the ARPA file to a binary file using default settings.
Several shell scripts provided in `./examples/tiny/local` will help us to quickly give it a try, for most major modules, including data preparation, model training, case inference and model evaluation, with a few public dataset (e.g. [LibriSpeech](http://www.openslr.org/12/), [Aishell](http://www.openslr.org/33)). Reading these examples will also help you to understand how to make it work with your own data.
Several shell scripts provided in `./examples/tiny/local` will help us to quickly give it a try, for most major modules, including data preparation, model training, case inference, and model evaluation, with a few public datasets (e.g. [LibriSpeech](http://www.openslr.org/12/), [Aishell](http://www.openslr.org/33)). Reading these examples will also help you to understand how to make it work with your data.
Some of the scripts in `./examples` are not configured with GPUs. If you want to train with 8 GPUs, please modify `CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7`. If you don't have any GPU available, please set `CUDA_VISIBLE_DEVICES=` to use CPUs instead. Besides, if out-of-memory problem occurs, just reduce `batch_size` to fit.
Some of the scripts in `./examples` are not configured with GPUs. If you want to train with 8 GPUs, please modify `CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7`. If you don't have any GPU available, please set `CUDA_VISIBLE_DEVICES=` to use CPUs instead. Besides, if an out-of-memory problem occurs, just reduce `batch_size` to fit.
Let's take a tiny sampled subset of [LibriSpeech dataset](http://www.openslr.org/12/) for instance.
- Go to directory
- Go to the directory
```bash
cd examples/tiny
@ -16,19 +16,17 @@ Let's take a tiny sampled subset of [LibriSpeech dataset](http://www.openslr.org
source path.sh
```
**Must do this before you start to do anything.**
Set `MAIN_ROOT` as project dir. Using defualt `deepspeech2` model as `MODEL`, you can change this in the script.
- Main entrypoint
Set `MAIN_ROOT` as project dir. Using the default `deepspeech2` model as `MODEL`, you can change this in the script.
- Main entrypoint
```bash
bash run.sh
```
This is just a demo, please make sure every `step` works well before next `step`.
This is just a demo, please make sure every `step` works well before the next `step`.
More detailed information are provided in the following sections. Wish you a happy journey with the *DeepSpeech on PaddlePaddle* ASR engine!
More detailed information is provided in the following sections. Wish you a happy journey with the *DeepSpeech on PaddlePaddle* ASR engine!
## Training a model
The key steps of training for Mandarin language are same to that of English language and we have also provided an example for Mandarin training with Aishell in ```examples/aishell/local```. As mentioned above, please execute ```sh data.sh```, ```sh train.sh```and```sh test.sh```to do data preparation, training, and testing correspondingly.
The key steps of training for the Mandarin language are the same as that of the English language and we have also provided an example for Mandarin training with Aishell in `examples/aishell/local`. As mentioned above, please execute `sh data.sh`, `sh train.sh` and `sh test.sh` to do data preparation, training, and testing correspondingly.
## Evaluate a Model
To evaluate a model's performance quantitatively, please run:
The error rate (default: word error rate; can be set with `error_rate_type`) will be printed.
We provide two types of CTC decoders: *CTC greedy decoder* and *CTC beam search decoder*. The *CTC greedy decoder* is an implementation of the simple best-path decoding algorithm, selecting at each timestep the most likely token, thus being greedy and locally optimal. The [*CTC beam search decoder*](https://arxiv.org/abs/1408.2873) otherwise utilizes a heuristic breadth-first graph search for reaching a near global optimality; it also requires a pre-trained KenLM language model for better scoring and ranking. The decoder type can be set with argument `decoding_method`.
We provide two types of CTC decoders: *CTC greedy decoder* and *CTC beam search decoder*. The *CTC greedy decoder* is an implementation of the simple best-path decoding algorithm, selecting at each timestep the most likely token, thus being greedy and locally optimal. The [*CTC beam search decoder*](https://arxiv.org/abs/1408.2873) otherwise utilizes a heuristic breadth-first graph search for reaching near-global optimality; it also requires a pre-trained KenLM language model for better scoring and ranking. The decoder type can be set with the argument `decoding_method`.
There are 3 ways to use `PaddleSpeech`. According to the degree of difficulty, the 3 ways can be divided into **Easy**, **Medium** and **Hard**. You can choose one of the 3 ways to install `PaddleSpeech`.
There are 3 ways to use `PaddleSpeech`. According to the degree of difficulty, the 3 ways can be divided into **Easy**, **Medium**, and **Hard**. You can choose one of the 3 ways to install `PaddleSpeech`.
| Easy | (1) Use command line functions of PaddleSpeech. <br> (2) Experience PaddleSpeech on Ai Studio. | Linux, Mac,Windows |
| Medium | Support major function,such as using the` ready-made `examples and using PaddleSpeech to train your own model. | Linux |
| Hard | Support full function of Paddlespeech,including training n-gram language model, montreal-forced-aligner and so on. And you are more able be a developer! | Ubuntu |
| Easy | (1) Use command-line functions of PaddleSpeech. <br> (2) Experience PaddleSpeech on Ai Studio. | Linux, Mac,Windows |
| Medium | Support major functions ,such as using the` ready-made `examples and using PaddleSpeech to train your model. | Linux |
| Hard | Support full function of Paddlespeech,including training n-gram language model, Montreal-Forced-Aligner, and so on. And you are more able to be a developer! | Ubuntu |
## Prerequisites
- Python >= 3.7
@ -14,9 +14,9 @@ There are 3 ways to use `PaddleSpeech`. According to the degree of difficulty, t
- C++ compilation environment
- Hip: For Linux and Mac, do not use command `sh` instead of command `bash` in installation document.
## Easy: Get the Basic Function (Support Linux, Mac and Windows)
- If you are newer to `PaddleSpeech` and want to experience it easily without your own machine. We recommend you to use [AI Studio](https://aistudio.baidu.com/aistudio/index) to experience it. There is a step-by-step tutorial for `PaddleSpeech` and you can use the basic function of `PaddleSpeech` with a free machine.
- If you want to use the command line function of Paddlespeech, you need to complete the following steps to install `PaddleSpeech`. For more information about how to use command line function, you can see the [cli](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/paddlespeech/cli).
## Easy: Get the Basic Function (Support Linux, Mac, and Windows)
- If you are newer to `PaddleSpeech` and want to experience it easily without your machine. We recommend you to use [AI Studio](https://aistudio.baidu.com/aistudio/index) to experience it. There is a step-by-step tutorial for `PaddleSpeech` and you can use the basic function of `PaddleSpeech` with a free machine.
- If you want to use the command line function of Paddlespeech, you need to complete the following steps to install `PaddleSpeech`. For more information about how to use the command line function, you can see the [cli](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/paddlespeech/cli).
### Install Conda
Conda is a management system of the environment. You can go to [minicoda](https://docs.conda.io/en/latest/miniconda.html) (select a version py>=3.7) to download and install the conda.
And then Install conda dependencies for `paddlespeech` :
@ -25,11 +25,9 @@ And then Install conda dependencies for `paddlespeech` :
## Medium: Get the Major Functions (Support Linux)
If you want to get the major function of `paddlespeech`. There are 4 steps you need to do.
### Install Conda
@ -71,7 +66,7 @@ $HOME/miniconda3/bin/conda init
# activate the conda
bash
```
Then you can create an conda virtual environment using the following command:
Then you can create a conda virtual environment using the following command:
```bash
conda create -y -p tools/venv python=3.7
```
@ -119,9 +114,9 @@ pip install .
- choice 1: working with `Ubuntu` Docker Container.
- choice 2: working on `Ubuntu` with `root` privilege.
To avoid the trouble of environment setup, running in Docker container is highly recommended. Otherwise, if you work on `Ubuntu` with `root` privilege, you can still complete the installation.
To avoid the trouble of environment setup, running in a Docker container is highly recommended. Otherwise, if you work on `Ubuntu` with `root` privilege, you can still complete the installation.
### Choice 1: Running in Docker Container (Recommand)
### Choice 1: Running in Docker Container (Recommend)
Docker is an open-source tool to build, ship, and run distributed applications in an isolated environment. A Docker image for this project has been provided in [hub.docker.com](https://hub.docker.com) with all the dependencies installed. This Docker image requires the support of NVIDIA GPU, so please make sure its availability and the [nvidia-docker](https://github.com/NVIDIA/nvidia-docker) has been installed.
PaddleSpeech is an open-source toolkit on PaddlePaddle platform for two critical tasks in Speech - Speech-to-Text (Automatic Speech Recognition, ASR) and Text-to-Speech Synthesis (TTS), with modules involving state-of-art and influential models.
PaddleSpeech is an open-source toolkit on the PaddlePaddle platform for two critical tasks in Speech - Speech-to-Text (Automatic Speech Recognition, ASR) and Text-to-Speech Synthesis (TTS), with modules involving state-of-art and influential models.
## What can PaddleSpeech do?
@ -29,7 +29,7 @@ PaddleSpeech ASR provides you with a complete ASR pipeline, including:
- attention decoding (used in Transformer and Conformer)
- attention rescoring (used in Transformer and Conformer)
Speech-to-Text helps you training the ASR model very simply.
Speech-to-Text helps you train the ASR model very simply.
### Text-to-Speech
TTS mainly consists of components below:
@ -53,4 +53,4 @@ PaddleSpeech TTS provides you with a complete TTS pipeline, including:
- Transfer Learning from Speaker Verification to Multispeaker Text-to-Speech Synthesis
- GE2E
Text-to-Speech helps you to train TTS models with simple commands.
Text-to-Speech helps you to train TTS models with simple commands.
Parakeet aims to provide a flexible, efficient and state-of-the-art text-to-speech toolkit for the open-source community. It is built on PaddlePaddle dynamic graph and includes many influential TTS models.
Parakeet aims to provide a flexible, efficient, and state-of-the-art text-to-speech toolkit for the open-source community. It is built on PaddlePaddle dynamic graph and includes many influential TTS models.
<divalign="center">
<imgsrc="../../images/logo.png"width=300/><br>
@ -7,10 +7,10 @@ Parakeet aims to provide a flexible, efficient and state-of-the-art text-to-spee
## Overview
In order to facilitate exploiting the existing TTS models directly and developing the new ones, Parakeet selects typical models and provides their reference implementations in PaddlePaddle. Furthermore, Parakeet abstracts the TTS pipeline and standardizes the procedure of data preprocessing, common modules sharing, model configuration, and the process of training and synthesis. The models supported here include Text FrontEnd, end-to-end Acoustic models and Vocoders:
To facilitate exploiting the existing TTS models directly and developing the new ones, Parakeet selects typical models and provides their reference implementations in PaddlePaddle. Furthermore, Parakeet abstracts the TTS pipeline and standardizes the procedure of data preprocessing, common modules sharing, model configuration, and the process of training and synthesis. The models supported here include Text FrontEnd, end-to-end Acoustic models, and Vocoders:
- Text FrontEnd
- Rulebased Chinese frontend.
- Rule-based Chinese frontend.
- Acoustic Models
- [【FastSpeech2】FastSpeech 2: Fast and High-Quality End-to-End Text to Speech](https://arxiv.org/abs/2006.04558)
This sections covers how to extend TTS by implementing your own models and experiments. Guidelines on implementation are also elaborated.
This section covers how to extend TTS by implementing your models and experiments. Guidelines on implementation are also elaborated.
For the general deep learning experiment, there are several parts to deal with:
1. Preprocess the data according to the needs of the model, and iterate the dataset by batch.
2. Define the model, optimizer and other components.
2. Define the model, optimizer, and other components.
3. Write out the training process (generally including forward / backward calculation, parameter update, log recording, visualization, periodic evaluation, etc.).
5. Configure and run the experiment.
## PaddleSpeech TTS's Model Components
In order to balance the reusability and function of models, we divide models into several types according to its characteristics.
To balance the reusability and function of models, we divide models into several types according to their characteristics.
For the commonly used modules that can be used as part of other larger models, we try to implement them as simple and universal as possible, because they will be reused. Modules with trainable parameters are generally implemented as subclasses of `paddle.nn.Layer`. Modules without trainable parameters can be directly implemented as a function, and its input and output are `paddle.Tensor`.
Models for a specific task are implemented as subclasses of `paddle.nn.Layer`. Models could be simple, like a singlelayer RNN. For complicated models, it is recommended to split the model into different components.
Models for a specific task are implemented as subclasses of `paddle.nn.Layer`. Models could be simple, like a single-layer RNN. For complicated models, it is recommended to split the model into different components.
For a seq-to-seq model, it's natural to split it into encoder and decoder. For a model composed of several similar layers, it's natural to extract the sublayer as a separate layer.
There are two common ways to define a model which consists of several modules.
1. Define a module given the specifications. Here is an example with multilayer perceptron.
1. Define a module given the specifications. Here is an example with a multilayer perceptron.
@ -44,11 +45,11 @@ There are two common ways to define a model which consists of several modules.
```
For a module defined in this way, it’s harder for the user to initialize an instance. Users have to read the code to check what attributes are used.
Also, code in this style tend to be abused by passing a huge config object to initialize every module used in an experiment, thought each module may not need the whole configuration.
Also, code in this style tends to be abused by passing a huge config object to initialize every module used in an experiment, though each module may not need the whole configuration.
We prefer to be explicit.
2. Define a module as a combination given its components. Here is an example for a sequence-to-sequence model.
2. Define a module as a combination given its components. Here is an example of a sequence-to-sequence model.
```python
class Seq2Seq(nn.Layer):
def __init__(self, encoder, decoder):
@ -65,27 +66,27 @@ There are two common ways to define a model which consists of several modules.
# compose two components
model = Seq2Seq(encoder, decoder)
```
When a model is a complicated and made up of several components, each of which has a separate functionality, and can be replaced by other components with the same functionality, we prefer to define it in this way.
When a model is complicated and made up of several components, each of which has a separate functionality, and can be replaced by other components with the same functionality, we prefer to define it in this way.
In the directory structure of PaddleSpeech TTS, modules with high reusability are placed in `paddlespeech.t2s.modules`, but models for specific tasks are placed in `paddlespeech.t2s.models`. When developing a new model, developers need to consider the feasibility of splitting the modules, and the degree of generality of the modules, and place them in appropriate directories.
In the directory structure of PaddleSpeech TTS, modules with high reusability are placed in `paddlespeech.t2s.modules`, but models for specific tasks are placed in `paddlespeech.t2s.models`. When developing a new model, developers need to consider the feasibility of splitting the modules, and the degree of generality of the modules and place them in appropriate directories.
## PaddleSpeech TTS's Data Components
Another critical componnet for a deep learning project is data.
Another critical component for a deep learning project is data.
PaddleSpeech TTS uses the following methods for training data:
1. Preprocess the data.
2. Load the preprocessed data for training.
Previously, we wrote the preprocessing in the `__getitem__` of the Dataset, which will process when accessing a certain batch samples, but encountered some problems:
Previously, we wrote the preprocessing in the `__getitem__` of the Dataset, which will process when accessing a certain batch sample, but encountered some problems:
1. Efficiency problem. Even if Paddle has a design to load data asynchronously, when the batch size is large, each sample needs to be preprocessed and set up batches, which takes a lot of time, and may even seriously slow down the training process.
2. Data filtering problem. Some filtering conditions depend on the features of the processed sample. For example, filtering samples that are too short according to text length. If the text length can only be known after `__getitem__`, every time you filter, the entire dataset needed to be loaded once! In addition, if you do not pre-filter, A small exception (such as too short text ) in `__getitem__` will cause an exception in the entire data flow, which is not feasible, because `collate_fn ` presupposes that the acquisition of each sample can be normal. Even if some special flags, such as `None`, are used to mark data acquisition failures, and skip `collate_fn`, it will change batch_size.
1. Efficiency problem. Even if Paddle has a design to load data asynchronously, when the batch size is large, each sample needs to be preprocessed and set up batches, which takes a lot of time, and may even seriously slow down the training process.
2. Data filtering problem. Some filtering conditions depend on the features of the processed sample. For example, filtering samples that are too short according to text length. If the text length can only be known after `__getitem__`, every time you filter, the entire dataset needed to be loaded once! In addition, if you do not pre-filter, A small exception (such as too short text ) in `__getitem__` will cause an exception in the entire data flow, which is not feasible, because `collate_fn ` presupposes that the acquisition of each sample can be normal. Even if some special flags, such as `None`, are used to mark data acquisition failures, and skip `collate_fn`, it will change batch_size.
Therefore, it is not realistic to put preprocessing entirely on `__getitem__`. We use the method mentioned above instead.
During preprocessing, we can do filtering, We can also save more intermediate features, such as text length, audio length, etc., which can be used for subsequent filtering. Because of the habit of TTS field, data is stored in multiple files, and the processed results are stored in `npy` format.
During preprocessing, we can do filtering, We can also save more intermediate features, such as text length, audio length, etc., which can be used for subsequent filtering. Because of the habit of TTS field, data is stored in multiple files, and the processed results are stored in `npy` format.
Use a list-like way to store metadata and store the file path in it, so that you can not be restricted by the specific storage location of the file. In addition to the file path, other metadata can also be stored in it. For example, the path of the text, the path of the audio, the path of the spectrum, the number of frames, the number of sampling points, and so on.
Then for the path, there are multiple opening methods, such as `sf.read`, `np.load`, etc., so it's best to use a parameter that can be input, we don't even want to determine the reading method by it's extension, it's best to let the users input it, in this way, users can define their own method to parse the data.
Then for the path, there are multiple opening methods, such as `sf.read`, `np.load`, etc., so it's best to use a parameter that can be input, we don't even want to determine the reading method by its extension, it's best to let the users input it, in this way, users can define their method to parse the data.
So we learned from the design of `DataFrame`, but our construction method is simpler, only need a `list of dicts`, a dict represents a record, and it's convenient to interact with formats such as `json`, `yaml`. For each selected field, we need to give a parser (called `converter` in the interface), and that's it.
@ -109,7 +110,7 @@ class DataTable(Dataset):
converters : Dict[str, Callable], optional
Converters used to process each field, by default None
use_cache : bool, optional
Whether to use cache, by default False
Whether to use a cache, by default False
Raises
------
@ -125,11 +126,11 @@ class DataTable(Dataset):
converters: Dict[str, Callable]=None,
use_cache: bool=False):
```
It's `__getitem__` method is to parse each field with their own parser, and then compose a dictionary to return.
Its `__getitem__` method is to parse each field with their parser and then compose a dictionary to return.
"""Convert a meta datum to an example by applying the corresponding
converters to each fields requested.
converters to each field requested.
Parameters
----------
@ -163,23 +164,23 @@ A typical training process includes the following processes:
6. Write logs, visualize, and in some cases save necessary intermediate results.
7. Save the state of the model and optimizer.
Here, we mainly introduce the trainingrelated components of TTS in Pa and why we designed it like this.
### Global Repoter
When training and modifying Deep Learning models,logging is often needed, and it has even become the key to model debugging and modifying. We usually use various visualization tools,such as , `visualdl` in `paddle`, `tensorboard` in `tensorflow` and `vidsom`, `wnb` ,etc. Besides, `logging` and `print` are usuaally used for different purpose.
Here, we mainly introduce the training-related components of TTS in Pa and why we designed it like this.
### Global Reporter
When training and modifying Deep Learning models,logging is often needed, and it has even become the key to model debugging and modifying. We usually use various visualization tools,such as , `visualdl` in `paddle`, `tensorboard` in `tensorflow` and `vidsom`, `wnb` ,etc. Besides, `logging` and `print` are usually used for a different purpose.
In these tools, `print` is the simplest,it doesn't have the concept of `logger` and `handler` in `logging` 、 `summarywriter` and `logdir` in `tensorboard`, when printing, there is no need for `global_step`,It's light enough to appear anywhere in the code, and it's printed to a common stdout. Of course, its customizability is limited, for example, it is no longer intuitive when printing dictionaries or more complex objects. And it's fleeting, people need to use redirection to save information.
In these tools, `print` is the simplest,it doesn't have the concept of `logger` and `handler` in `logging` 、 `summarywriter` and `logdir` in `tensorboard`, when printing, there is no need for `global_step`,It's light enough to appear anywhere in the code, and it's printed to a common stdout. Of course, its customizability is limited, for example, it is no longer intuitive when printing dictionaries or more complex objects. And it's fleeting, people need to use redirection to save information.
For TTS models development,we hope to have a more universal multimedia stdout, which is actually a tool similar to `tensorboard`, which allows many multimedia forms, but it needs a `summary writer` when using, and a `step` when writing information. If the data are images or voices, some format control parameters are needed.
For TTS models development,we hope to have a more universal multimedia stdout, which is a tool similar to `tensorboard`, which allows many multimedia forms, but it needs a `summary writer` when using, and a `step` when writing information. If the data are images or voices, some format control parameters are needed.
This will destroy the modular design to a certain extent. For example, If my model is composed of multiple sublayers, and I want to record some important information in the forward method of some sublayers. For this reason, I may need to pass the `summary writer` to this sublayers, but for the sublayers, its function is calculation, it should not have extra considerations, and it's also difficult for us to tolerate that the initialization of an `nn.Linear` has an optional `visualizer` in the method. And, for a calculation module, **HOW** can it know the global step? These are things related to the training process!
This will destroy the modular design to a certain extent. For example, If my model is composed of multiple sublayers, and I want to record some important information in the forward method of some sublayers. For this reason, I may need to pass the `summary writer` to these sublayers, but for the sublayers, its function is the calculation, it should not have extra considerations, and it's also difficult for us to tolerate that the initialization of an `nn.Linear` has an optional `visualizer` in the method. And, for a calculation module, **HOW** can it know the global step? These are things related to the training process!
Therefore, a more common approach is not to put writing_log_code in the definition of layer, but return it, then obtain them during training, and write them to `summary writer`. However, the return values need to be modified. `summary writer ` is a broadcaster at the training level, and then each module transmits information to it by modifying the return values.
We think this method is a little ugly. We prefer to return the necessary information only rather than change the return values to accommodate visualization and recording. When you need to report some information, you should be able to report it without difficult. So we imitate the design of `chainer` and use the `global repoter`.
We think this method is a little ugly. We prefer to return the necessary information only rather than change the return values to accommodate visualization and recording. When you need to report some information, you should be able to report it without difficulty. So we imitate the design of `chainer` and use the `global repoter`.
It takes advantage of the globality of Python's modulelevel variables and the effect of context manager.
It takes advantage of the globality of Python's module-level variables and the effect of context manager.
There is a modulelevel variable in `paddlespeech/t2s/training/reporter.py``OBSERVATIONS`,which is a `Dict` to store key-value.
There is a module-level variable in `paddlespeech/t2s/training/reporter.py``OBSERVATIONS`,which is a `Dict` to store key-value.
```python
# paddlespeech/t2s/training/reporter.py
@ -242,7 +243,7 @@ def test_reporter_scope():
assert third == {'third_begin': 3, 'third_end': 4}
```
In this way, when we write modular components, we can directly call `report`. The caller will decide where to report as long as it's ready for `OBSERVATION`, then it opens a `scope` and calls the component within this `scope`.
In this way, when we write modular components, we can directly call `report`. The caller will decide where to report as long as it's ready for `OBSERVATION`, then it opens a `scope` and calls the component within this `scope`.
The `Trainer` in PaddleSpeech TTS report the information in this way.
```python
@ -257,11 +258,11 @@ while True:
```
### Updater: Model Training Process
In order to maintain the purity of function and the reusability of code, we abstract the model code into a subclass of `paddle.nn.Layer`, and write the core computing functions in it.
To maintain the purity of function and the reusability of code, we abstract the model code into a subclass of `paddle.nn.Layer`, and write the core computing functions in it.
We tend to write the forward process of training in `forward()`, but only write to the prediction result, not to the loss. Therefore, this module can be called by a larger module.
However, when we compose an experiment, we need to add some other things, such as training process, evaluation process, checkpoint saving, visualization and the like. In this process, we will encounter some things that only exist in the training process, such as `optimizer`, `learning rate scheduler`, `visualizer`, etc. These things are not part of the model, they should **NOT** be written in the model code.
However, when we compose an experiment, we need to add some other things, such as the training process, evaluation process, checkpoint saving, visualization, and the like. In this process, we will encounter some things that only exist in the training process, such as `optimizer`, `learning rate scheduler`, `visualizer`, etc. These things are not part of the model, they should **NOT** be written in the model code.
We made an abstraction for these intermediate processes, that is, `Updater`, which takes the `model`, `optimizer`, and `data stream` as input, and its function is training. Since there may be differences in training methods of different models, we tend to write a corresponding `Updater` for each model. But this is different from the final training script, there is still a certain degree of encapsulation, just to extract the details of regular saving, visualization, evaluation, etc., and only retain the most basic function, that is, training the model.
@ -273,23 +274,23 @@ Deep learning experiments often have many options to configure. These configurat
1. Data source and data processing mode configuration.
2. Save path configuration of experimental results.
3. Data preprocessing mode configuration.
4. Model structure and hyperparameterconfiguration.
4. Model structure and hyperparameterconfiguration.
5. Training process configuration.
It’s common to change the running configuration to compare results. To keep track of running configuration, we use `yaml` configuration files.
Also, we want to interact with command line options. Some options that usually change according to running environments is provided by command line arguments. In addition, we want to override an option in the config file without editing it.
Also, we want to interact with command-line options. Some options that usually change according to running environments are provided by command line arguments. In addition, we want to override an option in the config file without editing it.
Taking these requirements into consideration, we use [yacs](https://github.com/rbgirshick/yacs) as a config management tool. Other tools like [omegaconf](https://github.com/omry/omegaconf) are also powerful and have similar functions.
Taking these requirements into consideration, we use [yacs](https://github.com/rbgirshick/yacs) as a config management tool. Other tools like [omegaconf](https://github.com/omry/omegaconf) are also powerful and have similar functions.
In each example provided, there is a `config.py`, the default config is defined at `conf/default.yaml`. If you want to get the default config, import `config.py` and call `get_cfg_defaults()` to get it. Then it can be updated with `yaml` config file or commandline arguments if needed.
In each example provided, there is a `config.py`, the default config is defined at `conf/default.yaml`. If you want to get the default config, import `config.py` and call `get_cfg_defaults()` to get it. Then it can be updated with `yaml` config file or command-line arguments if needed.
For details about how to use yacs in experiments, see [yacs](https://github.com/rbgirshick/yacs).
The following is the basic `ArgumentParser`:
1. `--config` is used to support configuration file parsing, and the configuration file itself handles the unique options of each experiment.
2. `--train-metadata` is the path to the training data.
3. `--output-dir` is the dir to save the training results.(if there are checkpoints in `checkpoints/` of `--output-dir` , it's defalut to reload the newest checkpoint to train)
3. `--output-dir` is the dir to save the training results.(if there are checkpoints in `checkpoints/` of `--output-dir` , it defaults to reload the newest checkpoint to train)
4. `--ngpu` determine operation modes,`--ngpu` refers to the number of training processes. If `ngpu` > 0, it means using GPU, else CPU is used.
Developers can refer to the examples in `examples` to write the default configuration file when adding new experiments.
@ -313,13 +314,13 @@ The experimental codes in PaddleSpeech TTS are generally organized as follows:
```
The `*.py` files called by above `*.sh` are located `${BIN_DIR}/`
We add a named argument. `--output-dir` to each training script to specify the output directory. The directory structure is as follows, It's best for developers to follow this specification:
We add a named argument. `--output-dir` to each training script to specify the output directory. The directory structure is as follows, developers should follow this specification:
```text
exp/default/
├── checkpoints/
│ ├── records.jsonl (record file)
│ └── snapshot_iter_*.pdz (checkpoint files)
├── config.yaml (config fille of this experiment)
├── config.yaml (config file of this experiment)
├── vdlrecords.*.log (visualdl record file)
├── worker_*.log (text logging, one file per process)
├── validation/ (output dir during training, information_iter_*/ is the output of each step, if necessary)
@ -327,4 +328,4 @@ exp/default/
└── test/ (output dir of synthesis results)
```
You can view the examples we provide in `examples`. These experiments are provided to users as examples which can be run directly. Users are welcome to add new models and experiments and contribute code to PaddleSpeech.
You can view the examples we provide in `examples`. These experiments are provided to users as examples that can be run directly. Users are welcome to add new models and experiments and contribute code to PaddleSpeech.
TTS system mainly includes three modules: `Text Frontend`, `Acoustic model` and `Vocoder`. We introduce a rulebased Chinese text frontend in [cn_text_frontend.md](./cn_text_frontend.md). Here, we will introduce acoustic models and vocoders, which are trainable models.
TTS system mainly includes three modules: `Text Frontend`, `Acoustic model` and `Vocoder`. We introduce a rule-based Chinese text frontend in [cn_text_frontend.md](./cn_text_frontend.md). Here, we will introduce acoustic models and vocoders, which are trainable.
The main processes of TTS include:
1. Convert the original text into characters/phonemes, through `text frontend` module.
2. Convert characters/phonemes into acoustic features, such as linear spectrogram, mel spectrogram, LPC features, etc. through `Acoustic models`.
1. Convert the original text into characters/phonemes, through the `text frontend` module.
2. Convert characters/phonemes into acoustic features, such as linear spectrogram, mel spectrogram, LPC features, etc. through `Acoustic models`.
3. Convert acoustic features into waveforms through `Vocoders`.
A simple text frontend module can be implemented by rules. Acoustic models and vocoders need to be trained. The models provided by PaddleSpeech TTS are acoustic models and vocoders.
@ -59,7 +59,7 @@ At present, there are two mainstream acoustic model structures.
**Advantage of Tacotron:**
- No need for complex text frontend analysis modules.
- No need for additional duration model.
- No need for an additional duration model.
- Greatly simplifies the acoustic model construction process and reduces the dependence of speech synthesis tasks on domain knowledge.
**Disadvantages of Tacotron:**
@ -67,7 +67,7 @@ At present, there are two mainstream acoustic model structures.
- Global soft attention.
- Poor stability for speech synthesis tasks.
- In training, the less the number of speech frames predicted at each moment, the more difficult it is to train.
- Phase problem in Griffin-Lim casues speech distortion during wave reconstruction.
- Phase problem in Griffin-Lim causes speech distortion during wave reconstruction.
- The autoregressive decoder cannot be stopped during the generation process.
#### Tacotron2
@ -81,12 +81,12 @@ At present, there are two mainstream acoustic model structures.
- CBHG -> 5 Conv layers.
- The input and output of the PostNet calculate `L2` loss with real Mel spectrogram.
- Residual connection.
- Bad stop in autoregressive decoder.
- Bad stop in an autoregressive decoder.
- Predict whether it should stop at each moment of decoding (stop token).
- Set a threshold to determine whether to stop generating when decoding.
- Stability of attention.
- Location-aware attention.
- The alignment matrix of previous time is considered at the step `t` of decoder.
- The alignment matrix of the previous time is considered at step `t` of the decoder.
@ -96,7 +96,7 @@ You can find PaddleSpeech TTS's tacotron2 with LJSpeech dataset example at [exam
### TransformerTTS
**Disadvantages of the Tacotrons:**
- Encodr and decoder are relatively weak at global information modeling
- Encoder and decoder are relatively weak at global information modeling
- Vanishing gradient of RNN.
- Fixed-length context modeling problem in CNN kernel.
- Training is relatively inefficient.
@ -105,7 +105,7 @@ You can find PaddleSpeech TTS's tacotron2 with LJSpeech dataset example at [exam
Transformer TTS is a combination of Tacotron2 and Transformer.
#### Transformer
[Transformer](https://arxiv.org/abs/1706.03762) is a seq2seq model based entirely on attention mechanism.
[Transformer](https://arxiv.org/abs/1706.03762) is a seq2seq model based entirely on an attention mechanism.
**Features of Transformer:**
- Encoder.
@ -113,7 +113,7 @@ Transformer TTS is a combination of Tacotron2 and Transformer.
- Positional Encoding.
- Decoder.
- `N` blocks based on self-attention mechanism.
- Add Mask to the self-attention in blocks to cover up the information after `t` step.
- Add Mask to the self-attention in blocks to cover up the information after the `t` step.
- Attentions between encoder and decoder.
- Positional Encoding.
@ -153,34 +153,34 @@ You can find PaddleSpeech TTS's Transformer TTS with LJSpeech dataset example at
**Disadvantage of seq2seq models:**
- In the seq2seq model based on attention, no matter how to improve the attention mechanism, it's difficult to avoid generation errors in the decoding stage.
Framelevel acoustic models use duration models to determine the pronunciation duration of phonemes, and the framelevel mapping does not have the uncertainty of sequence generation.
Frame-level acoustic models use duration models to determine the pronunciation duration of phonemes, and the frame-level mapping does not have the uncertainty of sequence generation.
In seq2saq models, the concept of duration models is used as the alignment module of two sequences to replace attention, which can avoid the uncertainty in attention, and significantly improve the stability of the seq2saq models.
#### FastSpeech
Instead of using the encoder-attention-decoder based architecture as adopted by most seq2seq based autoregressive and non-autoregressive generation, [FastSpeech](https://arxiv.org/abs/1905.09263) is a novel feed-forward structure, which can generate a target mel spectrogram sequence in parallel.
Instead of using the encoder-attention-decoder based architecture as adopted by most seq2seq based autoregressive and non-autoregressive generation, [FastSpeech](https://arxiv.org/abs/1905.09263) is a novel feed-forward structure, which can generate a target mel spectrogram sequence in parallel.
**Features of FastSpeech:**
- Encoder: based on Transformer.
- Change `FFN` to `CNN` in self-attention.
- Model local dependency.
- Length regulator.
- Use real phoneme durations to expand output frame of encoder during training.
- Nonautoregressive decode.
- Use real phoneme durations to expand the output frame of the encoder during training.
- Non-autoregressive decode.
- Improve generation efficiency.
**Length predictor:**
- Pretrain a TransformerTTS model.
- Get alignment matrix of train data.
- Caculate the phoneme durations according to the probability of the alignment matrix.
- Use the output of encoder to predict the phoneme durations and calculate the MSE loss.
- Use real phoneme durations to expand output frame of encoder during training.
- Calculate the phoneme durations according to the probability of the alignment matrix.
- Use the output of the encoder to predict the phoneme durations and calculate the MSE loss.
- Use real phoneme durations to expand the output frame of the encoder during training.
- Use phoneme durations predicted by the duration model to expand the frame during prediction.
- Attentrion can not control phoneme durations. The explicit duration modeling can control durations through duration coefficient (duration coefficient is `1` during training).
**Advantages of non-autoregressive decoder:**
- The built-in duration model of the seq2seq model has converted the input length `M` to the output length `N`.
- The length of output is known, `stop token` is no longer used, avoiding the problem of being unable to stop.
- The length of the output is known, `stop token` is no longer used, avoiding the problem of being unable to stop.
• Can be generated in parallel (decoding time is less affected by sequence length)
<divalign="left">
@ -198,27 +198,27 @@ Instead of using the encoder-attention-decoder based architecture as adopted by
**Disadvantages of FastSpeech:**
- The teacher-student distillation pipeline is complicated and time-consuming.
- The duration extracted from the teacher model is not accurate enough.
- The target mel spectrograms distilled from teacher model suffer from information loss due to data simplification.
- The target mel spectrograms distilled from the teacher model suffer from information loss due to data simplification.
[FastSpeech2](https://arxiv.org/abs/2006.04558) addresses the issues in FastSpeech and better solves the one-to-many mapping problem in TTS.
**Features of FastSpeech2:**
- Directly training the model with ground-truth target instead of the simplified output from teacher.
- Introducing more variation information of speech as conditional inputs, extract `duration`, `pitch` and `energy` from speech waveform and directly take them as conditional inputs in training and use predicted values in inference.
- Directly train the model with the ground-truth target instead of the simplified output from the teacher.
- Introducing more variation information of speech as conditional inputs, extract `duration`, `pitch`, and `energy` from speech waveform and directly take them as conditional inputs in training and use predicted values in inference.
FastSpeech2 is similar to FastPitch but introduces more variation information of speech.
FastSpeech2 is similar to FastPitch but introduces more variation information of the speech.
You can find PaddleSpeech TTS's FastSpeech2/FastPitch with CSMSC dataset example at [examples/csmsc/tts3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts3), We use token-averaged pitch and energy values introduced in FastPitch rather than framelevel ones in FastSpeech2.
You can find PaddleSpeech TTS's FastSpeech2/FastPitch with CSMSC dataset example at [examples/csmsc/tts3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts3), We use token-averaged pitch and energy values introduced in FastPitch rather than frame-level ones in FastSpeech2.
### SpeedySpeech
[SpeedySpeech](https://arxiv.org/abs/2008.03802) simplify the teacher-student architecture of FastSpeech and provide a fast and stable training procedure.
**Features of SpeedySpeech:**
- Use a simpler, smaller and faster-to-train convolutional teacher model ([Deepvoice3](https://arxiv.org/abs/1710.07654) and [DCTTS](https://arxiv.org/abs/1710.08969)) with a single attention layer instead of Transformer used in FastSpeech.
- Use a simpler, smaller, and faster-to-train convolutional teacher model ([Deepvoice3](https://arxiv.org/abs/1710.07654) and [DCTTS](https://arxiv.org/abs/1710.08969)) with a single attention layer instead of Transformer used in FastSpeech.
- Show that self-attention layers in the student network are not needed for high-quality speech synthesis.
- Describe a simple data augmentation technique that can be used early in the training to make the teacher network robust to sequential error propagation.
@ -233,7 +233,7 @@ In speech synthesis, the main task of the vocoder is to convert the spectral par
Taking into account the short-term change frequency of the waveform, the acoustic model usually avoids direct modeling of the speech waveform, but firstly models the spectral features extracted from the speech waveform, and then reconstructs the waveform by the decoding part of the vocoder.
A vocoder usually consists of a pair of encoders and decoders for speech analysis and synthesis. The encoder estimate the parameters, and then the decoder restores the speech.
A vocoder usually consists of a pair of encoders and decoders for speech analysis and synthesis. The encoder estimates the parameters, and then the decoder restores the speech.
Vocoders based on neural networks usually is speech synthesis, which learns the mapping relationship from spectral features to waveforms through training data.
@ -262,11 +262,11 @@ Vocoders based on neural networks usually is speech synthesis, which learns the
- DiffWave
**Motivations of GAN-based vocoders:**
- Modeling speech signal by estimating probability distribution usually has high requirements for the expression ability of the model itself. In addition, specific assumptions need to be made about the distribution of waveforms.
- Modeling speech signals by estimating probability distribution usually has high requirements for the expression ability of the model itself. In addition, specific assumptions need to be made about the distribution of waveforms.
- Although autoregressive neural vocoders can obtain high-quality synthetic speech, such models usually have a **slow generation speed**.
- The training of inverse autoregressive flow vocoders is complex, and they also require the modeling capability of longterm context information.
- The training of inverse autoregressive flow vocoders is complex, and they also require the modeling capability of long-term context information.
- Vocoders based on Bipartite Transformation converge slowly and are complex.
- GAN-based vocoders don't need to make assumptions about the speech distribution, and train through adversarial learning.
- GAN-based vocoders don't need to make assumptions about the speech distribution and train through adversarial learning.
Here, we introduce a Flow-based vocoder WaveFlow and a GAN-based vocoder Parallel WaveGAN.
@ -274,14 +274,14 @@ Here, we introduce a Flow-based vocoder WaveFlow and a GAN-based vocoder Paralle
[WaveFlow](https://arxiv.org/abs/1912.01219) is proposed by Baidu Research.
**Features of WaveFlow:**
- It can synthesize 22.05 kHz high-fidelity speech around 40x faster than real-time on a Nvidia V100 GPU without engineered inference kernels, which is faster than [WaveGlow](https://github.com/NVIDIA/waveglow) and serveral orders of magnitude faster than WaveNet.
- It is a small-footprint flow-based model for raw audio. It has only 5.9M parameters, which is 15x smalller than WaveGlow (87.9M).
- It can synthesize 22.05 kHz high-fidelity speech around 40x faster than real-time on an Nvidia V100 GPU without engineered inference kernels, which is faster than [WaveGlow](https://github.com/NVIDIA/waveglow) and several orders of magnitude faster than WaveNet.
- It is a small-footprint flow-based model for raw audio. It has only 5.9M parameters, which is 15x smaller than WaveGlow (87.9M).
- It is directly trained with maximum likelihood without probability density distillation and auxiliary losses as used in [Parallel WaveNet](https://arxiv.org/abs/1711.10433) and [ClariNet](https://openreview.net/pdf?id=HklY120cYm), which simplifies the training pipeline and reduces the cost of development.
You can find PaddleSpeech TTS's WaveFlow with LJSpeech dataset example at [examples/ljspeech/voc0](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/voc0).
### Parallel WaveGAN
[Parallel WaveGAN](https://arxiv.org/abs/1910.11480) trains a non-autoregressive WaveNet variant as a generator in a GANbased training method.
[Parallel WaveGAN](https://arxiv.org/abs/1910.11480) trains a non-autoregressive WaveNet variant as a generator in a GAN-based training method.
The examples in PaddleSpeech are mainly classified by datasets, the TTS datasets we mainly used are:
* CSMCS (Mandarin single speaker)
* AISHELL3 (Mandarin multiple speaker)
* AISHELL3 (Mandarin multiple speakers)
* LJSpeech (English single speaker)
* VCTK (English multiple speaker)
* VCTK (English multiple speakers)
The models in PaddleSpeech TTS have the following mapping relationship:
* tts0 - Tactron2
@ -14,6 +14,8 @@ The models in PaddleSpeech TTS have the following mapping relationship:
* voc1 - Parallel WaveGAN
* voc2 - MelGAN
* voc3 - MultiBand MelGAN
* voc4 - Style MelGAN
* voc5 - HiFiGAN
* vc0 - Tactron2 Voice Clone with GE2E
* vc1 - FastSpeech2 Voice Clone with GE2E
@ -22,7 +24,7 @@ The models in PaddleSpeech TTS have the following mapping relationship:
Let's take a FastSpeech2 + Parallel WaveGAN with CSMSC dataset for instance. [examples/csmsc](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc)
### Train Parallel WaveGAN with CSMSC
- Go to directory
- Go to the directory
```bash
cd examples/csmsc/voc1
```
@ -37,9 +39,9 @@ Let's take a FastSpeech2 + Parallel WaveGAN with CSMSC dataset for instance. [ex
```bash
bash run.sh
```
This is just a demo, please make sure source data have been prepared well and every `step` works well before next `step`.
This is just a demo, please make sure source data have been prepared well and every `step` works well before the next `step`.
### Train FastSpeech2 with CSMSC
- Go to directory
- Go to the directory
```bash
cd examples/csmsc/tts3
```
@ -49,26 +51,26 @@ Let's take a FastSpeech2 + Parallel WaveGAN with CSMSC dataset for instance. [ex
```
**Must do this before you start to do anything.**
Set `MAIN_ROOT` as project dir. Using `fastspeech2` model as `MODEL`.
- Main entrypoint
- Main entrypoint
```bash
bash run.sh
```
This is just a demo, please make sure source data have been prepared well and every `step` works well before next `step`.
This is just a demo, please make sure source data have been prepared well and every `step` works well before the next `step`.
The steps in `run.sh` mainly include:
- source path.
- preprocess the dataset,
- train the model.
- synthesize waveform from metadata.jsonl.
- synthesize waveform from text file. (in acoustic models)
- inference using static model. (optional)
- synthesize waveform from a text file. (in acoustic models)
- inference using a static model. (optional)
For more details, you can see `README.md` in examples.
For more details, you can see `README.md` in examples.
## Pipeline of TTS
This section shows how to use pretrained models provided by TTS and make inference with them.
This section shows how to use pretrained models provided by TTS and make an inference with them.
Pretrained models in TTS are provided in a archive. Extract it to get a folder like this:
Pretrained models in TTS are provided in an archive. Extract it to get a folder like this:
**Acoustic Models:**
```text
checkpoint_name
@ -87,15 +89,15 @@ checkpoint_name
└── stats.npy
```
- `default.yaml` stores the config used to train the model.
- `snapshot_iter_*.pdz` is the chechpoint file, where `*` is the steps it has been trained.
- `*_stats.npy` is the stats file of feature if it has been normalized before training.
- `phone_id_map.txt` is the map of phonemes to phoneme_ids.
- `tone_id_map.txt` is the map of tones to tones_ids, when you split tones and phones before training acoustic models. (for example in our csmsc/speedyspeech example)
- `spk_id_map.txt` is the map of spkeaker to spk_ids in multi-spk acoustic models. (for example in our aishell3/fastspeech2 example)
- `snapshot_iter_*.pdz` is the checkpoint file, where `*` is the steps it has been trained.
- `*_stats.npy` is the stats file of the feature if it has been normalized before training.
- `phone_id_map.txt` is the map of phonemes to phoneme_ids.
- `tone_id_map.txt` is the map of tones to tones_ids, when you split tones and phones before training acoustic models. (for example in our csmsc/speedyspeech example)
- `spk_id_map.txt` is the map of speakers to spk_ids in multi-spk acoustic models. (for example in our aishell3/fastspeech2 example)
The example code below shows how to use the models for prediction.
### Acoustic Models (text to spectrogram)
The code below show how to use a `FastSpeech2` model. After loading the pretrained model, use it and normalizer object to construct a prediction object,then use `fastspeech2_inferencet(phone_ids)` to generate spectrograms, which can be further used to synthesize raw audio with a vocoder.
The code below shows how to use a `FastSpeech2` model. After loading the pretrained model, use it and the normalizer object to construct a prediction object,then use `fastspeech2_inferencet(phone_ids)` to generate spectrograms, which can be further used to synthesize raw audio with a vocoder.
```python
from pathlib import Path
@ -153,7 +155,7 @@ for part_phone_ids in phone_ids:
```
### Vocoder (spectrogram to wave)
The code below show how to use a ` Parallel WaveGAN` model. Like the example above, after loading the pretrained model, use it and normalizer object to construct a prediction object,then use `pwg_inference(mel)` to generate raw audio (in wav format).
The code below shows how to use a ` Parallel WaveGAN` model. Like the example above, after loading the pretrained model, use it and the normalizer object to construct a prediction object,then use `pwg_inference(mel)` to generate raw audio (in wav format).
A TTS system mainly includes three modules: `Text Frontend`, `Acoustic model` and `Vocoder`. We provide a complete Chinese text frontend module in PaddleSpeech TTS, see exapmles in [examples/other/tn](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/tn) and [examples/other/g2p](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/g2p).
A text frontend module mainly includes:
@ -42,7 +42,7 @@ Among them, Text Normalization and G2P are the most important modules. We mainly
## Grapheme-to-Phoneme
In Chinese, G2P is a very complex module, which mainly includes **polyphone** and **tone sandhi**.
We use [g2pM](https://github.com/kakaobrain/g2pM) and [pypinyin](https://github.com/mozillazg/python-pinyin) as the defalut g2p tools. They can solve the problem of polyphone to a certain extent. In the future, we intend to use a trainable language model (for example, [BERT](https://arxiv.org/abs/1810.04805)) for polyphone.
We use [g2pM](https://github.com/kakaobrain/g2pM) and [pypinyin](https://github.com/mozillazg/python-pinyin) as the default g2p tools. They can solve the problem of polyphones to a certain extent. In the future, we intend to use a trainable language model (for example, [BERT](https://arxiv.org/abs/1810.04805)) for polyphones.
However, g2pM and pypinyin do not perform well in tone sandhi, we use rules to solve this problem, which requires relevant linguistic knowledge.
| 0 | Process data. It includes: <br> (1) Download the dataset <br> (2) Caculate the CMVN of the train dataset <br> (3) Get the vocabulary file <br> (4) Get the manifest files of the train, development and test dataset |
| 0 | Process data. It includes: <br> (1) Download the dataset <br> (2) Calculate the CMVN of the train dataset <br> (3) Get the vocabulary file <br> (4) Get the manifest files of the train, development and test dataset |
| 1 | Train the model |
| 2 | Get the final model by averaging the top-k models, set k = 1 means choose the best model |
| 2 | Get the final model by averaging the top-k models, set k = 1 means to choose the best model |
| 3 | Test the final model performance |
| 4 | Export the static graph model |
| 5 | Test the static graph model |
| 6 | Infer the single audio file |
You can choose to run a range of stages by setting the ```stage``` and ```stop_stage ``` .
You can choose to run a range of stages by setting the `stage` and `stop_stage `.
For example, if you want to execute the code in stage 2 and stage 3, you can run this script:
```bash
bash run.sh --stage 2 --stop_stage 3
```
Or you can set ```stage``` equal to ```stop-stage``` to only run one stage.
For example, if you only want to run ```stage 0```, you can use the script below:
Or you can set `stage` equal to `stop-stage` to only run one stage.
For example, if you only want to run `stage 0`, you can use the script below:
```bash
bash run.sh --stage 0 --stop_stage 0
```
The document below will describe the scripts in the ```run.sh``` in detail.
The document below will describe the scripts in the `run.sh` in detail.
## The environment variables
The path.sh contains the environment variable.
```bash
source path.sh
```
This script needs to be run firstly.
This script needs to be run first.
And another script is also needed:
```bash
source ${MAIN_ROOT}/utils/parse_options.sh
```
It will support the way of using```--varibale value``` in the shell scripts.
It will support the way of using `--variable value` in the shell scripts.
## The local variables
Some local variables are set in the `run.sh`.
`gpus` denotes the GPU number you want to use. If you set `gpus=`, it means you only use CPU.
Some local variables are set in the ```run.sh```.
```gpus``` denotes the GPU number you want to use. If you set ```gpus=```, it means you only use CPU.
`stage` denotes the number of the stage you want to start from in the experiments.
`stop stage` denotes the number of the stage you want to end at in the experiments.
`conf_path` denotes the config path of the model.
`avg_num` denotes the number K of top-K models you want to average to get the final model.
`model_type` denotes the model type: offline or online
`audio file` denotes the file path of the single file you want to infer in stage 6
`ckpt` denotes the checkpoint prefix of the model, e.g. "deepspeech2"
```stage``` denotes the number of stage you want to start from in the expriments.
```stop stage```denotes the number of stage you want to end at in the expriments.
```conf_path``` denotes the config path of the model.
```avg_num``` denotes the number K of top-K models you want to average to get the final model.
```model_type```denotes the model type: offline or online
```audio file``` denotes the file path of the single file you want to infer in stage 6
```ckpt``` denotes the checkpoint prefix of the model, e.g. "deepspeech2"
You can set the local variables (except ```ckpt```) when you use the ```run.sh```
For example, you can set the ```gpus``` and ``avg_num`` when you use the command line.:
You can set the local variables (except `ckpt`) when you use the `run.sh`
For example, you can set the `gpus` and `avg_num` when you use the command line.:
```bash
bash run.sh --gpus 0,1 --avg_num 1
```
## Stage 0: Data processing
To use this example, you need to process data firstly and you can use stage 0 in the ```run.sh``` to do this. The code is shown below:
To use this example, you need to process data firstly and you can use stage 0 in the `run.sh` to do this. The code is shown below:
```bash
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
@ -91,24 +65,17 @@ To use this example, you need to process data firstly and you can use stage 0 i
bash ./local/data.sh || exit -1
fi
```
Stage 0 is for processing the data.
If you only want to process the data. You can run
```bash
bash run.sh --stage 0 --stop_stage 0
```
You can also just run these scripts in your command line.
```bash
source path.sh
bash ./local/data.sh
```
After processing the data, the ``data`` directory will look like this:
After processing the data, the `data` directory will look like this:
```bash
data/
|-- dev.meta
@ -124,54 +91,37 @@ data/
|-- test.meta
`-- train.meta
```
## Stage 1: Model training
If you want to train the model. you can use stage 1 in the ```run.sh```. The code is shown below.
If you want to train the model. you can use stage 1 in the `run.sh`. The code is shown below.
```bash
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
After training the model, we need to get the final model for testing and inference. In every epoch, the model checkpoint is saved, so we can choose the best model from them based on the validation loss or we can sort them and average the parameters of the top-k models to get the final model. We can use stage 2 to do this, and the code is shown below:
After training the model, we need to get the final model for testing and inference. In every epoch, the model checkpoint is saved, so we can choose the best model from them based on the validation loss or we can sort them and average the parameters of the top-k models to get the final model. We can use stage 2 to do this, and the code is shown below:
```bash
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# avg n best model
avg.sh best exp/${ckpt}/checkpoints ${avg_num}
fi
```
The ```avg.sh``` is in the ```../../../utils/``` which is define in the ```path.sh```.
The `avg.sh` is in the `../../../utils/` which is define in the `path.sh`.
If you want to get the final model, you can use the script below to execute stage 0, stage 1, and stage 2:
```bash
bash run.sh --stage 0 --stop_stage 2
```
or you can run these scripts in the command line (only use CPU).
You need to prepare an audio file or use the audio demo above, please confirm the sample rate of the audio is 16K. You can get the result of audio demo by running the script below.
You need to prepare an audio file or use the audio demo above, please confirm the sample rate of the audio is 16K. You can get the result of the audio demo by running the script below.
This example contains code used to train a Transformer or [Conformer](http://arxiv.org/abs/2008.03802) model with [Aishell dataset](http://www.openslr.org/resources/33)
## Overview
All the scirpts you need are in ```run.sh```. There are several stages in ```run.sh```, and each stage has its function.
All the scripts you need are in `run.sh`. There are several stages in `run.sh`, and each stage has its function.
| 0 | Process data. It includes: <br> (1) Download the dataset <br> (2) Caculate the CMVN of the train dataset <br> (3) Get the vocabulary file <br> (4) Get the manifest files of the train, development and test dataset |
| 0 | Process data. It includes: <br> (1) Download the dataset <br> (2) Calculate the CMVN of the train dataset <br> (3) Get the vocabulary file <br> (4) Get the manifest files of the train, development and test dataset |
| 1 | Train the model |
| 2 | Get the final model by averaging the top-k models, set k = 1 means choose the best model |
| 2 | Get the final model by averaging the top-k models, set k = 1 means to choose the best model |
| 3 | Test the final model performance |
| 4 | Get ctc alignment of test data using the final model |
| 5 | Infer the single audio file |
You can choose to run a range of stages by setting ```stage``` and ```stop_stage ```.
You can choose to run a range of stages by setting `stage` and `stop_stage `.
For example, if you want to execute the code in stage 2 and stage 3, you can run this script:
```bash
bash run.sh --stage 2 --stop_stage 3
```
Or you can set ```stage``` equal to ```stop-stage``` to only run one stage.
For example, if you only want to run ```stage 0```, you can use the script below:
Or you can set `stage` equal to `stop-stage` to only run one stage.
For example, if you only want to run `stage 0`, you can use the script below:
```bash
bash run.sh --stage 0 --stop_stage 0
```
The document below will describe the scripts in ```run.sh``` in detail.
The document below will describe the scripts in `run.sh` in detail.
## The Environment Variables
The path.sh contains the environment variables.
```bash
source path.sh
```
This script needs to be run firstly. And another script is also needed:
This script needs to be run first. And another script is also needed:
```bash
source ${MAIN_ROOT}/utils/parse_options.sh
```
It will support the way of using```--varibale value``` in the shell scripts.
It will support the way of using `--variable value` in the shell scripts.
## The Local Variables
Some local variables are set in ```run.sh```.
```gpus``` denotes the GPU number you want to use. If you set ```gpus=```, it means you only use CPU.
```stage``` denotes the number of stage you want to start from in the expriments.
```stop stage```denotes the number of stage you want to end at in the expriments.
```conf_path``` denotes the config path of the model.
```avg_num``` denotes the number K of top-K models you want to average to get the final model.
```audio_file``` denotes the file path of the single file you want to infer in stage 5
```ckpt``` denotes the checkpoint prefix of the model, e.g. "conformer"
You can set the local variables (except ```ckpt```) when you use ```run.sh```
For example, you can set the ```gpus``` and ``avg_num`` when you use the command line.:
Some local variables are set in `run.sh`.
`gpus` denotes the GPU number you want to use. If you set `gpus=`, it means you only use CPU.
`stage` denotes the number of the stage you want to start from in the experiments.
`stop stage` denotes the number of the stage you want to end at in the experiments.
`conf_path` denotes the config path of the model.
`avg_num` denotes the number K of top-K models you want to average to get the final model.
`audio_file` denotes the file path of the single file you want to infer in stage 5
`ckpt` denotes the checkpoint prefix of the model, e.g. "conformer"
You can set the local variables (except `ckpt`) when you use `run.sh`
For example, you can set the `gpus` and `avg_num` when you use the command line.:
```bash
bash run.sh --gpus 0,1 --avg_num 20
```
## Stage 0: Data Processing
To use this example, you need to process data firstly and you can use stage 0 in ```run.sh``` to do this. The code is shown below:
To use this example, you need to process data firstly and you can use stage 0 in `run.sh` to do this. The code is shown below:
```bash
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
# prepare data
@ -93,20 +60,15 @@ To use this example, you need to process data firstly and you can use stage 0 in
Stage 0 is for processing the data.
If you only want to process the data. You can run
```bash
bash run.sh --stage 0 --stop_stage 0
```
You can also just run these scripts in your command line.
```bash
source path.sh
bash ./local/data.sh
```
After processing the data, the ``data`` directory will look like this:
```bash
data/
|-- dev.meta
@ -122,84 +84,57 @@ data/
|-- test.meta
`-- train.meta
```
## Stage 1: Model Training
If you want to train the model. you can use stage 1 in ```run.sh```. The code is shown below.
If you want to train the model. you can use stage 1 in `run.sh`. The code is shown below.
```bash
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
After training the model, we need to get the final model for testing and inference. In every epoch, the model checkpoint is saved, so we can choose the best model from them based on the validation loss or we can sort them and average the parameters of the top-k models to get the final model. We can use stage 2 to do this, and the code is shown below:
```bash
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# avg n best model
avg.sh best exp/${ckpt}/checkpoints ${avg_num}
fi
```
The ```avg.sh``` is in the ```../../../utils/``` which is define in the ```path.sh```.
The `avg.sh`is in the `../../../utils/` which is define in the `path.sh`.
If you want to get the final model, you can use the script below to execute stage 0, stage 1, and stage 2:
```bash
bash run.sh --stage 0 --stop_stage 2
```
or you can run these scripts in the command line (only use CPU).
you can train the model by yourself using ```bash run.sh --stage 0 --stop_stage 3```, or you can download the pretrained model through the script below:
You need to prepare an audio file or use the audio demo above, please confirm the sample rate of the audio is 16K. You can get the result by running the script below.
This example contains code used to train a DeepSpeech2 offline or online model with [[Librispeech dataset](http://www.openslr.org/resources/12)](http://www.openslr.org/resources/33)
## Overview
All the scirpts you need are in the ```run.sh```. There are several stages in the ```run.sh```, and each stage has its function.
All the scripts you need are in the `run.sh`. There are several stages in the `run.sh`, and each stage has its function.
| 0 | Process data. It includes: <br> (1) Download the dataset <br> (2) Caculate the CMVN of the train dataset <br> (3) Get the vocabulary file <br> (4) Get the manifest files of the train, development and test dataset |
| 0 | Process data. It includes: <br> (1) Download the dataset <br> (2) Calculate the CMVN of the train dataset <br> (3) Get the vocabulary file <br> (4) Get the manifest files of the train, development and test dataset |
| 1 | Train the model |
| 2 | Get the final model by averaging the top-k models, set k = 1 means choose the best model |
| 2 | Get the final model by averaging the top-k models, set k = 1 means to choose the best model |
| 3 | Test the final model performance |
| 4 | Export the static graph model |
| 5 | Test the static graph model |
| 6 | Infer the single audio file |
You can choose to run a range of stages by setting the ```stage``` and ```stop_stage ``` .
You can choose to run a range of stages by setting the `stage` and `stop_stage `.
For example, if you want to execute the code in stage 2 and stage 3, you can run this script:
```bash
bash run.sh --stage 2 --stop_stage 3
```
Or you can set ```stage``` equal to ```stop-stage``` to only run one stage.
For example, if you only want to run ```stage 0```, you can use the script below:
Or you can set `stage` equal to `stop-stage` to only run one stage.
For example, if you only want to run `stage 0`, you can use the script below:
```bash
bash run.sh --stage 0 --stop_stage 0
```
The document below will describe the scripts in the ```run.sh``` in detail.
The document below will describe the scripts in the `run.sh` in detail.
## The environment variables
The path.sh contains the environment variable.
```bash
source path.sh
```
This script needs to be run firstly.
This script needs to be run first.
And another script is also needed:
```bash
source ${MAIN_ROOT}/utils/parse_options.sh
```
It will support the way of using```--varibale value``` in the shell scripts.
It will support the way of using `--variable value` in the shell scripts.
## The local variables
Some local variables are set in the `run.sh`.
`gpus` denotes the GPU number you want to use. If you set `gpus=`, it means you only use CPU.
`stage` denotes the number of the stage you want to start from in the experiments.
`stop stage` denotes the number of stages you want to end at in the experiments.
`conf_path` denotes the config path of the model.
`avg_num` denotes the number K of top-K models you want to average to get the final model.
`model_type` denotes the model type: offline or online
`audio file` denotes the file path of the single file you want to infer in stage 6
`ckpt` denotes the checkpoint prefix of the model, e.g. "deepspeech2"
Some local variables are set in the ```run.sh```.
```gpus``` denotes the GPU number you want to use. If you set ```gpus=```, it means you only use CPU.
```stage``` denotes the number of stage you want to start from in the expriments.
```stop stage```denotes the number of stage you want to end at in the expriments.
```conf_path``` denotes the config path of the model.
```avg_num``` denotes the number K of top-K models you want to average to get the final model.
```model_type```denotes the model type: offline or online
```audio file``` denotes the file path of the single file you want to infer in stage 6
```ckpt``` denotes the checkpoint prefix of the model, e.g. "deepspeech2"
You can set the local variables (except ```ckpt```) when you use the ```run.sh```
For example, you can set the ```gpus``` and ``avg_num`` when you use the command line.:
You can set the local variables (except `ckpt`) when you use the `run.sh`
For example, you can set the `gpus` and `avg_num` when you use the command line.:
```bash
bash run.sh --gpus 0,1 --avg_num 1
```
## Stage 0: Data processing
To use this example, you need to process data firstly and you can use stage 0 in the ```run.sh``` to do this. The code is shown below:
To use this example, you need to process data firstly and you can use stage 0 in the `run.sh` to do this. The code is shown below:
```bash
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
# prepare data
bash ./local/data.sh || exit -1
fi
```
Stage 0 is for processing the data.
If you only want to process the data. You can run
```bash
bash run.sh --stage 0 --stop_stage 0
```
You can also just run these scripts in your command line.
```bash
source path.sh
bash ./local/data.sh
```
After processing the data, the ``data`` directory will look like this:
After processing the data, the `data` directory will look like this:
```bash
data/
|-- dev.meta
@ -120,19 +88,14 @@ data/
|-- test.meta
`-- train.meta
```
## Stage 1: Model training
If you want to train the model. you can use stage 1 in the ```run.sh```. The code is shown below.
If you want to train the model. you can use stage 1 in the `run.sh`. The code is shown below.
```bash
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
After training the model, we need to get the final model for testing and inference. In every epoch, the model checkpoint is saved, so we can choose the best model from them based on the validation loss or we can sort them and average the parameters of the top-k models to get the final model. We can use stage 2 to do this, and the code is shown below:
After training the model, we need to get the final model for testing and inference. In every epoch, the model checkpoint is saved, so we can choose the best model from them based on the validation loss or we can sort them and average the parameters of the top-k models to get the final model. We can use stage 2 to do this, and the code is shown below:
```bash
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# avg n best model
avg.sh best exp/${ckpt}/checkpoints ${avg_num}
fi
```
The ```avg.sh``` is in the ```../../../utils/``` which is define in the ```path.sh```.
The `avg.sh` is in the `../../../utils/` which is define in the `path.sh`.
If you want to get the final model, you can use the script below to execute stage 0, stage 1, and stage 2:
```bash
bash run.sh --stage 0 --stop_stage 2
```
or you can run these scripts in the command line (only use CPU).
You can train a model by yourself, then you need to prepare an audio file or use the audio demo above, please confirm the sample rate of the audio is 16K. You can get the result of audio demo by running the script below.
You can train a model by yourself, then you need to prepare an audio file or use the audio demo above, please confirm the sample rate of the audio is 16K. You can get the result of the audio demo by running the script below.
This example contains code used to train a Transformer or [Conformer](http://arxiv.org/abs/2008.03802) model with [Librispeech dataset](http://www.openslr.org/resources/12)
## Overview
All the scirpts you need are in ```run.sh```. There are several stages in ```run.sh```, and each stage has its function.
All the scripts you need are in `run.sh`. There are several stages in `run.sh`, and each stage has its function.
| 0 | Process data. It includes: <br> (1) Download the dataset <br> (2) Caculate the CMVN of the train dataset <br> (3) Get the vocabulary file <br> (4) Get the manifest files of the train, development and test dataset<br> (5) Get the sentencepiece model |
| 0 | Process data. It includes: <br> (1) Download the dataset <br> (2) Calculate the CMVN of the train dataset <br> (3) Get the vocabulary file <br> (4) Get the manifest files of the train, development and test dataset<br> (5) Get the sentencepiece model |
| 1 | Train the model |
| 2 | Get the final model by averaging the top-k models, set k = 1 means choose the best model |
| 2 | Get the final model by averaging the top-k models, set k = 1 means to choose the best model |
| 3 | Test the final model performance |
| 4 | Get ctc alignment of test data using the final model |
| 5 | Infer the single audio file |
You can choose to run a range of stages by setting ```stage``` and ```stop_stage ```.
You can choose to run a range of stages by setting `stage` and `stop_stage `.
For example, if you want to execute the code in stage 2 and stage 3, you can run this script:
```bash
bash run.sh --stage 2 --stop_stage 3
```
Or you can set ```stage``` equal to ```stop-stage``` to only run one stage.
For example, if you only want to run ```stage 0```, you can use the script below:
Or you can set `stage` equal to `stop-stage` to only run one stage.
For example, if you only want to run `stage 0`, you can use the script below:
```bash
bash run.sh --stage 0 --stop_stage 0
```
The document below will describe the scripts in ```run.sh``` in detail.
The document below will describe the scripts in `run.sh` in detail.
## The Environment Variables
The path.sh contains the environment variables.
```bash
. ./path.sh
. ./cmd.sh
```
This script needs to be run firstly. And another script is also needed:
This script needs to be run first. And another script is also needed:
```bash
source ${MAIN_ROOT}/utils/parse_options.sh
```
It will support the way of using```--varibale value``` in the shell scripts.
It will support the way of using `--variable value` in the shell scripts.
## The Local Variables
Some local variables are set in `run.sh`.
`gpus` denotes the GPU number you want to use. If you set `gpus=`, it means you only use CPU.
`stage` denotes the number of stages you want to start from in the experiments.
`stop stage` denotes the number of the stage you want to end at in the experiments.
`conf_path` denotes the config path of the model.
`avg_num` denotes the number K of top-K models you want to average to get the final model.
`audio file` denotes the file path of the single file you want to infer in stage 5
`ckpt` denotes the checkpoint prefix of the model, e.g. "conformer"
Some local variables are set in ```run.sh```.
```gpus``` denotes the GPU number you want to use. If you set ```gpus=```, it means you only use CPU.
```stage``` denotes the number of stage you want to start from in the expriments.
```stop stage```denotes the number of stage you want to end at in the expriments.
```conf_path``` denotes the config path of the model.
```avg_num``` denotes the number K of top-K models you want to average to get the final model.
```audio file``` denotes the file path of the single file you want to infer in stage 5
```ckpt``` denotes the checkpoint prefix of the model, e.g. "conformer"
You can set the local variables (except ```ckpt```) when you use ```run.sh```
For example, you can set the ```gpus``` and ``avg_num`` when you use the command line.:
You can set the local variables (except `ckpt`) when you use `run.sh`
For example, you can set the `gpus` and `avg_num` when you use the command line:
```bash
bash run.sh --gpus 0,1 --avg_num 20
```
## Stage 0: Data Processing
To use this example, you need to process data firstly and you can use stage 0 in ```run.sh``` to do this. The code is shown below:
To use this example, you need to process data firstly and you can use stage 0 in `run.sh` to do this. The code is shown below:
```bash
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
# prepare data
bash ./local/data.sh || exit -1
fi
```
Stage 0 is for processing the data.
If you only want to process the data. You can run
```bash
bash run.sh --stage 0 --stop_stage 0
```
You can also just run these scripts in your command line.
```bash
. ./path.sh
. ./cmd.sh
bash ./local/data.sh
```
After processing the data, the ``data`` directory will look like this:
After processing the data, the `data` directory will look like this:
```bash
data/
|-- dev.meta
@ -126,55 +88,38 @@ data/
|-- test.meta
`-- train.meta
```
## Stage 1: Model Training
If you want to train the model. you can use stage 1 in ```run.sh```. The code is shown below.
If you want to train the model. you can use stage 1 in `run.sh`. The code is shown below.
```bash
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
After training the model, we need to get the final model for testing and inference. In every epoch, the model checkpoint is saved, so we can choose the best model from them based on the validation loss or we can sort them and average the parameters of the top-k models to get the final model. We can use stage 2 to do this, and the code is shown below:
```bash
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# avg n best model
avg.sh best exp/${ckpt}/checkpoints ${avg_num}
fi
```
The ```avg.sh``` is in the ```../../../utils/``` which is define in the ```path.sh```.
The `avg.sh` is in the `../../../utils/` which is define in the `path.sh`.
If you want to get the final model, you can use the script below to execute stage 0, stage 1, and stage 2:
```bash
bash run.sh --stage 0 --stop_stage 2
```
or you can run these scripts in the command line (only use CPU).
you can train the model by yourself using ```bash run.sh --stage 0 --stop_stage 3```, or you can download the pretrained model through the script below:
You need to prepare an audio file or use the audio demo above, please confirm the sample rate of the audio is 16K. You can get the result of audio demo by running the script below.
You need to prepare an audio file or use the audio demo above, please confirm the sample rate of the audio is 16K. You can get the result of the audio demo by running the script below.
This example contains code used to train a Transformer or [Conformer](http://arxiv.org/abs/2008.03802) model with [Librispeech dataset](http://www.openslr.org/resources/12) and use some functions in kaldi.
This example contains code used to train a Transformer or [Conformer](http://arxiv.org/abs/2008.03802) model with [Librispeech dataset](http://www.openslr.org/resources/12) and use some functions in kaldi.
To use this example, you need to install Kaldi at first.
To use this example, you need to install Kaldi first.
## Overview
All the scirpts you need are in ```run.sh```. There are several stages in ```run.sh```, and each stage has its function.
All the scripts you need are in ```run.sh```. There are several stages in ```run.sh```, and each stage has its function.
| 0 | Process data. It includes: <br> (1) Download the dataset <br> (2) Caculate the CMVN of the train dataset <br> (3) Get the vocabulary file <br> (4) Get the manifest files of the train, development and test dataset<br> (5) Get the sentencepiece model |
| 0 | Process data. It includes: <br> (1) Download the dataset <br> (2) Calculate the CMVN of the train dataset <br> (3) Get the vocabulary file <br> (4) Get the manifest files of the train, development and test dataset<br> (5) Get the sentencepiece model |
| 1 | Train the model |
| 2 | Get the final model by averaging the top-k models, set k = 1 means choose the best model |
| 2 | Get the final model by averaging the top-k models, set k = 1 means to choose the best model |
| 3 | Test the final model performance |
| 4 | Join ctc decoder and use transformer language model to score |
| 5 | Get ctc alignment of test data using the final model |
| 6 | Caculate the perplexity of transformer language model |
| 6 | Calculate the perplexity of transformer language model |
You can choose to run a range of stages by setting ```stage``` and ```stop_stage ```.
You can choose to run a range of stages by setting `stage` and `stop_stage `.
For example, if you want to execute the code in stage 2 and stage 3, you can run this script:
```bash
bash run.sh --stage 2 --stop_stage 3
```
Or you can set ```stage``` equal to ```stop-stage``` to only run one stage.
For example, if you only want to run ```stage 0```, you can use the script below:
Or you can set `stage` equal to `stop-stage` to only run one stage.
For example, if you only want to run `stage 0`, you can use the script below:
```bash
bash run.sh --stage 0 --stop_stage 0
```
The document below will describe the scripts in ```run.sh``` in detail.
The document below will describe the scripts in `run.sh` in detail.
## The Environment Variables
The path.sh contains the environment variables.
```bash
. ./path.sh
. ./cmd.sh
```
This script needs to be run firstly. And another script is also needed:
This script needs to be run first. And another script is also needed:
```bash
source ${MAIN_ROOT}/utils/parse_options.sh
```
It will support the way of using```--varibale value``` in the shell scripts.
It will support the way of using `--variable value` in the shell scripts.
## The Local Variables
Some local variables are set in ```run.sh```.
```gpus``` denotes the GPU number you want to use. If you set ```gpus=```, it means you only use CPU.
```stage``` denotes the number of stage you want to start from in the expriments.
```stop stage```denotes the number of stage you want to end at in the expriments.
```conf_path``` denotes the config path of the model.
`dict_path` denotes the path of vocabulary file.
```avg_num``` denotes the number K of top-K models you want to average to get the final model.
```ckpt``` denotes the checkpoint prefix of the model, e.g. "transformer"
You can set the local variables (except ```ckpt```) when you use ```run.sh```
For example, you can set the ```gpus``` and ``avg_num`` when you use the command line.:
Some local variables are set in `run.sh`.
`gpus` denotes the GPU number you want to use. If you set `gpus=`, it means you only use CPU.
`stage` denotes the number of the stage you want to start from in the experiments.
`stop stage` denotes the number of the stage you want to end at in the experiments.
`conf_path` denotes the config path of the model.
`dict_path` denotes the path of the vocabulary file.
`avg_num` denotes the number K of top-K models you want to average to get the final model.
`ckpt` denotes the checkpoint prefix of the model, e.g. "transformer"
You can set the local variables (except `ckpt`) when you use `run.sh`
For example, you can set the `gpus` and `avg_num` when you use the command line.:
```bash
bash run.sh --gpus 0,1 --avg_num 10
```
## Stage 0: Data Processing
To use this example, you need to process data firstly and you can use stage 0 in ```run.sh``` to do this. The code is shown below:
To use this example, you need to process data firstly and you can use stage 0 in ```run.sh```to do this. The code is shown below:
```bash
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
@ -93,7 +67,6 @@ To use this example, you need to process data firstly and you can use stage 0 in
bash ./local/data.sh || exit -1
fi
```
Stage 0 is for processing the data.
If you only want to process the data. You can run
@ -156,56 +129,39 @@ data/
└── train_sp_org
```
## Stage 1: Model Training
If you want to train the model. you can use stage 1 in ```run.sh```. The code is shown below.
```bash
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
After training the model, we need to get the final model for testing and inference. In every epoch, the model checkpoint is saved, so we can choose the last K models and average the parameters of the models to get the final model. We can use stage 2 to do this, and the code is shown below:
After training the model, we need to get the final model for testing and inference. In every epoch, the model checkpoint is saved, so we can choose the last K models and average the parameters of the models to get the final model. We can use stage 2 to do this, and the code is shown below:
```bash
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# avg n best model
avg.sh lastest exp/${ckpt}/checkpoints ${avg_num}
fi
```
The ```avg.sh``` is in the ```../../../utils/``` which is define in the ```path.sh```.
The `avg.sh` is in the `../../../utils/` which is define in the `path.sh`.
If you want to get the final model, you can use the script below to execute stage 0, stage 1, and stage 2:
```bash
bash run.sh --stage 0 --stop_stage 2
```
or you can run these scripts in the command line (only use CPU).
Compare with [ESPNET](https://github.com/espnet/espnet/blob/master/egs/librispeech/asr1/RESULTS.md#pytorch-large-transformer-with-specaug-4-gpus--transformer-lm-4-gpus) we using 8gpu, but model size (aheads4-adim256) small than it.
Compare with [ESPNET](https://github.com/espnet/espnet/blob/master/egs/librispeech/asr1/RESULTS.md#pytorch-large-transformer-with-specaug-4-gpus--transformer-lm-4-gpus) we using 8gpu, but the model size (aheads4-adim256) small than it.
## Stage 5: CTC Alignment
If you want to get the alignment between the audio and the text, you can use the ctc alignment. The code of this stage is shown below:
```bash
if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
If you want to train the model, test it and do the alignment, you can use the script below to execute stage 0, stage 1, stage 2, stage 3, stage 4 and stage 5:
If you want to train the model, test it and do the alignment, you can use the script below to execute stage 0, stage 1, stage 2, stage 3, stage 4, and stage 5:
```bash
bash run.sh --stage 0 --stop_stage 5
```
or if you only need to train a model and do the alignment, you can use these scripts to escape stage 3(test stage):
```bash
bash run.sh --stage 0 --stop_stage 2
bash run.sh --stage 5 --stop_stage 5
```
or you can also use these scripts in the command line (only use CPU).
For g2p, we use BZNSYP's phone label as the ground truth and we delete silence tokens in labels and predicted phones.
You should Download BZNSYP from it's [Official Website](https://test.data-baker.com/data/index/source) and extract it. Assume the path to the dataset is `~/datasets/BZNSYP`.
You should Download BZNSYP from its [Official Website](https://test.data-baker.com/data/index/source) and extract it. Assume the path to the dataset is `~/datasets/BZNSYP`.
We use `WER` as evaluation criterion.
We use `WER` as an evaluation criterion.
# Start
Run the command below to get the results of test.
Run the command below to get the results of the test.
This experiment trains a speaker encoder with speaker verification as its task. It is done as a part of the experiment of transfer learning from speaker verification to multispeaker text-to-speech synthesis, which can be found at [examples/aishell3/vc0](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/vc0). The trained speaker encoder is used to extract utterance embeddings from utterances.
This experiment trains a speaker encoder with speaker verification as to its task. It is done as a part of the experiment of transfer learning from speaker verification to multispeaker text-to-speech synthesis, which can be found at [examples/aishell3/vc0](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/vc0). The trained speaker encoder is used to extract utterance embeddings from utterances.
## Model
The model used in this experiment is the speaker encoder with textindependent speaker verification task in [GENERALIZED END-TO-END LOSS FOR SPEAKER VERIFICATION](https://arxiv.org/pdf/1710.10467.pdf). GE2E-softmax loss is used.
The model used in this experiment is the speaker encoder with text-independent speaker verification task in [GENERALIZED END-TO-END LOSS FOR SPEAKER VERIFICATION](https://arxiv.org/pdf/1710.10467.pdf). GE2E-softmax loss is used.
## Download Datasets
Currently supported datasets are Librispeech-other-500, VoxCeleb, VoxCeleb2,ai-datatang-200zh, magicdata, which can be downloaded from corresponding webpage.
Currently supported datasets are Librispeech-other-500, VoxCeleb, VoxCeleb2,ai-datatang-200zh, magicdata, which can be downloaded from the corresponding webpage.
1. Librispeech/train-other-500
An English multispeaker dataset,[URL](https://www.openslr.org/resources/12/train-other-500.tar.gz),only the `train-other-500` subset is used.
2. VoxCeleb1
An English multispeaker dataset,[URL](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html), Audio Files from Dev A to Dev D should be downloaded, combined and extracted.
An English multispeaker dataset,[URL](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html), Audio Files from Dev A to Dev D should be downloaded, combined, and extracted.
3. VoxCeleb2
An English multispeaker dataset,[URL](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html), Audio Files from Dev A to Dev H should be downloaded, combined and extracted.
An English multispeaker dataset,[URL](https://www.robots.ox.ac.uk/~vgg/data/voxceleb/vox1.html), Audio Files from Dev A to Dev H should be downloaded, combined, and extracted.
4. Aidatatang-200zh
A Mandarin Chinese multispeaker dataset ,[URL](https://www.openslr.org/62/).
A Mandarin Chinese multispeaker dataset ,[URL](https://www.openslr.org/62/).
5. magicdata
A Mandarin Chinese multispeaker dataset ,[URL](https://www.openslr.org/68/).
A Mandarin Chinese multispeaker dataset ,[URL](https://www.openslr.org/68/).
If you want to use other datasets, you can also download and preprocess it as long as it meets the requirements described below.
If you want to use other datasets, you can also download and preprocess them as long as they meet the requirements described below.
## Get Started
```bash
./run.sh
```
You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset.
You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
```bash
./run.sh --stage 0 --stop-stage 0
```
@ -33,13 +33,13 @@ You can choose a range of stages you want to run, or set `stage` equal to `stop-
Assume datasets_root is `~/datasets/GE2E`, and it has the follow structure(We only use `train-other-500` for simplicity):
Assume datasets_root is `~/datasets/GE2E`, and it has the following structure(We only use `train-other-500` for simplicity):
```Text
GE2E
├── LibriSpeech
└── (other datasets)
```
Multispeaker datasets are used as training data, though the transcriptions are not used. To enlarge the amount of data used for training, several multispeaker datasets are combined. The preporcessed datasets are organized in a file structure described below. The mel spectrogram of each utterance is save in `.npy` format. The dataset is 2-stratified (speaker-utterance). Since multiple datasets are combined, to avoid conflict in speaker id, dataset name is prepended to the speake ids.
Multispeaker datasets are used as training data, though the transcriptions are not used. To enlarge the amount of data used for training, several multispeaker datasets are combined. The preprocessed datasets are organized in a file structure described below. The mel spectrogram of each utterance is saved in `.npy` format. The dataset is 2-stratified (speaker-utterance). Since multiple datasets are combined, to avoid conflict in speaker id, the dataset name is prepended to the speaker ids.
```text
dataset_root
@ -63,7 +63,7 @@ dataset_root
In `${BIN_DIR}/preprocess.py`:
1. `--datasets_root` is the directory that contains several extracted dataset
2. `--output_dir` is the directory to save the preprocessed dataset
3. `--dataset_names` is the dataset to preprocess. If there are multiple datasets in `--datasets_root` to preprocess, the names can be joined with comma. Currently supported dataset names are librispeech_other, voxceleb1, voxceleb2, aidatatang_200zh and magicdata.
3. `--dataset_names` is the dataset to preprocess. If there are multiple datasets in `--datasets_root` to preprocess, the names can be joined with a comma. Currently supported dataset names are librispeech_other, voxceleb1, voxceleb2, aidatatang_200zh, and magicdata.
1. `--data` is the path to the preprocessed dataset.
2. `--output` is the directory to save results,usually a subdirectory of `runs`.It contains visualdl log files, text log files, config file and a `checkpoints` directory, which contains parameter file and optimizer state file. If `--output` already has some training results in it, the most recent parameter file and optimizer state file is loaded before training.
2. `--output` is the directory to save results,usually a subdirectory of `runs`.It contains visualdl log files, text log files, config files, and a `checkpoints` directory, which contains parameter files and optimizer state files. If `--output` already has some training results in it, the most recent parameter file and optimizer state file are loaded before training.
4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
5. `CUDA_VISIBLE_DEVICES` can be used to specify visible devices with cuda.
Other options are described below.
- `--config` is a `.yaml` config file used to override the default config(which is coded in `config.py`).
- `--opts` is command line options to further override config files. It should be the last comman line options passed with multiple key-value pairs separated by spaces.
- `--checkpoint_path` specifies the checkpoiont to load before training, extension is not included. A parameter file ( `.pdparams`) and an optimizer state file ( `.pdopt`) with the same name is used. This option has a higher priority than auto-resuming from the `--output` directory.
- `--opts` is a command-line option to further override config files. It should be the last command-line options passed with multiple key-value pairs separated by spaces.
- `--checkpoint_path` specifies the checkpoint to load before training, extension is not included. A parameter file ( `.pdparams`) and an optimizer state file ( `.pdopt`) with the same name is used. This option has a higher priority than auto-resuming from the `--output` directory.
### Inferencing
When training is done, run the command below to generate utterance embedding for each utterance in a dataset.
1. `--input` is the path of the dataset used for inference.
2. `--output` is the directory to save the processed results. It has the same file structure as the input dataset. Each utterance in the dataset has a corrsponding utterance embedding file in `*.npy` format.
2. `--output` is the directory to save the processed results. It has the same file structure as the input dataset. Each utterance in the dataset has a corresponding utterance embedding file in the`*.npy` format.
3. `--checkpoint_path` is the path of the checkpoint to use, extension not included.
4. `--pattern` is the wildcard pattern to filter audio files for inference, defaults to `*.wav`.
5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
| 0 | Process data. It includes: <br> (1) Caculate the CMVN of the train dataset <br> (2) Get the vocabulary file <br> (3) Get the manifest files of the train, development and test dataset<br> |
| 0 | Process data. It includes: <br> (1) Calculate the CMVN of the train dataset <br> (2) Get the vocabulary file <br> (3) Get the manifest files of the train, development and test dataset<br> |
| 1 | Train the model |
| 2 | Get the final model by averaging the top-k models, set k = 1 means choose the best model |
| 2 | Get the final model by averaging the top-k models, set k = 1 means to choose the best model |
| 3 | Test the final model performance |
You can choose to run a range of stages by setting ```stage``` and ```stop_stage ```.
You can choose to run a range of stages by setting `stage` and `stop_stage `.
For example, if you want to execute the code in stage 2 and stage 3, you can run this script:
```bash
bash run.sh --stage 2 --stop_stage 3
```
Or you can set ```stage``` equal to ```stop-stage``` to only run one stage.
Or you can set `stage` equal to `stop-stage` to only run one stage.
For example, if you only want to run ```stage 0```, you can use the script below:
```bash
bash run.sh --stage 0 --stop_stage 0
```
The document below will describe the scripts in ```run.sh``` in detail.
The document below will describe the scripts in `run.sh` in detail.
## The Environment Variables
The path.sh contains the environment variables.
```bash
source path.h
```
This script needs to be run firstly. And another script is also needed:
This script needs to be run first. And another script is also needed:
```bash
source ${MAIN_ROOT}/utils/parse_options.sh
```
It will support the way of using```--varibale value``` in the shell scripts.
It will support the way of using `--variable value` in the shell scripts.
## The Local Variables
Some local variables are set in `run.sh`.
`gpus` denotes the GPU number you want to use. If you set `gpus=`, it means you only use CPU.
`stage` denotes the number of stages you want to start from in the experiments.
`stop_stage` denotes the number of stages you want to end at in the experiments.
`conf_path`denotes the config path of the model.
`data_path` denotes the path of the dataset.
`avg_num`denotes the number K of top-K models you want to average to get the final model.
`ckpt` denotes the checkpoint prefix of the model, e.g. "transformer_mtl_noam"
Some local variables are set in ```run.sh```.
```gpus``` denotes the GPU number you want to use. If you set ```gpus=```, it means you only use CPU.
```stage``` denotes the number of stage you want to start from in the expriments.
```stop stage```denotes the number of stage you want to end at in the expriments.
```conf_path``` denotes the config path of the model.
`data_path` denotes the path of the dataset..
```avg_num``` denotes the number K of top-K models you want to average to get the final model.
```ckpt``` denotes the checkpoint prefix of the model, e.g. "transformer_mtl_noam"
You can set the local variables (except ```ckpt```) when you use ```run.sh```
For example, you can set the ```gpus``` and ``avg_num`` when you use the command line.:
You can set the local variables (except `ckpt`) when you use `run.sh`
For example, you can set the `gpus` and `avg_num` when you use the command line.:
```bash
bash run.sh --gpus 0,1 --avg_num 5
```
## Stage 0: Data Processing
To use this example, you need to process data firstly and you can use stage 0 in ```run.sh``` to do this. The code is shown below:
To use this example, you need to process data firstly and you can use stage 0 in ```run.sh```to do this. The code is shown below:
```bash
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
# prepare data
bash ./local/data.sh || exit -1
fi
```
Stage 0 is for processing the data.
If you only want to process the data. You can run
```bash
bash run.sh --stage 0 --stop_stage 0
```
You can also just run these scripts in your command line.
```bash
source path.h
bash ./local/data.sh
```
## Stage 1: Model Training
If you want to train the model. you can use stage 1 in ```run.sh```. The code is shown below.
```bash
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
After training the model, we need to get the final model for testing and inference. In every epoch, the model checkpoint is saved, so we can choose the best model from them based on the validation loss or we can sort them and average the parameters of the top-k models to get the final model. We can use stage 2 to do this, and the code is shown below:
After training the model, we need to get the final model for testing and inference. In every epoch, the model checkpoint is saved, so we can choose the best model from them based on the validation loss or we can sort them and average the parameters of the top-k models to get the final model. We can use stage 2 to do this, and the code is shown below:
```bash
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# avg n best model
avg.sh best exp/${ckpt}/checkpoints ${avg_num}
fi
```
The ```avg.sh``` is in the ```../../../utils/``` which is define in the ```path.sh```.
The ```avg.sh```is in the ```../../../utils/```which is define in the ```path.sh```.
If you want to get the final model, you can use the script below to execute stage 0, stage 1, and stage 2:
```bash
bash run.sh --stage 0 --stop_stage 2
```
or you can run these scripts in the command line (only use CPU).
| 0 | Process data. It includes: <br> (1) Caculate the CMVN of the train dataset <br> (2) Get the vocabulary file <br> (3) Get the manifest files of the train, development and test dataset<br> |
| 0 | Process data. It includes: <br> (1) Calculate the CMVN of the train dataset <br> (2) Get the vocabulary file <br> (3) Get the manifest files of the train, development and test dataset<br> |
| 1 | Train the model |
| 2 | Get the final model by averaging the top-k models, set k = 1 means choose the best model |
| 2 | Get the final model by averaging the top-k models, set k = 1 means to choose the best model |
| 3 | Test the final model performance |
You can choose to run a range of stages by setting ```stage``` and ```stop_stage ```.
You can choose to run a range of stages by setting `stage` and `stop_stage `.
For example, if you want to execute the code in stage 2 and stage 3, you can run this script:
```bash
bash run.sh --stage 2 --stop_stage 3
```
Or you can set ```stage``` equal to ```stop-stage``` to only run one stage.
For example, if you only want to run ```stage 0```, you can use the script below:
Or you can set `stage` equal to `stop-stage` to only run one stage.
For example, if you only want to run `stage 0`, you can use the script below:
```bash
bash run.sh --stage 0 --stop_stage 0
```
The document below will describe the scripts in ```run.sh``` in detail.
The document below will describe the scripts in ```run.sh```in detail.
## The Environment Variables
The path.sh contains the environment variables.
```bash
. ./path.sh
. ./cmd.sh
```
This script needs to be run firstly. And another script is also needed:
This script needs to be run first. And another script is also needed:
```bash
source ${MAIN_ROOT}/utils/parse_options.sh
```
It will support the way of using```--varibale value``` in the shell scripts.
It will support the way of using `--variable value` in the shell scripts.
## The Local Variables
Some local variables are set in `run.sh`.
`gpus` denotes the GPU number you want to use. If you set `gpus=`, it means you only use CPU.
`stage` denotes the number of the stage you want to start from in the experiments.
`stop stage` denotes the number of the stage you want to end at in the experiments.
`conf_path` denotes the config path of the model.
`data_path` denotes the path of the dataset.
`avg_num`denotes the number K of top-K models you want to average to get the final model.
`ckpt` denotes the checkpoint prefix of the model, e.g. "transformer_mtl_noam"
Some local variables are set in ```run.sh```.
```gpus``` denotes the GPU number you want to use. If you set ```gpus=```, it means you only use CPU.
```stage``` denotes the number of stage you want to start from in the expriments.
```stop stage```denotes the number of stage you want to end at in the expriments.
```conf_path``` denotes the config path of the model.
`data_path` denotes the path of the dataset..
```avg_num``` denotes the number K of top-K models you want to average to get the final model.
```ckpt``` denotes the checkpoint prefix of the model, e.g. "transformer_mtl_noam"
You can set the local variables (except ```ckpt```) when you use ```run.sh```
For example, you can set the ```gpus``` and ``avg_num`` when you use the command line.:
You can set the local variables (except `ckpt`) when you use `run.sh`
For example, you can set the `gpus` and `avg_num` when you use the command line.:
```bash
bash run.sh --gpus 0,1 --avg_num 5
```
## Stage 0: Data Processing
To use this example, you need to process data firstly and you can use stage 0 in ```run.sh``` to do this. The code is shown below:
To use this example, you need to process data firstly and you can use stage 0 in ```run.sh```to do this. The code is shown below:
```bash
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
# prepare data
bash ./local/data.sh || exit -1
fi
```
Stage 0 is for processing the data.
If you only want to process the data. You can run
```bash
bash run.sh --stage 0 --stop_stage 0
```
You can also just run these scripts in your command line.
```bash
. ./path.sh
. ./cmd.sh
bash ./local/data.sh
```
## Stage 1: Model Training
If you want to train the model. you can use stage 1 in ```run.sh```. The code is shown below.
```bash
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
# train model, all `ckpt` under `exp` dir
@ -127,44 +86,31 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
After training the model, we need to get the final model for testing and inference. In every epoch, the model checkpoint is saved, so we can choose the best model from them based on the validation loss or we can sort them and average the parameters of the top-k models to get the final model. We can use stage 2 to do this, and the code is shown below:
After training the model, we need to get the final model for testing and inference. In every epoch, the model checkpoint is saved, so we can choose the best model from them based on the validation loss or we can sort them and average the parameters of the top-k models to get the final model. We can use stage 2 to do this, and the code is shown below:
```bash
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# avg n best model
avg.sh best exp/${ckpt}/checkpoints ${avg_num}
fi
```
The ```avg.sh``` is in the ```../../../utils/``` which is define in the ```path.sh```.
The ```avg.sh```is in the ```../../../utils/```which is define in the ```path.sh```.
If you want to get the final model, you can use the script below to execute stage 0, stage 1, and stage 2:
```bash
bash run.sh --stage 0 --stop_stage 2
```
or you can run these scripts in the command line (only use CPU).
This example contains code used to train a DeepSpeech2 offline or online model with Tiny dataset(a part of [[Librispeech dataset](http://www.openslr.org/resources/12)](http://www.openslr.org/resources/33))
## Overview
All the scirpts you need are in the ```run.sh```. There are several stages in the ```run.sh```, and each stage has its function.
All the scripts you need are in the `run.sh`. There are several stages in the `run.sh`, and each stage has its function.
| 0 | Process data. It includes: <br> (1) Download the dataset <br> (2) Caculate the CMVN of the train dataset <br> (3) Get the vocabulary file <br> (4) Get the manifest files of the train, development and test dataset |
| 0 | Process data. It includes: <br> (1) Download the dataset <br> (2) Calculate the CMVN of the train dataset <br> (3) Get the vocabulary file <br> (4) Get the manifest files of the train, development and test dataset |
| 1 | Train the model |
| 2 | Get the final model by averaging the top-k models, set k = 1 means choose the best model |
| 2 | Get the final model by averaging the top-k models, set k = 1 means to choose the best model |
| 3 | Test the final model performance |
| 4 | Export the static graph model |
You can choose to run a range of stages by setting the ```stage``` and ```stop_stage ``` .
You can choose to run a range of stages by setting the `stage` and `stop_stage `.
For example, if you want to execute the code in stage 2 and stage 3, you can run this script:
```bash
bash run.sh --stage 2 --stop_stage 3
```
Or you can set ```stage``` equal to ```stop-stage``` to only run one stage.
For example, if you only want to run ```stage 0```, you can use the script below:
Or you can set `stage` equal to `stop-stage` to only run one stage.
For example, if you only want to run `stage 0`, you can use the script below:
```bash
bash run.sh --stage 0 --stop_stage 0
```
The document below will describe the scripts in the ```run.sh``` in detail.
The document below will describe the scripts in the `run.sh` in detail.
## The environment variables
The path.sh contains the environment variable.
```bash
source path.sh
```
This script needs to be run firstly.
This script needs to be run first.
And another script is also needed:
```bash
source ${MAIN_ROOT}/utils/parse_options.sh
```
It will support the way of using```--varibale value``` in the shell scripts.
It will support the way of using `--variable value` in the shell scripts.
## The local variables
Some local variables are set in the `run.sh`.
`gpus` denotes the GPU number you want to use. If you set `gpus=`, it means you only use CPU.
`stage` denotes the number of stages you want to start from in the experiments.
`stop stage` denotes the number of the stage you want to end at in the experiments.
`conf_path` denotes the config path of the model.
`avg_num` denotes the number K of top-K models you want to average to get the final model.
`model_type`denotes the model type: offline or online
`ckpt` denotes the checkpoint prefix of the model, e.g. "deepspeech2"
Some local variables are set in the ```run.sh```.
```gpus``` denotes the GPU number you want to use. If you set ```gpus=```, it means you only use CPU.
```stage``` denotes the number of stage you want to start from in the expriments.
```stop stage```denotes the number of stage you want to end at in the expriments.
```conf_path``` denotes the config path of the model.
```avg_num``` denotes the number K of top-K models you want to average to get the final model.
```model_type```denotes the model type: offline or online
```ckpt``` denotes the checkpoint prefix of the model, e.g. "deepspeech2"
You can set the local variables (except ```ckpt```) when you use the ```run.sh```
For example, you can set the ```gpus``` and ``avg_num`` when you use the command line.:
You can set the local variables (except `ckpt`) when you use the `run.sh`
For example, you can set the `gpus` and `avg_num` when you use the command line.:
```bash
bash run.sh --gpus 0,1 --avg_num 20
```
## Stage 0: Data processing
To use this example, you need to process data firstly and you can use stage 0 in the ```run.sh``` to do this. The code is shown below:
To use this example, you need to process data firstly and you can use stage 0 in the `run.sh` to do this. The code is shown below:
```bash
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
# prepare data
bash ./local/data.sh || exit -1
fi
```
Stage 0 is for processing the data.
If you only want to process the data. You can run
@ -98,16 +68,12 @@ If you only want to process the data. You can run
```bash
bash run.sh --stage 0 --stop_stage 0
```
You can also just run these scripts in your command line.
```bash
source path.sh
bash ./local/data.sh
```
After processing the data, the ``data`` directory will look like this:
After processing the data, the `data` directory will look like this:
```bash
data/
|-- dev.meta
@ -123,54 +89,37 @@ data/
|-- test.meta
`-- train.meta
```
## Stage 1: Model training
If you want to train the model. you can use stage 1 in the ```run.sh```. The code is shown below.
```bash
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
After training the model, we need to get the final model for testing and inference. In every epoch, the model checkpoint is saved, so we can choose the best model from them based on the validation loss or we can sort them and average the parameters of the top-k models to get the final model. We can use stage 2 to do this, and the code is shown below:
After training the model, we need to get the final model for testing and inference. In every epoch, the model checkpoint is saved, so we can choose the best model from them based on the validation loss or we can sort them and average the parameters of the top-k models to get the final model. We can use stage 2 to do this, and the code is shown below:
```bash
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# avg n best model
avg.sh best exp/${ckpt}/checkpoints ${avg_num}
fi
```
The ```avg.sh``` is in the ```../../../utils/``` which is define in the ```path.sh```.
The `avg.sh` is in the `../../../utils/` which is define in the `path.sh`.
If you want to get the final model, you can use the script below to execute stage 0, stage 1, and stage 2:
```bash
bash run.sh --stage 0 --stop_stage 2
```
or you can run these scripts in the command line (only use CPU).
This example contains code used to train a Transformer or [Conformer](http://arxiv.org/abs/2008.03802) model Tiny dataset(a part of [[Librispeech dataset](http://www.openslr.org/resources/12)](http://www.openslr.org/resources/33))
## Overview
All the scirpts you need are in ```run.sh```. There are several stages in ```run.sh```, and each stage has its function.
All the scripts you need are in `run.sh`. There are several stages in `run.sh`, and each stage has its function.
| 0 | Process data. It includes: <br> (1) Download the dataset <br> (2) Caculate the CMVN of the train dataset <br> (3) Get the vocabulary file <br> (4) Get the manifest files of the train, development and test dataset<br> (5) Get the sentencepiece model |
| 0 | Process data. It includes: <br> (1) Download the dataset <br> (2) Calculate the CMVN of the train dataset <br> (3) Get the vocabulary file <br> (4) Get the manifest files of the train, development and test dataset<br> (5) Get the sentencepiece model |
| 1 | Train the model |
| 2 | Get the final model by averaging the top-k models, set k = 1 means choose the best model |
| 2 | Get the final model by averaging the top-k models, set k = 1 means to choose the best model |
| 3 | Test the final model performance |
| 4 | Get ctc alignment of test data using the final model |
You can choose to run a range of stages by setting ```stage``` and ```stop_stage ```.
You can choose to run a range of stages by setting `stage` and `stop_stage`.
For example, if you want to execute the code in stage 2 and stage 3, you can run this script:
```bash
bash run.sh --stage 2 --stop_stage 3
```
Or you can set ```stage``` equal to ```stop-stage``` to only run one stage.
Or you can set `stage` equal to `stop-stage` to only run one stage.
For example, if you only want to run ```stage 0```, you can use the script below:
```bash
bash run.sh --stage 0 --stop_stage 0
```
The document below will describe the scripts in ```run.sh``` in detail.
The document below will describe the scripts in ```run.sh```in detail.
## The Environment Variables
The path.sh contains the environment variables.
```bash
. ./path.sh
. ./cmd.sh
```
This script needs to be run firstly. And another script is also needed:
This script needs to be run first. And another script is also needed:
```bash
source ${MAIN_ROOT}/utils/parse_options.sh
```
It will support the way of using```--varibale value``` in the shell scripts.
It will support the way of using `--variable value` in the shell scripts.
## The Local Variables
Some local variables are set in ```run.sh```.
```gpus``` denotes the GPU number you want to use. If you set ```gpus=```, it means you only use CPU.
```stage``` denotes the number of stage you want to start from in the expriments.
```stop stage```denotes the number of stage you want to end at in the expriments.
```conf_path``` denotes the config path of the model.
```avg_num``` denotes the number K of top-K models you want to average to get the final model.
```ckpt``` denotes the checkpoint prefix of the model, e.g. "transformerr"
You can set the local variables (except ```ckpt```) when you use ```run.sh```
For example, you can set the ```gpus``` and ``avg_num`` when you use the command line.:
Some local variables are set in `run.sh`.
`gpus` denotes the GPU number you want to use. If you set `gpus=`, it means you only use CPU.
`stage` denotes the number of stage you want the start from in the experiments.
`stop stage` denotes the number of stage you want the stop at in the expriments.
`conf_path` denotes the config path of the model.
`avg_num`denotes the number K of top-K models you want to average to get the final model.
`ckpt` denotes the checkpoint prefix of the model, e.g. "transformerr"
Youtransformer local variables (except `ckpt`) when you use `run.sh`
For example, you can set the `gpus` and `avg_num` when you use the command line.:
```bash
bash run.sh --gpus 0,1 --avg_num 1
```
## Stage 0: Data Processing
To use this example, you need to process data firstly and you can use stage 0 in ```run.sh``` to do this. The code is shown below:
To use this example, you need to process data firstly and you can use stage 0 in ```run.sh```to do this. The code is shown below:
```bash
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
@ -82,25 +55,19 @@ To use this example, you need to process data firstly and you can use stage 0 in
bash ./local/data.sh || exit -1
fi
```
Stage 0 is for processing the data.
If you only want to process the data. You can run
```bash
bash run.sh --stage 0 --stop_stage 0
```
You can also just run these scripts in your command line.
```bash
. ./path.sh
. ./cmd.sh
bash ./local/data.sh
```
After processing the data, the ``data`` directory will look like this:
```bash
data/
|-- dev.meta
@ -118,57 +85,38 @@ data/
|-- test.meta
`-- train.meta
```
## Stage 1: Model Training
If you want to train the model. you can use stage 1 in ```run.sh```. The code is shown below.
```bash
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
After training the model, we need to get the final model for testing and inference. In every epoch, the model checkpoint is saved, so we can choose the best model from them based on the validation loss or we can sort them and average the parameters of the top-k models to get the final model. We can use stage 2 to do this, and the code is shown below:
```bash
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# avg n best model
avg.sh best exp/${ckpt}/checkpoints ${avg_num}
fi
```
The ```avg.sh``` is in the ```../../../utils/``` which is define in the ```path.sh```.
The `avg.sh` is in the `../../../utils/` which is define in the `path.sh`.
If you want to get the final model, you can use the script below to execute stage 0, stage 1, and stage 2:
```bash
bash run.sh --stage 0 --stop_stage 2
```
or you can run these scripts in the command line (only use CPU).