refactor docs

pull/941/head
TianYuan 3 years ago
parent e395462419
commit 3f9e30c9b3

@ -7,7 +7,7 @@ version: 2
# Build documentation in the docs/ directory with Sphinx # Build documentation in the docs/ directory with Sphinx
sphinx: sphinx:
configuration: docs/src/conf.py configuration: docs/source/conf.py
# Build documentation with MkDocs # Build documentation with MkDocs
#mkdocs: #mkdocs:

Binary file not shown.

After

Width:  |  Height:  |  Size: 4.9 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 108 KiB

@ -0,0 +1,6 @@
myst-parser
recommonmark>=0.5.0
sphinx
sphinx-autobuild
sphinx-markdown-tables
sphinx_rtd_theme

@ -1,6 +1,5 @@
# Deepspeech2 # Models introduction
## Streaming ## Streaming DeepSpeech2
The implemented arcitecure of Deepspeech2 online model is based on [Deepspeech2 model](https://arxiv.org/pdf/1512.02595.pdf) with some changes. The implemented arcitecure of Deepspeech2 online model is based on [Deepspeech2 model](https://arxiv.org/pdf/1512.02595.pdf) with some changes.
The model is mainly composed of 2D convolution subsampling layer and stacked single direction rnn layers. The model is mainly composed of 2D convolution subsampling layer and stacked single direction rnn layers.
@ -14,8 +13,8 @@ In addition, the training process and the testing process are also introduced.
The arcitecture of the model is shown in Fig.1. The arcitecture of the model is shown in Fig.1.
<p align="center"> <p align="center">
<img src="../../images/ds2onlineModel.png" width=800> <img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/ds2onlineModel.png" width=800>
<br/>Fig.1 The Arcitecture of deepspeech2 online model <br/>Fig.1 The Arcitecture of deepspeech2 online model
</p> </p>
@ -23,13 +22,13 @@ The arcitecture of the model is shown in Fig.1.
#### Vocabulary #### Vocabulary
For English data, the vocabulary dictionary is composed of 26 English characters with " ' ", space, \<blank\> and \<eos\>. The \<blank\> represents the blank label in CTC, the \<unk\> represents the unknown character and the \<eos\> represents the start and the end characters. For mandarin, the vocabulary dictionary is composed of chinese characters statisticed from the training set and three additional characters are added. The added characters are \<blank\>, \<unk\> and \<eos\>. For both English and mandarin data, we set the default indexs that \<blank\>=0, \<unk\>=1 and \<eos\>= last index. For English data, the vocabulary dictionary is composed of 26 English characters with " ' ", space, \<blank\> and \<eos\>. The \<blank\> represents the blank label in CTC, the \<unk\> represents the unknown character and the \<eos\> represents the start and the end characters. For mandarin, the vocabulary dictionary is composed of chinese characters statisticed from the training set and three additional characters are added. The added characters are \<blank\>, \<unk\> and \<eos\>. For both English and mandarin data, we set the default indexs that \<blank\>=0, \<unk\>=1 and \<eos\>= last index.
``` ```
# The code to build vocabulary # The code to build vocabulary
cd examples/aishell/s0 cd examples/aishell/s0
python3 ../../../utils/build_vocab.py \ python3 ../../../utils/build_vocab.py \
--unit_type="char" \ --unit_type="char" \
--count_threshold=0 \ --count_threshold=0 \
--vocab_path="data/vocab.txt" \ --vocab_path="data/vocab.txt" \
--manifest_paths "data/manifest.train.raw" "data/manifest.dev.raw" --manifest_paths "data/manifest.train.raw" "data/manifest.dev.raw"
# vocabulary for aishell dataset (Mandarin) # vocabulary for aishell dataset (Mandarin)
vi examples/aishell/s0/data/vocab.txt vi examples/aishell/s0/data/vocab.txt
@ -41,29 +40,29 @@ vi examples/librispeech/s0/data/vocab.txt
#### CMVN #### CMVN
For CMVN, a subset or the full of traininig set is chosed and be used to compute the feature mean and std. For CMVN, a subset or the full of traininig set is chosed and be used to compute the feature mean and std.
``` ```
# The code to compute the feature mean and std # The code to compute the feature mean and std
cd examples/aishell/s0 cd examples/aishell/s0
python3 ../../../utils/compute_mean_std.py \ python3 ../../../utils/compute_mean_std.py \
--manifest_path="data/manifest.train.raw" \ --manifest_path="data/manifest.train.raw" \
--spectrum_type="linear" \ --spectrum_type="linear" \
--delta_delta=false \ --delta_delta=false \
--stride_ms=10.0 \ --stride_ms=10.0 \
--window_ms=20.0 \ --window_ms=20.0 \
--sample_rate=16000 \ --sample_rate=16000 \
--use_dB_normalization=True \ --use_dB_normalization=True \
--num_samples=2000 \ --num_samples=2000 \
--num_workers=10 \ --num_workers=10 \
--output_path="data/mean_std.json" --output_path="data/mean_std.json"
``` ```
#### Feature Extraction #### Feature Extraction
For feature extraction, three methods are implemented, which are linear (FFT without using filter bank), fbank and mfcc. For feature extraction, three methods are implemented, which are linear (FFT without using filter bank), fbank and mfcc.
Currently, the released deepspeech2 online model use the linear feature extraction method. Currently, the released deepspeech2 online model use the linear feature extraction method.
``` ```
The code for feature extraction The code for feature extraction
vi deepspeech/frontend/featurizer/audio_featurizer.py vi deepspeech/frontend/featurizer/audio_featurizer.py
``` ```
### Encoder ### Encoder
The encoder is composed of two 2D convolution subsampling layers and a number of stacked single direction rnn layers. The 2D convolution subsampling layers extract feature representation from the raw audio feature and reduce the length of audio feature at the same time. After passing through the convolution subsampling layers, then the feature representation are input into the stacked rnn layers. For the stacked rnn layers, LSTM cell and GRU cell are provided to use. Adding one fully connected (fc) layer after the stacked rnn layers is optional. If the number of stacked rnn layers is less than 5, adding one fc layer after stacked rnn layers is recommand. The encoder is composed of two 2D convolution subsampling layers and a number of stacked single direction rnn layers. The 2D convolution subsampling layers extract feature representation from the raw audio feature and reduce the length of audio feature at the same time. After passing through the convolution subsampling layers, then the feature representation are input into the stacked rnn layers. For the stacked rnn layers, LSTM cell and GRU cell are provided to use. Adding one fully connected (fc) layer after the stacked rnn layers is optional. If the number of stacked rnn layers is less than 5, adding one fc layer after stacked rnn layers is recommand.
@ -84,11 +83,11 @@ vi deepspeech/models/ds2_online/deepspeech2.py
vi deepspeech/modules/ctc.py vi deepspeech/modules/ctc.py
``` ```
## Training Process ### Training Process
Using the command below, you can train the deepspeech2 online model. Using the command below, you can train the deepspeech2 online model.
``` ```
cd examples/aishell/s0 cd examples/aishell/s0
bash run.sh --stage 0 --stop_stage 2 --model_type online --conf_path conf/deepspeech2_online.yaml bash run.sh --stage 0 --stop_stage 2 --model_type online --conf_path conf/deepspeech2_online.yaml
``` ```
The detail commands are: The detail commands are:
``` ```
@ -127,11 +126,11 @@ fi
By using the command above, the training process can be started. There are 5 stages in "run.sh", and the first 3 stages are used for training process. The stage 0 is used for data preparation, in which the dataset will be downloaded, and the manifest files of the datasets, vocabulary dictionary and CMVN file will be generated in "./data/". The stage 1 is used for training the model, the log files and model checkpoint is saved in "exp/deepspeech2_online/". The stage 2 is used to generated final model for predicting by averaging the top-k model parameters based on validation loss. By using the command above, the training process can be started. There are 5 stages in "run.sh", and the first 3 stages are used for training process. The stage 0 is used for data preparation, in which the dataset will be downloaded, and the manifest files of the datasets, vocabulary dictionary and CMVN file will be generated in "./data/". The stage 1 is used for training the model, the log files and model checkpoint is saved in "exp/deepspeech2_online/". The stage 2 is used to generated final model for predicting by averaging the top-k model parameters based on validation loss.
## Testing Process ### Testing Process
Using the command below, you can test the deepspeech2 online model. Using the command below, you can test the deepspeech2 online model.
``` ```
bash run.sh --stage 3 --stop_stage 5 --model_type online --conf_path conf/deepspeech2_online.yaml bash run.sh --stage 3 --stop_stage 5 --model_type online --conf_path conf/deepspeech2_online.yaml
``` ```
The detail commands are: The detail commands are:
``` ```
conf_path=conf/deepspeech2_online.yaml conf_path=conf/deepspeech2_online.yaml
@ -139,7 +138,7 @@ avg_num=1
model_type=online model_type=online
avg_ckpt=avg_${avg_num} avg_ckpt=avg_${avg_num}
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
# test ckpt avg_n # test ckpt avg_n
CUDA_VISIBLE_DEVICES=2 ./local/test.sh ${conf_path} exp/${ckpt}/checkpoints/${avg_ckpt} ${model_type}|| exit -1 CUDA_VISIBLE_DEVICES=2 ./local/test.sh ${conf_path} exp/${ckpt}/checkpoints/${avg_ckpt} ${model_type}|| exit -1
fi fi
@ -156,19 +155,16 @@ fi
``` ```
After the training process, we use stage 3,4,5 for testing process. The stage 3 is for testing the model generated in the stage 2 and provided the CER index of the test set. The stage 4 is for transforming the model from dynamic graph to static graph by using "paddle.jit" library. The stage 5 is for testing the model in static graph. After the training process, we use stage 3,4,5 for testing process. The stage 3 is for testing the model generated in the stage 2 and provided the CER index of the test set. The stage 4 is for transforming the model from dynamic graph to static graph by using "paddle.jit" library. The stage 5 is for testing the model in static graph.
## Non-Streaming DeepSpeech2
## Non-Streaming
The deepspeech2 offline model is similarity to the deepspeech2 online model. The main difference between them is the offline model use the stacked bi-directional rnn layers while the online model use the single direction rnn layers and the fc layer is not used. For the stacked bi-directional rnn layers in the offline model, the rnn cell and gru cell are provided to use. The deepspeech2 offline model is similarity to the deepspeech2 online model. The main difference between them is the offline model use the stacked bi-directional rnn layers while the online model use the single direction rnn layers and the fc layer is not used. For the stacked bi-directional rnn layers in the offline model, the rnn cell and gru cell are provided to use.
The arcitecture of the model is shown in Fig.2. The arcitecture of the model is shown in Fig.2.
<p align="center"> <p align="center">
<img src="../../images/ds2offlineModel.png" width=800> <img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/ds2offlineModel.png" width=800>
<br/>Fig.2 The Arcitecture of deepspeech2 offline model <br/>Fig.2 The Arcitecture of deepspeech2 offline model
</p> </p>
For data preparation and decoder, the deepspeech2 offline model is same with the deepspeech2 online model. For data preparation and decoder, the deepspeech2 offline model is same with the deepspeech2 online model.
The code of encoder and decoder for deepspeech2 offline model is in: The code of encoder and decoder for deepspeech2 offline model is in:
@ -180,7 +176,7 @@ The training process and testing process of deepspeech2 offline model is very si
Only some changes should be noticed. Only some changes should be noticed.
For training and testing, the "model_type" and the "conf_path" must be set. For training and testing, the "model_type" and the "conf_path" must be set.
``` ```
# Training offline # Training offline
cd examples/aishell/s0 cd examples/aishell/s0
bash run.sh --stage 0 --stop_stage 2 --model_type offline --conf_path conf/deepspeech2.yaml bash run.sh --stage 0 --stop_stage 2 --model_type offline --conf_path conf/deepspeech2.yaml

@ -1,5 +1,4 @@
# Getting Started # Quick Start of Speech-To-Text
Several shell scripts provided in `./examples/tiny/local` will help us to quickly give it a try, for most major modules, including data preparation, model training, case inference and model evaluation, with a few public dataset (e.g. [LibriSpeech](http://www.openslr.org/12/), [Aishell](http://www.openslr.org/33)). Reading these examples will also help you to understand how to make it work with your own data. Several shell scripts provided in `./examples/tiny/local` will help us to quickly give it a try, for most major modules, including data preparation, model training, case inference and model evaluation, with a few public dataset (e.g. [LibriSpeech](http://www.openslr.org/12/), [Aishell](http://www.openslr.org/33)). Reading these examples will also help you to understand how to make it work with your own data.
Some of the scripts in `./examples` are not configured with GPUs. If you want to train with 8 GPUs, please modify `CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7`. If you don't have any GPU available, please set `CUDA_VISIBLE_DEVICES=` to use CPUs instead. Besides, if out-of-memory problem occurs, just reduce `batch_size` to fit. Some of the scripts in `./examples` are not configured with GPUs. If you want to train with 8 GPUs, please modify `CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7`. If you don't have any GPU available, please set `CUDA_VISIBLE_DEVICES=` to use CPUs instead. Besides, if out-of-memory problem occurs, just reduce `batch_size` to fit.
@ -11,68 +10,52 @@ Let's take a tiny sampled subset of [LibriSpeech dataset](http://www.openslr.org
```bash ```bash
cd examples/tiny cd examples/tiny
``` ```
Notice that this is only a toy example with a tiny sampled subset of LibriSpeech. If you would like to try with the complete dataset (would take several days for training), please go to `examples/librispeech` instead. Notice that this is only a toy example with a tiny sampled subset of LibriSpeech. If you would like to try with the complete dataset (would take several days for training), please go to `examples/librispeech` instead.
- Source env - Source env
```bash ```bash
source path.sh source path.sh
``` ```
**Must do this before starting do anything.** **Must do this before you start to do anything.**
Set `MAIN_ROOT` as project dir. Using defualt `deepspeech2` model as default, you can change this in the script. Set `MAIN_ROOT` as project dir. Using defualt `deepspeech2` model as `MODEL`, you can change this in the script.
- Main entrypoint - Main entrypoint
```bash ```bash
bash run.sh bash run.sh
``` ```
This just a demo, please make sure every `step` is work fine when do next `step`. This is just a demo, please make sure every `step` works well before next `step`.
More detailed information are provided in the following sections. Wish you a happy journey with the *DeepSpeech on PaddlePaddle* ASR engine! More detailed information are provided in the following sections. Wish you a happy journey with the *DeepSpeech on PaddlePaddle* ASR engine!
## Training a model ## Training a model
The key steps of training for Mandarin language are same to that of English language and we have also provided an example for Mandarin training with Aishell in ```examples/aishell/local```. As mentioned above, please execute ```sh data.sh```, ```sh train.sh```, ```sh test.sh``` and ```sh infer.sh``` to do data preparation, training, testing and inference correspondingly. We have also prepared a pre-trained model (downloaded by local/download_model.sh) for users to try with ```sh infer_golden.sh``` and ```sh test_golden.sh```. Notice that, different from English LM, the Mandarin LM is character-based and please run ```local/tune.sh``` to find an optimal setting. The key steps of training for Mandarin language are same to that of English language and we have also provided an example for Mandarin training with Aishell in ```examples/aishell/local```. As mentioned above, please execute ```sh data.sh```, ```sh train.sh```, ```sh test.sh```and ```sh infer.sh```to do data preparation, training, testing and inference correspondingly. We have also prepared a pre-trained model (downloaded by local/download_model.sh) for users to try with ```sh infer_golden.sh```and ```sh test_golden.sh```. Notice that, different from English LM, the Mandarin LM is character-based and please run ```local/tune.sh```to find an optimal setting.
## Speech-to-text Inference ## Speech-to-text Inference
An inference module caller `infer.py` is provided to infer, decode and visualize speech-to-text results for several given audio clips. It might help to have an intuitive and qualitative evaluation of the ASR model's performance. An inference module caller `infer.py` is provided to infer, decode and visualize speech-to-text results for several given audio clips. It might help to have an intuitive and qualitative evaluation of the ASR model's performance.
```bash ```bash
CUDA_VISIBLE_DEVICES=0 bash local/infer.sh CUDA_VISIBLE_DEVICES=0 bash local/infer.sh
``` ```
We provide two types of CTC decoders: *CTC greedy decoder* and *CTC beam search decoder*. The *CTC greedy decoder* is an implementation of the simple best-path decoding algorithm, selecting at each timestep the most likely token, thus being greedy and locally optimal. The [*CTC beam search decoder*](https://arxiv.org/abs/1408.2873) otherwise utilizes a heuristic breadth-first graph search for reaching a near global optimality; it also requires a pre-trained KenLM language model for better scoring and ranking. The decoder type can be set with argument `decoding_method`. We provide two types of CTC decoders: *CTC greedy decoder* and *CTC beam search decoder*. The *CTC greedy decoder* is an implementation of the simple best-path decoding algorithm, selecting at each timestep the most likely token, thus being greedy and locally optimal. The [*CTC beam search decoder*](https://arxiv.org/abs/1408.2873) otherwise utilizes a heuristic breadth-first graph search for reaching a near global optimality; it also requires a pre-trained KenLM language model for better scoring and ranking. The decoder type can be set with argument `decoding_method`.
## Evaluate a Model ## Evaluate a Model
To evaluate a model's performance quantitatively, please run: To evaluate a model's performance quantitatively, please run:
```bash ```bash
CUDA_VISIBLE_DEVICES=0 bash local/test.sh CUDA_VISIBLE_DEVICES=0 bash local/test.sh
``` ```
The error rate (default: word error rate; can be set with `error_rate_type`) will be printed. The error rate (default: word error rate; can be set with `error_rate_type`) will be printed.
For more help on arguments:
## Hyper-parameters Tuning ## Hyper-parameters Tuning
The hyper-parameters $\alpha$ (language model weight) and $\beta$ (word insertion weight) for the [*CTC beam search decoder*](https://arxiv.org/abs/1408.2873) often have a significant impact on the decoder's performance. It would be better to re-tune them on the validation set when the acoustic model is renewed. The hyper-parameters $\alpha$ (language model weight) and $\beta$ (word insertion weight) for the [*CTC beam search decoder*](https://arxiv.org/abs/1408.2873) often have a significant impact on the decoder's performance. It would be better to re-tune them on the validation set when the acoustic model is renewed.
`tune.py` performs a 2-D grid search over the hyper-parameter $\alpha$ and $\beta$. You must provide the range of $\alpha$ and $\beta$, as well as the number of their attempts. `tune.py` performs a 2-D grid search over the hyper-parameter $\alpha$ and $\beta$. You must provide the range of $\alpha$ and $\beta$, as well as the number of their attempts.
```bash ```bash
CUDA_VISIBLE_DEVICES=0 bash local/tune.sh CUDA_VISIBLE_DEVICES=0 bash local/tune.sh
``` ```
The grid search will print the WER (word error rate) or CER (character error rate) at each point in the hyper-parameters space, and draw the error surface optionally. A proper hyper-parameters range should include the global minima of the error surface for WER/CER, as illustrated in the following figure. The grid search will print the WER (word error rate) or CER (character error rate) at each point in the hyper-parameters space, and draw the error surface optionally. A proper hyper-parameters range should include the global minima of the error surface for WER/CER, as illustrated in the following figure.
<p align="center"> <p align="center">
<img src="images/tuning_error_surface.png" width=550> <img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/tuning_error_surface.png" width=550>
<br/>An example error surface for tuning on the dev-clean set of LibriSpeech <br/>An example error surface for tuning on the dev-clean set of LibriSpeech
</p> </p>
Usually, as the figure shows, the variation of language model weight ($\alpha$) significantly affect the performance of CTC beam search decoder. And a better procedure is to first tune on serveral data batches (the number can be specified) to find out the proper range of hyper-parameters, then change to the whole validation set to carray out an accurate tuning. Usually, as the figure shows, the variation of language model weight ($\alpha$) significantly affect the performance of CTC beam search decoder. And a better procedure is to first tune on serveral data batches (the number can be specified) to find out the proper range of hyper-parameters, then change to the whole validation set to carray out an accurate tuning.

@ -1,28 +0,0 @@
# Released Models
## Acoustic Model Released in paddle 2.X
Acoustic Model | Training Data | Token-based | Size | Descriptions | CER | WER | Hours of speech
:-------------:| :------------:| :-----: | -----: | :----------------- |:--------- | :---------- | :---------
[Ds2 Online Aishell Model](https://deepspeech.bj.bcebos.com/release2.1/aishell/s0/aishell.s0.ds_online.5rnn.debug.tar.gz) | Aishell Dataset | Char-based | 345 MB | 2 Conv + 5 LSTM layers with only forward direction | 0.0824 |-| 151 h
[Ds2 Offline Aishell Model](https://deepspeech.bj.bcebos.com/release2.1/aishell/s0/aishell.s0.ds2.offline.cer6p65.release.tar.gz)| Aishell Dataset | Char-based | 306 MB | 2 Conv + 3 bidirectional GRU layers| 0.065 |-| 151 h
[Conformer Online Aishell Model](https://deepspeech.bj.bcebos.com/release2.1/aishell/s1/aishell.chunk.release.tar.gz) | Aishell Dataset | Char-based | 283 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention + CTC | 0.0594 |-| 151 h
[Conformer Offline Aishell Model](https://deepspeech.bj.bcebos.com/release2.1/aishell/s1/aishell.release.tar.gz) | Aishell Dataset | Char-based | 284 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention | 0.0547 |-| 151 h
[Conformer Librispeech Model](https://deepspeech.bj.bcebos.com/release2.1/librispeech/s1/conformer.release.tar.gz) | Librispeech Dataset | Word-based | 287 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention |-| 0.0325 | 960 h
[Transformer Librispeech Model](https://deepspeech.bj.bcebos.com/release2.1/librispeech/s1/transformer.release.tar.gz) | Librispeech Dataset | Word-based | 195 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention |-| 0.0544 | 960 h
## Acoustic Model Transformed from paddle 1.8
Acoustic Model | Training Data | Token-based | Size | Descriptions | CER | WER | Hours of speech
:-------------:| :------------:| :-----: | -----: | :----------------- | :---------- | :---------- | :---------
[Ds2 Offline Aishell model](https://deepspeech.bj.bcebos.com/mandarin_models/aishell_model_v1.8_to_v2.x.tar.gz)|Aishell Dataset| Char-based| 234 MB| 2 Conv + 3 bidirectional GRU layers| 0.0804 |-| 151 h|
[Ds2 Offline Librispeech model](https://deepspeech.bj.bcebos.com/eng_models/librispeech_v1.8_to_v2.x.tar.gz)|Librispeech Dataset| Word-based| 307 MB| 2 Conv + 3 bidirectional sharing weight RNN layers |-| 0.0685| 960 h|
[Ds2 Offline Baidu en8k model](https://deepspeech.bj.bcebos.com/eng_models/baidu_en8k_v1.8_to_v2.x.tar.gz)|Baidu Internal English Dataset| Word-based| 273 MB| 2 Conv + 3 bidirectional GRU layers |-| 0.0541 | 8628 h|
## Language Model Released
Language Model | Training Data | Token-based | Size | Descriptions
:-------------:| :------------:| :-----: | -----: | :-----------------
[English LM](https://deepspeech.bj.bcebos.com/en_lm/common_crawl_00.prune01111.trie.klm) | [CommonCrawl(en.00)](http://web-language-models.s3-website-us-east-1.amazonaws.com/ngrams/en/deduped/en.00.deduped.xz) | Word-based | 8.3 GB | Pruned with 0 1 1 1 1; <br/> About 1.85 billion n-grams; <br/> 'trie' binary with '-a 22 -q 8 -b 8'
[Mandarin LM Small](https://deepspeech.bj.bcebos.com/zh_lm/zh_giga.no_cna_cmn.prune01244.klm) | Baidu Internal Corpus | Char-based | 2.8 GB | Pruned with 0 1 2 4 4; <br/> About 0.13 billion n-grams; <br/> 'probing' binary with default settings
[Mandarin LM Large](https://deepspeech.bj.bcebos.com/zh_lm/zhidao_giga.klm) | Baidu Internal Corpus | Char-based | 70.4 GB | No Pruning; <br/> About 3.7 billion n-grams; <br/> 'probing' binary with default settings

@ -20,9 +20,15 @@
# If extensions (or modules to document with autodoc) are in another directory, # If extensions (or modules to document with autodoc) are in another directory,
# add these directories to sys.path here. If the directory is relative to the # add these directories to sys.path here. If the directory is relative to the
# documentation root, use os.path.abspath to make it absolute, like shown here. # documentation root, use os.path.abspath to make it absolute, like shown here.
import os
import sys
import recommonmark.parser import recommonmark.parser
import sphinx_rtd_theme import sphinx_rtd_theme
sys.path.insert(0, os.path.abspath('../..'))
autodoc_mock_imports = ["soundfile", "librosa"]
# -- Project information ----------------------------------------------------- # -- Project information -----------------------------------------------------
project = 'paddle speech' project = 'paddle speech'
@ -46,10 +52,10 @@ pygments_style = 'sphinx'
extensions = [ extensions = [
'sphinx.ext.autodoc', 'sphinx.ext.autodoc',
'sphinx.ext.viewcode', 'sphinx.ext.viewcode',
'sphinx_rtd_theme', "sphinx_rtd_theme",
'sphinx.ext.mathjax', 'sphinx.ext.mathjax',
'sphinx.ext.autosummary',
'numpydoc', 'numpydoc',
'sphinx.ext.autosummary',
'myst_parser', 'myst_parser',
] ]
@ -76,6 +82,7 @@ smartquotes = False
# relative to this directory. They are copied after the builtin static files, # relative to this directory. They are copied after the builtin static files,
# so a file named "default.css" will overwrite the builtin "default.css". # so a file named "default.css" will overwrite the builtin "default.css".
html_static_path = ['_static'] html_static_path = ['_static']
html_logo = '../images/paddle.png'
# -- Extension configuration ------------------------------------------------- # -- Extension configuration -------------------------------------------------
# numpydoc_show_class_members = False # numpydoc_show_class_members = False

@ -10,34 +10,44 @@ Contents
.. toctree:: .. toctree::
:maxdepth: 1 :maxdepth: 1
:caption: Introduction :caption: Introduction
asr/deepspeech_architecture
introduction
.. toctree:: .. toctree::
:maxdepth: 1 :maxdepth: 1
:caption: Getting_started :caption: Quick Start
asr/install
asr/getting_started
install
asr/quick_start
tts/quick_start
.. toctree:: .. toctree::
:maxdepth: 1 :maxdepth: 1
:caption: More Information :caption: Speech-To-Text
asr/models_introduction
asr/data_preparation asr/data_preparation
asr/augmentation asr/augmentation
asr/feature_list asr/feature_list
asr/ngram_lm asr/ngram_lm
.. toctree:: .. toctree::
:maxdepth: 1 :maxdepth: 1
:caption: Released_model :caption: Text-To-Speech
asr/released_model tts/basic_usage
tts/advanced_usage
tts/zh_text_frontend
tts/models_introduction
tts/gan_vocoder
tts/demo
tts/demo_2
.. toctree::
:maxdepth: 1
:caption: Released Models
released_model
.. toctree:: .. toctree::
:maxdepth: 1 :maxdepth: 1
@ -45,3 +55,8 @@ Contents
asr/reference asr/reference

@ -8,7 +8,7 @@ To avoid the trouble of environment setup, [running in Docker container](#runnin
## Setup (Important) ## Setup (Important)
- Make sure these libraries or tools installed: `pkg-config`, `flac`, `ogg`, `vorbis`, `boost`, `sox, and `swig`, e.g. installing them via `apt-get`: - Make sure these libraries or tools installed: `pkg-config`, `flac`, `ogg`, `vorbis`, `boost`, `sox`, and `swig`, e.g. installing them via `apt-get`:
```bash ```bash
sudo apt-get install -y sox pkg-config libflac-dev libogg-dev libvorbis-dev libboost-dev swig python3-dev sudo apt-get install -y sox pkg-config libflac-dev libogg-dev libvorbis-dev libboost-dev swig python3-dev
@ -44,6 +44,14 @@ bash setup.sh
source tools/venv/bin/activate source tools/venv/bin/activate
``` ```
## Simple Setup
```python
git clone https://github.com/PaddlePaddle/DeepSpeech.git
cd DeepSpeech
pip install -e .
```
## Running in Docker Container (optional) ## Running in Docker Container (optional)
Docker is an open source tool to build, ship, and run distributed applications in an isolated environment. A Docker image for this project has been provided in [hub.docker.com](https://hub.docker.com) with all the dependencies installed. This Docker image requires the support of NVIDIA GPU, so please make sure its availiability and the [nvidia-docker](https://github.com/NVIDIA/nvidia-docker) has been installed. Docker is an open source tool to build, ship, and run distributed applications in an isolated environment. A Docker image for this project has been provided in [hub.docker.com](https://hub.docker.com) with all the dependencies installed. This Docker image requires the support of NVIDIA GPU, so please make sure its availiability and the [nvidia-docker](https://github.com/NVIDIA/nvidia-docker) has been installed.

@ -0,0 +1,33 @@
# PaddleSpeech
## What is PaddleSpeech?
PaddleSpeech is an open-source toolkit on PaddlePaddle platform for two critical tasks in Speech - Speech-To-Text (Automatic Speech Recognition, ASR) and Text-To-Speech Synthesis (TTS), with modules involving state-of-art and influential models.
## What can PaddleSpeech do?
### Speech-To-Text
(An introduce of ASR in PaddleSpeech is needed here!)
### Text-To-Speech
TTS mainly consists of components below:
- Implementation of models and commonly used neural network layers.
- Dataset abstraction and common data preprocessing pipelines.
- Ready-to-run experiments.
PaddleSpeech TTS provides you with a complete TTS pipeline, including:
- Text FrontEnd
- Rule based Chinese frontend.
- Acoustic Models
- FastSpeech2
- SpeedySpeech
- TransformerTTS
- Tacotron2
- Vocoders
- Multi Band MelGAN
- Parallel WaveGAN
- WaveFlow
- Voice Cloning
- Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis
- GE2E
Text-To-Speech helps you to train TTS models with simple commands.

@ -0,0 +1,55 @@
# Released Models
## Speech-To-Text Models
### Acoustic Model Released in paddle 2.X
Acoustic Model | Training Data | Token-based | Size | Descriptions | CER | WER | Hours of speech
:-------------:| :------------:| :-----: | -----: | :----------------- |:--------- | :---------- | :---------
[Ds2 Online Aishell Model](https://deepspeech.bj.bcebos.com/release2.1/aishell/s0/aishell.s0.ds_online.5rnn.debug.tar.gz) | Aishell Dataset | Char-based | 345 MB | 2 Conv + 5 LSTM layers with only forward direction | 0.0824 |-| 151 h
[Ds2 Offline Aishell Model](https://deepspeech.bj.bcebos.com/release2.1/aishell/s0/aishell.s0.ds2.offline.cer6p65.release.tar.gz)| Aishell Dataset | Char-based | 306 MB | 2 Conv + 3 bidirectional GRU layers| 0.065 |-| 151 h
[Conformer Online Aishell Model](https://deepspeech.bj.bcebos.com/release2.1/aishell/s1/aishell.chunk.release.tar.gz) | Aishell Dataset | Char-based | 283 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention + CTC | 0.0594 |-| 151 h
[Conformer Offline Aishell Model](https://deepspeech.bj.bcebos.com/release2.1/aishell/s1/aishell.release.tar.gz) | Aishell Dataset | Char-based | 284 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention | 0.0547 |-| 151 h
[Conformer Librispeech Model](https://deepspeech.bj.bcebos.com/release2.1/librispeech/s1/conformer.release.tar.gz) | Librispeech Dataset | Word-based | 287 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention |-| 0.0325 | 960 h
[Transformer Librispeech Model](https://deepspeech.bj.bcebos.com/release2.1/librispeech/s1/transformer.release.tar.gz) | Librispeech Dataset | Word-based | 195 MB | Encoder:Transformer, Decoder:Transformer, Decoding method: Attention |-| 0.0544 | 960 h
### Acoustic Model Transformed from paddle 1.8
Acoustic Model | Training Data | Token-based | Size | Descriptions | CER | WER | Hours of speech
:-------------:| :------------:| :-----: | -----: | :----------------- | :---------- | :---------- | :---------
[Ds2 Offline Aishell model](https://deepspeech.bj.bcebos.com/mandarin_models/aishell_model_v1.8_to_v2.x.tar.gz)|Aishell Dataset| Char-based| 234 MB| 2 Conv + 3 bidirectional GRU layers| 0.0804 |-| 151 h|
[Ds2 Offline Librispeech model](https://deepspeech.bj.bcebos.com/eng_models/librispeech_v1.8_to_v2.x.tar.gz)|Librispeech Dataset| Word-based| 307 MB| 2 Conv + 3 bidirectional sharing weight RNN layers |-| 0.0685| 960 h|
[Ds2 Offline Baidu en8k model](https://deepspeech.bj.bcebos.com/eng_models/baidu_en8k_v1.8_to_v2.x.tar.gz)|Baidu Internal English Dataset| Word-based| 273 MB| 2 Conv + 3 bidirectional GRU layers |-| 0.0541 | 8628 h|
### Language Model Released
Language Model | Training Data | Token-based | Size | Descriptions
:-------------:| :------------:| :-----: | -----: | :-----------------
[English LM](https://deepspeech.bj.bcebos.com/en_lm/common_crawl_00.prune01111.trie.klm) | [CommonCrawl(en.00)](http://web-language-models.s3-website-us-east-1.amazonaws.com/ngrams/en/deduped/en.00.deduped.xz) | Word-based | 8.3 GB | Pruned with 0 1 1 1 1; <br/> About 1.85 billion n-grams; <br/> 'trie' binary with '-a 22 -q 8 -b 8'
[Mandarin LM Small](https://deepspeech.bj.bcebos.com/zh_lm/zh_giga.no_cna_cmn.prune01244.klm) | Baidu Internal Corpus | Char-based | 2.8 GB | Pruned with 0 1 2 4 4; <br/> About 0.13 billion n-grams; <br/> 'probing' binary with default settings
[Mandarin LM Large](https://deepspeech.bj.bcebos.com/zh_lm/zhidao_giga.klm) | Baidu Internal Corpus | Char-based | 70.4 GB | No Pruning; <br/> About 3.7 billion n-grams; <br/> 'probing' binary with default settings
## Text-To-Speech Models
### Acoustic Models
Model Type | Dataset| Example Link | Pretrained Models
:-------------:| :------------:| :-----: | :-----
Tacotron2|LJSpeech|[tacotron2-vctk](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/ljspeech/tts0)|[tacotron2_ljspeech_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/tacotron2_ljspeech_ckpt_0.3.zip)
TransformerTTS| LJSpeech| [transformer-ljspeech](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/ljspeech/tts1)|[transformer_tts_ljspeech_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/transformer_tts_ljspeech_ckpt_0.4.zip)
SpeedySpeech| CSMSC | [speedyspeech-csmsc](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/csmsc/tts2) |[speedyspeech_nosil_baker_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/speedyspeech_nosil_baker_ckpt_0.5.zip)
FastSpeech2| CSMSC |[fastspeech2-csmsc](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/csmsc/tts3)|[fastspeech2_nosil_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/fastspeech2_nosil_baker_ckpt_0.4.zip)
FastSpeech2| AISHELL-3 |[fastspeech2-aishell3](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/aishell3/tts3)|[fastspeech2_nosil_aishell3_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/fastspeech2_nosil_aishell3_ckpt_0.4.zip)
FastSpeech2| LJSpeech |[fastspeech2-ljspeech](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/ljspeech/tts3)|[fastspeech2_nosil_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/fastspeech2_nosil_ljspeech_ckpt_0.5.zip)
FastSpeech2| VCTK |[fastspeech2-csmsc](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/vctk/tts3)|[fastspeech2_nosil_vctk_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/fastspeech2_nosil_vctk_ckpt_0.5.zip)
### Vocoders
Model Type | Dataset| Example Link | Pretrained Models
:-------------:| :------------:| :-----: | :-----
WaveFlow| LJSpeech |[waveflow-ljspeech](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/ljspeech/voc0)|[waveflow_ljspeech_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/waveflow_ljspeech_ckpt_0.3.zip)
Parallel WaveGAN| CSMSC |[PWGAN-csmsc](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/csmsc/voc1)|[pwg_baker_ckpt_0.4.zip.](https://paddlespeech.bj.bcebos.com/Parakeet/pwg_baker_ckpt_0.4.zip)
Parallel WaveGAN| LJSpeech |[PWGAN-ljspeech](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/ljspeech/voc1)|[pwg_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/pwg_ljspeech_ckpt_0.5.zip)
Parallel WaveGAN| VCTK |[PWGAN-vctk](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/vctk/voc1)|[pwg_vctk_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/pwg_vctk_ckpt_0.5.zip)
### Voice Cloning
Model Type | Dataset| Example Link | Pretrained Models
:-------------:| :------------:| :-----: | :-----
GE2E| AISHELL-3, etc. |[ge2e](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/other/ge2e)|[ge2e_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/ge2e_ckpt_0.3.zip)
GE2E + Tactron2| AISHELL-3 |[ge2e-tactron2-aishell3](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/aishell3/vc0)|[tacotron2_aishell3_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/tacotron2_aishell3_ckpt_0.3.zip)

@ -1,6 +1,5 @@
# Advanced Usage # Advanced Usage
This sections covers how to extend parakeet by implementing your own models and experiments. Guidelines on implementation are also elaborated. This sections covers how to extend TTS by implementing your own models and experiments. Guidelines on implementation are also elaborated.
For the general deep learning experiment, there are several parts to deal with: For the general deep learning experiment, there are several parts to deal with:
1. Preprocess the data according to the needs of the model, and iterate the dataset by batch. 1. Preprocess the data according to the needs of the model, and iterate the dataset by batch.
@ -8,7 +7,7 @@ For the general deep learning experiment, there are several parts to deal with:
3. Write out the training process (generally including forward / backward calculation, parameter update, log recording, visualization, periodic evaluation, etc.). 3. Write out the training process (generally including forward / backward calculation, parameter update, log recording, visualization, periodic evaluation, etc.).
5. Configure and run the experiment. 5. Configure and run the experiment.
## Parakeet's Model Components ## PaddleSpeech TTS's Model Components
In order to balance the reusability and function of models, we divide models into several types according to its characteristics. In order to balance the reusability and function of models, we divide models into several types according to its characteristics.
For the commonly used modules that can be used as part of other larger models, we try to implement them as simple and universal as possible, because they will be reused. Modules with trainable parameters are generally implemented as subclasses of `paddle.nn.Layer`. Modules without trainable parameters can be directly implemented as a function, and its input and output are `paddle.Tensor`. For the commonly used modules that can be used as part of other larger models, we try to implement them as simple and universal as possible, because they will be reused. Modules with trainable parameters are generally implemented as subclasses of `paddle.nn.Layer`. Modules without trainable parameters can be directly implemented as a function, and its input and output are `paddle.Tensor`.
@ -68,11 +67,11 @@ There are two common ways to define a model which consists of several modules.
``` ```
When a model is a complicated and made up of several components, each of which has a separate functionality, and can be replaced by other components with the same functionality, we prefer to define it in this way. When a model is a complicated and made up of several components, each of which has a separate functionality, and can be replaced by other components with the same functionality, we prefer to define it in this way.
In the directory structure of Parakeet, modules with high reusability are placed in `parakeet.modules`, but models for specific tasks are placed in `parakeet.models`. When developing a new model, developers need to consider the feasibility of splitting the modules, and the degree of generality of the modules, and place them in appropriate directories. In the directory structure of PaddleSpeech TTS, modules with high reusability are placed in `parakeet.modules`, but models for specific tasks are placed in `parakeet.models`. When developing a new model, developers need to consider the feasibility of splitting the modules, and the degree of generality of the modules, and place them in appropriate directories.
## Parakeet's Data Components ## PaddleSpeech TTS's Data Components
Another critical componnet for a deep learning project is data. Another critical componnet for a deep learning project is data.
Parakeet uses the following methods for training data: PaddleSpeech TTS uses the following methods for training data:
1. Preprocess the data. 1. Preprocess the data.
2. Load the preprocessed data for training. 2. Load the preprocessed data for training.
@ -154,7 +153,7 @@ def _convert(self, meta_datum: Dict[str, Any]) -> Dict[str, Any]:
return example return example
``` ```
## Parakeet's Training Components ## PaddleSpeech TTS's Training Components
A typical training process includes the following processes: A typical training process includes the following processes:
1. Iterate the dataset. 1. Iterate the dataset.
2. Process batch data. 2. Process batch data.
@ -164,7 +163,7 @@ A typical training process includes the following processes:
6. Write logs, visualize, and in some cases save necessary intermediate results. 6. Write logs, visualize, and in some cases save necessary intermediate results.
7. Save the state of the model and optimizer. 7. Save the state of the model and optimizer.
Here, we mainly introduce the training related components of Parakeet and why we designed it like this. Here, we mainly introduce the training related components of TTS in Pa and why we designed it like this.
### Global Repoter ### Global Repoter
When training and modifying Deep Learning modelslogging is often needed, and it has even become the key to model debugging and modifying. We usually use various visualization toolssuch as , `visualdl` in `paddle`, `tensorboard` in `tensorflow` and `vidsom`, `wnb` ,etc. Besides, `logging` and `print` are usuaally used for different purpose. When training and modifying Deep Learning modelslogging is often needed, and it has even become the key to model debugging and modifying. We usually use various visualization toolssuch as , `visualdl` in `paddle`, `tensorboard` in `tensorflow` and `vidsom`, `wnb` ,etc. Besides, `logging` and `print` are usuaally used for different purpose.
@ -245,7 +244,7 @@ def test_reporter_scope():
In this way, when we write modular components, we can directly call `report`. The caller will decide where to report as long as it's ready for `OBSERVATION`, then it opens a `scope` and calls the component within this `scope`. In this way, when we write modular components, we can directly call `report`. The caller will decide where to report as long as it's ready for `OBSERVATION`, then it opens a `scope` and calls the component within this `scope`.
The `Trainer` in Parakeet report the information in this way. The `Trainer` in PaddleSpeech TTS report the information in this way.
```python ```python
while True: while True:
self.observation = {} self.observation = {}
@ -269,7 +268,7 @@ We made an abstraction for these intermediate processes, that is, `Updater`, whi
### Visualizer ### Visualizer
Because we choose observation as the communication mode, we can simply write the things in observation into `visualizer`. Because we choose observation as the communication mode, we can simply write the things in observation into `visualizer`.
## Parakeet's Configuration Components ## PaddleSpeech TTS's Configuration Components
Deep learning experiments often have many options to configure. These configurations can be roughly divided into several categories. Deep learning experiments often have many options to configure. These configurations can be roughly divided into several categories.
1. Data source and data processing mode configuration. 1. Data source and data processing mode configuration.
2. Save path configuration of experimental results. 2. Save path configuration of experimental results.
@ -293,28 +292,26 @@ The following is the basic `ArgumentParser`:
3. `--output-dir` is the dir to save the training results.if there are checkpoints in `checkpoints/` of `--output-dir` , it's defalut to reload the newest checkpoint to train) 3. `--output-dir` is the dir to save the training results.if there are checkpoints in `checkpoints/` of `--output-dir` , it's defalut to reload the newest checkpoint to train)
4. `--device` and `--nprocs` determine operation modes`--device` specifies the type of running device, whether to run on `cpu` or `gpu`. `--nprocs` refers to the number of training processes. If `nprocs` > 1, it means that multi process parallel training is used. (Note: currently only GPU multi card multi process training is supported.) 4. `--device` and `--nprocs` determine operation modes`--device` specifies the type of running device, whether to run on `cpu` or `gpu`. `--nprocs` refers to the number of training processes. If `nprocs` > 1, it means that multi process parallel training is used. (Note: currently only GPU multi card multi process training is supported.)
Developers can refer to the examples in `Parakeet/examples` to write the default configuration file when adding new experiments. Developers can refer to the examples in `examples` to write the default configuration file when adding new experiments.
## Parakeet's Experiment template ## PaddleSpeech TTS's Experiment template
The experimental codes in Parakeet are generally organized as follows: The experimental codes in PaddleSpeech TTS are generally organized as follows:
```text ```text
├── conf .
│ └── default.yaml (defalut config) ├── README.md (help information)
├── README.md (help information) ├── conf
├── batch_fn.py (organize metadata into batch) │ └── default.yaml (defalut config)
├── config.py (code to read default config) ├── local
├── *_updater.py (Updater of a specific model) │ ├── preprocess.sh (script to call data preprocessing.py)
├── preprocess.py (data preprocessing code) │ ├── synthesize.sh (script to call synthesis.py)
├── preprocess.sh (script to call data preprocessing.py) │ ├── synthesize_e2e.sh (script to call synthesis_e2e.py)
├── synthesis.py (synthesis from metadata) │ └──train.sh (script to call train.py)
├── synthesis.sh (script to call synthesis.py) ├── path.sh (script include paths to be sourced)
├── synthesis_e2e.py (synthesis from raw text) └── run.sh (script to call scripts in local)
├── synthesis_e2e.sh (script to call synthesis_e2e.py)
├── train.py (train code)
└── run.sh (script to call train.py)
``` ```
The `*.py` files called by above `*.sh` are located `${BIN_DIR}/`
We add a named argument. `--output-dir` to each training script to specify the output directory. The directory structure is as follows, It's best for developers to follow this specification: We add a named argument. `--output-dir` to each training script to specify the output directory. The directory structure is as follows, It's best for developers to follow this specification:
```text ```text
@ -330,4 +327,4 @@ exp/default/
└── test/ (output dir of synthesis results) └── test/ (output dir of synthesis results)
``` ```
You can view the examples we provide in `Parakeet/examples`. These experiments are provided to users as examples which can be run directly. Users are welcome to add new models and experiments and contribute code to Parakeet. You can view the examples we provide in `examples`. These experiments are provided to users as examples which can be run directly. Users are welcome to add new models and experiments and contribute code to PaddleSpeech.

@ -1,115 +0,0 @@
# Basic Usage
This section shows how to use pretrained models provided by parakeet and make inference with them.
Pretrained models in v0.4 are provided in a archive. Extract it to get a folder like this:
```
checkpoint_name/
├──default.yaml
├──snapshot_iter_76000.pdz
├──speech_stats.npy
└──phone_id_map.txt
```
`default.yaml` stores the config used to train the model.
`snapshot_iter_N.pdz` is the chechpoint file, where `N` is the steps it has been trained.
`*_stats.npy` is the stats file of feature if it has been normalized before training.
`phone_id_map.txt` is the map of phonemes to phoneme_ids.
The example code below shows how to use the models for prediction.
## Acoustic Models (text to spectrogram)
The code below show how to use a `FastSpeech2` model. After loading the pretrained model, use it and normalizer object to construct a prediction objectthen use fastspeech2_inferencet(phone_ids) to generate spectrograms, which can be further used to synthesize raw audio with a vocoder.
```python
from pathlib import Path
import numpy as np
import paddle
import yaml
from yacs.config import CfgNode
from parakeet.models.fastspeech2 import FastSpeech2
from parakeet.models.fastspeech2 import FastSpeech2Inference
from parakeet.modules.normalizer import ZScore
# Parakeet/examples/fastspeech2/baker/frontend.py
from frontend import Frontend
# load the pretrained model
checkpoint_dir = Path("fastspeech2_nosil_baker_ckpt_0.4")
with open(checkpoint_dir / "phone_id_map.txt", "r") as f:
phn_id = [line.strip().split() for line in f.readlines()]
vocab_size = len(phn_id)
with open(checkpoint_dir / "default.yaml") as f:
fastspeech2_config = CfgNode(yaml.safe_load(f))
odim = fastspeech2_config.n_mels
model = FastSpeech2(
idim=vocab_size, odim=odim, **fastspeech2_config["model"])
model.set_state_dict(
paddle.load(args.fastspeech2_checkpoint)["main_params"])
model.eval()
# load stats file
stat = np.load(checkpoint_dir / "speech_stats.npy")
mu, std = stat
mu = paddle.to_tensor(mu)
std = paddle.to_tensor(std)
fastspeech2_normalizer = ZScore(mu, std)
# construct a prediction object
fastspeech2_inference = FastSpeech2Inference(fastspeech2_normalizer, model)
# load Chinese Frontend
frontend = Frontend(checkpoint_dir / "phone_id_map.txt")
# text to spectrogram
sentence = "你好吗?"
input_ids = frontend.get_input_ids(sentence, merge_sentences=True)
phone_ids = input_ids["phone_ids"]
flags = 0
# The output of Chinese text frontend is segmented
for part_phone_ids in phone_ids:
with paddle.no_grad():
temp_mel = fastspeech2_inference(part_phone_ids)
if flags == 0:
mel = temp_mel
flags = 1
else:
mel = paddle.concat([mel, temp_mel])
```
## Vocoder (spectrogram to wave)
The code below show how to use a ` Parallel WaveGAN` model. Like the example above, after loading the pretrained model, use it and normalizer object to construct a prediction objectthen use pwg_inference(mel) to generate raw audio (in wav format).
```python
from pathlib import Path
import numpy as np
import paddle
import soundfile as sf
import yaml
from yacs.config import CfgNode
from parakeet.models.parallel_wavegan import PWGGenerator
from parakeet.models.parallel_wavegan import PWGInference
from parakeet.modules.normalizer import ZScore
# load the pretrained model
checkpoint_dir = Path("parallel_wavegan_baker_ckpt_0.4")
with open(checkpoint_dir / "pwg_default.yaml") as f:
pwg_config = CfgNode(yaml.safe_load(f))
vocoder = PWGGenerator(**pwg_config["generator_params"])
vocoder.set_state_dict(paddle.load(args.pwg_params))
vocoder.remove_weight_norm()
vocoder.eval()
# load stats file
stat = np.load(checkpoint_dir / "pwg_stats.npy")
mu, std = stat
mu = paddle.to_tensor(mu)
std = paddle.to_tensor(std)
pwg_normalizer = ZScore(mu, std)
# construct a prediction object
pwg_inference = PWGInference(pwg_normalizer, vocoder)
# spectrogram to wave
wav = pwg_inference(mel)
sf.write(
audio_path,
wav.numpy(),
samplerate=fastspeech2_config.fs)
```

@ -11,7 +11,7 @@ The main processes of TTS include:
When training ``Tacotron2````TransformerTTS`` and ``WaveFlow``, we use English single speaker TTS dataset `LJSpeech <https://keithito.com/LJ-Speech-Dataset/>`_ by default. However, when training ``SpeedySpeech``, ``FastSpeech2`` and ``ParallelWaveGAN``, we use Chinese single speaker dataset `CSMSC <https://test.data-baker.com/data/index/source/>`_ by default. When training ``Tacotron2````TransformerTTS`` and ``WaveFlow``, we use English single speaker TTS dataset `LJSpeech <https://keithito.com/LJ-Speech-Dataset/>`_ by default. However, when training ``SpeedySpeech``, ``FastSpeech2`` and ``ParallelWaveGAN``, we use Chinese single speaker dataset `CSMSC <https://test.data-baker.com/data/index/source/>`_ by default.
In the future, ``Parakeet`` will mainly use Chinese TTS datasets for default examples. In the future, ``PaddleSpeech TTS`` will mainly use Chinese TTS datasets for default examples.
Here, we will display three types of audio samples: Here, we will display three types of audio samples:
@ -441,7 +441,7 @@ Audio samples generated by a TTS system. Text is first transformed into spectrog
Chinese TTS with/without text frontend Chinese TTS with/without text frontend
-------------------------------------- --------------------------------------
We provide a complete Chinese text frontend module in ``Parakeet``. ``Text Normalization`` and ``G2P`` are the most important modules in text frontend, We assume that the texts are normalized already, and mainly compare ``G2P`` module here. We provide a complete Chinese text frontend module in ``PaddleSpeech TTS``. ``Text Normalization`` and ``G2P`` are the most important modules in text frontend, We assume that the texts are normalized already, and mainly compare ``G2P`` module here.
We use ``FastSpeech2`` + ``ParallelWaveGAN`` here. We use ``FastSpeech2`` + ``ParallelWaveGAN`` here.

@ -0,0 +1,7 @@
Audio Sample (PaddleSpeech TTS VS Espnet TTS)
==================
This is an audio demo page to contrast PaddleSpeech TTS and Espnet TTS, We use their respective modules (Text Frontend, Acoustic model and Vocoder) here.
We use Espnet's released models here.
FastSpeech2 + Parallel WaveGAN in CSMSC

@ -0,0 +1,9 @@
# GAN Vocoders
This is a brief introduction of GAN Vocoders, we mainly introduce the losses of different vocoders here.
Model | Generator Loss |Discriminator Loss
:-------------:| :------------:| :-----
Parallel Wave GAN| adversial loss <br> Feature Matching | Multi-Scale Discriminator |
Mel GAN |adversial loss <br> Multi-resolution STFT loss | adversial loss|
Multi-Band Mel GAN | adversial loss <br> full band Multi-resolution STFT loss <br> sub band Multi-resolution STFT loss |Multi-Scale Discriminator|
HiFi GAN |adversial loss <br> Feature Matching <br> Mel-Spectrogram Loss | Multi-Scale Discriminator <br> Multi-Period Discriminato |

@ -1,45 +0,0 @@
.. parakeet documentation master file, created by
sphinx-quickstart on Fri Sep 10 14:22:24 2021.
You can adapt this file completely to your liking, but it should at least
contain the root `toctree` directive.
Parakeet
====================================
``parakeet`` is a deep learning based text-to-speech toolkit built upon ``paddlepaddle`` framework. It aims to provide a flexible, efficient and state-of-the-art text-to-speech toolkit for the open-source community. It includes many influential TTS models proposed by `Baidu Research <http://research.baidu.com>`_ and other research groups.
``parakeet`` mainly consists of components below.
#. Implementation of models and commonly used neural network layers.
#. Dataset abstraction and common data preprocessing pipelines.
#. Ready-to-run experiments.
.. toctree::
:maxdepth: 1
:caption: Introduction
introduction
.. toctree::
:maxdepth: 1
:caption: Getting started
install
basic_usage
advanced_usage
cn_text_frontend
released_models
.. toctree::
:maxdepth: 1
:caption: Demos
demo
Indices and tables
==================
* :ref:`genindex`
* :ref:`modindex`
* :ref:`search`

@ -1,47 +0,0 @@
# Installation
## Install PaddlePaddle
Parakeet requires PaddlePaddle as its backend. Note that 2.1.2 or newer versions of paddle is required.
Since paddlepaddle has multiple packages depending on the device (cpu or gpu) and the dependency libraries, it is recommended to install a proper package of paddlepaddle with respect to the device and dependency library versons via `pip`.
Installing paddlepaddle with conda or build paddlepaddle from source is also supported. Please refer to [PaddlePaddle installation](https://www.paddlepaddle.org.cn/install/quick?docurl=/documentation/docs/zh/install/pip/linux-pip.html) for more details.
Example instruction to install paddlepaddle via pip is listed below.
### PaddlePaddle with GPU
```python
# CUDA10.1 的 PaddlePaddle
python -m pip install paddlepaddle-gpu==2.1.2.post101 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
# CUDA10.2 的 PaddlePaddle
python -m pip install paddlepaddle-gpu -i https://mirror.baidu.com/pypi/simple
# CUDA11.0 的 PaddlePaddle
python -m pip install paddlepaddle-gpu==2.1.2.post110 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
# CUDA11.2 的 PaddlePaddle
python -m pip install paddlepaddle-gpu==2.1.2.post112 -f https://www.paddlepaddle.org.cn/whl/linux/mkl/avx/stable.html
```
### PaddlePaddle with CPU
```python
python -m pip install paddlepaddle==2.1.2 -i https://mirror.baidu.com/pypi/simple
```
## Install libsndfile
Experimemts in parakeet often involve audio and spectrum processing, thus `librosa` and `soundfile` are required. `soundfile` requires a extra C library `libsndfile`, which is not always handled by pip.
For Windows and Mac users, `libsndfile` is also installed when installing `soundfile` via pip, but for Linux users, installing `libsndfile` via system package manager is required. Example commands for popular distributions are listed below.
```bash
# ubuntu, debian
sudo apt-get install libsndfile1
# centos, fedora
sudo yum install libsndfile
# openSUSE
sudo zypper in libsndfile
```
For any problem with installtion of soundfile, please refer to [SoundFile](https://pypi.org/project/SoundFile/).
## Install Parakeet
There are two ways to install parakeet according to the purpose of using it.
1. If you want to run experiments provided by parakeet or add new models and experiments, it is recommended to clone the project from github (Parakeet), and install it in editable mode.
```python
git clone https://github.com/PaddlePaddle/Parakeet
cd Parakeet
pip install -e .
```

@ -1,27 +0,0 @@
# Parakeet - PAddle PARAllel text-to-speech toolKIT
## What is Parakeet?
Parakeet is a deep learning based text-to-speech toolkit built upon paddlepaddle framework. It aims to provide a flexible, efficient and state-of-the-art text-to-speech toolkit for the open-source community. It includes many influential TTS models proposed by Baidu Research and other research groups.
## What can Parakeet do?
Parakeet mainly consists of components below:
- Implementation of models and commonly used neural network layers.
- Dataset abstraction and common data preprocessing pipelines.
- Ready-to-run experiments.
Parakeet provides you with a complete TTS pipeline, including:
- Text FrontEnd
- Rule based Chinese frontend.
- Acoustic Models
- FastSpeech2
- SpeedySpeech
- TransformerTTS
- Tacotron2
- Vocoders
- Parallel WaveGAN
- WaveFlow
- Voice Cloning
- Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis
- GE2E
Parakeet helps you to train TTS models with simple commands.

@ -1,12 +1,12 @@
# Released Models # Models introduction
TTS system mainly includes three modules: `text frontend`, `Acoustic model` and `Vocoder`. We introduce a rule based Chinese text frontend in [cn_text_frontend.md](./cn_text_frontend.md). Here, we will introduce acoustic models and vocoders, which are trainable models. TTS system mainly includes three modules: `Text Frontend`, `Acoustic model` and `Vocoder`. We introduce a rule based Chinese text frontend in [cn_text_frontend.md](./cn_text_frontend.md). Here, we will introduce acoustic models and vocoders, which are trainable models.
The main processes of TTS include: The main processes of TTS include:
1. Convert the original text into characters/phonemes, through `text frontend` module. 1. Convert the original text into characters/phonemes, through `text frontend` module.
2. Convert characters/phonemes into acoustic features , such as linear spectrogram, mel spectrogram, LPC features, etc. through `Acoustic models`. 2. Convert characters/phonemes into acoustic features , such as linear spectrogram, mel spectrogram, LPC features, etc. through `Acoustic models`.
3. Convert acoustic features into waveforms through `Vocoders`. 3. Convert acoustic features into waveforms through `Vocoders`.
A simple text frontend module can be implemented by rules. Acoustic models and vocoders need to be trained. The models provided by Parakeet are acoustic models and vocoders. A simple text frontend module can be implemented by rules. Acoustic models and vocoders need to be trained. The models provided by PaddleSpeech TTS are acoustic models and vocoders.
## Acoustic Models ## Acoustic Models
### Modeling Objectives of Acoustic Models ### Modeling Objectives of Acoustic Models
@ -92,7 +92,7 @@ At present, there are two mainstream acoustic model structures.
<img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/tacotron2.png" width=500 /> <br> <img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/tacotron2.png" width=500 /> <br>
</div> </div>
You can find Parakeet's tacotron2 example at `Parakeet/examples/tacotron2`. You can find PaddleSpeech TTS's tacotron2 with LJSpeech dataset example at [examples/ljspeech/tts0](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/ljspeech/tts0).
### TransformerTTS ### TransformerTTS
**Disadvantages of the Tacotrons:** **Disadvantages of the Tacotrons:**
@ -146,7 +146,7 @@ Transformer TTS is a seq2seq acoustic model based on Transformer and Tacotron2.
- The ability to perceive local information is weak, and local information is more related to pronunciation. - The ability to perceive local information is weak, and local information is more related to pronunciation.
- Stability is worse than Tacotron2. - Stability is worse than Tacotron2.
You can find Parakeet's Transformer TTS example at `Parakeet/examples/transformer_tts`. You can find PaddleSpeech TTS's Transformer TTS with LJSpeech dataset example at [examples/ljspeech/tts1](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/ljspeech/tts1).
### FastSpeech2 ### FastSpeech2
@ -212,7 +212,7 @@ FastSpeech2 is similar to FastPitch but introduces more variation information of
<img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/fastspeech2.png" width=800 /> <br> <img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/fastspeech2.png" width=800 /> <br>
</div> </div>
You can find Parakeet's FastSpeech2/FastPitch example at `Parakeet/examples/fastspeech2`, We use token-averaged pitch and energy values introduced in FastPitch rather than frame level ones in FastSpeech2. You can find PaddleSpeech TTS's FastSpeech2/FastPitch with CSMSC dataset example at [examples/csmsc/tts3](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/csmsc/tts3), We use token-averaged pitch and energy values introduced in FastPitch rather than frame level ones in FastSpeech2.
### SpeedySpeech ### SpeedySpeech
[SpeedySpeech](https://arxiv.org/abs/2008.03802) simplify the teacher-student architecture of FastSpeech and provide a fast and stable training procedure. [SpeedySpeech](https://arxiv.org/abs/2008.03802) simplify the teacher-student architecture of FastSpeech and provide a fast and stable training procedure.
@ -226,7 +226,7 @@ You can find Parakeet's FastSpeech2/FastPitch example at `Parakeet/examples/fast
<img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/speedyspeech.png" width=500 /> <br> <img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/speedyspeech.png" width=500 /> <br>
</div> </div>
You can find Parakeet's SpeedySpeech example at `Parakeet/examples/speedyspeech/baker`. You can find PaddleSpeech TTS's SpeedySpeech with CSMSC dataset example at [examples/csmsc/tts2](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/csmsc/tts2).
## Vocoders ## Vocoders
In speech synthesis, the main task of the vocoder is to convert the spectral parameters predicted by the acoustic model into the final speech waveform. In speech synthesis, the main task of the vocoder is to convert the spectral parameters predicted by the acoustic model into the final speech waveform.
@ -276,7 +276,7 @@ Here, we introduce a Flow-based vocoder WaveFlow and a GAN-based vocoder Paralle
- It is a small-footprint flow-based model for raw audio. It has only 5.9M parameters, which is 15x smalller than WaveGlow (87.9M). - It is a small-footprint flow-based model for raw audio. It has only 5.9M parameters, which is 15x smalller than WaveGlow (87.9M).
- It is directly trained with maximum likelihood without probability density distillation and auxiliary losses as used in [Parallel WaveNet](https://arxiv.org/abs/1711.10433) and [ClariNet](https://openreview.net/pdf?id=HklY120cYm), which simplifies the training pipeline and reduces the cost of development. - It is directly trained with maximum likelihood without probability density distillation and auxiliary losses as used in [Parallel WaveNet](https://arxiv.org/abs/1711.10433) and [ClariNet](https://openreview.net/pdf?id=HklY120cYm), which simplifies the training pipeline and reduces the cost of development.
You can find Parakeet's WaveFlow example at `Parakeet/examples/waveflow`. You can find PaddleSpeech TTS's WaveFlow with LJSpeech dataset example at [examples/ljspeech/voc0](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/ljspeech/voc0).
### Parallel WaveGAN ### Parallel WaveGAN
[Parallel WaveGAN](https://arxiv.org/abs/1910.11480) trains a non-autoregressive WaveNet variant as a generator in a GAN based training method. [Parallel WaveGAN](https://arxiv.org/abs/1910.11480) trains a non-autoregressive WaveNet variant as a generator in a GAN based training method.
@ -286,10 +286,10 @@ You can find Parakeet's WaveFlow example at `Parakeet/examples/waveflow`.
- Use non-causal convolution instead of causal convolution. - Use non-causal convolution instead of causal convolution.
- The input is random Gaussian white noise. - The input is random Gaussian white noise.
- The model is non-autoregressive both in training and prediction, which is fast - The model is non-autoregressive both in training and prediction, which is fast
- Multi-resolution STFT loss. - Multi-resolution STFT loss.
<div align="left"> <div align="left">
<img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/pwg.png" width=600 /> <br> <img src="https://paddlespeech.bj.bcebos.com/Parakeet/docs/images/pwg.png" width=600 /> <br>
</div> </div>
You can find Parakeet's Parallel WaveGAN example at `Parakeet/examples/parallelwave_gan/baker`. You can find PaddleSpeech TTS's Parallel WaveGAN with CSMSC example at [examples/csmsc/voc1](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/csmsc/voc1).

@ -0,0 +1,193 @@
# Quick Start of Text-To-Speech
The examples in PaddleSpeech are mainly classified by datasets, the TTS datasets we mainly used are:
* CSMCS (Mandarin single speaker)
* AISHELL3 (Mandarin multiple speaker)
* LJSpeech (English single speaker)
* VCTK (English multiple speaker)
The models in PaddleSpeech TTS have the following mapping relationship:
* tts0 - Tactron2
* tts1 - TransformerTTS
* tts2 - SpeedySpeech
* tts3 - FastSpeech2
* voc0 - WaveFlow
* voc1 - Parallel WaveGAN
* voc2 - MelGAN
* voc3 - MultiBand MelGAN
* vc0 - Tactron2 Voice Clone with GE2E
## Quick Start
Let's take a FastSpeech2 + Parallel WaveGAN with CSMSC dataset for instance. (./examples/csmsc/)(https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/csmsc)
### Train Parallel WaveGAN with CSMSC
- Go to directory
```bash
cd examples/csmsc/voc1
```
- Source env
```bash
source path.sh
```
**Must do this before you start to do anything.**
Set `MAIN_ROOT` as project dir. Using `parallelwave_gan` model as `MODEL`.
- Main entrypoint
```bash
bash run.sh
```
This is just a demo, please make sure source data have been prepared well and every `step` works well before next `step`.
### Train FastSpeech2 with CSMSC
- Go to directory
```bash
cd examples/csmsc/tts3
```
- Source env
```bash
source path.sh
```
**Must do this before you start to do anything.**
Set `MAIN_ROOT` as project dir. Using `fastspeech2` model as `MODEL`.
- Main entrypoint
```bash
bash run.sh
```
This is just a demo, please make sure source data have been prepared well and every `step` works well before next `step`.
The steps in `run.sh` mainly include:
- source path.
- preprocess the dataset,
- train the model.
- synthesize waveform from metadata.jsonl.
- synthesize waveform from text file. (in acoustic models)
- inference using static model. (optional)
For more details , you can see `README.md` in examples.
## Pipeline of TTS
This section shows how to use pretrained models provided by TTS and make inference with them.
Pretrained models in TTS are provided in a archive. Extract it to get a folder like this:
**Acoustic Models:**
```text
checkpoint_name
├── default.yaml
├── snapshot_iter_*.pdz
├── speech_stats.npy
├── phone_id_map.txt
├── spk_id_map.txt (optimal)
└── tone_id_map.txt (optimal)
```
**Vocoders:**
```text
checkpoint_name
├── default.yaml
├── snapshot_iter_*.pdz
└── stats.npy
```
- `default.yaml` stores the config used to train the model.
- `snapshot_iter_*.pdz` is the chechpoint file, where `*` is the steps it has been trained.
- `*_stats.npy` is the stats file of feature if it has been normalized before training.
- `phone_id_map.txt` is the map of phonemes to phoneme_ids.
- `tone_id_map.txt` is the map of tones to tones_ids, when you split tones and phones before training acoustic models. (for example in our csmsc/speedyspeech example)
- `spk_id_map.txt` is the map of spkeaker to spk_ids in multi-spk acoustic models. (for example in our aishell3/fastspeech2 example)
The example code below shows how to use the models for prediction.
### Acoustic Models (text to spectrogram)
The code below show how to use a `FastSpeech2` model. After loading the pretrained model, use it and normalizer object to construct a prediction objectthen use `fastspeech2_inferencet(phone_ids)` to generate spectrograms, which can be further used to synthesize raw audio with a vocoder.
```python
from pathlib import Path
import numpy as np
import paddle
import yaml
from yacs.config import CfgNode
from parakeet.models.fastspeech2 import FastSpeech2
from parakeet.models.fastspeech2 import FastSpeech2Inference
from parakeet.modules.normalizer import ZScore
# examples/fastspeech2/baker/frontend.py
from frontend import Frontend
# load the pretrained model
checkpoint_dir = Path("fastspeech2_nosil_baker_ckpt_0.4")
with open(checkpoint_dir / "phone_id_map.txt", "r") as f:
phn_id = [line.strip().split() for line in f.readlines()]
vocab_size = len(phn_id)
with open(checkpoint_dir / "default.yaml") as f:
fastspeech2_config = CfgNode(yaml.safe_load(f))
odim = fastspeech2_config.n_mels
model = FastSpeech2(
idim=vocab_size, odim=odim, **fastspeech2_config["model"])
model.set_state_dict(
paddle.load(args.fastspeech2_checkpoint)["main_params"])
model.eval()
# load stats file
stat = np.load(checkpoint_dir / "speech_stats.npy")
mu, std = stat
mu = paddle.to_tensor(mu)
std = paddle.to_tensor(std)
fastspeech2_normalizer = ZScore(mu, std)
# construct a prediction object
fastspeech2_inference = FastSpeech2Inference(fastspeech2_normalizer, model)
# load Chinese Frontend
frontend = Frontend(checkpoint_dir / "phone_id_map.txt")
# text to spectrogram
sentence = "你好吗?"
input_ids = frontend.get_input_ids(sentence, merge_sentences=True)
phone_ids = input_ids["phone_ids"]
flags = 0
# The output of Chinese text frontend is segmented
for part_phone_ids in phone_ids:
with paddle.no_grad():
temp_mel = fastspeech2_inference(part_phone_ids)
if flags == 0:
mel = temp_mel
flags = 1
else:
mel = paddle.concat([mel, temp_mel])
```
### Vocoder (spectrogram to wave)
The code below show how to use a ` Parallel WaveGAN` model. Like the example above, after loading the pretrained model, use it and normalizer object to construct a prediction objectthen use `pwg_inference(mel)` to generate raw audio (in wav format).
```python
from pathlib import Path
import numpy as np
import paddle
import soundfile as sf
import yaml
from yacs.config import CfgNode
from parakeet.models.parallel_wavegan import PWGGenerator
from parakeet.models.parallel_wavegan import PWGInference
from parakeet.modules.normalizer import ZScore
# load the pretrained model
checkpoint_dir = Path("parallel_wavegan_baker_ckpt_0.4")
with open(checkpoint_dir / "pwg_default.yaml") as f:
pwg_config = CfgNode(yaml.safe_load(f))
vocoder = PWGGenerator(**pwg_config["generator_params"])
vocoder.set_state_dict(paddle.load(args.pwg_params))
vocoder.remove_weight_norm()
vocoder.eval()
# load stats file
stat = np.load(checkpoint_dir / "pwg_stats.npy")
mu, std = stat
mu = paddle.to_tensor(mu)
std = paddle.to_tensor(std)
pwg_normalizer = ZScore(mu, std)
# construct a prediction object
pwg_inference = PWGInference(pwg_normalizer, vocoder)
# spectrogram to wave
wav = pwg_inference(mel)
sf.write(
audio_path,
wav.numpy(),
samplerate=fastspeech2_config.fs)
```

@ -1,5 +1,5 @@
# Chinese Rule Based Text Frontend # Chinese Rule Based Text Frontend
TTS system mainly includes three modules: `text frontend`, `Acoustic model` and `Vocoder`. We provide a complete Chinese text frontend module in Parakeet, see exapmle in `Parakeet/examples/text_frontend/`. A TTS system mainly includes three modules: `Text Frontend`, `Acoustic model` and `Vocoder`. We provide a complete Chinese text frontend module in PaddleSpeech TTS, see exapmle in [examples/other/text_frontend/](https://github.com/PaddlePaddle/DeepSpeech/tree/develop/examples/other/text_frontend).
A text frontend module mainly includes: A text frontend module mainly includes:
- Text Segmentation - Text Segmentation

@ -20,5 +20,73 @@ Run the command below to get the results of test.
./run.sh ./run.sh
``` ```
The `avg WER` of g2p is: 0.027495061517943988 The `avg WER` of g2p is: 0.027495061517943988
```text
SYSTEM SUMMARY PERCENTAGES by SPEAKER
,------------------------------------------------------------------------.
| ./exp/g2p/text.g2p |
|------------------------------------------------------------------------|
| SPKR | # Snt # Wrd | Corr Sub Del Ins Err S.Err |
|------+-----------------+-----------------------------------------------|
| bak | 9996 299181 | 290969 8198 14 14 8226 5249 |
|========================================================================|
| Sum | 9996 299181 | 290969 8198 14 14 8226 5249 |
|========================================================================|
| Mean |9996.0 299181.0 |290969.0 8198.0 14.0 14.0 8226.0 5249.0 |
| S.D. | 0.0 0.0 | 0.0 0.0 0.0 0.0 0.0 0.0 |
|Median|9996.0 299181.0 |290969.0 8198.0 14.0 14.0 8226.0 5249.0 |
`------------------------------------------------------------------------'
SYSTEM SUMMARY PERCENTAGES by SPEAKER
,--------------------------------------------------------------------.
| ./exp/g2p/text.g2p |
|--------------------------------------------------------------------|
| SPKR | # Snt # Wrd | Corr Sub Del Ins Err S.Err |
|--------+-----------------+-----------------------------------------|
| bak | 9996 299181 | 97.3 2.7 0.0 0.0 2.7 52.5 |
|====================================================================|
| Sum/Avg| 9996 299181 | 97.3 2.7 0.0 0.0 2.7 52.5 |
|====================================================================|
| Mean |9996.0 299181.0 | 97.3 2.7 0.0 0.0 2.7 52.5 |
| S.D. | 0.0 0.0 | 0.0 0.0 0.0 0.0 0.0 0.0 |
| Median |9996.0 299181.0 | 97.3 2.7 0.0 0.0 2.7 52.5 |
`--------------------------------------------------------------------'
```
The `avg CER` of text normalization is: 0.006388318503308237 The `avg CER` of text normalization is: 0.006388318503308237
```text
SYSTEM SUMMARY PERCENTAGES by SPEAKER
,----------------------------------------------------------------.
| ./exp/textnorm/text.tn |
|----------------------------------------------------------------|
| SPKR | # Snt # Wrd | Corr Sub Del Ins Err S.Err |
|------+--------------+------------------------------------------|
| utt | 125 2254 | 2241 2 11 2 15 4 |
|================================================================|
| Sum | 125 2254 | 2241 2 11 2 15 4 |
|================================================================|
| Mean |125.0 2254.0 |2241.0 2.0 11.0 2.0 15.0 4.0 |
| S.D. | 0.0 0.0 | 0.0 0.0 0.0 0.0 0.0 0.0 |
|Median|125.0 2254.0 |2241.0 2.0 11.0 2.0 15.0 4.0 |
`----------------------------------------------------------------'
SYSTEM SUMMARY PERCENTAGES by SPEAKER
,-----------------------------------------------------------------.
| ./exp/textnorm/text.tn |
|-----------------------------------------------------------------|
| SPKR | # Snt # Wrd | Corr Sub Del Ins Err S.Err |
|--------+--------------+-----------------------------------------|
| utt | 125 2254 | 99.4 0.1 0.5 0.1 0.7 3.2 |
|=================================================================|
| Sum/Avg| 125 2254 | 99.4 0.1 0.5 0.1 0.7 3.2 |
|=================================================================|
| Mean |125.0 2254.0 | 99.4 0.1 0.5 0.1 0.7 3.2 |
| S.D. | 0.0 0.0 | 0.0 0.0 0.0 0.0 0.0 0.0 |
| Median |125.0 2254.0 | 99.4 0.1 0.5 0.1 0.7 3.2 |
`-----------------------------------------------------------------'
```

Loading…
Cancel
Save