Merge branch 'develop' of https://github.com/PaddlePaddle/DeepSpeech into doc

3 years ago · 50cf88b7f1
parent 649fcc4c16 f598df0c0b
commit 50cf88b7f1
21 changed files with 989 additions and 290 deletions
--- a/README.md
+++ b/README.md
@ -124,7 +124,7 @@ avg.sh best exp/deepspeech2/checkpoints 1
 ./local/test.sh conf/deepspeech2.yaml exp/deepspeech2/checkpoints/avg_1 offline
 ```

-For **Text-To-Speech**, try pretrained FastSpeech2 + Parallel WaveGAN on CSMSC:
+For **Text-to-Speech**, try pretrained FastSpeech2 + Parallel WaveGAN on CSMSC:
 ```shell
 cd examples/csmsc/tts3
 # download the pretrained models and unaip them
@ -150,7 +150,7 @@ python3 ${BIN_DIR}/synthesize_e2e.py \
  --phones-dict=fastspeech2_nosil_baker_ckpt_0.4/phone_id_map.txt
 ```

-If you want to try more functions like training and tuning, please see [Speech-to-Text Quick Start](./docs/source/asr/quick_start.md) and [Text-To-Speech Quick Start](./docs/source/tts/quick_start.md).
+If you want to try more functions like training and tuning, please see [Speech-to-Text Quick Start](./docs/source/asr/quick_start.md) and [Text-to-Speech Quick Start](./docs/source/tts/quick_start.md).

 ## Model List

--- a/docs/source/asr/quick_start.md
+++ b/docs/source/asr/quick_start.md
@ -1,4 +1,4 @@
-# Quick Start of Speech-To-Text
+# Quick Start of Speech-to-Text
 Several shell scripts provided in `./examples/tiny/local` will help us to quickly give it a try, for most major modules, including data preparation, model training, case inference and model evaluation, with a few public dataset (e.g. [LibriSpeech](http://www.openslr.org/12/), [Aishell](http://www.openslr.org/33)). Reading these examples will also help you to understand how to make it work with your own data.

 Some of the scripts in `./examples` are not configured with GPUs. If you want to train with 8 GPUs, please modify `CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7`. If you don't have any GPU available, please set `CUDA_VISIBLE_DEVICES=` to use CPUs instead. Besides, if out-of-memory problem occurs, just reduce `batch_size` to fit.
--- a/docs/source/introduction.md
+++ b/docs/source/introduction.md
@ -50,7 +50,7 @@ PaddleSpeech TTS provides you with a complete TTS pipeline, including:
    - Parallel WaveGAN
    - WaveFlow
 - Voice Cloning
-    - Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis
+    - Transfer Learning from Speaker Verification to Multispeaker Text-to-Speech Synthesis
    - GE2E

 Text-to-Speech  helps you to train TTS models with simple commands.
--- a/docs/source/reference.md
+++ b/docs/source/reference.md
@ -1,13 +1,13 @@
 # Reference

-We borrowed a lot of code from these repos to build `model` and `engine`, thank for these great work and opensource community!
+We borrowed a lot of code from these repos to build `model` and `engine`, thanks for these great works and opensource community!

 * [espnet](https://github.com/espnet/espnet/blob/master/LICENSE)
 - Apache-2.0 License
 - python/shell `utils`
 - kaldi feat preprocessing
- datapipeline and `transform`
- a lot of tts model, like `fastspeech2` and GAN-based `vocoder`
+- data pipe line and `transform`
+- some tts models, like `fastspeech2` and GAN-based `vocoder`

 * [wenet](https://github.com/wenet-e2e/wenet/blob/main/LICENSE)
 - Apache-2.0 License
@ -30,7 +30,7 @@ We borrowed a lot of code from these repos to build `model` and `engine`, thank

 * [chainer](https://github.com/chainer/chainer/blob/master/LICENSE)
 - MIT License
- Updater, Trainer and more utils.
+- Updater, Trainer and some utils.

 * [librosa](https://github.com/librosa/librosa/blob/main/LICENSE.md)
 - ISC License
--- a/docs/source/tts/README.md
+++ b/docs/source/tts/README.md
@ -35,7 +35,7 @@ In order to facilitate exploiting the existing TTS models directly and developin
  - [【Parallel WaveGAN】Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram](https://arxiv.org/abs/1910.11480)
  - [【WaveFlow】WaveFlow: A Compact Flow-based Model for Raw Audio](https://arxiv.org/abs/1912.01219)
 - Voice Cloning
-  - [Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis](https://arxiv.org/pdf/1806.04558v4.pdf)
+  - [Transfer Learning from Speaker Verification to Multispeaker Text-to-Speech Synthesis](https://arxiv.org/pdf/1806.04558v4.pdf)
  - [【GE2E】Generalized End-to-End Loss for Speaker Verification](https://arxiv.org/abs/1710.10467)

 ## Setup
--- a/docs/source/tts/quick_start.md
+++ b/docs/source/tts/quick_start.md
@ -1,4 +1,4 @@
-# Quick Start of Text-To-Speech
+# Quick Start of Text-to-Speech
 The examples in PaddleSpeech are mainly classified by datasets, the TTS datasets we mainly used are:
 * CSMCS (Mandarin single speaker)
 * AISHELL3 (Mandarin multiple speaker)
--- a/docs/tutorial/tts/source/tts-timeline.png
+++ b/docs/tutorial/tts/source/tts-timeline.png
--- a/docs/tutorial/tts/source/wechat-group.png
+++ b/docs/tutorial/tts/source/wechat-group.png
--- a/docs/tutorial/tts/tts_tutorial.ipynb
+++ b/docs/tutorial/tts/tts_tutorial.ipynb
--- a/examples/aishell/asr0/README.md
+++ b/examples/aishell/asr0/README.md
@ -1,10 +1,10 @@
 # Aishell-1

-## Deepspeech2
+## Deepspeech2 Non-Streaming

 | Model | Params | Release | Config | Test set | Loss | CER |  
 | --- | --- | --- | --- | --- | --- | --- |  
-| DeepSpeech2 | 58.4M | 2.2.0 | conf/deepspeech2.yaml + spec aug | test | 5.71956205368042 | 0.064287 |  
+| DeepSpeech2 | 58.4M | 2.2.0 | conf/deepspeech2.yaml + spec aug | test | 5.738585948944092 | 0.064000 |  
 | DeepSpeech2 | 58.4M | 2.1.0 | conf/deepspeech2.yaml + spec aug | test | 7.483316898345947 | 0.077860 |  
 | DeepSpeech2 | 58.4M | 2.1.0 | conf/deepspeech2.yaml | test | 7.299022197723389 | 0.078671 |
 | DeepSpeech2 | 58.4M | 2.0.0 | conf/deepspeech2.yaml | test | - | 0.078977 |  
--- a/examples/aishell/asr0/utils
+++ b/examples/aishell/asr0/utils
@ -0,0 +1 @@
+../../../utils/
--- a/examples/aishell3/vc1/README.md
+++ b/examples/aishell3/vc1/README.md
@ -1,3 +1,4 @@
+
 # FastSpeech2 + AISHELL-3 Voice Cloning
 This example contains code used to train a [FastSpeech2](https://arxiv.org/abs/2006.04558) model with [AISHELL-3](http://www.aishelltech.com/aishell_3). The trained model can be used in Voice Cloning Task, We refer to the model structure of  [Transfer Learning from Speaker Veriﬁcation to Multispeaker Text-To-Speech Synthesis](https://arxiv.org/pdf/1806.04558.pdf) . The general steps are as follows:
 1. Speaker Encoder: We  use a Speaker Verification to train a speaker encoder. Datasets used in this task are different from those used in `FastSpeech2`, because the  transcriptions are not needed, we use more datasets, refer to  [ge2e](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/ge2e).
@ -121,6 +122,10 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/voice_cloning.sh ${conf_path} ${train_outpu
 ## Pretrained Model
 [fastspeech2_nosil_aishell3_vc1_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_aishell3_vc1_ckpt_0.5.zip)

+Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/pitch_loss| eval/energy_loss 
+:-------------:| :------------:| :-----: | :-----: | :--------: |:--------:|:---------:
+default|2(gpu) x 96400|0.99699|0.62013|0.53057|0.11954| 0.20426|
+
 FastSpeech2 checkpoint contains files listed below.
 (There is no need for `speaker_id_map.txt` here )

--- a/examples/aishell3/voc1/README.md
+++ b/examples/aishell3/voc1/README.md
@ -138,6 +138,10 @@ optional arguments:
 ## Pretrained Models
 Pretrained models can be downloaded here [pwg_aishell3_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_aishell3_ckpt_0.5.zip).

+Model | Step | eval/generator_loss | eval/log_stft_magnitude_loss:| eval/spectral_convergence_loss 
+:-------------:| :------------:| :-----: | :-----: | :--------:
+default| 1(gpu) x 400000|1.968762|0.759008|0.218524
+
 Parallel WaveGAN checkpoint contains files listed below.

 ```text
--- a/examples/csmsc/tts2/README.md
+++ b/examples/csmsc/tts2/README.md
@ -216,6 +216,10 @@ Pretrained SpeedySpeech model with no silence in the edge of audios[speedyspeech

 Static model can be downloaded here [speedyspeech_nosil_baker_static_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_static_0.5.zip).

+Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/ssim_loss
+:-------------:| :------------:| :-----: | :-----: | :--------:|:--------:
+default| 1(gpu) x 11400|0.83655|0.42324|0.03211| 0.38119
+
 SpeedySpeech checkpoint contains files listed below.
 ```text
 speedyspeech_nosil_baker_ckpt_0.5
--- a/examples/csmsc/tts3/README.md
+++ b/examples/csmsc/tts3/README.md
@ -207,6 +207,11 @@ Pretrained FastSpeech2 model with no silence in the edge of audios [fastspeech2_

 Static model can be downloaded here [fastspeech2_nosil_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_static_0.4.zip).

+Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/pitch_loss| eval/energy_loss 
+:-------------:| :------------:| :-----: | :-----: | :--------: |:--------:|:---------:
+default| 2(gpu) x 76000|1.0991|0.59132|0.035815| 0.31915| 0.15287|
+conformer| 2(gpu) x 76000||||||
+
 FastSpeech2 checkpoint contains files listed below.
 ```text
 fastspeech2_nosil_baker_ckpt_0.4
--- a/examples/csmsc/voc1/README.md
+++ b/examples/csmsc/voc1/README.md
@ -130,6 +130,10 @@ Pretrained model can be downloaded here [pwg_baker_ckpt_0.4.zip](https://paddles

 Static model can be downloaded here [pwg_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_static_0.4.zip).

+Model | Step | eval/generator_loss | eval/log_stft_magnitude_loss:| eval/spectral_convergence_loss 
+:-------------:| :------------:| :-----: | :-----: | :--------:
+default| 1(gpu) x 400000|1.948763|0.670098|0.248882
+
 Parallel WaveGAN checkpoint contains files listed below.

 ```text
--- a/examples/csmsc/voc3/README.md
+++ b/examples/csmsc/voc3/README.md
@ -157,6 +157,12 @@ Finetuned model can ben downloaded here [mb_melgan_baker_finetune_ckpt_0.5.zip](

 Static model can be downloaded here [mb_melgan_baker_static_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_baker_static_0.5.zip)

+Model | Step | eval/generator_loss | eval/log_stft_magnitude_loss|eval/spectral_convergence_loss |eval/sub_log_stft_magnitude_loss|eval/sub_spectral_convergence_loss
+:-------------:| :------------:| :-----: | :-----: | :--------:| :--------:| :--------:
+default| 1(gpu) x 1000000| ——|—— |—— |—— | ——|
+finetune| 1(gpu) x 1000000|3.196967|0.977804| 0.778484| 0.889576 |0.776756 |
+
+
 Multi Band MelGAN checkpoint contains files listed below.

 ```text
--- a/examples/librispeech/asr1/conf/chunk_transformer.yaml
+++ b/examples/librispeech/asr1/conf/chunk_transformer.yaml
@ -11,9 +11,9 @@ data:
  max_output_input_ratio: 100.0

 collator:
-  vocab_filepath: data/vocab.txt 
+  vocab_filepath: data/lang_char/vocab.txt 
  unit_type: 'spm'
-  spm_model_prefix: 'data/bpe_unigram_5000'
+  spm_model_prefix: 'data/lang_char/bpe_unigram_5000'
  mean_std_filepath: ""
  augmentation_config: conf/preprocess.yaml
  batch_size: 64
--- a/examples/librispeech/asr2/conf/preprocess.yaml
+++ b/examples/librispeech/asr2/conf/preprocess.yaml
@ -0,0 +1,16 @@
+process:
+  # these three processes are a.k.a. SpecAugument
+  - type: time_warp
+    max_time_warp: 5
+    inplace: true
+    mode: PIL
+  - type: freq_mask
+    F: 30
+    n_mask: 2
+    inplace: true
+    replace_with_zero: false
+  - type: time_mask
+    T: 40
+    n_mask: 2
+    inplace: true
+    replace_with_zero: false
--- a/examples/librispeech/asr2/conf/transformer.yaml
+++ b/examples/librispeech/asr2/conf/transformer.yaml
@ -57,7 +57,7 @@ collator:
  batch_frames_in: 0
  batch_frames_out: 0
  batch_frames_inout: 0
-  augmentation_config: conf/augmentation.json
+  augmentation_config: conf/preprocess.yaml 
  num_workers: 0
  subsampling_factor: 1
  num_encs: 1
--- a/examples/ljspeech/tts3/README.md
+++ b/examples/ljspeech/tts3/README.md
@ -197,6 +197,11 @@ optional arguments:
 ## Pretrained Model
 Pretrained FastSpeech2 model with no silence in the edge of audios. [fastspeech2_nosil_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_ljspeech_ckpt_0.5.zip)

+Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/pitch_loss| eval/energy_loss 
+:-------------:| :------------:| :-----: | :-----: | :--------: |:--------:|:---------:
+default| 2(gpu) x 100000| 1.505682|0.612104| 0.045505| 0.62792| 0.220147
+
+
 FastSpeech2 checkpoint contains files listed below.
 ```text
 fastspeech2_nosil_ljspeech_ckpt_0.5