pull/1021/head
huangyuxin 3 years ago
commit 50cf88b7f1

@ -124,7 +124,7 @@ avg.sh best exp/deepspeech2/checkpoints 1
./local/test.sh conf/deepspeech2.yaml exp/deepspeech2/checkpoints/avg_1 offline
```
For **Text-To-Speech**, try pretrained FastSpeech2 + Parallel WaveGAN on CSMSC:
For **Text-to-Speech**, try pretrained FastSpeech2 + Parallel WaveGAN on CSMSC:
```shell
cd examples/csmsc/tts3
# download the pretrained models and unaip them
@ -150,7 +150,7 @@ python3 ${BIN_DIR}/synthesize_e2e.py \
--phones-dict=fastspeech2_nosil_baker_ckpt_0.4/phone_id_map.txt
```
If you want to try more functions like training and tuning, please see [Speech-to-Text Quick Start](./docs/source/asr/quick_start.md) and [Text-To-Speech Quick Start](./docs/source/tts/quick_start.md).
If you want to try more functions like training and tuning, please see [Speech-to-Text Quick Start](./docs/source/asr/quick_start.md) and [Text-to-Speech Quick Start](./docs/source/tts/quick_start.md).
## Model List

@ -1,4 +1,4 @@
# Quick Start of Speech-To-Text
# Quick Start of Speech-to-Text
Several shell scripts provided in `./examples/tiny/local` will help us to quickly give it a try, for most major modules, including data preparation, model training, case inference and model evaluation, with a few public dataset (e.g. [LibriSpeech](http://www.openslr.org/12/), [Aishell](http://www.openslr.org/33)). Reading these examples will also help you to understand how to make it work with your own data.
Some of the scripts in `./examples` are not configured with GPUs. If you want to train with 8 GPUs, please modify `CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7`. If you don't have any GPU available, please set `CUDA_VISIBLE_DEVICES=` to use CPUs instead. Besides, if out-of-memory problem occurs, just reduce `batch_size` to fit.

@ -50,7 +50,7 @@ PaddleSpeech TTS provides you with a complete TTS pipeline, including:
- Parallel WaveGAN
- WaveFlow
- Voice Cloning
- Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis
- Transfer Learning from Speaker Verification to Multispeaker Text-to-Speech Synthesis
- GE2E
Text-to-Speech helps you to train TTS models with simple commands.

@ -1,13 +1,13 @@
# Reference
We borrowed a lot of code from these repos to build `model` and `engine`, thank for these great work and opensource community!
We borrowed a lot of code from these repos to build `model` and `engine`, thanks for these great works and opensource community!
* [espnet](https://github.com/espnet/espnet/blob/master/LICENSE)
- Apache-2.0 License
- python/shell `utils`
- kaldi feat preprocessing
- datapipeline and `transform`
- a lot of tts model, like `fastspeech2` and GAN-based `vocoder`
- data pipe line and `transform`
- some tts models, like `fastspeech2` and GAN-based `vocoder`
* [wenet](https://github.com/wenet-e2e/wenet/blob/main/LICENSE)
- Apache-2.0 License
@ -30,7 +30,7 @@ We borrowed a lot of code from these repos to build `model` and `engine`, thank
* [chainer](https://github.com/chainer/chainer/blob/master/LICENSE)
- MIT License
- Updater, Trainer and more utils.
- Updater, Trainer and some utils.
* [librosa](https://github.com/librosa/librosa/blob/main/LICENSE.md)
- ISC License

@ -35,7 +35,7 @@ In order to facilitate exploiting the existing TTS models directly and developin
- [【Parallel WaveGAN】Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram](https://arxiv.org/abs/1910.11480)
- [【WaveFlow】WaveFlow: A Compact Flow-based Model for Raw Audio](https://arxiv.org/abs/1912.01219)
- Voice Cloning
- [Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis](https://arxiv.org/pdf/1806.04558v4.pdf)
- [Transfer Learning from Speaker Verification to Multispeaker Text-to-Speech Synthesis](https://arxiv.org/pdf/1806.04558v4.pdf)
- [【GE2E】Generalized End-to-End Loss for Speaker Verification](https://arxiv.org/abs/1710.10467)
## Setup

@ -1,4 +1,4 @@
# Quick Start of Text-To-Speech
# Quick Start of Text-to-Speech
The examples in PaddleSpeech are mainly classified by datasets, the TTS datasets we mainly used are:
* CSMCS (Mandarin single speaker)
* AISHELL3 (Mandarin multiple speaker)

Binary file not shown.

After

Width:  |  Height:  |  Size: 212 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 52 KiB

File diff suppressed because one or more lines are too long

@ -1,10 +1,10 @@
# Aishell-1
## Deepspeech2
## Deepspeech2 Non-Streaming
| Model | Params | Release | Config | Test set | Loss | CER |
| --- | --- | --- | --- | --- | --- | --- |
| DeepSpeech2 | 58.4M | 2.2.0 | conf/deepspeech2.yaml + spec aug | test | 5.71956205368042 | 0.064287 |
| DeepSpeech2 | 58.4M | 2.2.0 | conf/deepspeech2.yaml + spec aug | test | 5.738585948944092 | 0.064000 |
| DeepSpeech2 | 58.4M | 2.1.0 | conf/deepspeech2.yaml + spec aug | test | 7.483316898345947 | 0.077860 |
| DeepSpeech2 | 58.4M | 2.1.0 | conf/deepspeech2.yaml | test | 7.299022197723389 | 0.078671 |
| DeepSpeech2 | 58.4M | 2.0.0 | conf/deepspeech2.yaml | test | - | 0.078977 |

@ -0,0 +1 @@
../../../utils/

@ -1,3 +1,4 @@
# FastSpeech2 + AISHELL-3 Voice Cloning
This example contains code used to train a [FastSpeech2](https://arxiv.org/abs/2006.04558) model with [AISHELL-3](http://www.aishelltech.com/aishell_3). The trained model can be used in Voice Cloning Task, We refer to the model structure of [Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis](https://arxiv.org/pdf/1806.04558.pdf) . The general steps are as follows:
1. Speaker Encoder: We use a Speaker Verification to train a speaker encoder. Datasets used in this task are different from those used in `FastSpeech2`, because the transcriptions are not needed, we use more datasets, refer to [ge2e](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/ge2e).
@ -121,6 +122,10 @@ CUDA_VISIBLE_DEVICES=${gpus} ./local/voice_cloning.sh ${conf_path} ${train_outpu
## Pretrained Model
[fastspeech2_nosil_aishell3_vc1_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_aishell3_vc1_ckpt_0.5.zip)
Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/pitch_loss| eval/energy_loss
:-------------:| :------------:| :-----: | :-----: | :--------: |:--------:|:---------:
default|2(gpu) x 96400|0.99699|0.62013|0.53057|0.11954| 0.20426|
FastSpeech2 checkpoint contains files listed below.
(There is no need for `speaker_id_map.txt` here )

@ -138,6 +138,10 @@ optional arguments:
## Pretrained Models
Pretrained models can be downloaded here [pwg_aishell3_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_aishell3_ckpt_0.5.zip).
Model | Step | eval/generator_loss | eval/log_stft_magnitude_loss:| eval/spectral_convergence_loss
:-------------:| :------------:| :-----: | :-----: | :--------:
default| 1(gpu) x 400000|1.968762|0.759008|0.218524
Parallel WaveGAN checkpoint contains files listed below.
```text

@ -216,6 +216,10 @@ Pretrained SpeedySpeech model with no silence in the edge of audios[speedyspeech
Static model can be downloaded here [speedyspeech_nosil_baker_static_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/speedyspeech/speedyspeech_nosil_baker_static_0.5.zip).
Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/ssim_loss
:-------------:| :------------:| :-----: | :-----: | :--------:|:--------:
default| 1(gpu) x 11400|0.83655|0.42324|0.03211| 0.38119
SpeedySpeech checkpoint contains files listed below.
```text
speedyspeech_nosil_baker_ckpt_0.5

@ -207,6 +207,11 @@ Pretrained FastSpeech2 model with no silence in the edge of audios [fastspeech2_
Static model can be downloaded here [fastspeech2_nosil_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_static_0.4.zip).
Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/pitch_loss| eval/energy_loss
:-------------:| :------------:| :-----: | :-----: | :--------: |:--------:|:---------:
default| 2(gpu) x 76000|1.0991|0.59132|0.035815| 0.31915| 0.15287|
conformer| 2(gpu) x 76000||||||
FastSpeech2 checkpoint contains files listed below.
```text
fastspeech2_nosil_baker_ckpt_0.4

@ -130,6 +130,10 @@ Pretrained model can be downloaded here [pwg_baker_ckpt_0.4.zip](https://paddles
Static model can be downloaded here [pwg_baker_static_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_static_0.4.zip).
Model | Step | eval/generator_loss | eval/log_stft_magnitude_loss:| eval/spectral_convergence_loss
:-------------:| :------------:| :-----: | :-----: | :--------:
default| 1(gpu) x 400000|1.948763|0.670098|0.248882
Parallel WaveGAN checkpoint contains files listed below.
```text

@ -157,6 +157,12 @@ Finetuned model can ben downloaded here [mb_melgan_baker_finetune_ckpt_0.5.zip](
Static model can be downloaded here [mb_melgan_baker_static_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_baker_static_0.5.zip)
Model | Step | eval/generator_loss | eval/log_stft_magnitude_loss|eval/spectral_convergence_loss |eval/sub_log_stft_magnitude_loss|eval/sub_spectral_convergence_loss
:-------------:| :------------:| :-----: | :-----: | :--------:| :--------:| :--------:
default| 1(gpu) x 1000000| ——|—— |—— |—— | ——|
finetune| 1(gpu) x 1000000|3.196967|0.977804| 0.778484| 0.889576 |0.776756 |
Multi Band MelGAN checkpoint contains files listed below.
```text

@ -11,9 +11,9 @@ data:
max_output_input_ratio: 100.0
collator:
vocab_filepath: data/vocab.txt
vocab_filepath: data/lang_char/vocab.txt
unit_type: 'spm'
spm_model_prefix: 'data/bpe_unigram_5000'
spm_model_prefix: 'data/lang_char/bpe_unigram_5000'
mean_std_filepath: ""
augmentation_config: conf/preprocess.yaml
batch_size: 64

@ -0,0 +1,16 @@
process:
# these three processes are a.k.a. SpecAugument
- type: time_warp
max_time_warp: 5
inplace: true
mode: PIL
- type: freq_mask
F: 30
n_mask: 2
inplace: true
replace_with_zero: false
- type: time_mask
T: 40
n_mask: 2
inplace: true
replace_with_zero: false

@ -57,7 +57,7 @@ collator:
batch_frames_in: 0
batch_frames_out: 0
batch_frames_inout: 0
augmentation_config: conf/augmentation.json
augmentation_config: conf/preprocess.yaml
num_workers: 0
subsampling_factor: 1
num_encs: 1

@ -197,6 +197,11 @@ optional arguments:
## Pretrained Model
Pretrained FastSpeech2 model with no silence in the edge of audios. [fastspeech2_nosil_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_ljspeech_ckpt_0.5.zip)
Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/pitch_loss| eval/energy_loss
:-------------:| :------------:| :-----: | :-----: | :--------: |:--------:|:---------:
default| 2(gpu) x 100000| 1.505682|0.612104| 0.045505| 0.62792| 0.220147
FastSpeech2 checkpoint contains files listed below.
```text
fastspeech2_nosil_ljspeech_ckpt_0.5

Loading…
Cancel
Save