[TTS]add ljspeech new tacotron2 (#1416)

* add ljspeech new tacotron2, test=tts * update ljspeech waveflow's synthesize * add config, test=doc Co-authored-by: Hui Zhang <zhtclz@foxmail.com>
2 years ago · 0747600c95
parent 348a1a33bf
commit 0747600c95
10 changed files with 189 additions and 118 deletions
--- a/examples/ljspeech/tts0/README.md
+++ b/examples/ljspeech/tts0/README.md
@ -1,89 +0,0 @@
-# Tacotron2 with LJSpeech
-PaddlePaddle dynamic graph implementation of Tacotron2, a neural network architecture for speech synthesis directly from the text. The implementation is based on [Natural TTS Synthesis by Conditioning WaveNet on Mel Spectrogram Predictions](https://arxiv.org/abs/1712.05884).
-
-## Dataset
-We experiment with the LJSpeech dataset. Download and unzip [LJSpeech](https://keithito.com/LJ-Speech-Dataset/).
-
-```bash
-wget https://data.keithito.com/data/speech/LJSpeech-1.1.tar.bz2
-tar xjvf LJSpeech-1.1.tar.bz2
-```
-## Get Started
-Assume the path to the dataset is `~/datasets/LJSpeech-1.1`.
-Run the command below to
-1. **source path**.
-2. preprocess the dataset.
-3. train the model.
-4. synthesize mels.
-```bash
-./run.sh
-```
-You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
-```bash
-./run.sh --stage 0 --stop-stage 0
-```
-### Data Preprocessing
-```bash
-./local/preprocess.sh ${conf_path}
-```
-### Model Training
-`./local/train.sh` calls `${BIN_DIR}/train.py`.
-```bash
-CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path}
-```
-Here's the complete help message.
-```text
-usage: train.py [-h] [--config FILE] [--data DATA_DIR] [--output OUTPUT_DIR]
-                [--checkpoint_path CHECKPOINT_PATH] [--ngpu NGPU] [--opts ...]
-
-optional arguments:
-  -h, --help            show this help message and exit
-  --config FILE         path of the config file to overwrite to default config
-                        with.
-  --data DATA_DIR       path to the dataset.
-  --output OUTPUT_DIR   path to save checkpoint and logs.
-  --checkpoint_path CHECKPOINT_PATH
-                        path of the checkpoint to load
-  --ngpu NGPU           if ngpu == 0, use cpu.
-  --opts ...            options to overwrite --config file and the default
-                        config, passing in KEY VALUE pairs
-```
-
-If you want to train on CPU, just set `--ngpu=0`.
-If you want to train on multiple GPUs, just set `--ngpu` as the num of GPU.
-By default, training will be resumed from the latest checkpoint in `--output`, if you want to start a new training, please use a new `${OUTPUTPATH}` with no checkpoint.
-And if you want to resume from another existing model, you should set `checkpoint_path` to be the checkpoint path you want to load.
-**Note: The checkpoint path cannot contain the file extension.**
-
-### Synthesizing
-`./local/synthesize.sh` calls `${BIN_DIR}/synthesize.py`,  which synthesize **mels**  from text_list here.
-```bash
-CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${train_output_path} ${ckpt_name}
-```
-```text
-usage: synthesize.py [-h] [--config FILE] [--checkpoint_path CHECKPOINT_PATH]
-                     [--input INPUT] [--output OUTPUT] [--ngpu NGPU]
-                     [--opts ...] [-v]
-
-generate mel spectrogram with TransformerTTS.
-
-optional arguments:
-  -h, --help            show this help message and exit
-  --config FILE         extra config to overwrite the default config
-  --checkpoint_path CHECKPOINT_PATH
-                        path of the checkpoint to load.
-  --input INPUT         path of the text sentences
-  --output OUTPUT       path to save outputs
-  --ngpu NGPU           if ngpu == 0, use cpu.
-  --opts ...            options to overwrite --config file and the default
-                        config, passing in KEY VALUE pairs
-  -v, --verbose         print msg
-```
-**Ps.** You can use [waveflow](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/voc0) as the neural vocoder to synthesize mels to wavs. (Please  refer to `synthesize.sh` in our  LJSpeech waveflow example)
-
-## Pretrained Models
-Pretrained Models can be downloaded from the links below. We provide 2 models with different configurations.
-
-1. This model uses a binary classifier to predict the stop token. [tacotron2_ljspeech_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_ljspeech_ckpt_0.3.zip)
-
-2. This model does not have a stop token predictor. It uses the attention peak position to decide whether all the contents have been uttered. Also, guided attention loss is used to speed up training. This model is trained with `configs/alternative.yaml`.[tacotron2_ljspeech_ckpt_0.3_alternative.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/tacotron2/tacotron2_ljspeech_ckpt_0.3_alternative.zip)
--- a/examples/ljspeech/tts0/conf/default.yaml
+++ b/examples/ljspeech/tts0/conf/default.yaml
@ -0,0 +1,87 @@
+# This configuration is for Paddle to train Tacotron 2. Compared to the
+# original paper, this configuration additionally use the guided attention
+# loss to accelerate the learning of the diagonal attention. It requires
+# only a single GPU with 12 GB memory and it takes ~1 days to finish the
+# training on Titan V.
+
+###########################################################
+#                FEATURE EXTRACTION SETTING               #
+###########################################################
+fs: 22050                # Sampling rate.
+n_fft: 1024              # FFT size (samples).
+n_shift: 256             # Hop size (samples). 11.6ms
+win_length: null         # Window length (samples).
+                         # If set to null, it will be the same as fft_size.
+window: "hann"           # Window function.
+n_mels: 80               # Number of mel basis.
+fmin: 80                 # Minimum freq in mel basis calculation. (Hz)
+fmax: 7600               # Maximum frequency in mel basis calculation. (Hz)
+
+###########################################################
+#                       DATA SETTING                      #
+###########################################################
+batch_size: 64
+num_workers: 2
+
+###########################################################
+#                       MODEL SETTING                     #
+###########################################################
+model:                          # keyword arguments for the selected model
+    embed_dim: 512               # char or phn embedding dimension
+    elayers: 1                   # number of blstm layers in encoder
+    eunits: 512                  # number of blstm units
+    econv_layers: 3              # number of convolutional layers in encoder
+    econv_chans: 512             # number of channels in convolutional layer
+    econv_filts: 5               # filter size of convolutional layer
+    atype: location              # attention function type
+    adim: 512                    # attention dimension
+    aconv_chans: 32              # number of channels in convolutional layer of attention
+    aconv_filts: 15              # filter size of convolutional layer of attention
+    cumulate_att_w: True         # whether to cumulate attention weight
+    dlayers: 2                   # number of lstm layers in decoder
+    dunits: 1024                 # number of lstm units in decoder
+    prenet_layers: 2             # number of layers in prenet
+    prenet_units: 256            # number of units in prenet
+    postnet_layers: 5            # number of layers in postnet
+    postnet_chans: 512           # number of channels in postnet
+    postnet_filts: 5             # filter size of postnet layer
+    output_activation: null      # activation function for the final output
+    use_batch_norm: True         # whether to use batch normalization in encoder
+    use_concate: True            # whether to concatenate encoder embedding with decoder outputs
+    use_residual: False          # whether to use residual connection in encoder
+    dropout_rate: 0.5            # dropout rate
+    zoneout_rate: 0.1            # zoneout rate
+    reduction_factor: 1          # reduction factor
+    spk_embed_dim: null          # speaker embedding dimension
+
+
+###########################################################
+#                       UPDATER SETTING                   #
+###########################################################
+updater:
+    use_masking: True            # whether to apply masking for padded part in loss calculation
+    bce_pos_weight: 5.0          # weight of positive sample in binary cross entropy calculation
+    use_guided_attn_loss: True   # whether to use guided attention loss
+    guided_attn_loss_sigma: 0.4  # sigma of guided attention loss
+    guided_attn_loss_lambda: 1.0 # strength of guided attention loss
+
+
+##########################################################
+#                  OPTIMIZER SETTING                     #
+##########################################################
+optimizer:
+    optim: adam              # optimizer type
+    learning_rate: 1.0e-03   # learning rate
+    epsilon: 1.0e-06         # epsilon
+    weight_decay: 0.0        # weight decay coefficient
+
+###########################################################
+#                     TRAINING SETTING                    #
+###########################################################
+max_epoch: 300
+num_snapshots: 5
+
+###########################################################
+#                       OTHER SETTING                     #
+###########################################################
+seed: 42
--- a/examples/ljspeech/tts0/local/preprocess.sh
+++ b/examples/ljspeech/tts0/local/preprocess.sh
@ -1,8 +1,62 @@
 #!/bin/bash

-preprocess_path=$1
+stage=0
+stop_stage=100

-python3 ${BIN_DIR}/preprocess.py \
-    --input=~/datasets/LJSpeech-1.1 \
-    --output=${preprocess_path} \
-    -v  \
+config_path=$1
+
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+    # get durations from MFA's result
+    echo "Generate durations.txt from MFA results ..."
+    python3 ${MAIN_ROOT}/utils/gen_duration_from_textgrid.py \
+        --inputdir=./ljspeech_alignment \
+        --output=durations.txt \
+        --config=${config_path}
+fi
+
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+    # extract features
+    echo "Extract features ..."
+    python3 ${BIN_DIR}/preprocess.py \
+        --dataset=ljspeech \
+        --rootdir=~/datasets/LJSpeech-1.1/ \
+        --dumpdir=dump \
+        --dur-file=durations.txt \
+        --config=${config_path} \
+        --num-cpu=20 \
+        --cut-sil=True
+fi
+
+if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
+    # get features' stats(mean and std)
+    echo "Get features' stats ..."
+    python3 ${MAIN_ROOT}/utils/compute_statistics.py \
+        --metadata=dump/train/raw/metadata.jsonl \
+        --field-name="speech"
+
+fi
+
+if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
+    # normalize and covert phone to id, dev and test should use train's stats
+    echo "Normalize ..."
+    python3 ${BIN_DIR}/normalize.py \
+        --metadata=dump/train/raw/metadata.jsonl \
+        --dumpdir=dump/train/norm \
+        --speech-stats=dump/train/speech_stats.npy \
+        --phones-dict=dump/phone_id_map.txt \
+        --speaker-dict=dump/speaker_id_map.txt
+
+    python3 ${BIN_DIR}/normalize.py \
+        --metadata=dump/dev/raw/metadata.jsonl \
+        --dumpdir=dump/dev/norm \
+        --speech-stats=dump/train/speech_stats.npy \
+        --phones-dict=dump/phone_id_map.txt \
+        --speaker-dict=dump/speaker_id_map.txt
+
+    python3 ${BIN_DIR}/normalize.py \
+        --metadata=dump/test/raw/metadata.jsonl \
+        --dumpdir=dump/test/norm \
+        --speech-stats=dump/train/speech_stats.npy \
+        --phones-dict=dump/phone_id_map.txt \
+        --speaker-dict=dump/speaker_id_map.txt
+fi
--- a/examples/ljspeech/tts0/local/synthesize.sh
+++ b/examples/ljspeech/tts0/local/synthesize.sh
@ -1,11 +1,20 @@
 #!/bin/bash

-train_output_path=$1
-ckpt_name=$2
+config_path=$1
+train_output_path=$2
+ckpt_name=$3

-python3 ${BIN_DIR}/synthesize.py \
-    --config=${train_output_path}/config.yaml \
-    --checkpoint_path=${train_output_path}/checkpoints/${ckpt_name} \
-    --input=${BIN_DIR}/../sentences_en.txt \
-    --output=${train_output_path}/test \
-    --ngpu=1
+FLAGS_allocator_strategy=naive_best_fit \
+FLAGS_fraction_of_gpu_memory_to_use=0.01 \
+python3 ${BIN_DIR}/../synthesize.py \
+    --am=tacotron2_ljspeech \
+    --am_config=${config_path} \
+    --am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
+    --am_stat=dump/train/speech_stats.npy \
+    --voc=pwgan_ljspeech \
+    --voc_config=pwg_ljspeech_ckpt_0.5/pwg_default.yaml \
+    --voc_ckpt=pwg_ljspeech_ckpt_0.5/pwg_snapshot_iter_400000.pdz  \
+    --voc_stat=pwg_ljspeech_ckpt_0.5/pwg_stats.npy \
+    --test_metadata=dump/test/norm/metadata.jsonl \
+    --output_dir=${train_output_path}/test \
+    --phones_dict=dump/phone_id_map.txt
--- a/examples/ljspeech/tts0/local/train.sh
+++ b/examples/ljspeech/tts0/local/train.sh
@ -1,9 +1,12 @@
 #!/bin/bash

-preprocess_path=$1
+config_path=$1
 train_output_path=$2

 python3 ${BIN_DIR}/train.py \
-    --data=${preprocess_path} \
-    --output=${train_output_path} \
-    --ngpu=1 \
+    --train-metadata=dump/train/norm/metadata.jsonl \
+    --dev-metadata=dump/dev/norm/metadata.jsonl \
+    --config=${config_path} \
+    --output-dir=${train_output_path} \
+    --ngpu=1 \
+    --phones-dict=dump/phone_id_map.txt
--- a/examples/ljspeech/tts0/path.sh
+++ b/examples/ljspeech/tts0/path.sh
@ -9,5 +9,5 @@ export PYTHONDONTWRITEBYTECODE=1
 export PYTHONIOENCODING=UTF-8
 export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH}

-MODEL=tacotron2
+MODEL=new_tacotron2
 export BIN_DIR=${MAIN_ROOT}/paddlespeech/t2s/exps/${MODEL}
--- a/examples/ljspeech/tts0/run.sh
+++ b/examples/ljspeech/tts0/run.sh
@ -3,13 +3,13 @@
 set -e
 source path.sh

-gpus=0
+gpus=0,1
 stage=0
 stop_stage=100

-preprocess_path=preprocessed_ljspeech
-train_output_path=output
-ckpt_name=step-35000
+conf_path=conf/default.yaml
+train_output_path=exp/default
+ckpt_name=snapshot_iter_201.pdz

 # with the following command, you can choose the stage range you want to run
 # such as `./run.sh --stage 0 --stop-stage 0`
@ -18,16 +18,20 @@ source ${MAIN_ROOT}/utils/parse_options.sh || exit 1

 if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
    # prepare data
-    ./local/preprocess.sh ${preprocess_path} || exit -1
+    ./local/preprocess.sh ${conf_path} || exit -1
 fi

 if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
    # train model, all `ckpt` under `train_output_path/checkpoints/` dir
-    CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${preprocess_path} ${train_output_path} || exit -1
+    CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} || exit -1
 fi

 if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
-    # train model, all `ckpt` under `train_output_path/checkpoints/` dir
-    CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${train_output_path} ${ckpt_name} || exit -1
+    # synthesize, vocoder is pwgan
+    CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
 fi

+if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
+    # synthesize_e2e, vocoder is pwgan
+    CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
+fi
--- a/examples/ljspeech/voc0/run.sh
+++ b/examples/ljspeech/voc0/run.sh
@ -10,7 +10,7 @@ stop_stage=100
 preprocess_path=preprocessed_ljspeech
 train_output_path=output
 # mel generated by Tacotron2
-input_mel_path=../tts0/output/test
+input_mel_path=${preprocess_path}/mel_test
 ckpt_name=step-10000

 # with the following command, you can choose the stage range you want to run
@ -28,5 +28,7 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
 fi

 if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
+    mkdir -p ${preprocess_path}/mel_test
+    cp ${preprocess_path}/mel/LJ050-001*.npy ${preprocess_path}/mel_test/
    CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${input_mel_path} ${train_output_path} ${ckpt_name} || exit -1
 fi
--- a/paddlespeech/t2s/exps/synthesize.py
+++ b/paddlespeech/t2s/exps/synthesize.py
@ -207,7 +207,8 @@ def main():
        default='fastspeech2_csmsc',
        choices=[
            'speedyspeech_csmsc', 'fastspeech2_csmsc', 'fastspeech2_ljspeech',
-            'fastspeech2_aishell3', 'fastspeech2_vctk', 'tacotron2_csmsc', 'tacotron2_aishell3'
+            'fastspeech2_aishell3', 'fastspeech2_vctk', 'tacotron2_csmsc',
+            'tacotron2_ljspeech', 'tacotron2_aishell3'
        ],
        help='Choose acoustic model type of tts task.')
    parser.add_argument(
--- a/paddlespeech/t2s/exps/synthesize_e2e.py
+++ b/paddlespeech/t2s/exps/synthesize_e2e.py
@ -285,7 +285,7 @@ def main():
        choices=[
            'speedyspeech_csmsc', 'speedyspeech_aishell3', 'fastspeech2_csmsc',
            'fastspeech2_ljspeech', 'fastspeech2_aishell3', 'fastspeech2_vctk',
-            'tacotron2_csmsc'
+            'tacotron2_csmsc', 'tacotron2_ljspeech'
        ],
        help='Choose acoustic model type of tts task.')
    parser.add_argument(