[tts] add mix finetune (#2525)

* updata readme, test=doc * update yaml and readme, test=tts * fix batch_size, test=tts * add mix finetune, test=tts * updata readme, test=tts
3 years ago · b76968e6d9
parent 1b9b1f5454
commit b76968e6d9
5 changed files with 207 additions and 86 deletions
--- a/examples/other/tts_finetune/tts3/README.md
+++ b/examples/other/tts_finetune/tts3/README.md
@ -7,7 +7,7 @@ For more information on training Fastspeech2 with AISHELL-3, You can refer [exam
 ## Prepare 
 ### Download Pretrained model
 Assume the path to the model is `./pretrained_models`. </br>
-If you want to finetune Chinese data, you need to download Fastspeech2 pretrained model with AISHELL-3: [fastspeech2_aishell3_ckpt_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_aishell3_ckpt_1.1.0.zip) for finetuning. Download HiFiGAN pretrained model with aishell3: [hifigan_aishell3_ckpt_0.2.0](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_ckpt_0.2.0.zip) for synthesis.
+If you want to finetune Chinese pretrained model, you need to download Fastspeech2 pretrained model with AISHELL-3: [fastspeech2_aishell3_ckpt_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_aishell3_ckpt_1.1.0.zip) for finetuning. Download HiFiGAN pretrained model with aishell3: [hifigan_aishell3_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_ckpt_0.2.0.zip) for synthesis.

 ```bash
 mkdir -p pretrained_models && cd pretrained_models
@ -21,7 +21,7 @@ cd ../
 ```


-If you want to finetune English data, you need to download Fastspeech2 pretrained model with VCTK: [fastspeech2_vctk_ckpt_1.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_vctk_ckpt_1.2.0.zip) for finetuning. Download HiFiGAN pretrained model with VCTK: [hifigan_vctk_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_vctk_ckpt_0.2.0.zip) for synthesis.
+If you want to finetune English pretrained model, you need to download Fastspeech2 pretrained model with VCTK: [fastspeech2_vctk_ckpt_1.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_vctk_ckpt_1.2.0.zip) for finetuning. Download HiFiGAN pretrained model with VCTK: [hifigan_vctk_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_vctk_ckpt_0.2.0.zip) for synthesis.

 ```bash
 mkdir -p pretrained_models && cd pretrained_models
@ -34,6 +34,59 @@ unzip hifigan_vctk_ckpt_0.2.0.zip
 cd ../
 ```

+If you want to finetune Chinese-English Mixed pretrained model, you need to download Fastspeech2 pretrained model with mix datasets: [fastspeech2_mix_ckpt_1.2.0.zip](https://paddlespeech.bj.bcebos.com/t2s/chinse_english_mixed/models/fastspeech2_mix_ckpt_1.2.0.zip) for finetuning. Download HiFiGAN pretrained model with aishell3: [hifigan_aishell3_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_ckpt_0.2.0.zip) for synthesis.
+
+```bash
+mkdir -p pretrained_models && cd pretrained_models
+# pretrained fastspeech2 model
+wget https://paddlespeech.bj.bcebos.com/t2s/chinse_english_mixed/models/fastspeech2_mix_ckpt_1.2.0.zip
+unzip fastspeech2_mix_ckpt_1.2.0.zip
+# pretrained hifigan model
+wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_ckpt_0.2.0.zip
+unzip hifigan_aishell3_ckpt_0.2.0.zip
+cd ../
+```
+
+### Prepare your data
+Assume the path to the dataset is `./input` which contains a speaker folder. Speaker folder contains audio files (*.wav) and label file (labels.txt). The format of the audio file is wav. The format of the label file is: utt_id|pronunciation. </br>
+
+If you want to finetune Chinese pretrained model, you need to prepare Chinese data. Chinese label example: 
+```
+000001|ka2 er2 pu3 pei2 wai4 sun1 wan2 hua2 ti1
+```
+
+Here is an example of the first 200 data of csmsc.
+
+```bash
+mkdir -p input && cd input
+wget https://paddlespeech.bj.bcebos.com/datasets/csmsc_mini.zip
+unzip csmsc_mini.zip
+cd ../
+```
+
+If you want to finetune English pretrained model, you need to prepare English data. English label example: 
+```
+LJ001-0001|Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition
+```
+
+Here is an example of the first 200 data of ljspeech.
+
+```bash
+mkdir -p input && cd input
+wget https://paddlespeech.bj.bcebos.com/datasets/ljspeech_mini.zip
+unzip ljspeech_mini.zip
+cd ../
+```
+
+If you want to finetune Chinese-English Mixed pretrained model, you need to prepare Chinese data or English data. Here is an example of the first 12 data of SSB0005 (the speaker of aishell3).
+
+```bash
+mkdir -p input && cd input
+wget https://paddlespeech.bj.bcebos.com/datasets/SSB0005_mini.zip
+unzip SSB0005_mini.zip
+cd ../
+```
+
 ### Download MFA tools and pretrained model
 Assume the path to the MFA tool is `./tools`. Download [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner/releases/download/v1.0.1/montreal-forced-aligner_linux.tar.gz).

@ -46,7 +99,7 @@ cp montreal-forced-aligner/lib/libpython3.6m.so.1.0 montreal-forced-aligner/lib/
 mkdir -p aligner && cd aligner
 ```

-If you want to finetune Chinese data, you need to download pretrained MFA models with aishell3: [aishell3_model.zip](https://paddlespeech.bj.bcebos.com/MFA/ernie_sat/aishell3_model.zip) and unzip it.
+If you want to get mfa result of Chinese data, you need to download pretrained MFA models with aishell3: [aishell3_model.zip](https://paddlespeech.bj.bcebos.com/MFA/ernie_sat/aishell3_model.zip) and unzip it.

 ```bash
 # pretrained mfa model for Chinese data
@ -56,30 +109,17 @@ wget https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/simple.lexicon
 cd ../../
 ```

-If you want to finetune English data, you need to download pretrained MFA models with vctk: [vctk_model.zip](https://paddlespeech.bj.bcebos.com/MFA/ernie_sat/vctk_model.zip) and unzip it.
+If you want to get mfa result of English data, you need to download pretrained MFA models with vctk: [vctk_model.zip](https://paddlespeech.bj.bcebos.com/MFA/ernie_sat/vctk_model.zip) and unzip it.

 ```bash
-# pretrained mfa model for Chinese data
+# pretrained mfa model for English data
 wget https://paddlespeech.bj.bcebos.com/MFA/ernie_sat/vctk_model.zip
 unzip vctk_model.zip
 wget https://paddlespeech.bj.bcebos.com/MFA/LJSpeech-1.1/cmudict-0.7b
 cd ../../
 ```

-### Prepare your data
-Assume the path to the dataset is `./input` which contains a speaker folder. Speaker folder contains audio files (*.wav) and label file (labels.txt). The format of the audio file is wav. The format of the label file is: utt_id|pronunciation. </br>
-
-If you want to finetune Chinese data, Chinese label example: 000001|ka2 er2 pu3 pei2 wai4 sun1 wan2 hua2 ti1</br>
-Here is an example of the first 200 data of csmsc.
-
-```bash
-mkdir -p input && cd input
-wget https://paddlespeech.bj.bcebos.com/datasets/csmsc_mini.zip
-unzip csmsc_mini.zip
-cd ../
-```
-
-When "Prepare" done. The structure of the current directory is listed below.
+When "Prepare" done. The structure of the current directory is similar to the following.
 ```text
 ├── input
 │   ├── csmsc_mini
@ -119,56 +159,6 @@ When "Prepare" done. The structure of the current directory is listed below.

 ```

-If you want to finetune English data, English label example: LJ001-0001|Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition </br>
-Here is an example of the first 200 data of ljspeech.
-
-```bash
-mkdir -p input && cd input
-wget https://paddlespeech.bj.bcebos.com/datasets/ljspeech_mini.zip
-unzip ljspeech_mini.zip
-cd ../
-```
-
-When "Prepare" done. The structure of the current directory is listed below.
-```text
-├── input
-│   ├── ljspeech_mini
-│   │   ├── LJ001-0001.wav
-│   │   ├── LJ001-0002.wav
-│   │   ├── LJ001-0003.wav
-│   │   ├── ...
-│   │   ├── LJ002-0014.wav
-│   │   ├── labels.txt
-│   └── ljspeech_mini.zip
-├── pretrained_models
-│   ├── fastspeech2_vctk_ckpt_1.2.0
-│   │   ├── default.yaml
-│   │   ├── energy_stats.npy
-│   │   ├── phone_id_map.txt
-│   │   ├── pitch_stats.npy
-│   │   ├── snapshot_iter_66200.pdz
-│   │   ├── speaker_id_map.txt
-│   │   └── speech_stats.npy
-│   ├── fastspeech2_vctk_ckpt_1.2.0.zip
-│   ├── hifigan_vctk_ckpt_0.2.0    
-│   │   ├── default.yaml
-│   │   ├── feats_stats.npy
-│   │   └── snapshot_iter_2500000.pdz
-│   └── hifigan_vctk_ckpt_0.2.0.zip
-└── tools
-    ├── aligner
-    │   ├── vctk_model
-    │   ├── vctk_model.zip
-    │   └── cmudict-0.7b
-    ├── montreal-forced-aligner
-    │   ├── bin
-    │   ├── lib
-    │   └── pretrained_models
-    └── montreal-forced-aligner_linux.tar.gz
-    ...
-
-```
-
 ### Set finetune.yaml
 `conf/finetune.yaml` contains some configurations for fine-tuning. You can try various options to fine better result. The value of frozen_layers can be change according `conf/fastspeech2_layers.txt` which is the model layer of fastspeech2.

@ -180,7 +170,7 @@ Arguments:


 ## Get Started
-For Chinese data finetune, execute `./run.sh`. For English data finetune, execute `./run_en.sh`. </br>
+For finetuning Chinese pretrained model, execute `./run.sh`. For finetuning English pretrained model, execute `./run_en.sh`. For finetuning Chinese-English Mixed pretrained model, execute `./run_mix.sh`. </br>
 Run the command below to
 1. **source path**.
 2. finetune the model. 
--- a/examples/other/tts_finetune/tts3/local/extract_feature.py
+++ b/examples/other/tts_finetune/tts3/local/extract_feature.py
@ -56,13 +56,15 @@ def get_stats(pretrained_model_dir: Path):

 def get_map(duration_file: Union[str, Path],
            dump_dir: Path,
-            pretrained_model_dir: Path):
+            pretrained_model_dir: Path,
+            replace_spkid: int=0):
    """get phone map and speaker map, save on dump_dir

    Args:
        duration_file (str): durantions.txt
        dump_dir (Path): dump dir
        pretrained_model_dir (Path): pretrained model dir
+        replace_spkid (int): replace spk id 
    """
    # copy phone map file from pretrained model path
    phones_dict = dump_dir / "phone_id_map.txt"
@ -75,14 +77,24 @@ def get_map(duration_file: Union[str, Path],
    speakers = sorted(list(speaker_set))
    num = len(speakers)
    speaker_dict = dump_dir / "speaker_id_map.txt"
-    with open(speaker_dict, 'w') as f, open(pretrained_model_dir /
-                                            "speaker_id_map.txt", 'r') as fr:
-        for i, spk in enumerate(speakers):
-            f.write(spk + ' ' + str(i) + '\n')
+    spk_dict = {}
+    # get raw spkid-spk dict 
+    with open(pretrained_model_dir / "speaker_id_map.txt", 'r') as fr:
        for line in fr.readlines():
-            spk_id = line.strip().split(" ")[-1]
-            if int(spk_id) >= num:
-                f.write(line)
+            spk = line.strip().split(" ")[0]
+            spk_id = line.strip().split(" ")[1]
+            spk_dict[spk_id] = spk
+
+    # replace spk on spkid-spk dict
+    assert replace_spkid + num - 1 < len(
+        spk_dict), "Please set correct replace spk id."
+    for i, spk in enumerate(speakers):
+        spk_dict[str(replace_spkid + i)] = spk
+
+    # write a new spk map file
+    with open(speaker_dict, 'w') as f:
+        for spk_id in spk_dict.keys():
+            f.write(spk_dict[spk_id] + ' ' + spk_id + '\n')

    vocab_phones = {}
    with open(phones_dict, 'rt') as f:
@ -206,10 +218,11 @@ def extract_feature(duration_file: str,
                    config,
                    input_dir: Path,
                    dump_dir: Path,
-                    pretrained_model_dir: Path):
+                    pretrained_model_dir: Path,
+                    replace_spkid: int=0):

-    sentences, vocab_phones, vocab_speaker = get_map(duration_file, dump_dir,
-                                                     pretrained_model_dir)
+    sentences, vocab_phones, vocab_speaker = get_map(
+        duration_file, dump_dir, pretrained_model_dir, replace_spkid)
    mel_extractor, pitch_extractor, energy_extractor = get_extractor(config)

    wav_files = sorted(list((input_dir).rglob("*.wav")))
@ -315,6 +328,9 @@ if __name__ == '__main__':
        default="./pretrained_models/fastspeech2_aishell3_ckpt_1.1.0",
        help="Path to pretrained model")

+    parser.add_argument(
+        "--replace_spkid", type=int, default=0, help="replace spk id")
+
    args = parser.parse_args()

    input_dir = Path(args.input_dir).expanduser()
@ -332,4 +348,5 @@ if __name__ == '__main__':
        config=config,
        input_dir=input_dir,
        dump_dir=dump_dir,
-        pretrained_model_dir=pretrained_model_dir)
+        pretrained_model_dir=pretrained_model_dir,
+        replace_spkid=args.replace_spkid)
--- a/examples/other/tts_finetune/tts3/run.sh
+++ b/examples/other/tts_finetune/tts3/run.sh
@ -15,6 +15,7 @@ output_dir=./exp/default
 lang=zh
 ngpu=1
 finetune_config=./conf/finetune.yaml
+replace_spkid=0

 ckpt=snapshot_iter_96699

@ -62,7 +63,8 @@ if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
        --duration_file="./durations.txt" \
        --input_dir=${new_dir} \
        --dump_dir=${dump_dir} \
-        --pretrained_model_dir=${pretrained_model_dir}
+        --pretrained_model_dir=${pretrained_model_dir} \
+        --replace_spkid=$replace_spkid
 fi

 # create finetune env
@ -102,5 +104,5 @@ if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
        --output_dir=./test_e2e/ \
        --phones_dict=${dump_dir}/phone_id_map.txt \
        --speaker_dict=${dump_dir}/speaker_id_map.txt \
-        --spk_id=0 
+        --spk_id=$replace_spkid
 fi
--- a/examples/other/tts_finetune/tts3/run_en.sh
+++ b/examples/other/tts_finetune/tts3/run_en.sh
@ -14,6 +14,7 @@ output_dir=./exp/default
 lang=en
 ngpu=1
 finetune_config=./conf/finetune.yaml
+replace_spkid=0

 ckpt=snapshot_iter_66300

@ -61,7 +62,8 @@ if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
        --duration_file="./durations.txt" \
        --input_dir=${new_dir} \
        --dump_dir=${dump_dir} \
-        --pretrained_model_dir=${pretrained_model_dir}
+        --pretrained_model_dir=${pretrained_model_dir} \
+        --replace_spkid=$replace_spkid
 fi

 # create finetune env
@ -101,5 +103,5 @@ if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
        --output_dir=./test_e2e/ \
        --phones_dict=${dump_dir}/phone_id_map.txt \
        --speaker_dict=${dump_dir}/speaker_id_map.txt \
-        --spk_id=0 
+        --spk_id=$replace_spkid
 fi
--- a/examples/other/tts_finetune/tts3/run_mix.sh
+++ b/examples/other/tts_finetune/tts3/run_mix.sh
@ -0,0 +1,110 @@
+#!/bin/bash
+
+set -e
+source path.sh
+
+
+input_dir=./input/SSB0005_mini
+newdir_name="newdir"
+new_dir=${input_dir}/${newdir_name}
+pretrained_model_dir=./pretrained_models/fastspeech2_mix_ckpt_1.2.0
+mfa_tools=./tools
+mfa_dir=./mfa_result
+dump_dir=./dump
+output_dir=./exp/default
+lang=zh
+ngpu=1
+finetune_config=./conf/finetune.yaml
+replace_spkid=174  # csmsc: 174, ljspeech: 175, aishell3: 0~173, vctk: 176
+
+ckpt=snapshot_iter_99300
+
+gpus=1
+CUDA_VISIBLE_DEVICES=${gpus}
+stage=0
+stop_stage=100
+
+
+# with the following command, you can choose the stage range you want to run
+# such as `./run.sh --stage 0 --stop-stage 0`
+# this can not be mixed use with `$1`, `$2` ...
+source ${MAIN_ROOT}/utils/parse_options.sh || exit 1
+
+# check oov
+if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
+    echo "check oov"
+    python3 local/check_oov.py \
+        --input_dir=${input_dir} \
+        --pretrained_model_dir=${pretrained_model_dir} \
+        --newdir_name=${newdir_name} \
+        --lang=${lang}
+fi
+
+# get mfa result
+if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
+    echo "get mfa result"
+    python3 local/get_mfa_result.py \
+        --input_dir=${new_dir} \
+        --mfa_dir=${mfa_dir} \
+        --lang=${lang}
+fi
+
+# generate durations.txt
+if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
+    echo "generate durations.txt"
+    python3 local/generate_duration.py \
+        --mfa_dir=${mfa_dir} 
+fi
+
+# extract feature
+if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
+    echo "extract feature"
+    python3 local/extract_feature.py \
+        --duration_file="./durations.txt" \
+        --input_dir=${new_dir} \
+        --dump_dir=${dump_dir} \
+        --pretrained_model_dir=${pretrained_model_dir} \
+        --replace_spkid=$replace_spkid
+
+fi
+
+# create finetune env
+if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
+    echo "create finetune env"
+    python3 local/prepare_env.py \
+        --pretrained_model_dir=${pretrained_model_dir} \
+        --output_dir=${output_dir}
+fi
+
+# finetune
+if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
+    echo "finetune..."
+    python3 local/finetune.py \
+        --pretrained_model_dir=${pretrained_model_dir} \
+        --dump_dir=${dump_dir} \
+        --output_dir=${output_dir} \
+        --ngpu=${ngpu} \
+        --epoch=100 \
+        --finetune_config=${finetune_config}
+fi
+
+# synthesize e2e
+if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
+    echo "in hifigan syn_e2e"
+    python3 ${BIN_DIR}/../synthesize_e2e.py \
+        --am=fastspeech2_aishell3 \
+        --am_config=${pretrained_model_dir}/default.yaml \
+        --am_ckpt=${output_dir}/checkpoints/${ckpt}.pdz \
+        --am_stat=${pretrained_model_dir}/speech_stats.npy \
+        --voc=hifigan_aishell3 \
+        --voc_config=pretrained_models/hifigan_aishell3_ckpt_0.2.0/default.yaml \
+        --voc_ckpt=pretrained_models/hifigan_aishell3_ckpt_0.2.0/snapshot_iter_2500000.pdz \
+        --voc_stat=pretrained_models/hifigan_aishell3_ckpt_0.2.0/feats_stats.npy \
+        --lang=mix \
+        --text=${BIN_DIR}/../sentences_mix.txt \
+        --output_dir=./test_e2e/ \
+        --phones_dict=${dump_dir}/phone_id_map.txt \
+        --speaker_dict=${dump_dir}/speaker_id_map.txt \
+        --spk_id=$replace_spkid
+fi
+