[tts] add mix finetune (#2525)

* updata readme, test=doc

* update yaml and readme, test=tts

* fix batch_size, test=tts

* add mix finetune, test=tts

* updata readme, test=tts
pull/2539/head
liangym 3 years ago committed by GitHub
parent 1b9b1f5454
commit b76968e6d9
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -7,7 +7,7 @@ For more information on training Fastspeech2 with AISHELL-3, You can refer [exam
## Prepare
### Download Pretrained model
Assume the path to the model is `./pretrained_models`. </br>
If you want to finetune Chinese data, you need to download Fastspeech2 pretrained model with AISHELL-3: [fastspeech2_aishell3_ckpt_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_aishell3_ckpt_1.1.0.zip) for finetuning. Download HiFiGAN pretrained model with aishell3: [hifigan_aishell3_ckpt_0.2.0](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_ckpt_0.2.0.zip) for synthesis.
If you want to finetune Chinese pretrained model, you need to download Fastspeech2 pretrained model with AISHELL-3: [fastspeech2_aishell3_ckpt_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_aishell3_ckpt_1.1.0.zip) for finetuning. Download HiFiGAN pretrained model with aishell3: [hifigan_aishell3_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_ckpt_0.2.0.zip) for synthesis.
```bash
mkdir -p pretrained_models && cd pretrained_models
@ -21,7 +21,7 @@ cd ../
```
If you want to finetune English data, you need to download Fastspeech2 pretrained model with VCTK: [fastspeech2_vctk_ckpt_1.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_vctk_ckpt_1.2.0.zip) for finetuning. Download HiFiGAN pretrained model with VCTK: [hifigan_vctk_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_vctk_ckpt_0.2.0.zip) for synthesis.
If you want to finetune English pretrained model, you need to download Fastspeech2 pretrained model with VCTK: [fastspeech2_vctk_ckpt_1.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_vctk_ckpt_1.2.0.zip) for finetuning. Download HiFiGAN pretrained model with VCTK: [hifigan_vctk_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_vctk_ckpt_0.2.0.zip) for synthesis.
```bash
mkdir -p pretrained_models && cd pretrained_models
@ -34,6 +34,59 @@ unzip hifigan_vctk_ckpt_0.2.0.zip
cd ../
```
If you want to finetune Chinese-English Mixed pretrained model, you need to download Fastspeech2 pretrained model with mix datasets: [fastspeech2_mix_ckpt_1.2.0.zip](https://paddlespeech.bj.bcebos.com/t2s/chinse_english_mixed/models/fastspeech2_mix_ckpt_1.2.0.zip) for finetuning. Download HiFiGAN pretrained model with aishell3: [hifigan_aishell3_ckpt_0.2.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_ckpt_0.2.0.zip) for synthesis.
```bash
mkdir -p pretrained_models && cd pretrained_models
# pretrained fastspeech2 model
wget https://paddlespeech.bj.bcebos.com/t2s/chinse_english_mixed/models/fastspeech2_mix_ckpt_1.2.0.zip
unzip fastspeech2_mix_ckpt_1.2.0.zip
# pretrained hifigan model
wget https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_ckpt_0.2.0.zip
unzip hifigan_aishell3_ckpt_0.2.0.zip
cd ../
```
### Prepare your data
Assume the path to the dataset is `./input` which contains a speaker folder. Speaker folder contains audio files (*.wav) and label file (labels.txt). The format of the audio file is wav. The format of the label file is: utt_id|pronunciation. </br>
If you want to finetune Chinese pretrained model, you need to prepare Chinese data. Chinese label example:
```
000001|ka2 er2 pu3 pei2 wai4 sun1 wan2 hua2 ti1
```
Here is an example of the first 200 data of csmsc.
```bash
mkdir -p input && cd input
wget https://paddlespeech.bj.bcebos.com/datasets/csmsc_mini.zip
unzip csmsc_mini.zip
cd ../
```
If you want to finetune English pretrained model, you need to prepare English data. English label example:
```
LJ001-0001|Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition
```
Here is an example of the first 200 data of ljspeech.
```bash
mkdir -p input && cd input
wget https://paddlespeech.bj.bcebos.com/datasets/ljspeech_mini.zip
unzip ljspeech_mini.zip
cd ../
```
If you want to finetune Chinese-English Mixed pretrained model, you need to prepare Chinese data or English data. Here is an example of the first 12 data of SSB0005 (the speaker of aishell3).
```bash
mkdir -p input && cd input
wget https://paddlespeech.bj.bcebos.com/datasets/SSB0005_mini.zip
unzip SSB0005_mini.zip
cd ../
```
### Download MFA tools and pretrained model
Assume the path to the MFA tool is `./tools`. Download [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner/releases/download/v1.0.1/montreal-forced-aligner_linux.tar.gz).
@ -46,7 +99,7 @@ cp montreal-forced-aligner/lib/libpython3.6m.so.1.0 montreal-forced-aligner/lib/
mkdir -p aligner && cd aligner
```
If you want to finetune Chinese data, you need to download pretrained MFA models with aishell3: [aishell3_model.zip](https://paddlespeech.bj.bcebos.com/MFA/ernie_sat/aishell3_model.zip) and unzip it.
If you want to get mfa result of Chinese data, you need to download pretrained MFA models with aishell3: [aishell3_model.zip](https://paddlespeech.bj.bcebos.com/MFA/ernie_sat/aishell3_model.zip) and unzip it.
```bash
# pretrained mfa model for Chinese data
@ -56,30 +109,17 @@ wget https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/simple.lexicon
cd ../../
```
If you want to finetune English data, you need to download pretrained MFA models with vctk: [vctk_model.zip](https://paddlespeech.bj.bcebos.com/MFA/ernie_sat/vctk_model.zip) and unzip it.
If you want to get mfa result of English data, you need to download pretrained MFA models with vctk: [vctk_model.zip](https://paddlespeech.bj.bcebos.com/MFA/ernie_sat/vctk_model.zip) and unzip it.
```bash
# pretrained mfa model for Chinese data
# pretrained mfa model for English data
wget https://paddlespeech.bj.bcebos.com/MFA/ernie_sat/vctk_model.zip
unzip vctk_model.zip
wget https://paddlespeech.bj.bcebos.com/MFA/LJSpeech-1.1/cmudict-0.7b
cd ../../
```
### Prepare your data
Assume the path to the dataset is `./input` which contains a speaker folder. Speaker folder contains audio files (*.wav) and label file (labels.txt). The format of the audio file is wav. The format of the label file is: utt_id|pronunciation. </br>
If you want to finetune Chinese data, Chinese label example: 000001|ka2 er2 pu3 pei2 wai4 sun1 wan2 hua2 ti1</br>
Here is an example of the first 200 data of csmsc.
```bash
mkdir -p input && cd input
wget https://paddlespeech.bj.bcebos.com/datasets/csmsc_mini.zip
unzip csmsc_mini.zip
cd ../
```
When "Prepare" done. The structure of the current directory is listed below.
When "Prepare" done. The structure of the current directory is similar to the following.
```text
├── input
│ ├── csmsc_mini
@ -119,56 +159,6 @@ When "Prepare" done. The structure of the current directory is listed below.
```
If you want to finetune English data, English label example: LJ001-0001|Printing, in the only sense with which we are at present concerned, differs from most if not from all the arts and crafts represented in the Exhibition </br>
Here is an example of the first 200 data of ljspeech.
```bash
mkdir -p input && cd input
wget https://paddlespeech.bj.bcebos.com/datasets/ljspeech_mini.zip
unzip ljspeech_mini.zip
cd ../
```
When "Prepare" done. The structure of the current directory is listed below.
```text
├── input
│ ├── ljspeech_mini
│ │ ├── LJ001-0001.wav
│ │ ├── LJ001-0002.wav
│ │ ├── LJ001-0003.wav
│ │ ├── ...
│ │ ├── LJ002-0014.wav
│ │ ├── labels.txt
│ └── ljspeech_mini.zip
├── pretrained_models
│ ├── fastspeech2_vctk_ckpt_1.2.0
│ │ ├── default.yaml
│ │ ├── energy_stats.npy
│ │ ├── phone_id_map.txt
│ │ ├── pitch_stats.npy
│ │ ├── snapshot_iter_66200.pdz
│ │ ├── speaker_id_map.txt
│ │ └── speech_stats.npy
│ ├── fastspeech2_vctk_ckpt_1.2.0.zip
│ ├── hifigan_vctk_ckpt_0.2.0
│ │ ├── default.yaml
│ │ ├── feats_stats.npy
│ │ └── snapshot_iter_2500000.pdz
│ └── hifigan_vctk_ckpt_0.2.0.zip
└── tools
├── aligner
│ ├── vctk_model
│ ├── vctk_model.zip
│ └── cmudict-0.7b
├── montreal-forced-aligner
│ ├── bin
│ ├── lib
│ └── pretrained_models
└── montreal-forced-aligner_linux.tar.gz
...
```
### Set finetune.yaml
`conf/finetune.yaml` contains some configurations for fine-tuning. You can try various options to fine better result. The value of frozen_layers can be change according `conf/fastspeech2_layers.txt` which is the model layer of fastspeech2.
@ -180,7 +170,7 @@ Arguments:
## Get Started
For Chinese data finetune, execute `./run.sh`. For English data finetune, execute `./run_en.sh`. </br>
For finetuning Chinese pretrained model, execute `./run.sh`. For finetuning English pretrained model, execute `./run_en.sh`. For finetuning Chinese-English Mixed pretrained model, execute `./run_mix.sh`. </br>
Run the command below to
1. **source path**.
2. finetune the model.

@ -56,13 +56,15 @@ def get_stats(pretrained_model_dir: Path):
def get_map(duration_file: Union[str, Path],
dump_dir: Path,
pretrained_model_dir: Path):
pretrained_model_dir: Path,
replace_spkid: int=0):
"""get phone map and speaker map, save on dump_dir
Args:
duration_file (str): durantions.txt
dump_dir (Path): dump dir
pretrained_model_dir (Path): pretrained model dir
replace_spkid (int): replace spk id
"""
# copy phone map file from pretrained model path
phones_dict = dump_dir / "phone_id_map.txt"
@ -75,14 +77,24 @@ def get_map(duration_file: Union[str, Path],
speakers = sorted(list(speaker_set))
num = len(speakers)
speaker_dict = dump_dir / "speaker_id_map.txt"
with open(speaker_dict, 'w') as f, open(pretrained_model_dir /
"speaker_id_map.txt", 'r') as fr:
for i, spk in enumerate(speakers):
f.write(spk + ' ' + str(i) + '\n')
spk_dict = {}
# get raw spkid-spk dict
with open(pretrained_model_dir / "speaker_id_map.txt", 'r') as fr:
for line in fr.readlines():
spk_id = line.strip().split(" ")[-1]
if int(spk_id) >= num:
f.write(line)
spk = line.strip().split(" ")[0]
spk_id = line.strip().split(" ")[1]
spk_dict[spk_id] = spk
# replace spk on spkid-spk dict
assert replace_spkid + num - 1 < len(
spk_dict), "Please set correct replace spk id."
for i, spk in enumerate(speakers):
spk_dict[str(replace_spkid + i)] = spk
# write a new spk map file
with open(speaker_dict, 'w') as f:
for spk_id in spk_dict.keys():
f.write(spk_dict[spk_id] + ' ' + spk_id + '\n')
vocab_phones = {}
with open(phones_dict, 'rt') as f:
@ -206,10 +218,11 @@ def extract_feature(duration_file: str,
config,
input_dir: Path,
dump_dir: Path,
pretrained_model_dir: Path):
pretrained_model_dir: Path,
replace_spkid: int=0):
sentences, vocab_phones, vocab_speaker = get_map(duration_file, dump_dir,
pretrained_model_dir)
sentences, vocab_phones, vocab_speaker = get_map(
duration_file, dump_dir, pretrained_model_dir, replace_spkid)
mel_extractor, pitch_extractor, energy_extractor = get_extractor(config)
wav_files = sorted(list((input_dir).rglob("*.wav")))
@ -315,6 +328,9 @@ if __name__ == '__main__':
default="./pretrained_models/fastspeech2_aishell3_ckpt_1.1.0",
help="Path to pretrained model")
parser.add_argument(
"--replace_spkid", type=int, default=0, help="replace spk id")
args = parser.parse_args()
input_dir = Path(args.input_dir).expanduser()
@ -332,4 +348,5 @@ if __name__ == '__main__':
config=config,
input_dir=input_dir,
dump_dir=dump_dir,
pretrained_model_dir=pretrained_model_dir)
pretrained_model_dir=pretrained_model_dir,
replace_spkid=args.replace_spkid)

@ -15,6 +15,7 @@ output_dir=./exp/default
lang=zh
ngpu=1
finetune_config=./conf/finetune.yaml
replace_spkid=0
ckpt=snapshot_iter_96699
@ -62,7 +63,8 @@ if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
--duration_file="./durations.txt" \
--input_dir=${new_dir} \
--dump_dir=${dump_dir} \
--pretrained_model_dir=${pretrained_model_dir}
--pretrained_model_dir=${pretrained_model_dir} \
--replace_spkid=$replace_spkid
fi
# create finetune env
@ -102,5 +104,5 @@ if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
--output_dir=./test_e2e/ \
--phones_dict=${dump_dir}/phone_id_map.txt \
--speaker_dict=${dump_dir}/speaker_id_map.txt \
--spk_id=0
--spk_id=$replace_spkid
fi

@ -14,6 +14,7 @@ output_dir=./exp/default
lang=en
ngpu=1
finetune_config=./conf/finetune.yaml
replace_spkid=0
ckpt=snapshot_iter_66300
@ -61,7 +62,8 @@ if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
--duration_file="./durations.txt" \
--input_dir=${new_dir} \
--dump_dir=${dump_dir} \
--pretrained_model_dir=${pretrained_model_dir}
--pretrained_model_dir=${pretrained_model_dir} \
--replace_spkid=$replace_spkid
fi
# create finetune env
@ -101,5 +103,5 @@ if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
--output_dir=./test_e2e/ \
--phones_dict=${dump_dir}/phone_id_map.txt \
--speaker_dict=${dump_dir}/speaker_id_map.txt \
--spk_id=0
--spk_id=$replace_spkid
fi

@ -0,0 +1,110 @@
#!/bin/bash
set -e
source path.sh
input_dir=./input/SSB0005_mini
newdir_name="newdir"
new_dir=${input_dir}/${newdir_name}
pretrained_model_dir=./pretrained_models/fastspeech2_mix_ckpt_1.2.0
mfa_tools=./tools
mfa_dir=./mfa_result
dump_dir=./dump
output_dir=./exp/default
lang=zh
ngpu=1
finetune_config=./conf/finetune.yaml
replace_spkid=174 # csmsc: 174, ljspeech: 175, aishell3: 0~173, vctk: 176
ckpt=snapshot_iter_99300
gpus=1
CUDA_VISIBLE_DEVICES=${gpus}
stage=0
stop_stage=100
# with the following command, you can choose the stage range you want to run
# such as `./run.sh --stage 0 --stop-stage 0`
# this can not be mixed use with `$1`, `$2` ...
source ${MAIN_ROOT}/utils/parse_options.sh || exit 1
# check oov
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
echo "check oov"
python3 local/check_oov.py \
--input_dir=${input_dir} \
--pretrained_model_dir=${pretrained_model_dir} \
--newdir_name=${newdir_name} \
--lang=${lang}
fi
# get mfa result
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
echo "get mfa result"
python3 local/get_mfa_result.py \
--input_dir=${new_dir} \
--mfa_dir=${mfa_dir} \
--lang=${lang}
fi
# generate durations.txt
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
echo "generate durations.txt"
python3 local/generate_duration.py \
--mfa_dir=${mfa_dir}
fi
# extract feature
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
echo "extract feature"
python3 local/extract_feature.py \
--duration_file="./durations.txt" \
--input_dir=${new_dir} \
--dump_dir=${dump_dir} \
--pretrained_model_dir=${pretrained_model_dir} \
--replace_spkid=$replace_spkid
fi
# create finetune env
if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
echo "create finetune env"
python3 local/prepare_env.py \
--pretrained_model_dir=${pretrained_model_dir} \
--output_dir=${output_dir}
fi
# finetune
if [ ${stage} -le 5 ] && [ ${stop_stage} -ge 5 ]; then
echo "finetune..."
python3 local/finetune.py \
--pretrained_model_dir=${pretrained_model_dir} \
--dump_dir=${dump_dir} \
--output_dir=${output_dir} \
--ngpu=${ngpu} \
--epoch=100 \
--finetune_config=${finetune_config}
fi
# synthesize e2e
if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
echo "in hifigan syn_e2e"
python3 ${BIN_DIR}/../synthesize_e2e.py \
--am=fastspeech2_aishell3 \
--am_config=${pretrained_model_dir}/default.yaml \
--am_ckpt=${output_dir}/checkpoints/${ckpt}.pdz \
--am_stat=${pretrained_model_dir}/speech_stats.npy \
--voc=hifigan_aishell3 \
--voc_config=pretrained_models/hifigan_aishell3_ckpt_0.2.0/default.yaml \
--voc_ckpt=pretrained_models/hifigan_aishell3_ckpt_0.2.0/snapshot_iter_2500000.pdz \
--voc_stat=pretrained_models/hifigan_aishell3_ckpt_0.2.0/feats_stats.npy \
--lang=mix \
--text=${BIN_DIR}/../sentences_mix.txt \
--output_dir=./test_e2e/ \
--phones_dict=${dump_dir}/phone_id_map.txt \
--speaker_dict=${dump_dir}/speaker_id_map.txt \
--spk_id=$replace_spkid
fi
Loading…
Cancel
Save