This example contains code used to train a [DiffSinger](https://arxiv.org/abs/2105.02446) model with [Mandarin singing corpus](https://wenet.org.cn/opencpop/).
## Dataset
### Download and Extract
Download Opencpop from it's [Official Website](https://wenet.org.cn/opencpop/download/) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/Opencpop`.
## Get Started
Assume the path to the dataset is `~/datasets/Opencpop`.
Run the command below to
1.**source path**.
2. preprocess the dataset.
3. train the model.
4. synthesize wavs.
- synthesize waveform from `metadata.jsonl`.
- (Supporting) synthesize waveform from a text file.
5. (Supporting) inference using the static model.
```bash
./run.sh
```
You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
```bash
./run.sh --stage 0 --stop-stage 0
```
### Data Preprocessing
```bash
./local/preprocess.sh ${conf_path}
```
When it is done. A `dump` folder is created in the current directory. The structure of the dump folder is listed below.
```text
dump
├── dev
│ ├── norm
│ └── raw
├── phone_id_map.txt
├── speaker_id_map.txt
├── test
│ ├── norm
│ └── raw
└── train
├── energy_stats.npy
├── norm
├── pitch_stats.npy
├── raw
├── speech_stats.npy
└── speech_stretchs.npy
```
The dataset is split into 3 parts, namely `train`, `dev`, and` test`, each of which contains a `norm` and `raw` subfolder. The raw folder contains speech, pitch and energy features of each utterance, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/*_stats.npy`. `speech_stretchs.npy` contains the minimum and maximum values of each dimension of the mel spectrum, which is used for linear stretching before training/inference of the diffusion module.
Note: Since the training effect of non-norm features is due to norm, the features saved under `norm` are features that have not been normed.
Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains utterance id, speaker id, phones, text_lengths, speech_lengths, phone durations, the path of speech features, the path of pitch features, the path of energy features, note, note durations, slur.
1.`--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
2.`--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
3.`--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory.
4.`--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
5.`--phones-dict` is the path of the phone vocabulary file.
6.`--speech-stretchs` is the path of mel's min-max data file.
### Synthesizing
We use parallel wavegan as the neural vocoder.
Download pretrained parallel wavegan model from [pwgan_opencpop_ckpt_1.4.0.zip](https://paddlespeech.bj.bcebos.com/t2s/svs/opencpop/pwgan_opencpop_ckpt_1.4.0.zip) and unzip it.
{pwgan_opencpop, hifigan_opencpop} Choose vocoder type of svs task.
--voc_config VOC_CONFIG
Config of voc.
--voc_ckpt VOC_CKPT Checkpoint file of voc.
--voc_stat VOC_STAT mean and standard deviation used to normalize
spectrogram when training voc.
--lang LANG {zh, en, mix, canton} Choose language type of tts task.
{sing} Choose language type of svs task.
--inference_dir INFERENCE_DIR
dir to save inference models
--ngpu NGPU if ngpu == 0, use cpu.
--text TEXT text to synthesize file, a 'utt_id sentence' pair per line for tts task.
A '{ utt_id input_type (is word) text notes note_durs}' or '{utt_id input_type (is phoneme) phones notes note_durs is_slurs}' pair per line for svs task.
--output_dir OUTPUT_DIR
output dir.
--pinyin_phone PINYIN_PHONE
pinyin to phone map file, using on sing_frontend.
--speech_stretchs SPEECH_STRETCHS
The min and max values of the mel spectrum, using on diffusion of diffsinger.
```
1.`--am` is acoustic model type with the format {model_name}_{dataset}
2.`--am_config`, `--am_ckpt`, `--am_stat` and `--phones_dict` are arguments for acoustic model, which correspond to the 4 files in the diffsinger pretrained model.
3.`--voc` is vocoder type with the format {model_name}_{dataset}
4.`--voc_config`, `--voc_ckpt`, `--voc_stat` are arguments for vocoder, which correspond to the 3 files in the parallel wavegan pretrained model.
5.`--lang` is language. `zh`, `en`, `mix` and `canton` for tts task. `sing` for tts task.
6.`--test_metadata` should be the metadata file in the normalized subfolder of `test` in the `dump` folder.
7.`--text` is the text file, which contains sentences to synthesize.
8.`--output_dir` is the directory to save synthesized audio files.
9.`--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
10.`--inference_dir` is the directory to save static models. If this line is not added, it will not be generated and saved as a static model.
11.`--pinyin_phone` pinyin to phone map file, using on sing_frontend.
12.`--speech_stretchs` The min and max values of the mel spectrum, using on diffusion of diffsinger.
Note: At present, the diffsinger model does not support dynamic to static, so do not add `--inference_dir`.