[Hackathon 7th] fix Voc5/Jets/TTS2 with CSMSC (#3906)

* fix Voc5/Jets with CSMSC

* fix Voc5/Jets with CSMSC

* Update README.md

* Update README.md

* Update README.md

* Update iSTFTNet.md

* Apply suggestions from code review

* Apply suggestions from code review

* Apply suggestions from code review
pull/3923/head
张春乔 3 weeks ago committed by GitHub
parent c33d9bfb50
commit 67ae7c8dd2
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

@ -3,7 +3,18 @@ This example contains code used to train a [JETS](https://arxiv.org/abs/2203.168
## Dataset
### Download and Extract
Download CSMSC from it's [Official Website](https://test.data-baker.com/data/index/source).
Download CSMSC from it's [official website](https://test.data-baker.com/data/index/TNtts/) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/BZNSYP`.
The structure of the folder is listed below.
```text
└─ Wave
└─ .wav files (audio speech)
└─ PhoneLabeling
└─ .interval files (alignment between phoneme and duration)
└─ ProsodyLabeling
└─ 000001-010000.txt (text with prosodic by pinyin)
```
### Get MFA Result and Extract
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get phonemes and durations for JETS.

@ -5,6 +5,17 @@ This example contains code used to train a [SpeedySpeech](http://arxiv.org/abs/2
### Download and Extract
Download CSMSC from it's [Official Website](https://test.data-baker.com/data/index/TNtts/) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/BZNSYP`.
The structure of the folder is listed below.
```text
└─ Wave
└─ .wav files (audio speech)
└─ PhoneLabeling
└─ .interval files (alignment between phoneme and duration)
└─ ProsodyLabeling
└─ 000001-010000.txt (text with prosodic by pinyin)
```
### Get MFA Result and Extract
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for SPEEDYSPEECH.
You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.

@ -4,6 +4,17 @@ This example contains code used to train a [HiFiGAN](https://arxiv.org/abs/2010.
### Download and Extract
Download CSMSC from it's [official website](https://test.data-baker.com/data/index/TNtts/) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/BZNSYP`.
The structure of the folder is listed below.
```text
└─ Wave
└─ .wav files (audio speech)
└─ PhoneLabeling
└─ .interval files (alignment between phoneme and duration)
└─ ProsodyLabeling
└─ 000001-010000.txt (text with prosodic by pinyin)
```
### Get MFA Result and Extract
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence at the edge of audio.
You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.

@ -6,6 +6,17 @@ This example contains code used to train a [iSTFTNet](https://arxiv.org/abs/2203
### Download and Extract
Download CSMSC from it's [official website](https://test.data-baker.com/data/index/TNtts/) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/BZNSYP`.
The structure of the folder is listed below.
```text
└─ Wave
└─ .wav files (audio speech)
└─ PhoneLabeling
└─ .interval files (alignment between phoneme and duration)
└─ ProsodyLabeling
└─ 000001-010000.txt (text with prosodic by pinyin)
```
### Get MFA Result and Extract
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence at the edge of audio.
You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.

@ -203,9 +203,9 @@ def main():
sentences, speaker_set = get_phn_dur(dur_file)
merge_silence(sentences)
# split data into 3 sections
if args.dataset == "baker":
wav_files = sorted(list((rootdir / "Wave").rglob("*.wav")))
# split data into 3 sections
num_train = 9800
num_dev = 100
train_wav_files = wav_files[:num_train]

@ -55,7 +55,9 @@ class GaussianUpsampling(nn.Layer):
if h_masks is not None:
t = t * paddle.to_tensor(h_masks, dtype="float32")
c = ds.cumsum(axis=-1) - ds / 2
ds_cumsum = ds.cumsum(axis=-1)
ds_half = ds / 2
c = ds_cumsum.astype(ds_half.dtype) - ds_half
energy = -1 * self.delta * (t.unsqueeze(-1) - c.unsqueeze(1))**2
if d_masks is not None:
d_masks = ~(d_masks.unsqueeze(1))

Loading…
Cancel
Save