@ -3,7 +3,18 @@ This example contains code used to train a [JETS](https://arxiv.org/abs/2203.168
## Dataset
### Download and Extract
Download CSMSC from it's [Official Website](https://test.data-baker.com/data/index/source).
Download CSMSC from it's [official website](https://test.data-baker.com/data/index/TNtts/) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/BZNSYP`.
The structure of the folder is listed below.
```text
└─ Wave
└─ .wav files (audio speech)
└─ PhoneLabeling
└─ .interval files (alignment between phoneme and duration)
└─ ProsodyLabeling
└─ 000001-010000.txt (text with prosodic by pinyin)
```
### Get MFA Result and Extract
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get phonemes and durations for JETS.
@ -5,6 +5,17 @@ This example contains code used to train a [SpeedySpeech](http://arxiv.org/abs/2
### Download and Extract
Download CSMSC from it's [Official Website](https://test.data-baker.com/data/index/TNtts/) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/BZNSYP`.
The structure of the folder is listed below.
```text
└─ Wave
└─ .wav files (audio speech)
└─ PhoneLabeling
└─ .interval files (alignment between phoneme and duration)
└─ ProsodyLabeling
└─ 000001-010000.txt (text with prosodic by pinyin)
```
### Get MFA Result and Extract
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for SPEEDYSPEECH.
You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
@ -4,6 +4,17 @@ This example contains code used to train a [HiFiGAN](https://arxiv.org/abs/2010.
### Download and Extract
Download CSMSC from it's [official website](https://test.data-baker.com/data/index/TNtts/) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/BZNSYP`.
The structure of the folder is listed below.
```text
└─ Wave
└─ .wav files (audio speech)
└─ PhoneLabeling
└─ .interval files (alignment between phoneme and duration)
└─ ProsodyLabeling
└─ 000001-010000.txt (text with prosodic by pinyin)
```
### Get MFA Result and Extract
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence at the edge of audio.
You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
@ -6,6 +6,17 @@ This example contains code used to train a [iSTFTNet](https://arxiv.org/abs/2203
### Download and Extract
Download CSMSC from it's [official website](https://test.data-baker.com/data/index/TNtts/) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/BZNSYP`.
The structure of the folder is listed below.
```text
└─ Wave
└─ .wav files (audio speech)
└─ PhoneLabeling
└─ .interval files (alignment between phoneme and duration)
└─ ProsodyLabeling
└─ 000001-010000.txt (text with prosodic by pinyin)
```
### Get MFA Result and Extract
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence at the edge of audio.
You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
You can train a model by yourself, then you need to prepare an audio file or use the audio demo above, please confirm the sample rate of the audio is 16K. You can get the result of the audio demo by running the script below.
You need to prepare an audio file or use the audio demo above, please confirm the sample rate of the audio is 16K. You can get the result of the audio demo by running the script below.
You need to prepare an audio file or use the audio demo above, please confirm the sample rate of the audio is 16K. You can get the result of the audio demo by running the script below.
# NOTE: the code below asserted that the backward() is problematic, and as more steps are accumulated, the output from wavlm alone will be the same for all frames
# optimizer step old
if(batch_index+1)%train_conf.accum_grad==0:
@ -428,8 +428,7 @@ class WavLMASRTrainer(Trainer):
report("epoch",self.epoch)
report('step',self.iteration)
report("model_lr",self.model_optimizer.get_lr())
report("wavlm_lr",
self.wavlm_optimizer.get_lr())
report("wavlm_lr",self.wavlm_optimizer.get_lr())
self.train_batch(batch_index,batch,msg)
self.after_train_batch()
report('iter',batch_index+1)
@ -680,8 +679,7 @@ class WavLMASRTrainer(Trainer):
self.extractor_mode:str="default"# mode for feature extractor. default has a single group norm with d groups in the first conv block, whereas layer_norm has layer norms in every block (meant to use with normalize=True)
self.encoder_layers:int=12# num encoder layers in the transformer
self.extractor_mode:str="default"# mode for feature extractor. default has a single group norm with d groups in the first conv block, whereas layer_norm has layer norms in every block (meant to use with normalize=True)
self.encoder_layers:int=12# num encoder layers in the transformer
self.encoder_ffn_embed_dim:int=3072# encoder embedding dimension for FFN
self.encoder_attention_heads:int=12# num encoder attention heads
self.activation_fn:str="gelu"# activation function to use
self.layer_norm_first:bool=False# apply layernorm first in the transformer
self.conv_feature_layers:str="[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2"# string describing convolutional feature extraction layers in form of a python list that contains [(dim, kernel_size, stride), ...]
self.conv_bias:bool=False# include bias in conv encoder
self.feature_grad_mult:float=1.0# multiply feature extractor var grads by this
self.layer_norm_first:bool=False# apply layernorm first in the transformer
self.conv_feature_layers:str="[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2"# string describing convolutional feature extraction layers in form of a python list that contains [(dim, kernel_size, stride), ...]
self.conv_bias:bool=False# include bias in conv encoder
self.feature_grad_mult:float=1.0# multiply feature extractor var grads by this
self.normalize:bool=False# normalize input to have 0 mean and unit variance during training
# dropouts
self.dropout:float=0.1# dropout probability for the transformer
self.attention_dropout:float=0.1# dropout probability for attention weights
self.activation_dropout:float=0.0# dropout probability after activation in FFN
self.encoder_layerdrop:float=0.0# probability of dropping a tarnsformer layer
self.dropout_input:float=0.0# dropout to apply to the input (after feat extr)
self.dropout_features:float=0.0# dropout to apply to the features (after feat extr)
self.dropout:float=0.1# dropout probability for the transformer
self.attention_dropout:float=0.1# dropout probability for attention weights
self.activation_dropout:float=0.0# dropout probability after activation in FFN
self.encoder_layerdrop:float=0.0# probability of dropping a tarnsformer layer
self.dropout_input:float=0.0# dropout to apply to the input (after feat extr)
self.dropout_features:float=0.0# dropout to apply to the features (after feat extr)
# masking
self.mask_length:int=10# mask length
self.mask_prob:float=0.65# probability of replacing a token with mask
self.mask_selection:str="static"# how to choose mask length
self.mask_other:float=0# secondary mask argument (used for more complex distributions), see help in compute_mask_indicesh
self.no_mask_overlap:bool=False# whether to allow masks to overlap
self.mask_min_space:int=1# min space between spans (if no overlap is enabled)
self.mask_length:int=10# mask length
self.mask_prob:float=0.65# probability of replacing a token with mask
self.mask_selection:str="static"# how to choose mask length
self.mask_other:float=0# secondary mask argument (used for more complex distributions), see help in compute_mask_indicesh
self.no_mask_overlap:bool=False# whether to allow masks to overlap
self.mask_min_space:int=1# min space between spans (if no overlap is enabled)
# channel masking
self.mask_channel_length:int=10# length of the mask for features (channels)
self.mask_channel_prob:float=0.0# probability of replacing a feature with 0
self.mask_channel_selection:str="static"# how to choose mask length for channel masking
self.mask_channel_other:float=0# secondary mask argument (used for more complex distributions), see help in compute_mask_indices
self.no_mask_channel_overlap:bool=False# whether to allow channel masks to overlap
self.mask_channel_min_space:int=1# min space between spans (if no overlap is enabled)
self.mask_channel_length:int=10# length of the mask for features (channels)
self.mask_channel_prob:float=0.0# probability of replacing a feature with 0
self.mask_channel_selection:str="static"# how to choose mask length for channel masking
self.mask_channel_other:float=0# secondary mask argument (used for more complex distributions), see help in compute_mask_indices
self.no_mask_channel_overlap:bool=False# whether to allow channel masks to overlap
self.mask_channel_min_space:int=1# min space between spans (if no overlap is enabled)
# positional embeddings
self.conv_pos:int=128# number of filters for convolutional positional embeddings
self.conv_pos_groups:int=16# number of groups for convolutional positional embedding
self.conv_pos:int=128# number of filters for convolutional positional embeddings
self.conv_pos_groups:int=16# number of groups for convolutional positional embedding
# relative position embedding
self.relative_position_embedding:bool=True# apply relative position embedding
self.num_buckets:int=320# number of buckets for relative position embedding
self.max_distance:int=1280# maximum distance for relative position embedding
self.gru_rel_pos:bool=True# apply gated relative position embedding
self.relative_position_embedding:bool=True# apply relative position embedding
self.num_buckets:int=320# number of buckets for relative position embedding
self.max_distance:int=1280# maximum distance for relative position embedding
self.gru_rel_pos:bool=True# apply gated relative position embedding