@ -19,7 +19,7 @@ There are 3 ways to use `PaddleSpeech`. According to the degree of difficulty, t
- If you are newer to `PaddleSpeech` and want to experience it easily without your machine. We recommend you to use [AI Studio](https://aistudio.baidu.com/aistudio/index) to experience it. There is a step-by-step [tutorial](https://aistudio.baidu.com/aistudio/education/group/info/25130) for `PaddleSpeech`, and you can use the basic function of `PaddleSpeech` with a free machine.
- If you are newer to `PaddleSpeech` and want to experience it easily without your machine. We recommend you to use [AI Studio](https://aistudio.baidu.com/aistudio/index) to experience it. There is a step-by-step [tutorial](https://aistudio.baidu.com/aistudio/education/group/info/25130) for `PaddleSpeech`, and you can use the basic function of `PaddleSpeech` with a free machine.
- If you want to use the command line function of Paddlespeech, you need to complete the following steps to install `PaddleSpeech`. For more information about how to use the command line function, you can see the [cli](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/paddlespeech/cli).
- If you want to use the command line function of Paddlespeech, you need to complete the following steps to install `PaddleSpeech`. For more information about how to use the command line function, you can see the [cli](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/paddlespeech/cli).
### Install Conda
### Install Conda
Conda is a management system of the environment. You can go to [minicoda](https://docs.conda.io/en/latest/miniconda.html) (select a version py>=3.7) to download and install the conda.
Conda is a management system of the environment. You can go to [miniconda](https://docs.conda.io/en/latest/miniconda.html) (select a version py>=3.7) to download and install the conda.
And then Install conda dependencies for `paddlespeech` :
And then Install conda dependencies for `paddlespeech` :
TTS system mainly includes three modules: `Text Frontend`, `Acoustic model` and `Vocoder`. We introduce a rule-based Chinese text frontend in [cn_text_frontend.md](./cn_text_frontend.md). Here, we will introduce acoustic models and vocoders, which are trainable.
TTS system mainly includes three modules: `Text Frontend`, `Acoustic model` and `Vocoder`. We introduce a rule-based Chinese text frontend in [zh_text_frontend](./zh_text_frontend.md). Here, we will introduce acoustic models and vocoders, which are trainable.
The main processes of TTS include:
The main processes of TTS include:
1. Convert the original text into characters/phonemes, through the `text frontend` module.
1. Convert the original text into characters/phonemes, through the `text frontend` module.
- [fastspeech2_aishell3_static_pir_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_aishell3_static_pir_1.1.0.zip) (Run PIR model need to set FLAGS_enable_pir_api=1, and PIR model only worked with paddlepaddle>=3.0.0b2)
- [pwgan_aishell3_static_pir_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwgan_aishell3_static_pir_1.1.0.zip) (Run PIR model need to set FLAGS_enable_pir_api=1, and PIR model only worked with paddlepaddle>=3.0.0b2)
- [hifigan_aishell3_static_pir_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_aishell3_static_pir_1.1.0.zip) (Run PIR model need to set FLAGS_enable_pir_api=1, and PIR model only worked with paddlepaddle>=3.0.0b2)
@ -3,7 +3,18 @@ This example contains code used to train a [JETS](https://arxiv.org/abs/2203.168
## Dataset
## Dataset
### Download and Extract
### Download and Extract
Download CSMSC from it's [Official Website](https://test.data-baker.com/data/index/source).
Download CSMSC from it's [official website](https://test.data-baker.com/data/index/TNtts/) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/BZNSYP`.
The structure of the folder is listed below.
```text
└─ Wave
└─ .wav files (audio speech)
└─ PhoneLabeling
└─ .interval files (alignment between phoneme and duration)
└─ ProsodyLabeling
└─ 000001-010000.txt (text with prosodic by pinyin)
```
### Get MFA Result and Extract
### Get MFA Result and Extract
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get phonemes and durations for JETS.
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get phonemes and durations for JETS.
@ -5,6 +5,17 @@ This example contains code used to train a [SpeedySpeech](http://arxiv.org/abs/2
### Download and Extract
### Download and Extract
Download CSMSC from it's [Official Website](https://test.data-baker.com/data/index/TNtts/) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/BZNSYP`.
Download CSMSC from it's [Official Website](https://test.data-baker.com/data/index/TNtts/) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/BZNSYP`.
The structure of the folder is listed below.
```text
└─ Wave
└─ .wav files (audio speech)
└─ PhoneLabeling
└─ .interval files (alignment between phoneme and duration)
└─ ProsodyLabeling
└─ 000001-010000.txt (text with prosodic by pinyin)
```
### Get MFA Result and Extract
### Get MFA Result and Extract
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for SPEEDYSPEECH.
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for SPEEDYSPEECH.
You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
@ -4,6 +4,18 @@ This example contains code used to train a [parallel wavegan](http://arxiv.org/a
### Download and Extract
### Download and Extract
Download CSMSC from it's [official website](https://test.data-baker.com/data/index/TNtts/) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/BZNSYP`.
Download CSMSC from it's [official website](https://test.data-baker.com/data/index/TNtts/) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/BZNSYP`.
After processing the data, the ``BZNSYP`` directory will look like this:
```text
BZNSYP
├── Wave
│ └─ *.wav files (audio speech)
├── PhoneLabeling
│ └─ *.interval files (alignment between phoneme and duration)
└── ProsodyLabeling
└─ 000001-010000.txt (text with prosodic by pinyin)
```
This experiment only uses *.wav files from the Wave file
### Get MFA Result and Extract
### Get MFA Result and Extract
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence at the edge of audio.
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence at the edge of audio.
You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
@ -17,6 +29,7 @@ Run the command below to
3. train the model.
3. train the model.
4. synthesize wavs.
4. synthesize wavs.
- synthesize waveform from `metadata.jsonl`.
- synthesize waveform from `metadata.jsonl`.
- synthesize waveform from text file.
```bash
```bash
./run.sh
./run.sh
```
```
@ -94,6 +107,18 @@ benchmark:
4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
### Synthesizing
### Synthesizing
We use [parallel wavegan](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/voc1) as the neural vocoder.
Download pretrained parallel wavegan model from [pwg_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_baker_ckpt_0.4.zip) and unzip it.
4. `--output-dir` is the directory to save the synthesized audio files.
4. `--output-dir` is the directory to save the synthesized audio files.
5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
We use [Fastspeech2](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/csmsc/tts3) as the acoustic model.
Download pretrained fastspeech2_nosil model from [fastspeech2_nosil_baker_ckpt_0.4.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_baker_ckpt_0.4.zip)and unzip it.
--voc_stat VOC_STAT mean and standard deviation used to normalize
spectrogram when training voc.
--lang LANG Choose model language. zh or en
--inference_dir INFERENCE_DIR
dir to save inference models
--ngpu NGPU if ngpu == 0, use cpu.
--text TEXT text to synthesize, a 'utt_id sentence' pair per line.
--output_dir OUTPUT_DIR
output dir.
```
1. `--am` is acoustic model type with the format {model_name}_{dataset}
2. `--am_config`, `--am_ckpt`, `--am_stat` and `--phones_dict` are arguments for acoustic model, which correspond to the 4 files in the fastspeech2 pretrained model.
3. `--voc` is vocoder type with the format {model_name}_{dataset}
4. `--voc_config`, `--voc_ckpt`, `--voc_stat` are arguments for vocoder, which correspond to the 3 files in the parallel wavegan pretrained model.
5. `--lang` is the model language, which can be `zh` or `en`.
6. `--test_metadata` should be the metadata file in the normalized subfolder of `test` in the `dump` folder.
7. `--text` is the text file, which contains sentences to synthesize.
8. `--output_dir` is the directory to save synthesized audio files.
9. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
- [mb_melgan_csmsc_static_pir_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/mb_melgan/mb_melgan_csmsc_static_pir_0.1.1.zip) (Run PIR model need to set FLAGS_enable_pir_api=1, and PIR model only worked with paddlepaddle>=3.0.0b2)
@ -4,6 +4,17 @@ This example contains code used to train a [HiFiGAN](https://arxiv.org/abs/2010.
### Download and Extract
### Download and Extract
Download CSMSC from it's [official website](https://test.data-baker.com/data/index/TNtts/) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/BZNSYP`.
Download CSMSC from it's [official website](https://test.data-baker.com/data/index/TNtts/) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/BZNSYP`.
The structure of the folder is listed below.
```text
└─ Wave
└─ .wav files (audio speech)
└─ PhoneLabeling
└─ .interval files (alignment between phoneme and duration)
└─ ProsodyLabeling
└─ 000001-010000.txt (text with prosodic by pinyin)
```
### Get MFA Result and Extract
### Get MFA Result and Extract
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence at the edge of audio.
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence at the edge of audio.
You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
@ -118,6 +129,9 @@ The pretrained model can be downloaded here:
- [hifigan_csmsc_static_pir_0.1.1.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_static_pir_0.1.1.zip) (Run PIR model need to set FLAGS_enable_pir_api=1, and PIR model only worked with paddlepaddle>=3.0.0b2)
@ -6,6 +6,17 @@ This example contains code used to train a [iSTFTNet](https://arxiv.org/abs/2203
### Download and Extract
### Download and Extract
Download CSMSC from it's [official website](https://test.data-baker.com/data/index/TNtts/) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/BZNSYP`.
Download CSMSC from it's [official website](https://test.data-baker.com/data/index/TNtts/) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/BZNSYP`.
The structure of the folder is listed below.
```text
└─ Wave
└─ .wav files (audio speech)
└─ PhoneLabeling
└─ .interval files (alignment between phoneme and duration)
└─ ProsodyLabeling
└─ 000001-010000.txt (text with prosodic by pinyin)
```
### Get MFA Result and Extract
### Get MFA Result and Extract
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence at the edge of audio.
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence at the edge of audio.
You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
You can train a model by yourself, then you need to prepare an audio file or use the audio demo above, please confirm the sample rate of the audio is 16K. You can get the result of the audio demo by running the script below.
You can train a model by yourself, then you need to prepare an audio file or use the audio demo above, please confirm the sample rate of the audio is 16K. You can get the result of the audio demo by running the script below.
You need to prepare an audio file or use the audio demo above, please confirm the sample rate of the audio is 16K. You can get the result of the audio demo by running the script below.
You need to prepare an audio file or use the audio demo above, please confirm the sample rate of the audio is 16K. You can get the result of the audio demo by running the script below.
You need to prepare an audio file or use the audio demo above, please confirm the sample rate of the audio is 16K. You can get the result of the audio demo by running the script below.
You need to prepare an audio file or use the audio demo above, please confirm the sample rate of the audio is 16K. You can get the result of the audio demo by running the script below.
The phoneme-based continuous speech corpus is a collaboration between Texas Instruments, MIT, and SRI International. The [Timit](https://catalog.ldc.upenn.edu/docs/LDC93S1/) dataset has a voice sampling frequency of 16 khz and contains a total of 6,300 sentences, with 630 people from 8 major U.S. dialects speaking a given 10 sentences each, all sentences are manually segmented and marked at the phone level. Seventy percent of the speakers are male; most of the speakers are white adults.
## Dataset
### Download and Extract
Download TIMIT from it's [official website](https://catalog.ldc.upenn.edu/LDC93S1) and extract it to `~/datasets`. Assume unzip the dataset in the directory `~/datasets/timit`.
## Overview
All the scripts you need are in `run.sh`. There are several stages in `run.sh`, and each stage has its function.
| 0 | Process data. It includes: <br> (1) Download the dataset <br> (2) Calculate the CMVN of the train dataset <br> (3) Get the vocabulary file <br> (4) Get the manifest files of the train, development and test dataset |
| 1 | Train the model |
| 2 | Get the final model by averaging the top-k models, set k = 1 means to choose the best model |
| 3 | Test the final model performance |
| 4 | Get ctc alignment of test data using the final model |
You can choose to run a range of stages by setting `stage` and `stop_stage `.
For example, if you want to execute the code in stage 2 and stage 3, you can run this script:
```bash
bash run.sh --stage 2 --stop_stage 3
```
Or you can set `stage` equal to `stop-stage` to only run one stage.
For example, if you only want to run `stage 0`, you can use the script below:
```bash
bash run.sh --stage 0 --stop_stage 0
```
The document below will describe the scripts in `run.sh` in detail.
## The Environment Variables
The path.sh contains the environment variables.
```bash
source path.sh
```
This script needs to be run first. And another script is also needed:
```bash
source ${MAIN_ROOT}/utils/parse_options.sh
```
It will support the way of using `--variable value` in the shell scripts.
## The Local Variables
Some local variables are set in `run.sh`.
`gpus` denotes the GPU number you want to use. If you set `gpus=`, it means you only use CPU.
`stage` denotes the number of the stage you want to start from in the experiments.
`stop stage` denotes the number of the stage you want to end at in the experiments.
`conf_path` denotes the config path of the model.
`avg_num` denotes the number K of top-K models you want to average to get the final model.
`audio_file` denotes the file path of the single file you want to infer in stage 5
`ckpt` denotes the checkpoint prefix of the model, e.g. "conformer"
You can set the local variables (except `ckpt`) when you use `run.sh`
For example, you can set the `gpus` and `avg_num` when you use the command line.:
```bash
bash run.sh --gpus 0,1,2,3 --avg_num 10
```
## Stage 0: Data Processing
To use this example, you need to process data firstly and you can use stage 0 in `run.sh` to do this. The code is shown below:
```bash
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
# prepare data
bash ./local/timit_data_prep.sh ${TIMIT_path}
bash ./local/data.sh || exit -1
fi
```
Stage 0 is for processing the data.
If you only want to process the data. You can run
```bash
bash run.sh --stage 0 --stop_stage 0
```
You can also just run these scripts in your command line.
```bash
source path.sh
bash ./local/timit_data_prep.sh ${TIMIT_path}
bash ./local/data.sh
```
After processing the data, the ``data`` directory will look like this:
```bash
data/
|-- lang_char
| `-- vocab.txt
|-- local
| `-- dev_sph.flist
| `-- dev_sph.scp
| `-- dev.text
| `-- dev.trans
| `-- dev.uttids
| `-- test_sph.flist
| `-- test_sph.scp
| `-- test.text
| `-- test.trans
| `-- test.uttids
| `-- train_sph.flist
| `-- train_sph.scp
| `-- train.text
| `-- train.trans
| `-- train.uttids
|-- manifest.dev
|-- manifest.dev.raw
|-- manifest.test
|-- manifest.test.raw
|-- manifest.train
|-- manifest.train.raw
|-- mean_std.json
|-- test.meta
```
## Stage 1: Model Training
If you want to train the model. you can use stage 1 in `run.sh`. The code is shown below.
```bash
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
After training the model, we need to get the final model for testing and inference. In every epoch, the model checkpoint is saved, so we can choose the best model from them based on the validation loss or we can sort them and average the parameters of the top-k models to get the final model. We can use stage 2 to do this, and the code is shown below:
```bash
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
# avg n best model
avg.sh best exp/${ckpt}/checkpoints ${avg_num}
fi
```
The `avg.sh`is in the `../../../utils/` which is define in the `path.sh`.
If you want to get the final model, you can use the script below to execute stage 0, stage 1, and stage 2:
- [hifigan_vctk_static_pir_1.1.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_vctk_static_pir_1.1.0.zip) (Run PIR model need to set FLAGS_enable_pir_api=1, and PIR model only worked with paddlepaddle>=3.0.0b2)
# NOTE: the code below asserted that the backward() is problematic, and as more steps are accumulated, the output from wavlm alone will be the same for all frames
# NOTE: the code below asserted that the backward() is problematic, and as more steps are accumulated, the output from wavlm alone will be the same for all frames
# optimizer step old
# optimizer step old
if(batch_index+1)%train_conf.accum_grad==0:
if(batch_index+1)%train_conf.accum_grad==0:
@ -428,8 +428,7 @@ class WavLMASRTrainer(Trainer):
report("epoch",self.epoch)
report("epoch",self.epoch)
report('step',self.iteration)
report('step',self.iteration)
report("model_lr",self.model_optimizer.get_lr())
report("model_lr",self.model_optimizer.get_lr())
report("wavlm_lr",
report("wavlm_lr",self.wavlm_optimizer.get_lr())
self.wavlm_optimizer.get_lr())
self.train_batch(batch_index,batch,msg)
self.train_batch(batch_index,batch,msg)
self.after_train_batch()
self.after_train_batch()
report('iter',batch_index+1)
report('iter',batch_index+1)
@ -680,8 +679,7 @@ class WavLMASRTrainer(Trainer):
self.extractor_mode:str="default"# mode for feature extractor. default has a single group norm with d groups in the first conv block, whereas layer_norm has layer norms in every block (meant to use with normalize=True)
self.extractor_mode:str="default"# mode for feature extractor. default has a single group norm with d groups in the first conv block, whereas layer_norm has layer norms in every block (meant to use with normalize=True)
self.encoder_layers:int=12# num encoder layers in the transformer
self.encoder_layers:int=12# num encoder layers in the transformer
self.encoder_ffn_embed_dim:int=3072# encoder embedding dimension for FFN
self.encoder_ffn_embed_dim:int=3072# encoder embedding dimension for FFN
self.encoder_attention_heads:int=12# num encoder attention heads
self.encoder_attention_heads:int=12# num encoder attention heads
self.activation_fn:str="gelu"# activation function to use
self.activation_fn:str="gelu"# activation function to use
self.layer_norm_first:bool=False# apply layernorm first in the transformer
self.layer_norm_first:bool=False# apply layernorm first in the transformer
self.conv_feature_layers:str="[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2"# string describing convolutional feature extraction layers in form of a python list that contains [(dim, kernel_size, stride), ...]
self.conv_feature_layers:str="[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2"# string describing convolutional feature extraction layers in form of a python list that contains [(dim, kernel_size, stride), ...]
self.conv_bias:bool=False# include bias in conv encoder
self.conv_bias:bool=False# include bias in conv encoder
self.feature_grad_mult:float=1.0# multiply feature extractor var grads by this
self.feature_grad_mult:float=1.0# multiply feature extractor var grads by this
self.normalize:bool=False# normalize input to have 0 mean and unit variance during training
self.normalize:bool=False# normalize input to have 0 mean and unit variance during training
# dropouts
# dropouts
self.dropout:float=0.1# dropout probability for the transformer
self.dropout:float=0.1# dropout probability for the transformer
self.attention_dropout:float=0.1# dropout probability for attention weights
self.attention_dropout:float=0.1# dropout probability for attention weights
self.activation_dropout:float=0.0# dropout probability after activation in FFN
self.activation_dropout:float=0.0# dropout probability after activation in FFN
self.encoder_layerdrop:float=0.0# probability of dropping a tarnsformer layer
self.encoder_layerdrop:float=0.0# probability of dropping a tarnsformer layer
self.dropout_input:float=0.0# dropout to apply to the input (after feat extr)
self.dropout_input:float=0.0# dropout to apply to the input (after feat extr)
self.dropout_features:float=0.0# dropout to apply to the features (after feat extr)
self.dropout_features:float=0.0# dropout to apply to the features (after feat extr)
# masking
# masking
self.mask_length:int=10# mask length
self.mask_length:int=10# mask length
self.mask_prob:float=0.65# probability of replacing a token with mask
self.mask_prob:float=0.65# probability of replacing a token with mask
self.mask_selection:str="static"# how to choose mask length
self.mask_selection:str="static"# how to choose mask length
self.mask_other:float=0# secondary mask argument (used for more complex distributions), see help in compute_mask_indicesh
self.mask_other:float=0# secondary mask argument (used for more complex distributions), see help in compute_mask_indicesh
self.no_mask_overlap:bool=False# whether to allow masks to overlap
self.no_mask_overlap:bool=False# whether to allow masks to overlap
self.mask_min_space:int=1# min space between spans (if no overlap is enabled)
self.mask_min_space:int=1# min space between spans (if no overlap is enabled)
# channel masking
# channel masking
self.mask_channel_length:int=10# length of the mask for features (channels)
self.mask_channel_length:int=10# length of the mask for features (channels)
self.mask_channel_prob:float=0.0# probability of replacing a feature with 0
self.mask_channel_prob:float=0.0# probability of replacing a feature with 0
self.mask_channel_selection:str="static"# how to choose mask length for channel masking
self.mask_channel_selection:str="static"# how to choose mask length for channel masking
self.mask_channel_other:float=0# secondary mask argument (used for more complex distributions), see help in compute_mask_indices
self.mask_channel_other:float=0# secondary mask argument (used for more complex distributions), see help in compute_mask_indices
self.no_mask_channel_overlap:bool=False# whether to allow channel masks to overlap
self.no_mask_channel_overlap:bool=False# whether to allow channel masks to overlap
self.mask_channel_min_space:int=1# min space between spans (if no overlap is enabled)
self.mask_channel_min_space:int=1# min space between spans (if no overlap is enabled)
# positional embeddings
# positional embeddings
self.conv_pos:int=128# number of filters for convolutional positional embeddings
self.conv_pos:int=128# number of filters for convolutional positional embeddings
self.conv_pos_groups:int=16# number of groups for convolutional positional embedding
self.conv_pos_groups:int=16# number of groups for convolutional positional embedding
# relative position embedding
# relative position embedding
self.relative_position_embedding:bool=True# apply relative position embedding
self.relative_position_embedding:bool=True# apply relative position embedding
self.num_buckets:int=320# number of buckets for relative position embedding
self.num_buckets:int=320# number of buckets for relative position embedding
self.max_distance:int=1280# maximum distance for relative position embedding
self.max_distance:int=1280# maximum distance for relative position embedding
self.gru_rel_pos:bool=True# apply gated relative position embedding
self.gru_rel_pos:bool=True# apply gated relative position embedding