diff --git a/docs/source/tts/README.md b/docs/source/tts/README.md index 18283cb2..3d9ee972 100644 --- a/docs/source/tts/README.md +++ b/docs/source/tts/README.md @@ -5,20 +5,6 @@ Parakeet aims to provide a flexible, efficient and state-of-the-art text-to-spee
- -## News -- Oct-12-2021, Refector examples code. -- Oct-12-2021, Parallel WaveGAN with LJSpeech. Check [examples/GANVocoder/parallelwave_gan/ljspeech](./examples/GANVocoder/parallelwave_gan/ljspeech). -- Oct-12-2021, FastSpeech2/FastPitch with LJSpeech. Check [examples/fastspeech2/ljspeech](./examples/fastspeech2/ljspeech). -- Sep-14-2021, Reconstruction of TransformerTTS. Check [examples/transformer_tts/ljspeech](./examples/transformer_tts/ljspeech). -- Aug-31-2021, Chinese Text Frontend. Check [examples/text_frontend](./examples/text_frontend). -- Aug-23-2021, FastSpeech2/FastPitch with AISHELL-3. Check [examples/fastspeech2/aishell3](./examples/fastspeech2/aishell3). -- Aug-03-2021, FastSpeech2/FastPitch with CSMSC. Check [examples/fastspeech2/baker](./examples/fastspeech2/baker). -- Jul-19-2021, SpeedySpeech with CSMSC. Check [examples/speedyspeech/baker](./examples/speedyspeech/baker). -- Jul-01-2021, Parallel WaveGAN with CSMSC. Check [examples/GANVocoder/parallelwave_gan/baker](./examples/GANVocoder/parallelwave_gan/baker). -- Jul-01-2021, Montreal-Forced-Aligner. Check [examples/use_mfa](./examples/use_mfa). -- May-07-2021, Voice Cloning in Chinese. Check [examples/tacotron2_aishell3](./examples/tacotron2_aishell3). - ## Overview In order to facilitate exploiting the existing TTS models directly and developing the new ones, Parakeet selects typical models and provides their reference implementations in PaddlePaddle. Further more, Parakeet abstracts the TTS pipeline and standardizes the procedure of data preprocessing, common modules sharing, model configuration, and the process of training and synthesis. The models supported here include Text FrontEnd, end-to-end Acoustic models and Vocoders: @@ -38,50 +24,11 @@ In order to facilitate exploiting the existing TTS models directly and developin - [Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis](https://arxiv.org/pdf/1806.04558v4.pdf) - [【GE2E】Generalized End-to-End Loss for Speaker Verification](https://arxiv.org/abs/1710.10467) -## Setup -It's difficult to install some dependent libraries for this repo in Windows system, we recommend that you **DO NOT** use Windows system, please use `Linux`. - -Make sure the library `libsndfile1` is installed, e.g., on Ubuntu. - -```bash -sudo apt-get install libsndfile1 -``` -### Install PaddlePaddle -See [install](https://www.paddlepaddle.org.cn/install/quick) for more details. This repo requires PaddlePaddle **2.1.2** or above. - -### Install Parakeet -```bash -git clone https://github.com/PaddlePaddle/Parakeet -cd Parakeet -pip install -e . -``` - -If some python dependent packages cannot be installed successfully, you can run the following script first. -(replace `python3.6` with your own python version) -```bash -sudo apt install -y python3.6-dev -``` - -See [install](https://paddle-parakeet.readthedocs.io/en/latest/install.html) for more details. - -## Examples -Entries to the introduction, and the launch of training and synthsis for different example models: - -- [>>> Chinese Text Frontend](./examples/text_frontend) -- [>>> FastSpeech2/FastPitch](./examples/fastspeech2) -- [>>> Montreal-Forced-Aligner](./examples/use_mfa) -- [>>> Parallel WaveGAN](./examples/GANVocoder/parallelwave_gan) -- [>>> SpeedySpeech](./examples/speedyspeech) -- [>>> Tacotron2_AISHELL3](./examples/tacotron2_aishell3) -- [>>> GE2E](./examples/ge2e) -- [>>> WaveFlow](./examples/waveflow) -- [>>> TransformerTTS](./examples/transformer_tts) -- [>>> Tacotron2](./examples/tacotron2) ## Audio samples -### TTS models (Acoustic Model + Neural Vocoder) -Check our [website](https://paddleparakeet.readthedocs.io/en/latest/demo.html) for audio sampels. + +Check our [website](https://paddlespeech.readthedocs.io/en/latest/tts/demo.html) for audio sampels. ## Released Model diff --git a/examples/aishell3/tts3/README.md b/examples/aishell3/tts3/README.md index 82b69ad8..5ab15ffb 100644 --- a/examples/aishell3/tts3/README.md +++ b/examples/aishell3/tts3/README.md @@ -17,7 +17,7 @@ tar zxvf data_aishell3.tgz -C data_aishell3 ``` ### Get MFA result of AISHELL-3 and Extract it We use [MFA2.x](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for aishell3_fastspeech2. -You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/use_mfa) (use MFA1.x now) of our repo. +You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo. ## Get Started Assume the path to the dataset is `~/datasets/data_aishell3`. diff --git a/examples/aishell3/vc0/README.md b/examples/aishell3/vc0/README.md index 28fea629..cd573c4d 100644 --- a/examples/aishell3/vc0/README.md +++ b/examples/aishell3/vc0/README.md @@ -41,7 +41,8 @@ We use Montreal Force Aligner 1.0. The label in aishell3 include pinyin,so th We use [lexicon.txt](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/paddlespeech/t2s/exps/voice_cloning/tacotron2_ge2e/lexicon.txt) as the lexicon. -You can download the alignment results from here [alignment_aishell3.tar.gz](https://paddlespeech.bj.bcebos.com/Parakeet/alignment_aishell3.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/use_mfa) (use MFA1.x now) of our repo. +You can download the alignment results from here [alignment_aishell3.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/alignment_aishell3.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo. + ```bash if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then diff --git a/examples/aishell3/vc1/README.md b/examples/aishell3/vc1/README.md index 8c0aec3a..974b84ca 100644 --- a/examples/aishell3/vc1/README.md +++ b/examples/aishell3/vc1/README.md @@ -1,89 +1,138 @@ + # FastSpeech2 + AISHELL-3 Voice Cloning -This example contains code used to train a [Tacotron2 ](https://arxiv.org/abs/1712.05884) model with [AISHELL-3](http://www.aishelltech.com/aishell_3). The trained model can be used in Voice Cloning Task, We refer to the model structure of [Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis](https://arxiv.org/pdf/1806.04558.pdf) . The general steps are as follows: -1. Speaker Encoder: We use a Speaker Verification to train a speaker encoder. Datasets used in this task are different from those used in Tacotron2, because the transcriptions are not needed, we use more datasets, refer to [ge2e](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/ge2e). -2. Synthesizer: Then, we use the trained speaker encoder to generate utterance embedding for each sentence in AISHELL-3. This embedding is a extra input of Tacotron2 which will be concated with encoder outputs. -3. Vocoder: We use WaveFlow as the neural Vocoder, refer to [waveflow](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/voc0). +This example contains code used to train a [FastSpeech2](https://arxiv.org/abs/2006.04558) model with [AISHELL-3](http://www.aishelltech.com/aishell_3). The trained model can be used in Voice Cloning Task, We refer to the model structure of [Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis](https://arxiv.org/pdf/1806.04558.pdf) . The general steps are as follows: +1. Speaker Encoder: We use a Speaker Verification to train a speaker encoder. Datasets used in this task are different from those used in `FastSpeech2`, because the transcriptions are not needed, we use more datasets, refer to [ge2e](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/ge2e). +2. Synthesizer: We use the trained speaker encoder to generate speaker embedding for each sentence in AISHELL-3. This embedding is an extra input of `FastSpeech2` which will be concated with encoder outputs. +3. Vocoder: We use [Parallel Wave GAN](http://arxiv.org/abs/1910.11480) as the neural Vocoder, refer to [voc1](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/voc1). + +## Dataset +### Download and Extract +Download AISHELL-3. +```bash +wget https://www.openslr.org/resources/93/data_aishell3.tgz +``` +Extract AISHELL-3. +```bash +mkdir data_aishell3 +tar zxvf data_aishell3.tgz -C data_aishell3 +``` +### Get MFA Result and Extract +We use [MFA2.x](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for aishell3_fastspeech2. +You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo. + +## Pretrained GE2E Model +We use pretrained GE2E model to generate spwaker embedding for each sentence. + +Download pretrained GE2E model from here [ge2e_ckpt_0.3.zip](https://bj.bcebos.com/paddlespeech/Parakeet/released_models/ge2e/ge2e_ckpt_0.3.zip), and `unzip` it. ## Get Started Assume the path to the dataset is `~/datasets/data_aishell3`. -Assume the path to the MFA result of AISHELL-3 is `./alignment`. -Assume the path to the pretrained ge2e model is `ge2e_ckpt_path=./ge2e_ckpt_0.3/step-3000000` +Assume the path to the MFA result of AISHELL-3 is `./aishell3_alignment_tone`. +Assume the path to the pretrained ge2e model is `./ge2e_ckpt_0.3`. + Run the command below to 1. **source path**. -2. preprocess the dataset, +2. preprocess the dataset. 3. train the model. -4. start a voice cloning inference. +4. synthesize waveform from `metadata.jsonl`. +5. start a voice cloning inference. ```bash ./run.sh ``` -### Preprocess the dataset +You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset. ```bash -CUDA_VISIBLE_DEVICES=${gpus} ./local/preprocess.sh ${input} ${preprocess_path} ${alignment} ${ge2e_ckpt_path} +./run.sh --stage 0 --stop-stage 0 ``` -#### generate utterance embedding - Use pretrained GE2E (speaker encoder) to generate utterance embedding for each sentence in AISHELL-3, which has the same file structure with wav files and the format is `.npy`. - +### Data Preprocessing ```bash -if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then - python3 ${BIN_DIR}/../ge2e/inference.py \ - --input=${input} \ - --output=${preprocess_path}/embed \ - --ngpu=1 \ - --checkpoint_path=${ge2e_ckpt_path} -fi +CUDA_VISIBLE_DEVICES=${gpus} ./local/preprocess.sh ${conf_path} ${ge2e_ckpt_path} +``` +When it is done. A `dump` folder is created in the current directory. The structure of the dump folder is listed below. +```text +dump +├── dev +│ ├── norm +│ └── raw +├── embed +│ ├── SSB0005 +│ ├── SSB0009 +│ ├── ... +│ └── ... +├── phone_id_map.txt +├── speaker_id_map.txt +├── test +│ ├── norm +│ └── raw +└── train + ├── energy_stats.npy + ├── norm + ├── pitch_stats.npy + ├── raw + └── speech_stats.npy ``` +The `embed` contains the generated speaker embedding for each sentence in AISHELL-3, which has the same file structure with wav files and the format is `.npy`. The computing time of utterance embedding can be x hours. -#### process wav -There are silence in the edge of AISHELL-3's wavs, and the audio amplitude is very small, so, we need to remove the silence and normalize the audio. You can the silence remove method based on volume or energy, but the effect is not very good, We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get the alignment of text and speech, then utilize the alignment results to remove the silence. -We use Montreal Force Aligner 1.0. The label in aishell3 include pinyin,so the lexicon we provided to MFA is pinyin rather than Chinese characters. And the prosody marks(`$` and `%`) need to be removed. You shoud preprocess the dataset into the format which MFA needs, the texts have the same name with wavs and have the suffix `.lab`. +The dataset is split into 3 parts, namely `train`, `dev` and` test`, each of which contains a `norm` and `raw` sub folder. The raw folder contains speech、pitch and energy features of each utterances, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/*_stats.npy`. -We use [lexicon.txt](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/paddlespeech/t2s/exps/voice_cloning/tacotron2_ge2e/lexicon.txt) as the lexicon. +Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains phones, text_lengths, speech_lengths, durations, path of speech features, path of pitch features, path of energy features, speaker and id of each utterance. -You can download the alignment results from here [alignment_aishell3.tar.gz](https://paddlespeech.bj.bcebos.com/Parakeet/alignment_aishell3.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/use_mfa) (use MFA1.x now) of our repo. +The preprocessing step is very similar to that one of [tts3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/tts3), but there is one more `ge2e/inference` step here. +### Model Training +`./local/train.sh` calls `${BIN_DIR}/train.py`. ```bash -if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then - echo "Process wav ..." - python3 ${BIN_DIR}/process_wav.py \ - --input=${input}/wav \ - --output=${preprocess_path}/normalized_wav \ - --alignment=${alignment} -fi +CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} ``` +The training step is very similar to that one of [tts3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/tts3), but we should set `--voice-cloning=True` when calling `${BIN_DIR}/train.py`. -#### preprocess transcription -We revert the transcription into `phones` and `tones`. It is worth noting that our processing here is different from that used for MFA, we separated the tones. This is a processing method, of course, you can only segment initials and vowels. - +### Synthesizing +We use [parallel wavegan](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/voc1) as the neural vocoder. +Download pretrained parallel wavegan model from [pwg_aishell3_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_aishell3_ckpt_0.5.zip) and unzip it. ```bash -if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then - python3 ${BIN_DIR}/preprocess_transcription.py \ - --input=${input} \ - --output=${preprocess_path} -fi +unzip pwg_aishell3_ckpt_0.5.zip ``` -The default input is `~/datasets/data_aishell3/train`,which contains `label_train-set.txt`, the processed results are `metadata.yaml` and `metadata.pickle`. the former is a text format for easy viewing, and the latter is a binary format for direct reading. -#### extract mel -```python -if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then - python3 ${BIN_DIR}/extract_mel.py \ - --input=${preprocess_path}/normalized_wav \ - --output=${preprocess_path}/mel -fi +Parallel WaveGAN checkpoint contains files listed below. +```text +pwg_aishell3_ckpt_0.5 +├── default.yaml # default config used to train parallel wavegan +├── feats_stats.npy # statistics used to normalize spectrogram when training parallel wavegan +└── snapshot_iter_1000000.pdz # generator parameters of parallel wavegan ``` - -### Train the model +`./local/synthesize.sh` calls `${BIN_DIR}/synthesize.py`, which can synthesize waveform from `metadata.jsonl`. ```bash -CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${preprocess_path} ${train_output_path} +CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} ``` +The synthesizing step is very similar to that one of [tts3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/tts3), but we should set `--voice-cloning=True` when calling `${BIN_DIR}/synthesize.py`. -Our model remve stop token prediction in Tacotron2, because of the problem of extremely unbalanced proportion of positive and negative samples of stop token prediction, and it's very sensitive to the clip of audio silence. We use the last symbol from the highest point of attention to the encoder side as the termination condition. +### Voice Cloning +Assume there are some reference audios in `./ref_audio` +```text +ref_audio +├── 001238.wav +├── LJ015-0254.wav +└── audio_self_test.mp3 +``` +`./local/voice_cloning.sh` calls `${BIN_DIR}/voice_cloning.py` -In addition, in order to accelerate the convergence of the model, we add `guided attention loss` to induce the alignment between encoder and decoder to show diagonal lines faster. -### Infernece ```bash -CUDA_VISIBLE_DEVICES=${gpus} ./local/voice_cloning.sh ${ge2e_params_path} ${tacotron2_params_path} ${waveflow_params_path} ${vc_input} ${vc_output} +CUDA_VISIBLE_DEVICES=${gpus} ./local/voice_cloning.sh ${conf_path} ${train_output_path} ${ckpt_name} ${ge2e_params_path} ${ref_audio_dir} ``` ## Pretrained Model -[tacotron2_aishell3_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/tacotron2_aishell3_ckpt_0.3.zip). +[fastspeech2_nosil_aishell3_vc1_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_aishell3_vc1_ckpt_0.5.zip) + +Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/pitch_loss| eval/energy_loss +:-------------:| :------------:| :-----: | :-----: | :--------: |:--------:|:---------: +default|2(gpu) x 96400|0.99699|0.62013|0.53057|0.11954| 0.20426| + +FastSpeech2 checkpoint contains files listed below. +(There is no need for `speaker_id_map.txt` here ) + +```text +fastspeech2_nosil_aishell3_ckpt_vc1_0.5 +├── default.yaml # default config used to train fastspeech2 +├── phone_id_map.txt # phone vocabulary file when training fastspeech2 +├── snapshot_iter_96400.pdz # model parameters and optimizer states +└── speech_stats.npy # statistics used to normalize spectrogram when training fastspeech2 +``` diff --git a/examples/aishell3/voc1/README.md b/examples/aishell3/voc1/README.md index d9e8ce59..0e40c1b5 100644 --- a/examples/aishell3/voc1/README.md +++ b/examples/aishell3/voc1/README.md @@ -15,7 +15,7 @@ tar zxvf data_aishell3.tgz -C data_aishell3 ``` ### Get MFA result of AISHELL-3 and Extract it We use [MFA2.x](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for aishell3_fastspeech2. -You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/use_mfa) (use MFA1.x now) of our repo. +You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo. ## Get Started Assume the path to the dataset is `~/datasets/data_aishell3`. diff --git a/examples/csmsc/tts2/README.md b/examples/csmsc/tts2/README.md index 2088ed15..631fffc0 100644 --- a/examples/csmsc/tts2/README.md +++ b/examples/csmsc/tts2/README.md @@ -7,7 +7,7 @@ Download CSMSC from it's [Official Website](https://test.data-baker.com/data/ind ### Get MFA result of CSMSC and Extract it We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for SPEEDYSPEECH. -You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/use_mfa) of our repo. +You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo. ## Get Started Assume the path to the dataset is `~/datasets/BZNSYP`. diff --git a/examples/csmsc/tts3/README.md b/examples/csmsc/tts3/README.md index 6e4701df..fcb626ce 100644 --- a/examples/csmsc/tts3/README.md +++ b/examples/csmsc/tts3/README.md @@ -7,7 +7,7 @@ Download CSMSC from it's [Official Website](https://test.data-baker.com/data/ind ### Get MFA result of CSMSC and Extract it We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for fastspeech2. -You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/use_mfa) of our repo. +You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo. ## Get Started Assume the path to the dataset is `~/datasets/BZNSYP`. diff --git a/examples/csmsc/voc1/README.md b/examples/csmsc/voc1/README.md index f789cba0..4cce8b2a 100644 --- a/examples/csmsc/voc1/README.md +++ b/examples/csmsc/voc1/README.md @@ -6,7 +6,7 @@ Download CSMSC from the [official website](https://www.data-baker.com/data/index ### Get MFA results for silence trim We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence in the edge of audio. -You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/use_mfa) of our repo. +You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo. ## Get Started Assume the path to the dataset is `~/datasets/BZNSYP`. diff --git a/examples/csmsc/voc3/README.md b/examples/csmsc/voc3/README.md index 9cb9d34d..0b2872fb 100644 --- a/examples/csmsc/voc3/README.md +++ b/examples/csmsc/voc3/README.md @@ -6,7 +6,7 @@ Download CSMSC from the [official website](https://www.data-baker.com/data/index ### Get MFA results for silence trim We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence in the edge of audio. -You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/use_mfa) of our repo. +You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/mfa) of our repo. ## Get Started Assume the path to the dataset is `~/datasets/BZNSYP`. diff --git a/examples/ljspeech/tts3/README.md b/examples/ljspeech/tts3/README.md index bc38aac6..bfd9dd8c 100644 --- a/examples/ljspeech/tts3/README.md +++ b/examples/ljspeech/tts3/README.md @@ -7,7 +7,7 @@ Download LJSpeech-1.1 from the [official website](https://keithito.com/LJ-Speech ### Get MFA result of LJSpeech-1.1 and Extract it We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for fastspeech2. -You can download from here [ljspeech_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/LJSpeech-1.1/ljspeech_alignment.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/use_mfa) of our repo. +You can download from here [ljspeech_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/LJSpeech-1.1/ljspeech_alignment.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo. ## Get Started Assume the path to the dataset is `~/datasets/LJSpeech-1.1`. diff --git a/examples/ljspeech/voc1/README.md b/examples/ljspeech/voc1/README.md index fdeac632..13cc6ed7 100644 --- a/examples/ljspeech/voc1/README.md +++ b/examples/ljspeech/voc1/README.md @@ -1,26 +1,29 @@ # Parallel WaveGAN with the LJSpeech-1.1 This example contains code used to train a [parallel wavegan](http://arxiv.org/abs/1910.11480) model with [LJSpeech-1.1](https://keithito.com/LJ-Speech-Dataset/). ## Dataset -### Download and Extract the datasaet +### Download and Extract Download LJSpeech-1.1 from the [official website](https://keithito.com/LJ-Speech-Dataset/). -### Get MFA results for silence trim -We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence in the edge of audio. -You can download from here [ljspeech_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/LJSpeech-1.1/ljspeech_alignment.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/use_mfa) of our repo. +### Get MFA Result and Extract +We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence in the edge of audio. +You can download from here [ljspeech_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/LJSpeech-1.1/ljspeech_alignment.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo. ## Get Started Assume the path to the dataset is `~/datasets/LJSpeech-1.1`. Assume the path to the MFA result of LJSpeech-1.1 is `./ljspeech_alignment`. Run the command below to 1. **source path**. -2. preprocess the dataset, +2. preprocess the dataset. 3. train the model. 4. synthesize wavs. - synthesize waveform from `metadata.jsonl`. ```bash ./run.sh ``` - -### Preprocess the dataset +You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset. +```bash +./run.sh --stage 0 --stop-stage 0 +``` +### Data Preprocessing ```bash ./local/preprocess.sh ${conf_path} ``` @@ -44,7 +47,7 @@ The dataset is split into 3 parts, namely `train`, `dev` and `test`, each of whi Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains id and paths to spectrogam of each utterance. -### Train the model +### Model Training `./local/train.sh` calls `${BIN_DIR}/train.py`. ```bash CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} @@ -91,7 +94,7 @@ benchmark: 3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are save in `checkpoints/` inside this directory. 4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. -### Synthesize +### Synthesizing `./local/synthesize.sh` calls `${BIN_DIR}/synthesize.py`, which can synthesize waveform from `metadata.jsonl`. ```bash CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} @@ -122,8 +125,8 @@ optional arguments: 4. `--output-dir` is the directory to save the synthesized audio files. 5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu. -## Pretrained Models -Pretrained models can be downloaded here. [pwg_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/pwg_ljspeech_ckpt_0.5.zip) +## Pretrained Model +Pretrained models can be downloaded here. [pwg_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_ljspeech_ckpt_0.5.zip) Parallel WaveGAN checkpoint contains files listed below. @@ -134,4 +137,4 @@ pwg_ljspeech_ckpt_0.5 └── pwg_stats.npy # statistics used to normalize spectrogram when training parallel wavegan ``` ## Acknowledgement -We adapted some code from https://github.com/kan-bayashi/ParallelWaveGAN. +We adapted some code from https://github.com/kan-bayashi/ParallelWaveGAN. \ No newline at end of file diff --git a/examples/other/use_mfa/README.md b/examples/other/mfa/README.md similarity index 100% rename from examples/other/use_mfa/README.md rename to examples/other/mfa/README.md diff --git a/examples/other/use_mfa/local/cmudict-0.7b b/examples/other/mfa/local/cmudict-0.7b similarity index 100% rename from examples/other/use_mfa/local/cmudict-0.7b rename to examples/other/mfa/local/cmudict-0.7b diff --git a/examples/other/use_mfa/local/detect_oov.py b/examples/other/mfa/local/detect_oov.py similarity index 100% rename from examples/other/use_mfa/local/detect_oov.py rename to examples/other/mfa/local/detect_oov.py diff --git a/examples/other/use_mfa/local/generate_lexicon.py b/examples/other/mfa/local/generate_lexicon.py similarity index 100% rename from examples/other/use_mfa/local/generate_lexicon.py rename to examples/other/mfa/local/generate_lexicon.py diff --git a/examples/other/use_mfa/local/reorganize_aishell3.py b/examples/other/mfa/local/reorganize_aishell3.py similarity index 100% rename from examples/other/use_mfa/local/reorganize_aishell3.py rename to examples/other/mfa/local/reorganize_aishell3.py diff --git a/examples/other/use_mfa/local/reorganize_baker.py b/examples/other/mfa/local/reorganize_baker.py similarity index 100% rename from examples/other/use_mfa/local/reorganize_baker.py rename to examples/other/mfa/local/reorganize_baker.py diff --git a/examples/other/use_mfa/local/reorganize_ljspeech.py b/examples/other/mfa/local/reorganize_ljspeech.py similarity index 100% rename from examples/other/use_mfa/local/reorganize_ljspeech.py rename to examples/other/mfa/local/reorganize_ljspeech.py diff --git a/examples/other/use_mfa/local/reorganize_vctk.py b/examples/other/mfa/local/reorganize_vctk.py similarity index 100% rename from examples/other/use_mfa/local/reorganize_vctk.py rename to examples/other/mfa/local/reorganize_vctk.py diff --git a/examples/other/use_mfa/run.sh b/examples/other/mfa/run.sh similarity index 100% rename from examples/other/use_mfa/run.sh rename to examples/other/mfa/run.sh diff --git a/examples/vctk/tts3/README.md b/examples/vctk/tts3/README.md index 78bfb966..ad4fb7bf 100644 --- a/examples/vctk/tts3/README.md +++ b/examples/vctk/tts3/README.md @@ -7,8 +7,8 @@ Download VCTK-0.92 from the [official website](https://datashare.ed.ac.uk/handle ### Get MFA result of VCTK and Extract it We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for fastspeech2. -You can download from here [vctk_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/VCTK-Corpus-0.92/vctk_alignment.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/use_mfa) of our repo. -ps: we remove three speakers in VCTK-0.92 (see [reorganize_vctk.py](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/examples/other/use_mfa/local/reorganize_vctk.py)): +You can download from here [vctk_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/VCTK-Corpus-0.92/vctk_alignment.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo. +ps: we remove three speakers in VCTK-0.92 (see [reorganize_vctk.py](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/examples/other/mfa/local/reorganize_vctk.py)): 1. `p315`, because no txt for it. 2. `p280` and `p362`, because no *_mic2.flac (which is better than *_mic1.flac) for them. diff --git a/examples/vctk/voc1/README.md b/examples/vctk/voc1/README.md index c9ecae0d..5c9d54c9 100644 --- a/examples/vctk/voc1/README.md +++ b/examples/vctk/voc1/README.md @@ -5,10 +5,10 @@ This example contains code used to train a [parallel wavegan](http://arxiv.org/a ### Download and Extract the datasaet Download VCTK-0.92 from the [official website](https://datashare.ed.ac.uk/handle/10283/3443) and extract it to `~/datasets`. Then the dataset is in directory `~/datasets/VCTK-Corpus-0.92`. -### Get MFA results for silence trim +### Get MFA Result and Extract We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence in the edge of audio. -You can download from here [vctk_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/VCTK-Corpus-0.92/vctk_alignment.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/use_mfa) of our repo. -ps: we remove three speakers in VCTK-0.92 (see [reorganize_vctk.py](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/examples/other/use_mfa/local/reorganize_vctk.py)): +You can download from here [vctk_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/VCTK-Corpus-0.92/vctk_alignment.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo. +ps: we remove three speakers in VCTK-0.92 (see [reorganize_vctk.py](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/examples/other/mfa/local/reorganize_vctk.py)): 1. `p315`, because no txt for it. 2. `p280` and `p362`, because no *_mic2.flac (which is better than *_mic1.flac) for them. diff --git a/paddlespeech/t2s/exps/gan_vocoder/multi_band_melgan/train.py b/paddlespeech/t2s/exps/gan_vocoder/multi_band_melgan/train.py index ca3c0a1f..a44d2d3c 100644 --- a/paddlespeech/t2s/exps/gan_vocoder/multi_band_melgan/train.py +++ b/paddlespeech/t2s/exps/gan_vocoder/multi_band_melgan/train.py @@ -36,10 +36,10 @@ from paddlespeech.t2s.models.melgan import MBMelGANEvaluator from paddlespeech.t2s.models.melgan import MBMelGANUpdater from paddlespeech.t2s.models.melgan import MelGANGenerator from paddlespeech.t2s.models.melgan import MelGANMultiScaleDiscriminator -from paddlespeech.t2s.modules.adversarial_loss import DiscriminatorAdversarialLoss -from paddlespeech.t2s.modules.adversarial_loss import GeneratorAdversarialLoss +from paddlespeech.t2s.modules.losses import DiscriminatorAdversarialLoss +from paddlespeech.t2s.modules.losses import GeneratorAdversarialLoss +from paddlespeech.t2s.modules.losses import MultiResolutionSTFTLoss from paddlespeech.t2s.modules.pqmf import PQMF -from paddlespeech.t2s.modules.stft_loss import MultiResolutionSTFTLoss from paddlespeech.t2s.training.extensions.snapshot import Snapshot from paddlespeech.t2s.training.extensions.visualizer import VisualDL from paddlespeech.t2s.training.seeding import seed_everything diff --git a/paddlespeech/t2s/exps/gan_vocoder/parallelwave_gan/train.py b/paddlespeech/t2s/exps/gan_vocoder/parallelwave_gan/train.py index 42ef8830..98b0ed71 100644 --- a/paddlespeech/t2s/exps/gan_vocoder/parallelwave_gan/train.py +++ b/paddlespeech/t2s/exps/gan_vocoder/parallelwave_gan/train.py @@ -36,7 +36,7 @@ from paddlespeech.t2s.models.parallel_wavegan import PWGDiscriminator from paddlespeech.t2s.models.parallel_wavegan import PWGEvaluator from paddlespeech.t2s.models.parallel_wavegan import PWGGenerator from paddlespeech.t2s.models.parallel_wavegan import PWGUpdater -from paddlespeech.t2s.modules.stft_loss import MultiResolutionSTFTLoss +from paddlespeech.t2s.modules.losses import MultiResolutionSTFTLoss from paddlespeech.t2s.training.extensions.snapshot import Snapshot from paddlespeech.t2s.training.extensions.visualizer import VisualDL from paddlespeech.t2s.training.seeding import seed_everything diff --git a/paddlespeech/t2s/models/__init__.py b/paddlespeech/t2s/models/__init__.py index 4ce90896..66720649 100644 --- a/paddlespeech/t2s/models/__init__.py +++ b/paddlespeech/t2s/models/__init__.py @@ -12,6 +12,8 @@ # See the License for the specific language governing permissions and # limitations under the License. from .fastspeech2 import * +from .melgan import * +from .parallel_wavegan import * from .tacotron2 import * from .transformer_tts import * from .waveflow import * diff --git a/paddlespeech/t2s/models/fastspeech2/fastspeech2.py b/paddlespeech/t2s/models/fastspeech2/fastspeech2.py index 8ff07fa5..aa42a83d 100644 --- a/paddlespeech/t2s/models/fastspeech2/fastspeech2.py +++ b/paddlespeech/t2s/models/fastspeech2/fastspeech2.py @@ -32,7 +32,8 @@ from paddlespeech.t2s.modules.predictor.duration_predictor import DurationPredic from paddlespeech.t2s.modules.predictor.length_regulator import LengthRegulator from paddlespeech.t2s.modules.predictor.variance_predictor import VariancePredictor from paddlespeech.t2s.modules.tacotron2.decoder import Postnet -from paddlespeech.t2s.modules.transformer.encoder import Encoder +from paddlespeech.t2s.modules.transformer.encoder import ConformerEncoder +from paddlespeech.t2s.modules.transformer.encoder import TransformerEncoder class FastSpeech2(nn.Layer): @@ -306,12 +307,10 @@ class FastSpeech2(nn.Layer): num_embeddings=idim, embedding_dim=adim, padding_idx=self.padding_idx) - # add encoder type here - # 测试模型还能跑通不 - # 记得改 transformer tts + if encoder_type == "transformer": print("encoder_type is transformer") - self.encoder = Encoder( + self.encoder = TransformerEncoder( idim=idim, attention_dim=adim, attention_heads=aheads, @@ -325,11 +324,10 @@ class FastSpeech2(nn.Layer): normalize_before=encoder_normalize_before, concat_after=encoder_concat_after, positionwise_layer_type=positionwise_layer_type, - positionwise_conv_kernel_size=positionwise_conv_kernel_size, - encoder_type=encoder_type) + positionwise_conv_kernel_size=positionwise_conv_kernel_size, ) elif encoder_type == "conformer": print("encoder_type is conformer") - self.encoder = Encoder( + self.encoder = ConformerEncoder( idim=idim, attention_dim=adim, attention_heads=aheads, @@ -349,8 +347,7 @@ class FastSpeech2(nn.Layer): activation_type=conformer_activation_type, use_cnn_module=use_cnn_in_conformer, cnn_module_kernel=conformer_enc_kernel_size, - zero_triu=zero_triu, - encoder_type=encoder_type) + zero_triu=zero_triu, ) else: raise ValueError(f"{encoder_type} is not supported.") @@ -417,7 +414,7 @@ class FastSpeech2(nn.Layer): # because fastspeech's decoder is the same as encoder if decoder_type == "transformer": print("decoder_type is transformer") - self.decoder = Encoder( + self.decoder = TransformerEncoder( idim=0, attention_dim=adim, attention_heads=aheads, @@ -432,11 +429,10 @@ class FastSpeech2(nn.Layer): normalize_before=decoder_normalize_before, concat_after=decoder_concat_after, positionwise_layer_type=positionwise_layer_type, - positionwise_conv_kernel_size=positionwise_conv_kernel_size, - encoder_type=decoder_type) + positionwise_conv_kernel_size=positionwise_conv_kernel_size, ) elif decoder_type == "conformer": print("decoder_type is conformer") - self.decoder = Encoder( + self.decoder = ConformerEncoder( idim=0, attention_dim=adim, attention_heads=aheads, @@ -455,8 +451,7 @@ class FastSpeech2(nn.Layer): selfattention_layer_type=conformer_self_attn_layer_type, activation_type=conformer_activation_type, use_cnn_module=use_cnn_in_conformer, - cnn_module_kernel=conformer_dec_kernel_size, - encoder_type=decoder_type) + cnn_module_kernel=conformer_dec_kernel_size, ) else: raise ValueError(f"{decoder_type} is not supported.") diff --git a/paddlespeech/t2s/models/melgan/melgan.py b/paddlespeech/t2s/models/melgan/melgan.py index 80bb1c1b..809403f6 100644 --- a/paddlespeech/t2s/models/melgan/melgan.py +++ b/paddlespeech/t2s/models/melgan/melgan.py @@ -78,7 +78,7 @@ class MelGANGenerator(nn.Layer): Padding function module name before dilated convolution layer. pad_params : dict Hyperparameters for padding function. - use_final_nonlinear_activation : paddle.nn.Layer + use_final_nonlinear_activation : nn.Layer Activation function for the final layer. use_weight_norm : bool Whether to use weight norm. diff --git a/paddlespeech/t2s/models/speedyspeech/speedyspeech.py b/paddlespeech/t2s/models/speedyspeech/speedyspeech.py index 0689ec45..ece5c279 100644 --- a/paddlespeech/t2s/models/speedyspeech/speedyspeech.py +++ b/paddlespeech/t2s/models/speedyspeech/speedyspeech.py @@ -11,13 +11,34 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. +import numpy as np import paddle from paddle import nn -from paddlespeech.t2s.modules.expansion import expand from paddlespeech.t2s.modules.positional_encoding import sinusoid_position_encoding +def expand(encodings: paddle.Tensor, durations: paddle.Tensor) -> paddle.Tensor: + """ + encodings: (B, T, C) + durations: (B, T) + """ + batch_size, t_enc = durations.shape + durations = durations.numpy() + slens = np.sum(durations, -1) + t_dec = np.max(slens) + M = np.zeros([batch_size, t_dec, t_enc]) + for i in range(batch_size): + k = 0 + for j in range(t_enc): + d = durations[i, j] + M[i, k:k + d, j] = 1 + k += d + M = paddle.to_tensor(M, dtype=encodings.dtype) + encodings = paddle.matmul(M, encodings) + return encodings + + class ResidualBlock(nn.Layer): def __init__(self, channels, kernel_size, dilation, n=2): super().__init__() diff --git a/paddlespeech/t2s/models/speedyspeech/speedyspeech_updater.py b/paddlespeech/t2s/models/speedyspeech/speedyspeech_updater.py index 4883a87e..6f9937a5 100644 --- a/paddlespeech/t2s/models/speedyspeech/speedyspeech_updater.py +++ b/paddlespeech/t2s/models/speedyspeech/speedyspeech_updater.py @@ -19,8 +19,8 @@ from paddle.fluid.layers import huber_loss from paddle.nn import functional as F from paddlespeech.t2s.modules.losses import masked_l1_loss +from paddlespeech.t2s.modules.losses import ssim from paddlespeech.t2s.modules.losses import weighted_mean -from paddlespeech.t2s.modules.ssim import ssim from paddlespeech.t2s.training.extensions.evaluator import StandardEvaluator from paddlespeech.t2s.training.reporter import report from paddlespeech.t2s.training.updaters.standard_updater import StandardUpdater diff --git a/paddlespeech/t2s/models/tacotron2.py b/paddlespeech/t2s/models/tacotron2.py index b0946a5b..01ea4f7d 100644 --- a/paddlespeech/t2s/models/tacotron2.py +++ b/paddlespeech/t2s/models/tacotron2.py @@ -20,7 +20,6 @@ from paddle.nn import functional as F from paddle.nn import initializer as I from tqdm import trange -from paddlespeech.t2s.modules.attention import LocationSensitiveAttention from paddlespeech.t2s.modules.conv import Conv1dBatchNorm from paddlespeech.t2s.modules.losses import guided_attention_loss from paddlespeech.t2s.utils import checkpoint @@ -28,6 +27,99 @@ from paddlespeech.t2s.utils import checkpoint __all__ = ["Tacotron2", "Tacotron2Loss"] +class LocationSensitiveAttention(nn.Layer): + """Location Sensitive Attention module. + + Reference: `Attention-Based Models for Speech Recognition `_ + + Parameters + ----------- + d_query: int + The feature size of query. + d_key : int + The feature size of key. + d_attention : int + The feature size of dimension. + location_filters : int + Filter size of attention convolution. + location_kernel_size : int + Kernel size of attention convolution. + """ + + def __init__(self, + d_query: int, + d_key: int, + d_attention: int, + location_filters: int, + location_kernel_size: int): + super().__init__() + + self.query_layer = nn.Linear(d_query, d_attention, bias_attr=False) + self.key_layer = nn.Linear(d_key, d_attention, bias_attr=False) + self.value = nn.Linear(d_attention, 1, bias_attr=False) + + # Location Layer + self.location_conv = nn.Conv1D( + 2, + location_filters, + kernel_size=location_kernel_size, + padding=int((location_kernel_size - 1) / 2), + bias_attr=False, + data_format='NLC') + self.location_layer = nn.Linear( + location_filters, d_attention, bias_attr=False) + + def forward(self, + query, + processed_key, + value, + attention_weights_cat, + mask=None): + """Compute context vector and attention weights. + + Parameters + ----------- + query : Tensor [shape=(batch_size, d_query)] + The queries. + processed_key : Tensor [shape=(batch_size, time_steps_k, d_attention)] + The keys after linear layer. + value : Tensor [shape=(batch_size, time_steps_k, d_key)] + The values. + attention_weights_cat : Tensor [shape=(batch_size, time_step_k, 2)] + Attention weights concat. + mask : Tensor, optional + The mask. Shape should be (batch_size, times_steps_k, 1). + Defaults to None. + + Returns + ---------- + attention_context : Tensor [shape=(batch_size, d_attention)] + The context vector. + attention_weights : Tensor [shape=(batch_size, time_steps_k)] + The attention weights. + """ + + processed_query = self.query_layer(paddle.unsqueeze(query, axis=[1])) + processed_attention_weights = self.location_layer( + self.location_conv(attention_weights_cat)) + # (B, T_enc, 1) + alignment = self.value( + paddle.tanh(processed_attention_weights + processed_key + + processed_query)) + + if mask is not None: + alignment = alignment + (1.0 - mask) * -1e9 + + attention_weights = F.softmax(alignment, axis=1) + attention_context = paddle.matmul( + attention_weights, value, transpose_x=True) + + attention_weights = paddle.squeeze(attention_weights, axis=-1) + attention_context = paddle.squeeze(attention_context, axis=1) + + return attention_context, attention_weights + + class DecoderPreNet(nn.Layer): """Decoder prenet module for Tacotron2. @@ -197,7 +289,7 @@ class Tacotron2Encoder(nn.Layer): super().__init__() k = math.sqrt(1.0 / (d_hidden * kernel_size)) - self.conv_batchnorms = paddle.nn.LayerList([ + self.conv_batchnorms = nn.LayerList([ Conv1dBatchNorm( d_hidden, d_hidden, @@ -903,7 +995,7 @@ class Tacotron2Loss(nn.Layer): self.use_stop_token_loss = use_stop_token_loss self.use_guided_attention_loss = use_guided_attention_loss self.attn_criterion = guided_attention_loss - self.stop_criterion = paddle.nn.BCEWithLogitsLoss() + self.stop_criterion = nn.BCEWithLogitsLoss() self.sigma = sigma def forward(self, diff --git a/paddlespeech/t2s/models/transformer_tts/transformer_tts.py b/paddlespeech/t2s/models/transformer_tts/transformer_tts.py index e8adafb2..ae6d7365 100644 --- a/paddlespeech/t2s/models/transformer_tts/transformer_tts.py +++ b/paddlespeech/t2s/models/transformer_tts/transformer_tts.py @@ -34,7 +34,7 @@ from paddlespeech.t2s.modules.transformer.attention import MultiHeadedAttention from paddlespeech.t2s.modules.transformer.decoder import Decoder from paddlespeech.t2s.modules.transformer.embedding import PositionalEncoding from paddlespeech.t2s.modules.transformer.embedding import ScaledPositionalEncoding -from paddlespeech.t2s.modules.transformer.encoder import Encoder +from paddlespeech.t2s.modules.transformer.encoder import TransformerEncoder from paddlespeech.t2s.modules.transformer.mask import subsequent_mask @@ -281,7 +281,7 @@ class TransformerTTS(nn.Layer): num_embeddings=idim, embedding_dim=adim, padding_idx=self.padding_idx) - self.encoder = Encoder( + self.encoder = TransformerEncoder( idim=idim, attention_dim=adim, attention_heads=aheads, diff --git a/paddlespeech/t2s/models/waveflow.py b/paddlespeech/t2s/models/waveflow.py index c57429db..e519e0c5 100644 --- a/paddlespeech/t2s/models/waveflow.py +++ b/paddlespeech/t2s/models/waveflow.py @@ -329,7 +329,7 @@ class ResidualNet(nn.LayerList): if len(dilations_h) != n_layer: raise ValueError( "number of dilations_h should equals num of layers") - super(ResidualNet, self).__init__() + super().__init__() for i in range(n_layer): dilation = (dilations_h[i], 2**i) layer = ResidualBlock(residual_channels, condition_channels, diff --git a/paddlespeech/t2s/modules/__init__.py b/paddlespeech/t2s/modules/__init__.py index 5b569f5d..1e331200 100644 --- a/paddlespeech/t2s/modules/__init__.py +++ b/paddlespeech/t2s/modules/__init__.py @@ -14,5 +14,4 @@ from .conv import * from .geometry import * from .losses import * -from .masking import * from .positional_encoding import * diff --git a/paddlespeech/t2s/modules/glu.py b/paddlespeech/t2s/modules/activation.py similarity index 69% rename from paddlespeech/t2s/modules/glu.py rename to paddlespeech/t2s/modules/activation.py index 1669fb36..f5b0af6e 100644 --- a/paddlespeech/t2s/modules/glu.py +++ b/paddlespeech/t2s/modules/activation.py @@ -11,8 +11,9 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. +import paddle +import paddle.nn.functional as F from paddle import nn -from paddle.nn import functional as F class GLU(nn.Layer): @@ -24,3 +25,18 @@ class GLU(nn.Layer): def forward(self, xs): return F.glu(xs, axis=self.dim) + + +def get_activation(act): + """Return activation function.""" + + activation_funcs = { + "hardtanh": paddle.nn.Hardtanh, + "tanh": paddle.nn.Tanh, + "relu": paddle.nn.ReLU, + "selu": paddle.nn.SELU, + "swish": paddle.nn.Swish, + "glu": GLU + } + + return activation_funcs[act]() diff --git a/paddlespeech/t2s/modules/adversarial_loss.py b/paddlespeech/t2s/modules/adversarial_loss.py deleted file mode 100644 index d2c8f7a9..00000000 --- a/paddlespeech/t2s/modules/adversarial_loss.py +++ /dev/null @@ -1,125 +0,0 @@ -# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -# Modified from espnet(https://github.com/espnet/espnet) -"""Adversarial loss modules.""" -import paddle -import paddle.nn.functional as F -from paddle import nn - - -class GeneratorAdversarialLoss(nn.Layer): - """Generator adversarial loss module.""" - - def __init__( - self, - average_by_discriminators=True, - loss_type="mse", ): - """Initialize GeneratorAversarialLoss module.""" - super().__init__() - self.average_by_discriminators = average_by_discriminators - assert loss_type in ["mse", "hinge"], f"{loss_type} is not supported." - if loss_type == "mse": - self.criterion = self._mse_loss - else: - self.criterion = self._hinge_loss - - def forward(self, outputs): - """Calcualate generator adversarial loss. - Parameters - ---------- - outputs: Tensor or List - Discriminator outputs or list of discriminator outputs. - Returns - ---------- - Tensor - Generator adversarial loss value. - """ - if isinstance(outputs, (tuple, list)): - adv_loss = 0.0 - for i, outputs_ in enumerate(outputs): - if isinstance(outputs_, (tuple, list)): - # case including feature maps - outputs_ = outputs_[-1] - adv_loss += self.criterion(outputs_) - if self.average_by_discriminators: - adv_loss /= i + 1 - else: - adv_loss = self.criterion(outputs) - - return adv_loss - - def _mse_loss(self, x): - return F.mse_loss(x, paddle.ones_like(x)) - - def _hinge_loss(self, x): - return -x.mean() - - -class DiscriminatorAdversarialLoss(nn.Layer): - """Discriminator adversarial loss module.""" - - def __init__( - self, - average_by_discriminators=True, - loss_type="mse", ): - """Initialize DiscriminatorAversarialLoss module.""" - super().__init__() - self.average_by_discriminators = average_by_discriminators - assert loss_type in ["mse"], f"{loss_type} is not supported." - if loss_type == "mse": - self.fake_criterion = self._mse_fake_loss - self.real_criterion = self._mse_real_loss - - def forward(self, outputs_hat, outputs): - """Calcualate discriminator adversarial loss. - Parameters - ---------- - outputs_hat : Tensor or list - Discriminator outputs or list of - discriminator outputs calculated from generator outputs. - outputs : Tensor or list - Discriminator outputs or list of - discriminator outputs calculated from groundtruth. - Returns - ---------- - Tensor - Discriminator real loss value. - Tensor - Discriminator fake loss value. - """ - if isinstance(outputs, (tuple, list)): - real_loss = 0.0 - fake_loss = 0.0 - for i, (outputs_hat_, - outputs_) in enumerate(zip(outputs_hat, outputs)): - if isinstance(outputs_hat_, (tuple, list)): - # case including feature maps - outputs_hat_ = outputs_hat_[-1] - outputs_ = outputs_[-1] - real_loss += self.real_criterion(outputs_) - fake_loss += self.fake_criterion(outputs_hat_) - if self.average_by_discriminators: - fake_loss /= i + 1 - real_loss /= i + 1 - else: - real_loss = self.real_criterion(outputs) - fake_loss = self.fake_criterion(outputs_hat) - - return real_loss, fake_loss - - def _mse_real_loss(self, x): - return F.mse_loss(x, paddle.ones_like(x)) - - def _mse_fake_loss(self, x): - return F.mse_loss(x, paddle.zeros_like(x)) diff --git a/paddlespeech/t2s/modules/attention.py b/paddlespeech/t2s/modules/attention.py deleted file mode 100644 index 154625cc..00000000 --- a/paddlespeech/t2s/modules/attention.py +++ /dev/null @@ -1,348 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import math - -import numpy as np -import paddle -from paddle import nn -from paddle.nn import functional as F - - -def scaled_dot_product_attention(q, k, v, mask=None, dropout=0.0, - training=True): - r"""Scaled dot product attention with masking. - - Assume that q, k, v all have the same leading dimensions (denoted as * in - descriptions below). Dropout is applied to attention weights before - weighted sum of values. - - Parameters - ----------- - q : Tensor [shape=(\*, T_q, d)] - the query tensor. - k : Tensor [shape=(\*, T_k, d)] - the key tensor. - v : Tensor [shape=(\*, T_k, d_v)] - the value tensor. - mask : Tensor, [shape=(\*, T_q, T_k) or broadcastable shape], optional - the mask tensor, zeros correspond to paddings. Defaults to None. - - Returns - ---------- - out : Tensor [shape=(\*, T_q, d_v)] - the context vector. - attn_weights : Tensor [shape=(\*, T_q, T_k)] - the attention weights. - """ - d = q.shape[-1] # we only support imperative execution - qk = paddle.matmul(q, k, transpose_y=True) - scaled_logit = paddle.scale(qk, 1.0 / math.sqrt(d)) - - if mask is not None: - scaled_logit += paddle.scale((1.0 - mask), -1e9) # hard coded here - - attn_weights = F.softmax(scaled_logit, axis=-1) - attn_weights = F.dropout(attn_weights, dropout, training=training) - out = paddle.matmul(attn_weights, v) - return out, attn_weights - - -def drop_head(x, drop_n_heads, training=True): - """Drop n context vectors from multiple ones. - - Parameters - ---------- - x : Tensor [shape=(batch_size, num_heads, time_steps, channels)] - The input, multiple context vectors. - drop_n_heads : int [0<= drop_n_heads <= num_heads] - Number of vectors to drop. - training : bool - A flag indicating whether it is in training. If `False`, no dropout is - applied. - - Returns - ------- - Tensor - The output. - """ - if not training or (drop_n_heads == 0): - return x - - batch_size, num_heads, _, _ = x.shape - # drop all heads - if num_heads == drop_n_heads: - return paddle.zeros_like(x) - - mask = np.ones([batch_size, num_heads]) - mask[:, :drop_n_heads] = 0 - for subarray in mask: - np.random.shuffle(subarray) - scale = float(num_heads) / (num_heads - drop_n_heads) - mask = scale * np.reshape(mask, [batch_size, num_heads, 1, 1]) - out = x * paddle.to_tensor(mask) - return out - - -def _split_heads(x, num_heads): - batch_size, time_steps, _ = x.shape - x = paddle.reshape(x, [batch_size, time_steps, num_heads, -1]) - x = paddle.transpose(x, [0, 2, 1, 3]) - return x - - -def _concat_heads(x): - batch_size, _, time_steps, _ = x.shape - x = paddle.transpose(x, [0, 2, 1, 3]) - x = paddle.reshape(x, [batch_size, time_steps, -1]) - return x - - -# Standard implementations of Monohead Attention & Multihead Attention -class MonoheadAttention(nn.Layer): - """Monohead Attention module. - - Parameters - ---------- - model_dim : int - Feature size of the query. - dropout : float, optional - Dropout probability of scaled dot product attention and final context - vector. Defaults to 0.0. - k_dim : int, optional - Feature size of the key of each scaled dot product attention. If not - provided, it is set to `model_dim / num_heads`. Defaults to None. - v_dim : int, optional - Feature size of the key of each scaled dot product attention. If not - provided, it is set to `model_dim / num_heads`. Defaults to None. - """ - - def __init__(self, - model_dim: int, - dropout: float=0.0, - k_dim: int=None, - v_dim: int=None): - super(MonoheadAttention, self).__init__() - k_dim = k_dim or model_dim - v_dim = v_dim or model_dim - self.affine_q = nn.Linear(model_dim, k_dim) - self.affine_k = nn.Linear(model_dim, k_dim) - self.affine_v = nn.Linear(model_dim, v_dim) - self.affine_o = nn.Linear(v_dim, model_dim) - - self.model_dim = model_dim - self.dropout = dropout - - def forward(self, q, k, v, mask): - """Compute context vector and attention weights. - - Parameters - ----------- - q : Tensor [shape=(batch_size, time_steps_q, model_dim)] - The queries. - k : Tensor [shape=(batch_size, time_steps_k, model_dim)] - The keys. - v : Tensor [shape=(batch_size, time_steps_k, model_dim)] - The values. - mask : Tensor [shape=(batch_size, times_steps_q, time_steps_k] or broadcastable shape - The mask. - - Returns - ---------- - out : Tensor [shape=(batch_size, time_steps_q, model_dim)] - The context vector. - attention_weights : Tensor [shape=(batch_size, times_steps_q, time_steps_k)] - The attention weights. - """ - q = self.affine_q(q) # (B, T, C) - k = self.affine_k(k) - v = self.affine_v(v) - - context_vectors, attention_weights = scaled_dot_product_attention( - q, k, v, mask, self.dropout, self.training) - - out = self.affine_o(context_vectors) - return out, attention_weights - - -class MultiheadAttention(nn.Layer): - """Multihead Attention module. - - Parameters - ----------- - model_dim: int - The feature size of query. - num_heads : int - The number of attention heads. - dropout : float, optional - Dropout probability of scaled dot product attention and final context - vector. Defaults to 0.0. - k_dim : int, optional - Feature size of the key of each scaled dot product attention. If not - provided, it is set to ``model_dim / num_heads``. Defaults to None. - v_dim : int, optional - Feature size of the key of each scaled dot product attention. If not - provided, it is set to ``model_dim / num_heads``. Defaults to None. - - Raises - --------- - ValueError - If ``model_dim`` is not divisible by ``num_heads``. - """ - - def __init__(self, - model_dim: int, - num_heads: int, - dropout: float=0.0, - k_dim: int=None, - v_dim: int=None): - super(MultiheadAttention, self).__init__() - if model_dim % num_heads != 0: - raise ValueError("model_dim must be divisible by num_heads") - depth = model_dim // num_heads - k_dim = k_dim or depth - v_dim = v_dim or depth - self.affine_q = nn.Linear(model_dim, num_heads * k_dim) - self.affine_k = nn.Linear(model_dim, num_heads * k_dim) - self.affine_v = nn.Linear(model_dim, num_heads * v_dim) - self.affine_o = nn.Linear(num_heads * v_dim, model_dim) - - self.num_heads = num_heads - self.model_dim = model_dim - self.dropout = dropout - - def forward(self, q, k, v, mask): - """Compute context vector and attention weights. - - Parameters - ----------- - q : Tensor [shape=(batch_size, time_steps_q, model_dim)] - The queries. - k : Tensor [shape=(batch_size, time_steps_k, model_dim)] - The keys. - v : Tensor [shape=(batch_size, time_steps_k, model_dim)] - The values. - mask : Tensor [shape=(batch_size, times_steps_q, time_steps_k] or broadcastable shape - The mask. - - Returns - ---------- - out : Tensor [shape=(batch_size, time_steps_q, model_dim)] - The context vector. - attention_weights : Tensor [shape=(batch_size, times_steps_q, time_steps_k)] - The attention weights. - """ - q = _split_heads(self.affine_q(q), self.num_heads) # (B, h, T, C) - k = _split_heads(self.affine_k(k), self.num_heads) - v = _split_heads(self.affine_v(v), self.num_heads) - mask = paddle.unsqueeze(mask, 1) # unsqueeze for the h dim - - context_vectors, attention_weights = scaled_dot_product_attention( - q, k, v, mask, self.dropout, self.training) - # NOTE: there is more sophisticated implementation: Scheduled DropHead - context_vectors = _concat_heads(context_vectors) # (B, T, h*C) - out = self.affine_o(context_vectors) - return out, attention_weights - - -class LocationSensitiveAttention(nn.Layer): - """Location Sensitive Attention module. - - Reference: `Attention-Based Models for Speech Recognition `_ - - Parameters - ----------- - d_query: int - The feature size of query. - d_key : int - The feature size of key. - d_attention : int - The feature size of dimension. - location_filters : int - Filter size of attention convolution. - location_kernel_size : int - Kernel size of attention convolution. - """ - - def __init__(self, - d_query: int, - d_key: int, - d_attention: int, - location_filters: int, - location_kernel_size: int): - super().__init__() - - self.query_layer = nn.Linear(d_query, d_attention, bias_attr=False) - self.key_layer = nn.Linear(d_key, d_attention, bias_attr=False) - self.value = nn.Linear(d_attention, 1, bias_attr=False) - - # Location Layer - self.location_conv = nn.Conv1D( - 2, - location_filters, - kernel_size=location_kernel_size, - padding=int((location_kernel_size - 1) / 2), - bias_attr=False, - data_format='NLC') - self.location_layer = nn.Linear( - location_filters, d_attention, bias_attr=False) - - def forward(self, - query, - processed_key, - value, - attention_weights_cat, - mask=None): - """Compute context vector and attention weights. - - Parameters - ----------- - query : Tensor [shape=(batch_size, d_query)] - The queries. - processed_key : Tensor [shape=(batch_size, time_steps_k, d_attention)] - The keys after linear layer. - value : Tensor [shape=(batch_size, time_steps_k, d_key)] - The values. - attention_weights_cat : Tensor [shape=(batch_size, time_step_k, 2)] - Attention weights concat. - mask : Tensor, optional - The mask. Shape should be (batch_size, times_steps_k, 1). - Defaults to None. - - Returns - ---------- - attention_context : Tensor [shape=(batch_size, d_attention)] - The context vector. - attention_weights : Tensor [shape=(batch_size, time_steps_k)] - The attention weights. - """ - - processed_query = self.query_layer(paddle.unsqueeze(query, axis=[1])) - processed_attention_weights = self.location_layer( - self.location_conv(attention_weights_cat)) - # (B, T_enc, 1) - alignment = self.value( - paddle.tanh(processed_attention_weights + processed_key + - processed_query)) - - if mask is not None: - alignment = alignment + (1.0 - mask) * -1e9 - - attention_weights = F.softmax(alignment, axis=1) - attention_context = paddle.matmul( - attention_weights, value, transpose_x=True) - - attention_weights = paddle.squeeze(attention_weights, axis=-1) - attention_context = paddle.squeeze(attention_context, axis=1) - - return attention_context, attention_weights diff --git a/paddlespeech/t2s/modules/audio.py b/paddlespeech/t2s/modules/audio.py deleted file mode 100644 index 926ce8f2..00000000 --- a/paddlespeech/t2s/modules/audio.py +++ /dev/null @@ -1,229 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import librosa -import numpy as np -import paddle -from librosa.util import pad_center -from paddle import nn -from paddle.nn import functional as F -from scipy import signal - -__all__ = ["quantize", "dequantize", "STFT", "MelScale"] - - -def quantize(values, n_bands): - """Linearlly quantize a float Tensor in [-1, 1) to an interger Tensor in - [0, n_bands). - - Parameters - ----------- - values : Tensor [dtype: flaot32 or float64] - The floating point value. - - n_bands : int - The number of bands. The output integer Tensor's value is in the range - [0, n_bans). - - Returns - ---------- - Tensor [dtype: int 64] - The quantized tensor. - """ - quantized = paddle.cast((values + 1.0) / 2.0 * n_bands, "int64") - return quantized - - -def dequantize(quantized, n_bands, dtype=None): - """Linearlly dequantize an integer Tensor into a float Tensor in the range - [-1, 1). - - Parameters - ----------- - quantized : Tensor [dtype: int] - The quantized value in the range [0, n_bands). - - n_bands : int - Number of bands. The input integer Tensor's value is in the range - [0, n_bans). - - dtype : str, optional - Data type of the output. - - Returns - ----------- - Tensor - The dequantized tensor, dtype is specified by `dtype`. If `dtype` is - not specified, the default float data type is used. - """ - dtype = dtype or paddle.get_default_dtype() - value = (paddle.cast(quantized, dtype) + 0.5) * (2.0 / n_bands) - 1.0 - return value - - -class STFT(nn.Layer): - """A module for computing stft transformation in a differentiable way. - - Parameters - ------------ - n_fft : int - Number of samples in a frame. - hop_length : int - Number of samples shifted between adjacent frames. - win_length : int - Length of the window. - window : str, optional - Name of window function, see `scipy.signal.get_window` for more - details. Defaults to "hanning". - center : bool - If True, the signal y is padded so that frame D[:, t] is centered - at y[t * hop_length]. If False, then D[:, t] begins at y[t * hop_length]. - Defaults to True. - pad_mode : string or function - If center=True, this argument is passed to np.pad for padding the edges - of the signal y. By default (pad_mode="reflect"), y is padded on both - sides with its own reflection, mirrored around its first and last - sample respectively. If center=False, this argument is ignored. - - Notes - ----------- - It behaves like ``librosa.core.stft``. See ``librosa.core.stft`` for more - details. - - Given a audio which ``T`` samples, it the STFT transformation outputs a - spectrum with (C, frames) and complex dtype, where ``C = 1 + n_fft / 2`` - and ``frames = 1 + T // hop_lenghth``. - - Ony ``center`` and ``reflect`` padding is supported now. - - """ - - def __init__(self, - n_fft, - hop_length=None, - win_length=None, - window="hanning", - center=True, - pad_mode="reflect"): - super().__init__() - # By default, use the entire frame - if win_length is None: - win_length = n_fft - - # Set the default hop, if it's not already specified - if hop_length is None: - hop_length = int(win_length // 4) - - self.hop_length = hop_length - self.n_bin = 1 + n_fft // 2 - self.n_fft = n_fft - self.center = center - self.pad_mode = pad_mode - - # calculate window - window = signal.get_window(window, win_length, fftbins=True) - - # pad window to n_fft size - if n_fft != win_length: - window = pad_center(window, n_fft, mode="constant") - # lpad = (n_fft - win_length) // 2 - # rpad = n_fft - win_length - lpad - # window = np.pad(window, ((lpad, pad), ), 'constant') - - # calculate weights - # r = np.arange(0, n_fft) - # M = np.expand_dims(r, -1) * np.expand_dims(r, 0) - # w_real = np.reshape(window * - # np.cos(2 * np.pi * M / n_fft)[:self.n_bin], - # (self.n_bin, 1, self.n_fft)) - # w_imag = np.reshape(window * - # np.sin(-2 * np.pi * M / n_fft)[:self.n_bin], - # (self.n_bin, 1, self.n_fft)) - weight = np.fft.fft(np.eye(n_fft))[:self.n_bin] - w_real = weight.real - w_imag = weight.imag - w = np.concatenate([w_real, w_imag], axis=0) - w = w * window - w = np.expand_dims(w, 1) - weight = paddle.cast(paddle.to_tensor(w), paddle.get_default_dtype()) - self.register_buffer("weight", weight) - - def forward(self, x): - """Compute the stft transform. - Parameters - ------------ - x : Tensor [shape=(B, T)] - The input waveform. - Returns - ------------ - real : Tensor [shape=(B, C, frames)] - The real part of the spectrogram. - - imag : Tensor [shape=(B, C, frames)] - The image part of the spectrogram. - """ - x = paddle.unsqueeze(x, axis=1) - if self.center: - x = F.pad( - x, [self.n_fft // 2, self.n_fft // 2], - data_format='NCL', - mode=self.pad_mode) - - # to BCT, C=1 - out = F.conv1d(x, self.weight, stride=self.hop_length) - real, imag = paddle.chunk(out, 2, axis=1) # BCT - return real, imag - - def power(self, x): - """Compute the power spectrum. - Parameters - ------------ - x : Tensor [shape=(B, T)] - The input waveform. - Returns - ------------ - Tensor [shape=(B, C, T)] - The power spectrum. - """ - real, imag = self.forward(x) - power = real**2 + imag**2 - return power - - def magnitude(self, x): - """Compute the magnitude of the spectrum. - Parameters - ------------ - x : Tensor [shape=(B, T)] - The input waveform. - Returns - ------------ - Tensor [shape=(B, C, T)] - The magnitude of the spectrum. - """ - power = self.power(x) - magnitude = paddle.sqrt(power) # TODO(chenfeiyu): maybe clipping - return magnitude - - -class MelScale(nn.Layer): - def __init__(self, sr, n_fft, n_mels, fmin, fmax): - super().__init__() - mel_basis = librosa.filters.mel(sr, n_fft, n_mels, fmin, fmax) - # self.weight = paddle.to_tensor(mel_basis) - weight = paddle.to_tensor(mel_basis, dtype=paddle.get_default_dtype()) - self.register_buffer("weight", weight) - - def forward(self, spec): - # (n_mels, n_freq) * (batch_size, n_freq, n_frames) - mel = paddle.matmul(self.weight, spec) - return mel diff --git a/paddlespeech/t2s/modules/causal_conv.py b/paddlespeech/t2s/modules/causal_conv.py index c0dd5b28..c0d4f955 100644 --- a/paddlespeech/t2s/modules/causal_conv.py +++ b/paddlespeech/t2s/modules/causal_conv.py @@ -13,9 +13,10 @@ # limitations under the License. """Causal convolusion layer modules.""" import paddle +from paddle import nn -class CausalConv1D(paddle.nn.Layer): +class CausalConv1D(nn.Layer): """CausalConv1D module with customized initialization.""" def __init__( @@ -31,7 +32,7 @@ class CausalConv1D(paddle.nn.Layer): super().__init__() self.pad = getattr(paddle.nn, pad)((kernel_size - 1) * dilation, **pad_params) - self.conv = paddle.nn.Conv1D( + self.conv = nn.Conv1D( in_channels, out_channels, kernel_size, @@ -52,7 +53,7 @@ class CausalConv1D(paddle.nn.Layer): return self.conv(self.pad(x))[:, :, :x.shape[2]] -class CausalConv1DTranspose(paddle.nn.Layer): +class CausalConv1DTranspose(nn.Layer): """CausalConv1DTranspose module with customized initialization.""" def __init__(self, @@ -63,7 +64,7 @@ class CausalConv1DTranspose(paddle.nn.Layer): bias=True): """Initialize CausalConvTranspose1d module.""" super().__init__() - self.deconv = paddle.nn.Conv1DTranspose( + self.deconv = nn.Conv1DTranspose( in_channels, out_channels, kernel_size, stride, bias_attr=bias) self.stride = stride diff --git a/paddlespeech/t2s/modules/conformer/convolution.py b/paddlespeech/t2s/modules/conformer/convolution.py index 25246736..e4a6c8c6 100644 --- a/paddlespeech/t2s/modules/conformer/convolution.py +++ b/paddlespeech/t2s/modules/conformer/convolution.py @@ -72,8 +72,10 @@ class ConvolutionModule(nn.Layer): x = x.transpose([0, 2, 1]) # GLU mechanism - x = self.pointwise_conv1(x) # (batch, 2*channel, dim) - x = nn.functional.glu(x, axis=1) # (batch, channel, dim) + # (batch, 2*channel, time) + x = self.pointwise_conv1(x) + # (batch, channel, time) + x = nn.functional.glu(x, axis=1) # 1D Depthwise Conv x = self.depthwise_conv(x) diff --git a/paddlespeech/t2s/modules/conformer/encoder_layer.py b/paddlespeech/t2s/modules/conformer/encoder_layer.py index a7a49367..2949dc37 100644 --- a/paddlespeech/t2s/modules/conformer/encoder_layer.py +++ b/paddlespeech/t2s/modules/conformer/encoder_layer.py @@ -25,19 +25,19 @@ class EncoderLayer(nn.Layer): ---------- size : int Input dimension. - self_attn : paddle.nn.Layer + self_attn : nn.Layer Self-attention module instance. `MultiHeadedAttention` or `RelPositionMultiHeadedAttention` instance can be used as the argument. - feed_forward : paddle.nn.Layer + feed_forward : nn.Layer Feed-forward module instance. `PositionwiseFeedForward`, `MultiLayeredConv1d`, or `Conv1dLinear` instance can be used as the argument. - feed_forward_macaron : paddle.nn.Layer + feed_forward_macaron : nn.Layer Additional feed-forward module instance. `PositionwiseFeedForward`, `MultiLayeredConv1d`, or `Conv1dLinear` instance can be used as the argument. - conv_module : paddle.nn.Layer + conv_module : nn.Layer Convolution module instance. `ConvlutionModule` instance can be used as the argument. dropout_rate : float @@ -67,7 +67,7 @@ class EncoderLayer(nn.Layer): concat_after=False, stochastic_depth_rate=0.0, ): """Construct an EncoderLayer object.""" - super(EncoderLayer, self).__init__() + super().__init__() self.self_attn = self_attn self.feed_forward = feed_forward self.feed_forward_macaron = feed_forward_macaron diff --git a/paddlespeech/t2s/modules/conv.py b/paddlespeech/t2s/modules/conv.py index d9bd98df..68766d5e 100644 --- a/paddlespeech/t2s/modules/conv.py +++ b/paddlespeech/t2s/modules/conv.py @@ -84,7 +84,7 @@ class Conv1dCell(nn.Conv1D): _kernel_size = kernel_size[0] if isinstance(kernel_size, ( tuple, list)) else kernel_size self._r = 1 + (_kernel_size - 1) * _dilation - super(Conv1dCell, self).__init__( + super().__init__( in_channels, out_channels, kernel_size, @@ -226,7 +226,7 @@ class Conv1dBatchNorm(nn.Layer): data_format="NCL", momentum=0.9, epsilon=1e-05): - super(Conv1dBatchNorm, self).__init__() + super().__init__() self.conv = nn.Conv1D( in_channels, out_channels, diff --git a/paddlespeech/t2s/modules/expansion.py b/paddlespeech/t2s/modules/expansion.py deleted file mode 100644 index e9d4b6fe..00000000 --- a/paddlespeech/t2s/modules/expansion.py +++ /dev/null @@ -1,37 +0,0 @@ -# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import numpy as np -import paddle -from paddle import Tensor - - -def expand(encodings: Tensor, durations: Tensor) -> Tensor: - """ - encodings: (B, T, C) - durations: (B, T) - """ - batch_size, t_enc = durations.shape - durations = durations.numpy() - slens = np.sum(durations, -1) - t_dec = np.max(slens) - M = np.zeros([batch_size, t_dec, t_enc]) - for i in range(batch_size): - k = 0 - for j in range(t_enc): - d = durations[i, j] - M[i, k:k + d, j] = 1 - k += d - M = paddle.to_tensor(M, dtype=encodings.dtype) - encodings = paddle.matmul(M, encodings) - return encodings diff --git a/paddlespeech/t2s/modules/layer_norm.py b/paddlespeech/t2s/modules/layer_norm.py index a1c775fc..4edd22c9 100644 --- a/paddlespeech/t2s/modules/layer_norm.py +++ b/paddlespeech/t2s/modules/layer_norm.py @@ -13,9 +13,10 @@ # limitations under the License. """Layer normalization module.""" import paddle +from paddle import nn -class LayerNorm(paddle.nn.LayerNorm): +class LayerNorm(nn.LayerNorm): """Layer normalization module. Parameters @@ -28,7 +29,7 @@ class LayerNorm(paddle.nn.LayerNorm): def __init__(self, nout, dim=-1): """Construct an LayerNorm object.""" - super(LayerNorm, self).__init__(nout) + super().__init__(nout) self.dim = dim def forward(self, x): diff --git a/paddlespeech/t2s/modules/losses.py b/paddlespeech/t2s/modules/losses.py index ece9e045..6b0ab6b3 100644 --- a/paddlespeech/t2s/modules/losses.py +++ b/paddlespeech/t2s/modules/losses.py @@ -11,18 +11,16 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. +import math + import paddle +from paddle import nn from paddle.fluid.layers import sequence_mask from paddle.nn import functional as F - -__all__ = [ - "guided_attention_loss", - "weighted_mean", - "masked_l1_loss", - "masked_softmax_with_cross_entropy", -] +from scipy import signal +# Loss for Tacotron2 def attention_guide(dec_lens, enc_lens, N, T, g, dtype=None): """Build that W matrix. shape(B, T_dec, T_enc) W[i, n, t] = 1 - exp(-(n/dec_lens[i] - t/enc_lens[i])**2 / (2g**2)) @@ -57,6 +55,367 @@ def guided_attention_loss(attention_weight, dec_lens, enc_lens, g): return loss +# Losses for GAN Vocoder +def stft(x, + fft_size, + hop_length=None, + win_length=None, + window='hann', + center=True, + pad_mode='reflect'): + """Perform STFT and convert to magnitude spectrogram. + Parameters + ---------- + x : Tensor + Input signal tensor (B, T). + fft_size : int + FFT size. + hop_size : int + Hop size. + win_length : int + window : str, optional + window : str + Name of window function, see `scipy.signal.get_window` for more + details. Defaults to "hann". + center : bool, optional + center (bool, optional): Whether to pad `x` to make that the + :math:`t \times hop\_length` at the center of :math:`t`-th frame. Default: `True`. + pad_mode : str, optional + Choose padding pattern when `center` is `True`. + Returns + ---------- + Tensor: + Magnitude spectrogram (B, #frames, fft_size // 2 + 1). + """ + # calculate window + window = signal.get_window(window, win_length, fftbins=True) + window = paddle.to_tensor(window) + x_stft = paddle.signal.stft( + x, + fft_size, + hop_length, + win_length, + window=window, + center=center, + pad_mode=pad_mode) + + real = x_stft.real() + imag = x_stft.imag() + + return paddle.sqrt(paddle.clip(real**2 + imag**2, min=1e-7)).transpose( + [0, 2, 1]) + + +class SpectralConvergenceLoss(nn.Layer): + """Spectral convergence loss module.""" + + def __init__(self): + """Initilize spectral convergence loss module.""" + super().__init__() + + def forward(self, x_mag, y_mag): + """Calculate forward propagation. + Parameters + ---------- + x_mag : Tensor + Magnitude spectrogram of predicted signal (B, #frames, #freq_bins). + y_mag : Tensor) + Magnitude spectrogram of groundtruth signal (B, #frames, #freq_bins). + Returns + ---------- + Tensor + Spectral convergence loss value. + """ + return paddle.norm( + y_mag - x_mag, p="fro") / paddle.clip( + paddle.norm(y_mag, p="fro"), min=1e-10) + + +class LogSTFTMagnitudeLoss(nn.Layer): + """Log STFT magnitude loss module.""" + + def __init__(self, epsilon=1e-7): + """Initilize los STFT magnitude loss module.""" + super().__init__() + self.epsilon = epsilon + + def forward(self, x_mag, y_mag): + """Calculate forward propagation. + Parameters + ---------- + x_mag : Tensor + Magnitude spectrogram of predicted signal (B, #frames, #freq_bins). + y_mag : Tensor + Magnitude spectrogram of groundtruth signal (B, #frames, #freq_bins). + Returns + ---------- + Tensor + Log STFT magnitude loss value. + """ + return F.l1_loss( + paddle.log(paddle.clip(y_mag, min=self.epsilon)), + paddle.log(paddle.clip(x_mag, min=self.epsilon))) + + +class STFTLoss(nn.Layer): + """STFT loss module.""" + + def __init__(self, + fft_size=1024, + shift_size=120, + win_length=600, + window="hann"): + """Initialize STFT loss module.""" + super().__init__() + self.fft_size = fft_size + self.shift_size = shift_size + self.win_length = win_length + self.window = window + self.spectral_convergence_loss = SpectralConvergenceLoss() + self.log_stft_magnitude_loss = LogSTFTMagnitudeLoss() + + def forward(self, x, y): + """Calculate forward propagation. + Parameters + ---------- + x : Tensor + Predicted signal (B, T). + y : Tensor + Groundtruth signal (B, T). + Returns + ---------- + Tensor + Spectral convergence loss value. + Tensor + Log STFT magnitude loss value. + """ + x_mag = stft(x, self.fft_size, self.shift_size, self.win_length, + self.window) + y_mag = stft(y, self.fft_size, self.shift_size, self.win_length, + self.window) + sc_loss = self.spectral_convergence_loss(x_mag, y_mag) + mag_loss = self.log_stft_magnitude_loss(x_mag, y_mag) + + return sc_loss, mag_loss + + +class MultiResolutionSTFTLoss(nn.Layer): + """Multi resolution STFT loss module.""" + + def __init__( + self, + fft_sizes=[1024, 2048, 512], + hop_sizes=[120, 240, 50], + win_lengths=[600, 1200, 240], + window="hann", ): + """Initialize Multi resolution STFT loss module. + Parameters + ---------- + fft_sizes : list + List of FFT sizes. + hop_sizes : list + List of hop sizes. + win_lengths : list + List of window lengths. + window : str + Window function type. + """ + super().__init__() + assert len(fft_sizes) == len(hop_sizes) == len(win_lengths) + self.stft_losses = nn.LayerList() + for fs, ss, wl in zip(fft_sizes, hop_sizes, win_lengths): + self.stft_losses.append(STFTLoss(fs, ss, wl, window)) + + def forward(self, x, y): + """Calculate forward propagation. + Parameters + ---------- + x : Tensor + Predicted signal (B, T) or (B, #subband, T). + y : Tensor + Groundtruth signal (B, T) or (B, #subband, T). + Returns + ---------- + Tensor + Multi resolution spectral convergence loss value. + Tensor + Multi resolution log STFT magnitude loss value. + """ + if len(x.shape) == 3: + # (B, C, T) -> (B x C, T) + x = x.reshape([-1, x.shape[2]]) + # (B, C, T) -> (B x C, T) + y = y.reshape([-1, y.shape[2]]) + sc_loss = 0.0 + mag_loss = 0.0 + for f in self.stft_losses: + sc_l, mag_l = f(x, y) + sc_loss += sc_l + mag_loss += mag_l + sc_loss /= len(self.stft_losses) + mag_loss /= len(self.stft_losses) + + return sc_loss, mag_loss + + +class GeneratorAdversarialLoss(nn.Layer): + """Generator adversarial loss module.""" + + def __init__( + self, + average_by_discriminators=True, + loss_type="mse", ): + """Initialize GeneratorAversarialLoss module.""" + super().__init__() + self.average_by_discriminators = average_by_discriminators + assert loss_type in ["mse", "hinge"], f"{loss_type} is not supported." + if loss_type == "mse": + self.criterion = self._mse_loss + else: + self.criterion = self._hinge_loss + + def forward(self, outputs): + """Calcualate generator adversarial loss. + Parameters + ---------- + outputs: Tensor or List + Discriminator outputs or list of discriminator outputs. + Returns + ---------- + Tensor + Generator adversarial loss value. + """ + if isinstance(outputs, (tuple, list)): + adv_loss = 0.0 + for i, outputs_ in enumerate(outputs): + if isinstance(outputs_, (tuple, list)): + # case including feature maps + outputs_ = outputs_[-1] + adv_loss += self.criterion(outputs_) + if self.average_by_discriminators: + adv_loss /= i + 1 + else: + adv_loss = self.criterion(outputs) + + return adv_loss + + def _mse_loss(self, x): + return F.mse_loss(x, paddle.ones_like(x)) + + def _hinge_loss(self, x): + return -x.mean() + + +class DiscriminatorAdversarialLoss(nn.Layer): + """Discriminator adversarial loss module.""" + + def __init__( + self, + average_by_discriminators=True, + loss_type="mse", ): + """Initialize DiscriminatorAversarialLoss module.""" + super().__init__() + self.average_by_discriminators = average_by_discriminators + assert loss_type in ["mse"], f"{loss_type} is not supported." + if loss_type == "mse": + self.fake_criterion = self._mse_fake_loss + self.real_criterion = self._mse_real_loss + + def forward(self, outputs_hat, outputs): + """Calcualate discriminator adversarial loss. + Parameters + ---------- + outputs_hat : Tensor or list + Discriminator outputs or list of + discriminator outputs calculated from generator outputs. + outputs : Tensor or list + Discriminator outputs or list of + discriminator outputs calculated from groundtruth. + Returns + ---------- + Tensor + Discriminator real loss value. + Tensor + Discriminator fake loss value. + """ + if isinstance(outputs, (tuple, list)): + real_loss = 0.0 + fake_loss = 0.0 + for i, (outputs_hat_, + outputs_) in enumerate(zip(outputs_hat, outputs)): + if isinstance(outputs_hat_, (tuple, list)): + # case including feature maps + outputs_hat_ = outputs_hat_[-1] + outputs_ = outputs_[-1] + real_loss += self.real_criterion(outputs_) + fake_loss += self.fake_criterion(outputs_hat_) + if self.average_by_discriminators: + fake_loss /= i + 1 + real_loss /= i + 1 + else: + real_loss = self.real_criterion(outputs) + fake_loss = self.fake_criterion(outputs_hat) + + return real_loss, fake_loss + + def _mse_real_loss(self, x): + return F.mse_loss(x, paddle.ones_like(x)) + + def _mse_fake_loss(self, x): + return F.mse_loss(x, paddle.zeros_like(x)) + + +# Losses for SpeedySpeech +# Structural Similarity Index Measure (SSIM) +def gaussian(window_size, sigma): + gauss = paddle.to_tensor([ + math.exp(-(x - window_size // 2)**2 / float(2 * sigma**2)) + for x in range(window_size) + ]) + return gauss / gauss.sum() + + +def create_window(window_size, channel): + _1D_window = gaussian(window_size, 1.5).unsqueeze(1) + _2D_window = paddle.matmul(_1D_window, paddle.transpose( + _1D_window, [1, 0])).unsqueeze([0, 1]) + window = paddle.expand(_2D_window, [channel, 1, window_size, window_size]) + return window + + +def _ssim(img1, img2, window, window_size, channel, size_average=True): + mu1 = F.conv2d(img1, window, padding=window_size // 2, groups=channel) + mu2 = F.conv2d(img2, window, padding=window_size // 2, groups=channel) + + mu1_sq = mu1.pow(2) + mu2_sq = mu2.pow(2) + mu1_mu2 = mu1 * mu2 + + sigma1_sq = F.conv2d( + img1 * img1, window, padding=window_size // 2, groups=channel) - mu1_sq + sigma2_sq = F.conv2d( + img2 * img2, window, padding=window_size // 2, groups=channel) - mu2_sq + sigma12 = F.conv2d( + img1 * img2, window, padding=window_size // 2, groups=channel) - mu1_mu2 + + C1 = 0.01**2 + C2 = 0.03**2 + + ssim_map = ((2 * mu1_mu2 + C1) * (2 * sigma12 + C2)) \ + / ((mu1_sq + mu2_sq + C1) * (sigma1_sq + sigma2_sq + C2)) + + if size_average: + return ssim_map.mean() + else: + return ssim_map.mean(1).mean(1).mean(1) + + +def ssim(img1, img2, window_size=11, size_average=True): + (_, channel, _, _) = img1.shape + window = create_window(window_size, channel) + return _ssim(img1, img2, window, window_size, channel, size_average) + + def weighted_mean(input, weight): """Weighted mean. It can also be used as masked mean. @@ -98,28 +457,3 @@ def masked_l1_loss(prediction, target, mask): abs_error = F.l1_loss(prediction, target, reduction='none') loss = weighted_mean(abs_error, mask) return loss - - -def masked_softmax_with_cross_entropy(logits, label, mask, axis=-1): - """Compute masked softmax with cross entropy loss. - - Parameters - ---------- - logits : Tensor - The logits. The ``axis``-th axis is the class dimension. - label : Tensor [dtype: int] - The label. The size of the ``axis``-th axis should be 1. - mask : Tensor - The mask. The shape should be broadcastable to ``label``. - axis : int, optional - The index of the class dimension in the shape of ``logits``, by default - -1. - - Returns - ------- - Tensor [shape=(1,)] - The masked softmax with cross entropy loss. - """ - ce = F.softmax_with_cross_entropy(logits, label, axis=axis) - loss = weighted_mean(ce, mask) - return loss diff --git a/paddlespeech/t2s/modules/masking.py b/paddlespeech/t2s/modules/masking.py deleted file mode 100644 index 7cf37040..00000000 --- a/paddlespeech/t2s/modules/masking.py +++ /dev/null @@ -1,120 +0,0 @@ -# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -import paddle - -__all__ = [ - "id_mask", - "feature_mask", - "combine_mask", - "future_mask", -] - - -def id_mask(input, padding_index=0, dtype="bool"): - """Generate mask with input ids. - - Those positions where the value equals ``padding_index`` correspond to 0 or - ``False``, otherwise, 1 or ``True``. - - Parameters - ---------- - input : Tensor [dtype: int] - The input tensor. It represents the ids. - padding_index : int, optional - The id which represents padding, by default 0. - dtype : str, optional - Data type of the returned mask, by default "bool". - - Returns - ------- - Tensor - The generate mask. It has the same shape as ``input`` does. - """ - return paddle.cast(input != padding_index, dtype) - - -def feature_mask(input, axis, dtype="bool"): - """Compute mask from input features. - - For a input features, represented as batched feature vectors, those vectors - which all zeros are considerd padding vectors. - - Parameters - ---------- - input : Tensor [dtype: float] - The input tensor which represents featues. - axis : int - The index of the feature dimension in ``input``. Other dimensions are - considered ``spatial`` dimensions. - dtype : str, optional - Data type of the generated mask, by default "bool" - Returns - ------- - Tensor - The geenrated mask with ``spatial`` shape as mentioned above. - - It has one less dimension than ``input`` does. - """ - feature_sum = paddle.sum(paddle.abs(input), axis) - return paddle.cast(feature_sum != 0, dtype) - - -def combine_mask(mask1, mask2): - """Combine two mask with multiplication or logical and. - - Parameters - ----------- - mask1 : Tensor - The first mask. - mask2 : Tensor - The second mask with broadcastable shape with ``mask1``. - Returns - -------- - Tensor - Combined mask. - - Notes - ------ - It is mainly used to combine the padding mask and no future mask for - transformer decoder. - - Padding mask is used to mask padding positions of the decoder inputs and - no future mask is used to prevent the decoder to see future information. - """ - if mask1.dtype == paddle.fluid.core.VarDesc.VarType.BOOL: - return paddle.logical_and(mask1, mask2) - else: - return mask1 * mask2 - - -def future_mask(time_steps, dtype="bool"): - """Generate lower triangular mask. - - It is used at transformer decoder to prevent the decoder to see future - information. - - Parameters - ---------- - time_steps : int - Decoder time steps. - dtype : str, optional - The data type of the generate mask, by default "bool". - - Returns - ------- - Tensor - The generated mask. - """ - mask = paddle.tril(paddle.ones([time_steps, time_steps])) - return paddle.cast(mask, dtype) diff --git a/paddlespeech/t2s/modules/nets_utils.py b/paddlespeech/t2s/modules/nets_utils.py index 879cdba6..3822b33d 100644 --- a/paddlespeech/t2s/modules/nets_utils.py +++ b/paddlespeech/t2s/modules/nets_utils.py @@ -129,7 +129,7 @@ def initialize(model: nn.Layer, init: str): Parameters ---------- - model : paddle.nn.Layer + model : nn.Layer Target. init : str Method of initialization. @@ -150,17 +150,3 @@ def initialize(model: nn.Layer, init: str): nn.initializer.Constant()) else: raise ValueError("Unknown initialization: " + init) - - -def get_activation(act): - """Return activation function.""" - - activation_funcs = { - "hardtanh": paddle.nn.Hardtanh, - "tanh": paddle.nn.Tanh, - "relu": paddle.nn.ReLU, - "selu": paddle.nn.SELU, - "swish": paddle.nn.Swish, - } - - return activation_funcs[act]() diff --git a/paddlespeech/t2s/modules/pqmf.py b/paddlespeech/t2s/modules/pqmf.py index c299fb57..fb850a4d 100644 --- a/paddlespeech/t2s/modules/pqmf.py +++ b/paddlespeech/t2s/modules/pqmf.py @@ -16,6 +16,7 @@ import numpy as np import paddle import paddle.nn.functional as F +from paddle import nn from scipy.signal import kaiser @@ -56,7 +57,7 @@ def design_prototype_filter(taps=62, cutoff_ratio=0.142, beta=9.0): return h -class PQMF(paddle.nn.Layer): +class PQMF(nn.Layer): """PQMF module. This module is based on `Near-perfect-reconstruction pseudo-QMF banks`_. .. _`Near-perfect-reconstruction pseudo-QMF banks`: @@ -105,7 +106,7 @@ class PQMF(paddle.nn.Layer): self.updown_filter = updown_filter self.subbands = subbands # keep padding info - self.pad_fn = paddle.nn.Pad1D(taps // 2, mode='constant', value=0.0) + self.pad_fn = nn.Pad1D(taps // 2, mode='constant', value=0.0) def analysis(self, x): """Analysis with PQMF. diff --git a/paddlespeech/t2s/modules/predictor/duration_predictor.py b/paddlespeech/t2s/modules/predictor/duration_predictor.py index b269b686..6d7adf23 100644 --- a/paddlespeech/t2s/modules/predictor/duration_predictor.py +++ b/paddlespeech/t2s/modules/predictor/duration_predictor.py @@ -65,7 +65,7 @@ class DurationPredictor(nn.Layer): Offset value to avoid nan in log domain. """ - super(DurationPredictor, self).__init__() + super().__init__() self.offset = offset self.conv = nn.LayerList() for idx in range(n_layers): @@ -155,7 +155,7 @@ class DurationPredictorLoss(nn.Layer): reduction : str Reduction type in loss calculation. """ - super(DurationPredictorLoss, self).__init__() + super().__init__() self.criterion = nn.MSELoss(reduction=reduction) self.offset = offset diff --git a/paddlespeech/t2s/modules/ssim.py b/paddlespeech/t2s/modules/ssim.py deleted file mode 100644 index c9899cd6..00000000 --- a/paddlespeech/t2s/modules/ssim.py +++ /dev/null @@ -1,80 +0,0 @@ -# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -from math import exp - -import paddle -import paddle.nn.functional as F -from paddle import nn - - -def gaussian(window_size, sigma): - gauss = paddle.to_tensor([ - exp(-(x - window_size // 2)**2 / float(2 * sigma**2)) - for x in range(window_size) - ]) - return gauss / gauss.sum() - - -def create_window(window_size, channel): - _1D_window = gaussian(window_size, 1.5).unsqueeze(1) - _2D_window = paddle.matmul(_1D_window, paddle.transpose( - _1D_window, [1, 0])).unsqueeze([0, 1]) - window = paddle.expand(_2D_window, [channel, 1, window_size, window_size]) - return window - - -def _ssim(img1, img2, window, window_size, channel, size_average=True): - mu1 = F.conv2d(img1, window, padding=window_size // 2, groups=channel) - mu2 = F.conv2d(img2, window, padding=window_size // 2, groups=channel) - - mu1_sq = mu1.pow(2) - mu2_sq = mu2.pow(2) - mu1_mu2 = mu1 * mu2 - - sigma1_sq = F.conv2d( - img1 * img1, window, padding=window_size // 2, groups=channel) - mu1_sq - sigma2_sq = F.conv2d( - img2 * img2, window, padding=window_size // 2, groups=channel) - mu2_sq - sigma12 = F.conv2d( - img1 * img2, window, padding=window_size // 2, groups=channel) - mu1_mu2 - - C1 = 0.01**2 - C2 = 0.03**2 - - ssim_map = ((2 * mu1_mu2 + C1) * (2 * sigma12 + C2)) \ - / ((mu1_sq + mu2_sq + C1) * (sigma1_sq + sigma2_sq + C2)) - - if size_average: - return ssim_map.mean() - else: - return ssim_map.mean(1).mean(1).mean(1) - - -class SSIM(nn.Layer): - def __init__(self, window_size=11, size_average=True): - super().__init__() - self.window_size = window_size - self.size_average = size_average - self.channel = 1 - self.window = create_window(window_size, self.channel) - - def forward(self, img1, img2): - return _ssim(img1, img2, self.window, self.window_size, self.channel, - self.size_average) - - -def ssim(img1, img2, window_size=11, size_average=True): - (_, channel, _, _) = img1.shape - window = create_window(window_size, channel) - return _ssim(img1, img2, window, window_size, channel, size_average) diff --git a/paddlespeech/t2s/modules/stft_loss.py b/paddlespeech/t2s/modules/stft_loss.py deleted file mode 100644 index 31963e71..00000000 --- a/paddlespeech/t2s/modules/stft_loss.py +++ /dev/null @@ -1,220 +0,0 @@ -# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved. -# -# Licensed under the Apache License, Version 2.0 (the "License"); -# you may not use this file except in compliance with the License. -# You may obtain a copy of the License at -# -# http://www.apache.org/licenses/LICENSE-2.0 -# -# Unless required by applicable law or agreed to in writing, software -# distributed under the License is distributed on an "AS IS" BASIS, -# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. -# See the License for the specific language governing permissions and -# limitations under the License. -# Modified from espnet(https://github.com/espnet/espnet) -import paddle -from paddle import nn -from paddle.nn import functional as F -from scipy import signal - - -def stft(x, - fft_size, - hop_length=None, - win_length=None, - window='hann', - center=True, - pad_mode='reflect'): - """Perform STFT and convert to magnitude spectrogram. - Parameters - ---------- - x : Tensor - Input signal tensor (B, T). - fft_size : int - FFT size. - hop_size : int - Hop size. - win_length : int - window : str, optional - window : str - Name of window function, see `scipy.signal.get_window` for more - details. Defaults to "hann". - center : bool, optional - center (bool, optional): Whether to pad `x` to make that the - :math:`t \times hop\_length` at the center of :math:`t`-th frame. Default: `True`. - pad_mode : str, optional - Choose padding pattern when `center` is `True`. - Returns - ---------- - Tensor: - Magnitude spectrogram (B, #frames, fft_size // 2 + 1). - """ - # calculate window - window = signal.get_window(window, win_length, fftbins=True) - window = paddle.to_tensor(window) - x_stft = paddle.signal.stft( - x, - fft_size, - hop_length, - win_length, - window=window, - center=center, - pad_mode=pad_mode) - - real = x_stft.real() - imag = x_stft.imag() - - return paddle.sqrt(paddle.clip(real**2 + imag**2, min=1e-7)).transpose( - [0, 2, 1]) - - -class SpectralConvergenceLoss(nn.Layer): - """Spectral convergence loss module.""" - - def __init__(self): - """Initilize spectral convergence loss module.""" - super().__init__() - - def forward(self, x_mag, y_mag): - """Calculate forward propagation. - Parameters - ---------- - x_mag : Tensor - Magnitude spectrogram of predicted signal (B, #frames, #freq_bins). - y_mag : Tensor) - Magnitude spectrogram of groundtruth signal (B, #frames, #freq_bins). - Returns - ---------- - Tensor - Spectral convergence loss value. - """ - return paddle.norm( - y_mag - x_mag, p="fro") / paddle.clip( - paddle.norm(y_mag, p="fro"), min=1e-10) - - -class LogSTFTMagnitudeLoss(nn.Layer): - """Log STFT magnitude loss module.""" - - def __init__(self, epsilon=1e-7): - """Initilize los STFT magnitude loss module.""" - super().__init__() - self.epsilon = epsilon - - def forward(self, x_mag, y_mag): - """Calculate forward propagation. - Parameters - ---------- - x_mag : Tensor - Magnitude spectrogram of predicted signal (B, #frames, #freq_bins). - y_mag : Tensor - Magnitude spectrogram of groundtruth signal (B, #frames, #freq_bins). - Returns - ---------- - Tensor - Log STFT magnitude loss value. - """ - return F.l1_loss( - paddle.log(paddle.clip(y_mag, min=self.epsilon)), - paddle.log(paddle.clip(x_mag, min=self.epsilon))) - - -class STFTLoss(nn.Layer): - """STFT loss module.""" - - def __init__(self, - fft_size=1024, - shift_size=120, - win_length=600, - window="hann"): - """Initialize STFT loss module.""" - super().__init__() - self.fft_size = fft_size - self.shift_size = shift_size - self.win_length = win_length - self.window = window - self.spectral_convergence_loss = SpectralConvergenceLoss() - self.log_stft_magnitude_loss = LogSTFTMagnitudeLoss() - - def forward(self, x, y): - """Calculate forward propagation. - Parameters - ---------- - x : Tensor - Predicted signal (B, T). - y : Tensor - Groundtruth signal (B, T). - Returns - ---------- - Tensor - Spectral convergence loss value. - Tensor - Log STFT magnitude loss value. - """ - x_mag = stft(x, self.fft_size, self.shift_size, self.win_length, - self.window) - y_mag = stft(y, self.fft_size, self.shift_size, self.win_length, - self.window) - sc_loss = self.spectral_convergence_loss(x_mag, y_mag) - mag_loss = self.log_stft_magnitude_loss(x_mag, y_mag) - - return sc_loss, mag_loss - - -class MultiResolutionSTFTLoss(nn.Layer): - """Multi resolution STFT loss module.""" - - def __init__( - self, - fft_sizes=[1024, 2048, 512], - hop_sizes=[120, 240, 50], - win_lengths=[600, 1200, 240], - window="hann", ): - """Initialize Multi resolution STFT loss module. - Parameters - ---------- - fft_sizes : list - List of FFT sizes. - hop_sizes : list - List of hop sizes. - win_lengths : list - List of window lengths. - window : str - Window function type. - """ - super().__init__() - assert len(fft_sizes) == len(hop_sizes) == len(win_lengths) - self.stft_losses = nn.LayerList() - for fs, ss, wl in zip(fft_sizes, hop_sizes, win_lengths): - self.stft_losses.append(STFTLoss(fs, ss, wl, window)) - - def forward(self, x, y): - """Calculate forward propagation. - Parameters - ---------- - x : Tensor - Predicted signal (B, T) or (B, #subband, T). - y : Tensor - Groundtruth signal (B, T) or (B, #subband, T). - Returns - ---------- - Tensor - Multi resolution spectral convergence loss value. - Tensor - Multi resolution log STFT magnitude loss value. - """ - if len(x.shape) == 3: - # (B, C, T) -> (B x C, T) - x = x.reshape([-1, x.shape[2]]) - # (B, C, T) -> (B x C, T) - y = y.reshape([-1, y.shape[2]]) - sc_loss = 0.0 - mag_loss = 0.0 - for f in self.stft_losses: - sc_l, mag_l = f(x, y) - sc_loss += sc_l - mag_loss += mag_l - sc_loss /= len(self.stft_losses) - mag_loss /= len(self.stft_losses) - - return sc_loss, mag_loss diff --git a/paddlespeech/t2s/modules/style_encoder.py b/paddlespeech/t2s/modules/style_encoder.py index 8a23e85c..e76226f3 100644 --- a/paddlespeech/t2s/modules/style_encoder.py +++ b/paddlespeech/t2s/modules/style_encoder.py @@ -74,7 +74,7 @@ class StyleEncoder(nn.Layer): gru_units: int=128, ): """Initilize global style encoder module.""" assert check_argument_types() - super(StyleEncoder, self).__init__() + super().__init__() self.ref_enc = ReferenceEncoder( idim=idim, @@ -93,11 +93,15 @@ class StyleEncoder(nn.Layer): def forward(self, speech: paddle.Tensor) -> paddle.Tensor: """Calculate forward propagation. - Args: - speech (Tensor): Batch of padded target features (B, Lmax, odim). + Parameters + ---------- + speech : Tensor + Batch of padded target features (B, Lmax, odim). - Returns: - Tensor: Style token embeddings (B, token_dim). + Returns + ---------- + Tensor: + Style token embeddings (B, token_dim). """ ref_embs = self.ref_enc(speech) @@ -145,7 +149,7 @@ class ReferenceEncoder(nn.Layer): gru_units: int=128, ): """Initilize reference encoder module.""" assert check_argument_types() - super(ReferenceEncoder, self).__init__() + super().__init__() # check hyperparameters are valid assert conv_kernel_size % 2 == 1, "kernel size must be odd." @@ -249,7 +253,7 @@ class StyleTokenLayer(nn.Layer): dropout_rate: float=0.0, ): """Initilize style token layer module.""" assert check_argument_types() - super(StyleTokenLayer, self).__init__() + super().__init__() gst_embs = paddle.randn(shape=[gst_tokens, gst_token_dim // gst_heads]) self.gst_embs = paddle.create_parameter( diff --git a/paddlespeech/t2s/modules/tacotron2/encoder.py b/paddlespeech/t2s/modules/tacotron2/encoder.py index b95e3529..f1889061 100644 --- a/paddlespeech/t2s/modules/tacotron2/encoder.py +++ b/paddlespeech/t2s/modules/tacotron2/encoder.py @@ -73,7 +73,7 @@ class Encoder(nn.Layer): Dropout rate. """ - super(Encoder, self).__init__() + super().__init__() # store the hyperparameters self.idim = idim self.use_residual = use_residual diff --git a/paddlespeech/t2s/modules/transformer/decoder.py b/paddlespeech/t2s/modules/transformer/decoder.py index 072fc813..fe2949f4 100644 --- a/paddlespeech/t2s/modules/transformer/decoder.py +++ b/paddlespeech/t2s/modules/transformer/decoder.py @@ -67,11 +67,11 @@ class Decoder(nn.Layer): Dropout rate in self-attention. src_attention_dropout_rate : float Dropout rate in source-attention. - input_layer : (Union[str, paddle.nn.Layer]) + input_layer : (Union[str, nn.Layer]) Input layer type. use_output_layer : bool Whether to use output layer. - pos_enc_class : paddle.nn.Layer + pos_enc_class : nn.Layer Positional encoding module class. `PositionalEncoding `or `ScaledPositionalEncoding` normalize_before : bool @@ -122,8 +122,7 @@ class Decoder(nn.Layer): input_layer, pos_enc_class(attention_dim, positional_dropout_rate)) else: - raise NotImplementedError( - "only `embed` or paddle.nn.Layer is supported.") + raise NotImplementedError("only `embed` or nn.Layer is supported.") self.normalize_before = normalize_before # self-attention module definition diff --git a/paddlespeech/t2s/modules/transformer/decoder_layer.py b/paddlespeech/t2s/modules/transformer/decoder_layer.py index 0310d83e..44978f1e 100644 --- a/paddlespeech/t2s/modules/transformer/decoder_layer.py +++ b/paddlespeech/t2s/modules/transformer/decoder_layer.py @@ -26,13 +26,13 @@ class DecoderLayer(nn.Layer): ---------- size : int Input dimension. - self_attn : paddle.nn.Layer + self_attn : nn.Layer Self-attention module instance. `MultiHeadedAttention` instance can be used as the argument. - src_attn : paddle.nn.Layer + src_attn : nn.Layer Self-attention module instance. `MultiHeadedAttention` instance can be used as the argument. - feed_forward : paddle.nn.Layer + feed_forward : nn.Layer Feed-forward module instance. `PositionwiseFeedForward`, `MultiLayeredConv1d`, or `Conv1dLinear` instance can be used as the argument. dropout_rate : float diff --git a/paddlespeech/t2s/modules/transformer/embedding.py b/paddlespeech/t2s/modules/transformer/embedding.py index 3c3f3616..40ab03ee 100644 --- a/paddlespeech/t2s/modules/transformer/embedding.py +++ b/paddlespeech/t2s/modules/transformer/embedding.py @@ -43,7 +43,7 @@ class PositionalEncoding(nn.Layer): dtype="float32", reverse=False): """Construct an PositionalEncoding object.""" - super(PositionalEncoding, self).__init__() + super().__init__() self.d_model = d_model self.reverse = reverse self.xscale = math.sqrt(self.d_model) @@ -117,7 +117,7 @@ class ScaledPositionalEncoding(PositionalEncoding): self.alpha = paddle.create_parameter( shape=x.shape, dtype=self.dtype, - default_initializer=paddle.nn.initializer.Assign(x)) + default_initializer=nn.initializer.Assign(x)) def reset_parameters(self): """Reset parameters.""" @@ -141,7 +141,7 @@ class ScaledPositionalEncoding(PositionalEncoding): return self.dropout(x) -class RelPositionalEncoding(paddle.nn.Layer): +class RelPositionalEncoding(nn.Layer): """Relative positional encoding module (new implementation). Details can be found in https://github.com/espnet/espnet/pull/2816. See : Appendix B in https://arxiv.org/abs/1901.02860 @@ -157,10 +157,10 @@ class RelPositionalEncoding(paddle.nn.Layer): def __init__(self, d_model, dropout_rate, max_len=5000, dtype="float32"): """Construct an PositionalEncoding object.""" - super(RelPositionalEncoding, self).__init__() + super().__init__() self.d_model = d_model self.xscale = math.sqrt(self.d_model) - self.dropout = paddle.nn.Dropout(p=dropout_rate) + self.dropout = nn.Dropout(p=dropout_rate) self.pe = None self.dtype = dtype self.extend_pe(paddle.expand(paddle.zeros([1]), (1, max_len))) diff --git a/paddlespeech/t2s/modules/transformer/encoder.py b/paddlespeech/t2s/modules/transformer/encoder.py index 2fdf02cf..b422f01d 100644 --- a/paddlespeech/t2s/modules/transformer/encoder.py +++ b/paddlespeech/t2s/modules/transformer/encoder.py @@ -17,10 +17,10 @@ from typing import Union from paddle import nn +from paddlespeech.t2s.modules.activation import get_activation from paddlespeech.t2s.modules.conformer.convolution import ConvolutionModule from paddlespeech.t2s.modules.conformer.encoder_layer import EncoderLayer as ConformerEncoderLayer from paddlespeech.t2s.modules.layer_norm import LayerNorm -from paddlespeech.t2s.modules.nets_utils import get_activation from paddlespeech.t2s.modules.transformer.attention import MultiHeadedAttention from paddlespeech.t2s.modules.transformer.attention import RelPositionMultiHeadedAttention from paddlespeech.t2s.modules.transformer.embedding import PositionalEncoding @@ -34,8 +34,8 @@ from paddlespeech.t2s.modules.transformer.repeat import repeat from paddlespeech.t2s.modules.transformer.subsampling import Conv2dSubsampling -class Encoder(nn.Layer): - """Transformer encoder module. +class BaseEncoder(nn.Layer): + """Base Encoder module. Parameters ---------- @@ -55,7 +55,7 @@ class Encoder(nn.Layer): Dropout rate after adding positional encoding. attention_dropout_rate : float Dropout rate in attention. - input_layer : Union[str, paddle.nn.Layer] + input_layer : Union[str, nn.Layer] Input layer type. normalize_before : bool Whether to use layer_norm before the first block. @@ -120,7 +120,7 @@ class Encoder(nn.Layer): stochastic_depth_rate: float=0.0, intermediate_layers: Union[List[int], None]=None, encoder_type: str="transformer"): - """Construct an Encoder object.""" + """Construct an Bae Encoder object.""" super().__init__() activation = get_activation(activation_type) pos_enc_class = self.get_pos_enc_class(pos_enc_layer_type, @@ -264,7 +264,6 @@ class Encoder(nn.Layer): nn.Dropout(dropout_rate), nn.ReLU(), pos_enc_class(attention_dim, positional_dropout_rate), ) - elif input_layer == "conv2d": embed = Conv2dSubsampling( idim, @@ -305,46 +304,118 @@ class Encoder(nn.Layer): paddle.Tensor Mask tensor (#batch, 1, time). """ - if self.encoder_type == "transformer": - xs = self.embed(xs) - xs, masks = self.encoders(xs, masks) - if self.normalize_before: - xs = self.after_norm(xs) - return xs, masks - elif self.encoder_type == "conformer": - if isinstance(self.embed, (Conv2dSubsampling)): - xs, masks = self.embed(xs, masks) - else: - xs = self.embed(xs) - - if self.intermediate_layers is None: - xs, masks = self.encoders(xs, masks) - else: - intermediate_outputs = [] - for layer_idx, encoder_layer in enumerate(self.encoders): - xs, masks = encoder_layer(xs, masks) - - if (self.intermediate_layers is not None and - layer_idx + 1 in self.intermediate_layers): - # intermediate branches also require normalization. - encoder_output = xs - if isinstance(encoder_output, tuple): - encoder_output = encoder_output[0] - if self.normalize_before: - encoder_output = self.after_norm(encoder_output) - intermediate_outputs.append(encoder_output) - - if isinstance(xs, tuple): - xs = xs[0] - - if self.normalize_before: - xs = self.after_norm(xs) - - if self.intermediate_layers is not None: - return xs, masks, intermediate_outputs - return xs, masks - else: - raise ValueError(f"{self.encoder_type} is not supported.") + xs = self.embed(xs) + xs, masks = self.encoders(xs, masks) + if self.normalize_before: + xs = self.after_norm(xs) + return xs, masks + + +class TransformerEncoder(BaseEncoder): + """Transformer encoder module. + Parameters + ---------- + idim : int + Input dimension. + attention_dim : int + Dimention of attention. + attention_heads : int + The number of heads of multi head attention. + linear_units : int + The number of units of position-wise feed forward. + num_blocks : int + The number of decoder blocks. + dropout_rate : float + Dropout rate. + positional_dropout_rate : float + Dropout rate after adding positional encoding. + attention_dropout_rate : float + Dropout rate in attention. + input_layer : Union[str, paddle.nn.Layer] + Input layer type. + pos_enc_layer_type : str + Encoder positional encoding layer type. + normalize_before : bool + Whether to use layer_norm before the first block. + concat_after : bool + Whether to concat attention layer's input and output. + if True, additional linear will be applied. + i.e. x -> x + linear(concat(x, att(x))) + if False, no additional linear will be applied. i.e. x -> x + att(x) + positionwise_layer_type : str + "linear", "conv1d", or "conv1d-linear". + positionwise_conv_kernel_size : int + Kernel size of positionwise conv1d layer. + selfattention_layer_type : str + Encoder attention layer type. + activation_type : str + Encoder activation function type. + padding_idx : int + Padding idx for input_layer=embed. + """ + + def __init__( + self, + idim, + attention_dim: int=256, + attention_heads: int=4, + linear_units: int=2048, + num_blocks: int=6, + dropout_rate: float=0.1, + positional_dropout_rate: float=0.1, + attention_dropout_rate: float=0.0, + input_layer: str="conv2d", + pos_enc_layer_type: str="abs_pos", + normalize_before: bool=True, + concat_after: bool=False, + positionwise_layer_type: str="linear", + positionwise_conv_kernel_size: int=1, + selfattention_layer_type: str="selfattn", + activation_type: str="relu", + padding_idx: int=-1, ): + """Construct an Transformer Encoder object.""" + super().__init__( + idim, + attention_dim=attention_dim, + attention_heads=attention_heads, + linear_units=linear_units, + num_blocks=num_blocks, + dropout_rate=dropout_rate, + positional_dropout_rate=positional_dropout_rate, + attention_dropout_rate=attention_dropout_rate, + input_layer=input_layer, + pos_enc_layer_type=pos_enc_layer_type, + normalize_before=normalize_before, + concat_after=concat_after, + positionwise_layer_type=positionwise_layer_type, + positionwise_conv_kernel_size=positionwise_conv_kernel_size, + selfattention_layer_type=selfattention_layer_type, + activation_type=activation_type, + padding_idx=padding_idx, + encoder_type="transformer") + + def forward(self, xs, masks): + """Encode input sequence. + + Parameters + ---------- + xs : paddle.Tensor + Input tensor (#batch, time, idim). + masks : paddle.Tensor + Mask tensor (#batch, 1, time). + + Returns + ---------- + paddle.Tensor + Output tensor (#batch, time, attention_dim). + paddle.Tensor + Mask tensor (#batch, 1, time). + """ + xs = self.embed(xs) + xs, masks = self.encoders(xs, masks) + if self.normalize_before: + xs = self.after_norm(xs) + return xs, masks def forward_one_step(self, xs, masks, cache=None): """Encode input frame. @@ -378,3 +449,161 @@ class Encoder(nn.Layer): if self.normalize_before: xs = self.after_norm(xs) return xs, masks, new_cache + + +class ConformerEncoder(BaseEncoder): + """Conformer encoder module. + Parameters + ---------- + idim : int + Input dimension. + attention_dim : int + Dimention of attention. + attention_heads : int + The number of heads of multi head attention. + linear_units : int + The number of units of position-wise feed forward. + num_blocks : int + The number of decoder blocks. + dropout_rate : float + Dropout rate. + positional_dropout_rate : float + Dropout rate after adding positional encoding. + attention_dropout_rate : float + Dropout rate in attention. + input_layer : Union[str, nn.Layer] + Input layer type. + normalize_before : bool + Whether to use layer_norm before the first block. + concat_after : bool + Whether to concat attention layer's input and output. + if True, additional linear will be applied. + i.e. x -> x + linear(concat(x, att(x))) + if False, no additional linear will be applied. i.e. x -> x + att(x) + positionwise_layer_type : str + "linear", "conv1d", or "conv1d-linear". + positionwise_conv_kernel_size : int + Kernel size of positionwise conv1d layer. + macaron_style : bool + Whether to use macaron style for positionwise layer. + pos_enc_layer_type : str + Encoder positional encoding layer type. + selfattention_layer_type : str + Encoder attention layer type. + activation_type : str + Encoder activation function type. + use_cnn_module : bool + Whether to use convolution module. + zero_triu : bool + Whether to zero the upper triangular part of attention matrix. + cnn_module_kernel : int + Kernerl size of convolution module. + padding_idx : int + Padding idx for input_layer=embed. + stochastic_depth_rate : float + Maximum probability to skip the encoder layer. + intermediate_layers : Union[List[int], None] + indices of intermediate CTC layer. + indices start from 1. + if not None, intermediate outputs are returned (which changes return type + signature.) + """ + + def __init__( + self, + idim: int, + attention_dim: int=256, + attention_heads: int=4, + linear_units: int=2048, + num_blocks: int=6, + dropout_rate: float=0.1, + positional_dropout_rate: float=0.1, + attention_dropout_rate: float=0.0, + input_layer: str="conv2d", + normalize_before: bool=True, + concat_after: bool=False, + positionwise_layer_type: str="linear", + positionwise_conv_kernel_size: int=1, + macaron_style: bool=False, + pos_enc_layer_type: str="rel_pos", + selfattention_layer_type: str="rel_selfattn", + activation_type: str="swish", + use_cnn_module: bool=False, + zero_triu: bool=False, + cnn_module_kernel: int=31, + padding_idx: int=-1, + stochastic_depth_rate: float=0.0, + intermediate_layers: Union[List[int], None]=None, ): + """Construct an Conformer Encoder object.""" + super().__init__( + idim=idim, + attention_dim=attention_dim, + attention_heads=attention_heads, + linear_units=linear_units, + num_blocks=num_blocks, + dropout_rate=dropout_rate, + positional_dropout_rate=positional_dropout_rate, + attention_dropout_rate=attention_dropout_rate, + input_layer=input_layer, + normalize_before=normalize_before, + concat_after=concat_after, + positionwise_layer_type=positionwise_layer_type, + positionwise_conv_kernel_size=positionwise_conv_kernel_size, + macaron_style=macaron_style, + pos_enc_layer_type=pos_enc_layer_type, + selfattention_layer_type=selfattention_layer_type, + activation_type=activation_type, + use_cnn_module=use_cnn_module, + zero_triu=zero_triu, + cnn_module_kernel=cnn_module_kernel, + padding_idx=padding_idx, + stochastic_depth_rate=stochastic_depth_rate, + intermediate_layers=intermediate_layers, + encoder_type="conformer") + + def forward(self, xs, masks): + """Encode input sequence. + Parameters + ---------- + xs : paddle.Tensor + Input tensor (#batch, time, idim). + masks : paddle.Tensor + Mask tensor (#batch, 1, time). + Returns + ---------- + paddle.Tensor + Output tensor (#batch, time, attention_dim). + paddle.Tensor + Mask tensor (#batch, 1, time). + """ + if isinstance(self.embed, (Conv2dSubsampling)): + xs, masks = self.embed(xs, masks) + else: + xs = self.embed(xs) + + if self.intermediate_layers is None: + xs, masks = self.encoders(xs, masks) + else: + intermediate_outputs = [] + for layer_idx, encoder_layer in enumerate(self.encoders): + xs, masks = encoder_layer(xs, masks) + + if (self.intermediate_layers is not None and + layer_idx + 1 in self.intermediate_layers): + # intermediate branches also require normalization. + encoder_output = xs + if isinstance(encoder_output, tuple): + encoder_output = encoder_output[0] + if self.normalize_before: + encoder_output = self.after_norm(encoder_output) + intermediate_outputs.append(encoder_output) + + if isinstance(xs, tuple): + xs = xs[0] + + if self.normalize_before: + xs = self.after_norm(xs) + + if self.intermediate_layers is not None: + return xs, masks, intermediate_outputs + return xs, masks diff --git a/paddlespeech/t2s/modules/transformer/encoder_layer.py b/paddlespeech/t2s/modules/transformer/encoder_layer.py index fb2c2e82..f55ded3d 100644 --- a/paddlespeech/t2s/modules/transformer/encoder_layer.py +++ b/paddlespeech/t2s/modules/transformer/encoder_layer.py @@ -24,10 +24,10 @@ class EncoderLayer(nn.Layer): ---------- size : int Input dimension. - self_attn : paddle.nn.Layer + self_attn : nn.Layer Self-attention module instance. `MultiHeadedAttention` instance can be used as the argument. - feed_forward : paddle.nn.Layer + feed_forward : nn.Layer Feed-forward module instance. `PositionwiseFeedForward`, `MultiLayeredConv1d`, or `Conv1dLinear` instance can be used as the argument. dropout_rate : float @@ -50,7 +50,7 @@ class EncoderLayer(nn.Layer): normalize_before=True, concat_after=False, ): """Construct an EncoderLayer object.""" - super(EncoderLayer, self).__init__() + super().__init__() self.self_attn = self_attn self.feed_forward = feed_forward self.norm1 = nn.LayerNorm(size) diff --git a/paddlespeech/t2s/modules/transformer/lightconv.py b/paddlespeech/t2s/modules/transformer/lightconv.py index 1aeb6d6e..ccf84c8a 100644 --- a/paddlespeech/t2s/modules/transformer/lightconv.py +++ b/paddlespeech/t2s/modules/transformer/lightconv.py @@ -18,7 +18,7 @@ import paddle import paddle.nn.functional as F from paddle import nn -from paddlespeech.t2s.modules.glu import GLU +from paddlespeech.t2s.modules.activation import get_activation from paddlespeech.t2s.modules.masked_fill import masked_fill MIN_VALUE = float(numpy.finfo(numpy.float32).min) @@ -56,7 +56,7 @@ class LightweightConvolution(nn.Layer): use_kernel_mask=False, use_bias=False, ): """Construct Lightweight Convolution layer.""" - super(LightweightConvolution, self).__init__() + super().__init__() assert n_feat % wshare == 0 self.wshare = wshare @@ -68,7 +68,7 @@ class LightweightConvolution(nn.Layer): # linear -> GLU -> lightconv -> linear self.linear1 = nn.Linear(n_feat, n_feat * 2) self.linear2 = nn.Linear(n_feat, n_feat) - self.act = GLU() + self.act = get_activation("glu") # lightconv related self.uniform_ = nn.initializer.Uniform() diff --git a/paddlespeech/t2s/modules/transformer/multi_layer_conv.py b/paddlespeech/t2s/modules/transformer/multi_layer_conv.py index 8845b2a2..df8929e3 100644 --- a/paddlespeech/t2s/modules/transformer/multi_layer_conv.py +++ b/paddlespeech/t2s/modules/transformer/multi_layer_conv.py @@ -12,10 +12,10 @@ # See the License for the specific language governing permissions and # limitations under the License. """Layer modules for FFT block in FastSpeech (Feed-forward Transformer).""" -import paddle +from paddle import nn -class MultiLayeredConv1d(paddle.nn.Layer): +class MultiLayeredConv1d(nn.Layer): """Multi-layered conv1d for Transformer block. This is a module of multi-leyered conv1d designed @@ -43,21 +43,21 @@ class MultiLayeredConv1d(paddle.nn.Layer): Dropout rate. """ - super(MultiLayeredConv1d, self).__init__() - self.w_1 = paddle.nn.Conv1D( + super().__init__() + self.w_1 = nn.Conv1D( in_chans, hidden_chans, kernel_size, stride=1, padding=(kernel_size - 1) // 2, ) - self.w_2 = paddle.nn.Conv1D( + self.w_2 = nn.Conv1D( hidden_chans, in_chans, kernel_size, stride=1, padding=(kernel_size - 1) // 2, ) - self.dropout = paddle.nn.Dropout(dropout_rate) - self.relu = paddle.nn.ReLU() + self.dropout = nn.Dropout(dropout_rate) + self.relu = nn.ReLU() def forward(self, x): """Calculate forward propagation. @@ -77,7 +77,7 @@ class MultiLayeredConv1d(paddle.nn.Layer): [0, 2, 1]) -class Conv1dLinear(paddle.nn.Layer): +class Conv1dLinear(nn.Layer): """Conv1D + Linear for Transformer block. A variant of MultiLayeredConv1d, which replaces second conv-layer to linear. @@ -98,16 +98,16 @@ class Conv1dLinear(paddle.nn.Layer): dropout_rate : float Dropout rate. """ - super(Conv1dLinear, self).__init__() - self.w_1 = paddle.nn.Conv1D( + super().__init__() + self.w_1 = nn.Conv1D( in_chans, hidden_chans, kernel_size, stride=1, padding=(kernel_size - 1) // 2, ) - self.w_2 = paddle.nn.Linear(hidden_chans, in_chans, bias_attr=True) - self.dropout = paddle.nn.Dropout(dropout_rate) - self.relu = paddle.nn.ReLU() + self.w_2 = nn.Linear(hidden_chans, in_chans, bias_attr=True) + self.dropout = nn.Dropout(dropout_rate) + self.relu = nn.ReLU() def forward(self, x): """Calculate forward propagation. diff --git a/paddlespeech/t2s/modules/transformer/positionwise_feed_forward.py b/paddlespeech/t2s/modules/transformer/positionwise_feed_forward.py index 297a3b4f..28ed1c31 100644 --- a/paddlespeech/t2s/modules/transformer/positionwise_feed_forward.py +++ b/paddlespeech/t2s/modules/transformer/positionwise_feed_forward.py @@ -14,9 +14,10 @@ # Modified from espnet(https://github.com/espnet/espnet) """Positionwise feed forward layer definition.""" import paddle +from paddle import nn -class PositionwiseFeedForward(paddle.nn.Layer): +class PositionwiseFeedForward(nn.Layer): """Positionwise feed forward layer. Parameters @@ -35,7 +36,7 @@ class PositionwiseFeedForward(paddle.nn.Layer): dropout_rate, activation=paddle.nn.ReLU()): """Construct an PositionwiseFeedForward object.""" - super(PositionwiseFeedForward, self).__init__() + super().__init__() self.w_1 = paddle.nn.Linear(idim, hidden_units, bias_attr=True) self.w_2 = paddle.nn.Linear(hidden_units, idim, bias_attr=True) self.dropout = paddle.nn.Dropout(dropout_rate) diff --git a/paddlespeech/t2s/modules/transformer/subsampling.py b/paddlespeech/t2s/modules/transformer/subsampling.py index e1bd75bb..cf0fca8a 100644 --- a/paddlespeech/t2s/modules/transformer/subsampling.py +++ b/paddlespeech/t2s/modules/transformer/subsampling.py @@ -14,11 +14,12 @@ # Modified from espnet(https://github.com/espnet/espnet) """Subsampling layer definition.""" import paddle +from paddle import nn from paddlespeech.t2s.modules.transformer.embedding import PositionalEncoding -class Conv2dSubsampling(paddle.nn.Layer): +class Conv2dSubsampling(nn.Layer): """Convolutional 2D subsampling (to 1/4 length). Parameters ---------- @@ -28,20 +29,20 @@ class Conv2dSubsampling(paddle.nn.Layer): Output dimension. dropout_rate : float Dropout rate. - pos_enc : paddle.nn.Layer + pos_enc : nn.Layer Custom position encoding layer. """ def __init__(self, idim, odim, dropout_rate, pos_enc=None): """Construct an Conv2dSubsampling object.""" - super(Conv2dSubsampling, self).__init__() - self.conv = paddle.nn.Sequential( - paddle.nn.Conv2D(1, odim, 3, 2), - paddle.nn.ReLU(), - paddle.nn.Conv2D(odim, odim, 3, 2), - paddle.nn.ReLU(), ) - self.out = paddle.nn.Sequential( - paddle.nn.Linear(odim * (((idim - 1) // 2 - 1) // 2), odim), + super().__init__() + self.conv = nn.Sequential( + nn.Conv2D(1, odim, 3, 2), + nn.ReLU(), + nn.Conv2D(odim, odim, 3, 2), + nn.ReLU(), ) + self.out = nn.Sequential( + nn.Linear(odim * (((idim - 1) // 2 - 1) // 2), odim), pos_enc if pos_enc is not None else PositionalEncoding(odim, dropout_rate), ) diff --git a/paddlespeech/t2s/training/optimizer.py b/paddlespeech/t2s/training/optimizer.py index c6a6944d..907e3daf 100644 --- a/paddlespeech/t2s/training/optimizer.py +++ b/paddlespeech/t2s/training/optimizer.py @@ -12,6 +12,7 @@ # See the License for the specific language governing permissions and # limitations under the License. import paddle +from paddle import nn optim_classes = dict( adadelta=paddle.optimizer.Adadelta, @@ -25,7 +26,7 @@ optim_classes = dict( sgd=paddle.optimizer.SGD, ) -def build_optimizers(model: paddle.nn.Layer, +def build_optimizers(model: nn.Layer, optim='adadelta', max_grad_norm=None, learning_rate=0.01) -> paddle.optimizer: diff --git a/tests/unit/tts/test_stft.py b/tests/unit/tts/test_stft.py index d2d56dca..624226e9 100644 --- a/tests/unit/tts/test_stft.py +++ b/tests/unit/tts/test_stft.py @@ -11,52 +11,11 @@ # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. -import librosa -import numpy as np import paddle import torch from parallel_wavegan.losses import stft_loss as sl -from scipy import signal -from paddlespeech.t2s.modules.stft_loss import MultiResolutionSTFTLoss -from paddlespeech.t2s.modules.stft_loss import STFT - - -def test_stft(): - stft = STFT(n_fft=1024, hop_length=256, win_length=1024) - x = paddle.uniform([4, 46080]) - S = stft.magnitude(x) - window = signal.get_window('hann', 1024, fftbins=True) - D2 = torch.stft( - torch.as_tensor(x.numpy()), - n_fft=1024, - hop_length=256, - win_length=1024, - window=torch.as_tensor(window)) - S2 = (D2**2).sum(-1).sqrt() - S3 = np.abs( - librosa.stft(x.numpy()[0], n_fft=1024, hop_length=256, win_length=1024)) - print(S2.shape) - print(S.numpy()[0]) - print(S2.data.cpu().numpy()[0]) - print(S3) - - -def test_torch_stft(): - # NOTE: torch.stft use no window by default - x = np.random.uniform(-1.0, 1.0, size=(46080, )) - window = signal.get_window('hann', 1024, fftbins=True) - D2 = torch.stft( - torch.as_tensor(x), - n_fft=1024, - hop_length=256, - win_length=1024, - window=torch.as_tensor(window)) - D3 = librosa.stft( - x, n_fft=1024, hop_length=256, win_length=1024, window='hann') - print(D2[:, :, 0].data.cpu().numpy()[:, 30:60]) - print(D3.real[:, 30:60]) - # print(D3.imag[:, 30:60]) +from paddlespeech.t2s.modules.losses import MultiResolutionSTFTLoss def test_multi_resolution_stft_loss():