refactor encoder, rm old code

pull/1015/head
TianYuan 3 years ago
parent bc0dd51149
commit 469329221b

@ -5,20 +5,6 @@ Parakeet aims to provide a flexible, efficient and state-of-the-art text-to-spee
<img src="../../images/logo.png" width=300 /> <br>
</div>
## News <img src="../../images/news_icon.png" width="40"/>
- Oct-12-2021, Refector examples code.
- Oct-12-2021, Parallel WaveGAN with LJSpeech. Check [examples/GANVocoder/parallelwave_gan/ljspeech](./examples/GANVocoder/parallelwave_gan/ljspeech).
- Oct-12-2021, FastSpeech2/FastPitch with LJSpeech. Check [examples/fastspeech2/ljspeech](./examples/fastspeech2/ljspeech).
- Sep-14-2021, Reconstruction of TransformerTTS. Check [examples/transformer_tts/ljspeech](./examples/transformer_tts/ljspeech).
- Aug-31-2021, Chinese Text Frontend. Check [examples/text_frontend](./examples/text_frontend).
- Aug-23-2021, FastSpeech2/FastPitch with AISHELL-3. Check [examples/fastspeech2/aishell3](./examples/fastspeech2/aishell3).
- Aug-03-2021, FastSpeech2/FastPitch with CSMSC. Check [examples/fastspeech2/baker](./examples/fastspeech2/baker).
- Jul-19-2021, SpeedySpeech with CSMSC. Check [examples/speedyspeech/baker](./examples/speedyspeech/baker).
- Jul-01-2021, Parallel WaveGAN with CSMSC. Check [examples/GANVocoder/parallelwave_gan/baker](./examples/GANVocoder/parallelwave_gan/baker).
- Jul-01-2021, Montreal-Forced-Aligner. Check [examples/use_mfa](./examples/use_mfa).
- May-07-2021, Voice Cloning in Chinese. Check [examples/tacotron2_aishell3](./examples/tacotron2_aishell3).
## Overview
In order to facilitate exploiting the existing TTS models directly and developing the new ones, Parakeet selects typical models and provides their reference implementations in PaddlePaddle. Further more, Parakeet abstracts the TTS pipeline and standardizes the procedure of data preprocessing, common modules sharing, model configuration, and the process of training and synthesis. The models supported here include Text FrontEnd, end-to-end Acoustic models and Vocoders:
@ -38,50 +24,11 @@ In order to facilitate exploiting the existing TTS models directly and developin
- [Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis](https://arxiv.org/pdf/1806.04558v4.pdf)
- [【GE2E】Generalized End-to-End Loss for Speaker Verification](https://arxiv.org/abs/1710.10467)
## Setup
It's difficult to install some dependent libraries for this repo in Windows system, we recommend that you **DO NOT** use Windows system, please use `Linux`.
Make sure the library `libsndfile1` is installed, e.g., on Ubuntu.
```bash
sudo apt-get install libsndfile1
```
### Install PaddlePaddle
See [install](https://www.paddlepaddle.org.cn/install/quick) for more details. This repo requires PaddlePaddle **2.1.2** or above.
### Install Parakeet
```bash
git clone https://github.com/PaddlePaddle/Parakeet
cd Parakeet
pip install -e .
```
If some python dependent packages cannot be installed successfully, you can run the following script first.
(replace `python3.6` with your own python version)
```bash
sudo apt install -y python3.6-dev
```
See [install](https://paddle-parakeet.readthedocs.io/en/latest/install.html) for more details.
## Examples
Entries to the introduction, and the launch of training and synthsis for different example models:
- [>>> Chinese Text Frontend](./examples/text_frontend)
- [>>> FastSpeech2/FastPitch](./examples/fastspeech2)
- [>>> Montreal-Forced-Aligner](./examples/use_mfa)
- [>>> Parallel WaveGAN](./examples/GANVocoder/parallelwave_gan)
- [>>> SpeedySpeech](./examples/speedyspeech)
- [>>> Tacotron2_AISHELL3](./examples/tacotron2_aishell3)
- [>>> GE2E](./examples/ge2e)
- [>>> WaveFlow](./examples/waveflow)
- [>>> TransformerTTS](./examples/transformer_tts)
- [>>> Tacotron2](./examples/tacotron2)
## Audio samples
### TTS models (Acoustic Model + Neural Vocoder)
Check our [website](https://paddleparakeet.readthedocs.io/en/latest/demo.html) for audio sampels.
Check our [website](https://paddlespeech.readthedocs.io/en/latest/tts/demo.html) for audio sampels.
## Released Model

@ -17,7 +17,7 @@ tar zxvf data_aishell3.tgz -C data_aishell3
```
### Get MFA result of AISHELL-3 and Extract it
We use [MFA2.x](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for aishell3_fastspeech2.
You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/use_mfa) (use MFA1.x now) of our repo.
You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo.
## Get Started
Assume the path to the dataset is `~/datasets/data_aishell3`.

@ -41,7 +41,8 @@ We use Montreal Force Aligner 1.0. The label in aishell3 include pinyinso th
We use [lexicon.txt](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/paddlespeech/t2s/exps/voice_cloning/tacotron2_ge2e/lexicon.txt) as the lexicon.
You can download the alignment results from here [alignment_aishell3.tar.gz](https://paddlespeech.bj.bcebos.com/Parakeet/alignment_aishell3.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/use_mfa) (use MFA1.x now) of our repo.
You can download the alignment results from here [alignment_aishell3.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/alignment_aishell3.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo.
```bash
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then

@ -1,89 +1,138 @@
# FastSpeech2 + AISHELL-3 Voice Cloning
This example contains code used to train a [Tacotron2 ](https://arxiv.org/abs/1712.05884) model with [AISHELL-3](http://www.aishelltech.com/aishell_3). The trained model can be used in Voice Cloning Task, We refer to the model structure of [Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis](https://arxiv.org/pdf/1806.04558.pdf) . The general steps are as follows:
1. Speaker Encoder: We use a Speaker Verification to train a speaker encoder. Datasets used in this task are different from those used in Tacotron2, because the transcriptions are not needed, we use more datasets, refer to [ge2e](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/ge2e).
2. Synthesizer: Then, we use the trained speaker encoder to generate utterance embedding for each sentence in AISHELL-3. This embedding is a extra input of Tacotron2 which will be concated with encoder outputs.
3. Vocoder: We use WaveFlow as the neural Vocoder, refer to [waveflow](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/ljspeech/voc0).
This example contains code used to train a [FastSpeech2](https://arxiv.org/abs/2006.04558) model with [AISHELL-3](http://www.aishelltech.com/aishell_3). The trained model can be used in Voice Cloning Task, We refer to the model structure of [Transfer Learning from Speaker Verification to Multispeaker Text-To-Speech Synthesis](https://arxiv.org/pdf/1806.04558.pdf) . The general steps are as follows:
1. Speaker Encoder: We use a Speaker Verification to train a speaker encoder. Datasets used in this task are different from those used in `FastSpeech2`, because the transcriptions are not needed, we use more datasets, refer to [ge2e](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/ge2e).
2. Synthesizer: We use the trained speaker encoder to generate speaker embedding for each sentence in AISHELL-3. This embedding is an extra input of `FastSpeech2` which will be concated with encoder outputs.
3. Vocoder: We use [Parallel Wave GAN](http://arxiv.org/abs/1910.11480) as the neural Vocoder, refer to [voc1](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/voc1).
## Dataset
### Download and Extract
Download AISHELL-3.
```bash
wget https://www.openslr.org/resources/93/data_aishell3.tgz
```
Extract AISHELL-3.
```bash
mkdir data_aishell3
tar zxvf data_aishell3.tgz -C data_aishell3
```
### Get MFA Result and Extract
We use [MFA2.x](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for aishell3_fastspeech2.
You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo.
## Pretrained GE2E Model
We use pretrained GE2E model to generate spwaker embedding for each sentence.
Download pretrained GE2E model from here [ge2e_ckpt_0.3.zip](https://bj.bcebos.com/paddlespeech/Parakeet/released_models/ge2e/ge2e_ckpt_0.3.zip), and `unzip` it.
## Get Started
Assume the path to the dataset is `~/datasets/data_aishell3`.
Assume the path to the MFA result of AISHELL-3 is `./alignment`.
Assume the path to the pretrained ge2e model is `ge2e_ckpt_path=./ge2e_ckpt_0.3/step-3000000`
Assume the path to the MFA result of AISHELL-3 is `./aishell3_alignment_tone`.
Assume the path to the pretrained ge2e model is `./ge2e_ckpt_0.3`.
Run the command below to
1. **source path**.
2. preprocess the dataset,
2. preprocess the dataset.
3. train the model.
4. start a voice cloning inference.
4. synthesize waveform from `metadata.jsonl`.
5. start a voice cloning inference.
```bash
./run.sh
```
### Preprocess the dataset
You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset.
```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/preprocess.sh ${input} ${preprocess_path} ${alignment} ${ge2e_ckpt_path}
./run.sh --stage 0 --stop-stage 0
```
#### generate utterance embedding
Use pretrained GE2E (speaker encoder) to generate utterance embedding for each sentence in AISHELL-3, which has the same file structure with wav files and the format is `.npy`.
### Data Preprocessing
```bash
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
python3 ${BIN_DIR}/../ge2e/inference.py \
--input=${input} \
--output=${preprocess_path}/embed \
--ngpu=1 \
--checkpoint_path=${ge2e_ckpt_path}
fi
CUDA_VISIBLE_DEVICES=${gpus} ./local/preprocess.sh ${conf_path} ${ge2e_ckpt_path}
```
When it is done. A `dump` folder is created in the current directory. The structure of the dump folder is listed below.
```text
dump
├── dev
│ ├── norm
│ └── raw
├── embed
│ ├── SSB0005
│ ├── SSB0009
│ ├── ...
│ └── ...
├── phone_id_map.txt
├── speaker_id_map.txt
├── test
│ ├── norm
│ └── raw
└── train
├── energy_stats.npy
├── norm
├── pitch_stats.npy
├── raw
└── speech_stats.npy
```
The `embed` contains the generated speaker embedding for each sentence in AISHELL-3, which has the same file structure with wav files and the format is `.npy`.
The computing time of utterance embedding can be x hours.
#### process wav
There are silence in the edge of AISHELL-3's wavs, and the audio amplitude is very small, so, we need to remove the silence and normalize the audio. You can the silence remove method based on volume or energy, but the effect is not very good, We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get the alignment of text and speech, then utilize the alignment results to remove the silence.
We use Montreal Force Aligner 1.0. The label in aishell3 include pinyinso the lexicon we provided to MFA is pinyin rather than Chinese characters. And the prosody marks(`$` and `%`) need to be removed. You shoud preprocess the dataset into the format which MFA needs, the texts have the same name with wavs and have the suffix `.lab`.
The dataset is split into 3 parts, namely `train`, `dev` and` test`, each of which contains a `norm` and `raw` sub folder. The raw folder contains speech、pitch and energy features of each utterances, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/*_stats.npy`.
We use [lexicon.txt](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/paddlespeech/t2s/exps/voice_cloning/tacotron2_ge2e/lexicon.txt) as the lexicon.
Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains phones, text_lengths, speech_lengths, durations, path of speech features, path of pitch features, path of energy features, speaker and id of each utterance.
You can download the alignment results from here [alignment_aishell3.tar.gz](https://paddlespeech.bj.bcebos.com/Parakeet/alignment_aishell3.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/use_mfa) (use MFA1.x now) of our repo.
The preprocessing step is very similar to that one of [tts3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/tts3), but there is one more `ge2e/inference` step here.
### Model Training
`./local/train.sh` calls `${BIN_DIR}/train.py`.
```bash
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
echo "Process wav ..."
python3 ${BIN_DIR}/process_wav.py \
--input=${input}/wav \
--output=${preprocess_path}/normalized_wav \
--alignment=${alignment}
fi
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path}
```
The training step is very similar to that one of [tts3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/tts3), but we should set `--voice-cloning=True` when calling `${BIN_DIR}/train.py`.
#### preprocess transcription
We revert the transcription into `phones` and `tones`. It is worth noting that our processing here is different from that used for MFA, we separated the tones. This is a processing method, of course, you can only segment initials and vowels.
### Synthesizing
We use [parallel wavegan](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/voc1) as the neural vocoder.
Download pretrained parallel wavegan model from [pwg_aishell3_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_aishell3_ckpt_0.5.zip) and unzip it.
```bash
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
python3 ${BIN_DIR}/preprocess_transcription.py \
--input=${input} \
--output=${preprocess_path}
fi
unzip pwg_aishell3_ckpt_0.5.zip
```
The default input is `~/datasets/data_aishell3/train`which contains `label_train-set.txt`, the processed results are `metadata.yaml` and `metadata.pickle`. the former is a text format for easy viewing, and the latter is a binary format for direct reading.
#### extract mel
```python
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
python3 ${BIN_DIR}/extract_mel.py \
--input=${preprocess_path}/normalized_wav \
--output=${preprocess_path}/mel
fi
Parallel WaveGAN checkpoint contains files listed below.
```text
pwg_aishell3_ckpt_0.5
├── default.yaml # default config used to train parallel wavegan
├── feats_stats.npy # statistics used to normalize spectrogram when training parallel wavegan
└── snapshot_iter_1000000.pdz # generator parameters of parallel wavegan
```
### Train the model
`./local/synthesize.sh` calls `${BIN_DIR}/synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${preprocess_path} ${train_output_path}
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
```
The synthesizing step is very similar to that one of [tts3](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/tts3), but we should set `--voice-cloning=True` when calling `${BIN_DIR}/synthesize.py`.
Our model remve stop token prediction in Tacotron2, because of the problem of extremely unbalanced proportion of positive and negative samples of stop token prediction, and it's very sensitive to the clip of audio silence. We use the last symbol from the highest point of attention to the encoder side as the termination condition.
### Voice Cloning
Assume there are some reference audios in `./ref_audio`
```text
ref_audio
├── 001238.wav
├── LJ015-0254.wav
└── audio_self_test.mp3
```
`./local/voice_cloning.sh` calls `${BIN_DIR}/voice_cloning.py`
In addition, in order to accelerate the convergence of the model, we add `guided attention loss` to induce the alignment between encoder and decoder to show diagonal lines faster.
### Infernece
```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/voice_cloning.sh ${ge2e_params_path} ${tacotron2_params_path} ${waveflow_params_path} ${vc_input} ${vc_output}
CUDA_VISIBLE_DEVICES=${gpus} ./local/voice_cloning.sh ${conf_path} ${train_output_path} ${ckpt_name} ${ge2e_params_path} ${ref_audio_dir}
```
## Pretrained Model
[tacotron2_aishell3_ckpt_0.3.zip](https://paddlespeech.bj.bcebos.com/Parakeet/tacotron2_aishell3_ckpt_0.3.zip).
[fastspeech2_nosil_aishell3_vc1_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/fastspeech2/fastspeech2_nosil_aishell3_vc1_ckpt_0.5.zip)
Model | Step | eval/loss | eval/l1_loss | eval/duration_loss | eval/pitch_loss| eval/energy_loss
:-------------:| :------------:| :-----: | :-----: | :--------: |:--------:|:---------:
default|2(gpu) x 96400|0.99699|0.62013|0.53057|0.11954| 0.20426|
FastSpeech2 checkpoint contains files listed below.
(There is no need for `speaker_id_map.txt` here )
```text
fastspeech2_nosil_aishell3_ckpt_vc1_0.5
├── default.yaml # default config used to train fastspeech2
├── phone_id_map.txt # phone vocabulary file when training fastspeech2
├── snapshot_iter_96400.pdz # model parameters and optimizer states
└── speech_stats.npy # statistics used to normalize spectrogram when training fastspeech2
```

@ -15,7 +15,7 @@ tar zxvf data_aishell3.tgz -C data_aishell3
```
### Get MFA result of AISHELL-3 and Extract it
We use [MFA2.x](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for aishell3_fastspeech2.
You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/use_mfa) (use MFA1.x now) of our repo.
You can download from here [aishell3_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/AISHELL-3/with_tone/aishell3_alignment_tone.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) (use MFA1.x now) of our repo.
## Get Started
Assume the path to the dataset is `~/datasets/data_aishell3`.

@ -7,7 +7,7 @@ Download CSMSC from it's [Official Website](https://test.data-baker.com/data/ind
### Get MFA result of CSMSC and Extract it
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for SPEEDYSPEECH.
You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/use_mfa) of our repo.
You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
## Get Started
Assume the path to the dataset is `~/datasets/BZNSYP`.

@ -7,7 +7,7 @@ Download CSMSC from it's [Official Website](https://test.data-baker.com/data/ind
### Get MFA result of CSMSC and Extract it
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for fastspeech2.
You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/use_mfa) of our repo.
You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
## Get Started
Assume the path to the dataset is `~/datasets/BZNSYP`.

@ -6,7 +6,7 @@ Download CSMSC from the [official website](https://www.data-baker.com/data/index
### Get MFA results for silence trim
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence in the edge of audio.
You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/use_mfa) of our repo.
You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
## Get Started
Assume the path to the dataset is `~/datasets/BZNSYP`.

@ -6,7 +6,7 @@ Download CSMSC from the [official website](https://www.data-baker.com/data/index
### Get MFA results for silence trim
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence in the edge of audio.
You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/use_mfa) of our repo.
You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/mfa) of our repo.
## Get Started
Assume the path to the dataset is `~/datasets/BZNSYP`.

@ -7,7 +7,7 @@ Download LJSpeech-1.1 from the [official website](https://keithito.com/LJ-Speech
### Get MFA result of LJSpeech-1.1 and Extract it
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for fastspeech2.
You can download from here [ljspeech_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/LJSpeech-1.1/ljspeech_alignment.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/use_mfa) of our repo.
You can download from here [ljspeech_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/LJSpeech-1.1/ljspeech_alignment.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
## Get Started
Assume the path to the dataset is `~/datasets/LJSpeech-1.1`.

@ -1,26 +1,29 @@
# Parallel WaveGAN with the LJSpeech-1.1
This example contains code used to train a [parallel wavegan](http://arxiv.org/abs/1910.11480) model with [LJSpeech-1.1](https://keithito.com/LJ-Speech-Dataset/).
## Dataset
### Download and Extract the datasaet
### Download and Extract
Download LJSpeech-1.1 from the [official website](https://keithito.com/LJ-Speech-Dataset/).
### Get MFA results for silence trim
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence in the edge of audio.
You can download from here [ljspeech_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/LJSpeech-1.1/ljspeech_alignment.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/use_mfa) of our repo.
### Get MFA Result and Extract
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence in the edge of audio.
You can download from here [ljspeech_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/LJSpeech-1.1/ljspeech_alignment.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
## Get Started
Assume the path to the dataset is `~/datasets/LJSpeech-1.1`.
Assume the path to the MFA result of LJSpeech-1.1 is `./ljspeech_alignment`.
Run the command below to
1. **source path**.
2. preprocess the dataset,
2. preprocess the dataset.
3. train the model.
4. synthesize wavs.
- synthesize waveform from `metadata.jsonl`.
```bash
./run.sh
```
### Preprocess the dataset
You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, run the following command will only preprocess the dataset.
```bash
./run.sh --stage 0 --stop-stage 0
```
### Data Preprocessing
```bash
./local/preprocess.sh ${conf_path}
```
@ -44,7 +47,7 @@ The dataset is split into 3 parts, namely `train`, `dev` and `test`, each of whi
Also there is a `metadata.jsonl` in each subfolder. It is a table-like file which contains id and paths to spectrogam of each utterance.
### Train the model
### Model Training
`./local/train.sh` calls `${BIN_DIR}/train.py`.
```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path}
@ -91,7 +94,7 @@ benchmark:
3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are save in `checkpoints/` inside this directory.
4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
### Synthesize
### Synthesizing
`./local/synthesize.sh` calls `${BIN_DIR}/synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
```bash
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
@ -122,8 +125,8 @@ optional arguments:
4. `--output-dir` is the directory to save the synthesized audio files.
5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
## Pretrained Models
Pretrained models can be downloaded here. [pwg_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/pwg_ljspeech_ckpt_0.5.zip)
## Pretrained Model
Pretrained models can be downloaded here. [pwg_ljspeech_ckpt_0.5.zip](https://paddlespeech.bj.bcebos.com/Parakeet/released_models/pwgan/pwg_ljspeech_ckpt_0.5.zip)
Parallel WaveGAN checkpoint contains files listed below.
@ -134,4 +137,4 @@ pwg_ljspeech_ckpt_0.5
└── pwg_stats.npy # statistics used to normalize spectrogram when training parallel wavegan
```
## Acknowledgement
We adapted some code from https://github.com/kan-bayashi/ParallelWaveGAN.
We adapted some code from https://github.com/kan-bayashi/ParallelWaveGAN.

@ -7,8 +7,8 @@ Download VCTK-0.92 from the [official website](https://datashare.ed.ac.uk/handle
### Get MFA result of VCTK and Extract it
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for fastspeech2.
You can download from here [vctk_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/VCTK-Corpus-0.92/vctk_alignment.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/use_mfa) of our repo.
ps: we remove three speakers in VCTK-0.92 (see [reorganize_vctk.py](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/examples/other/use_mfa/local/reorganize_vctk.py)):
You can download from here [vctk_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/VCTK-Corpus-0.92/vctk_alignment.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
ps: we remove three speakers in VCTK-0.92 (see [reorganize_vctk.py](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/examples/other/mfa/local/reorganize_vctk.py)):
1. `p315`, because no txt for it.
2. `p280` and `p362`, because no *_mic2.flac (which is better than *_mic1.flac) for them.

@ -5,10 +5,10 @@ This example contains code used to train a [parallel wavegan](http://arxiv.org/a
### Download and Extract the datasaet
Download VCTK-0.92 from the [official website](https://datashare.ed.ac.uk/handle/10283/3443) and extract it to `~/datasets`. Then the dataset is in directory `~/datasets/VCTK-Corpus-0.92`.
### Get MFA results for silence trim
### Get MFA Result and Extract
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence in the edge of audio.
You can download from here [vctk_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/VCTK-Corpus-0.92/vctk_alignment.tar.gz), or train your own MFA model reference to [use_mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/use_mfa) of our repo.
ps: we remove three speakers in VCTK-0.92 (see [reorganize_vctk.py](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/examples/other/use_mfa/local/reorganize_vctk.py)):
You can download from here [vctk_alignment.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/VCTK-Corpus-0.92/vctk_alignment.tar.gz), or train your own MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
ps: we remove three speakers in VCTK-0.92 (see [reorganize_vctk.py](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/examples/other/mfa/local/reorganize_vctk.py)):
1. `p315`, because no txt for it.
2. `p280` and `p362`, because no *_mic2.flac (which is better than *_mic1.flac) for them.

@ -36,10 +36,10 @@ from paddlespeech.t2s.models.melgan import MBMelGANEvaluator
from paddlespeech.t2s.models.melgan import MBMelGANUpdater
from paddlespeech.t2s.models.melgan import MelGANGenerator
from paddlespeech.t2s.models.melgan import MelGANMultiScaleDiscriminator
from paddlespeech.t2s.modules.adversarial_loss import DiscriminatorAdversarialLoss
from paddlespeech.t2s.modules.adversarial_loss import GeneratorAdversarialLoss
from paddlespeech.t2s.modules.losses import DiscriminatorAdversarialLoss
from paddlespeech.t2s.modules.losses import GeneratorAdversarialLoss
from paddlespeech.t2s.modules.losses import MultiResolutionSTFTLoss
from paddlespeech.t2s.modules.pqmf import PQMF
from paddlespeech.t2s.modules.stft_loss import MultiResolutionSTFTLoss
from paddlespeech.t2s.training.extensions.snapshot import Snapshot
from paddlespeech.t2s.training.extensions.visualizer import VisualDL
from paddlespeech.t2s.training.seeding import seed_everything

@ -36,7 +36,7 @@ from paddlespeech.t2s.models.parallel_wavegan import PWGDiscriminator
from paddlespeech.t2s.models.parallel_wavegan import PWGEvaluator
from paddlespeech.t2s.models.parallel_wavegan import PWGGenerator
from paddlespeech.t2s.models.parallel_wavegan import PWGUpdater
from paddlespeech.t2s.modules.stft_loss import MultiResolutionSTFTLoss
from paddlespeech.t2s.modules.losses import MultiResolutionSTFTLoss
from paddlespeech.t2s.training.extensions.snapshot import Snapshot
from paddlespeech.t2s.training.extensions.visualizer import VisualDL
from paddlespeech.t2s.training.seeding import seed_everything

@ -12,6 +12,8 @@
# See the License for the specific language governing permissions and
# limitations under the License.
from .fastspeech2 import *
from .melgan import *
from .parallel_wavegan import *
from .tacotron2 import *
from .transformer_tts import *
from .waveflow import *

@ -32,7 +32,8 @@ from paddlespeech.t2s.modules.predictor.duration_predictor import DurationPredic
from paddlespeech.t2s.modules.predictor.length_regulator import LengthRegulator
from paddlespeech.t2s.modules.predictor.variance_predictor import VariancePredictor
from paddlespeech.t2s.modules.tacotron2.decoder import Postnet
from paddlespeech.t2s.modules.transformer.encoder import Encoder
from paddlespeech.t2s.modules.transformer.encoder import ConformerEncoder
from paddlespeech.t2s.modules.transformer.encoder import TransformerEncoder
class FastSpeech2(nn.Layer):
@ -306,12 +307,10 @@ class FastSpeech2(nn.Layer):
num_embeddings=idim,
embedding_dim=adim,
padding_idx=self.padding_idx)
# add encoder type here
# 测试模型还能跑通不
# 记得改 transformer tts
if encoder_type == "transformer":
print("encoder_type is transformer")
self.encoder = Encoder(
self.encoder = TransformerEncoder(
idim=idim,
attention_dim=adim,
attention_heads=aheads,
@ -325,11 +324,10 @@ class FastSpeech2(nn.Layer):
normalize_before=encoder_normalize_before,
concat_after=encoder_concat_after,
positionwise_layer_type=positionwise_layer_type,
positionwise_conv_kernel_size=positionwise_conv_kernel_size,
encoder_type=encoder_type)
positionwise_conv_kernel_size=positionwise_conv_kernel_size, )
elif encoder_type == "conformer":
print("encoder_type is conformer")
self.encoder = Encoder(
self.encoder = ConformerEncoder(
idim=idim,
attention_dim=adim,
attention_heads=aheads,
@ -349,8 +347,7 @@ class FastSpeech2(nn.Layer):
activation_type=conformer_activation_type,
use_cnn_module=use_cnn_in_conformer,
cnn_module_kernel=conformer_enc_kernel_size,
zero_triu=zero_triu,
encoder_type=encoder_type)
zero_triu=zero_triu, )
else:
raise ValueError(f"{encoder_type} is not supported.")
@ -417,7 +414,7 @@ class FastSpeech2(nn.Layer):
# because fastspeech's decoder is the same as encoder
if decoder_type == "transformer":
print("decoder_type is transformer")
self.decoder = Encoder(
self.decoder = TransformerEncoder(
idim=0,
attention_dim=adim,
attention_heads=aheads,
@ -432,11 +429,10 @@ class FastSpeech2(nn.Layer):
normalize_before=decoder_normalize_before,
concat_after=decoder_concat_after,
positionwise_layer_type=positionwise_layer_type,
positionwise_conv_kernel_size=positionwise_conv_kernel_size,
encoder_type=decoder_type)
positionwise_conv_kernel_size=positionwise_conv_kernel_size, )
elif decoder_type == "conformer":
print("decoder_type is conformer")
self.decoder = Encoder(
self.decoder = ConformerEncoder(
idim=0,
attention_dim=adim,
attention_heads=aheads,
@ -455,8 +451,7 @@ class FastSpeech2(nn.Layer):
selfattention_layer_type=conformer_self_attn_layer_type,
activation_type=conformer_activation_type,
use_cnn_module=use_cnn_in_conformer,
cnn_module_kernel=conformer_dec_kernel_size,
encoder_type=decoder_type)
cnn_module_kernel=conformer_dec_kernel_size, )
else:
raise ValueError(f"{decoder_type} is not supported.")

@ -78,7 +78,7 @@ class MelGANGenerator(nn.Layer):
Padding function module name before dilated convolution layer.
pad_params : dict
Hyperparameters for padding function.
use_final_nonlinear_activation : paddle.nn.Layer
use_final_nonlinear_activation : nn.Layer
Activation function for the final layer.
use_weight_norm : bool
Whether to use weight norm.

@ -11,13 +11,34 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import numpy as np
import paddle
from paddle import nn
from paddlespeech.t2s.modules.expansion import expand
from paddlespeech.t2s.modules.positional_encoding import sinusoid_position_encoding
def expand(encodings: paddle.Tensor, durations: paddle.Tensor) -> paddle.Tensor:
"""
encodings: (B, T, C)
durations: (B, T)
"""
batch_size, t_enc = durations.shape
durations = durations.numpy()
slens = np.sum(durations, -1)
t_dec = np.max(slens)
M = np.zeros([batch_size, t_dec, t_enc])
for i in range(batch_size):
k = 0
for j in range(t_enc):
d = durations[i, j]
M[i, k:k + d, j] = 1
k += d
M = paddle.to_tensor(M, dtype=encodings.dtype)
encodings = paddle.matmul(M, encodings)
return encodings
class ResidualBlock(nn.Layer):
def __init__(self, channels, kernel_size, dilation, n=2):
super().__init__()

@ -19,8 +19,8 @@ from paddle.fluid.layers import huber_loss
from paddle.nn import functional as F
from paddlespeech.t2s.modules.losses import masked_l1_loss
from paddlespeech.t2s.modules.losses import ssim
from paddlespeech.t2s.modules.losses import weighted_mean
from paddlespeech.t2s.modules.ssim import ssim
from paddlespeech.t2s.training.extensions.evaluator import StandardEvaluator
from paddlespeech.t2s.training.reporter import report
from paddlespeech.t2s.training.updaters.standard_updater import StandardUpdater

@ -20,7 +20,6 @@ from paddle.nn import functional as F
from paddle.nn import initializer as I
from tqdm import trange
from paddlespeech.t2s.modules.attention import LocationSensitiveAttention
from paddlespeech.t2s.modules.conv import Conv1dBatchNorm
from paddlespeech.t2s.modules.losses import guided_attention_loss
from paddlespeech.t2s.utils import checkpoint
@ -28,6 +27,99 @@ from paddlespeech.t2s.utils import checkpoint
__all__ = ["Tacotron2", "Tacotron2Loss"]
class LocationSensitiveAttention(nn.Layer):
"""Location Sensitive Attention module.
Reference: `Attention-Based Models for Speech Recognition <https://arxiv.org/pdf/1506.07503.pdf>`_
Parameters
-----------
d_query: int
The feature size of query.
d_key : int
The feature size of key.
d_attention : int
The feature size of dimension.
location_filters : int
Filter size of attention convolution.
location_kernel_size : int
Kernel size of attention convolution.
"""
def __init__(self,
d_query: int,
d_key: int,
d_attention: int,
location_filters: int,
location_kernel_size: int):
super().__init__()
self.query_layer = nn.Linear(d_query, d_attention, bias_attr=False)
self.key_layer = nn.Linear(d_key, d_attention, bias_attr=False)
self.value = nn.Linear(d_attention, 1, bias_attr=False)
# Location Layer
self.location_conv = nn.Conv1D(
2,
location_filters,
kernel_size=location_kernel_size,
padding=int((location_kernel_size - 1) / 2),
bias_attr=False,
data_format='NLC')
self.location_layer = nn.Linear(
location_filters, d_attention, bias_attr=False)
def forward(self,
query,
processed_key,
value,
attention_weights_cat,
mask=None):
"""Compute context vector and attention weights.
Parameters
-----------
query : Tensor [shape=(batch_size, d_query)]
The queries.
processed_key : Tensor [shape=(batch_size, time_steps_k, d_attention)]
The keys after linear layer.
value : Tensor [shape=(batch_size, time_steps_k, d_key)]
The values.
attention_weights_cat : Tensor [shape=(batch_size, time_step_k, 2)]
Attention weights concat.
mask : Tensor, optional
The mask. Shape should be (batch_size, times_steps_k, 1).
Defaults to None.
Returns
----------
attention_context : Tensor [shape=(batch_size, d_attention)]
The context vector.
attention_weights : Tensor [shape=(batch_size, time_steps_k)]
The attention weights.
"""
processed_query = self.query_layer(paddle.unsqueeze(query, axis=[1]))
processed_attention_weights = self.location_layer(
self.location_conv(attention_weights_cat))
# (B, T_enc, 1)
alignment = self.value(
paddle.tanh(processed_attention_weights + processed_key +
processed_query))
if mask is not None:
alignment = alignment + (1.0 - mask) * -1e9
attention_weights = F.softmax(alignment, axis=1)
attention_context = paddle.matmul(
attention_weights, value, transpose_x=True)
attention_weights = paddle.squeeze(attention_weights, axis=-1)
attention_context = paddle.squeeze(attention_context, axis=1)
return attention_context, attention_weights
class DecoderPreNet(nn.Layer):
"""Decoder prenet module for Tacotron2.
@ -197,7 +289,7 @@ class Tacotron2Encoder(nn.Layer):
super().__init__()
k = math.sqrt(1.0 / (d_hidden * kernel_size))
self.conv_batchnorms = paddle.nn.LayerList([
self.conv_batchnorms = nn.LayerList([
Conv1dBatchNorm(
d_hidden,
d_hidden,
@ -903,7 +995,7 @@ class Tacotron2Loss(nn.Layer):
self.use_stop_token_loss = use_stop_token_loss
self.use_guided_attention_loss = use_guided_attention_loss
self.attn_criterion = guided_attention_loss
self.stop_criterion = paddle.nn.BCEWithLogitsLoss()
self.stop_criterion = nn.BCEWithLogitsLoss()
self.sigma = sigma
def forward(self,

@ -34,7 +34,7 @@ from paddlespeech.t2s.modules.transformer.attention import MultiHeadedAttention
from paddlespeech.t2s.modules.transformer.decoder import Decoder
from paddlespeech.t2s.modules.transformer.embedding import PositionalEncoding
from paddlespeech.t2s.modules.transformer.embedding import ScaledPositionalEncoding
from paddlespeech.t2s.modules.transformer.encoder import Encoder
from paddlespeech.t2s.modules.transformer.encoder import TransformerEncoder
from paddlespeech.t2s.modules.transformer.mask import subsequent_mask
@ -281,7 +281,7 @@ class TransformerTTS(nn.Layer):
num_embeddings=idim,
embedding_dim=adim,
padding_idx=self.padding_idx)
self.encoder = Encoder(
self.encoder = TransformerEncoder(
idim=idim,
attention_dim=adim,
attention_heads=aheads,

@ -329,7 +329,7 @@ class ResidualNet(nn.LayerList):
if len(dilations_h) != n_layer:
raise ValueError(
"number of dilations_h should equals num of layers")
super(ResidualNet, self).__init__()
super().__init__()
for i in range(n_layer):
dilation = (dilations_h[i], 2**i)
layer = ResidualBlock(residual_channels, condition_channels,

@ -14,5 +14,4 @@
from .conv import *
from .geometry import *
from .losses import *
from .masking import *
from .positional_encoding import *

@ -11,8 +11,9 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import paddle
import paddle.nn.functional as F
from paddle import nn
from paddle.nn import functional as F
class GLU(nn.Layer):
@ -24,3 +25,18 @@ class GLU(nn.Layer):
def forward(self, xs):
return F.glu(xs, axis=self.dim)
def get_activation(act):
"""Return activation function."""
activation_funcs = {
"hardtanh": paddle.nn.Hardtanh,
"tanh": paddle.nn.Tanh,
"relu": paddle.nn.ReLU,
"selu": paddle.nn.SELU,
"swish": paddle.nn.Swish,
"glu": GLU
}
return activation_funcs[act]()

@ -1,125 +0,0 @@
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Modified from espnet(https://github.com/espnet/espnet)
"""Adversarial loss modules."""
import paddle
import paddle.nn.functional as F
from paddle import nn
class GeneratorAdversarialLoss(nn.Layer):
"""Generator adversarial loss module."""
def __init__(
self,
average_by_discriminators=True,
loss_type="mse", ):
"""Initialize GeneratorAversarialLoss module."""
super().__init__()
self.average_by_discriminators = average_by_discriminators
assert loss_type in ["mse", "hinge"], f"{loss_type} is not supported."
if loss_type == "mse":
self.criterion = self._mse_loss
else:
self.criterion = self._hinge_loss
def forward(self, outputs):
"""Calcualate generator adversarial loss.
Parameters
----------
outputs: Tensor or List
Discriminator outputs or list of discriminator outputs.
Returns
----------
Tensor
Generator adversarial loss value.
"""
if isinstance(outputs, (tuple, list)):
adv_loss = 0.0
for i, outputs_ in enumerate(outputs):
if isinstance(outputs_, (tuple, list)):
# case including feature maps
outputs_ = outputs_[-1]
adv_loss += self.criterion(outputs_)
if self.average_by_discriminators:
adv_loss /= i + 1
else:
adv_loss = self.criterion(outputs)
return adv_loss
def _mse_loss(self, x):
return F.mse_loss(x, paddle.ones_like(x))
def _hinge_loss(self, x):
return -x.mean()
class DiscriminatorAdversarialLoss(nn.Layer):
"""Discriminator adversarial loss module."""
def __init__(
self,
average_by_discriminators=True,
loss_type="mse", ):
"""Initialize DiscriminatorAversarialLoss module."""
super().__init__()
self.average_by_discriminators = average_by_discriminators
assert loss_type in ["mse"], f"{loss_type} is not supported."
if loss_type == "mse":
self.fake_criterion = self._mse_fake_loss
self.real_criterion = self._mse_real_loss
def forward(self, outputs_hat, outputs):
"""Calcualate discriminator adversarial loss.
Parameters
----------
outputs_hat : Tensor or list
Discriminator outputs or list of
discriminator outputs calculated from generator outputs.
outputs : Tensor or list
Discriminator outputs or list of
discriminator outputs calculated from groundtruth.
Returns
----------
Tensor
Discriminator real loss value.
Tensor
Discriminator fake loss value.
"""
if isinstance(outputs, (tuple, list)):
real_loss = 0.0
fake_loss = 0.0
for i, (outputs_hat_,
outputs_) in enumerate(zip(outputs_hat, outputs)):
if isinstance(outputs_hat_, (tuple, list)):
# case including feature maps
outputs_hat_ = outputs_hat_[-1]
outputs_ = outputs_[-1]
real_loss += self.real_criterion(outputs_)
fake_loss += self.fake_criterion(outputs_hat_)
if self.average_by_discriminators:
fake_loss /= i + 1
real_loss /= i + 1
else:
real_loss = self.real_criterion(outputs)
fake_loss = self.fake_criterion(outputs_hat)
return real_loss, fake_loss
def _mse_real_loss(self, x):
return F.mse_loss(x, paddle.ones_like(x))
def _mse_fake_loss(self, x):
return F.mse_loss(x, paddle.zeros_like(x))

@ -1,348 +0,0 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import math
import numpy as np
import paddle
from paddle import nn
from paddle.nn import functional as F
def scaled_dot_product_attention(q, k, v, mask=None, dropout=0.0,
training=True):
r"""Scaled dot product attention with masking.
Assume that q, k, v all have the same leading dimensions (denoted as * in
descriptions below). Dropout is applied to attention weights before
weighted sum of values.
Parameters
-----------
q : Tensor [shape=(\*, T_q, d)]
the query tensor.
k : Tensor [shape=(\*, T_k, d)]
the key tensor.
v : Tensor [shape=(\*, T_k, d_v)]
the value tensor.
mask : Tensor, [shape=(\*, T_q, T_k) or broadcastable shape], optional
the mask tensor, zeros correspond to paddings. Defaults to None.
Returns
----------
out : Tensor [shape=(\*, T_q, d_v)]
the context vector.
attn_weights : Tensor [shape=(\*, T_q, T_k)]
the attention weights.
"""
d = q.shape[-1] # we only support imperative execution
qk = paddle.matmul(q, k, transpose_y=True)
scaled_logit = paddle.scale(qk, 1.0 / math.sqrt(d))
if mask is not None:
scaled_logit += paddle.scale((1.0 - mask), -1e9) # hard coded here
attn_weights = F.softmax(scaled_logit, axis=-1)
attn_weights = F.dropout(attn_weights, dropout, training=training)
out = paddle.matmul(attn_weights, v)
return out, attn_weights
def drop_head(x, drop_n_heads, training=True):
"""Drop n context vectors from multiple ones.
Parameters
----------
x : Tensor [shape=(batch_size, num_heads, time_steps, channels)]
The input, multiple context vectors.
drop_n_heads : int [0<= drop_n_heads <= num_heads]
Number of vectors to drop.
training : bool
A flag indicating whether it is in training. If `False`, no dropout is
applied.
Returns
-------
Tensor
The output.
"""
if not training or (drop_n_heads == 0):
return x
batch_size, num_heads, _, _ = x.shape
# drop all heads
if num_heads == drop_n_heads:
return paddle.zeros_like(x)
mask = np.ones([batch_size, num_heads])
mask[:, :drop_n_heads] = 0
for subarray in mask:
np.random.shuffle(subarray)
scale = float(num_heads) / (num_heads - drop_n_heads)
mask = scale * np.reshape(mask, [batch_size, num_heads, 1, 1])
out = x * paddle.to_tensor(mask)
return out
def _split_heads(x, num_heads):
batch_size, time_steps, _ = x.shape
x = paddle.reshape(x, [batch_size, time_steps, num_heads, -1])
x = paddle.transpose(x, [0, 2, 1, 3])
return x
def _concat_heads(x):
batch_size, _, time_steps, _ = x.shape
x = paddle.transpose(x, [0, 2, 1, 3])
x = paddle.reshape(x, [batch_size, time_steps, -1])
return x
# Standard implementations of Monohead Attention & Multihead Attention
class MonoheadAttention(nn.Layer):
"""Monohead Attention module.
Parameters
----------
model_dim : int
Feature size of the query.
dropout : float, optional
Dropout probability of scaled dot product attention and final context
vector. Defaults to 0.0.
k_dim : int, optional
Feature size of the key of each scaled dot product attention. If not
provided, it is set to `model_dim / num_heads`. Defaults to None.
v_dim : int, optional
Feature size of the key of each scaled dot product attention. If not
provided, it is set to `model_dim / num_heads`. Defaults to None.
"""
def __init__(self,
model_dim: int,
dropout: float=0.0,
k_dim: int=None,
v_dim: int=None):
super(MonoheadAttention, self).__init__()
k_dim = k_dim or model_dim
v_dim = v_dim or model_dim
self.affine_q = nn.Linear(model_dim, k_dim)
self.affine_k = nn.Linear(model_dim, k_dim)
self.affine_v = nn.Linear(model_dim, v_dim)
self.affine_o = nn.Linear(v_dim, model_dim)
self.model_dim = model_dim
self.dropout = dropout
def forward(self, q, k, v, mask):
"""Compute context vector and attention weights.
Parameters
-----------
q : Tensor [shape=(batch_size, time_steps_q, model_dim)]
The queries.
k : Tensor [shape=(batch_size, time_steps_k, model_dim)]
The keys.
v : Tensor [shape=(batch_size, time_steps_k, model_dim)]
The values.
mask : Tensor [shape=(batch_size, times_steps_q, time_steps_k] or broadcastable shape
The mask.
Returns
----------
out : Tensor [shape=(batch_size, time_steps_q, model_dim)]
The context vector.
attention_weights : Tensor [shape=(batch_size, times_steps_q, time_steps_k)]
The attention weights.
"""
q = self.affine_q(q) # (B, T, C)
k = self.affine_k(k)
v = self.affine_v(v)
context_vectors, attention_weights = scaled_dot_product_attention(
q, k, v, mask, self.dropout, self.training)
out = self.affine_o(context_vectors)
return out, attention_weights
class MultiheadAttention(nn.Layer):
"""Multihead Attention module.
Parameters
-----------
model_dim: int
The feature size of query.
num_heads : int
The number of attention heads.
dropout : float, optional
Dropout probability of scaled dot product attention and final context
vector. Defaults to 0.0.
k_dim : int, optional
Feature size of the key of each scaled dot product attention. If not
provided, it is set to ``model_dim / num_heads``. Defaults to None.
v_dim : int, optional
Feature size of the key of each scaled dot product attention. If not
provided, it is set to ``model_dim / num_heads``. Defaults to None.
Raises
---------
ValueError
If ``model_dim`` is not divisible by ``num_heads``.
"""
def __init__(self,
model_dim: int,
num_heads: int,
dropout: float=0.0,
k_dim: int=None,
v_dim: int=None):
super(MultiheadAttention, self).__init__()
if model_dim % num_heads != 0:
raise ValueError("model_dim must be divisible by num_heads")
depth = model_dim // num_heads
k_dim = k_dim or depth
v_dim = v_dim or depth
self.affine_q = nn.Linear(model_dim, num_heads * k_dim)
self.affine_k = nn.Linear(model_dim, num_heads * k_dim)
self.affine_v = nn.Linear(model_dim, num_heads * v_dim)
self.affine_o = nn.Linear(num_heads * v_dim, model_dim)
self.num_heads = num_heads
self.model_dim = model_dim
self.dropout = dropout
def forward(self, q, k, v, mask):
"""Compute context vector and attention weights.
Parameters
-----------
q : Tensor [shape=(batch_size, time_steps_q, model_dim)]
The queries.
k : Tensor [shape=(batch_size, time_steps_k, model_dim)]
The keys.
v : Tensor [shape=(batch_size, time_steps_k, model_dim)]
The values.
mask : Tensor [shape=(batch_size, times_steps_q, time_steps_k] or broadcastable shape
The mask.
Returns
----------
out : Tensor [shape=(batch_size, time_steps_q, model_dim)]
The context vector.
attention_weights : Tensor [shape=(batch_size, times_steps_q, time_steps_k)]
The attention weights.
"""
q = _split_heads(self.affine_q(q), self.num_heads) # (B, h, T, C)
k = _split_heads(self.affine_k(k), self.num_heads)
v = _split_heads(self.affine_v(v), self.num_heads)
mask = paddle.unsqueeze(mask, 1) # unsqueeze for the h dim
context_vectors, attention_weights = scaled_dot_product_attention(
q, k, v, mask, self.dropout, self.training)
# NOTE: there is more sophisticated implementation: Scheduled DropHead
context_vectors = _concat_heads(context_vectors) # (B, T, h*C)
out = self.affine_o(context_vectors)
return out, attention_weights
class LocationSensitiveAttention(nn.Layer):
"""Location Sensitive Attention module.
Reference: `Attention-Based Models for Speech Recognition <https://arxiv.org/pdf/1506.07503.pdf>`_
Parameters
-----------
d_query: int
The feature size of query.
d_key : int
The feature size of key.
d_attention : int
The feature size of dimension.
location_filters : int
Filter size of attention convolution.
location_kernel_size : int
Kernel size of attention convolution.
"""
def __init__(self,
d_query: int,
d_key: int,
d_attention: int,
location_filters: int,
location_kernel_size: int):
super().__init__()
self.query_layer = nn.Linear(d_query, d_attention, bias_attr=False)
self.key_layer = nn.Linear(d_key, d_attention, bias_attr=False)
self.value = nn.Linear(d_attention, 1, bias_attr=False)
# Location Layer
self.location_conv = nn.Conv1D(
2,
location_filters,
kernel_size=location_kernel_size,
padding=int((location_kernel_size - 1) / 2),
bias_attr=False,
data_format='NLC')
self.location_layer = nn.Linear(
location_filters, d_attention, bias_attr=False)
def forward(self,
query,
processed_key,
value,
attention_weights_cat,
mask=None):
"""Compute context vector and attention weights.
Parameters
-----------
query : Tensor [shape=(batch_size, d_query)]
The queries.
processed_key : Tensor [shape=(batch_size, time_steps_k, d_attention)]
The keys after linear layer.
value : Tensor [shape=(batch_size, time_steps_k, d_key)]
The values.
attention_weights_cat : Tensor [shape=(batch_size, time_step_k, 2)]
Attention weights concat.
mask : Tensor, optional
The mask. Shape should be (batch_size, times_steps_k, 1).
Defaults to None.
Returns
----------
attention_context : Tensor [shape=(batch_size, d_attention)]
The context vector.
attention_weights : Tensor [shape=(batch_size, time_steps_k)]
The attention weights.
"""
processed_query = self.query_layer(paddle.unsqueeze(query, axis=[1]))
processed_attention_weights = self.location_layer(
self.location_conv(attention_weights_cat))
# (B, T_enc, 1)
alignment = self.value(
paddle.tanh(processed_attention_weights + processed_key +
processed_query))
if mask is not None:
alignment = alignment + (1.0 - mask) * -1e9
attention_weights = F.softmax(alignment, axis=1)
attention_context = paddle.matmul(
attention_weights, value, transpose_x=True)
attention_weights = paddle.squeeze(attention_weights, axis=-1)
attention_context = paddle.squeeze(attention_context, axis=1)
return attention_context, attention_weights

@ -1,229 +0,0 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import librosa
import numpy as np
import paddle
from librosa.util import pad_center
from paddle import nn
from paddle.nn import functional as F
from scipy import signal
__all__ = ["quantize", "dequantize", "STFT", "MelScale"]
def quantize(values, n_bands):
"""Linearlly quantize a float Tensor in [-1, 1) to an interger Tensor in
[0, n_bands).
Parameters
-----------
values : Tensor [dtype: flaot32 or float64]
The floating point value.
n_bands : int
The number of bands. The output integer Tensor's value is in the range
[0, n_bans).
Returns
----------
Tensor [dtype: int 64]
The quantized tensor.
"""
quantized = paddle.cast((values + 1.0) / 2.0 * n_bands, "int64")
return quantized
def dequantize(quantized, n_bands, dtype=None):
"""Linearlly dequantize an integer Tensor into a float Tensor in the range
[-1, 1).
Parameters
-----------
quantized : Tensor [dtype: int]
The quantized value in the range [0, n_bands).
n_bands : int
Number of bands. The input integer Tensor's value is in the range
[0, n_bans).
dtype : str, optional
Data type of the output.
Returns
-----------
Tensor
The dequantized tensor, dtype is specified by `dtype`. If `dtype` is
not specified, the default float data type is used.
"""
dtype = dtype or paddle.get_default_dtype()
value = (paddle.cast(quantized, dtype) + 0.5) * (2.0 / n_bands) - 1.0
return value
class STFT(nn.Layer):
"""A module for computing stft transformation in a differentiable way.
Parameters
------------
n_fft : int
Number of samples in a frame.
hop_length : int
Number of samples shifted between adjacent frames.
win_length : int
Length of the window.
window : str, optional
Name of window function, see `scipy.signal.get_window` for more
details. Defaults to "hanning".
center : bool
If True, the signal y is padded so that frame D[:, t] is centered
at y[t * hop_length]. If False, then D[:, t] begins at y[t * hop_length].
Defaults to True.
pad_mode : string or function
If center=True, this argument is passed to np.pad for padding the edges
of the signal y. By default (pad_mode="reflect"), y is padded on both
sides with its own reflection, mirrored around its first and last
sample respectively. If center=False, this argument is ignored.
Notes
-----------
It behaves like ``librosa.core.stft``. See ``librosa.core.stft`` for more
details.
Given a audio which ``T`` samples, it the STFT transformation outputs a
spectrum with (C, frames) and complex dtype, where ``C = 1 + n_fft / 2``
and ``frames = 1 + T // hop_lenghth``.
Ony ``center`` and ``reflect`` padding is supported now.
"""
def __init__(self,
n_fft,
hop_length=None,
win_length=None,
window="hanning",
center=True,
pad_mode="reflect"):
super().__init__()
# By default, use the entire frame
if win_length is None:
win_length = n_fft
# Set the default hop, if it's not already specified
if hop_length is None:
hop_length = int(win_length // 4)
self.hop_length = hop_length
self.n_bin = 1 + n_fft // 2
self.n_fft = n_fft
self.center = center
self.pad_mode = pad_mode
# calculate window
window = signal.get_window(window, win_length, fftbins=True)
# pad window to n_fft size
if n_fft != win_length:
window = pad_center(window, n_fft, mode="constant")
# lpad = (n_fft - win_length) // 2
# rpad = n_fft - win_length - lpad
# window = np.pad(window, ((lpad, pad), ), 'constant')
# calculate weights
# r = np.arange(0, n_fft)
# M = np.expand_dims(r, -1) * np.expand_dims(r, 0)
# w_real = np.reshape(window *
# np.cos(2 * np.pi * M / n_fft)[:self.n_bin],
# (self.n_bin, 1, self.n_fft))
# w_imag = np.reshape(window *
# np.sin(-2 * np.pi * M / n_fft)[:self.n_bin],
# (self.n_bin, 1, self.n_fft))
weight = np.fft.fft(np.eye(n_fft))[:self.n_bin]
w_real = weight.real
w_imag = weight.imag
w = np.concatenate([w_real, w_imag], axis=0)
w = w * window
w = np.expand_dims(w, 1)
weight = paddle.cast(paddle.to_tensor(w), paddle.get_default_dtype())
self.register_buffer("weight", weight)
def forward(self, x):
"""Compute the stft transform.
Parameters
------------
x : Tensor [shape=(B, T)]
The input waveform.
Returns
------------
real : Tensor [shape=(B, C, frames)]
The real part of the spectrogram.
imag : Tensor [shape=(B, C, frames)]
The image part of the spectrogram.
"""
x = paddle.unsqueeze(x, axis=1)
if self.center:
x = F.pad(
x, [self.n_fft // 2, self.n_fft // 2],
data_format='NCL',
mode=self.pad_mode)
# to BCT, C=1
out = F.conv1d(x, self.weight, stride=self.hop_length)
real, imag = paddle.chunk(out, 2, axis=1) # BCT
return real, imag
def power(self, x):
"""Compute the power spectrum.
Parameters
------------
x : Tensor [shape=(B, T)]
The input waveform.
Returns
------------
Tensor [shape=(B, C, T)]
The power spectrum.
"""
real, imag = self.forward(x)
power = real**2 + imag**2
return power
def magnitude(self, x):
"""Compute the magnitude of the spectrum.
Parameters
------------
x : Tensor [shape=(B, T)]
The input waveform.
Returns
------------
Tensor [shape=(B, C, T)]
The magnitude of the spectrum.
"""
power = self.power(x)
magnitude = paddle.sqrt(power) # TODO(chenfeiyu): maybe clipping
return magnitude
class MelScale(nn.Layer):
def __init__(self, sr, n_fft, n_mels, fmin, fmax):
super().__init__()
mel_basis = librosa.filters.mel(sr, n_fft, n_mels, fmin, fmax)
# self.weight = paddle.to_tensor(mel_basis)
weight = paddle.to_tensor(mel_basis, dtype=paddle.get_default_dtype())
self.register_buffer("weight", weight)
def forward(self, spec):
# (n_mels, n_freq) * (batch_size, n_freq, n_frames)
mel = paddle.matmul(self.weight, spec)
return mel

@ -13,9 +13,10 @@
# limitations under the License.
"""Causal convolusion layer modules."""
import paddle
from paddle import nn
class CausalConv1D(paddle.nn.Layer):
class CausalConv1D(nn.Layer):
"""CausalConv1D module with customized initialization."""
def __init__(
@ -31,7 +32,7 @@ class CausalConv1D(paddle.nn.Layer):
super().__init__()
self.pad = getattr(paddle.nn, pad)((kernel_size - 1) * dilation,
**pad_params)
self.conv = paddle.nn.Conv1D(
self.conv = nn.Conv1D(
in_channels,
out_channels,
kernel_size,
@ -52,7 +53,7 @@ class CausalConv1D(paddle.nn.Layer):
return self.conv(self.pad(x))[:, :, :x.shape[2]]
class CausalConv1DTranspose(paddle.nn.Layer):
class CausalConv1DTranspose(nn.Layer):
"""CausalConv1DTranspose module with customized initialization."""
def __init__(self,
@ -63,7 +64,7 @@ class CausalConv1DTranspose(paddle.nn.Layer):
bias=True):
"""Initialize CausalConvTranspose1d module."""
super().__init__()
self.deconv = paddle.nn.Conv1DTranspose(
self.deconv = nn.Conv1DTranspose(
in_channels, out_channels, kernel_size, stride, bias_attr=bias)
self.stride = stride

@ -72,8 +72,10 @@ class ConvolutionModule(nn.Layer):
x = x.transpose([0, 2, 1])
# GLU mechanism
x = self.pointwise_conv1(x) # (batch, 2*channel, dim)
x = nn.functional.glu(x, axis=1) # (batch, channel, dim)
# (batch, 2*channel, time)
x = self.pointwise_conv1(x)
# (batch, channel, time)
x = nn.functional.glu(x, axis=1)
# 1D Depthwise Conv
x = self.depthwise_conv(x)

@ -25,19 +25,19 @@ class EncoderLayer(nn.Layer):
----------
size : int
Input dimension.
self_attn : paddle.nn.Layer
self_attn : nn.Layer
Self-attention module instance.
`MultiHeadedAttention` or `RelPositionMultiHeadedAttention` instance
can be used as the argument.
feed_forward : paddle.nn.Layer
feed_forward : nn.Layer
Feed-forward module instance.
`PositionwiseFeedForward`, `MultiLayeredConv1d`, or `Conv1dLinear` instance
can be used as the argument.
feed_forward_macaron : paddle.nn.Layer
feed_forward_macaron : nn.Layer
Additional feed-forward module instance.
`PositionwiseFeedForward`, `MultiLayeredConv1d`, or `Conv1dLinear` instance
can be used as the argument.
conv_module : paddle.nn.Layer
conv_module : nn.Layer
Convolution module instance.
`ConvlutionModule` instance can be used as the argument.
dropout_rate : float
@ -67,7 +67,7 @@ class EncoderLayer(nn.Layer):
concat_after=False,
stochastic_depth_rate=0.0, ):
"""Construct an EncoderLayer object."""
super(EncoderLayer, self).__init__()
super().__init__()
self.self_attn = self_attn
self.feed_forward = feed_forward
self.feed_forward_macaron = feed_forward_macaron

@ -84,7 +84,7 @@ class Conv1dCell(nn.Conv1D):
_kernel_size = kernel_size[0] if isinstance(kernel_size, (
tuple, list)) else kernel_size
self._r = 1 + (_kernel_size - 1) * _dilation
super(Conv1dCell, self).__init__(
super().__init__(
in_channels,
out_channels,
kernel_size,
@ -226,7 +226,7 @@ class Conv1dBatchNorm(nn.Layer):
data_format="NCL",
momentum=0.9,
epsilon=1e-05):
super(Conv1dBatchNorm, self).__init__()
super().__init__()
self.conv = nn.Conv1D(
in_channels,
out_channels,

@ -1,37 +0,0 @@
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import numpy as np
import paddle
from paddle import Tensor
def expand(encodings: Tensor, durations: Tensor) -> Tensor:
"""
encodings: (B, T, C)
durations: (B, T)
"""
batch_size, t_enc = durations.shape
durations = durations.numpy()
slens = np.sum(durations, -1)
t_dec = np.max(slens)
M = np.zeros([batch_size, t_dec, t_enc])
for i in range(batch_size):
k = 0
for j in range(t_enc):
d = durations[i, j]
M[i, k:k + d, j] = 1
k += d
M = paddle.to_tensor(M, dtype=encodings.dtype)
encodings = paddle.matmul(M, encodings)
return encodings

@ -13,9 +13,10 @@
# limitations under the License.
"""Layer normalization module."""
import paddle
from paddle import nn
class LayerNorm(paddle.nn.LayerNorm):
class LayerNorm(nn.LayerNorm):
"""Layer normalization module.
Parameters
@ -28,7 +29,7 @@ class LayerNorm(paddle.nn.LayerNorm):
def __init__(self, nout, dim=-1):
"""Construct an LayerNorm object."""
super(LayerNorm, self).__init__(nout)
super().__init__(nout)
self.dim = dim
def forward(self, x):

@ -11,18 +11,16 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import math
import paddle
from paddle import nn
from paddle.fluid.layers import sequence_mask
from paddle.nn import functional as F
__all__ = [
"guided_attention_loss",
"weighted_mean",
"masked_l1_loss",
"masked_softmax_with_cross_entropy",
]
from scipy import signal
# Loss for Tacotron2
def attention_guide(dec_lens, enc_lens, N, T, g, dtype=None):
"""Build that W matrix. shape(B, T_dec, T_enc)
W[i, n, t] = 1 - exp(-(n/dec_lens[i] - t/enc_lens[i])**2 / (2g**2))
@ -57,6 +55,367 @@ def guided_attention_loss(attention_weight, dec_lens, enc_lens, g):
return loss
# Losses for GAN Vocoder
def stft(x,
fft_size,
hop_length=None,
win_length=None,
window='hann',
center=True,
pad_mode='reflect'):
"""Perform STFT and convert to magnitude spectrogram.
Parameters
----------
x : Tensor
Input signal tensor (B, T).
fft_size : int
FFT size.
hop_size : int
Hop size.
win_length : int
window : str, optional
window : str
Name of window function, see `scipy.signal.get_window` for more
details. Defaults to "hann".
center : bool, optional
center (bool, optional): Whether to pad `x` to make that the
:math:`t \times hop\_length` at the center of :math:`t`-th frame. Default: `True`.
pad_mode : str, optional
Choose padding pattern when `center` is `True`.
Returns
----------
Tensor:
Magnitude spectrogram (B, #frames, fft_size // 2 + 1).
"""
# calculate window
window = signal.get_window(window, win_length, fftbins=True)
window = paddle.to_tensor(window)
x_stft = paddle.signal.stft(
x,
fft_size,
hop_length,
win_length,
window=window,
center=center,
pad_mode=pad_mode)
real = x_stft.real()
imag = x_stft.imag()
return paddle.sqrt(paddle.clip(real**2 + imag**2, min=1e-7)).transpose(
[0, 2, 1])
class SpectralConvergenceLoss(nn.Layer):
"""Spectral convergence loss module."""
def __init__(self):
"""Initilize spectral convergence loss module."""
super().__init__()
def forward(self, x_mag, y_mag):
"""Calculate forward propagation.
Parameters
----------
x_mag : Tensor
Magnitude spectrogram of predicted signal (B, #frames, #freq_bins).
y_mag : Tensor)
Magnitude spectrogram of groundtruth signal (B, #frames, #freq_bins).
Returns
----------
Tensor
Spectral convergence loss value.
"""
return paddle.norm(
y_mag - x_mag, p="fro") / paddle.clip(
paddle.norm(y_mag, p="fro"), min=1e-10)
class LogSTFTMagnitudeLoss(nn.Layer):
"""Log STFT magnitude loss module."""
def __init__(self, epsilon=1e-7):
"""Initilize los STFT magnitude loss module."""
super().__init__()
self.epsilon = epsilon
def forward(self, x_mag, y_mag):
"""Calculate forward propagation.
Parameters
----------
x_mag : Tensor
Magnitude spectrogram of predicted signal (B, #frames, #freq_bins).
y_mag : Tensor
Magnitude spectrogram of groundtruth signal (B, #frames, #freq_bins).
Returns
----------
Tensor
Log STFT magnitude loss value.
"""
return F.l1_loss(
paddle.log(paddle.clip(y_mag, min=self.epsilon)),
paddle.log(paddle.clip(x_mag, min=self.epsilon)))
class STFTLoss(nn.Layer):
"""STFT loss module."""
def __init__(self,
fft_size=1024,
shift_size=120,
win_length=600,
window="hann"):
"""Initialize STFT loss module."""
super().__init__()
self.fft_size = fft_size
self.shift_size = shift_size
self.win_length = win_length
self.window = window
self.spectral_convergence_loss = SpectralConvergenceLoss()
self.log_stft_magnitude_loss = LogSTFTMagnitudeLoss()
def forward(self, x, y):
"""Calculate forward propagation.
Parameters
----------
x : Tensor
Predicted signal (B, T).
y : Tensor
Groundtruth signal (B, T).
Returns
----------
Tensor
Spectral convergence loss value.
Tensor
Log STFT magnitude loss value.
"""
x_mag = stft(x, self.fft_size, self.shift_size, self.win_length,
self.window)
y_mag = stft(y, self.fft_size, self.shift_size, self.win_length,
self.window)
sc_loss = self.spectral_convergence_loss(x_mag, y_mag)
mag_loss = self.log_stft_magnitude_loss(x_mag, y_mag)
return sc_loss, mag_loss
class MultiResolutionSTFTLoss(nn.Layer):
"""Multi resolution STFT loss module."""
def __init__(
self,
fft_sizes=[1024, 2048, 512],
hop_sizes=[120, 240, 50],
win_lengths=[600, 1200, 240],
window="hann", ):
"""Initialize Multi resolution STFT loss module.
Parameters
----------
fft_sizes : list
List of FFT sizes.
hop_sizes : list
List of hop sizes.
win_lengths : list
List of window lengths.
window : str
Window function type.
"""
super().__init__()
assert len(fft_sizes) == len(hop_sizes) == len(win_lengths)
self.stft_losses = nn.LayerList()
for fs, ss, wl in zip(fft_sizes, hop_sizes, win_lengths):
self.stft_losses.append(STFTLoss(fs, ss, wl, window))
def forward(self, x, y):
"""Calculate forward propagation.
Parameters
----------
x : Tensor
Predicted signal (B, T) or (B, #subband, T).
y : Tensor
Groundtruth signal (B, T) or (B, #subband, T).
Returns
----------
Tensor
Multi resolution spectral convergence loss value.
Tensor
Multi resolution log STFT magnitude loss value.
"""
if len(x.shape) == 3:
# (B, C, T) -> (B x C, T)
x = x.reshape([-1, x.shape[2]])
# (B, C, T) -> (B x C, T)
y = y.reshape([-1, y.shape[2]])
sc_loss = 0.0
mag_loss = 0.0
for f in self.stft_losses:
sc_l, mag_l = f(x, y)
sc_loss += sc_l
mag_loss += mag_l
sc_loss /= len(self.stft_losses)
mag_loss /= len(self.stft_losses)
return sc_loss, mag_loss
class GeneratorAdversarialLoss(nn.Layer):
"""Generator adversarial loss module."""
def __init__(
self,
average_by_discriminators=True,
loss_type="mse", ):
"""Initialize GeneratorAversarialLoss module."""
super().__init__()
self.average_by_discriminators = average_by_discriminators
assert loss_type in ["mse", "hinge"], f"{loss_type} is not supported."
if loss_type == "mse":
self.criterion = self._mse_loss
else:
self.criterion = self._hinge_loss
def forward(self, outputs):
"""Calcualate generator adversarial loss.
Parameters
----------
outputs: Tensor or List
Discriminator outputs or list of discriminator outputs.
Returns
----------
Tensor
Generator adversarial loss value.
"""
if isinstance(outputs, (tuple, list)):
adv_loss = 0.0
for i, outputs_ in enumerate(outputs):
if isinstance(outputs_, (tuple, list)):
# case including feature maps
outputs_ = outputs_[-1]
adv_loss += self.criterion(outputs_)
if self.average_by_discriminators:
adv_loss /= i + 1
else:
adv_loss = self.criterion(outputs)
return adv_loss
def _mse_loss(self, x):
return F.mse_loss(x, paddle.ones_like(x))
def _hinge_loss(self, x):
return -x.mean()
class DiscriminatorAdversarialLoss(nn.Layer):
"""Discriminator adversarial loss module."""
def __init__(
self,
average_by_discriminators=True,
loss_type="mse", ):
"""Initialize DiscriminatorAversarialLoss module."""
super().__init__()
self.average_by_discriminators = average_by_discriminators
assert loss_type in ["mse"], f"{loss_type} is not supported."
if loss_type == "mse":
self.fake_criterion = self._mse_fake_loss
self.real_criterion = self._mse_real_loss
def forward(self, outputs_hat, outputs):
"""Calcualate discriminator adversarial loss.
Parameters
----------
outputs_hat : Tensor or list
Discriminator outputs or list of
discriminator outputs calculated from generator outputs.
outputs : Tensor or list
Discriminator outputs or list of
discriminator outputs calculated from groundtruth.
Returns
----------
Tensor
Discriminator real loss value.
Tensor
Discriminator fake loss value.
"""
if isinstance(outputs, (tuple, list)):
real_loss = 0.0
fake_loss = 0.0
for i, (outputs_hat_,
outputs_) in enumerate(zip(outputs_hat, outputs)):
if isinstance(outputs_hat_, (tuple, list)):
# case including feature maps
outputs_hat_ = outputs_hat_[-1]
outputs_ = outputs_[-1]
real_loss += self.real_criterion(outputs_)
fake_loss += self.fake_criterion(outputs_hat_)
if self.average_by_discriminators:
fake_loss /= i + 1
real_loss /= i + 1
else:
real_loss = self.real_criterion(outputs)
fake_loss = self.fake_criterion(outputs_hat)
return real_loss, fake_loss
def _mse_real_loss(self, x):
return F.mse_loss(x, paddle.ones_like(x))
def _mse_fake_loss(self, x):
return F.mse_loss(x, paddle.zeros_like(x))
# Losses for SpeedySpeech
# Structural Similarity Index Measure (SSIM)
def gaussian(window_size, sigma):
gauss = paddle.to_tensor([
math.exp(-(x - window_size // 2)**2 / float(2 * sigma**2))
for x in range(window_size)
])
return gauss / gauss.sum()
def create_window(window_size, channel):
_1D_window = gaussian(window_size, 1.5).unsqueeze(1)
_2D_window = paddle.matmul(_1D_window, paddle.transpose(
_1D_window, [1, 0])).unsqueeze([0, 1])
window = paddle.expand(_2D_window, [channel, 1, window_size, window_size])
return window
def _ssim(img1, img2, window, window_size, channel, size_average=True):
mu1 = F.conv2d(img1, window, padding=window_size // 2, groups=channel)
mu2 = F.conv2d(img2, window, padding=window_size // 2, groups=channel)
mu1_sq = mu1.pow(2)
mu2_sq = mu2.pow(2)
mu1_mu2 = mu1 * mu2
sigma1_sq = F.conv2d(
img1 * img1, window, padding=window_size // 2, groups=channel) - mu1_sq
sigma2_sq = F.conv2d(
img2 * img2, window, padding=window_size // 2, groups=channel) - mu2_sq
sigma12 = F.conv2d(
img1 * img2, window, padding=window_size // 2, groups=channel) - mu1_mu2
C1 = 0.01**2
C2 = 0.03**2
ssim_map = ((2 * mu1_mu2 + C1) * (2 * sigma12 + C2)) \
/ ((mu1_sq + mu2_sq + C1) * (sigma1_sq + sigma2_sq + C2))
if size_average:
return ssim_map.mean()
else:
return ssim_map.mean(1).mean(1).mean(1)
def ssim(img1, img2, window_size=11, size_average=True):
(_, channel, _, _) = img1.shape
window = create_window(window_size, channel)
return _ssim(img1, img2, window, window_size, channel, size_average)
def weighted_mean(input, weight):
"""Weighted mean. It can also be used as masked mean.
@ -98,28 +457,3 @@ def masked_l1_loss(prediction, target, mask):
abs_error = F.l1_loss(prediction, target, reduction='none')
loss = weighted_mean(abs_error, mask)
return loss
def masked_softmax_with_cross_entropy(logits, label, mask, axis=-1):
"""Compute masked softmax with cross entropy loss.
Parameters
----------
logits : Tensor
The logits. The ``axis``-th axis is the class dimension.
label : Tensor [dtype: int]
The label. The size of the ``axis``-th axis should be 1.
mask : Tensor
The mask. The shape should be broadcastable to ``label``.
axis : int, optional
The index of the class dimension in the shape of ``logits``, by default
-1.
Returns
-------
Tensor [shape=(1,)]
The masked softmax with cross entropy loss.
"""
ce = F.softmax_with_cross_entropy(logits, label, axis=axis)
loss = weighted_mean(ce, mask)
return loss

@ -1,120 +0,0 @@
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import paddle
__all__ = [
"id_mask",
"feature_mask",
"combine_mask",
"future_mask",
]
def id_mask(input, padding_index=0, dtype="bool"):
"""Generate mask with input ids.
Those positions where the value equals ``padding_index`` correspond to 0 or
``False``, otherwise, 1 or ``True``.
Parameters
----------
input : Tensor [dtype: int]
The input tensor. It represents the ids.
padding_index : int, optional
The id which represents padding, by default 0.
dtype : str, optional
Data type of the returned mask, by default "bool".
Returns
-------
Tensor
The generate mask. It has the same shape as ``input`` does.
"""
return paddle.cast(input != padding_index, dtype)
def feature_mask(input, axis, dtype="bool"):
"""Compute mask from input features.
For a input features, represented as batched feature vectors, those vectors
which all zeros are considerd padding vectors.
Parameters
----------
input : Tensor [dtype: float]
The input tensor which represents featues.
axis : int
The index of the feature dimension in ``input``. Other dimensions are
considered ``spatial`` dimensions.
dtype : str, optional
Data type of the generated mask, by default "bool"
Returns
-------
Tensor
The geenrated mask with ``spatial`` shape as mentioned above.
It has one less dimension than ``input`` does.
"""
feature_sum = paddle.sum(paddle.abs(input), axis)
return paddle.cast(feature_sum != 0, dtype)
def combine_mask(mask1, mask2):
"""Combine two mask with multiplication or logical and.
Parameters
-----------
mask1 : Tensor
The first mask.
mask2 : Tensor
The second mask with broadcastable shape with ``mask1``.
Returns
--------
Tensor
Combined mask.
Notes
------
It is mainly used to combine the padding mask and no future mask for
transformer decoder.
Padding mask is used to mask padding positions of the decoder inputs and
no future mask is used to prevent the decoder to see future information.
"""
if mask1.dtype == paddle.fluid.core.VarDesc.VarType.BOOL:
return paddle.logical_and(mask1, mask2)
else:
return mask1 * mask2
def future_mask(time_steps, dtype="bool"):
"""Generate lower triangular mask.
It is used at transformer decoder to prevent the decoder to see future
information.
Parameters
----------
time_steps : int
Decoder time steps.
dtype : str, optional
The data type of the generate mask, by default "bool".
Returns
-------
Tensor
The generated mask.
"""
mask = paddle.tril(paddle.ones([time_steps, time_steps]))
return paddle.cast(mask, dtype)

@ -129,7 +129,7 @@ def initialize(model: nn.Layer, init: str):
Parameters
----------
model : paddle.nn.Layer
model : nn.Layer
Target.
init : str
Method of initialization.
@ -150,17 +150,3 @@ def initialize(model: nn.Layer, init: str):
nn.initializer.Constant())
else:
raise ValueError("Unknown initialization: " + init)
def get_activation(act):
"""Return activation function."""
activation_funcs = {
"hardtanh": paddle.nn.Hardtanh,
"tanh": paddle.nn.Tanh,
"relu": paddle.nn.ReLU,
"selu": paddle.nn.SELU,
"swish": paddle.nn.Swish,
}
return activation_funcs[act]()

@ -16,6 +16,7 @@
import numpy as np
import paddle
import paddle.nn.functional as F
from paddle import nn
from scipy.signal import kaiser
@ -56,7 +57,7 @@ def design_prototype_filter(taps=62, cutoff_ratio=0.142, beta=9.0):
return h
class PQMF(paddle.nn.Layer):
class PQMF(nn.Layer):
"""PQMF module.
This module is based on `Near-perfect-reconstruction pseudo-QMF banks`_.
.. _`Near-perfect-reconstruction pseudo-QMF banks`:
@ -105,7 +106,7 @@ class PQMF(paddle.nn.Layer):
self.updown_filter = updown_filter
self.subbands = subbands
# keep padding info
self.pad_fn = paddle.nn.Pad1D(taps // 2, mode='constant', value=0.0)
self.pad_fn = nn.Pad1D(taps // 2, mode='constant', value=0.0)
def analysis(self, x):
"""Analysis with PQMF.

@ -65,7 +65,7 @@ class DurationPredictor(nn.Layer):
Offset value to avoid nan in log domain.
"""
super(DurationPredictor, self).__init__()
super().__init__()
self.offset = offset
self.conv = nn.LayerList()
for idx in range(n_layers):
@ -155,7 +155,7 @@ class DurationPredictorLoss(nn.Layer):
reduction : str
Reduction type in loss calculation.
"""
super(DurationPredictorLoss, self).__init__()
super().__init__()
self.criterion = nn.MSELoss(reduction=reduction)
self.offset = offset

@ -1,80 +0,0 @@
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from math import exp
import paddle
import paddle.nn.functional as F
from paddle import nn
def gaussian(window_size, sigma):
gauss = paddle.to_tensor([
exp(-(x - window_size // 2)**2 / float(2 * sigma**2))
for x in range(window_size)
])
return gauss / gauss.sum()
def create_window(window_size, channel):
_1D_window = gaussian(window_size, 1.5).unsqueeze(1)
_2D_window = paddle.matmul(_1D_window, paddle.transpose(
_1D_window, [1, 0])).unsqueeze([0, 1])
window = paddle.expand(_2D_window, [channel, 1, window_size, window_size])
return window
def _ssim(img1, img2, window, window_size, channel, size_average=True):
mu1 = F.conv2d(img1, window, padding=window_size // 2, groups=channel)
mu2 = F.conv2d(img2, window, padding=window_size // 2, groups=channel)
mu1_sq = mu1.pow(2)
mu2_sq = mu2.pow(2)
mu1_mu2 = mu1 * mu2
sigma1_sq = F.conv2d(
img1 * img1, window, padding=window_size // 2, groups=channel) - mu1_sq
sigma2_sq = F.conv2d(
img2 * img2, window, padding=window_size // 2, groups=channel) - mu2_sq
sigma12 = F.conv2d(
img1 * img2, window, padding=window_size // 2, groups=channel) - mu1_mu2
C1 = 0.01**2
C2 = 0.03**2
ssim_map = ((2 * mu1_mu2 + C1) * (2 * sigma12 + C2)) \
/ ((mu1_sq + mu2_sq + C1) * (sigma1_sq + sigma2_sq + C2))
if size_average:
return ssim_map.mean()
else:
return ssim_map.mean(1).mean(1).mean(1)
class SSIM(nn.Layer):
def __init__(self, window_size=11, size_average=True):
super().__init__()
self.window_size = window_size
self.size_average = size_average
self.channel = 1
self.window = create_window(window_size, self.channel)
def forward(self, img1, img2):
return _ssim(img1, img2, self.window, self.window_size, self.channel,
self.size_average)
def ssim(img1, img2, window_size=11, size_average=True):
(_, channel, _, _) = img1.shape
window = create_window(window_size, channel)
return _ssim(img1, img2, window, window_size, channel, size_average)

@ -1,220 +0,0 @@
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
# Modified from espnet(https://github.com/espnet/espnet)
import paddle
from paddle import nn
from paddle.nn import functional as F
from scipy import signal
def stft(x,
fft_size,
hop_length=None,
win_length=None,
window='hann',
center=True,
pad_mode='reflect'):
"""Perform STFT and convert to magnitude spectrogram.
Parameters
----------
x : Tensor
Input signal tensor (B, T).
fft_size : int
FFT size.
hop_size : int
Hop size.
win_length : int
window : str, optional
window : str
Name of window function, see `scipy.signal.get_window` for more
details. Defaults to "hann".
center : bool, optional
center (bool, optional): Whether to pad `x` to make that the
:math:`t \times hop\_length` at the center of :math:`t`-th frame. Default: `True`.
pad_mode : str, optional
Choose padding pattern when `center` is `True`.
Returns
----------
Tensor:
Magnitude spectrogram (B, #frames, fft_size // 2 + 1).
"""
# calculate window
window = signal.get_window(window, win_length, fftbins=True)
window = paddle.to_tensor(window)
x_stft = paddle.signal.stft(
x,
fft_size,
hop_length,
win_length,
window=window,
center=center,
pad_mode=pad_mode)
real = x_stft.real()
imag = x_stft.imag()
return paddle.sqrt(paddle.clip(real**2 + imag**2, min=1e-7)).transpose(
[0, 2, 1])
class SpectralConvergenceLoss(nn.Layer):
"""Spectral convergence loss module."""
def __init__(self):
"""Initilize spectral convergence loss module."""
super().__init__()
def forward(self, x_mag, y_mag):
"""Calculate forward propagation.
Parameters
----------
x_mag : Tensor
Magnitude spectrogram of predicted signal (B, #frames, #freq_bins).
y_mag : Tensor)
Magnitude spectrogram of groundtruth signal (B, #frames, #freq_bins).
Returns
----------
Tensor
Spectral convergence loss value.
"""
return paddle.norm(
y_mag - x_mag, p="fro") / paddle.clip(
paddle.norm(y_mag, p="fro"), min=1e-10)
class LogSTFTMagnitudeLoss(nn.Layer):
"""Log STFT magnitude loss module."""
def __init__(self, epsilon=1e-7):
"""Initilize los STFT magnitude loss module."""
super().__init__()
self.epsilon = epsilon
def forward(self, x_mag, y_mag):
"""Calculate forward propagation.
Parameters
----------
x_mag : Tensor
Magnitude spectrogram of predicted signal (B, #frames, #freq_bins).
y_mag : Tensor
Magnitude spectrogram of groundtruth signal (B, #frames, #freq_bins).
Returns
----------
Tensor
Log STFT magnitude loss value.
"""
return F.l1_loss(
paddle.log(paddle.clip(y_mag, min=self.epsilon)),
paddle.log(paddle.clip(x_mag, min=self.epsilon)))
class STFTLoss(nn.Layer):
"""STFT loss module."""
def __init__(self,
fft_size=1024,
shift_size=120,
win_length=600,
window="hann"):
"""Initialize STFT loss module."""
super().__init__()
self.fft_size = fft_size
self.shift_size = shift_size
self.win_length = win_length
self.window = window
self.spectral_convergence_loss = SpectralConvergenceLoss()
self.log_stft_magnitude_loss = LogSTFTMagnitudeLoss()
def forward(self, x, y):
"""Calculate forward propagation.
Parameters
----------
x : Tensor
Predicted signal (B, T).
y : Tensor
Groundtruth signal (B, T).
Returns
----------
Tensor
Spectral convergence loss value.
Tensor
Log STFT magnitude loss value.
"""
x_mag = stft(x, self.fft_size, self.shift_size, self.win_length,
self.window)
y_mag = stft(y, self.fft_size, self.shift_size, self.win_length,
self.window)
sc_loss = self.spectral_convergence_loss(x_mag, y_mag)
mag_loss = self.log_stft_magnitude_loss(x_mag, y_mag)
return sc_loss, mag_loss
class MultiResolutionSTFTLoss(nn.Layer):
"""Multi resolution STFT loss module."""
def __init__(
self,
fft_sizes=[1024, 2048, 512],
hop_sizes=[120, 240, 50],
win_lengths=[600, 1200, 240],
window="hann", ):
"""Initialize Multi resolution STFT loss module.
Parameters
----------
fft_sizes : list
List of FFT sizes.
hop_sizes : list
List of hop sizes.
win_lengths : list
List of window lengths.
window : str
Window function type.
"""
super().__init__()
assert len(fft_sizes) == len(hop_sizes) == len(win_lengths)
self.stft_losses = nn.LayerList()
for fs, ss, wl in zip(fft_sizes, hop_sizes, win_lengths):
self.stft_losses.append(STFTLoss(fs, ss, wl, window))
def forward(self, x, y):
"""Calculate forward propagation.
Parameters
----------
x : Tensor
Predicted signal (B, T) or (B, #subband, T).
y : Tensor
Groundtruth signal (B, T) or (B, #subband, T).
Returns
----------
Tensor
Multi resolution spectral convergence loss value.
Tensor
Multi resolution log STFT magnitude loss value.
"""
if len(x.shape) == 3:
# (B, C, T) -> (B x C, T)
x = x.reshape([-1, x.shape[2]])
# (B, C, T) -> (B x C, T)
y = y.reshape([-1, y.shape[2]])
sc_loss = 0.0
mag_loss = 0.0
for f in self.stft_losses:
sc_l, mag_l = f(x, y)
sc_loss += sc_l
mag_loss += mag_l
sc_loss /= len(self.stft_losses)
mag_loss /= len(self.stft_losses)
return sc_loss, mag_loss

@ -74,7 +74,7 @@ class StyleEncoder(nn.Layer):
gru_units: int=128, ):
"""Initilize global style encoder module."""
assert check_argument_types()
super(StyleEncoder, self).__init__()
super().__init__()
self.ref_enc = ReferenceEncoder(
idim=idim,
@ -93,11 +93,15 @@ class StyleEncoder(nn.Layer):
def forward(self, speech: paddle.Tensor) -> paddle.Tensor:
"""Calculate forward propagation.
Args:
speech (Tensor): Batch of padded target features (B, Lmax, odim).
Parameters
----------
speech : Tensor
Batch of padded target features (B, Lmax, odim).
Returns:
Tensor: Style token embeddings (B, token_dim).
Returns
----------
Tensor:
Style token embeddings (B, token_dim).
"""
ref_embs = self.ref_enc(speech)
@ -145,7 +149,7 @@ class ReferenceEncoder(nn.Layer):
gru_units: int=128, ):
"""Initilize reference encoder module."""
assert check_argument_types()
super(ReferenceEncoder, self).__init__()
super().__init__()
# check hyperparameters are valid
assert conv_kernel_size % 2 == 1, "kernel size must be odd."
@ -249,7 +253,7 @@ class StyleTokenLayer(nn.Layer):
dropout_rate: float=0.0, ):
"""Initilize style token layer module."""
assert check_argument_types()
super(StyleTokenLayer, self).__init__()
super().__init__()
gst_embs = paddle.randn(shape=[gst_tokens, gst_token_dim // gst_heads])
self.gst_embs = paddle.create_parameter(

@ -73,7 +73,7 @@ class Encoder(nn.Layer):
Dropout rate.
"""
super(Encoder, self).__init__()
super().__init__()
# store the hyperparameters
self.idim = idim
self.use_residual = use_residual

@ -67,11 +67,11 @@ class Decoder(nn.Layer):
Dropout rate in self-attention.
src_attention_dropout_rate : float
Dropout rate in source-attention.
input_layer : (Union[str, paddle.nn.Layer])
input_layer : (Union[str, nn.Layer])
Input layer type.
use_output_layer : bool
Whether to use output layer.
pos_enc_class : paddle.nn.Layer
pos_enc_class : nn.Layer
Positional encoding module class.
`PositionalEncoding `or `ScaledPositionalEncoding`
normalize_before : bool
@ -122,8 +122,7 @@ class Decoder(nn.Layer):
input_layer,
pos_enc_class(attention_dim, positional_dropout_rate))
else:
raise NotImplementedError(
"only `embed` or paddle.nn.Layer is supported.")
raise NotImplementedError("only `embed` or nn.Layer is supported.")
self.normalize_before = normalize_before
# self-attention module definition

@ -26,13 +26,13 @@ class DecoderLayer(nn.Layer):
----------
size : int
Input dimension.
self_attn : paddle.nn.Layer
self_attn : nn.Layer
Self-attention module instance.
`MultiHeadedAttention` instance can be used as the argument.
src_attn : paddle.nn.Layer
src_attn : nn.Layer
Self-attention module instance.
`MultiHeadedAttention` instance can be used as the argument.
feed_forward : paddle.nn.Layer
feed_forward : nn.Layer
Feed-forward module instance.
`PositionwiseFeedForward`, `MultiLayeredConv1d`, or `Conv1dLinear` instance can be used as the argument.
dropout_rate : float

@ -43,7 +43,7 @@ class PositionalEncoding(nn.Layer):
dtype="float32",
reverse=False):
"""Construct an PositionalEncoding object."""
super(PositionalEncoding, self).__init__()
super().__init__()
self.d_model = d_model
self.reverse = reverse
self.xscale = math.sqrt(self.d_model)
@ -117,7 +117,7 @@ class ScaledPositionalEncoding(PositionalEncoding):
self.alpha = paddle.create_parameter(
shape=x.shape,
dtype=self.dtype,
default_initializer=paddle.nn.initializer.Assign(x))
default_initializer=nn.initializer.Assign(x))
def reset_parameters(self):
"""Reset parameters."""
@ -141,7 +141,7 @@ class ScaledPositionalEncoding(PositionalEncoding):
return self.dropout(x)
class RelPositionalEncoding(paddle.nn.Layer):
class RelPositionalEncoding(nn.Layer):
"""Relative positional encoding module (new implementation).
Details can be found in https://github.com/espnet/espnet/pull/2816.
See : Appendix B in https://arxiv.org/abs/1901.02860
@ -157,10 +157,10 @@ class RelPositionalEncoding(paddle.nn.Layer):
def __init__(self, d_model, dropout_rate, max_len=5000, dtype="float32"):
"""Construct an PositionalEncoding object."""
super(RelPositionalEncoding, self).__init__()
super().__init__()
self.d_model = d_model
self.xscale = math.sqrt(self.d_model)
self.dropout = paddle.nn.Dropout(p=dropout_rate)
self.dropout = nn.Dropout(p=dropout_rate)
self.pe = None
self.dtype = dtype
self.extend_pe(paddle.expand(paddle.zeros([1]), (1, max_len)))

@ -17,10 +17,10 @@ from typing import Union
from paddle import nn
from paddlespeech.t2s.modules.activation import get_activation
from paddlespeech.t2s.modules.conformer.convolution import ConvolutionModule
from paddlespeech.t2s.modules.conformer.encoder_layer import EncoderLayer as ConformerEncoderLayer
from paddlespeech.t2s.modules.layer_norm import LayerNorm
from paddlespeech.t2s.modules.nets_utils import get_activation
from paddlespeech.t2s.modules.transformer.attention import MultiHeadedAttention
from paddlespeech.t2s.modules.transformer.attention import RelPositionMultiHeadedAttention
from paddlespeech.t2s.modules.transformer.embedding import PositionalEncoding
@ -34,8 +34,8 @@ from paddlespeech.t2s.modules.transformer.repeat import repeat
from paddlespeech.t2s.modules.transformer.subsampling import Conv2dSubsampling
class Encoder(nn.Layer):
"""Transformer encoder module.
class BaseEncoder(nn.Layer):
"""Base Encoder module.
Parameters
----------
@ -55,7 +55,7 @@ class Encoder(nn.Layer):
Dropout rate after adding positional encoding.
attention_dropout_rate : float
Dropout rate in attention.
input_layer : Union[str, paddle.nn.Layer]
input_layer : Union[str, nn.Layer]
Input layer type.
normalize_before : bool
Whether to use layer_norm before the first block.
@ -120,7 +120,7 @@ class Encoder(nn.Layer):
stochastic_depth_rate: float=0.0,
intermediate_layers: Union[List[int], None]=None,
encoder_type: str="transformer"):
"""Construct an Encoder object."""
"""Construct an Bae Encoder object."""
super().__init__()
activation = get_activation(activation_type)
pos_enc_class = self.get_pos_enc_class(pos_enc_layer_type,
@ -264,7 +264,6 @@ class Encoder(nn.Layer):
nn.Dropout(dropout_rate),
nn.ReLU(),
pos_enc_class(attention_dim, positional_dropout_rate), )
elif input_layer == "conv2d":
embed = Conv2dSubsampling(
idim,
@ -305,46 +304,118 @@ class Encoder(nn.Layer):
paddle.Tensor
Mask tensor (#batch, 1, time).
"""
if self.encoder_type == "transformer":
xs = self.embed(xs)
xs, masks = self.encoders(xs, masks)
if self.normalize_before:
xs = self.after_norm(xs)
return xs, masks
elif self.encoder_type == "conformer":
if isinstance(self.embed, (Conv2dSubsampling)):
xs, masks = self.embed(xs, masks)
else:
xs = self.embed(xs)
if self.intermediate_layers is None:
xs, masks = self.encoders(xs, masks)
else:
intermediate_outputs = []
for layer_idx, encoder_layer in enumerate(self.encoders):
xs, masks = encoder_layer(xs, masks)
if (self.intermediate_layers is not None and
layer_idx + 1 in self.intermediate_layers):
# intermediate branches also require normalization.
encoder_output = xs
if isinstance(encoder_output, tuple):
encoder_output = encoder_output[0]
if self.normalize_before:
encoder_output = self.after_norm(encoder_output)
intermediate_outputs.append(encoder_output)
if isinstance(xs, tuple):
xs = xs[0]
if self.normalize_before:
xs = self.after_norm(xs)
if self.intermediate_layers is not None:
return xs, masks, intermediate_outputs
return xs, masks
else:
raise ValueError(f"{self.encoder_type} is not supported.")
xs = self.embed(xs)
xs, masks = self.encoders(xs, masks)
if self.normalize_before:
xs = self.after_norm(xs)
return xs, masks
class TransformerEncoder(BaseEncoder):
"""Transformer encoder module.
Parameters
----------
idim : int
Input dimension.
attention_dim : int
Dimention of attention.
attention_heads : int
The number of heads of multi head attention.
linear_units : int
The number of units of position-wise feed forward.
num_blocks : int
The number of decoder blocks.
dropout_rate : float
Dropout rate.
positional_dropout_rate : float
Dropout rate after adding positional encoding.
attention_dropout_rate : float
Dropout rate in attention.
input_layer : Union[str, paddle.nn.Layer]
Input layer type.
pos_enc_layer_type : str
Encoder positional encoding layer type.
normalize_before : bool
Whether to use layer_norm before the first block.
concat_after : bool
Whether to concat attention layer's input and output.
if True, additional linear will be applied.
i.e. x -> x + linear(concat(x, att(x)))
if False, no additional linear will be applied. i.e. x -> x + att(x)
positionwise_layer_type : str
"linear", "conv1d", or "conv1d-linear".
positionwise_conv_kernel_size : int
Kernel size of positionwise conv1d layer.
selfattention_layer_type : str
Encoder attention layer type.
activation_type : str
Encoder activation function type.
padding_idx : int
Padding idx for input_layer=embed.
"""
def __init__(
self,
idim,
attention_dim: int=256,
attention_heads: int=4,
linear_units: int=2048,
num_blocks: int=6,
dropout_rate: float=0.1,
positional_dropout_rate: float=0.1,
attention_dropout_rate: float=0.0,
input_layer: str="conv2d",
pos_enc_layer_type: str="abs_pos",
normalize_before: bool=True,
concat_after: bool=False,
positionwise_layer_type: str="linear",
positionwise_conv_kernel_size: int=1,
selfattention_layer_type: str="selfattn",
activation_type: str="relu",
padding_idx: int=-1, ):
"""Construct an Transformer Encoder object."""
super().__init__(
idim,
attention_dim=attention_dim,
attention_heads=attention_heads,
linear_units=linear_units,
num_blocks=num_blocks,
dropout_rate=dropout_rate,
positional_dropout_rate=positional_dropout_rate,
attention_dropout_rate=attention_dropout_rate,
input_layer=input_layer,
pos_enc_layer_type=pos_enc_layer_type,
normalize_before=normalize_before,
concat_after=concat_after,
positionwise_layer_type=positionwise_layer_type,
positionwise_conv_kernel_size=positionwise_conv_kernel_size,
selfattention_layer_type=selfattention_layer_type,
activation_type=activation_type,
padding_idx=padding_idx,
encoder_type="transformer")
def forward(self, xs, masks):
"""Encode input sequence.
Parameters
----------
xs : paddle.Tensor
Input tensor (#batch, time, idim).
masks : paddle.Tensor
Mask tensor (#batch, 1, time).
Returns
----------
paddle.Tensor
Output tensor (#batch, time, attention_dim).
paddle.Tensor
Mask tensor (#batch, 1, time).
"""
xs = self.embed(xs)
xs, masks = self.encoders(xs, masks)
if self.normalize_before:
xs = self.after_norm(xs)
return xs, masks
def forward_one_step(self, xs, masks, cache=None):
"""Encode input frame.
@ -378,3 +449,161 @@ class Encoder(nn.Layer):
if self.normalize_before:
xs = self.after_norm(xs)
return xs, masks, new_cache
class ConformerEncoder(BaseEncoder):
"""Conformer encoder module.
Parameters
----------
idim : int
Input dimension.
attention_dim : int
Dimention of attention.
attention_heads : int
The number of heads of multi head attention.
linear_units : int
The number of units of position-wise feed forward.
num_blocks : int
The number of decoder blocks.
dropout_rate : float
Dropout rate.
positional_dropout_rate : float
Dropout rate after adding positional encoding.
attention_dropout_rate : float
Dropout rate in attention.
input_layer : Union[str, nn.Layer]
Input layer type.
normalize_before : bool
Whether to use layer_norm before the first block.
concat_after : bool
Whether to concat attention layer's input and output.
if True, additional linear will be applied.
i.e. x -> x + linear(concat(x, att(x)))
if False, no additional linear will be applied. i.e. x -> x + att(x)
positionwise_layer_type : str
"linear", "conv1d", or "conv1d-linear".
positionwise_conv_kernel_size : int
Kernel size of positionwise conv1d layer.
macaron_style : bool
Whether to use macaron style for positionwise layer.
pos_enc_layer_type : str
Encoder positional encoding layer type.
selfattention_layer_type : str
Encoder attention layer type.
activation_type : str
Encoder activation function type.
use_cnn_module : bool
Whether to use convolution module.
zero_triu : bool
Whether to zero the upper triangular part of attention matrix.
cnn_module_kernel : int
Kernerl size of convolution module.
padding_idx : int
Padding idx for input_layer=embed.
stochastic_depth_rate : float
Maximum probability to skip the encoder layer.
intermediate_layers : Union[List[int], None]
indices of intermediate CTC layer.
indices start from 1.
if not None, intermediate outputs are returned (which changes return type
signature.)
"""
def __init__(
self,
idim: int,
attention_dim: int=256,
attention_heads: int=4,
linear_units: int=2048,
num_blocks: int=6,
dropout_rate: float=0.1,
positional_dropout_rate: float=0.1,
attention_dropout_rate: float=0.0,
input_layer: str="conv2d",
normalize_before: bool=True,
concat_after: bool=False,
positionwise_layer_type: str="linear",
positionwise_conv_kernel_size: int=1,
macaron_style: bool=False,
pos_enc_layer_type: str="rel_pos",
selfattention_layer_type: str="rel_selfattn",
activation_type: str="swish",
use_cnn_module: bool=False,
zero_triu: bool=False,
cnn_module_kernel: int=31,
padding_idx: int=-1,
stochastic_depth_rate: float=0.0,
intermediate_layers: Union[List[int], None]=None, ):
"""Construct an Conformer Encoder object."""
super().__init__(
idim=idim,
attention_dim=attention_dim,
attention_heads=attention_heads,
linear_units=linear_units,
num_blocks=num_blocks,
dropout_rate=dropout_rate,
positional_dropout_rate=positional_dropout_rate,
attention_dropout_rate=attention_dropout_rate,
input_layer=input_layer,
normalize_before=normalize_before,
concat_after=concat_after,
positionwise_layer_type=positionwise_layer_type,
positionwise_conv_kernel_size=positionwise_conv_kernel_size,
macaron_style=macaron_style,
pos_enc_layer_type=pos_enc_layer_type,
selfattention_layer_type=selfattention_layer_type,
activation_type=activation_type,
use_cnn_module=use_cnn_module,
zero_triu=zero_triu,
cnn_module_kernel=cnn_module_kernel,
padding_idx=padding_idx,
stochastic_depth_rate=stochastic_depth_rate,
intermediate_layers=intermediate_layers,
encoder_type="conformer")
def forward(self, xs, masks):
"""Encode input sequence.
Parameters
----------
xs : paddle.Tensor
Input tensor (#batch, time, idim).
masks : paddle.Tensor
Mask tensor (#batch, 1, time).
Returns
----------
paddle.Tensor
Output tensor (#batch, time, attention_dim).
paddle.Tensor
Mask tensor (#batch, 1, time).
"""
if isinstance(self.embed, (Conv2dSubsampling)):
xs, masks = self.embed(xs, masks)
else:
xs = self.embed(xs)
if self.intermediate_layers is None:
xs, masks = self.encoders(xs, masks)
else:
intermediate_outputs = []
for layer_idx, encoder_layer in enumerate(self.encoders):
xs, masks = encoder_layer(xs, masks)
if (self.intermediate_layers is not None and
layer_idx + 1 in self.intermediate_layers):
# intermediate branches also require normalization.
encoder_output = xs
if isinstance(encoder_output, tuple):
encoder_output = encoder_output[0]
if self.normalize_before:
encoder_output = self.after_norm(encoder_output)
intermediate_outputs.append(encoder_output)
if isinstance(xs, tuple):
xs = xs[0]
if self.normalize_before:
xs = self.after_norm(xs)
if self.intermediate_layers is not None:
return xs, masks, intermediate_outputs
return xs, masks

@ -24,10 +24,10 @@ class EncoderLayer(nn.Layer):
----------
size : int
Input dimension.
self_attn : paddle.nn.Layer
self_attn : nn.Layer
Self-attention module instance.
`MultiHeadedAttention` instance can be used as the argument.
feed_forward : paddle.nn.Layer
feed_forward : nn.Layer
Feed-forward module instance.
`PositionwiseFeedForward`, `MultiLayeredConv1d`, or `Conv1dLinear` instance can be used as the argument.
dropout_rate : float
@ -50,7 +50,7 @@ class EncoderLayer(nn.Layer):
normalize_before=True,
concat_after=False, ):
"""Construct an EncoderLayer object."""
super(EncoderLayer, self).__init__()
super().__init__()
self.self_attn = self_attn
self.feed_forward = feed_forward
self.norm1 = nn.LayerNorm(size)

@ -18,7 +18,7 @@ import paddle
import paddle.nn.functional as F
from paddle import nn
from paddlespeech.t2s.modules.glu import GLU
from paddlespeech.t2s.modules.activation import get_activation
from paddlespeech.t2s.modules.masked_fill import masked_fill
MIN_VALUE = float(numpy.finfo(numpy.float32).min)
@ -56,7 +56,7 @@ class LightweightConvolution(nn.Layer):
use_kernel_mask=False,
use_bias=False, ):
"""Construct Lightweight Convolution layer."""
super(LightweightConvolution, self).__init__()
super().__init__()
assert n_feat % wshare == 0
self.wshare = wshare
@ -68,7 +68,7 @@ class LightweightConvolution(nn.Layer):
# linear -> GLU -> lightconv -> linear
self.linear1 = nn.Linear(n_feat, n_feat * 2)
self.linear2 = nn.Linear(n_feat, n_feat)
self.act = GLU()
self.act = get_activation("glu")
# lightconv related
self.uniform_ = nn.initializer.Uniform()

@ -12,10 +12,10 @@
# See the License for the specific language governing permissions and
# limitations under the License.
"""Layer modules for FFT block in FastSpeech (Feed-forward Transformer)."""
import paddle
from paddle import nn
class MultiLayeredConv1d(paddle.nn.Layer):
class MultiLayeredConv1d(nn.Layer):
"""Multi-layered conv1d for Transformer block.
This is a module of multi-leyered conv1d designed
@ -43,21 +43,21 @@ class MultiLayeredConv1d(paddle.nn.Layer):
Dropout rate.
"""
super(MultiLayeredConv1d, self).__init__()
self.w_1 = paddle.nn.Conv1D(
super().__init__()
self.w_1 = nn.Conv1D(
in_chans,
hidden_chans,
kernel_size,
stride=1,
padding=(kernel_size - 1) // 2, )
self.w_2 = paddle.nn.Conv1D(
self.w_2 = nn.Conv1D(
hidden_chans,
in_chans,
kernel_size,
stride=1,
padding=(kernel_size - 1) // 2, )
self.dropout = paddle.nn.Dropout(dropout_rate)
self.relu = paddle.nn.ReLU()
self.dropout = nn.Dropout(dropout_rate)
self.relu = nn.ReLU()
def forward(self, x):
"""Calculate forward propagation.
@ -77,7 +77,7 @@ class MultiLayeredConv1d(paddle.nn.Layer):
[0, 2, 1])
class Conv1dLinear(paddle.nn.Layer):
class Conv1dLinear(nn.Layer):
"""Conv1D + Linear for Transformer block.
A variant of MultiLayeredConv1d, which replaces second conv-layer to linear.
@ -98,16 +98,16 @@ class Conv1dLinear(paddle.nn.Layer):
dropout_rate : float
Dropout rate.
"""
super(Conv1dLinear, self).__init__()
self.w_1 = paddle.nn.Conv1D(
super().__init__()
self.w_1 = nn.Conv1D(
in_chans,
hidden_chans,
kernel_size,
stride=1,
padding=(kernel_size - 1) // 2, )
self.w_2 = paddle.nn.Linear(hidden_chans, in_chans, bias_attr=True)
self.dropout = paddle.nn.Dropout(dropout_rate)
self.relu = paddle.nn.ReLU()
self.w_2 = nn.Linear(hidden_chans, in_chans, bias_attr=True)
self.dropout = nn.Dropout(dropout_rate)
self.relu = nn.ReLU()
def forward(self, x):
"""Calculate forward propagation.

@ -14,9 +14,10 @@
# Modified from espnet(https://github.com/espnet/espnet)
"""Positionwise feed forward layer definition."""
import paddle
from paddle import nn
class PositionwiseFeedForward(paddle.nn.Layer):
class PositionwiseFeedForward(nn.Layer):
"""Positionwise feed forward layer.
Parameters
@ -35,7 +36,7 @@ class PositionwiseFeedForward(paddle.nn.Layer):
dropout_rate,
activation=paddle.nn.ReLU()):
"""Construct an PositionwiseFeedForward object."""
super(PositionwiseFeedForward, self).__init__()
super().__init__()
self.w_1 = paddle.nn.Linear(idim, hidden_units, bias_attr=True)
self.w_2 = paddle.nn.Linear(hidden_units, idim, bias_attr=True)
self.dropout = paddle.nn.Dropout(dropout_rate)

@ -14,11 +14,12 @@
# Modified from espnet(https://github.com/espnet/espnet)
"""Subsampling layer definition."""
import paddle
from paddle import nn
from paddlespeech.t2s.modules.transformer.embedding import PositionalEncoding
class Conv2dSubsampling(paddle.nn.Layer):
class Conv2dSubsampling(nn.Layer):
"""Convolutional 2D subsampling (to 1/4 length).
Parameters
----------
@ -28,20 +29,20 @@ class Conv2dSubsampling(paddle.nn.Layer):
Output dimension.
dropout_rate : float
Dropout rate.
pos_enc : paddle.nn.Layer
pos_enc : nn.Layer
Custom position encoding layer.
"""
def __init__(self, idim, odim, dropout_rate, pos_enc=None):
"""Construct an Conv2dSubsampling object."""
super(Conv2dSubsampling, self).__init__()
self.conv = paddle.nn.Sequential(
paddle.nn.Conv2D(1, odim, 3, 2),
paddle.nn.ReLU(),
paddle.nn.Conv2D(odim, odim, 3, 2),
paddle.nn.ReLU(), )
self.out = paddle.nn.Sequential(
paddle.nn.Linear(odim * (((idim - 1) // 2 - 1) // 2), odim),
super().__init__()
self.conv = nn.Sequential(
nn.Conv2D(1, odim, 3, 2),
nn.ReLU(),
nn.Conv2D(odim, odim, 3, 2),
nn.ReLU(), )
self.out = nn.Sequential(
nn.Linear(odim * (((idim - 1) // 2 - 1) // 2), odim),
pos_enc if pos_enc is not None else
PositionalEncoding(odim, dropout_rate), )

@ -12,6 +12,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
import paddle
from paddle import nn
optim_classes = dict(
adadelta=paddle.optimizer.Adadelta,
@ -25,7 +26,7 @@ optim_classes = dict(
sgd=paddle.optimizer.SGD, )
def build_optimizers(model: paddle.nn.Layer,
def build_optimizers(model: nn.Layer,
optim='adadelta',
max_grad_norm=None,
learning_rate=0.01) -> paddle.optimizer:

@ -11,52 +11,11 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
import librosa
import numpy as np
import paddle
import torch
from parallel_wavegan.losses import stft_loss as sl
from scipy import signal
from paddlespeech.t2s.modules.stft_loss import MultiResolutionSTFTLoss
from paddlespeech.t2s.modules.stft_loss import STFT
def test_stft():
stft = STFT(n_fft=1024, hop_length=256, win_length=1024)
x = paddle.uniform([4, 46080])
S = stft.magnitude(x)
window = signal.get_window('hann', 1024, fftbins=True)
D2 = torch.stft(
torch.as_tensor(x.numpy()),
n_fft=1024,
hop_length=256,
win_length=1024,
window=torch.as_tensor(window))
S2 = (D2**2).sum(-1).sqrt()
S3 = np.abs(
librosa.stft(x.numpy()[0], n_fft=1024, hop_length=256, win_length=1024))
print(S2.shape)
print(S.numpy()[0])
print(S2.data.cpu().numpy()[0])
print(S3)
def test_torch_stft():
# NOTE: torch.stft use no window by default
x = np.random.uniform(-1.0, 1.0, size=(46080, ))
window = signal.get_window('hann', 1024, fftbins=True)
D2 = torch.stft(
torch.as_tensor(x),
n_fft=1024,
hop_length=256,
win_length=1024,
window=torch.as_tensor(window))
D3 = librosa.stft(
x, n_fft=1024, hop_length=256, win_length=1024, window='hann')
print(D2[:, :, 0].data.cpu().numpy()[:, 30:60])
print(D3.real[:, 30:60])
# print(D3.imag[:, 30:60])
from paddlespeech.t2s.modules.losses import MultiResolutionSTFTLoss
def test_multi_resolution_stft_loss():

Loading…
Cancel
Save