Merge branch 'develop' into fix-opencpop-svs1

pull/3912/head
cyberslack_lee 10 months ago committed by GitHub
commit c5c3a8a9b5
No known key found for this signature in database
GPG Key ID: B5690EEEBB952194

@ -3,7 +3,18 @@ This example contains code used to train a [JETS](https://arxiv.org/abs/2203.168
## Dataset ## Dataset
### Download and Extract ### Download and Extract
Download CSMSC from it's [Official Website](https://test.data-baker.com/data/index/source). Download CSMSC from it's [official website](https://test.data-baker.com/data/index/TNtts/) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/BZNSYP`.
The structure of the folder is listed below.
```text
└─ Wave
└─ .wav files (audio speech)
└─ PhoneLabeling
└─ .interval files (alignment between phoneme and duration)
└─ ProsodyLabeling
└─ 000001-010000.txt (text with prosodic by pinyin)
```
### Get MFA Result and Extract ### Get MFA Result and Extract
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get phonemes and durations for JETS. We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get phonemes and durations for JETS.

@ -5,6 +5,17 @@ This example contains code used to train a [SpeedySpeech](http://arxiv.org/abs/2
### Download and Extract ### Download and Extract
Download CSMSC from it's [Official Website](https://test.data-baker.com/data/index/TNtts/) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/BZNSYP`. Download CSMSC from it's [Official Website](https://test.data-baker.com/data/index/TNtts/) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/BZNSYP`.
The structure of the folder is listed below.
```text
└─ Wave
└─ .wav files (audio speech)
└─ PhoneLabeling
└─ .interval files (alignment between phoneme and duration)
└─ ProsodyLabeling
└─ 000001-010000.txt (text with prosodic by pinyin)
```
### Get MFA Result and Extract ### Get MFA Result and Extract
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for SPEEDYSPEECH. We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get durations for SPEEDYSPEECH.
You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo. You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.

@ -4,6 +4,17 @@ This example contains code used to train a [HiFiGAN](https://arxiv.org/abs/2010.
### Download and Extract ### Download and Extract
Download CSMSC from it's [official website](https://test.data-baker.com/data/index/TNtts/) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/BZNSYP`. Download CSMSC from it's [official website](https://test.data-baker.com/data/index/TNtts/) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/BZNSYP`.
The structure of the folder is listed below.
```text
└─ Wave
└─ .wav files (audio speech)
└─ PhoneLabeling
└─ .interval files (alignment between phoneme and duration)
└─ ProsodyLabeling
└─ 000001-010000.txt (text with prosodic by pinyin)
```
### Get MFA Result and Extract ### Get MFA Result and Extract
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence at the edge of audio. We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence at the edge of audio.
You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo. You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.

@ -6,6 +6,17 @@ This example contains code used to train a [iSTFTNet](https://arxiv.org/abs/2203
### Download and Extract ### Download and Extract
Download CSMSC from it's [official website](https://test.data-baker.com/data/index/TNtts/) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/BZNSYP`. Download CSMSC from it's [official website](https://test.data-baker.com/data/index/TNtts/) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/BZNSYP`.
The structure of the folder is listed below.
```text
└─ Wave
└─ .wav files (audio speech)
└─ PhoneLabeling
└─ .interval files (alignment between phoneme and duration)
└─ ProsodyLabeling
└─ 000001-010000.txt (text with prosodic by pinyin)
```
### Get MFA Result and Extract ### Get MFA Result and Extract
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence at the edge of audio. We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence at the edge of audio.
You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo. You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.

@ -144,7 +144,7 @@ source path.sh
bash ./local/data.sh bash ./local/data.sh
CUDA_VISIBLE_DEVICES= ./local/train.sh conf/deepspeech2.yaml deepspeech2 CUDA_VISIBLE_DEVICES= ./local/train.sh conf/deepspeech2.yaml deepspeech2
avg.sh best exp/deepspeech2/checkpoints 1 avg.sh best exp/deepspeech2/checkpoints 1
CUDA_VISIBLE_DEVICES= ./local/test.sh conf/deepspeech2.yaml exp/deepspeech2/checkpoints/avg_1 CUDA_VISIBLE_DEVICES= ./local/test.sh conf/deepspeech2.yaml conf/tuning/decode.yaml exp/deepspeech2/checkpoints/avg_1
``` ```
## Stage 4: Static graph model Export ## Stage 4: Static graph model Export
This stage is to transform dygraph to static graph. This stage is to transform dygraph to static graph.
@ -185,5 +185,5 @@ wget -nc https://paddlespeech.bj.bcebos.com/datasets/single_wav/en/demo_002_en.w
``` ```
You can train a model by yourself, then you need to prepare an audio file or use the audio demo above, please confirm the sample rate of the audio is 16K. You can get the result of the audio demo by running the script below. You can train a model by yourself, then you need to prepare an audio file or use the audio demo above, please confirm the sample rate of the audio is 16K. You can get the result of the audio demo by running the script below.
```bash ```bash
CUDA_VISIBLE_DEVICES= ./local/test_wav.sh conf/deepspeech2.yaml exp/deepspeech2/checkpoints/avg_1 data/demo_002_en.wav CUDA_VISIBLE_DEVICES= ./local/test_wav.sh conf/deepspeech2.yaml conf/tuning/decode.yaml exp/deepspeech2/checkpoints/avg_1 data/demo_002_en.wav
``` ```

@ -148,7 +148,7 @@ or you can run these scripts in the command line (only use CPU).
bash ./local/data.sh bash ./local/data.sh
CUDA_VISIBLE_DEVICES= ./local/train.sh conf/conformer.yaml conformer CUDA_VISIBLE_DEVICES= ./local/train.sh conf/conformer.yaml conformer
avg.sh best exp/conformer/checkpoints 20 avg.sh best exp/conformer/checkpoints 20
CUDA_VISIBLE_DEVICES= ./local/test.sh conf/conformer.yaml exp/conformer/checkpoints/avg_20 CUDA_VISIBLE_DEVICES= ./local/test.sh conf/conformer.yaml conf/tuning/decode.yaml exp/conformer/checkpoints/avg_20
``` ```
## Pretrained Model ## Pretrained Model
You can get the pretrained transformer or conformer from [this](../../../docs/source/released_model.md). You can get the pretrained transformer or conformer from [this](../../../docs/source/released_model.md).
@ -163,7 +163,7 @@ source path.sh
# If you have process the data and get the manifest file you can skip the following 2 steps # If you have process the data and get the manifest file you can skip the following 2 steps
bash local/data.sh --stage -1 --stop_stage -1 bash local/data.sh --stage -1 --stop_stage -1
bash local/data.sh --stage 2 --stop_stage 2 bash local/data.sh --stage 2 --stop_stage 2
CUDA_VISIBLE_DEVICES= ./local/test.sh conf/conformer.yaml exp/conformer/checkpoints/avg_20 CUDA_VISIBLE_DEVICES= ./local/test.sh conf/conformer.yaml conf/tuning/decode.yaml exp/conformer/checkpoints/avg_20
``` ```
The performance of the released models are shown in [here](./RESULTS.md). The performance of the released models are shown in [here](./RESULTS.md).
@ -192,8 +192,8 @@ bash ./local/data.sh
CUDA_VISIBLE_DEVICES= ./local/train.sh conf/conformer.yaml conformer CUDA_VISIBLE_DEVICES= ./local/train.sh conf/conformer.yaml conformer
avg.sh best exp/conformer/checkpoints 20 avg.sh best exp/conformer/checkpoints 20
# test stage is optional # test stage is optional
CUDA_VISIBLE_DEVICES= ./local/test.sh conf/conformer.yaml exp/conformer/checkpoints/avg_20 CUDA_VISIBLE_DEVICES= ./local/test.sh conf/conformer.yaml conf/tuning/decode.yaml exp/conformer/checkpoints/avg_20
CUDA_VISIBLE_DEVICES= ./local/align.sh conf/conformer.yaml exp/conformer/checkpoints/avg_20 CUDA_VISIBLE_DEVICES= ./local/align.sh conf/conformer.yaml conf/tuning/decode.yaml exp/conformer/checkpoints/avg_20
``` ```
## Stage 5: Single Audio File Inference ## Stage 5: Single Audio File Inference
In some situations, you want to use the trained model to do the inference for the single audio file. You can use stage 5. The code is shown below In some situations, you want to use the trained model to do the inference for the single audio file. You can use stage 5. The code is shown below
@ -214,5 +214,5 @@ wget -nc https://paddlespeech.bj.bcebos.com/datasets/single_wav/en/demo_002_en.w
``` ```
You need to prepare an audio file or use the audio demo above, please confirm the sample rate of the audio is 16K. You can get the result of the audio demo by running the script below. You need to prepare an audio file or use the audio demo above, please confirm the sample rate of the audio is 16K. You can get the result of the audio demo by running the script below.
```bash ```bash
CUDA_VISIBLE_DEVICES= ./local/test_wav.sh conf/conformer.yaml exp/conformer/checkpoints/avg_20 data/demo_002_en.wav CUDA_VISIBLE_DEVICES= ./local/test_wav.sh conf/conformer.yaml conf/tuning/decode.yaml exp/conformer/checkpoints/avg_20 data/demo_002_en.wav
``` ```

@ -27,7 +27,6 @@ The document below will describe the scripts in `run.sh` in detail.
The path.sh contains the environment variables. The path.sh contains the environment variables.
```bash ```bash
. ./path.sh . ./path.sh
. ./cmd.sh
``` ```
This script needs to be run first. And another script is also needed: This script needs to be run first. And another script is also needed:
```bash ```bash
@ -67,7 +66,6 @@ bash run.sh --stage 0 --stop_stage 0
You can also just run these scripts in your command line. You can also just run these scripts in your command line.
```bash ```bash
. ./path.sh . ./path.sh
. ./cmd.sh
bash ./local/data.sh bash ./local/data.sh
``` ```
After processing the data, the `data` directory will look like this: After processing the data, the `data` directory will look like this:
@ -103,7 +101,6 @@ bash run.sh --stage 0 --stop_stage 1
or you can run these scripts in the command line (only use CPU). or you can run these scripts in the command line (only use CPU).
```bash ```bash
. ./path.sh . ./path.sh
. ./cmd.sh
bash ./local/data.sh bash ./local/data.sh
CUDA_VISIBLE_DEVICES= ./local/train.sh conf/conformer.yaml conformer CUDA_VISIBLE_DEVICES= ./local/train.sh conf/conformer.yaml conformer
``` ```
@ -124,7 +121,6 @@ or you can run these scripts in the command line (only use CPU).
```bash ```bash
. ./path.sh . ./path.sh
. ./cmd.sh
bash ./local/data.sh bash ./local/data.sh
CUDA_VISIBLE_DEVICES= ./local/train.sh conf/conformer.yaml conformer CUDA_VISIBLE_DEVICES= ./local/train.sh conf/conformer.yaml conformer
avg.sh best exp/conformer/checkpoints 10 avg.sh best exp/conformer/checkpoints 10
@ -144,11 +140,10 @@ bash run.sh --stage 0 --stop_stage 3
or you can run these scripts in the command line (only use CPU). or you can run these scripts in the command line (only use CPU).
```bash ```bash
. ./path.sh . ./path.sh
. ./cmd.sh
bash ./local/data.sh bash ./local/data.sh
CUDA_VISIBLE_DEVICES= ./local/train.sh conf/conformer.yaml conformer CUDA_VISIBLE_DEVICES= ./local/train.sh conf/conformer.yaml conformer
avg.sh best exp/conformer/checkpoints 10 avg.sh best exp/conformer/checkpoints 10
CUDA_VISIBLE_DEVICES= ./local/test.sh conf/conformer.yaml exp/conformer/checkpoints/avg_10 CUDA_VISIBLE_DEVICES= ./local/test.sh conf/conformer.yaml conf/tuning/decode.yaml exp/conformer/checkpoints/avg_10
``` ```
## Pretrained Model ## Pretrained Model
You can get the pretrained transformer or conformer from [this](../../../docs/source/released_model.md). You can get the pretrained transformer or conformer from [this](../../../docs/source/released_model.md).
@ -163,7 +158,7 @@ source path.sh
# If you have process the data and get the manifest file you can skip the following 2 steps # If you have process the data and get the manifest file you can skip the following 2 steps
bash local/data.sh --stage -1 --stop_stage -1 bash local/data.sh --stage -1 --stop_stage -1
bash local/data.sh --stage 2 --stop_stage 2 bash local/data.sh --stage 2 --stop_stage 2
CUDA_VISIBLE_DEVICES= ./local/test.sh conf/conformer.yaml exp/conformer/checkpoints/avg_10 CUDA_VISIBLE_DEVICES= ./local/test.sh conf/conformer.yaml conf/tuning/decode.yaml exp/conformer/checkpoints/avg_10
``` ```
The performance of the released models are shown in [here](./RESULTS.md). The performance of the released models are shown in [here](./RESULTS.md).
@ -186,5 +181,5 @@ wget -nc https://paddlespeech.bj.bcebos.com/datasets/single_wav/zh/demo_01_03.wa
``` ```
You need to prepare an audio file or use the audio demo above, please confirm the sample rate of the audio is 16K. You can get the result of the audio demo by running the script below. You need to prepare an audio file or use the audio demo above, please confirm the sample rate of the audio is 16K. You can get the result of the audio demo by running the script below.
```bash ```bash
CUDA_VISIBLE_DEVICES= ./local/test_wav.sh conf/conformer.yaml exp/conformer/checkpoints/avg_10 data/demo_01_03.wav CUDA_VISIBLE_DEVICES= ./local/test_wav.sh conf/conformer.yaml conf/tuning/decode.yaml exp/conformer/checkpoints/avg_10 data/demo_01_03.wav
``` ```

@ -75,7 +75,7 @@ class DeepSpeech2Tester_hub():
feat = self.preprocessing(audio, **self.preprocess_args) feat = self.preprocessing(audio, **self.preprocess_args)
logger.info(f"feat shape: {feat.shape}") logger.info(f"feat shape: {feat.shape}")
audio_len = paddle.to_tensor(feat.shape[0]) audio_len = paddle.to_tensor(feat.shape[0]).unsqueeze(0)
audio = paddle.to_tensor(feat, dtype='float32').unsqueeze(axis=0) audio = paddle.to_tensor(feat, dtype='float32').unsqueeze(axis=0)
result_transcripts = self.compute_result_transcripts( result_transcripts = self.compute_result_transcripts(

@ -18,7 +18,7 @@ from yacs.config import CfgNode
from paddlespeech.s2t.exps.hubert.model import HubertASRTester as Tester from paddlespeech.s2t.exps.hubert.model import HubertASRTester as Tester
from paddlespeech.s2t.training.cli import default_argument_parser from paddlespeech.s2t.training.cli import default_argument_parser
from paddlespeech.s2t.utils.utility import print_arguments from paddlespeech.utils.argparse import print_arguments
def main_sp(config, args): def main_sp(config, args):

@ -19,7 +19,7 @@ from yacs.config import CfgNode
from paddlespeech.s2t.exps.hubert.model import HubertASRTrainer as Trainer from paddlespeech.s2t.exps.hubert.model import HubertASRTrainer as Trainer
from paddlespeech.s2t.training.cli import default_argument_parser from paddlespeech.s2t.training.cli import default_argument_parser
from paddlespeech.s2t.utils.utility import print_arguments from paddlespeech.utils.argparse import print_arguments
def main_sp(config, args): def main_sp(config, args):

@ -75,7 +75,7 @@ class U2Infer():
feat = self.preprocessing(audio, **self.preprocess_args) feat = self.preprocessing(audio, **self.preprocess_args)
logger.info(f"feat shape: {feat.shape}") logger.info(f"feat shape: {feat.shape}")
ilen = paddle.to_tensor(feat.shape[0]) ilen = paddle.to_tensor(feat.shape[0]).unsqueeze(0)
xs = paddle.to_tensor(feat, dtype='float32').unsqueeze(0) xs = paddle.to_tensor(feat, dtype='float32').unsqueeze(0)
decode_config = self.config.decode decode_config = self.config.decode
logger.info(f"decode cfg: {decode_config}") logger.info(f"decode cfg: {decode_config}")

@ -78,7 +78,7 @@ class U2Infer():
if self.args.debug: if self.args.debug:
np.savetxt("feat.transform.txt", feat) np.savetxt("feat.transform.txt", feat)
ilen = paddle.to_tensor(feat.shape[0]) ilen = paddle.to_tensor(feat.shape[0]).unsqueeze(0)
xs = paddle.to_tensor(feat, dtype='float32').unsqueeze(0) xs = paddle.to_tensor(feat, dtype='float32').unsqueeze(0)
decode_config = self.config.decode decode_config = self.config.decode
logger.info(f"decode cfg: {decode_config}") logger.info(f"decode cfg: {decode_config}")

@ -33,7 +33,7 @@ from paddlespeech.s2t.io.speechbrain import data_pipeline
from paddlespeech.s2t.io.speechbrain import dataio from paddlespeech.s2t.io.speechbrain import dataio
from paddlespeech.s2t.io.speechbrain import dataset from paddlespeech.s2t.io.speechbrain import dataset
from paddlespeech.s2t.io.speechbrain.dataloader import make_dataloader from paddlespeech.s2t.io.speechbrain.dataloader import make_dataloader
from paddlespeech.s2t.models.wavlm.processing.speech_augmentation import TimeDomainSpecAugment from paddlespeech.s2t.models.wav2vec2.processing.speech_augmentation import TimeDomainSpecAugment
from paddlespeech.s2t.models.wavlm.wavlm_asr import WavLMASR from paddlespeech.s2t.models.wavlm.wavlm_asr import WavLMASR
from paddlespeech.s2t.training.optimizer import OptimizerFactory from paddlespeech.s2t.training.optimizer import OptimizerFactory
from paddlespeech.s2t.training.reporter import ObsScope from paddlespeech.s2t.training.reporter import ObsScope
@ -428,8 +428,7 @@ class WavLMASRTrainer(Trainer):
report("epoch", self.epoch) report("epoch", self.epoch)
report('step', self.iteration) report('step', self.iteration)
report("model_lr", self.model_optimizer.get_lr()) report("model_lr", self.model_optimizer.get_lr())
report("wavlm_lr", report("wavlm_lr", self.wavlm_optimizer.get_lr())
self.wavlm_optimizer.get_lr())
self.train_batch(batch_index, batch, msg) self.train_batch(batch_index, batch, msg)
self.after_train_batch() self.after_train_batch()
report('iter', batch_index + 1) report('iter', batch_index + 1)
@ -680,8 +679,7 @@ class WavLMASRTrainer(Trainer):
logger.info("optim_model:{},{}", model_optim_type, model_optim_conf) logger.info("optim_model:{},{}", model_optim_type, model_optim_conf)
wavlm_optim_type = train_config.wavlm_optim wavlm_optim_type = train_config.wavlm_optim
wavlm_optim_conf = train_config.wavlm_optim_conf wavlm_optim_conf = train_config.wavlm_optim_conf
logger.info("optim_model:{},{}", wavlm_optim_type, logger.info("optim_model:{},{}", wavlm_optim_type, wavlm_optim_conf)
wavlm_optim_conf)
model_scheduler_type = train_config.model_scheduler model_scheduler_type = train_config.model_scheduler
model_scheduler_conf = train_config.model_scheduler_conf model_scheduler_conf = train_config.model_scheduler_conf
@ -698,8 +696,8 @@ class WavLMASRTrainer(Trainer):
model_lr_scheduler = LRSchedulerFactory.from_args(model_scheduler_type, model_lr_scheduler = LRSchedulerFactory.from_args(model_scheduler_type,
model_scheduler_args) model_scheduler_args)
wavlm_lr_scheduler = LRSchedulerFactory.from_args( wavlm_lr_scheduler = LRSchedulerFactory.from_args(wavlm_scheduler_type,
wavlm_scheduler_type, wavlm_scheduler_args) wavlm_scheduler_args)
def optimizer_args( def optimizer_args(
config, config,
@ -716,24 +714,31 @@ class WavLMASRTrainer(Trainer):
}) })
return optim_arg return optim_arg
model_optimizer_args = optimizer_args( model_optimizer_args = optimizer_args(config, model_optim_type,
config, model_optim_type, model_optim_conf, [{
model_optim_conf, 'params':
[{'params': model._layers.enc.parameters()}, {'params': model._layers.ctc.parameters()}] if self.parallel else [{'params': model.enc.parameters()}, {'params': model.ctc.parameters()}], model._layers.enc.parameters()
model_lr_scheduler }, {
) 'params':
# [{'params': model._layers.ctc.parameters()}] if self.parallel else [{'params': model.ctc.parameters()}], model_lr_scheduler) model._layers.ctc.parameters()
}] if self.parallel else [{
'params':
model.enc.parameters()
}, {
'params':
model.ctc.parameters()
}], model_lr_scheduler)
# [{'params': model._layers.ctc.parameters()}] if self.parallel else [{'params': model.ctc.parameters()}], model_lr_scheduler)
wavlm_optimizer_args = optimizer_args( wavlm_optimizer_args = optimizer_args(
config, wavlm_optim_type, wavlm_optim_conf, config, wavlm_optim_type, wavlm_optim_conf,
model._layers.wavlm.parameters() if self.parallel else model._layers.wavlm.parameters()
model.wavlm.parameters(), wavlm_lr_scheduler) if self.parallel else model.wavlm.parameters(), wavlm_lr_scheduler)
model_optimizer = OptimizerFactory.from_args(model_optim_type, model_optimizer = OptimizerFactory.from_args(model_optim_type,
model_optimizer_args) model_optimizer_args)
wavlm_optimizer = OptimizerFactory.from_args(wavlm_optim_type, wavlm_optimizer = OptimizerFactory.from_args(wavlm_optim_type,
wavlm_optimizer_args) wavlm_optimizer_args)
self.model_optimizer = model_optimizer self.model_optimizer = model_optimizer
self.wavlm_optimizer = wavlm_optimizer self.wavlm_optimizer = wavlm_optimizer

@ -129,7 +129,7 @@ def _compute_mask_indices(
[sequence_length for _ in range(batch_size)]) [sequence_length for _ in range(batch_size)])
# SpecAugment mask to fill # SpecAugment mask to fill
spec_aug_mask = np.zeros((batch_size, sequence_length), dtype=np.bool) spec_aug_mask = np.zeros((batch_size, sequence_length), dtype=np.bool_)
spec_aug_mask_idxs = [] spec_aug_mask_idxs = []
max_num_masked_span = compute_num_masked_span(sequence_length) max_num_masked_span = compute_num_masked_span(sequence_length)
@ -207,9 +207,9 @@ def _sample_negative_indices(features_shape: Tuple,
sampled_negative_indices = np.zeros( sampled_negative_indices = np.zeros(
shape=(batch_size, sequence_length, num_negatives), dtype=np.int32) shape=(batch_size, sequence_length, num_negatives), dtype=np.int32)
mask_time_indices = (mask_time_indices.astype(np.bool) mask_time_indices = (mask_time_indices.astype(np.bool_)
if mask_time_indices is not None else if mask_time_indices is not None else
np.ones(features_shape, dtype=np.bool)) np.ones(features_shape, dtype=np.bool_))
for batch_idx in range(batch_size): for batch_idx in range(batch_size):
high = mask_time_indices[batch_idx].sum() - 1 high = mask_time_indices[batch_idx].sum() - 1

@ -1476,7 +1476,7 @@ def compute_mask_indices(
lens = np.fromiter( lens = np.fromiter(
(e - s if e - s >= length + min_space else 0 (e - s if e - s >= length + min_space else 0
for s, e in parts), for s, e in parts),
np.int, ) np.int_, )
l_sum = np.sum(lens) l_sum = np.sum(lens)
if l_sum == 0: if l_sum == 0:
break break

@ -6,40 +6,38 @@
# Based on fairseq code bases # Based on fairseq code bases
# https://github.com/pytorch/fairseq # https://github.com/pytorch/fairseq
# -------------------------------------------------------- # --------------------------------------------------------
import math
import logging import logging
from typing import List, Optional, Tuple import math
from typing import List
from typing import Optional
from typing import Tuple
import numpy as np import numpy as np
import paddle import paddle
import paddle.nn as nn import paddle.nn as nn
import paddle.nn.functional as F import paddle.nn.functional as F
from paddle.nn import LayerNorm
from paddle import Tensor from paddle import Tensor
from .modules.modules import ( from paddle.nn import LayerNorm
MultiheadAttention,
SamePad, from .modules.modules import get_activation_fn
get_activation_fn, from .modules.modules import GLU_Linear
TransposeLast, from .modules.modules import MultiheadAttention
GLU_Linear, from .modules.modules import SamePad
) from .modules.modules import TransposeLast
logger = logging.getLogger(__name__) logger = logging.getLogger(__name__)
def compute_mask_indices( def compute_mask_indices(
shape: Tuple[int, int], shape: Tuple[int, int],
padding_mask: Optional[Tensor], padding_mask: Optional[Tensor],
mask_prob: float, mask_prob: float,
mask_length: int, mask_length: int,
mask_type: str = "static", mask_type: str="static",
mask_other: float = 0.0, mask_other: float=0.0,
min_masks: int = 0, min_masks: int=0,
no_overlap: bool = False, no_overlap: bool=False,
min_space: int = 0, min_space: int=0, ) -> np.ndarray:
) -> np.ndarray:
""" """
Computes random mask spans for a given shape Computes random mask spans for a given shape
@ -65,9 +63,7 @@ def compute_mask_indices(
all_num_mask = int( all_num_mask = int(
# add a random number for probabilistic rounding # add a random number for probabilistic rounding
mask_prob * all_sz / float(mask_length) mask_prob * all_sz / float(mask_length) + np.random.rand())
+ np.random.rand()
)
all_num_mask = max(min_masks, all_num_mask) all_num_mask = max(min_masks, all_num_mask)
@ -77,9 +73,7 @@ def compute_mask_indices(
sz = all_sz - padding_mask[i].long().sum().item() sz = all_sz - padding_mask[i].long().sum().item()
num_mask = int( num_mask = int(
# add a random number for probabilistic rounding # add a random number for probabilistic rounding
mask_prob * sz / float(mask_length) mask_prob * sz / float(mask_length) + np.random.rand())
+ np.random.rand()
)
num_mask = max(min_masks, num_mask) num_mask = max(min_masks, num_mask)
else: else:
sz = all_sz sz = all_sz
@ -88,7 +82,8 @@ def compute_mask_indices(
if mask_type == "static": if mask_type == "static":
lengths = np.full(num_mask, mask_length) lengths = np.full(num_mask, mask_length)
elif mask_type == "uniform": elif mask_type == "uniform":
lengths = np.random.randint(mask_other, mask_length * 2 + 1, size=num_mask) lengths = np.random.randint(
mask_other, mask_length * 2 + 1, size=num_mask)
elif mask_type == "normal": elif mask_type == "normal":
lengths = np.random.normal(mask_length, mask_other, size=num_mask) lengths = np.random.normal(mask_length, mask_other, size=num_mask)
lengths = [max(1, int(round(x))) for x in lengths] lengths = [max(1, int(round(x))) for x in lengths]
@ -119,9 +114,9 @@ def compute_mask_indices(
min_length = min(lengths) min_length = min(lengths)
for length in sorted(lengths, reverse=True): for length in sorted(lengths, reverse=True):
lens = np.fromiter( lens = np.fromiter(
(e - s if e - s >= length + min_space else 0 for s, e in parts), (e - s if e - s >= length + min_space else 0
np.int, for s, e in parts),
) np.int_, )
l_sum = np.sum(lens) l_sum = np.sum(lens)
if l_sum == 0: if l_sum == 0:
break break
@ -137,13 +132,10 @@ def compute_mask_indices(
mask_idc = np.random.choice(sz - min_len, num_mask, replace=False) mask_idc = np.random.choice(sz - min_len, num_mask, replace=False)
mask_idc = np.asarray( mask_idc = np.asarray([
[ mask_idc[j] + offset
mask_idc[j] + offset for j in range(len(mask_idc)) for offset in range(lengths[j])
for j in range(len(mask_idc)) ])
for offset in range(lengths[j])
]
)
mask_idcs.append(np.unique(mask_idc[mask_idc < sz])) mask_idcs.append(np.unique(mask_idc[mask_idc < sz]))
@ -158,54 +150,54 @@ def compute_mask_indices(
class WavLMConfig: class WavLMConfig:
def __init__(self, cfg=None): def __init__(self, cfg=None):
self.extractor_mode: str = "default" # mode for feature extractor. default has a single group norm with d groups in the first conv block, whereas layer_norm has layer norms in every block (meant to use with normalize=True) self.extractor_mode: str = "default" # mode for feature extractor. default has a single group norm with d groups in the first conv block, whereas layer_norm has layer norms in every block (meant to use with normalize=True)
self.encoder_layers: int = 12 # num encoder layers in the transformer self.encoder_layers: int = 12 # num encoder layers in the transformer
self.encoder_embed_dim: int = 768 # encoder embedding dimension self.encoder_embed_dim: int = 768 # encoder embedding dimension
self.encoder_ffn_embed_dim: int = 3072 # encoder embedding dimension for FFN self.encoder_ffn_embed_dim: int = 3072 # encoder embedding dimension for FFN
self.encoder_attention_heads: int = 12 # num encoder attention heads self.encoder_attention_heads: int = 12 # num encoder attention heads
self.activation_fn: str = "gelu" # activation function to use self.activation_fn: str = "gelu" # activation function to use
self.layer_norm_first: bool = False # apply layernorm first in the transformer self.layer_norm_first: bool = False # apply layernorm first in the transformer
self.conv_feature_layers: str = "[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2" # string describing convolutional feature extraction layers in form of a python list that contains [(dim, kernel_size, stride), ...] self.conv_feature_layers: str = "[(512,10,5)] + [(512,3,2)] * 4 + [(512,2,2)] * 2" # string describing convolutional feature extraction layers in form of a python list that contains [(dim, kernel_size, stride), ...]
self.conv_bias: bool = False # include bias in conv encoder self.conv_bias: bool = False # include bias in conv encoder
self.feature_grad_mult: float = 1.0 # multiply feature extractor var grads by this self.feature_grad_mult: float = 1.0 # multiply feature extractor var grads by this
self.normalize: bool = False # normalize input to have 0 mean and unit variance during training self.normalize: bool = False # normalize input to have 0 mean and unit variance during training
# dropouts # dropouts
self.dropout: float = 0.1 # dropout probability for the transformer self.dropout: float = 0.1 # dropout probability for the transformer
self.attention_dropout: float = 0.1 # dropout probability for attention weights self.attention_dropout: float = 0.1 # dropout probability for attention weights
self.activation_dropout: float = 0.0 # dropout probability after activation in FFN self.activation_dropout: float = 0.0 # dropout probability after activation in FFN
self.encoder_layerdrop: float = 0.0 # probability of dropping a tarnsformer layer self.encoder_layerdrop: float = 0.0 # probability of dropping a tarnsformer layer
self.dropout_input: float = 0.0 # dropout to apply to the input (after feat extr) self.dropout_input: float = 0.0 # dropout to apply to the input (after feat extr)
self.dropout_features: float = 0.0 # dropout to apply to the features (after feat extr) self.dropout_features: float = 0.0 # dropout to apply to the features (after feat extr)
# masking # masking
self.mask_length: int = 10 # mask length self.mask_length: int = 10 # mask length
self.mask_prob: float = 0.65 # probability of replacing a token with mask self.mask_prob: float = 0.65 # probability of replacing a token with mask
self.mask_selection: str = "static" # how to choose mask length self.mask_selection: str = "static" # how to choose mask length
self.mask_other: float = 0 # secondary mask argument (used for more complex distributions), see help in compute_mask_indicesh self.mask_other: float = 0 # secondary mask argument (used for more complex distributions), see help in compute_mask_indicesh
self.no_mask_overlap: bool = False # whether to allow masks to overlap self.no_mask_overlap: bool = False # whether to allow masks to overlap
self.mask_min_space: int = 1 # min space between spans (if no overlap is enabled) self.mask_min_space: int = 1 # min space between spans (if no overlap is enabled)
# channel masking # channel masking
self.mask_channel_length: int = 10 # length of the mask for features (channels) self.mask_channel_length: int = 10 # length of the mask for features (channels)
self.mask_channel_prob: float = 0.0 # probability of replacing a feature with 0 self.mask_channel_prob: float = 0.0 # probability of replacing a feature with 0
self.mask_channel_selection: str = "static" # how to choose mask length for channel masking self.mask_channel_selection: str = "static" # how to choose mask length for channel masking
self.mask_channel_other: float = 0 # secondary mask argument (used for more complex distributions), see help in compute_mask_indices self.mask_channel_other: float = 0 # secondary mask argument (used for more complex distributions), see help in compute_mask_indices
self.no_mask_channel_overlap: bool = False # whether to allow channel masks to overlap self.no_mask_channel_overlap: bool = False # whether to allow channel masks to overlap
self.mask_channel_min_space: int = 1 # min space between spans (if no overlap is enabled) self.mask_channel_min_space: int = 1 # min space between spans (if no overlap is enabled)
# positional embeddings # positional embeddings
self.conv_pos: int = 128 # number of filters for convolutional positional embeddings self.conv_pos: int = 128 # number of filters for convolutional positional embeddings
self.conv_pos_groups: int = 16 # number of groups for convolutional positional embedding self.conv_pos_groups: int = 16 # number of groups for convolutional positional embedding
# relative position embedding # relative position embedding
self.relative_position_embedding: bool = True # apply relative position embedding self.relative_position_embedding: bool = True # apply relative position embedding
self.num_buckets: int = 320 # number of buckets for relative position embedding self.num_buckets: int = 320 # number of buckets for relative position embedding
self.max_distance: int = 1280 # maximum distance for relative position embedding self.max_distance: int = 1280 # maximum distance for relative position embedding
self.gru_rel_pos: bool = True # apply gated relative position embedding self.gru_rel_pos: bool = True # apply gated relative position embedding
if cfg is not None: if cfg is not None:
self.update(cfg) self.update(cfg)
@ -216,9 +208,8 @@ class WavLMConfig:
class WavLM(nn.Layer): class WavLM(nn.Layer):
def __init__( def __init__(
self, self,
cfg: WavLMConfig, cfg: WavLMConfig, ) -> None:
) -> None:
super().__init__() super().__init__()
logger.info(f"WavLM Config: {cfg.__dict__}") logger.info(f"WavLM Config: {cfg.__dict__}")
@ -230,14 +221,11 @@ class WavLM(nn.Layer):
conv_layers=feature_enc_layers, conv_layers=feature_enc_layers,
dropout=0.0, dropout=0.0,
mode=cfg.extractor_mode, mode=cfg.extractor_mode,
conv_bias=cfg.conv_bias, conv_bias=cfg.conv_bias, )
)
self.post_extract_proj = ( self.post_extract_proj = (nn.Linear(self.embed, cfg.encoder_embed_dim)
nn.Linear(self.embed, cfg.encoder_embed_dim) if self.embed != cfg.encoder_embed_dim else
if self.embed != cfg.encoder_embed_dim None)
else None
)
self.mask_prob = cfg.mask_prob self.mask_prob = cfg.mask_prob
self.mask_selection = cfg.mask_selection self.mask_selection = cfg.mask_selection
@ -260,8 +248,7 @@ class WavLM(nn.Layer):
self.mask_emb = self.create_parameter( self.mask_emb = self.create_parameter(
shape=[cfg.encoder_embed_dim], shape=[cfg.encoder_embed_dim],
default_initializer=nn.initializer.Uniform(), default_initializer=nn.initializer.Uniform(), )
)
self.encoder = TransformerEncoder(cfg) self.encoder = TransformerEncoder(cfg)
self.layer_norm = LayerNorm(self.embed) self.layer_norm = LayerNorm(self.embed)
@ -278,8 +265,7 @@ class WavLM(nn.Layer):
self.mask_other, self.mask_other,
min_masks=2, min_masks=2,
no_overlap=self.no_mask_overlap, no_overlap=self.no_mask_overlap,
min_space=self.mask_min_space, min_space=self.mask_min_space, )
)
# mask_indices = torch.from_numpy(mask_indices).to(x.device) # mask_indices = torch.from_numpy(mask_indices).to(x.device)
mask_indices = paddle.to_tensor(mask_indices, dtype='int64') mask_indices = paddle.to_tensor(mask_indices, dtype='int64')
x[mask_indices] = self.mask_emb x[mask_indices] = self.mask_emb
@ -295,40 +281,35 @@ class WavLM(nn.Layer):
self.mask_channel_selection, self.mask_channel_selection,
self.mask_channel_other, self.mask_channel_other,
no_overlap=self.no_mask_channel_overlap, no_overlap=self.no_mask_channel_overlap,
min_space=self.mask_channel_min_space, min_space=self.mask_channel_min_space, )
)
mask_channel_indices = ( mask_channel_indices = (
# torch.from_numpy(mask_channel_indices) # torch.from_numpy(mask_channel_indices)
paddle.to_tensor(mask_channel_indices, dtype='int64') paddle.to_tensor(mask_channel_indices, dtype='int64')
.to(x.device) .to(x.device).unsqueeze(1).expand(-1, T, -1))
.unsqueeze(1)
.expand(-1, T, -1)
)
x[mask_channel_indices] = 0 x[mask_channel_indices] = 0
return x, mask_indices return x, mask_indices
def forward_padding_mask( def forward_padding_mask(
self, features: Tensor, padding_mask: Tensor, self,
) -> Tensor: features: Tensor,
padding_mask: Tensor, ) -> Tensor:
extra = padding_mask.size(1) % features.size(1) extra = padding_mask.size(1) % features.size(1)
if extra > 0: if extra > 0:
padding_mask = padding_mask[:, :-extra] padding_mask = padding_mask[:, :-extra]
padding_mask = padding_mask.view( padding_mask = padding_mask.view(
padding_mask.size(0), features.size(1), -1 padding_mask.size(0), features.size(1), -1)
)
padding_mask = padding_mask.all(-1) padding_mask = padding_mask.all(-1)
return padding_mask return padding_mask
def extract_features( def extract_features(
self, self,
source: Tensor, source: Tensor,
padding_mask: Optional[Tensor] = None, padding_mask: Optional[Tensor]=None,
mask: bool = False, mask: bool=False,
ret_conv: bool = False, ret_conv: bool=False,
output_layer: Optional[int] = None, output_layer: Optional[int]=None,
ret_layer_results: bool = False, ret_layer_results: bool=False, ):
):
if self.feature_grad_mult > 0: if self.feature_grad_mult > 0:
features = self.feature_extractor(source) features = self.feature_extractor(source)
@ -339,7 +320,7 @@ class WavLM(nn.Layer):
with paddle.no_grad(): with paddle.no_grad():
features = self.feature_extractor(source) features = self.feature_extractor(source)
features = features.transpose([0, 2, 1]) # [1, 49, 512] features = features.transpose([0, 2, 1]) # [1, 49, 512]
features = self.layer_norm(features) features = self.layer_norm(features)
if padding_mask is not None: if padding_mask is not None:
@ -351,9 +332,7 @@ class WavLM(nn.Layer):
features = self.dropout_input(features) features = self.dropout_input(features)
if mask: if mask:
x, mask_indices = self.apply_mask( x, mask_indices = self.apply_mask(features, padding_mask)
features, padding_mask
)
else: else:
x = features x = features
@ -366,10 +345,14 @@ class WavLM(nn.Layer):
x, layer_results = self.encoder( x, layer_results = self.encoder(
x, x,
padding_mask=padding_mask, padding_mask=padding_mask,
layer=None if output_layer is None else output_layer - 1 layer=None if output_layer is None else output_layer - 1)
)
# print(f"Debugging: x.shape: {x.shape}, x.mean(): {x.mean()}, x.std(): {x.std()}") # print(f"Debugging: x.shape: {x.shape}, x.mean(): {x.mean()}, x.std(): {x.std()}")
res = {"x": x, "padding_mask": padding_mask, "features": features, "layer_results": layer_results} res = {
"x": x,
"padding_mask": padding_mask,
"features": features,
"layer_results": layer_results
}
feature = res["features"] if ret_conv else res["x"] feature = res["features"] if ret_conv else res["x"]
if ret_layer_results: if ret_layer_results:
@ -381,14 +364,12 @@ class WavLM(nn.Layer):
class ConvFeatureExtractionModel(nn.Layer): class ConvFeatureExtractionModel(nn.Layer):
def __init__( def __init__(self,
self, conv_layers: List[Tuple[int, int, int]],
conv_layers: List[Tuple[int, int, int]], dropout: float=0.0,
dropout: float = 0.0, mode: str="default",
mode: str = "default", conv_bias: bool=False,
conv_bias: bool = False, conv_type: str="default"):
conv_type: str = "default"
):
super().__init__() super().__init__()
assert mode in {"default", "layer_norm"} assert mode in {"default", "layer_norm"}
@ -400,17 +381,20 @@ class ConvFeatureExtractionModel(nn.Layer):
stride, stride,
is_layer_norm=False, is_layer_norm=False,
is_group_norm=False, is_group_norm=False,
conv_bias=False, conv_bias=False, ):
):
def make_conv(): def make_conv():
conv = nn.Conv1D(n_in, n_out, k, stride=stride, bias_attr=conv_bias, conv = nn.Conv1D(
weight_attr=nn.initializer.KaimingNormal()) n_in,
n_out,
k,
stride=stride,
bias_attr=conv_bias,
weight_attr=nn.initializer.KaimingNormal())
# nn.init.kaiming_normal_(conv.weight) # nn.init.kaiming_normal_(conv.weight)
return conv return conv
assert ( assert (is_layer_norm and is_group_norm
is_layer_norm and is_group_norm ) == False, "layer norm and group norm are exclusive"
) == False, "layer norm and group norm are exclusive"
if is_layer_norm: if is_layer_norm:
return nn.Sequential( return nn.Sequential(
@ -419,19 +403,18 @@ class ConvFeatureExtractionModel(nn.Layer):
nn.Sequential( nn.Sequential(
TransposeLast(), TransposeLast(),
nn.LayerNorm(normalized_shape=dim, epsilon=1e-5), nn.LayerNorm(normalized_shape=dim, epsilon=1e-5),
TransposeLast(), TransposeLast(), ),
), nn.GELU(), )
nn.GELU(),
)
elif is_group_norm: elif is_group_norm:
return nn.Sequential( return nn.Sequential(
make_conv(), make_conv(),
nn.Dropout(p=dropout), nn.Dropout(p=dropout),
nn.GroupNorm(num_groups=dim, num_channels=dim, epsilon=1e-5), nn.GroupNorm(
nn.GELU(), num_groups=dim, num_channels=dim, epsilon=1e-5),
) nn.GELU(), )
else: else:
return nn.Sequential(make_conv(), nn.Dropout(p=dropout), nn.GELU()) return nn.Sequential(
make_conv(), nn.Dropout(p=dropout), nn.GELU())
self.conv_type = conv_type self.conv_type = conv_type
if self.conv_type == "default": if self.conv_type == "default":
@ -449,9 +432,7 @@ class ConvFeatureExtractionModel(nn.Layer):
stride, stride,
is_layer_norm=mode == "layer_norm", is_layer_norm=mode == "layer_norm",
is_group_norm=mode == "default" and i == 0, is_group_norm=mode == "default" and i == 0,
conv_bias=conv_bias, conv_bias=conv_bias, ))
)
)
in_d = dim in_d = dim
elif self.conv_type == "conv2d": elif self.conv_type == "conv2d":
in_d = 1 in_d = 1
@ -460,9 +441,7 @@ class ConvFeatureExtractionModel(nn.Layer):
assert len(cl) == 3 assert len(cl) == 3
(dim, k, stride) = cl (dim, k, stride) = cl
self.conv_layers.append( self.conv_layers.append(paddle.nn.Conv2D(in_d, dim, k, stride))
paddle.nn.Conv2D(in_d, dim, k, stride)
)
self.conv_layers.append(paddle.nn.ReLU()) self.conv_layers.append(paddle.nn.ReLU())
in_d = dim in_d = dim
elif self.conv_type == "custom": elif self.conv_type == "custom":
@ -473,17 +452,13 @@ class ConvFeatureExtractionModel(nn.Layer):
assert len(cl) == 3 assert len(cl) == 3
(dim, k, stride) = cl (dim, k, stride) = cl
self.conv_layers.append( self.conv_layers.append(
paddle.nn.Conv2D(in_d, dim, k, stride, padding=1) paddle.nn.Conv2D(in_d, dim, k, stride, padding=1))
) self.conv_layers.append(paddle.nn.LayerNorm([dim, idim]))
self.conv_layers.append(
paddle.nn.LayerNorm([dim, idim])
)
self.conv_layers.append(paddle.nn.ReLU()) self.conv_layers.append(paddle.nn.ReLU())
in_d = dim in_d = dim
if (i + 1) % 2 == 0: if (i + 1) % 2 == 0:
self.conv_layers.append( self.conv_layers.append(
paddle.nn.MaxPool2D(2, stride=2, ceil_mode=True) paddle.nn.MaxPool2D(2, stride=2, ceil_mode=True))
)
idim = int(math.ceil(idim / 2)) idim = int(math.ceil(idim / 2))
else: else:
pass pass
@ -518,8 +493,8 @@ class TransformerEncoder(nn.Layer):
self.dropout = args.dropout self.dropout = args.dropout
self.embedding_dim = args.encoder_embed_dim self.embedding_dim = args.encoder_embed_dim
dropout = 0 dropout = 0
std = math.sqrt((4 * (1.0 - dropout)) / (args.conv_pos * self.embedding_dim)) std = math.sqrt(
(4 * (1.0 - dropout)) / (args.conv_pos * self.embedding_dim))
self.pos_conv = nn.Conv1D( self.pos_conv = nn.Conv1D(
self.embedding_dim, self.embedding_dim,
@ -528,15 +503,16 @@ class TransformerEncoder(nn.Layer):
padding=args.conv_pos // 2, padding=args.conv_pos // 2,
groups=args.conv_pos_groups, groups=args.conv_pos_groups,
weight_attr=nn.initializer.Normal(mean=0, std=std), weight_attr=nn.initializer.Normal(mean=0, std=std),
bias_attr=True bias_attr=True)
)
# nn.init.normal_(self.pos_conv.weight, mean=0, std=std) # nn.init.normal_(self.pos_conv.weight, mean=0, std=std)
# nn.init.constant_(self.pos_conv.bias, 0) # nn.init.constant_(self.pos_conv.bias, 0)
# self.pos_conv = nn.utils.weight_norm(self.pos_conv, name="weight", dim=2) # self.pos_conv = nn.utils.weight_norm(self.pos_conv, name="weight", dim=2)
# self.pos_conv.weight_g = self.pos_conv.weight_g.unsqueeze(0).unsqueeze(0) # self.pos_conv.weight_g = self.pos_conv.weight_g.unsqueeze(0).unsqueeze(0)
self.pos_conv = nn.utils.weight_norm(self.pos_conv, name="weight", dim=2) self.pos_conv = nn.utils.weight_norm(
self.pos_conv = nn.Sequential(self.pos_conv, SamePad(args.conv_pos), nn.GELU()) self.pos_conv, name="weight", dim=2)
self.pos_conv = nn.Sequential(self.pos_conv,
SamePad(args.conv_pos), nn.GELU())
if hasattr(args, "relative_position_embedding"): if hasattr(args, "relative_position_embedding"):
self.relative_position_embedding = args.relative_position_embedding self.relative_position_embedding = args.relative_position_embedding
@ -547,25 +523,23 @@ class TransformerEncoder(nn.Layer):
self.num_buckets = 0 self.num_buckets = 0
self.max_distance = 0 self.max_distance = 0
self.layers = nn.LayerList( self.layers = nn.LayerList([
[ TransformerSentenceEncoderLayer(
TransformerSentenceEncoderLayer( embedding_dim=self.embedding_dim,
embedding_dim=self.embedding_dim, ffn_embedding_dim=args.encoder_ffn_embed_dim,
ffn_embedding_dim=args.encoder_ffn_embed_dim, num_attention_heads=args.encoder_attention_heads,
num_attention_heads=args.encoder_attention_heads, dropout=self.dropout,
dropout=self.dropout, attention_dropout=args.attention_dropout,
attention_dropout=args.attention_dropout, activation_dropout=args.activation_dropout,
activation_dropout=args.activation_dropout, activation_fn=args.activation_fn,
activation_fn=args.activation_fn, layer_norm_first=args.layer_norm_first,
layer_norm_first=args.layer_norm_first, has_relative_attention_bias=(
has_relative_attention_bias=(self.relative_position_embedding and i == 0), self.relative_position_embedding and i == 0),
num_buckets=self.num_buckets, num_buckets=self.num_buckets,
max_distance=self.max_distance, max_distance=self.max_distance,
gru_rel_pos=args.gru_rel_pos, gru_rel_pos=args.gru_rel_pos, )
) for i in range(args.encoder_layers)
for i in range(args.encoder_layers) ])
]
)
self.layer_norm_first = args.layer_norm_first self.layer_norm_first = args.layer_norm_first
self.layer_norm = LayerNorm(self.embedding_dim) self.layer_norm = LayerNorm(self.embedding_dim)
@ -574,14 +548,19 @@ class TransformerEncoder(nn.Layer):
# self.apply(init_bert_params) # self.apply(init_bert_params)
def forward(self, x, padding_mask=None, streaming_mask=None, layer=None): def forward(self, x, padding_mask=None, streaming_mask=None, layer=None):
x, layer_results = self.extract_features(x, padding_mask, streaming_mask, layer) x, layer_results = self.extract_features(x, padding_mask,
streaming_mask, layer)
# print("x.shape", x.shape) # print("x.shape", x.shape)
if self.layer_norm_first and layer is None: if self.layer_norm_first and layer is None:
x = self.layer_norm(x) x = self.layer_norm(x)
return x, layer_results return x, layer_results
def extract_features(self, x, padding_mask=None, streaming_mask=None, tgt_layer=None): def extract_features(self,
x,
padding_mask=None,
streaming_mask=None,
tgt_layer=None):
if padding_mask is not None: if padding_mask is not None:
x[padding_mask] = 0 x[padding_mask] = 0
@ -598,7 +577,6 @@ class TransformerEncoder(nn.Layer):
# x = x.transpose(0, 1) # x = x.transpose(0, 1)
x = x.transpose([1, 0, 2]) x = x.transpose([1, 0, 2])
layer_results = [] layer_results = []
z = None z = None
if tgt_layer is not None: if tgt_layer is not None:
@ -608,7 +586,12 @@ class TransformerEncoder(nn.Layer):
for i, layer in enumerate(self.layers): for i, layer in enumerate(self.layers):
dropout_probability = np.random.random() dropout_probability = np.random.random()
if not self.training or (dropout_probability > self.layerdrop): if not self.training or (dropout_probability > self.layerdrop):
x, z, pos_bias = layer(x, self_attn_padding_mask=padding_mask, need_weights=False,self_attn_mask=streaming_mask, pos_bias=pos_bias) x, z, pos_bias = layer(
x,
self_attn_padding_mask=padding_mask,
need_weights=False,
self_attn_mask=streaming_mask,
pos_bias=pos_bias)
if tgt_layer is not None: if tgt_layer is not None:
layer_results.append((x, z)) layer_results.append((x, z))
if i == tgt_layer: if i == tgt_layer:
@ -633,20 +616,19 @@ class TransformerSentenceEncoderLayer(nn.Layer):
def __init__( def __init__(
self, self,
embedding_dim: float = 768, embedding_dim: float=768,
ffn_embedding_dim: float = 3072, ffn_embedding_dim: float=3072,
num_attention_heads: float = 8, num_attention_heads: float=8,
dropout: float = 0.1, dropout: float=0.1,
attention_dropout: float = 0.1, attention_dropout: float=0.1,
activation_dropout: float = 0.1, activation_dropout: float=0.1,
activation_fn: str = "relu", activation_fn: str="relu",
layer_norm_first: bool = False, layer_norm_first: bool=False,
has_relative_attention_bias: bool = True, has_relative_attention_bias: bool=True,
num_buckets: int = 0, num_buckets: int=0,
max_distance: int = 0, max_distance: int=0,
rescale_init: bool = False, rescale_init: bool=False,
gru_rel_pos: bool = True, gru_rel_pos: bool=True, ) -> None:
) -> None:
super().__init__() super().__init__()
# Initialize parameters # Initialize parameters
@ -666,8 +648,7 @@ class TransformerSentenceEncoderLayer(nn.Layer):
num_buckets=num_buckets, num_buckets=num_buckets,
max_distance=max_distance, max_distance=max_distance,
rescale_init=rescale_init, rescale_init=rescale_init,
gru_rel_pos=gru_rel_pos, gru_rel_pos=gru_rel_pos, )
)
self.dropout1 = nn.Dropout(dropout) self.dropout1 = nn.Dropout(dropout)
self.dropout2 = nn.Dropout(self.activation_dropout) self.dropout2 = nn.Dropout(self.activation_dropout)
@ -679,7 +660,8 @@ class TransformerSentenceEncoderLayer(nn.Layer):
self.self_attn_layer_norm = LayerNorm(self.embedding_dim) self.self_attn_layer_norm = LayerNorm(self.embedding_dim)
if self.activation_name == "glu": if self.activation_name == "glu":
self.fc1 = GLU_Linear(self.embedding_dim, ffn_embedding_dim, "swish") self.fc1 = GLU_Linear(self.embedding_dim, ffn_embedding_dim,
"swish")
else: else:
self.fc1 = nn.Linear(self.embedding_dim, ffn_embedding_dim) self.fc1 = nn.Linear(self.embedding_dim, ffn_embedding_dim)
self.fc2 = nn.Linear(ffn_embedding_dim, self.embedding_dim) self.fc2 = nn.Linear(ffn_embedding_dim, self.embedding_dim)
@ -687,14 +669,12 @@ class TransformerSentenceEncoderLayer(nn.Layer):
# layer norm associated with the position wise feed-forward NN # layer norm associated with the position wise feed-forward NN
self.final_layer_norm = LayerNorm(self.embedding_dim) self.final_layer_norm = LayerNorm(self.embedding_dim)
def forward( def forward(self,
self, x: Tensor,
x: Tensor, self_attn_mask: Tensor=None,
self_attn_mask: Tensor = None, self_attn_padding_mask: Tensor=None,
self_attn_padding_mask: Tensor = None, need_weights: bool=False,
need_weights: bool = False, pos_bias=None):
pos_bias=None
):
""" """
LayerNorm is applied either before or after the self-attention/ffn LayerNorm is applied either before or after the self-attention/ffn
modules similar to the original Transformer imlementation. modules similar to the original Transformer imlementation.
@ -710,8 +690,7 @@ class TransformerSentenceEncoderLayer(nn.Layer):
key_padding_mask=self_attn_padding_mask, key_padding_mask=self_attn_padding_mask,
need_weights=False, need_weights=False,
attn_mask=self_attn_mask, attn_mask=self_attn_mask,
position_bias=pos_bias position_bias=pos_bias)
)
# import pdb; pdb.set_trace() # import pdb; pdb.set_trace()
x = self.dropout1(x) x = self.dropout1(x)
x = residual + x x = residual + x
@ -734,8 +713,7 @@ class TransformerSentenceEncoderLayer(nn.Layer):
key_padding_mask=self_attn_padding_mask, key_padding_mask=self_attn_padding_mask,
need_weights=need_weights, need_weights=need_weights,
attn_mask=self_attn_mask, attn_mask=self_attn_mask,
position_bias=pos_bias position_bias=pos_bias)
)
x = self.dropout1(x) x = self.dropout1(x)
x = residual + x x = residual + x

@ -109,11 +109,11 @@ class MultiHeadAttention(nn.Layer):
n_batch, n_ctx, n_state = q.shape n_batch, n_ctx, n_state = q.shape
scale = (n_state // self.n_head)**-0.25 scale = (n_state // self.n_head)**-0.25
q = paddle.transpose( q = paddle.transpose(
q.view(*q.shape[:2], self.n_head, -1), (0, 2, 1, 3)) * scale q.reshape([*q.shape[:2], self.n_head, -1]), (0, 2, 1, 3)) * scale
k = paddle.transpose( k = paddle.transpose(
k.view(*k.shape[:2], self.n_head, -1), (0, 2, 3, 1)) * scale k.reshape([*k.shape[:2], self.n_head, -1]), (0, 2, 3, 1)) * scale
v = paddle.transpose( v = paddle.transpose(
v.view(*v.shape[:2], self.n_head, -1), (0, 2, 1, 3)) v.reshape([*v.shape[:2], self.n_head, -1]), (0, 2, 1, 3))
qk = q @ k qk = q @ k
if mask is not None: if mask is not None:
@ -823,7 +823,7 @@ class BeamSearchDecoder(TokenDecoder):
if self.finished_sequences is None: # for the first update if self.finished_sequences is None: # for the first update
self.finished_sequences = [{} for _ in range(batch_size)] self.finished_sequences = [{} for _ in range(batch_size)]
logprobs = F.log_softmax(logits, axis=-1, dtype=paddle.float32) logprobs = F.log_softmax(logits, axis=-1, dtype='float32')
next_tokens, source_indices, finished_sequences = [], [], [] next_tokens, source_indices, finished_sequences = [], [], []
for i in range(batch_size): for i in range(batch_size):
scores, sources, finished = {}, {}, {} scores, sources, finished = {}, {}, {}
@ -969,7 +969,7 @@ class ApplyTimestampRules(LogitFilter):
logits[:, last_allowed + 1:] = -np.inf logits[:, last_allowed + 1:] = -np.inf
# if sum of probability over timestamps is above any other token, sample timestamp # if sum of probability over timestamps is above any other token, sample timestamp
logprobs = F.log_softmax(logits, axis=-1, dtype=paddle.float32) logprobs = F.log_softmax(logits, axis=-1, dtype='float32')
for k in range(tokens.shape[0]): for k in range(tokens.shape[0]):
# When using paddle.logsumexp on a 32GB Tesla-V100 GPU, we encountered CUDA error 700. # When using paddle.logsumexp on a 32GB Tesla-V100 GPU, we encountered CUDA error 700.
# To bypass this issue in CI, we have decomposed the operation into separate steps. # To bypass this issue in CI, we have decomposed the operation into separate steps.

@ -138,7 +138,7 @@ class Pitch():
input: np.ndarray, input: np.ndarray,
use_continuous_f0: bool=True, use_continuous_f0: bool=True,
use_log_f0: bool=True) -> np.ndarray: use_log_f0: bool=True) -> np.ndarray:
input = input.astype(np.float) input = input.astype(np.float_)
frame_period = 1000 * self.hop_length / self.sr frame_period = 1000 * self.hop_length / self.sr
f0, timeaxis = pyworld.dio( f0, timeaxis = pyworld.dio(
input, input,

@ -203,9 +203,9 @@ def main():
sentences, speaker_set = get_phn_dur(dur_file) sentences, speaker_set = get_phn_dur(dur_file)
merge_silence(sentences) merge_silence(sentences)
# split data into 3 sections
if args.dataset == "baker": if args.dataset == "baker":
wav_files = sorted(list((rootdir / "Wave").rglob("*.wav"))) wav_files = sorted(list((rootdir / "Wave").rglob("*.wav")))
# split data into 3 sections
num_train = 9800 num_train = 9800
num_dev = 100 num_dev = 100
train_wav_files = wav_files[:num_train] train_wav_files = wav_files[:num_train]

@ -841,6 +841,9 @@ class FastSpeech2(nn.Layer):
spk_emb = self.spk_projection(F.normalize(spk_emb)) spk_emb = self.spk_projection(F.normalize(spk_emb))
hs = hs + spk_emb.unsqueeze(1) hs = hs + spk_emb.unsqueeze(1)
elif self.spk_embed_integration_type == "concat": elif self.spk_embed_integration_type == "concat":
# one wave `spk_emb` under synthesize, the dim is `1`
if spk_emb.dim() == 1:
spk_emb = spk_emb.unsqueeze(0)
# concat hidden states with spk embeds and then apply projection # concat hidden states with spk embeds and then apply projection
spk_emb = F.normalize(spk_emb).unsqueeze(1).expand( spk_emb = F.normalize(spk_emb).unsqueeze(1).expand(
shape=[-1, paddle.shape(hs)[1], -1]) shape=[-1, paddle.shape(hs)[1], -1])

@ -55,7 +55,9 @@ class GaussianUpsampling(nn.Layer):
if h_masks is not None: if h_masks is not None:
t = t * paddle.to_tensor(h_masks, dtype="float32") t = t * paddle.to_tensor(h_masks, dtype="float32")
c = ds.cumsum(axis=-1) - ds / 2 ds_cumsum = ds.cumsum(axis=-1)
ds_half = ds / 2
c = ds_cumsum.astype(ds_half.dtype) - ds_half
energy = -1 * self.delta * (t.unsqueeze(-1) - c.unsqueeze(1))**2 energy = -1 * self.delta * (t.unsqueeze(-1) - c.unsqueeze(1))**2
if d_masks is not None: if d_masks is not None:
d_masks = ~(d_masks.unsqueeze(1)) d_masks = ~(d_masks.unsqueeze(1))

@ -577,8 +577,9 @@ class VITSGenerator(nn.Layer):
# decoder # decoder
z_p = m_p + paddle.randn( z_p = m_p + paddle.randn(
paddle.shape(m_p)) * paddle.exp(logs_p) * noise_scale paddle.shape(m_p)) * paddle.exp(logs_p) * noise_scale
z = self.flow(z_p, y_mask, g=g, inverse=True) z = self.flow(z_p, y_mask.astype(z_p.dtype), g=g, inverse=True)
wav = self.decoder((z * y_mask)[:, :, :max_len], g=g) wav = self.decoder(
(z * y_mask.astype(z.dtype))[:, :, :max_len], g=g)
return wav.squeeze(1), attn.squeeze(1), dur.squeeze(1) return wav.squeeze(1), attn.squeeze(1), dur.squeeze(1)
@ -695,4 +696,5 @@ class VITSGenerator(nn.Layer):
path = paddle.cast(path, dtype='float32') path = paddle.cast(path, dtype='float32')
pad_tmp = self.pad1d(path)[:, :-1] pad_tmp = self.pad1d(path)[:, :-1]
path = path - pad_tmp path = path - pad_tmp
return path.unsqueeze(1).transpose([0, 1, 3, 2]) * mask return path.unsqueeze(1).transpose(
[0, 1, 3, 2]) * mask.astype(path.dtype)

@ -129,6 +129,7 @@ class PosteriorEncoder(nn.Layer):
""" """
x_mask = make_non_pad_mask(x_lengths).unsqueeze(1) x_mask = make_non_pad_mask(x_lengths).unsqueeze(1)
x_mask = x_mask.astype(x.dtype)
x = self.input_conv(x) * x_mask x = self.input_conv(x) * x_mask
x = self.encoder(x, x_mask, g=g) x = self.encoder(x, x_mask, g=g)
stats = self.proj(x) * x_mask stats = self.proj(x) * x_mask

@ -155,6 +155,7 @@ class TextEncoder(nn.Layer):
""" """
x = self.emb(x) * math.sqrt(self.attention_dim) x = self.emb(x) * math.sqrt(self.attention_dim)
x_mask = make_non_pad_mask(x_lengths).unsqueeze(1) x_mask = make_non_pad_mask(x_lengths).unsqueeze(1)
x_mask = x_mask.astype(x.dtype)
# encoder assume the channel last (B, T_text, attention_dim) # encoder assume the channel last (B, T_text, attention_dim)
# but mask shape shoud be (B, 1, T_text) # but mask shape shoud be (B, 1, T_text)
x, _ = self.encoder(x, x_mask) x, _ = self.encoder(x, x_mask)

@ -180,7 +180,12 @@ def make_pad_mask(lengths, xs=None, length_dim=-1):
""" """
if length_dim == 0: if length_dim == 0:
raise ValueError("length_dim cannot be 0: {}".format(length_dim)) raise ValueError("length_dim cannot be 0: {}".format(length_dim))
bs = paddle.shape(lengths.unsqueeze(0))
# check if ilens is 0-dim tensor, if so, add a dimension
if lengths.ndim == 0:
lengths = lengths.unsqueeze(0)
bs = paddle.shape(lengths)
if xs is None: if xs is None:
maxlen = paddle.cast(lengths.max(), dtype=bs.dtype) maxlen = paddle.cast(lengths.max(), dtype=bs.dtype)
else: else:
@ -347,7 +352,9 @@ def get_random_segments(
""" """
b, c, t = paddle.shape(x) b, c, t = paddle.shape(x)
max_start_idx = x_lengths - segment_size max_start_idx = x_lengths - segment_size
start_idxs = paddle.cast(paddle.rand([b]) * max_start_idx, 'int64') rand_number = paddle.rand([b])
start_idxs = paddle.cast(rand_number *
max_start_idx.astype(rand_number.dtype), 'int64')
segments = get_segments(x, start_idxs, segment_size) segments = get_segments(x, start_idxs, segment_size)
return segments, start_idxs return segments, start_idxs

@ -36,7 +36,7 @@ def convert_dtype_to_np_dtype_(dtype):
elif dtype is core.VarDesc.VarType.FP16: elif dtype is core.VarDesc.VarType.FP16:
return np.float16 return np.float16
elif dtype is core.VarDesc.VarType.BOOL: elif dtype is core.VarDesc.VarType.BOOL:
return np.bool return np.bool_
elif dtype is core.VarDesc.VarType.INT32: elif dtype is core.VarDesc.VarType.INT32:
return np.int32 return np.int32
elif dtype is core.VarDesc.VarType.INT64: elif dtype is core.VarDesc.VarType.INT64:

Loading…
Cancel
Save