[TTS]Add hifigan (#1097)
* add hifigan * add hifigan * integrate synthesize synthesize_e2e, inference for tts, test=tts * add some python files, test=tts * update readme, test=doc_fixpull/1163/head
parent
675cff258b
commit
19ef7210a0
@ -0,0 +1,117 @@
|
|||||||
|
# HiFiGAN with CSMSC
|
||||||
|
This example contains code used to train a [HiFiGAN](https://arxiv.org/abs/2010.05646) model with [Chinese Standard Mandarin Speech Copus](https://www.data-baker.com/open_source.html).
|
||||||
|
## Dataset
|
||||||
|
### Download and Extract
|
||||||
|
Download CSMSC from the [official website](https://www.data-baker.com/data/index/source) and extract it to `~/datasets`. Then the dataset is in the directory `~/datasets/BZNSYP`.
|
||||||
|
|
||||||
|
### Get MFA Result and Extract
|
||||||
|
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) results to cut silence at the edge of audio.
|
||||||
|
You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/Parakeet/tree/develop/examples/mfa) of our repo.
|
||||||
|
|
||||||
|
## Get Started
|
||||||
|
Assume the path to the dataset is `~/datasets/BZNSYP`.
|
||||||
|
Assume the path to the MFA result of CSMSC is `./baker_alignment_tone`.
|
||||||
|
Run the command below to
|
||||||
|
1. **source path**.
|
||||||
|
2. preprocess the dataset.
|
||||||
|
3. train the model.
|
||||||
|
4. synthesize wavs.
|
||||||
|
- synthesize waveform from `metadata.jsonl`.
|
||||||
|
```bash
|
||||||
|
./run.sh
|
||||||
|
```
|
||||||
|
You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
|
||||||
|
```bash
|
||||||
|
./run.sh --stage 0 --stop-stage 0
|
||||||
|
```
|
||||||
|
### Data Preprocessing
|
||||||
|
```bash
|
||||||
|
./local/preprocess.sh ${conf_path}
|
||||||
|
```
|
||||||
|
When it is done. A `dump` folder is created in the current directory. The structure of the dump folder is listed below.
|
||||||
|
|
||||||
|
```text
|
||||||
|
dump
|
||||||
|
├── dev
|
||||||
|
│ ├── norm
|
||||||
|
│ └── raw
|
||||||
|
├── test
|
||||||
|
│ ├── norm
|
||||||
|
│ └── raw
|
||||||
|
└── train
|
||||||
|
├── norm
|
||||||
|
├── raw
|
||||||
|
└── feats_stats.npy
|
||||||
|
```
|
||||||
|
The dataset is split into 3 parts, namely `train`, `dev`, and `test`, each of which contains a `norm` and `raw` subfolder. The `raw` folder contains the log magnitude of the mel spectrogram of each utterance, while the norm folder contains the normalized spectrogram. The statistics used to normalize the spectrogram are computed from the training set, which is located in `dump/train/feats_stats.npy`.
|
||||||
|
|
||||||
|
Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains id and paths to the spectrogram of each utterance.
|
||||||
|
|
||||||
|
### Model Training
|
||||||
|
```bash
|
||||||
|
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path}
|
||||||
|
```
|
||||||
|
`./local/train.sh` calls `${BIN_DIR}/train.py`.
|
||||||
|
Here's the complete help message.
|
||||||
|
|
||||||
|
```text
|
||||||
|
usage: train.py [-h] [--config CONFIG] [--train-metadata TRAIN_METADATA]
|
||||||
|
[--dev-metadata DEV_METADATA] [--output-dir OUTPUT_DIR]
|
||||||
|
[--ngpu NGPU] [--verbose VERBOSE]
|
||||||
|
|
||||||
|
Train a HiFiGAN model.
|
||||||
|
|
||||||
|
optional arguments:
|
||||||
|
-h, --help show this help message and exit
|
||||||
|
--config CONFIG config file to overwrite default config.
|
||||||
|
--train-metadata TRAIN_METADATA
|
||||||
|
training data.
|
||||||
|
--dev-metadata DEV_METADATA
|
||||||
|
dev data.
|
||||||
|
--output-dir OUTPUT_DIR
|
||||||
|
output dir.
|
||||||
|
--ngpu NGPU if ngpu == 0, use cpu.
|
||||||
|
--verbose VERBOSE verbose.
|
||||||
|
```
|
||||||
|
|
||||||
|
1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
|
||||||
|
2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
|
||||||
|
3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory.
|
||||||
|
4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
|
||||||
|
|
||||||
|
### Synthesizing
|
||||||
|
`./local/synthesize.sh` calls `${BIN_DIR}/../synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
|
||||||
|
```bash
|
||||||
|
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
|
||||||
|
```
|
||||||
|
```text
|
||||||
|
usage: synthesize.py [-h] [--generator-type GENERATOR_TYPE] [--config CONFIG]
|
||||||
|
[--checkpoint CHECKPOINT] [--test-metadata TEST_METADATA]
|
||||||
|
[--output-dir OUTPUT_DIR] [--ngpu NGPU]
|
||||||
|
[--verbose VERBOSE]
|
||||||
|
|
||||||
|
Synthesize with GANVocoder.
|
||||||
|
|
||||||
|
optional arguments:
|
||||||
|
-h, --help show this help message and exit
|
||||||
|
--generator-type GENERATOR_TYPE
|
||||||
|
type of GANVocoder, should in {pwgan, mb_melgan,
|
||||||
|
style_melgan, } now
|
||||||
|
--config CONFIG GANVocoder config file.
|
||||||
|
--checkpoint CHECKPOINT
|
||||||
|
snapshot to load.
|
||||||
|
--test-metadata TEST_METADATA
|
||||||
|
dev data.
|
||||||
|
--output-dir OUTPUT_DIR
|
||||||
|
output dir.
|
||||||
|
--ngpu NGPU if ngpu == 0, use cpu.
|
||||||
|
--verbose VERBOSE verbose.
|
||||||
|
```
|
||||||
|
|
||||||
|
1. `--config` config file. You should use the same config with which the model is trained.
|
||||||
|
2. `--checkpoint` is the checkpoint to load. Pick one of the checkpoints from `checkpoints` inside the training output directory.
|
||||||
|
3. `--test-metadata` is the metadata of the test dataset. Use the `metadata.jsonl` in the `dev/norm` subfolder from the processed directory.
|
||||||
|
4. `--output-dir` is the directory to save the synthesized audio files.
|
||||||
|
5. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
|
||||||
|
|
||||||
|
## Fine-tuning
|
@ -0,0 +1,167 @@
|
|||||||
|
# This is the configuration file for CSMSC dataset.
|
||||||
|
# This configuration is based on HiFiGAN V1, which is an official configuration.
|
||||||
|
# But I found that the optimizer setting does not work well with my implementation.
|
||||||
|
# So I changed optimizer settings as follows:
|
||||||
|
# - AdamW -> Adam
|
||||||
|
# - betas: [0.8, 0.99] -> betas: [0.5, 0.9]
|
||||||
|
# - Scheduler: ExponentialLR -> MultiStepLR
|
||||||
|
# To match the shift size difference, the upsample scales is also modified from the original 256 shift setting.
|
||||||
|
|
||||||
|
###########################################################
|
||||||
|
# FEATURE EXTRACTION SETTING #
|
||||||
|
###########################################################
|
||||||
|
fs: 24000 # Sampling rate.
|
||||||
|
n_fft: 2048 # FFT size (samples).
|
||||||
|
n_shift: 300 # Hop size (samples). 12.5ms
|
||||||
|
win_length: 1200 # Window length (samples). 50ms
|
||||||
|
# If set to null, it will be the same as fft_size.
|
||||||
|
window: "hann" # Window function.
|
||||||
|
n_mels: 80 # Number of mel basis.
|
||||||
|
fmin: 80 # Minimum freq in mel basis calculation. (Hz)
|
||||||
|
fmax: 7600 # Maximum frequency in mel basis calculation. (Hz)
|
||||||
|
|
||||||
|
###########################################################
|
||||||
|
# GENERATOR NETWORK ARCHITECTURE SETTING #
|
||||||
|
###########################################################
|
||||||
|
generator_params:
|
||||||
|
in_channels: 80 # Number of input channels.
|
||||||
|
out_channels: 1 # Number of output channels.
|
||||||
|
channels: 512 # Number of initial channels.
|
||||||
|
kernel_size: 7 # Kernel size of initial and final conv layers.
|
||||||
|
upsample_scales: [5, 5, 4, 3] # Upsampling scales.
|
||||||
|
upsample_kernel_sizes: [10, 10, 8, 6] # Kernel size for upsampling layers.
|
||||||
|
resblock_kernel_sizes: [3, 7, 11] # Kernel size for residual blocks.
|
||||||
|
resblock_dilations: # Dilations for residual blocks.
|
||||||
|
- [1, 3, 5]
|
||||||
|
- [1, 3, 5]
|
||||||
|
- [1, 3, 5]
|
||||||
|
use_additional_convs: true # Whether to use additional conv layer in residual blocks.
|
||||||
|
bias: true # Whether to use bias parameter in conv.
|
||||||
|
nonlinear_activation: "leakyrelu" # Nonlinear activation type.
|
||||||
|
nonlinear_activation_params: # Nonlinear activation paramters.
|
||||||
|
negative_slope: 0.1
|
||||||
|
use_weight_norm: true # Whether to apply weight normalization.
|
||||||
|
|
||||||
|
|
||||||
|
###########################################################
|
||||||
|
# DISCRIMINATOR NETWORK ARCHITECTURE SETTING #
|
||||||
|
###########################################################
|
||||||
|
discriminator_params:
|
||||||
|
scales: 3 # Number of multi-scale discriminator.
|
||||||
|
scale_downsample_pooling: "AvgPool1D" # Pooling operation for scale discriminator.
|
||||||
|
scale_downsample_pooling_params:
|
||||||
|
kernel_size: 4 # Pooling kernel size.
|
||||||
|
stride: 2 # Pooling stride.
|
||||||
|
padding: 2 # Padding size.
|
||||||
|
scale_discriminator_params:
|
||||||
|
in_channels: 1 # Number of input channels.
|
||||||
|
out_channels: 1 # Number of output channels.
|
||||||
|
kernel_sizes: [15, 41, 5, 3] # List of kernel sizes.
|
||||||
|
channels: 128 # Initial number of channels.
|
||||||
|
max_downsample_channels: 1024 # Maximum number of channels in downsampling conv layers.
|
||||||
|
max_groups: 16 # Maximum number of groups in downsampling conv layers.
|
||||||
|
bias: true
|
||||||
|
downsample_scales: [4, 4, 4, 4, 1] # Downsampling scales.
|
||||||
|
nonlinear_activation: "leakyrelu" # Nonlinear activation.
|
||||||
|
nonlinear_activation_params:
|
||||||
|
negative_slope: 0.1
|
||||||
|
follow_official_norm: true # Whether to follow the official norm setting.
|
||||||
|
periods: [2, 3, 5, 7, 11] # List of period for multi-period discriminator.
|
||||||
|
period_discriminator_params:
|
||||||
|
in_channels: 1 # Number of input channels.
|
||||||
|
out_channels: 1 # Number of output channels.
|
||||||
|
kernel_sizes: [5, 3] # List of kernel sizes.
|
||||||
|
channels: 32 # Initial number of channels.
|
||||||
|
downsample_scales: [3, 3, 3, 3, 1] # Downsampling scales.
|
||||||
|
max_downsample_channels: 1024 # Maximum number of channels in downsampling conv layers.
|
||||||
|
bias: true # Whether to use bias parameter in conv layer."
|
||||||
|
nonlinear_activation: "leakyrelu" # Nonlinear activation.
|
||||||
|
nonlinear_activation_params: # Nonlinear activation paramters.
|
||||||
|
negative_slope: 0.1
|
||||||
|
use_weight_norm: true # Whether to apply weight normalization.
|
||||||
|
use_spectral_norm: false # Whether to apply spectral normalization.
|
||||||
|
|
||||||
|
|
||||||
|
###########################################################
|
||||||
|
# STFT LOSS SETTING #
|
||||||
|
###########################################################
|
||||||
|
use_stft_loss: false # Whether to use multi-resolution STFT loss.
|
||||||
|
use_mel_loss: true # Whether to use Mel-spectrogram loss.
|
||||||
|
mel_loss_params:
|
||||||
|
fs: 24000
|
||||||
|
fft_size: 2048
|
||||||
|
hop_size: 300
|
||||||
|
win_length: 1200
|
||||||
|
window: "hann"
|
||||||
|
num_mels: 80
|
||||||
|
fmin: 0
|
||||||
|
fmax: 12000
|
||||||
|
log_base: null
|
||||||
|
generator_adv_loss_params:
|
||||||
|
average_by_discriminators: false # Whether to average loss by #discriminators.
|
||||||
|
discriminator_adv_loss_params:
|
||||||
|
average_by_discriminators: false # Whether to average loss by #discriminators.
|
||||||
|
use_feat_match_loss: true
|
||||||
|
feat_match_loss_params:
|
||||||
|
average_by_discriminators: false # Whether to average loss by #discriminators.
|
||||||
|
average_by_layers: false # Whether to average loss by #layers in each discriminator.
|
||||||
|
include_final_outputs: false # Whether to include final outputs in feat match loss calculation.
|
||||||
|
|
||||||
|
###########################################################
|
||||||
|
# ADVERSARIAL LOSS SETTING #
|
||||||
|
###########################################################
|
||||||
|
lambda_aux: 45.0 # Loss balancing coefficient for STFT loss.
|
||||||
|
lambda_adv: 1.0 # Loss balancing coefficient for adversarial loss.
|
||||||
|
lambda_feat_match: 2.0 # Loss balancing coefficient for feat match loss..
|
||||||
|
|
||||||
|
###########################################################
|
||||||
|
# DATA LOADER SETTING #
|
||||||
|
###########################################################
|
||||||
|
batch_size: 16 # Batch size.
|
||||||
|
batch_max_steps: 8400 # Length of each audio in batch. Make sure dividable by hop_size.
|
||||||
|
num_workers: 2 # Number of workers in Pytorch DataLoader.
|
||||||
|
|
||||||
|
###########################################################
|
||||||
|
# OPTIMIZER & SCHEDULER SETTING #
|
||||||
|
###########################################################
|
||||||
|
generator_optimizer_params:
|
||||||
|
beta1: 0.5
|
||||||
|
beta2: 0.9
|
||||||
|
weight_decay: 0.0 # Generator's weight decay coefficient.
|
||||||
|
generator_scheduler_params:
|
||||||
|
learning_rate: 2.0e-4 # Generator's learning rate.
|
||||||
|
gamma: 0.5 # Generator's scheduler gamma.
|
||||||
|
milestones: # At each milestone, lr will be multiplied by gamma.
|
||||||
|
- 200000
|
||||||
|
- 400000
|
||||||
|
- 600000
|
||||||
|
- 800000
|
||||||
|
generator_grad_norm: -1 # Generator's gradient norm.
|
||||||
|
discriminator_optimizer_params:
|
||||||
|
beta1: 0.5
|
||||||
|
beta2: 0.9
|
||||||
|
weight_decay: 0.0 # Discriminator's weight decay coefficient.
|
||||||
|
discriminator_scheduler_params:
|
||||||
|
learning_rate: 2.0e-4 # Discriminator's learning rate.
|
||||||
|
gamma: 0.5 # Discriminator's scheduler gamma.
|
||||||
|
milestones: # At each milestone, lr will be multiplied by gamma.
|
||||||
|
- 200000
|
||||||
|
- 400000
|
||||||
|
- 600000
|
||||||
|
- 800000
|
||||||
|
discriminator_grad_norm: -1 # Discriminator's gradient norm.
|
||||||
|
|
||||||
|
###########################################################
|
||||||
|
# INTERVAL SETTING #
|
||||||
|
###########################################################
|
||||||
|
generator_train_start_steps: 1 # Number of steps to start to train discriminator.
|
||||||
|
discriminator_train_start_steps: 0 # Number of steps to start to train discriminator.
|
||||||
|
train_max_steps: 2500000 # Number of training steps.
|
||||||
|
save_interval_steps: 5000 # Interval steps to save checkpoint.
|
||||||
|
eval_interval_steps: 1000 # Interval steps to evaluate the network.
|
||||||
|
|
||||||
|
###########################################################
|
||||||
|
# OTHER SETTING #
|
||||||
|
###########################################################
|
||||||
|
num_snapshots: 10 # max number of snapshots to keep while training
|
||||||
|
seed: 42 # random seed for paddle, random, and np.random
|
@ -0,0 +1,168 @@
|
|||||||
|
# This is the configuration file for CSMSC dataset.
|
||||||
|
# This configuration is based on HiFiGAN V1, which is an official configuration.
|
||||||
|
# But I found that the optimizer setting does not work well with my implementation.
|
||||||
|
# So I changed optimizer settings as follows:
|
||||||
|
# - AdamW -> Adam
|
||||||
|
# - betas: [0.8, 0.99] -> betas: [0.5, 0.9]
|
||||||
|
# - Scheduler: ExponentialLR -> MultiStepLR
|
||||||
|
# To match the shift size difference, the upsample scales is also modified from the original 256 shift setting.
|
||||||
|
|
||||||
|
###########################################################
|
||||||
|
# FEATURE EXTRACTION SETTING #
|
||||||
|
###########################################################
|
||||||
|
fs: 24000 # Sampling rate.
|
||||||
|
n_fft: 2048 # FFT size (samples).
|
||||||
|
n_shift: 300 # Hop size (samples). 12.5ms
|
||||||
|
win_length: 1200 # Window length (samples). 50ms
|
||||||
|
# If set to null, it will be the same as fft_size.
|
||||||
|
window: "hann" # Window function.
|
||||||
|
n_mels: 80 # Number of mel basis.
|
||||||
|
fmin: 80 # Minimum freq in mel basis calculation. (Hz)
|
||||||
|
fmax: 7600 # Maximum frequency in mel basis calculation. (Hz)
|
||||||
|
|
||||||
|
###########################################################
|
||||||
|
# GENERATOR NETWORK ARCHITECTURE SETTING #
|
||||||
|
###########################################################
|
||||||
|
generator_params:
|
||||||
|
in_channels: 80 # Number of input channels.
|
||||||
|
out_channels: 1 # Number of output channels.
|
||||||
|
channels: 512 # Number of initial channels.
|
||||||
|
kernel_size: 7 # Kernel size of initial and final conv layers.
|
||||||
|
upsample_scales: [5, 5, 4, 3] # Upsampling scales.
|
||||||
|
upsample_kernel_sizes: [10, 10, 8, 6] # Kernel size for upsampling layers.
|
||||||
|
resblock_kernel_sizes: [3, 7, 11] # Kernel size for residual blocks.
|
||||||
|
resblock_dilations: # Dilations for residual blocks.
|
||||||
|
- [1, 3, 5]
|
||||||
|
- [1, 3, 5]
|
||||||
|
- [1, 3, 5]
|
||||||
|
use_additional_convs: true # Whether to use additional conv layer in residual blocks.
|
||||||
|
bias: true # Whether to use bias parameter in conv.
|
||||||
|
nonlinear_activation: "leakyrelu" # Nonlinear activation type.
|
||||||
|
nonlinear_activation_params: # Nonlinear activation paramters.
|
||||||
|
negative_slope: 0.1
|
||||||
|
use_weight_norm: true # Whether to apply weight normalization.
|
||||||
|
|
||||||
|
|
||||||
|
###########################################################
|
||||||
|
# DISCRIMINATOR NETWORK ARCHITECTURE SETTING #
|
||||||
|
###########################################################
|
||||||
|
discriminator_params:
|
||||||
|
scales: 3 # Number of multi-scale discriminator.
|
||||||
|
scale_downsample_pooling: "AvgPool1D" # Pooling operation for scale discriminator.
|
||||||
|
scale_downsample_pooling_params:
|
||||||
|
kernel_size: 4 # Pooling kernel size.
|
||||||
|
stride: 2 # Pooling stride.
|
||||||
|
padding: 2 # Padding size.
|
||||||
|
scale_discriminator_params:
|
||||||
|
in_channels: 1 # Number of input channels.
|
||||||
|
out_channels: 1 # Number of output channels.
|
||||||
|
kernel_sizes: [15, 41, 5, 3] # List of kernel sizes.
|
||||||
|
channels: 128 # Initial number of channels.
|
||||||
|
max_downsample_channels: 1024 # Maximum number of channels in downsampling conv layers.
|
||||||
|
max_groups: 16 # Maximum number of groups in downsampling conv layers.
|
||||||
|
bias: true
|
||||||
|
downsample_scales: [4, 4, 4, 4, 1] # Downsampling scales.
|
||||||
|
nonlinear_activation: "leakyrelu" # Nonlinear activation.
|
||||||
|
nonlinear_activation_params:
|
||||||
|
negative_slope: 0.1
|
||||||
|
follow_official_norm: true # Whether to follow the official norm setting.
|
||||||
|
periods: [2, 3, 5, 7, 11] # List of period for multi-period discriminator.
|
||||||
|
period_discriminator_params:
|
||||||
|
in_channels: 1 # Number of input channels.
|
||||||
|
out_channels: 1 # Number of output channels.
|
||||||
|
kernel_sizes: [5, 3] # List of kernel sizes.
|
||||||
|
channels: 32 # Initial number of channels.
|
||||||
|
downsample_scales: [3, 3, 3, 3, 1] # Downsampling scales.
|
||||||
|
max_downsample_channels: 1024 # Maximum number of channels in downsampling conv layers.
|
||||||
|
bias: true # Whether to use bias parameter in conv layer."
|
||||||
|
nonlinear_activation: "leakyrelu" # Nonlinear activation.
|
||||||
|
nonlinear_activation_params: # Nonlinear activation paramters.
|
||||||
|
negative_slope: 0.1
|
||||||
|
use_weight_norm: true # Whether to apply weight normalization.
|
||||||
|
use_spectral_norm: false # Whether to apply spectral normalization.
|
||||||
|
|
||||||
|
|
||||||
|
###########################################################
|
||||||
|
# STFT LOSS SETTING #
|
||||||
|
###########################################################
|
||||||
|
use_stft_loss: false # Whether to use multi-resolution STFT loss.
|
||||||
|
use_mel_loss: true # Whether to use Mel-spectrogram loss.
|
||||||
|
mel_loss_params:
|
||||||
|
fs: 24000
|
||||||
|
fft_size: 2048
|
||||||
|
hop_size: 300
|
||||||
|
win_length: 1200
|
||||||
|
window: "hann"
|
||||||
|
num_mels: 80
|
||||||
|
fmin: 0
|
||||||
|
fmax: 12000
|
||||||
|
log_base: null
|
||||||
|
generator_adv_loss_params:
|
||||||
|
average_by_discriminators: false # Whether to average loss by #discriminators.
|
||||||
|
discriminator_adv_loss_params:
|
||||||
|
average_by_discriminators: false # Whether to average loss by #discriminators.
|
||||||
|
use_feat_match_loss: true
|
||||||
|
feat_match_loss_params:
|
||||||
|
average_by_discriminators: false # Whether to average loss by #discriminators.
|
||||||
|
average_by_layers: false # Whether to average loss by #layers in each discriminator.
|
||||||
|
include_final_outputs: false # Whether to include final outputs in feat match loss calculation.
|
||||||
|
|
||||||
|
###########################################################
|
||||||
|
# ADVERSARIAL LOSS SETTING #
|
||||||
|
###########################################################
|
||||||
|
lambda_aux: 45.0 # Loss balancing coefficient for STFT loss.
|
||||||
|
lambda_adv: 1.0 # Loss balancing coefficient for adversarial loss.
|
||||||
|
lambda_feat_match: 2.0 # Loss balancing coefficient for feat match loss..
|
||||||
|
|
||||||
|
###########################################################
|
||||||
|
# DATA LOADER SETTING #
|
||||||
|
###########################################################
|
||||||
|
batch_size: 16 # Batch size.
|
||||||
|
batch_max_steps: 8400 # Length of each audio in batch. Make sure dividable by hop_size.
|
||||||
|
num_workers: 2 # Number of workers in Pytorch DataLoader.
|
||||||
|
|
||||||
|
###########################################################
|
||||||
|
# OPTIMIZER & SCHEDULER SETTING #
|
||||||
|
###########################################################
|
||||||
|
generator_optimizer_params:
|
||||||
|
beta1: 0.5
|
||||||
|
beta2: 0.9
|
||||||
|
weight_decay: 0.0 # Generator's weight decay coefficient.
|
||||||
|
generator_scheduler_params:
|
||||||
|
learning_rate: 2.0e-4 # Generator's learning rate.
|
||||||
|
gamma: 0.5 # Generator's scheduler gamma.
|
||||||
|
milestones: # At each milestone, lr will be multiplied by gamma.
|
||||||
|
- 200000
|
||||||
|
- 400000
|
||||||
|
- 600000
|
||||||
|
- 800000
|
||||||
|
generator_grad_norm: -1 # Generator's gradient norm.
|
||||||
|
discriminator_optimizer_params:
|
||||||
|
beta1: 0.5
|
||||||
|
beta2: 0.9
|
||||||
|
weight_decay: 0.0 # Discriminator's weight decay coefficient.
|
||||||
|
discriminator_scheduler_params:
|
||||||
|
learning_rate: 2.0e-4 # Discriminator's learning rate.
|
||||||
|
gamma: 0.5 # Discriminator's scheduler gamma.
|
||||||
|
milestones: # At each milestone, lr will be multiplied by gamma.
|
||||||
|
- 200000
|
||||||
|
- 400000
|
||||||
|
- 600000
|
||||||
|
- 800000
|
||||||
|
discriminator_grad_norm: -1 # Discriminator's gradient norm.
|
||||||
|
|
||||||
|
###########################################################
|
||||||
|
# INTERVAL SETTING #
|
||||||
|
###########################################################
|
||||||
|
generator_train_start_steps: 1 # Number of steps to start to train discriminator.
|
||||||
|
discriminator_train_start_steps: 0 # Number of steps to start to train discriminator.
|
||||||
|
train_max_steps: 2500000 # Number of training steps.
|
||||||
|
save_interval_steps: 10000 # Interval steps to save checkpoint.
|
||||||
|
eval_interval_steps: 1000 # Interval steps to evaluate the network.
|
||||||
|
log_interval_steps: 100 # Interval steps to record the training log.
|
||||||
|
|
||||||
|
###########################################################
|
||||||
|
# OTHER SETTING #
|
||||||
|
###########################################################
|
||||||
|
num_snapshots: 10 # max number of snapshots to keep while training
|
||||||
|
seed: 42 # random seed for paddle, random, and np.random
|
@ -0,0 +1,62 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
|
||||||
|
source path.sh
|
||||||
|
|
||||||
|
gpus=0
|
||||||
|
stage=0
|
||||||
|
stop_stage=100
|
||||||
|
|
||||||
|
source ${MAIN_ROOT}/utils/parse_options.sh || exit 1
|
||||||
|
|
||||||
|
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
|
||||||
|
python3 ${MAIN_ROOT}/paddlespeech/t2s/exps/fastspeech2/gen_gta_mel.py \
|
||||||
|
--fastspeech2-config=fastspeech2_nosil_baker_ckpt_0.4/default.yaml \
|
||||||
|
--fastspeech2-checkpoint=fastspeech2_nosil_baker_ckpt_0.4/snapshot_iter_76000.pdz \
|
||||||
|
--fastspeech2-stat=fastspeech2_nosil_baker_ckpt_0.4/speech_stats.npy \
|
||||||
|
--dur-file=durations.txt \
|
||||||
|
--output-dir=dump_finetune \
|
||||||
|
--phones-dict=fastspeech2_nosil_baker_ckpt_0.4/phone_id_map.txt
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
|
||||||
|
python3 local/link_wav.py \
|
||||||
|
--old-dump-dir=dump \
|
||||||
|
--dump-dir=dump_finetune
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
|
||||||
|
# get features' stats(mean and std)
|
||||||
|
echo "Get features' stats ..."
|
||||||
|
cp dump/train/feats_stats.npy dump_finetune/train/
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
|
||||||
|
# normalize, dev and test should use train's stats
|
||||||
|
echo "Normalize ..."
|
||||||
|
|
||||||
|
python3 ${BIN_DIR}/../normalize.py \
|
||||||
|
--metadata=dump_finetune/train/raw/metadata.jsonl \
|
||||||
|
--dumpdir=dump_finetune/train/norm \
|
||||||
|
--stats=dump_finetune/train/feats_stats.npy
|
||||||
|
python3 ${BIN_DIR}/../normalize.py \
|
||||||
|
--metadata=dump_finetune/dev/raw/metadata.jsonl \
|
||||||
|
--dumpdir=dump_finetune/dev/norm \
|
||||||
|
--stats=dump_finetune/train/feats_stats.npy
|
||||||
|
|
||||||
|
python3 ${BIN_DIR}/../normalize.py \
|
||||||
|
--metadata=dump_finetune/test/raw/metadata.jsonl \
|
||||||
|
--dumpdir=dump_finetune/test/norm \
|
||||||
|
--stats=dump_finetune/train/feats_stats.npy
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
|
||||||
|
CUDA_VISIBLE_DEVICES=${gpus} \
|
||||||
|
FLAGS_cudnn_exhaustive_search=true \
|
||||||
|
FLAGS_conv_workspace_size_limit=4000 \
|
||||||
|
python ${BIN_DIR}/train.py \
|
||||||
|
--train-metadata=dump_finetune/train/norm/metadata.jsonl \
|
||||||
|
--dev-metadata=dump_finetune/dev/norm/metadata.jsonl \
|
||||||
|
--config=conf/finetune.yaml \
|
||||||
|
--output-dir=exp/finetune \
|
||||||
|
--ngpu=1
|
||||||
|
fi
|
@ -0,0 +1,85 @@
|
|||||||
|
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
import argparse
|
||||||
|
import os
|
||||||
|
from operator import itemgetter
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import jsonlines
|
||||||
|
import numpy as np
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
# parse config and args
|
||||||
|
parser = argparse.ArgumentParser(
|
||||||
|
description="Preprocess audio and then extract features .")
|
||||||
|
|
||||||
|
parser.add_argument(
|
||||||
|
"--old-dump-dir",
|
||||||
|
default=None,
|
||||||
|
type=str,
|
||||||
|
help="directory to dump feature files.")
|
||||||
|
parser.add_argument(
|
||||||
|
"--dump-dir",
|
||||||
|
type=str,
|
||||||
|
required=True,
|
||||||
|
help="directory to finetune dump feature files.")
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
old_dump_dir = Path(args.old_dump_dir).expanduser()
|
||||||
|
old_dump_dir = old_dump_dir.resolve()
|
||||||
|
dump_dir = Path(args.dump_dir).expanduser()
|
||||||
|
# use absolute path
|
||||||
|
dump_dir = dump_dir.resolve()
|
||||||
|
dump_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
assert old_dump_dir.is_dir()
|
||||||
|
assert dump_dir.is_dir()
|
||||||
|
|
||||||
|
for sub in ["train", "dev", "test"]:
|
||||||
|
# 把 old_dump_dir 里面的 *-wave.npy 软连接到 dump_dir 的对应位置
|
||||||
|
output_dir = dump_dir / sub
|
||||||
|
output_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
results = []
|
||||||
|
for name in os.listdir(output_dir / "raw"):
|
||||||
|
# 003918_feats.npy
|
||||||
|
utt_id = name.split("_")[0]
|
||||||
|
mel_path = output_dir / ("raw/" + name)
|
||||||
|
gen_mel = np.load(mel_path)
|
||||||
|
wave_name = utt_id + "_wave.npy"
|
||||||
|
wav = np.load(old_dump_dir / sub / ("raw/" + wave_name))
|
||||||
|
os.symlink(old_dump_dir / sub / ("raw/" + wave_name),
|
||||||
|
output_dir / ("raw/" + wave_name))
|
||||||
|
num_sample = wav.shape[0]
|
||||||
|
num_frames = gen_mel.shape[0]
|
||||||
|
wav_path = output_dir / ("raw/" + wave_name)
|
||||||
|
|
||||||
|
record = {
|
||||||
|
"utt_id": utt_id,
|
||||||
|
"num_samples": num_sample,
|
||||||
|
"num_frames": num_frames,
|
||||||
|
"feats": str(mel_path),
|
||||||
|
"wave": str(wav_path),
|
||||||
|
}
|
||||||
|
results.append(record)
|
||||||
|
|
||||||
|
results.sort(key=itemgetter("utt_id"))
|
||||||
|
|
||||||
|
with jsonlines.open(output_dir / "raw/metadata.jsonl", 'w') as writer:
|
||||||
|
for item in results:
|
||||||
|
writer.write(item)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
@ -0,0 +1,55 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
|
||||||
|
stage=0
|
||||||
|
stop_stage=100
|
||||||
|
|
||||||
|
config_path=$1
|
||||||
|
|
||||||
|
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
|
||||||
|
# get durations from MFA's result
|
||||||
|
echo "Generate durations.txt from MFA results ..."
|
||||||
|
python3 ${MAIN_ROOT}/utils/gen_duration_from_textgrid.py \
|
||||||
|
--inputdir=./baker_alignment_tone \
|
||||||
|
--output=durations.txt \
|
||||||
|
--config=${config_path}
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
|
||||||
|
# extract features
|
||||||
|
echo "Extract features ..."
|
||||||
|
python3 ${BIN_DIR}/../preprocess.py \
|
||||||
|
--rootdir=~/datasets/BZNSYP/ \
|
||||||
|
--dataset=baker \
|
||||||
|
--dumpdir=dump \
|
||||||
|
--dur-file=durations.txt \
|
||||||
|
--config=${config_path} \
|
||||||
|
--cut-sil=True \
|
||||||
|
--num-cpu=20
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
|
||||||
|
# get features' stats(mean and std)
|
||||||
|
echo "Get features' stats ..."
|
||||||
|
python3 ${MAIN_ROOT}/utils/compute_statistics.py \
|
||||||
|
--metadata=dump/train/raw/metadata.jsonl \
|
||||||
|
--field-name="feats"
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
|
||||||
|
# normalize, dev and test should use train's stats
|
||||||
|
echo "Normalize ..."
|
||||||
|
|
||||||
|
python3 ${BIN_DIR}/../normalize.py \
|
||||||
|
--metadata=dump/train/raw/metadata.jsonl \
|
||||||
|
--dumpdir=dump/train/norm \
|
||||||
|
--stats=dump/train/feats_stats.npy
|
||||||
|
python3 ${BIN_DIR}/../normalize.py \
|
||||||
|
--metadata=dump/dev/raw/metadata.jsonl \
|
||||||
|
--dumpdir=dump/dev/norm \
|
||||||
|
--stats=dump/train/feats_stats.npy
|
||||||
|
|
||||||
|
python3 ${BIN_DIR}/../normalize.py \
|
||||||
|
--metadata=dump/test/raw/metadata.jsonl \
|
||||||
|
--dumpdir=dump/test/norm \
|
||||||
|
--stats=dump/train/feats_stats.npy
|
||||||
|
fi
|
@ -0,0 +1,14 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
|
||||||
|
config_path=$1
|
||||||
|
train_output_path=$2
|
||||||
|
ckpt_name=$3
|
||||||
|
|
||||||
|
FLAGS_allocator_strategy=naive_best_fit \
|
||||||
|
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
|
||||||
|
python3 ${BIN_DIR}/../synthesize.py \
|
||||||
|
--config=${config_path} \
|
||||||
|
--checkpoint=${train_output_path}/checkpoints/${ckpt_name} \
|
||||||
|
--test-metadata=dump/test/norm/metadata.jsonl \
|
||||||
|
--output-dir=${train_output_path}/test \
|
||||||
|
--generator-type=hifigan
|
@ -0,0 +1,13 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
|
||||||
|
config_path=$1
|
||||||
|
train_output_path=$2
|
||||||
|
|
||||||
|
FLAGS_cudnn_exhaustive_search=true \
|
||||||
|
FLAGS_conv_workspace_size_limit=4000 \
|
||||||
|
python ${BIN_DIR}/train.py \
|
||||||
|
--train-metadata=dump/train/norm/metadata.jsonl \
|
||||||
|
--dev-metadata=dump/dev/norm/metadata.jsonl \
|
||||||
|
--config=${config_path} \
|
||||||
|
--output-dir=${train_output_path} \
|
||||||
|
--ngpu=1
|
@ -0,0 +1,13 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
export MAIN_ROOT=`realpath ${PWD}/../../../`
|
||||||
|
|
||||||
|
export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH}
|
||||||
|
export LC_ALL=C
|
||||||
|
|
||||||
|
export PYTHONDONTWRITEBYTECODE=1
|
||||||
|
# Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
|
||||||
|
export PYTHONIOENCODING=UTF-8
|
||||||
|
export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH}
|
||||||
|
|
||||||
|
MODEL=hifigan
|
||||||
|
export BIN_DIR=${MAIN_ROOT}/paddlespeech/t2s/exps/gan_vocoder/${MODEL}
|
@ -0,0 +1,32 @@
|
|||||||
|
#!/bin/bash
|
||||||
|
|
||||||
|
set -e
|
||||||
|
source path.sh
|
||||||
|
|
||||||
|
gpus=0,1
|
||||||
|
stage=0
|
||||||
|
stop_stage=100
|
||||||
|
|
||||||
|
conf_path=conf/default.yaml
|
||||||
|
train_output_path=exp/default
|
||||||
|
ckpt_name=snapshot_iter_50000.pdz
|
||||||
|
|
||||||
|
# with the following command, you can choose the stage range you want to run
|
||||||
|
# such as `./run.sh --stage 0 --stop-stage 0`
|
||||||
|
# this can not be mixed use with `$1`, `$2` ...
|
||||||
|
source ${MAIN_ROOT}/utils/parse_options.sh || exit 1
|
||||||
|
|
||||||
|
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
|
||||||
|
# prepare data
|
||||||
|
./local/preprocess.sh ${conf_path} || exit -1
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
|
||||||
|
# train model, all `ckpt` under `train_output_path/checkpoints/` dir
|
||||||
|
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} || exit -1
|
||||||
|
fi
|
||||||
|
|
||||||
|
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
|
||||||
|
# synthesize
|
||||||
|
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
|
||||||
|
fi
|
@ -1,135 +0,0 @@
|
|||||||
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
|
||||||
#
|
|
||||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
|
||||||
# you may not use this file except in compliance with the License.
|
|
||||||
# You may obtain a copy of the License at
|
|
||||||
#
|
|
||||||
# http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
#
|
|
||||||
# Unless required by applicable law or agreed to in writing, software
|
|
||||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
|
||||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
||||||
# See the License for the specific language governing permissions and
|
|
||||||
# limitations under the License.
|
|
||||||
import argparse
|
|
||||||
import os
|
|
||||||
from pathlib import Path
|
|
||||||
|
|
||||||
import soundfile as sf
|
|
||||||
from paddle import inference
|
|
||||||
|
|
||||||
from paddlespeech.t2s.frontend.zh_frontend import Frontend
|
|
||||||
|
|
||||||
|
|
||||||
def main():
|
|
||||||
parser = argparse.ArgumentParser(
|
|
||||||
description="Paddle Infernce with speedyspeech & parallel wavegan.")
|
|
||||||
parser.add_argument(
|
|
||||||
"--inference-dir", type=str, help="dir to save inference models")
|
|
||||||
parser.add_argument(
|
|
||||||
"--text",
|
|
||||||
type=str,
|
|
||||||
help="text to synthesize, a 'utt_id sentence' pair per line")
|
|
||||||
parser.add_argument("--output-dir", type=str, help="output dir")
|
|
||||||
parser.add_argument(
|
|
||||||
"--enable-auto-log", action="store_true", help="use auto log")
|
|
||||||
parser.add_argument(
|
|
||||||
"--phones-dict",
|
|
||||||
type=str,
|
|
||||||
default="phones.txt",
|
|
||||||
help="phone vocabulary file.")
|
|
||||||
|
|
||||||
args, _ = parser.parse_known_args()
|
|
||||||
|
|
||||||
frontend = Frontend(phone_vocab_path=args.phones_dict)
|
|
||||||
print("frontend done!")
|
|
||||||
|
|
||||||
fastspeech2_config = inference.Config(
|
|
||||||
str(Path(args.inference_dir) / "fastspeech2.pdmodel"),
|
|
||||||
str(Path(args.inference_dir) / "fastspeech2.pdiparams"))
|
|
||||||
fastspeech2_config.enable_use_gpu(50, 0)
|
|
||||||
# This line must be commented, if not, it will OOM
|
|
||||||
# fastspeech2_config.enable_memory_optim()
|
|
||||||
fastspeech2_predictor = inference.create_predictor(fastspeech2_config)
|
|
||||||
|
|
||||||
pwg_config = inference.Config(
|
|
||||||
str(Path(args.inference_dir) / "pwg.pdmodel"),
|
|
||||||
str(Path(args.inference_dir) / "pwg.pdiparams"))
|
|
||||||
pwg_config.enable_use_gpu(100, 0)
|
|
||||||
pwg_config.enable_memory_optim()
|
|
||||||
pwg_predictor = inference.create_predictor(pwg_config)
|
|
||||||
|
|
||||||
if args.enable_auto_log:
|
|
||||||
import auto_log
|
|
||||||
os.makedirs("output", exist_ok=True)
|
|
||||||
pid = os.getpid()
|
|
||||||
logger = auto_log.AutoLogger(
|
|
||||||
model_name="fastspeech2",
|
|
||||||
model_precision='float32',
|
|
||||||
batch_size=1,
|
|
||||||
data_shape="dynamic",
|
|
||||||
save_path="./output/auto_log.log",
|
|
||||||
inference_config=fastspeech2_config,
|
|
||||||
pids=pid,
|
|
||||||
process_name=None,
|
|
||||||
gpu_ids=0,
|
|
||||||
time_keys=['preprocess_time', 'inference_time', 'postprocess_time'],
|
|
||||||
warmup=0)
|
|
||||||
|
|
||||||
output_dir = Path(args.output_dir)
|
|
||||||
output_dir.mkdir(parents=True, exist_ok=True)
|
|
||||||
sentences = []
|
|
||||||
|
|
||||||
with open(args.text, 'rt') as f:
|
|
||||||
for line in f:
|
|
||||||
items = line.strip().split()
|
|
||||||
utt_id = items[0]
|
|
||||||
sentence = "".join(items[1:])
|
|
||||||
sentences.append((utt_id, sentence))
|
|
||||||
|
|
||||||
for utt_id, sentence in sentences:
|
|
||||||
if args.enable_auto_log:
|
|
||||||
logger.times.start()
|
|
||||||
input_ids = frontend.get_input_ids(sentence, merge_sentences=True)
|
|
||||||
phone_ids = input_ids["phone_ids"]
|
|
||||||
phones = phone_ids[0].numpy()
|
|
||||||
|
|
||||||
if args.enable_auto_log:
|
|
||||||
logger.times.stamp()
|
|
||||||
|
|
||||||
input_names = fastspeech2_predictor.get_input_names()
|
|
||||||
phones_handle = fastspeech2_predictor.get_input_handle(input_names[0])
|
|
||||||
|
|
||||||
phones_handle.reshape(phones.shape)
|
|
||||||
phones_handle.copy_from_cpu(phones)
|
|
||||||
|
|
||||||
fastspeech2_predictor.run()
|
|
||||||
output_names = fastspeech2_predictor.get_output_names()
|
|
||||||
output_handle = fastspeech2_predictor.get_output_handle(output_names[0])
|
|
||||||
output_data = output_handle.copy_to_cpu()
|
|
||||||
|
|
||||||
input_names = pwg_predictor.get_input_names()
|
|
||||||
mel_handle = pwg_predictor.get_input_handle(input_names[0])
|
|
||||||
mel_handle.reshape(output_data.shape)
|
|
||||||
mel_handle.copy_from_cpu(output_data)
|
|
||||||
|
|
||||||
pwg_predictor.run()
|
|
||||||
output_names = pwg_predictor.get_output_names()
|
|
||||||
output_handle = pwg_predictor.get_output_handle(output_names[0])
|
|
||||||
wav = output_data = output_handle.copy_to_cpu()
|
|
||||||
|
|
||||||
if args.enable_auto_log:
|
|
||||||
logger.times.stamp()
|
|
||||||
|
|
||||||
sf.write(output_dir / (utt_id + ".wav"), wav, samplerate=24000)
|
|
||||||
|
|
||||||
if args.enable_auto_log:
|
|
||||||
logger.times.end(stamp=True)
|
|
||||||
print(f"{utt_id} done!")
|
|
||||||
|
|
||||||
if args.enable_auto_log:
|
|
||||||
logger.report()
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
main()
|
|
@ -1,178 +0,0 @@
|
|||||||
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
|
||||||
#
|
|
||||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
|
||||||
# you may not use this file except in compliance with the License.
|
|
||||||
# You may obtain a copy of the License at
|
|
||||||
#
|
|
||||||
# http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
#
|
|
||||||
# Unless required by applicable law or agreed to in writing, software
|
|
||||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
|
||||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
||||||
# See the License for the specific language governing permissions and
|
|
||||||
# limitations under the License.
|
|
||||||
import argparse
|
|
||||||
import logging
|
|
||||||
from pathlib import Path
|
|
||||||
|
|
||||||
import numpy as np
|
|
||||||
import paddle
|
|
||||||
import soundfile as sf
|
|
||||||
import yaml
|
|
||||||
from yacs.config import CfgNode
|
|
||||||
|
|
||||||
from paddlespeech.t2s.frontend.zh_frontend import Frontend
|
|
||||||
from paddlespeech.t2s.models.fastspeech2 import FastSpeech2
|
|
||||||
from paddlespeech.t2s.models.fastspeech2 import FastSpeech2Inference
|
|
||||||
from paddlespeech.t2s.models.parallel_wavegan import PWGGenerator
|
|
||||||
from paddlespeech.t2s.models.parallel_wavegan import PWGInference
|
|
||||||
from paddlespeech.t2s.modules.normalizer import ZScore
|
|
||||||
|
|
||||||
|
|
||||||
def evaluate(args, fastspeech2_config, pwg_config):
|
|
||||||
# dataloader has been too verbose
|
|
||||||
logging.getLogger("DataLoader").disabled = True
|
|
||||||
|
|
||||||
# construct dataset for evaluation
|
|
||||||
sentences = []
|
|
||||||
with open(args.text, 'rt') as f:
|
|
||||||
for line in f:
|
|
||||||
items = line.strip().split()
|
|
||||||
utt_id = items[0]
|
|
||||||
sentence = "".join(items[1:])
|
|
||||||
sentences.append((utt_id, sentence))
|
|
||||||
|
|
||||||
with open(args.phones_dict, "r") as f:
|
|
||||||
phn_id = [line.strip().split() for line in f.readlines()]
|
|
||||||
vocab_size = len(phn_id)
|
|
||||||
print("vocab_size:", vocab_size)
|
|
||||||
with open(args.speaker_dict, 'rt') as f:
|
|
||||||
spk_id = [line.strip().split() for line in f.readlines()]
|
|
||||||
spk_num = len(spk_id)
|
|
||||||
print("spk_num:", spk_num)
|
|
||||||
|
|
||||||
odim = fastspeech2_config.n_mels
|
|
||||||
model = FastSpeech2(
|
|
||||||
idim=vocab_size,
|
|
||||||
odim=odim,
|
|
||||||
spk_num=spk_num,
|
|
||||||
**fastspeech2_config["model"])
|
|
||||||
|
|
||||||
model.set_state_dict(
|
|
||||||
paddle.load(args.fastspeech2_checkpoint)["main_params"])
|
|
||||||
model.eval()
|
|
||||||
|
|
||||||
vocoder = PWGGenerator(**pwg_config["generator_params"])
|
|
||||||
vocoder.set_state_dict(paddle.load(args.pwg_checkpoint)["generator_params"])
|
|
||||||
vocoder.remove_weight_norm()
|
|
||||||
vocoder.eval()
|
|
||||||
print("model done!")
|
|
||||||
|
|
||||||
frontend = Frontend(phone_vocab_path=args.phones_dict)
|
|
||||||
print("frontend done!")
|
|
||||||
|
|
||||||
stat = np.load(args.fastspeech2_stat)
|
|
||||||
mu, std = stat
|
|
||||||
mu = paddle.to_tensor(mu)
|
|
||||||
std = paddle.to_tensor(std)
|
|
||||||
fastspeech2_normalizer = ZScore(mu, std)
|
|
||||||
|
|
||||||
stat = np.load(args.pwg_stat)
|
|
||||||
mu, std = stat
|
|
||||||
mu = paddle.to_tensor(mu)
|
|
||||||
std = paddle.to_tensor(std)
|
|
||||||
pwg_normalizer = ZScore(mu, std)
|
|
||||||
|
|
||||||
fastspeech2_inference = FastSpeech2Inference(fastspeech2_normalizer, model)
|
|
||||||
pwg_inference = PWGInference(pwg_normalizer, vocoder)
|
|
||||||
|
|
||||||
output_dir = Path(args.output_dir)
|
|
||||||
output_dir.mkdir(parents=True, exist_ok=True)
|
|
||||||
# only test the number 0 speaker
|
|
||||||
spk_ids = list(range(20))
|
|
||||||
for spk_id in spk_ids:
|
|
||||||
for utt_id, sentence in sentences[:2]:
|
|
||||||
input_ids = frontend.get_input_ids(sentence, merge_sentences=True)
|
|
||||||
phone_ids = input_ids["phone_ids"]
|
|
||||||
flags = 0
|
|
||||||
for part_phone_ids in phone_ids:
|
|
||||||
with paddle.no_grad():
|
|
||||||
mel = fastspeech2_inference(
|
|
||||||
part_phone_ids, spk_id=paddle.to_tensor(spk_id))
|
|
||||||
temp_wav = pwg_inference(mel)
|
|
||||||
if flags == 0:
|
|
||||||
wav = temp_wav
|
|
||||||
flags = 1
|
|
||||||
else:
|
|
||||||
wav = paddle.concat([wav, temp_wav])
|
|
||||||
sf.write(
|
|
||||||
str(output_dir / (str(spk_id) + "_" + utt_id + ".wav")),
|
|
||||||
wav.numpy(),
|
|
||||||
samplerate=fastspeech2_config.fs)
|
|
||||||
print(f"{spk_id}_{utt_id} done!")
|
|
||||||
|
|
||||||
|
|
||||||
def main():
|
|
||||||
# parse args and config and redirect to train_sp
|
|
||||||
parser = argparse.ArgumentParser(
|
|
||||||
description="Synthesize with fastspeech2 & parallel wavegan.")
|
|
||||||
parser.add_argument(
|
|
||||||
"--fastspeech2-config", type=str, help="fastspeech2 config file.")
|
|
||||||
parser.add_argument(
|
|
||||||
"--fastspeech2-checkpoint",
|
|
||||||
type=str,
|
|
||||||
help="fastspeech2 checkpoint to load.")
|
|
||||||
parser.add_argument(
|
|
||||||
"--fastspeech2-stat",
|
|
||||||
type=str,
|
|
||||||
help="mean and standard deviation used to normalize spectrogram when training fastspeech2."
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"--pwg-config", type=str, help="parallel wavegan config file.")
|
|
||||||
parser.add_argument(
|
|
||||||
"--pwg-checkpoint",
|
|
||||||
type=str,
|
|
||||||
help="parallel wavegan generator parameters to load.")
|
|
||||||
parser.add_argument(
|
|
||||||
"--pwg-stat",
|
|
||||||
type=str,
|
|
||||||
help="mean and standard deviation used to normalize spectrogram when training parallel wavegan."
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"--phones-dict", type=str, default=None, help="phone vocabulary file.")
|
|
||||||
parser.add_argument(
|
|
||||||
"--speaker-dict", type=str, default=None, help="speaker id map file.")
|
|
||||||
parser.add_argument(
|
|
||||||
"--text",
|
|
||||||
type=str,
|
|
||||||
help="text to synthesize, a 'utt_id sentence' pair per line.")
|
|
||||||
parser.add_argument("--output-dir", type=str, help="output dir.")
|
|
||||||
parser.add_argument(
|
|
||||||
"--ngpu", type=int, default=1, help="if ngpu == 0, use cpu.")
|
|
||||||
parser.add_argument("--verbose", type=int, default=1, help="verbose.")
|
|
||||||
|
|
||||||
args = parser.parse_args()
|
|
||||||
|
|
||||||
if args.ngpu == 0:
|
|
||||||
paddle.set_device("cpu")
|
|
||||||
elif args.ngpu > 0:
|
|
||||||
paddle.set_device("gpu")
|
|
||||||
else:
|
|
||||||
print("ngpu should >= 0 !")
|
|
||||||
|
|
||||||
with open(args.fastspeech2_config) as f:
|
|
||||||
fastspeech2_config = CfgNode(yaml.safe_load(f))
|
|
||||||
with open(args.pwg_config) as f:
|
|
||||||
pwg_config = CfgNode(yaml.safe_load(f))
|
|
||||||
|
|
||||||
print("========Args========")
|
|
||||||
print(yaml.safe_dump(vars(args)))
|
|
||||||
print("========Config========")
|
|
||||||
print(fastspeech2_config)
|
|
||||||
print(pwg_config)
|
|
||||||
|
|
||||||
evaluate(args, fastspeech2_config, pwg_config)
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
main()
|
|
@ -1,175 +0,0 @@
|
|||||||
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
|
||||||
#
|
|
||||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
|
||||||
# you may not use this file except in compliance with the License.
|
|
||||||
# You may obtain a copy of the License at
|
|
||||||
#
|
|
||||||
# http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
#
|
|
||||||
# Unless required by applicable law or agreed to in writing, software
|
|
||||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
|
||||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
||||||
# See the License for the specific language governing permissions and
|
|
||||||
# limitations under the License.
|
|
||||||
import argparse
|
|
||||||
import logging
|
|
||||||
from pathlib import Path
|
|
||||||
|
|
||||||
import numpy as np
|
|
||||||
import paddle
|
|
||||||
import soundfile as sf
|
|
||||||
import yaml
|
|
||||||
from yacs.config import CfgNode
|
|
||||||
|
|
||||||
from paddlespeech.t2s.frontend import English
|
|
||||||
from paddlespeech.t2s.models.fastspeech2 import FastSpeech2
|
|
||||||
from paddlespeech.t2s.models.fastspeech2 import FastSpeech2Inference
|
|
||||||
from paddlespeech.t2s.models.parallel_wavegan import PWGGenerator
|
|
||||||
from paddlespeech.t2s.models.parallel_wavegan import PWGInference
|
|
||||||
from paddlespeech.t2s.modules.normalizer import ZScore
|
|
||||||
|
|
||||||
|
|
||||||
def evaluate(args, fastspeech2_config, pwg_config):
|
|
||||||
# dataloader has been too verbose
|
|
||||||
logging.getLogger("DataLoader").disabled = True
|
|
||||||
|
|
||||||
# construct dataset for evaluation
|
|
||||||
sentences = []
|
|
||||||
with open(args.text, 'rt') as f:
|
|
||||||
for line in f:
|
|
||||||
line_list = line.strip().split()
|
|
||||||
utt_id = line_list[0]
|
|
||||||
sentence = " ".join(line_list[1:])
|
|
||||||
sentences.append((utt_id, sentence))
|
|
||||||
|
|
||||||
with open(args.phones_dict, "r") as f:
|
|
||||||
phn_id = [line.strip().split() for line in f.readlines()]
|
|
||||||
vocab_size = len(phn_id)
|
|
||||||
phone_id_map = {}
|
|
||||||
for phn, id in phn_id:
|
|
||||||
phone_id_map[phn] = int(id)
|
|
||||||
print("vocab_size:", vocab_size)
|
|
||||||
with open(args.speaker_dict, 'rt') as f:
|
|
||||||
spk_id = [line.strip().split() for line in f.readlines()]
|
|
||||||
spk_num = len(spk_id)
|
|
||||||
print("spk_num:", spk_num)
|
|
||||||
|
|
||||||
odim = fastspeech2_config.n_mels
|
|
||||||
model = FastSpeech2(
|
|
||||||
idim=vocab_size,
|
|
||||||
odim=odim,
|
|
||||||
spk_num=spk_num,
|
|
||||||
**fastspeech2_config["model"])
|
|
||||||
|
|
||||||
model.set_state_dict(
|
|
||||||
paddle.load(args.fastspeech2_checkpoint)["main_params"])
|
|
||||||
model.eval()
|
|
||||||
|
|
||||||
vocoder = PWGGenerator(**pwg_config["generator_params"])
|
|
||||||
vocoder.set_state_dict(paddle.load(args.pwg_checkpoint)["generator_params"])
|
|
||||||
vocoder.remove_weight_norm()
|
|
||||||
vocoder.eval()
|
|
||||||
print("model done!")
|
|
||||||
|
|
||||||
frontend = English(phone_vocab_path=args.phones_dict)
|
|
||||||
print("frontend done!")
|
|
||||||
|
|
||||||
stat = np.load(args.fastspeech2_stat)
|
|
||||||
mu, std = stat
|
|
||||||
mu = paddle.to_tensor(mu)
|
|
||||||
std = paddle.to_tensor(std)
|
|
||||||
fastspeech2_normalizer = ZScore(mu, std)
|
|
||||||
|
|
||||||
stat = np.load(args.pwg_stat)
|
|
||||||
mu, std = stat
|
|
||||||
mu = paddle.to_tensor(mu)
|
|
||||||
std = paddle.to_tensor(std)
|
|
||||||
pwg_normalizer = ZScore(mu, std)
|
|
||||||
|
|
||||||
fastspeech2_inference = FastSpeech2Inference(fastspeech2_normalizer, model)
|
|
||||||
pwg_inference = PWGInference(pwg_normalizer, vocoder)
|
|
||||||
|
|
||||||
output_dir = Path(args.output_dir)
|
|
||||||
output_dir.mkdir(parents=True, exist_ok=True)
|
|
||||||
# only test the number 0 speaker
|
|
||||||
spk_id = 0
|
|
||||||
for utt_id, sentence in sentences:
|
|
||||||
input_ids = frontend.get_input_ids(sentence)
|
|
||||||
phone_ids = input_ids["phone_ids"]
|
|
||||||
|
|
||||||
with paddle.no_grad():
|
|
||||||
mel = fastspeech2_inference(
|
|
||||||
phone_ids, spk_id=paddle.to_tensor(spk_id))
|
|
||||||
wav = pwg_inference(mel)
|
|
||||||
|
|
||||||
sf.write(
|
|
||||||
str(output_dir / (str(spk_id) + "_" + utt_id + ".wav")),
|
|
||||||
wav.numpy(),
|
|
||||||
samplerate=fastspeech2_config.fs)
|
|
||||||
print(f"{spk_id}_{utt_id} done!")
|
|
||||||
|
|
||||||
|
|
||||||
def main():
|
|
||||||
# parse args and config and redirect to train_sp
|
|
||||||
parser = argparse.ArgumentParser(
|
|
||||||
description="Synthesize with fastspeech2 & parallel wavegan.")
|
|
||||||
parser.add_argument(
|
|
||||||
"--fastspeech2-config", type=str, help="fastspeech2 config file.")
|
|
||||||
parser.add_argument(
|
|
||||||
"--fastspeech2-checkpoint",
|
|
||||||
type=str,
|
|
||||||
help="fastspeech2 checkpoint to load.")
|
|
||||||
parser.add_argument(
|
|
||||||
"--fastspeech2-stat",
|
|
||||||
type=str,
|
|
||||||
help="mean and standard deviation used to normalize spectrogram when training fastspeech2."
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"--pwg-config", type=str, help="parallel wavegan config file.")
|
|
||||||
parser.add_argument(
|
|
||||||
"--pwg-checkpoint",
|
|
||||||
type=str,
|
|
||||||
help="parallel wavegan generator parameters to load.")
|
|
||||||
parser.add_argument(
|
|
||||||
"--pwg-stat",
|
|
||||||
type=str,
|
|
||||||
help="mean and standard deviation used to normalize spectrogram when training parallel wavegan."
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"--phones-dict", type=str, default=None, help="phone vocabulary file.")
|
|
||||||
parser.add_argument(
|
|
||||||
"--speaker-dict", type=str, default=None, help="speaker id map file.")
|
|
||||||
parser.add_argument(
|
|
||||||
"--text",
|
|
||||||
type=str,
|
|
||||||
help="text to synthesize, a 'utt_id sentence' pair per line.")
|
|
||||||
parser.add_argument("--output-dir", type=str, help="output dir.")
|
|
||||||
parser.add_argument(
|
|
||||||
"--ngpu", type=int, default=1, help="if ngpu == 0, use cpu.")
|
|
||||||
parser.add_argument("--verbose", type=int, default=1, help="verbose.")
|
|
||||||
|
|
||||||
args = parser.parse_args()
|
|
||||||
|
|
||||||
if args.ngpu == 0:
|
|
||||||
paddle.set_device("cpu")
|
|
||||||
elif args.ngpu > 0:
|
|
||||||
paddle.set_device("gpu")
|
|
||||||
else:
|
|
||||||
print("ngpu should >= 0 !")
|
|
||||||
|
|
||||||
with open(args.fastspeech2_config) as f:
|
|
||||||
fastspeech2_config = CfgNode(yaml.safe_load(f))
|
|
||||||
with open(args.pwg_config) as f:
|
|
||||||
pwg_config = CfgNode(yaml.safe_load(f))
|
|
||||||
|
|
||||||
print("========Args========")
|
|
||||||
print(yaml.safe_dump(vars(args)))
|
|
||||||
print("========Config========")
|
|
||||||
print(fastspeech2_config)
|
|
||||||
print(pwg_config)
|
|
||||||
|
|
||||||
evaluate(args, fastspeech2_config, pwg_config)
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
main()
|
|
@ -1,189 +0,0 @@
|
|||||||
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
|
||||||
#
|
|
||||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
|
||||||
# you may not use this file except in compliance with the License.
|
|
||||||
# You may obtain a copy of the License at
|
|
||||||
#
|
|
||||||
# http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
#
|
|
||||||
# Unless required by applicable law or agreed to in writing, software
|
|
||||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
|
||||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
||||||
# See the License for the specific language governing permissions and
|
|
||||||
# limitations under the License.
|
|
||||||
import argparse
|
|
||||||
import logging
|
|
||||||
from pathlib import Path
|
|
||||||
|
|
||||||
import jsonlines
|
|
||||||
import numpy as np
|
|
||||||
import paddle
|
|
||||||
import soundfile as sf
|
|
||||||
import yaml
|
|
||||||
from yacs.config import CfgNode
|
|
||||||
|
|
||||||
from paddlespeech.t2s.datasets.data_table import DataTable
|
|
||||||
from paddlespeech.t2s.models.fastspeech2 import FastSpeech2
|
|
||||||
from paddlespeech.t2s.models.fastspeech2 import FastSpeech2Inference
|
|
||||||
from paddlespeech.t2s.models.parallel_wavegan import PWGGenerator
|
|
||||||
from paddlespeech.t2s.models.parallel_wavegan import PWGInference
|
|
||||||
from paddlespeech.t2s.modules.normalizer import ZScore
|
|
||||||
|
|
||||||
|
|
||||||
def evaluate(args, fastspeech2_config, pwg_config):
|
|
||||||
# dataloader has been too verbose
|
|
||||||
logging.getLogger("DataLoader").disabled = True
|
|
||||||
|
|
||||||
# construct dataset for evaluation
|
|
||||||
with jsonlines.open(args.test_metadata, 'r') as reader:
|
|
||||||
test_metadata = list(reader)
|
|
||||||
|
|
||||||
fields = ["utt_id", "text"]
|
|
||||||
|
|
||||||
spk_num = None
|
|
||||||
if args.speaker_dict is not None:
|
|
||||||
print("multiple speaker fastspeech2!")
|
|
||||||
with open(args.speaker_dict, 'rt') as f:
|
|
||||||
spk_id = [line.strip().split() for line in f.readlines()]
|
|
||||||
spk_num = len(spk_id)
|
|
||||||
fields += ["spk_id"]
|
|
||||||
elif args.voice_cloning:
|
|
||||||
print("voice cloning!")
|
|
||||||
fields += ["spk_emb"]
|
|
||||||
else:
|
|
||||||
print("single speaker fastspeech2!")
|
|
||||||
print("spk_num:", spk_num)
|
|
||||||
|
|
||||||
test_dataset = DataTable(data=test_metadata, fields=fields)
|
|
||||||
|
|
||||||
odim = fastspeech2_config.n_mels
|
|
||||||
with open(args.phones_dict, "r") as f:
|
|
||||||
phn_id = [line.strip().split() for line in f.readlines()]
|
|
||||||
vocab_size = len(phn_id)
|
|
||||||
print("vocab_size:", vocab_size)
|
|
||||||
|
|
||||||
model = FastSpeech2(
|
|
||||||
idim=vocab_size,
|
|
||||||
odim=odim,
|
|
||||||
spk_num=spk_num,
|
|
||||||
**fastspeech2_config["model"])
|
|
||||||
|
|
||||||
model.set_state_dict(
|
|
||||||
paddle.load(args.fastspeech2_checkpoint)["main_params"])
|
|
||||||
model.eval()
|
|
||||||
|
|
||||||
vocoder = PWGGenerator(**pwg_config["generator_params"])
|
|
||||||
vocoder.set_state_dict(paddle.load(args.pwg_checkpoint)["generator_params"])
|
|
||||||
vocoder.remove_weight_norm()
|
|
||||||
vocoder.eval()
|
|
||||||
print("model done!")
|
|
||||||
|
|
||||||
stat = np.load(args.fastspeech2_stat)
|
|
||||||
mu, std = stat
|
|
||||||
mu = paddle.to_tensor(mu)
|
|
||||||
std = paddle.to_tensor(std)
|
|
||||||
fastspeech2_normalizer = ZScore(mu, std)
|
|
||||||
|
|
||||||
stat = np.load(args.pwg_stat)
|
|
||||||
mu, std = stat
|
|
||||||
mu = paddle.to_tensor(mu)
|
|
||||||
std = paddle.to_tensor(std)
|
|
||||||
pwg_normalizer = ZScore(mu, std)
|
|
||||||
|
|
||||||
fastspeech2_inference = FastSpeech2Inference(fastspeech2_normalizer, model)
|
|
||||||
pwg_inference = PWGInference(pwg_normalizer, vocoder)
|
|
||||||
|
|
||||||
output_dir = Path(args.output_dir)
|
|
||||||
output_dir.mkdir(parents=True, exist_ok=True)
|
|
||||||
|
|
||||||
for datum in test_dataset:
|
|
||||||
utt_id = datum["utt_id"]
|
|
||||||
text = paddle.to_tensor(datum["text"])
|
|
||||||
spk_emb = None
|
|
||||||
spk_id = None
|
|
||||||
if args.voice_cloning and "spk_emb" in datum:
|
|
||||||
spk_emb = paddle.to_tensor(np.load(datum["spk_emb"]))
|
|
||||||
elif "spk_id" in datum:
|
|
||||||
spk_id = paddle.to_tensor(datum["spk_id"])
|
|
||||||
with paddle.no_grad():
|
|
||||||
wav = pwg_inference(
|
|
||||||
fastspeech2_inference(text, spk_id=spk_id, spk_emb=spk_emb))
|
|
||||||
sf.write(
|
|
||||||
str(output_dir / (utt_id + ".wav")),
|
|
||||||
wav.numpy(),
|
|
||||||
samplerate=fastspeech2_config.fs)
|
|
||||||
print(f"{utt_id} done!")
|
|
||||||
|
|
||||||
|
|
||||||
def main():
|
|
||||||
# parse args and config and redirect to train_sp
|
|
||||||
parser = argparse.ArgumentParser(
|
|
||||||
description="Synthesize with fastspeech2 & parallel wavegan.")
|
|
||||||
parser.add_argument(
|
|
||||||
"--fastspeech2-config", type=str, help="fastspeech2 config file.")
|
|
||||||
parser.add_argument(
|
|
||||||
"--fastspeech2-checkpoint",
|
|
||||||
type=str,
|
|
||||||
help="fastspeech2 checkpoint to load.")
|
|
||||||
parser.add_argument(
|
|
||||||
"--fastspeech2-stat",
|
|
||||||
type=str,
|
|
||||||
help="mean and standard deviation used to normalize spectrogram when training fastspeech2."
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"--pwg-config", type=str, help="parallel wavegan config file.")
|
|
||||||
parser.add_argument(
|
|
||||||
"--pwg-checkpoint",
|
|
||||||
type=str,
|
|
||||||
help="parallel wavegan generator parameters to load.")
|
|
||||||
parser.add_argument(
|
|
||||||
"--pwg-stat",
|
|
||||||
type=str,
|
|
||||||
help="mean and standard deviation used to normalize spectrogram when training parallel wavegan."
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"--phones-dict", type=str, default=None, help="phone vocabulary file.")
|
|
||||||
parser.add_argument(
|
|
||||||
"--speaker-dict",
|
|
||||||
type=str,
|
|
||||||
default=None,
|
|
||||||
help="speaker id map file for multiple speaker model.")
|
|
||||||
|
|
||||||
def str2bool(str):
|
|
||||||
return True if str.lower() == 'true' else False
|
|
||||||
|
|
||||||
parser.add_argument(
|
|
||||||
"--voice-cloning",
|
|
||||||
type=str2bool,
|
|
||||||
default=False,
|
|
||||||
help="whether training voice cloning model.")
|
|
||||||
parser.add_argument("--test-metadata", type=str, help="test metadata.")
|
|
||||||
parser.add_argument("--output-dir", type=str, help="output dir.")
|
|
||||||
parser.add_argument(
|
|
||||||
"--ngpu", type=int, default=1, help="if ngpu == 0, use cpu.")
|
|
||||||
parser.add_argument("--verbose", type=int, default=1, help="verbose.")
|
|
||||||
|
|
||||||
args = parser.parse_args()
|
|
||||||
if args.ngpu == 0:
|
|
||||||
paddle.set_device("cpu")
|
|
||||||
elif args.ngpu > 0:
|
|
||||||
paddle.set_device("gpu")
|
|
||||||
else:
|
|
||||||
print("ngpu should >= 0 !")
|
|
||||||
|
|
||||||
with open(args.fastspeech2_config) as f:
|
|
||||||
fastspeech2_config = CfgNode(yaml.safe_load(f))
|
|
||||||
with open(args.pwg_config) as f:
|
|
||||||
pwg_config = CfgNode(yaml.safe_load(f))
|
|
||||||
|
|
||||||
print("========Args========")
|
|
||||||
print(yaml.safe_dump(vars(args)))
|
|
||||||
print("========Config========")
|
|
||||||
print(fastspeech2_config)
|
|
||||||
print(pwg_config)
|
|
||||||
|
|
||||||
evaluate(args, fastspeech2_config, pwg_config)
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
main()
|
|
@ -1,187 +0,0 @@
|
|||||||
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
|
||||||
#
|
|
||||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
|
||||||
# you may not use this file except in compliance with the License.
|
|
||||||
# You may obtain a copy of the License at
|
|
||||||
#
|
|
||||||
# http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
#
|
|
||||||
# Unless required by applicable law or agreed to in writing, software
|
|
||||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
|
||||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
||||||
# See the License for the specific language governing permissions and
|
|
||||||
# limitations under the License.
|
|
||||||
import argparse
|
|
||||||
import logging
|
|
||||||
import os
|
|
||||||
from pathlib import Path
|
|
||||||
|
|
||||||
import numpy as np
|
|
||||||
import paddle
|
|
||||||
import soundfile as sf
|
|
||||||
import yaml
|
|
||||||
from paddle import jit
|
|
||||||
from paddle.static import InputSpec
|
|
||||||
from yacs.config import CfgNode
|
|
||||||
|
|
||||||
from paddlespeech.t2s.frontend.zh_frontend import Frontend
|
|
||||||
from paddlespeech.t2s.models.fastspeech2 import FastSpeech2
|
|
||||||
from paddlespeech.t2s.models.fastspeech2 import FastSpeech2Inference
|
|
||||||
from paddlespeech.t2s.models.parallel_wavegan import PWGGenerator
|
|
||||||
from paddlespeech.t2s.models.parallel_wavegan import PWGInference
|
|
||||||
from paddlespeech.t2s.modules.normalizer import ZScore
|
|
||||||
|
|
||||||
|
|
||||||
def evaluate(args, fastspeech2_config, pwg_config):
|
|
||||||
# dataloader has been too verbose
|
|
||||||
logging.getLogger("DataLoader").disabled = True
|
|
||||||
|
|
||||||
# construct dataset for evaluation
|
|
||||||
sentences = []
|
|
||||||
with open(args.text, 'rt') as f:
|
|
||||||
for line in f:
|
|
||||||
items = line.strip().split()
|
|
||||||
utt_id = items[0]
|
|
||||||
sentence = "".join(items[1:])
|
|
||||||
sentences.append((utt_id, sentence))
|
|
||||||
|
|
||||||
with open(args.phones_dict, "r") as f:
|
|
||||||
phn_id = [line.strip().split() for line in f.readlines()]
|
|
||||||
vocab_size = len(phn_id)
|
|
||||||
print("vocab_size:", vocab_size)
|
|
||||||
odim = fastspeech2_config.n_mels
|
|
||||||
model = FastSpeech2(
|
|
||||||
idim=vocab_size, odim=odim, **fastspeech2_config["model"])
|
|
||||||
|
|
||||||
model.set_state_dict(
|
|
||||||
paddle.load(args.fastspeech2_checkpoint)["main_params"])
|
|
||||||
model.eval()
|
|
||||||
|
|
||||||
vocoder = PWGGenerator(**pwg_config["generator_params"])
|
|
||||||
vocoder.set_state_dict(paddle.load(args.pwg_checkpoint)["generator_params"])
|
|
||||||
vocoder.remove_weight_norm()
|
|
||||||
vocoder.eval()
|
|
||||||
print("model done!")
|
|
||||||
|
|
||||||
frontend = Frontend(phone_vocab_path=args.phones_dict)
|
|
||||||
print("frontend done!")
|
|
||||||
|
|
||||||
stat = np.load(args.fastspeech2_stat)
|
|
||||||
mu, std = stat
|
|
||||||
mu = paddle.to_tensor(mu)
|
|
||||||
std = paddle.to_tensor(std)
|
|
||||||
fastspeech2_normalizer = ZScore(mu, std)
|
|
||||||
|
|
||||||
stat = np.load(args.pwg_stat)
|
|
||||||
mu, std = stat
|
|
||||||
mu = paddle.to_tensor(mu)
|
|
||||||
std = paddle.to_tensor(std)
|
|
||||||
pwg_normalizer = ZScore(mu, std)
|
|
||||||
|
|
||||||
fastspeech2_inference = FastSpeech2Inference(fastspeech2_normalizer, model)
|
|
||||||
fastspeech2_inference.eval()
|
|
||||||
fastspeech2_inference = jit.to_static(
|
|
||||||
fastspeech2_inference, input_spec=[InputSpec([-1], dtype=paddle.int64)])
|
|
||||||
paddle.jit.save(fastspeech2_inference,
|
|
||||||
os.path.join(args.inference_dir, "fastspeech2"))
|
|
||||||
fastspeech2_inference = paddle.jit.load(
|
|
||||||
os.path.join(args.inference_dir, "fastspeech2"))
|
|
||||||
pwg_inference = PWGInference(pwg_normalizer, vocoder)
|
|
||||||
pwg_inference.eval()
|
|
||||||
pwg_inference = jit.to_static(
|
|
||||||
pwg_inference, input_spec=[
|
|
||||||
InputSpec([-1, 80], dtype=paddle.float32),
|
|
||||||
])
|
|
||||||
paddle.jit.save(pwg_inference, os.path.join(args.inference_dir, "pwg"))
|
|
||||||
pwg_inference = paddle.jit.load(os.path.join(args.inference_dir, "pwg"))
|
|
||||||
|
|
||||||
output_dir = Path(args.output_dir)
|
|
||||||
output_dir.mkdir(parents=True, exist_ok=True)
|
|
||||||
|
|
||||||
for utt_id, sentence in sentences:
|
|
||||||
input_ids = frontend.get_input_ids(sentence, merge_sentences=True)
|
|
||||||
phone_ids = input_ids["phone_ids"]
|
|
||||||
flags = 0
|
|
||||||
for part_phone_ids in phone_ids:
|
|
||||||
with paddle.no_grad():
|
|
||||||
mel = fastspeech2_inference(part_phone_ids)
|
|
||||||
temp_wav = pwg_inference(mel)
|
|
||||||
if flags == 0:
|
|
||||||
wav = temp_wav
|
|
||||||
flags = 1
|
|
||||||
else:
|
|
||||||
wav = paddle.concat([wav, temp_wav])
|
|
||||||
sf.write(
|
|
||||||
str(output_dir / (utt_id + ".wav")),
|
|
||||||
wav.numpy(),
|
|
||||||
samplerate=fastspeech2_config.fs)
|
|
||||||
print(f"{utt_id} done!")
|
|
||||||
|
|
||||||
|
|
||||||
def main():
|
|
||||||
# parse args and config and redirect to train_sp
|
|
||||||
parser = argparse.ArgumentParser(
|
|
||||||
description="Synthesize with fastspeech2 & parallel wavegan.")
|
|
||||||
parser.add_argument(
|
|
||||||
"--fastspeech2-config", type=str, help="fastspeech2 config file.")
|
|
||||||
parser.add_argument(
|
|
||||||
"--fastspeech2-checkpoint",
|
|
||||||
type=str,
|
|
||||||
help="fastspeech2 checkpoint to load.")
|
|
||||||
parser.add_argument(
|
|
||||||
"--fastspeech2-stat",
|
|
||||||
type=str,
|
|
||||||
help="mean and standard deviation used to normalize spectrogram when training fastspeech2."
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"--pwg-config", type=str, help="parallel wavegan config file.")
|
|
||||||
parser.add_argument(
|
|
||||||
"--pwg-checkpoint",
|
|
||||||
type=str,
|
|
||||||
help="parallel wavegan generator parameters to load.")
|
|
||||||
parser.add_argument(
|
|
||||||
"--pwg-stat",
|
|
||||||
type=str,
|
|
||||||
help="mean and standard deviation used to normalize spectrogram when training parallel wavegan."
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"--phones-dict",
|
|
||||||
type=str,
|
|
||||||
default="phone_id_map.txt",
|
|
||||||
help="phone vocabulary file.")
|
|
||||||
parser.add_argument(
|
|
||||||
"--text",
|
|
||||||
type=str,
|
|
||||||
help="text to synthesize, a 'utt_id sentence' pair per line.")
|
|
||||||
parser.add_argument("--output-dir", type=str, help="output dir.")
|
|
||||||
parser.add_argument(
|
|
||||||
"--inference-dir", type=str, help="dir to save inference models")
|
|
||||||
parser.add_argument(
|
|
||||||
"--ngpu", type=int, default=1, help="if ngpu == 0, use cpu.")
|
|
||||||
parser.add_argument("--verbose", type=int, default=1, help="verbose.")
|
|
||||||
|
|
||||||
args = parser.parse_args()
|
|
||||||
|
|
||||||
if args.ngpu == 0:
|
|
||||||
paddle.set_device("cpu")
|
|
||||||
elif args.ngpu > 0:
|
|
||||||
paddle.set_device("gpu")
|
|
||||||
else:
|
|
||||||
print("ngpu should >= 0 !")
|
|
||||||
|
|
||||||
with open(args.fastspeech2_config) as f:
|
|
||||||
fastspeech2_config = CfgNode(yaml.safe_load(f))
|
|
||||||
with open(args.pwg_config) as f:
|
|
||||||
pwg_config = CfgNode(yaml.safe_load(f))
|
|
||||||
|
|
||||||
print("========Args========")
|
|
||||||
print(yaml.safe_dump(vars(args)))
|
|
||||||
print("========Config========")
|
|
||||||
print(fastspeech2_config)
|
|
||||||
print(pwg_config)
|
|
||||||
|
|
||||||
evaluate(args, fastspeech2_config, pwg_config)
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
main()
|
|
@ -1,166 +0,0 @@
|
|||||||
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
|
||||||
#
|
|
||||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
|
||||||
# you may not use this file except in compliance with the License.
|
|
||||||
# You may obtain a copy of the License at
|
|
||||||
#
|
|
||||||
# http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
#
|
|
||||||
# Unless required by applicable law or agreed to in writing, software
|
|
||||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
|
||||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
||||||
# See the License for the specific language governing permissions and
|
|
||||||
# limitations under the License.
|
|
||||||
import argparse
|
|
||||||
import logging
|
|
||||||
from pathlib import Path
|
|
||||||
|
|
||||||
import numpy as np
|
|
||||||
import paddle
|
|
||||||
import soundfile as sf
|
|
||||||
import yaml
|
|
||||||
from yacs.config import CfgNode
|
|
||||||
|
|
||||||
from paddlespeech.t2s.frontend import English
|
|
||||||
from paddlespeech.t2s.models.fastspeech2 import FastSpeech2
|
|
||||||
from paddlespeech.t2s.models.fastspeech2 import FastSpeech2Inference
|
|
||||||
from paddlespeech.t2s.models.parallel_wavegan import PWGGenerator
|
|
||||||
from paddlespeech.t2s.models.parallel_wavegan import PWGInference
|
|
||||||
from paddlespeech.t2s.modules.normalizer import ZScore
|
|
||||||
|
|
||||||
|
|
||||||
def evaluate(args, fastspeech2_config, pwg_config):
|
|
||||||
# dataloader has been too verbose
|
|
||||||
logging.getLogger("DataLoader").disabled = True
|
|
||||||
|
|
||||||
# construct dataset for evaluation
|
|
||||||
sentences = []
|
|
||||||
with open(args.text, 'rt') as f:
|
|
||||||
for line in f:
|
|
||||||
line_list = line.strip().split()
|
|
||||||
utt_id = line_list[0]
|
|
||||||
sentence = " ".join(line_list[1:])
|
|
||||||
sentences.append((utt_id, sentence))
|
|
||||||
|
|
||||||
with open(args.phones_dict, "r") as f:
|
|
||||||
phn_id = [line.strip().split() for line in f.readlines()]
|
|
||||||
vocab_size = len(phn_id)
|
|
||||||
phone_id_map = {}
|
|
||||||
for phn, id in phn_id:
|
|
||||||
phone_id_map[phn] = int(id)
|
|
||||||
print("vocab_size:", vocab_size)
|
|
||||||
odim = fastspeech2_config.n_mels
|
|
||||||
model = FastSpeech2(
|
|
||||||
idim=vocab_size, odim=odim, **fastspeech2_config["model"])
|
|
||||||
|
|
||||||
model.set_state_dict(
|
|
||||||
paddle.load(args.fastspeech2_checkpoint)["main_params"])
|
|
||||||
model.eval()
|
|
||||||
|
|
||||||
vocoder = PWGGenerator(**pwg_config["generator_params"])
|
|
||||||
vocoder.set_state_dict(paddle.load(args.pwg_checkpoint)["generator_params"])
|
|
||||||
vocoder.remove_weight_norm()
|
|
||||||
vocoder.eval()
|
|
||||||
print("model done!")
|
|
||||||
|
|
||||||
frontend = English(phone_vocab_path=args.phones_dict)
|
|
||||||
print("frontend done!")
|
|
||||||
|
|
||||||
stat = np.load(args.fastspeech2_stat)
|
|
||||||
mu, std = stat
|
|
||||||
mu = paddle.to_tensor(mu)
|
|
||||||
std = paddle.to_tensor(std)
|
|
||||||
fastspeech2_normalizer = ZScore(mu, std)
|
|
||||||
|
|
||||||
stat = np.load(args.pwg_stat)
|
|
||||||
mu, std = stat
|
|
||||||
mu = paddle.to_tensor(mu)
|
|
||||||
std = paddle.to_tensor(std)
|
|
||||||
pwg_normalizer = ZScore(mu, std)
|
|
||||||
|
|
||||||
fastspeech2_inference = FastSpeech2Inference(fastspeech2_normalizer, model)
|
|
||||||
pwg_inference = PWGInference(pwg_normalizer, vocoder)
|
|
||||||
|
|
||||||
output_dir = Path(args.output_dir)
|
|
||||||
output_dir.mkdir(parents=True, exist_ok=True)
|
|
||||||
|
|
||||||
for utt_id, sentence in sentences:
|
|
||||||
input_ids = frontend.get_input_ids(sentence)
|
|
||||||
phone_ids = input_ids["phone_ids"]
|
|
||||||
|
|
||||||
with paddle.no_grad():
|
|
||||||
mel = fastspeech2_inference(phone_ids)
|
|
||||||
wav = pwg_inference(mel)
|
|
||||||
|
|
||||||
sf.write(
|
|
||||||
str(output_dir / (utt_id + ".wav")),
|
|
||||||
wav.numpy(),
|
|
||||||
samplerate=fastspeech2_config.fs)
|
|
||||||
print(f"{utt_id} done!")
|
|
||||||
|
|
||||||
|
|
||||||
def main():
|
|
||||||
# parse args and config and redirect to train_sp
|
|
||||||
parser = argparse.ArgumentParser(
|
|
||||||
description="Synthesize with fastspeech2 & parallel wavegan.")
|
|
||||||
parser.add_argument(
|
|
||||||
"--fastspeech2-config", type=str, help="fastspeech2 config file.")
|
|
||||||
parser.add_argument(
|
|
||||||
"--fastspeech2-checkpoint",
|
|
||||||
type=str,
|
|
||||||
help="fastspeech2 checkpoint to load.")
|
|
||||||
parser.add_argument(
|
|
||||||
"--fastspeech2-stat",
|
|
||||||
type=str,
|
|
||||||
help="mean and standard deviation used to normalize spectrogram when training fastspeech2."
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"--pwg-config", type=str, help="parallel wavegan config file.")
|
|
||||||
parser.add_argument(
|
|
||||||
"--pwg-checkpoint",
|
|
||||||
type=str,
|
|
||||||
help="parallel wavegan generator parameters to load.")
|
|
||||||
parser.add_argument(
|
|
||||||
"--pwg-stat",
|
|
||||||
type=str,
|
|
||||||
help="mean and standard deviation used to normalize spectrogram when training parallel wavegan."
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"--phones-dict",
|
|
||||||
type=str,
|
|
||||||
default="phone_id_map.txt",
|
|
||||||
help="phone vocabulary file.")
|
|
||||||
parser.add_argument(
|
|
||||||
"--text",
|
|
||||||
type=str,
|
|
||||||
help="text to synthesize, a 'utt_id sentence' pair per line.")
|
|
||||||
parser.add_argument("--output-dir", type=str, help="output dir.")
|
|
||||||
parser.add_argument(
|
|
||||||
"--ngpu", type=int, default=1, help="if ngpu == 0, use cpu.")
|
|
||||||
parser.add_argument("--verbose", type=int, default=1, help="verbose.")
|
|
||||||
|
|
||||||
args = parser.parse_args()
|
|
||||||
|
|
||||||
if args.ngpu == 0:
|
|
||||||
paddle.set_device("cpu")
|
|
||||||
elif args.ngpu > 0:
|
|
||||||
paddle.set_device("gpu")
|
|
||||||
else:
|
|
||||||
print("ngpu should >= 0 !")
|
|
||||||
|
|
||||||
with open(args.fastspeech2_config) as f:
|
|
||||||
fastspeech2_config = CfgNode(yaml.safe_load(f))
|
|
||||||
with open(args.pwg_config) as f:
|
|
||||||
pwg_config = CfgNode(yaml.safe_load(f))
|
|
||||||
|
|
||||||
print("========Args========")
|
|
||||||
print(yaml.safe_dump(vars(args)))
|
|
||||||
print("========Config========")
|
|
||||||
print(fastspeech2_config)
|
|
||||||
print(pwg_config)
|
|
||||||
|
|
||||||
evaluate(args, fastspeech2_config, pwg_config)
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
main()
|
|
@ -1,192 +0,0 @@
|
|||||||
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
|
||||||
#
|
|
||||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
|
||||||
# you may not use this file except in compliance with the License.
|
|
||||||
# You may obtain a copy of the License at
|
|
||||||
#
|
|
||||||
# http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
#
|
|
||||||
# Unless required by applicable law or agreed to in writing, software
|
|
||||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
|
||||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
||||||
# See the License for the specific language governing permissions and
|
|
||||||
# limitations under the License.
|
|
||||||
import argparse
|
|
||||||
import logging
|
|
||||||
import os
|
|
||||||
from pathlib import Path
|
|
||||||
|
|
||||||
import numpy as np
|
|
||||||
import paddle
|
|
||||||
import soundfile as sf
|
|
||||||
import yaml
|
|
||||||
from paddle import jit
|
|
||||||
from paddle.static import InputSpec
|
|
||||||
from yacs.config import CfgNode
|
|
||||||
|
|
||||||
from paddlespeech.t2s.frontend.zh_frontend import Frontend
|
|
||||||
from paddlespeech.t2s.models.fastspeech2 import FastSpeech2
|
|
||||||
from paddlespeech.t2s.models.fastspeech2 import FastSpeech2Inference
|
|
||||||
from paddlespeech.t2s.models.melgan import MelGANGenerator
|
|
||||||
from paddlespeech.t2s.models.melgan import MelGANInference
|
|
||||||
from paddlespeech.t2s.modules.normalizer import ZScore
|
|
||||||
|
|
||||||
|
|
||||||
def evaluate(args, fastspeech2_config, melgan_config):
|
|
||||||
# dataloader has been too verbose
|
|
||||||
logging.getLogger("DataLoader").disabled = True
|
|
||||||
|
|
||||||
# construct dataset for evaluation
|
|
||||||
sentences = []
|
|
||||||
with open(args.text, 'rt') as f:
|
|
||||||
for line in f:
|
|
||||||
items = line.strip().split()
|
|
||||||
utt_id = items[0]
|
|
||||||
sentence = "".join(items[1:])
|
|
||||||
sentences.append((utt_id, sentence))
|
|
||||||
|
|
||||||
with open(args.phones_dict, "r") as f:
|
|
||||||
phn_id = [line.strip().split() for line in f.readlines()]
|
|
||||||
vocab_size = len(phn_id)
|
|
||||||
print("vocab_size:", vocab_size)
|
|
||||||
odim = fastspeech2_config.n_mels
|
|
||||||
model = FastSpeech2(
|
|
||||||
idim=vocab_size, odim=odim, **fastspeech2_config["model"])
|
|
||||||
|
|
||||||
model.set_state_dict(
|
|
||||||
paddle.load(args.fastspeech2_checkpoint)["main_params"])
|
|
||||||
model.eval()
|
|
||||||
|
|
||||||
vocoder = MelGANGenerator(**melgan_config["generator_params"])
|
|
||||||
vocoder.set_state_dict(
|
|
||||||
paddle.load(args.melgan_checkpoint)["generator_params"])
|
|
||||||
vocoder.remove_weight_norm()
|
|
||||||
vocoder.eval()
|
|
||||||
print("model done!")
|
|
||||||
|
|
||||||
frontend = Frontend(phone_vocab_path=args.phones_dict)
|
|
||||||
print("frontend done!")
|
|
||||||
|
|
||||||
stat = np.load(args.fastspeech2_stat)
|
|
||||||
mu, std = stat
|
|
||||||
mu = paddle.to_tensor(mu)
|
|
||||||
std = paddle.to_tensor(std)
|
|
||||||
fastspeech2_normalizer = ZScore(mu, std)
|
|
||||||
|
|
||||||
stat = np.load(args.melgan_stat)
|
|
||||||
mu, std = stat
|
|
||||||
mu = paddle.to_tensor(mu)
|
|
||||||
std = paddle.to_tensor(std)
|
|
||||||
pwg_normalizer = ZScore(mu, std)
|
|
||||||
|
|
||||||
fastspeech2_inference = FastSpeech2Inference(fastspeech2_normalizer, model)
|
|
||||||
fastspeech2_inference.eval()
|
|
||||||
fastspeech2_inference = jit.to_static(
|
|
||||||
fastspeech2_inference, input_spec=[InputSpec([-1], dtype=paddle.int64)])
|
|
||||||
paddle.jit.save(fastspeech2_inference,
|
|
||||||
os.path.join(args.inference_dir, "fastspeech2"))
|
|
||||||
fastspeech2_inference = paddle.jit.load(
|
|
||||||
os.path.join(args.inference_dir, "fastspeech2"))
|
|
||||||
|
|
||||||
mb_melgan_inference = MelGANInference(pwg_normalizer, vocoder)
|
|
||||||
mb_melgan_inference.eval()
|
|
||||||
mb_melgan_inference = jit.to_static(
|
|
||||||
mb_melgan_inference,
|
|
||||||
input_spec=[
|
|
||||||
InputSpec([-1, 80], dtype=paddle.float32),
|
|
||||||
])
|
|
||||||
paddle.jit.save(mb_melgan_inference,
|
|
||||||
os.path.join(args.inference_dir, "mb_melgan"))
|
|
||||||
mb_melgan_inference = paddle.jit.load(
|
|
||||||
os.path.join(args.inference_dir, "mb_melgan"))
|
|
||||||
|
|
||||||
output_dir = Path(args.output_dir)
|
|
||||||
output_dir.mkdir(parents=True, exist_ok=True)
|
|
||||||
|
|
||||||
for utt_id, sentence in sentences:
|
|
||||||
input_ids = frontend.get_input_ids(sentence, merge_sentences=True)
|
|
||||||
phone_ids = input_ids["phone_ids"]
|
|
||||||
flags = 0
|
|
||||||
for part_phone_ids in phone_ids:
|
|
||||||
with paddle.no_grad():
|
|
||||||
mel = fastspeech2_inference(part_phone_ids)
|
|
||||||
temp_wav = mb_melgan_inference(mel)
|
|
||||||
if flags == 0:
|
|
||||||
wav = temp_wav
|
|
||||||
flags = 1
|
|
||||||
else:
|
|
||||||
wav = paddle.concat([wav, temp_wav])
|
|
||||||
sf.write(
|
|
||||||
str(output_dir / (utt_id + ".wav")),
|
|
||||||
wav.numpy(),
|
|
||||||
samplerate=fastspeech2_config.fs)
|
|
||||||
print(f"{utt_id} done!")
|
|
||||||
|
|
||||||
|
|
||||||
def main():
|
|
||||||
# parse args and config and redirect to train_sp
|
|
||||||
parser = argparse.ArgumentParser(
|
|
||||||
description="Synthesize with fastspeech2 & parallel wavegan.")
|
|
||||||
parser.add_argument(
|
|
||||||
"--fastspeech2-config", type=str, help="fastspeech2 config file.")
|
|
||||||
parser.add_argument(
|
|
||||||
"--fastspeech2-checkpoint",
|
|
||||||
type=str,
|
|
||||||
help="fastspeech2 checkpoint to load.")
|
|
||||||
parser.add_argument(
|
|
||||||
"--fastspeech2-stat",
|
|
||||||
type=str,
|
|
||||||
help="mean and standard deviation used to normalize spectrogram when training fastspeech2."
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"--melgan-config", type=str, help="parallel wavegan config file.")
|
|
||||||
parser.add_argument(
|
|
||||||
"--melgan-checkpoint",
|
|
||||||
type=str,
|
|
||||||
help="parallel wavegan generator parameters to load.")
|
|
||||||
parser.add_argument(
|
|
||||||
"--melgan-stat",
|
|
||||||
type=str,
|
|
||||||
help="mean and standard deviation used to normalize spectrogram when training parallel wavegan."
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"--phones-dict",
|
|
||||||
type=str,
|
|
||||||
default="phone_id_map.txt",
|
|
||||||
help="phone vocabulary file.")
|
|
||||||
parser.add_argument(
|
|
||||||
"--text",
|
|
||||||
type=str,
|
|
||||||
help="text to synthesize, a 'utt_id sentence' pair per line.")
|
|
||||||
parser.add_argument("--output-dir", type=str, help="output dir.")
|
|
||||||
parser.add_argument(
|
|
||||||
"--inference-dir", type=str, help="dir to save inference models")
|
|
||||||
parser.add_argument(
|
|
||||||
"--ngpu", type=int, default=1, help="if ngpu == 0, use cpu.")
|
|
||||||
parser.add_argument("--verbose", type=int, default=1, help="verbose.")
|
|
||||||
|
|
||||||
args = parser.parse_args()
|
|
||||||
|
|
||||||
if args.ngpu == 0:
|
|
||||||
paddle.set_device("cpu")
|
|
||||||
elif args.ngpu > 0:
|
|
||||||
paddle.set_device("gpu")
|
|
||||||
else:
|
|
||||||
print("ngpu should >= 0 !")
|
|
||||||
|
|
||||||
with open(args.fastspeech2_config) as f:
|
|
||||||
fastspeech2_config = CfgNode(yaml.safe_load(f))
|
|
||||||
with open(args.melgan_config) as f:
|
|
||||||
melgan_config = CfgNode(yaml.safe_load(f))
|
|
||||||
|
|
||||||
print("========Args========")
|
|
||||||
print(yaml.safe_dump(vars(args)))
|
|
||||||
print("========Config========")
|
|
||||||
print(fastspeech2_config)
|
|
||||||
print(melgan_config)
|
|
||||||
|
|
||||||
evaluate(args, fastspeech2_config, melgan_config)
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
main()
|
|
@ -0,0 +1,13 @@
|
|||||||
|
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
@ -0,0 +1,277 @@
|
|||||||
|
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
import argparse
|
||||||
|
import logging
|
||||||
|
import os
|
||||||
|
import shutil
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import jsonlines
|
||||||
|
import numpy as np
|
||||||
|
import paddle
|
||||||
|
import yaml
|
||||||
|
from paddle import DataParallel
|
||||||
|
from paddle import distributed as dist
|
||||||
|
from paddle import nn
|
||||||
|
from paddle.io import DataLoader
|
||||||
|
from paddle.io import DistributedBatchSampler
|
||||||
|
from paddle.optimizer import Adam
|
||||||
|
from paddle.optimizer.lr import MultiStepDecay
|
||||||
|
from yacs.config import CfgNode
|
||||||
|
|
||||||
|
from paddlespeech.t2s.datasets.data_table import DataTable
|
||||||
|
from paddlespeech.t2s.datasets.vocoder_batch_fn import Clip
|
||||||
|
from paddlespeech.t2s.models.hifigan import HiFiGANEvaluator
|
||||||
|
from paddlespeech.t2s.models.hifigan import HiFiGANGenerator
|
||||||
|
from paddlespeech.t2s.models.hifigan import HiFiGANMultiScaleMultiPeriodDiscriminator
|
||||||
|
from paddlespeech.t2s.models.hifigan import HiFiGANUpdater
|
||||||
|
from paddlespeech.t2s.modules.losses import DiscriminatorAdversarialLoss
|
||||||
|
from paddlespeech.t2s.modules.losses import FeatureMatchLoss
|
||||||
|
from paddlespeech.t2s.modules.losses import GeneratorAdversarialLoss
|
||||||
|
from paddlespeech.t2s.modules.losses import MelSpectrogramLoss
|
||||||
|
from paddlespeech.t2s.training.extensions.snapshot import Snapshot
|
||||||
|
from paddlespeech.t2s.training.extensions.visualizer import VisualDL
|
||||||
|
from paddlespeech.t2s.training.seeding import seed_everything
|
||||||
|
from paddlespeech.t2s.training.trainer import Trainer
|
||||||
|
|
||||||
|
|
||||||
|
def train_sp(args, config):
|
||||||
|
# decides device type and whether to run in parallel
|
||||||
|
# setup running environment correctly
|
||||||
|
world_size = paddle.distributed.get_world_size()
|
||||||
|
if (not paddle.is_compiled_with_cuda()) or args.ngpu == 0:
|
||||||
|
paddle.set_device("cpu")
|
||||||
|
else:
|
||||||
|
paddle.set_device("gpu")
|
||||||
|
if world_size > 1:
|
||||||
|
paddle.distributed.init_parallel_env()
|
||||||
|
|
||||||
|
# set the random seed, it is a must for multiprocess training
|
||||||
|
seed_everything(config.seed)
|
||||||
|
|
||||||
|
print(
|
||||||
|
f"rank: {dist.get_rank()}, pid: {os.getpid()}, parent_pid: {os.getppid()}",
|
||||||
|
)
|
||||||
|
|
||||||
|
# dataloader has been too verbose
|
||||||
|
logging.getLogger("DataLoader").disabled = True
|
||||||
|
|
||||||
|
# construct dataset for training and validation
|
||||||
|
with jsonlines.open(args.train_metadata, 'r') as reader:
|
||||||
|
train_metadata = list(reader)
|
||||||
|
train_dataset = DataTable(
|
||||||
|
data=train_metadata,
|
||||||
|
fields=["wave", "feats"],
|
||||||
|
converters={
|
||||||
|
"wave": np.load,
|
||||||
|
"feats": np.load,
|
||||||
|
}, )
|
||||||
|
with jsonlines.open(args.dev_metadata, 'r') as reader:
|
||||||
|
dev_metadata = list(reader)
|
||||||
|
dev_dataset = DataTable(
|
||||||
|
data=dev_metadata,
|
||||||
|
fields=["wave", "feats"],
|
||||||
|
converters={
|
||||||
|
"wave": np.load,
|
||||||
|
"feats": np.load,
|
||||||
|
}, )
|
||||||
|
|
||||||
|
# collate function and dataloader
|
||||||
|
train_sampler = DistributedBatchSampler(
|
||||||
|
train_dataset,
|
||||||
|
batch_size=config.batch_size,
|
||||||
|
shuffle=True,
|
||||||
|
drop_last=True)
|
||||||
|
dev_sampler = DistributedBatchSampler(
|
||||||
|
dev_dataset,
|
||||||
|
batch_size=config.batch_size,
|
||||||
|
shuffle=False,
|
||||||
|
drop_last=False)
|
||||||
|
print("samplers done!")
|
||||||
|
|
||||||
|
if "aux_context_window" in config.generator_params:
|
||||||
|
aux_context_window = config.generator_params.aux_context_window
|
||||||
|
else:
|
||||||
|
aux_context_window = 0
|
||||||
|
train_batch_fn = Clip(
|
||||||
|
batch_max_steps=config.batch_max_steps,
|
||||||
|
hop_size=config.n_shift,
|
||||||
|
aux_context_window=aux_context_window)
|
||||||
|
|
||||||
|
train_dataloader = DataLoader(
|
||||||
|
train_dataset,
|
||||||
|
batch_sampler=train_sampler,
|
||||||
|
collate_fn=train_batch_fn,
|
||||||
|
num_workers=config.num_workers)
|
||||||
|
|
||||||
|
dev_dataloader = DataLoader(
|
||||||
|
dev_dataset,
|
||||||
|
batch_sampler=dev_sampler,
|
||||||
|
collate_fn=train_batch_fn,
|
||||||
|
num_workers=config.num_workers)
|
||||||
|
print("dataloaders done!")
|
||||||
|
|
||||||
|
generator = HiFiGANGenerator(**config["generator_params"])
|
||||||
|
discriminator = HiFiGANMultiScaleMultiPeriodDiscriminator(
|
||||||
|
**config["discriminator_params"])
|
||||||
|
if world_size > 1:
|
||||||
|
generator = DataParallel(generator)
|
||||||
|
discriminator = DataParallel(discriminator)
|
||||||
|
print("models done!")
|
||||||
|
|
||||||
|
criterion_feat_match = FeatureMatchLoss(**config["feat_match_loss_params"])
|
||||||
|
criterion_mel = MelSpectrogramLoss(
|
||||||
|
fs=config.fs,
|
||||||
|
fft_size=config.n_fft,
|
||||||
|
hop_size=config.n_shift,
|
||||||
|
win_length=config.win_length,
|
||||||
|
window=config.window,
|
||||||
|
num_mels=config.n_mels,
|
||||||
|
fmin=config.fmin,
|
||||||
|
fmax=config.fmax, )
|
||||||
|
criterion_gen_adv = GeneratorAdversarialLoss(
|
||||||
|
**config["generator_adv_loss_params"])
|
||||||
|
criterion_dis_adv = DiscriminatorAdversarialLoss(
|
||||||
|
**config["discriminator_adv_loss_params"])
|
||||||
|
print("criterions done!")
|
||||||
|
|
||||||
|
lr_schedule_g = MultiStepDecay(**config["generator_scheduler_params"])
|
||||||
|
# Compared to multi_band_melgan.v1 config, Adam optimizer without gradient norm is used
|
||||||
|
generator_grad_norm = config["generator_grad_norm"]
|
||||||
|
gradient_clip_g = nn.ClipGradByGlobalNorm(
|
||||||
|
generator_grad_norm) if generator_grad_norm > 0 else None
|
||||||
|
print("gradient_clip_g:", gradient_clip_g)
|
||||||
|
|
||||||
|
optimizer_g = Adam(
|
||||||
|
learning_rate=lr_schedule_g,
|
||||||
|
grad_clip=gradient_clip_g,
|
||||||
|
parameters=generator.parameters(),
|
||||||
|
**config["generator_optimizer_params"])
|
||||||
|
lr_schedule_d = MultiStepDecay(**config["discriminator_scheduler_params"])
|
||||||
|
discriminator_grad_norm = config["discriminator_grad_norm"]
|
||||||
|
gradient_clip_d = nn.ClipGradByGlobalNorm(
|
||||||
|
discriminator_grad_norm) if discriminator_grad_norm > 0 else None
|
||||||
|
print("gradient_clip_d:", gradient_clip_d)
|
||||||
|
optimizer_d = Adam(
|
||||||
|
learning_rate=lr_schedule_d,
|
||||||
|
grad_clip=gradient_clip_d,
|
||||||
|
parameters=discriminator.parameters(),
|
||||||
|
**config["discriminator_optimizer_params"])
|
||||||
|
print("optimizers done!")
|
||||||
|
|
||||||
|
output_dir = Path(args.output_dir)
|
||||||
|
output_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
if dist.get_rank() == 0:
|
||||||
|
config_name = args.config.split("/")[-1]
|
||||||
|
# copy conf to output_dir
|
||||||
|
shutil.copyfile(args.config, output_dir / config_name)
|
||||||
|
|
||||||
|
updater = HiFiGANUpdater(
|
||||||
|
models={
|
||||||
|
"generator": generator,
|
||||||
|
"discriminator": discriminator,
|
||||||
|
},
|
||||||
|
optimizers={
|
||||||
|
"generator": optimizer_g,
|
||||||
|
"discriminator": optimizer_d,
|
||||||
|
},
|
||||||
|
criterions={
|
||||||
|
"mel": criterion_mel,
|
||||||
|
"feat_match": criterion_feat_match,
|
||||||
|
"gen_adv": criterion_gen_adv,
|
||||||
|
"dis_adv": criterion_dis_adv,
|
||||||
|
},
|
||||||
|
schedulers={
|
||||||
|
"generator": lr_schedule_g,
|
||||||
|
"discriminator": lr_schedule_d,
|
||||||
|
},
|
||||||
|
dataloader=train_dataloader,
|
||||||
|
discriminator_train_start_steps=config.discriminator_train_start_steps,
|
||||||
|
# only hifigan have generator_train_start_steps
|
||||||
|
generator_train_start_steps=config.generator_train_start_steps,
|
||||||
|
lambda_adv=config.lambda_adv,
|
||||||
|
lambda_aux=config.lambda_aux,
|
||||||
|
lambda_feat_match=config.lambda_feat_match,
|
||||||
|
output_dir=output_dir)
|
||||||
|
|
||||||
|
evaluator = HiFiGANEvaluator(
|
||||||
|
models={
|
||||||
|
"generator": generator,
|
||||||
|
"discriminator": discriminator,
|
||||||
|
},
|
||||||
|
criterions={
|
||||||
|
"mel": criterion_mel,
|
||||||
|
"feat_match": criterion_feat_match,
|
||||||
|
"gen_adv": criterion_gen_adv,
|
||||||
|
"dis_adv": criterion_dis_adv,
|
||||||
|
},
|
||||||
|
dataloader=dev_dataloader,
|
||||||
|
lambda_adv=config.lambda_adv,
|
||||||
|
lambda_aux=config.lambda_aux,
|
||||||
|
lambda_feat_match=config.lambda_feat_match,
|
||||||
|
output_dir=output_dir)
|
||||||
|
|
||||||
|
trainer = Trainer(
|
||||||
|
updater,
|
||||||
|
stop_trigger=(config.train_max_steps, "iteration"),
|
||||||
|
out=output_dir)
|
||||||
|
|
||||||
|
if dist.get_rank() == 0:
|
||||||
|
trainer.extend(
|
||||||
|
evaluator, trigger=(config.eval_interval_steps, 'iteration'))
|
||||||
|
trainer.extend(VisualDL(output_dir), trigger=(1, 'iteration'))
|
||||||
|
trainer.extend(
|
||||||
|
Snapshot(max_size=config.num_snapshots),
|
||||||
|
trigger=(config.save_interval_steps, 'iteration'))
|
||||||
|
|
||||||
|
print("Trainer Done!")
|
||||||
|
trainer.run()
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
# parse args and config and redirect to train_sp
|
||||||
|
|
||||||
|
parser = argparse.ArgumentParser(
|
||||||
|
description="Train a HiFiGAN model.")
|
||||||
|
parser.add_argument(
|
||||||
|
"--config", type=str, help="config file to overwrite default config.")
|
||||||
|
parser.add_argument("--train-metadata", type=str, help="training data.")
|
||||||
|
parser.add_argument("--dev-metadata", type=str, help="dev data.")
|
||||||
|
parser.add_argument("--output-dir", type=str, help="output dir.")
|
||||||
|
parser.add_argument(
|
||||||
|
"--ngpu", type=int, default=1, help="if ngpu == 0, use cpu.")
|
||||||
|
parser.add_argument("--verbose", type=int, default=1, help="verbose.")
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
with open(args.config, 'rt') as f:
|
||||||
|
config = CfgNode(yaml.safe_load(f))
|
||||||
|
|
||||||
|
print("========Args========")
|
||||||
|
print(yaml.safe_dump(vars(args)))
|
||||||
|
print("========Config========")
|
||||||
|
print(config)
|
||||||
|
print(
|
||||||
|
f"master see the word size: {dist.get_world_size()}, from pid: {os.getpid()}"
|
||||||
|
)
|
||||||
|
|
||||||
|
# dispatch
|
||||||
|
if args.ngpu > 1:
|
||||||
|
dist.spawn(train_sp, (args, config), nprocs=args.ngpu)
|
||||||
|
else:
|
||||||
|
train_sp(args, config)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
@ -0,0 +1,136 @@
|
|||||||
|
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
import argparse
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import soundfile as sf
|
||||||
|
from paddle import inference
|
||||||
|
|
||||||
|
from paddlespeech.t2s.frontend.zh_frontend import Frontend
|
||||||
|
|
||||||
|
|
||||||
|
# only inference for models trained with csmsc now
|
||||||
|
def main():
|
||||||
|
parser = argparse.ArgumentParser(
|
||||||
|
description="Paddle Infernce with speedyspeech & parallel wavegan.")
|
||||||
|
# acoustic model
|
||||||
|
parser.add_argument(
|
||||||
|
'--am',
|
||||||
|
type=str,
|
||||||
|
default='fastspeech2_csmsc',
|
||||||
|
choices=['speedyspeech_csmsc', 'fastspeech2_csmsc'],
|
||||||
|
help='Choose acoustic model type of tts task.')
|
||||||
|
parser.add_argument(
|
||||||
|
"--phones_dict", type=str, default=None, help="phone vocabulary file.")
|
||||||
|
parser.add_argument(
|
||||||
|
"--tones_dict", type=str, default=None, help="tone vocabulary file.")
|
||||||
|
# voc
|
||||||
|
parser.add_argument(
|
||||||
|
'--voc',
|
||||||
|
type=str,
|
||||||
|
default='pwgan_csmsc',
|
||||||
|
choices=['pwgan_csmsc', 'mb_melgan_csmsc', 'hifigan_csmsc'],
|
||||||
|
help='Choose vocoder type of tts task.')
|
||||||
|
# other
|
||||||
|
parser.add_argument(
|
||||||
|
"--text",
|
||||||
|
type=str,
|
||||||
|
help="text to synthesize, a 'utt_id sentence' pair per line")
|
||||||
|
parser.add_argument(
|
||||||
|
"--inference_dir", type=str, help="dir to save inference models")
|
||||||
|
parser.add_argument("--output_dir", type=str, help="output dir")
|
||||||
|
|
||||||
|
args, _ = parser.parse_known_args()
|
||||||
|
|
||||||
|
frontend = Frontend(
|
||||||
|
phone_vocab_path=args.phones_dict, tone_vocab_path=args.tones_dict)
|
||||||
|
print("frontend done!")
|
||||||
|
|
||||||
|
# model: {model_name}_{dataset}
|
||||||
|
am_name = args.am[:args.am.rindex('_')]
|
||||||
|
am_dataset = args.am[args.am.rindex('_') + 1:]
|
||||||
|
|
||||||
|
am_config = inference.Config(
|
||||||
|
str(Path(args.inference_dir) / (args.am + ".pdmodel")),
|
||||||
|
str(Path(args.inference_dir) / (args.am + ".pdiparams")))
|
||||||
|
am_config.enable_use_gpu(100, 0)
|
||||||
|
# This line must be commented for fastspeech2, if not, it will OOM
|
||||||
|
if am_name != 'fastspeech2':
|
||||||
|
am_config.enable_memory_optim()
|
||||||
|
am_predictor = inference.create_predictor(am_config)
|
||||||
|
|
||||||
|
voc_config = inference.Config(
|
||||||
|
str(Path(args.inference_dir) / (args.voc + ".pdmodel")),
|
||||||
|
str(Path(args.inference_dir) / (args.voc + ".pdiparams")))
|
||||||
|
voc_config.enable_use_gpu(100, 0)
|
||||||
|
voc_config.enable_memory_optim()
|
||||||
|
voc_predictor = inference.create_predictor(voc_config)
|
||||||
|
|
||||||
|
output_dir = Path(args.output_dir)
|
||||||
|
output_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
sentences = []
|
||||||
|
|
||||||
|
print("in new inference")
|
||||||
|
|
||||||
|
with open(args.text, 'rt') as f:
|
||||||
|
for line in f:
|
||||||
|
items = line.strip().split()
|
||||||
|
utt_id = items[0]
|
||||||
|
sentence = "".join(items[1:])
|
||||||
|
sentences.append((utt_id, sentence))
|
||||||
|
|
||||||
|
get_tone_ids = False
|
||||||
|
if am_name == 'speedyspeech':
|
||||||
|
get_tone_ids = True
|
||||||
|
|
||||||
|
am_input_names = am_predictor.get_input_names()
|
||||||
|
|
||||||
|
for utt_id, sentence in sentences:
|
||||||
|
input_ids = frontend.get_input_ids(
|
||||||
|
sentence, merge_sentences=True, get_tone_ids=get_tone_ids)
|
||||||
|
phone_ids = input_ids["phone_ids"]
|
||||||
|
if get_tone_ids:
|
||||||
|
tone_ids = input_ids["tone_ids"]
|
||||||
|
tones = tone_ids[0].numpy()
|
||||||
|
tones_handle = am_predictor.get_input_handle(am_input_names[1])
|
||||||
|
tones_handle.reshape(tones.shape)
|
||||||
|
tones_handle.copy_from_cpu(tones)
|
||||||
|
|
||||||
|
phones = phone_ids[0].numpy()
|
||||||
|
phones_handle = am_predictor.get_input_handle(am_input_names[0])
|
||||||
|
phones_handle.reshape(phones.shape)
|
||||||
|
phones_handle.copy_from_cpu(phones)
|
||||||
|
|
||||||
|
am_predictor.run()
|
||||||
|
am_output_names = am_predictor.get_output_names()
|
||||||
|
am_output_handle = am_predictor.get_output_handle(am_output_names[0])
|
||||||
|
am_output_data = am_output_handle.copy_to_cpu()
|
||||||
|
|
||||||
|
voc_input_names = voc_predictor.get_input_names()
|
||||||
|
mel_handle = voc_predictor.get_input_handle(voc_input_names[0])
|
||||||
|
mel_handle.reshape(am_output_data.shape)
|
||||||
|
mel_handle.copy_from_cpu(am_output_data)
|
||||||
|
|
||||||
|
voc_predictor.run()
|
||||||
|
voc_output_names = voc_predictor.get_output_names()
|
||||||
|
voc_output_handle = voc_predictor.get_output_handle(voc_output_names[0])
|
||||||
|
wav = voc_output_handle.copy_to_cpu()
|
||||||
|
|
||||||
|
sf.write(output_dir / (utt_id + ".wav"), wav, samplerate=24000)
|
||||||
|
|
||||||
|
print(f"{utt_id} done!")
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
@ -1,185 +0,0 @@
|
|||||||
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
|
||||||
#
|
|
||||||
# Licensed under the Apache License, Version 2.0 (the "License");
|
|
||||||
# you may not use this file except in compliance with the License.
|
|
||||||
# You may obtain a copy of the License at
|
|
||||||
#
|
|
||||||
# http://www.apache.org/licenses/LICENSE-2.0
|
|
||||||
#
|
|
||||||
# Unless required by applicable law or agreed to in writing, software
|
|
||||||
# distributed under the License is distributed on an "AS IS" BASIS,
|
|
||||||
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
|
||||||
# See the License for the specific language governing permissions and
|
|
||||||
# limitations under the License.
|
|
||||||
import argparse
|
|
||||||
import logging
|
|
||||||
import os
|
|
||||||
from pathlib import Path
|
|
||||||
|
|
||||||
import jsonlines
|
|
||||||
import numpy as np
|
|
||||||
import paddle
|
|
||||||
import soundfile as sf
|
|
||||||
import yaml
|
|
||||||
from paddle import jit
|
|
||||||
from paddle.static import InputSpec
|
|
||||||
from yacs.config import CfgNode
|
|
||||||
|
|
||||||
from paddlespeech.t2s.datasets.data_table import DataTable
|
|
||||||
from paddlespeech.t2s.models.parallel_wavegan import PWGGenerator
|
|
||||||
from paddlespeech.t2s.models.parallel_wavegan import PWGInference
|
|
||||||
from paddlespeech.t2s.models.speedyspeech import SpeedySpeech
|
|
||||||
from paddlespeech.t2s.models.speedyspeech import SpeedySpeechInference
|
|
||||||
from paddlespeech.t2s.modules.normalizer import ZScore
|
|
||||||
|
|
||||||
|
|
||||||
def evaluate(args, speedyspeech_config, pwg_config):
|
|
||||||
# dataloader has been too verbose
|
|
||||||
logging.getLogger("DataLoader").disabled = True
|
|
||||||
|
|
||||||
# construct dataset for evaluation
|
|
||||||
with jsonlines.open(args.test_metadata, 'r') as reader:
|
|
||||||
test_metadata = list(reader)
|
|
||||||
test_dataset = DataTable(
|
|
||||||
data=test_metadata, fields=["utt_id", "phones", "tones"])
|
|
||||||
|
|
||||||
with open(args.phones_dict, "r") as f:
|
|
||||||
phn_id = [line.strip().split() for line in f.readlines()]
|
|
||||||
vocab_size = len(phn_id)
|
|
||||||
print("vocab_size:", vocab_size)
|
|
||||||
with open(args.tones_dict, "r") as f:
|
|
||||||
tone_id = [line.strip().split() for line in f.readlines()]
|
|
||||||
tone_size = len(tone_id)
|
|
||||||
print("tone_size:", tone_size)
|
|
||||||
|
|
||||||
model = SpeedySpeech(
|
|
||||||
vocab_size=vocab_size,
|
|
||||||
tone_size=tone_size,
|
|
||||||
**speedyspeech_config["model"])
|
|
||||||
model.set_state_dict(
|
|
||||||
paddle.load(args.speedyspeech_checkpoint)["main_params"])
|
|
||||||
model.eval()
|
|
||||||
|
|
||||||
vocoder = PWGGenerator(**pwg_config["generator_params"])
|
|
||||||
vocoder.set_state_dict(paddle.load(args.pwg_checkpoint)["generator_params"])
|
|
||||||
vocoder.remove_weight_norm()
|
|
||||||
vocoder.eval()
|
|
||||||
print("model done!")
|
|
||||||
|
|
||||||
stat = np.load(args.speedyspeech_stat)
|
|
||||||
mu, std = stat
|
|
||||||
mu = paddle.to_tensor(mu)
|
|
||||||
std = paddle.to_tensor(std)
|
|
||||||
speedyspeech_normalizer = ZScore(mu, std)
|
|
||||||
speedyspeech_normalizer.eval()
|
|
||||||
|
|
||||||
stat = np.load(args.pwg_stat)
|
|
||||||
mu, std = stat
|
|
||||||
mu = paddle.to_tensor(mu)
|
|
||||||
std = paddle.to_tensor(std)
|
|
||||||
pwg_normalizer = ZScore(mu, std)
|
|
||||||
pwg_normalizer.eval()
|
|
||||||
|
|
||||||
speedyspeech_inference = SpeedySpeechInference(speedyspeech_normalizer,
|
|
||||||
model)
|
|
||||||
speedyspeech_inference.eval()
|
|
||||||
speedyspeech_inference = jit.to_static(
|
|
||||||
speedyspeech_inference,
|
|
||||||
input_spec=[
|
|
||||||
InputSpec([-1], dtype=paddle.int64), InputSpec(
|
|
||||||
[-1], dtype=paddle.int64)
|
|
||||||
])
|
|
||||||
paddle.jit.save(speedyspeech_inference,
|
|
||||||
os.path.join(args.inference_dir, "speedyspeech"))
|
|
||||||
speedyspeech_inference = paddle.jit.load(
|
|
||||||
os.path.join(args.inference_dir, "speedyspeech"))
|
|
||||||
|
|
||||||
pwg_inference = PWGInference(pwg_normalizer, vocoder)
|
|
||||||
pwg_inference.eval()
|
|
||||||
pwg_inference = jit.to_static(
|
|
||||||
pwg_inference, input_spec=[
|
|
||||||
InputSpec([-1, 80], dtype=paddle.float32),
|
|
||||||
])
|
|
||||||
paddle.jit.save(pwg_inference, os.path.join(args.inference_dir, "pwg"))
|
|
||||||
pwg_inference = paddle.jit.load(os.path.join(args.inference_dir, "pwg"))
|
|
||||||
|
|
||||||
output_dir = Path(args.output_dir)
|
|
||||||
output_dir.mkdir(parents=True, exist_ok=True)
|
|
||||||
|
|
||||||
for datum in test_dataset:
|
|
||||||
utt_id = datum["utt_id"]
|
|
||||||
phones = paddle.to_tensor(datum["phones"])
|
|
||||||
tones = paddle.to_tensor(datum["tones"])
|
|
||||||
|
|
||||||
with paddle.no_grad():
|
|
||||||
wav = pwg_inference(speedyspeech_inference(phones, tones))
|
|
||||||
sf.write(
|
|
||||||
output_dir / (utt_id + ".wav"),
|
|
||||||
wav.numpy(),
|
|
||||||
samplerate=speedyspeech_config.fs)
|
|
||||||
print(f"{utt_id} done!")
|
|
||||||
|
|
||||||
|
|
||||||
def main():
|
|
||||||
# parse args and config and redirect to train_sp
|
|
||||||
parser = argparse.ArgumentParser(
|
|
||||||
description="Synthesize with speedyspeech & parallel wavegan.")
|
|
||||||
parser.add_argument(
|
|
||||||
"--speedyspeech-config", type=str, help="config file for speedyspeech.")
|
|
||||||
parser.add_argument(
|
|
||||||
"--speedyspeech-checkpoint",
|
|
||||||
type=str,
|
|
||||||
help="speedyspeech checkpoint to load.")
|
|
||||||
parser.add_argument(
|
|
||||||
"--speedyspeech-stat",
|
|
||||||
type=str,
|
|
||||||
help="mean and standard deviation used to normalize spectrogram when training speedyspeech."
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"--pwg-config", type=str, help="config file for parallelwavegan.")
|
|
||||||
parser.add_argument(
|
|
||||||
"--pwg-checkpoint",
|
|
||||||
type=str,
|
|
||||||
help="parallel wavegan generator parameters to load.")
|
|
||||||
parser.add_argument(
|
|
||||||
"--pwg-stat",
|
|
||||||
type=str,
|
|
||||||
help="mean and standard deviation used to normalize spectrogram when training speedyspeech."
|
|
||||||
)
|
|
||||||
parser.add_argument(
|
|
||||||
"--phones-dict", type=str, default=None, help="phone vocabulary file.")
|
|
||||||
parser.add_argument(
|
|
||||||
"--tones-dict", type=str, default=None, help="tone vocabulary file.")
|
|
||||||
parser.add_argument("--test-metadata", type=str, help="test metadata")
|
|
||||||
parser.add_argument("--output-dir", type=str, help="output dir")
|
|
||||||
parser.add_argument(
|
|
||||||
"--inference-dir", type=str, help="dir to save inference models")
|
|
||||||
parser.add_argument(
|
|
||||||
"--ngpu", type=int, default=1, help="if ngpu == 0, use cpu.")
|
|
||||||
parser.add_argument("--verbose", type=int, default=1, help="verbose")
|
|
||||||
|
|
||||||
args, _ = parser.parse_known_args()
|
|
||||||
|
|
||||||
if args.ngpu == 0:
|
|
||||||
paddle.set_device("cpu")
|
|
||||||
elif args.ngpu > 0:
|
|
||||||
paddle.set_device("gpu")
|
|
||||||
else:
|
|
||||||
print("ngpu should >= 0 !")
|
|
||||||
|
|
||||||
with open(args.speedyspeech_config) as f:
|
|
||||||
speedyspeech_config = CfgNode(yaml.safe_load(f))
|
|
||||||
with open(args.pwg_config) as f:
|
|
||||||
pwg_config = CfgNode(yaml.safe_load(f))
|
|
||||||
|
|
||||||
print("========Args========")
|
|
||||||
print(yaml.safe_dump(vars(args)))
|
|
||||||
print("========Config========")
|
|
||||||
print(speedyspeech_config)
|
|
||||||
print(pwg_config)
|
|
||||||
|
|
||||||
evaluate(args, speedyspeech_config, pwg_config)
|
|
||||||
|
|
||||||
|
|
||||||
if __name__ == "__main__":
|
|
||||||
main()
|
|
@ -0,0 +1,268 @@
|
|||||||
|
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
import argparse
|
||||||
|
import logging
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import jsonlines
|
||||||
|
import numpy as np
|
||||||
|
import paddle
|
||||||
|
import soundfile as sf
|
||||||
|
import yaml
|
||||||
|
from yacs.config import CfgNode
|
||||||
|
|
||||||
|
from paddlespeech.s2t.utils.dynamic_import import dynamic_import
|
||||||
|
from paddlespeech.t2s.datasets.data_table import DataTable
|
||||||
|
from paddlespeech.t2s.modules.normalizer import ZScore
|
||||||
|
|
||||||
|
model_alias = {
|
||||||
|
# acoustic model
|
||||||
|
"speedyspeech":
|
||||||
|
"paddlespeech.t2s.models.speedyspeech:SpeedySpeech",
|
||||||
|
"speedyspeech_inference":
|
||||||
|
"paddlespeech.t2s.models.speedyspeech:SpeedySpeechInference",
|
||||||
|
"fastspeech2":
|
||||||
|
"paddlespeech.t2s.models.fastspeech2:FastSpeech2",
|
||||||
|
"fastspeech2_inference":
|
||||||
|
"paddlespeech.t2s.models.fastspeech2:FastSpeech2Inference",
|
||||||
|
# voc
|
||||||
|
"pwgan":
|
||||||
|
"paddlespeech.t2s.models.parallel_wavegan:PWGGenerator",
|
||||||
|
"pwgan_inference":
|
||||||
|
"paddlespeech.t2s.models.parallel_wavegan:PWGInference",
|
||||||
|
"mb_melgan":
|
||||||
|
"paddlespeech.t2s.models.melgan:MelGANGenerator",
|
||||||
|
"mb_melgan_inference":
|
||||||
|
"paddlespeech.t2s.models.melgan:MelGANInference",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def evaluate(args):
|
||||||
|
# dataloader has been too verbose
|
||||||
|
logging.getLogger("DataLoader").disabled = True
|
||||||
|
|
||||||
|
# construct dataset for evaluation
|
||||||
|
with jsonlines.open(args.test_metadata, 'r') as reader:
|
||||||
|
test_metadata = list(reader)
|
||||||
|
|
||||||
|
# Init body.
|
||||||
|
with open(args.am_config) as f:
|
||||||
|
am_config = CfgNode(yaml.safe_load(f))
|
||||||
|
with open(args.voc_config) as f:
|
||||||
|
voc_config = CfgNode(yaml.safe_load(f))
|
||||||
|
|
||||||
|
print("========Args========")
|
||||||
|
print(yaml.safe_dump(vars(args)))
|
||||||
|
print("========Config========")
|
||||||
|
print(am_config)
|
||||||
|
print(voc_config)
|
||||||
|
|
||||||
|
# construct dataset for evaluation
|
||||||
|
|
||||||
|
# model: {model_name}_{dataset}
|
||||||
|
am_name = args.am[:args.am.rindex('_')]
|
||||||
|
am_dataset = args.am[args.am.rindex('_') + 1:]
|
||||||
|
|
||||||
|
if am_name == 'fastspeech2':
|
||||||
|
fields = ["utt_id", "text"]
|
||||||
|
spk_num = None
|
||||||
|
if am_dataset in {"aishell3", "vctk"} and args.speaker_dict:
|
||||||
|
print("multiple speaker fastspeech2!")
|
||||||
|
with open(args.speaker_dict, 'rt') as f:
|
||||||
|
spk_id = [line.strip().split() for line in f.readlines()]
|
||||||
|
spk_num = len(spk_id)
|
||||||
|
fields += ["spk_id"]
|
||||||
|
elif args.voice_cloning:
|
||||||
|
print("voice cloning!")
|
||||||
|
fields += ["spk_emb"]
|
||||||
|
else:
|
||||||
|
print("single speaker fastspeech2!")
|
||||||
|
print("spk_num:", spk_num)
|
||||||
|
elif am_name == 'speedyspeech':
|
||||||
|
fields = ["utt_id", "phones", "tones"]
|
||||||
|
|
||||||
|
test_dataset = DataTable(data=test_metadata, fields=fields)
|
||||||
|
|
||||||
|
with open(args.phones_dict, "r") as f:
|
||||||
|
phn_id = [line.strip().split() for line in f.readlines()]
|
||||||
|
vocab_size = len(phn_id)
|
||||||
|
print("vocab_size:", vocab_size)
|
||||||
|
|
||||||
|
tone_size = None
|
||||||
|
if args.tones_dict:
|
||||||
|
with open(args.tones_dict, "r") as f:
|
||||||
|
tone_id = [line.strip().split() for line in f.readlines()]
|
||||||
|
tone_size = len(tone_id)
|
||||||
|
print("tone_size:", tone_size)
|
||||||
|
|
||||||
|
# acoustic model
|
||||||
|
odim = am_config.n_mels
|
||||||
|
am_class = dynamic_import(am_name, model_alias)
|
||||||
|
am_inference_class = dynamic_import(am_name + '_inference', model_alias)
|
||||||
|
|
||||||
|
if am_name == 'fastspeech2':
|
||||||
|
am = am_class(
|
||||||
|
idim=vocab_size, odim=odim, spk_num=spk_num, **am_config["model"])
|
||||||
|
elif am_name == 'speedyspeech':
|
||||||
|
am = am_class(
|
||||||
|
vocab_size=vocab_size, tone_size=tone_size, **am_config["model"])
|
||||||
|
|
||||||
|
am.set_state_dict(paddle.load(args.am_ckpt)["main_params"])
|
||||||
|
am.eval()
|
||||||
|
am_mu, am_std = np.load(args.am_stat)
|
||||||
|
am_mu = paddle.to_tensor(am_mu)
|
||||||
|
am_std = paddle.to_tensor(am_std)
|
||||||
|
am_normalizer = ZScore(am_mu, am_std)
|
||||||
|
am_inference = am_inference_class(am_normalizer, am)
|
||||||
|
print("am_inference.training0:", am_inference.training)
|
||||||
|
am_inference.eval()
|
||||||
|
print("acoustic model done!")
|
||||||
|
|
||||||
|
# vocoder
|
||||||
|
# model: {model_name}_{dataset}
|
||||||
|
voc_name = args.voc[:args.voc.rindex('_')]
|
||||||
|
voc_class = dynamic_import(voc_name, model_alias)
|
||||||
|
voc_inference_class = dynamic_import(voc_name + '_inference', model_alias)
|
||||||
|
voc = voc_class(**voc_config["generator_params"])
|
||||||
|
voc.set_state_dict(paddle.load(args.voc_ckpt)["generator_params"])
|
||||||
|
voc.remove_weight_norm()
|
||||||
|
voc.eval()
|
||||||
|
voc_mu, voc_std = np.load(args.voc_stat)
|
||||||
|
voc_mu = paddle.to_tensor(voc_mu)
|
||||||
|
voc_std = paddle.to_tensor(voc_std)
|
||||||
|
voc_normalizer = ZScore(voc_mu, voc_std)
|
||||||
|
voc_inference = voc_inference_class(voc_normalizer, voc)
|
||||||
|
print("voc_inference.training0:", voc_inference.training)
|
||||||
|
voc_inference.eval()
|
||||||
|
print("voc done!")
|
||||||
|
|
||||||
|
output_dir = Path(args.output_dir)
|
||||||
|
output_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
for datum in test_dataset:
|
||||||
|
utt_id = datum["utt_id"]
|
||||||
|
with paddle.no_grad():
|
||||||
|
# acoustic model
|
||||||
|
if am_name == 'fastspeech2':
|
||||||
|
phone_ids = paddle.to_tensor(datum["text"])
|
||||||
|
spk_emb = None
|
||||||
|
spk_id = None
|
||||||
|
# multi speaker
|
||||||
|
if args.voice_cloning and "spk_emb" in datum:
|
||||||
|
spk_emb = paddle.to_tensor(np.load(datum["spk_emb"]))
|
||||||
|
elif "spk_id" in datum:
|
||||||
|
spk_id = paddle.to_tensor(datum["spk_id"])
|
||||||
|
mel = am_inference(phone_ids, spk_id=spk_id, spk_emb=spk_emb)
|
||||||
|
elif am_name == 'speedyspeech':
|
||||||
|
phone_ids = paddle.to_tensor(datum["phones"])
|
||||||
|
tone_ids = paddle.to_tensor(datum["tones"])
|
||||||
|
mel = am_inference(phone_ids, tone_ids)
|
||||||
|
# vocoder
|
||||||
|
wav = voc_inference(mel)
|
||||||
|
sf.write(
|
||||||
|
str(output_dir / (utt_id + ".wav")),
|
||||||
|
wav.numpy(),
|
||||||
|
samplerate=am_config.fs)
|
||||||
|
print(f"{utt_id} done!")
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
# parse args and config and redirect to train_sp
|
||||||
|
parser = argparse.ArgumentParser(
|
||||||
|
description="Synthesize with acoustic model & vocoder")
|
||||||
|
# acoustic model
|
||||||
|
parser.add_argument(
|
||||||
|
'--am',
|
||||||
|
type=str,
|
||||||
|
default='fastspeech2_csmsc',
|
||||||
|
choices=[
|
||||||
|
'speedyspeech_csmsc', 'fastspeech2_csmsc', 'fastspeech2_ljspeech',
|
||||||
|
'fastspeech2_aishell3', 'fastspeech2_vctk'
|
||||||
|
],
|
||||||
|
help='Choose acoustic model type of tts task.')
|
||||||
|
parser.add_argument(
|
||||||
|
'--am_config',
|
||||||
|
type=str,
|
||||||
|
default=None,
|
||||||
|
help='Config of acoustic model. Use deault config when it is None.')
|
||||||
|
parser.add_argument(
|
||||||
|
'--am_ckpt',
|
||||||
|
type=str,
|
||||||
|
default=None,
|
||||||
|
help='Checkpoint file of acoustic model.')
|
||||||
|
parser.add_argument(
|
||||||
|
"--am_stat",
|
||||||
|
type=str,
|
||||||
|
default=None,
|
||||||
|
help="mean and standard deviation used to normalize spectrogram when training acoustic model."
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--phones_dict", type=str, default=None, help="phone vocabulary file.")
|
||||||
|
parser.add_argument(
|
||||||
|
"--tones_dict", type=str, default=None, help="tone vocabulary file.")
|
||||||
|
parser.add_argument(
|
||||||
|
"--speaker_dict", type=str, default=None, help="speaker id map file.")
|
||||||
|
|
||||||
|
def str2bool(str):
|
||||||
|
return True if str.lower() == 'true' else False
|
||||||
|
|
||||||
|
parser.add_argument(
|
||||||
|
"--voice-cloning",
|
||||||
|
type=str2bool,
|
||||||
|
default=False,
|
||||||
|
help="whether training voice cloning model.")
|
||||||
|
# vocoder
|
||||||
|
parser.add_argument(
|
||||||
|
'--voc',
|
||||||
|
type=str,
|
||||||
|
default='pwgan_csmsc',
|
||||||
|
choices=[
|
||||||
|
'pwgan_csmsc', 'pwgan_ljspeech', 'pwgan_aishell3', 'pwgan_vctk',
|
||||||
|
'mb_melgan_csmsc'
|
||||||
|
],
|
||||||
|
help='Choose vocoder type of tts task.')
|
||||||
|
|
||||||
|
parser.add_argument(
|
||||||
|
'--voc_config',
|
||||||
|
type=str,
|
||||||
|
default=None,
|
||||||
|
help='Config of voc. Use deault config when it is None.')
|
||||||
|
parser.add_argument(
|
||||||
|
'--voc_ckpt', type=str, default=None, help='Checkpoint file of voc.')
|
||||||
|
parser.add_argument(
|
||||||
|
"--voc_stat",
|
||||||
|
type=str,
|
||||||
|
default=None,
|
||||||
|
help="mean and standard deviation used to normalize spectrogram when training voc."
|
||||||
|
)
|
||||||
|
# other
|
||||||
|
parser.add_argument(
|
||||||
|
"--ngpu", type=int, default=1, help="if ngpu == 0, use cpu.")
|
||||||
|
parser.add_argument("--test_metadata", type=str, help="test metadata.")
|
||||||
|
parser.add_argument("--output_dir", type=str, help="output dir.")
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
if args.ngpu == 0:
|
||||||
|
paddle.set_device("cpu")
|
||||||
|
elif args.ngpu > 0:
|
||||||
|
paddle.set_device("gpu")
|
||||||
|
else:
|
||||||
|
print("ngpu should >= 0 !")
|
||||||
|
|
||||||
|
evaluate(args)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
@ -0,0 +1,336 @@
|
|||||||
|
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
import argparse
|
||||||
|
import os
|
||||||
|
from pathlib import Path
|
||||||
|
|
||||||
|
import numpy as np
|
||||||
|
import paddle
|
||||||
|
import soundfile as sf
|
||||||
|
import yaml
|
||||||
|
from paddle import jit
|
||||||
|
from paddle.static import InputSpec
|
||||||
|
from yacs.config import CfgNode
|
||||||
|
|
||||||
|
from paddlespeech.s2t.utils.dynamic_import import dynamic_import
|
||||||
|
from paddlespeech.t2s.frontend import English
|
||||||
|
from paddlespeech.t2s.frontend.zh_frontend import Frontend
|
||||||
|
from paddlespeech.t2s.modules.normalizer import ZScore
|
||||||
|
|
||||||
|
model_alias = {
|
||||||
|
# acoustic model
|
||||||
|
"speedyspeech":
|
||||||
|
"paddlespeech.t2s.models.speedyspeech:SpeedySpeech",
|
||||||
|
"speedyspeech_inference":
|
||||||
|
"paddlespeech.t2s.models.speedyspeech:SpeedySpeechInference",
|
||||||
|
"fastspeech2":
|
||||||
|
"paddlespeech.t2s.models.fastspeech2:FastSpeech2",
|
||||||
|
"fastspeech2_inference":
|
||||||
|
"paddlespeech.t2s.models.fastspeech2:FastSpeech2Inference",
|
||||||
|
# voc
|
||||||
|
"pwgan":
|
||||||
|
"paddlespeech.t2s.models.parallel_wavegan:PWGGenerator",
|
||||||
|
"pwgan_inference":
|
||||||
|
"paddlespeech.t2s.models.parallel_wavegan:PWGInference",
|
||||||
|
"mb_melgan":
|
||||||
|
"paddlespeech.t2s.models.melgan:MelGANGenerator",
|
||||||
|
"mb_melgan_inference":
|
||||||
|
"paddlespeech.t2s.models.melgan:MelGANInference",
|
||||||
|
"style_melgan":
|
||||||
|
"paddlespeech.t2s.models.melgan:StyleMelGANGenerator",
|
||||||
|
"style_melgan_inference":
|
||||||
|
"paddlespeech.t2s.models.melgan:StyleMelGANInference",
|
||||||
|
"hifigan":
|
||||||
|
"paddlespeech.t2s.models.hifigan:HiFiGANGenerator",
|
||||||
|
"hifigan_inference":
|
||||||
|
"paddlespeech.t2s.models.hifigan:HiFiGANInference",
|
||||||
|
}
|
||||||
|
|
||||||
|
|
||||||
|
def evaluate(args):
|
||||||
|
|
||||||
|
# Init body.
|
||||||
|
with open(args.am_config) as f:
|
||||||
|
am_config = CfgNode(yaml.safe_load(f))
|
||||||
|
with open(args.voc_config) as f:
|
||||||
|
voc_config = CfgNode(yaml.safe_load(f))
|
||||||
|
|
||||||
|
print("========Args========")
|
||||||
|
print(yaml.safe_dump(vars(args)))
|
||||||
|
print("========Config========")
|
||||||
|
print(am_config)
|
||||||
|
print(voc_config)
|
||||||
|
|
||||||
|
# construct dataset for evaluation
|
||||||
|
sentences = []
|
||||||
|
with open(args.text, 'rt') as f:
|
||||||
|
for line in f:
|
||||||
|
items = line.strip().split()
|
||||||
|
utt_id = items[0]
|
||||||
|
if args.lang == 'zh':
|
||||||
|
sentence = "".join(items[1:])
|
||||||
|
elif args.lang == 'en':
|
||||||
|
sentence = " ".join(items[1:])
|
||||||
|
sentences.append((utt_id, sentence))
|
||||||
|
|
||||||
|
with open(args.phones_dict, "r") as f:
|
||||||
|
phn_id = [line.strip().split() for line in f.readlines()]
|
||||||
|
vocab_size = len(phn_id)
|
||||||
|
print("vocab_size:", vocab_size)
|
||||||
|
|
||||||
|
tone_size = None
|
||||||
|
if args.tones_dict:
|
||||||
|
with open(args.tones_dict, "r") as f:
|
||||||
|
tone_id = [line.strip().split() for line in f.readlines()]
|
||||||
|
tone_size = len(tone_id)
|
||||||
|
print("tone_size:", tone_size)
|
||||||
|
|
||||||
|
spk_num = None
|
||||||
|
if args.speaker_dict:
|
||||||
|
with open(args.speaker_dict, 'rt') as f:
|
||||||
|
spk_id = [line.strip().split() for line in f.readlines()]
|
||||||
|
spk_num = len(spk_id)
|
||||||
|
print("spk_num:", spk_num)
|
||||||
|
|
||||||
|
# frontend
|
||||||
|
if args.lang == 'zh':
|
||||||
|
frontend = Frontend(
|
||||||
|
phone_vocab_path=args.phones_dict, tone_vocab_path=args.tones_dict)
|
||||||
|
elif args.lang == 'en':
|
||||||
|
frontend = English(phone_vocab_path=args.phones_dict)
|
||||||
|
print("frontend done!")
|
||||||
|
|
||||||
|
# acoustic model
|
||||||
|
odim = am_config.n_mels
|
||||||
|
# model: {model_name}_{dataset}
|
||||||
|
am_name = args.am[:args.am.rindex('_')]
|
||||||
|
am_dataset = args.am[args.am.rindex('_') + 1:]
|
||||||
|
|
||||||
|
am_class = dynamic_import(am_name, model_alias)
|
||||||
|
am_inference_class = dynamic_import(am_name + '_inference', model_alias)
|
||||||
|
|
||||||
|
if am_name == 'fastspeech2':
|
||||||
|
am = am_class(
|
||||||
|
idim=vocab_size, odim=odim, spk_num=spk_num, **am_config["model"])
|
||||||
|
elif am_name == 'speedyspeech':
|
||||||
|
am = am_class(
|
||||||
|
vocab_size=vocab_size, tone_size=tone_size, **am_config["model"])
|
||||||
|
|
||||||
|
am.set_state_dict(paddle.load(args.am_ckpt)["main_params"])
|
||||||
|
am.eval()
|
||||||
|
am_mu, am_std = np.load(args.am_stat)
|
||||||
|
am_mu = paddle.to_tensor(am_mu)
|
||||||
|
am_std = paddle.to_tensor(am_std)
|
||||||
|
am_normalizer = ZScore(am_mu, am_std)
|
||||||
|
am_inference = am_inference_class(am_normalizer, am)
|
||||||
|
am_inference.eval()
|
||||||
|
print("acoustic model done!")
|
||||||
|
|
||||||
|
# vocoder
|
||||||
|
# model: {model_name}_{dataset}
|
||||||
|
voc_name = args.voc[:args.voc.rindex('_')]
|
||||||
|
voc_class = dynamic_import(voc_name, model_alias)
|
||||||
|
voc_inference_class = dynamic_import(voc_name + '_inference', model_alias)
|
||||||
|
voc = voc_class(**voc_config["generator_params"])
|
||||||
|
voc.set_state_dict(paddle.load(args.voc_ckpt)["generator_params"])
|
||||||
|
voc.remove_weight_norm()
|
||||||
|
voc.eval()
|
||||||
|
voc_mu, voc_std = np.load(args.voc_stat)
|
||||||
|
voc_mu = paddle.to_tensor(voc_mu)
|
||||||
|
voc_std = paddle.to_tensor(voc_std)
|
||||||
|
voc_normalizer = ZScore(voc_mu, voc_std)
|
||||||
|
voc_inference = voc_inference_class(voc_normalizer, voc)
|
||||||
|
voc_inference.eval()
|
||||||
|
print("voc done!")
|
||||||
|
|
||||||
|
# whether dygraph to static
|
||||||
|
if args.inference_dir:
|
||||||
|
# acoustic model
|
||||||
|
if am_name == 'fastspeech2':
|
||||||
|
if am_dataset in {"aishell3", "vctk"} and args.speaker_dict:
|
||||||
|
print(
|
||||||
|
"Haven't test dygraph to static for multi speaker fastspeech2 now!"
|
||||||
|
)
|
||||||
|
else:
|
||||||
|
am_inference = jit.to_static(
|
||||||
|
am_inference,
|
||||||
|
input_spec=[InputSpec([-1], dtype=paddle.int64)])
|
||||||
|
paddle.jit.save(am_inference,
|
||||||
|
os.path.join(args.inference_dir, args.am))
|
||||||
|
am_inference = paddle.jit.load(
|
||||||
|
os.path.join(args.inference_dir, args.am))
|
||||||
|
elif am_name == 'speedyspeech':
|
||||||
|
am_inference = jit.to_static(
|
||||||
|
am_inference,
|
||||||
|
input_spec=[
|
||||||
|
InputSpec([-1], dtype=paddle.int64),
|
||||||
|
InputSpec([-1], dtype=paddle.int64)
|
||||||
|
])
|
||||||
|
|
||||||
|
paddle.jit.save(am_inference,
|
||||||
|
os.path.join(args.inference_dir, args.am))
|
||||||
|
am_inference = paddle.jit.load(
|
||||||
|
os.path.join(args.inference_dir, args.am))
|
||||||
|
|
||||||
|
# vocoder
|
||||||
|
voc_inference = jit.to_static(
|
||||||
|
voc_inference,
|
||||||
|
input_spec=[
|
||||||
|
InputSpec([-1, 80], dtype=paddle.float32),
|
||||||
|
])
|
||||||
|
paddle.jit.save(voc_inference,
|
||||||
|
os.path.join(args.inference_dir, args.voc))
|
||||||
|
voc_inference = paddle.jit.load(
|
||||||
|
os.path.join(args.inference_dir, args.voc))
|
||||||
|
|
||||||
|
output_dir = Path(args.output_dir)
|
||||||
|
output_dir.mkdir(parents=True, exist_ok=True)
|
||||||
|
|
||||||
|
for utt_id, sentence in sentences:
|
||||||
|
get_tone_ids = False
|
||||||
|
if am_name == 'speedyspeech':
|
||||||
|
get_tone_ids = True
|
||||||
|
if args.lang == 'zh':
|
||||||
|
input_ids = frontend.get_input_ids(
|
||||||
|
sentence, merge_sentences=True, get_tone_ids=get_tone_ids)
|
||||||
|
phone_ids = input_ids["phone_ids"]
|
||||||
|
phone_ids = phone_ids[0]
|
||||||
|
if get_tone_ids:
|
||||||
|
tone_ids = input_ids["tone_ids"]
|
||||||
|
tone_ids = tone_ids[0]
|
||||||
|
elif args.lang == 'en':
|
||||||
|
input_ids = frontend.get_input_ids(sentence)
|
||||||
|
phone_ids = input_ids["phone_ids"]
|
||||||
|
else:
|
||||||
|
print("lang should in {'zh', 'en'}!")
|
||||||
|
|
||||||
|
with paddle.no_grad():
|
||||||
|
# acoustic model
|
||||||
|
if am_name == 'fastspeech2':
|
||||||
|
# multi speaker
|
||||||
|
if am_dataset in {"aishell3", "vctk"}:
|
||||||
|
spk_id = paddle.to_tensor(args.spk_id)
|
||||||
|
mel = am_inference(phone_ids, spk_id)
|
||||||
|
else:
|
||||||
|
mel = am_inference(phone_ids)
|
||||||
|
elif am_name == 'speedyspeech':
|
||||||
|
mel = am_inference(phone_ids, tone_ids)
|
||||||
|
# vocoder
|
||||||
|
wav = voc_inference(mel)
|
||||||
|
sf.write(
|
||||||
|
str(output_dir / (utt_id + ".wav")),
|
||||||
|
wav.numpy(),
|
||||||
|
samplerate=am_config.fs)
|
||||||
|
print(f"{utt_id} done!")
|
||||||
|
|
||||||
|
|
||||||
|
def main():
|
||||||
|
# parse args and config and redirect to train_sp
|
||||||
|
parser = argparse.ArgumentParser(
|
||||||
|
description="Synthesize with acoustic model & vocoder")
|
||||||
|
# acoustic model
|
||||||
|
parser.add_argument(
|
||||||
|
'--am',
|
||||||
|
type=str,
|
||||||
|
default='fastspeech2_csmsc',
|
||||||
|
choices=[
|
||||||
|
'speedyspeech_csmsc', 'fastspeech2_csmsc', 'fastspeech2_ljspeech',
|
||||||
|
'fastspeech2_aishell3', 'fastspeech2_vctk'
|
||||||
|
],
|
||||||
|
help='Choose acoustic model type of tts task.')
|
||||||
|
parser.add_argument(
|
||||||
|
'--am_config',
|
||||||
|
type=str,
|
||||||
|
default=None,
|
||||||
|
help='Config of acoustic model. Use deault config when it is None.')
|
||||||
|
parser.add_argument(
|
||||||
|
'--am_ckpt',
|
||||||
|
type=str,
|
||||||
|
default=None,
|
||||||
|
help='Checkpoint file of acoustic model.')
|
||||||
|
parser.add_argument(
|
||||||
|
"--am_stat",
|
||||||
|
type=str,
|
||||||
|
default=None,
|
||||||
|
help="mean and standard deviation used to normalize spectrogram when training acoustic model."
|
||||||
|
)
|
||||||
|
parser.add_argument(
|
||||||
|
"--phones_dict", type=str, default=None, help="phone vocabulary file.")
|
||||||
|
parser.add_argument(
|
||||||
|
"--tones_dict", type=str, default=None, help="tone vocabulary file.")
|
||||||
|
parser.add_argument(
|
||||||
|
"--speaker_dict", type=str, default=None, help="speaker id map file.")
|
||||||
|
parser.add_argument(
|
||||||
|
'--spk_id',
|
||||||
|
type=int,
|
||||||
|
default=0,
|
||||||
|
help='spk id for multi speaker acoustic model')
|
||||||
|
# vocoder
|
||||||
|
parser.add_argument(
|
||||||
|
'--voc',
|
||||||
|
type=str,
|
||||||
|
default='pwgan_csmsc',
|
||||||
|
choices=[
|
||||||
|
'pwgan_csmsc', 'pwgan_ljspeech', 'pwgan_aishell3', 'pwgan_vctk',
|
||||||
|
'mb_melgan_csmsc', 'style_melgan_csmsc', 'hifigan_csmsc'
|
||||||
|
],
|
||||||
|
help='Choose vocoder type of tts task.')
|
||||||
|
|
||||||
|
parser.add_argument(
|
||||||
|
'--voc_config',
|
||||||
|
type=str,
|
||||||
|
default=None,
|
||||||
|
help='Config of voc. Use deault config when it is None.')
|
||||||
|
parser.add_argument(
|
||||||
|
'--voc_ckpt', type=str, default=None, help='Checkpoint file of voc.')
|
||||||
|
parser.add_argument(
|
||||||
|
"--voc_stat",
|
||||||
|
type=str,
|
||||||
|
default=None,
|
||||||
|
help="mean and standard deviation used to normalize spectrogram when training voc."
|
||||||
|
)
|
||||||
|
# other
|
||||||
|
parser.add_argument(
|
||||||
|
'--lang',
|
||||||
|
type=str,
|
||||||
|
default='zh',
|
||||||
|
help='Choose model language. zh or en')
|
||||||
|
|
||||||
|
parser.add_argument(
|
||||||
|
"--inference_dir",
|
||||||
|
type=str,
|
||||||
|
default=None,
|
||||||
|
help="dir to save inference models")
|
||||||
|
parser.add_argument(
|
||||||
|
"--ngpu", type=int, default=1, help="if ngpu == 0, use cpu.")
|
||||||
|
parser.add_argument(
|
||||||
|
"--text",
|
||||||
|
type=str,
|
||||||
|
help="text to synthesize, a 'utt_id sentence' pair per line.")
|
||||||
|
parser.add_argument("--output_dir", type=str, help="output dir.")
|
||||||
|
|
||||||
|
args = parser.parse_args()
|
||||||
|
|
||||||
|
if args.ngpu == 0:
|
||||||
|
paddle.set_device("cpu")
|
||||||
|
elif args.ngpu > 0:
|
||||||
|
paddle.set_device("gpu")
|
||||||
|
else:
|
||||||
|
print("ngpu should >= 0 !")
|
||||||
|
|
||||||
|
evaluate(args)
|
||||||
|
|
||||||
|
|
||||||
|
if __name__ == "__main__":
|
||||||
|
main()
|
@ -0,0 +1,15 @@
|
|||||||
|
# Copyright (c) 2020 PaddlePaddle Authors. All Rights Reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
from .hifigan import *
|
||||||
|
from .hifigan_updater import *
|
@ -0,0 +1,779 @@
|
|||||||
|
# -*- coding: utf-8 -*-
|
||||||
|
"""HiFi-GAN Modules.
|
||||||
|
This code is based on https://github.com/jik876/hifi-gan.
|
||||||
|
"""
|
||||||
|
import copy
|
||||||
|
from typing import Any
|
||||||
|
from typing import Dict
|
||||||
|
from typing import List
|
||||||
|
|
||||||
|
import paddle
|
||||||
|
import paddle.nn.functional as F
|
||||||
|
from paddle import nn
|
||||||
|
|
||||||
|
from paddlespeech.t2s.modules.activation import get_activation
|
||||||
|
from paddlespeech.t2s.modules.nets_utils import initialize
|
||||||
|
from paddlespeech.t2s.modules.residual_block import HiFiGANResidualBlock as ResidualBlock
|
||||||
|
|
||||||
|
|
||||||
|
class HiFiGANGenerator(nn.Layer):
|
||||||
|
"""HiFiGAN generator module."""
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
in_channels: int=80,
|
||||||
|
out_channels: int=1,
|
||||||
|
channels: int=512,
|
||||||
|
kernel_size: int=7,
|
||||||
|
upsample_scales: List[int]=(8, 8, 2, 2),
|
||||||
|
upsample_kernel_sizes: List[int]=(16, 16, 4, 4),
|
||||||
|
resblock_kernel_sizes: List[int]=(3, 7, 11),
|
||||||
|
resblock_dilations: List[List[int]]=[(1, 3, 5), (1, 3, 5),
|
||||||
|
(1, 3, 5)],
|
||||||
|
use_additional_convs: bool=True,
|
||||||
|
bias: bool=True,
|
||||||
|
nonlinear_activation: str="leakyrelu",
|
||||||
|
nonlinear_activation_params: Dict[str, Any]={"negative_slope": 0.1},
|
||||||
|
use_weight_norm: bool=True,
|
||||||
|
init_type: str="xavier_uniform", ):
|
||||||
|
"""Initialize HiFiGANGenerator module.
|
||||||
|
Parameters
|
||||||
|
----------
|
||||||
|
in_channels : int
|
||||||
|
Number of input channels.
|
||||||
|
out_channels : int
|
||||||
|
Number of output channels.
|
||||||
|
channels : int
|
||||||
|
Number of hidden representation channels.
|
||||||
|
kernel_size : int
|
||||||
|
Kernel size of initial and final conv layer.
|
||||||
|
upsample_scales : list
|
||||||
|
List of upsampling scales.
|
||||||
|
upsample_kernel_sizes : list
|
||||||
|
List of kernel sizes for upsampling layers.
|
||||||
|
resblock_kernel_sizes : list
|
||||||
|
List of kernel sizes for residual blocks.
|
||||||
|
resblock_dilations : list
|
||||||
|
List of dilation list for residual blocks.
|
||||||
|
use_additional_convs : bool
|
||||||
|
Whether to use additional conv layers in residual blocks.
|
||||||
|
bias : bool
|
||||||
|
Whether to add bias parameter in convolution layers.
|
||||||
|
nonlinear_activation : str
|
||||||
|
Activation function module name.
|
||||||
|
nonlinear_activation_params : dict
|
||||||
|
Hyperparameters for activation function.
|
||||||
|
use_weight_norm : bool
|
||||||
|
Whether to use weight norm.
|
||||||
|
If set to true, it will be applied to all of the conv layers.
|
||||||
|
"""
|
||||||
|
super().__init__()
|
||||||
|
|
||||||
|
# initialize parameters
|
||||||
|
initialize(self, init_type)
|
||||||
|
|
||||||
|
# check hyperparameters are valid
|
||||||
|
assert kernel_size % 2 == 1, "Kernel size must be odd number."
|
||||||
|
assert len(upsample_scales) == len(upsample_kernel_sizes)
|
||||||
|
assert len(resblock_dilations) == len(resblock_kernel_sizes)
|
||||||
|
|
||||||
|
# define modules
|
||||||
|
self.num_upsamples = len(upsample_kernel_sizes)
|
||||||
|
self.num_blocks = len(resblock_kernel_sizes)
|
||||||
|
self.input_conv = nn.Conv1D(
|
||||||
|
in_channels,
|
||||||
|
channels,
|
||||||
|
kernel_size,
|
||||||
|
1,
|
||||||
|
padding=(kernel_size - 1) // 2, )
|
||||||
|
self.upsamples = nn.LayerList()
|
||||||
|
self.blocks = nn.LayerList()
|
||||||
|
for i in range(len(upsample_kernel_sizes)):
|
||||||
|
assert upsample_kernel_sizes[i] == 2 * upsample_scales[i]
|
||||||
|
self.upsamples.append(
|
||||||
|
nn.Sequential(
|
||||||
|
get_activation(nonlinear_activation, **
|
||||||
|
nonlinear_activation_params),
|
||||||
|
nn.Conv1DTranspose(
|
||||||
|
channels // (2**i),
|
||||||
|
channels // (2**(i + 1)),
|
||||||
|
upsample_kernel_sizes[i],
|
||||||
|
upsample_scales[i],
|
||||||
|
padding=upsample_scales[i] // 2 + upsample_scales[i] %
|
||||||
|
2,
|
||||||
|
output_padding=upsample_scales[i] % 2, ), ))
|
||||||
|
for j in range(len(resblock_kernel_sizes)):
|
||||||
|
self.blocks.append(
|
||||||
|
ResidualBlock(
|
||||||
|
kernel_size=resblock_kernel_sizes[j],
|
||||||
|
channels=channels // (2**(i + 1)),
|
||||||
|
dilations=resblock_dilations[j],
|
||||||
|
bias=bias,
|
||||||
|
use_additional_convs=use_additional_convs,
|
||||||
|
nonlinear_activation=nonlinear_activation,
|
||||||
|
nonlinear_activation_params=nonlinear_activation_params,
|
||||||
|
))
|
||||||
|
self.output_conv = nn.Sequential(
|
||||||
|
nn.LeakyReLU(),
|
||||||
|
nn.Conv1D(
|
||||||
|
channels // (2**(i + 1)),
|
||||||
|
out_channels,
|
||||||
|
kernel_size,
|
||||||
|
1,
|
||||||
|
padding=(kernel_size - 1) // 2, ),
|
||||||
|
nn.Tanh(), )
|
||||||
|
|
||||||
|
nn.initializer.set_global_initializer(None)
|
||||||
|
|
||||||
|
# apply weight norm
|
||||||
|
if use_weight_norm:
|
||||||
|
self.apply_weight_norm()
|
||||||
|
|
||||||
|
# reset parameters
|
||||||
|
self.reset_parameters()
|
||||||
|
|
||||||
|
def forward(self, c):
|
||||||
|
"""Calculate forward propagation.
|
||||||
|
Parameters
|
||||||
|
----------
|
||||||
|
c : Tensor
|
||||||
|
Input tensor (B, in_channels, T).
|
||||||
|
Returns
|
||||||
|
----------
|
||||||
|
Tensor
|
||||||
|
Output tensor (B, out_channels, T).
|
||||||
|
"""
|
||||||
|
c = self.input_conv(c)
|
||||||
|
for i in range(self.num_upsamples):
|
||||||
|
c = self.upsamples[i](c)
|
||||||
|
# initialize
|
||||||
|
cs = 0.0
|
||||||
|
for j in range(self.num_blocks):
|
||||||
|
cs += self.blocks[i * self.num_blocks + j](c)
|
||||||
|
c = cs / self.num_blocks
|
||||||
|
c = self.output_conv(c)
|
||||||
|
|
||||||
|
return c
|
||||||
|
|
||||||
|
def reset_parameters(self):
|
||||||
|
"""Reset parameters.
|
||||||
|
This initialization follows official implementation manner.
|
||||||
|
https://github.com/jik876/hifi-gan/blob/master/models.py
|
||||||
|
"""
|
||||||
|
# 定义参数为float的正态分布。
|
||||||
|
dist = paddle.distribution.Normal(loc=0.0, scale=0.01)
|
||||||
|
|
||||||
|
def _reset_parameters(m):
|
||||||
|
if isinstance(m, nn.Conv1D) or isinstance(m, nn.Conv1DTranspose):
|
||||||
|
w = dist.sample(m.weight.shape)
|
||||||
|
m.weight.set_value(w)
|
||||||
|
|
||||||
|
self.apply(_reset_parameters)
|
||||||
|
|
||||||
|
def apply_weight_norm(self):
|
||||||
|
"""Recursively apply weight normalization to all the Convolution layers
|
||||||
|
in the sublayers.
|
||||||
|
"""
|
||||||
|
|
||||||
|
def _apply_weight_norm(layer):
|
||||||
|
if isinstance(layer, (nn.Conv1D, nn.Conv2D, nn.Conv1DTranspose)):
|
||||||
|
nn.utils.weight_norm(layer)
|
||||||
|
|
||||||
|
self.apply(_apply_weight_norm)
|
||||||
|
|
||||||
|
def remove_weight_norm(self):
|
||||||
|
"""Recursively remove weight normalization from all the Convolution
|
||||||
|
layers in the sublayers.
|
||||||
|
"""
|
||||||
|
|
||||||
|
def _remove_weight_norm(layer):
|
||||||
|
try:
|
||||||
|
nn.utils.remove_weight_norm(layer)
|
||||||
|
except ValueError:
|
||||||
|
pass
|
||||||
|
|
||||||
|
self.apply(_remove_weight_norm)
|
||||||
|
|
||||||
|
def inference(self, c):
|
||||||
|
"""Perform inference.
|
||||||
|
Parameters
|
||||||
|
----------
|
||||||
|
c : Tensor
|
||||||
|
Input tensor (T, in_channels).
|
||||||
|
normalize_before (bool): Whether to perform normalization.
|
||||||
|
Returns
|
||||||
|
----------
|
||||||
|
Tensor
|
||||||
|
Output tensor (T ** prod(upsample_scales), out_channels).
|
||||||
|
"""
|
||||||
|
c = self.forward(c.transpose([1, 0]).unsqueeze(0))
|
||||||
|
return c.squeeze(0).transpose([1, 0])
|
||||||
|
|
||||||
|
|
||||||
|
class HiFiGANPeriodDiscriminator(nn.Layer):
|
||||||
|
"""HiFiGAN period discriminator module."""
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
in_channels: int=1,
|
||||||
|
out_channels: int=1,
|
||||||
|
period: int=3,
|
||||||
|
kernel_sizes: List[int]=[5, 3],
|
||||||
|
channels: int=32,
|
||||||
|
downsample_scales: List[int]=[3, 3, 3, 3, 1],
|
||||||
|
max_downsample_channels: int=1024,
|
||||||
|
bias: bool=True,
|
||||||
|
nonlinear_activation: str="leakyrelu",
|
||||||
|
nonlinear_activation_params: Dict[str, Any]={"negative_slope": 0.1},
|
||||||
|
use_weight_norm: bool=True,
|
||||||
|
use_spectral_norm: bool=False,
|
||||||
|
init_type: str="xavier_uniform", ):
|
||||||
|
"""Initialize HiFiGANPeriodDiscriminator module.
|
||||||
|
Parameters
|
||||||
|
----------
|
||||||
|
in_channels : int
|
||||||
|
Number of input channels.
|
||||||
|
out_channels : int
|
||||||
|
Number of output channels.
|
||||||
|
period : int
|
||||||
|
Period.
|
||||||
|
kernel_sizes : list
|
||||||
|
Kernel sizes of initial conv layers and the final conv layer.
|
||||||
|
channels : int
|
||||||
|
Number of initial channels.
|
||||||
|
downsample_scales : list
|
||||||
|
List of downsampling scales.
|
||||||
|
max_downsample_channels : int
|
||||||
|
Number of maximum downsampling channels.
|
||||||
|
use_additional_convs : bool
|
||||||
|
Whether to use additional conv layers in residual blocks.
|
||||||
|
bias : bool
|
||||||
|
Whether to add bias parameter in convolution layers.
|
||||||
|
nonlinear_activation : str
|
||||||
|
Activation function module name.
|
||||||
|
nonlinear_activation_params : dict
|
||||||
|
Hyperparameters for activation function.
|
||||||
|
use_weight_norm : bool
|
||||||
|
Whether to use weight norm.
|
||||||
|
If set to true, it will be applied to all of the conv layers.
|
||||||
|
use_spectral_norm : bool
|
||||||
|
Whether to use spectral norm.
|
||||||
|
If set to true, it will be applied to all of the conv layers.
|
||||||
|
"""
|
||||||
|
super().__init__()
|
||||||
|
|
||||||
|
# initialize parameters
|
||||||
|
initialize(self, init_type)
|
||||||
|
|
||||||
|
assert len(kernel_sizes) == 2
|
||||||
|
assert kernel_sizes[0] % 2 == 1, "Kernel size must be odd number."
|
||||||
|
assert kernel_sizes[1] % 2 == 1, "Kernel size must be odd number."
|
||||||
|
|
||||||
|
self.period = period
|
||||||
|
self.convs = nn.LayerList()
|
||||||
|
in_chs = in_channels
|
||||||
|
out_chs = channels
|
||||||
|
for downsample_scale in downsample_scales:
|
||||||
|
self.convs.append(
|
||||||
|
nn.Sequential(
|
||||||
|
nn.Conv2D(
|
||||||
|
in_chs,
|
||||||
|
out_chs,
|
||||||
|
(kernel_sizes[0], 1),
|
||||||
|
(downsample_scale, 1),
|
||||||
|
padding=((kernel_sizes[0] - 1) // 2, 0), ),
|
||||||
|
get_activation(nonlinear_activation, **
|
||||||
|
nonlinear_activation_params), ))
|
||||||
|
in_chs = out_chs
|
||||||
|
# NOTE: Use downsample_scale + 1?
|
||||||
|
out_chs = min(out_chs * 4, max_downsample_channels)
|
||||||
|
self.output_conv = nn.Conv2D(
|
||||||
|
out_chs,
|
||||||
|
out_channels,
|
||||||
|
(kernel_sizes[1] - 1, 1),
|
||||||
|
1,
|
||||||
|
padding=((kernel_sizes[1] - 1) // 2, 0), )
|
||||||
|
|
||||||
|
if use_weight_norm and use_spectral_norm:
|
||||||
|
raise ValueError("Either use use_weight_norm or use_spectral_norm.")
|
||||||
|
|
||||||
|
# apply weight norm
|
||||||
|
if use_weight_norm:
|
||||||
|
self.apply_weight_norm()
|
||||||
|
|
||||||
|
# apply spectral norm
|
||||||
|
if use_spectral_norm:
|
||||||
|
self.apply_spectral_norm()
|
||||||
|
|
||||||
|
def forward(self, x):
|
||||||
|
"""Calculate forward propagation.
|
||||||
|
Parameters
|
||||||
|
----------
|
||||||
|
c : Tensor
|
||||||
|
Input tensor (B, in_channels, T).
|
||||||
|
Returns
|
||||||
|
----------
|
||||||
|
list
|
||||||
|
List of each layer's tensors.
|
||||||
|
"""
|
||||||
|
# transform 1d to 2d -> (B, C, T/P, P)
|
||||||
|
b, c, t = paddle.shape(x)
|
||||||
|
if t % self.period != 0:
|
||||||
|
n_pad = self.period - (t % self.period)
|
||||||
|
x = F.pad(x, (0, n_pad), "reflect", data_format="NCL")
|
||||||
|
t += n_pad
|
||||||
|
x = x.reshape([b, c, t // self.period, self.period])
|
||||||
|
|
||||||
|
# forward conv
|
||||||
|
outs = []
|
||||||
|
for layer in self.convs:
|
||||||
|
x = layer(x)
|
||||||
|
outs += [x]
|
||||||
|
x = self.output_conv(x)
|
||||||
|
x = paddle.flatten(x, 1, -1)
|
||||||
|
outs += [x]
|
||||||
|
|
||||||
|
return outs
|
||||||
|
|
||||||
|
def apply_weight_norm(self):
|
||||||
|
"""Recursively apply weight normalization to all the Convolution layers
|
||||||
|
in the sublayers.
|
||||||
|
"""
|
||||||
|
|
||||||
|
def _apply_weight_norm(layer):
|
||||||
|
if isinstance(layer, (nn.Conv1D, nn.Conv2D, nn.Conv1DTranspose)):
|
||||||
|
nn.utils.weight_norm(layer)
|
||||||
|
|
||||||
|
self.apply(_apply_weight_norm)
|
||||||
|
|
||||||
|
def apply_spectral_norm(self):
|
||||||
|
"""Apply spectral normalization module from all of the layers."""
|
||||||
|
|
||||||
|
def _apply_spectral_norm(m):
|
||||||
|
if isinstance(m, nn.Conv2D):
|
||||||
|
nn.utils.spectral_norm(m)
|
||||||
|
|
||||||
|
self.apply(_apply_spectral_norm)
|
||||||
|
|
||||||
|
|
||||||
|
class HiFiGANMultiPeriodDiscriminator(nn.Layer):
|
||||||
|
"""HiFiGAN multi-period discriminator module."""
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
periods: List[int]=[2, 3, 5, 7, 11],
|
||||||
|
discriminator_params: Dict[str, Any]={
|
||||||
|
"in_channels": 1,
|
||||||
|
"out_channels": 1,
|
||||||
|
"kernel_sizes": [5, 3],
|
||||||
|
"channels": 32,
|
||||||
|
"downsample_scales": [3, 3, 3, 3, 1],
|
||||||
|
"max_downsample_channels": 1024,
|
||||||
|
"bias": True,
|
||||||
|
"nonlinear_activation": "leakyrelu",
|
||||||
|
"nonlinear_activation_params": {
|
||||||
|
"negative_slope": 0.1
|
||||||
|
},
|
||||||
|
"use_weight_norm": True,
|
||||||
|
"use_spectral_norm": False,
|
||||||
|
},
|
||||||
|
init_type: str="xavier_uniform", ):
|
||||||
|
"""Initialize HiFiGANMultiPeriodDiscriminator module.
|
||||||
|
Parameters
|
||||||
|
----------
|
||||||
|
periods : list
|
||||||
|
List of periods.
|
||||||
|
discriminator_params : dict
|
||||||
|
Parameters for hifi-gan period discriminator module.
|
||||||
|
The period parameter will be overwritten.
|
||||||
|
"""
|
||||||
|
super().__init__()
|
||||||
|
# initialize parameters
|
||||||
|
initialize(self, init_type)
|
||||||
|
|
||||||
|
self.discriminators = nn.LayerList()
|
||||||
|
for period in periods:
|
||||||
|
params = copy.deepcopy(discriminator_params)
|
||||||
|
params["period"] = period
|
||||||
|
self.discriminators.append(HiFiGANPeriodDiscriminator(**params))
|
||||||
|
|
||||||
|
def forward(self, x):
|
||||||
|
"""Calculate forward propagation.
|
||||||
|
Parameters
|
||||||
|
----------
|
||||||
|
x : Tensor
|
||||||
|
Input noise signal (B, 1, T).
|
||||||
|
Returns
|
||||||
|
----------
|
||||||
|
List
|
||||||
|
List of list of each discriminator outputs, which consists of each layer output tensors.
|
||||||
|
"""
|
||||||
|
outs = []
|
||||||
|
for f in self.discriminators:
|
||||||
|
outs += [f(x)]
|
||||||
|
|
||||||
|
return outs
|
||||||
|
|
||||||
|
|
||||||
|
class HiFiGANScaleDiscriminator(nn.Layer):
|
||||||
|
"""HiFi-GAN scale discriminator module."""
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
in_channels: int=1,
|
||||||
|
out_channels: int=1,
|
||||||
|
kernel_sizes: List[int]=[15, 41, 5, 3],
|
||||||
|
channels: int=128,
|
||||||
|
max_downsample_channels: int=1024,
|
||||||
|
max_groups: int=16,
|
||||||
|
bias: bool=True,
|
||||||
|
downsample_scales: List[int]=[2, 2, 4, 4, 1],
|
||||||
|
nonlinear_activation: str="leakyrelu",
|
||||||
|
nonlinear_activation_params: Dict[str, Any]={"negative_slope": 0.1},
|
||||||
|
use_weight_norm: bool=True,
|
||||||
|
use_spectral_norm: bool=False,
|
||||||
|
init_type: str="xavier_uniform", ):
|
||||||
|
"""Initilize HiFiGAN scale discriminator module.
|
||||||
|
Parameters
|
||||||
|
----------
|
||||||
|
in_channels : int
|
||||||
|
Number of input channels.
|
||||||
|
out_channels : int
|
||||||
|
Number of output channels.
|
||||||
|
kernel_sizes : list
|
||||||
|
List of four kernel sizes. The first will be used for the first conv layer,
|
||||||
|
and the second is for downsampling part, and the remaining two are for output layers.
|
||||||
|
channels : int
|
||||||
|
Initial number of channels for conv layer.
|
||||||
|
max_downsample_channels : int
|
||||||
|
Maximum number of channels for downsampling layers.
|
||||||
|
bias : bool
|
||||||
|
Whether to add bias parameter in convolution layers.
|
||||||
|
downsample_scales : list
|
||||||
|
List of downsampling scales.
|
||||||
|
nonlinear_activation : str
|
||||||
|
Activation function module name.
|
||||||
|
nonlinear_activation_params : dict
|
||||||
|
Hyperparameters for activation function.
|
||||||
|
use_weight_norm : bool
|
||||||
|
Whether to use weight norm.
|
||||||
|
If set to true, it will be applied to all of the conv layers.
|
||||||
|
use_spectral_norm : bool
|
||||||
|
Whether to use spectral norm.
|
||||||
|
If set to true, it will be applied to all of the conv layers.
|
||||||
|
"""
|
||||||
|
super().__init__()
|
||||||
|
|
||||||
|
# initialize parameters
|
||||||
|
initialize(self, init_type)
|
||||||
|
|
||||||
|
self.layers = nn.LayerList()
|
||||||
|
|
||||||
|
# check kernel size is valid
|
||||||
|
assert len(kernel_sizes) == 4
|
||||||
|
for ks in kernel_sizes:
|
||||||
|
assert ks % 2 == 1
|
||||||
|
|
||||||
|
# add first layer
|
||||||
|
self.layers.append(
|
||||||
|
nn.Sequential(
|
||||||
|
nn.Conv1D(
|
||||||
|
in_channels,
|
||||||
|
channels,
|
||||||
|
# NOTE: Use always the same kernel size
|
||||||
|
kernel_sizes[0],
|
||||||
|
bias_attr=bias,
|
||||||
|
padding=(kernel_sizes[0] - 1) // 2, ),
|
||||||
|
get_activation(nonlinear_activation, **
|
||||||
|
nonlinear_activation_params), ))
|
||||||
|
|
||||||
|
# add downsample layers
|
||||||
|
in_chs = channels
|
||||||
|
out_chs = channels
|
||||||
|
# NOTE(kan-bayashi): Remove hard coding?
|
||||||
|
groups = 4
|
||||||
|
for downsample_scale in downsample_scales:
|
||||||
|
self.layers.append(
|
||||||
|
nn.Sequential(
|
||||||
|
nn.Conv1D(
|
||||||
|
in_chs,
|
||||||
|
out_chs,
|
||||||
|
kernel_size=kernel_sizes[1],
|
||||||
|
stride=downsample_scale,
|
||||||
|
padding=(kernel_sizes[1] - 1) // 2,
|
||||||
|
groups=groups,
|
||||||
|
bias_attr=bias, ),
|
||||||
|
get_activation(nonlinear_activation, **
|
||||||
|
nonlinear_activation_params), ))
|
||||||
|
in_chs = out_chs
|
||||||
|
# NOTE: Remove hard coding?
|
||||||
|
out_chs = min(in_chs * 2, max_downsample_channels)
|
||||||
|
# NOTE: Remove hard coding?
|
||||||
|
groups = min(groups * 4, max_groups)
|
||||||
|
|
||||||
|
# add final layers
|
||||||
|
out_chs = min(in_chs * 2, max_downsample_channels)
|
||||||
|
self.layers.append(
|
||||||
|
nn.Sequential(
|
||||||
|
nn.Conv1D(
|
||||||
|
in_chs,
|
||||||
|
out_chs,
|
||||||
|
kernel_size=kernel_sizes[2],
|
||||||
|
stride=1,
|
||||||
|
padding=(kernel_sizes[2] - 1) // 2,
|
||||||
|
bias_attr=bias, ),
|
||||||
|
get_activation(nonlinear_activation, **
|
||||||
|
nonlinear_activation_params), ))
|
||||||
|
self.layers.append(
|
||||||
|
nn.Conv1D(
|
||||||
|
out_chs,
|
||||||
|
out_channels,
|
||||||
|
kernel_size=kernel_sizes[3],
|
||||||
|
stride=1,
|
||||||
|
padding=(kernel_sizes[3] - 1) // 2,
|
||||||
|
bias_attr=bias, ), )
|
||||||
|
|
||||||
|
if use_weight_norm and use_spectral_norm:
|
||||||
|
raise ValueError("Either use use_weight_norm or use_spectral_norm.")
|
||||||
|
|
||||||
|
# apply weight norm
|
||||||
|
if use_weight_norm:
|
||||||
|
self.apply_weight_norm()
|
||||||
|
|
||||||
|
# apply spectral norm
|
||||||
|
if use_spectral_norm:
|
||||||
|
self.apply_spectral_norm()
|
||||||
|
|
||||||
|
def forward(self, x):
|
||||||
|
"""Calculate forward propagation.
|
||||||
|
Parameters
|
||||||
|
----------
|
||||||
|
x : Tensor
|
||||||
|
Input noise signal (B, 1, T).
|
||||||
|
Returns
|
||||||
|
----------
|
||||||
|
List
|
||||||
|
List of output tensors of each layer.
|
||||||
|
"""
|
||||||
|
outs = []
|
||||||
|
for f in self.layers:
|
||||||
|
x = f(x)
|
||||||
|
outs += [x]
|
||||||
|
|
||||||
|
return outs
|
||||||
|
|
||||||
|
def apply_weight_norm(self):
|
||||||
|
"""Recursively apply weight normalization to all the Convolution layers
|
||||||
|
in the sublayers.
|
||||||
|
"""
|
||||||
|
|
||||||
|
def _apply_weight_norm(layer):
|
||||||
|
if isinstance(layer, (nn.Conv1D, nn.Conv2D, nn.Conv1DTranspose)):
|
||||||
|
nn.utils.weight_norm(layer)
|
||||||
|
|
||||||
|
self.apply(_apply_weight_norm)
|
||||||
|
|
||||||
|
def apply_spectral_norm(self):
|
||||||
|
"""Apply spectral normalization module from all of the layers."""
|
||||||
|
|
||||||
|
def _apply_spectral_norm(m):
|
||||||
|
if isinstance(m, nn.Conv2D):
|
||||||
|
nn.utils.spectral_norm(m)
|
||||||
|
|
||||||
|
self.apply(_apply_spectral_norm)
|
||||||
|
|
||||||
|
|
||||||
|
class HiFiGANMultiScaleDiscriminator(nn.Layer):
|
||||||
|
"""HiFi-GAN multi-scale discriminator module."""
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
scales: int=3,
|
||||||
|
downsample_pooling: str="AvgPool1D",
|
||||||
|
# follow the official implementation setting
|
||||||
|
downsample_pooling_params: Dict[str, Any]={
|
||||||
|
"kernel_size": 4,
|
||||||
|
"stride": 2,
|
||||||
|
"padding": 2,
|
||||||
|
},
|
||||||
|
discriminator_params: Dict[str, Any]={
|
||||||
|
"in_channels": 1,
|
||||||
|
"out_channels": 1,
|
||||||
|
"kernel_sizes": [15, 41, 5, 3],
|
||||||
|
"channels": 128,
|
||||||
|
"max_downsample_channels": 1024,
|
||||||
|
"max_groups": 16,
|
||||||
|
"bias": True,
|
||||||
|
"downsample_scales": [2, 2, 4, 4, 1],
|
||||||
|
"nonlinear_activation": "leakyrelu",
|
||||||
|
"nonlinear_activation_params": {
|
||||||
|
"negative_slope": 0.1
|
||||||
|
},
|
||||||
|
},
|
||||||
|
follow_official_norm: bool=False,
|
||||||
|
init_type: str="xavier_uniform", ):
|
||||||
|
"""Initilize HiFiGAN multi-scale discriminator module.
|
||||||
|
Parameters
|
||||||
|
----------
|
||||||
|
scales : int
|
||||||
|
Number of multi-scales.
|
||||||
|
downsample_pooling : str
|
||||||
|
Pooling module name for downsampling of the inputs.
|
||||||
|
downsample_pooling_params : dict
|
||||||
|
Parameters for the above pooling module.
|
||||||
|
discriminator_params : dict
|
||||||
|
Parameters for hifi-gan scale discriminator module.
|
||||||
|
follow_official_norm : bool
|
||||||
|
Whether to follow the norm setting of the official
|
||||||
|
implementaion. The first discriminator uses spectral norm and the other
|
||||||
|
discriminators use weight norm.
|
||||||
|
"""
|
||||||
|
super().__init__()
|
||||||
|
|
||||||
|
# initialize parameters
|
||||||
|
initialize(self, init_type)
|
||||||
|
|
||||||
|
self.discriminators = nn.LayerList()
|
||||||
|
|
||||||
|
# add discriminators
|
||||||
|
for i in range(scales):
|
||||||
|
params = copy.deepcopy(discriminator_params)
|
||||||
|
if follow_official_norm:
|
||||||
|
if i == 0:
|
||||||
|
params["use_weight_norm"] = False
|
||||||
|
params["use_spectral_norm"] = True
|
||||||
|
else:
|
||||||
|
params["use_weight_norm"] = True
|
||||||
|
params["use_spectral_norm"] = False
|
||||||
|
self.discriminators.append(HiFiGANScaleDiscriminator(**params))
|
||||||
|
self.pooling = getattr(nn, downsample_pooling)(
|
||||||
|
**downsample_pooling_params)
|
||||||
|
|
||||||
|
def forward(self, x):
|
||||||
|
"""Calculate forward propagation.
|
||||||
|
Parameters
|
||||||
|
----------
|
||||||
|
x : Tensor
|
||||||
|
Input noise signal (B, 1, T).
|
||||||
|
Returns
|
||||||
|
----------
|
||||||
|
List
|
||||||
|
List of list of each discriminator outputs, which consists of each layer output tensors.
|
||||||
|
"""
|
||||||
|
outs = []
|
||||||
|
for f in self.discriminators:
|
||||||
|
outs += [f(x)]
|
||||||
|
x = self.pooling(x)
|
||||||
|
|
||||||
|
return outs
|
||||||
|
|
||||||
|
|
||||||
|
class HiFiGANMultiScaleMultiPeriodDiscriminator(nn.Layer):
|
||||||
|
"""HiFi-GAN multi-scale + multi-period discriminator module."""
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
# Multi-scale discriminator related
|
||||||
|
scales: int=3,
|
||||||
|
scale_downsample_pooling: str="AvgPool1D",
|
||||||
|
scale_downsample_pooling_params: Dict[str, Any]={
|
||||||
|
"kernel_size": 4,
|
||||||
|
"stride": 2,
|
||||||
|
"padding": 2,
|
||||||
|
},
|
||||||
|
scale_discriminator_params: Dict[str, Any]={
|
||||||
|
"in_channels": 1,
|
||||||
|
"out_channels": 1,
|
||||||
|
"kernel_sizes": [15, 41, 5, 3],
|
||||||
|
"channels": 128,
|
||||||
|
"max_downsample_channels": 1024,
|
||||||
|
"max_groups": 16,
|
||||||
|
"bias": True,
|
||||||
|
"downsample_scales": [2, 2, 4, 4, 1],
|
||||||
|
"nonlinear_activation": "leakyrelu",
|
||||||
|
"nonlinear_activation_params": {
|
||||||
|
"negative_slope": 0.1
|
||||||
|
},
|
||||||
|
},
|
||||||
|
follow_official_norm: bool=True,
|
||||||
|
# Multi-period discriminator related
|
||||||
|
periods: List[int]=[2, 3, 5, 7, 11],
|
||||||
|
period_discriminator_params: Dict[str, Any]={
|
||||||
|
"in_channels": 1,
|
||||||
|
"out_channels": 1,
|
||||||
|
"kernel_sizes": [5, 3],
|
||||||
|
"channels": 32,
|
||||||
|
"downsample_scales": [3, 3, 3, 3, 1],
|
||||||
|
"max_downsample_channels": 1024,
|
||||||
|
"bias": True,
|
||||||
|
"nonlinear_activation": "leakyrelu",
|
||||||
|
"nonlinear_activation_params": {
|
||||||
|
"negative_slope": 0.1
|
||||||
|
},
|
||||||
|
"use_weight_norm": True,
|
||||||
|
"use_spectral_norm": False,
|
||||||
|
},
|
||||||
|
init_type: str="xavier_uniform", ):
|
||||||
|
"""Initilize HiFiGAN multi-scale + multi-period discriminator module.
|
||||||
|
Parameters
|
||||||
|
----------
|
||||||
|
scales : int
|
||||||
|
Number of multi-scales.
|
||||||
|
scale_downsample_pooling : str
|
||||||
|
Pooling module name for downsampling of the inputs.
|
||||||
|
scale_downsample_pooling_params : dict
|
||||||
|
Parameters for the above pooling module.
|
||||||
|
scale_discriminator_params : dict
|
||||||
|
Parameters for hifi-gan scale discriminator module.
|
||||||
|
follow_official_norm : bool): Whether to follow the norm setting of the official
|
||||||
|
implementaion. The first discriminator uses spectral norm and the other
|
||||||
|
discriminators use weight norm.
|
||||||
|
periods : list
|
||||||
|
List of periods.
|
||||||
|
period_discriminator_params : dict
|
||||||
|
Parameters for hifi-gan period discriminator module.
|
||||||
|
The period parameter will be overwritten.
|
||||||
|
"""
|
||||||
|
super().__init__()
|
||||||
|
|
||||||
|
# initialize parameters
|
||||||
|
initialize(self, init_type)
|
||||||
|
|
||||||
|
self.msd = HiFiGANMultiScaleDiscriminator(
|
||||||
|
scales=scales,
|
||||||
|
downsample_pooling=scale_downsample_pooling,
|
||||||
|
downsample_pooling_params=scale_downsample_pooling_params,
|
||||||
|
discriminator_params=scale_discriminator_params,
|
||||||
|
follow_official_norm=follow_official_norm, )
|
||||||
|
self.mpd = HiFiGANMultiPeriodDiscriminator(
|
||||||
|
periods=periods,
|
||||||
|
discriminator_params=period_discriminator_params, )
|
||||||
|
|
||||||
|
def forward(self, x):
|
||||||
|
"""Calculate forward propagation.
|
||||||
|
Parameters
|
||||||
|
----------
|
||||||
|
x : Tensor
|
||||||
|
Input noise signal (B, 1, T).
|
||||||
|
Returns
|
||||||
|
----------
|
||||||
|
List:
|
||||||
|
List of list of each discriminator outputs,
|
||||||
|
which consists of each layer output tensors.
|
||||||
|
Multi scale and multi period ones are concatenated.
|
||||||
|
"""
|
||||||
|
msd_outs = self.msd(x)
|
||||||
|
mpd_outs = self.mpd(x)
|
||||||
|
return msd_outs + mpd_outs
|
||||||
|
|
||||||
|
|
||||||
|
class HiFiGANInference(nn.Layer):
|
||||||
|
def __init__(self, normalizer, hifigan_generator):
|
||||||
|
super().__init__()
|
||||||
|
self.normalizer = normalizer
|
||||||
|
self.hifigan_generator = hifigan_generator
|
||||||
|
|
||||||
|
def forward(self, logmel):
|
||||||
|
normalized_mel = self.normalizer(logmel)
|
||||||
|
wav = self.hifigan_generator.inference(normalized_mel)
|
||||||
|
return wav
|
@ -0,0 +1,247 @@
|
|||||||
|
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
import logging
|
||||||
|
from typing import Dict
|
||||||
|
|
||||||
|
import paddle
|
||||||
|
from paddle import distributed as dist
|
||||||
|
from paddle.io import DataLoader
|
||||||
|
from paddle.nn import Layer
|
||||||
|
from paddle.optimizer import Optimizer
|
||||||
|
from paddle.optimizer.lr import LRScheduler
|
||||||
|
|
||||||
|
from paddlespeech.t2s.training.extensions.evaluator import StandardEvaluator
|
||||||
|
from paddlespeech.t2s.training.reporter import report
|
||||||
|
from paddlespeech.t2s.training.updaters.standard_updater import StandardUpdater
|
||||||
|
from paddlespeech.t2s.training.updaters.standard_updater import UpdaterState
|
||||||
|
logging.basicConfig(
|
||||||
|
format='%(asctime)s [%(levelname)s] [%(filename)s:%(lineno)d] %(message)s',
|
||||||
|
datefmt='[%Y-%m-%d %H:%M:%S]')
|
||||||
|
logger = logging.getLogger(__name__)
|
||||||
|
logger.setLevel(logging.INFO)
|
||||||
|
|
||||||
|
|
||||||
|
class HiFiGANUpdater(StandardUpdater):
|
||||||
|
def __init__(self,
|
||||||
|
models: Dict[str, Layer],
|
||||||
|
optimizers: Dict[str, Optimizer],
|
||||||
|
criterions: Dict[str, Layer],
|
||||||
|
schedulers: Dict[str, LRScheduler],
|
||||||
|
dataloader: DataLoader,
|
||||||
|
generator_train_start_steps: int=0,
|
||||||
|
discriminator_train_start_steps: int=100000,
|
||||||
|
lambda_adv: float=1.0,
|
||||||
|
lambda_aux: float=1.0,
|
||||||
|
lambda_feat_match: float=1.0,
|
||||||
|
output_dir=None):
|
||||||
|
self.models = models
|
||||||
|
self.generator: Layer = models['generator']
|
||||||
|
self.discriminator: Layer = models['discriminator']
|
||||||
|
|
||||||
|
self.optimizers = optimizers
|
||||||
|
self.optimizer_g: Optimizer = optimizers['generator']
|
||||||
|
self.optimizer_d: Optimizer = optimizers['discriminator']
|
||||||
|
|
||||||
|
self.criterions = criterions
|
||||||
|
self.criterion_feat_match = criterions['feat_match']
|
||||||
|
self.criterion_mel = criterions['mel']
|
||||||
|
|
||||||
|
self.criterion_gen_adv = criterions["gen_adv"]
|
||||||
|
self.criterion_dis_adv = criterions["dis_adv"]
|
||||||
|
|
||||||
|
self.schedulers = schedulers
|
||||||
|
self.scheduler_g = schedulers['generator']
|
||||||
|
self.scheduler_d = schedulers['discriminator']
|
||||||
|
|
||||||
|
self.dataloader = dataloader
|
||||||
|
|
||||||
|
self.generator_train_start_steps = generator_train_start_steps
|
||||||
|
self.discriminator_train_start_steps = discriminator_train_start_steps
|
||||||
|
self.lambda_adv = lambda_adv
|
||||||
|
self.lambda_aux = lambda_aux
|
||||||
|
self.lambda_feat_match = lambda_feat_match
|
||||||
|
|
||||||
|
self.state = UpdaterState(iteration=0, epoch=0)
|
||||||
|
self.train_iterator = iter(self.dataloader)
|
||||||
|
|
||||||
|
log_file = output_dir / 'worker_{}.log'.format(dist.get_rank())
|
||||||
|
self.filehandler = logging.FileHandler(str(log_file))
|
||||||
|
logger.addHandler(self.filehandler)
|
||||||
|
self.logger = logger
|
||||||
|
self.msg = ""
|
||||||
|
|
||||||
|
def update_core(self, batch):
|
||||||
|
self.msg = "Rank: {}, ".format(dist.get_rank())
|
||||||
|
losses_dict = {}
|
||||||
|
# parse batch
|
||||||
|
wav, mel = batch
|
||||||
|
|
||||||
|
# Generator
|
||||||
|
if self.state.iteration > self.generator_train_start_steps:
|
||||||
|
# (B, out_channels, T ** prod(upsample_scales)
|
||||||
|
wav_ = self.generator(mel)
|
||||||
|
|
||||||
|
# initialize
|
||||||
|
gen_loss = 0.0
|
||||||
|
aux_loss = 0.0
|
||||||
|
|
||||||
|
# mel spectrogram loss
|
||||||
|
mel_loss = self.criterion_mel(wav_, wav)
|
||||||
|
aux_loss += mel_loss
|
||||||
|
report("train/mel_loss", float(mel_loss))
|
||||||
|
losses_dict["mel_loss"] = float(mel_loss)
|
||||||
|
|
||||||
|
gen_loss += aux_loss * self.lambda_aux
|
||||||
|
|
||||||
|
# adversarial loss
|
||||||
|
if self.state.iteration > self.discriminator_train_start_steps:
|
||||||
|
p_ = self.discriminator(wav_)
|
||||||
|
adv_loss = self.criterion_gen_adv(p_)
|
||||||
|
report("train/adversarial_loss", float(adv_loss))
|
||||||
|
losses_dict["adversarial_loss"] = float(adv_loss)
|
||||||
|
|
||||||
|
# feature matching loss
|
||||||
|
# no need to track gradients
|
||||||
|
with paddle.no_grad():
|
||||||
|
p = self.discriminator(wav)
|
||||||
|
fm_loss = self.criterion_feat_match(p_, p)
|
||||||
|
report("train/feature_matching_loss", float(fm_loss))
|
||||||
|
losses_dict["feature_matching_loss"] = float(fm_loss)
|
||||||
|
|
||||||
|
adv_loss += self.lambda_feat_match * fm_loss
|
||||||
|
|
||||||
|
gen_loss += self.lambda_adv * adv_loss
|
||||||
|
|
||||||
|
report("train/generator_loss", float(gen_loss))
|
||||||
|
losses_dict["generator_loss"] = float(gen_loss)
|
||||||
|
|
||||||
|
self.optimizer_g.clear_grad()
|
||||||
|
gen_loss.backward()
|
||||||
|
|
||||||
|
self.optimizer_g.step()
|
||||||
|
self.scheduler_g.step()
|
||||||
|
|
||||||
|
# Disctiminator
|
||||||
|
if self.state.iteration > self.discriminator_train_start_steps:
|
||||||
|
# re-compute wav_ which leads better quality
|
||||||
|
with paddle.no_grad():
|
||||||
|
wav_ = self.generator(mel)
|
||||||
|
|
||||||
|
p = self.discriminator(wav)
|
||||||
|
p_ = self.discriminator(wav_.detach())
|
||||||
|
real_loss, fake_loss = self.criterion_dis_adv(p_, p)
|
||||||
|
dis_loss = real_loss + fake_loss
|
||||||
|
report("train/real_loss", float(real_loss))
|
||||||
|
report("train/fake_loss", float(fake_loss))
|
||||||
|
report("train/discriminator_loss", float(dis_loss))
|
||||||
|
losses_dict["real_loss"] = float(real_loss)
|
||||||
|
losses_dict["fake_loss"] = float(fake_loss)
|
||||||
|
losses_dict["discriminator_loss"] = float(dis_loss)
|
||||||
|
|
||||||
|
self.optimizer_d.clear_grad()
|
||||||
|
dis_loss.backward()
|
||||||
|
|
||||||
|
self.optimizer_d.step()
|
||||||
|
self.scheduler_d.step()
|
||||||
|
|
||||||
|
self.msg += ', '.join('{}: {:>.6f}'.format(k, v)
|
||||||
|
for k, v in losses_dict.items())
|
||||||
|
|
||||||
|
|
||||||
|
class HiFiGANEvaluator(StandardEvaluator):
|
||||||
|
def __init__(self,
|
||||||
|
models: Dict[str, Layer],
|
||||||
|
criterions: Dict[str, Layer],
|
||||||
|
dataloader: DataLoader,
|
||||||
|
lambda_adv: float=1.0,
|
||||||
|
lambda_aux: float=1.0,
|
||||||
|
lambda_feat_match: float=1.0,
|
||||||
|
output_dir=None):
|
||||||
|
self.models = models
|
||||||
|
self.generator = models['generator']
|
||||||
|
self.discriminator = models['discriminator']
|
||||||
|
|
||||||
|
self.criterions = criterions
|
||||||
|
self.criterion_feat_match = criterions['feat_match']
|
||||||
|
self.criterion_mel = criterions['mel']
|
||||||
|
self.criterion_gen_adv = criterions["gen_adv"]
|
||||||
|
self.criterion_dis_adv = criterions["dis_adv"]
|
||||||
|
|
||||||
|
self.dataloader = dataloader
|
||||||
|
|
||||||
|
self.lambda_adv = lambda_adv
|
||||||
|
self.lambda_aux = lambda_aux
|
||||||
|
self.lambda_feat_match = lambda_feat_match
|
||||||
|
|
||||||
|
log_file = output_dir / 'worker_{}.log'.format(dist.get_rank())
|
||||||
|
self.filehandler = logging.FileHandler(str(log_file))
|
||||||
|
logger.addHandler(self.filehandler)
|
||||||
|
self.logger = logger
|
||||||
|
self.msg = ""
|
||||||
|
|
||||||
|
def evaluate_core(self, batch):
|
||||||
|
# logging.debug("Evaluate: ")
|
||||||
|
self.msg = "Evaluate: "
|
||||||
|
losses_dict = {}
|
||||||
|
wav, mel = batch
|
||||||
|
|
||||||
|
# Generator
|
||||||
|
# (B, out_channels, T ** prod(upsample_scales)
|
||||||
|
wav_ = self.generator(mel)
|
||||||
|
|
||||||
|
# initialize
|
||||||
|
gen_loss = 0.0
|
||||||
|
aux_loss = 0.0
|
||||||
|
|
||||||
|
## Adversarial loss
|
||||||
|
p_ = self.discriminator(wav_)
|
||||||
|
adv_loss = self.criterion_gen_adv(p_)
|
||||||
|
report("eval/adversarial_loss", float(adv_loss))
|
||||||
|
losses_dict["adversarial_loss"] = float(adv_loss)
|
||||||
|
|
||||||
|
# feature matching loss
|
||||||
|
p = self.discriminator(wav)
|
||||||
|
fm_loss = self.criterion_feat_match(p_, p)
|
||||||
|
report("eval/feature_matching_loss", float(fm_loss))
|
||||||
|
losses_dict["feature_matching_loss"] = float(fm_loss)
|
||||||
|
adv_loss += self.lambda_feat_match * fm_loss
|
||||||
|
|
||||||
|
gen_loss += self.lambda_adv * adv_loss
|
||||||
|
|
||||||
|
# mel spectrogram loss
|
||||||
|
mel_loss = self.criterion_mel(wav_, wav)
|
||||||
|
aux_loss += mel_loss
|
||||||
|
report("eval/mel_loss", float(mel_loss))
|
||||||
|
losses_dict["mel_loss"] = float(mel_loss)
|
||||||
|
|
||||||
|
gen_loss += aux_loss * self.lambda_aux
|
||||||
|
|
||||||
|
report("eval/generator_loss", float(gen_loss))
|
||||||
|
losses_dict["generator_loss"] = float(gen_loss)
|
||||||
|
|
||||||
|
# Disctiminator
|
||||||
|
p = self.discriminator(wav)
|
||||||
|
real_loss, fake_loss = self.criterion_dis_adv(p_, p)
|
||||||
|
dis_loss = real_loss + fake_loss
|
||||||
|
report("eval/real_loss", float(real_loss))
|
||||||
|
report("eval/fake_loss", float(fake_loss))
|
||||||
|
report("eval/discriminator_loss", float(dis_loss))
|
||||||
|
|
||||||
|
losses_dict["real_loss"] = float(real_loss)
|
||||||
|
losses_dict["fake_loss"] = float(fake_loss)
|
||||||
|
losses_dict["discriminator_loss"] = float(dis_loss)
|
||||||
|
|
||||||
|
self.msg += ', '.join('{}: {:>.6f}'.format(k, v)
|
||||||
|
for k, v in losses_dict.items())
|
||||||
|
self.logger.info(self.msg)
|
@ -0,0 +1,207 @@
|
|||||||
|
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
import math
|
||||||
|
from typing import Any
|
||||||
|
from typing import Dict
|
||||||
|
from typing import List
|
||||||
|
|
||||||
|
import paddle
|
||||||
|
from paddle import nn
|
||||||
|
from paddle.nn import functional as F
|
||||||
|
|
||||||
|
from paddlespeech.t2s.modules.activation import get_activation
|
||||||
|
|
||||||
|
|
||||||
|
class WaveNetResidualBlock(nn.Layer):
|
||||||
|
"""A gated activation unit composed of an 1D convolution, a gated tanh
|
||||||
|
unit and parametric redidual and skip connections. For more details,
|
||||||
|
refer to `WaveNet: A Generative Model for Raw Audio <https://arxiv.org/abs/1609.03499>`_.
|
||||||
|
|
||||||
|
Parameters
|
||||||
|
----------
|
||||||
|
kernel_size : int, optional
|
||||||
|
Kernel size of the 1D convolution, by default 3
|
||||||
|
residual_channels : int, optional
|
||||||
|
Feature size of the resiaudl output(and also the input), by default 64
|
||||||
|
gate_channels : int, optional
|
||||||
|
Output feature size of the 1D convolution, by default 128
|
||||||
|
skip_channels : int, optional
|
||||||
|
Feature size of the skip output, by default 64
|
||||||
|
aux_channels : int, optional
|
||||||
|
Feature size of the auxiliary input (e.g. spectrogram), by default 80
|
||||||
|
dropout : float, optional
|
||||||
|
Probability of the dropout before the 1D convolution, by default 0.
|
||||||
|
dilation : int, optional
|
||||||
|
Dilation of the 1D convolution, by default 1
|
||||||
|
bias : bool, optional
|
||||||
|
Whether to use bias in the 1D convolution, by default True
|
||||||
|
use_causal_conv : bool, optional
|
||||||
|
Whether to use causal padding for the 1D convolution, by default False
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self,
|
||||||
|
kernel_size: int=3,
|
||||||
|
residual_channels: int=64,
|
||||||
|
gate_channels: int=128,
|
||||||
|
skip_channels: int=64,
|
||||||
|
aux_channels: int=80,
|
||||||
|
dropout: float=0.,
|
||||||
|
dilation: int=1,
|
||||||
|
bias: bool=True,
|
||||||
|
use_causal_conv: bool=False):
|
||||||
|
super().__init__()
|
||||||
|
self.dropout = dropout
|
||||||
|
if use_causal_conv:
|
||||||
|
padding = (kernel_size - 1) * dilation
|
||||||
|
else:
|
||||||
|
assert kernel_size % 2 == 1
|
||||||
|
padding = (kernel_size - 1) // 2 * dilation
|
||||||
|
self.use_causal_conv = use_causal_conv
|
||||||
|
|
||||||
|
self.conv = nn.Conv1D(
|
||||||
|
residual_channels,
|
||||||
|
gate_channels,
|
||||||
|
kernel_size,
|
||||||
|
padding=padding,
|
||||||
|
dilation=dilation,
|
||||||
|
bias_attr=bias)
|
||||||
|
if aux_channels is not None:
|
||||||
|
self.conv1x1_aux = nn.Conv1D(
|
||||||
|
aux_channels, gate_channels, kernel_size=1, bias_attr=False)
|
||||||
|
else:
|
||||||
|
self.conv1x1_aux = None
|
||||||
|
|
||||||
|
gate_out_channels = gate_channels // 2
|
||||||
|
self.conv1x1_out = nn.Conv1D(
|
||||||
|
gate_out_channels, residual_channels, kernel_size=1, bias_attr=bias)
|
||||||
|
self.conv1x1_skip = nn.Conv1D(
|
||||||
|
gate_out_channels, skip_channels, kernel_size=1, bias_attr=bias)
|
||||||
|
|
||||||
|
def forward(self, x, c):
|
||||||
|
"""
|
||||||
|
Parameters
|
||||||
|
----------
|
||||||
|
x : Tensor
|
||||||
|
Shape (N, C_res, T), the input features.
|
||||||
|
c : Tensor
|
||||||
|
Shape (N, C_aux, T), the auxiliary input.
|
||||||
|
|
||||||
|
Returns
|
||||||
|
-------
|
||||||
|
res : Tensor
|
||||||
|
Shape (N, C_res, T), the residual output, which is used as the
|
||||||
|
input of the next ResidualBlock in a stack of ResidualBlocks.
|
||||||
|
skip : Tensor
|
||||||
|
Shape (N, C_skip, T), the skip output, which is collected among
|
||||||
|
each layer in a stack of ResidualBlocks.
|
||||||
|
"""
|
||||||
|
x_input = x
|
||||||
|
x = F.dropout(x, self.dropout, training=self.training)
|
||||||
|
x = self.conv(x)
|
||||||
|
x = x[:, :, x_input.shape[-1]] if self.use_causal_conv else x
|
||||||
|
if c is not None:
|
||||||
|
c = self.conv1x1_aux(c)
|
||||||
|
x += c
|
||||||
|
|
||||||
|
a, b = paddle.chunk(x, 2, axis=1)
|
||||||
|
x = paddle.tanh(a) * F.sigmoid(b)
|
||||||
|
|
||||||
|
skip = self.conv1x1_skip(x)
|
||||||
|
res = (self.conv1x1_out(x) + x_input) * math.sqrt(0.5)
|
||||||
|
return res, skip
|
||||||
|
|
||||||
|
|
||||||
|
class HiFiGANResidualBlock(nn.Layer):
|
||||||
|
"""Residual block module in HiFiGAN."""
|
||||||
|
|
||||||
|
def __init__(
|
||||||
|
self,
|
||||||
|
kernel_size: int=3,
|
||||||
|
channels: int=512,
|
||||||
|
dilations: List[int]=(1, 3, 5),
|
||||||
|
bias: bool=True,
|
||||||
|
use_additional_convs: bool=True,
|
||||||
|
nonlinear_activation: str="leakyrelu",
|
||||||
|
nonlinear_activation_params: Dict[str, Any]={"negative_slope": 0.1},
|
||||||
|
):
|
||||||
|
"""Initialize HiFiGANResidualBlock module.
|
||||||
|
Parameters
|
||||||
|
----------
|
||||||
|
kernel_size : int
|
||||||
|
Kernel size of dilation convolution layer.
|
||||||
|
channels : int
|
||||||
|
Number of channels for convolution layer.
|
||||||
|
dilations : List[int]
|
||||||
|
List of dilation factors.
|
||||||
|
use_additional_convs : bool
|
||||||
|
Whether to use additional convolution layers.
|
||||||
|
bias : bool
|
||||||
|
Whether to add bias parameter in convolution layers.
|
||||||
|
nonlinear_activation : str
|
||||||
|
Activation function module name.
|
||||||
|
nonlinear_activation_params : dict
|
||||||
|
Hyperparameters for activation function.
|
||||||
|
"""
|
||||||
|
super().__init__()
|
||||||
|
|
||||||
|
self.use_additional_convs = use_additional_convs
|
||||||
|
self.convs1 = nn.LayerList()
|
||||||
|
if use_additional_convs:
|
||||||
|
self.convs2 = nn.LayerList()
|
||||||
|
assert kernel_size % 2 == 1, "Kernel size must be odd number."
|
||||||
|
|
||||||
|
for dilation in dilations:
|
||||||
|
self.convs1.append(
|
||||||
|
nn.Sequential(
|
||||||
|
get_activation(nonlinear_activation, **
|
||||||
|
nonlinear_activation_params),
|
||||||
|
nn.Conv1D(
|
||||||
|
channels,
|
||||||
|
channels,
|
||||||
|
kernel_size,
|
||||||
|
1,
|
||||||
|
dilation=dilation,
|
||||||
|
bias_attr=bias,
|
||||||
|
padding=(kernel_size - 1) // 2 * dilation, ), ))
|
||||||
|
if use_additional_convs:
|
||||||
|
self.convs2.append(
|
||||||
|
nn.Sequential(
|
||||||
|
get_activation(nonlinear_activation, **
|
||||||
|
nonlinear_activation_params),
|
||||||
|
nn.Conv1D(
|
||||||
|
channels,
|
||||||
|
channels,
|
||||||
|
kernel_size,
|
||||||
|
1,
|
||||||
|
dilation=1,
|
||||||
|
bias_attr=bias,
|
||||||
|
padding=(kernel_size - 1) // 2, ), ))
|
||||||
|
|
||||||
|
def forward(self, x):
|
||||||
|
"""Calculate forward propagation.
|
||||||
|
Parameters
|
||||||
|
----------
|
||||||
|
x : Tensor
|
||||||
|
Input tensor (B, channels, T).
|
||||||
|
Returns
|
||||||
|
----------
|
||||||
|
Tensor
|
||||||
|
Output tensor (B, channels, T).
|
||||||
|
"""
|
||||||
|
for idx in range(len(self.convs1)):
|
||||||
|
xt = self.convs1[idx](x)
|
||||||
|
if self.use_additional_convs:
|
||||||
|
xt = self.convs2[idx](xt)
|
||||||
|
x = xt + x
|
||||||
|
return x
|
@ -0,0 +1,220 @@
|
|||||||
|
# Copyright (c) 2021 PaddlePaddle Authors. All Rights Reserved.
|
||||||
|
#
|
||||||
|
# Licensed under the Apache License, Version 2.0 (the "License");
|
||||||
|
# you may not use this file except in compliance with the License.
|
||||||
|
# You may obtain a copy of the License at
|
||||||
|
#
|
||||||
|
# http://www.apache.org/licenses/LICENSE-2.0
|
||||||
|
#
|
||||||
|
# Unless required by applicable law or agreed to in writing, software
|
||||||
|
# distributed under the License is distributed on an "AS IS" BASIS,
|
||||||
|
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
|
||||||
|
# See the License for the specific language governing permissions and
|
||||||
|
# limitations under the License.
|
||||||
|
# Modified from espnet(https://github.com/espnet/espnet)
|
||||||
|
from typing import Any
|
||||||
|
from typing import Dict
|
||||||
|
from typing import List
|
||||||
|
from typing import Optional
|
||||||
|
|
||||||
|
from paddle import nn
|
||||||
|
from paddle.nn import functional as F
|
||||||
|
|
||||||
|
from paddlespeech.t2s.modules.activation import get_activation
|
||||||
|
|
||||||
|
|
||||||
|
class Stretch2D(nn.Layer):
|
||||||
|
def __init__(self, w_scale: int, h_scale: int, mode: str="nearest"):
|
||||||
|
"""Strech an image (or image-like object) with some interpolation.
|
||||||
|
|
||||||
|
Parameters
|
||||||
|
----------
|
||||||
|
w_scale : int
|
||||||
|
Scalar of width.
|
||||||
|
h_scale : int
|
||||||
|
Scalar of the height.
|
||||||
|
mode : str, optional
|
||||||
|
Interpolation mode, modes suppored are "nearest", "bilinear",
|
||||||
|
"trilinear", "bicubic", "linear" and "area",by default "nearest"
|
||||||
|
|
||||||
|
For more details about interpolation, see
|
||||||
|
`paddle.nn.functional.interpolate <https://www.paddlepaddle.org.cn/documentation/docs/en/api/paddle/nn/functional/interpolate_en.html>`_.
|
||||||
|
"""
|
||||||
|
super().__init__()
|
||||||
|
self.w_scale = w_scale
|
||||||
|
self.h_scale = h_scale
|
||||||
|
self.mode = mode
|
||||||
|
|
||||||
|
def forward(self, x):
|
||||||
|
"""
|
||||||
|
Parameters
|
||||||
|
----------
|
||||||
|
x : Tensor
|
||||||
|
Shape (N, C, H, W)
|
||||||
|
|
||||||
|
Returns
|
||||||
|
-------
|
||||||
|
Tensor
|
||||||
|
Shape (N, C, H', W'), where ``H'=h_scale * H``, ``W'=w_scale * W``.
|
||||||
|
The stretched image.
|
||||||
|
"""
|
||||||
|
out = F.interpolate(
|
||||||
|
x, scale_factor=(self.h_scale, self.w_scale), mode=self.mode)
|
||||||
|
return out
|
||||||
|
|
||||||
|
|
||||||
|
class UpsampleNet(nn.Layer):
|
||||||
|
"""A Layer to upsample spectrogram by applying consecutive stretch and
|
||||||
|
convolutions.
|
||||||
|
|
||||||
|
Parameters
|
||||||
|
----------
|
||||||
|
upsample_scales : List[int]
|
||||||
|
Upsampling factors for each strech.
|
||||||
|
nonlinear_activation : Optional[str], optional
|
||||||
|
Activation after each convolution, by default None
|
||||||
|
nonlinear_activation_params : Dict[str, Any], optional
|
||||||
|
Parameters passed to construct the activation, by default {}
|
||||||
|
interpolate_mode : str, optional
|
||||||
|
Interpolation mode of the strech, by default "nearest"
|
||||||
|
freq_axis_kernel_size : int, optional
|
||||||
|
Convolution kernel size along the frequency axis, by default 1
|
||||||
|
use_causal_conv : bool, optional
|
||||||
|
Whether to use causal padding before convolution, by default False
|
||||||
|
|
||||||
|
If True, Causal padding is used along the time axis, i.e. padding
|
||||||
|
amount is ``receptive field - 1`` and 0 for before and after,
|
||||||
|
respectively.
|
||||||
|
|
||||||
|
If False, "same" padding is used along the time axis.
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self,
|
||||||
|
upsample_scales: List[int],
|
||||||
|
nonlinear_activation: Optional[str]=None,
|
||||||
|
nonlinear_activation_params: Dict[str, Any]={},
|
||||||
|
interpolate_mode: str="nearest",
|
||||||
|
freq_axis_kernel_size: int=1,
|
||||||
|
use_causal_conv: bool=False):
|
||||||
|
super().__init__()
|
||||||
|
self.use_causal_conv = use_causal_conv
|
||||||
|
self.up_layers = nn.LayerList()
|
||||||
|
|
||||||
|
for scale in upsample_scales:
|
||||||
|
stretch = Stretch2D(scale, 1, interpolate_mode)
|
||||||
|
assert freq_axis_kernel_size % 2 == 1
|
||||||
|
freq_axis_padding = (freq_axis_kernel_size - 1) // 2
|
||||||
|
kernel_size = (freq_axis_kernel_size, scale * 2 + 1)
|
||||||
|
if use_causal_conv:
|
||||||
|
padding = (freq_axis_padding, scale * 2)
|
||||||
|
else:
|
||||||
|
padding = (freq_axis_padding, scale)
|
||||||
|
conv = nn.Conv2D(
|
||||||
|
1, 1, kernel_size, padding=padding, bias_attr=False)
|
||||||
|
self.up_layers.extend([stretch, conv])
|
||||||
|
if nonlinear_activation is not None:
|
||||||
|
# for compatibility
|
||||||
|
nonlinear_activation = nonlinear_activation.lower()
|
||||||
|
|
||||||
|
nonlinear = get_activation(nonlinear_activation,
|
||||||
|
**nonlinear_activation_params)
|
||||||
|
self.up_layers.append(nonlinear)
|
||||||
|
|
||||||
|
def forward(self, c):
|
||||||
|
"""
|
||||||
|
Parameters
|
||||||
|
----------
|
||||||
|
c : Tensor
|
||||||
|
Shape (N, F, T), spectrogram
|
||||||
|
|
||||||
|
Returns
|
||||||
|
-------
|
||||||
|
Tensor
|
||||||
|
Shape (N, F, T'), where ``T' = upsample_factor * T``, upsampled
|
||||||
|
spectrogram
|
||||||
|
"""
|
||||||
|
c = c.unsqueeze(1)
|
||||||
|
for f in self.up_layers:
|
||||||
|
if self.use_causal_conv and isinstance(f, nn.Conv2D):
|
||||||
|
c = f(c)[:, :, :, c.shape[-1]]
|
||||||
|
else:
|
||||||
|
c = f(c)
|
||||||
|
return c.squeeze(1)
|
||||||
|
|
||||||
|
|
||||||
|
class ConvInUpsampleNet(nn.Layer):
|
||||||
|
"""A Layer to upsample spectrogram composed of a convolution and an
|
||||||
|
UpsampleNet.
|
||||||
|
|
||||||
|
Parameters
|
||||||
|
----------
|
||||||
|
upsample_scales : List[int]
|
||||||
|
Upsampling factors for each strech.
|
||||||
|
nonlinear_activation : Optional[str], optional
|
||||||
|
Activation after each convolution, by default None
|
||||||
|
nonlinear_activation_params : Dict[str, Any], optional
|
||||||
|
Parameters passed to construct the activation, by default {}
|
||||||
|
interpolate_mode : str, optional
|
||||||
|
Interpolation mode of the strech, by default "nearest"
|
||||||
|
freq_axis_kernel_size : int, optional
|
||||||
|
Convolution kernel size along the frequency axis, by default 1
|
||||||
|
aux_channels : int, optional
|
||||||
|
Feature size of the input, by default 80
|
||||||
|
aux_context_window : int, optional
|
||||||
|
Context window of the first 1D convolution applied to the input. It
|
||||||
|
related to the kernel size of the convolution, by default 0
|
||||||
|
|
||||||
|
If use causal convolution, the kernel size is ``window + 1``, else
|
||||||
|
the kernel size is ``2 * window + 1``.
|
||||||
|
use_causal_conv : bool, optional
|
||||||
|
Whether to use causal padding before convolution, by default False
|
||||||
|
|
||||||
|
If True, Causal padding is used along the time axis, i.e. padding
|
||||||
|
amount is ``receptive field - 1`` and 0 for before and after,
|
||||||
|
respectively.
|
||||||
|
|
||||||
|
If False, "same" padding is used along the time axis.
|
||||||
|
"""
|
||||||
|
|
||||||
|
def __init__(self,
|
||||||
|
upsample_scales: List[int],
|
||||||
|
nonlinear_activation: Optional[str]=None,
|
||||||
|
nonlinear_activation_params: Dict[str, Any]={},
|
||||||
|
interpolate_mode: str="nearest",
|
||||||
|
freq_axis_kernel_size: int=1,
|
||||||
|
aux_channels: int=80,
|
||||||
|
aux_context_window: int=0,
|
||||||
|
use_causal_conv: bool=False):
|
||||||
|
super().__init__()
|
||||||
|
self.aux_context_window = aux_context_window
|
||||||
|
self.use_causal_conv = use_causal_conv and aux_context_window > 0
|
||||||
|
kernel_size = aux_context_window + 1 if use_causal_conv else 2 * aux_context_window + 1
|
||||||
|
self.conv_in = nn.Conv1D(
|
||||||
|
aux_channels,
|
||||||
|
aux_channels,
|
||||||
|
kernel_size=kernel_size,
|
||||||
|
bias_attr=False)
|
||||||
|
self.upsample = UpsampleNet(
|
||||||
|
upsample_scales=upsample_scales,
|
||||||
|
nonlinear_activation=nonlinear_activation,
|
||||||
|
nonlinear_activation_params=nonlinear_activation_params,
|
||||||
|
interpolate_mode=interpolate_mode,
|
||||||
|
freq_axis_kernel_size=freq_axis_kernel_size,
|
||||||
|
use_causal_conv=use_causal_conv)
|
||||||
|
|
||||||
|
def forward(self, c):
|
||||||
|
"""
|
||||||
|
Parameters
|
||||||
|
----------
|
||||||
|
c : Tensor
|
||||||
|
Shape (N, F, T), spectrogram
|
||||||
|
|
||||||
|
Returns
|
||||||
|
-------
|
||||||
|
Tensors
|
||||||
|
Shape (N, F, T'), where ``T' = upsample_factor * T``, upsampled
|
||||||
|
spectrogram
|
||||||
|
"""
|
||||||
|
c_ = self.conv_in(c)
|
||||||
|
c = c_[:, :, :-self.aux_context_window] if self.use_causal_conv else c_
|
||||||
|
return self.upsample(c)
|
Loading…
Reference in new issue