merge the develop

pull/1225/head
huangyuxin 4 years ago
commit a1d8ab0f99

@ -530,7 +530,7 @@ You are warmly welcome to submit questions in [discussions](https://github.com/P
## Acknowledgement ## Acknowledgement
- Many thanks to [yeyupiaoling](https://github.com/yeyupiaoling) for years of attention, constructive advice and great help. - Many thanks to [yeyupiaoling](https://github.com/yeyupiaoling)/[PPASR](https://github.com/yeyupiaoling/PPASR)/[PaddlePaddle-DeepSpeech](https://github.com/yeyupiaoling/PaddlePaddle-DeepSpeech)/[VoiceprintRecognition-PaddlePaddle](https://github.com/yeyupiaoling/VoiceprintRecognition-PaddlePaddle)/[AudioClassification-PaddlePaddle](https://github.com/yeyupiaoling/AudioClassification-PaddlePaddle) for years of attention, constructive advice and great help.
- Many thanks to [AK391](https://github.com/AK391) for TTS web demo on Huggingface Spaces using Gradio. - Many thanks to [AK391](https://github.com/AK391) for TTS web demo on Huggingface Spaces using Gradio.
- Many thanks to [mymagicpower](https://github.com/mymagicpower) for the Java implementation of ASR upon [short](https://github.com/mymagicpower/AIAS/tree/main/3_audio_sdks/asr_sdk) and [long](https://github.com/mymagicpower/AIAS/tree/main/3_audio_sdks/asr_long_audio_sdk) audio files. - Many thanks to [mymagicpower](https://github.com/mymagicpower) for the Java implementation of ASR upon [short](https://github.com/mymagicpower/AIAS/tree/main/3_audio_sdks/asr_sdk) and [long](https://github.com/mymagicpower/AIAS/tree/main/3_audio_sdks/asr_long_audio_sdk) audio files.
- Many thanks to [JiehangXie](https://github.com/JiehangXie)/[PaddleBoBo](https://github.com/JiehangXie/PaddleBoBo) for developing Virtual Uploader(VUP)/Virtual YouTuber(VTuber) with PaddleSpeech TTS function. - Many thanks to [JiehangXie](https://github.com/JiehangXie)/[PaddleBoBo](https://github.com/JiehangXie/PaddleBoBo) for developing Virtual Uploader(VUP)/Virtual YouTuber(VTuber) with PaddleSpeech TTS function.

@ -497,7 +497,6 @@ year={2021}
<a name="欢迎贡献"></a> <a name="欢迎贡献"></a>
## 参与 PaddleSpeech 的开发 ## 参与 PaddleSpeech 的开发
热烈欢迎您在[Discussions](https://github.com/PaddlePaddle/PaddleSpeech/discussions) 中提交问题,并在[Issues](https://github.com/PaddlePaddle/PaddleSpeech/issues) 中指出发现的 bug。此外我们非常希望您参与到 PaddleSpeech 的开发中! 热烈欢迎您在[Discussions](https://github.com/PaddlePaddle/PaddleSpeech/discussions) 中提交问题,并在[Issues](https://github.com/PaddlePaddle/PaddleSpeech/issues) 中指出发现的 bug。此外我们非常希望您参与到 PaddleSpeech 的开发中!
### 贡献者 ### 贡献者
@ -539,7 +538,7 @@ year={2021}
## 致谢 ## 致谢
- 非常感谢 [yeyupiaoling](https://github.com/yeyupiaoling) 多年来的关注和建议,以及在诸多问题上的帮助。 - 非常感谢 [yeyupiaoling](https://github.com/yeyupiaoling)/[PPASR](https://github.com/yeyupiaoling/PPASR)/[PaddlePaddle-DeepSpeech](https://github.com/yeyupiaoling/PaddlePaddle-DeepSpeech)/[VoiceprintRecognition-PaddlePaddle](https://github.com/yeyupiaoling/VoiceprintRecognition-PaddlePaddle)/[AudioClassification-PaddlePaddle](https://github.com/yeyupiaoling/AudioClassification-PaddlePaddle) 多年来的关注和建议,以及在诸多问题上的帮助。
- 非常感谢 [AK391](https://github.com/AK391) 在 Huggingface Spaces 上使用 Gradio 对我们的语音合成功能进行网页版演示。 - 非常感谢 [AK391](https://github.com/AK391) 在 Huggingface Spaces 上使用 Gradio 对我们的语音合成功能进行网页版演示。
- 非常感谢 [mymagicpower](https://github.com/mymagicpower) 采用PaddleSpeech 对 ASR 的[短语音](https://github.com/mymagicpower/AIAS/tree/main/3_audio_sdks/asr_sdk)及[长语音](https://github.com/mymagicpower/AIAS/tree/main/3_audio_sdks/asr_long_audio_sdk)进行 Java 实现。 - 非常感谢 [mymagicpower](https://github.com/mymagicpower) 采用PaddleSpeech 对 ASR 的[短语音](https://github.com/mymagicpower/AIAS/tree/main/3_audio_sdks/asr_sdk)及[长语音](https://github.com/mymagicpower/AIAS/tree/main/3_audio_sdks/asr_long_audio_sdk)进行 Java 实现。
- 非常感谢 [JiehangXie](https://github.com/JiehangXie)/[PaddleBoBo](https://github.com/JiehangXie/PaddleBoBo) 采用 PaddleSpeech 语音合成功能实现 Virtual Uploader(VUP)/Virtual YouTuber(VTuber) 虚拟主播。 - 非常感谢 [JiehangXie](https://github.com/JiehangXie)/[PaddleBoBo](https://github.com/JiehangXie/PaddleBoBo) 采用 PaddleSpeech 语音合成功能实现 Virtual Uploader(VUP)/Virtual YouTuber(VTuber) 虚拟主播。

@ -72,10 +72,7 @@ lambda_adv: 4.0 # Loss balancing coefficient.
########################################################### ###########################################################
batch_size: 8 # Batch size. batch_size: 8 # Batch size.
batch_max_steps: 24000 # Length of each audio in batch. Make sure dividable by n_shift. batch_max_steps: 24000 # Length of each audio in batch. Make sure dividable by n_shift.
pin_memory: true # Whether to pin memory in Pytorch DataLoader. num_workers: 2 # Number of workers in DataLoader.
num_workers: 4 # Number of workers in Pytorch DataLoader.
remove_short_samples: true # Whether to remove samples the length of which are less than batch_max_steps.
allow_cache: true # Whether to allow cache in dataset. If true, it requires cpu memory.
########################################################### ###########################################################
# OPTIMIZER & SCHEDULER SETTING # # OPTIMIZER & SCHEDULER SETTING #

@ -79,10 +79,7 @@ lambda_adv: 4.0 # Loss balancing coefficient.
########################################################### ###########################################################
batch_size: 8 # Batch size. batch_size: 8 # Batch size.
batch_max_steps: 25500 # Length of each audio in batch. Make sure dividable by n_shift. batch_max_steps: 25500 # Length of each audio in batch. Make sure dividable by n_shift.
pin_memory: true # Whether to pin memory in Pytorch DataLoader. num_workers: 2 # Number of workers in DataLoader.
num_workers: 2 # Number of workers in Pytorch DataLoader.
remove_short_samples: true # Whether to remove samples the length of which are less than batch_max_steps.
allow_cache: true # Whether to allow cache in dataset. If true, it requires cpu memory.
########################################################### ###########################################################
# OPTIMIZER & SCHEDULER SETTING # # OPTIMIZER & SCHEDULER SETTING #

@ -88,7 +88,7 @@ discriminator_adv_loss_params:
batch_size: 32 # Batch size. batch_size: 32 # Batch size.
# batch_max_steps(24000) == prod(noise_upsample_scales)(80) * prod(upsample_scales)(300, n_shift) # batch_max_steps(24000) == prod(noise_upsample_scales)(80) * prod(upsample_scales)(300, n_shift)
batch_max_steps: 24000 # Length of each audio in batch. Make sure dividable by n_shift. batch_max_steps: 24000 # Length of each audio in batch. Make sure dividable by n_shift.
num_workers: 2 # Number of workers in Pytorch DataLoader. num_workers: 2 # Number of workers in DataLoader.
########################################################### ###########################################################
# OPTIMIZER & SCHEDULER SETTING # # OPTIMIZER & SCHEDULER SETTING #

@ -119,7 +119,7 @@ lambda_feat_match: 2.0 # Loss balancing coefficient for feat match loss..
########################################################### ###########################################################
batch_size: 16 # Batch size. batch_size: 16 # Batch size.
batch_max_steps: 8400 # Length of each audio in batch. Make sure dividable by hop_size. batch_max_steps: 8400 # Length of each audio in batch. Make sure dividable by hop_size.
num_workers: 2 # Number of workers in Pytorch DataLoader. num_workers: 2 # Number of workers in DataLoader.
########################################################### ###########################################################
# OPTIMIZER & SCHEDULER SETTING # # OPTIMIZER & SCHEDULER SETTING #

@ -119,7 +119,7 @@ lambda_feat_match: 2.0 # Loss balancing coefficient for feat match loss..
########################################################### ###########################################################
batch_size: 16 # Batch size. batch_size: 16 # Batch size.
batch_max_steps: 8400 # Length of each audio in batch. Make sure dividable by hop_size. batch_max_steps: 8400 # Length of each audio in batch. Make sure dividable by hop_size.
num_workers: 2 # Number of workers in Pytorch DataLoader. num_workers: 2 # Number of workers in DataLoader.
########################################################### ###########################################################
# OPTIMIZER & SCHEDULER SETTING # # OPTIMIZER & SCHEDULER SETTING #

@ -72,10 +72,7 @@ lambda_adv: 4.0 # Loss balancing coefficient.
########################################################### ###########################################################
batch_size: 8 # Batch size. batch_size: 8 # Batch size.
batch_max_steps: 25600 # Length of each audio in batch. Make sure dividable by n_shift. batch_max_steps: 25600 # Length of each audio in batch. Make sure dividable by n_shift.
pin_memory: true # Whether to pin memory in Pytorch DataLoader. num_workers: 2 # Number of workers in DataLoader.
num_workers: 4 # Number of workers in Pytorch DataLoader.
remove_short_samples: true # Whether to remove samples the length of which are less than batch_max_steps.
allow_cache: true # Whether to allow cache in dataset. If true, it requires cpu memory.
########################################################### ###########################################################
# OPTIMIZER & SCHEDULER SETTING # # OPTIMIZER & SCHEDULER SETTING #

@ -2,7 +2,7 @@
########################################### ###########################################
# Data # # Data #
########################################### ###########################################
train_manifest: data/manifest.train.tiny train_manifest: data/manifest.train
dev_manifest: data/manifest.dev dev_manifest: data/manifest.dev
test_manifest: data/manifest.test test_manifest: data/manifest.test
min_input_len: 0.05 # second min_input_len: 0.05 # second
@ -19,8 +19,10 @@ vocab_filepath: data/lang_char/vocab.txt
unit_type: 'spm' unit_type: 'spm'
spm_model_prefix: data/lang_char/bpe_unigram_8000 spm_model_prefix: data/lang_char/bpe_unigram_8000
mean_std_filepath: "" mean_std_filepath: ""
# augmentation_config: conf/augmentation.json augmentation_config: conf/preprocess.yaml
batch_size: 10 batch_size: 16
maxlen_in: 5 # if input length > maxlen-in, batchsize is automatically reduced
maxlen_out: 150 # if output length > maxlen-out, batchsize is automatically reduced
raw_wav: True # use raw_wav or kaldi feature raw_wav: True # use raw_wav or kaldi feature
spectrum_type: fbank #linear, mfcc, fbank spectrum_type: fbank #linear, mfcc, fbank
feat_dim: 80 feat_dim: 80
@ -84,13 +86,13 @@ accum_grad: 2
global_grad_clip: 5.0 global_grad_clip: 5.0
optim: adam optim: adam
optim_conf: optim_conf:
lr: 0.004 lr: 2.5
weight_decay: 1.0e-06 weight_decay: 1e-06
scheduler: warmuplr scheduler: noam
scheduler_conf: scheduler_conf:
warmup_steps: 25000 warmup_steps: 25000
lr_decay: 1.0 lr_decay: 1.0
log_interval: 5 log_interval: 50
checkpoint: checkpoint:
kbest_n: 50 kbest_n: 50
latest_n: 5 latest_n: 5

@ -19,8 +19,10 @@ vocab_filepath: data/lang_char/vocab.txt
unit_type: 'spm' unit_type: 'spm'
spm_model_prefix: data/lang_char/bpe_unigram_8000 spm_model_prefix: data/lang_char/bpe_unigram_8000
mean_std_filepath: "" mean_std_filepath: ""
# augmentation_config: conf/augmentation.json augmentation_config: conf/preprocess.yaml
batch_size: 10 batch_size: 16
maxlen_in: 5 # if input length > maxlen-in, batchsize is automatically reduced
maxlen_out: 150 # if output length > maxlen-out, batchsize is automatically reduced
raw_wav: True # use raw_wav or kaldi feature raw_wav: True # use raw_wav or kaldi feature
spectrum_type: fbank #linear, mfcc, fbank spectrum_type: fbank #linear, mfcc, fbank
feat_dim: 80 feat_dim: 80

@ -14,7 +14,6 @@ ckpt_prefix=$3
for type in fullsentence; do for type in fullsentence; do
echo "decoding ${type}" echo "decoding ${type}"
batch_size=32
python3 -u ${BIN_DIR}/test.py \ python3 -u ${BIN_DIR}/test.py \
--ngpu ${ngpu} \ --ngpu ${ngpu} \
--config ${config_path} \ --config ${config_path} \
@ -22,7 +21,6 @@ for type in fullsentence; do
--result_file ${ckpt_prefix}.${type}.rsl \ --result_file ${ckpt_prefix}.${type}.rsl \
--checkpoint_path ${ckpt_prefix} \ --checkpoint_path ${ckpt_prefix} \
--opts decode.decoding_method ${type} \ --opts decode.decoding_method ${type} \
--opts decode.decode_batch_size ${batch_size}
if [ $? -ne 0 ]; then if [ $? -ne 0 ]; then
echo "Failed in evaluation!" echo "Failed in evaluation!"

@ -12,5 +12,5 @@
## Transformer ## Transformer
| Model | Params | Config | Val loss | Char-BLEU | | Model | Params | Config | Val loss | Char-BLEU |
| --- | --- | --- | --- | --- | | --- | --- | --- | --- | --- |
| FAT + Transformer+ASR MTL | 50.26M | conf/transformer_mtl_noam.yaml | 62.86 | 19.45 | | FAT + Transformer+ASR MTL | 50.26M | conf/transformer_mtl_noam.yaml | 69.91 | 20.26 |
| FAT + Transformer+ASR MTL with word reward | 50.26M | conf/transformer_mtl_noam.yaml | 62.86 | 20.80 | | FAT + Transformer+ASR MTL with word reward | 50.26M | conf/transformer_mtl_noam.yaml | 62.86 | 20.80 |

@ -2,42 +2,35 @@
########################################### ###########################################
# Data # # Data #
########################################### ###########################################
train_manifest: data/manifest.train.tiny train_manifest: data/manifest.train
dev_manifest: data/manifest.dev dev_manifest: data/manifest.dev
test_manifest: data/manifest.test test_manifest: data/manifest.test
min_input_len: 5.0 # frame
max_input_len: 3000.0 # frame
min_output_len: 0.0 # tokens
max_output_len: 400.0 # tokens
min_output_input_ratio: 0.01
max_output_input_ratio: 20.0
########################################### ###########################################
# Dataloader # # Dataloader #
########################################### ###########################################
vocab_filepath: data/lang_char/vocab.txt vocab_filepath: data/lang_char/ted_en_zh_bpe8000.txt
unit_type: 'spm' unit_type: 'spm'
spm_model_prefix: data/lang_char/bpe_unigram_8000 spm_model_prefix: data/lang_char/ted_en_zh_bpe8000
mean_std_filepath: "" mean_std_filepath: ""
# augmentation_config: conf/augmentation.json # augmentation_config: conf/augmentation.json
batch_size: 10 batch_size: 20
raw_wav: True # use raw_wav or kaldi feature
spectrum_type: fbank #linear, mfcc, fbank
feat_dim: 83 feat_dim: 83
delta_delta: False
dither: 1.0
target_sample_rate: 16000
max_freq: None
n_fft: None
stride_ms: 10.0 stride_ms: 10.0
window_ms: 25.0 window_ms: 25.0
use_dB_normalization: True sortagrad: 0 # Feed samples from shortest to longest ; -1: enabled for all epochs, 0: disabled, other: enabled for 'other' epochs
target_dB: -20 maxlen_in: 512 # if input length > maxlen-in, batchsize is automatically reduced
random_seed: 0 maxlen_out: 150 # if output length > maxlen-out, batchsize is automatically reduced
keep_transcription_text: False minibatches: 0 # for debug
sortagrad: True batch_count: auto
shuffle_method: batch_shuffle batch_bins: 0
num_workers: 2 batch_frames_in: 0
batch_frames_out: 0
batch_frames_inout: 0
augmentation_config:
num_workers: 0
subsampling_factor: 1
num_encs: 1
############################################ ############################################
@ -80,18 +73,18 @@ model_conf:
########################################### ###########################################
# Training # # Training #
########################################### ###########################################
n_epoch: 20 n_epoch: 40
accum_grad: 2 accum_grad: 2
global_grad_clip: 5.0 global_grad_clip: 5.0
optim: adam optim: adam
optim_conf: optim_conf:
lr: 0.004 lr: 2.5
weight_decay: 1.0e-06 weight_decay: 0.
scheduler: warmuplr scheduler: noam
scheduler_conf: scheduler_conf:
warmup_steps: 25000 warmup_steps: 25000
lr_decay: 1.0 lr_decay: 1.0
log_interval: 5 log_interval: 50
checkpoint: checkpoint:
kbest_n: 50 kbest_n: 50
latest_n: 5 latest_n: 5

@ -5,12 +5,6 @@
train_manifest: data/manifest.train train_manifest: data/manifest.train
dev_manifest: data/manifest.dev dev_manifest: data/manifest.dev
test_manifest: data/manifest.test test_manifest: data/manifest.test
min_input_len: 5.0 # frame
max_input_len: 3000.0 # frame
min_output_len: 0.0 # tokens
max_output_len: 400.0 # tokens
min_output_input_ratio: 0.01
max_output_input_ratio: 20.0
########################################### ###########################################
# Dataloader # # Dataloader #
@ -20,24 +14,23 @@ unit_type: 'spm'
spm_model_prefix: data/lang_char/ted_en_zh_bpe8000 spm_model_prefix: data/lang_char/ted_en_zh_bpe8000
mean_std_filepath: "" mean_std_filepath: ""
# augmentation_config: conf/augmentation.json # augmentation_config: conf/augmentation.json
batch_size: 10 batch_size: 20
raw_wav: True # use raw_wav or kaldi feature
spectrum_type: fbank #linear, mfcc, fbank
feat_dim: 83 feat_dim: 83
delta_delta: False
dither: 1.0
target_sample_rate: 16000
max_freq: None
n_fft: None
stride_ms: 10.0 stride_ms: 10.0
window_ms: 25.0 window_ms: 25.0
use_dB_normalization: True sortagrad: 0 # Feed samples from shortest to longest ; -1: enabled for all epochs, 0: disabled, other: enabled for 'other' epochs
target_dB: -20 maxlen_in: 512 # if input length > maxlen-in, batchsize is automatically reduced
random_seed: 0 maxlen_out: 150 # if output length > maxlen-out, batchsize is automatically reduced
keep_transcription_text: False minibatches: 0 # for debug
sortagrad: True batch_count: auto
shuffle_method: batch_shuffle batch_bins: 0
num_workers: 2 batch_frames_in: 0
batch_frames_out: 0
batch_frames_inout: 0
augmentation_config:
num_workers: 0
subsampling_factor: 1
num_encs: 1
############################################ ############################################
@ -80,18 +73,18 @@ model_conf:
########################################### ###########################################
# Training # # Training #
########################################### ###########################################
n_epoch: 20 n_epoch: 40
accum_grad: 2 accum_grad: 2
global_grad_clip: 5.0 global_grad_clip: 5.0
optim: adam optim: adam
optim_conf: optim_conf:
lr: 2.5 lr: 2.5
weight_decay: 1.0e-06 weight_decay: 0.
scheduler: noam scheduler: noam
scheduler_conf: scheduler_conf:
warmup_steps: 25000 warmup_steps: 25000
lr_decay: 1.0 lr_decay: 1.0
log_interval: 5 log_interval: 50
checkpoint: checkpoint:
kbest_n: 50 kbest_n: 50
latest_n: 5 latest_n: 5

@ -14,15 +14,18 @@ ckpt_prefix=$3
for type in fullsentence; do for type in fullsentence; do
echo "decoding ${type}" echo "decoding ${type}"
batch_size=32
python3 -u ${BIN_DIR}/test.py \ python3 -u ${BIN_DIR}/test.py \
--ngpu ${ngpu} \ --ngpu ${ngpu} \
--config ${config_path} \ --config ${config_path} \
--decode_cfg ${decode_config_path} \ --decode_cfg ${decode_config_path} \
--result_file ${ckpt_prefix}.${type}.rsl \ --result_file ${ckpt_prefix}.${type}.rsl \
--checkpoint_path ${ckpt_prefix} \ --checkpoint_path ${ckpt_prefix} \
<<<<<<< HEAD
--opts decode.decoding_method ${type} \ --opts decode.decoding_method ${type} \
--opts decode.decode_batch_size ${batch_size} --opts decode.decode_batch_size ${batch_size}
=======
--opts decoding.decoding_method ${type} \
>>>>>>> 6272496d9c26736750b577fd832ea9dd4ddc4e6e
if [ $? -ne 0 ]; then if [ $? -ne 0 ]; then
echo "Failed in evaluation!" echo "Failed in evaluation!"

@ -72,10 +72,7 @@ lambda_adv: 4.0 # Loss balancing coefficient.
########################################################### ###########################################################
batch_size: 8 # Batch size. batch_size: 8 # Batch size.
batch_max_steps: 24000 # Length of each audio in batch. Make sure dividable by n_shift. batch_max_steps: 24000 # Length of each audio in batch. Make sure dividable by n_shift.
pin_memory: true # Whether to pin memory in Pytorch DataLoader. num_workers: 2 # Number of workers in DataLoader.
num_workers: 4 # Number of workers in Pytorch DataLoader.
remove_short_samples: true # Whether to remove samples the length of which are less than batch_max_steps.
allow_cache: true # Whether to allow cache in dataset. If true, it requires cpu memory.
########################################################### ###########################################################
# OPTIMIZER & SCHEDULER SETTING # # OPTIMIZER & SCHEDULER SETTING #

@ -178,6 +178,32 @@ pretrained_models = {
'speech_stats': 'speech_stats':
'feats_stats.npy', 'feats_stats.npy',
}, },
# style_melgan
"style_melgan_csmsc-zh": {
'url':
'https://paddlespeech.bj.bcebos.com/Parakeet/released_models/style_melgan/style_melgan_csmsc_ckpt_0.1.1.zip',
'md5':
'5de2d5348f396de0c966926b8c462755',
'config':
'default.yaml',
'ckpt':
'snapshot_iter_1500000.pdz',
'speech_stats':
'feats_stats.npy',
},
# hifigan
"hifigan_csmsc-zh": {
'url':
'https://paddlespeech.bj.bcebos.com/Parakeet/released_models/hifigan/hifigan_csmsc_ckpt_0.1.1.zip',
'md5':
'dd40a3d88dfcf64513fba2f0f961ada6',
'config':
'default.yaml',
'ckpt':
'snapshot_iter_2500000.pdz',
'speech_stats':
'feats_stats.npy',
},
} }
model_alias = { model_alias = {
@ -199,6 +225,14 @@ model_alias = {
"paddlespeech.t2s.models.melgan:MelGANGenerator", "paddlespeech.t2s.models.melgan:MelGANGenerator",
"mb_melgan_inference": "mb_melgan_inference":
"paddlespeech.t2s.models.melgan:MelGANInference", "paddlespeech.t2s.models.melgan:MelGANInference",
"style_melgan":
"paddlespeech.t2s.models.melgan:StyleMelGANGenerator",
"style_melgan_inference":
"paddlespeech.t2s.models.melgan:StyleMelGANInference",
"hifigan":
"paddlespeech.t2s.models.hifigan:HiFiGANGenerator",
"hifigan_inference":
"paddlespeech.t2s.models.hifigan:HiFiGANInference",
} }
@ -266,7 +300,7 @@ class TTSExecutor(BaseExecutor):
default='pwgan_csmsc', default='pwgan_csmsc',
choices=[ choices=[
'pwgan_csmsc', 'pwgan_ljspeech', 'pwgan_aishell3', 'pwgan_vctk', 'pwgan_csmsc', 'pwgan_ljspeech', 'pwgan_aishell3', 'pwgan_vctk',
'mb_melgan_csmsc' 'mb_melgan_csmsc', 'style_melgan_csmsc', 'hifigan_csmsc'
], ],
help='Choose vocoder type of tts task.') help='Choose vocoder type of tts task.')
@ -504,37 +538,47 @@ class TTSExecutor(BaseExecutor):
am_name = am[:am.rindex('_')] am_name = am[:am.rindex('_')]
am_dataset = am[am.rindex('_') + 1:] am_dataset = am[am.rindex('_') + 1:]
get_tone_ids = False get_tone_ids = False
merge_sentences = False
if am_name == 'speedyspeech': if am_name == 'speedyspeech':
get_tone_ids = True get_tone_ids = True
if lang == 'zh': if lang == 'zh':
input_ids = self.frontend.get_input_ids( input_ids = self.frontend.get_input_ids(
text, merge_sentences=True, get_tone_ids=get_tone_ids) text,
merge_sentences=merge_sentences,
get_tone_ids=get_tone_ids)
phone_ids = input_ids["phone_ids"] phone_ids = input_ids["phone_ids"]
phone_ids = phone_ids[0]
if get_tone_ids: if get_tone_ids:
tone_ids = input_ids["tone_ids"] tone_ids = input_ids["tone_ids"]
tone_ids = tone_ids[0]
elif lang == 'en': elif lang == 'en':
input_ids = self.frontend.get_input_ids(text) input_ids = self.frontend.get_input_ids(
text, merge_sentences=merge_sentences)
phone_ids = input_ids["phone_ids"] phone_ids = input_ids["phone_ids"]
else: else:
print("lang should in {'zh', 'en'}!") print("lang should in {'zh', 'en'}!")
# am flags = 0
if am_name == 'speedyspeech': for i in range(len(phone_ids)):
mel = self.am_inference(phone_ids, tone_ids) part_phone_ids = phone_ids[i]
# fastspeech2 # am
else: if am_name == 'speedyspeech':
# multi speaker part_tone_ids = tone_ids[i]
if am_dataset in {"aishell3", "vctk"}: mel = self.am_inference(part_phone_ids, part_tone_ids)
mel = self.am_inference( # fastspeech2
phone_ids, spk_id=paddle.to_tensor(spk_id))
else: else:
mel = self.am_inference(phone_ids) # multi speaker
if am_dataset in {"aishell3", "vctk"}:
# voc mel = self.am_inference(
wav = self.voc_inference(mel) part_phone_ids, spk_id=paddle.to_tensor(spk_id))
self._outputs['wav'] = wav else:
mel = self.am_inference(part_phone_ids)
# voc
wav = self.voc_inference(mel)
if flags == 0:
wav_all = wav
flags = 1
else:
wav_all = paddle.concat([wav_all, wav])
self._outputs['wav'] = wav_all
def postprocess(self, output: str='output.wav') -> Union[str, os.PathLike]: def postprocess(self, output: str='output.wav') -> Union[str, os.PathLike]:
""" """

@ -16,6 +16,7 @@ import json
import os import os
import time import time
from collections import defaultdict from collections import defaultdict
from collections import OrderedDict
from contextlib import nullcontext from contextlib import nullcontext
from typing import Optional from typing import Optional
@ -23,21 +24,18 @@ import jsonlines
import numpy as np import numpy as np
import paddle import paddle
from paddle import distributed as dist from paddle import distributed as dist
from paddle.io import DataLoader
from yacs.config import CfgNode from yacs.config import CfgNode
from paddlespeech.s2t.io.collator import SpeechCollator from paddlespeech.s2t.frontend.featurizer import TextFeaturizer
from paddlespeech.s2t.io.collator import TripletSpeechCollator from paddlespeech.s2t.io.dataloader import BatchDataLoader
from paddlespeech.s2t.io.dataset import ManifestDataset
from paddlespeech.s2t.io.sampler import SortagradBatchSampler
from paddlespeech.s2t.io.sampler import SortagradDistributedBatchSampler
from paddlespeech.s2t.models.u2_st import U2STModel from paddlespeech.s2t.models.u2_st import U2STModel
from paddlespeech.s2t.training.gradclip import ClipGradByGlobalNormWithLog from paddlespeech.s2t.training.optimizer import OptimizerFactory
from paddlespeech.s2t.training.scheduler import WarmupLR from paddlespeech.s2t.training.reporter import ObsScope
from paddlespeech.s2t.training.reporter import report
from paddlespeech.s2t.training.scheduler import LRSchedulerFactory
from paddlespeech.s2t.training.timer import Timer from paddlespeech.s2t.training.timer import Timer
from paddlespeech.s2t.training.trainer import Trainer from paddlespeech.s2t.training.trainer import Trainer
from paddlespeech.s2t.utils import bleu_score from paddlespeech.s2t.utils import bleu_score
from paddlespeech.s2t.utils import ctc_utils
from paddlespeech.s2t.utils import layer_tools from paddlespeech.s2t.utils import layer_tools
from paddlespeech.s2t.utils import mp_tools from paddlespeech.s2t.utils import mp_tools
from paddlespeech.s2t.utils.log import Log from paddlespeech.s2t.utils.log import Log
@ -96,6 +94,8 @@ class U2STTrainer(Trainer):
# loss div by `batch_size * accum_grad` # loss div by `batch_size * accum_grad`
loss /= train_conf.accum_grad loss /= train_conf.accum_grad
losses_np = {'loss': float(loss) * train_conf.accum_grad} losses_np = {'loss': float(loss) * train_conf.accum_grad}
if st_loss:
losses_np['st_loss'] = float(st_loss)
if attention_loss: if attention_loss:
losses_np['att_loss'] = float(attention_loss) losses_np['att_loss'] = float(attention_loss)
if ctc_loss: if ctc_loss:
@ -125,6 +125,12 @@ class U2STTrainer(Trainer):
iteration_time = time.time() - start iteration_time = time.time() - start
for k, v in losses_np.items():
report(k, v)
report("batch_size", self.config.collator.batch_size)
report("accum", train_conf.accum_grad)
report("step_cost", iteration_time)
if (batch_index + 1) % train_conf.log_interval == 0: if (batch_index + 1) % train_conf.log_interval == 0:
msg += "train time: {:>.3f}s, ".format(iteration_time) msg += "train time: {:>.3f}s, ".format(iteration_time)
msg += "batch size: {}, ".format(self.config.batch_size) msg += "batch size: {}, ".format(self.config.batch_size)
@ -204,16 +210,34 @@ class U2STTrainer(Trainer):
data_start_time = time.time() data_start_time = time.time()
for batch_index, batch in enumerate(self.train_loader): for batch_index, batch in enumerate(self.train_loader):
dataload_time = time.time() - data_start_time dataload_time = time.time() - data_start_time
msg = "Train: Rank: {}, ".format(dist.get_rank()) msg = "Train:"
msg += "epoch: {}, ".format(self.epoch) observation = OrderedDict()
msg += "step: {}, ".format(self.iteration) with ObsScope(observation):
msg += "batch : {}/{}, ".format(batch_index + 1, report("Rank", dist.get_rank())
len(self.train_loader)) report("epoch", self.epoch)
msg += "lr: {:>.8f}, ".format(self.lr_scheduler()) report('step', self.iteration)
msg += "data time: {:>.3f}s, ".format(dataload_time) report("lr", self.lr_scheduler())
self.train_batch(batch_index, batch, msg) self.train_batch(batch_index, batch, msg)
self.after_train_batch() self.after_train_batch()
data_start_time = time.time() report('iter', batch_index + 1)
report('total', len(self.train_loader))
report('reader_cost', dataload_time)
observation['batch_cost'] = observation[
'reader_cost'] + observation['step_cost']
observation['samples'] = observation['batch_size']
observation['ips,sent./sec'] = observation[
'batch_size'] / observation['batch_cost']
for k, v in observation.items():
msg += f" {k.split(',')[0]}: "
msg += f"{v:>.8f}" if isinstance(v,
float) else f"{v}"
msg += f" {k.split(',')[1]}" if len(
k.split(',')) == 2 else ""
msg += ","
msg = msg[:-1] # remove the last ","
if (batch_index + 1
) % self.config.training.log_interval == 0:
logger.info(msg)
except Exception as e: except Exception as e:
logger.error(e) logger.error(e)
raise e raise e
@ -244,97 +268,88 @@ class U2STTrainer(Trainer):
def setup_dataloader(self): def setup_dataloader(self):
config = self.config.clone() config = self.config.clone()
config.defrost()
config.keep_transcription_text = False
# train/valid dataset, return token ids
config.manifest = config.train_manifest
train_dataset = ManifestDataset.from_config(config)
config.manifest = config.dev_manifest
dev_dataset = ManifestDataset.from_config(config)
if config.model_conf.asr_weight > 0.:
Collator = TripletSpeechCollator
TestCollator = SpeechCollator
else:
TestCollator = Collator = SpeechCollator
collate_fn_train = Collator.from_config(config) load_transcript = True if config.model_conf.asr_weight > 0 else False
config.augmentation_config = ""
collate_fn_dev = Collator.from_config(config)
if self.parallel: if self.train:
batch_sampler = SortagradDistributedBatchSampler( # train/valid dataset, return token ids
train_dataset, self.train_loader = BatchDataLoader(
json_file=config.train_manifest,
train_mode=True,
sortagrad=False,
batch_size=config.batch_size, batch_size=config.batch_size,
num_replicas=None, maxlen_in=config.maxlen_in,
rank=None, maxlen_out=config.maxlen_out,
shuffle=True, minibatches=0,
drop_last=True, mini_batch_size=1,
sortagrad=config.sortagrad, batch_count='auto',
shuffle_method=config.shuffle_method) batch_bins=0,
else: batch_frames_in=0,
batch_sampler = SortagradBatchSampler( batch_frames_out=0,
train_dataset, batch_frames_inout=0,
shuffle=True, preprocess_conf=config.augmentation_config, # aug will be off when train_mode=False
n_iter_processes=config.num_workers,
subsampling_factor=1,
load_aux_output=load_transcript,
num_encs=1,
dist_sampler=True)
self.valid_loader = BatchDataLoader(
json_file=config.dev_manifest,
train_mode=False,
sortagrad=False,
batch_size=config.batch_size, batch_size=config.batch_size,
drop_last=True, maxlen_in=float('inf'),
sortagrad=config.sortagrad, maxlen_out=float('inf'),
shuffle_method=config.shuffle_method) minibatches=0,
self.train_loader = DataLoader( mini_batch_size=1,
train_dataset, batch_count='auto',
batch_sampler=batch_sampler, batch_bins=0,
collate_fn=collate_fn_train, batch_frames_in=0,
num_workers=config.num_workers, ) batch_frames_out=0,
self.valid_loader = DataLoader( batch_frames_inout=0,
dev_dataset, preprocess_conf=config.augmentation_config, # aug will be off when train_mode=False
batch_size=config.batch_size, n_iter_processes=config.num_workers,
shuffle=False, subsampling_factor=1,
drop_last=False, load_aux_output=load_transcript,
collate_fn=collate_fn_dev, num_encs=1,
num_workers=config.num_workers, ) dist_sampler=True)
logger.info("Setup train/valid Dataloader!")
# test dataset, return raw text else:
config.manifest = config.test_manifest # test dataset, return raw text
# filter test examples, will cause less examples, but no mismatch with training decode_batch_size = config.get('decode',dict()).get('decode_batch_size', 1)
# and can use large batch size , save training time, so filter test egs now. self.test_loader = BatchDataLoader(
# config.min_input_len = 0.0 # second json_file=config.data.test_manifest,
# config.max_input_len = float('inf') # second train_mode=False,
# config.min_output_len = 0.0 # tokens sortagrad=False,
# config.max_output_len = float('inf') # tokens batch_size=decode_batch_size,
# config.min_output_input_ratio = 0.00 maxlen_in=float('inf'),
# config.max_output_input_ratio = float('inf') maxlen_out=float('inf'),
test_dataset = ManifestDataset.from_config(config) minibatches=0,
# return text ord id mini_batch_size=1,
config.keep_transcription_text = True batch_count='auto',
config.augmentation_config = "" batch_bins=0,
decode_batch_size = config.get('decode', dict()).get( batch_frames_in=0,
'decode_batch_size', 1) batch_frames_out=0,
self.test_loader = DataLoader( batch_frames_inout=0,
test_dataset, preprocess_conf=config.augmentation_config, # aug will be off when train_mode=False
batch_size=decode_batch_size, n_iter_processes=config.num_workers,
shuffle=False, subsampling_factor=1,
drop_last=False, num_encs=1,
collate_fn=TestCollator.from_config(config), dist_sampler=False)
num_workers=config.num_workers, )
# return text token id logger.info("Setup test Dataloader!")
config.keep_transcription_text = False
self.align_loader = DataLoader(
test_dataset,
batch_size=decode_batch_size,
shuffle=False,
drop_last=False,
collate_fn=TestCollator.from_config(config),
num_workers=config.num_workers, )
logger.info("Setup train/valid/test/align Dataloader!")
def setup_model(self): def setup_model(self):
config = self.config config = self.config
model_conf = config model_conf = config
with UpdateConfig(model_conf): with UpdateConfig(model_conf):
model_conf.input_dim = self.train_loader.collate_fn.feature_size if self.train:
model_conf.output_dim = self.train_loader.collate_fn.vocab_size model_conf.input_dim = self.train_loader.feat_dim
model_conf.output_dim = self.train_loader.vocab_size
else:
model_conf.input_dim = self.test_loader.feat_dim
model_conf.output_dim = self.test_loader.vocab_size
model = U2STModel.from_config(model_conf) model = U2STModel.from_config(model_conf)
@ -350,35 +365,38 @@ class U2STTrainer(Trainer):
scheduler_type = train_config.scheduler scheduler_type = train_config.scheduler
scheduler_conf = train_config.scheduler_conf scheduler_conf = train_config.scheduler_conf
if scheduler_type == 'expdecaylr': scheduler_args = {
lr_scheduler = paddle.optimizer.lr.ExponentialDecay( "learning_rate": optim_conf.lr,
learning_rate=optim_conf.lr, "verbose": False,
gamma=scheduler_conf.lr_decay, "warmup_steps": scheduler_conf.warmup_steps,
verbose=False) "gamma": scheduler_conf.lr_decay,
elif scheduler_type == 'warmuplr': "d_model": model_conf.encoder_conf.output_size,
lr_scheduler = WarmupLR( }
learning_rate=optim_conf.lr, lr_scheduler = LRSchedulerFactory.from_args(scheduler_type,
warmup_steps=scheduler_conf.warmup_steps, scheduler_args)
verbose=False)
elif scheduler_type == 'noam': def optimizer_args(
lr_scheduler = paddle.optimizer.lr.NoamDecay( config,
learning_rate=optim_conf.lr, parameters,
d_model=model_conf.encoder_conf.output_size, lr_scheduler=None, ):
warmup_steps=scheduler_conf.warmup_steps, train_config = config.training
verbose=False) optim_type = train_config.optim
else: optim_conf = train_config.optim_conf
raise ValueError(f"Not support scheduler: {scheduler_type}") scheduler_type = train_config.scheduler
scheduler_conf = train_config.scheduler_conf
grad_clip = ClipGradByGlobalNormWithLog(train_config.global_grad_clip) return {
weight_decay = paddle.regularizer.L2Decay(optim_conf.weight_decay) "grad_clip": train_config.global_grad_clip,
if optim_type == 'adam': "weight_decay": optim_conf.weight_decay,
optimizer = paddle.optimizer.Adam( "learning_rate": lr_scheduler
learning_rate=lr_scheduler, if lr_scheduler else optim_conf.lr,
parameters=model.parameters(), "parameters": parameters,
weight_decay=weight_decay, "epsilon": 1e-9 if optim_type == 'noam' else None,
grad_clip=grad_clip) "beta1": 0.9 if optim_type == 'noam' else None,
else: "beat2": 0.98 if optim_type == 'noam' else None,
raise ValueError(f"Not support optim: {optim_type}") }
optimzer_args = optimizer_args(config, model.parameters(), lr_scheduler)
optimizer = OptimizerFactory.from_args(optim_type, optimzer_args)
self.model = model self.model = model
self.optimizer = optimizer self.optimizer = optimizer
@ -418,26 +436,30 @@ class U2STTester(U2STTrainer):
def __init__(self, config, args): def __init__(self, config, args):
super().__init__(config, args) super().__init__(config, args)
self.text_feature = TextFeaturizer(
unit_type=self.config.collator.unit_type,
vocab_filepath=self.config.collator.vocab_filepath,
spm_model_prefix=self.config.collator.spm_model_prefix)
self.vocab_list = self.text_feature.vocab_list
def ordid2token(self, texts, texts_len): def id2token(self, texts, texts_len, text_feature):
""" ord() id to chr() chr """ """ ord() id to chr() chr """
trans = [] trans = []
for text, n in zip(texts, texts_len): for text, n in zip(texts, texts_len):
n = n.numpy().item() n = n.numpy().item()
ids = text[:n] ids = text[:n]
trans.append(''.join([chr(i) for i in ids])) trans.append(text_feature.defeaturize(ids.numpy().tolist()))
return trans return trans
def translate(self, audio, audio_len): def translate(self, audio, audio_len):
""""E2E translation from extracted audio feature""" """"E2E translation from extracted audio feature"""
decode_cfg = self.config.decode decode_cfg = self.config.decode
text_feature = self.test_loader.collate_fn.text_feature
self.model.eval() self.model.eval()
hyps = self.model.decode( hyps = self.model.decode(
audio, audio,
audio_len, audio_len,
text_feature=text_feature, text_feature=self.text_feature,
decoding_method=decode_cfg.decoding_method, decoding_method=decode_cfg.decoding_method,
beam_size=decode_cfg.beam_size, beam_size=decode_cfg.beam_size,
word_reward=decode_cfg.word_reward, word_reward=decode_cfg.word_reward,
@ -458,23 +480,20 @@ class U2STTester(U2STTrainer):
len_refs, num_ins = 0, 0 len_refs, num_ins = 0, 0
start_time = time.time() start_time = time.time()
text_feature = self.test_loader.collate_fn.text_feature
refs = [ refs = self.id2token(texts, texts_len, self.text_feature)
"".join(chr(t) for t in text[:text_len])
for text, text_len in zip(texts, texts_len)
]
hyps = self.model.decode( hyps = self.model.decode(
audio, audio,
audio_len, audio_len,
text_feature=text_feature, text_feature=self.text_feature,
decoding_method=decode_cfg.decoding_method, decoding_method=decode_cfg.decoding_method,
beam_size=decode_cfg.beam_size, beam_size=decode_cfg.beam_size,
word_reward=decode_cfg.word_reward, word_reward=decode_cfg.word_reward,
decoding_chunk_size=decode_cfg.decoding_chunk_size, decoding_chunk_size=decode_cfg.decoding_chunk_size,
num_decoding_left_chunks=decode_cfg.num_decoding_left_chunks, num_decoding_left_chunks=decode_cfg.num_decoding_left_chunks,
simulate_streaming=decode_cfg.simulate_streaming) simulate_streaming=decode_cfg.simulate_streaming)
decode_time = time.time() - start_time decode_time = time.time() - start_time
for utt, target, result in zip(utts, refs, hyps): for utt, target, result in zip(utts, refs, hyps):
@ -507,7 +526,7 @@ class U2STTester(U2STTrainer):
decode_cfg = self.config.decode decode_cfg = self.config.decode
bleu_func = bleu_score.char_bleu if decode_cfg.error_rate_type == 'char-bleu' else bleu_score.bleu bleu_func = bleu_score.char_bleu if decode_cfg.error_rate_type == 'char-bleu' else bleu_score.bleu
stride_ms = self.test_loader.collate_fn.stride_ms stride_ms = self.config.collator.stride_ms
hyps, refs = [], [] hyps, refs = [], []
len_refs, num_ins = 0, 0 len_refs, num_ins = 0, 0
num_frames = 0.0 num_frames = 0.0
@ -524,7 +543,8 @@ class U2STTester(U2STTrainer):
len_refs += metrics['len_refs'] len_refs += metrics['len_refs']
num_ins += metrics['num_ins'] num_ins += metrics['num_ins']
rtf = num_time / (num_frames * stride_ms) rtf = num_time / (num_frames * stride_ms)
logger.info("RTF: %f, BELU (%d) = %f" % (rtf, num_ins, bleu)) logger.info("RTF: %f, instance (%d), batch BELU = %f" %
(rtf, num_ins, bleu))
rtf = num_time / (num_frames * stride_ms) rtf = num_time / (num_frames * stride_ms)
msg = "Test: " msg = "Test: "
@ -555,13 +575,6 @@ class U2STTester(U2STTrainer):
}) })
f.write(data + '\n') f.write(data + '\n')
@paddle.no_grad()
def align(self):
ctc_utils.ctc_align(self.config, self.model, self.align_loader,
self.config.decode.decode_batch_size,
self.config.stride_ms, self.vocab_list,
self.args.result_file)
def load_inferspec(self): def load_inferspec(self):
"""infer model and input spec. """infer model and input spec.
@ -569,11 +582,11 @@ class U2STTester(U2STTrainer):
nn.Layer: inference model nn.Layer: inference model
List[paddle.static.InputSpec]: input spec. List[paddle.static.InputSpec]: input spec.
""" """
from paddlespeech.s2t.models.u2 import U2InferModel from paddlespeech.s2t.models.u2_st import U2STInferModel
infer_model = U2InferModel.from_pretrained(self.test_loader, infer_model = U2STInferModel.from_pretrained(self.test_loader,
self.config.clone(), self.config.clone(),
self.args.checkpoint_path) self.args.checkpoint_path)
feat_dim = self.test_loader.collate_fn.feature_size feat_dim = self.test_loader.feat_dim
input_spec = [ input_spec = [
paddle.static.InputSpec(shape=[1, None, feat_dim], paddle.static.InputSpec(shape=[1, None, feat_dim],
dtype='float32'), # audio, [B,T,D] dtype='float32'), # audio, [B,T,D]

@ -31,11 +31,17 @@ class CustomConverter():
""" """
def __init__(self, subsampling_factor=1, dtype=np.float32): def __init__(self,
subsampling_factor=1,
dtype=np.float32,
load_aux_input=False,
load_aux_output=False):
"""Construct a CustomConverter object.""" """Construct a CustomConverter object."""
self.subsampling_factor = subsampling_factor self.subsampling_factor = subsampling_factor
self.ignore_id = -1 self.ignore_id = -1
self.dtype = dtype self.dtype = dtype
self.load_aux_input = load_aux_input
self.load_aux_output = load_aux_output
def __call__(self, batch): def __call__(self, batch):
"""Transform a batch and send it to a device. """Transform a batch and send it to a device.
@ -49,34 +55,53 @@ class CustomConverter():
""" """
# batch should be located in list # batch should be located in list
assert len(batch) == 1 assert len(batch) == 1
(xs, ys), utts = batch[0] data, utts = batch[0]
assert xs[0] is not None, "please check Reader and Augmentation impl." xs_data, ys_data = [], []
for ud in data:
# perform subsampling if ud[0].ndim > 1:
if self.subsampling_factor > 1: # speech data (input): (speech_len, feat_dim)
xs = [x[::self.subsampling_factor, :] for x in xs] xs_data.append(ud)
else:
# get batch of lengths of input sequences # text data (output): (text_len, )
ilens = np.array([x.shape[0] for x in xs]) ys_data.append(ud)
# perform padding and convert to tensor assert xs_data[0][
# currently only support real number 0] is not None, "please check Reader and Augmentation impl."
if xs[0].dtype.kind == "c":
xs_pad_real = pad_list([x.real for x in xs], 0).astype(self.dtype) xs_pad, ilens = [], []
xs_pad_imag = pad_list([x.imag for x in xs], 0).astype(self.dtype) for xs in xs_data:
# Note(kamo): # perform subsampling
# {'real': ..., 'imag': ...} will be changed to ComplexTensor in E2E. if self.subsampling_factor > 1:
# Don't create ComplexTensor and give it E2E here xs = [x[::self.subsampling_factor, :] for x in xs]
# because torch.nn.DataParellel can't handle it.
xs_pad = {"real": xs_pad_real, "imag": xs_pad_imag} # get batch of lengths of input sequences
else: ilens.append(np.array([x.shape[0] for x in xs]))
xs_pad = pad_list(xs, 0).astype(self.dtype)
# perform padding and convert to tensor
# currently only support real number
xs_pad.append(pad_list(xs, 0).astype(self.dtype))
if not self.load_aux_input:
xs_pad, ilens = xs_pad[0], ilens[0]
break
# NOTE: this is for multi-output (e.g., speech translation) # NOTE: this is for multi-output (e.g., speech translation)
ys_pad = pad_list( ys_pad, olens = [], []
[np.array(y[0][:]) if isinstance(y, tuple) else y for y in ys],
self.ignore_id) for ys in ys_data:
ys_pad.append(
pad_list([
np.array(y[0][:]) if isinstance(y, tuple) else y for y in ys
], self.ignore_id))
olens.append(
np.array([
y[0].shape[0] if isinstance(y, tuple) else y.shape[0]
for y in ys
]))
if not self.load_aux_output:
ys_pad, olens = ys_pad[0], olens[0]
break
olens = np.array(
[y[0].shape[0] if isinstance(y, tuple) else y.shape[0] for y in ys])
return utts, xs_pad, ilens, ys_pad, olens return utts, xs_pad, ilens, ys_pad, olens

@ -18,7 +18,9 @@ from typing import Text
import jsonlines import jsonlines
import numpy as np import numpy as np
from paddle.io import BatchSampler
from paddle.io import DataLoader from paddle.io import DataLoader
from paddle.io import DistributedBatchSampler
from paddlespeech.s2t.io.batchfy import make_batchset from paddlespeech.s2t.io.batchfy import make_batchset
from paddlespeech.s2t.io.converter import CustomConverter from paddlespeech.s2t.io.converter import CustomConverter
@ -73,7 +75,10 @@ class BatchDataLoader():
preprocess_conf=None, preprocess_conf=None,
n_iter_processes: int=1, n_iter_processes: int=1,
subsampling_factor: int=1, subsampling_factor: int=1,
num_encs: int=1): load_aux_input: bool=False,
load_aux_output: bool=False,
num_encs: int=1,
dist_sampler: bool=False):
self.json_file = json_file self.json_file = json_file
self.train_mode = train_mode self.train_mode = train_mode
self.use_sortagrad = sortagrad == -1 or sortagrad > 0 self.use_sortagrad = sortagrad == -1 or sortagrad > 0
@ -89,6 +94,9 @@ class BatchDataLoader():
self.num_encs = num_encs self.num_encs = num_encs
self.preprocess_conf = preprocess_conf self.preprocess_conf = preprocess_conf
self.n_iter_processes = n_iter_processes self.n_iter_processes = n_iter_processes
self.load_aux_input = load_aux_input
self.load_aux_output = load_aux_output
self.dist_sampler = dist_sampler
# read json data # read json data
with jsonlines.open(json_file, 'r') as reader: with jsonlines.open(json_file, 'r') as reader:
@ -126,21 +134,36 @@ class BatchDataLoader():
# Setup a converter # Setup a converter
if num_encs == 1: if num_encs == 1:
self.converter = CustomConverter( self.converter = CustomConverter(
subsampling_factor=subsampling_factor, dtype=np.float32) subsampling_factor=subsampling_factor,
dtype=np.float32,
load_aux_input=load_aux_input,
load_aux_output=load_aux_output)
else: else:
assert NotImplementedError("not impl CustomConverterMulEnc.") assert NotImplementedError("not impl CustomConverterMulEnc.")
# hack to make batchsize argument as 1 # hack to make batchsize argument as 1
# actual bathsize is included in a list # actual bathsize is included in a list
# default collate function converts numpy array to pytorch tensor # default collate function converts numpy array to paddle tensor
# we used an empty collate function instead which returns list # we used an empty collate function instead which returns list
self.dataset = TransformDataset(self.minibaches, self.converter, self.dataset = TransformDataset(self.minibaches, self.converter,
self.reader) self.reader)
if self.dist_sampler:
self.sampler = DistributedBatchSampler(
dataset=self.dataset,
batch_size=1,
shuffle=not self.use_sortagrad if self.train_mode else False,
drop_last=False, )
else:
self.sampler = BatchSampler(
dataset=self.dataset,
batch_size=1,
shuffle=not self.use_sortagrad if self.train_mode else False,
drop_last=False, )
self.dataloader = DataLoader( self.dataloader = DataLoader(
dataset=self.dataset, dataset=self.dataset,
batch_size=1, batch_sampler=self.sampler,
shuffle=not self.use_sortagrad if self.train_mode else False,
collate_fn=batch_collate, collate_fn=batch_collate,
num_workers=self.n_iter_processes, ) num_workers=self.n_iter_processes, )
@ -168,5 +191,8 @@ class BatchDataLoader():
echo += f"subsampling_factor: {self.subsampling_factor}, " echo += f"subsampling_factor: {self.subsampling_factor}, "
echo += f"num_encs: {self.num_encs}, " echo += f"num_encs: {self.num_encs}, "
echo += f"num_workers: {self.n_iter_processes}, " echo += f"num_workers: {self.n_iter_processes}, "
echo += f"load_aux_input: {self.load_aux_input}, "
echo += f"load_aux_output: {self.load_aux_output}, "
echo += f"dist_sampler: {self.dist_sampler}, "
echo += f"file: {self.json_file}" echo += f"file: {self.json_file}"
return echo return echo

@ -68,7 +68,7 @@ class LoadInputsAndTargets():
if mode not in ["asr"]: if mode not in ["asr"]:
raise ValueError("Only asr are allowed: mode={}".format(mode)) raise ValueError("Only asr are allowed: mode={}".format(mode))
if preprocess_conf is not None: if preprocess_conf:
self.preprocessing = Transformation(preprocess_conf) self.preprocessing = Transformation(preprocess_conf)
logger.warning( logger.warning(
"[Experimental feature] Some preprocessing will be done " "[Experimental feature] Some preprocessing will be done "
@ -82,12 +82,11 @@ class LoadInputsAndTargets():
self.load_output = load_output self.load_output = load_output
self.load_input = load_input self.load_input = load_input
self.sort_in_input_length = sort_in_input_length self.sort_in_input_length = sort_in_input_length
if preprocess_args is None: if preprocess_args:
self.preprocess_args = {}
else:
assert isinstance(preprocess_args, dict), type(preprocess_args) assert isinstance(preprocess_args, dict), type(preprocess_args)
self.preprocess_args = dict(preprocess_args) self.preprocess_args = dict(preprocess_args)
else:
self.preprocess_args = {}
self.keep_all_data_on_mem = keep_all_data_on_mem self.keep_all_data_on_mem = keep_all_data_on_mem
def __call__(self, batch, return_uttid=False): def __call__(self, batch, return_uttid=False):

@ -196,41 +196,50 @@ def evaluate(args):
output_dir = Path(args.output_dir) output_dir = Path(args.output_dir)
output_dir.mkdir(parents=True, exist_ok=True) output_dir.mkdir(parents=True, exist_ok=True)
merge_sentences = False
for utt_id, sentence in sentences: for utt_id, sentence in sentences:
get_tone_ids = False get_tone_ids = False
if am_name == 'speedyspeech': if am_name == 'speedyspeech':
get_tone_ids = True get_tone_ids = True
if args.lang == 'zh': if args.lang == 'zh':
input_ids = frontend.get_input_ids( input_ids = frontend.get_input_ids(
sentence, merge_sentences=True, get_tone_ids=get_tone_ids) sentence,
merge_sentences=merge_sentences,
get_tone_ids=get_tone_ids)
phone_ids = input_ids["phone_ids"] phone_ids = input_ids["phone_ids"]
phone_ids = phone_ids[0]
if get_tone_ids: if get_tone_ids:
tone_ids = input_ids["tone_ids"] tone_ids = input_ids["tone_ids"]
tone_ids = tone_ids[0]
elif args.lang == 'en': elif args.lang == 'en':
input_ids = frontend.get_input_ids(sentence) input_ids = frontend.get_input_ids(
sentence, merge_sentences=merge_sentences)
phone_ids = input_ids["phone_ids"] phone_ids = input_ids["phone_ids"]
else: else:
print("lang should in {'zh', 'en'}!") print("lang should in {'zh', 'en'}!")
with paddle.no_grad(): with paddle.no_grad():
# acoustic model flags = 0
if am_name == 'fastspeech2': for i in range(len(phone_ids)):
# multi speaker part_phone_ids = phone_ids[i]
if am_dataset in {"aishell3", "vctk"}: # acoustic model
spk_id = paddle.to_tensor(args.spk_id) if am_name == 'fastspeech2':
mel = am_inference(phone_ids, spk_id) # multi speaker
if am_dataset in {"aishell3", "vctk"}:
spk_id = paddle.to_tensor(args.spk_id)
mel = am_inference(part_phone_ids, spk_id)
else:
mel = am_inference(part_phone_ids)
elif am_name == 'speedyspeech':
part_tone_ids = tone_ids[i]
mel = am_inference(part_phone_ids, part_tone_ids)
# vocoder
wav = voc_inference(mel)
if flags == 0:
wav_all = wav
flags = 1
else: else:
mel = am_inference(phone_ids) wav_all = paddle.concat([wav_all, wav])
elif am_name == 'speedyspeech':
mel = am_inference(phone_ids, tone_ids)
# vocoder
wav = voc_inference(mel)
sf.write( sf.write(
str(output_dir / (utt_id + ".wav")), str(output_dir / (utt_id + ".wav")),
wav.numpy(), wav_all.numpy(),
samplerate=am_config.fs) samplerate=am_config.fs)
print(f"{utt_id} done!") print(f"{utt_id} done!")

@ -13,7 +13,9 @@
# limitations under the License. # limitations under the License.
from abc import ABC from abc import ABC
from abc import abstractmethod from abc import abstractmethod
from typing import List
import numpy as np
import paddle import paddle
from g2p_en import G2p from g2p_en import G2p
from g2pM import G2pM from g2pM import G2pM
@ -21,6 +23,7 @@ from g2pM import G2pM
from paddlespeech.t2s.frontend.normalizer.normalizer import normalize from paddlespeech.t2s.frontend.normalizer.normalizer import normalize
from paddlespeech.t2s.frontend.punctuation import get_punctuations from paddlespeech.t2s.frontend.punctuation import get_punctuations
from paddlespeech.t2s.frontend.vocab import Vocab from paddlespeech.t2s.frontend.vocab import Vocab
from paddlespeech.t2s.frontend.zh_normalization.text_normlization import TextNormalizer
# discard opencc untill we find an easy solution to install it on windows # discard opencc untill we find an easy solution to install it on windows
# from opencc import OpenCC # from opencc import OpenCC
@ -53,6 +56,7 @@ class English(Phonetics):
self.vocab = Vocab(self.phonemes + self.punctuations) self.vocab = Vocab(self.phonemes + self.punctuations)
self.vocab_phones = {} self.vocab_phones = {}
self.punc = ":,;。?!“”‘’':,;.?!" self.punc = ":,;。?!“”‘’':,;.?!"
self.text_normalizer = TextNormalizer()
if phone_vocab_path: if phone_vocab_path:
with open(phone_vocab_path, 'rt') as f: with open(phone_vocab_path, 'rt') as f:
phn_id = [line.strip().split() for line in f.readlines()] phn_id = [line.strip().split() for line in f.readlines()]
@ -78,19 +82,42 @@ class English(Phonetics):
phonemes = [item for item in phonemes if item in self.vocab.stoi] phonemes = [item for item in phonemes if item in self.vocab.stoi]
return phonemes return phonemes
def get_input_ids(self, sentence: str) -> paddle.Tensor: def _p2id(self, phonemes: List[str]) -> np.array:
result = {} # replace unk phone with sp
phones = self.phoneticize(sentence) phonemes = [
# remove start_symbol and end_symbol
phones = phones[1:-1]
phones = [phn for phn in phones if not phn.isspace()]
phones = [
phn if (phn in self.vocab_phones and phn not in self.punc) else "sp" phn if (phn in self.vocab_phones and phn not in self.punc) else "sp"
for phn in phones for phn in phonemes
] ]
phone_ids = [self.vocab_phones[phn] for phn in phones] phone_ids = [self.vocab_phones[item] for item in phonemes]
phone_ids = paddle.to_tensor(phone_ids) return np.array(phone_ids, np.int64)
result["phone_ids"] = phone_ids
def get_input_ids(self, sentence: str,
merge_sentences: bool=False) -> paddle.Tensor:
result = {}
sentences = self.text_normalizer._split(sentence, lang="en")
phones_list = []
temp_phone_ids = []
for sentence in sentences:
phones = self.phoneticize(sentence)
# remove start_symbol and end_symbol
phones = phones[1:-1]
phones = [phn for phn in phones if not phn.isspace()]
phones_list.append(phones)
if merge_sentences:
merge_list = sum(phones_list, [])
# rm the last 'sp' to avoid the noise at the end
# cause in the training data, no 'sp' in the end
if merge_list[-1] == 'sp':
merge_list = merge_list[:-1]
phones_list = []
phones_list.append(merge_list)
for part_phones_list in phones_list:
phone_ids = self._p2id(part_phones_list)
phone_ids = paddle.to_tensor(phone_ids)
temp_phone_ids.append(phone_ids)
result["phone_ids"] = temp_phone_ids
return result return result
def numericalize(self, phonemes): def numericalize(self, phonemes):

@ -53,7 +53,7 @@ class TextNormalizer():
def __init__(self): def __init__(self):
self.SENTENCE_SPLITOR = re.compile(r'([:,;。?!,;?!][”’]?)') self.SENTENCE_SPLITOR = re.compile(r'([:,;。?!,;?!][”’]?)')
def _split(self, text: str) -> List[str]: def _split(self, text: str, lang="zh") -> List[str]:
"""Split long text into sentences with sentence-splitting punctuations. """Split long text into sentences with sentence-splitting punctuations.
Parameters Parameters
---------- ----------
@ -65,7 +65,8 @@ class TextNormalizer():
Sentences. Sentences.
""" """
# Only for pure Chinese here # Only for pure Chinese here
text = text.replace(" ", "") if lang == "zh":
text = text.replace(" ", "")
text = self.SENTENCE_SPLITOR.sub(r'\1\n', text) text = self.SENTENCE_SPLITOR.sub(r'\1\n', text)
text = text.strip() text = text.strip()
sentences = [sentence.strip() for sentence in re.split(r'\n+', text)] sentences = [sentence.strip() for sentence in re.split(r'\n+', text)]

@ -940,7 +940,6 @@ class StyleFastSpeech2Inference(FastSpeech2Inference):
Tensor Tensor
Output sequence of features (L, odim). Output sequence of features (L, odim).
""" """
spk_id = paddle.to_tensor(spk_id)
normalized_mel, d_outs, p_outs, e_outs = self.acoustic_model.inference( normalized_mel, d_outs, p_outs, e_outs = self.acoustic_model.inference(
text, text,
durations=None, durations=None,

Loading…
Cancel
Save