Cherry-pick to r1.4 branch (#3798)
* [TTS]add Diffsinger with opencpop dataset (#3005) * Update requirements.txt * fix vits reduce_sum's input/output dtype, test=tts (#3028) * [TTS] add opencpop PWGAN example (#3031) * add opencpop voc, test=tts * soft link * Update textnorm_test_cases.txt * [TTS] add opencpop HIFIGAN example (#3038) * add opencpop voc, test=tts * soft link * add opencpop hifigan, test=tts * update * fix dtype diff of last expand_v2 op of VITS (#3041) * [ASR]add squeezeformer model (#2755) * add squeezeformer model * change CodeStyle, test=asr * change CodeStyle, test=asr * fix subsample rate error, test=asr * merge classes as required, test=asr * change CodeStyle, test=asr * fix missing code, test=asr * split code to new file, test=asr * remove rel_shift, test=asr * Update README.md * Update README_cn.md * Update README.md * Update README_cn.md * Update README.md * fix input dtype of elementwise_mul op from bool to int64 (#3054) * [TTS] add svs frontend (#3062) * [TTS]clean starganv2 vc model code and add docstring (#2987) * clean code * add docstring * [Doc] change define asr server config to chunk asr config, test=doc (#3067) * Update README.md * Update README_cn.md * get music score, test=doc (#3070) * [TTS]fix elementwise_floordiv's fill_constant (#3075) * fix elementwise_floordiv's fill_constant * add float converter for min_value in attention * fix paddle2onnx's install version, install the newest paddle2onnx in run.sh (#3084) * [TTS] update svs_music_score.md (#3085) * rm unused dep, test=tts (#3097) * Update bug-report-tts.md (#3120) * [TTS]Fix VITS lite infer (#3098) * [TTS]add starganv2 vc trainer (#3143) * add starganv2 vc trainer * fix StarGANv2VCUpdater and losses * fix StarGANv2VCEvaluator * add some typehint * [TTS]【Hackathon + No.190】 + 模型复现:iSTFTNet (#3006) * iSTFTNet implementation based on hifigan, not affect the function and execution of HIFIGAN * modify the comment in iSTFT.yaml * add the comments in hifigan * iSTFTNet implementation based on hifigan, not affect the function and execution of HIFIGAN * modify the comment in iSTFT.yaml * add the comments in hifigan * add iSTFTNet.md * modify the format of iSTFTNet.md * modify iSTFT.yaml and hifigan.py * Format code using pre-commit * modify hifigan.py,delete the unused self.istft_layer_id , move the self.output_conv behind else, change conv_post to output_conv * update iSTFTNet_csmsc_ckpt.zip download link * modify iSTFTNet.md * modify hifigan.py and iSTFT.yaml * modify iSTFTNet.md * add function for generating srt file (#3123) * add function for generating srt file 在原来websocket_client.py的基础上,增加了由wav或mp3格式的音频文件生成对应srt格式字幕文件的功能 * add function for generating srt file 在原来websocket_client.py的基础上,增加了由wav或mp3格式的音频文件生成对应srt格式字幕文件的功能 * keep origin websocket_client.py 恢复原本的websocket_client.py文件 * add generating subtitle function into README * add generate subtitle funciton into README * add subtitle generation function * add subtitle generation function * fix example/aishell local/train.sh if condition bug, test=asr (#3146) * fix some preprocess bugs (#3155) * add amp for U2 conformer. * fix scaler save * fix scaler save and load. * mv scaler.unscale_ blow grad_clip. * [TTS]add StarGANv2VC preprocess (#3163) * [TTS] [黑客松]Add JETS (#3109) * Update quick_start.md (#3175) * [BUG] Fix progress bar unit. (#3177) * Update quick_start_cn.md (#3176) * [TTS]StarGANv2 VC fix some trainer bugs, add add reset_parameters (#3182) * VITS learning rate revised, test=tts * VITS learning rate revised, test=tts * [s2t] mv dataset into paddlespeech.dataset (#3183) * mv dataset into paddlespeech.dataset * add aidatatang * fix import * Fix some typos. (#3178) * [s2t] move s2t data preprocess into paddlespeech.dataset (#3189) * move s2t data preprocess into paddlespeech.dataset * avg model, compute wer, format rsl into paddlespeech.dataset * fix format rsl * fix avg ckpts * Update pretrained model in README (#3193) * [TTS]Fix losses of StarGAN v2 VC (#3184) * VITS learning rate revised, test=tts * VITS learning rate revised, test=tts * add new aishell model for better CER. * add readme * [s2t] fix cli args to config (#3194) * fix cli args to config * fix train cli * Update README.md * [ASR] Support Hubert, fintuned on the librispeech dataset (#3088) * librispeech hubert, test=asr * librispeech hubert, test=asr * hubert decode * review * copyright, notes, example related * hubert cli * pre-commit format * fix conflicts * fix conflicts * doc related * doc and train config * librispeech.py * support hubert cli * [ASR] fix asr 0-d tensor. (#3214) * Update README.md * Update README.md * fix: 🐛 修复服务端 python ASREngine 无法使用conformer_talcs模型 (#3230) * fix: 🐛 fix python ASREngine not pass codeswitch * docs: 📝 Update Docs * 修改模型判断方式 * Adding WavLM implementation * fix model m5s * Code clean up according to comments in https://github.com/PaddlePaddle/PaddleSpeech/pull/3242 * fix error in tts/st * Changed the path for the uploaded weight * Update phonecode.py # 固话的正则 错误修改 参考https://github.com/speechio/chinese_text_normalization/blob/master/python/cn_tn.py 固化的正则为: pattern = re.compile(r"\D((0(10|2[1-3]|[3-9]\d{2})-?)?[1-9]\d{6,7})\D") * Adapted wavlmASR model to pretrained weights and CLI * Changed the MD5 of the pretrained tar file due to bug fixes * Deleted examples/librispeech/asr5/format_rsl.py * Update released_model.md * Code clean up for CIs * Fixed the transpose usages ignored before * Update setup.py * refactor mfa scripts * Final cleaning; Modified SSL/infer.py and README for wavlm inclusion in model options * updating readme and readme_cn * remove tsinghua pypi * Update setup.py (#3294) * Update setup.py * refactor rhy * fix ckpt * add dtype param for arange API. (#3302) * add scripts for tts code switch * add t2s assets * more comment on tts frontend * fix librosa==0.8.1 numpy==1.23.5 for paddleaudio align with this version * move ssl into t2s.frontend; fix spk_id for 0-D tensor; * add ssml unit test * add en_frontend file * add mix frontend test * fix long text oom using ssml; filter comma; update polyphonic * remove print * hotfix english G2P * en frontend unit text * fix profiler (#3323) * old grad clip has 0d tensor problem, fix it (#3334) * update to py3.8 * remove fluid. * add roformer * fix bugs * add roformer result * support position interpolation for langer attention context windown length. * RoPE with position interpolation * rope for streaming decoding * update result * fix rotary embeding * Update README.md * fix weight decay * fix develop view confict with model's * Add XPU support for SpeedySpeech (#3502) * Add XPU support for SpeedySpeech * fix typos * update description of nxpu * Add XPU support for FastSpeech2 (#3514) * Add XPU support for FastSpeech2 * optimize * Update ge2e_clone.py (#3517) 修复在windows上的多空格错误 * Fix Readme. (#3527) * Update README.md * Update README_cn.md * Update README_cn.md * Update README.md * FIX: Added missing imports * FIX: Fixed the implementation of a special method * 【benchmark】add max_mem_reserved for benchmark (#3604) * fix profiler * add max_mem_reserved for benchmark * fix develop bug function:view to reshape (#3633) * 【benchmark】fix gpu_mem unit (#3634) * fix profiler * add max_mem_reserved for benchmark * fix benchmark * 增加文件编码读取 (#3606) Fixed #3605 * bugfix: audio_len should be 1D, no 0D, which will raise list index out (#3490) of range error in the following decode process Co-authored-by: Luzhenhui <luzhenhui@mqsz.com> * Update README.md (#3532) Fixed a typo * fixed version for paddlepaddle. (#3701) * fixed version for paddlepaddle. * fix code style * 【Fix Speech Issue No.5】issue 3444 transformation import error (#3779) * fix paddlespeech.s2t.transform.transformation import error * fix paddlespeech.s2t.transform import error * 【Fix Speech Issue No.8】issue 3652 merge_yi function has a bug (#3786) * 【Fix Speech Issue No.8】issue 3652 merge_yi function has a bug * 【Fix Speech Issue No.8】issue 3652 merge_yi function has a bug * 【test】add cli test readme (#3784) * add cli test readme * fix code style * 【test】fix test cli bug (#3793) * add cli test readme * fix code style * fix bug * Update setup.py (#3795) * adapt view behavior change, fix KeyError. (#3794) * adapt view behavior change, fix KeyError. * fix readme demo run error. * fixed opencc version --------- Co-authored-by: liangym <34430015+lym0302@users.noreply.github.com> Co-authored-by: TianYuan <white-sky@qq.com> Co-authored-by: 夜雨飘零 <yeyupiaoling@foxmail.com> Co-authored-by: zxcd <228587199@qq.com> Co-authored-by: longRookie <68834517+longRookie@users.noreply.github.com> Co-authored-by: twoDogy <128727742+twoDogy@users.noreply.github.com> Co-authored-by: lemondy <lemondy9@gmail.com> Co-authored-by: ljhzxc <33015549+ljhzxc@users.noreply.github.com> Co-authored-by: PiaoYang <495384481@qq.com> Co-authored-by: WongLaw <mailoflawrence@gmail.com> Co-authored-by: Hui Zhang <zhtclz@foxmail.com> Co-authored-by: Shuangchi He <34329208+Yulv-git@users.noreply.github.com> Co-authored-by: TianHao Zhang <32243340+Zth9730@users.noreply.github.com> Co-authored-by: guanyc <guanyc@gmail.com> Co-authored-by: jiamingkong <kinetical@live.com> Co-authored-by: zoooo0820 <zoooo0820@qq.com> Co-authored-by: shuishu <990941859@qq.com> Co-authored-by: LixinGuo <18510030324@126.com> Co-authored-by: gmm <38800877+mmglove@users.noreply.github.com> Co-authored-by: Wang Huan <wanghuan29@baidu.com> Co-authored-by: Kai Song <50285351+USTCKAY@users.noreply.github.com> Co-authored-by: skyboooox <zcj924@gmail.com> Co-authored-by: fazledyn-or <ataf@openrefactory.com> Co-authored-by: luyao-cv <1367355728@qq.com> Co-authored-by: Color_yr <402067010@qq.com> Co-authored-by: JeffLu <luzhenhui@gmail.com> Co-authored-by: Luzhenhui <luzhenhui@mqsz.com> Co-authored-by: satani99 <42287151+satani99@users.noreply.github.com> Co-authored-by: mjxs <52824616+kk-2000@users.noreply.github.com> Co-authored-by: Mattheliu <leonliuzx@outlook.com>r1.4 r1.4.2
parent
9d61b8c5ac
commit
7b780369f6
@ -1,3 +0,0 @@
|
||||
# [Aishell1](http://openslr.elda.org/33/)
|
||||
|
||||
This Open Source Mandarin Speech Corpus, AISHELL-ASR0009-OS1, is 178 hours long. It is a part of AISHELL-ASR0009, of which utterance contains 11 domains, including smart home, autonomous driving, and industrial production. The whole recording was put in quiet indoor environment, using 3 different devices at the same time: high fidelity microphone (44.1kHz, 16-bit,); Android-system mobile phone (16kHz, 16-bit), iOS-system mobile phone (16kHz, 16-bit). Audios in high fidelity were re-sampled to 16kHz to build AISHELL- ASR0009-OS1. 400 speakers from different accent areas in China were invited to participate in the recording. The manual transcription accuracy rate is above 95%, through professional speech annotation and strict quality inspection. The corpus is divided into training, development and testing sets. ( This database is free for academic research, not in the commerce, if without permission. )
|
@ -0,0 +1,163 @@
|
||||
# This is the parameter configuration file for PaddleSpeech Offline Serving.
|
||||
|
||||
#################################################################################
|
||||
# SERVER SETTING #
|
||||
#################################################################################
|
||||
host: 0.0.0.0
|
||||
port: 8090
|
||||
|
||||
# The task format in the engin_list is: <speech task>_<engine type>
|
||||
# task choices = ['asr_python', 'asr_inference', 'tts_python', 'tts_inference', 'cls_python', 'cls_inference', 'text_python', 'vector_python']
|
||||
protocol: 'http'
|
||||
engine_list: ['asr_python', 'tts_python', 'cls_python', 'text_python', 'vector_python']
|
||||
|
||||
|
||||
#################################################################################
|
||||
# ENGINE CONFIG #
|
||||
#################################################################################
|
||||
|
||||
################################### ASR #########################################
|
||||
################### speech task: asr; engine_type: python #######################
|
||||
asr_python:
|
||||
model: 'conformer_talcs'
|
||||
lang: 'zh_en'
|
||||
sample_rate: 16000
|
||||
cfg_path: # [optional]
|
||||
ckpt_path: # [optional]
|
||||
decode_method: 'attention_rescoring'
|
||||
force_yes: True
|
||||
codeswitch: True
|
||||
device: # set 'gpu:id' or 'cpu'
|
||||
|
||||
################### speech task: asr; engine_type: inference #######################
|
||||
asr_inference:
|
||||
# model_type choices=['deepspeech2offline_aishell']
|
||||
model_type: 'deepspeech2offline_aishell'
|
||||
am_model: # the pdmodel file of am static model [optional]
|
||||
am_params: # the pdiparams file of am static model [optional]
|
||||
lang: 'zh'
|
||||
sample_rate: 16000
|
||||
cfg_path:
|
||||
decode_method:
|
||||
force_yes: True
|
||||
|
||||
am_predictor_conf:
|
||||
device: # set 'gpu:id' or 'cpu'
|
||||
switch_ir_optim: True
|
||||
glog_info: False # True -> print glog
|
||||
summary: True # False -> do not show predictor config
|
||||
|
||||
|
||||
################################### TTS #########################################
|
||||
################### speech task: tts; engine_type: python #######################
|
||||
tts_python:
|
||||
# am (acoustic model) choices=['speedyspeech_csmsc', 'fastspeech2_csmsc',
|
||||
# 'fastspeech2_ljspeech', 'fastspeech2_aishell3',
|
||||
# 'fastspeech2_vctk', 'fastspeech2_mix',
|
||||
# 'tacotron2_csmsc', 'tacotron2_ljspeech']
|
||||
am: 'fastspeech2_csmsc'
|
||||
am_config:
|
||||
am_ckpt:
|
||||
am_stat:
|
||||
phones_dict:
|
||||
tones_dict:
|
||||
speaker_dict:
|
||||
|
||||
|
||||
# voc (vocoder) choices=['pwgan_csmsc', 'pwgan_ljspeech', 'pwgan_aishell3',
|
||||
# 'pwgan_vctk', 'mb_melgan_csmsc', 'style_melgan_csmsc',
|
||||
# 'hifigan_csmsc', 'hifigan_ljspeech', 'hifigan_aishell3',
|
||||
# 'hifigan_vctk', 'wavernn_csmsc']
|
||||
voc: 'mb_melgan_csmsc'
|
||||
voc_config:
|
||||
voc_ckpt:
|
||||
voc_stat:
|
||||
|
||||
# others
|
||||
lang: 'zh'
|
||||
device: # set 'gpu:id' or 'cpu'
|
||||
|
||||
|
||||
################### speech task: tts; engine_type: inference #######################
|
||||
tts_inference:
|
||||
# am (acoustic model) choices=['speedyspeech_csmsc', 'fastspeech2_csmsc']
|
||||
am: 'fastspeech2_csmsc'
|
||||
am_model: # the pdmodel file of your am static model (XX.pdmodel)
|
||||
am_params: # the pdiparams file of your am static model (XX.pdipparams)
|
||||
am_sample_rate: 24000
|
||||
phones_dict:
|
||||
tones_dict:
|
||||
speaker_dict:
|
||||
|
||||
|
||||
am_predictor_conf:
|
||||
device: # set 'gpu:id' or 'cpu'
|
||||
switch_ir_optim: True
|
||||
glog_info: False # True -> print glog
|
||||
summary: True # False -> do not show predictor config
|
||||
|
||||
# voc (vocoder) choices=['pwgan_csmsc', 'mb_melgan_csmsc','hifigan_csmsc']
|
||||
voc: 'mb_melgan_csmsc'
|
||||
voc_model: # the pdmodel file of your vocoder static model (XX.pdmodel)
|
||||
voc_params: # the pdiparams file of your vocoder static model (XX.pdipparams)
|
||||
voc_sample_rate: 24000
|
||||
|
||||
voc_predictor_conf:
|
||||
device: # set 'gpu:id' or 'cpu'
|
||||
switch_ir_optim: True
|
||||
glog_info: False # True -> print glog
|
||||
summary: True # False -> do not show predictor config
|
||||
|
||||
# others
|
||||
lang: 'zh'
|
||||
|
||||
|
||||
################################### CLS #########################################
|
||||
################### speech task: cls; engine_type: python #######################
|
||||
cls_python:
|
||||
# model choices=['panns_cnn14', 'panns_cnn10', 'panns_cnn6']
|
||||
model: 'panns_cnn14'
|
||||
cfg_path: # [optional] Config of cls task.
|
||||
ckpt_path: # [optional] Checkpoint file of model.
|
||||
label_file: # [optional] Label file of cls task.
|
||||
device: # set 'gpu:id' or 'cpu'
|
||||
|
||||
|
||||
################### speech task: cls; engine_type: inference #######################
|
||||
cls_inference:
|
||||
# model_type choices=['panns_cnn14', 'panns_cnn10', 'panns_cnn6']
|
||||
model_type: 'panns_cnn14'
|
||||
cfg_path:
|
||||
model_path: # the pdmodel file of am static model [optional]
|
||||
params_path: # the pdiparams file of am static model [optional]
|
||||
label_file: # [optional] Label file of cls task.
|
||||
|
||||
predictor_conf:
|
||||
device: # set 'gpu:id' or 'cpu'
|
||||
switch_ir_optim: True
|
||||
glog_info: False # True -> print glog
|
||||
summary: True # False -> do not show predictor config
|
||||
|
||||
|
||||
################################### Text #########################################
|
||||
################### text task: punc; engine_type: python #######################
|
||||
text_python:
|
||||
task: punc
|
||||
model_type: 'ernie_linear_p3_wudao'
|
||||
lang: 'zh'
|
||||
sample_rate: 16000
|
||||
cfg_path: # [optional]
|
||||
ckpt_path: # [optional]
|
||||
vocab_file: # [optional]
|
||||
device: # set 'gpu:id' or 'cpu'
|
||||
|
||||
|
||||
################################### Vector ######################################
|
||||
################### Vector task: spk; engine_type: python #######################
|
||||
vector_python:
|
||||
task: spk
|
||||
model_type: 'ecapatdnn_voxceleb12'
|
||||
sample_rate: 16000
|
||||
cfg_path: # [optional]
|
||||
ckpt_path: # [optional]
|
||||
device: # set 'gpu:id' or 'cpu'
|
File diff suppressed because one or more lines are too long
File diff suppressed because one or more lines are too long
After Width: | Height: | Size: 294 KiB |
@ -0,0 +1,98 @@
|
||||
############################################
|
||||
# Network Architecture #
|
||||
############################################
|
||||
cmvn_file:
|
||||
cmvn_file_type: "json"
|
||||
# encoder related
|
||||
encoder: conformer
|
||||
encoder_conf:
|
||||
output_size: 256 # dimension of attention
|
||||
attention_heads: 4
|
||||
linear_units: 2048 # the number of units of position-wise feed forward
|
||||
num_blocks: 12 # the number of encoder blocks
|
||||
dropout_rate: 0.1 # sublayer output dropout
|
||||
positional_dropout_rate: 0.1
|
||||
attention_dropout_rate: 0.0
|
||||
input_layer: conv2d # encoder input type, you can chose conv2d, conv2d6 and conv2d8
|
||||
normalize_before: True
|
||||
cnn_module_kernel: 15
|
||||
use_cnn_module: True
|
||||
activation_type: 'swish'
|
||||
pos_enc_layer_type: 'rope_pos' # abs_pos, rel_pos, rope_pos
|
||||
selfattention_layer_type: 'rel_selfattn' # unused
|
||||
causal: true
|
||||
use_dynamic_chunk: true
|
||||
cnn_module_norm: 'layer_norm' # using nn.LayerNorm makes model converge faster
|
||||
use_dynamic_left_chunk: false
|
||||
# decoder related
|
||||
decoder: transformer # transformer, bitransformer
|
||||
decoder_conf:
|
||||
attention_heads: 4
|
||||
linear_units: 2048
|
||||
num_blocks: 6
|
||||
r_num_blocks: 0 # only for bitransformer
|
||||
dropout_rate: 0.1 # sublayer output dropout
|
||||
positional_dropout_rate: 0.1
|
||||
self_attention_dropout_rate: 0.0
|
||||
src_attention_dropout_rate: 0.0
|
||||
# hybrid CTC/attention
|
||||
model_conf:
|
||||
ctc_weight: 0.3
|
||||
lsm_weight: 0.1 # label smoothing option
|
||||
reverse_weight: 0.0 # only for bitransformer
|
||||
length_normalized_loss: false
|
||||
init_type: 'kaiming_uniform' # !Warning: need to convergence
|
||||
|
||||
###########################################
|
||||
# Data #
|
||||
###########################################
|
||||
|
||||
train_manifest: data/manifest.train
|
||||
dev_manifest: data/manifest.dev
|
||||
test_manifest: data/manifest.test
|
||||
|
||||
|
||||
###########################################
|
||||
# Dataloader #
|
||||
###########################################
|
||||
|
||||
vocab_filepath: data/lang_char/vocab.txt
|
||||
spm_model_prefix: ''
|
||||
unit_type: 'char'
|
||||
preprocess_config: conf/preprocess.yaml
|
||||
feat_dim: 80
|
||||
stride_ms: 10.0
|
||||
window_ms: 25.0
|
||||
sortagrad: 0 # Feed samples from shortest to longest ; -1: enabled for all epochs, 0: disabled, other: enabled for 'other' epochs
|
||||
batch_size: 32
|
||||
maxlen_in: 512 # if input length > maxlen-in, batchsize is automatically reduced
|
||||
maxlen_out: 150 # if output length > maxlen-out, batchsize is automatically reduced
|
||||
minibatches: 0 # for debug
|
||||
batch_count: auto
|
||||
batch_bins: 0
|
||||
batch_frames_in: 0
|
||||
batch_frames_out: 0
|
||||
batch_frames_inout: 0
|
||||
num_workers: 2
|
||||
subsampling_factor: 1
|
||||
num_encs: 1
|
||||
|
||||
###########################################
|
||||
# Training #
|
||||
###########################################
|
||||
n_epoch: 240
|
||||
accum_grad: 1
|
||||
global_grad_clip: 5.0
|
||||
dist_sampler: True
|
||||
optim: adam
|
||||
optim_conf:
|
||||
lr: 0.001
|
||||
weight_decay: 1.0e-6
|
||||
scheduler: warmuplr
|
||||
scheduler_conf:
|
||||
warmup_steps: 25000
|
||||
lr_decay: 1.0
|
||||
log_interval: 100
|
||||
checkpoint:
|
||||
kbest_n: 50
|
||||
latest_n: 5
|
@ -0,0 +1,98 @@
|
||||
############################################
|
||||
# Network Architecture #
|
||||
############################################
|
||||
cmvn_file:
|
||||
cmvn_file_type: "json"
|
||||
# encoder related
|
||||
encoder: conformer
|
||||
encoder_conf:
|
||||
output_size: 256 # dimension of attention
|
||||
attention_heads: 4
|
||||
linear_units: 2048 # the number of units of position-wise feed forward
|
||||
num_blocks: 12 # the number of encoder blocks
|
||||
dropout_rate: 0.1 # sublayer output dropout
|
||||
positional_dropout_rate: 0.1
|
||||
attention_dropout_rate: 0.0
|
||||
input_layer: conv2d # encoder input type, you can chose conv2d, conv2d6 and conv2d8
|
||||
normalize_before: True
|
||||
cnn_module_kernel: 15
|
||||
use_cnn_module: True
|
||||
activation_type: 'swish'
|
||||
pos_enc_layer_type: 'rope_pos' # abs_pos, rel_pos, rope_pos
|
||||
selfattention_layer_type: 'rel_selfattn' # unused
|
||||
causal: true
|
||||
use_dynamic_chunk: true
|
||||
cnn_module_norm: 'layer_norm' # using nn.LayerNorm makes model converge faster
|
||||
use_dynamic_left_chunk: false
|
||||
# decoder related
|
||||
decoder: bitransformer # transformer, bitransformer
|
||||
decoder_conf:
|
||||
attention_heads: 4
|
||||
linear_units: 2048
|
||||
num_blocks: 3
|
||||
r_num_blocks: 3 # only for bitransformer
|
||||
dropout_rate: 0.1 # sublayer output dropout
|
||||
positional_dropout_rate: 0.1
|
||||
self_attention_dropout_rate: 0.0
|
||||
src_attention_dropout_rate: 0.0
|
||||
# hybrid CTC/attention
|
||||
model_conf:
|
||||
ctc_weight: 0.3
|
||||
lsm_weight: 0.1 # label smoothing option
|
||||
reverse_weight: 0.3 # only for bitransformer
|
||||
length_normalized_loss: false
|
||||
init_type: 'kaiming_uniform' # !Warning: need to convergence
|
||||
|
||||
###########################################
|
||||
# Data #
|
||||
###########################################
|
||||
|
||||
train_manifest: data/manifest.train
|
||||
dev_manifest: data/manifest.dev
|
||||
test_manifest: data/manifest.test
|
||||
|
||||
|
||||
###########################################
|
||||
# Dataloader #
|
||||
###########################################
|
||||
|
||||
vocab_filepath: data/lang_char/vocab.txt
|
||||
spm_model_prefix: ''
|
||||
unit_type: 'char'
|
||||
preprocess_config: conf/preprocess.yaml
|
||||
feat_dim: 80
|
||||
stride_ms: 10.0
|
||||
window_ms: 25.0
|
||||
sortagrad: 0 # Feed samples from shortest to longest ; -1: enabled for all epochs, 0: disabled, other: enabled for 'other' epochs
|
||||
batch_size: 32
|
||||
maxlen_in: 512 # if input length > maxlen-in, batchsize is automatically reduced
|
||||
maxlen_out: 150 # if output length > maxlen-out, batchsize is automatically reduced
|
||||
minibatches: 0 # for debug
|
||||
batch_count: auto
|
||||
batch_bins: 0
|
||||
batch_frames_in: 0
|
||||
batch_frames_out: 0
|
||||
batch_frames_inout: 0
|
||||
num_workers: 2
|
||||
subsampling_factor: 1
|
||||
num_encs: 1
|
||||
|
||||
###########################################
|
||||
# Training #
|
||||
###########################################
|
||||
n_epoch: 240
|
||||
accum_grad: 1
|
||||
global_grad_clip: 5.0
|
||||
dist_sampler: True
|
||||
optim: adam
|
||||
optim_conf:
|
||||
lr: 0.001
|
||||
weight_decay: 1.0e-6
|
||||
scheduler: warmuplr
|
||||
scheduler_conf:
|
||||
warmup_steps: 25000
|
||||
lr_decay: 1.0
|
||||
log_interval: 100
|
||||
checkpoint:
|
||||
kbest_n: 50
|
||||
latest_n: 5
|
@ -0,0 +1,98 @@
|
||||
############################################
|
||||
# Network Architecture #
|
||||
############################################
|
||||
cmvn_file:
|
||||
cmvn_file_type: "json"
|
||||
# encoder related
|
||||
encoder: squeezeformer
|
||||
encoder_conf:
|
||||
encoder_dim: 256 # dimension of attention
|
||||
output_size: 256 # dimension of output
|
||||
attention_heads: 4
|
||||
num_blocks: 12 # the number of encoder blocks
|
||||
reduce_idx: 5
|
||||
recover_idx: 11
|
||||
feed_forward_expansion_factor: 8
|
||||
input_dropout_rate: 0.1
|
||||
feed_forward_dropout_rate: 0.1
|
||||
attention_dropout_rate: 0.1
|
||||
adaptive_scale: true
|
||||
cnn_module_kernel: 31
|
||||
normalize_before: false
|
||||
activation_type: 'swish'
|
||||
pos_enc_layer_type: 'rel_pos'
|
||||
time_reduction_layer_type: 'stream'
|
||||
causal: true
|
||||
use_dynamic_chunk: true
|
||||
use_dynamic_left_chunk: false
|
||||
|
||||
# decoder related
|
||||
decoder: transformer
|
||||
decoder_conf:
|
||||
attention_heads: 4
|
||||
linear_units: 2048
|
||||
num_blocks: 6
|
||||
dropout_rate: 0.1 # sublayer output dropout
|
||||
positional_dropout_rate: 0.1
|
||||
self_attention_dropout_rate: 0.0
|
||||
src_attention_dropout_rate: 0.0
|
||||
# hybrid CTC/attention
|
||||
model_conf:
|
||||
ctc_weight: 0.3
|
||||
lsm_weight: 0.1 # label smoothing option
|
||||
length_normalized_loss: false
|
||||
init_type: 'kaiming_uniform' # !Warning: need to convergence
|
||||
|
||||
###########################################
|
||||
# Data #
|
||||
###########################################
|
||||
|
||||
train_manifest: data/manifest.train
|
||||
dev_manifest: data/manifest.dev
|
||||
test_manifest: data/manifest.test
|
||||
|
||||
|
||||
###########################################
|
||||
# Dataloader #
|
||||
###########################################
|
||||
|
||||
vocab_filepath: data/lang_char/vocab.txt
|
||||
spm_model_prefix: ''
|
||||
unit_type: 'char'
|
||||
preprocess_config: conf/preprocess.yaml
|
||||
feat_dim: 80
|
||||
stride_ms: 10.0
|
||||
window_ms: 25.0
|
||||
sortagrad: 0 # Feed samples from shortest to longest ; -1: enabled for all epochs, 0: disabled, other: enabled for 'other' epochs
|
||||
batch_size: 32
|
||||
maxlen_in: 512 # if input length > maxlen-in, batchsize is automatically reduced
|
||||
maxlen_out: 150 # if output length > maxlen-out, batchsize is automatically reduced
|
||||
minibatches: 0 # for debug
|
||||
batch_count: auto
|
||||
batch_bins: 0
|
||||
batch_frames_in: 0
|
||||
batch_frames_out: 0
|
||||
batch_frames_inout: 0
|
||||
num_workers: 2
|
||||
subsampling_factor: 1
|
||||
num_encs: 1
|
||||
|
||||
###########################################
|
||||
# Training #
|
||||
###########################################
|
||||
n_epoch: 240
|
||||
accum_grad: 1
|
||||
global_grad_clip: 5.0
|
||||
dist_sampler: True
|
||||
optim: adam
|
||||
optim_conf:
|
||||
lr: 0.001
|
||||
weight_decay: 1.0e-6
|
||||
scheduler: warmuplr
|
||||
scheduler_conf:
|
||||
warmup_steps: 25000
|
||||
lr_decay: 1.0
|
||||
log_interval: 100
|
||||
checkpoint:
|
||||
kbest_n: 50
|
||||
latest_n: 5
|
@ -0,0 +1,93 @@
|
||||
############################################
|
||||
# Network Architecture #
|
||||
############################################
|
||||
cmvn_file:
|
||||
cmvn_file_type: "json"
|
||||
# encoder related
|
||||
encoder: squeezeformer
|
||||
encoder_conf:
|
||||
encoder_dim: 256 # dimension of attention
|
||||
output_size: 256 # dimension of output
|
||||
attention_heads: 4
|
||||
num_blocks: 12 # the number of encoder blocks
|
||||
reduce_idx: 5
|
||||
recover_idx: 11
|
||||
feed_forward_expansion_factor: 8
|
||||
input_dropout_rate: 0.1
|
||||
feed_forward_dropout_rate: 0.1
|
||||
attention_dropout_rate: 0.1
|
||||
adaptive_scale: true
|
||||
cnn_module_kernel: 31
|
||||
normalize_before: false
|
||||
activation_type: 'swish'
|
||||
pos_enc_layer_type: 'rel_pos'
|
||||
time_reduction_layer_type: 'conv1d'
|
||||
|
||||
# decoder related
|
||||
decoder: transformer
|
||||
decoder_conf:
|
||||
attention_heads: 4
|
||||
linear_units: 2048
|
||||
num_blocks: 6
|
||||
dropout_rate: 0.1
|
||||
positional_dropout_rate: 0.1
|
||||
self_attention_dropout_rate: 0.0
|
||||
src_attention_dropout_rate: 0.0
|
||||
|
||||
# hybrid CTC/attention
|
||||
model_conf:
|
||||
ctc_weight: 0.3
|
||||
lsm_weight: 0.1 # label smoothing option
|
||||
length_normalized_loss: false
|
||||
init_type: 'kaiming_uniform' # !Warning: need to convergence
|
||||
|
||||
###########################################
|
||||
# Data #
|
||||
###########################################
|
||||
train_manifest: data/manifest.train
|
||||
dev_manifest: data/manifest.dev
|
||||
test_manifest: data/manifest.test
|
||||
|
||||
###########################################
|
||||
# Dataloader #
|
||||
###########################################
|
||||
vocab_filepath: data/lang_char/vocab.txt
|
||||
spm_model_prefix: ''
|
||||
unit_type: 'char'
|
||||
preprocess_config: conf/preprocess.yaml
|
||||
feat_dim: 80
|
||||
stride_ms: 10.0
|
||||
window_ms: 25.0
|
||||
sortagrad: 0 # Feed samples from shortest to longest ; -1: enabled for all epochs, 0: disabled, other: enabled for 'other' epochs
|
||||
batch_size: 32
|
||||
maxlen_in: 512 # if input length > maxlen-in, batchsize is automatically reduced
|
||||
maxlen_out: 150 # if output length > maxlen-out, batchsize is automatically reduced
|
||||
minibatches: 0 # for debug
|
||||
batch_count: auto
|
||||
batch_bins: 0
|
||||
batch_frames_in: 0
|
||||
batch_frames_out: 0
|
||||
batch_frames_inout: 0
|
||||
num_workers: 2
|
||||
subsampling_factor: 1
|
||||
num_encs: 1
|
||||
|
||||
###########################################
|
||||
# Training #
|
||||
###########################################
|
||||
n_epoch: 150
|
||||
accum_grad: 8
|
||||
global_grad_clip: 5.0
|
||||
dist_sampler: False
|
||||
optim: adam
|
||||
optim_conf:
|
||||
lr: 0.002
|
||||
weight_decay: 1.0e-6
|
||||
scheduler: warmuplr
|
||||
scheduler_conf:
|
||||
warmup_steps: 25000
|
||||
lr_decay: 1.0
|
||||
log_interval: 100
|
||||
checkpoint:
|
||||
kbest_n: 50
|
||||
latest_n: 5
|
@ -0,0 +1,108 @@
|
||||
# JETS with CSMSC
|
||||
This example contains code used to train a [JETS](https://arxiv.org/abs/2203.16852v1) model with [Chinese Standard Mandarin Speech Copus](https://www.data-baker.com/open_source.html).
|
||||
|
||||
## Dataset
|
||||
### Download and Extract
|
||||
Download CSMSC from it's [Official Website](https://test.data-baker.com/data/index/source).
|
||||
|
||||
### Get MFA Result and Extract
|
||||
We use [MFA](https://github.com/MontrealCorpusTools/Montreal-Forced-Aligner) to get phonemes and durations for JETS.
|
||||
You can download from here [baker_alignment_tone.tar.gz](https://paddlespeech.bj.bcebos.com/MFA/BZNSYP/with_tone/baker_alignment_tone.tar.gz), or train your MFA model reference to [mfa example](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/mfa) of our repo.
|
||||
|
||||
## Get Started
|
||||
Assume the path to the dataset is `~/datasets/BZNSYP`.
|
||||
Assume the path to the MFA result of CSMSC is `./baker_alignment_tone`.
|
||||
Run the command below to
|
||||
1. **source path**.
|
||||
2. preprocess the dataset.
|
||||
3. train the model.
|
||||
4. synthesize wavs.
|
||||
- synthesize waveform from `metadata.jsonl`.
|
||||
- synthesize waveform from a text file.
|
||||
|
||||
```bash
|
||||
./run.sh
|
||||
```
|
||||
You can choose a range of stages you want to run, or set `stage` equal to `stop-stage` to use only one stage, for example, running the following command will only preprocess the dataset.
|
||||
```bash
|
||||
./run.sh --stage 0 --stop-stage 0
|
||||
```
|
||||
### Data Preprocessing
|
||||
```bash
|
||||
./local/preprocess.sh ${conf_path}
|
||||
```
|
||||
When it is done. A `dump` folder is created in the current directory. The structure of the dump folder is listed below.
|
||||
|
||||
```text
|
||||
dump
|
||||
├── dev
|
||||
│ ├── norm
|
||||
│ └── raw
|
||||
├── phone_id_map.txt
|
||||
├── speaker_id_map.txt
|
||||
├── test
|
||||
│ ├── norm
|
||||
│ └── raw
|
||||
└── train
|
||||
├── feats_stats.npy
|
||||
├── norm
|
||||
└── raw
|
||||
```
|
||||
The dataset is split into 3 parts, namely `train`, `dev`, and` test`, each of which contains a `norm` and `raw` subfolder. The raw folder contains wave、mel spectrogram、speech、pitch and energy features of each utterance, while the norm folder contains normalized ones. The statistics used to normalize features are computed from the training set, which is located in `dump/train/feats_stats.npy`.
|
||||
|
||||
Also, there is a `metadata.jsonl` in each subfolder. It is a table-like file that contains phones, text_lengths, the path of feats, feats_lengths, the path of pitch features, the path of energy features, the path of raw waves, speaker, and the id of each utterance.
|
||||
|
||||
### Model Training
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path}
|
||||
```
|
||||
`./local/train.sh` calls `${BIN_DIR}/train.py`.
|
||||
Here's the complete help message.
|
||||
```text
|
||||
usage: train.py [-h] [--config CONFIG] [--train-metadata TRAIN_METADATA]
|
||||
[--dev-metadata DEV_METADATA] [--output-dir OUTPUT_DIR]
|
||||
[--ngpu NGPU] [--phones-dict PHONES_DICT]
|
||||
|
||||
Train a JETS model.
|
||||
|
||||
optional arguments:
|
||||
-h, --help show this help message and exit
|
||||
--config CONFIG config file to overwrite default config.
|
||||
--train-metadata TRAIN_METADATA
|
||||
training data.
|
||||
--dev-metadata DEV_METADATA
|
||||
dev data.
|
||||
--output-dir OUTPUT_DIR
|
||||
output dir.
|
||||
--ngpu NGPU if ngpu == 0, use cpu.
|
||||
--phones-dict PHONES_DICT
|
||||
phone vocabulary file.
|
||||
```
|
||||
1. `--config` is a config file in yaml format to overwrite the default config, which can be found at `conf/default.yaml`.
|
||||
2. `--train-metadata` and `--dev-metadata` should be the metadata file in the normalized subfolder of `train` and `dev` in the `dump` folder.
|
||||
3. `--output-dir` is the directory to save the results of the experiment. Checkpoints are saved in `checkpoints/` inside this directory.
|
||||
4. `--ngpu` is the number of gpus to use, if ngpu == 0, use cpu.
|
||||
5. `--phones-dict` is the path of the phone vocabulary file.
|
||||
|
||||
### Synthesizing
|
||||
|
||||
`./local/synthesize.sh` calls `${BIN_DIR}/synthesize.py`, which can synthesize waveform from `metadata.jsonl`.
|
||||
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name}
|
||||
```
|
||||
|
||||
`./local/synthesize_e2e.sh` calls `${BIN_DIR}/synthesize_e2e.py`, which can synthesize waveform from text file.
|
||||
```bash
|
||||
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name}
|
||||
```
|
||||
|
||||
## Pretrained Model
|
||||
|
||||
The pretrained model can be downloaded here:
|
||||
|
||||
- [jets_csmsc_ckpt_1.5.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/jets_csmsc_ckpt_1.5.0.zip)
|
||||
|
||||
The static model can be downloaded here:
|
||||
|
||||
- [jets_csmsc_static_1.5.0.zip](https://paddlespeech.bj.bcebos.com/Parakeet/jets_csmsc_static_1.5.0.zip)
|
@ -0,0 +1,224 @@
|
||||
# This configuration tested on 4 GPUs (V100) with 32GB GPU
|
||||
# memory. It takes around 2 weeks to finish the training
|
||||
# but 100k iters model should generate reasonable results.
|
||||
###########################################################
|
||||
# FEATURE EXTRACTION SETTING #
|
||||
###########################################################
|
||||
|
||||
n_mels: 80
|
||||
fs: 22050 # sr
|
||||
n_fft: 1024 # FFT size (samples).
|
||||
n_shift: 256 # Hop size (samples). 12.5ms
|
||||
win_length: null # Window length (samples). 50ms
|
||||
# If set to null, it will be the same as fft_size.
|
||||
window: "hann" # Window function.
|
||||
fmin: 0 # minimum frequency for Mel basis
|
||||
fmax: null # maximum frequency for Mel basis
|
||||
f0min: 80 # Minimum f0 for pitch extraction.
|
||||
f0max: 400 # Maximum f0 for pitch extraction.
|
||||
|
||||
|
||||
##########################################################
|
||||
# TTS MODEL SETTING #
|
||||
##########################################################
|
||||
model:
|
||||
# generator related
|
||||
generator_type: jets_generator
|
||||
generator_params:
|
||||
adim: 256 # attention dimension
|
||||
aheads: 2 # number of attention heads
|
||||
elayers: 4 # number of encoder layers
|
||||
eunits: 1024 # number of encoder ff units
|
||||
dlayers: 4 # number of decoder layers
|
||||
dunits: 1024 # number of decoder ff units
|
||||
positionwise_layer_type: conv1d # type of position-wise layer
|
||||
positionwise_conv_kernel_size: 3 # kernel size of position wise conv layer
|
||||
duration_predictor_layers: 2 # number of layers of duration predictor
|
||||
duration_predictor_chans: 256 # number of channels of duration predictor
|
||||
duration_predictor_kernel_size: 3 # filter size of duration predictor
|
||||
use_masking: True # whether to apply masking for padded part in loss calculation
|
||||
encoder_normalize_before: True # whether to perform layer normalization before the input
|
||||
decoder_normalize_before: True # whether to perform layer normalization before the input
|
||||
encoder_type: transformer # encoder type
|
||||
decoder_type: transformer # decoder type
|
||||
conformer_rel_pos_type: latest # relative positional encoding type
|
||||
conformer_pos_enc_layer_type: rel_pos # conformer positional encoding type
|
||||
conformer_self_attn_layer_type: rel_selfattn # conformer self-attention type
|
||||
conformer_activation_type: swish # conformer activation type
|
||||
use_macaron_style_in_conformer: true # whether to use macaron style in conformer
|
||||
use_cnn_in_conformer: true # whether to use CNN in conformer
|
||||
conformer_enc_kernel_size: 7 # kernel size in CNN module of conformer-based encoder
|
||||
conformer_dec_kernel_size: 31 # kernel size in CNN module of conformer-based decoder
|
||||
init_type: xavier_uniform # initialization type
|
||||
init_enc_alpha: 1.0 # initial value of alpha for encoder
|
||||
init_dec_alpha: 1.0 # initial value of alpha for decoder
|
||||
transformer_enc_dropout_rate: 0.2 # dropout rate for transformer encoder layer
|
||||
transformer_enc_positional_dropout_rate: 0.2 # dropout rate for transformer encoder positional encoding
|
||||
transformer_enc_attn_dropout_rate: 0.2 # dropout rate for transformer encoder attention layer
|
||||
transformer_dec_dropout_rate: 0.2 # dropout rate for transformer decoder layer
|
||||
transformer_dec_positional_dropout_rate: 0.2 # dropout rate for transformer decoder positional encoding
|
||||
transformer_dec_attn_dropout_rate: 0.2 # dropout rate for transformer decoder attention layer
|
||||
pitch_predictor_layers: 5 # number of conv layers in pitch predictor
|
||||
pitch_predictor_chans: 256 # number of channels of conv layers in pitch predictor
|
||||
pitch_predictor_kernel_size: 5 # kernel size of conv leyers in pitch predictor
|
||||
pitch_predictor_dropout: 0.5 # dropout rate in pitch predictor
|
||||
pitch_embed_kernel_size: 1 # kernel size of conv embedding layer for pitch
|
||||
pitch_embed_dropout: 0.0 # dropout rate after conv embedding layer for pitch
|
||||
stop_gradient_from_pitch_predictor: true # whether to stop the gradient from pitch predictor to encoder
|
||||
energy_predictor_layers: 2 # number of conv layers in energy predictor
|
||||
energy_predictor_chans: 256 # number of channels of conv layers in energy predictor
|
||||
energy_predictor_kernel_size: 3 # kernel size of conv leyers in energy predictor
|
||||
energy_predictor_dropout: 0.5 # dropout rate in energy predictor
|
||||
energy_embed_kernel_size: 1 # kernel size of conv embedding layer for energy
|
||||
energy_embed_dropout: 0.0 # dropout rate after conv embedding layer for energy
|
||||
stop_gradient_from_energy_predictor: false # whether to stop the gradient from energy predictor to encoder
|
||||
generator_out_channels: 1
|
||||
generator_channels: 512
|
||||
generator_global_channels: -1
|
||||
generator_kernel_size: 7
|
||||
generator_upsample_scales: [8, 8, 2, 2]
|
||||
generator_upsample_kernel_sizes: [16, 16, 4, 4]
|
||||
generator_resblock_kernel_sizes: [3, 7, 11]
|
||||
generator_resblock_dilations: [[1, 3, 5], [1, 3, 5], [1, 3, 5]]
|
||||
generator_use_additional_convs: true
|
||||
generator_bias: true
|
||||
generator_nonlinear_activation: "leakyrelu"
|
||||
generator_nonlinear_activation_params:
|
||||
negative_slope: 0.1
|
||||
generator_use_weight_norm: true
|
||||
segment_size: 64 # segment size for random windowed discriminator
|
||||
|
||||
# discriminator related
|
||||
discriminator_type: hifigan_multi_scale_multi_period_discriminator
|
||||
discriminator_params:
|
||||
scales: 1
|
||||
scale_downsample_pooling: "AvgPool1D"
|
||||
scale_downsample_pooling_params:
|
||||
kernel_size: 4
|
||||
stride: 2
|
||||
padding: 2
|
||||
scale_discriminator_params:
|
||||
in_channels: 1
|
||||
out_channels: 1
|
||||
kernel_sizes: [15, 41, 5, 3]
|
||||
channels: 128
|
||||
max_downsample_channels: 1024
|
||||
max_groups: 16
|
||||
bias: True
|
||||
downsample_scales: [2, 2, 4, 4, 1]
|
||||
nonlinear_activation: "leakyrelu"
|
||||
nonlinear_activation_params:
|
||||
negative_slope: 0.1
|
||||
use_weight_norm: True
|
||||
use_spectral_norm: False
|
||||
follow_official_norm: False
|
||||
periods: [2, 3, 5, 7, 11]
|
||||
period_discriminator_params:
|
||||
in_channels: 1
|
||||
out_channels: 1
|
||||
kernel_sizes: [5, 3]
|
||||
channels: 32
|
||||
downsample_scales: [3, 3, 3, 3, 1]
|
||||
max_downsample_channels: 1024
|
||||
bias: True
|
||||
nonlinear_activation: "leakyrelu"
|
||||
nonlinear_activation_params:
|
||||
negative_slope: 0.1
|
||||
use_weight_norm: True
|
||||
use_spectral_norm: False
|
||||
# others
|
||||
sampling_rate: 22050 # needed in the inference for saving wav
|
||||
cache_generator_outputs: True # whether to cache generator outputs in the training
|
||||
use_alignment_module: False # whether to use alignment module
|
||||
|
||||
###########################################################
|
||||
# LOSS SETTING #
|
||||
###########################################################
|
||||
# loss function related
|
||||
generator_adv_loss_params:
|
||||
average_by_discriminators: False # whether to average loss value by #discriminators
|
||||
loss_type: mse # loss type, "mse" or "hinge"
|
||||
discriminator_adv_loss_params:
|
||||
average_by_discriminators: False # whether to average loss value by #discriminators
|
||||
loss_type: mse # loss type, "mse" or "hinge"
|
||||
feat_match_loss_params:
|
||||
average_by_discriminators: False # whether to average loss value by #discriminators
|
||||
average_by_layers: False # whether to average loss value by #layers of each discriminator
|
||||
include_final_outputs: True # whether to include final outputs for loss calculation
|
||||
mel_loss_params:
|
||||
fs: 22050 # must be the same as the training data
|
||||
fft_size: 1024 # fft points
|
||||
hop_size: 256 # hop size
|
||||
win_length: null # window length
|
||||
window: hann # window type
|
||||
num_mels: 80 # number of Mel basis
|
||||
fmin: 0 # minimum frequency for Mel basis
|
||||
fmax: null # maximum frequency for Mel basis
|
||||
log_base: null # null represent natural log
|
||||
|
||||
###########################################################
|
||||
# ADVERSARIAL LOSS SETTING #
|
||||
###########################################################
|
||||
lambda_adv: 1.0 # loss scaling coefficient for adversarial loss
|
||||
lambda_mel: 45.0 # loss scaling coefficient for Mel loss
|
||||
lambda_feat_match: 2.0 # loss scaling coefficient for feat match loss
|
||||
lambda_var: 1.0 # loss scaling coefficient for duration loss
|
||||
lambda_align: 2.0 # loss scaling coefficient for KL divergence loss
|
||||
# others
|
||||
sampling_rate: 22050 # needed in the inference for saving wav
|
||||
cache_generator_outputs: True # whether to cache generator outputs in the training
|
||||
|
||||
|
||||
# extra module for additional inputs
|
||||
pitch_extract: dio # pitch extractor type
|
||||
pitch_extract_conf:
|
||||
reduction_factor: 1
|
||||
use_token_averaged_f0: false
|
||||
pitch_normalize: global_mvn # normalizer for the pitch feature
|
||||
energy_extract: energy # energy extractor type
|
||||
energy_extract_conf:
|
||||
reduction_factor: 1
|
||||
use_token_averaged_energy: false
|
||||
energy_normalize: global_mvn # normalizer for the energy feature
|
||||
|
||||
|
||||
###########################################################
|
||||
# DATA LOADER SETTING #
|
||||
###########################################################
|
||||
batch_size: 32 # Batch size.
|
||||
num_workers: 4 # Number of workers in DataLoader.
|
||||
|
||||
##########################################################
|
||||
# OPTIMIZER & SCHEDULER SETTING #
|
||||
##########################################################
|
||||
# optimizer setting for generator
|
||||
generator_optimizer_params:
|
||||
beta1: 0.8
|
||||
beta2: 0.99
|
||||
epsilon: 1.0e-9
|
||||
weight_decay: 0.0
|
||||
generator_scheduler: exponential_decay
|
||||
generator_scheduler_params:
|
||||
learning_rate: 2.0e-4
|
||||
gamma: 0.999875
|
||||
|
||||
# optimizer setting for discriminator
|
||||
discriminator_optimizer_params:
|
||||
beta1: 0.8
|
||||
beta2: 0.99
|
||||
epsilon: 1.0e-9
|
||||
weight_decay: 0.0
|
||||
discriminator_scheduler: exponential_decay
|
||||
discriminator_scheduler_params:
|
||||
learning_rate: 2.0e-4
|
||||
gamma: 0.999875
|
||||
generator_first: True # whether to start updating generator first
|
||||
|
||||
##########################################################
|
||||
# OTHER TRAINING SETTING #
|
||||
##########################################################
|
||||
num_snapshots: 10 # max number of snapshots to keep while training
|
||||
train_max_steps: 350000 # Number of training steps. == total_iters / ngpus, total_iters = 1000000
|
||||
save_interval_steps: 1000 # Interval steps to save checkpoint.
|
||||
eval_interval_steps: 250 # Interval steps to evaluate the network.
|
||||
seed: 777 # random seed number
|
@ -0,0 +1,15 @@
|
||||
#!/bin/bash
|
||||
|
||||
train_output_path=$1
|
||||
|
||||
stage=0
|
||||
stop_stage=0
|
||||
|
||||
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
|
||||
python3 ${BIN_DIR}/inference.py \
|
||||
--inference_dir=${train_output_path}/inference \
|
||||
--am=jets_csmsc \
|
||||
--text=${BIN_DIR}/../../assets/sentences.txt \
|
||||
--output_dir=${train_output_path}/pd_infer_out \
|
||||
--phones_dict=dump/phone_id_map.txt
|
||||
fi
|
@ -0,0 +1,77 @@
|
||||
#!/bin/bash
|
||||
set -e
|
||||
stage=0
|
||||
stop_stage=100
|
||||
|
||||
config_path=$1
|
||||
|
||||
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
|
||||
# get durations from MFA's result
|
||||
echo "Generate durations.txt from MFA results ..."
|
||||
python3 ${MAIN_ROOT}/utils/gen_duration_from_textgrid.py \
|
||||
--inputdir=./baker_alignment_tone \
|
||||
--output=durations.txt \
|
||||
--config=${config_path}
|
||||
fi
|
||||
|
||||
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
|
||||
# extract features
|
||||
echo "Extract features ..."
|
||||
python3 ${BIN_DIR}/preprocess.py \
|
||||
--dataset=baker \
|
||||
--rootdir=~/datasets/BZNSYP/ \
|
||||
--dumpdir=dump \
|
||||
--dur-file=durations.txt \
|
||||
--config=${config_path} \
|
||||
--num-cpu=20 \
|
||||
--cut-sil=True \
|
||||
--token_average=True
|
||||
fi
|
||||
|
||||
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
|
||||
# get features' stats(mean and std)
|
||||
echo "Get features' stats ..."
|
||||
python3 ${MAIN_ROOT}/utils/compute_statistics.py \
|
||||
--metadata=dump/train/raw/metadata.jsonl \
|
||||
--field-name="feats"
|
||||
|
||||
python3 ${MAIN_ROOT}/utils/compute_statistics.py \
|
||||
--metadata=dump/train/raw/metadata.jsonl \
|
||||
--field-name="pitch"
|
||||
|
||||
python3 ${MAIN_ROOT}/utils/compute_statistics.py \
|
||||
--metadata=dump/train/raw/metadata.jsonl \
|
||||
--field-name="energy"
|
||||
|
||||
fi
|
||||
|
||||
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
|
||||
# normalize and covert phone/speaker to id, dev and test should use train's stats
|
||||
echo "Normalize ..."
|
||||
python3 ${BIN_DIR}/normalize.py \
|
||||
--metadata=dump/train/raw/metadata.jsonl \
|
||||
--dumpdir=dump/train/norm \
|
||||
--feats-stats=dump/train/feats_stats.npy \
|
||||
--pitch-stats=dump/train/pitch_stats.npy \
|
||||
--energy-stats=dump/train/energy_stats.npy \
|
||||
--phones-dict=dump/phone_id_map.txt \
|
||||
--speaker-dict=dump/speaker_id_map.txt
|
||||
|
||||
python3 ${BIN_DIR}/normalize.py \
|
||||
--metadata=dump/dev/raw/metadata.jsonl \
|
||||
--dumpdir=dump/dev/norm \
|
||||
--feats-stats=dump/train/feats_stats.npy \
|
||||
--pitch-stats=dump/train/pitch_stats.npy \
|
||||
--energy-stats=dump/train/energy_stats.npy \
|
||||
--phones-dict=dump/phone_id_map.txt \
|
||||
--speaker-dict=dump/speaker_id_map.txt
|
||||
|
||||
python3 ${BIN_DIR}/normalize.py \
|
||||
--metadata=dump/test/raw/metadata.jsonl \
|
||||
--dumpdir=dump/test/norm \
|
||||
--feats-stats=dump/train/feats_stats.npy \
|
||||
--pitch-stats=dump/train/pitch_stats.npy \
|
||||
--energy-stats=dump/train/energy_stats.npy \
|
||||
--phones-dict=dump/phone_id_map.txt \
|
||||
--speaker-dict=dump/speaker_id_map.txt
|
||||
fi
|
@ -0,0 +1,18 @@
|
||||
#!/bin/bash
|
||||
|
||||
config_path=$1
|
||||
train_output_path=$2
|
||||
ckpt_name=$3
|
||||
stage=0
|
||||
stop_stage=0
|
||||
|
||||
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
|
||||
FLAGS_allocator_strategy=naive_best_fit \
|
||||
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
|
||||
python3 ${BIN_DIR}/synthesize.py \
|
||||
--config=${config_path} \
|
||||
--ckpt=${train_output_path}/checkpoints/${ckpt_name} \
|
||||
--phones_dict=dump/phone_id_map.txt \
|
||||
--test_metadata=dump/test/norm/metadata.jsonl \
|
||||
--output_dir=${train_output_path}/test
|
||||
fi
|
@ -0,0 +1,22 @@
|
||||
#!/bin/bash
|
||||
|
||||
config_path=$1
|
||||
train_output_path=$2
|
||||
ckpt_name=$3
|
||||
|
||||
stage=0
|
||||
stop_stage=0
|
||||
|
||||
|
||||
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
|
||||
FLAGS_allocator_strategy=naive_best_fit \
|
||||
FLAGS_fraction_of_gpu_memory_to_use=0.01 \
|
||||
python3 ${BIN_DIR}/synthesize_e2e.py \
|
||||
--am=jets_csmsc \
|
||||
--config=${config_path} \
|
||||
--ckpt=${train_output_path}/checkpoints/${ckpt_name} \
|
||||
--phones_dict=dump/phone_id_map.txt \
|
||||
--output_dir=${train_output_path}/test_e2e \
|
||||
--text=${BIN_DIR}/../../assets/sentences.txt \
|
||||
--inference_dir=${train_output_path}/inference
|
||||
fi
|
@ -0,0 +1,12 @@
|
||||
#!/bin/bash
|
||||
|
||||
config_path=$1
|
||||
train_output_path=$2
|
||||
|
||||
python3 ${BIN_DIR}/train.py \
|
||||
--train-metadata=dump/train/norm/metadata.jsonl \
|
||||
--dev-metadata=dump/dev/norm/metadata.jsonl \
|
||||
--config=${config_path} \
|
||||
--output-dir=${train_output_path} \
|
||||
--ngpu=1 \
|
||||
--phones-dict=dump/phone_id_map.txt
|
@ -0,0 +1,13 @@
|
||||
#!/bin/bash
|
||||
export MAIN_ROOT=`realpath ${PWD}/../../../`
|
||||
|
||||
export PATH=${MAIN_ROOT}:${MAIN_ROOT}/utils:${PATH}
|
||||
export LC_ALL=C
|
||||
|
||||
export PYTHONDONTWRITEBYTECODE=1
|
||||
# Use UTF-8 in Python to avoid UnicodeDecodeError when LC_ALL=C
|
||||
export PYTHONIOENCODING=UTF-8
|
||||
export PYTHONPATH=${MAIN_ROOT}:${PYTHONPATH}
|
||||
|
||||
MODEL=jets
|
||||
export BIN_DIR=${MAIN_ROOT}/paddlespeech/t2s/exps/${MODEL}
|
@ -0,0 +1,41 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -e
|
||||
source path.sh
|
||||
|
||||
gpus=0
|
||||
stage=0
|
||||
stop_stage=100
|
||||
|
||||
conf_path=conf/default.yaml
|
||||
train_output_path=exp/default
|
||||
ckpt_name=snapshot_iter_150000.pdz
|
||||
|
||||
# with the following command, you can choose the stage range you want to run
|
||||
# such as `./run.sh --stage 0 --stop-stage 0`
|
||||
# this can not be mixed use with `$1`, `$2` ...
|
||||
source ${MAIN_ROOT}/utils/parse_options.sh || exit 1
|
||||
|
||||
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
|
||||
# prepare data
|
||||
./local/preprocess.sh ${conf_path}|| exit -1
|
||||
fi
|
||||
|
||||
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
|
||||
# train model, all `ckpt` under `train_output_path/checkpoints/` dir
|
||||
CUDA_VISIBLE_DEVICES=${gpus} ./local/train.sh ${conf_path} ${train_output_path} || exit -1
|
||||
fi
|
||||
|
||||
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
|
||||
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
|
||||
fi
|
||||
|
||||
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
|
||||
# synthesize_e2e
|
||||
CUDA_VISIBLE_DEVICES=${gpus} ./local/synthesize_e2e.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
|
||||
fi
|
||||
|
||||
if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
|
||||
CUDA_VISIBLE_DEVICES=${gpus} ./local/inference.sh ${train_output_path} || exit -1
|
||||
fi
|
||||
|
@ -0,0 +1,46 @@
|
||||
#!/bin/bash
|
||||
|
||||
train_output_path=$1
|
||||
|
||||
stage=0
|
||||
stop_stage=0
|
||||
|
||||
# pwgan
|
||||
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
|
||||
python3 ${BIN_DIR}/../inference.py \
|
||||
--inference_dir=${train_output_path}/inference \
|
||||
--am=speedyspeech_csmsc \
|
||||
--voc=pwgan_csmsc \
|
||||
--text=${BIN_DIR}/../../assets/sentences.txt \
|
||||
--output_dir=${train_output_path}/pd_infer_out \
|
||||
--phones_dict=dump/phone_id_map.txt \
|
||||
--tones_dict=dump/tone_id_map.txt \
|
||||
--device xpu
|
||||
fi
|
||||
|
||||
# for more GAN Vocoders
|
||||
# multi band melgan
|
||||
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
|
||||
python3 ${BIN_DIR}/../inference.py \
|
||||
--inference_dir=${train_output_path}/inference \
|
||||
--am=speedyspeech_csmsc \
|
||||
--voc=mb_melgan_csmsc \
|
||||
--text=${BIN_DIR}/../../assets/sentences.txt \
|
||||
--output_dir=${train_output_path}/pd_infer_out \
|
||||
--phones_dict=dump/phone_id_map.txt \
|
||||
--tones_dict=dump/tone_id_map.txt \
|
||||
--device xpu
|
||||
fi
|
||||
|
||||
# hifigan
|
||||
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
|
||||
python3 ${BIN_DIR}/../inference.py \
|
||||
--inference_dir=${train_output_path}/inference \
|
||||
--am=speedyspeech_csmsc \
|
||||
--voc=hifigan_csmsc \
|
||||
--text=${BIN_DIR}/../../assets/sentences.txt \
|
||||
--output_dir=${train_output_path}/pd_infer_out \
|
||||
--phones_dict=dump/phone_id_map.txt \
|
||||
--tones_dict=dump/tone_id_map.txt \
|
||||
--device xpu
|
||||
fi
|
@ -0,0 +1,122 @@
|
||||
#!/bin/bash
|
||||
|
||||
config_path=$1
|
||||
train_output_path=$2
|
||||
ckpt_name=$3
|
||||
|
||||
stage=0
|
||||
stop_stage=0
|
||||
|
||||
# pwgan
|
||||
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
|
||||
FLAGS_allocator_strategy=naive_best_fit \
|
||||
python3 ${BIN_DIR}/../synthesize_e2e.py \
|
||||
--am=speedyspeech_csmsc \
|
||||
--am_config=${config_path} \
|
||||
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
|
||||
--am_stat=dump/train/feats_stats.npy \
|
||||
--voc=pwgan_csmsc \
|
||||
--voc_config=pwg_baker_ckpt_0.4/pwg_default.yaml \
|
||||
--voc_ckpt=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \
|
||||
--voc_stat=pwg_baker_ckpt_0.4/pwg_stats.npy \
|
||||
--lang=zh \
|
||||
--text=${BIN_DIR}/../../assets/sentences.txt \
|
||||
--output_dir=${train_output_path}/test_e2e \
|
||||
--phones_dict=dump/phone_id_map.txt \
|
||||
--tones_dict=dump/tone_id_map.txt \
|
||||
--inference_dir=${train_output_path}/inference \
|
||||
--ngpu=0 \
|
||||
--nxpu=1
|
||||
fi
|
||||
|
||||
# for more GAN Vocoders
|
||||
# multi band melgan
|
||||
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
|
||||
FLAGS_allocator_strategy=naive_best_fit \
|
||||
python3 ${BIN_DIR}/../synthesize_e2e.py \
|
||||
--am=speedyspeech_csmsc \
|
||||
--am_config=${config_path} \
|
||||
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
|
||||
--am_stat=dump/train/feats_stats.npy \
|
||||
--voc=mb_melgan_csmsc \
|
||||
--voc_config=mb_melgan_csmsc_ckpt_0.1.1/default.yaml \
|
||||
--voc_ckpt=mb_melgan_csmsc_ckpt_0.1.1/snapshot_iter_1000000.pdz\
|
||||
--voc_stat=mb_melgan_csmsc_ckpt_0.1.1/feats_stats.npy \
|
||||
--lang=zh \
|
||||
--text=${BIN_DIR}/../../assets/sentences.txt \
|
||||
--output_dir=${train_output_path}/test_e2e \
|
||||
--phones_dict=dump/phone_id_map.txt \
|
||||
--tones_dict=dump/tone_id_map.txt \
|
||||
--inference_dir=${train_output_path}/inference \
|
||||
--ngpu=0 \
|
||||
--nxpu=1
|
||||
fi
|
||||
|
||||
# the pretrained models haven't release now
|
||||
# style melgan
|
||||
# style melgan's Dygraph to Static Graph is not ready now
|
||||
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
|
||||
FLAGS_allocator_strategy=naive_best_fit \
|
||||
python3 ${BIN_DIR}/../synthesize_e2e.py \
|
||||
--am=speedyspeech_csmsc \
|
||||
--am_config=${config_path} \
|
||||
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
|
||||
--am_stat=dump/train/feats_stats.npy \
|
||||
--voc=style_melgan_csmsc \
|
||||
--voc_config=style_melgan_csmsc_ckpt_0.1.1/default.yaml \
|
||||
--voc_ckpt=style_melgan_csmsc_ckpt_0.1.1/snapshot_iter_1500000.pdz \
|
||||
--voc_stat=style_melgan_csmsc_ckpt_0.1.1/feats_stats.npy \
|
||||
--lang=zh \
|
||||
--text=${BIN_DIR}/../../assets/sentences.txt \
|
||||
--output_dir=${train_output_path}/test_e2e \
|
||||
--phones_dict=dump/phone_id_map.txt \
|
||||
--tones_dict=dump/tone_id_map.txt \
|
||||
--ngpu=0 \
|
||||
--nxpu=1
|
||||
# --inference_dir=${train_output_path}/inference
|
||||
fi
|
||||
|
||||
# hifigan
|
||||
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
|
||||
FLAGS_allocator_strategy=naive_best_fit \
|
||||
python3 ${BIN_DIR}/../synthesize_e2e.py \
|
||||
--am=speedyspeech_csmsc \
|
||||
--am_config=${config_path} \
|
||||
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
|
||||
--am_stat=dump/train/feats_stats.npy \
|
||||
--voc=hifigan_csmsc \
|
||||
--voc_config=hifigan_csmsc_ckpt_0.1.1/default.yaml \
|
||||
--voc_ckpt=hifigan_csmsc_ckpt_0.1.1/snapshot_iter_2500000.pdz \
|
||||
--voc_stat=hifigan_csmsc_ckpt_0.1.1/feats_stats.npy \
|
||||
--lang=zh \
|
||||
--text=${BIN_DIR}/../../assets/sentences.txt \
|
||||
--output_dir=${train_output_path}/test_e2e \
|
||||
--phones_dict=dump/phone_id_map.txt \
|
||||
--tones_dict=dump/tone_id_map.txt \
|
||||
--inference_dir=${train_output_path}/inference \
|
||||
--ngpu=0 \
|
||||
--nxpu=1
|
||||
fi
|
||||
|
||||
# wavernn
|
||||
if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
|
||||
echo "in wavernn syn_e2e"
|
||||
FLAGS_allocator_strategy=naive_best_fit \
|
||||
python3 ${BIN_DIR}/../synthesize_e2e.py \
|
||||
--am=speedyspeech_csmsc \
|
||||
--am_config=${config_path} \
|
||||
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
|
||||
--am_stat=dump/train/feats_stats.npy \
|
||||
--voc=wavernn_csmsc \
|
||||
--voc_config=wavernn_csmsc_ckpt_0.2.0/default.yaml \
|
||||
--voc_ckpt=wavernn_csmsc_ckpt_0.2.0/snapshot_iter_400000.pdz \
|
||||
--voc_stat=wavernn_csmsc_ckpt_0.2.0/feats_stats.npy \
|
||||
--lang=zh \
|
||||
--text=${BIN_DIR}/../../assets/sentences.txt \
|
||||
--output_dir=${train_output_path}/test_e2e \
|
||||
--phones_dict=dump/phone_id_map.txt \
|
||||
--tones_dict=dump/tone_id_map.txt \
|
||||
--inference_dir=${train_output_path}/inference \
|
||||
--ngpu=0 \
|
||||
--nxpu=1
|
||||
fi
|
@ -0,0 +1,110 @@
|
||||
#!/bin/bash
|
||||
|
||||
config_path=$1
|
||||
train_output_path=$2
|
||||
ckpt_name=$3
|
||||
stage=0
|
||||
stop_stage=0
|
||||
|
||||
# pwgan
|
||||
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
|
||||
FLAGS_allocator_strategy=naive_best_fit \
|
||||
python3 ${BIN_DIR}/../synthesize.py \
|
||||
--am=speedyspeech_csmsc \
|
||||
--am_config=${config_path} \
|
||||
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
|
||||
--am_stat=dump/train/feats_stats.npy \
|
||||
--voc=pwgan_csmsc \
|
||||
--voc_config=pwg_baker_ckpt_0.4/pwg_default.yaml \
|
||||
--voc_ckpt=pwg_baker_ckpt_0.4/pwg_snapshot_iter_400000.pdz \
|
||||
--voc_stat=pwg_baker_ckpt_0.4/pwg_stats.npy \
|
||||
--test_metadata=dump/test/norm/metadata.jsonl \
|
||||
--output_dir=${train_output_path}/test \
|
||||
--phones_dict=dump/phone_id_map.txt \
|
||||
--tones_dict=dump/tone_id_map.txt \
|
||||
--ngpu=0 \
|
||||
--nxpu=1
|
||||
fi
|
||||
|
||||
# for more GAN Vocoders
|
||||
# multi band melgan
|
||||
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
|
||||
FLAGS_allocator_strategy=naive_best_fit \
|
||||
python3 ${BIN_DIR}/../synthesize.py \
|
||||
--am=speedyspeech_csmsc \
|
||||
--am_config=${config_path} \
|
||||
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
|
||||
--am_stat=dump/train/feats_stats.npy \
|
||||
--voc=mb_melgan_csmsc \
|
||||
--voc_config=mb_melgan_csmsc_ckpt_0.1.1/default.yaml \
|
||||
--voc_ckpt=mb_melgan_csmsc_ckpt_0.1.1/snapshot_iter_1000000.pdz\
|
||||
--voc_stat=mb_melgan_csmsc_ckpt_0.1.1/feats_stats.npy \
|
||||
--test_metadata=dump/test/norm/metadata.jsonl \
|
||||
--output_dir=${train_output_path}/test \
|
||||
--phones_dict=dump/phone_id_map.txt \
|
||||
--tones_dict=dump/tone_id_map.txt \
|
||||
--ngpu=0 \
|
||||
--nxpu=1
|
||||
fi
|
||||
|
||||
# style melgan
|
||||
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
|
||||
FLAGS_allocator_strategy=naive_best_fit \
|
||||
python3 ${BIN_DIR}/../synthesize.py \
|
||||
--am=speedyspeech_csmsc \
|
||||
--am_config=${config_path} \
|
||||
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
|
||||
--am_stat=dump/train/feats_stats.npy \
|
||||
--voc=style_melgan_csmsc \
|
||||
--voc_config=style_melgan_csmsc_ckpt_0.1.1/default.yaml \
|
||||
--voc_ckpt=style_melgan_csmsc_ckpt_0.1.1/snapshot_iter_1500000.pdz \
|
||||
--voc_stat=style_melgan_csmsc_ckpt_0.1.1/feats_stats.npy \
|
||||
--test_metadata=dump/test/norm/metadata.jsonl \
|
||||
--output_dir=${train_output_path}/test \
|
||||
--phones_dict=dump/phone_id_map.txt \
|
||||
--tones_dict=dump/tone_id_map.txt \
|
||||
--ngpu=0 \
|
||||
--nxpu=1
|
||||
fi
|
||||
|
||||
# hifigan
|
||||
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
|
||||
echo "in hifigan syn"
|
||||
FLAGS_allocator_strategy=naive_best_fit \
|
||||
python3 ${BIN_DIR}/../synthesize.py \
|
||||
--am=speedyspeech_csmsc \
|
||||
--am_config=${config_path} \
|
||||
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
|
||||
--am_stat=dump/train/feats_stats.npy \
|
||||
--voc=hifigan_csmsc \
|
||||
--voc_config=hifigan_csmsc_ckpt_0.1.1/default.yaml \
|
||||
--voc_ckpt=hifigan_csmsc_ckpt_0.1.1/snapshot_iter_2500000.pdz \
|
||||
--voc_stat=hifigan_csmsc_ckpt_0.1.1/feats_stats.npy \
|
||||
--test_metadata=dump/test/norm/metadata.jsonl \
|
||||
--output_dir=${train_output_path}/test \
|
||||
--phones_dict=dump/phone_id_map.txt \
|
||||
--tones_dict=dump/tone_id_map.txt \
|
||||
--ngpu=0 \
|
||||
--nxpu=1
|
||||
fi
|
||||
|
||||
# wavernn
|
||||
if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
|
||||
echo "in wavernn syn"
|
||||
FLAGS_allocator_strategy=naive_best_fit \
|
||||
python3 ${BIN_DIR}/../synthesize.py \
|
||||
--am=speedyspeech_csmsc \
|
||||
--am_config=${config_path} \
|
||||
--am_ckpt=${train_output_path}/checkpoints/${ckpt_name} \
|
||||
--am_stat=dump/train/feats_stats.npy \
|
||||
--voc=wavernn_csmsc \
|
||||
--voc_config=wavernn_csmsc_ckpt_0.2.0/default.yaml \
|
||||
--voc_ckpt=wavernn_csmsc_ckpt_0.2.0/snapshot_iter_400000.pdz \
|
||||
--voc_stat=wavernn_csmsc_ckpt_0.2.0/feats_stats.npy \
|
||||
--test_metadata=dump/test/norm/metadata.jsonl \
|
||||
--output_dir=${train_output_path}/test \
|
||||
--tones_dict=dump/tone_id_map.txt \
|
||||
--phones_dict=dump/phone_id_map.txt \
|
||||
--ngpu=0 \
|
||||
--nxpu=1
|
||||
fi
|
@ -0,0 +1,16 @@
|
||||
|
||||
#!/bin/bash
|
||||
|
||||
config_path=$1
|
||||
train_output_path=$2
|
||||
|
||||
python ${BIN_DIR}/train.py \
|
||||
--train-metadata=dump/train/norm/metadata.jsonl \
|
||||
--dev-metadata=dump/dev/norm/metadata.jsonl \
|
||||
--config=${config_path} \
|
||||
--output-dir=${train_output_path} \
|
||||
--ngpu=0 \
|
||||
--nxpu=1 \
|
||||
--phones-dict=dump/phone_id_map.txt \
|
||||
--tones-dict=dump/tone_id_map.txt \
|
||||
--use-relative-path=True
|
@ -0,0 +1,42 @@
|
||||
#!/bin/bash
|
||||
|
||||
set -e
|
||||
source path.sh
|
||||
|
||||
xpus=0,1
|
||||
stage=0
|
||||
stop_stage=100
|
||||
|
||||
conf_path=conf/default.yaml
|
||||
train_output_path=exp/default
|
||||
ckpt_name=snapshot_iter_76.pdz
|
||||
|
||||
# with the following command, you can choose the stage range you want to run
|
||||
# such as `./run_xpu.sh --stage 0 --stop-stage 0`
|
||||
# this can not be mixed use with `$1`, `$2` ...
|
||||
source ${MAIN_ROOT}/utils/parse_options.sh || exit 1
|
||||
|
||||
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
|
||||
# prepare data
|
||||
./local/preprocess.sh ${conf_path} || exit -1
|
||||
fi
|
||||
|
||||
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
|
||||
# train model, all `ckpt` under `train_output_path/checkpoints/` dir
|
||||
FLAGS_selected_xpus=${xpus} ./local/train_xpu.sh ${conf_path} ${train_output_path} || exit -1
|
||||
fi
|
||||
|
||||
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
|
||||
# synthesize, vocoder is pwgan by default
|
||||
FLAGS_selected_xpus=${xpus} ./local/synthesize_xpu.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
|
||||
fi
|
||||
|
||||
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
|
||||
# synthesize_e2e, vocoder is pwgan by default
|
||||
FLAGS_selected_xpus=${xpus} ./local/synthesize_e2e_xpu.sh ${conf_path} ${train_output_path} ${ckpt_name} || exit -1
|
||||
fi
|
||||
|
||||
if [ ${stage} -le 4 ] && [ ${stop_stage} -ge 4 ]; then
|
||||
# inference with static model
|
||||
FLAGS_selected_xpus=${xpus} ./local/inference_xpu.sh ${train_output_path} || exit -1
|
||||
fi
|
@ -0,0 +1,55 @@
|
||||
#!/bin/bash
|
||||
|
||||
train_output_path=$1
|
||||
|
||||
stage=0
|
||||
stop_stage=0
|
||||
|
||||
# pwgan
|
||||
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
|
||||
python3 ${BIN_DIR}/../inference.py \
|
||||
--inference_dir=${train_output_path}/inference \
|
||||
--am=fastspeech2_csmsc \
|
||||
--voc=pwgan_csmsc \
|
||||
--text=${BIN_DIR}/../../assets/sentences.txt \
|
||||
--output_dir=${train_output_path}/pd_infer_out \
|
||||
--phones_dict=dump/phone_id_map.txt \
|
||||
--device xpu
|
||||
fi
|
||||
|
||||
# for more GAN Vocoders
|
||||
# multi band melgan
|
||||
if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
|
||||
python3 ${BIN_DIR}/../inference.py \
|
||||
--inference_dir=${train_output_path}/inference \
|
||||
--am=fastspeech2_csmsc \
|
||||
--voc=mb_melgan_csmsc \
|
||||
--text=${BIN_DIR}/../../assets/sentences.txt \
|
||||
--output_dir=${train_output_path}/pd_infer_out \
|
||||
--phones_dict=dump/phone_id_map.txt \
|
||||
--device xpu
|
||||
fi
|
||||
|
||||
# hifigan
|
||||
if [ ${stage} -le 2 ] && [ ${stop_stage} -ge 2 ]; then
|
||||
python3 ${BIN_DIR}/../inference.py \
|
||||
--inference_dir=${train_output_path}/inference \
|
||||
--am=fastspeech2_csmsc \
|
||||
--voc=hifigan_csmsc \
|
||||
--text=${BIN_DIR}/../../assets/sentences.txt \
|
||||
--output_dir=${train_output_path}/pd_infer_out \
|
||||
--phones_dict=dump/phone_id_map.txt \
|
||||
--device xpu
|
||||
fi
|
||||
|
||||
# wavernn
|
||||
if [ ${stage} -le 3 ] && [ ${stop_stage} -ge 3 ]; then
|
||||
python3 ${BIN_DIR}/../inference.py \
|
||||
--inference_dir=${train_output_path}/inference \
|
||||
--am=fastspeech2_csmsc \
|
||||
--voc=wavernn_csmsc \
|
||||
--text=${BIN_DIR}/../../assets/sentences.txt \
|
||||
--output_dir=${train_output_path}/pd_infer_out \
|
||||
--phones_dict=dump/phone_id_map.txt \
|
||||
--device xpu
|
||||
fi
|
Some files were not shown because too many files have changed in this diff Show More
Loading…
Reference in new issue