pull/1361/head
huangyuxin 3 years ago
commit 3845804cc9

@ -463,7 +463,6 @@ Normally, [Speech SoTA](https://paperswithcode.com/area/speech), [Audio SoTA](ht
- [Automatic Speech Recognition](./docs/source/asr/quick_start.md)
- [Introduction](./docs/source/asr/models_introduction.md)
- [Data Preparation](./docs/source/asr/data_preparation.md)
- [Data Augmentation](./docs/source/asr/augmentation.md)
- [Ngram LM](./docs/source/asr/ngram_lm.md)
- [Text-to-Speech](./docs/source/tts/quick_start.md)
- [Introduction](./docs/source/tts/models_introduction.md)

@ -468,7 +468,6 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声
- [语音识别自定义训练](./docs/source/asr/quick_start.md)
- [简介](./docs/source/asr/models_introduction.md)
- [数据准备](./docs/source/asr/data_preparation.md)
- [数据增强](./docs/source/asr/augmentation.md)
- [Ngram 语言模型](./docs/source/asr/ngram_lm.md)
- [语音合成自定义训练](./docs/source/tts/quick_start.md)
- [简介](./docs/source/tts/models_introduction.md)

@ -1,40 +0,0 @@
# Data Augmentation Pipeline
Data augmentation has often been a highly effective technique to boost deep learning performance. We augment our speech data by synthesizing new audios with small random perturbation (label-invariant transformation) added upon raw audios. You don't have to do the syntheses on your own, as it is already embedded into the data provider and is done on the fly, randomly for each epoch during training.
Six optional augmentation components are provided to be selected, configured, and inserted into the processing pipeline.
* Audio
- Volume Perturbation
- Speed Perturbation
- Shifting Perturbation
- Online Bayesian normalization
- Noise Perturbation (need background noise audio files)
- Impulse Response (need impulse audio files)
* Feature
- SpecAugment
- Adaptive SpecAugment
To inform the trainer of what augmentation components are needed and what their processing orders are, it is required to prepare in advance an *augmentation configuration file* in [JSON](http://www.json.org/) format. For example:
```
[{
"type": "speed",
"params": {"min_speed_rate": 0.95,
"max_speed_rate": 1.05},
"prob": 0.6
},
{
"type": "shift",
"params": {"min_shift_ms": -5,
"max_shift_ms": 5},
"prob": 0.8
}]
```
When the `augment_conf_file` argument is set to the path of the above example configuration file, every audio clip in every epoch will be processed: with 60% of chance, it will first be speed perturbed with a uniformly random sampled speed-rate between 0.95 and 1.05, and then with 80% of chance it will be shifted in time with a randomly sampled offset between -5 ms and 5 ms. Finally, this newly synthesized audio clip will be fed into the feature extractor for further training.
For other configuration examples, please refer to `examples/conf/augmentation.example.json`.
Be careful when utilizing the data augmentation technique, as improper augmentation will harm the training, due to the enlarged train-test gap.

@ -27,7 +27,6 @@ Contents
asr/models_introduction
asr/data_preparation
asr/augmentation
asr/feature_list
asr/ngram_lm

@ -257,6 +257,7 @@ python3 ${BIN_DIR}/../synthesize_e2e.py \
--output_dir=exp/default/test_e2e \
--phones_dict=fastspeech2_nosil_aishell3_ckpt_0.4/phone_id_map.txt \
--speaker_dict=fastspeech2_nosil_aishell3_ckpt_0.4/speaker_id_map.txt \
--spk_id=0
--spk_id=0 \
--inference_dir=exp/default/inference
```

@ -0,0 +1,19 @@
#!/bin/bash
train_output_path=$1
stage=0
stop_stage=0
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
python3 ${BIN_DIR}/../inference.py \
--inference_dir=${train_output_path}/inference \
--am=fastspeech2_aishell3 \
--voc=pwgan_aishell3 \
--text=${BIN_DIR}/../sentences.txt \
--output_dir=${train_output_path}/pd_infer_out \
--phones_dict=dump/phone_id_map.txt \
--speaker_dict=dump/speaker_id_map.txt \
--spk_id=0
fi

@ -20,4 +20,5 @@ python3 ${BIN_DIR}/../synthesize_e2e.py \
--output_dir=${train_output_path}/test_e2e \
--phones_dict=dump/phone_id_map.txt \
--speaker_dict=dump/speaker_id_map.txt \
--spk_id=0
--spk_id=0 \
--inference_dir=${train_output_path}/inference

@ -240,13 +240,14 @@ python3 ${BIN_DIR}/../synthesize_e2e.py \
--am_ckpt=fastspeech2_nosil_vctk_ckpt_0.5/snapshot_iter_66200.pdz \
--am_stat=fastspeech2_nosil_vctk_ckpt_0.5/speech_stats.npy \
--voc=pwgan_vctk \
--voc_config=pwg_vctk_ckpt_0.5/pwg_default.yaml \
--voc_ckpt=pwg_vctk_ckpt_0.5/pwg_snapshot_iter_1000000.pdz \
--voc_stat=pwg_vctk_ckpt_0.5/pwg_stats.npy \
--voc_config=pwg_vctk_ckpt_0.1.1/default.yaml \
--voc_ckpt=pwg_vctk_ckpt_0.1.1/snapshot_iter_1500000.pdz \
--voc_stat=pwg_vctk_ckpt_0.1.1/feats_stats.npy \
--lang=en \
--text=${BIN_DIR}/../sentences_en.txt \
--output_dir=exp/default/test_e2e \
--phones_dict=fastspeech2_nosil_vctk_ckpt_0.5/phone_id_map.txt \
--speaker_dict=fastspeech2_nosil_vctk_ckpt_0.5/speaker_id_map.txt \
--spk_id=0
--spk_id=0 \
--inference_dir=exp/default/inference
```

@ -0,0 +1,20 @@
#!/bin/bash
train_output_path=$1
stage=0
stop_stage=0
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
python3 ${BIN_DIR}/../inference.py \
--inference_dir=${train_output_path}/inference \
--am=fastspeech2_vctk \
--voc=pwgan_vctk \
--text=${BIN_DIR}/../sentences_en.txt \
--output_dir=${train_output_path}/pd_infer_out \
--phones_dict=dump/phone_id_map.txt \
--speaker_dict=dump/speaker_id_map.txt \
--spk_id=0 \
--lang=en
fi

@ -20,4 +20,5 @@ python3 ${BIN_DIR}/../synthesize_e2e.py \
--output_dir=${train_output_path}/test_e2e \
--phones_dict=dump/phone_id_map.txt \
--speaker_dict=dump/speaker_id_map.txt \
--spk_id=0
--spk_id=0 \
--inference_dir=${train_output_path}/inference

@ -14,9 +14,11 @@
import argparse
from pathlib import Path
import numpy
import soundfile as sf
from paddle import inference
from paddlespeech.t2s.frontend import English
from paddlespeech.t2s.frontend.zh_frontend import Frontend
@ -29,20 +31,38 @@ def main():
'--am',
type=str,
default='fastspeech2_csmsc',
choices=['speedyspeech_csmsc', 'fastspeech2_csmsc'],
choices=[
'speedyspeech_csmsc', 'fastspeech2_csmsc', 'fastspeech2_aishell3',
'fastspeech2_vctk'
],
help='Choose acoustic model type of tts task.')
parser.add_argument(
"--phones_dict", type=str, default=None, help="phone vocabulary file.")
parser.add_argument(
"--tones_dict", type=str, default=None, help="tone vocabulary file.")
parser.add_argument(
"--speaker_dict", type=str, default=None, help="speaker id map file.")
parser.add_argument(
'--spk_id',
type=int,
default=0,
help='spk id for multi speaker acoustic model')
# voc
parser.add_argument(
'--voc',
type=str,
default='pwgan_csmsc',
choices=['pwgan_csmsc', 'mb_melgan_csmsc', 'hifigan_csmsc'],
choices=[
'pwgan_csmsc', 'mb_melgan_csmsc', 'hifigan_csmsc', 'pwgan_aishell3',
'pwgan_vctk'
],
help='Choose vocoder type of tts task.')
# other
parser.add_argument(
'--lang',
type=str,
default='zh',
help='Choose model language. zh or en')
parser.add_argument(
"--text",
type=str,
@ -53,8 +73,12 @@ def main():
args, _ = parser.parse_known_args()
# frontend
if args.lang == 'zh':
frontend = Frontend(
phone_vocab_path=args.phones_dict, tone_vocab_path=args.tones_dict)
elif args.lang == 'en':
frontend = English(phone_vocab_path=args.phones_dict)
print("frontend done!")
# model: {model_name}_{dataset}
@ -83,30 +107,53 @@ def main():
print("in new inference")
# construct dataset for evaluation
sentences = []
with open(args.text, 'rt') as f:
for line in f:
items = line.strip().split()
utt_id = items[0]
if args.lang == 'zh':
sentence = "".join(items[1:])
elif args.lang == 'en':
sentence = " ".join(items[1:])
sentences.append((utt_id, sentence))
get_tone_ids = False
get_spk_id = False
if am_name == 'speedyspeech':
get_tone_ids = True
if am_dataset in {"aishell3", "vctk"} and args.speaker_dict:
get_spk_id = True
spk_id = numpy.array([args.spk_id])
am_input_names = am_predictor.get_input_names()
print("am_input_names:", am_input_names)
merge_sentences = True
for utt_id, sentence in sentences:
if args.lang == 'zh':
input_ids = frontend.get_input_ids(
sentence,
merge_sentences=merge_sentences,
get_tone_ids=get_tone_ids)
phone_ids = input_ids["phone_ids"]
elif args.lang == 'en':
input_ids = frontend.get_input_ids(
sentence, merge_sentences=True, get_tone_ids=get_tone_ids)
sentence, merge_sentences=merge_sentences)
phone_ids = input_ids["phone_ids"]
else:
print("lang should in {'zh', 'en'}!")
if get_tone_ids:
tone_ids = input_ids["tone_ids"]
tones = tone_ids[0].numpy()
tones_handle = am_predictor.get_input_handle(am_input_names[1])
tones_handle.reshape(tones.shape)
tones_handle.copy_from_cpu(tones)
if get_spk_id:
spk_id_handle = am_predictor.get_input_handle(am_input_names[1])
spk_id_handle.reshape(spk_id.shape)
spk_id_handle.copy_from_cpu(spk_id)
phones = phone_ids[0].numpy()
phones_handle = am_predictor.get_input_handle(am_input_names[0])
phones_handle.reshape(phones.shape)

@ -159,9 +159,16 @@ def evaluate(args):
# acoustic model
if am_name == 'fastspeech2':
if am_dataset in {"aishell3", "vctk"} and args.speaker_dict:
print(
"Haven't test dygraph to static for multi speaker fastspeech2 now!"
)
am_inference = jit.to_static(
am_inference,
input_spec=[
InputSpec([-1], dtype=paddle.int64),
InputSpec([1], dtype=paddle.int64)
])
paddle.jit.save(am_inference,
os.path.join(args.inference_dir, args.am))
am_inference = paddle.jit.load(
os.path.join(args.inference_dir, args.am))
else:
am_inference = jit.to_static(
am_inference,

@ -781,7 +781,7 @@ class FastSpeech2(nn.Layer):
elif self.spk_embed_integration_type == "concat":
# concat hidden states with spk embeds and then apply projection
spk_emb = F.normalize(spk_emb).unsqueeze(1).expand(
shape=[-1, hs.shape[1], -1])
shape=[-1, paddle.shape(hs)[1], -1])
hs = self.spk_projection(paddle.concat([hs, spk_emb], axis=-1))
else:
raise NotImplementedError("support only add or concat.")

@ -86,11 +86,13 @@ requirements = {
def write_version_py(filename='paddlespeech/__init__.py'):
import paddlespeech
if hasattr(paddlespeech, "__version__") and paddlespeech.__version__ == VERSION:
if hasattr(paddlespeech,
"__version__") and paddlespeech.__version__ == VERSION:
return
with open(filename, "a") as f:
f.write(f"\n__version__ = '{VERSION}'\n")
def remove_version_py(filename='paddlespeech/__init__.py'):
with open(filename, "r") as f:
lines = f.readlines()
@ -256,8 +258,4 @@ setup_info = dict(
setup(**setup_info)
remove_version_py()

@ -19,11 +19,13 @@ VERSION = '0.1.0'
def write_version_py(filename='paddleaudio/__init__.py'):
import paddleaudio
if hasattr(paddleaudio, "__version__") and paddleaudio.__version__ == VERSION:
if hasattr(paddleaudio,
"__version__") and paddleaudio.__version__ == VERSION:
return
with open(filename, "a") as f:
f.write(f"\n__version__ = '{VERSION}'\n")
def remove_version_py(filename='paddleaudio/__init__.py'):
with open(filename, "r") as f:
lines = f.readlines()
@ -32,6 +34,7 @@ def remove_version_py(filename='paddleaudio/__init__.py'):
if "__version__" not in line:
f.write(line)
write_version_py()
setuptools.setup(

@ -1,21 +1,19 @@
#!/usr/bin/env python3
# Apache 2.0 (http://www.apache.org/licenses/LICENSE-2.0)
'''
Merge training configs into a single inference config.
The single inference config is for CLI, which only takes a single config to do inferencing.
The trainig configs includes: model config, preprocess config, decode config, vocab file and cmvn file.
'''
import yaml
import json
import os
import argparse
import json
import math
import os
from contextlib import redirect_stdout
from yacs.config import CfgNode
from paddlespeech.s2t.frontend.utility import load_dict
from contextlib import redirect_stdout
def save(save_path, config):
@ -29,11 +27,13 @@ def load(save_path):
config.merge_from_file(save_path)
return config
def load_json(json_path):
with open(json_path) as f:
json_content = json.load(f)
return json_content
def remove_config_part(config, key_list):
if len(key_list) == 0:
return
@ -41,6 +41,7 @@ def remove_config_part(config, key_list):
config = config[key_list[i]]
config.pop(key_list[-1])
def load_cmvn_from_json(cmvn_stats):
means = cmvn_stats['mean_stat']
variance = cmvn_stats['var_stat']
@ -54,14 +55,14 @@ def load_cmvn_from_json(cmvn_stats):
cmvn_stats = {"mean": means, "istd": variance}
return cmvn_stats
def merge_configs(
conf_path="conf/conformer.yaml",
preprocess_path="conf/preprocess.yaml",
decode_path="conf/tuning/decode.yaml",
vocab_path="data/vocab.txt",
cmvn_path="data/mean_std.json",
save_path = "conf/conformer_infer.yaml",
):
save_path="conf/conformer_infer.yaml", ):
# Load the configs
config = load(conf_path)
@ -75,8 +76,7 @@ def merge_configs(
preprocess_config = load(preprocess_path)
for idx, process in enumerate(preprocess_config["process"]):
if process['type'] == "cmvn_json":
preprocess_config["process"][idx][
"cmvn_path"] = cmvn_stats
preprocess_config["process"][idx]["cmvn_path"] = cmvn_stats
break
config.preprocess_config = preprocess_config
@ -95,7 +95,8 @@ def merge_configs(
# Remove some parts of the config
if os.path.exists(preprocess_path):
remove_train_list = ["train_manifest",
remove_train_list = [
"train_manifest",
"dev_manifest",
"test_manifest",
"n_epoch",
@ -126,7 +127,8 @@ def merge_configs(
"maxlen_out",
]
else:
remove_train_list = ["train_manifest",
remove_train_list = [
"train_manifest",
"dev_manifest",
"test_manifest",
"n_epoch",
@ -153,12 +155,13 @@ def merge_configs(
save(save_path, config)
if __name__ == "__main__":
parser = argparse.ArgumentParser(
prog='Config merge', add_help=True)
parser = argparse.ArgumentParser(prog='Config merge', add_help=True)
parser.add_argument(
'--cfg_pth', type=str, default = 'conf/transformer.yaml', help='origin config file')
'--cfg_pth',
type=str,
default='conf/transformer.yaml',
help='origin config file')
parser.add_argument(
'--pre_pth', type=str, default="conf/preprocess.yaml", help='')
parser.add_argument(
@ -177,7 +180,4 @@ if __name__ == "__main__":
preprocess_path=parser_args.pre_pth,
vocab_path=parser_args.vb_pth,
cmvn_path=parser_args.cmvn_pth,
save_path = parser_args.save_pth,
)
save_path=parser_args.save_pth, )

Loading…
Cancel
Save