Merge branch 'PaddlePaddle:develop' into develop

pull/2380/head
WongLaw 2 years ago committed by GitHub
commit fab5b3a3a6
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -888,7 +888,7 @@ You are warmly welcome to submit questions in [discussions](https://github.com/P
</p> </p>
## Acknowledgement ## Acknowledgement
- Many thanks to [HighCWu](https://github.com/HighCWu)for adding [VITS-aishell3](./examples/aishell3/vits) and [VITS-VC](./examples/aishell3/vits-vc) examples. - Many thanks to [HighCWu](https://github.com/HighCWu) for adding [VITS-aishell3](./examples/aishell3/vits) and [VITS-VC](./examples/aishell3/vits-vc) examples.
- Many thanks to [david-95](https://github.com/david-95) improved TTS, fixed multi-punctuation bug, and contributed to multiple program and data. - Many thanks to [david-95](https://github.com/david-95) improved TTS, fixed multi-punctuation bug, and contributed to multiple program and data.
- Many thanks to [BarryKCL](https://github.com/BarryKCL) improved TTS Chinses frontend based on [G2PW](https://github.com/GitYCC/g2pW). - Many thanks to [BarryKCL](https://github.com/BarryKCL) improved TTS Chinses frontend based on [G2PW](https://github.com/GitYCC/g2pW).
- Many thanks to [yeyupiaoling](https://github.com/yeyupiaoling)/[PPASR](https://github.com/yeyupiaoling/PPASR)/[PaddlePaddle-DeepSpeech](https://github.com/yeyupiaoling/PaddlePaddle-DeepSpeech)/[VoiceprintRecognition-PaddlePaddle](https://github.com/yeyupiaoling/VoiceprintRecognition-PaddlePaddle)/[AudioClassification-PaddlePaddle](https://github.com/yeyupiaoling/AudioClassification-PaddlePaddle) for years of attention, constructive advice and great help. - Many thanks to [yeyupiaoling](https://github.com/yeyupiaoling)/[PPASR](https://github.com/yeyupiaoling/PPASR)/[PaddlePaddle-DeepSpeech](https://github.com/yeyupiaoling/PaddlePaddle-DeepSpeech)/[VoiceprintRecognition-PaddlePaddle](https://github.com/yeyupiaoling/VoiceprintRecognition-PaddlePaddle)/[AudioClassification-PaddlePaddle](https://github.com/yeyupiaoling/AudioClassification-PaddlePaddle) for years of attention, constructive advice and great help.

@ -13,9 +13,9 @@ if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
FLAGS_fraction_of_gpu_memory_to_use=0.01 \ FLAGS_fraction_of_gpu_memory_to_use=0.01 \
python3 ${BIN_DIR}/synthesize_e2e.py \ python3 ${BIN_DIR}/synthesize_e2e.py \
--task_name=synthesize \ --task_name=synthesize \
--wav_path=source/SSB03540307.wav\ --wav_path=source/SSB03540307.wav \
--old_str='请播放歌曲小苹果' \ --old_str='请播放歌曲小苹果' \
--new_str='歌曲真好听' \ --new_str='歌曲真好听' \
--source_lang=zh \ --source_lang=zh \
--target_lang=zh \ --target_lang=zh \
--erniesat_config=${config_path} \ --erniesat_config=${config_path} \

@ -29,9 +29,11 @@ Or train your MFA model reference to [mfa example](https://github.com/PaddlePadd
Assume the paths to the datasets are: Assume the paths to the datasets are:
- `~/datasets/data_aishell3` - `~/datasets/data_aishell3`
- `~/datasets/VCTK-Corpus-0.92` - `~/datasets/VCTK-Corpus-0.92`
Assume the path to the MFA results of the datasets are: Assume the path to the MFA results of the datasets are:
- `./aishell3_alignment_tone` - `./aishell3_alignment_tone`
- `./vctk_alignment` - `./vctk_alignment`
Run the command below to Run the command below to
1. **source path**. 1. **source path**.
2. preprocess the dataset. 2. preprocess the dataset.

@ -15,7 +15,7 @@ if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
python3 ${BIN_DIR}/synthesize_e2e.py \ python3 ${BIN_DIR}/synthesize_e2e.py \
--task_name=synthesize \ --task_name=synthesize \
--wav_path=source/p243_313.wav \ --wav_path=source/p243_313.wav \
--old_str='For that reason cover should not be given.' \ --old_str='For that reason cover should not be given' \
--new_str='今天天气很好' \ --new_str='今天天气很好' \
--source_lang=en \ --source_lang=en \
--target_lang=zh \ --target_lang=zh \
@ -36,8 +36,8 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
python3 ${BIN_DIR}/synthesize_e2e.py \ python3 ${BIN_DIR}/synthesize_e2e.py \
--task_name=synthesize \ --task_name=synthesize \
--wav_path=source/SSB03540307.wav \ --wav_path=source/SSB03540307.wav \
--old_str='请播放歌曲小苹果' \ --old_str='请播放歌曲小苹果' \
--new_str="Thank you!" \ --new_str="Thank you" \
--source_lang=zh \ --source_lang=zh \
--target_lang=en \ --target_lang=en \
--erniesat_config=${config_path} \ --erniesat_config=${config_path} \

@ -14,7 +14,7 @@ if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ]; then
python3 ${BIN_DIR}/synthesize_e2e.py \ python3 ${BIN_DIR}/synthesize_e2e.py \
--task_name=synthesize \ --task_name=synthesize \
--wav_path=source/p243_313.wav \ --wav_path=source/p243_313.wav \
--old_str='For that reason cover should not be given.' \ --old_str='For that reason cover should not be given' \
--new_str='I love you very much do you love me' \ --new_str='I love you very much do you love me' \
--source_lang=en \ --source_lang=en \
--target_lang=en \ --target_lang=en \
@ -36,8 +36,8 @@ if [ ${stage} -le 1 ] && [ ${stop_stage} -ge 1 ]; then
python3 ${BIN_DIR}/synthesize_e2e.py \ python3 ${BIN_DIR}/synthesize_e2e.py \
--task_name=edit \ --task_name=edit \
--wav_path=source/p243_313.wav \ --wav_path=source/p243_313.wav \
--old_str='For that reason cover should not be given.' \ --old_str='For that reason cover should not be given' \
--new_str='For that reason cover is not impossible to be given.' \ --new_str='For that reason cover is not impossible to be given' \
--source_lang=en \ --source_lang=en \
--target_lang=en \ --target_lang=en \
--erniesat_config=${config_path} \ --erniesat_config=${config_path} \

@ -148,4 +148,4 @@ source path.sh
CUDA_VISIBLE_DEVICES= bash ./local/test.sh ./data sv0_ecapa_tdnn_voxceleb12_ckpt_0_2_1/model/ conf/ecapa_tdnn.yaml CUDA_VISIBLE_DEVICES= bash ./local/test.sh ./data sv0_ecapa_tdnn_voxceleb12_ckpt_0_2_1/model/ conf/ecapa_tdnn.yaml
``` ```
The performance of the released models are shown in [this](./RESULTS.md) The performance of the released models are shown in [this](./RESULT.md)

@ -34,3 +34,15 @@ Pretrain model from http://mobvoi-speech-public.ufile.ucloud.cn/public/wenet/wen
| conformer | 32.52 M | conf/conformer.yaml | spec_aug | aishell1 | ctc_greedy_search | - | 0.052534 | | conformer | 32.52 M | conf/conformer.yaml | spec_aug | aishell1 | ctc_greedy_search | - | 0.052534 |
| conformer | 32.52 M | conf/conformer.yaml | spec_aug | aishell1 | ctc_prefix_beam_search | - | 0.052915 | | conformer | 32.52 M | conf/conformer.yaml | spec_aug | aishell1 | ctc_prefix_beam_search | - | 0.052915 |
| conformer | 32.52 M | conf/conformer.yaml | spec_aug | aishell1 | attention_rescoring | - | 0.047904 | | conformer | 32.52 M | conf/conformer.yaml | spec_aug | aishell1 | attention_rescoring | - | 0.047904 |
## Conformer Steaming Pretrained Model
Pretrain model from https://paddlespeech.bj.bcebos.com/s2t/wenetspeech/asr1/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar.gz
| Model | Params | Config | Augmentation| Test set | Decode method | Chunk Size | CER |
| --- | --- | --- | --- | --- | --- | --- | --- |
| conformer | 32.52 M | conf/chunk_conformer.yaml | spec_aug | aishell1 | attention | 16 | 0.056273 |
| conformer | 32.52 M | conf/chunk_conformer.yaml | spec_aug | aishell1 | ctc_greedy_search | 16 | 0.078918 |
| conformer | 32.52 M | conf/chunk_conformer.yaml | spec_aug | aishell1 | ctc_prefix_beam_search | 16 | 0.079080 |
| conformer | 32.52 M | conf/chunk_conformer.yaml | spec_aug | aishell1 | attention_rescoring | 16 | 0.054401 |

@ -605,8 +605,8 @@ class U2BaseModel(ASRInterface, nn.Layer):
xs: paddle.Tensor, xs: paddle.Tensor,
offset: int, offset: int,
required_cache_size: int, required_cache_size: int,
att_cache: paddle.Tensor, # paddle.zeros([0, 0, 0, 0]) att_cache: paddle.Tensor=paddle.zeros([0, 0, 0, 0]),
cnn_cache: paddle.Tensor, # paddle.zeros([0, 0, 0, 0]) cnn_cache: paddle.Tensor=paddle.zeros([0, 0, 0, 0])
) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor]: ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor]:
""" Export interface for c++ call, give input chunk xs, and return """ Export interface for c++ call, give input chunk xs, and return
output from time 0 to current chunk. output from time 0 to current chunk.

@ -86,7 +86,7 @@ class MultiHeadedAttention(nn.Layer):
self, self,
value: paddle.Tensor, value: paddle.Tensor,
scores: paddle.Tensor, scores: paddle.Tensor,
mask: paddle.Tensor, # paddle.ones([0, 0, 0], dtype=paddle.bool) mask: paddle.Tensor=paddle.ones([0, 0, 0], dtype=paddle.bool)
) -> paddle.Tensor: ) -> paddle.Tensor:
"""Compute attention context vector. """Compute attention context vector.
Args: Args:
@ -127,14 +127,13 @@ class MultiHeadedAttention(nn.Layer):
return self.linear_out(x) # (batch, time1, d_model) return self.linear_out(x) # (batch, time1, d_model)
def forward( def forward(self,
self,
query: paddle.Tensor, query: paddle.Tensor,
key: paddle.Tensor, key: paddle.Tensor,
value: paddle.Tensor, value: paddle.Tensor,
mask: paddle.Tensor, # paddle.ones([0,0,0], dtype=paddle.bool) mask: paddle.Tensor=paddle.ones([0, 0, 0], dtype=paddle.bool),
pos_emb: paddle.Tensor, # paddle.empty([0]) pos_emb: paddle.Tensor=paddle.empty([0]),
cache: paddle.Tensor # paddle.zeros([0,0,0,0]) cache: paddle.Tensor=paddle.zeros([0, 0, 0, 0])
) -> Tuple[paddle.Tensor, paddle.Tensor]: ) -> Tuple[paddle.Tensor, paddle.Tensor]:
"""Compute scaled dot product attention. """Compute scaled dot product attention.
Args: Args:
@ -244,14 +243,13 @@ class RelPositionMultiHeadedAttention(MultiHeadedAttention):
return x return x
def forward( def forward(self,
self,
query: paddle.Tensor, query: paddle.Tensor,
key: paddle.Tensor, key: paddle.Tensor,
value: paddle.Tensor, value: paddle.Tensor,
mask: paddle.Tensor, # paddle.ones([0,0,0], dtype=paddle.bool) mask: paddle.Tensor=paddle.ones([0, 0, 0], dtype=paddle.bool),
pos_emb: paddle.Tensor, # paddle.empty([0]) pos_emb: paddle.Tensor=paddle.empty([0]),
cache: paddle.Tensor # paddle.zeros([0,0,0,0]) cache: paddle.Tensor=paddle.zeros([0, 0, 0, 0])
) -> Tuple[paddle.Tensor, paddle.Tensor]: ) -> Tuple[paddle.Tensor, paddle.Tensor]:
"""Compute 'Scaled Dot Product Attention' with rel. positional encoding. """Compute 'Scaled Dot Product Attention' with rel. positional encoding.
Args: Args:

@ -108,8 +108,8 @@ class ConvolutionModule(nn.Layer):
def forward( def forward(
self, self,
x: paddle.Tensor, x: paddle.Tensor,
mask_pad: paddle.Tensor, # paddle.ones([0,0,0], dtype=paddle.bool) mask_pad: paddle.Tensor=paddle.ones([0, 0, 0], dtype=paddle.bool),
cache: paddle.Tensor # paddle.zeros([0,0,0,0]) cache: paddle.Tensor=paddle.zeros([0, 0, 0, 0])
) -> Tuple[paddle.Tensor, paddle.Tensor]: ) -> Tuple[paddle.Tensor, paddle.Tensor]:
"""Compute convolution module. """Compute convolution module.
Args: Args:

@ -121,16 +121,11 @@ class DecoderLayer(nn.Layer):
if self.concat_after: if self.concat_after:
tgt_concat = paddle.cat( tgt_concat = paddle.cat(
(tgt_q, self.self_attn(tgt_q, tgt, tgt, tgt_q_mask, (tgt_q, self.self_attn(tgt_q, tgt, tgt, tgt_q_mask)[0]), dim=-1)
paddle.empty([0]),
paddle.zeros([0, 0, 0, 0]))[0]),
dim=-1)
x = residual + self.concat_linear1(tgt_concat) x = residual + self.concat_linear1(tgt_concat)
else: else:
x = residual + self.dropout( x = residual + self.dropout(
self.self_attn(tgt_q, tgt, tgt, tgt_q_mask, self.self_attn(tgt_q, tgt, tgt, tgt_q_mask)[0])
paddle.empty([0]), paddle.zeros([0, 0, 0, 0]))[
0])
if not self.normalize_before: if not self.normalize_before:
x = self.norm1(x) x = self.norm1(x)
@ -139,15 +134,11 @@ class DecoderLayer(nn.Layer):
x = self.norm2(x) x = self.norm2(x)
if self.concat_after: if self.concat_after:
x_concat = paddle.cat( x_concat = paddle.cat(
(x, self.src_attn(x, memory, memory, memory_mask, (x, self.src_attn(x, memory, memory, memory_mask)[0]), dim=-1)
paddle.empty([0]),
paddle.zeros([0, 0, 0, 0]))[0]),
dim=-1)
x = residual + self.concat_linear2(x_concat) x = residual + self.concat_linear2(x_concat)
else: else:
x = residual + self.dropout( x = residual + self.dropout(
self.src_attn(x, memory, memory, memory_mask, self.src_attn(x, memory, memory, memory_mask)[0])
paddle.empty([0]), paddle.zeros([0, 0, 0, 0]))[0])
if not self.normalize_before: if not self.normalize_before:
x = self.norm2(x) x = self.norm2(x)

@ -175,9 +175,7 @@ class BaseEncoder(nn.Layer):
decoding_chunk_size, self.static_chunk_size, decoding_chunk_size, self.static_chunk_size,
num_decoding_left_chunks) num_decoding_left_chunks)
for layer in self.encoders: for layer in self.encoders:
xs, chunk_masks, _, _ = layer(xs, chunk_masks, pos_emb, mask_pad, xs, chunk_masks, _, _ = layer(xs, chunk_masks, pos_emb, mask_pad)
paddle.zeros([0, 0, 0, 0]),
paddle.zeros([0, 0, 0, 0]))
if self.normalize_before: if self.normalize_before:
xs = self.after_norm(xs) xs = self.after_norm(xs)
# Here we assume the mask is not changed in encoder layers, so just # Here we assume the mask is not changed in encoder layers, so just
@ -190,9 +188,9 @@ class BaseEncoder(nn.Layer):
xs: paddle.Tensor, xs: paddle.Tensor,
offset: int, offset: int,
required_cache_size: int, required_cache_size: int,
att_cache: paddle.Tensor, # paddle.zeros([0,0,0,0]) att_cache: paddle.Tensor=paddle.zeros([0, 0, 0, 0]),
cnn_cache: paddle.Tensor, # paddle.zeros([0,0,0,0]), cnn_cache: paddle.Tensor=paddle.zeros([0, 0, 0, 0]),
att_mask: paddle.Tensor, # paddle.ones([0,0,0], dtype=paddle.bool) att_mask: paddle.Tensor=paddle.ones([0, 0, 0], dtype=paddle.bool)
) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor]: ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor]:
""" Forward just one chunk """ Forward just one chunk
Args: Args:
@ -255,7 +253,6 @@ class BaseEncoder(nn.Layer):
xs, xs,
att_mask, att_mask,
pos_emb, pos_emb,
mask_pad=paddle.ones([0, 0, 0], dtype=paddle.bool),
att_cache=att_cache[i:i + 1] if elayers > 0 else att_cache, att_cache=att_cache[i:i + 1] if elayers > 0 else att_cache,
cnn_cache=cnn_cache[i:i + 1] cnn_cache=cnn_cache[i:i + 1]
if paddle.shape(cnn_cache)[0] > 0 else cnn_cache, ) if paddle.shape(cnn_cache)[0] > 0 else cnn_cache, )
@ -328,8 +325,7 @@ class BaseEncoder(nn.Layer):
chunk_xs = xs[:, cur:end, :] chunk_xs = xs[:, cur:end, :]
(y, att_cache, cnn_cache) = self.forward_chunk( (y, att_cache, cnn_cache) = self.forward_chunk(
chunk_xs, offset, required_cache_size, att_cache, cnn_cache, chunk_xs, offset, required_cache_size, att_cache, cnn_cache)
paddle.ones([0, 0, 0], dtype=paddle.bool))
outputs.append(y) outputs.append(y)
offset += y.shape[1] offset += y.shape[1]

@ -76,10 +76,9 @@ class TransformerEncoderLayer(nn.Layer):
x: paddle.Tensor, x: paddle.Tensor,
mask: paddle.Tensor, mask: paddle.Tensor,
pos_emb: paddle.Tensor, pos_emb: paddle.Tensor,
mask_pad: paddle. mask_pad: paddle.Tensor=paddle.ones([0, 0, 0], dtype=paddle.bool),
Tensor, # paddle.ones([0, 0, 0], dtype=paddle.bool) att_cache: paddle.Tensor=paddle.zeros([0, 0, 0, 0]),
att_cache: paddle.Tensor, # paddle.zeros([0, 0, 0, 0]) cnn_cache: paddle.Tensor=paddle.zeros([0, 0, 0, 0])
cnn_cache: paddle.Tensor, # paddle.zeros([0, 0, 0, 0])
) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor, paddle.Tensor]: ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor, paddle.Tensor]:
"""Compute encoded features. """Compute encoded features.
Args: Args:
@ -106,8 +105,7 @@ class TransformerEncoderLayer(nn.Layer):
if self.normalize_before: if self.normalize_before:
x = self.norm1(x) x = self.norm1(x)
x_att, new_att_cache = self.self_attn( x_att, new_att_cache = self.self_attn(x, x, x, mask, cache=att_cache)
x, x, x, mask, paddle.empty([0]), cache=att_cache)
if self.concat_after: if self.concat_after:
x_concat = paddle.concat((x, x_att), axis=-1) x_concat = paddle.concat((x, x_att), axis=-1)
@ -195,9 +193,9 @@ class ConformerEncoderLayer(nn.Layer):
x: paddle.Tensor, x: paddle.Tensor,
mask: paddle.Tensor, mask: paddle.Tensor,
pos_emb: paddle.Tensor, pos_emb: paddle.Tensor,
mask_pad: paddle.Tensor, #paddle.ones([0, 0, 0],dtype=paddle.bool) mask_pad: paddle.Tensor=paddle.ones([0, 0, 0], dtype=paddle.bool),
att_cache: paddle.Tensor, # paddle.zeros([0, 0, 0, 0]) att_cache: paddle.Tensor=paddle.zeros([0, 0, 0, 0]),
cnn_cache: paddle.Tensor, # paddle.zeros([0, 0, 0, 0]) cnn_cache: paddle.Tensor=paddle.zeros([0, 0, 0, 0])
) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor, paddle.Tensor]: ) -> Tuple[paddle.Tensor, paddle.Tensor, paddle.Tensor, paddle.Tensor]:
"""Compute encoded features. """Compute encoded features.
Args: Args:

@ -19,6 +19,10 @@ from pathlib import Path
import paddle import paddle
from paddle import distributed as dist from paddle import distributed as dist
world_size = dist.get_world_size()
if world_size > 1:
dist.init_parallel_env()
from visualdl import LogWriter from visualdl import LogWriter
from paddlespeech.s2t.training.reporter import ObsScope from paddlespeech.s2t.training.reporter import ObsScope
@ -122,9 +126,6 @@ class Trainer():
else: else:
raise Exception("invalid device") raise Exception("invalid device")
if self.parallel:
self.init_parallel()
self.checkpoint = Checkpoint( self.checkpoint = Checkpoint(
kbest_n=self.config.checkpoint.kbest_n, kbest_n=self.config.checkpoint.kbest_n,
latest_n=self.config.checkpoint.latest_n) latest_n=self.config.checkpoint.latest_n)
@ -173,11 +174,6 @@ class Trainer():
""" """
return self.args.ngpu > 1 return self.args.ngpu > 1
def init_parallel(self):
"""Init environment for multiprocess training.
"""
dist.init_parallel_env()
@mp_tools.rank_zero_only @mp_tools.rank_zero_only
def save(self, tag=None, infos: dict=None): def save(self, tag=None, infos: dict=None):
"""Save checkpoint (model parameters and optimizer states). """Save checkpoint (model parameters and optimizer states).

@ -480,8 +480,7 @@ class PaddleASRConnectionHanddler:
self.offset, self.offset,
required_cache_size, required_cache_size,
att_cache=self.att_cache, att_cache=self.att_cache,
cnn_cache=self.cnn_cache, cnn_cache=self.cnn_cache)
att_mask=paddle.ones([0, 0, 0], dtype=paddle.bool))
outputs.append(y) outputs.append(y)
# update the global offset, in decoding frame unit # update the global offset, in decoding frame unit

@ -58,7 +58,7 @@ def _readtg(tg_path: str, lang: str='en', fs: int=24000, n_shift: int=300):
durations[-2] += durations[-1] durations[-2] += durations[-1]
durations = durations[:-1] durations = durations[:-1]
# replace ' and 'sil' with 'sp' # replace '' and 'sil' with 'sp'
phones = ['sp' if (phn == '' or phn == 'sil') else phn for phn in phones] phones = ['sp' if (phn == '' or phn == 'sil') else phn for phn in phones]
if lang == 'en': if lang == 'en':
@ -195,7 +195,7 @@ def words2phns(text: str, lang='en'):
wrd = wrd.upper() wrd = wrd.upper()
if (wrd not in ds): if (wrd not in ds):
wrd2phns[str(index) + '_' + wrd] = 'spn' wrd2phns[str(index) + '_' + wrd] = 'spn'
phns.extend('spn') phns.extend(['spn'])
else: else:
wrd2phns[str(index) + '_' + wrd] = word2phns_dict[wrd].split() wrd2phns[str(index) + '_' + wrd] = word2phns_dict[wrd].split()
phns.extend(word2phns_dict[wrd].split()) phns.extend(word2phns_dict[wrd].split())

@ -137,9 +137,6 @@ def prep_feats_with_dur(wav_path: str,
new_wav = np.concatenate( new_wav = np.concatenate(
[wav_org[:wav_left_idx], blank_wav, wav_org[wav_right_idx:]]) [wav_org[:wav_left_idx], blank_wav, wav_org[wav_right_idx:]])
# 音频是正常遮住了
sf.write(str("mask_wav.wav"), new_wav, samplerate=fs)
# 4. get old and new mel span to be mask # 4. get old and new mel span to be mask
old_span_bdy = get_span_bdy( old_span_bdy = get_span_bdy(
mfa_start=mfa_start, mfa_end=mfa_end, span_to_repl=span_to_repl) mfa_start=mfa_start, mfa_end=mfa_end, span_to_repl=span_to_repl)
@ -274,7 +271,8 @@ def get_wav(wav_path: str,
new_str: str='', new_str: str='',
duration_adjust: bool=True, duration_adjust: bool=True,
fs: int=24000, fs: int=24000,
n_shift: int=300): n_shift: int=300,
task_name: str='synthesize'):
outs = get_mlm_output( outs = get_mlm_output(
wav_path=wav_path, wav_path=wav_path,
@ -298,9 +296,11 @@ def get_wav(wav_path: str,
alt_wav = np.squeeze(alt_wav) alt_wav = np.squeeze(alt_wav)
old_time_bdy = [n_shift * x for x in old_span_bdy] old_time_bdy = [n_shift * x for x in old_span_bdy]
if task_name == 'edit':
wav_replaced = np.concatenate( wav_replaced = np.concatenate(
[wav_org[:old_time_bdy[0]], alt_wav, wav_org[old_time_bdy[1]:]]) [wav_org[:old_time_bdy[0]], alt_wav, wav_org[old_time_bdy[1]:]])
else:
wav_replaced = alt_wav
wav_dict = {"origin": wav_org, "output": wav_replaced} wav_dict = {"origin": wav_org, "output": wav_replaced}
return wav_dict return wav_dict
@ -356,7 +356,11 @@ def parse_args():
"--ngpu", type=int, default=1, help="if ngpu == 0, use cpu.") "--ngpu", type=int, default=1, help="if ngpu == 0, use cpu.")
# ernie sat related # ernie sat related
parser.add_argument("--task_name", type=str, help="task name") parser.add_argument(
"--task_name",
type=str,
choices=['edit', 'synthesize'],
help="task name.")
parser.add_argument("--wav_path", type=str, help="path of old wav") parser.add_argument("--wav_path", type=str, help="path of old wav")
parser.add_argument("--old_str", type=str, help="old string") parser.add_argument("--old_str", type=str, help="old string")
parser.add_argument("--new_str", type=str, help="new string") parser.add_argument("--new_str", type=str, help="new string")
@ -410,10 +414,9 @@ if __name__ == '__main__':
if args.task_name == 'edit': if args.task_name == 'edit':
new_str = new_str new_str = new_str
elif args.task_name == 'synthesize': elif args.task_name == 'synthesize':
new_str = old_str + new_str new_str = old_str + ' ' + new_str
else: else:
new_str = old_str + new_str new_str = old_str + ' ' + new_str
print("new_str:", new_str)
# Extractor # Extractor
mel_extractor = LogMelFBank( mel_extractor = LogMelFBank(
@ -467,7 +470,8 @@ if __name__ == '__main__':
new_str=new_str, new_str=new_str,
duration_adjust=args.duration_adjust, duration_adjust=args.duration_adjust,
fs=erniesat_config.fs, fs=erniesat_config.fs,
n_shift=erniesat_config.n_shift) n_shift=erniesat_config.n_shift,
task_name=args.task_name)
sf.write( sf.write(
args.output_name, wav_dict['output'], samplerate=erniesat_config.fs) args.output_name, wav_dict['output'], samplerate=erniesat_config.fs)

@ -15,6 +15,7 @@ dataline=$(cat ${FILENAME})
# parser params # parser params
IFS=$'\n' IFS=$'\n'
lines=(${dataline}) lines=(${dataline})
python=python
# The training params # The training params
model_name=$(func_parser_value "${lines[1]}") model_name=$(func_parser_value "${lines[1]}")
@ -68,7 +69,7 @@ if [[ ${MODE} = "benchmark_train" ]];then
if [[ ${model_name} == "pwgan" ]]; then if [[ ${model_name} == "pwgan" ]]; then
# 下载 csmsc 数据集并解压缩 # 下载 csmsc 数据集并解压缩
wget -nc https://weixinxcxdb.oss-cn-beijing.aliyuncs.com/gwYinPinKu/BZNSYP.rar wget -nc https://paddle-wheel.bj.bcebos.com/benchmark/BZNSYP.rar
mkdir -p BZNSYP mkdir -p BZNSYP
unrar x BZNSYP.rar BZNSYP unrar x BZNSYP.rar BZNSYP
wget -nc https://paddlespeech.bj.bcebos.com/Parakeet/benchmark/durations.txt wget -nc https://paddlespeech.bj.bcebos.com/Parakeet/benchmark/durations.txt
@ -80,6 +81,10 @@ if [[ ${MODE} = "benchmark_train" ]];then
python ../paddlespeech/t2s/exps/gan_vocoder/normalize.py --metadata=dump/test/raw/metadata.jsonl --dumpdir=dump/test/norm --stats=dump/train/feats_stats.npy python ../paddlespeech/t2s/exps/gan_vocoder/normalize.py --metadata=dump/test/raw/metadata.jsonl --dumpdir=dump/test/norm --stats=dump/train/feats_stats.npy
fi fi
echo "barrier start"
PYTHON="${python}" bash test_tipc/barrier.sh
echo "barrier end"
if [[ ${model_name} == "mdtc" ]]; then if [[ ${model_name} == "mdtc" ]]; then
# 下载 Snips 数据集并解压缩 # 下载 Snips 数据集并解压缩
wget https://paddlespeech.bj.bcebos.com/datasets/hey_snips_kws_4.0.tar.gz.1 wget https://paddlespeech.bj.bcebos.com/datasets/hey_snips_kws_4.0.tar.gz.1

Loading…
Cancel
Save