Merge branch 'PaddlePaddle:develop' into develop

pull/2493/head
liangym 3 years ago committed by GitHub
commit c737dab833
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -19,8 +19,6 @@
<div align="center">
<h4>
<a href="#quick-start"> Quick Start </a>
| <a href="#quick-start-server"> Quick Start Server </a>
| <a href="#quick-start-streaming-server"> Quick Start Streaming Server</a>
| <a href="#documents"> Documents </a>
| <a href="#model-list"> Models List </a>
| <a href="https://aistudio.baidu.com/aistudio/education/group/info/25130"> AIStudio Courses </a>
@ -159,6 +157,8 @@ Via the easy-to-use, efficient, flexible and scalable implementation, our vision
- 🧩 *Cascaded models application*: as an extension of the typical traditional audio tasks, we combine the workflows of the aforementioned tasks with other fields like Natural language processing (NLP) and Computer Vision (CV).
### Recent Update
- 🔥 2022.09.26: Add Voice Cloning, TTS finetune, and ERNIE-SAT in [PaddleSpeech Web Demo](./demos/speech_web).
- ⚡ 2022.09.09: Add AISHELL-3 Voice Cloning [example](./examples/aishell3/vc2) with ECAPA-TDNN speaker encoder.
- ⚡ 2022.08.25: Release TTS [finetune](./examples/other/tts_finetune/tts3) example.
- 🔥 2022.08.22: Add ERNIE-SAT models: [ERNIE-SAT-vctk](./examples/vctk/ernie_sat)、[ERNIE-SAT-aishell3](./examples/aishell3/ernie_sat)、[ERNIE-SAT-zh_en](./examples/aishell3_vctk/ernie_sat).
- 🔥 2022.08.15: Add [g2pW](https://github.com/GitYCC/g2pW) into TTS Chinese Text Frontend.
@ -705,7 +705,7 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
<tbody>
<tr>
<td>Speaker Verification</td>
<td>VoxCeleb12</td>
<td>VoxCeleb1/2</td>
<td>ECAPA-TDNN</td>
<td>
<a href = "./examples/voxceleb/sv0">ecapa-tdnn-voxceleb12</a>
@ -714,6 +714,31 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
</tbody>
</table>
<a name="SpeakerDiarization"></a>
**Speaker Diarization**
<table style="width:100%">
<thead>
<tr>
<th> Task </th>
<th> Dataset </th>
<th> Model Type </th>
<th> Example </th>
</tr>
</thead>
<tbody>
<tr>
<td>Speaker Diarization</td>
<td>AMI</td>
<td>ECAPA-TDNN + AHC / SC</td>
<td>
<a href = "./examples/ami/sd0">ecapa-tdnn-ami</a>
</td>
</tr>
</tbody>
</table>
<a name="PunctuationRestoration"></a>
**Punctuation Restoration**
@ -767,6 +792,7 @@ Normally, [Speech SoTA](https://paperswithcode.com/area/speech), [Audio SoTA](ht
- [Text-to-Speech](#TextToSpeech)
- [Audio Classification](#AudioClassification)
- [Speaker Verification](#SpeakerVerification)
- [Speaker Diarization](#SpeakerDiarization)
- [Punctuation Restoration](#PunctuationRestoration)
- [Community](#Community)
- [Welcome to contribute](#contribution)

@ -19,10 +19,8 @@
</p>
<div align="center">
<h4>
<a href="#安装"> 安装 </a>
<a href="#安装"> 安装 </a>
| <a href="#快速开始"> 快速开始 </a>
| <a href="#快速使用服务"> 快速使用服务 </a>
| <a href="#快速使用流式服务"> 快速使用流式服务 </a>
| <a href="#教程文档"> 教程文档 </a>
| <a href="#模型列表"> 模型列表 </a>
| <a href="https://aistudio.baidu.com/aistudio/education/group/info/25130"> AIStudio 课程 </a>
@ -181,6 +179,8 @@
</div>
### 近期更新
- 🔥 2022.09.26: 新增 Voice Cloning, TTS finetune 和 ERNIE-SAT 到 [PaddleSpeech 网页应用](./demos/speech_web)。
- ⚡ 2022.09.09: 新增基于 ECAPA-TDNN 声纹模型的 AISHELL-3 Voice Cloning [示例](./examples/aishell3/vc2)。
- ⚡ 2022.08.25: 发布 TTS [finetune](./examples/other/tts_finetune/tts3) 示例。
- 🔥 2022.08.22: 新增 ERNIE-SAT 模型: [ERNIE-SAT-vctk](./examples/vctk/ernie_sat)、[ERNIE-SAT-aishell3](./examples/aishell3/ernie_sat)、[ERNIE-SAT-zh_en](./examples/aishell3_vctk/ernie_sat)。
- 🔥 2022.08.15: 将 [g2pW](https://github.com/GitYCC/g2pW) 引入 TTS 中文文本前端。
@ -717,8 +717,8 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声
</thead>
<tbody>
<tr>
<td>Speaker Verification</td>
<td>VoxCeleb12</td>
<td>声纹识别</td>
<td>VoxCeleb1/2</td>
<td>ECAPA-TDNN</td>
<td>
<a href = "./examples/voxceleb/sv0">ecapa-tdnn-voxceleb12</a>
@ -727,6 +727,31 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声
</tbody>
</table>
<a name="说话人日志模型"></a>
**说话人日志**
<table style="width:100%">
<thead>
<tr>
<th> 任务 </th>
<th> 数据集 </th>
<th> 模型类型 </th>
<th> 脚本 </th>
</tr>
</thead>
<tbody>
<tr>
<td>说话人日志</td>
<td>AMI</td>
<td>ECAPA-TDNN + AHC / SC</td>
<td>
<a href = "./examples/ami/sd0">ecapa-tdnn-ami</a>
</td>
</tr>
</tbody>
</table>
<a name="标点恢复模型"></a>
**标点恢复**
@ -786,6 +811,7 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声
- [语音合成](#语音合成模型)
- [声音分类](#声音分类模型)
- [声纹识别](#声纹识别模型)
- [说话人日志](#说话人日志模型)
- [标点恢复](#标点恢复模型)
- [技术交流群](#技术交流群)
- [欢迎贡献](#欢迎贡献)

@ -21,7 +21,7 @@ engine_list: ['asr_online']
################################### ASR #########################################
################### speech task: asr; engine_type: online #######################
asr_online:
model_type: 'conformer_online_wenetspeech'
model_type: 'conformer_u2pp_online_wenetspeech'
am_model: # the pdmodel file of am static model [optional]
am_params: # the pdiparams file of am static model [optional]
lang: 'zh'

@ -9,6 +9,7 @@ Acoustic Model | Training Data | Token-based | Size | Descriptions | CER | WER |
[Ds2 Online Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_fbank161_ckpt_0.2.1.model.tar.gz) | Aishell Dataset | Char-based | 491 MB | 2 Conv + 5 LSTM layers | 0.0666 |-| 151 h | [D2 Online Aishell ASR0](../../examples/aishell/asr0) | onnx/inference/python |
[Ds2 Offline Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_offline_aishell_ckpt_1.0.1.model.tar.gz)| Aishell Dataset | Char-based | 1.4 GB | 2 Conv + 5 bidirectional LSTM layers| 0.0554 |-| 151 h | [Ds2 Offline Aishell ASR0](../../examples/aishell/asr0) | inference/python |
[Conformer Online Wenetspeech ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/wenetspeech/asr1/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar.gz) | WenetSpeech Dataset | Char-based | 457 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring| 0.11 (test\_net) 0.1879 (test\_meeting) |-| 10000 h |- | python |
[Conformer U2PP Online Wenetspeech ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/wenetspeech/asr1/asr1_chunk_conformer_u2pp_wenetspeech_ckpt_1.1.1.model.tar.gz) | WenetSpeech Dataset | Char-based | 476 MB | Encoder:Conformer, Decoder:BiTransformer, Decoding method: Attention rescoring| 0.047198 (aishell test\_-1) 0.059212 (aishell test\_16) |-| 10000 h |- | python |
[Conformer Online Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_chunk_conformer_aishell_ckpt_0.2.0.model.tar.gz) | Aishell Dataset | Char-based | 189 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring| 0.0544 |-| 151 h | [Conformer Online Aishell ASR1](../../examples/aishell/asr1) | python |
[Conformer Offline Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_conformer_aishell_ckpt_1.0.1.model.tar.gz) | Aishell Dataset | Char-based | 189 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0460 |-| 151 h | [Conformer Offline Aishell ASR1](../../examples/aishell/asr1) | python |
[Transformer Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_transformer_aishell_ckpt_0.1.1.model.tar.gz) | Aishell Dataset | Char-based | 128 MB | Encoder:Transformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0523 || 151 h | [Transformer Aishell ASR1](../../examples/aishell/asr1) | python |

@ -26,6 +26,10 @@ if [ ${seed} != 0 ]; then
export FLAGS_cudnn_deterministic=True
fi
# default memeory allocator strategy may case gpu training hang
# for no OOM raised when memory exhaused
export FLAGS_allocator_strategy=naive_best_fit
if [ ${ngpu} == 0 ]; then
python3 -u ${BIN_DIR}/train.py \
--ngpu ${ngpu} \

@ -35,6 +35,10 @@ echo ${ips_config}
mkdir -p exp
# default memeory allocator strategy may case gpu training hang
# for no OOM raised when memory exhaused
export FLAGS_allocator_strategy=naive_best_fit
if [ ${ngpu} == 0 ]; then
python3 -u ${BIN_DIR}/train.py \
--ngpu ${ngpu} \

@ -0,0 +1,44 @@
###########################################################
# DATA SETTING #
###########################################################
dataset_type: Ernie
train_path: data/iwslt2012_zh/train.txt
dev_path: data/iwslt2012_zh/dev.txt
test_path: data/iwslt2012_zh/test.txt
batch_size: 64
num_workers: 2
data_params:
pretrained_token: ernie-3.0-base-zh
punc_path: data/iwslt2012_zh/punc_vocab
seq_len: 100
###########################################################
# MODEL SETTING #
###########################################################
model_type: ErnieLinear
model:
pretrained_token: ernie-3.0-base-zh
num_classes: 4
###########################################################
# OPTIMIZER SETTING #
###########################################################
optimizer_params:
weight_decay: 1.0e-6 # weight decay coefficient.
scheduler_params:
learning_rate: 1.0e-5 # learning rate.
gamma: 0.9999 # scheduler gamma must between(0.0, 1.0) and closer to 1.0 is better.
###########################################################
# TRAINING SETTING #
###########################################################
max_epoch: 20
num_snapshots: 5
###########################################################
# OTHER SETTING #
###########################################################
num_snapshots: 10 # max number of snapshots to keep while training
seed: 42 # random seed for paddle, random, and np.random

@ -0,0 +1,44 @@
###########################################################
# DATA SETTING #
###########################################################
dataset_type: Ernie
train_path: data/iwslt2012_zh/train.txt
dev_path: data/iwslt2012_zh/dev.txt
test_path: data/iwslt2012_zh/test.txt
batch_size: 64
num_workers: 2
data_params:
pretrained_token: ernie-3.0-medium-zh
punc_path: data/iwslt2012_zh/punc_vocab
seq_len: 100
###########################################################
# MODEL SETTING #
###########################################################
model_type: ErnieLinear
model:
pretrained_token: ernie-3.0-medium-zh
num_classes: 4
###########################################################
# OPTIMIZER SETTING #
###########################################################
optimizer_params:
weight_decay: 1.0e-6 # weight decay coefficient.
scheduler_params:
learning_rate: 1.0e-5 # learning rate.
gamma: 0.9999 # scheduler gamma must between(0.0, 1.0) and closer to 1.0 is better.
###########################################################
# TRAINING SETTING #
###########################################################
max_epoch: 20
num_snapshots: 5
###########################################################
# OTHER SETTING #
###########################################################
num_snapshots: 10 # max number of snapshots to keep while training
seed: 42 # random seed for paddle, random, and np.random

@ -0,0 +1,44 @@
###########################################################
# DATA SETTING #
###########################################################
dataset_type: Ernie
train_path: data/iwslt2012_zh/train.txt
dev_path: data/iwslt2012_zh/dev.txt
test_path: data/iwslt2012_zh/test.txt
batch_size: 64
num_workers: 2
data_params:
pretrained_token: ernie-3.0-mini-zh
punc_path: data/iwslt2012_zh/punc_vocab
seq_len: 100
###########################################################
# MODEL SETTING #
###########################################################
model_type: ErnieLinear
model:
pretrained_token: ernie-3.0-mini-zh
num_classes: 4
###########################################################
# OPTIMIZER SETTING #
###########################################################
optimizer_params:
weight_decay: 1.0e-6 # weight decay coefficient.
scheduler_params:
learning_rate: 1.0e-5 # learning rate.
gamma: 0.9999 # scheduler gamma must between(0.0, 1.0) and closer to 1.0 is better.
###########################################################
# TRAINING SETTING #
###########################################################
max_epoch: 20
num_snapshots: 5
###########################################################
# OTHER SETTING #
###########################################################
num_snapshots: 10 # max number of snapshots to keep while training
seed: 42 # random seed for paddle, random, and np.random

@ -0,0 +1,44 @@
###########################################################
# DATA SETTING #
###########################################################
dataset_type: Ernie
train_path: data/iwslt2012_zh/train.txt
dev_path: data/iwslt2012_zh/dev.txt
test_path: data/iwslt2012_zh/test.txt
batch_size: 64
num_workers: 2
data_params:
pretrained_token: ernie-3.0-nano-zh
punc_path: data/iwslt2012_zh/punc_vocab
seq_len: 100
###########################################################
# MODEL SETTING #
###########################################################
model_type: ErnieLinear
model:
pretrained_token: ernie-3.0-nano-zh
num_classes: 4
###########################################################
# OPTIMIZER SETTING #
###########################################################
optimizer_params:
weight_decay: 1.0e-6 # weight decay coefficient.
scheduler_params:
learning_rate: 1.0e-5 # learning rate.
gamma: 0.9999 # scheduler gamma must between(0.0, 1.0) and closer to 1.0 is better.
###########################################################
# TRAINING SETTING #
###########################################################
max_epoch: 20
num_snapshots: 5
###########################################################
# OTHER SETTING #
###########################################################
num_snapshots: 10 # max number of snapshots to keep while training
seed: 42 # random seed for paddle, random, and np.random

@ -0,0 +1,44 @@
###########################################################
# DATA SETTING #
###########################################################
dataset_type: Ernie
train_path: data/iwslt2012_zh/train.txt
dev_path: data/iwslt2012_zh/dev.txt
test_path: data/iwslt2012_zh/test.txt
batch_size: 64
num_workers: 2
data_params:
pretrained_token: ernie-tiny
punc_path: data/iwslt2012_zh/punc_vocab
seq_len: 100
###########################################################
# MODEL SETTING #
###########################################################
model_type: ErnieLinear
model:
pretrained_token: ernie-tiny
num_classes: 4
###########################################################
# OPTIMIZER SETTING #
###########################################################
optimizer_params:
weight_decay: 1.0e-6 # weight decay coefficient.
scheduler_params:
learning_rate: 1.0e-5 # learning rate.
gamma: 0.9999 # scheduler gamma must between(0.0, 1.0) and closer to 1.0 is better.
###########################################################
# TRAINING SETTING #
###########################################################
max_epoch: 20
num_snapshots: 5
###########################################################
# OTHER SETTING #
###########################################################
num_snapshots: 10 # max number of snapshots to keep while training
seed: 42 # random seed for paddle, random, and np.random

@ -26,6 +26,10 @@ if [ ${seed} != 0 ]; then
export FLAGS_cudnn_deterministic=True
fi
# default memeory allocator strategy may case gpu training hang
# for no OOM raised when memory exhaused
export FLAGS_allocator_strategy=naive_best_fit
if [ ${ngpu} == 0 ]; then
python3 -u ${BIN_DIR}/train.py \
--ngpu ${ngpu} \

@ -29,6 +29,10 @@ fi
# export FLAGS_cudnn_exhaustive_search=true
# export FLAGS_conv_workspace_size_limit=4000
# default memeory allocator strategy may case gpu training hang
# for no OOM raised when memory exhaused
export FLAGS_allocator_strategy=naive_best_fit
if [ ${ngpu} == 0 ]; then
python3 -u ${BIN_DIR}/train.py \
--ngpu ${ngpu} \

@ -26,6 +26,10 @@ if [ ${seed} != 0 ]; then
export FLAGS_cudnn_deterministic=True
fi
# default memeory allocator strategy may case gpu training hang
# for no OOM raised when memory exhaused
export FLAGS_allocator_strategy=naive_best_fit
if [ ${ngpu} == 0 ]; then
python3 -u ${BIN_DIR}/train.py \
--ngpu ${ngpu} \

@ -19,6 +19,10 @@ if [ ${seed} != 0 ]; then
export FLAGS_cudnn_deterministic=True
fi
# default memeory allocator strategy may case gpu training hang
# for no OOM raised when memory exhaused
export FLAGS_allocator_strategy=naive_best_fit
if [ ${ngpu} == 0 ]; then
python3 -u ${BIN_DIR}/train.py \
--ngpu ${ngpu} \

@ -32,6 +32,10 @@ fi
mkdir -p exp
# default memeory allocator strategy may case gpu training hang
# for no OOM raised when memory exhaused
export FLAGS_allocator_strategy=naive_best_fit
if [ ${ngpu} == 0 ]; then
python3 -u ${BIN_DIR}/train.py \
--ngpu ${ngpu} \

@ -34,6 +34,10 @@ fi
mkdir -p exp
# default memeory allocator strategy may case gpu training hang
# for no OOM raised when memory exhaused
export FLAGS_allocator_strategy=naive_best_fit
if [ ${ngpu} == 0 ]; then
python3 -u ${BIN_DIR}/train.py \
--ngpu ${ngpu} \

@ -35,6 +35,10 @@ echo ${ips_config}
mkdir -p exp
# default memeory allocator strategy may case gpu training hang
# for no OOM raised when memory exhaused
export FLAGS_allocator_strategy=naive_best_fit
if [ ${ngpu} == 0 ]; then
python3 -u ${BIN_DIR}/train.py \
--ngpu ${ngpu} \

@ -42,3 +42,7 @@
```bash
paddlespeech text --task punc --input 今天的天气真不错啊你下午有空吗我想约你一起去吃饭
```
- Faster Punctuation Restoration
```bash
paddlespeech text --task punc --input 今天的天气真不错啊你下午有空吗我想约你一起去吃饭 --model ernie_linear_p3_wudao_fast
```

@ -43,3 +43,7 @@
```bash
paddlespeech text --task punc --input 今天的天气真不错啊你下午有空吗我想约你一起去吃饭
```
- 快速标点恢复
```bash
paddlespeech text --task punc --input 今天的天气真不错啊你下午有空吗我想约你一起去吃饭 --model ernie_linear_p3_wudao_fast
```

@ -12,6 +12,7 @@
# See the License for the specific language governing permissions and
# limitations under the License.
import argparse
import io
import os
import sys
import time
@ -51,7 +52,7 @@ class ASRExecutor(BaseExecutor):
self.parser.add_argument(
'--model',
type=str,
default='conformer_wenetspeech',
default='conformer_u2pp_wenetspeech',
choices=[
tag[:tag.index('-')]
for tag in self.task_resource.pretrained_models.keys()
@ -229,6 +230,8 @@ class ASRExecutor(BaseExecutor):
audio_file = input
if isinstance(audio_file, (str, os.PathLike)):
logger.debug("Preprocess audio_file:" + audio_file)
elif isinstance(audio_file, io.BytesIO):
audio_file.seek(0)
# Get the object for feature extraction
if "deepspeech2" in model_type or "conformer" in model_type or "transformer" in model_type:
@ -352,6 +355,8 @@ class ASRExecutor(BaseExecutor):
if not os.path.isfile(audio_file):
logger.error("Please input the right audio file path")
return False
elif isinstance(audio_file, io.BytesIO):
audio_file.seek(0)
logger.debug("checking the audio file format......")
try:
@ -465,7 +470,7 @@ class ASRExecutor(BaseExecutor):
@stats_wrapper
def __call__(self,
audio_file: os.PathLike,
model: str='conformer_wenetspeech',
model: str='conformer_u2pp_wenetspeech',
lang: str='zh',
sample_rate: int=16000,
config: os.PathLike=None,

@ -20,10 +20,13 @@ from typing import Optional
from typing import Union
import paddle
import yaml
from yacs.config import CfgNode
from ..executor import BaseExecutor
from ..log import logger
from ..utils import stats_wrapper
from paddlespeech.text.models.ernie_linear import ErnieLinear
__all__ = ['TextExecutor']
@ -139,6 +142,66 @@ class TextExecutor(BaseExecutor):
self.model.eval()
#init new models
def _init_from_path_new(self,
task: str='punc',
model_type: str='ernie_linear_p7_wudao',
lang: str='zh',
cfg_path: Optional[os.PathLike]=None,
ckpt_path: Optional[os.PathLike]=None,
vocab_file: Optional[os.PathLike]=None):
if hasattr(self, 'model'):
logger.debug('Model had been initialized.')
return
self.task = task
if cfg_path is None or ckpt_path is None or vocab_file is None:
tag = '-'.join([model_type, task, lang])
self.task_resource.set_task_model(tag, version=None)
self.cfg_path = os.path.join(
self.task_resource.res_dir,
self.task_resource.res_dict['cfg_path'])
self.ckpt_path = os.path.join(
self.task_resource.res_dir,
self.task_resource.res_dict['ckpt_path'])
self.vocab_file = os.path.join(
self.task_resource.res_dir,
self.task_resource.res_dict['vocab_file'])
else:
self.cfg_path = os.path.abspath(cfg_path)
self.ckpt_path = os.path.abspath(ckpt_path)
self.vocab_file = os.path.abspath(vocab_file)
model_name = model_type[:model_type.rindex('_')]
if self.task == 'punc':
# punc list
self._punc_list = []
with open(self.vocab_file, 'r') as f:
for line in f:
self._punc_list.append(line.strip())
# model
with open(self.cfg_path) as f:
config = CfgNode(yaml.safe_load(f))
self.model = ErnieLinear(**config["model"])
_, tokenizer_class = self.task_resource.get_model_class(model_name)
state_dict = paddle.load(self.ckpt_path)
self.model.set_state_dict(state_dict["main_params"])
self.model.eval()
#tokenizer: fast version: ernie-3.0-mini-zh slow version:ernie-1.0
if 'fast' not in model_type:
self.tokenizer = tokenizer_class.from_pretrained('ernie-1.0')
else:
self.tokenizer = tokenizer_class.from_pretrained(
'ernie-3.0-mini-zh')
else:
raise NotImplementedError
def _clean_text(self, text):
text = text.lower()
text = re.sub('[^A-Za-z0-9\u4e00-\u9fa5]', '', text)
@ -179,7 +242,7 @@ class TextExecutor(BaseExecutor):
else:
raise NotImplementedError
def postprocess(self) -> Union[str, os.PathLike]:
def postprocess(self, isNewTrainer: bool=False) -> Union[str, os.PathLike]:
"""
Output postprocess and return human-readable results such as texts and audio files.
"""
@ -192,13 +255,13 @@ class TextExecutor(BaseExecutor):
input_ids[1:seq_len - 1])
labels = preds[1:seq_len - 1].tolist()
assert len(tokens) == len(labels)
if isNewTrainer:
self._punc_list = [0] + self._punc_list
text = ''
for t, l in zip(tokens, labels):
text += t
if l != 0: # Non punc.
text += self._punc_list[l]
return text
else:
raise NotImplementedError
@ -255,10 +318,20 @@ class TextExecutor(BaseExecutor):
"""
Python API to call an executor.
"""
paddle.set_device(device)
self._init_from_path(task, model, lang, config, ckpt_path, punc_vocab)
self.preprocess(text)
self.infer()
res = self.postprocess() # Retrieve result of text task.
#Here is old version models
if model in ['ernie_linear_p7_wudao', 'ernie_linear_p3_wudao']:
paddle.set_device(device)
self._init_from_path(task, model, lang, config, ckpt_path,
punc_vocab)
self.preprocess(text)
self.infer()
res = self.postprocess() # Retrieve result of text task.
#Add new way to infer
else:
paddle.set_device(device)
self._init_from_path_new(task, model, lang, config, ckpt_path,
punc_vocab)
self.preprocess(text)
self.infer()
res = self.postprocess(isNewTrainer=True)
return res

@ -25,6 +25,8 @@ model_alias = {
"deepspeech2online": ["paddlespeech.s2t.models.ds2:DeepSpeech2Model"],
"conformer": ["paddlespeech.s2t.models.u2:U2Model"],
"conformer_online": ["paddlespeech.s2t.models.u2:U2Model"],
"conformer_u2pp": ["paddlespeech.s2t.models.u2:U2Model"],
"conformer_u2pp_online": ["paddlespeech.s2t.models.u2:U2Model"],
"transformer": ["paddlespeech.s2t.models.u2:U2Model"],
"wenetspeech": ["paddlespeech.s2t.models.u2:U2Model"],
@ -51,6 +53,10 @@ model_alias = {
"paddlespeech.text.models:ErnieLinear",
"paddlenlp.transformers:ErnieTokenizer"
],
"ernie_linear_p3_wudao": [
"paddlespeech.text.models:ErnieLinear",
"paddlenlp.transformers:ErnieTokenizer"
],
# ---------------------------------
# -------------- TTS --------------

@ -68,6 +68,46 @@ asr_dynamic_pretrained_models = {
'',
},
},
"conformer_u2pp_wenetspeech-zh-16k": {
'1.1': {
'url':
'https://paddlespeech.bj.bcebos.com/s2t/wenetspeech/asr1/asr1_chunk_conformer_u2pp_wenetspeech_ckpt_1.1.1.model.tar.gz',
'md5':
'eae678c04ed3b3f89672052fdc0c5e10',
'cfg_path':
'model.yaml',
'ckpt_path':
'exp/chunk_conformer_u2pp/checkpoints/avg_10',
'model':
'exp/chunk_conformer_u2pp/checkpoints/avg_10.pdparams',
'params':
'exp/chunk_conformer_u2pp/checkpoints/avg_10.pdparams',
'lm_url':
'',
'lm_md5':
'',
},
},
"conformer_u2pp_online_wenetspeech-zh-16k": {
'1.1': {
'url':
'https://paddlespeech.bj.bcebos.com/s2t/wenetspeech/asr1/asr1_chunk_conformer_u2pp_wenetspeech_ckpt_1.1.2.model.tar.gz',
'md5':
'925d047e9188dea7f421a718230c9ae3',
'cfg_path':
'model.yaml',
'ckpt_path':
'exp/chunk_conformer_u2pp/checkpoints/avg_10',
'model':
'exp/chunk_conformer_u2pp/checkpoints/avg_10.pdparams',
'params':
'exp/chunk_conformer_u2pp/checkpoints/avg_10.pdparams',
'lm_url':
'',
'lm_md5':
'',
},
},
"conformer_online_multicn-zh-16k": {
'1.0': {
'url':
@ -529,7 +569,7 @@ text_dynamic_pretrained_models = {
'ckpt/model_state.pdparams',
'vocab_file':
'punc_vocab.txt',
},
}
},
"ernie_linear_p3_wudao-punc-zh": {
'1.0': {
@ -543,8 +583,22 @@ text_dynamic_pretrained_models = {
'ckpt/model_state.pdparams',
'vocab_file':
'punc_vocab.txt',
},
}
},
"ernie_linear_p3_wudao_fast-punc-zh": {
'1.0': {
'url':
'https://paddlespeech.bj.bcebos.com/text/ernie_linear_p3_wudao_fast-punc-zh.tar.gz',
'md5':
'c93f9594119541a5dbd763381a751d08',
'cfg_path':
'ckpt/model_config.json',
'ckpt_path':
'ckpt/model_state.pdparams',
'vocab_file':
'punc_vocab.txt',
}
}
}
# ---------------------------------

@ -40,7 +40,6 @@ class U2Infer():
self.preprocess_conf = config.preprocess_config
self.preprocess_args = {"train": False}
self.preprocessing = Transformation(self.preprocess_conf)
self.reverse_weight = getattr(config.model_conf, 'reverse_weight', 0.0)
self.text_feature = TextFeaturizer(
unit_type=config.unit_type,
vocab=config.vocab_filepath,
@ -89,8 +88,7 @@ class U2Infer():
ctc_weight=decode_config.ctc_weight,
decoding_chunk_size=decode_config.decoding_chunk_size,
num_decoding_left_chunks=decode_config.num_decoding_left_chunks,
simulate_streaming=decode_config.simulate_streaming,
reverse_weight=self.reverse_weight)
simulate_streaming=decode_config.simulate_streaming)
rsl = result_transcripts[0][0]
utt = Path(self.audio_file).name
logger.info(f"hyp: {utt} {result_transcripts[0][0]}")

@ -316,7 +316,6 @@ class U2Tester(U2Trainer):
vocab=self.config.vocab_filepath,
spm_model_prefix=self.config.spm_model_prefix)
self.vocab_list = self.text_feature.vocab_list
self.reverse_weight = getattr(config.model_conf, 'reverse_weight', 0.0)
def id2token(self, texts, texts_len, text_feature):
""" ord() id to chr() chr """
@ -351,8 +350,7 @@ class U2Tester(U2Trainer):
ctc_weight=decode_config.ctc_weight,
decoding_chunk_size=decode_config.decoding_chunk_size,
num_decoding_left_chunks=decode_config.num_decoding_left_chunks,
simulate_streaming=decode_config.simulate_streaming,
reverse_weight=self.reverse_weight)
simulate_streaming=decode_config.simulate_streaming)
decode_time = time.time() - start_time
for utt, target, result, rec_tids in zip(

@ -507,16 +507,14 @@ class U2BaseModel(ASRInterface, nn.Layer):
num_decoding_left_chunks, simulate_streaming)
return hyps[0][0]
def attention_rescoring(
self,
speech: paddle.Tensor,
speech_lengths: paddle.Tensor,
beam_size: int,
decoding_chunk_size: int=-1,
num_decoding_left_chunks: int=-1,
ctc_weight: float=0.0,
simulate_streaming: bool=False,
reverse_weight: float=0.0, ) -> List[int]:
def attention_rescoring(self,
speech: paddle.Tensor,
speech_lengths: paddle.Tensor,
beam_size: int,
decoding_chunk_size: int=-1,
num_decoding_left_chunks: int=-1,
ctc_weight: float=0.0,
simulate_streaming: bool=False) -> List[int]:
""" Apply attention rescoring decoding, CTC prefix beam search
is applied first to get nbest, then we resoring the nbest on
attention decoder with corresponding encoder out
@ -536,7 +534,7 @@ class U2BaseModel(ASRInterface, nn.Layer):
"""
assert speech.shape[0] == speech_lengths.shape[0]
assert decoding_chunk_size != 0
if reverse_weight > 0.0:
if self.reverse_weight > 0.0:
# decoder should be a bitransformer decoder if reverse_weight > 0.0
assert hasattr(self.decoder, 'right_decoder')
device = speech.place
@ -574,7 +572,7 @@ class U2BaseModel(ASRInterface, nn.Layer):
self.eos)
decoder_out, r_decoder_out, _ = self.decoder(
encoder_out, encoder_mask, hyps_pad, hyps_lens, r_hyps_pad,
reverse_weight) # (beam_size, max_hyps_len, vocab_size)
self.reverse_weight) # (beam_size, max_hyps_len, vocab_size)
# ctc score in ln domain
decoder_out = paddle.nn.functional.log_softmax(decoder_out, axis=-1)
decoder_out = decoder_out.numpy()
@ -594,12 +592,13 @@ class U2BaseModel(ASRInterface, nn.Layer):
score += decoder_out[i][j][w]
# last decoder output token is `eos`, for laste decoder input token.
score += decoder_out[i][len(hyp[0])][self.eos]
if reverse_weight > 0:
if self.reverse_weight > 0:
r_score = 0.0
for j, w in enumerate(hyp[0]):
r_score += r_decoder_out[i][len(hyp[0]) - j - 1][w]
r_score += r_decoder_out[i][len(hyp[0])][self.eos]
score = score * (1 - reverse_weight) + r_score * reverse_weight
score = score * (1 - self.reverse_weight
) + r_score * self.reverse_weight
# add ctc score (which in ln domain)
score += hyp[1] * ctc_weight
if score > best_score:
@ -748,8 +747,7 @@ class U2BaseModel(ASRInterface, nn.Layer):
ctc_weight: float=0.0,
decoding_chunk_size: int=-1,
num_decoding_left_chunks: int=-1,
simulate_streaming: bool=False,
reverse_weight: float=0.0):
simulate_streaming: bool=False):
"""u2 decoding.
Args:
@ -821,8 +819,7 @@ class U2BaseModel(ASRInterface, nn.Layer):
decoding_chunk_size=decoding_chunk_size,
num_decoding_left_chunks=num_decoding_left_chunks,
ctc_weight=ctc_weight,
simulate_streaming=simulate_streaming,
reverse_weight=reverse_weight)
simulate_streaming=simulate_streaming)
hyps = [hyp]
else:
raise ValueError(f"Not support decoding method: {decoding_method}")

@ -30,7 +30,7 @@ asr_online:
decode_method:
num_decoding_left_chunks: -1
force_yes: True
device: # cpu or gpu:id
device: cpu # cpu or gpu:id
continuous_decoding: True # enable continue decoding when endpoint detected
am_predictor_conf:

@ -22,6 +22,7 @@ from numpy import float32
from yacs.config import CfgNode
from paddlespeech.audio.transform.transformation import Transformation
from paddlespeech.audio.utils.tensor_utils import st_reverse_pad_list
from paddlespeech.cli.asr.infer import ASRExecutor
from paddlespeech.cli.log import logger
from paddlespeech.resource import CommonTaskResource
@ -602,24 +603,31 @@ class PaddleASRConnectionHanddler:
hyps_pad = pad_sequence(
hyp_list, batch_first=True, padding_value=self.model.ignore_id)
ori_hyps_pad = hyps_pad
hyps_lens = paddle.to_tensor(
[len(hyp[0]) for hyp in hyps], place=self.device,
dtype=paddle.long) # (beam_size,)
hyps_pad, _ = add_sos_eos(hyps_pad, self.model.sos, self.model.eos,
self.model.ignore_id)
hyps_lens = hyps_lens + 1 # Add <sos> at begining
encoder_out = self.encoder_out.repeat(beam_size, 1, 1)
encoder_mask = paddle.ones(
(beam_size, 1, encoder_out.shape[1]), dtype=paddle.bool)
decoder_out, _, _ = self.model.decoder(
encoder_out, encoder_mask, hyps_pad,
hyps_lens) # (beam_size, max_hyps_len, vocab_size)
r_hyps_pad = st_reverse_pad_list(ori_hyps_pad, hyps_lens - 1,
self.model.sos, self.model.eos)
decoder_out, r_decoder_out, _ = self.model.decoder(
encoder_out, encoder_mask, hyps_pad, hyps_lens, r_hyps_pad,
self.model.reverse_weight) # (beam_size, max_hyps_len, vocab_size)
# ctc score in ln domain
decoder_out = paddle.nn.functional.log_softmax(decoder_out, axis=-1)
decoder_out = decoder_out.numpy()
# r_decoder_out will be 0.0, if reverse_weight is 0.0 or decoder is a
# conventional transformer decoder.
r_decoder_out = paddle.nn.functional.log_softmax(r_decoder_out, axis=-1)
r_decoder_out = r_decoder_out.numpy()
# Only use decoder score for rescoring
best_score = -float('inf')
best_index = 0
@ -631,6 +639,13 @@ class PaddleASRConnectionHanddler:
# last decoder output token is `eos`, for laste decoder input token.
score += decoder_out[i][len(hyp[0])][self.model.eos]
if self.model.reverse_weight > 0:
r_score = 0.0
for j, w in enumerate(hyp[0]):
r_score += r_decoder_out[i][len(hyp[0]) - j - 1][w]
r_score += r_decoder_out[i][len(hyp[0])][self.model.eos]
score = score * (1 - self.model.reverse_weight
) + r_score * self.model.reverse_weight
# add ctc score (which in ln domain)
score += hyp[1] * self.ctc_decode_config.ctc_weight

@ -107,11 +107,14 @@ class PaddleTextConnectionHandler:
assert len(tokens) == len(labels)
text = ''
is_fast_model = 'fast' in self.text_engine.config.model_type
for t, l in zip(tokens, labels):
text += t
if l != 0: # Non punc.
text += self._punc_list[l]
if is_fast_model:
text += self._punc_list[l - 1]
else:
text += self._punc_list[l]
return text
else:
raise NotImplementedError
@ -160,14 +163,23 @@ class TextEngine(BaseEngine):
return False
self.executor = TextServerExecutor()
self.executor._init_from_path(
task=config.task,
model_type=config.model_type,
lang=config.lang,
cfg_path=config.cfg_path,
ckpt_path=config.ckpt_path,
vocab_file=config.vocab_file)
if 'fast' in config.model_type:
self.executor._init_from_path_new(
task=config.task,
model_type=config.model_type,
lang=config.lang,
cfg_path=config.cfg_path,
ckpt_path=config.ckpt_path,
vocab_file=config.vocab_file)
else:
self.executor._init_from_path(
task=config.task,
model_type=config.model_type,
lang=config.lang,
cfg_path=config.cfg_path,
ckpt_path=config.ckpt_path,
vocab_file=config.vocab_file)
logger.info("Using model: %s." % (config.model_type))
logger.info("Initialize Text server engine successfully on device: %s."
% (self.device))
return True

@ -7,7 +7,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/cat.wav https://paddlespe
paddlespeech cls --input ./cat.wav --topk 10
# Punctuation_restoration
paddlespeech text --input 今天的天气真不错啊你下午有空吗我想约你一起去吃饭
paddlespeech text --input 今天的天气真不错啊你下午有空吗我想约你一起去吃饭 --model ernie_linear_p3_wudao_fast
# Speech_recognition
wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespeech.bj.bcebos.com/PaddleAudio/en.wav

Loading…
Cancel
Save