diff --git a/README.md b/README.md index 59c61f776..72db64b7d 100644 --- a/README.md +++ b/README.md @@ -19,8 +19,6 @@

Quick Start - | Quick Start Server - | Quick Start Streaming Server | Documents | Models List | AIStudio Courses @@ -159,6 +157,8 @@ Via the easy-to-use, efficient, flexible and scalable implementation, our vision - 🧩 *Cascaded models application*: as an extension of the typical traditional audio tasks, we combine the workflows of the aforementioned tasks with other fields like Natural language processing (NLP) and Computer Vision (CV). ### Recent Update +- 🔥 2022.09.26: Add Voice Cloning, TTS finetune, and ERNIE-SAT in [PaddleSpeech Web Demo](./demos/speech_web). +- ⚡ 2022.09.09: Add AISHELL-3 Voice Cloning [example](./examples/aishell3/vc2) with ECAPA-TDNN speaker encoder. - ⚡ 2022.08.25: Release TTS [finetune](./examples/other/tts_finetune/tts3) example. - 🔥 2022.08.22: Add ERNIE-SAT models: [ERNIE-SAT-vctk](./examples/vctk/ernie_sat)、[ERNIE-SAT-aishell3](./examples/aishell3/ernie_sat)、[ERNIE-SAT-zh_en](./examples/aishell3_vctk/ernie_sat). - 🔥 2022.08.15: Add [g2pW](https://github.com/GitYCC/g2pW) into TTS Chinese Text Frontend. @@ -705,7 +705,7 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r Speaker Verification - VoxCeleb12 + VoxCeleb1/2 ECAPA-TDNN ecapa-tdnn-voxceleb12 @@ -714,6 +714,31 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r + + +**Speaker Diarization** + + + + + + + + + + + + + + + + + + +
Task Dataset Model Type Example
Speaker DiarizationAMIECAPA-TDNN + AHC / SC + ecapa-tdnn-ami +
+ **Punctuation Restoration** @@ -767,6 +792,7 @@ Normally, [Speech SoTA](https://paperswithcode.com/area/speech), [Audio SoTA](ht - [Text-to-Speech](#TextToSpeech) - [Audio Classification](#AudioClassification) - [Speaker Verification](#SpeakerVerification) + - [Speaker Diarization](#SpeakerDiarization) - [Punctuation Restoration](#PunctuationRestoration) - [Community](#Community) - [Welcome to contribute](#contribution) diff --git a/README_cn.md b/README_cn.md index 070a656a2..725f7eda1 100644 --- a/README_cn.md +++ b/README_cn.md @@ -19,10 +19,8 @@

- 安装 + 安装 | 快速开始 - | 快速使用服务 - | 快速使用流式服务 | 教程文档 | 模型列表 | AIStudio 课程 @@ -181,6 +179,8 @@

### 近期更新 +- 🔥 2022.09.26: 新增 Voice Cloning, TTS finetune 和 ERNIE-SAT 到 [PaddleSpeech 网页应用](./demos/speech_web)。 +- ⚡ 2022.09.09: 新增基于 ECAPA-TDNN 声纹模型的 AISHELL-3 Voice Cloning [示例](./examples/aishell3/vc2)。 - ⚡ 2022.08.25: 发布 TTS [finetune](./examples/other/tts_finetune/tts3) 示例。 - 🔥 2022.08.22: 新增 ERNIE-SAT 模型: [ERNIE-SAT-vctk](./examples/vctk/ernie_sat)、[ERNIE-SAT-aishell3](./examples/aishell3/ernie_sat)、[ERNIE-SAT-zh_en](./examples/aishell3_vctk/ernie_sat)。 - 🔥 2022.08.15: 将 [g2pW](https://github.com/GitYCC/g2pW) 引入 TTS 中文文本前端。 @@ -717,8 +717,8 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声 - Speaker Verification - VoxCeleb12 + 声纹识别 + VoxCeleb1/2 ECAPA-TDNN ecapa-tdnn-voxceleb12 @@ -727,6 +727,31 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声 + + +**说话人日志** + + + + + + + + + + + + + + + + + + +
任务 数据集 模型类型 脚本
说话人日志AMIECAPA-TDNN + AHC / SC + ecapa-tdnn-ami +
+ **标点恢复** @@ -786,6 +811,7 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声 - [语音合成](#语音合成模型) - [声音分类](#声音分类模型) - [声纹识别](#声纹识别模型) + - [说话人日志](#说话人日志模型) - [标点恢复](#标点恢复模型) - [技术交流群](#技术交流群) - [欢迎贡献](#欢迎贡献) diff --git a/demos/streaming_asr_server/conf/application.yaml b/demos/streaming_asr_server/conf/application.yaml index a89d312ab..d446e13b6 100644 --- a/demos/streaming_asr_server/conf/application.yaml +++ b/demos/streaming_asr_server/conf/application.yaml @@ -21,7 +21,7 @@ engine_list: ['asr_online'] ################################### ASR ######################################### ################### speech task: asr; engine_type: online ####################### asr_online: - model_type: 'conformer_online_wenetspeech' + model_type: 'conformer_u2pp_online_wenetspeech' am_model: # the pdmodel file of am static model [optional] am_params: # the pdiparams file of am static model [optional] lang: 'zh' diff --git a/docs/source/released_model.md b/docs/source/released_model.md index d6691812e..bdac2c5bb 100644 --- a/docs/source/released_model.md +++ b/docs/source/released_model.md @@ -9,6 +9,7 @@ Acoustic Model | Training Data | Token-based | Size | Descriptions | CER | WER | [Ds2 Online Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_online_aishell_fbank161_ckpt_0.2.1.model.tar.gz) | Aishell Dataset | Char-based | 491 MB | 2 Conv + 5 LSTM layers | 0.0666 |-| 151 h | [D2 Online Aishell ASR0](../../examples/aishell/asr0) | onnx/inference/python | [Ds2 Offline Aishell ASR0 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr0/asr0_deepspeech2_offline_aishell_ckpt_1.0.1.model.tar.gz)| Aishell Dataset | Char-based | 1.4 GB | 2 Conv + 5 bidirectional LSTM layers| 0.0554 |-| 151 h | [Ds2 Offline Aishell ASR0](../../examples/aishell/asr0) | inference/python | [Conformer Online Wenetspeech ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/wenetspeech/asr1/asr1_chunk_conformer_wenetspeech_ckpt_1.0.0a.model.tar.gz) | WenetSpeech Dataset | Char-based | 457 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring| 0.11 (test\_net) 0.1879 (test\_meeting) |-| 10000 h |- | python | +[Conformer U2PP Online Wenetspeech ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/wenetspeech/asr1/asr1_chunk_conformer_u2pp_wenetspeech_ckpt_1.1.1.model.tar.gz) | WenetSpeech Dataset | Char-based | 476 MB | Encoder:Conformer, Decoder:BiTransformer, Decoding method: Attention rescoring| 0.047198 (aishell test\_-1) 0.059212 (aishell test\_16) |-| 10000 h |- | python | [Conformer Online Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_chunk_conformer_aishell_ckpt_0.2.0.model.tar.gz) | Aishell Dataset | Char-based | 189 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring| 0.0544 |-| 151 h | [Conformer Online Aishell ASR1](../../examples/aishell/asr1) | python | [Conformer Offline Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_conformer_aishell_ckpt_1.0.1.model.tar.gz) | Aishell Dataset | Char-based | 189 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0460 |-| 151 h | [Conformer Offline Aishell ASR1](../../examples/aishell/asr1) | python | [Transformer Aishell ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/aishell/asr1/asr1_transformer_aishell_ckpt_0.1.1.model.tar.gz) | Aishell Dataset | Char-based | 128 MB | Encoder:Transformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0523 || 151 h | [Transformer Aishell ASR1](../../examples/aishell/asr1) | python | diff --git a/examples/aishell/asr0/local/train.sh b/examples/aishell/asr0/local/train.sh index 256b30d22..2b71b7f76 100755 --- a/examples/aishell/asr0/local/train.sh +++ b/examples/aishell/asr0/local/train.sh @@ -26,6 +26,10 @@ if [ ${seed} != 0 ]; then export FLAGS_cudnn_deterministic=True fi +# default memeory allocator strategy may case gpu training hang +# for no OOM raised when memory exhaused +export FLAGS_allocator_strategy=naive_best_fit + if [ ${ngpu} == 0 ]; then python3 -u ${BIN_DIR}/train.py \ --ngpu ${ngpu} \ diff --git a/examples/aishell/asr1/local/train.sh b/examples/aishell/asr1/local/train.sh index f514de303..bfa8dd97d 100755 --- a/examples/aishell/asr1/local/train.sh +++ b/examples/aishell/asr1/local/train.sh @@ -35,6 +35,10 @@ echo ${ips_config} mkdir -p exp +# default memeory allocator strategy may case gpu training hang +# for no OOM raised when memory exhaused +export FLAGS_allocator_strategy=naive_best_fit + if [ ${ngpu} == 0 ]; then python3 -u ${BIN_DIR}/train.py \ --ngpu ${ngpu} \ diff --git a/examples/iwslt2012/punc0/conf/ernie-3.0-base.yaml b/examples/iwslt2012/punc0/conf/ernie-3.0-base.yaml new file mode 100644 index 000000000..845b13fd8 --- /dev/null +++ b/examples/iwslt2012/punc0/conf/ernie-3.0-base.yaml @@ -0,0 +1,44 @@ +########################################################### +# DATA SETTING # +########################################################### +dataset_type: Ernie +train_path: data/iwslt2012_zh/train.txt +dev_path: data/iwslt2012_zh/dev.txt +test_path: data/iwslt2012_zh/test.txt +batch_size: 64 +num_workers: 2 +data_params: + pretrained_token: ernie-3.0-base-zh + punc_path: data/iwslt2012_zh/punc_vocab + seq_len: 100 + + +########################################################### +# MODEL SETTING # +########################################################### +model_type: ErnieLinear +model: + pretrained_token: ernie-3.0-base-zh + num_classes: 4 + +########################################################### +# OPTIMIZER SETTING # +########################################################### +optimizer_params: + weight_decay: 1.0e-6 # weight decay coefficient. + +scheduler_params: + learning_rate: 1.0e-5 # learning rate. + gamma: 0.9999 # scheduler gamma must between(0.0, 1.0) and closer to 1.0 is better. + +########################################################### +# TRAINING SETTING # +########################################################### +max_epoch: 20 +num_snapshots: 5 + +########################################################### +# OTHER SETTING # +########################################################### +num_snapshots: 10 # max number of snapshots to keep while training +seed: 42 # random seed for paddle, random, and np.random diff --git a/examples/iwslt2012/punc0/conf/ernie-3.0-medium.yaml b/examples/iwslt2012/punc0/conf/ernie-3.0-medium.yaml new file mode 100644 index 000000000..392ba011c --- /dev/null +++ b/examples/iwslt2012/punc0/conf/ernie-3.0-medium.yaml @@ -0,0 +1,44 @@ +########################################################### +# DATA SETTING # +########################################################### +dataset_type: Ernie +train_path: data/iwslt2012_zh/train.txt +dev_path: data/iwslt2012_zh/dev.txt +test_path: data/iwslt2012_zh/test.txt +batch_size: 64 +num_workers: 2 +data_params: + pretrained_token: ernie-3.0-medium-zh + punc_path: data/iwslt2012_zh/punc_vocab + seq_len: 100 + + +########################################################### +# MODEL SETTING # +########################################################### +model_type: ErnieLinear +model: + pretrained_token: ernie-3.0-medium-zh + num_classes: 4 + +########################################################### +# OPTIMIZER SETTING # +########################################################### +optimizer_params: + weight_decay: 1.0e-6 # weight decay coefficient. + +scheduler_params: + learning_rate: 1.0e-5 # learning rate. + gamma: 0.9999 # scheduler gamma must between(0.0, 1.0) and closer to 1.0 is better. + +########################################################### +# TRAINING SETTING # +########################################################### +max_epoch: 20 +num_snapshots: 5 + +########################################################### +# OTHER SETTING # +########################################################### +num_snapshots: 10 # max number of snapshots to keep while training +seed: 42 # random seed for paddle, random, and np.random diff --git a/examples/iwslt2012/punc0/conf/ernie-3.0-mini.yaml b/examples/iwslt2012/punc0/conf/ernie-3.0-mini.yaml new file mode 100644 index 000000000..c57fd94a8 --- /dev/null +++ b/examples/iwslt2012/punc0/conf/ernie-3.0-mini.yaml @@ -0,0 +1,44 @@ +########################################################### +# DATA SETTING # +########################################################### +dataset_type: Ernie +train_path: data/iwslt2012_zh/train.txt +dev_path: data/iwslt2012_zh/dev.txt +test_path: data/iwslt2012_zh/test.txt +batch_size: 64 +num_workers: 2 +data_params: + pretrained_token: ernie-3.0-mini-zh + punc_path: data/iwslt2012_zh/punc_vocab + seq_len: 100 + + +########################################################### +# MODEL SETTING # +########################################################### +model_type: ErnieLinear +model: + pretrained_token: ernie-3.0-mini-zh + num_classes: 4 + +########################################################### +# OPTIMIZER SETTING # +########################################################### +optimizer_params: + weight_decay: 1.0e-6 # weight decay coefficient. + +scheduler_params: + learning_rate: 1.0e-5 # learning rate. + gamma: 0.9999 # scheduler gamma must between(0.0, 1.0) and closer to 1.0 is better. + +########################################################### +# TRAINING SETTING # +########################################################### +max_epoch: 20 +num_snapshots: 5 + +########################################################### +# OTHER SETTING # +########################################################### +num_snapshots: 10 # max number of snapshots to keep while training +seed: 42 # random seed for paddle, random, and np.random diff --git a/examples/iwslt2012/punc0/conf/ernie-3.0-nano-zh.yaml b/examples/iwslt2012/punc0/conf/ernie-3.0-nano-zh.yaml new file mode 100644 index 000000000..a7a84c4c1 --- /dev/null +++ b/examples/iwslt2012/punc0/conf/ernie-3.0-nano-zh.yaml @@ -0,0 +1,44 @@ +########################################################### +# DATA SETTING # +########################################################### +dataset_type: Ernie +train_path: data/iwslt2012_zh/train.txt +dev_path: data/iwslt2012_zh/dev.txt +test_path: data/iwslt2012_zh/test.txt +batch_size: 64 +num_workers: 2 +data_params: + pretrained_token: ernie-3.0-nano-zh + punc_path: data/iwslt2012_zh/punc_vocab + seq_len: 100 + + +########################################################### +# MODEL SETTING # +########################################################### +model_type: ErnieLinear +model: + pretrained_token: ernie-3.0-nano-zh + num_classes: 4 + +########################################################### +# OPTIMIZER SETTING # +########################################################### +optimizer_params: + weight_decay: 1.0e-6 # weight decay coefficient. + +scheduler_params: + learning_rate: 1.0e-5 # learning rate. + gamma: 0.9999 # scheduler gamma must between(0.0, 1.0) and closer to 1.0 is better. + +########################################################### +# TRAINING SETTING # +########################################################### +max_epoch: 20 +num_snapshots: 5 + +########################################################### +# OTHER SETTING # +########################################################### +num_snapshots: 10 # max number of snapshots to keep while training +seed: 42 # random seed for paddle, random, and np.random diff --git a/examples/iwslt2012/punc0/conf/ernie-tiny.yaml b/examples/iwslt2012/punc0/conf/ernie-tiny.yaml new file mode 100644 index 000000000..6a5b7fee2 --- /dev/null +++ b/examples/iwslt2012/punc0/conf/ernie-tiny.yaml @@ -0,0 +1,44 @@ +########################################################### +# DATA SETTING # +########################################################### +dataset_type: Ernie +train_path: data/iwslt2012_zh/train.txt +dev_path: data/iwslt2012_zh/dev.txt +test_path: data/iwslt2012_zh/test.txt +batch_size: 64 +num_workers: 2 +data_params: + pretrained_token: ernie-tiny + punc_path: data/iwslt2012_zh/punc_vocab + seq_len: 100 + + +########################################################### +# MODEL SETTING # +########################################################### +model_type: ErnieLinear +model: + pretrained_token: ernie-tiny + num_classes: 4 + +########################################################### +# OPTIMIZER SETTING # +########################################################### +optimizer_params: + weight_decay: 1.0e-6 # weight decay coefficient. + +scheduler_params: + learning_rate: 1.0e-5 # learning rate. + gamma: 0.9999 # scheduler gamma must between(0.0, 1.0) and closer to 1.0 is better. + +########################################################### +# TRAINING SETTING # +########################################################### +max_epoch: 20 +num_snapshots: 5 + +########################################################### +# OTHER SETTING # +########################################################### +num_snapshots: 10 # max number of snapshots to keep while training +seed: 42 # random seed for paddle, random, and np.random diff --git a/examples/librispeech/asr0/local/train.sh b/examples/librispeech/asr0/local/train.sh index 71659e28d..bb41fd554 100755 --- a/examples/librispeech/asr0/local/train.sh +++ b/examples/librispeech/asr0/local/train.sh @@ -26,6 +26,10 @@ if [ ${seed} != 0 ]; then export FLAGS_cudnn_deterministic=True fi +# default memeory allocator strategy may case gpu training hang +# for no OOM raised when memory exhaused +export FLAGS_allocator_strategy=naive_best_fit + if [ ${ngpu} == 0 ]; then python3 -u ${BIN_DIR}/train.py \ --ngpu ${ngpu} \ diff --git a/examples/librispeech/asr1/local/train.sh b/examples/librispeech/asr1/local/train.sh index f729ed22c..e274b9133 100755 --- a/examples/librispeech/asr1/local/train.sh +++ b/examples/librispeech/asr1/local/train.sh @@ -29,6 +29,10 @@ fi # export FLAGS_cudnn_exhaustive_search=true # export FLAGS_conv_workspace_size_limit=4000 +# default memeory allocator strategy may case gpu training hang +# for no OOM raised when memory exhaused +export FLAGS_allocator_strategy=naive_best_fit + if [ ${ngpu} == 0 ]; then python3 -u ${BIN_DIR}/train.py \ --ngpu ${ngpu} \ diff --git a/examples/librispeech/asr2/local/train.sh b/examples/librispeech/asr2/local/train.sh index 1f414ad41..c2f2d4b65 100755 --- a/examples/librispeech/asr2/local/train.sh +++ b/examples/librispeech/asr2/local/train.sh @@ -26,6 +26,10 @@ if [ ${seed} != 0 ]; then export FLAGS_cudnn_deterministic=True fi +# default memeory allocator strategy may case gpu training hang +# for no OOM raised when memory exhaused +export FLAGS_allocator_strategy=naive_best_fit + if [ ${ngpu} == 0 ]; then python3 -u ${BIN_DIR}/train.py \ --ngpu ${ngpu} \ diff --git a/examples/timit/asr1/local/train.sh b/examples/timit/asr1/local/train.sh index 661407582..1088c7ffa 100755 --- a/examples/timit/asr1/local/train.sh +++ b/examples/timit/asr1/local/train.sh @@ -19,6 +19,10 @@ if [ ${seed} != 0 ]; then export FLAGS_cudnn_deterministic=True fi +# default memeory allocator strategy may case gpu training hang +# for no OOM raised when memory exhaused +export FLAGS_allocator_strategy=naive_best_fit + if [ ${ngpu} == 0 ]; then python3 -u ${BIN_DIR}/train.py \ --ngpu ${ngpu} \ diff --git a/examples/tiny/asr0/local/train.sh b/examples/tiny/asr0/local/train.sh index 8b67902fe..e233a0c0a 100755 --- a/examples/tiny/asr0/local/train.sh +++ b/examples/tiny/asr0/local/train.sh @@ -32,6 +32,10 @@ fi mkdir -p exp +# default memeory allocator strategy may case gpu training hang +# for no OOM raised when memory exhaused +export FLAGS_allocator_strategy=naive_best_fit + if [ ${ngpu} == 0 ]; then python3 -u ${BIN_DIR}/train.py \ --ngpu ${ngpu} \ diff --git a/examples/tiny/asr1/local/train.sh b/examples/tiny/asr1/local/train.sh index 459f2e218..fbfb41f6f 100755 --- a/examples/tiny/asr1/local/train.sh +++ b/examples/tiny/asr1/local/train.sh @@ -34,6 +34,10 @@ fi mkdir -p exp +# default memeory allocator strategy may case gpu training hang +# for no OOM raised when memory exhaused +export FLAGS_allocator_strategy=naive_best_fit + if [ ${ngpu} == 0 ]; then python3 -u ${BIN_DIR}/train.py \ --ngpu ${ngpu} \ diff --git a/examples/wenetspeech/asr1/local/train.sh b/examples/wenetspeech/asr1/local/train.sh index 01af00b61..6813d270c 100755 --- a/examples/wenetspeech/asr1/local/train.sh +++ b/examples/wenetspeech/asr1/local/train.sh @@ -35,6 +35,10 @@ echo ${ips_config} mkdir -p exp +# default memeory allocator strategy may case gpu training hang +# for no OOM raised when memory exhaused +export FLAGS_allocator_strategy=naive_best_fit + if [ ${ngpu} == 0 ]; then python3 -u ${BIN_DIR}/train.py \ --ngpu ${ngpu} \ diff --git a/paddlespeech/cli/README.md b/paddlespeech/cli/README.md index 19c822040..e6e216c0b 100644 --- a/paddlespeech/cli/README.md +++ b/paddlespeech/cli/README.md @@ -42,3 +42,7 @@ ```bash paddlespeech text --task punc --input 今天的天气真不错啊你下午有空吗我想约你一起去吃饭 ``` +- Faster Punctuation Restoration + ```bash + paddlespeech text --task punc --input 今天的天气真不错啊你下午有空吗我想约你一起去吃饭 --model ernie_linear_p3_wudao_fast + ``` diff --git a/paddlespeech/cli/README_cn.md b/paddlespeech/cli/README_cn.md index 4b15d6c7b..6464c598c 100644 --- a/paddlespeech/cli/README_cn.md +++ b/paddlespeech/cli/README_cn.md @@ -43,3 +43,7 @@ ```bash paddlespeech text --task punc --input 今天的天气真不错啊你下午有空吗我想约你一起去吃饭 ``` +- 快速标点恢复 + ```bash + paddlespeech text --task punc --input 今天的天气真不错啊你下午有空吗我想约你一起去吃饭 --model ernie_linear_p3_wudao_fast + ``` diff --git a/paddlespeech/cli/asr/infer.py b/paddlespeech/cli/asr/infer.py index 7296776f9..437f64631 100644 --- a/paddlespeech/cli/asr/infer.py +++ b/paddlespeech/cli/asr/infer.py @@ -12,6 +12,7 @@ # See the License for the specific language governing permissions and # limitations under the License. import argparse +import io import os import sys import time @@ -51,7 +52,7 @@ class ASRExecutor(BaseExecutor): self.parser.add_argument( '--model', type=str, - default='conformer_wenetspeech', + default='conformer_u2pp_wenetspeech', choices=[ tag[:tag.index('-')] for tag in self.task_resource.pretrained_models.keys() @@ -229,6 +230,8 @@ class ASRExecutor(BaseExecutor): audio_file = input if isinstance(audio_file, (str, os.PathLike)): logger.debug("Preprocess audio_file:" + audio_file) + elif isinstance(audio_file, io.BytesIO): + audio_file.seek(0) # Get the object for feature extraction if "deepspeech2" in model_type or "conformer" in model_type or "transformer" in model_type: @@ -352,6 +355,8 @@ class ASRExecutor(BaseExecutor): if not os.path.isfile(audio_file): logger.error("Please input the right audio file path") return False + elif isinstance(audio_file, io.BytesIO): + audio_file.seek(0) logger.debug("checking the audio file format......") try: @@ -465,7 +470,7 @@ class ASRExecutor(BaseExecutor): @stats_wrapper def __call__(self, audio_file: os.PathLike, - model: str='conformer_wenetspeech', + model: str='conformer_u2pp_wenetspeech', lang: str='zh', sample_rate: int=16000, config: os.PathLike=None, diff --git a/paddlespeech/cli/text/infer.py b/paddlespeech/cli/text/infer.py index 24b8c9c25..ff822f674 100644 --- a/paddlespeech/cli/text/infer.py +++ b/paddlespeech/cli/text/infer.py @@ -20,10 +20,13 @@ from typing import Optional from typing import Union import paddle +import yaml +from yacs.config import CfgNode from ..executor import BaseExecutor from ..log import logger from ..utils import stats_wrapper +from paddlespeech.text.models.ernie_linear import ErnieLinear __all__ = ['TextExecutor'] @@ -139,6 +142,66 @@ class TextExecutor(BaseExecutor): self.model.eval() + #init new models + def _init_from_path_new(self, + task: str='punc', + model_type: str='ernie_linear_p7_wudao', + lang: str='zh', + cfg_path: Optional[os.PathLike]=None, + ckpt_path: Optional[os.PathLike]=None, + vocab_file: Optional[os.PathLike]=None): + if hasattr(self, 'model'): + logger.debug('Model had been initialized.') + return + + self.task = task + + if cfg_path is None or ckpt_path is None or vocab_file is None: + tag = '-'.join([model_type, task, lang]) + self.task_resource.set_task_model(tag, version=None) + self.cfg_path = os.path.join( + self.task_resource.res_dir, + self.task_resource.res_dict['cfg_path']) + self.ckpt_path = os.path.join( + self.task_resource.res_dir, + self.task_resource.res_dict['ckpt_path']) + self.vocab_file = os.path.join( + self.task_resource.res_dir, + self.task_resource.res_dict['vocab_file']) + else: + self.cfg_path = os.path.abspath(cfg_path) + self.ckpt_path = os.path.abspath(ckpt_path) + self.vocab_file = os.path.abspath(vocab_file) + + model_name = model_type[:model_type.rindex('_')] + + if self.task == 'punc': + # punc list + self._punc_list = [] + with open(self.vocab_file, 'r') as f: + for line in f: + self._punc_list.append(line.strip()) + + # model + with open(self.cfg_path) as f: + config = CfgNode(yaml.safe_load(f)) + self.model = ErnieLinear(**config["model"]) + + _, tokenizer_class = self.task_resource.get_model_class(model_name) + state_dict = paddle.load(self.ckpt_path) + self.model.set_state_dict(state_dict["main_params"]) + self.model.eval() + + #tokenizer: fast version: ernie-3.0-mini-zh slow version:ernie-1.0 + if 'fast' not in model_type: + self.tokenizer = tokenizer_class.from_pretrained('ernie-1.0') + else: + self.tokenizer = tokenizer_class.from_pretrained( + 'ernie-3.0-mini-zh') + + else: + raise NotImplementedError + def _clean_text(self, text): text = text.lower() text = re.sub('[^A-Za-z0-9\u4e00-\u9fa5]', '', text) @@ -179,7 +242,7 @@ class TextExecutor(BaseExecutor): else: raise NotImplementedError - def postprocess(self) -> Union[str, os.PathLike]: + def postprocess(self, isNewTrainer: bool=False) -> Union[str, os.PathLike]: """ Output postprocess and return human-readable results such as texts and audio files. """ @@ -192,13 +255,13 @@ class TextExecutor(BaseExecutor): input_ids[1:seq_len - 1]) labels = preds[1:seq_len - 1].tolist() assert len(tokens) == len(labels) - + if isNewTrainer: + self._punc_list = [0] + self._punc_list text = '' for t, l in zip(tokens, labels): text += t if l != 0: # Non punc. text += self._punc_list[l] - return text else: raise NotImplementedError @@ -255,10 +318,20 @@ class TextExecutor(BaseExecutor): """ Python API to call an executor. """ - paddle.set_device(device) - self._init_from_path(task, model, lang, config, ckpt_path, punc_vocab) - self.preprocess(text) - self.infer() - res = self.postprocess() # Retrieve result of text task. - + #Here is old version models + if model in ['ernie_linear_p7_wudao', 'ernie_linear_p3_wudao']: + paddle.set_device(device) + self._init_from_path(task, model, lang, config, ckpt_path, + punc_vocab) + self.preprocess(text) + self.infer() + res = self.postprocess() # Retrieve result of text task. + #Add new way to infer + else: + paddle.set_device(device) + self._init_from_path_new(task, model, lang, config, ckpt_path, + punc_vocab) + self.preprocess(text) + self.infer() + res = self.postprocess(isNewTrainer=True) return res diff --git a/paddlespeech/resource/model_alias.py b/paddlespeech/resource/model_alias.py index 9c76dd4b3..f5ec655b7 100644 --- a/paddlespeech/resource/model_alias.py +++ b/paddlespeech/resource/model_alias.py @@ -25,6 +25,8 @@ model_alias = { "deepspeech2online": ["paddlespeech.s2t.models.ds2:DeepSpeech2Model"], "conformer": ["paddlespeech.s2t.models.u2:U2Model"], "conformer_online": ["paddlespeech.s2t.models.u2:U2Model"], + "conformer_u2pp": ["paddlespeech.s2t.models.u2:U2Model"], + "conformer_u2pp_online": ["paddlespeech.s2t.models.u2:U2Model"], "transformer": ["paddlespeech.s2t.models.u2:U2Model"], "wenetspeech": ["paddlespeech.s2t.models.u2:U2Model"], @@ -51,6 +53,10 @@ model_alias = { "paddlespeech.text.models:ErnieLinear", "paddlenlp.transformers:ErnieTokenizer" ], + "ernie_linear_p3_wudao": [ + "paddlespeech.text.models:ErnieLinear", + "paddlenlp.transformers:ErnieTokenizer" + ], # --------------------------------- # -------------- TTS -------------- diff --git a/paddlespeech/resource/pretrained_models.py b/paddlespeech/resource/pretrained_models.py index f049879a3..0103651bc 100644 --- a/paddlespeech/resource/pretrained_models.py +++ b/paddlespeech/resource/pretrained_models.py @@ -68,6 +68,46 @@ asr_dynamic_pretrained_models = { '', }, }, + "conformer_u2pp_wenetspeech-zh-16k": { + '1.1': { + 'url': + 'https://paddlespeech.bj.bcebos.com/s2t/wenetspeech/asr1/asr1_chunk_conformer_u2pp_wenetspeech_ckpt_1.1.1.model.tar.gz', + 'md5': + 'eae678c04ed3b3f89672052fdc0c5e10', + 'cfg_path': + 'model.yaml', + 'ckpt_path': + 'exp/chunk_conformer_u2pp/checkpoints/avg_10', + 'model': + 'exp/chunk_conformer_u2pp/checkpoints/avg_10.pdparams', + 'params': + 'exp/chunk_conformer_u2pp/checkpoints/avg_10.pdparams', + 'lm_url': + '', + 'lm_md5': + '', + }, + }, + "conformer_u2pp_online_wenetspeech-zh-16k": { + '1.1': { + 'url': + 'https://paddlespeech.bj.bcebos.com/s2t/wenetspeech/asr1/asr1_chunk_conformer_u2pp_wenetspeech_ckpt_1.1.2.model.tar.gz', + 'md5': + '925d047e9188dea7f421a718230c9ae3', + 'cfg_path': + 'model.yaml', + 'ckpt_path': + 'exp/chunk_conformer_u2pp/checkpoints/avg_10', + 'model': + 'exp/chunk_conformer_u2pp/checkpoints/avg_10.pdparams', + 'params': + 'exp/chunk_conformer_u2pp/checkpoints/avg_10.pdparams', + 'lm_url': + '', + 'lm_md5': + '', + }, + }, "conformer_online_multicn-zh-16k": { '1.0': { 'url': @@ -529,7 +569,7 @@ text_dynamic_pretrained_models = { 'ckpt/model_state.pdparams', 'vocab_file': 'punc_vocab.txt', - }, + } }, "ernie_linear_p3_wudao-punc-zh": { '1.0': { @@ -543,8 +583,22 @@ text_dynamic_pretrained_models = { 'ckpt/model_state.pdparams', 'vocab_file': 'punc_vocab.txt', - }, + } }, + "ernie_linear_p3_wudao_fast-punc-zh": { + '1.0': { + 'url': + 'https://paddlespeech.bj.bcebos.com/text/ernie_linear_p3_wudao_fast-punc-zh.tar.gz', + 'md5': + 'c93f9594119541a5dbd763381a751d08', + 'cfg_path': + 'ckpt/model_config.json', + 'ckpt_path': + 'ckpt/model_state.pdparams', + 'vocab_file': + 'punc_vocab.txt', + } + } } # --------------------------------- diff --git a/paddlespeech/s2t/exps/u2/bin/test_wav.py b/paddlespeech/s2t/exps/u2/bin/test_wav.py index 4588def0b..46925faed 100644 --- a/paddlespeech/s2t/exps/u2/bin/test_wav.py +++ b/paddlespeech/s2t/exps/u2/bin/test_wav.py @@ -40,7 +40,6 @@ class U2Infer(): self.preprocess_conf = config.preprocess_config self.preprocess_args = {"train": False} self.preprocessing = Transformation(self.preprocess_conf) - self.reverse_weight = getattr(config.model_conf, 'reverse_weight', 0.0) self.text_feature = TextFeaturizer( unit_type=config.unit_type, vocab=config.vocab_filepath, @@ -89,8 +88,7 @@ class U2Infer(): ctc_weight=decode_config.ctc_weight, decoding_chunk_size=decode_config.decoding_chunk_size, num_decoding_left_chunks=decode_config.num_decoding_left_chunks, - simulate_streaming=decode_config.simulate_streaming, - reverse_weight=self.reverse_weight) + simulate_streaming=decode_config.simulate_streaming) rsl = result_transcripts[0][0] utt = Path(self.audio_file).name logger.info(f"hyp: {utt} {result_transcripts[0][0]}") diff --git a/paddlespeech/s2t/exps/u2/model.py b/paddlespeech/s2t/exps/u2/model.py index a13a6385e..a6197d073 100644 --- a/paddlespeech/s2t/exps/u2/model.py +++ b/paddlespeech/s2t/exps/u2/model.py @@ -316,7 +316,6 @@ class U2Tester(U2Trainer): vocab=self.config.vocab_filepath, spm_model_prefix=self.config.spm_model_prefix) self.vocab_list = self.text_feature.vocab_list - self.reverse_weight = getattr(config.model_conf, 'reverse_weight', 0.0) def id2token(self, texts, texts_len, text_feature): """ ord() id to chr() chr """ @@ -351,8 +350,7 @@ class U2Tester(U2Trainer): ctc_weight=decode_config.ctc_weight, decoding_chunk_size=decode_config.decoding_chunk_size, num_decoding_left_chunks=decode_config.num_decoding_left_chunks, - simulate_streaming=decode_config.simulate_streaming, - reverse_weight=self.reverse_weight) + simulate_streaming=decode_config.simulate_streaming) decode_time = time.time() - start_time for utt, target, result, rec_tids in zip( diff --git a/paddlespeech/s2t/models/u2/u2.py b/paddlespeech/s2t/models/u2/u2.py index 48b05d20c..4fe51c151 100644 --- a/paddlespeech/s2t/models/u2/u2.py +++ b/paddlespeech/s2t/models/u2/u2.py @@ -507,16 +507,14 @@ class U2BaseModel(ASRInterface, nn.Layer): num_decoding_left_chunks, simulate_streaming) return hyps[0][0] - def attention_rescoring( - self, - speech: paddle.Tensor, - speech_lengths: paddle.Tensor, - beam_size: int, - decoding_chunk_size: int=-1, - num_decoding_left_chunks: int=-1, - ctc_weight: float=0.0, - simulate_streaming: bool=False, - reverse_weight: float=0.0, ) -> List[int]: + def attention_rescoring(self, + speech: paddle.Tensor, + speech_lengths: paddle.Tensor, + beam_size: int, + decoding_chunk_size: int=-1, + num_decoding_left_chunks: int=-1, + ctc_weight: float=0.0, + simulate_streaming: bool=False) -> List[int]: """ Apply attention rescoring decoding, CTC prefix beam search is applied first to get nbest, then we resoring the nbest on attention decoder with corresponding encoder out @@ -536,7 +534,7 @@ class U2BaseModel(ASRInterface, nn.Layer): """ assert speech.shape[0] == speech_lengths.shape[0] assert decoding_chunk_size != 0 - if reverse_weight > 0.0: + if self.reverse_weight > 0.0: # decoder should be a bitransformer decoder if reverse_weight > 0.0 assert hasattr(self.decoder, 'right_decoder') device = speech.place @@ -574,7 +572,7 @@ class U2BaseModel(ASRInterface, nn.Layer): self.eos) decoder_out, r_decoder_out, _ = self.decoder( encoder_out, encoder_mask, hyps_pad, hyps_lens, r_hyps_pad, - reverse_weight) # (beam_size, max_hyps_len, vocab_size) + self.reverse_weight) # (beam_size, max_hyps_len, vocab_size) # ctc score in ln domain decoder_out = paddle.nn.functional.log_softmax(decoder_out, axis=-1) decoder_out = decoder_out.numpy() @@ -594,12 +592,13 @@ class U2BaseModel(ASRInterface, nn.Layer): score += decoder_out[i][j][w] # last decoder output token is `eos`, for laste decoder input token. score += decoder_out[i][len(hyp[0])][self.eos] - if reverse_weight > 0: + if self.reverse_weight > 0: r_score = 0.0 for j, w in enumerate(hyp[0]): r_score += r_decoder_out[i][len(hyp[0]) - j - 1][w] r_score += r_decoder_out[i][len(hyp[0])][self.eos] - score = score * (1 - reverse_weight) + r_score * reverse_weight + score = score * (1 - self.reverse_weight + ) + r_score * self.reverse_weight # add ctc score (which in ln domain) score += hyp[1] * ctc_weight if score > best_score: @@ -748,8 +747,7 @@ class U2BaseModel(ASRInterface, nn.Layer): ctc_weight: float=0.0, decoding_chunk_size: int=-1, num_decoding_left_chunks: int=-1, - simulate_streaming: bool=False, - reverse_weight: float=0.0): + simulate_streaming: bool=False): """u2 decoding. Args: @@ -821,8 +819,7 @@ class U2BaseModel(ASRInterface, nn.Layer): decoding_chunk_size=decoding_chunk_size, num_decoding_left_chunks=num_decoding_left_chunks, ctc_weight=ctc_weight, - simulate_streaming=simulate_streaming, - reverse_weight=reverse_weight) + simulate_streaming=simulate_streaming) hyps = [hyp] else: raise ValueError(f"Not support decoding method: {decoding_method}") diff --git a/paddlespeech/server/conf/ws_conformer_application.yaml b/paddlespeech/server/conf/ws_conformer_application.yaml index d72eb2379..d5357c853 100644 --- a/paddlespeech/server/conf/ws_conformer_application.yaml +++ b/paddlespeech/server/conf/ws_conformer_application.yaml @@ -30,7 +30,7 @@ asr_online: decode_method: num_decoding_left_chunks: -1 force_yes: True - device: # cpu or gpu:id + device: cpu # cpu or gpu:id continuous_decoding: True # enable continue decoding when endpoint detected am_predictor_conf: diff --git a/paddlespeech/server/engine/asr/online/python/asr_engine.py b/paddlespeech/server/engine/asr/online/python/asr_engine.py index ae0260929..adcd9bc14 100644 --- a/paddlespeech/server/engine/asr/online/python/asr_engine.py +++ b/paddlespeech/server/engine/asr/online/python/asr_engine.py @@ -22,6 +22,7 @@ from numpy import float32 from yacs.config import CfgNode from paddlespeech.audio.transform.transformation import Transformation +from paddlespeech.audio.utils.tensor_utils import st_reverse_pad_list from paddlespeech.cli.asr.infer import ASRExecutor from paddlespeech.cli.log import logger from paddlespeech.resource import CommonTaskResource @@ -602,24 +603,31 @@ class PaddleASRConnectionHanddler: hyps_pad = pad_sequence( hyp_list, batch_first=True, padding_value=self.model.ignore_id) + ori_hyps_pad = hyps_pad hyps_lens = paddle.to_tensor( [len(hyp[0]) for hyp in hyps], place=self.device, dtype=paddle.long) # (beam_size,) hyps_pad, _ = add_sos_eos(hyps_pad, self.model.sos, self.model.eos, self.model.ignore_id) hyps_lens = hyps_lens + 1 # Add at begining - encoder_out = self.encoder_out.repeat(beam_size, 1, 1) encoder_mask = paddle.ones( (beam_size, 1, encoder_out.shape[1]), dtype=paddle.bool) - decoder_out, _, _ = self.model.decoder( - encoder_out, encoder_mask, hyps_pad, - hyps_lens) # (beam_size, max_hyps_len, vocab_size) + r_hyps_pad = st_reverse_pad_list(ori_hyps_pad, hyps_lens - 1, + self.model.sos, self.model.eos) + decoder_out, r_decoder_out, _ = self.model.decoder( + encoder_out, encoder_mask, hyps_pad, hyps_lens, r_hyps_pad, + self.model.reverse_weight) # (beam_size, max_hyps_len, vocab_size) # ctc score in ln domain decoder_out = paddle.nn.functional.log_softmax(decoder_out, axis=-1) decoder_out = decoder_out.numpy() + # r_decoder_out will be 0.0, if reverse_weight is 0.0 or decoder is a + # conventional transformer decoder. + r_decoder_out = paddle.nn.functional.log_softmax(r_decoder_out, axis=-1) + r_decoder_out = r_decoder_out.numpy() + # Only use decoder score for rescoring best_score = -float('inf') best_index = 0 @@ -631,6 +639,13 @@ class PaddleASRConnectionHanddler: # last decoder output token is `eos`, for laste decoder input token. score += decoder_out[i][len(hyp[0])][self.model.eos] + if self.model.reverse_weight > 0: + r_score = 0.0 + for j, w in enumerate(hyp[0]): + r_score += r_decoder_out[i][len(hyp[0]) - j - 1][w] + r_score += r_decoder_out[i][len(hyp[0])][self.model.eos] + score = score * (1 - self.model.reverse_weight + ) + r_score * self.model.reverse_weight # add ctc score (which in ln domain) score += hyp[1] * self.ctc_decode_config.ctc_weight diff --git a/paddlespeech/server/engine/text/python/text_engine.py b/paddlespeech/server/engine/text/python/text_engine.py index 6167e7784..cc72c0543 100644 --- a/paddlespeech/server/engine/text/python/text_engine.py +++ b/paddlespeech/server/engine/text/python/text_engine.py @@ -107,11 +107,14 @@ class PaddleTextConnectionHandler: assert len(tokens) == len(labels) text = '' + is_fast_model = 'fast' in self.text_engine.config.model_type for t, l in zip(tokens, labels): text += t if l != 0: # Non punc. - text += self._punc_list[l] - + if is_fast_model: + text += self._punc_list[l - 1] + else: + text += self._punc_list[l] return text else: raise NotImplementedError @@ -160,14 +163,23 @@ class TextEngine(BaseEngine): return False self.executor = TextServerExecutor() - self.executor._init_from_path( - task=config.task, - model_type=config.model_type, - lang=config.lang, - cfg_path=config.cfg_path, - ckpt_path=config.ckpt_path, - vocab_file=config.vocab_file) - + if 'fast' in config.model_type: + self.executor._init_from_path_new( + task=config.task, + model_type=config.model_type, + lang=config.lang, + cfg_path=config.cfg_path, + ckpt_path=config.ckpt_path, + vocab_file=config.vocab_file) + else: + self.executor._init_from_path( + task=config.task, + model_type=config.model_type, + lang=config.lang, + cfg_path=config.cfg_path, + ckpt_path=config.ckpt_path, + vocab_file=config.vocab_file) + logger.info("Using model: %s." % (config.model_type)) logger.info("Initialize Text server engine successfully on device: %s." % (self.device)) return True diff --git a/tests/unit/cli/test_cli.sh b/tests/unit/cli/test_cli.sh index 15604961d..c6837c303 100755 --- a/tests/unit/cli/test_cli.sh +++ b/tests/unit/cli/test_cli.sh @@ -7,7 +7,7 @@ wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/cat.wav https://paddlespe paddlespeech cls --input ./cat.wav --topk 10 # Punctuation_restoration -paddlespeech text --input 今天的天气真不错啊你下午有空吗我想约你一起去吃饭 +paddlespeech text --input 今天的天气真不错啊你下午有空吗我想约你一起去吃饭 --model ernie_linear_p3_wudao_fast # Speech_recognition wget -c https://paddlespeech.bj.bcebos.com/PaddleAudio/zh.wav https://paddlespeech.bj.bcebos.com/PaddleAudio/en.wav