diff --git a/README.md b/README.md
index 9d7ed4258..0a8566940 100644
--- a/README.md
+++ b/README.md
@@ -157,13 +157,15 @@ Via the easy-to-use, efficient, flexible and scalable implementation, our vision
- 🧩 *Cascaded models application*: as an extension of the typical traditional audio tasks, we combine the workflows of the aforementioned tasks with other fields like Natural language processing (NLP) and Computer Vision (CV).
### Recent Update
+- 🔥 2022.11.07: [U2/U2++ C++ High Performance Streaming Asr Deployment](https://github.com/PaddlePaddle/PaddleSpeech/blob/develop/speechx/examples/u2pp_ol/wenetspeech).
+- 👑 2022.11.01: Add [Adversarial Loss](https://arxiv.org/pdf/1907.04448.pdf) for [Chinese English mixed TTS](./examples/zh_en_tts/tts3).
- 🔥 2022.10.26: Add [Prosody Prediction](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/rhy) for TTS.
- 🎉 2022.10.21: Add [SSML](https://github.com/PaddlePaddle/PaddleSpeech/discussions/2538) for TTS Chinese Text Frontend.
- 👑 2022.10.11: Add [Wav2vec2ASR](./examples/librispeech/asr3), wav2vec2.0 fine-tuning for ASR on LibriSpeech.
-- 🔥 2022.09.26: Add Voice Cloning, TTS finetune, and ERNIE-SAT in [PaddleSpeech Web Demo](./demos/speech_web).
+- 🔥 2022.09.26: Add Voice Cloning, TTS finetune, and [ERNIE-SAT](https://arxiv.org/abs/2211.03545) in [PaddleSpeech Web Demo](./demos/speech_web).
- ⚡ 2022.09.09: Add AISHELL-3 Voice Cloning [example](./examples/aishell3/vc2) with ECAPA-TDNN speaker encoder.
- ⚡ 2022.08.25: Release TTS [finetune](./examples/other/tts_finetune/tts3) example.
-- 🔥 2022.08.22: Add ERNIE-SAT models: [ERNIE-SAT-vctk](./examples/vctk/ernie_sat)、[ERNIE-SAT-aishell3](./examples/aishell3/ernie_sat)、[ERNIE-SAT-zh_en](./examples/aishell3_vctk/ernie_sat).
+- 🔥 2022.08.22: Add [ERNIE-SAT](https://arxiv.org/abs/2211.03545) models: [ERNIE-SAT-vctk](./examples/vctk/ernie_sat)、[ERNIE-SAT-aishell3](./examples/aishell3/ernie_sat)、[ERNIE-SAT-zh_en](./examples/aishell3_vctk/ernie_sat).
- 🔥 2022.08.15: Add [g2pW](https://github.com/GitYCC/g2pW) into TTS Chinese Text Frontend.
- 🔥 2022.08.09: Release [Chinese English mixed TTS](./examples/zh_en_tts/tts3).
- ⚡ 2022.08.03: Add ONNXRuntime infer for TTS CLI.
@@ -578,7 +580,7 @@ PaddleSpeech supports a series of most popular models. They are summarized in [r
- ERNIE-SAT |
+ ERNIE-SAT |
VCTK / AISHELL-3 / ZH_EN |
ERNIE-SAT-vctk / ERNIE-SAT-aishell3 / ERNIE-SAT-zh_en
@@ -696,9 +697,9 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声
-
+
-**唤醒**
+**语音唤醒**
@@ -711,11 +712,11 @@ PaddleSpeech 的 **语音合成** 主要包含三个模块:文本前端、声
- 唤醒 |
+ 语音唤醒 |
hey-snips |
- PANN |
+ MDTC |
- pann-hey-snips
+ mdtc-hey-snips
|
diff --git a/demos/asr_deployment/README.md b/demos/asr_deployment/README.md
new file mode 100644
index 000000000..9d36f19f2
--- /dev/null
+++ b/demos/asr_deployment/README.md
@@ -0,0 +1,100 @@
+([简体中文](./README_cn.md)|English)
+# ASR Deployment by SpeechX
+
+## Introduction
+
+ASR deployment support U2/U2++/Deepspeech2 asr model using c++, which is good practice in industry deployment.
+
+More info about SpeechX, please see [here](../../speechx/README.md).
+
+## Usage
+### 1. Environment
+
+* python - 3.7
+* docker - `registry.baidubce.com/paddlepaddle/paddle:2.2.2-gpu-cuda10.2-cudnn7`
+* os - Ubuntu 16.04.7 LTS
+* gcc/g++/gfortran - 8.2.0
+* cmake - 3.16.0
+
+More info please see [here](../../speechx/README.md).
+
+### 2. Compile SpeechX
+
+Please see [here](../../speechx/README.md).
+
+### 3. Usage
+
+For u2++ asr deployment example, please to see [here](../../speechx/examples/u2pp_ol/wenetspeech/).
+
+First go to `speechx/speechx/examples/u2pp_ol/wenetspeech` dir.
+
+- Source path.sh
+ ```bash
+ source path.sh
+ ```
+
+- Download Model, Prepare test data and cmvn
+ ```bash
+ run.sh --stage 0 --stop_stage 1
+ ```
+
+- Decode with WAV
+
+ ```bash
+ # FP32
+ ./local/recognizer.sh
+
+ # INT8
+ ./local/recognizer_quant.sh
+ ```
+
+ Output:
+ ```bash
+ I1026 16:13:24.683531 48038 u2_recognizer_main.cc:55] utt: BAC009S0916W0495
+ I1026 16:13:24.683578 48038 u2_recognizer_main.cc:56] wav dur: 4.17119 sec.
+ I1026 16:13:24.683595 48038 u2_recognizer_main.cc:64] wav len (sample): 66739
+ I1026 16:13:25.037652 48038 u2_recognizer_main.cc:87] Pratial result: 3 这令
+ I1026 16:13:25.043697 48038 u2_recognizer_main.cc:87] Pratial result: 4 这令
+ I1026 16:13:25.222124 48038 u2_recognizer_main.cc:87] Pratial result: 5 这令被贷款
+ I1026 16:13:25.228385 48038 u2_recognizer_main.cc:87] Pratial result: 6 这令被贷款
+ I1026 16:13:25.414669 48038 u2_recognizer_main.cc:87] Pratial result: 7 这令被贷款的员工
+ I1026 16:13:25.420714 48038 u2_recognizer_main.cc:87] Pratial result: 8 这令被贷款的员工
+ I1026 16:13:25.608129 48038 u2_recognizer_main.cc:87] Pratial result: 9 这令被贷款的员工们请
+ I1026 16:13:25.801620 48038 u2_recognizer_main.cc:87] Pratial result: 10 这令被贷款的员工们请食难安
+ I1026 16:13:25.804101 48038 feature_cache.h:44] set finished
+ I1026 16:13:25.804128 48038 feature_cache.h:51] compute last feats done.
+ I1026 16:13:25.948771 48038 u2_recognizer_main.cc:87] Pratial result: 11 这令被贷款的员工们请食难安
+ I1026 16:13:26.246963 48038 u2_recognizer_main.cc:113] BAC009S0916W0495 这令被贷款的员工们请食难安
+ ```
+
+## Result
+
+> CER compute under aishell-test.
+> RTF compute with feature and decoder, which is more end to end.
+> Machine Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz avx512_vnni
+
+### FP32
+
+```
+Overall -> 5.75 % N=104765 C=99035 S=5587 D=143 I=294
+Mandarin -> 5.75 % N=104762 C=99035 S=5584 D=143 I=294
+English -> 0.00 % N=0 C=0 S=0 D=0 I=0
+Other -> 100.00 % N=3 C=0 S=3 D=0 I=0
+```
+
+```
+RTF is: 0.315337
+```
+
+### INT8
+
+```
+Overall -> 5.83 % N=104765 C=98943 S=5675 D=147 I=286
+Mandarin -> 5.83 % N=104762 C=98943 S=5672 D=147 I=286
+English -> 0.00 % N=0 C=0 S=0 D=0 I=0
+Other -> 100.00 % N=3 C=0 S=3 D=0 I=0
+```
+
+```
+RTF is: 0.269674
+```
diff --git a/demos/asr_deployment/README_cn.md b/demos/asr_deployment/README_cn.md
new file mode 100644
index 000000000..ee4aa8489
--- /dev/null
+++ b/demos/asr_deployment/README_cn.md
@@ -0,0 +1,96 @@
+([简体中文](./README_cn.md)|English)
+# 基于SpeechX 的 ASR 部署
+
+## 简介
+
+支持 U2/U2++/Deepspeech2 模型的 C++ 部署,其在工业实践中经常被用到。
+
+更多 Speechx 信息可以参看[文档](../../speechx/README.md)。
+
+## 使用
+### 1. 环境
+
+* python - 3.7
+* docker - `registry.baidubce.com/paddlepaddle/paddle:2.2.2-gpu-cuda10.2-cudnn7`
+* os - Ubuntu 16.04.7 LTS
+* gcc/g++/gfortran - 8.2.0
+* cmake - 3.16.0
+
+更多信息可以参看[文档](../../speechx/README.md)。
+
+### 2. 编译 SpeechX
+
+更多信息可以参看[文档](../../speechx/README.md)。
+
+### 3. 例子
+
+u2++ 识别部署参看[这里](../../speechx/examples/u2pp_ol/wenetspeech/)。
+
+以下是在 `speechx/speechx/examples/u2pp_ol/wenetspeech`.
+
+- Source path.sh
+ ```bash
+ source path.sh
+ ```
+
+- 下载模型,准备测试数据和cmvn文件
+ ```bash
+ run.sh --stage 0 --stop_stage 1
+ ```
+
+- 解码
+
+ ```bash
+ # FP32
+ ./local/recognizer.sh
+
+ # INT8
+ ./local/recognizer_quant.sh
+ ```
+
+ 输出:
+ ```bash
+ I1026 16:13:24.683531 48038 u2_recognizer_main.cc:55] utt: BAC009S0916W0495
+ I1026 16:13:24.683578 48038 u2_recognizer_main.cc:56] wav dur: 4.17119 sec.
+ I1026 16:13:24.683595 48038 u2_recognizer_main.cc:64] wav len (sample): 66739
+ I1026 16:13:25.037652 48038 u2_recognizer_main.cc:87] Pratial result: 3 这令
+ I1026 16:13:25.043697 48038 u2_recognizer_main.cc:87] Pratial result: 4 这令
+ I1026 16:13:25.222124 48038 u2_recognizer_main.cc:87] Pratial result: 5 这令被贷款
+ I1026 16:13:25.228385 48038 u2_recognizer_main.cc:87] Pratial result: 6 这令被贷款
+ I1026 16:13:25.414669 48038 u2_recognizer_main.cc:87] Pratial result: 7 这令被贷款的员工
+ I1026 16:13:25.420714 48038 u2_recognizer_main.cc:87] Pratial result: 8 这令被贷款的员工
+ I1026 16:13:25.608129 48038 u2_recognizer_main.cc:87] Pratial result: 9 这令被贷款的员工们请
+ I1026 16:13:25.801620 48038 u2_recognizer_main.cc:87] Pratial result: 10 这令被贷款的员工们请食难安
+ I1026 16:13:25.804101 48038 feature_cache.h:44] set finished
+ I1026 16:13:25.804128 48038 feature_cache.h:51] compute last feats done.
+ I1026 16:13:25.948771 48038 u2_recognizer_main.cc:87] Pratial result: 11 这令被贷款的员工们请食难安
+ I1026 16:13:26.246963 48038 u2_recognizer_main.cc:113] BAC009S0916W0495 这令被贷款的员工们请食难安
+ ```
+
+## 结果
+
+> CER 测试集为 aishell-test
+> RTF 计算包含提特征和解码
+> 测试机器: Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz avx512_vnni
+
+### FP32
+
+```
+Overall -> 5.75 % N=104765 C=99035 S=5587 D=143 I=294
+Mandarin -> 5.75 % N=104762 C=99035 S=5584 D=143 I=294
+English -> 0.00 % N=0 C=0 S=0 D=0 I=0
+Other -> 100.00 % N=3 C=0 S=3 D=0 I=0
+```
+
+```
+RTF is: 0.315337
+```
+
+### INT8
+
+```
+Overall -> 5.87 % N=104765 C=98909 S=5711 D=145 I=289
+Mandarin -> 5.86 % N=104762 C=98909 S=5708 D=145 I=289
+English -> 0.00 % N=0 C=0 S=0 D=0 I=0
+Other -> 100.00 % N=3 C=0 S=3 D=0 I=0
+```
diff --git a/docs/source/cls/custom_dataset.md b/docs/source/cls/custom_dataset.md
index e39dcf12d..b7c06cd7a 100644
--- a/docs/source/cls/custom_dataset.md
+++ b/docs/source/cls/custom_dataset.md
@@ -108,7 +108,7 @@ for epoch in range(1, epochs + 1):
optimizer.clear_grad()
# Calculate loss
- avg_loss = loss.numpy()[0]
+ avg_loss = float(loss)
# Calculate metrics
preds = paddle.argmax(logits, axis=1)
diff --git a/docs/source/released_model.md b/docs/source/released_model.md
index 2f3c9d098..79e8f4f46 100644
--- a/docs/source/released_model.md
+++ b/docs/source/released_model.md
@@ -22,7 +22,7 @@ Acoustic Model | Training Data | Token-based | Size | Descriptions | CER | WER |
Model | Pre-Train Method | Pre-Train Data | Finetune Data | Size | Descriptions | CER | WER | Example Link |
:-------------:| :------------:| :-----: | -----: | :-----: |:-----:| :-----: | :-----: | :-----: |
[Wav2vec2-large-960h-lv60-self Model](https://paddlespeech.bj.bcebos.com/wav2vec/wav2vec2-large-960h-lv60-self.pdparams) | wav2vec2 | Librispeech and LV-60k Dataset (5.3w h) | - | 1.18 GB |Pre-trained Wav2vec2.0 Model | - | - | - |
-[Wav2vec2ASR-large-960h-librispeech Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr3/wav2vec2ASR-large-960h-librispeech_ckpt_1.3.0.model.tar.gz) | wav2vec2 | Librispeech and LV-60k Dataset (5.3w h) | Librispeech (960 h) | 1.18 GB |Encoder: Wav2vec2.0, Decoder: CTC, Decoding method: Greedy search | - | 0.0189 | [Wav2vecASR Librispeech ASR3](../../examples/librispeech/asr3) |
+[Wav2vec2ASR-large-960h-librispeech Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr3/wav2vec2ASR-large-960h-librispeech_ckpt_1.3.1.model.tar.gz) | wav2vec2 | Librispeech and LV-60k Dataset (5.3w h) | Librispeech (960 h) | 718 MB |Encoder: Wav2vec2.0, Decoder: CTC, Decoding method: Greedy search | - | 0.0189 | [Wav2vecASR Librispeech ASR3](../../examples/librispeech/asr3) |
### Language Model based on NGram
Language Model | Training Data | Token-based | Size | Descriptions
diff --git a/docs/tutorial/cls/cls_tutorial.ipynb b/docs/tutorial/cls/cls_tutorial.ipynb
index 56b488adc..3cee64991 100644
--- a/docs/tutorial/cls/cls_tutorial.ipynb
+++ b/docs/tutorial/cls/cls_tutorial.ipynb
@@ -509,7 +509,7 @@
" optimizer.clear_grad()\n",
"\n",
" # Calculate loss\n",
- " avg_loss += loss.numpy()[0]\n",
+ " avg_loss += float(loss)\n",
"\n",
" # Calculate metrics\n",
" preds = paddle.argmax(logits, axis=1)\n",
diff --git a/examples/aishell3/ernie_sat/README.md b/examples/aishell3/ernie_sat/README.md
index 9b7768985..bd5964c3a 100644
--- a/examples/aishell3/ernie_sat/README.md
+++ b/examples/aishell3/ernie_sat/README.md
@@ -1,5 +1,5 @@
# ERNIE-SAT with AISHELL-3 dataset
-ERNIE-SAT speech-text joint pretraining framework, which achieves SOTA results in cross-lingual multi-speaker speech synthesis and cross-lingual speech editing tasks, It can be applied to a series of scenarios such as Speech Editing, personalized Speech Synthesis, and Voice Cloning.
+[ERNIE-SAT](https://arxiv.org/abs/2211.03545) speech-text joint pretraining framework, which achieves SOTA results in cross-lingual multi-speaker speech synthesis and cross-lingual speech editing tasks, It can be applied to a series of scenarios such as Speech Editing, personalized Speech Synthesis, and Voice Cloning.
## Model Framework
In ERNIE-SAT, we propose two innovations:
diff --git a/examples/aishell3/tts3/local/export2lite.sh b/examples/aishell3/tts3/local/export2lite.sh
new file mode 120000
index 000000000..f7719914a
--- /dev/null
+++ b/examples/aishell3/tts3/local/export2lite.sh
@@ -0,0 +1 @@
+../../../csmsc/tts3/local/export2lite.sh
\ No newline at end of file
diff --git a/examples/aishell3/tts3/run.sh b/examples/aishell3/tts3/run.sh
index f730f3761..90b342125 100755
--- a/examples/aishell3/tts3/run.sh
+++ b/examples/aishell3/tts3/run.sh
@@ -58,3 +58,13 @@ fi
if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
./local/ort_predict.sh ${train_output_path}
fi
+
+if [ ${stage} -le 7 ] && [ ${stop_stage} -ge 7 ]; then
+ # This model is not supported, because 3 ops are not supported on 'arm'. These unsupported ops are: 'round, set_value, share_data'.
+ # This model is not supported, because 4 ops are not supported on 'x86'. These unsupported ops are: 'matmul_v2, round, set_value, share_data'.
+ # ./local/export2lite.sh ${train_output_path} inference pdlite fastspeech2_aishell3 x86
+ # x86 ok, arm Segmentation fault
+ # ./local/export2lite.sh ${train_output_path} inference pdlite pwgan_aishell3 x86
+ # x86 ok, arm ok
+ ./local/export2lite.sh ${train_output_path} inference pdlite hifigan_aishell3 x86
+fi
diff --git a/examples/aishell3_vctk/ernie_sat/README.md b/examples/aishell3_vctk/ernie_sat/README.md
index 321957835..fbf9244d1 100644
--- a/examples/aishell3_vctk/ernie_sat/README.md
+++ b/examples/aishell3_vctk/ernie_sat/README.md
@@ -1,5 +1,5 @@
# ERNIE-SAT with AISHELL-3 and VCTK dataset
-ERNIE-SAT speech-text joint pretraining framework, which achieves SOTA results in cross-lingual multi-speaker speech synthesis and cross-lingual speech editing tasks, It can be applied to a series of scenarios such as Speech Editing, personalized Speech Synthesis, and Voice Cloning.
+[ERNIE-SAT](https://arxiv.org/abs/2211.03545) speech-text joint pretraining framework, which achieves SOTA results in cross-lingual multi-speaker speech synthesis and cross-lingual speech editing tasks, It can be applied to a series of scenarios such as Speech Editing, personalized Speech Synthesis, and Voice Cloning.
## Model Framework
In ERNIE-SAT, we propose two innovations:
diff --git a/examples/csmsc/tts2/local/export2lite.sh b/examples/csmsc/tts2/local/export2lite.sh
new file mode 120000
index 000000000..402fd8334
--- /dev/null
+++ b/examples/csmsc/tts2/local/export2lite.sh
@@ -0,0 +1 @@
+../../tts3/local/export2lite.sh
\ No newline at end of file
diff --git a/examples/csmsc/tts2/run.sh b/examples/csmsc/tts2/run.sh
index 557dd4ff3..75fdb2109 100755
--- a/examples/csmsc/tts2/run.sh
+++ b/examples/csmsc/tts2/run.sh
@@ -60,3 +60,16 @@ fi
if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
./local/ort_predict.sh ${train_output_path}
fi
+
+# must run after stage 3 (which stage generated static models)
+if [ ${stage} -le 7 ] && [ ${stop_stage} -ge 7 ]; then
+ # This model is not supported, because 3 ops are not supported on 'arm'. These unsupported ops are: 'round, set_value, share_data'.
+ # This model is not supported, because 4 ops are not supported on 'x86'. These unsupported ops are: 'matmul_v2, round, set_value, share_data'.
+ ./local/export2lite.sh ${train_output_path} inference pdlite speedyspeech_csmsc x86
+ # x86 ok, arm Segmentation fault
+ # ./local/export2lite.sh ${train_output_path} inference pdlite pwgan_csmsc x86
+ # x86 ok, arm Segmentation fault
+ # ./local/export2lite.sh ${train_output_path} inference pdlite mb_melgan_csmsc x86
+ # x86 ok, arm ok
+ # ./local/export2lite.sh ${train_output_path} inference pdlite hifigan_csmsc x86
+fi
diff --git a/examples/csmsc/tts3/local/export2lite.sh b/examples/csmsc/tts3/local/export2lite.sh
new file mode 100755
index 000000000..f99905cfe
--- /dev/null
+++ b/examples/csmsc/tts3/local/export2lite.sh
@@ -0,0 +1,18 @@
+train_output_path=$1
+model_dir=$2
+output_dir=$3
+model=$4
+valid_targets=$5
+
+model_name=${model%_*}
+echo model_name: ${model_name}
+
+
+
+mkdir -p ${train_output_path}/${output_dir}
+
+paddle_lite_opt \
+ --model_file ${train_output_path}/${model_dir}/${model}.pdmodel \
+ --param_file ${train_output_path}/${model_dir}/${model}.pdiparams \
+ --optimize_out ${train_output_path}/${output_dir}/${model}_${valid_targets} \
+ --valid_targets ${valid_targets}
diff --git a/examples/csmsc/tts3/run.sh b/examples/csmsc/tts3/run.sh
index 80acf8200..8d646ecc3 100755
--- a/examples/csmsc/tts3/run.sh
+++ b/examples/csmsc/tts3/run.sh
@@ -61,3 +61,16 @@ fi
if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
./local/ort_predict.sh ${train_output_path}
fi
+
+# must run after stage 3 (which stage generated static models)
+if [ ${stage} -le 7 ] && [ ${stop_stage} -ge 7 ]; then
+ # This model is not supported, because 3 ops are not supported on 'arm'. These unsupported ops are: 'round, set_value, share_data'.
+ # This model is not supported, because 4 ops are not supported on 'x86'. These unsupported ops are: 'matmul_v2, round, set_value, share_data'.
+ ./local/export2lite.sh ${train_output_path} inference pdlite fastspeech2_csmsc x86
+ # x86 ok, arm Segmentation fault
+ # ./local/export2lite.sh ${train_output_path} inference pdlite pwgan_csmsc x86
+ # x86 ok, arm Segmentation fault
+ # ./local/export2lite.sh ${train_output_path} inference pdlite mb_melgan_csmsc x86
+ # x86 ok, arm ok
+ # ./local/export2lite.sh ${train_output_path} inference pdlite hifigan_csmsc x86
+fi
diff --git a/examples/csmsc/tts3/run_cnndecoder.sh b/examples/csmsc/tts3/run_cnndecoder.sh
index bae833157..645d1af09 100755
--- a/examples/csmsc/tts3/run_cnndecoder.sh
+++ b/examples/csmsc/tts3/run_cnndecoder.sh
@@ -75,7 +75,6 @@ if [ ${stage} -le 8 ] && [ ${stop_stage} -ge 8 ]; then
fi
# paddle2onnx streaming
-
if [ ${stage} -le 9 ] && [ ${stop_stage} -ge 9 ]; then
# install paddle2onnx
version=$(echo `pip list |grep "paddle2onnx"` |awk -F" " '{print $2}')
@@ -97,3 +96,34 @@ if [ ${stage} -le 10 ] && [ ${stop_stage} -ge 10 ]; then
./local/ort_predict_streaming.sh ${train_output_path}
fi
+# must run after stage 3 (which stage generated static models)
+if [ ${stage} -le 11 ] && [ ${stop_stage} -ge 11 ]; then
+ # This model is not supported, because 3 ops are not supported on 'arm'. These unsupported ops are: 'round, set_value, share_data'.
+ # This model is not supported, because 4 ops are not supported on 'x86'. These unsupported ops are: 'matmul_v2, round, set_value, share_data'.
+ ./local/export2lite.sh ${train_output_path} inference pdlite fastspeech2_csmsc x86
+ # x86 ok, arm Segmentation fault
+ # ./local/export2lite.sh ${train_output_path} inference pdlite pwgan_csmsc x86
+ # x86 ok, arm Segmentation fault
+ # ./local/export2lite.sh ${train_output_path} inference pdlite mb_melgan_csmsc x86
+ # x86 ok, arm ok
+ # ./local/export2lite.sh ${train_output_path} inference pdlite hifigan_csmsc x86
+fi
+
+# must run after stage 5 (which stage generated static models)
+if [ ${stage} -le 12 ] && [ ${stop_stage} -ge 12 ]; then
+ # streaming acoustic model
+ # This model is not supported, because 3 ops are not supported on 'arm'. These unsupported ops are: 'round, set_value, share_data'.
+ # This model is not supported, because 4 ops are not supported on 'x86'. These unsupported ops are: 'matmul_v2, round, set_value, share_data'.
+ # ./local/export2lite.sh ${train_output_path} inference pdlite fastspeech2_csmsc x86
+ ./local/export2lite.sh ${train_output_path} inference_streaming pdlite_streaming fastspeech2_csmsc_am_encoder_infer x86
+ # x86 ok, arm Segmentation fault
+ ./local/export2lite.sh ${train_output_path} inference_streaming pdlite_streaming fastspeech2_csmsc_am_decoder x86
+ # x86 ok, arm Segmentation fault
+ ./local/export2lite.sh ${train_output_path} inference_streaming pdlite_streaming fastspeech2_csmsc_am_postnet x86
+ # x86 ok, arm Segmentation fault
+ # ./local/export2lite.sh ${train_output_path} inference_streaming pdlite_streaming pwgan_csmsc x86
+ # x86 ok, arm Segmentation fault
+ # ./local/export2lite.sh ${train_output_path} inference_streaming pdlite_streaming mb_melgan_csmsc x86
+ # x86 ok, arm ok
+ # ./local/export2lite.sh ${train_output_path} inference_streaming pdlite_streaming hifigan_csmsc x86
+fi
diff --git a/examples/librispeech/asr3/conf/wav2vec2ASR.yaml b/examples/librispeech/asr3/conf/wav2vec2ASR.yaml
index b19881b70..c45bd692a 100644
--- a/examples/librispeech/asr3/conf/wav2vec2ASR.yaml
+++ b/examples/librispeech/asr3/conf/wav2vec2ASR.yaml
@@ -70,7 +70,6 @@ train_manifest: data/manifest.train
dev_manifest: data/manifest.dev
test_manifest: data/manifest.test-clean
-
###########################################
# Dataloader #
###########################################
@@ -95,6 +94,12 @@ dist_sampler: True
shortest_first: True
return_lens_rate: True
+############################################
+# Data Augmentation #
+############################################
+audio_augment: # for raw audio
+ sample_rate: 16000
+ speeds: [95, 100, 105]
###########################################
# Training #
@@ -115,6 +120,3 @@ log_interval: 1
checkpoint:
kbest_n: 50
latest_n: 5
-augment: True
-
-
diff --git a/examples/ljspeech/tts3/local/export2lite.sh b/examples/ljspeech/tts3/local/export2lite.sh
new file mode 120000
index 000000000..f7719914a
--- /dev/null
+++ b/examples/ljspeech/tts3/local/export2lite.sh
@@ -0,0 +1 @@
+../../../csmsc/tts3/local/export2lite.sh
\ No newline at end of file
diff --git a/examples/ljspeech/tts3/run.sh b/examples/ljspeech/tts3/run.sh
index 956185935..7ab591862 100755
--- a/examples/ljspeech/tts3/run.sh
+++ b/examples/ljspeech/tts3/run.sh
@@ -59,3 +59,14 @@ fi
if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
./local/ort_predict.sh ${train_output_path}
fi
+
+# must run after stage 3 (which stage generated static models)
+if [ ${stage} -le 7 ] && [ ${stop_stage} -ge 7 ]; then
+ # This model is not supported, because 3 ops are not supported on 'arm'. These unsupported ops are: 'round, set_value, share_data'.
+ # This model is not supported, because 4 ops are not supported on 'x86'. These unsupported ops are: 'matmul_v2, round, set_value, share_data'.
+ ./local/export2lite.sh ${train_output_path} inference pdlite fastspeech2_ljspeech x86
+ # x86 ok, arm Segmentation fault
+ # ./local/export2lite.sh ${train_output_path} inference pdlite pwgan_ljspeech x86
+ # x86 ok, arm ok
+ # ./local/export2lite.sh ${train_output_path} inference pdlite hifigan_ljspeech x86
+fi
\ No newline at end of file
diff --git a/examples/other/mfa/README.md b/examples/other/mfa/README.md
index c24524ab4..216d1275b 100644
--- a/examples/other/mfa/README.md
+++ b/examples/other/mfa/README.md
@@ -4,3 +4,6 @@ Run the following script to get started, for more detail, please see `run.sh`.
```bash
./run.sh
```
+# Rhythm tags for MFA
+If you want to get rhythm tags with duration through MFA tool, you may add flag `--rhy-with-duration` in the first two commands in `run.sh`
+Note that only CSMSC dataset is supported so far, and we replace `#` with `sp` in rhythm tags for MFA.
diff --git a/examples/other/mfa/local/generate_lexicon.py b/examples/other/mfa/local/generate_lexicon.py
index e9445665b..3deb24701 100644
--- a/examples/other/mfa/local/generate_lexicon.py
+++ b/examples/other/mfa/local/generate_lexicon.py
@@ -182,12 +182,17 @@ if __name__ == "__main__":
"--with-tone", action="store_true", help="whether to consider tone.")
parser.add_argument(
"--with-r", action="store_true", help="whether to consider erhua.")
+ parser.add_argument(
+ "--rhy-with-duration",
+ action="store_true", )
args = parser.parse_args()
lexicon = generate_lexicon(args.with_tone, args.with_r)
symbols = generate_symbols(lexicon)
with open(args.output + ".lexicon", 'wt') as f:
+ if args.rhy_with_duration:
+ f.write("sp1 sp1\nsp2 sp2\nsp3 sp3\nsp4 sp4\n")
for k, v in lexicon.items():
f.write(f"{k} {v}\n")
diff --git a/examples/other/mfa/local/reorganize_baker.py b/examples/other/mfa/local/reorganize_baker.py
index 153e01d13..0e0035bda 100644
--- a/examples/other/mfa/local/reorganize_baker.py
+++ b/examples/other/mfa/local/reorganize_baker.py
@@ -23,6 +23,7 @@ for more details.
"""
import argparse
import os
+import re
import shutil
from concurrent.futures import ThreadPoolExecutor
from pathlib import Path
@@ -32,6 +33,22 @@ import librosa
import soundfile as sf
from tqdm import tqdm
+repalce_dict = {
+ ";": "",
+ "。": "",
+ ":": "",
+ "—": "",
+ ")": "",
+ ",": "",
+ "“": "",
+ "(": "",
+ "、": "",
+ "…": "",
+ "!": "",
+ "?": "",
+ "”": ""
+}
+
def get_transcripts(path: Union[str, Path]):
transcripts = {}
@@ -55,9 +72,13 @@ def resample_and_save(source, target, sr=16000):
def reorganize_baker(root_dir: Union[str, Path],
output_dir: Union[str, Path]=None,
- resample_audio=False):
+ resample_audio=False,
+ rhy_dur=False):
root_dir = Path(root_dir).expanduser()
- transcript_path = root_dir / "ProsodyLabeling" / "000001-010000.txt"
+ if rhy_dur:
+ transcript_path = root_dir / "ProsodyLabeling" / "000001-010000_rhy.txt"
+ else:
+ transcript_path = root_dir / "ProsodyLabeling" / "000001-010000.txt"
transcriptions = get_transcripts(transcript_path)
wave_dir = root_dir / "Wave"
@@ -92,6 +113,46 @@ def reorganize_baker(root_dir: Union[str, Path],
print("Done!")
+def insert_rhy(sentence_first, sentence_second):
+ sub = '#'
+ return_words = []
+ sentence_first = sentence_first.translate(str.maketrans(repalce_dict))
+ rhy_idx = [substr.start() for substr in re.finditer(sub, sentence_first)]
+ re_rhy_idx = []
+ sentence_first_ = sentence_first.replace("#1", "").replace(
+ "#2", "").replace("#3", "").replace("#4", "")
+ sentence_seconds = sentence_second.split(" ")
+ for i, w in enumerate(rhy_idx):
+ re_rhy_idx.append(w - i * 2)
+ i = 0
+ # print("re_rhy_idx: ", re_rhy_idx)
+ for sentence_s in (sentence_seconds):
+ return_words.append(sentence_s)
+ if i < len(re_rhy_idx) and len(return_words) - i == re_rhy_idx[i]:
+ return_words.append("sp" + sentence_first[rhy_idx[i] + 1:rhy_idx[i]
+ + 2])
+ i = i + 1
+ return return_words
+
+
+def normalize_rhy(root_dir: Union[str, Path]):
+ root_dir = Path(root_dir).expanduser()
+ transcript_path = root_dir / "ProsodyLabeling" / "000001-010000.txt"
+ target_transcript_path = root_dir / "ProsodyLabeling" / "000001-010000_rhy.txt"
+
+ with open(transcript_path) as f:
+ lines = f.readlines()
+
+ with open(target_transcript_path, 'wt') as f:
+ for i in range(0, len(lines), 2):
+ sentence_first = lines[i] #第一行直接保存
+ f.write(sentence_first)
+ transcription = lines[i + 1].strip()
+ f.write("\t" + " ".join(
+ insert_rhy(sentence_first.split('\t')[1], transcription)) +
+ "\n")
+
+
if __name__ == "__main__":
parser = argparse.ArgumentParser(
description="Reorganize Baker dataset for MFA")
@@ -104,6 +165,12 @@ if __name__ == "__main__":
"--resample-audio",
action="store_true",
help="To resample audio files or just copy them")
+ parser.add_argument(
+ "--rhy-with-duration",
+ action="store_true", )
args = parser.parse_args()
- reorganize_baker(args.root_dir, args.output_dir, args.resample_audio)
+ if args.rhy_with_duration:
+ normalize_rhy(args.root_dir)
+ reorganize_baker(args.root_dir, args.output_dir, args.resample_audio,
+ args.rhy_with_duration)
diff --git a/examples/other/tn/data/textnorm_test_cases.txt b/examples/other/tn/data/textnorm_test_cases.txt
index e9a479b47..17e90d0b6 100644
--- a/examples/other/tn/data/textnorm_test_cases.txt
+++ b/examples/other/tn/data/textnorm_test_cases.txt
@@ -122,4 +122,6 @@ iPad Pro的秒控键盘这次也推出白色版本。|iPad Pro的秒控键盘这
近期也一反常态地发表看空言论|近期也一反常态地发表看空言论
985|九八五
12~23|十二到二十三
-12-23|十二到二十三
\ No newline at end of file
+12-23|十二到二十三
+25cm²|二十五平方厘米
+25m|米
\ No newline at end of file
diff --git a/examples/other/tts_finetune/tts3/run_mix.sh b/examples/other/tts_finetune/tts3/run_mix.sh
index 71008ef5b..960278a53 100755
--- a/examples/other/tts_finetune/tts3/run_mix.sh
+++ b/examples/other/tts_finetune/tts3/run_mix.sh
@@ -108,3 +108,4 @@ if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
--spk_id=$replace_spkid
fi
+
diff --git a/examples/vctk/ernie_sat/README.md b/examples/vctk/ernie_sat/README.md
index 94c7ae25d..1808e2074 100644
--- a/examples/vctk/ernie_sat/README.md
+++ b/examples/vctk/ernie_sat/README.md
@@ -1,5 +1,5 @@
# ERNIE-SAT with VCTK dataset
-ERNIE-SAT speech-text joint pretraining framework, which achieves SOTA results in cross-lingual multi-speaker speech synthesis and cross-lingual speech editing tasks, It can be applied to a series of scenarios such as Speech Editing, personalized Speech Synthesis, and Voice Cloning.
+[ERNIE-SAT](https://arxiv.org/abs/2211.03545) speech-text joint pretraining framework, which achieves SOTA results in cross-lingual multi-speaker speech synthesis and cross-lingual speech editing tasks, It can be applied to a series of scenarios such as Speech Editing, personalized Speech Synthesis, and Voice Cloning.
## Model Framework
In ERNIE-SAT, we propose two innovations:
diff --git a/examples/vctk/tts3/local/export2lite.sh b/examples/vctk/tts3/local/export2lite.sh
new file mode 120000
index 000000000..f7719914a
--- /dev/null
+++ b/examples/vctk/tts3/local/export2lite.sh
@@ -0,0 +1 @@
+../../../csmsc/tts3/local/export2lite.sh
\ No newline at end of file
diff --git a/examples/vctk/tts3/run.sh b/examples/vctk/tts3/run.sh
index b5184aed8..16f1eae18 100755
--- a/examples/vctk/tts3/run.sh
+++ b/examples/vctk/tts3/run.sh
@@ -58,3 +58,14 @@ fi
if [ ${stage} -le 6 ] && [ ${stop_stage} -ge 6 ]; then
./local/ort_predict.sh ${train_output_path}
fi
+
+# must run after stage 3 (which stage generated static models)
+if [ ${stage} -le 7 ] && [ ${stop_stage} -ge 7 ]; then
+ # This model is not supported, because 3 ops are not supported on 'arm'. These unsupported ops are: 'round, set_value, share_data'.
+ # This model is not supported, because 4 ops are not supported on 'x86'. These unsupported ops are: 'matmul_v2, round, set_value, share_data'.
+ ./local/export2lite.sh ${train_output_path} inference pdlite fastspeech2_vctk x86
+ # x86 ok, arm Segmentation fault
+ # ./local/export2lite.sh ${train_output_path} inference pdlite pwgan_vctk x86
+ # x86 ok, arm ok
+ # ./local/export2lite.sh ${train_output_path} inference pdlite hifigan_vctk x86
+fi
diff --git a/paddlespeech/cls/exps/panns/train.py b/paddlespeech/cls/exps/panns/train.py
index fba38a01c..133893081 100644
--- a/paddlespeech/cls/exps/panns/train.py
+++ b/paddlespeech/cls/exps/panns/train.py
@@ -101,7 +101,7 @@ if __name__ == "__main__":
optimizer.clear_grad()
# Calculate loss
- avg_loss += loss.numpy()[0]
+ avg_loss += float(loss)
# Calculate metrics
preds = paddle.argmax(logits, axis=1)
diff --git a/paddlespeech/kws/exps/mdtc/train.py b/paddlespeech/kws/exps/mdtc/train.py
index 94e45d590..d5bb5e020 100644
--- a/paddlespeech/kws/exps/mdtc/train.py
+++ b/paddlespeech/kws/exps/mdtc/train.py
@@ -110,7 +110,7 @@ if __name__ == '__main__':
optimizer.clear_grad()
# Calculate loss
- avg_loss += loss.numpy()[0]
+ avg_loss += float(loss)
# Calculate metrics
num_corrects += corrects
diff --git a/paddlespeech/s2t/exps/wav2vec2/model.py b/paddlespeech/s2t/exps/wav2vec2/model.py
index 933e268ed..4f6bc0c5b 100644
--- a/paddlespeech/s2t/exps/wav2vec2/model.py
+++ b/paddlespeech/s2t/exps/wav2vec2/model.py
@@ -71,7 +71,8 @@ class Wav2Vec2ASRTrainer(Trainer):
wavs_lens_rate = wavs_lens / wav.shape[1]
target_lens_rate = target_lens / target.shape[1]
wav = wav[:, :, 0]
- wav = self.speech_augmentation(wav, wavs_lens_rate)
+ if hasattr(train_conf, 'speech_augment'):
+ wav = self.speech_augmentation(wav, wavs_lens_rate)
loss = self.model(wav, wavs_lens_rate, target, target_lens_rate)
# loss div by `batch_size * accum_grad`
loss /= train_conf.accum_grad
@@ -277,7 +278,9 @@ class Wav2Vec2ASRTrainer(Trainer):
logger.info("Setup model!")
# setup speech augmentation for wav2vec2
- self.speech_augmentation = TimeDomainSpecAugment()
+ if hasattr(config, 'audio_augment') and self.train:
+ self.speech_augmentation = TimeDomainSpecAugment(
+ **config.audio_augment)
if not self.train:
return
diff --git a/paddlespeech/s2t/models/wav2vec2/processing/speech_augmentation.py b/paddlespeech/s2t/models/wav2vec2/processing/speech_augmentation.py
index 78a0782e7..ac9bf45db 100644
--- a/paddlespeech/s2t/models/wav2vec2/processing/speech_augmentation.py
+++ b/paddlespeech/s2t/models/wav2vec2/processing/speech_augmentation.py
@@ -641,14 +641,11 @@ class DropChunk(nn.Layer):
class TimeDomainSpecAugment(nn.Layer):
"""A time-domain approximation of the SpecAugment algorithm.
-
This augmentation module implements three augmentations in
the time-domain.
-
1. Drop chunks of the audio (zero amplitude or white noise)
2. Drop frequency bands (with band-drop filters)
3. Speed peturbation (via resampling to slightly different rate)
-
Arguments
---------
perturb_prob : float from 0 to 1
@@ -677,7 +674,6 @@ class TimeDomainSpecAugment(nn.Layer):
drop_chunk_noise_factor : float
The noise factor used to scale the white noise inserted, relative to
the average amplitude of the utterance. Default 0 (no noise inserted).
-
Example
-------
>>> inputs = paddle.randn([10, 16000])
@@ -718,7 +714,6 @@ class TimeDomainSpecAugment(nn.Layer):
def forward(self, waveforms, lengths):
"""Returns the distorted waveforms.
-
Arguments
---------
waveforms : tensor
diff --git a/paddlespeech/t2s/exps/sentences_ssml.txt b/paddlespeech/t2s/exps/sentences_ssml.txt
new file mode 100644
index 000000000..e3614f224
--- /dev/null
+++ b/paddlespeech/t2s/exps/sentences_ssml.txt
@@ -0,0 +1,10 @@
+0001 考古人员西藏布达拉宫里发现一个被隐藏的装有宝藏箱子。
+0002 有人询问中国银行北京分行行长是否叫任我行。
+0003 市委书记亲自率领审计员对这家公司进行财务审计,发现企业的利润率数据虚假。
+0004 学生们对代数理解不深刻,特别是小数点,在数数时容易弄错。
+0005 赵长军从小学习武术,擅长散打,长大后参军,担任连长。
+0006 我说她涨了工资,她就涨红着脸,摇头否认。
+0007 请把这封信交给团长,告诉他,前线的供给一定要有保障。
+0008 矿下的巷道,与北京四合院的小巷有点相似。
+0009 他常叹自己命薄,几亩薄田,种点薄荷。
+0010 小明对天相很有研究,在宿舍说了一宿有关星宿的常识。
\ No newline at end of file
diff --git a/paddlespeech/t2s/frontend/zh_normalization/quantifier.py b/paddlespeech/t2s/frontend/zh_normalization/quantifier.py
index 268d7229b..598030e43 100644
--- a/paddlespeech/t2s/frontend/zh_normalization/quantifier.py
+++ b/paddlespeech/t2s/frontend/zh_normalization/quantifier.py
@@ -18,6 +18,25 @@ from .num import num2str
# 温度表达式,温度会影响负号的读法
# -3°C 零下三度
RE_TEMPERATURE = re.compile(r'(-?)(\d+(\.\d+)?)(°C|℃|度|摄氏度)')
+measure_dict = {
+ "cm2": "平方厘米",
+ "cm²": "平方厘米",
+ "cm3": "立方厘米",
+ "cm³": "立方厘米",
+ "cm": "厘米",
+ "db": "分贝",
+ "ds": "毫秒",
+ "kg": "千克",
+ "km": "千米",
+ "m2": "平方米",
+ "m²": "平方米",
+ "m³": "立方米",
+ "m3": "立方米",
+ "ml": "毫升",
+ "m": "米",
+ "mm": "毫米",
+ "s": "秒"
+}
def replace_temperature(match) -> str:
@@ -35,3 +54,10 @@ def replace_temperature(match) -> str:
unit: str = "摄氏度" if unit == "摄氏度" else "度"
result = f"{sign}{temperature}{unit}"
return result
+
+
+def replace_measure(sentence) -> str:
+ for q_notation in measure_dict:
+ if q_notation in sentence:
+ sentence = sentence.replace(q_notation, measure_dict[q_notation])
+ return sentence
diff --git a/paddlespeech/t2s/frontend/zh_normalization/text_normlization.py b/paddlespeech/t2s/frontend/zh_normalization/text_normlization.py
index bc663c70d..8f8e3b07d 100644
--- a/paddlespeech/t2s/frontend/zh_normalization/text_normlization.py
+++ b/paddlespeech/t2s/frontend/zh_normalization/text_normlization.py
@@ -46,6 +46,7 @@ from .phonecode import RE_TELEPHONE
from .phonecode import replace_mobile
from .phonecode import replace_phone
from .quantifier import RE_TEMPERATURE
+from .quantifier import replace_measure
from .quantifier import replace_temperature
@@ -91,6 +92,7 @@ class TextNormalizer():
sentence = RE_TIME.sub(replace_time, sentence)
sentence = RE_TEMPERATURE.sub(replace_temperature, sentence)
+ sentence = replace_measure(sentence)
sentence = RE_FRAC.sub(replace_frac, sentence)
sentence = RE_PERCENTAGE.sub(replace_percentage, sentence)
sentence = RE_MOBILE_PHONE.sub(replace_mobile, sentence)
diff --git a/speechx/examples/codelab/u2/utils b/speechx/examples/codelab/u2/utils
new file mode 120000
index 000000000..23cef9612
--- /dev/null
+++ b/speechx/examples/codelab/u2/utils
@@ -0,0 +1 @@
+../../../../utils
\ No newline at end of file
diff --git a/speechx/examples/u2pp_ol/wenetspeech/README.md b/speechx/examples/u2pp_ol/wenetspeech/README.md
index 9a8f8af51..b90b8e201 100644
--- a/speechx/examples/u2pp_ol/wenetspeech/README.md
+++ b/speechx/examples/u2pp_ol/wenetspeech/README.md
@@ -2,10 +2,10 @@
## Testing with Aishell Test Data
-## Download wav and model
+### Download wav and model
```
-run.sh --stop_stage 0
+./run.sh --stop_stage 0
```
### compute feature
@@ -22,7 +22,6 @@ run.sh --stop_stage 0
### decoding using wav
-
```
./run.sh --stage 3 --stop_stage 3
```
diff --git a/speechx/examples/u2pp_ol/wenetspeech/RESULTS.md b/speechx/examples/u2pp_ol/wenetspeech/RESULTS.md
index 6a8e8c46d..5b33f3641 100644
--- a/speechx/examples/u2pp_ol/wenetspeech/RESULTS.md
+++ b/speechx/examples/u2pp_ol/wenetspeech/RESULTS.md
@@ -2,9 +2,11 @@
7176 utts, duration 36108.9 sec.
-## Attention Rescore
+## U2++ Attention Rescore
-### u2++ FP32
+> Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz, support `avx512_vnni`
+> RTF with feature and decoder which is more end to end.
+### FP32
#### CER
@@ -17,20 +19,29 @@ Other -> 100.00 % N=3 C=0 S=3 D=0 I=0
#### RTF
-> RTF with feature and decoder which is more end to end.
-
-* Intel(R) Xeon(R) Gold 6271C CPU @ 2.60GHz, support `avx512_vnni`
-
```
I1027 10:52:38.662868 51665 u2_recognizer_main.cc:122] total wav duration is: 36108.9 sec
I1027 10:52:38.662858 51665 u2_recognizer_main.cc:121] total cost:11169.1 sec
I1027 10:52:38.662876 51665 u2_recognizer_main.cc:123] RTF is: 0.309318
```
-* Intel(R) Xeon(R) Gold 6148 CPU @ 2.40GHz, not support `avx512_vnni`
+### INT8
+
+> RTF relative improve 12.8%, which count feature and decoder time.
+
+#### CER
+
+```
+Overall -> 5.83 % N=104765 C=98943 S=5675 D=147 I=286
+Mandarin -> 5.83 % N=104762 C=98943 S=5672 D=147 I=286
+English -> 0.00 % N=0 C=0 S=0 D=0 I=0
+Other -> 100.00 % N=3 C=0 S=3 D=0 I=0
+```
+
+#### RTF
```
-I1026 16:13:26.247121 48038 u2_recognizer_main.cc:123] total wav duration is: 36108.9 sec
-I1026 16:13:26.247130 48038 u2_recognizer_main.cc:124] total decode cost:13656.7 sec
-I1026 16:13:26.247138 48038 u2_recognizer_main.cc:125] RTF is: 0.378208
+I1110 09:59:52.551712 37249 u2_recognizer_main.cc:122] total wav duration is: 36108.9 sec
+I1110 09:59:52.551717 37249 u2_recognizer_main.cc:123] total decode cost:9737.63 sec
+I1110 09:59:52.551723 37249 u2_recognizer_main.cc:124] RTF is: 0.269674
```
diff --git a/speechx/examples/u2pp_ol/wenetspeech/local/decode.sh b/speechx/examples/u2pp_ol/wenetspeech/local/decode.sh
index e9c81009c..068f34a07 100755
--- a/speechx/examples/u2pp_ol/wenetspeech/local/decode.sh
+++ b/speechx/examples/u2pp_ol/wenetspeech/local/decode.sh
@@ -10,7 +10,7 @@ mkdir -p $exp
ckpt_dir=./data/model
model_dir=$ckpt_dir/asr1_chunk_conformer_u2pp_wenetspeech_static_1.3.0.model/
-utils/run.pl JOB=1:$nj $data/split${nj}/JOB/decoder.fbank.wolm.log \
+utils/run.pl JOB=1:$nj $data/split${nj}/JOB/decoder.log \
ctc_prefix_beam_search_decoder_main \
--model_path=$model_dir/export.jit \
--vocab_path=$model_dir/unit.txt \
diff --git a/speechx/examples/u2pp_ol/wenetspeech/local/nnet.sh b/speechx/examples/u2pp_ol/wenetspeech/local/nnet.sh
index 5455b5c9b..f947e6b17 100755
--- a/speechx/examples/u2pp_ol/wenetspeech/local/nnet.sh
+++ b/speechx/examples/u2pp_ol/wenetspeech/local/nnet.sh
@@ -1,18 +1,21 @@
#!/bin/bash
-set -x
set -e
. path.sh
+nj=20
data=data
exp=exp
+
mkdir -p $exp
ckpt_dir=./data/model
model_dir=$ckpt_dir/asr1_chunk_conformer_u2pp_wenetspeech_static_1.3.0.model/
+utils/run.pl JOB=1:$nj $data/split${nj}/JOB/nnet.log \
u2_nnet_main \
--model_path=$model_dir/export.jit \
- --feature_rspecifier=ark,t:$exp/fbank.ark \
+ --vocab_path=$model_dir/unit.txt \
+ --feature_rspecifier=ark,t:${data}/split${nj}/JOB/fbank.ark \
--nnet_decoder_chunk=16 \
--receptive_field_length=7 \
--subsampling_rate=4 \
@@ -20,4 +23,3 @@ u2_nnet_main \
--nnet_encoder_outs_wspecifier=ark,t:$exp/encoder_outs.ark \
--nnet_prob_wspecifier=ark,t:$exp/logprobs.ark
echo "u2 nnet decode."
-
diff --git a/speechx/examples/u2pp_ol/wenetspeech/run.sh b/speechx/examples/u2pp_ol/wenetspeech/run.sh
index 2bc855dec..870c5deeb 100755
--- a/speechx/examples/u2pp_ol/wenetspeech/run.sh
+++ b/speechx/examples/u2pp_ol/wenetspeech/run.sh
@@ -24,8 +24,6 @@ fi
ckpt_dir=$data/model
-model_dir=$ckpt_dir/asr1_chunk_conformer_u2pp_wenetspeech_static_1.3.0.model/
-
if [ ${stage} -le 0 ] && [ ${stop_stage} -ge 0 ];then
# download u2pp model
diff --git a/speechx/speechx/frontend/audio/data_cache.h b/speechx/speechx/frontend/audio/data_cache.h
index f538df1dd..5fe5e4fe0 100644
--- a/speechx/speechx/frontend/audio/data_cache.h
+++ b/speechx/speechx/frontend/audio/data_cache.h
@@ -32,7 +32,6 @@ class DataCache : public FrontendInterface {
// accept waves/feats
void Accept(const kaldi::VectorBase& inputs) override {
data_ = inputs;
- SetDim(data_.Dim());
}
bool Read(kaldi::Vector* feats) override {
@@ -41,7 +40,6 @@ class DataCache : public FrontendInterface {
}
(*feats) = data_;
data_.Resize(0);
- SetDim(data_.Dim());
return true;
}
diff --git a/speechx/speechx/nnet/decodable.cc b/speechx/speechx/nnet/decodable.cc
index 7f6859082..5fe2b9842 100644
--- a/speechx/speechx/nnet/decodable.cc
+++ b/speechx/speechx/nnet/decodable.cc
@@ -71,6 +71,7 @@ bool Decodable::AdvanceChunk() {
VLOG(3) << "decodable exit;";
return false;
}
+ CHECK_GE(frontend_->Dim(), 0);
VLOG(1) << "AdvanceChunk feat cost: " << timer.Elapsed() << " sec.";
VLOG(2) << "Forward in " << features.Dim() / frontend_->Dim() << " feats.";
diff --git a/tests/unit/tts/test_pwg.py b/tests/unit/tts/test_pwg.py
index 78cb34f25..10c82c9fd 100644
--- a/tests/unit/tts/test_pwg.py
+++ b/tests/unit/tts/test_pwg.py
@@ -13,6 +13,7 @@
# limitations under the License.
import paddle
import torch
+from paddle.device.cuda import synchronize
from parallel_wavegan.layers import residual_block
from parallel_wavegan.layers import upsample
from parallel_wavegan.models import parallel_wavegan as pwgan
@@ -24,7 +25,6 @@ from paddlespeech.t2s.models.parallel_wavegan import PWGGenerator
from paddlespeech.t2s.models.parallel_wavegan import ResidualBlock
from paddlespeech.t2s.models.parallel_wavegan import ResidualPWGDiscriminator
from paddlespeech.t2s.utils.layer_tools import summary
-from paddlespeech.t2s.utils.profile import synchronize
paddle.set_device("gpu:0")
device = torch.device("cuda:0")
|