From dbe8cee2482f5d83417d2ece62f67672f8151011 Mon Sep 17 00:00:00 2001 From: tianhao zhang <15600919271@163.com> Date: Thu, 13 Oct 2022 07:10:04 +0000 Subject: [PATCH 1/5] release wav2vec2ASR and wav2vec2.0 model, update Recent Update --- README.md | 1 + README_cn.md | 1 + docs/source/released_model.md | 2 ++ 3 files changed, 4 insertions(+) diff --git a/README.md b/README.md index 72db64b7d..c05e12424 100644 --- a/README.md +++ b/README.md @@ -157,6 +157,7 @@ Via the easy-to-use, efficient, flexible and scalable implementation, our vision - 🧩 *Cascaded models application*: as an extension of the typical traditional audio tasks, we combine the workflows of the aforementioned tasks with other fields like Natural language processing (NLP) and Computer Vision (CV). ### Recent Update +- 👑 2022.10.11: Add [Wav2vec2ASR](./examples/librispeech/asr3), wav2vec2.0 fine-tuning for ASR on LibriSpeech. - 🔥 2022.09.26: Add Voice Cloning, TTS finetune, and ERNIE-SAT in [PaddleSpeech Web Demo](./demos/speech_web). - ⚡ 2022.09.09: Add AISHELL-3 Voice Cloning [example](./examples/aishell3/vc2) with ECAPA-TDNN speaker encoder. - ⚡ 2022.08.25: Release TTS [finetune](./examples/other/tts_finetune/tts3) example. diff --git a/README_cn.md b/README_cn.md index 725f7eda1..20e2d3c85 100644 --- a/README_cn.md +++ b/README_cn.md @@ -179,6 +179,7 @@ ### 近期更新 +- 👑 2022.10.11: 新增 [Wav2vec2ASR](./examples/librispeech/asr3), 在 LibriSpeech 上针对ASR任务对wav2vec2.0 的fine-tuning. - 🔥 2022.09.26: 新增 Voice Cloning, TTS finetune 和 ERNIE-SAT 到 [PaddleSpeech 网页应用](./demos/speech_web)。 - ⚡ 2022.09.09: 新增基于 ECAPA-TDNN 声纹模型的 AISHELL-3 Voice Cloning [示例](./examples/aishell3/vc2)。 - ⚡ 2022.08.25: 发布 TTS [finetune](./examples/other/tts_finetune/tts3) 示例。 diff --git a/docs/source/released_model.md b/docs/source/released_model.md index bdac2c5bb..3d51f1122 100644 --- a/docs/source/released_model.md +++ b/docs/source/released_model.md @@ -17,6 +17,8 @@ Acoustic Model | Training Data | Token-based | Size | Descriptions | CER | WER | [Conformer Librispeech ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr1/asr1_conformer_librispeech_ckpt_0.1.1.model.tar.gz) | Librispeech Dataset | subword-based | 191 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring |-| 0.0338 | 960 h | [Conformer Librispeech ASR1](../../examples/librispeech/asr1) | python | [Transformer Librispeech ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr1/asr1_transformer_librispeech_ckpt_0.1.1.model.tar.gz) | Librispeech Dataset | subword-based | 131 MB | Encoder:Transformer, Decoder:Transformer, Decoding method: Attention rescoring |-| 0.0381 | 960 h | [Transformer Librispeech ASR1](../../examples/librispeech/asr1) | python | [Transformer Librispeech ASR2 Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr2/asr2_transformer_librispeech_ckpt_0.1.1.model.tar.gz) | Librispeech Dataset | subword-based | 131 MB | Encoder:Transformer, Decoder:Transformer, Decoding method: JoinCTC w/ LM |-| 0.0240 | 960 h | [Transformer Librispeech ASR2](../../examples/librispeech/asr2) | python | +[Wav2vec2-large-960h-lv60-self Model](https://paddlespeech.bj.bcebos.com/wav2vec/wav2vec2-large-960h-lv60-self.pdparams) | Librispeech and LV-60k Dataset | - | 1.18 GB | Pre-trained Wav2vec2.0 Model |-| - | 5.3w h | [Wav2vecASR Librispeech ASR3](../../examples/librispeech/asr3) | python | +[Wav2vec2ASR-large-960h-librispeech Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr3/wav2vec2ASR-large-960h-librispeech_ckpt_1.3.0.model.tar.gz) | Librispeech | - | 1.18 GB | Encoder:Wav2vec2.0, Decoder:CTC, Decoding method: Greedy search |-| 0.0189 | 960 h | [Wav2vecASR Librispeech ASR3](../../examples/librispeech/asr3) | python | ### Language Model based on NGram Language Model | Training Data | Token-based | Size | Descriptions From f29294153b141811bd48dc00206876be07cd9290 Mon Sep 17 00:00:00 2001 From: tianhao zhang <15600919271@163.com> Date: Thu, 13 Oct 2022 12:06:10 +0000 Subject: [PATCH 2/5] update reference.md and released_model.md --- docs/source/reference.md | 6 ++++++ docs/source/released_model.md | 8 ++++++-- 2 files changed, 12 insertions(+), 2 deletions(-) diff --git a/docs/source/reference.md b/docs/source/reference.md index 0d36d96f7..0b555578e 100644 --- a/docs/source/reference.md +++ b/docs/source/reference.md @@ -28,6 +28,8 @@ We borrowed a lot of code from these repos to build `model` and `engine`, thanks * [speechbrain](https://github.com/speechbrain/speechbrain/blob/develop/LICENSE) - Apache-2.0 License - ECAPA-TDNN SV model +- ASR with CTC and pre-trained wav2vec2 models. + * [chainer](https://github.com/chainer/chainer/blob/master/LICENSE) - MIT License @@ -43,3 +45,7 @@ We borrowed a lot of code from these repos to build `model` and `engine`, thanks * [g2pW](https://github.com/GitYCC/g2pW/blob/master/LICENCE) - Apache-2.0 license + +*[transformers](https://github.com/huggingface/transformers) +- Apache-2.0 License +- Wav2vec2.0 \ No newline at end of file diff --git a/docs/source/released_model.md b/docs/source/released_model.md index 3d51f1122..f60fe9d48 100644 --- a/docs/source/released_model.md +++ b/docs/source/released_model.md @@ -17,8 +17,12 @@ Acoustic Model | Training Data | Token-based | Size | Descriptions | CER | WER | [Conformer Librispeech ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr1/asr1_conformer_librispeech_ckpt_0.1.1.model.tar.gz) | Librispeech Dataset | subword-based | 191 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring |-| 0.0338 | 960 h | [Conformer Librispeech ASR1](../../examples/librispeech/asr1) | python | [Transformer Librispeech ASR1 Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr1/asr1_transformer_librispeech_ckpt_0.1.1.model.tar.gz) | Librispeech Dataset | subword-based | 131 MB | Encoder:Transformer, Decoder:Transformer, Decoding method: Attention rescoring |-| 0.0381 | 960 h | [Transformer Librispeech ASR1](../../examples/librispeech/asr1) | python | [Transformer Librispeech ASR2 Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr2/asr2_transformer_librispeech_ckpt_0.1.1.model.tar.gz) | Librispeech Dataset | subword-based | 131 MB | Encoder:Transformer, Decoder:Transformer, Decoding method: JoinCTC w/ LM |-| 0.0240 | 960 h | [Transformer Librispeech ASR2](../../examples/librispeech/asr2) | python | -[Wav2vec2-large-960h-lv60-self Model](https://paddlespeech.bj.bcebos.com/wav2vec/wav2vec2-large-960h-lv60-self.pdparams) | Librispeech and LV-60k Dataset | - | 1.18 GB | Pre-trained Wav2vec2.0 Model |-| - | 5.3w h | [Wav2vecASR Librispeech ASR3](../../examples/librispeech/asr3) | python | -[Wav2vec2ASR-large-960h-librispeech Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr3/wav2vec2ASR-large-960h-librispeech_ckpt_1.3.0.model.tar.gz) | Librispeech | - | 1.18 GB | Encoder:Wav2vec2.0, Decoder:CTC, Decoding method: Greedy search |-| 0.0189 | 960 h | [Wav2vecASR Librispeech ASR3](../../examples/librispeech/asr3) | python | + +### Self-Supervised Pre-trained Model +Model | Pre-Train Method | Pre-Train Data | Finetune Data | Size | Descriptions | CER | WER | Example Link | +:-------------:| :------------:| :-----: | -----: | :-----: |:-----:| :-----: | :-----: | :-----: | +[Wav2vec2-large-960h-lv60-self Model](https://paddlespeech.bj.bcebos.com/wav2vec/wav2vec2-large-960h-lv60-self.pdparams) | wav2vec2 | Librispeech and LV-60k Dataset (5.3w h) | - | 1.18 GB |Pre-trained Wav2vec2.0 Model | - | - | - | +[Wav2vec2ASR-large-960h-librispeech Model](https://paddlespeech.bj.bcebos.com/s2t/librispeech/asr3/wav2vec2ASR-large-960h-librispeech_ckpt_1.3.0.model.tar.gz) | wav2vec2 | Librispeech and LV-60k Dataset (5.3w h) | Librispeech (960 h) | 1.18 GB |Encoder: Wav2vec2.0, Decoder: CTC, Decoding method: Greedy search | - | 0.0189 | [Wav2vecASR Librispeech ASR3](../../examples/librispeech/asr3) | ### Language Model based on NGram Language Model | Training Data | Token-based | Size | Descriptions From 49c0cf9e317e76563cdc048d790b4267090ad2d5 Mon Sep 17 00:00:00 2001 From: tianhao zhang <15600919271@163.com> Date: Fri, 14 Oct 2022 02:11:21 +0000 Subject: [PATCH 3/5] format reference.md --- docs/source/reference.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/reference.md b/docs/source/reference.md index 0b555578e..9a47a2302 100644 --- a/docs/source/reference.md +++ b/docs/source/reference.md @@ -48,4 +48,4 @@ We borrowed a lot of code from these repos to build `model` and `engine`, thanks *[transformers](https://github.com/huggingface/transformers) - Apache-2.0 License -- Wav2vec2.0 \ No newline at end of file +- Wav2vec2.0 From 86f65f0b8eb8843b9d0a92539ce4aa30420d3700 Mon Sep 17 00:00:00 2001 From: tianhao zhang <15600919271@163.com> Date: Sun, 16 Oct 2022 14:17:26 +0000 Subject: [PATCH 4/5] fix wav2vec2 report loss bug --- paddlespeech/s2t/exps/wav2vec2/model.py | 26 +++++++++++-------------- 1 file changed, 11 insertions(+), 15 deletions(-) diff --git a/paddlespeech/s2t/exps/wav2vec2/model.py b/paddlespeech/s2t/exps/wav2vec2/model.py index de4c895f2..16feac5de 100644 --- a/paddlespeech/s2t/exps/wav2vec2/model.py +++ b/paddlespeech/s2t/exps/wav2vec2/model.py @@ -13,6 +13,7 @@ # limitations under the License. """Contains wav2vec2 model.""" import json +import math import os import time from collections import defaultdict @@ -46,25 +47,20 @@ logger = Log(__name__).getlog() class Wav2Vec2ASRTrainer(Trainer): def __init__(self, config, args): super().__init__(config, args) - self.avg_train_loss = 0 + self.avg_train_loss = 0.0 - def update_average(self, batch_index, loss, avg_loss): + def update_average(self, batch_index, loss): """Update running average of the loss. Arguments --------- + batch_index : int + current batch index loss : paddle.tensor detached loss, a single float value. - avg_loss : float - current running average. - Returns - ------- - avg_loss : float - The average loss. """ - if paddle.isfinite(loss): - avg_loss -= avg_loss / (batch_index + 1) - avg_loss += float(loss) / (batch_index + 1) - return avg_loss + if math.isfinite(loss): + self.avg_train_loss -= self.avg_train_loss / (batch_index + 1) + self.avg_train_loss += loss / (batch_index + 1) def train_batch(self, batch_index, batch, msg): train_conf = self.config @@ -80,8 +76,8 @@ class Wav2Vec2ASRTrainer(Trainer): # loss div by `batch_size * accum_grad` loss /= train_conf.accum_grad - self.avg_train_loss = self.update_average(batch_index, loss, - self.avg_train_loss) + # update self.avg_train_loss + self.update_average(batch_index, float(loss)) # loss backward if (batch_index + 1) % train_conf.accum_grad != 0: @@ -106,7 +102,7 @@ class Wav2Vec2ASRTrainer(Trainer): self.lr_scheduler.step() self.iteration += 1 - losses_np = {'loss': float(self.avg_train_loss) * train_conf.accum_grad} + losses_np = {'loss': self.avg_train_loss * train_conf.accum_grad} iteration_time = time.time() - start for k, v in losses_np.items(): report(k, v) From 0b5af77494d46398bcc93a3a03738d7ea3c399c4 Mon Sep 17 00:00:00 2001 From: Ming Date: Mon, 17 Oct 2022 14:36:19 +0800 Subject: [PATCH 5/5] fixed ENIRE-SAT => ERNIE-SAT,test=doc (#2533) * fixed ENIRE-SAT => ERNIE-SAT,test=doc * fixed ERNIE,test=doc --- demos/speech_web/README.md | 4 ++-- demos/speech_web/web_client/src/components/Experience.vue | 6 +++--- .../{ENIRE_SAT/ENIRE_SAT.vue => ERNIE_SAT/ERNIE_SAT.vue} | 0 3 files changed, 5 insertions(+), 5 deletions(-) rename demos/speech_web/web_client/src/components/SubMenu/{ENIRE_SAT/ENIRE_SAT.vue => ERNIE_SAT/ERNIE_SAT.vue} (100%) diff --git a/demos/speech_web/README.md b/demos/speech_web/README.md index 89d22382a..572781ab6 100644 --- a/demos/speech_web/README.md +++ b/demos/speech_web/README.md @@ -21,14 +21,14 @@ Paddle Speech Demo 是一个以 PaddleSpeech 的语音交互功能为主体开 + 小数据微调:基于小数据集的微调方案,内置用12句话标贝中文女声微调示例,你也可以通过一键重置,录制自己的声音,注意在安静环境下录制,效果会更好。你可以在 [【Finetune your own AM based on FastSpeech2 with AISHELL-3】](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/other/tts_finetune/tts3)中尝试使用自己的数据集进行微调。 -+ ENIRE-SAT:语言-语音跨模态大模型 ENIRE-SAT 可视化展示示例,支持个性化合成,跨语言语音合成(音频为中文则输入英文文本进行合成),语音编辑(修改音频文字中间的结果)功能。 ENIRE-SAT 更多实现细节,可以参考: ++ ERNIE-SAT:语言-语音跨模态大模型 ERNIE-SAT 可视化展示示例,支持个性化合成,跨语言语音合成(音频为中文则输入英文文本进行合成),语音编辑(修改音频文字中间的结果)功能。 ERNIE-SAT 更多实现细节,可以参考: + [【ERNIE-SAT with AISHELL-3 dataset】](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3/ernie_sat) + [【ERNIE-SAT with with AISHELL3 and VCTK datasets】](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/aishell3_vctk/ernie_sat) + [【ERNIE-SAT with VCTK dataset】](https://github.com/PaddlePaddle/PaddleSpeech/tree/develop/examples/vctk/ernie_sat) 运行效果: - ![效果](https://user-images.githubusercontent.com/30135920/192155349-9ef93d20-730b-413d-8d50-412fedf11d4b.png) + ![效果](https://user-images.githubusercontent.com/30135920/196076507-7eb33d39-2345-4268-aee7-6270b9ac8b98.png) diff --git a/demos/speech_web/web_client/src/components/Experience.vue b/demos/speech_web/web_client/src/components/Experience.vue index f593c0c14..ca0e1440f 100644 --- a/demos/speech_web/web_client/src/components/Experience.vue +++ b/demos/speech_web/web_client/src/components/Experience.vue @@ -7,7 +7,7 @@ import VPRT from './SubMenu/VPR/VPRT.vue' import IET from './SubMenu/IE/IET.vue' import VoiceCloneT from './SubMenu/VoiceClone/VoiceClone.vue' -import ENIRE_SATT from './SubMenu/ENIRE_SAT/ENIRE_SAT.vue' +import ERNIE_SATT from './SubMenu/ERNIE_SAT/ERNIE_SAT.vue' import FineTuneT from './SubMenu/FineTune/FineTune.vue' @@ -47,8 +47,8 @@ import FineTuneT from './SubMenu/FineTune/FineTune.vue' - - + + diff --git a/demos/speech_web/web_client/src/components/SubMenu/ENIRE_SAT/ENIRE_SAT.vue b/demos/speech_web/web_client/src/components/SubMenu/ERNIE_SAT/ERNIE_SAT.vue similarity index 100% rename from demos/speech_web/web_client/src/components/SubMenu/ENIRE_SAT/ENIRE_SAT.vue rename to demos/speech_web/web_client/src/components/SubMenu/ERNIE_SAT/ERNIE_SAT.vue