([简体中文](./README_cn.md)|English)
------------------------------------------------------------------------------------ **PaddleSpeech** is an open-source toolkit on [PaddlePaddle](https://github.com/PaddlePaddle/Paddle) platform for a variety of critical tasks in speech and audio, with the state-of-art and influential models. **PaddleSpeech** won the [NAACL2022 Best Demo Award](https://2022.naacl.org/blog/best-demo-award/), please check out our paper on [Arxiv](https://arxiv.org/abs/2205.12007). ##### Speech Recognition
Input Audio | Recognition Result |
---|---|
|
I knocked at the door on the ancient side of the building. |
|
我认为跑步最重要的就是给我带来了身体健康。 |
Input Text | Output Text |
---|---|
今天的天气真不错啊你下午有空吗我想约你一起去吃饭 | 今天的天气真不错啊!你下午有空吗?我想约你一起去吃饭。 |
Speech-to-Text Module Type | Dataset | Model Type | Example |
---|---|---|---|
Speech Recogination | Aishell | DeepSpeech2 RNN + Conv based Models | deepspeech2-aishell |
Transformer based Attention Models | u2.transformer.conformer-aishell | ||
Librispeech | Transformer based Attention Models | deepspeech2-librispeech / transformer.conformer.u2-librispeech / transformer.conformer.u2-kaldi-librispeech | |
TIMIT | Unified Streaming & Non-streaming Two-pass | u2-timit | |
Alignment | THCHS30 | MFA | mfa-thchs30 |
Language Model | Ngram Language Model | kenlm | |
Speech Translation (English to Chinese) | TED En-Zh | Transformer + ASR MTL | transformer-ted |
FAT + Transformer + ASR MTL | fat-st-ted |
Text-to-Speech Module Type | Model Type | Dataset | Example |
---|---|---|---|
Text Frontend | tn / g2p | ||
Acoustic Model | Tacotron2 | LJSpeech / CSMSC | tacotron2-ljspeech / tacotron2-csmsc |
Transformer TTS | LJSpeech | transformer-ljspeech | |
SpeedySpeech | CSMSC | speedyspeech-csmsc | |
FastSpeech2 | LJSpeech / VCTK / CSMSC / AISHELL-3 / ZH_EN / finetune | fastspeech2-ljspeech / fastspeech2-vctk / fastspeech2-csmsc / fastspeech2-aishell3 / fastspeech2-zh_en / fastspeech2-finetune | |
ERNIE-SAT | VCTK / AISHELL-3 / ZH_EN | ERNIE-SAT-vctk / ERNIE-SAT-aishell3 / ERNIE-SAT-zh_en | |
DiffSinger | Opencpop | DiffSinger-opencpop | |
Vocoder | WaveFlow | LJSpeech | waveflow-ljspeech |
Parallel WaveGAN | LJSpeech / VCTK / CSMSC / AISHELL-3 / Opencpop | PWGAN-ljspeech / PWGAN-vctk / PWGAN-csmsc / PWGAN-aishell3 / PWGAN-opencpop | |
Multi Band MelGAN | CSMSC | Multi Band MelGAN-csmsc | |
Style MelGAN | CSMSC | Style MelGAN-csmsc | |
HiFiGAN | LJSpeech / VCTK / CSMSC / AISHELL-3 / Opencpop | HiFiGAN-ljspeech / HiFiGAN-vctk / HiFiGAN-csmsc / HiFiGAN-aishell3 / HiFiGAN-opencpop | |
WaveRNN | CSMSC | WaveRNN-csmsc | |
Voice Cloning | GE2E | Librispeech, etc. | GE2E |
SV2TTS (GE2E + Tacotron2) | AISHELL-3 | VC0 | |
SV2TTS (GE2E + FastSpeech2) | AISHELL-3 | VC1 | |
SV2TTS (ECAPA-TDNN + FastSpeech2) | AISHELL-3 | VC2 | |
GE2E + VITS | AISHELL-3 | VITS-VC | |
End-to-End | VITS | CSMSC / AISHELL-3 | VITS-csmsc / VITS-aishell3 |
Task | Dataset | Model Type | Example |
---|---|---|---|
Audio Classification | ESC-50 | PANN | pann-esc50 |
Task | Dataset | Model Type | Example |
---|---|---|---|
Keyword Spotting | hey-snips | MDTC | mdtc-hey-snips |
Task | Dataset | Model Type | Example |
---|---|---|---|
Speaker Verification | VoxCeleb1/2 | ECAPA-TDNN | ecapa-tdnn-voxceleb12 |
Task | Dataset | Model Type | Example |
---|---|---|---|
Speaker Diarization | AMI | ECAPA-TDNN + AHC / SC | ecapa-tdnn-ami |
Task | Dataset | Model Type | Example |
---|---|---|---|
Punctuation Restoration | IWLST2012_zh | Ernie Linear | iwslt2012-punc0 |