You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
PaddleSpeech/docs/source/released_model.md

26 KiB

Released Models

Speech-to-Text Models

Speech Recognition Model

Acoustic Model Training Data Token-based Size Descriptions CER WER Hours of speech Example Link Inference Type
Ds2 Online Wenetspeech ASR0 Model Wenetspeech Dataset Char-based 1.2 GB 2 Conv + 5 LSTM layers 0.152 (test_net, w/o LM)
0.2417 (test_meeting, w/o LM)
0.053 (aishell, w/ LM)
- 10000 h - onnx/inference/python
Ds2 Online Aishell ASR0 Model Aishell Dataset Char-based 491 MB 2 Conv + 5 LSTM layers 0.0666 - 151 h D2 Online Aishell ASR0 onnx/inference/python
Ds2 Offline Aishell ASR0 Model Aishell Dataset Char-based 1.4 GB 2 Conv + 5 bidirectional LSTM layers 0.0554 - 151 h Ds2 Offline Aishell ASR0 inference/python
Conformer Online Wenetspeech ASR1 Model WenetSpeech Dataset Char-based 457 MB Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring 0.11 (test_net) 0.1879 (test_meeting) - 10000 h - python
Conformer U2PP Online Wenetspeech ASR1 Model WenetSpeech Dataset Char-based 476 MB Encoder:Conformer, Decoder:BiTransformer, Decoding method: Attention rescoring 0.047198 (aishell test_-1) 0.059212 (aishell test_16) - 10000 h - python
Conformer Online Aishell ASR1 Model Aishell Dataset Char-based 189 MB Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring 0.0544 - 151 h Conformer Online Aishell ASR1 python
Conformer Offline Aishell ASR1 Model Aishell Dataset Char-based 189 MB Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring 0.0460 - 151 h Conformer Offline Aishell ASR1 python
Transformer Aishell ASR1 Model Aishell Dataset Char-based 128 MB Encoder:Transformer, Decoder:Transformer, Decoding method: Attention rescoring 0.0523 151 h Transformer Aishell ASR1 python
Ds2 Offline Librispeech ASR0 Model Librispeech Dataset Char-based 1.3 GB 2 Conv + 5 bidirectional LSTM layers - 0.0467 960 h Ds2 Offline Librispeech ASR0 inference/python
Conformer Librispeech ASR1 Model Librispeech Dataset subword-based 191 MB Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring - 0.0338 960 h Conformer Librispeech ASR1 python
Transformer Librispeech ASR1 Model Librispeech Dataset subword-based 131 MB Encoder:Transformer, Decoder:Transformer, Decoding method: Attention rescoring - 0.0381 960 h Transformer Librispeech ASR1 python
Transformer Librispeech ASR2 Model Librispeech Dataset subword-based 131 MB Encoder:Transformer, Decoder:Transformer, Decoding method: JoinCTC w/ LM - 0.0240 960 h Transformer Librispeech ASR2 python

Self-Supervised Pre-trained Model

Model Pre-Train Method Pre-Train Data Finetune Data Size Descriptions CER WER Example Link
Wav2vec2-large-960h-lv60-self Model wav2vec2 Librispeech and LV-60k Dataset (5.3w h) - 1.18 GB Pre-trained Wav2vec2.0 Model - - -
Wav2vec2ASR-large-960h-librispeech Model wav2vec2 Librispeech and LV-60k Dataset (5.3w h) Librispeech (960 h) 718 MB Encoder: Wav2vec2.0, Decoder: CTC, Decoding method: Greedy search - 0.0189 Wav2vecASR Librispeech ASR3
Wav2vec2-large-wenetspeech-self Model wav2vec2 Wenetspeech Dataset (1w h) - 714 MB Pre-trained Wav2vec2.0 Model - - -
Wav2vec2ASR-large-aishell1 Model wav2vec2 Wenetspeech Dataset (1w h) aishell1 (train set) 1.17 GB Encoder: Wav2vec2.0, Decoder: CTC, Decoding method: Greedy search 0.0453 - -

Whisper Model

Demo Link Training Data Size Descriptions CER Model
Whisper 680kh from internet large: 5.8G,
medium: 2.9G,
small: 923M,
base: 277M,
tiny: 145M
Encoder:Transformer,
Decoder:Transformer,
Decoding method:
Greedy search
2.7
(large, Librispeech)
whisper-large
whisper-medium
whisper-medium-English-only
whisper-small
whisper-small-English-only
whisper-base
whisper-base-English-only
whisper-tiny
whisper-tiny-English-only

Language Model based on NGram

Language Model Training Data Token-based Size Descriptions
English LM CommonCrawl(en.00) Word-based 8.3 GB Pruned with 0 1 1 1 1;
About 1.85 billion n-grams;
'trie' binary with '-a 22 -q 8 -b 8'
Mandarin LM Small Baidu Internal Corpus Char-based 2.8 GB Pruned with 0 1 2 4 4;
About 0.13 billion n-grams;
'probing' binary with default settings
Mandarin LM Large Baidu Internal Corpus Char-based 70.4 GB No Pruning;
About 3.7 billion n-grams;
'probing' binary with default settings

Speech Translation Models

Model Training Data Token-based Size Descriptions BLEU Example Link
(only for CLI)Transformer FAT-ST MTL En-Zh Ted-En-Zh Spm Encoder:Transformer, Decoder:Transformer,
Decoding method: Attention
20.80 Transformer Ted-En-Zh ST1

Text-to-Speech Models

Acoustic Models

Model Type Dataset Example Link Pretrained Models Static / ONNX / Paddle-Lite Models Size (static)
Tacotron2 LJSpeech tacotron2-ljspeech tacotron2_ljspeech_ckpt_0.2.0.zip
Tacotron2 CSMSC tacotron2-csmsc tacotron2_csmsc_ckpt_0.2.0.zip tacotron2_csmsc_static_0.2.0.zip 103MB
TransformerTTS LJSpeech transformer-ljspeech transformer_tts_ljspeech_ckpt_0.4.zip
SpeedySpeech CSMSC speedyspeech-csmsc speedyspeech_csmsc_ckpt_0.2.0.zip speedyspeech_csmsc_static_0.2.0.zip
speedyspeech_csmsc_onnx_0.2.0.zip
speedyspeech_csmsc_pdlite_1.3.0.zip
13MB
FastSpeech2 CSMSC fastspeech2-csmsc fastspeech2_nosil_baker_ckpt_0.4.zip fastspeech2_csmsc_static_0.2.0.zip
fastspeech2_csmsc_onnx_0.2.0.zip
fastspeech2_csmsc_pdlite_1.3.0.zip
157MB
FastSpeech2-Conformer CSMSC fastspeech2-csmsc fastspeech2_conformer_baker_ckpt_0.5.zip
FastSpeech2-CNNDecoder CSMSC fastspeech2-csmsc fastspeech2_cnndecoder_csmsc_ckpt_1.0.0.zip fastspeech2_cnndecoder_csmsc_static_1.0.0.zip
fastspeech2_cnndecoder_csmsc_streaming_static_1.0.0.zip
fastspeech2_cnndecoder_csmsc_onnx_1.0.0.zip
fastspeech2_cnndecoder_csmsc_streaming_onnx_1.0.0.zip
fastspeech2_cnndecoder_csmsc_pdlite_1.3.0.zip
fastspeech2_cnndecoder_csmsc_streaming_pdlite_1.3.0.zip
84MB
FastSpeech2 AISHELL-3 fastspeech2-aishell3 fastspeech2_aishell3_ckpt_1.1.0.zip fastspeech2_aishell3_static_1.1.0.zip
fastspeech2_aishell3_onnx_1.1.0.zip
fastspeech2_aishell3_pdlite_1.3.0.zip
147MB
FastSpeech2 LJSpeech fastspeech2-ljspeech fastspeech2_nosil_ljspeech_ckpt_0.5.zip fastspeech2_ljspeech_static_1.1.0.zip
fastspeech2_ljspeech_onnx_1.1.0.zip
fastspeech2_ljspeech_pdlite_1.3.0.zip
145MB
FastSpeech2 VCTK fastspeech2-vctk fastspeech2_vctk_ckpt_1.2.0.zip fastspeech2_vctk_static_1.1.0.zip
fastspeech2_vctk_onnx_1.1.0.zip
fastspeech2_vctk_pdlite_1.3.0.zip
145MB
FastSpeech2 ZH_EN fastspeech2-zh_en fastspeech2_mix_ckpt_1.2.0.zip fastspeech2_mix_static_0.2.0.zip
fastspeech2_mix_onnx_0.2.0.zip
145MB
FastSpeech2 Male fastspeech2_male_ckpt_1.3.0.zip

Vocoders

Model Type Dataset Example Link Pretrained Models Static / ONNX / Paddle-Lite Models Size (static)
WaveFlow LJSpeech waveflow-ljspeech waveflow_ljspeech_ckpt_0.3.zip
Parallel WaveGAN CSMSC PWGAN-csmsc pwg_baker_ckpt_0.4.zip pwg_baker_static_0.4.zip
pwgan_csmsc_onnx_0.2.0.zip
pwgan_csmsc_pdlite_1.3.0.zip
4.8MB
Parallel WaveGAN LJSpeech PWGAN-ljspeech pwg_ljspeech_ckpt_0.5.zip pwgan_ljspeech_static_1.1.0.zip
pwgan_ljspeech_onnx_1.1.0.zip
pwgan_ljspeech_pdlite_1.3.0.zip
4.8MB
Parallel WaveGAN AISHELL-3 PWGAN-aishell3 pwg_aishell3_ckpt_0.5.zip pwgan_aishell3_static_1.1.0.zip
pwgan_aishell3_onnx_1.1.0.zip
pwgan_aishell3_pdlite_1.3.0.zip
4.8MB
Parallel WaveGAN VCTK PWGAN-vctk pwg_vctk_ckpt_0.5.zip pwgan_vctk_static_1.1.0.zip
pwgan_vctk_onnx_1.1.0.zip
pwgan_vctk_pdlite_1.3.0.zip
4.8MB
Multi Band MelGAN CSMSC MB MelGAN-csmsc mb_melgan_csmsc_ckpt_0.1.1.zip
mb_melgan_baker_finetune_ckpt_0.5.zip
mb_melgan_csmsc_static_0.1.1.zip
mb_melgan_csmsc_onnx_0.2.0.zip
mb_melgan_csmsc_pdlite_1.3.0.zip
7.6MB
Style MelGAN CSMSC Style MelGAN-csmsc style_melgan_csmsc_ckpt_0.1.1.zip
HiFiGAN CSMSC HiFiGAN-csmsc hifigan_csmsc_ckpt_0.1.1.zip hifigan_csmsc_static_0.1.1.zip
hifigan_csmsc_onnx_0.2.0.zip
hifigan_csmsc_pdlite_1.3.0.zip
46MB
HiFiGAN LJSpeech HiFiGAN-ljspeech hifigan_ljspeech_ckpt_0.2.0.zip hifigan_ljspeech_static_1.1.0.zip
hifigan_ljspeech_onnx_1.1.0.zip
hifigan_ljspeech_pdlite_1.3.0.zip
49MB
HiFiGAN AISHELL-3 HiFiGAN-aishell3 hifigan_aishell3_ckpt_0.2.0.zip hifigan_aishell3_static_1.1.0.zip
hifigan_aishell3_onnx_1.1.0.zip
hifigan_aishell3_pdlite_1.3.0.zip
46MB
HiFiGAN VCTK HiFiGAN-vctk hifigan_vctk_ckpt_0.2.0.zip hifigan_vctk_static_1.1.0.zip
hifigan_vctk_onnx_1.1.0.zip
hifigan_vctk_pdlite_1.3.0.zip
46MB
WaveRNN CSMSC WaveRNN-csmsc wavernn_csmsc_ckpt_0.2.0.zip wavernn_csmsc_static_0.2.0.zip 18MB
Parallel WaveGAN Male pwg_male_ckpt_1.3.0.zip

Voice Cloning

Model Type Dataset Example Link Pretrained Models
GE2E AISHELL-3, etc. ge2e ge2e_ckpt_0.3.zip
GE2E + Tacotron2 AISHELL-3 ge2e-Tacotron2-aishell3 tacotron2_aishell3_ckpt_vc0_0.2.0.zip
GE2E + FastSpeech2 AISHELL-3 ge2e-fastspeech2-aishell3 fastspeech2_nosil_aishell3_vc1_ckpt_0.5.zip

Audio Classification Models

Model Type Dataset Example Link Pretrained Models Static Models
PANN Audioset audioset_tagging_cnn panns_cnn6.pdparams, panns_cnn10.pdparams, panns_cnn14.pdparams panns_cnn6_static.tar.gz(18M), panns_cnn10_static.tar.gz(19M), panns_cnn14_static.tar.gz(289M)
PANN ESC-50 pann-esc50 esc50_cnn6.tar.gz, esc50_cnn10.tar.gz, esc50_cnn14.tar.gz

Speaker Verification Models

Model Type Dataset Example Link Pretrained Models Static Models
ECAPA-TDNN VoxCeleb voxceleb_ecapatdnn ecapatdnn.tar.gz -

Punctuation Restoration Models

Model Type Dataset Example Link Pretrained Models
Ernie Linear IWLST2012_zh iwslt2012_punc0 ernie_linear_p3_iwslt2012_zh_ckpt_0.1.1.zip