You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
15 KiB
15 KiB
Released Models
Speech-to-Text Models
Speech Recognition Model
Acoustic Model | Training Data | Token-based | Size | Descriptions | CER | WER | Hours of speech | Example Link |
---|---|---|---|---|---|---|---|---|
Ds2 Online Aishell ASR0 Model | Aishell Dataset | Char-based | 345 MB | 2 Conv + 5 LSTM layers with only forward direction | 0.080 | - | 151 h | D2 Online Aishell ASR0 |
Ds2 Offline Aishell ASR0 Model | Aishell Dataset | Char-based | 306 MB | 2 Conv + 3 bidirectional GRU layers | 0.064 | - | 151 h | Ds2 Offline Aishell ASR0 |
Conformer Online Aishell ASR1 Model | Aishell Dataset | Char-based | 189 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0565 | - | 151 h | Conformer Online Aishell ASR1 |
Conformer Offline Aishell ASR1 Model | Aishell Dataset | Char-based | 189 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0483 | - | 151 h | Conformer Offline Aishell ASR1 |
Transformer Aishell ASR1 Model | Aishell Dataset | Char-based | 128 MB | Encoder:Transformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0523 | 151 h | Transformer Aishell ASR1 | |
Ds2 Offline Librispeech ASR0 Model | Librispeech Dataset | Char-based | 518 MB | 2 Conv + 3 bidirectional LSTM layers | - | 0.0725 | 960 h | Ds2 Offline Librispeech ASR0 |
Conformer Librispeech ASR1 Model | Librispeech Dataset | subword-based | 191 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | - | 0.0337 | 960 h | Conformer Librispeech ASR1 |
Transformer Librispeech ASR1 Model | Librispeech Dataset | subword-based | 131 MB | Encoder:Transformer, Decoder:Transformer, Decoding method: Attention rescoring | - | 0.0381 | 960 h | Transformer Librispeech ASR1 |
Transformer Librispeech ASR2 Model | Librispeech Dataset | subword-based | 131 MB | Encoder:Transformer, Decoder:Transformer, Decoding method: JoinCTC w/ LM | - | 0.0240 | 960 h | Transformer Librispeech ASR2 |
Language Model based on NGram
Language Model | Training Data | Token-based | Size | Descriptions |
---|---|---|---|---|
English LM | CommonCrawl(en.00) | Word-based | 8.3 GB | Pruned with 0 1 1 1 1; About 1.85 billion n-grams; 'trie' binary with '-a 22 -q 8 -b 8' |
Mandarin LM Small | Baidu Internal Corpus | Char-based | 2.8 GB | Pruned with 0 1 2 4 4; About 0.13 billion n-grams; 'probing' binary with default settings |
Mandarin LM Large | Baidu Internal Corpus | Char-based | 70.4 GB | No Pruning; About 3.7 billion n-grams; 'probing' binary with default settings |
Speech Translation Models
Model | Training Data | Token-based | Size | Descriptions | BLEU | Example Link |
---|---|---|---|---|---|---|
(only for CLI)Transformer FAT-ST MTL En-Zh | Ted-En-Zh | Spm | Encoder:Transformer, Decoder:Transformer, Decoding method: Attention |
20.80 | Transformer Ted-En-Zh ST1 |
Text-to-Speech Models
Acoustic Models
Model Type | Dataset | Example Link | Pretrained Models | Static Models | Size (static) |
---|---|---|---|---|---|
Tacotron2 | LJSpeech | tacotron2-ljspeech | tacotron2_ljspeech_ckpt_0.2.0.zip | ||
Tacotron2 | CSMSC | tacotron2-csmsc | tacotron2_csmsc_ckpt_0.2.0.zip | tacotron2_csmsc_static_0.2.0.zip | 103MB |
TransformerTTS | LJSpeech | transformer-ljspeech | transformer_tts_ljspeech_ckpt_0.4.zip | ||
SpeedySpeech | CSMSC | speedyspeech-csmsc | speedyspeech_nosil_baker_ckpt_0.5.zip | speedyspeech_nosil_baker_static_0.5.zip | 12MB |
FastSpeech2 | CSMSC | fastspeech2-csmsc | fastspeech2_nosil_baker_ckpt_0.4.zip | fastspeech2_nosil_baker_static_0.4.zip | 157MB |
FastSpeech2-Conformer | CSMSC | fastspeech2-csmsc | fastspeech2_conformer_baker_ckpt_0.5.zip | ||
FastSpeech2 | AISHELL-3 | fastspeech2-aishell3 | fastspeech2_nosil_aishell3_ckpt_0.4.zip | ||
FastSpeech2 | LJSpeech | fastspeech2-ljspeech | fastspeech2_nosil_ljspeech_ckpt_0.5.zip | ||
FastSpeech2 | VCTK | fastspeech2-vctk | fastspeech2_nosil_vctk_ckpt_0.5.zip |
Vocoders
Voice Cloning
Model Type | Dataset | Example Link | Pretrained Models |
---|---|---|---|
GE2E | AISHELL-3, etc. | ge2e | ge2e_ckpt_0.3.zip |
GE2E + Tactron2 | AISHELL-3 | ge2e-tactron2-aishell3 | tacotron2_aishell3_ckpt_vc0_0.2.0.zip |
GE2E + FastSpeech2 | AISHELL-3 | ge2e-fastspeech2-aishell3 | fastspeech2_nosil_aishell3_vc1_ckpt_0.5.zip |
Audio Classification Models
Model Type | Dataset | Example Link | Pretrained Models | Static Models |
---|---|---|---|---|
PANN | Audioset | audioset_tagging_cnn | panns_cnn6.pdparams, panns_cnn10.pdparams, panns_cnn14.pdparams | panns_cnn6_static.tar.gz(18M), panns_cnn10_static.tar.gz(19M), panns_cnn14_static.tar.gz(289M) |
PANN | ESC-50 | pann-esc50 | esc50_cnn6.tar.gz, esc50_cnn10.tar.gz, esc50_cnn14.tar.gz |
Speaker Verification Models
Model Type | Dataset | Example Link | Pretrained Models | Static Models |
---|---|---|---|---|
PANN | VoxCeleb | voxceleb_ecapatdnn | ecapatdnn.tar.gz | - |
Punctuation Restoration Models
Model Type | Dataset | Example Link | Pretrained Models |
---|---|---|---|
Ernie Linear | IWLST2012_zh | iwslt2012_punc0 | ernie_linear_p3_iwslt2012_zh_ckpt_0.1.1.zip |
Speech Recognition Model from paddle 1.8
Acoustic Model | Training Data | Token-based | Size | Descriptions | CER | WER | Hours of speech |
---|---|---|---|---|---|---|---|
Ds2 Offline Aishell model | Aishell Dataset | Char-based | 234 MB | 2 Conv + 3 bidirectional GRU layers | 0.0804 | — | 151 h |
Ds2 Offline Librispeech model | Librispeech Dataset | Word-based | 307 MB | 2 Conv + 3 bidirectional sharing weight RNN layers | — | 0.0685 | 960 h |
Ds2 Offline Baidu en8k model | Baidu Internal English Dataset | Word-based | 273 MB | 2 Conv + 3 bidirectional GRU layers | — | 0.0541 | 8628 h |