You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
29 KiB
29 KiB
Released Models
Speech-to-Text Models
Speech Recognition Model
Acoustic Model | Training Data | Token-based | Size | Descriptions | CER | WER | Hours of speech | Example Link | Inference Type | static_model |
---|---|---|---|---|---|---|---|---|---|---|
Ds2 Online Wenetspeech ASR0 Model | Wenetspeech Dataset | Char-based | 1.2 GB | 2 Conv + 5 LSTM layers | 0.152 (test_net, w/o LM) 0.2417 (test_meeting, w/o LM) 0.053 (aishell, w/ LM) |
- | 10000 h | - | onnx/inference/python | - |
Ds2 Online Aishell ASR0 Model | Aishell Dataset | Char-based | 491 MB | 2 Conv + 5 LSTM layers | 0.0666 | - | 151 h | D2 Online Aishell ASR0 | onnx/inference/python | - |
Ds2 Offline Aishell ASR0 Model | Aishell Dataset | Char-based | 1.4 GB | 2 Conv + 5 bidirectional LSTM layers | 0.0554 | - | 151 h | Ds2 Offline Aishell ASR0 | inference/python | - |
Conformer Online Wenetspeech ASR1 Model | WenetSpeech Dataset | Char-based | 457 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.11 (test_net) 0.1879 (test_meeting) | - | 10000 h | - | python | - |
Conformer U2PP Online Wenetspeech ASR1 Model | WenetSpeech Dataset | Char-based | 540 MB | Encoder:Conformer, Decoder:BiTransformer, Decoding method: Attention rescoring | 0.047198 (aishell test_-1) 0.059212 (aishell test_16) | - | 10000 h | - | python | FP32 INT8 |
Conformer Online Aishell ASR1 Model | Aishell Dataset | Char-based | 189 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0544 | - | 151 h | Conformer Online Aishell ASR1 | python | - |
Conformer Offline Aishell ASR1 Model | Aishell Dataset | Char-based | 189 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0460 | - | 151 h | Conformer Offline Aishell ASR1 | python | - |
Transformer Aishell ASR1 Model | Aishell Dataset | Char-based | 128 MB | Encoder:Transformer, Decoder:Transformer, Decoding method: Attention rescoring | 0.0523 | 151 h | Transformer Aishell ASR1 | python | - | |
Ds2 Offline Librispeech ASR0 Model | Librispeech Dataset | Char-based | 1.3 GB | 2 Conv + 5 bidirectional LSTM layers | - | 0.0467 | 960 h | Ds2 Offline Librispeech ASR0 | inference/python | - |
Conformer Librispeech ASR1 Model | Librispeech Dataset | subword-based | 191 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | - | 0.0338 | 960 h | Conformer Librispeech ASR1 | python | - |
Transformer Librispeech ASR1 Model | Librispeech Dataset | subword-based | 131 MB | Encoder:Transformer, Decoder:Transformer, Decoding method: Attention rescoring | - | 0.0381 | 960 h | Transformer Librispeech ASR1 | python | - |
Transformer Librispeech ASR2 Model | Librispeech Dataset | subword-based | 131 MB | Encoder:Transformer, Decoder:Transformer, Decoding method: JoinCTC w/ LM | - | 0.0240 | 960 h | Transformer Librispeech ASR2 | python | - |
Conformer TALCS ASR1 Model | TALCS Dataset | subword-based | 470 MB | Encoder:Conformer, Decoder:Transformer, Decoding method: Attention rescoring | - | 0.0844 | 587 h | Conformer TALCS ASR1 | python | - |
Self-Supervised Pre-trained Model
Model | Pre-Train Method | Pre-Train Data | Finetune Data | Size | Descriptions | CER | WER | Example Link |
---|---|---|---|---|---|---|---|---|
Wav2vec2-large-960h-lv60-self Model | wav2vec2 | Librispeech and LV-60k Dataset (5.3w h) | - | 1.18 GB | Pre-trained Wav2vec2.0 Model | - | - | - |
Wav2vec2ASR-large-960h-librispeech Model | wav2vec2 | Librispeech and LV-60k Dataset (5.3w h) | Librispeech (960 h) | 718 MB | Encoder: Wav2vec2.0, Decoder: CTC, Decoding method: Greedy search | - | 0.0189 | Wav2vecASR Librispeech ASR3 |
Wav2vec2-large-wenetspeech-self Model | wav2vec2 | Wenetspeech Dataset (1w h) | - | 714 MB | Pre-trained Wav2vec2.0 Model | - | - | - |
Wav2vec2ASR-large-aishell1 Model | wav2vec2 | Wenetspeech Dataset (1w h) | aishell1 (train set) | 1.18 GB | Encoder: Wav2vec2.0, Decoder: CTC, Decoding method: Greedy search | 0.0510 | - | - |
Whisper Model
Demo Link | Training Data | Size | Descriptions | CER | Model |
---|---|---|---|---|---|
Whisper | 680kh from internet | large: 5.8G, medium: 2.9G, small: 923M, base: 277M, tiny: 145M |
Encoder:Transformer, Decoder:Transformer, Decoding method: Greedy search |
0.027 (large, Librispeech) |
whisper-large whisper-medium whisper-medium-English-only whisper-small whisper-small-English-only whisper-base whisper-base-English-only whisper-tiny whisper-tiny-English-only |
Language Model based on NGram
Language Model | Training Data | Token-based | Size | Descriptions |
---|---|---|---|---|
English LM | CommonCrawl(en.00) | Word-based | 8.3 GB | Pruned with 0 1 1 1 1; About 1.85 billion n-grams; 'trie' binary with '-a 22 -q 8 -b 8' |
Mandarin LM Small | Baidu Internal Corpus | Char-based | 2.8 GB | Pruned with 0 1 2 4 4; About 0.13 billion n-grams; 'probing' binary with default settings |
Mandarin LM Large | Baidu Internal Corpus | Char-based | 70.4 GB | No Pruning; About 3.7 billion n-grams; 'probing' binary with default settings |
Speech Translation Models
Model | Training Data | Token-based | Size | Descriptions | BLEU | Example Link |
---|---|---|---|---|---|---|
(only for CLI)Transformer FAT-ST MTL En-Zh | Ted-En-Zh | Spm | Encoder:Transformer, Decoder:Transformer, Decoding method: Attention |
20.80 | Transformer Ted-En-Zh ST1 |
Text-to-Speech Models
Acoustic Models
Vocoders
Voice Cloning
Model Type | Dataset | Example Link | Pretrained Models |
---|---|---|---|
GE2E | AISHELL-3, etc. | ge2e | ge2e_ckpt_0.3.zip |
GE2E + Tacotron2 | AISHELL-3 | ge2e-Tacotron2-aishell3 | tacotron2_aishell3_ckpt_vc0_0.2.0.zip |
GE2E + FastSpeech2 | AISHELL-3 | ge2e-fastspeech2-aishell3 | fastspeech2_nosil_aishell3_vc1_ckpt_0.5.zip |
Audio Classification Models
Model Type | Dataset | Example Link | Pretrained Models | Static Models |
---|---|---|---|---|
PANN | Audioset | audioset_tagging_cnn | panns_cnn6.pdparams, panns_cnn10.pdparams, panns_cnn14.pdparams | panns_cnn6_static.tar.gz(18M), panns_cnn10_static.tar.gz(19M), panns_cnn14_static.tar.gz(289M) |
PANN | ESC-50 | pann-esc50 | esc50_cnn6.tar.gz, esc50_cnn10.tar.gz, esc50_cnn14.tar.gz |
Speaker Verification Models
Model Type | Dataset | Example Link | Pretrained Models | Static Models |
---|---|---|---|---|
ECAPA-TDNN | VoxCeleb | voxceleb_ecapatdnn | ecapatdnn.tar.gz | - |
Punctuation Restoration Models
Model Type | Dataset | Example Link | Pretrained Models |
---|---|---|---|
Ernie Linear | IWLST2012_zh | iwslt2012_punc0 | ernie_linear_p3_iwslt2012_zh_ckpt_0.1.1.zip |