From 9b7fc7e903ff408095c17d893cde9e9c9bd9b08d Mon Sep 17 00:00:00 2001 From: yangyaming Date: Thu, 12 Oct 2017 11:38:22 +0800 Subject: [PATCH 1/3] Add doc for Chinese LM. --- README.md | 13 +++++++------ 1 file changed, 7 insertions(+), 6 deletions(-) diff --git a/README.md b/README.md index fca2528a..efd45efb 100644 --- a/README.md +++ b/README.md @@ -232,13 +232,13 @@ Now the preprocessing is done and we get a clean corpus to train the language mo #### Mandarin LM -Different from the English language model, Mandarin language model is character-based where each token is a Chinese character. We use an internal corpus to train the released Mandarin language model. This corpus contains billions of tokens. The preprocessing has tiny difference from English language model and main steps include: +Different from the English language model, Mandarin language model is character-based where each token is a Chinese character. We use internal corpus to train the released Mandarin language models. The corpus contain billions of tokens. The preprocessing has tiny difference from English language model and main steps include: * The beginning and trailing whitespace characters are removed. * English punctuations and Chinese punctuations are removed. * A whitespace character between two tokens is inserted. -Please notice that the released language model only contains Chinese simplified characters. After preprocessing done we can begin to train the language model. The key training arguments are '-o 5 --prune 0 1 2 4 4'. Please refer above section for the meaning of each argument. We also convert the arpa file to binary file using default settings. +Please notice that the released language models only contain Chinese simplified characters. After preprocessing done we can begin to train the language model. The key training arguments for small LM is '-o 5 --prune 0 1 2 4 4' and '-o 5' for large LM. Please refer above section for the meaning of each argument. We also convert the arpa file to binary file using default settings. ### Speech-to-text Inference @@ -459,10 +459,11 @@ Mandarin | [Internal Mandarin Model](to-be-added) | Baidu Mandarin Dataset | 291 #### Language Model Released -Language Model | Training Data | Token-based | Size | Filter Configuraiton -:-------------:| :------------:| :-----: | -----: | -----------------: -[English LM](http://paddlepaddle.bj.bcebos.com/model_zoo/speech/common_crawl_00.prune01111.trie.klm) | To Be Added | Word-based | 8.3 GB | To Be Added -[Mandarin LM](http://cloud.dlnel.org/filepub/?uuid=d21861e4-4ed6-45bb-ad8e-ae417a43195e) | To Be Added | Character-based | 2.8 GB | To Be Added +Language Model | Training Data | Token-based | Size | Description +:-------------:| :------------:| :-----: | -----: | :-----------------: +[English LM](http://paddlepaddle.bj.bcebos.com/model_zoo/speech/common_crawl_00.prune01111.trie.klm) | [en.00.deduped.xz](http://web-language-models.s3-website-us-east-1.amazonaws.com/ngrams/en/deduped/en.00.deduped.xz) | Word-based | 8.3 GB | pruned with 0 1 1 1 1
about 1.85 billion n-grams
'trie' binary with '-a 22 -q 8 -b 8' +[Mandarin LM Small](http://cloud.dlnel.org/filepub/?uuid=d21861e4-4ed6-45bb-ad8e-ae417a43195e) | Small internel data | Character-based | 2.8 GB | pruned with 0 1 2 4 4
about 0.13 billion n-grams
'probing' binary with default settings +Mandarin LM Large | Large internel data | Character-based | 70.4 GB | no pruning
about 3.7 billion n-grams
'probing' binary with default settings ## Experiments and Benchmarks From d78d4fa6ffa1a7978dfe9767fb1c3ec526e58a10 Mon Sep 17 00:00:00 2001 From: yangyaming Date: Thu, 12 Oct 2017 17:06:08 +0800 Subject: [PATCH 2/3] Add url for large Mandarin LM. --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index efd45efb..8dd13f92 100644 --- a/README.md +++ b/README.md @@ -463,7 +463,7 @@ Language Model | Training Data | Token-based | Size | Description :-------------:| :------------:| :-----: | -----: | :-----------------: [English LM](http://paddlepaddle.bj.bcebos.com/model_zoo/speech/common_crawl_00.prune01111.trie.klm) | [en.00.deduped.xz](http://web-language-models.s3-website-us-east-1.amazonaws.com/ngrams/en/deduped/en.00.deduped.xz) | Word-based | 8.3 GB | pruned with 0 1 1 1 1
about 1.85 billion n-grams
'trie' binary with '-a 22 -q 8 -b 8' [Mandarin LM Small](http://cloud.dlnel.org/filepub/?uuid=d21861e4-4ed6-45bb-ad8e-ae417a43195e) | Small internel data | Character-based | 2.8 GB | pruned with 0 1 2 4 4
about 0.13 billion n-grams
'probing' binary with default settings -Mandarin LM Large | Large internel data | Character-based | 70.4 GB | no pruning
about 3.7 billion n-grams
'probing' binary with default settings +[Mandarin LM Large](http://cloud.dlnel.org/filepub/?uuid=245d02bb-cd01-4ebe-b079-b97be864ec37) | Large internel data | Character-based | 70.4 GB | no pruning
about 3.7 billion n-grams
'probing' binary with default settings ## Experiments and Benchmarks From e8a5a17b1dee1853668b9cd6dcf22facc18c74ab Mon Sep 17 00:00:00 2001 From: yangyaming Date: Fri, 3 Nov 2017 15:09:03 +0800 Subject: [PATCH 3/3] Refine doc. --- README.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/README.md b/README.md index 8dd13f92..c9a4e8d5 100644 --- a/README.md +++ b/README.md @@ -459,11 +459,11 @@ Mandarin | [Internal Mandarin Model](to-be-added) | Baidu Mandarin Dataset | 291 #### Language Model Released -Language Model | Training Data | Token-based | Size | Description -:-------------:| :------------:| :-----: | -----: | :-----------------: -[English LM](http://paddlepaddle.bj.bcebos.com/model_zoo/speech/common_crawl_00.prune01111.trie.klm) | [en.00.deduped.xz](http://web-language-models.s3-website-us-east-1.amazonaws.com/ngrams/en/deduped/en.00.deduped.xz) | Word-based | 8.3 GB | pruned with 0 1 1 1 1
about 1.85 billion n-grams
'trie' binary with '-a 22 -q 8 -b 8' -[Mandarin LM Small](http://cloud.dlnel.org/filepub/?uuid=d21861e4-4ed6-45bb-ad8e-ae417a43195e) | Small internel data | Character-based | 2.8 GB | pruned with 0 1 2 4 4
about 0.13 billion n-grams
'probing' binary with default settings -[Mandarin LM Large](http://cloud.dlnel.org/filepub/?uuid=245d02bb-cd01-4ebe-b079-b97be864ec37) | Large internel data | Character-based | 70.4 GB | no pruning
about 3.7 billion n-grams
'probing' binary with default settings +Language Model | Training Data | Token-based | Size | Descriptions +:-------------:| :------------:| :-----: | -----: | :----------------- +[English LM](http://paddlepaddle.bj.bcebos.com/model_zoo/speech/common_crawl_00.prune01111.trie.klm) | [CommonCrawl(en.00)](http://web-language-models.s3-website-us-east-1.amazonaws.com/ngrams/en/deduped/en.00.deduped.xz) | Word-based | 8.3 GB | Pruned with 0 1 1 1 1;
About 1.85 billion n-grams;
'trie' binary with '-a 22 -q 8 -b 8' +[Mandarin LM Small](http://cloud.dlnel.org/filepub/?uuid=d21861e4-4ed6-45bb-ad8e-ae417a43195e) | Baidu Internal Corpus | Char-based | 2.8 GB | Pruned with 0 1 2 4 4;
About 0.13 billion n-grams;
'probing' binary with default settings +[Mandarin LM Large](http://cloud.dlnel.org/filepub/?uuid=245d02bb-cd01-4ebe-b079-b97be864ec37) | Baidu Internal Corpus | Char-based | 70.4 GB | No Pruning;
About 3.7 billion n-grams;
'probing' binary with default settings ## Experiments and Benchmarks