Add doc for english LM.

8 years ago · 35543fff8b
parent 0a5f123764
commit 35543fff8b
1 changed files with 13 additions and 1 deletions
--- a/README.md
+++ b/README.md
@ -219,6 +219,18 @@ sh download_lm_ch.sh
 ```
 If you wish to train your own better language model, please refer to [KenLM](https://github.com/kpu/kenlm) for tutorials.

+Here we provide some tips to show how we prepearing our english and mandarin language models.
+
+#### English LM
+
+The english corpus is from the [Common Crawl Repository](http://commoncrawl.org) and you can download it from [statmt](http://data.statmt.org/ngrams/deduped_en). We use part en.00 to train our english languge model. There are some preprocessing steps before training:
+
+  * Characters which not in [A-Za-z0-9\s'] are removed and arabic numbers are converted to english numbers like 1000 to one thousand.
+  * Repeated whitespace are squeezed to one and the beginning whitespace are removed. Notice that all transcriptions are lowercase, so all characters are converted to lowercases.
+  * Top 400000 words by frequency are selected to build the vocabulary and all words not in the vocabulary are replaced with 'UNKNOWNWORD'.
+
+Now the preprocessing is done and we get a clean corpus to train the language model. Our released language model are pruned by '0 1 1 1 1'. To save disk storage we convert the arpa file to 'trie' binary file with parameters '-a 22 -q 8 -b 8'.
+
 TODO: any other requirements or tips to add?

 ### Speech-to-text Inference