pull/6/head
Hu Weiwei 7 years ago committed by GitHub
parent 0fcd0812e7
commit cc5e420331
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

@ -187,7 +187,7 @@ Six optional augmentation components are provided to be selected, configured and
- Noise Perturbation (need background noise audio files) - Noise Perturbation (need background noise audio files)
- Impulse Response (need impulse audio files) - Impulse Response (need impulse audio files)
In order to inform the trainer of what augmentation components are needed and what their processing orders are, it is required to prepare in advance a *augmentation configuration file* in [JSON](http://www.json.org/) format. For example: In order to inform the trainer of what augmentation components are needed and what their processing orders are, it is required to prepare in advance an *augmentation configuration file* in [JSON](http://www.json.org/) format. For example:
``` ```
[{ [{
@ -226,7 +226,7 @@ If you wish to train your own better language model, please refer to [KenLM](htt
#### English LM #### English LM
The English corpus is from the [Common Crawl Repository](http://commoncrawl.org) and you can download it from [statmt](http://data.statmt.org/ngrams/deduped_en). We use part en.00 to train our English languge model. There are some preprocessing steps before training: The English corpus is from the [Common Crawl Repository](http://commoncrawl.org) and you can download it from [statmt](http://data.statmt.org/ngrams/deduped_en). We use part en.00 to train our English language model. There are some preprocessing steps before training:
* Characters not in \[A-Za-z0-9\s'\] (\s represents whitespace characters) are removed and Arabic numbers are converted to English numbers like 1000 to one thousand. * Characters not in \[A-Za-z0-9\s'\] (\s represents whitespace characters) are removed and Arabic numbers are converted to English numbers like 1000 to one thousand.
* Repeated whitespace characters are squeezed to one and the beginning whitespace characters are removed. Notice that all transcriptions are lowercase, so all characters are converted to lowercase. * Repeated whitespace characters are squeezed to one and the beginning whitespace characters are removed. Notice that all transcriptions are lowercase, so all characters are converted to lowercase.

Loading…
Cancel
Save