You can not select more than 25 topics
Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
56 lines
1.5 KiB
56 lines
1.5 KiB
# [THCHS30](http://openslr.elda.org/18/)
|
|
|
|
This is the *data part* of the `THCHS30 2015` acoustic data
|
|
& scripts dataset.
|
|
|
|
The dataset is described in more detail in the paper ``THCHS-30 : A Free
|
|
Chinese Speech Corpus`` by Dong Wang, Xuewei Zhang.
|
|
|
|
A paper (if it can be called a paper) 13 years ago regarding the database:
|
|
|
|
Dong Wang, Dalei Wu, Xiaoyan Zhu, ``TCMSD: A new Chinese Continuous Speech Database``,
|
|
International Conference on Chinese Computing (ICCC'01), 2001, Singapore.
|
|
|
|
The layout of this data pack is the following:
|
|
|
|
``data``
|
|
``*.wav``
|
|
audio data
|
|
|
|
``*.wav.trn``
|
|
transcriptions
|
|
|
|
``{train,dev,test}``
|
|
contain symlinks into the ``data`` directory for both audio and
|
|
transcription files. Contents of these directories define the
|
|
train/dev/test split of the data.
|
|
|
|
``{lm_word}``
|
|
``word.3gram.lm``
|
|
trigram LM based on word
|
|
``lexicon.txt``
|
|
lexicon based on word
|
|
|
|
``{lm_phone}``
|
|
``phone.3gram.lm``
|
|
trigram LM based on phone
|
|
``lexicon.txt``
|
|
lexicon based on phone
|
|
|
|
``README.TXT``
|
|
this file
|
|
|
|
|
|
Data statistics
|
|
===============
|
|
|
|
Statistics for the data are as follows:
|
|
|
|
=========== ========== ========== ===========
|
|
**dataset** **audio** **#sents** **#words**
|
|
=========== ========== ========== ===========
|
|
train 25 10,000 198,252
|
|
dev 2:14 893 17,743
|
|
test 6:15 2,495 49,085
|
|
=========== ========== ========== ===========
|