1.5 KiB
THCHS30
This is the data part of the THCHS30 2015
acoustic data
& scripts dataset.
The dataset is described in more detail in the paper THCHS-30 : A Free Chinese Speech Corpus
by Dong Wang, Xuewei Zhang.
A paper (if it can be called a paper) 13 years ago regarding the database:
Dong Wang, Dalei Wu, Xiaoyan Zhu, TCMSD: A new Chinese Continuous Speech Database
,
International Conference on Chinese Computing (ICCC'01), 2001, Singapore.
The layout of this data pack is the following:
data
*.wav
audio data
``*.wav.trn``
transcriptions
{train,dev,test}
contain symlinks into the data
directory for both audio and
transcription files. Contents of these directories define the
train/dev/test split of the data.
{lm_word}
word.3gram.lm
trigram LM based on word
lexicon.txt
lexicon based on word
{lm_phone}
phone.3gram.lm
trigram LM based on phone
lexicon.txt
lexicon based on phone
README.TXT
this file
Data statistics
Statistics for the data are as follows:
=========== ========== ========== ===========
**dataset** **audio** **#sents** **#words**
=========== ========== ========== ===========
train 25 10,000 198,252
dev 2:14 893 17,743
test 6:15 2,495 49,085
=========== ========== ========== ===========