You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
PaddleSpeech/dataset/thchs30/README.md

56 lines
1.5 KiB

# [THCHS30](http://openslr.elda.org/18/)
This is the *data part* of the `THCHS30 2015` acoustic data
& scripts dataset.
The dataset is described in more detail in the paper ``THCHS-30 : A Free
Chinese Speech Corpus`` by Dong Wang, Xuewei Zhang.
A paper (if it can be called a paper) 13 years ago regarding the database:
Dong Wang, Dalei Wu, Xiaoyan Zhu, ``TCMSD: A new Chinese Continuous Speech Database``,
International Conference on Chinese Computing (ICCC'01), 2001, Singapore.
The layout of this data pack is the following:
``data``
``*.wav``
audio data
``*.wav.trn``
transcriptions
``{train,dev,test}``
contain symlinks into the ``data`` directory for both audio and
transcription files. Contents of these directories define the
train/dev/test split of the data.
``{lm_word}``
``word.3gram.lm``
trigram LM based on word
``lexicon.txt``
lexicon based on word
``{lm_phone}``
``phone.3gram.lm``
trigram LM based on phone
``lexicon.txt``
lexicon based on phone
``README.TXT``
this file
Data statistics
===============
Statistics for the data are as follows:
=========== ========== ========== ===========
**dataset** **audio** **#sents** **#words**
=========== ========== ========== ===========
train 25 10,000 198,252
dev 2:14 893 17,743
test 6:15 2,495 49,085
=========== ========== ========== ===========