History

Hui Zhang 35d874c532 [s2t] mv dataset into paddlespeech.dataset (#3183 ) * mv dataset into paddlespeech.dataset * add aidatatang * fix import		2 years ago
..
.gitignore	examples/dataset to dataset	3 years ago
README.md	fix url in librispeech.py	2 years ago
thchs30.py	[s2t] mv dataset into paddlespeech.dataset (#3183 )	2 years ago

README.md

THCHS30

This is the data part of the THCHS30 2015 acoustic data & scripts dataset.

The dataset is described in more detail in the paper THCHS-30 : A Free Chinese Speech Corpus by Dong Wang, Xuewei Zhang.

A paper (if it can be called a paper) 13 years ago regarding the database:

Dong Wang, Dalei Wu, Xiaoyan Zhu, TCMSD: A new Chinese Continuous Speech Database, International Conference on Chinese Computing (ICCC'01), 2001, Singapore.

The layout of this data pack is the following:

data *.wav audio data

  ``*.wav.trn``  
    transcriptions

{train,dev,test} contain symlinks into the data directory for both audio and transcription files. Contents of these directories define the train/dev/test split of the data.

{lm_word} word.3gram.lm trigram LM based on word lexicon.txt lexicon based on word

{lm_phone} phone.3gram.lm trigram LM based on phone lexicon.txt lexicon based on phone

README.TXT this file

Data statistics

Statistics for the data are as follows:

===========  ==========  ==========  ===========
**dataset**  **audio**   **#sents**  **#words**
===========  ==========  ==========  ===========
    train        25        10,000      198,252
    dev         2:14         893        17,743
    test        6:15        2,495       49,085
===========  ==========  ==========  ===========