You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
PaddleSpeech/dataset/thchs30
Hui Zhang 35d874c532
[s2t] mv dataset into paddlespeech.dataset (#3183)
2 years ago
..
.gitignore examples/dataset to dataset 3 years ago
README.md fix url in librispeech.py 2 years ago
thchs30.py [s2t] mv dataset into paddlespeech.dataset (#3183) 2 years ago

README.md

THCHS30

This is the data part of the THCHS30 2015 acoustic data & scripts dataset.

The dataset is described in more detail in the paper THCHS-30 : A Free Chinese Speech Corpus by Dong Wang, Xuewei Zhang.

A paper (if it can be called a paper) 13 years ago regarding the database:

Dong Wang, Dalei Wu, Xiaoyan Zhu, TCMSD: A new Chinese Continuous Speech Database, International Conference on Chinese Computing (ICCC'01), 2001, Singapore.

The layout of this data pack is the following:

data *.wav audio data

  ``*.wav.trn``  
    transcriptions

{train,dev,test} contain symlinks into the data directory for both audio and transcription files. Contents of these directories define the train/dev/test split of the data.

{lm_word} word.3gram.lm trigram LM based on word lexicon.txt lexicon based on word

{lm_phone} phone.3gram.lm trigram LM based on phone lexicon.txt lexicon based on phone

README.TXT this file

Data statistics

Statistics for the data are as follows:

===========  ==========  ==========  ===========
**dataset**  **audio**   **#sents**  **#words**
===========  ==========  ==========  ===========
    train        25        10,000      198,252
    dev         2:14         893        17,743
    test        6:15        2,495       49,085
===========  ==========  ==========  ===========