PaddleSpeech/dataset/magicdata/README.md

# [MagicData](http://www.openslr.org/68/)

MAGICDATA Mandarin Chinese Read Speech Corpus was developed by MAGIC DATA Technology Co., Ltd. and freely published for non-commercial use.
The contents and the corresponding descriptions of the corpus include:

* The corpus contains 755 hours of speech data, which is mostly mobile recorded data.
* 1080 speakers from different accent areas in China are invited to participate in the recording.
* The sentence transcription accuracy is higher than 98%.
* Recordings are conducted in a quiet indoor environment.
* The database is divided into training set, validation set, and testing set in a ratio of 51: 1: 2.
* Detail information such as speech data coding and speaker information is preserved in the metadata file.
* The domain of recording texts is diversified, including interactive Q&A, music search, SNS messages, home command and control, etc.
* Segmented transcripts are also provided.

The corpus aims to support researchers in speech recognition, machine translation, speaker recognition, and other speech-related fields. Therefore, the corpus is totally free for academic use.
add thchs30, aidatatang; 4 years ago			`# [MagicData](http://www.openslr.org/68/)`

			`MAGICDATA Mandarin Chinese Read Speech Corpus was developed by MAGIC DATA Technology Co., Ltd. and freely published for non-commercial use.`
			`The contents and the corresponding descriptions of the corpus include:`

			`* The corpus contains 755 hours of speech data, which is mostly mobile recorded data.`
			`* 1080 speakers from different accent areas in China are invited to participate in the recording.`
			`* The sentence transcription accuracy is higher than 98%.`
			`* Recordings are conducted in a quiet indoor environment.`
			`* The database is divided into training set, validation set, and testing set in a ratio of 51: 1: 2.`
			`* Detail information such as speech data coding and speaker information is preserved in the metadata file.`
			`* The domain of recording texts is diversified, including interactive Q&A, music search, SNS messages, home command and control, etc.`
			`* Segmented transcripts are also provided.`

			`The corpus aims to support researchers in speech recognition, machine translation, speaker recognition, and other speech-related fields. Therefore, the corpus is totally free for academic use.`