You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
PaddleSpeech/dataset/magicdata
Hui Zhang cc7096dd27
examples/dataset to dataset
3 years ago
..
README.md examples/dataset to dataset 3 years ago

README.md

MagicData

MAGICDATA Mandarin Chinese Read Speech Corpus was developed by MAGIC DATA Technology Co., Ltd. and freely published for non-commercial use. The contents and the corresponding descriptions of the corpus include:

  • The corpus contains 755 hours of speech data, which is mostly mobile recorded data.
  • 1080 speakers from different accent areas in China are invited to participate in the recording.
  • The sentence transcription accuracy is higher than 98%.
  • Recordings are conducted in a quiet indoor environment.
  • The database is divided into training set, validation set, and testing set in a ratio of 51: 1: 2.
  • Detail information such as speech data coding and speaker information is preserved in the metadata file.
  • The domain of recording texts is diversified, including interactive Q&A, music search, SNS messages, home command and control, etc.
  • Segmented transcripts are also provided.

The corpus aims to support researchers in speech recognition, machine translation, speaker recognition, and other speech-related fields. Therefore, the corpus is totally free for academic use.