You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
PaddleSpeech/paddlespeech/dataset/aidatatang_200zh
Hui Zhang df3be4acae
[s2t] move s2t data preprocess into paddlespeech.dataset (#3189)
1 year ago
..
README.md [s2t] mv dataset into paddlespeech.dataset (#3183) 1 year ago
__init__.py [s2t] mv dataset into paddlespeech.dataset (#3183) 1 year ago
aidatatang_200zh.py [s2t] move s2t data preprocess into paddlespeech.dataset (#3189) 1 year ago

README.md

Aidatatang_200zh

Aidatatang_200zh is a free Chinese Mandarin speech corpus provided by Beijing DataTang Technology Co., Ltd under Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International Public License. The contents and the corresponding descriptions of the corpus include:

  • The corpus contains 200 hours of acoustic data, which is mostly mobile recorded data.
  • 600 speakers from different accent areas in China are invited to participate in the recording.
  • The transcription accuracy for each sentence is larger than 98%.
  • Recordings are conducted in a quiet indoor environment.
  • The database is divided into training set, validation set, and testing set in a ratio of 7: 1: 2.
  • Detail information such as speech data coding and speaker information is preserved in the metadata file.
  • Segmented transcripts are also provided.

The corpus aims to support researchers in speech recognition, machine translation, voiceprint recognition, and other speech-related fields. Therefore, the corpus is totally free for academic use.