History

co63oc 50ef94b68c Fix typos (#4024 ) * Fix * Fix		6 months ago
..
README.md	[s2t] mv dataset into paddlespeech.dataset (#3183 )	2 years ago
__init__.py	[s2t] mv dataset into paddlespeech.dataset (#3183 )	2 years ago
aidatatang_200zh.py	Fix typos (#4024 )	6 months ago

README.md

Aidatatang_200zh

Aidatatang_200zh is a free Chinese Mandarin speech corpus provided by Beijing DataTang Technology Co., Ltd under Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International Public License. The contents and the corresponding descriptions of the corpus include:

The corpus contains 200 hours of acoustic data, which is mostly mobile recorded data.
600 speakers from different accent areas in China are invited to participate in the recording.
The transcription accuracy for each sentence is larger than 98%.
Recordings are conducted in a quiet indoor environment.
The database is divided into training set, validation set, and testing set in a ratio of 7: 1: 2.
Detail information such as speech data coding and speaker information is preserved in the metadata file.
Segmented transcripts are also provided.

The corpus aims to support researchers in speech recognition, machine translation, voiceprint recognition, and other speech-related fields. Therefore, the corpus is totally free for academic use.