Mandarin Chinese Desktop Speech Recognition Corpus - SMS (120 people)

Full Official Name: Mandarin Chinese Desktop Speech Recognition Corpus - SMS (120 people)
Submission date: Jan. 24, 2014, 4:30 p.m.

This corpus comprises 7,142 entries uttered by 120 speakers of different dialects, ages and various educational levels (59 males and 61 females), recorded through head-mounted noise-canceling microphone. The database comprises 16,499 short messages (SMS). Speech samples are stored as a sequence of 16-bit 22.05kHz WAV for 21.7 hours of speech. The total capacity of the data is 3.2 Gb. Each speaker read 120-150 items. Text files are stored in Unicode format. All data have been proofread manually. The transcriptions include non-speech markers (background noise, background speech, speaker sounds) as well as markers for mispronunciation, channel distortions, words left-out and duplicates. The corpus aims to be applied to the testing and telephone natural speech recognition system.

Creator(s)
Distributor(s)
Right Holder(s)