Resource: GlobalPhone Vietnamese
|Date of Submission||Jan. 24, 2014, 4:29 p.m.|
|Resource Type||Primary Text|
The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks.
The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 20 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Vietnamese (ELRA-S0322).
In each language about 100 sentences were read from each of the 100 speakers. The read texts were selected from national newspapers available via Internet to provide a large vocabulary (up to 65,000 words). The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6) and same recording equipment for all languages. The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complement the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 1900 native adult speakers.
Data is shortened by means of the shorten program written by Tony Robinson, available from Softsound's web page: http://www.softsound.com/ linux distributions, or simulated versions such as cygwin. Alternatively, the data could be delivered unshorten.
The Vietnamese part of GlobalPhone was collected in summer 2009. In total 160 speakers were recorded, 140 of them in the cities of Hanoi and Ho Chi Minh City in Vietnam, and an additional set of 20 speakers were recorded in Karlsruhe, Germany. All speakers are Vietnamese native speakers, covering the main dialectal variants from South and North Vietnam. Of these 160 speakers, 70 were female and 90 were male. The majority of speakers are well educated, being graduated students and engineers. The age distribution of the speakers ranges from 18 to 65 years. Each speaker read between 50 and 200 utterances from newspaper articles, corresponding to roughly 9.5 minutes of speech or 138 utterances per person, in total we recorded 22.112 utterances. The speech was recorded using a close-talking microphone Sennheiser HM420 in a push-to-talk scenario using an inhouse developed modern laptop-based data collection toolkit. All data were recorded at 16kHz and 16bit resolution in PCM format. The data collection took place in small-sized rooms with very low background noise. Information on recording place and environmental noise conditions are provided in a separate speaker session file for each speaker. The speech data was recorded in two phases. In a first phase data was collected from 140 speakers in the cities of Hanoi and Ho Chi Minh. In the second phase we selected utterances from the text corpus in order to cover rare Vietnamese phonemes. This second recording phase was carried out with 20 Vietnamese graduate students who live in Karlsruhe. In sum, 22.112 utterances were spoken, corresponding to 25.25 hours of speech. The text data used for recording mainly came from the news posted in online editions of 15 Vietnamese newspaper websites as listed below, where the first 12 were used for the training set, while the last three were used for the development and evaluation set. The text data collected from the first 12 websites cover almost 4 Million word tokens with a vocabulary of 30.000 words resulting in an Out-of-Vocabulary rate of 0% on the development set and 0.067% on the evaluation set. For the text selection we followed the standard GlobalPhone protocols and focused on national and international politics and economics news (see [SCHULTZ 2002]). The transcriptions are provided in Vietnamese-style Roman script, i.e. using several diacritics encoded in UTF-8. The Vietnamese data are organized in a training set of 140 speakers with 22.15 hours of speech, a development set of 10 speakers, 6 from North and 4 from South Vietnam with 1:40 hours of speech and an evaluation set of 10 speakers with same gender and dialect distribution as the development set with 1:30 hours of speech. More details on corpus statistics, collection scenario, and system building based on the Vietnamese part of GlobalPhone can be found under [Vu and Schultz, 2009, 2010].
Vietnamese Newspaper sources:
[Schultz 2002] Tanja Schultz (2002): GlobalPhone: A Multilingual Speech and Text Database developed at Karlsruhe University, Proceedings of the International Conference of Spoken Language Processing, ICSLP 2002, Denver, CO, September 2002.