Resource: GlobalPhone Hausa
|Date of Submission||Jan. 24, 2014, 4:29 p.m.|
|Resource Type||Primary Text|
The GlobalPhone corpus developed in collaboration with the Karlsruhe Institute of Technology (KIT) was designed to provide read speech data for the development and evaluation of large continuous speech recognition systems in the most widespread languages of the world, and to provide a uniform, multilingual speech and text database for language independent and language adaptive speech recognition as well as for language identification tasks.
The entire GlobalPhone corpus enables the acquisition of acoustic-phonetic knowledge of the following 20 spoken languages: Arabic (ELRA-S0192), Bulgarian (ELRA-S0319), Chinese-Mandarin (ELRA-S0193), Chinese-Shanghai (ELRA-S0194), Croatian (ELRA-S0195), Czech (ELRA-S0196), French (ELRA-S0197), German (ELRA-S0198), Hausa (ELRA-S0347), Japanese (ELRA-S0199), Korean (ELRA-S0200), Polish (ELRA-S0320), Portuguese (Brazilian) (ELRA-S0201), Russian (ELRA-S0202), Spanish (Latin America) (ELRA-S0203), Swedish (ELRA-S0204), Tamil (ELRA-S0205), Thai (ELRA-S0321), Turkish (ELRA-S0206), Vietnamese (ELRA-S0322).
In each language about 100 native speakers were asked to read 100 sentences each. The read texts were selected from national newspapers available via Internet to provide a large vocabulary (up to 65,000 words). The read articles cover national and international political news as well as economic news. The speech is available in 16bit, 16kHz mono quality, recorded with a close-speaking microphone (Sennheiser 440-6) and same recording equipment for all languages. The transcriptions are internally validated and supplemented by special markers for spontaneous effects like stuttering, false starts, and non-verbal effects like laughing and hesitations. Speaker information like age, gender, occupation, etc. as well as information about the recording setup complements the database. The entire GlobalPhone corpus contains over 450 hours of speech spoken by more than 1900 native adult speakers.
Hausa is a member of the Chadic language family, and belongs together with the Semitic and Cushitic languages to the Afroasiatic language family. With over 25 million speakers, it is widely spoken in West Africa. The collection of the Hausa speech and text corpus followed the GlobalPhone collection standards. First, a large text corpus was built by crawling websites that cover main Hausa newspaper sources. The crawled websites are listed below:
After cleaning and normalization, these texts were used to build language models and to select prompts for the speech data recordings. All texts are encoded in Boko, which is a Latin-based alphabet that was imposed by the British colonial administration in the 1930s as Hausa’s modern official orthography. Boko consists of 22 characters of the English alphabet plus five special characters.
For the Hausa GlobalPhone collection, native speakers of Hausa were asked to read the prompted sentences. The entire collection took place in 5 different locations in Cameroon. In total, the corpus contains 7,895 utterances spoken by 33 male and 69 female speakers in the age range of 16 to 60 years. The speech data contains a variety of accents: Maroua, Douala, Yaoundé, Bafoussam, Ngaoundéré, and Nigeria. The accents are documented in the speaker information files. All speech data was recorded in different environmental conditions, with some slightly noisy parts. The recording time, number of utterances, and spoken word tokens together with the division of the Hausa GlobalPhone database into the training, development, and evaluation set are listed in the table below.