Resource: GlobalPhone Bulgarian Pronunciation Dictionary 260k entries (extended version)
|Reference||GlobalPhone Bulgarian Pronunciation Dictionary 260k entries (extended version)|
|Date of Submission||April 6, 2017, 5:50 p.m.|
This extended version of the Bulgarian Pronunciation Dictionary called Bulgarian-Dict260k contains pronunciations of more than 260,000 word forms. The dictionary matches in phone set and format the original GlobalPhone Bulgarian Pronunciation Dictionary (see ELRA-S0351) of 20,000 word forms. Bulgarian-Dict260k was built based on the extension of the Bulgarian GlobalPhone text database to improve language modeling and to reduce the high Out-Of-Vocabulary rate resulting from the rich morphology of the Bulgarian language. For this purpose, roughly 9 Million word tokens were collected from the internet sources of national, international, and economic news available from the online newspapers "Banker" (http://www.banker.bg/), "Kesh" (http://www.cash.bg), and �Sega" (http://www.segabg.com/). After text cleaning and normalization, all word forms were extracted. Pronunciations were created in an automatic process using hand-crafted grapheme-to-phoneme rules. The generated pronunciations were manually cross-checked by native speakers, correcting potential errors of the automatic generation.