NEMLAR Written Corpus

Full Official Name: NEMLAR Written Corpus
Submission date: Jan. 24, 2014, 4:30 p.m.

This corpus was produced within the NEMLAR project (http://www.nemlar.org). Two other resources, produced within the same project, are also available: NEMLAR Broadcast News Speech Corpus (ELRA-S0219) and the NEMLAR Speech Synthesis Corpus (ELRA-S0220). The NEMLAR Written Corpus consists of about 500,000 words of Arabic text from 13 different categories, aiming to achieve a well-balanced corpus that offers a representation of the variety in syntactic, semantic and pragmatic features of modern Arabic language. The different categories are: • Political news: 48,000 words • Political debate: 30,000 words • Islamic text (Preaching and others): 29,000 words • Phrases of common words: 8,500 words • Text from broadcast news: 5,500 words • Business: 20,000 words • Arabic literature: 30,000 words • General news: 100,000 words • Interviews: 56,000 words • Scientific press: 50,000 words • Sports press: 50,000 words • Dictionary entries explanation: 52,000 words • Legal domain text: 21,000 words The time span of the data included goes from late 1990’s to 2005. The corpus is provided in 4 different versions: • Raw text • Fully vowelized text • Text with Arabic lexical analysis • Text with Arabic POS-tags Diacritics, lexical analysis and POS-tags were generated by RDI’s tool Fassieh©. The accuracy of the automatic analysis is around 95%. To reach about the 99% accuracy rate as defined for this corpus, the linguists used the visual revision mode of Fassieh© where the linguist has to either approve the 1st most likely analysis (most of the time) or select another one manually (in the 4% minority of the cases). The database is distributed on 1 ISO 9660 CD-ROM volume. It has been validated by an external partner and a validation report is provided.

Creator(s)
Distributor(s)
Right Holder(s)