"Le Monde Diplomatique" Arabic tagged corpus

Full Official Name: "Le Monde Diplomatique" Arabic tagged corpus
Submission date: Jan. 24, 2014, 4:30 p.m.

This corpus contains 102,960 vowelised, lemmatised and tagged words (58 texts from Le Monde Diplomatique Arabic, see also ELRA-W0036-04). To each text are associated 3 files : - raw text in Arabic, - vowelized text in Arabic, - one XML file containing the morphological annotation of the text. Each text word associates a certain number of information, such as word size, rank of the word in the text, paragraph number where the word was found, etc. Each word associates a node in the XML file. Each node contains the following positional features of the word in the text: - Paragraph number in the text, i.e. paragraph where the word can be found, - Sentence number in the paragraph, - Sentence number in the text, - Rank of the word in the text, - Rank of the first character of the word in the text, - Word size. Information about word annotation are added as « sub-nodes »: - Word of non vowelised text, - Vowelised word, - Word lemma, - Grammatical category of the word.

Creator(s)
Distributor(s)
Right Holder(s)