Resource: "Le Monde Diplomatique" Arabic tagged corpus

Reference "Le Monde Diplomatique" Arabic tagged corpus
Date of Submission Jan. 24, 2014, 4:30 p.m.
Status accepted
ISLRN 124-139-628-259-2
Resource Type Primary Text
Media Type Text
Source
Language Arabic
Description

This corpus contains 102,960 vowelised, lemmatised and tagged words (58 texts from Le Monde Diplomatique Arabic, see also ELRA-W0036-04).

To each text are associated 3 files :
- raw text in Arabic,
- vowelized text in Arabic,
- one XML file containing the morphological annotation of the text.

Each text word associates a certain number of information, such as word size, rank of the word in the text, paragraph number where the word was found, etc. Each word associates a node in the XML file. Each node contains the following positional features of the word in the text:
- Paragraph number in the text, i.e. paragraph where the word can be found,
- Sentence number in the paragraph,
- Sentence number in the text,
- Rank of the word in the text,
- Rank of the first character of the word in the text,
- Word size.

Information about word annotation are added as « sub-nodes »:
- Word of non vowelised text,
- Vowelised word,
- Word lemma,
- Grammatical category of the word.

Version 1.0
Distributor ELRA