Resource: Al-Hayat Arabic Corpus

Reference Al-Hayat Arabic Corpus
Date of Submission Jan. 24, 2014, 4:17 p.m.
Status accepted
ISLRN 365-777-769-398-7
Resource Type Primary Text
Media Type Text
Language Arabic

The corpus was developed in the course of a research project at the University of Essex, in collaboration with the Open University.
The corpus contains Al-Hayat newspaper articles with value added for Language Engineering and Information Retrieval applications development purposes.
The data have been distributed into 7 subject-specific databases, thus following the Al-Hayat subject tags: General, Car, Computer, News, Economics, Science, and Sport.
Mark-up, numbers, special characters and punctuation have been removed. The size of the total file is 268 MB. The dataset contains 18,639,264 distinct tokens in 42,591 articles, organised in 7 domains.

Version 1.0
Distributor ELRA