OSIAN Corpus

Full Official Name: The Open Source International Arabic News (OSIAN) corpus
Submission date: Jan. 8, 2018, 4:48 p.m.

The Open Source International Arabic News (OSIAN) corpus has been collected from international Arabic news websites like CNN, DW, RT, Aljazeera, among others. With a server-friendly crawling policy we extracted 1 million web pages. After necessary cleaning and filtering steps, the OSIAN corpus has 477,556 articles comprising 2,861,944 sentences and roughly 157 million words. The corpus is encoded in XML, each article is annotated with metadata information, which gives the information about its web location and the date of its extraction. Moreover, each word is annotated with lemma and part-of-speech.

Creator(s)
Distributor(s)
Right Holder(s)