Resource: NUM 5M Mongolian written corpus

Reference NUM 5M Mongolian written corpus
Date of Submission July 12, 2017, 11:06 a.m.
Status accepted
ISLRN 492-817-146-504-9
Resource Type Primary Text
Media Type Text
Language Mongolian
Format/MIME Type Plain text
Access Medium Downloadable

This is a corpus of Mongolian text mostly from domains like online or printed daily newspapers, literature, and laws.

The collected raw texts was reduced from 5 to 4.8 million words after cleaning. The cleaned corpus comprises:
- 144 texts from laws until 2009,
- 288 texts from literature that is currently being used in the primary and secondary school text books in Mongolia (including stories, novels, novelettes),
- 1,134 editorals from the printed newspaper "Unen" dating from 1984 to 1989,
- 2,477 online newswire texts dating from 2003 to 2009.

Part of this corpus, about 2,800 sentences with 100,000 words, has been POS-tagged manually and stored in XML TEI format.

Version 1.0
Distributor ELRA