ISLRN

NUM 5M Mongolian written corpus

Full Official Name: NUM 5M Mongolian written corpus

Submission date: July 12, 2017, 11:06 a.m.

This is a corpus of Mongolian text mostly from domains like online or printed daily newspapers, literature, and laws. The collected raw texts was reduced from 5 to 4.8 million words after cleaning. The cleaned corpus comprises: - 144 texts from laws until 2009, - 288 texts from literature that is currently being used in the primary and secondary school text books in Mongolia (including stories, novels, novelettes), - 1,134 editorals from the printed newspaper "Unen" dating from 1984 to 1989, - 2,477 online newswire texts dating from 2003 to 2009. Part of this corpus, about 2,800 sentences with 100,000 words, has been POS-tagged manually and stored in XML TEI format.

Creator(s)

Distributor(s)

ELRA

Right Holder(s)

Status : Accepted

ISLRN :

492-817-146-504-9

Version

1.0

Source

http://catalog.elra.info/product_info.php?products_id=1309

Resource Type

Primary Text

Media Type

Text

Language(s)

Mongolian

Access Medium

Downloadable