NUM 5M Mongolian written corpus

Full Official Name: NUM 5M Mongolian written corpus
Submission date: July 12, 2017, 11:06 a.m.

This is a corpus of Mongolian text mostly from domains like online or printed daily newspapers, literature, and laws. The collected raw texts was reduced from 5 to 4.8 million words after cleaning. The cleaned corpus comprises: - 144 texts from laws until 2009, - 288 texts from literature that is currently being used in the primary and secondary school text books in Mongolia (including stories, novels, novelettes), - 1,134 editorals from the printed newspaper "Unen" dating from 1984 to 1989, - 2,477 online newswire texts dating from 2003 to 2009. Part of this corpus, about 2,800 sentences with 100,000 words, has been POS-tagged manually and stored in XML TEI format.

Right Holder(s)