ISLRN

The EMILLE Lancaster Corpus

Full Official Name: The EMILLE Lancaster Corpus

Submission date: Jan. 24, 2014, 4:31 p.m.

The EMILLE Lancaster Corpus consists of three components: monolingual, parallel and annotated corpora. There are monolingual corpora for seven South Asian languages: Bengali, Gujarati, Hindi, Punjabi, Sinhala, Tamil, Urdu. The EMILLE monolingual corpora contain approximately 58,880,000 words (including 2,627,000 words of transcribed spoken data for Bengali, Gujarati, Hindi, Punjabi and Urdu). The parallel corpus consists of 200,000 words of text in English and its accompanying translations in Hindi, Bengali, Punjabi, Gujarati and Urdu. The annotated component includes the Urdu monolingual and parallel corpora automatically annotated for parts-of-speech, together with twenty written Hindi corpus files annotated to show the nature of demonstrative use. All other components are annotated at the sentence level. The corpus is marked up using CES-compliant SGML and encoded using Unicode. References: Xiao, Z, McEnery, A., Baker, P. and Hardie, A. 2004. ‘Developing Asian language corpora: standards and practice’ in Sornlertlamvanich, V., Tokunaga, T. and Huang, C. (eds.) Proceedings of the Fourth Workshop on Asian Language Resources, pp. 1-8. March 25, Sanya. This database is available only for commercial use. For research use by academic organisations, a more complete set of the EMILLE Lancaster Corpus is available under the reference ELRA-W0037 The EMILLE/CIIL Corpus.

Creator(s)

Distributor(s)

ELRA

Right Holder(s)

Status : Accepted

ISLRN :

438-045-014-925-0

Version

1.0

Source

http://catalog.elra.info/product_info.php?products_id=714

Resource Type

Primary Text

Media Type

Text

Language(s)

Bengali

English

Gujarati

Hindi

Panjabi

Sinhala

Tamil

Urdu

Access Medium