Resource: Amharic-English bilingual corpus

Reference Amharic-English bilingual corpus
Date of Submission Jan. 24, 2014, 4:17 p.m.
Status accepted
ISLRN 590-255-335-719-0
Resource Type Primary Text
Media Type Text
Source
Language Amharic, English
Description

The Amharic-English bilingual corpus contains parallel text from legal and news domains in Amharic script, in transliterated form and in English. The size of the corpus is of 232,653 words in Amharic and 291,701 in English.

This parallel corpus contains documents from two domains, namely legal and news, in English and Amharic language. The two domains are separately processed. In addition, for Amharic language, documents were prepared using its own script which is different from Latin alphabet. For easy of use and processing, as well as normalization purposes, the Amharic documents are transliterated and the English documents are converted into lower case format. Furthermore, clean documents were prepared without considering the two domains separately.

Amharic is a Semitic language spoken in Ethiopia.

Version 1.0
Distributor ELRA