Resource: EUROPARL Corpus Parallel Corpora: Portuguese-English
|Reference||EUROPARL Corpus Parallel Corpora: Portuguese-English|
|Date of Submission||Jan. 20, 2016, 11:58 a.m.|
|Resource Type||Primary Text|
The EUROPARL Corpus (Portuguese-English subpart of the parallel corpora), was extracted from the proceedings of the European Parliament. It contains transcriptions of sessions dating back from 1996 to 2011, with a total of approximately 58,324,562 tokens of European Portuguese (L1) and 49,216,896 tokens of English (translation).
The EUROPARL Corpus is composed of one text file for the English corpus and two files for the Portuguese version: a text file and an annotated file. The text version contains plain text and no further annotation. The Portuguese annotated file is a four-column file with one token per line, followed by a PoS tag and a lemma. The corpus was automatically PoS-tagged with MBT tagger (http://ilk.uvt.nl/mbt/), and lemmatized with MBLEM (http://ilk.uvt.nl/mbma/), following the annotation scheme of the Corpus of Reference of Contemporary Portuguese.