Resource: EUROPARL Corpus Parallel Corpora: Portuguese-English

Reference EUROPARL Corpus Parallel Corpora: Portuguese-English
Date of Submission Jan. 20, 2016, 11:58 a.m.
Status accepted
ISLRN 435-502-922-727-2
Resource Type Primary Text
Media Type Text
Source
Language English, Portuguese
Description

The EUROPARL Corpus (Portuguese-English subpart of the parallel corpora), was extracted from the proceedings of the European Parliament. It contains transcriptions of sessions dating back from 1996 to 2011, with a total of approximately 58,324,562 tokens of European Portuguese (L1) and 49,216,896 tokens of English (translation).

The EUROPARL Corpus is composed of one text file for the English corpus and two files for the Portuguese version: a text file and an annotated file. The text version contains plain text and no further annotation. The Portuguese annotated file is a four-column file with one token per line, followed by a PoS tag and a lemma. The corpus was automatically PoS-tagged with MBT tagger (http://ilk.uvt.nl/mbt/), and lemmatized with MBLEM (http://ilk.uvt.nl/mbma/), following the annotation scheme of the Corpus of Reference of Contemporary Portuguese.

Version 1.0
Distributor ELRA