Resource: MULTEXT JOC Corpus

Reference MULTEXT JOC Corpus
Date of Submission Jan. 24, 2014, 4:30 p.m.
Status accepted
ISLRN 900-482-746-635-0
Resource Type Primary Text
Media Type Text
Source
Language English, French, German, Italian, Spanish, Castilian
Size 5000000 words
Description

This CD-ROM contains a part of the corpus developed in the MULTEXT project financed by the European Commission (LRE 62-050). This part contains raw, tagged and aligned data from the Written Questions and Answers of the Official Journal of the European Community. The corpus contains approx. 5 million words in English, French, German, Italian and Spanish (approx. 1 million words per language). About 800,000 words were grammatically tagged and manually checked for English, French, Italian and Spanish, i.e. roughly 200,000 words per language. The same subset for French, German, Italian and Spanish was aligned to English at the sentence level.

The JOC corpus is delivered in Corpus Encoding Standard conformant format at each level of treatment :

paragraph annotation level, conformant to the CESDOC specifications (1 M words * 5 languages);
morpho-syntactic annotation level (PoS Tagging), conformant to CESANA specifications (200,000 words * 4 languages);
parallel text alignment at sentence level, conformant to CESALIGN specifications (200,000 words * 4 languages).
Additional information: http://www.lpl.univ-aix.fr/projects/multext

Version 1.0
Distributor ELRA