The MLCC text corpus has two main components - one set to allow comparable studies to be carried out in different languages and one set as the basis for translation studies.
The first set is referred as the Polylingual Document Collection, a collection of newspaper articles from financial newspapers in 6 languages (Dutch, English, French, German, Italian and Spanish). It consists of the following sub-corpora:
Dutch - Het Financieele Dagblad - 1992-1993 (Samples)
The corpus contains articles from the Dutch financial newspaper Het Financieele Dagblad editions of 2nd January 1992 through to 24th December 1993. It contains around 8.5 million words of text.
English - The Financial Times - 1993 (Samples)
The corpus contains articles from the British financial newspaper The Financial Times editions from the year 1993. The corpus contains around 30 million words.
French - Le Monde - 1992-1993 (Samples)
A corpus of articles from the French newspaper Le Monde, consisting of two years worth (1992-1993) of articles on financial subjects, approximately 10 million words.
German - Handelsblatt - 1986-1988 (Samples)
This subcorpus consists of articles from the period 02.01.1986 to 15.06.1988. It contains some 33 million words. It may be possible to obtain more recent articles from Handelsblatt.
Italian - Il Sole 24 Ore - 1992-1993 (Samples)
The corpus described here contains articles from the Italian financial newspaper Il Sole 24 Ore from the year 1992. This corpus contains some 1.88 million words. The SGML-markup was done by the University of Edinburgh.
Spanish - Expansion - 1994 (Samples)
This subcorpus contains articles from the Spanish financial newspaper Expansion editions from 21.10.1991 to 24.10.1991 and 14.05.1994 to 27.12.1994. It contains some 10 million words.
The second set is a Multilingual Parallel Corpus consisting of translated data in nine European languages: Danish, Dutch, English, French, German, Greek, Italian, Portuguese and Spanish. The parallel data, provided by the European Commission, comprises two sub-corpora from the Official Journal of the European Communities:
Official Journal of the European Commission, C Series: Written Questions 1993
Records of questions and answers regarding European Community matters. The data is regularly published as one section of the C Series of the Official Journal of the European Community in all official languages (previously nine). This corpus contains written questions asked by members of the European Parliament and corresponding answers from the European Commission in 9 parallel versions. The total size of the corpus is approximately 10.2 million words (ca. 1.1 million words per language).
Official Journal of the European Commission, Annex: Debates of the European Parliament 1992-1994
This parallel corpus is the records of Parliamentary sitting published as an annex to the Official Journal of the European Community Debates of the European Parliament. The Parliamentary Debates are a record of what was said by members of the meeting as well as written input provided to the meeting. The original data from which the translations are produced consist of a transcript of the sittings, each member speaking in the language of his choice. The final version consists of nine parallel versions of the material. The texts delivered comprise the Debates of Parliament from January 1992 to July 1994. This sub-corpus contains some 5 to 8 million words per language.