Resource: A "scientific" corpus of modern French ("La Recherche" magazine) - Raw data

Reference A "scientific" corpus of modern French ("La Recherche" magazine) - Raw data
Date of Submission Jan. 24, 2014, 4:17 p.m.
Status accepted
ISLRN 508-941-013-339-7
Resource Type Primary Text
Media Type Text
Source
Language French
Description

This "scientific" corpus of modern French was produced by the University of Nantes (France) through a funding from ELRA in the framework of the European Commission project LRsP&P (Language Resources Production & Packaging - LE4-8335).
The corpus contains all articles published in La Recherche magazine in 1998, including issues 305 (January) to 315 (December), which amounts to 447,244 tokens and 30,238 types. It is aimed to be used within text analysis and related applications.
The texts, provided in XML (Extended Markup Language) format, have been marked-up into the SGML standard (Standard Generalized Markup Language). XML contained a structure where only the constituant parts of the text were coded (title, body, etc.), whereas SGML marking up , richer, goes up to the word level, including the grammatical category and the canonical form for each word. The annotation work is conformant with the TEI (Text Encoding Initiative) international project's guidelines.

Version 1.0
Distributor ELRA