Resource: Modern French Corpus including Anaphors Tagging
|Reference||Modern French Corpus including Anaphors Tagging|
|Date of Submission||Jan. 24, 2014, 4:30 p.m.|
|Resource Type||Primary Text|
|Format/MIME Type||Plain text|
The corpus that includes the tagging of the anaphors was created by the CRISTAL-GRESEC (Stendhal-Grenoble 3 University, France) team and XRCE (Xerox Research Centre Europe, France) in the framework of the call launched by the DGLF-LF (national institution for the French language and the languages spoken in France), for the creation of modern French corpora).
Over 1 million words have been annotated. The corpora have been selected so that they represent a wide sampling of the French language (scientific and human science articles, extracts from newspapers and magazines, legal texts, etc.) and according to the points of interest of the teams working on the project. The processed corpora supplied by ELRA are listed below:
- Two books edited by the CNRS: La protection des oeuvres scientifiques en droit d'auteur français, Xavier Strubel. Paris, CNRS Editions, 1997 (77 591 words) and Cinquante ans de traction à la SNCF. Enjeux politiques, économiques et réponses techniques, Clive Lamming. Paris, CNRS Editions, 1997 (124 990 words).
Below the tagged anaphoric elements:
The annotation scheme was defined in XML format. The texts were divided into sections, paragraphs and sentences. The sentence segmentation was carried out with NLP tools developed by XRCE, the annotation part was done manually by two qualified linguists. A large subset of anaphoric phrases was automatically pre-annotated. The antecedents and the tagging of the anaphoric relations were manually processed, but editing tools (emacs, macros from Author/Editor software) were used to make it easier. 5% of the corpora were checked to measure the annotation reliability.