Resource: Phrase Detectives Corpus

Reference Phrase Detectives Corpus
Date of Submission May 17, 2017, 4:29 p.m.
Status accepted
ISLRN 052-688-100-874-5
Resource Type Primary Text
Media Type Text
Source
Language English
Format/MIME Type application/xml, text/html, text/plain
Size 28024 KB
Access Medium Web Download
Description

*Introduction*

Phrase Detectives Corpus was developed by the School of Computer Science and Electronic Engineering at the University of Essex and consists of approximately 19,012 words across 40 documents anaphorically-annotated by the Phrase Detectives game, an online interactive "game-with-a-purpose" (GWAP) designed to collect data about English anaphoric coreference.

GWAPs for creating language resources are growing. In general, they employ non-monetary incentives, such as entertainment, to motivate participation and can be successful for large-scale persistent annotation efforts.

*Data*

The documents in the corpus are taken from Wikipedia articles and from narrative text in Project Gutenberg. Wikipedia articles and annotation files are presented as XML and Project Gutenberg source files are presented as plain text. All text is encoded as UTF-8. Annotations are comprised of a gold standard version created by multiple experts, as well as a set created by a large non-expert crowd (via the Phase Detectives game).

The data was annotated according to a prevalent linguistically-oriented approach for anaphora used in several tasks, including OntoNotes Release 5.0 (LDC2013T19), SemEval-2010 Task 1 Ontonotes English: Coreference Resolution in Multiple Languages (LDC2011T01) and The ARRAU Corpus of Anaphoric Information (LDC2013T22).

Version 1.1
Creator Jon Chamberlain , Massimo Poesio , Udo Kruschwitz
Distributor Linguistic Data Consortium
Rights Holder Portions © 2017 University of Essex, © 2017 Trustees of the University of Pennsylvania