ISLRN

BOLT English Co-reference -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech

Full Official Name: BOLT English Co-reference -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech

Submission date: Dec. 16, 2020, 8:16 p.m.

*Introduction* BOLT English Co-reference -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech was developed by Raytheon BBN Technologies and consists of co-reference annotation on English discussion forum (DF), SMS/Chat and conversational telephone speech (CTS). The DARPA BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. The Linguistic Data Consortium (LDC) supported the BOLT program by collecting informal data sources -- discussion forums, text messaging and chat -- in Chinese, Egyptian Arabic and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking and co-reference. *Data* DF data was collected from the web using a combination of manual and automatic processes. SMS/Chat material was donated or collected via live platforms. CTS data was taken from LDC's Arabic and Chinese CALLHOME and CALLFRIEND telephone collections; the audio files were transcribed and translated into English. Co-reference annotation aims to fill in all of the connections between specific mentions in the text that refer to the same entities and events in the discourse context. BOLT co-reference annotation was performed on BOLT treebank annotation. It covers noun phrases (including proper nouns, nominals, pronouns and null arguments), possessives, proper noun pre-modifiers and verbs. Annotation files are presented in UTF-8 encoded XML format. *Acknowledgements* This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-11-C-0145. The content does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.

Creator(s)

Nitin Agarwal

Michelle Franchini

Michelle Kappler

Linnea Micciulla

Lance Ramshaw

Sameer Pradhan

Distributor(s)

Linguistic Data Consortium

Right Holder(s)

Status : Accepted

ISLRN :

494-155-932-422-8

Version

1.0

Source

https://catalog.ldc.upenn.edu/LDC2020T20

Resource Type

Primary Text

Media Type

Text

Language(s)

English

Access Medium

Web Download