ISLRN

C-ORAL-ROM

Full Official Name: C-ORAL-ROM - Integrated reference corpora for spoken romance languages. Multi-media edition; tools of analysis; standard linguistic measurements for validation in HLT

Submission date: Jan. 24, 2014, 4:22 p.m.

Description The C-ORAL-ROM resource is a multilingual corpus of spontaneous1 speech for the main romance languages of around 1,200,000 words (IST 2000-26228). The resource comprises three components: a)Multimedia corpus; b)Speech software; c)Appendix. The corpus consists of four comparable recording collections of Italian, French, Portuguese and Spanish spontaneous speech sessions (around 300,000 words for each Language). The collections are delivered respectively by the following providers: * Università di Firenze (Dipartimento di Italianistica, LABLITA); * Université de Provence (Description Linguistique Informatisée sur Corpus); * Fundação da Universidade de Lisboa/Centro de Linguística da Universidade de Lisboa * Universidad Autónoma de Madrid (Departamento de Lingüística, Lenguas Modernas, Lógica y F. de la Ciencia, Laboratorio de Lingüística Informática). The C-ORAL-ROM corpus provides the acoustic source of each session together with the following main annotations: * The orthographic transcription, in CHAT format, enriched with the tagging of terminal and non terminal prosodic breaks * Session metadata * The text to speech synchronization, in WIN PITCH CORPUS format, based on the alignment of each transcribed utterance, The multimedia corpus comes with the speech software Win Pitch Corpus (© Pitch France. Minimal configuration: Pentium III, 1 GHz, 252 mega Ram, S-blaster or compatible sound card, running under Windows 2000 or XP only. GDPLUS.dll installed on the same directory of the program required).2 A series of appendix are also provided containing: a) the purely textual corpus in .TXT and .XML format; b) the PoS tagging of all and the corresponding frequency list of lemmas forms in .TXT files; c) a set of linguistic measurements extracted from the main corpus annotations, in .EXCEL files; d) the specifications and validation of the resource, e) corpus metadata. Package 1. DVDs 1 to 8 contain the multimedia corpus edition (DVDs1-2 French; DVDs 3-4 Italian; DVDs 5-6 Portuguese; DVDs 7-8 Spanish). All collections have the same folder's structure, that mirrors directly the C-ORAL-ROM corpus design (see. below). For each session into folders the following is delivered: * the uncompressed .WAV files (Windows PCM: 22,050 hz; 16 bit) * the .TXT file of the transcripts; * the .XML file defining the text to speech alignment in WIN PITCH CORPUS format and its .DTD 2. The CD contains the speech software and the Appendix: a)Speech software The speech software Win Pitch Corpus (10 licenses) b) Appendix The C-ORAL-ROM transcription files in .TXT and .XML format The C-ORAL-ROM transcription files with PoS tagging in .TXT files The frequency list of lemmas for each language collection in TXT files Measurements of spoken language variability in EXCEL files The Corpus specifications: a)Corpus design; b)Metadata description; c)Dialogue representation format; d)Prosodic tagging; e)Alignment format; f)XML format; g)PoS tagging and lemma formats h)Glossaries. Resource Validation reports Multimedia sample files Main Features The resource aims to represent the variety of speech acts performed in everyday language and to enable the induction of prosodic and syntactic structures in the four romance languages, from a quantitative and qualitative point of view. The resource has been designed for prosodic modeling, test bed procedures in HLT and corpus based studies of spontaneous speech. C-ORAL-ROM have a relevant added value at the following levels: * Corpus design * Metadata * Dialogue representation * Prosodic annotation * PoS tagging * Multimedia storage * Speech analysis CORPUS DESIGN The corpus design of the C-ORAL-ROM resource aim to ensure a possibility of occurrence for a large variety of speech act typologies and natural prosodic contours, which are the most peculiar linguistic feature found in spontaneous speech. To this end the main variation parameters of the spoken domain (Channel variation, Dialogue structure, sociological domain of use, and semantic domain of application) are represented in a corpus design schema, covering a wide range of semantic and pragmatic domains of application. The four language collection are considered comparable as far as they fit with the corpus design schema. More specifically each language collection in the C-ORAL-ROM corpus is consistent with the following average structure (check documentation for deviations): INFORMAL/150,000 words from at least 64 texts of 1500 words each and 10 texts of 4500 words each INFORMAL/ Family-Private context/124,500 words INFORMAL/Family-Private context/ Monologues/42,000 words INFORMAL/Family-Private context/Dialogues-Conversations /82,500 words INFORMAL/Public context /25.500 words INFORMAL/Public context/Monologues/6,000 words INFORMAL/Public context/ Dialogues-Conversations/19,500 words FORMAL 150,000 words FORMAL/Formal in natural context/2 or 3 samples of 3000 words average for each of the following typical domain of use for 65,000 words in total. FORMAL/Formal in natural context/ political speech FORMAL/Formal in natural context/ political debate FORMAL/Formal in natural context/ preaching FORMAL/Formal in natural context/ teaching FORMAL/Formal in natural context/professional explanation FORMAL/Formal in natural context/ conference FORMAL/Formal in natural context/ business FORMAL/Formal in natural context/law (through media allowed) FORMAL/Media context/2 or 3 samples of 3000 words average for each of the following typical domain of use for 60,000 words in total FORMAL/Media context/news (small sample) FORMAL/Media context/meteo (small sample) FORMAL/Media context/interviews FORMAL/Media context/reportage FORMAL/Media context/scientific press FORMAL/Media context/sport talk shows FORMAL/Media context/political debate FORMAL/Media context/talk shows thematic discussions FORMAL/Media context/talk shows culture FORMAL/Media context/talk shows science FORMAL/Telephone 25,000 words3 FORMAL/Telephone/private conversations FORMAL/Telephone/phone to call services or man-machine interaction (10,000 words) 4 METADATA For each session a rich series of metadata is delivered in CHAT format, ensuring multitask exploitation of the resource for Linguistics and Human language technologies. Metadata contain essential information regarding the speakers, the recording situation, the topic, the acoustic quality, the source of the collected data . DIALOGUE REPRESENTATION Corpora are orthographically transcribed in standard textual format (CHAT format; Mac Whinney, 1994) with the annotation of speaker's turns. The textual string is divided into utterances. The main non linguistic and paralinguistic acoustic events in the speech flow are reported into transcripts PROSODIC ANNOTATION The four romance collections are completely tagged with respect to prosodic breaks. Terminal and non terminal breaks, are discriminated through perceptive judgments and reported in the transcripts. The level of inter-annotator agreement on prosodic tags assignment has been validated by an external institution. MULTIMEDIA STORAGE The multimedia storage ensures a natural and meaningful text / sound correspondence for both prosodic modeling, test bed procedures and corpus based studies of spontaneous speech. SPEECH SOFTWARE Win Pitch Corpus is an innovative software program for computer-aided alignment of large corpora. It provides a method for easy and precise selection of alignment units, ranging from syllable to whole sentences in a hierarchical storing system of aligned data. The method is based on the ability to link visually a moving target with the perception of corresponding speech sound played back at a rate reduced by at least 30% or more. Segments derived from alignment can be defined on 8 independent layers, with automatic generation of the corresponding database, which can be saved directly in both XML and Excel formats. Besides text to speech alignment, Win Pitch Corpus, which is Unicode compliant, has numerous features allowing easy and efficient acoustical analysis of speech, such as real-time fundamental frequency tracking, spectrographic display, re-synthesis after editing of prosodic parameters, etc... For more information: http://www.elda.org/en/proj/coralrom.html ___________________ (1) As defined according to C-ORAL-ROM as: comprising formal and informal speech. (2) ELDA does not take responsibility on software products coming with the distributed resources. Pitch France is fully responsible for this Software. (3) text length not defined (by preference 1500 words upper limit, no lower limit) (4) Field not present in the Portuguese corpus. The texts in this field are not delivered aligned to the acoustic source.

Creator(s)

Distributor(s)

ELRA

Right Holder(s)

Status : Accepted

ISLRN :

318-977-046-077-4

Version

1.0

Source

http://catalog.elra.info/product_info.php?products_id=757

Resource Type

Primary Text

Media Type

Audio

Language(s)

French

Italian

Portuguese

Spanish

Access Medium