Resource: CAREGIVER Corpus
|Date of Submission||Sept. 3, 2020, 4:14 p.m.|
|Resource Type||Primary Text|
|Language||Dutch, Flemish, English, Finnish|
A multi-lingual speech corpus used for modeling language acquisition called CAREGIVER has been designed and recorded within the framework of the EU funded Acquisition of Communication and Recognition Skills (ACORNS) project. The motivation behind the corpus and its design relies on current knowledge regarding infant language acquisition. Instead of recording infants and children, the voices of their primary and secondary caregivers were captured in both infant-directed and adult-directed speech modes over four languages in a read speech manner. The challenges and methods applied to obtain similar prompts in terms of complexity and semantics across different languages, as well as the normalized recording procedures employed at different locations, are covered. An orthographic transcription is available for every utterance. Also, time-aligned word and phone annotations for some of the sub-corpora exist.
However, in the actual corpus there are a couple of deviations from this setup. The corpus contains nearly 66,000 utterance-based audio files spoken over a two-year period by 16 male and 14 female native speakers of Dutch, English, and Finnish. Swedish is not provided. For Dutch only year 2 recordings are available.
1) UK English:
To be mentioned as reference to the corpus: