Resource: Arabic Speech Corpus

Reference Arabic Speech Corpus
Date of Submission Aug. 19, 2016, 10:57 a.m.
Status accepted
ISLRN 866-568-447-697-8
Resource Type Other
Media Type Audio
Source
Language Arabic
Size 3.7 hours
Description

This speech corpus has been developed as part of a PhD work carried out by Nawar Halabi at the University of Southampton. The corpus was recorded through a Neumann TLM 103 Studio Microphone by one male speaker in South Levantine Arabic (Damascian accent) in a professional studio. The transcript was collected from “Aljazeera Learn” (Aljazeera 2015), a language learning website which was chosen because it contained fully diacritised text which makes it easier to phonetise. The transcript was split into utterances based on punctuation, to make it easier for the speaker during the recording sessions. Synthesized speech as an output using this corpus has produced a high quality, natural voice. It consists of 1813 utterances for a total of 3.7 hours consisting of:
- 2.1 hours of normal utterances,
- 1.6 hours of nonsense utterances (utterances that are not semantically, orthographically or syntactically correct).

This package corresponds to version 2.0 of the corpus and includes:
- 1813 .wav files containing spoken utterances,
- 1813 .lab files containing text utterances,
- 1813 .TextGrid files containing the phoneme labels with time stamps of the boundaries where these occur in the .wav files. These files can be opened using Praat software (see http://www.fon.hum.uva.nl/praat/),
- phonetic transcriptions are gathered in one single text file which has the form "[wav_filename]" "[Phoneme Sequence]" in every line.
- orthographic transcriptions are gathered in one single text file which has the form "[wav_filename]" "[Orthographic Transcript]" in every line. Orthography is in Buckwalter Format (see http://www.qamus.org/transliteration.htm) which is friendlier where there is a software that does not read Arabic script. It can be easily converted back to Arabic.
- An extra set of 18 minutes of fully annotated corpus, used to evaluate the corpus, is also provided (separate from above but with the same structure as above).

Arabic Speech Corpus by Nawar Halabi is licensed under a Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

Version 2.0
Creator Nawar Halabi
Distributor ELRA