Resource: Slovenian BNSI Broadcast News Speech Corpus

Reference Slovenian BNSI Broadcast News Speech Corpus
Date of Submission Jan. 24, 2014, 4:31 p.m.
Status accepted
ISLRN 502-280-144-938-4
Resource Type Primary Text
Media Type Audio
Language Slovenian

This speech database consists of TV news shows (both evening news, “TV Dnevnik” and late night news, “Odmevi”), from the archive of a Slovenian national broadcaster RTV Slovenia. The recordings took place between June 1999 and May 2003.

The database comprises a total of 36 hours of recordings (training set: 30 hours, development set: 3 hours and test set: 3 hours), transcribed and manually checked using the Transcriber tool. Transcription conventions are based on documents defined by LDC, LIMSI and COST 278 BN SIG. There are 268,000 words in transcriptions, out of which 37,000 are distinct words. The transcription files contain: orthographic transcriptions, information on acoustic conditions and background, segmentation on turn and section level. The topic is described and marked (25 topic categories) for each section of news show. Speaker information consists of gender, speaking style, accent and origin.

1,565 speakers were recorded (1,069 males, 477 females, 19 unspecified).

The speech signal is as follows: 16kHz, 16 bit, WAV, 1 channel.

Version 1.0
Distributor ELRA