|Reference||The Serbian Semantic Textual Similarity News Corpus|
|Date of Submission||Feb. 7, 2018, 5:01 p.m.|
|Resource Type||Primary Text|
|Size||1192 sentence pairs|
The Serbian STS News Corpus consists of 1192 pairs of sentences in Serbian gathered from news sources on the web. Each sentence pair was manually annotated with fine-grained semantic similarity scores on the 0-5 scale. The final scores were obtained by averaging the individual scores of five annotators. All sentences are written in the Serbian Latin script.
The sentence pairs in this dataset were taken from the Serbian Paraphrase Corpus (paraphrase.sr). Typographical errors within the 1194 sentence pairs of the Serbian Paraphrase Corpus were manually corrected. Any missing diacritical marks were restored as well. Two of the 1194 pairs were removed from the corpus - one was found to be a duplicate and the other included a text longer than one sentence. The remaining 1192 pairs were annotated with fine-grained semantic similarity scores.
The annotation methodology followed the one established in the SemEval STS shared tasks (2012-2017). One major difference is that the example pairs for each score that were used in the SemEval STS annotation instructions were replaced by new ones. Instead of one example per score value, three examples were included in the annotation guidelines for each score value. This was found to improve task comprehension and annotation quality. The new examples were taken from the 2012 MSRPar and the 2013-2016 Headlines portions of the annotated SemEval STS corpora in English, and were then professionally translated into Serbian.
The average annotator self-agreement score, expressed in terms of the Pearson correlation coefficient r, is 0.93. The average inter-rater correlation between an annotator and the averaged scores of all other annotators is 0.92, which is effectively the upper bound for STS model performance on this dataset. STS.news.sr contains around 64 thousand tokens, making the average sentence length around 27 tokens. The average semantic similarity score value is 2.51.
Fine-grained Semantic Textual Similarity for Serbian, Vuk Batanović, Miloš Cvetanović, Boško Nikolić, in Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan (2018).
|Creator||Vuk Batanović - School of Electrical Engineering, University of Belgrade|
|Distributor||Vuk Batanović - School of Electrical Engineering, University of Belgrade|
|Rights Holder||Vuk Batanović - School of Electrical Engineering, University of Belgrade|