STS.news.sr

Full Official Name: The Serbian Semantic Textual Similarity News Corpus
Submission date: Feb. 7, 2018, 5:01 p.m.

The Serbian STS News Corpus consists of 1192 pairs of sentences in Serbian gathered from news sources on the web. Each sentence pair was manually annotated with fine-grained semantic similarity scores on the 0-5 scale. The final scores were obtained by averaging the individual scores of five annotators. All sentences are written in the Serbian Latin script. Corpus creation The sentence pairs in this dataset were taken from the Serbian Paraphrase Corpus (paraphrase.sr). Typographical errors within the 1194 sentence pairs of the Serbian Paraphrase Corpus were manually corrected. Any missing diacritical marks were restored as well. Two of the 1194 pairs were removed from the corpus - one was found to be a duplicate and the other included a text longer than one sentence. The remaining 1192 pairs were annotated with fine-grained semantic similarity scores. Corpus annotation The annotation methodology followed the one established in the SemEval STS shared tasks (2012-2017). One major difference is that the example pairs for each score that were used in the SemEval STS annotation instructions were replaced by new ones. Instead of one example per score value, three examples were included in the annotation guidelines for each score value. This was found to improve task comprehension and annotation quality. The new examples were taken from the 2012 MSRPar and the 2013-2016 Headlines portions of the annotated SemEval STS corpora in English, and were then professionally translated into Serbian. Five annotators separately assigned semantic similarity scores in the 0-5 range to each pair in the corpus. They first scored a subset of 60 randomly selected pairs (~5% of the total), after which they proceeded to annotate the entire dataset. This initial batch was subsequently used to calculate the annotator self-agreement scores. Corpus statistics The average annotator self-agreement score, expressed in terms of the Pearson correlation coefficient r, is 0.93. The average inter-rater correlation between an annotator and the averaged scores of all other annotators is 0.92, which is effectively the upper bound for STS model performance on this dataset. STS.news.sr contains around 64 thousand tokens, making the average sentence length around 27 tokens. The average semantic similarity score value is 2.51. Reference paper Fine-grained Semantic Textual Similarity for Serbian, Vuk Batanović, Miloš Cvetanović, Boško Nikolić, in Proceedings of the 11th International Conference on Language Resources and Evaluation (LREC 2018), Miyazaki, Japan (2018).

Creator(s)
Distributor(s)
Right Holder(s)