Resource: STEM-ECR

Reference The STEM-ECR Dataset: Grounding Scientific Entity References in STEM Scholarly Content to Authoritative Encyclopedic and Lexicographic Sources
Date of Submission Feb. 19, 2020, 12:25 p.m.
Status accepted
ISLRN 749-555-840-571-2
Resource Type Primary Text
Media Type Text
Language English
Format/MIME Type text/plain
Size 3.3 MB
Access Medium Web Download

The STEM ECR v1.0 dataset introduces the task of Scientific Entity Extraction, Classification, and Resolution on scholarly publications in STEM (Science, Technology, Engineering, and Medicine) disciplines. It comprises annotated scholarly abstracts from 10 STEM disciplines that were found to be the most prolific ones on a major publishing platform. The annotated data includes: phrase-based scientific entities, and their corresponding disambiguated references in Wikipedia and Wiktionary as applicable.
The purpose of the dataset is to provide a benchmark for the evaluation of scientific entity extraction, classification, and resolution tasks in a domain-independent fashion.

Source data for the annotations in this corpus comprise scholarly abstracts collected by Elsevier. The annotations per abstract are presented as UTF-8 encoded files comprising the abstract text and its corresponding character-span based annotations in separate files.
A summary of the data by domains and mentions is below:
Domain Mentions
Astronomy 791
Agriculture 741
Engineering 741
Earth Science 698
Biology 649
Medicine 600
Material Science 574
Computer Science 553
Chemistry 483
Mathematics 297

This work was co-funded by the European Research Council for the project ScienceGRAPH (Grant agreement ID: 819536) and by the TIB Leibniz Information Centre for Science and Technology.

Version 1.0
Creator Jennifer D'Souza , Anett Hoppe - TIB Leibniz Information Centre for Science and Technology
Distributor TIB Leibniz Information Centre for Science and Technology
Rights Holder TIB Leibniz Information Centre for Science and Technology