|Date of Submission||Jan. 24, 2014, 4:17 p.m.|
BioLexicon is a large-scale English terminological resource which has been developed to address the needs emerging in text mining efforts in the biomedical domain. It contains information on:
Existing information on terms was integrated, augmented, complemented and linked, through processing of massive amounts of biomedical text, to yield inter alia over 2.2M lexical entries (over 3.3M semantic relations), and information on over 1.8M variants and on over 2M synonymy relations. Moreover, extensive information is provided on how verbs and nominalised verbs in the domain behave at both syntactic and semantic levels, supporting thus applications aiming at discovery of relations and events involving biological entities in text. It contains domain specific verbs (658), includes both automatically-extracted syntactic subcategorization frames (1710), as well as semantic event frames (850) that are based on corpus annotation by domain experts.
This comprehensive coverage of biological terms makes BioLexicon a unique linguistic resource within the domain. It is primarily intended to support text mining and information retrieval in the biomedical domain, however its standards-based structure and rich content make it a valuable resource for many other kinds of application.
In the first stage of the construction of BioLexicon, potential terms were pooled together from several resources representing selected semantic types of entities, such as genes and proteins, chemical compounds, species, enzymes, as well as various entities found in biological ontologies.
Terms were then organized into sets of synonymous variants and annotated with a number of static features which improve the resolution of term ambiguity. Once populated with terms from existing repositories, BioLexicon was augmented with term variants extracted from the scientific literature and complemented with manually selected lexical items, such as biologically relevant verbs and multiword token expressions. Linguistic information was added to entries, on the basis of corpus processing, including syntactic subcategorisation information for verbs and nominalised verbs, and semantic event frame information. Last but not least, a subset of terms in BioLexicon was linked to Gene Regulation Ontology concepts to support the identification of gene regulatory events.
The schema of BioLexicon preserves term annotations and metadata derived from the original data resources. At the same time, it provides consistent lexical representation for terms of different semantic types. BioLexicon thus offers the clear advantage of a uniform lexical format for a wide coverage of biological terminology, with accompanying linguistic information.
BioLexicon is available in a relational database format (MySQL dump format) and it adheres to the EAGLES/ISO standards for lexical resources.