This is Oxford University Press's most comprehensive single-volume dictionary, with 170,000 entries covering all varieties of English worldwide. The NODE data set constitutes a fully integrated range of formal data types suitable for language engineering and NLP applications: It is available in XML or SGML.
- Source dictionary data. The NODE data set includes all the information present in the New Oxford Dictionary of English itself, such as definition text, example sentences, grammatical indicators, and encyclopaedic material.
- Morphological data. Each NODE lemma (both headwords and subentries) has a full listing of all possible syntactic forms (e.g. plurals for nouns, inflections for verbs, comparatives and superlatives for adjectives), tagged to show their syntactic relationships. Each form has an IPA pronunciation. Full morphological data is also given for spelling variants (e.g. typical American variants), and a system of links enables straightforward correlation of variant forms to standard forms. The data set thus provides robust support for all look-up routines, and is equally viable for applications dealing with American and British English.
- Phrases and idioms. The NODE data set provides a rich and flexible codification of over 10,000 phrasal verbs and other multi-word phrases. It features comprehensive lexical resources enabling applications to identify a phrase not only in the form listed in the dictionary but also in a range of real-world variations, including alternative wording, variable syntactic patterns, inflected verbs, optional determiners, etc.
- Subject classification. Using a categorization scheme of 200 key domains, over 80,000 words and senses have been associated with particular subject areas, from aeronautics to zoology. As well as facilitating the extraction of subject-specific sub-lexicons, this also provides an extensive resource for document categorization and information retrieval.
- Semantic relationships. The relationships between every noun and noun sense in the dictionary are being codified using an extensive semantic taxonomy on the model of the Princeton WordNet project. (Mapping to WordNet 1.7 is supported.) This structure allows elements of the basic lexical database to function as a formal knowledge database, enabling functionality such as sense disambiguation and logical inference.
- Derived from the detailed and authoritative corpus-based research of Oxford University Press's lexicographic team, the NODE data set is a powerful asset for any task dealing with real-world contemporary English usage. By integrating a number of different data types into a single structure, it creates a coherent resource which can be queried along numerous axes, allowing open-ended exploitation by many kinds of language-related applications.