The Arboretum treebank is a morphologically and syntactically annotated repository of Danish sentences, taken from Korpus 90 and Korpus 2000, both compiled by the Society for Danish Language and Literature (http://ordnet.dk/korpusdk/fakta), and containing samples of written Danish from the 90'ies and from around the year 2000, respectively. The treebank consists of about 425,000 tokens. There are ca. 22,260 sentences/utterances containing 3 or more tokens.
In a first pass, all material was tokenized and tagged with the DanGram parser, using hand-written Constraint Grammar rules. In a next stage, the parser's dependency grammar and constituent conversion was applied to produce full syntactic tree structures. The automatic annotation was then revised both at the morphosyntactic and the structural levels, with iterative improvements made to the parser at the same time.
Arboretum provides named entity categories for all proper nouns. It also contains subclass categorisation for the pronoun and adverb word classes, facilitating conversion to different descriptive traditions. In addition, the dependency version contains structural markers concerning coordination and clause boundaries, as well as some morphological information concerning compounding.
The final version of the treebank consists of two independent versions, constituent trees and dependency trees, and is distributed in the following versions:
1. Native dependency format (Constraint Grammar format)
2. Dependency annotation converted to MALT xml format
3. Native constituent tree format (Cross-language VISL standard)
4. Constituent format converted to TIGER xml