The Hamburg Dependency Treebank
The Hamburg Dependency Treebank is to our knowledge the largest dependency treebank currently available. It consists of genuine dependency annotations, i.e. they have not been transformed from phrase structures. The HDT is free for scientific/academic use.
The sentences were all sourced from the German news site heise.de, from articles published between 1996 and 2001. The content of the articles ranges from formulaic periodic updates on new BIOS revisions and processor models or quarterly earnings of tech companies over features about general trends in the hardware and software market to general coverage of social, legal and political issues in cyberspace, sometimes in the form of extensive weekly editorial comments. The mapping from sentences to articles and authors is retained, allowing, e.g. analysis of individual style. The creation of the treebank through manual annotation was largely interleaved with the creation of a standard for morphologically and syntactically annotating sentences as well as a constraint-based parser.
If you have questions regarding the HDT, send an email to hdt at informatik.uni-hamburg.de
The HDT consists of three parts:
- manually annotated and checked for consistency with DECCA (part A, 101,999 sentences)
- manually annotated but not checked with DECCA (part B, 104,795 sentences)
- automatically parsed with WCDG (part C, 55,027 sentences)
Download the HDT from the HZSK
UD conversion
There is a UD conversion to the HDT, performed by our TrUDucer tool. It has been be part of the UD releases since version 2.4 and can obtained from the UD_German-HDT GitHub repository. The dev branch contains the newest conversion. The conversion currently consists of nearly 3.4M tokens from parts A and B.
Publications
- HDT-UD: A very large Universal Dependencies Treebank for German – describes the conversion to UD
- Because Size Does Matter: The Hamburg Dependency Treebank – the paper describing the HDT
- Eine umfassende Constraint-Dependenz-Grammatik des Deutschen – The annotation guidelines
Software
- the toolbox, containing all sorts of helper scripts
- cda_parse, a python library for parsing cda files
- cobacose, a web-based treebank search system
- jwcdg, the successor of the parser used for initial automatic annotation