The ComiGS corpus: Comic Strips Retold by Learners of German

For the ComiGS corpus, learners of German retold picture stories as the one shown below. The retelling task mildly constrains learner utterances to facilitate consistent annotation and reliable automatic processing but at the same time does not prime learners with textual information. We designed the corpus to be able to serve multiple purposes: The corpus is manually annotated, including target hypotheses and syntactic structures, mainly following the Falko guidelines. Due to the tasks nature, the inter-annotator agreement for the annotation of minimal and extended target hypotheses is very high (κ = 0.78 and κ = 0.53, respectively). If only considering which token is changed: κ = 0.87 and κ = 0.75. Note: These numbers are slightly higher than reported in the paper due to errors corrected in the annotations.


Moral mit Wespen

About the picture stories

The picture stories are by Erich Ohser, a German cartoonist. On the right you can see one of the stories used in the corpus, “Moral mit Wespen” (Moral with wasps).

In 1944, Ohser and his friend Erich Knauf were arrested for making anti Nazi jokes. Ohser committed suicide in prison a day before his trial, Erich Knauf was decapitated after a trial at the Volksgerichtshof.

Contents of the corpus

Thirty language learners described two to three picture stories,

resulting in seventy texts with nearly 1.5k sentences (18k tokens) in total. With an average of 12 tokens, the sentences are comparatively long.

The data is manually tokenized and annotated with minimal and extended target hypotheses. The corrections are manually annotated with dependency structures following the HDT using the schema by Foth (2006).

For 21 learners, all keystrokes by the learners were recorded, enabling researching the production process.

Download the corpus

Download: ComiGS corpus v0.8

If you make use of the corpus, please cite Christine Köhn and Arne Köhn: An Annotated Corpus of Comic Strips Stories Written by Language Learners (bibtex), which also describes the corpus in detail.


@InProceedings{Koehn2018-comigs,
  author =  "K{\"o}hn, Christine and K{\"o}hn, Arne",
  title =   "An Annotated Corpus of Picture Stories Retold by Language Learners",
  booktitle =   "Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018)",
  year =    "2018",
  publisher =   "Association for Computational Linguistics",
  pages =   "121--132",
  location =    "Santa Fe, New Mexico, USA",
  url =     "http://aclweb.org/anthology/W18-4914"
}

The ComiGS corpus is licensed under a Creative Commons Attribution 4.0 International License.
Creative
Commons Licence

Working with the data

The annotation of the target hypotheses was performed using EXMARaLDA. The annotation is encoded in XML.

The syntactic, PoS and lemma annotations (for the target hypotheses) were performed using the annotation frontend AnnoViewer of jwcdg. We will provide jwcdg's native format cda as well as CoNLL-X.

The ComiGS corpus can be automatically tagged with errors using Gerrant: Inga Kempfert and Christine Köhn: An Automatic Error Tagger for German

links

social