The ComiGS corpus is licensed under a Creative Commons Attribution 4.0 International License with the following exceptions: The script compute_kappa.py
is dual licensed under MIT / Apache v2 and for the images in ./img
local copyright laws apply. The images are “gemeinfrei” (roughly equivalent to public domain) in Germany since 2015/1/1.
img/
the picture stories:
1_moral_mit_wespen.png
“Moral mit Wespen” (Moral with wasps)2_der_schmoeker.png
“Der Schmöker” (The page-turner)3_was_zuviel_ist_ist_zuviel.png
“Was zuviel ist, ist zuviel” (Enough is enough)raw_text/
contains the texts as typed by the participants. File name format: PARTICIPANTID_STORYID_IMAGENO.txt
, STORYID
is one of 1, 2, 3 and every story has 6 images with IMAGENO
numbers 1 to 6.set1/
data for set 1
cda/
cda files
zh1/
cda files for target hypothesis 1 (ZH1), one directory per participant. Directory name format: zh1_PARTICIPANTID
. File name format: PARTICIPANTID_STORYID_SENTENCENO.txt.cda
, STORYID
is one of 1, 2, 3 and SENTENCENO
is the number of the sentence in that layerzh2/
same as zh1
but for target hypothesis 2 (ZH2)conll/
CoNLL-X files, converted from cda files with convert-cda2conll.py. The directory is structured the same way as cda
.texts/
ctok/
Texts from layer ctok
split into sentences (according to layer ctokS
). File name format: PARTICIPANTID_STORYID_SENTENCENO.txt
zh1/
same as ctok
but for target hypothesis 1 (ZH1) and the segmentation layer ZH1S
zh2/
same as ctok
but for target hypothesis 2 (ZH2) and the segmentation layer ZH2S
xml/
XML files in the EXMARaLDA basic transcription format. File name format: PARTICIPANTID_STORYID.exb
. Annotation layers:
imageID
spans for the images, image name format: STORYID-IMAGENO
tok
token layer, automatically tokenized with SegtokS
sentence segmentation for tok
, automatically segmented with Segctok
token layer, manually correctedctokS
sentence segmentation for ctok
, manually correctedZH1
target hypothesis 1, manual annotationZH1S
sentence segmentation for ZH1
, manually correctedZH1tokmovid
layer with ids for indicating token movements and joins and/or splits of non-contiguous tokens between ctok
and ZH1
, manual annotationZH2
target hypothesis 2, manual annotationZH2S
sentence segmentation for ZH2
, manually correctedZH2tokmovid
layer with ids for indicating token movements and joins and/or splits of non-contiguous tokens between ctok
and ZH2
, manual annotationset2/
data for set 2, same structure as for set 1 except that the directories are subdivided into annotator A and annotator B.l1.org
the participant’s first languageproficiency_levels.org
the participant’s proficiency levelscompute_kappa.py
computes corpus statistics and interannotator agreements. kappa_positions
measures the agreement on which tokens to change (regardless of whether the correction is the same). kappa
measures identical changes (position and correction). Please note that kappa_positions
is a valid computation of Cohen’s κ but kappa
is not (see below for a discussion).compute_kappa_output
output of compute_kappa.py
, the values are higher than reported in the paper due to corrections in both annotations.The syntactic, PoS and lemma annotations (for the target hypotheses) were performed using the annotation frontend AnnoViewer of jwcdg. In contrast to the syntactic and PoS annotation, the lemma annotation is incomplete (approximately 60 lemmas) due to missing vocabulary entries in the annotation interface and was not checked for correctness after the initial annotation. The cda and conll files contain morphological information which was derived automatically as a byproduct of using the annotation interface and which was not manually checked.
The picture stories are by Erich Ohser, a German cartoonist. In 1944, Ohser and his friend Erich Knauf were arrested for making anti Nazi jokes. Ohser committed suicide in prison a day before his trial, Erich Knauf was decapitated after a trial at the Volksgerichtshof.
We wanted to compute Cohen’s κ not only for the agreement on which tokens to change but also include the corrections in the computation. As it turns out, this is not a valid computation of Cohen’s κ because vocabulary is an open set and the annotation task is open-ended and therefore the categories are never exhaustive. However, this is one of the requirements:
The categories of the nominal scale are independent, mutually exclusive, and exhaustive. – Cohen (1960)
Therefore, kappa
as calculated by compute_kappa.py
is not Cohen’s κ. The problem with calculating Cohen’s κ is that the chance agreement pe has to be estimated but with an open-ended task there is no straightforward way to approximate pe. For kappa
, we estimated pe as the chance agreement that the annotators performed a change on the same token, i.e. we used the same pe for kappa
and kappa_positions
. For the computation of kappa
, we chose pe conservatively, i.e. it tends to underestimate agreement. We assume that if annotators perform a change it is the same (i.e. 100% agreement to perform the same correction if a token is corrected), and therefore pe cannot underestimate chance agreement and kappa
does not overestimate agreement.
All in all, kappa_positions
is a Cohen’s κ and kappa
is not. Maybe agreement computations simply should not be boiled down to one number. compute_kappa.py
computes the relevant raw numbers that can be used to illustrate the level of agreement between annotators.
If you make use of the corpus, please cite Christine Köhn and Arne Köhn: An Annotated Corpus of Comic Strips Stories Written by Language Learners, which also describes the corpus in detail.
@InProceedings{Koehn2018-comigs, author = "K{\"o}hn, Christine and K{\"o}hn, Arne", title = "An Annotated Corpus of Picture Stories Retold by Language Learners", booktitle = "Proceedings of the Joint Workshop on Linguistic Annotation, Multiword Expressions and Constructions (LAW-MWE-CxG-2018)", year = "2018", publisher = "Association for Computational Linguistics", pages = "121--132", location = "Santa Fe, New Mexico, USA", url = "http://aclweb.org/anthology/W18-4914" }