The Spoken Wikipedia Corpora

The Spoken Wikipedia project unites volunteer readers of Wikipedia articles. Hundreds of spoken articles in multiple languages are available to users who are – for one reason or another – unable or unwilling to consume the written version of the article. We turn this speech resource into a time-aligned corpus, making it accessible for research and to foster new ways of interacting with the material.

The SWC is a corpus of aligned Spoken Wikipedia articles from the English, German, and Dutch Wikipedia. This corpus has several outstanding characteristics:

NEWS

Publications

Current Statistics

German English Dutch
#articles 1010 1314 3073
#speakers 339 395 145
total audio 386h 395h 224h
aligned words 249h 182h 79h
phonetically aligned 129h 77h

The Annotation Format

Each article is tokenized into sections, sentences, and tokens. Each token is normalized and the normalization is aligned to the audio.

Annotation layer visualized
Exemplary annotation of “500 hours of audio.” with SWC annotation that binds text to audio above and HTML markup that adds hypertextuality below. The SWC annotation marks sentences (s), tokens (t) and adds normalization information (n), which refers to the audio. Note that the whitespace ([ ]) between words are original characters that are attached to the sentence but are not part of any token.

We treat the html as a second layer of annotation to the plain text, as can be seen above. Both annotations are linked by the exact text correspondence. In addition, we have a phoneme-level alignment, which is not pictured.

Have a look at the SWC schema definition (RelaxNG compact), which defines and explains the annotation in detail.

Download Current Release (Oct 2017):

If you use this data, please cite our paper: bibtex

@InProceedings{KHN16.518,
  author = {Arne K{\"o}hn and Florian Stegen and Timo Baumann},
  title = {Mining the Spoken Wikipedia for Speech Data and Beyond},
  booktitle = {Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)},
  year = {2016},
  month = {may},
  date = {23-28},
  location = {Portorož, Slovenia},
  editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Marko Grobelnik and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis},
  publisher = {European Language Resources Association (ELRA)},
  address = {Paris, France},
  isbn = {978-2-9517408-9-1},
  islrn = {684-927-624-257-3/},
  language = {english}
 }

Software

Our master script will download all required software (including the ones listed below) and do all the alignment work for you. You can find the source code for the download and alignment pipeline at https://bitbucket.org/natsuhh/swc.

Original Release (Spring 2016):

Note that this release has a completely different annotation schema, use the new release! We only provide this download for historical purposes.

links

social