The Spoken Wikipedia Corpora

The Spoken Wikipedia project unites volunteer readers of Wikipedia articles. Hundreds of spoken articles in multiple languages are available to users who are – for one reason or another – unable or unwilling to consume the written version of the article. We turn this speech resource into a time-aligned corpus, making it accessible for research and to foster new ways of interacting with the material.

The SWC is a corpus of aligned Spoken Wikipedia articles from the English, German, and Dutch Wikipedia. This corpus has several outstanding characteristics:

hundreds of hours of aligned audio
from a diverse set of readers
about a diverse set of topics
in a well-researched textual genre
licensed under a free license (CC BY-SA 4.0)
Annotations can be mapped back to the original html
phoneme-level alignments

NEWS

2018-07: State-of-the-Art Speech Recognition using SWC: Uses the SWC data to gain a WER reduction of 26%, achieves a WER of 14.
2018-06: Experiments regarding correlation between syntax and prosody using SWC: accepted @ Interspeech 2018! See below at "publications".
2017-12: Our journal article in Language Resources and Evaluation has been accepted! See below at "publications".
2017-11: Arne wrote a blog post with examples for working with SWC data (scroll down, it starts with validation and comparisons between XML and json)
2017-10: New release of the finalized SWC corpora (see below). More data, more quality, more annotation
Please participate in our rating experiment on speech quality (if you are a German speaker)!
Have a look at the amazing Wikipedia-Reader (example data) which allows to hyperlisten to the Spoken Wikipedia (and is based on the SWC); also have a look at the paper which proves the amazingness :-).
The SWC has its own ISLRN.

Publications

The journal article (don't have access to LRE? read the the accepted manuscript) (bibtex)

@Article{Baumann2018,
author="Baumann, Timo
and K{\"o}hn, Arne
and Hennig, Felix",
title="The Spoken Wikipedia Corpus collection: Harvesting, alignment and an application to hyperlistening",
journal="Language Resources and Evaluation",
year="2018",
month="Jan",
day="09",
issn="1574-0218",
doi="10.1007/s10579-017-9410-y",
url="https://doi.org/10.1007/s10579-017-9410-y"
}

Open Source Automatic Speech Recognition for German – Using 285h of German SWC data to obtain a free and high quality ASR
Interspeech 2018 paper about the correlation between prosody and syntax using SWC
Our original publication at LREC 2016. (bibtex)

@InProceedings{KHN16.518,
  author = {Arne K{\"o}hn and Florian Stegen and Timo Baumann},
  title = {Mining the Spoken Wikipedia for Speech Data and Beyond},
  booktitle = {Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)},
  year = {2016},
  month = {may},
  date = {23-28},
  location = {Portorož, Slovenia},
  editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Marko Grobelnik and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis},
  publisher = {European Language Resources Association (ELRA)},
  address = {Paris, France},
  isbn = {978-2-9517408-9-1},
  islrn = {684-927-624-257-3/},
  language = {english}
 }

Publication about our WikipediaReader that is based on the corpus published at SLPAT 2016.
Interspeech paper that ranks speakers in the SWC by likability based on crowdsourced preference ratings! See also the supplemental material.

Current Statistics

	German	English	Dutch
#articles	1010	1314	3073
#speakers	339	395	145
total audio	386h	395h	224h
aligned words	249h	182h	79h
phonetically aligned	129h	77h	—

The Annotation Format

Each article is tokenized into sections, sentences, and tokens. Each token is normalized and the normalization is aligned to the audio.

Annotation layer visualized

Exemplary annotation of “500 hours of audio.” with SWC annotation that binds text to audio above and HTML markup that adds hypertextuality below. The SWC annotation marks sentences (s), tokens (t) and adds normalization information (n), which refers to the audio. Note that the whitespace ([ ]) between words are original characters that are attached to the sentence but are not part of any token.

We treat the html as a second layer of annotation to the plain text, as can be seen above. Both annotations are linked by the exact text correspondence. In addition, we have a phoneme-level alignment, which is not pictured.

Have a look at the SWC schema definition (RelaxNG compact), which defines and explains the annotation in detail.

Download Current Release (Oct 2017):

For our release of English, German and Dutch data, please see http://hdl.handle.net/11022/0000-0007-C641-0.

If you use this data, please cite our paper: bibtex

@InProceedings{KHN16.518,
  author = {Arne K{\"o}hn and Florian Stegen and Timo Baumann},
  title = {Mining the Spoken Wikipedia for Speech Data and Beyond},
  booktitle = {Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)},
  year = {2016},
  month = {may},
  date = {23-28},
  location = {Portorož, Slovenia},
  editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Marko Grobelnik and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis},
  publisher = {European Language Resources Association (ELRA)},
  address = {Paris, France},
  isbn = {978-2-9517408-9-1},
  islrn = {684-927-624-257-3/},
  language = {english}
 }

Software

Our master script will download all required software (including the ones listed below) and do all the alignment work for you. You can find the source code for the download and alignment pipeline at https://bitbucket.org/natsuhh/swc.

Original Release (Spring 2016):

Note that this release has a completely different annotation schema, use the new release! We only provide this download for historical purposes.

Aligned text (XML format):
- German (17MB)
- English (19MB)
corresponding audio:
- German (11GB)
- English (12GB)
LICENSE: CC BY-SA-3.0. Each folder in the audio contains a info.json with the original URL of both audio and text, and the speaker's name.