The Spoken Wikipedia Corpora
The SWC is a corpus of aligned Spoken Wikipedia articles from the English, German, and Dutch Wikipedia. This corpus has several outstanding characteristics:
- hundreds of hours of aligned audio
- from a diverse set of readers
- about a diverse set of topics
- in a well-researched textual genre
- licensed under a free license (CC BY-SA 4.0)
- Annotations can be mapped back to the original html
- phoneme-level alignments
NEWS
- 2018-07: State-of-the-Art Speech Recognition using SWC: Uses the SWC data to gain a WER reduction of 26%, achieves a WER of 14.
- 2018-06: Experiments regarding correlation between syntax and prosody using SWC: accepted @ Interspeech 2018! See below at "publications".
- 2017-12: Our journal article in Language Resources and Evaluation has been accepted! See below at "publications".
- 2017-11: Arne wrote a blog post with examples for working with SWC data (scroll down, it starts with validation and comparisons between XML and json)
- 2017-10: New release of the finalized SWC corpora (see below). More data, more quality, more annotation
- Please participate in our rating experiment on speech quality (if you are a German speaker)!
- Have a look at the amazing Wikipedia-Reader (example data) which allows to hyperlisten to the Spoken Wikipedia (and is based on the SWC); also have a look at the paper which proves the amazingness :-).
- The SWC has its own ISLRN.
Publications
- The journal article (don't have access to LRE? read the the accepted manuscript) (bibtex)
@Article{Baumann2018, author="Baumann, Timo and K{\"o}hn, Arne and Hennig, Felix", title="The Spoken Wikipedia Corpus collection: Harvesting, alignment and an application to hyperlistening", journal="Language Resources and Evaluation", year="2018", month="Jan", day="09", issn="1574-0218", doi="10.1007/s10579-017-9410-y", url="https://doi.org/10.1007/s10579-017-9410-y" }
- Open Source Automatic Speech Recognition for German – Using 285h of German SWC data to obtain a free and high quality ASR
-
Interspeech 2018 paper about the correlation between prosody and syntax using SWC
@InProceedings{KHN16.518, author = {Arne K{\"o}hn and Florian Stegen and Timo Baumann}, title = {Mining the Spoken Wikipedia for Speech Data and Beyond}, booktitle = {Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)}, year = {2016}, month = {may}, date = {23-28}, location = {Portorož, Slovenia}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Marko Grobelnik and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis}, publisher = {European Language Resources Association (ELRA)}, address = {Paris, France}, isbn = {978-2-9517408-9-1}, islrn = {684-927-624-257-3/}, language = {english} }
- Publication about our WikipediaReader that is based on the corpus published at SLPAT 2016.
- Interspeech paper that ranks speakers in the SWC by likability based on crowdsourced preference ratings! See also the supplemental material.
Current Statistics
German | English | Dutch | |
---|---|---|---|
#articles | 1010 | 1314 | 3073 |
#speakers | 339 | 395 | 145 |
total audio | 386h | 395h | 224h |
aligned words | 249h | 182h | 79h |
phonetically aligned | 129h | 77h | — |
The Annotation Format
Each article is tokenized into sections, sentences, and tokens. Each token is normalized and the normalization is aligned to the audio.
Exemplary annotation of “500 hours of audio.” with SWC annotation that binds text to audio above and HTML markup that adds hypertextuality below. The SWC annotation marks sentences (s), tokens (t) and adds normalization information (n), which refers to the audio. Note that the whitespace ([ ]) between words are original characters that are attached to the sentence but are not part of any token.
We treat the html as a second layer of annotation to the plain text, as can be seen above. Both annotations are linked by the exact text correspondence. In addition, we have a phoneme-level alignment, which is not pictured.
Have a look at the SWC schema definition (RelaxNG compact), which defines and explains the annotation in detail.
Download Current Release (Oct 2017):
For our release of English, German and Dutch data, please see http://hdl.handle.net/11022/0000-0007-C641-0.
If you use this data, please cite our paper: bibtex
@InProceedings{KHN16.518, author = {Arne K{\"o}hn and Florian Stegen and Timo Baumann}, title = {Mining the Spoken Wikipedia for Speech Data and Beyond}, booktitle = {Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016)}, year = {2016}, month = {may}, date = {23-28}, location = {Portorož, Slovenia}, editor = {Nicoletta Calzolari (Conference Chair) and Khalid Choukri and Thierry Declerck and Marko Grobelnik and Bente Maegaard and Joseph Mariani and Asuncion Moreno and Jan Odijk and Stelios Piperidis}, publisher = {European Language Resources Association (ELRA)}, address = {Paris, France}, isbn = {978-2-9517408-9-1}, islrn = {684-927-624-257-3/}, language = {english} }
Software
Our master script will download all required software (including the ones listed below) and do all the alignment work for you. You can find the source code for the download and alignment pipeline at https://bitbucket.org/natsuhh/swc.
Original Release (Spring 2016):
Note that this release has a completely different annotation schema, use the new release! We only provide this download for historical purposes.