|
12thAnnual Conference of the
International Speech Communication Association
|
sponsors
|
Interspeech 2011 Florence |
Technical Programme
This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.
Sun-Ses3-S1-P: Crowdsourcing for Speech Processing II
| Time: | Sunday 17:00 |
Place: | Caravaggio (Adua 1) - Pala Affari - 1st Floor |
Type: | Poster |
| Chair: | Maxine Eskenazi, David Suendermann, Gina-Anne Levow |
| #1 | A Transcription Task for Crowdsourcing with Automatic Quality Control
Chia-ying Lee (MIT Computer Science and Artificial Intelligence Laboratory) James Glass (MIT Computer Science and Artificial Intelligence Laboratory)
In this paper, we propose a two-stage transcription task design for crowdsourcing with an automatic quality control mechanism embedded in each stage. For the first stage, a support vector machine (SVM) classifier is utilized to quickly filter poor quality transcripts based on acoustic cues and language patterns in the transcript. In the second stage, word level confidence scores are used to estimate a transcription quality and provide instantaneous feedback to the transcriber. The proposed design was evaluated using Amazon Mechanical Turk (MTurk) and tested on seven hours of academic lecture speech, which is typically conversational in nature and contains technical material. Compared to baseline transcripts which were also collected from MTurk using a ROVER-based method, we observed that the new method resulted in higher quality transcripts while requiring less transcriber effort.
|
| #2 | Reliability-Weighted Acoustic Model Adaptation Using Crowd Sourced Transcriptions
Kartik Audhkhasi (University of Southern California) Panayiotis G. Georgiou (University of Southern California) Shrikanth S. Narayanan (University of Southern California)
This paper focuses on adaptation of acoustic models using speech transcribed by multiple noisy experts. A simple approach involves combining multiple transcripts using word frequency based Recognizer Output Voting Error Reduction (ROVER) followed by adaptation using the combined transcripts. But this assumes that the transcripts being combined are equally reliable. To overcome this assumption, we use two sets of scores to estimate this reliability. The first set is based on answers to some questions given by the transcribers. The second set is derived in an unsupervised way using the word frequency based ROVER transcripts and baseline acoustic models. The overall confidence is a convex combination of these scores and is used to perform a confidence weighted fusion. We adapt the baseline acoustic models using these combined transcripts. Recognition results for a Mexican Spanish ASR system show an absolute improvement of 0.5% in word error rate and 0.9% in sentence error rate.
|
| #3 | Crowdsourcing for word recognition in noise
Martin Cooke (Ikerbasque (Basque Science Foundation), Spain) Jon Barker (Department of Computer Science, University of Sheffield, UK) Maria Luisa Garcia Lecumberri (Language and Speech Laboratory, Univeersity of the Basque Country, Spain) Krzysztof Wasilewski (Department of Computer Science, University of Sheffield, UK)
Access to large samples of listeners is an appealing prospect for speech perception researchers, but lack of control over key factors such as listeners' linguistic backgrounds and quality of stimulus delivery is a formidable barrier to the application of crowdsourcing. We describe the outcome of a web-based listening experiment designed to discover consistent confusions amongst words presented in noise, alongside an identical task carried out using traditional laboratory methods. Web listeners were graded according based on information they provided as well as via their responses to tokens recognised robustly by a majority of participants. While overall word identification scores even for the best-performing web subset were well below those obtained in the laboratory, word confusions with high levels of cross-listener agreement were obtained nevertheless, suggesting that focused application of crowdsourcing in speech perception can provide useful data for scientific analysis.
|
| #4 | Crowdsourcing preference tests, and how to detect cheating
Sabine Buchholz (Toshiba Research Europe Ltd., Cambridge Research Lab, Cambridge, UK) Javier Latorre (Toshiba Research Europe Ltd., Cambridge Research Lab, Cambridge, UK)
We describe an approach to crowdsource the evaluation of TTS systems by preference tests and report on lessons learnt from running 127 real-life crowdsourced tests. We show that at least one type of cheating becomes more prevalent over time if left unchecked and develop metrics to exclude cheats. We demonstrate that their exclusion improves test outcomes.
|
| #5 | Growing a Spoken Language Interface on Amazon Mechanical Turk
Ian McGraw (MIT) James Glass (MIT) Stephanie Seneff (MIT)
Typically data collection, transcription, language model generation, and deployment are separate phases of creating a spoken language interface. An unfortunate consequence of this is that the recognizer usually remains a static element of systems often deployed in dynamic environments. By providing an API for human intelligence, Amazon Mechanical Turk changes the way system developers can construct spoken language systems. In this work, we describe an architecture that automates and connects these four phases, effectively allowing the developer to grow a spoken language interface. In particular, we show that a human-in-the-loop programming paradigm, in which workers transcribe utterances behind the scenes, can alleviate the need for expert guidance in language model construction. We demonstrate the utility of these organic language models in a voice-search interface for photographs.
|
| #6 | Real user evaluation of spoken dialogue systems using Amazon Mechanical Turk
Filip Jurčíček (Engineering Department, Cambridge University) Simon Keizer (Engineering Department, Cambridge University) Milica Gasic (Engineering Department, Cambridge University) Francois Mairesse (Engineering Department, Cambridge University) Blaise Thomson (Engineering Department, Cambridge University) Kai Yu (Engineering Department, Cambridge University) Steve Young (Engineering Department, Cambridge University)
This paper describes a framework for evaluation of spoken dialogue systems.
Typically, evaluation of dialogue systems is performed in a controlled test environment with carefully selected and instructed users.
However, this approach is very demanding.
An alternative is to recruit a large group of users who evaluate the dialogue systems in a remote setting under virtually no supervision.
Crowdsourcing technology, for example Amazon Mechanical Turk (AMT),
provides an efficient way of recruiting subjects.
This paper describes an evaluation framework for spoken dialogue
systems using AMT users and compares the obtained results with a
recent trial in which the systems were tested by locally recruited users.
The results suggest that the use of crowdsourcing technology is feasible and it can provide reliable results.
|
| #7 | Quality assessment of crowdsourcing transcriptions for African languages
Hadrien Gelas (Laboratoire Dynamique Du Langage, CNRS - Université de Lyon, France and Laboratoire Informatique de Grenoble, CNRS - Université Joseph Fourier Grenoble, France) Solomon Teferra Abate (Laboratoire Informatique de Grenoble, CNRS - Université Joseph Fourier Grenoble, France) Laurent Besacier (Laboratoire Informatique de Grenoble, CNRS - Université Joseph Fourier Grenoble, France) François Pellegrino (Laboratoire Dynamique Du Langage, CNRS - Université de Lyon, France)
We evaluate the quality of speech transcriptions acquired by crowdsourcing to develop ASR acoustic models (AM) for under-resourced languages. We have developed AMs using reference (REF) transcriptions and transcriptions from crowdsourcing (TRK) for Swahili and Amharic. While the Amharic transcription was much slower than that of Swahili to complete, the speech recognition systems developed using REF and TRK transcriptions have almost similar (40.1 vs 39.6 for Amharic and 38.0 vs 38.5 for Swahili) word recognition error rate. Moreover, the character level disagreement rates between REF and TRK are only 3.3% and 6.1% for Amharic and Swahili, respectively. We conclude that it is possible to acquire quality transcriptions from the crowd for under-resourced languages using Amazon's Mechanical Turk. Recognizing such a great potential of it, we recommend some legal and ethical issues to consider.
|
| #8 | Using crowdsourcing to provide prosodic annotations for non-native speech
Keelan Evanini (Educational Testing Service) Klaus Zechner (Educational Testing Service)
We present the results of an experiment in which 2 expert and 11 naive annotators provided prosodic annotations for stress and boundary tones on a corpus of spontaneous speech produced by non-native speakers of English. The results show that agreement rates were higher for boundary tones than for stress. In addition, a crowdsourcing approach was implemented to combine the naive annotations to increase accuracy. The crowdsourcing approach was able to match expert agreement for stress (62.1%) with 3 naive annotators, and come within 7.2% of expert agreement for boundary tones (82.4%) with 11 naive annotators. This experiment also demonstrates that noticeable improvements in naive annotations can be obtained with a small amount of additional training.
|
| #9 | PodCastle: Recent Advances of a Spoken Document Retrieval Service Improved by Anonymous User Contributions
Masataka Goto (National Institute of Advanced Industrial Science and Technology (AIST)) Jun Ogata (National Institute of Advanced Industrial Science and Technology (AIST))
In this paper, we introduce recent advances of a speech retrieval web service, PodCastle, that collects and amplifies voluntary contributions by anonymous users. Our goal is to provide users with a public web service based on speech recognition and crowdsourcing so that they can experience state-of-the-art speech recognition performance through a useful service. PodCastle enables users to find speech data (such as podcasts and YouTube video clips) that include a search term, read full texts of their recognition results, and easily correct recognition errors by simply selecting from a list of candidates. The resulting corrections were used to improve both the speech retrieval and recognition performances. In our experiences from its practical use over the past four years (since December, 2006), over half a million recognition errors in about one hundred thousand speech data were corrected by anonymous users and we confirmed that the speech recognition performance of PodCastle was actually improved by those corrections.
|
|
|