Dante - Di Michelino 150° sponsors







Corporate & Society Sponsors
Loquendo diamond package
Nuance gold package
ATT bronze package
Google silver package
Appen bronze package
Appen bronze package
Interactive Media bronze package
Microasoft bronze package
SpeechOcean bronze package
Avios logo package
NDI logo package
NDI logo package

CNR-ISTC

CNR-ISTC
Universit柤e Avignon
Speech Cycle
AT&T
Universit�i Firenze
FUB
FBK
Univ. Trento
Univ. Napoli
Univ. Tuscia
Univ. Calabria
Univ. Venezia

AISV
AISV

AISV
AISV
Comune di Firenze
Firenze Fiera
Florence Convention Bureau

ISCA

12thAnnual Conference of the
International Speech Communication Association

Sponsors
sponsors

Interspeech 2011 Florence

Technical Programme

This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.

Sun-Ses3-P4:
Spoken Language Resources, Evaluation and Standardization I

Time:Sunday 16:00 Place:Faenza 2 - Pala Congressi (Passi Perduti-Gallery) Type:Poster
Chair:Sebastian Moeller

#1Measurement of Objective Intelligibility of Japanese Accented English Using ERJ (English Read by Japanese) Database

Nobuaki Minematsu (The Univ. of Tokyo)
Koji Okabe (NEC Corporation)
Keisuke Ogaki (The Univ. of Tokyo)
Keikichi Hirose (The Univ. of Tokyo)

In many schools, English is taught as international communication tool and the goal of English pronunciation training is generally to acquire intelligible enough pronunciation. However, the definition of the intelligible pronunciation is not easy because it depends on the language background of listeners. One kind of accented pronunciation, which is intelligible enough for some listeners, is often less intelligible for others. This paper focuses on objective intelligibility of Japanese English through the ears of American English speakers with little exposure to Japanese English. A large listening test was conducted using ERJ database. A balanced subset of this database were presented over a telephone line to the American listeners who were asked to repeat what they heard. Totally, 17,416 repetitive responses were collected and they were transcribed manually. This paper describes the design of this experiment and some results of analyzing the results of transcription.

#2From Single-Call to Multi-Call Quality: A Study on Long-term Quality Integration in Audio-Visual Speech Communication

Sebastian Möller (Quality and Usability Lab, TU Berlin, Germany)
Chihuy Bang (Quality and Usability Lab, TU Berlin, Germany)
Teele Tamme (Skype Labs, Skype, Tallinn, Estonia)
Markus Vaalgamaa (Skype Labs, Skype, Helsinki, Finland)
Benjamin Weiss (Quality and Usability Lab, TU Berlin, Germany)

Speech quality is commonly assumed to be the most important factor for the quality of a speech communication service and solution. However, little is known about how the quality experienced during individual calls forms the quality perception of an entire service or solution. Taking the example of an audio-visual IP-based communication solution, a long-term study is presented in which we analyze this relationship in a controlled setting. Results show temporal integration effects in the users’ response to time-varying quality levels and prove that simple averaging of call quality scores does not provide sufficiently accurate estimations of service quality.

#3Optimal Selection of Limited Vocabulary Speech Corpora

Hui Lin (University of Washington)
Jeff Bilmes (University of Washington)

We address the problem of finding a subset of a large speech data corpus that is useful for accurately and rapidly prototyping novel and computationally expensive speech recognition architectures. To solve this problem, we express it as an optimization problem over submodular functions. Quantities such as vocabulary size (or quality) of a set of utterances, or quality of a bundle of word types are submodular functions which make finding the optimal solutions possible. We, moreover, are able to express our approach using graph cuts leading to a very fast implementation even on large initial corpora. We show results on the Switchboard-I corpus, demonstrating improved results over previous techniques for this purpose. We also demonstrate the variety of the resulting corpora that may be produced using our method.

#4Open Source Multi-Language Audio Database for Spoken Language Processing Applications

Stephen Zahorian (Binghamton University, Electrical & Computer Engineering Dept.)
Jiang Wu (Binghamton University, Electrical & Computer Engineering Dept.)
Montri Karnjanadecha (Binghamton University, Electrical & Computer Engineering Dept.)
Chandra Vootkuri (Binghamton University, Electrical & Computer Engineering Dept.)
Brian Wong (Binghamton University, Electrical & Computer Engineering Dept.)
Andrew Hwang (Binghamton University, Electrical & Computer Engineering Dept.)
Eldar Tokhtamyshev (Binghamton University, Electrical & Computer Engineering Dept.)

Over the past few decades, research in automatic speech recognition and automatic speaker recognition has been greatly facilitated by the sharing of large annotated speech databases such as those distributed by the Linguistic Data Consortium. Open sources, particularly web sites such as YouTube, contain vast and varied speech recordings in a variety of languages. These “open sources” for speech data are largely untapped as resources for speech research. In this paper, a project to collect, organize, and annotate a large group of this speech data is described. The data consists of approximately 25 hours of speech in each of three languages, English, Mandarin Chinese, and Russian. Each of the 900 recordings has been orthographically transcribed at the sentence/phrase level by human listeners. Some of the issues related to working with this low quality, varied, noisy speech data in three languages are described.

#5The USC CARE Corpus: Child-Psychologist Interactions of Children with Autism Spectrum Disorders

Matthew P. Black (Signal Analysis and Interpretation Laboratory (SAIL), Viterbi School of Engineering, University of Southern California, Los Angeles, CA, USA)
Daniel Bone (Signal Analysis and Interpretation Laboratory (SAIL), Viterbi School of Engineering, University of Southern California, Los Angeles, CA, USA)
Marian E. Williams (University Center for Excellence in Developmental Disabilities, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA)
Phillip Gorrindo (Medical Scientist Training Program, Vanderbilt University, Nashville, TN, USA)
Pat Levitt (Zilkha Neurogenetic Institute, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA)
Shrikanth S. Narayanan (Signal Analysis and Interpretation Laboratory (SAIL), Viterbi School of Engineering, University of Southern California, Los Angeles, CA, USA)

We introduce the USC CARE Corpus, comprised of spontaneous and standardized child-psychologist interactions of children with a diagnosis of an autism spectrum disorder (ASD). The audio-video data is collected in the context of the Autism Diagnostic Observation Schedule (ADOS), which is a tool used by psychologists for a research-level diagnosis of ASD for children. The interaction consists of developmentally appropriate semi-structured social activities, providing the psychologist with a sample of behavior used to rate the child on a series of autism-relevant symptoms. Our goal with this multimodal corpus is to investigate how analytical technology (e.g., speech and language processing) can enhance this observational rating task and provide greater insight into social behavior and communication. We provide demographic statistics on the recruited children (60 to date), describe the multimodal recording set-up, and discuss current and future work for this novel corpus.

#6Towards A Versatile Multi-Layered Description of Speech Corpora Using Algebraic Relations

Nelly Barbot (IRISA - University Rennes 1)
Vincent Barreaud (IRISA - University Rennes 1)
Olivier Boeffard (IRISA - University Rennes 1)
Laure Charonnat (IRISA - University Rennes 1)
Arnaud Delhay (IRISA - University Rennes 1)
Sebastien Le Maguer (IRISA - University Rennes 1)
Damien Lolive (IRISA - University Rennes 1)

This paper presents a software library, namely Roots for Rich Object Oriented Transcription System, thats help to describe spoken messages in a coherent manner linking sequences of items on numerous levels (linguistic, phonological, or acoustic). The proposed representation is incremental and can thus describe any or all parts of an utterance. In order link different levels of description, algebraic relations are used. Instead of relying solely on fixed, pre-determined relations, algebraic composition operators are proposed that can create a missing relation on demand. In terms of software architecture, object classes are defined based on a well-grounded theoretical representations of speech (text, syntax, phonology and acoustics), without particular dependences on an annotation system (e.g. IPA is fully implemented). The API documentation for this software is available online.

#7Announcing the Electromagnetic Articulography (Day 1) Subset of the mngu0 Articulatory Corpus

Korin Richmond (Centre for Speech Technology Research, Edinburgh University)
Phil Hoole (Institut fuer Phonetik und Sprachverarbeitung, Ludwig-Maximilians-Universitaet)
Simon King (Centre for Speech Technology Research, Edinburgh University)

This paper serves as an initial announcement of the availability of a corpus of articulatory data called mngu0. This corpus will ultimately consist of a collection of multiple sources of articulatory data acquired from a single speaker: electromagnetic articulography (EMA), audio, video, volumetric MRI scans, and 3D scans of dental impressions. This data will be provided free for research use. In this first stage of the release, we are making available one subset of EMA data, consisting of more than 1,300 phonetically diverse utterances recorded with a Carstens AG500 electromagnetic articulograph. Distribution of mngu0 will be managed by a dedicated “forum-style” web site. This paper both outlines the general goals motivating the distribution of the data and the creation of the mngu0 web forum, and also provides a description of the EMA data contained in this initial release.

#8A Pitch Tracking Corpus with Evaluation on Multipitch Tracking Scenario

Gregor Pirker (Signal Processing and Speech Communication Laboratory, Graz University of Technology)
Michael Wohlmayr (Signal Processing and Speech Communication Laboratory, Graz University of Technology)
Stefan Petrik (Signal Processing and Speech Communication Laboratory, Graz University of Technology)
Franz Pernkopf (Signal Processing and Speech Communication Laboratory, Graz University of Technology)

In this paper, we introduce a novel pitch tracking database (PTDB) including ground truth signals obtained from a laryngograph. The database, referenced as PTDB-TUG, consists of 2342 phonetically rich sentences taken from the TIMIT corpus. Each sentence was at least recorded once by a male and a female native speaker. In total, the database contains 4720 recordings from 10 male and 10 female speakers. Furthermore, we evaluated two multipitch tracking systems on a subset of speakers to provide a benchmark for further research activities. The database can be downloaded at http://www.spsc.tugraz.at/tools.

#9On building and evaluating a broadcast-news audio segmentation system

Taras Butko (Technical University of Catalonia)
Climent Nadeu (Technical University of Catalonia)

Audio segmentation is useful in diverse applications like audio indexing and retrieval, subtitling, monitoring of acoustic scenes, etc. Also, an initial audio segmentation stage may help to improve the robustness of speech technologies like automatic speech recognition and speaker diarization. In this paper, firstly, the Albayzín-2010 audio segmentation evaluation is reported, including some conclusions drawn from the analysis of the set of eight submitted systems and their results. Then an audio segmentation system build in agreement with those conclusions is described and tested. Finally, by using the gained experience, the initial design of both the acoustic classes and the detection scoring rules is refined aiming to obtain a more meaningful error rate measurement.

#10Time- and Acoustic-Mediated Alignment Algorithms for Speech Recognition Evaluation

Simon Dobrišek (Ljubljana University, Faculty of Electrical Engineering)
France Mihelič (Ljubljana University, Faculty of Electrical Engineering)

The paper investigates the time- and acoustic-mediated alignment algorithms that can be used for better speech recognition evaluation. The edit-cost function, which weights the cost of speech unit matches, substitutions, deletions and insertions, is defined as a function of timed symbols or even as a function of speech signal segments. The algorithms are compared using several classical statistical measures of different types that are derived from speech recognition confusion matrices and are normally used to measure agreement between different classifications of the same set of objects. These measures provide a reasonable indication that the investigated algorithms provide more relevant speech recognition error statistics than the algorithms that are commonly used for this purpose.

#11Effects of Shortening Speech Prompts of In-Car Voice User Interfaces on Users\' Mental Models

Julia Niemann (Deutsche Telekom Laboratories)
Kati Schulz (Deutsche Telekom Laboratories)
Ina Wechsung (Deutsche Telekom Laboratories)

Shortening speech prompts is useful to reduce the tendency of drivers to allocate attention towards the display. But it is so far unsettled in if the shortening of speech still provides a good users’ mental model? A lab experiment was conducted. The effects of reducing time effort of speech was evaluated via a transfer task, retrieval tasks and navigation-orientation tasks for three different strategies: (1) using sounds (earcons) for menu orientation (land marking), (2) using commando based speech for interaction options, and (3) using uptempo speech for content based information. It was observed that earcons are well qualified to not impair navigation-orientation performance. Commando based speech leads to even better retrieval performance than the sentence based representation of interaction. Solely uptempo speech decreased retrieval performance.

#12Speech Transcript Evaluation for Information Retrieval

Laurens van der Werff (University of Twente)
Wessel Kraaij (Radboud University Nijmegen)
Franciska de Jong (University of Twente)

Speech recognition transcripts are being used in various fields of research and practical applications, putting various demands on their accuracy. Traditionally ASR research has used intrinsic evaluation measures such as word error rate to determine transcript quality. In non-dictation-type applications such as speech retrieval, it is better to use extrinsic (or task specific) measures. Indexation and the associated processing may eliminate certain errors, whereas the search query may reveal others. In this work, we argue that the standard extrinsic speech retrieval measure average precision is unpractical for ASR evaluation. As an alternative we propose the use of ranked correlation measures on the output of the speech retrieval task, with the goal of predicting relative mean average precision. The measures we used showed a reasonably high correlation with average precision, but require much less human effort to calculate and can be more easily deployed in a variety of real-life settings.

#13The Albayzin 2010 Language Recognition Evaluation

Luis Javier Rodriguez-Fuentes (University of the Basque Country)
Mikel Penagarikano (University of the Basque Country)
Amparo Varona (University of the Basque Country)
Mireia Diez (University of the Basque Country)
German Bordel (University of the Basque Country)

The Albayzin 2010 Language Recognition Evaluation (LRE) was the second effort made from the Spanish/Portuguese community for benchmarking language recognition technology. As the Albayzin 2008 LRE, it was coordinated by the Software Technology Working Group of the University of the Basque Country, with the support of the Spanish Thematic Network on Speech Technology. A speech database was created for system development and evaluation. Speech signals were recorded from TV broadcasts, including clean and noisy speech. The task consisted in deciding whether or not a target language was spoken in a test utterance, and involved 6 target languages: English, Portuguese, Basque, Catalan, Galician and Spanish, other (Out-Of-Set) languages being also recorded to allow open-set verification tests. This paper presents the main features of the evaluation, analyses system performance on different conditions, including the confusion among languages, and gives hints for future evaluations.

#14Progress and Prospects for Speech Technology: Results from Three Sexennial Surveys

Roger Moore (University of Sheffield)

In 1997, and again in 2003, the author was invited to conduct a survey at the IEEE workshop on Automatic Speech Recognition and Understanding (ASRU) in which attendees were offered a set of statements about putative future events relating to progress in various aspects of speech technology R&D. The task of the respondents was to assign a date to each possible event. The 1997 and 2003 results were published at INTERSPEECH 2005 in Lisbon. Six years later, the author was invited by the organisers of ASRU09 to repeat the survey for a third time, and this paper presents the combined results from all three 1997, 2003 and 2009 surveys. The overall conclusion is that, over the twelve year period progress is perceived as slow, and the future appears to be generally no nearer than it has been in the past. However, on a positive note, the survey confirmed that the market for speech technology applications on mobile devices would be highly attractive over the next ten or so years.

#15Painless WFST cascade construction for LVCSR - Transducersaurus

Josef Robert Novak (Graduate School of Information Science and Technology, The University of Tokyo)
Nobuaki Minematsu (Graduate School of Information Science and Technology, The University of Tokyo)
Keikichi Hirose (Graduate School of Information Science and Technology, The University of Tokyo)

This paper introduces the Transducersaurus toolkit which provides a set of classes for generating each of the fundamental components of a typical WFST ASR cascade, including a Context-dependency transducer, a Lexicon, a stochastic language model and an optional silence class model. The toolkit further implements a simple scripting language in order to facilitate the construction of cascades with a variety of popular combination and optimization methods and provides integrated support for the TCubed and Juicer WFST decoders, and both Sphinx and HTK format acoustic models. New results for two standard WSJ tasks are also provided, comparing a variety of cascade construction and optimization algorithms. These results illustrate the flexibility of the toolkit as well as the tradeoffs inherent in various build algorithms.