Dante - Di Michelino 150° sponsors







Corporate & Society Sponsors
Loquendo diamond package
Nuance gold package
ATT bronze package
Google silver package
Appen bronze package
Appen bronze package
Interactive Media bronze package
Microasoft bronze package
SpeechOcean bronze package
Avios logo package
NDI logo package
NDI logo package

CNR-ISTC

CNR-ISTC
Universit柤e Avignon
Speech Cycle
AT&T
Universit�i Firenze
FUB
FBK
Univ. Trento
Univ. Napoli
Univ. Tuscia
Univ. Calabria
Univ. Venezia

AISV
AISV

AISV
AISV
Comune di Firenze
Firenze Fiera
Florence Convention Bureau

ISCA

12thAnnual Conference of the
International Speech Communication Association

Sponsors
sponsors

Interspeech 2011 Florence

Interspeech 2011 Technical Programme

Sun-Ses2-O1:
Speaker Recognition - Modeling

Time:Sunday 13:30 Place:Auditorium - Pala Congressi Type:Oral
Chair:Andrea Paoloni

13:30Skew Gaussian mixture models for speaker recognition

Avi Matza (Tel Aviv university)
Yuval Bistritz (Tel Aviv university)

The current paper proposes skew Gaussian mixture models for speaker recognition and an associated algorithm for its training from experimental data. Speaker identification experiments were conducted, in which speakers were modeled using the familiar Gaussian mixture models (GMM), and the new skew-GMM. Each model type was evaluated using two sets of feature vectors, the mel-frequency cepstral coefficients (MFCC), that are widely used in speaker recognition applications, and line spectra frequencies (LSF), that are used in many low bit rate speech coders but were not that successful in speech and speaker recognition. Results showed that the skew-GMM, with LSF, compares favorably with the GMM-MFCC pair (under fair comparison conditions). They indicated that skew-Gaussians are better suited for capturing the relatively highly non-symmetrical shapes of the LSF distribution and thus the skew-GMM with LSF offers a worthy alternative to the GMM-MFCC pair for speaker recognition.

13:50Towards Goat Detection in Text-Dependent Speaker Verification

Orith Toledo-Ronen (IBM Research – Haifa)
Hagai Aronowitz (IBM Research – Haifa)
Ron Hoory (IBM Research – Haifa)
Jason Pelecanos (IBM T.J. Watson Research Center)
David Nahamoo (IBM T.J. Watson Research Center)

We present a method that identifies speakers that are likely to have a high false-reject rate in a text-dependent speaker verification system (“goats”). The method normally uses only the enrollment data to perform this task. We begin with extracting an appropriate feature from each enrollment session. We then rank all the enrollment sessions in the system based on this feature. The lowest-ranking sessions are likely to have a high false-reject rate. We explore several features and show that the 1% lowest-ranking enrollments have a false reject rate of up to 7.8%, compared to our system’s overall rate of 2.0%. Furthermore, when using a single additional verification score from the true speaker for ranking, the false-reject of the 1% lowest-ranking sessions rises up to 33%.

14:10Speaker modeling using local binary decisions

Jean-Francois Bonastre (University of Avignon)
Xavier Anguera (Telefonica Research)
Gabirel H. Sierra (Advanced Technologies Application Center)
Pierre-Michel Bousquet (University of Avignon)

Achieving an accurate speaker modeling is a crucial step in any speaker-related algorithm. Many statistical speaker modeling techniques that deviate from the classical GMM/UBM approach have been proposed for some time now that can accurately discriminate between speakers. Although many of them imply the evaluation of high dimensional feature vectors and represent a speaker with a single vector, therefore not using any temporal information. In addition, they place most emphasis on modeling the most recurrent acoustic events, instead of less occurring speaker discriminant information. In this paper we explain the main benefits of our recently proposed binary speaker modeling technique and show its benefits in two particular applications, namely for speaker recognition and speaker diarization. Both applications achieve near to state-of-the-art results while benefiting from performing most processing in the binary space.

14:30New Developments in Voice Biometrics for User Authentication

Hagai Aronowitz (IBM Research - Haifa)
Ron Hoory (IBM Research - Haifa)
Jason Pelecanos (IBM T.J. Watson Research Center)
David Nahamoo (IBM T.J. Watson Research Center)

Voice biometrics for user authentication is a task in which the object is to perform convenient, robust and secure authentication of speakers. In this work we investigate the use of state-of-the-art text-independent and text-dependent speaker verification technology for user authentication. We evaluate four different authentication conditions: speaker specific digit stings, global digit strings, prompted digit strings, and text-independent. Harnessing the characteristics of the different types of conditions can provide benefits such as authentication transparent to the user (convenience), spoofing robustness (security) and improved accuracy (reliability). The systems were evaluated on a corpus collected by Wells Fargo Bank which consists of 750 speakers. We show how to adapt techniques such as joint factor analysis (JFA), Gaussian mixture models with nuisance attribute projection (GMM-NAP) and hidden Markov models with NAP (HMM-NAP) to obtain improved results for new authentication scenarios and environments.

14:50Evaluation of i-vector Speaker Recognition Systems for Forensic Application

Miranti Indar Mandasari (Radboud University Nijmegen)
Mitchell McLaren (Radboud University Nijmegen)
David van Leeuwen (Radboud University Nijmegen)

This paper contributes a study on i-vector based speaker recognition systems and their application to forensics. The sensitivity of i-vector based speaker recognition is analyzed with respect to the effects of speech duration. This approach is motivated by the potentially limited speech available in a recording for a forensic case. In this context, the classification performance and calibration costs of the i-vector system are analyzed along with the role of normalization in the cosine kernel. Evaluated on the NIST SRE-2010 dataset, results highlight that normalization of the cosine kernel provided improved performance across all speech durations compared to the use of an unnormalized kernel. The normalized kernel was also found to play an important role in reducing miscalibration costs and providing well-calibrated likelihood ratios with limited speech duration.

15:10Mixture of PLDA Models in I-Vector Space for Gender-Independent Speaker Recognition

Mohammed Senoussaoui (CRIM and École de Technologie Suprieure ( ÉTS), Canada)
Patrick Kenny (Centre de recherche informatique de Montréal (CRIM), Canada)
Niko Brümmer (AGNITIO Labs, South Africa)
Edward De Villiers (AGNITIO Labs, South Africa)
Pierre Dumouchel (CRIM and École de Technologie Suprieure ( ÉTS), Canada)

The Speaker Recognition community that participates in NIST evaluations has concentrated on designing gender- and channel-conditioned systems. In the real word, this conditioning is not feasible. Our main purpose in this work is to propose a mixture of Probabilistic Linear Discriminant Analysis models (PLDA) as a solution for making systems independent of speaker gender. In order to show the effectiveness of the mixture model, we first experiment on 2010 NIST telephone speech (det5), where we prove that there is no loss of accuracy compared with a baseline gender-dependent model. We also test with success the mixture model on a more realistic situation where there are cross-gender trials. Furthermore, we report results on microphone speech for the det1, det2, det3 and det4 tasks to confirm the effectiveness of the mixture model.

Sun-Ses2-O3:
Speech Representation and Modelling

Time:Sunday 13:30 Place:Brunelleschi (Green Room) - Pala Congressi - 2nd Floor Type:Oral
Chair:Yannis Stylianou

13:30A Long-Term Harmonic plus Noise Model for Speech Signals

Faten Ben Ali (GIPSA-Lab, Grenoble Institute of Technology, Grenoble France - Unité Signaux et Systèmes, Ecole Nationale d\'Ingénieurs de Tunis, Tunisie)
Laurent Girin (GIPSA-Lab, Grenoble Institute of Technology, Grenoble France)
Sonia Djaziri Larbi (Unité Signaux et Systèmes, Ecole Nationale d\'Ingénieurs de Tunis, Tunisie)

The harmonic plus noise model (HNM) is widely used for spectral modeling of mixed harmonic/noise speech sounds. In this paper, we present an analysis/synthesis system based on a long-term two-band HNM. “Long-term” means that the time-trajectories of the HNM parameters are modeled using “smooth” (discrete cosine) functions depending on a small set of parameters. The goal is to capture and exploit the long-term correlation of spectral components on time segments of up to several hundreds of ms. The proposed long-term HNM enables joint compact representation of signals (thus a potential for low bit-rate coding) and easy signal transformation (e.g. time stretching) directly from the long-term parameters. Experiments show that it can be compared favourably with the short-term version in terms of parameter rates and signal quality.

13:50A Frequency Domain Approach to ARX-LF Voiced Speech Parameterization and Synthesis

Alan O Cinneide (Dublin Institute of Technology)
David Dorran (Dublin Institute of Technology)
Gainza Mikel (Dublin Institute of Technology)
Eugene Coyle (Dublin Institute of Technology)

The ARX-LF model interprets voiced speech as the an LF derivative glottal pulse exciting an all-pole vocal tract filter with an additional exogenous residual signal. It fully parameterizes the voice and has been shown to be useful for voice modification. However, the determination of the ARX-LF parameters from speech is very sensitive to the time placement of the analysis frame and not robust to phase distortion from e.g. recording equipment. This paper describes a frequency domain approach to the determination of ARX-LF model parameters which is robust to these issues. An experiment comparing synthetic speech produced by this method with a time domain ARX-LF parameterization approach under real and simulated recording conditions was conducted and it was found that the frequency domain approach was preferred by participants of a listening test.

14:10Automatic Data-Driven Learning of Articulatory Primitives from Real-Time MRI Data using Convolutive NMF with Sparseness Constraints

Vikram Ramanarayanan (University of Southern California)
Athanasios Katsamanis (University of Southern California)
Shrikanth Narayanan (University of Southern California)

We present a procedure to automatically derive interpretable dynamic articulatory primitives in a data-driven manner from image sequences acquired through real-time magnetic resonance imaging (rt-MRI). More specifically, we propose a convolutive Nonnegative Matrix Factorization algorithm with sparseness constraints (cNMFsc) to decompose a given set of image sequences into a set of basis image sequences and an activation matrix. We use a recently-acquired rt-MRI corpus of read speech (460 MOCHA-TIMIT sentences from 4 speakers) as a test dataset for this procedure. We choose the free parameters of the algorithm empirically by analyzing algorithm performance for different parameter values. We finally attempt an interpretation of the extracted basis set of image sequences in a gesture-based Articulatory Phonology framework.

14:30Online Pattern Learning for Non-Negative Convolutive Sparse Coding

Dong Wang (EURECOM)
Ravichander Vipperla (EURECOM)
Nicholas Evans (EURECOM)

The unsupervised learning of spectro-temporal speech patterns is relevant in a broad range of tasks. Convolutive non-negative matrix factorization (CNMF) and its sparse version, convolutive non-negative sparse coding (CNSC), are powerful, related tools. A particular difficulty of CNMF/CNSC, however, is the high demand on computing power and memory, which can prohibit their application to large scale tasks. In this paper, we propose an online algorithm for CNMF and CNSC, which processes input data piece-by-piece and updates the learned patterns after the processing of each piece by using accumulated sufficient statistics. The online CNSC algorithm remarkably increases converge speed of the CNMF/CNSC learning, thereby enabling its application to large scale tasks.

14:50Sinewave Representations of Nonmodality

Nicolas Malyska (MIT Lincoln Laboratory)
Thomas F. Quatieri (MIT Lincoln Laboratory)
Robert Dunn (MIT Lincoln Laboratory)

Regions of nonmodal phonation, exhibiting deviations from uniform glottal-pulse periods and amplitudes, occur often and convey information about speaker- and linguistic-dependent factors. Such waveforms pose challenges for speech modeling, analysis/synthesis, and processing. In this paper, we investigate the representation of nonmodal pulse trains as a sum of harmonically-related sinewaves with time-varying amplitudes, phases, and frequencies. We show that a sinewave representation of any impulsive signal is not unique and also the converse, i.e., frame-based measurements of the underlying sinewave representation can yield different impulse trains. Finally, we argue how this ambiguity may explain addition, deletion, and movement of pulses in sinewave synthesis and a specific illustrative example of time-scale modification of a nonmodal case of diplophonia.

15:10Time-Varying Signal Adaptive transform and IHT recovery of compressive sensed speech

Srikanth Raj Ch (Department of Electrical Communication Engineering, Indian Institute of Science, Banglore, India-560012)
Sreenivas T. V. (Department of Electrical Communication Engineering, Indian Institute of Science, Banglore, India-560012)

Compressive Sensing (CS) signal recovery has been formulated for signals sparse in a known linear transform domain. We consider the scenario in which the transform is unknown and the goal is to estimate the transform as well as the sparse signal from just the CS measurements. Specifically, we consider the speech signal as the output of a time-varying AR process, as in the linear system model of speech production, with the excitation being sparse. We propose an iterative algorithm to estimate both the system impulse response and the excitation signal from the CS measurements. We show that the proposed algorithm, in conjunction with a modified iterative hard thresholding, is able to estimate the signal adaptive transform accurately, leading to much higher quality signal reconstruction than the codebook based matching pursuit approach. Thus, we are able to get near ``toll quality'' speech reconstruction from sub-Nyquist rate CS measurements.

Sun-Ses2-O2:
Speech Perception - Speech Intelligibility

Time:Sunday 13:30 Place:Leonardo - Pala Affari - Ground Floor Type:Oral
Chair:Anne Cutler

13:30Segregation of whispered speech interleaved with noise or speech maskers

Nandini Iyer (Air Force Research Laboratory)
Douglas, S. Brungart (Walter Reed Army Medical Center)
Brian D. Simpson (Air Force Research Laboratory)

Some listening environments require listeners to segregate a whispered target talker from a background of other talkers. In this experiment, a whispered speech signal was presented continuously in the presence of a continuous masker (noise, voiced speech or whispered speech) or alternated with the masker at an 8-Hz rate. Performance was near ceiling in the alternated whisper and noise condition, suggesting that harmonic structure due to voicing is not necessary to segregate a speech signal from an interleaved random-noise masker. Indeed, when whispered speech was interleaved with voiced speech, performance decreased relative to the continuous condition when the target talker was voiced but not when it was whispered, suggesting that listeners are better at selectively attending to unvoiced intervals and ignoring voiced intervals than the converse.

13:50Monaural Azimuth Localization Using Spectral Dynamics of Speech

Roi Kliper (Interdisciplinary Center for Neural Computation, Hebrew University of Jerusalem, Israel)
Hendrik Kayser (Dept. of Physics, Carl von Ossietzky University, Oldenburg, Germany)
Daphna Weinshall (Interdisciplinary Center for Neural Computation, Hebrew University of Jerusalem, Israel)
Israel Nelken (Interdisciplinary Center for Neural Computation, Hebrew University of Jerusalem, Israel)
Jörn Anemüller (Dept. of Physics, Carl von Ossietzky University, Oldenburg, Germany)

We tackle the task of localizing speech signals on the horizontal plane using monaural cues. We show that monaural cues as incorporated in speech are efficiently captured by amplitude modulation spectra patterns. We demonstrate that by using these patterns, a linear Support Vector Machine can use directionality-related information to learn to discriminate and classify sound location at high resolution. We propose a straightforward and robust way of integrating information from two ears. Each ear is treated as an independent processor and information is integrated at the decision level thus resolving, to a large extent, ambiguity in location.

14:10Prediction of binaural intelligiblity level differences in reverberation

Jan Rennies (Fraunhofer IDMT Hearing, Speech and Audio Technology, Oldenburg, Germany)
Thomas Brand (Medical Physics, University of Oldenburg, Oldenburg, Germany)
Birger Kollmeier (Medical Physics, University of Oldenburg, Oldenburg, Germany)

Speech intelligibility can be substantially improved when speech and interfering noise are spatially separated. This spatial unmasking is commonly attributed to effects of head shadow and binaural auditory processing. In reverberant rooms spatial unmasking is generally reduced. In this study spatial unmasking is systematically measured in reverberant conditions for several configurations of binaural, diotic and monaural speech signals. The data are compared to predictions of a recently developed binaural speech intelligibility model. The high prediction accuracy (R²>0.97) indicates that the model is applicable in real rooms and may serve as a tool in room acoustical design.

14:30Let’s all speak together! Exploring the impact of various languages on the comprehension of speech in multi-linguistic babble.

Aurore Gautreau ((1) INSERM U1028, Lyon Neuroscience Research Center, Brain Dynamics and Cognition Team, Lyon, France (2) CNRS UMR5292, Lyon Neuroscience Research Center, Brain Dynamics and Cognition Team, Lyon, France (3) University Lyon 1, Lyon, France)
Michel Hoen ((1) INSERM U1028, Lyon Neuroscience Research Center, Brain Dynamics and Cognition Team, Lyon, France (2) CNRS UMR5292, Lyon Neuroscience Research Center, Brain Dynamics and Cognition Team, Lyon, France (3) University Lyon 1, Lyon, France)
Fanny Meunier ((1) INSERM U1028, Lyon Neuroscience Research Center, Brain Dynamics and Cognition Team, Lyon, France (2) CNRS UMR5292, Lyon Neuroscience Research Center, Brain Dynamics and Cognition Team, Lyon, France (3) University Lyon 1, Lyon, France)

Our research aims at exploring psycholinguistic processes implicated in the speech-in-speech situation. Our studies focused on the interferences observed during speech-in-speech comprehension. Our goal is to clarify if interferences exist only on an acoustical level or if there are clear psycholinguistic interferences. In 3 experiments, we used 4 talkers cocktail-party signals using different world languages: French, Breton, Irish and Italian. Participants had to identify French words inserted in a babble noise. Results first confirmed that it is more difficult to understand a French word in a French background than in a babble composed of unknown languages. This result demonstrates that the interference effect is not purely acoustic but rather linguistic. Results also showed differences in the observed performances depending on the unknown language spoken in the background and demonstrated that some languages interfered more with French than some others.

14:50Cross-Rate Variation in the Intelligibility of Dual-Rate Gated Speech in Older Listeners

Valeriy Shafiro (Rush University Medical Center)
Stanley Sheft (Rush University Medical Center)
Robert Risley (Rush University Medical Center)

Intelligibility of sentences gated with a single primary rate (0.5-8 Hz, 25-75% duty cycle) or gated with an additional concurrent rate of 24 Hz and a 50% duty cycle was examined in older normal-hearing and hearing-impaired listeners. With a stronger effect of age than hearing loss, intelligibility tended to increase with primary rate and duty cycle, but varied for dual-rate gating. Reduction in the total amount of speech due to concurrent 24 Hz gating had little effect on the intelligibility for the lowest and highest primary rates, but was detrimental for rates between 2 to 4 Hz, mimicking the pattern previously obtained from young normal-hearing listeners. The dual-rate intelligibility decrement with a 2 Hz primary rate significantly correlated with speech intelligibility in multi-talker babble, suggesting overlap of perceptual processes. Overall, findings reflect interaction of central and peripheral processing of speech occurring on different time scales.

15:10An Efferent-Inspired Auditory Model Front-End for Speech Recognition

Chia-ying Lee (MIT Computer Science and Artificial Intelligence Laboratory)
James Glass (MIT Computer Science and Artificial Intelligence Laboratory)
Oded Ghitza (Boston University Hearing Research Center)

In this paper, we investigate a closed-loop auditory model and explore its potential as a feature representation for speech recognition. The closed-loop representation consists of an auditory-based, efferent-inspired feedback mechanism that regulates the operating point of a filter bank, thus enabling it to dynamically adapt to changing background noise. With dynamic adaptation, the closed-loop representation demonstrates an ability to compensate for the effects of noise on speech, and generates a consistent feature representation for speech when contaminated by different kinds of noises. Our preliminary experimental results indicate that the efferent-inspired feedback mechanism enables the closed-loop auditory model to consistently improve word recognition accuracies, when compared with an open-loop representation, for mismatched training and test noise conditions in a connected digit recognition task.

Sun-Ses2-O4:
Emotion, Speaking Style, and Social Behavior

Time:Sunday 13:30 Place:Michelangelo - Pala Affari - 2nd Floor Type:Oral
Chair:Anton Batliner

13:30Acoustic-Linguistic Recognition of Interest in Speech with Bottleneck-BLSTM Nets

Martin Woellmer (Technische Universitaet Muenchen)
Felix Weninger (Technische Universitaet Muenchen)
Florian Eyben (Technische Universitaet Muenchen)
Bjoern Schuller (Technische Universitaet Muenchen)

This paper proposes a novel technique for speech-based interest recognition in natural conversations. We introduce a fully automatic system that exploits the principle of bidirectional Long Short-Term Memory (BLSTM) as well as the structure of so-called bottleneck networks. BLSTM nets are able to model a self-learned amount of context information, which was shown to be beneficial for affect recognition applications, while bottleneck networks allow for efficient feature compression within neural networks. In addition to acoustic features, our technique considers linguistic information obtained from a multi-stream BLSTM-HMM speech recognizer. Evaluations on the TUM AVIC corpus reveal that the bottleneck-BLSTM method prevails over all approaches that have been proposed for the Interspeech 2010 Paralinguistic Challenge task.

13:50Automatic Detection of Anger in Human-Human Call Center Dialogs

Mustafa Erden (Electrical and Electronics Engineering Department, BOGAZICI University, Istanbul, Turkey)
Levent M. Arslan (Electrical and Electronics Engineering Department, BOGAZICI University, Istanbul, Turkey)

Automatic emotion recognition can enhance evaluation of customer satisfaction and detection of customer problems in call centers. For this purpose emotion recognition is defined as binary classification for angry and non-angry on Turkish human-human call center conversations. We investigated both acoustic and language models for this task. Support Vector Machines (SVM) resulted in 82.9% accuracy whereas Gaussian Mixture Models (GMM) gave a slightly worse performance with 77.9%. In terms of the language modeling we compared word based, stem-only and stem+ending structures. Stem+ending based system resulted in higher accuracy with 72% using manual transcriptions. This can be mainly attributed to the agglutinative nature of Turkish language. When we fused the acoustic and LM classifiers using a Multi Layer Perceptron (MLP) we could achieve a 89% correct detection of both angry and non-angry classes.

14:10Improved Classification of Speaking Styles for Mental Health Monitoring using Phoneme Dynamics

Keng-hao Chang (Computer Science Division, University of California, Berkeley, CA, USA)
Howard Lei (International Computer Science Institute, Berkeley, CA, USA)
John Canny (Computer Science Division, University of California, Berkeley, CA, USA)

This paper investigates the usefulness of segmental phoneme dynamics for classification of speaking styles. We modeled local regularities of phonology based on the phoneme sequences emitted by a speech recognizer, using data obtained from a recording of 39 depressed patients with 7 different speaking styles - normal, pressured, slurred, stuttered, flat, slow and fast speech. We designed and compared two models: a language model treating each phoneme as a word unit and a context-dependent phoneme duration model based on Gaussians. Language modeling at the phoneme level performed better than the duration model. We also experimented with user normalization to improve performance. To verify the complementary effect of the models, we combined the classifiers at a decision level with a baseline Hidden Markov Model classifier built with spectral features. The improvement was 7.2% in absolute (13.2% relatively), reaching 61.7% accuracy in 7-class and 72.1% in 4-class classification.

14:30\"You made me do it\": Classification of Blame in Married Couples\' Interactions by Fusing Automatically Derived Speech and Language Information

Matthew P. Black (Signal Analysis and Interpretation Laboratory (SAIL), Los Angeles, CA, USA)
Panayiotis G. Georgiou (Signal Analysis and Interpretation Laboratory (SAIL), Los Angeles, CA, USA)
Athanasios Katsamanis (Signal Analysis and Interpretation Laboratory (SAIL), Los Angeles, CA, USA)
Brian R. Baucom (Department of Psychology, University of Southern California, Los Angeles, CA, USA)
Shrikanth S. Narayanan (Signal Analysis and Interpretation Laboratory (SAIL), Los Angeles, CA, USA)

One of the goals of behavioral signal processing is the automatic prediction of relevant high-level human behaviors from complex, realistic interactions. In this work, we analyze dyadic discussions of married couples and try to classify extreme instances (low/high) of blame expressed from one spouse to another. Since blame can be conveyed through various communicative channels (e.g., speech, language, gestures), we compare two different classification methods in this paper. The first classifier is trained with the conventional static acoustic features and models "how" the spouses spoke. The second is a novel automatic speech recognition-derived classifier, which models "what" the spouses said. We get the best classification performance (82 percent accuracy) by exploiting the complementarity of these acoustic and lexical information sources through score-level fusion of the two classification methods.

14:50Context and priming effects in the recognition of emotion in old and young listeners

Martijn Goudbeek (University of Tilburg)
Marie Nilsenová (University of Tilbueg)

The development of our ability to recognize (vocal) emotional expression has been relatively understudied. Even less studied is the effect of linguistic (spoken) context on emotion perception. In this study we investigate the performance of young (18-25) and old (60-85) listeners on two tasks: an emotion recognition task where emotions expressed in a sustained vowel (/a/) had to be recognized and an emotion attribution task where listeners had to judge a neutral fragment that was preceded by a phrase that varied in speech rate and/or loudness. The results of the recognition task showed that old and young participants do not differ in their recognition accuracy. The emotion attribution task showed that young listeners are more likely to interpret neutral stimuli as emotional when the preceding speech is emotionally colored. The results are interpreted as evidence for diminished plasticity later in life.

15:10Acoustic and Prosodic Correlates of Social Behavior

Agustin Gravano (Universidad de Buenos Aires)
Rivka Levitan (Columbia University)
Laura Willson (Columbia University)
Stefan Benus (Constantine the Philosopher University and Institute of Informatics, Slovak Academy of Sciences)
Julia Hirschberg (Columbia University)
Ani Nenkova (University of Pennsylvania)

We describe acoustic/prosodic and lexical correlates of social variables annotated on a large corpus of task-oriented spontaneous speech. We employ Amazon Mechanical Turk to label the corpus with a large number of social behaviors, examining results of three of these here. We find significant differences between male and female speakers for perceptions of attempts to be liked, likeability, speech planning, that also differ depending upon the gender of their conversational partners.

Sun-Ses2-O5:
HMM-based Speech Synthesis I

Time:Sunday 13:30 Place:Raffaello - Pala Affari - 3rd Floor Type:Oral
Chair:Keiichi Tokuda

13:30Decision Tree-based Clustering with Outlier Detection for HMM-based Speech Synthesis

Kyung Hwan Oh (Seoul National University)
June Sig Sung (Seoul National University)
Doo Hwa Hong (Seoul National University)
Nam Soo Kim (Seoul National University)

In order to express natural prosodic variations in continuous speech, sophisticated speech units such as the context-dependent phone models are usually employed in HMM-based speech synthesis techniques. Since the training database cannot practically cover all the possible context factors, decision tree-based HMM states clustering is commonly applied. One of the serious problems in decision tree-based method is that the criterion used for node splitting and stopping is sensitive to irrelavant outlier data. In this paper, we propose a novel approach to removing outliers during the decision tree growing phase. Experimental results show that removing of outlying models improves the quality of the synthesized speech, especially for sentences which originally demonstrated poor quality.

13:50Prediction of voice aperiodicity based on spectral representations in HMM speech synthesis

Hanna Silén (Department of Signal Processing, Tampere University of Technology, Tampere, Finland)
Elina Helander (Department of Signal Processing, Tampere University of Technology, Tampere, Finland)
Moncef Gabbouj (Department of Signal Processing, Tampere University of Technology, Tampere, Finland)

In hidden Markov model-based speech synthesis, speech is typically parameterized using source-filter decomposition. A widely used analysis/synthesis framework, STRAIGHT, decomposes the speech waveform into a framewise spectral envelope and a mixed mode excitation signal. Inclusion of an aperiodicity measure in the model enables synthesis also for signals that are not purely voiced or unvoiced. In the traditional approach employing hidden Markov modeling and decision tree-based clustering, the connection between speech spectrum and aperiodicities is not taken into account. In this paper, we take advantage of this dependency and predict voice aperiodicities afterwards based on synthetic spectral representations. The evaluations carried out for English data confirm that the proposed approach is able to provide prediction accuracy that is comparable to the traditional approach.

14:10A Perceptual Expressivity Modeling Technique for Speech Synthesis Based on Multiple-Regression HSMM

Takashi Nose (Tokyo Institute of Technology)
Takao Kobayashi (Tokyo Institute of Technology)

This paper describes a technique for modeling and controlling emotional expressivity of speech in HMM-based speech synthesis. A problem of conventional emotional speech synthesis based on HMM is that the intensity of an emotional expression appearing in synthetic speech completely depends on the database used for model training. To take into account the emotional expressivity that listeners actually perceive, the perceptual expressivity scores are introduced into a style control technique based on multiple-regression hidden semi-Markov model (MRHSMM). The objective and subjective evaluation results show that the proposed technique works well when there is a large variation and bias of emotional expressivity in the training data.

14:30Multi-Speaker Modeling with Shared Prior Distributions and Model Structures for Bayesian Speech Synthesis

Kei Hashimoto (Nagoya Institute of Technology)
Yoshihiko Nankaku (Nagoya Institute of Technology)
Keiichi Tokuda (Nagoya Institute of Technology)

This paper investigates a multi-speaker modeling technique with shared prior distributions and model structures for Bayesian speech synthesis. The quality of synthesized speech is improved by selecting appropriate model structures in HMM-based speech synthesis. Bayesian approach is known to work for such model selection. However, the result is strongly affected by prior distributions of model parameters. Therefore, determination of prior distributions and selection of model structures should be performed simultaneously. This paper investigates prior distributions and model structures in the situation where training data of multiple speakers are available. The prior distributions and model structures which represent acoustic features common to every speakers can be obtained by sharing them between multiple speaker-dependent models.

14:50Feature-Space Transform Tying in Unified Acoustic-Articulatory Modelling for Articulatory Control of HMM-based Speech Synthesis

Zhen-Hua Ling (iFLYTEK Speech Lab, University of Science and Technology of China, P.R.China)
Korin Richmond (CSTR, University of Edinburgh, United Kingdom)
Junichi Yamagishi (CSTR, University of Edinburgh, United Kingdom)

In previous work, we have proposed a method to control the characteristics of synthetic speech flexibly by integrating articulatory features into hidden Markov model (HMM) based parametric speech synthesis. A unified acoustic-articulatory model was trained and a piecewise linear transform was adopted to describe the dependency between these two feature streams. The transform matrices were trained for each HMM state and were tied based on each state's context. In this paper, an improved acoustic-articulatory modelling method is proposed. A Gaussian mixture model (GMM) is introduced to model the articulatory space and the cross-stream transform matrices are trained for each Gaussian mixture instead of context-dependently. This means the dependency relationship can vary with the change of articulatory features flexibly. Our results show this method improves the effectiveness of control over vowel quality by modifing articulatory trajectories without degrading naturalness.

15:10The Effect of Using Normalized Models in Statistical Speech Synthesis

Matt Shannon (Cambridge University Engineering Department)
Heiga Zen (Toshiba Research Europe Ltd.)
William Byrne (Cambridge University Engineering Department)

The standard approach to HMM-based speech synthesis is inconsistent in the enforcement of the deterministic constraints between static and dynamic features. The trajectory HMM and autoregressive HMM have been proposed as normalized models which rectify this inconsistency. This paper investigates the practical effects of using these normalized models, and examines the strengths and weaknesses of the different models as probabilistic models of speech. The most striking difference observed is that the standard approach greatly underestimates predictive variance. We argue that the normalized models have better predictive distributions than the standard approach, but that all the models we consider are still far from satisfactory probabilistic models of speech. We also present evidence that better intra-frame correlation modelling goes some way towards improving existing normalized models.

Sun-Ses2-S1-O:
Speech and Language Processing-Based Assistive Technologies and Health Applications

Time:Sunday 13:30 Place:Caravaggio (Adua 1) - Pala Affari - 1st Floor Type:Oral
Chairs:Tobias Bocklet, Gokhan Tur

13:30Automatic Detection of Depression in Speech using Gaussian Mixture Modeling with Factor Analysis

Douglas Sturim (MIT Lincoln Laboratory)
Pedro Torres-Carrasquillo, (MIT Lincoln Laboratory)
Thomas Quatieri (MIT Lincoln Laboratory)
Nicolas Malyska (MIT Lincoln Laboratory)
Alan McCree (MIT Lincoln Laboratory)

Of increasing importance in the civilian and military population is the recognition of Major Depressive Disorder at its earliest stages and intervention before the onset of severe symptoms. Toward the goal of more effective monitoring of depression severity, we investigate automatic classifiers of depression state, that have the important property of mitigating nuisances due to data variability, such as speaker and channel effects, unrelated to levels of depression. To assess our measures, we use a 35-speaker free-response speech database of subjects treated for depression over a six-week duration, along with standard clinical HAMD depression ratings. Preliminary experiments indicate that by mitigating nuisances, thus focusing on depression severity as a class, we can significantly improve classification accuracy over baseline Gaussian-mixture-model-based classifiers.

13:50Utterance Verification for automating the Hearing In Noise Test (HINT)

H. Timothy Bunnell (Nemours Biomedical Research)
Jason Lilley (Nemours Biomedical Research)
Sigfrid Soli (House Ear Institute)
Ivan Pal (Compreval)

Tests of speech intelligibility play an essential role in many audiological procedures, including diagnostic assessment, verification of hearing aid and cochlear implant fittings, outcome assessment following intervention, and screening of applicants for hearing-critical jobs. The Hearing In Noise Test (HINT) [1] is a speech intelligibility test commonly used for these purposes. A limitation of the HINT, as well as other similar tests, is that they must be administered and scored by a human observer. The present study is an evaluation of a preliminary HMM-based utterance verification system that can be used in place of a human observer to administer HINT.

14:10Analyzing the Nature of ECA Interactions in Children with Autism

Emily Mower (University of Southern California)
Chi-Chun Lee (University of Southern California)
James Gibson (University of Southern California)
Theodora Chaspari (University of Southern California)
Marian Williams (University of Southern California)
Shrikanth Narayanan (University of Southern California)

Embodied conversational agents (ECA) offer platforms for the collection of structured interaction data. This paper discusses data collected from the Rachel system, an ECA developed at the University of Southern California, for interactions with children with autism. Two dyads composed of a child with autism and his parent participated in an experiment with two modes: interactions with and without the ECA present. The goal of this work is to assess the naturalness of the data recorded in the ECA interaction. This analysis was carried out using a classification framework with the prediction variable of the presence or absence of the ECA in the interaction. The results demonstrate that it is possible to estimate whether or not a parent is interacting with the ECA using speech data. However, it is not generally possible to do so for the child suggesting that the Rachel system elicits communication data similar to that elicited through interactions between the child and his parent.

Sun-Ses2-P1:
Second Language Acquisition, Development and Learning I

Time:Sunday 13:30 Place:Valfonda 1 - Pala Congressi (Passi Perduti-Gallery) Type:Poster
Chair:Cinzia Avesani

#1Acquisition of Timing Patterns in Second Language

Mikhail Ordin (Bielefeld University)
Leona Polyanskaya (Bielefeld University)

The paper presents an analysis of speech rhythm development in second language acquisition. 51 German learners of English with varying degrees of proficiency were recorded producing 33 identical sentences of quasi-spontaneous speech. Durational characteristics of consonantal, vocalic intervals and syllables were calculated to allow for analysis of temporal acoustic cues pertaining to speech rhythm. We found that 1) durational characteristics in a second language depend on the proficiency level of the speaker and therefore can be modeled to predict the proficiency level of the language learner; 2) durational characteristics become more consistent throughout the process of second language mastery; and 3) multiple rhythms can operate on multiple timescales.

#2Context-dependent Duration Modeling with Backoff Strategy and Look-up Tables for Pronunciation Assessment and Mispronunciation Detection

Hongyan Li (Institute of Automation, Chinese Academy of Sciences)
Shen Huang (Institute of Automation, Chinese Academy of Sciences)
Shijin Wang (Institute of Automation, Chinese Academy of Sciences)
Bo Xu (Institute of Automation, Chinese Academy of Sciences)

This paper makes an intensive study on the contextual modeling methods of duration information, for the purpose of improving the performance of pronunciation assessment and mispronunciation detection. The main ideas include: 1) we extend the relations among duration sequence with different level of contextual constraints, and bring them into a unified framework. 2) A backoff mechanism is introduced to resolve the problem of data sparseness and unbalanced distribution. 3) Rather than the traditional parametric functions, we use the discrete modeling for empirical duration distributions based on look-up tables, which can improve the model precision and accelerate the computation speed. The experimental results show the effectiveness of the above methods.

#3Perceptual training of vowel length contrast of Japanese by L2 listeners: Effects of an isolated word versus a word embedded in sentences

Mee Sonu (GITI / LASS Lab.Waseda University, Japan)
Keiichi Tajime (Department of Psychology, Hosei University, Japan)
Hiroaki Kato (NICT, Japan)
Yoshinori Sagisaka (GITI/ LASS Lab. Waseda University, Japan)

In an attempt to improve the perception of vowel length contrasts in Japanese by L2 learners (L1 Korean), we compared two different training methods. The first one involved training L2 learners with sets of isolated words contrasting the vowels (Word training), whereas the other training involved presenting same words within sentences (Sentence training). Word training and sentence training both led to significant improvement in the learners’ ability to identify vowel length contrasts in Japanese, and, both types of training improved the listeners’ ability to perceive consonant length contrasts in Japanese, which listeners were not trained to identify. Although the amount of overall improvement was not significantly different between training methods, sentence training showed improvement in a wider range of conditions than word training. These results indicate that word training and sentence training are both effective at improving perception of length contrasts in Japanese, but that sentence training may have some advantage over word training.

#4Similar Vowels in L1/L2 Production: Confused or Discerned in Early L2 English Learners with Different amount of Exposure

E-Chin Wu (None)

The degree of similarity between L1 and L2 sounds is said to be crucial for L1/L2 sound distinction. Similar L1 and L2 sounds, according to the Speech Learning Model (SLM), tend to be undistinguished. Yet, much has been left undetermined of what counts as “similar”. In this study, the relation between phonetic similarity and L1/L2 sound distinction is explored by examining whether different acoustic cues were produced for Mandarin and English when the English vowels tested were similar to the Mandarin ones (i.e. /i/, /a/, and /u/). The effect of exposure was also investigated. The results showed that Mandarin children made distinctions for the similar vowels of Mandarin and English in the dimension expected and exposure did effect how well certain vowels were discerned for the group with less exposure. The fact that L2 learners were able to pick out the differences even between similar vowels found in L1 and L2 implies a need to clarify the concept of similarity in L2 learning models.

#5Production and perception of Estonian vowels by native and non-native speakers

Lya Meister (Institute of Cybernetics at Tallinn University of Technology)
Einar Meister (Institute of Cybernetics at Tallinn University of Technology)

The aim of the paper is to study the production of Estonian vowel categories by L2 speakers of Estonian with a Russian-language background and to compare the results with perception data from the same subjects. Ten native Estonian subjects and ten L2 speakers participated in both the reading of an Estonian text corpus and the perception experiment. It was found that mostly the production and perception results show similar patterns and thus lend support to the common standpoint that L2 perception predicts the accuracy of L2 production. However, evidence was found that despite the correct perceptual identification of L2 vowels, in L2 production the native categorical vowel representation outweighs the newer L2 category pattern.

#6New feature parameters for pronunciation evaluation in English presentations at international conferences

Hiroshi Kibishi (Department of Computer Science and Engineering Toyohashi University of Technology, Japan)
Seiichi Nakagawa (Department of Computer Science and Engineering Toyohashi University of Technology, Japan)

We statistically analyzed the speakers' actual utterances to find combinations of acoustic features with a high correlation between the score estimated by a linear regression model and the score perceived by native English teachers. In this paper, we examined the quality of new acoustic features that are useful when used in combination with the system's estimates of pronunciation score and intelligibility. Results showed that the best combination of acoustic features produced correlation coefficients of 0.929 and 0.753 for pronunciation and intelligibility, respectively, using open data for speakers at the 10-sentence level.

#7Synchronous reading: learning French orthography by audiovisual training

Gérard Bailly (GIPSA-Lab)
William-Seamus Barbour (GIPSA-Lab)

We assess here the potential benefit of a karaoke-style reading system for learning sound-to-letter mapping in irregular languages. We have developed a framework that eases the development of interactive systems exploiting the alignment of text with audio at various levels (letters, phones syllables, words, chunks, etc). Synchronous reading consists of using time-aligned text with speech at the phone level to displace a cursor – here a virtual finger – on the text in synchrony with its verbalization. We demonstrate here that this bimodal reading implicitly facilitates the learning of the correspondence between sounds and letters in French for native and foreign subjects. Native subjects are shown to benefit more strongly from synchronous reading.

#8Phoneme Level Non-Native Pronunciation Analysis by an Auditory Model-based Native Assessment Scheme

Christos Koniaris (KTH - Royal Institute of Technology)
Olov Engwall (KTH - Royal Institute of Technology)

We introduce a general method for automatic diagnostic evaluation of the pronunciation of individual non-native speakers based on a model of the human auditory system trained with native data stimuli. For each phoneme class, the Euclidean geometry similarity between the native perceptual domain and the non-native speech power spectrum domain is measured. The problematic phonemes for a given second language speaker are found by comparing this measure to the Euclidean geometry similarity for the same phonemes produced by native speakers only. The method is applied to different groups of non-native speakers of various language backgrounds and the experimental results are in agreement with theoretical findings of linguistic studies.

#9The open front vowel /æ/ in the production and perception of Czech students of English

Pavel Šturm (Institute of Phonetics, Charles University in Prague, Czech Republic)
Radek Skarnitzl (Institute of Phonetics, Charles University in Prague, Czech Republic)

This study addresses the acquisition of the English open front vowel by Czech learners of English, who are known to experience difficulties in both its production and perception. Secondary school students and university students of English judged the acceptability of the open front vowel as pronounced by other Czech learners of English. Their evaluations were plotted against acoustic measurements (F1, F2, and vowel duration) and linguistically relevant variables. The evaluations varied as a function of F1 and L2 experience. The experienced subjects perceived the vowel more accurately and consistently than did the relatively inexperienced subjects.

#10Error selection for ASR-based English pronunciation training in \'My Pronunciation Coach\'

Catia Cucchiarini (Centre for Language and Speech Technology, Radboud University Nijmegen, The Netherlands)
Henk van den Heuvel (Centre for Language and Speech Technology, Radboud University Nijmegen, The Netherlands)
Eric Sanders (Centre for Language and Speech Technology, Radboud University Nijmegen, The Netherlands)
Helmer Strik (Department of Linguistics, Radboud University Nijmegen, The Netherlands)

In this paper we report on a study of pronunciation errors that was conducted within the framework of the project “My Pronunciation Coach”, which is aimed at developing an ASR-based system for pronunciation training for learners of English with Dutch as their mother tongue. This study was aimed at obtaining quantitative data on the occurrence of pronunciation errors in Dutch English speech. We present the results of this study and compare them to those of previous investigations. Finally, we discuss the implications of these results for the development of My Pronunciation Coach.

#11An Experimental Analysis of Pitch Patterns in Japanese Speakers of English with Verification by Speech Re-synthesis

Tomoko Nariai (University of Tsukuba)
Kazuyo Tanaka (University of Tsukuba)

Certain irregularities in utterances of a word or phrase occur in English as spoken by Japanese native subjects (Japanese English, henceforth). This study considers such pitch patterns as one of the most common causes of deficiencies in Japanese English, and that Japanese English would have better pitch patterns if its peculiarities are modified. Firstly, pitch patterns of Japanese English are statistically analyzed. The analytical results provide a rule for modifying the pitch patterns of Japanese English, in order to improve naturalness. To check the appropriateness of the rule, the pitch patterns of several samples of Japanese English are acoustically modified and re-synthesized. The modified speeches are evaluated in a listening experiment taken by native English speakers. Averagely, over threefold subjects support the proposed modification against original speeches. Therefore, the results indicate practical verification of modifying ways of Japanese English.

#12An Analysis of Word Duration in Native Speakers and Japanese Speakers of English

Tomoko Nariai (University of Tsukuba)
Kazuyo Tanaka (University of Tsukuba)

An analysis of word duration in English sentences uttered by native speakers of Japanese is made, in which the difference in prosodic patterns between the English and Japanese languages is taken into account. The durations of Japanese speakers are compared with those of English speakers in regard to a percentage distribution of an individual word relative to all words in a sentence. The results of the statistical analysis revealed that nouns and words at the ends of sentences in Japanese speakers were shorter for English speakers. The former result suggests that English speakers put prominence on nouns, whereas Japanese speakers tend not to have the same rhythm as English speakers. The latter result suggests that phrase-final lengthening is insufficient in Japanese speakers.

Sun-Ses2-P2:
Speech Enhancement

Time:Sunday 13:30 Place:Valfonda 2 - Pala Congressi (Passi Perduti-Gallery) Type:Poster
Chair:Dietrich Klakow

#1Evaluating artificial bandwidth extension by conversational tests in car using mobile devices with integrated hands-free functionality

Laura Laaksonen (Nokia, Symbian Smartphones, Audio Technology, Finland)
Ville Myllylä (Nokia, Symbian Smartphones, Audio Technology, Finland)
Riitta Niemistö (Nokia, Symbian Smartphones, Audio Technology, Finland)

This paper describes an artificial bandwidth extension (ABE) method that generates new high frequency components to a narrowband signal by folding specifically gained subbands to frequencies from 4 kHz to 7 kHz, and improves the quality and intelligibility of narrowband speech in mobile devices. The proposed algorithm was evaluated by subjective listening tests. In addition, rarely used conversation test was constructed. Speech quality of 1) narrowband phone call, 2) wideband phone call, and 3) narrowband phone call enhanced with ABE were evaluated in conversational context using mobile devices with integrated hands-free (IHF) functionality. The results indicate that in IHF use case, ABE quality overcomes narrowband speech quality both in car noise and in quiet environment.

#2Low-Frequency Bandwidth Extension of Telephone Speech Using Sinusoidal Synthesis and Gaussian Mixture Model

Hannu Pulakka (Department of Signal Processing and Acoustics, Aalto University, Finland)
Ulpu Remes (Adaptive Informatics Research Centre, Aalto University, Finland)
Santeri Yrttiaho (Department of Signal Processing and Acoustics, Aalto University, Finland)
Kalle Palomäki (Adaptive Informatics Research Centre, Aalto University, Finland)
Mikko Kurimo (Adaptive Informatics Research Centre, Aalto University, Finland)
Paavo Alku (Department of Signal Processing and Acoustics, Aalto University, Finland)

The limited audio bandwidth of narrowband telephone speech degrades the speech quality. This paper proposes a method that extends the bandwidth of telephone speech to the frequency range 0 - 300 Hz. The lowest harmonics of voiced speech are generated using sinusoidal synthesis. The energy in the extension band is estimated from spectral features using a Gaussian mixture model. The amplitudes and phases of the synthesized signal are adjusted based on the amplitudes and phases of the narrowband input speech. The proposed method was evaluated with listening tests together with a bandwidth extension method for the range 4 - 8 kHz. The low-frequency bandwidth extension was found to reduce dissimilarity with wideband speech but no perceived quality improvement was achieved.

#3Memory-Based Approximation of the Gaussian Mixture Model Framework for Bandwidth Extension of Narrowband Speech

Amr Nour-Eldin (McGill University, Montreal, Canada)
Peter Kabal (McGill University, Montreal, Canada)

In this paper, we extend our previous work on exploiting speech temporal properties to improve Bandwidth Extension (BWE) of narrowband speech using Gaussian Mixture Models (GMMs). By quantifying temporal properties through information theoretic measures and using delta features, we have shown that narrowband memory significantly increases certainty about highband parameters. However, as delta features are non-invertible, they can not be directly used to reconstruct highband frequency content. In the work presented herein, we embed temporal properties indirectly into the GMM structure through a memory-dependent tree-based approach to extend representation of the narrow band. In particular, sequences of past frames are progressively used to grow the GMM in a tree-like fashion. This growth approach results in reliable estimates for the GMM parameters such that Maximum Likelihood estimation is no longer necessary, thus circumventing the complexity accompanying high-dimensionality GMM training.

#4Speech enhancement by reconstruction from cleaned acoustic features

Philip Harding (University of East Anglia)
Ben Milner (University of East Anglia)

This paper proposes a novel method of speech enhancement that moves away from conventional filtering-based methods and instead aims to reconstruct clean speech from a set of speech features. Underlying the enhancement system is a speech model which at present is based on a sinusoidal model. This is driven by a set of speech features, comprising voicing, fundamental frequency and spectral envelope, that are extracted from the noisy speech. A maximum a posteriori approach is proposed for estimating clean spectral envelope features from the noisy spectral envelope. A set of subjective tests, measuring speech quality, noise intrusiveness and overall quality, found the proposed method to be highly effective at removing noise. Comparison against conventional speech enhancement methods found performance to be equivalent to Wiener filtering.

#5A Soft Decision-based Speech Enhancement using Acoustic Noise Classification

Jae-Hun Choi (Hanyang University)
Sang-Kyun Kim (Hanyang University)
Joon-Hyuk Chang (Hanyang University)

In this letter, we present a speech enhancement technique based on the ambient noise classification incorporating the Gaussian mixture model (GMM). The principal parameters of the statistical model-based speech enhancement algorithm such as the weighting parameter in the decision-directed (DD) method and the long-term smoothing parameter of the noise estimation, are chosen as different values according to the classified contexts to ensure best performance for each noise. For the real-time environment awareness, the noise classification is performed on a frame-by-frame basis using the GMM with the soft decision framework. The speech absence probability (SAP) is used in detecting the speech absence periods and updating the likelihood of the GMM.

#6A Noise Estimation Method Based on Speech Presence Probability and Spectral Sparseness

Chao Li (National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences,)
Wenju Liu (National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences,)

This paper addresses the problem of noise power spectrum estimation. Existing noise estimation methods cannot perform quite reliably when noise level increasing abruptly (e.g., narrowband noise burst). To overcome this problem, we improve the time-recursive averaging algorithm based on speech presence probability (SPP), by exploiting the sparseness of speech spectrum. Firstly, we utilize the SPP estimation method based on fixed priors to achieve low SPP estimates at time-frequency bins where speech is absent. Furthermore, a spectral sparseness measure is proposed to adjust the SPP estimates. Experiments show the proposed method can update the noise estimates faster than state-of-the-art approaches in both stationary and nonstationary noise.

#7Improved a posteriori Speech Presence Probability Estimation Based on Cepstro-Temporal Smoothing and Time-Frequency Correlation

Chao Li (National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences,)
Wenju Liu (National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences,)

In this paper, we present a novel estimator for the SPP at each time-frequency point in the short-time Fourier transform (STFT) domain. Existing speech presence probability (SPP) estimators cannot perform quite reliably in nonstationary noise environment when applied to a speech enhancement task. To overcome this limitation, we propose a novel SPP estimation method. Firstly, the spectral outliers are eliminated by selectively smoothing the maximum likelihood estimate of a priori signal-noise ratio (SNR) in the cepstral domain. Furthermore, an adaptive tracking for a priori SPP is derived by exploiting the strong correlation of speech presence in neighboring frequency bins of consecutive frames. The proposed approach outperforms the state-of-the-art approaches, resulting in less noise leakage and low speech distortions in both stationary and nonstationary noise environments.

#8A Rapid Adaptation Algorithm for Tracking Highly Non-Stationary Noises Based on Bayesian Inference for On-Line Spectral Change Point Detection

Md Foezur Rahman Chowdhury Chowdhury (INRS-EMT, Université du Québec, Montreal, QC, Canada)
Sid-Ahmed Selouani (Université de Moncton, Campas de Shippagan, NB, Canada)
Douglas O\'Shaughnessy (INRS-EMT, Université du Québec, Montreal, QC, Canada)

This paper presents an innovative rapid adaptation technique for tracking highly non-stationary acoustic noises. The novelty of this technique is that it can detect the acoustic change points from the spectral characteristics of the observed speech signal in rapidly changing non-stationary acoustic environments. The proposed innovative noise tracking technique will be very suitable for joint additive and channel distortions compensation (JAC) for on-line automatic speech recognition (ASR). The Bayesian on-line change point detection (BOCPD) approach is used to implement this technique. The proposed algorithm is tested using highly non-stationary noisy speech samples from the Aurora2 speech database. Significant improvement in minimizing the delay in adaptation to new acoustic conditions is obtained for highly non-stationary noises compared to the most popular baseline noise tracking algorithm MCRA and its derivatives.

#9Single channel speech enhancement using MMSE estimation of short-time modulation magnitude spectrum

Kuldip Paliwal (Signal Processing Laboratory, School of Engineering, Griffith University, Australia)
Belinda Schwerin (Signal Processing Laboratory, School of Engineering, Griffith University, Australia)
Kamil Wojcicki (Signal Processing Laboratory, School of Engineering, Griffith University, Australia)

In this paper we investigate the enhancement of speech by applying MMSE short-time spectral magnitude estimation in the modulation domain. For this purpose, the traditional analysis-modification-synthesis framework is extended to include modulation domain processing. We compensate the noisy modulation spectrum for additive noise distortion by applying the MMSE short-time spectral magnitude estimation algorithm in the modulation domain. Subjective experiments were conducted to compare the quality of stimuli processed by the MMSE modulation magnitude estimator to those processed using the MMSE acoustic magnitude estimator and the modulation spectral subtraction method. The proposed method is shown to have better noise suppression than MMSE acoustic magnitude estimation, and improved speech quality compared to modulation domain spectral subtraction.

#10Speech Enhancement Using Masking Properties in Adverse Environments

Atanu Saha (Graduate School of Science and Engineering, Saitama University, Saitama, Japan)
Tetsuya Shimamura (Graduate School of Science and Engineering, Saitama University, Saitama, Japan)

In this paper, we propose a speech enhancement method by exploiting masking properties of human auditory system. The masking properties are exploited to calculate a masking threshold. The spectral components which lie above the threshold are audible to human listeners. These audible spectral components in the proposed method are suppressed as a predefined attenuation factor of the original noise. The evaluation is conducted in the experiments. The experimental results show that the proposed method provides significant performance compared to the conventional approaches.

#11Phoneme-dependent NMF for speech enhancement in monaural mixtures

Bhiksha Raj (Carnegie Mellon University)
Rita Singh (Carnegie Mellon University)
Tuomas Virtanen (Tampere University of Technology)

The problem of separating speech signals out of monaural mixtures (with other non-speech or speech signals) has become increasingly popular in recent times. Among the various solutions proposed, the most popular methods are based on compositional models such as non-negative matrix factorization (NMF) and latent variable models. Although these techniques are highly effective they largely ignore the inherently phonetic nature of speech. In this paper we present a phoneme-dependent NMF-based algorithm to separate speech from monaural mixtures. Experiments performed on speech mixed with music indicate that the proposed algorithm can result in significant improvement in separation performance, over conventional NMF-based separation.

#12Kernel PCA for Speech Enhancement

Christina Leitner (Graz University of Technology)
Franz Pernkopf (Graz University of Technology)
Gernot Kubin (Graz University of Technology)

In this paper, we apply kernel principal component analysis (kPCA), which has been successfully used for image de-noising, to speech enhancement. In contrast to other enhancement methods which are based on the magnitude spectrum, we rather apply kPCA to complex spectral data. This is facilitated by Gaussian kernels. In the experiments, we show good noise reduction with few artifacts for noise corrupted speech at different SNR levels using additive white Gaussian noise. We compared kPCA with linear PCA and spectral subtraction and evaluated all algorithms with perceptually motivated quality measures.

#13Objective Intelligibility Prediction of Speech by Combining Correlation and Distortion based Techniques

Angel Gomez (Dpt. of Signal Theory, Networking and Communications, University of Granada, Spain)
Belinda Schwerin (Signal Processing Laboratory, School of Engineering, Griffith University, Australia)
Kuldip Paliwal (Signal Processing Laboratory, School of Engineering, Griffith University, Australia)

A number of techniques based on correlation measurements have recently been proposed to provide an objective measure of intelligibility. These techniques are able to detect nonlinear distortions and provide intelligibility scores highly correlated with those given by human listeners. However, the performance of these techniques has not been found satisfactory for measuring the speech intelligibility of speech enhancement algorithms. In this paper we first investigate the different correlation-based methods, in the context of speech enhancement. We then propose to combine these correlation-based techniques with spectral distance-based ones. Results presented show that objective intelligibility prediction is significantly improved by this combination.

Sun-Ses2-P3:
ASR - Feature Extraction I

Time:Sunday 13:30 Place:Faenza 1 - Pala Congressi (Passi Perduti-Gallery) Type:Poster
Chair:Fabio Brugnara

#1Integrating recent MLP feature extraction techniques into TRAP architecture

Frantisek Grezl (Brno University of Technology)
Martin Karafiat (Brno University of Technology)

This paper is focused on the incorporation of recent techniques for multi-layer perceptron (MLP) based feature extraction in Temporal Pattern (TRAP) and Hidden Activation TRAP (HATS) feature extraction scheme. The TRAP scheme has been origin of various MLP-based features some of which are now indivisible part of state-of-the-art LVCSR systems. The modifications which brought most improvement -- sub-phoneme targets and Bottle-Neck technique -- are introduced into original TRAP scheme. Introduction of sub-phoneme targets uncovered the hidden danger of having too many classes in TRAP/HATS scheme. On the other hand, Bottle-Neck technique improved the TRAP/HATS scheme so its competitive with other approaches.

#2Feature Frame Stacking in RNN-based Tandem ASR Systems - Learned vs. Predefined Context

Martin Woellmer (Technische Universitaet Muenchen)
Bjoern Schuller (Technische Universitaet Muenchen)
Gerhard Rigoll (Technische Universitaet Muenchen)

As phoneme recognition is known to profit from techniques that consider contextual information, neural networks applied in Tandem automatic speech recognition (ASR) systems usually employ some form of context modeling. While approaches based on multi-layer perceptrons or recurrent neural networks (RNN) are able to model a predefined amount of context by simultaneously processing a stacked sequence of successive feature vectors, bidirectional Long Short-Term Memory (BLSTM) networks were shown to be well-suited for incorporating a self-learned amount of context for phoneme prediction. In this paper, we evaluate combinations of BLSTM modeling and frame stacking to determine the most efficient method for exploiting context in RNN-based Tandem systems. Applying the COSINE corpus and our recently introduced multi-stream BLSTM-HMM decoder, we provide empirical evidence for the intuition that BLSTM networks redundantize frame stacking while RNNs profit from predefined feature-level context.

#3Improved Acoustic Feature Combination for LVCSR by Neural Networks

Christian Plahl (RWTH Aachen University)
Ralf Schlüter (RWTH Aachen University)
Hermann Ney (RWTH Aachen University)

This paper investigates the combination of different acoustic features. Several methods to combine these features such as concatenation or LDA are well known. Even though LDA improves the system, feature combination by LDA has been shown to be suboptimal. We introduce a new method based on neural networks. The posterior estimates derived from the NN lead to a significant improvement and achieve a 6% relative better word error rate (WER). Results are also compared to system combination. While system combination has been reported to outperform all other combination techniques, in this work the proposed NN-based combination outperforms system combination. We achieve a 2% relative better WER, resulting in an improvement of 7% relative to the baseline system. In addition to giving better recognition performance w.r.t. WER, NN-based combination reduces both, training and testing complexity. Overall, we use a single set of acoustic models, together with the training of the NN.

#4Hierarchical Tandem Features for ASR in Mandarin

Joel Pinto (Idiap Research Institute)
Mathew Magimai.-Doss (Idiap Research Institute)
Herve Bourlard (Idiap Research Institute)

We apply multilayer perceptron (MLP) based hierarchical Tandem features to large vocabulary continuous speech recognition in Mandarin. Hierarchical Tandem features are estimated using a cascade of two MLP classifiers which are trained independently. The first classifier is trained on perceptual linear predictive coefficients with a 90 ms temporal context. The second classifier is trained using the phonetic class conditional probabilities estimated by the first MLP, but with a relatively longer temporal context of about 150 ms. Experiments on the Mandarin DARPA GALE eval06 dataset show significant reduction (about 7.6% relative) in character error rates by using hierarchical Tandem features over conventional Tandem features.

#5Analysis and Comparison of Recent MLP Features for LVCSR Systems

Fabio Valente (Idiap Research Institute)
Mathew Magimai Doss (Idiap Research Institute)
Wen Wang (SRI International)

MLP based front-ends have evolved in different ways in recent years beyond the seminal TANDEM-PLP features. This paper aims at providing a fair comparison of these recent progresses including the use of different long/short temporal inputs and the use of complex architectures (bottleneck, hierarchy,multistream) that go beyond the conventional three layer MLP. Furthermore, the paper identifies which of these actually provide advantages over the conventional TANDEM-PLP . The investigation is carried on an LVCSR task for recognition of Mandarin Broadcast speech and results are analyzed in terms of Character Error Rate and phonetic confusions. Results reveal that as stand alone features, multistream front-ends can outperform by 10% conventional spectral features like MFCC while TANDEM-PLP only improve by 1% . When used in concatenation with MFCC features, hierarchical/bottleneck front-ends reduce the character error rate by +18% relative compared to +14% relative from TANDEM-PLP.

#6Deep Learning of Speech Features for Improved Phonetic Recognition

Jaehyung Lee (KAIST)
Soo-Young Lee (KAIST)

Recently, a remarkable performance result of 23.0% Phone Error Rate (PER) on the TIMIT core test set was reported by applying Deep Belief Network (DBN) on phonetic recognition [1]. Despite the good performance reported, there is still sub¬stantial room for improvement in the reported design in order to achieve optimal results. In this letter, we present an improved but simple architecture for phonetic recognition which uses log¬Mel spectrum directly instead of Mel¬Frequency Cepstral Coefficient (MFCC), and combines Deep Learning with conventional Baum¬Welch re-estimation for subphoneme alignment. Experiments performed on TIMIT speech corpus show that the proposed method outperforms most of the conventional methods, yielding 21.4% PER on the complete test set of TIMIT and 22.1% on the core test set.

#7Globality-Locality Consistent Discriminant Analysis for Phone Classification

Heyun Huang (Department of Linguistics, Radboud University Nijmegen, Erasmuslaan 1, 6525 HT, Nijmegen, the Netherlands)
Yang Liu (Department of Computing, The Hong Kong Polytechnic University, Hung Hom, Kowloon, Hong Kong)
Jort Gemmeke (Department of Linguistics, Radboud University Nijmegen, Erasmuslaan 1, 6525 HT, Nijmegen, the Netherlands)
Louis ten Bosch (Department of Linguistics, Radboud University Nijmegen, Erasmuslaan 1, 6525 HT, Nijmegen, the Netherlands)
Bert Cranen (Department of Linguistics, Radboud University Nijmegen, Erasmuslaan 1, 6525 HT, Nijmegen, the Netherlands)
Lou Boves (Department of Linguistics, Radboud University Nijmegen, Erasmuslaan 1, 6525 HT, Nijmegen, the Netherlands)

Concatenating sequences of feature vectors helps to capture essential information about articulatory dynamics, at the cost of increasing the number of dimensions in the feature space, which may be characterized by the presence of manifolds. Existing supervised dimensionality reduction methods such as Linear Discriminant Analysis may destroy part of that manifold structure. In this paper, we propose a novel supervised dimensionality reduction algorithm, called Globality-Locality Consistent Discriminant Analysis (GLCDA), which aims to preserve global and local discriminant information simultaneously. Because it allows finding the optimal trade-off between global and local structure of data sets, GLCDA can provide a more faithful compact representation of high-dimensional observations than entirely global approaches or heuristic approaches aimed to preserve local information. Experimental results on the TIMIT phone classification task show the effectiveness of the proposed algorithm.

#8Front-End Compensation Methods for LVCSR Under Lombard Effect

Hynek Boril (Center for Robust Speech Systems (CRSS), The University of Texas at Dallas)
Frantisek Grezl (Speech@FIT, Brno University of Technology)
John H.L. Hansen (Center for Robust Speech Systems (CRSS), The University of Texas at Dallas)

This study analyzes the impact of noisy background variations and Lombard effect (LE) on large vocabulary continuous speech recognition (LVCSR). Robustness of several front-end feature extraction strategies combined with state-of-the-art feature distribution normalizations is tested on neutral and Lombard speech from the UT-Scope database presented in two types of background noise at various levels of SNR. An extension of a bottleneck (BN) front-end utilizing normalization of both critical band energies (CRBE) and BN outputs is proposed and shown to provide a competitive performance compared to the best MFCC-based system. A novel MFCC-based BN front-end is introduced and shown to outperform all other systems in all conditions considered (average 4.1% absolute WER reduction over the second best system). Additionally, two phenomena are observed: (i) combination of cepstral mean subtraction and recently established RASTALP filtering significantly reduces transient effects of RASTA band-pass filtering and increases ASR robustness to noise and LE; (ii) histogram equalization may benefit from utilizing reference distributions derived from pre-normalized rather than raw training features, and also from adopting distributions from different front-ends.

#9Classification of Fricatives Using Feature Extrapolation of Acoustic-Phonetic Features in Telephone Speech

Jung-Won Lee (Department School of Electrical and Electronic Engineering, Yonsei University)
Jeung-Yoon Choi (Department School of Electrical and Electronic Engineering, Yonsei University)
Hong-Goo Kang (Department School of Electrical and Electronic Engineering, Yonsei University)

This paper proposes a classification module for fricative consonants in telephone speech using an acoustic-phonetic feature extrapolation technique. In channel-deteriorated telephone speech, acoustic cues of fricative consonants are expected to be degraded or missing due to limited bandwidth. This paper applies an extrapolation technique to acoustic-phonetic features based on Gaussian mixture models, which uses a statistical learning of the correspondence between acoustic-phonetic feature of wideband speech and the spectral characteristics of telephone bandwidth speech. Experimental results with NTIMIT database verify that the feature extrapolation improves the performance of fricative classification module for all unvoiced fricatives by 0.5-5% (relative) compared to the performance obtained by only acoustic-phonetic features extracted from narrowband signal.

#10Noise Robust Feature Extraction Based on Extended Weighted Linear Prediction in LVCSR

Sami Keronen (Aalto University School of Science and Technology)
Jouni Pohjalainen (Aalto University School of Science and Technology)
Paavo Alku (Aalto University School of Science and Technology)
Mikko Kurimo (Aalto University School of Science and Technology)

This paper introduces extended weighted linear prediction (XLP) to noise robust short-time spectrum analysis in the feature extraction process of a speech recognition system. XLP is a generalization of standard linear prediction (LP) and temporally weighted linear prediction (WLP) which have already been applied to noise robust speech recognition with good results. With XLP, higher controllability to the temporal weighting of different parts of the noisy speech is gained by taking the lags of the signal into account in prediction. Here, the performance of XLP is put up against WLP and conventional spectrum analysis methods FFT and LP on a large vocabulary continuous speech recognition (LVCSR) scheme using real world noisy data containing additive and convolutive noise. The results show improvements over the reference methods in several cases.

#11Comparing Different Flavors of Spectro-Temporal Features for ASR

Bernd T. Meyer (International Computer Science Institute, Berkeley, CA, USA)
Suman V. Ravuri (International Computer Science Institute, Berkeley, CA, USA)
Marc René Schädler (Medical Physics, Institute of Physics, University of Oldenburg, Germany)
Nelson Morgan (International Computer Science Institute, Berkeley, CA, USA)

In the last decade, several studies have shown that the robustness of ASR systems can be increased when 2D Gabor filters are used to extract specific modulation frequencies from the input pattern. This paper analyzes important design parameters for spectro-temporal features based on a Gabor filter bank: We perform experiments with filters that exhibit different phase sensitivity. Further, we analyze if non-linear weighting with a multi-layer perceptron (MLP) and a subsequent concatenation with mel frequency cepstral coefficients (MFCCs) has beneficial effects. For the Aurora2 noisy digit recognition task, the use of phase sensitive filters improved the MFCC baseline, whereas using filters that neglect phase information did not. While MLP processing alone did not have a large effect on the overall scores, the best results were obtained for MLP-processed phase sensitive filters and added MFCCs, with relative error reductions of over 40% for both noisy and clean training.

#12VTLN in the MFCC domain: band-limited versus local interpolation

Ehsan Variani (Johns Hopkins University)
Thomas Schaaf (Multimodal Technologies, Inc.)

We propose a new easy-to-implement method to compute a Linear Transform (LT) to perform Vocal Tract Length Normalization (VTLN) on truncated Mel Frequency Cepstral Coefficients (MFCCs) normally used in distributed speech recognition. The method is based on a Local Interpolation which is independent of the Mel filter design. Local Interpolation (LILT) VTLN is theoretically and experimentally compared to a global scheme based on band-limited interpolation (BLI-VTLN) and the conventional frequency warping scheme (FFT-VTLN). Investigating the interoperability of these methods shows that the performance of LILT-VTLN is on par with FFT-VTLN and BLI-VTLN. The statistical significance test also shows that there are no significant differences between FFT-VTLN, LILT-VTLN, and BLI-VTLN, even if the models and front ends do not match.

#13Multistream Bandpass Modulation Features for Robust Speech Recognition

Sridhar Krishna Nemala (Johns Hopkins University)
Kailash Patil (Johns Hopkins University)
Mounya Elhilali (Johns Hopkins University)

Current understanding of speech processing in the brain suggests dual streams of processing of temporal and spectral information, whereby slow vs. fast modulations are analyzed along parallel paths that encode various scales of information in speech signals. In this work, we propose a multistream approach to feature analysis for robust speaker-independent phoneme recognition. The scheme presented here centers around a multi-path bandpass modulation analysis of speech sounds with each stream covering an entire range of temporal and spectral modulations. By performing bandpass operations of slow vs. fast information along the spectral and temporal dimensions, the proposed scheme avoids the classic feature explosion problem of previous multistream approaches while maintaining the advantage of parallelism and localized feature analysis. The proposed architecture results in substantial improvements over standard baseline features and two state-of-the-art noise robust feature schemes.

#14An Analysis of Automatic Speech Recognition with Multiple Microphones

Davide Marino (University of Sheffield)
Thomas Hain (University of Sheffield)

Automatic speech recognition in real world situations often requires the use of microphones distant from speaker’s mouth. One or several microphones are placed in the surroundings to capture many versions of the original signal. Recognition with a single far field microphone yields considerably poorer performance than with person-mounted devices (headset, lapel), with the main causes being reverberation and noise. Acoustic beam- forming techniques allow significant improvements over the use of a single microphone, although the overall performance still remains well above the close-talking results. In this paper we investigate the use of beam-forming in the context of speaker movement, together with commonly used adaptation techniques and compare against a naive multi-stream approach. We show that even such a simple approach can yield equivalent results to beam-forming, allowing for far more powerful integration of multiple microphone sources in ASR systems.

Sun-Ses2-P4:
Spoken Dialogue & Spoken Language Understanding Systems

Time:Sunday 13:30 Place:Faenza 2 - Pala Congressi (Passi Perduti-Gallery) Type:Poster
Chair:Steve Renals

#1Multi-view approach for speaker turn role labeling in TV Broadcast News shows

Geraldine Damnati (France Telecom - Orange Labs)
Delphine Charlet (France Telecom - Orange Labs)

Speaker role recognition in TV Broadcast News shows is addressed in this paper. Speaker turns are assigned a role among anchor, reporter and other. A multi-view approach is proposed exploiting the complementarities of lexical cues obtained from Automatic Speech Recognition output and acoustical cues obtained from speech signal analysis. Early and late fusions are compared. 90.1% classification accuracy is obtained on automatically segmented speaker turns for a 6.5 hours test corpus of 14 shows mixing news and conversational speech. Further analyses are provided for other speaker turns showing interesting perspectives towards finer-grained speaker role characterization.

#2Evaluation of an Integrated Authoring Tool for Building Advanced Question-Answering Characters

Sudeep Gandhe (USC Institute for Creative Technologies)
Michael Rushforth (University of Texas at San Antonio)
Priti Aggarwal (USC Institute for Creative Technologies)
David Traum (USC Institute for Creative Technologies)

We present the evaluation of an integrated authoring tool for rapid prototyping of dialogue systems. These dialogue systems are designed to support virtual humans engaging in advanced question-answering dialogues, such as for training tactical questioning skills. The tool was designed to help non- experts, who may have little or no knowledge of linguistics or computer science, build virtual characters that can play the role of an interviewee. The tool has been successfully used by several different non-experts to create a number of virtual characters used successfully for both training and human subjects testing. We report on experiences with seven such characters, whose development time was as little as two weeks including concept development and a round of user testing.

#3Towards Unsupervised Spoken Language Understanding: Exploiting Query Click Logs for Slot Filling

Gokhan Tur (Microsoft Speech Labs | Microsoft Research)
Dilek Hakkani-Tür (Microsoft Speech Labs | Microsoft Research)
Dustin Hillard (Microsoft Speech Labs)
Asli Celikyilmaz (Microsoft Speech Labs)

In this paper, we present a novel approach to exploit user queries mined from search engine query click logs to bootstrap or improve slot filling models for spoken language understanding. We propose extending the earlier gazetteer population techniques to mine unannotated training data for semantic parsing. The automatically annotated mined data can then be used to train slot specific parsing models. We show that this method can be used to bootstrap slot filling models and can be combined with any available annotated data to improve performance. Furthermore, this approach may eliminate the need for populating and maintaining in-domain gazetteers, in addition to providing complementary information if they are already available.

#4 Web-enhanced Contents Retrieval for Information Access Dialogue System

Donghyeon Lee (Department of Computer Science and Engineering, POSTECH, South Korea)
Cheongjae Lee (Academic Center for Computing and Media Studies, Kyoto University, Japan)
Minwoo Jeong (Department of Computer Science and Engineering, POSTECH, South Korea)
Kyungduk Kim (Department of Computer Science and Engineering, POSTECH, South Korea)
Seokhwan Kim (Department of Computer Science and Engineering, POSTECH, South Korea)
Junhwi Choi (Department of Computer Science and Engineering, POSTECH, South Korea)
Gary Geunbae Lee (Department of Computer Science and Engineering, POSTECH, South Korea)

We consider the problem of contents retrieval with complex query for information access dialogue system. To deal with complex query, dialogue system used to attain deep semantic processing such as full semantic parsing and ontology-based reasoning. However, they require a large amount of semantic annotation and domain expert knowledge that are often very expensive to create, and thus they have been limited in practice. In this paper, we present a simple alternative method that enhances vector space model-based contents retrieval with web search engine. For robust contents retrieval, our model expands vector spaces with web documents to capture underlying co-occurrence patterns between the query and contents. One merit of the proposed approach is that it does not require heavy semantic processing, and therefore it results in efficient content retrieval. We demonstrate that our method is beneficial in an electronic program guide dialogue system.

#5Uncertainty management for on-line optimisation of a POMDP-based large-scale spoken dialogue system

Lucie Daubigney (Supelec)
Milica Gasic (Cambridge University)
Senthilkumar Chandramohan (Supelec - UAPV)
Matthieu Geist (Supelec)
Olivier Pietquin (Supelec - UMI 2958 (CNRS GeorgiaTech))
Steve Young (Cambridge University)

The optimization of dialogue policies using reinforcement learning (RL) is now an accepted part of the state of the art in spoken dialogue systems (SDS). Yet, it is still the case that the commonly used training algorithms for SDS require a large number of dialogues and hence most systems still rely on artificial data generated by a user simulator. Optimization is therefore performed off-line before releasing the system to real users. Gaussian Processes (GP) for RL have recently been applied to dialogue systems. One advantage of GP is that they compute an explicit measure of uncertainty in the value function estimates computed during learning. In this paper, a class of novel learning strategies is described which use uncertainty to control exploration on-line. Comparisons between several exploration schemes show that significant improvements to learning speed can be obtained and that rapid and safe online optimisation is possible, even on a complex task.

#6Detection of task-incomplete dialogs based on utterance-and-behavior tag N-gram for spoken dialog systems

Sunao Hara (Nagoya universityGraduate School of Information Science, Nagoya University, Japan)
Norihide Kitaoka (Graduate School of Information Science, Nagoya University, Japan)
Kazuya Takeda (Graduate School of Information Science, Nagoya University, Japan)

We propose a method of detecting ``task incomplete'' dialogs in spoken dialog systems using N-gram-based dialog models. We used a database created during a field test in which inexperienced users used a client-server music retrieval system with a spoken dialog interface on their own PCs. In this study, the dialog for a music retrieval task consisted of a sequence of user and system tags that related their utterances and behaviors. The dialogs were manually classified into two classes: the dialog either completed the music retrieval task or it didn't. We then detected dialogs that did not complete the task, using N-gram probability models or a Support Vector Machine with N-gram feature vectors trained using manually classified dialogs. Off-line and on-line detection experiments were conducted on a large amount of real data, and the results show that our proposed method achieved good classification performance.

#7Shrinkage Based Features for Natural Language Call-Routing

Ruhi Sarikaya (IBM T.J. Watson Research Center)
Stanley F. Chen (IBM T.J. Watson Research Center)
Bhuvana Ramabhadran (IBM T.J. Watson Research Center)

The feature set used with a classifier can have a large impact on classification performance. This paper presents a set of shrinkage-based features for Maximum Entropy and other classifiers in the exponential family. These features are inspired by the exponential class-based language model, Model M. We motivate the use of these features for the task of text classification and evaluate them on a natural language call routing task. The proposed features along with a new word clustering method result in significant improvements in action classification accuracy over typical word-based features, particularly for small amounts of training data.

#8Clustering with modified cosine distance learned from constraints

Leonid Rachevsky (IBM)
Dimitri Kanevsky (IBM)
Ruhi Sarikaya (IBM)
Bhuvana Ramabhadran (IBM)

In this paper we present a modified cosine similarity metric that helps to make features more discriminative. The new metric is defined via various linear transformations of the original feature space to a space in which these samples are better separated. These transformations are learned from a set of constraints representing available domain knowledge by solving related optimization problems. We present results on two natural language call routing datasets that show significant improvements ranging from 3\% to 5\% absolute in the purity of clusters obtained in an unsupervised fashion.

#9Using Speaker ID to Discover Repeat Callers to a Spoken Dialog System

Andrew Fandrianto (Carnegie Mellon University)
Brian Langner (Carnegie Mellon University)
Alan W Black (Carnegie Mellon University)

This paper describes using speaker ID techniques to identify repeat callers in a spoken dialog system, using only acoustic features. Often it is useful to know if a dialog user is a novice or is experienced, and it can be the case that identifying data such as Caller ID is either unreliable or unavailable. Our approach attempts to remedy this by determining user identity in a dialog session using the acoustic information in the dialog. We optimize the audio content of each call by removing artifacts not relevant to modeling speech. This technique is applied to finding consecutive callers and creating unique user identities over all calls over a larger time frame, with the aim of tuning or adapting the dialog system based on the user identity. Our results show that the technique is effective in recognizing consecutive callers and in identifying a unique user identities in a large set of calls.

#10Semantic graph clustering for POMDP-based spoken dialog systems

Florian Pinault (LIA University of Avignon)
Fabrice Lefèvre (LIA University of Avignon)

Dialog managers (DM) in spoken dialogue systems make decisions in highly uncertain conditions, due to errors from the speech recognition and spoken language understanding (SLU) modules. In this work a framework to interface efficient probabilistic modeling for both the SLU and the DM modules is described and investigated. Thorough representation of the user semantics is inferred by the SLU in the form of a graph of frames and, complemented with some contextual information, is mapped to a summary space in which a stochastic POMDP dialogue manager can perform planning of actions taking into account the uncertainty on the current dialogue state. Tractability is ensured by the use of an intermediate summary space. Also to reduce the development cost of SDS an approach based on clustering is proposed to automatically derive the master-summary mapping function. A preliminary implementation is presented in the {\sc Media} domain (touristic information and hotel booking) and tested with a simulated user.

#11Learning Place-Names from Spoken Utterances and Localization Results by Mobile Robot

Ryo Taguchi (Nagoya Institute of Technology)
Yuji Yamada (Nagoya Institute of Technology)
Koosuke Hattori (Nagoya Institute of Technology)
Taizo Umezaki (Nagoya Institute of Technology)
Masahiro Hoguro (Chubu University)
Naoto Iwahashi (National Institute of Information and Communications Technology)
Kotaro Funakoshi (Honda Research Institute Japan Co., Ltd.)
Mikio Nakano (Honda Research Institute Japan Co., Ltd.)

This paper proposes a method for the unsupervised learning of place-names from pairs of a spoken utterance and a localization result, which represents a current location of a mobile robot, without any priori linguistic knowledge other than a phoneme acoustic model. In previous work, we have proposed a lexical learning method based on statistical model selection. This method can learn the words that represent a single object, such as proper nouns, but cannot learn the words that represent classes of objects, such as general nouns. This paper describes improvements of the method for learning both a phoneme sequence of each word and a distribution of objects that the word represents.

#12Active Learning for Dialogue Act Classification

Björn Gambäck (SICS, Swedish Institute of Computer Science AB / Norwegian University of Science and Technology)
Fredrik Olsson (SICS, Swedish Institute of Computer Science AB)
Oscar Täckström (SICS, Swedish Institute of Computer Science AB)

Active learning techniques were employed for classification of dialogue acts over two dialogue corpora, the English human-human Switchboard corpus and the Spanish human-machine Dihana corpus. It is shown clearly that active learning improves on a baseline obtained through a passive learning approach to tagging the same data sets. An error reduction of 7% was obtained on Switchboard, while a factor 5 reduction in the amount of labeled data needed for classification was achieved on Dihana. The passive Support Vector Machine learner used as baseline in itself significantly improves the state of the art in dialogue act classification on both corpora. On Switchboard it gives a 31% error reduction compared to the previously best reported result.

#13Speaker Role Recognition using question detection and characterization

Thierry Bazillon (Aix Marseille Universite , LIF-CNRS)
Benjamin Maza (Universite d\'Avignon , LIA-CERI)
Mickael Rouvier (Universite d\'Avignon , LIA-CERI)
Frederic Bechet (Aix Marseille Universite , LIF-CNRS)
Alexis Nasr (Aix Marseille Universite , LIF-CNRS)

Speech Data Mining is an area of research dedicated to characterize audio streams containing speech of one or more speakers, using descriptors related to the form and the content of the speech signal. Besides the automatic word transcription process, information about the type of audio stream and the role and identity of speakers is also crucial to allow complex queries. In this framework we present a study done on broadcast conversations on how speakers express questions in conversations, starting with the initial intuition that the surface form of the questions uttered is a signature of the role of the speakers in the conversation (anchor, guest, expert, etc.). By classifying these questions thanks to a set of labels and using this information in addition to the commonly used descriptors to classify users' role in broadcast conversations, we want to improve the role classification accuracy and validate our initial intuition.

#14Learning Score Structure from Spoken Language for A Tennis Game

QIANG HUANG (University of East Anglia)
Stephen Cox (University of East Anglia)

We describe a novel approach to inferring the scoring rules of a tennis game by analysing the chair umpire's speech. In a tennis match, the chair umpire, amongst other tasks, announces the scores. Hence his or her speech is the key resource for inferring the scoring rules of tennis. In this work, the learning procedure consists of two steps: speech recognition followed by rule inference. For speech recognition, we use a two coupled language models one for words and one for scores. The first makes use of the internal structure of a score, the second, the dependency of a score on the previous score. For rule inference, we utilize a multigram model to segment the recognised score streams into variable-length score sequences, each of them corresponding to a game in a tennis match. The approach is applied to four complete tennis matches, and shows both enhanced recognition performance, and a promising approach to inferring the scoring rules of the game.

#15Semi-automated classifier adaptation for natural language call routing

Silke M. Witt (West)

Commercial spoken dialogue systems traditionally are static in the sense that once deployed, these applications only get updated as periodically. Also, the creation of classifiers in call routing applications requires expensive manual annotation of caller intents. This work introduces a process to semi-automatically annotate new data and to use the new annotations to update the training corpus to iteratively improve classification performance. The new method combine a multiple classifier voting schema and an iterative boosting mechanism to continually update the classifier with the new automatically annotated data. This method was tested with 6 weeks’ worth of data from a live system. It is shown that with this approach about 93% of all new utterances can be automatically annotated. Using the iterative boosting approach increased the size of the training corpus by about 6% per iteration while at the same time slightly increasing the classification accuracy.

#16Interactional Style Detection for Versatile Dialogue Response Using Prosodic and Semantic Features

Wei-Bin Liang (Dept. of CSIE, NCKU, Tainan, Taiwan)
Chung-Hsien Wu (Dept. of CSIE, NCKU, Tainan, Taiwan)
Chih-Hung Wang (Dept. of CSIE, NCKU, Tainan, Taiwan)
Jhing-Fa Wang (Dept. of Electrical Engineering, NCKU, Tainan, Taiwan)

This work presents an approach to interactional style (IS) detection for versatile responses in spoken dialogue systems (SDSs). Since speakers generally express their intents in different styles, the responses of an SDS should be versatile instead of invariable responses. Moreover, the IS of dialogue turns can be affected by dialogue topics and speakers’ emotional states. In this work, three base-level classifiers are employed for preliminary detection, latent Dirichlet allocation for dialogue topic categorization, support vector machine for prosody-based emotional state identification and maximum entropy for semantic label-based emotional state identification. Finally, an artificial neural network is adopted for IS detection considering the scores estimated from the aforementioned classifiers. To evaluate the proposed approach, an SDS in a chatting domain was constructed for evaluation. The evaluation results revealed that the performance of IS detection can achieve 82.67% accuracy.

#17Quality aspects of multimodal dialog systems: identity, stimulation and success

Christine Kuehnel (Quality & Usability Lab, Deutsche Telekom Laboratories, TU Berlin, Germany)
Benjamin Weiss (Quality & Usability Lab, Deutsche Telekom Laboratories, TU Berlin, Germany)
Matthias Schulz (Quality & Usability Lab, Deutsche Telekom Laboratories, TU Berlin, Germany)
Sebastian Moeller (Quality & Usability Lab, Deutsche Telekom Laboratories, TU Berlin, Germany)

So far, not much is known on the relationship of quality aspects of multimodal dialog systems. This paper aims at closing this gap by analyzing the influence of input and output modalities on the systems' usability. The underlying study has been carried out with a smart-home system offering speech, gesture and touch as well as the combination of these three for input and a speech-to-text system, a TV screen and a smartphone screen for output. The results indicate that the usability of a multimodal system is composed of hedonic and pragmatic aspects. The hedonic aspects are influenced by the identity transported by the output channels and the stimulation of the input modalities. A measure for task success was sufficient to describe the pragmatic aspect.

Sun-Ses2-S1-P:
Speech and Language Processing-Based Assistive Technologies and Health Applications

Time:Sunday 14:30 Place:Caravaggio (Adua 1) - Pala Affari - 1st Floor Type:Poster
Chairs:Shri Narayanan, Elmar Noeth

#1Incorporating Speech Recognition Engine Into an Intelligent Assistive Reading System for Dyslexic Students

Theologos Athanaselis (ILSP)
Stelios Bakamidis (ILSP)
Ioannis Dologlou (ILSP)
Evmorfia N. Argyriou (Department of Mathematics, School of Applied Mathematical and Physical Sciences National Technical University of Athens)
Antonis Symvonis (Department of Mathematics, School of Applied Mathematical and Physical Sciences National Technical University of Athens)

In this paper we present an approach for incorporating a state of the art speech recognition engine into a novel assistive reading system for Greek dyslexic students. This system is being developed in the framework of the AGENT-DYSL IST project, and facilitates dyslexic children in learning to read fluently. Unlike previously presented approaches, the aim of this system is to monitor the progress and perspectives of a dyslexic user and supply personalised help. The goal of this help is to gradually increase the reading capabilities of the user, gradually diminish the assistance provided, till he is able to read as a non-dyslexic reader.

#2An Investigation of Depressed Speech Detection: Features and Normalization

Nicholas Cummins (The University of New South Wales)
Julien Epps (The University of New South Wales and National ICT Australia)
Michael Breakspear (Black Dog Institute and School of Psychiatry, The University of New South Wales)
Roland Goecke (Faculty of Information Sciences and Engineering, University of Canberra, and RSCS, Australian National University)

In recent years, the problem of automatic detection of mental illness from the speech signal has gained some initial interest, however questions remaining include how speech segments should be selected, what features provide good discrimination, and what benefits feature normalization might bring given the speaker-specific nature of mental disorders. In this paper, these questions are addressed empirically using classifier configurations employed in emotion recognition from speech, evaluated on a 47-speaker depressed/neutral read sentences speech database. Results demonstrate that (1) Detailed spectral features are well suited to the task, (2) Speaker normalization provides benefits mainly for less detailed features, and (3) Dynamic information appears to provide little benefit. Classification accuracy using a combination of MFCC and formant based features approached 80% for this database.

#3Using Prosodic and Spectral Features in Detecting Depression in Elderly Males

Michelle Hewlett Sanchez (Speech Technology and Research Laboratory, SRI International and Stanford University)
Dimitra Vergyri (Speech Technology and Research Laboratory, SRI International)
Luciana Ferrer (Speech Technology and Research Laboratory, SRI International)
Colleen Richey (Speech Technology and Research Laboratory, SRI International)
Pablo Garcia (Robotics and Medical Systems Laboratory, SRI International)
Bruce Knoth (Robotics and Medical Systems Laboratory, SRI International)
William Jarrold (Center for Mind and Brain, University of California Davis)

As research in speech processing has matured, there has been much interest in paralinguistic speech processing problems including the speaker's mental and psychological health. In this study, we focus on speech features that can identify the speaker's emotional health, i.e., whether the speaker is depressed or not. We use prosodic speech measurements, such as pitch and energy, in addition to spectral features, such as formants and spectral tilt, and compute statistics of these features over different regions of the speech signal. These statistics are used as input features to a discriminative classifier that predicts the speaker's depression state. We find that with an N-fold leave-one-out cross-validation setup, we can achieve a prediction accuracy of 81.3%, where random guess is 50%.

#4Combining phonological and acoustic ASR-free features for pathological speech intelligibility assessment

Catherine Middag (Department of Electronics and Information Systems, Ghent University, Belgium)
Tobias Bocklet (Chair of Pattern Recognition, University of Erlangen-Nuremberg, Germany)
Jean-Pierre Martens (Department of Electronics and Information Systems, Ghent University, Belgium)
Elmar Nöth (Chair of Pattern Recognition, University of Erlangen-Nuremberg, Germany)

Intelligibility is widely used to measure the severity of articulatory problems in pathological speech. Recently, a number of automatic intelligibility assessment tools have been developed. Most of them use automatic speech recognizers (ASR) to compare the patient’s utterance with the target text. These methods are bound to one language and tend to be less accurate when speakers hesitate or make reading errors. To circumvent these problems, two different ASR-free methods were developed over the last few years, only making use of the acoustic or phonological properties of the utterance. In this paper, we demonstrate that these ASR-free techniques are also able to predict intelligibility in other languages. Moreover, they show to be complementary, resulting in even better intelligibility predictions when both methods are combined.

#5Speech Synthesis Parameter Generation for the Assistive Silent Speech Interface MVOCA

Robin Hofe (University of Sheffield, Uk)
Stephen R. Ell (University of Hull, UK)
Michael J. Fagan (University of Hull, UK)
James M. Gilbert (University of Hull, UK)
Phil D. Green (University of Sheffield, Uk)
Roger K. Moore (University of Sheffield, Uk)
Sergey I. Rybchenko (University of Hull, UK)

In previous publications, a silent speech interface based on permanent-magnetic articulography (PMA) has been introduced and evaluated using standard automatic speech recognition techniques. However, word recognition is a task that is computationally expensive and introduces a significant time delay between speech articulation and generation of the acoustic signal. This paper investigates a direct synthesis approach where control parameters for parametric speech synthesis are generated directly from the sensor data of the silent speech interface, without an intermediate lexical representation. Users of such a device would not be tied to the limited vocabulary of a word-based recogniser and could therefore express themselves more freely. This paper presents a feasibility study that investigates whether it is possible to infer speech synthesis parameters from PMA sensor data.

#6Computer-Assisted Disfluency Counts for Stuttered Speech

Peter A. Heeman (Oregon Health & Science University)
Andy McMillin (Artz Center)
J. Scott Yaruss (University of Pittsburgh)

In this paper, we present computer tools to help speech-language pathologists in counting disfluencies, for both real-time counts and transcript-based counts. The latter tend to be more precise and show which words are involved in each disfluency. Our approach allows real-time counts to be used as the basis for transcript-based counts. We employ automatic speech recognition to generate a word transcript (for read-speech samples), and then automatically merge the disfluency annotations with the word transcript, and have the clinician review parts of the audio file where a disfluency annotation was placed.

#7Spectral Features for Automatic Blind Intelligibility Estimation of Spastic Dysarthric Speech

Richard Hummel (Queen\'s University, Kingston, Ontario, Canada)
Wai-Yip Chan (Queen\'s University, Kingston, Ontario, Canada)
Tiago Falk (Institut National de la Recherche Scientifique, Energy, Materials, and Telecommunications, Montreal, Quebec, Canada)

In this paper, we explore the use of the standard ITU-T P.563 speech quality estimation algorithm for automatic assessment of dysarthric speech intelligibility. A linear mapping consisting of three salient P.563 internal features is proposed and shown to accurately estimate spastic dysarthric speech intelligibility. Delta-energy features are further proposed in order to characterize the atypical spectral dynamics and limited vowel space observed with spastic dysarthria. Experiments using the publicly-available Universal Access database (10 speaker patients) show that when salient delta-energy and internal P.563 features are used, correlations with subjective intelligibility ratings as high as 0.98 can be attained.

#8Extraction of narrative recall patterns for neuropsychological assessment

Emily Prud\'hommeaux (Center for Spoken Language Understanding, Oregon Health and Science University)
Brian Roark (Center for Spoken Language Understanding, Oregon Health and Science University)

Poor narrative memory is associated with a variety of neurodegenerative and developmental disorders, such as autism and Alzheimer's related dementia. Hence, narrative recall tasks are included in most standard neurological examinations. In this paper, we explore methods for automatically assessing the quality of retellings via alignment to the original narrative. Word alignments serve both to automate manual scoring and to derive other features related to narrative coherence that can be used for diagnostic classification of neurological disorders. Despite relatively high word alignment error rates, the automatic alignments provide sufficient information to achieve nearly as accurate diagnostic classification as manual scores. Furthermore, additional features that become available with alignment provide utility in classifying subject groups. While the additional features we explore here did not provide additive gains in accuracy, they point the way to the development of many potentially useful features in this domain.

#9Gesture Design of Hand-to-Speech Converter derived from Speech-to-Hand Converter based on Probabilistic Integration Model

Aki Kunikoshi (The University of Tokyo)
Yu Qiao (Shenzhen Institute of Advanced Technology)
Daisuke Saito (The University of Tokyo)
Nobuaki Minematsu (The University of Tokyo)
Keikichi Hirose (The University of Tokyo)

When dysarthrics try to communicate using speech, they often have to use speech synthesizers which require them to type word symbols or sound symbols. Input by this method often makes real-time communication troublesome. In this study, we are developing a novel speech synthesizer where speech is generated through hand motions rather than symbol input. By applying statistical voice conversion techniques, a hand space was mapped to a vowel space and a converter from hand motions to vowel transitions was developed. In this paper, we discuss the expansion of this system to consonant generation. In order to create the gestures for consonants, a Speech-to-Hand conversion system is developed using parallel data for vowels. Thus, we are able to automatically search for candidates for consonant gestures for a Hand-to-Speech system.

#10Powered Wheelchair Control Using Acoustic-Based Recognition of Head Gesture Accompanying Speech

Akira Sasou (National Institute of Advanced Industrial Science and Technology, AIST)

In this paper, we propose the novel interface for powered wheelchair control using the acoustic-based recognition of head gesture accompanying speech. A microphone array mounted on a wheelchair localizes the position of the user’s voice. Because the localized position of the user’s voice almost corresponds with that of the mouth, the tracking of the head movements accompanying speech can be achieved by means of the microphone array. The proposed interface does not require disabled people to wear any microphones or utter recognizable voice commands, but requires only two capabilities: the ability to move the head and the ability to utter an arbitrary sound. In our preliminary experiments, five subjects performed six kinds of head gestures accompanying speech. The head gestures of each subject were recognized using the models trained from the other subjects' data. The average recognition accuracy was 99.7 %.

#11 Analyzing training dependencies and posterior fusion in discriminant classification of apnea patients based on sustained and connected speech

Jose Luis Blanco (Universidad Politecnica de Madrid)
Ruben Fernandez (Universidad Politecnica de Madrid)
Doroteo Torre (Universidad Autonoma de Madrid)
Francisco Javier Caminero (Telefonica R&D)
Eduardo Lopez (Universidad Politecnica de Madrid)

We present a novel approach using both sustained vowels and connected speech, to detect obstructive sleep apnoea (OSA) cases within a homogeneous group of speakers. The proposed scheme is based on state-of-the-art GMM-based classifiers, and acknowledges specifically the way in which acoustic models are trained on standard databases, as well as the complexity of the resulting models and their adaptation to specific data. Our experimental database contains a suitable number of utterances and sustained speech from healthy (i.e control) and OSA Spanish speakers. Finally, a 25.1% relative reduction in classification error is achieved when fusing continuous and sustained speech classifiers.

Sun-Ses3-O1:
Speaker Recognition - Modeling, Automatic Procedures, Analysis I

Time:Sunday 16:00 Place:Auditorium - Pala Congressi Type:Oral
Chair:Luciano Romito

16:00Restoring the Residual Speaker Information in Total Variability Modeling for Speaker Verification

Ce Zhang (Digital Content Technology Research Center, Institute of Automation, Chinese Academy of Sciences)
Rong Zheng (Digital Content Technology Research Center, Institute of Automation, Chinese Academy of Sciences)
Bo Xu (Digital Content Technology Research Center, Institute of Automation, Chinese Academy of Sciences)

In this paper, we introduce the residual space into the Total Variability Modeling by assuming that the speaker super-vectors are not totally contained in a linear subspace of low dimension. Thus the feature reduction carried out by Probabilistic Principal Component Analysis(PPCA) leads to information loss including information of speaker as well as channel. We add the residual factor to restore the missing speaker information which is lost during the PPCA process. To utilize the recovered information effectively, we propose two fusion methods that combine the principal components with the residual factor. We compare the fusion results that are obtained with direct scoring and Support Vector Machines for classification, respectively. The experiments on NIST SRE 2006 show that the performance can be improved consistently by involving the residual factor, e.g. the best result achieves 6% relative improvement on Equal Error Rate(EER) compared to the baseline system.

16:20New Developments in Joint Factor Analysis for Speaker Verification

Hagai Aronowitz (IBM Research - Haifa)
Oren Barkan (IBM Research - Haifa)

Joint factor analysis (JFA) is widely used by state-of-the-art speech processing systems for tasks such as speaker verification, language identification and emotion detection. In this paper we introduce new developments for the JFA framework which we validate empirically for the speaker verification task but in principle may be beneficial for other tasks too. We first propose a method for obtaining improved recognition accuracy by better modeling supervector estimation uncertainty. We then propose a novel approach we name JFAlight for extremely efficient approximated estimation of speaker, common and channel factors. Using JFAlight we are able to efficiently score a given test session with a very small degradation in accuracy.

16:40Speaker recognition using temporal contours in linguistic units: the case of formant and formant-bandwidth trajectories

Joaquin Gonzalez-Rodriguez (ICSI/ATVS-UAM)

We describe a new approach to automatic speaker recognition based in explicit modeling of temporal contours in linguistic units (TCLU). Inspired in successful work in forensic speaker identification, we extend the approach to design a fully automatic system, with a high potential for combination with spectral systems. Using SRI’s Decipher phone, word and syllabic labels, we have tested up to 468 unit-based subsystems from 6 groups of lexically-determined units, namely phones, diphones, triphones, center phone in triphones, syllables and words, subsystems being combined at the score level. Evaluating with NIST SRE04 English-only 1s1s, their hierarchical fusion gives an EER of 4.20% (minDCF=0.018) from automatic formant tracking of conversational telephone speech. Combining extremely well with a Joint Factor Analysis system (from JFA EER of 4.25% to 2.47%, minDCF from 0.020 to 0.012), extensions as more robust prosodic or spectral features are likely to further improve this approach.

17:00Discriminatively Trained i-vector Extractor for Speaker Verification

Ondrej Glembek (Brno University of Technology)
Lukas Burget (Brno University of Technology)
Niko Brummer (Agnitio, South Africa)
Oldrich Plchot (Brno University of Technology)
Pavel Matejka (Brno University of Technology)

We propose a strategy for discriminative training of the i-vector extractor in speaker recognition. The original i-vector extractor training was based on the maximum-likelihood generative modeling, where the EM algorithm was used. In our approach, the i-vector extractor parameters are numerically optimized to minimize the discriminative cross-entropy error function. Two versions of the i-vector extraction are studied---the original approach as defined for Joint Factor Analysis, and the simplified version, where orthogonalization of the i-vector extractor matrix is performed.

17:20Constrained Cepstral Speaker Recognition Using Matched UBM and JFA Training

Michelle Hewlett Sanchez (Speech Technology and Research Laboratory, SRI International and Stanford University)
Luciana Ferrer (Speech Technology and Research Laboratory, SRI International)
Elizabeth Shriberg (Speech Technology and Research Laboratory, SRI International)
Andreas Stolcke (Speech Technology and Research Laboratory, SRI International)

We study constrained speaker recognition systems, or systems that model standard cepstral features that fall within particular types of speech regions. A question in modeling such systems is whether to constrain universal background model (UBM) training, joint factor analysis (JFA), or both. We explore this question, as well as how to optimize UBM model size, using a corpus of Arabic male speakers. Over a large set of phonetic and prosodic constraints, we find that the performance of a system using constrained JFA and UBM is on average 5.24% better than when using constraint-independent (all frames) JFA and UBM. We find further improvement from optimizing UBM size based on the percentage of frames covered by the constraint.

17:40A New Perspective on GMM Subspace Compensation Based on PPCA and Wiener Filtering

Alan McCree (MIT Lincoln Laboratory)
Doug Sturim (MIT Lincoln Laboratory)
Doug Reynolds (MIT Lincoln Laboratory)

We present a new perspective on the subspace compensation techniques that currently dominate the field of speaker recognition using Gaussian Mixture Models (GMMs). Rather than the traditional factor analysis approach, we use Gaussian modeling in the sufficient statistic supervector space combined with Probabilistic Principal Component Analysis (PPCA) within-class and shared across class covariance matrices to derive a family of training and testing algorithms. Key to this analysis is the use of two noise terms for each speech cut: a random channel offset and a length dependent observation noise. Using the Wiener filtering perspective, formulas for optimal train and test algorithms for Joint Factor Analysis (JFA) are simple to derive. In addition, we can show that an alternative form of Wiener filtering results in the i-vector approach, thus tying together these two disparate techniques.

Sun-Ses3-O3:
Speech Analysis

Time:Sunday 16:00 Place:Brunelleschi (Green Room) - Pala Congressi - 2nd Floor Type:Oral
Chair:Thomas F. Quatieri

16:00Adaptive Estimation of Zeros of Time-Varying Z-Transforms

Christian Fischer Pedersen (Department of Electronic Systems, Aalborg University, Denmark)
Ove Andersen (Department of Electronic Systems, Aalborg University, Denmark)
Paul Dalsgaard (Department of Electronic Systems, Aalborg University, Denmark)

In the present paper, a method is proposed for adaptive estimation and tracking of roots of time-varying, complex, and univariate polynomials, e.g. z-transform polynomials that arise from finite signal sequences. The objective with the method is to alleviate the computational burden induced by factorization. The estimation is done by solving a set of linear equations; the number of equations equals the order of the polynomial. To avoid potential drifting of the estimations, it is proposed to verify with Aberth-Ehrlich’s factorization method at given intervals. A numerical experiment supplements theory by estimating roots of time-varying polynomials of different order. As a function of order, the proposed method has a lower run time than Lindsey-Fox and computing eigenvalues of companion matrices. The estimations are quite accurate, but tend to drift slightly in response to increasing coefficient pertubation lengths.

16:20Identifying regions of non-modal phonation using features of the wavelet transform

John Kane (Trinity College Dublin)
Christer Gobl (Trinity College Dublin)

The present study proposes a new parameter for identifying breathy to tense voice qualities in a given speech segment using measurements from the wavelet transform. Techniques that can deliver robust information on the voice quality of a speech segment are desirable as they can help tune analysis strategies as well as provide automatic voice quality annotation in large corpora. The method described here involves wavelet-based decomposition of the speech signal into octave bands and fitting a regression line to the maximum amplitudes at the different scales. The slope coefficient is then evaluated in terms of its ability to differentiate voice qualities compared to other parameters in the literature. The new parameter (named Peak Slope) was shown to have robustness to babble noise added with signal to noise ratios as low as 10 dB. Furthermore, the proposed parameter was shown to provide better differentiation of breathy to tense voice qualities in both vowels and running speech.

16:40Acoustic Analysis of Whispered Speech for Phoneme and Speaker Dependency

Xing Fan (University of Texas at Dallas)
Keith Godin (University of Texas at Dallas)
John Hansen (University of Texas at Dallas)

Whisper is used by talkers intentionally in certain circumstances to protect personal information. Due to the different production mechanisms in whispered speech, there are considerable differences existing between neutral and whispered speech in the spectral structure, such as formant shifting and spectral slope tilting. In this study, the dependency of those differences on speakers and phonemes are analyzed statistically by using a Vector Taylor Series (VTS) approximation with EM algorithm. The results from this study are useful for understanding the difference between whispered and neutral speech and shed light on model adaptation/compensation for whisper speech/speaker recognition and reconstruction of neutral speech from whispered speech.

17:00Multi-party Speech Recovery Exploiting Structured Sparsity Models

Afsaneh Asaei (Idiap Research Institute, Ecole Polytechnique Federale de Lausanne)
Mohammad Javad Taghizadeh (Idiap Research Institute, Ecole Polytechnique Federale de Lausanne)
Hervé Bourlard (Idiap Research Institute, Ecole Polytechnique Federale de Lausanne)
Volkan Cevher (Idiap Research Institute, Ecole Polytechnique Federale de Lausanne)

We study the sparsity of spectro-temporal representation of speech in reverberant acoustic conditions. This study motivates the use of structured sparsity models for efficient speech recovery. We formulate the underdetermined convolutive speech separation in spectro-temporal domain as the sparse signal recovery where we leverage model-based recovery algorithms. To tackle the ambiguity of the real acoustics, we exploit the Image Model of the enclosures to estimate the room impulse response function through a structured sparsity constraint optimization. The experiments conducted on real data recordings demonstrate the effectiveness of the proposed approach for multi-party speech applications.

17:20Modulation spectrum analysis for recognition of reverberant speech

Sri Harish Mallidi (Dept. of ECE, Johns Hopkins University)
Sriram Ganapathy (Dept. of ECE, Johns Hopkins University)
Hynek Hermansky (Dept. of ECE, Johns Hopkins University)

Recognition of reverberant speech constitutes a challenging problem for typical speech recognition systems. This is mainly due to the conventional short-term analysis/compensation techniques. In this paper, we present a feature extraction technique based on modeling long segments of temporal envelopes of the speech signal in narrow sub-bands using frequency domain linear prediction (FDLP). FDLP provides an all-pole approximation of the Hilbert envelope of the signal by linear prediction on cosine transform of the signal. We show that the FDLP modulation spectrum plays an important role in the robustness of the proposed feature extraction. Automatic speech recognition (ASR) experiments on speech data degraded with a number of room impulse responses (with varying degrees of distortion) show significant performance improvements for the proposed FDLP features when compared to other robust feature extraction techniques (average relative reduction of 40 % in word error rate). Similar improvements are also obtained for far-field data which contain natural reverberation in background noise.

17:40Discrete Choice Models for Non-Intrusive Quality Assessment

Petko N. Petkov (KTH-Royal Institute of Technology, Stockholm, Sweden)
W. Bastiaan Kleijn (Victoria University of Wellington, Wellington, New Zealand)
Bert de Vries (DSP Research, GN ReSound A/S, Eindhoven, Netherlands)

Non-intrusive signal quality assessment in general, and its application to speech signal processing, in particular, builds extensively upon statistical regression models. Commonly, the raw preference scores used for fitting these models belong to a categorical scale. Averaging the scores over a number of test subjects results in smooth, close-to-continuous ratings, thus justifying the use of regression as opposed to classification models. A form of marginalization, averaging subjective ratings takes away useful information about the reliability of individual test points. Using a model tailored to the raw data achieves highly competitive performance in terms of conventional performance measures while providing the additional advantage of identifying the usability of individual test points. In this paper, we consider the application of discrete choice models to non-intrusive quality assessment of speech.

Sun-Ses3-O2:
Speech Perception - Perceptual Learning and Cross-Language Perception

Time:Sunday 16:00 Place:Leonardo - Pala Affari - Ground Floor Type:Oral
Chair:Catia Cucchiarini

16:00Perceptual learning of liquids

Odette Scharenborg (Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands)
Holger Mitterer (Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands)
James M. McQueen (Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands, and Donders Institute for Brain, Cognition and Behaviour, Centre for Cognition, & Behavioural Science Institute, Radboud University Nijmegen, The Netherlands)

Previous research on lexically-guided perceptual learning has focussed on contrasts that differ primarily in local cues, such as plosive and fricative contrasts. The present research had two aims: to investigate whether perceptual learning occurs for a contrast with non-local cues, the /l/-/r/ contrast, and to establish whether STRAIGHT can be used to create ambiguous sounds on an /l/-/r/ continuum. Listening experiments showed lexically-guided learning about the /l/-/r/ contrast. Listeners can thus tune in to unusual speech sounds characterised by non-local cues. Moreover, STRAIGHT can be used to create stimuli for perceptual learning experiments, opening up new research possibilities. Index Terms: perceptual learning, morphing, liquids, human word recognition, STRAIGHT.

16:20The Efficiency of Cross-dialectal Word Recognition

Annelie Tuinman (Max Planck Institute for Psycholinguistics)
Holger Mitterer (Max Planck Institute for Psycholinguistics)
Anne Cutler (Max Planck Institute for Psycholinguistics)

Dialects of the same language can differ in the casual speech processes they allow; e.g., British English allows the insertion of [r] at word boundaries in sequences such as saw ice, while American English does not. In two speeded word recognition experiments, American listeners heard such British English sequences; in contrast to non-native listeners, they accurately perceived intended vowel-initial words even with intrusive [r]. Thus despite input mismatches, cross-dialectal word recognition benefits from the full power of native-language processing.

16:40Estimation of Perceptual Spaces for Speaker Identities Based on the Cross-Lingual Discrimination Task

Minoru Tsuzaki (Faculty of Music, Kyoto City University of Arts)
Keiichi Tokuda (Department of Computer Science, Nagoya Institute of Technology)
Hisashi Kawai (Knowledge Creating Communication Research Center, NICT)
Jinfu Ni (Knowledge Creating Communication Research Center, NICT)

This paper reconfirms that talker identity can be transmitted across languages. Talker discrimination was examined in the ABX paradigm, where the stimuli A and B were utterances by different talkers in the same language and the stimulus X was an utterance by either of A or B in the different language. The average hit rate of this discrimination task was as high as 0.89. The mutual distance matrices were generated using the discrimination index, d'. By applying the multidimensional scaling, three-dimensional perceptual spaces were estimated. The features related with loudness and spectral centroid had high contribution to the perceptual dimensions.

17:00The relation between perception and production in L2 phonological processing

Sharon Peperkamp (Laboratoire de Sciences Cognitives et Psycholinguistique, Paris)
Sharon Peperkamp (Laboratoire de Sciences Cognitives et Psycholinguistique, Paris)
Camillia Bouchon (Laboratoire Psychologie de la Perception, Paris)

Seventeen French-English bilinguals read aloud a set of English sentences and performed an ABX discrimination task that assessed their perception of the English /I/-/i/ contrast. Global nativelikeness in production correlated with pronunciation accuracy for the vowels /I/ and /i/, and both production measures correlated with self-estimated pronunciation skills. However, performance on the perception task did not correlate with either global nativelikeness or /I,i/ pronunciation accuracy. These results are discussed in light of theories about the relation between perception and production in L2 phonological processing.

17:20The Role of Word-Initial Glottal Stops in Recognizing English Words

Maria Paola Bissiri (Institute of Phonetics, Charles University in Prague, Czech Republic)
María Luisa Lecumberri (Department of English Philology, University of the Basque Country, Vitoria, Spain)
Martin Cooke (Ikerbasque (Basque Science Foundation), Spain)
Jan Volín (Institute of Phonetics, Charles University in Prague, Czech Republic)

English word-initial vowels in natural continuous speech are optionally preceded by glottal stops or functionally equivalent glottalizations. It may be claimed that these glottal elements disturb the smooth flow of speech. However, they clearly mark word boundaries, which may potentially facilitate speech processing in the brain of the listener. The present study utilizes the word-monitoring paradigm to determine whether listeners react faster to words with or without glottalizations. Three groups of subjects were compared: Czech and Spanish learners of English and native English speakers. The results indicate that perceptual use of glottalization for word segmentation is not entirely governed by universal rules and reflects the mother tongue of the listener as well as the status (L1/L2) of the target language.

17:40Effect of language experience on the categorical perception of Cantonese vowel duration

Caicai ZHANG (Language Engineering Laboratory, The Chinese University of Hong Kong)
Gang PENG (Language Engineering Laboratory, The Chinese University of Hong Kong)
William S-Y. WANG (Language Engineering Laboratory, The Chinese University of Hong Kong)

This study investigated the effect of language experience on the categorical perception of Cantonese vowel duration distinction. By comparing Cantonese and Mandarin listeners’ performances, we found that: (1) duration change elicited categorical perception in the performance of Cantonese listeners, but not in Mandarin listeners; (2) Cantonese listeners were affected by the vowel quality differences, whereas Mandarin subjects were generally unbiased towards the quality differences; (3) effect of duration was overridden by the vowel quality [a] condition in the performance of Cantonese listeners. Our findings suggested that vowel quality is incorporated as a phonological cue in Cantonese.

Sun-Ses3-O4:
Speech Enhancement and Dereverberation

Time:Sunday 16:00 Place:Michelangelo - Pala Affari - 2nd Floor Type:Oral
Chair:Peter Kabal

16:00Single channel dereverberation using example-based speech enhancement with uncertainty decoding technique

Keisuke Kinoshita (NTT Communication Science Labs.)
Mehrez Souden (NTT Communication Science Labs.)
Marc Delcroix (NTT Communication Science Labs.)
Tomohiro Nakatani (NTT Communication Science Labs.)

A speech signal captured by a distant microphone is generally contaminated by reverberation, which severely degrades the audible quality and intelligibility of the observed speech. In this paper, we investigate the single channel dereverberation which has been considered as one of the most challenging tasks. We propose an example-based speech enhancement approach used in combination with non-example-based (conventional) blind dereverberation algorithm, that would complement each other. The term, example-based, refers to the method which has exact (not brief and statistical) information about the clean speech as its model. It is important to note that the combination of two algorithms is formulated utilizing the uncertainty decoding technique, thereby achieving the smooth and theoretical interconnection. Experimental results show that the proposed method achieves better dereverberation in severe reverberant environments than the conventional methods in terms of objective quality measures.

16:20A statistical room impulse response model with frequency dependent reverberation time for single-microphone late reverberation suppression

Jan Erkelens (Delft University of Technology)
Richard Heusdens (Delft University of Technology)

Single-channel late reverberation suppression algorithms use statistical room impulse response (RIR) models to derive late reverberance spectral variance (LRSV) estimators. Current RIR models are based on white Gaussian noise sequences with exponentially decaying variance. The whiteness assumption means that the same decay constant is assumed for all frequencies. Since decay constants are generally frequency dependent, there is a need for RIR models that take this into account. We propose a new time-varying RIR model that consists of a sum of decaying cosine functions with random phases, with frequency dependent decay constants. We show that the resulting LRSV estimators have the same form as existing ones, but with an inherent frequency dependency of the decay constant. Experiments with real measured RIRs, however, indicate that, for the purpose of reverberation suppression, using a frequency independent decay constant is often sufficiently good. A common assumption in the derivation of LRSV estimators is that the sum of direct signal and early reflections is uncorrelated with the late reverberation. We verify this assumption experimentally on measured RIRs and conclude that it is accurate.

16:40An Assessment of the Improvement Potential of Time-Frequency Masking for Speech Dereverberation

Chenxi Zheng (Department of Electrical Engineering, Queen\'s University, Kingston, Canada)
Tiago Falk (Institut National de la Recherche Scientifique (EMT), Montreal, Canada)
Wai-Yip Chan (Department of Electrical Engineering, Queen\'s University, Kingston, Canada)

The effect of ideal time-frequency masking (ITFM) on the intelligibility of reverberated speech is tested using objective measurement, namely STI and PESQ scores. The best choice of ITFM threshold is determined for a range of reverberation times (RTs). Four existing dereverberation algorithms are also assessed. Objective test results and informal subjective listening show that IFTM provides great intelligibility improvement for all RTs and outperforms the existing dereverberation algorithms, one of which assumes perfect knowledge of the room impulse response. While ITFM provides only a best possible performance bound, our results demonstrate the potential improvement that could be obtained using time-frequency masking for speech dereverberation.

17:00Perceptual Improvement of a Two-Stage Algorithm for Speech Dereverberation

Thiago Prego (Program of Electrical Engineering, COPPE, Federal University of Rio de Janeiro, Brazil.)
Amaro de Lima (Program of Electrical Engineering, COPPE, Federal University of Rio de Janeiro and Federal Center for Technological Education Celso Suckow da Fonseca (CEFET-RJ), Brazil.)
Sergio Netto (Program of Electrical Engineering, COPPE, Federal University of Rio de Janeiro, Brazil.)

This paper presents three effective proposals for a two-stage algorithm for one-microphone reverberant speech enhancement. The original algorithm is divided into two blocks: one that deals with the coloration effect, due to the early reflections, and the other for reducing the long-term reverberation. The proposed modifications consider changing the linear-prediction model order, the adaptation stepsize and stop criterion for the first algorithm stage. All the modifications are evaluated by a perceptual-quality measure specific for the speech-reverberation context. Experimental results for a 200-signal database show that the proposed improvements yielded an increase of 12% in perceptual measure and a reduction of about 96% in the computation cost when compared to the original framework.

17:20A Model-Based Spectral Envelope Wiener Filter for Perceptually Motivated Speech Enhancement

Najib Hadir (Spoken Language Systems)
Friedrich Faubel (Spoken Language Systems)
Dietrich Klakow (Spoken Language Systems)

In this work, we present a model-based Wiener filter whose frequency response is optimized in the dimensionally reduced log-Mel domain. That is achieved by making use of a reasonably novel speech feature enhancement approach that has originally been developed in the area of speech recognition. Its combination with Wiener filtering is motivated by the fact that signal reconstruction from log-Mel features sounds very unnatural. Hence, we correct only the spectral envelope and preserve the fine spectral structure of the noisy signal. Experiments on a Wall Street Journal corpus showed a relative improvement of up to 24% relative in PESQ and 45% relative in log spectral distance (LSD), compared to Ephraim and Mallah's log spectral amplitude estimator.

17:40Binaural Noise-Reduction Method based on Blind Source Separation and Perceptual post processing

Jorge Marin-Hurtado (Georgia Institute of Technology)
Devangi Parikh (Georgia Institute of Technology)
David Anderson (Georgia Institute of Technology)

Binaural hearing aids include a wireless link to exchange the signals received at each side, allowing the implementation of more efficient noise-reduction algorithms for hostile environments such as babble noise. Although several binaural noise-reduction techniques have been proposed in the literature, only a few of them preserve localization cues of the target and interfering signals simultaneously without degrading the SNR improvement. This paper proposes a novel binaural noise-reduction method based on blind source separation (BSS) and a perceptual post-processing technique. Objective and subjective tests under four different scenarios were performed. The method showed good output sound quality, high SNR improvement at very low input SNR conditions, and preservation of localization cues for the signal and noise---outperforming both an existing BSS-based method and a multichannel Wiener filter (MWF).

Sun-Ses3-O5:
ASR - Feature Extraction II

Time:Sunday 16:00 Place:Raffaello - Pala Affari - 3rd Floor Type:Oral
Chair:Dong Yu

16:00Region Dependent Transform on MLP Features for Speech Recognition

Tim Ng (Raytheon BBN Technologies)
Bing Zhang (Raytheon BBN Technologies)
Spyros Matsoukas (Raytheon BBN Technologies)
Long Nguyen (Raytheon BBN Technologies)

In this work, Region Dependent Transform (RDT) is used as a feature extraction process to combine the traditional short-term acoustic features with the features derived from Multi-Layer Perceptrons (MLP) which is trained from the long-term features. When compared to the conventional feature augmentation approach, substantial improvement is obtained. Moreover, an improved RDT training procedure in which speaker dependent transforms are take into account is proposed for feature combinination in the Speaker Adaptive Training. By incorporating the higher dimensional features output from the layer prior to the bottleneck layer into our Speech-to-Text (STT) system using RDT, significant improvement is achieved as compared to using the conventional bottleneck features. In summary, by using the features derived from MLP with RDT, 8.2% to 11.4% relative reduction in Character Error Rate is achieved for our Mandarin STT systems.

16:20Discriminant Sub-Space Projection of Spectro-Temporal Speech Features based on Maximizing Mutual Information

Martin Heckmann (Honda Research Institute Europe GmbH)
Claudius Gläser (Honda Research Institute Europe GmbH)

We previously developed noise robust Hierarchical Spectro-Temporal (HIST) speech features. The learning of the features was performed in an unsupervised way with unlabeled speech data. In a final stage we deployed Principal Component Analysis (PCA) to reduce the feature dimensions and to diagonalize them. In this paper we investigate if a discriminant projection can further increase the performance. We maximize the mutual information between the features and the phoneme categories using a procedure known as Maximizing Renyi's Mutual Information (MRMI) and also compare it to Linear Discriminant Analysis (LDA). Based on recognition tests in clean and in noise, i.e. in matching and mismatching conditions, we show that the discriminant projections increases recognition scores compared to PCA in matching conditions. However, this improvement does not transfer to the mismatching, i.e. noisy, conditions. We discuss measures to alleviate this problem. Overall MRMI performs better than LDA.

16:40Combining feature space discriminative training with long-term spectro-temporal features for noise-robust speech recognition

Takashi Fukuda (IBM Research - Tokyo, IBM Japan Ltd.)
Osamu Ichikawa (IBM Research - Tokyo, IBM Japan Ltd.)
Masafumi Nishimura (IBM Research - Tokyo, IBM Japan Ltd.)

Discriminative training of feature space using maximum mutual information (fMMI) objective function has been shown to yield remarkable accuracy improvements. For noisy environments, fMMI can be regarded as an effective noise compensation algorithm and can play a significant role for noise robustness. Feature space speaker adaptation techniques such as (fMLLR) are also well known, suitable for mismatched test data. These feature space transform algorithms are essential for modern speech recognition but still need further improvement against low SNR conditions. In contrast, long-term spectro-temporal information has also received attention to support traditional short-term features. We previously proposed long-term temporal features to improve ASR accuracy for low SNR speech. In this paper, we show that long-term temporal features can be combined with fMMI to build more discriminative models of noisy speech and the proposed method performed favorably at low SNR conditions.

17:00Combining Frame and Segment Level Processing via Temporal Pooling for Phonetic Classification

Sumit Chopra (AT&T Labs Research)
Patrick Haffner (AT&T Labs Research)
Dimitrios Dimitriadis (AT&T Labs Research)

We propose a simple, yet novel, multi-layer model for the problem of phonetic classification. Our model combines a frame level transformation of the acoustic signal with a segment level phone classification. Our key contribution is the study of new temporal pooling strategies that interface these two levels, determining how frame scores are converted into segment scores. On the TIMIT benchmark, we match the best performance obtained using a single classifier. Diversity in pooling strategies is further used to generate candidate classifiers with complementary performance characteristics, which perform even better as an ensemble. Without the use of any phonetic knowledge, our ensemble model achieves a 16.96% phone classification error. While our data-driven approach is exhaustive, the combinatorial inflation is limited to the smaller segmental half of the system.

17:20Improved Bottleneck Features Using Pretrained Deep Neural Networks

Dong Yu (Microsoft Research)
Michael L. Seltzer (Microsoft Research)

Bottleneck features (BNs) have been shown to be effective in improving the accuracy of ASR systems. Conventionally, BNs are extracted from a multi-layer perceptron (MLP) trained to predict context-independent monophone states. The MLP typically has three hidden layers and is trained using the backpropagation algorithm. In this paper, we propose two improvements to the training of bottleneck features motivated by recent advances in deep neural networks (DNNs). First, we show how the use of unsupervised pretraining of a DNN enhances the network’s discriminative power and improves the bottleneck features it generates. Second, we show that a neural network trained to predict context-dependent senone targets produces better bottleneck features than one trained to predict monophone states. Bottleneck features trained using the proposed methods produced a 16% relative reduction in sentence error rate over conventional bottleneck features on a large vocabulary business search task.

17:40MINIMUM CLASSIFICATION ERROR BASED SPECTRO-TEMPORAL FEATURE EXTRACTION FOR ROBUST AUDIO EVENT CLASSIFICATION

Yuan-Fu Liao (Department of Electronic Engineering, National Taipei University of Technology, Taipei, Taiwan)

Mel-frequency cepstral coefficients (MFCCs) are the most popular features for automatic audio classification (AAC). However, MFCCs are often not robust in adverse environment. In this paper, a minimum classification error (MCE)-based method is proposed to extract new and robust spectro-temporal features as alternatives to MFCCs. The robustness of the proposed new features is evaluated on noisy non-speech sound of RWCP Sound Scene Database in Real Acoustic Environment database with Aurora 2 multi-condition training task-like settings. Experimental results show the proposed new features achieved the lowest average recognition error rate of 3.17% which is much better than state-of-the-art MFCCs plus mean subtraction, variance normalization and ARMA filtering (MFCC+MVA, 4.31%), Gabor filters with principle component analysis (Gabor+PCA, 4.43%) and linear discriminant analysis (LDA, 4.20%) features. We thus confirm the robustness of the proposed spectro-temporal feature extraction approach.

Sun-Ses3-S1-O:
Crowdsourcing for Speech Processing I

Time:Sunday 16:00 Place:Caravaggio (Adua 1) - Pala Affari - 1st Floor Type:Oral
Chairs:Maxine Eskenazi, David Suendermann, Gina-Anne Levow

16:00Speaking to the Crowd: looking at past achievements in using crowdsourcing for speech and predicting future challenges

Gabriel Parent (Language Technologies Institute, Carnegie Mellon University)
Maxine Eskenazi (Language Technologies Institute, Carnegie Mellon University)

This paper examines the literature on the use of crowdsourcing for speech-related tasks: speech acquisition, transcription and annotation as well as the assessment of speech technology. 29 papers were found, representing, 37 different experiments, which were annotated and analyzed to find trends in the field. The paper focuses on the different techniques used for quality control and the variety of sources of “crowds”. Finally, we propose several challenges for the future of crowdsourcing for speech processing.

Sun-Ses3-P1:
Prosodic Structure

Time:Sunday 16:00 Place:Valfonda 1 - Pala Congressi (Passi Perduti-Gallery) Type:Poster
Chair:Elizabeth Shriberg

#1Where should pitch accents and phrase breaks go? A syntax tree transducer solution

Joseph Tepperman (Rosetta Stone Labs)
Emily Nava (Rosetta Stone Labs)

Motivated by a desire to assess the prosody of foreign language learners, this study demonstrates the benefit of high-level syntactic information in automatically deciding where phrase breaks and pitch accents should go in text. The connection between syntax and prosody is well-established, and naturally lends itself to tree-based probabilistic models. With automatically-derived parse trees paired to tree transducer models, we found that categorical prosody tags for unseen text can be determined with significantly higher accuracy than they can with a baseline method that uses sequential models of part-of-speech tags. On the Boston University Radio News Corpus, the tree transducer outperformed the baseline by 14% overall for accents, and by 3% overall for breaks. These automatic results fell within this corpus's range of inter-speaker agreement in assigning accents and breaks to text.

#2Phrasal prominences do not need pitch movements: postfocal phrasal heads in Italian

Giuliano Bocci (University of Bologna)
Cinzia Avesani (CNR- Institute of Cognitive Sciences and Technologies)

Informationally Given phrases following an instance of focus are generally realized in a compressed pitch range and are assumed to lack prosodic prominences above the word level. In this paper, we address the question of the metrical representation of postfocal constituents in Tuscan Italian. The results of a production experiment show that, despite their being realized with a low and flat F0 contour, postfocal constituents are not extrametrical, but are phrased and assigned metrical prominences of phrasal level. The impact of our results on the prosodic representation of Italian is discussed. Index Terms: Prosody, Focus, prosodic hierarchy, Givenness.

#3Intonation of left dislcated topics in Modern Greek

David Le Gac (University of Rouen, EA4305 lidifra)
Hiyon Yoo (University Paris Diderot, UMR 7110 LLF CNRS)

We present in this paper the results of a production experiment testing the effects of the discourse activation state on the intonation of left-dislocated topics in Modern Greek. The activation states of the topics (active, inactive and semi-active) were examined in three different sentence types, namely declaratives, WH-questions and yes-no questions. Results show that the tunes are not affected by activation state but by sentence type. This supports the idea that the intonation of these topics is rather governed by phonological process, probably grounded on perceptually oriented principles

#4Phrases, pitch and perceived prominence in Māori

Laura Thompson (University of Auckland, New Zealand)
Catherine I. Watson (University of Auckland, New Zealand)
Ray Harlow (University of Waikato, New Zealand)
Jeanette King (University of Canterbury, New Zealand)
Margaret Maclagan (University of Canterbury, New Zealand)
Helen Charters (University of Auckland, New Zealand)
Peter Keegan (University of Auckland, New Zealand)

This study explores phrase-level prosody and prominence in the Maori language. Limited existing prosodic analysis and anecdotal evidence of diachronic change have motivated the present investigation into alignment of descriptions of intonation and stress with prominence perception test results and pitch analysis of speech data. In general, we find the expected case does occur most often, but examining results across speakers with birthdates spanning a century shows conservatism in modern elders and contradictory results in younger speakers: while making 'errors' in prominence placement, they are often as faithful to the overall expected contour as their elders.

#5Perceptual sensitivity to prenuclear and nuclear intonational patterns

Tomáš Duběda (Charles University in Prague, Institute of Translation Studies)

We describe a perceptual experiment whose goal is to compare perceptual sensitivity to pitch accent contrasts in nuclear and prenuclear positions. The material consists of Czech sentences which have been resynthesized with controlled intonation. The results show that changes in nuclear pitch accents are perceived more sharply than changes in prenuclear pitch accents, and that the H* accent is perceptually more salient than the other accent types (L*H, L* and S*). The effect of constituent edge on the perception of intonational contrasts has not been confirmed.

#6Tonal Alignment Defined: the case of Southern Irish English

Raya Kalaldeh (Trinity College Dublin)

This paper proposes to define tonal alignment features as either intrinsic; the default alignment, or extrinsic; the shifts away from the default alignment due to prosodic contextual factors. Intrinsic alignment is different for pre-nuclear (PN) and nuclear (N) accents. This distinction is illustrated for a variety of Irish English (IrE), Drogheda English (DroghE) where the PN and the N peaks of H* accents are intrinsically aligned at a time point 70% ~80% and 60% ~75% into the vowel of the accented syllable, respectively. Extrinsic alignment shifts of PN and N peaks are very small not exceeding the accented vowel boundaries.

#7Using Mutual Information to Identify Regions of Analysis for Prosodic Analysis

Andrew Rosenberg (Queens College / CUNY)

This paper presents a novel technique for empirically identifying regions of analysis for time/value information. The technique relies on analysis of mutual information between the contour, and some variable of interest. We present the use of this technique in the analysis of prosody in American English speech, where we identify valuable regions of analysis for the classification of phrase ending intonation. We also use the technique to investigate the most informative region of analysis for pitch accent detection.

#8Prosodic highlights in Mandarin continuous speech—Cross-genre attributes and implications

Chiu-yu Tseng (Phonetics Lab, Institute of Linguistics, Academia Sinica Taipei, Taiwan)
Zhao-yu Su (Taiwan International Graduate Program (TIGP), Academia Sinica Taipei, Taiwan)
Chi-Feng Huang (Phonetics Lab, Institute of Linguistics, Academia Sinica Taipei, Taiwan)

The present study examines perceived prosodic highlights in three genres of fluent continuous Mandarin to test (1) whether prosodic highlights are genre related, (2) how they interact with discourse structure, (3) how they signal information status, (4) whether systematic acoustic patterns could be obtained from speech data analysis, and (5) whether prosodic highlights is layered over to discourse structure. Results demonstrate that prosodic highlighting is genre related; distribution of key information can be attributed to linguistic content and communicative needs. Prosodic highlighting is an extra layer over discourse structure, the former signals key information while the latter underlying linguistic association.

#9When two newly-acquired words are one: New words differing in stress alone are not automatically represented differently

Simone Sulpizio (Department of Cognitive Science and Education, University of Trento, Italy)
James McQueen (Donders Institute for Brain, Cognition, and Behaviour, Centre for Cognition, Radboud University Nijmegen, Nijmegen, The Netherlands)

Do listeners use lexical stress at an early stage in word learning? Artificial-lexicon studies have shown that listeners can learn new spoken words easily. These studies used non-words differing in consonants and/or vowels, but not differing only in stress. If listeners use stress information in word learning, they should be able to learn new words that differ only in stress (e.g., BInulo-biNUlo). We investigated this issue here. When learning new words, Italian listeners relied on segmental information; they did not take stress information into account. Newly-acquired words differing in stress alone are not automatically represented as different words.

#10Automatic Determination of the Standard Chinese Prosodic Phrase Boundaries by $F_0$ Generation Model

Shehui Bu (School of Computer Scinece \\& Engineering, South China University of Technology, Guangzhou, China)
Zhenjie Zhuo (School of Computer Scinece \\& Engineering, South China University of Technology, Guangzhou, China)
Lingling Yang (School of Computer Scinece \\& Engineering, South China University of Technology, Guangzhou, China)
Shuichi Itahashi (National Institute of Informatics , Tokyo, Japan)

We proposed an automatic method for determining the boundaries of prosodic phrases in real speech waves. In this method, the dynamic programming ( DP ) and the least mean square error ( LMSE ) methods were implemented based on the F0 generation model. In order to evaluate the accuracy and validity of this proposed method, a set of 973 standard Chinese speech sentences was selected. The cumulative proportion of the estimated prosodic phrase boundaries approached 76% when ET(0i) was less than the average duration of the prosodic phrases. Thus, it can be concluded that this proposed method can be used in the practical application.

#11Measuring speakers’ similarity in speech by means of prosodic cues: methods and potential

Celine De Looze (Speech Communication Laboratory, Trinity College Dublin, Ireland)
Stephane Rauzy (Laboratoire Parole et Langage, Aix-Marseille Universite ́, UMR 6057, Aix-en-Provence, France)

This study presents a method for measuring speakers similarity (the tendency for speakers to exhibit similar speech patterns) by means of prosodic cues. It shows that similarity changes throughout social interaction and that its variations can inform about speakers’ attitudes, similarity being more important when speakers are more involved in the interaction. It supports the assumption that similarity is part of social interaction and may be implemented into spoken dialogue systems.

#12Tonal Variations in Mandarin: New Evidence from Spontaneous and Read Speech

Li-chiung Yang (Tunghai University)

It is well-known that tone variation occurs in read speech due to a number of different contextual effects, most prominently local tone sequencing and downstepping, whereas in spontaneous speech, a greater variability of tone shape has been observed as arising from global discourse context. In this paper we show comparative results on tonal realization in read and spontaneous speech. Specifically, our study confirms the effects of tonal sequencing, and shows that higher amplitude increases conformity to lexical shape in both speech modes, while spontaneous speech is both more varied and pitch reduced in the defined f0 direction.

Sun-Ses3-P2:
Language Processing

Time:Sunday 16:00 Place:Valfonda 2 - Pala Congressi (Passi Perduti-Gallery) Type:Poster
Chair:Frederic Bechet

#1Accounting for prosodic information to improve ASR-based topic tracking for TV Broadcast News

Camille Guinaudeau (INRIA Rennes)
Julia Hirschberg (Columbia University)

The increasing quantity of video material available on line requires improved methods to help users navigate such data, among which are topic tracking techniques. The goal of this paper is to show that prosodic information can improve an ASR-based topic tracking system for French TV Broadcast News. To this end, two kinds of prosodic information—extracted with and without a learning phase—are integrated in the system. This integration shows significant improvements in the F1-measure, by 13 and 8 points for the two techniques compared with the baseline system.

#2Morpheme Conversion for Connecting Speech Recognizer and Language Analyzers in Unsegmented Languages

Kenji Imamura (NTT Cyber Space Laboratories)
Tomoko Izumi (NTT Cyber Space Laboratories)
Kugatsu Sadamitsu (NTT Cyber Space Laboratories)
Kuniko Saito (NTT Cyber Space Laboratories)
Satoshi Kobashikawa (NTT Cyber Space Laboratories)
Hirokazu Masataki (NTT Cyber Space Laboratories)

Connecting automatic speech recognizers (ASRs) and language analyzers is difficult since they may be based on differences in part-of-speech (POS) systems; the latter cannot directly analyze the outputs of the former. In addition, in unsegmented languages such as Japanese, the ASR outputs are likely to have different word segmentation from that of the language analyzer inputs because they are individually developed. A conventional approach is to generate raw texts from the ASR outputs and re-analyze them using a morphological analyzer. However, if the ASR outputs contain recognition errors, the morphological analyzer incorrectly analyzes them even though they contain correctly recognized words. To avoid this problem, we propose a morpheme conversion method that directly converts ASR outputs into morpheme sequences suitable for the language analyzers. Our experiments show that morpheme conversion is more robust than the conventional approach against recognition errors.

#3Emotion Detection Based on Concept Inference and Spoken Sentence Analysis for Customer Service

Ren-Ying Fang (Department of Electrical Engineering, National Cheng Kung University)
Bo-Wei Chen (Department of Electrical Engineering, National Cheng Kung University)
Jhing-Fa Wang (Department of Electrical Engineering, National Cheng Kung University)
Chung-Hsien Wu (Department of Computer Science and Information Engineering, National Cheng Kung University)

In this study, we aim to construct an emotion detection system, which is capable of recognizing emotions of customers from product feedback forms for customer service. The proposed system integrated the ontology and the fuzzy inference algorithm for analyzing and recognizing various emotions through verbal/spoken sentences. Companies can evaluate users’ degree of satisfaction from emotions and automatically find out opinions of interest, saving unnecessary human resources.

#4Commas recovery with syntactic features in French and in Czech

Christophe Cerisara (LORIA UMR 7503, BP 239 - 54506 Vandoeuvre, France)
Pavel Král (Dept. of Computer Science & Engineering, University of West Bohemia, Plzeň, Czech Rep.)
Claire Gardent (LORIA UMR 7503, BP 239 - 54506 Vandoeuvre, France)

Automatic speech transcripts can be made more readable and useful for further processing by enriching them with punctuation marks and other meta-linguistic information. We study in this work how to improve automatic recovery of one of the most difficult punctuation marks, commas, in French and in Czech. We show that commas detection performances are largely improved in both languages by integrating into our baseline Conditional Random Field model syntactic features derived from dependency structures. We further study the relative impact of language-independent vs. specific features, and show that a combination of both of them gives the largest improvement. Robustness of these features to speech recognition errors is finally discussed.

#5Redundancy Reduction in ASR of Spontaneous Speech through Statistical Machine Translation

Daniele Falavigna (FBK-Fondazione Bruno Kessler)

This paper describes a system, based on statistical machine translation, that tries to remove from the output of an automatic audio transcription system non relevant words, such as: erroneously inserted functional words, filled pauses, interjections, word fragments, etc, as well as to repair, at a certain extent, ungrammatical pieces of sentences. For this work we decided to concentrate on a political speeches application domain, due to the immediate availability of a parallel corpus of automatic audio transcriptions and related proceedings, manually produced. The system can effectively detect and correct several errors (mainly insertions) included in the alignment between a given automatic audio transcription and a reference transcription derived from a corresponding proceeding. Preliminary results, expressed in terms of word error rate, show that the proposed approach allows to improve of a relative 5% with respect to the usage of the pure automatic transcription of the audio.

#6From Interview to News Text : A Study of Taiwan TV Political Interviews in Newspaper Reports

Chin-Chih Chiang (Department of Communication Managements, Shih Hsin University, Taipei, Taiwan)

Mode and genre transformations are main issues in the digital era, especially apparent when news interviews are adapted into written news reports. This paper investigates how interactions in a television news interview are excerpted and adapted into a written news text in newspapers. Through a text analysis of two political interviews in Taiwan on television, and five newspaper reports based on the two interviews, this study discovered that the reporter focuses on the answers of the interviewee, and tends to extract those that fit the political news narrative and are salient in the interactions. Furthermore, this paper demonstrates how the oral interview is adapted into a written news text via various discursive techniques, how interviewee answers are quoted, and how the local interview context is constructed in the news text.

Sun-Ses3-P3:
ASR - language models I

Time:Sunday 16:00 Place:Faenza 1 - Pala Congressi (Passi Perduti-Gallery) Type:Poster
Chair:Michael Riley

#1Unary Data Structures for Language Models

Jeffrey Sorensen (Google, Inc.)
Cyril Allauzen (Google, Inc.)

Language models are important components of speech recognition and machine translation systems. Trained on billions of words, and consisting of billions of parameters, language models often are the single largest components of these systems. There have been many proposed techniques to reduce the storage requirements for language models. A technique based upon pointer-free compact storage of ordinal trees shows compression competitive with the best proposed systems, while retaining the full finite state structure, and without using computationally expensive block compression schemes or lossy quantization techniques.

#2Bayesian Language Model Interpolation for Mobile Speech Input

Cyril Allauzen (Google Research, 76 Ninth AV, NY, NY, USA)
Michael Riley (Google Research, 76 Ninth AV, NY, NY, USA)

This paper explores various static interpolation methods for approximating a single dynamically-interpolated language model used for a variety of recognition tasks on the Google Android platform. The goal is to find the statically-interpolated first-pass LM that best reduces search errors in a two-pass system or that even allows eliminating the more complex dynamic second pass entirely. Static interpolation weights that are uniform, prior-weighted, and the maximum likelihood, maximum a posteriori, and Bayesian solutions are considered. Analysis argues and recognition experiments on Android test data show that a Bayesian interpolation approach performs best.

#3On the Estimation of Discount Parameters for Language Model Smoothing

Martin Sundermeyer (RWTH Aachen)
Ralf Schlüter (RWTH Aachen)
Hermann Ney (RWTH Aachen)

The goal of statistical language modeling is to find probability estimates for arbitrary word sequences. To obtain nonzero values, the probability distributions found in the training data need to be smoothed. In the widely-used Kneser-Ney family of smoothing algorithms, this is achieved by absolute discounting. The discount parameters can be computed directly using some approximation formulas minimizing the leavingone- out log-likelihood of the training data. In this work, we outline several shortcomings of the standard estimators for the discount parameters. We propose an efficient method for computing the discount values on heldout data and analyze the resulting parameter estimates. Experiments on large English and French corpora show consistent improvements in perplexity and word error rate over the baseline method. At the same time, this approach can be used for language model pruning, leading to slightly better results than standard pruning algorithms.

#4N-grams for Conditional Random Fields or a Failure-transition Posterior for Acyclic FSTs

Patrick Lehnen (RWTH Aachen University)
Stefan Hahn (RWTH Aachen University)
Hermann Ney (RWTH Aachen University)

Freely available software packages for the training of Conditional Random Fields, e.g. CRF++, do not support longer n-grams than bigram, which can be attributed to the fact that training CRFs in original notation has a polynomial computation time dependence on the target vocabulary size and an exponential dependence on the n-gram size. We transfer the backing-off idea from language models to CRFs. We realized the software with Finite State Transducers, where we modified the calculation of the posterior algorithm. To implement the backing-off scheme, we applied Failure-transitions known from OpenFST. Proof of concept is given on the semantic tagging task MEDIA and on the grapheme-to-phoneme (G2P) conversion tasks NetTalk and Celex, showing that computational time increases much below the size of the target vocabulary and showing error rate reduction on the G2P tasks.

#5Hybrid Language Models Using Mixed Types of Sub-lexical Units for Open Vocabulary German LVCSR

M. Ali Basha Shaik (Computer Science Department i6, RWTH Aachen University, Aachen, Germany)
Amr El-Desoky Mousa (Computer Science Department i6, RWTH Aachen University, Aachen, Germany)
Ralf Schlueter (Computer Science Department i6, RWTH Aachen University, Aachen, Germany)
Hermann Ney (Computer Science Department i6, RWTH Aachen University, Aachen, Germany)

German is a highly inflected language with a large number of words derived from the same root. It makes use of a high degree of word compounding leading to high Out-of-vocabulary (OOV) rates, and Language Model (LM) perplexities. For such languages the use of sub-lexical units for Large Vocabulary Continuous Speech Recognition (LVCSR) becomes a natural choice. In this paper, we investigate the use of mixed types of sub-lexical units in the same recognition lexicon. Namely, morphemic or syllabic units combined with pronunciations called graphones, normal graphemic morphemes or syllables along with full-words. This mixture of units is used for building hybrid LMs suitable for open vocabulary LVCSR where the system operates over an open, constantly changing vocabulary like in broadcast news, political debates, etc. A relative reduction of around 5.0% in Word Error Rate (WER) is obtained compared to a traditional full-words system. Moreover, around 40% of the OOVs are recognized.

#6Morpheme Based Factored Language Models for German LVCSR

Amr El-Desoky Mousa (Computer Science Department i6, RWTH Aachen University, Aachen, Germany)
M. Ali Basha Shaik (Computer Science Department i6, RWTH Aachen University, Aachen, Germany)
Ralf Schlueter (Computer Science Department i6, RWTH Aachen University, Aachen, Germany)
Hermann Ney (Computer Science Department i6, RWTH Aachen University, Aachen, Germany)

German is a highly inflectional language, where a large number of words can be generated from the same root. It makes a liberal use of compounding leading to high Out-of-vocabulary (OOV) rates, and poor Language Model (LM) probability estimates. Therefore, the use of morphemes for language modeling is considered a better choice for Large Vocabulary Continuous Speech Recognition (LVCSR) than the full-words. Thereby, better lexical coverage and less LM perplexities are achieved. On the other side, the use of Factored Language Models (FLMs) is considered a successful approach that allows the integration of many information sources to get better LM probability estimates. In this paper, we try a combined methodology for language modeling where both morphological decomposition and factored language modeling are used in one model called morpheme based FLM. Finally, we obtain around 2.5% relative reduction in Word Error Rate (WER) with respect to a traditional full-words system.

#7Compound Word Recombination for German LVCSR

Markus Nußbaum-Thom (RWTH Aachen University, Computer Science Department, Chair 6 for Human Language Technology and Pattern Recognition)
Amr El-Desoky Mousa (RWTH Aachen University, Computer Science Department, Chair 6 for Human Language Technology and Pattern Recognition)
Ralf Schlüter (RWTH Aachen University, Computer Science Department, Chair 6 for Human Language Technology and Pattern Recognition)
Hermann Ney (RWTH Aachen University, Computer Science Department, Chair 6 for Human Language Technology and Pattern Recognition)

Compound words are a difficulty for German speech recognition systems since they cause high out-of-vocabulary and word error rates. State of the art approaches augment the language model by the fragments of compounds in order to increase lexical coverage, lower the perplexity and out-of-vocabulary rate. The fragments are tagged in order to concatenate subsequent equally tagged fragments in the recognition result, but this does not guarantee the recombination of proper words. Such recombination techniques neglect the large vocabulary of the language model training data for recombination although most compounds are covered by it. In this paper, we investigate the use of this vocabulary for the recombination of compound words from the recognition result. The approach is tested on two large vocabulary tasks on top of full-word and fragment based language models and achieves good improvements of 3-7% relative over the baseline compound-sensitive word error rate.

#8Lattice-Based Risk Minimization Training for Unsupervised Language Model Adaptation

Akio Kobayashi (NHK Science and Technology Research Laboratories)
Takahiro Oku (NHK Science and Technology Research Laboratories)
Shinichi Homma (NHK Science and Technology Research Laboratories)
Toru Imai (NHK Science and Technology Research Laboratories)
Seiichi Nakagawa (Toyohashi University of Technology)

This paper describes a lattice-based risk minimization training method for unsupervised language model (LM) adaptation. In a broadcast archiving system, unsupervised LM adaptation using transcriptions generated by speech recognition is considered to be useful for improving the performance. However, conventional linear interpolation methods occasionally degrade the performance because of incorrect words in the training transcriptions. Accordingly, we propose a new adaptation method aiming to reflect error information among training lattices. The method minimizes the whole risk of training lattices to yield a log-linear model, which consists of a set of linguistic features. The advantage of the method is that the model parameters can be obtained efficiently in an unsupervised manner. Experimental results obtained in transcribing Japanese broadcast news showed significant word error rate reduction for those of conventional mixture LMs.

#9Similarity language model

Christian Gillot (Nancy-Universite)
Christophe Cerisara (LORIA)

The similarity language model is a statistical model that makes efficient use of long distance information when possible and falls back to standard ngram language model when not. To estimate the probability distribution of a given target context, each training example of the ngram model is retrieved and its similarity to the context is estimated. In this work, this is done by performing a string alignment and training the system to estimate the similarity of each possible alignment. Whereas in the ngram model all such examples are deemed equal, the more similar an example is to the current context, the more weight it is given in the estimation of the probability distribution. The proposed model outperforms a modified Knener-Ney 4-gram model.

#10Data Sampling and Dimensionality Reduction Approaches for Reranking ASR Outputs Using Discriminative Language Models

Erinc Dikici (Bogazici University)
Murat Semerci (Bogazici University)
Murat Saraclar (Bogazici University)
Ethem Alpaydin (Bogazici University)

This paper investigates various approaches to data sampling and dimensionality reduction for discriminative language models (DLM). Being a feature based language modeling approach, the aim of DLM is to rerank the ASR output with discriminatively trained feature parameters. Using a Turkish morphology based feature set, we examine the use of online Principal Component Analysis (PCA) as a dimensionality reduction method. We exploit ranking perceptron and ranking SVM as two alternative discriminative modeling techniques, and apply data sampling to improve their efficiency. We obtain a reduction in word error rate (WER) of 0.4%, significant at p<0.001 over the baseline perceptron result.

#11Training a language model using webdata for large vocabulary Japanese spontaneous speech recognition

Ryo Masumura (Graduate School of Engineering, Tohoku University)
Seongjun Hahm (Graduate School of Engineering, Tohoku University)
Akinori Ito (Graduate School of Engineering, Tohoku University)

This paper describes a language modeling method using large-scale spoken language data retrieved from the Web for spontaneous speech recognition. To this end, we downloaded 15 million Web pages that comprehensively include various topics. Next, spoken language-like texts are selected from the downloaded Web data using naive Bayes classifier. Moreover, we complemented typical linguistic phenomena such as fillers and pauses using simulation models. A language model trained by the generated data gave as high performance as the large-scale spontaneous speech corpus (Corpus of Spontaneous Japanese, CSJ). By combining the generated data and CSJ, we obtained further improvement of word accuracy.

#12Large Vocabulary SOUL Neural Network Language Models

Hai-Son Le (Universite Paris-Sud and LIMSI CNRS)
Ilya Oparin (LIMSI CNRS)
Abdel Messaoudi (LIMSI CNRS)
Alexandre Allauzen (Universite Paris-Sud and LIMSI CNRS)
Jean-Luc Gauvain (LIMSI CNRS)
Francois Yvon (Universite Paris-Sud and LIMSI CNRS)

This paper presents continuation of research on Structured OUtput Layer Neural Network language models (SOUL NNLM) for automatic speech recognition. As SOUL NNLMs allow estimating probabilities for all in-vocabulary words and not only for those pertaining to a limited shortlist, we investigate its performance on a large-vocabulary task. Significant improvements both in perplexity and word error rate over conventional shortlist-based NNLMs are shown on a challenging Arabic GALE task characterized by a recognition vocabulary of about 300k entries. A new training scheme is proposed for SOUL NNLMs that is based on separate training of the out-of-shortlist part of the output layer. It enables using more data at each iteration of a neural network without any considerable slow-down in training and brings additional improvements in speech recognition performance.

#13Improved Spoken Query Transcription using Co-occurrence Information

Jonathan Mamou (IBM Research)
Abhinav Sethy (IBM Research)
Bhuvana Ramabhadran (IBM Research)
Ron Hoory (IBM Research)
Paul Vozila (Nuance Communications)

Spoken queries are a natural medium for searching the Mobile Web. Language modeling for voice search recognition offers different challenges compared to more conventional speech applications. The challenges arise from the fact that spoken queries are usually a set of keywords and do not have a syntactic and grammatical structure. This paper describes a co-occurrence based approach to improve the accuracy of voice queries automatic transcription. With the right choice of scoring function and co-occurrence level, we show that co-occurrence information gives a 2% relative accuracy improvement over a state of the art system.

#14Unsupervised Latent Speaker Language Modeling

Yik-Cheung Tam (Nuance Communications Inc)
Paul Vozila (Nuance Communications Inc)

In commercial speech applications, millions of speech utterances from the field are collected from millions of users, creating a challenge to best leverage the user data to enhance speech recognition performance. Motivated by an intuition that similar users may produce similar utterances, we propose a latent speaker model for unsupervised language modeling. Inspired by latent semantic analysis (LSA), an unsupervised method to extract latent topics from document corpora, we view the accumulated unsupervised text from a user as a document in the corpora. We employ latent Dirichlet-Tree allocation, a tree-based LSA, to extract the latent speakers in an unsupervised fashion. During speaker adaptation, a new speaker model is adapted via a linear interpolation of the latent speaker models. On an in-house evaluation, the proposed method reduces the word error rates by 1.4% compared to a well-tuned baseline with speaker-independent and speaker-dependent adaptation.

Sun-Ses3-P4:
Spoken Language Resources, Evaluation and Standardization I

Time:Sunday 16:00 Place:Faenza 2 - Pala Congressi (Passi Perduti-Gallery) Type:Poster
Chair:Sebastian Moeller

#1Measurement of Objective Intelligibility of Japanese Accented English Using ERJ (English Read by Japanese) Database

Nobuaki Minematsu (The Univ. of Tokyo)
Koji Okabe (NEC Corporation)
Keisuke Ogaki (The Univ. of Tokyo)
Keikichi Hirose (The Univ. of Tokyo)

In many schools, English is taught as international communication tool and the goal of English pronunciation training is generally to acquire intelligible enough pronunciation. However, the definition of the intelligible pronunciation is not easy because it depends on the language background of listeners. One kind of accented pronunciation, which is intelligible enough for some listeners, is often less intelligible for others. This paper focuses on objective intelligibility of Japanese English through the ears of American English speakers with little exposure to Japanese English. A large listening test was conducted using ERJ database. A balanced subset of this database were presented over a telephone line to the American listeners who were asked to repeat what they heard. Totally, 17,416 repetitive responses were collected and they were transcribed manually. This paper describes the design of this experiment and some results of analyzing the results of transcription.

#2From Single-Call to Multi-Call Quality: A Study on Long-term Quality Integration in Audio-Visual Speech Communication

Sebastian Möller (Quality and Usability Lab, TU Berlin, Germany)
Chihuy Bang (Quality and Usability Lab, TU Berlin, Germany)
Teele Tamme (Skype Labs, Skype, Tallinn, Estonia)
Markus Vaalgamaa (Skype Labs, Skype, Helsinki, Finland)
Benjamin Weiss (Quality and Usability Lab, TU Berlin, Germany)

Speech quality is commonly assumed to be the most important factor for the quality of a speech communication service and solution. However, little is known about how the quality experienced during individual calls forms the quality perception of an entire service or solution. Taking the example of an audio-visual IP-based communication solution, a long-term study is presented in which we analyze this relationship in a controlled setting. Results show temporal integration effects in the users’ response to time-varying quality levels and prove that simple averaging of call quality scores does not provide sufficiently accurate estimations of service quality.

#3Optimal Selection of Limited Vocabulary Speech Corpora

Hui Lin (University of Washington)
Jeff Bilmes (University of Washington)

We address the problem of finding a subset of a large speech data corpus that is useful for accurately and rapidly prototyping novel and computationally expensive speech recognition architectures. To solve this problem, we express it as an optimization problem over submodular functions. Quantities such as vocabulary size (or quality) of a set of utterances, or quality of a bundle of word types are submodular functions which make finding the optimal solutions possible. We, moreover, are able to express our approach using graph cuts leading to a very fast implementation even on large initial corpora. We show results on the Switchboard-I corpus, demonstrating improved results over previous techniques for this purpose. We also demonstrate the variety of the resulting corpora that may be produced using our method.

#4Open Source Multi-Language Audio Database for Spoken Language Processing Applications

Stephen Zahorian (Binghamton University, Electrical & Computer Engineering Dept.)
Jiang Wu (Binghamton University, Electrical & Computer Engineering Dept.)
Montri Karnjanadecha (Binghamton University, Electrical & Computer Engineering Dept.)
Chandra Vootkuri (Binghamton University, Electrical & Computer Engineering Dept.)
Brian Wong (Binghamton University, Electrical & Computer Engineering Dept.)
Andrew Hwang (Binghamton University, Electrical & Computer Engineering Dept.)
Eldar Tokhtamyshev (Binghamton University, Electrical & Computer Engineering Dept.)

Over the past few decades, research in automatic speech recognition and automatic speaker recognition has been greatly facilitated by the sharing of large annotated speech databases such as those distributed by the Linguistic Data Consortium. Open sources, particularly web sites such as YouTube, contain vast and varied speech recordings in a variety of languages. These “open sources” for speech data are largely untapped as resources for speech research. In this paper, a project to collect, organize, and annotate a large group of this speech data is described. The data consists of approximately 25 hours of speech in each of three languages, English, Mandarin Chinese, and Russian. Each of the 900 recordings has been orthographically transcribed at the sentence/phrase level by human listeners. Some of the issues related to working with this low quality, varied, noisy speech data in three languages are described.

#5The USC CARE Corpus: Child-Psychologist Interactions of Children with Autism Spectrum Disorders

Matthew P. Black (Signal Analysis and Interpretation Laboratory (SAIL), Viterbi School of Engineering, University of Southern California, Los Angeles, CA, USA)
Daniel Bone (Signal Analysis and Interpretation Laboratory (SAIL), Viterbi School of Engineering, University of Southern California, Los Angeles, CA, USA)
Marian E. Williams (University Center for Excellence in Developmental Disabilities, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA)
Phillip Gorrindo (Medical Scientist Training Program, Vanderbilt University, Nashville, TN, USA)
Pat Levitt (Zilkha Neurogenetic Institute, Keck School of Medicine, University of Southern California, Los Angeles, CA, USA)
Shrikanth S. Narayanan (Signal Analysis and Interpretation Laboratory (SAIL), Viterbi School of Engineering, University of Southern California, Los Angeles, CA, USA)

We introduce the USC CARE Corpus, comprised of spontaneous and standardized child-psychologist interactions of children with a diagnosis of an autism spectrum disorder (ASD). The audio-video data is collected in the context of the Autism Diagnostic Observation Schedule (ADOS), which is a tool used by psychologists for a research-level diagnosis of ASD for children. The interaction consists of developmentally appropriate semi-structured social activities, providing the psychologist with a sample of behavior used to rate the child on a series of autism-relevant symptoms. Our goal with this multimodal corpus is to investigate how analytical technology (e.g., speech and language processing) can enhance this observational rating task and provide greater insight into social behavior and communication. We provide demographic statistics on the recruited children (60 to date), describe the multimodal recording set-up, and discuss current and future work for this novel corpus.

#6Towards A Versatile Multi-Layered Description of Speech Corpora Using Algebraic Relations

Nelly Barbot (IRISA - University Rennes 1)
Vincent Barreaud (IRISA - University Rennes 1)
Olivier Boeffard (IRISA - University Rennes 1)
Laure Charonnat (IRISA - University Rennes 1)
Arnaud Delhay (IRISA - University Rennes 1)
Sebastien Le Maguer (IRISA - University Rennes 1)
Damien Lolive (IRISA - University Rennes 1)

This paper presents a software library, namely Roots for Rich Object Oriented Transcription System, thats help to describe spoken messages in a coherent manner linking sequences of items on numerous levels (linguistic, phonological, or acoustic). The proposed representation is incremental and can thus describe any or all parts of an utterance. In order link different levels of description, algebraic relations are used. Instead of relying solely on fixed, pre-determined relations, algebraic composition operators are proposed that can create a missing relation on demand. In terms of software architecture, object classes are defined based on a well-grounded theoretical representations of speech (text, syntax, phonology and acoustics), without particular dependences on an annotation system (e.g. IPA is fully implemented). The API documentation for this software is available online.

#7Announcing the Electromagnetic Articulography (Day 1) Subset of the mngu0 Articulatory Corpus

Korin Richmond (Centre for Speech Technology Research, Edinburgh University)
Phil Hoole (Institut fuer Phonetik und Sprachverarbeitung, Ludwig-Maximilians-Universitaet)
Simon King (Centre for Speech Technology Research, Edinburgh University)

This paper serves as an initial announcement of the availability of a corpus of articulatory data called mngu0. This corpus will ultimately consist of a collection of multiple sources of articulatory data acquired from a single speaker: electromagnetic articulography (EMA), audio, video, volumetric MRI scans, and 3D scans of dental impressions. This data will be provided free for research use. In this first stage of the release, we are making available one subset of EMA data, consisting of more than 1,300 phonetically diverse utterances recorded with a Carstens AG500 electromagnetic articulograph. Distribution of mngu0 will be managed by a dedicated “forum-style” web site. This paper both outlines the general goals motivating the distribution of the data and the creation of the mngu0 web forum, and also provides a description of the EMA data contained in this initial release.

#8A Pitch Tracking Corpus with Evaluation on Multipitch Tracking Scenario

Gregor Pirker (Signal Processing and Speech Communication Laboratory, Graz University of Technology)
Michael Wohlmayr (Signal Processing and Speech Communication Laboratory, Graz University of Technology)
Stefan Petrik (Signal Processing and Speech Communication Laboratory, Graz University of Technology)
Franz Pernkopf (Signal Processing and Speech Communication Laboratory, Graz University of Technology)

In this paper, we introduce a novel pitch tracking database (PTDB) including ground truth signals obtained from a laryngograph. The database, referenced as PTDB-TUG, consists of 2342 phonetically rich sentences taken from the TIMIT corpus. Each sentence was at least recorded once by a male and a female native speaker. In total, the database contains 4720 recordings from 10 male and 10 female speakers. Furthermore, we evaluated two multipitch tracking systems on a subset of speakers to provide a benchmark for further research activities. The database can be downloaded at http://www.spsc.tugraz.at/tools.

#9On building and evaluating a broadcast-news audio segmentation system

Taras Butko (Technical University of Catalonia)
Climent Nadeu (Technical University of Catalonia)

Audio segmentation is useful in diverse applications like audio indexing and retrieval, subtitling, monitoring of acoustic scenes, etc. Also, an initial audio segmentation stage may help to improve the robustness of speech technologies like automatic speech recognition and speaker diarization. In this paper, firstly, the Albayzín-2010 audio segmentation evaluation is reported, including some conclusions drawn from the analysis of the set of eight submitted systems and their results. Then an audio segmentation system build in agreement with those conclusions is described and tested. Finally, by using the gained experience, the initial design of both the acoustic classes and the detection scoring rules is refined aiming to obtain a more meaningful error rate measurement.

#10Time- and Acoustic-Mediated Alignment Algorithms for Speech Recognition Evaluation

Simon Dobrišek (Ljubljana University, Faculty of Electrical Engineering)
France Mihelič (Ljubljana University, Faculty of Electrical Engineering)

The paper investigates the time- and acoustic-mediated alignment algorithms that can be used for better speech recognition evaluation. The edit-cost function, which weights the cost of speech unit matches, substitutions, deletions and insertions, is defined as a function of timed symbols or even as a function of speech signal segments. The algorithms are compared using several classical statistical measures of different types that are derived from speech recognition confusion matrices and are normally used to measure agreement between different classifications of the same set of objects. These measures provide a reasonable indication that the investigated algorithms provide more relevant speech recognition error statistics than the algorithms that are commonly used for this purpose.

#11Effects of Shortening Speech Prompts of In-Car Voice User Interfaces on Users\' Mental Models

Julia Niemann (Deutsche Telekom Laboratories)
Kati Schulz (Deutsche Telekom Laboratories)
Ina Wechsung (Deutsche Telekom Laboratories)

Shortening speech prompts is useful to reduce the tendency of drivers to allocate attention towards the display. But it is so far unsettled in if the shortening of speech still provides a good users’ mental model? A lab experiment was conducted. The effects of reducing time effort of speech was evaluated via a transfer task, retrieval tasks and navigation-orientation tasks for three different strategies: (1) using sounds (earcons) for menu orientation (land marking), (2) using commando based speech for interaction options, and (3) using uptempo speech for content based information. It was observed that earcons are well qualified to not impair navigation-orientation performance. Commando based speech leads to even better retrieval performance than the sentence based representation of interaction. Solely uptempo speech decreased retrieval performance.

#12Speech Transcript Evaluation for Information Retrieval

Laurens van der Werff (University of Twente)
Wessel Kraaij (Radboud University Nijmegen)
Franciska de Jong (University of Twente)

Speech recognition transcripts are being used in various fields of research and practical applications, putting various demands on their accuracy. Traditionally ASR research has used intrinsic evaluation measures such as word error rate to determine transcript quality. In non-dictation-type applications such as speech retrieval, it is better to use extrinsic (or task specific) measures. Indexation and the associated processing may eliminate certain errors, whereas the search query may reveal others. In this work, we argue that the standard extrinsic speech retrieval measure average precision is unpractical for ASR evaluation. As an alternative we propose the use of ranked correlation measures on the output of the speech retrieval task, with the goal of predicting relative mean average precision. The measures we used showed a reasonably high correlation with average precision, but require much less human effort to calculate and can be more easily deployed in a variety of real-life settings.

#13The Albayzin 2010 Language Recognition Evaluation

Luis Javier Rodriguez-Fuentes (University of the Basque Country)
Mikel Penagarikano (University of the Basque Country)
Amparo Varona (University of the Basque Country)
Mireia Diez (University of the Basque Country)
German Bordel (University of the Basque Country)

The Albayzin 2010 Language Recognition Evaluation (LRE) was the second effort made from the Spanish/Portuguese community for benchmarking language recognition technology. As the Albayzin 2008 LRE, it was coordinated by the Software Technology Working Group of the University of the Basque Country, with the support of the Spanish Thematic Network on Speech Technology. A speech database was created for system development and evaluation. Speech signals were recorded from TV broadcasts, including clean and noisy speech. The task consisted in deciding whether or not a target language was spoken in a test utterance, and involved 6 target languages: English, Portuguese, Basque, Catalan, Galician and Spanish, other (Out-Of-Set) languages being also recorded to allow open-set verification tests. This paper presents the main features of the evaluation, analyses system performance on different conditions, including the confusion among languages, and gives hints for future evaluations.

#14Progress and Prospects for Speech Technology: Results from Three Sexennial Surveys

Roger Moore (University of Sheffield)

In 1997, and again in 2003, the author was invited to conduct a survey at the IEEE workshop on Automatic Speech Recognition and Understanding (ASRU) in which attendees were offered a set of statements about putative future events relating to progress in various aspects of speech technology R&D. The task of the respondents was to assign a date to each possible event. The 1997 and 2003 results were published at INTERSPEECH 2005 in Lisbon. Six years later, the author was invited by the organisers of ASRU09 to repeat the survey for a third time, and this paper presents the combined results from all three 1997, 2003 and 2009 surveys. The overall conclusion is that, over the twelve year period progress is perceived as slow, and the future appears to be generally no nearer than it has been in the past. However, on a positive note, the survey confirmed that the market for speech technology applications on mobile devices would be highly attractive over the next ten or so years.

#15Painless WFST cascade construction for LVCSR - Transducersaurus

Josef Robert Novak (Graduate School of Information Science and Technology, The University of Tokyo)
Nobuaki Minematsu (Graduate School of Information Science and Technology, The University of Tokyo)
Keikichi Hirose (Graduate School of Information Science and Technology, The University of Tokyo)

This paper introduces the Transducersaurus toolkit which provides a set of classes for generating each of the fundamental components of a typical WFST ASR cascade, including a Context-dependency transducer, a Lexicon, a stochastic language model and an optional silence class model. The toolkit further implements a simple scripting language in order to facilitate the construction of cascades with a variety of popular combination and optimization methods and provides integrated support for the TCubed and Juicer WFST decoders, and both Sphinx and HTK format acoustic models. New results for two standard WSJ tasks are also provided, comparing a variety of cascade construction and optimization algorithms. These results illustrate the flexibility of the toolkit as well as the tradeoffs inherent in various build algorithms.

Sun-Ses3-S1-P:
Crowdsourcing for Speech Processing II

Time:Sunday 17:00 Place:Caravaggio (Adua 1) - Pala Affari - 1st Floor Type:Poster
Chairs:Maxine Eskenazi, David Suendermann, Gina-Anne Levow

#1A Transcription Task for Crowdsourcing with Automatic Quality Control

Chia-ying Lee (MIT Computer Science and Artificial Intelligence Laboratory)
James Glass (MIT Computer Science and Artificial Intelligence Laboratory)

In this paper, we propose a two-stage transcription task design for crowdsourcing with an automatic quality control mechanism embedded in each stage. For the first stage, a support vector machine (SVM) classifier is utilized to quickly filter poor quality transcripts based on acoustic cues and language patterns in the transcript. In the second stage, word level confidence scores are used to estimate a transcription quality and provide instantaneous feedback to the transcriber. The proposed design was evaluated using Amazon Mechanical Turk (MTurk) and tested on seven hours of academic lecture speech, which is typically conversational in nature and contains technical material. Compared to baseline transcripts which were also collected from MTurk using a ROVER-based method, we observed that the new method resulted in higher quality transcripts while requiring less transcriber effort.

#2Reliability-Weighted Acoustic Model Adaptation Using Crowd Sourced Transcriptions

Kartik Audhkhasi (University of Southern California)
Panayiotis G. Georgiou (University of Southern California)
Shrikanth S. Narayanan (University of Southern California)

This paper focuses on adaptation of acoustic models using speech transcribed by multiple noisy experts. A simple approach involves combining multiple transcripts using word frequency based Recognizer Output Voting Error Reduction (ROVER) followed by adaptation using the combined transcripts. But this assumes that the transcripts being combined are equally reliable. To overcome this assumption, we use two sets of scores to estimate this reliability. The first set is based on answers to some questions given by the transcribers. The second set is derived in an unsupervised way using the word frequency based ROVER transcripts and baseline acoustic models. The overall confidence is a convex combination of these scores and is used to perform a confidence weighted fusion. We adapt the baseline acoustic models using these combined transcripts. Recognition results for a Mexican Spanish ASR system show an absolute improvement of 0.5% in word error rate and 0.9% in sentence error rate.

#3Crowdsourcing for word recognition in noise

Martin Cooke (Ikerbasque (Basque Science Foundation), Spain)
Jon Barker (Department of Computer Science, University of Sheffield, UK)
Maria Luisa Garcia Lecumberri (Language and Speech Laboratory, Univeersity of the Basque Country, Spain)
Krzysztof Wasilewski (Department of Computer Science, University of Sheffield, UK)

Access to large samples of listeners is an appealing prospect for speech perception researchers, but lack of control over key factors such as listeners' linguistic backgrounds and quality of stimulus delivery is a formidable barrier to the application of crowdsourcing. We describe the outcome of a web-based listening experiment designed to discover consistent confusions amongst words presented in noise, alongside an identical task carried out using traditional laboratory methods. Web listeners were graded according based on information they provided as well as via their responses to tokens recognised robustly by a majority of participants. While overall word identification scores even for the best-performing web subset were well below those obtained in the laboratory, word confusions with high levels of cross-listener agreement were obtained nevertheless, suggesting that focused application of crowdsourcing in speech perception can provide useful data for scientific analysis.

#4Crowdsourcing preference tests, and how to detect cheating

Sabine Buchholz (Toshiba Research Europe Ltd., Cambridge Research Lab, Cambridge, UK)
Javier Latorre (Toshiba Research Europe Ltd., Cambridge Research Lab, Cambridge, UK)

We describe an approach to crowdsource the evaluation of TTS systems by preference tests and report on lessons learnt from running 127 real-life crowdsourced tests. We show that at least one type of cheating becomes more prevalent over time if left unchecked and develop metrics to exclude cheats. We demonstrate that their exclusion improves test outcomes.

#5Growing a Spoken Language Interface on Amazon Mechanical Turk

Ian McGraw (MIT)
James Glass (MIT)
Stephanie Seneff (MIT)

Typically data collection, transcription, language model generation, and deployment are separate phases of creating a spoken language interface. An unfortunate consequence of this is that the recognizer usually remains a static element of systems often deployed in dynamic environments. By providing an API for human intelligence, Amazon Mechanical Turk changes the way system developers can construct spoken language systems. In this work, we describe an architecture that automates and connects these four phases, effectively allowing the developer to grow a spoken language interface. In particular, we show that a human-in-the-loop programming paradigm, in which workers transcribe utterances behind the scenes, can alleviate the need for expert guidance in language model construction. We demonstrate the utility of these organic language models in a voice-search interface for photographs.

#6Real user evaluation of spoken dialogue systems using Amazon Mechanical Turk

Filip Jurčíček (Engineering Department, Cambridge University)
Simon Keizer (Engineering Department, Cambridge University)
Milica Gasic (Engineering Department, Cambridge University)
Francois Mairesse (Engineering Department, Cambridge University)
Blaise Thomson (Engineering Department, Cambridge University)
Kai Yu (Engineering Department, Cambridge University)
Steve Young (Engineering Department, Cambridge University)

This paper describes a framework for evaluation of spoken dialogue systems. Typically, evaluation of dialogue systems is performed in a controlled test environment with carefully selected and instructed users. However, this approach is very demanding. An alternative is to recruit a large group of users who evaluate the dialogue systems in a remote setting under virtually no supervision. Crowdsourcing technology, for example Amazon Mechanical Turk (AMT), provides an efficient way of recruiting subjects. This paper describes an evaluation framework for spoken dialogue systems using AMT users and compares the obtained results with a recent trial in which the systems were tested by locally recruited users. The results suggest that the use of crowdsourcing technology is feasible and it can provide reliable results.

#7Quality assessment of crowdsourcing transcriptions for African languages

Hadrien Gelas (Laboratoire Dynamique Du Langage, CNRS - Université de Lyon, France and Laboratoire Informatique de Grenoble, CNRS - Université Joseph Fourier Grenoble, France)
Solomon Teferra Abate (Laboratoire Informatique de Grenoble, CNRS - Université Joseph Fourier Grenoble, France)
Laurent Besacier (Laboratoire Informatique de Grenoble, CNRS - Université Joseph Fourier Grenoble, France)
François Pellegrino (Laboratoire Dynamique Du Langage, CNRS - Université de Lyon, France)

We evaluate the quality of speech transcriptions acquired by crowdsourcing to develop ASR acoustic models (AM) for under-resourced languages. We have developed AMs using reference (REF) transcriptions and transcriptions from crowdsourcing (TRK) for Swahili and Amharic. While the Amharic transcription was much slower than that of Swahili to complete, the speech recognition systems developed using REF and TRK transcriptions have almost similar (40.1 vs 39.6 for Amharic and 38.0 vs 38.5 for Swahili) word recognition error rate. Moreover, the character level disagreement rates between REF and TRK are only 3.3% and 6.1% for Amharic and Swahili, respectively. We conclude that it is possible to acquire quality transcriptions from the crowd for under-resourced languages using Amazon's Mechanical Turk. Recognizing such a great potential of it, we recommend some legal and ethical issues to consider.

#8Using crowdsourcing to provide prosodic annotations for non-native speech

Keelan Evanini (Educational Testing Service)
Klaus Zechner (Educational Testing Service)

We present the results of an experiment in which 2 expert and 11 naive annotators provided prosodic annotations for stress and boundary tones on a corpus of spontaneous speech produced by non-native speakers of English. The results show that agreement rates were higher for boundary tones than for stress. In addition, a crowdsourcing approach was implemented to combine the naive annotations to increase accuracy. The crowdsourcing approach was able to match expert agreement for stress (62.1%) with 3 naive annotators, and come within 7.2% of expert agreement for boundary tones (82.4%) with 11 naive annotators. This experiment also demonstrates that noticeable improvements in naive annotations can be obtained with a small amount of additional training.

#9PodCastle: Recent Advances of a Spoken Document Retrieval Service Improved by Anonymous User Contributions

Masataka Goto (National Institute of Advanced Industrial Science and Technology (AIST))
Jun Ogata (National Institute of Advanced Industrial Science and Technology (AIST))

In this paper, we introduce recent advances of a speech retrieval web service, PodCastle, that collects and amplifies voluntary contributions by anonymous users. Our goal is to provide users with a public web service based on speech recognition and crowdsourcing so that they can experience state-of-the-art speech recognition performance through a useful service. PodCastle enables users to find speech data (such as podcasts and YouTube video clips) that include a search term, read full texts of their recognition results, and easily correct recognition errors by simply selecting from a list of candidates. The resulting corrections were used to improve both the speech retrieval and recognition performances. In our experiences from its practical use over the past four years (since December, 2006), over half a million recognition errors in about one hundred thousand speech data were corrected by anonymous users and we confirmed that the speech recognition performance of PodCastle was actually improved by those corrections.

Mon-Ses1-O1:
Speaker Recognition - Modeling, Automatic Procedures, Analysis II

Time:Monday 10:00 Place:Auditorium - Pala Congressi Type:Oral
Chair:Kornel Laskowski

10:00Data-driven Gaussian Component Selection for Fast GMM-Based Speaker Verification

Ce Zhang (Digital Content Technology Research Center, Institute of Automation, Chinese Academy of Sciences)
Rong Zheng (Digital Content Technology Research Center, Institute of Automation, Chinese Academy of Sciences)
Bo Xu (Digital Content Technology Research Center, Institute of Automation, Chinese Academy of Sciences)

In this paper, a fast likelihood calculation of Gaussian mixture model (GMM) is presented, by means of dividing the acoustic space into disjoint subsets and then assigning the most relevant Gaussians to each of them. The data-driven approach is explored to select Gaussian component which guarantees that the loss, brought by pre-discarding most useless Gaussians, can be easily controlled by a manual set parameter. To avoid the rapid growth of the index table size, a two level index scheme is proposed. We adjust several set of parameters to validate our work which is expected to speed up the computation while maintaining the performance. The results of the experiments on the female part of the telephone condition of NIST SRE 2006 indicate that the speed can be improved up to 5 times over the GMM-UBM baseline system without performance loss.

10:20Analysis of i-vector Length Normalization in Speaker Recognition Systems

Daniel Garcia-Romero (Department of Electrical and Computer Engineering, University of Maryland, College Park, MD)
Carol Y. Espy-Wilson (Department of Electrical and Computer Engineering, University of Maryland, College Park, MD)

We present a method to boost the performance of probabilistic generative models that work with i-vector representations. The proposed approach deals with the non-Gaussian behavior of i-vectors by performing a simple length normalization. This non-linear transformation allows the use of probabilistic models with Gaussian assumptions that yield equivalent performance to that of more complicated systems based on Heavy-Tailed assumptions. Significant performance improvements are demonstrated on the telephone portion of NIST SRE 2010.

10:40An Analysis Framework based on Random Subspace Sampling for Speaker Verification

Weiwu Jiang (Department of Systems Engineering & Engineering Management, The Chinese University of Hong Kong, Hong Kong S.A.R, China)
Zhifeng Li (Department of Information Engineering, The Chinese University of Hong Kong, Hong Kong S.A.R, China)
Helen Meng (Department of Systems Engineering & Engineering Management, The Chinese University of Hong Kong, Hong Kong S.A.R, China)

Using Joint Factor Analysis (JFA) supervector for subspace analysis has many problems, such as high processing complexity and overfitting. We propose an analysis framework based on random subspace sampling to address these problems. In this framework, JFA supervectors are first partitioned equally and each partitioned subvector is projected on to a subspace by PCA. All projected subvectors are then concatenated and PCA is applied again to reduce the dimension by projection onto a low-dimensional feature space. Finally, we randomly sample this feature space and build classifiers for the sampled features. The classifiers are fused to produce the final classification output. Experiments on NIST SRE 2008 corpora demonstrate the effectiveness of the proposed framework.

11:00Factor analysis back ends for MLLR transforms in speaker recognition

Nicolas Scheffer (SRI International)
Yun Lei (SRI International)
Luciana Ferrer (SRI International)

The purpose of this work is to show how recent developments in cepstral-based systems for speaker recognition can be leveraged for the use of Maximum Likelihood Linear Regression (MLLR) transforms. Speaker recognition systems based on MLLR transforms have shown to be greatly beneficial in combination with standard systems, but most of the advances in speaker modeling techniques have been implemented for cepstral features. We show how these advances, based on Factor Analysis, such as eigenchannel and ivector, can be easily employed to achieve very high accuracy. We show that they outperform the current state-of-the-art MLLR-SVM system that SRI submitted during the NIST SRE 2010 evaluation. The advantages of leveraging the new approaches are manyfold: the ability to process a large amount of data, working in a reduced dimensional space, importing any advances made for cepstral systems to the MLLR features, and the potential for system combination at the ivector level.

11:20Report on Performance Results in the NIST 2010 Speaker Recognition Evaluation

Craig S. Greenberg (National Institute of Standards and Technology)
Alvin F. Martin (National Institute of Standards and Technology)
Bradford N. Barr (National Institute of Standards and Technology)
George R. Doddington (Unaffiliated)

In the spring of 2010, the National Institute of Standards and Technology organized a Speaker Recognition Evaluation in which several factors believed to affect the performance of speaker recognition systems were explored. Among the factors considered in the evaluation were channel conditions, duration of training and test segments, number of training segments, and level of vocal effort. New cost function parameters emphasizing lower false alarm rates were used for two of the tests in the evaluation, and the reduction in false alarm rates exhibited by many of the systems suggests that the new measure may have helped to focus research on the low false alarm region of operation, which is important in many applications.

11:40iVector Fusion of Prosodic and Cepstral Features for Speaker Verification

Marcel Kockmann (Brno University of Technology)
Luciana Ferrer (SRI International)
Lukas Burget (Brno University of Technology)
Jan Cernocky (Brno University of Technology)

In this paper we apply the promising iVector extraction technique followed by PLDA modeling to simple prosodic contour features. With this procedure we achieve results comparable to a system that models much more complex prosodic features using our recently proposed SMM-based iVector modeling technique. We then propose a combination of both prosodic iVectors by joint PLDA modeling that leads to significant improvements over individual systems with an EER of 5.4% on NIST SRE 2008 telephone data. Finally, we can combine these two prosodic iVector front ends with a baseline cepstral iVector system to achieve up to 21% relative reduction in new DCF.

Mon-Ses1-O3:
Acoustic Event Detection

Time:Monday 10:00 Place:Brunelleschi (Green Room) - Pala Congressi - 2nd Floor Type:Oral
Chair:Dirk van Compernolle

10:00Learning new acoustic events in an HMM-based system using MAP adaptation

Jürgen Thomas Geiger (Institute for Human-Machine Communication, Technische Universität München, 80290 Munich, Germany)
Mohamed Anouar Lakhal (Institute for Human-Machine Communication, Technische Universität München, 80290 Munich, Germany)
Björn Schuller (Institute for Human-Machine Communication, Technische Universität München, 80290 Munich, Germany)
Gerhard Rigoll (Institute for Human-Machine Communication, Technische Universität München, 80290 Munich, Germany)

In this paper, we present a system for the recognition of acoustic events suited for a robotic application. HMMs are used to model different acoustic event classes. We are especially looking at the open-set case, where a class of acoustic events occurs that was not included in the training phase. It is evaluated how newly occuring classes can be learnt using MAP adaptation or conventional training methods. A small database of acoustic events was recorded with a robotic platform to perform the experiments.

10:20Alternative Frequency Scale Cepstral Coefficient for Robust Sound Event Recognition

Yiren Leng (Institute for Infocomm Research, A*STAR, Singapore)
Huy Dat Tran (Institute for Infocomm Research, A*STAR, Singapore)
Norihide Kitaoka (Nagoya University, Japan)
Haizhou Li (Institute for Infocomm Research, A*STAR, Singapore)

There are two issues when applying MFCC for sound event recognition: 1) sound events have a broader spectral range than speech thus the log-frequency scale is less informative; 2) low frequency noise is more prevalent thus the log-frequency scale captures more noise. To address these issues, we study two alternative frequency scales and show that they outperform MFCCs for sound event recognition under mismatch conditions using SVMs without the need for complex algorithms.

10:40Evaluation of Abnormal Sound Detection using Multi-stage GMM in Various Environments

Akinori Ito (Graduate School of Engineering, Tohoku University)
Akihito Aiba (Graduate School of Engineering, Tohoku University)
Masashi Ito (Tohoku Institute of Technology)
Shozo Makino (Tohoku Bunka Gakuen University)

We have been developing a method to automatically detect incidents by detecting abnormal sound events from audio signal recorded in real environments. The proposed method uses the multi-stage Gaussian Mixture Models (GMM) that learns rare sounds using multiple GMMs. In this work, we investigated relationship between sound environment and detection performance, and we found that the performance deteriorates in noisy environments. The performance largely depended on SN ratio of the abnormal sounds. Next, we investigated methods for determining hyperparameters of the multi-stage GMM, which involves intermediate thresholds, numbers of mixture of GMMs and the detection threshold. From the experimental results, combination of Percentile-based threshold determination and Bayesian information criterion (BIC)-based mixture determination was most effective. However, when using the automatically-determined parameters, the detection performance deteriorated up to 20%.

11:00Unsupervised learning of acoustic events using dynamic time warping and hierarchical K-means++ clustering

Joerg Schmalenstroeer (Department of Communications Engineering, University of Paderborn, Germany)
Markus Bartek (Department of Communications Engineering, University of Paderborn, Germany)
Reinhold Haeb-Umbach (Department of Communications Engineering, University of Paderborn, Germany)

In this paper we propose to jointly consider Segmental Dynamic Time Warping and distance clustering for the unsupervised learning of acoustic events. As a result, the computational complexity increases only linearly with the dababase size compared to a quadratic increase in a sequential setup, where all pairwise SDTW distances between segments are computed prior to clustering. Further, we discuss options for seed value selection for clustering and show that drawing seeds with a probability proportional to the distance from the already drawn seeds, known as K-means++ clustering, results in a significantly higher probability of finding representatives of each of the underlying classes, compared to the commonly used draws from a uniform distribution. Experiments are performed on an acoustic event classification and an isolated digit recognition task, where on the latter the final word accuracy approaches that of supervised training.

11:20Feature Extraction Assessment for an Acoustic-Event Classification Task using the Entropy Triangle

David Mejía-Navarrete (Universidad Carlos III de Madrid)
Ascensión Gallardo-Antolín (Universidad Carlos III de Madrid)
Carmen Peláez-Moreno (Universidad Carlos III de Madrid)
Francisco J. Valverde-Albacete (Universidad Carlos III de Madrid)

We assess the behaviour of $5$ different feature extraction methods for an acoustic event classification task---built using the same SVM underlying technology---by means of two different techniques: accuracy and the entropy triangle. The entropy triangle is able to find a classifier instance whose relatively high accuracy stems from an attempt to specialize in some classes to the detriment of the overall behaviour. On all other cases, fair classifiers, accuracy and entropy triangle agree.

11:40Unsupervised Audio Analysis for Categorizing Heterogeneous Consumer Domain Videos

Pradeep Natarajan (Raytheon BBN Technologies)
Stavros Tsakalidis (Raytheon BBN Technologies)
Vasant Manohar (Raytheon BBN Technologies)
Rohit Prasad (Raytheon BBN Technologies)
Prem Natarajan (Raytheon BBN Technologies)

The ever increasing volume of consumer domain videos on the Internet has led to a surge in interest in automatically analyzing such content. The audio signal in these videos contains salient information, but applying current automatic speech recognition (ASR) techniques is not viable due to high variability, noise and multilingual content. We present two unsupervised techniques which do not rely on ASR to address these challenges. The first method involves learning an unsupervised codebook by clustering audio features, and the second involves directly matching low-level features using the pyramid match kernel (PMK). Experimental results on a ~200 hour audio corpus downloaded from YouTube show that both our approaches significantly outperform the traditional approach of first segmenting the audio stream to a set of mid-level classes (e.g. speech, non-speech, music, silence) and using the duration statistics of these classes to train high-level classifiers.

Mon-Ses1-O2:
Speech Production - Articulatory Measurements

Time:Monday 10:00 Place:Leonardo - Pala Affari - Ground Floor Type:Oral
Chair:Paavo Alku

10:00Visualization of vocal tract shape using interleaved real-time MRI of multiple scan planes

Yoon-Chul Kim (University of Southern California)
Michael I. Proctor (University of Southern California)
Shrikanth S. Narayanan (University of Southern California)
Krishna S. Nayak (University of Southern California)

Conventional real-time magnetic resonance imaging (RT-MRI) of the upper airway typically acquires information about the vocal tract from a single midsagittal scan plane. This provides insights into the dynamics of all articulators, but does not allow for visualization of several important features in vocal tract shaping, such as grooving/doming of the tongue, asymmetries in tongue shape, and lateral shaping of the pharyngeal airway. In this paper, we present an approach to RT-MRI of multiple scan planes of interest using time-interleaved acquisition, in which temporal resolution is compromised for greater spatial coverage. We demonstrate simultaneous visualization of vocal tract dynamics from midsagittal, coronal, and axial scan planes in the articulation of English fricatives.

10:20Biomechanical Tongue Models: An Approach to Studying Inter-speaker Variability

Ralf Winkler (ZAS, Berlin, Germany)
Susanne Fuchs (ZAS, Berlin, Germany)
Pascal Perrier (DPC/GIPSA-lab, Grenoble-INP, CNRS, Grenoble, France)
Mark Tiede (Haskins Labs, New Haven, CT, USA and R.L.E.-MIT, Boston, MA, USA)

Speakers of a given language vary with respect to their acoustics, articulation, and motor commands. This variation is driven by a variety of influences, such as emotional states, communicative interaction, and individual properties of the vocal tract. In this work we focus on the latter. First, we build speaker-specific biomechanical tongue models. Second, we discuss the impact of the relative position of the bending in the vocal tract on the basis of extensive simulations with two different models. We focus on /i,a,u/ by defining target regions in the acoustic space, and discuss the corresponding speaker-specific articulatory and motor command variability observed.

10:40Quantifying Articulatory Distinctiveness of Vowels

Jun Wang (University of Nebraska - Lincoln)
Jordan R. Green (University of Nebraska - Lincoln)
Ashok Samal (University of Nebraska - Lincoln)
David B. Marx (University of Nebraska - Lincoln)

The articulatory distinctiveness among vowels has been frequently characterized descriptively based on tongue height and front-back position; however, very few empirical methods have been proposed to characterize vowels based on time-varying articulatory characteristics. Such information is not only needed to improve knowledge about the articulation of vowels but also to determine the contribution of articulatory imprecision to poor speech intelligibility. In this paper, a novel statistical shape analysis was used to derive a vowel space that depicted the quantified articulatory distinctiveness among vowels based on tongue and lip movements. The effectiveness of the approach was supported by vowel classification accuracy of up to 91.7%. The theoretical relevance and clinical implication of the derived vowel space were discussed.

11:00Direct Estimation of Articulatory Kinematics from Real-time Magnetic Resonance Image Sequences

Michael Proctor (University of Southern California)
Adam Lammert (University of Southern California)
Athanasios Katsamanis (University of Southern California)
Louis Goldstein (University of Southern California)
Christina Hagedorn (University of Southern California)
Shrikanth Narayanan (University of Southern California)

A method of rapid, automatic extraction of consonantal articulatory trajectories from real-time magnetic resonance image sequences is described. Constriction location targets are estimated by identifying regions of maximally-dynamic correlated pixel activity along the palate, the alveolar ridge, and at the lips. Tissue movement into and out of the constriction location is estimated by calculating the change in mean pixel intensity in a circle located at the center of the region of interest. Closure and release gesture timings are estimated from landmarks in the velocity profile derived from the smoothed intensity function. We demonstrate the utility of the technique in the analysis of Italian intervocalic consonant production.

11:20Combined optical distance sensing and electropalatography to measure articulation

Peter Birkholz (Clinic for Phoniatrics, Pedaudiology, and Communication Disorders, University Hospital Aachen and RWTH Aachen University)
Christiane Neuschaefer-Rube (Clinic for Phoniatrics, Pedaudiology, and Communication Disorders, University Hospital Aachen and RWTH Aachen University)

We present the first prototype of a new optoelectronic instrument for the combined real-time measurement of the tongue contour in the mid-sagittal plane, the contact pattern between the tongue and the palate, and the position of the lips. The instrument consists of a thin acrylic pseudopalate with embedded contact sensors, as for electropalatography, and optical distance sensors to measure tongue-palate distances, as for glossometry. One additional distance sensor is located at the anterior side of the upper incisors to register the degree of opening and protrusion of the lips. together, the sensors provide complementary information about the articulation of vowels and consonants, which was verified in initial experiments. The instrument offers new perspectives for the study of normal and disordered speech production, as well as for silent speech interfaces and speech prostheses for laryngectomees.

11:40Simulating Post-L F0 Bouncing by Modeling Articulatory Dynamics

Santitham Prom-on (University College London)
Yi Xu (University College London)
Fang Liu (Stanford University)

Post-L F0 bouncing (post-L bouncing for short) is a prosodic phenomenon whereby F0 is temporarily raised following a very low pitch. The phenomenon is quite robust, but is not widely known, and it has never been computationally modeled. This paper presents the results of our simulation of the phenomenon by modeling articulatory dynamics. Using the quantitative Target Approximation (qTA) model, we were able to simulate the F0 rise after the Mandarin L tone by adding an acceleration adjustment to the initial state of the first post-L Neutral tone. Furthermore, a linear relationship was found between the added acceleration and the amount of F0 lowering in the L tone. We interpreted the results as evidence that post-L bouncing is directly related to the articulatory mechanism of producing a very low pitch.

Mon-Ses1-O4:
Speech Synthesis - Unit Selection and Hybrid approaches

Time:Monday 10:00 Place:Michelangelo - Pala Affari - 2nd Floor Type:Oral
Chair:Junichi Yamagish

10:00Enriching text-to-speech synthesis using automatic dialog act tags

Vivek Kumar Rangarajan Sridhar (AT&T Labs - Research)
Alistair Conkie (AT&T Labs - Research)
Ann Syrdal (AT&T Labs - Research)
Srinivas Bangalore (AT&T Labs - Research)

We present an approach for enriching dialog based text-to-speech (TTS) synthesis systems by explicitly controlling the expressiveness through the use of dialog act tags. The dialog act tags in our framework are automatically obtained by training a maximum entropy classifier on the Switchboard-DAMSL data set, unrelated to the TTS database. We compare the voice quality produced by exploiting automatic dialog act tags with that using human annotations of dialog acts, and with two forms of reference databases. Even though the inventory of tags is different for the automatic tagger and human annotation, exploiting either form of dialog markup generates better voice quality in comparison with the reference voices in subjective evaluation.

10:20Joint Target and Join Cost Weight Training for Unit Selection Synthesis

Lukas Latacz (Vrije Universiteit Brussel)
Wesley Mattheyses (Vrije Universiteit Brussel)
Werner Verhelst (Vrije Universiteit Brussel)

One of the key challenges of optimizing a unit selection voice is obtaining suitable target and join cost weights. In this paper we investigate several strategies to train these weights automatically. Two training algorithms are tested, which are based on an acoustic distance that approximates human perception: a modified version of the well-known linear regression training and an iterative algorithm that tries to minimize a selection error. Since a single, global set of weights might not result in selecting all the time the best sequence of units, we investigate whether using multiple weight sets could improve the synthesis quality.

10:40Prominence-Based Prosody Prediction for Unit Selection Speech Synthesis

Andreas Windmann (Faculty of Linguistics and Literature, Bielefeld University, Germany)
Igor Jauk (Faculty of Technology, Bielefeld University, Germany)
Fabio Tamburini (Department of Linguistics and Oriental Studies, University of Bologna, Italy)
Petra Wagner (Faculty of Linguistics and Literature, Bielefeld University, Germany)

This paper describes the development and evaluation of a prosody prediction module for unit selection speech synthesis that is based on the notion of perceptual prominence. We outline the design principles of the module and describe its implementation in the Bonn Open Synthesis System (BOSS). Moreover, we report results of perception experiments that have been conducted in order to evaluate prominence prediction. The paper is concluded by a general discussion of the approach and a sketch of perspectives for further work.

11:00Evaluating the meaning of synthesized listener vocalizations

Sathish Pammi (DFKI GmbH)
Marc Schröder (DFKI GmbH)

Spoken and multimodal dialogue systems start to use listener vocalizations for more natural interaction. In a unit selection framework, using a finite set of recorded listener vocalizations, synthesis quality is high but the acoustic variability is limited. As a result, many combinations of segmental form and intended meaning cannot be synthesized. This paper presents an algorithm in the unit selection domain for increasing the range of vocalizations that can be synthesized with a given set of recordings. We investigate whether the approach makes the synthesized vocalizations convey a meaning closer to the intended meaning, using a pairwise comparison perception test. The results partially confirm the hypothesis, indicating that in many cases, the algorithm makes available more appropriate alternatives to the available set of recorded listener vocalizations.

11:20A Hybrid TTS Approach for Prosody and Acoustic Modules

Iñaki Sainz (Aholab Signal Processing Laboratory, University of the Basque Country)
Daniel Erro (Aholab Signal Processing Laboratory, University of the Basque Country)
Eva Navas (Aholab Signal Processing Laboratory, University of the Basque Country)
Inma Hernáez (Aholab Signal Processing Laboratory, University of the Basque Country)

Unit selection (US) TTSs generate quite natural speech but highly variable in quality. Statistical parametric (SP) systems offer far more consistent quality but reduced naturalness due to its vocoding nature. We present a hybrid approach (HA) that tries to improve the overall naturalness combining both synthesis methods. Contrary to other works, the fusion of methods is performed both in prosody and acoustic modules yielding a more robust prosody prediction and achieving greater naturalness. Objective and subjective experiments show the validity of our procedure.

11:40Uniform Speech Parameterization for Multi-form Segment Synthesis

Alexander Sorin (Speech Technologies, IBM Haifa Research Lab, Haifa, Israel)
Slava Shechtman (Speech Technologies, IBM Haifa Research Lab, Haifa, Israel)
Vincent Pollet (Text-To-Speech Research, Nuance Communications, Merelbeke, Belgium)

In multi-form segment synthesis speech is constructed by sequencing speech segments of different nature: model segments, i.e. mathematical abstractions of speech and template segments, i.e. speech waveform fragments. These multi-form segments can have shared, layered or alternate speech parameterization schemes. This paper introduces an advanced uniform speech parameterization scheme for statistical model segments and waveform segments employed in our multi-form segment synthesis system. Mel-Regularized Cepstrum derived from amplitude and phase spectra forms its basic framework. Furthermore, a new adaptive enhancement technique for model segments is presented that reduces the perceived gap in quality and similarity between model and template segments.

Mon-Ses1-O5:
Speech Enhancement analysis and Evaluation

Time:Monday 10:00 Place:Raffaello - Pala Affari - 3rd Floor Type:Oral
Chair:Doug O'Shaughnessy

10:00Theoretical analysis of musical noise and speech distortion in structure-generalized parametric blind spatial subtraction array

Ryoichi Miyazaki (Nara Institute of Science and Technology)
Hiroshi Saruwatari (Nara Institute of Science and Technology)
Hiroshi Saruwatari (Nara Institute of Science and Technology)
Kiyohiro Shikano (Nara Institute of Science and Technology)
Kiyohiro Shikano (Nara Institute of Science and Technology)

In this paper, we propose the structure-generalized parametric blind spatial subtraction array (BSSA) and its theoretical analysis of amounts of musical noise and speech distortion is conducted via higher-order statistics. We theoretically prove a tradeoff between the amount of musical noise and speech distortion in various BSSA. Also we reveal that the best speech recognition performance can be obtained when a lower exponent parameter is used in parametric BSSA.

10:20Subjective and objective evaluation of speech intelligibility enhancement under constant energy and duration constraints

Yan Tang (Language and Speech Laboratory, Universidad del Pais Vasco)
Martin Cooke (Ikerbasque (Basque Science Foundation))

Speakers appear to adopt strategies to improve speech intelligibility for interlocutors in adverse acoustic conditions. Generated speech, whether synthetic, recorded or live, may also benefit from context-sensitive modifications in challenging situations. The current study measured the effect on intelligibility of six spectral and temporal modifications operating under global constraints of constant input-output energy and duration. Reallocation of energy from mid-frequency regions with high local SNR produced the largest intelligibility benefits, while other approaches such as pause insertion or maintenance of a constant segmental SNR actually led to a deterioration in intelligibility. Listener scores correlated only moderately well with recent objective intelligibility estimators, suggesting that further development of intelligibility models is required to improve predictions for modified speech.

10:40A Risk-Estimation-Based Comparison of Mean Square Error and Itakura-Saito Distortion Measures for Speech Enhancement

Nagarjuna Reddy Muraka (Indian Institute of Science)
Chandra Sekhar Seelamantula (Indian Institute of Science)

The goal of speech enhancement algorithms is to provide an estimate of clean speech starting from noisy observations. In general, the estimate is obtained by minimizing a chosen distortion metric. The often-employed cost is the mean-square error (MSE), which results in a Wiener-filter solution. Since the ground truth is not available in practice, the practical utility of the optimal estimators is limited. Alternative, one can optimize an unbiased estimate of the MSE. This is the key idea behind Stein's unbiased risk estimation (SURE) principle. Within this framework, we derive SURE solutions for the MSE and Itakura-Saito (IS) distortion measures. We also propose parametric versions of the corresponding SURE estimators, which give additional flexibility in controlling the attenuation characteristics for maximum signal-to-noise-ratio (SNR) gain. We compare the performance of the two distortion measures in terms of attenuation profiles, average segmental SNR, global SNR, and visual inspection of spectrograms. We also include a comparison with the standard power spectral subtraction technique. The results show that the SURE-IS approach consistently gives better performance gain than SURE-MSE. The perceived sound quality is also better in case of the SURE-IS estimator.

11:00On Noise Tracking for Noise Floor Estimation

Mahdi Triki (Philips Research)

Various speech enhancement techniques (e.g. noise suppression, dereverberation) rely on the knowledge of the statistics of the clean signal and the noise process. In practice, however, these statistics are not explicitly available, and the overall enhancement accuracy critically depends on the estimation quality of the unknown statistics. With this respect, subspace based approaches have shown to allow for reduced estimation delay and perform a good tracking vs. final misadjustment tradeoff. For an accurate noise non-stationarity tracking, these schemes have the challenge to estimate the correlation matrix of the observed signal from a limited number of samples. In this paper, we investigate the effect of the covariance estimation artifacts on the noise PSD tracking. We show that the estimation downsides could be alleviated using an appropriate selection scheme.

11:20Maximum a posteriori estimation of noise from non-acoustic reference signals in very low signal-to-noise ratio environments

Ben Milner (University of East Anglia)

This paper examines whether non-acoustic noise reference signals can provide accurate estimates of noise at very low signal-to-noise ratios (SNRs) where conventional estimation methods are less effective. The environment chosen for the investigation is Formula 1 motor racing where SNRs are as low as -15dB and the non-acoustic reference signals are engine speed, road speed and throttle measurements. Noise is found to relate closely to these reference signals and a maximum a posteriori method (MAP) is proposed to estimate airflow and tyre noise from these parameters. Objective tests show MAP estimation to be more accurate than a range of conventional noise estimation methods. Subjective listening tests then compare speech enhancement using the proposed MAP estimation to conventional methods with the former found to give significantly higher speech quality.

11:40Blind speech prior estimation for generalized minimum mean-square error short-time spectral amplitude estimator

Ryo Wakisaka (Nara Institute of Science and Technology)
Hiroshi Saruwatari (Nara Institute of Science and Technology)
Kiyohiro Shikano (Nara Institute of Science and Technology)
Tomoya Takatani (Toyota Motor Corporation)

In this paper, to achieve high-quality speech enhancement, we introduce the generalized minimum mean-square error short-time spectral amplitude estimator with a new blind prior estimation of the speech probability density function (p.d.f.). To deal with various types of speech signals with different p.d.f., we propose an algorithm of speech kurtosis estimation based on moment-cumulant transformation for blind adaptation to the shape parameter of speech p.d.f. From the objective and subjective evaluation experiments, we show the improved noise reduction performance of the proposed method.

Mon-Ses1-P1:
Paralinguistic Information - Classification and Detection

Time:Monday 10:00 Place:Valfonda 1 - Pala Congressi (Passi Perduti-Gallery) Type:Poster
Chair:Julia Hirschberg

#1On the use of multimodal cues for the prediction of degrees of involvement in spontaneous conversation

Catharine Oertel (Trinity College Dublin)
Stefan Scherer (Ulm University)
Nick Campbell (Trinity College Dublin)

Quantifying the degree of involvement of a group of participants in a conversation is a task which humans accomplish every day, but it is something that, as of yet, machines are unable to do. In this study we first investigate the correlation between visual cues (gaze and blinking rate) and involvement. We then test the suitability of prosodic cues (acoustic model) as well as gaze and blinking (visual model) for the prediction of the degree of involvement by using a support vector machine (SVM). We also test whether the fusion of the acoustic and the visual model improves the prediction. We show that we are able to predict three classes of involvement with an reduction of error rate of 0.30 (accuracy =0.68).

#2Anger Recognition in Spoken Dialog Using Linguistic and Para-Linguistic Information

Narichika Nomoto (NTT Cyber Space Laboratories, NTT Corporation)
Masafumi Tamoto (NTT Cyber Space Laboratories, NTT Corporation)
Hirokazu Masataki (NTT Cyber Space Laboratories, NTT Corporation)
Osamu Yoshioka (NTT Cyber Space Laboratories, NTT Corporation)
Satoshi Takahashi (NTT Cyber Space Laboratories, NTT Corporation)

This paper proposes a method to recognize anger-dialog based on linguistic and para-linguistic information in speech. Anger is classified into two types; HotAnger (agitated) and ColdAnger (calm). Conventional prosody-features based on para-linguistic can reliably recognize the former but not the latter. To recognize anger more robustly, we apply other para-linguistic cues named dialog-features which are seen in conversational interactive situations between two speakers such as turn-taking and back-channel feedback. We also utilize linguistic-features which represent conversational emotional salience. They are acquired by Pearson's chi-square test by comparing the automatically-transcribed texts between angry and neutral dialogs. Experiments show that the proposed feature combination improves the F-measure of ColdAnger and HotAnger by 26.9 points and 16.1 points against a baseline that uses only prosody.

#3Recognition of Personality Traits from Human Spoken Conversations

Alexei V. Ivanov (Department of Information Engineering and Computer Science, University of Trento, Italy)
Giuseppe Riccardi (Department of Information Engineering and Computer Science, University of Trento, Italy)
Adam J. Sporka (Czech Technical University in Prague, Czech Republic)
Jakub Franc (Dept. of Psychology, Faculty of Arts, Charles University, Prague, Czech Republic)

We are interested in understanding human personality and its manifestations in human interactions. The automatic analysis of such personality traits in natural conversation is quite complex due to the user-profiled corpora acquisition, annotation task and multidimensional modeling. While in the experimental psychology research this topic has been addressed extensively, speech and language scientists have recently engaged in limited experiments. In this paper we describe an automated system for speaker-independent personality prediction in the context of human-human spoken conversations. The evaluation of such system is carried out on the PersIA human-human spoken dialog corpus annotated with user self-assessments of the Big-Five personality traits. The personality predictor has been trained on paralinguistic features and its evaluation on five personality traits shows encouraging results for the conscientiousness and extroversion labels.

#4Using Multiple Databases for Training in Emotion Recognition: To Unite or to Vote?

Björn Schuller (Institute for Human-Machine Communication, Technische Universitaet Muenchen, Germany)
Zixing Zhang (Institute for Human-Machine Communication, Technische Universitaet Muenchen, Germany)
Felix Weninger (Institute for Human-Machine Communication, Technische Universitaet Muenchen, Germany)
Gerhard Rigoll (Institute for Human-Machine Communication, Technische Universitaet Muenchen, Germany)

We present an extensive study on the performance of data agglomeration and decision-level fusion for robust cross-corpus emotion recognition. We compare joint training with multiple databases and late fusion of classifiers trained on single databases, employing six frequently used corpora of natural or elicited emotion, namely ABC, AVIC, DES, eNTERFACE, SAL, VAM, and three classifiers i. e. SVM, Random Forests, Naive Bayes to best cover for singular effects. On average over classifier and database, data agglomeration and majority voting deliver relative improvements of unweighted accuracy by 9.0 % and 4.8 %, respectively, over single-database cross-corpus classification of arousal, while majority voting performs best for valence recognition.

#5“Would You Buy A Car From Me?” – On the Likability of Telephone Voices

Felix Burkhardt (Deutsche Telekom Laboratories)
Björn Schuller (Institute for Human-Machine Communication, Technische Universität München,)
Benjamin Weiss (Quality & Usability Lab, Technische Universität Berlin)
Felix Weninger (Institute for Human-Machine Communication, Technische Universität München,)

We researched how “likable” or “pleasant” a speaker appears based on a subset of the “Agender” database which was recently introduced at the 2010 Interspeech Paralinguistic Challenge. 32 participants rated the stimuli according to their likability on a seven point scale. An Anova showed that the samples rated are significantly different although the inter-rater agreement is not very high. Experiments with automatic regression and classification by REPTree ensemble learning resulted in a cross-correlation of up to .378 with the evaluator weighted estimator, and 67.6 % accuracy in binary classification (likable / not likable). Analysis of individual acoustic feature groups reveals that for this data, auditory spectral features seem to contribute most to reliable automatic likability analysis.

#6Automatic Identification of Salient Acoustic Instances in Couples\' Behavioral Interactions using Diverse Density Support Vector Machines

James Gibson (Signal Analysis and Interpretation Laboratory (SAIL), University of Southern California, Los Angeles, CA, USA)
Athanasios Katsamanis (Signal Analysis and Interpretation Laboratory (SAIL), University of Southern California, Los Angeles, CA, USA)
Matthew Black (Signal Analysis and Interpretation Laboratory (SAIL), University of Southern California, Los Angeles, CA, USA)
Shrikanth Narayanan (Signal Analysis and Interpretation Laboratory (SAIL), University of Southern California, Los Angeles, CA, USA)

Behavioral coding focuses on deriving higher-level behavioral annotations using observational data of human interactions. Automatically identifying salient events in the observed signal data could lead to a deeper understanding of how specific events in an interaction correspond to the perceived high-level behaviors of the subjects. In this paper, we analyze a corpus of married couples' interactions, in which a number of relevant behaviors, e.g., level of acceptance, were manually coded at the session-level. We propose a multiple instance learning approach called Diverse Density Support Vector Machines, trained with acoustic features, to classify extreme cases of these behaviors, e.g., low acceptance vs. high acceptance. This method has the benefit of identifying salient behavioral events within the interactions, which is demonstrated by comparable classification performance to traditional SVMs while using only a subset of the events from the interactions for classification.

#7Predicting Speaker Changes and Listener Responses With And Without Eye-contact

Daniel Neiberg (CTT, TMH, CSC, KTH)
Joakim Gustafson (CTT, TMH, CSC, KTH)

This paper compares turn-taking in terms of timing and prediction in human-human conversations under the conditions when participants has eye-contact versus when there is no eye-contact, as found in the HCRC Map Task corpus. By measuring between speaker intervals it was found that a larger proportion of speaker shifts occurred in overlap for the no eye-contact condition. For prediction we used prosodic and spectral features parametrized by time-varying length-invariant discrete cosine coefficients. With Gaussian Mixture Modeling and variations of classifier fusion schemes, we explored the task of predicting whether there is an upcoming speaker change (SC) or not (HOLD), at the end of an utterance (EOU) with a pause lag of 200 ms. The label SC was further split into LRs (listener responses, e.g. back-channels) and other TURNSHIFTs. The prediction was found to be somewhat easier for the eye-contact condition, for which the average recall rates were 60.57%, 66.35% and 62.00% for TURN-SHIFTs, LR and SC respectively.

#8Emotion Classification Using Inter- and Intra-Subband Energy Variation

Senaka Amarakeerthi (University of Aizu, Japan)
Tin Lay Nwe (Institute for Infocomm Research, Singapore)
C De Silva Liyanage (University of Brunei, Brunei)
Michael Cohen (University of Aizu, Japan)

Speech is one of the most important signals that can be usedto detect human emotions. Speech is modulated by differentemotions by varying frequency- and energy-related acoustic parameters such as pitch, energy and formants. In this paper, wedescribe research on analyzing inter- and intra-subband energyvariations to differentiate five emotions. The emotions considered are anger, fear, dislike, sadness, and neutral. We employ aTwo-Layered Cascaded Subband (TLCS) filter to study the energy variations for extraction of acoustic features. Experimentswere conducted on the Berlin Emotional Data Corpus (BEDC).We achieve average accuracy of 76.4% and 69.3% for speakerdependent and -independent emotion classifications, respectively.

#9Emotion Classification of Infants’ Cries using Duration Ratios of Acoustic Segments

Kazuki Kitahara (Nagasaki University)
Shinzi Michiwaki (Nagasaki University)
Miku Sato (Nagasaki University)
Shoichi Matsunaga (Nagasaki University)
Masaru Yamashita (Nagasaki University)
Kazuyuki Shinohara (Nagasaki University)

We propose an approach to the classification of emotion clusters using prosodic features. In our approach, we use the duration ratios of specific acoustic segments—resonant cry and silence segments—in the infants’ cries as prosodic features. We use power and pitch information to detect these segment periods and use normal distribution as a prosodic model to approximate the occurrence probability of the duration ratios of these segments. Classification experiments on two major emotion clusters are carried out. When the detection performance for the segment periods is about 75%, an emotion classification rate of 70.8% is achieved. The classification performance of our approach using the duration ratios was significantly better than that of the method using power and spectral features, thereby indicating the effectiveness of using prosodic features. Furthermore, we describe a classification method using both spectral and prosodic features with a slightly better performance (71.9%).

#10Vowels formants analysis allows straightforward detection of high arousal acted and spontaneous emotions

Bogdan Vlasenko (Cognitive Systems, IESK, OvGU)
Dmytro Prylipko (Cognitive Systems, IESK, OvGU)
David Philippou-Hübner (Cognitive Systems, IESK, OvGU)
Andreas Wendemuth (Cognitive Systems, IESK, OvGU)

The role of automatic emotion recognition from speech grows continually because of accepted importance of reacting to the emotional state of the user in human-computer interaction. Most part of state-of-the-art emotion recognition methods are based on context independent turn- and frame-level analysis. In our earlier ICME 2011 article it has been shown that robust high arousal acted emotions detection can be performed on the context dependent vowel basis. In contrast to using a HMM/GMM classification with 39-dimensional MFCC vectors, a much more convenient Neyman-Pearson criterion with the only one average F1 value is employed here. In this paper we apply the proposed method to the spontaneous emotion recognition from speech. Also, we avoid use of speaker-dependent acoustic features in favor of gender-specific ones. Finally we compare performances of acted and spontaneous emotions for different criterion threshold values.

#11Intra-, Inter-, and Cross-cultural Classification of Vocal Affect

Daniel Neiberg (Department of Speech, Music and Hearing (TMH), KTH, Stockholm, Sweden)
Petri Laukka (Department of Psychology, Stockholm University, Stockholm, Sweden)
Hillary Anger Elfenbein (Olin Business School, Washington University in St. Louis, St. Louis, MO, USA)

We present intra-, inter- and cross-cultural classifications of vocal expressions. Stimuli were selected from the VENEC corpus and consisted of portrayals of 11 emotions, each expressed with 3 levels of intensity. Classification (nu-SVM) was based on acoustic measures related to pitch, intensity, formants, voice source and duration. Results showed that mean recall across emotions was around 2.4-3 times higher than chance level for both intra- and inter-cultural conditions. For cross-cultural conditions, the relative performance dropped 26%, 32%, and 34% for high, medium, and low emotion intensity, respectively. This suggests that intra-cultural models were more sensitive to mismatched conditions for low emotion intensity. Preliminary results further indicated that recall rate varied as a function of emotion, with lust and sadness showing the smallest performance drops in the cross-cultural condition.

Mon-Ses1-P2:
Applications for Learning, Education, Aged and Handicapped Persons

Time:Monday 10:00 Place:Valfonda 2 - Pala Congressi (Passi Perduti-Gallery) Type:Poster
Chair:Roberto Gretter

#1Verifying Human Users in Speech-Based Interactions

Sajad Shirali-Shahreza (University of Toronto)
Yashar Ganjali (University of Toronto)
Ravin Balakrishnan (University of Toronto)

Verifying that a live human is interacting with an automated speech based system is needed in some applications such as biometric authentication. In this paper, we present a method to verify that the user is human. Simply stated, our method asks the user to repeat a sentence. The reply is analyzed to verify that it is the requested sentence and said by a human, not a speech synthesis system. Our method is taking advantage of both speech synthesizer and speech recognizer limitations to detect computer programs, which is new, and potentially more accessible, way to develop CAPTCHA systems. Using an acoustic model trained on voices of over 1000 users, our system can verify the user’s answer with 98% accuracy and with 80% success in distinguishing humans from computers.

#2Automatic Assessment of Prosody in High-Stakes English Tests

Jian Cheng (Knowledge Technologies, Pearson)

Prosody can be used to infer whether or not candidates fully understand a passage they are reading aloud. In this paper, we focused on automatic assessment of prosody in a read-aloud section for a high-stakes English test. A new method was proposed to handle fundamental frequency (F0) of unvoiced segments that significantly improved the predictive power of F0. The k-means clustering method was used to build canonical contour models at the word level for F0 and energy. A direct comparison between the candidate’s contours and ideal contours gave a strong prediction of the candidate’s human prosody rating. Duration information at the phoneme level was an even better predictive feature. When the contours and duration information were combined, the correlation coefficient r = 0.80 was obtained, which exceeded the correlation between human raters (r = 0.75). The results support the use of the new methods for evaluating prosody in high-stakes assessments.

#3Improvement of Segmental Mispronunciation Detection with Prior Knowledge Extracted from Large L2 Speech Corpus

Dean Luo (Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences/The Chinese University of Hong Kong)
Xuesong Yang (Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences/The Chinese University of Hong Kong)
Lan Wang (Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences/The Chinese University of Hong Kong)

In this paper, we propose novel methods that utilize prior mispronunciation knowledge extracted from large L2 speech corpus to improve segmental mispronunciation detection performance. Mispronunciation rules are categorized and the occurrence frequency of each error type is calculated from phone-level annotation of the corpora. Based on these rules and statistics of mispronunciations, we construct extended pronunciation lexicons with prior probabilities that reflect how likely each type of error might occur as language models for ASR. A 2-pass confusion network based strategy, which uses posterior proverbiality scores with optimal thresholds estimated from the L2 speech corpus, is introduced to refine phone recognition results. Experimental results show that the proposed methods can improve mispronunciation detection performance rather significantly.

#4Off-Topic Detection in Automated Speech Assessment Applications

Jian Cheng (Knowledge Technologies, Pearson)
Jianqiang Shen (Knowledge Technologies, Pearson)

Automated L2 speech assessment applications need some mechanism for validating the relevance of user responses before providing scores. In this paper, we discuss a method for off-topic detection in an automated speech assessment application: a high-stakes English test (PTE Academic). Different from traditional topic detection techniques that use characteristics of text alone, our method mainly focused on using the features derived from speech confidence scores. We also enhanced our off-topic detection model by incorporating other features derived from acoustic likelihood, language model likelihood, and garbage modeling. The final combination model significantly outperformed classification from any individual feature. When fixing the false rejection rate at 5% in our test set, we achieved a false acceptance rate of 9.8%. a very promising result.

#5Towards Context-dependent Phonetic Spelling Error Correction in Children’s Freely Composed Text for Diagnostic and Pedagogical Purposes

Sebastian Stüker (Karlsruhe Institute of Technology)
Johanna Fay (Pädagogische Hochschule Karlsruhe)
Kay Berkling (Karlsruhe Institute of Technology)

Reading and writing are core competencies of any society. In Germany, international and national comparative studies such as PISA or IGLU have shown that around 25% of German school children do not reach the minimal competence level necessary to function effectively in society by the age of 15. Automized diagnosis and spelling tutoring of children can play an important role in raising their orthographic level of competence. One of several necessary steps in an automatic spelling tutoring system is the automatic correction of achieved text that was freely written by children and contains errors. Based on the common knowledge that children in the first years of school write as they speak, we propose a novel, context-sensitive spelling correction algorithm that uses phonetic similarities, in order to achieve this step. We evaluate our approach on a test set of texts written by children and how that it outperforms Hunspell, a well established isolated error correction program used in text processors.

#6Factored Translation Models for improving a Speech into Sign Language Translation System

Verónica López-Ludeña (Grupo de Tecnología del Habla. Universidad Politécnica de Madrid.)
Rubén San-Segundo (Grupo de Tecnología del Habla. Universidad Politécnica de Madrid.)
Ricardo Cordoba (Grupo de Tecnología del Habla. Universidad Politécnica de Madrid.)
Javier Ferreiros (Grupo de Tecnología del Habla. Universidad Politécnica de Madrid.)
Juan Manuel Montero (Grupo de Tecnología del Habla. Universidad Politécnica de Madrid.)
José Manuel Pardo (Grupo de Tecnología del Habla. Universidad Politécnica de Madrid.)

This paper proposes the use of Factored Translation Models (FTMs) for improving a Speech into Sign Language Translation System. These FTMs allow incorporating syntactic-semantic information during the translation process. This new information permits to reduce significantly the translation error rate. This paper also analyses different alternatives for dealing with the non-relevant words. The speech into sign language translation system has been developed and evaluated in a specific application domain: the renewal of Identity Documents and Driver’s License. The translation system uses a phrase-based translation system (Moses). The evaluation results reveal that the BLEU has improved from 69.11% to 73.92% and the mSER has been reduced from 30.56% to 24.81%.

#7Formant maps in Hungarian vowels – online data inventory for research, and education

Kálmán Abari (Institute of Psychology, University of Debrecen, Hungary)
Zsuzsanna Zsófia Rácz (Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics, Hungary)
Gábor Olaszy (Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics, Hungary)

This paper describes a project for creating an online system for studying the main formant movements of Hungarian vowels in spoken words, as a function of their sound environment. The speech material and the formant data corresponding to the vowels combined present research data for many other purposes as well. For efficient presentation of the data and to allow multilevel comparisons among formant features an online solution was developed. The inventory data can be regarded as a reference, because of the strict conformity between the defined formant data and the formants of the spoken words. A two-step manual verification phase after the completion of automatic formant tracking was performed. The on-line query ensures quick and wide spread studying of formant maps in vowels. The database is available at: “http://hungarianspeech.tmit.bme.hu/formant”. Index Terms: formant map, Hungarian vowels, live measurements, coarticulation, evaluation material.

#8Automatic Subtitling of the Basque Parliament Plenary Sessions Videos

Germán Bordel (Department of Electricity and Electronics,University of the Basque Country, Spain)
Slvia Nieto (Department of Electricity and Electronics,University of the Basque Country, Spain)
Mikel Penagarikano (Department of Electricity and Electronics,University of the Basque Country, Spain)
Luis Javier Rodriguez-Fuentes (Department of Electricity and Electronics,University of the Basque Country, Spain)
Amparo Varona (Department of Electricity and Electronics,University of the Basque Country, Spain)

Subtitling of video contents offered in the web by Spanish administration agencies is required by law, allowing people with hearing impairments to follow them. The automatic video subtitling system described in this paper has been developed to be applied on the videos that the Basque Parliament posts in its web (http://www.parlamentovasco.euskolegebiltzarra.org/), and is running from September 2010. A specific characteristic of this system is the use of a simple phonetic decoder based on a joint selection of Basque and Spanish phone models, since it is not unusual for parliamentarians to make use of a mixing of the two languages. The system uses the manually transcribed Session Diaries (about verbatim but containing some errors) as subtitles, synchronizing text and voice by means of an acoustic decoder, a multilingual orthographic-phonetic transcriber and a very-large-symbol-sequence aligner.

#9Generating Animated Pronunciation from Speech through Articulatory Feature Extraction

Yurie Iribe (Information and Media Center, Toyohashi University of Technology, Japan)
Silasak Manosavanh (Graduate School of Engineering, Toyohashi University of Technology, Japan)
Kouichi Katsurada (Graduate School of Engineering, Toyohashi University of Technology, Japan)
Ryoko Hayashi (Graduate School of Intercultural Studies, Kobe University, Japan)
Chunyue Zhu (School of Language and Communication, Kobe University, Japan)
Tsuneo Nitta (Graduate School of Engineering, Toyohashi University of Technology, Japan)

We automatically generate CG animations to express the pronunciation movement of speech through articulatory feature (AF) extraction to help learn a pronunciation. The proposed system uses MRI data to map AFs to coordinate values that are needed to generate the animations. By using magnetic resonance imaging (MRI) data, we can observe the movements of the tongue, palate, and pharynx in detail while a person utters words. AFs and coordinate values are extracted by multi-layer neural networks (MLN). Specifically, the system displays animations of the pronunciation movements of both the learner and teacher from their speech in order to show in what way the learner’s pronunciation is wrong. Learners can thus understand their wrong pronunciation and the correct pronunciation method through specific animated pronunciations. Experiments to compare MRI data with the generated animations confirmed the accuracy of articulatory features. Additionally, we verified the effectiveness of using AF to generate animation.

#10A Tale of Two Tasks: Detecting Children’s Off-Task Speech in a Reading Tutor

Wei Chen (Language Technologies Institute, School of Computer Science, Carnegie Mellon University, USA)
Jack Mostow (Project LISTEN, School of Computer Science, Carnegie Mellon University, USA)

How can an automated tutor detect children’s off-task utterances? To answer this question, we trained SVM classifiers on a corpus of 495 children’s 36,492 computer-assisted oral reading utterances. On a test set of 651 utterances by 10 held-out readers, the classifier correctly detected 88% of off-task utterances and misclassified 17% of on-task utterances as off-task. As a test of generality, we applied the same classifier to 20 children’s 410 responses to vocabulary questions. The classifier detected 84% of off-task utterances but misclassified 57% of on-task utterances. Acoustic and lexical features helped detect off-task speech in both tasks.

#11The problems encountered by Japanese EL2 with English short vowels as illustrated on the 3D Vowel Chart

Toshiko Isei-Jaakkola (Chubu University)
Takatoshi Naka (Chukyo University)
Keikichi Hirose (The University of Tokyo)

In this study we attempted to illustrate to what extent Japanese university students who study English immediately after their enrolment have acquired English short vowels using graphs and a three-dimensional (= 3D) vowel chart, and thus to clarify what their problems are while simultaneously producing American English short vowels. There was a prediction that Japanese learners of English (= JEL2) have weakness in lip-rounding and protrusion since there are no such articulatory movements in Japanese vowels. This was clarified while observing F2 and F3. JEL2 have problems with simultaneous in lip movements, the jaw movements in general in this case. Also we found that there was a difference between female and male JEL2. As far as this experiment is concerned, female JEL2’s tongue and jaw movement (F2) is less stable than males’. Moreover, it may be confirmed that the 3D Vowel Chart may be more useful for EL2 than the graph.

#12Automatic generation of listening comprehension learning material in European Portuguese

Thomas Pellegrini (INESC-ID)
Rui Correia (IST)
Isabel Trancoso (INESC-ID / IST)
Jorge Baptista (Universidade do Algarve)
Nuno Mamede (INESC-ID / IST)

The goal of this work is the automatic selection of materials for a listening comprehension game. We would like to select automatically transcribed sentences from recent broadcast news corpora, in order to gather material for the games with little human effort. The recognized words are used as the ground solution of the exercises, thus sentences with misrecognitions need to be filtered out. Our experiments confirmed the feasibility of the filter chain that automatically selects sentences, although harder confidence thresholds may be needed. Together with the correct words, wrong candidates, namely distractors, are also needed to build the exercises. Two techniques of distractor generation are presented, either based on the confusion networks produced by the recognizer, or on phonetic distances. The experiments confirmed the complementarity of both approaches.

#13Candidate Generation for ASR Output Error Correction Using a Context-Dependent Syllable Cluster-Based Confusion Matrix

Chao-Hong Liu (Department of Computer Science and Information Engineering, National Cheng Kung University, Taiwan)
Chung-Hsien Wu (Department of Computer Science and Information Engineering, National Cheng Kung University, Taiwan)
David Sarwono (Department of Computer Science and Information Engineering, National Cheng Kung University, Taiwan)
Jhing-Fa Wang (Department of Electrical Engineering, National Cheng Kung University, Taiwan)

Error correction techniques have been proposed in the applications of language learning and spoken dialogue systems for spoken language understanding. These techniques include two consecutive stages: the generation of correction candidates and the selection of correction candidates. In this study, a Context-Dependent Syllable Cluster (CD-SC)-based Confusion Matrix is proposed for the generation of correction candidates. A Contextual Fitness Score, measuring the sequential relationship to the neighbors of the candidate, is proposed for corrected syllable sequence selection. Finally, the n-gram language model is used to determine the final word sequence output. Experiments show that the proposed method improved from 0.742 to 0.771 in terms of BLEU score as compared to the conventional speech recognition mechanism.

#14SEMI-SUPERVISED TREE SUPPORT VECTOR MACHINE FOR ONLINE COUGH RECOGNITION

Thai Hoa Huynh (A-STAR * Institute for Infocomm Research, Singapore)
Vu An Tran (A-STAR * Institute for Infocomm Research, Singapore)
Huy Dat Tran (A-STAR * Institute for Infocomm Research, Singapore)

Pneumonia and asthma are among the top causes of death worldwide with 300 million people suffered. In the year 2005, 255,000 people died only because of asthma. Good controlling requires both proper medication and continual monitoring over days and nights. In this paper, we introduce a novel classifier, namely Semi-Supervised Tree Support Vector Machine, to target the problem of cough detection and monitoring. It will adaptively analyze the distribution of samples’ confidence metrics, automatically select the most informative samples and re-train the core Tree SVM classifier inside accordingly. Besides, we also introduce a new way to build Tree SVM, based on Fisher Linear Discriminant (FLD) analytic. All are meant to improve final system performance, and our proposed classifier has really demonstrated good improvement over conventional method; validated on a database consists of comprehensive body-sounds, recorded with wearable contact microphone.

Mon-Ses1-P3:
Robust Speech Recognition I

Time:Monday 10:00 Place:Faenza 1 - Pala Congressi (Passi Perduti-Gallery) Type:Poster
Chair:Pietro Laface

#1A versatile Gaussian splitting approach to non-linear state estimation and its application to noise-robust ASR

Volker Leutnant (Department of Communications Engineering, University of Paderborn)
Alexander Krueger (Department of Communications Engineering, University of Paderborn)
Reinhold Haeb-Umbach (Department of Communications Engineering, University of Paderborn)

In this work, a splitting and weighting scheme that allows for splitting a Gaussian density into a Gaussian mixture density (GMM) is extended to allow the mixture components to be arranged along arbitrary directions. The parameters of the Gaussian mixture are chosen such that the GMM and the original Gaussian still exhibit equal central moments up to an order of four. The resulting mixtures' covariances will have eigenvalues that are smaller than those of the covariance of the original distribution, which is a desirable property in the context of non-linear state estimation, since the underlying assumptions of the extended Kalman filter are better justified in this case. Application to speech feature enhancement in the context of noise-robust automatic speech recognition reveals the beneficial properties of the proposed approach in terms of a reduced word error rate on the Aurora~2 recognition task.

#2Generalized-Log Spectral Mean Normalization for Speech Recognition

Hilman Ferdinandus Pardede (Tokyo Institute of Technology)
Koichi Shinoda (Tokyo Institute of Technology)

Most compensation methods for robust speech recognition against noise assume independency between speech, additive and convolutive noise. However, the nonlinear nature distortion caused by noise may introduce correlation between noise and speech. To tackle this issue, we propose generalized-log spectral mean normalization (GLSMN) in which log spectral mean normalization (LSMN) is carried out in the q-logarithmic domain. Experiments on the Aurora-2 database show that GLSMN improved speech recognition accuracies by 20% compared to cepstral mean normalization (CMN) in mel-frequency domain.

#3Zero-Crossing-Based Channel Attentive Weighting of Cepstral Features for Robust Speech Recognition: The ETRI 2011 CHiME Challenge System

Young-Ik Kim (ETRI)
Hoon-Young Cho (ETRI)
Sang-Hoon Kim (ETRI)

We present a practical and noise-robust speech recognition system which estimates a target-to-interferers power ratio using a zero-crossing-based binaural model and applies the power ratio to a channel attentive missing feature decoder in the cepstral domain. In a natural multisource environment, our binaural model extracts spatial cues at each zero-crossing of a filterbank output signal to localize multiple sound sources and estimates a ratio mask reliably which segregates target speech from interfering noises. Our system uses gammatone filterbank cepstral coefficients (GFCCs) for the recognition and the channel attentive decoder utilizes the ratio mask on weighting the cepstral features when calculating the output probability in the Viterbi decoding. On the experiments of CHiME final testset, our channel attentive GFCC system improves the baseline recognition result 12.2% on average, and with noisy training condition, the average improvement amounts to 18.8%.

#4Feature Compensation for Speech Recognition in Severely Adverse Environments due to Background Noise and Channel Distortion

Wooil Kim (University of Texas at Dallas)
John H. L. Hansen (University of Texas at Dallas)

This paper proposes an effective feature compensation scheme to address severely adverse environments for robust speech recognition, where background noise and channel distortion are simultaneously involved. An iterative channel estimation method is integrated into the framework of our Parallel Combined Gaussian Mixture Model based feature compensation algorithm. A new speech corpus is generated which reflects both additive and convolutional noise corruption. Performance evaluation of the proposed system demonstrates that the proposed feature compensation scheme is significantly effective in improving speech recognition performance with presence of both background noise and channel distortion, comparing to the conventional methods including the ETSI AFE.

#5Binaural cues for fragment-based speech recognition in reverberant multisource environments

Ning Ma (Department of Computer Science, University of Sheffield, Sheffield S1 4DP, UK)
Jon Barker (Department of Computer Science, University of Sheffield, Sheffield S1 4DP, UK)
Heidi Christensen (Department of Computer Science, University of Sheffield, Sheffield S1 4DP, UK)
Phil Green (Department of Computer Science, University of Sheffield, Sheffield S1 4DP, UK)

This paper addresses the problem of speech recognition using distant binaural microphones in reverberant multisource noise conditions. Our scheme employs a two stage fragment decoding approach: first spectro-temporal acoustic source fragments are identified using signal level cues, and second, a hypothesis-driven stage simultaneously searches for the most probable speech/background fragment labelling and the corresponding acoustic model state sequence. The paper reports the first successful attempt to use binaural localisation cues within this framework. By integrating binaural cues and acoustic models in a consistent probabilistic framework, the decoder is able to derive significant recognition performance benefits from fragment location estimates despite their inherent unreliability.

#6Sub-band level Histogram Equalization for Robust Speech Recognition

Vikas Joshi (Indian Institute Technology, Madras (IIT-M))
Raghvendra Biligi (Indian Institute Technology, Madras (IIT-M))
Umesh S (Indian Institute Technology, Madras (IIT-M))
Luz Garcia (University of Granada, Spain)
Carmen Benitez (University of Granada, Spain)

This paper describes a novel modification of Histogram Equalization (HEQ) approach to robust speech recognition. We propose separate equalization of the high frequency (HF) and low frequency (LF) bands. We study different combinations of the sub-band equalization and obtain best results when we perform a two-stage equalization. First, conventional HEQ is performed on the cepstral features, which does not completely equalize HF and LF bands, even though the overall histogram equalization is good. In the second stage, an equalization is done separately on the HF and the LF components of the above equalized cepstra. We refer to this approach as Sub-band Histogram Equalization (S-HEQ). The new set of features has better equalization of the sub-bands as well as the overall cepstral histogram. Recognition results show a relative improvement of 12% and 15% over conventional HEQ in WER on Aurora-2 and Aurora-4 databases respectively.

#7GMM-based missing-feature reconstruction on multi-frame windows

Ulpu Remes (Aalto University School of Science)
Yoshihiko Nankaku (Nagoya Institute of Technology)
Keiichi Tokuda (Nagoya Institute of Technology)

Methods for missing-feature reconstruction substitute noise-corrupted features with clean-speech estimates calculated based on reliable information found in the noisy speech signal. Gaussian mixture model (GMM) based reconstruction has conventionally focussed on reliable information present in a single frame. In this work, GMM-based reconstruction is applied on windows that span several time frames. Mixtures of factor analysers (MFA) are used to limit the number of model parameters needed to describe the feature distribution as window width increases. Using the window-based MFA in noisy speech recognition task resulted in relative error reductions up to 52 % compared to frame-based GMM.

#8Improvements of a dual-input DBN for noise robust ASR

Yang Sun (Centre for Language and Speech Technology, Radboud University Nijmegen)
Jort F. Gemmeke (Centre for Language and Speech Technology, Radboud University Nijmegen)
Bert Cranen (Centre for Language and Speech Technology, Radboud University Nijmegen)
Louis ten Bosch (Centre for Language and Speech Technology, Radboud University Nijmegen)
Lou Boves (Centre for Language and Speech Technology, Radboud University Nijmegen)

In previous work we have shown that an ASR system consisting of a dual-input Dynamic Bayesian Network (DBN) which simultaneously observes MFCC acoustic features and an exemplar-based Sparse Classification (SC) phoneme predictor stream can achieve better word recognition accuracies in noise than a system that observes only one input stream. This paper explores three modifications of SC input to further improve the noise robustness of the dual-input DBN system: 1) using state likelihoods instead of phoneme, 2) integrating more contextual information and 3) using a complete set of likelihood distribution. Experiments on Aurora 2 reveal that the combination of the first two approaches significantly improves the recognition results, achieving up to 29% (absolute) accuracy gain at SNR -5 dB. In the dual-input system using the full likelihood vector does not outperform using the best state prediction.

#9Denoising Using Optimized Wavelet Filtering for Automatic Speech Recognition

Randy Gomez (Kyoto University)
Tatsuya Kawahara (Kyoto University)

We present an improved denoising method based on filtering of the noisy wavelet coefficients using a Wiener gain for automatic speech recognition (ASR). We optimize the wavelet parameters for speech and different noise profiles to achieve a better estimate of the Wiener gain for effective filtering. Moreover, we introduce a scaling parameter to the Wiener gain, to minimize mismatch caused by distortion during the denoising process. Experimental results in large vocabulary continuous speech recognition (LVCSR) show that the proposed method is effective and robust to different noise conditions.

#10Noise Robust Speaker-Independent Speech Recognition with Invariant-Integration Features Using Power-Bias Subtraction

Florian Müller (Institute for Signal Processing, University of Lübeck, Germany)
Alfred Mertins (Institute for Signal Processing, University of Lübeck, Germany)

This paper presents new results about the robustness of invariant-integration features (IIF) in noisy conditions. Furthermore, it is shown that a feature-enhancement method known as "power-bias subtraction" for noisy conditions can be combined with the IIF approach to improve its performance in noisy environments while keeping the robustness of the IIFs to mismatching vocal-tract length training-testing conditions. Results of experiments with training on clean speech only as well as experiments with matched-condition training are presented.

Mon-Ses1-P4:
ASR - Acoustic Models I

Time:Monday 10:00 Place:Faenza 2 - Pala Congressi (Passi Perduti-Gallery) Type:Poster
Chair:Lori Lamel

#1Semi-automatic acoustic model generation from large unsynchronized audio and text chunks

Michele Alessandrini (Università Politecnica delle Marche)
Giorgio Biagetti (Università Politecnica delle Marche)
Alessandro Curzi (Università Politecnica delle Marche)
Claudio Turchetti (Università Politecnica delle Marche)

In this paper an effective technique to train an acoustic model from large and unsynchronized audio and text chunks is presented. Given such a speech corpus, an algorithm to automatically segment each chunk into smaller fragments and to synchronize those to the corresponding text is defined. These smaller fragments are more suitable to be used in standard model training algorithms for usage in automatic speech recognition systems. The proposed approach is particularly suitable to bootstrap language models without relying neither on specialized training material nor borrowing from models trained for other similar languages. Extensive experimentation using the CMU Sphinx 4 recognizer and the SphinxTrain model generator in a setting designed for large-vocabulary continuous speech recognition shows the effectiveness of the approach.

#2Unsupervised Testing Strategies for ASR

Brian Strope (Google)
Doug Beeferman (Google)
Alexander Gruenstein (google)
Xin Lei (Google)

This paper describes unsupervised strategies for estimating relative accuracy differences between acoustic models or language models used for automatic speech recognition. To test acoustic models, the approach extends ideas used for unsupervised discriminative training to include a more explicit validation on held out data. To test language models, we use a dual interpretation of the same process, this time allowing us to measure differences by exploiting expected `truth gradients' between strong and weak acoustic models. The paper shows correlations between supervised and unsupervised measures across a range of acoustic model and language model variations. We also use unsupervised tests to assess the non-stationary nature of mobile speech input.

#3Acoustic Model Training with Detecting Transcription Errors in the Training Data

Gakuto KURATA (IBM Research - Tokyo)
Nobuyasu ITOH (IBM Research - Tokyo)
Masafumi NISHIMURA (IBM Research - Tokyo)

As the target of ASR has moved from clean read speech to spontaneous conversational speech, we need to prepare orthographic transcripts of spontaneous conversational speech to train acoustic models (AMs). However, it is expensive and slow to manually transcribe such speech word by word. We propose a framework to train an AM based on easy-to-make rough transcripts in which fillers and word fragments are not precisely transcribed and some transcription errors are included. By focusing on the phone duration in the result of forced alignment between the rough transcripts and the utterances, we can detect the erroneous parts in the rough transcripts. A preliminary experiment showed that we can detect the erroneous parts with moderately high recall and precision. Through ASR experiments with conversational telephone speech, we confirmed that automatic detection helped improve the performance of the AM trained with both conventional ML and state-of-the-art boosted MMI criteria.

#4Towards Unsupervised Training of Speaker Independent Acoustic Models

Aren Jansen (Johns Hopkins University)
Kenneth Church (Johns Hopkins University)

Can we automatically discover speaker independent phoneme-like subword units with zero resources in a surprise language? There have been a number of recent efforts to automatically discover repeated spoken terms without a recognizer. This paper investigates the feasibility of using these results as constraints for unsupervised acoustic model training. We start with a relatively small set of word types, as well as their locations in the speech. The training process assumes that repetitions of the same (unknown) word share the same (unknown) sequence of subword units. For each word type, we train a whole-word hidden Markov model with Gaussian mixture observation densities and collapse correlated states across the word types using spectral clustering. We find that the resulting state clusters align reasonably well along phonetic lines. In evaluating cross-speaker word similarity, the proposed techniques outperform both raw acoustic features and language-mismatched acoustic models.

#5Acoustic Modeling with Bootstrap and Restructuring Based on Full Covariance

Xiaodong Cui (IBM T. J. Watson Research Center)
Xin Chen (University of Missouri, Columbia)
Jian Xue (IBM T. J. Watson Research Center)
Peder A. Olsen (IBM T. J. Watson Research Center)
John R. Hershey (Mitsubishi Electric Research Laboratories)
Bowen Zhou (IBM T. J. Watson Research Center)

Bootstrap and restructuring (BSRS) has been shown in our previous work to be superior over the conventional acoustic modeling approach when dealing with low-resourced languages. This paper presents a full covariance based BSRS scheme, which is an extension of our previous work on diagonal covariance based BSRS acoustic modeling. Since full covariance provides richer structural information of acoustic model compared to its diagonal counterpart, it is advantageous for both model clustering and refinement. Therefore, in this work, full covariance is employed in BSRS to keep the structural information until the last step before being converted to diagonal covariance for practical applications. We show that using full covariance further improves the performance over diagonal covariance in the BSRS acoustic modeling framework under the same model size without increasing computational cost in decoding.

#6An i-Vector based Approach to Acoustic Sniffing for Irrelevant Variability Normalization based Acoustic Model Training and Speech Recognition

Jian Xu (University of Science and Technology of China)
Yu Zhang (Shanghai Jiao Tong University)
Zhi-Jie Yan (Microsoft Research Asia)
Qiang Huo (Microsoft Research Asia)

This paper presents a new approach to acoustic sniffing for irrelevant variability normalization (IVN) based acoustic model training and speech recognition. Given a training corpus, a so-called i-vector is extracted from each training speech segment. A clustering algorithm is used to cluster the training i-vectors into multiple clusters, each corresponding to an acoustic condition. The acoustic sniffing can then be implemented as finding the most similar cluster by comparing the i-vector extracted from a speech segment with the centroid of each cluster. Experimental results on Switchboard-1 conversational telephone speech transcription task suggest that the i-vector based acoustic sniffing outperforms our previous Gaussian mixture model (GMM) based approach. The proposed approach is very efficient therefore can deal with very large scale training corpus on current mainstream computing platforms, yet has very low run-time cost.

#7Log-linear Optimization of Second-order Polynomial Features with Subsequent Dimension Reduction for Speech Recognition

Muhammad Ali Tahir (RWTH Aachen University, Aachen, Germany)
Ralf Schlueter (RWTH Aachen University, Aachen, Germany)
Hermann Ney (RWTH Aachen University, Aachen, Germany)

Second order ploynomial features are useful for speech recognition because they can be used to model class specific covariance even with a pooled covariance acoustic model. Previous experiments with second order features have shown word error rate improvements. However, the improvement comes at the price of a large increase in the number of parameters. This paper investigates the discriminative training of second order features, with a subsequent dimension reduction transform to limit the increase in number of parameters. The acoustic model parameters and the transformation matrix parameters are modeled log-linearly and optimized using maximum mutual information criterion. The advantage of log-linear optimization lies in its ability to robustly combine different kinds of features. Experiments are performed for second order MFCC features on the EPPS large vocabulary task and have resulted in a decrease in word error rate.

#8Genre Categorization and Modeling for Broadcast Speech Transcription

Qingqing Zhang (Spoken Language Processing Group, LIMSI-CNRS)
Lori Lamel (Spoken Language Processing Group, LIMSI-CNRS)
Jean-Luc Gauvain (Spoken Language Processing Group, LIMSI-CNRS)

Broadcast News (BN) speech recognition transcription has attracted research due to the challenges of the task since the mid 1990's. More recently, research has been moving towards more spontaneous broadcast data, commonly called Broadcast Conversation (BC) speech. Considering the large style difference between BN and BC genres, specific modeling of genres should intuitively result in improved system performance. In this paper BN- and BC-style speech recognition has been explored by designing genre-specific systems. In order to separate the training data, an automatic genre categorization with two novel features is proposed. Experiments showed that automatic categorization of genre labels of the training data compared favorably to the original manually specified genre labels provided with corpora. When test data sets were classified into BN or BC genres and tested by the corresponding genre-specific speech recognition systems, modest but consistent error reductions were achieved compared to the baseline genre-independent systems.

#9Individual Error Minimization Learning Framework and its Applications to Speech Recognition and Utterance Verification

Sunghwan Shin (Center for Signal and Image Processing, Georgia Institute of Technology, Atlanta, GA, USA)
Ho-Young Jung (Speech Language Processing Team, Electronics and Telecommunications Research Institute, Daejeon, South Korea)
Biing-Hwang Juang (Center for Signal and Image Processing, Georgia Institute of Technology, Atlanta, GA, USA)

In this paper, we extend the individual recognition error minimization criteria, MDE/MIE/MSE [1] in word-level and apply them to word recognition and verification tasks, respectively. In order to effectively reduce potential errors in word-level, we expand the training token selection scheme to be more appropriate for word-level learning framework, by taking into account neighboring words and by covering internal phonemes in each training word. Then, we examine the proposed word-level learning criteria on the TIMIT word recognition task and further investigate individual rejection performance of the recognition errors in utterance verification (UV). Experimental results confirm that each of the word-level objective criteria results in primarily reducing the corresponding target error type, respectively. The rejection rates of insertion and substitution errors are also improved within MIE and MSE criteria, which lead to additional word error rate reduction after the rejection.

#10Effective Triphone Mapping for Acoustic Modeling in Speech Recognition

Sakhia Darjaa (Slovak Academy of Sciences)
Miloš Cerňak (Slovak Academy of Sciences)
Marián Trnka (Slovak Academy of Sciences)
Milan Rusko (Slovak Academy of Sciences)
Róbert Sabo (Slovak Academy of Sciences)

This paper presents effective triphone mapping for acoustic models training in automatic speech recognition, which allows the synthesis of unseen triphones. The description of this data-driven model clustering, including experiments performed using 350 hours of a Slovak audio database of mixed read and spontaneous speech, are presented. The proposed technique is compared with tree-based state tying, and it is shown that for bigger acoustic models, at a size of 4000 states and more, a triphone mapped HMM system achieves better performance than a tree-based state tying system. The main gain in performance is due to latent application of triphone mapping on monophones with multiple Gaussian pdfs, so the cloned triphones are initialized better than with single Gaussians monophones. Absolute decrease of word error rate was 0.46% (5.73% relatively) for models with 7500 states, and decreased to 0.4% (5.17% relatively) gain at 11500 states.

#11Analysis of Dialectal Influence in Pan-Arabic ASR

Udhyakumar Nallasamy (Language Technologies Institute, CMU)
Michael Garbus (Language Technologies Institute, CMU)
Florian Metze (Language Technologies Institute, CMU)
Qin Jin (Language Technologies Institute, CMU)
Thomas Schaaf (Multimodal Technologies, Inc.)
Tanja Schultz (Language Technologies Institute, CMU)

In this paper, we present various experiments on analyzing the influence of five dialects of the Arabic language in an Automatic Speech Recognition (ASR) system. We discuss our efforts in building the baseline ASR system and present a detailed analysis of the impact of dialects on different ASR components including the front-end and pronunciation dictionary. We use ASR phonetic decision tree as a diagnostic tool to evaluate the robustness of different front-ends to dialectal variations in the speech data. We also perform a rule-based analysis of the pronunciation dictionary, which enables us to identify dialectal words in the vocabulary and automatically generate pronunciations for unseen words.

#12Connected Digit Recognition by Means of Reservoir Computing

Azarakhsh Jalalvand (ELIS-UGent)
fabian triefenbach (ELIS-UGent)
david verstraeten (ELIS-UGent)
jean-pierre martens (ELIS-UGent)

Most automatic speech recognition systems employ Hidden Markov Models with Gaussian mixture emission distributions to model the acoustics. There have been several attempts however to challenge this approach, e.g. by introducing a neural network (NN) as an alternative acoustic model. Although the performance of these so-called hybrid systems is actually quite good, their training is often problematic and time consuming. By using a reservoir -this is a recurrent NN with only the output weights being trainable- we can overcome this disadvantage and yet obtain good accuracy. In this paper, we propose the first reservoir-based connected digit recognition system, and we demonstrate good performance on the Aurora-2 testbed. Since RC is a new technology, we anticipate that our present system is still sub-optimal, and further improvements are possible.

#13Large Margin - Minimum Classification Error Using Sum of Shifted Sigmoids as the Loss Function

Madhavi Ratnagiri (Department of Electrical and Computer Engineering, Rutgers University, Piscataway, New Jersey)
Biing-Hwang Juang (School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, Georgia)
Lawrence Rabiner (Department of Electrical and Computer Engineering, Rutgers University, Piscataway, New Jersey)

We have developed a novel loss function that embeds large-margin classification into Minimum Classification Error (MCE) training. Unlike previous efforts this approach employs a loss function that is bounded, does not require incremental adjustment of the margin or prior MCE training. It extends the Bayes risk formulation of MCE using Parzen Window estimation to incorporate large–margin classification and develops a loss function that is a sum of shifted sigmoids. Experimental results show improvement in recognition performance when evaluated on the TIDigits database.

#14Representing Phonological features trough a two-level finite state model

Javier Mikel Olaso (Universidad del Pais Vasco)
María Inés Torres (Universidad del Pais Vasco)
Raquel Justo (Universidad del Pais Vasco)

Articulatory information has demonstrated to be useful to improve phone recognition performance in ASR systems, being Dynamic Neural Networks the most successful method to detect articulatory gestures from the speech signal. On the other hand, Stochastic Finite State Automata (SFSA) have been effectively used in many speech-input natural language tasks. In this work SFSA are used to represent phonological features. A hierarchical model able to consider sequences of acoustic observations along with sequences of phonological features is defined. From this formulation a classifier of articulatory features has been derived and then evaluated over a Spanish phonetic corpus. Experimental results show that this is a promising framework to detect and include phonological knowledge into ASR systems. Keywords: phonological features, ASR, finite state models, stochastic finite state automata, k-tss models

#15Optimization of the Gaussian Mixture Model Evaluation on GPU

Jan Vanek (University of West Bohemia)
Jan Trmal (University of West Bohemia)
Josef V. Psutka (University of West Bohemia)
Josef Psutka (University of West Bohemia)

In this paper we present a highly optimized implementation of Gaussian mixture acoustic model evaluation algorithm. Evaluation of these likelihoods is one of the most computationally intensive parts of automatics speech recognizers but it can be well-parallelized and offloaded to GPU devices. Our approach offers significant speed-up compared to the recently published approaches, since it exploits the GPU architecture better. All the recent implementations were only targeted on NVIDIA graphics processors; programmed either in CUDA or OpenCL GPU programming frameworks. We present results for both; CUDA as well as OpenCL. Results suggest that even very large acoustic models can be utilized in real-time speech recognition engines on computers and laptops equipped with a low-end GPU. Optimization of acoustic likelihoods computation on GPU enables to use the remaining GPU resources for offloading of other compute-intensive parts of LVCSR decoder.

Mon-Ses2-O1 :
Speaker Recognition - Analysis and Statistics I

Time:Monday 13:30 Place:Auditorium - Pala Congressi Type:Oral
Chair:David Van Leeuwen

13:30Harmonic Structure Transform for Speaker Recognition

Kornel Laskowski (KTH Speech Music and Hearing)
Qin Jin (Carnegie Mellon University)

We evaluate a new filterbank structure, yielding the harmonic structure cepstral coefficients (HSCCs), on a mismatched-session closed-set speaker classification task. The novelty of the filterbank lies in its averaging of energy at frequencies related by harmonicity rather than by adjacency. Improvements are presented which achieve a 37%rel reduction in error rate under these conditions. The improved features are combined with a similar Mel-frequency cepstral coefficient (MFCC) system to yield error rate reductions of 32%rel, suggesting that HSCCs offer information which is complimentary to that available to today's MFCC-based systems.

13:50Combining Evidence from Spectral and Source-like Features for Person Recognition from Humming

Hemant Patil (Dhirubhai Ambani Institute of Information and Communication Technology, Gandhiangar, INDIA)
Maulik Madhavi (Dhirubhai Ambani Institute of Information and Communication Technology, Gandhiangar, INDIA)
Keshab Parhi (Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis, USA)

In this paper, hum of a person is used in voice biometric system. In addition, recently proposed feature set, i.e., Variable length Teager Energy Based Mel Frequency Cepstral Coefficients (VTMFCC), is found to capture perceptually meaningful source-like information from hum signal. For person recognition, MFCC gives EER of 13.14% and %ID of 64.96%. A reduction in equal error rate (EER) by 0.2% and improvement in identification rate by 7.3 % is achieved when a score-level fusion system is employed by combining evidence from MFCC (system) and VTMFCC (source-like features) than MFCC alone. Results are reported for various feature dimensions and population sizes.

14:10Improvements in Speaker Characterization Using Spectral Subband Energy Based on Harmonic plus Noise Model

Yanhua Long (iFly Speech Lab, University of Science and Technology of China (USTC))
Zhi-Jie Yan (Microsoft Research Asia, Beijing, China)
Frank K. Soong (Microsoft Research Asia, Beijing, China)
Lirong Dai (iFly Speech Lab, University of Science and Technology of China (USTC))
Wu Guo (iFly Speech Lab, University of Science and Technology of China (USTC))

We previously proposed the use of Spectral Subband Energy Ratio (SSER) as speaker features in a speaker verification system[1]. Those SSER features were derived from two distinct components-the harmonic and noise speech parts, which were decomposed by the Harmonic plus Noise Model(HNM) from the original speech. In this paper, we report several recent improvements to this approach. First, we go into the details of the two distinct speech components and achieve a surprising better performance by only extracting the separate Spectral Subband Energy features from each component. Second, we propose a soft unvoiced/voiced (U/V) decision method to preserve more speech data during HNM analysis and feature extraction. Greatly improved experiment results have shown the efficiency of this soft U/V decision. Finally, a further preliminary attempt to extract features from linear frequency domain to mel-frequency domain has also been examined.

14:30Implicit Segmentation in Two-Wire Speaker Recognition

Yosef Solewicz (Technology Section, Israel National Police)
Hagai Aronowitz (IBM Research - Haifa)

This paper presents a novel self-contained two-wire speaker recognition framework. The classical approach to two-wire speaker recognition usually requires a preliminary explicit speaker segmentation stage in order to extract audio files for the two hypothesized speakers. We propose an implicit speaker segmentation method implemented at the supervector level of speaker recognition systems. By periodically extracting successive supervectors from the two-wire audio it is possible to further associate them to each of the hypothesized speakers before scoring both streams. We show that the proposed technique leads to recognition performance comparable to standard approaches while requiring substantially less resources

14:50Boosting Speaker Recognition Performance with Compact Representations

Sibel Yaman (IBM T. J. Watson Research Center)
Jason Pelecanos (IBM T. J. Watson Research Center)
Mohamed K. Omar (IBM T. J. Watson Research Center)

This paper describes a speaker recognition system combination approach in which the compact forms of MAP adapted GMM supervectors are used to boost the performance of a high-dimensional supervector-based system or a combination of multiple systems. The compact supervector representations are subjected to a diagonal transformation to emphasize those dimensions that describe significant speaker information and to de-emphasize noisy dimensions. Scores obtained from these representations are then combined with the scores obtained from high-dimensional supervector representations. The transformation parameters and the combination weights are estimated by minimizing a discriminative training objective function that approximates a minimum detection cost function. We carried out experiments on two NIST 2008 Speaker Recognition Evaluation English telephony tasks to compare the proposed approach with direct score combination obtained from low- and high-dimensional supervector representations. We have found that the proposed approach yields up to 18% relative gain.

15:10Partitioning of Two-Speaker Conversation Datasets

Carlos Vaquero (University of Zaragoza)
Alfonso Ortega (University of Zaragoza)
Eduardo Lleida (University of Zaragoza)

We address the speaker partitioning problem on datasets composed of two-speaker conversations. In such a situation, it is desirable to obtain a good overall diarization performance but even in that case, the performance of the partitioning problem can be severely degraded if some of the recordings are incorrectly segmented. We show that the performance of a bottom-up speaker clustering approach for the partitioning of two-speaker conversation datasets is sensitive to errors in the diarization, up to a point that the Diarization Error Rate for every recording should be as low as 1% to avoid degradation in performance due to the diarization process. Finally we propose a set of confidence measures along with a logistic regression approach to detect those conversations whose segmentation hypothesis is reliable enough to perform speaker clustering, showing that it enables an improvement in clustering performance at the expense of missing a small portion of the speakers in the dataset.

Mon-Ses2-O3:
Speech Segmentation

Time:Monday 13:30 Place:Brunelleschi (Green Room) - Pala Congressi - 2nd Floor Type:Oral
Chair:Daniele Falavigna

13:30A Two-stage Sample-based Phone Boundary Detector using Segmental Similarity Features

Yih-Ru Wang (National Chiao Tung University, Taiwan)

In this paper, a two-stage sample-based phone boundary detection algorithm is proposed. In the first stage, some local sample-based acoustic parameters are used to pre-select some phone boundary candidates. Then, in the second stage, some high-order statistics of the log-likelihood differences of two adjacent speech segments around each boundary candidate are calculated to serve as similarity measure for candidate verification. Experimental results on the TIMIT speech corpus showed that EERs of 8.6% and 7.6% were achieved for one-stage and two-stage sample-based phone boundary detections, respectively. Moreover, for the two-stage system, 42.1% and 81.9% of boundaries detected were within 5- and 15-sample error tolerance from manual labeling results.

13:50Iterative Improvement of Speaker Segmentation in A Noisy Environment Using High-level Knowledge

QIANG HUANG (University of East Anglia)
Stephen Cox (University of East Anglia)

Our goal is to understand the progress of a tennis game according to its soundtrack. The chair umpire’s speech is one of the most useful sources of information, and we focus on identifying the locations of this signal on the soundtrack. Although current techniques for audio segmentation can work well on this task when the acoustics of the training- and test- data are well-matched, they fail when there is a mismatch, which occurs when the chair umpires are different in the test- and training-data. Our technique uses high-level knowledge of the syntax of the audio events to make a coarse estimate of the location of the umpire’s speech. The data gathered from these locations is then iteratively refined. A model is built from this data that enables a more accurate determination of the location of the speech segments to be made. Our approach is applied to three different tennis games. The results obtained show similar performances to those obtained using supervised methods.

14:10Hierarchical Auido Segmentation with HMM and Factor Analysis in Broadcast News Domain

Diego Castan (University of Zaragoza)
Carlos Vaquero (University of Zaragoza)
Alfonso Ortega (University of Zaragoza)
David Martinez (University of Zaragoza)
Jesus Villalba (University of Zaragoza)
Eduardo Lleida (University of Zaragoza)

This paper investigates the performance of a Factor Analysis stage in audio segmentation systems. The system described here is designed to segment and classify the audio files coming from broadcast programs into five different classes: speech, speech with noise, speech with music, music or others. This task was recently proposed as a competitive evaluation organized by the Spanish Network on Speech Technologies as part of the conference FALA 2010. The system proposed here makes use of a hierarchical structure in two steps with two different acoustic features. First, the system decides among music, speech with music or the rest of the classes by using HMM/GMM and a smoothed combination of MFCC and Chroma as feature vectors. Next, the system classifies speech and speech with noise by using FA and MFCC as acoustic features. The results shows that, with this configuration, the error rate achieved is lower than the one obtained by the best system presented in the FALA 2010 evaluation.

14:30Syllable Segmentation of Continuous Speech Using Auditory Attention Cues

Ozlem KALINLI (Sony Computer Entertainment America)

Segmentation of speech into syllables is beneficial for many spoken language processing applications since it provides information about phonological and rhythmic aspects of speech. Traditional methods usually detect syllable nuclei using features such as energies in critical bands, linear predictive coding spectra, pitch, voicing, etc. Here, a novel system that uses auditory attention cues is proposed for predicting syllable boundaries. The auditory attention cues are biologically inspired and capture changes in sound characteristic by using 2D spectro-temporal receptive filters. When tested on TIMIT, it is shown that the proposed method successfully predicts syllable boundaries and performs as good as or better than the state-of-the art syllable nucleus detection methods.

14:50Exploiting phone-class specific landmarks for refinement of segment boundaries in TTS databases

Vijayaditya Peddinti (International Institute of Information Technology - Hyderabad)
Kishore Prahallad (International Institute of Information Technology - Hyderabad)

High accuracy speech segmentation methods invariably depend on manually labelled data. However under-resourced languages do not have annotated speech corpora required for training these segmentors. In this paper we propose a boundary refinement technique which uses knowledge of phone-class specific sub-band energy events, in place of manual labels, to guide the refinement process. The use of this knowledge enables proper placement of boundaries in regions with multiple spectral discontinuities in close proximity. It also helps in the correction of large alignment errors. The proposed refinement technique provides boundaries with an accuracy of 82% within 20ms of actual boundary. Combining the proposed technique with iterative isolated HMM training technique boosts the accuracy to 89%, without the use of any manually labelled data.

15:10Phoneme-Level Text to Audio Synchronization on Speech Signals with Background Music

Agnes Pedone (Audionamix)
Juan Jose Burred (Audionamix)
Simon Maller (Audionamix)
Pierre Leveau (Audionamix)

We address the task of synchronizing a given phoneme transcription with the corresponding speech signal, when the latter is linearly mixed with background music. To that end, we propose a new method based on Non-negative Matrix Factorization in the time-frequency domain, which models the speech as a source-filter factorization that includes a synchronization parameter matrix. Phoneme models, which consist of collections of basic spectral envelopes, are learned from a training set of isolated speech. The model is subjected to an iterative Maximum Likelihood optimization that concurrently estimates pitch, synchronization parameters and the contribution of the music part. Results show the feasibility of the system for application in text-informed audio processing and automatic subtitle synchronization.

Mon-Ses2-S1:
Show & Tell Demonstration - Speech Systems and Applications

Time:Monday 13:30 Place:Donatello (Room Onice) - Pala Congressi - Ground Floor Type:Poster
Chair:Dimitrios Dimitriadis

#1An Affective Spoken Storyteller

Felix Burkhardt (Deutsche Telekom Laboratories)

We present a software to read texts with emotional expression. The software is developed as part of the Emofilt open source emotional speech synthesis software. The affective storyteller consists of a text editor which offers a set of emotional speaking styles that can be used to mark up the text. The system was validated in a perception experiment and, although the number of participants wasn’t very large, could show the general usability of the approach.

#2Text Driven 3D Photo-Realistic Talking Head

Lijuan Wang (Microsoft Research Asia)
Frank Soong (Microsoft Research Asia)
Wei Han (Department of Computer Science, Shanghai Jiao Tong University, China)
Qiang Huo (Microsoft Research Asia)

We propose a new 3D photo-realistic talking head with a personalized, photo realistic appearance. It extends our prior, high-quality, 2D photo-realistic talking head to 3D. We use a 2D-to-3D reconstruction algorithm to automatically adapt a general 3D head mesh model to the individual. In training, super feature vectors consisting of 3D geometry, texture and speech are formed to train a statistical, multi-streamed, Hidden Markov Model (HMM). The HMM is then used to synthesize both the trajectories of geometry animation and dynamic texture. The 3D talking head animation can be controlled by the rendered geometric trajectory while the facial expressions and articulator movements are rendered with the dynamic 2D image sequences. Head motions and facial expression can also be separately controlled by manipulating corresponding parameters. The new 3D talking head has many useful applications such as voice-agent, tele-presence, gaming, social networking, etc.

#3Physical Models Producing Vowels with Pitch Variation

Arai Takayuki (Sophia University)

Physical models of the human vocal tract are useful for education in acoustics and speech science. To excite such vocal-tract models, different types of sound sources may be used. We have developed two new types of physical models which produce a glottal source with a variable fundamental frequency. Both types are based on a reed vibration, and the length of the vibratory portion can be varied manually. In the first type, the reed itself is curved, while the reed of the second type is straight but its support is curved. In each case, we can demonstrate vowel production with pitch variation by combining vocal-tract models with our proposed source models.

#4An Engine-Independent Text-to-Speech Workplace

Margot Mieskes (European Media Laboratory GmbH)

We present a web-based graphical user interface for access to Text-to-Speech engines. The workplace is intended to be engine-independent, allowing the user to not worry about the interaction with the specific engine, but to focus on his/her task and create a good synthesis result. Additionally, the workplace offers support for non-expert users in specific tuning and interaction tasks, such as phonetic transcriptions or creating a lexicon for usage during synthesis. We also present two application scenarios which were the basis for creating this workplace and the current status of the workplace.

#5An application to test the emotion conveyed by vocal and musical signals.

Simone Carcone (ISIM_garage Phys. Dept., University of Rome Tor Vergata, Italy)
Carlo Giovannella (ISIM_garage Dip. Fisica e Scuola IaD - University of Rome Tor Vergata)

We present an application that allows to built up straightforwardly tests to measure the emotion conveyed by multimodal and single modal signals, among them voice, music and sounds. The application is available either as a stand-alone application and, partially, as web-service.

#6Automatic Speech Recognition System Dedicated for Polish

Mariusz Ziółko, (Department of Electronics, AGH University of Science and Technology)
Jakub Gałka (Department of Electronics, AGH University of Science and Technology)
Bartosz Ziółko (Department of Electronics, AGH University of Science and Technology)
Tomasz Jadczyk (Department of Electronics, AGH University of Science and Technology)
Skurzok Dawid (Department of Electronics, AGH University of Science and Technology)
Mąsior Mariusz (Department of Electronics, AGH University of Science and Technology)

An automatic speech recognition system for Polish is demonstrated. A few layers of our system are different from popular approaches as a result of differences between Polish and English languages.

#7Joint Application of Speech and Speaker Recognition for Automation and Security in Smart Home

Kong Aik Lee (Institute for Infocomm Research, Singapore)
Anthony Larcher (Institute for Infocomm Research, Singapore)
Helen Thai (Institute for Infocomm Research, Singapore)
Bin Ma (Institute for Infocomm Research, Singapore)
Haizhou Li (Institute for Infocomm Research, Singapore)

This paper describes the deployment of speech technologies in STARHome, a fully functional smart home prototype. We make use of speech and speaker recognition technologies to provide three voice services, namely, voice command for controlling home appliances, voice biometric for entrance-door access control, and service customization (speaker-loaded command control). Voice applications for STARHome have been designed to deal with short utterances and low SNR.

#8Adding a Speech Cursor to a Multimodal Dialogue System

Staffan Larsson (University of Gothenburg)
Alexander Berman (Talkamatic AB)
Jessica Villing (University of Gothenburg)

This paper describes an in-vehicle dialogue system demonstrating a novel combination of flexible multimodal menu-based dialogue and a "speech cursor" which enables menu navigation as well as browsing long list using haptic input and spoken output.

#9Prosody Toolkit: Integrating HTK, Praat and WEKA

Scott Thomas Christie (Cognitive Science, University of Minnesota)
Serguei Pakhomov (Center for Clinical and Cognitive Neuropharmacology, College of Pharmacy, University of Minnesota)

A major hurdle in computational speech analysis is the effective integration of available tools originally developed for purposes unrelated to each other. We present a Python-based tool to enable an efficient and organized processing workflow incorporating automatic speech recognition using HTK, phoneme-level prosodic feature extraction in Praat and machine learning in WEKA. Our system is extensible, customizable and organizes prosodic data by phoneme and time stamp in a tabular fashion in preparation for analysis using other utilities. Plotting of prosodic information is supported to enable visualization of prosodic features.

#10Collecting life logs for experience-based corpora

Fabiano Francesconi (DISI - University of Trento, 38050 Povo (Trento), Italy)
Arindam Ghosh (DISI - University of Trento, 38050 Povo (Trento), Italy)
Giuseppe Riccardi (DISI - University of Trento, 38050 Povo (Trento), Italy)
Marco Ronchetti (DISI - University of Trento, 38050 Povo (Trento), Italy)
Alex Vagin (DISI - University of Trento, 38050 Povo (Trento), Italy)

In this paper we propose an approach to lightweight acquisition, sharing and annotation of experience-based corpora via mobile devices. Corpora acquisition is the crucial and often costly process in speech and language science and engineering. To address this problem, we have built a system for creating a location based corpora annotated with multimedia tags (e.g. text, speech, image) generated by end-users. We describe a relevant case study for the collection of mobile user life logs. We plan to make publicly available such tools and platforms to the research community for collaborative development and distributed experiential corpora collection.

Mon-Ses2-O2:
Speech Production - Coarticulation and Speech Timing

Time:Monday 13:30 Place:Leonardo - Pala Affari - Ground Floor Type:Oral
Chair:Wim van Dommelen

13:30Jaw movement in vowels and liquids forming the syllable nucleus

Štefan Beňuš (Department of Eng. and Am. Studies, Constantine the Philosopher University, Nitra, Slovakia & Institute of Informatics, Slovak Academy of Sciences, Bratislava, Slovakia)
Marianne Pouplier (Institute of Phonetics and Speech Processing, Ludwig-Maximilians-University, Munich, Germany)

This paper investigates jaw movements in the production of Slovak syllables with and without vowels. We test the hypothesis that /l, r/ in the syllable nucleus position show a degree of jaw opening comparable to vowels, therefore providing a rising-falling sonority profile even in syllables lacking vowels. We also investigate whether the phonemic length distinction occurring for both vowels and syllabic consonants is implemented in a similar fashion for the different nucleus types. Our articulatory data show that the jaw activity during syllabic liquids is indeed comparable to that of vowels, and that the jaw is recruited to help maintain the main lingual articulation. This became evident in particular in an interaction between nucleus type and phonemic length effects.

13:50Coarticulation across prosodic domains in Italian: An ultrasound investigation

Barbara Gili Fivela (CRIL & Università del Salento)
Antonio Stella (CRIL & Università del Salento)
Sonia D\'Apolito (CRIL & Università del Salento)
Francesco Sigona (CRIL & Università del Salento)

This work aims at exploring the phasing of vowels across prosodic boundaries, by analyzing ultrasound data relating to the production of V(#)CV (/i#ba/) sequences by three speakers of Italian. Sequences are inserted in sentences in such a way that C corresponds to the beginning of prosodic domains of various levels and it is strengthened depending on the type of boundary. The influence of prosodic boundaries is investigated on cross boundary vowels by means of ultrasound data, which show that a smaller degree of (preservative) coarticulation is found when strong rather than weak prosodic boundaries are realized.

14:10Investigating the stability of intergestural timing relations

Juraj Simko (CITEC, Bielefeld University, Germany)
Fred Cummins (University College Dublin, Ireland)
Štefan Beňuš (Constantine the Philosopher University, Nitra, Slovakia, Institute of Informatics, Slovak Academy of Sciences, Bratislava, Slovakia)

An articulatory analysis of lip and tongue coordination in VCV sequences is presented for four Slovak speakers. Lip and tongue movements are obtained for many utterances elicited in a manner that ensures great variation in both rate and in articulatory precision. Theory and models suggest that gestures might not be sequenced in simple linear order, but that medial consonant timing may be tied to the evolution of the following vowel gesture. We find that the relative timing of the consonant and second vowel gestures is most stable and exhibits least variability when the second vowel gesture provides the temporal reference frame. This work contributes to our understanding of non-linear coarticulatory effects in continuous, and variable, speech.

14:30Speech timing organization for the phonological length contrast in Italian consonants

Claudio Zmarich (CNR-Institute of Cognitive Sciences and Technologies, Padova, Italy)
Barbara Gili Fivela (Università del Salento, Lecce, Italy)
Pascal Perrier (DPC/GIPSA-lab, Grenoble-INP & CNRS, Grenoble, France)
Christophe Savariaux (DPC/GIPSA-lab, Grenoble-INP & CNRS, Grenoble, France)
Graziano Tisato (CNR-Institute of Cognitive Sciences and Technologies, Padova, Italy)

In Italian, length contrast is exploited in the consonant system. Previous articulatory studies have focused on the temporal organization of gestures in Italian geminates and on the lower lip kinematics of the singleton/geminate distinction, and have showed that the time interval between the nuclei of two successive syllables does not depend on the number of intervening consonants (Öhman’s Vowel-to-Vowel model) . In this paper, data on lip and tongue gestures from four Italian subjects saying “mima” and “mimma” at fast and comfortable rate of delivery are discussed in order to directly test the validity of the Öhman’s model for the gestural organization of Italian geminate consonants.

14:50Timing in Italian VNC sequences at different speech rates

Chiara Celata (Scuola Normale Superiore, Pisa, Italy)
Silvia Calamai (Università di Siena)

This study addresses the question of temporal cohesion in Italian word-medial VNC (vowel-nasal-obstruent) sequences varying in the laryngeal status of the post-nasal C, for two classes of obstruents distinct in terms of place and at three different speech rates. The temporal relations among the obstruent and the two preceding sonorant segments are examined, and variations in speaking tempo are shown to affect the timing pattern of different speech units in different ways. These results support a view of speech timing control in which temporal effects over costituents spanning syllable boundaries are to be combined with the effects observed over traditional syllable-sized units.

15:10Automatic Analysis of Singleton and Geminate Consonant Articulation Using Real-time Magnetic Resonance Imaging

Christina Hagedorn (Department of Linguistics, University of Southern California, USA)
Michael Proctor (Viterbi School of Engineering, University of Southern California, USA, Department of Linguistics, University of Southern California, USA)
Louis Goldstein (Department of Linguistics, University of Southern California, USA)

We explore robust methods of automatically quantifying constriction location, constriction degree and gestural kinematics of Italian short and long consonants using direct image analysis techniques applied to rtMRI data. Articulatory kinematics are estimated from correlated regional changes in pixel intensity. We demonstrate that these methods are capable of quantifying differences in constriction duration exhibited by short and long Italian consonants for labial, coronal and dorsal segments, and differences in constriction degree for labial and coronal consonants. No difference in constriction location is observed for geminates and singletons, while systematic differences in constriction location are observed between (i) coronal oral stops and coronal sonorants and (ii) dorsal stops flanked by vowels differing in backness.

Mon-Ses2-O4:
ASR - Acoustic Models II

Time:Monday 13:30 Place:Michelangelo - Pala Affari - 2nd Floor Type:Oral
Chair:Frank Seide

13:30Conversational Speech Transcription Using Context-Dependent Deep Neural Networks

Frank Seide (Microsoft Research Asia)
Gang Li (Microsoft Research Asia)
Dong Yu (Microsoft Research)

We apply the recently proposed Context-Dependent Deep-Neural-Network HMMs, or CD-DNN-HMMs, to speech-to-text transcription. For single-pass speaker-independent recognition on the 2000 NIST Hub5 phone-call transcription benchmark (Switchboard), the word-error rate is reduced from 23.6%, obtained by discriminatively trained Gaussian-mixture HMMs, to 16.2%--a 31% relative improvement. CD-DNN-HMMs combine classic artificial-neural-network HMMs with traditional tied-state triphones and deep-belief-network pre-training. They had previously been shown to reduce errors by 16% relatively when trained on tens of hours of data using hundreds of tied states. This paper takes CD-DNN-HMMs further and applies them to transcription using over 300 hours of training data, over 9000 tied states, and up to 9 hidden layers, and demonstrates how sparseness can be exploited. On four less well-matched transcription tasks, we observe relative error reductions of 22-28%.

13:50Sequential Classification Criteria for NNs in Automatic Speech Recognition

Guangsen Wang (School of Computing, National University of Singapore)
Khe Chai Sim (School of Computing, National University of Singapore)

Neural networks (NNs) are discriminative classifiers which have been successfully integrated with hidden Markov models (HMMs), either in the hybrid NN/HMM or tandem connectionist systems. Typically, the NNs are trained with the frame-based cross-entropy criterion to classify phonemes or phoneme states. However, for word recognition, the word error rate is more closely related to the sequence classification criteria, such as maximum mutual information and minimum phone error. In this paper, the lattice-based sequence classification criteria are used to train the NNs in the hybrid NN/HMM system and the tandem system. A product-of-expert-based factorization and smoothing scheme is proposed for the hybrid system to scale the lattice-based NN training up to 6000 triphone states. Experimental results on the WSJCAM0 reveal that the NNs trained with the sequential classification criterion yield a 24.2% relative improvement compared to the cross-entropy trained NNs for the hybrid system.

14:10GRAPHEME-BASED AUTOMATIC SPEECH RECOGNITION USING KL-HMM

Mathew Magimai.-Doss (Idiap Research Institute, Martigny, Switzerland)
Ramya Rasipuram (Idiap Research Institute, Martigny, Switzerland; Ecole Polytechnique Federale, Lausanne (EPFL), Switzerland)
Guillermo Aradilla (Idiap Research Institute, Martigny, Switzerland)
Herve Bourlard (Idiap Research Institute, Martigny, Switzerland; Ecole Polytechnique Federale, Lausanne (EPFL), Switzerland)

The state-of-the-art automatic speech recognition (ASR) systems typically use phonemes as subword units. In this work, we present a novel grapheme-based ASR system that jointly models phoneme and grapheme information using Kullback-Leibler divergence-based HMM system (KL-HMM). More specifically, the underlying subword unit models are grapheme units and the phonetic information is captured through phoneme posterior probabilities (referred as posterior features) estimated using a multilayer perceptron (MLP). We investigate the proposed approach for ASR on English language, where the correspondence between phoneme and grapheme is weak. In particular, we investigate the effect of contextual modeling on grapheme-based KL-HMM system and the use of MLP trained on auxiliary data. Experiments on DARPA Resource Management corpus have shown that the grapheme-based ASR system modeling longer subword unit context can achieve same performance as phoneme-based ASR system, irrespective of the data on which MLP is trained.

14:30Direct Error Rate Minimization of Hidden Markov Models

Joseph Keshet (TTI-Chicago)
Chih-Chieh Cheng (Department of Computer Science and Engineering, University of California, San Diego)
Mark Stoehr (Department of Computer Science, University of Chicago)
David McAllester (TTI-Chicago)

We explore discriminative training of HMM parameters that directly minimizes the expected error rate. In discriminative training one is interested in training a system to minimize a desired error function, like word error rate, phone error rate, or frame error rate. We review a recent method (McAllester, Hazan and Keshet, 2010), which introduces an analytic expression for the gradient of the expected error-rate. The analytic expression leads to a perceptron-like update rule, which is adapted here for training of HMMs in an online fashion. While the proposed method can work with any type of the error function used in speech recognition, we evaluated it on phoneme recognition of TIMIT, when the desired error function used for training was frame error rate. Except for the case of GMM with a single mixture per state, the proposed update rule provides lower error rates, both in terms of frame error rate and phone error rate, than other approaches, including MCE and large margin.

14:50On the Effectiveness of Statistical Modeling based Template Matching Approach for Continuous Speech Recognition

Xie Sun (University of Missouri)
Xin Chen (University of Missouri)
Yunxin Zhao (University of Missouri)

In this work, we validate the effectiveness of our recently proposed integrated template matching and statistical modeling approach on four baseline systems with increasing phone recognition accuracies in the range of 73% to 78% for the TIMIT task. The four baselines were generated using the methods of 1) Discriminative Training (DT) of Minimum Phone Error (MPE), 2) MFCC concatenated with ensemble Multiple Layer Perceptron (MFCC+EMLP) features, 3) DT combined with the MFCC+EMLP features, and 4) data sampling based ensemble acoustic models integrated with DT and MFCC+EMLP features. Experimental results obtained by carrying out template matching based rescoring on the phone lattices that were generated by the baseline models show that our template matching approach has produced consistent and significant improvements over the four baselines, and the highest recognition accuracy was 79.55% obtained from rescoring the phone lattices produced by the ensemble acoustic model baseline.

15:10Comparison of Smoothing Techniques for Robust Context Dependent Acoustic Modelling in Hybrid NN/HMM Systems

Guangsen Wang (School of Computing, National University of Singapore)
Khe Chai Sim (School of Computing, National University of Singapore)

Hybrid Neural Network/Hidden Markov Model (NN/HMM) systems have been found to yield high quality phone recognition performance. One issue with modelling the Context Dependent (CD) NN/HMM is the robust estimation of the NN parameters to reliably predict the large number of CD state posteriors. Previously, factorization based on conditional probabilities has been commonly adopted to circumvent this problem. This paper proposes two factorization schemes based on the product-of-expert framework, depending on the choice of the experts. In addition, smoothing and interpolation schemes were introduced to improve robustness. Experimental results on the WSJCAM0 reveal that the proposed CD NN/HMM parameter estimation techniques achieved consistent improvement compared to CI hybrid systems. The best hybrid system achieves a 21.7% relative phone error rate reduction and a 17.6% word error reduction compared to a discriminative trained context dependent triphone GMM/HMM system.

Mon-Ses2-O5:
Robust Speech Recognition II

Time:Monday 13:30 Place:Raffaello - Pala Affari - 3rd Floor Type:Oral
Chair:Maurizio Omologo

13:30Propagation of Uncertainty through Multilayer Perceptrons for Robust Automatic Speech Recognition

Ramón Fernandez Astudillo (Spoken Language Laboratory, INESC-ID-Lisboa)
Joao Paulo da Silva Neto (Spoken Language Laboratory, INESC-ID-Lisboa)

Observation uncertainty techniques offer a way to dynamically compensate automatic speech recognizers to account for the information missing in real world scenarios. These techniques have been demonstrated to effectively be able to compensate multiple environment distortions and improve the integration of ASR systems with speech enhancement pre-processing through uncertainty propagation. Unfortunately observation uncertainty techniques rely on statistical methods and as such are limited to GMM-HMM architectures. In this paper we explore the application of observation uncertainty and uncertainty propagation techniques to multi-layer perceptrons (MLPs). We develop solutions for propagation through a generic MLP and exemplify potential gains with an large vocabulary robust ASR experiment on the AURORA4 database using an Hybrid MLP-HMM recognizer.

13:50Mapping Sparse Representation to State Likelihoods in Noise-Robust Automatic Speech Recognition

Katariina Mahkonen (Tampere University of Technology)
Antti Hurmalainen (Tampere University of Technology)
Tuomas Virtanen (Tampere University of Technology)
Jort Gemmeke (Radboud University Nijmegen)

This paper proposes learning-based methods for mapping a sparse representation of noisy speech to state likelihoods in an automatic speech recognition system. We represent speech as a sparse linear combination of exemplars extracted from training data. The weights of exemplars are mapped to speech state likelihoods using Ordinary Least Squares (OLS) and Partial Least Squares (PLS) regression. Recognition experiments are conducted using the CHiME noisy speech database. According to the results, both algorithms can be successfully used for training the mapping. We achieve improvements over the previous binary labeling system, and recognition scores close to 70% at -6dB SNR.

14:10Uncertainty measures for improving exemplar-based source separation

Heikki Kallasjoki (Adaptive Informatics Research Centre, Aalto University, Finland)
Ulpu Remes (Adaptive Informatics Research Centre, Aalto University, Finland)
Jort F. Gemmeke (Centre for Language and Speech Technology, Radboud University Nijmegen, The Netherlands)
Tuomas Virtanen (Department of Signal Processing, Tampere University of Technology, Finland)
Kalle J. Palomäki (Adaptive Informatics Research Centre, Aalto University, Finland)

This work studies the use of observation uncertainty measures for improving the speech recognition performance of an exemplar-based source separation based front end. To generate the observation uncertainty estimates for the enhanced features, we propose the use of heuristic methods based on the sparse representation of the noisy signal in the exemplar-based source separation algorithm. The effectiveness of the proposed measures is evaluated in a large vocabulary noisy speech recognition task. The best proposed measure achieved relative error reductions up to 18 % over the baseline feature enhancement method without uncertainty measures.

14:30Maximum Confidence Measure Based Interaural Phase Difference Estimation for Noise Masking in Dual-Microphone Robust Speech Recognition

Hsien-Cheng Liao (Information and Communications Research Labs, Industrial Technology Research Institute, Taiwan)
Yuan-Fu Liao (Department of Electronic Engineering, National Taipei University of Technology, Taiwan)
Chin-Hui Lee (School of Electrical and Computer Engineering, Georgia Institute of Technology, USA)

A new one-stage maximum confidence measure (MCM) based interaural phase difference estimation framework for noise masking is proposed to closely integrate the underline speech models into dual-microphone array noise filtering for robust speech recognition. The main ideas are: (1) utilizing both the speech and filler models of the recognizer to feedback confidence measures (CMs) that indicate the degree of separation between filtered speech and interference noises, and (2) automatically optimizing the parameters of the microphone array with an expectation maximization (EM) algorithm based on the proposed MCM criterion. Experimental results on a Mandarin voice command task show that the proposed approach significantly improves the final speech recognition rates. Moreover the observed performance degradation is usually graceful under low signal-to-noise ratios (SNRs) and close interference noises conditions.

14:50A Performance Monitoring Approach to Fusing Enhanced Spectrogram Channels in Robust Speech Recognition

Shirin Badiezadegan (McGill University)
Richard Rose (McGill Univeristy)

An implementation of a performance monitoring approach to feature channel integration in robust automatic speech recognition is presented. Motivated by psychophysical evidence in human speech perception, the approach combines multiple feature channels using a closed loop criterion relating to the overall performance of the system. The multiple feature channels correspond to an ensemble of reconstructed spectrograms generated by applying multiresolution discrete wavelet transform analysis-synthesis filter-banks to corrupted speech spectrograms. The spectrograms associated with these feature channels differ in the degree to which information has been suppressed in multiple scales and frequency bands. The performance of this approach is evaluated in both the Aurora 2 and the Aurora 3 speech in noise task domains.

15:10Generalized Variable Parameter HMMs for Noise Robust Speech Recognition

Ning Cheng (Chinese Academy of Sciences)
Xunying Liu (Cambridge University)
Lan Wang (Chinese Academy of Sciences)

Handling variable, non-stationary ambient noise is a challenging task for automatic speech recognition (ASR) systems. To address this issue, multi-style, noise {\em condition independent} (CI) model training using speech data collected in diverse noise environments, or uncertainty decoding techniques can be used. An alternative approach is to explicitly approximate the continuous trajectory of Gaussian component mean and variance parameters against the varying noise level, for example, using variable parameter HMMs (VP-HMM). This paper investigates a more generalized form of variable parameter HMMs (GVP-HMM). In addition to Gaussian component means and variances, it can also provide a more compact trajectory modelling for tied linear transformations. An alternative noise {\em condition dependent} (CD) training algorithm is also proposed to handle the bias to training noise condition distribution. Consistent error rate gains were obtained over conventional VP-HMM mean and variance only trajectory modelling on a media vocabulary Mandarin Chinese in-car navigation command recognition task.

Mon-Ses2-P1:
Source Separation and Speech Enhancement

Time:Monday 13:30 Place:Valfonda 1 - Pala Congressi (Passi Perduti-Gallery) Type:Poster
Chair:Marco Matassoni

#1Monaural Voiced Speech Segregation Based on Pitch and Comb Filter

Xueliang Zhang (Compute Science Department, Inner Mongolia University,)
Wenju Liu (National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences)

The correlogram is an important mid-level representation for periodic sounds which is widely used in sound source separation and pitch detection. However, it is very time consuming. In this paper, we presented a novel scheme for monaural voiced speech separation without computing cor-relograms. The noisy speech is firstly decomposing into time-frequency units. Pitch contour of the target speech is extracted according to the zero crossing rate of the units. Then we applied a comb filter to label each unit as target speech or intrusion. Compared with previous correlogram-based method, the proposed algorithm saves computing time and also yields better performance.

#2Fast and simple iterative algorithm of Lp-norm minimization for under-determined speech separation

Yasuharu Hirasawa (Kyoto University)
Naoki Yasuraoka (Kyoto University)
Toru Takahashi (Kyoto University)
Tetsuya Ogata (Kyoto University)
Hiroshi G. Okuno (Kyoto University)

This paper presents an efficient algorithm to solve Lp-norm minimization problem for under-determined speech separation; that is, for the case that there are more sound sources than microphones. We employ an auxiliary function method in order to derive update rules under the assumption that the amplitude of each sound source follows generalized Gaussian distribution. Experiments reveal that our method solves the L1-norm minimization problem ten times faster than a general solver, and also solves Lp-norm minimization problem efficiently, especially when the parameter p is small; when p is not more than 0.7, it runs in real-time without loss of separation quality.

#3Monaural Speech Separation Based on a 2D Processing and Harmonic Analysis

Azam Rabiee (Dept. of Computer, Islamic Azad University, Dolatabad Branch, Isfahan, Iran)
Saeed Setayeshi (Faculty of Nuclear Engineering and Physics, Amirkabir University of Technology, Iran)
Soo-Young Lee (Dept. of Electrical Engineering, Korea Advanced Institute of Science & Technology, Korea)

This paper proposes a new Computational Auditory Scene Analysis (CASA) approach based on a 2D spectro-temporal analysis and harmonic separation. The 2D processing, so-called Grating Compression Transform (GCT), analyzes the spectro-temporal content of the spectrogram, mimicking the processing of the primary auditory cortex. The estimated pitches from the GCT analysis are used for separation using harmonic magnitude suppression (HMS). A powerful aspect of our model is requiring no prior training on a specific training corpus. A baseline system based on the harmonic separation is designed for comparison. Since the baseline system is similar to the proposed except the auditory-cortex-like analysis, the SIR results illustrate its importance in this task.

#4Underdetermined Blind Source Separation with Fuzzy Clustering for Arbitrarily Arranged Sensors

Ingrid Jafari (Department of Electrical, Electronic and Computer Engineering, The University of Western Australia, Australia)
Serajul Haque (Department of Electrical, Electronic and Computer Engineering, The University of Western Australia, Australia)
Roberto Togneri (Department of Electrical, Electronic and Computer Engineering, The University of Western Australia, Australia)
Sven Nordholm (Department of Electrical and Computer Engineering, Curtin University, Australia)

Recently, the concept of time-frequency masking has developed as an important approach to the blind source separation problem, particularly when in the presence of reverberation. However, previous research has been limited by factors such as the sensor arrangement, and/or the mask estimation technique implemented. This paper presents a novel integration of two established approaches to BSS in an effort to overcome such limitations. A multidimensional feature vector is extracted from a non-linear sensor arrangement, and the fuzzy c-means algorithm is then applied to cluster the feature vectors into representations of the source speakers. Fuzzy time-frequency masks are estimated and applied to the observations for source recovery. The evaluations on the proposed study demonstrated improved separation quality over all test conditions. This establishes the potential of multidimensional fuzzy c-means clustering for mask estimation in the context of blind source separation.

#5On Initial Seed Selection for Frequency Domain Blind Speech Separation

Dang Hai Tran Vu (University of Paderborn)
Reinhold Haeb-Umbach (University of Paderborn)

In this paper we address the problem of initial seed selection for frequency domain iterative blind speech separation (BSS) algorithms. The derivation of the seeding algorithm is guided by the goal to select samples which are likely to be caused by source activity and not by noise and at the same time originate from different sources. The proposed algorithm has moderate computational complexity and finds better seed values than alternative schemes, as is demonstrated by experiments on the database of the SiSEC2010 challenge.

#6Spatial filter calibration based on minimization of modified LSD

Nobuaki Tanaka (Waseda University)
Tetsuji Ogawa (Waseda University)
Tetsunori Kobayashi (Waseda University)

A new sound source separation method has been developed that is robust against individual variability in microphones and acoustic lines. A specific area that has a target sound source was enhanced by using a spatial filter developed by time-frequency masking. However, there is a strong likelihood that the spatial filters will be distorted due to the impact of individual variability in microphone characteristics and acoustic lines. To solve this problem, calibration of these spatial filters' shapes was attempted using a modified log-spectral distance (MLSD) minimization criterion, which uses utterances made by each individual (i.e., a sound source) at the desired positions. The effectiveness of this spatial filter calibration was experimentally verified in speech recognition experiments; MLSD-based calibration had fewer word errors than the cases without calibration and calibration using other criteria.

#7Probabilistic Spectrum Envelope: Categorized Audio-features Representation for NMF-based Sound Decomposition

Toru Nakashika (Kobe University)
Tetsuya Takiguchi (Kobe University)
Yasuo Ariki (Kobe University)

NMF (Non-negative Matrix Factorization) has been one of the most useful techniques for audio signal analysis in recent years. In particular, supervised NMF, in which a large number of samples are used for analyzing a signal, is garnering much attention in sound source separation or noise reduction research. However, because such methods require all the possible samples for the analysis, it is hard to build the practical system of the method. In this paper, we propose a novel method of signal analysis by combining NMF and probabilistic approach. In this approach, it is assumed that each audio-source category (such as phonemes or musical instruments) has an environment-invariant feature, called Probabilistic Spectrum Envelope (PSE). At the beginning of the system, PSE of each category are learned using the technique based on Gaussian Process Regression. Then, the observed spectrum is analyzed by a combination of super- vised NMF and Genetic Algorithm with pre-trained PSEs.

#8A high resolution multiple source localization by generalized cumulant structure (GCS) matrix

Jinho Choi (KAIST)
Chang D. Yoo (KAIST)

This paper considers a high-resolution multiple non-stationary and non-Gaussian source localization algorithm based on the proposed generalized cumulant structure (GCS) matrix that is constructed as a weighted sum of the second and fourth order cumulants of the sensor signals. The weight determines the rank and range space of the GCS matrix, and the range space of the GCS matrix should be same to the range space of the virtual array manifold matrix to estimate the true direction of arrival (DOA)s of the sources. To estimate the weight and the DOAs of sources, a rank constrained optimization problem is formulated. The optimal solution is computationally heavy, and for this reason a suboptimal solution is considered. With the weight set to an arbitrary value, singular value decomposition on the GCS matrix is performed to determine the singular matrix associated with the null space of the virtual array response matrix, and either this singular matrix or the singular matrix obtained using only the second order (SO) statistic is used to obtain the proposed spatial spectrum. Experimental results show that the proposed algorithm performs better than the recently proposed SO cumulant based algorithm for synthetic and real speech data.

#9Single channel speech music separation using nonnegative matrix factorization with sliding window and spectral masks

Emad M. Grais (Faculty of Engineering and Natural Sciences, Sabanci University, Istanbul, Turkey.)
Hakan Erdogan (Faculty of Engineering and Natural Sciences, Sabanci University, Istanbul, Turkey.)

A single channel speech-music separation algorithm based on nonnegative matrix factorization (NMF) with sliding windows and spectral masks is proposed in this work. We train a set of basis vectors for each source signal using NMF in the magnitude spectral domain. Rather than forming the columns of the matrices to be decomposed by NMF of a single spectral frame, we build them with multiple spectral frames stacked in one column. After observing the mixed signal, NMF is used to decompose its magnitude spectra into a weighted linear combination of the trained basis vectors for both sources. An initial spectrogram estimate for each source is found, and a spectral mask is built using these initial estimates. This mask is used to weight the mixed signal spectrogram to find the contributions of each source signal in the mixed signal. The method is shown to perform better than the conventional NMF approach.

#10Perceptually-inspired Processing for Multichannel Wiener Filter

Jorge I. Marin (Georgia Institute of Technology, Atlanta, GA, USA)
David V. Anderson (Georgia Institute of Technology, Atlanta, GA, USA)

Binaural noise-reduction techniques based on Multichannel Wiener filter (MWF) have been reported as promissory candidates to be used in binaural hearing aids because of their effective SNR improvement at any arbitrary direction of arrival of the target signal and the preservation of localization cues. There are different MWF techniques derived in the FFT domain. The use of an FFT-based processing involve two important challenges for the real-time implementation of these techniques in a digital hearing: high computational cost and processing delay. To reduce computational cost and processing delay without degrading the SNR improvement and sound quality, this paper proposes the use of an auditory representation instead of an FFT representation. The proposed processing shows significant advantages over an FFT-based processing: reduction of the computational cost and processing delay, and improvement of the output SNR and sound quality.

#11Speech recognition in mixed sound of speech and music based on vector quantization and non-negative matrix factorization

Shoichi Nakano (Department of Computer Science and Engineering, Toyohashi University of Technology)
Kazumasa Yamamoto (Department of Computer Science and Engineering, Toyohashi University of Technology)
Seiichi Nakagawa (Department of Computer Science and Engineering, Toyohashi University of Technology)

This paper describes a speech recognition method for mixed sound, consisting of speech and music, that removes the music only based on vector quantization (VQ) and non-negative matrix factorization (NMF). For isolated word recognition using the clean speech model, an improvement of about 15% was obtained compared with the case of not removing music. Furthermore, a high recognition rate of about 90% was achieved, even under the 0 dB condition using a model trained from the mixed sound after removing the music according to the VQ method.

#12Reduction of Highly Nonstationary Ambient Noise by Integrating Spectral and Locational Characteristics of Speech and Noise

Tomohiro Nakatani (NTT Corporation)
Shoko Araki (NTT Corporation)
Marc Delcroix (NTT Corporation)
Takuya Yoshioka (NTT Corporation)
Masakiyo Fujimoto (NTT Corporation)

This paper proposes a new multi-channel noise reduction approach that can appropriately handle highly nonstationary noise based on the spectral and locational features of speech and noise. We focus on a distant talking scenario, where a 2-ch microphone array receives a target speaker’s voice from the front while it receives highly nonstationary ambient noise from any direction. To cope well with this scenario, we introduce prior training not only for the spectral features of speech and noise but also for their locational features, and utilize them in a unified manner. The proposed method can distinguish rapid changes in speech and noise based mainly on their locational features, while it can reliably estimate the spectral shapes of the speech based largely on the spectral features. A filter-bank based implementation is also discussed to enable the proposed method to work in real time. Experiments using the PASCAL CHiME separation and recognition challenge task show the superiority of the proposed method as regards both speech quality and automatic speech recognition performance.

#13Voice processing by dynamic glottal models with applications to speech enhancement

Carlo Drioli (University of Verona)
Andrea Calanca (University of Verona)

We discuss the use of low-dimensional physical models of the voice source for speech coding and processing applications. A class of waveform-adaptive dynamic glottal models and parameter tracking procedures are illustrated. The model and analysis procedures are assessed by addressing speech encoding and enhancement, achievable by using a state space version of the dynamical model in a Extended Kalman filtering framework. The proposed method is shown to provide better SNR improvement if compared to a standard AR Kalman filtering scheme.

#14Supervised Sparse Coding Strategy in Cochlear Implants

Jinqiu Sang (Institute of Sound and Vibration Research, University of Southampton)
Guoping Li (Institute of Sound and Vibration Research, University of Southampton)
Hongmei Hu (Institute of Sound and Vibration Research, University of Southampton)
Mark E Lutman (Institute of Sound and Vibration Research, University of Southampton)
Stefan Bleeck (Institute of Sound and Vibration Research, University of Southampton)

In this paper we explore how to improve a sparse coding (SC) strategy that was successfully used to improve subjective speech perception in noisy environment in cochlear implants. On the basis of the existing unsupervised algorithm, we developed an enhanced supervised SC strategy, using the SC shrinkage (SCS) principle. The new algorithm is implemented at the stage of the spectral envelopes after the signal separation in a 22-channel filter bank. SCS can extract and transmit the most important information from noisy speech. The new algorithm is compared with the unsupervised algorithm using objective evaluation for speech in babble and white noise (signal-to-noise ratios, SNR = 10dB, 5dB, 0dB) using objective measures in a cochlea implant simulation. Results show that the supervised SC strategy performs better in white noise, but not significantly better with babble noise.

Mon-Ses2-P2:
HMM-based Speech Synthesis II

Time:Monday 13:30 Place:Valfonda 2 - Pala Congressi (Passi Perduti-Gallery) Type:Poster
Chair:Tomoki Toda

#1Continuous Control of the Degree of Articulation in HMM-based Speech Synthesis

Benjamin Picart (TCTS Lab, Faculté Polytechnique (FPMs), University of Mons (UMons), Belgium)
Thomas Drugman (TCTS Lab, Faculté Polytechnique (FPMs), University of Mons (UMons), Belgium)
Thierry Dutoit (TCTS Lab, Faculté Polytechnique (FPMs), University of Mons (UMons), Belgium)

This paper focuses on the implementation of a continuous control of the degree of articulation (hypo/hyperarticulation) in the framework of HMM-based speech synthesis. The adaptation of a neutral speech synthesizer to generate hypo and hyperarticulated speech using a limited amount of speech data is first studied. This is done using inter-speaker voice adaptation techniques, applied here to intra-speaker voice adaptation. The implementation of a continuous control of the degree of articulation is then proposed in a second step. Finally, a subjective evaluation shows that good quality neutral/hypo/hyperarticulated speech, and also any intermediate, interpolated or extrapolated articulation degrees, can be obtained from an HMM-based speech synthesizer.

#2Estimation of Window Coefficients for Dynamic Feature Extraction for HMM based Speech Synthesis

Ling-Hui Chen (iFLYTEK Speech Lab, University of Science and Technology of China, Hefei, P.R. China)
Yoshihiko Nankaku (Nagoya Institute of Technology, Nagoya, Japan)
Heiga Zen (Nagoya Institute of Technology, Nagoya, Japan)
Keiichi Tokuda (Nagoya Institute of Technology, Nagoya, Japan)
Zhen-Hua Ling (iFLYTEK Speech Lab, University of Science and Technology of China, Hefei, P.R. China)
Li-Rong Dai (iFLYTEK Speech Lab, University of Science and Technology of China, Hefei, P.R. China)

In standard approaches to hidden Markov model (HMM)-based speech synthesis, window coefficients for calculating dynamic features are pre-determined and fixed. This may not be optimal to capture various context-dependent dynamic characteristics in speech signals. This paper proposes a data-driven technique to estimate the window coefficients. They are optimized so as to maximize the likelihood of trajectory HMMs given data. Experimental results show that the proposed technique can achieve a comparable performance with the mean- and variance-updated trajectory HMMs in the naturalness of synthesized speech, while offering significantly lower computational cost.

#3Inverse Filtering Based Harmonic plus Noise Excitation Model for HMM-based Speech Synthesis

Zhengqi Wen (National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences)
Jianhua Tao (National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences)

In this paper, a new Voicing Cut-Off Frequency (VCO) estimation method based on inverse filtering is presented. The spectrum of residual signal got from inverse filtering is split into sub-bands which are clustered into two classes by using K-means algorithm. And then, the Viterbi algorithm is used to search a smoothed VCO contour. Based on this new VCO estimation method, an adaptation of Harmonic plus Noise Model is also proposed to reconstruct the residual signal with both harmonic and noise components. The proposed excitation model can reduce the buzziness of speech generated by normal vocoders using simple pulse train, and has been integrated into a HMM-based speech synthesis system (HTS). The listening test showed that the HTS with our new method gives better quality of synthesized speech than the traditional HTS which only uses simple pulse train excitation model.

#4Improved HNM-based Vocoder for Statistical Synthesizers

Daniel Erro (University of the Basque Country (UPV/EHU))
Iñaki Sainz (University of the Basque Country (UPV/EHU))
Eva Navas (University of the Basque Country (UPV/EHU))
Inma Hernaez (University of the Basque Country (UPV/EHU))

Statistical parametric synthesizers have achieved very good performance scores during the last years. Nevertheless, as they require the use of vocoders to parameterize speech (during training) and to reconstruct waveforms (during synthesis), the speech generated from statistical models lacks some degree of naturalness. In previous works we explored the usefulness of the harmonics plus noise model in the design of a high-quality speech vocoder. Quite promising results were achieved when this vocoder was integrated into a synthesizer. In this paper, we describe some recent improvements related to the excitation parameters, particularly the so called maximum voiced frequency. Its estimation and explicit modelling leads to an even better synthesis performance as confirmed by subjective comparisons with other well-known methods.

#5A Statistical Phrase/Accent Model for Intonation Modeling

Gopala Krishna Anumanchipalli (Language Technologies Institute, Carnegie Mellon University, USA; INESC-ID/IST Lisboa, Portugal)
Luís C. Oliveira (INESC-ID/IST Lisboa, Portugal)
Alan W Black (Language Technologies Institute, Carnegie Mellon University, USA)

This paper proposes a statistical phrase/accent model of voice fundamental frequency(F0) for speech synthesis. It presents an approach for automatic extraction and modeling of phrase and accent phenomena from F0 contours by taking into account their overall trends in the training data. An iterative optimization algorithm is described to extract these components, minimizing the reconstruction error of the F0 contour. This method of modeling local and global components of F0 separately is shown to be better than conventional F0 models used in Statistical Parametric Speech Synthesis (SPSS). Perceptual evaluations confirm that the proposed model is significantly better than baseline SPSS F0 models in 3 prosodically diverse tasks -- read speech, radio broadcast speech and audio book speech.

#6Intermediate-State HMMs to Capture Continuously-Changing Signal Features

Gustav Eje Henter (Sound and Image Processing Laboratory, KTH – Royal Institute of Technology, Stockholm, Sweden)
W. Bastiaan Kleijn (School of Engineering and Computer Science, Victoria University of Wellington, Wellington, New Zealand)

Traditional discrete-state HMMs are not well suited for describing steadily evolving, path-following natural processes like motion capture data or speech. HMMs cannot represent incremental progress between behaviors, and sequences sampled from the models have unnatural segment durations, unsmooth transitions, and excessive rapid variation. We propose to address these problems by permitting the state variable to occupy positions between the discrete states, and present a concrete left-right model incorporating this idea. We call this intermediate-state HMMs. The state evolution remains Markovian. We describe training using the generalized EM-algorithm and present associated update formulas. An experiment shows that the intermediate-state model is capable of gradual transitions, with more natural durations and less noise in sampled sequences compared to a conventional HMM.

#7Automatic sentence selection from speech corpora including diverse speech for improved HMM-TTS synthesis quality

Norbert Braunschweiler (Toshiba Research Europe Ltd., Cambridge Research Laboratory, United Kingdom)
Sabine Buchholz (Toshiba Research Europe Ltd., Cambridge Research Laboratory, United Kingdom)

Using publicly available audiobooks for HMM-TTS poses new challenges. This paper addresses the issue of diverse speech in audiobooks. The aim is to identify diverse speech likely to have a negative effect on HMM-TTS quality. Manual removal of diverse speech was found to yield better synthesis quality despite halving the training corpus. To handle large amounts of data an automatic approach is proposed. The approach uses a small set of acoustic and text based features. A series of listening tests showed that the manual selection is most preferred, while the automatic selection showed significant preference over the full training set.

#8Phonological Knowledge Guided HMM State Mapping for Cross-Lingual Speaker Adaptation

Hui Liang (Idiap Research Institute & École Polytechnique Fédérale de Lausanne, Switzerland)
John Dines (Idiap Research Institute, Martigny, Switzerland)

Within the HMM state mapping-based cross-lingual speaker adaptation framework, the minimum Kullback-Leibler divergence criterion has been typically employed to measure the similarity of two average voice state distributions from two respective languages for state mapping construction. Considering that this simple criterion doesn't take any language-specific information into account, we propose a data-driven, phonological knowledge guided approach to strengthen the mapping construction - state distributions from the two languages are clustered according to broad phonetic categories using decision trees and mapping rules are constructed only within each of the clusters. Objective evaluation of our proposed approach demonstrates reduction of mel-cepstral distortion and that mapping rules derived from a single training speaker generalize to other speakers, with subtle improvement being detected during subjective listening tests.

#9Reformulating Prosodic Break Model into Segmental HMMs and Information Fusion

Nicolas Obin (IRCAM)
Pierre Lanchantin (IRCAM)
Anne Lacheret (Modyco Lab.)
Xavier Rodet (IRCAM)

In this paper, a method for prosodic break modelling based on segmental-HMMs and Dempster-Shafer fusion for speech synthesis is presented, and the relative importance of linguistic and metric constraints in prosodic break modelling is assessed. A context-dependent segmental-HMM is used to explicitly model the linguistic and the metric constraints. Dempster-Shafer fusion is used to balance the relative importance of the linguistic and the metric constraints into the segmental-HMM. A linguistic processing chain based on surface and deep syntactic parsing is additionally used to extract linguistic informations of different nature. An objective evaluation proved evidence that the optimal combination of the linguistic and the metric constraints significantly outperforms both the conventional HMM (linguistic information only) and segmental-HMM (equal balance of linguistic and metric constraints), and confirmed that the linguistic constraint is prior to the metric.

#9Multipulse Sequences for Residual Signal Modeling

Ranniery Maia (Toshiba Research Europe Limited, Cambridge Research Laboratory, UK)
Heiga Zen (Toshiba Research Europe Limited, Cambridge Research Laboratory, UK)
Kate Knill (Toshiba Research Europe Limited, Cambridge Research Laboratory, UK)
Mark Gales (Toshiba Research Europe Limited, Cambridge Research Laboratory, UK)
Sabine Buchholz (Toshiba Research Europe Limited, Cambridge Research Laboratory, UK)

In source-filter models of speech production, the residual signal contains important information for the generation of naturally sounding re-synthesized speech. Typically, the voiced regions of residual signals are regarded as a mixture of glottal pulse and noise. This paper introduces a novel approach to represent the noise component of voiced regions of residual signals through autoregressive filtering of multipulse sequences. The positions and amplitudes of the non-zero samples of these multipulse signals are optimized through a closed-loop procedure. The method in question is applied to excitation modeling in statistical parametric synthesis. Experimental results indicate that the use of multipulse-based noise component construction eliminates the necessity of run-time ad hoc procedures such as high-pass filtering and time modulation, common on excitation models for statistical parametric synthesizers, with no loss of synthesized speech quality.

#10Can Objective Measures Predict the Intelligibility of Modified HMM-based Synthetic Speech in Noise?

Cassia Valentini-Botinhao (Centre for Speech Technology Research, University of Edinburgh, UK)
Junichi Yamagishi (Centre for Speech Technology Research, University of Edinburgh, UK)
Simon King (Centre for Speech Technology Research, University of Edinburgh, UK)

Synthetic speech can be modified to improve intelligibility in noise. In order to perform modifications automatically, it would be useful to have an objective measure that could predict the intelligibility of modified synthetic speech for human listeners. We analysed the impact on intelligibility – and on how well objective measures predict it – when we separately modify speaking rate, fundamental frequency, line spectral pairs and spectral peaks. Shifting LSPs can increase intelligibility for human listeners; other modifications had weaker effects. Among the objective measures we evaluated, the Dau model and the Glimpse proportion were the best predictors of human performance.

#11Speech Synthesis based on Articulatory-Movement HMMs with Voice-source Codebook

Tsuneo Nitta (Toyohashi University of Technology)
Takayuki Onoda (Toyohashi University of Technology)
Masashi Kimura (Toyohashi University of Technology)
Yurie Iribe (Toyohashi University of Technology)
Kouichi Katsurada (Toyohashi University of Technology)

Speech synthesis based on one-model of articulatory movement HMMs that are commonly applied to both speech recognition (SR) and speech synthesis (SS) is described. In the SS module, speaker-invariant HMMs are applied to generate an articulatory feature (AF) sequence, and then, after converting AFs into vocal tract parameters by using a multi-layer neural network (MLN), a speech signal is synthesized by an LSP digital filter. CELP coding technique is applied to improve sound quality when generating voice source from embedded codes in the corresponding state of HMMs. The proposed speech synthesis system separate phonetic information and speaker individuality. Therefore, target speaker’s voice can be synthesized with a small amount of speech data. In the experiments, we carried out listening tests for 10 subjects and evaluated both of sound quality and individuality of synthesized speech. As a result, we confirmed that the proposed synthesis system can produce good quality speech of a target speaker by training with only two-sentences.

#12Large-scale Subjective Evaluations of Speech Rate Control Methods for HMM-based Speech Synthesizers

Tsuneo Kato (KDDI R&D Laboratories Inc.)
Makoto Yamada (KDDI R&D Laboratories Inc.)
Nobuyuki Nishizawa (KDDI R&D Laboratories Inc.)
Keiichiro Oura (Nagoya Institute of Technology)
Keiichi Tokuda (Nagoya Institute of Technology)

In this paper, we compared three speech rate control methods on HMM-based speech synthesis by large-scale subjective evaluations. The methods are 1) synthesizing by HMMs trained from corpora at a target speech rate, 2) stretching or shrinking utterance durations proportionally in waveform generation, and 3) determining state durations by ML criterion under restriction of utterance duration. The results indicated that the proportional shrinking had significant advantages for fast rate, whereas HMMs trained from slow speech sounds had a slight advantage for slow rate. We also found an advantage of proportionally shrunk speech from a synthesizer trained from slow speech corpora.

#13HMM-Based Emphatic Speech Synthesis Using Unsupervised Context Labeling

Yu Maeno (Interdisciplinary Graduate School of Science and Engineering, Tokyo Institute of Technology)
Takashi Nose (Interdisciplinary Graduate School of Science and Engineering, Tokyo Institute of Technology)
Takao Kobayashi (Interdisciplinary Graduate School of Science and Engineering, Tokyo Institute of Technology)
Yusuke Ijima (NTT Cyber Space Laboratories, NTT Corporation)
Hideharu Nakajima (NTT Cyber Space Laboratories, NTT Corporation)
Hideyuki Mizuno (NTT Cyber Space Laboratories, NTT Corporation)
Osamu Yoshioka (NTT Cyber Space Laboratories, NTT Corporation)

This paper describes an approach to HMM-based expressive speech synthesis which does not require any supervised labeling process for emphasis context. We use appealing-style speech whose sentences were taken from real domains. To reduce the cost for labeling speech data with an emphasis context for the model training, we propose an unsupervised labeling technique of the emphasis context based on the difference between original and generated F0 patterns of training sentences. Although the criterion for the emphasis labeling is quite simple, subjective evaluation results reveal that the unsupervised labeling is comparable to the labeling conducted carefully by a human in terms of speech naturalness and emphasis reproducibility.

Mon-Ses2-P3:
Phonetics and Phonology, Stress, Accent, Rhythm

Time:Monday 13:30 Place:Faenza 1 - Pala Congressi (Passi Perduti-Gallery) Type:Poster
Chair:Bernd Möbius

#1Chinese and Italian Speech Rhythm. Normalization and the CCI Algorithm.

Chiara Bertini (Laboratorio di Linguistica, Scuola Normale Superiore, Pisa, Italy)
Pier Marco Bertinetto (Laboratorio di Linguistica, Scuola Normale Superiore, Pisa, Italy)
Na Zhi (Laboratorio di Linguistica, Scuola Normale Superiore, Pisa, Italy)

This paper re-examines the speech rhythm of Beijing Chinese and Pisa Italian by means of the Control/Compensation Index (CCI), with a view to normalizing the speech data, in order to reduce the effect of the rate factor. Two metrics were applied: (a) DnCCI, an adaptation to the CCI model of the nPVI normalization strategy; (b) SnCCI, a z-score normalization, which takes into account the actual constitution of each V- and C-interval, by referring the individual segment’s duration to the mean duration of the members of the corresponding natural phoneme class. The results indicate the advantage of the SnCCI metrics as a normalization strategy.

#2Rhythm metrics on syllables and feet do not work as expected

Paolo Mairano (GIPSA-Lab, Université Stendhal Grenoble 3)
Antonio Romano (LFSAG, Università di Torino)

The aim of this paper is to explore the possibility of using rhythm metrics on the traditional units of speech rhythm (the syllable and the foot), instead of applying them to consonantal and vocalic intervals. Despite [14] had already proven that the standard deviation of syllables and feet did not provide a satisfactory representation of the traditional rhythm classes, some recent studies obtained encouraging results. In particular, [2] applied the PVI to English and Estonian syllables and feet, and a similar approach is intrinsic in the YARD index (cf. [15]) though only at syllable level. We computed the deltas and the PVIs on syllables (measured as the distance between two successive vocalic onsets) and feet (measured as the distance between the onsets of two stressed vowels) for 30 samples of 14 languages. The results do not confirm expectations and do not seem to support the use of these units for the study of speech rhythm in these terms.

#3Applying Rhythm Features to Automatically Assess Non-Native Speech

Lei Chen (Educational Testing Service)
Klaus Zechner (Educational Testing Service)

Speech rhythm measurements have been used in a limited number of previous studies on automated speech assessment, an approach using speech recognition technology to judge non-native speakers' proficiency levels. However, one of the most problematic issues of these previous studies is a lack of a comparison of these rhythm features with other effective non-rhythm features found in decade-long previous research. In this paper, we extracted both non-rhythm and rhythm features and compared them with respect to their performances to predict proficiency scores rated by humans. We show that adding rhythm features significantly improves the performance of the scoring model based only on non-rhythm features.

#4Prosodic Synchrony in Co-operative Task-based Dialogues: A Measure of Agreement and Disagreement

Brian Vaughan (Trinity College Dublin)

Prosodic synchrony has been reported to be an important aspect of conversational dyads. In this paper, synchrony in four different dyads is examined. A Time Aligned Moving Aver- age (TAMA) procedure is used to temporally align the prosodic measurements for the detection of synchrony in the dyads. An overlapping windowed correlation procedure is used to measure synchrony for six different prosodic parameters: mean pitch, pitch range, mean intensity, intensity range, centre of gravity and spectral slope. This study shows that a windowed correlation procedure better captures the dynamic nature of speech synchrony than a single measure across a whole conversation. This method also enables points of concurrent synchrony be- tween prosodic parameters to be detected. Moreover, the synchrony of the prosodic parameters was considered in relation to levels of agreement and disagreement in the four dyads. Results show only one parameter in one dyad to be significantly correlated with agreement/disagreement.

#5Low and High, Short and Long by Crook or by Hook?

Oliver Niebuhr (Department of General and Comparative Linguistics, University of Kiel, Germany)
Astrid Wolf (Department of General and Comparative Linguistics, University of Kiel, Germany)

The paper deals with perceived speech rhythm, starting from the observation that two nouns with a conjunction in between (‘X and/or Y’, cf. title) sound more rhythmical in a particular noun order. A perception experiment on German with real and pseudo nouns provides evidence that speech rhythm is not just created prosodically by means of high and low or long and short syllables, but that the phonetic properties of the vowel nuclei and of the consonantal onsets and offsets of the stressed syllables are separate segmental constituents of speech rhythm.

#6Estimating Speaking Rate by Means of Rhythmicity Parameters

Christian Heinrich (Institute of Phonetics and Speech Processing, LMU Munich)
Florian Schiel (Institute of Phonetics and Speech Processing, LMU Munich)

In this paper we present a speech rate estimator based on so-called rhythmicity features derived from a modified version of the short-time energy envelope. To evaluate the new method, it is compared to a traditional speech rate estimator on the basis of semi-automatic segmentation. Speech material from the Alcohol Language Corpus (ALC) covering intoxicated and sober speech of different speech styles provides a statistically sound foundation to test upon. The proposed measure clearly correlates with the semi-automatically determined speech rate and seems to be robust across speech styles and speaker states.

#7Comparing word and syllable prominence rated by naive listeners

Denis Arnold (Language and Speech Communication, University of Bonn, Germany)
Bernd Möbius (Department of Computational Linguistics and Phonetics, Saarland University, Germany)
Petra Wagner (Faculty of Linguistics and Literature, University of Bielefeld, Germany)

Prominence has been widely studied on the word level and the syllable level. An extensive study comparing the two approaches is missing in the literature. This study investigates how word and syllable prominence relate to each other in German. We find that perceptual ratings based on the word level are more extreme than those based on the syllable level. The correlations between word prominence and acoustic features are greater than the correlations between syllable prominence and acoustic features.

#8L1 / L2 perception of lexical stress with F0 peak-delay: effect of an extra syllable added

Shinichi Tokuma (Chuo University)
Yi Xu (University College London)

This study re-examined the perceptual effect of F0 peak-delay on L1 / L2 perception of English lexical stress. A trisyllabic English nonsense word ‘ninini’ whose F0 was set to reach its peak around the second syllable was embedded in a frame sentence and used as the stimulus of the perceptual experiment. Native English and Japanese speakers were asked to determine lexical stress locations in the experiment. The results showed that delayed F0 peaks which were aligned with the second syllable of the stimulus words perceptually affected both Japanese and English groups, although slightly in a different manner: the Japanese group perceived the delayed F0 peaks as a cue to lexical stress in the first syllable when the peaks were aligned with, or before, the end of /n/ in the second syllable, while the English group had the boundary shifted in an earlier temporal position. It was also discovered that the Japanese group had greater sensitivity to the delayed peak positions.

#9Letter-to-Phoneme Conversion based on Two-Stage Neural Network focusing on Letter and Phoneme Contexts

Seng Kheang (Graduate School of Computer Science and Engineering, Toyohashi University of Technology)
Iribe Yurie (Information and Media Center, Toyohashi University of Technology)
Nitta Tsuneo (Graduate School of Computer Science and Engineering, Toyohashi University of Technology)

The improvement of Letter-To-Phoneme (L2P) conversion that can output the phoneme strings corresponding to Out-Of-Vocabulary (OOV) words, especially in English language, has become one of the most important issues in Text-To-Speech (TTS) research. In this paper, we propose a Two-Stage Neural Network (NN) based approach to solve the problem of conflicting output at a phonemic level. Both Letter and Phoneme Context-Dependent models are combined and implemented in the first-stage NN to convert several letters into several phonemes. Then, the second-stage NN can predict the final output phoneme by observing on a combination of several consecutive phoneme sequences that obtained from the first-stage NN. Therefore, our L2P conversion module takes a sequence of letters as input and outputs only one phoneme at each time. By focusing mainly on the result of word accuracy of OOV words, this new approach usually provides a higher performance.

#10An international English speech corpus for longitudinal study of accent development

Rosemary Orr (University College Utrecht, The Netherlands)
Hugo Quene (Utrecht Institute of Linguistics OTS, Utrecht University, The Netherlands)
Roeland van Beek (University College Utrecht, The Netherlands)
Thari Diefenbach (University College Utrecht, The Netherlands)
David van Leeuwen (Department of Language and Speech, Radboud University, Nijmegen, The Netherlands)
Marijn Huijbregts (Department of Language and Speech, Radboud University, Nijmegen, The Netherlands)

If English is used intensively as a lingua franca in a multilanguage community, do speakers then converge towards a single common English accent? This speech corpus allows for longitudinal study to investigate the question of convergence by means of repeated speech recordings of students at an Englishlanguage college over a period of 5 years. This paper describes the content and collection of the corpus and the type of research that is envisaged, as well as tools used to manage and analyze the recordings, including tools for automatic phone recognition; prosodic analyses; and intelligibility experiments using the SRT method.

#11A CORPUS-BASED STUDY OF ENGLISH PRONUNCIATION VARIATIONS

Sunhee Kim (Seoul National University, Korea)
Kyuwhan Lee (Seoul National University, Korea)
Minhwa Chung (Seoul National University, Korea)

This paper aims to present an analysis of English pronunciation variations using the TIMIT corpus of American English. The manually annotated data are analyzed by comparing the pronunciation variants to their canonical pronunciations which are defined by using the CMU Pronunciation Dictionary. Vowels and consonants are separately analyzed with respect to substitution, deletion and insertion. The results show that: i) vowels are more subject to substitution than deletion, whereas consonants are more subject to deletion than substitution; and ii) vocalic substitutions are related to the raising and the reduction of vowels, whereas consonantal substitutions are related to changes in voice, place of articulation and manner of articulation. Given that the ultimate goal of pronunciation training in the area of second language acquisition is to help students achieve a reasonably "intelligible" pronunciation rather than an "accent-less" pronunciation, the results of this study will contribute to the determination of "comprehensible" pronunciation of English. Furthermore, they will also contribute to the study of English phonetics and phonology as well as to the development of pronunciation modeling of English speech recognizers.

#12Long term average speech spectra in Yolngu Matha and Pitjantjatjara speaking females and males

Hywel Stoakes (The University of Melbourne and Flinders University)
Andrew Butcher (Flinders University)
Janet Fletcher (The University of Melbourne)
Marija Tabain (La Trobe University)

This paper provides a spectral analysis of two Australian languages Yolngu Matha (YM), Pitjantjatjara (PTJ) and Australian Aboriginal English (AAE) as spoken in two language communities. The aim of this study is to show clear quantitative spectral differences between Australian Aboriginal English and the two Aboriginal languages. Thirteen speakers of Yolngu Matha, ten male, three female and three female speakers of Pitjantjatjara were recorded reading or retelling a passage in their first language and also in English. The results show that there is a difference between the spectral averages of the two language groups with the AAE having higher amplitudes at higher frequencies when compared to the two Australian languages. In contrast the Australian languages have higher amplitudes for frequencies between 750 Hz and 2 kHz.

#13Context and speaker dependency in the relation of vowel formants and subglottal resonances – Evidence from Hungarian

Tekla Etelka Gráczi (Research Institute for Linguistics, Hungarian Academy of Sciences, Budapest, Hungary)
Steven M. Lulich (Department of Psychology, Washington University, Saint Louis, Missouri)
Tamás Gábor Csapó (Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics, Budapest, Hungary)
András Beke (Eötvös Loránd University, Budapest, Hungary)

Subglottal resonances are claimed to divide front/back vowels and low/high vowels in several languages, including Hungarian. However, some ‘recalcitrant’ vowels appear to resist this mould. We therefore performed a careful analysis of the role coarticulation and speaker-dependent effects might play in the recalcitrance of these vowels in Hungarian. The present analysis focused on various stop contexts in order to see the place of articulation triggered effects. It is shown that the subglottal resonances indeed divide the vowel space as claimed, and that the recalcitrance of certain vowels is due to coarticulation with specific consonants. The magnitude of the coarticulation effect is speaker dependent.

Mon-Ses2-P4:
ASR - Search, Keyword Spotting and Confidence Measures I

Time:Monday 13:30 Place:Faenza 2 - Pala Congressi (Passi Perduti-Gallery) Type:Poster
Chair:Mark Gales

#1Event Selection from Phone Posteriorgrams Using Matched Filters

Keith Kintzley (Johns Hopkins University)
Aren Jansen (Johns Hopkins University)
Hynek Hermansky (Johns Hopkins University)

In this paper we address the issue of how to select a minimal set of phonetic events from a phone posteriorgram while minimizing the loss of information. We derive phone posteriorgrams from two sources, Gaussian mixture models and sparse multilayer perceptrons, and apply phone-specific matched filters to the posteriorgrams to yield a smaller set of phonetic events. We introduce a mutual information based performance measure to compare phonetic event selection techniques and demonstrate that events extracted using matched filters can reduce input data while significantly improving performance of an event-based keyword spotting system.

#2A Piecewise Aggregate Approximation Lower-Bound Estimate for Posteriorgram-based Dynamic Time Warping

Yaodong Zhang (MIT Computer Science and Artificial Intelligence Laboratory)
James Glass (MIT Computer Science and Artificial Intelligence Laboratory)

In this paper, we propose a novel lower-bound estimate for dynamic time warping (DTW) methods that use an inner product distance on multi-dimensional posterior probability vectors known as posteriorgrams. Compared to our previous work, the new lower-bound estimate uses piecewise aggregate approximation (PAA) to reduce the time required for calculating the lower-bound estimate. We describe the PAA lower-bound construction process and prove that it can be efficiently used in an admissible $K$ nearest neighbor (KNN) search. The amount of computational savings is quantified by a set of unsupervised spoken keyword spotting experiments. The results show that the newly proposed PAA lower-bound is able to speed up DTW-KNN search by 28\% without affecting the keyword spotting performance.

#3OOV Detection and Recovery using Hybrid Models with Different Fragments

Long Qin (Language Technologies Institute, School of Computer Science, Carnegie Mellon University)
Ming Sun (Language Technologies Institute, School of Computer Science, Carnegie Mellon University)
Alexander Rudnicky (Language Technologies Institute, School of Computer Science, Carnegie Mellon University)

In this paper, we address the out-of-vocabulary (OOV) detection and recovery problem by developing three different fragment-word hybrid systems. A fragment language model (LM) and a word LM were trained separately and then combined into a single hybrid LM. Using this hybrid model, the recognizer can recognize any OOVs as fragment sequences. Different types of fragments, such as phones, subwords, and graphones were tested and compared on the WSJ 5k and 20k evaluation sets. The experiment results show that the subword and the graphone hybrid systems perform better than the phone hybrid system in both 5k and 20k tasks. Furthermore, given less training data, the subword hybrid system is more preferable than the graphone hybrid system.

#4AUC Optimization Based Confidence Measure for Keyword Spotting

Haiyang Li (School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China)
Jiqing Han (School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China)
Tieran Zheng (School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China)

Confidence measure plays an important role in keyword spotting. To enhance the effectiveness of the confidence measure, we propose a novel method which improves the performance of keyword spotting by directly maximizing the area under the ROC curve (AUC). Firstly, we approximate the AUC as an objective function with the weighted mean confidence measure. Then, we optimize the objective function by training the weighting factors with the generalized probabilistic descent algorithm. Compared with the current method based on minimum classification error (MCE) criterion, the proposed method makes a global enhancement of ROC curve and does not need to train any threshold. The experiments conducted on the King-ASR-023 database show that the proposed method outperforms both the method averaging phone-level confidences and the method based on MCE.

#5An Empirical Study of Multilingual Spoken Term Detection

Zejun Ma (Institute of Automation, Chinese Academy of Sciences)
Xiaorui Wang (Institute of Automation, Chinese Academy of Sciences)
Bo Xu (Institute of Automation, Chinese Academy of Sciences)

This paper introduces the design of multilingual spoken term detection system using CALLHOME and CALLFRIEND multilingual databases published by LDC. For our experiments seven languages namely Arabic, English, German, Japanese, Korean, Chinese Mandarin and Spanish, are used to train and evaluate the STD system. As the core module of our language general STD system, the multilingual automatic speech recogniser combines the acoustic and language models of seven languages into an unified model set. A lot of our works are focused on the comparison of multilingual acoustic models - the conventional global phoneme set (GPS) based method and the recently proposed subspace GMM (SGMM) method are investigated in detail. The experimental results demonstrate the viability of our multilingual STD system. It is shown that the resulting multilingual system not only supports seven different languages but also gives satisfying performance gains over the monolingual systems.

#6Fusing Multiple Confidence Measures for Chinese Spoken Term Detection

Zejun Ma (Institute of Automation, Chinese Academy of Sciences)
Xiaorui Wang (Institute of Automation, Chinese Academy of Sciences)
Bo Xu (Institute of Automation, Chinese Academy of Sciences)

In spoken term detection (STD) task, the confidence measure is used to assess the reliability of detected terms. The widely used confidence measure in STD is based on the normalized lattice posterior probability. In this paper, however, several distinct confidence estimation methods are investigated to improve the baseline lattice confidence: the acoustic and duration confidences are estimated by hybrid Hidden Markov Model/Artificial Neural Network (HMM/ANN) and phonetic duration model respectively. These two confidences plus lattice confidence are linearly interpolated to produce a more reliable confidence measure. The experimental results show the feasibility and effectiveness of our combination approach. The proposed method substantially improves the STD performance, for a 4.8%-11.1% relative equal error rate (EER) reduction on three evaluation sets compared with the baseline lattice confidence.

#7Response Probability Based Decoding Algorithm for Large Vocabulary Continuous Speech Recognition

Zhanlei Yang (National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences, Beijing, China)
Hao Chao (National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences, Beijing, China)
Wenju Liu (National Laboratory of Pattern Recognition (NLPR), Institute of Automation, Chinese Academy of Sciences, Beijing, China)

Acoustic space is made up of phonemes, and it can be modeled using universal background model (UBM). Therefore, there are some relations between the phonemes and Gaussian mixture components of the UBM. This paper represents these relations by proposing a response probability (RP) model, which describes the location information of speech observations within the whole acoustic space. At decoding stage, proposed RP model is fused with traditional acoustic model (AM) and language model (LM). After integrating RP, the decoder is guided to weaken or enhance different path candidates respectively and directed to extend the most promising paths. Experiments conducted on Mandarin broadcasting speech show that character error rate is relatively reduced by 9.15% when RP model is used and by 11.89% when an improved RP model is used.

#8Combining Lattice-Based Language Dependent and Independent Approaches for Out-of-Language Detection in LVCSR

Yuxiang Shan (Tsinghua National Laboratory for Information Science and Technology, Department of Electronic Engineering, Tsinghua University, Beijing, China)
Yan Deng (Tsinghua National Laboratory for Information Science and Technology, Department of Electronic Engineering, Tsinghua University, Beijing, China)
Jia Liu (Tsinghua National Laboratory for Information Science and Technology, Department of Electronic Engineering, Tsinghua University, Beijing, China)

In this paper, Out-Of-Language (OOL) detection problem is handled by both language dependent (LD) and language independent (LI) approaches. In the LD approach, a novel speech content and language joint recognition algorithm is proposed, which integrates a phone lattice-based vector space modeling language recognition (LRE) backend into the conventional speech decoding procedure. In the LI approach, lattice derived confidence measures are used. Since these two approaches reflect two different dimensions of uncertainties encoded in lattices, combining them improves both the LRE and OOL detection performance. Experiments also show that for LD approach the detection accuracies can be significantly increased by applying heuristic phone lattice reconstruction. Evaluated on a Mandarin/English mixed conversational telephone speech corpus with a Mandarin speech recognizer, the proposed method achieves an EER of 12.68% in OOL detection, and reduces the recognition error by 33.06%.

#9Evaluation of tree-trellis based decoding in over-million LVCSR

Naoaki Ito (Nagoya Institute of Technology)
Yoshihiko Nankaku (Nagoya Institute of Technology)
Akinobu Lee (Nagoya Institute of Technology)
Keiichi Tokuda (Nagoya Institute of Technology)

Very large vocabulary continuous speech recognition (CSR) that can recognize every sentence is one of important goals in speech recognition. Several attempts have been made to achieve very large vocabulary CSR. However, very large vocabulary CSR using a tree-trellis based decoder has not been reported. We report the performance evaluation and improvement of the “Julius” tree-trellis based decoder in large vocabulary CSR (LVCSR) involving more than one million vocabulary, referred to here as over-million LVCSR. Experiments indicated that Julius achieved a word accuracy of about 91% and a real time factor of about 2 in over-million LVCSR for Japanese newspaper speech transcription.

#10Lattice Based Discriminative Model Combination Using Automatically Induced Phonetic Contexts

Hao Huang (Department of Information Science and Engineering, Xinjiang University)
Bing Hu Li (Department of Information Science and Engineering, Xinjiang University)

Discriminative model combination is to integrate several model scores using discriminatively trained weighting factors. In recent research, context-dependent scaling is often applied. One limitation of this approach is a large number of parameters will be introduced. The large parameter set with limited training data might introduce training instability. In this paper, we propose to use automatically induced contexts modeled by phonetic decision trees. Question in tree node is chosen to maximize the minimum phone error criterion. First order approximation of objective increase is used for question selection to make tree growing efficient. Experimental results on continuous speech recognition show the method is capable of inducing crucial phonetic contexts and obtains error reduction with many fewer parameters, compared with the results from manually selected contexts.

#11Predicting Human Perceived Accuracy of ASR Systems

Taniya Mishra (AT&T Labs-Research)
Andrej Ljolje (AT&T Labs-Research)
Mazin Gilbert (AT&T Labs-Research)

Word error rate (WER), which is the most commonly used method of measuring automatic speech recognition (ASR) accuracy, penalizes all types of ASR errors equally. However, humans differentially weigh different types of ASR errors. They judge ASR errors that distort the meaning of the spoken message more harshly than those that do not. Aiming to align more closely with human perception of ASR accuracy, we developed a new metric HPA (Human Perceived Accuracy) that predicts the subjective perceived accuracy of ASR transcriptions. HPA is computed based on the central idea of differential weighting of different ASR errors. Applied to the particular task of automatically recognizing voicemails, we found that the correlation between HPA and the human judgement of ASR accuracy was significantly higher (r-value=0.91) than the correlation between WER and human judgement (r-value=0.65).

#12Cross-lingual study of ASR errors: on the role of the context in human perception of near homophones

Ioana Vasilescu (LIMSI-CNRS)
Dahbia Yahia (LIMSI-CNRS)
Natalie Snoeren (LIMSI-CNRS)
Martine Adda-Decker (LPP/LIMSI-CNRS)
Lori Lamel (LIMSI-CNRS)

It is well-accepted that human listeners significantly outperform machines when it comes to transcribing speech. This paper presents a paradigm for perceptual experiments that aims to increase our understanding of automatic speech recognition errors. The role of the context length is investigated through perceptual recovery of small homophonic words or near homophones yielding frequent automatic transcription errors. The same experimental protocol of varied size speech stimuli transcription is applied to both English and French. Our hypothesis is that ambiguity due to homophonic words reduces with context size for both languages, which in turn should entail reduced perception and transcription errors. The results show that context plays a central role as the human WER decreases significantly with increasing context. The long-term aim is to improve the modelling of such ambiguous items to reduce automatic errors.

#13Performance Prediction of Speech Recognition Using Average-Voice-Based Speech Synthesis

Tatsuhiko Saito (Tokyo Institute of Technology)
Takashi Nose (Tokyo Institute of Technology)
Takao Kobayashi (Tokyo Institute of Technology)
Yohei Okato (Mitsubishi Electric Corporation)

This paper describes a performance prediction technique of a speech recognition system using a small amount of target speakers’ data. In the conventional HMM-based technique, a speaker-dependent model was used and thus a considerable amount of training data was needed. To reduce the amount of training data, we introduce an average voice model as a prior knowledge for the target speakers’ acoustic models, and adapt it to the target speakers’ ones using speaker adaptation. Experimental results show that the use of average voice model effectively save the amount of training data of the target speakers, and the prediction accuracy is significantly improved compared to the conventional technique especially when a smaller amount of training data is available.

#14Confidence Measures For Turkish Call Center Conversations

Ali Haznedaroglu (Bogazici University)
Levent M. Arslan (Bogazici University)

Automatic speech recognition accuracies of call canter conversations are still below intended levels due to harsh conditions such as channel distortions, external noises, co-articulated speech, etc. Agglutinative and free word order nature of Turkish degrades the recognition performances further; therefore the usage of confidence measures (CMs) is inevitable to retrieve correct information from the calls. In this paper, two conversational CMs, namely speech overlap ratio and opposite party energy level, are proposed, and tested together with single-channel confidence measures on Turkish stereo call center recordings. Experimental results show that conversational CMs improve the rating accuracies of the utterances with respect to their recognition rates.

#15Spoken Document Confidence Estimation Using Contextual Coherence

Taichi Asami (NTT Cyber Space Laboratories, NTT Corporation)
Narichika Nomoto (NTT Cyber Space Laboratories, NTT Corporation)
Satoshi Kobashikawa (NTT Cyber Space Laboratories, NTT Corporation)
Yoshikazu Yamaguchi (NTT Cyber Space Laboratories, NTT Corporation)
Hirokazu Masataki (NTT Cyber Space Laboratories, NTT Corporation)
Satoshi Takahashi (NTT Cyber Space Laboratories, NTT Corporation)

Selecting well-recognized transcripts is critical if information retrieval systems are to extract business intelligence from massive spoken document databases. To achieve this goal, we target spoken document confidence measures that represent the recognition rates of each document. We focus on the incoherent word occurrences over several utterances in ill-recognized transcripts of spoken documents. The proposed method uses contextual coherence as a measure of spoken document confidence. The contextual coherence is formulated as the mean of pointwise mutual information (PMI). We also propose a smoothing method of PMI, which deals with the data sparseness problem. Compared to the conventional method, our smoothing technique offers improved correlation coefficients between spoken document confidence scores and recognition rates from 0.573 to 0.672. Moreover, an even higher correlation coefficient, 0.710, is achieved by combining the contextual-based and decoder-based confidence measures.

Mon-Ses3-O1:
Speaker Recognition - Analysis and Statistics II

Time:Monday 16:00 Place:Auditorium - Pala Congressi Type:Oral
Chair:Najim Dehak

16:00Intersession compensation and scoring methods in the i-vectors space for speaker recognition

Pierre-Michel Bousquet (LIA Avignon)
Driss Matrouf (LIA Avignon)
Jean-François Bonastre (LIA Avignon)

We propose here, for speaker recognition, new intersession compensation and scoring methods, in the total factor space (i-vectors), adapted to our system configuration.

16:20Kernel alignment maximization for speaker recognition based on high-level features

Szymon Drgas (Poznan University of Technology)
Adam Dabrowski (Poznan University of Technology)

In this paper text-independent automatic speaker verification based on support vector machines is considered. A generalized linear kernel training method based on kernel alignment maximization is proposed. First, kernel matrix decomposition into a sum of maximally aligned directions in the input space is performed and this decomposition is spectrally optimized. The method was evaluated for high-level speaker features: prosodic, articulatory and lexical. The experiments were undertaken employing Switchboard corpus. The proposed algorithm gave equal error rate (EER) reduction up to 23%.

16:40Kernel partial least squares for speaker recognition

Balaji Vasan Srinivasan (University of Maryland)
Daniel Garcia-Romero (University of Maryland)
Dmitry N. Zotkin (University of Maryland)
Ramani Duraiswami (University of Maryland)

I-vectors are a concise representation of speaker characteristics. Recent advances in speaker recognition have utilized their ability to capture speaker and channel variability to develop efficient recognition engines. Inter-speaker relationships in the i-vector space are non-linear. Accomplishing effective speaker recognition requires a good modeling of these non-linearities and can be cast as a machine learning problem. In this paper, we propose a kernel partial least squares (kernel PLS, or KPLS) framework for modeling speakers in the i-vectors space. The resulting recognition system is tested across several conditions of the NIST SRE $2010$ extended core data set and compared against state-of-the-art systems: Joint Factor Analysis (JFA), Probabilistic Linear Discriminant Analysis (PLDA), and Cosine Distance Scoring (CDS) classifiers. Improvements are shown.

17:00Conversational-Side-Specific Inter-Session Variability Compensation

Mohamed Omar (IBM T. J. Watson Research Center)
Jason Pelecanos (IBM T. J. Watson Research Center)

This paper investigates three methods for estimating a conversational-side-specific projection or affine transform to compensate for session and channel effects. In the first, we estimate the projection based on an estimate of the within-class covariance matrix from the statistics of a conversational-side-specific subset of the development data. In the second, we use a subset of the development data to construct a discriminative objective function which is used to estimate the projection parameters. An affine transform of the observation vectors of each conversational side is estimated using maximum likelihood estimation in the third method. We present several experiments that show how these three techniques perform compared to our baseline system on the interview tasks of the NIST 2008 and the NIST 2010 speaker recognition evaluations. The best method of these techniques gives a performance improvement of up to 20% relative compared to the baseline system.

17:20A speaker line-up for the Likelihood Ratio

David Van Leeuwen (Radboud University Nijmegen)
Niko Brümmer (Agnitio Research)

We propose an analogy to eye witness line-up in order to compute calibrated likelihood ratios for speaker recognition, by including the target model in an identification trial with a cohort of foils. Expressions for the likelihood ratio as a function of cohort size, identification rank and system ROC performance are derived, and some properties of the likelihood ratio function are discussed. The line-up procedure is used as a method to calibrate speaker recognition scores. Using NIST SRE 2010, we find calibration loss comparable to linear calibration (FoCal), while the proposed method gives improved discrimination.

17:40Towards Fully Bayesian Speaker Recognition: Integrating Out the Between-Speaker Covariance

Jesús Antonio Villalba López (University of Zaragoza)
Niko Brümmer (Agnitio, South Africa)

We propose a variational Bayes solution to integrate out the model parameters in a generative i-vector speaker recognizer. The existing state-of-the-art in generative i-vector modelling plugs in fixed maximum-likelihood point-estimates of model parameters. This recipe may suffer from over-fitting of especially the between-speaker covariance. We show how to integrate out the between-speaker covariance and demonstrate dramatic improvements on NIST SRE 2010.

Mon-Ses3-O3:
ASR - Lexical, Prosodic and Multi-Lingual Models

Time:Monday 16:00 Place:Brunelleschi (Green Room) - Pala Congressi - 2nd Floor Type:Oral
Chair:Murat Saraclar

16:00Learning from Mistakes: Expanding Pronunciation Lexicons using Word Recognition Errors

Sravana Reddy (The University of Chicago)
Evandro Gouvea (The University of Chicago)

We introduce the problem of learning pronunciations of out of vocabulary words from word recognition mistakes made by an ASR system. This question is especially relevant in cases where the ASR engine is a black-box -- meaning that the only acoustic cues about the speech data come from word recognition output. This paper presents an EM approach to inferring pronunciations from n-best word recognition hypotheses, which outperforms pronunciation estimates of a grapheme-to-phoneme system.

16:20Improving non-native ASR through stochastic multilingual phoneme space transformations

David Imseng (Idiap Research Institute)
Hervé Bourlard (Idiap Research Institute)
John Dines (Idiap Research Institute)
Philip N. Garner (Idiap Research Institute)
Mathew Magimai Doss (Idiap Research Institute)

We propose a stochastic phoneme space transformation technique that allows the conversion of conditional source phoneme posterior probabilities (conditioned on the acoustics) into target phoneme posterior probabilities. The source and target phonemes can be in any language and phoneme format such as the International Phonetic Alphabet. The novel technique makes use of a Kullback-Leibler divergence based hidden Markov model and can be applied to non-native and accented speech recognition or used to adapt systems to under-resourced languages. In this paper, and in the context of hybrid HMM/MLP recognizers, we successfully apply the proposed approach to non-native English speech recognition on the HIWIRE dataset.

16:40Unsupervised Arabic Dialect Adaptation with Self-Training

Scott Novotney (BBN Technologies)
Rich Schwartz (BBN Technologies)
Sanjeev Khudanpur (Johns Hopkins HLT COE/ECE Dept.)

Useful training data for automatic speech recognition systems of colloquial speech is usually limited to expensive in-domain transcription. Broadcast news is an appealing source of easily available data to bootstrap into a new dialect. However, some languages, like Arabic, have deep linguistic differences resulting in poor cross domain performance. If no in-domain transcripts are available, but a large amount of in-domain audio is, self-training may be a suitable technique to bootstrap into the domain. In this work, we attempt to adapt Modern Standard Arabic (MSA) models to Levantine Arabic without any in-domain manual transcription. We contrast with varying amounts of in-domain transcription and show that 1) Self-training is effective with only one hour of in-domain transcripts. 2) Self-training is not a suitable solution to improve strong MSA models on Levantine. 3) Two metrics that quantify model bias predict self-training success. 4) Model bias explains the failure of self-training to adapt across strong domain mismatch.

17:00Template-based Automatic Speech Recognition meets Prosody

Dino Seppi (ESAT - Katholieke Universiteit Leuven)
Kris Demuynck (ESAT - Katholieke Universiteit Leuven)
Dirk Van Compernolle (ESAT - Katholieke Universiteit Leuven)

In this paper, we use prosodic information to improve the accuracy of our template-based automatic speech recognizer. Prosodic information is harvested adopting a data-driven approach. A number of prosodic features is extracted, then combined into major groups, and finally studied separately and together. All acoustic evidence, both segmental and suprasegmental, is modelled non-parametrically. The different sources of information are conveniently combined with segmental conditional random fields. Prosody enhances the accuracy of the state-of-the-art baseline by reducing the word error rate by 7% relative on the nov92, 20k trigram, Wall Street Journal task.

17:20Pronunciation Learning from Continuous Speech

Ibrahim Badr (CSAIL MIT)
Ian McGraw (CSAIL MIT)
James Glass (CSAIL MIT)

This paper explores the use of continuous speech data to learn stochastic lexicons. Building on previous work in which we augmented graphones with acoustic examples of isolated words, we extend our pronunciation mixture model framework to two domains containing spontaneous speech: a weather information retrieval spoken dialogue system and the academic lectures domain. We find that our learned lexicons out-perform expert, hand-crafted lexicons in each domain.

17:40State-Level Data Borrowing for Low-Resource Speech Recognition based on Subspace GMMs

Yanmin Qian (Tsinghua University)
Daniel Povey (Microsoft Research)
Jia Liu (Tsinghua University)

Large vocabulary continuous speech recognition is always a difficult task, and it is particularly so for low-resource languages. The scenario we focus on here is having only 1 hour of acoustic training data in the “target” language. This paper presents work on a data borrowing strategy combined with the recently proposed Subspace Gaussian Mixture Model (SGMM). We developed data borrowing strategies based on two approaches: one based on minimizing K-L Divergence, and one that also takes into account state occupation counts. We demonstrate improvements versus the baseline SGMM setup, which itself is better than a conventional HMM-GMM system. The SGMMs are more robustly estimated by borrowing data from the non-target language at the acoustic-state level. Although we tested the approach for SGMMs, we expect the general idea of borrowing data from a non-target language to be applicable for conventional GMMs as well.

Mon-Ses3-P5:
Speech Synthesis - Selected Topics

Time:Monday 16:00 Place:Donatello (Room Onice) - Pala Congressi - Ground Floor Type:Poster
Chair:Enrico Zovato

#1A Grammar Based Approach to Style Specific Phrase Prediction

Alok Parlikar (Language Technologies Institute, Carnegie Mellon University)
Alan W Black (Language Technologies Institute, Carnegie Mellon University)

We present an approach to style specific phrasing for Text-to-Speech (TTS) systems. We formulate the problem of phrase break prediction (or phrasing) as generation of a sequence of breaks (B) and non-breaks (NB) after each word in a sentence. We use prosodic breaks in speech data to build shallow parses over corresponding text. We then learn a grammar that can predict these shallow prosodic parses from text. We then combine this prosodic phrasing information with other word level features in a CART tree to predict where phrase breaks should be inserted in new text. We show that a model built to target a specific reading style can predict phrase breaks more accurately than the standard generic model.

#2Unsupervised features from text for speech synthesis in a speech-to-speech translation system

Oliver Watts (Centre for Speech Technology Research, University of Edinburgh, UK)
Bowen Zhou (IBM T. J. Watson Research Center, Yorktown Heights, NY 10598, USA)

We explore the use of linguistic features for text to speech (TTS) conversion in the context of a speech-to-speech translation system that can be extracted from unannotated text in an unsupervised, language-independent fashion. The features are intended to act as surrogates for conventional part of speech (POS) features. Unlike POS features, the experimental features assume only the availability of tools and data that must already be in place for the construction of other components of the translation system, and can therefore be used for the TTS module without incurring additional TTS-specific costs. We here describe the use of the experimental features in a speech synthesiser, using six different configurations of the system to allow the comparison of the proposed features with conventional, knowledge-based POS features. We present results of objective and subjective evaluations of the usefulness of the new features.

#3Unsupervised continuous-valued word features for phrase-break prediction without a part-of-speech tagger

Oliver Watts (Centre for Speech Technology Research, University of Edinburgh, UK)
Junichi Yamagishi (Centre for Speech Technology Research, University of Edinburgh, UK)
Simon King (Centre for Speech Technology Research, University of Edinburgh, UK)

Part of speech (POS) tags are foremost among the features conventionally used to predict intonational phrase-breaks for text to speech (TTS) conversion. The construction of such systems therefore presupposes the availability of a POS tagger for the relevant language, or of a corpus manually tagged with POS. However, such tools and resources are not available in the majority of the world's languages, and manually labelling text with POS tags is an expensive and time-consuming process. We therefore propose the use of continuous-valued features that summarise the distributional characteristics of word types as surrogates for POS features. Importantly, such features are obtained in an unsupervised manner from an untagged text corpus. We present results on the phrase-break prediction task, where use of the features closes the gap in performance between a baseline system (using only basic punctuation-related features) and a topline system (incorporating a state-of-the-art POS tagger).

#4Albayzín 2010: a Spanish text to speech evaluation

Francisco Campillo (Group on Multimedia Technologies, University of Vigo)
Francisco Méndez (Group on Multimedia Technologies, University of Vigo)
Montserrat Arza (Instituto Ramón Piñeiro)
Laura Docío (Group on Multimedia Technologies, University of Vigo)
Antonio Bonafonte (TALP Research Center, Universitat Politècnica de Catalunya)
Eva Navas (Aholab Signal Processing Laboratory, University of the Basque Country)
Iñaki Sainz (Aholab Signal Processing Laboratory, University of the Basque Country)

Albayzín 2010 Text-to-Speech Evaluation Campaign was the second biannual Albayzín Campaign. A Spanish corpus was provided by the Group of Multimedia Technologies of the University of Vigo, and six teams developed a total of ten systems for the evaluation. A set of test sentences was released to be synthesized, and an on-line evaluation was conducted, focusing on naturalness, similarity to the original voice, and intelligibility. In this paper the evaluation details and results are described.

#5Combining Active and Semi-supervised Learning for Homograph Disambiguation in Mandarin Text-to-Speech Synthesis

Binbin Shen (Graduate School at Shenzhen, Tsinghua University)
Zhiyong Wu (Graduate School at Shenzhen, Tsinghua University)
Yongxin Wang (Department of Computer Science and Technology, Tsinghua University)
Lianhong Cai (Department of Computer Science and Technology, Tsinghua University)

Grapheme-to-phoneme conversion (G2P) is a crucial step for Mandarin text-to-speech (TTS) synthesis, where homograph disambiguation is the core issue. Several machine learning algorithms have been proposed to solve the issue by building models from well annotated training corpus. However, the preparation of such well annotated corpus is very laboring and time-consuming which requires lots of manual hand-label work to validate the proper pronunciations of the homographs. This work tries to cover this problem by introducing the active learning (AL) and semi-supervised learning (SSL) algorithms for the homograph disambiguation task using unlabeled data. Experiments show that the proposed framework can greatly reduce the cost of manual hand-label work while preserving the performance of the trained model.

#6Automatically Creating a Diphone Set from a Speech Database

Thomas Ewender (ETH Zürich)
Beat Pfister (ETH Zürich)

This paper presents a measure that scores various aspects of phone quality. The measure is designed to penalize phone instances with one or several characteristics that are not desirable in concatenation-based speech synthesis. Depending on the phone type, these aspects amongst others include spectrum, phase, fundamental frequency, duration, voicing and plosive quality. We applied this quality measure to select diphone sets from four different speech databases and demonstrate the quality of these diphone sets by means of synthesis examples. The quality of these examples showed that the proposed measure can be applied to select a high-quality diphone set from a speech database.

#7Automatic Viseme Clustering for Audiovisual Speech Synthesis

Wesley Mattheyses (Vrije Universiteit Brussel)
Lukas Latacz (Vrije Universiteit Brussel)
Werner Verhelst (Vrije Universiteit Brussel)

A common approach in visual speech synthesis is the use of visemes as atomic units of speech. In this paper, phoneme-based and viseme-based audiovisual speech synthesis techniques are compared in order to explore the balancing between data availability and an improved audiovisual coherence for synthesis optimization. A technique for automatic viseme clustering is described and it is compared to the standardized viseme set described in MPEG-4. Both objective and subjective testing indicated that a phoneme-based approach leads to better synthesis results. In addition, the test results improve when more different visemes are defined. This raises some questions on the widely applied viseme-based approach. It appears that a many-to-one phoneme-to-viseme mapping is not capable of describing all subtle details of the visual speech information. In addition, with viseme-based synthesis the perceived synthesis quality is affected by the loss of audiovisual coherence in the synthetic speech.

#8Perceptual Quality Dimensions of Text-to-Speech Systems

Florian Hinterleitner (Quality and Usability Lab, Deutsche Telekom Laboratories, TU Berlin, Germany)
Sebastian Möller (Quality and Usability Lab, Deutsche Telekom Laboratories, TU Berlin, Germany)
Christoph Norrenbrock (Digital Signal Processing and System Theory, CAU Kiel, Germany)
Ulrich Heute (Digital Signal Processing and System Theory, CAU Kiel, Germany)

The aim of this paper is to analyze the perceptual quality dimensions of state-of-the-art text-to-speech systems (TTS). Therefore, several pretests were conducted to determine a suitable set of attribute scales. The resulting 16 scales were used in a semantic differential on a diverse database containing 16 different TTS systems. A subsequent multidimensional analysis (Principal Axis Factor analysis with Promax rotation) resulted in three underlying quality dimensions. They were labeled naturalness, disturbances, and temporal distortions. A mapping of these factors onto the perceived overall quality revealed that naturalness contributes the most to the quality of TTS signals.

#10A Pointwise Approach to Pronunciation Estimation for a TTS Front-end

Shinsuke Mori (Kyoto University)
Graham Neubig (Kyoto University)

In this paper, we propose a pointwise approach to the Japanese TTS front-end. In this approach, phoneme sequence estimation of sentences is decomposed into two tasks: word segmentation of the input sentence and phoneme estimation of each word. Then these two tasks are solved by pointwise classifiers. In contrast to an existing method, the n-gram model based on sequences of word-phoneme pair, this framework enables us to use various language resources such as sentences partially annotated with word boundary information and phoneme sequences, word sequences annotated with phoneme sequences, etc. In the experiments, we compared a pair-based tri-gram model and the combination of a pointwise word segmenter and a pointwise phoneme sequence estimator. The results showed that our framework successfully enables a front-end to refer to a partially annotated corpus and/or a word sequence list annotated with phoneme sequences to realize a far larger improvement in accuracy.

#11Correlating Text with Prosody

Mohamed Abou-Zleikha (University College Dublin)
Julie Carson-Berndsen (University College Dublin)

The prediction of prosody from text information has long been recognised as a requirement for natural sounding speech synthesis. While an examination of the relationship between text information and prosody typically focuses on the role of accent, duration and phrasing both from a statistical and rule-based perspective, this paper investigates the correlation between the similarity calculated with respect to text information and the similarity calculated with respect to prosody from exemplar-based perspective. Two text features are studied, the syntactic tree and the dependency tree, along with two prosody features, the pitch and the intensity. The work in this paper investigates the correlation between text information and prosody information, the conditional membership probability between text analysis information and prosody information, and the effect of the number of exemplars on the conditional membership probability.

#12``What is... Dengue Fever?\'\' Modeling and Predicting Pronunciation Errors in a Text-to-Speech System

Andrew Rosenberg (Queens College / CUNY)
Raul Fernandez (IBM Research)
Bhuvana Ramabhadran (IBM Research)

We propose a system to predict baseform-generation errors in a text-to-speech (TTS) front-end, and aid in the process of customizing the synthesis engine to a novel application with large, open-ended vocabulary. We motivate the use of the system by using data collected during the deployment of the IBM TTS engine in the Watson Deep Question-Answering system customized to play a game of {\em Jeopardy!}. We propose a set of features derived from a lexeme's orthography and candidate baseform, and use a variety of learning schemes and data sampling algorithms to address the issue of skewed class priors in the training data. We show that 1) these different approaches provide complementary information that can then be exploited by fusion schemes to improve on the baseline performances, and 2) it is possible to use these techniques to retrieve a list of likely incorrect lexemes so as to reduce the number of tokens that must be vetted before finding and fixing an error.

#13Aperiodicity Analysis for Quality Estimation of Text-To-Speech Signals

Christoph Norrenbrock (Digital Signal Processing and System Theory, Christian-Albrechts-University of Kiel, Germany)
Ulrich Heute (Digital Signal Processing and System Theory, Christian-Albrechts-University of Kiel, Germany)
Florian Hinterleitner (Quality and Usability Lab, Deutsche Telekom Laboratories, TU Berlin, Germany)
Sebastian Möller (Quality and Usability Lab, Deutsche Telekom Laboratories, TU Berlin, Germany)

This contribution presents a new approach towards non-intrusive quality assessment of Text-To-Speech (TTS) signals. Perturbation measures which capture the degree of excitation-specific aperiodicity in voiced speech are investigated concerning their quality implications in synthesized speech. Based on two independent TTS databases for which formal attribute-based listening tests have been conducted, we show that perturbation measures are sensitive to quality aspects of prosody and voice characteristic. Furthermore a dominant dependency on TTS type, namely non-uniform unit-selection and diphone synthesis, is identified. Yet, considerable differences between male and female TTS samples are recognized, emphasizing the need for gender-specific quality assessment.

Mon-Ses3-O2:
Physiology and Pathology of Spoken Language

Time:Monday 16:00 Place:Leonardo - Pala Affari - Ground Floor Type:Oral
Chair:Tim Bunnell

16:00Novel VTEO Based Mel Cepstral Features for Classification of Normal and Pathological Voices

Hemant Patil (Dhirubhai Ambani Institute of Information and Communication Technology, DA-IICT Gandhinagar, India.)
Pallavi Baljekar (Department of Electronics and Communication, Manipal Institute of Technology (MIT), Manipal University, Manipal, India)

In this paper, novel Variable length Teager Energy Operator (VTEO) based Mel cepstral features, viz., VTMFCC are proposed for automatic classification of normal and pathological voices. Experiments have been carried out using this proposed feature set, MFCC and their score-level fusion. Classification was performed using a 2nd order polynomial classifier on a subset of the MEEI database. The equal error rate (EER) on fusion was reduced by 3.2% than EER of MFCC alone which was used as the baseline. Effectiveness of the proposed feature set was also investigated under degraded conditions using the NOISEX-92 database for babble and high frequency channel noise.

16:20Temporal Performance of Dysarthric Patients in Speech and Tapping Tasks

Eiji Shimura (Niigata University of Health and Welfare)
Kazuhiko Kakehi (Chukyo University)

Dysarthria is defined as a locomotor disorder of the vocal speech organ due to a pathological change of nerve and muscle systems. Several methods of speaking rate control have been widely used for the rehabilitation of dysarthria. However, these methods are not always effective depending on the condition of the dysarthric patient. In this study, we investigated the performance of tempo perception of dysarthrias, which has not yet been fully studied. Several types of experiments were conducted for both dysarthric patients and normal subjects. The experiments included speech production and tapping tasks with and without reference samples of utterances or tapping. The experimental results showed that some of the dysarthric subjects exhibited disorders both in the locomoter of the vocal speech organ and in their memory of tempo and rhythm.

16:40A comparative acoustic study on speech of glossectomy patients and normal subjects

Xinhui Zhou (Department of Electrical and Computer Engineering, University of Maryland, College Park, USA)
Maureen Stone (Departments of Neural and Pain Sciences and Orthodontics, University of Maryland Dental School, Baltimore, USA)
Carol Espy-Wilson (Department of Electrical and Computer Engineering, University of Maryland, College Park, USA)

Oral, head and neck cancer represents 3% of all cancers in the United States and is the 6th most common cancer worldwide. Tongue cancer patients are treated by glossectomy, a surgical procedure to remove the cancerous tumor. As a result, the tongue properties such as volume, shape, muscle structure, and motility are affected. As a result, the vocal tract acoustics are affected too. This study compares the speech acoustics between normal subjects and partial glossecotmy patients with T1 or T2 tumors. The acoustic signal of four vowels (/iy/, /uw/, /eh/, and /ah/) and two fricatives (/s/ and /sh/) were analyzed. Our results show that, while the average formants (F1-F3) for the four vowels between the normal subjects and the glossectomy patients are very similar, the average centers of gravity for the two fricatives differ significantly. These differences in fricatives can be explained by the more posterior constriction in patients due to the glossectomy and its resulting longer front cavity.

17:00Dysperiodicity analysis of perceptually assessed synthetic stimuli

Ali Alpan (Université Libre de Bruxelles, Brussels, Belgium)
Francis Grenez (Université Libre de Bruxelles, Brussels, Belgium)
Jean Schoentgen (Université Libre de Bruxelles, Brussels, Belgium)

The objective is to analyze vocal dysperiodicities in perceptually assessed synthetic speech sounds. The analysis involves a variogram-based method that enables tracking instantaneous vocal dysperiodicities. The dysperiodicity trace is summarized by means of the signal-to-dysperiodicity ratio, which has been shown to correlate strongly with the perceived degree of hoarseness of the speaker. The stimuli have been generated by a synthesizer of disordered voices that has been shown to generate natural-sounding speech fragments comprising diverse vocal perturbations. The speech stimuli have been perceptually assessed by nine listeners according to grade, breathiness and roughness. In previous studies, signal-to-dysperiodicity ratios have been correlated with perceived degrees of hoarseness. The objective here is to extend the analysis to roughness and breathiness. A second objective is to analyze the dependance of the signal-to-dysperiodicity ratio on the signal properties fixed by the synthesizer parameters. Results show a good correlation between signal-to-dysperiodicity ratios and perceptual scores. At most two frequency bands are necessary to predict the perceptual scores. Additive noise contributes most followed by jitter. The interaction between noise parameters, vocal frequency and vowel category contribute moderately or feebly.

17:20Is the perception of voice quality language-dependant? A comparison of French and Italian listeners and dysphonic speakers

Alain Ghio (LPL, Laboratoire Parole et Langage, CNRS, Aix-Marseille University, France)
Frédérique Weisz (Service ORL, CHU de la Timone, Marseille, France)
Giovanna Baracca (Fondazione IRCCS Cà Granda Ospedale Maggiore Policlinico, ORL Dept, Milano, Italy)
Giovanna Cantarella (Fondazione IRCCS Cà Granda Ospedale Maggiore Policlinico, ORL Dept, Milano, Italy)
Danièle Robert (Laboratoire Parole et Langage & Service ORL, CHU de la Timone, Aix-Marseille University, France)
Virginie Woisard (Service ORL, Unité voix et déglutition, CHU Rangueil-Larrey, Toulouse, France)
Franco Fussi (Centro Audiologico Foniatrico Azienda USL Ravenna, Italy)
Antoine Giovanni (Laboratoire Parole et Langage & Service ORL, CHU de la Timone, Aix-Marseille University, France)

We present an experiment where voice quality of French and Italian dysphonic speakers was evaluated by French and Italian listeners, specialists in phoniatrics. Results showed that both groups of speakers were perceived in the same way by the two groups of listeners in term of overall severity and breathiness. But the perception of roughness is clearly language dependant. Italian listeners underestimate roughness compare to French listeners. If we link these results obtained in perception with measures obtained in speech production, we can make the hypothesis that it is a case of perception/production adaptation process.

17:40Automatic Selection of Acoustic and Non-linear Dynamic Features in Voice Signals for Hypernasality Detection

Juan Rafael Orozco (Universidad de Antioquia)
Santiago Murillo (Universidad Nacional de Colombia)
Andres Marino Alvarez (Universidad Nacional de Colombia)
Julian David Arias (Universidad Antonio Nariño)
Edilson Delgado (Instituto Tecnologico Metropolitano)
Jesus Francisco Vargas (Universidad de Antioquia)
Cesar German Castellanos (Universidad Nacional de Colombia)

Automatic detection of hypernasality in voices of children with Cleft Lip and Palate (CLP) is made considering two charcaterization techniques, one based on acoustic, noise and cepstral analysis and other based on nonlinear dynamic features. Besides characterization, two automatic feature selection techniques are implemented in order to find optimal sub-spaces to better discriminate between healthy and hypernasal voices. Results indicate that nonlinear dynamic features are valuable tool for automatic detection of hypernasality; addtionally both feature selection techniques show stable and consistent results, achieving accuracy levels of up to 93.73%.

Mon-Ses3-O4:
Source Separation

Time:Monday 16:00 Place:Michelangelo - Pala Affari - 2nd Floor Type:Oral
Chair:Tomohiro Nakatani

16:00FREQUENCY ORIENTED PCA FOR BLIND SPEECH SEPARATION OF CONVOLUTIVE MIXTURES IN MULTIPLE ENVIRONMENTS

Yasmina Benabderrahmane (INRS-EMT Telecommunications Canada)
Sid Ahmed Selouani (Université de Moncton Canada)
Douglas O\'Shaughnessy (INRS-EMT Telecommunications Canada)

This paper reports the results of a comparative study on blind speech separation (BSS) of two types of convolutive mixtures. The separation criterion is based on Frequency Oriented Principal Components Analysis (FOPCA). This method is compared to two other well-known methods: the Degenerate Unmixing Evaluation Technique (DUET) and Convolutive Fast Independent Component Analysis (C-FICA). The efficiency of FOPCA is exploited to derive a BSS algorithm for the under-determined case (more speakers than microphones). The FOPCA method is objectively compared in terms of signal-to-interference ratio (SIR) and the Perceptual Evaluation of Speech Quality (PESQ) criteria and subjectively by the Mean Opinion Score (MOS). Usually, the conventional algorithms in the frequency domain are subject to permutation problems. On the other hand, the proposed algorithm has the attractive feature that this inconvenience usually arising does not occur.

16:20Blind Speech Separation in Time-Domain Using Block-Toeplitz Structure of Reconstructed Signal Matrices

Zbynek Koldovsky (Technical University of Liberec)
Petr Tichavsky (Institute of Information Theory and Automation)
Jiri Malek (Technical University of Liberec)

Methods for Blind Source Separation (BSS) aim at recovering signals from their mixture without prior knowledge about the signals and the mixing system. Among others, they provide tools for enhancing speech signals when they are disturbed by unknown noise or other interfering signals in the mixture. This paper considers a recent time-domain BSS method that is based on a complete decomposition of a signal subspace into components that should be independent. The components are used to reconstruct images of original signals using an ad hoc weighting, which influences the final performance of the method markedly. We propose a novel weighting scheme that utilizes block-Toeplitz structure of signal matrices and relies thus on an established property. We provide experiments with blind speech separation and speech recognition that prove the better performance of the modified BSS method.

16:40Generalized method for solving the permutation problem in frequency-domain blind source separation of convolved speech signals

Auxiliadora Sarmiento (Department of Signal Theory and Communications, University of Seville, Seville, Spain.)
Iván Durán (Department of Signal Theory and Communications, University of Seville, Seville, Spain.)
Sergio Cruces (Department of Signal Theory and Communications, University of Seville, Seville, Spain.)
Pablo Aguilera (Department of Signal Theory and Communications, University of Seville, Seville, Spain.)

The blind speech separation of convolutive mixtures can be performed in the time-frequency domain. The separation problem becomes to a set of instantaneous mixing problems, one for each frequency bin, that can be solved independently by any appropiated instantaneous ICA algorithm. However, the arbitrary order of the estimated sources in each frequency, known as permutation problem, has to be solved to succesfully recover the original sources. This paper deals with the permutation problem in the general case of N sources and N observations. The proposed method combines a correlation approach based on the amplitude correlation property of speech signals, and an optimal pairing scheme to align the permuted solutions. Our method is robust to artificially permuted speech signals. Experimental results on simulated convolutive mixtures show the effectiveness of the proposed method in terms of quality of separated signals by objective and perceptually measures.

17:00Adaptation of speaker-specific bases in non-negative matrix factorization for single channel speech-music separation

Emad M. Grais (Faculty of Engineering and Natural Sciences, Sabanci University, Istanbul, Turkey.)
Hakan Erdogan (Faculty of Engineering and Natural Sciences, Sabanci University, Istanbul, Turkey.)

This paper introduces a speaker adaptation algorithm for nonnegative matrix factorization (NMF) models. The proposed adaptation algorithm is a combination of Bayesian and subspace model adaptation. The adapted model is used to separate speech signal from a background music signal in a single record. Training speech data for multiple speakers is used with NMF to train a set of basis vectors as a general model for speech signals. The probabilistic interpretation of NMF is used to achieve Bayesian adaptation to adjust the general model with respect to the actual properties of the speech signals that is observed in the mixed signal. The Bayesian adapted model is adapted again by a linear transform, which changes the subspace that the Bayesian adapted model spans to better match the speech signal that is in the mixed signal. The experimental results show that combining Bayesian with linear transform adaptation improves the separation results.

17:20An Informed Source Separation System for Speech Signals

Shuhua Zhang (GIPSA-lab, Grenoble Institute of Technology, Grenoble, France)
Laurent Girin (GIPSA-lab, Grenoble Institute of Technology, Grenoble, France)

In two previous papers, we proposed an audio Informed Source Separation (ISS) system which can achieve the separation of I > 2 musical sources from linear instantaneous stationary stereo (2-channel) mixtures, based on audio signal’s natural sparsity, pre-mix source signals analysis, and side-information embedding (within the mix signal). In the present paper and for the first time, we apply this system to mixtures of (up to seven) simultaneous speech signals. Compared to the reference MPEG-4 Spatial Audio Object Coding system, our system provides much cleaner separated speech signals (consistently 10–20 dB higher Signal to Interference Ratios), revealing strong potential for audio conference applications.

17:40Adaptive Blocking Beamforming for Speech Separation

Ngoc Thuy Tran (Institute for Telecommunications Research, University of South Australia)
William Cowley (Institute for Telecommunications Research, University of South Australia)
Andre Pollok (Institute for Telecommunications Research, University of South Australia)

This paper tackles the speech separation problem in a meeting room using a new acoustic beamforming method – adaptive blocking (AB) beamformer. The proposed method is an optimum beamforming with a structure similar to the generalized sidelobe canceller (GSC) structure, but simpler. Thus, it inherits the flexibility of GSC and functions well in dynamic environments. We investigate the performance of the proposed method through different experiments and compare the results with a GSC beamformer for minimum variance distortionless response (MVDR). The experimental setups include one wanted speaker, two interferers, air conditioner noise and uncorrelated sensor noise. AB provides improvement over MVDR-GSC.

Mon-Ses3-O5:
Multimodal Signal Processing

Time:Monday 16:00 Place:Raffaello - Pala Affari - 3rd Floor Type:Oral
Chair:Keikichi Hirose

16:00Asynchronous Multimodal Text Entry using Speech and Gesture Keyboards

Per Ola Kristensson (University of St Andrews)
Keith Vertanen (Princeton University)

We propose reducing errors in text entry by combining speech and gesture keyboard input. We describe a merge model that combines recognition results in an asynchronous and flexible manner. We collected speech and gesture data of users entering both short email sentences and web search queries. By merging recognition results from both modalities, word error rate was reduced by 53% relative for email sentences and 29% relative for web searches. For email utterances with speech errors, we investigated providing gesture keyboard corrections of only the erroneous words. Without the user explicitly indicating the incorrect words, our model was able to reduce the word error rate by 44% relative.

16:20ROBUST BIMODAL PERSON IDENTIFICATION USING FACE AND SPEECH WITH LIMITED TRAINING DATA AND CORRUPTION OF BOTH MODALITIES

Niall McLaughlin (Queen\'s University Belfast)
Ji Ming (Queen\'s University Belfast)
Danny Crookes (Queen\'s University Belfast)

This paper presents a novel method of audio-visual fusion for person identification where both the speech and facial modalities may be corrupted, and there is a lack of prior knowledge about the corruption. Furthermore, we assume there is a limited amount of training data for each modality (e.g., a short training speech segment and a single training facial image for each person). A new representation and a modified cosine similarity are introduced for combining and comparing bimodal features with limited training data as well as vastly differing data rates and feature sizes. Optimal feature selection and multicondition training are used to reduce the mismatch between training and testing, thereby making the system robust to unknown bimodal corruption. Experiments have been carried out on a bimodal data set created from the SPIDRE and AR databases with variable noise corruption of speech and occlusion in the face images. The new method has demonstrated improved recognition accuracy.

16:40Toward a multi-speaker visual articulatory feedback system

Atef Ben Youssef (GIPSA-Lab)
Thomas Hueber (GIPSA-Lab)
Pierre Badin (GIPSA-Lab)
Gérard Bailly (GIPSA-Lab)

In this paper, we present recent developments on the HMM-based acoustic-to-articulatory inversion approch that we develop for a “visual articulatory feedback” system. In this approach, multi-stream phoneme HMMs are trained jointly on synchronous streams of acoustic and articulatory data, acquired by electromagnetic articulography (EMA). Acoustic-to-articulatory inversion is achieved in two steps. Phonetic and state decoding is first performed. Then articulatory trajectories are inferred from the decoded phone and state sequence using the maximum-likelihood parameter generation algorithm (MLPG). We introduce here a new procedure for the re-estimation of the HMM parameters, based on the Minimum Generation Error criterion (MGE). We also investigate the use of model adaptation techniques based on maximum likelihood linear regression (MLLR), as a first step toward a multi-speaker visual articulatory feedback system.

17:00Statistical Mapping between Articulatory and Acoustic Data for an Ultrasound-based Silent Speech Interface

Thomas Hueber (GIPSA-lab, Speech and Cognition Departement / CNRS)
Elie-Laurent Benaroya (ESPCI ParisTech, Sigma laboratory)
Bruce Denby (Université Pierre et Marie Curie / ESPCI ParisTech)
Gérard Chollet (Telecom ParisTech, LTCI/CNRS)

This paper presents recent developments on our “silent speech interface” that converts tongue and lip motions, captured by ultrasound and video imaging, into audible speech. In our previous studies, the mapping between the observed articulatory movements and the resulting speech sound was achieved using a unit selection approach. We investigate here the use of statistical mapping techniques, based on the joint modeling of visual and spectral features, using respectively Gaussian Mixture Models (GMM) and Hidden Markov Models (HMM). The prediction of the voiced/unvoiced parameter from visual articulatory data is also investigated using an artificial neural network (ANN). A continuous speech database consisting of one-hour of high-speed ultrasound and video sequences was specifically recorded to evaluate the proposed mapping techniques.

17:20Unsupervised geometry calibration of acoustic sensor networks using source correspondences

Joerg Schmalenstroeer (Department of Communications Engineering, University of Paderborn, Germany)
Florian Jacob (Department of Communications Engineering, University of Paderborn, Germany)
Reinhold Haeb-Umbach (Department of Communications Engineering, University of Paderborn, Germany)
Marius H. Hennecke (Department of Computer Science, TU Dortmund University, Germany)
Gernot A. Fink (Department of Computer Science, TU Dortmund University, Germany)

In this paper we propose a procedure for estimating the geometric configuration of an arbitrary acoustic sensor placement. It determines the position and the orientation of microphone arrays in 2D while locating a source by direction-of-arrival (DoA) estimation. Neither artificial calibration signals nor unnatural user activity are required. The problem of scale indeterminacy inherent to DoA-only observations is solved by adding time difference of arrival (TDOA) measurements. The geometry calibration method is numerically stable and delivers precise results in moderately reverberated rooms. Simulation results are confirmed by laboratory experiments.

17:40Investigations on Speaking Mode Discrepancies in EMG-based Speech Recognition

Michael Wand (Cognitive Systems Lab, Karlsruhe Institute of Technology, Karlsruhe, Germany)
Matthias Janke (Cognitive Systems Lab, Karlsruhe Institute of Technology, Karlsruhe, Germany)
Tanja Schultz (Cognitive Systems Lab, Karlsruhe Institute of Technology, Karlsruhe, Germany)

In this paper we present our recent study on the impact of speaking mode variabilities on speech recognition by surface electromyography (EMG). Surface electromyography captures the electric potentials of the human articulatory muscles, which enables a user to communicate naturally without making any audible sound. Our previous experiments have shown that the EMG signal varies greatly between different speaking modes, like audibly uttered speech and silently articulated speech. In this study we extend our previous research and quantify the impact of different speaking modes by investigating the amount of mode-specific leaves in phonetic decision trees. We show that this measure correlates highly with discrepancies in the spectral energy of the EMG signal, as well as with differences in the performance of a recognizer on different speaking modes. We furthermore present how EMG signal adaptation by spectral mapping decreases the effect of the speaking mode.

Mon-Ses3-P1:
Pitch Processing - Singing Voice Analysis

Time:Monday 16:00 Place:Valfonda 1 - Pala Congressi (Passi Perduti-Gallery) Type:Poster
Chair:Thomas Drugman

#1Fundamental Frequency Estimation Using Modified Higher Order Moments And Multiple Windows

Alipah Pawi (School of Engineering and Design, Brunel University, Uxbridge, London, UK)
Saeed Vaseghi (School of Engineering and Design, Brunel University, Uxbridge, London, UK)
Ben Milner (School of Computing Sciences, East Anglia University, Norwich, UK)
Seyed Ghorshi (School of Science and Engineering, Sharif University of Technology, International Kish Campus, Kish Island, Iran)

This paper proposes a set of higher-order modified moments for estimation of the fundamental frequency of speech and explores the impact of the speech window length on pitch estimation error. The pitch extraction methods are evaluated in a range of noise types and SNRs. For calculation of errors, pitch reference values are calculated from manually-corrected estimates of the periods obtained from laryngograph signals. The results obtained for the 3rd and 4th order modified moment compare well with methods based on correlation and magnitude difference criteria and the YIN method; with improved pitch accuracy and less occurrence of large errors.

#2EM-based Gain Adaptation for Probabilistic Multipitch Tracking

Michael Wohlmayr (Graz University of Technology)
Franz Pernkopf (Graz University of Technology)

We introduce an EM algorithm for automatic speaker gain adaptation, and use this approach for probabilistic multipitch tracking. We derive a lower bound on the log-likelihood of the gain parameters and use a fast pruning method to make lower bound optimization efficient. We evaluate the performance of gain adapted multipitch tracking on the GRID database, where 3000 speech mixtures were generated for each mixing level. For gain differences in the range of zero up to 18dB, the proposed method achieves almost the same performance as for the case where the gain is assumed to be known.

#3Joint Robust Voicing Detection and Pitch Estimation Based on Residual Harmonics

Thomas Drugman (University of Mons)
Abeer Alwan (University of California, Los Angeles)

This paper focuses on the problem of pitch tracking in noisy conditions. A method using harmonic information in the residual signal is presented. The proposed criterion is used both for pitch estimation, as well as for determining the voicing segments of speech. In the experiments, the method is compared to six state-of-the-art pitch trackers on the Keele and CSTR databases. The proposed technique is shown to be particularly robust to additive noise, leading to a significant improvement in adverse conditions.

#4Epoch Extraction in High Pass Filtered Speech using Hilbert Envelope

Govind D (Indian Institute of Technology Guwahati)
Prasanna S R Mahadeva (Indian Institute of Technology Guwahati)
Debadatta Pati (Indian Institute of Technology Guwahati)

Hilbert envelope (HE) is defined as the magnitude of the analytic signal. This work proposes HE based zero frequency filtering (ZFF) approach for the extraction of epochs in high pass filtered speech. Epochs in speech correspond to instants of significant excitation like glottal closure instants. The ZFF method for epoch extraction is based on the signal energy around the impulse at zero frequency which seems to be significantly attenuated in case of high pass filtered speech. The low frequency nature of HE reinforces the signal energy around the impulse at zero frequency. This work therefore processes the HE of high pass filtered speech or its residual by zero frequency filtering for epoch extraction. The proposed approach shows significant improvement in performance for the high pass filtered speech compared to the conventional ZFF of speech.

#5Robust HNR-based Closed-loop Pitch and Harmonic Parameters Estimation

Alexander Pavlovets (Department of Electronic Computing Devices, Belarusian State University of Informatics and Radioelectronics, Minsk, Belarus)
Alexander Petrovsky (Department of Electronic Computing Devices, Belarusian State University of Informatics and Radioelectronics, Minsk, Belarus)

An important problem in speech coding framework is model parameters estimation. In most cases parametric speech coding methods do not preserve shape of speech waveform. This fact implies straightforward parameters estimation and analysis-by-synthesis method is hardly used. A novel analysis-by-synthesis parameters estimation method in speech coders based on harmonic models presented. We introduce improved speech model based on robust harmonic and noise components separation. The separation is performed with usage of Pitch Tracking Modified DFT (PTDFT). Harmonic parameters and pitch frequency are estimated simultaneously in a closed-loop manner based on Harmonic-to-Noise Ratio (HNR).

#6Exploring Bessel Features for Detection of Glottal Closure Instants

Chetana Prakash (International Institute of Information Technology , Hyderabad - India)
Dhananjaya Nagaraje Gowda (Indian Institute of Technology, Chennai - India)
Suryakanth V. Gangashetty (International Institute of Information Technology , Hyderabad - India)

For voiced speech, the most significant excitation takes place around the instant of glottal closure. Glottal closure instants (GCI) information is useful for accurate speech analysis. In particular accurate spectrum analysis is performed by considering the speech in the intervals of glottal closure. In this paper we propose an approach for detection of GCI by exploring Bessel feature, and the use of AM-FM signal. Using appropriate range of Bessel coefficients, the narrow band, band limited signal is obtained for the given signal.The bandlimited signal is considered as AM-FM signal. The signal is band limited for 0-300 Hz to remove effect of formants. Amplitude envelope (AE) function of the AM-FM signal model has been estimated by the discrete energy separation algorithm (DESA). The performance of the method is demonstrated using CMU-Arctic database. The corresponding electro-glottograph (EGG) signals are used as a reference for the validation of the detected GCI locations.

#7Evaluation of Glottal Epoch Detection Algorithms on Different Voice Types

Joao Paulo Cabral (University College of Dublin)
John Kane (Trinity College Dublin)
Christer Gobl (Trinity College Dublin)
Julie Carson-Berndsen (University College of Dublin)

According to the source-filter model of speech production, speech can be represented by passing the excitation signal through the vocal tract filter. Epoch or instant of maximum excitation corresponds to the glottal closure instant. Several speech processing applications require robust epoch detection but this is a difficult task. Although state-of-the-art epoch estimation methods can produce reliable results, they are generally evaluated using speech recorded with a neutral voice quality (modal voice). This paper reviews and evaluates six popular algorithms for the calculation of glottal closure instants on speech spoken with modal voice and seven additional voice qualities. Results show that the performance of each method is affected by the voice type and that some methods perform better than others for each voice quality.

#8A divide et impera algorithm for optimal pitch stylization

Antonio Origlia (LUSI-Lab, Department of Physics, Federico II University, Naples, Italy)
Giovanni Abete (Department of Modern Philology, Federico II University, Naples, Italy)
Francesco Cutugno (LUSI-Lab, Department of Physics, Federico II University, Naples, Italy)
Iolanda Alfano (Department of Humanities Studies, University of Salerno, Italy)
Renata Savy (Department of Humanities Studies, University of Salerno, Italy)
Bogdan Ludusan (LUSI-Lab, Department of Physics, Federico II University, Naples, Italy)

We present OpS, a divide et impera algorithm to address the problem of pitch stylization as an optimization process in O(NlogN). We aim at balancing the quality of the stylized curve and its cost in terms of the number of control points used. We also investigate how the occurrence of prominent syllables can be exploited to obtain less expensive stylizations. Our tests show that the basic OpS algorithm performs in a similar way to the MOMEL algorithm without having to set any parameter. By introducing prominence, we show that the cost of the stylization is lowered without losing perceptual equality.

#9Singing Voice Analysis Using Relative Harmonic Delays

Ricardo Sousa (University of Porto-FEUP)
Aníbal Ferreira (University of Porto-FEUP)

In this paper we introduce new phase-related features denoting the delay between the harmonics and the fundamental frequency of a periodic signal, notably of voiced singing. These features are identified as Normalized Relative Delay (NRD) and denote the phase contribution to the shape invariance of a periodic signal. Thus, NRDs are amenable to a physical and psychophysical interpretation and are structurally independent of the overall time shift of the signal, an important property that is shared with the magnitude spectrum in the case of a locally stationary signal. We describe the NRD and report on preliminary studies testing the discrimination capability of NRDs applied to singing signals

#10Singing voice synthesis: Singer-dependent vibrato modeling and coherent processing of spectral envelope

Siu Wa Lee (Institute for Infocomm Research)
Minghui Dong (Institute for Infocomm Research)

Pleasant singing voice is often ornamented by vibrato. This pitch fluctuation acts as a distinctive feature for singing and promotes voice quality. Nevertheless, independent pitch processing in singing voice synthesis does not guarantee the output quality. The spectral envelope actually varies with pitch during human voice production. This paper proposes a modeling technique for singers’ vibratos, followed by a joint processing on vibrato and spectral envelope, such that these attributes are consistent. The performance of the proposed processing has been verified by subjective listening test. The synthetic singing outputs are found to have similar quality as the human singing.

#11Chorus Digitalis: experiments in chironomic choir singing

Sylvain Le Beux (LIMSI-CNRS)
Lionel Feugère (LIMSI-CNRS)
Christophe d\'Alessandro (LIMSI-CNRS)

This paper reports on experiments in real-time gestural control of voice synthesis. The ability of hand writing gestures for controlling singing intonation (chironomic singing synthesis) is studied. In a first part, the singing synthesizer and controller are described. The system is developed in an environment for multi-users music synthesis, allowing for synthetic choir singing. In a second part, performances of subjects playing with the system are analysed. The results show that chironomic singers are able to control melody with accuracy, to perform vibrato, portamento and other types of fine-grained intonation variations, and to give convincing musical performances.

Mon-Ses3-P2:
Prosodic Modeling

Time:Monday 16:00 Place:Valfonda 2 - Pala Congressi (Passi Perduti-Gallery) Type:Poster
Chair:Hiroya Fujisaki

#1Prominence Model for Prosodic Features in Automatic Lexical Stress and Pitch Accent Detection

Kun Li (Human-Computer Communications Laboratory, The Chinese University of Hong Kong, Hong Kong)
Shuang Zhang (Human-Computer Communications Laboratory, The Chinese University of Hong Kong, Hong Kong)
Mingxing Li (Human-Computer Communications Laboratory, The Chinese University of Hong Kong, Hong Kong)
Wai-Kit Lo (Human-Computer Communications Laboratory, The Chinese University of Hong Kong, Hong Kong)
Helen Meng (Human-Computer Communications Laboratory, The Chinese University of Hong Kong, Hong Kong)

A prominence model is proposed for enhancing prosodic features in automatic lexical stress and pitch accent detection. We make use of a loudness model and incorporate differential pitch values to improve conventional features. Experiments show that these new prosodic features can improve the detection of lexical stress and pitch accent by about 6%. We further employ a prominence model to take into account of effects from neighboring syllables. For pitch accent detection, we achieve a further performance improvement from 80.61%% to 83.30%. For lexical stress detection, we achieve performance improvements in (i) classification of primary, secondary and unstressed syllables (from 76.92% to 78.64%), as well as (ii) determining the presence or absence of primary stress (from 86.99% to 89.80%).

#2Hierarchical Stress Modeling in Mandarin Text-to-Speech

Ya LI (National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China)
Jianhua TAO (National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing, China)
Xiaoying XU (Department of Chinese Language & Literature, Beijing Normal University, Beijing, China)

Automatic stress prediction is helpful for speech synthesis and natural speech understanding. This paper proposes a novel hierarchical Mandarin stress modeling method. The top level emphasizes stressed syllable, while the bottom level focuses on unstressed syllable for the first time due to its importance in both naturalness and expressiveness of synthetic speech. Maximum Entropy model is adopted to predict stress structure from textual features. Experiments show that the modeling method could capture the macro- and micro- characteristics of stress successfully. The F-score of two-level stress predictions are 73.3% and 78.7%, respectively, which are satisfactory compared to other prosody predictions.

#3Automatic Prosodic Events Detection by Using Syllable-based Acoustic, Lexical and Syntactic Features

Chong-Jia Ni (NLPR,CASIA)
Wen-Ju Liu (NLPR,CASIA)
Bo Xu (NLPR,CASIA)

Automatic prosodic events detection and annotation are important for both speech understanding and natural speech synthesis. In this paper, the complementary model method is proposed to detect prosodic events. This method discards the independent assumption between the acoustic features and the lexical and syntactic features, models not only the features of the current syllable but also the contextual features of the current syllable at the model level, and realizes the complementarities by taking the advantages of each model. The experiments on Boston University Radio News Corpus show that the complementary model can yield 91.40% pitch accent detection accuracy rate, 95.19% intonational phrase boundaries (IPB) detection accuracy rate and 93.96% break index detection accuracy rate. When compared with the previous work, the results for pitch accent, IPB and break index detection are significantly better.

#4Using Dynamic Time Warping to compute prosodic similarity measures

Albert Rilliard (LIMSI-CNRS)
Alexandre Allauzen (LIMSI-CNRS / Univ. Paris-Sud 11)
Philippe Boula de Mareüil (LIMSI-CNRS)

This paper presents the use of Dynamic Time Warping (DTW) for measuring prosodic differences between variable-sized sentences. This methodological study may apply to various prosodic functions, accented or expressive speech. Both the structuring and attitudinal functions of prosody are investigated here. We evaluated the relevance of three prosodic (dis)similarity measures to account for perceived variations. The importance of constraints on the DTW alignment process is highlighted, together with the possibility to use prosodic features beyond pitch. Results show the effectiveness of DTW-based measurements to capture different syntactic-prosodic structures and to cluster prosodically similar attitudinal expressions, irrespective of the utterance length.

#5Applying the quantitative target approximation model (qTA) to German and Brazilian Portuguese

Plinio Barbosa (State University of Campinas, Campinas, Brazil)
Hansjörg Mixdorff (Beuth University of Applied Sciences, Berlin, Germany)
Sandra Madureira (Catholic University of São Paulo, São Paulo, Brazil)

This work is an attempt to explore a different prosodic domain for the quantitative target approximation model (qTA) model than the syllable. This is done by studying the model's ability to synthesise the melodic contours of two different languages, German and Brazilian Portuguese, in two distinct speaking styles, reading and storytelling. The connected utterances studied here present more complex material than hitherto studied using the qTA model. However, the modelling accuracy on these data is similar to that of the Fujisaki model. The results show that the word can be the domain for both prominence marking and phrase boundary type (terminal and non-terminal). By restricting the qTA parameter search space for the two mentioned functions, it is possible to develop an encoding scheme for them.

#6Stylization and Trajectory Modelling of Short and Long Term Speech Prosody Variations

Nicolas Obin (IRCAM)
Anne Lacheret (Modyco Lab.)
Xavier Rodet (IRCAM)

In this paper, a unified trajectory model based on the stylization and the modelling of f0 variations simultaneously over various temporal domains is proposed. The syllable is used as the minimal temporal domain for the description of speech prosody, and short-term and long-term f0 variations are stylized and modelled simultaneously over various temporal domains. During the training, a context-dependent model is estimated according to the joint stylized f0 contours over the syllable and a set of long-term temporal domains. During the synthesis, f0 variations are determined using the long-term variations as trajectory constraints. In a subjective evaluation in speech synthesis, the stylization and trajectory modelling of short and long term speech prosody variations is shown to consistently model speech prosody and to outperform the conventional short-term modelling.

#7Toward a Continuous Modeling of French Prosodic Structure: Using Acoustic Features to Predict Prominence Location and Prominence Degree

Mathieu Avanzi (Université de Neuchâtel)
Nicolas Obin (IRCAM, paris)
Anne Lacheret (Université de paris Ouest Nanterre)
Bernard Victorri (Lattice/ENS, Paris)

The aim of this paper is to present a tool developed in order to generate French rhythmical structure semi-automatically. On the basis of a phonemic alignment, the software first locates prominent syllables by considering basic acoustic features such as F0, duration and silent pause. It then assigns a degree of prominence to each syllable identified. The estimation of this degree results from a computation of the values of silent pause, relative duration and height averages used for prominence detection in the first step. The second part of the article presents an experiment conducted in order to validate the algorithmes performances, by comparing the predictions of the software with a continuous manual coding. The performance of the algorithm is encouraging: a Fleiss’ kappa calculation estimates the rate at 0.8, and a correlation agreement calculation at 91%, in the best cases.

#8Optimal models of prosodic prominence using the Bayesian information criterion

Tim Mahrt (Department of Linguistics, University of Illinois Urbana-Champaign)
Jui-Ting Huang (Department of Electrical and Computer Engineering, University of Illinois Urbana-Champaign)
Yoonsook Mo (Department of Linguistics, University of Illinois Urbana-Champaign)
Margaret Fleck (Department of Computer Science, University of Illinois Urbana-Champaign)
Mark Hasegawa-Johnson (Department of Electrical and Computer Engineering, University of Illinois Urbana-Champaign)
Jennifer Cole (Department of Linguistics, University of Illinois Urbana-Champaign)

This study investigated the relation between various acoustic features and prominence. Past research has suggested that duration, pitch, and intensity all play a role in the perception of prominence. In our past work, we found a correlation between these acoustic features and speaker agreement over the placement of prominence. The current study was motivated by a need to enrich our understanding of this correlation. Using the Bayesian information criterion, we show that the best model for a feature that cues prosody is not necessarily a single Gaussian. Rather, the best model depends on the feature. This finding has consequences for our understanding of the role of these features in the perception of prosody and for prosody recognition systems.

#9Quantitative Analysis of Tone Coarticulation in Mandarin

Hussein Hussein (Beuth University of Applied Sciences, Berlin, Germany)
Hansjörg Mixdorff (Beuth University of Applied Sciences, Berlin, Germany)

The current paper examines the effect of tone coarticulation in Mandarin on the amplitude and duration of tone commands of the Fujisaki model and whether declination needs to be taken into account when synthesizing F0 contours of Mandarin. Based on a corpus of short sentences mean parameters of the Fujisaki-model were calculated for the 15 combinations of Mandarin tones. The resulting smoothed F0 contours differ from the canonical shapes due to tonal coarticulation. Results of averaged parameters suggest that sequences of tone commands with the same polarity can usually be merged into a single tone command because their tone command amplitudes At are very similar. T2 for the first and T1 for the second command in these sequences are also very similar, so they can be set to the same value. As a consequence, tonal combinations can be interpreted as sequences of tone switches between high and low tones, considerably simplifying the modeling. It was also found that for most utterances phrase commands of magnitude Ap greater 0 occurred, indicating that the phrase component should be taken into account when analyzing and synthesizing of F0 contour of Mandarin.

#10Tracking pitch contours using minimum jerk trajectories

Daniel Neiberg (CTT, TMH, CSC, KTH)
G Ananthakrishnan (CTT, TMH, CSC, KTH)
Joakim Gustafson (CTT, TMH, CSC, KTH)

This paper proposes a fundamental frequency tracker, with the specific purpose of comparing the automatic estimates with pitch contours that are sketched by trained phoneticians. The method uses a frequency domain approach to estimate pitch tracks that form minimum jerk trajectories. This method tries to mimic motor movements of the hand made while sketching. When the fundamental frequency tracked by the proposed method on the oral and laryngograph signals were compared using the MOCHA-TIMIT database, the correlation was 0.98 and the root mean squared error was 4.0 Hz, which was slightly better than a state-of-the-art pitch tracking algorithm included in the ESPS. We also demonstrate how the proposed algorithm could to be applied when comparing with sketches made by phoneticians for the variations in accent II among the Swedish dialects.

Mon-Ses3-P3:
Discourse and Dialogue

Time:Monday 16:00 Place:Faenza 1 - Pala Congressi (Passi Perduti-Gallery) Type:Poster
Chair:Patrick Ehlen

#1On the use of linguistic features in an automatic system for speech analytics of telephone conversations

Benjamin Maza (LIA, University of Avignon, France)
Marc El-Beze (LIA, University of Avignon, France)
Georges Linares (LIA, University of Avignon, France)
Renato De Mori (McGill University, School of Computer Science, Montreal, Quebec, Canada)

A research on the analysis of human/human conversations in a call centre is described. The purpose of the research is to provide short reports of each conversation with information useful for monitoring the call centre efficiency. Data from real users discussing over the telephone with agents are processed by an automatic speech recognition (ASR) system. Reports are grouped into classes by the agents based on predefined taxonomy. A train set of manually transcribed data is used for training the extraction of features relevant to the application and the classification of the conversations. The use of all the words of the application vocabulary, of automatically selected key_words, and of automatically learned sentence chunks containing semantic classes of words are compared and evaluated with a totally different test set. The results show a significant increase in performance when chunks are used even in comparison with the use of bags of words obtained with a boosting algorithm.

#2Determining What Questions To Ask, with the Help of Spectral Graph Theory

Abe Kazemzadeh (University of Southern California)
Sungbok Lee (University of Southern California)
Panayiotis Georgiou (University of Southern California)
Shrikanth Narayanan (University of Southern California)

This paper considers objects and questions asked about the objects to be a graph and formulates the knowledge goal of a question-asking agent in terms of connecting this graph. The game of twenty questions can be thought of as a test of such a question-asking agent's knowledge. If this were completely specified, the goal of question asking would be to find the answer as quickly as possible. However, if the agent's knowledge is incomplete, it must have a secondary goal for the questions it plans: to complete its knowledge. We claim that this secondary goal of a question asking agent can be formulated in terms of spectral graph theory. We show how the eigenvalues of a graph Laplacian of the the question-object adjacency graph can identify whether a set of knowledge contains disconnected components and the zero elements of the powers of the question-object adjacency graph provide a way to identify these questions.

#3\'Are you sure you\'re paying attention?\' -- \'Uh-huh\'. Communicating understanding as a marker of attentiveness

Hendrik Buschmeier (Sociable Agents Group, CITEC, Bielefeld University)
Zofia Malisz (Faculty of Linguistics and Literary Studies, Bielefeld University)
Marcin Wlodarczak (Faculty of Linguistics and Literary Studies, Bielefeld University)
Stefan Kopp (Sociable Agents Group, CITEC, Bielefeld University)
Petra Wagner (Faculty of Linguistics and Literary Studies, Bielefeld University)

We report on first results of an experiment designed to investigate properties of communicative feedback produced by non-attentive listeners in dialogue. Listeners were found to produce less feedback when distracted by an ancillary task. A decreased number of feedback expressions communicating understanding was a particularly reliable indicator of distractedness. We argue this finding could be used to facilitate recognition of attentional states in dialogue system users.

#4Projectability of Transition-relevance Places using Prosodic Features in Japanese Spontaneous Conversation

Yuichi Ishimoto (National Institute of Informatics)
Mika Enomoto (Tokyo University of Technology)
Hitoshi Iida (Tokyo University of Technology)

In this paper, to clarify acoustic features for predicting the ends of utterances, we investigated prosodic features that project transition relevance places in Japanese spontaneous conversation. Acoustic parameters used as the prosodic features are the fundamental frequency, power, and mora duration of accentual phrases and words. Results showed that the fundamental frequency and power at the beginning of the final accentual phrase indicate whether the utterance includes utterance-final elements, which are the syntactic cue for detecting the end-of-utterance. In addition, the mora duration lengthened in the final accentual phrase. That is, these prosodic features around the beginning of the final accentual phrase showed the characteristic changes that make hearers predict the transition relevance places.

#5Measuring Final Lengthening for Speaker-Change Prediction

Anna Hjalmarsson (KTH Speech Music and Hearing)
Kornel Laskowski (KTH Speech Music and Hearing)

We explore pre-silence syllabic lengthening as a cue for next-speakership prediction in spontaneous dialogue. When estimated using a transcription-mediated procedure, lengthening is shown to reduce error rates by 25% relative to majority class guessing. Lengthening should therefore be exploited by dialogue systems. With that in mind, we evaluate an automatic measure of spectral envelope change, Mel-spectral flux (MSF), and show that its speaker-independent performance is at least as good as that of the transcription-mediated measure. Modeling MSF is likely to improve turn uptake in dialogue systems, and to benefit other applications needing an estimate of durational variability in speech.

#6Incremental Learning and Forgetting in Stochastic Turn-Taking Models

Kornel Laskowski (KTH Speech Music and Hearing)
Jens Edlund (KTH Speech Music and Hearing)
Mattias Heldner (KTH Speech Music and Hearing)

We present a computational framework for stochastically modeling dyad interaction chronograms. The framework's most novel feature is the capacity for incremental learning and forgetting. To showcase its flexibility, we design experiments answering four concrete questions about the systematics of spoken interaction. The results show that: (1) individuals are clearly affected by one another; (2) there is individual variation in interaction strategy; (3) strategies wander in time rather than converge; and (4) individuals exhibit similarity with their interlocutors. We expect the proposed framework to be capable of answering many such questions with little additional effort.

#7Reinforcement Learning of Argumentation Dialogue Policies in Negotiation

Kallirroi Georgila (Institute for Creative Technologies, University of Southern California)
David Traum (Institute for Creative Technologies, University of Southern California)

We build dialogue system policies for negotiation, and in particular for argumentation. These dialogue policies are designed for negotiation against users of different cultural norms (individualists, collectivists, and altruists). In order to learn these policies we build simulated users (SUs), i.e. models that simulate the behavior of real users, and use reinforcement learning (RL). The SUs are trained on a spoken dialogue corpus in a negotiation domain, and then tweaked towards a particular cultural norm using hand-crafted rules. We evaluate the learned policies in a simulation setting. Our results are consistent with our SUs, in other words, the policies learn what they are designed to learn, which shows that RL is a promising technique for learning policies in domains, such as argumentation, that are more complex than standard slot-filling applications.

#8Topic Switching Strategies for Spoken Dialogue Systems

Tobias Heinroth (Ulm University)
Savina Koleva (Ulm University)
Wolfgang Minker (Ulm University)

One of the most important challenges researchers are facing within the field of Spoken Dialogue Systems is that life is neither domain dependent nor driven by a single task. In the recent past various methods to handle multiple tasks or topics in parallel within spoken human-human and human-computer dialogues have been investigated. In this paper we compare several task switching approaches such as discourse markers and task recovery methods. The aim of our study is to reveal which strategies users prefer regarding metrics such as efficiency, friendliness, and reliability. Furthermore we investigate how the different strategies influence the cognitive capacity of the subjects. The dialogues used for the study have been implemented utilising the OwlSpeak Spoken Dialogue Manager, which applies ontologies as dialogue models that can be dynamically combined during runtime.

#9Unsupervised Clustering of Utterances using Non-parametric Bayesian Methods

Ryuichiro Higashinaka (NTT Cyber Space Laboratories)
Noriaki Kawamae (NTT Comware Corporation)
Kugatsu Sadamitsu (NTT Cyber Space Laboratories, NTT Corporation)
Yasuhiro Minami (NTT Communication Science Laboratories)
Toyomi Meguro (NTT Communication Science Laboratories)
Kohji Dohsaka (NTT Communication Science Laboratories)
Hirohito Inagaki (NTT Cyber Space Laboratories, NTT Corporation)

Unsupervised clustering of utterances can be useful for the modeling of dialogue acts for dialogue applications. Previously, the Chinese restaurant process (CRP), a non-parametric Bayesian method, has been introduced and has shown promising results for the clustering of utterances in dialogue. This paper newly introduces the infinite HMM, which is also a non-parametric Bayesian method, and verifies its effectiveness. Experimental results in two dialogue domains show that the infinite HMM, which takes into account the sequence of utterances in its clustering process, significantly outperforms the CRP. Although the infinite HMM outperformed other methods, we also found that clustering complex dialogue data, such as human-human conversations, is still hard when compared to human-machine dialogues.

Mon-Ses3-P4:
SLP for Speech Translation, Information Extraction and Retrieval

Time:Monday 16:00 Place:Faenza 2 - Pala Congressi (Passi Perduti-Gallery) Type:Poster
Chair:Dekai Wu

#1OOV Sensitive Named-Entity Recognition in Speech

Carolina Parada (Johns Hopkins University)
Frederick Jelinek (Johns Hopkins University)

Named Entity Recognition (NER), an information extraction task, is typically applied to spoken documents by cascading a large vocabulary continuous speech recognizer (LVCSR) and a named entity tagger. Recognizing named entities in automatically decoded speech is difficult since LVCSR errors can confuse the tagger. This is especially true of out-of-vocabulary (OOV) words, which are often named entities and always produce transcription errors. In this work, we improve speech NER by including features indicative of OOVs based on a OOV detector, allowing for the identification of regions of speech containing named entities, even if they are incorrectly transcribed. We construct a new speech NER data set and demonstrate significant improvements for this task.

#2Speech Translation with Grammar Driven Probabilistic Phrasal Bilexica Extraction

Markus Saers (Hong Kong University of Science and Technology)
Dekai Wu (Hong Kong University of Science and Technology)
Chi-Kiu Lo (Hong Kong University of Science and Technology)
Karteek Addanki (Hong Kong University of Science and Technology)

We introduce a new type of transduction grammar that allows for learning of probabilistic phrasal bilexica, leading to a significant improvement in spoken language translation accuracy. The current state-of-the-art in statistical machine translation relies on a complicated and crude pipeline to learn probabilistic phrasal bilexica---the very core of any speech translation system. In this paper, we present a more principled approach to learning probabilistic phrasal bilexica, based on stochastic transduction grammar learning applicable to speech corpora.

#3An Efficient Unified Extraction Algorithm for Bilingual Data

Christoph Tillmann (IBM Research)
Sanjika Hewavitharana (CMU)

The paper presents a unified algorithm for aligning sentences with their translations in bilingual data. The sentence alignment problem is handled as a large-scale pattern recognition problem similar to the task of finding the word sequence that corresponds to an acoustic input signal in isolated word automatic speech recognition (ASR). The algorithm gains efficiency from related work on dynamic programming (DP) search for speech recognition (ney84). The one-stage stack-based search is parametrized in a novel way, such that the unified algorithm can be used on various types of data that have been previously handled by separate implementations. With the help of a unified beam-search candidate pruning, the algorithm is very efficient as it can be carried out in a single run over the data. Results are presented on a Russian-English and a Spanish-English extraction task. Based on simple word-based scoring model, text chunk pairs are extracted out of several trillion candidates.

#4Using Features from Topic Models to Alleviate Over-generation in Hierarchical Phrase-based Translation

Songfang Huang (IBM T.J. Watson Research Center)
Bowen Zhou (IBM T.J. Watson Research Center)

In hierarchical phrase-based translation systems, the grammars (SCFG rules) have over-generation problem because we can replace the non-terminal X with almost everything without knowing the syntactic or semantic role of X. In this paper, we present an approach that uses topic models to learn the distributions for non-terminals in each SCFG rule, based on which we further derive static features for the discriminative framework of statistical machine translation. Experimental results on three corpora show that we can obtain some gains in BLEU by using these features derived from topic models to alleviate the over-generation problem in hierarchical phrase-based translation.

#5An Empirical Study on Improving Hierarchical Phrase-based Translation Using Alignment Features

Songfang Huang (IBM T.J. Watson Research Center)
Bowen Zhou (IBM T.J. Watson Research Center)

In this paper, we empirically investigate three new features from word alignments to improve speech-to-speech translation on mobile devices for low-resource languages. The three features include one feature about alignment for boundary words of the target side phrase, one about the balance of terminal words between the source and the target side, and another about the number of unaligned words. We carry out experiments on both directions (E2F and F2E) for Pashto and Dari, two official languages of Afghanistan. By using the proposed alignment features, we can obtain improvements (up to 1% BLEU score) on the test sets for both Pashto and Dari.

#6Robust Speech Translation by Domain Adaptation

Xiaodong He (Microsoft Research)
Li Deng (Microsoft Research)

Domain adaptation is crucial to achieve robust performance across different conditions in speech translation. In this paper, we study the problem of adapting a general-domain, writing-text- style machine translation system to a travel-domain, speech translation task. We study a variety of domain adaptation techniques in a unified decoding process. The experimental results demonstrate significant translation improvement on the targeting scenario after domain adaptation. The results also demonstrate robust translation performance achieved across multiple conditions via joint data selection and model combination. We finally point out further directions for robust translation via variability-adaptive and discriminatively-adaptive learning.

#7Enhancements to the Training Process of Classifier-based Speech Translator via Topic Modeling

Emil Ettelaie (University of Southern California)
Panayiotis G. Georgiou (University of Southern California)
Shrikanth S. Narayanan (University of Southern California)

Classification of sentences based on their meaning (or concept) has been used as component in speech translation and spoken language understanding systems. Preparing training data for this type of classifiers is often a tedious task. In our previous work, we presented a method of clustering sentences as a step toward automated annotation of concepts. To measure the distance between two sentences, that method relied on the local lexical dependencies in their translations. In this work, we apply Topic Modeling to enhance the previously proposed distance metric so that it includes information from semantic associations among the words. Our experiments on the DARPA USC Transonics and BBN Transtac data sets show the advantage of incorporating this information as performance improvements in a set of clustering tasks.

#8A scalable approach for building a parallel corpus from the Web

Vivek Kumar Rangarajan Sridhar (AT&T Labs - Research)
Luciano Barbosa (AT&T Labs - Research)
Srinivas Bangalore (AT&T Labs - Research)

Parallel text acquisition from the Web is an attractive way for augmenting statistical models (e.g., machine translation, cross-lingual document retrieval) with domain representative data. The basis for obtaining such data is a collection of pairs of bilingual Web sites or pages. In this work, we propose a crawling strategy that locates bilingual Web sites by constraining the visitation policy of the crawler to the graph neighborhood of bilingual sites on the Web. Subsequently, we use a novel recursive mining technique that recursively extracts text and links from the collection of bilingual Web sites obtained from the crawling. We demonstrate the efficacy of our approach in the context of machine translation in the tourism and hospitality domain. The parallel text obtained using our novel crawling strategy results in a relative improvement of 21% in BLEU score (English-to-Spanish) over an out-of-domain seed translation model trained on the European parliamentary proceedings.

#9Spoken Term Detection Results using Plural Subword Models by Estimating Detection Performance for Each Query

Yoshiaki Itoh (Iwate Prefectural University)
Kohei Iwata (Iwate Prefectural University)
Ishigame Masaaki (Iwate Prefectural University)
Kazuyo Tanaka (University of Tsukuba)
Shi-wook Lee (National Institute of AIST)

The present paper proposes a new integration method of plural spoken term detection (STD) results obtained from plural subword models that we previously proposed. We confirmed that these new subword models, which are the 1/2 phone model, the 1/3 phone model, and the sub-phonetic segment (SPS) model, are effective for STD systems, which must be vocabulary-free in order to process arbitrary query words. In addition, these models are more sophisticated on the time axis than conventional phone models, such as the triphone model. In the present study, we utilize the results of the subword models explicitly when integrating the plural results. For this purpose, we introduce an STD performance index that expresses the degree of detection difficulty for each query word. The index is approximated by the recognition accuracy of the query subword sequence. We demonstrate improved performance through experiments using an actual presentation speech corpus.

#10SpeechForms - From Web to Speech and Back

Luciano Barbosa (AT&T Labs - Research, Inc.)
Diamantino Caseiro (AT&T Labs - Research, Inc.)
Giuseppe Di Fabbrizio (AT&T Labs - Research, Inc.)
Amanda Stent (AT&T Labs - Research, Inc.)

This paper describes SpeechForms, a system that uses novel techniques to automatically identify form element semantics and form element content, and to semi-automatically generate language models that allow users to fill out each web form element by voice. Preliminary experimental results show that simple per-element language models are faster and may be more accurate than statistical n-gram language models trained on large amounts of web text data.

#11Image Processing Filters for Line Detection-based Spoken Term Detection

Kazuyuki NORITAKE (Graduate School of Science and Technology, Ryukoku Unviersity)
Hiroaki NANJO (Faculty of Science and Technology, Ryukoku University)
Takehiko YOSHIMI (Faculty of Science and Technology, Ryukoku University)

Spoken term detection (STD) from oral presentations is addressed. Specifically, we regard STD as a line detection problem in an image file, in which each pixel holds a syllable-distance between query term and automatic speech recognition (ASR) results. Since such kind of image file essentially includes ASR errors, line detection in noisy image should be investigated. In this paper, we propose line detection-oriented image processing filters for STD. We achieved 0.46 of F-measure for low frequency term (out of vocabulary term in ASR system) detection task, and 0.75 of F-measure for known term (in-vocabulary term in ASR system) detection task.

#12Using Latent Topic Features for Named Entity Extraction in Search Queries

Joseph Polifroni (Nokia Research Center)
Francois Mairesse (Nokia Research Center)

Search is one of the most quickly growing applications in the mobile market. As people rely more on portable devices for performing search, it becomes increasingly important to analyze user queries in order to achieve more targetted results over a broad set of search entities. While most previous work has relied on lexico-syntactic features and handcrafted knowledge sources, this paper investigates methods for learning latent semantic features from unlabelled user-generated content. We extract word-topic associations by training a Latent Dirichlet Allocation model on a corpus of online reviews, and show that this information improves named-entity classification performance over broad domain search queries. We believe that topical features provide a rich source of information from data with minimal manual effort, and no dependency on a specific language.

#13Language model expansion using webdata for spoken document retrieval

Ryo Masumura (Graduate School of Engineering, Tohoku University)
Seongjun Hahm (Graduate School of Engineering, Tohoku University)
Akinori Ito (Graduate School of Engineering, Tohoku University)
Akinori Ito (Graduate School of Engineering, Tohoku University)

In recent years, there has been more and more demands for ad hoc retrieval of spoken documents. We can use existing text retrieval methods by transcribing spoken documents into text data using a Large Vocabulary Continuous Speech Recognizer (LVCSR). However, retrieval performance severely deteriorates by recognition errors and OOV words. To solve these problems, we previously proposed an expansion method that compensate the transcription using text data downloaded from the Web. In this paper, we introduce two improvements into the existing document expansion framework. First, we exploit a large-scale sample database of webdata for the source of the relevant documents. Using the sample database, we can avoid the bias introduced by choosing keywords in the existing methods. Next, we exploit document retrieval method based on statistical language model, which is a popular framework in information retrieval. Not only using the existing SLM framework, but also we propose a new smoothing method considering recognition errors and missing keywords. The retrieval experiments showed that a good result was obtained by the proposed methods.

#14Effects of Query Expansion for Spoken Document Passage Retrieval

Tomoyosi Akiba (Toyohashi University of Technology)
Koichiro Honda (Toyohashi University of Technology)

One of the major challenges for spoken document retrieval is how to handle speech recognition errors within the target documents. Query expansion is promising for this challenge. In this paper, we apply relevance models, a type of query expansion method, for the spoken document passage retrieval task. We adapted the original relevance model for passage retrieval. We also extended it to benefit from massive collections of Web documents for query expansion. Through our experimental evaluation, we found that our relevance model successfully improved the retrieval performance. We also found that using Web documents was effective when the transcription of the target documents had a high word error rate.

#15Unsupervised Hidden Markov Modeling of Spoken Queries for Spoken Term Detection without Speech Recognition

Chun-an Chan (National Taiwan University)
Lin-shan Lee (National Taiwan University)

We propose an unsupervised technique to model the spoken query using hidden Markov model (HMM) for spoken term detection without speech recognition. By unsupervised segmentation, clustering and training, a set of HMMs, referred to as acoustic segment HMMs (ASHMMs), is generated from the spoken archive to model the signal variations and frame trajectories. An unsupervised technique is also designed for ASHMMs parameter training. A model-based approach for spoken term detection is then developed by constructing a query HMM from the ASHMMs, and then scoring the spoken documents using the query HMM. Experiments show that this model-based approach complements the feature-based dynamic time warping approach. A significant improvement on detection performance is achieved by integrating the two methods.

#16Topic Identification from Audio Recordings using Rich Recognition Results and Neural Network based Classifiers

Roberto Gemello (Loquendo)
Franco Mana (Loquendo)
Pier Domenico Batzu (Loquendo)

This paper investigates the use of a Neural Network classifier for topic identification from conversational telephone speech, which exploits rich recognition results coming from an automatic speech recognizer. The baseline features used to feed the neural classifier are produced using the words extracted from the 1-best sequence. Rich recognition results include the word union of the first n-best sequences, the consensus hypothesis and the full or pruned Word Confusion Network generated from the n-best sequences. Different probabilistic information attached to the words, including confidence and word posterior probabilities, is investigated together with classical and probabilistic feature weighting schemes. A large experimentation on conversational telephone speech of Fisher corpus is reported, showing significant improvements when compared to the state of the art.

Tue-Ses1-O1:
ASR - language models II

Time:Tuesday 10:00 Place:Auditorium - Pala Congressi Type:Oral
Chair:Stephan Kanthak

10:00Empirical Evaluation and Combination of Advanced Language Modeling Techniques

Tomas Mikolov (Speech@FIT, Brno University of Technology)
Anoop Deoras (Johns Hopkins University)
Stefan Kombrink (Speech@FIT, Brno University of Technology)
Lukas Burget (Speech@FIT, Brno University of Technology)
Jan Cernocky (Speech@FIT, Brno University of Technology)

We present results obtained with several advanced language modeling techniques, including class based model, cache model, maximum entropy model, structured language model, random forest language model and several types of neural network based language models. We show results obtained after combining all these models together by using linear interpolation. We conclude that for both small and moderately sized tasks, we obtain new state of the art results with combination of models, which is significantly better than performance of any individual model. Obtained perplexity reductions against Good-Turing trigram baseline are over 50% and against modified Kneser-Ney smoothed 5-gram over 40%.

10:20Personalizing Model M for Voice-search

Geoffrey Zweig (Microsoft)
Shuangyu Chang (Microsoft)

Model M is a recently proposed class based exponential n-gram language model. In this paper, we extend it with personalization features, address the scalability issues present with large data sets, and test its effectiveness on the Bing Mobile voice-search task. We find that Model M by itself reduces both perplexity and word error rate compared with a conventional model, and that the personalization features produce a further significant improvement. The personalization features provide a very large improvement when the history contains a relevant query; thus the overall effect is gated by the number of times a user re-queries a past request.

10:40Sentence Selection by Direct Likelihood Maximization for Language Model Adaptation

Takahiro Shinozaki (Department of Computer Science, Tokyo Insutitute of Technology, Tokyo, Japan)
Yu Kubota (Department of Computer Science, Tokyo Insutitute of Technology, Tokyo, Japan)
Sadaoki Furui (Department of Computer Science, Tokyo Insutitute of Technology, Tokyo, Japan)
Eiji Utsunomiya (Technology Development Center, KDDI R&D Labs. Inc., Tokyo, Japan)
Yasutaka Shindoh (Technology Development Center, KDDI R&D Labs. Inc., Tokyo, Japan)

A general framework of language model task adaptation is to select documents in a large training set based on a language model estimated on a development data. However, this strategy has a deficiency that the selected documents are biased to the most frequent patterns in the development data. To address this problem, a new task adaptation method is proposed that selects documents in the training set so as to directly reduce the perplexity on the development set. Moreover, a weighting method to modify the perplexity objective function is proposed to improve the generalization to unseen data. The proposed adaptation methods are evaluated by large vocabulary speech recognition experiments. It is shown that the proposed adaptation with the weighting term produces a compact-size model that gives consistently lower word error rates for different tasks.

11:00Feature Combination Approaches for Discriminative Language Models

Ebru Arisoy (IBM T.J. Watson Research Center)
Bhuvana Ramabhadran (IBM T.J. Watson Research Center)
Hong-Kwang Jeff Kuo (IBM T.J. Watson Research Center)

This paper focuses on feature combination approaches for discriminative language models (DLMs). DLM is a feature-based log-linear language modeling approach where the feature parameters are estimated discriminatively. DLM allows for easy integration of various knowledge sources into language modeling. Choosing the proper strategy when combining features coming from different information sources is important. We investigated three approaches for combining lexical, word class, and acoustic features in DLMs. The three approaches are joint parameter estimation, cascade training, and model score combination. The cascade approach is an interesting approach that finally gave the best test set performance, improving the word error rate by 0.49% absolute (3% relative) on transcription of English Broadcast News. The word class features and state duration features were found to be very complementary, and their combination provided most of the improvement.

11:20On-line Language Model Biasing for Multi-Pass Automatic Speech Recognition

Sankaranarayanan Ananthakrishnan (Raytheon BBN Technologies)
Stavros Tsakalidis (Raytheon BBN Technologies)
Rohit Prasad (Raytheon BBN Technologies)
Prem Natarajan (Raytheon BBN Technologies)

The language model (LM) is a critical component in statistical automatic speech recognition (ASR) systems, serving to establish a probability distribution over the hypothesis space. In typical use, the LM is trained off-line and remains static at run-time. We describe a novel LM biasing method suitable for multi-pass ASR systems. We use k-best lists from the initial recognition pass to obtain a confidence-weighted biasing of the LM training corpus. The latter is used to train a LM biased to the test input. The biased LM is used in the second pass to obtain refined hypotheses either by re-decoding or by re-ranking the k-best list. We sketch an on-line implementation of this scheme that lends itself to integration within low-latency systems. On Farsi and English test sets, we obtained relative reductions in perplexity of 24.5% and 31.6%, respectively. Additionally, relative reductions of 1.6% and 1.8% in WER were obtained for large-vocabulary Farsi and English ASR, respectively.

11:40Mandarin word-character hybrid-input Neural Network Language Model

Moonyoung Kang (Raytheon BBN Technologies)
Tim Ng (Raytheon BBN Technologies)
Long Nguyen (Raytheon BBN Technologies)

We applied neural network language model (NNLM) onChinese by training and testing it on 2011 GALE Mandarinevaluation task. Exploiting the fact that there are no wordboundaries in written Chinese, we trained various NNLMsusing either word or character or both, including a wordcharacterhybrid-input NNLM which accepts both word andcharacter as input. Our best result showed up to 0.6%absolute (6.3% relative) Character Error Rate (CER)reduction compared to an un-pruned 4-gram standardlanguage model.

Tue-Ses1-O3:
Voice Conversion

Time:Tuesday 10:00 Place:Brunelleschi (Green Room) - Pala Congressi - 2nd Floor Type:Oral
Chair:Junichi Yamagishi

10:00One-to-Many Voice Conversion Based on Tensor Representation of Speaker Space

Daisuke Saito (The University of Tokyo)
Keisuke Yamamoto (The University of Tokyo)
Nobuaki Minematsu (The University of Tokyo)
Keikichi Hirose (The University of Tokyo)

This paper describes a novel approach to flexible control of speaker characteristics using tensor representation of speaker space. Realization of conversion from/to an arbitrary speaker's voice is an important objective in voice conversion. For this purpose, eigenvoice conversion (EVC) was proposed. In the EVC, a speaker space is based on GMM supervectors, and each speaker is represented by a weighted sum of eigen-supervectors. In this paper, we revisit construction of the speaker space. In our approach, each speaker is represented as a matrix of which the row and the column respectively correspond to the Gaussian component and the dimension of the mean vector, and the speaker space is derived by the tensor analysis of the set of the matrices. Our approach can solve an implicit problem of supervector representation, and it improves the performance of voice conversion. Experimental results of one-to-many voice conversion demonstrate the effectiveness of the proposed approach.

10:20A Study on Bag of Gaussian Model with Application to Voice Conversion

Yu Qiao (Shenzhen Institutes of Advanced Technology)
Tong Tong (Shenzhen Institutes of Advanced Technology)
Nobuaki Minematsu (The University of Tokyo)

The GMM based mapping techniques proved to be an efficient method to find nonlinear regression function between two spaces, and found success in voice conversion. In these methods, a linear transformation is estimated for each Guassian component, and the final conversion function is a weighted summation of all linear transformations. These linear transformations fit well for the samples near to the center of at least one Guassian component, but may not deal well with the samples far from the centers of all Gaussian distributions. To overcome this problem, this paper proposes Bag of Gaussian Model (BGM). BGM model consists of two types of Gaussian distributions, namely basic and complex distributions. Compared with classical GMM, BGM is adaptive for samples. That is for a sample, BGM can select a set of Guassian distributions which fit the sample best. We develop a data-driven method to construct BGM model and show how to estimate regression function with BGM. We carry out experiment on voice conversion tasks. The experimental results exhibit the usefulness of BGM based methods.

10:40A Bayesian Approach to Voice Conversion Based on GMMs Using Multiple Model Structures

Lei Li (Nagoya Institute of Technology)
Yoshihiko Nankaku (Nagoya Institute of Technology)
Keiichi Tokuda (Nagoya Institute of Technology)

A spectral conversion method using multiple Gaussian Mixture Models (GMMs) based on the Bayesian framework is proposed. A typical spectral conversion framework is based on a GMM. However, in this conventional method, a GMM-appropriate number of mixtures is dependent on the amount of training data, and thus the number of mixtures should be determined beforehand. In the proposed method, the variational Bayesian approach is applied to GMM-based voice conversion, and multiple GMMs are integrated as a single statistical model. Appropriate model structures are stochastically selected for each frame based on the Bayesian frame work.

11:00Quality Improvement of Voice Conversion Systems Based on Trellis Structured VQ

Mahdi Eslami (Lecturer at Khaje Nasir Toosi University of Technology)
Hamid Sheikhzadeh (Assistant Professor)
Abolghasem Sayadiyan (Associate Professor)

Common voice conversion systems employ a spectral time domain mapping to convert speech from one speaker to another. The speech quality of conversion methods does not sound natural because the spectral/time domain patterns of two speakers’ speech do not match completely. In this paper we propose a method that uses intraframe (dynamic) characteristics in addition to inter frame characteristics to find the converted speech frames. This method is based on VQ and uses a trellis structure to find the best conversion function. The proposed method provides high quality converted voice, low computational complexity and small trained model size in contrast to other common methods. Subjective and objective evaluations are employed to demonstrate the superiority of the proposed method over the VQ-based and GMM-based methods.

11:20Voice Conversion using GMM with Enhanced Global Variance

Hadas Benisty (Department of Electrical Engineering Technion, Israel Institute of Technology Haifa, 32000, Israel)
David Malah (Department of Electrical Engineering Technion, Israel Institute of Technology Haifa, 32000, Israel)

The goal of voice conversion is to transform a sentence said by one speaker, to sound as if another speaker had said it. The classical conversion based on a Gaussian Mixture Model and several other schemes suggested since, produce muffled sounding outputs, due to excessive smoothing of the spectral envelopes. To reduce the muffling effect, enhancement of the Global Variance (GV) of the spectral features was recently suggested. We propose a different approach for GV enhancement, based on the classical conversion formalized as a GV-constrained minimization. Listening tests show that an improvement in quality is achieved by the proposed approach.

11:40Spectral Envelope Transformation using DFW and Amplitude Scaling for Voice Conversion with Parallel or Nonparallel Corpora

Elizabeth Godoy (Orange labs)
Olivier Rosec (Orange Labs)
Thierry Chonavel (Telecom Bretagne)

Dynamic Frequency Warping (DFW) offers an appealing alternative to GMM-based voice conversion, which suffers from "over-smoothing" that hinders speech quality. However, to adjust spectral power after DFW, previous work returns to GMM-transformation. This paper proposes a more effective DFW with amplitude scaling (DFWA) that functions on the acoustic class level and is independent of GMM-transformation. The amplitude scaling compares average target and warped source log amplitude spectra for each class. DFWA outperforms the GMM in terms of both speech quality and timbre conversion, as confirmed in objective and subjective testing. Moreover, DFWA performance is equivalent using parallel or nonparallel corpora.

Tue-Ses1-P5:
Speech Audio Analysis and Classification

Time:Tuesday 10:00 Place:Donatello (Room Onice) - Pala Congressi - Ground Floor Type:Poster
Chair:Olivier Rosec

#1Stop Consonant Recognition by Temporal Fine Structure of Burst

Seppo Fagerlund (Department of Signal Processing and Acoustics, Aalto University)
Unto K. Laine (Department of Signal Processing and Acoustics, Aalto University)

The automatic classification of the unvoiced stop consonants is widely considered as a difficult task for traditional frequency domain and even time-frequency methods. Main reason for this is their short duration and diverse temporal structure. In this paper we present a novel method for stop consonant recognition. The method is based on statistical properties of short temporal fine structure of burst part. Classification is also evaluated with simple frequency domain method.

#2Phonetic Classification Using Controlled Random Walks

Katrin Kirchhoff (University of Washington)
Andrei Alexandrescu (Facebook)

Recently, semi-supervised learning algorithms for phonetic classifiers have been proposed that have obtained promising results. Often, these algorithms attempt to satisfy learning criteria that are not inherent in the standard generative or discriminative training procedures for phonetic classifiers. Graph-based learners in particular maximize an objective function that not only incorporates the classification accuracy on a labeled set but also the global smoothness of the predicted label assignment. In this paper we investigate a novel graph-based semi-supervised learning framework that implements a controlled random walk, i.e. different possible moves in the random walk are controlled by probabilities that are dependent on the properties of the graph itself. Experimental results on the TIMIT corpus are presented that demonstrate the effectiveness of this procedure.

#3Keyphrase Cloud Generation of Broadcast News

Luís Marujo (LTI/CMU and INESC-ID/IST/UTL)
Márcio Viveiros (VoiceInteraction)
João P. Neto (INESC-ID/IST/UTL and VoiceInteraction)

This paper describes an enhanced automatic keyphrase extrac- tion method applied to Broadcast News. The keyphrase extrac- tion process is used to create a concept level for each news. On top of words resulting from a speech recognition system output and news indexation and it contributes to the generation of a tag/keyphrase cloud of the top news included in a Multimedia Monitoring Solution system for TV and Radio news/programs, running daily, and monitoring 12 TV channels and 4 Radios.

#4Optimized Feature Extraction and HMMs in Subword Detectors

Alfonso M. Canterla (Department of Electronics and Telecommunications, NTNU)
Magne H. Johnsen (Department of Electronics and Telecommunications, NTNU)

This paper presents methods and results for optimizing subword detectors in continuous speech. Speech detectors are useful within areas like detection-based ASR, pronunciation training, phonetic analysis, word spotting, etc. We build detectors for both articulatory features and phones by discriminative training of detector-specific MFCC filterbanks and HMMs. The resulting filterbanks are clearly different from each other and reflect acoustic properties of the corresponding detection classes. For the TIMIT task, our detector-specific features reduce the average detection error rate by 20% compared to standard MFCCs.

#5Real-World Speech/Non-Speech Audio Classification Based on Sparse Representation Features and GPCs

Ziqiang Shi (School of Computer Science and Technology, Harbin Institute of Technology)
Jiqing Han (School of Computer Science and Technology, Harbin Institute of Technology)
Tieran Zheng (School of Computer Science and Technology, Harbin Institute of Technology)

A novel and robust approach for content based speech/non-speech audio classification is proposed based on sparse representation (SR) features and Gaussian process classifiers (GPCs). The projections of the noise robust sparse representations for audio signals computed by -norm minimization are used as features. GPCs are used to learn and predict audio categories. Compare to the difficulties of Support Vector Machines (SVMs) in determining the hyperparameters, GPCs employ Bayesian selection criterion to estimate them. Experimental results on real-world audio datasets show that the SR features are more robust to audio variants than mel-frequency cepstral coefficients (MFCCs) and the proposed approach gives better performances than SVM.

#6Privacy Preserving Speaker Verification using Adapted GMMs

Manas Pathak (Carnegie Mellon University)
Bhiksha Raj (Carnegie Mellon University)

In this paper we present an adapted UBM-GMM based privacy preserving speaker verification (PPSV) system, where the system is not able to observe the speech data provided by the user and the user does not observe the models trained by the system. These privacy criteria are important in order to prevent an adversary having unauthorized access to the user's client device from impersonating a user and also from another adversary who can break into the verification system can learn about the user's speech patterns to impersonate the user in another system. We present protocols for speaker enrollment and verification which preserve privacy according to these requirements and report experiments with a prototype implementation on the YOHO dataset.

#7Clustering Expressive Speech Styles in Audiobooks Using Glottal Source Parameters

Eva Szekely (CNGL, School of Computer Science and Informatics, University College Dublin, Dublin, Ireland)
Joao Cabral (CNGL, School of Computer Science and Informatics, University College Dublin, Dublin, Ireland)
Peter Cahill (CNGL, School of Computer Science and Informatics, University College Dublin, Dublin, Ireland)
Julie Carson-Berndsen (CNGL, School of Computer Science and Informatics, University College Dublin, Dublin, Ireland)

A great challenge for text-to-speech synthesis is to produce expressive speech. The main problem is that it is difficult to synthesise high-quality speech using expressive corpora. With the increasing interest in audiobook corpora for speech synthesis, there is a demand to synthesise speech which is rich in prosody, emotions and voice styles. In this work, Self-Organising Feature Maps (SOFM) are used for clustering the speech data using voice quality parameters of the glottal source, in order to map out the variety of voice styles in the corpus. Subjective evaluation showed that this clustering method successfully separated the speech data into groups of utterances associated with different voice characteristics. This work can be applied in unitselection synthesis by selecting appropriate data sets to synthesise utterances with specific voice styles. It can also be used in parametric speech synthesis to model different voice styles separately.

#8On the use of the rhythmogram for automatic syllabic prominence detection

Bogdan Ludusan (LUSI-lab, Department of Physical Sciences, Federico II University, Naples, Italy)
Antonio Origlia (LUSI-lab, Department of Physical Sciences, Federico II University, Naples, Italy)
Francesco Cutugno (LUSI-lab, Department of Physical Sciences, Federico II University, Naples, Italy)

In this paper we will investigate the usefulness of the rhythmogram, a speech rhythm representation based on the Auditory Primal Sketch model, for the automatic detection of prominent syllables. This representation was compared to other features usually used for this task and it showed a higher performance in the identification of prominent/non-prominent syllables. A new prominence detection algorithm is proposed, combining the rhythmogram and pitch features and tested on two corpora of Italian and French. The results obtained showed significant detection improvements with respect to other systems in the literature, 0.9% and 2.5% accuracy increase respectively.

#9Speech Modulation Features for Robust Nonnative Speech Accent Detection

Sethserey Sam (Laboratoir d\'Informatique de Grenoble (LIG)-France / MICA Research Center-Vietnam)
Xiong Xiao (School of Computer Engineering, Nanyang Technological University, Singapore / Temasek Lab@NTU, Nanyang Technological University, Singapore)
Laurent Besacier (LIG Laboratory, UMR CNRS 5524 BP 53, 38041, Grenoble Cedex 9, France)
Eric Castelli (MICA research center, UMI CNRS 2954, HUT, Hanoi, Vietnam)
Haizhou Li (Department of Human Language Technology, Institute for Infocomm Research, Singapore / Temasek Lab@NTU, Nanyang Technological University, Singapore)
Eng Siong Chng (School of Computer Engineering, Nanyang Technological University, Singapore / Temasek Lab@NTU, Nanyang Technological University, Singapore)

In this paper, we propose to use speech modulation features for robust nonnative accent detection. Modulation spectrum carries long term temporal information of speech and may discriminate accents of native and nonnative speakers. For each speech segment to be tested, we extract a 10 dimension feature vector from modulation spectrum and use it for model training and testing. The proposed modulation features are compared with other popular features such as pitch and formant on a nonnative French accent detection task. Results show that the modulation features produce good detection performance and are quite robust to channel distortions. In addition, when combine test scores of modulation features and pitch features, performance is further significantly reduced. The best equal error rate is 13.1% by fusing pitch and modulation-based systems.

#10Frame-Level Vocal Effort Likelihood Space Modeling for Improved Whisper-Island Detection

Chi Zhang (Ph.D. student)
John Hansen (Professor, Chair of EE Department)

In this study, a frame-based vocal effort likelihood space modeling framework for improved whisper-island detection within normally phonated audio streams is proposed. The proposed method is based on first training a traditional GMM for whisper and neutral speech, which is then employed to extract a newly proposed discriminative feature set entitled Vocal Effort Likelihood (VEL), for whisper-island detection. The VEL feature set is integrated within a BIC/T2-BIC segmentation scheme for vocal effort change point(VECP) detection. With the dimension-reduced VEL 2-D feature set, the proposed framework has reduced computational costs versus prior method [1]. The proposed algorithm is shown to improve performance in VECP detection with the lowest Multi-Error Score(MES) of 6.33. Finally, experimental performance achieves a 100% detection rate for the proposed algorithm.

#11Speaker Identification for Whispered Speech Using A Training Feature Transformation From Neutral To Whisper

Xing Fan (University of Texas at Dallas)
John Hansen (Univesrity of Texas at Dallas)

Due to the mismatched spectral structures resulted from different production mechanisms of whispered speech, performance of speaker identification/speech recognition systems trained with neutral speech degrades significantly when tested with whispered speech. This study considers a feature transformation method in the training phase that leads to a more robust speaker model for speaker ID with whispered speech. In the proposed system, a Speech Mode Independent (SMI) Universal Background Model (UBM) is built using collected real neutral features and pseudo whispered features generated with Vector Taylor Series (VTS) or Constrained Maximum Likelihood Linear Regression (CMLLR) model adaptation. Text-independent closed set speaker ID results show an accuracy of 88.87\% using the proposed method, which represents a relative improvement of 46.26\% comparing with the 79.29\% accuracy of the baseline system.

#12An Accurate and Robust Gender Identification Algorithm

Andrea DeMarco (University of East Anglia, School of Computing Sciences)
Stephen J. Cox (University of East Anglia, School of Computing Sciences)

We describe a robust, unsupervised method of automatic gender identification from speech. We first design a baseline gender classifier based on MFCC features, and add a second classifier that uses context-dependent but text-independent pitch features. The results of these classifiers are then examined for disagreements in gender classification. Any disagreements are resolved by the use of a novel pitch-shifting mechanism applied to the utterances. We show how the acoustic context classifier provides very good gender identification results, and how these are further enhanced by the pitch-shifting process. Furthermore this enhancement is preserved across a set of different corpora.

#13Deep Belief Networks for Automatic Music Genre Classification

Xiaohong Yang (Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, China)
Qingcai Chen (Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, China)
Shusen Zhou (Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, China)
Xiaolong Wang (Harbin Institute of Technology Shenzhen Graduate School, Shenzhen, China)

This paper proposes an approach to automatic music genre classification using deep belief networks. The deep belief network is constructed based on restricted Boltzmann machines and takes the acoustic features extracted through content-based analysis of music signals as input. The model parameters are initially determined after the deep belief network is trained by greedy layer-wise learning algorithm with feature vectors that are comprised of short-term and long-term features. Then the parameters are fine-tuned to local optimum according to back propagation algorithm. Experiments on GTZAN dataset show that the performance of music genre classification based on deep belief networks is superior to those of widely used classification methods such as support vector machine, K-nearest neighbor and linear discriminant analysis.

#14Image Representation of the Subband Power Distribution for Robust Sound Classification

Jonathan William Dennis (Institute for Infocomm Research, A*STAR, Singapore)
Huy Dat Tran (Institute for Infocomm Research, A*STAR, Singapore)
Haizhou Li (Institute for Infocomm Research, A*STAR, Singapore)

This paper proposes a robust sound event classification method, based on a selective image feature driven from the novel subband power distribution (SPD), which represents the distribution of power against frequency. This method is an extension of our previous work, which was motivated by the visual perception of the spectrogram to produce a robust feature for sound classification. Unlike the conventional spectrogram, the proposed SPD representation is invariant to time-shifting and therefore suitable in practise where the detected sound clips may not be balanced. Furthermore, we develop a missing feature classification method, which automatically selects the sparse, representative areas of the signal from the noisy SPD images. The method is tested on a large database containing 50 sound classes, under several different noise environments. A significant improvement in performance is obtained in mismatched conditions, producing an average classification accuracy of 87.5% with 0dB noise.

#15Acoustic and Visual Cues of Turn-Taking Dynamics in Dyadic Interactions

Bo Xiao (University of Southern California)
Viktor Rozgic (University of Southern California)
Athanasios Katsamanis (University of Southern California)
Brian Baucom (University of Southern California)
Panayiotis Georgiou (University of Southern California)
Shrikanth Narayanan (University of Southern California)

In this paper we introduce an empirical study of multimodal cues of turn-taking dynamics in a social interaction context. We first identify pauses, gaps and overlapped speech segments in the dyadic conversation dataset. Second, we define two types of measurements, Mean Equalized Energy (MEE) and Animation Level (AL) on the audio and video channels, respectively. Then, we verify the hypothesis that the speaker with higher MEE or AL is more likely to take the floor after silence or overlapped speech. The results suggest that both the vocal and visual movement energy offer useful cues towards inferring the intention of the interlocutor to grab the floor.

Tue-Ses1-O2:
Phonology and Phonetics

Time:Tuesday 10:00 Place:Leonardo - Pala Affari - Ground Floor Type:Oral
Chair:Mark Hasegawa-Johnson

10:00Laryngealization and Breathiness in Persian

Vahid Sadeghi (Imam Khomeini International University)

Persian has sequences of two vowels separated by an intervening glottal consonant (/h/ or /?/). The VG(lottal)V sequence becomes reduced in certain occurrences, with the perceptual effect of the loss of the glottal consonant. The purpose of this study is to provide an acoustic description of VGV sequences in reduced forms. A production study examined three acoustic measurements of phonation types: H1-H2, H1-F1, and F0. The measurements were made at 15 ms time intervals throughout the second vowel to determine the time course of phonation effect. The issue of interest is what properties of VGV remain where G is lost. It is shown that /?/ will be preserved as vowel laryngealization and /h/ as breathiness.

10:20Age-dependent differences in the neutralization of the intervocalic voicing contrast: Evidence from an apparent-time study on East Franconian

Viola Müller (Institute of Phonetics and Speech Processing (IPS), University of Munich (LMU), Munich, Germany)
Jonathan Harrington (Institute of Phonetics and Speech Processing (IPS), University of Munich (LMU), Munich, Germany)
Felicitas Kleber (Institute of Phonetics and Speech Processing (IPS), University of Munich (LMU), Munich, Germany)
Ulrich Reubold (Institute of Phonetics and Speech Processing (IPS), University of Munich (LMU), Munich, Germany)

The main aim of the present study was to investigate the extent to which East Franconian speakers neutralize the voicing opposition in intervocalic stops when they produce a variety of Standard German. A second aim was to test whether young and old speakers differ in their extent of neutralization and tend to a more standard-like pronunciation. We analyzed contrast maintenance by means of the vowel-to-stop duration ratio. An acoustic analysis of leiden-leiten revealed that old East Franconian speakers neutralized the voicing contrast either completely or to a greater extent than young East Franconian speakers. Young East Franconian speakers preserved the voicing contrast, although to a lesser extent than the Standard German speakers. A forced choice perception experiment showed that young but not old East Franconians perceived the lenis/fortis contrast. The results point to a sound change in progress in which a phonemic [± voice] stop distinction is developing in East Franconian.

10:40Comparing syllable frequencies in corpora of written and spoken language

Barbara Samlowski (Division of Language and Speech Communication, University of Bonn, Germany)
Bernd Möbius (Department of Computational Linguistics and Phonetics, Saarland University, Germany)
Petra Wagner (Faculty of Linguistics and Literary Studies, Bielefeld University, Germany)

In this study, various German language corpora were compared in order to discover the extent to which syllable frequencies remain stable across different contexts and modalities. Although considerable differences in relative frequency were found among the more common syllables, rank numbers proved to be more robust. Variation across corpora was mostly due to vocabulary characteristics of particular corpus domains rather than to systematic differences between spoken and written language. The results indicate that syllable frequencies in written corpora can be taken as a rough estimate for their frequency in spoken language.

11:00Sylli: Automatic Phonological Syllabification for Italian

Iacoponi Luca (University of Pisa)
Savy Renata (University of Salerno)

We will present a complete syllabifier for Italian (Sylli), that is based on phonological principles, flexible and easy to adapt for other uses, alphabets and languages. Crucial concepts regarding syllabification principles in modern phonological theory will be discussed (§1.1); specific issues concerning Italian syllabification will then be summarised (§1.2) and an overview of the available automatic syllabification models will be provided (§1.3). We will then move on to describe the program structure, the syllabification algorithm and two particular issues concerning syllabification in Italian (§2). Finally, we will illustrate the results of a manual syllabification test carried out by linguists to verify the accuracy of the algorithm (§3).

11:20A preliminary study on the production of signs in Brazilian Sign Language when one of the manual articulators is unavailable

André Nogueira Xavier (State University of Campinas)
Plinio Almeida Barbosa (State University of Campinas)

This paper aims at discussing the realization of some Brazilian Sign Language signs, articulated with both hands, when one of them is unavailable. As will be discussed, this unavailability is caused by extra-linguistic factors, as well as by a linguistic one. The data considered here were collected through the observation of spontaneous signing and discussed with three subjects. Their analysis revealed that the production of two-handed signs when one of the hands is not available does not simply consist of realizing them with only one hand, but alternatively employing other strategies, such as using a one-handed sign equivalent in meaning. Index terms: sign language, manual articulators, dynamical system

11:40Electroglottograph and Acoustic Cues for Phonation Contrasts in Taiwan Min Falling Tones

Ho-hsien Pan (National Chiao Tung University, TAIWAN)
Mao-hsu Chen (National Chiao Tung University, TAIWAN)
Shao-ren Lyu (National Chiao Tung University, TAIWAN)

This study explored the effective articulatory and acoustic parameters for distinguishing Taiwan Min falling unchecked tones 53 and 31 and checked tones 5 and 3. Data were collected from Zhangzhou, Quanzhou, and mixed accents in northern, central, and southern Taiwan. Results showed that EGG parameters, Contact Quotients (CQ) and Peak Increase in Contact (PIC) were not effective in distinguishing checked from unchecked tones across speakers. In contrast, f0 contour and Cepstral Peak Prominence (CPP) consistently distinguished checked tones from unchecked tones across speakers. The f0 onset was highest for tone 53, followed in order by tone 3, 5 and 31. The f0 contours of tone 5 were the highest in the later half of the vowels. CPP measures of checked tones were higher than those of unchecked tones in the latter portion of vowels.

Tue-Ses1-O4:
Robust Speech Recognition III

Time:Tuesday 10:00 Place:Michelangelo - Pala Affari - 2nd Floor Type:Oral
Chair:Richard Stern

10:00Sinusoidal Approach for the Single-Channel Speech Separation and Recognition Challenge

Pejman Mowlaee (Institute of Communication Acoustics (IKA), Ruhr-Universit¨at Bochum (RUB), Bochum, Germany)
Rahim Saeidi (School of Computing, University of Eastern Finland, Joensuu, Finland)
Zheng-Hua Tan (Dept. of Electronic Systems, Aalborg University, Aalborg, Denmark)
Mads Græsbøll Christensen (Dept. of Architecture, Design & Media Technology Aalborg University, Aalborg, Denmark)
Tomi Kinnunen (School of Computing, University of Eastern Finland, Joensuu, Finland)
Søren Holdt Jensen (Dept. of Electronic Systems, Aalborg University, Aalborg, Denmark)
Pasi Fr¨anti (School of Computing, University of Eastern Finland, Joensuu, Finland)

Most of the single-channel speech separation (SCSS) systems use the short-time Fourier transform as their parametric features. Recent studies have shown that employing sinusoidal features for the SCSS application results in a high perceived speech quality. In this paper, we make a systematic study on automatic speech recognition results for a SCSS system that uses sinusoidal features composed of amplitude and frequency. We compare the speech recognition results with those already reported by other participants in the single-channel speech separation and recognition challenge. Our results show that a newly proposed system achieves an overall recognition accuracy of 52.3%, ranges at the median over all other participants in the challenge.

10:20Semi-supervised Single-Channel Speech-Music Separation for Automatic Speech Recognition

cemil demir (TUBİTAK BİLGEM)
murat saraçlar (Bogazici University)
ali taylan cemgil (Bogazici University)

In this study, we propose a semi-supervised speech-music separation method which uses the speech, music and speech-music segments in an audio to separate speech and music signals from each other. In this strategy, we assume, the background music of the mixed signal is composed of the repetition of the music segment in the audio. Therefore, we used a mixture model to represent the music signal. The speech signal is modeled using Non-negative Matrix Factorization (NMF) model and the template and excitation matrices of the NMF model are estimated using the speech and mixed segments of the audio simultaneously. The separation performance of the proposed method is evaluated in automatic speech recognition task and compared with the traditional NMF method.

10:40A Level-dependent Auditory Filter-bank for Speech Recognition in Reverberant Environments

HariKrishna Maganti (Fondazione Bruno Kessler - Center for Information Technology - IRST)
Marco Matassoni (Fondazione Bruno Kessler - Center for Information Technology - IRST)

Distortions due to reverberation have detrimental effect on the performance of automatic speech recognition (ASR). In this work, an auditory filter-bank based feature is presented to improve the ASR in reverberant conditions. The proposed technique is based on gammachirp filter bank which provides level dependent frequency response to emulate the mechanisms performed in the human auditory system, particularly basilar membrane filtering aimed to improve robustness of the ear. The low frequency tail of gammachirp filter which is unaffected by bandwidth parameters due to the level dependency frequency resolution is effective in reducing the reverberation distortions. Experiments are performed on Aurora-5 meeting recorder digit task recorded with four different microphones in hands-free mode at a real meeting room. The ASR experiments using the proposed gammachirp based features show reliable and consistent improvements when compared to other conventional feature extraction techniques.

11:00A Multichannel Feature-Based Processing for Robust Speech Recognition

Mehrez Souden (NTT Communication Science Laboratories)
Keisuke Kinoshita (NTT Communication Science Laboratories)
Marc Delcroix (NTT Communication Science Laboratories)
Tomohiro Nakatani (NTT Communication Science Laboratories)

We propose a new approach for multichannel robust speech recognition. This approach extends the vector Taylor series (VTS)-based feature compensation from the single channel to the multichannel case. Precisely, we use the first order VTS to approximate each of the microphone feature vectors. Afterwards, these features are jointly processed to estimate the acoustic channel and noise statistics via expectation maximization (EM). Experimental results with TI-Digits and measured impulse responses show that the proposed method can achieve significant gains in terms of word recognition accuracy in different noise conditions.

11:20Feature Normalization Using Structured Full Transforms for Robust Speech Recognition

Xiong Xiao (Temasek Laboratories @ NTU, Nanyang Technological University, Singapore)
Jinyu Li (Microsoft Corporation, USA)
Eng Siong Chng (School of Computer Engineering, Nanyang Technological University, Singapore)
Haizhou Li (Department of Human Language Technology, Institute for Infocomm Research, Singapore)

Classical mean and variance normalization (MVN) uses a diagonal transform and a bias vector to normalize the mean and variance of noisy features to reference values. As MVN uses diagonal transform, it ignores correlation between feature dimensions. Although full transform is able to make use of feature correlation, its large amount of parameters may not be estimated reliably from a short observation, e.g. 1 utterance. We propose a novel structured full transform that has the same amount of free parameters as diagonal transform while being able to capture correlation between feature dimensions. The proposed structured transform can be estimated reliably from one utterance by maximizing the likelihood of the normalized features on a reference Gaussian mixture model. Experimental results on Aurora-4 task show that the structured transform produces consistently better speech recognition results than diagonal transform and also outperforms advanced frontend (AFE) feature extractor.

11:40A Robust Estimation Method of Noise Mixture Model for Noise Suppression

Masakiyo Fujimoto (NTT Communication Science Laboratories, NTT Corporation)
Shinji Watanabe (NTT Communication Science Laboratories, NTT Corporation)
Tomohiro Nakatani (NTT Communication Science Laboratories, NTT Corporation)

Vector Taylor series (VTS)-based noise suppression usually employs a single Gaussian distribution for the noise model. However, it is insufficient for non-stationary noise which has a multi-peak distribution. It is very complex to estimate multi-peak distribution of the noise, when we deal with the noise as random variables or hidden variables. To solve these problems, we investigate a way of estimating the noise mixture model by using a minimum mean squared error (MMSE) estimate of the noise. By iterating the MMSE estimation of noise and noise model estimation, the proposed method realizes the simultaneous optimization of both the observed signal model and the noise model. The proposed method significantly outperformed the VTS-based approach, and the maximum improvement in the word error rate was about 12 %.

Tue-Ses1-O5:
Spoken Language Understanding

Time:Tuesday 10:00 Place:Raffaello - Pala Affari - 3rd Floor Type:Oral
Chair:Ruhi Sarikaya

10:00Multi-Task Learning for Spoken Language Understanding with Shared Slots

Xiao Li (Microsoft Corporation)
Ye-Yi Wang (Microsoft Corporation)
Gokhan Tur (Microsoft Corporation)

This paper addresses the problem of learning multiple spoken language understanding (SLU) tasks that have overlapping sets of slots. In such a scenario, it is possible to achieve better slot filling performance by learning multiple tasks simultaneously, as opposed to learning them independently. We focus on presenting a number of simple multi-task learning algorithms for slot filling systems based on semi-Markov CRFs, assuming the knowledge of shared slots. Furthermore, we discuss an intra-domain clustering method that automatically discovers shared slots from training data. The effectiveness of our proposed approaches is demonstrated in an SLU application that involves three different yet related tasks.

10:20Learning Weighted Entity Lists from Web Click Logs for Spoken Language Understanding

Dustin Hillard (Microsoft Speech Labs)
Asli Celikyilmaz (Microsoft Speech Labs)
Dilek Hakkani-Tur (Microsoft Speech Labs)
Gokhan Tur (Microsoft Speech Labs)

Named entity lists provide important features for language understanding, but typical lists can contain many ambiguous or incorrect phrases. We present an approach for automatically learning weighted entity lists by mining user clicks from web search logs. The approach significantly outperforms multiple baseline approaches and the weighted lists improve spoken language understanding tasks such as domain detection and slot filling. Our methods are general and can be easily applied to large quantities of entities, across any number of lists.

10:40Bootstrapping Domain Detection Using Query Click Logs for New Domains

Dilek Hakkani-Tür (Microsoft Speech Labs | Microsoft Research)
Gokhan Tur (Microsoft Speech Labs | Microsoft Research)
Larry Heck (Microsoft Speech Labs | Microsoft Research)
Elizabeth Shriberg (Microsoft Speech Labs)

Domain detection in spoken dialog systems is usually treated as a multi-class, multi-label classification problem, and training of domain classifiers requires collection and manual annotation of example utterances. In this work, we propose using web search query logs, which include queries entered by users and the links they subsequently click on, to bootstrap domain detection for new domains. While sampling user queries from the query click logs to train new domain classifiers, we introduce two types of measures based on the behavior of the users who entered a query and the form of the query. We show that both types of measures result in reductions in the error rate as compared to randomly sampling training queries. In controlled experiments over five domains, we achieve the best gain from the combination of the two types of sampling criteria.

11:00Multi-Domain Spoken Language Understanding with Approximate Inference

Asli Celikyilmaz (Microsoft)
Dilek Hakkani-Tur (Microsoft)
Gokhan Tur (Microsoft)

This paper presents a semi-latent topic model for semantic domain detection in spoken language understanding systems. We use labeled utterance information to capture latent topics, which directly correspond to semantic domains. Additionally, we introduce an ’informative prior’ for Bayesian inference that can simultaneously segment utterances of known domains into classes and divide them from out-of-domain utterances. We show that our model generalizes well on the task of classifying spoken language utterances and compare its results to those of an unsupervised topic model, which does not use labeled information.

11:20Speech Indexing Using Semantic Context Inference

Chien-Lin Huang (Human Language Technology Department, Institute for Infocomm Research, A*STAR, Singapore)
Bin Ma (Human Language Technology Department, Institute for Infocomm Research, A*STAR, Singapore)
Haizhou Li (Human Language Technology Department, Institute for Infocomm Research, A*STAR, Singapore)
Chung-Hsien Wu (Computer Science and Information Engineering, National Cheng Kung University, Taiwan)

This study presents a novel approach to spoken document retrieval based on semantic context inference for speech indexing. Each recognized term in a spoken document is mapped onto a semantic inference vector containing a bag of semantic terms through a semantic relation matrix. The semantic context inference vector is then constructed by summing up all the semantic inference vectors. Semantic term expansion and re-weighting make the semantic context inference vector a suitable representation for speech indexing. The experiments were conducted on 1550 anchor news stories collected from Mandarin Chinese broadcast news of 198 hours. The experimental results indicate that the proposed semantic context inference-based indexing contributes to a substantial performance improvement of spoken document retrieval.

11:40Automatically Optimizing Utterance Classification Performance without Human in the Loop

Yun-Cheng Ju (Microsoft Research)
Jasha Droppo (Microsoft Research)

The Utterance Classification (UC) method has become a developer’s choice over traditional Context Free Grammars (CFGs) for voice menus in telephony applications. This data driven method achieves higher accuracy and has great potential to utilize a huge amount of labeled training data. But, having a human manually label the training data can be expensive. This paper provides a robust recipe for training a UC system using inexpensive acoustic data with limited transcriptions or semantic labels. It also describes two new algorithms that use caller confirmation, which naturally occurred within a dialog, to generate pseudo semantic labels. Experimental results show that, after having sufficient labeled data to achieve a reasonable accuracy, both of our algorithms can use unlabeled data to achieve the same performance as a system trained with labeled data, while completely eliminating the need for human supervision.

Tue-Ses1-P1:
Human Speech and Sound Perception I

Time:Tuesday 10:00 Place:Valfonda 1 - Pala Congressi (Passi Perduti-Gallery) Type:Poster
Chair:Denis Burnham

#1Parallels in infants’ attention to speech articulation and to physical changes in speech-unrelated objects

Eeva Klintfors (Dept of Linguistics, Section for Phonetics, Stockholm University)
Ellen Marklund (Dept of Linguistics, Section for Phonetics, Stockholm University)
Francisco Lacerda (Dept of Linguistics, Section for Phonetics, Stockholm University)

The mechanisms of how children develop the capacity to make use of visual cues while listening to speech are not exhaustively explored. The purpose of this study is to explore potential parallels in infants’ way to attend to speech articulation and their perception of physical changes in speech-unrelated objects. The current research questions grew out from a earlier study in which it was found that perception of speech in infants seems to be based on a match between auditory and visual prominence – as opposed to a match between sound and face. Data suggested that speech perception in infancy may function as described by Stevens power law, and two methodological supplements to test this hypotheses were made: first, a non-speech test condition was added to investigate infants’ perception of speech-unrelated objects, and second, amplitude manipulated stimuli were added to introduce systematic changes in loudness. Results showed that visually prominent stimuli were favored in the speech and non-speech conditions.

#2Speech events are recoverable from unlabeled articulatory data: Using an unsupervised clustering approach on data obtained from Electromagnetic Midsaggital Articulography (EMA)

Daniel Duran (Institute for Natural Language Processing, University of Stuttgart, Germany)
Jagoda Bruni (Institute for Natural Language Processing, University of Stuttgart, Germany)
Grzegorz Dogil (Institute for Natural Language Processing, University of Stuttgart, Germany)
Hinrich Schütze (Institute for Natural Language Processing, University of Stuttgart, Germany)

Some models of speech perception/production and language acquisition make use of a quasi-continuous representation of the acoustic speech signal. We investigate whether such models could potentially profit from incorporating articulatory information in an analogous fashion. In particular, we investigate how articulatory information represented by EMA measurements can influence unsupervised phonetic speech categorization. By incorporation of the acoustic signal and non-synthetic, raw articulatory data, we present first results of a clustering procedure, which is similarly applied in numerous language acquisition and speech perception models. It is observed that non-labeled articulatory data, i.e. without previously assumed landmarks, perform fine clustering results. A more effective clustering outcome for plosives than for vowels seems to support the motor view of speech perception.

#3Children’s recognition of their own voice: influence of phonological impairment

Sofia Strömbergsson (Department of Speech, Music and Hearing, School of Computer Science and Communication, Royal Institute of Technology (KTH), Stockholm, Sweden)

This study explores the ability to identify the recorded voice as one’s own, in three groups of children: one group of children with phonological impairment (PI) and two groups of children with typical speech and language development; 4-5 year-olds and 7-8 year-olds. High average performance rates in all three groups suggest that these children indeed recognize their recorded voice as their own, with no significant difference between the groups. Signs indicating that children with deviant speech use their speech deviance as a cue to identifying their own voice are discussed.

#4Evaluation of Bone-conducted Ultrasonic Hearing-aid Regarding Transmission of Speaker Discrimination Information

Takayuki Kagomiya (Health Research Institute, National Institute of Advanced Industrial Science and Technology (AIST), Japan)
Seiji Nakagawa (Health Research Institute, National Institute of Advanced Industrial Science and Technology (AIST), Japan)

Human listeners can perceive speech signals in a voice-modulated ultrasonic carrier from a bone-conduction stimulator, even if the listeners are patients with sensorineural hearing loss. Considering this fact, we have been developing a bone-conducted ultrasonic hearing aid (BCUHA). The purpose of this study is to evaluate the usability of BCUHA regarding transmission of speaker discrimination information. For this purpose, a prototype of speaker discrimination test was developed. The test consists of 120 pairs of 10 words spoken by 10 speakers, and examinee is requested to judge the speakers of each pair are ``same'' or ``different''. The usability of BCUHA was assessed by using the speaker discrimination test. The test was also conduced to air-conduction (AC) and cochlear implant simulator (CIsim) condition. The results show that BCUHA can transmit speaker information speaker as well as CIsim.

#5Impact of Different Feedback Mechanisms in EMG-based Speech Recognition

Christian Herff (Cognitive Systems Lab, Karlsruhe Institute of Technology, Karlsruhe, Germany)
Matthias Janke (Cognitive Systems Lab, Karlsruhe Institute of Technology, Karlsruhe, Germany)
Michael Wand (Cognitive Systems Lab, Karlsruhe Institute of Technology, Karlsruhe, Germany)
Tanja Schultz (Cognitive Systems Lab, Karlsruhe Institute of Technology, Karlsruhe, Germany)

This paper reports on our recent research in the feedback effects of Silent Speech. Our technology is based on surface electromyography (EMG) which captures the electrical potentials of the human articulatory muscles rather than the acoustic speech signal. While recognition results are good for loudly articulated speech and when experienced users speak silently, novice users usually achieve far worse results when speaking silently. Since there is no acoustic feedback when speaking silently, we investigate different kinds of feedback modes: no additional feedback except the natural somatosensory feedback (like the touching of the lips), visual feedback using a mirror and indirect acoustic feedback by speaking simultaneously to a previously recorded audio signal. In addition we examine recorded EMG data when the subject speaks audibly and silently in a loud environment to see if the Lombard effect can be observed in Silent Speech, too.

#6Phonotactic constraints and the segmentation of Cantonese speech

Michael C. W. Yip (The Hong Kong Institute of Education)

Two word-spotting experiments were conducted to examine the question of whether native Cantonese listeners are constrained by phonotactic information in the segmentation of Cantonese continuous speech. Because there are no legal consonant clusters occurred within individual Cantonese words, so this kind of phonotactic information of words may most likely cue native Cantonese listeners the locations of possible word boundaries in the continuous speech. Finally, the observed results from the two experiments confirmed this prediction. Together with other relevant studies, we argue that phonotactic constraint is one of the useful sources of information in segmenting Cantonese continuous speech.

#7Reaction time and decision difficulty in the perception of intonation

Katrin Schneider (Institute for Natural Language Processing, University of Stuttgart, Germany)
Grzegorz Dogil (Institute for Natural Language Processing, University of Stuttgart, Germany)
Bernd Möbius (Department of Computational Linguistics and Phonetics, Saarland University, Germany)

An experiment was carried out to test the Categorical Perception as well as possible Perceptual Magnet Effects in the two boundary tone categories L% and H% in German, corresponding to statement vs. question interpretation, respectively. Additionally, reaction times (RT) were logged during all subtests to see if they support the results. Analyses revealed that RTs always increased with rising difficulty of the perceptual task, and decreased when the decision process was easy. Task-specific results showed that RT also correlated with the number of possible answers during a perceptual decision, i.e. more answer alternatives resulted in longer RT. Furthermore, female subjects generally reacted faster during all perceptual tasks, although this did not necessarily correlate with the accuracy of the results. Nevertheless, the results confirmed the usefulness of RT to support the analyses and the interpretation of perceptual data.

#8Processing of stress related acoustic cues as indexed by ERPs

Ferenc Honbolygó (Institute for Psychology, Hungarian Academy of Sciences, Budapest, Hungary)
Valéria Csépe (Institute for Psychology, Hungarian Academy of Sciences, Budapest, Hungary)

The present paper investigated the event-related brain potential correlates of the processing of word stress related acoustic changes. We studied the processing of non-speech stimuli containing similar intensity and f0 changes as speech stimuli in a passive oddball paradigm. Contrary to our previous results using speech stimuli with a trochaic stress pattern contrasted with a iambic stress pattern, non-speech stimuli elicited a single MMN component. This result was interpreted as showing that the processing of stress information is based on speech specific mechanisms, instead of solely acoustic mechanisms.

#9On the relationship between perceived accentedness, acoustic similarity, and processing difficulty in foreign-accented speech

Marijt J. Witteman (MPI for Psycholinguistics, Nijmegen, The Netherlands, International Max Planck Research School, Radboud University, Nijmegen, The Netherlands)
Andrea Weber (MPI for Psycholinguistics, Nijmegen, The Netherlands, Donders Institute for Brain, Cognition and Behaviour, Nijmegen, The Netherlands)
James M. McQueen (MPI for Psycholinguistics, Nijmegen, The Netherlands, Donders Institute for Brain, Cognition and Behaviour, Nijmegen, The Netherlands, Behavioural Science Instit)

Foreign-accented speech is often perceived as more difficult to understand than native speech. What causes this potential difficulty, however, remains unknown. In the present study, we compared acoustic similarity and accent ratings of American-accented Dutch with a cross-modal priming task designed to measure online speech processing. We focused on two Dutch diphthongs: ui and ij. Though both diphthongs deviated from standard Dutch to varying degrees and perceptually varied in accent strength, native Dutch listeners recognized words containing the diphthongs easily. Thus, not all foreign-accented speech hinders comprehension, and acoustic similarity and perceived accentedness are not always predictive of processing difficulties.

#10Perception Boundary between Single and Geminate Stops in 3- and 4-mora Japanese Words

Shigeaki Amano (Faculty of Human Informatics, Aichi-Shukutoku University)
Yukari Hirata (Department of East Asian Languages and Literatures, Colgate University)

The perception boundary between single and geminate stops was examined by regression analyses in 3- and 4-mora Japanese words spoken at various speaking rates. It was found that the perception boundary is well predicted by a linear function with duration of stop closure and durations of word or disyllable which contained the single and geminate stops. However, we conclude that the disyllable duration was a better variable than the word duration because it provides a more consistent explanation for the perception boundary regardless of word length and speaking rate variations. The results support a relational acoustic invariance theory.

#11Correlation Analysis of Acoustic Features with Perceptual Voice Quality Similarity for Similar Speaker Selection

Yusuke Ijima (NTT Cyber Space Laboratories, NTT Corporation)
Mitsuaki Isogai (NTT Cyber Space Laboratories, NTT Corporation)
Hideyuki Mizuno (NTT Cyber Space Laboratories, NTT Corporation)

This paper describes the correlations between various acoustic features and perceptual voice quality similarity. We focus on identifying the acoustic features that are correlated with voice quality similarity. First, a large-scale perceptual experiment using the voices of 62 speakers is conducted and perceptual similarity scores between each pair of speakers are acquired. Next, multiple linear regression analysis is carried out; it shows that five acoustic features exhibit high correlation to voice quality similarity. Last, we perform similar speaker selection based on multiple linear regression with the above features and moreover, assess its performance by classifying speakers based on the perceptual similarity. The results indicate that the combination of the five acoustic features in classifying speakers into two classes is effective in choosing speakers with similar voice quality; it reduces the error rate by about 44 % compared to using just the cepstrum.

Tue-Ses1-P2:
Multilingual and Multimodal Approaches to Spoken Language

Time:Tuesday 10:00 Place:Valfonda 2 - Pala Congressi (Passi Perduti-Gallery) Type:Poster
Chair:Michael Johnston

#1Can Audio-Visual Speech Recognition outperform Acoustically Enhanced Speech Recognition in Automotive Environment?

Navarathna Rajitha (Speech, Audio, Image and Video Technology Lab, Queensland University of Technology, Australia)
Kleinschmidt Tristan (Speech, Audio, Image and Video Technology Lab, Queensland University of Technology, Australia)
Dean David (Speech, Audio, Image and Video Technology Lab, Queensland University of Technology, Australia)
Sridharan Sridha (Speech, Audio, Image and Video Technology Lab, Queensland University of Technology, Australia)
Lucey Patrick (Disney Research Pittsburgh, USA)

The use of visual features in the form of lip movements to improve the performance of acoustic speech recognition has been shown to work well, particularly in noisy acoustic conditions. However, whether this technique can outperform speech recognition incorporating well-known acoustic enhancement techniques, such as spectral subtraction, or multi-channel beamforming is not known. This is an important question to be answered especially in an automotive environment, for the design of an efficient human-vehicle computer interface. We perform a variety of speech recognition experiments on a challenging automotive speech dataset and results show that synchronous HMM-based audio-visual fusion can outperform traditional single as well as multi-channel acoustic speech enhancement techniques. We also show that further improvement in recognition performance can be obtained by fusing speech-enhanced audio with the visual modality, demonstrating the complementary nature of the two robust speech recognition approaches.

#2A Multimodal Approach to Dictation of Handwritten Historical Documents

Vicent Alabau (Institut Tecnològic d\'Informàtica, Universitat Politècnica de València, Camino de Vera, s/n, 46022, Valencia, Spain)
Verónica Romero (Institut Tecnològic d\'Informàtica, Universitat Politècnica de València, Camino de Vera, s/n, 46022, Valencia, Spain)
Antonio-L. Lagarda (Institut Tecnològic d\'Informàtica, Universitat Politècnica de València, Camino de Vera, s/n, 46022, Valencia, Spain)
Carlos-D. Martínez-Hinarejos (Institut Tecnològic d\'Informàtica, Universitat Politècnica de València, Camino de Vera, s/n, 46022, Valencia, Spain)

Handwritten Text Recognition is a problem that has gained attention in the last years due to the interest in the transcription of historical documents. Handwritten Text Recognition employs models that are similar to those employed in Automatic Speech Recognition (Hidden Markov Models and n-grams). Dictation of the contents of the document is an alternative to text recognition. In this work, we explore the performance of a Handwritten Text Recognition system against that of two speech dictation systems: a non-multimodal system that only uses speech and a multimodal system that performs a text recognition which is used in the posterior speech recognition. Results show that the multimodal combination outperforms any of the other considered non-multimodal systems.

#3Weight Optimization for Bimodal Unit-Selection Talking Head Synthesis

Asterios Toutios (University Nancy 2 / LORIA)
Utpala Musti (University Nancy 2 / LORIA)
Slim Ouni (University Nancy 2 / LORIA)
Vincent Colotte (University Henri Poincaré Nancy 1 / LORIA)

This paper addresses talking head synthesis based on the concatenation of units comprising of both acoustic and visual information. Selection of appropriate diphone units to synthesize a given text string is based on the minimization of a weighted linear combination of four costs that reflect linguistic, acoustic, and visual considerations. We present initial work toward a method to determine automatically the weights applied to each cost, using a series of metrics that assess quantitatively the performance of synthesis.

#4Modality Selection and Perceived Mental Effort in a mobile Application

Stefan Schaffer (Research Training Group prometei, TU Berlin)
Benjamin Jöckel (Research Training Group prometei, TU Berlin)
Ina Wechsung (Deutsche Telekom Laboratories, Quality & Usability Lab, TU Berlin)
Robert Schleicher (Deutsche Telekom Laboratories, Quality & Usability Lab, TU Berlin)
Sebastian Möller (Deutsche Telekom Laboratories, Quality & Usability Lab, TU Berlin)

This paper describes a study investigating the influence of efficiency and effectiveness on modality selection and perceived mental effort. Each participant had to perform several tasks with a smart phone application offering touch screen and Wizard-of-Oz speech recognition simulation as input modalities. The results show that efficiency and effectiveness have a strong influence on modality selection. Speech usage increases with increasing efficiency of speech input. A lower effectiveness of speech input raised the threshold for changing the modality selection strategy. For effectiveness mental effort differed significantly between the groups presented with low and high speech recognition errors.

#5A cross-lingual spoken content search system

Jitendra Ajmera (IBM Research - India)
Ashish Verma (IBM Research - India)

This paper presents an approach towards enabling audio search for those languages where training an automatic speech recognition (ASR) system is difficult, owing to lack of training resources. Our work is related to previous approaches where the problem of allowing search for out-of-vocabulary terms has been addressed tackled. In the proposed approach, the acoustic models (AM) for the phonetic recognizer are trained on a base language for which training data is available and used to search the content in a similar languages. A phonetic language model (PLM) is trained for each language independently using text data available from a variety of sources including the web. We have performed experiments to evaluate this approach for searching through Gujarati corpus where the AM were trained on Indian-English corpus. The experimental results show that this approach can provide a P@10 (precision at 10) accuracy of up to 0.65.

#6NeMo: a Platform for Multilingual News Monitoring

Fabio Brugnara (FBK, Italy)
Daniele Falavigna (FBK, Italy)
Marcello Federico (FBK, Italy)
Christian Girardi (FBK, Italy)
Diego Giuliani (FBK, Italy)
Roberto Gretter (FBK, Italy)

News Monitor (NeMo) is an environment in which the Human Language Technology research unit at FBK brings together its technologies pertaining to Automatic Speech Recognition, Machine Translation and Natural Language Processing. In this view it is a dynamic framework where we can share ideas and technologies, refine algorithms, see and discuss performance and errors of our algorithms that are daily applied on fresh data. In this paper we describe a framework in which a set of parallel news streams in different languages are automatically transcribed and translated. The architecture of the system utilizes modules that perform ASR, MT and NLP. The development of the various modules relies upon a continuous acquisition activity of parallel data (both audio and texts) in different languages. In particular, the availability of large corpora of aligned multi-lingual text/audio data has allowed to implement unsupervised Acoustic Model training approaches.

#7Unsupervised Learning of Acoustic Unit Descriptors for Audio Content Representation and Classification

Sourish Chaudhuri (Carnegie Mellon University)
Mark Harvilla (Carnegie Mellon University)
Bhiksha Raj (bhiksha@cs.cmu.edu)

In this paper, we attempt to represent audio as a sequence of acoustic units using unsupervised learning and use them for multi-class classification. We expect the acoustic units to represent sounds or sound sequences to automatically create a sound alphabet. We use audio from multi-class Youtube-quality multimedia data to converge on a set of sound units, such that each audio file is represented as a sequence of these units. We then try to learn category language models over sequences of the acoustic units, and use them to generate acoustic and language model scores for each category. Finally, we use a margin based classification algorithm to weight the category scores to predict the class that each test data point belongs to. We compare different settings and report encouraging results on this task.

#8Conditioned Hidden Markov Model Fusion for Multimodal Classification

Michael Glodek (Ulm University)
Stefan Scherer (Ulm University)
Friedhelm Schwenker (Ulm University)

Classification using hidden Markov models (HMM) is in general done by comparing the model likelihoods and choosing the class more likely to have generated the data. This work investigates a conditioned HMM which additionally provides a probability for a class label and compares different fusion strategies. The notion is two-fold: on the one hand applications in affective computing might pass their uncertainty of the classification to the next processing unit, on the other hand different streams might be fused to increase the performance. The data set studied incorporates two modalities and is based on a naturalistic multiparty dialogue. The goal is to discriminate between laughter and utterances. It turned out that the conditioned HMM outperforms classical HMM using different late fusion approaches while additionally providing a certainty about class decision.

#9Distant Speech Recognition in a Smart Home: Comparison of Several Multisource ASRs in Realistic Conditions

Benjamin Lecouteux (LIG UMR CNRS/UJF/G-INP 5217)
Michel Vacher (LIG UMR CNRS/UJF/G-INP 5217)
François Portet (LIG UMR CNRS/UJF/G-INP 5217)

While the smart home domain has become a major field of application of ICT to improve support and wellness of people in loss of autonomy, speech technology in smart home has, comparatively to other ICTs, received limited attention. This paper presents the Sweethome project whose aim is to make it possible for frail persons to control their domestic environment through voice interfaces. Several state-of-the-art and novel ASR techniques were evaluated on realistic data acquired in a multiroom smart home. This distant speech corpus was recorded with 21 speakers playing scenarios including activities of daily living in a smart home equipped with several microphones. Techniques acting at the decoding stage and using a priori knowledge such as DDA give better results (WER=8.8%, Domotic F-measure=96.8%) than the baseline (WER=18.3%, Domotic F-measure=89.2%) and other approaches.

#10A Robust Approach to Mining Repeated Sequence in Audio Stream

Jiansong Chen (Institute of Automation, Chinese Acadamy of Science)
Lei Zhu (Institute of Automation, Chinese Acadamy of Science)
Bailan Feng (Institute of Automation, Chinese Acadamy of Science)
Peng Ding (Institute of Automation, Chinese Acadamy of Science)
Bo Xu (Institute of Automation, Chinese Acadamy of Science)

In multimedia stream, repeated sequence, e.g., commercial, news anchor person, usually implies potentially significant content. Therefore, mining repeated sequence is an important approach to analyzing multimedia content. This paper reports on a robust unsupervised technique of discovering repeated sequence in audio stream. Different from former research, the approach transforms the repeated sequence detection task into a Hidden Markov Model (HMM) decoding problem in a similarity trellis. To resist the false and missing matches in real application, we present a soft definition of repeated sequence, termed as maximal loosely repeated sequence, as the objective for detection. A Viterbi-like algorithm is used to mine all the maximal loosely repeated sequences in the stream. In addition, we propose a novel metric to evaluate the repeated sequence detection algorithm. Extensive experiments both on simulated data and real broadcast data demonstrate the effectiveness of our method.

Tue-Ses1-P3:
ASR - New Paradigms and Other Topics

Time:Tuesday 10:00 Place:Faenza 1 - Pala Congressi (Passi Perduti-Gallery) Type:Poster
Chair:Maurizio Omologo

#1Accelerated Parallelizable Neural Network Learning Algorithm for Speech Recognition

Dong Yu (Microsoft Research)
Li Deng (Microsoft Research)

We describe a set of novel, batch-mode algorithms we developed recently as one key component in scalable, deep neural network based speech recognition. The essence of these algorithms is to structure the single-hidden-layer neural network so that the upper-layer’s weights can be written as a deterministic function of the lower-layer’s weights. This structure is effectively exploited during training by plugging in the deterministic function to the least square error objective function while calculating the gradients. Accelerating techniques are further exploited to make the weight updates move along the most promising directions. The experiments on TIMIT frame-level phone and phone-state classification show strong results. In particular, the error rate is strictly monotonically dropping as the mini-batch size increases. This demonstrates the potential for the proposed batch-mode algorithms in large scale speech recognition since they are easily parallelizable across computers.

#2Deep Convex Network: A Scalable Architecture for Deep Learning

Li Deng (Microsoft Research)
Dong Yu (Microsoft Research)

We recently developed context-dependent DNN-HMM (Deep-Neural-Net/Hidden-Markov-Model) for large-vocabulary ASR. While achieving remarkable recognition error rate reduction, we face the seemingly insurmountable problem of scalability in dealing with virtually unlimited amount of training data available nowadays. To overcome the scalability challenge, we have designed a novel deep learning architecture, deep convex network (DCN). The learning problem in DCN is convex within each layer. Additional structure-exploited fine tuning further improves the quality of DCN. The full learning in DCN is batch-mode based, naturally lending it amenable to parallel training that can be distributed over many machines. Experimental results on both MNIST and TIMIT tasks evaluated thus far demonstrate superior performance of DCN over the DBN (Deep Belief Network). The superiority is reflected not only in training scalability and CPU-only computation, but also in recognition accuracy in both tasks.

#3Modeling Broad Context for Tone Recognition with Conditional Random Fields

Siwei Wang (Department of Computer Science University of Chicago)
Gina-Anne Levow (Department of Linguistics University of Washington)

We propose a tone recognition approach that employs linear-chain Conditional Random Fields (CRF) to model tone variation due to intonation effects. We implement three linear-chain CRFs which aim at modeling intonation effects at phrase-, sentence- and story-level boundaries, where we show that standard recognition techniques degrade and common normalization approaches do not improve. We show that all linear-chain CRFs outperform the baseline unigram model, and the biggest improvement is found in recognizing 3rd tones, (4%) in overall accuracy. In particular, Phrase Bigram CRFs show a drastic 39% improvement in recognizing 3rd tones located at initial boundaries. This improvement shows that the position specific modeling of initial tones in bigram CRFs captures the intonation effects better than the baseline unigram model.

#4Improved Tonal Language Speech Recognition by Integrating Spectro-temporal Evidence and Pitch Information with Properly Chosen Tonal Acoustic Units

Shang-wen Li (Graduate Institute of Communication Engineering, National Taiwan University)
Yow-bang Wang (Graduate Institute of Electrical Engineering, National Taiwan University)
Liang-che Sun (Graduate Institute of Communication Engineering, National Taiwan University)
Lin-shan Lee (Graduate Institute of Communication Engineering, National Taiwan University)

We propose an improved Tandem system for tonal language speech recognition. Three different types of features, cepstral, spectro-temporal and pitch features, are integrated for modeling tone and phoneme variation simultaneously. Tonal phonemes (or tonemes) are used for MLP posterior estimation, and tonal acoustic units for HMM recognition. In our experiments conducted on Mandarin broadcast news, a 19.3% relative CER reduction was achieved over the conventional MFCC Tandem baseline. With different training acoustic units, we analyze the complementarity among the three types of features in tone, phoneme, and toneme classification.

#5Kullback-Leibler divergence-based ASR training data selection

Evandro Gouvea (European Media Laboratory GmbH)
Marelie Davel (CSIR Meraka Institute)

Data preparation and selection affects systems in a wide range of complexities. A system built for a resource-rich language may be so large as to include borrowed languages. A system built for resource scarce language may be affected by how carefully the training data is selected and produced. Accuracy is affected by the presence of enough samples of qualitatively relevant information. We propose a method using the Kullback-Leibler divergence to solve two problems related to data preparation: the ordering of alternate pronunciations in a lexicon, and the selection of transcription data. In both cases, we want to guarantee that a particular distribution of n-grams is achieved. In the case of lexicon design, we want to ascertain that phones will be present often enough. In the case of training data selection for scarcely resourced languages, we want to make sure that some n-grams are better represented than others. We show that our proposed technique yields encouraging results.

#6Articulatory Feature Classification Using Nearest Neighbors

Arild Brandrud Næss (Department of Electronics and Telecommunications, Norwegian University of Science and Technology, Trondheim, Norway)
Karen Livescu (Toyota Technological Institute, Chicago IL, USA)
Rohit Prabhavalkar (Department of Computer Science and Engineering, The Ohio State University, Columbus OH, USA)

Recognizing aspects of articulation from audio recordings of speech is an important problem, either as an end in itself or as part of an articulatory approach to automatic speech recognition. In this paper we study the frame-level classification of a set of articulatory features (AFs) inspired by the vocal tract variables of articulatory phonology. We compare k nearest neighbor (kNN) classifiers and multilayer perceptrons (MLPs), using different acoustic feature vectors, and classify the AFs either independently or jointly. We also consider using the MLP outputs for all of the AFs as inputs to kNN classifiers for the individual AFs, effectively using the MLPs as a form of nonlinear dimensionality reduction and allowing the decision for each AF to be based on the MLPs for the other AFs. We find that MLPs outperform kNN classifiers, while kNN classifiers using MLP outputs outperform both.

#7Continuous episodic memory based speech recognition using articulatory dynamics

Sébastien Demange (INRIA - LORIA, UMR 7503, BP239, 54506 Vandoeuvre-lès-Nancy)
Slim Ouni (University of Nancy 2 - LORIA, UMR 7503, BP239, 54506 Vandoeuvre-lès-Nancy)

In this paper we present a speech recognition system based on articulatory dynamics. We do not extend the acoustic feature with any explicit articulatory measurements but instead the articulatory dynamics of speech are structurally embodied within episodic memories. The proposed recognizer is made of different memories each specialized for a particular articulator. As all the articulators do not contribute equally to the realization of a particular phoneme, the specialized memories do not perform equally regarding each phoneme. We show, through phone string recognition experiments that combining the recognition hypotheses resulting from the different articulatory specialized memories leads to significant recognition improvements.

#8Graphone Model Interpolation and Arabic Pronunciation Generation

T. Li (Department of Engineering, University of Cambridge, Trumpington Street, Cambridge, CB2 1PZ, U.K.)
P. C. Woodland (Department of Engineering, University of Cambridge, Trumpington Street, Cambridge, CB2 1PZ, U.K.)
F. Diehl (Department of Engineering, University of Cambridge, Trumpington Street, Cambridge, CB2 1PZ, U.K.)
M. J. F. Gales (Department of Engineering, University of Cambridge, Trumpington Street, Cambridge, CB2 1PZ, U.K.)

This paper extends n-gram graphone model pronunciation generation to use a mixture of such models. This technique is useful when pronunciation data is for a specific variant (or set of variants) of a language, such as for a dialect, and only a small amount of pronunciation dictionary training data for that specific variant is available. The performance of the interpolated n-gram graphone model is evaluated on Arabic phonetic pronunciation generation for words that can't be handled by the Buckwalter Morphological Analyser. The pronunciations produced are also used to train an Arabic broadcast audio speech recognition system. In both cases the interpolated graphone model leads to improved performance.

#9Grapheme-to-Phoneme Conversion using Conditional Random Fields

Irina Illina (LORIA/INRIA)
Dominique Fohr (LORIA/INRIA)
Denis Jouvet (LORIA/INRIA)

We propose an approach to grapheme-to-phoneme conversion based on a probabilistic method: Conditional Random Fields (CRF). CRF give a long term prediction, assume relaxed state independence condition. Moreover, we propose an algorithm to one-to-one letter to phoneme alignment needed for CRF training. This alignment is based on discrete HMM. The proposed system is validated on two pronunciation dictionaries. Different CRF features are studied: POS-tag, context size, unigram versus bigram. Our approach compares favorably with the performance of the state-of-the-art Joint-Multigram Models for the quality of the pronunciations, but provides better recall and precision measures for multiple pronunciation variants generation.

#10Bilingual Acoustic Model Adaptation by Unit Merging on Different Levels and Cross-level Integration

Ching-Feng Yeh (National Taiwan University)
Chao-Yu Huang (National Taiwan University)
Lin-Shan Lee (National Taiwan University)

This paper presents a bilingual acoustic model adaptation ap-proach for transcribing Mandarin-English code-mixed lectures with highly unbalanced language distribution. This includes a adaptation structure, merging of Mandarin and English acoustic units on model, state and Gaussian levels, and a cross-level integration scheme. The corpora tested include two real courses, in which special terminologies were produced in the guest language of English (about 15-19%) and embedded in the utterances produced in the host language (about 81-85%). The code-mixing nature of the target corpora and the very small percentage of the English data made the task difficult. Preliminary experiments showed that unit merging was helpful, merging on lower levels offered more improvements, and cross-level integration was even better. The code-mixing situation considered is actually very natural in the spoken language of the daily lives of many people in the globalized world today.

#11A qualitative evaluation of phoneme-to-phoneme technology

Marijn Schraagen (Utrecht Institute of Linguistics OTS, Utrecht University)
Gerrit Bloothooft (Utrecht Institute of Linguistics OTS, Utrecht University)

Automatic speech recognition systems apply grapheme-to phoneme transcription (G2P) to model pronunciation of items in the lexicon. General purpose G2P transcriptions are not always accurate, e.g., in a multilingual environment. To improve the transcription quality, G2P transcriptions can be postprocessed using a phoneme-to-phoneme (P2P) converter. This paper discusses the applicability of P2P technology based on results of a speech recognition experiment using P2P conversion on a multilingual speech corpus. P2P conversion can be applied successfully, however the analysis also shows limitations of P2P technology.

#12Cheap Bootstrap of Multi-Lingual Hidden Markov Models

Daniele Falavigna (FBK, Italy)
Roberto Gretter (FBK, Italy)

In this work we investigate the usage of TV audio data for cross-language training of multi-lingual acoustic models. We intend to take advantage from the availability of a training speech corpus, formed by parallel news uttered in different languages and transmitted over separated audio channels. Spanish, French and Russian phone Hidden Markov Models (HMMs) are bootstrapped using an unsupervised training procedure starting from an Italian set of phone HMMs. The use of confidence measures was also investigated, in order to select the training audio data, and has proven to be effective. The usage of cross language information, i.e. exploiting the temporal alignment of news in different languages to build cross-language news-dependent LMs, was also proven to give benefits to the acoustic model training.

#13Adaptive Stream Fusion in Multistream Recognition of Speech

Nima Mesgarani (Johns Hopkins University)
Samuel Thomas (Johns Hopkins University)
Hynek Hermansky (Johns Hopkins University)

A new method to deal with variable distortions of speech during the operation of the system is proposed. First, multiple processing streams are formed by extracting different spectral and temporal modulation components from the speech signal. Information in each stream is used to estimate posterior probabilities of phonemes. Initial values for a weighted integration of these individual estimates are found by normalized cross-correlation of the estimates with the actual phoneme labels on the training data. A statistical model of the final estimated posterior probabilities is used to characterize the system performance. During the operation, the weights in the linear fusion are adapted using particle filtering to optimize the performance. Results on phoneme recognition from noisy speech indicate the effectiveness of the proposed method.

#14Unsupervised Audio Patterns Discovery using HMM-based Self-Organized Units

Man-hung Siu (Raytheon BBN Technologies)
Herbert Gish (Raytheon BBN Technologies)
Steve Lowe (Raytheon BBN Technologies)
Arthur Chan (Raytheon BBN Technologies)

In our previous work, we trained an HMM-based speech recognizer without transcription or any knowledge or resources. The trained HMM recognizer was used to transcribe audio into self-organized units (SOUs) and we evaluated its performance on the task of topic identification. In this paper, we report our work in applying SOUs to discover audio patterns in spoken documents without supervision. By recognizing audio into SOUs which are sound-like units, the discovery for common audio patterns can be carried out extremely efficiently over a large corpus, without dynamic programming comparisons as proposed by earlier work. Experiments were performed on Mandarin conversational telephone speech using both the one-best SOU token sequences and SOU consensus networks. We show that using SOU as keys to audio patterns, we can discover frequently spoken words with good purity.

#15NEAREST NEIGHBORS WITH LEARNED DISTANCES FOR PHONETIC FRAME CLASSIFICATION

John Labiak (University of Chicago)
Karen Livescu (TTIC)

Nearest neighbor-based techniques provide an approach to acoustic modeling that avoids the often lengthy and heuristic process of training traditional Gaussian mixture-based models. Here we study the problem of choosing the distance metric for a {\it k}-nearest neighbor ({\it k}-NN) phonetic frame classifier. We compare the standard Euclidean distance to two learned Mahalanobis distances, based on large-margin nearest neighbors (LMNN) and locality preserving projections (LPP). We use locality sensitive hashing for approximate nearest neighbor search to reduce the test time of {\it k}-NN classification. We compare the error rates of these approaches, as well as of baseline Gaussian mixture-based and multilayer perceptron classifiers, on the task of phonetic frame classification of speech from the TIMIT database. The {\it k}-NN classifiers outperform Gaussian mixture models, but not multilayer perceptrons. We find that the best {\it k}-NN classification performance is obtained using LPP, while LMNN is close behind.

Tue-Ses1-P4 :
Speaker Recognition - Modeling, Automatic Procedures, Analysis III

Time:Tuesday 10:00 Place:Faenza 2 - Pala Congressi (Passi Perduti-Gallery) Type:Poster
Chair:Daniel Garcia-Romero

#1i-vector Based Speaker Recognition on Short Utterances

Ahilan Kanagasundaram (Speech and Audio Research Laboratory,Queensland University of Technology.)
Robbie Vogt (Speech and Audio Research Laboratory,Queensland University of Technology.)
David Dean (Speech and Audio Research Laboratory,Queensland University of Technology.)
Sridha Sridharan (Speech and Audio Research Laboratory,Queensland University of Technology.)
Michael Mason (Speech and Audio Research Laboratory,Queensland University of Technology.)

Robust speaker verification on short utterances remains a key consideration when deploying automatic speaker recognition, as many real world applications often have access to only limited duration speech data. This paper explores how the recent technologies focused around total variability modeling behave when training and testing utterance lengths are reduced. Results are presented which provide a comparison of Joint Factor Analysis (JFA) and i-vector based systems including various compensation techniques; Within-Class Covariance Normalization (WCCN), LDA, Scatter Difference Nuisance Attribute Projection (SDNAP) and Gaussian Probabilistic Linear Discriminant Analysis (GPLDA). Speaker verification performance for utterances with as little as 2 sec of data taken from the NIST Speaker Recognition Evaluations are presented to provide a clearer picture of the current performance characteristics of these techniques in short utterance conditions.

#2Study of Overlapped Speech Detection for NIST SRE Summed Channel Speaker Recognition

Hanwu Sun (Institute for Infocomm Research)
Bin Ma (Institute for Infocomm Research)

This paper studies the overlapped speech detection for improving the performance of the summed channel speaker recognition system in NIST Speaker Recognition Evaluation (SRE). The speaker recognition system includes four main modules: voice activity detection, speaker diarization, overlapped speaker detection and speaker recognition. We adopt a GMM based overlapped speaker detection system, by using entropy, MFCC and LPC features, to remove the overlapped segments in summed channel test condition. With the overlapped speech detection, the speaker diarization achieves a relative 18% diarization error rate reduction for the 2008 NIST SRE summed channel test set, and we obtain relative equal error rate reductions of 13.3% and 9.4% in speaker recognition on the 1conv-summed task and 8conv-summed task, respectively.

#3Super-Dirichlet Mixture Models using Differential Line Spectral Frequences for Text-Independent Speaker Identification

Zhanyu Ma (KTH-Royal Institute of Technology, Sound and Image Processing Lab)
Arne Leijon (KTH-Royal Institute of Technology, Sound and Image Processing Lab)

A new text-independent speaker identification (SI) system is proposed. This system utilizes the line spectral frequencies (LSFs) as alternative feature set for capturing the speaker characteristics. The boundary and ordering properties of the LSFs are considered and the LSF are transformed to the differential LSF (DLSF) space. Since the dynamic information is useful for speaker recognition, we represent the dynamic information of the DLSFs by considering two neighbors of the current frame, one from the past frames and the other from the following frames. The current frame with the neighbor frames together are cascaded into a supervector. The statistical distribution of this supervector is modelled by the so-called super-Dirichlet mixture model, which is an extension from the Dirichlet mixture model. Compared to the conventional SI system, which is using the mel-frequency cepstral coefficients and based on the Gaussian mixture model, the proposed SI system shows a promising improvement.

#4Comparison of Voice Activity Detectors for Interview Speech in NIST Speaker Recognition Evaluation

Hon-Bill Yu (The Hong Kong Polytechnic University)
Man-Wai Mak (The Hong Kong Polytechnic University)

Interview speech in NIST Speaker Recognition Evaluations (SREs) has substantially lower signal-to-noise ratio, which necessitates robust voice activity detection (VAD). This paper highlights the characteristics of interview speech files in NIST SREs and discusses the difficulties in performing VAD in these files. To overcome these difficulties, this paper proposes using speech enhancement techniques as a pre-processing step for enhancing the reliability of energy-based and statistical-model-based VADs. It was found that spectral subtraction can make better use of the background spectrum than the likelihood-ratio tests in statistical-model-based VADs. A decision strategy is also proposed to overcome the undesirable effects caused by impulsive signals and sinusoidal background signals. Results on NIST 2010 SRE show that the proposed VAD outperforms the statistical-model-based VAD, the ETSI-AMR speech coder, and the ASR transcripts provided by NIST SRE Workshop.

#5Eigen-Voice Based Anchor Modeling System for Speaker Identification using MLLR Super-Vector

Achintya Kumar Sarkar (Indian Institute of Technology Madras)
S. Umesh (Indian Institute of Technology Madras)

In this paper, we propose an anchor modeling scheme where instead of conventional “anchor” speakers, we use eigenvectors that span the Eigen-voice space. The computational advantage of conventional Anchor-modeling based speaker identification system comes from representing all speakers in a space spanned by a small number of anchor speakers instead of having separate speaker models. The conventional “anchor” speakers are usually chosen using data-driven clustering and the number of such speakers are also empirically determined. The use of proposed eigenvoice based anchors provide a more systematic way of spanning the speaker-space and in determining the optimal number of anchors. In our proposed method, the eigenvector space is built using the Maximum Likelihood Linear Linear Regression (MLLR) super-vectors of non-target speakers. Further, the proposed method does not require calculation of the likelihood with respect to anchor speaker models to create the speaker-characterization vector as done in conventional anchor systems. Instead, speakers are characterized with respect to eigen-space by projecting the speaker’s MLLR-super vector onto the eigen-voice space. This makes the method computationally efficient. Experimental results show that the proposed method consistently performs better than conventional anchor modeling technique for different number of anchor speakers.

#6Automatic Detection of Speaker Attributes Based on Utterance Text

Wen Wang (SRI International)
Andreas Kathol (SRI International)
Harry Bratt (SRI International)

In this paper, we present models for detecting various attributes of a speaker based on uttered text alone. These attributes include whether the speaker is speaking his/her native language, the speaker's age and gender, and the regional information reported by the speakers. We explore various lexical features as well as features inspired by Linguistic Inquiry and Word Count and Dictionary of Affect in Language. Overall, results suggest that when audio data is not available, by exploring effective feature sets only from uttered text and system combinations of multiple classification algorithms, we can build high quality statistical models to detect these attributes of speakers, comparable to systems that can exploit the audio data.

#7Comparison of Speaker Recognition Approaches for Real Applications

Sandro Cumani (Politecnico di Torino)
Pier Domenico Batzu (Loquendo)
Daniele Colibro (Loquendo)
Claudio Vair (Loquendo)
Pietro Laface (Politecnico di Torino)
Vasileios Vasilakakis (Politecnico di Torino)

This paper describes the experimental setup and the results obtained using several state-of-the-art speaker recognition classifiers. The comparison of the different approaches aims at the development of real world applications, taking into account memory and computational constraints, and possible mismatches with respect to the training environment. The NIST SRE 2008 database has been considered our reference dataset, whereas nine commercially available databases of conversational speech in languages different form the ones used for developing the speaker recognition systems have been tested as representative of an application domain. Our results, evaluated on the two domains, show that the classifiers based on i-vectors obtain the best recognition and calibration accuracy. Gaussian PLDA and a recently introduced discriminative SVM together with an adaptive symmetric score normalization achieve the best performance using low memory and processing resources.

#8Modeling Speaker Personality using Voice

Tim Polzehl (Quality and Usability Lab, Technische Universität Berlin / Deutsche Telekom Laboratories; Germany)
Sebastian Möller (Quality and Usability Lab, Technische Universität Berlin / Deutsche Telekom Laboratories; Germany)
Florian Metze (Language Technologies Institute, Carnegie Mellon University; Pittsburgh, PA; USA)

In this paper, we validate the application of an established personality assessment and modeling paradigm to speech input, and extend earlier work towards text independent acted speech input. We show that human labelers can consistently label speech data generated across multiple recording sessions, and investigate further which of the 5 scales in the NEO-FFI scheme can be assessed from speech, and how a manipulation of one scale influences the perception of another. Finally, we present a clustering of human labels of perceived personality traits, which will be useful in future experiments on automatic classification and generation of personality traits from speech.

#9Structural Joint Factor Analysis for Speaker Recognition

Marc Ferras (Tokyo Institute of Technology)
Koichi Shinoda (Tokyo Institute of Technology)
Sadaoki Furui (Tokyo Institute of Technology)

In recent years, adaptation techniques have been given a special focus in speaker recognition tasks. Addressing the separation of speaker and session variation effects, Joint Factor Analysis (JFA) has been consolidated as a powerful adaptation framework and has become ubiquitous in the last NIST Speaker Recognition Evaluations (SRE). However, its global parameter sharing strategy is not necessarily optimal when a small amount of adaptation data is available. In this paper, we address this issue by resorting to a regularization approach such as structural MAP. We introduce two variants of structural JFA (SJFA) that, depending on the amount of data, use coarser or finer parameter approximations in the adaptation process. One of these variants is shown to considerably outperform JFA. We report relative gains over 25% EER on the 2006 NIST SRE data for GMMSVM systems using SJFA over systems using JFA.

#10Acoustic Forest for SMAP-based Speaker Verification

Sangeeta Biswas (Tokyo Institute of Technology)
Marc Ferras (Tokyo Institute of Technology)
Koichi Shinoda (Tokyo Institute of Technology)
Sadaoki Furui (Tokyo Institute of Technology)

In speaker verification, structural maximum-a-posteriori (SMAP) adaptation for Gaussian mixture model (GMM) has been proven effective, especially when the speech segment is very short. In SMAP adaptation, an acoustic tree of Gaussian components is constructed to represent the hierarchical acoustic space. Until now, however, there has been no clear way to automatically find the optimal tree structure for a given speaker. In this paper, we propose using an acoustic forest, which is a set of trees, for SMAP adaptation, instead of a single tree. In this approach, we combine the results of SMAP adaptation systems with different acoustic trees. A key issue is how to combine the trees. We explore three score fusion techniques, and evaluate our approach in the text-independent speaker verification task of the NIST 2006 SRE plan using 10-second speech segments. Our proposed method decreased EER by 3.2% from the relevant MAP adaptation and by 1.6% from the conventional SMAP with a single tree.

#11Mixture of Auto-Associative Neural Networks for Speaker Verification

Sivaram Garimella (The Johns Hopkins University)
Samuel Thomas (The Johns Hopkins University)
Hynek Hermansky (The Johns Hopkins University)

The paper introduces a mixture of auto-associative neural networks for speaker verification. A new objective function based on posterior probabilities of phoneme classes is used for training the mixture. This objective function allows each component of the mixture to model part of the acoustic space corresponding to a broad phonetic class. This paper also proposes how factor analysis can be applied in this setting. The proposed techniques show promising results on a subset of NIST-08 speaker recognition evaluation (SRE) and yield about 10% relative improvement when combined with the state-of-the-art Gaussian Mixture Model i-vector system.

Tue-Ses2-O1:
Dialect and Accent Identification

Time:Tuesday 13:30 Place:Auditorium - Pala Congressi Type:Oral
Chair:David Martínez

13:30In search of cues discriminating West-African accents in French

Philippe Boula de Mareüil (LIMSI-CNRS)
Jean-Luc Rouas (LIMSI-CNRS & LABRI)
Manuela Yapomo (LIMSI-CNRS)

The aim of this study is twofold: to determine the extent to which West-African French accents can be distinguished and to find out phonetic cues discriminating French varieties spoken in Burkina Faso, Ivory Coast, Mali and Senegal. A perceptual experiment showed that these accents are well identified by West-African listeners. Prosodic and segmental cues such as the /R/ pronunciation were investigated by making use of automatic phoneme alignment. They allowed Senegal (with a tendency toward word-initial stress followed by a falling pitch movement) and Ivory Coast (with a tendency to delete/vocalise the /R/ consonant) to be differentiated.

13:50Computer and Human Recognition of Regional Accents of British English

Abualsoud Hanani (Department of Electronic, Electrical and Computer Engineering - The University of Birmingham - UK)
Martin J. Russell (Department of Electronic, Electrical and Computer Engineering - The University of Birmingham - UK)
Michael J. Carey (Department of Electronic, Electrical and Computer Engineering - The University of Birmingham - UK)

This paper is concerned with classification of the 14 regional accents of British English in the ABI (Accents of the British Isles) speech corpus. Results are reported using a state-of-the-art Language Identification system, variants of Huckvale’s ACCDIST system, and human listeners. The best performance, 95.18% accuracy, is obtained using the text-dependent ACCDIST measure. The performance of a conventional (text-independent) acoustic Language Identification system is poor, but is improved significantly (89.6% accuracy) by the addition of phone sequence information. Human performance (58.25% accuracy) is much lower than expected.

14:10Target-aware Lattice Rescoring for Dialect Recognition

Rong Tong (Institute for Infocomm Research, Singapore)
Bin Ma (Institute for Infocomm Research, Singapore)
Haizhou Li (Institute for Infocomm Research, Singapore)
Eng Siong Chng (Nanyang Technological University, Singapore)

We observed that human listeners distinguish one dialect from another by paying special attention to some particular phonetic and/or phonotactic patterns. Motivated by this observation, we propose a technique that emulates this process. We explore a target-aware lattice rescoring (TALR) process that revises the n-gram statistics in a lattice with target dialect information. We then derive n-gram statistics as the phonotactic features from the lattice and develop a system under the vector space modeling framework. The experiment results show that the proposed technique consistently improves dialect recognition performance on 30-second test utterances. We achieved equal error rates (EERs) of 4.57% and 13.28% with 3-gram statistics for Chinese and English dialect recognition in 2007 NIST Language Recognition Evaluation 30-second closed test sets.

14:30Effective Arabic Dialect Classification Using Diverse Phonotactic Models

Murat Akbacak (SRI International)
Dimitra Vergyri (SRI Internarional)
Andreas Stolcke (Speech @ Microsoft)
Andreas Stolcke (Microsoft Speech Labs)
Nicolas Scheffer (SRI International)
Arindam Mandal (SRI International)

We study the effectiveness of recently developed speaker and language recognition techniques based on speech recognition models for the discrimination of Arabic dialects. Specifically, we investigate dialect-specific and cross-dialectal phonotactic models, using both language models and support vector machines (SVMs). Techniques are evaluated both alone and in combination with others, including a cepstral GMM with joint factor analysis (JFA), and using a four-dialect data set employing 30-second telephone speech samples. We find good complementarity from different features and modeling paradigms, and achieve 2% average equal error rate for pairwise classification.

14:50Characterizing Deletion Transformations across Dialects using a Sophisticated Tying Mechanism

Nancy Chen (MIT/Lincoln Laboratory)
Wade Shen (MIT/Lincoln Laboratory)
Joe Campbell (MIT/Lincoln Laboratory)

In this work, we propose extensions of our Phone-based Pronunciation Model (PPM) for analyzing dialect differences. We compared these systems using 3 metrics and 2 datasets of English. Empirical results suggest that (1) sophisticated tying is suitable in modeling deletion transformations across dialects, beating standard tying by 33% relative, and (2) APM (Acoustic-based Pronunciation Model) improves performance in generating dialect-specific pronunciations, dialect identification and rule retrieval, achieving relative gains beyond 34%.

15:10Dialect and Accent Recognition using Phonetic-Segmentation Supervectors

Fadi Biadsy (Columbia University)
Julia Hirschberg (Columbia University)
Daniel Ellis (Columbia University)

We describe a new approach to automatic dialect and accent recognition which exceeds state-of-the-art performance in three recognition tasks. This approach improves the accuracy and substantially lower the time complexity of our earlier phonetic-based kernel approach for dialect recognition. In contrast to state-of-the-art acoustic-based systems, our approach employs phone labels and segmentation to constrain the acoustic models. Given a speaker's utterance, we first obtain phone hypotheses using a phone recognizer and then extract GMM-supervectors for each phone type, effectively summarizing the speaker's phonetic characteristics in a single vector of phone-type supervectors. Using these vectors, we design a kernel function that computes the phonetic similarities between pairs of utterances to train SVM classifiers to identify dialects. Comparing this approach to the state-of-the-art, we obtain a 12.9% relative improvement in EER on Arabic dialects, and a 17.9% relative improvement for American vs. Indian English dialects. We also see a 53.5% relative improvement over a GMM-UBM on American Southern vs.~Non-Southern English.

Tue-Ses2-O3:
ASR - Acoustic Models III

Time:Tuesday 13:30 Place:Brunelleschi (Green Room) - Pala Congressi - 2nd Floor Type:Oral
Chair:Ralf Schlueter

13:30Generalized Baum-Welch Algorithm and Its Implication to a New Extended Baum-Welch Algorithm

Roger Hsiao (InterACT, Language Technologies Institute, Carnegie Mellon University)
Tanja Schultz (InterACT, Language Technologies Institute, Carnegie Mellon University)

This paper describes how we can use the generalized Baum-Welch (GBW) algorithm to develop better extended Baum-Welch (EBW) algorithms. Based on GBW, we show that the backoff term in the EBW algorithm comes from KL-divergence which is used as a regularization function. This finding allows us to develop a fast EBW algorithm, which can reduce the time of model space discriminative training by half, without incurring any degradation on recognition accuracy. We compare the performance of the new EBW algorithm with the original one on various large scale systems including Farsi, Iraqi and modern standard Arabic ASR systems.

13:50Word Boundary Modelling and Full Covariance Gaussians for Arabic Speech-to-Text Systems

Frank Diehl (Cambridge University)
Mark Gales (Cambridge University)
Andrew Liu (Cambridge University)
Marcus Tomalin (Cambridge University)
Phil Woodland (Cambridge University)

This paper describes recent improvements to the Cambridge Arabic Large Vocabulary Continuous Speech Recognition (LVCSR) Speech-to-Text (STT) system. It is shown that word-boundary context markers provide a powerful method to enhance graphemic systems by implicit phonetic information, improving the modelling capability of graphemic systems. In addition, a robust technique for full covariance Gaussian modelling in the Minimum Phone Error (MPE) training framework is introduced. This reduces the full covariance training to a diagonal covariance training problem, thereby solving related robustness problems. The full system results show that the combined use of these and other techniques within a the multi-branch combination framework reduces the Word Error Rate (WER) of the complete system by up to 5.9\% relative.

14:10A Fully Automated Derivation of State-based Eigentriphones for Triphone Modeling with No Tied States using Regularization

Tom Ko (The Hong Kong University of Science and Technology)
Brian Mak (The Hong Kong University of Science and Technology)

Recently we proposed an alternative method called eigentriphone to solve the data insufficiency problem in triphone acoustic modeling without the need of state tying. The idea is to treat the acoustic modeling problem of infrequent triphones ("poor triphones'') as an adaptation problem from the more frequent triphones ("rich triphones''): firstly, an eigenbasis is developed over the rich triphones that have sufficient training data and the eigenvectors are called eigentriphones; then the poor triphones are adapted in a fashion similar to eigenvoice adaptation. Since, in general, no states are tied in our method, all triphones (states) are distinct so that they can be more discriminative than tied-state triphones. In our previous work, the number of eigentriphones was determined in advance with a set of development data. In this paper, we investigate simply using all of them with the help of regularization to naturally penalize the less important ones. In addition, the model-based eigenbasis is replaced by three state-based eigenbases. Experimental evaluation on the WSJ 5K task shows that triphone models trained using our new eigentriphone approach without state tying perform at least as well as the common tied-state triphone models.

14:30Reducing Computational Complexities of Exemplar-Based Sparse Representations With Applications to Large Vocabulary Speech Recognition

Tara Sainath (IBM T.J. Watson Research Center)
Bhuvana Ramabhadran (IBM T.J. Watson Research Center)
David Nahamoo (IBM T.J. Watson Research Center)
Dimitri Kanevsky (IBM T.J. Watson Research Center)

Recently, exemplar-based sparse representation phone identification features (Spif) have shown promising results on large vocabulary speech recognition tasks. However, one problem with exemplar-based techniques is that they are computationally expensive. In this paper, we present two methods to speed up the creation of Spif features. First, we explore a technique to quickly select a subset of informative exemplars among millions of training examples. Secondly, we make approximations to the sparse representation computation such that a matrix-matrix multiplication is reduced to a matrix-vector product. We present results on four large vocabulary tasks, including Broadcast News where acoustic models are trained with 50 and 400 hours, and a Voice Search task, where models are trained with 160 and 1000 hours. Results on all tasks indicate improvements in speedup by a factor of four relative to the original Spif features, as well as improvements in word error rate (WER) in combination with a baseline HMM system.

14:50An i-Vector based Approach to Training Data Clustering for Improved Speech Recognition

Yu Zhang (Shanghai Jiao Tong University)
Jian Xu (University of Science and Technology of China)
Zhi-Jie Yan (Microsoft Research Asia)
Qiang Huo (Microsoft Research Asia)

We present a new approach to clustering training data for improved speech recognition. Given a training corpus, a so-called i-vector is extracted from each training utterance. A hierarchical divisive clustering algorithm is then used to cluster the training i-vectors into multiple clusters. For each cluster, an acoustic model (AM) is trained accordingly. Such trained multiple AMs can then be used in recognition stage to improve recognition accuracy. The proposed approach is very efficient therefore can deal with very large scale training corpus on current mainstream computing platforms. We report experimental results on a voice search task with 7,500 hours of speech training data.

15:10Rapid Training of Acoustic Models using Graphics Processing Units

Senaka Buthpitiya (Carnegie Mellon University)
Ian Lane (Carnegie Mellon University)
Jike Chong (Carnegie Mellon University)

Robust and accurate speech recognition systems can only be realized with adequately trained acoustic models. For common languages, state-of-the-art systems are now trained on thousands of hours of speech data. Even with a large cluster of machines the entire training process can take many weeks. To overcome this development bottleneck we propose a new framework for rapid training of acoustic models using highly parallel graphics processing units (GPUs). In this paper we focus on Viterbi training and describe the optimizations required for effective throughput on GPU processors. Using a single NVIDIA GTX580 GPU our proposed approach is shown to be 51x faster than a sequential CPU implementation, enabling a moderately sized acoustic model to be trained on 1000 hours of speech data in just over 9 hours. Moreover, we show that our implementation on a two-GPU system can perform 67% faster than a standard parallel reference implementation on a high-end 32-core Xeon server. Our GPU-based training platform empowers research groups to rapidly evaluate new ideas and build accurate and robust acoustic models on very large training corpora.

Tue-Ses2-S1:
Show & Tell Demonstration - Mobility and Web-services

Time:Tuesday 13:30 Place:Donatello (Room Onice) - Pala Congressi - Ground Floor Type:Poster
Chair:Mazin Gilbert

#1Making an automatic speech recognition service freely available on the web

Stuart Nicholas Wrigley (University of Sheffield)
Thomas Hain (University of Sheffield)

The state-of-the-art speech recognition system developed by the AMIDA project and which performed well in the NIST RT'09 evaluation has been made available as a web service. The service provides free access to ASR aimed specifically at the scientific community. There are two ways in which this service can be accessed: via a standard web-browser and programmatically via an API.

#2AT&T VoiceBuilder: A Cloud-based Text-To-Speech Voice Builder Tool

Yeon-Jun Kim (AT&T Labs - Research, Inc.)
Thomas Okken (AT&T Labs - Research, Inc.)
Alistair Conkie (AT&T Labs - Research, Inc.)
Giuseppe Di Fabbrizio (AT&T Labs - Research, Inc.)

The AT&T VoiceBuilder provides a new tool to researchers and practitioners who want to have their voices synthesized by a high-quality, commercial-grade text-to-speech (TTS) system without the need to install, configure, or manage speech processing software and equipment. It is implemented as a web service on the AT&T Speech Mashup Portal. The proposed system records, processes, and validates users' utterances, and provides a web service API to make the new voice immediately available to real-time applications. All the procedures are fully-automated to avoid human intervention.

#3Extending Audio Notetaker to Browse WebASR Transcriptions

Roger Tucker (Sonocent Ltd, Chepstow, UK)
Dan Fry (Sonocent Ltd, Chepstow, UK)
Vincent Wan (Department of Computer Science, University of Sheffield, UK)
Stuart Wrigley (Department of Computer Science, University of Sheffield, UK)
Thomas Hain (Department of Computer Science, University of Sheffield, UK)

The audio annotation tool Audio Notetaker has been extended to allow browsing of transcripts produced with the WebASR system from Sheffield University. The interface has been designed to be usable with as much as 50% recognition error.

#4A Web-Based Tool for Developing Multilingual Pronunciation Lexicons

Samantha Ainsley (Department of Computer Science, Columbia University, USA)
Linne Ha (Google Inc., USA)
Martin Jansche (Google Inc., USA)
Ara Kim (Formerly Google Inc., USA)
Masayuki Nanzawa (Google Inc., USA)

We present a web-based tool for generating and editing pronunciation lexicons in multiple languages. The tool is implemented as a web application on Google App Engine and can be accessed remotely from a web browser. The client application displays to users a textual prompt and interface that reconfigures based on language and task. It lets users generate pronunciations via constrained phoneme selection, which allows users with no special training to provide phonemic transcriptions efficiently and accurately.

#5Speak4it and the Multimodal Semantic Interpretation System

Michael Johnston (AT&T Labs Research)
Patrick Ehlen (AT&T Labs)

Multimodal interaction allows users to specify commands using combinations of inputs from multiple different modalities. For example, in a local search application, a user might say “gas stations” while simultaneously tracing a route on a touchscreen display. In this demonstration, we describe the extension of our cloud-based speech recognition architecture to a Multimodal Semantic Interpretation System (MSIS) that supports processing of multimodal inputs streamed over HTTP. We illustrate the capabilities of the framework using Speak4itSM, a deployed mobile local search application supporting combined speech and gesture input. We provide interactive demonstrations of Speak4it on the iPhone and iPad and explain the challenges of supporting true multimodal interaction in a deployed mobile service.

#6TSAB -- Web Interface for Transcribed Speech Collections

Tanel Alumäe (Institute of Cybernetics at Tallinn University of Technology, Estonia)
Ahti Kitsik (Codehoop OU)

This paper describes a new web interface for accessing large transcribed spoken data collections. The system uses automatic or manual time-aligned transcriptions with speaker and topic segmentation information to present structured speech data more efficiently and make accessing relevant speech data quicker. The system is independent of the underlying speech processing technology. The software is free and open-source.

#7Visual Voice Mail to Text on the iPhone/iPad

Andrej Ljolje (AT&T Labs - Research)
Vincent Goffin (AT&T Labs - Research)
Diamantino Caseiro (AT&T Labs - Research)
Taniya Mishra (AT&T Labs - Research)
Mazin Gilbert (AT&T Labs - Research)

A visual Voice-Mail-to-Text (VMTT) transcription system takes a conventional voice mail and converts it to formatted text following the standard punctuation, capitalization and presentation conventions. The text can then be used in a plethora of applications, form emails, to databases, text messages etc., which in turn allow searching, classification, data extraction, statistical analyses and other processes. Here we demonstrate the VMTT application by displaying the best scoring hypotheses from various recognition passes, the addition of punctuation and capitalization, formatting by using appropriate conventions for times, dates, dollar amounts and abbreviations, and finally applying grayscaling to lower the impact of the words recognized with low confidence scores.

#8Percy - an HTML5 framework for media rich web experiments on mobile devices

Christoph Draxler (Institute of Phonetics and Speech Processing, LMU Munich)

Percy is a small software framework for perception experiments via the WWW. It is implemented entirely in dynamic HTML and makes use of the new multimedia tags available in HTML5, eliminating the need for browser plug-ins or external players to display media content. With Percy, perception experiments can be run on any platform supporting HTML5, including tablet computers, smartphones or game consoles and thus access new participant populations. Percy supports touch interfaces and measures reaction times. It stores its data in a relational database system on a server. This allows immediate access to the experiment data from statistics packages, spreadsheet programs or via standard database access application programming interfaces.The system has been used for an online experiment on the identification of regional variants by phonetic features in German. Furthermore, the software has been used in a number of experiments in German, Castilian Spanish and English.

#9The KLAIR toolkit for recording interactive dialogues with a virtual infant

Mark Huckvale (University College London)

The goals of the KLAIR project are to facilitate research into the computational modelling of spoken language acquisition. Previously we have described the KLAIR toolkit that implements a virtual infant that can see, hear and talk. In this demonstration we show how the toolkit can be used to record interactive dialogues with caregivers. The outcomes are both an audio-video recording and a log of the "beliefs" and "goals" of the infant control program. These recordings can then be analysed by machine learning systems to model spoken language acquisition. In our demonstration, visitors will be able to interact with KLAIR and try to teach it the names of some toys.

#10Real-time Prototype for Integration of Blind Source Extraction and Robust Automatic Speech Recognition

Francesco Nesta (Fondazione Bruno Kessler-Irst)
Marco Matassoni (Fondazione Bruno Kessler-Irst)
Hari Krishna Maganti (Fondazione Bruno Kessler-Irst)

This demo presents a real-time prototype for automatic blind source extraction and speech recognition in presence of multiple interfering noise sources. Binaural recorded mixtures are processed by a combined Blind/Semi-Blind Source Separation algorithm in order to obtain an estimation of the target signal. The recovered target signal is segmented and used as input to a real-time automatic speech recognition (ASR) system. Further, to improve the recognition performance, noise robust features based on Gammatone Frequency Cepstral Coefficients (GFCC) are used. The demo utilizes the data provided for the CHiME Pascal speech separation and recognition challenge and also real-time mixtures recorded on-site. Users will be able to listen to the recovered target signal and compare it with the original mixture and ASR output.

Tue-Ses2-O2:
First Language Acquisition

Time:Tuesday 13:30 Place:Leonardo - Pala Affari - Ground Floor Type:Oral
Chair:cinzia avesani

13:30The Multi Timescale Phoneme Acquisition Model of the Self-Organizing Based on the Dynamic Features

Kouki MIYAZAWA (Graduate School of Human Sciences, Waseda University)
Hideaki MIURA (Graduate School of Human Sciences, Waseda University)
Hideaki KIKUCHI (Graduate School of Human Sciences, Waseda University)
Reiko MAZUKA (RIKEN Brain Science Institute)

It is unclear as to how infants learn the acoustic expression of each phoneme of their native languages. In recent studies, researchers have inspected phoneme acquisition by using a computational model. However, these studies have used a limited vocabulary as input and do not handle a continuous speech that is almost comparable to a natural environment. Therefore, we use a natural continuous speech and build a self-organization model that simulates the cognitive ability of the humans, and we analyze the quality and quantity of the speech information that is necessary for the acquisition of the native phoneme system. Our model is designed to learn values of the acoustic features of a continuous speech and to estimate the number and boundaries of the phoneme categories without using explicit instructions. In a recent study, our model could acquire the detailed vowels of the input language. In this study, we examined the mechanism necessary for an infant to acquire all the phonemes of a language, including consonants. In natural speech, vowels have a stationary feature; hence, our recent model is suitable for learning them. However, learning consonants through the past model is difficult because most consonants have more dynamic features than vowels. To solve this problem, we designed a method to separate “stable” and “dynamic” speech patterns using a feature-extraction method based on the auditory expressions used by human beings. Using this method, we showed that the acquisition of an unstable phoneme was possible without the use of instructions.

13:50The time-course of talker-specificity effects for newly-learned pseudowords: Evidence for a hybrid model of lexical representation

Helen Brown (Department of Psychology, University of York)
M. Gareth Gaskell (Department of Psychology, University of York)

Whilst research shows that talker information affects recognition of recently studied words, it remains unclear whether this information is stored in long-term memory. Three experiments explored whether talker-specificity effects (TSEs) for pseudowords changed over time and were affected by within- and between-talker variability during study. Results showed TSEs immediately after study in all experiments, consistent with episodic models, but TSEs remained a week later only for pseudowords studied in a single voice. Furthermore, source memory data suggested that talker information becomes less accessible over time, supporting hybrid models that incorporate aspects of both episodic and abstract lexical representation.

14:10A parametric approach to intonation acquisition research: Validation on child-directed speech data

Britta Lintfert (Institute of Natural Language Processing, University of Stuttgart, Germany)
Antje Schweitzer (Institute of Natural Language Processing, University of Stuttgart, Germany)
Bernd Möbius (Department of Computational Linguistics and Phonetics, Saarland University, Germany)

This paper validates a parametric approach to intonation acquisition research using child-directed speech data. An advantage of this approach is that it can be used for studying child speech as well as adult speech. Within the field of prosody acquisition it reconciles independent approaches to child prosody with ToBI-based approaches. In this paper we substantiate this claim by showing that clusters of parameterized contours obtained from German child-directed speech correlate with GToBI(S) categories, and by elaborating how, alternatively, the parameters can be mapped to properties that are relevant in independent approaches.

14:30Modelling Novelty Preference in Word Learning

Maarten Versteegh (International Max Planck Research School for Language Sciences / Radboud University, Nijmegen, The Netherlands)
Louis ten Bosch (Radboud University Nijmegen)
Lou Boves (Radboud University Nijmegen)

This paper investigates the effects of novel words on a cognitively plausible computational model of word learning. The model is first familiarized with a set of words, achieving high recognition scores and subsequently offered novel words for training. We show that the model is able to recognize the novel words as different from the previously seen words, based on a measure of novelty that we introduce. We then propose a procedure analogous to novelty preference in infants. Results from simulations of word learning show that adding this procedure to our model speeds up training and helps the model attain higher recognition rates.

14:50Using Imitation to learn Infant-Adult Acoustic Mappings

G Ananthakrishnan (Center for speech Technology, KTH (Royal Institute of Technology))
Giampiero Salvi (Centre for Speech Technology, Royal Institute of Technology (KTH), Stockholm, Sweden)

This paper discusses a model which conceptually demonstrates how infants could learn the normalization between infant-adult acoustics. The model proposes that the mapping can be inferred from the topological correspondences between the adult and infant acoustic spaces, that are clustered separately in an unsupervised manner. The model requires feedback from the adult in order to select the right topology for clustering, which is a crucial aspect of the model. The feedback is in terms of an overall rating of the imitation effort by the infant, rather than a frame-by-frame correspondence. Using synthetic, but continuous speech data, we demonstrate that clusters, which have a good topological correspondence, are perceived to be similar by a phonetically trained listener.

15:10Thresholding word activations for response scoring - Modelling psycholinguistic data

Christina Bergmann (Centre for Language and Speech Technology/International Max Planck Research School for Language Sciences, Radboud University Nijmegen, The Netherlands)
Louis ten Bosch (Centre for Language and Speech Technology, Radboud University Nijmegen, The Netherlands)
Lou Boves (Centre for Language and Speech Technology, Radboud University Nijmegen, The Netherlands)

In the present paper we replicate simulations of infant word learning and the effect of variation in the input. We then investigate to what extent the results are influenced by the way in which the continuous response functions are treated and what effects the use of thresholds can have on the data. Our results show that the underlying response pattern, as uncovered by different thresholds, varies greatly. Nonetheless, the overall output of the model is often correct and able to generalise to unseen data. Thus, we show that the model can give correct responses even in uncertain circumstances. Links of this finding to language acquisition research are discussed.

Tue-Ses2-O4:
Spoken Dialogue Systems I

Time:Tuesday 13:30 Place:Michelangelo - Pala Affari - 2nd Floor Type:Oral
Chair:Olivier Pietquin

13:30User Study of Spoken Decision Support System

Teruhisa Misu (NICT)
Kiyonori Ohtake (NICT)
Chiori Hori (NICT)
Hisashi Kawai (NICT)
Satoshi Nakamura (NICT)

This paper presents the results of the user evaluation of spoken decision support dialogue systems, which help users select from a set of alternatives. Thus far, we have modeled this decision support dialogue as a partially observable Markov decision process (POMDP), and optimized its dialogue strategy to maximize the value of the user’s decision. In this paper, we present a comparative evaluation of the optimized dialogue strategy with several baseline methods, and demonstrate that the optimized dialogue strategy that was effective in user simulation experiments works well in an evaluation by real users.

13:50Efficient Probabilistic Tracking of User Goal and Dialog History for Spoken Dialog Systems

Antoine Raux (Honda Research Institute USA)
Yi Ma (Ohio State University)

In this paper, we describe Dynamic Probabilistic Ontology Trees, a new probabilistic model to track dialog state in a dialog system. Our model captures both the user goal and the history of user dialog acts using a unified Bayesian Network. We perform efficient inference using a form of blocked Gibbs sampling designed to exploit the structure of the model. Evaluation on a corpus of dialogs from the CMU Let's Go system shows that our approach significantly outperforms a deterministic baseline and is able to exploit long N-best lists without loss of accuracy.

14:10Tackling a Shilly-Shally Classifier for Predicting Task Success in Spoken Dialogue Interaction

Alexander Schmitt (University of Ulm, Germany)
Alexander Zgorzelski (University of Ulm, Germany)
Wolfgang Minker (University of Ulm, Germany)

Statistical models, which predict that a task with a telephone-based Spoken Dialogue System (SDS) is unlikely to be completed, can be useful to adapt dialogue strategies. They can also trigger the decision to route callers directly to human assistance once it is clear that the SDS cannot automate the call. This paper addresses a number of issues that arise when deploying such models. We show that the predictions of a model are subject to strong variations between several adjacent dialogue steps. As a consequence, we show that the accuracy can be significantly risen when using sequences of equal predictions as basis of the decision-making. Furthermore, we implement a confidence metric that takes into account the certainty of the classifier to determine the optimum decision point.

14:30Evaluation of Listening-oriented Dialogue Control Rules based on the Analysis of HMMs

Toyomi Meguro (NTT Communication Science Laboratories, NTT Corporation)
Ryuichiro Higashinaka (NTT Cyber Space Laboratories, NTT Corporation)
Yasuhiro Minami (NTT Communication Science Laboratories, NTT Corporation)
Kohji Dohsaka (NTT Communication Science Laboratories, NTT Corporation)

We have been working on listening-oriented dialogues for the purpose of building listening agents. In our previous work [1], we trained hidden Markov models (HMMs) from listeningoriented dialogues (LoDs) between humans, and by analyzing them, discovered a distinguishing dialogue flow of LoD. For example, listeners suppress their information giving and selfdisclosure, and, instead, increase acknowledgments and questions to lead speakers’ utterances. As a initial step for building listening agents, we decided to create dialogue control rules based on our analysis of the HMMs. We built our rule-based system and compared it with three other systems by aWizard of Oz (WoZ) experiment. As a result, we found that our rule-based system achieved as much user satisfaction as human listeners.

14:50Large-Scale Experiments on Data-Driven Design of Commercial Spoken Dialog Systems

David Suendermann (SpeechCycle)
Jackson Liscombe (SpeechCycle)
Jonathan Bloom (SpeechCycle)
Grace Li (SpeechCycle)
Roberto Pieraccini (SpeechCycle)

The design of commercial spoken dialog systems is most commonly based on hand-crafting call flows. Voice interaction designers write prompts, predict caller responses, set speech recognition parameters, implement interaction strategies, all based on ``best design practices''. Recently, we presented the mathematical framework ``Contender'' (similar to reinforcement learning) that allows for replacing manual decisions made during system design by data-driven soft decisions made at system run time optimizing the cumulative reward of an application. The current paper reports on the results of 26 Contenders implemented in commercial applications processing a total of about 15 million calls.

15:10Comparing system-driven and free dialogue in in-vehicle interaction

Fredrik Kronlid (Talkamatic AB)
Jessica Villing (University of Gothenburg)
Alexander Berman (Talkamatic AB)
Staffan Larsson (University of Gothenburg)

It is widely held that a free, natural dialogue model is more efficient and less distracting than system-initiative, state based dialogue. This paper describes an evaluation of two systems - one using system-directed dialogue and one using a more "free" dialogue - focusing on distraction and efficiency. The level of distraction is measured using an automotive industry standard test (LCT), and the efficiency is measured by counting the number of completed tasks. The efficiency is increased by 42 % using the free, natural dialogue model while the LCT results are unclear. Using a free dialogue model increases the efficiency and reduces the distraction in some cases.

Tue-Ses2-O5:
Spoken Language Resources, Evaluation and Standardization II

Time:Tuesday 13:30 Place:Raffaello - Pala Affari - 3rd Floor Type:Oral
Chair:Paolo Baggia

13:30Rapid Evaluation of Speech Representations for Spoken Term Discovery

Michael Carlin (Johns Hopkins University)
Samuel Thomas (Johns Hopkins University)
Aren Jansen (Johns Hopkins University)
Hynek Hermansky (Johns Hopkins University)

Acoustic front ends are typically developed for supervised learning tasks and are thus optimized to minimize word error rate, phone error rate, etc. However, in recent efforts to develop zero resource speech technologies, the goal is not to use transcribed speech to train systems but instead to discover the acoustic structure of the spoken language automatically. For this new setting, we require a framework for evaluating the quality of speech representations without coupling to a particular recognition architecture. Motivated by the spoken term discovery task, we present a dynamic time warping-based framework for quantifying how well a representation can associate words of the same type spoken by different speakers. We benchmark the quality of a wide range of speech representations using multiple frame-level distance metrics and demonstrate that our recognizer-free performance metrics can accurately predict phone recognition accuracies.

13:50Phonemic Similarity Metrics to Compare Pronunciation Methods

Ben Hixon (Department of Computer Science, Hunter College of The City University of New York)
Eric Schneider (Department of Computer Science, Hunter College of The City University of New York)
Susan L. Epstein (Department of Computer Science, Hunter College of The City University of New York)

As grapheme-to-phoneme methods proliferate, their careful evaluation becomes increasingly important. This paper explores a variety of metrics to compare the automatic pronunciation methods of three freely-available grapheme-to-phoneme packages on a large dictionary. Two metrics, presented here for the first time, rely upon a novel weighted phonemic substitution matrix constructed from substitution frequencies in a collection of trusted alternate pronunciations. These new metrics are sensitive to the degree of mutability among phonemes. An alignment tool uses this matrix to compare phoneme substitutions between pairs of pronunciations.

14:10Investigating the effect of number of interlocutors on the quality of experience for multi-party audio conferencing

Janto Skowronek (Deutsche Telekom Laboratories - Technische Universität Berlin)
Alexander Raake (Deutsche Telekom Laboratories - Technische Universität Berlin)

Towards an assessment method for multi-party audio conferencing systems, we investigated in a pilot study the influence of the number of interlocutors and the audio reproduction method on the quality of experience. Despite some room for improving the sensitivity of our experimental method, the results show that the number of interlocutors and the audio reproduction method influence at least partially the cognitive effort required in conferencing situations and the perceived speech transmission quality.

14:30On Development of Consistently Punctuated Speech Corpora

Jachym Kolar (LIMSI-CNRS, France)
Lori Lamel (LIMSI-CNRS, France)

Punctuation of automatically recognized speech is important to enhance readability of transcripts and to aid downstream NLP processing. This paper is concerned with issues involved in developing training and test corpora for automatic punctuation systems. Punctuation annotation in speech transcripts is difficult since there are numerous cases for which no standard punctuation rules exist. Special punctuation annotation guidelines tailored to spoken language were developed. Using these guidelines, almost 100 hours of broadcast news and conversation data in English and French have been punctuated by trained annotators. Measures of inter-annotator agreement are provided for both languages and differences between languages and genre are analyzed and discussed, along with some of the most frequent disagreements between annotators. Overall, using the guidelines, the annotation consistency has been significantly improved.

14:50A Multimodal Real-Time MRI Articulatory Corpus for Speech Research

Shrikanth Narayanan (University of Southern California)
Erik Bresch (University of Southern California)
Prasanta Ghosh (University of Southern California)
Louis Goldstein (University of Southern California)
Athanasios Katsamanis (University of Southern California)
Yoon Kim (University of Southern California)
Adam Lammert (University of Southern California)
Michael Proctor (University of Southern California)
Vikram Ramanarayanan (University of Southern California)
Yinghua Zhu (University of Southern California)

We present MRI-TIMIT: a large-scale database of synchronized audio and real-time magnetic resonance imaging (rtMRI) data for speech research. The database currently consists of speech data acquired from two male and two female speakers of American English. Subjects’ upper airways were imaged in the midsagittal plane while reading the same 460 sentence corpus used in the MOCHA-TIMIT corpus. Accompanying acoustic recordings were phonemically transcribed using forced alignment. Vocal tract tissue boundaries were automatically identified in each video frame, allowing for dynamic quantification of each speaker’s midsagittal articulation. The database and companion toolset provide a unique resource with which to examine articulatory-acoustic relationships in speech production.

15:10Building an audio-visual corpus of Australian English: large corpus collection with an economical portable and replicable Black Box

Denis Burnham (MARCS Auditory Laboratories, University of Western Sydney, Australia)
Dominique Estival (1MARCS Auditory Laboratories, University of Western Sydney, Australia)
Steven Fazio (1MARCS Auditory Laboratories, University of Western Sydney, Australia)
Felicity Cox (Macquarie University, Australia)
Robert Dale (Macquarie University, Australia)
Jette Viethen (Macquarie University, Australia)
Steve Cassidy (Macquarie University, Australia)
Julien Epps (University of New South Wales, Australia)
Roberto Togneri (University of Western Australia, Australia)
Yuko Kinoshita (University of Canberra, Australia)
Roland Göcke (University of Canberra, Australia)
Joanne Arciuli (University of Sydney, Australia)
Marc Onslow (University of Sydney, Australia)
Trent Lewis (Flinders University, Australia)
Andy Butcher (Flinders University, Australia)
John Hajek (University of Melbourne, Australia)
Michael Wagner (University of Canberra, Australia)

The Big Australian Speech Corpus project incorporates the strategic goals of 30 Chief Investigators from various speech science areas. Speech from 1000 geographically and socially diverse speakers is being recorded using a uniform and automated protocol plus standardized hardware and software to produce a widely applicable and extensible database – AusTalk. Here we describe the project’s major components and organization; share the lessons learnt from difficulties and challenges; and present the results achieved so far.

Tue-Ses2-S1-O:
Spoken Language Processing of Human-Human Conversations I

Time:Tuesday 13:30 Place:Caravaggio (Adua 1) - Pala Affari - 1st Floor Type:Oral
Chair:Dilek Hakkani-Tur

13:30Language-Independent Socio-Emotional Role Recognition in the AMI Meetings Corpus

Fabio Valente (Idiap Research Institute)
Alessandro Vinciarelli (University of Glasgow)

Social roles are a coding scheme that characterizes the relationships between group members during a discussion and their roles oriented toward the functioning of the group as a group. They can be related to phenomena like engagement, hot-spots and social dominance. This work presents an investigation on language-independent automatic social role recognition in AMI meetings based on turns statistics and prosodic features. At first, turn-taking statistics and prosodic features are integrated into a single generative conversation model which achieves a role recognition accuracy of 59%. This model is then extended to explicitly account for dependencies (or influence) between speakers achieving an accuracy of 65%. The last contribution consists in investigating the statistical dependencies between the formal and the social role that participants have; integrating the information related to the formal role in the model, the recognition achieves an accuracy of 68%.

13:50Measuring acoustic-prosodic entrainment with respect to multiple levels and dimensions

Rivka Levitan (Columbia University)
Julia Hirschberg (Columbia University)

In conversation, speakers become more alike each other in various dimensions. This phenomenon, commonly called entrainment, coordination, or alignment, is widely believed to be crucial to the success and naturalness of human interactions. We investigate entrainment in four acoustic and prosodic dimensions. We explore whether speakers coordinate with each other in these dimensions over the conversation as a whole as well as on a turn-by-turn basis and in both relative and absolute terms, and whether this coordination improves over the course of the conversation.

14:10Automatic Call Quality Monitoring Using Cost-Sensitive Classification

Youngja Park (IBM T.J. Watson Research Center)

We propose advanced text analytics and cost sensitive classification-based approaches for call monitoring and show that automatic monitoring on ASR transcripts can be achieved with a high accuracy. We identified features by analyzing a large number of human monitoring results, which aim to estimate agent’s attitude and customer’s sentiment. To enhance accuracy of feature extraction, we apply various techniques to improve the quality of transcripts, such as sentence boundary detection and disfluency removal. We further note that quality monitoring has skewed class distribution and unequal classification error costs, and thus apply cost sensitive classification algorithms. Validation on 386 customer calls confirms the benefits of our approach. A SVM-based method produces an accuracy of 83.16% and 67.66% in F1 Score for bad calls, which is promising. This system can therefore be used to conduct initial monitoring of all calls and to select calls that require human monitoring.

Tue-Ses2-P1:
Human Speech and Sound Perception II

Time:Tuesday 13:30 Place:Valfonda 1 - Pala Congressi (Passi Perduti-Gallery) Type:Poster
Chair:Holger Mitterer.

#1Pointing Gestures do not Influence the Perception of Lexical Stress

Alexandra Jesse (Department of Psychology, University of Massachusetts, Amherst, U.S.A.)
Holger Mitterer (Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands)

We investigated whether seeing a pointing gesture influences the perceived lexical stress. A pitch contour continuum between the Dutch words “CAnon” (‘canon’) and “kaNON” (‘cannon’) was presented along with a pointing gesture during the first or the second syllable. Pointing gestures following natural recordings but not Gaussian functions influenced stress perception (Experiment 1 and 2), especially when auditory context preceded (Experiment 2). This was not replicated in Experiment 3. Natural pointing gestures failed to affect the categorization of a pitch peak timing continuum (Experiment 4). There is thus no convincing evidence that seeing a pointing gesture influences lexical stress perception.

#2Relationships between Phonetic Features and Speech Perception

Ian Cushing (University of Salford)
Francis Li (University of Salford)
Ken Worrall (Her Majesty’s Government Communications Centre)
Jackson Tim (Her Majesty’s Government Communications Centre)

This paper concerns the relationships amongst acoustic phonetic features of speech signals, perceived vocal effort, and speech clarity. It is presented from a statistical analysis of a good number of subjective testing on an anechoic speech corpus with 5 different vocal efforts, namely hushed, normal, raised, loud, and shouted, with an aim to map objective acoustic phonetic features onto subjective ratings. Results show that listeners can differentiate vocal effort from subtle acoustic phonetic variations. There is also a correlation between clarity and vocal efforts. A regression model is further established to predict vocal effort from acoustic phonetic analysis.

#3The representation of speech in a nonlinear auditory model: time-domain analysis of simulated auditory-nerve firing patterns

Guy Brown (Department of Computer Science, University of Sheffield)
Tim Jurgens (Medizinische Physik, Carl-von-Ossietzky Universitat Oldenburg)
Ray Meddis (Department of Psychology, University of Essex)
Matthew Robertson (Department of Computer Science, University of Sheffield)
Nicholas Clark (Department of Psychology, University of Essex)

A nonlinear auditory model is appraised in terms of its ability to encode speech formant frequencies in the fine time structure of its output. It is demonstrated that groups of model auditory nerve (AN) fibres with similar interpeak intervals accurately encode the resonances of synthetic three-formant syllables, in close agreement with physiological data. Acoustic features are derived from the interpeak intervals and used as the input to a hidden Markov model-based automatic speech recognition system. In a digits-in-noise recognition task, interval-based features gave a better performance than features based on AN firing rate at every signal-to-noise ratio tested.

#4An Automatic Voice Pleasantness Classification System based on Prosodic and Acoustic Patterns of Voice Preference

Luis Pinto-Coelho (Instituto Politécnico do Porto)
Daniela Braga (Microsoft, China)
Miguel Sales-Dias (Microsoft Language Development Center)
Carmen Garcia-Mateo (University of Vigo)

In the last few years the number of systems and devices that use voice based interaction has grown significantly. For a continued use of these systems the interface must be reliable and pleasant in order to provide an optimal user experience. However there are currently very few studies that try to evaluate how good is a voice when the application is a speech based interface. In this paper we present a new automatic voice pleasantness classification system based on prosodic and acoustic patterns of voice preference. Our study is based on a multi-language database composed by female voices. In the objective performance evaluation the system achieved a 7.3% error rate.

#5Contributions of F1 and F2 (F2’) to the perception of plosive consonants

René Carré (Laboratoire Dynamique du Langage, CNRS-Université Lyon 2, Lyon)
Pierre Divenyi (Speech and Hearing Research, Veterans Affairs Northern California Health Care System, Martinez CA, USA)
Willy Serniclaes (CNRS-LEAPLE, Université René Descartes, Paris)
Emmanuel Ferragne (Laboratoire Dynamique du Langage, CNRS-Université Lyon 2, Lyon)
Egidio Marsico (Laboratoire Dynamique du Langage, CNRS-Université Lyon 2, Lyon)
Viet-Son Nguyen (Centre MICA, CNRS/UMI2954, Hanoi University of Sciences and Technology)

This study examined the contribution of F1 and F2 alone on the perception of plosive consonants in a CV context. Applying a 3-Bark spectral integration the F2 frequency was corrected for effects of proximity either to F1 or to F3, i.e., was replaced by F2’. Subjects used a two-dimensional Method of Adjustment to select the F1 and F2 consonant onset frequencies that led to a subjectively optimal percept of a predefined target CV. Results indicate that place prototypes are guided by F2 and are largely independent of F1. Nevertheless, while F2 alone is sufficient for segregating place prototypes for some consonants and vocalic contexts, it is insufficient for explaining the perception of place.

#6Auditory speech processing is affected by visual speech in the periphery

Jeesun Kim (MARCS Auditory Laboratories, University of Western Sydney, Australia)
Chris Davis (MARCS Auditory Laboratories, University of Western Sydney, Australia)

Two experiments were conducted to determine whether visual speech presented in the visual periphery affects the perceived identity of speech sounds. Auditory speech targets (vCv syllables) were presented in noise (-8 dB) with congruent or incongruent visual speech presented in full-face or upper-half face conditions. Participants’ eye-movements were monitored to assure that visual speech input occurred only from the periphery. In experiment 1 participants had only to identify what they heard. The results showed that peripherally presented visual speech (full-face) facilitated identification of AV congruent stimuli compared to the upper-face control. Likewise, visual speech reduced correct identification for the incongruent stimuli. Experiment 2 was the same as the first except that in addition participants performed a central visual task. Again significant effects of visual speech were found. These results show that peripheral visual speech affects speech recognition.

#7Visual Speech Speeds Up Auditory Identification Responses

Tim Paris (MARCS, University of Western Sydney)
Jeesun Kim (MARCS, University of Western Sydney)
Davis Chris (MARCS, University of Western Sydney)

Auditory speech perception is more accurate when combined with visual speech. Recent ERP studies suggest that visual speech helps 'predict' which phoneme will be heard via feedback from visual to auditory areas, with more visual salient articulations associated with greater facilitation. Two experiments tested this hypothesis with a speeded auditory identification measure. Stimuli consisted of the sounds 'apa’, 'aka' and 'ata', with matched and mismatched videos that showed the talker’s whole face or upper face (control). The percentage of matched AV videos was set at 85% in Experiment 1 and 15% in Experiment 2. Results showed that responses to matched whole face stimuli were faster than both upper face and mismatched videos in both experiments. Furthermore, salient phonemes (aPa) showed a greater reduction in reaction times than ambiguous ones (aKa). The current study provides support for the proposal that visual speech speeds up processing of auditory speech.

#8Agglomerative Hierarchical Clustering of Emotions in Speech Based on Subjective Relative Similarity

Ryoichi Takashima (Kobe University)
Tohru Nagano (IBM Research - Tokyo)
Ryuki Tachibana (IBM Research - Tokyo)
Masafumi Nishimura (IBM Research - Tokyo)

When we humans are asked whether or not the emotions in two speech samples are in the same category, the judgment depends on the size of the target category. Hierarchical clustering is a suitable technique for simulating such perceptions by humans of relative similarities of the emotions in speech. For better reflection of subjective similarities in clustering results, we have devised a method of hierarchical clustering that uses a new type of relative similarity data based on tagging the most similar pair in sets of three samples. This type of data allowed us to create a closed-loop algorithm for feature weight learning that uses the clustering performance as the objective function. When classifying the utterances of a specific sentence in Japanese recorded at a real call center, the method reduced the errors by 15.2%.

#9Optimal Syllabic Rates and Processing Units in Perceiving Mandarin Spoken Sentences

Guangting Mai (Language Engineering Laboratory, Department of Electronic Engineering, The Chinese University of Hong Kong, Hong Kong)
Gang Peng (Language Engineering Laboratory, Department of Electronic Engineering, The Chinese University of Hong Kong, Hong Kong)

This paper presents our investigations on the syllable-related processing during human perception of Mandarin spoken sentences. Two behavioral perception experiments were conducted employing a signal synthesis method in a previous study [1]. We found (1) a clear relationship between speech intelligibility and syllabic rates of spoken sentences and (2) significantly higher speech intelligibility of sentences acoustically segmented at sub-syllable and syllable levels than at the level beyond one syllable. We therefore revealed the optimal syllabic rates and processing units in perceiving Mandarin continuous speech and further discussed the association between our results and the possible underlying neural mechanisms in the human brain.

#10Cross-Lingual Speaker Discrimination Using Natural and Synthetic Speech

Mirjam Wester (Centre for Speech Technology Research, University of Edinburgh, United Kingdom)
Hui Liang (Idiap Research Institute, Martigny, Switzerland)

This paper describes speaker discrimination experiments in which native English listeners were presented with natural speech stimuli in English and Mandarin, synthetic speech stimuli in English and Mandarin, or natural Mandarin speech and synthetic English speech stimuli. In each experiment, listeners were asked to judge whether the sentences in a pair were spoken by the same person or not. We found that the results of Mandarin/English speaker discrimination were very similar to those found in previous work on German/English and Finnish/English speaker discrimination. We conclude from this and previous work that listeners are able to discriminate between speakers across languages or across speech types, but the combination of these two factors leads to a speaker discrimination task that is too difficult for listeners to perform successfully, given the fact that the quality of across-language speaker adapted speech synthesis at present still needs to be improved.

Tue-Ses2-P2:
Speech Audio Analysis

Time:Tuesday 13:30 Place:Valfonda 2 - Pala Congressi (Passi Perduti-Gallery) Type:Poster
Chair:Toshiaki Fukada

#1Robust Audio Fingerprinting Based on Local Spectral Luminance Maxima Scheme

Yong-Zhe Shi (Department of Electronic Engineering, Tsinghua University, Beijing 100084, China)
Wei-Qiang Zhang (Department of Electronic Engineering, Tsinghua University, Beijing 100084, China)
jia Liu (Department of Electronic Engineering, Tsinghua University, Beijing 100084, China)

This paper proposes a robust audio fingerprinting system based on local spectral luminance maxima (LSLM) scheme using image processing approaches. Our approach treats spectrogram of an audio clip as a 2-D image and extracts the local luminance maxima of spectrum image as the discriminative characteristics. LSLM are selected due to resilience against quantization, compression, and noise addition, etc. Experimental results show that the proposed binary audio fingerprints outperform some of the state-of-the-art in the context of both robustness and reliability, especially in the noisy environment.

#2Entropy Driven Inference of Stochastic Grammars

Unto Kalervo Laine (Department of Signal Processing and Acoustics, Aalto University, Espoo, Finland)

A new method for inferring specific stochastic grammars is presented. The process called Hybrid Model Learner (HML) applies entropy rate to guide the agglomeration process of type ab->c. Each rule derived from the input sequence is associated with a certain entropy-rate difference. A grammar automatically inferred from an example sequence can be used to detect and recognize similar structures in unknown sequences. Two important schools of thought, that of structuralism and the other of ‘stochasticism’ are discussed, including how these two have met and are influencing current statistical learning methods. It is argued that syntactic methods may provide universal tools to model and describe structures from the very elementary level of signals up to the highest one, that of language.

#3An Efficient Pre-processing Scheme to Improve the Sound Source Localization System in Noisy Environment

Sheng-Chieh Lee (Department of Electrical Engineering, National Cheng Kung University, Tainan, Taiwan)
Bo-Wei Chen (Department of Electrical Engineering, National Cheng Kung University, Tainan, Taiwan)
Jhing-Fa Wang (Department of Electrical Engineering, National Cheng Kung University, Tainan, Taiwan)
Chung-Hsien Wu (Department of Computer Science and Information Engineering, National Cheng Kung University, Tainan, Taiwan)
Min-Jian Liao (Department of Electrical Engineering, National Cheng Kung University, Tainan, Taiwan)

In this study, we introduce an efficient pre-processing scheme for direction of arrival (DOA) estimation, which is capable of reducing the noise and reverberation effects in speech sound source localization. Furthermore, this presented system is also suitable for far-field speech localization. The adopted method of this proposed system can be simply subdivided into three stages: Linear phase-difference approximation, covariance matrix reconstruction, and frequency bin selection. The first two stages can initially decrease the influences of noise and reverberation; the last stage is used to filter the noise frequency bands according to the eigenvalue decomposition (EVD) of the covariance matrix. The experimental results show that our proposed system has effective performance of detecting different directions of speeches. For different signal-to-noise ratios (SNRs) speech signals, the average estimation errors can be decreased by about 5 to 7.5 degrees.

#4A study on auditory feature spaces for speech-driven lip animation

Guylaine Le-Jan (INRIA)
Yannick Benezeth (INRIA)
Guillaume Gravier (IRISA - CNRS)
Frédéric Bimbot (IRISA - CNRS)

We present in this paper a study on auditory feature spaces for speech-driven face animation. The goal is to provide solid analytic ground to underscore the description capability of some well-known features with relation to lipsync. A set of various audio features describing the temporal and spectral shape of speech signal has been computed on annotated audio extracts. The dimension of the input feature space has been reduced with PCA and the contribution of each input feature is investigated to determine the more descriptive. The resulting feature space is quantitatively and qualitatively analyzed for the description of acoustic units (phonemes, visemes, etc.) and we demonstrate that the use of some low-level features in addition to MFCC increases the relevance of the feature space. Finally, we evaluate the stability of these features w.r.t. the gender of the speaker.

#5Phase-only Speech Reconstruction Using Very Short Frames

Erfan Loweimi (Amirkabir University of Technology (AUT))
Seyed Mohammad Ahadi (Amirkabir University of Technology (AUT))
Hamid Sheikhzadeh (Amirkabir University of Technology (AUT))

This paper aims to investigate potentials existing in speech phase spectrum. We observed that the window shape and scale incompatibility error (SIE) are two important factors which deeply influence the quality of phase-only reconstructed speech. After evaluating effects of different windows, we found Chebyshev window with dynamic range of 25 to 30 dB the best option. Inspiring from Hilbert transform relations, we removed the SIE and found the reason for quality improvement of ordinary phase-only reconstructed speech by frame length extension. Results show that phase spectrum, even in very short frame lengths such as 16 ms, can be highly informative.

#6Frequency-Warped and Stabilized Time-Varying Cepstral Coefficients

Trond Skogstad (NTNU)
Torbjørn Svendsen (NTNU)

This paper presents a set of cepstral parameters based on time-varying linear prediction. The lattice filter structure is utilized to accommodate efficient stabilization of models and a Bark-like warped frequency scale. As the proposed cepstral features are based on non-stationary spectral analysis there is a potential for complementary information not captured in conventional features. In classification and recognition experiments, the proposed features are shown to improve performance when augmenting MFCCs.

#7Using Human Perception for Automatic Accent Assessment

Freddy William (CRSS University of Texas at Dallas)
Abhijeet Sangwan (CRSS University of Texas at Dallas)
John H.L. Hansen (CRSS University of Texas at Dallas)

In this study, a new algorithm for automatic accent evaluation of native and non-native speakers is presented. The proposed system consists of two main steps: alignment and scoring. At the alignment step, the speech utterance is processed using a Weighted Finite State Transducer (WFST) based technique to automatically estimate the pronunciation errors. Subsequently, in the scoring step a Maximum Entropy (ME) based technique is employed to assign perceptually motivated scores to pronunciation errors. The combination of the two steps yields an approach that measures accent based on perceptual impact of pronunciation errors, and is termed as the Perceptual WFST (P-WFST). The P-WFST is evaluated on American English (AE) spoken by native and non-native (native speakers of Mandarin-Chinese) speakers from the CU-Accent corpus. The proposed P-WFST algorithm shows higher and more consistent correlation with human evaluated accent scores, when compared to the Goodness Of Pronunciation (GOP) algorithm.

#8A study of the effectiveness of articulatory strokes for phonemic recognition

Carlos Molina (Universidad de Chile)
Sungbok Lee (University of Southern California)
Shrikanth Narayanan (University of Southern California)
Néstor Becerra Yoma (Universidad de Chile)

This paper explores a framework to incorporate articulatory movement information into a classical ASR scheme based on the concept of articulatory stroke. Articulatory stroke is a geometrical segmental unit which corresponds to a target approaching-releasing articulatory gesture. It has been shown that critical and non-critical (i.e., secondary or dummy) articulatory gestures can be classified with about 88% accuracy using the stroke parameters. Phonetic recognition accuracy is also investigated by augmenting the conventional MFCC features with the articulatory stroke features (obtained using the MOCHA corpus). It is found that the phonetic recognition accuracy increases 15% with respect to the best result using the ordinary MFCC parameters only. This provides supporting evidence for the usefulness of the articulatory stroke representation of articulatory movements not only for speech production description but also for automatic speech recognition.

#9Auditory Filterbank Improves Voice Morphing

Erika Okamoto (Faculty of Systems Engineering, Wakayama University, Japan)
Toshio Irino (Faculty of Systems Engineering, Wakayama University, Japan)
Ryuichi Nisimura (Faculty of Systems Engineering, Wakayama University, Japan)
Hideki Kawahara (Faculty of Systems Engineering, Wakayama University, Japan)

This paper presents a new method for vocal tract length (VTL) estimation and normalization based on a gammachirp auditory filterbank (GCFB) to improve the sound quality in voice morphing. VTL ratios between 28 speakers were estimated based on the spectral distances for all permutations (756 = $_{28} P _{27}$) . The VTL estimation using the mel-frequency filterbank (MFFB), which is a preprocessor for calculating MFCCs commonly used in ASR, was also evaluated for comparison. The results of subjective listening tests of morphed voice sounds with and without VTL normalization are also reported. The objective and subjective results indicate that VTL normalization is essential for voice morphing, and the proposed GCFB-based method outperforms theMFCC-based method.

#10Monaural Sound Localization

Anna Katharina Fuchs (Signal Processing and Speech Communication Laboratory, Graz University of Technology, Austria)
Christian Feldbauer (Signal Processing and Speech Communication Laboratory, Graz University of Technology, Austria)
Michael Stark (Signal Processing and Speech Communication Laboratory, Graz University of Technology, Austria)

The principles of human sound localization imply binaural (interaural level and time difference) as well as monaural cues. The latter are captured by the head-related transfer functions (HRTFs), which describe the direction-dependent, spectral shaping of the incident sound wave, and can be exploited to determine the direction. In this paper an accurate talker localization strategy in the horizontal plane using the signal of only one microphone is presented. The sound localization method is developed based on a set of HRTF measurements taken from a dummy head and a statistical model of speech. High-dimensional spectral features (STFT coefficients) are taken and the direction of the sound source is evaluated with Gaussian mixture models (GMMs) using a maximum likelihood (ML) framework. An evaluation of the developed method in a synthetic test environment yields excellent localization results and leads to a promising approach which can be further investigated in future research.

Tue-Ses2-P3:
Speech Coding

Time:Tuesday 13:30 Place:Faenza 1 - Pala Congressi (Passi Perduti-Gallery) Type:Poster
Chair:Alan McCree

#1Dual-mode AVQ Coding Based on Spectral Masking and Sparseness Detection for ITU-T G.711.1/G.722 Super-wideband Extensions

Masahiro Fukui (NTT Cyber Space Laboratories, NTT Corporation)
Shigeaki Sasaki (NTT Cyber Space Laboratories, NTT Corporation)
Yusuke Hiwasaki (NTT Cyber Space Laboratories, NTT Corporation)
Sachiko Kurihara (NTT Cyber Space Laboratories, NTT Corporation)
Yoichi Haneda (NTT Cyber Space Laboratories, NTT Corporation)

ITU-T Recommendations G.711.1 Annex D and G.722 Annex B, which are super-wideband (50-14000 Hz) extensions to G.711.1 and G.722, have been recently standardized. This paper introduces a new coding method proposed and employed in the above ITU-T standards; an adaptive spectral masking of the algebraic vector quantization (AVQ) for MDCT-domain non-sparse signals using spectral envelope. This paper also proposes the two mode method that switches the sparse type coding such as ordinary AVQ and the proposed spectral masking with MDCT-domain sparseness analysis. The proposed method improves the sound quality more than 0.1 points with a five grade scale, in average of speech, music, and mixed content. This is a significant impact on the sound quality.

#2Phone Impact Based Speech Transmission Technique for Reliable Speech Recognition in Poor Wireless Network Conditions

Azar Taufique (Open Networking Advanced Research Laboratory, University of Texas at Dallas)
Kumaran Vijayasankar (Open Networking Advanced Research Laboratory, University of Texas at Dallas)
Wooil Kim (Center for Robust Speech Systems, University of Texas at Dallas)
John H.L. Hansen (Center for Robust Speech Systems, University of Texas at Dallas)
Marco Tacca (Open Networking Advanced Research Laboratory,University of Texas at Dallas)
Andrea Fumagalli (Open Networking Advanced Research Laboratory,University of Texas at Dallas)

This paper presents a preliminary study on an effective differentiable network service technique to achieve improved speech recognition under severely poor wireless channel conditions, by leveraging multiple priority levels applied to speech classes. Each speech class is assigned a different priority level based on its level of impact on speech recognition performance. Based on their priority level, frames of each speech class are given distinct levels of network quality of service (QoS) to satisfy the delay requirement and enable speech recognition at the receiver. The experimental results prove that the proposed scheme is effective at providing wireless network service for robust speech recognition under poor channel conditions, showing up to 2.67 dB and 5.93 dB lower Signal to Noise Ratio (SNR) operating regions compared to the VU based and plain protocols respectively.

#3Automatic Speech Codec Identification with Applications to Tampering Detection of Speech Recordings

Jingting Zhou (Department of Electrical and Computer Engineering, University of Maryland)
Daniel Garcia-Romero (Department of Electrical and Computer Engineering, University of Maryland)
Carol Espy-Wilson (Department of Electrical and Computer Engineering, University of Maryland)

In this paper we explored many versions of CELP coders and studied different codebooks they use to encode noisy part of residual. Taking advantage of noise patterns they generated, an algorithm was proposed to detect GSM-AMR,EFR,HR and SILK coders. Then the algorithm was extended to identify subframe offset to do tampering detection of cellphone speech recordings.

#4A hybrid quasi-harmonic/CELP wideband speech coding scheme for unit selection TTS synthesis

Chang-Heon Lee (Orange Labs TECH/ASAP/VOICE, Lannion, France)
Olivier Rosec (Orange Labs TECH/ASAP/VOICE, Lannion, France)
Yannis Stylianou (Institute of Computer Science, FORTH, and Multimedia Informatics Lab, CSD, UoC, Greece)

This paper suggests a new wideband speech coding model to efficiently compress acoustic inventories for concatenative unit selection text-to-speech (TTS) synthesis system. To fulfill the requirements of TTS synthesizer such as partial segment decoding and random access capability, a non-predictive scheme was adopted which combines the adaptive Quasi-Harmonic Model (aQHM) with the innovative codebook (ICB) model. aQHM plays a major role in modeling pitch harmonic components, and ICB compensates, in a closed-loop way, for the modeling error of aQHM. This is especially important in transient or unvoiced regions. To further improve the coding efficiency, a hybrid coding framework is also suggested. Results from a large French speech database show that the proposed algorithm provides similar speech quality to the high quality AMR-WB codec while it supports the random access capability.

#5Voice Quality Characterization of IETF Opus Codec

Anssi Rämö (Nokia Research Center)
Henri Toukomaa (Nokia Research Center)

This paper discusses the voice quality of Opus, IETF driven open source voice and audio codec. Opus is a newly developed hybrid codec based on SILK and CELT codec technologies. Opus construction is described shortly in this paper and more importantly its optimal operating points are found out based on the listening test results. Voice quality was evaluated with two subjective listening tests. Industry standard voice codecs: 3GPP AMR and AMR-WB, and ITU-T G.718B, G.722.1C and G.719 as well as direct signals were used as voice quality references.

#6Leja ordering LSFs for accurate estimation of predictor coefficients

Christian Fischer Pedersen (Department of Electronic Systems, Aalborg University, Denmark)

Linear prediction (LP) is the most prevalent method for spectral modelling of speech, and line spectrum pair (LSP) decomposition is the standard method to robustly represent the coefficients of LP models. Specifically, the angles of LSP polynomial roots, i.e. line spectrum frequencies (LSFs), encode exactly the same information as LP coefficients. The conversion of LP coefficients to LSFs and back, has received considerable attention since mid 1970s when LSFs were introduced. The present paper demonstrates how Leja ordering LSFs reduce amplification of rounding errors when converting LSFs to LP coefficients. The theory behind Leja ordering and the LSFs to LP coefficients conversion is presented. To supplement theory, numerical experiments illustrate the accuracy gain achieved by Leja ordering LSFs prior to conversion. Accuracy is measured as the root mean square deviation between estimated coefficient vectors with and without prior Leja ordering.

#7Improved Quality for Conversational VoIP using Path Diversity

Qipeng Gong (McGill University)
Peter Kabal (McGill Univeristy)

In Voice-over-IP, the quality of interactive conversation is important to users. Quality-based playout buffering seeks an optimum balance between delay and loss. However, such a scheme still suffers when packet losses are bursty. Path diversity can alleviate the effect of losses and improve perceived quality by providing redundancy. In this paper, a new scheme is proposed which evaluates the performance of both paths. We consider three different path diversity schemes. The playout scheduling algorithms are designed based on conversational quality including both calling quality and interactivity. The simulation results show the efficacy of our algorithms in correcting for losses (isolated and burst) and improving perceived conversational quality.

#8Tree Encoding for the ITU-T G.711.1 Speech Coder

Abdul Hannan Khan (McGill University)
Peter Kabal (McGill University)

This paper examines enhancement to ITU-T Recommendation G.711.1 PCM wideband extension speech coder. To further improve the core lower-band coding performance the use of vector quantization and delayed decision coding is studied. A particular case of delayed decision coding, tree encoding, is implemented in the above standard. The bitstream is compatible with both the legacy G.711 and the G.711.1 decoder. PESQ (ITU-T P.862, Perceptual Evaluation of Speech Quality) is used to evaluate the performance. Both the vector quantizer and tree encoder have better performance than the original core layer encoder.

#9Parallel and Hierarchical Decision Making for Sparse Coding in Speech Recognition

Dong Wang (EURECOM)
Ravichander Vipperla (EURECOM)
Nicholas Evans (EURECOM)

Sparse coding exhibits promising performance in speech processing, mainly due to the large number of bases that can be used to represent speech signals. However, the high demand for computational power represents a major obstacle in the case of large datasets, as does the difficulty in utilising information scattered sparsely in high dimensional features. This paper reports the use of an online dictionary learning technique, proposed recently by the machine learning community, to learn large scale bases efficiently, and proposes a new parallel and hierarchical architecture to make use of the sparse information in high dimensional features. The approach uses multilayer perceptrons (MLPs) to model sparse feature subspaces and make local decisions accordingly; the latter are integrated by additional MLPs in a hierarchical way for making global decisions. Experiments on the WSJ database show that the proposed approach not only solves the problem of prohibitive computation with large-dimensional sparse features, but also provides better performance in a frame-level phone prediction task.

#10A New Model-based Mandarin-speech Coding System

Chen-Yu Chiang (Institute of Communication Engineering, National Chiao Tung University, Taiwan)
Jyh-Her Yang (Institute of Communication Engineering, National Chiao Tung University, Taiwan)
Ming-Chieh Liu (Institute of Communication Engineering, National Chiao Tung University, Taiwan)
Yih-Ru Wang (Institute of Communication Engineering, National Chiao Tung University, Taiwan)
Yuan-Fu Liao (Department of Electronic Engineering, National Taipei University of Technology, Taiwan)
Sin-Horn Chen (Institute of Communication Engineering, National Chiao Tung University, Taiwan)

In this paper, a new model-based Mandarin-speech coding system is proposed. It employs a prosody-enriched ASR with a hierarchical prosodic model (HPM) to generate from the input speech enriched transcriptions, including linguistic features, prosodic tags and spectral parameters in the encoder. By sending these features to the decoder, we can first reconstruct the prosodic-acoustic features of syllable pitch contour, syllable duration, syllable energy level, and inter-syllable pause duration by HPM using the linguistic features and prosodic tags; and then combined with spectral parameters to reconstruct the input speech signal by an HMM-based speech synthesizer. Experimental results show that the reconstructed speech has good quality at a low data rate of 543 bits/s.

Tue-Ses2-P4:
Robustness and Adaptation for ASR

Time:Tuesday 13:30 Place:Faenza 2 - Pala Congressi (Passi Perduti-Gallery) Type:Poster
Chair:Vivek Kumar

#1Using Unsupervised Feature-Based Speaker Adaptation for Improved Transcription of Spoken Archives

Petr Cerva (Institute of Information Technology and Electronics, Faculty of Mechatronics Technical University of Liberec, Studentska 2, CZ 461 17, Liberec, Czech Republic)
Karel Palecek (Institute of Information Technology and Electronics, Faculty of Mechatronics Technical University of Liberec, Studentska 2, CZ 461 17, Liberec, Czech Republic)
Jan Silovsky (Institute of Information Technology and Electronics, Faculty of Mechatronics Technical University of Liberec, Studentska 2, CZ 461 17, Liberec, Czech Republic)
Jan Nouza (Institute of Information Technology and Electronics, Faculty of Mechatronics Technical University of Liberec, Studentska 2, CZ 461 17, Liberec, Czech Republic)

This paper deals with unsupervised feature-based speaker adaptation techniques. The goal is to design an optimal adaptation approach for improving the recognition accuracy of a LVCSR system developed for automatic transcription of large archives of spoken Czech (e.g. the archive of the parliament talks, historical archives of Czech broadcast stations, etc.) For this purpose, several modifications of VTLN and CMLLR techniques were investigated and combined together. Our study focuses on the application of the adaptation methods in the recognition process as well as in building a normalized acoustic model within the speaker adaptive training scheme. The methods were evaluated experimentally on a large amount of various data (with total number 93k words). The resulting two-step adaptation scheme yields a significant WER reduction from 17.8 % to 14.8 %.

#2Online Speaker Adaptation with Pre-computed FMLLR Transformations

Volker Fischer (European Media Laboratory GmbH, Schloß-Wolfsbrunnenweg 33, D-69118 Heidelberg, Germany)
Siegfried Kunzmann (European Media Laboratory GmbH, Schloß-Wolfsbrunnenweg 33, D-69118 Heidelberg, Germany)

This paper presents a memory efficient single pass speech recognizer that makes use of pre-computed FMLLR transformations for online speaker adaptation. For that purpose we apply unsupervised segment clustering to the training corpus, create a transformation matrix for each cluster, and train a text-independent Gaussian mixture classifier for cluster selection during runtime. We use the RWTH Aachen University open source speech recognition toolkit for evaluation and compare the results to a standard speaker adaptive two pass decoding strategy. Results indicate that the method improves single pass recognition in VTLN feature space almost without overhead due to cluster selection, and show a relative improvement of up to 15 percent over speaker adaptative decoding, if only little data is available for unsupervised online adaptation.

#3Instantaneous Speaker Adaptation through Selection and Combination of fMLLR Transformation Matrices

Diego Giuliani (Fondazione Bruno Kessler, Trento, Italy)
Fabio Brugnara (Fondazione Bruno Kessler, Trento, Italy)

This paper addresses instantaneous speaker adaptation, based on feature-space maximum likelihood linear regression (fMLLR), in the context of an automatic transcription task. We investigate the use of fMLLR-based adaptation when the need of a preliminary decoding pass for a speech segment is removed, as sufficient statistics for adaptation parameter estimation are gathered with respect to a Gaussian mixture model. To cope with limited adaptation data, in addition of using feature-space maximum a posteriori linear regression (fMAPLR), an investigation is conducted where the transformation matrix to be applied to the speech segment is estimated through selection and combination of pre-computed fMLLR transformation matrices. For speaker adaptively trained acoustic models results of recognition experiments show that the proposed approach is moderately better than fMLLR but not as good as fMAPLR.

#4Joint Bilinear Transformation Space Based Maximum a Posteriori Linear Regression Adaptation using Prior with Variance Function

Hwa Jeon Song (Spoken Language Processing Team, Electronics and Telecommunications Research Institute, Korea)
Yunkeun Lee (Spoken Language Processing Team, Electronics and Telecommunications Research Institute, Korea)
Hyung Soon Kim (School of Electrical Engineering, Pusan National University, Korea)

This paper proposes a new joint maximum a posteriori linear regression (MAPLR) adaptation using single prior distribution with a variance function in bilinear transformation space (BITS). There are two indirect adaptation methods based on the linear transformation in BITS and these are tightly coupled by joint MAP-based estimation. The proposed method not only has the scalable parameters but also is based on only one prior distribution, unlike the conventional joint MAP-MAPLR method with two priors. Experimental results, especially for small amount of adaptation data, show the synergy between two indirect BITS-based methods over other methods.

#5A Study on Combining VTLN and SAT to Improve the Performance of Automatic Speech Recognition

Rama Sanand Doddipatla (Aalto University)
Mikko Kurimo (Aalto University)

In this paper, we present ideas to combine VTLN and SAT to improve the performance of automatic speech recognition. We show that VTLN matrices can be used as SAT transformation matrices in recognition, though the training still follows conventional SAT. This will be useful when there is very little adaptation data and the SAT transformation matrix can not be estimated to perform the required adaptation. We also present a study to understand whether VTLN can be performed after SAT and whether such a combination is better than the conventional approach, where VTLN is performed before SAT. Finally, we present a novel approach to perform VTLN by using VTLN matrices in cascade. This allows us to include warping-factors that are not included in the initial search space. We show through recognition experiments that these combinations improve the performance of ASR, with major gains in the mis-matched train and test speaker conditions.

#6Incorporating Regional Information to Enhance MAP-based Stochastic Feature Compensation for Robust Speech Recognition

Yu Tsao (Spoken Language Communication Group, National Institute of Information and Communications Technology)
Paul R. Dixon (Spoken Language Communication Group, National Institute of Information and Communications Technology)
Chiori Hori (Spoken Language Communication Group, National Institute of Information and Communications Technology)
Hisashi Kawai (Spoken Language Communication Group, National Institute of Information and Communications Technology)

In this study, we propose an environment structuring framework to facilitate suitable prior density preparation for MAP-based stochastic feature matching (SFM) for robust speech recognition. We use a two-stage hierarchical structure to construct the environment structuring framework to characterize the regional information of various speaker and speaking environments. With the regional information, we derive three types of prior densities, namely clustered prior, sequential prior, and hierarchical prior densities. We also designed an integrated prior density to combine the advantages of the above three prior densities. From our experimental results on the Aurora-2 task, we confirmed that with regional information, we can obtain more suitable prior densities and thus enhance the performance of MAP-based SFM. Moreover, we found that by using the integrated prior density, which integrates multiple knowledge sources from the other three, MAP-based SFM gives the best performance.

#7A Study on the Effect of Pitch on LPCC and PLPC Features for Children\'s ASR in comparison to MFCC

Shweta Ghai (Department of Electronics and Electrical Engineering, Indian Institute of Technology Guwahati, Guwahati-781039, India)
Rohit Sinha (Department of Electronics and Electrical Engineering, Indian Institute of Technology Guwahati, Guwahati-781039, India)

In this work, following our previous studies, we study and quantify the effect of pitch on LPCC and PLPC features and explore their efficacy for children's mismatched ASR in comparison to MFCC. Our analysis shows that, unlike MFCC, LPCC feature has no major influence of pitch variations. On the other hand, similar to MFCC, though PLPC is also found to be significantly effected by pitch variations but comparatively to a lesser extent. However, after explicit pitch normalization of children's speech, MFCC is found to result in the best children's speech recognition performance on adults' speech trained models in comparison to LPCC and PLPC features.

#8About Handling Boundary Uncertainty in a Speaking Rate Dependent Modeling Approach

Denis Jouvet (LORIA - INRIA Nancy)
Dominique Fohr (LORIA - CNRS)
Irina Illina (LORIA - Université Nancy II)

Variability dependent modeling provides a way of handling the impact of some variability sources in the modeling. In many cases, the variability factor is estimated in a deterministic way, leading to a mere selection of the most adequate model. However, there are always some uncertainty in the estimation of the variability sources which may induce a sub optimal model selection. This paper considers the context of a speaking rate dependent modeling approach, and shows that the uncertainty on the speech segment boundaries, which translates in an uncertainty on the speaking rate estimation, can be handled in the training process and/or in the decoding process. Preliminary results reported here are promising for dealing with variability estimation uncertainty.

#9An Active Learning Approach to Task Adaptation

Ji Wu (Department of Electronic Engineering, Tsinghua University, Beijing, China)
Zhiyang He (Tsinghua-iFlytek Joint Laboratory for Speech Technologies, Beijing, China)
Ping Lv (Tsinghua-iFlytek Joint Laboratory for Speech Technologies, Beijing, China)

An active learning approach is proposed to automatically analyze speech recognition tasks and select particularly useful adaptation data. The distribution of task data is first estimated, which is a combination of two distributions based on N-best recognition results and low confidence data. After that, a subset of adaptation data is selected in two stages using a greedy algorithm according to the estimated distribution. Low confidence data are firstly selected and manually labeled. Then, the high confidence data are selected based on the top-best recognition results, which are also used as labels for the adaptation. The experimental results of the subsequent task adaptation show that the proposed active learning approach can effectively select the useful data to improve the overall performance of the system. The word accuracy is close to, and even exceed, the performance of supervised adaptation using all of the data, when only 10%-20% of the total data need to be manually labeled.

#10Efficient Speaker and Noise Normalization for Robust Speech Recognition

Vikas Joshi (Department of Electrical Engineering, Indian Institute of Technology, Madras, India)
Raghavendra Bilgi (Department of Electrical Engineering, Indian Institute of Technology, Madras, India)
Umesh S (Department of Electrical Engineering, Indian Institute of Technology, Madras, India)
Carmen Benitez (Dept of Signal Theory, Telematics and Communications, University of Granada, Spain)
Luz García Martínez (Dept of Signal Theory, Telematics and Communications, University of Granada, Spain)

In this paper, we describe a computationally efficient approach for combining speaker and noise normalization techniques. In particular, we combine the simple yet effective Histogram Equalization (HEQ) for noise compensation with Vocal-tract length normalization (VTLN) for speaker-normalization. While it is intuitive to remove noise first and then perform VTLN, this is difficult since HEQ performs noise compensation in the cepstral domain, while VTLN involves warping in spectral domain. In this paper, we investigate the use of the recently proposed T-VTLN approach to speaker normalization where matrix transformations are directly applied on cepstral features. We show that the speaker-specific warp-factors estimated even from noisy speech using this approach closely match those from clean-speech. Further, using sub-band HEQ (S-HEQ) and T-VTLN we get a significant relative improvement of 20% and an impressive 33.54% over baseline in recognition accuracy for Aurora-2 and Aurora-4 task.

#11How Realistic is Artificially Added Noise?

Thomas Winkler (Fraunhofer IAIS)

Evaluations of algorithms for robust automatic speech recognition (ASR) are often based on artificial noisy speech instead of realistic noisy speech. In this paper we compare the ASR performance of speech with artificial additive noise to the performance of realistic noisy speech. All data was recorded during the same recording campaign and with nearly identical channel characteristics. The simulation process takes into account all major characteristics of the noisy reference data. Clean speech, noisy speech and simulated speech are compared for different aspects of robust ASR including noise reduction by Spectral Subtraction and the ETSI robust front end. The results show, that artificial noisy speech even in very controlled simulation environments is not very similar and not a full substitute for realistic noisy data. While the tendencies of the improvement for artificial and realistic data are similar for the evaluated approaches, the magnitude can be quite different.

Tue-Ses2-S1-P:
Spoken Language Processing of Human-Human Conversations II

Time:Tuesday 14:30 Place:Caravaggio (Adua 1) - Pala Affari - 1st Floor Type:Poster
Chair:Dilek Hakkani-Tur

#1Learning Influences from Word Use in Polylogue

Tomoharu Iwata (NTT)
Shinji Watanabe (NTT)

We propose a probabilistic model for estimating influences among speakers from conversation data with multiple people. In conversations, people tend to mimic their companions' behavior depending on their level of trust. With the proposed model, we assume that the word use of a speaker depends on the word use of previous speakers as well as their own earlier word use and the general word distribution. The influences can be efficiently estimated by using the expectation maximization (EM) algorithm. Experiments on two meeting data sets in Japanese and in English demonstrate the effectiveness of the proposed method.

#2Identifying Agreement/Disagreement in Conversational Speech: A Cross-lingual Study

Wen Wang (SRI International)
Kristin Precoda (SRI International)
Colleen Richey (SRI International)
Geoffrey Raymond (University of California, Santa Barbara)

This paper presents models for detecting agreement/disagreement between speakers in English and Arabic broadcast conversation shows. We explore a variety of features, including lexical, structural, durational, and prosodic features. We experiment these features using Conditional Random Fields models and conduct systematic investigations on efficacy of various feature groups across languages. Sampling approaches are examined for handling highly imbalanced data. Overall, we achieved 79.2 pct. (precision), 50.5 pct. (recall), 61.7 pct. (F1) for agreement detection and 69.2 pct. (precision), 46.9 pct. (recall), and 55.9 pct. (F1) for disagreement detection, on English broadcast conversation data; and 89.2 pct. (precision), 30.1 pct. (recall), 45.1 pct. (F1) for agreement detection and 75.9 pct. (precision), 28.4 pct. (recall), and 41.3 pct. (F1) for disagreement detection, on Arabic broadcast conversation data.

#3A Dual Channel Coupled Decoder for Fillers and Feedback

Daniel Neiberg (CTT, TMH, CSC, KTH)
Joakim Gustafson (CTT, TMH, CSC, KTH)

This study presents a dual channel decoder capable of modeling cross-speaker dependencies for segmentation and classification of fillers and feedbacks in conversational speech found in the DEAL corpus. For the same number of Gaussians per state, we have shown improvement in terms of average F-score for the successive addition of 1) increased frame rate from 10 ms to 50 ms 2) Joint Maximum Cross-Correlation (JMXC) features in a single channel decoder 3) a joint transition matrix which captures dependencies symmetrically across the two channels 4) coupled acoustic model retraining symmetrically across the two channels. The final step gives a relative improvement of over 100% for fillers and feedbacks compared to our previous published results. The F-scores are in the range to make it possible to use the decoder as both a voice activity detector and an illucotary act decoder for semi-automatic annotation.

#4An Analysis of PCA-based Vocal Entrainment Measures in Married Couples\' Affective Spoken Interactions

Chi-Chun Lee (Signal Analysis and Interpretation Laboratory, University of Southern California)
Athanasios Katsamanis (Signal Analysis and Interpretation Laboratory, University of Southern California)
Matthew P. Black (Signal Analysis and Interpretation Laboratory, University of Southern California)
Brian R. Baucom (Department of Psychology, University of Southern California)
Panayiotis G. Georgiou (Signal Analysis and Interpretation Laboratory, University of Southern California)
Shrikanth S. Narayanan (Signal Analysis and Interpretation Laboratory, University of Southern California)

Entrainment has played a crucial role in analyzing marital couples interactions. In this work, we introduce a novel technique for quantifying vocal entrainment based on Principal Component Analysis (PCA). The entrainment measure, as we define in this work, is the amount of preserved variability of one interlocutor’s speaking characteristic when projected onto representing space of the other’s speaking characteristics. Our analysis on real couples’ interactions shows that when a spouse is rated as having positive emotion, he/she has a higher value of vocal entrainment compared when rated as having negative emotion. We further performed various statistical analyses on the strength and the directionality of vocal entrainment under different affective interaction conditions to bring quantitative insights into the entrainment phenomenon. These analyses along with a baseline prediction model demonstrate the validity and utility of the proposed PCA-based vocal entrainment measure.

Tue-Ses3-O1:
Language Identification

Time:Tuesday 16:00 Place:Auditorium - Pala Congressi Type:Oral
Chair:Philippe Boula de Mareüil

16:00Data-driven UBM Generation via Tied Gaussians for GMM-Supervector Based Accent Identification

Rong Zheng (Digital Content Technology Research Center, Institute of Automation Chinese Academy of Sciences, Beijing 100190, China)
Ce Zhang (Digital Content Technology Research Center, Institute of Automation Chinese Academy of Sciences, Beijing 100190, China)
Bo Xu (Digital Content Technology Research Center, Institute of Automation Chinese Academy of Sciences, Beijing 100190, China)

This paper presents a new approach to exploit data-driven universal background model (UBM) generation using tied Gaussians for accent identification (AID). The motivation of the proposed algorithm is to potentially utilize broad phonetic-specific accent characteristics by Gaussian mixture model (GMM) and examine data-driven phonetically-inspired UBM creation for GMM-supervector based accent classification. In this work, we discuss the issues involved in applying cumulative posterior probability based Gaussian selection and tree structure based UBM parameter estimation. Derivation and validation of the UBM refined by tied Gaussians are reported in this paper. Performance evaluations comparing our system with other well-known techniques for AID are also provided. Better performance is further achieved by fusing these acoustic-based accent classifiers. Comparison experiments conducted on the CSLU foreign-accented English (FAE) dataset show the effectiveness of the proposed method.

16:20I3A Language Recognition System for Albayzin 2010 LRE

David Martínez (University of Zaragoza)
Jesús Villalba (University of Zaragoza)
Antonio Miguel (University of Zaragoza)
Alfonso Ortega (University of Zaragoza)
Eduardo Lleida (University of Zaragoza)

This paper describes the two systems submitted to the Albayzin 2010 Language Recognition Evaluation by I3A. This evaluation is similar to the one organized by NIST every 2 years, but the languages to be recognized are those spoken in the Iberian peninsula (Spanish, Catalan, Basque, Galician and Portuguese) plus English. Both submissions are a fusion of five phonotactic and three acoustic subsystems. The only difference between them is the normalization and fusion of the scores. State-of-the art methods for Language Recognition are adapted to and investigated in the KALAKA-2 database. Our primary system was ranked in the first position of the evaluation.

16:40Dimensionality Reduction for Using High-Order n-grams in SVM-Based Phonotactic Language Recognition

Mikel Penagarikano (University of the Basque Country)
Amparo Varona (University of the Basque Country)
Luis Javier Rodriguez-Fuentes (University of the Basque Country)
German Bordel (University of the Basque Country)

SVM-based phonotactic language recognition is state-of-the-art technology. However, due to computational bounds, phonotactic information is usually limited to low-order phone n-grams (up to n=3). In a previous work, we proposed a feature selection algorithm, based on n-gram frequencies, which allowed us work successfully with high-order n-grams. In this work, we use two feature projection methods for dimensionality reduction in feature spaces including up to 4-grams: Principal Component Analysis (PCA) and Random Projection. These methods allow us to attain competitive performance even for small feature sets. Experiments were carried out on the NIST 2009 LRE database. Best performance (1.93% EER) was attained by using the feature selection algorithm to get around 11500 features. When considering smaller sets of features, PCA provided best performance. A 500-dimensional PCA feature projection yielded 2.15% EER, meaning 25% improvement with regard to using feature selection.

17:00Language Recognition via Ivectors and Dimensionality Reduction

Najim Dehak (MIT - CSAIL)
Pedro A. Torres Carrasquillo (MIT Lincoln Lab)
Douglas Reynolds (MIT Lincoln Lab)
Reda Dehak (LRDE- EPITA)

In this paper, a new language identification system is presented based on the total variability approach previously developed in the field of speaker identification. Various techniques are employed to extract the most salient features in the lower dimensional i-vector space and the system developed results in excellent performance on the 2009 LRE evaluation set without the need for any post-processing or backend techniques. Additional performance gains are observed when the system is combined with other acoustic systems.

17:20Language Recognition in iVectors Space

David Martínez (University of Zaragoza)
Oldrich Plchot (Brno University of Technology)
Lukas Burget (Brno University of Technology)
Ondrej Glembek (Brno University of Technology)
Pavel Matejka (Brno University of Technology)

The concept of so called iVectors, where each utterance is represented by fixed-length low-dimensional feature vector, has recently become very successfully in speaker verification. In this work, we apply the same idea in the context Language Recognition (LRE). To recognize language in the iVector space, we experiment with three different linear classifiers: one based on generative model, were classes are modeled by Gaussian distributions with shared covariance matrix, and two discriminative classifiers, namely linear Support Vector Machine and Logistic Regression. The tests were performed on the NIST LRE 2009 dataset and the results were compared with state-of-the-art LRE based on Joint Factor Analysis (JFA). While the iVector system offers better performance, it also seems to be complementary to JFA, as their fusion shows another improvement.

Tue-Ses3-O3:
ASR - Search, Keyword Spotting and Confidence Measures II

Time:Tuesday 16:00 Place:Brunelleschi (Green Room) - Pala Congressi - 2nd Floor Type:Oral
Chair:Geoffrey Zweig

16:00A Template Based Voice Trigger System Using Bhattacharyya Edit Distance

Evelyn Kurniawati (STMicroelectronics Asia Pacific, Pte. Ltd.)
Samsudin Ng (STMicroelectronics Asia Pacific, Pte. Ltd.)
Karthik Muralidhar (STMicroelectronics Asia Pacific, Pte. Ltd.)
Sapna George (STMicroelectronics Asia Pacific, Pte. Ltd.)

Dynamic Time Warping (DTW) is frequently used in isolated word recognition system due to their simplicity and robustness to noise. However, the computational effort required by DTW based solution is proportional to the number of words registered in the system. Vector Quantization (VQ) is employed to alleviate this by converting the spoken input to a sequence of discrete symbols to be matched with the stored word template. In this paper, we propose the use of Bhattacharyya distance as the cost function for this pattern matching problem. The template used is a string of discrete symbols, each modeled by Gaussian Mixture Model (GMM) representing context dependent sub-word unit. The system is tested on 100 template matching task from two registrations of 50 cable TV channel names to simulate voice-triggered remote control. An average of 92% accuracy is obtained. A scheme is also proposed to enable guest user without registration data to use the system efficiently.

16:20Acoustic Look-Ahead for More Efficient Decoding in LVCSR

David Nolden (RWTH Aachen)
Ralf Schlüter (RWTH Aachen)
Hermann Ney (RWTH Aachen)

In this paper we propose novel approximations of a generalized acoustic look-ahead to speed up the search process in large vocabulary continuous speech recognition (LVCSR). Unlike earlier methods, we do not employ any phoneme- or syllable level heuristics. First we define and analyze the perfect acoustic look-ahead as a simple pre-evaluation of the original acoustic models into the future. This method is very slow, but reveals the best possible impact on the search space that can be achieved through acoustic look-ahead. In a second step, we derive efficient and simple approximative look-ahead models from the perfect models. We show that the approximative models compare well to the perfect models regarding the search space, and that the approximative models significantly improve the efficiency in comparison to the baseline, without any negative effect on the precision.

16:40A new Epsilon Filter for Efficient Composition of Weighted Finite-State Transducers

Frank Duckhorn (Institute of Acoustics and Speech Communication, Technical University Dresden, Germany)
Matthias Wolff (Institute of Acoustics and Speech Communication, Technical University Dresden, Germany)
Rüdiger Hoffmann (Institute of Acoustics and Speech Communication, Technical University Dresden, Germany)

In this paper we propose a new composition algorithm for weighted finite-states transducers that are more and more used for speech and pattern recognition applications. Composition joins multiple transducers into one. We have implemented an embedded speech based dialog system for steering applications. Therefore regular grammars are very useful, but they may enlarge strongly by determinization. Composition using the sequential or the matching epsilon-filter does not perform optimal without determinization. Our new algorithm combines the advantages of these two epsilon-filters for size reduction. So composition and decoding time can be saved. It can be applied to many current algorithms including on-the-fly ones.

17:00A Bottom-Up Stepwise Knowledge-Integration Approach to Large Vocabulary Continuous Speech Recognition Using Weighted Finite State Machines

Sabato Marco Siniscalchi (Kore Universty of Enna)
Torbjorn Svendsen (Norwegian University of Science and Technology (NTNU))
Chin-Hui Lee (Georgia Institute of Technology)

A bottom-up, stepwise, knowledge integration framework is proposed to realize detection-based, large vocabulary continuous speech recognition (LVCSR) with a weighted finite state machine (WFSM). The WFSM framework offers a flexible architecture for different types of knowledge network compositions, each of them can be built and optimized independently. Speech attribute detectors are used as an intermediate block to obtain phoneme posterior probabilities over which a phoneme recognition network is designed. Lexical access and syntax knowledge integration over this phoneme network are then performed to deliver the decoded sentences. Experimental evidence illustrates that the proposed system outperforms several hybrid HMM/ANN systems with different configurations on the Wall Street Journal task while it is competitive with conventional LVCSR technology.

17:20Combining Information Sources for Confidence Estimation with CRF Models

Matthew Stephen Seigel (Cambridge University Engineering Department)
Philip Woodland (Cambridge University Engineering Department)

Obtaining accurate confidence measures for automatic speech recognition (ASR) transcriptions is an important task which stands to benefit from the use of multiple information sources. This paper investigates the application of conditional random field (CRF) models as a principled technique for combining multiple features from such sources. A novel method for combining suitably defined features is presented, allowing for confidence annotation using lattice-based features of hypotheses other than the lattice 1-best. The resulting framework is applied to different stages of a state-of-the-art large vocabulary speech recognition pipeline, and consistent improvements are shown over a sophisticated baseline system.

17:40Evaluation of Fast Spoken Term Detection Using a Suffix Array

Kouichi Katsurada (Toyohashi University of Technology)
Shinta Sawada (Toyohashi University of Technology)
Shigeki Teshima (Toyohashi University of Technology)
Yurie Iribe (Toyohashi University of Technology)
Tsuneo Nitta (Toyohashi University of Technology)

We previously proposed [1] fast spoken term detection that uses a suffix array as a data structure for searching a large-scale speech documents. In this method, a keyword is divided into sub-keywords, and two of them are searched for in the speech database. Although the search is executed very quickly on a 10,000-h speech database, we only proposed a variety of matching procedures in [1]. In this paper, we compare different varieties of matching procedures in which the number of sub-keywords to be searched for and the number of phonemes in a sub-keyword are different. We also compare the performance and the process time of our method with typical spoken term detection using an inverted index.

Tue-Ses3-O2:
Second Language Acquisition, Development and Learning II

Time:Tuesday 16:00 Place:Leonardo - Pala Affari - Ground Floor Type:Oral
Chair:Robert Fox

16:00On Mispronunciation Lexicon Generation using Joint-sequence Multigrams in Computer-Aided Pronunciation Training

Xiaojun Qian (Department of Systems Engineering and Engineering Management, the Chinese University of Hong Kong)
Helen Meng (Department of Systems Engineering and Engineering Managment, the Chinese University of Hong Kong)
Frank Soong (Speech Group, Microsoft Research Asia)

We investigate the use of joint-sequence multigrams to generate L2 mispronunciation lexicons for mispronunciation detection and diagnosis. In the joint-sequence framework, a pair of parallel strings (namely, the input string of either graphemes or phonemes of the canonical pronunciation and the phonetic string of the mispronunciation) are aligned to form joint units for probabilistic estimation. We compare results on lexicons produced by phoneme-to-mispronunciation conversion and those by grapheme-to-mispronunciation conversion. Results reflect the hypothesized advantage (1.1% reduction in expected miss rate) in unifying phonetic confusion due to L1 negative transfer with those due to grapheme-to-phoneme errors. The impact of mispronunciation by mis-use of analogy is also studied. Recognition results show the benefit of a lexicon with proper priors.

16:20Validating a second language perception model for classroom context. A longitudinal study within the Perceptual Assimilation Model

Bianca Sisinni (University of Salento)
Mirko Grimaldi (University of Salento)

The present study verified whether adult listeners retain the ability to improve non-native speech perception and if it can be significantly enhanced in the formal context, a very impoverished context with respect to the natural one. We tested (i) whether perceptual learning is possible for adults in a classroom context during focused phonetic lessons, and (ii) whether it follows the pattern predicted for natural acquisition by the PAM-L2 [1]. The results showed that adult listeners are still able to improve foreign sound perception and this ability seems to occur also in formal contexts in line with the PAM-L2 predictions.

16:40The role of variability in non-native perceptual learning of a Japanese geminate-singleton fricative contrast

Makiko Sadakata (Donders Institute for Brain, Cognition and Behaviour, Centre for Cognition, Radboud University Nijmegen)
James M. McQueen (Behavioural Science Institute, Donders Institute for Brain, Cognition and Behaviour, Centre for Cognition Radboud University Nijmegen, Max Planck Institute for Psycholinguistics)

The current study reports the enhancing effect of a high variability training procedure in the learning of a Japanese geminate-singleton fricative contrast. Dutch natives took part in a five-day training procedure in which they identified geminate and singleton variants of the Japanese fricative /s/. They heard either many repetitions of a limited set of words recorded by a single speaker (simple training) or fewer repetitions of a more variable set of words recorded by multiple speakers (variable training). Pre-post identification evaluations and a transfer test indicated clear benefits of the variable training.

17:00Fluency Changes with General Progress in L2 Proficiency

Jared Bernstein (Knowledge Technologies, Pearson)
Jian Cheng (Knowledge Technologies, Pearson)
Masanori Suzuki (Knowledge Technologies, Pearson)

Second language learners tend to speak slower at every level of linguistic analysis, often in an uneven tempo, with longer pauses at the start and before some words and constructions, than is typical of native speech. As noted by Zhang & Elder, native listeners focus on phonological fluency in making judgments about L2 proficiency. Improved understanding of how fluency grows with progress in overall oral proficiency may lead to measures of fluency that would be useful for measuring proficiency itself. Spontaneous speech sampled from populations of L2 speakers of English and Spanish showed orderly, seemingly linear increments in the rates at which words and larger constituents are spoken as a function of human-judged general proficiency level. Results suggest that unit/time fluency measures match native expert perception of oral proficiency, supporting the hypothesis that performance-in-time is a core attribute of speaking proficiency and efficient spoken communication.

17:20Tongue Gestures Awareness and Pronunciation Training

Slim Ouni (University Nancy 2 - LORIA)

Pronunciation training based on speech production techniques illustrating tongue movements is gaining popularity. However, there is not sufficient evidence that learners can imitate some tongue animation. In this paper, we investigated human awareness of controlling their tongue body gestures. In a first experiment, participants were asked to perform some tongue movements. This task was evaluated by observing ultrasound imaging of the tongue recorded during the experiment. In a second experiment, a short session of training was added where participants can observe ultrasound imaging in realtime of their own tongue movements. The goal was to increase their awareness of their tongue gestures. A pre-test and post-test were carried out without any feedback. The results suggest that it is not easy to finely control tongue body gestures; and that we gain in performance after a short training session which suggests that providing visual feedback improves tongue gesture awareness.

17:40Impact of speaker variability on speech perception in non-native listeners

Wim A. van Dommelen (Department of Language and Communication Studies, NTNU)
Valerie Hazan (Department of Speech, Hearing and Phonetic Sciences, UCL)

This study investigates the perception of English words produced by 45 native talkers presented in moderate noise to native Norwegian listeners. The relative intelligibility of individual talkers is compared with that obtained for native listeners in order to determine whether inherent talker clarity is determined by global acoustic-phonetic characteristics. Talker intelligibility was strongly correlated across native and non-native listeners; there was also strong correlation across groups as to the lexical items most often misperceived. Word intelligibility for both was correlated with certain acoustic-phonetic characteristics of the talker’s productions, including amount of energy in the mid-frequency region and mean word duration.

Tue-Ses3-O4:
SLP for Information Extraction and Retrieval I

Time:Tuesday 16:00 Place:Michelangelo - Pala Affari - 2nd Floor Type:Oral
Chair:Pascale Fung

16:00LATENT TOPIC MODELING FOR AUDIO CORPUS SUMMARIZATION

Timothy J. Hazen (MIT Lincoln Laboratory)

This work presents techniques for automatically summarizing the topical content of an audio corpus. Probabilistic latent semantic analysis (PLSA) is used to learn a set of latent topics in an unsupervised fashion. These latent topics are ranked by their relative importance in the corpus and a summary of each topic is generated from signature words that aptly describe the content of that topic. This paper presents techniques for producing a high quality summarization. An example summarization of conversational data from the Fisher corpus that demonstrates the effectiveness of our approach is presented and evaluated.

16:20Investigation of Spontaneous Speech Characterization Applied to Speaker Role Recognition

Richard Dufour (LIUM - University of Le Mans)
Yannick Estève (LIUM - University of Le Mans)
Paul Deléglise (LIUM - University of Le Mans)

Extracting information from large data is a challenging task. In this paper, we investigate the link between speech spontaneity levels and speaker roles, and the relevance to use an automatic spontaneous speech characterization as a speaker role identification feature. Applying this automatic spontaneous speech characterization system to a broadcast news corpus containing ten manually labeled speaker roles allowed us to highlight this relationship. So, we propose to directly apply the spontaneous speech characterization approach in order to automatically recognize speaker roles. Experimental results show that characteristics used to detect speech spontaneity could be very useful to recognize speaker roles, as we reached an overall classification precision of 74.4%.

16:40Zero-resource audio-only spoken term detection based on a combination of template matching techniques

Armando Muscariello (IRISA-INRIA Rennes Bretagne Atlantique)
Guillaume Gravier (IRISA-INRIA Rennes Bretagne Atlantique)
Frédéric Bimbot (IRISA-INRIA Rennes Bretagne Atlantique)

Spoken term detection is a well-known information retrieval task that seeks to extract contentful information from audio by locating occurrences of known query words of interest. This paper describes a zero-resource approach to such task based on pattern matching of spoken term queries at the acoustic level. The template matching module comprises the cascade of a segmental variant of dynamic time warping and a self-similarity matrix comparison to further improve robustness to speech variability. This solution notably differs from more traditional train and test methods that, while shown to be very accurate, rely upon the availability of large amounts of linguistic resources. We evaluate our framework on different parameterizations of the speech templates: raw MFCC features and Gaussian posteriorgrams, French and English phonetic posteriorgrams output by two different state of the art phoneme recognizers.

17:00Automatic Learning in Content Indexing Service using Phonetic Alignment

Yeon-Jun Kim (AT&T Labs-Research)
Dave C. Gibbon (AT&T Labs-Research)

Content indexing has become necessary, not just optional, in the era where broadcast, cable and Internet produce huge amounts of media daily. Text information from spoken audio is still a key feature to understand content along with other meta-data and video features. In this paper, a new method is introduced to improve transcription quality, which allows more accurate content indexing. Our method finds phonetic similarities between two imperfect sources, closed captions and ASR outputs, and aligns them together to make quality transcriptions. In the process, even out-of-vocabulary words could be learned automatically. Given broadcast news audio and closed captions, our experimental results show that the proposed method, on average, improves word correct rates 11% from the ASR output using the baseline language model and 6% from the one using the adapted language model.

17:20Leveraging Relevance Cues for Improved Spoken Document Retrieval

Pei-Ning Chen (National Taiwan Normal University)
Kuan-Yu Chen (Institute of Information Science, Academia Sinica)
Berlin Chen (National Taiwan Normal University)

Spoken document retrieval (SDR) has emerged as an active area of research in the speech processing community. The fundamental problems facing SDR are generally three-fold: 1) a query is often only a vague expression of an underlying information need, 2) there probably would be word usage mismatch between a query and a spoken document even if they are topically related to each other, and 3) the imperfect speech recognition transcript carries wrong information and thus deviates somewhat from representing the true theme of a spoken document. To mitigate the above problems, in this paper, we study a novel use of a relevance language modeling framework for SDR. It not only inherits the merits of several existing techniques but also provides a flexible but systematic way to render the lexical and topical relationships between a query and a spoken document. Experiments conducted on the TDT SDR task show promise of the methods deduced from our retrieval framework.

17:40Spoken Lecture Summarization by Random Walk over a Graph Constructed with Automatically Extracted Key Terms

Yun-Nung Chen (Graduate Institute of Computer Science and Information Engineering, National Taiwan University, Taiwan)
Yu Huang (Graduate Institute of Computer Science and Information Engineering, National Taiwan University, Taiwan)
Ching-Feng Yeh (Graduate Institute of Communication Engineering, National Taiwan University, Taiwan)
Lin-Shan Lee (Graduate Institute of Computer Science and Information Engineering, National Taiwan University, Taiwan)

This paper proposes an improved approach for spoken lecture summarization, in which random walk is performed on a graph constructed with automatically extracted key terms and probabilistic latent semantic analysis (PLSA). Each sentence of the document is represented as a node of the graph and the edge between two nodes is weighted by the topical similarity between the two sentences. The basic idea is that sentences topically similar to more important sentences should be more important. In this way all sentences in the document can be jointly considered more globally rather than individually. Experimental results showed significant improvement in terms of ROUGE evaluation.

Tue-Ses3-S1-O:
Speech and Audio Processing for Human-Robot Interaction I

Time:Tuesday 16:00 Place:Caravaggio (Adua 1) - Pala Affari - 1st Floor Type:Oral
Chair:Laurence Devillers 

16:00Using Prominence Detection to Generate Acoustic Feedback in Tutoring Situations

Lars Schillingmann (Applied Informatics Group, Faculty of Technology, Bielefeld University, Germany)
Petra Wagner (Faculty of Linguistics and Literature, Bielefeld University, Germany)
Christian Munier (Applied Informatics Group, Faculty of Technology, Bielefeld University, Germany)
Britta Wrede (Applied Informatics Group, Faculty of Technology, Bielefeld University, Germany)
Katharina Rohlfing (Emergentist Semantics Group, CITEC, Bielefeld University, Germany)

Robots interacting with humans need to understand actions and make use of language in social interactions. Research on infant development has shown that language helps the learner to structure visual observations of action. This acoustic information typically in the form of narration overlaps with action sequences and provides infants with a bottom-up guide to find structure within them. This concept has been introduced as acoustic packaging by Hirsh-Pasek and Golinkoff. We developed and integrated a prominence detection module in our acoustic packaging system to detect semantically relevant information linguistically highlighted by the tutor. Evaluation results on speech data from adult-infant interactions show a significant agreement with human raters. Furthermore a first approach based on acoustic packages which uses the prominence detection results to generate acoustic feedback is presented.

16:20Bayesian Extension of MUSIC for Sound Source Localization and Tracking

Takuma Otsuka (Graduate School of Informatics, Kyoto University)
Kazuhiro Nakadai (Honda Research Institute Japan, Co., Ltd.)
Tetsuya Ogata (Graduate School of Informatics, Kyoto University)
Hiroshi G. Okuno (Graduate School of Informatics, Kyoto University)

This paper presents a Bayesian extension of MUSIC-based sound source localization (SSL) and tracking method. SSL is important for distant speech enhancement and simultaneous speech separation for successful speech recognition, as well as for auditory scene analysis by mobile robots. One of the drawbacks of existing SSL methods is the necessity of careful parameter tunings, e.g., the sound source detection threshold depending on the reverberation time and the number of sources. Our contribution consists of (1) automatic parameter estimation in the variational Bayesian framework and (2) tracking of sound sources with reliability. Experimental results demonstrate our method robustly tracks multiple sound sources in a reverberant environment with RT20 = 800 (ms).

16:40Speech-based Non-prototypical Affect Recognition for Child-Robot Interaction in Reverberated Environments

Martin Woellmer (Technische Universitaet Muenchen)
Felix Weninger (Technische Universitaet Muenchen)
Bjoern Schuller (Technische Universitaet Muenchen)

We present a study on the effect of reverberation on acoustic-linguistic recognition of non-prototypical emotions during child-robot interaction. Investigating the well-defined Interspeech 2009 Emotion Challenge task of recognizing negative emotions in children's speech, we focus on the impact of artificial and real reverberation conditions on the quality of linguistic features and on emotion recognition accuracy. To maintain acceptable recognition performance of both, spoken content and affective state, we consider matched and multi-condition training and apply our novel multi-stream automatic speech recognition system which outperforms conventional Hidden Markov Modeling. Depending on the acoustic condition, we obtain unweighted emotion recognition accuracies of between 65.4% and 70.3% applying our multi-stream system in combination with the SimpleLogistic algorithm for joint acoustic-linguistic analysis.

Tue-Ses3-P1:
Voice Activity Detection

Time:Tuesday 16:00 Place:Valfonda 1 - Pala Congressi (Passi Perduti-Gallery) Type:Poster
Chair:Abeer Alwan

#1Voice activity detection in MTF-based power envelope restoration

Masashi Unoki (Japan Advanced Institute of Science and Technology)
Xugang Lu (National Institute of Information and Communication)
Rico Petrick (Dresden University of Technology)
Shota Morita (Japan Advanced Institute of Science and Technology)
Masato Akagi (Japan Advanced Institute of Science and Technology)
Ruediger Hoffmann (Dresden University of Technology)

This paper reports comparative evaluations of conventional voice activity detection (VAD) methods in both artificial and realistic reverberant environments. Both conventional and standard (G.729) methods are discussed. In general, these methods work well under clean conditions but their performance is drastically affected by reverberation. We therefore developed a method using MTF-based power envelope restoration to improve the robustness of VAD in reverberant environments. Experimental results demonstrated that the proposed method is superior to conventional methods with regard to robustness and providing accurate VAD (reducing both the false acceptance rate and false rejection rate) in both artificial and actual reverberant environments.

#2Using Spectral Fluctuation of Speech in multi-feature HMM-based voice activity detection

Miquel Espi (Graduate School of Information Science and Technology, The University of Tokyo, Japan)
Shigeki Miyabe (Graduate School of Information Science and Technology, The University of Tokyo, Japan)
Takuya Nishimoto (Graduate School of Information Science and Technology, The University of Tokyo, Japan)
Nobutaka Ono (Graduate School of Information Science and Technology, The University of Tokyo, Japan)
Shigeki Sagayama (Graduate School of Information Science and Technology, The University of Tokyo, Japan)

Observation of speech spectrum leads to the fact that speech has a specific spectral fluctuation pattern both along time and frequency. In this paper, we integrate the usage of this nature in a multi-feature approach for voice activity detection. The effect of separating such specific spectral fluctuation using multi-stage HPSS (Harmonic-Percussive Sound Separation) has been analyzed over conventional features in voice activity detection, reducing frame-wise detection error by up to 78%, depending on the SNR conditions and noise type. The multi-feature approach has been tested using a Hidden Markov Models to model the features stream as a sequence, which has out-performed standard and similar VAD proposals in utterance-based tests intended for automatic speech recognition.

#3Linear Dynamic Models for Voice Activity Detection

Kannu Mehta (Department of Electrical Engineering, IIT Roorkee, India)
Chau Khoa Pham (School of Computer Engineering, Nanyang Technological University, Singapore)
Eng Siong Chng (School of Computer Engineering, Nanyang Technological University, Singapore)

In this paper, we propose a robust voice activity detection method based on long-term stationarity (LTS) of the speech signal. The approach is motivated by the fact that noise, in time-domain, is relatively more stationary as compared to speech. We describe the use of Linear dynamic models (LDMs) as a measure of calculating the long-term stationarity of the signal and propose a voice activity detector by comparing the degree of stationarity at different times in the signal. We evaluate the proposed approach in presence of five types of noises at various SNR levels. Comparison with G.729-Annex B, order statistics filters (OSF) VAD, windowed autocorrelation lag energy (WALE), and autocorrelation zero-crossing rate (AZR) schemes demonstrates that the accuracy of the LTS-based VAD scheme averaged over all noises and all SNRs is 3.94% better than that obtained by the best among the considered VAD schemes.

#4Detection of Shouted Speech in the Presence of Ambient Noise

Jouni Pohjalainen (Department of Signal Processing and Acoustics, Aalto University, Finland)
Tuomo Raitio (Department of Signal Processing and Acoustics, Aalto University, Finland)
Paavo Alku (Department of Signal Processing and Acoustics, Aalto University, Finland)

This study focuses on the detection of shouted speech in realistic noisy conditions. An automatic system based on modified mel frequency cepstral coefficient (MFCC) feature extraction and Gaussian mixture model (GMM) classification is developed. The performance of the automatic system is compared against human perception measured by a listening test. At moderate noise levels, the automatic system outperforms humans. In severe conditions, classification by humans is clearly better.

#5Breath-detection-based Telephony Speech Phrasing

Takashi Fukuda (IBM Research - Tokyo, IBM Japan Ltd.)
Osamu Ichikawa (IBM Research - Tokyo, IBM Japan Ltd.)
Masafumi Nishimura (IBM Research - Tokyo, IBM Japan Ltd.)

In the ASR technology for call center conversations, the system usually divides an input signal into separate utterances and eliminates the unneeded silence parts of the signal before doing ASR processing on the detected utterances. This means the input signal should be split into utterances of the proper length for ASR performance. However, typical VAD techniques sometimes generate overly long speech segments because they are focused only on the length of non-speech between sentences. In contrast, it is shown that speakers typically take breaths for when speaking more than one sentence or long sentences. These breaths are highly correlated with the major prosodic breaks. In this paper, we focus on the breath events in the pause intervals and attempt to split the input signal into utterances by detecting the breathing events. The proposed method significantly improved performance for both breath detection and ASR.

#6Multi-channel voice activity detection based on conic constraints

Gibak Kim (Daegu University, Korea)

Unlike single microphone techniques for voice activity detection (VAD), multi-microphone signal processing usually exploits the spatial information of signals received at multiple microphones. In this paper, we propose a VAD algorithm based on the conic constraints to achieve robustness against the direction of arrival (DOA) estimation error. The proposed algorithm uses the phase vector as feature and detects the presence of the target speech by comparing the angles between the phase vector of the multi-microphone input signal and two mean phase vectors for target speech+interference period and interference-only period. The proposed algorithm was tested with simulation data generated by real-measured impulse response for seven uniformly distributed microphones. The simulation results showed that the proposed algorithm presents a reliable VAD metric in the presence of competing speech. The results also supported the robustness of the proposed algorithm against the DOA estimation error.

#7Multi-Sensor Voice Activity Detection based on Multiple Observation Hypothesis Testing

Theodoros Petsatodis (CTIF Aalborg University Denmark)
Fotios Talantzis (Imperial College London England)
Christos Boukis (Accenture Interactive, Greece)
Zheng-Hua Tan (Department of Electronic Systems, Aalborg University Denmark)
Ramjee Prasad (CTIF Aalborg University Denmark)

Voice Activity Detection (VAD) in acoustic environments remains a challenging task due to potentially adverse noise and reverberation conditions. The problem becomes even more difficult when the microphones used to detect speech reside far from the speaker. An unsupervised VAD scheme is presented in this paper. The system is based on processing signals captured by multiple far-field sensors in order to integrate spatial information in addition to the frequency content available at a single channel recording. To decide upon the presence or absence of speech the system employs a modified multiple observation hypothesis that tests at each sensor the probability of having an active speaker and then fuses the decisions. To minimize mis-detections and enhance the performance of the hypothesis test a computationally efficient forgetting scheme is also employed. Simulations conducted in several artificial environments illustrate that significant improvements in performance can be expected from the proposed scheme when compared to systems of similar philosophy.

#8Online Speech Activity Detection in Broadcast News

Chao Gao (Raytheon BBN Technologies)
Guruprasad Saikumar (Raytheon BBN Technologies)
Saurabh Khanwalkar (Raytheon BBN Technologies)
Avi Herscovici (Raytheon BBN Technologies)
Anoop Kumar (Raytheon BBN Technologies)
Amit Srivastava (Raytheon BBN Technologies)
Premkumar Natarajan (Raytheon BBN Technologies)

In this paper, we investigate the important implications of real-time processing to the design of speech activity detection (SAD) system. Particularly, the impact of the unique constraints posed by an online automatic speech recognition system is studied. Our investigation is built on a real-life application of speech technology — the BBN Broadcast Monitoring System (BMS), which encapsulates a real-time automatic Rich Transcription system. We propose an adaptive segmentation method that is capable of variable scale speech boundary detection in an online SAD system. We evaluate how different choices on the granularities of boundary detection impact the performance of speech-to-text (STT) and speaker diarization. In addition, the entangling interactions between STT and speaker diarization are evaluated and the mechanism for trading off the performance of these two system aspects are studied. In our experiment, the adaptive segmentation mechanism in the proposed SAD system reduces error rates of STT and speaker diarization by 2.4% and 9.5% relatively, compared to the baseline system.

#9A Real-Time Speech Command Detector for a Smart Control Room

Daniel Reich (Karlsruhe Institute of Technology (KIT))
Daniel Reich (Karlsruhe Institute of Technology (KIT))
Felix Putze (Karlsruhe Institute of Technology (KIT))
Dominic Heger (Karlsruhe Institute of Technology (KIT))
Joris Ijsselmuiden (Fraunhofer Institute of Optronics, System Technologies and Image Exploitation (IOSB))
Rainer Stiefelhagen (Fraunhofer Institute of Optronics, System Technologies and Image Exploitation (IOSB))
Tanja Schultz (Karlsruhe Institute of Technology (KIT))

In this work we present an always-on speech recognition system that discriminates spoken commands directed to the system from other spoken input. For discrimination we integrated various features ranging from prosodic cues and decoding features to linguistic information. The resulting ”Speech Command Detector” provides intuitive hands-free user interaction in a Smart Control Room environment where voice commands are directed toward a large interactive display. Based on a recognition vocabulary of 259 words with more than 10k possible commands, the Speech Command Detector detected 88.3% of the commands correctly maintaining a very low False Positive Rate of 1.5%. In a cross-domain setup the system was evaluated on a Star Trek episode. With only minor adjustments, our system achieved very promising results with 91.2% command detection rate at a False Positive Rate of 1.8%.

#10Robust Voice Activity Detector for Real World Applications Using Harmonicity and Modulation frequency

Ekapol Chuangsuwanich (Massachusetts Institute of Technology)
James Glass (Massachusetts Institute of Technology)

The task of robustly detecting distant speech in low SNR environments for automatic speech recognition is examined using a two-stage approach based on two distinguishing features of speech, namely harmonicity and modulation frequency (MF). A modified metric for harmonicity is used as a gating function to a set of parallel classifiers that incorporate MFs computed on different frequency bands. Performance is evaluated on both the frame-level discriminative power and also the system level ASR results on a real-world robotic forklift task. Compared to other previously proposed features, the combined approach shows good generalization across different kinds of dynamic noise conditions, and obtains a significant improvement on the false alarm rate at low speech miss rate settings. The overall ASR results also improved significantly compared to the ESTI AMR-VAD2, while reducing the number of false alarms by a factor of two.

#11On Noise Robust Voice Activity detection

Tomas Dekens (Vrije Universiteit Brussel, ETRO-DSSP)
Werner Verhelst (Vrije Universiteit Brussel, ETRO-DSSP)

In this paper, we show that the performance of voice activity detection algorithms (VAD) can be highly dependent on the type of background noise and we introduce a new VAD algorithm that is based on relative energy measurements in different frequency bands. The obtained experimental results are compared to the results obtained with two other spectrum-based VADs and it is concluded that a VAD, configured to use around 3 frequency bands can cope best with a large variety of background sounds.

#12Adaptive regularization framework for robust voice activity detection

Xugang Lu (National Institute of Information and Communications Technology)
Masashi Unoki (Japan Advanced Institute of Science and Technology)
Ryosuke Isotani (National Institute of Information and Communications Technology)
Hisashi Kawai (National Institute of Information and Communications Technology)
Satoshi Nakamura (National Institute of Information and Communications Technology)

We have proposed a regularization framework for designing VAD algorithms. In the framework, the balance between false acceptance rate (FAR) and false rejection rate (FRR) related to noise reduction and speech distortion was implicitly controlled by using a regularization parameter. In addition, the regularization was done in a reproducing kernel Hilbert space (RKHS) which made it easy to apply a nonlinear transform function for noise reduction. Under this framework, a better tradeoff between FAR and FRR was obtained in VAD. In this study, an adaptive regularization framework was further developed in which the regularization parameter was changed adaptively according to local variations of the signal to noise ratio (SNR). We tested our algorithm on VAD experiments, and compared it with several typical VAD algorithms. The results showed that the proposed algorithm could be used to improve the robustness of VAD.

Tue-Ses3-P2:
Human Speech Production I

Time:Tuesday 16:00 Place:Valfonda 2 - Pala Congressi (Passi Perduti-Gallery) Type:Poster
Chair:Francesco Cutugno

#1On the use of extended context for HMM-based spontaneous conversational speech synthesis

Tomoki Koriyama (Tokyo Institute of Technology)
Takashi Nose (Tokyo Institute of Technology)
Takao Kobayashi (Tokyo Institute of Technology)

This paper addresses an issue of prosodic variability of spontaneous speech in HMM-based spontaneous conversational speech synthesis. We propose an extended context set including information peculiar to spontaneous speech derived from the annotation data embedded in a large-scale database of spontaneous Japanese. We show the effectiveness of the newly introduced contexts from the results of objective and subjective evaluation experiments. We also propose stopping criteria for decision-tree clustering to alleviate an over-fitting problem. Experimental results show that the restriction of the size of each leaf node can improve the quality of synthetic speech.

#2Predicting Tongue Positions from Acoustics and Facial Features

Asterios Toutios (University Nancy 2 / LORIA)
Slim Ouni (University Nancy 2 / LORIA)

We test the hypothesis that adding information regarding the positions of electromagnetic articulograph (EMA) sensors on the lips and jaw can improve the results of a typical acoustic-to-EMA mapping system, based on support vector regression, that targets the tongue sensors. Our initial motivation is to use such a system in the context of adding a tongue animation to a talking head built on the basis of concatenating bimodal acoustic-visual units. For completeness, we also train a system that maps only jaw and lip information to tongue information

#3Assessing acoustic reduction: Exploiting local structure in speech

Louis ten Bosch (Radboud University Nijmegen)
Annika Hämäläinen (Loquendo S.p.A.)
Mirjam Ernestus (Radboud University Nijmegen)

This paper presents a method to quantify the spectral characteristics of reduction in speech. Hämäläinen et al. (2009) proposes a measure of spectral reduction which is able to predict a substantial amount of the variation in duration that the linguistically motivated variables do not account for. In this paper, we continue studying acoustic reduction in speech by developing a new acoustic measure of reduction, based on local manifold structure in speech. We show that this measure yields significantly improved statistical models for predicting variation in duration.

#4THE “FORTIS-LENIS” DISTINCTION IN BULGARIAN AND GERMAN

Bistra Andreeva (Saarland University)
Magdalena Wolska (Saarland University)

The present study investigates the voicing contrast in Bulgarian and German. Analyses of two production experiments are reported. In the first experiment logatoms were constructed containing /p, t, k/ and /b, d, g/ in intervocalic position. In the second experiment one Bulgarian and one German sentence were elicited in different focus conditions resulting in different accentuation levels. Based on the obtained data we analyze the phonetic implementation of the phonological categories voiced vs. voiceless and the influence of focus condition and accentuation. It is shown, that: First, the two languages differ in the phonetic realization of /p, t, k/ but not /b, d, g/ in intervocalic position in terms of voice onset time (short vs. long lag /p, t, k/ in Bulgarian and German respectively). Second, accentuation levels are realised in different ways in the two languages.

#5Acoustic Correlates of Glottal Gaps

Gang Chen (Department of Electrical Engineering, University of California, Los Angeles)
Jody Kreiman (Division of Head and Neck Surgery, UCLA School of Medicine, Los Angeles)
Yen-Liang Shue (Dolby Australia)
Abeer Alwan (Department of Electrical Engineering, University of California, Los Angeles)

During speech production, the vocal folds may not close completely. The resulting glottal gap (GG) or incomplete glottal closure has not been systematically studied in terms of GG acoustic and/or perceptual consequences. This paper uses high-speed imaging to investigate the relationship between GG area, source parameters, and acoustic measures for 6 subjects. Results showed that the cepstral peak prominence (CPP) and the harmonics-to-noise ratio (HNR) are affected by GG area, indicating the presence of more spectral noise with increasing GG area. Analysis of a glide phonation from breathy to pressed for one female speaker showed that measures H1*-H2* and H1*-A3* were positively correlated with GG area under a steady fundamental frequency (F0). In some phonatory modes, increasing F0 may reduce the amplitude of vocal folds vibration, increase GG area, and produce a lower spectral tilt due to significant aspiration noise, leading to a negative correlation between GG area and H1*-A3*.

#6Using a Genetic Algorithm to Estimate Parameters of a Coarticulation Model

Brian Bush (Oregon Health & Science University)
John-Paul Hosom (Oregon Health & Science University)
Alexander Kain (Oregon Health & Science University)
Akiko Amano-Kusumoto (Oregon Health & Science University)

We present a real-coded genetic algorithm that efficiently estimates parameters of a formant trajectory model. The genetic algorithm uses roulette-wheel selection and elitism to minimize the root mean square error between the observed formant trajectory and the model trajectory. Parameters, including vowel and consonant target values and coarticulation parameters, are estimated for a corpus of English clear and conversational CVC words. Results are presented that show the discovery of consistent consonant formant targets, even when those consonants do not themselves have formant structure. We also present findings of a relationship between a coarticulation parameter and the consonant identity.

#7Synthesis of breathy, normal, and pressed phonation using a two-mass model with a triangular glottis

Peter Birkholz (Clinic of Phoniatrics, Pedaudiology, and Communication Disorders, University Hospital Aachen and RWTH Aachen University, Aachen, Germany)
Bernd J. Kröger (Clinic of Phoniatrics, Pedaudiology, and Communication Disorders, University Hospital Aachen and RWTH Aachen University, Aachen, Germany)
Christiane Neuschaefer-Rube (Clinic of Phoniatrics, Pedaudiology, and Communication Disorders, University Hospital Aachen and RWTH Aachen University, Aachen, Germany)

Two-mass models of the vocal folds and their variants are valuable tools for voice synthesis and analysis, but are not able to produce breathy voice qualities. The produced voice qualities usually lie between normal and pressed. The reason for this property is that the mass elements are aligned parallel to the dorso-ventral axis. Thereby, the glottis always closes simultaneously along the entire length of the vocal folds. For breathy phonation, however, the closure happens rather gradual. This article introduces a modified two-mass model with mass elements that are inclined with respect to the dorso-ventral axis as a function of the degree of abduction. In this way, the closing phase of the glottis becomes progressively more gradual when the degree of abduction is increased. This model is able to produce the continuum of voice qualities from pressed over normal to breathy voices.

#8Analysis of inter-articulator correlation in acoustic-to-articulatory inversion using generalized smoothness criterion

Prasanta Ghosh (PhD Student)
Shrikanth Narayanan (Professor)

The movements of the different speech articulators are known to be correlated to various degrees during speech production. In this paper, we investigate whether the inter-articulator correlation is preserved among the articulators estimated through acoustic-to-articulatory inversion using the generalized smoothness criterion (GSC). Theoretical analysis of inter-articulator correlation in GSC reveals that the correlation between any two estimated articulators may not be identical to that between the corresponding measured articulatory trajectories; however, based on smoothness constraints provided by the real articulatory data, we found that, in practice, the correlation among articulators is approximately preserved in GSC based inversion. We validate the theoretical analysis using a modified version of GSC where correlations among articulators are explicitly imposed and observing that there is no significant benefit in inversion using such modified GSC.

#9Frequency-domain representation of source-filter coupling and its effect in the production of voice

Tokihiko Kaburagi (Kyushu University)

The acoustic coupling between the voice production system and the vocal tract has a significant influence on the production of voice. In this study, the coupling effect was represented using the acoustic pressure difference across the glottis, which is capable of inducing a flow, and the mean acoustic pressure in the glottis, which acts as a driving force for the vocal folds. These specific acoustic pressures were then interpreted in the frequency domain in the form of frequency responses, and incorporated into a model of the voice production system. In this framework, we were able to test the effect of source-filter coupling by filtering frequency responses. Numerical results revealed that these responses and the input impedance of the vocal tract both exhibited a dominant peak around 4 kHz. In addition, voice production simulations revealed that this high-frequency peak has a significant influence on the spatio-temporal pattern of glottal volume flow and vocal fold movements.

#10Method for speech inversion with large scale statistical evaluation

Heikki Rasilo (Aalto University, School of Electrical Engineering, Department of Signal Processing and Acoustics)
Unto K. Laine (Aalto University, School of Electrical Engineering, Department of Signal Processing and Acoustics)
Okko Räsänen (Aalto University, School of Electrical Engineering, Department of Signal Processing and Acoustics)
Toomas Altosaar (Aalto University, School of Electrical Engineering, Department of Signal Processing and Acoustics)

An articulatory model of speech production is created for the purpose of studying the links between speech production and perception. A computationally effective method for speech inversion in proposed, using a two-pole predictor structure in order to maintain better articulatory dynamics when compared to conventional dynamic programming methods. Preliminary tests for the effect of inversion are performed for 2500 Finnish syllables extracted from continuous speech, consisting of 125 different syllable classes. A cluster selectivity test shows that the syllables are more reliably clustered using the automatically obtained parametric representation of articulatory gestures rather than the original formant representation that is used as a starting point for the inversion.

#11Italian in the no-man\'s land between stress-timing and syllable-timing? Speakers are more stress-timed than listeners

Bettina Braun (University of Konstanz)
Sabine Geiselmann (University of Konstanz)

How syllable-timed is Italian? We investigate two contexts for vowel reduction, unstressed syllables and syllables in polysyllabic words. In a production experiment, a large sample of speakers from Tuscany read di- and trisyllabic target words with different stress placement in a sentence context. Results showed vowel reduction in unstressed syllables, both in terms of duration and spectral quality as well as polysyllabic shortening (without spectral reduction). These temporal adjustments are of similar magnitude as reported for stress-timed languages. Results of a two-alternative forced choice task, however, showed little sensitivity to temporal patterns in monosyllabic fragments. Hence, production patterns appear to be more stress-timed than perceptual mechanisms which has implications for duration models in speech synthesis.

#12The Lombard Effect in Spontaneous Dialog Speech

Laura Folk (Institute of Phonetics and Speech Processing, LMU Munich)
Florian Schiel (Bavarian Archive for Speech Signals, LMU Munich)

The Lombard effect -- environmental noise affects speech production -- has already been studied extensively for read lab speech. In this study spontaneous dialog speech produced by 24 German speakers has been recorded under noisy conditions and analysed for the Lombard effect. A sophisticated experimental setup using behind-the-ear hearing aid equipment allows us to insert real car noise into the perceived audio stream of speakers while maintaining the normal auditory feedback loop. We found that the main Lombard effects -- rising fundamental frequency and intensity -- can be confirmed for dialog speech. Speaking rate did not slow down although reported earlier for read speech. We also found that certain rhythmicity features regarding the dynamic of the RMS energy contour change significantly under Lombard conditions but only for the female speakers.

Tue-Ses3-P3:
Speaker Recognition - Analysis and Statistics III

Time:Tuesday 16:00 Place:Faenza 1 - Pala Congressi (Passi Perduti-Gallery) Type:Poster
Chair:Pierre-Michel Bousquet

#1Variational Bayesian Model Selection for GMM-Speaker Verification using Universal Background Model

Timur Pekhovsky (Department of Speaker Verification and Identification, Speech Technology Center, St. Petersburg, Russia)
Alexandra Lokhanova (Department of Speaker Verification and Identification, Speech Technology Center, St. Petersburg, Russia)

In this paper we propose to use Variational Bayesian Analysis (VBA) instead of Maximum Likelihood (ML) estimation for Universal Background Model (UBM) building in GMM text independent speaker verification systems. Using VBA estimation solves the problem of the optimal choice of the UBM mixture dimensionality for the training data set, as well as the problem of noise Gaussians which are typical for ML estimation. Experiments using the NIST 2006 and 2008 SRE datasets (cellular channels only) demonstrate superior efficiency of baseline verification systems with a UBM trained using the VBA method compared to standard ML training. Verification error was reduced by almost 8%, compared to a baseline system with standard ML training for the UBM.

#2To Weight or not to Weight: Source-Normalised LDA for Speaker Recognition using i-vectors

Mitchell McLaren (Radboud University Nijmegen)
David van Leeuwen (Radboud University Nijmegen)

Source-normalised Linear Discriminant Analysis (SN-LDA) was recently introduced to improve speaker recognition using i-vectors extracted from multiple speech sources. SN-LDA normalises for the effect of speech source in the calculation of the between-speaker covariance matrix. Source-normalised-and-weighted (SNAW) LDA computes a weighted average of source-normalised covariance matrices to better exploit available information. This paper investigates the statistical significance of performance gains offered by SNAW-LDA over SN-LDA. An exhaustive search for optimal scatter weights was conducted to determine the potential benefit of SNAW-LDA. When evaluated on both NIST 2008 and 2010 SRE datasets, scatter-weighting in SNAW-LDA tended to overfit the LDA transform to the evaluation dataset while offering few statistically significant performance improvements over SN-LDA.

#3Maximum Entropy based Data Selection for Speaker Recognition

Chien-Lin Huang (Human Language Technology Department, Institute for Infocomm Research, A*STAR, Singapore)
Bin Ma (Human Language Technology Department, Institute for Infocomm Research, A*STAR, Singapore)

This paper presents the data selection method for speaker recognition. Since there is no promise that more data guarantee better results, the way of data selection becomes important. In the GMM-UBM speaker recognition, the UBM is trained to represent the speaker-independent distribution of acoustic features while the GMM speaker model is tailored for a specific speaker. In this study of data selection for speaker recognition, we apply the maximum entropy criterion to remove the redundant feature frames in the UBM training and to select the discriminative feature frames in the GMM speaker modeling. The conducted experiments on the 2008 NIST Speaker Recognition Evaluation corpus show that the proposed method outperforms the baseline system without the data selection.

#4Addressing the Data-Imbalance Problem in Kernel-based Speaker Verification via Utterance Partitioning and Speaker Comparison

Wei Rao (The Hong Kong Polytechnic University)
Man-Wai Mak (The Hong Kong Polytechnic University)

A problematic issue of the GMM-SVM approach to speaker verification is the serious imbalance between the numbers of speaker- and impostor-class utterances for training speaker-dependent SVMs. This data-imbalance problem can be addressed by creating more speaker-class supervectors for SVM training through utterance partitioning techniques (UP-AVR) or avoiding the SVM training so that speaker scores are formulated as an inner product discriminant function (IPDF) between the target-speaker's and test supervectors. This paper highlights the differences between these two approaches and compares the effect of using different kernels -- including the KL divergence kernel, GMM-UBM mean interval (GUMI) kernel and geometric-mean-comparison kernel -- on their performance. Experiments on the NIST 2010 Speaker Recognition Evaluation suggest that GMM-SVM with UP-AVR is superior to speaker comparison and that the GUMI kernel is slightly better than the KL kernel in speaker comparison.

#5Single-channel Head Orientation Estimation Based on Discrimination of Acoustic Transfer Function

Ryoichi Takashima (Graduate School of System Informatics, Kobe University, Japan)
Tetsuya Takiguchi (Graduate School of System Informatics, Kobe University, Japan)
Yasuo Ariki (Graduate School of System Informatics, Kobe University, Japan)

This paper presents a talker’s head orientation estimation method using only a single microphone, where phoneme HMMs of clean speech are introduced to separate the acoustic transfer function at the user’s position and head orientation. The frame sequence of the acoustic transfer function is estimated by maximizing the likelihood of training data uttered from a given position with a given head orientation. Using the separated frame sequence data, the user’s position and head orientation are trained by SVM in advance. Then, for each test utterance, the acoustic transfer function is separated, and the user’s position and head orientation are estimated by discriminating the separated acoustic transfer function using SVM. The effectiveness of this method has been confirmed by talker localization and head orientation estimation experiments performed in a real environment.

#6Maximum Likelihood i-vector Space Using PCA for Speaker Verification

Zhenchun Lei (School of Computer and Information Engineering, Jiangxi Normal University, Nanchang, China)
Yingchun Yang (College of Computer Science, Zhejiang University, Hangzhou, China)

This paper proposes a new approach to training the i-vector space using a variant of PCA with the Baum-Welch statistics for speaker verification. In eigenvoice the rank of variability space is bounded by the number of training speakers, so a variant of the probabilistic PCA approach is introduced for estimating the parameters. But this constraint doesn’t exist in i-vector model because the number of utterances is much bigger than the rank of total variability space. We adopt the EM algorithm for PCA with the statistics to train the total variability space, and the maximum likelihood criterion is used. After WCCN, the cosine similarity scoring is used for decision. These two total variability spaces will be fused at feature-level and score-level. The experiments have been run on the NIST SRE 2008 data, and the results show that the performances in two total variability spaces are comparable. The performance can be improved obviously after feature fusion and score fusion.

#7Speaker Verification using Sparse Representations on Total Variability I-Vectors

Ming Li (Signal Analysis and Interpretation Laboratory, Department of Electrical Engineering, University of Southern California, Los Angeles, USA)
Xiang Zhang (Key Laboratory of Speech Acoustics and Content Understanding, Chinese Academy of Sciences, Beijing, China)
Yonghong Yan (Key Laboratory of Speech Acoustics and Content Understanding, Chinese Academy of Sciences, Beijing, China)
Shrikanth Narayanan (Signal Analysis and Interpretation Laboratory, Department of Electrical Engineering, University of Southern California, Los Angeles, USA)

In this paper, the sparse representation computed by L1-minimization with quadratic constraints is employed to model the i-vectors in the low dimensional total variability space after performing the Within-Class Covariance Normalization and Linear Discriminate Analysis channel compensation. First, we propose the background normalized L2 residual as a scoring criterion. Second, we demonstrate that the Tnorm can be efficiently achieved by using the Tnorm data as the non-target samples in the over-complete dictionary. Finally, by fusing with the conventional i-vector based support vector machine (SVM) and cosine distance scoring system, we demonstrate overall system performance improvement. Experimental results show that the proposed fusion system achieved 4.05% (male) and 5.25% (female) equal error rate (EER) after Tnorm on the single-single multi-language handheld telephone task of NIST SRE 2008 and outperformed the SVM baseline by yielding 7.1% and 4.9% relative EER reduction for the male and female tasks, respectively.

#8Robust Speaker Recognition in Non-Stationary Room Environments Based on Empirical Mode Decomposition

Taufiq Hasan (University of Texas at Dallas)
John Hansen (University of Texas at Dallas)

In this study, we consider the problem of speaker recognition in a non-stationary room/channel mismatched condition. In such circumstances, cepstral coefficients are affected in a way that the short-term stationarity assumption, on which conventional feature normalization methods are based on, may not be valid. We observe that the empirical mode decomposition (EMD) applied to the cepstral feature stream can partially separate out the non-stationary channel components, if present, into its residual signal and other lower order intrinsic mode functions (IMFs), which leads us to develop a filtering scheme based on this decomposition. The proposed method works in the time domain making use of the instantaneous frequency function obtained through Hilbert spectral analysis of the IMFs. Experimental evaluations on the TIMIT database with added non-stationary room channels in test demonstrate the superiority of the proposed scheme compared to conventional feature normalization schemes. Additional experiments performed on the newly released noisy robust open set speaker identification (ROSSI) and NIST SRE corpora also confirm the effectiveness of the proposed method in stationary room/channel mismatched conditions.

#9Range based multi microphone array fusion for speaker activity detection in small meetings

Jani Even (ATR Intelligent Robotics and Communication Laboratories, Kyoto, Japan)
Panikos Heracleous (ATR Intelligent Robotics and Communication Laboratories, Kyoto, Japan)
Carlos Ishi (ATR Intelligent Robotics and Communication Laboratories, Kyoto, Japan)
Norihiro Hagita (ATR Intelligent Robotics and Communication Laboratories, Kyoto, Japan)

This paper presents a method for speaker activity detection in small meetings. The activity of the participants is deduced from audio streams obtained by multiple microphone arrays. One of the novelty of the proposed approach is that it uses a human tracker that relies on scanning laser range finders to localize the participants. First, this additional information is exploited by the beamforming algorithm creating the audio streams for each of the microphone arrays. Then, at each array, the speaker activity detection is performed using Gaussian mixture models that were trained before hand. Finally, a fusion procedure, that also uses the location information, combines the detection results of the different microphone arrays. An experiment reproducing a meeting configuration demonstrates the effectiveness of the system.

#10Speaker verification robust to talking style variation using multiple kernel learning based on conditional entropy minimization

Tetsuji Ogawa (Waseda University)
Hideitsu Hino (Waseda University)
Noboru Murata (Waseda University)
Tetsunori Kobayashi (Waseda University)

We developed a new speaker verification system that is robust to intra-speaker variation. There is a strong likelihood that intra-speaker variations will occur due to changes in talking styles, the periods when an individual speaks, and so on. It is well known that such variation generally degrades the performance of speaker verification systems. To solve this problem, we applied multiple kernel learning (MKL) based on conditional entropy minimization, which impose the data to be compactly aggregated for each speaker class and ensure that the different speaker classes were far apart from each other. Experimental results showed that the proposed speaker verification system achieved a robust performance to intra-speaker variation derived from changes in the talking styles compared to the conventional maximum margin-based system.

#11Regularized Logistic Regression Fusion for Speaker Verification

Ville Marko Hautamaki (Institute for Infocomm Research)
Kong Aik Lee (Institute for Infocomm Research)
Tomi Kinnunen (University of Eastern Finland)
Bin Ma (Institute for Infocomm Research)
Haizou Li (Institute for Infocomm Research)

Fusion of the base classifiers is seen as the way to achieve state-of-the art performance in the speaker verfication systems. Standard approach is to pose the fusion problem as the linear binary classification task. Most successful loss function in speaker verification fusion has been the weighted logistic regression popularized by the FoCal toolkit. However, it is known that optimizing logistic regression can overfit severely without appropriate regularization. In addition, subset classifier selection can be achieved by using an external 0/1 loss function on the best subset. In this work, we propose to use LASSO based regularization on the FoCal cost function to achive improved performance and classifier subset selection method integrated into one optimization task. Proposed method is able to achieve 51% relative improvement in Actual DCF over the FoCal baseline.

#12A Longest Matching Segment Approach with Baysian Adaptation - Application to Noise-Robust Speaker Recognition

Ayeh Jafari (Queen\'s University Belfast)
Ramji Srinivasan (Queen\'s University Belfast)
Danny Crookes (Queen\'s University Belfast)
Ming Ji (Queen\'s University Belfast)

Temporal dynamics is an important feature of speech that distinguishes speech from noise, as well as distinguishing between different speakers. In this paper, we present an approach to extract long-range temporal dynamics of speech for text-independent speaker recognition. We aim to maximize the noise immunity arising from the distinct temporal dynamics of speech. The new approach achieves this by identifying the longest matching segments between the training data and test data for recognition. Additionally, the new approach combines Bayesian adaptation, multicondition training and missing-feature theory to further advance the ability to model noisy speech. Experiments have been conducted on the NIST 2002 SRE database in the presence of various types of noise including fast-varying song and music. The new approach has shown improved performance over conventional noise-robust techniques.

#13Data Selection with Kurtosis and Nasality features for Speaker Recognition

Howard Lei (International Computer Science Institute)
Nikki Mirghafori (International Computer Science Institute)

We propose new data selection approaches based on speaker discriminability features, including kurtosis and a set of nasality features which exploit spectral properties of nasal speech sounds. Data selected based on the speaker discriminability features are used to implement end-to-end speaker recognition systems, which produce significant improvements when combined with the baseline system (which uses the speech-only data regions determined by a speech/non-speech detector), where the optimal combination of systems produces roughly a 24% improvement over the baseline. Results suggest that focusing the modeling power on data regions selected via the kurtosis and nasality speaker discriminability features, part of which are often discarded in the speech/non-speech detection process, can improvement speaker recognition.

#14Use of The Harmonic Phase in Speaker Recognition

Inma Hernaez (University of the Basque Country)
Ibon Saratxaga (University of the Basque Country)
Jon Sanchez (University of the Basque Country)
Eva Navas (University of the Basque Country)
Iker Luengo (University of the Basque Country)

In this paper a novel set of features with a promising ability to identify speakers is presented. These features are based on the harmonic phase of the speech signal and have been previously used successfully in an ASR task. Using the SI-284 subset of the WSJ database, a GMM has been trained for each of the 283 speakers and several speaker identification experiments have been performed, with a high level of success. The feature extraction method and the performed experiments are described. The results show that the features present excellent identification performance, very close to the performance of the MFCC parameters.

Tue-Ses3-P4:
Voice Conversion and Speech Synthesis

Time:Tuesday 16:00 Place:Faenza 2 - Pala Congressi (Passi Perduti-Gallery) Type:Poster
Chair:Alan Black

#1Gaussian Process Experts for Voice Conversion

Nicholas Pilkington (Computer Laboratory, University of Cambridge, Cambridge)
Heiga Zen (Toshiba Research Europe Ltd., Cambridge Research Lab.)
Mark Gales (Toshiba Research Europe Ltd., Cambridge Research Lab.)

Conventional approaches to voice conversion typically use a GMM to represent the joint probability density of source and target features. This model is then used to perform spectral conversion between speakers. This approach is reasonably effective but can be prone to overfitting and oversmoothing of the target spectra. This paper proposes an alternative scheme that uses a collection of Gaussian process experts to perform the spectral conversion. Gaussian processes are robust to overfitting and oversmoothing and can predict the target spectra more accurately. Experimental results indicate that the objective performance of voice conversion can be improved using the proposed approach.

#2Intonation Conversion From Neutral to Expressive Speech

Christophe VEAUX (Ircam)
Xavier RODET (Ircam)

Intonation is one of the most important factors of speech expressivity. This paper presents a conversion method for the F0 contours. The F0 segments are represented with discrete cosine transform (DCT) coefficients at the syllable level. Multi-level dynamic features are added to model the temporal correlation between syllables and to constrain the F0 contour at the phrase level. Gaussian mixture models (GMM) are used to map the prosodic features between neutral and expressive speech, and the converted F0 contour is generated under the dynamic features constraints. Experimental evaluation using a database of acted emotional speech shows the effectiveness of the proposed F0 model and conversion method.

#3Speaker-Adaptive Speech Synthesis Based on Eigenvoice Conversion and Language-Dependent Prosodic Conversion in Speech-to-Speech Translation

Nobuhiko Hattori (Nara Institute of Science and Technology)
Tomoki Toda (Nara Institute of Science and Technolog)
Hisashi Kawai (National Institute of Information and Communications Technology)
Hiroshi Saruwatari (Nara Institute of Science and Technology)
Kiyohiro Shikano (Nara Institute of Science and Technology)

This paper describes a novel approach based on voice conversion (VC) to speaker-adaptive speech synthesis for speech-to-speech translation. Voice quality of translated speech in an output language is usually different from that of an input speaker of the translation system since a text-to-speech system is developed with another speaker's voices in the output language. To render the input speaker's voice quality in the translated speech, we propose a voice quality control method based on one-to-many eigenvoice conversion (EVC) and language-dependent prosodic conversion. Spectral parameters of the translated speech are effectively converted by one-to-many EVC enabling unsupervised speaker adaptation. Moreover, prosodic parameters are modified considering their global differences between the input and output languages. The effectiveness of the proposed method is confirmed by experimental evaluations on cross-lingual VC among Japanese, English, and Chinese.

#4Adding Glottal Source Information to Intra-lingual Voice Conversion

Javier Pérez (Universitat Politècnica de Catalunya)
Antonio Bonafonte (Universitat Politècnica de Catalunya)

This paper studies the inclusion of glottal source characteristics in voice conversion (VC) systems. We use source/filter decomposition to parametrize the vocal tract (LSF), the glottal source (LF model), and the aspiration noise (amplitude-modulated high-pass filtered AWGN noise). To evaluate the impact of this new parametrization in VC, we use a reference conversion system that estimates a linear transformation function using a joint target/source model obtained with CART and GMM. The reference system is based on the LPC model, uses LSF to represent the vocal tract and a selection technique for the residual. We use the reference algorithm to build a VC system for each of the three parameter sets. We compared both parametrizations in the framework of an intra-lingual voice conversion task in Spanish. The results show that the new source/filter representation clearly improves the overall performance, both in terms of speaker identity transformation and voice quality.

#6Formant-controlled HMM-based Speech Synthesis

Ming Lei (iFLYTEK Speech Lab, University of Science and Technology of China)
Junichi Yamagishi (CSTR, University of Edinburgh)
Korin Richmond (CSTR, University of Edinburgh)
Zhen-Hua Ling (iFLYTEK Speech Lab, University of Science and Technology of China)
Simon King (CSTR, University of Edinburgh)
Li-Rong Dai (iFLYTEK Speech Lab, University of Science and Technology of China)

This paper proposes a novel framework that enables us to manipulate and control formants in HMM-based speech synthesis. In this framework, the dependency between formants and spectral features is modelled by piecewise linear transforms; formant parameters are effectively mapped by these to the means of Gaussian distributions over the spectral synthesis parameters. The spectral envelope features generated under the influence of formants in this way may then be passed to high-quality vocoders to generate the speech waveform. This provides two major advantages over conventional frameworks. First, we can achieve spectral modification by changing formants only in those parts where we want control, whereas the user must specify all formants manually in conventional formant synthesisers (e.g. Klatt). Second, this can produce high-quality speech. Our results show the proposed method can control vowels in the synthesized speech by manipulating F1 and F2 without any degradation in synthesis quality.

#7Analysis of HMM-Based Lombard Speech Synthesis

Tuomo Raitio (Department Signal Processing and Acoustics, Aalto University, Helsinki, Finland)
Antti Suni (Department of Speech Sciences, University of Helsinki, Helsinki, Finland)
Martti Vainio (Department of Speech Sciences, University of Helsinki, Helsinki, Finland)
Paavo Alku (Department Signal Processing and Acoustics, Aalto University, Helsinki, Finland)

Humans modify their voice in interfering noise in order to maintain the intelligibility of their speech - this is called the Lombard effect. This ability, however, has not been extensively modeled in speech synthesis. Here we compare several methods of synthesizing speech in noise using a physiologically based statistical speech synthesis system (GlottHMM). The results show that in a realistic street noise situation the synthetic Lombard speech is judged by listeners both as appropriate for the situation and as intelligible as natural Lombard speech. Of the different types of models, one using adaptation and extrapolation performed the best.

#8Discrete/Continuous Modelling of Speaking Style in HMM-based Speech Synthesis: Design and Evaluation

Nicolas Obin (IRCAM)
Pierre Lanchantin (IRCAM)
Anne Lacheret (Modyco Lab.)
Xavier Rodet (IRCAM)

This paper assesses the ability of a HMM-based speech synthesis systems to model the speech characteristics of various speaking styles. A discrete/continuous HMM is presented to model the symbolic and acoustic speech characteristics of a speaking style. The proposed model is used to model the average characteristics of a speaking style that is shared among various speakers, depending on specific situations of speech communication. The evaluation consists of an identification experiment of 4 speaking styles based on delexicalized speech, and compared to a similar experiment on natural speech. The comparison is discussed and reveals that discrete/continuous HMM consistently models the speech characteristics of a speaking style.

#9Factored MLLR Adaptation For Singing Voice Generation

June Sig Sung (Seoul National University)
Doo Hwa Hong (Seoul National University)
Shin Jae Kang (Seoul National University)
Nam Soo Kim (Seoul National University)

In our previous study, we proposed factored MLLR (FMLLR) where each MLLR parameter is defined as a function of a control vector. We presented a method to train the FMLLR parameters based on a general framework of the expectation-maximization (EM) algorithm. In this paper, we extend the FMLLR structure from diagonal to unrestricted full matrix with a sophisticated algorithm for the training of relevant parameters. In the experiments on artificial generation of singing voice, we evaluate the performance of the FMLLR technique with two matrix structures and also compare with other approaches to parameter adaptation in HMM-based speech synthesis.

#11Adaptation of Prosody in Speech Synthesis by Changing Command Values of the Generation Process Model of Fundamental Frequency

Keikichi Hirose (Department of Information and Communication Engineering, the University of Tokyo)
Keiko Ochi (Department of Information and Communication Engineering, the University of Tokyo)
Ryusuke Mihara (Department of Information and Communication Engineering, the University of Tokyo)
Hiroya Hashimoto (Department of Information and Communication Engineering, the University of Tokyo)
Daisuke Saito (Department of Information and Communication Engineering, the University of Tokyo)
Nobuaki Minematsu (Department of Information and Communication Engineering, the University of Tokyo)

A method was developed to adapt prosody to a new speaker/style in speech synthesis. It is based on predicting differences between target and original speakers/styles and applying them to the original one. Differences in fundamental frequency (F0) contours are represented in the framework of the generation process model; differences in the command magnitudes/amplitudes. While the original one requires a certain amount of training corpus, while corpus for training command differences can be small. Furthermore, in the case of style adaptation, it is not necessarily the corpus being uttered by the same speaker of the original style. Speech synthesis was conducted using HMM-based speech synthesis system, where prosody was controlled by the method. Listening experiments on synthetic speech with style adaptation and voice conversion both showed the validity of the method.

#12Prosody Conversion for Emotional Mandarin Speech Synthesis Using the Tone Nucleus Model

Miaomiao Wen (Department of Electrical Engineering and Information Systems, the University of Tokyo, Japan)
Miaomiao Wang (Department of Electrical Engineering and Information Systems, the University of Tokyo, Japan)
Keikichi Hirose (Department of Information and Communication Engineering, the University of Tokyo, Japan)
Nobuaki Minematsu (Department of Information and Communication Engineering, the University of Tokyo, Japan)

In this paper, tone nucleus model is employed to represent and convert F0 contour for synthesizing an emotional Mandarin speech from a neutral speech. Compared with previous prosody transforming methods, the proposed method 1) only converts the tone nucleus part of each syllable rather than the whole F0 contour to avoid the data sparseness problems; 2) builds mapping functions for well-chosen tone nucleus model parameters to better capture Mandarin tonal information. Using only a modest amount of training data, the perceptual accuracy achieved by our method was shown to be comparable to that obtained by a professional speaker.

#13Rapid Adaptation of Foreign-accented HMM-based Speech Synthesis

Reima Karhila (Adaptive Informatics Research Centre, Aalto University, Helsinki, Finland)
Mirjam Wester (Centre for Speech Technology Research, University of Edinburgh, UK)

This paper presents findings of listeners’ perception of speaker identity in synthetic speech. We investigated what the effect is on the perceived identity of a speaker when using differently accented average voice models and limited amounts (5 and 15 sentences) of a speaker’s data to create the synthetic stimuli. A speaker discrimination task was used to measure speaker identity. Native English listeners were presented with natural and synthetic speech stimuli in English and were asked whether they thought the sentences were spoken by the same person or not. An accent rating task was carried out to measure the perceived accents of the synthetic speech stimuli. Listeners perform as well at speaker discrimination when the stimuli have been created using 5 or 15 adaptation sentences as when using 105 sentences. The accent rating task shows listeners perceive different accents in the synthetic stimuli. However, listeners do not base speaker similarity decisions on perceived accent.

#14The Effects of Phoneme Errors in Speaker Adaptation for HMM Speech Synthesis

Bálint Tóth (Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics)
Tibor Fegyó (Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics)
Géza Németh (Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics)

In this paper the phoneme errors in adaptation data of HMM based synthesis is investigated. Phoneme errors are likely to appear in automatic speech recognition (ASR) based transcriptions. The research also investigates the perspective of merely ASR transcription based unsupervised adaptation. To achieve better quality a new method is introduced for selecting an optimal subset of ASR transcription based adaptation data. Quality evaluation of the method was also performed. The results showed that adaptation was successful even on higher than 50% phoneme error rates.

Tue-Ses3-S1-P:
Speech and Audio Processing for Human-Robot Interaction II

Time:Tuesday 17:00 Place:Caravaggio (Adua 1) - Pala Affari - 1st Floor Type:Poster
Chair:Alex Rudnicky

#1Blind Source Separation for Robot Audition using Fixed Beamforming with HRTFs

Mounira Maazaoui (Telecom ParisTech)
Yves Grenier (Telecom ParisTech)
Karim Abed-Meraim (Telecom ParisTech)

We present a two stage blind source separation (BSS) algorithm for robot audition. The algorithm is based on a beamforming preprocessing and a BSS algorithm using a sparsity separation criterion. Before the BSS step, we filter the sensors outputs by beamforming filters to reduce the reverberation and the environmental noise. As we are in a robot audition context, the manifold of the sensor array in this case is hard to model, so we use pre-measured Head Related Transfer Functions (HRTFs) to estimate the beamforming filters. In this article, we show the good performance of this method as compared to a single stage BSS only method.

#2Audio-Visual Voice Activity Detection in Dynamically Changing Environments

Takami Yoshida (Graduate School of Information Science and Engineering, Tokyo Institute of Technology)
Keisuke Nakamura (Honda Research Institute Japan Co. Ltd.)
Kazuhiro Nakadai (Honda Research Institute Japan Co. Ltd.)

This paper addresses Audio-Visual Voice Activity Detection (AV-VAD) for a robot especially in a dynamically changing environment. In such an environment, VAD has to deal with two issues. One is that audio and visual data, that is, speech signals and lip movements are not fully synchronized. The other is that the qualities of audio and visual data always change depending on various factors such as the acoustic noise level, and the geometrical relationship between the robot and a user. We propose an audio-visual integration method based on a state transition model to cope with asynchronicity between voice and lip activities, and also a modality selection method to use the best stream for VAD among audio, visual, and audio-visual. We implemented an online AV-VAD system based on the proposed methods for our upper-torso humanoid robot SIG having a camera and an 8 ch microphone array. Experimental results showed that the AV integration with the proposed methods improved 22.3 and 32.3 points compared to audio-only VAD and visualonly VAD, respectively.

#3Emotion detection from speech in human-robot interaction

Marie Tahon (LIMSI-CNRS)
Agnès Delaborde (LIMSI-CNRS)
Laurence Devillers (LIMSI-CNRS)

We focus in this paper on the detection of the emotions in the voice of a speaker in a Human-Robot Interaction context. This work is part of the ROMEO project, which aims to design a robot for both elderly people and children. Our system offers several modules based on a multi-level processing of the audio cues. The affective markers produced by these different modules will allow to pilot the emotional behaviour of the robot. Since the models are built with recording data and the system will test real-life data, we need to estimate our emotion detection system performances in cross-corpus situations. Cross-validation experiments on a three class detection show that derivatives and energy features may be removed from our feature set for this specific task. Cross-corpora experiments on anger-positive-neutral data suggest that detection performances may be better with two different models: one for child voices, one for adult voices..

#4WEIGHTED ORDERED CLASSES - NEAREST NEIGHBORS : A NEW FRAMEWORK FOR AUTOMATIC EMOTION RECOGNITION FROM SPEECH

Yazid Attabi (Centre de recherche informatique de Montréal, Montréal, Canada)
Pierre Dumouchel (Centre de recherche informatique de Montréal, Montréal, Canada)

In this paper we present a new framework for emotion recognition from speech based on a similarity concept called Weighted Ordered Classes-Nearest Neighbors (WOC-NN). Unlike the k-nearest neighbor, an instance-similarity based method; WOC-NN computes similarities between a test instance and a class pattern of each emotion class. An emotion class pattern is a representation of its ranked neighboring classes. A Hamming distance is used as distance metric, enhanced with two improvements: i) weighting the importance of each class rank of each neighborhood pattern and ii) discarding irrelevant class ranks from the patterns. Thus, the decision process in WOC-NN exploits more information than Bayes rule which makes use only of the information about the model class that minimizes Bayes risk. This extra information allows WOC-NN to get more accurate prediction. Also, the results show that the proposed system outperforms the result of state-of-the art systems when applied to the FAU AIBO corpus

#5Prosodic Analysis of a Corpus of Tales

David Doukhan (LIMSI-CNRS)
David Doukhan (LIMSI-CNRS)
Albert Rilliard (LIMSI-CNRS)
Sophie Rosset (LIMSI-CNRS)
Martine Adda-Decker (LIMSI-CNRS, LPP-CNRS)
Christophe d\'Alessandro (LIMSI-CNRS)

This paper present a prosodic analysis of a corpus of 12 tales, read by a male speaker. The work is part of the GVLex project which aims at giving a storytelling ability to a humanoid robot. One main point is to improve text-to-speech synthesis expressivity according to a semi-automatic analysis of a given tale. Automatic tagging and prosodic stylization was applied to the corpus. The extracted parameters are described and analyzed according to relevant elements of the tales structure. The results underline the expressive strategy used by the speaker to impersonate the different kinds of characters and during the different structural parts of each tale. The relevance of this prosodic parameters are then discussed in order to propose relevant instructions to enhance the expressivity of a non-uniform-units text-to-speech synthesizer.

#6Analysis of acoustic-prosodic features related to paralinguistic information carried by interjections in dialogue speech

Carlos T. Ishi (ATR Intelligent Robotics and Communication Labs.)
Hiroshi Ishiguro (ATR Intelligent Robotics and Communication Labs.)
Norihiro Hagita (ATR Intelligent Robotics and Communication Labs.)

Interjections are often used in dialogue communication for expressing a reaction (such as agreement, surprise and disgust) to the interlocutor. Thus, a correct interpretation of the paralinguistic information (intention, attitude or emotion) carried by interjections is important for achieving a smooth dialogue interaction between humans and robots. In the present work, analyses are conducted on several interjections appearing in spontaneous conversational speech databases to investigate the relationship between acoustic-prosodic features (related to intonation and voice quality) and their paralinguistic functions in dialogue speech. It is found that there are common and interjection-dependent relationships between acoustic features and paralinguistic information. Regardless of the interjection type, non-modal voice qualities, such as whispery, harsh and pressed voices, are shown to be important cues for the expression of emotions and attitudes.

#7Robust intonation pattern classification in human robot interaction

Martin Heckmann (Honda Research Institute Europe GmbH)
Kazuhiro Nakadai (Honda Research Institute Japan Co. Ltd.)
Hirofumi Nakajima (Honda Research Institute Japan Co. Ltd.)

We present a system for the classification of intonation patterns in human robot interaction. The system distinguishes questions from other types of utterances and can deal with additional reverberations, background noise, as well as music interfering with the speech signal. The main building blocks of our system are a multi channel source separation, robust fundamental frequency extraction and tracking, segmentation of the speech signal, and classification of the fundamental frequency pattern of the last speech segment. We evaluate the system with Japanese sentences which are ambiguous without intonation information in a realistic human robot interaction scenario. Despite the challenging task our system is able to classify the intonation pattern with good accuracy. With several experiments we evaluate the contribution of the different aspects of our system.

#8ASR for human-symbiotic robot ``EMIEW2\'\' with Mechanical Noise and Floor-Level Noise Reduction

Takashi Sumiyoshi (Central Research Laboratory, Hitachi, Ltd., Tokyo, Japan)
Masahito Togami (Central Research Laboratory, Hitachi, Ltd., Tokyo, Japan)
Yasunari Obuchi (Central Research Laboratory, Hitachi, Ltd., Tokyo, Japan)

A human-symbiotic robot called “EMIEW2” and its auditory function which includes two noise reduction methods against self-generated mechanical noise and external floor-level noise is introduced. The former type of noise is produced by the robot itself, and this is a difficult problem because it can be loud, nonstationary, and have a wide frequency band. We adopt a maximized SNR technique, in which noise correlation matrix is selected from noise clusters that are learned from the pre-recorded noise signals. The latter type of noise, which can occur when robots are used in office environments, is also a problem, and we addressed it by expanding the beamforming area from one dimension (azimuth angle) to the two dimensions (azimuth and elevation angles). We evaluated these methods in a 100-word speech recognition task and we show that both methods are effective for improving the speech recognition rate.

Wed-Ses1-O1:
Speaker Diarization I

Time:Wednesday 10:00 Place:Auditorium - Pala Congressi Type:Oral
Chair:Janez Zibert

10:00SPEAKER DIARIZATION USING A PRIORI ACOUSTIC INFORMATION

Hagai Aronowitz (IBM Haifa - Research)

Speaker diarization is usually performed in a blind manner without using a priori knowledge about the identity or acoustic characteristics of the participating speakers. In this paper we propose a novel framework for incorporating available a priori knowledge such as potential participating speakers, channels, background noise and gender, and integrating these knowledge sources into blind speaker diarization-type algorithms. We demonstrate this framework on two tasks. The first task is agent-customer speaker diarization for call-center phone calls and the second task is speaker-diarization for a PDA recorder which is part of an assistive living system for the elderly. For both of these tasks, incorporating the a priori information into our blind speaker diarization systems significantly improves diarization accuracy.

10:20Improved Overlapped Speech Handling for Speaker Diarization

Kofi Boakye (Lawrence Livermore National Laboratory)
Oriol Vinyals (International Computer Science Institute)
Gerald Friedland (International Computer Science Institute)

We present our ongoing work in addressing the issue of overlapped speech in speaker diarization through the use of overlap segmentation, overlapped speech exclusion, and overlap segment labeling. Using feature analysis, we identify the most salient features from a candidate list including those from our previous system and a set of newly proposed features. In addition, through independent optimization of overlap exclusion and labeling, we obtain a relative diarization error rate improvement of 15.1% on a sampled subset of the AMI Meeting Corpus, more than double our previous result. When analyzed independently, we show that the performance improvement due to overlapped speech exclusion now rivals that of an oracle system using reference overlap segments.

10:40Exploiting Intra-Conversation Variability for Speaker Diarization

Stephen Shum (MIT Computer Science and Artificial Intelligence Laboratory)
Najim Dehak (MIT Computer Science and Artificial Intelligence Laboratory)
Ekapol Chuangsuwanich (MIT Computer Science and Artificial Intelligence Laboratory)
Douglas Reynolds (MIT Lincoln Laboratory)
Jim Glass (MIT Computer Science and Artificial Intelligence Laboratory)

In this paper, we propose a new approach to speaker diarization based on the Total Variability approach to speaker verification. Drawing on previous work done in applying factor analysis priors to the diarization problem, we arrive at a simplified approach that exploits intra-conversation variability in the Total Variability space through the use of Principal Component Analysis (PCA). Using our proposed methods, we demonstrate the ability to achieve state-of-the-art performance (0.9% DER) in the diarization of summed-channel telephone data from the NIST 2008 SRE.

11:00Speaker Clustering Based on Non-negative Matrix Factorization

Masafumi Nishida (Doshisha University)
Seiichi Yamamoto (Doshisha University)

This paper addresses unsupervised speaker clustering for multi-party conversations. Hierarchical clustering methods were mainly used in previous studies. However, these methods require many processes, such as distance calculation and cluster merging, when there are many utterances in conversation data. We propose a clustering method based on non-negative matrix factorization. The proposed method can perform fast and robust clustering by decomposing a matrix consisting of distances between models. We conducted speaker clustering experiments using a Bayesian information criterion based method, a method based on the likelihood ratio between Gaussian mixture models, and the proposed method. Experimental results showed that the proposed method achieves higher clustering accuracy than these conventional methods.

11:20Information Bottleneck Features for HMM/GMM Speaker Diarization of Meetings Recordings

Sree Harsha Yella (Idiap Research Institute)
Fabio Valente (Idiap Research Institute)

Improved diarization results can be obtained through the combination of multiple systems. Several combination techniques have been proposed based on output voting, initialization and also integrated approaches. This paper proposes and investigate a novel approach to combine diarization systems through the use of features. A first diarization system, based on the Information Bottleneck system, is used to generate a set of features that contain information relevant to the clustering. Those features are later used in conjunction with conventional MFCC in a second diarization system. While appearing fully integrated, the approach does not need modifications to any of the two systems in order to integrate the information. Experiments on 24 recordings from the NIST RT06/RT07/RT09 evaluations collected in five meeting rooms reveal that when the IB features are used together with MFCC, the total speaker error is reduced from 12% to 9.7%, i.e., by approximatively 19% relative.

11:40Cross Likelihood Ratio Based Speaker Clustering Using Eigenvoice Models

David Wang (Queensland University of Technology)
Robert Vogt (Queensland University of Technology)
Sridha Sridharan (Queensland University of Technology)
David Dean (Queensland University of Technology)

This paper proposes the use of eigenvoice modeling techniques with the Cross Likelihood Ratio (CLR) as a criterion for speaker clustering within a speaker diarization system. The CLR has previously been shown to be a robust decision criterion for speaker clustering using Gaussian Mixture Models. Recently, eigenvoice modeling techniques have become increasingly popular, due to its ability to adequately represent a speaker based on sparse training data, as well as an improved capture of differences in speaker characteristics. This paper hence proposes that it would be beneficial to capitalize on the advantages of eigenvoice modeling in a CLR framework. Results obtained on the 2002 Rich Transcription (RT-02) Evaluation dataset show an improved clustering performance, resulting in a 35.1% relative improvement in the overall Diarization Error Rate (DER) compared to the baseline system.

Wed-Ses1-O3:
ASR - New Paradigms

Time:Wednesday 10:00 Place:Brunelleschi (Green Room) - Pala Congressi - 2nd Floor Type:Oral
Chair:John Hansen

10:00New Methods for Template Selection and Compression in Continuous Speech Recognition

Xie Sun (University of Missouri)
Yunxin Zhao (University of Missouri)

We propose a maximum likelihood method for selecting template representatives, and in order to include more information in the selected template representatives, we further propose to create compressed template representatives by Gaussian mixture model (GMM) merging algorithm. A Kullback-Leibler (KL) divergence based local distance is proposed for Dynamic Time Warping (DTW) in template matching. Experimental results on the tasks of TIMIT phone recognition and large vocabulary continuous speech recognition demonstrated that the proposed template selection method significantly improved the recognition accuracy over the HMM baseline while only 5% or 10% templates were selected from the total templates, and the template compression method has provided further recognition accuracy gains over the template selection method.

10:20Structured Support Vector Machines for Noise Robust Continuous Speech Recognition

Shi-Xiong Zhang (Department of Engineering, University of Cambridge)
M.J.F. Gales (Department of Engineering, University of Cambridge)

The use of discriminative models is an interesting alternative to generative models for speech recognition. This paper examines one form of these models, structured support vector machines (SVMs), for noise robust speech recognition. One important aspect of structured SVMs is the form of the joint feature space. In this work features based on generative models are used, which allows model-based compensation schemes to be applied to yield robust joint features. However, these features require the segmentation of frames into words, or sub-words, to be specified. In previous work this segmentation was obtained using generative models. Here the segmentations are refined using the parameters of the structured SVM. A Viterbi-like scheme for obtaining ``optimal'' segmentations, and modifications to the training algorithm to allow them to be efficiently used, are described. The performance of the approach is evaluated on a noise corrupted continuous digit task: AURORA 2.

10:40Continuous Digits Recognition Leveraging Invariant Structure

Masayuki Suzuki (The University of Tokyo)
Gakuto Kurata (IBM Research - Tokyo)
Masafumi Nishimura (IBM Research - Tokyo)
Nobuaki Minematsu (The University of Tokyo)

Recently, an invariant structure of speech was proposed, where the inevitable acoustic variations caused by non-linguistic factors are effectively removed from speech. The invariant structure was applied to isolated word recognition, and the experimental results showed good performance. However, the previous method can't apply to continuous speech recognition directly because there was no efficient decoding algorithm. In this paper, we propose a method to leverage the invariant structure in continuous digits recognition. We use a traditional HMM-based Automatic Speech Recognition (ASR) system to get N-best lists with phone alignments. Then we construct invariant structures using these phone alignments, and re-rank the N-best lists by investigating whether the invariant structure for each hypothesis is appropriate or not. Experimental results show a relative WER improvement of 17.4 % over the baseline ASR system.

11:00Convergence of Line Search A-Function methods

Dimitri Kanevsky (IBM)
David Nahamoo (IBM)
Tara Sainath (IBM)
Bhuvana Ramabhadran (IBM)

Recently, the Line Search A-Function (LSAF) was introduced as a technique that generalizes Extended Baum-Welch (EBW) algorithm for functions of continuous probability densities. It was shown that LSAF provides a unified scheme for a large class of optimization problems that involve discriminant objective functions of different probability densities or sparse representation objective functions such as Approximate Bayesian Compressive Sensing. In this paper, we show that a discrete EBW recursion (that was initially developed to optimize functions of discrete distributions) also fits the scope of LSAF technique. We demonstrate the utility and robustness of the technique for discrete distributions thru the experimental set up of a TIMIT phone classification task using a Convex Hull Sparse Representation approach with different Lq regularization (q being any positive number).

11:20Hidden Boosted MMI and Hierarchical State Posterior Feature for Automatic Speech Recognition based on Hidden Conditional Neural Fields

Yasuhisa Fujii (Department of Computer Science and Engineering, Toyohashi University of Technology, Japan)
Kazumasa Yamamoto (Department of Computer Science and Engineering, Toyohashi University of Technology, Japan)
Seiichi Nakagawa (Department of Computer Science and Engineering, Toyohashi University of Technology, Japan)

We have investigated automatic speech recognition using Hidden Conditional Neural Fields (HCNF). In this paper, we propose a new objective function, Hidden Boosted MMI (HB-MMI) that considers the number of errors in the training data even if the correct state sequence is not known for training the HCNF. The experimental results show that HB-MMI can improve recognition accuracy if overfitting does not occur. We also present an automatic speech recognition method using a hierarchical state posterior feature where the output from the first stage HCNF is used as input for the second stage HCNF. The experimental results show that the feature improves recognition accuracy. By combining both of the proposed methods, we obtain further improvements.

11:40Recognition and Real Time Performances of a Lightweight Ultrasound Based Silent Speech Interface Employing a Language Model

Jun Cai (Université Pierre et Marie Curie, Paris, France)
Bruce Denby (Université Pierre et Marie Curie, Paris, France)
Pierre Roussel (SIGMA Laboratory, ESPCI ParisTech, CNRS-UMR 7084, Paris, France)
Gerard Dreyfus (SIGMA Laboratory, ESPCI ParisTech, CNRS-UMR 7084, Paris, France)
Lise Crevier-Buchman (Laboratoire de Phonétique et Phonologie, CNRS-UMR 7018, Paris, France)

The work presents advances in the implementation of an ultrasound based silent speech interface system. Use of a portable acquisition device, a visual speech recognizer system with a language model, and real time tests with the Julius system are described. Experiments with two types of visual feature extraction are also presented. Results show that good recognition and real time performance can be obtained with a portable silent speech interface employing a language model.

Wed-Ses1-S3:
Speech Processing Tools

Time:Wednesday 10:00 Place:Donatello (Room Onice) - Pala Congressi - Ground Floor Type:Poster
Chair:Christoph Draxler

#1Speech Processing Tools - An Introduction to Interoperability

Christoph Draxler (Institute of Phonetics and Speech Processing, LMU Munich)
Toomas Altosaar (Aalto University School of Science and Technology, Espoo, Finland)
Sadaoki Furui (Dept. of Computer Science, Tokyo Institute of Technology, Japan)
Mark Liberman (Dept. of Linguistics, University of Pennsylvania, Philadelphia PA, USA)
Peter Wittenburg (Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands)

Research and development in the field of spoken language depends critically on the existence of software tools. A large range of excellent tools have been developed and are widely used today. Most tools were developed by individuals who recognized the need for a given tool, had the necessary conceptual and programming skills, and were deeply rooted in the application field, namely spoken language. Excellent tools are a prerequisite to research. However, tool developers rarely receive academic recognition for their efforts. Journals, conferences and funding agencies are interested in the results of the work on a research question while the tools developed to achieve these results are of less interest. The Interspeech 2011 special event on speech processing tools aims to provide a forum for tool developers to improve their academic visibility and thus enhance their motivation to continue developing the software needed by the community.

#2EasyAlign: an automatic phonetic alignment tool under Praat

Jean-Philippe Goldman (University of Geneva)

We provide a user-friendly automatic phonetic alignment tool for continuous speech, named EasyAlign. It is developed as a plug-in of Praat, the popular speech analysis software, and it is freely available. Its main advantage is that one can easily align speech from an orthographic transcription. It requires a few minor manual steps and the result is a multi-level annotation within a TextGrid composed of phonetic, syllabic, lexical and utterance tiers. Evaluation showed that the performances of this HTK-based aligner compare to human alignment and to other existing alignment tools. It was originally fully available for French, English. Community’s interests for its extension to other languages helped to develop a straight-forward methodology to add languages. While Spanish and Taiwan Min were recently added, other languages are under development.

#3MTRANS: A multi-channel, multi-tier speech annotation tool

Julián Villegas (Ikerbasque (Basque Science Foundation), Spain)
Martin Cooke (Ikerbasque (Basque Science Foundation), Spain and Language and Speech Laboratory, Universidad del Pais Vasco, Spain)
Vincent Aubanel (Ikerbasque (Basque Science Foundation), Spain)
Marco A. Piccolino-Boniforti (Dept. Linguistics, Univ. Cambridge, UK)

MTRANS, a freely available tool for annotating multi-channel speech is presented. This software tool is designed to provide visual and aural display flexibility required for transcribing multi-party conversations; in particular, it eases the analysis of speech overlaps by overlaying waveforms and spectrograms (with controllable transparency), and the mapping from media channels to annotation tiers by allowing arbitrary associations between them. MTRANS supports interoperability with other tools via the Open Sound Control protocol.

#4The JSafran platform for semi-automatic speech processing

Christophe Cerisara (LORIA-CNRS UMR 7503)
Claire Gardent (LORIA-CNRS UMR 7503)

JSafran is an open-source Java platform for editing, annotating and transforming speech corpora both manually and automatically at many levels: transcription, alignment, morphosyntactic tagging, syntactic parsing and semantic roles labelling. It integrates preconfigured state-of-the-art libraries for this purpose, including the Sphinx4, TreeTagger, OpenNLP, MaltParser and MATE applications, as well as the companion JTrans software for text-to-speech alignment and transcription. Despite the complexity of such speech processing tasks, JSafran has been designed to maximize simplicity both for the end-user, thanks to an easy-to-use GUI that controls all of these automatic and manual annotation functionalities, and for the developer, thanks to well-defined interfaces and to the multi-level stand-off annotation paradigm. JSafran has been used so far for several tasks, including the creation of a new French treebank on top of the broadcast news ESTER corpus.

#5The Social Signal Interpretation Framework (SSI) for Real Time Signal Processing and Recognition

Johannes Wagner (Lab for Human Centered Multimedia, Augsburg University)
Florian Lingenfelser (Lab for Human Centered Multimedia, Augsburg University)
Elisabeth Andre (Lab for Human Centered Multimedia, Augsburg University)

The construction of systems for recording, processing and recognising a human's social and affective signals is a challenging effort that includes numerous but necessary sub-tasks to be dealt with. In this article, we introduce our Social Signal Interpretation (SSI) tool, a framework dedicated to support the development of such systems. It provides a flexible architecture to construct pipelines to handle multiple modalities like audio or video and establishing on- and offline recognition tasks. The plug-in system of SSI encourages developers to integrate external code, while a XML interface allows anyone to write own applications with a simple text editor. Furthermore, data recording, annotation and classification can be done using a straightforward graphical user interface, allowing simple access to inexperienced users.

#6ELAN – aspects of interoperability and functionality

Han Sloetjes (Max Planck Institute for Psycholinguistics, The Language Archive, Nijmegen, The Netherlands)
Peter Wittenburg (Max Planck Institute for Psycholinguistics, The Language Archive, Nijmegen, The Netherlands)
Aarthy Somasundaram (Max Planck Institute for Psycholinguistics, The Language Archive, Nijmegen, The Netherlands)

ELAN is a multimedia annotation tool that has been developed for roughly ten years now and is still being extended and improved in, on average, two or three major updates per year. This paper describes the current state of the application, the main areas of attention of the past few years and the plans for the near future. The emphasis will be on various interoperability issues: interoperability with other tools through file conversions, process based interoperability with other tools by means of commands send to or received from other applications, interoperability on the level of the data model and semantic interoperability.

#7Open source voice creation toolkit for the MARY TTS Platform

Marc Schröder (DFKI GmbH)
Marcela Charfuelan (DFKI GmbH)
Sathish Pammi (DFKI GmbH)
Ingmar Steiner (INRIA/LORIA Speech Group)

This paper describes an open source voice creation toolkit that supports the creation of unit selection and HMM-based voices, for the MARY (Modular Architecture for Research on speech Synthesis) TTS platform. The toolkit can be easily employed to create voices in the languages already supported by MARY TTS, but also provides the tools and generic reusable run-time system modules to add new languages. The voice creation toolkit is mainly intended to be used by research groups on speech technology throughout the world, notably those who do not have their own pre-existing technology yet. We try to provide them with a reusable technology that lowers the entrance barrier for them, making it easier to get started. The toolkit is developed in Java and includes an intuitive Graphical User Interface (GUI) for most of the common tasks in the creation of a synthetic voice. We present the toolkit and discuss a number of interoperability issues.

#8Java Visual Speech Components for Rapid Application Development of GUI based Speech Processing Applications

Stefan Steidl (International Computer Science Institute (ICSI))
Korbinian Riedhammer (Computer Science Department, University of Erlangen-Nuremberg, Germany)
Tobias Bocklet (Computer Science Department, University of Erlangen-Nuremberg, Germany)
Florian Hönig (Computer Science Department, University of Erlangen-Nuremberg, Germany)
Elmar Nöth (Computer Science Department, University of Erlangen-Nuremberg, Germany)

In this paper, we describe a new Java framework for an easy and efficient way of developing new GUI based speech processing applications. Standard components are provided to display the speech signal, the power plot, and the spectrogram. Furthermore, a component to create a new transcription and to display and manipulate an existing transcription is provided, as well as a component to display and manually correct external pitch values. These Swing components can be easily embedded into own Java programs. They can be synchronized to display the same region of the speech file. The object-oriented design provides base classes for rapid development of own components.

#9mTalk - A Multimodal Browser for Mobile Services

Michael Johnston (AT&T Labs - Research, Inc.)
Giuseppe Di Fabbrizio (AT&T Labs - Research, Inc.)
Simon Urbanek (AT&T Labs - Research, Inc.)

The mTalk multimodal browser is a tool which enables rapid prototyping for research and development of mobile multimodal interfaces combining natural modalities such as speech, touch, and gesture. mTalk integrates a broad range of open standards for authoring graphical and spoken user interfaces and is supported by a cloud-based multimodal processing architecture. In this paper, we describe mTalk and illustrate its capabilities through examination of a series of sample applications.

#10Web-based automatic speech recognition service - webASR

Stuart Nicholas Wrigley (University of Sheffield)
Thomas Hain (University of Sheffield)

A state-of-the-art automatic speech recognition (ASR) system was developed as part of the AMIDA project whose core domain was the transcription of small to medium sized meetings. The system has performed well in recent NIST evaluations (RT'07 and RT'09). This research-grade ASR system has now been made available as a free web service (webASR) targeting non-commercial researchers. Access to the service is via and standard browser-based interface as well as an API. The service provides the facility to upload audio recordings which are then processed by the ASR system to produce a word-level transcript. Such transcripts are available in a range of formats to suite different needs and technical expertise. The API allows the core webASR functionality to be integrated seamlessly into applications and services. Detailed descriptions of the system design and user interface are provided.

#11A Web based Speech Transcription Workplace

Markus Klehr (European Media Laboratory GmbH, Heidelberg, Germany)
Andreas Ratzka (European Media Laboratory GmbH, Heidelberg, Germany)
Thomas Ross (European Media Laboratory GmbH, Heidelberg, Germany)

We describe our web based speech transcription tool EML Transcription Workplace (TWP). Apart from its main purpose of annotating audio data, it also includes support for the management of transcription data, ASR based pre-transcription, assignment of work packages to specific users, user management and a correction/verification workflow. These features help to increase the productivity for both transcriptionists and supervisors and facilitates further processing.

#12WinPitch, a multimodal tool for speech analysis of endangered languages

Philippe Martin (UFRL, Université Paris Diderot)

WinPitch is a speech analysis program running on PC and Mac for acoustical analysis of speech corpora. It includes a large number of specialized functions to transcribe, align and analyze large sound and video recordings. It supports multiple hierarchical layers for segmentation (up to 96 layers), speaker lists, and overlapping speech. Various character encodings, including Unicode, are supported, with optional right to left text display for Arabic and Hebrew transcriptions. Interfaces with other popular speech analysis programs are provided, as well as standard alignment input and output in XML format. Many functions are devoted to the transcription, alignment and description of less documented languages, such as slow speed playback, programmable keyboard, automatic lexicon generation and text labeling. Various software functions are described together with their applications to the analysis of Parkatêjê, a Timbira language spoken in the Amazonia by about 400 speakers..

#13Recording caregiver interactions for machine acquisition of spoken language using the KLAIR virtual infant

Mark Huckvale (University College London)

The goals of the KLAIR project are to facilitate research into the computational modelling of spoken language acquisition. Previously we have described the KLAIR toolkit that implements a virtual infant that can see, hear and talk. In this paper we describe how the toolkit has been enhanced and extended to make it easier to build interactive applications that promote dialogues with human subjects, and also to record and document them. Primary developments are the introduction of 3D models, integration of speech recognition, real-time video recording, support for .NET languages, and additional tools for supporting interactive experiments. An example experimental configuration is described in which KLAIR appears to learn how to say the names of toys in order to encourage dialogue with caregivers.

Wed-Ses1-O2:
Prosody I

Time:Wednesday 10:00 Place:Leonardo - Pala Affari - Ground Floor Type:Oral
Chair:Gérard Bailly

10:00A quantitative investigation of the prosody of Verum Focus in Italian

Giuseppina Turco (Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands)
Michele Gubian (Centre for Language & Speech Technology, Radboud University, Nijmegen, The Netherlands)
Jessamyn Schertz (Centre for Language & Speech Technology, Radboud University, Nijmegen, The Netherlands)

In this study we present a preliminary investigation of the prosodic marking of Verum focus (VF) in Italian, which is said to be realized with a pitch accent on the finite verb (e.g. A: Paul has not eaten the banana - B: (No), Paul HAS eaten the banana!). We tried to discover whether and how Italian speakers prosodically mark VF when producing full-fledged sentences using a semi-spontaneous production experiment on 27 speakers. Speech rate and f0 contours were extracted using automatic data processing tools and were subsequently analysed using Functional Data Analysis (FDA), which allowed for automatic visualization of patterns in the contour shapes. Our results show that the postfocal region of VF sentences exhibit faster speech rate and lower f0 compared to non-VF cases. However, an expected consistent difference of f0 effect on the focal region of the VF sentence was not found in this analysis.

10:20Effects of focus on f0 and duration in Irish (Gaelic) declaratives

Amelie Dorn (Trinity College Dublin)
Ailbhe Ní Chasaide (Trinity College Dublin)

This pilot study investigates the effects of focus (broad, narrow and contrastive) on tonal patterns, f0 scaling and duration of accented syllables and rhythmic feet for a controlled dataset in Donegal Irish. Results show differences in pre-focal tonal patterns between broad focus and the other focus types. Narrow and contrastive focus renditions are implemented by largely the same phonetic means. Focused domains are overall longer in duration and have wider f0 excursion than broad focus. Durational differences depend on sentence position.

10:40The phonology and phonetics of perceived prosody: What do listeners imitate?

Jennifer Cole (University of Illinois at Urbana-Champaign)
Stefanie Shattuck-Hufnagel (Massachusetts Institute of Technology)

An imitation experiment tests the hypothesis that when asked to reproduce a spontaneously-spoken utterance that they hear, speakers imitate the prosody of the stimulus in its phonological structure more accurately than the phonetic details. Results suggest that speakers rarely distort the presence of a pitch accent or an intonational phrase boundary, but more often change the nature of the phonetic cues, e.g. the duration of a pause or the occurrence of irregular pitch periods associated with boundaries and accents in American English. These findings argue for an encoding of phonological prosodic structure that is separate from the phonetic cues that signal that structure.

11:00Uncovering the effect of imitation on tonal patterns of French Accentual Phrases

Amandine Michelas (Aix-Marseille Université & Laboratoire Parole et Langage)
Noël Nguyen (Aix-Marseille Université & Laboratoire Parole et Langage)

French accentual phrases (APs) are characterized by the presence of a typical final fo rise and an optional/additional initial fo rise. This study tested whether between-speaker speech imitation influenced the realization of APs tonal patterns. The experiment was based on 3-syllable APs, whose tonal patterns differed in the potential placement of an initial rise. In two shadowing tasks (without/with explicit instructions to imitate the speaker’s way of pronouncing the stimuli), participants produced more initial rises when they heard a stimulus including both initial and final rises relative to stimuli which only a final rise was present. Thus, imitation influences the realization of APs tonal patterns in French.

11:20Crossmodal prosodic and gestural contribution to the perception of contrastive focus to the perception of contrastive focus

Pilar Prieto (ICREA- Universitat Pompeu Fabra)
Cecilia Pugliesi (Universitat Pompeu Fabra)
Joan Borràs-Comes (Universitat Pompeu Fabra)
Ernesto Arroyo (Universitat Pompeu Fabra)
Josep Blat (Universitat Pompeu Fabra)

Speech prosody has traditionally been analyzed in terms of acoustic features. Even though visual features and gestures have been shown to help and enhance linguistic processing, the conventional view is that facial and body gesture information in oral (non-sign) languages tends to be redundant and has the role of helping the hearer recover the meaning of an utterance. We conducted two perception experiments with a 3D animated character showing conflicting auditory and visual information to investigate two related questions regarding the importance of gestures in conveying prosodic meaning: (a) how important are facial cues with respect to auditory cues for the perception of contrastive focus?; and (b) what is the relevance of the different gestural movements (i.e., head nod and eyebrow raising) for the perception of this type of focus? Our findings reveal that the visual component is crucial in the semantic interpretation of contrastive focus.

11:40Temporal relationship between auditory and visual prosodic cues

Erin Cvejic (MARCS Auditory Laboratories, University of Western Sydney, Australia)
Jeesun Kim (MARCS Auditory Laboratories, University of Western Sydney, Australia)
Chris Davis (MARCS Auditory Laboratories, University of Western Sydney, Australia)

It has been reported that non-articulatory visual cues to prosody tend to align with auditory cues, emphasizing auditory events that are in close alignment (visual alignment hypothesis). We investigated the temporal relationship between visual and auditory prosodic cues in a large corpus of utterances to determine the extent to which non-articulatory visual prosodic cues align with auditory ones. Six speakers saying 30 sentences in three prosodic conditions (x2 repetitions) were recorded in a dialogue exchange task, to measure how often eyebrow movements and rigid head tilts aligned with auditory prosodic cues, the temporal distribution of such movements, and the variation across prosodic conditions. The timing of brow raises and head tilts were not aligned with auditory cues, and the occurrence of visual cues was inconsistent, lending little support for the visual alignment hypothesis. Different types of visual cues may combine with auditory cues in different ways to signal prosody.

Wed-Ses1-O4:
Spoken Dialogue Systems II

Time:Wednesday 10:00 Place:Michelangelo - Pala Affari - 2nd Floor Type:Oral
Chair:Steve Young

10:00Optimizing Situated Dialogue Management in Unknown Environments

Heriberto Cuayahuitl (German Research Center for Artificial Intelligence (DFKI))
Nina Dethlefs (University of Bremen)

We present a conversational learning agent that helps users navigate through complex and challenging spatial environments. The agent exhibits adaptive behaviour by learning spatially-aware dialogue actions while the user carries out the navigation task. To this end, we use Hierarchical Reinforcement Learning with relational representations to efficiently optimize dialogue actions tightly-coupled with spatial ones, and Bayesian networks to model the user's beliefs of the navigation environment. Since these beliefs are continuously changing, we induce the agent's behaviour in real time. Experimental results, using simulation, are encouraging by showing efficient adaptation to the user's navigation knowledge, specifically to the generated route and the intermediate locations to negotiate with the user.

10:20Acoustic-similarity based technique to improve concept recognition

Om D Deshmukh (IBM Research India)
Shajith Ikbal (IBM Research India)
Ashish Verma (IBM Research India)
Etienne Marcheret (IBM Watson Research Center)

In this work we propose an acoustic-similarity based technique to improve the recognition of in-grammar utterances in typical directed-dialog applications where the Automatic Speech Recognition (ASR) system consists of one or more class-grammars embedded in the Language Model (LM). The proposed technique increases the transition cost of LM paths by a value proportional to the average acoustic similarity between that LM path and all the in-grammar utterances. Proposed modifications improve the in-grammar concept recognition rate by 0.5% absolute at lower grammar fanouts and by about 2% at higher fanouts as compared to a technique which reduces the probability of entering all the LM paths by a uniform value. The improvements are more pronounced as the fanout size of the grammar is increased and especially at operating points corresponding to lower False Accept (FA) values.

10:40Dialog Methods for Improved Alphanumeric String Capture

Doug Peters (Nuance Communications)
Peter Stubley (Nuance Communications)

In this paper, we consider advances in automated over-the-phone alphanumeric string capture. For this task, acoustic confusions typically result in significant error rates. Of course, confusions also exist in human-to-human communication. However, humans employ dialog-level strategies with which to disambiguate confusions and correct errors – allowing high-fidelity transmission of alphanumeric strings across all but the noisiest of channels. These human strategies are examined and a subset amenable to automation is identified. The resulting automated error-correction dialog achieves 30% dialog error rate reduction compared to a conventional application in a high-volume commercial deployment. Further, the fact that there are many recognition errors in the context of a structurally simple dialog recommends this task for dialog optimization. We present an example of offline optimization and discuss the potential for online learning.

11:00Detecting the Status of a Predictive Incremental Speech Understanding Model for Real-Time Decision-Making in a Spoken Dialogue System

David DeVault (USC Institute for Creative Technologies)
Kenji Sagae (USC Institute for Creative Technologies)
David Traum (USC Institute for Creative Technologies)

We explore the potential for a responsive spoken dialogue system to use the real-time status of an incremental speech understanding model to guide its incremental decision-making about how to respond to a user utterance that is still in progress. Spoken dialogue systems have a range of potentially useful real-time response options as a user is speaking, such as providing acknowledgments or backchannels, interrupting the user to ask a clarification question or to initiate the system's response, or even completing the user's utterance at appropriate moments. However, implementing such incremental response capabilities seems to require that a system be able to assess its own level of understanding incrementally, so that an appropriate response can be selected at each moment. In this paper, we use a data-driven classification approach to explore the trade-offs that a virtual human dialogue system faces in reliably identifying how its understanding is progressing during a user utterance.

11:20User Simulation in Dialogue Systems using Inverse Reinforcement Learning

Senthilkumar Chandramohan (Supelec / LIA - UAPV)
Matthieu Geist (Suepelc)
Fabrice Lefevre (LIA - UAPV)
Olivier Pietquin (Supelec / UMI 2958 (CNRS - GeorgiaTech))

Spoken Dialogue Systems (SDS) are man-machine interfaces which use natural language as the medium of interaction. Dialogue corpora generation for the purpose of training and evaluating dialogue systems is an expensive process. User simulators focus on simulating human users in order to generate synthetic data. Existing methods for user simulation mainly focus on generating data with the same statistical consistency as in the dialogue corpus. This paper outlines a novel approach for user simulation based on Inverse Reinforcement Learning (IRL). The task of building the user simulator is perceived as a task of imitation learning.

11:40Lossless Value Directed Compression of Complex User Goal States for Statistical Spoken Dialogue Systems

Paul A. Crook (Interaction Lab, Heriot-Watt University, Edinburgh, UK)
Oliver Lemon (Interaction Lab, Heriot-Watt University, Edinburgh, UK)

This paper presents initial results in the application of Value Directed Compression (VDC) to spoken dialogue management states for reasoning about complex user goals. On a small but realistic SDS problem VDC generates a lossless compression which achieves a 6 fold reduction in the number of dialogue states required by a Partially Observable Markov Decision Process (POMDP) dialogue manager (DM). Reducing the number of dialogue states reduces the computational power, memory and storage requirements of the hardware used to deploy such POMDP SDSs, thus increasing the complexity of the POMDP SDSs which could theoretically be deployed. In addition, in the case when on-line reinforcement learning is used to learn the DM policy, it should lead to, in this case, a 6 fold reduction in policy learning time. These are the first automatic compression results that have been presented for POMDP SDS states which represent user goals as sets over the possible domain objects.

Wed-Ses1-S1:
Speaker State Challenge - Intoxication and Sleepiness I

Time:Wednesday 10:00 Place:Raffaello - Pala Affari - 3rd Floor Type:Oral
Chair:Bjoern Schuller

10:00The INTERSPEECH 2011 Speaker State Challenge

Björn Schuller (Technische Universität München)
Stefan Steidl (ICSI)
Anton Batliner (FAU Erlangen-Nuremberg)
Florian Schiel (BAS, Ludwig-Maximilians-Universität München)
Jarek Krajewski (University of Wuppertal)

While the first open comparative challenges in the field of paralinguistics targeted more conventional' phenomena such as emotion, age, and gender, there still exists a multiplicity of not yet covered, but highly relevant speaker states and traits. The INTERSPEECH 2011 Speaker State Challenge thus addresses two new sub-challenges to overcome the usually low compatibility of results: In the Intoxication Sub-Challenge, alcoholisation of speakers has to be determined in two classes; in the Sleepiness Sub-Challenge, another two-class classification task has to be solved. This paper introduces the conditions, the Challenge corpora "Alcohol Language Corpus" and "Sleepy Language Corpus", and a standard feature set that may be used. Further, baseline results are given.

10:20Combining Multiple Phoneme-based Classifiers with Audio Feature-based Classifier for the Detection of Alcohol Intoxication

Claude Montacié (STIH Laboratory, Paris Sorbonne University, France)
Marie-José Caraty (LIPADE laboratory, Paris Descartes University, France)

This article describes the two systems which we submitted for the Intoxication Sub-Challenge of INTERSPEECH 2011 Speaker State Challenge. At first, we developed an extended Baseline System with a significant improvement of the unweigthed accuracy compared to the Official Baseline System (OBS) on the development set. Then, we investigated the phonetic variations of speech under alcoholisation and developed gender-dependent Phoneme-based SVM classifiers. For this purpose, we selected the most relevant phonemes and investigated a system combining six Phoneme-based SVM classifiers. Its results in accuracy are slightly below the OBS results. Finally, the combination of the two systems is presented.

10:40Intoxication Detection using Phonetic, Phonotactic and Prosodic Cues

Fadi Biadsy (Computer Science Department, Columbia University, New York, USA)
William Yang Wang (Computer Science Department, Columbia University, New York, USA)
Andrew Rosenberg (Computer Science Department, Queens College (CUNY), New York, USA)
Julia Hirschberg (Computer Science Department, Columbia University, New York, USA)

In this paper, we investigate multiple approaches for automatically detecting intoxicated speakers given samples of their speech. Intoxicated speech in a given language can be viewed simply as a different accent of this language; therefore we adopt our recent approach to dialect and accent recognition to detect intoxication. The system models phonetic structural differences across sober and intoxicated speakers. This approach employs SVM with a kernel function that computes similarities between adapted phone GMMs which summarize speakers' phonetic characteristics in their utterances. We also investigate additional cues, such as prosodic events, phonotactics and phonetic durations under intoxicated and sober conditions. We find that our phonetic-based system when combined with phonotactic features provides us with our best result on the official development set, an accuracy of 73% and an equal error rate of 26.3%, significantly higher than the official baseline.

11:00Drink and Speak: On the automatic classification of alcohol intoxination by acoustic, prosodic and text-based features

Tobias Bocklet (University of Erlangen-Nuremberg)
Korbinian Riedhammer (University of Erlangen-Nuremberg)
Elmar Nöth (University of Erlangen-Nuremberg)

This paper focuses on the automatic detection of a person’s blood alcohol based on automatic speech processing approaches. We compare different feature sets on the ALC dataset of Interspeech2011 speaker state challenge. Three feature sets are based on spectral observations: TRAPS, MFCC, and PLP. These are modeled by GMMs. Classification is either done by a Gaussian classifier or by SVMs. A prosodic system extracts a 292-dimensional feature vector. Transcription-based systems employ features from the available transcription. We compare the stand-alone performances of these systems and combine them on score level. Combination on score achieved a significant improvement of 15% on development set. On test-set we achieved an UA of 68.63% which is a significant relative improvement of more than 5% compared to the baseline system.

11:20Intoxicated Speech Detection Using Hierarchical Features and Iterative Speaker Normalization

Daniel Bone (Signal Analysis and Interpretation Laboratory (SAIL), Los Angeles, CA, USA)
Matthew P. Black (Signal Analysis and Interpretation Laboratory (SAIL), Los Angeles, CA, USA)
Ming Li (Signal Analysis and Interpretation Laboratory (SAIL), Los Angeles, CA, USA)
Angeliki Metallinou (Signal Analysis and Interpretation Laboratory (SAIL), Los Angeles, CA, USA)
Sungbok Lee (Signal Analysis and Interpretation Laboratory (SAIL), Los Angeles, CA, USA)
Shrikanth S. Narayanan (Signal Analysis and Interpretation Laboratory (SAIL), Los Angeles, CA, USA)

Speaker state recognition is a challenging problem due to speaker and context variability. Intoxication detection is an important area of paralinguistic speech research with potential real-world applications. In this work, we build upon a base set of various static acoustic features by proposing the combination of several different methods for this learning task. The methods include extracting hierarchical acoustic features, performing iterative speaker normalization, and using a set of GMM supervectors. We obtain an optimal unweighted recall for intoxication recognition using score-level fusion of these subsystems. Unweighted average recall performance is 70.54% on the test set, an improvement of 4.64% absolute (7.04% relative) over the baseline model accuracy of 65.9%.

11:40Attention, Sobriety Checkpoint! Can Humans Determine by Means of Voice, if Someone is Drunk... and can Automatic Classifiers Compete?

Stefan Ultes (Institute of Information Technology, University of Ulm, Germany)
Alexander Schmitt (Institute of Information Technology, University of Ulm, Germany)
Wolfgang Minker (Institute of Information Technology, University of Ulm, Germany)

This paper analyzes the human performance of recognizing drunk speakers merely by voice and compares the results with the performance of an automatic statistical classifier. The study is carried out within the Interspeech 2011 Speaker State Challenge employing the Alcohol Language Corpus (ALC). The 79 subjects yielded an average performance of 55.8% unweighted accuracy on a balanced intoxicated/non-intoxicated sample set. The statistical classifier developed in this study reaches a performance of 66.6% unweighted accuracy on the test set. In comparison, the subject with the highest performance yielded 70.0%. Our classifier is based on 4368 acoustic and prosodic features. Incorporating linguistic features along with feature selection using Information Gain Ratio (IGR) ranking added 0.7% absolute improvement with resulting in a 29% smaller feature space size.

12:00Does it Groove or Does it Stumble - Automatic Classification of Alcoholic Intoxiation Using Prosodic Features

Florian Hönig (Pattern Recognition Lab, University of Erlangen-Nuremberg)
Anton Batliner (Pattern Recognition Lab, University of Erlangen-Nuremberg)

This paper studies how prosodic features can help in the automatic detection of alcoholic intoxication. We compute features that have recently been proposed to model speech rhythm such as the pair-wise variability index for consonantal and vocalic segments (PVI) and study their aptness for the task. Further, we use a large prosodic feature vector modelling the usual candidates - pitch, intensity, and duration - and apply it onto different units such as words, syllables and stressed syllables to create generalizations of the rhythm features mentioned. The results show that the prosodic features computed are suitable for detecting alcoholic intoxication and add complementary information to state-of-the-art features. The database is the intoxication database provided by the organizers of the 2011 Interspeech Speaker State Challenge.

Wed-Ses1-S2-O:
Speech Technology for Under-Resourced Languages I

Time:Wednesday 10:00 Place:Caravaggio (Adua 1) - Pala Affari - 1st Floor Type:Oral
Chairs:Alexey Karpov, Laurent Besacier

10:00Rapid building of an ASR system for Under-Resourced Languages based on Multilingual Unsupervised Training

Ngoc Thang Vu (Karlsruhe Institute of Technology (KIT))
Franziska Kraus Kraus (Karlsruhe Institute of Technology (KIT))
Tanja Schultz (Karlsruhe Institute of Technology (KIT))

This paper presents our work on rapid language adaptation of acoustic models based on multilingual cross-language bootstrapping and unsupervised training. We used Automatic Speech Recognition (ASR) systems in the six source languages English, French, German, Spanish, Bulgarian and Polish to build from scratch an ASR system for Vietnamese, an under-resourced language. System building was performed without using any transcribed audio data by applying three consecutive steps, i.e. cross-language transfer, unsupervised training based on the “multilingual A-stabil” confidence score [1], and bootstrapping. We investigated the correlation between performance of “multilingual A-stabil” and the number of source languages and improved the performance of “multilingual A-stabil” by applying it at the syllable level. Furthermore, we showed that increasing the amount of source language ASR systems for the multilingual framework results in better performance of the final ASR system in the target language Vietnamese. The final Vietnamese recognition system has a Syllable Error Rate (SyllER) of 16.8% on the development set and 16.1% on the evaluation set.

10:20Places and Manner of Articulation of Bangla Consonants: A EPG based study

Shyamal Kr Das Mandal (Centre for Educational Technology, Indian Institute of Technology Kharagpur)
Somnath Chandra (Department of Information Technologies, Government of India)
Swaran Lata (Department of Information Technologies, Government of India)
Ashoke Kumar Datta (BOM Public Charitable Trusts, Kolkata, India)

Bangla phoneme inventory consists of 32 consonants out of them 16 are stop or plosive and 4 are affricate. This paper presents the detail investigation of place and manner of articulation of Bengali phonemes. The place of articulation study of consonants is based on Electropalatography (EPG) system and the manner of articulation study is based on acoustic study of large number of well-spoken Bengali VCV sequences in which V represents the seven Bangla vowels /u/, /o/, /ɔ/, /a/, /æ/, /e/ and /i/ while C represents all the consonants of Bangla. The study shows that in case of Bengali language plosives have three distinct places of articulation namely dental, alveolar and post alveolar and four manner of articulation.

10:40Efficient harvesting of Internet audio for resource-scarce ASR

Marelie Hattingh Davel (CSIR Meraka Institute)
Charl van Heerden (CSIR Meraka Institute)
Neil Kleynhans (CSIR Meraka Institute)

Spoken recordings that have been transcribed for human reading (e.g. as captions for audiovisual material, or to provide alternative modes of access to recordings) are widely available in many languages. Such recordings and transcriptions have proven to be a valuable source of ASR data in well-resourced languages, but have not been exploited to a significant extent in under-resourced languages or dialects. Techniques used to harvest such data typically assume the availability of a fairly accurate ASR system, which is generally not available when working with resource-scarce languages. In this work, we define a process whereby an ASR corpus is bootstrapped using unmatched ASR models in conjunction with speech and approximate transcriptions sourced from the Internet. We introduce a new segmentation technique based on the use of a phone-internal garbage model, and demonstrate how this technique (combined with limited filtering) can be used to develop a large, high-quality corpus in an under-resourced dialect with minimal effort.

Wed-Ses1-P1:
Human Speech Production II

Time:Wednesday 10:00 Place:Valfonda 1 - Pala Congressi (Passi Perduti-Gallery) Type:Poster
Chair:Francis Grenez

#2Articulatory Reduction in Mandarin Chinese Words

Jeffrey Berry (University of Arizona)
Sunjing Ji (University of Arizona)
Ian Fasel (University of Arizona)
Diana Archangeli (University of Arizona)

We investigate the effect of reduction induced by repetition during articulation. Specifically, we report how tongue movement differs between the first mention of Mandarin words and that of later repetitions using ultrasound imaging. Two analyses were carried out in this paper: tongue deformation and timing. We used Dynamic Time Warping to measure the tongue deformation from a neutral position. We used Functional Data Analysis to measure the timing difference between the first and later repetitions. We found that the tongue deviates less from the neutral and moves faster in time for the later repetitions, namely the more reduced ones. Our study shows promise for more thorough investigation of speech reduction from the articulatory perspective, and provides insights for constructing applications for speech synthesis/recognition towards more natural speech.

#3Morphological Variation in the Adult Vocal Tract: A Modeling Study of its Potential Acoustic Impact

Adam Lammert (University of Southern California)
Michael Proctor (University of Southern California)
Athanasios Katsamanis (University of Southern California)
Shrikanth Narayanan (University of Southern California)

In order to fully understand inter-speaker variability in the acoustical and articulatory domains, morphological variability must be considered, as well. Human vocal tracts display substantial morphological differences, all of which have the potential to impact a speaker's acoustic output. The palate and rear pharyngeal wall, in particular, vary widely and have the potential to strongly impact the resonant properties of the vocal tract. To gain a better understanding of this impact, we combine an examination of morphological variation with acoustic modeling experiments. The goal is to show the theoretical acoustic effect of common inter-speaker differences for a set of English vowels. Modeling results indicate that the effect is indeed strong, but also surprisingly complex and context-specific, even when morphology varies in relatively straightforward ways.

#4Analysis and automatic estimation of children\'s subglottal resonances

Steven M. Lulich (Washington University in Saint Louis)
Harish Arsikere (University of California, Los Angeles)
John R. Morton (Washington University in Saint Louis)
Gary K. F. Leung (University of California, Los Angeles)
Abeer Alwan (University of California, Los Angeles)
Mitchell S. Sommers (Washington University in Saint Louis)

Models and measurements of subglottal resonances are generally made from adult data, but there are several applications in which it would be useful to know about subglottal resonances in children. We therefore conducted an analysis of both new and old recordings of children's subglottal acoustics in order 1) to produce a fuller picture of the variability of children's subglottal resonances, and 2) to confirm that existing models of subglottal acoustics can be reasonably applied to children. We also tested the effectiveness of recent algorithms for estimating children's subglottal resonances from speech formants and the fundamental frequency, which were originally formulated based on adult data. It was found that these algorithms are effective for children at least 150 cm tall.

#5Acceleration Sensor Based Estimates of Subglottal Resonances: Short vs. Long Vowels

Wolfgang Wokurek (Institut fuer Maschinelle Sprachverarbeitung, Universitaet Stuttgart, Deutschland)
Andreas Madsack (Institut fuer Maschinelle Sprachverarbeitung, Universitaet Stuttgart, Deutschland)

The current version ACCV4 of our acceleration sensor device is presented and used to study the influence of vowel duration to the estimates of resonances of the subglottal system of 7 female and 9 male speakers. The sensor records movements in all three spatial directions below 5kHz. It is gently pressed to the neck of the speaker in front of the cricothyroid ligament, a soft tissue in the lower part of the larynx. Linear prediction is used the estimate three resonances between 500Hz and 2kHz. Statistically significant differences in the estimates taken for the first two subglottal formants are found.

#6Comparison of nasalance measurements from accelerometers and microphones and preliminary development of novel features

Nicolas Audibert (Laboratoire de Phonétique et Phonologie, UMR7018 CNRS/Université Paris 3-Sorbonne-Nouvelle, Paris, France)
Angélique Amelot (Laboratoire de Phonétique et Phonologie, UMR7018 CNRS/Université Paris 3-Sorbonne-Nouvelle, Paris, France)

This study compares four nasalance measures computed as ratios between the amplitude of signals recorded with accelerometers and microphones. Two new measures based on RMS amplitude differences between the nasal signal and either the vocal folds vibration signal (LND) or the oral acoustic signal (OND) are introduced. Measures were compared on a total of 584 utterances produced by four native French speakers. Results show that (1) all measures separate nasal from oral consonants, (2) the different experimental setups cannot be considered equivalent, (3) difference-based measures appear to better describe the time course of nasality than ratio-based measures.

#7The effect of seeing the interlocutor on speech production in different noise types

Michael Fitzpatrick (MARCS Auditory Laboratories, University of Western Sydney)
Jeesun Kim (MARCS Auditory Laboratories, University of Western Sydney)
Davis Chris (MARCS Auditory Laboratories, University of Western Sydney)

Talkers modify their speech production in noisy environments partly as a reflex but also as an intentional communicative strategy to facilitate the transmission of the speech signal to the interlocutor. Previous studies have shown that the characteristics of such modifications vary depending on the type of noise. The current study examined whether speech production (and its interaction with noise type) would be affected by being able to see their interlocutor or not. Participants completed an interactive communication game in various noise conditions with/without being able to see their interlocutor. The results show that speech modifications differed with noise condition and that the speech amplitude was significantly lower when interlocutors could see each other. These results suggest that talkers actively monitor their environment and adopt appropriate speech production for efficient communication.

#8Conversing in the presence of a competing conversation: effects on speech production

Vincent Aubanel (Language and Speech Laboratory, Faculty of Letters, University of the Basque Country, Spain)
Martin Cooke (Ikerbasque, Basque Foundation for Science, Spain)
Julian Villegas (Language and Speech Laboratory, Faculty of Letters, University of the Basque Country, Spain)
Maria Luisa Garcia Lecumberri (Language and Speech Laboratory, Faculty of Letters, University of the Basque Country, Spain)

How does a background conversation affect a foreground conversation? In this scenario, and unlike traditional studies of noise-induced speech modification (Lombard speech), listeners have to cope with the additional challenge of competing speech material. In the current study, pairs of talkers engaged in natural dialogs in the absence or presence of another talker pair. Changes in speech level revealed only a small energetic masking effect of the background pair, but very large modifications in prosodic parameters (F0, speech rate) were observed during overlaps within conversations. The presence of the background pair led to increases in the numbers of dysfluencies, mistiming and interruptions, suggesting that interlocutors suffer from competing speech in ways which are not well-described by Lombard speech modifications. Longer inter-turn pauses seen in the background present condition may indicate that listeners monitor the other conversation to avoid temporally-competing speech material where possible.

#9Very short utterances and timing in turn-taking

Mattias Heldner (KTH Speech, Music and Hearing, Stockholm, Sweden)
Jens Edlund (KTH Speech, Music and Hearing, Stockholm, Sweden)
Anna Hjalmarsson (KTH Speech, Music and Hearing, Stockholm, Sweden)
Kornel Laskowski (KTH Speech, Music and Hearing, Stockholm, Sweden)
Kornel Laskowski (KTH Speech, Music and Hearing, Stockholm, Sweden)

This work explores the timing of very short utterances in conversations, as well as the effects of excluding intervals adjacent to such utterances from distributions of between-speaker interval durations. The results show that very short utterances are more precisely timed to the preceding utterance than longer utterances in terms of a smaller variance and a larger proportion of no-gap-no-overlaps. Excluding intervals adjacent to very short utterances furthermore results in measures of central tendency closer to zero (i.e. no-gap-no-overlaps) as well as larger variance (i.e. relatively longer gaps and overlaps).

#10Validating rt-MRI based articulatory representations via articulatory recognition

Athanasios Katsamanis (University of Southern California)
Erik Bresch (University of Southern California)
Vikram Ramanarayanan (University of Southern California)
Shrikanth Narayanan (University of Southern California)

The large corpus of real time magnetic resonance image sequences of the vocal tract during speech production that was recently acquired and can be referred to as MRI-TIMIT, provides us with a unique platform for systematically studying articulatory dynamics. Compared to previously collected articulatory datasets, e.g., using articulography or X-rays, MRI-TIMIT is a rich source of information for the entire vocal tract and not only for certain articulatory landmarks and further has the potential to continue increasing in size covering a large variety of speakers and speaking styles. In this work, we investigate an articulatory representation based on full vocal tract shapes. We employ an articulatory recognition framework in MRI-TIMIT to analyze its merits and drawbacks. We argue that articulatory recognition can serve as a general validation tool for real-time MRI based articulatory representations.

#11An Electropalatographic and Acoustic Study on Anticipatory Coarticulation in V1#C2V2 Sequences in Standard Chinese

Yinghao Li (1. Phonetics Lab, Peking University; 2. English Department, Yanbian University)
Jiangping Kong (Phonetics Lab, Peking University)

This paper presents the data on the anticipatory coarticulation of C2 and V2 on V1 in V1#C2V2 sequences in Standard Chinese. Electropalatographic measures and F2 trajectory were obtained to define the articulatory and F2 targets for V1 as well as the displacement for articulatory and F2 transition of V1. Results show that the articulatory target is affected only by C2 place, while C2 place, C2 manner, and V2 show combined effect on the articulatory and F2 displacement of V1. Lip rounding associated with V2 is found to affect the F2 target and F2 transition of V1.

#12Final /t/ reduction in Dutch past-participles: the role of word predictability and morphological decomposability

Iris Hanique (CLS, Radboud University Nijmegen & Max Planck Institute for Psycholinguistics, the Netherlands)
Mirjam Ernestus (CLS, Radboud University Nijmegen & Max Planck Institute for Psycholinguistics, the Netherlands)

This corpus study demonstrates that the realization of word-final /t/ in Dutch past-participles in various speech styles is affected by a word's predictability and paradigmatic relative frequency. In particular, /t/s are shorter and more often absent if the two preceding words are more predictable. In addition, /t/s, especially in irregular verbs, are more reduced, the lower the verb's lemma frequency relative to the past-participle's frequency. Both effects are more pronounced in more spontaneous speech. These findings are expected if speech planning plays an important role in speech reduction.

#13Parametrising Degree of Articulator Movement from Dynamic MRI Data

Raeesy Zeynab (Phonetics Laboratory, University of Oxford, UK)
Baghai-Ravary Ladan (Phonetics Laboratory, University of Oxford, UK)
Coleman John (Phonetics Laboratory, University of Oxford, UK)

A new approach is proposed for quantifying the degree of articulator movement in a phoneme as a scalar value, using vocal tract images captured with dynamic MRI. It indicates the degree of movement of the articulators, rather than the acoustic realisation of that movement. We show that this is a valid method for characterising the overall dynamics of the vocal tract, by demonstrating a definite correlation with a similarly-defined scalar measure of the dynamics of the acoustic signal. The calculation of the new measure produces images showing the location of any movement within the vocal tract, and also shows this information separately for the initial and final segments of each phoneme. Finally, we observe that some sounds may involve more movement of the articulators than would be expected from the dynamics of the acoustic signal, it is rare for the degree of articulation derived from the MRI data to be significantly less than expected.

Wed-Ses1-P2:
Systems for LVCSR and rich transcription

Time:Wednesday 10:00 Place:Valfonda 2 - Pala Congressi (Passi Perduti-Gallery) Type:Poster
Chair:Diego Giuliani

#1Improving LVCSR System Combination Using Neural Network Language Model Cross Adaptation

Xunying Liu (Cambridge University)
Mark Gales (Cambridge University)
Phil Woodland (Cambridge University)

State-of-the-art large vocabulary continuous speech recognition (LVCSR) systems often combine outputs from multiple sub-systems developed at different sites. Cross system adaptation can be used as an alternative to direct hypothesis level combination schemes such as ROVER. The standard approach involves only cross adapting acoustic models. To fully exploit the complimentary features among sub-systems, language model (LM) cross adaptation techniques can be used. Previous research on multi-level N-gram LM cross adaptation is extended to further include the cross adaptation of neural network LMs in this paper. Using this improved LM cross adaptation framework, significant error rate gains of 4.0%-7.1% relative were obtained over acoustic model only cross adaptation when combining a range of Chinese LVCSR sub-systems used in the 2010 and 2011 DARPA GALE evaluations.

#2TOWARDS HIGH PERFORMANCE LVCSR IN SPEECH-TO-SPEECH TRANSLATION SYSTEM ON SMART PHONES

Jian Xue (IBM T.J. Watson Research Center)
Xiaodong Cui (IBM T.J. Watson Research Center)
Gregg Daggett (IBM T.J. Watson Research Center)
Etienne Marcheret (IBM T.J. Watson Research Center)
Bowen Zhou (IBM T.J. Watson Research Center)

This paper presents the endeavors to improve the performance of large vocabulary continuous speech recognition (LVCSR) in speech-to-speech translation system on smart phones. A variety of techniques towards high LVCSR performance are investigated to achieve high accuracy and low latency given constrained resources. This includes one-pass streaming mode decoding for minimum latency, acoustic modeling with full-covariance based on bootstrap and model restructuring for improving recognition accuracy with limited training data; quantized discriminative feature space transforms and quantized Gaussian mixture model to reduce memory usage with negligible degradation on recognition accuracy. Some speed optimization methods are also discussed to increase the recognition speed. The proposed techniques evaluated on the DARPA Transtac datasets will be shown to give good overall performance under the constraints of both CPU and memory on smart phones.

#3Deploying Google Search by Voice in Cantonese

Yun-Hsuan Sung (Google Inc.)
Martin Jansche (Google Inc.)
Pedro Moreno (Google Inc.)

We describe our efforts in deploying Google search by voice for Cantonese, a southern Chinese dialect widely spoken in and around Hong Kong and Guangzhou. We collected audio data from local Cantonese speakers in Hong Kong and Guangzhou by using our DataHound smartphone application. This data was used to create appropriate acoustic models. Language models were trained on anonymized query logs from Google Web Search for Hong Kong. Because users in Hong Kong frequently mix English and Cantonese in their queries, we designed our system from the ground up to handle both languages. We report on experiments with different techniques for mapping the phoneme inventories for both languages into a common space. Based on extensive experiments we report word error rates and web scores for both Hong Kong and Guangzhou data. Cantonese Google search by voice was launched in December 2010

#4An Investigation on Speech Recognition for Colloquial Arabic

Sarah Al-Shareef (The University of Sheffield)
Thomas Hain (The University of Sheffield)

This paper describes a study of grapheme-based speech recognition for colloquial Arabic. An investigation of language and acoustic model configurations is carried out to illustrate the differences between colloquial and modern standard Arabic (MSA) on the example of Levantine telephone conversations. The study defines extensive and carefully crafted data sets for different dialects and studies their overlap with MSA sources. The use of grapheme models is re-investigated, and alternative configuration for acoustic models to correct obvious short-comings are tested. The recognition performance was analyzed on two levels: corpus-level and dialect-level. In addition modifications of dictionaries to allow better specification of sound patterns is explored. Overall the experiments highlight the need for higher level information on acoustic model selection.

#5A multithreaded implementation of Viterbi decoding on Recursive Transition Networks

Fabio Brugnara (HLT research unit, FBK - Fondazione Bruno Kessler, Trento, Italy)

This paper describes the move to a multithreaded implementation of a Recursive Transition Network Viterbi speech decoder, undertaken with the objective of performing low-latency synchronous decoding on live audio streams to support online subtitling. The approach was meant to be independent on any specific hardware, in order to be easily exploitable on common computers, and portable to different operating systems. In the paper, the reference serial algorithm is presented, together with the modifications introduced to distribute most of the load to different threads by means of a dispatcher/collector thread and several worker threads. Results are presented, confirming a performance benefit in accordance with the design goals.

#6Recurrent Neural Network based Language Modeling in Meeting Recognition

Stefan Kombrink (Brno University of Technology)
Tomas Mikolov (Brno University of Technology)
Karafiat Martin (Brno University of Technology)
Burget Lukas (Brno University of Technology)

We use recurrent neural network (RNN) based language models to improve the BUT English meeting recognizer. On the baseline setup using the original language models we decrease word error rate (WER) more than 1% absolute by n-best list rescoring and language model adaptation. When n-gram language models are trained on the same moderately sized data set as the RNN models, improvements are higher yielding a system which performs comparable to the baseline. A noticeable improvement was observed with unsupervised adaptation of RNN models. Furthermore, we examine the influence of word history on WER and show how to speed-up rescoring by caching common prefix strings.

#7Ad-Hoc Meeting Transcription on Clusters of Mobile Devices

Michele Cossalter (Carnegie Mellon University)
Priya Sundararajan (Carnegie Mellon University)
Ian Lane (Carnegie Mellon University)

For all the time invested in meetings, very little of the wealth of information that is exchanged is preserved. In this paper, we propose a novel platform for meeting transcription using cellular phones for recognition. As most meeting participants carry cellular phones, this platform will allow meetings to be transcribed anywhere without requiring any additional infrastructure. In our proposed platform, we compare three approaches for combining audio from multiple devices: microphone selection, either at signal or feature level, and combination of decoder outputs via confusion network combination. We evaluated our approach on speech collected in a meeting environment and found that the early microphone selection at signal level obtained a 16% improvement in speech recognition accuracy compared to using a single recording device. Moreover this approach offered a comparable performance to multi-system confusion network combination, while requiring significantly lower computational cost.

#8ROVER Enhancement with Automatic Error Detection

Kacem Abida (University of Waterloo)
Fakhri Karray (University of Waterloo)

In this paper, an approach is presented to improve the existing performance of the Recognizer Output Voting Error Reduction (ROVER) procedure used for speech decoders’ combination in automatic speech transcription. A contextual analysis is injected within the ROVER process to detect and eliminate erroneous words. This filtering is carried out through the combination of automatic error detection techniques. Experiments showed it is possible to outperform the ROVER baseline, and that combining it with error detection methods leads to an even lower Word Error Rate (WER) in the final ROVER composite output.

#9Automatic Comma Insertion of Lecture Transcripts Based on Multiple Annotations

Yuya Akita (Kyoto University)
Tatsuya Kawahara (Kyoto University)

To enhance readability and usability of speech recognition results, automatic punctuation is an essential process. In this paper, we address automatic comma prediction based on conditional random fields (CRF) using lexical, syntactic and pause information. Since there is large disagreement in comma insertion between humans, we model individual tendencies of punctuation using annotations given by multiple annotators, and combine these models by voting and interpolation frameworks. Experimental evaluations on real lecture speech demonstrated that the combination of individual punctuation models achieves higher prediction accuracy for commas agreed by all annotators and those given by individual annotators.

Wed-Ses1-P3:
Language, Dialect Identification and Speaker Diarization

Time:Wednesday 10:00 Place:Faenza 1 - Pala Congressi (Passi Perduti-Gallery) Type:Poster
Chair: Nancy Chen

#1Study on the Relevance Factor of Maximum a Posteriori with GMM for Language Recognition

Chang Huai You (Institute for Infocomm Research, Singapore)
Haizhou Li (Institute for Infocomm Research, Singapore)
Kong Aik Lee (Institute for Infocomm Research, Singapore)

In this paper, the relevance factor in maximum a posteriori (MAP) adaptation of Gaussian mixture model (GMM) from universal background model (UBM) is studied for language recognition. In conventional MAP, relevance factor is typically set as a constant empirically. Knowing that relevance factor determines how much the observed training data influence the model adaptation, thus the resulting GMM models, we believe that the relevance factor should be dependent to the data for more effective modeling. We formulate the estimation of relevance factor in a systematic manner and study its role in characterizing spoken languages with supervectors. We use a Bhattacharyya-based language recognition system on National Institute of Standards and Technology (NIST) language recognition evaluation (LRE) 2009 task to investigate the validate of the data-dependent relevance factor. Experimental results show that we achieve improved performance by using the proposed relevance factor.

#2Improving Multiband Position Pitch Algorithm for Localization and Tracking of Multiple Concurrent Speakers by using a Frequency Selective Criterion

Tania Habib (Signal Processing and Speech Communication Lab, Graz University of Technology, Austria)
Harald Romsdorfer (Signal Processing and Speech Communication Lab, Graz University of Technology, Austria)

We present an auditory inspired frequency selective extension to the multiband position-pitch (MPoPi) algorithm and a new particle filtering algorithm for localization and tracking of an arbitrary number of concurrent speakers. In the particle filtering framework, we combine standard bootstrap with importance sampling techniques. The proposed algorithm was tested on real-world recordings using a 24 channel microphone array in a meeting room for different location and speaker combinations. The results show that using the frequency selective criterion outperforms state-of-the-art and our original algorithms.

#3On the Use of Lattices of Time-Synchronous Cross-Decoder Phone Co-occurrences in a SVM-Phonotactic Language Recognition System

Amparo Varona (University of the Basque Country)
Mikel Penagarikano (University of the Basque Country)
Luis Javier Rodriguez-Fuentes (University of the Basque Country)
German Bordel (University of the Basque Country)

This paper presents a simple approach to phonotactic language recognition which uses Lattices of Time-Synchronous Cross-Decoder Phone Co-occurrences. In previous works we have successfully applied cross-decoder information, but using statistics of n-grams extracted from 1-best phone strings. In this work, the method to build and properly use lattices of cross-decoder phone co-occurrences is fully explained and developed. Experiments were carried out on the 2007 NIST LRE database. The proposed approach outperformed the baseline phonotactic systems both considering 3-grams and 4-grams. Best results were obtained by considering the m=400 most likely cross-decoder coocurrences: 1.29% EER and CLLR=0.203. The fusion of the baseline system with the proposed approach yielded 1.22% EER and CLLR=0.203 (18% and 15% relative improvements) for n=3, and 1.17% EER and CLLR=0.197 (15% and 10% relative improvements) for n=4, outperforming state-of-the-art phonotactic systems on the same task.

#4Speaker Clustering Based on Utterance-oriented Dirichlet Process Mixture Model

Naohiro Tawara (Department of Science and Engineering, Waseda University)
Shinji Watanabe (NTT Communication Science Laboratories, NTT Corporation)
Tetsuji Ogawa (Waseda Institute for Advanced Study)
Tetsunori Kobayashi (Department of Science and Engineering, Waseda University)

This paper provides the analytical solution and algorithm of UO-DPMM based on a non-parametric Bayesian manner, and thus realizes fully Bayesian speaker clustering. We carried out preliminary speaker clustering experiments by using a TIMIT database to compare the proposed method with the conventional Bayesian Information Criterion (BIC) based method, which is an approximate Bayesian approach. The results showed that the proposed method outperformed the conventional one in terms of both computational cost and robustness to changes in tuning parameters.

#5PLDA-based Clustering for Speaker Diarization of Broadcast Streams

Jan Silovsky (Institute of Information Technology and Electronics, Faculty of Mechatronics, Technical University of Liberec, Czech Republic)
Jan Prazak (Institute of Information Technology and Electronics, Faculty of Mechatronics, Technical University of Liberec, Czech Republic)
Petr Cerva (Institute of Information Technology and Electronics, Faculty of Mechatronics, Technical University of Liberec, Czech Republic)
Jindrich Zdansky (Institute of Information Technology and Electronics, Faculty of Mechatronics, Technical University of Liberec, Czech Republic)
Jan Nouza (Institute of Information Technology and Electronics, Faculty of Mechatronics, Technical University of Liberec, Czech Republic)

This paper presents two approaches to speaker clustering based on Probabilistic Linear Discriminant Analysis (PLDA) in the speaker diarization task. We refer to the approaches as the multifold-PLDA approach and the onefold-PLDA approach. For both approaches, simple factor analysis model is employed to extract low-dimensional representation of a sequence of acoustic feature vectors - so called i-vectors - and these i-vectors are modeled using the PLDA model. Further, two-stage clustering with Bayesian Information Criterion (BIC) based approach applied in the first stage and PLDA-based approach in the second stage is examined. We carried out our experiments using the COST278 multilingual broadcast news database. The best evaluated system yielded 42 % relative improvement of the speaker error rate over a baseline BIC-based system.

#6iVector Approach to Phonotactic Language Recognition

Mehdi Soufifar (Department of Electronics and Telecommunications, NTNU, Trondheim, Norway)
Marcel Kockmann (Brno University of Technology, Speech@FIT, Czech Republic)
Lukas Burget (Brno University of Technology, Speech@FIT, Czech Republic)
Olda Plchot (Brno University of Technology, Speech@FIT, Czech Republic)
Ondrej Glembek (Brno University of Technology, Speech@FIT, Czech Republic)
Torbjørn Svendsen (Department of Electronics and Telecommunications, NTNU, Trondheim, Norway)

This paper addresses a novel technique for representation and processing of n-gram counts in phonotactic language recognition (LRE): subspace multinomial modelling represents the vectors of n-gram counts by low dimensional vectors of coordinates in total variability subspace, called iVector. Two techniques for iVector scoring are tested: support vector machines (SVM), and logistic regression (LR). Using standard NIST LRE 2009 task as our evaluation set, the latter scoring approach was shown to outperform phonotactic LRE system based on direct SVM classification of n-gram count vectors. The proposed iVector paradigm also shows comparable results to previously proposed PCA-based phonotactic feature extraction.

#7Discriminative Features For Language Identification

Christopher Alberti (Google Inc.)
Michiel Bacchiani (Google Inc.)

In this paper we investigate the use of discriminatively trained feature transforms to improve the accuracy of a MAP-SVM language recognition system. We train the feature transforms by alternatively solving an SVM optimization on MAP supervectors estimated from transformed features, and performing a small step on the transforms in the direction of the antigradient of the SVM objective function. We applied this method on the LRE2003 dataset, and obtained an 5.9% relative reduction of pooled equal error rate.

#8Perceptual sensitivity to dialectal and generational variations in vowels

Robert Allen Fox (Department of Speech and Hearing Science, The Ohio State University)
Ewa Jacewicz (Department of Speech and Hearing Science, The Ohio State University)

Perception of dialect variation is well studied with respect to perceptual similarity of talkers based on dialectal markers. This study examines the perceptual distinctiveness of regional vowel variants in light of cross-generational changes in vowel productions. Listeners from two regional dialects of English identified the dialect of the speaker in monosyllabic words (produced by older adults, young adults and children). Differential listener sensitivity to speaker dialect was found, which was highly affected by speaker generation. This suggests that the ability to determine dialect membership is an interaction between the perceptual spaces of listeners and the acoustic variations in vowels.

#9Investigation of Cross-show Speaker Diarization

Qian Yang (Cognitive Systems Lab, Karlsruhe Institute of Technology,Karlsruhe,Germany)
Tanja Schultz (Cognitive Systems Lab, Karlsruhe Institute of Technology,Karlsruhe,Germany)
Qin Jin (Language Technologies Institute, Carnegie Mellon University,USA)

The goal of cross-show diarization is to index speech segments of speakers for a set of shows, with the particular challenge that reappearing speakers across shows have to be assigned to the same speaker identity. In this paper, we introduce three cross-show diarization systems and present our initial experiments on the cross-show diarization task. Among the three systems, the Global-BIC-cluster achieves the best performance and obtains 15.53% and 13.21% cross-show diarization error rate (DER) on the dev and test set respectively. However, incremental approach is considered to be more effective in real life. By applying T-Norm on incremental system, we obtain 13.18% and 10.97% relative improvements in terms of cross-show DER on dev and test set. We also investigate the impact of the show order on cross-show DER for the incremental approach.

#10Language Identification for Text Chats

Vesa Siivola (Rosetta Stone)
Bryan Pellom (Rosetta Stone)
Meagan Sills (Rosetta Stone)

This work aims to classify the language of typed messages in a text chat system used by language learners. A method for training a language classifier from unlabeled data is presented. A dictionary-based method is used to produce initial classification of the messages. Character based n-gram models of order 3 and 5 are built. A method for selectively choosing the n-grams to be modeled is used to train 15-gram models. This method produces the best-performing classifier. It has models for 57 languages and obtains over 95% accuracy on the classification of messages that are unambiguously in one language.

#11Spoken Language Recognition in the Latent Topic Simplex

Kong Aik Lee (Institute for Infocomm Research, Singapore)
Chang Huai You (Institute for Infocomm Research, Singapore)
Ville Hautamäki (Institute for Infocomm Research, Singapore)
Anthony Larcher (Institute for Infocomm Research, Singapore)
Haizhou Li (Institute for Infocomm Research, Singapore)

This paper proposes the use of latent topic modeling for spoken language recognition, where a topic is defined as a discrete distribution over phone n-grams. The latent topics are trained in an unsupervised manner using the latent Dirichlet allocation (LDA) technique. Language recognition is then performed in a low dimensional simplex defined by the latent topics. We apply the Bhattacharyya measure to compute the n-gram similarity in the topic simplex. Our study shows that some of the latent topics are language specific while others exhibit multilingual characteristic. Experiment conducted on the NIST 2007 language detection task shows that language cues can be sufficiently preserved in the topic simplex.

Wed-Ses1-P4:
Paralinguistic Information - Analysis and Tools

Time:Wednesday 10:00 Place:Faenza 2 - Pala Congressi (Passi Perduti-Gallery) Type:Poster
Chair:shri narayanan

#1Investigating Robustness of Spectral Moments on Normal- and High-Effort Speech

Frederike Gottsmann (Fraunhofer FKIE)
Corinna Harwardt (Fraunhofer FKIE)

In this paper we are looking for a robust value of the spectral moments that doesn’t change when a speaker varies his vocal effort from normal to loud speech. To do this we first calculate the first four spectral moments for normal and loud speech. Then we compare the results for each single phoneme. After this, we do a correlation analysis to check whether normal and loud speech are linked with each other linearly. The results of the investigations show that plosives and fricatives are robust to changes of vocal effort. Vowels and sonorants demonstrate significant differences in vocal effort.

#2Comparing the Impact of Raised Vocal Effort on Various Spectral Parameters

Corinna Harwardt (Fraunhofer FKIE)

Vocal effort changes induce various modifications to acoustic characteristics of speech. In this paper we investigate the impact of raised vocal effort on the speech spectrum. In particular, we look at different spectral parameters and compare the changes. The parameters we take into account are spectral tilt, spectral center of gravity, energy ratio and spectral moments. We carry out tests on the complete data set with all phonemes pooled into one distribution and tests with the data divided into three phoneme classes. Furthermore we run vocal effort classification tests to verify our results from statistical analysis. The results indicate significant changes for all parameters on the complete distribution as well as for vowels and sonorants. For obstruents we observe significant changes, too. But the modifications are much smaller than those for the other two phoneme classes. The parameters that are less affected by raised vocal effort are energy ratio and second spectral moment.

#4Vowel Context and Speaker Interactions Influencing Glottal Open Quotient and Formant Frequency Shifts in Physical Task Stress

Keith W. Godin (University of Texas at Dallas)
John H. L. Hansen (University of Texas at Dallas)

Physical task stress is known to affect the fundamental frequency of speech. This study of two American English vowels /IY/ and /AH/ investigates whether physical task stress affects the center frequencies of formants F1 and F2, and whether it affects the glottal open quotient, and whether these effects are different for different speakers, the different vowels, and two different vowel contexts. Formant center frequencies are measured from the acoustic waveform, and the glottal open quotient is measured from the electroglottograph signal. The study finds in general that the production of vowels is affected by physical task stress. In particular, the study finds that F1, F2, and the glottal open quotient are affected by physical task stress. It also finds that the effects of stress on F1 vary for different speakers, and that the effects of stress on the glottal open quotient vary for different combinations of speaker and vowel.

#5Prosodic Correlates of Individual Physiological Response to Stress

Serguei Pakhomov (University of Minnesota)
Michael Kotlyar (University of Minnesota)

Response to stress is an important health risk factor. We compared several methods based on automatic speech analysis for extracting prosodic information from spontaneous speech of 19 subjects participating in a study of the effects of bupropion (a medication that increases smoking cessation rates and treats depression) on stress response in smokers. Automatically extracted mean fundamental frequency (F0), F0 variability, and the mean duration of silent pauses significantly correlate with physiological measures of stress: plasma concentration of epinephrine (adrenaline), heart rate and blood pressure. These findings indicate that automated speech analysis may be used for non-invasive stress response measurement.

#6The vocal effort of dominance in scenario meetings

Marcela Charfuelan (DFKI GmbH, Language Technology Lab)
Marc Schröder (DFKI GmbH, Language Technology Lab)

In this paper we address two questions about dominance in the AMI-IDIAP scenario meetings: (i) do the annotated most and least dominant utterances correlate with different levels of vocal effort? and if so (ii) how quantitatively discriminative are the vocal effort effects for prosody, voice quality and low level acoustic features? For answering these questions we perform supervised learning with dominance annotations in AMI-IDIAP meetings and vocal effort annotations in controlled data. A linear discriminant analysis (LDA) classifier is used to optimise class separability. We have found that the most and least dominant utterances are acoustically correlated with loud and soft vocal effort. We were able to quantify around 55% discrimination of equal distributions of most dominant, neutral and least dominant utterances using low level acoustic measures.

#7A Preliminary Model of Emotional Prosody using Multidimensional Scaling

Sona Patel (NCCR Center for Affective Sciences (CISA), University of Geneva, Switzerland)
Rahul Shrivastav (Department of Speech, Language and Hearing Sciences, University of Florida, USA)

Models of emotional prosody based on perception have typically required listeners to rate emotional expressions according to the psychological dimensions (arousal, valence, and power). We propose a perception-based model without assuming that the psychological dimensions are those used by listeners to differentiate emotional prosody. Instead, multidimensional scaling is used to identify three perceptual dimensions, which are then regressed onto a dynamic feature set that does not require a training set or normalization to a speaker’s “neutral” expression. The model predictions for Dimensions 1 and 3 closely matched the perceptual model; however, a moderately close match observed for Dimension 2.

#8An Exploratory Study of the Relations between Perceived Emotion Strength and Articulatory Kinematics

Jangwon Kim (University of Southern California)
Sungbok Lee (University of Southern California)
Shrikanth Narayanan (University of Southern California)

Acoustic and articulatory behaviors underlying emotion strength perception are studied by analyzing acted emotional speech. Listeners evaluated emotion identity, strength and confidence. Parameters related to pitch, loudness and articulatory kinematics are associated with a 2-level (strong/weak) representation of the emotion strength. Two-class discriminant analyses show averaged leave-one-out accuracies of 65.8% and 63.8% in the acoustic and articulatory domains, respectively. Two-factor ANOVA (emotion type/strength) indicates that the listeners assess the emotion strength based on the nature of perceived emotions in the arousal dimension. Only hot anger and happiness show significant differences in pitch use in the strength contrast. Such contrasts are also observed in tongue lowering and/or advancing. The strength contrast by listeners may mainly rely upon pitch and loudness. However, interactions between the acoustic and articulatory parameters in strength perception are complex.

#9Improved Acoustic Characterization of Breathy and Whispery Voices

Carlos T. Ishi (ATR Intelligent Robotics and Communication Labs.)
Hiroshi Ishiguro (ATR Intelligent Robotics and Communication Labs.)
Norihiro Hagita (ATR Intelligent Robotics and Communication Labs.)

Breathy and whispery voices are non-modal phonations produced by an air escape through the glottis, and may carry important linguistic or paralinguistic information, depending on the language. In order to improve the acoustic characterization of breathy and whispery segments, we proposed a normalized breathiness power measure (NBP) by embedding a mid-frequency voicing measure (F1F3syn) in its formulation. A partial inverse filtering pre-processing and a sub-band periodicity-based frequency boundary selection approach were also proposed for improving the performance of the F1F3syn and NBP measures. Improvements from 70 to 83% on detection of breathy/whispery segments are achieved by the proposed NBP measure relative to previous methods, for a false detection rate of 10% in modal and rough segments.

#10Neutral to Target Emotion Conversion Using Source and Suprasegmental Information

Govind D (Indian Institute of Technology Guwahati)
Prasanna S R Mahadeva (Indian Institute of Technology Guwahati)
Yegnanarayana B (International Institute of Information Technology Hyderabad)

This work uses instantaneous pitch and strength of excitation along with duration of syllable-like units as the parameters for emotion conversion. Instantaneous pitch and duration of the syllable-like units of the neutral speech are modified by the prosody modification of its linear prediction (LP) residual using the instants of significant excitation. The strength of excitation is modified by scaling the Hilbert envelope (HE) of the LP residual. The target emotion speech is then synthesized using the prosody and strength modified LP residual. The pitch, duration and strength modification factors for emotion conversion are derived using the syllable-like units of initial, middle and final regions from an emotion speech database having different speakers, texts and emotions. The effectiveness of the region wise modification of source and supra segmental features over the gross level modification is confirmed by the waveforms, spectrograms and subjective evaluations

#11A multimodal analysis of vocal and visual backchannels in spontaneous dialogs

Khiet P. Truong (Human Media Interaction, University of Twente)
Ronald Poppe (Human Media Interaction, University of Twente)
Iwan de Kok (Human Media Interaction, University of Twente)
Dirk Heylen (Human Media Interaction, University of Twente)

Backchannels (BCs) are short vocal and visual listener responses that signal attention, interest, and understanding to the speaker. Previous studies have investigated BC prediction in telephone-style dialogs from prosodic cues. In contrast, we consider spontaneous face-to-face dialogs. The additional visual modality allows speaker and listener to monitor each other's attention continuously, and we hypothesize that this affects the BC-inviting cues. In this study, we investigate how gaze, in addition to prosody, can cue BCs. Moreover, we focus on the type of BC performed, with the aim to find out whether vocal and visual BCs are invited by similar cues. In contrast to telephone-style dialogs, we do not find rising/falling pitch to be a BC-inviting cue. However, in a face-to-face setting, gaze appears to cue BCs. In addition, we find that mutual gaze occurs significantly more often during visual BCs. Moreover, vocal BCs are more likely to be timed during pauses in the speaker's speech.

#12Kernel models for affective lexicon creation

Nikos Malandrakis (Dept. of ECE, Technical University of Crete, 73100 Chania, Greece)
Alexandros Potamianos (Dept. of ECE, Technical University of Crete, 73100 Chania, Greece)
Elias Iosif (Dept. of ECE, Technical University of Crete, 73100 Chania, Greece)
Shrikanth Narayanan (SAIL Lab, Dept. of EE, Univ. of Southern California, Los Angeles, CA 90089, USA)

Emotion recognition algorithms for spoken dialogue applications typically employ lexical models that are trained on labeled in-domain data. In this paper, we propose a domain-independent approach to affective text modeling that is based on the creation of an affective lexicon. Starting from a small set of manually annotated seed words, continuous valence ratings for new words are estimated using semantic similarity scores and a kernel model. The parameters of the model are trained using least mean squares estimation. Word level scores are combined to produce sentence-level scores via simple linear and non-linear fusion. The proposed method is evaluated on the SemEval news headline polarity task and on the ChIMP politeness and frustration detection dialogue task, achieving state-of-the-art results on both. For politeness detection, best results are obtained when the affective model is adapted using in domain data. For frustration detection, the domain-independent model and non-linear fusion achieve the best performance.

Wed-Ses1-S2-P:
Speech Technology for Under-Resourced Languages II

Time:Wednesday 11:00 Place:Caravaggio (Adua 1) - Pala Affari - 1st Floor Type:Poster
Chairs:Laurent Besacier, Alexey Karpov

#1Automatic Prosody Generation for Serbo-Croatian Speech Synthesis Based on Regression Trees

Milan Sečujski (Faculty of Technical Sciences, University of Novi Sad, Serbia)
Darko Pekar (“AlfaNum – Speech Technologies Ltd.”, Novi Sad, Serbia)
Nikša Jakovljević (Faculty of Technical Sciences, University of Novi Sad, Serbia)

The paper presents the module for automatic generation of prosodic features of synthesized speech, namely, f0 targets and phonetic segment durations, within the speech synthesizer AlfaNumTTS, the most sophisticated speech synthesis system for Serbo-Croatian language to date. The module is based on regression trees trained on a studio recorded single speaker database of Serbo-Croatian. The database has been annotated for phonemic identity as well as a number of prosodic events such as pitch accents, phrase breaks and prosodic prominence. Besides the traditional description of the intonational phonology of Serbo-Croatian through four distinct accent types, within this study we have examined the possibility of representing them as tonal sequences, which has been suggested in recent linguistic literature. The results obtained confirm that the four accents can indeed be reduced to sequences of high and low tones without loss of quality, provided that phonemic length contrast is preserved.

#2Very Large Vocabulary ASR for Spoken Russian with Syntactic and Morphemic Analysis

Alexey Karpov (St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences (SPIIRAS))
Irina Kipyatkova (St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences (SPIIRAS))
Andrey Ronzhin (St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences (SPIIRAS))

In this paper, we present a word-based very large vocabulary automatic speech recognition system for Russian. Some novel methods are proposed for organization of the lexicon and the language model. Two-level morpho-phonemic prefix graph that uses some information on morphemic structure of lexical units is suggested for a compact representation of the pronunciation vocabulary and search space. Such model is more compact than the lexical tree or the linearly-based vocabulary and provides speeding up the recognition process. The syntactic analysis of a training text corpus in a combination with the statistical analysis is suggested for generation of N-gram language models. The syntax-based Russian language model allows taking into account long-distance syntactic dependencies between word pairs. The results have proved that the syntactic-statistic language model gives 5% relative improvement on the word and letter error rates with respect to the baseline models.

#3Cross-language phone recognition when the target language phoneme inventory is not known

Timothy Kempton (University of Sheffield)
Roger Moore (University of Sheffield)
Thomas Hain (University of Sheffield)

Cross-language speech recognition often assumes a certain amount of knowledge about the target language. However, there are hundreds of languages where not even the phoneme inventory is known. In the work reported here, phone recognisers are evaluated on a cross-language task with minimum target knowledge. A phonetic distance measure is introduced for the evaluation, allowing a distance to be calculated between any utterance of any language. This has a number of spin-off applications such as allophone detection, a phone-based ROVER approach to recognition, and cross-language forced alignment. Results show that some of these novel approaches will be of immediate use in characterising languages where there is little phonological knowledge.

#4A Paradigm for Small Vocabulary Speech Recognition Based on Redundant Spectro-Temporal Feature Sets

Sourish Chaudhuri (Carnegie Mellon University)
Bhiksha Raj (Carnegie Mellon University)

Speech recognition techniques have come to rely almost completely on HMM based frameworks. In this paper, we present a novel paradigm for small-vocabulary speech recognition based on a recently proposed word spotting technique. Recent work using discriminative classifiers with ordered spectro-temporal features to detect the presence of keywords obtained encouraging improvements over HMM-based models. We propose to extend this approach to recognize continuous speech in our work. Our method uses discriminative models to predict which words are present in a speech signal and hypothesize their locations. A graph search using dynamic programming is then used to obtain the most likely sequence of words from the hypothesis set produced as a result of combining the results from the discriminative word classifiers. While this approach doesn't perform as well as state-of-the-art ASR systems, it can be particularly useful for languages with small amounts of annotated data available.

#5GorUp: an ontology-driven Audio Information Retrieval system that suits the requirements of under-resourced languages

Nora Barroso (Irunweb Enterprise - Irun)
Karmele López de Ipiña (Computational Intelligence Group, Department of System Engineering and Automation - University of the Basque Country)
Aitzol Ezeiza (Computational Intelligence Group, Department of System Engineering and Automation - University of the Basque Country)
Carmen Hernández (Computational Intelligence Group, Department of System Engineering and Automation - University of the Basque Country)
Nerea Ezeiza (Computational Intelligence Group, Department of System Engineering and Automation - University of the Basque Country)
Odei Barroso (Irunweb Enterprise - Irun)
Unai Susperregi (Irunweb Enterprise - Irun)
Barroso Simeon (Insima Teknologia, Donostia)

GorUp is an Information Retrieval system that provides information about the contents of audio broadcast news in Basque, Spanish, and French. Since the resources available for Basque in general, and for this task in particular, were very few, data optimization methodologies had to be applied in various phases of the development. Moreover, the agglutinative nature of Basque required the use of morphemes and other sub-word units. Additionally, some keyword spotting and semantic methods have been also applied in the system in order to retrieve information properly. In most of the cases, the methods employed during this project could suit the requirements of many under-resourced languages, and one of these techniques could be the ontology-based approach. This paper presents the system in general for Basque and emphasizes the techniques employed in order to enhance the system using a semantic ontology.

#6Woefzela - An open-source platform for ASR data collection in the developing world

Nic De Vries (Human Language Technologies Research Group, Meraka Institute, CSIR, Pretoria, South Africa)
Jaco Badenhorst (Human Language Technologies Research Group, Meraka Institute, CSIR, Pretoria, South Africa)
Marelie Davel (Human Language Technologies Research Group, Meraka Institute, CSIR, Pretoria, South Africa)
Etienne Barnard (Multilingual Speech Technologies, North-West University, Vanderbijlpark 1900, South Africa)
Alta De Waal (Human Language Technologies Research Group, Meraka Institute, CSIR, Pretoria, South Africa)

Building transcribed speech corpora for under-resourced languages plays a pivotal role in developing speech technologies for such languages. We have developed an open-source tool for devices running the Android operating system to facilitate the efficient collection of speech data for Automatic Speech Recognition system development. The tool was designed for use in typical developing-world conditions; we present the relevant design choices and analyse the effectiveness of this tool by means of a case study. In particular, we introduce a novel semi-real-time quality monitoring system, which reduces the amount of erroneous data collected from users and thus increase the efficiency of the data collection process.

#7A Study on the Perception of Tone and Intonation in Sesotho

Hansjörg Mixdorff (Beuth University of Applied Sciences Berlin, Germany)
Lehlohonolo Mohasi (University of Stellenbosch, South Africa)
\'Malillo Machobane (National University of Lesotho)
Thomas Niesler (University of Stellenbosch, South Africa)

This paper presents a study on the perception of Sesotho, a Southern African tonal language, employing a set of recorded minimal pairs, whose F0 contours were analyzed in a previous study using the Fujisaki model and resynthesized. Sequences of prosodically modified stimuli were produced to examine the effect of these modifications on word identification, statement/question distinction, as well as focus identification. With few exceptions, results regarding word identification are in line with our expectations. F0 modifications even seem to override vowel differences between words, but they do not change perception when the vowel is the only difference. With respect to the statement/question distinction, shortening of the penultimate syllable, higher speech rate and increased phrase command magnitude Ap all increase the probability of an utterance to be judged as a question. The focus experiment only produced inconclusive results, possibly due to its complex setting.

#8Developing a broadband automatic speech recognition system for Afrikaans

Febe de Wet (Human Language Technology Competency Area, CSIR Meraka Institute, Pretoria, South Africa)
Alta de Waal (Human Language Technology Competency Area, CSIR Meraka Institute, Pretoria, South Africa)
Gerhard van Huyssteen (Centre for Text Technology (CTexT), North-West University, Potchefstroom, South Africa)

Afrikaans is one of the eleven official languages of South Africa. It is classified as an under-resourced language. No annotated broadband speech corpora currently exist for Afrikaans. This article reports on the development of speech resources for Afrikaans, specifically a broadband speech corpus and an extended pronunciation dictionary. Baseline results for an ASR system that was built using these resources are also presented. In addition, the article suggests different strategies to exploit the close relationship between Afrikaans and Dutch for the purposes of technology development.

#9Multi-accent speech recognition of Afrikaans, Black and White varieties of South African English

Herman Kamper (Stellenbosch University)
Thomas Niesler (Stellenbosch University)

We investigate speech recognition performance of systems employing several accent-specific recognisers in parallel for simultaneous recognition of multiple accents. We compare these systems with oracle systems, in which test utterances are presented to matching accent-specific recognisers, and with accent-independent systems, in which data are pooled. Afrikaans (AE), Black (BE) and White (EE) accents of South African English are considered. We find that, when accent is classified on a per-utterance basis, parallel systems outperform oracle systems for the AE+EE accent pair while the opposite is observed for BE+EE. When accent is identified on a per-speaker basis, oracle or better performance is obtained for both accent pairs. Furthermore, parallel systems using multi-accent acoustic modelling, which allows cross-accent sharing of acoustic data, outperform parallel systems using accent-specific acoustic models. The former also yields better performance than accent-independent systems.

#10Perceptual Representation of Consonant Sounds in Thai

Charturong Tantibundhit (Department of Electrical and Computer Engineering, Thammasat University, Thailand)
Chutamanee Onsuwan (Department of Linguistics, Thammasat University, Thailand)
Tanawan Saimai (Department of Electrical and Computer Engineering, Thammasat University, Thailand)
Nantaporn Saimai (Department of Electrical and Computer Engineering, Thammasat University, Thailand)
sumonmas Thatphithakkul (National Electronics and Computer Technology Center (NECTEC), Thailand)
P. Chootrakool (National Electronics and Computer Technology Center (NECTEC), Thailand)
Krit Kosawat (National Electronics and Computer Technology Center (NECTEC), Thailand)
nattanun Thatphithakkul (National Electronics and Computer Technology Center (NECTEC), Thailand)

This work is an attempt to construct a perceptual representation of Thai consonants based on perceptual identification results (from 28 Thais) of 21 phonemes presented in noise. The experiment is designed to equally make pairwise comparisons among 21 word-initial phonemes, which results in 210 real-word stimulus pairs. Percent correct responses and confusion matrices are obtained. Similarity score and perceptual distance for each phoneme pair are systematically derived from confusion scores based on a method proposed by Shepard (1972). Then, a perceptual space of Thai consonants takes shape and could roughly be divided into 5 groupings: glide, glottal, nasal, aspirated obstruent, and a combination of liquid and unaspirated obstruent. It is suggested that these phonological classes reflect the most distinct and relevant perceptual properties of Thai consonants. Preliminary cross-linguistic observation is addressed in light of the data of English consonants from Miller and Nicely (1955).

#11A cross-lingual approach to the development of an HMM-based speech synthesis system for Malay

Mumtaz Begum Mustafa (University of Malaya)
Ainon Raja Noor (University of Malaya)
Roziati Zainuddin (University of Malaya)
Zuraidah M. Don (University of Malaya)
Gerry Knowles (University of Malaya)

This research reports the development of an HMM-based speech synthesis system for Malay, which is an under-resourced language with few resources including recorded speech and segmental labels. We propose the cross-lingual use of resources for developing a Malay HMM-based speech synthesis system. We used the Festival English speech synthesis system to generate time-aligned phone transcriptions for Malay using specially constructed Malay grapheme-to-phoneme database and English CART. These transcriptions together with Malay recorded speech databases were used for training and synthesis of Malay speech. The effectiveness of the proposed approach is confirmed by intelligibility and naturalness tests on the synthetic speech.

Wed-Ses2-O1:
Speaker Diarization II

Time:Wednesday 13:30 Place:Auditorium - Pala Congressi Type:Oral
Chair:Hagai Aronowitz

13:30Prosodic and Phonetic Features for Speaker Clustering in Speaker Diarization Systems

Janez Zibert (Department of Information Sciences and Technology, University of Primorska, Koper, Slovenia)
France Mihelic (Faculty of Electrical Engineering, University of Ljubljana, Ljubljana, Slovenia)

This paper is focused on speaker clustering methods that are used in speaker diarization systems. We concentrate on developing proper representations of speaker segments for clustering and research different similarity measures for joining speaker segments. We realize two speaker clustering systems. The first is a standard approach using a bottom-up agglomerative clustering principle with the BIC as a merging criterion. In the second system we developed a fusion-based speaker-clustering, where speaker segments are modeled by acoustic and prosodic representations. In this way we additionally model the speaker prosodic characteristics and combine them with the basic acoustic information of speakers, which leads to improved clustering of the segments in the case of similar speaker acoustic properties and poor acoustic conditions.

13:50Diarization-based Speaker Retrieval for Broadcast Television Archives

Marijn Huijbregts (Radboud University Nijmegen, Centre for Language and Speech Technology)
David Leeuwen van (Radboud University Nijmegen, Centre for Language and Speech Technology)

In this study we extend a query-by-example diarization-based speaker retrieval system to a full speaker retrieval system for broadcast television. The envisioned system is capable of finding all speakers in an archive using their names instead of example speech fragments. Information extracted from a television guide is used to label speaker clusters that most likely correspond to the found names. As part of the labeling process, all speaker clusters are first classified automatically based on their role in the programs they appear in. The role classification accuracy is 64% on our evaluation set. Speaker names can automatically be attributed to a fraction of the speaker clusters with an accuracy of 70%.

14:10The detection of overlapping speech with prosodic features for speaker diarization

Martin Zelenák (Universitat Politecnica de Catalunya)
Javier Hernando (Universitat Politecnica de Catalunya)

Overlapping speech is responsible for a certain amount of errors produced by standard speaker diarization systems in meeting environment. We are investigating a set of prosody-based long-term features as a potential complement to our overlap detection system relying on short-term spectral parameters. The most relevant features are selected in a two-step process. They are firstly evaluated and sorted according to mRMR criterion and then the optimal number is determined by iterative wrapper approach. We show that the addition of prosodic features decreased overlap detection error. Detected overlap segments are used in speaker diarization to recover missed speech by assigning multiple speaker labels and to increase the purity of speaker clusters.

14:30LP Residual Features for Robust, Privacy-Sensitive Speaker Diarization

Sree Hari Krishnan Parthasarathi (Idiap Research Institute, EPFL)
Herve Bourlard (Idiap Research Institute, EPFL)
Daniel Gatica-Perez (Idiap Research Institute, EPFL)

We present a comprehensive study of linear prediction residual for speaker diarization on single and multiple distant microphone conditions in privacy-sensitive settings, a requirement to analyze a wide range of spontaneous conversations. Two representations of the residual are compared, namely real-cepstrum and MFCC, with the latter performing better. Experiments on RT06eval show that residual with subband information from 2.5 kHz to 3.5 kHz and spectral slope yields a performance close to traditional MFCC features. As a way to objectively evaluate privacy in terms of linguistic information, we perform phoneme recognition. Residual features yield low phoneme accuracies compared to traditional MFCC features.

14:50Extending the Task of Diarization to Speaker Attribution

Houman Ghaemmaghami (Queensland University of Technology)
David Dean (Queensland University of Technology)
Robbie Vogt (Queensland University of Technology)
Sridha Sridharan (Queensland University of Technology)

In this paper we extend the concept of speaker annotation within a single-recording, or speaker diarization, to a collection wide approach we call speaker attribution. Accordingly, speaker attribution is the task of clustering expectantly homogenous inter-session clusters obtained using diarization according to common cross-recording identities. The result of attribution is a collection of spoken audio across multiple recordings attributed to speaker identities. In this paper, an attribution system is proposed using mean-only MAP adaptation of a combined-gender UBM to model clusters from a perfect diarization system, as well as a JFA-based system with session variability compensation. The normalized cross-likelihood ratio is calculated for each pair of clusters to construct an attribution matrix and the complete linkage algorithm is employed to conduct clustering of the inter-session clusters. A matched cluster purity and coverage of 87.1% was obtained on the NIST 2008 SRE corpus.

15:10Comparing Multi-Stage Approaches for Cross-Show Speaker Diarization

Viet-Anh Tran (LIMSI-CNRS)
Viet Bac Le (Vocapia Research)
Claude Barras (LIMSI-CNRS)
Lori Lamel (LIMSI-CNRS)

Acoustic speaker diarization is investigated for situations where a collection of shows from the same source needs to be processed. In this case, the same speaker should receive the same label across all shows. We compare different architectures for cross-show speaker diarization: the obvious concatenation of all shows, a hybrid system combining a local first clustering stage with a second global stage, and an incremental system which processes the shows in a predefined order and updates the speaker models accordingly, this latter system being best suited to real applicative situations. These three strategies were compared to a baseline system on a set of 46 ten-minutes samples of British English scientific podcasts.

Wed-Ses2-O3:
Adaptation for ASR

Time:Wednesday 13:30 Place:Brunelleschi (Green Room) - Pala Congressi - 2nd Floor Type:Oral
Chair:Phil Woodland

13:30Model Adaptation for Automatic Speech Recognition Based on Multiple Time Scale Evolution

Shinji Watanabe (NTT Corporation)
Atsushi Nakamura (NTT Corporation)
Biing-Hwang Juang (Georgia Institute of Technology)

The change in speech characteristics is originated from various factors, at various (temporal) rates in a real world conversation. These temporal changes have their own dynamics and therefore, we propose to extend the single (time-) incremental adaptations to a multiscale adaptation, which has the potential of greatly increasing the model's robustness as it will include adaptation mechanism to approximate the nature of the characteristic change. The formulation of the incremental adaptation assumes a time evolution system of the model, where the posterior distributions, used in the decision process, are successively updated based on a macroscopic time scale in accordance with the Kalman filter theory. In this paper, we extend the original incremental adaptation scheme, based on a single time scale, to multiple time scales, and apply the method to the adaptation of both the acoustic model and the language model. We further investigate methods to integrate the multi-scale adaptation scheme to realize the robust speech recognition performance. Large vocabulary continuous speech recognition experiments for English and Japanese lectures revealed the importance of modeling multiscale properties in speech recognition.

13:50Integrated Online Speaker Clustering and Adaptation

Catherine Breslin (Toshiba Research Europe Ltd.)
KK Chin (Toshiba Research Europe Ltd.)
Mark Gales (Toshiba Research Europe Ltd.)
Kate Knill (Toshiba Research Europe Ltd.)

For many applications, it is necessary to produce speech transcriptions in a causal fashion. To produce high quality transcripts, speaker adaptation is often used. This requires online speaker clustering and incremental adaptation techniques to be developed. This paper presents an integrated approach to online speaker clustering and adaptation which allows efficient clustering of speakers using the same accumulated statistics that are normally used for adaptation. Using a consistent criterion for both clustering and adaptation should yield gains for both stages. The proposed approach is evaluated on a meetings transcription task using audio from multiple distant microphones. Consistent gains over standard clustering and adaptation were obtained.

14:10A study on speaker normalized MLP features in LVCSR

Zoltán Tüske (Human Language Technology and Pattern Recognition, Computer Science Department, RWTH Aachen University)
Christian Plahl (Human Language Technology and Pattern Recognition, Computer Science Department, RWTH Aachen University)
Ralf Schlüter (Human Language Technology and Pattern Recognition, Computer Science Department, RWTH Aachen University)

Different normalization methods are applied in recent Large Vocabulary Continuous Speech Recognition Systems (LVCSR) to reduce the influence of speaker variability on the acoustic models. In this paper we investigate the use of Vocal Tract Length Normalization (VTLN) and Speaker Adaptive Training (SAT) in Multi Layer Perceptron (MLP) feature extraction on an English task. We achieve significant improvements by each normalization method and we gain further by stacking the normalizations. Studying features transformed by Constrained Maximum Likelihood Linear Regression (CMLLR) based SAT as possible input for MLP, further experiments show that MLP could not consistently take advantage of SAT as it does in case of VTLN.

14:30Matrix-Variate Distribution of Training Models for Robust Speaker Adaptation

Yongwon Jeong (Pusan National University)
Young Kuk Kim (LG Electronics)

In this paper, we describe a new speaker adaptation method based on the matrix-variate distribution of training models. A set of mean vectors of hidden Markov models (HMMs) is assumed to be drawn from the matrix-variate normal distribution, and bases are derived under this assumption. The resulting bases have the same dimension as that of the eigenvoice, thus adaptation can be performed using the same equation. In the isolated-word experiments, the proposed method showed a comparable performance with the eigenvoice in a clean environment, and showed better performance than the eigenvoice in both babble and factory floor noises. The experimental results demonstrated the validity of the matrix-variate normal assumption about the training models, thus the proposed method can be used for rapid speaker adaptation in noise environments.

14:50Separating Speaker and Environmental Variability Using Factored Transforms

Michael Seltzer (Microsoft Research)
Alex Acero (Microsoft Research)

Two primary sources of variability that degrade accuracy in speech recognition systems are the speaker and the environment. While many algorithms for speaker or environment adaptation have been proposed to improve performance, far less attention has been paid to approaches which address for both factors. In this paper, we present a method for compensating for speaker and environmental mismatch using a cascade of CMLLR transforms. The proposed approach enables speaker transforms estimated in one environment to be effectively applied to speech from the same user in a different environment. This approach can be further improved using a new training method called speaker and environment adaptive training method. When applying speaker transforms to new environments, the proposed approach results in a 13% relative improvement over conventional CMLLR.

15:10Your Mobile Virtual Assistant Just Got Smarter!

Mazin Gilbert (AT&T)
Iker Arizmendi (AT&T)
Enrico Bocchieri (AT&T)
Diamantino Caseiro (AT&T)
Vincent Goffin (AT&T)
Andrej Ljolje (AT&T)
Mike Philips (Vlingo)
Chao Wang (Vlingo)
Jay Wilpon (AT&T)

A Mobile Virtual Assistant (MVA) is a communication agent that recognizes and understands free speech, and performs actions such as retrieving information and completing transactions. One essential characteristic of MVAs is their ability to learn and adapt without supervision. This paper describes our ongoing research in developing more intelligent MVAs that recognize and understand very large vocabulary speech input across a variety of tasks. In particular, we present our architecture for unsupervised acoustic and language model adaptation. Experimental results show that unsupervised acoustic model learning approaches the performance of supervised learning when adapting on 40-50 device-specific utterances. Unsupervised language model learning results in an 8% absolute drop in word error rate.

Wed-Ses2-O2:
Prosody II

Time:Wednesday 13:30 Place:Leonardo - Pala Affari - Ground Floor Type:Oral
Chair:Pilar Prieto

13:30Analysing the correspondence between automatic prosodic segmentation and syntactic structure

Gyorgy Szaszak (Department of Telecommunication and Media Informatics, Budapest University for Technology and Economics, Budapest, Hungary)
Katalin Nagy (Department of Telecommunication and Media Informatics, Budapest University for Technology and Economics, Budapest, Hungary)
Andras Beke (Research Institute for Linguistics, Hungarian Academy of Sciences, Budapest, Hungary)

Prosody and syntax are highly related, even if the prosodic structure cannot be directly mapped to the syntactic one and vice versa. This paper presents an experiment for exploring in what degree a powerful HMM-based automatic prosodic segmentation tool can recover the syntactic structure of an utterance in speech understanding systems. Results show that the approach is capable of recalling up to 92% of syntactic clause boundaries and up to 71% of embedded syntactic phrase boundaries based on the detection of phonological phrases. Recall rates do not depend further on the syntactic level (whether the phrase is multiply embedded or not), but clause boundaries can be well separated from lower level syntactic phrases based on the type of the aligned phonological phrase(s). These findings can be exploited in speech understanding systems, allowing for the recovery of the skeleton of the syntactic structure, based purely on the speech signal.

13:50Long-distance rhythmic dependencies and their application to automatic language identification

Joseph Tepperman (Rosetta Stone Labs)
Emily Nava (Rosetta Stone Labs)

The perception of rhythmic differences among languages relies on varieties in periodicity within prominence groups. But the consensus in phonetic research on rhythm is that existing measures don't capture true rhythm by that definition - instead, they merely measure short-term timing. This work proposes a new rhythm measure, the Generalized Variability Index (GVI), that examines durational contexts over arbitrarily long linguistic distances. To evaluate this new measure, we conducted a set of experiments in automatic language identification using large amounts of data from 11 languages in the Globalphone and TIMIT corpora. When added to baseline rhythm measures, these new GVI features offer absolute improvement in 11-way language classification accuracy by as much as 12%. Moreover, the addition of wider and wider durational context in the GVI continues to contribute information useful for automatic language ID, abating in usefulness only at a distance of about 10 syllables.

14:10Symbolic and Direct Sequential Modeling of Prosody for Classification of Speaking-Style and Nativeness

Andrew Rosenberg (Queens College / CUNY)

In this paper, we explore the differences between direct and symbolic sequential modeling of prosody. We use sequential models to characterize speech in two tasks, classifying speaking-style and distinguishing native from non-native speech. We explore the use of a {\it spike-and-slab} model to directly model pitch contour data. We find in both of these tasks that sequences of symbolic prosodic events to lead to improved performance over approaches that model pitch contours directly. We also explore the use of hypothesized prosodic events in both tasks. We find the speaking-style results to be robust to automatic annotation. While, when classifying nativeness, the spike-and-slab model leads to better performance.

14:30Prosodic Analysis and Perception of Mandarin Utterances Conveying Attitudes

Wentao Gu (Nanjing Normal University)
Ting Zhang (Nanjing Normal University)
Hiroya Fujisaki (The University of Tokyo)

After differentiating attitudes from emotions, the present work investigates prosodic manifestations and perceptual attributes of Mandarin utterances conveying various attitudes. A speech corpus was designed to incorporate five classes of attitudes: friendly/hostile, polite/rude, serious/joking, praising/blaming, and confident/uncertain. Perceptual experiment reveals two different patterns between intended and perceived attitudes. Statistical analysis of prosodic features shows that speech rate is distinctive in all five classes, while utterance-level F0 height and F0 range are distinctive only for some classes. Moreover, F0 features in the words carrying sentential stress are more distinctive than utterance-level settings. The relation between perception and acoustics is also examined.

14:50Predicting Taiwan Mandarin tone shapes from their duration

Chierh Cheng (Department of Speech, Hearing and Phonetic Sciences, University College London, UK)
Michele Gubian (Centre for Language and Speech Technology, Radboud University, Nijmegen, NL)

A preliminary study on modelling tonal variation as a function of duration is carried out. An experimentally controlled acoustic database was utilized to construct functional linear models. In the construction of the linear models, duration was used as independent variable in predicting the shape of disyllabic pitch contours in Taiwan Mandarin, given the target tone sequences. Results showed that by moving duration values from short to long, tonal curve shapes of disyllables ranging from non-reduced to reduced were approximated with an adequate goodness-of-fit (usually below one semitone RMSE). This study provides a novel approach to examine the relation between duration and F0 realisation of small units such as disyllables and also supports the time pressure account of phonetic reduction in general.

15:10Variation of Accent Type and of Context – Influences on Pragmatic Focus Interpretation

Charlotte Wollermann (University of Bonn; University of Duisburg-Essen)
Ulrich Schade (University of Bonn; Fraunhofer Institute for Communication, Information Processing and Ergonomics)
Bernhard Schröder (University of Duisburg-Essen)

We present an empirical study on the variation of accent type and of context on pragmatic focus interpretation. The material was based on audio-recordings of nine German speakers who were instructed to read dialogues with embedded question-answer pairs in which the answers constituted the pragmatic focus of the utterance. Different accent types occurred for marking the focus constituent. The audio-material was presented to 53 subjects. Interpretation was tested by using pictures intended to illustrate the (non-)exhaustive reading. When presenting the picture illustrating the non-exhaustive reading, the results show in general a significant influence of both context and prosody, but the contextual influence is stronger.

Wed-Ses2-O4:
SLP for Information Extraction and Retrieval II

Time:Wednesday 13:30 Place:Michelangelo - Pala Affari - 2nd Floor Type:Oral
Chair:Haizhou Li

13:30Topic Segmentation of TV-streams by mathematical morphology and vectorization

Vincent Claveau (IRISA-CNRS)
Sébastien Lefèvre (Valoria, Univ. of South Brittany)

A fine-grained segmentation of Radio or TV broadcasts is an essential step for most multimedia processing. Applying segmentation algorithms to the speech transcripts seems straight-forward. Yet, most of these algorithms are not suited when dealing with short segments or noisy data. In this paper, we propose a new segmentation technique inspired from the image segmentation field and relying on a new way to compute similarities between candidate segments. This new topic segmentation technique is evaluated on two corpora of French TV broadcasts on which it largely outperforms other existing approaches from the state-of-the-art.

13:50Probabilistic Latent Semantic Analysis for Broadcast News Story Segmentation

Mimi Lu (Shaanxi Provincial Key Laboratory of Speech and Image Information Processing, School of Computer Science, Northwestern Polytechnical University, China;Institute for Infocomm Research, A*STAR, Singapore)
Cheung-Chi Leung (Institute for Infocomm Research, A*STAR, Singapore)
Lei Xie (Shaanxi Provincial Key Laboratory of Speech and Image Information Processing, School of Computer Science, Northwestern Polytechnical University, China)
Bin Ma (Institute for Infocomm Research, A*STAR, Singapore)
Haizhou Li (Institute for Infocomm Research, A*STAR, Singapore)

This paper proposes to perform probabilistic latent semantic analysis for broadcast news story segmentation. PLSA exploits a deeper latent relation among terms beyond their occurrences thus conceptual matching can be employed to replace literal matching. Different from text segmentation, lexical based story segmentation has to be carried out on LVCSR transcripts, where the incorrect recognition of out-of-vocabulary words inevitably impacts the semantic relation. We use phoneme subword as the basic term units to address this problem. We integrate a cross entropy measurement with PLSA to depict lexical cohesion and compare its performance with the widely used cosine similarity metric. Furthermore, we evaluate two approaches, namely TextTiling and dynamic programming (DP), for story boundary identification. Experimental results show that PLSA based methods bring a significant performance boost to story segmentation and the cross entropy based DP approach provides the best performance.

14:10Hybrid Speech Recognition for Voice Search: a Comparative Study

Evandro Gouvea (Carnegie Mellon University)

We compare different units for use in information retrieval of items by voice. We compare a word based system with a subword based one, a combination of these into a hybrid system, and a phonetic one. The subword set is derived by splitting words using a Minimum Description Length (MDL) criterion. In general, we convert an index written in terms of words into an index written in terms of these different units. A speech recognition engine that uses a language model and pronunciation dictionary built from each such an inventory of units is completely independent from the information retrieval task, and can, therefore, remain fixed, making this approach ideal for resource constrained systems. We demonstrate that recognition accuracy and recall results at higher OOV rates are much superior for the hybrid system than the alternatives. On a music lyrics task at 80% OOV, the hybrid system has a recall of 82.9%, compared to 75.2% for the subword-based one and 47.4% for a word system.

14:30A New Phonetic Candidate Generator for Improving Search Query Efficiency

Bo Peng (Microsoft Research Asia)
Yao Qian (Microsoft Research Asia)
Frank Soong (Microsoft Research Asia)
Bo Zhang (College of Software, NanKai University)

Misspelled query due to homophones or mispronunciation is difficult to be corrected in the conventional spelling correction methods. In phonetic candidate generation, the generator is to produce candidates which are phonetically similar to a given query. In this paper, we present a new phonetic candidate generator for improving the search efficiency of a query. The proposed generator consists of three modules: letter-to-sound (LTS) conversion, phonetic “trie” and phonetic similarity estimator based upon Levenshtein distance and Kullback-Leibler Divergence (KLD) between phones. This generator yields a significant improvement over Double-metaphone in terms of candidate accuracy and effective candidate set size.

14:50Towards Voice-Input Symbolic Pattern Retrieval using Parameter-Based Search

Yukiko Suzuki (School of Media Science, Tokyo University of Technology, Tokyo, Japan)
Kiyoaki Aikawa (School of Media Science, Tokyo University of Technology, Tokyo, Japan)

This paper proposes a symbolic pattern retrieval method using emotional feature vectors. Queries and symbolic patterns are represented by emotional vectors composed of eight numerical parameters. Since the proposed method uses numerical vectors close to raw data instead of recognized text, the information loss by data conversion is small. This point is advantageous compared with conventional text-based search such as recent spoken document retrieval approach. Five similarity measures were compared on a test collection. The cos similarity and the Euclidean distance showed the best performance among five similarity measures. OOV analysis clarified several problems for achieving voice-input symbolic pattern retrieval.

15:10A Language Independent Approach to Audio Search

Vikram Gupta (IIT Delhi, India)
Jitendra Ajmera (IBM Research, India)
Arun Kumar (CARE, IIT Delhi, India)
Ashish Verma (IBM Research, India)

In this paper, we propose an approach towards audio search where no language specific resources are required. This approach is most useful in those scenarios where no training data exists to create an automatic speech recognition (ASR) system for a language, e.g. in the case of most regional languages or dialects. In this approach, a neural network is trained for a language where the training data exists, e.g. English. This neural network estimates a sequence of probability vectors for an audio segment, which is referred to as the posteriorgram representation for that segment. Components of the probability vector are posterior probabilities of English phonemes at any given frame of speech. Template matching technique is then used to compare the query-posteriorgram against the content-posteriorgram over the searchable audio-content. We present experiments in this paper to show that, even for other language like Hindi, the probabilities obtained from the neural network trained on English provide a characteristic representation for a word. A dynamic time warping algorithm with appropriate modifications is applied and encouraging P@N performance of 46.24% for Hindi and 65.22% for English for the task of audio search is reported while using the same MLP trained using English data in both the cases.

Wed-Ses2-S1:
Speaker State Challenge - Intoxication and Sleepiness II

Time:Wednesday 13:30 Place:Raffaello - Pala Affari - 3rd Floor Type:Oral
Chair:Anton Batliner

13:30Perception of Alcoholic Intoxication in Speech

Florian Schiel (Bavarian Archive for Speech Signals, Ludwig-Maximilians-Universität)

The ALC sub-challenge of the Interspeech Speaker State Challenge (ISSC) aims at the automatic classification of speech signals into intoxicated and sober speech. In this context we conducted a perception experiment on data derived from the same corpus to analyse the performance of humans on the same task. The results show that humans still outperform comparable baseline results of ISSC. Female and male listeners perform on the same level, but there is strong evidence that intoxication in female voices is easier to be recognized than in male voices. Prosodic features contribute to the decision of human listeners but seem not to be dominant. In analogy to Doddington's zoo of speaker verification we find some evidence for the existence of lambs and goats but no wolves.

13:50Detecting sleepiness by fusing classifiers trained with novel acoustic features

Tauhidur Rahman (University of Texas at Dallas)
Soroosh Mariooryad (University of Texas at Dallas)
Shalini Keshavamurthy (University of Texas at Dallas)
Gang Liu (University of Texas at Dallas)
John H.L. Hansen (University of Texas at Dallas)
Carlos Busso (University of Texas at Dallas)

Automatic sleepiness detection is a challenging task that can lead to advances in various domains including traffic safety, medicine and human-machine interaction. This paper analyzes the discriminative power of different acoustic features to detect sleepiness. The study uses the sleepy language corpus (SLC). Along with standard acoustic features, novel features are pro- posed including functionals across voiced segment statistics in the F0 contour, likelihoods of reference models used to contrast non-neutral speech, and a set of robust to noise spectral features. These feature sets, which have performed well in other paralinguistic tasks such as emotion recognition, are used to train classifiers that are combined at the feature and decision levels. The best unweighted accuracy (UA) is obtained by combining the classifiers at the decision level under a maximum likelihood framework (UA = 70.97%). This performance is higher than the best results reported in the corpus.

14:10An HMM-Based Approach to the INTERSPEECH 2011 Speaker State Challenge

Albino Nogueiras (Universitat Politecnica de Catalunya. Barcelona, SPAIN.)

The current main trend in paralinguistic information recognition is the so-called static classification. In this kind of classification the low level descriptors are pooled together by means of statistical functionals and all, or almost all, information about the temporal structure and evolution of speech is lost. Although this approach represents the state-of-the-art, we believe that dynamic classification, where temporal information is kept, still deserves some attention due to its capability to handle aspects impossible to do by the static one. In this paper the INTERSPEECH 2011 Speaker State Challenged is addressed using the Automatic Speech Recognition system developed at UPC, which has already been used in a similar task: emotion recognition. Although results fall below the baseline, we believe that they are close enough to be taken into account.

14:30RANSAC-based Training Data Selection for Speaker State Recognition

Elif Bozkurt (Koc University, Istanbul, Turkey)
Engin Erzin (Koc University, Istanbul, Turkey)
Cigdem Eroglu Erdem (Bahcesehir University, Istanbul, Turkey)
Arif Tanju Erdem (Ozyegin University, Istanbul, Turkey)

We present a Random Sampling Consensus (RANSAC) based training approach for the problem of speaker state recognition from spontaneous speech. Our system is trained and tested with the INTERSPEECH 2011 Speaker State Challenge corpora that includes the Intoxication and the Sleepiness Sub-challenges, where each sub-challenge defines a two-class classification task. We aim to perform a RANSAC-based training data selection coupled with the Support Vector Machine (SVM) based classification to prune possible outliers, which exist in the training data. Our experimental evaluations indicate that utilization of RANSAC-based training data selection provides 66.32 % and 65.38 % unweighted average (UA) recall rate on the development and test sets for the Sleepiness Sub-challenge, respectively and a slight improvement on the Intoxication Sub-challenge performance.

14:50University of Ljubljana System for Interspeech 2011 Speaker State Challenge

Rok Gajšek (University of Ljubljana)
Simon Dobrišek (University of Ljubljana)
France Mihelič (University of Ljubljana)

The paper presents our efforts in the Interspeech 2011 Speaker State Challenge. Both systems, for the Intoxication and the Sleepiness Sub-Challenge, are based on a Universal Background Model (UBM) in a form of a Hidden Markov Model (HMM) and the Maximum A Posteriori (MAP) adaptation. With the combination of our HMM-UBM-MAP derived super-vectors and selected statistical functionals from the baseline feature set, we were able to surpass the baseline system in both sub-challenges. By employing majority voting fusion of best systems we were able to further improve the performance. In the Intoxication Sub-Challenge our best result on the test set is 67.46%, and in the Sleepiness Sub-Challenge 71.28%.

15:10Speaker State Classification Based on Fusion of Asymmetric SIMPLS and Support Vector Machines

Dong-Yan Huang (Institute for Infocomm Research)
Shuzhi Sam Ge (Social Robotics Lab, Interactive Digital Media Institute)
Zhengchen Zhang (Social Robotics Lab, Interactive Digital Media Institute)

This paper describes a Speaker State Classification System (SSCS) for the INTERSPEECH 2011 Speaker State Challenge. Our SSC system for the Intoxication and Sleepiness Sub-Challenges uses fusion of several individual sub-systems. We make use of three standard feature sets per corpus given by organizers and MFCCs. Modeling is based on our own developed classification method - Asymmetric simple partial least squares (ASIMPLS) and Support Vector Machines (SVMs), followed by the calibration and multiple fusion methods. The advantage of asymmetric SIMPLS is prone to protect the minority class from being misclassified and boosts the performance on the majority class. Our experimental results show that our SSC system performs better than baseline system. Our final fusion results in 1.8% absolute improvement on the unweighted accuracy value for the Alcohol Language Corpus (ALC) and about 0.7% for the Sleepy Language Corpus (SLC) on the development set over the baseline. On the test set, we obtain 1.1% and 1.4 % absolute improvement, respectively.