Dante - Di Michelino 150° sponsors







Corporate & Society Sponsors
Loquendo diamond package
Nuance gold package
ATT bronze package
Google silver package
Appen bronze package
Appen bronze package
Interactive Media bronze package
Microasoft bronze package
SpeechOcean bronze package
Avios logo package
NDI logo package
NDI logo package

CNR-ISTC

CNR-ISTC
Universit柤e Avignon
Speech Cycle
AT&T
Universit�i Firenze
FUB
FBK
Univ. Trento
Univ. Napoli
Univ. Tuscia
Univ. Calabria
Univ. Venezia

AISV
AISV

AISV
AISV
Comune di Firenze
Firenze Fiera
Florence Convention Bureau

ISCA

12thAnnual Conference of the
International Speech Communication Association

Sponsors
sponsors

Interspeech 2011 Florence

Technical Programme

This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.

Wed-Ses1-S2-P:
Speech Technology for Under-Resourced Languages II

Time:Wednesday 11:00 Place:Caravaggio (Adua 1) - Pala Affari - 1st Floor Type:Poster
Chair:Laurent Besacier, Alexey Karpov

#1Automatic Prosody Generation for Serbo-Croatian Speech Synthesis Based on Regression Trees

Milan Sečujski (Faculty of Technical Sciences, University of Novi Sad, Serbia)
Darko Pekar (“AlfaNum – Speech Technologies Ltd.”, Novi Sad, Serbia)
Nikša Jakovljević (Faculty of Technical Sciences, University of Novi Sad, Serbia)

The paper presents the module for automatic generation of prosodic features of synthesized speech, namely, f0 targets and phonetic segment durations, within the speech synthesizer AlfaNumTTS, the most sophisticated speech synthesis system for Serbo-Croatian language to date. The module is based on regression trees trained on a studio recorded single speaker database of Serbo-Croatian. The database has been annotated for phonemic identity as well as a number of prosodic events such as pitch accents, phrase breaks and prosodic prominence. Besides the traditional description of the intonational phonology of Serbo-Croatian through four distinct accent types, within this study we have examined the possibility of representing them as tonal sequences, which has been suggested in recent linguistic literature. The results obtained confirm that the four accents can indeed be reduced to sequences of high and low tones without loss of quality, provided that phonemic length contrast is preserved.

#2Very Large Vocabulary ASR for Spoken Russian with Syntactic and Morphemic Analysis

Alexey Karpov (St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences (SPIIRAS))
Irina Kipyatkova (St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences (SPIIRAS))
Andrey Ronzhin (St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences (SPIIRAS))

In this paper, we present a word-based very large vocabulary automatic speech recognition system for Russian. Some novel methods are proposed for organization of the lexicon and the language model. Two-level morpho-phonemic prefix graph that uses some information on morphemic structure of lexical units is suggested for a compact representation of the pronunciation vocabulary and search space. Such model is more compact than the lexical tree or the linearly-based vocabulary and provides speeding up the recognition process. The syntactic analysis of a training text corpus in a combination with the statistical analysis is suggested for generation of N-gram language models. The syntax-based Russian language model allows taking into account long-distance syntactic dependencies between word pairs. The results have proved that the syntactic-statistic language model gives 5% relative improvement on the word and letter error rates with respect to the baseline models.

#3Cross-language phone recognition when the target language phoneme inventory is not known

Timothy Kempton (University of Sheffield)
Roger Moore (University of Sheffield)
Thomas Hain (University of Sheffield)

Cross-language speech recognition often assumes a certain amount of knowledge about the target language. However, there are hundreds of languages where not even the phoneme inventory is known. In the work reported here, phone recognisers are evaluated on a cross-language task with minimum target knowledge. A phonetic distance measure is introduced for the evaluation, allowing a distance to be calculated between any utterance of any language. This has a number of spin-off applications such as allophone detection, a phone-based ROVER approach to recognition, and cross-language forced alignment. Results show that some of these novel approaches will be of immediate use in characterising languages where there is little phonological knowledge.

#4A Paradigm for Small Vocabulary Speech Recognition Based on Redundant Spectro-Temporal Feature Sets

Sourish Chaudhuri (Carnegie Mellon University)
Bhiksha Raj (Carnegie Mellon University)

Speech recognition techniques have come to rely almost completely on HMM based frameworks. In this paper, we present a novel paradigm for small-vocabulary speech recognition based on a recently proposed word spotting technique. Recent work using discriminative classifiers with ordered spectro-temporal features to detect the presence of keywords obtained encouraging improvements over HMM-based models. We propose to extend this approach to recognize continuous speech in our work. Our method uses discriminative models to predict which words are present in a speech signal and hypothesize their locations. A graph search using dynamic programming is then used to obtain the most likely sequence of words from the hypothesis set produced as a result of combining the results from the discriminative word classifiers. While this approach doesn't perform as well as state-of-the-art ASR systems, it can be particularly useful for languages with small amounts of annotated data available.

#5GorUp: an ontology-driven Audio Information Retrieval system that suits the requirements of under-resourced languages

Nora Barroso (Irunweb Enterprise - Irun)
Karmele López de Ipiña (Computational Intelligence Group, Department of System Engineering and Automation - University of the Basque Country)
Aitzol Ezeiza (Computational Intelligence Group, Department of System Engineering and Automation - University of the Basque Country)
Carmen Hernández (Computational Intelligence Group, Department of System Engineering and Automation - University of the Basque Country)
Nerea Ezeiza (Computational Intelligence Group, Department of System Engineering and Automation - University of the Basque Country)
Odei Barroso (Irunweb Enterprise - Irun)
Unai Susperregi (Irunweb Enterprise - Irun)
Barroso Simeon (Insima Teknologia, Donostia)

GorUp is an Information Retrieval system that provides information about the contents of audio broadcast news in Basque, Spanish, and French. Since the resources available for Basque in general, and for this task in particular, were very few, data optimization methodologies had to be applied in various phases of the development. Moreover, the agglutinative nature of Basque required the use of morphemes and other sub-word units. Additionally, some keyword spotting and semantic methods have been also applied in the system in order to retrieve information properly. In most of the cases, the methods employed during this project could suit the requirements of many under-resourced languages, and one of these techniques could be the ontology-based approach. This paper presents the system in general for Basque and emphasizes the techniques employed in order to enhance the system using a semantic ontology.

#6Woefzela - An open-source platform for ASR data collection in the developing world

Nic De Vries (Human Language Technologies Research Group, Meraka Institute, CSIR, Pretoria, South Africa)
Jaco Badenhorst (Human Language Technologies Research Group, Meraka Institute, CSIR, Pretoria, South Africa)
Marelie Davel (Human Language Technologies Research Group, Meraka Institute, CSIR, Pretoria, South Africa)
Etienne Barnard (Multilingual Speech Technologies, North-West University, Vanderbijlpark 1900, South Africa)
Alta De Waal (Human Language Technologies Research Group, Meraka Institute, CSIR, Pretoria, South Africa)

Building transcribed speech corpora for under-resourced languages plays a pivotal role in developing speech technologies for such languages. We have developed an open-source tool for devices running the Android operating system to facilitate the efficient collection of speech data for Automatic Speech Recognition system development. The tool was designed for use in typical developing-world conditions; we present the relevant design choices and analyse the effectiveness of this tool by means of a case study. In particular, we introduce a novel semi-real-time quality monitoring system, which reduces the amount of erroneous data collected from users and thus increase the efficiency of the data collection process.

#7A Study on the Perception of Tone and Intonation in Sesotho

Hansjörg Mixdorff (Beuth University of Applied Sciences Berlin, Germany)
Lehlohonolo Mohasi (University of Stellenbosch, South Africa)
\'Malillo Machobane (National University of Lesotho)
Thomas Niesler (University of Stellenbosch, South Africa)

This paper presents a study on the perception of Sesotho, a Southern African tonal language, employing a set of recorded minimal pairs, whose F0 contours were analyzed in a previous study using the Fujisaki model and resynthesized. Sequences of prosodically modified stimuli were produced to examine the effect of these modifications on word identification, statement/question distinction, as well as focus identification. With few exceptions, results regarding word identification are in line with our expectations. F0 modifications even seem to override vowel differences between words, but they do not change perception when the vowel is the only difference. With respect to the statement/question distinction, shortening of the penultimate syllable, higher speech rate and increased phrase command magnitude Ap all increase the probability of an utterance to be judged as a question. The focus experiment only produced inconclusive results, possibly due to its complex setting.

#8Developing a broadband automatic speech recognition system for Afrikaans

Febe de Wet (Human Language Technology Competency Area, CSIR Meraka Institute, Pretoria, South Africa)
Alta de Waal (Human Language Technology Competency Area, CSIR Meraka Institute, Pretoria, South Africa)
Gerhard van Huyssteen (Centre for Text Technology (CTexT), North-West University, Potchefstroom, South Africa)

Afrikaans is one of the eleven official languages of South Africa. It is classified as an under-resourced language. No annotated broadband speech corpora currently exist for Afrikaans. This article reports on the development of speech resources for Afrikaans, specifically a broadband speech corpus and an extended pronunciation dictionary. Baseline results for an ASR system that was built using these resources are also presented. In addition, the article suggests different strategies to exploit the close relationship between Afrikaans and Dutch for the purposes of technology development.

#9Multi-accent speech recognition of Afrikaans, Black and White varieties of South African English

Herman Kamper (Stellenbosch University)
Thomas Niesler (Stellenbosch University)

We investigate speech recognition performance of systems employing several accent-specific recognisers in parallel for simultaneous recognition of multiple accents. We compare these systems with oracle systems, in which test utterances are presented to matching accent-specific recognisers, and with accent-independent systems, in which data are pooled. Afrikaans (AE), Black (BE) and White (EE) accents of South African English are considered. We find that, when accent is classified on a per-utterance basis, parallel systems outperform oracle systems for the AE+EE accent pair while the opposite is observed for BE+EE. When accent is identified on a per-speaker basis, oracle or better performance is obtained for both accent pairs. Furthermore, parallel systems using multi-accent acoustic modelling, which allows cross-accent sharing of acoustic data, outperform parallel systems using accent-specific acoustic models. The former also yields better performance than accent-independent systems.

#10Perceptual Representation of Consonant Sounds in Thai

Charturong Tantibundhit (Department of Electrical and Computer Engineering, Thammasat University, Thailand)
Chutamanee Onsuwan (Department of Linguistics, Thammasat University, Thailand)
Tanawan Saimai (Department of Electrical and Computer Engineering, Thammasat University, Thailand)
Nantaporn Saimai (Department of Electrical and Computer Engineering, Thammasat University, Thailand)
sumonmas Thatphithakkul (National Electronics and Computer Technology Center (NECTEC), Thailand)
P. Chootrakool (National Electronics and Computer Technology Center (NECTEC), Thailand)
Krit Kosawat (National Electronics and Computer Technology Center (NECTEC), Thailand)
nattanun Thatphithakkul (National Electronics and Computer Technology Center (NECTEC), Thailand)

This work is an attempt to construct a perceptual representation of Thai consonants based on perceptual identification results (from 28 Thais) of 21 phonemes presented in noise. The experiment is designed to equally make pairwise comparisons among 21 word-initial phonemes, which results in 210 real-word stimulus pairs. Percent correct responses and confusion matrices are obtained. Similarity score and perceptual distance for each phoneme pair are systematically derived from confusion scores based on a method proposed by Shepard (1972). Then, a perceptual space of Thai consonants takes shape and could roughly be divided into 5 groupings: glide, glottal, nasal, aspirated obstruent, and a combination of liquid and unaspirated obstruent. It is suggested that these phonological classes reflect the most distinct and relevant perceptual properties of Thai consonants. Preliminary cross-linguistic observation is addressed in light of the data of English consonants from Miller and Nicely (1955).

#11A cross-lingual approach to the development of an HMM-based speech synthesis system for Malay

Mumtaz Begum Mustafa (University of Malaya)
Ainon Raja Noor (University of Malaya)
Roziati Zainuddin (University of Malaya)
Zuraidah M. Don (University of Malaya)
Gerry Knowles (University of Malaya)

This research reports the development of an HMM-based speech synthesis system for Malay, which is an under-resourced language with few resources including recorded speech and segmental labels. We propose the cross-lingual use of resources for developing a Malay HMM-based speech synthesis system. We used the Festival English speech synthesis system to generate time-aligned phone transcriptions for Malay using specially constructed Malay grapheme-to-phoneme database and English CART. These transcriptions together with Malay recorded speech databases were used for training and synthesis of Malay speech. The effectiveness of the proposed approach is confirmed by intelligibility and naturalness tests on the synthetic speech.