|
12thAnnual Conference of the
International Speech Communication Association
|
sponsors
|
Interspeech 2011 Florence |
Technical Programme
This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.
Tue-Ses3-P4: Voice Conversion and Speech Synthesis
| Time: | Tuesday 16:00 |
Place: | Faenza 2 - Pala Congressi (Passi Perduti-Gallery) |
Type: | Poster |
| Chair: | Alan Black |
| #1 | Gaussian Process Experts for Voice Conversion
Nicholas Pilkington (Computer Laboratory, University of Cambridge, Cambridge) Heiga Zen (Toshiba Research Europe Ltd., Cambridge Research Lab.) Mark Gales (Toshiba Research Europe Ltd., Cambridge Research Lab.)
Conventional approaches to voice conversion typically use a GMM to represent the joint probability density of source and target features. This model is then used to perform spectral conversion between speakers. This approach is reasonably effective but can be prone to overfitting and oversmoothing of the target spectra. This paper proposes an alternative scheme that uses a collection of Gaussian process experts to perform the spectral conversion. Gaussian processes are robust to overfitting and oversmoothing and can predict the target spectra more accurately. Experimental results indicate that the objective performance of voice conversion can be improved using the proposed approach.
|
| #2 | Intonation Conversion From Neutral to Expressive Speech
Christophe VEAUX (Ircam) Xavier RODET (Ircam)
Intonation is one of the most important factors of speech expressivity. This paper presents a conversion method for the F0 contours. The F0 segments are represented with discrete cosine transform (DCT) coefficients at the syllable level. Multi-level dynamic features are added to model the temporal correlation between syllables and to constrain the F0 contour at the phrase level. Gaussian mixture models (GMM) are used to map the prosodic features between neutral and expressive speech, and the converted F0 contour is generated under the dynamic features constraints. Experimental evaluation using a database of acted emotional speech shows the effectiveness of the proposed F0 model and conversion method.
|
| #3 | Speaker-Adaptive Speech Synthesis Based on Eigenvoice Conversion and Language-Dependent Prosodic Conversion in Speech-to-Speech Translation
Nobuhiko Hattori (Nara Institute of Science and Technology) Tomoki Toda (Nara Institute of Science and Technolog) Hisashi Kawai (National Institute of Information and Communications Technology) Hiroshi Saruwatari (Nara Institute of Science and Technology) Kiyohiro Shikano (Nara Institute of Science and Technology)
This paper describes a novel approach based on voice conversion (VC) to speaker-adaptive speech synthesis for speech-to-speech translation. Voice quality of translated speech in an output language is usually different from that of an input speaker of the translation system since a text-to-speech system is developed with another speaker's voices in the output language. To render the input speaker's voice quality in the translated speech, we propose a voice quality control method based on one-to-many eigenvoice conversion (EVC) and language-dependent prosodic conversion. Spectral parameters of the translated speech are effectively converted by one-to-many EVC enabling unsupervised speaker adaptation. Moreover, prosodic parameters are modified considering their global differences between the input and output languages. The effectiveness of the proposed method is confirmed by experimental evaluations on cross-lingual VC among Japanese, English, and Chinese.
|
| #4 | Adding Glottal Source Information to Intra-lingual Voice Conversion
Javier Pérez (Universitat Politècnica de Catalunya) Antonio Bonafonte (Universitat Politècnica de Catalunya)
This paper studies the inclusion of glottal source characteristics in
voice conversion (VC) systems. We use source/filter decomposition to
parametrize the vocal tract (LSF), the glottal source (LF
model), and the aspiration noise (amplitude-modulated high-pass
filtered AWGN noise). To evaluate the impact of this new
parametrization in VC, we use a reference conversion system that
estimates a linear transformation function using a joint target/source
model obtained with CART and GMM. The reference system is based on the
LPC model, uses LSF to represent the vocal tract and a selection
technique for the residual. We use the reference algorithm to build a
VC system for each of the three parameter sets. We compared both
parametrizations in the framework of an intra-lingual voice conversion
task in Spanish. The results show that the new source/filter
representation clearly improves the overall performance, both in terms
of speaker identity transformation and voice quality.
|
| #6 | Formant-controlled HMM-based Speech Synthesis
Ming Lei (iFLYTEK Speech Lab, University of Science and Technology of China) Junichi Yamagishi (CSTR, University of Edinburgh) Korin Richmond (CSTR, University of Edinburgh) Zhen-Hua Ling (iFLYTEK Speech Lab, University of Science and Technology of China) Simon King (CSTR, University of Edinburgh) Li-Rong Dai (iFLYTEK Speech Lab, University of Science and Technology of China)
This paper proposes a novel framework that enables us to manipulate and control formants in HMM-based speech synthesis. In this framework, the dependency between formants and spectral features is modelled by piecewise linear transforms; formant parameters are effectively mapped by these to the means of Gaussian distributions over the spectral synthesis parameters. The spectral envelope features generated under the influence of formants in this way may then be passed to high-quality vocoders to generate the speech waveform. This provides two major advantages over conventional frameworks. First, we can achieve spectral modification by changing formants only in those parts where we want control, whereas the user must specify all formants manually in conventional formant synthesisers (e.g. Klatt). Second, this can produce high-quality speech. Our results show the proposed method can control vowels in the synthesized speech by manipulating F1 and F2 without any degradation in synthesis quality.
|
| #7 | Analysis of HMM-Based Lombard Speech Synthesis
Tuomo Raitio (Department Signal Processing and Acoustics, Aalto University, Helsinki, Finland) Antti Suni (Department of Speech Sciences, University of Helsinki, Helsinki, Finland) Martti Vainio (Department of Speech Sciences, University of Helsinki, Helsinki, Finland) Paavo Alku (Department Signal Processing and Acoustics, Aalto University, Helsinki, Finland)
Humans modify their voice in interfering noise in order to maintain the intelligibility of their speech - this is called the Lombard effect. This ability, however, has not been extensively modeled in speech synthesis. Here we compare several methods of synthesizing speech in noise using a physiologically based statistical speech synthesis system (GlottHMM). The results show that in a realistic street noise situation the synthetic Lombard speech is judged by listeners both as appropriate for the situation and as intelligible as natural Lombard speech. Of the different types of models, one using adaptation and extrapolation performed the best.
|
| #8 | Discrete/Continuous Modelling of Speaking Style in HMM-based Speech Synthesis: Design and Evaluation
Nicolas Obin (IRCAM) Pierre Lanchantin (IRCAM) Anne Lacheret (Modyco Lab.) Xavier Rodet (IRCAM)
This paper assesses the ability of a HMM-based speech synthesis systems to model the speech characteristics of various speaking styles.
A discrete/continuous HMM is presented to model the symbolic and acoustic speech characteristics of a speaking style.
The proposed model is used to model the average characteristics of a speaking style that is shared among various speakers, depending on specific situations of speech communication.
The evaluation consists of an identification experiment of 4 speaking styles based on delexicalized speech, and compared to a similar experiment on natural speech.
The comparison is discussed and reveals that discrete/continuous HMM consistently models the speech characteristics of a speaking style.
|
| #9 | Factored MLLR Adaptation For Singing Voice Generation
June Sig Sung (Seoul National University) Doo Hwa Hong (Seoul National University) Shin Jae Kang (Seoul National University) Nam Soo Kim (Seoul National University)
In our previous study, we proposed factored MLLR (FMLLR) where each MLLR parameter is defined as a function of a control vector. We presented a method to train the FMLLR parameters based on a general framework of the expectation-maximization (EM) algorithm. In this paper, we extend the FMLLR structure from diagonal to unrestricted full matrix with a sophisticated algorithm for the training of relevant parameters. In the experiments on artificial generation of singing voice, we evaluate the performance of the FMLLR technique with two matrix structures and also compare with other approaches to parameter adaptation in HMM-based speech synthesis.
|
| #11 | Adaptation of Prosody in Speech Synthesis by Changing Command Values of the Generation Process Model of Fundamental Frequency
Keikichi Hirose (Department of Information and Communication Engineering, the University of Tokyo) Keiko Ochi (Department of Information and Communication Engineering, the University of Tokyo) Ryusuke Mihara (Department of Information and Communication Engineering, the University of Tokyo) Hiroya Hashimoto (Department of Information and Communication Engineering, the University of Tokyo) Daisuke Saito (Department of Information and Communication Engineering, the University of Tokyo) Nobuaki Minematsu (Department of Information and Communication Engineering, the University of Tokyo)
A method was developed to adapt prosody to a new speaker/style in speech synthesis. It is based on predicting differences between target and original speakers/styles and applying them to the original one. Differences in fundamental frequency (F0) contours are represented in the framework of the generation process model; differences in the command magnitudes/amplitudes. While the original one requires a certain amount of training corpus, while corpus for training command differences can be small. Furthermore, in the case of style adaptation, it is not necessarily the corpus being uttered by the same speaker of the original style. Speech synthesis was conducted using HMM-based speech synthesis system, where prosody was controlled by the method. Listening experiments on synthetic speech with style adaptation and voice conversion both showed the validity of the method.
|
| #12 | Prosody Conversion for Emotional Mandarin Speech Synthesis Using the Tone Nucleus Model
Miaomiao Wen (Department of Electrical Engineering and Information Systems, the University of Tokyo, Japan) Miaomiao Wang (Department of Electrical Engineering and Information Systems, the University of Tokyo, Japan) Keikichi Hirose (Department of Information and Communication Engineering, the University of Tokyo, Japan) Nobuaki Minematsu (Department of Information and Communication Engineering, the University of Tokyo, Japan)
In this paper, tone nucleus model is employed to represent and convert F0 contour for synthesizing an emotional Mandarin speech from a neutral speech. Compared with previous prosody transforming methods, the proposed method 1) only converts the tone nucleus part of each syllable rather than the whole F0 contour to avoid the data sparseness problems; 2) builds mapping functions for well-chosen tone nucleus model parameters to better capture Mandarin tonal information. Using only a modest amount of training data, the perceptual accuracy achieved by our method was shown to be comparable to that obtained by a professional speaker.
|
| #13 | Rapid Adaptation of Foreign-accented HMM-based Speech Synthesis
Reima Karhila (Adaptive Informatics Research Centre, Aalto University, Helsinki, Finland) Mirjam Wester (Centre for Speech Technology Research, University of Edinburgh, UK)
This paper presents findings of listeners’ perception of speaker identity in synthetic speech. We investigated what the effect is on the perceived identity of a speaker when using differently accented average voice models and limited amounts (5 and 15 sentences) of a speaker’s data to create the synthetic stimuli. A speaker discrimination task was used to measure speaker identity. Native English listeners were presented with natural and synthetic speech stimuli in English and were asked whether they thought the sentences were spoken by the same person or not. An accent rating task was carried out to measure the perceived accents of the synthetic speech stimuli. Listeners perform as well at speaker discrimination when the stimuli have been created using 5 or 15 adaptation sentences as when using 105 sentences. The accent rating task shows listeners perceive different accents in the synthetic stimuli. However, listeners do not base speaker similarity decisions on perceived accent.
|
| #14 | The Effects of Phoneme Errors in Speaker Adaptation for HMM Speech Synthesis
Bálint Tóth (Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics) Tibor Fegyó (Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics) Géza Németh (Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics)
In this paper the phoneme errors in adaptation data of HMM based synthesis is investigated. Phoneme errors are likely to appear in automatic speech recognition (ASR) based transcriptions. The research also investigates the perspective of merely ASR transcription based unsupervised adaptation. To achieve better quality a new method is introduced for selecting an optimal subset of ASR transcription based adaptation data. Quality evaluation of the method was also performed. The results showed that adaptation was successful even on higher than 50% phoneme error rates.
|
|
|