Speaker Adaptation 3

Home
Full List of Titles
1: ICSLP'98 Proceedings
Keynote Speeches
Text-To-Speech Synthesis 1
Spoken Language Models and Dialog 1
Prosody and Emotion 1
Hidden Markov Model Techniques 1
Speaker and Language Recognition 1
Multimodal Spoken Language Processing 1
Isolated Word Recognition
Robust Speech Processing in Adverse Environments 1
Spoken Language Models and Dialog 2
Articulatory Modelling 1
Talking to Infants, Pets and Lovers
Robust Speech Processing in Adverse Environments 2
Spoken Language Models and Dialog 3
Speech Coding 1
Articulatory Modelling 2
Prosody and Emotion 2
Neural Networks, Fuzzy and Evolutionary Methods 1
Utterance Verification and Word Spotting 1 / Speaker Adaptation 1
Text-To-Speech Synthesis 2
Spoken Language Models and Dialog 4
Human Speech Perception 1
Robust Speech Processing in Adverse Environments 3
Speech and Hearing Disorders 1
Prosody and Emotion 3
Spoken Language Understanding Systems 1
Signal Processing and Speech Analysis 1
Spoken Language Generation and Translation 1
Spoken Language Models and Dialog 5
Segmentation, Labelling and Speech Corpora 1
Multimodal Spoken Language Processing 2
Prosody and Emotion 4
Neural Networks, Fuzzy and Evolutionary Methods 2
Large Vocabulary Continuous Speech Recognition 1
Speaker and Language Recognition 2
Signal Processing and Speech Analysis 2
Prosody and Emotion 5
Robust Speech Processing in Adverse Environments 4
Segmentation, Labelling and Speech Corpora 2
Speech Technology Applications and Human-Machine Interface 1
Large Vocabulary Continuous Speech Recognition 2
Text-To-Speech Synthesis 3
Language Acquisition 1
Acoustic Phonetics 1
Speaker Adaptation 2
Speech Coding 2
Hidden Markov Model Techniques 2
Multilingual Perception and Recognition 1
Large Vocabulary Continuous Speech Recognition 3
Articulatory Modelling 3
Language Acquisition 2
Speaker and Language Recognition 3
Text-To-Speech Synthesis 4
Spoken Language Understanding Systems 4
Human Speech Perception 2
Large Vocabulary Continuous Speech Recognition 4
Spoken Language Understanding Systems 2
Signal Processing and Speech Analysis 3
Human Speech Perception 3
Speaker Adaptation 3
Spoken Language Understanding Systems 3
Multimodal Spoken Language Processing 3
Acoustic Phonetics 2
Large Vocabulary Continuous Speech Recognition 5
Speech Coding 3
Language Acquisition 3 / Multilingual Perception and Recognition 2
Segmentation, Labelling and Speech Corpora 3
Text-To-Speech Synthesis 5
Spoken Language Generation and Translation 2
Human Speech Perception 4
Robust Speech Processing in Adverse Environments 5
Text-To-Speech Synthesis 6
Speech Technology Applications and Human-Machine Interface 2
Prosody and Emotion 6
Hidden Markov Model Techniques 3
Speech and Hearing Disorders 2 / Speech Processing for the Speech and Hearing Impaired 1
Human Speech Production
Segmentation, Labelling and Speech Corpora 4
Speaker and Language Recognition 4
Speech Technology Applications and Human-Machine Interface 3
Utterance Verification and Word Spotting 2
Large Vocabulary Continuous Speech Recognition 6
Neural Networks, Fuzzy and Evolutionary Methods 3
Speech Processing for the Speech-Impaired and Hearing-Impaired 2
Prosody and Emotion 7
2: SST Student Day
SST Student Day - Poster Session 1
SST Student Day - Poster Session 2

Author Index
A B C D E F G H I
J K L M N O P Q R
S T U V W X Y Z

Multimedia Files

On-line Hierarchical Transformation of Hidden Markov Models for Speaker Adaptation

Authors:

Jen-Tzung Chien, National Cheng Kung University (Taiwan)

Page (NA) Paper number 102

Abstract:

This paper presents a novel framework of on-line hierarchical transformation of hidden Markov models (HMM's) for speaker adaptation. Our aim is to incrementally transform (or adapt) all the HMM parameters to a new speaker even though part of HMM units are unseen in adaptation data. The transformation paradigm is formulated according to the approximate Bayesian estimate, which the prior statistics and the transformation parameters are incrementally updated for each consecutive adaptation data. Using this formulation, the updated prior statistics and the current block of data are sufficient for on-line transformation. Further, we establish a hierarchical tree of HMM's and use it to dynamically control the transformation sharing for each HMM unit. In the speaker adaptation experiments, we demonstrate the superiority of proposed on-line transformation to other method.

SL980102.PDF (From Author) SL980102.PDF (Rasterized)

TOP


High-Speed Speaker Adaptation Using Phoneme Dependent Tree-Structured Speaker Clustering

Authors:

Motoyuki Suzuki, Computer Center / Graduate school of Information Sciences, Tohoku Univ. (Japan)
Toshiaki Abe, Graduate school of Engineering, Tohoku Univ. (Japan)
Hiroki Mori, Graduate school of Engineering, Tohoku Univ. (Japan)
Shozo Makino, Computer Center / Graduate school of Information Sciences, Tohoku Univ. (Japan)
Hirotomo Aso, Graduate school of Engineering, Tohoku Univ. (Japan)

Page (NA) Paper number 992

Abstract:

The tree-structured speaker clustering was proposed as a high-speed speaker adaptation method. It can select the model which is most similar to a target speaker. However, this method does not consider speaker difference dependent on phoneme class. In this paper, we propose a speaker adaptation method based on speaker clustering by taking speaker difference dependent on phoneme class into account. The experimental results showed that the new method gave a better performance than the original method. Furthermore, we propose the improved method which use a tree-structure of a similar phoneme as the substitute for the phoneme which does not appear in the adaptation data. From the experimental results, the improved method gave a better performance than the method previously proposed.

SL980992.PDF (From Author) SL980992.PDF (Rasterized)

TOP


The Use of Confidence Measures in Unsupervised Adaptation of Speech Recognizers

Authors:

Tasos Anastasakos, Motorola, Lexicus Division (USA)
Sreeram V. Balakrishnan, Motorola, Lexicus Division (USA)

Page (NA) Paper number 599

Abstract:

Confidence estimation of the output hypothesis of a speech recognizer offers a way to assess the probability that the recognized words are correct. This work investigates the application of confidence scores for selection of speech segments in unsupervised speaker adaptation. Our approach is motivated by initial experiments that show that the use of mis-labeled data has a significant cost in the performance of particular adaptation schemes. We focus on a rapid self-adaptation scenario that uses only a few seconds of adaptation data. The adaptation algorithm is based on an extension to the MLLR transformation method that can be applied to the observation vectors. We present experimental results of this work on the ARPA WSJ large vocabulary dictation task.

SL980599.PDF (From Author) SL980599.PDF (Rasterized)

TOP


Speaker Normalization with All-Pass Transforms

Authors:

John McDonough, Center for Language and Speech Processing, The Johns Hopkins University (USA)
William Byrne, Center for Language and Speech Processing, The Johns Hopkins University (USA)
Xiaoqiang Luo, Center for Language and Speech Processing, The Johns Hopkins University (USA)

Page (NA) Paper number 869

Abstract:

Speaker normalization is a process in which the short-time features of speech from a given speaker are transformed so as to better match some speaker independent model. Vocal tract length normalization (VTLN) is a popular speaker normalization scheme wherein the frequency axis of the short-time spectrum associated with a speaker's speech is rescaled or warped prior to the extraction of cepstral features. In this work, we develop a novel speaker normalization scheme by exploiting the fact that frequency domain transformations similar to that inherent in VTLN can be accomplished entirely in the cepstral domain through the use of conformal maps. We propose a class of such maps, designated all-pass transforms for reasons given hereafter, and in a set of speech recognition experiments conducted on the Switchboard Corpus demonstrate their capacity to achieve word error rate reductions of 3.7% absolute.

SL980869.PDF (From Author) SL980869.PDF (Rasterized)

TOP


Toward On-Line Learning of Chinese Continuous Speech Recognition System

Authors:

Rong Zheng, Speech Recognition Lab, Dept. of Electrical Engr. , Tsinghua University (China)
Zuoying Wang, Speech Recognition Lab, Dept. of Electrical Engr. , Tsinghua University (China)

Page (NA) Paper number 276

Abstract:

In this paper, we presented an integrated on-line learning scheme, which combined the state-of-art speaker normalization and adaptation techniques to improve the performance of our large vocabulary Chinese continuous speech recognition (CSR) system. We used VTLN to remove inter-speaker variation in both training and testing stage. To facilitate dynamic transformation scale determination, we devised a tree-based transformation method as the key component of our incremental adaptation. Experiments shows that the combined scheme of on-line learning (incremental & unsupervised) system, which gives approximately 22~26% error reduction rate, was proved to be better than either method when used separately at 18.34% and 2.7%.

SL980276.PDF (From Author) SL980276.PDF (Rasterized)

TOP


The CHAM Model of Hyperarticulate Adaptation During Human-Computer Error Resolution

Authors:

Sharon L. Oviatt, Oregon Graduate Institute (USA)

Page (NA) Paper number 49

Abstract:

When using interactive systems, people adapt their speech during attempts to resolve system recognition errors. This paper summarizes the two-stage Computer-elicited Hyperarticulate Adaptation Model (CHAM), which accounts for systematic changes in human speech during interactive error handling. According to CHAM, Stage I adaptation is manifest as a singular change involving the increased duration of speech and pauses. This change is associated with a moderate degree of hyperarticulation, which occurs during a low rate of system errors. In contrast, State II adaptations are associated with more extreme hyperarticulation during a high system error rate. It entails change in multiple features of speech - including duration, articulation, intonation pattern, fundamental frequency and amplitude. This paper summarizes the empirical findings and linguistic theory upon which CHAM is based, as well as the model's main predictions. Finally, the implications of CHAM are discussed for designing future interactive systems with improved error handling.

SL980049.PDF (From Author) SL980049.PDF (Rasterized)

TOP