Authors:
Jen-Tzung Chien, National Cheng Kung University (Taiwan)
Page (NA) Paper number 102
Abstract:
This paper presents a novel framework of on-line hierarchical transformation
of hidden Markov models (HMM's) for speaker adaptation. Our aim is
to incrementally transform (or adapt) all the HMM parameters to a new
speaker even though part of HMM units are unseen in adaptation data.
The transformation paradigm is formulated according to the approximate
Bayesian estimate, which the prior statistics and the transformation
parameters are incrementally updated for each consecutive adaptation
data. Using this formulation, the updated prior statistics and the
current block of data are sufficient for on-line transformation. Further,
we establish a hierarchical tree of HMM's and use it to dynamically
control the transformation sharing for each HMM unit. In the speaker
adaptation experiments, we demonstrate the superiority of proposed
on-line transformation to other method.
Authors:
Motoyuki Suzuki, Computer Center / Graduate school of Information Sciences, Tohoku Univ. (Japan)
Toshiaki Abe, Graduate school of Engineering, Tohoku Univ. (Japan)
Hiroki Mori, Graduate school of Engineering, Tohoku Univ. (Japan)
Shozo Makino, Computer Center / Graduate school of Information Sciences, Tohoku Univ. (Japan)
Hirotomo Aso, Graduate school of Engineering, Tohoku Univ. (Japan)
Page (NA) Paper number 992
Abstract:
The tree-structured speaker clustering was proposed as a high-speed
speaker adaptation method. It can select the model which is most similar
to a target speaker. However, this method does not consider speaker
difference dependent on phoneme class. In this paper, we propose a
speaker adaptation method based on speaker clustering by taking speaker
difference dependent on phoneme class into account. The experimental
results showed that the new method gave a better performance than the
original method. Furthermore, we propose the improved method which
use a tree-structure of a similar phoneme as the substitute for the
phoneme which does not appear in the adaptation data. From the experimental
results, the improved method gave a better performance than the method
previously proposed.
Authors:
Tasos Anastasakos, Motorola, Lexicus Division (USA)
Sreeram V. Balakrishnan, Motorola, Lexicus Division (USA)
Page (NA) Paper number 599
Abstract:
Confidence estimation of the output hypothesis of a speech recognizer
offers a way to assess the probability that the recognized words are
correct. This work investigates the application of confidence scores
for selection of speech segments in unsupervised speaker adaptation.
Our approach is motivated by initial experiments that show that the
use of mis-labeled data has a significant cost in the performance of
particular adaptation schemes. We focus on a rapid self-adaptation
scenario that uses only a few seconds of adaptation data. The adaptation
algorithm is based on an extension to the MLLR transformation method
that can be applied to the observation vectors. We present experimental
results of this work on the ARPA WSJ large vocabulary dictation task.
Authors:
John McDonough, Center for Language and Speech Processing, The Johns Hopkins University (USA)
William Byrne, Center for Language and Speech Processing, The Johns Hopkins University (USA)
Xiaoqiang Luo, Center for Language and Speech Processing, The Johns Hopkins University (USA)
Page (NA) Paper number 869
Abstract:
Speaker normalization is a process in which the short-time features
of speech from a given speaker are transformed so as to better match
some speaker independent model. Vocal tract length normalization (VTLN)
is a popular speaker normalization scheme wherein the frequency axis
of the short-time spectrum associated with a speaker's speech is rescaled
or warped prior to the extraction of cepstral features. In this work,
we develop a novel speaker normalization scheme by exploiting the fact
that frequency domain transformations similar to that inherent in VTLN
can be accomplished entirely in the cepstral domain through the use
of conformal maps. We propose a class of such maps, designated all-pass
transforms for reasons given hereafter, and in a set of speech recognition
experiments conducted on the Switchboard Corpus demonstrate their capacity
to achieve word error rate reductions of 3.7% absolute.
Authors:
Rong Zheng, Speech Recognition Lab, Dept. of Electrical Engr. , Tsinghua University (China)
Zuoying Wang, Speech Recognition Lab, Dept. of Electrical Engr. , Tsinghua University (China)
Page (NA) Paper number 276
Abstract:
In this paper, we presented an integrated on-line learning scheme,
which combined the state-of-art speaker normalization and adaptation
techniques to improve the performance of our large vocabulary Chinese
continuous speech recognition (CSR) system. We used VTLN to remove
inter-speaker variation in both training and testing stage. To facilitate
dynamic transformation scale determination, we devised a tree-based
transformation method as the key component of our incremental adaptation.
Experiments shows that the combined scheme of on-line learning (incremental
& unsupervised) system, which gives approximately 22~26% error
reduction rate, was proved to be better than either method when used
separately at 18.34% and 2.7%.
Authors:
Sharon L. Oviatt, Oregon Graduate Institute (USA)
Page (NA) Paper number 49
Abstract:
When using interactive systems, people adapt their speech during attempts
to resolve system recognition errors. This paper summarizes the two-stage
Computer-elicited Hyperarticulate Adaptation Model (CHAM), which accounts
for systematic changes in human speech during interactive error handling.
According to CHAM, Stage I adaptation is manifest as a singular change
involving the increased duration of speech and pauses. This change
is associated with a moderate degree of hyperarticulation, which occurs
during a low rate of system errors. In contrast, State II adaptations
are associated with more extreme hyperarticulation during a high system
error rate. It entails change in multiple features of speech - including
duration, articulation, intonation pattern, fundamental frequency and
amplitude. This paper summarizes the empirical findings and linguistic
theory upon which CHAM is based, as well as the model's main predictions.
Finally, the implications of CHAM are discussed for designing future
interactive systems with improved error handling.
|