Authors:
Roland Kuhn, Panasonic Technologies Inc., Speech Technology Laboratory (USA)
Patrick Nguyen, Panasonic Technologies Inc., Speech Technology Laboratory (USA)
Jean-Claude Junqua, Panasonic Technologies Inc., Speech Technology Laboratory (USA)
Lloyd Goldwasser, Panasonic Technologies Inc., Speech Technology Laboratory (USA)
Nancy Niedzielski, Panasonic Technologies Inc., Speech Technology Laboratory (USA)
Steven Fincke, Panasonic Technologies Inc., Speech Technology Laboratory (USA)
Ken Field, Panasonic Technologies Inc., Speech Technology Laboratory (USA)
Matteo Contolini, Panasonic Technologies Inc., Speech Technology Laboratory (USA)
Page (NA) Paper number 303
Abstract:
We have devised a new class of fast adaptation techniques for speech
recognition. These techniques are based on prior knowledge of speaker
variation, obtained by applying Principal Component Analysis (PCA)
or a similar technique to T vectors of dimension D derived from T speaker-dependent
models. This offline step yields T basis vectors called ``eigenvoices''.
We constrain the model for new speaker S to be located in the space
spanned by the first K eigenvoices. Speaker adaptation involves estimating
the K eigenvoice coefficients for the new speaker; typically, K is
very small compared to D. We conducted mean adaptation experiments
on the Isolet database. With a large amount of supervised adaptation
data, most eigenvoice techniques performed slightly better than MAP
or MLLR; with small amounts of supervised adaptation data or for unsupervised
adaptation, some eigenvoice techniques performed much better. We believe
that the eigenvoice approach would yield rapid adaptation for most
speech recognition systems.
Authors:
Sue E. Johnson, Cambridge University (U.K.)
Philip C. Woodland, Cambridge University (U.K.)
Page (NA) Paper number 726
Abstract:
In this paper speaker clustering schemes are investigated in the context
of improving unsupervised adaptation for broadcast news transcription.
The various techniques are presented within a framework of top-down
split-and-merge clustering. Since these schemes are to be used for
MLLR-based adaptation, a natural evaluation metric for clustering is
the increase in data likelihood from adaptation. Two types of cluster
splitting criteria have been used. The first minimises a covariance-based
distance measure and for the second we introduce a two-step E-M type
procedure to form clusters which directly maximise the likelihood of
the adapted data. It is shown that the direct maximisation technique
produces a higher data likelihood and also gives a reduction in word
error rate.
Authors:
Olli Viikki, Nokia Research Center, Speech and Audio Systems Laboratory (Finland)
Kari Laurila, Nokia Research Center, Speech and Audio Systems Laboratory (Finland)
Page (NA) Paper number 313
Abstract:
In this paper, we examine the use of speaker adaptation in adverse
noise conditions. In particular, we focus on incremental on-line speaker
adaptation since it, in addition to its other advantages, enables joint
speaker and environment adaptation. First, we show that on-line adaptation
is superior to off-line adaptation when realistic changing noise conditions
are considered. Next, we show that a conventional left-to-right HMM
structure is not well suited for on-line adaptation in variable noise
conditions due to unreliable state-frame alignments of noisy utterances.
To overcome this problem, we suggest the use of state duration constrained
HMMs. Our experimental results indicate that the performance gain due
to adaptation is much greater with duration constrained HMMs than obtained
with conventional left-to-right HMMs. In addition to the appropriate
model structure, we point out that in long-term adaptation, such as
incremental on-line adaptation, the supervised approach is a necessity.
Authors:
Mark J.F. Gales, IBM Almaden Research Center (USA)
Page (NA) Paper number 375
Abstract:
When performing speaker adaptation there are two conflicting requirements.
The transform must be powerful enough to model the speaker. Second,
the transform should be rapidly estimated for any particular speaker.
Recently the most popular adaptation schemes have used many parameters
to adapt the models. This paper examines an adaptation scheme requiring
few parameters to adapt the models, cluster adaptive training. It may
be viewed as a simple extension to speaker clustering. A linear interpolation
of the cluster means is used as the mean of the particular speaker.
This scheme naturally falls into an adaptive training framework. Maximum
likelihood estimates of the interpolation weights are given. Furthermore,
re-estimation formulae for cluster means, represented both explicitly
and by sets of transforms of some canonical mean, are given. On a speaker-independent
task CAT reduced the word error rate using very little adaptation data
compared to a standard system. a speaker independent model set.
|