Authors:
Tomoko Matsui, NTT Human Interface Labs. (Japan)
Kiyoaki Aikawa, NTT Human Interface Labs. (Japan)
Page (NA) Paper number 714
Abstract:
This paper investigates a method of creating robust speaker models
that are not sensitive to session-dependent (SD) utterance-variation
and handset-dependent (HD) distortion for HMM-based speaker verification
systems in a real telephone network. We recently reported a method
of creating session-independent (SI) speaker-HMMs that are not sensitive
to SD utterance-variation. In that method, the distortion function
that transforms SI speaker-HMMs to SD speaker-HMMs is introduced, and
the parameters in the function and the speaker-HMM parameters are jointly
estimated using a speaker adaptive training algorithm. This paper
proposes a method that is less sensitive to SD utterance-variation
and HD distortion than the previous method. This new idea focuses
on different difficulties in estimating parameters in distortion functions
for SD utterance-variation and HD distortion. In text-independent verification
experiments using telephone speech data, the error reduction rate of
the improved method compared with that of the conventional method of
cepstral mean normalization is 24%.
Authors:
Haakan Melin, KTH, Dept. of Speech, Music and Hearing (Sweden)
Johan W. Koolwaaij, KUN, Dept. of Language & Speech (The Netherlands)
Johan Lindberg, KTH, Dept. of Speech, Music and Hearing (Sweden)
Frédéric Bimbot, IRISA / CNRS & INRIA - Sigma2 (France)
Page (NA) Paper number 467
Abstract:
The problem of how to train variance parameters on scarce data is addressed
in the context of text-dependent, HMM-based, automatic speaker verification.
Three variations of variance flooring is explored as a means to prevent
over-fitting. With the best performing one, the floor to a variance
vector of a client model is proportional to the corresponding variance
vector in a non-client multi-speaker model. It is also found that creating
a client-model by adapting the means and mixture weights from the non-client
model while keeping variances constant works comparably to variance
flooring and is much simpler. Comparisons are made on three large telephone
quality corpora: Gandalf, SESP and Polycost.
Authors:
Dijana Petrovska-Delacrétaz, CIRC-EPFL (Switzerland)
Jan Cernocky, Technical University of Brno (Czech Republic)
Jean Hennebert, CIRC-EPFL (Switzerland)
Gérard Chollet, ENST, Paris (France)
Page (NA) Paper number 536
Abstract:
Most of the current text-independent speaker veri,cation techniques
are based on modelling the global probability distribution function
of speakers in the acoustic vector space. We present an alternative
approach based on class-dependent veri,cation systems using automatically
determined segmental units, obtained with temporal decomposition and
labelled through unsupervised clustering. The core of the system is
a set of multi-layer perceptrons (MLP) trained to discriminate between
client and an independent set of world speakers. Each MLP is dedicated
to work with data segments that are previously selected as belonging
to a particular class. Issues and potential advantages of the segmental
approach are presented. Performances of global and segmental approaches
are tested on the NIST'98 database (250 female and 250 male speakers),
showing promising results for the proposed new segmental approach.
Comparison with a state of the art system, based on Gaussian Mixture
Modelling is also included.
Authors:
Qi Li, Bell Labs, Lucent Technologies (USA)
Page (NA) Paper number 815
Abstract:
A fast algorithm for left-to-right HMM decoding is proposed in this
paper. The algorithm is developed based on a sequential detection scheme
which is asymptotically optimal in the sense of detecting a possible
change in distribution as reliably and quickly as possible. The scheme
is extended to HMM decoding in determining the state segmentations
for likelihood or other score computations. As a sequential scheme,
it can determine a state boundary in a few time steps after it occurs.
The examples in this paper show that the proposed algorithm is 5 to
9 times faster than the Viterbi algorithm while it still can provide
the same or similar decoding results. The proposed algorithm can be
applied to speaker recognition, audio segmentation, voice/silence detection,
and many other applications, where an assumption of the algorithm is
usually satisfied.
Authors:
Jesper Østergaard Olsen, Aalborg University (Denmark)
Page (NA) Paper number 334
Abstract:
For most classifier architectures realistic training schemes only allow
classifiers corresponding to local optima of the training criteria
to be constructed. One way of dealing with this problem is to work
with classifier ensembles: multiple classifiers are trained for the
same classification problem and combined into one ``super'' classifier.
The problem addressed in this paper is text prompted speaker verification
by means of phoneme dependent Radial Basis Function networks trained
by gradient descent error minimisation. In this context ensemble techniques
are introduced by combining different classifiers that classify feature
vectors, which have been pre-processed using different linear transforms.
Four different types of linear transforms are studied: the Fisher
transform, the LDA transform, the PCA transform and the cosine transform.
The verification system is evaluated on the Gandalf database, where
the equal error rate is reduced from 3.6% to 3.2% when ensemble techniques
are introduced.
Authors:
Jesper Østergaard Olsen, Aalborg University (Denmark)
Page (NA) Paper number 335
Abstract:
A new discriminant speaker model is introduced in this paper. The
model is text dependent and relies on characterising speakers in terms
of the angular distance between ``projection vectors'', which allow
good discrimination between individual speakers. The projection models
require only little enrollment data to be available per target speaker,
but at the same time require a set of ``cohort speakers'' to be available
for which a relatively large amount of training speech is available
per cohort speaker. The projection model technique is evaluated on
the Gandalf database and compared to conventional Gaussian Mixture
Models (GMMs). It is found that the projection models require less
storage per target speaker, while at the same time achieving lower
error rates, particularly when applied for speaker identification and
recognition under mismatched conditions.
|