Authors:
James Moody, Queensland University of Technology (Australia)
Stefan Slomka, Queensland University of Technology (Australia)
Jason Pelecanos, Queensland University of Technology (Australia)
Sridha Sridharan, Queensland University of Technology (Australia)
Page (NA) Paper number 667
Abstract:
This paper studies the reliance of a Gaussian Mixture Model (GMM) based
closed-set Speaker Identification system on model convergence and describes
methods to improve this convergence. It shows that the reason why the
Vector Quantisation GMMs (VQGMMs) outperform a simple GMM is mainly
due to decreasing the complexity of the data during training. In addition,
it is shown that the VQGMM system is less computationally complex than
the traditional GMM, yielding a system which is quicker to train and
which gives higher performance. We also investigate four different
VQ distance measures which can be used in the training of a VQGMM and
compare their respective performances. It is found that the improvements
gained by the VQGMM is only marginally dependant on the distance measure.
Authors:
Kemal Sönmez, SRI International (USA)
Elizabeth Shriberg, SRI International (USA)
Larry Heck, Nuance Communications (USA)
Mitchel Weintraub, SRI International (USA)
Page (NA) Paper number 920
Abstract:
Statistics of frame-level pitch have recently been used in speaker
recognition systems with good results. Although they convey useful
long-term information about a speaker's distribution of f0 values,
such statistics fail to capture information about local dynamics in
intonation that characterize an individual's speaking style. In this
work, we take a first step toward capturing such suprasegmental patterns
for automatic speaker verification. Specifically, we model the speaker's
f0 movements by fitting a piecewise linear model to the f0 track to
obtain a stylized f0 contour. Parameters of the model are then used
as statistical features for speaker verification. We report results
on 1998 NIST speaker verification evaluation. Prosody modeling improves
the verification performance of a cepstrum-based Gaussian mixture model
system (as measured by a task-specific Bayes risk) by 10%.
Authors:
Douglas A. Reynolds, MIT Lincoln Laboratory (USA)
Elliot Singer, MIT Lincoln Laboratory (USA)
Beth A. Carlson, MIT Lincoln Laboratory (USA)
Gerald C. O'Leary, MIT Lincoln Laboratory (USA)
Jack J. McLaughlin, MIT Lincoln Laboratory (USA)
Marc A. Zissman, MIT Lincoln Laboratory (USA)
Page (NA) Paper number 610
Abstract:
Classical speaker and language recognition techniques can be applied
to the classification of unknown utterances by computing the likelihoods
of the utterances given a set of well trained target models. This
paper addresses the problem of grouping unknown utterances when no
information is available regarding the speaker or language classes
or even the total number of classes. Approaches to blind message clustering
are presented based on conventional hierarchical clustering techniques
and an integrated cluster generation and selection method called the
d* algorithm. Results are presented using message sets derived from
the Switchboard and Callfriend corpora. Potential applications include
automatic indexing of recorded speech corpora by speaker/language tags
and automatic or semiautomatic selection of speaker specific speech
utterances for speaker recognition adaptation.
Authors:
Diamantino Caseiro, INESC/IST (Portugal)
Isabel M. Trancoso, INESC/IST (Portugal)
Page (NA) Paper number 1093
Abstract:
Current language identification systems vary significantly in their
complexity. The systems that use higher level linguistic information
have the best performance. Nevertheless, that information is hard to
collect for each new language. The system presented in this paper is
easily extendable to new languages because it uses very little linguistic
information. In fact, the presented system needs only one language
specific phone recogniser (in our case the Portuguese one), and is
trained with speech from each of the other languages. With the SpeechDat-M
corpus, with 6 European languages (English, French, German, Italian,
Portuguese and Spanish) our system achieved an identification rate
of 83.4% on 5-second utterances, this result shows an improvement of
5% over our previous version, mainly through the use of a neural network
classifier. Both the baseline and the full system were implemented
in realtime.
Authors:
Jerome Braun, Computer Science Department, University of Massachusetts Lowell (USA)
Haim Levkowitz, Computer Science Department, University of Massachusetts Lowell (USA)
Page (NA) Paper number 405
Abstract:
We present a novel approach to Automatic Language Identification (LID).
We propose Perceptually Guided Training (PGT), a novel LID training
method, involving identification of utterance parts which are particularly
significant perceptually for the language identification process, and
exploitation of these Perceptually Significant Regions (PSRs) to guide
the LID training process. Our approach involves a Recurrent Neural
Network (RNN) as the main mechanism. We propose that, because of the
long-range intra-utterance acoustical context significance in LID,
RNNs are particularly suitable for the LID task. Our approach does
not require phonetic labeling or transcription of the training corpus.
LIREN/PGT, the LID system we developed, incorporates our approach.
Our LID experiments were on English, German, and Mandarin Chinese,
using the OGI-TS corpus.
Authors:
Sarel van Vuuren, Oregon Graduate Institute of Science and Technology (USA)
Hynek Hermansky, Oregon Graduate Institute of Science and Technology (USA)
Page (NA) Paper number 631
Abstract:
We provide an analysis of the relative importance of components of
the modulation spectrum for speaker verification. The aim is to remove
less relevant components and reduce system sensitivity to acoustic
disturbances while improving verification accuracy. Spectral components
between about 0.1Hz and 10Hz are found to contain the most useful speaker
information. We discuss this result in the context of RASTA processing
and cepstral mean subtraction. When compared to cepstral mean subtraction
that retains components up to 50Hz, lowpass filtering to 10Hz with
downsampling by 75 percent is found to significantly improve robustness
in mismatched conditions. The downsampling results in a large computational
savings.
|