ABSTRACT
Gaussian mixture models (GMM's) have been demonstrated as one of the
powerful statistical methods for speaker identification. In GMM method,
the covariance matrix is usually assumed to be diagonal. That means the
feature components are relatively uncorrelated. This assumption may not
be correct. This paper concentrates on finding an orthogonal speaker-dependent
transformation to reduce the correlation between feature components. This
transformation is based on the eigenvectors of the within-class scatter
matrix which is attained in each stage of iterative training of GMM parameters.
Hence the transformation matrix and GMM parameters are both updated in
each iteration until the total log-likelihood converges. An experimental
evaluation of the proposed method is conducted on a 100-person connected
digit database for text independent speaker identification. The experimental
result shows a reduction in the error rate by 42% when 7-digit utterances
are used for testing.
A0009.pdf
ABSTRACT
In this paper we present a new approach to text in- dependent speaker
verification. Speaker models are created from complete data sets, derived
from a set of sentences. A decision on an identity claim is based on the
calculation of the mean next neighbour distance between a speaker model
and a test utterance. A Vec- tor quantization technique serves to efficiently
extract this frame based similarity measure. It is the purpose of this
paper to investigate this new approach and test its performance on a large
database as a function of a number of parameters, i.e., the number of data
vectors in each model and the length of the test ut- terance. The best
results on a set of 108 speakers are 0:93% false rejection rate and 0:98%
false acceptance rate.
A0042.pdf
ABSTRACT
The first motivation for using Gaussian mixture models for text-independent
speaker identification is based on the observation that a linear combination
of gaussian basis functions is capable of representing a large class of
sample distributions. While this technique gives generally good results,
little is known about which specific part of a speech signal best identifies
a speaker. This contribution suggests a procedure, based on the Jensen
divergence measure, to automatically extract from the input speech signal
the part that best contribute to identify a speaker. It is shown, by results
obtained, that this technique can significantly increase the performance
of a speaker recognition system.
A0209.pdf
ABSTRACT
Automatic identiøcation of non-linguistic speech features (e.g.
the speaker or the language of an utterance) are currently of practical
interest. In this paper, we ørst impose a set of requirements that
we think a statistical model used in non-linguistic feature identiøcation
should satisfy. Namely, these requirements are capturing both short and
long term correlations in addition to maintaining a certain acoustic resolution.
A model satisfying these requirements, and in the same time having the
attractive feature of requiring no transcribed speech material during training
is proposed. Experimental evaluation of the approach in speaker recognition
on the TIMIT database is presented, where recognition rates up to 99.2
% are achieved.
A0225.pdf
ABSTRACT
In this paper, we present and compare two alternative post-processing approaches to generate rules decision for text-dependent speaker identification based on Gaussian Mixture Models (GMM). The first approach, a linear programming method, is used to minimize a cost on a combined scores obtained from the N-Best GMM output probabilities. The second, more heuristic, is based on combination of output score probabilities to generate a decision rules. Statistical tools have been developed to explore the relative importance of these approaches on recognition accuracy. Experiments on Spidre database are presented to show the effects of these two approaches on the speaker identification performance (including the number of the N-Best hypothesis and handset variability). The linear programming approach does not show any improvement, however, a combined statistical approaches has demonstrated an improvement of more than 11% comparing to our standard performance system.
ABSTRACT
We investigate the use of variable resolution spectral analysis for
speaker recognition. The spectral resolution is simply determined by a
unique parameter. A speaker can therefore be represented by this parameter
and a stochastic model, which means that each speaker is represented in
a different acoustic space. For speaker verification tasks, the likelihood
ratio compared to a threshold should not depend on the representation space,
so that likelihood ratios remain comparable. We experimented different
spectral resolution with several classifiers but we had no improvement
in the results and the classifiers turned out not to be very sensitive
to the different feature sets.
A0289.pdf
ABSTRACT
Recent work in ASR shows that band splitting, forming multiple paths with recombination at the decision stage, can give recognition accuracy comparable with the conventional full- band approach. One of the many interesting questions with band-splitting relates to the bandwidths of each sub-band, and the use of frequency warping functions such as mel. This paper examines the use of mel and linear frequency scales in the context of band-splitting and speaker recognition. We demonstrate how sub-band error profiles can lead to a new scale, which is between linear and mel, giving both an equalised sub-band error profile and an improved overall recognition accuracy.
ABSTRACT
This paper evaluates 63 Automatic Gender Identification (AGI) systems
for text-independent clean speech segments, coded speech and speech segments
affected by reverberation. The AGI systems contain a Linear Classifier
(LC) with inputs from a combination of two average pitch detection methods
and paired Gaussian Mixture Models trained with mel-cepstral, autocorrelation,
reflection and log area ratios parameterised speech data. An AGI system
is built which is able to handle the LPC10, CELP and GSM coders with no
significant loss in accuracy and reduce the impact of even severe reverberation
by subjecting the training data of the LC with a different room response.
Using speech segments with an average duration of 890ms (after silence
removal), the best AGI system had an accuracy of 98.5% averaged over all
clean and adverse conditions.
A0307.pdf
ABSTRACT
The present study aims at examining the relative importance of various
acoustic features as cues to familiar speaker identification. The study
also attempts to examine the validity of the prototype model, as the key
to human speaker recognition. To this aim 20 speakers were recorded. Their
voices were modified using an analysis-synthesis system, which enabled
analysis and modification of the glottal waveform, of the pitch, and of
the formants. A group of 30 listeners had to identify the speakers in an
open-set experiment. The results suggest that on average, the contribution
of the vocal tract features is more important than that of the glottal
source features. Examination of individual speakers reveals that changes
of identical features affect differently the identification of various
speakers. This finding suggests that for each speaker a different group
of acoustic features serves as cue to the vocal identity, and along with
other predictions that were found to be valid, supports the adequacy of
the prototype model.
A0345.pdf
ABSTRACT
In this paper, we present a novel architecture for a Speaker Recognition
system over the telephone. The proposed system introduces acoustic information
into a HMM-based recognizer. This is achieved by using a phonetic classifier
during the training phase. Three broad phonetic classes: voiced frames,
unvoiced frames and transitions, are defined. We design speaker templates
by the parallel connection of the outputs of the single state HMM´s
and by the combination of the single state HMM's into a four state HMM
after estimation of the transition probabilities. The results show that
this architecture performs better than others without phonetic classification.
A0395.pdf
ABSTRACT
Speaker recognition with human listeners and with an automatic system were compared. Eight male and eight female speakers were involved. Also the effect of the speech quality was investigated: wide band, telephone band and two signal-to-noise conditions of 6dB and OdB. conditions with noise (SNR +6 dB, 0 dB). For this purpose noise samples were used with a spectrum shaped according to the long-term speech spectrum. The automatic speaker recognition was based on an algorithm which uses a description of the signal by the co-variance in the spectral domain. It was found that for both methods the male speakers are slightly better recognized. One to two words are sufficient, in the wide band condition, for correct subjective recognition. The automatic recognition requires a slightly longer utterance.
ABSTRACT
This paper reports on the development of a foreign speaker accent classification
system based on phoneme class specific accent discrimination models. This
new approach to the problem of automatic accent classification allows fast
and reliable prediction of the speaker accents for continuous speech through
exploitation of the accent specific information at the phoneme level. The
system was trained and evaluated on a corpus representing three speaker
groups with native Australian English (AuE), Lebanese Arabic (LA) and South
Vietnamese (SV) accents. The speaker accent classification rates achieved
by our system come close to the benchmarks set by human listeners.
A0470.pdf
ABSTRACT
Speaker recognition experiments have been conducted with the publicly
available YOHO database to compare the performance of human listeners and
computers. Two types of listening experiments have been performed, one
is the forced-choice speaker discrimination test which is corresponding
to the task of speaker identification. The second experiment of speaker
recognition by human listeners is the same-different judgment which is
similar to the task of speaker verification. It is shown that the human
listeners perform well for the same-different judgment task, but the error
rate of speaker discrimination is relatively large. Besides, human listeners
are more robust to session variability, while the machine's performance
falls off largely when the reference and test utterances are from different
recording sessions.
A0489.pdf
ABSTRACT
In this study, Hidden Markov Models (HMMs) were used to evaluate pronunciation. Native and non-native speakers were asked to pronounce ten Dutch words. Each word was subsequently evaluated by an expert listener. Her main task was to decide whether a word was spoken by a native or a non-native speaker. For each word type, two versions of prototype HMMs were defined: one to be trained on tokens produced by a single native speaker, and another to be trained on tokens produced by a group of native speakers. For testing the different types of HMM, forced recognition was performed using native and non-native judged tokens. We expected that recognition with multi- speaker HMMs would allow a more effective discrimination between native and non-native tokens than recognition with single-speaker models. A comparison of Equal Error Rates partly confirmed this hypothesis.
ABSTRACT
The performance of speaker recognition algorithms drops significantly
when testing and training acoustic environments differ. This decrease is
caused by the statistical mismatch between the statistics representing
the speaker and the testing acoustic data. This paper reports our preliminary
results on the application of a novel environmental compensation algorithm
to the problem of speaker recognition and identification. This new technique,
called the Delta Vector Taylor Series (DVTS) approach, improves performance
at signal-to-noise ratios below 20dB. The algorithm imposes a model of
how the envi- ronment modifies speaker statistics and uses Expectation-
Maximization (EM) to solve a joint maximum likelihood formulation for the
speaker recognition problem over both the speakers and the environment.
We report experimental results on a subset of the TIMIT and NTIMIT database.
A0572.pdf
ABSTRACT
This paper investigates the effects of using multiple time intervals for the calculation of regression coefficients. The technique that we have used is referred to as Wavelet-Like regression (WLR). Using this approach we have found that the underlying time series in the cepstral domain differs slightly depending upon the index of the series, and that by employing a technique that accounts for this, such as WLR, we may achieve an incremental improvement in recognition performance, at negligble extra costs.
ABSTRACT
In this paper, we report our recent work on applications of the combined MLLR and MCE approach to estimating the time-varying polynomial Gaussian mean functions in the trended HMM. We call this integrated approach as the minimum classification error linear regression (MCELR), which has been described in this study. The transformation matrices associated with each polynomial coefficients are calculated to minimize the recognition error of the adaptation data and is developed using the gradient descent algorithm. A speech recognizer based on these results is implemented in speaker adaptation experiments using TI46 corpora. Results show that the trended HMM always outperforms the standard HMM and that adaptation of linear regression coefficients is always better when fewer than three adaptation tokens are used.
ABSTRACT
This paper presents a text-independent speaker recognition system based on vowel spotting and Continuous Mixture Hidden Markov Models. The same modeling technique is applied both to vowel spotting and speaker identification/verification procedures. The system is evaluated on two speech databases, TIMIT and NTIMIT, resulting in high accuracy rates. Closed-set identification accuracy on TIMIT and NTIMIT databases is 98.09% and 59.32%, respectively. Concerning the verification experiments, accuracy of 98.28% for TIMIT, and 83.04% for NTIMIT databases is obtained. The nearly real time response of the classification procedure, the low memory requirements and the small amount of training and testing data are some of the additional advantages of the proposed speaker recognition system.
ABSTRACT
One of the frequently used assumptions in Speaker Verification is that two speech segments (phonemes, subwords, words) are considered to be independent. And therefore, the log-likelihood of a test utterance is just the sum of the log-likelihoods of the speech segments in that utterance. This paper reports about cases in which this observation-independence assumption seems to be violated, namely for those test utterances which call a certain speech model more than once. For example, a pin code which contains a non-unique digit set performs worse in verification than a pin code which consists of four different digits. Results illustrate that violating the independence assumption too much might result in increasing EERs while more information (in form of digits) is added to the test utterance.
ABSTRACT
In this paper we propose a new measure to classify speakers with respect to their behaviour in speaker recognition systems. Taking the proposal made by EAGLES as a point of departure we show that it fails to yield results that are consistent between closely related speaker recognition methods and between different amounts of speech available for the recognition task. We show that measures based on a straight- forward confusion matrices, that take only the 1-best classification into account, cannot result in consistent classifications. As an alternative we propose a measure based on n-best scores in a speaker identification paradigm, and show that it yields more consistent performance.
ABSTRACT
We present in this paper preliminary results using speaker recognition
and speech recognition techniques, designed at LIP6, to index audio data
of video movies. The assumption that only one person is speaking at the
same time is made. In a first approach, we work on dialogue unsupervised
indexing using speaker recognition techniques. For this purpose, we develop
Silence/Noise/Music/Speech detection algorithms in order to cut audio data
in segments that we hope to be homogeneous in terms of speaker appartenance.
In a second approach, we develop a supervised audio data indexing method
knowing the movie script.
A1258.pdf
ABSTRACT
Recently, the set of spectral parameters of every speech frame that
result from filtering the frequency sequence of mel-scaled filter-bank
energies with a simple first-order high-pass FIR filter have proved to
be an efficient speech representation in terms of both speech recognition
rate and computational load. In this paper, we apply the same technique
to speaker recognition. Frequency filtering approximately equalizes the
cepstrum variance, enhancing the oscillations of the spectral envelope
curve that are most effective for discriminating between speakers. In this
way, even better speaker identification results than using conventional
mel-cepstrum were observed in continuous observation Gaussian density HMM,
especially in noisy conditions.
A1360.pdf