Speaker Verification and Identification

Home


Model Transformation for Robust Speaker Recognition from Telephone Data

Authors:

Françoise Beaufays, SRI International (U.S.A.)
Mitch Weintraub, SRI International (U.S.A.)

Volume 2, Page 1063

Abstract:

In the context of automatic speaker recognition, we propose a model transformation technique that renders speaker models more robust to acoustic mismatches and to data scarcity by appropriately increasing their variances. We use a stereo database containing speech recorded simultaneously under different acoustic conditions to derive a synthetic variance distribution. This distribution is then used to modify the variances of other speaker models from other telephone databases.

ic971063.pdf

ic971063.pdf

TOP



Speaker Recognition with the Switchboard corpus

Authors:

Lori Lamel, LIMSI (France)
Jean-Luc Gauvain, LIMSI (France)

Volume 2, Page 1067

Abstract:

In this paper we present our development work carried out in preparation for the March'96 speaker recognition test on the Switchboard corpus organized by NIST. The speaker verification system evaluated was a Gaussian mixture model. We provide experimental results on the development test and evaluation test data, and some experiments carried out since the evaluation comparing the GMM with a phone-based approach. Better performance is obtained by training on data from multiple sessions, and with different handsets. High error rates are obtained even using a phone-based approach both with and without the use of orthographic transcriptions of the training data. We also describe a human perceptual test carried out on a subset of the development data, which demonstrates the difficulty human listeners had with this task.

ic971067.pdf

TOP



Handset Dependent Background Models for Robust Text-Independent Speaker Recognition

Authors:

Larry P. Heck, Speech Technology and Research Lab. (U.S.A.)
Mitch Weintraub, Speech Technology and Research Lab. (U.S.A.)

Volume 2, Page 1071

Abstract:

This paper studies the effects of handset distortion on telephone-based speaker recognition performance, resulting in the following observations: (1) the major factor in speaker recognition errors is whether the handset type (e.g., electret, carbon) is different across training and testing, not whether the telephone lines are mismatched, (2) the distribution of speaker recognition scores for true speakers is bimodal, with one mode dominated by matched handset tests and the other by mismatched handsets, (3) cohort-based normalization methods derive much of their performance gains from implicitly selecting cohorts trained with the same handset type as the claimant, and (4) utilizing a handset-dependent background model which is matched to the handset type of the claimant's training data sharpens and separates the true and false speaker score distributions. Results on the 1996 NIST Speaker Recognition Evaluation corpus show that using handset-matched background models reduces false acceptances (at a 10 % miss

ic971071.pdf

ic971071.pdf

TOP



Telephone Based Speaker Recognition using Multiple Binary Classifier and Gaussian Mixture Models

Authors:

Pierre J. Castellano, Queensland University of Technology (Australia)
Stefan Slomka, Queensland University of Technology (Australia)
Sridha Sridharan, Queensland University of Technology (Australia)

Volume 2, Page 1075

Abstract:

The present study evaluates MBCM and GMM solutions for both ASV and ASI problems involving text-independent telephone speech from the King speech database. The MBCM's accuracy is enhanced by selectively removing those classifiers within the model which perform worst (pruning). An unpruned MBCM outperforms a GMM for ASV and speakers taken from within the same dialectic region (San Diego, CA). Once pruned, the MBCM is found to be 2.6 times more accurate than the GMM. For closed set ASI, based on the same data, the MBCM is roughly twice as accurate as the GMM but only after pruning.

ic971075.pdf

ic971075.pdf

TOP



Comparison of Whole Word and Subword Modeling Techniques for Speaker Verification with Limited Training Data

Authors:

Stephan Euler, Bosch Telecom (Germany)
Rainer Langlitz, Bosch Telecom (Germany)
Joachim Zinke, FH Friedberg (Germany)

Volume 2, Page 1079

Abstract:

In this paper we use whole word and subword hidden Markov models for text dependent speaker verification. In this application usually only a small amount of training data is available for each model. In order to cope with this limitation we propose a intermediate functional representation of the training data allowing the robust initialization of the models. This new approach is tested with two data bases and is compared both with standard training techniques and the dynamic time warp method. Secondly, we give results for two types of subword units. The scores of these units are combined in two different ways to obtain word error rates.

ic971079.pdf

ic971079.pdf

TOP



A Comparison of Model Estimation Techniques for Speaker Verification

Authors:

Michael J. Carey, Ensigma Ltd (U.K.)
Eluned S. Parris, Ensigma Ltd (U.K.)
Stephen J. Bennett, Ensigma Ltd (U.K.)
Harvey Lloyd-Thomas, Ensigma Ltd (U.K.)

Volume 2, Page 1083

Abstract:

We address the problem of building speaker dependent HMM for a speaker verification system. A number of model building techniques are described and the comparative performance of a system using models built using these techniques is presented. Mean estimated models, models where the means of the HMMs are estimated using segmental K means but where the variances are taken from speaker independent models, out performed other techniques for training times of 120s to15s. Mean estimated models were also built with varying numbers of components in the state mixture distributions and a performance gain was again observed. The incorporation of transitional features into the system had degraded performance when the Baum-Welch algorithm was used for model estimation. However the inclusion of delta and delta-delta cepstra into the system using mean estimated models now gave a significant improvement in performance. These changes halved the equal error rate of the system to 7.8%

ic971083.pdf

ic971083.pdf

TOP



Speaker Verification using Frame and Utterance Level Likelihood Normalization

Authors:

Seiichi Nakagawa, TUT, Toyohashi (Japan)
Konstantin P. Markov, TUT, Toyohashi (Japan)

Volume 2, Page 1087

Abstract:

In this paper, we propose a new method, where the likelihood normalization technique is applied at both the frame and utterance levels. In this method based on Gaussian Mixture Models (GMM), every frame of the test utterance is inputed to the claimed and all background speaker models in parallel. In this procedure, for each frame, likelihoods from all the background models are available, hence they can be used for normalization of the claimed speaker likelihood at every frame. A special kind of likelihood normalization, called Weighting Models Rank, is also proposed. We have evaluated our method using two databases - TIMIT and NTT. Results show that the combination of frame and utterance level likelihood normalization in some cases reduces the equal error rate (EER) more than twice.

ic971087.pdf

ic971087.pdf

TOP



A New Codebook Traning Algorithm for VQ-based Speaker Recognition

Authors:

Jialong He, University of ULM (Germany)
Li Liu, University of ULM (Germany)
Günther Palm, University of ULM (Germany)

Volume 2, Page 1091

Abstract:

VQ-based speaker recognition has proven to be a successful method. Usually, a codebook is trained to minimize the quantization error for the data from an individual speaker. The codebooks trained based on this criterion have weak discriminative power when used as a classifier. The LVQ algorithm can be used to globally train the VQ-based classifier. However, the correlation between the feature vectors is not taken into consideration, in consequence, a high classification rate for feature vectors does not lead to a high classification rate for the test sentences. In this paper, a heuristic training procedure is proposed to retrain the codebooks so that they give a lower classification error rate for randomly selected vector-groups. Evaluation experiments demonstrated that the codebooks trained with this method provide much higher recognition rates than that trained with the LBG algorithm alone, and often they can outperform the more powerful Gaussian mixture speaker models.

ic971091.pdf

ic971091.pdf

TOP



Bispectrum Features for Robust Speaker Identification

Authors:

Stanley Wenndt, Rome Laboratory (U.S.A.)
Sanyogita Shamsunder, Colorado State University (U.S.A.)

Volume 2, Page 1095

Abstract:

Along with the spoken message, speech contains information about the identity of the speaker. Thus, the goal of speaker identification is to develop features which are unique to each speaker. This paper explores a new feature for speech and shows how it can be used for robust speaker identification. The results will be compared to the cepstrum feature due to its widespread use and success in speaker identification applications. The cepstrum, however, has shown a lack of robustness in varying conditions, especially in a cross-condition environment where the classifier has been trained with clean data but then tested on corrupted data. Part of the bispectrum will be used as a new feature and we will demonstrate its usefulness in varying noise settings.

ic971095.pdf

ic971095.pdf

TOP



Speaker Identification Based Text to Audio Alignment for an Audio Retrieval System

Authors:

Deb K. Roy, MIT Media Lab (U.S.A.)
Carl Malamud, IMS (U.S.A.)

Volume 2, Page 1099

Abstract:

We report on an audio retrieval system which lets Internet users efficiently access a large audio database containing recordings of the proceedings of the United States House of Representatives. The audio has been temporally aligned to text transcripts of the proceedings (which are manually generated by the U.S. Government) using a novel method based on speaker identification. Speaker sequence and approximate timing information is extracted from the text transcript and used to constrain a Viterbi alignment of speaker models to the observed audio. Speakers are modeled by computing Gaussian statistics of cepstral coefficients extracted from samples of each persons speech. The speaker identification is used to locate speaker transition points in the audio which are then linked to corresponding speaker transitions in the text transcript. The alignment system has been successfully integrated into a World Wide Web based search and browse system as an experimental service on the Internet.

ic971099.pdf

TOP



Robust Speaker Recognition through Acoustic Array Processing and Spectral Normalization

Authors:

Joaquin Gonzalez-Rodriguez, Univ. Politécnica de Madrid (Spain)
Javier Ortega-Garcia, Univ. Politécnica de Madrid (Spain)

Volume 2, Page 1103

Abstract:

The development of a robust speaker recognition system obtained through the joint use of acoustic array processing and spectral normalization as input to a Gaussian Mixture Model speaker recognition system is described in this paper. Results obtained with these techniques have been reported previously by the authors [10(, but operational problems appear if extensive testing with different configurations and testing conditions are intended. In this paper, we describe an open system that has been developed to cope with this problem. The number and geometry of the microphones, the time delay estimation method, the array processing structure and the spectral normalization technique together with the room size, noise type and SNR are some of the options that can be easily changed. It will also allow testing with real multichannel databases and any new algorithm can easily be incorporated to the system.

ic971103.pdf

ic971103.pdf

TOP



Providing Single and Multi-Channel Acoustical Robustness to Speaker Identification Systems

Authors:

Javier Ortega-Garcia, Univ. Politécnica de Madrid (Spain)
Joaquin Gonzalez-Rodriguez, Univ. Politécnica de Madrid (Spain)

Volume 2, Page 1107

Abstract:

Acoustical mismatch between training and testing phases induce degradation of performance in automatic speaker recognition systems. Providing robustness to speaker recognizers has to be, therefore, a priority matter. Robustness in the acoustical stage can be accomplished through speech enhancement techniques as a prior stage to the recognizer. These techniques are oriented to the reduction of the impact that acoustical noise produces on the input signal. In this paper, several spectral subtraction-derived techniques are used to enhance single-channel noisy speech. Other perspectives, based in dual-channel (adaptive filtering) and multi-channel (microphone arrays) processing are also presented as optimal solutions to speech enhancement needs. A comparative analysis of the proposed techniques, with different types of noise at different SNRs, as a pre-processing stage to an ergodic HMM-based speaker recognizer, is presented.

ic971107.pdf

ic971107.pdf

TOP