Session ThMD Speaker Recognition and Language Identification

Chairperson Douglas Reynolds MIT, USA

Home


GAUSSIAN MIXTURE MODELS WITH COMMON PRINCIPAL AXES AND THEIR APPLICATION IN TEXT-INDEPENDENT SPEAKER IDENTIFICATION

Authors: Kuo-Hwei Yuo and Hsiao-Chuan Wang

Department of Electrical Engineering, National Tsing Hua University, Hsinchu, Taiwan 30043 E-mail: hcwang@ee.nthu.edu.tw

Volume 5 pages 2279 - 2282

ABSTRACT

Gaussian mixture models (GMM's) have been demonstrated as one of the powerful statistical methods for speaker identification. In GMM method, the covariance matrix is usually assumed to be diagonal. That means the feature components are relatively uncorrelated. This assumption may not be correct. This paper concentrates on finding an orthogonal speaker-dependent transformation to reduce the correlation between feature components. This transformation is based on the eigenvectors of the within-class scatter matrix which is attained in each stage of iterative training of GMM parameters. Hence the transformation matrix and GMM parameters are both updated in each iteration until the total log-likelihood converges. An experimental evaluation of the proposed method is conducted on a 100-person connected digit database for text independent speaker identification. The experimental result shows a reduction in the error rate by 42% when 7-digit utterances are used for testing.
A0009.pdf 

TOP


SPEAKER MODELS DESIGNED FROM COMPLETE DATA SETS: A NEW APPROACH TO TEXT-INDEPENDENT SPEAKER VERIFICATION

Authors: D. R. Dersch (1) and R. W. King (2)

1 Speech Technology Research Group Department of Electrical Engineering, The University of Sydney, NSW 2006 2 Faculty of Information Technology, University of South Australia, SA 5095 Email: dersch@speech.su.oz.au, robin.king@UniSA.edu.au

Volume 5 pages 2283 - 2286

ABSTRACT

In this paper we present a new approach to text in- dependent speaker verification. Speaker models are created from complete data sets, derived from a set of sentences. A decision on an identity claim is based on the calculation of the mean next neighbour distance between a speaker model and a test utterance. A Vec- tor quantization technique serves to efficiently extract this frame based similarity measure. It is the purpose of this paper to investigate this new approach and test its performance on a large database as a function of a number of parameters, i.e., the number of data vectors in each model and the length of the test ut- terance. The best results on a set of 108 speakers are 0:93% false rejection rate and 0:98% false acceptance rate.
A0042.pdf 

TOP


A Double Gaussian Mixture Modeling Approach to Speaker Recognition

Authors: Rivarol Vergin and Douglas O'Shaughnessy

CML Technologies, 75 Blvd de la technologie Hull, J8Z-3G4, Quebec, Canada INRS-Telecommunications, 16 Place du Commerce, Ile-des-Soeurs, H3E-1H6, Qu'ebec, Canada email: vergin@inrs-telecom.uquebec.ca

Volume 5 pages 2287 - 2290

ABSTRACT

The first motivation for using Gaussian mixture models for text-independent speaker identification is based on the observation that a linear combination of gaussian basis functions is capable of representing a large class of sample distributions. While this technique gives generally good results, little is known about which specific part of a speech signal best identifies a speaker. This contribution suggests a procedure, based on the Jensen divergence measure, to automatically extract from the input speech signal the part that best contribute to identify a speaker. It is shown, by results obtained, that this technique can significantly increase the performance of a speaker recognition system.
A0209.pdf 

TOP


AN ACOUSTIC SUBWORD UNIT APPROACH TO NON-LINGUISTIC SPEECH FEATURE IDENTIFICATION

Authors: Mohamed Afify (1) Yifan Gong (1),(2) Jean-Paul Haton (1)

(1) CRIN/CNRS-INRIA-Lorraine,B.P. 239 54506 Vandeouvre,Nancy,France (2) Media Technologies Laboratory, Texas Instruments, P.O.BOX 655303 MS 8374, Dallas TX 75265, U.S.A.

Volume 5 pages 2291 - 2294

ABSTRACT

Automatic identiøcation of non-linguistic speech features (e.g. the speaker or the language of an utterance) are currently of practical interest. In this paper, we ørst impose a set of requirements that we think a statistical model used in non-linguistic feature identiøcation should satisfy. Namely, these requirements are capturing both short and long term correlations in addition to maintaining a certain acoustic resolution. A model satisfying these requirements, and in the same time having the attractive feature of requiring no transcribed speech material during training is proposed. Experimental evaluation of the approach in speaker recognition on the TIMIT database is presented, where recognition rates up to 99.2 % are achieved.
A0225.pdf 

TOP


N-best GMM's for Speaker Identification

Authors: Chakib Tadj (1), Pierre Dumouchel (2) Yu Fang (3)

(1)Ecole de Technologie Superieure 1100 rue Notre Dame Ouest Montreal (Qc) - H3C 1K3 - Canada (2)Centre de Recherche Informatique de Montreal 1801, avenue McGill College, bureau 800 Montreal (Qc) - H3A 2N4 - Canada (3)Institut Universitaire de Technologie 16 Place du commerce Nun's Island (Qc) - H3E 1H6 - Canada

Volume 5 pages 2295 - 2298

ABSTRACT

In this paper, we present and compare two alternative post-processing approaches to generate rules decision for text-dependent speaker identification based on Gaussian Mixture Models (GMM). The first approach, a linear programming method, is used to minimize a cost on a combined scores obtained from the N-Best GMM output probabilities. The second, more heuristic, is based on combination of output score probabilities to generate a decision rules. Statistical tools have been developed to explore the relative importance of these approaches on recognition accuracy. Experiments on Spidre database are presented to show the effects of these two approaches on the speaker identification performance (including the number of the N-Best hypothesis and handset variability). The linear programming approach does not show any improvement, however, a combined statistical approaches has demonstrated an improvement of more than 11% comparing to our standard performance system.

A0266.pdf

TOP


MODEL DEPENDENT SPECTRAL REPRESENTATIONS FOR SPEAKER RECOGNITION

Authors: G. Gravier (1) C. Mokbel (2) G. Chollet (1)

(1) ENST, Dpt. Signal, 46 rue Barrault, 75634 Paris Cedex 13, France (2)France T'el'ecom, CNET - DIH/RCP, Lannion (1){gravier,cholletg}@sig.enst.fr mokbel@cnet.lannion.fr

Volume 5 pages 2299 - 2302

ABSTRACT

We investigate the use of variable resolution spectral analysis for speaker recognition. The spectral resolution is simply determined by a unique parameter. A speaker can therefore be represented by this parameter and a stochastic model, which means that each speaker is represented in a different acoustic space. For speaker verification tasks, the likelihood ratio compared to a threshold should not depend on the representation space, so that likelihood ratios remain comparable. We experimented different spectral resolution with several classifiers but we had no improvement in the results and the classifiers turned out not to be very sensitive to the different feature sets.
A0289.pdf 

TOP


EQUALIZING SUB-BAND ERROR RATES IN SPEAKER RECOGNITION

Authors: Roland Auckenthaler (1) and John S. Mason (2)

(1)Department of Electronics,Technical University Graz, Inffeldgasse 12,A-8010 GRAZ, AUSTRIA (2)Department of Electrical & Electronic Engineering, University of Wales Swansea, SA2 8PP, UK email: feeaucken, J.S.D.Masong@swansea.ac.uk

Volume 5 pages 2303 - 2306

ABSTRACT

Recent work in ASR shows that band splitting, forming multiple paths with recombination at the decision stage, can give recognition accuracy comparable with the conventional full- band approach. One of the many interesting questions with band-splitting relates to the bandwidths of each sub-band, and the use of frequency warping functions such as mel. This paper examines the use of mel and linear frequency scales in the context of band-splitting and speaker recognition. We demonstrate how sub-band error profiles can lead to a new scale, which is between linear and mel, giving both an equalised sub-band error profile and an improved overall recognition accuracy.

A0296.pdf

TOP


AUTOMATIC GENDER IDENTIFICATION UNDER ADVERSE CONDITIONS

Authors: Stefan Slomka and Sridha Sridharan

Speech Research Laboratory, Signal Processing Research Center Queensland University of Technology, GPO Box 2434, Brisbane, 4001, Australia E-mail: slomka@markov.eese.qut.edu.au and s.sridharan@qut.edu.au

Volume 5 pages 2307 - 2310

ABSTRACT

This paper evaluates 63 Automatic Gender Identification (AGI) systems for text-independent clean speech segments, coded speech and speech segments affected by reverberation. The AGI systems contain a Linear Classifier (LC) with inputs from a combination of two average pitch detection methods and paired Gaussian Mixture Models trained with mel-cepstral, autocorrelation, reflection and log area ratios parameterised speech data. An AGI system is built which is able to handle the LPC10, CELP and GSM coders with no significant loss in accuracy and reduce the impact of even severe reverberation by subjecting the training data of the LC with a different room response. Using speech segments with an average duration of 890ms (after silence removal), the best AGI system had an accuracy of 98.5% averaged over all clean and adverse conditions.
A0307.pdf 

TOP


Acoustic Features and Perceptive Processes in the Identification of Familiar Voices

Authors: Yizhar Lavner*, Isak Gath* and Judith Rosenhouse**

* Dept. of Biomedical Engineering and ** Dept. of General studies, Technion, Israel Institute of Technology, Haifa, Israel

Volume 5 pages 2311 - 2314

ABSTRACT

The present study aims at examining the relative importance of various acoustic features as cues to familiar speaker identification. The study also attempts to examine the validity of the prototype model, as the key to human speaker recognition. To this aim 20 speakers were recorded. Their voices were modified using an analysis-synthesis system, which enabled analysis and modification of the glottal waveform, of the pitch, and of the formants. A group of 30 listeners had to identify the speakers in an open-set experiment. The results suggest that on average, the contribution of the vocal tract features is more important than that of the glottal source features. Examination of individual speakers reveals that changes of identical features affect differently the identification of various speakers. This finding suggests that for each speaker a different group of acoustic features serves as cue to the vocal identity, and along with other predictions that were found to be valid, supports the adequacy of the prototype model.
A0345.pdf 

TOP


On the use of Acoustic Segmentation in Speaker Identification

Authors: L. Rodríguez-Liñares and C. García-Mateo

E.T.S.E. Telecomunicación Dept. de Tecnoloxías das Comunicacións. Universidade de Vigo, 36200-VIGO (Pontevedra), Spain. Tel. +34 86 812664, FAX: +34 86 812116, E-mail: leandro@tsc.uvigo.es

Volume 5 pages 2315 - 2318

ABSTRACT

In this paper, we present a novel architecture for a Speaker Recognition system over the telephone. The proposed system introduces acoustic information into a HMM-based recognizer. This is achieved by using a phonetic classifier during the training phase. Three broad phonetic classes: voiced frames, unvoiced frames and transitions, are defined. We design speaker templates by the parallel connection of the outputs of the single state HMM´s and by the combination of the single state HMM's into a four state HMM after estimation of the transition probabilities. The results show that this architecture performs better than others without phonetic classification.
A0395.pdf 

TOP


SPEAKER RECOGNITION BY HUMANS AND MACHINES

Authors: Herman J. M. Steeneken and David A. van Leeuwen

TNO Human Factors Research Institute, Soesterberg, The Netherlands Tel. +31 346 356269, FAX +31 346 353977, E-mail: steeneken@tm.tno.nl

Volume 5 pages 2319 - 2322

ABSTRACT

Speaker recognition with human listeners and with an automatic system were compared. Eight male and eight female speakers were involved. Also the effect of the speech quality was investigated: wide band, telephone band and two signal-to-noise conditions of 6dB and OdB. conditions with noise (SNR +6 dB, 0 dB). For this purpose noise samples were used with a spectrum shaped according to the long-term speech spectrum. The automatic speaker recognition was based on an algorithm which uses a description of the signal by the co-variance in the spectral domain. It was found that for both methods the male speakers are slightly better recognized. One to two words are sufficient, in the wide band condition, for correct subjective recognition. The automatic recognition requires a slightly longer utterance.

A0413.pdf

TOP


Foreign Speaker Accent Classification using Phoneme-Dependent Accent Discrimination Models and Comparisons with Human Perception Benchmarks

Authors: Karsten Kumpf and Robin W. King (1)

Speech Technology Research Group Department of Electrical Engineering, University of Sydney, NSW 2006, Australia (1) Faculty of Information Technology, University of South Australia, SA 5095, Australia Email: karsten@speech.usyd.edu.au, robin.king@unisa.edu.au

Volume 5 pages 2323 - 2326

ABSTRACT

This paper reports on the development of a foreign speaker accent classification system based on phoneme class specific accent discrimination models. This new approach to the problem of automatic accent classification allows fast and reliable prediction of the speaker accents for continuous speech through exploitation of the accent specific information at the phoneme level. The system was trained and evaluated on a corpus representing three speaker groups with native Australian English (AuE), Lebanese Arabic (LA) and South Vietnamese (SV) accents. The speaker accent classification rates achieved by our system come close to the benchmarks set by human listeners.
A0470.pdf 

TOP


A COMPARISON OF HUMAN AND MACHINE IN SPEAKER RECOGNITION

Authors: Li Liu, Jialong He, and Günther Palm

Abteilung Neuroinformatik University of ULM, GERMANY li@neuro.informatik.uni-ulm.de

Volume 5 pages 2327 - 2330

ABSTRACT

Speaker recognition experiments have been conducted with the publicly available YOHO database to compare the performance of human listeners and computers. Two types of listening experiments have been performed, one is the forced-choice speaker discrimination test which is corresponding to the task of speaker identification. The second experiment of speaker recognition by human listeners is the same-different judgment which is similar to the task of speaker verification. It is shown that the human listeners perform well for the same-different judgment task, but the error rate of speaker discrimination is relatively large. Besides, human listeners are more robust to session variability, while the machine's performance falls off largely when the reference and test utterances are from different recording sessions.
A0489.pdf 

TOP


EVALUATION OF SECOND LANGUAGE LEARNERS' PRONUNCIATION USING HIDDEN MARKOV MODELS

Authors: Simo M.A. Goddijn (1) and Guus de Krom (2)

1 Forensic Science Laboratory, Rijswijk 2 Computer and Humanities Department / Utrecht Institute of Linguistics-OTS University of Utrecht, Trans 10, 3512 JK Utrecht, the Netherlands Tel: + 31 30 2536-59, Fax: + 31 30 2536000, E-mail: Guus.deKrom@let.ruu.nl

Volume 5 pages 2331 - 2334

ABSTRACT

In this study, Hidden Markov Models (HMMs) were used to evaluate pronunciation. Native and non-native speakers were asked to pronounce ten Dutch words. Each word was subsequently evaluated by an expert listener. Her main task was to decide whether a word was spoken by a native or a non-native speaker. For each word type, two versions of prototype HMMs were defined: one to be trained on tokens produced by a single native speaker, and another to be trained on tokens produced by a group of native speakers. For testing the different types of HMM, forced recognition was performed using native and non-native judged tokens. We expected that recognition with multi- speaker HMMs would allow a more effective discrimination between native and non-native tokens than recognition with single-speaker models. A comparison of Equal Error Rates partly confirmed this hypothesis.

A0501.pdf

TOP


Delta Vector Taylor Series Environment Compensation for Speaker Recognition

Authors: Brian Eberman and Pedro J. Moreno

email: bse@crl.dec.com, pjm@crl.dec.com Digital Equipment Corporation Cambridge Research Laboratory

Volume 5 pages 2335 - 2338

ABSTRACT

The performance of speaker recognition algorithms drops significantly when testing and training acoustic environments differ. This decrease is caused by the statistical mismatch between the statistics representing the speaker and the testing acoustic data. This paper reports our preliminary results on the application of a novel environmental compensation algorithm to the problem of speaker recognition and identification. This new technique, called the Delta Vector Taylor Series (DVTS) approach, improves performance at signal-to-noise ratios below 20dB. The algorithm imposes a model of how the envi- ronment modifies speaker statistics and uses Expectation- Maximization (EM) to solve a joint maximum likelihood formulation for the speaker recognition problem over both the speakers and the environment. We report experimental results on a subset of the TIMIT and NTIMIT database.
A0572.pdf 

TOP


Wavelet-Like Regression Features in the Cepstral Domain for Speaker Recognition

Authors: Jonathan Hume

Department of Electrical & Electronic Engineering, University of Wales Swansea, SWANSEA, SA2 8PP, UK. email: J.Hume@swansea.ac.uk

Volume 5 pages 2339 - 2342

ABSTRACT

This paper investigates the effects of using multiple time intervals for the calculation of regression coefficients. The technique that we have used is referred to as Wavelet-Like regression (WLR). Using this approach we have found that the underlying time series in the cepstral domain differs slightly depending upon the index of the series, and that by employing a technique that accounts for this, such as WLR, we may achieve an incremental improvement in recognition performance, at negligble extra costs.

A0929.pdf

TOP


ABSTRACT

MINIMUM CLASSIFICATION ERROR LINEAR REGRESSION (MCELR) FOR SPEAKER ADAPTATION USING HMM WITH TREND FUNCTIONS

Authors: Rathinavelu Chengalvarayan

Currently at: Speech Processing Group, Bell Labs Lucent Technologies, Naperville, IL 60566, USA Tel: (630) 224 6398, Fax: (630) 979 5915 Email: rathi@lucent.com

Volume 5 pages 2343 - 2346

ABSTRACT

In this paper, we report our recent work on applications of the combined MLLR and MCE approach to estimating the time-varying polynomial Gaussian mean functions in the trended HMM. We call this integrated approach as the minimum classification error linear regression (MCELR), which has been described in this study. The transformation matrices associated with each polynomial coefficients are calculated to minimize the recognition error of the adaptation data and is developed using the gradient descent algorithm. A speech recognizer based on these results is implemented in speaker adaptation experiments using TI46 corpora. Results show that the trended HMM always outperforms the standard HMM and that adaptation of linear regression coefficients is always better when fewer than three adaptation tokens are used.

A0982.pdf

TOP


A CONTINUOUS HMM TEXT-INDEPENDENT SPEAKER RECOGNITION SYSTEM BASED ON VOWEL SPOTTING

Authors: Nikos Fakotakis*, Kallirroi Georgila*, Anastasios Tsopanoglou**

* Wire Communications Laboratory, Electrical and Computer Engineering Dept., University of Patras, 26110 Rion, Patras, Greece Tel: +30 61 997336, Fax:+30 61 991855, e-mail: fakotaki@wcl.ee.upatras.gr, rgeorgil@wcl.ee.upatras.gr ** KNOWLEDGE S.A., Human Machine Communication Dept., N.E.O. Patron-Athinon 37, 264 41 Patras, Greece Tel: +30 61 452820, Fax:+30 61 453819, e-mail:KNOWLEDGE@Patra.hol.gr

Volume 5 pages 2347 - 2350

ABSTRACT

This paper presents a text-independent speaker recognition system based on vowel spotting and Continuous Mixture Hidden Markov Models. The same modeling technique is applied both to vowel spotting and speaker identification/verification procedures. The system is evaluated on two speech databases, TIMIT and NTIMIT, resulting in high accuracy rates. Closed-set identification accuracy on TIMIT and NTIMIT databases is 98.09% and 59.32%, respectively. Concerning the verification experiments, accuracy of 98.28% for TIMIT, and 83.04% for NTIMIT databases is obtained. The nearly real time response of the classification procedure, the low memory requirements and the small amount of training and testing data are some of the additional advantages of the proposed speaker recognition system.

A1180.pdf

TOP


ON THE INDEPENDENCE OF DIGITS IN CONNECTED DIGIT STRINGS

Authors: J.W. Koolwaaij L. Boves

Department of Language and Speech, Nijmegen University P.O. Box 9103, 6500 HD Nijmegen, the Netherlands E-mail: koolwaaij,boves@let.kun.nl

Volume 5 pages 2351 - 2354

ABSTRACT

One of the frequently used assumptions in Speaker Verification is that two speech segments (phonemes, subwords, words) are considered to be independent. And therefore, the log-likelihood of a test utterance is just the sum of the log-likelihoods of the speech segments in that utterance. This paper reports about cases in which this observation-independence assumption seems to be violated, namely for those test utterances which call a certain speech model more than once. For example, a pin code which contains a non-unique digit set performs worse in verification than a pin code which consists of four different digits. Results illustrate that violating the independence assumption too much might result in increasing EERs while more information (in form of digits) is added to the test utterance.

A1204.pdf

TOP


ABSTRACT

A NEW PROCEDURE FOR CLASSIFYING SPEAKERS IN SPEAKER VERIFICATION SYSTEMS

Authors: J.W. Koolwaaij L. Boves

Department of Language and Speech, Nijmegen University P.O. Box 9103, 6500 HD Nijmegen, the Netherlands E-mail: koolwaaij,boves@let.kun.nl

Volume 5 pages 2355 - 2358

ABSTRACT

In this paper we propose a new measure to classify speakers with respect to their behaviour in speaker recognition systems. Taking the proposal made by EAGLES as a point of departure we show that it fails to yield results that are consistent between closely related speaker recognition methods and between different amounts of speech available for the recognition task. We show that measures based on a straight- forward confusion matrices, that take only the 1-best classification into account, cannot result in consistent classifications. As an alternative we propose a measure based on n-best scores in a speaker identification paradigm, and show that it yields more consistent performance.

A1205.pdf

TOP


SOUND CHANNEL VIDEO INDEXING

Authors: Claude Montacié and Marie-José Caraty

LIP6 - Université Pierre et Marie Curie - CNRS 4, place Jussieu - 75252 Paris Cedex 5 - France Tel. (33/0) 1 44 27 62 81, FAX (33/0) 1 44 27 70 00, E-mail: montacie@laforia.ibp.fr

Volume 5 pages 2359 - 2362

ABSTRACT

We present in this paper preliminary results using speaker recognition and speech recognition techniques, designed at LIP6, to index audio data of video movies. The assumption that only one person is speaking at the same time is made. In a first approach, we work on dialogue unsupervised indexing using speaker recognition techniques. For this purpose, we develop Silence/Noise/Music/Speech detection algorithms in order to cut audio data in segments that we hope to be homogeneous in terms of speaker appartenance. In a second approach, we develop a supervised audio data indexing method knowing the movie script.
A1258.pdf 

TOP


CDHMM SPEAKER RECOGNITION BY MEANS OF FREQUENCY FILTERING OF FILTER-BANK ENERGIES

Authors: J. Hernando and C. Nadeu

Universitat Politècnica de Catalunya Barcelona, Spain javier@gps.tsc.upc.es

Volume 5 pages 2363 - 2366

ABSTRACT

Recently, the set of spectral parameters of every speech frame that result from filtering the frequency sequence of mel-scaled filter-bank energies with a simple first-order high-pass FIR filter have proved to be an efficient speech representation in terms of both speech recognition rate and computational load. In this paper, we apply the same technique to speaker recognition. Frequency filtering approximately equalizes the cepstrum variance, enhancing the oscillations of the spectral envelope curve that are most effective for discriminating between speakers. In this way, even better speaker identification results than using conventional mel-cepstrum were observed in continuous observation Gaussian density HMM, especially in noisy conditions.
A1360.pdf 

TOP