Speaker Recognition II

Chair: S. Parthasarathy, AT&T Labs, USA

Home

Magnitude-Only Estimation of Handset Nonlinearity with Application to Speaker Recognition

Authors:

Thomas F Quatieri, MIT (U.S.A.)
Douglas A. Reynolds, MIT (U.S.A.)
Gerald C O'Leary, MIT (U.S.A.)

Volume 2, Page 745, Paper number 1087

Abstract:

A method is described for estimating telephone handset nonlinearity by matching the spectral magnitude of the distorted signal to the output of a nonlinear channel model, driven by an undistorted reference. This ``magnitude-only'' representation allows the model to directly match unwanted speech formants that arise over nonlinear channels and that are a potential source of degradation in speaker and speech recognition algorithms. As such, the method is particularly suited to algorithms that use only spectral magnitude information. The distortion model consists of a memoryless polynomial nonlinearity sandwiched between two finite-length linear filters. Minimization of a mean-squared spectral magnitude error, with respect to model parameters, relies on iterative estimation via a gradient descent technique, using a Jacobian in the iterative correction term with gradients calculated by finite-element approximation. Initial work has demonstrated the algorithm's usefulness in speaker recognition over telephone channels by reducing mismatch between high- and low-quality handset conditions.

ic981087.pdf (From Postscript)

TOP

Non-Parametric Estimation and Correction of Non-Linear Distortion in Speech Systems

Authors:

Rajesh Balchandran, Rutgers University (U.S.A.)
Richard J Mammone, Rutgers University (U.S.A.)

Volume 2, Page 749, Paper number 1400

Abstract:

The performance of speech systems such as speaker recognition degrades drastically when there is mismatch between training and testing conditions, caused by non-linear distortion. This paper describes a technique to estimate and correct such non-linear distortion in speech. The focus is on constrained restoration of degraded speech, that is distortion in the test speech is undone relative to the training speech. Restoration is a two step process - estimation followed by inversion. The non-linearity is estimated in the form of a look-up table by a process of statistical matching using a reference speech template. This statistical matching technique provides a very good estimate of the true non-linear characteristic, and the process is robust, computationally efficient, and universally applicable. Speaker-ID experiments, using artificially corrupted test speech, showed significant improvement in performance after the test speech was `cleaned' using this technique. The restoration process itself does not introduce appreciable distortion.

ic981400.pdf (Scanned)

TOP

A Distance Measure between Collections of Distributions and its Application to Speaker Recognition

Authors:

Homayoon S.M Beigi, IBM Research (U.S.A.)
Stephane H Maes, IBM Research (U.S.A.)
Jeffrey S Sorensen, IBM Research (U.S.A.)

Volume 2, Page 753, Paper number 2520

Abstract:

This paper presents a distance measure for evaluating the closeness of two sets of distributions. The job of finding the distance between two distributions has been addressed with many solutions present in the literature. To cluster speakers using the pre-computed models of their speech, a need arises for computing a distance between these models which arenormally built of a collection of distributions such as Gaussians. The definition of this distance measure creates many possibilities for speaker verification, speaker adaptation, speaker segmentation and many other related applications. A distance measure is presented for evaluating the closeness of a collection of distributions and several applications with some results are presented using this distance measure.

ic982520.pdf (From Postscript)

TOP

Clustering Speakers by their Voices

Authors:

Alex Solomonoff, GTE/BBN (U.S.A.)
Angela Mielke, GTE/BBN (U.S.A.)
Michael Schmidt, GTE/BBN (U.S.A.)
Herbert Gish, GTE/BBN (U.S.A.)

Volume 2, Page 757, Paper number 2122

Abstract:

The problem of clustering speakers by their voices is addressed. With the mushrooming of available speech data from television broadcasts to voice mail, automatic systems for archive retrieval, organizing and labeling by speaker are necessary. Clustering conversations by speaker is a solution to all three of the above tasks. Another application for speaker clustering is to group utterances together for speaker adaptation in speech recognition. Metrics based on purity and completeness of clusters are introduced. Next our approach to speaker clustering is described and finally experimental results on a subset of the Switchboard corpus are presented.

ic982122.pdf (Scanned)

TOP

GMM Based Speaker Identification Using Training-Time-Dependent Number of Mixtures

Authors:

Chakib Tadj, Ecole de Technologie Superieure (Canada)
Pierre Dumouchel, Centre de Recherche Informatique de Montreal (Canada)
Pierre Ouellet, Ecole de Technologie Superieure (Canada)

Volume 2, Page 761, Paper number 1193

Abstract:

In this paper, we present the study of the performance of our standard GMM speaker identification system in a limited amount of training data context. We explore the use of different mixture components for different speakers/models. Different approaches are presented: (a) a nonlinear transformation of speech duration vs. number of mixtures is proposed in order to set correctly the appropriate number of model mixtures for each speaker according to the available training data. (b) From exhaustive experiments, the appropriate linear transformation is deduced. The resulting transformation offers several advantages: (a) each speaker is well modelized (b) the performance is improved by more than 6% on the SPIDRE corpus and finally (c) the number of mixtures is reduced and thus leads to a faster system response.

ic981193.pdf (From Postscript)

TOP

Frame Pruning for Speaker Recognition

Authors:

Laurent Besacier, LIA - Avignon (France)
Jean-Francois Bonastre, LIA - Avignon (France)

Volume 2, Page 765, Paper number 1381

Abstract:

In this paper, we propose a frame selection procedure for text-independent speaker identification. Instead of averaging the frame likelihoods along the whole test utterance, some of these are rejected (pruning) and the final score is computed with a limited number of frames. This pruning stage requires a prior frame level likelihood normalization in order to make comparison between frames meaningful. This normalization procedure alone leads to a significative performance enhancement. As far as pruning is concerned, the optimal number of frames pruned is learned on a tuning data set for normal and telephone speech. Validation of the pruning procedure on 567 speakers leads to a 27% identification rate improvement on TIMIT, and to 17% on NTIMIT.

ic981381.pdf (From Postscript)

TOP

Feature Selection for a DTW-based Speaker Verification System

Authors:

Medha Pandit, University of Surrey (U.K.)
Josef Kittler, University of Surrey (U.K.)

Volume 2, Page 769, Paper number 1995

Abstract:

Speaker verification systems, in general, require 20 to 30 features as input for satisfactory verification. We show that this feature set can be optimised by appropriately choosing proper feature subset from the input feature set. This paperproposes a technique for optimisation of the feature sets, in an Dynamic Time Warping based text-dependent speaker verification system, to improve false acceptance rate. The optimisation technique is based on l-r algorithm. The proposed scheme is applied to study cepstrum coefficients and their first order orthogonal polynomial coefficients. Experiments are conducted on two data bases: French and Spanish. The results indicate that with the optimised feature set the performance of the system may improve but it is never degraded. Moreover, the speed of verification is significantly increased.

ic981995.pdf (From Postscript)

TOP

AHUMADA: A Large Speech Corpus in Spanish for Speaker Identification and Verification

Authors:

Javier Ortega-Garcia, Univ. Politecnica de Madrid (Spain)
Joaquin Gonzalez-Rodriguez, Univ. Politecnica de Madrid (Spain)
Victoria Marrero-Aguiar, UNED (Spain)
Juan Jesus Diaz-Gomez, Servicio de Policia Judicial (Spain)
Ramon Garcia-Jimenez, Servicio de Policia Judicial (Spain)
Jose Lucena-Molina, Servicio de Policia Judicial (Spain)
Jose Antonio G. Sanchez-Molero, Servicio de Policia Judicial (Spain)

Volume 2, Page 773, Paper number 1558

Abstract:

Speaker Recognition is a major task when security applications through speech input are needed. Regarding speaker identity, several factors of variability must be considered: a) Factors concerning peculiar intra-speaker variability (manner of speaking, inter-session variability, dialectal variations, emotional condition, etc.) or forced intra-speaker variability (Lombard effect, cocktail-party effect). b) Factors depending on external influences (kind of microphone, channel effects, noise, reverberation, etc). To cope with all these variability sources, a specific speech database called AHUMADA has been designed and collected for speaker recognition tasks in Castilian Spanish. AHUMADA incorporates six different recording sessions, including both in situ and telephone speech recordings. A total of 104 male speakers uttered isolated digits, digit strings, phonologically balanced short utterances, phonologically and syllabically balanced read text and more than one minute of spontaneous speech, so about 15 GB of speech material is available. Speaker verification results, concerning the available variability sources are also presented.

ic981558.pdf (Scanned)

TOP

Text-Prompted Speaker Verification Experiments with Phoneme Specific MLPs

Authors:

Dijana Petrovska-Delacretaz, CIRC-EPFL (Switzerland)
Jean Hennebert, CIRC-EPFL (Switzerland)

Volume 2, Page 777, Paper number 2383

Abstract:

The aims of the study described in this paper are (1) to assess the relative speaker discriminant properties of phonemes and (2) to investigate the importance of the temporal frame-to-frame information for speaker modelling in the framework of a text-prompted speaker verification system using Hidden Markov Models (HMMs) and Multi Layer Perceptrons (MLPs). It is shown that, with similar experimental conditions, nasals, fricatives and vowels convey more speaker specific informations than plosives and liquids. Regarding the influence of the frame-to-frame temporal information, significant improvements are reported from the inclusion of several acoustic frames at the input of the MLPs. Results tend also to show that each phoneme has its optimal MLP context size giving the best Equal Error Rate (EER).

ic982383.pdf (From Postscript)

TOP

An Efficient Phonotactic-Acoustic System for Language Identification

Authors:

Jiri Navratil, Technical University of Ilmenau (Germany)
Werner Zuehlke, Technical University of Ilmenau (Germany)

Volume 2, Page 781, Paper number 1122

Abstract:

This paper presents a combined two-component system for language identification based on phonotactic and acoustic features. The phonotactic part consisting of a multilingual phone-recognizer with a double bigram-decoding architecture and a phonetic-context mapping is supported by a second part with pronunciation modeling of the recognized phone-sequence using Gaussian density models. Both parts are post-processed by a neural-based final classifier. Measured on the NIST'95 evaluation set, the described system outperforms state-of-the-art components and, at the same time, requires considerably less computational expense, as compared to implicit phonotactic-acoustic modeling and parallel recognizer architectures.