ICASSP '98 Main Page
 General Information
 Conference Schedule
 Technical Program

Overview
50th Annivary Events
Plenary Sessions
Special Sessions
Tutorials
Technical Sessions
Invited Speakers
 Registration
 Exhibits
 Social Events
 Coming to Seattle
 Satellite Events
 Call for Papers/ Author's Kit
 Future Conferences
 Help
|
Abstract - SP23 |
 |
SP23.1
|
Magnitude-Only Estimation of Handset Nonlinearity with Application to Speaker Recognition
T. Quatieri,
D. Reynolds,
G. O'Leary (Lincoln Laboratory, MIT, USA)
A method is described for estimating telephone handset nonlinearity by matching the spectral magnitude of the distorted signal to the output of a nonlinear channel model, driven by an undistorted reference. This ``magnitude-only'' representation allows the model to directly match unwanted speech formants that arise over nonlinear channels and that are a potential source of degradation in speaker and speech recognition algorithms. As such, the method is particularly suited to algorithms that use only spectral magnitude information. The distortion model consists of a memoryless polynomial nonlinearity sandwiched between two finite-length linear filters. Minimization of a mean-squared spectral magnitude error, with respect to model parameters, relies on iterative estimation via a gradient descent technique, using a Jacobian in the iterative correction term with gradients calculated by finite-element approximation. Initial work has demonstrated the algorithm's usefulness in speaker recognition over telephone channels by reducing mismatch between high- and low-quality handset conditions.
|
SP23.2
|
Non-Parametric Estimation and Correction of Non-Linear Distortion in Speech Systems
R. Balchandran,
R. Mammone (Rutgers University, USA)
The performance of speech systems such as speaker recognition degrades drastically when there is mismatch between training and testing conditions, caused by non-linear distortion. This paper describes a technique to estimate and correct such non-linear distortion in speech. The focus is on constrained restoration of degraded speech, that is distortion in the test speech is undone relative to the training speech. Restoration is a two step process - estimation followed by inversion. The non-linearity is estimated in the form of a look-up table by a process of statistical matching using a reference speech template. This statistical matching technique provides a very good estimate of the true non-linear characteristic, and the process is robust, computationally efficient, and universally applicable. Speaker-ID experiments, using artificially corrupted test speech, showed significant improvement in performance after the test speech was `cleaned' using this technique. The restoration process itself does not introduce appreciable distortion.
|
SP23.3
|
A Distance Measure between Collections of Distributions and its Application to Speaker Recognition
H. Beigi,
S. Maes,
J. Sorensen (IBM Research, USA)
This paper presents a distance measure for evaluating the closeness of two sets of distributions. The job of finding the distance between two distributions has been addressed with many solutions present in the literature. To cluster speakers using the pre-computed models of their speech, a need arises for computing a distance between these models which are normally built of a collection of distributions such as Gaussians. The definition of this distance measure creates many possibilities for speaker verification, speaker adaptation, speaker segmentation and many other related applications. A distance measure is presented for evaluating the closeness of a collection of distributions and several applications with some results are presented using this distance measure.
|
SP23.4
|
Clustering Speakers by their Voices
A. Solomonoff,
A. Mielke,
M. Schmidt,
G. Herbert (GTE/BBN, USA)
The problem of clustering speakers by their voices is addressed. With the mushrooming of available speech data from television broadcasts to voice mail, automatic systems for archive retrieval, organizing and labeling by speaker are necessary. Clustering conversations by speaker is a solution to all three of the above tasks. Another application for speaker clustering is to group utterances together for speaker adaptation in speech recognition. Metrics based on purity and completeness of clusters are introduced. Next our approach to speaker clustering is described and finally experimental results on a subset of the Switchboard corpus are presented.
|
SP23.5
|
GMM Based Speaker Identification Using Training-Time-Dependent Number of Mixtures
C. Tadj (Ecole de Technologie Superieure, Canada);
P. Dumouchel (Centre de Recherche Informatique de Montreal, Can);
P. Ouellet (Ecole de Technologie Superieure, Canada)
In this paper, we present the study of the performance of our standard GMM speaker identification system in a limited amount of training data context. We explore the use of different mixture components for different speakers/models. Different approaches are presented: (a) a nonlinear transformation of speech duration vs. number of mixtures is proposed in order to set correctly the appropriate number of model mixtures for each speaker according to the available training data. (b) From exhaustive experiments, the appropriate linear transformation is deduced. The resulting transformation offers several advantages: (a) each speaker is well modelized (b) the performance is improved by more than 6% on the SPIDRE corpus and finally (c) the number of mixtures is reduced and thus leads to a faster system response.
|
SP23.6
|
Frame Pruning for Speaker Recognition
L. Besacier,
J. Bonastre (LIA - Avignon, France)
In this paper, we propose a frame selection procedure for text-independent speaker identification. Instead of averaging the frame likelihoods along the whole test utterance, some of these are rejected (pruning) and the final score is computed with a limited number of frames. This pruning stage requires a prior frame level likelihood normalization in order to make comparison between frames meaningful. This normalization procedure alone leads to a significative performance enhancement. As far as pruning is concerned, the optimal number of frames pruned is learned on a tuning data set for normal and telephone speech. Validation of the pruning procedure on 567 speakers leads to a 27% identification rate improvement on TIMIT, and to 17% on NTIMIT.
|
SP23.7
|
Feature Selection for a DTW-based Speaker Verification System
M. Pandit,
J. Kittler (University of Surrey, UK)
Speaker verification systems, in general, require 20 to 30 features as input for satisfactory verification. We show that this feature set can be optimised by appropriately choosing proper feature subset from the input feature set. This paperproposes a technique for optimisation of the feature sets, in an Dynamic Time Warping based text-dependent speaker verification system, to improve false acceptance rate. The optimisation technique is based on l-r algorithm. The proposed scheme is applied to study cepstrum coefficients and their first order orthogonal polynomial coefficients. Experiments are conducted on two data bases: French and Spanish. The results indicate that with the optimised feature set the performance of the system may improve but it is never degraded. Moreover, the speed of verification is significantly increased.
|
SP23.8
|
AHUMADA: A Large Speech Corpus in Spanish for Speaker Identification and Verification
J. Ortega-Garcia,
J. Gonzalez-Rodriguez (Univ. Politecnica de Madrid, Spain);
V. Marrero-Aguiar (UNED, Spain);
J. Diaz-Gomez,
R. Garcia-Jimenez,
J. Lucena-Molina,
J. Sanchez-Molero (Servicio de Policia Judicial, Spain)
Speaker Recognition is a major task when security applications through speech input are needed. Regarding speaker identity, several factors of variability must be considered: a) Factors concerning peculiar intra-speaker variability (manner of speaking, inter-session variability, dialectal variations, emotional condition, etc.) or forced intra-speaker variability (Lombard effect, cocktail-party effect). b) Factors depending on external influences (kind of microphone, channel effects, noise, reverberation, etc). To cope with all these variability sources, a specific speech database called AHUMADA has been designed and collected for speaker recognition tasks in Castilian Spanish. AHUMADA incorporates six different recording sessions, including both in situ and telephone speech recordings. A total of 104 male speakers uttered isolated digits, digit strings, phonologically balanced short utterances, phonologically and syllabically balanced read text and more than one minute of spontaneous speech, so about 15 GB of speech material is available. Speaker verification results, concerning the available variability sources are also presented.
|
SP23.9
|
Text-Prompted Speaker Verification Experiments with Phoneme Specific MLPs
D. Petrovska-Delacretaz,
J. Hennebert (CIRC-EPFL, Switzerland)
The aims of the study described in this paper are (1) to assess the relative speaker discriminant properties of phonemes and (2) to investigate the importance of the temporal frame-to-frame information for speaker modelling in the framework of a text-prompted speaker verification system using Hidden Markov Models (HMMs) and Multi Layer Perceptrons (MLPs). It is shown that, with similar experimental conditions, nasals, fricatives and vowels convey more speaker specific informations than plosives and liquids. Regarding the influence of the frame-to-frame temporal information, significant improvements are reported from the inclusion of several acoustic frames at the input of the MLPs. Results tend also to show that each phoneme has its optimal MLP context size giving the best Equal Error Rate (EER).
|
SP23.10
|
An Efficient Phonotactic-Acoustic System for Language Identification
J. Navratil,
W. Zuehlke (Technical University of Ilmenau, Germany)
This paper presents a combined two-component system for language identification based on phonotactic and acoustic features. The phonotactic part consisting of a multilingual phone-recognizer with a double bigram-decoding architecture and a phonetic-context mapping is supported by a second part with pronunciation modeling of the recognized phone-sequence using Gaussian density models. Both parts are post-processed by a neural-based final classifier. Measured on the NIST'95 evaluation set, the described system outperforms state-of-the-art components and, at the same time, requires considerably less computational expense, as compared to implicit phonotactic-acoustic modeling and parallel recognizer architectures.
|
< Previous Abstract - SP22 |
SP24 - Next Abstract > |
|