Speaker and Language Recognition 4

Home
Full List of Titles
1: ICSLP'98 Proceedings
Keynote Speeches
Text-To-Speech Synthesis 1
Spoken Language Models and Dialog 1
Prosody and Emotion 1
Hidden Markov Model Techniques 1
Speaker and Language Recognition 1
Multimodal Spoken Language Processing 1
Isolated Word Recognition
Robust Speech Processing in Adverse Environments 1
Spoken Language Models and Dialog 2
Articulatory Modelling 1
Talking to Infants, Pets and Lovers
Robust Speech Processing in Adverse Environments 2
Spoken Language Models and Dialog 3
Speech Coding 1
Articulatory Modelling 2
Prosody and Emotion 2
Neural Networks, Fuzzy and Evolutionary Methods 1
Utterance Verification and Word Spotting 1 / Speaker Adaptation 1
Text-To-Speech Synthesis 2
Spoken Language Models and Dialog 4
Human Speech Perception 1
Robust Speech Processing in Adverse Environments 3
Speech and Hearing Disorders 1
Prosody and Emotion 3
Spoken Language Understanding Systems 1
Signal Processing and Speech Analysis 1
Spoken Language Generation and Translation 1
Spoken Language Models and Dialog 5
Segmentation, Labelling and Speech Corpora 1
Multimodal Spoken Language Processing 2
Prosody and Emotion 4
Neural Networks, Fuzzy and Evolutionary Methods 2
Large Vocabulary Continuous Speech Recognition 1
Speaker and Language Recognition 2
Signal Processing and Speech Analysis 2
Prosody and Emotion 5
Robust Speech Processing in Adverse Environments 4
Segmentation, Labelling and Speech Corpora 2
Speech Technology Applications and Human-Machine Interface 1
Large Vocabulary Continuous Speech Recognition 2
Text-To-Speech Synthesis 3
Language Acquisition 1
Acoustic Phonetics 1
Speaker Adaptation 2
Speech Coding 2
Hidden Markov Model Techniques 2
Multilingual Perception and Recognition 1
Large Vocabulary Continuous Speech Recognition 3
Articulatory Modelling 3
Language Acquisition 2
Speaker and Language Recognition 3
Text-To-Speech Synthesis 4
Spoken Language Understanding Systems 4
Human Speech Perception 2
Large Vocabulary Continuous Speech Recognition 4
Spoken Language Understanding Systems 2
Signal Processing and Speech Analysis 3
Human Speech Perception 3
Speaker Adaptation 3
Spoken Language Understanding Systems 3
Multimodal Spoken Language Processing 3
Acoustic Phonetics 2
Large Vocabulary Continuous Speech Recognition 5
Speech Coding 3
Language Acquisition 3 / Multilingual Perception and Recognition 2
Segmentation, Labelling and Speech Corpora 3
Text-To-Speech Synthesis 5
Spoken Language Generation and Translation 2
Human Speech Perception 4
Robust Speech Processing in Adverse Environments 5
Text-To-Speech Synthesis 6
Speech Technology Applications and Human-Machine Interface 2
Prosody and Emotion 6
Hidden Markov Model Techniques 3
Speech and Hearing Disorders 2 / Speech Processing for the Speech and Hearing Impaired 1
Human Speech Production
Segmentation, Labelling and Speech Corpora 4
Speaker and Language Recognition 4
Speech Technology Applications and Human-Machine Interface 3
Utterance Verification and Word Spotting 2
Large Vocabulary Continuous Speech Recognition 6
Neural Networks, Fuzzy and Evolutionary Methods 3
Speech Processing for the Speech-Impaired and Hearing-Impaired 2
Prosody and Emotion 7
2: SST Student Day
SST Student Day - Poster Session 1
SST Student Day - Poster Session 2

Author Index
A B C D E F G H I
J K L M N O P Q R
S T U V W X Y Z

Multimedia Files

On The Convergence Of Gaussian Mixture Models: Improvements Through Vector Quantization

Authors:

James Moody, Queensland University of Technology (Australia)
Stefan Slomka, Queensland University of Technology (Australia)
Jason Pelecanos, Queensland University of Technology (Australia)
Sridha Sridharan, Queensland University of Technology (Australia)

Page (NA) Paper number 667

Abstract:

This paper studies the reliance of a Gaussian Mixture Model (GMM) based closed-set Speaker Identification system on model convergence and describes methods to improve this convergence. It shows that the reason why the Vector Quantisation GMMs (VQGMMs) outperform a simple GMM is mainly due to decreasing the complexity of the data during training. In addition, it is shown that the VQGMM system is less computationally complex than the traditional GMM, yielding a system which is quicker to train and which gives higher performance. We also investigate four different VQ distance measures which can be used in the training of a VQGMM and compare their respective performances. It is found that the improvements gained by the VQGMM is only marginally dependant on the distance measure.

SL980667.PDF (From Author) SL980667.PDF (Rasterized)

TOP


Modeling Dynamic Prosodic Variation for Speaker Verification

Authors:

Kemal Sönmez, SRI International (USA)
Elizabeth Shriberg, SRI International (USA)
Larry Heck, Nuance Communications (USA)
Mitchel Weintraub, SRI International (USA)

Page (NA) Paper number 920

Abstract:

Statistics of frame-level pitch have recently been used in speaker recognition systems with good results. Although they convey useful long-term information about a speaker's distribution of f0 values, such statistics fail to capture information about local dynamics in intonation that characterize an individual's speaking style. In this work, we take a first step toward capturing such suprasegmental patterns for automatic speaker verification. Specifically, we model the speaker's f0 movements by fitting a piecewise linear model to the f0 track to obtain a stylized f0 contour. Parameters of the model are then used as statistical features for speaker verification. We report results on 1998 NIST speaker verification evaluation. Prosody modeling improves the verification performance of a cepstrum-based Gaussian mixture model system (as measured by a task-specific Bayes risk) by 10%.

SL980920.PDF (From Author) SL980920.PDF (Rasterized)

TOP


Blind Clustering of Speech Utterances Based on Speaker and Language Characteristics

Authors:

Douglas A. Reynolds, MIT Lincoln Laboratory (USA)
Elliot Singer, MIT Lincoln Laboratory (USA)
Beth A. Carlson, MIT Lincoln Laboratory (USA)
Gerald C. O'Leary, MIT Lincoln Laboratory (USA)
Jack J. McLaughlin, MIT Lincoln Laboratory (USA)
Marc A. Zissman, MIT Lincoln Laboratory (USA)

Page (NA) Paper number 610

Abstract:

Classical speaker and language recognition techniques can be applied to the classification of unknown utterances by computing the likelihoods of the utterances given a set of well trained target models. This paper addresses the problem of grouping unknown utterances when no information is available regarding the speaker or language classes or even the total number of classes. Approaches to blind message clustering are presented based on conventional hierarchical clustering techniques and an integrated cluster generation and selection method called the d* algorithm. Results are presented using message sets derived from the Switchboard and Callfriend corpora. Potential applications include automatic indexing of recorded speech corpora by speaker/language tags and automatic or semiautomatic selection of speaker specific speech utterances for speaker recognition adaptation.

SL980610.PDF (From Author) SL980610.PDF (Rasterized)

TOP


Spoken Language Identification Using The SpeechDat Corpus

Authors:

Diamantino Caseiro, INESC/IST (Portugal)
Isabel M. Trancoso, INESC/IST (Portugal)

Page (NA) Paper number 1093

Abstract:

Current language identification systems vary significantly in their complexity. The systems that use higher level linguistic information have the best performance. Nevertheless, that information is hard to collect for each new language. The system presented in this paper is easily extendable to new languages because it uses very little linguistic information. In fact, the presented system needs only one language specific phone recogniser (in our case the Portuguese one), and is trained with speech from each of the other languages. With the SpeechDat-M corpus, with 6 European languages (English, French, German, Italian, Portuguese and Spanish) our system achieved an identification rate of 83.4% on 5-second utterances, this result shows an improvement of 5% over our previous version, mainly through the use of a neural network classifier. Both the baseline and the full system were implemented in realtime.

SL981093.PDF (From Author) SL981093.PDF (Rasterized)

TOP


Automatic Language Identification with Perceptually Guided Training and Recurrent Neural Networks

Authors:

Jerome Braun, Computer Science Department, University of Massachusetts Lowell (USA)
Haim Levkowitz, Computer Science Department, University of Massachusetts Lowell (USA)

Page (NA) Paper number 405

Abstract:

We present a novel approach to Automatic Language Identification (LID). We propose Perceptually Guided Training (PGT), a novel LID training method, involving identification of utterance parts which are particularly significant perceptually for the language identification process, and exploitation of these Perceptually Significant Regions (PSRs) to guide the LID training process. Our approach involves a Recurrent Neural Network (RNN) as the main mechanism. We propose that, because of the long-range intra-utterance acoustical context significance in LID, RNNs are particularly suitable for the LID task. Our approach does not require phonetic labeling or transcription of the training corpus. LIREN/PGT, the LID system we developed, incorporates our approach. Our LID experiments were on English, German, and Mandarin Chinese, using the OGI-TS corpus.

SL980405.PDF (From Author) SL980405.PDF (Scanned)

TOP


On the Importance of Components of the Modulation Spectrum for Speaker Verification

Authors:

Sarel van Vuuren, Oregon Graduate Institute of Science and Technology (USA)
Hynek Hermansky, Oregon Graduate Institute of Science and Technology (USA)

Page (NA) Paper number 631

Abstract:

We provide an analysis of the relative importance of components of the modulation spectrum for speaker verification. The aim is to remove less relevant components and reduce system sensitivity to acoustic disturbances while improving verification accuracy. Spectral components between about 0.1Hz and 10Hz are found to contain the most useful speaker information. We discuss this result in the context of RASTA processing and cepstral mean subtraction. When compared to cepstral mean subtraction that retains components up to 50Hz, lowpass filtering to 10Hz with downsampling by 75 percent is found to significantly improve robustness in mismatched conditions. The downsampling results in a large computational savings.

SL980631.PDF (From Author) SL980631.PDF (Rasterized)

TOP