Speaker and Language Recognition 3

Home
Full List of Titles
1: ICSLP'98 Proceedings
Keynote Speeches
Text-To-Speech Synthesis 1
Spoken Language Models and Dialog 1
Prosody and Emotion 1
Hidden Markov Model Techniques 1
Speaker and Language Recognition 1
Multimodal Spoken Language Processing 1
Isolated Word Recognition
Robust Speech Processing in Adverse Environments 1
Spoken Language Models and Dialog 2
Articulatory Modelling 1
Talking to Infants, Pets and Lovers
Robust Speech Processing in Adverse Environments 2
Spoken Language Models and Dialog 3
Speech Coding 1
Articulatory Modelling 2
Prosody and Emotion 2
Neural Networks, Fuzzy and Evolutionary Methods 1
Utterance Verification and Word Spotting 1 / Speaker Adaptation 1
Text-To-Speech Synthesis 2
Spoken Language Models and Dialog 4
Human Speech Perception 1
Robust Speech Processing in Adverse Environments 3
Speech and Hearing Disorders 1
Prosody and Emotion 3
Spoken Language Understanding Systems 1
Signal Processing and Speech Analysis 1
Spoken Language Generation and Translation 1
Spoken Language Models and Dialog 5
Segmentation, Labelling and Speech Corpora 1
Multimodal Spoken Language Processing 2
Prosody and Emotion 4
Neural Networks, Fuzzy and Evolutionary Methods 2
Large Vocabulary Continuous Speech Recognition 1
Speaker and Language Recognition 2
Signal Processing and Speech Analysis 2
Prosody and Emotion 5
Robust Speech Processing in Adverse Environments 4
Segmentation, Labelling and Speech Corpora 2
Speech Technology Applications and Human-Machine Interface 1
Large Vocabulary Continuous Speech Recognition 2
Text-To-Speech Synthesis 3
Language Acquisition 1
Acoustic Phonetics 1
Speaker Adaptation 2
Speech Coding 2
Hidden Markov Model Techniques 2
Multilingual Perception and Recognition 1
Large Vocabulary Continuous Speech Recognition 3
Articulatory Modelling 3
Language Acquisition 2
Speaker and Language Recognition 3
Text-To-Speech Synthesis 4
Spoken Language Understanding Systems 4
Human Speech Perception 2
Large Vocabulary Continuous Speech Recognition 4
Spoken Language Understanding Systems 2
Signal Processing and Speech Analysis 3
Human Speech Perception 3
Speaker Adaptation 3
Spoken Language Understanding Systems 3
Multimodal Spoken Language Processing 3
Acoustic Phonetics 2
Large Vocabulary Continuous Speech Recognition 5
Speech Coding 3
Language Acquisition 3 / Multilingual Perception and Recognition 2
Segmentation, Labelling and Speech Corpora 3
Text-To-Speech Synthesis 5
Spoken Language Generation and Translation 2
Human Speech Perception 4
Robust Speech Processing in Adverse Environments 5
Text-To-Speech Synthesis 6
Speech Technology Applications and Human-Machine Interface 2
Prosody and Emotion 6
Hidden Markov Model Techniques 3
Speech and Hearing Disorders 2 / Speech Processing for the Speech and Hearing Impaired 1
Human Speech Production
Segmentation, Labelling and Speech Corpora 4
Speaker and Language Recognition 4
Speech Technology Applications and Human-Machine Interface 3
Utterance Verification and Word Spotting 2
Large Vocabulary Continuous Speech Recognition 6
Neural Networks, Fuzzy and Evolutionary Methods 3
Speech Processing for the Speech-Impaired and Hearing-Impaired 2
Prosody and Emotion 7
2: SST Student Day
SST Student Day - Poster Session 1
SST Student Day - Poster Session 2

Author Index
A B C D E F G H I
J K L M N O P Q R
S T U V W X Y Z

Multimedia Files

Robust Speaker Verification Insensitive to Session-dependent Utterance Variation and Handset-dependent Distortion

Authors:

Tomoko Matsui, NTT Human Interface Labs. (Japan)
Kiyoaki Aikawa, NTT Human Interface Labs. (Japan)

Page (NA) Paper number 714

Abstract:

This paper investigates a method of creating robust speaker models that are not sensitive to session-dependent (SD) utterance-variation and handset-dependent (HD) distortion for HMM-based speaker verification systems in a real telephone network. We recently reported a method of creating session-independent (SI) speaker-HMMs that are not sensitive to SD utterance-variation. In that method, the distortion function that transforms SI speaker-HMMs to SD speaker-HMMs is introduced, and the parameters in the function and the speaker-HMM parameters are jointly estimated using a speaker adaptive training algorithm. This paper proposes a method that is less sensitive to SD utterance-variation and HD distortion than the previous method. This new idea focuses on different difficulties in estimating parameters in distortion functions for SD utterance-variation and HD distortion. In text-independent verification experiments using telephone speech data, the error reduction rate of the improved method compared with that of the conventional method of cepstral mean normalization is 24%.

SL980714.PDF (From Author) SL980714.PDF (Rasterized)

TOP


A Comparative Evaluation of Variance Flooring Techniques in HMM-based Speaker Verification

Authors:

Haakan Melin, KTH, Dept. of Speech, Music and Hearing (Sweden)
Johan W. Koolwaaij, KUN, Dept. of Language & Speech (The Netherlands)
Johan Lindberg, KTH, Dept. of Speech, Music and Hearing (Sweden)
Frédéric Bimbot, IRISA / CNRS & INRIA - Sigma2 (France)

Page (NA) Paper number 467

Abstract:

The problem of how to train variance parameters on scarce data is addressed in the context of text-dependent, HMM-based, automatic speaker verification. Three variations of variance flooring is explored as a means to prevent over-fitting. With the best performing one, the floor to a variance vector of a client model is proportional to the corresponding variance vector in a non-client multi-speaker model. It is also found that creating a client-model by adapting the means and mixture weights from the non-client model while keeping variances constant works comparably to variance flooring and is much simpler. Comparisons are made on three large telephone quality corpora: Gandalf, SESP and Polycost.

SL980467.PDF (From Author) SL980467.PDF (Rasterized)

TOP


Text-Independent Speaker Verification Using Automatically Labelled Acoustic Segments

Authors:

Dijana Petrovska-Delacrétaz, CIRC-EPFL (Switzerland)
Jan Cernocky, Technical University of Brno (Czech Republic)
Jean Hennebert, CIRC-EPFL (Switzerland)
Gérard Chollet, ENST, Paris (France)

Page (NA) Paper number 536

Abstract:

Most of the current text-independent speaker veri,cation techniques are based on modelling the global probability distribution function of speakers in the acoustic vector space. We present an alternative approach based on class-dependent veri,cation systems using automatically determined segmental units, obtained with temporal decomposition and labelled through unsupervised clustering. The core of the system is a set of multi-layer perceptrons (MLP) trained to discriminate between client and an independent set of world speakers. Each MLP is dedicated to work with data segments that are previously selected as belonging to a particular class. Issues and potential advantages of the segmental approach are presented. Performances of global and segmental approaches are tested on the NIST'98 database (250 female and 250 male speakers), showing promising results for the proposed new segmental approach. Comparison with a state of the art system, based on Gaussian Mixture Modelling is also included.

SL980536.PDF (From Author) SL980536.PDF (Rasterized)

TOP


A Fast Decoding Algorithm Based on Sequential Detection of the Changes in Distribution

Authors:

Qi Li, Bell Labs, Lucent Technologies (USA)

Page (NA) Paper number 815

Abstract:

A fast algorithm for left-to-right HMM decoding is proposed in this paper. The algorithm is developed based on a sequential detection scheme which is asymptotically optimal in the sense of detecting a possible change in distribution as reliably and quickly as possible. The scheme is extended to HMM decoding in determining the state segmentations for likelihood or other score computations. As a sequential scheme, it can determine a state boundary in a few time steps after it occurs. The examples in this paper show that the proposed algorithm is 5 to 9 times faster than the Viterbi algorithm while it still can provide the same or similar decoding results. The proposed algorithm can be applied to speaker recognition, audio segmentation, voice/silence detection, and many other applications, where an assumption of the algorithm is usually satisfied.

SL980815.PDF (From Author) SL980815.PDF (Rasterized)

TOP


Speaker Verification With Ensemble Classifiers Based On Linear Speech Transforms

Authors:

Jesper Østergaard Olsen, Aalborg University (Denmark)

Page (NA) Paper number 334

Abstract:

For most classifier architectures realistic training schemes only allow classifiers corresponding to local optima of the training criteria to be constructed. One way of dealing with this problem is to work with classifier ensembles: multiple classifiers are trained for the same classification problem and combined into one ``super'' classifier. The problem addressed in this paper is text prompted speaker verification by means of phoneme dependent Radial Basis Function networks trained by gradient descent error minimisation. In this context ensemble techniques are introduced by combining different classifiers that classify feature vectors, which have been pre-processed using different linear transforms. Four different types of linear transforms are studied: the Fisher transform, the LDA transform, the PCA transform and the cosine transform. The verification system is evaluated on the Gandalf database, where the equal error rate is reduced from 3.6% to 3.2% when ensemble techniques are introduced.

SL980334.PDF (From Author) SL980334.PDF (Rasterized)

TOP


Speaker Recognition Based On Discriminative Projection Models

Authors:

Jesper Østergaard Olsen, Aalborg University (Denmark)

Page (NA) Paper number 335

Abstract:

A new discriminant speaker model is introduced in this paper. The model is text dependent and relies on characterising speakers in terms of the angular distance between ``projection vectors'', which allow good discrimination between individual speakers. The projection models require only little enrollment data to be available per target speaker, but at the same time require a set of ``cohort speakers'' to be available for which a relatively large amount of training speech is available per cohort speaker. The projection model technique is evaluated on the Gandalf database and compared to conventional Gaussian Mixture Models (GMMs). It is found that the projection models require less storage per target speaker, while at the same time achieving lower error rates, particularly when applied for speaker identification and recognition under mismatched conditions.

SL980335.PDF (From Author) SL980335.PDF (Rasterized)

TOP