Speaker Adaptation and Normalization in Adverse Environments

Chair: Yunxin Zhao, University of Illinois, USA

Home

Minimum Cross-Entropy Adaptation of Hidden Markov Models

Authors:

Mohamed Afify, Universite Henri Poincare, Nancy (France)
Jean-Paul Haton, Universite Henri Poincare, Nancy (France)

Volume 1, Page 73, Paper number 1812

Abstract:

Adaptation techniques that benefit from distribution correlation are important in practical situations having sparse adaptation data. The so called EMAP algorithm provides an optimal, though expensive, solution. In this article we start from EMAP, andpropose an approximate optimisation criterion, based on maximising a set of local densities. We then obtain expressions for these local densities based on the principle of minimum cross-entropy (MCE). The solution to the MCE problem is obtained using an analogy with MAP estimation, and avoids the use of complex numerical procedures, thus resulting in a simple adaptation algorithm. The implementation of the proposed method for the adaptation of HMMs with mixture Gaussian densities is discussed, and its efficiency is evaluated on an alphabet recognition task.

ic981812.pdf (From Postscript)

TOP

Improving Viterbi Bayesian Predictive Classification via Sequential Bayesian Learning in Robust Speech Recognition

Authors:

Hui Jiang, University of Tokyo (Japan)
Keikichi Hirose, University of Tokyo (Japan)
Qiang Huo, University of Hong Kong (Hong Kong)

Volume 1, Page 77, Paper number 1648

Abstract:

In this paper, we extend our previously proposed Viterbi Bayesian predictive classification (VBPC) algorithm to accommodate a new class of prior probability density function (pdf) for continuous density hidden Markov model (CDHMM) based robust speech recognition. The initial prior pdf of CDHMM is assumed to be a finite mixture of natural conjugate prior pdf's of its complete-data density. With the new observation data, the true posterior pdf is approximated by the same type of finite mixture pdf's which retain the required most significant terms in the true posterior density according to their contribution to the corresponding predictive density. Then the updated mixture pdf is used to improve the VBPC performance. The experimental results on a speaker-independent recognition task of isolated Japanese digits confirm the viability and the usefulness of the proposed technique.

ic981648.pdf (From Postscript)

TOP

Discriminative Learning of Additive Noise and Channel Distortions for Robust Speech Recognition

Authors:

Jiqing Han, Systems Engineering Research Institute (Korea)
Munsung Han, Systems Engineering Research Institute (Korea)
Gyu-Bong Park, Systems Engineering Research Institute (Korea)
Jeongue Park, Systems Engineering Research Institute (Korea)
Wen Gao, Harbin Institute of Technology (China)
Doosung Hwang, Systems Engineering Research Institute (Korea)

Volume 1, Page 81, Paper number 1338

Abstract:

Learning the influence of additive noise and channel distortions from training data is an effective approach for robust speech recognition. Most of the previous methods are based on maximum likelihood estimation criterion. In this paper, we propose a new method of discriminative learning environmental parameters, which is based on Minimum Classification Error ( MCE ) criterion. By using a simple classifier defined by ourselves and the Generalized Probabilistic Descent ( GPD ) algorithm, we iteratively learn environmental parameters. After getting the parameters, we estimate the clean speech features from the observed speech features and then use the estimation of the clean speech features to train or test the back-end HMM classifier. The best error rate reduction of 32.1% is obtained, tested on a Korean 18 isolated confusion words task, relative to conventional HMM system.

ic981338.pdf (From Postscript)

TOP

A Combination of Discriminative and Maximum Likelihood Techniques for Noise Robust Speech Recognition

Authors:

Kari Laurila, Nokia Research Center (Finland)
Marcel Vasilache, Nokia Research Center (Finland)
Olli Viikki, Nokia Research Center (Finland)

Volume 1, Page 85, Paper number 1674

Abstract:

In this paper, we study how discriminative and Maximum Likelihood (ML) techniques should be combined in order to maximize the recognition accuracy of a speaker-independent Automatic Speech Recognition (ASR) system that includes speaker adaptation. We compare two training approaches for speaker-independent case and examine how well they perform together with four different speaker adaptation schemes. In a noise robust connected digit recognition task we show that the Minimum Classification Error (MCE) training approach for speaker-independent modeling together with the Bayesian speaker adaptation scheme provide the highest classification accuracy over the whole lifespan of an ASR system. With the MCE training we are capable of reducing the recognition errors by 30% over the ML approach in the speaker-independent case. With the Bayesian speaker adaptation scheme we can further reduce the error rates by 62% using only as few as five adaptation utterances.

ic981674.pdf (From Postscript)

TOP

Frame-Synchronous Stochastic Matching Based on the Kullback-Leibler Information

Authors:

Lionel Delphin-Poulat, Telecom CNET/DIH/RCP (France)
Chafic Mokbel, Telecom CNET/DIH/RCP (France)
Jerome Idier, Laboratoire des Signaux et Systemes Supelec (France)

Volume 1, Page 89, Paper number 1242

Abstract:

An acoustic mismatch between a given utterance and a model degrades the performance of the speech recognition process. We choose to model speech by Hidden Markov Models (HMMs) in the cepstrum domain and the mismatch by a parametric function. In order to reduce the mismatch, one has to estimate the parameters of this function. In this paper, we present a frame synchronous estimation of these parameters. We show that the parameters can be computed recursively. Thanks to such methods, parameters variations can be tracked. We give general equations and study the particular case of an affine transform. Finally, we report recognition experiments carried out over both PSTN and cellular telephone network to show the efficiency of the method in a real context.

ic981242.pdf (From Postscript)

TOP

Unsupervised Speaker Normalization Using Canonical Correlation Analysis

Authors:

Yasuo Ariki, Ryukoku University (Japan)
Miharu Sakuragi, Ryukoku University (Japan)

Volume 1, Page 93, Paper number 1431

Abstract:

Conventional speaker-independent HMMs ignore the speaker differences and collect speech data in an observation space. This causes a problem that the output probability distribution of the HMMs becomes vague so that it deteriorates the recognition accuracy. To solve this problem, we construct the speaker subspace for an individual speaker and correlate them by o-space canonical correlation analysis between the standard speaker and input speaker. In order to remove the constraint that input speakers have to speak the same sentences as the standard speaker in the supervised normalization, we propose in this paper an unsupervised speaker normalization method which automatically segments the speech data into phoneme data by Viterbi decoding algorithm and then associates the mean feature vectors of phoneme data by o-space canonical correlation analysis. We show the phoneme recognition rate by this unsupervised method is equivalent with that of the supervised normalization method we already proposed.

ic981431.pdf (From Postscript)

TOP

Speaker Independent Acoustic Modeling Using Speaker Normalization

Authors:

Jun Ishii, Mitsubishi Electric Corporation (Japan)
Toshiaki Fukada, ATR Interpreting Telecommunications Research Labs (Japan)

Volume 1, Page 97, Paper number 1899

Abstract:

This paper proposes a novel speaker-independent (SI) modeling for spontaneous speech data from multiple speakers. The SI acoustic model parameters are estimated by individual training for inter-speaker variability and for intra-speaker phonetically related variation in order to obtain a more accurate acoustic model. The linear transformation technique is used for the speaker normalization to extract intra-speaker phonetically related variation and also is used for the re-estimation of inter-speaker variability. The proposed modeling is evaluated for a Japanese spontaneous speech data, using continuous density mixture Gaussian HMMs. Experimental results from the use of proposed acoustic model show that the reductions in word error rate can be achieved over the standard SI model regardless the type of acoustic model used.

ic981899.pdf (From Postscript)

TOP

Robust Speech Recognition for Multiple Topological Scenarios of the GSM Mobile Phone System

Authors:

Theodoros Salonidis, Technical University of Crete (Greece)
Vassilios Digalakis, Technical University of Crete (Greece)

Volume 1, Page 101, Paper number 1623

Abstract:

This paper deals with robust speech recognition in the GSM mobile environment. Our focus is on the voice degradation due to the losses in the GSM coding scheme. Thus, we initially propose an experimental framework of network topologies that consists of various coding-decoding systems placed in tandem. After measuring the recognition performance for each of these network scenarios, we try to increase recognition accuracy by using feature compensation and model adaptation algorithms. We first compare the different methods for all the network topologies assuming the topology is known. We then investigate the more realistic case, in which we don't know the network topology the voice has passed through. The results show that robustness can be achieved even in this case.