ICASSP '98 Main Page
 General Information
 Conference Schedule
 Technical Program

Overview
50th Annivary Events
Plenary Sessions
Special Sessions
Tutorials
Technical Sessions
Invited Speakers
 Registration
 Exhibits
 Social Events
 Coming to Seattle
 Satellite Events
 Call for Papers/ Author's Kit
 Future Conferences
 Help
|
Abstract - SP3 |
 |
SP3.1
|
Minimum Cross-Entropy Adaptation of Hidden Markov Models
M. Afify,
J. Haton (Universite Henri Poincare, Nancy, France)
Adaptation techniques that benefit from distribution correlation are important in practical situations having sparse adaptation data. The so called EMAP algorithm provides an optimal, though expensive, solution. In this article we start from EMAP, and propose an approximate optimisation criterion, based on maximising a set of local densities. We then obtain expressions for these local densities based on the principle of minimum cross-entropy (MCE). The solution to the MCE problem is obtained using an analogy with MAP estimation, and avoids the use of complex numerical procedures, thus resulting in a simple adaptation algorithm. The implementation of the proposed method for the adaptation of HMMs with mixture Gaussian densities is discussed, and its efficiency is evaluated on an alphabet recognition task.
|
SP3.2
|
Improving Viterbi Bayesian Predictive Classification via Sequential Bayesian Learning in Robust Speech Recognition
H. Jiang,
K. Hirose (University of Tokyo, Japan);
Q. Huo (University of Hong Kong, P R China)
In this paper, we extend our previously proposed Viterbi Bayesian predictive classification (VBPC) algorithm to accommodate a new class of prior probability density function (pdf) for continuous density hidden Markov model (CDHMM) based robust speech recognition. The initial prior pdf of CDHMM is assumed to be a finite mixture of natural conjugate prior pdf's of its complete-data density. With the new observation data, the true posterior pdf is approximated by the same type of finite mixture pdf's which retain the required most significant terms in the true posterior density according to their contribution to the corresponding predictive density. Then the updated mixture pdf is used to improve the VBPC performance. The experimental results on a speaker-independent recognition task of isolated Japanese digits confirm the viability and the usefulness of the proposed technique.
|
SP3.3
|
Discriminative Learning of Additive Noise and Channel Distortions for Robust Speech Recognition
J. Han,
M. Han,
G. Park,
J. Park (Systems Engineering Research Institute, South Korea);
W. Gao (Harbin Institute of Technology, P R China);
D. Hwang (Systems Engineering Research Institute, South Korea)
Learning the influence of additive noise and channel distortions from training data is an effective approach for robust speech recognition. Most of the previous methods are based on maximum likelihood estimation criterion. In this paper, we propose a new method of discriminative learning environmental parameters, which is based on Minimum Classification Error ( MCE ) criterion. By using a simple classifier defined by ourselves and the Generalized Probabilistic Descent ( GPD ) algorithm, we iteratively learn environmental parameters. After getting the parameters, we estimate the clean speech features from the observed speech features and then use the estimation of the clean speech features to train or test the back-end HMM classifier. The best error rate reduction of 32.1% is obtained, tested on a Korean 18 isolated confusion words task, relative to conventional HMM system.
|
SP3.4
|
A Combination of Discriminative and Maximum Likelihood Techniques for Noise Robust Speech Recognition
K. Laurila,
M. Vasilache,
O. Viikki (Nokia Research Center, Finland)
In this paper, we study how discriminative and Maximum Likelihood (ML) techniques should be combined in order to maximize the recognition accuracy of a speaker-independent Automatic Speech Recognition (ASR) system that includes speaker adaptation. We compare two training approaches for speaker-independent case and examine how well they perform together with four different speaker adaptation schemes. In a noise robust connected digit recognition task we show that the Minimum Classification Error (MCE) training approach for speaker-independent modeling together with the Bayesian speaker adaptation scheme provide the highest classification accuracy over the whole lifespan of an ASR system. With the MCE training we are capable of reducing the recognition errors by 30% over the ML approach in the speaker-independent case. With the Bayesian speaker adaptation scheme we can further reduce the error rates by 62% using only as few as five adaptation utterances.
|
SP3.5
|
Frame-Synchronous Stochastic Matching Based on the Kullback-Leibler Information
L. Delphin-Poulat,
C. Mokbel (France Telecom CNET/DIH/RCP, France);
J. Idier (Laboratoire des Signaux et Systemes, France)
An acoustic mismatch between a given utterance and a model degrades the performance of the speech recognition process. We choose to model speech by Hidden Markov Models (HMMs) in the cepstrum domain and the mismatch by a parametric function. In order to reduce the mismatch, one has to estimate the parameters of this function. In this paper, we present a frame synchronous estimation of these parameters. We show that the parameters can be computed recursively. Thanks to such methods, parameters variations can be tracked. We give general equations and study the particular case of an affine transform. Finally, we report recognition experiments carried out over both PSTN and cellular telephone network to show the efficiency of the method in a real context.
|
SP3.6
|
Unsupervised Speaker Normalization Using Canonical Correlation Analysis
Y. Ariki,
M. Sakuragi (Ryukoku University, Japan)
Conventional speaker-independent HMMs ignore the speaker differences and collect speech data in an observation space. This causes a problem that the output probability distribution of the HMMs becomes vague so that it deteriorates the recognition accuracy. To solve this problem, we construct the speaker subspace for an individual speaker and correlate them by o-space canonical correlation analysis between the standard speaker and input speaker. In order to remove the constraint that input speakers have to speak the same sentences as the standard speaker in the supervised normalization, we propose in this paper an unsupervised speaker normalization method which automatically segments the speech data into phoneme data by Viterbi decoding algorithm and then associates the mean feature vectors of phoneme data by o-space canonical correlation analysis. We show the phoneme recognition rate by this unsupervised method is equivalent with that of the supervised normalization method we already proposed.
|
SP3.7
|
Speaker Independent Acoustic Modeling Using Speaker Normalization
J. Ishii (Mitsubishi Electric Corporation, Japan);
T. Fukada (ATR Interpreting Telecommunications Research Labs, Japan)
This paper proposes a novel speaker-independent (SI) modeling for spontaneous speech data from multiple speakers. The SI acoustic model parameters are estimated by individual training for inter-speaker variability and for intra-speaker phonetically related variation in order to obtain a more accurate acoustic model. The linear transformation technique is used for the speaker normalization to extract intra-speaker phonetically related variation and also is used for the re-estimation of inter-speaker variability. The proposed modeling is evaluated for a Japanese spontaneous speech data, using continuous density mixture Gaussian HMMs. Experimental results from the use of proposed acoustic model show that the reductions in word error rate can be achieved over the standard SI model regardless the type of acoustic model used.
|
SP3.8
|
Robust Speech Recognition for Multiple Topological Scenarios of the GSM Mobile Phone System
T. Salonidis,
V. Digalakis (Technical University of Crete, Greece)
This paper deals with robust speech recognition in the GSM mobile environment. Our focus is on the voice degradation due to the losses in the GSM coding scheme. Thus, we initially propose an experimental framework of network topologies that consists of various coding-decoding systems placed in tandem. After measuring the recognition performance for each of these network scenarios, we try to increase recognition accuracy by using feature compensation and model adaptation algorithms. We first compare the different methods for all the network topologies assuming the topology is known. We then investigate the more realistic case, in which we don't know the network topology the voice has passed through. The results show that robustness can be achieved even in this case.
|
< Previous Abstract - SP2 |
SP4 - Next Abstract > |
|