Tetsuya Takiguchi, NAIST (Japan)
Satoshi Nakamura, NAIST (Japan)
Kiyohiro Shikano, NAIST (Japan)
Qiang Huo, ATR ITL (Japan)
The performance of a speech recognizer degrades drastically in reverberant environments. We proposed previously a novel algorithm which can model an observation signal by composition of HMMs of clean speech, noise and an acoustic transfer function. However, how to estimate HMM parameters of the acoustic transfer function is a remaining serious problem. In our previous paper, we measured real impulse responses of training positions in an experiment room. It is inconvenient and unrealistic to measure impulse responses for every possible new experiment room. This paper presents a new method to estimate HMM parameters of the acoustic transfer function from some adaptation data by using an HMM decomposition algorithm. Its effectiveness is confirmed by a series of speaker dependent and independent word recognition experiments on simulated distant-talking speech data.
Driss Matrouf, LIMSI (France)
Jean-Luc Gauvain, LIMSI (France)
It is well known that the performances of speech recognition systems degrade rapidly as the mismatch between the training and test conditions increases. Approaches to compensate for this mismatch generally assume that the training data is noise-free, and the test data is noisy. In practice, this assumption is seldom correct. In this paper, we propose an iterative technique to compensate for noises in both the training and test data. The adopted approach compensates the speech model parameters using the noise present in the test data, and compensates the test data frames using the noise present in the training data. The training and test data are assumed to come from different and unknown microphones and acoustic environments. The interest of such a compensation scheme has been assessed on the MASK task using a continuous density HMM-based speech recognizer. Experimental results show the advantage of compensating for both test and training noises.
Shigeki Sagayama, NTT HI Labs (Japan)
Yoshikazu Yamaguchi, NTT HI Labs (Japan)
Satoshi Takahashi, NTT HI Labs (Japan)
Jun-ichi Takahashi, NTT HI Labs (Japan)
This paper describes a Jacobian approach to fast adaptation of acoustic models to noisy environments. Acoustic models under a noise assumption are compensated by Jacobian matrices with the difference between assumed and observed noise cepstra. Detailed mathematical formulation and algorithm derivation are presented. Experiments showed that when a small amount of training data is given, this approach outperforms the existing approaches (such as PMC and NOVO) for composing a model from speech and noise models. It drastically reduces computational cost by replacing the complicated computation of model composition by simple matrix arithmetic and enables real-time environmental noise adaptation. Combination with spectrum subtraction is also discussed.
Mohamed Afify, CRIN/CNRS-INRIA-Lorraine (France)
Yifan Gong, Speech research,Texas Instruments (U.S.A.)
Jean-Paul Haton, CRIN/CNRS-INRIA-Lorraine (France)
In the context of continuous density hidden Markov model (CDHMM) we present a unified maximum likelihood (ML) approach to acoustic mismatch compensation. This is achieved by introducing additive Gaussian biases at the state level in both the mel cepstral and linear spectral domains. Flexible modelling of different mismatch effects can be obtained through appropriate bias tying. A maximum likelihood approach for joint estimation of both mel cepstral and linear spectral biases from the observed mismatched speech given only one set of clean speech models is presented, where the obtained bias estimates are used for the compensation of clean speech models during decoding. The proposed approach is applied to the recognition of noisy Lombard speech, and significant improvement in the word recognition rate is achieved.
Beth T. Logan, Cambridge University (U.K.)
Anthony J. Robinson, Cambridge University (U.K.)
This paper describes a new algorithm to enhance and recognise noisy speech when only the noisy signal is available. The system uses autoregressive hidden Markov models (HMMs) to model the clean speech and noise and combines these to form a model for the noisy speech. The probability framework developed is then used to reestimate the noise models from the corrupted speech waveform and the process is repeated. Enhancement is performed using the Wiener filters formed from the final clean speech models and noise estimates. Results are presented for additive stationary Gaussian and coloured noise.
Hiroki Yamamoto, Canon Inc (Japan)
Tetsuo Kosaka, Canon Inc (Japan)
Masayuki Yamada, Canon Inc (Japan)
Yasuhiro Komori, Canon Inc (Japan)
Minoru Fujita, Canon Inc (Japan)
In this paper, we describe a fast speech recognition algorithm under noisy environment. To achieve an accurate and fast speech recognition under noisy environment, a very fast speech recognition algorithm with well-adapted model against the noisy environment is required. First, for the model adaptation, we propose MCMS-PMC: a combination of the parallel model combination(PMC) and the modified cepstral mean subtraction(MCMS) which estimates the cepstrum mean by taking account of the additive noise. Then, for the fast speech recognition, we propose new techniques to create the noise-adapted scalar quantized codebook in order to introduce the MCMS-PMC into the IDMM+SQ which we proposed in ICASSP96 as fast speech recognition algorithm using scalar quantization approach. Finally, an effect of proposed method is shown through the speaker-independent telephone-bandwidth continuous speech recognition experiment.
Bhiksha Raj, Carnegie Mellon University, Pittsburgh (U.S.A.)
Vipul Parikh, Carnegie Mellon University, Pittsburgh (U.S.A.)
Richard Stern, Carnegie Mellon University, Pittsburgh (U.S.A.)
Recognition of broadcast data, such as TV and radio programs is a topic of great interest. One of the problems with such data is the frequent presence of background music that degrades the perfor- mance of speech recognition systems. In this paper we examine the effects of different kinds of music on automatic speech recognition systems by comparing the effects of music with the relatively well-known effects of white noise on these systems. We also examine the extent to which compensation algorithms that have been successfully applied to noisy speech are also helpful in improving recognition accuracy for speech that is corrupted by music. It is hoped that these experimental compari- sons will lead to a better understanding of how to compensate for the effects of background music.
Jenq-Neng Hwang, University of Washington (U.S.A.)
Chien-Jen Wang, University of Washington (U.S.A.)
This paper presents a maximum likelihood joint-space adaptation technique for robust speech recognition. In this joint-space adaptation process, the N-Best HMM inversion frame-by-frame adapts the speech features non-parametrically to compensate the temporal deviation, while the models are transformed parametrically to catch the global characteristics of the mismatch. The proposed method provides a better compensation to the mismatch than either of the single-space adaptation does. This algorithm operates only on the given testing speech and the models, therefore no stereo or adaptation data are required. As verified by the experiments performed under different mismatch environments, the proposed method improves the performance in all the cases without degrading the performance under the match condition.
Kuan-Chieh Yen, University of Illinois (U.S.A.)
Yunxin Zhao, University of Illinois (U.S.A.)
A signal-separation front-end based on adaptive decorrelation filtering (ADF) was integrated with an HMM based speaker independent continuous speech recognition system for co-channel speech recognition. The ADF is improved by addressing the adaptation gain for system stability and efficiency: an upper bound of adaptation rate is derived for system stability, and an accelerated sequence of adaptation gain is introduced for system efficiency. The system was evaluated under simulated room acoustic conditions with both time-invariant and time-varying channels. It is shown that the system significantly improved the signal-to-interference ratio and the word recognition accuracy, and that the combination of the derived upper bound for adaptation rate with the accelerated adaptation gain sequence achieved the best performance for system stability and efficiency.
Martin P. Cooke, University of Sheffield (U.K.)
Andrew C. Morris, University of Sheffield (U.K.)
Philip D. Green, University of Sheffield (U.K.)
In noisy listening conditions, the information available on which to base speech recognition decisions is necessarily incomplete: some spectro-temporal regions are dominated by other sources. We report on the application of a variety of techniques for missing-data in speech recognition. These techniques may be based on marginal distributions or on reconstruction of missing parts of the spectrum. Application of these ideas in the Resource Management task shows performance which is robust to random removal of up to 80% of the frequency channels, but falls off rapidly with deletions which more realistically simulate masked speech. We report on a vowel classification experiment designed to isolate some of the RM problems for more detailed exploration. The results of this experiment confirm the general superiority of marginals-based schemes, demonstrate the viability of shared covariance statistics, and suggest several ways in which performance improvements on the larger task may be obtained.
Detlef Hardt, Technical University of Berlin (Germany)
Klaus Fellbaum, Brandenburg Technical University of Cottbus (Germany)
In real text-dependent telephone-based speaker verification systems, both, additive and convolutional noise influence the error rate considerably. In this paper different procedures which make a speaker verification system more robust against noise are compared. We either use the spectral subtraction in addition to the MFCC-feature extraction or only the PLP and RASTA-PLP (without spectral subtraction). Considering spectral subtraction two modifications were examined: one version which was pre-connected to the system and a second one being integrated into the MFCC computation. The first version has the advantage that the window length can be chosen independently on those of the MFCC procedure. This led to better results. However, the most effective procedure for telephone speech data is the J-RASTA-PLP, but the estimation of the optimal J factor is difficult. At first we used a fixed J factor based on the off-line measurement of the noise power. Finally, we performed some experiments to optimize the system w
Kari Laurila, NRC (Finland)
In this paper, we present a method to incorporate and re-estimate state duration constraints within the Maximum Likelihood training of hidden Markov models. In the recognition phase we find the optimal state sequence fulfilling the state duration constraints obtained in the training phase. Our target is to get speaker-dependent training and recognition perform well with a very small amount of training data in the case of mismatch between the training and testing environments. We take advantage of the fact that speakers tend to preserve their speaking style in similar situations (e.g. when speaking to a machine) and our main means to reach the target is to force similar state segmentations in the training and recognition phases. We show that with the proposed method we can substantially improve the robustness of a speech recognizer and decrease the error rates by over 93% when compared with a standard approach.