Chair: Alex Acero, Microsoft Research, USA
Noboru Kanedera, Ishikawa National College of Technology (Japan)
Hynek Hermansky, Oregon Graduate Institute of Science and Technology (U.S.A.)
Takayuki Arai, International Computer Science Institute (U.S.A.)
We report on the effect of band-pass filtering of the time trajectories of spectral envelopes on speech recognition. Several types of filter (linear-phase FIR, DCT, and DFT) are studied. Results indicate the relative importance of different components of the modulation spectrum of speech for ASR. General conclusions are: (1) most of the useful linguistic information is in modulation frequency components from the range between 1 and 16 Hz, with the dominant component at around 4 Hz, (2) it is important to preserve the phase information in modulation frequency domain, (3) The features which include components at around 4 Hz in modulation spectrum outperform the conventional delta features, (4) The features which represent the several modulation frequency bands with appropriate center frequency and band width increaserecognition performance.
Kuldip K. Paliwal, Griffith University (Australia)
Cepstral coefficients derived either through linear prediction analysis or from filter bank are perhaps the most commonly used features in currently available speech recognition systems. In this paper, we propose spectral subband centroids as new features and use them as supplement to cepstral features for speech recognition. We show that these features have properties similar to formant frequencies and they are quite robust to noise. Recognition results are reported in the paper justifying the usefulness of these features as supplementary features.
Shoji Kajita, Nagoya University (Japan)
Kazuya Takeda, Nagoya University (Japan)
Fumitada Itakura, Nagoya University (Japan)
Subband-autocorrelation (SBCOR) analysis is a noise robust acoustic analysis based on filter bank and autocorrelation analysis, and aims to extract periodicities associated with the inverse of the center frequency in a subband. In this paper, it is derived that SBCOR results in the lateral inhibitive weighting (LIW) processing of power spectrum, and shown that the LIW is significantly effective for noise robust acoustic analysis using a DTW word recognizer. An interpretation of LIW is also described. In the second half of this paper, a flattening technique of noise spectral envelope using LPC inverse filter is applied to speech degraded with noise, and DTW word recognition is performed. The idea of this inverse filtering technique comes from weakening the strong periodic components included in noise. The experimental results using 32th order LPC inverse filter show that the recognition performance of SBCOR (or LIW) is improved for computer room noise.
Brian P Strope, University of California, Los Angeles (U.S.A.)
Abeer Alwan, University of California, Los Angeles (U.S.A.)
A novel technique which characterizes the position and motion of dominant spectral peaks in speech, significantly reduces the error-rate of an HMM-based word-recognition system. The technique includes approximate auditory filtering, temporal adaptation, identification of local spectral peaks in each frame, grouping of neighboring peaks into threads, estimation of frequency derivatives, and slowly updating approximations of the threads and their derivatives. This processingprovides a frame-based speech representation which is both dependent on perceptually salient aspects of the frame's immediate context, and well-suited to segmentally-stationary statistical characterization. In noise, the representation reduces the error-rate obtained with standard Mel-filter-based feature vectors by as much as a factor of 4, and provides improvements over other common feature-vector manipulations.
Jingdong Chen, National Laboratory of Pattern Recognition (China)
Bo Xu, National Laboratory of Pattern Recognition (China)
Taiyi Huang, National Laboratory of Pattern Recognition (China)
This paper presents a novel kind of speech feature which is the modified Mellin transform of the log-spectrum of the speech signal (short for MMTLS). Because of the scale invariance property of the modified Mellin transform, the new feature is insensitive to the variation of the vocal tract length among individual speakers, and thus it is more appropriate for speaker-independent speech recognition than the popular used cepstrum. The preliminary experiments show that the performance of the MMTLS-based method is much better in comparison with those of the LPC- and MFC-based methods. Moreover, the error rate of this method is very consistent for different outlier speakers.
Naoto Iwahashi, Sony Corporation (Japan)
Hongchang Pao, Sony Corporation (Japan)
Hitoshi Honda, Sony Corporation (Japan)
Katsuki Minamino, Sony Corporation (Japan)
Masanori Omote, Sony Corporation (Japan)
This paper describes a novel technique for noise robust speech recognition, which can incorporate the characteristics of noise distribution directly in features. The feature itself of each analysis frame has a stochastic form, which can represent the probability density function of the estimated speech component in the noisy speech. Using the sequence of the probability density functions of the estimated speech components and Hidden Markov Modelling of clean speech, the observation probability of the noisy speech is calculated. In the whole process of the technique, the explicit information on SNR is not used. The technique is evaluated by large vocabulary isolated word recognition under car noise environment, and is found to have clearly outperformed nonlinear spectral subtraction (between 13% and 44% reduction in recognition errors).
Srinivasan Umesh, IIT (India)
Leon Cohen, Hunter College of CUNY (U.S.A.)
Douglas J. Nelson, US Department of Defense (U.S.A.)
In this paper, we present improvements over the original scale-cepstrum proposed by us. The scale-cepstrum is motivated by a desire to normalize the first-order effects of differences in vocal-tract lengths for a given vowel. Our subsequent work (in ICSLP'96), has shown that a more appropriate frequency-warping than the log-warping used originally is necessary to account for the frequency-dependency of the scale-factor. Using this more appropriate frequency-warping and a modified method of computing the scale-cepstrum we have obtained improved features that provide better separability between vowels than before, and are also robust to noise.
Shigeki Okawa, AT&T Labs (U.S.A.)
Enrico L. Bocchieri, AT&T Labs (U.S.A.)
Alexandros Potamianos, AT&T Labs (U.S.A.)
This paper presents a new approach for multi-band based automatic speech recognition (ASR). Recent work by Bourlard and Hermansky suggests that multi-band ASR gives more accurate recognition, especially in noisy acoustic environments, by combining the likelihoods of different frequency bands. Here we evaluate this likelihood recombination (LC) approach to multi-band ASR, and propose an alternative method, namely feature recombination (FC). In the FC system, after different acoustic analyzers are applied to each sub-band individually, a vector is composed by combining the sub-band features. The speech classifier then calculates the likelihood from the single vector. Thus, band-limited noise affects only few of the feature components, as in multi-band LC system, but, at the same time, all feature components are jointly modeled, as in conventional ASR. The experimental results show that the FC system can yield better performance than both the conventional ASR and the LC strategy for noisy speech.