ICASSP '98 Main Page

General Information

Conference Schedule

Technical Program

    Overview
    50th Annivary Events
    Plenary Sessions
    Special Sessions
    Tutorials
    Technical Sessions

	By Date
		May 12, Tue
		May 13, Wed
		May 14, Thur
		May 15, Fri

By Category
	AE	ANNIV
	COMM	DSP
	IMDSP	MMSP
	NNSP	PLEN
	SP	SPEC
	SSAP	UA
	VLSI

By Author
	A	B	C	D	E
	F	G	H	I	J
	K	L	M	N	O
	P	Q	R	S	T
	U	V	W	X	Y
	Z

Invited Speakers

Registration

Exhibits

Social Events

Coming to Seattle

Satellite Events

Call for Papers/
Author's Kit

Future Conferences

Help

Abstract - SP19

SP19.1	On Properties of Modulation Spectrum for Robust Automatic Speech Recognition N. Kanedera (Ishikawa National College of Technology, Japan); H. Hermansky (Oregon Graduate Institute of Science and Technology, USA); T. Arai (International Computer Science Institute, USA) We report on the effect of band-pass filtering of the time trajectories of spectral envelopes on speech recognition. Several types of filter (linear-phase FIR, DCT, and DFT) are studied. Results indicate the relative importance of different components of the modulation spectrum of speech for ASR. General conclusions are: (1) most of the useful linguistic information is in modulation frequency components from the range between 1 and 16 Hz, with the dominant component at around 4 Hz, (2) it is important to preserve the phase information in modulation frequency domain, (3) The features which include components at around 4 Hz in modulation spectrum outperform the conventional delta features, (4) The features which represent the several modulation frequency bands with appropriate center frequency and band width increase recognition performance.
SP19.2	Spectral Subband Centroid Features for Speech Recognition K. Paliwal (Griffith University, Australia) Cepstral coefficients derived either through linear prediction analysis or from filter bank are perhaps the most commonly used features in currently available speech recognition systems. In this paper, we propose spectral subband centroids as new features and use them as supplement to cepstral features for speech recognition. We show that these features have properties similar to formant frequencies and they are quite robust to noise. Recognition results are reported in the paper justifying the usefulness of these features as supplementary features.
SP19.3	Spectral Weighting of SBCOR for Noise Robust Speech Recognition S. Kajita, K. Takeda, F. Itakura (Nagoya University, Japan) Subband-autocorrelation (SBCOR) analysis is a noise robust acoustic analysis based on filter bank and autocorrelation analysis, and aims to extract periodicities associated with the inverse of the center frequency in a subband. In this paper, it is derived that SBCOR results in the lateral inhibitive weighting (LIW) processing of power spectrum, and shown that the LIW is significantly effective for noise robust acoustic analysis using a DTW word recognizer. An interpretation of LIW is also described. In the second half of this paper, a flattening technique of noise spectral envelope using LPC inverse filter is applied to speech degraded with noise, and DTW word recognition is performed. The idea of this inverse filtering technique comes from weakening the strong periodic components included in noise. The experimental results using 32th order LPC inverse filter show that the recognition performance of SBCOR (or LIW) is improved for computer room noise.
SP19.4	Robust Word Recognition Using Threaded Spectral Peaks B. Strope, A. Alwan (University of California, Los Angeles, USA) A novel technique which characterizes the position and motion of dominant spectral peaks in speech, significantly reduces the error-rate of an HMM-based word-recognition system. The technique includes approximate auditory filtering, temporal adaptation, identification of local spectral peaks in each frame, grouping of neighboring peaks into threads, estimation of frequency derivatives, and slowly updating approximations of the threads and their derivatives. This processing provides a frame-based speech representation which is both dependent on perceptually salient aspects of the frame's immediate context, and well-suited to segmentally-stationary statistical characterization. In noise, the representation reduces the error-rate obtained with standard Mel-filter-based feature vectors by as much as a factor of 4, and provides improvements over other common feature-vector manipulations.
SP19.5	A Novel Robust Feature of Speech Signal Based on the Mellin Transform for Speaker-independent Speech Recognition J. Chen, B. Xu, T. Huang (National Laboratory of Pattern Recognition, P R China) This paper presents a novel kind of speech feature which is the modified Mellin transform of the log-spectrum of the speech signal (short for MMTLS). Because of the scale invariance property of the modified Mellin transform, the new feature is insensitive to the variation of the vocal tract length among individual speakers, and thus it is more appropriate for speaker-independent speech recognition than the popular used cepstrum. The preliminary experiments show that the performance of the MMTLS-based method is much better in comparison with those of the LPC- and MFC-based methods. Moreover, the error rate of this method is very consistent for different outlier speakers.
SP19.6	Stochastic Features for Noise Robust Speech Recognition N. Iwahashi, H. Pao, H. Honda, K. Minamino, M. Omote (Sony Corporation, Japan) This paper describes a novel technique for noise robust speech recognition, which can incorporate the characteristics of noise distribution directly in features. The feature itself of each analysis frame has a stochastic form, which can represent the probability density function of the estimated speech component in the noisy speech. Using the sequence of the probability density functions of the estimated speech components and Hidden Markov Modelling of clean speech, the observation probability of the noisy speech is calculated. In the whole process of the technique, the explicit information on SNR is not used. The technique is evaluated by large vocabulary isolated word recognition under car noise environment, and is found to have clearly outperformed nonlinear spectral subtraction (between 13% and 44% reduction in recognition errors).
SP19.7	Improved Scale-Cepstral Analysis in Speech S. Umesh (IIT, India); L. Cohen (Hunter College of CUNY, USA); D. Nelson (US Department of Defense, USA) In this paper, we present improvements over the original scale-cepstrum proposed by us. The scale-cepstrum is motivated by a desire to normalize the first-order effects of differences in vocal-tract lengths for a given vowel. Our subsequent work (in ICSLP'96), has shown that a more appropriate frequency-warping than the log-warping used originally is necessary to account for the frequency-dependency of the scale-factor. Using this more appropriate frequency-warping and a modified method of computing the scale-cepstrum we have obtained improved features that provide better separability between vowels than before, and are also robust to noise.
SP19.8	Multi-Band Speech Recognition in Noisy Environments S. Okawa, E. Bocchieri, A. Potamianos (AT&T Labs, USA) This paper presents a new approach for multi-band based automatic speech recognition (ASR). Recent work by Bourlard and Hermansky suggests that multi-band ASR gives more accurate recognition, especially in noisy acoustic environments, by combining the likelihoods of different frequency bands. Here we evaluate this likelihood recombination (LC) approach to multi-band ASR, and propose an alternative method, namely feature recombination (FC). In the FC system, after different acoustic analyzers are applied to each sub-band individually, a vector is composed by combining the sub-band features. The speech classifier then calculates the likelihood from the single vector. Thus, band-limited noise affects only few of the feature components, as in multi-band LC system, but, at the same time, all feature components are jointly modeled, as in conventional ASR. The experimental results show that the FC system can yield better performance than both the conventional ASR and the LC strategy for noisy speech.

< Previous Abstract - SP18

SP20 - Next Abstract >