ICASSP '98 Main Page

General Information

Conference Schedule

Technical Program

    Overview
    50th Annivary Events
    Plenary Sessions
    Special Sessions
    Tutorials
    Technical Sessions

	By Date
		May 12, Tue
		May 13, Wed
		May 14, Thur
		May 15, Fri

By Category
	AE	ANNIV
	COMM	DSP
	IMDSP	MMSP
	NNSP	PLEN
	SP	SPEC
	SSAP	UA
	VLSI

By Author
	A	B	C	D	E
	F	G	H	I	J
	K	L	M	N	O
	P	Q	R	S	T
	U	V	W	X	Y
	Z

Invited Speakers

Registration

Exhibits

Social Events

Coming to Seattle

Satellite Events

Call for Papers/
Author's Kit

Future Conferences

Help

Abstract - SP29

SP29.1	An Acoustic-Phonetic Feature-Based System for the Automatic Recognition of Fricative Consonants A. Ali, J. Van der Spiegel (University of Pennsylvania, USA); P. Mueller (Corticon Inc., USA) In this paper, the acoustic-phonetic characteristics and the automatic recognition of the American English fricatives are investigated. The acoustic features that exist in the literature are evaluated and new features are proposed. To test the value of the extracted features, a knowledge-based acoustic-phonetic system for the automatic recognition of fricatives, in speaker independent continuous speech, is proposed. The system uses an auditory-based front-end processing and incorporates new algorithms for the extraction and manipulation of the acoustic- phonetic features that proved to be rich in their information content. Several features, which describe the relative amplitude, location of the most dominant peak, spectral shape and duration of unvoiced portion, are combined in the recognition process. Recognition accuracy of 95% for voicing detection and 93% for place of articulation detection are obtained for TIMIT database continuous speech of 22 speakers from 5 different dialect regions.
SP29.2	On Second Order Statistics and Linear Estimation of Cepstral Coefficients Y. Ephraim (George Mason University, USA); M. Rahim (AT&T Labs, USA) Explicit expressions for the second order statistics of cepstral components representing clean and noisy signal waveforms are derived. The noise is assumed additive to the signal, and the spectral components of each process are assumed statistically independent complex Gaussian random variables. The key result developed here is an explicit expression for the cross-covariance between the log-spectra of the clean and noisy signals. In the absence of noise, this expression is used to show that the covariance matrix of cepstral components representing a vector of N signal samples, approaches a fixed, signal independent, diagonal matrix at a rate of 1/N*N. In addition, the cross-covariance expression is used to develop an explicit linear minimum mean square error estimator for the clean cepstral components given noisy cepstral components. Recognition results on the ten English digits using the fixed covariance and linear estimator are presented.
SP29.3	An Algorithm for Robust Signal Modelling in Speech Recognition R. Vergin (CML Technologies, Canada) The most popular set of parameters used in recognition systems is the mel frequency cepstral coefficients. While giving generally good results, it remains that the filtering process, as used in the evaluation of these parameters, reduces the signal resolution in the frequency domain which can have some impact in discriminating between phonemes. This paper presents a new parameterization approach that preserves most of the characteristics of mel frequency cepstral cofficients while maintaining the initial frequency resolution obtained from the fast Fourier transform. It is shown, by results obtained, that this technique can significantly increase the performance of a recognition system.
SP29.4	Automatic Speech Recognition Based on Cepstral Coefficients and a MEL-Based Discrete Energy Operator H. Tolba, D. O'Shaughnessy (INRS-Telecommunications, Canada) In this paper, a novel feature vector based on both Mel Frequency Cepstral Coefficients (MFCCs) and a Mel-based onlinear Discrete-time Energy Operator (MDEO) is proposed to be used as the input of an HMM-based Automatic Continuous Speech Recognition (ACSR) system. Our goal is to improve the performance of such a recognizer using the new feature vector. Experiments show that the use of the new feature vector increases the recognition rate of the ACSR system. The HTK Hidden Markov Model Toolkit was used throughout. Experiments were done on both the TIMIT and NTIMIT databases. For the TIMIT database, when the MDEO was included in the feature vector to test a multi-speaker ACSR system, we found that the error rate decreased by about 9.51%. On the other hand, for NTIMIT, the MDEO deteriorates the performance of the recognizer. That is, the new feature vector is useful for clean speech but not for telephone speech.
SP29.5	Compression of Acoustic Features for Speech Recognition in Network Environments G. Ramaswamy, P. Gopalakrishnan (IBM, USA) In this paper, we describe a new compression algorithm for encoding acoustic features used in typical speech recognition systems. The proposed algorithm uses a combination of simple techniques, such as linear prediction and multi-stage vector quantization, and the current version of the algorithm encodes the acoustic features at a fixed-rate of 4.0 Kbit/s. The compression algorithm can be used very effectively for speech recognition in network environments, such as those employing a client-server model, or to reduce storage in general speech recognition applications. The algorithm has also been tuned for practical implementations, so that the computational complexity and memory requirements are modest. We have successfully tested the compression algorithm against many test sets from several different languages, and the algorithm performed very well, with no significant change in the recognition accuracy due to compression.
SP29.6	Speaker Clustering for Speech Recognition Using the Parameters Characterizing Vocal-Tract Dimensions M. Naito, L. Deng, Y. Sagisaka (ATR-ITL, Japan) We propose speaker clustering methods based on the vocal-tract-size related articulatory parameters associated with individual speakers. Two parameters characterizing gross vocal-tract dimensions are first derived from formants of speaker-specific Japanese vowels, and are then used to cluster a total of 148 male Japanese speakers. The resultant speaker clusters are found to be significantly different from the speaker clusters obtained by conventional acoustic criteria. Japanese phoneme recognition experiments are carried out using speaker-clustered tied-state HMMs(HMNets) trained for each cluster. Compared with the baseline gender dependent model, 5.7% of recognition error reduction has been achieved based on the clustering method using vocal-tract parameters.
SP29.7	Baby Ears: A Recognition System for Affective Vocalizations M. Slaney (Interval Research Corporation, USA); G. McRoberts (Lehigh University, USA) We collected more than 500 utterances from adults talking to their infants. We automatically classified 65% of the strongest utterances correctly as approval, attentional bids, or prohibition. We used several pitch and formant measures, and a multidimensional Gaussian mixture-model discriminator to perform this task. As previous studies have shown, changes in pitch are an important cue for affective messages; we found that timbre or cepstral coefficients are also important. The utterances of female speakers, in this test, were easier to classify than were those of male speakers. We hope this research will allow us to build machines that sense the "emotional state" of a user.
SP29.8	Quantization of Cepstral Parameters for Speech Recognition over the World Wide Web V. Digalakis (Technical University of Crete, Greece); L. Neumeyer (SRI International, USA); M. Perakakis (Technical University of Crete, Greece) We examine alternative architectures for a client-server model of speech-enabled applications over the World Wide Web. We compare a server-only processing model, where the client encodes and transmits the speech signal to the server, to a model where the recognition front end, implemented as a Java applet, runs locally at the client and encodes and transmits the cepstral coefficients to the recognition server over the Internet. We follow a novel encoding paradigm, trying to maximize recognition performance instead of perceptual reproduction, and we find that by transmitting the cepstral coefficients we can achieve significantly higher recognition performance at a fraction of the bit rate required when encoding the speech signal directly.

< Previous Abstract - SP28

SP30 - Next Abstract >