Features for Automatic Speech Recognition II

Chair: M. Lennig, Nuance, USA

Home

An Acoustic-Phonetic Feature-Based System for the Automatic Recognition of Fricative Consonants

Authors:

Ahmed M. Abdelatty Ali, University of Pennsylvania (U.S.A.)
Jan Van der Spiegel, University of Pennsylvania (U.S.A.)
Paul Mueller, Corticon Inc. (U.S.A.)

Volume 2, Page 961, Paper number 1086

Abstract:

In this paper, the acoustic-phonetic characteristics and the automatic recognition of the American English fricatives are investigated. The acoustic features that exist in the literature are evaluated and new features are proposed. To test the value of the extracted features, a knowledge-based acoustic-phonetic system for the automatic recognition of fricatives, in speaker independent continuous speech, is proposed. The system uses an auditory-based front-end processing and incorporates new algorithms for the extraction and manipulation of the acoustic- phonetic features that proved to be rich in their information content. Several features, which describe the relative amplitude, location of the most dominant peak, spectral shape and duration of unvoiced portion, are combined in the recognition process. Recognition accuracy of 95% for voicing detection and 93% for place of articulation detection are obtained for TIMIT database continuous speech of 22 speakers from 5 different dialect regions.

ic981086.pdf (From Postscript)

TOP

On Second Order Statistics and Linear Estimation of Cepstral Coefficients

Authors:

Yariv Ephraim, George Mason University (U.S.A.)
Mazin Rahim, AT&T Labs (U.S.A.)

Volume 2, Page 965, Paper number 1118

Abstract:

Explicit expressions for the second order statistics of cepstral components representing clean and noisy signal waveforms are derived. The noise is assumed additive to the signal, and the spectral components of each process are assumed statistically independent complex Gaussian random variables. The key result developed here is an explicit expression for the cross-covariance between the log-spectra of the clean and noisy signals. In the absence of noise, this expression is used to show that the covariance matrix of cepstral components representing a vector of N signal samples, approaches a fixed, signal independent, diagonal matrix at a rate of 1/N*N. In addition, the cross-covariance expression is used to develop an explicit linear minimum mean square error estimator for the clean cepstral components given noisy cepstral components. Recognition results on the ten English digits using the fixed covariance and linear estimator are presented.

ic981118.pdf (Scanned)

TOP

An Algorithm for Robust Signal Modelling in Speech Recognition

Authors:

Rivarol Vergin, CML Technologies (Canada)

Volume 2, Page 969, Paper number 1345

Abstract:

The most popular set of parameters used in recognition systems is the mel frequency cepstral coefficients.While giving generally good results, it remains that the filtering process, as used in the evaluation of these parameters, reduces the signal resolution in the frequency domain which can have some impact in discriminating between phonemes. This paper presents a new parameterization approach that preserves most of the characteristics of mel frequency cepstral cofficients while maintaining the initial frequency resolution obtained from the fast Fourier transform. It is shown, by results obtained, that this technique can significantly increase the performance of a recognition system.

ic981345.pdf (Scanned)

TOP

Automatic Speech Recognition Based on Cepstral Coefficients and a MEL-Based Discrete Energy Operator

Authors:

Hesham Tolba, INRS-Telecommunications (Canada)
Douglas O'Shaughnessy, INRS-Telecommunications (Canada)

Volume 2, Page 973, Paper number 1346

Abstract:

In this paper, a novel feature vector based on both Mel Frequency Cepstral Coefficients (MFCCs) and a Mel-based onlinear Discrete-time Energy Operator (MDEO) is proposed to be used as the input of an HMM-based Automatic Continuous Speech Recognition (ACSR) system. Our goal is to improve the performance of such a recognizer using the new feature vector. Experiments show that the use of the new feature vector increases the recognition rate of the ACSR system. The HTK Hidden Markov Model Toolkit was used throughout. Experiments were done on both the TIMIT and NTIMIT databases. For the TIMIT database, when the MDEO was included in the feature vector to test a multi-speaker ACSR system, we found that the error rate decreased by about 9.51%. On the other hand, for NTIMIT, the MDEO deteriorates the performance of the recognizer. That is, the new feature vector is useful for clean speech but not for telephone speech.

ic981346.pdf (From Postscript)

TOP

Compression of Acoustic Features for Speech Recognition in Network Environments

Authors:

Ganesh N. Ramaswamy, IBM (U.S.A.)
Ponani S. Gopalakrishnan, IBM (U.S.A.)

Volume 2, Page 977, Paper number 1619

Abstract:

In this paper, we describe a new compression algorithm for encoding acoustic features used in typical speech recognition systems. The proposed algorithm uses a combination of simple techniques, such as linear prediction and multi-stage vector quantization, and the current version of the algorithm encodes the acoustic features at a fixed-rate of 4.0 Kbit/s. The compression algorithm can be used very effectively for speech recognition in network environments, such as those employing a client-server model, or to reduce storage in general speech recognition applications. The algorithm has also been tuned for practical implementations, so that the computational complexity and memory requirements are modest.We have successfully tested the compression algorithm against many test sets from several different languages, and the algorithm performed very well, with no significant change in the recognition accuracy due to compression.

ic981619.pdf (From Postscript)

TOP

Speaker Clustering for Speech Recognition Using the Parameters Characterizing Vocal-Tract Dimensions

Authors:

Masaki Naito, ATR-ITL (Japan)
Li Deng, ATR-ITL (Japan)
Yoshinori Sagisaka, ATR-ITL (Japan)

Volume 2, Page 981, Paper number 1889

Abstract:

We propose speaker clustering methods based on the vocal-tract-size related articulatory parameters associated with individual speakers. Two parameters characterizing gross vocal-tract dimensions are first derived from formants of speaker-specific Japanese vowels, and are then used to cluster a total of 148 male Japanese speakers. The resultant speaker clusters are found to be significantly different from the speaker clusters obtained by conventional acoustic criteria.Japanese phoneme recognition experiments are carried out using speaker-clustered tied-state HMMs(HMNets) trained for each cluster. Compared with the baseline gender dependent model, 5.7% of recognition error reduction has been achieved based on the clustering method using vocal-tract parameters.

ic981889.pdf (From Postscript)

TOP

Baby Ears: A Recognition System for Affective Vocalizations

Authors:

Malcolm G. Slaney, Interval Research Corporation (U.S.A.)
Gerald McRoberts, Lehigh University (U.S.A.)

Volume 2, Page 985, Paper number 2138

Abstract:

We collected more than 500 utterances from adults talking to their infants. We automatically classified 65% of the strongest utterances correctly as approval, attentional bids, or prohibition. We used several pitch and formant measures, and a multidimensional Gaussian mixture-model discriminator to perform this task. As previous studies have shown, changes in pitch are an important cue for affective messages; we found that timbre or cepstral coefficients are also important. The utterances of female speakers, in this test, were easier to classify than were those of male speakers. We hope this research will allow us to build machines that sense the ""emotional state"" of a user.

ic982138.pdf (From Postscript)

TOP

Quantization of Cepstral Parameters for Speech Recognition over the World Wide Web

Authors:

Vassilios Digalakis, Technical University of Crete (Greece)
Leonardo G Neumeyer, SRI International (U.S.A.)
Manolis Perakakis, Technical University of Crete (Greece)

Volume 2, Page 989, Paper number 2184

Abstract:

We examine alternative architectures for a client-server model of speech-enabled applications over the World Wide Web. We compare a server-only processing model, where the client encodes and transmits the speech signal to the server, to a model where the recognition front end, implemented as a Java applet, runs locally at the client and encodes and transmits the cepstral coefficients to the recognition server over the Internet. We follow a novel encoding paradigm, trying to maximize recognition performance instead of perceptual reproduction, and we find that by transmitting the cepstral coefficients we can achieve significantly higher recognition performance at a fraction of the bit rate required when encoding the speech signal directly.