Features for ASR

Home

Exploiting the Potential of Auditory Preprocessing for Robust Speech Recognition by Locally Recurrent Neural Networks

Authors:

Klaus Kasper, University of Frankfurt (Germany)
Herbert Reininger, University of Frankfurt (Germany)
Dietrich Wolf, University of Frankfurt (Germany)

Volume 2, Page 1223

Abstract:

In this paper we present a robust speaker independent speech recognition system consisting of a feature extraction based on a model of the auditory periphery, and a Locally Recurrent Neural Network for scoring of the derived feature vectors. A number of recognition experiments were carried out to investigate the robustness of this combination against different types of noise in the test data. The proposed method is compared with Cepstral, RASTA, and JAH-RASTA processing for feature extraction and Hidden Markov Models for scoring. The presented results show that the information in features from the auditory model can be best exploited by Locally Recurrent Neural Networks. The robustness achieved by this combination is comparable to that of JAH-RASTA in combination with HMM but without any requirement for an explicit adaptation to the noise in speech pauses.

ic971223.pdf

TOP

Feature adaptation using deviation vector for robust speech recognition in noisy environment

Authors:

Tai-Hwei Hwang, NTHU (Taiwan)
Lee-Ming Lee, MIT (Taiwan)
Hsiao-Chuan Wang, NTHU (Taiwan)

Volume 2, Page 1227

Abstract:

When a speech signal is contaminated by additive noise, its cepstral coefficients are assumed to be the functions of noise power. By using Taylor series expansion with respect to noise power, the cepstral vector can be approximated by a nominal vector plus the first derivative term. The nominal cepstrum corresponds to the clean speech signal and the first derivative term is a quantity to adapt the speech feature to noisy environment. A deviation vector is introduced to estimate the derivative term. The experiments show that the feature adaptation based on deviation vectors is superior to those projection based methods.

ic971227.pdf

TOP

Binaural Phoneme Recognition Using the Auditory Image Model and Cross-Correlation

Authors:

Keith I. Francis, Cedarville College (U.S.A.)
Timothy R. Anderson, Armstrong Laboratory (U.S.A.)

Volume 2, Page 1231

Abstract:

An improved method for phoneme recognition in noise is presented using an auditory image model and cross-correlation in a binaural approach called the binaural auditory image model (BAIM). Current binaural methods are explained as background to BAIM processing. BAIM and a variation of the cocktail-party-processor incorporating the auditory image model are applied in phoneme recognition experiments. The results show BAIM performs as well or better than current methods for most signal-to-noise ratios.

ic971231.pdf

TOP

Utterance dependent parametric warping for a talker-independent HMM-based recognizer.

Authors:

Daniel J. Mashao, Brown University (U.S.A.)
John E. Adcock, Brown University (U.S.A.)

Volume 2, Page 1235

Abstract:

In an effort to improve recognition performance of talker-independent speech systems, many adaptive methods have been proposed. The methods generally seek to exploit the higher recognition performance rate of talker-dependent systems and extend it to talker-independent systems. This is achieved by some form of placing talkers into several categories, usually using gender or vocal-tract size. In this paper we investigate a similar idea, but categorize each utterance independently. An utterance is processed using several spectral compressions, and the compression with the maximum likelihood is then used to train a better model. For testing, the spectral compression with the maximum likelihood is used to decode the utterance. While the spectral compressions divided the utterances well, this did not translate into significant improvement in performance, and the computational cost increase was significant.

ic971235.pdf

TOP

Phase-corrected RASTA for automatic speech recognition over the phone

Authors:

Johan de Veth, Univerity of Nijmegen (The Netherlands)
Louis Boves, Univerity of Nijmegen (The Netherlands)

Volume 2, Page 1239

Abstract:

In this paper we propose an extension to the classical RASTA technique. The new method consists of classical RASTA filtering followed by a phase correction operation. In this manner, the influence of the communication channel is as effectively removed as with classical RASTA. However, our proposal does not introduce a left-context dependency like classical RASTA. Therefore the new method is better suited for automatic speech recognition based on context-independent modeling with Gaussian mixture hidden Markov models. We tested this in the context of connected digit recognition over the phone. In case we used context-dependent hidden Markov models (i.e. word models), we found that classical RASTA and phase-corrected RASTA performed equally well. For context-independent phone-based models, we found that phase-corrected RASTA can outperform classical RASTA depending on the acoustic resolution of the models.

ic971239.pdf

TOP

A Binaural Speech Processing Method Using Subband-Crosscorrelation Analysis for Noise Robust Recognition

Authors:

Shoji Kajita, Nagoya University (Japan)
Kazuya Takeda, Nagoya University (Japan)
Fumitada Itakura, Nagoya University (Japan)

Volume 2, Page 1243

Abstract:

This paper describes an extended subband-crosscorrelation (SBXCOR) analysis to improve the robustness against noise. The SBXCOR analysis, which has been already proposed, is a binaural speech processing technique using two input signals and extracts the periodicities associated with the inverse of the center frequency (CF) in each subband. In this paper, by taking an exponentially weighted sum of crosscorrelation at the integral multiples of the inverse of CF, SBXCOR is extended so as to capture more periodicities included in two input signals. The experimental results using a DTW word recognizer showed that the processing improves the performance of SBXCOR for both that of the white noise and a computer room noise. For white noise, the extended SBXCOR performed significantly better than the smoothed group delay spectrum and the mel-frequency cepstral coefficient (MFCC) extracted from both monaural and binaural signals. However, for the computer room noise, it outperformed only at SNR 0dB.

ic971243.pdf

TOP

Modelling asynchrony in speech using elementary single-signal decomposition

Authors:

Michael J. Tomlinson, DRA Malvern (U.K.)
Martin J. Russell, DRA Malvern (U.K.)
Roger K. Moore, DRA Malvern (U.K.)
Andrew P. Buckland, DRA Malvern (U.K.)
Martin A. Fawley, DRA Malvern (U.K.)

Volume 2, Page 1247

Abstract:

Although the possibility of asynchrony between different components of the speech spectrum has been acknowledged, its potential effect on automatic speech recogniser performance has only recently been studied. This paper presents the results of continuous speech recognition experiments in which such asynchrony is accommodated using a variant of HMM decomposition. The paper begins with an investigation of the effects of partitioning the speech spectrum explicitly into sub-bands. Asynchrony between these sub-bands is then accommodated, resulting in a significant decrease in word errors. The same decomposition technique has previously been used successfully to compensate for asynchrony between the two input streams in an audio-visual speech recognition system.

ic971247.pdf

TOP

Subband-based speech recognition

Authors:

Hervé Bourlard, FPMS - TCTS (Belgium)
Stéphane Dupont, FPMS - TCTS (Belgium)

Volume 2, Page 1251

Abstract:

In the framework of Hidden Markov Models (HMM) or hybrid HMM/Artificial Neural Network (ANN) systems, we present a new approach towards automatic speech recognition (ASR). The general idea is to divide up the full frequency band (represented in terms of critical bands) into several subbands, compute phone probabilities for each subband on the basis of subband acoustic features, perform dynamic programming independently for each band, and merge the subband recognizers (recombining the respective, possibly weighted, scores) at some segmental level corresponding to temporal anchor points. The results presented in this paper confirm some preliminary tests reported earlier. On both isolated word and continuous speech tasks, it is indeed shown that even using quite simple recombination strategies, this subband ASR approach can yield at least comparable performance on clean speech while providing better robustness in the case of narrowband noise.

ic971251.pdf

TOP

Sub-band Based Recognition Of Noisy Speech

Authors:

Sangita Tibrewala, OGI (U.S.A.)
Hynek Hermansky, ICSI (U.S.A.)

Volume 2, Page 1255

Abstract:

A new approach to automatic speech recognition based on independent class-conditional probability estimates in several frequency sub-bands is presented. The approach is shown to be especially applicable to environments which cause partial corruption of the frequency spectrum of the signal. Some of the issues involved in the implementation of the approach are also addressed.

ic971255.pdf

TOP

Recognizing Reverberant Speech with RASTA-PLP

Authors:

Brian E.D. Kingsbury, ICSI / UC Berkeley (U.S.A.)
Nelson Morgan, ICSI / UC Berkeley (U.S.A.)

Volume 2, Page 1259

Abstract:

The performance of the PLP, log-RASTA-PLP, and J-RASTA-PLP front ends for recognition of highly reverberant speech is measured and compared with the performance of humans and the performance of an experimental RASTA-like front end on reverberant speech, and with the performance of a PLP-based recognizer trained on reverberant speech. While humans are able to reliably recognize the reverberant test set, achieving a 6.1% word error rate, the best RASTA-PLP-based recognizer has a word error rate of 68.7% on the same test set, and the PLP-based recognizer trained on reverberant speech has a 50.3% word error rate. Our experimental variant on RASTA processing provides a statistically significant improvement in performance on the reverberant speech, with a best word error rate of 64.1%

ic971259.pdf

TOP

Multi-Resolution Phonetic/Segmental Features and Models for MM-Based Speech Recognition

Authors:

Saeed Vaseghi, QUB (Northern Ireland)
Naomi Harte, QUB (Northern Ireland)
Ben Milner, QUB (Northern Ireland)

Volume 2, Page 1263

Abstract:

This paper explores the modelling of phonetic segments of speech with multi-resolution spectral/time correlates. For spectral representation a set of multi-resolution cepstral features are proposed. Cepstral features obtained from a DCT of the log energy-spectrum over the full voice-bandwidth (100-4000 Hz) are combined with higher resolution features obtained from the DCT of the upper subband (say 100-2100) and the lower subband (2100-4000) halves. This approach can be extended to several levels of different resolutions. For representation of the temporal structure of speech segments, or phones, the conventional cepstral and dynamic cepstral features representing speech at sub-phonetic levels, are supplemented by a set of phonetic features that describe the trajectory of speech over the duration of a phoneme. A conditional probability model for phonetic and subphonetic features. Experimental evaluations demonstrate that the inclusion is considered of segmental features results in about 10% decrease in error rates.

ic971263.pdf

TOP

Maximum Likelihood Weighting of Dynamic Speech Features for CDHMM Speech Recognition

Authors:

Javier Hernando, UPC-Barcelona (Spain)

Volume 2, Page 1267

Abstract:

Speech dynamic features are routinely used in current speech recognition systems in combination with short-term (static) spectral features. Although many existing speech recognition systems do not weight both kinds of features, it seems convenient to use some weighting in order to increase the recognition accuracy of the system. In the cases that this weighting is performed, it is manually tuned or it consists simply in compensating the variances. The aim of this paper is to propose a method to automatically estimate an optimum state-dependent stream weighting in a CDHMM recognition system by means of a maximum-likelihood based training algorithm. Unlike other works, it is shown that simple constraints on the new weighting parameters permit to apply the maximum-likelihood crtierion to this problem. Experimental results in speaker independent digit recognition show an important increase of recognition accuracy.

ic971267.pdf

TOP

Speech recognition using automatically derived acoustic baseforms

Authors:

Richard C. Rose, ATT Labs-Research (U.S.A.)
Eduardo Lleida, University of Zaragoza (Spain)

Volume 2, Page 1271

Abstract:

This paper investigates procedures for obtaining user-configurable speech recognition vocabularies. These procedures use example utterances of vocabulary words to perform unsupervised automatic acoustic baseform determination in terms of a set of speaker independent subword acoustic units. Several procedures, differing both in the definition of subword acoustic model context and in the phonotactic constraints used in decoding have been investigated. The tendency of input utterances to contain out-of-vocabulary or non-speech information is accounted for using likelihood ratio based utterance verification procedures. Comparisons of different definitions of the likelihood ratio used for utterance verification and of different criteria for estimating parameters used in the likelihood ratio test have been performed. The performance of these techniques has been evaluated on utterances taken from a trial of a voice label recognition service.

ic971271.pdf

TOP

On Combining Frequency Warping and Spectral Shaping in HMM Based Speech Recognition

Authors:

Alexandros Potamianos, ATT Labs-Research (U.S.A.)
Richard C. Rose, ATT Labs-Research (U.S.A.)

Volume 2, Page 1275

Abstract:

Frequency warping approaches to speaker normalization have been proposed and evaluated on various speech recognition tasks. These techniques have been found to significantly improve performance even for speaker independent recognition from short utterances over the telephone network. In maximum likelihood (ML) based model adaptation a linear transformation is estimated and applied to the model parameters in order to increase the likelihood of the input utterance. The purpose of this paper is to demonstrate that significant advantage can be gained by performing frequency warping and ML speaker adaptation in a unified framework. A procedure is described which compensates utterances by simultaneously scaling the frequency axis and reshaping the spectral energy contour. This procedure is shown to reduce the error rate in a telephone based connected digit recognition task by 30-40%