Klaus Kasper, University of Frankfurt (Germany)
Herbert Reininger, University of Frankfurt (Germany)
Dietrich Wolf, University of Frankfurt (Germany)
In this paper we present a robust speaker independent speech recognition system consisting of a feature extraction based on a model of the auditory periphery, and a Locally Recurrent Neural Network for scoring of the derived feature vectors. A number of recognition experiments were carried out to investigate the robustness of this combination against different types of noise in the test data. The proposed method is compared with Cepstral, RASTA, and JAH-RASTA processing for feature extraction and Hidden Markov Models for scoring. The presented results show that the information in features from the auditory model can be best exploited by Locally Recurrent Neural Networks. The robustness achieved by this combination is comparable to that of JAH-RASTA in combination with HMM but without any requirement for an explicit adaptation to the noise in speech pauses.
Tai-Hwei Hwang, NTHU (Taiwan)
Lee-Ming Lee, MIT (Taiwan)
Hsiao-Chuan Wang, NTHU (Taiwan)
When a speech signal is contaminated by additive noise, its cepstral coefficients are assumed to be the functions of noise power. By using Taylor series expansion with respect to noise power, the cepstral vector can be approximated by a nominal vector plus the first derivative term. The nominal cepstrum corresponds to the clean speech signal and the first derivative term is a quantity to adapt the speech feature to noisy environment. A deviation vector is introduced to estimate the derivative term. The experiments show that the feature adaptation based on deviation vectors is superior to those projection based methods.
Keith I. Francis, Cedarville College (U.S.A.)
Timothy R. Anderson, Armstrong Laboratory (U.S.A.)
An improved method for phoneme recognition in noise is presented using an auditory image model and cross-correlation in a binaural approach called the binaural auditory image model (BAIM). Current binaural methods are explained as background to BAIM processing. BAIM and a variation of the cocktail-party-processor incorporating the auditory image model are applied in phoneme recognition experiments. The results show BAIM performs as well or better than current methods for most signal-to-noise ratios.
Daniel J. Mashao, Brown University (U.S.A.)
John E. Adcock, Brown University (U.S.A.)
In an effort to improve recognition performance of talker-independent speech systems, many adaptive methods have been proposed. The methods generally seek to exploit the higher recognition performance rate of talker-dependent systems and extend it to talker-independent systems. This is achieved by some form of placing talkers into several categories, usually using gender or vocal-tract size. In this paper we investigate a similar idea, but categorize each utterance independently. An utterance is processed using several spectral compressions, and the compression with the maximum likelihood is then used to train a better model. For testing, the spectral compression with the maximum likelihood is used to decode the utterance. While the spectral compressions divided the utterances well, this did not translate into significant improvement in performance, and the computational cost increase was significant.
Johan de Veth, Univerity of Nijmegen (The Netherlands)
Louis Boves, Univerity of Nijmegen (The Netherlands)
In this paper we propose an extension to the classical RASTA technique. The new method consists of classical RASTA filtering followed by a phase correction operation. In this manner, the influence of the communication channel is as effectively removed as with classical RASTA. However, our proposal does not introduce a left-context dependency like classical RASTA. Therefore the new method is better suited for automatic speech recognition based on context-independent modeling with Gaussian mixture hidden Markov models. We tested this in the context of connected digit recognition over the phone. In case we used context-dependent hidden Markov models (i.e. word models), we found that classical RASTA and phase-corrected RASTA performed equally well. For context-independent phone-based models, we found that phase-corrected RASTA can outperform classical RASTA depending on the acoustic resolution of the models.
Shoji Kajita, Nagoya University (Japan)
Kazuya Takeda, Nagoya University (Japan)
Fumitada Itakura, Nagoya University (Japan)
This paper describes an extended subband-crosscorrelation (SBXCOR) analysis to improve the robustness against noise. The SBXCOR analysis, which has been already proposed, is a binaural speech processing technique using two input signals and extracts the periodicities associated with the inverse of the center frequency (CF) in each subband. In this paper, by taking an exponentially weighted sum of crosscorrelation at the integral multiples of the inverse of CF, SBXCOR is extended so as to capture more periodicities included in two input signals. The experimental results using a DTW word recognizer showed that the processing improves the performance of SBXCOR for both that of the white noise and a computer room noise. For white noise, the extended SBXCOR performed significantly better than the smoothed group delay spectrum and the mel-frequency cepstral coefficient (MFCC) extracted from both monaural and binaural signals. However, for the computer room noise, it outperformed only at SNR 0dB.
Michael J. Tomlinson, DRA Malvern (U.K.)
Martin J. Russell, DRA Malvern (U.K.)
Roger K. Moore, DRA Malvern (U.K.)
Andrew P. Buckland, DRA Malvern (U.K.)
Martin A. Fawley, DRA Malvern (U.K.)
Although the possibility of asynchrony between different components of the speech spectrum has been acknowledged, its potential effect on automatic speech recogniser performance has only recently been studied. This paper presents the results of continuous speech recognition experiments in which such asynchrony is accommodated using a variant of HMM decomposition. The paper begins with an investigation of the effects of partitioning the speech spectrum explicitly into sub-bands. Asynchrony between these sub-bands is then accommodated, resulting in a significant decrease in word errors. The same decomposition technique has previously been used successfully to compensate for asynchrony between the two input streams in an audio-visual speech recognition system.
Hervé Bourlard, FPMS - TCTS (Belgium)
Stéphane Dupont, FPMS - TCTS (Belgium)
In the framework of Hidden Markov Models (HMM) or hybrid HMM/Artificial Neural Network (ANN) systems, we present a new approach towards automatic speech recognition (ASR). The general idea is to divide up the full frequency band (represented in terms of critical bands) into several subbands, compute phone probabilities for each subband on the basis of subband acoustic features, perform dynamic programming independently for each band, and merge the subband recognizers (recombining the respective, possibly weighted, scores) at some segmental level corresponding to temporal anchor points. The results presented in this paper confirm some preliminary tests reported earlier. On both isolated word and continuous speech tasks, it is indeed shown that even using quite simple recombination strategies, this subband ASR approach can yield at least comparable performance on clean speech while providing better robustness in the case of narrowband noise.
Sangita Tibrewala, OGI (U.S.A.)
Hynek Hermansky, ICSI (U.S.A.)
A new approach to automatic speech recognition based on independent class-conditional probability estimates in several frequency sub-bands is presented. The approach is shown to be especially applicable to environments which cause partial corruption of the frequency spectrum of the signal. Some of the issues involved in the implementation of the approach are also addressed.
Brian E.D. Kingsbury, ICSI / UC Berkeley (U.S.A.)
Nelson Morgan, ICSI / UC Berkeley (U.S.A.)
The performance of the PLP, log-RASTA-PLP, and J-RASTA-PLP front ends for recognition of highly reverberant speech is measured and compared with the performance of humans and the performance of an experimental RASTA-like front end on reverberant speech, and with the performance of a PLP-based recognizer trained on reverberant speech. While humans are able to reliably recognize the reverberant test set, achieving a 6.1% word error rate, the best RASTA-PLP-based recognizer has a word error rate of 68.7% on the same test set, and the PLP-based recognizer trained on reverberant speech has a 50.3% word error rate. Our experimental variant on RASTA processing provides a statistically significant improvement in performance on the reverberant speech, with a best word error rate of 64.1%
Saeed Vaseghi, QUB (Northern Ireland)
Naomi Harte, QUB (Northern Ireland)
Ben Milner, QUB (Northern Ireland)
This paper explores the modelling of phonetic segments of speech with multi-resolution spectral/time correlates. For spectral representation a set of multi-resolution cepstral features are proposed. Cepstral features obtained from a DCT of the log energy-spectrum over the full voice-bandwidth (100-4000 Hz) are combined with higher resolution features obtained from the DCT of the upper subband (say 100-2100) and the lower subband (2100-4000) halves. This approach can be extended to several levels of different resolutions. For representation of the temporal structure of speech segments, or phones, the conventional cepstral and dynamic cepstral features representing speech at sub-phonetic levels, are supplemented by a set of phonetic features that describe the trajectory of speech over the duration of a phoneme. A conditional probability model for phonetic and subphonetic features. Experimental evaluations demonstrate that the inclusion is considered of segmental features results in about 10% decrease in error rates.
Javier Hernando, UPC-Barcelona (Spain)
Speech dynamic features are routinely used in current speech recognition systems in combination with short-term (static) spectral features. Although many existing speech recognition systems do not weight both kinds of features, it seems convenient to use some weighting in order to increase the recognition accuracy of the system. In the cases that this weighting is performed, it is manually tuned or it consists simply in compensating the variances. The aim of this paper is to propose a method to automatically estimate an optimum state-dependent stream weighting in a CDHMM recognition system by means of a maximum-likelihood based training algorithm. Unlike other works, it is shown that simple constraints on the new weighting parameters permit to apply the maximum-likelihood crtierion to this problem. Experimental results in speaker independent digit recognition show an important increase of recognition accuracy.
Richard C. Rose, ATT Labs-Research (U.S.A.)
Eduardo Lleida, University of Zaragoza (Spain)
This paper investigates procedures for obtaining user-configurable speech recognition vocabularies. These procedures use example utterances of vocabulary words to perform unsupervised automatic acoustic baseform determination in terms of a set of speaker independent subword acoustic units. Several procedures, differing both in the definition of subword acoustic model context and in the phonotactic constraints used in decoding have been investigated. The tendency of input utterances to contain out-of-vocabulary or non-speech information is accounted for using likelihood ratio based utterance verification procedures. Comparisons of different definitions of the likelihood ratio used for utterance verification and of different criteria for estimating parameters used in the likelihood ratio test have been performed. The performance of these techniques has been evaluated on utterances taken from a trial of a voice label recognition service.
Alexandros Potamianos, ATT Labs-Research (U.S.A.)
Richard C. Rose, ATT Labs-Research (U.S.A.)
Frequency warping approaches to speaker normalization have been proposed and evaluated on various speech recognition tasks. These techniques have been found to significantly improve performance even for speaker independent recognition from short utterances over the telephone network. In maximum likelihood (ML) based model adaptation a linear transformation is estimated and applied to the model parameters in order to increase the likelihood of the input utterance. The purpose of this paper is to demonstrate that significant advantage can be gained by performing frequency warping and ML speaker adaptation in a unified framework. A procedure is described which compensates utterances by simultaneously scaling the frequency axis and reshaping the spectral energy contour. This procedure is shown to reduce the error rate in a telephone based connected digit recognition task by 30-40%