ABSTRACT
In this paper, a new approach for linear prediction (LP) analysis is proposed. This approach makes the assumption that the speech signal is cyclostationary and uses cyclic autocorrelation function for computing LP parameters. Since the cyclic autocorrelation function of a stationary random signal is zero, independent of its statistical description, this analysis is robust to additive noise, white or colored. It is applied to speech recognition. Preliminary results demonstrate its robustness to white additive noise.
ABSTRACT
The context-dependent modeling technique is extended to include non-speech ller segments oc- curring between speech word units. In addition to the conventional context-dependent word or subword units, the proposed acoustic modeling provides an eficient way of accounting for the effects of the surrounding speech on the inter-word non-speech segments, especially for small vocabulary recognition tasks. It is argued that a robust recognition scheme is obtained by explicitly accounting for context-dependent inter-word filler acoustics in training while ignoring their explicit context dependencies during recognition testing. Results on a connected digit recognition task over the telephone network indicate an improvement in the error rate from 2.5% to 0.9% i.e., about 64% word error-rate reduction, using the improved model set.
ABSTRACT
This paper investigates the Cepstrum Mean Normalization(CMN) which has been widely acknowledged useful for compensation of multiplicative distortions. However, the performance of usual CMN is limited because the normalization by a single cepstrum mean vector is not enough to compensate many factors of multiplicative distortion in real environments. To solve this problem, a new method E-CMN is proposed. The method estimates two cepstrum mean vectors, one for speech and the other for non-speech for each speaker and subtracts them from an input cepstrum This method is capable of compensating various kinds of multiplicative distortion collectively to normalize input spectra. Furthermore, a new model-adaptive approach E-CMN/PMC, based on E- CMN and HMM composition, is proposed for environments with additive noise and multiplicative distortions. This method is simplified in a sense that it is possible to add speech models and an additive noise model without any iterative operations. Matching gains for all frequency bands of speech models to the noise model are uniquely estimated as a cepstrum mean vector for speech. The performance of E-CMN/PMC in adverse car environsnents is finally evaluated.
ABSTRACT
Signal representation is crucial for designing a speech recognizer. The feature extractor selects the information to be used by the classifier to perform the recognition. In noisy environments, the data vectors representing the speech signal are changed and the recognizer performance is degraded by two main facts: (1) the mismatch between the training and the recognition conditions and (2) the degradation of the signal to be recognized. In such a situation, the representation of the speech signal plays an important role. In this paper, we analyze the importance of the representation for speech recognition in noise. We apply the Discriminative Feature Extraction (DFE) method to optimize the representation. The experiments presented in this work show that the DFE method, which has been successfully applied in clean environments, leads also to improvements of the speech recognizers in noise.
ABSTRACT
In automatic speech recognition (ASR) systems immunity to additive noise may either be applied at the preprocessing stage or at the pattern matching stage. The Feature Selective Modeling (FSM) approach suggested in this paper is applied in the pattern matching stage, but in contrast to most existing methods, it is optimized on a model basis such that noise robust and phonetically descriptive parameters of a particular model can be set in focus. For sonorant sounds this is done by marking the lowest n mean values of each HMM density function as being sensitive to noise in a log filterbank representation. The noise robustness is obtained by deemphasizing the marked feature dimensions. Two different methods for de-emphasizing - mean value masking and dimensional reduction - are presented and experimentally compared to the PMC-algorithm [2].
ABSTRACT
We extend the input transformation approach for adapting hybrid connectionist speech recognizers to allow multiple transformations to be trained. Previous work has shown the efficacy of the linear input transformation approach for speaker adaptation [1][2][3], but has focused only on training global transformations. This approach is clearly suboptimal since it assumes that a single transformation is appropriate for every region in the acoustic feature input space, that is, for every phonetic class, microphone, and noise level. In this paper, we propose a new algorithm to train mixtures of transformation networks (MTNs) in the hybrid connectionist recognition framework. This approach is based on the idea of partitioning the acoustic feature space into R regions and training an input transformation for each region. The transformations are combined probabilistically according to the degree to which the acoustic features belong to each region, where the combination weights are derived from a separate acoustic gating network (AGN). We apply the new algorithm to nonnative speaker adaptation, and present recognition results for the 1994 WSJ Spoke 3 development set. The MTN technique can also be used for noise or microphone robust recognition or for other nonspeech neural network pattern recognition problems.