Søren Kamaric Riis, IMM, DTU, Lyngby (Denmark)
Anders Krogh, The Sanger Centre, Wellcome Trust Genome Campus (U.K.)
This paper presents a general framework for hybrids of Hidden Markov models (HMM) and neural networks (NN). In the new framework called Hidden Neural Networks (HNN) the usual HMM probability parameters are replaced by neural network outputs. To ensure a probabilistic interpretation the HNN is normalized globally as opposed to the local normalization enforced on parameters in standard HMMs. Furthermore, all parameters in the HNN are estimated simultaneously according to the discriminative conditional maximum likelihood (CML) criterion. The HNNs show clear performance gains compared to standard HMMs on TIMIT continuous speech recognition benchmarks. On the task of recognizing five broad phoneme classes an accuracy of 84% is obtained compared to 76% for a standard HMM. Additionally, we report a preliminary result of 69% accuracy on the TIMIT 39 phoneme task.
Hideyuki Watanabe, ATR (Japan)
Shigeru Katagiri, ATR (Japan)
In this paper we apply Discriminative Metric Design (DMD), the general methodology of discriminative class-feature design, to a speech recognizer using a Hidden Markov Model (HMM) classification. This implementation enables one to represent the salient feature of each acoustic unit that is essential for recognition decision, and accordingly enhances robustness against irrelevant pattern variations. We demonstrate its high utility by experiments of speaker-dependent Japanese word recognition using linear feature extractors and mixture Gaussian HMMs. Furthermore, we summarize several other recently-proposed design methods related to our DMD and show that they are special implementations of the DMD concept.
Yonghong Yan, CSLU, OGI (U.S.A.)
Mark Fanty, CSLU, OGI (U.S.A.)
Ronald Cole, CSLU, OGI (U.S.A.)
Neural network training targets for speech recognition are estimated using a novel method. Rather than use zero and one, continuous targets are generated using forward-backward probabilities. Each training pattern has more than one class active. Experiments showed that the new method effectively decreased the error rate by 15% in a continuous digits recognition task.
Eric A. Woudenberg, ATRI, Kyoto (Japan)
Alain Biem, ATRI, Kyoto (Japan)
Erik McDermott, ATRI, Kyoto (Japan)
Shigeru Katagiri, ATRI, Kyoto (Japan)
In this paper we propose a simple but powerful method for normalizing various sources of mismatch between training and testing conditions in speech recognizers, based on a recent training methodology called the Generalized Probabilistic Descent method (GPD). In this new framework, a gradient based method is used to adapt parameters of the feature extraction process in order to minimize distortion between new speech data and existing classifier models, while most conventional normalization/adaptation methods attempt to adapt classification parameters. GPD was proposed as a general discrimative training method for pattern recognizers such as neural networks. Up until now this has been used only for classifier design, sometimes in combination with the design of a non adaptive feature extractor. This paper, in contrast, studies the adaptive training benefits of GPD in the framework of normalizing the feature extractor to a new pattern environment.
Mike Schuster, ATR (Japan)
In this paper a new framework for acoustic model building is presented. It is based on non-uniform segment models, which are learned and scored with a time bidirectional recurrent neural network. While usually neural networks in speech recognition systems are used to estimate posterior ''frame to phoneme'' probabilities, they are used here to estimate directly ''segment to phoneme'' probabilities, which results in an improved duration model. The special MAP approach allows not only incorporation of long term dependencies on the acoustic side, but also on the phone (output) side, which results automatically in parameter efficient context dependent models. While the use of neural networks as frame or phoneme classifiers always results in discriminative training for the acoustic information, the MAP approach presented here also incorporates discriminative training for the internally learned phoneme language model. Classification tests for the TIMIT phoneme database gave promising results of 77.75 (82.38)% for the full test data set with all 61 (39) symbols.
Brian Mak, OGI (U.S.A.)
In applying neural networks to speech recognition, one often finds that slightly different training configurations lead to significantly different networks. Thus different training sessions using different setups will likely end up in "mixed" network configurations representing different solutions in different regions of the data space. This sensitivity to the initial weights assigned, the training parameters and the training data can be used to enhance performance, using a committee of neural networks. In this paper, we study various ways to combine context-dependent(CD) and context-independent(CI) neural network phone estimators to improve phone recognition. As a result, we obtain 6.3% and 2.2% increase in accuracy in phone recognition using monophones and biphones respectively.
Christoph Neukirchen, Duisburg University (Germany)
Gerhard Rigoll, Duisburg University (Germany)
This paper deals with the construction and optimization of a hybrid speech recognition system that consists of a combination of a neural vector quantizer (VQ) and discrete HMMs. In our investigations an integration of VQ based classification in the continuous classifier framework is given and some constraints are derived that must hold for the pdfs in the discrete pattern classifier context. Furthermore it is shown that for ML training of the whole system the VQ parameters must be estimated according to the MMI criterion. A novel training method based on gradient search for Neural Networks that serve as optimal VQ is derived. This allows faster training of arbitrary network topologies compared to the traditional MMI-NN training. An integration of multilayer MMI-NNs as VQ in the hybrid discrete HMM based speech recognizer leads to a large improvement compared to other supervised and unsupervised single layer VQ systems. For the speaker independent Resource Management database the constructed hybrid MMI-connectionist/HMM system achieves recognition rates that are comparable to traditional sophisticated continuous pdf HMM systems.
Yuchang Cao, SPRC, Queensland University of Technology (Australia)
Sridha Sridharan, SPRC, Queensland University of Technology (Australia)
Miles Moody, SPRC, Queensland University of Technology (Australia)
A novel speech separation structure which simulates the cocktail party effect using a modified iterative Wiener filter and a multi-layer perceptron neural network is presented. The neural network is used as a speaker recognition system to control the iterative Wiener filter. The neural network is a modified perceptron with a hidden layer using feature data extracted from LPC cepstral analysis. The proposed technique has been successfully used for speech separation when the interference is competing speech or broad band noise.
Devang Naik, Apple Computer Inc. (U.S.A.)
Hands-free desktop command and control speech recognition suffers from the critical drawback of improperly rejecting spurious conversation. This results in false acceptances of unintended speech commands that can inconvenience the user. A neural-network approach is proposed to detect spurious conversation by determining talker location. The approach is based on the premise that spoken utterances not directed towards the microphone source tend to be more reverberant and are likely to be spurious. The method estimates a confidence measure proportional to the amount of reverberation in the end-pointed speech signal. The measure is obtained from a neural network that determines if the speech was directed to the microphone or was spoken otherwise. The proposed measure can be combined with acoustic, linguistic and semantic information to improve upon decisions taken by conventional rejection modeling schemes.
Tan Lee, Dept. of EE, CUHK (Hong Kong)
Pak Chung Ching, Dept. of EE, CUHK (Hong Kong)
This paper describes a novel design of neural network based speech recognition system for isolated Cantonese syllables. Since Cantonese is a monosyllabic and tonal language, the recognition system consists of a tone recognizer and a base syllable recognizer. The tone recognizer adopts the architecture of multi-layer perceptron in which each output neuron represents a particular tone. The syllable recognizer contains a large number of independently trained recurrent networks, each representing a designated Cantonese syllable. Such a modular structure provides greater flexibility to expand the system vocabulary progressively by adding new syllable models. To demonstrate the effectiveness of the proposed method, a speaker-dependent recognition system has been built with the vocabulary growing from 40 syllables to 200 syllables. In the case of 200 syllables, a top-1 recognition accuracy of 81.8% has been attained and the top-3 accuracy is 95.2%
Barbara Talle, University of Ulm (Germany)
Gabi Krone, University of Ulm (Germany)
Günther Palm, University of Ulm (Germany)
For technical speech recognition systems as well as for humans it has been shown that the combination of acoustic and optic information can enhance speech recognition performance. But it still remains an open question, at which stage of processing the two information channels should be combined. In this paper we systematically investigate this problem by means of a neural speech recognition system applied to monosyllabic words. Different fusion architectures of multilayer perceptrons are compared both for noiseless and noisy acoustic data. Furthermore, different modularized neural architectures are examined for the acoustic channel alone. The results corroborate the idea of separate processing of the two channels until the final stage of classification.
Ha-Jin Yu, KAIST (Korea)
Yung-Hwan Oh, KAIST (Korea)
A neural network model based on a non-uniform unit for speaker-independent continuous speech recognition is proposed. The functions of the neural network model include segmenting the input speech into sub-word units, classifying the units and detecting words, and each of them is implemented by a module. The recognition unit we propose can includes arbitrary number of phonemes in a unit, so that it can absorb co-articulation effects which spread for several phonemes. The unit classifier module separates the speech into stationary and transition parts and use different parameters for them. The word detector module can learn all the pronunciation variations in the training data. The system is evaluated on a subset of TIMIT speech data.