Chair: Wu Chou, Bell Labs, USA
Peter Beyerlein, Philips Research Labs (Germany)
Discriminative model combination is a new approach in the field of automatic speech recognition, which aims at an optimal integration of all given (acoustic and language) models into one log-linear posterior probability distribution. As opposed to the maximum entropy approach, the coefficients of the log-linear combination are optimized on training samples using discriminative methods, to obtain an optimal classifier. Three methods are discussed to find coefficients, which minimize the empirical word error rate on given training data: the well-known GPD-based minimum error rate training, a minimization of the mean distance between the discriminant function of the model combination and an ``ideal'' discriminant function and a minimization of a smoothed error count measure. Latter two methods lead to closed-form solutions for the coefficients of the model combination. The accuracy of a large vocabulary continuous speech recognition could be increased by the new approach.
Shawn M Herman, Lucent Technologies (U.S.A.)
Rafid A. Sukkar, Lucent Technologies (U.S.A.)
Vector Quantization (VQ) has been explored in the past as a means of reducing likelihood computation in speech recognizers which use hidden Markov models (HMMs) containing Gaussian output densities. Although this approach has proved successful, there is an extent beyond which further reduction in likelihood computation substantially degrades recognition accuracy. Since the components of the VQ frontend are typically designed after model training is complete, this degradation can be attributed to the fact that VQ and HMM parameters are not jointly estimated. In order to restore the accuracy of a recognizer using VQ to aggressively reduce computation, joint estimation is necessary. In this paper, we propose a technique which couples VQ frontend design with Minimum Classification Error training. We demonstrate on a large vocabulary subword task that in certain cases, our joint training algorithm can reduce the string error rate by 79% compared to that of VQ mixture selection alone.
Luca Rigazio, Panasonic Technologies Inc. (U.S.A.)
Jean-Claude Junqua, Panasonic Technologies Inc. (U.S.A.)
Michael Galler, Panasonic Technologies Inc. (U.S.A.)
Discriminative training is effective in enhancing robustness for recognition tasks characterized by high confusion rates. In this paper, we apply discriminative training to different components of a spelled word recognizer to improve recognition accuracy among confusable letters. First we weighted the HMM states to emphasize the letters' discriminant part. The training achieved a 17% decrease in unit (letter) error rate when the search was performed with an unconstrained grammar. Then we designed a new algorithm that relies on discriminative training to adapt the grammar transition probabilities and the language weight. This method uses acoustic information to provide a tight coupling between the acoustic and language models. Experimental results showed the state weighting followed by the adaptation of a bigram language model reduced by 11% the total unit errors and by 12% the unit errors among the E-Set of the English alphabet.
Ralf Schlüter, RWTH Aachen (Germany)
Wolfgang Macherey, RWTH Aachen (Germany)
In this paper, a formally unifying approach for a class of discriminative training criteria including Maximum Mutual Information (MMI) and Minimum Classification Error (MCE) criterion is presented including, the optimization methods gradient descent(GD) and extended Baum-Welch (EB) algorithm. Comparisons are discussed for the MMI and the MCE criterion including, the determination of the sets of word sequence hypotheses for discrimination using word graphs. Experiments have been carried out on the SieTill corpus for telephone line recorded German continuous digit strings. Using several approaches for acoustic modeling, the word error rates obtained by MMI training using single densities always were better than those for Maximum Likelihood (ML) using mixture densities. Finally, results obtained for corrective training (CT), i.e. using only the best recognized word sequence in addition to the spoken word sequence, could not be improved by using the word-graph based discriminative training.
Wei Wei, Oregon Graduate Institute of Science and Technology (U.S.A.)
Sarel Van Vuuren, Oregon Graduate Institute of Science and Technology (U.S.A.)
For connected digit recognition the relative frequency of occurrence (prior) for context-dependent phonetic units at inter-word boundaries in training data tends to be much lower than the prior expected for a single test utterance. A problem in using a neural network to model context-dependent phonetic units is that it learns the prior of the training data and not that expected of a test utterance. We show how to compensate for the problem by roughly flattening the class prior for infrequently occurring context units by a suitable weighting of the neural network cost function -- based entirely on the training set prior. We show that this leads to improved recognition performance. We give results for telephone speech on the OGI Numbers Corpus. Our method gives a 12.37% reduction of the sentence-level error rate (to 14.76%) and a 9.93% reduction of the word-level error rate (to 3.81%) compared to not doing compensation.
Jan Verhasselt, ELIS, University of Ghent (Belgium)
Jean-Pierre Martens, ELIS, University of Ghent (Belgium)
In this paper, we describe the incorporation of context-dependent models in hybrid Segment-Based/Neural Network speech recognition systems. We present alternative probabilistic frameworks and evaluate them by performing speaker-independent phone recognition experiments on the TIMIT corpus. We compare their recognition performances with that of a context-independent hybrid SB/NN system and with the best published performances on this task.
Juergen Fritsch, University of Karlsruhe (Germany)
Michael Finke, University of Karlsruhe (Germany)
We present the ACID/HNN framework, a principled approach to hierarchical connectionist acoustic modeling in large vocabulary conversational speech recognition (LVCSR). Our approach consists of an Agglomerative Clustering algorithm based on Information Divergence (ACID) to automatically design and robustly estimate Hierarchies of Neural Networks (HNN) for arbitrarily large sets of context-dependent decision tree clustered HMM states. We argue that a hierarchical approach is crucial in applying locally discriminative connectionist models to the typically very large state spaces observed in LVCSR systems. We evaluate the ACID/HNN framework on the Switchboard conversational telephone speech corpus. Furthermore, we focus on the benefits of the proposed connectionist acoustic model, namely exploiting the hierarchical structure for speaker adaptation and decoding speed-up algorithms.
Tranzai Lee, Chinese Academy of Sciences (China)
Daowen Chen, Chinese Academy of Sciences (China)
In the continuous speech recognition, the co-pronunciation between two successive phonemes seriously disturbs recognition effect. It is difficult for pure hidden Markov model(HMM) methods to cope with the co-pronunciation, because HMM methods consider that two successive frames of speech are independant. The hybrid HMM and artificial neural networks(ANN) methods with feedback MLP[1,3] provide the ability to cope with the co-pronunciation by means of the feedback input. In this paper, we propose a new feedback method for feedback hybrid HMM/ANN methods on the basis of the original methods[1,3]. New feedback method provides the more information of co-pronunciation to feedback ANN. As a result, new feedback method falls the error rate 20.4%. Additionally, By means of our previous work, the hybrid mthods HMM/ANN with the feedback double MLP structure, we discuss the method that reduces the computation of the feedback MLP during the recognition.