Chair: Mazin Rahim, AT&T Labs, USA
Jon G Vaver, Department of Defense (U.S.A.)
We present results relevant to tasks involved in the confidence scoring of output from a continuous speech recognition system, including the search for predictor variables and model selection. We introduce the DET curve characteristic (DCC) score, which we use along with the normalized cross entropy (NCE) score, to perform the model and predictor variable evaluation. We also show results from experiments that suggest how the NCE and DCC scores vary with recognizer performance.
Myoung-Wan Koo, Telecom (Korea)
Chin-Hui Lee, Lucent Technologies (U.S.A.)
Biing-Hwang Juang, Lucent Technologies (U.S.A.)
We propose a new decoder based on a generalized confidence score. The generalized confidence score is defined as a product of confidence scores obtained from confidence information sources such as likelihood, likelihood ratio, duration, duration ratio, language model probabilities, supra-segmental information etc. All confidence information sources are converted into confidence score by a confidence pre-processor. We show an extended hybrid as an example of the decoder based on the generalized confidence score. The extended hybrid decoder uses multi-level confidence scores such as frame-level, phone-level, and word-level likelihood ratios, while the conventional hybrid decoder uses the frame-level confidence score. Experimental result shows that the extended decoder gives better result than the conventional hybrid decoder, particularly in dealing with out-of-vocabulary words or out-of-task sentences.
Takatoshi Jitsuhiro, NTT Human Interface Laboratories (Japan)
Satoshi Takahashi, NTT Human Interface Laboratories (Japan)
Kiyoaki Aikawa, NTT Human Interface Laboratories (Japan)
The rejection of unknown words is important in improving the performance of speech recognition. The anti-keyword model method can reject unknown words with high accuracy in small vocabulary and specified task. Unfortunately, it is either inconvenient or impossible to apply if words in the vocabulary change frequently. We propose a new method for task independent rejection of unknown words, where a new phoneme confidence measure is used to verify partial utterances. It is used to verify each phoneme while locating candidates. Furthermore, the whole utterance is verified by a phonetic typewriter. This method can improve the accuracy of verification in each phoneme, and improve the speed of candidate search. Tests show that the proposed method improves the recognition rate by 4% compared to the conventional algorithm at equal error rates. Furthermore, a 3% improvement is obtained by training acoustic models with the MCE algorithm.
Jochen Junkawitsch, Siemens AG (Germany)
Harald Hoege, Siemens AG (Germany)
The assumption of statistically independent feature vectors within the HMM approach is a well known problem. The aim of this study is to explore a simple and feasible method, that takes the correlation of adjacent feature vectors into account. A so called correlated HMM, that estimates the emission probability of a state with respect to correlated feature vectors, is built by combining two separate knowledge sources. On the one side, a traditional HMM provides an emission probability under the condition of a certain state, whereas on the other side a linear predictor delivers an emission probability considering the previous feature vectors. The efficiency of this method is shown with the help of the German SpeechDat(M) database. The application of the correlated HMM within the verification procedure of a keyword spotter provided an improvement of the Figure-of-Merit from 87.1% to 88.6%
Frank Wessel, RWTH Aachen (Germany)
Klaus Macherey, RWTH Aachen (Germany)
Ralf Schlüter, RWTH Aachen (Germany)
Estimates of confidence for the output of a speech recognition system can be used in many practical applications of speech recognition technology. They can be employed for detecting possible errors and can help to avoid undesirable verification turns in automatic inquiry systems. In this paper we propose to estimate the confidence in a hypothesized word as its posterior probability, given all acoustic feature vectors of the speaker utterance. The basic idea of our approach is to estimate the posterior word probabilities as the sum of all word hypothesis probabilities which represent the occurrence of the same word in more or less the same segment of time. The word hypothesis probabilities are approximated by paths in a wordgraph and are computed using a simplified forward-backward algorithm. We present experimental results on the North American Business (NAB'94) and the German Verbmobil recognition task.
Rafid A. Sukkar, Lucent Technologies (U.S.A.)
In this paper we formulate a training framework and present a method for task independent utterance verification. Verification-specific HMMs are defined and discriminatively trained using minimum verification error training. Task independence is accomplished by performing the verification on the subword level and training the verification models using a general phonetically balanced database that is independent of the application tasks. Experimental results show that the proposed method significantly outperforms two other commonly used task independent utterance verification techniques. It is shown that the equal error rate of false alarms and false keyword rejection is reduced by more than 22% compared to the other two methods on a large vocabulary recognition task.
Satya Dharanipragada, IBM (U.S.A.)
Salim E. Roukos, IBM (U.S.A.)
In applications such as audio-indexing, spoken message retrieval and video-browsing, it is necessary to have the ability to detect spoken words that are outside the vocabulary of the speech recognizer used inthese systems, in large amounts of speech at speeds many times faster than real-time. In this paper we present a fast, vocabulary independent, algorithm for spotting words in speech. The algorithm consists of a preprocessing stage and a coarse-to-detailed search strategy for spotting a word/phone sequence in speech. The preprocessing method provides a phone-level representation of the speech that can be searched efficiently. The coarse search, consisting of phone-ngram matching, identifies regions of speech as putative word hits. The detailed acoustic match is then conducted only at the putative hits identified in the coarse match. This gives us the desired accuracy and speed in wordspotting.
Richard C. Rose, AT&T Labs - Research (U.S.A.)
Huan Yao, AT&T Labs - Research (U.S.A.)
Giuseppe Riccardi, AT&T Labs - Research (U.S.A.)
Jeremy H. Wright, AT&T Labs - Research (U.S.A.)
Methods for utterance verification (UV) and their integration into statistical language modeling and spoken language understanding formalisms for a large vocabulary spoken understanding system are presented. The paper consists of three parts. First, a set of acoustic likelihood ratio based utterance verification techniques are described and applied to the problem or rejecting portions of a hypothesized word string that may have been incorrectly decoded by a large vocabulary continuous speech recognizer. Second, a procedure for integrating the acoustic level confidence measures with the statistical language model is described. Finally, the effect of integrating acoustic level confidence into the spoken language understanding unit (SLU) in a call-type classification is discussed. These techniques were evaluated on utterances collected from a highly unconstrained call routing task performed over the telephone network. They have been evaluated in terms of their ability to classify utterances into a set of 15 semantic actions corresponding to call-types that are accepted by the application.