Thomas Schaaf, University of Karlsruhe (Germany)
Thomas Kemp, University of Karlsruhe (Germany)
For many practical applications of speech recognition systems, it is desirable to have an estimate of confidence for each hypothesized word, i.e. to have an estimate of which words of the output of the speech recognizer are likely to be correct and which are not reliable. We describe the development of the measure of confidence tagger JANKA, which is able to provide confidence information for the words in the output of the speech recognizer JANUS-3-SR. On a spontaneous german human-to-human database, JANKA achieves a tagging accuracy of 90% at a baseline word accuracy of 82%
Larry Gillick, Dragon Systems (U.S.A.)
Yoshiko Ito, Dragon Systems (U.S.A.)
Jonathan Young, Dragon Systems (U.S.A.)
In this paper we propose a novel way of estimating confidences for words that are recognized by a speech recognition system, together with a natural methodology for evaluating the overall quality of those confidence estimates. Our approach is based on an interpretation of a confidence as the probability that the corresponding recognized word is correct, and makes use of generalized linear models as a means for combining various predictor scores so as to arrive at confidence estimates. Experimental results using these models are presented based on four different sources of speech data: Switchboard, Spanish and Mandarin CallHome, and Wall Street Journal.
Chalapathy Neti, IBM Research (U.S.A.)
Salim Roukos, IBM Research (U.S.A.)
Ellen Eide, IBM Research (U.S.A.)
The Maximum a posteriori hypothesis is treated as the decoded truth in speech recognition. However, since the word recognition accuracy is not 100%, it is desirable to have an independent confidence measure on how good the maximum a posteriori hypothesis is relative to the spoken truth for some applications. Efforts are in progress[1,2,3] to develop such confidence measures with the intent of applying it to assesment of confidence of whole utterances, rescoring of N-best lists, etc. In this paper, we explore the use of word-based confidence measures to adaptively modify the hypothesis score during search in continuous speech recognition: specifically, based on the confidence of the current sequence of hypothesized words during search, the weight of its prediction is changed as a function of the confidence. Experimental results are described for ATIS and SwitchBoard tasks. About 8% relative reduction in word error is obtained for ATIS.
Mitch Weintraub, SRI International (U.S.A.)
Françoise Beaufays, SRI International (U.S.A.)
Zeév Rivlin, SRI International (U.S.A.)
Yochai Konig, SRI International (U.S.A.)
Andreas Stolcke, SRI International (U.S.A.)
This paper proposes a probabilistic framework to define and evaluate confidence measures for word recognition. We describe a novel method to combine different knowledge sources and estimate the confidence in a word hypothesis, via a neural network. We also propose a measure of the joint performance of the recognition and confidence systems. The definitions and algorithms are illustrated with results on the Switchboard Corpus.
Javier Caminero, Telefonica I+D (Spain)
Luis Hernandez-Gomez, E.T.S.I. Telecomunicacion, UPM (Spain)
Celinda de la Torre, Telefonica I+D (Spain)
Cesar Martin, Telefonica I+D (Spain)
Utterance Verification (UV) is a critical function of an Automatic Speech Recognition (ASR) System working on real applications where spontaneous speech, out-of-vocabulary (OOV) words and acoustic noises are present. In this paper we present a new UV procedure with two major features: a) Confidence tests are applied to decoded string hypotheses obtained from using word and garbage models that represent OOV words and noises. Thus the ASR system is designed to deal with what we refer to as Word Spotting and Noise Spotting capabilities. b) The UV procedure is based on three different confidence tests, two based on acoustic measures and one founded on linguistic information, applied in a hierarchical structure. Experimental results from a real telephone application on a natural number recognition task show an 50% reduction in recognition errors with a moderate 12% rejection rate of correct utterances and a low 1.5% rate of false acceptance.
Antonio J. Rubio, University of Granada (Spain)
Jesus E. Diaz, University of Granada (Spain)
Pedro Garcia, University of Granada (Spain)
Jose C. Segura, University of Granada (Spain)
It is usually assumed that grammar probabilities and acoustic probabilities in a Continuous Speech Recognition system have to be incorporated to the general score with different weights. This is an experimental fact and there is no generally accepted theoretical explanation. In this paper we propose an explanation to this fact, related to the way grammar scoring is incorporated in the searching procedure. Accordingly to this explanation, we perform a set of experiments to test our hypothesis. We are also proposing a new way of introducing grammar probabilities in a tree-based vocabulary search strategy, where systems are usually bound to use the worst strategy. To apply our ideas to unigrams is rather simple. For more complex language models like bigrams we have to implement a new procedure.
Alexandros S. Manos, MIT-LCS (U.S.A.)
Victor W. Zue, MIT-LCS (U.S.A.)
A common approach to wordspotting is to augment the keyword models with "filler" models to account for non-keyword intervals. An alternative approach is to use a large vocabulary continuous speech recognition system (LVCSR) to produce a word string, and then search for the keywords in that string. While the latter approach typically yields higher performance, it requires costly computation and extensive training data. In this study, we develop several segment-based wordspotters in an effort to achieve performance comparable to that of the LVCSR spotter, but with only a fraction of the vocabulary. We investigate several methods to model the background, ranging from a few general models to refined phone representations. The task is to detect sixty-one keywords from continuous speech in the ATIS domain. The best performance we achieve is 91.4% Figure of Merit for the LVCSR spotter and 86.7% for a spotter using 57 phone-based filler models.
Bo-Ren Bai, NTU (Taiwan)
Chiu-Yu Tseng, Academia Sinica (Taiwan)
Lin-Shan Lee, NTU (Taiwan)
This paper presents a multi-phase approach for fast spotting of large vocabulary Chinese keywords from a spontaneous Mandarin speech utterance using prosodic knowledge. Without searching through the whole utterance using large number of keyword models, the multi-phase framework proposed here including some special scoring schemes provides very good efficiency by considering the monosyllable-based structure of Mandarin Chinese. This approach is therefore very fast due to very good boundary estimations and the deletion of most impossible syllable and keyword candidates using context independent models, and also very accurate with the carefully designed scoring processes. A task with 2611 keywords was tested here. An inclusion rate of 85.79% for the top 10 candidates is attained, at a speed requiring only 1.2 times of the utterance length on a Sparc 20 workstation.
Rachida El Méliani, INRS-Télécom (Canada)
Douglas O'Shaughnessy, INRS-Télécom (Canada)
Our goal is to design an accurate keyword spotter that can deal with any size of keyword set, since the size actually required in a wide range of applications is large (number of airports, number of names in a directory, etc.). This justifies the choice of an architecture based on a large-vocabulary continuous-speech recognizer. In a previous paper we introduced the use of strictly-lexical subword fillers for keyword spotting based on the INRS large-vocabulary continuous-speech recognizer showing that they are, when compared to acoustic fillers, a good compromise between memory and time consumption, keyword choice freedom and task-independence training on one hand and accuracy on the other hand. We propose here two new high-performance designs of individual strictly-lexical subword fillers that perform, this time, better than their acoustic counterparts while still keeping the mentioned advantages.
Martin Holzapfel, Siemens (Germany)
Günther Ruske, Technical University of Munich (Germany)
Harald Höge, Siemens (Germany)
A basic problem in keyword spotting is the fact that the keywords itself cannot be completely different from background speech. Therefore, false alarms arise from those parts of the keyword which are also contained in the background. The paper describes the favourable application of a model trellis which enables to test individual phoneme sequences with respect to their influence on the underlying phoneme HMMs in a statistical way. It is shown, that the Viterbi path highly is affected by those partly fitting phoneme groups. The probability of occurrance of these phoneme sequences is captured by a statistical "speech model" consisting of a Markov graph having an order up to 2. In this way sequences of 1, 2, or 3 phonemes are considered. By combining the model trellis and the statistical speech model, the probability of false alarms can be precalculated in advance, thus providing an useful measure for the suitability of the keyword under consideration. When the choice of keywords was optimized by this suitability measure in a practical application (spotting multicom 94.4 data) , the false alarm rate could be reduced by a factor of 3.5.
Suhardi Suhardi, Technical University of Berlin (Germany)
Klaus Fellbaum, Brandenburg Technical University of Cottbus (Germany)
We describe a wordspotting algorithm based on a predictive neural model for a telephone speech corpus. Each keyword is modeled as a whole word. For keyword detection scoring we used a minimum accumulated prediction residual. We computed empirically a threshold value for rejecting non-keyword speech in place of building non-keyword models. We tested the algorithm with the TUBTEL telephone speech corpus and compared it with other algorithms like the standard DTW-based wordspotting algorithm and the two-stage wordspotting algorithm based on a DTW and a multilayer perceptron.