ABSTRACT
This paper describes a new algorithm for key-phrase spotting applications. The algorithm consists of three processes. The first process is to synergistically integrate N-grams with Finite-State Grammars (FSG) -- the two conventional language models (LM) for speech recognition. All the key phrases to be spotted are covered by the FSG component of the recognizer's LM, while the N-grams are used for decoding surrounding non-key phrases. Secondly, selective weighting is proposed and implemented. The weighting parameters independently control the triggering and completion of FSG on top of N-grams. Finally, the third process involves a word confirmation and rejection logic which determines whether to accept or reject a hypothesized key phrase. The proposed algorithm has been favorably evaluated on two separate experiments. In these experiments, only the FSG part of the LM need be updated for different application tasks while the N-gram part can remain unchanged.
ABSTRACT
This paper refers to our prosperous development of algorithms for detecting keywords in continuous speech. Two different approaches to define confidence measures are introduced. As an advantage, these definitions are theoretically calculable without artful tuning. Moreover, two distinct decoding algorithms are presented, that incorporate these confidence measures into the search procedure. One is a new possibility of detecting keywords in continuous speech, using the standard Viterbi algorithm without modeling the non-keyword parts of the utterance. The other one is an improved further development of an algorithm described in [1], also without the need of modeling the non-keyword parts.
ABSTRACT
We describe our recent work in implementing a word-spotting system based on the ANGIE framework and the effects of varying the nature of the sublexical constraints placed upon the word-spotter's filler model. ANGIE is a framework for modelling speech where the morphological and phonological substructures of words are jointly characterized by a context-free grammar and are represented in a multi-layered hierarchical structure. In this representation, the upper layers capture syllabification, morphology, and stress, the preterminal layer represents phonemics, and the bottom terminal categories are the phones. ANGIE provides a flexible framework where we can explore the effects of sublexical constraints within a word-spotting environment. Our experiments with spotting city names in ATIS validate the intuition that increasing the constraints present in the model improves performance, from 85.3 FOM for phone bigram to 89.3 FOM for a word lexicon. They also empirically strengthens our belief that ANGIE provides a feasible framework for various speech recognition tasks, of which word-spotting is one.
ABSTRACT
The aim of this paper was to study the efficiency of sound duration, degree of sound voicing and sound energy in a rejection procedure of an automatic speech recognition system. A modelling of the three parameters was achieved using statistical models estimated on vocabulary words, out-of-vocabulary words and noise tokens. The rejection of out-of-vocabulary words and noises depended on the score obtained by comparing the probability given by the different models. However, such an approach also cause false rejection (rejection of vocabulary words). A trade-off was therefore necessary between the false rejection rate and the false alarm rate on out-of-vocabulary words and noise tokens. The degree of voicing turned out to be the most efficient parameter for rejecting noise tokens; it reduced the HMM false acceptance rate from 6.3% down to 2.3% for the same amount of false rejection rate (9%). The duration parameter provided better performance for laboratory data, reducing the error rate on French numbers from 3.1% to 1.5% for a 5% false rejection rate.
ABSTRACT
This paper describes keyword spotting using prosodic information as well as phonemic information. A Japanese word has its own F0 contour based on the lexical accent type and the F0 contour is preserved in sentences. Prosodic dissimilarity between a keyword and input speech is measured by DP matching of F0 contours. Phonemic score is calculated by a conventional HMM technique. A total score based on these two measures is used for detecting keywords. The F0 contour of the keyword is smoothed by using an F0 model. Evaluation test was carried out on recorded speech of a TV news program. The introduction of prosodic information reduces false alarms by 30% or 50% for wide ranges of the detection rate.
ABSTRACT
In this paper we present a new approach for topic spotting based on subword units (phonemes and feature vectors) instead of words. Classification of topics is done by running topic dependent polygram language models over these symbol sequences and deciding for the one with the best score. We trained and tested the two methods on three dierent corpora. The first is a part of a media corpus which contains data from TV shows for three different topics (IDS), the second is part of the Switchboard corpus, the third is a collection of human machine dialogs about train timetable information (EVAR corpus). The results on Switchboard are compared with phoneme based approaches which were made at CRIM (Montreal) and DRA (Malvern) and are presented as ROC curves; the results on IDS and EVAR are compared with a word based approach and presented as confusion tables. We show that a surprisingly little amount of recognition accuracy is lost when going from word to subword based topic spotting.