Qiru Zhou, Bell Labs., Lucent Technologies (U.S.A.)
Wu Chou, Bell Labs., Lucent Technologies (U.S.A.)
In this paper, an approach of continuous speech recognition based on layered self-adjusting decoding graph is described. It utilizes a scaffolding layer to support fast network expansion and releasing. A two level hashing structure is also described. It introduces self-adjusting capability in dynamic decoding on general re-entrant decoding network. In stack decoding, the scaffolding layer in the proposed approach enables the decoder to look several layers into the future so that long span inter-word context dependency can be exactly preserved. Experimental results indicate that highly efficient decoding can be achieved with a significant savings on recognition resources.
Stefan Ortmanns, RWTH Aachen (Germany)
Andreas Eiden, RWTH Aachen (Germany)
Hermann Ney, RWTH Aachen (Germany)
Norbert Coenen, RWTH Aachen (Germany)
This paper presents two look-ahead techniques for speeding up large vocabulary continuous speech recognition. These two techniques, which are referred to as language model look-ahead and phoneme look-ahead, are incorporated into the pruning process of the time-synchronous one-pass beam search algorithm. The search algorithm is based on a tree-organized pronunciation lexicon in connection with a bigram language model. Both look-ahead techniques have been tested on the 20 000-word NAB'94 task (ARPA North American Business Corpus). The recognition experiments show that the combination of bigram language model look-ahead and phoneme look-ahead reduces the size of search space by a factor of about 30 without affecting the word recognition accuracy in comparison with no look-ahead pruning technique.
Hanazawa Ken, Tokyo Institute of Technology (Japan)
Sadaoki Furui, Tokyo Institute of Technology (Japan)
Yasuhiro Minami Minami, NTT Human Interface Laboratories (Japan)
This paper proposes an efficient method for large-vocabulary continuous-speech recognition, using a compact data structure and an efficient search algorithm. We introduce a very compact data structure DAWG as a lexicon to reduce the search space. We also propose a search algorithm to obtain the N-best hypotheses using the DAWG structure. This search algorithm is composed of two phases: ``forward search'' and ``traceback''. Forward search, which basically uses the time-synchronous Viterbi algorithm, merges candidates and stores the information about them in DAWG structures to create phoneme graphs. Traceback traces the phoneme graphs to obtain the N-best hypotheses. An evaluation of this method's performance using a speech-recognition-based telephone-directory-assistance system having a 4000-word vocabulary confirmed that our strategy improves speech recognition in terms of time and recognition rate.
Hermann Ney, RWTH Aachen (Germany)
Stefan Ortmanns, RWTH Aachen (Germany)
Ingo Lindam, RWTH Aachen (Germany)
This paper describes two methods for constructing word graphs for large vocabulary continuous speech recognition. Both word graph methods are based on a time-synchronous, left-to-right beam search strategy in connection with a tree-organized pronunciation lexicon. The first method is based on the so-called word pair approximation and fits directly into a word-conditioned search organization. In order to avoid the assumptions made in the word pair approximation, we design another word graph method. This method is based on a time conditioned factoring of the search space. For the case of a trigram language model, we give a detailed comparison of both word graph methods with an integrated search method. The experiments have been carried out on the North American Business (NAB'94) 20,000-word task.
Sarvar Patel, Bellcore (U.S.A.)
In continuous speech recognition, a significant amount of time is used every frame to evaluate interword transitions. In fact, if N is the size of the vocabulary and each word transitions on average to (overline)E other words then O(N(overline)E) operations are required. Similarly when evaluating a partially connected HMM, the Viterbi algorithm requires O(N(overline)E) operations. This paper presents the first algorithm to break the O(N(overline)E) complexity requirement. The new algorithm has an average complexity of O(N root-(overline)E). An algorithm was previously presented by the author for the special case of fully connected models, however, the new algorithm is general. It speeds up evaluations of both partial and fully connected HMM and language models. Unlike pruning, this paper does not use any heuristics which may sacrifice optimality, but fundamentally improves the basic evaluation of the time synchronous Viterbi algorithm.
Tung-Hui Chiang, ATC/CCL/ITRI (Taiwan)
Chung-Mou Pengwu, ATC/CCL/ITRI (Taiwan)
Shih-Chieh Chien, ATC/CCL/ITRI (Taiwan)
Chao-Huang Chang, ATC/CCL/ITRI (Taiwan)
This paper presents the first known results for the speaker-independent large-vocabulary Mandarin Dictation System, namely CCLMDS'96, developed by Computer & Communication Research Laboratories (CCL) at Industrial Technology Research Institute (ITRI). First, a fast searching algorithm is proposed to improve the searching efficiency such that the CCLMDS'96 can operate in real time running on a personal computer. In addition, a discriminative scoring function is proposed to integrate the speech recognizer and the word-class-based bigram language model. With this discriminative scoring function, the system attains word accuracy rate of 91.3%, which significantly outperforms the conventional integration approach.
Tatsuo Matsuoka, NTT Human Interface Labs. (Japan)
Katsutoshi Ohtsuki, NTT Human Interface Labs. (Japan)
Takeshi Mori, NTT Human Interface Labs. (Japan)
Kotaro Yoshida, Institute of Technology, Tokyo (Japan)
Sadaoki Furui, Institute of Technology, Tokyo (Japan)
Katsuhiko Shirai, Waseda University (Japan)
A large-vocabulary continuous-speech recognition (LVCSR) system was developed and evaluated. To evaluate the system, a Japanese business-newspaper speech corpus was designed and recorded. The corpus was designed so that is can be used for Japanese LVCSR research in the same way that the Wall Street Journal (WSJ) corpus, for example, is used for English LVCSR research. Since Japanese sentences are written without spaces between words, a morphological analysis was introduced to segment sentences into words so that word n-gram language models could be used. To enable the use of detailed word n-gram language models, a two-pass decoding strategy was applied. Context- dependent (CD) phone models and word trigram language models reduced the word error rate from 80.2% to 10.1% (an error reduction of about 88%). This result shows that CD phoneme modeling and word trigram language models can be used effectively in Japanese LVCSR.
Meinrad Niemöller, Siemens AG, Munich (Germany)
Alfred Hauenstein, Siemens AG, Munich (Germany)
Erwin Marschall, Siemens AG, Munich (Germany)
Petra Witschel, Siemens AG, Munich (Germany)
Ulrike Harke, Siemens AG, Munich (Germany)
A large vocabulary speech recognizer for German is presented. The main properties of the recognizer are speaker independence, continuous speech input and real-time operation. It is integrated into a client/server framework, which allows for simple porting between different hard- and software platforms. Methods like simplified language model spreading in beam search and specialized word-begin and -end modelling are introduced in order to achieve real-time operation on a Pentium-based PC. Recognition tests for two different dictation applications (controlled speech newspaper dictation and spontaneous speech medical dictation) are presented showing the importance of adding efforts in the modelling of spontaneous speech.
Barbara Peskin, Dragon Systems, Inc. (U.S.A.)
Larry Gillick, Dragon Systems, Inc. (U.S.A.)
Natalie Liberman, Dragon Systems, Inc. (U.S.A.)
Mike Newman, Dragon Systems, Inc. (U.S.A.)
Paul van Mulbregt, Dragon Systems, Inc. (U.S.A.)
Steven Wegmann, Dragon Systems, Inc. (U.S.A.)
This paper describes recent improvements made to Dragon's speech recognition system which have improved performance on Switchboard recognition by roughly 10 percentage points in the past year. These features include the use of rapid speaker adaptation, a move from a 20 to a 10 msec frame rate for recognition, expansion of the acoustic training set and lexicon, and the introduction of interpolated language models. Preliminary results applying this Switchboard-trained system to conversations drawn from the English CallHome corpus are also quite strong, suggesting that this technology ports well to novel tasks. Finally, the paper includes a report on several research projects currently in progress which show promise of further reducing the error rate.
Torsten Zeppenfeld, Carnegie Mellon University (U.S.A.)
Michael Finke, Carnegie Mellon University (U.S.A.)
Klaus Ries, Carnegie Mellon University (U.S.A.)
Martin Westphal, Carnegie Mellon University (U.S.A.)
Alex Waibel, Carnegie Mellon University (U.S.A.)
Recognition of conversational speech is one of the most challenging speech recognition tasks to-date. While recognition error rates of 10% or lower can now be reached on speech dictation tasks over vocabularies in excess of 60,000 words, recognition of conversational speech has persistently resisted most attempts at improvements by way of the proven techniques to date. Difficulties arise from shorter words, telephone channel degradation, and highly disfluent and coarticulated speech. In this paper, we describe the application, adaptation, and performance evaluation of our JANUS speech recognition engine to the Switchboard conversational speech recognition task. Through a number of algorithmic improvements, we have been able to reduce error rates from more than 50% word error to 38%, measured on the official 1996 NIST evaluation test set. Improvements include vocal tract length normalization, polyphonic modeling, label boosting, speaker adaptation with and without confidence measures, and speaking mode dependent pronunciation modeling.
Roland Kuhn, STL (U.S.A.)
Peter Nowell, DRA Malvern (U.K.)
Caroline Drouin, CRIM (Canada)
Topic spotting is often performed on the output of a large vocabulary recognizer or a keyword spotter. However, this requires detailed knowledge about the vocabulary, and transcribed training data. If portability to new topics and languages is important, then a topic spotter based on phoneme recognition is preferable. A phoneme recognizer is run on training data consisting of audio files labeled by topic alone - no word transcripts are required. Phoneme sub-sequences which help to predict the topic are then extracted automatically. This work was carried out by two teams exploring three different approaches to phoneme-based topic spotting: the ``DP-ngram'', the ``decision tree'', and the ``Euclidean'' approach. Results obtained by each team on the ARM (Airborne Reconnaissance Mission) and Switchboard data sets were compared by means of Receiver Operating Characteristic (ROC) curves. The best performance for each team was obtained via a similar type of discriminative training.
Philip N. Garner, DRA Malvern (U.K.)
Aidan Hemsworth, DRA Malvern (U.K.)
The concept of usefulness for keyword selection in topic identification problems is reformulated and extended to the multi-class domain. The derivation is shown to be a generalisation of that for the two class problem. The technique is applied to both multinomial and Poisson based estimates of word probability, and shown to outperform or compare favourably to various information theoretic techniques classifying dialogue moves in the map task corpus, and reports in the LOB corpus.
Seong-Jin Yun, KAIST (Korea)
Yung-Hwan Oh, KAIST (Korea)
Gyung-Chul Shin, KAIST (Korea)
We propose the stochastic lexicon model which represents the pronunciation variations to optimally cope with the continuous speech recognizer. In this lexicon model, the baseform of words are represented by subword states and probability distribution of subwords as hidden Markov model. Also, proposed approach can be applied to system employing non-linguistic recognition units and lexicon is automatically trained from a training utterances. In speaker independent speech recognition tests using a 3000 word continuous speech database, the proposed system improves the word accuracy by about 27.8% and the sentence accuracy by about 22.4%