Authors:
Long Nguyen, BBN Technologies, GTE Internetworking (USA)
Richard Schwartz, BBN Technologies, GTE Internetworking (USA)
Page (NA) Paper number 1028
Abstract:
In January 1993, BBN demonstrated for the first time a real-time, 20K-word,
speaker-independent, continuous speech recognition system, implemented
in software on an off-the-shelf workstation. The key to the real-time
system was a novel, proprietary fast-match algorithm which had two
important properties: high-accuracy recognition and run-time proportional
to only the cube root of the vocabulary size. This paper describes
that fast-match algorithm in detail. While a number of fast-match
algorithms have been published, the BBN algorithm continues to have
novel features that have not appeared in the literature. In this fast-match,
the vocabulary is organized as a phonetic tree. The last phonemes
of the words always locate at the leaves. Each node of the phonetic
tree is assigned a set id representing a group of words which share
this node. The acoustic models associated with the nodes are the composite
triphones where there could be more than one right context. The language
models used in this fast-match are the Pr(set_of_words|some_word) which
are evaluated at every node of the words during the search. The search
itself is similar to the usual beam search with one addition: To activate
a node, we temporarily use some set bigrams; to leave that node, we
take out that temporary bigram.
Authors:
Akinobu Lee, School of Informatics, Kyoto University (Japan)
Tatsuya Kawahara, School of Informatics, Kyoto University (Japan)
Shuji Doshita, School of Informatics, Kyoto University (Japan)
Page (NA) Paper number 655
Abstract:
We propose an efficient two-pass search algorithm for LVCSR. Instead
of conventional word graph, the first preliminary pass generates "word
trellis index", keeping track of all survived word hypotheses within
the beam every time-frame. As it represents all found word boundaries
non-deterministically, we can (1) obtain accurate sentence-dependent
hypotheses on the second search, and (2) avoid expensive word-pair
approximation on the first pass. The second pass performs an efficient
stack decoding search, where the index is referred to as predicted
word list and heuristics. Experimental results on 5,000-word Japanese
dictation task show that, compared with the word-graph method, this
trellis-based method runs with less than 1/10 memory cost while keeping
high accuracy. Finally, by handling inter-word context dependency,
we achieved the word error rate of 5.6%.
Authors:
Mike Schuster, ATR, Interpreting Telecommunications Laboratories (Japan)
Page (NA) Paper number 464
Abstract:
This paper describes some of the implementation details of the ``Nozomi''
stack decoder for LVCSR. The decoder was tested on a Japanese Newspaper
Dictation Task using a 5000 word vocabulary. Using continuous density
acoustic models with 2000 and 3000 states trained on the JNAS/ASJ corpora
and a 3-gram LM trained on the RWC text corpus, both models provided
by the IPA group, it was possible to reach more than 95% word accuracy
on the standard test set. With computationally cheap acoustic models
we could achieve around 89% accuracy in nearly realtime on a 300 Mhz
Pentium II. Using a disk-based LM the memory usage could be optimized
to 4 MB in total.
Authors:
Thomas Kemp, ISL, University of Karlsruhe (Germany)
Alex Waibel, ISL, University of Karlsruhe (Germany)
Page (NA) Paper number 757
Abstract:
To achieve the long-term goal of robust, real-time broadcast news transcription,
several problems have to be overcome, e.g. the variety of acoustic
conditions and the unlimited vocabulary. In this paper we address
the problem of unlimited vocabulary. We show, that this problem is
more serious for German than it is for English. Using a speech recognition
system with a large vocabulary, we dynamically adapt the active vocabulary
to the topic of the current news segment. This is done by using information
retrieval (IR) techniques on a large collection of texts automatically
gathered from the internet. The same technique is also used to adapt
the language model of the recognition system. The process of vocabulary
adaptation and language model retraining is completely unsupervised.
We show, that dynamic vocabulary adaptation can significantly reduce
the out-of-vocabulary (OOV) rate and the word error rate of our broadcast
news transcription system View4You.
Authors:
Michiel Bacchiani, Boston University, ECE Department (USA)
Mari Ostendorf, Boston University, ECE Department (USA)
Page (NA) Paper number 586
Abstract:
Although most parameters in a speech recognition system are estimated
from data, the unit inventory and lexicon are generally hand crafted
and therefore unlikely to be optimal. This paper describes a joint
solution to the problems of learning a unit inventory and corresponding
lexicon from data. The methodology, which requires multiple training
tokens per word, is then extended to handle infrequently observed words
using a hybrid system that combines automatically-derived units with
phone-based units. The hybrid system outperforms a phone-based system
in first-pass decoding experiments on a large vocabulary conversational
speech recognition task.
Authors:
Don McAllaster, Dragon Systems (USA)
Lawrence Gillick, Dragon Systems (USA)
Francesco Scattone, Dragon Systems (USA)
Michael Newman, Dragon Systems (USA)
Page (NA) Paper number 986
Abstract:
We present a study of data simulated using acoustic models trained
on Switchboard data, and then recognized using various Switchboard-trained
models. Simple development models give a word error rate (WER) of about
47%, when recognizing real Switchboard conversations. If we simulate
speech from word transcriptions, obtaining the word pronunciations
from our recognition dictionary, the WER drops by a factor of five
to ten. If we use more realistic hand-labeled phonetic transcripts
to fabricate data, we obtain WERs in the low 40's, close to those found
in actual speech data. These and other experiments we describe in
the paper suggest that there is a substantial mismatch between real
speech and the combination of our acoustic models and the pronunciations
in our recognition dictionary. The use of simulation in speech recognition
research appears to be a promising tool in our efforts to understand
and reduce the size of this mismatch.
|