Large Vocabulary Continuous Speech Recognition 3

Home
Full List of Titles
1: ICSLP'98 Proceedings
Keynote Speeches
Text-To-Speech Synthesis 1
Spoken Language Models and Dialog 1
Prosody and Emotion 1
Hidden Markov Model Techniques 1
Speaker and Language Recognition 1
Multimodal Spoken Language Processing 1
Isolated Word Recognition
Robust Speech Processing in Adverse Environments 1
Spoken Language Models and Dialog 2
Articulatory Modelling 1
Talking to Infants, Pets and Lovers
Robust Speech Processing in Adverse Environments 2
Spoken Language Models and Dialog 3
Speech Coding 1
Articulatory Modelling 2
Prosody and Emotion 2
Neural Networks, Fuzzy and Evolutionary Methods 1
Utterance Verification and Word Spotting 1 / Speaker Adaptation 1
Text-To-Speech Synthesis 2
Spoken Language Models and Dialog 4
Human Speech Perception 1
Robust Speech Processing in Adverse Environments 3
Speech and Hearing Disorders 1
Prosody and Emotion 3
Spoken Language Understanding Systems 1
Signal Processing and Speech Analysis 1
Spoken Language Generation and Translation 1
Spoken Language Models and Dialog 5
Segmentation, Labelling and Speech Corpora 1
Multimodal Spoken Language Processing 2
Prosody and Emotion 4
Neural Networks, Fuzzy and Evolutionary Methods 2
Large Vocabulary Continuous Speech Recognition 1
Speaker and Language Recognition 2
Signal Processing and Speech Analysis 2
Prosody and Emotion 5
Robust Speech Processing in Adverse Environments 4
Segmentation, Labelling and Speech Corpora 2
Speech Technology Applications and Human-Machine Interface 1
Large Vocabulary Continuous Speech Recognition 2
Text-To-Speech Synthesis 3
Language Acquisition 1
Acoustic Phonetics 1
Speaker Adaptation 2
Speech Coding 2
Hidden Markov Model Techniques 2
Multilingual Perception and Recognition 1
Large Vocabulary Continuous Speech Recognition 3
Articulatory Modelling 3
Language Acquisition 2
Speaker and Language Recognition 3
Text-To-Speech Synthesis 4
Spoken Language Understanding Systems 4
Human Speech Perception 2
Large Vocabulary Continuous Speech Recognition 4
Spoken Language Understanding Systems 2
Signal Processing and Speech Analysis 3
Human Speech Perception 3
Speaker Adaptation 3
Spoken Language Understanding Systems 3
Multimodal Spoken Language Processing 3
Acoustic Phonetics 2
Large Vocabulary Continuous Speech Recognition 5
Speech Coding 3
Language Acquisition 3 / Multilingual Perception and Recognition 2
Segmentation, Labelling and Speech Corpora 3
Text-To-Speech Synthesis 5
Spoken Language Generation and Translation 2
Human Speech Perception 4
Robust Speech Processing in Adverse Environments 5
Text-To-Speech Synthesis 6
Speech Technology Applications and Human-Machine Interface 2
Prosody and Emotion 6
Hidden Markov Model Techniques 3
Speech and Hearing Disorders 2 / Speech Processing for the Speech and Hearing Impaired 1
Human Speech Production
Segmentation, Labelling and Speech Corpora 4
Speaker and Language Recognition 4
Speech Technology Applications and Human-Machine Interface 3
Utterance Verification and Word Spotting 2
Large Vocabulary Continuous Speech Recognition 6
Neural Networks, Fuzzy and Evolutionary Methods 3
Speech Processing for the Speech-Impaired and Hearing-Impaired 2
Prosody and Emotion 7
2: SST Student Day
SST Student Day - Poster Session 1
SST Student Day - Poster Session 2

Author Index
A B C D E F G H I
J K L M N O P Q R
S T U V W X Y Z

Multimedia Files

The BBN Single-Phonetic-Tree Fast-Match Algorithm

Authors:

Long Nguyen, BBN Technologies, GTE Internetworking (USA)
Richard Schwartz, BBN Technologies, GTE Internetworking (USA)

Page (NA) Paper number 1028

Abstract:

In January 1993, BBN demonstrated for the first time a real-time, 20K-word, speaker-independent, continuous speech recognition system, implemented in software on an off-the-shelf workstation. The key to the real-time system was a novel, proprietary fast-match algorithm which had two important properties: high-accuracy recognition and run-time proportional to only the cube root of the vocabulary size. This paper describes that fast-match algorithm in detail. While a number of fast-match algorithms have been published, the BBN algorithm continues to have novel features that have not appeared in the literature. In this fast-match, the vocabulary is organized as a phonetic tree. The last phonemes of the words always locate at the leaves. Each node of the phonetic tree is assigned a set id representing a group of words which share this node. The acoustic models associated with the nodes are the composite triphones where there could be more than one right context. The language models used in this fast-match are the Pr(set_of_words|some_word) which are evaluated at every node of the words during the search. The search itself is similar to the usual beam search with one addition: To activate a node, we temporarily use some set bigrams; to leave that node, we take out that temporary bigram.

SL981028.PDF (From Author) SL981028.PDF (Rasterized)

TOP


An Efficient Two-pass Search Algorithm Using Word Trellis Index

Authors:

Akinobu Lee, School of Informatics, Kyoto University (Japan)
Tatsuya Kawahara, School of Informatics, Kyoto University (Japan)
Shuji Doshita, School of Informatics, Kyoto University (Japan)

Page (NA) Paper number 655

Abstract:

We propose an efficient two-pass search algorithm for LVCSR. Instead of conventional word graph, the first preliminary pass generates "word trellis index", keeping track of all survived word hypotheses within the beam every time-frame. As it represents all found word boundaries non-deterministically, we can (1) obtain accurate sentence-dependent hypotheses on the second search, and (2) avoid expensive word-pair approximation on the first pass. The second pass performs an efficient stack decoding search, where the index is referred to as predicted word list and heuristics. Experimental results on 5,000-word Japanese dictation task show that, compared with the word-graph method, this trellis-based method runs with less than 1/10 memory cost while keeping high accuracy. Finally, by handling inter-word context dependency, we achieved the word error rate of 5.6%.

SL980655.PDF (From Author) SL980655.PDF (Rasterized)

TOP


Nozomi -- a Fast, Memory-Efficient Stack Decoder For LVCSR

Authors:

Mike Schuster, ATR, Interpreting Telecommunications Laboratories (Japan)

Page (NA) Paper number 464

Abstract:

This paper describes some of the implementation details of the ``Nozomi'' stack decoder for LVCSR. The decoder was tested on a Japanese Newspaper Dictation Task using a 5000 word vocabulary. Using continuous density acoustic models with 2000 and 3000 states trained on the JNAS/ASJ corpora and a 3-gram LM trained on the RWC text corpus, both models provided by the IPA group, it was possible to reach more than 95% word accuracy on the standard test set. With computationally cheap acoustic models we could achieve around 89% accuracy in nearly realtime on a 300 Mhz Pentium II. Using a disk-based LM the memory usage could be optimized to 4 MB in total.

SL980464.PDF (From Author) SL980464.PDF (Rasterized)

TOP


Reducing the OOV Rate in Broadcast News Speech Recognition

Authors:

Thomas Kemp, ISL, University of Karlsruhe (Germany)
Alex Waibel, ISL, University of Karlsruhe (Germany)

Page (NA) Paper number 757

Abstract:

To achieve the long-term goal of robust, real-time broadcast news transcription, several problems have to be overcome, e.g. the variety of acoustic conditions and the unlimited vocabulary. In this paper we address the problem of unlimited vocabulary. We show, that this problem is more serious for German than it is for English. Using a speech recognition system with a large vocabulary, we dynamically adapt the active vocabulary to the topic of the current news segment. This is done by using information retrieval (IR) techniques on a large collection of texts automatically gathered from the internet. The same technique is also used to adapt the language model of the recognition system. The process of vocabulary adaptation and language model retraining is completely unsupervised. We show, that dynamic vocabulary adaptation can significantly reduce the out-of-vocabulary (OOV) rate and the word error rate of our broadcast news transcription system View4You.

SL980757.PDF (From Author) SL980757.PDF (Rasterized)

TOP


Using Automatically-Derived Acoustic Sub-word Units in Large Vocabulary Speech Recognition

Authors:

Michiel Bacchiani, Boston University, ECE Department (USA)
Mari Ostendorf, Boston University, ECE Department (USA)

Page (NA) Paper number 586

Abstract:

Although most parameters in a speech recognition system are estimated from data, the unit inventory and lexicon are generally hand crafted and therefore unlikely to be optimal. This paper describes a joint solution to the problems of learning a unit inventory and corresponding lexicon from data. The methodology, which requires multiple training tokens per word, is then extended to handle infrequently observed words using a hybrid system that combines automatically-derived units with phone-based units. The hybrid system outperforms a phone-based system in first-pass decoding experiments on a large vocabulary conversational speech recognition task.

SL980586.PDF (From Author) SL980586.PDF (Rasterized)

TOP


Fabricating Conversational Speech Data with Acoustic Models: a Program to Examine Model-Data Mismatch

Authors:

Don McAllaster, Dragon Systems (USA)
Lawrence Gillick, Dragon Systems (USA)
Francesco Scattone, Dragon Systems (USA)
Michael Newman, Dragon Systems (USA)

Page (NA) Paper number 986

Abstract:

We present a study of data simulated using acoustic models trained on Switchboard data, and then recognized using various Switchboard-trained models. Simple development models give a word error rate (WER) of about 47%, when recognizing real Switchboard conversations. If we simulate speech from word transcriptions, obtaining the word pronunciations from our recognition dictionary, the WER drops by a factor of five to ten. If we use more realistic hand-labeled phonetic transcripts to fabricate data, we obtain WERs in the low 40's, close to those found in actual speech data. These and other experiments we describe in the paper suggest that there is a substantial mismatch between real speech and the combination of our acoustic models and the pronunciations in our recognition dictionary. The use of simulation in speech recognition research appears to be a promising tool in our efforts to understand and reduce the size of this mismatch.

SL980986.PDF (From Author) SL980986.PDF (Rasterized)

TOP