ICSLP'98 Large Vocabulary Continuous Speech Recognition 5

Large Vocabulary Continuous Speech Recognition 5
Home Full List of Titles 1: ICSLP'98 Proceedings Keynote Speeches Text-To-Speech Synthesis 1 Spoken Language Models and Dialog 1 Prosody and Emotion 1 Hidden Markov Model Techniques 1 Speaker and Language Recognition 1 Multimodal Spoken Language Processing 1 Isolated Word Recognition Robust Speech Processing in Adverse Environments 1 Spoken Language Models and Dialog 2 Articulatory Modelling 1 Talking to Infants, Pets and Lovers Robust Speech Processing in Adverse Environments 2 Spoken Language Models and Dialog 3 Speech Coding 1 Articulatory Modelling 2 Prosody and Emotion 2 Neural Networks, Fuzzy and Evolutionary Methods 1 Utterance Verification and Word Spotting 1 / Speaker Adaptation 1 Text-To-Speech Synthesis 2 Spoken Language Models and Dialog 4 Human Speech Perception 1 Robust Speech Processing in Adverse Environments 3 Speech and Hearing Disorders 1 Prosody and Emotion 3 Spoken Language Understanding Systems 1 Signal Processing and Speech Analysis 1 Spoken Language Generation and Translation 1 Spoken Language Models and Dialog 5 Segmentation, Labelling and Speech Corpora 1 Multimodal Spoken Language Processing 2 Prosody and Emotion 4 Neural Networks, Fuzzy and Evolutionary Methods 2 Large Vocabulary Continuous Speech Recognition 1 Speaker and Language Recognition 2 Signal Processing and Speech Analysis 2 Prosody and Emotion 5 Robust Speech Processing in Adverse Environments 4 Segmentation, Labelling and Speech Corpora 2 Speech Technology Applications and Human-Machine Interface 1 Large Vocabulary Continuous Speech Recognition 2 Text-To-Speech Synthesis 3 Language Acquisition 1 Acoustic Phonetics 1 Speaker Adaptation 2 Speech Coding 2 Hidden Markov Model Techniques 2 Multilingual Perception and Recognition 1 Large Vocabulary Continuous Speech Recognition 3 Articulatory Modelling 3 Language Acquisition 2 Speaker and Language Recognition 3 Text-To-Speech Synthesis 4 Spoken Language Understanding Systems 4 Human Speech Perception 2 Large Vocabulary Continuous Speech Recognition 4 Spoken Language Understanding Systems 2 Signal Processing and Speech Analysis 3 Human Speech Perception 3 Speaker Adaptation 3 Spoken Language Understanding Systems 3 Multimodal Spoken Language Processing 3 Acoustic Phonetics 2 Large Vocabulary Continuous Speech Recognition 5 Speech Coding 3 Language Acquisition 3 / Multilingual Perception and Recognition 2 Segmentation, Labelling and Speech Corpora 3 Text-To-Speech Synthesis 5 Spoken Language Generation and Translation 2 Human Speech Perception 4 Robust Speech Processing in Adverse Environments 5 Text-To-Speech Synthesis 6 Speech Technology Applications and Human-Machine Interface 2 Prosody and Emotion 6 Hidden Markov Model Techniques 3 Speech and Hearing Disorders 2 / Speech Processing for the Speech and Hearing Impaired 1 Human Speech Production Segmentation, Labelling and Speech Corpora 4 Speaker and Language Recognition 4 Speech Technology Applications and Human-Machine Interface 3 Utterance Verification and Word Spotting 2 Large Vocabulary Continuous Speech Recognition 6 Neural Networks, Fuzzy and Evolutionary Methods 3 Speech Processing for the Speech-Impaired and Hearing-Impaired 2 Prosody and Emotion 7 2: SST Student Day SST Student Day - Poster Session 1 SST Student Day - Poster Session 2 Author Index A B C D E F G H I J K L M N O P Q R S T U V W X Y Z Multimedia Files	A Thesaurus-Based Statistical Language Model for Broadcast News Transcription Authors: Akio Ando, NHK Sci. & Tech. Res. Labs. (Japan) Akio Kobayashi, NHK Sci. & Tech. Res. Labs. (Japan) Toru Imai, NHK Sci. & Tech. Res. Labs. (Japan) Page (NA) Paper number 16 Abstract: This paper describes a thesaurus-based class n-gram model for broadcast news transcription. The most important issue concerned with class n-gram models is how to develop a word classification. We construct a word classification mapping based on a thesaurus so as to maximize the average mutual information function on a training corpus. To examine the effectiveness of the new method, we compare it with two our previous methods, in which the same thesaurus is used but word-class mappings are determined in the different manners. The new method achieved substantially lower perplexity for 83 news transcription sentences broadcast on June 4, 1996. SL980016.PDF (From Author) SL980016.PDF (Rasterized) TOP Effect of Task Complexity on Search Strategies for the Motorola Lexicus Continuous Speech Recognition System Authors: Sreeram V. Balakrishnan, Motorola Lexicus Division (USA) Page (NA) Paper number 295 Abstract: As speech recognition systems are increasingly applied to real world problems, it is often desirable to use the same recognition engine for a variety of tasks of differing complexity. This paper explores the relationship between the complexity of the recognition task and the best strategies for pruning the recognition search space. We examine two types of task: 20000 word WSJ dictation, and phone book access using a 60 word grammar. For both tasks we compare two strategies for pruning the search space: absolute pruning, where the number of hypotheses is controlled by eliminating ones that have a score less than a fixed beamwidth below the best scoring hypothesis, and rank based pruning, where hypotheses are ranked by score and all hypotheses beneath a certain rank are eliminated. We present statistics characterizing the behaviour of the recognizer under different pruning strategies and show how the strategies affect error-rates. SL980295.PDF (From Author) SL980295.PDF (Rasterized) TOP New Features For Confidence Annotation Authors: Dhananjay Bansal, SCS, Carnegie Mellon University (USA) Mosur K. Ravishankar, SCS, Carnegie Mellon University. (USA) Page (NA) Paper number 829 Abstract: In this paper we describe two new confidence measures for estimating the reliability of speech-to-text output: "Likelihood Dependence" and "Neighborhood Dependence". Each word in the speech-to-text output for a given utterance is annotated with these two measures. Likelihood dependence for a given word occurrence indicates how critical that word is to the overall utterance likelihood; i.e., how much worse is the likelihood of the next best utterance if that word is eliminated from the recognition. Neighborhood dependence measures how stable a given word is when neighboring words are changed in the recognition. We show that correct and incorrect words in the recognition behave significantly differently with respect to these measures. We also show that on the broadcast news task they perform better than some of the existing, commonly used measures. SL980829.PDF (From Author) SL980829.PDF (Rasterized) TOP Multi-Span Statistical Language Modeling for Large Vocabulary Speech Recognition Authors: Jerome R. Bellegarda, Apple Computer, Inc. (USA) Page (NA) Paper number 134 Abstract: The goal of multi-span language modeling is to integrate the various constraints, both local and global, that are present in the language. In this paper, local constraints are captured via the usual n-gram approach, while global constraints are taken into account through the use of latent semantic analysis. An integrative formulation is derived for the combination of these two paradigms, resulting in an entirely data-driven, multi-span framework for large vocabulary speech recognition. Because of the inherent complementarity in the two types of constraints, the performance of the integrated language model compares favorably with the corresponding n-gram performance. On a subset of the Wall Street Journal speaker-independent, 20,000-word vocabulary, continuous speech task, we observed a reduction in perplexity of about 25%, and a reduction in average error rate of about 15%. SL980134.PDF (From Author) SL980134.PDF (Rasterized) TOP Maximum-Likelihood Updates Of HMM Duration Parameters For Discriminative Continuous Speech Recognition Authors: Rathinavelu Chengalvarayan, Lucent Technologies (USA) Page (NA) Paper number 21 Abstract: Previous studies showed that a significantly enhanced recognition performance can be achieved by incorporating information about HMM duration along with the cepstral parameters. The reestimation formula for the duration parameters have been derived in the past using fixed segmentation during K-means training and the duration statistics are always fixed throughout the additional minimum string error (MSE) training process. In this study, we update the duration parameters along with other model parameters during discriminative training iterations. The convergence property of the training property based on the MSE approach is investigated, and experimental results on wireline connected digit recognition task demonstrated a 6% word error rate reduction by using the newly trained duration model parameters as compared to fixed duartion parameters during MSE training. SL980021.PDF (From Author) SL980021.PDF (Rasterized) TOP Towards Better Integration of Semantic Predictors in Statistical Language Modeling Authors: Noah Coccaro, University of Colorado at Boulder: Department of Computer Science (USA) Daniel Jurafsky, University of Colorado at Boulder: Departments of Linguistics and Computer Science (USA) Page (NA) Paper number 852 Abstract: We introduce a number of techniques designed to help integrate semantic knowledge with N-gram language models for automatic speech recognition. Our techniques allow us to integrate Latent Semantic Analysis (LSA), a word-similarity algorithm based on word co-occurrence information, with N-gram models. While LSA is good at predicting content words which are coherent with the rest of a text, it is a bad predictor of frequent words, has a low dynamic range, and is inaccurate when combined linearly with N-grams. We show that modifying the dynamic range, applying a per-word confidence metric, and using geometric rather than linear combinations with N-grams produces a more robust language model which has a lower perplexity on a Wall Street Journal test-set than a baseline N-gram model. SL980852.PDF (From Author) SL980852.PDF (Rasterized) TOP An Asymmetric Stochastic Language Model Based on Multi-Tagged Words Authors: Julio Pastor, Grupo de Tecnología del Habla - Departamento de Ingeniería Electrónica - E. T. S. I. Telecomunicación - Universidad Politécnica de Madrid (Spain) José Colás, Grupo de Tecnología del Habla - Departamento de Ingeniería Electrónica - E. T. S. I. Telecomunicación - Universidad Politécnica de Madrid (Spain) Rubén San-Segundo, Grupo de Tecnología del Habla - Departamento de Ingeniería Electrónica - E. T. S. I. Telecomunicación - Universidad Politécnica de Madrid (Spain) José Manuel Pardo, Grupo de Tecnología del Habla - Departamento de Ingeniería Electrónica - E. T. S. I. Telecomunicación - Universidad Politécnica de Madrid (Spain) Page (NA) Paper number 1108 Abstract: Tag definition in stochastic language models (n-grams and n-pos) is based on grouping together words with similar right and left context behavior. A modification of the n-gram model using multi-tagged words and unsupervised clustering was already introduced for French with a corpus of millions of non-tagged words. We present a variation of bi-pos language model where two tag sets are defined and assigned to each word (multi-tagged model) using grammatical information. Each tag set is based on different context behavior. We use linguistic expert knowledge and a simple automatic clustering procedure to obtain groups of words with similar left context behavior (first set of tags) and with similar right context (second set of tags). We propose a grammatical based model useful when no big text corpus is available and a performance increase has been observed when multi-tagged words are used because of its better adaptation to the language. SL981108.PDF (From Author) SL981108.PDF (Rasterized) TOP Product-Code Vector Quantization of Cepstral Parameters for Speech Recognition Over the WWW Authors: Vassilis Digalakis, Technical University of Crete (Greece) Leonardo Neumeyer, SRI International (USA) Manolis Perakakis, Technical University of Crete (Greece) Page (NA) Paper number 940 Abstract: We follow the paradigm that we have previously introduced for the encoding of the recognizer parameters in a client-server model used for recognition over wireless networks and the WWW, trying to maximize recognition performance instead of perceptual reproduction. We present a new encoding scheme for the mel frequency-warped cepstral parameters (MFCCs) that uses product-code vector quantization, and we find that the required bit rate to achieve the recognition performance of high-quality unquantized speech is just 2000 bits per second. We also investigate the effect of additive noise on the recognition performance when quantized features are used, and we find that a small increase in the bit rate can provide the necessary robustness. SL980940.PDF (From Author) SL980940.PDF (Rasterized) TOP Context Dependent Tree Based Transforms For Phonetic Speech Recognition Authors: Bernard Doherty, The Queen's University of Belfast (Ireland) Saeed Vaseghi, The Queen's University of Belfast (Ireland) Paul McCourt, The Queen's University of Belfast (Ireland) Page (NA) Paper number 323 Abstract: This paper presents a novel method for modeling phonetic context using linear context transforms. Initial investigations have shown the feasibility of synthesising context dependent models from context independent models through weighted interpolation of the peripheral states of a given hidden markov model with its adjacent model. This idea can be further extended, to maximum likelihood estimation of not only single weights, but a matrix of weights or a transform. This paper outlines the application of Maximum Likelihood Linear Regression (MLLR) as a means of modeling context dependency in continuous density Hidden Markov Models (HMM). SL980323.PDF (From Author) SL980323.PDF (Rasterized) TOP Interfacing Acoustic Models with Natural Language Processing Systems Authors: Michael T. Johnson, Purdue University (USA) Mary P. Harper, Purdue University (USA) Leah H. Jamieson, Purdue University (USA) Page (NA) Paper number 871 Abstract: The research presented here focuses on implementation and efficiency issues associated with the use of word graphs for interfacing acoustic speech recognition systems with natural language processing systems. The effectiveness of various pruning methods for graph construction is examined, as well as techniques for word graph compression. In addition, the word graph representation is compared to another predominant interface method, the N-best sentence list. SL980871.PDF (From Author) SL980871.PDF (Rasterized) TOP Hierarchical Cluster Language Modeling With Statistical Rule Extraction For Rescoring N-Best Hypotheses During Speech Decoding Authors: Photina Jaeyoun Jang, School of Computer Science, Carnegie Mellon University (USA) Alexander G. Hauptmann, School of Computer Science, Carnegie Mellon University (USA) Page (NA) Paper number 934 Abstract: We propose an unsupervised learning algorithm that learns hierarchical patterns of word sequences in spoken language utterances. It extracts cluster rules from training data based on high n-gram language model probabilities to cluster words or segment a sentence. Cluster trees, similar to parse trees, are constructed from the learned cluster rules. This hierarchical clustering adds grammatical structure onto a traditional trigram language model. The learned cluster rules are used to rescore and improve the n-best utterance hypothesis list which is output by a speech recognizer based on acoustic and trigram language model scores. Our hierarchical cluster language model was trained on TREC broadcast news data from 1995 and 1996, and reduced word error rate on the HUB-4 1997 broadcast news development set by 0.3% absolute. Prior symbolic knowledge in the form of rules can also be incorporated by simply applying the rules to the training data before the applicable learning iteration. SL980934.PDF (From Author) SL980934.PDF (Rasterized) TOP Dealing With Out-of-Vocabulary Words and Speech Disfluencies in an N-Gram Based Speech Understanding System Authors: Atsuhiko Kai, Toyohashi University of Technology (Japan) Yoshifumi Hirose, Toyohashi University of Technology (Japan) Seiichi Nakagawa, Toyohashi University of Technology (Japan) Page (NA) Paper number 785 Abstract: In this study, we investigate the effectiveness of an unknown word processing(UWP) algorithm, which is incorporated into an N-gram language model based speech recognition system for dealing with filled pauses and out-of-vocabulary(OOV) words. We have already been investigated the effect of the UWP algorithm, which utilizes a simple subword sequence decoder, in a spoken dialog system using a context free grammar(CFG) as a language model. The effect of the UWP algorithm was investigated using an N-based continuous speech recognition system on both a small dialog task and a large-vocabulary read speech dictation task. The experiment results showed that the UWP improves the recognition accuracy and an N-gram based system with the UWP can improve the understanding performance in compared with a CFG-based system. SL980785.PDF (From Author) SL980785.PDF (Rasterized) TOP Source-Extended Language Model for Large Vocabulary Continuous Speech Recognition Authors: Tetsunori Kobayashi, Waseda University (Japan) Yosuke Wada, Waseda University (Japan) Norihiko Kobayashi, Waseda University (Japan) Page (NA) Paper number 708 Abstract: Information source extension is utilized to improve the language model for large vocabulary continuous speech recognition (LVCSR). McMillan's theory, source extension make the model entropy close to the real source entropy, implies that the better language model can be obtained by source extension (making new unit through word concatenations and using the new unit for the language modeling). In this paper, we examined the effectiveness of this source extension. Here, we tested two methods of source extension: frequency-based extension and entropy-based extension. We tested the effect in terms of perplexity and recognition accuracy using Mainichi newspaper articles and JNAS speech corpus. As the results, the bi-gram perplexity is improved from 98.6 to 70.8 and tri-gram perplexity is improved from 41.9 to 26.4. The bigram-based recognition accuracy is improved from 79.8% to 85.3%. SL980708.PDF (From Author) SL980708.PDF (Rasterized) TOP Time Dependent Language Model For Broadcast News Transcription And Its Post-Correction Authors: Akio Kobayashi, NHK Sci. & Tech. Res. Labs. (Japan) Kazuo Onoe, NHK Sci. & Tech. Res. Labs. (Japan) Toru Imai, NHK Sci. & Tech. Res. Labs. (Japan) Akio Ando, NHK Sci. & Tech. Res. Labs. (Japan) Page (NA) Paper number 973 Abstract: This paper presents two linguistic techniques to improve broadcast news transcription. The first one is an adaptation of a language model which reflects current news content. It is based on a weighted mixture of long-term news scripts and latest scripts as training data. The mixture weights are given by the EM algorithm for linear interpolation and then normalized by their text sizes. Not only n-grams but also the vocabulary are updated by the latest news. We call it the Time Dependent Language Model (TDLM). It achieved a 4.4% reduction in perplexity and 0.7% improvement in word accuracy over the baseline language model. The second technique is correction of the decoded transcriptions by their corresponding electronic draft scripts. The corresponding drafts are found by using a sentence similarity measure between them. Parts to be considered as recognition errors are replaced with the original drafts. This post-correction led to a 6.7% improvement in word accuracy. SL980973.PDF (From Author) SL980973.PDF (Rasterized) TOP Exploiting Transitions and Focussing on Linguistic Properties for ASR Authors: Jacques Koreman, University of the Saarland, Institute of Phonetics (Germany) William J. Barry, University of the Saarland, Institute of Phonetics (Germany) Bistra Andreeva, University of the Saarland, Institute of Phonetics (Germany) Page (NA) Paper number 548 Abstract: Three cross-language ASR experiments which use hidden Markov modelling are described. Their goal is to process the signal so that it better exploits its linguistically relevant properties for consonant identification. Experiment 1 shows that consonant identification improves when vowel transitions are used. Particularly the consonants' place of articulation is identified better, because the vowel transitions contain formant trajectories which depend on the consonant's place of articulation. Experiment 2 shows that mapping acoustic parameters onto phonetic features before applying hidden Markov modelling greatly improves consonant identification rates. In experiment 3, the acoustic parameters from the vowel transitions are also mapped onto consonantal (not vocalic!) features ("relational processing"), as are the acoustic parameters belonging to the consonants. The additional use of vowel transitions does not further improve consonant identification, however. This is probably due to undertraining of the vowel transitions in the Kohonen network. SL980548.PDF (From Author) SL980548.PDF (Rasterized) TOP A Unified Framework for Sublexical and Linguistic Modelling Supporting Flexible Vocabulary Speech Understanding Authors: Raymond Lau, MIT Laboratory for Computer Science (USA) Stephanie Seneff, MIT Laboratory for Computer Science (USA) Page (NA) Paper number 53 Abstract: Previously, we introduced the ANGIE framework for modelling speech where morphological and phonological substructures of words are jointly characterized by a context-free grammar and represented in a multi-layered hierarchical structure. We also demonstrated a phonetic recognizer and a word-spotter based on ANGIE. In this work, we extend ANGIE to a competitive continuous speech recognition system. Furthermore, given that ANGIE is based on a context-free framework, we have decided to combine ANGIE with TINA, a context-free based framework for natural language understanding, into an integrated system. The integration led to a 21.7% reduction in word error rate compared to a baseline word bigram recognizer on ATIS. We also examined the addition of new words to the vocabulary, an area we believe will benefit from both ANGIE and the ANGIE-plus-TINA integration. The combination reduced error rate by 20.8% over the baseline and outperformed several other configurations tested not involving an integrated ANGIE-plus-TINA. SL980053.PDF (From Author) SL980053.PDF (Rasterized) TOP A Method for Modeling Liaison in a Speech Recognition System for French Authors: Lalit R. Bahl, I.B.M. (USA) S. De Gennaro, I.B.M. (USA) P. De Souza, I.B.M. (USA) E. Epstein, I.B.M. (USA) J.M. Le Roux, I.B.M. (USA) B. Lewis, I.B.M. (USA) C. Waast, I.B.M. (USA) Page (NA) Paper number 114 Abstract: In French the pronunciations of many words change dramatically depending on the word immediately preceding it. The result of this phenomenon, known as "liaison", in an ASR system that does not model "liaison" is the requirement of unnatural pronunciation and much user dissatisfaction. We present, in this paper, the development of an acoustic model which takes into account the wide variability of word pronunciations caused by the liaison, the integration of this model into a French continuous speech recognition system and decoding results. SL980114.PDF (From Author) SL980114.PDF (Rasterized) TOP On Variable Sampling Frequencies in Speech Recognition Authors: Fu-Hua Liu, IBM Watson Research Center (USA) Michael Picheny, IBM Watson Research Center (USA) Page (NA) Paper number 838 Abstract: In this paper we describe a novel approach to address the issue of different sampling frequencies in speech recognition. When a recognition task needs a different sampling frequency from that of the reference system, it is customary to re-train the system for the new sampling rate. To circumvent the tedious training process, we propose a new approach termed Sampling Rate Transformation (SRT) to perform the transformation directly on speech recognition system. By re-scaling the mel-filter design and filtering the system in spectrum domain, SRT converts the existing system to the target spectral range. New systems are obtained without using any data from the test environment. SRT reduces the word error rate from 29.89% to 18.17% given 11KHz test data and a 16KHz SI system. The matched system for 11KHz has an error rate of 16.17%. We also examine MLLR and MAP. The best result from MLLR is 17.92% with 4.5 hours of speech. Similar improvements are also observed in the speaker adaptation mode. SL980838.PDF (From Author) SL980838.PDF (Rasterized) TOP Pronunciation Modeling for Large Vocabulary Conversational Speech Recognition Authors: Kristine Ma, GTE/BBN Technologies (USA) George Zavaliagkos, GTE/BBN Technologies (USA) Rukmini Iyer, GTE/BBN Technologies (USA) Page (NA) Paper number 866 Abstract: In this paper, we address the issue of deriving and using more realistic pronunciations to represent words spoken in natural conversational speech. Previous approaches include using automatic phoneme-based rule-learning techniques, linguistic transformation rules, and phonetically hand-labelled corpus to expand the number of pronunciation variants per word. While rule-based approaches have the advantage of being easily extensible to infrequent or unobserved words, they suffer from the problem of over generalization. Using hand-transcribed data, one can obtain a more concise set of new pronunciations but it cannot be extended to unobserved or infrequently occuring words. In this paper, we adopt the hand-labelled corpus scheme to improve pronunciations for frequent multi and single words occurring in the training data, while using the rule-based techniques to learn pronunciation variants and their weights for the infrequent words. Furthermore, we experiment with a new approach for speaker-dependent pronunciation modeling. The newly expanded dictionaries are evaluated on the Switchboard and Callhome corpora, giving a slight reduction in word recognition error rate. SL980866.PDF (From Author) SL980866.PDF (Rasterized) TOP Time Shift Invariant Speech Recognition Authors: Sankar Basu, IBM T.J. Watson Research center (USA) Abraham Ittycheriah, IBM T.J. Watson Research center (USA) Stéphane Maes, IBM T.J. Watson Research center (USA) Page (NA) Paper number 983 Abstract: When shifting by a few samples a speech signal, we have observed significant variations of the feature vectors produced by the acoustic front-end. Furthermore, these utterances when decoded with a continuous speech recognition system leads to dramatically different word error rates. This paper analyzes the phenomena and illustrates the well known result that classical acoustic front end processors including spectrum and cepstra based techniques suffer from time-shift. After describing the effect of sample sized shifts on the spectral estimates of the signal, we propose several techniques which take advantage of shift variations to multiply the amount of training that speech utterances can provide. Eventually, we illustrate how it is possible to slightly modify the acoustic front-end to render the recognizer invariant to small shifts. SL980983.PDF (From Author) SL980983.PDF (Rasterized) TOP The Demiphone Versus the Triphone in a Decision-tree State-Tying Framework Authors: José B. Mariño, Universitat Politècnica de Catalunya (Spain) Pau Pachès-Leal, Universitat Politècnica de Catalunya (Spain) Albino Nogueiras, Universitat Politècnica de Catalunya (Spain) Page (NA) Paper number 250 Abstract: The performances of the demiphone (a context dependent subword unit that models independently the left and the right parts of a phoneme) and the triphone are compared. Continuous density hidden Markov modeling for both types of units is tested with the HTK software using decision-tree state clustering. The speech material is taken from the SpeechDat Spanish database, composed by continuous speech utterances recorded through the public telephone network. The training corpus is speaker and task independent. Two testing sets are tried: isolated words corresponding to speaker names, city names and phonetically rich words; and numbers of Spanish identification cards and dates. The main conclusion is that the demiphone simplifies the recognition system and yields a better performance than the triphone. This result may be explained by the ability of the demiphone to provide an excellent tradeoff between a detailed coarticulation modeling and a proper parameter estimation. SL980250.PDF (From Author) SL980250.PDF (Rasterized) TOP Word Clustering for A Word Bi-gram Model Authors: Shinsuke Mori, Tokyo Research Laboratory, IBM Japan (Japan) Masafumi Nishimura, Tokyo Research Laboratory, IBM Japan (Japan) Nobuyasu Itoh, Tokyo Research Laboratory, IBM Japan (Japan) Page (NA) Paper number 989 Abstract: In this paper we describe a word clustering method for class-based n-gram model. The measurement for clustering is the entropy on a corpus different from the corpus for n-gram model estimation. The search method is based on the greedy algorithm. We applied this method to a Japanese EDR corpus and English Penn Treebank corpus. The perplexities of word-based n-gram model on EDR corpus and Penn Treebank are 153.1 and 203.5 respectively. And Those of class-based n-gram model, estimated through our method, are 146.4 and 136.0 respectively. The result tells us that our clustering methods is better than the Brown's method and the Ney's method called leaving-one-out. SL980989.PDF (From Author) SL980989.PDF (Rasterized) TOP A Large Vocabulary Continuous Speech Recognition Hybrid System for the Portuguese Language Authors: João P. Neto, INESC/IST (Portugal) Ciro Martins, INESC/IST (Portugal) Luís B. Almeida, INESC/IST (Portugal) Page (NA) Paper number 562 Abstract: Due to the enormous development of large vocabulary, speaker-independent continuous speech recognition systems, which occur essentially for the US English language, there is a large demand of this kind of systems for other languages. In this paper we present the work done in the development of a large vocabulary, speaker-independent continuous speech recognition hybrid system for the European Portuguese language. This is a difficult task due to the basic development stage of this technology in the European Portuguese language. The development of a system of this kind for a new language depends on the availability of the appropriate source components, mainly a speech corpus and large amounts of texts. This work became possible due to the development of a new database (BD-PUBLICO), a large vocabulary speech corpus for the European Portuguese language developed by us over the last two years. SL980562.PDF (From Author) SL980562.PDF (Rasterized) TOP Speech Recognition Performance on a new Voicemail Transcription Task Authors: Mukund Padmanabhan, IBM T. J. Watson Research Center (USA) Bhuvana Ramabhadran, IBM T. J. Watson Research Center (USA) Sankar Basu, IBM T. J. Watson Research Center (USA) Page (NA) Paper number 210 Abstract: In this paper we describe a new testbed for developing speech recognition algorithms - a VoiceMail transcription task, analogous to other tasks such as the Switchboard, CallHome, and the Hub 4 tasks, which are currently used by speech recognition researchers. We describe (i) the use of compound words to model co-articulation effects in commonly occurring words (ii) the use of linguistically derived phonological (that model phenomena such as degemination, palatization etc) for other words (iii) a new model-complexity adaptation technique that uses a discriminant measure to allocate gaussians to the mixtures modelling the acoustic units (allophones) (iv) experiments in using different feature extraction methods (v) we also investigated the efficacy of some well known acoustic adaptation techniques on this task. We then reported experimental results that showed that most of the modelling techniques we investigated were useful in reducing the word error rate - from 87% (when decoded with Switchboard acoustic and language models) to 38%. SL980210.PDF (From Author) SL980210.PDF (Rasterized) TOP Grammatical and Statistical Word Prediction System for Spanish Integrated in an Aid for People with Disabilities Authors: Sira Palazuelos, Laboratorio de Tecnologias de Rehabilitacion. ETSI de Telecomunicacion. Universidad Politecnica de Madrid (Spain) Santiago Aguilera, Laboratorio de Tecnologias de Rehabilitacion. ETSI de Telecomunicacion. Universidad Politecnica de Madrid (Spain) Jose Rodrigo, Laboratorio de Tecnologias de Rehabilitacion. ETSI de Telecomunicacion. Universidad Politecnica de Madrid (Spain) Juan Godino, Laboratorio de Tecnologias de Rehabilitacion. ETSI de Telecomunicacion. Universidad Politecnica de Madrid (Spain) Page (NA) Paper number 381 Abstract: In this paper, we describe and evaluate recent work and results achieved in a word prediction system for Spanish, applying both statistical and grammatical methods: word pairs and trios, bipos and tripos models, and a stochastic context free grammar. The predictor is included in several software applications, to enhance the typing rate for users with motorical problems, who can only use a switch for writing. These users have difficulties in writing, and usually in communicating with other people, and the inclusion of word prediction in the system allow them to increase their typing rate from 2-3 words per minute up to 8-10 words per minute. SL980381.PDF (From Author) SL980381.PDF (Rasterized) TOP Segmentation Using a Maximum Entropy Approach Authors: Kishore Papineni, IBM TJ Watson Research Center (USA) Satya Dharanipragada, IBM TJ Watson Research Center (USA) Page (NA) Paper number 559 Abstract: Consider generating phonetic baseforms from orthographic spellings. Availability of a segmentation (grouping) of the characters can be exploited to achieve better phonetic translation. We are interested in building segmentation models without using explicit segmentation or alignment information during training. The heart of our segmentation algorithm is a conditional probabilistic model that predicts whether there are less, equal, or more phones than characters in the word. We use just this contraction-expansion information on whole words for training the model. The model has three components: a prior model, a set of features, and weights of the features. The features are selected and weights assigned in maximum entropy framework. Even though the model is trained on whole words, we effectively localize it on substrings to induce segmentation of the word to be segmented. Segmentation is also aided by considering substrings in both forward and backward directions. SL980559.PDF (From Author) SL980559.PDF (Rasterized) TOP Recognition Performance of a Large-Scale Dependency Grammar Language Model Authors: Adam Berger, Carnegie Mellon University (USA) Harry Printz, IBM (USA) Page (NA) Paper number 679 Abstract: We describe a large-scale investigation of dependency grammar language models. Our work includes several significant departures from earlier studies, notably a larger training corpus, improved model structure, different feature types, new feature selection methods, andmore coherent training and test data. We report word error rate (WER) results of a speech recognition experiment, in which we used these models to rescore the output of the IBM speech recognition system. SL980679.PDF (From Author) SL980679.PDF (Rasterized) TOP A Bootstrap Technique for Building Domain-Dependent Language Models Authors: Ganesh N. Ramaswamy, I.B.M. Research Center (USA) Harry Printz, I.B.M. Research Center (USA) Ponani S. Gopalakrishnan, I. B. M. Research Center (USA) Page (NA) Paper number 611 Abstract: In this paper, we propose a new bootstrap technique to build domain-dependent language models. We assume that a seed corpus consisting of a small amount of data relevant to the new domain is available, which is used to build a reference language model. We also assume the availability of an external corpus, consisting of a large amount of data from various sources, which need not be directly relevant to the domain of interest. We use the reference language model and a suitable metric, such as the perplexity measure, to select sentences from the external corpus that are relevant to the domain. Once we have a sufficient number of new sentences, we can rebuild the reference language model. We then continue to select additional sentences from the external corpus, and this process continues to iterate until some satisfactory termination point is achieved. We also describe several methods to further enhance the bootstrap technique, such as combining it with mixture modeling and class-based modeling. The performance of the proposed approach was evaluated through a set of experiments, and the results are discussed. Analysis of the convergence properties of the approach and the conditions that need to be satisfied by the external corpus and the seed corpus are highlighted, but detailed work on these issues is deferred for the future. SL980611.PDF (From Author) SL980611.PDF (Rasterized) TOP Estimation of the Probability Distributions of Stochastic Context-Free Grammars From the k-Best Derivations Authors: Joan-Andreu Sánchez, DSIC-Universidad Politecnica de Valencia (Spain) José-Miguel Benedi, DSIC-Universidad Politecnica de Valencia (Spain) Page (NA) Paper number 1032 Abstract: The use of the Inside-Outside algorithm for the estimation of the probability distributions of Stochastic Context-Free Grammars in Natural-Language processing is restricted due to the time complexity per iteration and the large number of iterations that it needs to converge. Alternatively, an algorithm based on the Viterbi score (VS) is used. This VS algorithm converges more rapidly, but obtains less competitive models. We propose a new algorithm that only considers the k-best derivations in the estimation process. The time complexity per iteration of the algorithm is practically the same as that of the VS algorithm. The proposed algorithm has been proven with a part of the Wall Street Journal task processed in the Penn Treebank project. The perplexity of a test set for the VS algorithm was 24.22 and this value was 22.73 for the proposed algorithm with k= 3, which means an improvement of 6.15%. SL981032.PDF (From Author) SL981032.PDF (Rasterized) TOP Robust HMM Estimation with Gaussian Merging-Splitting and Tied-Transform HMMs Authors: Ananth Sankar, SRI International (USA) Page (NA) Paper number 194 Abstract: We present two different approaches for robust estimation of the parameters of context-dependent hidden Markov models (HMMs) for speech recognition. The first approach, the Gaussian Merging-Splitting (GMS) algorithm, uses Gaussian splitting to uniformly distribute the Gaussians in acoustic space, and merging so as to compute only those Gaussians that have enough data for robust estimation. We show that this method is more robust than our previous training technique. The second approach, called tied-transform HMMs, uses maximum-likelihood transformation-based acoustic adaptation algorithms to transform a small HMM to a much larger HMM. Since the transforms are shared or tied among Gaussians in the larger HMM, robust estimation is achieved. We show that this approach gives a significant improvement in recognition accuracy and a dramatic reduction in memory needed to store the models. SL980194.PDF (From Author) SL980194.PDF (Rasterized) TOP Nonlinear Interpolation of Topic Models for Language Model Adaptation Authors: Kristie Seymore, Carnegie Mellon University (USA) Stanley Chen, Carnegie Mellon University (USA) Ronald Rosenfeld, Carnegie Mellon University (USA) Page (NA) Paper number 897 Abstract: Topic adaptation for language modeling is concerned with adjusting the probabilities in a language model to better reflect the expected frequencies of topical words for a new document. We present a novel technique for adapting a language model to the topic of a document, using a nonlinear interpolation of n-gram language models. A three-way, mutually exclusive division of the vocabulary into general, on-topic and off-topic word classes is used to combine word predictions from a topic-specific and a general language model. We achieve a slight decrease in perplexity and speech recognition word error rate on a Broadcast News test set using these techniques. Our results are compared to results obtained through linear interpolation of topic models. SL980897.PDF (From Author) SL980897.PDF (Rasterized) TOP Performance Evaluation of Word Phrase and Noun Category Language Models For Broadcast News Speech Recognition Authors: Kazuyuki Takagi, The University of Electro-Communications (Japan) Rei Oguro, The University of Electro-Communications (Japan) Kenji Hashimoto, The University of Electro-Communications (Japan) Kazuhiko Ozeki, The University of Electro-Communications (Japan) Page (NA) Paper number 26 Abstract: This paper reports our work to improve a bigram language model for Japanese TV broadcast news speech recognition. First, frequent word strings were grouped into phrases in order that the phrases were added to the lexicon as new units of recognition. The test set perplexity was improved when frequent function word strings were used as additional recognition units. The speech recognition performance was improved both by grouping function word strings and by grouping compound nouns that were selected by word association ratio. Secondly, in order to alleviate the OOV problem related with nouns, we built and tested a language model that allows switching its noun lexicon according to the domain of the article to be recognized next. SL980026.PDF (From Author) SL980026.PDF (Rasterized) TOP Robust Automatic Continuous-Speech Recognition Based on a Voiced-Unvoiced Decision Authors: Hesham Tolba, INRS-Telecommunications (Canada) Douglas O'Shaughnessy, INRS-Telecommunications (Canada) Page (NA) Paper number 342 Abstract: In this paper, the implementation of a robust front-end to be used for a large-vocabulary Continuous Speech Recognition (CSR) system based on a Voiced-Unvoiced (V-U) decision has been addressed. Our approach is based on the separation of the speech signal into voiced and unvoiced components. Consequently, speech enhancement can be achieved through processing of the voiced and the unvoiced components separately. Enhancement of the voiced component is performed using an adaptive comb filtering, whereas the unvoiced component is enhanced using the modified spectral subtraction approach. We proved via experiments that the proposed CSR system is robust in additive noisy environments (SNR down to 0 dB). SL980342.PDF (From Author) SL980342.PDF (Rasterized) TOP Double Tree Beam Search Using Hierarchical Subword Units Authors: Juan Carlos Torrecilla, Telefonica Investigacion Y Desarrollo (Spain) Ismael Cortázar, Telefonica Investigacion Y Desarrollo (Spain) Luis A. Hernández, Telefonica Investigacion Y Desarrollo (Spain) Page (NA) Paper number 321 Abstract: In this paper we proposed an efficient beam search procedure that combines well-known search techniques as a lexicon organization using tree-structured grammars with a novel approach of using different types of subword units depending on the local scores of the active words. An efficient double-tree structure using phonemes and triphones is presented. Experimental results on an isolated word recognition systems reveals that the proposed strategy is suitable for important reductions in computational cost with only negligible increases in recognition errors. Tests over a vocabulary of 955 Spanish words presents a 0.5% of increase in error rate for a 32% reduction in the number of senones to be evaluated. SL980321.PDF (From Author) SL980321.PDF (Rasterized) TOP Text Segmentation and Topic Tracking on Broadcast News Via a Hidden Markov Model Approach Authors: Paul van Mulbregt, Dragon Systems, Inc. (USA) Ira Carp, Dragon Systems, Inc. (USA) Lawrence Gillick, Dragon Systems, Inc. (USA) Steve Lowe, Dragon Systems, Inc. (USA) Jon Yamron, Dragon Systems, Inc. (USA) Page (NA) Paper number 116 Abstract: Expertise in the automatic transcription of broadcast speech has progressed to the point of being able to use the resulting transcripts for information retrieval purposes. In this paper, we first describe a corpus of automatically recognized broadcast news, a method for segmenting the broadcast into stories, and finally apply this method to retrieve stories relating to a specific topic. The method is based on Hidden Markov Models and is in analogy with the usual implementation of HMMs in speech recognition. SL980116.PDF (From Author) SL980116.PDF (Rasterized) TOP Multi-Phone Strings as Subword Units for Speech Recognition Authors: Philip O'Neill, The Queen's University of Belfast (Ireland) Saeed Vaseghi, The Queen's University of Belfast (Ireland) Bernard Doherty, The Queen's University of Belfast (Ireland) Wooi Haw Tan, The Queen's University of Belfast (Ireland) Paul McCourt, The Queen's University of Belfast (Ireland) Page (NA) Paper number 178 Abstract: The choice of speech unit affects the accuracy, complexity, expandability and ease of adaptation of ASRs to speaker and environmental variations. This paper explores a method of subword modelling based on the concept of multi-phone strings. The motivation in using the longer duration multi-phone strings is to reduce the loss of contextual information, cross-phone correlation, and transitions. Multi-phone strings are an alternative to context-dependent phones and they include many of the syllables. An advantage of multi-phone units is the existence of more than one valid multi-phone transcription for each monophone sequence, this can be used to improve ASR accuracy. A particular case of multi-phone strings namely phone-pairs is investigated in detail. Experimental Evaluation on TIMIT and WSJCAM0 are presented. SL980178.PDF (From Author) SL980178.PDF (Rasterized) TOP Phonetic Modification of the Syllable /tu/ in Two Spontaneous American English Dialogues Authors: Nanette M. Veilleux, Boston University, Metropolitan College (USA) Stefanie Shattuck-Hufnagel, Mass Institute of Technology, Research Lab of Electronics (USA) Page (NA) Paper number 382 Abstract: In a pilot study of phonetic modification of function words in 2 spontaneous speech dialogues, 99 utterances of the syllable /tu/ corresponding to the morphemes to, two, too, -to and to- included ten pronunciation variants. Factors influencing phonetic modification included phonetic context, prosody, part of speech, adjacent disfluency and individual speaker. 11% of the acoustic landmarks defining /t/ closure, /t/ release and vowel jaw opening maximum were not detectable in hand labelling. In a separate corpus, 59% of recognition errors involved grammatical or function words like conjunctions, articles, prepositions, pronouns and auxilliary verbs, and for 17 tokens of /tu/, half were misrecognized. Implications of these preliminary results for linguistic theory, cognitive modelling of speech processing and automatic speech recognition are discussed. SL980382.PDF (From Author) SL980382.PDF (Rasterized) TOP Efficient Lattice Representation and Generation Authors: Fuliang Weng, SRI, International (USA) Andreas Stolcke, SRI, International (USA) Ananth Sankar, SRI, International (USA) Page (NA) Paper number 136 Abstract: We describe two new techniques for reducing word lattice sizes without eliminating hypotheses. The first technique is an algorithm to reduce the size of non-deterministic bigram word lattices by merging redundant nodes. On bigram word lattices generated from Hub4 Broadcast News speech, this reduces lattice sizes by half on average. The second technique is an improved algorithm for expanding lattices with trigram language models. Backed-off trigram probabilities are encoded without node duplication by factoring the probabilities into bigram probabilities and backoff weights, and duplicating nodes only for explicit trigrams. Experiments on Broadcast News show that this method reduces trigram lattice sizes by a factor of 6, and reduces expansion time by more than a factor of 10. Compared to conventionally expanded lattices, recognition with the compactly expanded lattices was also found to be 40% faster, without affecting recognition accuracy. SL980136.PDF (From Author) SL980136.PDF (Rasterized) TOP Modeling Pronunciation Variation for a Dutch CSR: Testing Three Methods Authors: Mirjam Wester, A2RT, University of Nijmegen (The Netherlands) Judith M. Kessens, A2RT, University of Nijmegen (The Netherlands) Helmer Strik, A2RT, University of Nijmegen (The Netherlands) Page (NA) Paper number 371 Abstract: This paper describes how the performance of a continuous speech recognizer for Dutch has been improved by modeling pronunciation variation. We used three methods to model pronunciation variation. First, within-word variation was dealt with. Phonological rules were applied to the words in the lexicon, thus automatically generating pronunciation variants. Secondly, cross-word pronunciation variation was modeled using two different approaches. The first approach was to model cross-word processes by adding the variants as separate words to the lexicon and in the second approach this was done by using multi-words. For each of the methods, recognition experiments were carried out. A significant improvement was found for modeling within-word variation. Furthermore, modeling cross-word processes using multi-words leads to significantly better results than modeling them using separate words in the lexicon. SL980371.PDF (From Author) SL980371.PDF (Rasterized) TOP Comparison of Language Modelling Techniques for Russian and English Authors: Edward W.D. Whittaker, Cambridge University (U.K.) Philip C. Woodland, Cambridge University (U.K.) Page (NA) Paper number 967 Abstract: In this paper the main differences between language modelling of Russian and English are examined. A Russian corpus and a comparable English corpus are described. The effects of high inflectionality in Russian and the relationship between the out-of-vocabulary rate and vocabulary size are investigated. Standard word and class N-gram language modelling techniques are applied to the two corpora and perplexity results are reported. A novel approach to the modelling of inflected languages is proposed and its efficacy compared with the other techniques. SL980967.PDF (From Author) SL980967.PDF (Rasterized) TOP Optimized POS-Based Language Models for Large Vocabulary Speech Recognition Authors: Petra Witschel, Siemens AG (Germany) Page (NA) Paper number 471 Abstract: Stochastic language models based on word n-grams require huge amount of training material especially for large vocabulary systems. Using n-grams based on classes much less training material is necessary and higher coverage can be achieved. Building classes on basis of linguistic characteristics (POS) has the advantage that new words can be assigned easily. Until now for POS-based language models class sets have usually been defined by linguistic experts. In this paper we present an approach where for a given number of classes a class set is generated automatically such that entropy of language model is minimized. We perform experiments on German medical reports of about 1.2 million words of text and 24000 words of vocabulary. Using our approach we generate an exemplary class set of 196 optimized POS-classes. Comparing the optimized POS-based language model to the language model based on 196 normally defined classes we get an improvement up to 10% in test set perplexity. SL980471.PDF (From Author) SL980471.PDF (Rasterized) TOP Reducing Peak Search Effort Using Two-Tier Pruning Authors: Mark Wright, BT Laboratories (U.K.) Simon Hovell, BT Laboratories (U.K.) Simon Ringland, BT Laboratories (U.K.) Page (NA) Paper number 450 Abstract: Many of the pruning strategies used to remove less likely hypotheses from the search space in large vocabulary speech recognition (LVR) systems, have a peak search space many times greater than the average search space. This paper discusses two such strategies used within BT's speech recognition architecture, Step pruning and Histogram pruning. Two-tier pruning is proposed as a simple but powerful extension applicable to either of the above strategies. This seeks to limit the expansion of the search space between the prune and acoustic match processes without affecting accuracy. It is shown that the application of two-tier pruning to either strategy reduces peak search effort, and results in an average reduction in run time of 33% and 53% for step pruning and histogram pruning respectively, with no loss in top-N accuracy. SL980450.PDF (From Author) SL980450.PDF (Rasterized) TOP Using Untranscribed Training Data to Improve Performance Authors: George Zavaliagkos, BBN Technologies (USA) Man-Hung Siu, BBN Technologies (USA) Thomas Colthurst, BBN Technologies (USA) Jayadev Billa, BBN Technologies (USA) Page (NA) Paper number 1007 Abstract: This paper explores techniques for utilizing untranscribed training data pools to increase the available training data for automatic speech recognition systems. It has been well established that current speech recognition technology, especially in Large Vocabulary Conversational Speech Recognition (LVCSR), is largely language independent, and that the dominant factor with regards to performance on a certain language is the amount of available training data. The paper addresses this need for increased training data by presenting ways to use untranscribed acoustic data to increase the training data size and thus improve speech recognition. SL981007.PDF (From Author) SL981007.PDF (Rasterized) TOP Telephone Band LVCSR for Hearing-Impaired Users Authors: Ea-Ee Jan, IBM Thomas J. Watson Research Center (USA) Raimo Bakis, IBM Thomas J. Watson Research Center (USA) Fu-Hua Liu, IBM Thomas J. Watson Research Center (USA) Michael Picheny, IBM Thomas J. Watson Research Center (USA) Page (NA) Paper number 862 Abstract: Large vocabulary automatic speech recognition might assist hearing impaired telephone users by displaying a transcription of the incoming side of the conversation, but the system would have to achieve sufficient accuracy on conversational-style, telephone-bandwidth speech. We describe our development work toward such a system. This work comprised three phases: Experiments with clean data filtered to 200-3500Hz, experiments with real telephone data, and language model development. In the first phase, the speaker independent error rate was reduced from 25% to 12% by using MLLT, increasing the number of cepstral components from 9 to 13, and increasing the number of Gaussians from 30,000 to 120,000. The resulting system, however, performed less well on actual telephony, producing an error rate of 28.4%. By additional adaptation and the use of an LDA and CDCN combination, the error rate was reduced to 19.1%. Speaker adaptation reduces the error rate to 10.96%. These results were obtained with read speech. To explore the language-model requirements in a more realistic situation, we collected some conversational speech with an arrangement in which one participant could not hear the conversation but only saw recognizer output on a screen. We found that a mixture of language models, one derived from the Switchboard corpus and the other from prepared texts, resulted in approximately 10% fewer errors than either model alone. SL980862.PDF (From Author) SL980862.PDF (Rasterized) TOP Using X-Gram For Efficient Speech Recognition Authors: Antonio Bonafonte, Universitat Politecnica de Catalunya (Spain) José B. Mariño, Universitat Politecnica de Catalunya (Spain) Page (NA) Paper number 1125 Abstract: X-grams are a generalization of the n-grams, where the number of previous conditioning words is different for each case and decided from the training data. X-grams reduce perplexity with respect to trigrams and need less number of parameters. In this paper, the representation of the x-grams using finite state automata is considered. This representation leads to a new model, the non-deterministic x-grams, an approximation that is much more efficient, suffering small degradation on the modeling capability. Empirical experiments for a continuous speech recognition task show how, for each ending word, the number of transitions is reduced from 1222 (the size of the lexicon) to around 66. SL981125.PDF (From Author) SL981125.PDF (Rasterized) TOP