Language modeling

Home

Semantic Clustering for Adaptive Language Modeling

Authors:

Reinhard Kneser, Philips Research (Germany)
Jochen Peters, Philips Research (Germany)

Volume 2, Page 779

Abstract:

In this paper we present efficient clustering algorithms for two novel class-based approaches to adaptive language modeling. In contrast to bigram and trigram class models, the proposed classes are related to the distribution and co-occurrence of words within complete text units and are thus mostly of a semantic nature. We introduce adaptation techniques such as the adaptive linear interpolation and an approximation to the minimum discriminant estimation and show how to use the automatically derived semantic structure in order to allow a fast adaptation to some special topic or style. In experiments performed on the Wall-Street-Journal corpus, intuitively convincing semantic classes were obtained. The resulting adaptive language models were significantly better than a standard cache model. Compared to a static model a reduction in perplexity of up to 31% could be achieved.

ic970779.pdf

TOP

Task adaptation using MAP estimation in N-gram language modeling.

Authors:

Hirokazu Masataki, ATR (Japan)
Yoshinori Sagisaka, ATR (Japan)
Kazuya Hisaki, Kyoto University (Japan)
Tatsuya Kawahara, Kyoto University (Japan)

Volume 2, Page 783

Abstract:

This paper describes a method of task adaptation in N-gram language modeling,for accurately estimating the N-gram statisticsfrom the small amount of data of the target task.Assuming a task-independent N-gram to be a-priori knowledge,the N-gram is adapted to a target task byMAP (maximum a-posteriori probability) estimation.Experimental results showed that the perplexities of the task adapted modelswere 15% (trigram), 24% (bigram)lower than those of the task-independent model,and that the perplexity reduction of the adaptation went up to 39 % at maximumwhen the amount of text data in the adapted task was very small.

ic970783.pdf

TOP

Distant Bigram Language Modelling Using Maximum Entropy

Authors:

Michael Simons, RWTH Aachen (Germany)
Hermann Ney, RWTH Aachen (Germany)
Sven C. Martin, RWTH Aachen (Germany)

Volume 2, Page 787

Abstract:

In this paper, we apply the maximum entropy approach to so-called distant bigram language modelling. In addition to the usual unigram and bigram dependencies, we use distant bigram dependencies, where the immediate predecessor word of the word position under consideration is skipped. The contributions of this paper are: (1) We analyze the computational complexity of the resulting training algorithm, i.e. the generalized iterative scaling (GIS) algorithm, and study the details of its implementation. (2) We describe a method for handling unseen events in the maximum entropy approach; this is achieved by discounting the frequencies of observed events. (3) We study the effect of this discounting operation on the convergence of the GIS algorithm. (4) We give experimental perplexity results for a corpus from the WSJ task. By using the maximum entropy approach and the distant bigram dependencies, we are able to reduce the perplexity from 205.4 for our best conventional bigram model to 169.5.

ic970787.pdf

TOP

Nonuniform Markov Models

Authors:

Eric Sven Ristad, Princeton University (U.S.A.)
Robert G. Thomas, Princeton University (U.S.A.)

Volume 2, Page 791

Abstract:

We propose a new way to model conditional independence in Markov models. The central feature of our nonuniform Markov model is that it makes predictions of varying lengths using contexts of varying lengths. Experiments on the Wall Street Journal reveal that the nonuniform model performs slightly better than the classic interpolated Markov model of Jelinek and Mercer (1980). This result is somewhat remarkable because both models contain identical numbers of parameters whose values are estimated in a similar manner. The only difference between the two models is how they combine the statistics of longer and shorter strings.

ic970791.pdf

TOP

Modelling word-pair relations in a category-based language model

Authors:

Thomas Niesler, University of Cambridge (U.K.)
P.C. Woodland, University of Cambridge (U.K.)

Volume 2, Page 795

Abstract:

A new technique for modelling word occurrence correlations within a word-category based language model is presented. Empirical observations indicate that the conditional probability of a word given its category, rather than maintaining the constant value normally assumed, exhibits an exponential decay towards a constant as a function of an appropriately defined measure of separation between the correlated words. Consequently a functional dependence of the probability upon this separation is postulated, and methods for determining both the related word pairs as well as the function parameters are developed. Experiments using the LOB, Switchboard and Wall Street Journal corpora indicate that this formulation captures the transient nature of the conditional probability effectively, and leads to reductions in perplexity of between 8 and 22%, where the largest improvements are delivered by correlations of words with themselves (self-triggers), and the reductions increase with the size of the training corpus.

ic970795.pdf

TOP

Language Model Adaptation using mixtures and an exponentially decaying cache

Authors:

Philip Clarkson, Cambridge University (U.K.)
Anthony J. Robinson, Cambridge University (U.K.)

Volume 2, Page 799

Abstract:

This paper presents two techniques for language model adaptation. The first is based on the use of mixtures of language models: the training text is partitioned according to topic, a language model is constructed for each component, and at recognition time appropriate weightings are assigned to each component to model the observed style of language. The second technique is based on augmenting the standard trigram model with a cache component in which words' recurrence probabilities decay exponentially over time. Both techniques yield a significant reduction in perplexity over the baseline trigram language model when faced with multi-domain test text, the mixture-based model giving a 24% reduction and the cache-based model giving a 14% reduction. The two techniques attack the problem of adaptation at different scales, and as a result can be used in parallel to give a total perplexity reduction of 30%

ic970799.pdf

TOP

Confidence-driven Estimator Perturbation: BMPC

Authors:

Stefan Besling, Philips Research (Germany)
Hans-Günter Meier, Fachhochschule Düsseldorf (Germany)

Volume 2, Page 803

Abstract:

In most practical applications of speech recognition, the acceptance and performance of the system depends strongly on its capability to adapt to the special speaker characteristics. Restricted to the problem of language model adaptation, one has to find an efficient way to combine a typically well-trained a priori estimator for a domain with a regularly updated but undertrained estimator reflecting the actual speaker-specific data so far. In this paper we present a new language model estimation technique that makes explicit use of the confidence in estimates obtained on the (typically small) adaptation or training data. Mathematically it attempts to perturb a given reliable a priori distribution in such a way that it fits into the confidence regions given by the training material. Experiments performed on real-life data supplied by US radiologists indicate that the method could improve standard adaptation techniques like linear interpolation.

ic970803.pdf

TOP

Domain Adaptation With Clustered Language Models

Authors:

Joerg Peter Ueberla, Forum Technology - DRA Malvern (U.K.)

Volume 2, Page 807

Abstract:

In this paper, a method of domain adaptation for clustered language models is developed. It is based on a previously developed clustering algorithm, but with a modified optimisation criterion. The results are shown to be slightly superior to the previously published 'Fillup' method, which can be used to adapt standard n-gram models. However, the improvement both methods give compared to models built from scratch on the adaptation data is quite small (less than 11% relative improvement in word error rate). This suggests that both methods are still unsatisfactory from a practical point of view.

ic970807.pdf

TOP

Improving Parsing of Spontaneous Speech with the Help of Prosodic Boundaries

Authors:

Ralf Kompe, University of Erlangen (Germany)
Andreas Kießling, University of Erlangen (Germany)
Heinrich Niemann, University of Erlangen (Germany)
Elmar Nöth, University of Erlangen (Germany)
Anton Batliner, L.M.-Univ. München (Germany)
Stefanie Schachtl, Siemens (Germany)
Tobias Ruland, Siemens (Germany)
Hans Ulrich Block, Siemens (Germany)

Volume 2, Page 811

Abstract:

Parsing can be improved in automatic speech understanding if prosodic boundaries are taken into account, because syntactic boundaries are often marked prosodically. Since large databases are needed for the training of statistical models, we developed a labeling scheme for syntactic-prosodic boundaries within the German VERBMOBIL speech-to-speech translation project. We compare the results of classifiers (multi-layer perceptrons and language models) trained on these labels with results for perceptual and syntactic labels. Recognition rates of up to 96% were achieved. The turns consist of 20 words on the average and frequently contain sequences of partial sentence equivalents (restarts, ellipsis). The boundary scores computed by our classifiers were successfully integrated into the syntactic parsing of word graphs; currently, they improve the parse time by 92% and reduce the number of parse trees by 96%. This is achieved by introducing a special Prosodic Syntactic Clause Boundary symbol into our grammar and by guiding the search for the best word chain with the boundary scores.

ic970811.pdf

TOP

Specialized Language Models using Dialogue Predictions

Authors:

Cosmin Popovici, ICI (Romania)
Paolo Baggia, CSELT (Italy)

Volume 2, Page 815

Abstract:

This paper analyses language modeling in spoken dialogue systems for accessing a database. The use of several language models obtained by exploiting dialogue predictions gives better results than the use of a single model for the whole dialogue interaction. For this reason several models have been created, each one for a specific system question, such as the request or the confirmation of a parameter. The use of dialogue-dependent language models increases the performance both at the recognition and at the understanding level, especially on answers to system requests. Moreover using other methods to increase performance, like automatic clustering of vocabulary words or the use of better acoustic models during recognition, does not affect the improvements given by dialogue-dependent language models. The system used in our experiments is Dialogos, the Italian spoken dialogue system used for accessing railway timetable information over the telephone. The experiments were carried out on a large corpus of dialogues collected using Dialogos.

ic970815.pdf

TOP

K-TLSS(S) Language Models for Speech Recognition

Authors:

Germán Bordel, UPV/EHU, Bilbao (Spain)
Amparo Varona, UPV/EHU, Bilbao (Spain)

Volume 2, Page 819

Abstract:

The class of K-Testable Languages in the Strict Sense (K-TLSS) is a subclass of regular languages. Previous works demonstrate that stochastic K-TLSS language models describe the same probability distribution as N-gram models, and that smoothing techniques can be efficiently applied (Back-off like methods). Once we have a set of k-TLSS models (k=1... K) and a smoothing technique that specifically fits in them, here we propose an integration into a unique self-contained model (the K-TLSS(S)) which embeds the smoothing within the topology allowing extremely simple parsing procedures. To build this model we designed a more general syntactic mechanism that we call Stochastic Deterministic Finite State Automaton with Recursive Transitions. The topology of the new models (K-TLSS(S)) allows an easy pruning procedure. Pruned K-TLSS(S) models give probability distributions that are equivalent to Variable N-gram models. Experimental results gave as a conclusion that the effect of a small pruning is always positive.

ic970819.pdf

TOP

Language Model Adaptation For Conversational Speech Recognition Using Automatically Tagged Pseudo-Morphological Classes

Authors:

Carlos Crespo, TID (Spain)
Daniel Tapias, TID (Spain)
Gregorio Escalada, TID (Spain)
Jorge Alvarez, TID (Spain)

Volume 2, Page 823

Abstract:

Statistical language models provide a powerful tool to model natural spoken language. Nevertheless it is required a large set of training sentences to reliably estimate the model parameters. In this paper we present a method to estimate n-gram probabilities from sparse data. The proposed language modeling strategy allows to adapt a generic language model (LM) to a new semantic domain with just a few hundreds of sentences. This reduced set of sentences is automatically tagged with eighty different pseudo-morphological labels, and then a word-bigram LM is derived from them. Finally, this target domain word-bigram LM is interpolated with a generic backoff word-bigram LM, which was estimated using a large text database. This strategy reduces a 27% the word error rate of the SPATIS (Spanish ATIS) task.