Session W2B Language Model Adaptation

Chairperson Herman Ney RWTH, Germany

Home

DOCUMENT SPACE MODELS USING LATENT SEMANTIC ANALYSIS

Authors: Yoshihiko Gotoh Steve Renals

University of Sheffield, Department of Computer Science Regent Court, 211 Portobello St., Sheffield S1 4DP, UK e-mail: (y.gotoh, s.renals}@dcs.shef.ac.uk

Volume 3 pages 1443 - 1446

ABSTRACT

In this paper, an approach for constructing mixture language models (LMs) based on some notion of semantics is discussed. To this end, a technique known as latent semantic analysis (LSA) is used. The approach encapsulates corpus-derived semantic information and is able to model the varying style of the text. Using such information, the corpus texts are clustered in an unsupervised manner and mixture LMs are automatically created. This work builds on previous work in the field of information retrieval which was recently applied by Bellegarda et. al. to the problem of clustering words by semantic categories. The principal contribution of this work is to characterize the document space resulting from the LSA modeling and to demonstrate the approach for mixture LM application. Comparison is made between manual and automatic clustering in order to elucidate how the semantic information is expressed in the space. It is shown that, using semantic information, mixture LMs performs better than a conventional single LM with slight increase of computational cost.

A0235.pdf

TOP

ADAPTIVE TOPIC--DEPENDENT LANGUAGE MODELLING USING WORD--BASED VARIGRAMS

Authors: Sven C. Martin, Jorg Liermann, Hermann Ney

Lehrstuhl fur Informatik VI, RWTH Aachen, University of Technology, D-52056 Aachen, Germany E--mail: martin@informatik.rwth-aachen.de

Volume 3 pages 1447 - 1450

ABSTRACT

This paper presents two extensions of the standard interpolated word trigram and cache model, namely the extension of the trigram model by useful word m--grams with m ? 3 resulting into a varigram model, and the addition of topic--specific trigram models. We give the criteria for selecting useful m--grams and for partitioning the training corpus into topic-- specific subcorpora. We apply both extensions, separately and in combination, to corpora of 4 and 39 million words taken from the Wall Street Journal Corpus and show that high reductions in perplexity of up to 19 % on the largest corpus are achieved. We also performed some recognition experiments.

A0334.pdf

TOP

A Latent Semantic Analysis Framework for Large-Span Language Modeling

Authors: Jerome R. Bellegarda

Advanced Technology Group. Apple Computer. Cupertino. California 95014. USA jerome @ apple . com: +1 (408) 974-7647

Volume 3 pages 1451 - 1454

ABSTRACT

A new framework is proposed to construct large-span, semantically-derived language models for large vocabulary speech recognition. It is based on the latent semantic analysis paradigm, which seeks to automatically uncover the salient semantic relationships between words and documents in a given corpus. Because of its semantic nature. a latent semantic language model is well suited to complement a conventional. more syntactically-oriented n-gram. An integrative formulation is proposed for the combination of the two paradigms. The performance of the resulting integrated language model. as measured by perplexity, compares favorably with the corresponding n-gram performance.

A0642.pdf

TOP

A MAXIMUM LIKELIHOOD MODEL FOR TOPIC CLASSIFICATION OF BROADCAST NEWS

Authors: Richard Schwartz, Toru Imai*, Francis Kubala, Long Nguyen, John Makhoul

BBN Systems and Technologies, Cambridge, MA, 02138, USA *NHK (Japan Braodcasting Corp.) Sci. & Tech. Res. Labs., Tokyo 157, Japan Tel: 1-617-873-3360, FAX: 1-617-873-2534, E-mail: schwartz@bbn.com

Volume 3 pages 1455 - 1458

ABSTRACT

We describe a new algorithm for topic classification that allows discrimination among thousands of topics. A mixture of topics explicitly models the fact that each story has multiple topics, that different words are related to different topics, and that most of the words are not related to any topic. The resulting model, trained by EM, has sharper distributions of words that result in more accurate topic classification. We tested the algorithm on transcribed broadcast news texts. When trained on one year of stories containing over 5,000 different topics and tested on new (later) stories the first choice topic was among the manually annotated choices 76% of the time.

A1000.pdf

TOP

Language Modelling for task-oriented DOMAINS

Authors: Cosmin Popovici Paolo Baggia

ICI - Institutul de Cercetari in Informatica Bd. M. Averescu, 8-10 Bucuresti (Romania) CSELT - Centro Studi e Laboratori Telecomunicazioni Via G. Reiss Romoli, 274 I-10148 Torino (Italy) baggia@cselt.it

Volume 3 pages 1459 - 1462

ABSTRACT

This paper is focused on the language modelling for task-oriented domains and presents an accurate analysis of the utterances acquired by the Dialogos spoken dialogue system. Dialogos allows access to the Italian Railways timetable by using the telephone over the public network. The language modelling aspects of specificity and behaviour to rare events are studied. A technique for getting a language model more robust, based on sentences generated by grammars, is presented. Experimental results show the benefit of the proposed technique. The increment of performance between language models created using grammars and usual ones, is higher when the amount of training material is limited. Therefore this technique can give an advantage especially for the development of language models in a new domain.

A1096.pdf

TOP

CHINESE LANGUAGE MODEL ADAPTATION BASED ON DOCUMENT CLASSIFICATION AND MULTIPLE DOMAIN-SPECIFIC LANGUAGE MODELS

Authors: Sung-Chien Lin (1) , Chi-Lung Tsai (1) , Lee-Feng Chien (2), Ker-Jiann Chen (2) , Lin-Shan Lee (1),(2)

(1) Dept. of Computer Science and Information Engineering, National Taiwan University (2) Institute of information Science, Academia Sinica Taipei, Taiwan, Republic of China lsc@speech.ee.ntu.edu.tw

Volume 3 pages 1463 - 1466

ABSTRACT

Adaptation of language models to the specific subject domains is definitely important for real speech recognition applications. In this paper, a Chinese language model adaptation approach is presented mainly based on document classification and multiple domain- specific language models. The proposed document classification method using the perplexity value and word bigram coverage value as primary measures are able to model word associations and syntactic behavior in classifying documents into the clusters and thus creates more effective domain-specific language models. The adaptation of language model in speech recognition can be therefore effectively achieved by the proper selection of the most appropriated domain-specific language model. Preliminary tests have been made in application to Mandarin speech recognition and shown its exciting performance of the proposed approach in creating real applications.

A1124.pdf