ABSTRACT
In this paper, an approach for constructing mixture language models (LMs) based on some notion of semantics is discussed. To this end, a technique known as latent semantic analysis (LSA) is used. The approach encapsulates corpus-derived semantic information and is able to model the varying style of the text. Using such information, the corpus texts are clustered in an unsupervised manner and mixture LMs are automatically created. This work builds on previous work in the field of information retrieval which was recently applied by Bellegarda et. al. to the problem of clustering words by semantic categories. The principal contribution of this work is to characterize the document space resulting from the LSA modeling and to demonstrate the approach for mixture LM application. Comparison is made between manual and automatic clustering in order to elucidate how the semantic information is expressed in the space. It is shown that, using semantic information, mixture LMs performs better than a conventional single LM with slight increase of computational cost.
ABSTRACT
This paper presents two extensions of the standard interpolated word trigram and cache model, namely the extension of the trigram model by useful word m--grams with m ? 3 resulting into a varigram model, and the addition of topic--specific trigram models. We give the criteria for selecting useful m--grams and for partitioning the training corpus into topic-- specific subcorpora. We apply both extensions, separately and in combination, to corpora of 4 and 39 million words taken from the Wall Street Journal Corpus and show that high reductions in perplexity of up to 19 % on the largest corpus are achieved. We also performed some recognition experiments.
ABSTRACT
A new framework is proposed to construct large-span, semantically-derived language models for large vocabulary speech recognition. It is based on the latent semantic analysis paradigm, which seeks to automatically uncover the salient semantic relationships between words and documents in a given corpus. Because of its semantic nature. a latent semantic language model is well suited to complement a conventional. more syntactically-oriented n-gram. An integrative formulation is proposed for the combination of the two paradigms. The performance of the resulting integrated language model. as measured by perplexity, compares favorably with the corresponding n-gram performance.
ABSTRACT
We describe a new algorithm for topic classification that allows discrimination among thousands of topics. A mixture of topics explicitly models the fact that each story has multiple topics, that different words are related to different topics, and that most of the words are not related to any topic. The resulting model, trained by EM, has sharper distributions of words that result in more accurate topic classification. We tested the algorithm on transcribed broadcast news texts. When trained on one year of stories containing over 5,000 different topics and tested on new (later) stories the first choice topic was among the manually annotated choices 76% of the time.
ABSTRACT
This paper is focused on the language modelling for task-oriented domains and presents an accurate analysis of the utterances acquired by the Dialogos spoken dialogue system. Dialogos allows access to the Italian Railways timetable by using the telephone over the public network. The language modelling aspects of specificity and behaviour to rare events are studied. A technique for getting a language model more robust, based on sentences generated by grammars, is presented. Experimental results show the benefit of the proposed technique. The increment of performance between language models created using grammars and usual ones, is higher when the amount of training material is limited. Therefore this technique can give an advantage especially for the development of language models in a new domain.
ABSTRACT
Adaptation of language models to the specific subject domains is definitely important for real speech recognition applications. In this paper, a Chinese language model adaptation approach is presented mainly based on document classification and multiple domain- specific language models. The proposed document classification method using the perplexity value and word bigram coverage value as primary measures are able to model word associations and syntactic behavior in classifying documents into the clusters and thus creates more effective domain-specific language models. The adaptation of language model in speech recognition can be therefore effectively achieved by the proper selection of the most appropriated domain-specific language model. Preliminary tests have been made in application to Mandarin speech recognition and shown its exciting performance of the proposed approach in creating real applications.