Topics in ASR

Home

Writer adaptation of a HMM handwriting recognition system

Authors:

Andrew W. Senior, IBM Research (U.S.A.)
Krishna S. Nathan, IBM Research (U.S.A.)

Volume 2, Page 1447

Abstract:

This paper describes a scheme to adapt the parameters of a tied-mixture, hidden Markov model, on-line handwriting recognition system to improve performance on new writers' handwriting. The means and variances of the distributions are adapted using the Maximum Likelihood Linear Regression technique. Experiments are performed with a number of new writers in both supervised and unsupervised modes. Adaptation on data quantities as small as 5 words is found to result in models with 6% lower error rate than the writer independent model.

ic971447.pdf

TOP

In-Service Adaptation of Multilingual Hidden-Markov-Models

Authors:

Udo Bub, Siemens AG (Germany)
Joachim Köhler, Siemens AG (Germany)
Bojan Imperl, University of Maribor (Slovenia)

Volume 2, Page 1451

Abstract:

In this paper we report on advances regarding our approach to porting an automatic speech recognition system to a new target task. In case there is not enough acoustic data available to allow for thorough estimation of HMM parameters it is impossible to train an appropriate model. The basic idea to overcome this problem is to create a task independent seed model that can cope with all tasks equally well. However, the performance of such generalist model is of course lower than the performance of task dependent models (if these were available). So, the seed model is gradually enhanced by using its own recognition results for incremental online task adaptation. Here, we use a multilingual romanic/germanic seed model for a slavic target task. In tests on Slovene digits multilingual modeling yields the best recognition accuracy compared to other language dependent models. Applying unsupervised online task adaptation we observe a remarkable boost of recognition performance.

ic971451.pdf

TOP

Development of Dialect-Specific Speech Recognizers Using Adaptation Methods

Authors:

Vassilios Diakoloukas, TUC (Greece)
Vassilios Digalakis, TUC (Greece)
Leonardo Neumeyer, STAR-SRI (U.S.A.)
Jaan Kaja, Telia (Sweden)

Volume 2, Page 1455

Abstract:

Several adaptation approaches have been proposed in an effort to improve the speech recognition performance in mismatched conditions. However, the application of these approaches had been mostly constrained to the speaker or channel adaptation tasks. In this paper, we first investigate the effect of mismatched dialects between training and testing speakers in an Automatic Speech Recognition (ASR) system. We find that a mismatch in dialects significantly influences the recognition accuracy. Consequently, we apply several adaptation approaches to develop a dialect-specific recognition system using a dialect-dependent system trained on a different dialect and a small number of training sentences from the target dialect. We show that adaptation improves recognition performance dramatically with small amounts of training sentences. We further show that, although the recognition performance of traditionally trained systems highly degrades as we decrease the number of training speakers, the performance of adapted systems is not influenced so much.

ic971455.pdf

TOP

Syllable-Based Relevance Feedback Techniques for Mandarin Voice Record Retrieval Using Speech Queries

Authors:

Bo-Ren Bai, NTU (Taiwan)
Lee-Feng Chien, Academia Sinica (Taiwan)
Lin-Shan Lee, NTU (Taiwan)

Volume 2, Page 1459

Abstract:

In order to solve the problem with the new environment of fast growth of audio resources on the Internet, we have presented a syllable-based approach which is capable of retrieving Mandarin voice records using queries of unconstrained speech. However, the performance achieved by this previously proposed approach is still not satisfactory, and one of the reason is that very often the information provided by the speech query for the request subject may not be sufficient. In this paper, we present approaches based on the relevance feedback technique to improving the performances of the previous research. The proposed approaches include a relevance measure adjustment scheme using a relevance table for the voice database, a query expansion scheme to generate a new query including the feedback information, and a combination of these two schemes. Extensive preliminary experiments were performed and encouraging results were demonstrated.

ic971459.pdf

TOP

Automatic Alternative Transcription Generation and Vocabulary Selection for Flexible Word Recognizers

Authors:

Doroteo Torre, TID (Spain)
Luis Villarrubia, TID (Spain)
Jose Maria Elvira, TID (Spain)
Luis Hernandez-Gomez, ETSIT-UPM (Spain)

Volume 2, Page 1463

Abstract:

In accordance with the new emerging Voice Response Systems that use Flexible Vocabulary Recognizers (FVRs), prediction of word confusabilities have been received increasing interest during the last few years. In this contribution we present a new method for transcription confusabilities estimation based on a new statistical modelling criterion. We propose the use of the new transcription confusability measure in two different word error rate (WER) reduction procedures for FVRs: an automatic vocabulary selection procedure suitable for those applications where the set of vocabulary words is not totally defined by the application, and an automatic procedure for generation of alternative transcriptions. Experimental results using a telephonic database show 20% WER relative reduction using the automatic alternative transcription generation procedure for a 37 word vocabulary, and over 50% (20%) WER relative reduction using our unrestricted (restricted by groups of synonyms) vocabulary selection procedure instead of random word selection.

ic971463.pdf

TOP

An Advanced System to Generate Pronunciations of Proper nouns

Authors:

Neeraj Deshmukh, ISIP (U.S.A.)
Julie Ngan, ISIP (U.S.A.)
Jonathan Hamaker, ISIP (U.S.A.)
Joseph Picone, ISIP (U.S.A.)

Volume 2, Page 1467

Abstract:

Accurate recognition of proper nouns is a critical component of automatic speech recognition (ASR). Since there are no obvious letter-to-sound conversion rules that govern the pronunciation of any large set of proper nouns, this is an open-ended problem that evolves constantly under various sociolinguistic influences. A Boltzmann machine neural network is well-suited for the task of generating the most likely pronunciations of a proper noun. This pronunciation output can be used to build better acoustic models for the noun that result in improved recognition performance. We present here an advanced version of this N-best pronunciations system; and a multiple pronunciations dictionary of 18000 surnames and 25000 pronunciations used as a training database. The database and software are available in the public domain.

ic971467.pdf

TOP

Automatic Pronunciation Scoring for Language Instruction

Authors:

Horacio Franco, SRI International (U.S.A.)
Leonardo Neumeyer, SRI International (U.S.A.)
Yoon Kim, SRI International (U.S.A.)
Orith Ronen, SRI International (U.S.A.)

Volume 2, Page 1471

Abstract:

In this work we address the task of grading the pronunciation quality of the speech of a student of a foreign language. The automatic grading system uses SRI's Decipher continuous speech recognition system to generate phonetic segmentations. Based on these segmentations and probabilistic models we produce pronunciation scores for individual or group of sentences. Scores obtained from expert human listeners are used as the reference to evaluate the different machine scores and to provide targets when training some of the algorithms. In previous work we had found that duration-based scores outperformed HMM log-likelihood-based scores. In this paper we show that we can significantly improve HMM-based scores by using average phone segment posterior probabilities. Correlation between machine and human scores went up from r=0.50 with likelihood-based scores to r=0.88 with posterior-based scores, they also outperformed duration-based scores mainly in the case of using few sentences to compute a score.

ic971471.pdf

TOP

Speaker-Independent Name Dialing With Out-of-Vocabulary Rejection

Authors:

Coimbatore S. Ramalingam, Texas Instruments (U.S.A.)
Lorin P. Netsch, Texas Instruments (U.S.A.)
Yu-Hung Kao, Texas Instruments (U.S.A.)

Volume 2, Page 1475

Abstract:

In this paper we propose a system for speaker-independent name dialing in which a name enrolled by a user can be used by other members in a family or co-workers in an office. We use speaker-independent sub-word models during enrollment; the recognized sub-word string is later used during recognition. We also present a mechanism for rejecting out-of-vocabulary (OOV) phrases. The best in-vocabulary (IV) correct and OOV rejection performance for other speakers is 90%/60% (IV/OOV) on a database containing eighteen speakers. If the orthography were known, the best performance is 96%/65%

ic971475.pdf

TOP

Hidden Understanding Models For Statistical Sentence Understanding

Authors:

Richard Schwartz, BBN Systems and Technologies 70 Fawcett Street, Cambridge, MA 02138 (U.S.A.)
Scott Miller, BBN Systems and Technologies 70 Fawcett Street, Cambridge, MA 02138 (U.S.A.)
David Stallard, BBN Systems and Technologies 70 Fawcett Street, Cambridge, MA 02138 (U.S.A.)
John Makhoul, BBN Systems and Technologies 70 Fawcett Street, Cambridge, MA 02138 (U.S.A.)

Volume 2, Page 1479

Abstract:

We describe the first sentence understanding system that is completely based on learned methods both for understanding individual sentences, and determining their meaning in the context of preceding sentences. We divide the problem into three stages: semantic parsing, semantic classification, and discourse modeling. Each of these stages requires a different model. When we ran this system on the last test (December, 1994) of the ARPA Air Travel Information System (ATIS) task, we achieved 13.7 error rate. The error rate for those sentences that are context-independent (class A) was 9.7

ic971479.pdf

TOP

An alternative scheme for perplexity estimation

Authors:

Frédéric Bimbot, ENST / SIG - CNRS / URA-820 (France)
Marc El-Bèze, LIA (France)
Michèle Jardino, LIMSI - CNRS (France)

Volume 2, Page 1483

Abstract:

Language models are usually evaluated on test texts using the perplexity derived directly from the model likelihood function. In order to use this measure in the framework of a comparative evaluation campaign, we have developped an alternative scheme for perplexity estimation. The method is derived from the Shannon game and based on a gambling approach on the next word to come in a truncated sentence. We also use entropy bounds proposed by Shannon and based on the rank of the correct answer, in order to estimate a perplexity interval for non-probabilistic language models. The relevance of the approach is assessed on an example.

ic971483.pdf

TOP

Extensions to Phone-State Decision-Tree Clustering: Single Tree and Tagged Clustering

Authors:

Douglas B. Paul, Dragon Systems, Inc. (U.S.A.)

Volume 2, Page 1487

Abstract:

The following article describes two extensions to the traditional decision tree methods for clustering allophone HMM states in LVCSR systems. The first, single tree clustering, combines all allophone states of all phones into a single tree. This can be used to improve performance for very small systems. The single tree clustering structure can also be exploited for speaker and channel adaptation and is shown to provide a 30 percent reduction in the error rate for an LVCSR task under matched channel conditions and a greater reduction under mismatched channel conditions. The second, tagged clustering, is a mechanism for providing additional information to the clustering procedure. The tags are labels for any of a wide variety of factors, such as stress, placed on the triphones. These tags are then accessible to the clustering process. Small improvements in recognition performance were obtained under certain conditions. Both methods can be combined.

ic971487.pdf

TOP

Evaluation of fast algorithms for finding the nearest neighbor

Authors:

Stéphane Lubiarz, Matra Com. (France)
Philip Lockwood, Matra Com. (France)

Volume 2, Page 1491

Abstract:

In speech recognition systems as well as in speech coders using vector quantization, the search for the nearest neighbor is a computationally intensive task. In this paper, we adress the problem of fast nearest neighbor search. State of the art solutions tend to approach logarithmic access time. The problem is that such performance is generally achieved at the expense of a significant increase in storage requirements. In this contribution, we compare several known approaches and propose new extensions. These new contributions allows for a significant reduction in memory requirements without impacting the performance in terms of number of distances computed and optimality of the search.

ic971491.pdf

TOP

Fusion of Visual and Acoustic Signals for Command-Word Recognition

Authors:

Rudolf Kober, FAW Ulm (Germany)
Ulrich Harz, FAW Ulm (Germany)
Jutta Schiffers, FAW Ulm (Germany)

Volume 2, Page 1495

Abstract:

In this paper, we investigate the question of how the visual information of lip movement contributes to command-word recognition. The fusion of the acoustic and visual signal can be carried out either at the feature level or at the class level. Integration at the feature level means merging of the acoustic and visual features to yield a combined feature vector which is feed into a HMM-system. Fusion at the class level means separate classification of the two sources of information and combination of the classification results. An HMM classifier is used for the acoustic signal and three different classifiers (HMM, DTW and ClaRe) for the visual signal. The classification results are combined using C4.5. The recognition rates of both fusion schemes are comparable. Both yield small improvements at high SNR's using the acoustic/visual system in comparsion to the acoustic system alone. Larger improvements (up to 12 %) result at low SNR's.

ic971495.pdf

TOP

Difference in visual information between face to face and telephone dialogues

Authors:

Yuri Iwano, Waseda University (Japan)
Yosuke Sugita, Waseda University (Japan)
Yusuke Kasahara, Waseda University (Japan)
Shu Nakazato, Waseda University (Japan)
Katsuhiko Shirai, Waseda University (Japan)

Volume 2, Page 1499

Abstract:

In this research, we analyzed conversations between a pair of subjects, under two conditions. One is face to face conversation that has a visual contact, and the other is conversation through telephone line that has not. From the recorded videotape we extracted the subject's actions especially focusing on the head movements. By comparing the dialogues under two conditions, it seems that there are two types of head movements, one is intended to give a response to his partner and the other is to send some signal. We are going to analyze how visual information contributes in spoken dialogue perceptions, and possibility of adopting it in a multi-modal human interface.