Acoustic Modeling - Miscellaneous Topics

Chair: Xavier Aubert, Philips GmbH Forschungslaboratorien, Germany

Home

Improved Phone Recognition Using Bayesian Triphone Models

Authors:

Ji Ming, The Queens University of Belfast (Northern Ireland)
F. Jack Smith, The Queens University of Belfast (Northern Ireland)

Volume 1, Page 409, Paper number 1223

Abstract:

A crucial issue in triphone based continuous speech recognition is the large number of models to be estimated against the limited availability of training data. This problem can be relieved by composing a triphone model from less context-dependent models. This paper introduces a new statistical framework, derived from the Bayesian principle, to perform such a composition. The potential power of this new framework is explored, both algorithmically and experimentally, by an implementation with hidden Markov modeling techniques. This implementation is applied to the recognition of the 39-phone set on the TIMIT database. The new model achieves 74.4% and 75.6% accuracy, respectively, on the core and complete test sets.

ic981223.pdf (From Postscript)

TOP

Multilingual Phone Recognition of Spontaneous Telephone Speech

Authors:

Cristobal Corredor-Ardoy, LIMSI-CNRS, Orsay (France)
Lori Lamel, LIMSI-CNRS, Orsay (France)
Martine Adda-Decker, LIMSI-CNRS, Orsay (France)
Jean-Luc Gauvain, LIMSI-CNRS, Orsay (France)

Volume 1, Page 413, Paper number 5241

Abstract:

In this paper we report on experiments with phone recognition of spontaneous telephone speech. Phone recognizers were trained and assessed on IDEAL, a multilingual corpus containing telephone speech in French, British English, German and Castillan Spanish. We investigated the influence of the training material composition (size and linguistic content) on the recognition performance using context-independent Hidden Markov Models and phonotactic bigram models. We found that when testing on spontaneous speech data, using only spontaneous speech training data gave the highest phone accuracies for the four languages, even though this data comprises only 14% of the available training data. The use of context-dependent HMMs reduced the phone error across the 4 languages, with the average error reduced to 51.9 % from the 57.4% obtained with CI models. We suggest a straightforward way of detecting non speech phenomena. The basic idea is to remove sequences of consonants between two silence labels from the recognized phone strings prior to scoring. This simple technique reduces the relative average phone error rate by 5.4%. The lowest phone error with CD models and filtering was obtained for Spanish (39.1%) with 4 language average being 49.1%

ic985241.pdf (Scanned)

TOP

Language Adaptation of Multilingual Phone Models for Vocabulary Independent Speech Recognition Tasks

Authors:

Joachim Koehler, Siemens AG (Germany)

Volume 1, Page 417, Paper number 1941

Abstract:

This paper presents our new results on multilingual phone modeling and adaptation into a new target language which is not included in the trained multilingual models. The experiments were carried out with the SpeechDat(M) and MacroPhone databases including the languages French, German, Italian, Portuguese, Spanish and American English. First, we constructed language-dependent and multilingual phone models. The recognition rate for an isolated word task decreased in average only by 3.2% using 95 multilingual instead of 232 language-dependent models. Second, we investigated adaptation techniques for cross-language transfer and showed that only 100 utterances from a new language were needed for adaptation. Using the MAP algorithm the recognition rate was improved from 79.9% to 84.3%. Finally, we defined a phonetic based dissimilarity measure between 2 languages and compared language-dependent and multilingual models for the purpose of cross-language transfer.

ic981941.pdf (From Postscript)

TOP

Advances in Alpha Digit Recognition Using Syllables

Authors:

Jonathan Hamaker, Mississippi State University (U.S.A.)
Aravind Ganapathiraju, Mississippi State University (U.S.A.)
Joseph Picone, Mississippi State University (U.S.A.)
John J Godfrey, PSL, Texas Instruments Inc. (U.S.A.)

Volume 1, Page 421, Paper number 2044

Abstract:

In this paper, we present a set of experiments which explore the use of syllables for recognition of continuous alphadigit utterances. In this system, syllables are used as the primary unit of recognition. This work was motivated by our need to verify and isolate phenomena seen when performing syllable-based experiments on the SWITCHBOARD corpus. The performance of our base syllable system is better than a crossword triphone system while requiring a small portion of the resources necessary for triphone systems. All experiments were performed on the OGI Alphadigits corpus, which consists of telephone-bandwidth alphadigit strings. The WER of the best syllable system (context-independent syllables) reported here is 11.1% compared to 12.2% for a crossword triphone system.

ic982044.pdf (From Postscript)

TOP

LVCSR Rescoring with Modified Loss Functions: A Decision Theoretic Perspective

Authors:

Vaibhava Goel, The Johns Hopkins University (U.S.A.)
William J. Byrne, The Johns Hopkins University (U.S.A.)
Sanjeev P. Khudanpur, The Johns Hopkins University (U.S.A.)

Volume 1, Page 425, Paper number 2387

Abstract:

The problem of speech decoding is considered here in a Decision Theoretic framework and a modified speech decoding procedure to minimize the expected risk under a general loss function is formulated. A specific word error rate loss function is considered and an implementation in an N-best list rescoring procedure is presented. Methods for estimation of the parameters of the resulting decision rule are provided for both supervised and unsupervised training. Preliminary experiments on an LVCSR task show a small but statistically significant error rate improvements.

ic982387.pdf (From Postscript)

TOP

Boosting Long-Term Adaptation of Hidden-Markov-Models: Incremental Splitting of Probability Density Functions

Authors:

Udo Bub, Siemens AG (Germany)
Harald Hoege, Siemens AG (Germany)

Volume 1, Page 429, Paper number 1961

Abstract:

The research described in this paper focuses on possibilities to avoid the tedious training of Hidden-Markov-Models when setting up a new recognition task. A major speaker independent cause for the decrease of recognition accuracy is a mismatch of the phonetic contexts between training and testing data. To overcome this problem, we introduced in previous work the idea of an update of task independent acoustic models by means of Bayesian learning. In this paper we introduce the new approach of adaptively splitting the probability density functions (pdfs) of a continuous density HMM. The goal is to model the appropriate state pdfs better so that they can more accurately match new contexts that are observed while the system is in service. Splitting AND Bayesian adaptation yields a remarkable reduction of word error rate compared to Bayesian adaptation only.

ic981961.pdf (From Postscript)

TOP

Improvements in Children's Speech Recognition Performance

Authors:

Subrata Das, IBM T.J. Watson Research Center (U.S.A.)
Don Nix, IBM T.J. Watson Research Center (U.S.A.)
Michael Picheny, IBM T.J. Watson Research Center (U.S.A.)

Volume 1, Page 433, Paper number 2172

Abstract:

There are several reasons why conventional speech recognition systems modeled on adult data fail to perform satisfactorily on children's speech input. For instance, children's vocal characteristics differ significantly from those of adults. In addition, their choices of vocabulary and sentence construction modalities usually do not follow adult patterns. We describe comparative studies demonstrating the performance gain realized by adopting to children's acoustic and language model data to construct a children's speech recognition system.

ic982172.pdf (From Postscript)

TOP

Speaker Normalized Acoustic Modeling Based on 3-D Viterbi Decoding

Authors:

Toshiaki Fukada, ATR-ITL (Japan)
Yoshinori Sagisaka, ATR-ITL (Japan)

Volume 1, Page 437, Paper number 1924

Abstract:

This paper describes a novel method for speaker normalization based on a frequency warping approach to reduce variations due to speaker-induced factors such as the vocal tract length. In our approach, a speaker normalized acousticmodel is trained using time-varying (i.e., state, phoneme or word dependent) warping factors, while in the conventional approaches, the frequency warping factor is fixed for each speaker. These time-varying frequency warping factors are determined by a 3-dimensional (i.e., input frames, HMM states and warping factors) Viterbi decoding procedure. Experimental results on Japanese spontaneous speech recognition show that the proposed method yields a 9.7 % improvement in speech recognition accuracy compared to the conventional speaker-independent model.

ic981924.pdf (From Postscript)

TOP

Adaptive Heterodyne Filters (AHF) for Detection and Attenuation of Narrow Band Signals

Authors:

Karl E Nelson, University of California (U.S.A.)
Michael A Soderstrand, University of California (U.S.A.)

Volume 1, Page 441, Paper number 2264

Abstract:

A fixed filter may be converted into an adaptive filter with a single adaptation parameter through the use of a new Adaptive Heterodyne Filter (AHF) concept in which the frequency of the heterodyne signal is adjusted thereby translating the entire filter transfer function in frequency. If the fixed filter is selected to be a very narrow-band band-pass filter, the new AHF concept can be used very effectively in the elimination of narrow band interference in wide-band communications or control systems. A specific example of the removal of a slow-moving time-varying mechanical resonance from the control signal for a flight control system demonstrates the power of the new AHF concept.

ic982264.pdf (From Postscript)

TOP

Online Tool Wear Monitoring in Turning Using Time-Delay Neural Networks

Authors:

Bernhard Sick, University of Passau (Germany)

Volume 1, Page 445, Paper number 1478

Abstract:

Wear monitoring systems often use neural networks for a sensor fusion with multiple input patterns. Systems for a continuous online supervision of wear have to process pattern sequences. Therefore recurrent neural networks have been investigated in the past. However, in most cases where only noisy input or even noisy output patterns are available for a supervised learning, success is not forthcoming. That is, recurrent networks don't perform noticeably better than non-recurrent networks processing only the current input pattern like multilayer perceptrons. This paper demonstrates on the basis of an application example (online tool wear monitoring in turning) that results can be improved significantly with special non-recurrent networks. This approach uses time-delay neural networks which consider the position of a single pattern in a pattern sequence by means of delay elements at the synapses. In the mentioned application example the average error in the estimation of a characteristic wear parameter could be reduced by about 24% compared with multilayer perceptrons.