Spacer ICASSP '98 Main Page

Spacer
General Information
Spacer
Conference Schedule
Spacer
Technical Program
Spacer
    Overview
    50th Annivary Events
    Plenary Sessions
    Special Sessions
    Tutorials
    Technical Sessions
    
By Date
    May 12, Tue
May 13, Wed
May 14, Thur
May 15, Fri
    
By Category
    AE    ANNIV   
COMM    DSP   
IMDSP    MMSP   
NNSP    PLEN   
SP    SPEC   
SSAP    UA   
VLSI   
    
By Author
    A    B    C    D    E   
F    G    H    I    J   
K    L    M    N    O   
P    Q    R    S    T   
U    V    W    X    Y   
Z   

    Invited Speakers
Spacer
Registration
Spacer
Exhibits
Spacer
Social Events
Spacer
Coming to Seattle
Spacer
Satellite Events
Spacer
Call for Papers/
Author's Kit

Spacer
Future Conferences
Spacer
Help

Abstract -  SP13   


 
SP13.1

   
Improved Phone Recognition Using Bayesian Triphone Models
J. Ming, F. Smith  (Queen's University of Belfast, N. Ireland)
A crucial issue in triphone based continuous speech recognition is the large number of models to be estimated against the limited availability of training data. This problem can be relieved by composing a triphone model from less context-dependent models. This paper introduces a new statistical framework, derived from the Bayesian principle, to perform such a composition. The potential power of this new framework is explored, both algorithmically and experimentally, by an implementation with hidden Markov modeling techniques. This implementation is applied to the recognition of the 39-phone set on the TIMIT database. The new model achieves 74.4% and 75.6% accuracy, respectively, on the core and complete test sets.
 
SP13.2

   
Multilingual Phone Recognition of Telephone Spontaneous Speech
C. Corredor-Ardoy, L. Lamel, M. Adda-Decker, J. Gauvain  (LIMSI-CNRS, Orsay, France)
In this paper we report on experiments with phone recognition of spontaneous telephone speech. Phone recognizers were trained and assessed on IDEAL, a multilingual corpus containing telephone speech in French, British English, German and Castillan Spanish. We investigated the influence of the training material composition (size and linguistic content) on the recognition performance using context-independent Hidden Markov Models and phonotactic bigram models. We found that when testing on spontaneous speech data, using only spontaneous speech training data gave the highest phone accuracies for the four languages, even though this data comprises only 14% of the available training data. The use of context-dependent HMMs reduced the phone error across the 4 languages, with the average error reduced to 51.9 % from the 57.4% obtained with CI models. We suggest a straightforward way of detecting non speech phenomena. The basic idea is to remove sequences of consonants between two silence labels from the recognized phone strings prior to scoring. This simple technique reduces the relative average phone error rate by 5.4%. The lowest phone error with CD models and filtering was obtained for Spanish (39.1%) with 4 language average being 49.1%.
 
SP13.3

   
Language Adaptation of Multilingual Phone Models for Vocabulary Independent Speech Recognition Tasks
J. Koehler  (Siemens AG, Germany)
This paper presents our new results on multilingual phone modeling and adaptation into a new target language which is not included in the trained multilingual models. The experiments were carried out with the SpeechDat(M) and MacroPhone databases including the languages French, German, Italian, Portuguese, Spanish and American English. First, we constructed language-dependent and multilingual phone models. The recognition rate for an isolated word task decreased in average only by 3.2% using 95 multilingual instead of 232 language-dependent models. Second, we investigated adaptation techniques for cross-language transfer and showed that only 100 utterances from a new language were needed for adaptation. Using the MAP algorithm the recognition rate was improved from 79.9% to 84.3%. Finally, we defined a phonetic based dissimilarity measure between 2 languages and compared language-dependent and multilingual models for the purpose of cross-language transfer.
 
SP13.4

   
Advances in Alpha Digit Recognition Using Syllables
J. Hamaker, A. Ganapathiraju, J. Picone  (Mississippi State University, USA);   J. Godfrey  (PSL, Texas Instruments Inc., USA)
In this paper, we present a set of experiments which explore the use of syllables for recognition of continuous alphadigit utterances. In this system, syllables are used as the primary unit of recognition. This work was motivated by our need to verify and isolate phenomena seen when performing syllable-based experiments on the SWITCHBOARD corpus. The performance of our base syllable system is better than a crossword triphone system while requiring a small portion of the resources necessary for triphone systems. All experiments were performed on the OGI Alphadigits corpus, which consists of telephone-bandwidth alphadigit strings. The WER of the best syllable system (context-independent syllables) reported here is 11.1% compared to 12.2% for a crossword triphone system.
 
SP13.5

   
LVCSR Rescoring with Modified Loss Functions: A Decision Theoretic Perspective
V. Goel, W. Byrne, S. Khudanpur  (The Johns Hopkins University, USA)
The problem of speech decoding is considered here in a Decision Theoretic framework and a modified speech decoding procedure to minimize the expected risk under a general loss function is formulated. A specific word error rate loss function is considered and an implementation in an N-best list rescoring procedure is presented. Methods for estimation of the parameters of the resulting decision rule are provided for both supervised and unsupervised training. Preliminary experiments on an LVCSR task show a small but statistically significant error rate improvements.
 
SP13.6

   
Boosting Long-Term Adaptation of Hidden-Markov-Models: Incremental Splitting of Probability Density Functions
U. Bub, H. Hoege  (Siemens AG, Germany)
The research described in this paper focuses on possibilities to avoid the tedious training of Hidden-Markov-Models when setting up a new recognition task. A major speaker independent cause for the decrease of recognition accuracy is a mismatch of the phonetic contexts between training and testing data. To overcome this problem, we introduced in previous work the idea of an update of task independent acoustic models by means of Bayesian learning. In this paper we introduce the new approach of adaptively splitting the probability density functions (pdfs) of a continuous density HMM. The goal is to model the appropriate state pdfs better so that they can more accurately match new contexts that are observed while the system is in service. Splitting AND Bayesian adaptation yields a remarkable reduction of word error rate compared to Bayesian adaptation only.
 
SP13.7

   
Improvements in Children's Speech Recognition Performance
S. Das, D. Nix, M. Picheny  (IBM T.J. Watson Research Center, USA)
There are several reasons why conventional speech recognition systems modeled on adult data fail to perform satisfactorily on children's speech input. For instance, children's vocal characteristics differ significantly from those of adults. In addition, their choices of vocabulary and sentence construction modalities usually do not follow adult patterns. We describe comparative studies demonstrating the performance gain realized by adopting to children's acoustic and language model data to construct a children's speech recognition system.
 
SP13.8

   
Speaker Normalized Acoustic Modeling Based on 3-D Viterbi Decoding
T. Fukada, Y. Sagisaka  (ATR-ITL, Japan)
This paper describes a novel method for speaker normalization based on a frequency warping approach to reduce variations due to speaker-induced factors such as the vocal tract length. In our approach, a speaker normalized acoustic model is trained using time-varying (i.e., state, phoneme or word dependent) warping factors, while in the conventional approaches, the frequency warping factor is fixed for each speaker. These time-varying frequency warping factors are determined by a 3-dimensional (i.e., input frames, HMM states and warping factors) Viterbi decoding procedure. Experimental results on Japanese spontaneous speech recognition show that the proposed method yields a 9.7 % improvement in speech recognition accuracy compared to the conventional speaker-independent model.
 
SP13.9

   
Adaptive Heterodyne Filters (AHF) for Detection and Attenuation of Narrow Band Signals
K. Nelson, M. Soderstrand  (University of California, USA)
A fixed filter may be converted into an adaptive filter with a single adaptation parameter through the use of a new Adaptive Heterodyne Filter (AHF) concept in which the frequency of the heterodyne signal is adjusted thereby translating the entire filter transfer function in frequency. If the fixed filter is selected to be a very narrow-band band-pass filter, the new AHF concept can be used very effectively in the elimination of narrow band interference in wide-band communications or control systems. A specific example of the removal of a slow-moving time-varying mechanical resonance from the control signal for a flight control system demonstrates the power of the new AHF concept.
 
SP13.10

   
Online Tool Wear Monitoring in Turning Using Time-Delay Neural Networks
B. Sick  (University of Passau, Germany)
Wear monitoring systems often use neural networks for a sensor fusion with multiple input patterns. Systems for a continuous online supervision of wear have to process pattern sequences. Therefore recurrent neural networks have been investigated in the past. However, in most cases where only noisy input or even noisy output patterns are available for a supervised learning, success is not forthcoming. That is, recurrent networks don't perform noticeably better than non-recurrent networks processing only the current input pattern like multilayer perceptrons. This paper demonstrates on the basis of an application example (online tool wear monitoring in turning) that results can be improved significantly with special non-recurrent networks. This approach uses time-delay neural networks which consider the position of a single pattern in a pattern sequence by means of delay elements at the synapses. In the mentioned application example the average error in the estimation of a characteristic wear parameter could be reduced by about 24% compared with multilayer perceptrons.
 

< Previous Abstract - SP12

SP14 - Next Abstract >