ICASSP '98 Main Page

General Information

Conference Schedule

Technical Program

    Overview
    50th Annivary Events
    Plenary Sessions
    Special Sessions
    Tutorials
    Technical Sessions

	By Date
		May 12, Tue
		May 13, Wed
		May 14, Thur
		May 15, Fri

By Category
	AE	ANNIV
	COMM	DSP
	IMDSP	MMSP
	NNSP	PLEN
	SP	SPEC
	SSAP	UA
	VLSI

By Author
	A	B	C	D	E
	F	G	H	I	J
	K	L	M	N	O
	P	Q	R	S	T
	U	V	W	X	Y
	Z

Invited Speakers

Registration

Exhibits

Social Events

Coming to Seattle

Satellite Events

Call for Papers/
Author's Kit

Future Conferences

Help

Abstract - SP13

SP13.1	Improved Phone Recognition Using Bayesian Triphone Models J. Ming, F. Smith (Queen's University of Belfast, N. Ireland) A crucial issue in triphone based continuous speech recognition is the large number of models to be estimated against the limited availability of training data. This problem can be relieved by composing a triphone model from less context-dependent models. This paper introduces a new statistical framework, derived from the Bayesian principle, to perform such a composition. The potential power of this new framework is explored, both algorithmically and experimentally, by an implementation with hidden Markov modeling techniques. This implementation is applied to the recognition of the 39-phone set on the TIMIT database. The new model achieves 74.4% and 75.6% accuracy, respectively, on the core and complete test sets.
SP13.2	Multilingual Phone Recognition of Telephone Spontaneous Speech C. Corredor-Ardoy, L. Lamel, M. Adda-Decker, J. Gauvain (LIMSI-CNRS, Orsay, France) In this paper we report on experiments with phone recognition of spontaneous telephone speech. Phone recognizers were trained and assessed on IDEAL, a multilingual corpus containing telephone speech in French, British English, German and Castillan Spanish. We investigated the influence of the training material composition (size and linguistic content) on the recognition performance using context-independent Hidden Markov Models and phonotactic bigram models. We found that when testing on spontaneous speech data, using only spontaneous speech training data gave the highest phone accuracies for the four languages, even though this data comprises only 14% of the available training data. The use of context-dependent HMMs reduced the phone error across the 4 languages, with the average error reduced to 51.9 % from the 57.4% obtained with CI models. We suggest a straightforward way of detecting non speech phenomena. The basic idea is to remove sequences of consonants between two silence labels from the recognized phone strings prior to scoring. This simple technique reduces the relative average phone error rate by 5.4%. The lowest phone error with CD models and filtering was obtained for Spanish (39.1%) with 4 language average being 49.1%.
SP13.3	Language Adaptation of Multilingual Phone Models for Vocabulary Independent Speech Recognition Tasks J. Koehler (Siemens AG, Germany) This paper presents our new results on multilingual phone modeling and adaptation into a new target language which is not included in the trained multilingual models. The experiments were carried out with the SpeechDat(M) and MacroPhone databases including the languages French, German, Italian, Portuguese, Spanish and American English. First, we constructed language-dependent and multilingual phone models. The recognition rate for an isolated word task decreased in average only by 3.2% using 95 multilingual instead of 232 language-dependent models. Second, we investigated adaptation techniques for cross-language transfer and showed that only 100 utterances from a new language were needed for adaptation. Using the MAP algorithm the recognition rate was improved from 79.9% to 84.3%. Finally, we defined a phonetic based dissimilarity measure between 2 languages and compared language-dependent and multilingual models for the purpose of cross-language transfer.
SP13.4	Advances in Alpha Digit Recognition Using Syllables J. Hamaker, A. Ganapathiraju, J. Picone (Mississippi State University, USA); J. Godfrey (PSL, Texas Instruments Inc., USA) In this paper, we present a set of experiments which explore the use of syllables for recognition of continuous alphadigit utterances. In this system, syllables are used as the primary unit of recognition. This work was motivated by our need to verify and isolate phenomena seen when performing syllable-based experiments on the SWITCHBOARD corpus. The performance of our base syllable system is better than a crossword triphone system while requiring a small portion of the resources necessary for triphone systems. All experiments were performed on the OGI Alphadigits corpus, which consists of telephone-bandwidth alphadigit strings. The WER of the best syllable system (context-independent syllables) reported here is 11.1% compared to 12.2% for a crossword triphone system.
SP13.5	LVCSR Rescoring with Modified Loss Functions: A Decision Theoretic Perspective V. Goel, W. Byrne, S. Khudanpur (The Johns Hopkins University, USA) The problem of speech decoding is considered here in a Decision Theoretic framework and a modified speech decoding procedure to minimize the expected risk under a general loss function is formulated. A specific word error rate loss function is considered and an implementation in an N-best list rescoring procedure is presented. Methods for estimation of the parameters of the resulting decision rule are provided for both supervised and unsupervised training. Preliminary experiments on an LVCSR task show a small but statistically significant error rate improvements.
SP13.6	Boosting Long-Term Adaptation of Hidden-Markov-Models: Incremental Splitting of Probability Density Functions U. Bub, H. Hoege (Siemens AG, Germany) The research described in this paper focuses on possibilities to avoid the tedious training of Hidden-Markov-Models when setting up a new recognition task. A major speaker independent cause for the decrease of recognition accuracy is a mismatch of the phonetic contexts between training and testing data. To overcome this problem, we introduced in previous work the idea of an update of task independent acoustic models by means of Bayesian learning. In this paper we introduce the new approach of adaptively splitting the probability density functions (pdfs) of a continuous density HMM. The goal is to model the appropriate state pdfs better so that they can more accurately match new contexts that are observed while the system is in service. Splitting AND Bayesian adaptation yields a remarkable reduction of word error rate compared to Bayesian adaptation only.
SP13.7	Improvements in Children's Speech Recognition Performance S. Das, D. Nix, M. Picheny (IBM T.J. Watson Research Center, USA) There are several reasons why conventional speech recognition systems modeled on adult data fail to perform satisfactorily on children's speech input. For instance, children's vocal characteristics differ significantly from those of adults. In addition, their choices of vocabulary and sentence construction modalities usually do not follow adult patterns. We describe comparative studies demonstrating the performance gain realized by adopting to children's acoustic and language model data to construct a children's speech recognition system.
SP13.8	Speaker Normalized Acoustic Modeling Based on 3-D Viterbi Decoding T. Fukada, Y. Sagisaka (ATR-ITL, Japan) This paper describes a novel method for speaker normalization based on a frequency warping approach to reduce variations due to speaker-induced factors such as the vocal tract length. In our approach, a speaker normalized acoustic model is trained using time-varying (i.e., state, phoneme or word dependent) warping factors, while in the conventional approaches, the frequency warping factor is fixed for each speaker. These time-varying frequency warping factors are determined by a 3-dimensional (i.e., input frames, HMM states and warping factors) Viterbi decoding procedure. Experimental results on Japanese spontaneous speech recognition show that the proposed method yields a 9.7 % improvement in speech recognition accuracy compared to the conventional speaker-independent model.
SP13.9	Adaptive Heterodyne Filters (AHF) for Detection and Attenuation of Narrow Band Signals K. Nelson, M. Soderstrand (University of California, USA) A fixed filter may be converted into an adaptive filter with a single adaptation parameter through the use of a new Adaptive Heterodyne Filter (AHF) concept in which the frequency of the heterodyne signal is adjusted thereby translating the entire filter transfer function in frequency. If the fixed filter is selected to be a very narrow-band band-pass filter, the new AHF concept can be used very effectively in the elimination of narrow band interference in wide-band communications or control systems. A specific example of the removal of a slow-moving time-varying mechanical resonance from the control signal for a flight control system demonstrates the power of the new AHF concept.
SP13.10	Online Tool Wear Monitoring in Turning Using Time-Delay Neural Networks B. Sick (University of Passau, Germany) Wear monitoring systems often use neural networks for a sensor fusion with multiple input patterns. Systems for a continuous online supervision of wear have to process pattern sequences. Therefore recurrent neural networks have been investigated in the past. However, in most cases where only noisy input or even noisy output patterns are available for a supervised learning, success is not forthcoming. That is, recurrent networks don't perform noticeably better than non-recurrent networks processing only the current input pattern like multilayer perceptrons. This paper demonstrates on the basis of an application example (online tool wear monitoring in turning) that results can be improved significantly with special non-recurrent networks. This approach uses time-delay neural networks which consider the position of a single pattern in a pattern sequence by means of delay elements at the synapses. In the mentioned application example the average error in the estimation of a characteristic wear parameter could be reduced by about 24% compared with multilayer perceptrons.

< Previous Abstract - SP12

SP14 - Next Abstract >