ICASSP '98 Main Page

General Information

Conference Schedule

Technical Program

    Overview
    50th Annivary Events
    Plenary Sessions
    Special Sessions
    Tutorials
    Technical Sessions

	By Date
		May 12, Tue
		May 13, Wed
		May 14, Thur
		May 15, Fri

By Category
	AE	ANNIV
	COMM	DSP
	IMDSP	MMSP
	NNSP	PLEN
	SP	SPEC
	SSAP	UA
	VLSI

By Author
	A	B	C	D	E
	F	G	H	I	J
	K	L	M	N	O
	P	Q	R	S	T
	U	V	W	X	Y
	Z

Invited Speakers

Registration

Exhibits

Social Events

Coming to Seattle

Satellite Events

Call for Papers/
Author's Kit

Future Conferences

Help

Abstract - SP24

SP24.1	Fast Robust Inverse Transform Speaker Adapted Training Using Diagonal Transformations H. Jin (BBN Technologies, USA); S. Matsoukas (Northeastern University, USA); R. Schwartz, F. Kubala (BBN Technologies, USA) We present a new method of Speaker Adapted Training (SAT) that is more robust, faster, and results in lower error rate than the previous methods. The method, called Inverse Transform SAT (ITSAT) is based on removing the differences between speakers before training, rather than modeling the differences during training. We develop several methods to avoid the problems associated with inverting the transformation. In one method, we interpolate the transformation matrix with an identity or diagonal transformation. We also apply constraints to the matrix to avoid estimation problems. Finally, we show that the resulting method is much faster, requires much less disk space, and results in higher accuracy than the original SAT method.
SP24.2	Instantaneous Environment Adaptation Techniques Based on Fast PMC and MAP-CMS Methods T. Kosaka, H. Yamamoto, M. Yamada, Y. Komori (Canon Inc., Japan) This paper proposes instantaneous environment adaptation techniques for both additive noise and channel distortion based on the fast PMC (FPMC) and the MAP-CMS methods. The instantaneous adaptation techniques enable a recognizer to improve recognition on a single sentence that is used for the adaptation in real-time. The key innovations enabling the system to achieve the instantaneous adaptation are: 1) a cepstral mean subtraction method based on maximum a posteriori estimation (MAP-CMS), 2) real-time implementation of the fast PMC that we proposed previously, 3) utilization of multi-pass search, and 4) a new combination method of MAP-CMS and FPMC to solve the problem of both channel distortion and additive noise. Experiment results showed that the proposed methods enabled the system to perform recognition and adaptation simultaneously nearly in real-time and obtained good improvements in performance.
SP24.3	Unsupervised Adaptation Using Structural Bayes Approach K. Shinoda (NEC Corporation, Japan); C. Lee (Bell Labs, Lucent Technologies, USA) It is well-known that the performance of recognition systems is often largely degraded when there is a mismatch between the training and testing environment. It is desirable to compensate for the mismatch when the system is in operation without any supervised learning. Recently, a structural maximum a posteriori (SMAP) adaptation approach, in which a hierarchical structure in the parameter space is assumed, was proposed. In this paper, this SMAP method is applied to unsupervised adaptation. A novel normalization technique is also introduced as a front end for the adaptation process. The recognition results showed that the proposed method was effective even when only one utterance from a new speaker was used for adaptation. Furthermore, an effective way to combine the supervised adaptation and the unsupervised adaptation was investigated to reduce the need for a large amount of supervised learning data.
SP24.4	A Study on Speaker Normalization Using Vocal Tract Normalization and Speaker Adaptive Training L. Welling (University of Technology, Aachen, Germany); R. Haeb-Umbach, X. Aubert (Philips GmbH Forschungslaboratorien, Germany); N. Haberland (University of Technology, Aachen, Germany) Although speaker normalization is attempted in very different manners, vocal tract normalization (VTN) and speaker adaptive training (SAT) share many common properties. We show that both lead to more compact representations of the phonetically relevant variations of the training data and that both achieve improved error rate performance only if a complementary normalization or adaptation operation is conducted on the test data. Algorithms for fast test speaker enrollment are presented for both normalization methods: in the framework of SAT, a pre-transformation step is proposed, which alone, i.e. without subsequent unsupervised MLLR adaptation, reduces the error rate by almost 10% on the WSJ 5k test sets. For VTN, the use of a Gaussian mixture model makes obsolete a first recognition pass to obtain a preliminary transcription of the test utterance at hardly any loss in performance.
SP24.5	Decision Tree State Tying Based on Segmental Clustering for Acoustic Modeling W. Reichl, W. Chou (Bell Labs, USA) In this paper, a fast segmental clustering approach to decision tree tying based acoustic modeling is proposed for large vocabulary speech recognition. It is based on a two level clustering scheme for robust decision tree state clustering. This approach extends the conventional segmental K-means approach to phonetic decision tree state tying based acoustic modeling. It achieves high recognition performances while reducing the model training time from days to hours comparing to the approachesbased on Baum-Welch training. Experimental results on standard Resource Management and Wall Street Journal tasks are presented which demonstrate the robustness and efficacy of this approach.
SP24.6	Automatic Question Generation for Decision Tree Based State Tying K. Beulen, H. Ney (RWTH Aachen, University of Technology, Germany) Decision tree based state tying uses so-called phonetic questions to assign triphone states to reasonable acoustic models. These phonetic questions are in fact phonetic categories such as vowels, plosives or fricatives. The assumption behind this is that context phonemes which belong to the same phonetic class have a similar influence on the pronunciation of a phoneme. For a new phoneme set, which has to be used e.g. when switching to a different corpus, a phonetic expert is needed to define proper phonetic questions. In this paper a new method is presented which automatically defines good phonetic questions for a phoneme set. This method uses the intermediate clusters from a phoneme clustering algorithm which are reduced to an appropriate number afterwards. Recognition results on the Wall Street Journal data for within-word and across-word phoneme models show competitive performance of the automatically generated questions with our best handcrafted question set.
SP24.7	Scaled Random Segmental Models J. Goldberger, D. Burshtein (Tel Aviv University, Israel) We present the concept of a scaled random segmental model, which aims to overcome the modeling problem created by the fact that segment realizations of the same phonetic unit differ in length. In the scaled model the variance of the random mean trajectory is inversely proportional to the segment length. The scaled model enables a Baum-Welch type parameter reestimation, unlike the previously suggested, non-scaled models, that require more complicated iterative estimation procedures. In experiments we have conducted with phoneme classification, it was found that the scaled model shows improved performance compared to the non-scaled model.
SP24.8	Factorial HMMS for Acoustic Modeling B. Logan (University of Cambridge, UK); P. Moreno (Digital Equip. Corporation, USA) Recently in the machine learning research field several extensions of hidden Markov models (HMMs) have been proposed. In this paper we study their posibilities and potential benefits for the field of acoustic modeling. We describe preliminary experiments using and alternative modeling approach knowns as factorial hidden Markov Models (FHMMs). We present these models as extensions of HMMs and detail a modification to the original formulation which seems to allow a more natural fit to speech. We present experimental results on the phonetically balanced TIMIT database comparing the performance of FHMMs with HMMs. We also study alternative feature representations that might be more suited to FHMMs.

< Previous Abstract - SP23

SP25 - Next Abstract >