Adaptation and Acoustic Modeling

Chair: Hermann Ney, RWTH-Aachen, Germany

Home

Fast Robust Inverse Transform Speaker Adapted Training Using Diagonal Transformations

Authors:

Hubert Jin, BBN Technologies (U.S.A.)
Spyros Matsoukas, Northeastern University (U.S.A.)
Richard Schwartz, BBN Technologies (U.S.A.)
Francis Kubala, BBN Technologies (U.S.A.)

Volume 2, Page 785, Paper number 2533

Abstract:

We present a new method of Speaker Adapted Training (SAT) that is more robust, faster, and results in lower error rate than the previous methods. The method, called Inverse Transform SAT (ITSAT) is based on removing the differencesbetween speakers before training, rather than modeling the differences during training. We develop several methods to avoid the problems associated with inverting the transformation. In one method, we interpolate the transformation matrix with an identity or diagonal transformation. We also apply constraints to the matrix to avoid estimation problems. Finally, we show that the resulting method is much faster, requires much less disk space, and results in higher accuracy than the original SAT method.

ic982533.pdf (From Postscript)

TOP

Instantaneous Environment Adaptation Techniques Based on Fast PMC and MAP-CMS Methods

Authors:

Tetsuo Kosaka, Canon Inc. (Japan)
Hiroki Yamamoto, Canon Inc. (Japan)
Masayuki Yamada, Canon Inc. (Japan)
Yasuhiro Komori, Canon Inc. (Japan)

Volume 2, Page 789, Paper number 1454

Abstract:

This paper proposes instantaneous environment adaptation techniques for both additive noise and channel distortion based on the fast PMC (FPMC) and the MAP-CMS methods. The instantaneous adaptation techniques enable a recognizer to improve recognition on a single sentence that is used for the adaptation in real-time. The key innovations enabling the system to achieve the instantaneous adaptation are: 1) a cepstral mean subtraction method based on maximum a posteriori estimation (MAP-CMS), 2) real-time implementation of the fast PMC that we proposed previously, 3) utilization of multi-pass search, and 4) a new combination method of MAP-CMS and FPMC to solve the problem of both channel distortion and additive noise. Experiment results showed that the proposed methods enabled the system to perform recognition and adaptation simultaneously nearly in real-time and obtained good improvements in performance.

ic981454.pdf (From Postscript)

TOP

Unsupervised Adaptation Using Structural Bayes Approach

Authors:

Koichi Shinoda, NEC Corporation (Japan)
Chin-Hui Lee, Bell Labs, Lucent Technologies (U.S.A.)

Volume 2, Page 793, Paper number 1761

Abstract:

It is well-known that the performance of recognition systems is often largely degraded when there is a mismatch between the training and testing environment. It is desirable to compensate for the mismatch when the system is in operation without any supervised learning. Recently, a structural maximum a posteriori (SMAP) adaptation approach, in which a hierarchical structure in the parameter space is assumed, was proposed. In this paper, this SMAP method is applied to unsupervised adaptation. A novel normalization technique is also introduced as a front end for the adaptation process. The recognition results showed that the proposed method was effective even when only one utterance from a new speaker was used for adaptation. Furthermore, an effective way to combine the supervised adaptation and the unsupervised adaptation was investigated to reduce the need for a large amount of supervised learning data.

ic981761.pdf (Scanned)

TOP

A Study on Speaker Normalization Using Vocal Tract Normalization and Speaker Adaptive Training

Authors:

Lutz Welling, University of Technology, Aachen (Germany)
Reinhold Haeb-Umbach, Philips GmbH Forschungslaboratorien (Germany)
Xavier Aubert, Philips GmbH Forschungslaboratorien (Germany)
Nils Haberland, University of Technology, Aachen (Germany)

Volume 2, Page 797, Paper number 1978

Abstract:

Although speaker normalization is attempted in very different manners, vocal tract normalization (VTN) and speaker adaptive training (SAT) share many common properties. We show that both lead to more compact representations of the phonetically relevant variations of the training data and that both achieve improved error rate performance only if a complementary normalization or adaptation operation is conducted on the test data. Algorithms for fast test speaker enrollment are presented for both normalization methods: in the framework of SAT, a pre-transformation step is proposed, which alone, i.e. without subsequent unsupervised MLLR adaptation, reduces the error rate by almost 10% on the WSJ 5k test sets. For VTN, the use of a Gaussian mixture model makes obsolete a first recognition pass to obtain a preliminary transcription of the test utterance at hardly any loss in performance.

ic981978.pdf (From Postscript)

TOP

Decision Tree State Tying Based on Segmental Clustering for Acoustic Modeling

Authors:

Wolfgang Reichl, Bell Labs (U.S.A.)
Wu Chou, Bell Labs (U.S.A.)

Volume 2, Page 801, Paper number 2505

Abstract:

In this paper, a fast segmental clustering approach to decision tree tying based acoustic modeling is proposed for large vocabulary speech recognition. It is based on a two level clustering scheme for robust decision tree state clustering. This approach extends the conventional segmental K-means approach to phonetic decision tree state tying based acoustic modeling. It achieves high recognition performances while reducing the model training time from days to hours comparing to the approachesbased on Baum-Welch training. Experimental results on standard Resource Management and Wall Street Journal tasks are presented which demonstrate the robustness and efficacy of this approach.

ic982505.pdf (From Postscript)

TOP

Automatic Question Generation for Decision Tree Based State Tying

Authors:

Klaus Beulen, RWTH Aachen, University of Technology (Germany)
Hermann Ney, RWTH Aachen, University of Technology (Germany)

Volume 2, Page 805, Paper number 2436

Abstract:

Decision tree based state tying uses so-called phonetic questions to assign triphone states to reasonable acoustic models. These phonetic questions are in fact phonetic categories such as vowels, plosives or fricatives. The assumption behind this is that context phonemes which belong to the same phonetic class have a similar influence on the pronunciation of a phoneme. For a new phoneme set, which has to be used e.g. when switching to a different corpus, a phonetic expert is needed to define proper phonetic questions. In this paper a new method is presented which automatically defines good phonetic questions for a phoneme set. This method uses the intermediate clusters from a phoneme clustering algorithm which are reduced to an appropriate number afterwards. Recognition results on the Wall Street Journal data for within-word and across-word phoneme models show competitive performance of the automatically generated questions with our best handcrafted question set.

ic982436.pdf (From Postscript)

TOP

Scaled Random Segmental Models

Authors:

Jacob Goldberger, Tel Aviv University (Israel)
David Burshtein, Tel Aviv University (Israel)

Volume 2, Page 809, Paper number 1030

Abstract:

We present the concept of a scaled random segmental model, which aims to overcome the modeling problem created by the fact that segment realizations of the same phonetic unit differ in length. In the scaled model the variance of the random mean trajectory is inversely proportional to the segment length. The scaled model enables a Baum-Welch type parameter reestimation, unlike the previously suggested, non-scaled models, that require more complicated iterative estimation procedures. In experiments we have conducted with phoneme classification, it was found that the scaled model shows improved performance compared to the non-scaled model.

ic981030.pdf (From Postscript)

TOP

Factorial HMMS for Acoustic Modeling

Authors:

Beth Logan, University of Cambridge (U.K.)
Pedro J Moreno, Digital Equipment Corporation (U.S.A.)

Volume 2, Page 813, Paper number 2453

Abstract:

Recently in the machine learning research field several extensions of hidden Markov models (HMMs) have been proposed. In this paper we study their posibilities and potential benefits for the field of acoustic modeling. We describe preliminary experiments using and alternative modeling approach knowns as factorial hidden Markov Models (FHMMs). We present these models as extensions of HMMs and detail a modification to the original formulation which seems to allow a more natural fit to speech. We present experimental results on the phonetically balanced TIMIT database comparing the performance of FHMMs with HMMs. We also study alternative feature representations that might be more suited to FHMMs.