Tomoko Matsui, NTT (Japan)
Tatsuo Matsuoka, NTT (Japan)
Sadaoki Furui, NTT (Japan)
Smoothed estimation and utterance verification are introduced into the N-best-based speaker adaptation method. That method is effective even for speakers whose decodings using speaker-independent (SI) models are error-prone, that is, for speakers for whom adaptation techniques are truly needed. The smoothed estimation improves the performance for such speakers, and the utterance verification reduces the required amount of calculation. Performance evaluation using connected-digit (four-digit strings) recognition experiments performed over actual telephone lines showed a reduction of 36.4% in the error rates for speakers whose decodings using SI models are error-prone. To try and find an effective model-transformation for speaker adaptation, we discuss replacing mixture-mean bias estimation by the widely used mixture-mean linear-regression-matrix estimation.
Michael Schüßler, FORWISS Erlangen (Germany)
Florian Gallwitz, University of Erlangen (Germany)
Stefan Harbeck, University of Erlangen (Germany)
Speaker adaptation algorithms often require a rather large amount of adaptation data in order to estimate the new parameters reliably. In this paper, we investigate how adaptation can be performed in real--time applications with only a few seconds of speech from each user. We propose a modified Bayesian codebook reestimation which does not need the computationally intensive evaluation of normal densities and thus speeds up the adaptation remarkably, e.g.~by a factor of 18 for 24--dimensional feature vectors. We performed experiments in two real--time applications with very small amounts of adaptation data, and achieved a word error reduction of up to 11%
Shigeru Homma, NTT HILab (Japan)
Kiyoaki Aikawa, NTT HILab (Japan)
Shigeki Sagayama, NTT HILab (Japan)
Unsupervised speaker adaptation plays an important role in "batch dictation," the aim of which is to automatically transcribe large amounts of recorded dictation using speech recognition. In the case of unsupervised speaker adaptation which uses recognition results of target speech as the means of supervision, erroneous recognition results degrade the quality of the adapted acoustic models. This paper presents a new supervision selection method. By using this method, correction of the first candidate is judged based on the likelihood ratio of the first and the second candidates. This method eliminates erroneous recognition results and corresponding speech data from the adaptive training data. We implemented this method in the iterative unsupervised speaker adaptation procedure. It is shown that the recognition errors are drastically reduced by 50% in a practical application of batch-style speech-to-text conversion of recorded dictation of Japanese medical diagnoses compared with speaker-independent recognition.
Jen Tzung Chien, National Tsing Hua University (Taiwan)
Hsiao-Chuan Wang, National Tsing Hua University (Taiwan)
Chin Hui Lee, Bell Lab (U.S.A.)
We propose an improved maximum a posteriori (MAP) learning algorithm of continuous-density hidden Markov model (CDHMM) parameters for speaker adaptation. The algorithm is developed by sequentially combining three adaptation approaches. First, the clusters of speaker-independent HMM parameters are locally transformed through a group of transformation functions. Then, the transformed HMM parameters are globally smoothed via the MAP adaptation. Within the MAP adaptation, the parameters of unseen units in adaptation are further adapted by employing the transfer vector interpolation scheme. Experiments show that the combined algorithm converges rapidly and outperforms those other adaptation methods.
Venkatesh Nagesha, DSI (U.S.A.)
Larry Gillick, DSI (U.S.A.)
This paper studies the use of transformation-based speaker adaptation in improving the performance of large vocabulary continuous speech recognition systems. We present a formulation of the adaptation procedure that is simpler than existing methods. Our experiments demonstrate that speaker normalization continues to be important even after significant amounts of speaker adaptation. An automatic clustering algorithm is compared to human expertise in sorting output distributions into collections that share the same transformation. We quantify improvements over standard Bayesian (by maximum a posteriori or MAP) adaptation in terms of (a) speed of adaptation, and (b) robustness to transcription errors. Finally, we discuss the use of speaker transformations in the training process.
Eric Thelen, Philips Research (Germany)
Xavier Aubert, Philips Research (Germany)
Peter Beyerlein, Philips Research (Germany)
The combination of Maximum Likelihood Linear Regression (MLLR) with Maximum a posteriori (MAP) adaptation has been investigated for both the enrollment of a new speaker as well as for the asymptotic recognition rate after several hours of dictation. We show that a least mean square approach to MLLR is quite effective in conjunction with phonetically derived regression classes. Results are presented for both ARPA read-speech test sets and real-life dictation. Significant improvements are reported. While MLLR achieves a faster adaptation rate when only few data is available, MAP has desirable asymptotic properties and the combination of both methods provides the best results. Both incremental and iterative batch modes are studied and compared to the performance of speaker-dependent training.
Puming Zhan, Interactive Systems Laboratories (U.S.A.)
Martin Westphal, Interactive Systems Laboratories (U.S.A.)
In speech recognition, speaker-dependence of a speech recognition system comes from speaker-dependence of the speech feature, and the variation of vocal tract shape is the major source of inter-speaker variations of the speech feature, though there are some other sources which also contribute. In this paper, we address the approachs of speaker normalization which aim at normalizing speaker's vocal tract length based on Frequency WarPing (FWP). The FWP is implemented in the front-end preprocessing of our speech recognition system. We investigate the formant-based and ML-based FWP in linear and nonlinear warping modes, and compare them in detail. All experimental results are based on our JANUS3 large vocabulary continuous speech recognition system and the Spanish Spontaneous Scheduling Task database (SSST).
Tasos Anastasakos, BBN Corporation (U.S.A.)
John W. McDonough, BBN Corporation (U.S.A.)
John Makhoul, BBN Corporation (U.S.A.)
This paper describes the speaker adaptive training (SAT) approach for speaker independent (SI) speech recognizers as a method for joint speaker normalization and estimation of the parameters of the SI acoustic models. In SAT, speaker characteristics are modeled explicitly as linear transformations of the SI acoustic parameters. The effect of inter-speaker variability in the training data is reduced, leading to parsimonious acoustic models that represent more accurately the phonetically relevant information of the speech signal. The proposed training method is applied to the Wall Street Journal (WSJ) corpus that consists of multiple training speakers. Experimental results in the context of batch supervised adaptation demonstrate the effectiveness of the proposed method in large vocabulary speech recognition tasks and show that significant reductions in word error rate can be achieved over the common pooled speaker-independent paradigm.
David Pye, University of Cambridge (U.K.)
Philip C. Woodland, University of Cambridge (U.K.)
This paper examines techniques for speaker normalisation and adaptation that are applied in training with the aim of removing some of the variability from the speaker independent models. Two techniques are examined: vocal tract normalisation (VTN) which estimates a single "vocal tract length" parameter for each speaker and then modifies the speech parameterisation accordingly and speaker adaptive training (SAT) which estimates Gaussian mean and variance parameters jointly with a speaker specific set of maximum likelihood linear regression (MLLR) based transformations. It is shown that VTN is effective for both clean speech and mismatched conditions and that the further improvements obtained by applying MLLR in testing are essentially additive. Detailed results from the use of SAT show that worthwhile improvements over using MLLR with standard speaker independent models are obtained.
Yasuo Ariki, Ryukoku University (Japan)
Conventional speaker independent HMMs ignore the speaker differences and collect speech data in an observation space. This causes a problem that probability distribution of the HMMs becomes flat, and then causes recognition errors. To solve this problem, we construct the speaker subspace for an individual speaker and project his speech data to his own subspace. By this method we can extract speaker independent phonetic information included in the speech data. Speaker independent HMMs can be constructed using this phonetic information. In this paper, we describe the result of phoneme recognition experiments using the speaker independent HMMs constructed by the speech data projected to the speaker subspaces.
Jun Ishii, ATR (Japan)
Masahiro Tonomura, ATR (Japan)
We propose novel speaker independent (SI) modeling and speaker adaptation based on a linear transformation. An SI model and speaker dependent (SD) models are usually generated using the same preprocessing of acoustic data. This straightforward preprocessing causes a serious problem. Probability distributions of the SI models become broad and the SI models do not give good initial estimates for speaker adaptation. To solve these problems, a normalized SI model is generated by removing speaker characteristics using a shift vector obtained by the maximum likelihood linear regression (MLLR) technique. In addition, we propose a speaker adaptation method that combines the MLLR and maximum a posteriori (MAP) techniques from the normalized SI model. For the baseline recognition test of normalized SI model, 12.8% reduction phoneme recognition error rate compared to the conventional SI model was achieved. Furthermore the proposed adaptation method using normalized SI model was effective than the tested conventional method.
John W. McDonough, BBN STD (U.S.A.)
Tasos Anastasakos, BBN STD (U.S.A.)
George Zavaliagkos, BBN STD (U.S.A.)
Herbert Gish, BBN STD (U.S.A.)
Speaker adaptation is the process of transforming some speaker-independentacoustic model in such a way as to more closely match the characteristicsof a particular speaker. It has been shown by several researchers to be aneffective means of improving the performance of large vocabulary continuousspeech recognition systems. Until very recently speaker adaptation has beenused exclusively as a part of the recognition process. This is undesireableinasmuch as it leads to a mismatched condition between test and training,and hence sub-optimal recognition performance. Very recently, there hasbeen a growing interest in applying speaker-adaptation techniques to HMMtraining in order to alleviate the training/test mismatch. In prior work,we presented an iterative scheme for determining the maximum likelihoodsolution for the set of speaker-independent means and variances whenspeaker-dependent adaptation is performed during HMM training. In thepresent work, we shall investigate specific issues encountered in applyingthis general framework to the task of improving recognition performance onthe Switchboard Corpus.