ABSTRACT
To realize good speaker adaptation for context dependent HMM using small-size training data, reasonable adaptation of unseen models have to be realized using the relation of appeared models and the training data. In the paper, a new speaker adaptation method for context dependent HMM using two spatial constraints is proposed: 1) spatial relation of the phoneme context hierarchical models, and 2) spatial relation between speaker specific models and speaker independent models. Several implementations based on the idea are proposed and are evaluated under 520 word speech recognition. 25 words are used for adaptation par speaker. The best result improved 30% error rate showing the effectiveness of the proposed method.
ABSTRACT
This paper describes a high speed algorithm for a speech recognizer based on speaker cluster HMM. The speaker cluster HMM, which enables to deal with variety among speakers, have been reported to show good performance. However, the computation amount grows in proportion to the number of clusters, when the speaker cluster HMM is used in speaker independent recognition, where the recognition processes must be run in parallel using every speaker cluster HMM. To reduce the computation, we introduced the multi-pass search for searching on the broad space covering lexical and speaker variation. Furthermore, the output probability recalculation is introduced to reduce the state output probability computation. We had some experiments on 1000 word speaker independent continuous telephone speech recognition. The result in the case where 7 speaker clusters are used shows about 30% of computation reduction.
ABSTRACT
This paper introduces two novel techniques for instantaneous speaker adaptation, reference speaker weighting and consistency modeling. An approach to hierarchical speaker clustering using gender and speaking rate as the clustering criteria is also presented. All three methods attempt to utilize the underlying within-speaker correlations that are present between the acoustic realizations of different phones. By accounting for these correlations a limited amount of adaptation data can be used to adapt the models of every phonetic acoustic model including those for phones which have not been observed in the adaptation data. In instantaneous adaptation experiments using the DARPA Resource Management corpus, a reduction in word error rate of 20% has been achieved using a combination of these new techniques.
ABSTRACT
This paper describes Jacobian adaptation (JA) of acoustic models to environmental noise and its experimental evaluation. JA is based on a ``noise adaptation'' idea, which is acoustic model adaptation from initial noise A to target noise B , and uses Jacobian matrices to relate changes in environmental noise with changes in the ``speech+noise'' acoustic model. It is experimentally shown that JA performs well compared with existing techniques such as HMM composition, particularly when only a short sample (shorter than 1 sec) of the target noise is given, and that JA is very advantageous in terms of computational cost. Moreover, this paper describes JA used in combination with noise spectral subtraction and shows that improving SNR by spectral subtraction leads to higher efficiency.
ABSTRACT
An unsupervised, sentence-level, discriminative, HMM adaptation algorithm based on silence- speech classification is presented. Silence and speech regions are determined either using an end- pointer or using the segmentation obtained from the recognizer in a first pass. A unsupervised discriminative training procedure using the gradient descent algorithm, with N-best competing strings with word insertions is then used to improve the discrimination between silence and speech. Experiments on connected digits show about 40-80 % reduction in insertion errors, a small amount of reduction in substitution errors, and a negligible effect on deletion errors. In addition, experiments on noisy speech showed about 70% word error rate reduction, thus demonstrating the robustness of the proposed adaptation technique.
ABSTRACT
Hidden Markov model (HMM) adaptation is currently of interest, to overcome the degradation eoeect of speaker and/or channel mismatches in practical speech recognition systems. The Bayesian framework provides a theoretically optimal formulation for combining adaptation data and prior knowledge, but it suoeers from the drawback of being incapable of adapting parameters of the models that have no observations in the adaptation speech. In this article we present a new predicitve (in the sense of influencing unobserved distribution parameters) adaptation algorithm for the mean vectors of an HMM. We also point out some theoretical relationships between the proposed method and other techniques used in the context of predictive model adaptation. The eOEcacy of the proposed approach is demonstrated in speaker adaptation experiemnts for both an isolated word task, and TIMIT phonetic recogntion.
ABSTRACT
The recognition accuracy in recent large vocabulary Automatic Speech Recognition (ASR) systems is highly related to the existing mismatch between the training and test sets. For example, dialect differences across the training and testing speakers result to a significant degradation in recognition performance. Some popular adaptation approaches improve the recognition performance of speech recognizers based on hidden Markov models with continuous mixture densities by using linear transforms to adapt the means, and possibly the covariances of the mixture Gaussians. In this paper, we propose a novel adaptation technique that adapts the means and, optionally, the covariances of the mixture Gaussians by using multiple stochastic transformations. We perform both speaker and dialect adaptation experiments, and we show that our method significantly improves the recognition accuracy and the robustness of our system. The experiments are carried out with SRI's DECIPHER speech recognition system.
ABSTRACT
Recently there has been much work done on how to transform HMMs, trained typically in a speaker-independent fashion on clean training data, to be more representative of data from a particular speaker or acoustic environment. These transforms are trained on a small amount of training data, so large numbers of components are required to share the same transform. Normally, each component is constrained to only use one transform. This paper examines how to optimally, in a maximum likelihood sense, assign components to transforms and allow each component, or component grouping, to make use of many transformations. The theory for obtaining both ``weights'' for each transform and transforms given a set of weights is given. The techniques are evaluated on both speaker and environmental adaptation tasks.
ABSTRACT
Linear Discriminant Analysis (LDA) has been widely applied to speech recognition resulting in improved recognition performance and improved robustness. LDA de- signs a linear transformation that projects a n-dimensio- nal space on a m-dimensional space (m < n) such that the class separability is maximum. This paper presents new results related to our previous work [6] on nonlinear discriminant analysis (NLDA) based on the discriminant properties of Artificial Neural Networks (ANN) and more particularly MLP. Experiments performed on the isolated word large vocabulary Phone- book database show that NLDA provides a method for designing discriminant features particularly efficient as well for continuous densities HMM as for hybrid HMM/ANN recognizers.
ABSTRACT
The combination of a model of auditory perception (PEMO) as feature extractor and of a Locally Recurrent Neural Network (LRNN) as classifier yields promising ASR results in noise. Our study focuses on the interplay between both techniques and their ability tocomplement each other in the task of robust speech recognition. We performed recognition experiments with modifications of PEMO processing concerning amplitude compression and envelope modulation ltering. The results show that the distinct and sparse peaks of PEMO speech representation which are well maintained in noise are sufficient cues for LRNN-based recognition due to LRNN's ability to exploit information which is distributed over time. Enhanced envelope modulation bandpass filtering of PEMO feature vectors better re ects the average modulation spectrum of speech and further decreases the influence of noise.
ABSTRACT
In this paper, we describe a rate of speech estimator that is derived directly from the acoustic signal. This measure has been developed as an alternative to lexical measures of speaking rate such as phones or syllables per second, which, in previous work, we estimated using a first recognition pass; the accuracy of our earlier lexical rate estimate depended on the quality of recognition. Here we show that our new measure is a good predictor of word error rate, and in addition, correlates moderately well with lexical speech rate. We also show that a simple modification of the model transition probabilities based on this measure can reduce the error rate almost as much as using lexical phones per second calculated from manually transcribed data. When we categorized test utterances based on speaking rate thresholds computed from the training set, we observed that a different transition probability value was required to minimize the error rate in each speaking rate bin. However, the reduction of error provided by this approach is still small in comparison with the increases in error observed for unusually fast or slow speech.
ABSTRACT
Formant frequencies have rarely been used as acoustic features for speech recognition, in spite of their phonetic significance. For some speech sounds one or more of the formants may be so badly defined that it is not useful to attempt a frequency measurement. Also, it is often difficult to decide which formant labels to attach to particular spectral peaks. This paper describes a new method of formant analysis which includes techniques to overcome both of the above difficulties. Using the same data and HMM model structure, results are compared between a recognizer using conventional cepstrum features and one using three formant frequencies, combined with fewer cepstrum features to represent general spectral trends. For the same total number of features, results show that including formant features can offer increased accuracy over using cepstrum features only.
ABSTRACT
Speaker normalization and speaker adaptation are two strategies to tackle the variations from speaker, channel, and environment. The vocal tract length normalization (VTLN) is an effective speaker normalization approach to compensate for the variations of vocal tract shapes. The Maximum Likelihood Linear Regression(MLLR) is a recent proposed method for speaker-adaptation. In this paper, we propose a speaker-specific Bark scale VTLN method, investigate the combination of the VTLN with MLLR, and present an iterative procedure for decoding the combined system of VTLN and MLLR. The results show that: (1) the new VTLN method is very effective with which the word error rate can be reduced up to 11%; (2) the combination of VTLN and MLLR can provide up to 15% word error reduction; (3) both VTLN and MLLR are more effective for the push-to-talk data than for the cross-talk data.
ABSTRACT
A new strategy for speaker adaptation is described that is based on: (1) pre-clustering all the speakers in the training set acoustically into clusters; (2) for each speaker cluster, a system is built using the data from the speakers who belong to the cluster; (3) when a test speaker's data is available, we find a subset of these clusters, closest to the test speaker; (4) we transform each of the selected clusters to bring it closer to the test speaker's acoustic space; (5) we build a speaker- adapted model using transformed cluster models. This method solves the problem of excessive storage for the training speaker models [1] , as it is relatively inexpensive to store a model for each cluster. Also as each cluster contains a number of speakers, parameters of the models for each cluster can be robustly estimated. The algorithm has been evaluated on a large vocabulary system and comparied to existing algorithms. The imporvement over existing algorithms such as MLLR [2] is statistically significant.
ABSTRACT
It has recently been shown that normalisation of vocal tract length can significantly increase recog- nition accuracy in speaker independent automatic speech recognition systems. An inherent difficulty with this technique is in automatically estimating the normalisation parameter from a new speaker's speech and previous techniques have typically re- lied on an exhaustive search to estimate this param- eter. In this paper, we present a method of nor- malising utterances by a linear warping of mel filter bank channels in which the normalisation parame- ter is estimated by fitting formant estimates to a probabilistic model. This method is fast, compu- tationally inexpensive and requires only a limited amount of data for estimation. It generates normali- sations which are close to those which would be found by an exhaustive search. The normalisation is ap- plied to a phoneme recognition task using the TIMIT database and results show a useful improvement over an un-normalised speaker independent system.
ABSTRACT
In this paper we describe experiments with the acoustic front{end of our large vocabulary speech recognition system. In particular, two aspects are studied: 1) linear transforms for feature extraction and 2) the modelling of the emission probabilities. Experiments are reported on a 5000 - word task of the ARPA Wall Street Journal database. For the linear transforms our main results are: a) Filter{bank coefficients yield a word error rate of 9.3%. b) A cepstral decorrelation reduces the error rate from 9.3% to 8.0%. c) By applying a linear discriminant analysis (LDA) a further reduction in the error rate from 8.0% to 7.1% is obtained. d) Recognition results are similar for a LDA applied to filter{bank outputs and to cepstral coefficients. The experiments with density modelling gave the following results: a) Gaussian and Laplacian densities yield similar error rates. b) One single vector of variances or absolute deviations outperforms density-specific or mixture- specific vectors.
ABSTRACT
A new method to improve the accuracy of Autoregressive Hidden Markov Model (AR-HMM) based recogni- tion systems is proposed. The technique uses the bilinear transform to warp the frequency scale of the observation vectors, hence it uses a better perceptual measure to compare the observation vectors to the trained models. Results presented for the E-set letters from the ISOLET database and the first speaker dependent task of the Resource Management (RM) database show that this technique improves recognition accuracy considerably. However, in the case of the RM system, the recognition results still fall short of those obtained from a similar mel- frequency cepstral (MFCC) based system without delta parameters. Reasons for the inferior performance of the AR-HMM system are proposed and future research directions are suggested. The models built for the RM task are incorporated into an existing enhancement algorithm to form a large vocabulary speaker dependent enhancement system. Preliminary results are presented for this system.
ABSTRACT
The hybrid algorithm of SMQ (Statistical Matrix Quantization) and HMM shows high performance in vocabulary-unspecific, speaker-independent speech recognition, however, it needs lots of computation and memory at the stage of the segment quantizer of SMQ. In this paper, we propose a newly developed, two-stage segment quantizer with a feature extractor based on KL expansion and a classifier, that can be trained by using competitive training of KL/GPD. Result of experiments shows 1/30 - 1/40 reduction in both computation time and a memory size with the same performance that the old version of SMQ shows.
ABSTRACT
This paper describes a new rapid speaker adaptation algorithm using a small amount of adaptation data. This algorithm, termed adaptation by correlation (ABC), exploits the intrinsic correlation among speech units to update the speech models. The algorithm updates the means of each Gaussian based on its correlation with means of the Gaussians which are observed in the adaptation data; the updating formula is derived from the theory of least squares. Our experiments on the ARPA NAB-94 evaluation (Eval-94) and the ARPA Hub4-96 (Hub4-96) tasks indicate that ABC seems more stable than MLLR when the amount of data for adaptation is very small (~ 5 seconds), and that ABC seems to enhance MLLR when they are combined.