Chair: Shigeki Sagayama, NTT Human Interface Labs, Japan
Scott S. Chen, IBM (U.S.A.)
Ponani S. Gopalakrishnan, IBM (U.S.A.)
One difficult problem we are often faced with in clustering analysis is how to choose the number of clusters. In this paper, we propose to choose the number of clusters by optimizing the Bayesian information criterion (BIC), a model selection criterion in the statistics literature. We develop a termination criterion for the hierarchical clustering methods which optimizes the BIC criterion in a greedy fashion. The resulting algorithms are fully automatic. Our experiments on Gaussian mixture modeling and speaker clustering demonstrate that the BIC criterion is able to choose the number of clusters accordingto the intrinsic complexity present in the data.
Atsushi Nakamura, ATR Interpreting Telecommunications Research Labs (Japan)
In continuous speech recognition featuring hidden Markov model (HMM), word N-gram and time-synchronous beam search, a local modeling mismatch in the HMM will often cause the recognition performance to degrade. To cope with this problem, this paper proposes a method of restructuring Gaussian mixture pdfs in a pre-trained speaker-independent HMM based on speech data. In this method, mixture components are copied and shared among multiple mixture pdfs with the tendency of local errors taken into account. The tendency is given by comparing the pre-trained HMM and speech data which was used in the pre-training. Experimental results prove that the proposed method can effectively restore local modeling mismatches and improve the recognition performance.
Timothy J Hazen, MIT (U.S.A.)
Andrew K Halberstadt, MIT (U.S.A.)
This paper investigates the use of aggregation as a means of improving the performance and robustness of mixture Gaussian models. This technique produces models that are more accurate and more robust to different test sets than traditional cross-validation using a development set. A theoretical justification for this technique is presented along with experimental results in phonetic classification, phonetic recognition, and word recognition tasks on the TIMIT and Resource Management corpora. In speech classification and recognition tasks error rate reductions of up to 12% were observed using this technique. A method for utilizing tree-structured density functions for the purpose of pruning the aggregated models is also presented.
Mark J.F. Gales, IBM T.J. Watson Research (U.S.A.)
A standard problem in many classification tasks is how to model feature vectors whose elements are highly correlated. If multi-variate Gaussian distributions are used to model the data then they must have full covariance matrices to do so. This requires a large number of parameters per distribution which restricts the number of ditributions that may be robustly estimated, particularly when high dimensional feature vectors are required. This paper describes an alternative to full covariance matrices in these situations. An approximate full covariance matrix is used. The covariance matrix is now split into two elements, one full and one diagonal, which may be tied at completely seperate levels. Typically the full elements are extensively tie, resulting in only a small increase in the number of parameterscompared to the diagonal case. Thus, dramatically increasing the number of distributions that may be robustly estimated. Simple re-estimation formulae for allthe parameters within the standard EM franework are presented. On a large vocabulary speech recognition task a 10% reduction in word error rate over a standard system was achieved.
Ramesh A Gopinath, IBM (U.S.A.)
Maximum Likelihood (ML) modeling of multiclass data for classification often suffers from the following problems: a) data insufficiency implying overtrained or unreliable models b) large storage requirement c) large computational requirement and/or d) ML is not discriminating between classes. Sharing parameters across classes (or constraining the parameters) clearly tends to alleviate the first three problems. In this paper we show that in some cases it can also lead to better discrimination (as evidenced by reduced misclassification error). The parameters considered are the means and variances of the gaussians and linear transformation s of the feature space (or equivalently the gaussian means). Some constraints on the parameters are shown to lead to Linear Discrimination Analysis ( a well known result) while others are shown to lead to optimal feature spaces ( a relatively new result). Applications of some of these ideas to the speech recognition problem are also given.
Merhyar Mohri, AT&T Labs - Research (U.S.A.)
Michael D. Riley, AT&T Labs - Research (U.S.A.)
Donald Hindle, AT&T Labs - Research (U.S.A.)
Andrej Ljolje, AT&T Labs - Research (U.S.A.)
Fernando C Pereira, AT&T Labs - Research (U.S.A.)
We combine our earlier approach to context-dependent network representation with our algorithm for determinizing weighted networks to build optimized networks for large-vocabulary speech recognition combining an n-gram language model, a pronunciation dictionary and context-dependency modeling. While fully expanded networks have been used before in restrictive settings (medium vocabulary or no cross-word contexts), we demonstrate that our network determinization method makes it practical to use fully expanded networks also in large-vocabulary recognition with full cross-word context modeling. For the DARPA North American Business News task (NAB), we give network sizes and recognition speeds and accuracies using bigram and trigram grammars with vocabulary sizes ranging from 10,000 to 160,000 words. With our construction, the fully expanded NAB context-dependent networks contain only about twice as many arcs as the corresponding language models. Interestingly, we also find that, with these networks, real-time word accuracy is improved by increasing vocabulary size and n-gram order.
Mei-Yuh Hwang, Microsoft Research (U.S.A.)
Xuedong Huang, Microsoft Research (U.S.A.)
Senones were introduced to share Hidden Markov model (HMM) parameters at a sub-phonetic level in 1992 and decision trees were incorporated to predict unseen phonetic contexts in 1993 by Hwang. In this paper, we will describe two applications of the senonic decision tree in (1) dynamically downsizing a speech recognition system for small platforms and in (2) sharing the Gaussian covariances of continuous density HMMs (CHMMs). We experimented how to balance different parameters that can offer the best trade off between recognition accuracy and system size. The dynamically downsized system, without retraining, performed even better than the regular Baum-Welch trained system. The shared covariance model provided as good a performance as the unshared full model and thus gave us the freedom to increase the number of Gaussian means to increase the accuracy of the model. Combining the downsizing and covariance sharing algorithms, a total of 8% error reduction was achieved over the Baum-Welch trained system with approximately the same parameter size.
Brian K Mak, AT&T Labs - Research (U.S.A.)
Enrico L. Bocchieri, AT&T Labs - Research (U.S.A.)
Recently we presented our novel subspace distribution clustering hidden Markov models (SDCHMMs) which can be converted from continuous density hidden Markov models (CDHMMs) by clustering subspace Gaussians in each stream over all models. The model conversion has two drawbacks: (1) it does not take advantage of the fewer model parameters in SDCHMMs --- theoretically SDCHMMs may be trained with smaller amount of data; and, (2) it involves two separate optimization steps (first training CDHMMs, then clustering subspace Gaussians) and the resulting SDCHMMs are not guaranteed to be optimal. In this paper, we show how SDCHMMs may be trained directly from less speech data if we have a priori knowledge of their architecture. On the ATIS task, a context-independent 20-stream SDCHMM system trained using our novel SDCHMM reestimation algorithm with only 8 minutes of speech performs as well as a CDHMM system trained with 105 minutes of speech.