ABSTRACT
Recently context-dependent phone units, such as triphones, have been used to model subword units in speech recognition based on Hidden Markov Models (HMMs). While most such methods employ clustering of the HMM parameters(e.g., subword clustering, state clustering, etc.), to control HMM size so as to avoid poor recognition accuracy due to an insuffciency of training data, none of them provide any effective criterion for the optimal degree of clustering that should be performed. This paper proposes a method in which state clustering is accomplished by way of phonetic decision trees and in which the MDL criterion is used to optimize the degree of clustering. Large-vocabulary Japanese recognition experiments show that the models obtained by this method achieved the highest accuracy among the models of various sizes obtained with conventional clustering approaches.
ABSTRACT
This paper proposes an utterance verication system for hidden Markov model (HMM) based automatic speech recognition systems. A verification objective function, based on a multi-layer-perceptron (MLP), is adopted which combines confidence measures from both the recognition and verification models. Discriminative minimum verification error training is applied for optimizing the parameters of the MLP and the verification models. Our proposed system provides a framework for combining different knowledge sources for utterance verification using an objective function that is consistently applied during both training and testing. Experimental results on telephone-based connected digits are presented.
ABSTRACT
This paper presents an efficient approximation of the Gaussian mixture state probability density functions of continuous observation density hidden Markov models (CHMM 's). In CHMM 's, the Gaussian mixtures carry a high computational cost, which amounts to a significant fraction (e.g. 30% to 70%) of the total computation. To achieve higher computation and memory efficiency, we approximate the Gaussian mixtures by (a) decomposition into functions defined on subspaces of the feature space, and (b) clustering the resulting subspace pdf's. Intuitively, when clustering in a subspace of few dimensions, even few function codewords can provide a small distortion. Therefore, we obtain significant reduction of the total computation (up to a factor of two), and memory savings (up to a factor of twelve), without significant changes of the CHMMM 's accuracy.
ABSTRACT
Phonetic decision trees have been widely used for obtaining robust context-dependent models in HMM-based systems. There are five key issues to consider when constructing phonetic decision trees: the alignment of data with the chosen phone classes; the quality of the modelling of the underlying data; the choice of partitioning method at each node; the goodness-of-split criterion and the method for determining appropriate tree sizes. A popular existing method usesefficient but crude approximatemethods for each of these. This paper introduces and evaluates more detailed alternatives to the standard approximations.
ABSTRACT
Semi-continuous Hidden Markov Models (SCHMM) with gaussian distributions are often used in continuous speech or handwriting recognition systems. Our paper compares gaussian and tree-structured polynomial classiffiers which have been successfully used in pattern recognition since many years. In our system the binary classiffier tree is generated by clustering HMM states using an entropy measure. For handwriting recognition, gaussians are clearly outperformed by polynomial classiffication. However, for speech recognition, polynomial classiffication currently performs slightly worse because some system parameters are not yet optimized.
ABSTRACT
This paper describes a new approach to ML-SSS (Maximum Likelihood Successive State Splitting) algorithm that uses tied-mixture representation of the output probability density function instead of a single Gaussian during the splitting phase of the ML-SSS algorithm. The tied-mixture representation results in a better state split gain, because it is able to measure dierences in the phoneme environment space that ML-SSS can not. With this more informative gain the new algorithm can choose a better split state and corresponding data. Phoneme clustering experiments were conducted which lead up to 38% of error reduction if compared to the ML-SSS algorithm.
ABSTRACT
In this paper, we present results for the Minimum Classification Error (MCE) [1] framework for discriminative training applied to tasks in continuous phoneme recognition. The results obtained using MCE are compared with results for Maximum Likelihood Estimation (MLE). We examine the ability of MCE to attain high recognition performance with a small number of parameters. Phoneme-level and string-level MCE loss functions were used as the optimization criteria for a Prototype- Based Minimum Error Classifier (PBMEC) [2] and an HMM [3]. The former was optimized using Generalized Probabilistic Descent, the latter was optimized using an approximated second order method, the Quickprop algorithm. Two databases were used in this evaluation: 1) the ATR 5240 isolated word datasets for 6 speakers, in both speaker-dependent and multi-speaker mode; 2) the TIMIT database. For both databases, MCE training yielded striking gains in performance and classifier compactness compared to MLE baselines. For instance, through MCE training, performance similar to that of the Maximum Likelihood Successive State Splitting algorithm (ML-SSS) [4] could be obtained with 20 times fewer parameters.
ABSTRACT
We present a novel approach to hidden Markov model (HMM) state clustering based on the use of broad phone classes and an allophone class entropy measure. Most state-of-the-art large- vocabulary speech recognizers are based on context-dependent (CD) phone HMMs that use Gaussian mixture models for the state-conditioned observation densities. A common approach for robust HMM parameter estimation is to cluster HMM states where each state cluster shares a set of parameters such as the components of a Gaussian mixture model. In all the current state clustering algorithms, the HMM states are clustered only within their respective allophone classes. While this makes some intuitive sense, it prevents the clustering of states across allophone class boundaries, even when the states are acoustically similar. Our algorithm allows clustering across allophone class boundaries by defining broad phone groups within which two states from different allophone classes can be clustered together. An allophone class entropy measure is used to control the clustering of states belonging to different allophone classes. Experimental results on three test sets are presented.
ABSTRACT
Speech recognition requires solving many space and time problems that can have a critical effect on the overall system performance. We describe the use of two general new algorithms [5] that transform recognition networks into equivalent ones that require much less time and space in large-vocabulary speech recognition. The new algorithms generalize classical automata determinization and minimization to deal properly with the probabilities of alternative hypotheses and with the relationships between units (distributions, phones, words) at different levels in the recognition system.
ABSTRACT
Computer speech recognition has been very successful in limited domains and for isolated word recognition. However, widespread use of large-vocabulary continuous- speech recognizers is limited by the speed of current recognizers, which cannot reach acceptable error rates while running in real time. This paper shows how to harness shared memory multiprocessors, which are becoming increasingly common, to increase significantly the speed, and therefore the accuracy or vocabulary size, of a speech recognizer. We describe the parallelization of an existing high-quality speech recognizer, achieving a speedup of a factor of 3, 5 and 6 on 4, 8 and 12 processors respectively for the benchmark North American business news (NAB) recognition task.
ABSTRACT
This paper studies algorithms for reducing the computational effort of the mixture density calculations in HMM-based speech recognition systems. These likelihood calculations take about 70 total recognition time in the RWTH system for large vocabulary continuous speech recognition. To reduce the computational cost of the likelihood calculations, we investigate several space partitioning methods. A detailed comparison of these techniques is given on the North American Business Corpus (NAB'94) for a 20 000- word task. As a result, the so-called projection search algorithm in combination with the VQ method reduces the cost of likelihood computation by a factor of about 8 with no significant loss in the word recognition accuracy.
ABSTRACT
To cope with the prohibitive growth of lexical tree based search-graphs when using cross-word context dependent (CD) phone models, an efficient novel search-topology was developed. The lexicon is stored as a compact static network with no language model (LM) information attached to it. The static representation avoids the cost of dynamic tree expansion, facilitates the integration of additional pronunciation information (e.g. assimilation rules) and is easier to integrate in existing search engines. Moreover, the network representation also results in a compact structure when words have alternative pronunciations, and due to its construction, it offers partial LM forwarding at no extra cost. Next, all knowledge sources (pronunciation information, language model and acoustic models) are combined by a slightly modified token-passing algorithm, resulting in a one pass time-synchronous recognition system.
ABSTRACT
We present a decision-tree based procedure to quantize the feature-space of a speech recognizer, with the motivation of reducing the computation time required for evaluating gaussians in a speech recognition system. The entire feature space is quantized into non overlapping regions where each region is bounded by a number of hyperplanes. Further, each region is characterized by the occurence of only a small number of the total alphabet of allophones (sub-phonetic speech units); by identifying the region in which a test feature vector lies, only the gaussians that model the density of allophones that exist in that region need be evaluated. The quantization of the feature space is done in a heirarchical manner using a binary decision tree. Each node of the decision tree represents a region of the feature space, and is further characterized by a hyperplane (a vector v n and a scalar threshold value hn ), that subdivides the region corresponding to the current node into two non-overlapping regions corresponding to the two children of the current node. Given a test feature vector, the process of finding the region that it lies in involves traversing this binary decision tree, which is computationally inexpensive. We present results of experiments that show that the gaussian computation time can be reduced by as much as a factor of 20 with negligible degradation in accuracy.
ABSTRACT
We describe a sub-vector clustering technique to reduce the memory size and computational cost of continuous density hidden Markov models (CHMMs). Acoustic models in modern large-vocabulary, continuous speech recognition systems are typically CHMMs. Systems with 100,000 Gaussian distributions of 40-60 dimensions are common, needing several tens of MB of memory. Computing HMM state likelihoods is several tens of times slower than real time. We show that by clustering and quantizing the Gaussian distributions a few dimensions at a time, both computation and memory costs can be reduced several fold without significant loss of recognition accuracy. On the 1994 Wall Street Journal 20K test set, this technique reduced the acoustic model size by a factor of 9-10, and HMM state output likelihood computation time by a factor of 4-5.
ABSTRACT
In this paper, the incorporation of path merging within BT's dynamic speech recognition architecture[1] is discussed. One of the disadvantages of dynamic network generation is the size of the network generated. This is to a large extent due to the creation of many duplicate network portions. The use of a path merging strategy can redress this problem to some extent. This paper discusses the theory behind path merging, demonstrating a 22% speed improvement on a typical recognition task for no loss in top-N accuracy.
ABSTRACT
This paper describes the use of an explicit word duration model in the environment of a HMM based time asynchronous stack search decoder. The benefit of the method is demonstrated on the task of connected digit recognition. Analysis of typical errors observed on this task suggests that appropriate word duration modeling can improve recognition accuracy. Duration model based on the Gamma Distribution, applied as a post- processing step during iterations of the search algorithm, reduces the error rate of the baseline system by 14%.
ABSTRACT
We show that the standard hypothesis scoring paradigm used in maximum-likelihood-based speech recognition systems is not optimal with regard to minimizing the word error rate, the commonly used performance metric in speech recognition. This can lead to sub-optimal performance, especially in high-error-rate environments where word error and sentence error are not necessarily monotonically related. To address this discrepancy, we developed a new algorithm that explicitly minimizes expected word error for recognition hypotheses. First, we approximate the posterior hypothesis probabilities using N-best lists. We then compute the expected word error for each hypothesis with respect to the posterior distribution, and choose the hypothesis with the lowest error. Experiments show improved recognition rates on two spontaneous speech corpora.
ABSTRACT
In this paper, we describe the new BBN BYBLOS ef- ficient 2-Pass N-Best decoder used for the 1996 Hub-4 Benchmark Tests. The decoder uses a quick fastmatch to determine the likely word endings. Then in the second pass, it performs a time-synchronous beam search using a detailed continuous-density HMM and a trigram language model to decide the word starting positions. From these word starts, the decoder, without looking at the input speech, constructs a trigram word lattice, and generates the top N likely hypotheses. This new 2-pass N-Best decoder maintains comparable recognition performance as the old 4-pass N-Best decoder, while its search strategy is simpler and much more efficient.
ABSTRACT
To improve the performance of continuous speech recognition, it is effective to incorporate grammatical knowledge of task into a word network of a FSN (finite state network) form. But, recently , some of them requires huge memory, so we introduce an ef ficient memory management method for a large word network; distributed FSN model and hiearchical memory model. The system keeps the word network divided to small sub-networks, and activates each sub-network when necessary. Using this method, we can recognize continuously spoken sentences of Japanese addresses, which are made of 390K geographic names, with only 5.6 Mbytes local memory in average.