Session MAA Training Techniques. Efficient Decoding in ASR

Chairperson Jerome Belegarda Apple Computer, USA

Home


ACOUSTIC MODELING BASED ON THE MDL PRINCIPLE FOR SPEECH RECOGNITION

Authors: Koichi Shinoda and Takao Watanabe

NEC Corporation 4-1-1 Miyazaki, Miyamae-ku, Kawasaki 216, JAPAN fshinoda,watanabeg@hum.cl.nec.co.jp

Volume 1 pages 99 - 102

ABSTRACT

Recently context-dependent phone units, such as triphones, have been used to model subword units in speech recognition based on Hidden Markov Models (HMMs). While most such methods employ clustering of the HMM parameters(e.g., subword clustering, state clustering, etc.), to control HMM size so as to avoid poor recognition accuracy due to an insuffciency of training data, none of them provide any effective criterion for the optimal degree of clustering that should be performed. This paper proposes a method in which state clustering is accomplished by way of phonetic decision trees and in which the MDL criterion is used to optimize the degree of clustering. Large-vocabulary Japanese recognition experiments show that the models obtained by this method achieved the highest accuracy among the models of various sizes obtained with conventional clustering approaches.

A0032.pdf

TOP


DISCRIMINATIVE UTTERANCE VERIFICATION USING MULTIPLE CONFIDENCE MEASURES

Authors: Piyush Modi and Mazin Rahim

AT&T Labs 180 Park Avenue, Florham Park, New Jersey 07932-0971, USA Email: piyush@research.att.com, mazin@research.att.com

Volume 1 pages 103 - 106

ABSTRACT

This paper proposes an utterance veri cation system for hidden Markov model (HMM) based automatic speech recognition systems. A verification objective function, based on a multi-layer-perceptron (MLP), is adopted which combines confidence measures from both the recognition and verification models. Discriminative minimum verification error training is applied for optimizing the parameters of the MLP and the verification models. Our proposed system provides a framework for combining different knowledge sources for utterance verification using an objective function that is consistently applied during both training and testing. Experimental results on telephone-based connected digits are presented.

A0078.pdf

TOP


SUBSPACE DISTRIBUTION CLUSTERING FOR CONTINUOUS OBSERVATION DENSITY HIDDEN MARKOV MODELS

Authors: Enrico Bocchieri and Brian Mak*

AT&T Labs-Research, 180 Park Ave, Florham Park, NJ 07932. (*) Oregon Graduate Institute, 20000 NW Walker Rd, Portland OR, 97006. enrico@research.att.com and mak@research.att.com

Volume 1 pages 107 - 110

ABSTRACT

This paper presents an efficient approximation of the Gaussian mixture state probability density functions of continuous observation density hidden Markov models (CHMM 's). In CHMM 's, the Gaussian mixtures carry a high computational cost, which amounts to a significant fraction (e.g. 30% to 70%) of the total computation. To achieve higher computation and memory efficiency, we approximate the Gaussian mixtures by (a) decomposition into functions defined on subspaces of the feature space, and (b) clustering the resulting subspace pdf's. Intuitively, when clustering in a subspace of few dimensions, even few function codewords can provide a small distortion. Therefore, we obtain significant reduction of the total computation (up to a factor of two), and memory savings (up to a factor of twelve), without significant changes of the CHMMM 's accuracy.

A0106.pdf

TOP


A Comparative Study of Methods for Phonetic Decision-Tree State Clustering

Authors: H.J. Nock M.J.F. Gales S.J. Young

Cambridge University Engineering Department, Trumpington Street, Cambridge CB2 1PZ, UK. Tel: [+44] 1223 332800 Fax: [+44] 1223 332662 email : hjn11,mjfg,sjy@eng.cam.ac.uk

Volume 1 pages 111 - 114

ABSTRACT

Phonetic decision trees have been widely used for obtaining robust context-dependent models in HMM-based systems. There are five key issues to consider when constructing phonetic decision trees: the alignment of data with the chosen phone classes; the quality of the modelling of the underlying data; the choice of partitioning method at each node; the goodness-of-split criterion and the method for determining appropriate tree sizes. A popular existing method usesefficient but crude approximatemethods for each of these. This paper introduces and evaluates more detailed alternatives to the standard approximations.

A0123.pdf

TOP


Comparing Gaussian and Polynomial Classification in SCHMM-Based Recognition Systems

Authors: Alfred Kaltenmeier, Jurgen Franke

Daimler Benz AG, Research Institute, Wilhelm Runge Str. 11, D-89081 Ulm Germany e-mail: kaltenmeier@dbag.ulm.DaimlerBenz.COM

Volume 1 pages 115 - 118

ABSTRACT

Semi-continuous Hidden Markov Models (SCHMM) with gaussian distributions are often used in continuous speech or handwriting recognition systems. Our paper compares gaussian and tree-structured polynomial classiffiers which have been successfully used in pattern recognition since many years. In our system the binary classiffier tree is generated by clustering HMM states using an entropy measure. For handwriting recognition, gaussians are clearly outperformed by polynomial classiffication. However, for speech recognition, polynomial classiffication currently performs slightly worse because some system parameters are not yet optimized.

A0135.pdf

TOP


MAXIMUM LIKELIHOOD SUCCESSIVE STATE SPLITTING ALGORITHM FOR TIED-MIXTURE HMNET

Authors: Alexandre Girardi, Harald Singer, Kiyohiro Shikano, Satoshi Nakamura

Nara Institute of Science and Technology Takayama-cho 8916-5, Ikoma-shi, Nara-ken 630-01 Japan E-mail: alex-g@is.aist-nara.ac.jp

Volume 1 pages 119 - 122

ABSTRACT

This paper describes a new approach to ML-SSS (Maximum Likelihood Successive State Splitting) algorithm that uses tied-mixture representation of the output probability density function instead of a single Gaussian during the splitting phase of the ML-SSS algorithm. The tied-mixture representation results in a better state split gain, because it is able to measure di erences in the phoneme environment space that ML-SSS can not. With this more informative gain the new algorithm can choose a better split state and corresponding data. Phoneme clustering experiments were conducted which lead up to 38% of error reduction if compared to the ML-SSS algorithm.

A0148.pdf

TOP


STRING-LEVEL MCE FOR CONTINUOUS PHONEME RECOGNITION

Authors: Erik McDermott Shigeru Katagiri

ATR Human Information Processing Res Labs 2-2 Hikari-dai, Seika-cho, Soraku-gun, Kyoto 619-02, Japan

Volume 1 pages 123 - 126

ABSTRACT

In this paper, we present results for the Minimum Classification Error (MCE) [1] framework for discriminative training applied to tasks in continuous phoneme recognition. The results obtained using MCE are compared with results for Maximum Likelihood Estimation (MLE). We examine the ability of MCE to attain high recognition performance with a small number of parameters. Phoneme-level and string-level MCE loss functions were used as the optimization criteria for a Prototype- Based Minimum Error Classifier (PBMEC) [2] and an HMM [3]. The former was optimized using Generalized Probabilistic Descent, the latter was optimized using an approximated second order method, the Quickprop algorithm. Two databases were used in this evaluation: 1) the ATR 5240 isolated word datasets for 6 speakers, in both speaker-dependent and multi-speaker mode; 2) the TIMIT database. For both databases, MCE training yielded striking gains in performance and classifier compactness compared to MLE baselines. For instance, through MCE training, performance similar to that of the Maximum Likelihood Successive State Splitting algorithm (ML-SSS) [4] could be obtained with 20 times fewer parameters.

A0176.pdf

TOP


HMM STATE CLUSTERING ACROSS ALLOPHONE CLASS BOUNDARIES

Authors: Ze'ev Rivlin, Ananth Sankar, and Harry Bratt

Speech Technology And Research Laboratory SRI International Menlo Park, California 94025 U.S.A. {zev,sankar,harry}@speech.sri.com

Volume 1 pages 127 - 130

ABSTRACT

We present a novel approach to hidden Markov model (HMM) state clustering based on the use of broad phone classes and an allophone class entropy measure. Most state-of-the-art large- vocabulary speech recognizers are based on context-dependent (CD) phone HMMs that use Gaussian mixture models for the state-conditioned observation densities. A common approach for robust HMM parameter estimation is to cluster HMM states where each state cluster shares a set of parameters such as the components of a Gaussian mixture model. In all the current state clustering algorithms, the HMM states are clustered only within their respective allophone classes. While this makes some intuitive sense, it prevents the clustering of states across allophone class boundaries, even when the states are acoustically similar. Our algorithm allows clustering across allophone class boundaries by defining broad phone groups within which two states from different allophone classes can be clustered together. An allophone class entropy measure is used to control the clustering of states belonging to different allophone classes. Experimental results on three test sets are presented.

A0207.pdf

TOP


Weighted Determinization and Minimization for Large Vocabulary Speech Recognition

Authors: Mehryar Mohri Michael Riley

AT&T Labs – Research, 180 Park Avenue, Florham Park, NJ 07932-0971, USA

Volume 1 pages 131 - 134

ABSTRACT

Speech recognition requires solving many space and time problems that can have a critical effect on the overall system performance. We describe the use of two general new algorithms [5] that transform recognition networks into equivalent ones that require much less time and space in large-vocabulary speech recognition. The new algorithms generalize classical automata determinization and minimization to deal properly with the probabilities of alternative hypotheses and with the relationships between units (distributions, phones, words) at different levels in the recognition system.

A0234.pdf

TOP


Parallel Speech Recognition

Authors: Steven Phillips and Anne Rogers

AT&T Labs-Research, 180 Park Ave, PO Box 971, Florham Park, NJ 07932-0971 email: {phillips,amr}@research.att.com

Volume 1 pages 135 - 138

ABSTRACT

Computer speech recognition has been very successful in limited domains and for isolated word recognition. However, widespread use of large-vocabulary continuous- speech recognizers is limited by the speed of current recognizers, which cannot reach acceptable error rates while running in real time. This paper shows how to harness shared memory multiprocessors, which are becoming increasingly common, to increase significantly the speed, and therefore the accuracy or vocabulary size, of a speech recognizer. We describe the parallelization of an existing high-quality speech recognizer, achieving a speedup of a factor of 3, 5 and 6 on 4, 8 and 12 processors respectively for the benchmark North American business news (NAB) recognition task.

A0242.pdf

TOP


FAST LIKELIHOOD COMPUTATION METHODS FOR CONTINUOUS MIXTURE DENSITIES IN LARGE VOCABULARY SPEECH RECOGNITION

Authors: Stefan Ortmanns, Thorsten Firzlaff and Hermann Ney

Lehrstuhl fur Informatik VI, RWTH Aachen - University of Technology, D-52056 Aachen, Germany

Volume 1 pages 139 - 142

ABSTRACT

This paper studies algorithms for reducing the computational effort of the mixture density calculations in HMM-based speech recognition systems. These likelihood calculations take about 70 total recognition time in the RWTH system for large vocabulary continuous speech recognition. To reduce the computational cost of the likelihood calculations, we investigate several space partitioning methods. A detailed comparison of these techniques is given on the North American Business Corpus (NAB'94) for a 20 000- word task. As a result, the so-called projection search algorithm in combination with the VQ method reduces the cost of likelihood computation by a factor of about 8 with no significant loss in the word recognition accuracy.

A0335.pdf

TOP


A STATIC LEXICON NETWORK REPRESENTATION FOR CROSS-WORD CONTEXT DEPENDENT PHONES

Authors: Kris Demuynck, Jacques Duchateau and Dirk Van Compernolle

K. U. Leuven - ESAT., Kardinaal Mercierlaan 94, B-3001 Heverlee, Belgium E-mail: Kris.Demuynck@esat.kuleuven.ac.be

Volume 1 pages 143 - 146

ABSTRACT

To cope with the prohibitive growth of lexical tree based search-graphs when using cross-word context dependent (CD) phone models, an efficient novel search-topology was developed. The lexicon is stored as a compact static network with no language model (LM) information attached to it. The static representation avoids the cost of dynamic tree expansion, facilitates the integration of additional pronunciation information (e.g. assimilation rules) and is easier to integrate in existing search engines. Moreover, the network representation also results in a compact structure when words have alternative pronunciations, and due to its construction, it offers partial LM forwarding at no extra cost. Next, all knowledge sources (pronunciation information, language model and acoustic models) are combined by a slightly modified token-passing algorithm, resulting in a one pass time-synchronous recognition system.

A0365.pdf

TOP


DECISION-TREE BASED QUANTIZATION OF THE FEATURE SPACE OF A SPEECH RECOGNIZER

Authors: M. Padmanabhan, L. R. Bahl, D. Nahamoo, P. de Souza

IBM T. J. Watson Research Center P. O. Box 218, Yorktown Heights, NY 10598

Volume 1 pages 147 - 150

ABSTRACT

We present a decision-tree based procedure to quantize the feature-space of a speech recognizer, with the motivation of reducing the computation time required for evaluating gaussians in a speech recognition system. The entire feature space is quantized into non overlapping regions where each region is bounded by a number of hyperplanes. Further, each region is characterized by the occurence of only a small number of the total alphabet of allophones (sub-phonetic speech units); by identifying the region in which a test feature vector lies, only the gaussians that model the density of allophones that exist in that region need be evaluated. The quantization of the feature space is done in a heirarchical manner using a binary decision tree. Each node of the decision tree represents a region of the feature space, and is further characterized by a hyperplane (a vector v n and a scalar threshold value hn ), that subdivides the region corresponding to the current node into two non-overlapping regions corresponding to the two children of the current node. Given a test feature vector, the process of finding the region that it lies in involves traversing this binary decision tree, which is computationally inexpensive. We present results of experiments that show that the gaussian computation time can be reduced by as much as a factor of 20 with negligible degradation in accuracy.

A0451.pdf

TOP


SUB-VECTOR CLUSTERING TO IMPROVE MEMORY AND SPEED PERFORMANCE OF ACOUSTIC LIKELIHOOD COMPUTATION

Authors: M. Ravishankar, R. Bisiani* and E. Thayer

School of Computer Science, Carnegie Mellon University, Pittsburgh, PA-15213, USA. *Dept. of Computer Science, University of Milan, Italy Tel. +1 412 268 3344, FAX: +1 412 268 5576, E-mail: rkm@cs.cmu.edu

Volume 1 pages 151 - 154

ABSTRACT

We describe a sub-vector clustering technique to reduce the memory size and computational cost of continuous density hidden Markov models (CHMMs). Acoustic models in modern large-vocabulary, continuous speech recognition systems are typically CHMMs. Systems with 100,000 Gaussian distributions of 40-60 dimensions are common, needing several tens of MB of memory. Computing HMM state likelihoods is several tens of times slower than real time. We show that by clustering and quantizing the Gaussian distributions a few dimensions at a time, both computation and memory costs can be reduced several fold without significant loss of recognition accuracy. On the 1994 Wall Street Journal 20K test set, this technique reduced the acoustic model size by a factor of 9-10, and HMM state output likelihood computation time by a factor of 4-5.

A0601.pdf

TOP


The Incorporation of Path Merging in a Dynamic Network Recogniser

Authors: Simon Hovell

Speech Technology Unit, BT Laboratories, Martlesham Heath, Suffolk, England. simon.hovell@bt-sys.bt.co.uk

Volume 1 pages 155 - 158

ABSTRACT

In this paper, the incorporation of path merging within BT's dynamic speech recognition architecture[1] is discussed. One of the disadvantages of dynamic network generation is the size of the network generated. This is to a large extent due to the creation of many duplicate network portions. The use of a path merging strategy can redress this problem to some extent. This paper discusses the theory behind path merging, demonstrating a 22% speed improvement on a typical recognition task for no loss in top-N accuracy.

A0707.pdf

TOP


IMPROVEMENT ON CONNECTED DIGITS RECOGNITION USING DURATION CONSTRAINTS IN THE ASYNCHRONOUS DECODING SCHEME

Authors: Miroslav Novak

IBM Watson Research Center - Human Language Technologies Group P.O. Box 218, Yorktown Heights, NY 10598, USA email: novak@watson.ibm.com

Volume 1 pages 159 - 162

ABSTRACT

This paper describes the use of an explicit word duration model in the environment of a HMM based time asynchronous stack search decoder. The benefit of the method is demonstrated on the task of connected digit recognition. Analysis of typical errors observed on this task suggests that appropriate word duration modeling can improve recognition accuracy. Duration model based on the Gamma Distribution, applied as a post- processing step during iterations of the search algorithm, reduces the error rate of the baseline system by 14%.

A0850.pdf

TOP


EXPLICIT WORD ERROR MINIMIZATION IN N-BEST LIST RESCORING

Authors: Andreas Stolcke Yochai Konig Mitchel Weintraub

Speech Technology and Research Laboratory SRI International, Menlo Park, CA, U.S.A. http://www.speech.sri.com/ {stolcke,konig,mw)@speech.sri.com

Volume 1 pages 163 - 166

ABSTRACT

We show that the standard hypothesis scoring paradigm used in maximum-likelihood-based speech recognition systems is not optimal with regard to minimizing the word error rate, the commonly used performance metric in speech recognition. This can lead to sub-optimal performance, especially in high-error-rate environments where word error and sentence error are not necessarily monotonically related. To address this discrepancy, we developed a new algorithm that explicitly minimizes expected word error for recognition hypotheses. First, we approximate the posterior hypothesis probabilities using N-best lists. We then compute the expected word error for each hypothesis with respect to the posterior distribution, and choose the hypothesis with the lowest error. Experiments show improved recognition rates on two spontaneous speech corpora.

A0925.pdf

TOP


EFFICIENT 2-PASS N-BEST DECODER

Authors: Long Nguyen, Richard Schwartz

BBN Systems & Technologies 70 Fawcett Street Cambridge, MA. 02138, USA. ln@bbn.com

Volume 1 pages 167 - 170

ABSTRACT

In this paper, we describe the new BBN BYBLOS ef- ficient 2-Pass N-Best decoder used for the 1996 Hub-4 Benchmark Tests. The decoder uses a quick fastmatch to determine the likely word endings. Then in the second pass, it performs a time-synchronous beam search using a detailed continuous-density HMM and a trigram language model to decide the word starting positions. From these word starts, the decoder, without looking at the input speech, constructs a trigram word lattice, and generates the top N likely hypotheses. This new 2-pass N-Best decoder maintains comparable recognition performance as the old 4-pass N-Best decoder, while its search strategy is simpler and much more efficient.

A0988.pdf

TOP


A MEMORY MANAGEMENT METHOD FOR A LARGE WORD NETWORK

Authors: T.Iwasaki and Y.Abe

Human Media Technology Dept. Information Technology R&D Center MITSUBISHI Electric Corp. 5-1-1, Ofuna, Kamakura, Kanagawa, 247, Japan

Volume 1 pages 171 - 174

ABSTRACT

To improve the performance of continuous speech recognition, it is effective to incorporate grammatical knowledge of task into a word network of a FSN (finite state network) form. But, recently , some of them requires huge memory, so we introduce an ef ficient memory management method for a large word network; distributed FSN model and hiearchical memory model. The system keeps the word network divided to small sub-networks, and activates each sub-network when necessary. Using this method, we can recognize continuously spoken sentences of Japanese addresses, which are made of 390K geographic names, with only 5.6 Mbytes local memory in average.

A1184.pdf

TOP