Acoustic Modeling

Home


Rescoring under Fuzzy Degrees with a Multilayer Neural Network in a Rule-Based Speech Recognition System

Authors:

Olivier Oppizzi, University of Avignon (France)
Régis Quélavoine, University of Avignon (France)

Volume 3, Page 1723

Abstract:

In this paper, a speech rescoring system is developed on a set of phonetic hypotheses produced by a bottom-up knowledge-based decoder. An original method to automatically compute a fuzzy membership function from top-down acoustic rules statistics is compared with a possibilistic measure. To aggregate the fuzzy degrees into a phonetic score, a mutilayer neural network is trained on the results of all the rules in order to detect how these rules characterize different phonemes and then in order to give a weight to each rule. Rescoring performance of top-down rules for fricatives will be discussed on an isolated-word speech database of French with 1000 utterances pronounced by five speakers.

ic971723.pdf

ic971723.pdf

TOP



Optimization Of HMM By A Genetic Algorithm

Authors:

Chak-Wai Chau, City University of Hong Kong (Hong Kong)
S. Kwong, City University of Hong Kong (Hong Kong)
C.K. Diu, Department of Applied Computing (Hong Kong)
W.R. Fahrner, FernUniversitaet (Germany)

Volume 3, Page 1727

Abstract:

Hidden Markov Model (HMM) is a natural and highly robust statistical methodology for automatic speech recognition. It is also being tested and proved considerably in a wide range of applications. The model parameters of the HMM are essence in describing the behavior of the utterance of the speech segments. Many successful heuristic algorithms are developed to optimize the model parameters in order to best describe the trained observation sequences. However, all these methodologies are exploring for only one local maxima in practice. No one methodology can recovering from the local maxima to obtain the global maxima or other more optimized local maxima. In this paper, a stochastic search method called Genetic Algorithm (GA) is presented for HMM training. GA mimics natural evolution and perform global searching within the defined searching space. Experimental results showed that using GA for HMM training (GA-HMM training) has a better performance than using other heuristic algorithms.

ic971727.pdf

ic971727.pdf

TOP



Inference of Variable-length Acoustic Units for Continuous Speech Recognition

Authors:

Sabine Deligne, Telecom Paris (France)
Frédéric Bimbot, Telecom Paris (France)

Volume 3, Page 1731

Abstract:

In the field of speech recognition, the patterns assumed to structure the speech material (phonemes, triphones, words...) are defined a priori according to a linguistic criterion, whereas the recognition criterion is based on an acoustic similarity measure. From this may result a lack of consistency for the recognition units. In this paper, we explore the possibility of a more data-driven approach, where recognition units are derived according to an acoustic criterion, and then, mapped to variable length sequences of phonemes in an unsupervised way. Continuous speech recognition experiments are reported to evaluate the consistency of those units as opposed to linguistically defined units.

ic971731.pdf

ic971731.pdf

TOP



Comparative Performance Analysis of Statistical Trajectory Models in Cellular Environment

Authors:

Bojan Petek, University of Ljubljana (Slovenia)
Ove Andersen, CPK, Aalborg University, Denmark (Denmark)
Paul Dalsgaard, CPK, Aalborg University, Denmark (Denmark)

Volume 3, Page 1735

Abstract:

Two systems (Statistical Trajectory Models (STM) and continuous density HMMs) utilizing three preprocessing methodologies (MFCC, RASTA and FBDYN) were evaluated on two databases, namely CTIMIT and the corresponding downsampled TIMIT. Within the bounds of the experimental setup the comparative performance analysis showed that the STM significantly outperforms the HMM system on the CTIMIT database. Specifically, the performance of the STM system was found to be at least 10% better as compared to the one obtained by HMM when the RASTA preprocessing was used. The performance of both systems with FBDYN parametrization was found to be inferior to those using MFCC and RASTA. On the other hand, in low-noise conditions on the TIMIT database FBDYN yielded an improved performance for the HMM system, whereas STM achieved the best results with the MFCC parametrization.

ic971735.pdf

ic971735.pdf

TOP



Inter-Digit HMM Connected Digit Recognition Using the Macrophone Corpus

Authors:

Yu-Hung Kao, Texas Instruments (U.S.A.)
Lorin P. Netsch, Texas Instruments (U.S.A.)

Volume 3, Page 1739

Abstract:

Continuous digit recognition over the telephone channel is a key technology for many telecommuncations applications such as voice dialing, automatic banking, and credit card number entry. Speech recognizers usually acheive high performance by modeling the acoustics in Hidden Markov Models (HMMs) using large numbers of multivariate Gaussian mixtures with assumed diagonal covariance in order to model the variability of different speakers and channel conditions. In this paper, we present a system that uses single mixture 16 feature Gaussian distributions with an assumed identity covariance to achieve 1.0% word error and 5.7% sentence error rate on the Macrophone corpus. We found that inter-digit modeling, discriminant training, and per-utterance adaptation can each contribute about 30% reduction in error rate. Using this approach, we can realize a system with relatively low memory requirements.

ic971739.pdf

ic971739.pdf

TOP



Wide Context Acoustic Modeling in Read vs. Spontaneous Speech

Authors:

Michael Finke, University of Karlsruhe (Germany)
Ivica Rogina, University of Karlsruhe (Germany)

Volume 3, Page 1743

Abstract:

Context-dependent acoustic models have been applied in speech recognition research for many years, and have been shown to increase the recognition accuracy significantly. The most common approach is to use triphones. Recently, several speech recognition groups have started investigating the use of larger phonetic context windows when building acoustic models. In this paper we discuss some of the computational problems arising from wide context modeling (polyphonic modeling) and present methods to cope with these problems. A two stage decision tree based polyphonic clustering approach is described which implements a more flexible parameter tying scheme. The new clustering approach gave us significant improvement across all tasks - WSJ, SWB, and Spontaneous Scheduling Task - and across all languages involved (German, Spanish, English). We report recognition results based on the JANUS speech recognition toolkit on two tasks comparing acoustic context phenomena in English read versus spontaneous speech. We used our WSJ 60K recognizer and the JANUS SWB 10K polyphonic recognizer.

ic971743.pdf

ic971743.pdf

TOP



Performance of Hybrid MMI-Connectionist / HMM Systems on the WSJ Speech Database

Authors:

Jörg Rottland, Duisburg University (Germany)
Christoph Neukirchen, Duisburg University (Germany)
Daniel Willett, Duisburg University (Germany)

Volume 3, Page 1747

Abstract:

In this paper, a hybrid MMI-connectionist / hidden Markov model (HMM) speech recognition system for the Wall Street Journal (WSJ) database is presented. The HMM part of this system uses discrete probability density functions (pdf). The neural network (NN) is used to replace a classical vector quantizer (VQ) like a k-means or LBG algorithm, which are typically used in discrete HMM systems. The NN is trained on an algorithm, that tries to achieve maximum mutual information (MMI) between the generated output labels and the underlying phonetic description. The system has been trained and tested with the five thousand word speaker independent WSJ task. The error rates of the MMI-Connectionist approach are 21% lower than the error rates of a k-means system. The system achieves error rates which have been achieved before only by the best continuous/semi-continuous HMM speech recognizers, with the advantage of a faster recognition algorithm.

ic971747.pdf

ic971747.pdf

TOP



Statistical Modeling of Co-Articulation in Continuous Speech Based on Data Driven Interpolation

Authors:

Don Sun, Bell Labs, Lucent Technologies (U.S.A.)

Volume 3, Page 1751

Abstract:

Parsimonious modeling of the context dependency nature of speech due to co-articulation is very important for improving speech recognition systems. Most of the proposed methods in dealing with this problem are based on the idea of using context-dependent speech units, which inevitably increases the complexity of the model space. This paper presents a new approach of speech co-articulation modeling with complexity only comparable to context independent models. We model the movement of a sequence of speech signals by a set of anchor points in the feature vector space corresponding to the target phonemic units. The transitions are modeled as interpolations between the target vectors. The auxiliary parameters specifying the transitional units are estimated ``online'' during recognition, hence it does not contribute to the complexity of the models. Some phonetic classification experiments showed that the new model can achieve the same performance as the more complex context dependent models.

ic971751.pdf

ic971751.pdf

TOP



Microsegment-Based Connected Digit Recognition

Authors:

John J. Godfrey, TI Dallas (U.S.A.)
Coimbatore S. Ramalingam, TI Dallas (U.S.A.)
Aravind Ganapathiraju, Mississippi State University (U.S.A.)
Joseph Picone, Mississippi State University (U.S.A.)

Volume 3, Page 1755

Abstract:

By building acoustic phonetic models which explicitly represent as much knowledge of pronunciation in a small domain (the digits) as possible, we can create a recognition system which not only performs well but allows for meaningful error analysis and improvement. An HMM-based recognizer for the digits and a few associated words was constructed in accord with these principles. About 65 phonetic models were trained on 140 carefully labeled utterances, then iteratively on unlabeled data under orthographic supervision. The basic system achieved less than 3% word error rate on digit strings of unknown length from unseen test speakers, and 1.4% on 7-digit strings of known length. This is competitive with word-based models using the same HMM engine and similar parameter settings. As an R&D system, it allows meaningful analysis of errors and relatively straightforward means of improvement.

ic971755.pdf

ic971755.pdf

TOP



Context--Dependent~ Hybrid HME / HMM Speech Recognition using Polyphone Clustering Decision Trees

Authors:

Jürgen Fritsch, University of Karlsruhe (Germany)
Michael Finke, University of Karlsruhe (Germany)
Alex Waibel, University of Karlsruhe (Germany)

Volume 3, Page 1759

Abstract:

This paper presents a context-dependent hybrid connectionist speech recognition system that uses a set of generalized hierarchical mixtures of experts (HME) to estimate context-dependent posterior acoustic class probabilities. The connectionist part of the system is organized in a modular fashion, allowing the distributed training of such a system on regular workstations. Context classes are based on polyphonic contexts, clustered using decision trees which we adopt from our continuous density HMM recognizer JANUS. The system is evaluated on ESST, an english speaker-independent spontaneous speech database. Context dependent modeling is shown to yield significant improvements over simple context-independent modeling, requiring only small additional overhead in terms of training and decoding time.

ic971759.pdf

ic971759.pdf

TOP



Improved automatic recognition of Norwegian natural numbers by incorporating phonetic knowledge

Authors:

Knut Kvale, Telenor R&D (Norway)
Ingunn Amdal, Telenor R&D (Norway)

Volume 3, Page 1763

Abstract:

This paper addresses the problem of speaker-independent connected natural number recognition over telephone lines. Increasing the vocabulary from digits (0--9) to natural numbers (0--99) opens for more user-friendly services, but also introduces many new, language-specific problems, such as more similar sounding words, a more complex grammar network, and more ambiguities due to segmentation problems of connected natural numbers. The paper shows that incorporating phonetic knowledge into a Norwegian natural number recogniser, improved the recognition performance from 70.6~% to 76.3~% correctly recognised 8-digits telephone numbers in noisy conditions.

ic971763.pdf

ic971763.pdf

TOP



Hybrid HMM/ANN Systems for Training Independent Tasks: Experiments on Phonebook and Related Improvements

Authors:

Stéphane Dupont, FPMS - TCTS (Belgium)
Hervé Bourlard, FPMS - TCTS (Belgium)
Olivier Deroo, FPMS - TCTS (Belgium)
Vincent Fontaine, FPMS - TCTS (Belgium)
Jean-Marc Boite, FPMS - TCTS (Belgium)

Volume 3, Page 1767

Abstract:

In this paper, we evaluate multi-Gaussian HMM systems and hybrid HMM/ANN systems in the framework of task independent training for small size (75 words) and medium size (600 words) vocabularies. To do this, we use the Phonebook database which is particularly well suited to this kind of experiments since (1) it is a very large telephone database and (2) the size and content of the test vocabulary is very flexible. For each system, different HMM topologies are compared to test the influence of state tying (with a number of parameters approximately kept constant) on the recognition performance. Two lexica (Phonebook and CMU) are also compared and it is shown that the CMU lexicon is leading to significantly better performance. Finally, it is shown that with a quite simple system and a few adaptations to the basic HMM/ANN scheme, recognition performance of 98.5% and 94.7% can easily be achieved, respectively on a lexicon of 75 and 600 words (isolated words, telephone speech and lexicon words not present in the training data).

ic971767.pdf

ic971767.pdf

TOP



European Speech Databases for Telephone Applications

Authors:

Harald Höge, Siemens AG (Germany)
Herbert S. Tropf, Siemens AG (Germany)
Richard Winski, Vocalis Ltd. (U.K.)
Henk van den Heuvel, SPEX (The Netherlands)
Reinhold Haeb-Umbach, Philips GmbH (Germany)
Khalid Choukri, ELRA (France)

Volume 3, Page 1771

Abstract:

The SpeechDat project aims to produce speech databases for all official languages of the European Union and some major dialectal variants and minority languages resulting in 28 speech databases. They will be recorded over fixed and mobile telephone networks. This will provide a realistic basis for training and assessment of both isolated and continuous-speech utterances, employing whole-word or subword approaches, and thus can be used for developing voice driven teleservices including speaker verification. The specification of the databases has been developed jointly, and is essentially the same for each language to facilitate dissemination and use. There will be a controlled variation among the speakers concerning sex, age, dialect, environment of call etc. The validation of all databases will be carried out centrally. The SpeechDat databases will be transferred to ELRA for distribution. Next databases to be recorded will cover East European languages.

ic971771.pdf

ic971771.pdf

TOP



Development of a Large Vocabulary Speech Database for Cantonese

Authors:

Pak Chung Ching, Chinese University of Hong Kong (Hong Kong)
K.F. Chow, Chinese University of Hong Kong (Hong Kong)
Tan Lee, Chinese University of Hong Kong (Hong Kong)
L.W. Chan, Chinese University of Hong Kong (Hong Kong)
Alfred Y.P. Ng, Chinese University of Hong Kong (Hong Kong)

Volume 3, Page 1775

Abstract:

This paper describes our recent work on developing a large vocabulary speech database for Cantonese. As a major Chinese dialect, Cantonese is spoken by tens of millions of people in Southern China and Hong Kong. It is very different from Mandarin or Putonghua in phonology, phonetics, vocabulary and grammatical structure. A speech database specially designed for Cantonese is urgently needed for the design, implementation and performance evaluation of various speech recognition systems. The proposed database contains a large number of speech utterances which include isolated syllables, polysyllabic words and phonetically rich sentences. It covers most of the intra-syllable and inter-syllable acoustic variations. We hope that this pioneer work will be beneficial and useful to facilitate future research activities in the related areas.

ic971775.pdf

ic971775.pdf

TOP