Hidden Markov Model Techniques 3

Home
Full List of Titles
1: ICSLP'98 Proceedings
Keynote Speeches
Text-To-Speech Synthesis 1
Spoken Language Models and Dialog 1
Prosody and Emotion 1
Hidden Markov Model Techniques 1
Speaker and Language Recognition 1
Multimodal Spoken Language Processing 1
Isolated Word Recognition
Robust Speech Processing in Adverse Environments 1
Spoken Language Models and Dialog 2
Articulatory Modelling 1
Talking to Infants, Pets and Lovers
Robust Speech Processing in Adverse Environments 2
Spoken Language Models and Dialog 3
Speech Coding 1
Articulatory Modelling 2
Prosody and Emotion 2
Neural Networks, Fuzzy and Evolutionary Methods 1
Utterance Verification and Word Spotting 1 / Speaker Adaptation 1
Text-To-Speech Synthesis 2
Spoken Language Models and Dialog 4
Human Speech Perception 1
Robust Speech Processing in Adverse Environments 3
Speech and Hearing Disorders 1
Prosody and Emotion 3
Spoken Language Understanding Systems 1
Signal Processing and Speech Analysis 1
Spoken Language Generation and Translation 1
Spoken Language Models and Dialog 5
Segmentation, Labelling and Speech Corpora 1
Multimodal Spoken Language Processing 2
Prosody and Emotion 4
Neural Networks, Fuzzy and Evolutionary Methods 2
Large Vocabulary Continuous Speech Recognition 1
Speaker and Language Recognition 2
Signal Processing and Speech Analysis 2
Prosody and Emotion 5
Robust Speech Processing in Adverse Environments 4
Segmentation, Labelling and Speech Corpora 2
Speech Technology Applications and Human-Machine Interface 1
Large Vocabulary Continuous Speech Recognition 2
Text-To-Speech Synthesis 3
Language Acquisition 1
Acoustic Phonetics 1
Speaker Adaptation 2
Speech Coding 2
Hidden Markov Model Techniques 2
Multilingual Perception and Recognition 1
Large Vocabulary Continuous Speech Recognition 3
Articulatory Modelling 3
Language Acquisition 2
Speaker and Language Recognition 3
Text-To-Speech Synthesis 4
Spoken Language Understanding Systems 4
Human Speech Perception 2
Large Vocabulary Continuous Speech Recognition 4
Spoken Language Understanding Systems 2
Signal Processing and Speech Analysis 3
Human Speech Perception 3
Speaker Adaptation 3
Spoken Language Understanding Systems 3
Multimodal Spoken Language Processing 3
Acoustic Phonetics 2
Large Vocabulary Continuous Speech Recognition 5
Speech Coding 3
Language Acquisition 3 / Multilingual Perception and Recognition 2
Segmentation, Labelling and Speech Corpora 3
Text-To-Speech Synthesis 5
Spoken Language Generation and Translation 2
Human Speech Perception 4
Robust Speech Processing in Adverse Environments 5
Text-To-Speech Synthesis 6
Speech Technology Applications and Human-Machine Interface 2
Prosody and Emotion 6
Hidden Markov Model Techniques 3
Speech and Hearing Disorders 2 / Speech Processing for the Speech and Hearing Impaired 1
Human Speech Production
Segmentation, Labelling and Speech Corpora 4
Speaker and Language Recognition 4
Speech Technology Applications and Human-Machine Interface 3
Utterance Verification and Word Spotting 2
Large Vocabulary Continuous Speech Recognition 6
Neural Networks, Fuzzy and Evolutionary Methods 3
Speech Processing for the Speech-Impaired and Hearing-Impaired 2
Prosody and Emotion 7
2: SST Student Day
SST Student Day - Poster Session 1
SST Student Day - Poster Session 2

Author Index
A B C D E F G H I
J K L M N O P Q R
S T U V W X Y Z

Multimedia Files

A Statistical Phonemic Segment Model for Speech Recognition Based on Automatic Phonemic Segmentation

Authors:

Katsura Aizawa, Toin University of Yokohama (Japan)
Chieko Furuichi, Toin University of Yokohama (Japan)

Page (NA) Paper number 544

Abstract:

This paper presents a method of constructing a statistical phonemic segment model (SPSM) for a speech recognition system based on speaker-independent context-independent automatic phonemic segmentation. In our recent research, we proposed the phoneme recognition system using the template matching method with the same segmentation, and confirmed that 5-frame-fixed time sequence of feature vectors used as a template represents features of phoneme effectively. This time, to improve a mass of these templates to a smarter model, we introduced a statistical method into modeling. The structure of SPSM connects 5 distributions of Gaussian N-mixture density in series. By the experiment of closed Japanese spoken word recognition, using VCV balanced 4920 words spoken by 10 male adults including 34430 phonemes in total, the rate of phoneme recognition using SPSM was up to 90.23 % compared with the rate using phoneme templates, 80.39 %.

SL980544.PDF (From Author) SL980544.PDF (Rasterized)

TOP


Improved Feature Decorrelation for HMM-based Speech Recognition

Authors:

Kris Demuynck, Katholieke Universiteit Leuven - ESAT (Belgium)
Jacques Duchateau, Katholieke Universiteit Leuven - ESAT (Belgium)
Dirk Van Compernolle, Lernout & Hauspie (Belgium)
Patrick Wambacq, Katholieke Universiteit Leuven - ESAT (Belgium)

Page (NA) Paper number 1081

Abstract:

Many HMM-based recognition systems use mixtures of diagonal covariance gaussians to model the observation density functions in the states. These mixtures are however only approximations of the real distributions. One of the approximations is the assumption that the off-diagonal elements of the covariance matrices of the gaussians are close to zero (diagonal covariance). To that end, most recognition systems have some kind of parameter decorrelation near the end of the preprocessing, e.g. the inverse cosine transform used with cepstral transformations. These transforms are however not optimal if it comes to decorrelating features on the gaussian level. This paper presents an optimal solution in a least-square sense to the decorrelation problem. It also demonstrates the link between the recently published maximum likelihood modelling for semi-tied covariance matrices and the presented least-squares optimisation. Evaluation on a large vocabulary recognition task shows a 10% relative improvement.

SL981081.PDF (From Author) SL981081.PDF (Rasterized)

TOP


Efficient High-Order Hidden Markov Modelling

Authors:

J.A. du Preez, University of Stellenbosch (South Africa)
D.M. Weber, University of Stellenbosch (South Africa)

Page (NA) Paper number 1073

Abstract:

We present two powerful tools which allow efficient training of arbitrary (including mixed and infinite) order hidden Markov models. The method rests on two parts: an algorithm which can convert high-order models to an equivalent first-order representation (ORder rEDucing), and a Fast (order) Incremental Training algorithm. We demonstrate that this method is more flexible, results in significantly faster training and improved generalisation compared to prior work. Order reducing is also shown to give insight into the language modelling capabilities of certain high-order HMM topologies.

SL981073.PDF (From Author) SL981073.PDF (Rasterized)

TOP


A Time-Synchronous, Tree-based Search Strategy in the Acoustic Fast Match of an Asynchronous Speech Recognition System

Authors:

Ellen M. Eide, IBM (USA)
Lalit R. Bahl, IBM (USA)

Page (NA) Paper number 165

Abstract:

This paper describes the interaction between a synchronous fast match and an asynchronous detailed match search in an automatic speech recognition system. We show a considerable speed-up through this hybrid architecture over a fully-asynchronous approach and discuss the advantages of the hybrid system.

SL980165.PDF (From Author) SL980165.PDF (Rasterized)

TOP


Effective Structural Adaptation of LVCSR Systems to Unseen Domains Using Hierarchical Connectionist Acoustic Models

Authors:

Jürgen Fritsch, Interactive Systems Labs, University of Karlsruhe (Germany)
Michael Finke, Interactive Systems Labs, Carnegie Mellon University (USA)
Alex Waibel, Interactive Systems Inc (USA)

Page (NA) Paper number 754

Abstract:

We present an approach to efficiently and effectively downsize and adapt the structure of large vocabulary conversational speech recognition (LVCSR) systems to unseen domains, requiring only small amounts of transcribed adaptation data. Our approach aims at bringing todays mostly task dependent systems closer to the aspired goal of domain independence. To achieve this, we rely on the ACID/HNN framework, a hierarchical connectionist modeling paradigm which allows to dynamically adapt a tree structured modeling hierarchy to differing specificity of phonetic context in new domains. Experimental validation of the proposed approach has been carried out by adapting size and structure of ACID/HNN based acoustic models trained on Switchboard to two quite different, unseen domains, Wall Street Journal and an English Spontaneous Scheduling Task. In both cases, our approach yields considerably downsized acoustic models with performance improvements of up to 18% over the unadapted baseline models.

SL980754.PDF (From Author) SL980754.PDF (Rasterized)

TOP


Support Vector Machines for Speech Recognition

Authors:

Aravind Ganapathiraju, Institute for Signal and Information Processing, Mississippi State University (USA)
Jonathan Hamaker, Institute for Signal and Information Processing, Mississippi State University (USA)
Joseph Picone, Institute for Signal and Information Processing, Mississippi State University (USA)

Page (NA) Paper number 410

Abstract:

A Support Vector Machine (SVM) is a promising machine learning technique that has generated a lot of interest in the pattern recognition community in recent years. The greatest asset of an SVM is its ability to construct nonlinear decision regions in a discriminative fashion. This paper describes an application of SVMs to two speech data classification experiments: 11 vowels spoken in isolation and 16 phones extracted from spontaneous telephone speech. The best performance achieved on the spontaneous speech classification task is a 51% error rate using an RBF kernel. This is comparable to frame-level classification achieved by other nonlinear modeling techniques such as artificial neural networks (ANN).

SL980410.PDF (From Author) SL980410.PDF (Rasterized)

TOP


Natural Number Recognition Using Discriminatively Trained Inter-Word Context Dependent Hidden Markov Models

Authors:

Malan B. Gandhi, Lucent Technologies Bell Laboratories (USA)

Page (NA) Paper number 90

Abstract:

Many automatic speech recognition telephony applications involve recognition of input containing some type of numbers. Traditionally, this has been achieved by using isolated or connected digit recognizers. However, as speech recognition finds a wider range of applications, it is often infeasible to impose restrictions on speaker behavior. This paper studies two model topologies for natural number recognition which use minimum classification error (MCE) trained inter-word context dependent acoustic models. One model topology uses triphone context units while another is of the head-body-tail (HBT) type. The performance of the models is evaluated on three natural number applications involving recognition of dates, time of day, and dollar amounts. Experimental results show that context dependent models reduce string error rates by as much as 50% over baseline context independent whole-word models. String accuracies of about 93% are obtained on these tasks while at the same time allowing users flexibility in speaking styles.

SL980090.PDF (From Author) SL980090.PDF (Rasterized)

TOP


Information Theoretic Approaches to Model Selection

Authors:

Jonathan Hamaker, Institute for Signal and Information Processing, Mississippi State University (USA)
Aravind Ganapathiraju, Institute for Signal and Information Processing, Mississippi State University (USA)
Joseph Picone, Institute for Signal and Information Processing, Mississippi State University (USA)

Page (NA) Paper number 653

Abstract:

The primary problem in large vocabulary conversational speech recognition (LVCSR) is poor acoustic-level matching due to large variability in pronunciations. There is much to explore about the "quality" of states in an HMM and the inter-relationships between inter-state and intra-state Gaussians used to model speech. Of particular interest is the variable discriminating power of the individual states. The fundamental concept addressed in this paper is to investigate means of exploiting such dependencies through model topology optimization based on the Bayesian Information Criterion (BIC) and the Minimum Description Length (MDL) principle.

SL980653.PDF (From Author) SL980653.PDF (Rasterized)

TOP


Continuous Speech Recognition Using Segmental Unit Input HMMs with a Mixture of Probability Density Functions and Context Dependency

Authors:

Kengo Hanai, Toyohashi University of Technology (Japan)
Kazumasa Yamamoto, Toyohashi University of Technology (Japan)
Nobuaki Minematsu, Toyohashi University of Technology (Japan)
Seiichi Nakagawa, Toyohashi University of Technology (Japan)

Page (NA) Paper number 1012

Abstract:

It is well-known that HMMs only of the basic structure cannot capture the correlations among successive frames adequately. In our previous work, to solve this problem, segmental unit HMMs were introduced and their effectiveness was shown. And the integration of delta- cepstrum and delta-delta- cepstrum into the segmental unit HMMs was also found to improve the recognition performance in the work. In this paper, we investigated further refinements of the models by using a mixture of PDFs and/or context dependency, where, for a given syllable, only a preceding vowel was treated as the context information. Recognition experiments showed that the accuracy rate was improved by 23 %, which clearly indicates the effectiveness of the refinements examined in this paper. The proposed syllable-based HMM outperformed a triphone model.

SL981012.PDF (From Author) SL981012.PDF (Rasterized)

TOP


Gaussian Density Tree Structure in a Multi-Gaussian HMM-Based Speech Recognition System

Authors:

Jacques Simonin, France Telecom - CNET (France)
Lionel Delphin-Poulat, France Telecom - CNET (France)
Geraldine Damnati, France Telecom - CNET (France)

Page (NA) Paper number 1063

Abstract:

This paper presents a Gaussian density tree structure usage which enables a computational cost reduction without a significant degradation of recognition performances, during a continuous speech recognition process. The Gaussian tree structure is built from successive Gaussian density merging. Each node of the tree is associated with a Gaussian density, and the actual HMM densities are associated to the leaves. We propose then a criterion to obtain good recognition performances with this Gaussian tree structure. This structure is evaluated with a continuous speech recognition system on a telephone database. The criterion allows a 75 to 85% computational cost reduction in terms of log-likelihood computations without any significant word error rate during the recognition process.

SL981063.PDF (From Author) SL981063.PDF (Rasterized)

TOP


Generalized Phone Modeling Based on Piecewise Linear Segment Lattice

Authors:

Hiroaki Kojima, Electrotechnical Laboratory, AIST, MITI (Japan)
Kazuyo Tanaka, Electrotechnical Laboratory, AIST, MITI (Japan)

Page (NA) Paper number 995

Abstract:

The goal of this work is to model phone-like units automatically from spoken word samples without using any transcriptions except for the lexical identification of the words. In order to implement this task, we have proposed the "piecewise linear segment lattice (PLSL)" model for phoneme representation. The structure of this model is a lattice of segments, each of which is represented as regression coefficients of feature vectors within the segment. In order to organize phone models, operations including division, concatenation, blocking and clustering are applied to the models. This paper mainly report on blocking and clustering. Experimental results for isolated word recognition task is that the recognition rate is significantly improved by blocking the segments and by clustering the segments within a block. We get sufficient performance for the task with the models consist of at most 128 clusters of segment patterns.

SL980995.PDF (From Author) SL980995.PDF (Rasterized)

TOP


A Flexible Method of Creating HMM Using Block-Diagonalization of Covariance Matrices

Authors:

Ryosuke Koshiba, Toshiba Kansai Research Laboratories (Japan)
Mitsuyoshi Tachimori, Toshiba Kansai Research Laboratories (Japan)
Hiroshi Kanazawa, Toshiba Kansai Research Laboratories (Japan)

Page (NA) Paper number 197

Abstract:

A new algorithm to reduce the amount of calculation in the likelihood computation of continuous mixture HMM(CMHMM) with block-diagonal covariance matrices while retaining high recognition rate is proposed. The block matrices are optimized by minimizing difference between the output probability calculated with full covariance matrices and that calculated with block-diagonal covariance matrices. The idea was implemented and tested on a continuous number recognition task.

SL980197.PDF (From Author) SL980197.PDF (Rasterized)

TOP


HMM Topology Selection For Accurate Acoustic And Duration Modeling

Authors:

C. Chesta, Politecnico di Torino (Italy)
Pietro Laface, Politecnico di Torino (Italy)
F. Ravera, CSELT - Centro Studi e Laboratori Telecomunicazioni (Italy)

Page (NA) Paper number 149

Abstract:

In this paper we show that accurate HMMs for connected word recognition can be obtained without context dependent modeling and discriminative training. To account for different speaking rates, we define two HMMs for each word that must be trained. The two models have the same, standard, left to right topology with the possibility of skipping one state, but each model has a different number of states, automatically selected. Our simple modeling and training technique has been applied to connected digit recognition using the adult speaker portion of the TI/NIST corpus. The obtained results are comparable with the best ones reported in the literature for models with a larger number of densities.

SL980149.PDF (From Author) SL980149.PDF (Rasterized)

TOP


Context-Dependent Duration Modelling for Continuous Speech Recognition

Authors:

Tan Lee, Department of Electronic Engineering, The Chinese University of Hong Kong (Hong Kong)
Rolf Carlson, Centre for Speech Technology, Royal Institute of Technology (Sweden)
Björn Granström, Centre for Speech Technology, Royal Institute of Technology (Sweden)

Page (NA) Paper number 441

Abstract:

This paper presents a trial study of using context-dependent segmental duration for continuous speech recognition in a domain-specific application. Different modelling strategies are proposed for function words and content words. Stress status, word position in utterance and phone position in word are identified to be the 3 most crucial factors affecting segmental duration in this particular application. In addition, speaking rate normalization is applied to further reduce the duration variabilities. Experimental results show that the normalized duration models can help improving the rank of the correct sentence in the N-best hypotheses list.

SL980441.PDF (From Author) SL980441.PDF (Rasterized)

TOP


Training of Context-Dependent Subspace Distribution Clustering Hidden Markov Model

Authors:

Brian Mak, Department of Computer Science, The Hong Kong University of Science & Technology (China)
Enrico Bocchieri, AT&T Labs -- Research (USA)

Page (NA) Paper number 699

Abstract:

Training of continuous density hidden Markov models(CDHMMs) is usually time-consuming and tedious due to the large number of model parameters involved. Recently we proposed a new derivative of CDHMM, the subspace distribution clustering hidden Markov model(SDCHMM) which tie CDHMMs at the finer level of subspace distributions, resulting in many fewer model parameters. An SDCHMM training algorithm is also devised to train SDCHMMs directly from speech data without intermediate CDHMMs. On the ATIS task, speaker-independent context-independent(CI) SDCHMMs can be trained with as little as 8 minutes of speech with no loss in recognition accuracy --- a 25-fold reduction when compared with their CDHMM counterparts. In this paper, we extend our novel SDCHMM training to context-dependent(CD) modeling with the assumption of various prior knowledge. Despite the 30-fold increase of model parameters in the CD ATIS CDHMMs, their equivalent CD SDCHMMs can still be estimated with a few minutes of ATIS data.

SL980699.PDF (From Author) SL980699.PDF (Rasterized)

TOP


Unsupervised Training of HMMs With Variable Number of Mixture Components Per State

Authors:

Cesar Martín del Alamo, Telefonica I+D (Spain)
Luis Villarrubia, Telefonica I+D (Spain)
Francisco Javier González, ETSI Telecomunicaci-on (Spain)
Luis A. Hernández, ETSI Telecomunicaci-on (Spain)

Page (NA) Paper number 443

Abstract:

In this work automatic methods for determining the number of gaussians per state in a set of Hidden Markov Models are studied. Four different mix-up criteria are proposed to decide how to increase the size of the states. These criteria, derived from Maximum Likelihood scores, are focused to increase the discrimination between states obtaining different number of gaussians per state. We compare these proposed methods with the common approach where the number of density functions used in every state is equal and pre-fixed by the designer. Experimental results demonstrate that performance can be maintained while reducing the total number of density functions by 17% (from 2046 down to 1705). These results are obtained in a flexible large vocabulary isolated word recognizer using context dependent models.

SL980443.PDF (From Author) SL980443.PDF (Rasterized)

TOP


Acoustic Observation Context Modeling in Segment Based Speech Recognition

Authors:

Máté Szarvas, Technical University of Budapest (Hungary)
Shoichi Matsunaga, NTT Human Interface Laboratories (Japan)

Page (NA) Paper number 1098

Abstract:

This paper describes a novel method that models the correlation between acoustic observations in contiguous speech segments. The basic idea behind the method is that acoustic observations are conditioned not only on the phonetic context but also on the preceding acoustic segment observation. The correlation between consecutive acoustic observations is modeled by polynomial mean trajectory segment models. This method is an extension of conventional segment modeling approaches in that it not only describes the correlation of acoustic observations inside segments but also between contiguous segments. It is also a generalization of phonetic context (e.g., triphone) modeling approaches because it can model acoustic context and phonetic context at the same time. In a speaker-independent phoneme classification test, using the proposed method resulted in a 7-9% reduction in error rate as compared to the traditional triphone segmental model system and a 31% reduction as compared to a similar triphone HMM system.

SL981098.PDF (From Author) SL981098.PDF (Rasterized)

TOP


Capturing Discriminative Information Using Multiple Modeling Techniques

Authors:

Ji Ming, Queen's University of Belfast (U.K.)
Philip Hanna, Queen's University of Belfast (U.K.)
Darryl Stewart, Queen's University of Belfast (U.K.)
Saeed Vaseghi, Queen's University of Belfast (U.K.)
F. Jack Smith, Queen's University of Belfast (U.K.)

Page (NA) Paper number 263

Abstract:

An acoustic model is a simplified mathematical representation of acoustic-phonetic information. The simplifying assumptions inherent to each model entail that it may only be capable of capturing a certain aspect of the available information. An effective combination of different types of model should therefore permit a combined model that can utilize all the information captured by the individual models. This paper reports some preliminary research in combining certain types of acoustic model for speech recognition. In particular, we designed and implemented a single HMM framework, which combines a segment-based modeling technique with the standard HMM technique. The recognition experiments, based on a speaker-independent E-set database, have shown that the combined model has the potential of producing a significantly higher performance than the individual models considered in isolation.

SL980263.PDF (From Author) SL980263.PDF (Rasterized)

TOP


Suprasegmental Duration Modelling with Elastic Constraints in Automatic Speech Recognition

Authors:

Laurence Molloy, Centre for Speech Technology Research (U.K.)
Stephen Isard, Centre for Speech Technology Research (U.K.)

Page (NA) Paper number 1103

Abstract:

In this paper a method of integrating a model of suprasegmental duration with a HMM-based recogniser at the post-processing level is presented. The N-Best utterance output is rescored using a suitable linear combination of acoustic log-likelihood (provided by a set of tied-state triphone HMMs) and duration log-likelihood (provided by a set of durational models). The durational model used in the post-processing imposes syllable-level elastic constraints on the durational behaviour of speech segments. Results are presented for word accuracy on the Resource Management database after rescoring, using two different syllable-like constraint units, a fixed-size N-phone window and simple (no constraint) phone duration probability scoring.

SL981103.PDF (From Author) SL981103.PDF (Rasterized)

TOP


An Adaptive Gradient-Search Based Algorithm for Discriminative Training of HMM's

Authors:

Albino Nogueiras-Rodríguez, Universitat Politecnica de Catalunya (Spain)
José B. Mariño, Universitat Politecnica de Catalunya (Spain)
Enric Monte, Universitat Politecnica de Catalunya (Spain)

Page (NA) Paper number 769

Abstract:

Although having revealed to be a very powerful tool in acoustic modelling, discriminative training presents a major drawback: the lack of a formulation guaranteeing convergence in no matter which initial conditions, such as the Baum-Welch algorithm in maximum likelihood training. For this reason, a gradient descent search is usually used in this kind of problem. Unfortunately, standard gradient descent algorithms rely heavily on the election of the learning rates. This dependence is specially cumbersome because it represents that, at each run of the discriminative training procedure, a search should be carried out over the parameters ruling the algorithm. In this paper we describe an adaptive procedure for determining the optimal value of the step size at each iteration. While the calculus and memory overhead of the algorithm is negligible, results show less dependence on the initial learning rate than standard gradient descent and, using the same idea in order to apply self-scaling, it clearly outperforms it.

SL980769.PDF (From Author) SL980769.PDF (Rasterized)

TOP


Task Adaptation of Sub-Lexical Unit Models Using the Minimum Confusibility Criterion on Task Independent Databases

Authors:

Albino Nogueiras-Rodríguez, Universitat Politecnica de Catalunya (Spain)
José B. Mariño, Universitat Politecnica de Catalunya (Spain)

Page (NA) Paper number 770

Abstract:

Discriminative training is a powerful tool in acoustic modeling for automatic speech recognition. Its strength is based on the direct minimisation of the number of errors committed by the system at recognition time. This is usually accomplished by defining an auxiliary function that characterises the behaviour of the system, and adjusting the parameters of the system in a way that this function is minimised. The main drawback of this approach is that a task specific training database is needed. In this paper an alternative procedure is proposed: task adaptation using task independent databases. It consists in the combination of acoustic information -estimated using a general purpose training database-, and linguistic information -taken from the definition of the task-. In the experiments carried out, this technique has led to great improvement in the recognition of two different tasks: clean speech digit strings in English and dates in Spanish over the telephone wire.

SL980770.PDF (From Author) SL980770.PDF (Rasterized)

TOP


Stochastic Calculus, Non-Linear Filtering, and the Internal Model Principle: Implications for Articulatory Speech Recognition

Authors:

Gordon Ramsay, ICP-INPG (France)

Page (NA) Paper number 671

Abstract:

A stochastic approach to modelling speech production and perception is discussed, based on Itô calculus. Speech is modelled by a system of non-linear stochastic differential equations evolving on a finite-dimensional state space, representing a partially-observed Markov process. The optimal non-linear filtering equations for the model are stated, and shown to exhibit a predictor-corrector structure, which mimics the structure of the original system. This is used to suggest a possible justification for the hypothesis that speakers and listeners make use of an ``internal model'' in producing and perceiving speech, and leads to a useful statistical framework for articulatory speech recognition.

SL980671.PDF (From Author) SL980671.PDF (Rasterized)

TOP


The Use of Meta-HMM in Multistream HMM Training for Automatic Speech Recognition

Authors:

Christian J. Wellekens, Institut Eurecom (France)
Jussi Kangasharju, Institut Eurecom (France)
Cedric Milesi, Institut Eurecom (France)

Page (NA) Paper number 271

Abstract:

Among the different attempts to improve recognition scores and robustness to noise, the recognition of parallel streams of data, each one representing partial information on the test signal, and the fusion of the decisions have received a great deal of interest. The problem of training such models taking recombination constraints at the level of speech-subunits has not yet been rigorously addressed. This paper shows how equivalence with an extended meta-HMM solves the problem and how reestimation formulas have to be applied to guarantee equivalence between the multistream model and the meta-HMM. Experiments demonstrate the importance of transition probabilities in the meta-HMM which they have to meet some constraints in order to represent a multistream HMM.

SL980271.PDF (From Author) SL980271.PDF (Rasterized)

TOP


Enhanced ASR By Acoustic Feature Filtering

Authors:

Christian J. Wellekens, Institut Eurecom - Sophia Antipolis (France)

Page (NA) Paper number 272

Abstract:

Since long, the use of contextual features has been shown to improve the recognition scores: use of numerical estimations of speed and acceleration appended to the current feature vectors, predictive HMM or neural networks. All these implementations are particular case of FIR filtering of feature trajectories. This paper presents a new approach where the characteristics of filters are trained together with the HMM parameters resulting in improvements of the recognition in first tests. Reestimation formulas for the cut-off frequencies of ideal LP-filters are derived as well for the impulse response coefficients of a general FIR LP-filter. Filters can be either common to all feature vectors or dedicated to a given entry or a given HMM state.

SL980272.PDF (From Author) SL980272.PDF (Rasterized)

TOP


Soft State-Tying for HMM-based Speech Recognition

Authors:

Christoph Neukirchen, Department of Computer Science, Gerhard-Mercator-University Duisburg (Germany)
Daniel Willett, Department of Computer Science, Gerhard-Mercator-University Duisburg (Germany)
Gerhard Rigoll, Department of Computer Science, Gerhard-Mercator-University Duisburg (Germany)

Page (NA) Paper number 346

Abstract:

This paper introduces a method for regularization of HMM systems that avoids parameter overfitting caused by insufficient training data. Regularization is done by augmenting the EM training method by a penalty term that favors simple and smooth HMM systems. The penalty term is constructed as a mixture model of negative exponential distributions that is assumed to generate the state dependent emission probabilities of the HMMs. This new method is the successful transfer of a well known regularization approach in neural networks to the HMM domain and can be interpreted as a generalization of traditional state-tying for HMM systems. The effect of regularization is demonstrated for continuous speech recognition tasks by improving overfitted triphone models and by speaker adaptation with limited training data.

SL980346.PDF (From Author) SL980346.PDF (Rasterized)

TOP


Estimation Of Models For Non-Native Speech In Computer-Assisted Language Learning Based On Linear Model Combination

Authors:

Silke Witt, Cambridge University Engineering Dept (U.K.)
Steve Young, Cambridge University Engineering Dept (U.K.)

Page (NA) Paper number 1010

Abstract:

This paper investigates how to improve the acoustic modelling of non-native speech. For this purpose we present an adaptation technique to combine hidden Markov models of the source and the target language of a foreign language student. Such model combination requires a mapping of the mean vectors from target to source language. Therefore, three different mapping approaches, based on either phonetic knowledge and/or acoustical distance measures have been tested. The performance of this model combination method and several variations of it has been measured and compared with standard MLLR adaptation. For the baseline model combination small improvements of recognition accuracy compared to the results based on applying MLLR were obtained. Furthermore, slight improvements were found when using an a-priori approach, where the models were combined with predefined weights before applying any of the adaptation techniques.

SL981010.PDF (From Author) SL981010.PDF (Rasterized)

TOP


Duration Modeling Using Cumulative Duration Probability and Speaking Rate Compensation

Authors:

Tae-Young Yang, Yonsei Univ. Dept. of Electronics (Korea)
Ji-Sung Kim, Yonsei Univ. Dept. of Electronics (Korea)
Chungyong Lee, Yonsei Univ. Dept. of Electronics (Korea)
Dae Hee Youn, Yonsei Univ. Dept. of Electronics (Korea)
Il-Whan Cha, Yonsei Univ. Dept. of Electronics (Korea)

Page (NA) Paper number 428

Abstract:

A duration modeling scheme and a speaking rate compensation technique are presented for the HMM based connected digit recognizer. The proposed duration modeling technique uses a cumulative duration probability. The cumulative duration probability also can be used to obtain the duration bounds for the bounded duration modeling. One of the advantages of proposed technique is that the cumulative duration probability can be applied directly to the Viterbi decoding procedure without additional postprocessing. Therefore, it rules the state and word transition at each frame. To alleviate the problems due to fast or slow speech, a modification to the bounded duration modeling which accounts for speaking rate is described. The experimental results on Korean connected digit recognition show the effectiveness of the proposed duration modeling scheme and the speaking rate compensation technique.

SL980428.PDF (From Author) SL980428.PDF (Rasterized)

TOP


Probabilistic Modeling with Bayesian Networks for Automatic Speech Recognition

Authors:

Geoffrey Zweig, IBM T.J. Watson Research Center (USA)
Stuart Russell, U.C. Berkeley (USA)

Page (NA) Paper number 858

Abstract:

This paper describes the application of Bayesian networks to automatic speech recognition (ASR). Bayesian networks enable the construction of probabilistic models in which an arbitrary set of variables can be associated with each speech frame in order to explicitly model factors such as acoustic context, speaking rate, or articulator positions. Once the basic inference machinery is in place, a wide variety of models can be expressed and tested. We have implemented a Bayesian network system for isolated word recognition, and present experimental results on the PhoneBook database. These results indicate that performance improves when the observations are conditioned on an auxiliary variable modeling acoustic/articulatory context. The use of multivalued and multiple context variables further improves recognition accuracy.

SL980858.PDF (From Author) SL980858.PDF (Rasterized)

TOP