Authors:
Katsura Aizawa, Toin University of Yokohama (Japan)
Chieko Furuichi, Toin University of Yokohama (Japan)
Page (NA) Paper number 544
Abstract:
This paper presents a method of constructing a statistical phonemic
segment model (SPSM) for a speech recognition system based on speaker-independent
context-independent automatic phonemic segmentation. In our recent
research, we proposed the phoneme recognition system using the template
matching method with the same segmentation, and confirmed that 5-frame-fixed
time sequence of feature vectors used as a template represents features
of phoneme effectively. This time, to improve a mass of these templates
to a smarter model, we introduced a statistical method into modeling.
The structure of SPSM connects 5 distributions of Gaussian N-mixture
density in series. By the experiment of closed Japanese spoken word
recognition, using VCV balanced 4920 words spoken by 10 male adults
including 34430 phonemes in total, the rate of phoneme recognition
using SPSM was up to 90.23 % compared with the rate using phoneme templates,
80.39 %.
Authors:
Kris Demuynck, Katholieke Universiteit Leuven - ESAT (Belgium)
Jacques Duchateau, Katholieke Universiteit Leuven - ESAT (Belgium)
Dirk Van Compernolle, Lernout & Hauspie (Belgium)
Patrick Wambacq, Katholieke Universiteit Leuven - ESAT (Belgium)
Page (NA) Paper number 1081
Abstract:
Many HMM-based recognition systems use mixtures of diagonal covariance
gaussians to model the observation density functions in the states.
These mixtures are however only approximations of the real distributions.
One of the approximations is the assumption that the off-diagonal elements
of the covariance matrices of the gaussians are close to zero (diagonal
covariance). To that end, most recognition systems have some kind
of parameter decorrelation near the end of the preprocessing, e.g.
the inverse cosine transform used with cepstral transformations. These
transforms are however not optimal if it comes to decorrelating features
on the gaussian level. This paper presents an optimal solution in a
least-square sense to the decorrelation problem. It also demonstrates
the link between the recently published maximum likelihood modelling
for semi-tied covariance matrices and the presented least-squares optimisation.
Evaluation on a large vocabulary recognition task shows a 10% relative
improvement.
Authors:
J.A. du Preez, University of Stellenbosch (South Africa)
D.M. Weber, University of Stellenbosch (South Africa)
Page (NA) Paper number 1073
Abstract:
We present two powerful tools which allow efficient training of arbitrary
(including mixed and infinite) order hidden Markov models. The method
rests on two parts: an algorithm which can convert high-order models
to an equivalent first-order representation (ORder rEDucing), and a
Fast (order) Incremental Training algorithm. We demonstrate that this
method is more flexible, results in significantly faster training and
improved generalisation compared to prior work. Order reducing is also
shown to give insight into the language modelling capabilities of certain
high-order HMM topologies.
Authors:
Ellen M. Eide, IBM (USA)
Lalit R. Bahl, IBM (USA)
Page (NA) Paper number 165
Abstract:
This paper describes the interaction between a synchronous fast match
and an asynchronous detailed match search in an automatic speech recognition
system. We show a considerable speed-up through this hybrid architecture
over a fully-asynchronous approach and discuss the advantages of the
hybrid system.
Authors:
Jürgen Fritsch, Interactive Systems Labs, University of Karlsruhe (Germany)
Michael Finke, Interactive Systems Labs, Carnegie Mellon University (USA)
Alex Waibel, Interactive Systems Inc (USA)
Page (NA) Paper number 754
Abstract:
We present an approach to efficiently and effectively downsize and
adapt the structure of large vocabulary conversational speech recognition
(LVCSR) systems to unseen domains, requiring only small amounts of
transcribed adaptation data. Our approach aims at bringing todays mostly
task dependent systems closer to the aspired goal of domain independence.
To achieve this, we rely on the ACID/HNN framework, a hierarchical
connectionist modeling paradigm which allows to dynamically adapt a
tree structured modeling hierarchy to differing specificity of phonetic
context in new domains. Experimental validation of the proposed approach
has been carried out by adapting size and structure of ACID/HNN based
acoustic models trained on Switchboard to two quite different, unseen
domains, Wall Street Journal and an English Spontaneous Scheduling
Task. In both cases, our approach yields considerably downsized acoustic
models with performance improvements of up to 18% over the unadapted
baseline models.
Authors:
Aravind Ganapathiraju, Institute for Signal and Information Processing, Mississippi State University (USA)
Jonathan Hamaker, Institute for Signal and Information Processing, Mississippi State University (USA)
Joseph Picone, Institute for Signal and Information Processing, Mississippi State University (USA)
Page (NA) Paper number 410
Abstract:
A Support Vector Machine (SVM) is a promising machine learning technique
that has generated a lot of interest in the pattern recognition community
in recent years. The greatest asset of an SVM is its ability to construct
nonlinear decision regions in a discriminative fashion. This paper
describes an application of SVMs to two speech data classification
experiments: 11 vowels spoken in isolation and 16 phones extracted
from spontaneous telephone speech. The best performance achieved on
the spontaneous speech classification task is a 51% error rate using
an RBF kernel. This is comparable to frame-level classification achieved
by other nonlinear modeling techniques such as artificial neural networks
(ANN).
Authors:
Malan B. Gandhi, Lucent Technologies Bell Laboratories (USA)
Page (NA) Paper number 90
Abstract:
Many automatic speech recognition telephony applications involve recognition
of input containing some type of numbers. Traditionally, this has been
achieved by using isolated or connected digit recognizers. However,
as speech recognition finds a wider range of applications, it is often
infeasible to impose restrictions on speaker behavior. This paper studies
two model topologies for natural number recognition which use minimum
classification error (MCE) trained inter-word context dependent acoustic
models. One model topology uses triphone context units while another
is of the head-body-tail (HBT) type. The performance of the models
is evaluated on three natural number applications involving recognition
of dates, time of day, and dollar amounts. Experimental results show
that context dependent models reduce string error rates by as much
as 50% over baseline context independent whole-word models. String
accuracies of about 93% are obtained on these tasks while at the same
time allowing users flexibility in speaking styles.
Authors:
Jonathan Hamaker, Institute for Signal and Information Processing, Mississippi State University (USA)
Aravind Ganapathiraju, Institute for Signal and Information Processing, Mississippi State University (USA)
Joseph Picone, Institute for Signal and Information Processing, Mississippi State University (USA)
Page (NA) Paper number 653
Abstract:
The primary problem in large vocabulary conversational speech recognition
(LVCSR) is poor acoustic-level matching due to large variability in
pronunciations. There is much to explore about the "quality" of states
in an HMM and the inter-relationships between inter-state and intra-state
Gaussians used to model speech. Of particular interest is the variable
discriminating power of the individual states. The fundamental concept
addressed in this paper is to investigate means of exploiting such
dependencies through model topology optimization based on the Bayesian
Information Criterion (BIC) and the Minimum Description Length (MDL)
principle.
Authors:
Kengo Hanai, Toyohashi University of Technology (Japan)
Kazumasa Yamamoto, Toyohashi University of Technology (Japan)
Nobuaki Minematsu, Toyohashi University of Technology (Japan)
Seiichi Nakagawa, Toyohashi University of Technology (Japan)
Page (NA) Paper number 1012
Abstract:
It is well-known that HMMs only of the basic structure cannot capture
the correlations among successive frames adequately. In our previous
work, to solve this problem, segmental unit HMMs were introduced and
their effectiveness was shown. And the integration of delta- cepstrum
and delta-delta- cepstrum into the segmental unit HMMs was also found
to improve the recognition performance in the work. In this paper,
we investigated further refinements of the models by using a mixture
of PDFs and/or context dependency, where, for a given syllable, only
a preceding vowel was treated as the context information. Recognition
experiments showed that the accuracy rate was improved by 23 %, which
clearly indicates the effectiveness of the refinements examined in
this paper. The proposed syllable-based HMM outperformed a triphone
model.
Authors:
Jacques Simonin, France Telecom - CNET (France)
Lionel Delphin-Poulat, France Telecom - CNET (France)
Geraldine Damnati, France Telecom - CNET (France)
Page (NA) Paper number 1063
Abstract:
This paper presents a Gaussian density tree structure usage which enables
a computational cost reduction without a significant degradation of
recognition performances, during a continuous speech recognition process.
The Gaussian tree structure is built from successive Gaussian density
merging. Each node of the tree is associated with a Gaussian density,
and the actual HMM densities are associated to the leaves. We propose
then a criterion to obtain good recognition performances with this
Gaussian tree structure. This structure is evaluated with a continuous
speech recognition system on a telephone database. The criterion allows
a 75 to 85% computational cost reduction in terms of log-likelihood
computations without any significant word error rate during the recognition
process.
Authors:
Hiroaki Kojima, Electrotechnical Laboratory, AIST, MITI (Japan)
Kazuyo Tanaka, Electrotechnical Laboratory, AIST, MITI (Japan)
Page (NA) Paper number 995
Abstract:
The goal of this work is to model phone-like units automatically from
spoken word samples without using any transcriptions except for the
lexical identification of the words. In order to implement this task,
we have proposed the "piecewise linear segment lattice (PLSL)" model
for phoneme representation. The structure of this model is a lattice
of segments, each of which is represented as regression coefficients
of feature vectors within the segment. In order to organize phone
models, operations including division, concatenation, blocking and
clustering are applied to the models. This paper mainly report on blocking
and clustering. Experimental results for isolated word recognition
task is that the recognition rate is significantly improved by blocking
the segments and by clustering the segments within a block. We get
sufficient performance for the task with the models consist of at most
128 clusters of segment patterns.
Authors:
Ryosuke Koshiba, Toshiba Kansai Research Laboratories (Japan)
Mitsuyoshi Tachimori, Toshiba Kansai Research Laboratories (Japan)
Hiroshi Kanazawa, Toshiba Kansai Research Laboratories (Japan)
Page (NA) Paper number 197
Abstract:
A new algorithm to reduce the amount of calculation in the likelihood
computation of continuous mixture HMM(CMHMM) with block-diagonal covariance
matrices while retaining high recognition rate is proposed. The block
matrices are optimized by minimizing difference between the output
probability calculated with full covariance matrices and that calculated
with block-diagonal covariance matrices. The idea was implemented and
tested on a continuous number recognition task.
Authors:
C. Chesta, Politecnico di Torino (Italy)
Pietro Laface, Politecnico di Torino (Italy)
F. Ravera, CSELT - Centro Studi e Laboratori Telecomunicazioni (Italy)
Page (NA) Paper number 149
Abstract:
In this paper we show that accurate HMMs for connected word recognition
can be obtained without context dependent modeling and discriminative
training. To account for different speaking rates, we define two HMMs
for each word that must be trained. The two models have the same, standard,
left to right topology with the possibility of skipping one state,
but each model has a different number of states, automatically selected.
Our simple modeling and training technique has been applied to connected
digit recognition using the adult speaker portion of the TI/NIST corpus.
The obtained results are comparable with the best ones reported in
the literature for models with a larger number of densities.
Authors:
Tan Lee, Department of Electronic Engineering, The Chinese University of Hong Kong (Hong Kong)
Rolf Carlson, Centre for Speech Technology, Royal Institute of Technology (Sweden)
Björn Granström, Centre for Speech Technology, Royal Institute of Technology (Sweden)
Page (NA) Paper number 441
Abstract:
This paper presents a trial study of using context-dependent segmental
duration for continuous speech recognition in a domain-specific application.
Different modelling strategies are proposed for function words and
content words. Stress status, word position in utterance and phone
position in word are identified to be the 3 most crucial factors affecting
segmental duration in this particular application. In addition, speaking
rate normalization is applied to further reduce the duration variabilities.
Experimental results show that the normalized duration models can help
improving the rank of the correct sentence in the N-best hypotheses
list.
Authors:
Brian Mak, Department of Computer Science, The Hong Kong University of Science & Technology (China)
Enrico Bocchieri, AT&T Labs -- Research (USA)
Page (NA) Paper number 699
Abstract:
Training of continuous density hidden Markov models(CDHMMs) is usually
time-consuming and tedious due to the large number of model parameters
involved. Recently we proposed a new derivative of CDHMM, the subspace
distribution clustering hidden Markov model(SDCHMM) which tie CDHMMs
at the finer level of subspace distributions, resulting in many fewer
model parameters. An SDCHMM training algorithm is also devised to
train SDCHMMs directly from speech data without intermediate CDHMMs.
On the ATIS task, speaker-independent context-independent(CI) SDCHMMs
can be trained with as little as 8 minutes of speech with no loss in
recognition accuracy --- a 25-fold reduction when compared with their
CDHMM counterparts. In this paper, we extend our novel SDCHMM training
to context-dependent(CD) modeling with the assumption of various prior
knowledge. Despite the 30-fold increase of model parameters in the
CD ATIS CDHMMs, their equivalent CD SDCHMMs can still be estimated
with a few minutes of ATIS data.
Authors:
Cesar Martín del Alamo, Telefonica I+D (Spain)
Luis Villarrubia, Telefonica I+D (Spain)
Francisco Javier González, ETSI Telecomunicaci-on (Spain)
Luis A. Hernández, ETSI Telecomunicaci-on (Spain)
Page (NA) Paper number 443
Abstract:
In this work automatic methods for determining the number of gaussians
per state in a set of Hidden Markov Models are studied. Four different
mix-up criteria are proposed to decide how to increase the size of
the states. These criteria, derived from Maximum Likelihood scores,
are focused to increase the discrimination between states obtaining
different number of gaussians per state. We compare these proposed
methods with the common approach where the number of density functions
used in every state is equal and pre-fixed by the designer. Experimental
results demonstrate that performance can be maintained while reducing
the total number of density functions by 17% (from 2046 down to 1705).
These results are obtained in a flexible large vocabulary isolated
word recognizer using context dependent models.
Authors:
Máté Szarvas, Technical University of Budapest (Hungary)
Shoichi Matsunaga, NTT Human Interface Laboratories (Japan)
Page (NA) Paper number 1098
Abstract:
This paper describes a novel method that models the correlation between
acoustic observations in contiguous speech segments. The basic idea
behind the method is that acoustic observations are conditioned not
only on the phonetic context but also on the preceding acoustic segment
observation. The correlation between consecutive acoustic observations
is modeled by polynomial mean trajectory segment models. This method
is an extension of conventional segment modeling approaches in that
it not only describes the correlation of acoustic observations inside
segments but also between contiguous segments. It is also a generalization
of phonetic context (e.g., triphone) modeling approaches because it
can model acoustic context and phonetic context at the same time.
In a speaker-independent phoneme classification test, using the proposed
method resulted in a 7-9% reduction in error rate as compared to the
traditional triphone segmental model system and a 31% reduction as
compared to a similar triphone HMM system.
Authors:
Ji Ming, Queen's University of Belfast (U.K.)
Philip Hanna, Queen's University of Belfast (U.K.)
Darryl Stewart, Queen's University of Belfast (U.K.)
Saeed Vaseghi, Queen's University of Belfast (U.K.)
F. Jack Smith, Queen's University of Belfast (U.K.)
Page (NA) Paper number 263
Abstract:
An acoustic model is a simplified mathematical representation of acoustic-phonetic
information. The simplifying assumptions inherent to each model entail
that it may only be capable of capturing a certain aspect of the available
information. An effective combination of different types of model should
therefore permit a combined model that can utilize all the information
captured by the individual models. This paper reports some preliminary
research in combining certain types of acoustic model for speech recognition.
In particular, we designed and implemented a single HMM framework,
which combines a segment-based modeling technique with the standard
HMM technique. The recognition experiments, based on a speaker-independent
E-set database, have shown that the combined model has the potential
of producing a significantly higher performance than the individual
models considered in isolation.
Authors:
Laurence Molloy, Centre for Speech Technology Research (U.K.)
Stephen Isard, Centre for Speech Technology Research (U.K.)
Page (NA) Paper number 1103
Abstract:
In this paper a method of integrating a model of suprasegmental duration
with a HMM-based recogniser at the post-processing level is presented.
The N-Best utterance output is rescored using a suitable linear combination
of acoustic log-likelihood (provided by a set of tied-state triphone
HMMs) and duration log-likelihood (provided by a set of durational
models). The durational model used in the post-processing imposes syllable-level
elastic constraints on the durational behaviour of speech segments.
Results are presented for word accuracy on the Resource Management
database after rescoring, using two different syllable-like constraint
units, a fixed-size N-phone window and simple (no constraint) phone
duration probability scoring.
Authors:
Albino Nogueiras-Rodríguez, Universitat Politecnica de Catalunya (Spain)
José B. Mariño, Universitat Politecnica de Catalunya (Spain)
Enric Monte, Universitat Politecnica de Catalunya (Spain)
Page (NA) Paper number 769
Abstract:
Although having revealed to be a very powerful tool in acoustic modelling,
discriminative training presents a major drawback: the lack of a formulation
guaranteeing convergence in no matter which initial conditions, such
as the Baum-Welch algorithm in maximum likelihood training. For this
reason, a gradient descent search is usually used in this kind of problem.
Unfortunately, standard gradient descent algorithms rely heavily on
the election of the learning rates. This dependence is specially cumbersome
because it represents that, at each run of the discriminative training
procedure, a search should be carried out over the parameters ruling
the algorithm. In this paper we describe an adaptive procedure for
determining the optimal value of the step size at each iteration.
While the calculus and memory overhead of the algorithm is negligible,
results show less dependence on the initial learning rate than standard
gradient descent and, using the same idea in order to apply self-scaling,
it clearly outperforms it.
Authors:
Albino Nogueiras-Rodríguez, Universitat Politecnica de Catalunya (Spain)
José B. Mariño, Universitat Politecnica de Catalunya (Spain)
Page (NA) Paper number 770
Abstract:
Discriminative training is a powerful tool in acoustic modeling for
automatic speech recognition. Its strength is based on the direct minimisation
of the number of errors committed by the system at recognition time.
This is usually accomplished by defining an auxiliary function that
characterises the behaviour of the system, and adjusting the parameters
of the system in a way that this function is minimised. The main drawback
of this approach is that a task specific training database is needed.
In this paper an alternative procedure is proposed: task adaptation
using task independent databases. It consists in the combination of
acoustic information -estimated using a general purpose training database-,
and linguistic information -taken from the definition of the task-.
In the experiments carried out, this technique has led to great improvement
in the recognition of two different tasks: clean speech digit strings
in English and dates in Spanish over the telephone wire.
Authors:
Gordon Ramsay, ICP-INPG (France)
Page (NA) Paper number 671
Abstract:
A stochastic approach to modelling speech production and perception
is discussed, based on Itô calculus. Speech is modelled by a
system of non-linear stochastic differential equations evolving on
a finite-dimensional state space, representing a partially-observed
Markov process. The optimal non-linear filtering equations for the
model are stated, and shown to exhibit a predictor-corrector structure,
which mimics the structure of the original system. This is used to
suggest a possible justification for the hypothesis that speakers and
listeners make use of an ``internal model'' in producing and perceiving
speech, and leads to a useful statistical framework for articulatory
speech recognition.
Authors:
Christian J. Wellekens, Institut Eurecom (France)
Jussi Kangasharju, Institut Eurecom (France)
Cedric Milesi, Institut Eurecom (France)
Page (NA) Paper number 271
Abstract:
Among the different attempts to improve recognition scores and robustness
to noise, the recognition of parallel streams of data, each one representing
partial information on the test signal, and the fusion of the decisions
have received a great deal of interest. The problem of training such
models taking recombination constraints at the level of speech-subunits
has not yet been rigorously addressed. This paper shows how equivalence
with an extended meta-HMM solves the problem and how reestimation formulas
have to be applied to guarantee equivalence between the multistream
model and the meta-HMM. Experiments demonstrate the importance of transition
probabilities in the meta-HMM which they have to meet some constraints
in order to represent a multistream HMM.
Authors:
Christian J. Wellekens, Institut Eurecom - Sophia Antipolis (France)
Page (NA) Paper number 272
Abstract:
Since long, the use of contextual features has been shown to improve
the recognition scores: use of numerical estimations of speed and acceleration
appended to the current feature vectors, predictive HMM or neural networks.
All these implementations are particular case of FIR filtering of feature
trajectories. This paper presents a new approach where the characteristics
of filters are trained together with the HMM parameters resulting in
improvements of the recognition in first tests. Reestimation formulas
for the cut-off frequencies of ideal LP-filters are derived as well
for the impulse response coefficients of a general FIR LP-filter. Filters
can be either common to all feature vectors or dedicated to a given
entry or a given HMM state.
Authors:
Christoph Neukirchen, Department of Computer Science, Gerhard-Mercator-University Duisburg (Germany)
Daniel Willett, Department of Computer Science, Gerhard-Mercator-University Duisburg (Germany)
Gerhard Rigoll, Department of Computer Science, Gerhard-Mercator-University Duisburg (Germany)
Page (NA) Paper number 346
Abstract:
This paper introduces a method for regularization of HMM systems that
avoids parameter overfitting caused by insufficient training data.
Regularization is done by augmenting the EM training method by a penalty
term that favors simple and smooth HMM systems. The penalty term
is constructed as a mixture model of negative exponential distributions
that is assumed to generate the state dependent emission probabilities
of the HMMs. This new method is the successful transfer of a well
known regularization approach in neural networks to the HMM domain
and can be interpreted as a generalization of traditional state-tying
for HMM systems. The effect of regularization is demonstrated for
continuous speech recognition tasks by improving overfitted triphone
models and by speaker adaptation with limited training data.
Authors:
Silke Witt, Cambridge University Engineering Dept (U.K.)
Steve Young, Cambridge University Engineering Dept (U.K.)
Page (NA) Paper number 1010
Abstract:
This paper investigates how to improve the acoustic modelling of non-native
speech. For this purpose we present an adaptation technique to combine
hidden Markov models of the source and the target language of a foreign
language student. Such model combination requires a mapping of the
mean vectors from target to source language. Therefore, three different
mapping approaches, based on either phonetic knowledge and/or acoustical
distance measures have been tested. The performance of this model combination
method and several variations of it has been measured and compared
with standard MLLR adaptation. For the baseline model combination small
improvements of recognition accuracy compared to the results based
on applying MLLR were obtained. Furthermore, slight improvements were
found when using an a-priori approach, where the models were combined
with predefined weights before applying any of the adaptation techniques.
Authors:
Tae-Young Yang, Yonsei Univ. Dept. of Electronics (Korea)
Ji-Sung Kim, Yonsei Univ. Dept. of Electronics (Korea)
Chungyong Lee, Yonsei Univ. Dept. of Electronics (Korea)
Dae Hee Youn, Yonsei Univ. Dept. of Electronics (Korea)
Il-Whan Cha, Yonsei Univ. Dept. of Electronics (Korea)
Page (NA) Paper number 428
Abstract:
A duration modeling scheme and a speaking rate compensation technique
are presented for the HMM based connected digit recognizer. The proposed
duration modeling technique uses a cumulative duration probability.
The cumulative duration probability also can be used to obtain the
duration bounds for the bounded duration modeling. One of the advantages
of proposed technique is that the cumulative duration probability can
be applied directly to the Viterbi decoding procedure without additional
postprocessing. Therefore, it rules the state and word transition at
each frame. To alleviate the problems due to fast or slow speech, a
modification to the bounded duration modeling which accounts for speaking
rate is described. The experimental results on Korean connected digit
recognition show the effectiveness of the proposed duration modeling
scheme and the speaking rate compensation technique.
Authors:
Geoffrey Zweig, IBM T.J. Watson Research Center (USA)
Stuart Russell, U.C. Berkeley (USA)
Page (NA) Paper number 858
Abstract:
This paper describes the application of Bayesian networks to automatic
speech recognition (ASR). Bayesian networks enable the construction
of probabilistic models in which an arbitrary set of variables can
be associated with each speech frame in order to explicitly model factors
such as acoustic context, speaking rate, or articulator positions.
Once the basic inference machinery is in place, a wide variety of models
can be expressed and tested. We have implemented a Bayesian network
system for isolated word recognition, and present experimental results
on the PhoneBook database. These results indicate that performance
improves when the observations are conditioned on an auxiliary variable
modeling acoustic/articulatory context. The use of multivalued and
multiple context variables further improves recognition accuracy.
|