Authors:
Wu Chou, Bell Labs., Lucent Technologies (USA)
Wolfgang Reichl, Bell Labs., Lucent Technologies (USA)
Page (NA) Paper number 607
Abstract:
In this paper, an m-level optimal subtree based phonetic decision tree
clustering algorithm is described. Unlike prior approaches, the m-level
optimal subtree in the proposed approach is to generate log likelihood
estimates using multiple mixture Gaussians for phonetic decision tree
based state tying. It provides a more accurate model of the log likelihood
variations in node splitting and it is consistent with the acoustic
space partition introduced by the set of phonetic questions applied
during the decision tree state tying process. In order to reduce the
algorithmic complexity, a caching scheme based on previous search results
is also described. It leads to a significant speed up of the m-level
optimal subtree construction without degradation of the recognition
performance, making the proposed approach suitable for large vocabulary
speech recognition tasks. Experimental results on a standard (Wall
Street Journal) speech recognition task indicate that the proposed
m-level optimal subtree approach outperforms the conventional approach
of using single mixture Gaussians in phonetic decision tree based state
tying.
Authors:
Thomas Kemp, ISL, University of Karlsruhe (Germany)
Alex Waibel, ISL, University of Karlsruhe (Germany)
Page (NA) Paper number 758
Abstract:
Current speech recognition systems require large amounts of expensive
transcribed data for parameter estimation. In this work we describe
our experiments which are aimed at training a speech recognizer without
transcriptions. The experiments were carried out with untranscribed
TV newscast recordings. The newscasts were automatically segmented
into segments of similar acoustic background condition. We develop
a training scheme, where a recognizer is bootstrapped using very little
transcribed data and is improved using new, untranscribed speech. We
show that it is necessary to use a confidence measure to judge the
initial transcriptions of the recognizer before using them. Higher
improvements can be achieved if the number of parameters in the system
is increased when more data becomes available. We show, that the beneficial
effect of unsupervised training is not compensated by MLLR adaptation
on the hypothesis. Using the described methods, we found that the untranscribed
data gives roughly one third of the improvement of the transcribed
material.
Authors:
Clark Z. Lee, INRS-Telecommunications (Canada)
Douglas O'Shaughnessy, INRS-Telecommunications (Canada)
Page (NA) Paper number 208
Abstract:
For large vocabulary continuous speech recognition based on hidden
Markov models, we often face the issue of trade-off between the accuracy
and the speed. A new method is proposed in this article such that complex
models are used to retain a high accuracy whereas the speed is achieved
by using the similarities in acoustic matches. These similarities are
based on the assumption that we refer as a look-phone-context property.
By using the look-phone-context property, the number of acoustic matches
can be substantially reduced in the course of scoring all possible
phonetic transcriptions of recognition hypotheses. Experiments on the
speaker-independent Wall Street Journal task show that a fast-response
system can be reached without compromising the accuracy.
Authors:
Jacques Duchateau, Katholieke Universiteit Leuven - ESAT (Belgium)
Kris Demuynck, Katholieke Universiteit Leuven - ESAT (Belgium)
Dirk Van Compernolle, Lernout and Hauspie Speech Products (Belgium)
Patrick Wambacq, Katholieke Universiteit Leuven - ESAT (Belgium)
Page (NA) Paper number 161
Abstract:
In an HMM based large vocabulary continuous speech recognition system,
the evaluation of - context dependent - acoustic models is very time
consuming. In Semi-Continuous HMMs, a state is modelled as a mixture
of elementary - generally gaussian - probability density functions.
Observation probability calculations of these states can be made faster
by reducing the size of the mixture of gaussians used to model them.
In this paper, we propose different criteria to decide which gaussians
should remain in the mixture for a state, and which ones can be removed.
The performance of the criteria is compared on context dependent tied
state models using the WSJ recognition task. Our novel criterion, which
decides to remove a gaussian in a state if it is based on too few acoustic
data, outperforms the other described criteria.
Authors:
Ananth Sankar, SRI International (USA)
Page (NA) Paper number 193
Abstract:
Most current state-of-the-art large-vocabulary continuous speech recognition
(LVCSR) systems are based on state-clustered hidden Markov models (HMMs).
Typical systems use thousands of state clusters, each represented by
a Gaussian mixture model with a few tens of Gaussians. In this paper,
we show that models with far more parameter tying, like phonetically
tied mixture (PTM) models, give better performance in terms of both
recognition accuracy and speed. In particular, we achieved between
a 5 and 10% improvement in word error rate, while cutting the number
of Gaussian distance computations in half, for three different Wall
Street Journal (WSJ) test sets, by using a PTM system with 38 phone-class
state clusters, as compared to a state-clustered system with 937 state
clusters. For both systems, the total number of Gaussians was fixed
at about 30,000. This result is of real practical significance as we
show that a conceptually simpler PTM system can achieve faster and
more accurate performance than current state-of-the-art state-clustered
HMM systems.
Authors:
Ramesh A. Gopinath, IBM T. J. Watson Research (USA)
Bhuvana Ramabhadran, IBM T. J. Watson Research (USA)
Satya Dharanipragada, IBM T. J. Watson Research (USA)
Page (NA) Paper number 397
Abstract:
Modeling data with Gaussian distributions is an important statistical
problem. To obtain robust models one imposes constraints the means
and covariances of these distributions. Constrained ML modeling implies
the existence of optimal feature spaces where the constraints are more
valid. This paper introduces one such constrained ML modeling technique
called factor analysis invariant to linear transformations(FACILT)
which is essentially factor analysis in optimal feature spaces. FACILT
is a generalization of several existing methods for modeling covariances.
This paper presents an EM algorithm for FACILT modeling.
|