Authors:
Su-Lin Wu, International Computer Science Institute (USA)
Brian E.D. Kingsbury, International Computer Science Institute (USA)
Nelson Morgan, International Computer Science Institute (USA)
Steven Greenberg, International Computer Science Institute (USA)
Page (NA) Paper number 854
Abstract:
Combining knowledge derived from both syllable- (100-250 ms) and phone-length
(40-100 ms) intervals in the automatic speech recognition process can
yield performance superior to that obtained using information derived
from a single time scale alone. The results are particularly pronounced
for reverberant test conditions that have not been incorporated into
the training set. In the present study, phone- and syllable-based
systems are combined at three distinct levels of the recognition process
--- the frame, the syllable and the entire utterance. Each strategy
successfully integrates the complementary strengths of the individual
systems, yielding a significant improvement in accuracy on a small-vocabulary,
naturally spoken, telephone speech corpus. The syllable-level combination
outperformed the other two methods under both relatively pristine and
moderately reverberant acoustic conditions, yielding a 20-40% relative
improvement over the baseline.
Authors:
Arun C. Surendran, Bell Labs, Lucent Technologies (USA)
Chin-Hui Lee, Bell Labs, Lucent Technologies (USA)
Page (NA) Paper number 859
Abstract:
Earlier work in parametric modeling of distortions for robust speech
recognition has focussed on estimating the distortion parameter using
maximum likelihood and other techniques as a point in the parameter
space, and treating this estimate as if it is the true value in a plug-in
maximum a posteriori(MAP) decoder. This approach is deficient in most
real environments where, due to many reasons, the value of the distortion
parameter varies significantly. In this paper we introduce an approach
which combines the power of parametric transformation and Bayesian
prediction to solve this problem. Instead of approximating the distortion
parameter with a point estimate, we average over its variation, thus
taking into consideration the distribution of the parameter as well.
This approach provides more robust performance than the conventional
maximum-likelihood approach. It also provides the solution that minimizes
the overall error given the distribution of the parameter. We present
results to demonstrate the robustness and effectiveness of the predictive
approach.
Authors:
Jean-Claude Junqua, Speech Technology Laboratory (USA)
Steven Fincke, Speech Technology Laboratory (USA)
Ken Field, Speech Technology Laboratory (USA)
Page (NA) Paper number 374
Abstract:
To study the Lombard reflex, more realistic databases representing
real world conditions need to be recorded and analyzed. In this paper
we 1) propose a procedure to record Lombard data which provides a good
approximation of realistic conditions and 2) present a comparison between
two sets of experiments where subjects are in communication with a
device while listening to noise through open-ear headphones and where
subjects are reading a list. By studying acoustic correlates of the
Lombard reflex and performing off-line speaker-independent recognition
experiments it is shown that the communication factor affects the Lombard
reflex. We also show evidence that several types of noise differing
mainly by their spectral tilt induce different acoustic changes. This
result reinforces the notion that it is difficult to separate the speaker
from the environment stressor (in this case the noise) when studying
the Lombard reflex.
Authors:
Stefano Crafa, CSELT (Italy)
Luciano Fissore, CSELT (Italy)
Claudio Vair, CSELT (Italy)
Page (NA) Paper number 1140
Abstract:
In this paper, we present an integration of Data Driven Parallel Model
Combination (DPMC) and Bayesian Learning into a fast and accurate framework
which can be easily integrated in standard training and recognition
systems. The original DPMC technique has been enhanced to avoid any
modification of the acoustic models, as required by the original method.
The Bayesian Learning estimation has been used in order to specialize
a general noisy speech model (the a priori model) to the target acoustic
environment, where the DPMC-generated observations are used as adaptation
data. Thanks to these innovations, the proposed method can achieve
better performance than the original DPMC, while consuming far less
computational resources.
Authors:
Martin Hunke, San Francisco State University, San Francisco, CA (USA)
Meeran Hyun, San Francisco State University, San Francisco, CA (USA)
Steve Love, Meridian Speech Technology (USA)
Thomas Holton, San Francisco State University, San Francisco, CA (USA)
Page (NA) Paper number 715
Abstract:
In this study, the performance of an auditory-model feature-extraction
'front end' was assessed in an isolated-word speech recognition task
using a common hidden Markov model (HMM) 'back end', and compared with
the performance of other feature representation front-end methods including
mel-frequency cepstral coefficients (MFCC) and two variants (J- and
L-) of the relative spectral amplitude (RASTA) technique. The recognition
task was performed in the presence of varying levels and types of additive
noise and spectral distortion using standard HMM whole-word models
with the Bellcore Digit database as a corpus. While all front ends
achieved comparable recognition performance in clean speech, the performance
of the auditory-model front end was generally significantly higher
than other methods in recognition tasks involving background noise
or spectral distortion. Training HMMs with speech processed by the
auditory-model or L-RASTA front end in one type of noise also improved
the recognition performance with other kinds of noise. This 'cross-training'
effect did not occur with the MFCC or J-RASTA front end.
Authors:
Owen P. Kenny, Defence Science and Technology Organization (Australia)
Douglas J. Nelson, Department of Defence (USA)
Page (NA) Paper number 863
Abstract:
The problem of removing channel effects from speech has generally been
attacked by attempting to recover a time-varying filter which inverts
the entire channel impulse response. We show that human listeners are
insensitive to many channel conditions and that the human ear seems
to respond primarily to discontinu-ities of the channel. As a result
of these observations, a partial equalization is proposed in which
the channel effects to which the ear is sensitive may be removed, without
full inversion of the channel. In addition, it is shown that it is
possible to build filters of arbitrary length which do not reduce speech
intelligibility and do not produce annoying artifacts.
|