Authors:
Guojun Zhou, Robust Speech Processing Lab; Duke Univ. (USA)
John H.L. Hansen, Robust Speech Processing Lab; Duke Univ. (USA)
James F. Kaiser, Robust Speech Processing Lab; Duke Univ. (USA)
Page (NA) Paper number 840
Abstract:
Many stressful environments can deteriorate the performance of speech
recognition systems such as aircraft cockpits or high workload task
stress/emotional situations. To address this, we investigate a number
of linear and nonlinear features and processing methods for stressed
speech classification. The linear features include properties of pitch,
duration, intensity, glottal source, and the vocal tract spectrum.
Nonlinear processing is based on our newly proposed Teager Energy Operator
speech feature which incorporates frequency domain critical band filters
and properties of the resulting TEO autocorrelation envelope. In this
study, we employ a Bayesian hypothesis testing and a hidden Markov
model processor as classification methods. Evaluations focused on speech
under loud, angry, and the Lombard effect from the SUSAS database.
Results using ROC curves and EER based detection show that pitch is
the best of the five linear features for stress classification; while
the new nonlinear TEO-based feature outperforms the best linear feature
by +5.2%, with a reduction in classification rate variability from
8.66 to 3.90.
Authors:
Sahar E. Bou-Ghazale, Rockwell (USA)
John H.L. Hansen, Robust Speech Processing Lab; Duke Univ. (USA)
Page (NA) Paper number 918
Abstract:
It is well known that the performance of speech recognition algorithms
degrade in the presence of adverse environments where a speaker is
under stress, emotion, or Lombard effect. This study evaluates the
effectiveness of traditional features in recognition of speech under
stress and formulates new features which are shown to improve stressed
speech recognition. The focus is on formulating robust features which
are less dependent on the speaking conditions rather than applying
compensation or adaptation techniques. The stressed speaking styles
considered are simulated angry and loud, Lombard effect speech, and
noisy actual stressed speech from the SUSAS database. In addition,
this study investigates the immunity of LP and FFT power spectrum to
the presence of stress. Our results show that unlike FFT's immunity
to noise, the LP power spectrum is more effective than the FFT to stress
as well as to a combination of a noisy and stressful environment.
Two alternative frequency partitioning methods (M-MFCC, ExpoLog) are
proposed and compared with traditional MFCC features for stressed speech
recognition. It is shown that the alternate filterbank frequency partitions
are more effective for recognition of speech under both simulated and
actual stressed conditions.
Authors:
Katrin Kirchhoff, University of Bielefeld (Germany)
Page (NA) Paper number 873
Abstract:
Robust speech recognition under varying acoustic conditions may be
achieved by exploiting multiple sources of information in the speech
signal. In addition to an acoustic signal representation, we use an
articulatory representation consisting of pseudo-articulatory features
as an additional information source. Hybrid ANN/HMM recognizers using
either of these representations are evaluated on a continuous numbers
recognition task (OGI Numbers95) under clean, reverberant and noisy
conditions. An error analysis of preliminary recognition results shows
that the different representations produce qualitatively different
errors, which suggests a combination of both representations. We investigate
various combination possibilities at the phoneme estimation level and
show that significant improvements can been achieved under all three
acoustic conditions.
Authors:
Timothy Wark, Speech Research Laboratory, QUT (Australia)
Sridha Sridharan, Speech Research Laboratory, QUT (Australia)
Page (NA) Paper number 294
Abstract:
This paper considers the improvement of speaker identification performance
in reverberant conditions using additional lip information. Automatic
speaker identification (ASI) using speech characteristics alone can
be highly successful, however problems occur with mismatches between
training and testing conditions. In particular, we find that ASI performance
drops dramatically when given anechoic training but reverberant test
speech. Previous work [1][2] has shown that speaker dependant information
can be extracted from the static and dynamic qualities of moving lips.
Given that lip information is unaffected by reverberation, we choose
to fuse this additional information with speech data. We propose a
new method for estimating confidence levels to allow adaptive fusion
of the audio and visual data. Identification results are presented
for increasing levels of artificially reverberated data, where lip
information is shown to provide excellent ASI performance improvement.
|