Authors:
Masato Akagi, Japan Advanced Institute of Science and Technology (Japan)
Mamoru Iwaki, Japan Advanced Institute of Science and Technology (Japan)
Noriyoshi Sakaguchi, Japan Advanced Institute of Science and Technology (Japan)
Page (NA) Paper number 28
Abstract:
Humans have an excellent ability to select a particular sound source
from a noisy environment, called the ``Cocktail-Party Effect'' and
to compensate for physically missing sound, called the ``Illusion of
Continuity.'' This paper proposes a spectral peak tracker as a model
of the illusion of continuity (or phonemic restoration) and a spectral
sequence prediction method using a spectral peak tracker. Although
some models have already been proposed, they treat only spectral peak
frequencies and often generate wrong predicted spectra. We introduce
a peak representation of log-spectrum with four parameters: amplitude,
frequency, bandwidth, and asymmetry, using the spectral shape analysis
method described by the wavelet transformation. And we devise a time-varying
second-order system for formulating the trajectories of the parameters.
We demonstrate that the model can estimate and track the parameters
for connected vowels whose transition section has been partially replaced
by white noise.
Authors:
Aruna Bayya, Rockwell Semiconductor Systems, Newport Beach, CA (USA)
B. Yegnanarayana, Indian Institute of Technology, Madras (India)
Page (NA) Paper number 1121
Abstract:
ABSTRACT In this paper we propose a set of features based on group
delay spectrum for speech recognition systems. These features appear
to be more robust to channel variations and environmental changes compared
to features based on Melspectral coefficients. The main idea is to
derive cepstrum-like features from group delay spectrum instead of
deriving them from power spectrum. The group delay spectrum is computed
from modified auto-correlation-like function. The effectiveness of
the new feature set is demonstrated by the results of both speaker-independent
(SI) and speaker-dependent (SD) recognition tasks. Preliminary results
indicate that using the new features, we can obtain results comparable
to Mel cepstra and PLP cepstra in most of the cases and a slight improvement
in noisy cases. More optimization of the parameters is needed to fully
exploit the nature of the new features.
Authors:
Frédéric Berthommier, ICP (France)
Hervé Glotin, IDIAP (Switzerland)
Emmanuel Tessier, ICP (France)
Hervé Bourlard, IDIAP (Switzerland)
Page (NA) Paper number 600
Abstract:
We propose a running demonstration of coupling between an intermediate
processing step (named CASA), based on the harmonicity cue, and partial
recognition, implemented with a HMM/ANN multistream technique [2].
The model is able to recognise words corrupted with narrow band noise,
either stationary or having variable center frequency. The principle
is to identify frame by frame the most noisy subband within four subbands
by analysing a SNR-dependent representation. A static partial recogniser
is fed with the remaining subbands. We establish on Numbers93 the noisy-band
identification (NBI) performance as well as the word error rate (WER),
and alter the correlation between these two indexes by changing the
distribution of the noise.
Authors:
Sen-Chia Chang, ATC/CCL, Industrial Technology Research Institute, Taiwan (Taiwan)
Shih-Chieh Chien, ATC/CCL, Industrial Technology Research Institute, Taiwan (Taiwan)
Chih-Chung Kuo, ATC/CCL, Industrial Technology Research Institute, Taiwan (Taiwan)
Page (NA) Paper number 1077
Abstract:
In this paper, a novel architecture, which integrates the recurrent
neural network (RNN) based compensation process and the hidden Markov
model (HMM) based recognition process into a unified framework, is
proposed. The RNN is employed to estimate the additive bias, which
represents the telephone channel effect, in the cepstral domain. Compensation
of telephone channel effects is implemented by subtracting the additive
bias from the cepstral coefficients of the input utterance. The integrated
recognition system is trained based upon MCE/GPD (minimum classification
error/generalized probabilistic descent) method with an objective function
that is designed to minimize recognition error rate. Experimental results
for speaker-independent Mandarin polysyllabic word recognition show
an error rate reduction of 21.5% compared to the baseline system.
Authors:
Stephen M. Chu, Beckman Institute and Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign (USA)
Yunxin Zhao, Beckman Institute and Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign (USA)
Page (NA) Paper number 690
Abstract:
This paper presents a method to improve the robustness of speech recognition
in noisy conditions. It has been shown that using dynamic features
in addition to static features can improve the noise robustness of
speech recognizers. In this work we show that in a continuous-density
Hidden Markov Model (HMM) based speech recognition system, weighting
the contribution of the dynamic features according to SNR levels can
further improve the performance, and we propose a two-step scheme to
adapt the weights for a given Signal to Noise Ratio (SNR). The first
step is to obtain the optimal weights for a set of selected SNR levels
by discriminative training. The Generalized Probabilistic Decent (GPD)
framework is used in our experiments. The second step is to interpolate
the set of SNR-specific weights obtained in step one for a new SNR
condition. Experimental results obtained by the proposed technique
is encouraging. Evaluation using speaker independent digits with added
white Gaussian noise shows significant reduction in error rate at various
SNR levels.
Authors:
Johan de Veth, A2RT, Dept. of Language & Speech, University of Nijmegen (The Netherlands)
Bert Cranen, A2RT, Dept. of Language & Speech, University of Nijmegen (The Netherlands)
Louis Boves, A2RT, Dept. of Language & Speech, University of Nijmegen (The Netherlands)
Page (NA) Paper number 360
Abstract:
In this paper we propose to introduce backing-off in the acoustic contributions
of the local distance functions used during Viterbi decoding as an
operationalisation of missing feature theory for increased recognition
robustness. Acoustic backing-off effectively removes the detrimental
influence of outlier values from the local decisions in the Viterbi
algorithm. It does so without the need for prior knowledge that specific
features are missing. Acoustic backing-off avoids any kind of explicit
outlier detection. This paper provides a proof of concept of acoustic
backing-off in the context of connected digit recognition over the
telephone, using artificial distortions of the acoustic observations.
It is shown that the word error rate can be maintained at the level
of 2.5 % obtained for undisturbed features, even in the case where
a conventional local distance computation without backing-off leads
to a word error rate > 80.0 %.
Authors:
Laura Docío-Fernández, University of Vigo (Spain)
Carmen García-Mateo, University of Vigo (Spain)
Page (NA) Paper number 579
Abstract:
This paper addresses the problem of mismatch between training and testing
conditions in a HMM-based speech recognizer. Parallel Model Combination
(PMC) has demonstrated to be an efficient technique for reducing the
effects of additive noise. In order to apply this technique, a noise
HMM must be trained at the recognition phase. Approaches that estimate
the noise model based on the Expectation-Maximization (EM) or Baum-Welch
algorithms are widely used. In these methods the recorded environmental
noise data are used, and their major drawback is that they need a long
sequence of noise data to estimate properly the model parameters. In
some real life applications the amount of noise data can be too small,
so from a practical point of view, the needed amount of noise is a
critical parameter which should be as short as possible. We propose
a novel method for obtaining a more reliable noise model than training
it from scratch by using a short noise sequence.
Authors:
Simon Doclo, Katholieke Universiteit Leuven (Belgium)
Ioannis Dologlou, Katholieke Universiteit Leuven (Belgium)
Marc Moonen, Katholieke Universiteit Leuven (Belgium)
Page (NA) Paper number 131
Abstract:
This paper presents an iterative signal enhancement algorithm for noise
reduction in speech. The algorithm is based on a truncated singular
value decomposition (SVD) procedure, which has already been used as
a tool for signal enhancement [1][2]. Compared to the classical algorithms,
the novel algorithm gives rise to comparable improvements in signal-to-noise
ratio (SNR). Moreover the algorithm has an improved frequency selectivity
for filtering out the noise and performs better with respect to the
higher formants of the speech. It can also be extended easily to multiple
channels.
Authors:
Stéphane Dupont, Faculte Polytechnique de Mons (FPMs) (Belgium)
Page (NA) Paper number 581
Abstract:
In this paper, we propose to use the missing data theory to allow the
reconstruction of missing spectro-temporal parameters in the framework
of hybrid HMM/ANN systems. A simple signal-to-noise ratio estimator
is used to automatically detect the components that are unavailable
or corrupted by noise (missing components). A limited number of multidimensional
gaussian distributions are then used to reconstruct those missing components
solely on the basis of the present data. The reconstructed vectors
are then used as input to an artificial neural network estimating the
HMM state probabilities. Continuous speech recognition experiments
have been done on filtered speech. In this case, filtered components
carry few or no information at all, and hence, should probably be ignored.
The results presented in this paper illustrate this point of view.
Complementary experiments also suggest the interest of the proposed
approach in the case of noisy speech.
Authors:
Ascension Gallardo-Antolín, Universidad Carlos III de Madrid (Spain)
Fernando Díaz-de-María, Universidad Carlos III de Madrid (Spain)
Francisco J. Valverde-Albacete, Universidad Carlos III de Madrid (Spain)
Page (NA) Paper number 584
Abstract:
This paper addresses the problem of speech recognition in the GSM environment.
In this context, new sources of distortion, such as transmission errors
or speech coding itself, significantly degrade the performance of speech
recognizers. While conventional approaches deal with these types of
distortion after decoding speech, we propose to recognize from the
digital speech representation of GSM. In particular, our work focuses
on the 13 kbit/s RPE-LTP GSM standard speech coder. In order to test
our recognizer we have compared it to a conventional recognizer in
several simulated situations, which allow us to gain insight into more
practical ones. Specifically, besides recognizing from clean digital
speech and evaluating the influence of speech coding distortion, the
proposed recognizer is faced with speech degraded by random errors,
burst errors and frame substitutions. The results are very encouraging:
the worse the transmission conditions are, the more recognizing from
digital speech outperforms the conventional approach.
Authors:
Petra Geutner, Universitaet Karlsruhe (Germany)
Matthias Denecke, MULTICOM (USA)
Uwe Meier, Carnegie Mellon University (USA)
Martin Westphal, Universitaet Karlsruhe (Germany)
Alex Waibel, Carnegie Mellon University (USA)
Page (NA) Paper number 772
Abstract:
This paper describes our latest efforts in building a speech recognizer
for operating a navigation system through speech instead of typed input.
Compared to conventional speech recognition for navigation systems,
where the input is usually restricted to a fixed set of keywords and
keyword phrases, complete spontaneous sentences are allowed as speech
input. We will present the interaction of speech input, parsing and
the reactions to the requested queries. Our system has been trained
on German spontaneous speech data and has been adapted to navigation
queries using MLLR. As the system is not restricted to command word
input, a parser further processes the recognized utterance. We show
that within a lab environment our system is able to handle arbitrary
spontaneous sentences as input to a navigation system successfully.
The performance of the recognizer measured in word error rate gives
a result of 18%. Evaluation of the parser yields an error rate of
20%.
Authors:
Laurent Girin, Institut de la Communication Parlée de Grenoble (France)
Laurent Varin, Institut de la Communication Parlée de Grenoble (France)
Gang Feng, Institut de la Communication Parlée de Grenoble (France)
Jean-Luc Schwartz, Institut de la Communication Parlée de Grenoble (France)
Page (NA) Paper number 431
Abstract:
This paper deals with the improvement of a noisy speech enhancement
system based on the fusion of auditory and visual information. The
system was presented in previous papers and implemented in the context
of vowel to vowel and vowel to consonant transitions corrupted with
white noise. Its principle consists in an analysis-enhancement-synthesis
process based on a linear prediction (LP) model of the signal: the
LP filter is enhanced thanks to associative tools that estimate LP
cleaned parameters from both noisy audio and visual information. The
detailed structure of the system is reminded and we focus on the improvement
that concerns precisely the associators: basic neural networks (multi-layers
perceptrons) are used instead of linear regression. It is shown that
in the context of VCV transitions corrupted with white noise, neural
networks can improve the performances of the system in terms of intelligibility
gain, distance measures and classification tests.
Authors:
Ruhi Sarikaya, Duke University (USA)
John H.L. Hansen, Duke University (USA)
Page (NA) Paper number 922
Abstract:
This study presents a new approach for robust speech activity detection
(SAD). Our framework is based on HMM recognition of speech versus silence.
We model speech as one of fourteen large phone classes whereas silence
is represented as a separate model. Individual test utterances are
concatenated to simulate read continuous speech for testing. The HMM-based
algorithm is compared to both an energy based, as well as speech enhancement
based, SAD algorithms for clean, 5 dB and 0 dB SNR levels under white
Gaussian noise (WGN), aircraft cockpit noise (AIR) and automobile highway
noise (HWY). We found that our algorithm provides lower frame error
rates than the other two methods especially for HWY noise. Unlike other
studies, we evaluate our algorithm on the core test set of the standard
TIMIT database. Hence, results can be used as benchmarks to evaluate
future systems.
Authors:
Michel Héon, INRS-Telecommunications (Canada)
Hesham Tolba, INRS-Telecommunications (Canada)
Douglas O'Shaughnessy, INRS-Telecommunications (Canada)
Page (NA) Paper number 807
Abstract:
In this paper, the problem of robust speech recognition has been considered.
Our approach is based on the noise reduction of the parameters that
we use for recognition, that is, the Mel-based cepstral coefficients.
A Temporal-Correlation-Based Recurrent Multilayer Neural Network (TCRMNN)
for noise reduction in the cepstral domain is used in order to get
less-variant parameters to be useful for robust recognition in noisy
environments. Experiments show that the use of the enhanced parameters
using such an approach increases the recognition rate of the continuous
speech recognition (CSR) process. The HTK Hidden Markov Model Toolkit
was used throughout. Experiments were done on a noisy version of the
TIMIT database. With such a pre-processing noise reduction technique
in the front-end of the HTK-based continuous speech recognition system
(CSR) system, improvements in the recognition accuracy of about 17.77%
and 18.58% using single mixture monophones and triphones, respectively,
have been obtained at a moderate SNR of 20 dB.
Authors:
Juan M. Huerta, Carnegie Mellon University (USA)
Richard M. Stern, Carnegie Mellon University (USA)
Page (NA) Paper number 626
Abstract:
Speech coding affects speech recognition performance, with recognition
accuracy deteriorating as the coded bit rate decreases. Virtually all
systems that recognize coded speech reconstruct the speech waveform
from the coded parameters, and then perform recognition (after possible
noise and/or channel compensation) using conventional techniques. In
this paper we compare the recognition accuracy of coded speech obtained
by reconstructing the speech waveform with the speech recognition accuracy
obtained when using cepstral features derived from the coding parameters.
We focus our efforts on speech that has been coded using the 13-kbps
full-rate GSM codec, a Regular Pulse Excited Long Term Prediction (RPE-LTP)
codec. The GSM codec develops separate representations for the linear
prediction (LPC) filter and the residual signal components of the coded
speech. We measure the effects of quantization and coding on the accuracy
with which these parameters are represented, and present two different
methods for recombining them for speech recognition purposes. We observe
that by selectively combining the cepstral streams repre senting the
LPC parameters and the residual signal it is possible to obtain recognition
accuracy directly from the coded parameters that equals or exceeds
the recognition accuracy obtained from the reconstructed waveforms.
Authors:
Jeih-Weih Hung, IIS, Academia Sinica (Taiwan)
Jia-Lin Shen, IIS, Academia Sinica (Taiwan)
Lin-Shan Lee, the department of E.E., National Taiwan University (Taiwan)
Page (NA) Paper number 231
Abstract:
The parallel model combination (PMC) technique has been shown to achieve
very good performance for speech recognition under noisy conditions.
However, there still exist some problems based on the PMC formula.
In this paper, we first investigated these problems and some modifications
on the transformation process of PMC were proposed. Experimental results
show that this modified PMC can provide significant improvements over
the original PMC in the recognition accuracies. Error rate reduction
on the order of 12.92% was achieved.
Authors:
Lamia Karray, France Telecom - CNET (France)
Jean Monné, France Telecom - CNET (France)
Page (NA) Paper number 430
Abstract:
Recognition performance decreases when recognition systems are used
over the telephone network, especially wireless network and noisy environments.
It appears that non efficient speech/non-speech detection is a very
important source of this degradation. Therefore, speech detector robustness
to noise is a challenging problem to be examined, in order to improve
recognition performance for the very noisy communications. Speech collected
in GSM environment gives an example of such very noisy speech to be
recognized. Several studies were conducted aiming to improve the robustness
of speech/non-speech detection used for speech recognition in adverse
conditions. This paper introduces a robust word boundary detection
algorithm reliable in the very noisy cellular network environment.
The algorithm is based on the statistics of noise and speech in the
observed signal. In order to decide on the binary hypotheses of noise
only versus speech plus noise, we use a likelihood ratio criterion.
Authors:
Myung Gyu Song, Dept. of Electronics Eng., Pusan National Univ. (Korea)
Hoi In Jung, Dept. of Electronics Eng., Pusan National Univ. (Korea)
Kab-Jong Shim, Passenger Car E&R Center II, Hyundai Motor Company (Korea)
Hyung Soon Kim, Dept. of Electronics Eng., Pusan National Univ. (Korea)
Page (NA) Paper number 1065
Abstract:
In speech recognition for real-world applications, the performance
degradation due to the mismatch introduced between training and testing
environments should be overcome. In this paper, to reduce this mismatch,
we provide a hybrid method of spectral subtraction and residual noise
masking. We also employ multiple model approach to obtain improved
robustness over various noise environments. In this approach, multiple
model sets are made according to several noise masking levels and then
a model set appropriate for the estimated noise level is selected automatically
in recognition phase. According to speaker independent isolated word
recognition experiments in car noise environments, the proposed method
using model sets with only two masking levels reduces average word
error rate by 60% in comparison with spectral subtraction method.
Authors:
Klaus Linhard, Daimler Benz Research and Technology (Germany)
Tim Haulick, Daimler Benz Research and Technology (Germany)
Page (NA) Paper number 109
Abstract:
Spectral noise subtraction has the drawback of generating residual
noise with musical character, the so-called musical noise. We propose
a simple modification of the filter coefficients calculation in form
of a recursion to the previous coefficient. This recursion results
in a switching mechanism of the spectral subtraction gain, in the way
that speech pauses are processed with a nearly constant very low gain
and speech components are processed with a nearly constant high gain.
Because of the switching mechanism the generation of musical noise
is almost completely avoided. The well known approach of Ephraim and
Malah has a similar mechanism, but the new recursive scheme is much
easier for implementation yet yielding comparable or sometimes better
performance than the other approach.
Authors:
Shengxi Pan, Electronic Engineering Department, Tsinghua University (China)
Jia Liu, Electronic Engineering Department, Tsinghua University (China)
Jintao Jiang, Electronic Engineering Department, Tsinghua University (China)
Zuoying Wang, Electronic Engineering Department, Tsinghua University (China)
Dajin Lu, Electronic Engineering Department, Tsinghua University (China)
Page (NA) Paper number 413
Abstract:
In this paper, a new robust speech recognition algorithm of multi-models
and integrated decision(MMID) is proposed. A parallel MMID(PMMID)
algorithm is developed. By using this new algorithm the advantages
of different models can be integrated into one system. This algorithm
uses different acoustic models at the same time based on DDBHMM (duration
distribution based Hidden Markov Model). These different models include
the channel-mismatch-correct (CMC) model, more-alternative-pronunciation
model, tone and non-tone models of Chinese Mandarin speech, voice activity
detection(VAD) model and state-skip model. The speech recognition
accuracy of the multi-model system is better than that of single-model
system in the adverse environments. The experimental results show that
the error rate of the recognition system is 2.9% and reduced by 81%
compared with the baseline system of the single-model.
Authors:
Dusan Macho, Slovak Technical University and Slovak Academy of Sciences (Slovak Republic)
Climent Nadeu, Universitat Politecnica de Catalunya (Spain)
Page (NA) Paper number 1137
Abstract:
One of the great today's challenges in speech recognition is to ensure
the robustness of the used speech representation. Usually, the recognition
rate is strongly reduced when the speech is corrupted, e.g. by convolutional
or additive noise, and the speech features are not designed to be robust.
In this paper we study the effect of additive noise on the logarithmic
filter-bank energy representation. We use time and frequency filtering
techniques to emphasize the discriminative information and to reduce
the mismatch between noisy and clean speech representation. A 2-D spectral
representation is introduced to see the regions most affected by noise
in the 2-D quefrency-modulation frequency domain and to help to design
the frequency and time filter shapes. Experiments with one and two
dynamic feature sets show the usefulness of the combination of time
and frequency filtering for both, white and low-pass noise speech recognition.
At the end the power time and frequency filtering technique is presented.
Authors:
Bhiksha Raj, Carnegie Mellon University (USA)
Rita Singh, Carnegie Mellon University (USA)
Richard M. Stern, Carnegie Mellon University (USA)
Page (NA) Paper number 1152
Abstract:
Two types of algorithms are introduced that recover missing time-frequency
regions of log-spectral representations of speech. These compensation
algorithms modify the incoming feature vector without any changes to
the speech recognition system, in contrast to previously-described
approaches. The first approach clusters the log-spectral vectors representing
clean speech. Missing data are recovered by estimating the spectral
cluster in each analysis frame on the basis of the feature values that
are present. The second approach uses MAP procedures to estimate the
values of missing data elements based on their correlation with the
features that are present. Greatest recognition accuracy was obtained
using the correlation-based approach, presumably because of its ability
to exploit the temporal as well as spectral structure of speech. The
recognition accuracy provided by these algorithms approaches but does
not exceed that obtained by traditional marginalization. Nevertheless,
it is believed that these algorithms provide greater computational
efficiency and enable greater flexibility in recognition system structure.
Authors:
Volker Schless, Daimler-Benz AG (Germany)
Fritz Class, Daimler-Benz AG (Germany)
Page (NA) Paper number 138
Abstract:
We present an approach to joint application of spectral subtraction
(SPS) and model combination (PMC) for speech recognition in noisy environments.
Contrary to previous solutions distortion introduced by SPS is not
modeled in PMC. Instead we ensure compatibility of the two methods
by adapting parameters of SPS (spectral floor and overestimation factor)
according to the present signal-to-noise-ratio (SNR). Parameter setting
should be done to subtract a maximum of noise while minimizing distortion.
Experiments suggest that for each noise level different parameter sets
yield optimal performance. Setting the parameters adaptively according
to the noise level leads to undegraded results at high SNR while in
low SNR regions the benefits of the noise reduction process are significant.
This scheme leaves the model combination process unchanged which simplifies
parameter estimation and reduces computation time. Experiments show
significant improvements when using PMC with modified SPS instead of
standard SPS.
Authors:
Jia-Lin Shen, Institute of Information Science, Academia Sinica (Taiwan)
Jeih-Weih Hung, Institute of Information Science, Academia Sinica (Taiwan)
Lin-Shan Lee, Institute of Information Science, Academia Sinica (Taiwan)
Page (NA) Paper number 442
Abstract:
In this paper, an improved mismatch function by considering signal
correlation between speech and noise is proposed to better estimate
the noisy speech HMM's. A linearized model based on Taylor series expansion
approach is used to approximate the proposed mismatch function. The
parameters of the noisy speech HMM's can be estimated more precisely
by combining the parameters of the clean speech and noise HMM's in
the log-spectral domain or cepstral domain. Experimental results show
that improved robustness for speech recognition in the presence of
white noise as well as colored noise can be obtained.
Authors:
Won-Ho Shin, Dept. of Electronic Eng., Yonsei Univ. (Korea)
Weon-Goo Kim, Dept. of Electrical Eng., Kunsan Univ. (Korea)
Chungyong Lee, Dept. of Electronic Eng., Yonsei Univ. (Korea)
Il-Whan Cha, Dept. of Electronic Eng., Yonsei Univ. (Korea)
Page (NA) Paper number 434
Abstract:
This paper investigates a projection-based likelihood measure that
improves speech recognition performance in noisy environment. The
projection-based likelihood measure is modified to give the weighting
and projection effect and to reduce computational complexity. It is
evaluated in sub-model based word recognition using semi-continuous
hidden Markov model with speaker independent mode. Experimental results
using proposed measure are reported for several performance factors:
additive noise and noisy channel environment, various noise signals,
and combination with other compensation method. In various noisy environments,
performance improvements were achieved compared to the previously existing
methods.
Authors:
Tetsuya Takiguchi, Nara Institute of Science and Technology (Japan)
Satoshi Nakamura, Nara Institute of Science and Technology (Japan)
Kiyohiro Shikano, Nara Institute of Science and Technology (Japan)
Masatoshi Morishima, Laboratory for Information Technology, NTT DATA Corporation (Japan)
Toshihiro Isobe, Laboratory for Information Technology, NTT DATA Corporation (Japan)
Page (NA) Paper number 698
Abstract:
In this paper, we evaluate performance of model adaptation by the previously
proposed HMM decomposition method on telephone speech recognition.
The HMM decomposition method separates a composed HMM into a known
phoneme HMM and an unknown noise and channel HMM by maximum likelihood
(ML) estimation of the HMM parameters. A transfer function (telephone
channel) HMM is estimated using adaptation speech data by applying
the HMM decomposition twice in the linear spectral domain for noise
and in the cepstral domain for channel. The telephone speech data
for evaluation are recorded through 10 kinds of ordinary analog telephone
handsets and cordless telephone handsets. The test results show that
the average phrase accuracy with the clean speech HMMs is 60.9% for
the ordinary analog telephone handsets, and 19.6% for the cordless
telephone handsets. By the HMM decomposition method, the average phrase
accuracy is improved to 78.1% for the ordinary analog telephone handsets,
and 50.5% for the cordless telephone handsets.
Authors:
Hesham Tolba, INRS-Telecommunications (Canada)
Douglas O'Shaughnessy, INRS-Telecommunications (Canada)
Page (NA) Paper number 341
Abstract:
This paper presents an evaluation of a robust Voiced-Unvoiced-based
large-vocabulary Continuous-Speech Recognition (CSR) system in the
presence of highly interfering noise. Comparative experiments have
indicated that the inclusion of an accurate Voiced-Unvoiced (V-U) classifier
in our design of a CSR system improves the performance of such a recognizer,
for speech contaminated by both additive Gaussian and uniform noises.
Our results show that the V-U-based CSR system outperforms the CMS-based
and the RASTA-PLP-based CSR systems in such environments for a wide
range of SNRs.
Authors:
Masashi Unoki, Japan Advanced Institute of Science and Technology (Japan)
Masato Akagi, Japan Advanced Institute of Science and Technology (Japan)
Page (NA) Paper number 1041
Abstract:
This paper proposes a method of extracting the desired signal from
a noisy signal. This method solves the problem of segregating two
acoustic sources by using constraints related to the four regularities
proposed by Bregman and by making two improvements to our previously
proposed method. One is to incorporate a method of estimating the
fundamental frequency using the Comb filtering on the filterbank.
The other is to reconsider the constraints on the separation block,
which constrain the instantaneous amplitude, input phase, and fundamental
frequency of the desired signal. Simulations performed to segregate
a vowel from a noisy vowel and to compare the results of using all
or only some constraints showed that our improved method can segregate
real speech precisely using all the constraints related to the four
regularities and that the absence some constraints reduces the accuracy.
Authors:
Tsuyoshi Usagawa, Kumamoto University (Japan)
Kenji Sakai, Kumamoto University (Japan)
Masanao Ebata, Kumamoto University (Japan)
Page (NA) Paper number 190
Abstract:
In this paper, the frequency domain binaural model is introduced. The
proposed model is the revised one of the former time domain model which
calculates the interaural crosscorrelation. The new model requires
the less computational load and has the comparable performance. It
is based on the FFT analysis and uses the cross-power spectrum to obtained
interaural phase difference. The performance of models is examined
not only in the isolated word speech recognition task and but also
in the speech enhancement task. As the results of experiment, the
improvement of robustness in speech recognition task corresponds to
about 15-20dB when the surrounding noise is white noise. That is a
few decibell better than one obtained by the time domain model. However,
when the surrounding noise is speech, the improvement decreases to
10-15dB. In addition, the proposed model can reproduce the signal component
from the specified direction as the binaural signal.
0190_01.PDF
(was: 0190_01.JPG)
| spectrogram in JPEG Fig.6(a)
File type: Image File
Format: JPEG
Tech. description: 1280x411 24bits
Creating Application:: Unknown
Creating OS: linux 2.0.34
|
0190_02.WAV
(was: 0190_02.WAV)
| sound file Fig.6(1)
File type: Sound File
Format: WAV
Tech. description: (44.1k, 16bit, mono)
Creating Application:: LHa for UNIX V 1.14c
Creating OS: linux 2.0.34
|
0190_03.PDF
(was: 0190_03.JPG)
| spectrogram in JPEG Fig.6(b)
File type: Image File
Format: JPEG
Tech. description: 1280x411 24bits
Creating Application:: LHa for UNIX V 1.14c
Creating OS: linux 2.0.34
|
0190_04.WAV
(was: 0190_04.WAV)
| sound file Fig.6(b)
File type: Sound File
Format: WAV
Tech. description: (44.1k, 16bit, mono)
Creating Application:: LHa for UNIX V 1.14c
Creating OS: linux 2.0.34
|
0190_05.PDF
(was: 0190_05.JPG)
| spectrogram in JPEG Fig.6(c)
File type: Archive File
Format: Archive File
Tech. description: 1280x411 24bits
Creating Application:: LHa for UNIX V 1.14c
Creating OS: linux 2.0.34
|
0190_06.WAV
(was: 0190_06.WAV)
| sound file Fig.6(c)
File type: Sound File
Format: WAV
Tech. description: (44.1k, 16bit, mono)
Creating Application:: LHa for UNIX V 1.14c
Creating OS: linux 2.0.34
|
Authors:
An-Tzyh Yu, National Tsing-Hua University (Taiwan)
Hsiao-Chuan Wang, National Tsing-Hua University (Taiwan)
Page (NA) Paper number 38
Abstract:
The recognition of encoded speech provides the possible applications
of speech recognition in low bit-rate communication systems. This kind
of applications will become necessary in the Internet and digital mobile
phones. In this paper, the encoded parameters of speech signal in various
speech coding systems are evaluated according to a designated speech
recognition task. The HMM with mixture of continuous Gaussian densities
is the framework of the speech recognition system.
Authors:
Tai-Hwei Hwang, National Tsing-Hua University (Taiwan)
Hsiao-Chuan Wang, National Tsing-Hua University (Taiwan)
Page (NA) Paper number 39
Abstract:
ABSTRACT This paper proposes a modified parameter mapping scheme for
parallel model combination (PMC) method. The modification aims to improve
the discriminative capabilities of the compensated models. It is achieved
by the rearrangement of the distributions of state models in order
to emphasize the contribution of the mean in the following process.
Both distributions of speech model and noise model are shaped in cepstral
domain through a covariance contracting procedure. After the compensation
steps, an expanding procedure of the adapted covariance is necessary
to release the emphasis. Using this process, the discriminative capability
is increased so that the recognition accuracy is improved. In this
paper, the recognition of Chinese names demonstrates the improvement
to the original PMC method, especially when SNR is low.
Authors:
Daniel Woo, The University of New South Wales (Australia)
Page (NA) Paper number 700
Abstract:
Human perceptual experiments are described that present listeners with
segmented stop consonant speech stimuli in noise. The selection of
short duration speech segments is based on a local measure of the signal-to-noise
ratio calculated over 1ms windows. The aim is to create stimuli with
known fluctuations occurring between a speech and noise sample to assess
whether the presence of short duration "gaps" in the noise produce
favourable and unfavourable signal regions that influence identification.
Perceptual results are reported that suggest human listeners make better
use of signals that comprise only of positive, local signal-to-noise
ratio segments. Such regions are assumed to be more favourable for
stimuli identification. Presentation of stimuli containing only negative
signal-to-noise ratio regions does not appear to contribute as much.
A model that is based on the accumulation of short duration spectral
segments is presented that produces a similar set of identification
functions for the same test stimuli.
|