Robust Speech Processing in Adverse Environments 4

Home
Full List of Titles
1: ICSLP'98 Proceedings
Keynote Speeches
Text-To-Speech Synthesis 1
Spoken Language Models and Dialog 1
Prosody and Emotion 1
Hidden Markov Model Techniques 1
Speaker and Language Recognition 1
Multimodal Spoken Language Processing 1
Isolated Word Recognition
Robust Speech Processing in Adverse Environments 1
Spoken Language Models and Dialog 2
Articulatory Modelling 1
Talking to Infants, Pets and Lovers
Robust Speech Processing in Adverse Environments 2
Spoken Language Models and Dialog 3
Speech Coding 1
Articulatory Modelling 2
Prosody and Emotion 2
Neural Networks, Fuzzy and Evolutionary Methods 1
Utterance Verification and Word Spotting 1 / Speaker Adaptation 1
Text-To-Speech Synthesis 2
Spoken Language Models and Dialog 4
Human Speech Perception 1
Robust Speech Processing in Adverse Environments 3
Speech and Hearing Disorders 1
Prosody and Emotion 3
Spoken Language Understanding Systems 1
Signal Processing and Speech Analysis 1
Spoken Language Generation and Translation 1
Spoken Language Models and Dialog 5
Segmentation, Labelling and Speech Corpora 1
Multimodal Spoken Language Processing 2
Prosody and Emotion 4
Neural Networks, Fuzzy and Evolutionary Methods 2
Large Vocabulary Continuous Speech Recognition 1
Speaker and Language Recognition 2
Signal Processing and Speech Analysis 2
Prosody and Emotion 5
Robust Speech Processing in Adverse Environments 4
Segmentation, Labelling and Speech Corpora 2
Speech Technology Applications and Human-Machine Interface 1
Large Vocabulary Continuous Speech Recognition 2
Text-To-Speech Synthesis 3
Language Acquisition 1
Acoustic Phonetics 1
Speaker Adaptation 2
Speech Coding 2
Hidden Markov Model Techniques 2
Multilingual Perception and Recognition 1
Large Vocabulary Continuous Speech Recognition 3
Articulatory Modelling 3
Language Acquisition 2
Speaker and Language Recognition 3
Text-To-Speech Synthesis 4
Spoken Language Understanding Systems 4
Human Speech Perception 2
Large Vocabulary Continuous Speech Recognition 4
Spoken Language Understanding Systems 2
Signal Processing and Speech Analysis 3
Human Speech Perception 3
Speaker Adaptation 3
Spoken Language Understanding Systems 3
Multimodal Spoken Language Processing 3
Acoustic Phonetics 2
Large Vocabulary Continuous Speech Recognition 5
Speech Coding 3
Language Acquisition 3 / Multilingual Perception and Recognition 2
Segmentation, Labelling and Speech Corpora 3
Text-To-Speech Synthesis 5
Spoken Language Generation and Translation 2
Human Speech Perception 4
Robust Speech Processing in Adverse Environments 5
Text-To-Speech Synthesis 6
Speech Technology Applications and Human-Machine Interface 2
Prosody and Emotion 6
Hidden Markov Model Techniques 3
Speech and Hearing Disorders 2 / Speech Processing for the Speech and Hearing Impaired 1
Human Speech Production
Segmentation, Labelling and Speech Corpora 4
Speaker and Language Recognition 4
Speech Technology Applications and Human-Machine Interface 3
Utterance Verification and Word Spotting 2
Large Vocabulary Continuous Speech Recognition 6
Neural Networks, Fuzzy and Evolutionary Methods 3
Speech Processing for the Speech-Impaired and Hearing-Impaired 2
Prosody and Emotion 7
2: SST Student Day
SST Student Day - Poster Session 1
SST Student Day - Poster Session 2

Author Index
A B C D E F G H I
J K L M N O P Q R
S T U V W X Y Z

Multimedia Files

Spectral Sequence Compensation Based on Continuity of Spectral Sequence

Authors:

Masato Akagi, Japan Advanced Institute of Science and Technology (Japan)
Mamoru Iwaki, Japan Advanced Institute of Science and Technology (Japan)
Noriyoshi Sakaguchi, Japan Advanced Institute of Science and Technology (Japan)

Page (NA) Paper number 28

Abstract:

Humans have an excellent ability to select a particular sound source from a noisy environment, called the ``Cocktail-Party Effect'' and to compensate for physically missing sound, called the ``Illusion of Continuity.'' This paper proposes a spectral peak tracker as a model of the illusion of continuity (or phonemic restoration) and a spectral sequence prediction method using a spectral peak tracker. Although some models have already been proposed, they treat only spectral peak frequencies and often generate wrong predicted spectra. We introduce a peak representation of log-spectrum with four parameters: amplitude, frequency, bandwidth, and asymmetry, using the spectral shape analysis method described by the wavelet transformation. And we devise a time-varying second-order system for formulating the trajectories of the parameters. We demonstrate that the model can estimate and track the parameters for connected vowels whose transition section has been partially replaced by white noise.

SL980028.PDF (From Author) SL980028.PDF (Rasterized)

TOP


Robust Features for Speech Recognition Systems

Authors:

Aruna Bayya, Rockwell Semiconductor Systems, Newport Beach, CA (USA)
B. Yegnanarayana, Indian Institute of Technology, Madras (India)

Page (NA) Paper number 1121

Abstract:

ABSTRACT In this paper we propose a set of features based on group delay spectrum for speech recognition systems. These features appear to be more robust to channel variations and environmental changes compared to features based on Melspectral coefficients. The main idea is to derive cepstrum-like features from group delay spectrum instead of deriving them from power spectrum. The group delay spectrum is computed from modified auto-correlation-like function. The effectiveness of the new feature set is demonstrated by the results of both speaker-independent (SI) and speaker-dependent (SD) recognition tasks. Preliminary results indicate that using the new features, we can obtain results comparable to Mel cepstra and PLP cepstra in most of the cases and a slight improvement in noisy cases. More optimization of the parameters is needed to fully exploit the nature of the new features.

SL981121.PDF (From Author) SL981121.PDF (Rasterized)

TOP


Interfacing of CASA and Partial Recognition Based on a Multistream Technique

Authors:

Frédéric Berthommier, ICP (France)
Hervé Glotin, IDIAP (Switzerland)
Emmanuel Tessier, ICP (France)
Hervé Bourlard, IDIAP (Switzerland)

Page (NA) Paper number 600

Abstract:

We propose a running demonstration of coupling between an intermediate processing step (named CASA), based on the harmonicity cue, and partial recognition, implemented with a HMM/ANN multistream technique [2]. The model is able to recognise words corrupted with narrow band noise, either stationary or having variable center frequency. The principle is to identify frame by frame the most noisy subband within four subbands by analysing a SNR-dependent representation. A static partial recogniser is fed with the remaining subbands. We establish on Numbers93 the noisy-band identification (NBI) performance as well as the word error rate (WER), and alter the correlation between these two indexes by changing the distribution of the noise.

SL980600.PDF (From Author) SL980600.PDF (Rasterized)

TOP


AN RNN-Based Compensation Method for Mandarin Telephone Speech Recognition

Authors:

Sen-Chia Chang, ATC/CCL, Industrial Technology Research Institute, Taiwan (Taiwan)
Shih-Chieh Chien, ATC/CCL, Industrial Technology Research Institute, Taiwan (Taiwan)
Chih-Chung Kuo, ATC/CCL, Industrial Technology Research Institute, Taiwan (Taiwan)

Page (NA) Paper number 1077

Abstract:

In this paper, a novel architecture, which integrates the recurrent neural network (RNN) based compensation process and the hidden Markov model (HMM) based recognition process into a unified framework, is proposed. The RNN is employed to estimate the additive bias, which represents the telephone channel effect, in the cepstral domain. Compensation of telephone channel effects is implemented by subtracting the additive bias from the cepstral coefficients of the input utterance. The integrated recognition system is trained based upon MCE/GPD (minimum classification error/generalized probabilistic descent) method with an objective function that is designed to minimize recognition error rate. Experimental results for speaker-independent Mandarin polysyllabic word recognition show an error rate reduction of 21.5% compared to the baseline system.

SL981077.PDF (From Author) SL981077.PDF (Rasterized)

TOP


Robust Speech Recognition Using Discriminative Stream Weighting and Parameter Interpolation

Authors:

Stephen M. Chu, Beckman Institute and Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign (USA)
Yunxin Zhao, Beckman Institute and Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign (USA)

Page (NA) Paper number 690

Abstract:

This paper presents a method to improve the robustness of speech recognition in noisy conditions. It has been shown that using dynamic features in addition to static features can improve the noise robustness of speech recognizers. In this work we show that in a continuous-density Hidden Markov Model (HMM) based speech recognition system, weighting the contribution of the dynamic features according to SNR levels can further improve the performance, and we propose a two-step scheme to adapt the weights for a given Signal to Noise Ratio (SNR). The first step is to obtain the optimal weights for a set of selected SNR levels by discriminative training. The Generalized Probabilistic Decent (GPD) framework is used in our experiments. The second step is to interpolate the set of SNR-specific weights obtained in step one for a new SNR condition. Experimental results obtained by the proposed technique is encouraging. Evaluation using speaker independent digits with added white Gaussian noise shows significant reduction in error rate at various SNR levels.

SL980690.PDF (From Author) SL980690.PDF (Rasterized)

TOP


Acoustic Backing-Off in the Local Distance Computation for Robust Automatic Speech Recognition

Authors:

Johan de Veth, A2RT, Dept. of Language & Speech, University of Nijmegen (The Netherlands)
Bert Cranen, A2RT, Dept. of Language & Speech, University of Nijmegen (The Netherlands)
Louis Boves, A2RT, Dept. of Language & Speech, University of Nijmegen (The Netherlands)

Page (NA) Paper number 360

Abstract:

In this paper we propose to introduce backing-off in the acoustic contributions of the local distance functions used during Viterbi decoding as an operationalisation of missing feature theory for increased recognition robustness. Acoustic backing-off effectively removes the detrimental influence of outlier values from the local decisions in the Viterbi algorithm. It does so without the need for prior knowledge that specific features are missing. Acoustic backing-off avoids any kind of explicit outlier detection. This paper provides a proof of concept of acoustic backing-off in the context of connected digit recognition over the telephone, using artificial distortions of the acoustic observations. It is shown that the word error rate can be maintained at the level of 2.5 % obtained for undisturbed features, even in the case where a conventional local distance computation without backing-off leads to a word error rate > 80.0 %.

SL980360.PDF (From Author) SL980360.PDF (Rasterized)

TOP


Noise Model Selection For Robust Speech Recognition

Authors:

Laura Docío-Fernández, University of Vigo (Spain)
Carmen García-Mateo, University of Vigo (Spain)

Page (NA) Paper number 579

Abstract:

This paper addresses the problem of mismatch between training and testing conditions in a HMM-based speech recognizer. Parallel Model Combination (PMC) has demonstrated to be an efficient technique for reducing the effects of additive noise. In order to apply this technique, a noise HMM must be trained at the recognition phase. Approaches that estimate the noise model based on the Expectation-Maximization (EM) or Baum-Welch algorithms are widely used. In these methods the recorded environmental noise data are used, and their major drawback is that they need a long sequence of noise data to estimate properly the model parameters. In some real life applications the amount of noise data can be too small, so from a practical point of view, the needed amount of noise is a critical parameter which should be as short as possible. We propose a novel method for obtaining a more reliable noise model than training it from scratch by using a short noise sequence.

SL980579.PDF (From Author) SL980579.PDF (Rasterized)

TOP


A Novel Iterative Signal Enhancement Algorithm for Noise Reduction in Speech

Authors:

Simon Doclo, Katholieke Universiteit Leuven (Belgium)
Ioannis Dologlou, Katholieke Universiteit Leuven (Belgium)
Marc Moonen, Katholieke Universiteit Leuven (Belgium)

Page (NA) Paper number 131

Abstract:

This paper presents an iterative signal enhancement algorithm for noise reduction in speech. The algorithm is based on a truncated singular value decomposition (SVD) procedure, which has already been used as a tool for signal enhancement [1][2]. Compared to the classical algorithms, the novel algorithm gives rise to comparable improvements in signal-to-noise ratio (SNR). Moreover the algorithm has an improved frequency selectivity for filtering out the noise and performs better with respect to the higher formants of the speech. It can also be extended easily to multiple channels.

SL980131.PDF (From Author) SL980131.PDF (Rasterized)

TOP


Missing Data Reconstruction for Robust Automatic Speech Recognition in the Framework of Hybrid HMM/ANN Systems

Authors:

Stéphane Dupont, Faculte Polytechnique de Mons (FPMs) (Belgium)

Page (NA) Paper number 581

Abstract:

In this paper, we propose to use the missing data theory to allow the reconstruction of missing spectro-temporal parameters in the framework of hybrid HMM/ANN systems. A simple signal-to-noise ratio estimator is used to automatically detect the components that are unavailable or corrupted by noise (missing components). A limited number of multidimensional gaussian distributions are then used to reconstruct those missing components solely on the basis of the present data. The reconstructed vectors are then used as input to an artificial neural network estimating the HMM state probabilities. Continuous speech recognition experiments have been done on filtered speech. In this case, filtered components carry few or no information at all, and hence, should probably be ignored. The results presented in this paper illustrate this point of view. Complementary experiments also suggest the interest of the proposed approach in the case of noisy speech.

SL980581.PDF (From Author) SL980581.PDF (Rasterized)

TOP


Recognition from GSM Digital Speech

Authors:

Ascension Gallardo-Antolín, Universidad Carlos III de Madrid (Spain)
Fernando Díaz-de-María, Universidad Carlos III de Madrid (Spain)
Francisco J. Valverde-Albacete, Universidad Carlos III de Madrid (Spain)

Page (NA) Paper number 584

Abstract:

This paper addresses the problem of speech recognition in the GSM environment. In this context, new sources of distortion, such as transmission errors or speech coding itself, significantly degrade the performance of speech recognizers. While conventional approaches deal with these types of distortion after decoding speech, we propose to recognize from the digital speech representation of GSM. In particular, our work focuses on the 13 kbit/s RPE-LTP GSM standard speech coder. In order to test our recognizer we have compared it to a conventional recognizer in several simulated situations, which allow us to gain insight into more practical ones. Specifically, besides recognizing from clean digital speech and evaluating the influence of speech coding distortion, the proposed recognizer is faced with speech degraded by random errors, burst errors and frame substitutions. The results are very encouraging: the worse the transmission conditions are, the more recognizing from digital speech outperforms the conventional approach.

SL980584.PDF (From Author) SL980584.PDF (Rasterized)

TOP


Conversational Speech Systems For On-Board Car Navigation And Assistance

Authors:

Petra Geutner, Universitaet Karlsruhe (Germany)
Matthias Denecke, MULTICOM (USA)
Uwe Meier, Carnegie Mellon University (USA)
Martin Westphal, Universitaet Karlsruhe (Germany)
Alex Waibel, Carnegie Mellon University (USA)

Page (NA) Paper number 772

Abstract:

This paper describes our latest efforts in building a speech recognizer for operating a navigation system through speech instead of typed input. Compared to conventional speech recognition for navigation systems, where the input is usually restricted to a fixed set of keywords and keyword phrases, complete spontaneous sentences are allowed as speech input. We will present the interaction of speech input, parsing and the reactions to the requested queries. Our system has been trained on German spontaneous speech data and has been adapted to navigation queries using MLLR. As the system is not restricted to command word input, a parser further processes the recognized utterance. We show that within a lab environment our system is able to handle arbitrary spontaneous sentences as input to a navigation system successfully. The performance of the recognizer measured in word error rate gives a result of 18%. Evaluation of the parser yields an error rate of 20%.

SL980772.PDF (From Author) SL980772.PDF (Rasterized)

TOP


A Signal Processing System for Having the Sound "Pop-Out" in Noise Thanks to the Image of the Speaker's Lips: New Advances Using Multi-Layer Perceptrons

Authors:

Laurent Girin, Institut de la Communication Parlée de Grenoble (France)
Laurent Varin, Institut de la Communication Parlée de Grenoble (France)
Gang Feng, Institut de la Communication Parlée de Grenoble (France)
Jean-Luc Schwartz, Institut de la Communication Parlée de Grenoble (France)

Page (NA) Paper number 431

Abstract:

This paper deals with the improvement of a noisy speech enhancement system based on the fusion of auditory and visual information. The system was presented in previous papers and implemented in the context of vowel to vowel and vowel to consonant transitions corrupted with white noise. Its principle consists in an analysis-enhancement-synthesis process based on a linear prediction (LP) model of the signal: the LP filter is enhanced thanks to associative tools that estimate LP cleaned parameters from both noisy audio and visual information. The detailed structure of the system is reminded and we focus on the improvement that concerns precisely the associators: basic neural networks (multi-layers perceptrons) are used instead of linear regression. It is shown that in the context of VCV transitions corrupted with white noise, neural networks can improve the performances of the system in terms of intelligibility gain, distance measures and classification tests.

SL980431.PDF (From Author) SL980431.PDF (Rasterized)

TOP


Robust Speech Activity Detection in the Presence of Noise

Authors:

Ruhi Sarikaya, Duke University (USA)
John H.L. Hansen, Duke University (USA)

Page (NA) Paper number 922

Abstract:

This study presents a new approach for robust speech activity detection (SAD). Our framework is based on HMM recognition of speech versus silence. We model speech as one of fourteen large phone classes whereas silence is represented as a separate model. Individual test utterances are concatenated to simulate read continuous speech for testing. The HMM-based algorithm is compared to both an energy based, as well as speech enhancement based, SAD algorithms for clean, 5 dB and 0 dB SNR levels under white Gaussian noise (WGN), aircraft cockpit noise (AIR) and automobile highway noise (HWY). We found that our algorithm provides lower frame error rates than the other two methods especially for HWY noise. Unlike other studies, we evaluate our algorithm on the core test set of the standard TIMIT database. Hence, results can be used as benchmarks to evaluate future systems.

SL980922.PDF (From Author) SL980922.PDF (Rasterized)

TOP


Robust Automatic Speech Recognition by the Application of a Temporal-Correlation-Based Recurrent Multilayer Neural Network to the Mel-Based Cepstral Coefficients

Authors:

Michel Héon, INRS-Telecommunications (Canada)
Hesham Tolba, INRS-Telecommunications (Canada)
Douglas O'Shaughnessy, INRS-Telecommunications (Canada)

Page (NA) Paper number 807

Abstract:

In this paper, the problem of robust speech recognition has been considered. Our approach is based on the noise reduction of the parameters that we use for recognition, that is, the Mel-based cepstral coefficients. A Temporal-Correlation-Based Recurrent Multilayer Neural Network (TCRMNN) for noise reduction in the cepstral domain is used in order to get less-variant parameters to be useful for robust recognition in noisy environments. Experiments show that the use of the enhanced parameters using such an approach increases the recognition rate of the continuous speech recognition (CSR) process. The HTK Hidden Markov Model Toolkit was used throughout. Experiments were done on a noisy version of the TIMIT database. With such a pre-processing noise reduction technique in the front-end of the HTK-based continuous speech recognition system (CSR) system, improvements in the recognition accuracy of about 17.77% and 18.58% using single mixture monophones and triphones, respectively, have been obtained at a moderate SNR of 20 dB.

SL980807.PDF (From Author) SL980807.PDF (Rasterized)

TOP


Speech Recognition from GSM Codec Parameters

Authors:

Juan M. Huerta, Carnegie Mellon University (USA)
Richard M. Stern, Carnegie Mellon University (USA)

Page (NA) Paper number 626

Abstract:

Speech coding affects speech recognition performance, with recognition accuracy deteriorating as the coded bit rate decreases. Virtually all systems that recognize coded speech reconstruct the speech waveform from the coded parameters, and then perform recognition (after possible noise and/or channel compensation) using conventional techniques. In this paper we compare the recognition accuracy of coded speech obtained by reconstructing the speech waveform with the speech recognition accuracy obtained when using cepstral features derived from the coding parameters. We focus our efforts on speech that has been coded using the 13-kbps full-rate GSM codec, a Regular Pulse Excited Long Term Prediction (RPE-LTP) codec. The GSM codec develops separate representations for the linear prediction (LPC) filter and the residual signal components of the coded speech. We measure the effects of quantization and coding on the accuracy with which these parameters are represented, and present two different methods for recombining them for speech recognition purposes. We observe that by selectively combining the cepstral streams repre senting the LPC parameters and the residual signal it is possible to obtain recognition accuracy directly from the coded parameters that equals or exceeds the recognition accuracy obtained from the reconstructed waveforms.

SL980626.PDF (From Author) SL980626.PDF (Rasterized)

TOP


Improved Parallel Model Combination Based on Better Domain Transformation for Speech Recognition Under Noisy Environments

Authors:

Jeih-Weih Hung, IIS, Academia Sinica (Taiwan)
Jia-Lin Shen, IIS, Academia Sinica (Taiwan)
Lin-Shan Lee, the department of E.E., National Taiwan University (Taiwan)

Page (NA) Paper number 231

Abstract:

The parallel model combination (PMC) technique has been shown to achieve very good performance for speech recognition under noisy conditions. However, there still exist some problems based on the PMC formula. In this paper, we first investigated these problems and some modifications on the transformation process of PMC were proposed. Experimental results show that this modified PMC can provide significant improvements over the original PMC in the recognition accuracies. Error rate reduction on the order of 12.92% was achieved.

SL980231.PDF (From Author) SL980231.PDF (Rasterized)

TOP


Robust Speech/Non-Speech Detection in Adverse Conditions Based on Noise and Speech Statistics

Authors:

Lamia Karray, France Telecom - CNET (France)
Jean Monné, France Telecom - CNET (France)

Page (NA) Paper number 430

Abstract:

Recognition performance decreases when recognition systems are used over the telephone network, especially wireless network and noisy environments. It appears that non efficient speech/non-speech detection is a very important source of this degradation. Therefore, speech detector robustness to noise is a challenging problem to be examined, in order to improve recognition performance for the very noisy communications. Speech collected in GSM environment gives an example of such very noisy speech to be recognized. Several studies were conducted aiming to improve the robustness of speech/non-speech detection used for speech recognition in adverse conditions. This paper introduces a robust word boundary detection algorithm reliable in the very noisy cellular network environment. The algorithm is based on the statistics of noise and speech in the observed signal. In order to decide on the binary hypotheses of noise only versus speech plus noise, we use a likelihood ratio criterion.

SL980430.PDF (From Author) SL980430.PDF (Rasterized)

TOP


Speech Recognition In Car Noise Environments Using Multiple Models According To Noise Masking Levels

Authors:

Myung Gyu Song, Dept. of Electronics Eng., Pusan National Univ. (Korea)
Hoi In Jung, Dept. of Electronics Eng., Pusan National Univ. (Korea)
Kab-Jong Shim, Passenger Car E&R Center II, Hyundai Motor Company (Korea)
Hyung Soon Kim, Dept. of Electronics Eng., Pusan National Univ. (Korea)

Page (NA) Paper number 1065

Abstract:

In speech recognition for real-world applications, the performance degradation due to the mismatch introduced between training and testing environments should be overcome. In this paper, to reduce this mismatch, we provide a hybrid method of spectral subtraction and residual noise masking. We also employ multiple model approach to obtain improved robustness over various noise environments. In this approach, multiple model sets are made according to several noise masking levels and then a model set appropriate for the estimated noise level is selected automatically in recognition phase. According to speaker independent isolated word recognition experiments in car noise environments, the proposed method using model sets with only two masking levels reduces average word error rate by 60% in comparison with spectral subtraction method.

SL981065.PDF (From Author) SL981065.PDF (Rasterized)

TOP


Spectral Noise Subtraction With Recursive Gain Curves

Authors:

Klaus Linhard, Daimler Benz Research and Technology (Germany)
Tim Haulick, Daimler Benz Research and Technology (Germany)

Page (NA) Paper number 109

Abstract:

Spectral noise subtraction has the drawback of generating residual noise with musical character, the so-called musical noise. We propose a simple modification of the filter coefficients calculation in form of a recursion to the previous coefficient. This recursion results in a switching mechanism of the spectral subtraction gain, in the way that speech pauses are processed with a nearly constant very low gain and speech components are processed with a nearly constant high gain. Because of the switching mechanism the generation of musical noise is almost completely avoided. The well known approach of Ephraim and Malah has a similar mechanism, but the new recursive scheme is much easier for implementation yet yielding comparable or sometimes better performance than the other approach.

SL980109.PDF (From Author) SL980109.PDF (Rasterized)

TOP


A Novel Robust Speech Recognition Algorithm Based on Multi-Models and Integrated Decision Method

Authors:

Shengxi Pan, Electronic Engineering Department, Tsinghua University (China)
Jia Liu, Electronic Engineering Department, Tsinghua University (China)
Jintao Jiang, Electronic Engineering Department, Tsinghua University (China)
Zuoying Wang, Electronic Engineering Department, Tsinghua University (China)
Dajin Lu, Electronic Engineering Department, Tsinghua University (China)

Page (NA) Paper number 413

Abstract:

In this paper, a new robust speech recognition algorithm of multi-models and integrated decision(MMID) is proposed. A parallel MMID(PMMID) algorithm is developed. By using this new algorithm the advantages of different models can be integrated into one system. This algorithm uses different acoustic models at the same time based on DDBHMM (duration distribution based Hidden Markov Model). These different models include the channel-mismatch-correct (CMC) model, more-alternative-pronunciation model, tone and non-tone models of Chinese Mandarin speech, voice activity detection(VAD) model and state-skip model. The speech recognition accuracy of the multi-model system is better than that of single-model system in the adverse environments. The experimental results show that the error rate of the recognition system is 2.9% and reduced by 81% compared with the baseline system of the single-model.

SL980413.PDF (From Author) SL980413.PDF (Rasterized)

TOP


On the Interaction Between Time and Frequency Filtering of Speech Parameters for Robust Speech Recognition

Authors:

Dusan Macho, Slovak Technical University and Slovak Academy of Sciences (Slovak Republic)
Climent Nadeu, Universitat Politecnica de Catalunya (Spain)

Page (NA) Paper number 1137

Abstract:

One of the great today's challenges in speech recognition is to ensure the robustness of the used speech representation. Usually, the recognition rate is strongly reduced when the speech is corrupted, e.g. by convolutional or additive noise, and the speech features are not designed to be robust. In this paper we study the effect of additive noise on the logarithmic filter-bank energy representation. We use time and frequency filtering techniques to emphasize the discriminative information and to reduce the mismatch between noisy and clean speech representation. A 2-D spectral representation is introduced to see the regions most affected by noise in the 2-D quefrency-modulation frequency domain and to help to design the frequency and time filter shapes. Experiments with one and two dynamic feature sets show the usefulness of the combination of time and frequency filtering for both, white and low-pass noise speech recognition. At the end the power time and frequency filtering technique is presented.

SL981137.PDF (From Author) SL981137.PDF (Rasterized)

TOP


Inference Of Missing Spectrographic Features For Robust Speech Recognition

Authors:

Bhiksha Raj, Carnegie Mellon University (USA)
Rita Singh, Carnegie Mellon University (USA)
Richard M. Stern, Carnegie Mellon University (USA)

Page (NA) Paper number 1152

Abstract:

Two types of algorithms are introduced that recover missing time-frequency regions of log-spectral representations of speech. These compensation algorithms modify the incoming feature vector without any changes to the speech recognition system, in contrast to previously-described approaches. The first approach clusters the log-spectral vectors representing clean speech. Missing data are recovered by estimating the spectral cluster in each analysis frame on the basis of the feature values that are present. The second approach uses MAP procedures to estimate the values of missing data elements based on their correlation with the features that are present. Greatest recognition accuracy was obtained using the correlation-based approach, presumably because of its ability to exploit the temporal as well as spectral structure of speech. The recognition accuracy provided by these algorithms approaches but does not exceed that obtained by traditional marginalization. Nevertheless, it is believed that these algorithms provide greater computational efficiency and enable greater flexibility in recognition system structure.

SL981152.PDF (From Author) SL981152.PDF (Rasterized)

TOP


SNR-Dependent Flooring and Noise Overestimation for Joint Application of Spectral Subtraction and Model Combination

Authors:

Volker Schless, Daimler-Benz AG (Germany)
Fritz Class, Daimler-Benz AG (Germany)

Page (NA) Paper number 138

Abstract:

We present an approach to joint application of spectral subtraction (SPS) and model combination (PMC) for speech recognition in noisy environments. Contrary to previous solutions distortion introduced by SPS is not modeled in PMC. Instead we ensure compatibility of the two methods by adapting parameters of SPS (spectral floor and overestimation factor) according to the present signal-to-noise-ratio (SNR). Parameter setting should be done to subtract a maximum of noise while minimizing distortion. Experiments suggest that for each noise level different parameter sets yield optimal performance. Setting the parameters adaptively according to the noise level leads to undegraded results at high SNR while in low SNR regions the benefits of the noise reduction process are significant. This scheme leaves the model combination process unchanged which simplifies parameter estimation and reduces computation time. Experiments show significant improvements when using PMC with modified SPS instead of standard SPS.

SL980138.PDF (From Author) SL980138.PDF (Rasterized)

TOP


Improved Robust Speech Recognition Considering Signal Correlation Approximated by Taylor Series

Authors:

Jia-Lin Shen, Institute of Information Science, Academia Sinica (Taiwan)
Jeih-Weih Hung, Institute of Information Science, Academia Sinica (Taiwan)
Lin-Shan Lee, Institute of Information Science, Academia Sinica (Taiwan)

Page (NA) Paper number 442

Abstract:

In this paper, an improved mismatch function by considering signal correlation between speech and noise is proposed to better estimate the noisy speech HMM's. A linearized model based on Taylor series expansion approach is used to approximate the proposed mismatch function. The parameters of the noisy speech HMM's can be estimated more precisely by combining the parameters of the clean speech and noise HMM's in the log-spectral domain or cepstral domain. Experimental results show that improved robustness for speech recognition in the presence of white noise as well as colored noise can be obtained.

SL980442.PDF (From Author) SL980442.PDF (Rasterized)

TOP


Speech Recognition in Noisy Environment Using Weighted Projection-Based Likelihood Measure

Authors:

Won-Ho Shin, Dept. of Electronic Eng., Yonsei Univ. (Korea)
Weon-Goo Kim, Dept. of Electrical Eng., Kunsan Univ. (Korea)
Chungyong Lee, Dept. of Electronic Eng., Yonsei Univ. (Korea)
Il-Whan Cha, Dept. of Electronic Eng., Yonsei Univ. (Korea)

Page (NA) Paper number 434

Abstract:

This paper investigates a projection-based likelihood measure that improves speech recognition performance in noisy environment. The projection-based likelihood measure is modified to give the weighting and projection effect and to reduce computational complexity. It is evaluated in sub-model based word recognition using semi-continuous hidden Markov model with speaker independent mode. Experimental results using proposed measure are reported for several performance factors: additive noise and noisy channel environment, various noise signals, and combination with other compensation method. In various noisy environments, performance improvements were achieved compared to the previously existing methods.

SL980434.PDF (From Author) SL980434.PDF (Rasterized)

TOP


Evaluation of Model Adaptation by HMM Decomposition on Telephone Speech Recognition

Authors:

Tetsuya Takiguchi, Nara Institute of Science and Technology (Japan)
Satoshi Nakamura, Nara Institute of Science and Technology (Japan)
Kiyohiro Shikano, Nara Institute of Science and Technology (Japan)
Masatoshi Morishima, Laboratory for Information Technology, NTT DATA Corporation (Japan)
Toshihiro Isobe, Laboratory for Information Technology, NTT DATA Corporation (Japan)

Page (NA) Paper number 698

Abstract:

In this paper, we evaluate performance of model adaptation by the previously proposed HMM decomposition method on telephone speech recognition. The HMM decomposition method separates a composed HMM into a known phoneme HMM and an unknown noise and channel HMM by maximum likelihood (ML) estimation of the HMM parameters. A transfer function (telephone channel) HMM is estimated using adaptation speech data by applying the HMM decomposition twice in the linear spectral domain for noise and in the cepstral domain for channel. The telephone speech data for evaluation are recorded through 10 kinds of ordinary analog telephone handsets and cordless telephone handsets. The test results show that the average phrase accuracy with the clean speech HMMs is 60.9% for the ordinary analog telephone handsets, and 19.6% for the cordless telephone handsets. By the HMM decomposition method, the average phrase accuracy is improved to 78.1% for the ordinary analog telephone handsets, and 50.5% for the cordless telephone handsets.

SL980698.PDF (From Author) SL980698.PDF (Rasterized)

TOP


Comparative Experiments to Evaluate a Voiced-Unvoiced-Based Pre-Processing Approach to Robust Automatic Speech Recognition in Low-SNR Environments

Authors:

Hesham Tolba, INRS-Telecommunications (Canada)
Douglas O'Shaughnessy, INRS-Telecommunications (Canada)

Page (NA) Paper number 341

Abstract:

This paper presents an evaluation of a robust Voiced-Unvoiced-based large-vocabulary Continuous-Speech Recognition (CSR) system in the presence of highly interfering noise. Comparative experiments have indicated that the inclusion of an accurate Voiced-Unvoiced (V-U) classifier in our design of a CSR system improves the performance of such a recognizer, for speech contaminated by both additive Gaussian and uniform noises. Our results show that the V-U-based CSR system outperforms the CMS-based and the RASTA-PLP-based CSR systems in such environments for a wide range of SNRs.

SL980341.PDF (From Author) SL980341.PDF (Rasterized)

TOP


Signal Extraction From Noisy Signal Based on Auditory Scene Analysis

Authors:

Masashi Unoki, Japan Advanced Institute of Science and Technology (Japan)
Masato Akagi, Japan Advanced Institute of Science and Technology (Japan)

Page (NA) Paper number 1041

Abstract:

This paper proposes a method of extracting the desired signal from a noisy signal. This method solves the problem of segregating two acoustic sources by using constraints related to the four regularities proposed by Bregman and by making two improvements to our previously proposed method. One is to incorporate a method of estimating the fundamental frequency using the Comb filtering on the filterbank. The other is to reconsider the constraints on the separation block, which constrain the instantaneous amplitude, input phase, and fundamental frequency of the desired signal. Simulations performed to segregate a vowel from a noisy vowel and to compare the results of using all or only some constraints showed that our improved method can segregate real speech precisely using all the constraints related to the four regularities and that the absence some constraints reduces the accuracy.

SL981041.PDF (From Author) SL981041.PDF (Rasterized)

TOP


Frequency Domain Binaural Model as the Front End of Speech Recognition System

Authors:

Tsuyoshi Usagawa, Kumamoto University (Japan)
Kenji Sakai, Kumamoto University (Japan)
Masanao Ebata, Kumamoto University (Japan)

Page (NA) Paper number 190

Abstract:

In this paper, the frequency domain binaural model is introduced. The proposed model is the revised one of the former time domain model which calculates the interaural crosscorrelation. The new model requires the less computational load and has the comparable performance. It is based on the FFT analysis and uses the cross-power spectrum to obtained interaural phase difference. The performance of models is examined not only in the isolated word speech recognition task and but also in the speech enhancement task. As the results of experiment, the improvement of robustness in speech recognition task corresponds to about 15-20dB when the surrounding noise is white noise. That is a few decibell better than one obtained by the time domain model. However, when the surrounding noise is speech, the improvement decreases to 10-15dB. In addition, the proposed model can reproduce the signal component from the specified direction as the binaural signal.

SL980190.PDF (From Author) SL980190.PDF (Rasterized)

0190_01.PDF
(was: 0190_01.JPG)
spectrogram in JPEG Fig.6(a)
File type: Image File
Format: JPEG
Tech. description: 1280x411 24bits
Creating Application:: Unknown
Creating OS: linux 2.0.34
0190_02.WAV
(was: 0190_02.WAV)
sound file Fig.6(1)
File type: Sound File
Format: WAV
Tech. description: (44.1k, 16bit, mono)
Creating Application:: LHa for UNIX V 1.14c
Creating OS: linux 2.0.34
0190_03.PDF
(was: 0190_03.JPG)
spectrogram in JPEG Fig.6(b)
File type: Image File
Format: JPEG
Tech. description: 1280x411 24bits
Creating Application:: LHa for UNIX V 1.14c
Creating OS: linux 2.0.34
0190_04.WAV
(was: 0190_04.WAV)
sound file Fig.6(b)
File type: Sound File
Format: WAV
Tech. description: (44.1k, 16bit, mono)
Creating Application:: LHa for UNIX V 1.14c
Creating OS: linux 2.0.34
0190_05.PDF
(was: 0190_05.JPG)
spectrogram in JPEG Fig.6(c)
File type: Archive File
Format: Archive File
Tech. description: 1280x411 24bits
Creating Application:: LHa for UNIX V 1.14c
Creating OS: linux 2.0.34
0190_06.WAV
(was: 0190_06.WAV)
sound file Fig.6(c)
File type: Sound File
Format: WAV
Tech. description: (44.1k, 16bit, mono)
Creating Application:: LHa for UNIX V 1.14c
Creating OS: linux 2.0.34

TOP


A Study on the Recognition of Low Bit-Rate Encoded Speech

Authors:

An-Tzyh Yu, National Tsing-Hua University (Taiwan)
Hsiao-Chuan Wang, National Tsing-Hua University (Taiwan)

Page (NA) Paper number 38

Abstract:

The recognition of encoded speech provides the possible applications of speech recognition in low bit-rate communication systems. This kind of applications will become necessary in the Internet and digital mobile phones. In this paper, the encoded parameters of speech signal in various speech coding systems are evaluated according to a designated speech recognition task. The HMM with mixture of continuous Gaussian densities is the framework of the speech recognition system.

SL980038.PDF (From Author) SL980038.PDF (Rasterized)

TOP


Weighted Parallel Model Combination for Noisy Speech Recognition

Authors:

Tai-Hwei Hwang, National Tsing-Hua University (Taiwan)
Hsiao-Chuan Wang, National Tsing-Hua University (Taiwan)

Page (NA) Paper number 39

Abstract:

ABSTRACT This paper proposes a modified parameter mapping scheme for parallel model combination (PMC) method. The modification aims to improve the discriminative capabilities of the compensated models. It is achieved by the rearrangement of the distributions of state models in order to emphasize the contribution of the mean in the following process. Both distributions of speech model and noise model are shaped in cepstral domain through a covariance contracting procedure. After the compensation steps, an expanding procedure of the adapted covariance is necessary to release the emphasis. Using this process, the discriminative capability is increased so that the recognition accuracy is improved. In this paper, the recognition of Chinese names demonstrates the improvement to the original PMC method, especially when SNR is low.

SL980039.PDF (From Author) SL980039.PDF (Rasterized)

TOP


Favourable and Unfavourable Short Duration Segments of Speech in Noise

Authors:

Daniel Woo, The University of New South Wales (Australia)

Page (NA) Paper number 700

Abstract:

Human perceptual experiments are described that present listeners with segmented stop consonant speech stimuli in noise. The selection of short duration speech segments is based on a local measure of the signal-to-noise ratio calculated over 1ms windows. The aim is to create stimuli with known fluctuations occurring between a speech and noise sample to assess whether the presence of short duration "gaps" in the noise produce favourable and unfavourable signal regions that influence identification. Perceptual results are reported that suggest human listeners make better use of signals that comprise only of positive, local signal-to-noise ratio segments. Such regions are assumed to be more favourable for stimuli identification. Presentation of stimuli containing only negative signal-to-noise ratio regions does not appear to contribute as much. A model that is based on the accumulation of short duration spectral segments is presented that produces a similar set of identification functions for the same test stimuli.

SL980700.PDF (From Author) SL980700.PDF (Scanned)

TOP