Session TAB Robustness in Recognition and Signal Processing II

Chairperson Alex Waibel Carnegie Mellon Univ. , USA

Home

ADAPTATION OF TIME DIFFERENTIATED CEPSTRUM FOR NOISY SPEECH RECOGNITION

Authors: Tai-Hwei Hwang, Lee-Min Lee* and Hsiao-Chuan Wang

Department of Electrical Engineering, National Tsing-Hua University, Hsinchu, Taiwan, ROC 30043 Email: hcwangQee.nthu.edu.tw *Department of Electrical Engineering, Mingchi Institute of Technology, Taipei Hsien, Taiwan, ROC 243

Volume 3 pages 1075 - 1078

ABSTRACT

In this paper, a noise compensation algorithm using the first order approximation of cepstral function is presented. The derivative term is replaced by the difference of cepstra for the adaptation of wide range variation of noise power. The differences of cepstral mean vectors between the clean and noisy version, termed as the deviation vector, are applied to adapt cepstrum and delta cepstrum. The experimental results show that using deviation vector to adapt the cepstral coefficients can gain a significant improvement over the method based on weighted projection measure. Further improvement can be made by jointly adapting the cepstrum and delta cepstrum.

A0012.pdf

TOP

ON THE IMPORTANCE OF VARIOUS MODULATION FREQUENCIES FOR SPEECH RECOGNITION

Authors: Noboru Kanedera (1), Takayuki Arai (2), Hynek Hermansky (2), and Misha Pavel (3)

(1) Oregon Graduate Institute of Science & Technology, Portland, Oregon, U.S.A. (2) International Computer Science Institute, Berkeley, California, U.S.A. (3) Ishikawa National College of Technology, Japan

Volume 3 pages 1079 - 1082

ABSTRACT

Temporal processing of the time trajectories in the logarithmic spectrum domain, performed in cepstral mean subtraction, in computation of dynamic features in speech, or in RASTA processing, is becoming a common procedure in current ASR. Such temporal processing effectively enhances some components of the modulation spectrum of speech while suppressing others. It is therefore important to know the relative importance of various components of the modulation spectrum of speech. In this study we report on the effect of band-pass filtering of the time trajectories of spectral envelopes on speech recognition. Results indicate the relative importance of different components of the modulation spectrum of speech for ASR.

A0044.pdf

TOP

A Robust RNN-based Pre-classification for Noisy Mandarin Speech Recognition

Authors: Wei-Tyng Hong and Sin-Horng Chen

Department of Communication Engineering, National Chiao Tung University, Hsinchu, Taiwan. TEL: 886-3-5731822, FAX: 886-3-5710116, E-Mail: jeff@cm.nctu.edu.tw

Volume 3 pages 1083 - 1086

ABSTRACT

This paper addressed the problem of speech signal pre-classification for robust noisy speech recognition. A novel RNN-based pre-classification scheme for noisy Mandarin speech recognition is proposed. The RNN, which is trained to be insensitive to noise-level variation, is employed to classify each input frame into the three broad classes of initial, final and pure-noise. An on-line noise tracking and estimation for noise model compensation is then performed. Besides, a broad-class likelihood compensation based on the RNN outputs is also performed to help the recognition. Experimental results showed that a significant improvement on syllable recognition rate has been achieved under non-stationary noise environment.

A0063.pdf

TOP

A PARALLEL ENVIRONMENT MODEL (PEM) FOR SPEECH RECOGNITION AND ADAPTATION

Authors: Mazin Rahim

AT&T Labs - Research, 180 Park Avenue, Florham Park, NJ 07932, USA mazin@research.att.com

Volume 3 pages 1087 - 1090

ABSTRACT

A speech recognitionsystem for modelingan acoustic mismatch across different environments is presented. The basic philos- ophy is to apply discriminative learning techniques to sepa- rate the recognition process, that is represented by a hidden Markov model (HMM), from the environmental process which is denoted by a limited number of translation vectors. Each segment of speech is assigned to an environment and recogni- tion is performed upon projecting the parameters of the HMM to best characterize the acoustic space of that environment. The proposed system provides an interesting framework for better modeling and adaptation of speech signals with varying acoustic conditions. Experimental findings on connected digits recognition for three different environments are reported.

A0087.pdf

TOP

ADAPTIVE MODEL COMBINATION FOR ROBUST SPEECH RECOGNITION IN CAR ENVIRONMENTS

Authors: Volker Schless Fritz Class

Daimler-Benz AG, Research and Technology, Wilhelm-Runge-Str. 11, D-89081 Ulm, Germany e-mail: schless@dbag.ulm.daimlerbenz.com

Volume 3 pages 1091 - 1094

ABSTRACT

We present a new adaptive method for online noise estimation which extends the model combination approach to slowly varying noise conditions. The technique of model combination is reported to improve accuracy in speech recognition without extensive training of noisy speech data. Only training of noise characteristics is needed. However, if the noise characteristics vary over time, calculation of noise parameters once before recognition is not suitable. Therefore the new method of online estimation allows an adaptation to the current noise situation. Furthermore cepstral mean subtraction is added to the model combination scheme. This removes convolutional noise as well. Finally, it is shown how linear discriminant analysis eases handling of dynamical effects for model combination.

A0136.pdf

TOP

A COMPARATIVE STUDY OF SPEECH DETECTION METHODS

Authors: Stefaan Van Gerven and Fei Xie

K. U. Leuven, Department of Electrical Engineering - ESAT Kardinaal Mercierlaan 94, B-3001 Heverlee, Belgium E-mail: Stefaan.VanGerven@esat.kuleuven.ac.be

Volume 3 pages 1095 - 1098

ABSTRACT

This paper adresses the important problem of speech detection. It describes the implementation of 3 speech detection methods and compares their performance under different signal-to-noise ratio (SNR) and stationarity conditions. The method that dynamically adjusts its thresholds is found to be the most reliable, even under very adverse recording conditions. Yet it is of low complexity and has a very moderate processing delay.

A0199.pdf

TOP

Voice Activity Detection Using Source Separation Techniques

Authors: Nikos Doukas, Patrick Naylor and Tania Stathaki

Signal Processing Section, Dept of Electrical Engineering, Imperial College, UK. e-mail: n.doukas@ic.ac.uk

Volume 3 pages 1099 - 1102

ABSTRACT

A novel Voice Activity Detector is presented that is based on Source Separation techniques applied to single sensor signals. It oers very accurate estimation of the endpoints in very low Signal to Noise ratio conditions, while maintaining low complexity. Since the procedure is totally iterative, it is suitable for use in real-time applications and is capable of operating in dynamically adapting situations. Results are presented for both White Gaussian and Car Engine background noise. The performance of the new technique is compared with that of the GSM Voice Activity Detector.

A0202.pdf

TOP

APPLYING BLIND SIGNAL SEPARATION TO THE RECOGNITION OF OVERLAPPED SPEECH

Authors: Tomohiko TANIGUCHI, Shoji KAJITA, Kazuya TAKEDA and Fumitada ITAKURA

Dept. of Information Electronics, Graduate School of Engineering Nagoya University, Nagoya, 464-01, Japan Tel:+81 52 789 3629, FAX: +81 52 789 3172, E-mail:takeda@nuee.nagoya-u.ac.jp

Volume 3 pages 1103 - 1106

ABSTRACT

Blind signal separation method based on minimiz- ing mutual information is applied to deal with multi- speaker problem in speech recognition. Recognition experiments performed under dierent acoustic environments, in a soundproof room and a reverberant room, clarify that 1) the method can improve recognition accuracy by about 20% where SNR condition is 0 dB, 2) the method is more eective when many speakers' speech exist than the simple overlapped situation, and that 3) the method does not work well under reverberant conditions.

A0218.pdf

TOP

MULTIRESOLUTION CHANNEL NORMALIZATION FOR ASR IN REVERBERANT ENZTIRONMENTS

Authors: Carlos Avendano, Sangita Tibrewala, and Hynek Hermansky

Department of Electrical Engineering Oregon Graduate Institute of Science & Technology Portland, Oregon, USA

Volume 3 pages 1107 - 1110

ABSTRACT

To overcome the problems related with the long impulse responses produced by reverberation, we use a long time window (high frequency resolution) analysis during the channel normalization steps of the feature extraction process in automatic speech recognition (ASR). After normalization, a trade between frequency and time resolution is used to increase the rate at which the time information is sampled (short-time domain), yielding an appropriate domain to derive ASR features. Experiments on data with reverberation times of about 0.5 s show that the new technique achieves significant performance improvement of a speech recognizer under reverberation, with only some performance degradation on clean speech.

A0246.pdf

TOP

A SPEECH PRE-PROCESSING TECHNIQUE FOR END-POINT DETECTION IN HIGHLY NON-STATIONARY ENVIRONMENTS

Authors: R. Martínez, A. Álvarez, P. Gómez, M. Pérez, V. Nieto and V. Rodellar

Departamento de Arquitectura y Tecnología de Sistemas Informáticos Universidad Politécnica de Madrid Campus de Montegancedo, s/n, 28660 Boadilla del Monte, Madrid, SPAIN Tel.: +34.1.336.73.84, Fax: +34.1.336.74.12, e-mail: pedro@pino.datsi.fi.upm.es

Volume 3 pages 1111 - 1114

ABSTRACT

The determination of the precise moment in which speech begins or ends is an important problem in ASR. As showed in [1], small separations from the optimum beginning and ending point, imply a great decrease in the recognition accuracy. The presence of noise [2] [3], specially when its level is high (around 95 dB as in the case of this work), and its characteristics are highly non-stationary, is an added problem, since it can produce false shots (more probable when the noise includes speech sounds). That is the reason why in such conditions, it is important to have a pre-processing stage that removes as much noise as is possible, and that gives some clues that help to build an end-point detector for those environments. The method here presented offers a pre-processing technique for highly noisy and non stationary environments, which at the same time that enhances the speech, gives an equalised version of the SNR improvement (Mean Spectral Energy Difference), whose main characteristic is that large differences in the level of noise are changed to a little ripple, while the presence of speech is distinguished by a large decrease in this Mean Spectral Energy Difference. Following this technique, any End-point Detection approach (explicit, implicit or hybrid [3]) may render acceptable results.

A0262.pdf

TOP

APPLICATION OF SEVERAL CHANNEL AND NOISE COMPENSATION TECHNIQUES FOR ROBUST SPEAKER RECOGNITION*

Authors: L. Docío-Fernández and C. García-Mateo

E.T.S.I. Telecomunicación Communication Technologies Dept. University of Vigo, 36200 Vigo, Spain. Tel. +34 86 812 664, FAX: +34 86 812 116, E-mail: ldocio@tsc.uvigo.es

Volume 3 pages 1115 - 1118

ABSTRACT

This paper is concerned with the problem of Robust Speaker Recognition. An acoustical mismatch between training and testing conditions of hidden Markov model (HMM)-based speaker recognition systems often causes a severe degradation in the recognition performance. In telephone speaker recognition, for example, undesirable signal components due to ambient noise and channel distortion, as well as due to different variations of telephone handsets render the recognizer unusable for real-world applications. The purpose of this paper is to present several compensation techniques to decrease or to remove the mismatch between training and testing environment conditions. Some of the techniques described here have already been successfully applied in Robust Speech Recognition, and our preliminary results show that they are also very encouraging for Speaker Recognition.

A0396.pdf

TOP

Knowing the Wheat from the Weeds in Noisy Speech

Authors: H. Agaiby, T. J. Moir

Department of Electronic Engineering and Physics University of Paisley, Paisley, PA1 2BE, UKTel. +44 141 848 3409, FAX: +44 141 848 3404, E-mail: hany@diana22.paisley.ac.uk

Volume 3 pages 1119 - 1122

ABSTRACT

This paper introduces a word boundary detection algorithm that works in a variety of noise conditions including what is commonly called the 'cocktail party' situation. The algorithm uses the direction of the signal as the main criterion for differentiating between desired-speech and background noise. To determine the signal direction the algorithm calculates estimates of the time delay between signals received at two microphones. These time delay estimates together with estimates of the coherence function and signal energy are used to locate word boundaries. The algorithm was tested using speech embedded in different types and levels of noise including car noise, factory noise, babble noise, and competing talkers. The test results showed that the algorithm performs very well under adverse conditions and with SNR down to -14.5dB.

A0439.pdf

TOP

Model-based approach for robust speech recognition in noisy environments with multiple noise sources

Authors: Do Yeong Kim (1), (2) , Nam Soo Kim (2) , Chong Kwan Un (1)

(1) Department of Elec. Eng., KAIST, Korea dykim@eekaist.kaist.ac.kr (2) Human and Computer Interaction Lab., SAIT, Korea nskim@green.sait.samsung.co.kr

Volume 3 pages 1123 - 1126

ABSTRACT

In this paper, we consider the hidden Markov model(HMM) parameter compensation in noisy environments with multiple noise sources based on the vector Taylor series(VTS) approach. General formulations for multiple environmental variables are derived and systematic expectation-maximization(EM) solutions are presented in maximum likelihood(ML) sense. It is assumed that each noise source is independent and having Gaussian distribution. To evaluate proposed method, we conduct speaker independent isolated word recognition experiments in various noisy environments. Experimental results show that proposed algorithm ahieves significant improvement. Especially, the proposed method is consistently more effective than the parallel model combination( PMC) based on log-normal approximation.

A0445.pdf

TOP

NORMALIZATION OF SPEAKER VARIABILITY BY SPECTRUM WARPING FOR ROBUST SPEECH RECOGNITION

Authors: Y.C. Chu, Charlie Jie, Vincent Tung, Ben Lin and Richard Lee

Technology Center Philips Taiwan P.O. Box 22978, Taipei, Taiwan, R.O.C. Tel. +886 2 382 3207, FAX: +886 2 382 4598, E-mail: y.c.chu@tw.ccmail.philips.com

Volume 3 pages 1127 - 1130

ABSTRACT

This paper examines techniques for normalization of unseen speakers in recognition. Two implementations of linear spectrum warping were examined: time domain resampling and filter bank scaling. It is shown that for seen speakers, the models trained by unwarped utterances are less sensitive to spectrum warping by filter bank scaling than by resampling. A pitch-based scheme for warping factor estimation has been proposed. The method is shown to be cost-effective in reducing the variability of unseen speakers compared to the ML-based methods. In particular the combination of filter bank scaling with the pitch-based warping factor estimation reduces the error rate of isolated Mandarin digit recognition by more than 30% for unseen speakers.

A0455.pdf

TOP

LPC POLES TRACKER FOR MUSIC/SPEECH/NOISE SEGMENTATION AND MUSIC CANCELLATION

Authors: Stephane H. Maes

Human Language Technologies Group, Speech Decoding Design Department, IBM T.J. Watson Research Center P.O. Box 218, Route 134, Yorktown Heights, NY 10598, USA e-mail: smaes@watson.ibm.com

Volume 3 pages 1131 - 1134

ABSTRACT

In automatic speech recognition (ASR) of broadcast news shows the input utterances are often corrupted by background music and noise. This paper proposes a new method of au- tomatic segmentation a speech signals according to the back- ground: music, clean or noisy. LPC analysis is used to extract the poles of the associated transfer function. Based on the time evolution of the poles it is possible to discriminate the contributions of music, speech and noise: music poles are sta- bler longer than speech poles while noise poles have a more unstable behavior than speech poles. Once the background of a signal is identified, poles tagged as non-speech can be sep- arated from speech poles. Using only the speech poles along with the LPC residuals, it is possible to reconstruct a new signal freed of music and noise contributions.

A0487.pdf

TOP

COMPARATIVE EVALUATIONS OF SEVERAL FRONT-ENDS FOR ROBUST SPEECH RECOGNITION

Authors: Doh-Suk Kim (1) , Jae-Hoon Jeong (1) , Soo-Young Lee (1) , Rhee M. Kil (2)

(1) Department of Electrical Engineering/ (2) Division of Basic Science Korea Advanced Institute of Science and Technology 373-1 Kusong-dong, Yusong-gu, Taejon 305-701, Korea E-mail: dsk@eekaist.kaist.ac.kr

Volume 3 pages 1135 - 1138

ABSTRACT

Zero-crossings with peak amplitudes (ZCPA) model motivated by human auditory periphery is simple com- pared with other auditory models, but powerful speech analysis tool for robust speech recognition in noisy environments. In this paper, improvement in recog- nition rate of ZCPA model is addressed by incorpo- rating time-derivative features with several different time-derivative window lengths. Experimental results show that ZCPA has relatively higher sensitivity to derivative window length than conventional feature extraction algorithms. Also, experimental compar- isons with several front-ends including some auditory- like schemes in real-world noisy environments demon- strate the robustness of ZCPA model. ZCPA model shows superior performance compared with other front- ends especially in noisy condition corrupted by white Gaussian noise.

A0576.pdf

TOP

SPEAKER NORMALIZATION THROUGH FORMANT-BASED WARPING OF THE FREQUENCY SCALE

Authors: Evandro B. Gouvea and Richard M. Stern

Department of Electrical and Computer Engineering School of Computer Science Carnegie Mellon University Pittsburgh, Pennsylvania 15213, USA Tel. +1 412 268 7116, FAX: +1 412 268 3890, Email: egouvea@cs.cmu.edu

Volume 3 pages 1139 - 1142

ABSTRACT

Speaker-dependent automatic speech recognition systems are known to outperform speaker-independent systems when enough training data are available to model acoustical variability among speakers. Speaker normalization techniques modify the spectral representation of incoming speech waveforms in an attempt to reduce variability between speakers. Recent successful speaker normalization algorithms have incorporated a speaker-specific frequency warping to the initial signal processing stages. These algorithms, however, do not make extensive use of acoustic features contained in the incoming speech. In this paper we study the possible benefits of the use of acoustic features in speaker normalization algorithms using frequency warping. We study the extent to which the use of such features, including specifically the use of formant frequencies, can improve recognition accuracy and reduce computational complexity for speaker normalization. We examine the characteristics and limitations of several types of feature sets and warping functions as we compare their performance relative to existing algorithms.

A0947.pdf

TOP

THE USE OF CEPSTRAL MEANS IN CONVERSATIONAL SPEECH RECOGNITION

Authors: Martin Westphal

Interactive Systems Laboratories University of Karlsruhe | 76128 Karlsruhe, Germany westphal@ira.uka.de

Volume 3 pages 1143 - 1146

ABSTRACT

Environmental robustness and speaker independence are import issues of current speech recognition research. Channel and speaker adaptation methods do the best job when the adaption is done towards a normalized acoustic model. Normalization methods might make use of the model but primarily inuence the signal such that important information is kept and unwanted distortions are cancelled out. Most large vocabulary conversational speech recognition systems use Cepstral Mean Subtraction (CMS), a channel normalization approach to compensate for the acoustic channel (and also the speaker). In this paper we discuss the basic algorithm and variations of it in the context of conversational speech and report our experience using different approaches on two widely used conversational speech recognition tasks.

A0999.pdf

TOP

COMPENSATION FOR ENVIRONMENTAL AND SPEAKER VARIABILITY BY NORMALIZATION OF POLE LOCATIONS

Authors: Juan M. Huerta and Richard M. Stern

Department of Electrical and Computer Engineering School of Computer Science Carnegie Mellon University Pittsburgh, PA 15213, USA

Volume 3 pages 1147 - 1150

ABSTRACT

We present a compensation technique that . corrects for the effects of noise and variability of speaker and environment on speech recognition accuracy by modifying the positions of the poles representing the speech signal in the z-plane. This modification yields pole locations with statistics that more closely match the statistics of the distribution of clean training speech. The parameters of the mapping are obtained from statistics of the distribution of the poles of the training and testing speech. Compensation is performed by direct modification of both the angle and the radius of pole locations, and also by evaluating the cepstrum along a cirele of radius less than 1 in the z-plane to enhance the salience of spectral peaks. These procedures are evaluated using the DARPA Resource Management database using added white noise. They are shown to compensate for the effects of environmental degradation, patvcularly at low SNRs.

A1005.pdf

TOP

CELLULAR PHONE SPEECH RECOGNITION: NOISE COMPENSATION vs. ROBUST ARCHITECTURES

Authors: J-B. Puel and R. André-Obrecht

IRIT - Université Paul Sabatier 118, route de Narbonne 31062 Toulouse cedex - France {puel, obrecht}@irit.fr

Volume 3 pages 1151 - 1154

ABSTRACT

This paper addresses the problem of speech recognition through telephonic networks. When the communication channel is unknown, the important mismatch between training data and signal encountered in recognition phase decreases drastically the performances of the recognition systems. In this context, we compare a classical approach: the noise compensation method with novel robust networks modellings aiming to incorporate and manage more variability in the training data. We introduce multi-HMMs and multi-transitions systems, trained with data recorded through analog switched network and cellular phone network. These architectures present best results and succeed in improving the recognizers robustness since they achieve up to 77 % reduction of the error rate for a system trained for switched telephonic network and used with cellular phone. Nevertheless, this modelling requires training data recorded in both environments; when such data are not available, noise cancellation or channel compensation are the only affordable solutions.

A1097.pdf

TOP

Speech Recognition in Noise Using On-line HMM Adaptation

Authors: TungHui Chiang

Industrial Technology Research Institute (ITRI) Chutung, Hsinchu, Taiwan 310, R.O.C Advanced Technology Center (ATC) Computer & Communication Laboratories (CCL)

Volume 3 pages 1155 - 1158

ABSTRACT

In this paper, a novel two-stage framework is proposed to copy with speech recognition in adverse environment. First, an on-line HMM composition method which compensates HMMs making use of the on-line testing utterances is proposed in the first stage. By using the proposed method, the dynamic change of environmental noise in each utterance can be well handled. In addition, a classifier trained by using a discriminative learning procedure is incorporated in the second stage to enhance system's discrimination capability. Since the recognition and adaptation processes are carried out in the same session in an unsupervised fashion, this proposed two-stage framework is suitable for practical uses.

A1179.pdf