Chair: Mark A. Clements, Georgia Institute of Technology, USA
Richard J Barron, MIT (U.S.A.)
Charles K Sestok, MIT (U.S.A.)
Alan V. Oppenheim, MIT (U.S.A.)
This paper proposes several methods for noise reduction using deterministic side information about the desired signal as aconstraint on the reconstruction. Two forms of side information are considered separately: short-time linear predictive coefficients, and short-time zero-phase impulse response coefficients. We derive general expressions for the ML, MAP and MMSE estimators, and develop algorithms that yield the ML estimators with the above side information for speech corrupted by additive white Gaussian noise. We also explore the use of these methods in the traditional noise reduction problem with no side information.
Dimitrie C Popescu, Rutgers University (U.S.A.)
Ilija Zeljkovic, AT&T Labs (U.S.A.)
A method for applying Kalman filtering to speech signals corrupted by colored noise is presented. Both speech and colored noise are modeled as autoregressive (AR) processes using speech and silence regions determined by an automatic end-point detector. Due to the non-stationary nature of the speech signal, non-stationary Kalman filter is used. Experiments indicate that non-stationary Kalman filtering outperforms the stationary case, the average SNR improvement increasing from 0.53 dB to 2.3 dB. Even better results are obtained if noise is considered also non-stationary, in addition to being colored, achieving an average of 7.14 dB SNR improvement.
Mitsunori Mizumachi, Advanced Institute of Science and Technology (Japan)
Masato Akagi, Advanced Institute of Science and Technology (Japan)
This paper proposes a method of noise reduction by paired microphones as a front-end processor for speech recognition systems. This method estimates noises using a subtractive microphone array and subtracts them from the noisy speech signal using the Spectral Subtraction(SS). Since this method can estimate noises analytically and frame by frame, it is easy to estimate noises not depending on these acoustic properties. Therefore, this method can also reduce non stationary noises, for example sudden noises when a door has just closed, which can not be reduced by other SS methods. The results of computer simulations and experiments in a real environment show that this method can reduce LPC log spectral envelope distortions.
Laurent Girin, ICP, Grenoble (France)
Gang Feng, ICP, Grenoble (France)
Jean-Luc Schwartz, ICP, Grenoble (France)
This paper deals with a noisy speech enhancement technique based on the fusion of auditory and visual information. We first present the global structure of the system, and then we focus on the tool we used to melt both sources of information. The whole noise reduction system is implemented in the context of vowel transitions corrupted with white noise. A complete evaluation of the system in this context is presented, including distance measures, gaussian classification scores, and a perceptive test. The results are very promising.
Rafael Martinez, Universidad Politecnica de Madrid (Spain)
Pedro Gomez, Universidad Politecnica de Madrid (Spain)
Agustin Alvarez, Universidad Politecnica de Madrid (Spain)
Victor Nieto, Universidad Politecnica de Madrid (Spain)
Victoria Rodellar, Universidad Politecnica de Madrid (Spain)
Manuel Rubio, Universidad Politecnica de Madrid (Spain)
Mercedes Perez, Universidad Politecnica de Madrid (Spain)
Speech Recognition in Noisy Environments is critical for applications in the domain of Communications, Automotion, Avionics, etc., where common recognizers fail to meet acceptable standards due to the noise picked during the recording of the Speech Trace. In this case Adaptive Filtering has been traditionally used with acceptable success in noise cancellation using two-microphone schemes. Some problems related with this technique are due to the non-stationary behavior of Speech and Noise, as the sudden changes in their relative energy levels are the cause of misadjustments and un-locking in the Adaptive Algorithms. A method based on the dynamic adjustment of the adaptation factor by a 3-state automaton driven by the continuous tracking of the Energy Differences between Speech and Noise is presented here. The paper discusses the proposed method and presents some results from practical conditions. These show good stability and a special ability for Word-Boundary Detection under Highly Non-Stationary Conditions.
Jaco Vermaak, Cambridge University (U.K.)
Mahesan Niranjan, Cambridge University (U.K.)
This paper investigates a Bayesian approach to the enhancement of speech signals corrupted by additive white Gaussian noise. Parametric models for the speech and noise processes are constructed, leading to a posterior distribution for the model parameters and uncorrupted speech samples given the observed noisy speech samples. Being analyutically intractable, inferences concerning these variables are performed using Markov Chain Monte Carlo (MCMC) methods. The efficiency of the sampling scheme within this framework is further improved by employing state space techniques based on the Kalman filter.
George S Kang, Naval Research Laboratory (U.S.A.)
Thomas M Moran, Naval Research Laboratory (U.S.A.)
In certain communication environments, digital speech transmission systems must work in severe acoustic environments where the noise levels exceeds 110 dB. In other environments, speakers must use an oxygen face mask. In both situations, the intelligibility of encoded speech falls below an acceptable level.We have developed a technique for improving speech quality in these situations. Previous speech improvement methods have focused on processing the corrupted signal after it has been induced by the microphone. These methods have not performed adequately. In our technique, speech anomalies are attenuated by a microphone array before speech and noise become mixed into a signal. Our microphone array prototype has shown excellent performance. In an example of speech taken aboard an E2C aircraft, this noise-canceling microphone array improved the speech-to-noise ratio by as much as 18 dB. When the same technique is used in a face mask, muffled speech was almost completely restored to high quality speech.
Daniel S Benincasa, Air Force Research Laboratory (U.S.A.)
Michael I Savic, Rensselaer Polytechnic Institute (U.S.A.)
This paper presents a voicing state determination algorithm (VSDA) that is used to simultaneously estimate the voicing state of two speakers present in a segment of co-channel speech. Supervised learning trains a Bayesian classifier to predict the voicing states. The possible voicing states are silence, voiced/voiced, voiced/unvoiced, unvoiced/voiced and unvoiced/unvoiced. We have assumed the silent state as a subset of the unvoiced class, except when both speakers are silent. We have chosen a binary tree decision structure. Our feature set is a projection of a 37 dimensional feature vector onto a single dimension applied at each branch of the decision tree, using the Fisher linear discriminant. We have produced co-channel speech from the TIMIT database which is used for training and testing. Preliminary results, at signal to interference ratio of 0 dB, have produced classification accuracy of 82.6%, 73.45%, and 68.24% on male/female, male/male and female/female mixtures respectively.
Kuan-Chieh Yen, University of Illinois, Urbana-Champaign (U.S.A.)
Yunxin Zhao, University of Illinois, Urbana-Champaign (U.S.A.)
Three modifications on the adaptive decorrelation filtering (ADF) algorithm are proposed to improve the performance of a co-channel speech separation system. Firstly, a simplified ADF (SADF) is suggested to reduce the computational complexity of ADF from O(NxN) to O(N) per sample, where N is the filter length used in the channel estimation. Secondly, a transform-domain ADF (TDADF) is developed to accelerate the convergence of the filter estimates while maintaining computational complexity at O(N). Thirdly, a generalized ADF (GADF) is derived to handle the noncausal filter estimation problem often encountered in co-channel speech separation. Experimental results showed that when the average signal-to-interferenceratios (SIRs) in the co-channel signals were 6.15 and 5.38 dB, respectively, both the SADF and TDADF improved the SIRs to around 18 to 19 dB, and the GADF further improved the SIRs to around 19 to 24 dB.
James P LeBlanc, New Mexico State University (U.S.A.)
Phillip L De Leon, New Mexico State University (U.S.A.)
We present a computationally efficient method of separating mixed speech signals. The method uses a recursive adaptive gradient descent technique with the cost function designed to maximize the kurtosis of the output (separated) signals. The choice of kurtosis maximization as an objective function (which acts as a measure of separation) is supported by investigation and analysis of a class of random processes which are regarded as excellent models for speech signal statistics. Such processes are identified as spherically invariant random processes (SIRP's). Development and analysis of the adaptive algorithm is presented. Simulation examples using actual voice signals are presented.