Joerg Meyer, University of Bremen (Germany)
Klaus Uwe Simmer, University of Bremen (Germany)
This paper presents a multichannel-algorithm for speech enhancement for hands-free telephone systems in cars. This new algorithm takes advantage of the special noise characteristics in fast driving cars. The incoherence of the noise allows to use adaptive Wiener filtering in the frequencies above a theoretically determined frequency. Below this frequency a smoothed spectral subtraction (SSS) is used to get an improved noise suppression. The algorithm yields better results in noise reduction with significantly less distortions and artificial noise than spectral subtraction or Wiener filtering alone.
Nestor Becerra Yoma, CCIR/University of Edinburgh (U.K.)
Fergus McInnes, CCIR/University of Edinburgh (U.K.)
Mervyn Jack, CCIR/University of Edinburgh (U.K.)
This paper addresses the problem of speech recognition with signals corrupted by additive noise at moderate SNR. A technique based on spectral subtraction and noise cancellation reliability weighting in acoustic pattern matching algorithms is studied. A model for additive noise is proposed and used to compute the variance of the hidden clean signal information and the reliability of the spectral subtraction process. The results presented in this paper show that a proper weight on the information provided by static parameters can substantially reduce the error rate.
Michael E. Deisher, Intel (U.S.A.)
Andreas S. Spanias, ASU (U.S.A.)
This paper describes a technique for reduction of non-stationary noise in electronic voice communication systems. Removal of noise is needed in many such systems, particularly those deployed in harsh mobile or otherwise dynamic acoustic environments. The proposed method employs state-based statistical models of both speech and noise, and is thus capable of tracking variations in noise during sustained speech. This work extends the hidden Markov model (HMM) based minimum mean square error (MMSE) estimator to incorporate a ternary voicing state, and applies it to a harmonic representation of voiced speech. Noise reduction during voiced sounds is thereby improved. Performance is evaluated using speech and noise from standard databases. The extended algorithm is demonstrated to improve speech quality as measured by informal preference tests and objective measures, to preserve speech intelligibility as measured by informal Diagnostic Rhyme Tests, and to improve the performance of a low bit-rate speech coder and a speech recognition system when used as a pre-processor.
Bruce L. McKinley, Signal Processing Consultants (U.S.A.)
Gary H. Whipple, U.S. Department of Defense (U.S.A.)
This paper presents two new algorithms for robust speech pause detection (SPD) in noise. Our approach was to formulate SPD into a statistical decision theory problem for the optimal detection of noise-only segments, using the framework of model-based speech enhancement (MBSE). The advantages of this approach are that it performs well in high noise conditions, all necessary information is available in MBSE, and no other features are required to be computed. The first algorithm is based on a maximum a posteriori probability (MAP) test and the second is based on a Neyman-Pearson test. These tests are seen to make use of the spectral distance between the input vector and the composite spectral prototypes of the speech and noise models, as well as the probabilistic framework of the hidden Markov model. The algorithms are evaluated and shown to perform well against different types of noise at various SNRs.
Andrzej Drygajlo, LTS-DE, EPFL (Switzerland)
Benito Carnero, LTS-DE, EPFL (Switzerland)
This paper addresses the problem of merging speech enhancement and coding in the context of an auditory modeling. The noisy signal is first processed by a fast wavelet packet transform algorithm to obtain an auditory spectrum, from which a rough masking model is estimated. Then, this model is used to refine a subtractive-type enhancement algorithm. The enhanced speech coefficients are then encoded in the same time-frequency transform domain using masking threshold constraints for quantization noise. The advantage of the proposed method is that both enhancement and coding are performed with the transform coefficients, without making use of the additional FFT processing.
Cheung-Fat Chan, City University of Hong Kong (Hong Kong)
Wai-Kwong Hui, City University of Hong Kong (Hong Kong)
Results for improving the quality of narrowband CELP-coded speech by enhancing the pitch periodicity and by regenerating the highband components of speech spectra are reported. Multiband excitation (MBE) analysis is applied to enhance the pitch periodicity by re-synthesizing the speech signal using a harmonic synthesizer. The highband magnitude spectra are regenerated by matching to lowband spectra using a trained wideband spectral codebook. Information about the voiced/unvoiced (V/UV) excitation in the highband are derived from a training procedure and recovered by using the matched lowband index. Simulation results indicate that the quality of the wideband enhanced speech is significantly improved over the narrowband CELP-coded speech.
Futoshi Asano, ETL (Japan)
Satoru Hayamizu, ETL (Japan)
A method for recovering the LPC spectrum from a microphone array input signal corrupted by ambient noise is proposed. This method is based on the CSS (coherent subspace) method, which is designed for DOA (direction of arrival) estimation of broadband array input signals. The noise energy is reduced in the subspace domain by the maximum likelihood method. To enhance the performance of noise reduction, elimination of noise-dominant subspace using projection is further employed, which is effective when the SNR is low and classification of noise and signals in the subspace domain is difficult. The results of the simulation show that some small formants, which cannot be estimated by the conventional delay-and-sum beamformer, were well estimated by the proposed method.
Daniel S. Benincasa, Rome Laboratory (U.S.A.)
Michael I. Savic, Rensselaer Polytechnic Institute (U.S.A.)
This paper describes a technique to separate the speech of two speakers recorded over a single channel. The main focus of this research is to separate overlapping voiced speech signals using constrained nonlinear optimization. Based on the assumption that voiced speech can be modeled as a slowly-varying vocal tract filter with a quasi-periodic train of impulses, the speech waveform is represented as a sum of sine waves with time-varying amplitude, frequency and phase. In this work the unknown parameters of our speech model will be the amplitude, frequency and phase of the harmonics of both speech signals. Using constrained nonlinear optimization, we will determine, on a frame by frame basis, the best possible parameters that provides the least mean square error (LMSE) between the original co-channel speech signal and the sum of the reconstructed speech signals.
Te-Won Lee, Max-Planck-Society (U.S.A.)
Reinhold Orglmeister, Berlin University of Technology (Germany)
We present a new method to tackle the problem of separating mixtures of real sources which have been convolved and time-delayed under real world conditions. To this end, we learn two sets of parameters to unmix the mixtures and to estimate the true density function. The solutions are discussed for feedback and feedforward architectures. Since the quality of separation depends on the modeling of the underlying density we propose different methods to closer approximate the density function using some context. The proposed density estimation achieves separation of a wider class of sources. Furthermore, we employ the FIR polynomial matrix techniques in the frequency domain to invert a true-phase mixing system. The significance of the new method is demonstrated with the successful separation of two speakers and separation of music and speech recorded with two microphones in a reverberating room.
Georg F. Meyer, Keele University (U.K.)
Fabrice Plante, Liverpool University (U.K.)
Frederic Berthommier, ICP Grenoble (France)
Modulation maps provide an effective method for the segregation of voiced speech sounds from competing background activity. The maps are constructed by computing modulation spectra in a bank of auditory filters. Target spectra are recovered by sampling the modulation spectra at the initial five multiples of the fundamental frequency of the target sound. If the modulation spectra are computed using a conventional DFT, windows of 200ms duration are necessary. Using the reassigned spectrum, a new time-frequency representation, the window size can be reduced to 50ms with minimal loss of performance. The algorithm is tested on a 'double vowel' identification task that has been used extensively in psychophysical experiments.
Hector Raul Javkin, PTI-STL (U.S.A.)
Michael Galler, PTI-STL (U.S.A.)
Nancy Niedzielski, PTI-STL (U.S.A.)
Esophageal speakers, who produce a voice source by bringing about a vibration of the esophageal superior sphincter, must insufflate the esophagus with an air injection gesture before every utterance, creating an air reservoir to drive the vibration. The resulting noise is generally undesired by the speakers. This paper describes a method for the automatic recognition and rejection of the injection noise which occurs in esophageal speech.
Neeraj Magotra, University of New Mexico (U.S.A.)
Sudheer Sirivara, University of New Mexico (U.S.A.)
This paper deals with digital processing of speech as it pertains to the hearing impaired. The issues described in this paper deal with the development of a true real-time digital hearing aid. The system (based on Texas Instruments TMS320C3X) implements frequency shaping, noise reduction, interaural time delay, amplitude compression and various timing options. It also provides a testbed for future development. The device is referred to as the DIgital Programmable Hearing Aid (DIPHA). DIPHA uses a wide bandwidth (upto 16 KHz). DIPHA is a fully programmable device that permits us to program various speech processing algorithms and test them on hearing impaired subjects in the real world as well as in the laboratory.
Sharon Gannot, Tel-Aviv University (Israel)
David Burshtein, Tel-Aviv University (Israel)
Ehud Weinstein, Tel-Aviv University (Israel)
Speech quality and intelligibility might significantly deteriorate in the presence of background noise, especially when the speech signal is subject to subsequent processing. In this paper we represent a class of Kalman-filter based speech enhancement algorithms with some extensions, modifications, and improvements. The first algorithm employs the estimate-maximize (EM) method to iteratively estimate the spectral parameters of the speech and noise parameters. The enhanced speech signal is obtained as a byproduct of the parameter estimation algorithm. The second algorithm is a sequential, computationally efficient, gradient descent algorithm. We discuss various topics concerning the practical implementation of these algorithms. Experimental study, using real speech and noise signals is provided to compare these algorithms with alternative speech enhancement algorithms, and to compare the performance of the iterative and sequential algorithms.
Patrik Sörqvist, Ericsson (Sweden)
Peter Händel, Ericsson (Sweden)
Björn Ottersten, KTH (Sweden)
This paper presents a model-based approach for noise suppression of speech contaminated by additive noise. A Kalman filter based speech enhancement system is presented and its performance is investigated in detail. It is shown that with a novel speech parameter estimation algorithm, it is possible to achieve 10dB noise suppression with a high total audible quality.