ABSTRACT
In this paper, a noise compensation algorithm using the first order approximation of cepstral function is presented. The derivative term is replaced by the difference of cepstra for the adaptation of wide range variation of noise power. The differences of cepstral mean vectors between the clean and noisy version, termed as the deviation vector, are applied to adapt cepstrum and delta cepstrum. The experimental results show that using deviation vector to adapt the cepstral coefficients can gain a significant improvement over the method based on weighted projection measure. Further improvement can be made by jointly adapting the cepstrum and delta cepstrum.
ABSTRACT
Temporal processing of the time trajectories in the logarithmic spectrum domain, performed in cepstral mean subtraction, in computation of dynamic features in speech, or in RASTA processing, is becoming a common procedure in current ASR. Such temporal processing effectively enhances some components of the modulation spectrum of speech while suppressing others. It is therefore important to know the relative importance of various components of the modulation spectrum of speech. In this study we report on the effect of band-pass filtering of the time trajectories of spectral envelopes on speech recognition. Results indicate the relative importance of different components of the modulation spectrum of speech for ASR.
ABSTRACT
This paper addressed the problem of speech signal pre-classification for robust noisy speech recognition. A novel RNN-based pre-classification scheme for noisy Mandarin speech recognition is proposed. The RNN, which is trained to be insensitive to noise-level variation, is employed to classify each input frame into the three broad classes of initial, final and pure-noise. An on-line noise tracking and estimation for noise model compensation is then performed. Besides, a broad-class likelihood compensation based on the RNN outputs is also performed to help the recognition. Experimental results showed that a significant improvement on syllable recognition rate has been achieved under non-stationary noise environment.
ABSTRACT
A speech recognitionsystem for modelingan acoustic mismatch across different environments is presented. The basic philos- ophy is to apply discriminative learning techniques to sepa- rate the recognition process, that is represented by a hidden Markov model (HMM), from the environmental process which is denoted by a limited number of translation vectors. Each segment of speech is assigned to an environment and recogni- tion is performed upon projecting the parameters of the HMM to best characterize the acoustic space of that environment. The proposed system provides an interesting framework for better modeling and adaptation of speech signals with varying acoustic conditions. Experimental findings on connected digits recognition for three different environments are reported.
ABSTRACT
We present a new adaptive method for online noise estimation which extends the model combination approach to slowly varying noise conditions. The technique of model combination is reported to improve accuracy in speech recognition without extensive training of noisy speech data. Only training of noise characteristics is needed. However, if the noise characteristics vary over time, calculation of noise parameters once before recognition is not suitable. Therefore the new method of online estimation allows an adaptation to the current noise situation. Furthermore cepstral mean subtraction is added to the model combination scheme. This removes convolutional noise as well. Finally, it is shown how linear discriminant analysis eases handling of dynamical effects for model combination.
ABSTRACT
This paper adresses the important problem of speech detection. It describes the implementation of 3 speech detection methods and compares their performance under different signal-to-noise ratio (SNR) and stationarity conditions. The method that dynamically adjusts its thresholds is found to be the most reliable, even under very adverse recording conditions. Yet it is of low complexity and has a very moderate processing delay.
ABSTRACT
A novel Voice Activity Detector is presented that is based on Source Separation techniques applied to single sensor signals. It oers very accurate estimation of the endpoints in very low Signal to Noise ratio conditions, while maintaining low complexity. Since the procedure is totally iterative, it is suitable for use in real-time applications and is capable of operating in dynamically adapting situations. Results are presented for both White Gaussian and Car Engine background noise. The performance of the new technique is compared with that of the GSM Voice Activity Detector.
ABSTRACT
Blind signal separation method based on minimiz- ing mutual information is applied to deal with multi- speaker problem in speech recognition. Recognition experiments performed under dierent acoustic environments, in a soundproof room and a reverberant room, clarify that 1) the method can improve recognition accuracy by about 20% where SNR condition is 0 dB, 2) the method is more eective when many speakers' speech exist than the simple overlapped situation, and that 3) the method does not work well under reverberant conditions.
ABSTRACT
To overcome the problems related with the long impulse responses produced by reverberation, we use a long time window (high frequency resolution) analysis during the channel normalization steps of the feature extraction process in automatic speech recognition (ASR). After normalization, a trade between frequency and time resolution is used to increase the rate at which the time information is sampled (short-time domain), yielding an appropriate domain to derive ASR features. Experiments on data with reverberation times of about 0.5 s show that the new technique achieves significant performance improvement of a speech recognizer under reverberation, with only some performance degradation on clean speech.
ABSTRACT
The determination of the precise moment in which speech begins or ends is an important problem in ASR. As showed in [1], small separations from the optimum beginning and ending point, imply a great decrease in the recognition accuracy. The presence of noise [2] [3], specially when its level is high (around 95 dB as in the case of this work), and its characteristics are highly non-stationary, is an added problem, since it can produce false shots (more probable when the noise includes speech sounds). That is the reason why in such conditions, it is important to have a pre-processing stage that removes as much noise as is possible, and that gives some clues that help to build an end-point detector for those environments. The method here presented offers a pre-processing technique for highly noisy and non stationary environments, which at the same time that enhances the speech, gives an equalised version of the SNR improvement (Mean Spectral Energy Difference), whose main characteristic is that large differences in the level of noise are changed to a little ripple, while the presence of speech is distinguished by a large decrease in this Mean Spectral Energy Difference. Following this technique, any End-point Detection approach (explicit, implicit or hybrid [3]) may render acceptable results.
ABSTRACT
This paper is concerned with the problem of Robust Speaker Recognition. An acoustical mismatch between training and testing conditions of hidden Markov model (HMM)-based speaker recognition systems often causes a severe degradation in the recognition performance. In telephone speaker recognition, for example, undesirable signal components due to ambient noise and channel distortion, as well as due to different variations of telephone handsets render the recognizer unusable for real-world applications. The purpose of this paper is to present several compensation techniques to decrease or to remove the mismatch between training and testing environment conditions. Some of the techniques described here have already been successfully applied in Robust Speech Recognition, and our preliminary results show that they are also very encouraging for Speaker Recognition.
ABSTRACT
This paper introduces a word boundary detection algorithm that works in a variety of noise conditions including what is commonly called the 'cocktail party' situation. The algorithm uses the direction of the signal as the main criterion for differentiating between desired-speech and background noise. To determine the signal direction the algorithm calculates estimates of the time delay between signals received at two microphones. These time delay estimates together with estimates of the coherence function and signal energy are used to locate word boundaries. The algorithm was tested using speech embedded in different types and levels of noise including car noise, factory noise, babble noise, and competing talkers. The test results showed that the algorithm performs very well under adverse conditions and with SNR down to -14.5dB.
ABSTRACT
In this paper, we consider the hidden Markov model(HMM) parameter compensation in noisy environments with multiple noise sources based on the vector Taylor series(VTS) approach. General formulations for multiple environmental variables are derived and systematic expectation-maximization(EM) solutions are presented in maximum likelihood(ML) sense. It is assumed that each noise source is independent and having Gaussian distribution. To evaluate proposed method, we conduct speaker independent isolated word recognition experiments in various noisy environments. Experimental results show that proposed algorithm ahieves significant improvement. Especially, the proposed method is consistently more effective than the parallel model combination( PMC) based on log-normal approximation.
ABSTRACT
This paper examines techniques for normalization of unseen speakers in recognition. Two implementations of linear spectrum warping were examined: time domain resampling and filter bank scaling. It is shown that for seen speakers, the models trained by unwarped utterances are less sensitive to spectrum warping by filter bank scaling than by resampling. A pitch-based scheme for warping factor estimation has been proposed. The method is shown to be cost-effective in reducing the variability of unseen speakers compared to the ML-based methods. In particular the combination of filter bank scaling with the pitch-based warping factor estimation reduces the error rate of isolated Mandarin digit recognition by more than 30% for unseen speakers.
ABSTRACT
In automatic speech recognition (ASR) of broadcast news shows the input utterances are often corrupted by background music and noise. This paper proposes a new method of au- tomatic segmentation a speech signals according to the back- ground: music, clean or noisy. LPC analysis is used to extract the poles of the associated transfer function. Based on the time evolution of the poles it is possible to discriminate the contributions of music, speech and noise: music poles are sta- bler longer than speech poles while noise poles have a more unstable behavior than speech poles. Once the background of a signal is identified, poles tagged as non-speech can be sep- arated from speech poles. Using only the speech poles along with the LPC residuals, it is possible to reconstruct a new signal freed of music and noise contributions.
ABSTRACT
Zero-crossings with peak amplitudes (ZCPA) model motivated by human auditory periphery is simple com- pared with other auditory models, but powerful speech analysis tool for robust speech recognition in noisy environments. In this paper, improvement in recog- nition rate of ZCPA model is addressed by incorpo- rating time-derivative features with several different time-derivative window lengths. Experimental results show that ZCPA has relatively higher sensitivity to derivative window length than conventional feature extraction algorithms. Also, experimental compar- isons with several front-ends including some auditory- like schemes in real-world noisy environments demon- strate the robustness of ZCPA model. ZCPA model shows superior performance compared with other front- ends especially in noisy condition corrupted by white Gaussian noise.
ABSTRACT
Speaker-dependent automatic speech recognition systems are known to outperform speaker-independent systems when enough training data are available to model acoustical variability among speakers. Speaker normalization techniques modify the spectral representation of incoming speech waveforms in an attempt to reduce variability between speakers. Recent successful speaker normalization algorithms have incorporated a speaker-specific frequency warping to the initial signal processing stages. These algorithms, however, do not make extensive use of acoustic features contained in the incoming speech. In this paper we study the possible benefits of the use of acoustic features in speaker normalization algorithms using frequency warping. We study the extent to which the use of such features, including specifically the use of formant frequencies, can improve recognition accuracy and reduce computational complexity for speaker normalization. We examine the characteristics and limitations of several types of feature sets and warping functions as we compare their performance relative to existing algorithms.
ABSTRACT
Environmental robustness and speaker independence are import issues of current speech recognition research. Channel and speaker adaptation methods do the best job when the adaption is done towards a normalized acoustic model. Normalization methods might make use of the model but primarily inuence the signal such that important information is kept and unwanted distortions are cancelled out. Most large vocabulary conversational speech recognition systems use Cepstral Mean Subtraction (CMS), a channel normalization approach to compensate for the acoustic channel (and also the speaker). In this paper we discuss the basic algorithm and variations of it in the context of conversational speech and report our experience using different approaches on two widely used conversational speech recognition tasks.
ABSTRACT
We present a compensation technique that . corrects for the effects of noise and variability of speaker and environment on speech recognition accuracy by modifying the positions of the poles representing the speech signal in the z-plane. This modification yields pole locations with statistics that more closely match the statistics of the distribution of clean training speech. The parameters of the mapping are obtained from statistics of the distribution of the poles of the training and testing speech. Compensation is performed by direct modification of both the angle and the radius of pole locations, and also by evaluating the cepstrum along a cirele of radius less than 1 in the z-plane to enhance the salience of spectral peaks. These procedures are evaluated using the DARPA Resource Management database using added white noise. They are shown to compensate for the effects of environmental degradation, patvcularly at low SNRs.
ABSTRACT
This paper addresses the problem of speech recognition through telephonic networks. When the communication channel is unknown, the important mismatch between training data and signal encountered in recognition phase decreases drastically the performances of the recognition systems. In this context, we compare a classical approach: the noise compensation method with novel robust networks modellings aiming to incorporate and manage more variability in the training data. We introduce multi-HMMs and multi-transitions systems, trained with data recorded through analog switched network and cellular phone network. These architectures present best results and succeed in improving the recognizers robustness since they achieve up to 77 % reduction of the error rate for a system trained for switched telephonic network and used with cellular phone. Nevertheless, this modelling requires training data recorded in both environments; when such data are not available, noise cancellation or channel compensation are the only affordable solutions.
ABSTRACT
In this paper, a novel two-stage framework is proposed to copy with speech recognition in adverse environment. First, an on-line HMM composition method which compensates HMMs making use of the on-line testing utterances is proposed in the first stage. By using the proposed method, the dynamic change of environmental noise in each utterance can be well handled. In addition, a classifier trained by using a discriminative learning procedure is incorporated in the second stage to enhance system's discrimination capability. Since the recognition and adaptation processes are carried out in the same session in an unsupervised fashion, this proposed two-stage framework is suitable for practical uses.