Chair: Jean-Claude Junqua, Panasonic, USA
Eduardo Lleida, University of Zaragoza (Spain)
Julian Fernandez, University of Zaragoza (Spain)
Enrique Masgrau, University of Zaragoza (Spain)
In this paper, a robust speech recognition system for videoconference applications is presented based on a microphone array. By means of a microphone array, the speech recognition system is able to know the position of the users and increase the signal-to-noise (SNR) ratio between the desired speaker signal and the interferences from the other users. The user positions are estimated by means of the combination of a direction of arrival (DOA) estimation method with a speaker identification system. The beamforming is performed by using the spatial references of the desired speaker and the interference locations. A minimum variance algorithm with spatial constraints working in the frequency domain is used to design the weights of the broad band microphone array. Results of the speech recognition system are reported in a simulated enviroment with several users asking questions to a geographic data base.
Takeshi Yamada, Nara Institute of Science and Technology (Japan)
Satoshi Nakamura, Nara Institute of Science and Technology (Japan)
Kiyohiro Shikano, Nara Institute of Science and Technology (Japan)
A microphone array is the promising solution for realizing hands-free speech recognition in real environments. Accurate talker localization is very important for speech recognition using the microphone array. However localization of a moving talker is difficult in noisy reverberant environments. The talker localization errors degrade the performance of speech recognition. To solve the problem, this paper proposes a new speech recognition algorithm which considers multiple talker direction hypotheses simultaneously. The proposed algorithm performs Viterbi search in 3-dimensional trellis space composed of talker directions, input frames, and HMM states. As a result, a locus of the talker and a phoneme sequence ofthe speech are obtained by finding an optimal path with the highest likelihood. To evaluate the performance of the proposed algorithm, speech recognition experiments are carried out on simulated data and real environment data. These results show that the proposed algorithm works well even if the talker moves.
Tadd B Hughes, Brown University (U.S.A.)
Hong-Seok Kim, Brown University (U.S.A.)
Joseph H DiBiase, Brown University (U.S.A.)
Harvey F Silverman, Brown University (U.S.A.)
A major problem for speech recognition systems is relieving the talker of the need to use a close-talking, head-mounted or a desk-stand microphone. A likely solution is the use of an array of microphones that can steer itself to the talker and can use a beamforming algorithm to overcome the reduced signal-to-noise ratio due to room acoustics. This paper reports results for a tracking, real-time microphone-array as an input to an HMM-based connected alpha-digits speech recognizer. For a talker in the very near field of the array (within a meter), performance approaches that of a close-talking microphone input device. The effects of both the noise reducing steered array and the use of a Maximum a posteriori (MAP) training step are shown to be significant. Here, the array system and the recognizer are described, experiments are presented, and the implications of combining these two systems discussed.
Franck Giron, NTT Human Interface Laboratories (Japan)
Yasuhiro Minami, NTT Human Interface Laboratories (Japan)
Masashi Tanaka, NTT Human Interface Laboratories (Japan)
Ken'ichi Furuya, NTT Human Interface Laboratories (Japan)
In hands-free speech recognition the speaker should be able to move freely in front of the speech acquisition device. However, the speech signal is then submitted to variations due to the continuous change of position in the acoustic space. This paper focuses on the role of speaker head rotations as compared with static situations in anechoic conditions. The effect of speaker directivity in speech recognition performance degradation is demonstrated and a compensation method based on HMM composition is proposed to increase the performance.
Alexander Fischer, Philips Research Labs, Aachen (Germany)
Volker Stahl, Philips Research Labs, Aachen (Germany)
This paper presents results of speaker-independent speech recognition experiments concerning acoustic front-ends, models and their structures in car environments. The database comprises 350 speakers in 6 different cars. We investigate whole-word models, context-independent phoneme models and context-dependent within-word phoneme models. We studied task-dependent (same vocabulary context in training and test) phoneme models and present first results on task-independent (broad context in training, i.e. phonetically rich material) scenarios. The latter allows flexible vocabulary definition for applications with dynamically changing command words or new applications avoiding an expensive data collection. Acoustic preprocessing is carried out with mel-cepstrum combined with spectral subtraction and SNR ormalization. The task-dependent word error rates are well below 3% for both whole-word and phoneme models. The task-independent scenarios have to be worked on further.
Lamia Karray, FT-CNET/DIH/RCP (France)
Abdellatif BenJelloun, FT-CNET/DIH/RCP (France)
Chafic Mokbel, FT-CNET/HIH/RCP (France)
This paper deals with automatic speech recognition robustness for noisy wireless communications. We propose several solutions to improve speech recognition over the cellular network. Two architectures are derived for the recognizer. They are based on Hidden Markov Models (HMMs) adapted to adverse noise conditions. Then two more specific solutions aiming to alleviate GSM cellular network defects (holes and impulsive noise) are developed. Holes are detected and rejected. Impulsive noises are modeled using mixture density HMMs and a maximum likelihood criterion. These solutions allow a noticeable recognition error reduction. The last one seems to be promising.
Zhong-Hua Wang, INRS-Telecommunications (Canada)
Patrick Kenny, INRS-Telecommunications (Canada)
In this paper, we introduce a new approach, called nonstationary adaptation (NA), to recognize speech under nonstationary adverse environments. Two models are used: one is a speaker-independent hidden Markov model (HMM) for clean speech, the other is an ergodic Markov chain representing the nonstationary adverse environment. Each state in the Markov chain represents one stationary adverse condition and has associated with it an affine transform that is estimated by maximum likelihood linear regression (MLLR). Three kinds of adverse environments are considered: (i) multi-speaker speech recognition where speaker identity changes randomly and this constitutes a nonstationary adverse condition, (ii) the recognition of speech corrupted by machinegun noise, (iii) the cross-talk problem. The algorithm is tested on the Nov92 development database of WSJF0 with a vocabulary size of 20,000. In multi-speaker speech recognition, NA decreases the error rate by 13.6%. For speech corrupted by machinegun noise, a one-state Markov chain decreases the error rate by 18%, and a two-state Markov chain gives another 14% decrease in error rate. In the cross-talk problem, a one-state Markov chain decreases the error rate by 16.8%. Two-state and three-state Markov chains decrease the error rate by 22% and 24.4%, respectively.
Makoto Shozakai, Nara Institute of Science and Technology (Japan)
Satoshi Nakamura, Nara Institute of Science and Technology (Japan)
Kiyohiro Shikano, Nara Institute of Science and Technology (Japan)
A user-friendly speech interface in a car cabin is highly needed for safety reasons. This paper will describe a robust speech recognition method that can cope with additive noises and multiplicative distortions. A known additive noise, a source signal of which is available, might be canceled by NLMS-VAD(Normalized Least Mean Squares with frame-wise Voice Activity Detection). On the other hand, an unknown additive noise, a source signal of which is not available, is suppressed with CSS(Continuous Spectral Subtraction). Furthermore, various multiplicative distortions are simultaneously compensated with E-CMN(Exact Cepstrum Mean Normalization) which is speaker-dependent/environment-dependent CMN for speech/non-speech. Evaluation results of the proposed method for car cabin environments are finally described.