Compensation (Speaker, Channel, Noise)

Home


Cepstrum-based filter-bank design using discriminative feature extraction training at various levels

Authors:

Alain Biem, ATR HIP. (Japan)
Shigeru Katagiri, ATR HIP. (Japan)

Volume 2, Page 1503

Abstract:

This paper investigates the realization of optimal filter bank-based cepstral parameters. The framework is the Discriminative Feature Extraction method (DFE) which iteratively estimates the filter-bank parameters according to the errors that the system makes. Various parameters of the filter-bank, such as center frequency, bandwidth, gain are optimized using a string-level optimization and a frame-level optimization scheme. Application to vowel and noisy telephone speech recognition tasks shows that the DFE method realizes a more robust classifier by appropriate feature extraction.

ic971503.pdf

ic971503.pdf

TOP



Minimum Error Rate Training for Designing Tree-Structured Probability Density Function

Authors:

Wu Chou, Bell Labs. (U.S.A.)

Volume 2, Page 1507

Abstract:

In this paper, we propose a signal prototype classification and evaluation framework in acoustic modeling. Based on this framework, a new tree-structured likelihood function is derived. It uses a designated cluster kernel $f_{m}^{C}$ for signal prototype classification and a designated cluster kernel $f_{m}^{L}$ for likelihood evaluation of outlier or tail events of the cluster. A minimum classification error (MCE) rate training approach is described for designing tree-structured likelihood function. Experimental results indicate that the new tree-structured likelihood function significantly improves the acoustic resolution of the model. It has a more significant speedup in decoding than the one obtained from the conventional approach.

ic971507.pdf

ic971507.pdf

TOP



A Frequency-Weighted HMM Based on Minimum Error Classification for Noisy Speech Recognition

Authors:

Hiroshi Matsumoto, Shinshu University (Japan)
Masanori Ono, Shinshu University (Japan)

Volume 2, Page 1511

Abstract:

As a noise robust HMM, we previously proposed a frequency-weighted HMM (HMM-FW) whose covariance matrices are replaced by the inverse of frequency-weighting matrices. In this HMM, the frequency-weighting parameters were common to all classes and states, and were experimentally adjusted. In order to achieve further noise robustness, this paper examines the class- and state-dependent weighting parameters and their minimum error classification training (MCE) of their weighting characteristics. Using the NOISEX-92 database, the MCE-trained HMM-FWs are shown to be more robust even under untrained noise conditions than both the previous HMM-FW and conventional HMM.

ic971511.pdf

TOP



Dictionary-Based Discriminative HMM Parameter Estimation for Continuous Speech Recognition Systems

Authors:

Daniel Willett, Duisburg University (Germany)
Christoph Neukirchen, Duisburg University (Germany)
Jörg Rottland, Duisburg University (Germany)

Volume 2, Page 1515

Abstract:

The estimation of the HMM parameters has always been a major issue in the design of speech recognition systems. Discriminative objectives like Maximum Mutual Information (MMI) or Minimum Classification Error (MCE) have proved to be superior over the common Maximum Likelihood Estimation (MLE) in cases where a robust estimation of the probabilistic density functions (pdfs) is not possible. The determination of the overall likelihood of an acoustic observation is the most crucial point of the MMI-parameter estimation when applied to continuous speech systems. Contrary to the common approaches that estimate the overall likelihood of the training observations by evaluating the most confusing sentences or by applying global state frequencies, this paper suggests to perform a dictionary analysis in order to get estimates for the dictionary-based risk of mixing up each two HMM states. These estimates are used to estimate the observations' likelihood and to control the discriminative MMI training procedure. Results on a monophone SCHMM speech recognition system are presented that prove the practicability of the new approach.

ic971515.pdf

ic971515.pdf

TOP



A DFE-Based Algorithm For Feature Selection In Speech Recognition

Authors:

Angel de la Torre, University of Granada (Spain)
Antonio M. Peinado, University of Granada (Spain)
Antonio J. Rubio, University of Granada (Spain)
Victoria Sanchez, University of Granada (Spain)

Volume 2, Page 1519

Abstract:

The algorithms for the reduction of the number of features without degrading the performance of pattern recognition systems play an important role in real applications. In this work a new algorithm for feature selection is proposed. This algorithm is based on the Discriminative Feature Extraction (DFE) technique and has been applied to speech recognition. The experimental results show that the recognition systems accept important reductions of the number of features without a degradation of the performance. For the representation used in our experiments, the recognition error-rate is not significantly increased when the number of components in the feature vector is reduced from 42 to 20.

ic971519.pdf

ic971519.pdf

TOP



Robustness Issues and Solutions in Speech Recognition Based Telephony Services

Authors:

Vijay Raman, NYNEX S&T Inc. (U.S.A.)
Vidhya Ramanujam, NYNEX S&T Inc. (U.S.A.)

Volume 2, Page 1523

Abstract:

HMM-based algorithms for speaker-dependent recognition as well as speaker-independent recognition form the basis of speech services developed at NYNEX S&T and deployed widely by NYNEX and other telephone service providers. Based on the analysis of the initially deployed VoiceDialing service, robustness of these algorithms was recognized to be a dominant issue. In this paper, we discuss the features of a high-performance, robust speaker-dependent recognition algorithm, and include some deployment issues that were successfully resolved.

ic971523.pdf

TOP



Speaker-Dependent Speech Recognition Based on Phone-Like Units Models --- Application to Voice Dialing

Authors:

Vincent Fontaine, FPMS - TCTS (Belgium)
Hervé Bourlard, FPMS - TCTS (Belgium)

Volume 2, Page 1527

Abstract:

This paper presents a speaker dependent speech recognition with application to voice dialing. This work has been developed under the constraints imposed by voice dialing applications, i.e., low memory requirements and limited training material. Two methods for producing speaker dependent word baseforms based on Phone Like Units (PLU) are presented and compared : (1) a classical vector quantizer is used to divide the space into regions associated with PLUs; (2) a speaker independent hybrid HMM/MLP recognizer is used to generate speaker dependent PLU based models. This work shows that very low error rates can be achieved even with very simple systems, namely a DTW-based recognizer. However, best results are achieved when using the hybrid HMM/MLP system to generate the word baseforms. Finally, a realtime demonstration simulating voice dialing functions and including keyword spotting and rejection capabilities has been set up and can be tested online.

ic971527.pdf

ic971527.pdf

TOP



Enhanced Control and Estimation of Parameters for a Telephone Based Isolated Digit Recognizer

Authors:

Josef G. Bauer, Siemens AG (Germany)

Volume 2, Page 1531

Abstract:

The paper studies the use of discriminative techniques for a telephone based isolated digit recognizer with respect to a reduced system complexity. The combination of Linear Discriminant Analysis (LDA) and Minimum Error Classification (MEC) training provides improved system performance at reduced costs for the training process and for the application. Experiments are performed on an isolated digit database recorded over public lines including approximately 700 speakers. The use of a single linear transformation matrix based on LDA allows the use of density modeling, that doesn't consider variances explicitly, at a high recognition rate. Minimum Classification Error training is found to perform best in case of a small amount of system parameters. A reduction of error rate up to 80% was achieved by the combination of the two methods for such a system configuration.

ic971531.pdf

ic971531.pdf

TOP



HTIMIT and LLHDB: Speech Corpora for the Study of Handset Transducer Effects

Authors:

Douglas A. Reynolds, MIT Lincoln Laboratory (U.S.A.)

Volume 2, Page 1535

Abstract:

This paper describes two corpora collected at Lincoln Laboratory for the study of handset transducer effects on the speech signal: the handset TIMIT (HTIMIT) corpus and the Lincoln Laboratory Handset Database (LLHDB). The goal of these corpora are to minimize all confounding factors and produce speech differing only in transducer effects. The speech is recorded directly from a telephone unit in a sound-booth using prompted text and extemporaneous descriptions. The two corpora allow comparison of speech collected from a person speaking into a handset (LLHDB) versus speech played through a loudspeaker into a handset (HTIMIT). A comparison of results between the two corpora addresses the realism of artificially creating handset degraded speech by playing recorded speech through handsets. The corpora are designed primarily for speaker recognition experimentation, but since both speaker and speech recognition systems use the same acoustic features affected by the handset, knowledge gleaned is directly transferable to speech recognizers. Initial speaker identification performance on these corpora are presented. In addition, the application of HTIMIT in developing a handset detector that was successfully used on a Switchboard speaker verification task is described.

ic971535.pdf

ic971535.pdf

TOP



Robustness Improvements in Continuously Spelled Names over the Telephone

Authors:

Michael Galler, STL (U.S.A.)
Jean-Claude Junqua, STL (U.S.A.)

Volume 2, Page 1539

Abstract:

A speaker-independent speech recognizer for continuously spelled names, implemented for a switchboard call-routing task, is analyzed for sources of error. Results indicate most errors are due to extraneous speech and end-point detection errors. Strategies are proposed for improving the robustness of recognition, including tolerance for speech with pauses, and a letter-spotting strategy to handle extraneous speech. Experimental results on laboratory data indicate that with the letter-spotting method, name retrieval error rate is reduced on noisy signals or signals with extraneous speech 60.1%, while it is increased on clean signals from 4.5% to 5.5%. On data collected during a telephone field trial, error is reduced 54.1% in offline tests by introducing the letter-spotting algorithm.

ic971539.pdf

ic971539.pdf

TOP



A Fast Algorithm for Stochastic Matching with Application to Robust Speaker Verification

Authors:

Qi Li, Bell Labs (U.S.A.)
S. Parthasarathy, Bell Labs (U.S.A.)
Aaron E. Rosenberg, Bell Labs (U.S.A.)

Volume 2, Page 1543

Abstract:

Acoustic mismatch between training and test environments is one of the major problems in telephone-based speaker recognition. Speaker recognition performances are degraded when an HMM trained under one set of conditions is used to evaluate data collected from different telephone channels, microphones, etc. The mismatch can be approximated as a linear transform in a cepstral domain. In this paper, we present a fast, efficient algorithm to estimate the parameters of the linear transform for real-time applications. Using the algorithm, test data are transformed toward the training conditions by rotation, scale, and translation without destroying the the detailed characteristics of speech, then, speaker dependent HMM's can be used to evaluate the details under the same condition as training. Compared to cepstral mean subtraction (CMS) and other bias removal techniques, the proposed linear transform is more general since CMS and others only consider translation; compared to maximum-likelihood approaches for stochastic matching, the proposed algorithm is simpler and faster since iterative techniques are not required. The proposed algorithm improves the performance of a speaker verification system in the experiments reported in this paper.

ic971543.pdf

ic971543.pdf

TOP



A Bayesian Predictive Classification Approach to Robust Speech Recognition

Authors:

Qiang Huo, ATR-ITL (Japan)
Hui Jiang, University of Tokyo (Japan)
Chin Hui Lee, Bell Labs (U.S.A.)

Volume 2, Page 1547

Abstract:

We introduce a new Bayesian predictive classification (BPC) approach to robust speech recognition and apply the BPC framework to Gaussian mixture continuous density hidden Markov model based speech recognition. We propose and focus on one of the approximate BPC approach called quasi-Bayesian predictive classification (QBPC). In comparison with the standard plug-in maximum a posteriori decoding, when the QBPC method is applied to speaker independent recognition of a confusable vocabulary, namely 26 English letters, where a broad range of mismatches between training and testing conditions exist, the QBPC achieves around 14% relative recognition error rate reduction. While the QBPC method is applied to cross-gender testing on a less confusable vocabulary, namely 20 English digits and commands, the QBPC method achieves around 24% relative recognition error rate reduction.

ic971547.pdf

ic971547.pdf

TOP



Robust Speech Recognition Based on Viterbi Bayesian Predictive Classification

Authors:

Hui Jiang, University of Tokyo (Japan)
Keikichi Hirose, University of Tokyo (Japan)
Qiang Huo, ATR (Japan)

Volume 2, Page 1551

Abstract:

In this paper, we investigate a new Bayesian predictive classification (BPC) approach to realize robust speech recognition when there exist mismatches between training and test conditions but no accurate knowledge of the mismatch mechanism is available. A specific approximate BPC algorithm called Viterbi BPC (VBPC) is proposed for both isolated word and continuous speech recognition. The proposed VBPC algorithm is compared with conventional Viterbi decoding algorithm on speaker-independent isolated digit and connected digit string (TIDIGITS) recognition tasks. The experimental results show that VBPC can considerably improve robustness when mismatches exist between training and testing conditions.

ic971551.pdf

ic971551.pdf

TOP