Session WMA Speech Recognition in Adverse Environments CSR and Error Analysis

Chairperson Lori Lamel LIMSI-CNRS, France

Home


A Comparative Analysis of Blind Channel Equalization Methods for Telephone Speech Recognition

Authors: Wei-Wen Hung and Hsiao-Chuan Wang

Department of Electrical Engineering, National Tsing Hua University, Hsinchu, Taiwan, 30043, Republic of China E-mail : hcwang@ee.nthu.edu.tw

Volume 3 pages 1515 - 1518

ABSTRACT

A blind channel equalization method called signal bias removal (SBR) has been proposed and proved to be effective in compensating the channel effect in telephone speech recognition. However, we found that the SBR method didn't work well when additive noise and multiplicative distortion are taken into account at the same time. In this paper, we propose a new method called modified signal bias removal (MSBR) which tries to overcome the problem described above in the SBR method. Some experiments are conducted to evaluate the effectiveness of the MSBR method. Experimental results show that the MSBR method outperforms the SBR no matter additive noise is considered or not in a telephone speech recognition system.

A0010.pdf

TOP


HMM Retraining Based on State Duration Alignment for Noisy Speech Recognition

Authors: Wei-Wen Hung and Hsiao-Chuan Wang

Department of Electrical Engineering, National Tsing Hua University, Hsinchu, Taiwan, 30043, Republic of China E-mail : hcwang@ee.nthu.edu.tw

Volume 3 pages 1519 - 1522

ABSTRACT

It is known that incorporating the temporal information of state durations into the HMM can achieve higher recognition performance. However, when a speech signal is contaminated by ambient noises, it is very possible for a state to stay too long or too short in decoding a state sequence even if state durations are adopted in the models. This phenomenon will severely reduce the efficiency of modeling techniques for state durations. To overcome this problem, a proportional alignment decoding (PAD) method combining with state duration statistics is proposed and proved experimentally to be effective when the speech signal is distorted by ambient noises. Instead of using Viterbi decoding algorithm, the PAD method is used for state decoding in the retraining phase of a conventional HMM and produce a new set of state duration statistics. This state duration alignment scheme is more efficient to prevent a state from occupying too long or too short in recognition phase.

A0011.pdf

TOP


Fast Parallel Model Combination Noise Adaptation Processing

Authors: Yasuhiro KOMORI, Tetsuo KOSAKA, Hiroki YAMAMOTO, and Masayuki YAMADA

Media. Technology Lahoratory, Canon Inc. 890-12 Kashimada, Saiwai-ku, kawasaki-shi,kanagawa 211 JAPAN, Email:komori@cis.canon.co.jp

Volume 3 pages 1523 - 1526

ABSTRACT

In this paper, a fast PMC (Parallel Model Combination) noise adaptation method is proposed for continuous HMM base speeclt recognizer. Tlte proposed metliod is realized as a direct reduction of the number of PMC processing times by introducing the distribution composition with the spatial relation of distributions. The proposed method is compared with the basic PMC algorithm in recognitiou accuracy and adaptation processing time on telehhone speech. The result showed that the proposed method saved around 65% (70.9%, - 62.7%) of PMC computation amount with almost no degradation of recognition performance.

A0034.pdf

TOP


SPEECH RECOGNITION MODULE FOR CSCW USING A MICROPHONE ARRAY

Authors: Takashi Endo, Shigeki Nagaya, Masayuki Nakazawa, Kiyoshi Furukawa and Ryuichi Oka

Tsukuba Research Center Real World Computing Partnership Tsukuba Mitsui Building 13F, 1-6-1 Takezono Tsukuba-shi, Ibaraki 305, JAPAN Tel. +81 298 53 1687, FAX:+81 298 53 1740, E-mail:enchan@trc.rwcp.or.jp

Volume 3 pages 1527 - 1530

ABSTRACT

This report proposes a recognition module for use in CSCW that suffers little degradation in recognition performance even when more than one person speaks at the same time and they speak at a distance from a microphone. This is accomplished by controlling directionality using a microphone array and estimating transmission characteristics from speakers to microphones. On the basis of evaluation performed by word spotting from continuous speech, it has been found that this module raises the recognition rate by ( 1 ) 30% in an environment where two people are speaking at the same time, and (2) by 15% when people speak at a distance of 160 cm from a microphone.

A0174.pdf

Recordings

TOP


Relative Mel-Frequency Cepstral Coefficients Compensation for Robust Telephone Speech Recognition

Authors: Jiqing Han *,** , Munsung Han * , Gyu-Bong Park * , Jeongue Park * , Wen Gao **

*. Language Understanding Lab. , Systems Engineering Research Institute, ETRI, Korea **. Department of Computer Science and Engineering, Harbin Institute of Technology, P.R. China email { jqhan, mshan, gbpark, jgpark}@seri.re.kr, wgao@jdl.mcel.mot.com

Volume 3 pages 1531 - 1534

ABSTRACT

It is a crucial factor to find the robust and simple computation methods for the actual application of telephone speech recognition. In this paper, we propose a new channel compensation method, which uses a RASTA-like band-pass filter on the mel-frequency cepstral coefficients for robust telephone speech recognition. It is shown from the experiments that the proposed method, comparing with the RASTA processing, reduces the computational complexity without losing performance, and it is also better than CMS and two level CMS on the performance. We also verify that it is an effective approach to suppress very low modulation frequencies for robust telephone speech recognition.

A0244.pdf

TOP


ROBUST SPEECH DETECTION METHOD FOR SPEECH RECOGNITION SYSTEM FOR TELECOMMUNICATION NETWORKS AND ITS FIELD TRIAL

Authors: Seiichi Yamamoto, Masaki Naito and Shingo Kuroiwa

KDD R&D Laboratories 2-1-15 Ohara Kamifukuoka, 356 Saitama, JAPAN Tel. +81 492 78 7311, FAX +81 492 78 7512, E-mail: yamamoto@lab.kdd.co.jp

Volume 3 pages 1535 - 1538

ABSTRACT

Input speech to speech recognition systems may be contaminated not only by various ambient noise but also by various irrelevant sounds generated by users such as coughing, tongue clicking, mouth-noises and certain out-of-task utterances. The authors have developed a speech detection method using the likelihood of partial sentences for detecting task utterance in speech contaminated with these irrelevant sounds. This paper describes this new speech detection method and reports on a field trial of speech recognition systems with the proposed speech detection method.

A0290.pdf

TOP


THE TUNING OF SPEECH DETECTION IN THE CONTEXT OF A GLOBAL EVALUATION OF A VOICE RESPONSE SYSTEM

Authors: Laurent MAUUARY and Lamia KARRAY

e-mail: mauuary@lannion.cnet.fr France Télécom, Centre National d'études des télécommunications, CNET/DIH/RCP, Technopole Anticipa, 2, avenue Pierre Marzin, 22307 LANNION, FRANCE

Volume 3 pages 1539 - 1542

ABSTRACT

Field evaluations of automatic speech recognition (ASR) systems clearly demonstrate the importance of efficient rejection procedures for filtering out-of-vocabulary tokens. High performance speech recognition systems also require efficient speech detection. This paper presents an original framework for a global evaluation of speech recognition systems allowing to tune the speech detection module of an ASR system. A global evaluation allows to measure the performances of the speech recognition system from the user point of view and to identify the weak modules of an ASR system. Global evaluations are carried out on PSN (Public Switch Network) and GSM (Global System Mobile) databases. On the PSN database, global evaluation is used to choose the best value for the speech detector threshold. The results also show, that for this optimal value, the rejection of out-of-vocabulary words is currently the main problem to be solved for building high performance speech recognition systems for large public telecommunication applications. On GSM database, global evaluation is used to evaluate the benefits of speech enhancement before speech detection. Results show that the use of spectral subtraction as the speech enhancement technique before the detection drastically improves the speech detection, and consequently the global speech recognition.

A0370.pdf

TOP


NEW METHODS IN CONTINUOUS MANDARIN SPEECH RECOGNITION

Authors: C. J. Chen, R. A. Gopinath, M. D. Monkowski, M. A. Picheny, and *K. Shen

IBM Thomas J. Watson Research Center, PO Box 218, Yorktown Heights, NY 10598, USA *IBM China Research Laboratory, 26 6th Street, Shangdi, Beijing 100085, China

Volume 3 pages 1543 - 1546

ABSTRACT

We describe new methods for speaker-independent, continuous mandarin speech recognition based on the IBM HMM-based continuous speech recognition system (1-3): First, we treat tones in mandarin as attributes of certain phonemes, instead of syllables. Second, instantaneous pitch is treated as a variable in the acoustic feature vector, in the same way as cepstra or energy. Third, by designing a set of word-segmentation rules to convert the continuous Chinese text into segmented text, an effective trigram language model is trained(4). By applying those new methods, a speaker-independent, very-large-vocabulary continuous mandarin dictation system is demonstrated. Decoding results showed that its performance is similar to the best results for US English.

A0405.pdf

TOP


AUTOMATIC TRANSCRIPTION OF GENERAL AUDIO DATA: EFFECT OF ENVIRONMENT SEGMENTATION ON PHONETIC RECOGNITION 1

Authors: Michelle S. Spina and Victor W. Zue

Spoken Language Systems Group Laboratory for Computer Science Massachusetts Institute of Technology Cambridge, Massachusetts 02139 USA fspina,zueg@sls.lcs.mit.edu

Volume 3 pages 1547 - 1550

ABSTRACT

The task of automatically transcribing general audio data is very different from those usually confronted by current automatic speech recognition systems. The general goal of our work is to determine the optimal training strategy for recognizing such data. Specifically, we have studied the effects of different speaking environments on a phonetic recognition task using data collected from a radio news program. We found that if a single-recognizer is to be used, it is more effective to use a smaller amount of homogeneous, clean data for training. This approach yielded a decrease in phonetic recognition error rate of over 26% over a system trained with an equivalent amount of data which contained a variety of speaking environments. We found that additional gains can be made with a multiple-recognizer system, trained with environment-specific data. Overall, we found that this approach yielded a decrease in error rate of nearly 2%, with some individual speaking environments' error rate decreasing by over 7%.

A0569.pdf

TOP


Automatic Recognition of Continuous Cantonese Speech with Very Large Vocabulary

Authors: Ying Pang Alfred NG (1) , L. W. CHAN (1) , P. C. CHING (2)

(1) Department of Computer Science and Engineering (2) Department of Electronic Engineering The Chinese University of Hong Kong, Shatin, New Territories, Hong Kong Tel: (852) 2609 8411 Fax: (852) 2603 5302 Email: ypng@cs.cuhk.edu.hk

Volume 3 pages 1551 - 1554

ABSTRACT

This paper presents the first published results for automatic recognition of continuous Cantonese speech with very large vocabulary. The size of the vocabulary covered by this system is about the same asthat encountered in the Hong Kong local Chinese newspaper, Wen Hui Bao ( ). The system covers 6335 Chinese characters ( ) and a large number of Chinese words () can be formed by combining these Chinese characters. The input to the system is the end pointed speech waveform of a sentence or phrase, the output isthe Big5 coded Chinese characters. In the development ofthe recognition system, we have devised new methods in 1) construction of a continuous Cantonese speech database, 2) lexical tone recognition in continuous Cantonese speech, and 3)integration of lexical tone and base syllable recognition results. The speaker dependent recognition rates for Chinese character, base syllable and lexical tone are 90.94%, 94.73% and 69.7% respectively.

A0587.pdf

TOP


SOURCE NORMALIZATION TRAINING FOR HMM APPLIED TO NOISY TELEPHONE SPEECH RECOGNITION

Authors: Yifan Gong

Speech Research, Media Technologies Laboratory, Texas Instruments P.O.BOX 655303 MS 8374, Dallas TX 75265, U.S.A. Email: Yifan.Gong@ti.com

Volume 3 pages 1555 - 1558

ABSTRACT

We refer to environment e as some combination of speaker, handset, transmission channel and noise background condition, and regard any practical situation of a speech recognizer as a mixture of environments. A speech recognizer may be trained on multi-environment data. It may also need to adapt the trained acoustic models to new conditions. How to train an HMM with multi- environment data and from what seed model to start an adaptation are two questions of great importance. We propose a new solution to speech recognition which is based on, for both training and adaptation, a separate modeling of phonetic variation and environment variations. This problem is formulated under hidden Markov process, where we assume, - Speech x is generated by some canonical (independent ofenvironmental factors) distributions, - An unknown linear transformation We and a bias be, specific to environment e, is applied to x with probability P(e), - x cannot be observed, what we observe is the outcome of the transformation: o = Wex + be. Under maximum-likelihood (ML) criterion, by application of EM algorithm and the extension of Baum's forward and backward variables and algorithm, we obtained a joint solution to the parameters of the canonical distributions, the transformations and the biases, which is novel. For special cases, on a noisy telephone speech database, the new formulation is compared to per-utterance cepstral mean normalization (CMN) technique and shows more than 20% word error rate improvement.

A0606.pdf

TOP


THE DEVELOPMENT OF A SPEAKER INDEPENDENT CONTINUOUS SPEECH RECOGNIZER FOR PORTUGUESE

Authors: Joao P. Neto, Ciro A. Martins and Luis B. Almeida

INESC - IST R. Alves Redol, 9 1000 Lisboa - Portugal E-Mails: jpn@inesc.pt, cam@inesc.pt, lba@inesc.pt

Volume 3 pages 1559 - 1562

ABSTRACT

The development and evaluation of large vocabulary, speaker-independent continuous speech recognition systems are mainly done for the American English language. In this paper we present the work done to date in the development of an hybrid large vocabulary, speaker-independent continuous speech recognition system for the European Portuguese language. Due to the lack of a large appropriate speech and text database to be used in the development of that system we started collecting a large database and at the same time began developing a baseline system based on a smaller database. On this baseline system we applied techniques for automatic segmentation and labeling, in parallel with the development of a basic lexicon and language model for Portuguese. In the last part of this paper we also present the first steps of our work over the new database.

A0672.pdf

TOP


BLAME ASSIGNMENT FOR ERRORS MADE BY LARGE VOCABULARY SPEECH RECOGNIZERS

Authors: Lin Chase

The Robotics Institute Carnegie Mellon University 5000 Forbes Avenue Pittsburgh, Pennsylvania 15213 USA chase@cs.cmu.edu

Volume 3 pages 1563 - 1566

ABSTRACT

This paper describes an approach to identifying the reasons that speech recognition errors occur. The algorithm presented requires an accurate word transcript of the utterances being analyzed. It places errors into one of the categories: 1) due to out­of­ vocabulary (OOV) word spoken, 2) search error, 3) homophone substitution, 4) language model overwhelming correct acoustics, 5) transcript/pronunciation problems, 6) confused acoustic models, or 7) miscellaneous/not possible to categorize. Some categorizations of errors can supply training data to automatic corrective training methods that refine acoustic models. Other errors supply language model and lexicon designers with examples that identify potential improvements. The algorithm is described and results on the combined evaluation test sets from 1992­1995 of the North American Business (NAB) [1] [2] [3] corpus using the Sphinx­II recognizer [4] are presented.

A0757.pdf

TOP


Predicting Speech Recognition Performance

Authors: Atsushi Nakamura

ATR Interpreting Telecommunications Research Laboratories. 2-2 Hikaridai, Seika-Cho, Soraku-Gun, Kyoto, 619-02 JAPAN Tel. +81 774 95 1301, FAX: +81 774 95 1308, E-mail: atsushi@itl.atr.co.jp

Volume 3 pages 1567 - 1570

ABSTRACT

Predicting speech recognition performance in place of expensive recognition experiments is a very useful approach for the research and development of speech recognition systems. In this paper, we propose a method to predict speech recognition performance when using new test data and/or a new acoustic model. Performance prediction tests showed that the proposed method can accurately predict recognition performance, thus saving a large amount of computer resources.

A0801.pdf

TOP


A Voice Activity Detector for the ITU-T 8kbit/s Speech Coding Standard G.729

Authors: S. D. Watson*, B. M. G. Cheetham*, P. A. Barrett#, W. T. K. Wong# and A. V. Lewis#

*Department of Electrical Engineering and Electronics, The University of Liverpool, LIVERPOOL. L69 3BX. U.K. #BT Laboratories, Martlesham Heath, IPSWICH. IP5 7RE. U.K

Volume 3 pages 1571 - 1574

ABSTRACT

Voice Activity Detectors (VAD's) are widely used in speech technology applications where available transmission or storage capacity is limited (e.g. mobile, DCME, etc.) and must be utilised with maximum economy. Modern day digital speech coding algorithms can provide toll quality speech at bit-rates as low as 8kbit/s (e.g. ITU-T G.729) and the use of a VAD can achieve further economy in average bit-rate. This paper presents a modified version of the GSM VAD, for use with the ITU-T 8kbit/s speech coding algorithm CS-ACELP, which makes an active/inactive decision for every 10 ms coding frame. The performance of the proposed voice activity detector is compared to that of the GSM coder in terms of VAD errors and subjective quality. Results indicate that the modified VAD has similar performance to the standardised GSM VAD while operating with G.729 parameters and coding frame size.

A0826.pdf

TOP


VOCABULARY-INDEPENDENT RECOGNITION OF AMERICAN SPANISH PHRASES AND DIGIT STRINGS

Authors: Yeshwant K. Muthusamy and John J. Godfrey

Speech Recognition Branch Media Technologies Laboratory Texas Instruments, Dallas, Texas USA. E-mail: {yeshwant,godfrey}@csc.ti.com

Volume 3 pages 1575 - 1578

ABSTRACT

We describe the development of an R&D recognizer for several Spanish applications, starting from an exist- ing recognition system for American English and modest language-specific resources. The experiments emphasize achieving phonetic accuracy on telephone speech without vocabulary specific training. We use our basic recogni- tion engine, and simple grammar-building tools for pre- dicting word sequences. Only the read sentences from two telephone speech corpora (Voice Across Hispanic Amer- ica (VAHA) and a smaller TI corpus) are used for train- ing. Word error rates (WER) of 1.9% on telephone service command phrases, 5.5% on telephone numbers, and 12% on continuously spoken sentences are achieved with the newly ported system.

A0991.pdf

TOP


RECOGNITION OF SPOKEN AND SPELLED PROPER NAMES

Authors: Michael Meyer and Hermann Hild

Interactive Systems Laboratories University of Karlsruhe | 76128 Karlsruhe, Germany fhhild,mmeyerg@ira.uka.de

Volume 3 pages 1579 - 1582

ABSTRACT

Many speech applications, most prominently telephone directory assistance, require the recognition of proper names. However, the recognition of increasingly large sets of spoken names is dificult: Besides technical limitations, very large recognition vocabularies contain many easily confused words or even homophones. Therefore, proper names are often spelled or both spoken and spelled. In this paper we compare the performance for proper name recognition when a name is spoken only, spelled only, or both spoken and spelled. In the latter case, information about the same name is provided in two different representations. We address methods to exploit this redundancy and propose techniques to handle the recognition of large lists of spoken and spelled proper names.

A0998.pdf

TOP


HMM COMPENSATION FOR NOISY SPEECH RECOGNITION BASED ON CEPSTRAL PARAMETER GENERATION

Authors: Takao Kobayashi (1), Takashi Masuko (1), and Keiichi Tokuda (2)

(1)Precision and Intelligence Laboratory, Tokyo Institute of Technology, Yokohama, 226 Japan (2)Department of Computer Science, Nagoya Institute of Technology, Nagoya, 466 Japan E-mail: tkobayas@pi.titech.ac.jp, masuko@pi.titech.ac.jp, tokuda@ics.nitech.ac.jp

Volume 3 pages 1583 - 1586

ABSTRACT

This paper proposes a technique for compensating both static and dynamic parameters of continuous mixture density HMM to make it robust to noise. The technique is based on cepstral parameter generation from HMM using dynamic parameters. The generated cepstral vector sequences of speech and noise are combined to yield noisy speech cepstral vector sequence, and the dynamic parameters are calculated from the obtained cepstral vector sequence. Model parameters for noisy speech HMM are obtained using the statistics of the noisy speech parameter sequences. We use the mixture transition probability for estimating the parameters of the compensated model. Experimental results show the effectiveness of the proposed technique in the noisy speech recognition.

A1020.pdf

TOP


ON THE ROBUSTNESS OF THE CRITICAL-BAND ADAPTIVE FILTERING METHOD FOR MULTI-SOURCE NOISY SPEECH RECOGNITION

Authors: G. Nokas, E. Dermatas and G. Kokkinakis

Wire Communications Laboratory Electrical & Computer Engineering Dept. University of Patras, 26100 Patras, Greece. Tel. +30 61 991 722, FAX: +30 61 991 855, E-mail:nokas@george.wcl2.ee.upatras.gr

Volume 3 pages 1587 - 1590

ABSTRACT

In this paper we study the influence of the sub-band adaptive filtering speech enhancement method on speech recognition systems in multi-source noisy environment using a speaker and a noise reference microphone. In extensive experiments, the recognition score of a speaker independent isolated word speech recognition system based on a continuous density HMM (CDHMM) has been measured in the presence of real life noises in various SNRs. In all experiments the results show improvement in the mean recognition score when the sub-band adaptive filtering LMS method is used in comparison to the full-band LMS method. This improvement increases when changing types of noise distort the speech signal.

A1137.pdf

TOP


A Space Transformation Approach for Robust Speech Recognition in Noisy Environments

Authors: Cun-tai Guan, Shu-hung Leung and Wing-hong Lau

Department of Electronic Engineering City University of Hong Kong, Kowloon, Hong Kong Tel: +852 2788 7193, Fax: +852 2784 4262, E-mail: eeguan@cpccux0.cityu.edu.hk

Volume 3 pages 1591 - 1594

ABSTRACT

To improve the robustness of speech recognition in additive noisy environments, an SVD based space transformation approach is proposed. It is shown that with this approach, not only the signal-to-noise ratio is improved but also a significant recognition error reduction is achieved. A multiple model based on the proposed method is developed and it can provide high recognition rate for a large range of SNRs. Recognition experiments on a speaker-dependent mono-syllabic database with additive noise show that, this new approach outperforms LPC cepstrum, MFCC, and OSALPC cepstrum significantly

A1138.pdf

TOP


ROBUST ISOLATED WORD RECOGNITION USING WSP-PMC COMBINATION

Authors: Tzur Vaich and Arnon Cohen

Electrical and Computer Engineering Department Ben Gurion University, P.O.Box 653 Beer- Sheva, 84105 Israel Tel.: +972-7-6461545; FAX: +972-7-6472949; E Mail: Arnon@Newton.bgu.ac.il

Volume 3 pages 1595 - 1598

ABSTRACT

A new robust algorithm for isolated word recognition in low SNR environments is suggested . The algorithm, called WSP, is described here for left to right models with no skips. It is shown that the algorithm outperforms the conventional HMM in the SNR range of 5 to 20db, and the PMC algorithm in the range 0 to -9db.

A1149.pdf

TOP