Session Th3A Style and Accent Recognition

Chairperson Gerard Chollet ENST/SIG, Switzerland

Home

USING ACCENT-SPECIFIC PRONUNCIATION MODELLING FOR IMPROVED LARGE VOCABULARY CONTINUOUS SPEECH RECOGNITION

Authors: J.J. Humphries, P.C. Woodland

e-mail: jjh11,pcw @eng.cam.ac.uk Cambridge University Engineering Department, Trumpington Street, Cambridge, UK

Volume 5 pages 2367 - 2370

ABSTRACT

A method of modelling accent-specific pronunciation variations is presented. Speech from an unseen accent group is phonetically transcribed such that pronunciation variations may be derived. These context-dependent variations are clustered in decision trees which are used as a model of the pronunciation variation associated with this new accent group. The trees are then used to build a new pronunciation dictionary for use during the recognition process. Experiments are presented, based on Wall Street Journal and WSJCAM0 corpora, for the recognition of American speakers using a British English recogniser. Speaker independent as well as speaker dependent adaptation scenarios are presented, giving up to 20% reduction in word error rate. A linguistic analysis of the pronunciation model is presented and finally the technique is combined with maximum likelihood linear regression, a well proven acoustic adaptation technique, yielding further improvement.

A0121.pdf

TOP

AUTOMATIC SPEECH RECOGNITION FOR CHILDREN

Authors: Alexandros Potamianos, Shrikanth Narayanan and Sungbok Lee

AT&T Labs{Research, 180 Park Ave, P.O. Box 971, Florham Park, NJ 07932-0971, U.S.A. email: fpotam,shri,sungbokg@research.att.com

Volume 5 pages 2371 - 2374

ABSTRACT

In this paper, the acoustic and linguistic characteristics of children speech are investigated in the context of automatic speech recognition. Acoustic variability is identied as a major hurdle in building high performance ASR applications for children. A simple speaker normalization algorithm combining frequency warping and spectral shaping introduced in [5] is shown to reduce acoustic variability and signicantly improve recognition performance for children speakers (by 25{ 45%). Age-dependent acoustic modeling further reduces word error rate by 10%. Piece-wise linear and phoneme-dependent frequency warping algorithms are proposed for reducing acoustic mismatch between the children and adult acoustic spaces.

A0183.pdf

TOP

RECOGNITION OF NON-NATIVE ACCENTS

Authors: Carlos Teixeira, Isabel Trancoso and Antonio Serralheiro

INESC/IST INESC, Rua Alves Redol 9, 1000 LISBOA, PORTUGAL Phone: +351.1.3100314, Fax: +351.1.3145843, Email: cjct@inesc.pt

Volume 5 pages 2375 - 2378

ABSTRACT

This paper deals with the problem of non-native accents in speech recognition. Reference tests were performed using whole-word and sub-word models trained either with a native accent or a pool of native and non-native accents. The results seem to indicate that the use of phonetic transcriptions for each specific accent may improve recognition scores with sub-word models. A data-driven process is used to derive transcription lattices. The recognition scores thus obtained were encouraging.

A0657.pdf

TOP

SPEAKING MODE DEPENDENT PRONUNCIATION MODELING IN LARGE VOCABULARY CONVERSATIONAL SPEECH RECOGNITION

Authors: Michael Finke and Alex Waibel

finkem@cs.cmu.edu, ahw@cs.cmu.edu Interactive Systems Laboratories Carnegie Mellon University (USA)

Volume 5 pages 2379 - 2382

ABSTRACT

In spontaneous conversational speech there is a large amount of variability due to accents, speaking styles and speaking rates (also known as the speaking mode) [3]. Because current recognition systems usually use only a relatively small number of pronunciation variants for the words in their dictionaries, the amount of variability that can be modeled is limited. Increasing the number of variants per dictionary entry is the obvious solution. Unfortunately, this also means increasing the confusability between the dictionary entries, and thus often leads to an actual performance decrease. In this paper we present a framework for speaking mode dependent pronunciation modeling. The probability of encountering pronunciation variants is defined to be a function of the speaking style. The probability function is learned through decision trees from rule based generated pronunciation variants as observed on the Switchboard corpus. The framework is successfully applied to increase the performance of our state-of-the-art Janus Recognition Toolkit Switchboard recognizer significantly.

A0930.pdf

TOP

A PROSODY ONLY DECISION-TREE MODEL FOR DISFLUENCY DETECTION

Authors: Elizabeth Shriberg (l) Rebecca Bates (2) Andreas Stolcke (1)

(1) Speech Technology and Research Laboratory, SRI International, Menlo Park, California {ees,stolcke}@speech.sri.com; http://www.speech.sri.com (2) Dept. of Electrical Engineering, Boston University, Boston, Massachusetts becky@raven.bu.edu; http://raven.bu.edu

Volume 5 pages 2383 - 2386

ABSTRACT

Speech disfluencies (filled pauses, repetitions, repairs, and false starts) are pervasive in spontaneous speech. The ability to detect and correct disfluencies automatically is important for effective natural language understanding, as well as to improve speech models in general. Previous approaches to disfluency detection have relied heavily on lexical information, which makes them less applicable when word recognition is unreliable. We have developed a disfluency detection method using decision tree classifiers that use only local and automatically extracted prosodic features. Because the model doesn't rely on lexical information, it is widely applicable even when word recognition is unreliable. The model performed significantly better than chance at detecting four disfluency types. It also outperformed a language model in the detection of false starts, given the correct transcription. Combining the prosody model with a specialized language model improved accuracy over either model alone for the detection of false starts. Results suggest that a prosody-only model can aid the automatic detection of disfluencies in spontaneous speech.

A0934.pdf

TOP

A Novel Training Approach For Improving Speech Recognition Under Adverse Stressful Conditions

Authors: Sahar E. Bou-Ghazale and John H. L. Hansen

Robust Speech Processing Laboratory, Department of Electrical and Computer Engineering, Duke University, Box 90291, Durham, North Carolina 27708-0291, U.S.A. http://www.ee.duke.edu/people/seb.html http://www.ee.duke.edu/Research/Speech

Volume 5 pages 2387 - 2390

ABSTRACT

This paper presents a new training approach for improving recognition of speech under emotional and environmental stress. The proposed approach consists of training a speech recognizer with synthetically generated speech under each stress condition using stress perturbation models previously formulated in [4, 1]. The perturbation models were previously formulated to statistically model the parameter variations under angry, loud, and Lombard effect and were employed in an analysis-synthesis scheme for generating stressed synthetic speech from isolated neutral speech. In this paper, two training approaches employing the synthetically generated stressed speech are presented consisting of : speaker-independent, and speaker-adaptive training methods. Both approaches outperform neutral trained recognizers when tested with angry, loud, and Lombard effect speech.

A1146.pdf