SP-2.1

Progress in Broadcast News Transcription at Dragon Systems
Steven Wegmann, Puming Zhan, Larry Gillick (Dragon Systems, Inc.)

In this paper we shall report on recent progress in acoustic modelling and preprocessing in our Broadcast News transcription system. We have gone back to basics in acoustic modelling, and re-examined some of our standard practices, in particular the use of IMELDA and frequency warping, in the context of the Broadcast News corpus. We shall also report on some preliminary experiments with a generalization of IMELDA, "semi-tied covariances". In combination, these improvements lead to a 3.5% absolute improvement over our eval97 models. We shall also describe our attempts to fix our rather primitive, silence-based preprocessing system, including initial results using a new speaker-change detection algorithm based on Hotelling's T2-test.

SP-2.2

Recent Improvements to IBM's Speech Recognition System for Automatic Transcription of Broadcast Newss
Scott S Chen, Ellen M Eide, Mark Gales, Ramesh A Gopinath, Dimitri Kanevsky, Peder A Olsen (IBM)

We describe recent extensions and improvements to IBM's system for automatic transcription of broadcast news. The speech recognizer uses a total of 160 hours of acoustic transcription, 80 hours more than for the 1997 Hub4 evaluation. In addition to improvements obtained in1997 we made a number of changes and algorithmic enhancements. Among these were changing the acoustic vocabulary, reducing the number of phonemes, insertion of short pauses, mixture models consisting of non-gaussian components, pronunciation networks, factor analysis (FACILT) and Bayesian Information Criteria (BIC) applied to choosing the number of components in a gaussian mixture model. The models were combined in a single system using NIST's script voting machine known as rover.

SP-2.3

Recent Experiments in Large Vocabulary Conversational Speech Recognition
Jayadev Billa, Thomas Colhurst, Amro El-Jaroudi, Rukmini Iyer, Kristine Ma, Carl Quillen, Fred Richardson, Manhung Siu, George Zavaliagkos, Herb Gish (BBN Technologies)

This paper describes the improvements that resulted in the 1998 Byblos Large Vocabulary Conversational Speech Recognition (LVCSR) System. Salient among these improvements are: improved signal processing, improved Hidden Markov Model (HMM) topology, use of quinphone context, introduction of diagonal speaker adapted training (DSAT), incorporation of variance adaptation in the MLLR framework, improvements in language modeling, increase in lexicon size and combination of multiple systems. These changes resulted in about a 7\% absolute reduction in word error rates on a balanced Switchboard/Callhome English test set.

SP-2.4

Large vocabulary speech recognition in French
Martine Adda-Decker, Gilles Adda, Jean-Luc S Gauvain, Lori F Lamel (LIMSI-CNRS)

In this contribution we present some design considerations concerning our large vocabulary continuous speech recognition system in French. The impact of the epoch of the text training material on lexical coverage, language model perplexity and recognition performance on newspaper texts is demonstrated. The effectiveness of larger vocabulary sizes and larger text training corpora for language modeling is investigated. French is a highly inflected language producing large lexical variety and a high homophone rate. About 30% of recognition errors are shown to be due to substitutions between inflected forms of a given root form. When word error rates are analysed as a function of word frequency, a significant increase in the error rate can be measured for frequency ranks above 5000.

SP-2.5

The Cambridge University Spoken Document Retrieval System
Sue E Johnson (Cambridge University Engineering Department), Pierre Jourlin (Cambridge University Computer Laboratory), Gareth L Moore (Cambridge University Engineering Department), Karen Sparck Jones (Cambridge University Computer Laboratory), Philip C Woodland (Cambridge University Engineering Department)

This paper describes the spoken document retrieval system that we have been developing and assesses its performance using automatic transcriptions of about 50 hours of broadcast news data. The recognition engine is based on the HTK broadcast news transcription system and the retrieval engine is based on the techniques developed at City University. The retrieval performance over a wide range of speech transcription error rates is presented and a number of recognition error metrics that more accurately reflect the impact of transcription errors on retrieval accuracy are defined and computed. The results demonstrate the importance of high accuracy automatic transcription. The final system is currently being evaluated on the 1998 TREC-7 spoken document retrieval task.

SP-2.6

Improvements in Recognition of Conversational Telephone Speech
Barbara Peskin, Michael Newman, Don McAllaster, Venkatesh Nagesha (Dragon Systems, Inc.), Hywel Richards (Dragon Systems UK), Steven Wegmann (Dragon Systems, Inc.), Melvyn Hunt (Dragon Systems UK), Larry Gillick (Dragon Systems, Inc.)

This paper describes recent changes in Dragon's speech recognition system which have markedly improved performance on conversational telephone speech. Key changes include: the conversion to modified PLP-based cepstra from mel-cepstra; the replacement of our usual IMELDA transform by a new transform using "semi-tied covariance"; a new multi-pass adaptation protocol; probabilities on alternate pronunciations in the lexicon; the addition of word-boundary tags in our acoustic models and the redistribution of model parameters to build fewer output distributions but with more mixture components per model.

SP-2.7

The 1998 HTK System for Transcription of Conversational Telephone Speech
Thomas Hain, Philip C Woodland, Thomas R Niesler, Edward W.D Whittaker (Cambridge University Engineering Department)

This paper describes the 1998 HTK large vocabulary speech recognition system for conversational telephone speech as used in the NIST 1998 Hub5E evaluation. Front-end and language modelling experiments conducted using various training and test sets from both the Switchboard and Callhome English corpora are presented. Our complete system includes reduced bandwidth analysis, side-based cepstral feature normalisation, vocal tract length normalisation (VTLN), triphone and quinphone hidden Markov models (HMMs) built using speaker adaptive training (SAT), maximum likelihood linear regression (MLLR) speaker adaptation and a confidence score based system combination. A detailed description of the complete system together with experimental results for each stage of our multi-pass decoding scheme is presented. The word error rate obtained is almost 20% better than our 1997 system on the development set.

SP-2.8

Real-Time Telephone-Based Speech Recognition in the Jupiter Domain
James R Glass, Timothy J Hazen, I. Lee Hetherington (MIT Laboratory for Computer Science)

This paper describes our experiences with developing a real-time telephone-based speech recognizer as part of a conversational system in the weather information domain. This system has been used to collect spontaneous speech data which has proven to be extremely valuable for research in a number of different areas. After describing the corpus we have collected, we describe the development of the recognizer vocabulary, pronunciations, language and acoustic models for this system, the new weighted finite-state transducer-based lexical access component, and report on the current performance of the recognizer under several different conditions. We also analyze recognition latency to verify that the system performs in real time.

< SP-1 SP-3 >

Last Update: February 4, 1999 Ingo Höntsch