ICASSP '98 Main Page

General Information

Conference Schedule

Technical Program

    Overview
    50th Annivary Events
    Plenary Sessions
    Special Sessions
    Tutorials
    Technical Sessions

	By Date
		May 12, Tue
		May 13, Wed
		May 14, Thur
		May 15, Fri

By Category
	AE	ANNIV
	COMM	DSP
	IMDSP	MMSP
	NNSP	PLEN
	SP	SPEC
	SSAP	UA
	VLSI

By Author
	A	B	C	D	E
	F	G	H	I	J
	K	L	M	N	O
	P	Q	R	S	T
	U	V	W	X	Y
	Z

Invited Speakers

Registration

Exhibits

Social Events

Coming to Seattle

Satellite Events

Call for Papers/
Author's Kit

Future Conferences

Help

Abstract - SP27

SP27.1	Serbo-Croatian LVCSR on the Dictation and Broadcast News Domain P. Scheytt, P. Geutner, A. Waibel (University of Karlsruhe, Germany) This paper describes the development of a Serbo-Croatian dictation and broadcast news speech recognizer. The intention is to generate an automatic text transcription of a news show, which will be submitted to a multilingual Informedia database. We outline the complete system development process using the JRTk, beginning with data collection, design and training of parameters, tuning and evaluation. We report on general recognition techniques like segmentation, adaptation and language model interpolation, as well as language specific problems, e.g. high OOV rate due to inflected word forms. We show that even a low amount of acoustic training data, combined with Web based interpolated language models, is sufficient to build up a fairly reliable automatic news transcription system, which yields a performance of 36.0% WE.
SP27.2	Transcription of Broadcast News - Some Recent Improvements to IBM's LVCSR System L. Polymenakos, P. Olsen, D. Kanevsky, R. Gopinath, P. Gopalakrishnan, S. Chen (IBM, USA) This paper describes extensions and improvements to IBM's large vocabulary continuous speech recognition (LVCSR) system for transcription of broadcast news. The recognizer uses an additional 35 hours of training data over the one used in the 1996 Hub4 evaluation (7). It includes a number of new features: optimal feature space for acoustic modeling (in training and/or testing), filler-word modeling, Bayesian Information Criterion (BIC) based segment clustering, an improved implementation of iterative MLLR and 4-gram language models. Results using the 1996 DARPA Hub4 evaluation data set are presented.
SP27.3	The BBN Byblos 1997 Large Vocabulary Conversational Speech Recognition System G. Zavaliagkos, J. McDonough, D. Miller, A. El-Jaroudi, J. Billa, F. Richardson, K. Ma, M. Siu, H. Gish (GTE/BBN Technologies, USA) This paper presents the 1997 BBN Byblos Large Vocabulary Speech Recognition (LVCSR) system. We give an outline of the algorithms and procedures used to train the system, describe the recognizer configuration and present the major technological innovations that lead to performance improvements. The major testbed we present our results for is the Switchboard Corpus, where current word error rates vary from 27% to 34% depending on the test set. In addition, we present results on the CallHome Spanish and Arabic tests, where we demonstrate that technology developed on English Corpora is very much portable to other problems and languages.
SP27.4	Experiments in Broadcast News Transcription P. Woodland, T. Hain, S. Johnson, T. Niesler, A. Tuerk, S. Young (Cambridge University Engineering Department, UK) This paper presents the recent development of the HTK broadcast news transcription system. Previously we have used data type specific modeling based on adapted Wall Street Journal trained HMMs. However, we are now experimenting with data for which no manual pre-classification or segmentation is available and therefore automatic technique are required and suitable acoustic modeling strategies adopted. An approach for automatic audio segmentation and classification is described and evaluated as well as extensions to our previous work on segment clustering. A number of recognition experiments are presented that compare data-type specific and non-specific models; differing amounts of training data; the use of gender-dependent modeling; and the effects of automatic data-type classification., It is shown that robust segmentation into a small number of audio types is possible and that models trained on a wide variety of data types can yield good performance.
SP27.5	Speech Recognition Performance on a Voicemail Transcription Task M. Padmanabhan, E. Eide, B. Ramabhadran, G. Ramaswamy, L. Bahl (IBM, USA) The paper describes the collection of a novel databse of voicemail messages (telephone bandwidth large-vocabulary consersational speech) where the speech data represents interaction between a human and a machine, and is consequently quite different from existing databases (for eg. Switchboard or CallHome). We present an analysis of this data and several novel techniques to improve the recognition performance on it. In particular the use of a new discriminant measure for improving the acoustic models, an automated technique for cleaning up transcriptions to provide cleaner acoustic models, the use of compound words to model crossword coarticulation effects, the use of class language models, etc., are shown to improve the baseline recognition rate on the task.
SP27.6	Transcribing Broadcast News with the 1997 ABBOT System G. Cook, T. Robinson (Cambridge University Engineering Department, UK) Recent DARPA CSR evaluations have focused on the transcription of broadcast news from both television and radio programmes. This is a challenging task because the data includes a variety of speaking styles and channel conditions. This paper describes the development of a connectionist-hidden Markov model (HMM) system, and the enhancements designed to improve performance on broadcast news data. Both multilayer perceptron (MLP) and recurrent neural network acoustic models have been investigated. We asses the effect of using gender-dependent acoustic models, and the impact on performance of varying both the number of parameters and the amount of training data used for acoustic modelling. The use of context-dependent phone models is described, and the effect of the number of context classes is investigated. We also describe a method for incorporating syllable boundary information during search. Results are reported on the 1997 DARPA Hub-4 development test set.
SP27.7	Experiments in Automatic Meeting Transcription Using JRTk H. Yu, C. Clark, R. Malkin, A. Waibel (Carnegie Mellon University, USA) In this paper we describe our early exploration of automatic recognition of conversational speech in meetings for use in automatic summarizers and browsers to produce meeting minutes effectively and rapidly. To acheive optimal performance we started from two different baseline English recognizers adapted to meeting conditions and tested resulting performance. The data were found to be highly disfluent (conversational human to human speech), noisy (due to lapel microphones and environment), and overlapped with background noise, resulting in error rates comparable so far to those on the CallHome conversational database (40-50\% WER). A meeting browser is presented that allows the user to search and skim through highlights from a meeting efficiently despite the recognition errors.
SP27.8	Adaptive Vocabularies for Transcribing Multilingual Broadcast News P. Geutner (Universitaet Karlsruhe, Germany); M. Finke, P. Scheytt (Carnegie Mellon University, USA) One of the most prevailing problems of large-vocabulary speech recognition systems is the large number of out-of-vocabulary words. This is especially the case for automatically transcribing broadcast news in languages other than English, that have a large number of inflections and compound words. We introduce a set of techniques to decrease the number of out-of-vocabulary words during recognition by using linguistic knowledge about morphology and a two-pass recognition approach, where the first pass only serves to dynamically adapt the recognition dictionary to the speech segment to be recognized. A second recognition run is then carried out on the adapted vocabulary. With the proposed techniques we were able to reduce the OOV-rate by more than 40% thereby also improving recognition results by an absolute 5.8% from 64% word accuracy to 69.8%.

< Previous Abstract - SP26

SP28 - Next Abstract >