Spacer ICASSP '98 Main Page

Spacer
General Information
Spacer
Conference Schedule
Spacer
Technical Program
Spacer
    Overview
    50th Annivary Events
    Plenary Sessions
    Special Sessions
    Tutorials
    Technical Sessions
    
By Date
    May 12, Tue
May 13, Wed
May 14, Thur
May 15, Fri
    
By Category
    AE    ANNIV   
COMM    DSP   
IMDSP    MMSP   
NNSP    PLEN   
SP    SPEC   
SSAP    UA   
VLSI   
    
By Author
    A    B    C    D    E   
F    G    H    I    J   
K    L    M    N    O   
P    Q    R    S    T   
U    V    W    X    Y   
Z   

    Invited Speakers
Spacer
Registration
Spacer
Exhibits
Spacer
Social Events
Spacer
Coming to Seattle
Spacer
Satellite Events
Spacer
Call for Papers/
Author's Kit

Spacer
Future Conferences
Spacer
Help

Abstract -  SP27   


 
SP27.1

   
Serbo-Croatian LVCSR on the Dictation and Broadcast News Domain
P. Scheytt, P. Geutner, A. Waibel  (University of Karlsruhe, Germany)
This paper describes the development of a Serbo-Croatian dictation and broadcast news speech recognizer. The intention is to generate an automatic text transcription of a news show, which will be submitted to a multilingual Informedia database. We outline the complete system development process using the JRTk, beginning with data collection, design and training of parameters, tuning and evaluation. We report on general recognition techniques like segmentation, adaptation and language model interpolation, as well as language specific problems, e.g. high OOV rate due to inflected word forms. We show that even a low amount of acoustic training data, combined with Web based interpolated language models, is sufficient to build up a fairly reliable automatic news transcription system, which yields a performance of 36.0% WE.
 
SP27.2

   
Transcription of Broadcast News - Some Recent Improvements to IBM's LVCSR System
L. Polymenakos, P. Olsen, D. Kanevsky, R. Gopinath, P. Gopalakrishnan, S. Chen  (IBM, USA)
This paper describes extensions and improvements to IBM's large vocabulary continuous speech recognition (LVCSR) system for transcription of broadcast news. The recognizer uses an additional 35 hours of training data over the one used in the 1996 Hub4 evaluation (7). It includes a number of new features: optimal feature space for acoustic modeling (in training and/or testing), filler-word modeling, Bayesian Information Criterion (BIC) based segment clustering, an improved implementation of iterative MLLR and 4-gram language models. Results using the 1996 DARPA Hub4 evaluation data set are presented.
 
SP27.3

   
The BBN Byblos 1997 Large Vocabulary Conversational Speech Recognition System
G. Zavaliagkos, J. McDonough, D. Miller, A. El-Jaroudi, J. Billa, F. Richardson, K. Ma, M. Siu, H. Gish  (GTE/BBN Technologies, USA)
This paper presents the 1997 BBN Byblos Large Vocabulary Speech Recognition (LVCSR) system. We give an outline of the algorithms and procedures used to train the system, describe the recognizer configuration and present the major technological innovations that lead to performance improvements. The major testbed we present our results for is the Switchboard Corpus, where current word error rates vary from 27% to 34% depending on the test set. In addition, we present results on the CallHome Spanish and Arabic tests, where we demonstrate that technology developed on English Corpora is very much portable to other problems and languages.
 
SP27.4

   
Experiments in Broadcast News Transcription
P. Woodland, T. Hain, S. Johnson, T. Niesler, A. Tuerk, S. Young  (Cambridge University Engineering Department, UK)
This paper presents the recent development of the HTK broadcast news transcription system. Previously we have used data type specific modeling based on adapted Wall Street Journal trained HMMs. However, we are now experimenting with data for which no manual pre-classification or segmentation is available and therefore automatic technique are required and suitable acoustic modeling strategies adopted. An approach for automatic audio segmentation and classification is described and evaluated as well as extensions to our previous work on segment clustering. A number of recognition experiments are presented that compare data-type specific and non-specific models; differing amounts of training data; the use of gender-dependent modeling; and the effects of automatic data-type classification., It is shown that robust segmentation into a small number of audio types is possible and that models trained on a wide variety of data types can yield good performance.
 
SP27.5

   
Speech Recognition Performance on a Voicemail Transcription Task
M. Padmanabhan, E. Eide, B. Ramabhadran, G. Ramaswamy, L. Bahl  (IBM, USA)
The paper describes the collection of a novel databse of voicemail messages (telephone bandwidth large-vocabulary consersational speech) where the speech data represents interaction between a human and a machine, and is consequently quite different from existing databases (for eg. Switchboard or CallHome). We present an analysis of this data and several novel techniques to improve the recognition performance on it. In particular the use of a new discriminant measure for improving the acoustic models, an automated technique for cleaning up transcriptions to provide cleaner acoustic models, the use of compound words to model crossword coarticulation effects, the use of class language models, etc., are shown to improve the baseline recognition rate on the task.
 
SP27.6

   
Transcribing Broadcast News with the 1997 ABBOT System
G. Cook, T. Robinson  (Cambridge University Engineering Department, UK)
Recent DARPA CSR evaluations have focused on the transcription of broadcast news from both television and radio programmes. This is a challenging task because the data includes a variety of speaking styles and channel conditions. This paper describes the development of a connectionist-hidden Markov model (HMM) system, and the enhancements designed to improve performance on broadcast news data. Both multilayer perceptron (MLP) and recurrent neural network acoustic models have been investigated. We asses the effect of using gender-dependent acoustic models, and the impact on performance of varying both the number of parameters and the amount of training data used for acoustic modelling. The use of context-dependent phone models is described, and the effect of the number of context classes is investigated. We also describe a method for incorporating syllable boundary information during search. Results are reported on the 1997 DARPA Hub-4 development test set.
 
SP27.7

   
Experiments in Automatic Meeting Transcription Using JRTk
H. Yu, C. Clark, R. Malkin, A. Waibel  (Carnegie Mellon University, USA)
In this paper we describe our early exploration of automatic recognition of conversational speech in meetings for use in automatic summarizers and browsers to produce meeting minutes effectively and rapidly. To acheive optimal performance we started from two different baseline English recognizers adapted to meeting conditions and tested resulting performance. The data were found to be highly disfluent (conversational human to human speech), noisy (due to lapel microphones and environment), and overlapped with background noise, resulting in error rates comparable so far to those on the CallHome conversational database (40-50\% WER). A meeting browser is presented that allows the user to search and skim through highlights from a meeting efficiently despite the recognition errors.
 
SP27.8

   
Adaptive Vocabularies for Transcribing Multilingual Broadcast News
P. Geutner  (Universitaet Karlsruhe, Germany);   M. Finke, P. Scheytt  (Carnegie Mellon University, USA)
One of the most prevailing problems of large-vocabulary speech recognition systems is the large number of out-of-vocabulary words. This is especially the case for automatically transcribing broadcast news in languages other than English, that have a large number of inflections and compound words. We introduce a set of techniques to decrease the number of out-of-vocabulary words during recognition by using linguistic knowledge about morphology and a two-pass recognition approach, where the first pass only serves to dynamically adapt the recognition dictionary to the speech segment to be recognized. A second recognition run is then carried out on the adapted vocabulary. With the proposed techniques we were able to reduce the OOV-rate by more than 40% thereby also improving recognition results by an absolute 5.8% from 64% word accuracy to 69.8%.
 

< Previous Abstract - SP26

SP28 - Next Abstract >