ABSTRACT
In this paper we report on our recent work in transcrib-ing broadcast news shows. Radio and television broad-casts contain signal segments of various linguistic and acoustic natures. The shows contain both prepared and spontaneous speech. The signal may be studio quality or have been transmitted over a telephone or other noisy channel (ie., corrupted by additive noise and nonlinear distorsions), or may contain speech over music. Transcription of this type of data poses challenges in dealing with the continuous stream of data under varying conditions. Our approach to this problem is to segment the data into a set of categories, which are then processed with category specific acoustic models. We describe our 65k speech recognizer and experiments using different sets of acoustic models for transcription of broadcast news data. The use of prior knowledge of the segment bound-aries and types is shown to not crucially affect the perfor-mance.
ABSTRACT
Continuous speech is far more natural and efficient than isolated speech for communication. However, for current state-of-the-art of automatic speech recognition systems, isolated speech recognition (ISR) is far more accurate than continuous speech recognition (CSR). It is a common practice in the speech research community to build CSR systems using only CSR data. In doing this we ignore the fact that isolated (a.k.a. discrete) speech is a special case of continuous speech. A slowing of the speaking rate is a natural reaction for a user faced with the high error rates of current CSR systems. Ironically, CSR systems typically have a much higher word error rate when speakers slow down since the acoustic models are usually derived exclusively from continuous speech corpora. In this paper, we summarize our efforts to improve the robustness of our speaker-independent CSR system without suffering a recognition accuracy penalty. In particular the multi-style trained system described in this paper attains a 7.0% word error rate for a test set consisting of both isolated and continuous speech, in contrast to the 10.9% word error rate achieved by the same system trained only on continuous speech
ABSTRACT
In this paper, we report on the automatic recognition of Japa-nese broadcast-news speech. We have been working on large-vocabulary continuous speech recognition (LVCSR) for Japanese newspaper speech transcription and have achieved good perfor-mance. We have recently applied our LVCSR system to tran-scribing Japanese broadcast-news speech. We extended the vo-cabulary from 7k words to 20k words and trained the language models using newspaper texts and broadcast-news manuscripts. These two language models were applied to our evaluation speech sets. The language model trained using broadcast-news manu-scripts achieved better results for broadcast-news speech than the language model trained using newspaper texts, which achieved better results for newspaper speech. We achieved a word error rate of 19.7% for anchor-speaker's speech by using a bigram lan-guage model and a trigram language model both trained using broadcast-news manuscripts.
ABSTRACT
In spoken language systems, the segmentation of utter- ances into coherent linguistic/semantic units is very use- ful, as it makes easier processing after the speech recog- nition phase. In this paper, a methodology for semantic boundary prediction is presented and tested on a corpus of person-to-person dialogues. The approach is based on bi- nary decision trees and uses text context, including broad classes of silent pauses, filled pauses and human noises. Best results give more than 90% precision, almost 80% recall and about 3% false alarms.
ABSTRACT
In this paper, we describe a new recognition system for 4-digit-strings in Japanese under fluent speech conditions. In particular, we introduce several methods to solve the problems related to the spontaneity of speech: discrimination of speech and background noise, out-of-vocabulary words, pauses between digits, etc. These methods led to an error rate reduction of 76%, compared to a simple start- and end-point detection based recognizer using non-refined models.
ABSTRACT
In this paper, we describe our recent work in automatic transcription of broadcast news programming from ra- dio and television. This is a very challenging recogni- tion problem because of the frequent and unpredictable changes that occur in speaker, speaking style, topic, chan- nel, and background conditions. Faced with such a prob- lem, there is a strong tendency to try to carve the in- put into separable classes and deal with each one inde- pendently. We have chosen instead to rely on condition- independent models and adaptive algorithms to deal with this highly variable data. In addition, we have developed eective techniques to automatically segment the input waveform and cluster the segments into data sets contain- ing similar speakers and conditions to support unsuper- vised adaptation on the test. Using this general approach, we achieved the best overall word error rate of 31.8% on the 1996 DARPA Hub-4 Unpartitioned Evaluation.