Authors:
Gary Cook, Cambridge University (U.K.)
Tony Robinson, Cambridge University (U.K.)
James Christie, Cambridge University (U.K.)
Page (NA) Paper number 65
Abstract:
Although the performance of state-of-the-art automatic speech recognition
systems on the challenging task of broadcast news transcription has
improved considerably in recent years, many of the systems operate
in 130-300 times real-time. Many applications of automatic transcription
of broadcast news, eg. closed-caption subtitles for television broadcasts,
require real-time operation. This paper describes a connectionist-HMM
system for broadcast news transcription, and the modifications to this
system necessary for real-time operation. We show that real-time operation
is possible with a relative increase in word error rate of about 12%.
Authors:
Ha-Jin Yu, LG Corporate Institute Of Technology (Korea)
Hoon Kim, LG Corporate Institute Of Technology (Korea)
Jae-Seung Choi, LG Corporate Institute Of Technology (Korea)
Joon-Mo Hong, LG Corporate Institute of Technology (Korea)
Kew-Suh Park, LG Corporate Institute of Technology (Korea)
Jong-Seok Lee, LG Corporate Institute of Technology (Korea)
Hee-Youn Lee, LG Corporate Institute of Technology (Korea)
Page (NA) Paper number 412
Abstract:
This paper describes preliminary results of automatic recognition of
Korean broadcast-news speech. We have been working on flexible vocabulary
isolated-word speech recognition, and the same HMM models are used
for broadcast-news continuous speech recognition. The recognizer is
trained by using phonetically balanced isolated words speech, rather
than the broadcast news speech itself. In this research, we use several
different lexica to investigate the recognition performance according
to the length of the words. We also propose a long-distance bigram
language model, which can be used at the first stage of the search,
so that it can reduce the recognition errors caused by earlier pruning
of correct hypothesis.
Authors:
James R. Glass, MIT Lab for Computer Science (USA)
Timothy J. Hazen, MIT Lab for Computer Science (USA)
Page (NA) Paper number 593
Abstract:
This paper describes our experiences with developing a telephone-based
speech recognizer as part of a conversational system in the weather
information domain. This system has been used to collect spontaneous
speech data which has proven to be extremely valuable for research
in a number of different areas. After describing the corpus we have
collected, we describe the development of the recognizer vocabulary,
pronunciations, language and acoustic models for this system, and report
on the current performance of the recognizer under several different
conditions.
Authors:
Hsiao-Wuen Hon, Microsoft Research (USA)
Yun-Cheng Ju, Microsoft Research (USA)
Keiko Otani, Microsoft Research (USA)
Page (NA) Paper number 597
Abstract:
Input of Asian ideographic characters has traditionally been one of
the biggest impediments for information processing in Asia. Speech
is arguably the most effective and efficient input method for Asian
non-spelling characters. This paper presents a Japanese large-vocabulary
continuous speech recognition system based on Microsoft Whisper technology.
We focus on the aspects of the system that are language specific and
demonstrate the adaptability of the Whisper system to new languages.
In this paper, we demonstrate that our pronunciation/part-of-speech
distinguished morpheme based language models and Whisper based Japanese
senonic acoustic models are able to yield state-of-the-art Japanese
LVCSR recognition performance. The speaker-independent character and
Kana error rates on the JNAS database are 10% and 5% respectively.
Authors:
Jean-Luc Gauvain, LIMSI/CNRS (France)
Lori F. Lamel, LIMSI/CNRS (France)
Gilles Adda, LIMSI/CNRS (France)
Page (NA) Paper number 84
Abstract:
Radio and television broadcasts consist of a continuous stream of data
comprised of segments of different linguistic and acoustic natures,
which poses challenges for transcription. In this paper we report
on our recent work in transcribing broadcast news data, including the
problem of partitioning the data into homogeneous segments prior to
word recognition. Gaussian mixture models are used to identify speech
and non-speech segments. A maximum-likelihood segmentation/clustering
process is then applied to the speech segments using GMMs and an agglomerative
clustering algorithm. The clustered segments are then labeled according
to bandwidth and gender. The recognizer is a continuous mixture density,
tied-state cross-word context-dependent HMM system with a 65k trigram
language model. Decoding is carried out in three passes, with a final
pass incorporating cluster-based test-set MLLR adaptation. The overall
word transcription error on the Nov'97 unpartitioned evaluation test
data was 18.5%.
|