Authors:
Pedro J. Moreno, Compaq Computer Corporation (USA)
Chris Joerg, Compaq Computer Corporation (USA)
Jean-Manuel Van Thong, Compaq Computer Corporation (USA)
Oren Glickman, Compaq Computer Corporation (USA)
Page (NA) Paper number 68
Abstract:
In this paper we address the problem of aligning very long (often more
than one hour) audio files to their corresponding textual transcripts
in an effective manner. We present an efficient recursive technique
to solve this problem that works well even on noisy speech signals.
The key idea of this algorithm is to turn the forced alignment problem
into a recursive speech recognition problem with a gradually restricting
dictionary and language model. The algorithm is tolerant to acoustic
noise and errors or gaps in the text transcript or audio tracks. We
report experimental results on a 3 hour audio file containing TV and
radio broadcasts. We will show accurate alignments on speech under
a variety of real acoustic conditions such as speech over music and
speech over telephone lines. We also report results when the same audio
stream has been corrupted with white additive noise or compressed using
a popular web encoding format such as RealAudio. This algorithm has
been used in our internal multimedia indexing project. It has processed
more than 200 hours of audio from varied sources, such as WGBH NOVA
documentaries and NPR web audio files. The system aligns speech media
content in about one to five times realtime, depending on the acoustic
conditions of the audio signal.
Authors:
Judith M. Kessens, A2RT, University of Nijmegen (The Netherlands)
Mirjam Wester, A2RT, University of Nijmegen (The Netherlands)
Catia Cucchiarini, A2RT, University of Nijmegen (The Netherlands)
Helmer Strik, A2RT, University of Nijmegen (The Netherlands)
Page (NA) Paper number 372
Abstract:
In this paper the performance of an automatic transcription tool is
evaluated. The transcription tool is a Continuous Speech Recognizer
(CSR) running in forced recognition mode. For evaluation the performance
of the CSR was compared to that of nine expert listeners. Both man
and the machine carried out exactly the same task: deciding whether
a segment was present or not in 467 cases. It turned out that the performance
of the CSR is comparable to that of the experts.
Authors:
Jon Barker, University of Sheffield (U.K.)
Gethin Williams, University of Sheffield (U.K.)
Steve Renals, University of Sheffield (U.K.)
Page (NA) Paper number 643
Abstract:
In this paper we define an acoustic confidence measure based on the
estimates of local posterior probabilities produced by a HMM/ANN large
vocabulary continuous speech recognition system. We use this measure
to segment continuous audio into regions where it is and is not appropriate
to expend recognition effort. The segmentation is computationally
inexpensive and provides reductions in both overall word error rate
and decoding time. The technique is evaluated using material from
the Broadcast News corpus.
Authors:
Bryan L. Pellom, Duke University (USA)
John H.L. Hansen, Duke University (USA)
Page (NA) Paper number 853
Abstract:
In this study, a duration-based measure is formulated for assigning
confidence scores to phonetic time-alignments produced by an automatic
speech segmentation system. For speech corrupted by additive noise
or telephone channel environments, the proposed confidence measure
is shown to provide a reliable means by which gross segmentation errors
can be automatically detected and marked for human hand correction.
The measure is evaluated by computing Receiver Operating Characteristic
(ROC) curves to illustrate the expected trade-off in probability of
detecting gross segmentation errors versus false alarm rates.
Authors:
Thomas Hain, Cambridge University (U.K.)
Philip C. Woodland, Cambridge University (U.K.)
Page (NA) Paper number 851
Abstract:
Broadcast news audio data contains a wide variety of different speakers
and audio conditions (channel and background noise). This paper describes
a segmentation, gender detection and audio classification scheme for
such data which aims to provide a speech recogniser with a stream of
reasonably-sized segments, each from a single speaker and audio type
while discarding non-speech data. Each segment is labelled as either
narrow or wide band and from either a female or male speaker. The segmentation
system has been evaluated on the DARPA 1997 broadcast news data set
and detailed segmentation accuracy results are presented. It is shown
that the speech recognition accuracy for these automatically derived
segments is very nearly the same as that for manually segmented data.
Authors:
Børge Lindberg, CPK, Aalborg University (Denmark)
Robrecht Comeyne, Lernout & Hauspie Speech Products NV (Belgium)
Christoph Draxler, IPSK, University of Munich (Germany)
Francesco Senia, CSELT (Italy)
Page (NA) Paper number 1126
Abstract:
With the globalisation and evolving technology of voice-driven man-machine
interfaces there is a growing demand for acquisition of spoken language
resources in a number of speaker populations being representative for
a number of languages and countries. In this paper experience from
work within a large consortium in creating large multilingual speech
databases for tele-services are reported. In particular the methods
and experiences in recruiting speakers for such recordings are reported
across a number of participating partners. The reporting is from the
SpeechDat project (http://speechdat.phonetik.uni-muenchen.de).
|