Authors:
Matti Karjalainen, Helsinki University of Technology (Finland)
Toomas Altosaar, Helsinki University of Technology (Finland)
Miikka Huttunen, Helsinki University of Technology (Finland)
Page (NA) Paper number 885
Abstract:
An automated speech signal labeling tool, developed for the QuickSig
speech database environment, is described. It is based primarily on
the use of neural networks as diphone event detectors. For robustness,
only coarse categories of diphones, such as stop-vowel and vowel-nasal,
are used. 64 such detectors are implemented to cover all of the Finnish
diphones. The preprocessing of speech signals is carried out using
warped linear prediction and the diphone events from neural network
outputs are matched to the given text transcription using a simple
rule-based parser. In the case of isolated word labeling of single
speaker signals a well trained system makes about 1-2 % of coarse labeling
errors and the deviation of boundary positions, compared to careful
manual labeling, is on average about 10 ms. Generalization ability
to label other speakers shows promising.
Authors:
Harry Bratt, SRI International (USA)
Leonardo Neumeyer, SRI International (USA)
Elizabeth Shriberg, SRI International (USA)
Horacio Franco, SRI International (USA)
Page (NA) Paper number 926
Abstract:
We describe the methodologies for collecting and annotating a Latin-American
Spanish speech database. The database includes recordings by native
and nonnative speakers. The nonnative recordings are annotated with
ratings of pronunciation quality and detailed phonetic transcriptions.
We use the annotated database to investigate rater reliability, the
effect of each phone on overall perceived nonnativeness, and the frequency
of specific pronunciation errors.
Authors:
Neeraj Deshmukh, Institute for Signal and Information Processing, Mississippi State University (USA)
Aravind Ganapathiraju, Institute for Signal and Information Processing, Mississippi State University (USA)
Andi Gleeson, Institute for Signal and Information Processing, Mississippi State University (USA)
Jonathan Hamaker, Institute for Signal and Information Processing, Mississippi State University (USA)
Joseph Picone, Institute for Signal and Information Processing, Mississippi State University (USA)
Page (NA) Paper number 685
Abstract:
The SWITCHBOARD (SWB) corpus is one of the most important benchmarks
for recognition tasks involving large vocabulary conversational speech
(LVCSR). The high error rates on SWB are largely attributable to an
acoustic model mismatch, the high frequency of poorly articulated monosyllabic
words, and large variations in pronunciations. It is imperative to
improve the quality of segmentations and transcriptions of the training
data to achieve better acoustic modeling. By adapting existing acoustic
models to only a small subset of such improved transcriptions, we have
achieved a 2% absolute improvement in performance.
Authors:
Demetrio Aiello, Fondazione Ugo Bordoni (Italy)
Cristina Delogu, Fondazione Ugo Bordoni (Italy)
Renato De Mori, Université d'Avignon et des Pays de Vaucluse (France)
Andrea Di Carlo, Fondazione Ugo Bordoni (Italy)
Marina Nisi, Fondazione Ugo Bordoni (Italy)
Silvia Tummeacciu, Fondazione Ugo Bordoni (Italy)
Page (NA) Paper number 499
Abstract:
The paper describes a system, in JAVA, for written and visual scenario
generation to collect speech corpora in the framework of a Tourism
Information System. Methods and experimental results are also presented
for evaluating the degree of understanding of the proposed scenarios.
The corpus generated from visual scenarios appears to be much richer
than the one generated from textual descriptions.
Authors:
Mauro Cettolo, ITC-IRST (Italy)
Daniele Falavigna, ITC-IRST (Italy)
Page (NA) Paper number 333
Abstract:
In spoken dialogue systems, the minimal unit of analysis does not necessarily
correspond to a full sentence. A possible approach for language processing
is that of splitting the sentence in a sequence of units that can be
successively processed by linguistic modules. The goal of the Semantic
Boundary (SB) detector is to locate boundaries inside a sentence in
order to obtain such minimal units. Useful information for SB detection
can be extracted both from the acoustic signal of the utterance and
from its corresponding word sequence. In the paper techniques for semantic
boundary prediction, based on both acoustic and lexical knowledge,
will be presented. Furthermore, a method for combining the two knowledge
sources will be proposed. Finally, performance obtained on a corpus
of hundreds of person-to-person dialogues will be provided. Best result
gives 62.8% recall and 71.8% precision.
Authors:
Iman Gholampour, Electrical Engineering Department, Sharif University of Technology (Iran)
Kambiz Nayebi, Electrical Engineering Department Sharif University of Technology (Iran)
Page (NA) Paper number 182
Abstract:
In this paper a new method for automatic segmentation of continuous
speech into phone-like units is addressed. Our method is based on a
very fast presegmentation algorithm which uses a new statistical modeling
of speech and searching in a multilevel structure, called Dendrogram,
for decreasing insertion rate. Performance of algorithms have been
tested over a large set of TIMIT sentences. According to these tests,
our final segmentation algorithm is capable of detecting nearly 97%
of segments with an average boundary position error of less than 7
msec and average insertion rate of less than 12.7%. In addition to
acceptable precision, our overall segmentation scheme has very low
computation cost and it can be implemented in real time on an average
Pentium PC. The major advantage of presented algorithms is that no
training or threshold estimation is needed in realizing them. Details
of proposed algorithms and their performance results are included in
the paper.
Authors:
Akemi Iida, Graduate School at Media and Governance, Keio University (Japan)
Nick Campbell, ATR Interpreting Telecommunications Research Laboratories (Japan)
Soichiro Iga, Graduate School at Media and Governance, Keio University (Japan)
Fumito Higuchi, Graduate School at Media and Governance, Keio University (Japan)
Michiaki Yasumura, Graduate School at Media and Governance, Keio University (Japan)
Page (NA) Paper number 818
Abstract:
This paper proposes three corpora of emotional speech in Japanese that
maximize the expression of each emotion (expressing joy, anger, and
sadness) for use with CHATR, the concatenative speech synthesis system
being developed at ATR. A perceptual experiment was conducted using
the synthesized speech generated from each emotion corpus and the results
proved to be significantly identifiable. Authors' current work is to
identify the local acoustic features relevant for specifying a particular
emotion type. F0 and duration showed significant differences among
emotion types. AV (amplitude of voicing source) and GN (glottal noise)
also showed differences. This paper reports on the corpus design, the
perceptual experiment, and the results of the acoustic analysis.
Authors:
Pyungsu Kang, DSP Lab, Electronics Of Engineering, Chonnam National University (Korea)
Jiyoung Kang, DSP Lab, Electronics Of Engineering, Chonnam National University (Korea)
Jinyoung Kim, DSP Lab, Electronics Of Engineering, Chonnam National University (Korea)
Page (NA) Paper number 264
Abstract:
We present a new mixed method of LDA-VQ to predict Korean prosodic
break index(PBI) for a given utterance. PBI can be used as an important
cue of syntactic discontinuity in continuous speech recognition(CSR).
Our proposed method, LDA-VQ model, consists of three steps. At the
first step, PBI was predicted with the information of syllable and
pause duration through the linear discriminant analysis(LDA) method.
At the second step, syllable tone information was used to estimate
PBI. In this step we used vector quantization(VQ) for coding the syllable
tones and PBI is estimated by tri-tone model. In the last step, two
PBI predictors were integrated by a weight factor. The LDA-VQ method
was tested on 200 literal style spoken sentences. The experimental
results showed 72% accuracy.
Authors:
Mark Laws, University of Otago (New Zealand)
Richard Kilgour, University of Otago (New Zealand)
Page (NA) Paper number 1116
Abstract:
With the advent of spoken language computer interface systems, the
storage and management of speech corpora is becoming more of an issue
in the development of such systems. Until recently, even large corpora
were stored as individual text and speech files, or as a single, monolithic
file. The issues involved in management and retrieval of the data have
been, to a large extent, overlooked. Relational database management
systems (RDBMS) are proposed as an ideal tool for the management of
speech corpora. Relationships between words and phonemes, and the realisations
of these, can be stored and retrieved efficiently. RDBMS may be constructed
with various levels, to store speaker, language, label transcription,
and phonetic information, plus speech as isolated words, and derived
segmented units. An implementation of such a system is presented to
manage the Otago Speech Corpus, currently called Management Of Otago
Speech Environment (MOOSE). The ability of the MOOSE to be applied
to other corpora is currently under investigation.
Authors:
Fabrice Malfrère, Faculté Polytechnique de Mons (Belgium)
Olivier Deroo, Faculté Polytechnique de Mons (Belgium)
Thierry Dutoit, Faculté Polytechnique de Mons (Belgium)
Page (NA) Paper number 354
Abstract:
In this paper we compare two different methods for phonetically labeling
a speech database. The first approach is based on the alignment of
the speech signal on a high quality synthetic speech pattern, and the
second one uses a hybrid HMM/ANN system. Both systems have been evaluated
on French read utterances from a speaker never seen in the training
stage of the HMM/ANN system and manually segmented. This study outlines
the advantages and drawbacks of both methods. The high quality speech
synthetic system has the great advantage that no training stage is
needed, while the classical HMM/ANN system easily allows multiple phonetic
transcriptions. We deduce a method for the automatic constitution
of phonetically labeled speech databases based on using the synthetic
speech segmentation tool to bootstrap the training process of our hybrid
HMM/ANN system. The importance of such segmentation tools will be
a key point for the development of improved speech synthesis and recognition
systems.
Authors:
J. Bruce Millar, Australian National University (Australia)
Page (NA) Paper number 705
Abstract:
This paper describes a world-wide web based system which allows a speech
data corpus developer to interact with a system for the comprehensive
description of a spoken language data corpus. The interface at the
browser allows the user to define the speaking environment that is
to be described and then progressively to describe it and all the data
files arising from it in a modular fashion. The quality of the description
can be tested incrementally and the feedback generated used to further
update the description. The user is effectively guided through the
necessary modules in a way that reminds but does not demand adherence
to a strict pattern. The complete data description can be displayed
on screen and also downloaded in a simple text form to the user. Users
may register as clients of the system in which case they can store
descriptions on the system and upgrade these descriptions at any time.
Authors:
Claude Montacié, Laboratoire d'Informatique de Paris 6 (France)
Marie-José Caraty, Laboratoire d'Informatique de Paris 6 (France)
Page (NA) Paper number 1141
Abstract:
In this paper, we present techniques to warp audio data of a video
movie on its movie script. In order to improve this script warping,
a new algorithm has been developed to split audio data into silence,
noise, music and speech segments without training step. This segments
splitting uses multiple techniques such as voiced/unvoiced segmentation,
pitch detection, pitch tracking, speaker and speech recognition techniques.
The 102.47 minutes of the film movie "Contes de Printemps" produced
by E. Rohmer have been indexed with these techniques with an average
shifting lower than one second between the time-code script and audio
data.
Authors:
David Pye, ORL (U.K.)
Nicholas J. Hollinghurst, ORL (U.K.)
Timothy J. Mills, ORL (U.K.)
Kenneth R. Wood, ORL (U.K.)
Page (NA) Paper number 517
Abstract:
This paper reports recent work at ORL on segmentation of digital audio/video
recordings. Firstly, we describe an audio segmentation algorithm that
partitions a soundtrack into manageably sized segments for speech recognition.
Secondly, we present an algorithm for detecting camera shot-break locations
in the video. The output of these two algorithms is combined to produce
a semantically meaningful segmentation of audio/video content, appropriate
for information retrieval. We report the success of the algorithms
in the context of television news retrieval.
Authors:
Stefan Rapp, Sony International (Europe) GmbH (Germany)
Grzegorz Dogil, Institut f. Maschinelle Sprachverarbeitung, Univ. Stuttgart (Germany)
Page (NA) Paper number 906
Abstract:
We present methods for finding same or almost same news stories in
the hourly radio news broadcasts spoken by the same or different announcers.
They allow to establish a large database of repeated and professionally
read speech at low costs that is especially interesting for prosody
research, but also, e.g., for concept-to-speech and socio-linguistic
studies. An automatically recorded complete radio news broadcast is
first segmented into individual news stories using HMM recognition.
Then, the word sequence estimates of the stories are either compared
directly (naive method) or realigned with the signal of other stories
(realignment method) to find out which stories were read before and
which not. Both methods can be further improved by computing ``meta
distances'' that also take into account distances to other stories.
We find that the realignment method combined with meta distances is
the most reliable of the methods on real life data.
Authors:
Christel Brindöpke, University Bielefeld (Germany)
Brigitte Schaffranietz, University Bielefeld (Germany)
Page (NA) Paper number 352
Abstract:
This article presents a phonetically defined annotation system for
German speech melody, whose descriptive units cover the perceptive
relevant pitch movements. The units for the melodic annotation are
based on a model for read German utterances. The implementation of
our descriptive melodic units as part of a recently developed labelling
and testing environment for melodic aspects of speech allows a comfortable
and intersubjective application of the melodic units for the annotation
of speech. An experimental evaluation by a rating-experiment secures
that the melodic units describe spontaneous speech as adequately as
read speech
Authors:
Karlheinz Stöber, IKP, Bonn University (Germany)
Wolfgang Hess, IKP, Bonn University (Germany)
Page (NA) Paper number 239
Abstract:
We describe a new approach for speaker independent automatic phoneme
alignment. Typical algorithms for this task use only phoneme-to-frame
similarity measures which are somehow maximised or minimised. In addition
to such similarity measures, we use phoneme duration hypotheses generated
by the speech synthesis system HADIFIX. For algorithms based on dynamic
programming, it is difficult to use these duration hypotheses, so we
create a cost-function consisting of phoneme-to-frame and segment-to-duration
hypotheses similarity measures and minimise this cost-function by a
Genetic Algorithm. The results show that the accuracy of automatically
determined phoneme boundaries increases. This accounts especially for
speakers not used in the training phase.
Authors:
Amy Isard, Human Communication Research Centre, University of Edinburgh (Scotland)
David McKelvie, Human Communication Research Centre, University of Edinburgh (Scotland)
Henry S. Thompson, Human Communication Research Centre, University of Edinburgh (Scotland)
Page (NA) Paper number 322
Abstract:
The rapid growth in availability of high-quality recordings of natural
spoken dialogue (and natural spoken material more generally) has encouraged
us to to improve the interchange of transcripts of such material, in
order that these resources be easy to exploit by the scientific community
as a whole. In this paper, we describe a new SGML architecture which
we have recently adopted for the HCRC Map Task corpus (a corpus of
spontaneous task-oriented dialogues) with precisely these issues in
view. This architecture is oriented towards ease of processing and
update.
|